How to use functions that return functions to clean data
byGuangming Lang
4 min read
Categories
r
We talked about functions that return functions and functions that eat functions. If you are new and curious, you can search my old blog posts. Today, I’m going to show you a real world example of how to use functions that return functions to clean data. By the end of the post, I’m sure you’ll be amazed by their beauty and power. Let’s get started. First, we define 3 functions that return functions.
Make sure you read these function definitions line by line three times before you move forward.
Next, we make some messy data.
I’ve cleaned much worse data, but this is good and representative enough for our purpose. Here’re the places we need to clean up:
the city and state are in one column, and we need to separate it into two columns.
the first 5 and last 4 digits of the zip code are in one column, and we also want to separate them into two columns.
we need to make sure the separated columns do not contain spaces.
if a value is missing in the newly created columns, we need to make sure it’s NA instead of “”.
First, let’s separate city and state.
Next, let’s divide zipcode into first 5 and last 4 digits.
Notice how similar these two chunks of code! If you cannot appreciate their neatness, go back and read the definition of split_by() again. Understand what it does.
Next, we combine dat_part1 and dat_part2 into one data frame, and check the variables.
We observe that state has a space in front of all its elements, zip_first5 has spaces trailing some of its elements, and zip_last4 also has spaces in some of its elements. We need to remove the spaces, and that’s what we’re doing next.
Pause for a moment, and go back to read the definition of rm_char() again. Make sure you understand what it does.
Finally, the action of removing spaces caused empty strings for some values, and we need to replace them with NA’s.
Do you understand what fix_missing() does? If not, go back and read its definition again…
This article is inspired by Hadley’s book “Advanced R”, which can be obtained from Amazon.