This chapter talks about cleaning data.
New tibble functions:
New column functions:
New concepts: Variable assignment in the general R space; variable assignment inside
For humanities data analysis (and most other types, for that matter), a huge proportion of you effort will go into cleaning rather than analyzing data. Cleaning can be itself be an important act. In Trevor Munoz and Katie Rawson’s essay Against Cleaning, they give a compelling rationale for why you must attend to the ways that regarding cleaning as easy or straightforward can give you less of an understanding of your information.
That said, you will invariably have to do something to your data to productively analyze, visualize, or understand it.
Most of the work of cleaning data is necessarily irregular. Different types of data all have their own problems. Often, the problem of ‘cleaning’ is really the problem of tracing a set of errors back to their sources. Here are some examples of data sanitation that I’ve encountered and had to clean over the years:
- A government dataset about asylum that spells “torture” as “toture” in one of the fields.
- A large number of ship’s logs records that give dates of February 30, September 31, and occasionally well into the future.
- A government dataset of statistics from the 1970s that was entirely unreadable until it became clear it used an almost-extinct character encoding scheme for IBM mainframes called EBCDIC.
There is no more a general set of rules for data cleaning than a general set of rules for editing manuscripts. Part of cleaning is understanding your own dataset. Another part is understanding your own analytic goals.
Common data cleaning issues.
Sometimes, data will be missing. In the crews data set, we see this for many fields. Inside R, missing data is represented with with a special indication as
<NA> (for “not available”).
How missing data is encoded may change. Some sets use empty strings; others use the letters “NA”; others use words like “None” or “Null.” And there are different types of unknowns; sometimes you’ll find a dataset where someone has written in “unknown” or “unclear,” which is not the same as missing data in some cases but is the same in others. By default (as you can see by typing
?read_csv into R or looking in the help pane),
read_csv uses a value of
c("", "NA"), which means that values which are empty (
"") or which are literally the string
NA will turned into
<NA> values. Sometimes this will be wrong–for instance, NA is a perfectly valuable Library of Congress Classification. And sometimes you may want to specify that things are missing by adding new strings to the list of missing values.
It’s also sometimes necessary to work around
<NA> values. You’ll see this a number of times in the code in this book. For instance, if you want to take the mean height in the
crews dataset, you get “NA” because some heights are not present. There are two ways to deal with this.
One is to apply a filter that uses the function
is.na. We can check each element to ensure that the function
is.na says “No!” – technically,
FALSE– and keep them in the filter if they do. There’s a slightly faster way, which is to use an exclamation point at the beginning of that line. Just as
!= means “does not equal,”,
! here too is a shorthand for “not.” So you read this simply as
crews %>% filter(is.na(height) == FALSE) # (Equivalently, but faster to type) crews %>% filter(!is.na(height))
Another way is use specific functions tactics for handling NA values. For instance, when using functions like
median, we can say
na.rm = TRUE to remove NA values. In practice, this is a little bit harder, I think.
crews |> summarize(height = mean(height, na.rm=TRUE))
If you’re working with new data, it’s probably in the Unicode standard. But if it’s older, characters from outside (things like curly quotes created on a Windows machine in 2003) may cause real problems. If you find yourself working with a file in a different character encoding, R generally allows you to read the file in by specifying the name of the standard. The precise instructions for doing so involve ‘locales’. Extensive discussion of this is available in the readr documentation.
Even when you’re given perfectly formatted data, you may experience some trouble reading it in.
In R, most problems will arise from conventions around commenting and quoting files. When you use
read_table and encounter errors, most often you want to adjust the arguments to comment and quote characters, which instructs the parser to ignore the “#” sign and quotation marks. Note that some files do use quotation marks as a “quote” character, such as the New Bedford whaling data we looked at in class.
Even the simplest of characters can cause problems. Sometimes, you may find that the end-of-line files aren’t behaving as you would expect. This has to do with the so-called “carriage return” character (called
\r in regular expressions, as opposed to
\n, the standard newline.
There are two ways to avoid these kinds of problems. One, in reading data, is to be sure always to use
read_tsv instead of the older functions
read.tsv that come with R. Students that have problems reading data in are often using copy-and-pasted code from online (which, again is a good strategy) that uses the older function:
read_csv has more sensible defaults.
History note: the reason for the peculiar behavior of the carriage return and newline has to do with the typewriter operations underlying modern computing: the
\r symbol was supposed to move the typewriter head to the beginning of the line, while the
\n newline character pushes the typewriter head down. In practice,
\n is sufficient to do both; there’s actually a third character, the formfeed (
\f), that usually takes on the job of dropping the cursor down a line. You will almost never see it in use, but it can be handy to insert from time to time if you need your own record breaks.
Inconsistent category labels
Almost omnipresent is the problem of inconsistent category labels. It gets to the distinction between datasets of strings and datasets of categories. Carefully curated datasets will always use exactly the same labels for categorical labels; but real-world sets often will not. Labels may be inconsistently capitalized–the whaling set uses “Rig” and “rig” for the same type of boat. Labels may occasionally be misspelled. And so on. This problem is exacerbated in visualization, where any bad values will take up a row.
tidyverse supplies a few ways to get rid of infrequent values by turning strings into what R calls
factors. [pandas calls
categoricals.] Factors are vectors where each row in the data is a discrete item that might be repeated across items.
fct_lump_n combines everything other than the most common elements into a catchall category called “Other.” Try commenting out the middle line in the code below and see how it changes.
crews |> mutate(Rig = fct_lump_n(Rig, 3)) |> count(Rig)
Date-time and other “type” formats
What is the data type for a year? You might reasonably expect it to be a number. But in fact, you’ll often find that dates are strings.
In converting messy dates, you’ll frequently have to use regular expressions to perform these coercsions.
When creating data yourself, you should use ISO 8601, which uses a “year-month-day” format. Countries around the world differ about whether Pearl Harbor happened on
7/12/41; the only sound response if you are creating data is to use
1945-12-07 as your format. The full ISO format for times includes ways of representing dates for which the precise date or the year is not specified, as well as ranges and intervals.
|Month, no specific date||1990-10|
|Date, no specific year||–12-25|
Datetime standards work well when used for the time period that computers have existed, but in historical cases things can get a little strange.
Internally, computers generally represent dates as integers to allow addition and subtraction. Dates are actually surprisingly hard to represent as numbers, which is why the ISO convention represents them as text instead. There are a variety of ways in which date calculations can be complicated. Some (like leap seconds and general relativity) are unlikely to matter unless you’re doing something like sending a spacecraft to another planetary body, in which case you probably should probably be reading a different text. But historical dates present challenges to both numeric and textual representations of numbers.
The most widespread convention for software is to represent date and time, together, as the number of seconds since January 1, 1970. This date– chosen in the 1970s–is known as the “Unix Epoch.” You’re probably familiar with the “Y2K” bug, in which years that were represented as two digit numbers could not accomodate the switch from 1999 to 2000. A similar probably may be coming in 2038 when 32-bit integers–most commonly used to store numbers–are not able to store the current date. (My favorite of the names proposed for this is the “epochalypse”).
The converse of the “epochalypse” is that dates with times before 1901 may not be easily represented in all software. It’s possible this will create issues for you. More likely is that you’ll find yourself working with dates from a country that had not yet switched from the Julian to the Gregorian calendar. Officially, the ISO standard should only be used for dates after 1582-10-15, the date on which Pope Gregory XIII’s new calendar began. (In the Gregorian calendar, the dates from October 5 to October 14 1582 were skipped to realign the calendar with the solar equinox.) In practice, ISO dates may present challenges when used for dates before any given country switched from the Julian to the Gregorian calendar (1752 in American and England, 1912 in China, 1918 in Russia).
Even if you choose to represent dates as simply years, pre-modern dates can create minor challenges. If you need to encode dates before 1 CE, you may need to remember that there is no such thing as “year zero;” the year 1BCE was immediately followed by the year 1AD. Although ISO-8601 requires explicit description of any dates before 1582, the general stqndard is to allow the existence of a year 0, corresponding to what’s usually called “1 BCE”: therefore the year “-0582” is not 582 BCE, but 583 BCE, and so forth.
Data cleaning strategies.
Depending on how the data you wish to read in is structured, you will typically use one of three functions in R to read it in.
How do you read in data? Generally, the information you want will be either on your hard drive (as when you format it) or on the Internet. For our first example, we’ll be looking at a well-formatted CSV online.
There’s some good descriptive data about people, which suggests a chance for something about bodies–measurements, physical descriptions, and ages all have interesting interplays. That will be particularly valuable if we can tie it in to some other sorts of information.
Reading tables and constructive failure.
Often, if there is something even slightly askew about your input data,
read_csv will fail. This may be frustrating. Try to be grateful, though, instead. This failure is the first aspect of something we’ll encounter many times in this class that is a general feature of data analysis: programs that don’t receive exactly the input they expect will simply fail to work, usually “throwing” an error message of some sort.
Here, we’ll look a bit more at the original version of the crewlist dataset we explored earlier.
You can read it on your computer by typing the following code.
library("tidyverse") crews = read_csv("https://benschmidt.org/crewlist.csv")
If you run this, you will see some alarming text in red. You can read it or try to understand; but the basic problem is that there are thousands of parsing failures. The reason has to do with the data types we’ve been talking about.
read_csv will automatically guess at data types; but it turns out that it’s getting some of these wrong because:
- It doesn’t see any eye colors until thousands of entries in;
- A few ship IDs have letters in them, but it initially learns to read them as numbers.
The first part of our cleaning is just to work around this by telling it to look farther in. As with any function,
?read_csv at the command prompt will give you instructions.
crews = read_csv("https://benschmidt.org/crewlist.csv", guess_max = 1000000)
Now we can inspect it. Going to the “Environment” pane in the upper right-hand corner, you can see what’s in here.
Where did this data come from? Did they do a good job transcribing? Here’s an example of the original source
R is composed of functions: each of these apply on an object. Each is, essentially, a little program of its own: you can run it on data, and see what happens.
In tidyverse R, the most common data structure is a
tibble; it’s essentially a table where the rows correspond to observations, and the columns refer to variables. (Tidyverse R borrows this from the incredibly similarly named ‘data.frame’ object that R has had since its birth. There are basically interchangeable, but using the modern versions helps you avoid a few pitfalls.) It resembles a spreadsheet or database table, but every datapoint has a type.
In this data set, as you’ll see, each row corresponds to an individual crew member, and the columns give information about him, such as the ship he sailed on his, his name, his rank, and so forth.
To look at it, we can start with the simplest program: simply looking at the thing. Type its name and run the code block from the upper right hand corner.
There are many different libraries out there for reading data. R is blessed with many packages for importing them. You can read excel files using ‘readxl’, for example. But for many purposes, it may be simpler go into Excel and click “save as CSV” to write to a more standard form.
There are some obvious problems with the “crews” data we need to fix for analysis.
For example. For example, in the raw version of this dataset:
- ‘Height’ is represented as a string like
5' 2 1/2", with feet and inches broken out;
- ‘Age’ has been read as a string, not a number;
- ‘Date’ has been read as a string, so we can’t plot–for example–years.
An example: cleaning years, and assignment. Suppose we want to extract the years and turn them into integers.
str_replace to clean data.
Note that the date doesn’t follow a standard form. We’re going to use regular expressions again to clean it up. Notice the steps in the chain here. We’re creating a new column called “year.” Then, we feed the output into the function
str_replace, which lets us use a regular expression.
Remember the function
str_replace: it is among the most useful tools for data cleaning in R. It lets you use the full power of regular expressions for find-replace operations. This regular expression is complicated–you may have to refer back to your sheet to see what’s going on. Note in particular that the parenthesis are performing a grouping operation.
str_replace is a substitution function, so you have to tell it what to replace it with. In this case, the escaped phrase
\\1 tells R to substitute with the first matched group. (That’s what
\\2 will match the second matched group, and so forth.)
So: suppose we want to pull ‘year’ out as a number.
The tidyverse has a function
mutate that changes a dataframe: it either adds a new column, or changes an existing one.
This gets a little complicated, so let’s break it down.
pull extracts a column so you can look at it more closely. Let’s do that with our
crews %>% pull(ApproximateDeparture) %>% head(10)
Let’s build up a tiny program that extracts the year from a date. The first step is to use regular expressions to get to just the year part of the expression. A simple way to do that here is to use the
.* regex operator to replace everything up to and including a slash with nothing (the empty string, ““).
crews %>% pull(ApproximateDeparture) %>% str_replace(".*/", "") %>% head(10)
We’re not done yet, though, because these are strings and years should be integers. The function
as.numeric does this for us. (There’s an inverse function,
as.character, that can turn a number into a string.)
crews %>% pull(ApproximateDeparture) %>% str_replace(".*/", "") %>% as.numeric() %>% head(10)
Success! But while this program runs, it doesn’t actually change our original frame. To do so, we need to assign.
Assignment in R.
Assignment is a fundamental feature of all languages. To this point, we’ve basically been working with
Advanced note: in the early history of R, you could not in fact assign a variable by using the
= sign. Instead, you would build an arrow out of two characters:
president <- "Washington" This form of assignment is still used in R. If you’re the sort of writer who sometimes starts sentences without having figured out the final clause, it can even be handy, because they can point in both directions; you can end a long program by saying something like.
"Washington" -> president Know to recognize this code when it appears. But most computer languages use
= for assignment and
== for equality, and so R now follows suit. Inside tibble calls, you use
= to assign; I find it simpler to just always use
= when assigning. Super-esoteric note: There is actually a third assignment operator, which introductory programmers should never use: the
<<- assigner, which assigns to a variable created outside of a function.
The actual assignment takes two steps. One says, inside a mutate call, to set year equal to the output of a string function; the other reassigns the variable
crews to equal this new frame, with its new column.
crews = crews %>% mutate( year = ApproximateDeparture %>% str_replace(".*/", "") %>% as.numeric() )
When you run this, you will get a warning error that says “NAs introduced by coercion.” That is often OK, but if you want to get a dataset completely clean you should look at the missing variables as well. You can do this using tools we already know:
crews %>% filter(is.na(year)) |> distinct(ApproximateDeparture) |> pull(ApproximateDeparture)
Cleaning using other functions
In practice, you often do not need to get so far into the nitty-gritty. I do not know of an R function that would translate feet and inches, but for age and date, we can do this all much more simply.
To turn a string into a number is quite simple; it just takes the function
as.numeric, which converts between types. Dates are more complicated. A date is a base data type in R, just like strings and numbers, but there are many different ways of specifying them. So we use the function
parse_date with a special string that describes the format: month-day-year.
This is another formal language, of course, a tiny little one that specifies data formats. “%m/%d/%Y” means month-day-year (12/7/1945); “%Y-%m-%d” means (1945-12-7). Are you supposed to memorize all these letters? Of course not. Instead, you just need to know that there’s a function for date parsing, and then use the ‘help’ documentation in R studio to read about how a particular conversion function works.
crews = crews %>% mutate( Age = Age %>% as.numeric(), date = ApproximateDeparture %>% parse_date("%m/%d/%Y") )
Getting the date
library(HumanitiesDataAnalysis) library(tidyverse) crews %>% mutate( first_initial = str_extract(FirstName, ".."), second_initial = str_extract(LastName, ".."), initials = str_c(first_initial, " ", second_initial) ) %>% count(initials) %>% arrange(-n)
:::exercise ### Assigning
crewsdatabase has columns for feet and inches, but no column for ‘height’. Make one, where height is measured in feet, inches, or something else.
crews %>% mutate( KEEP_GOING )
- The above code prints to the screen, but doesn’t change the ‘crews’. Make it so that that assignment happens to a new dataframe.
The following code chunks are all broken. You may have to run individual lines. Rather than copying and pasting, remember that the easiest way to run short sets of code in R is to highlight and then hit “control-enter” to force it to run.
To give three at the end:
x == 2 x + 1
Fix the following line of code so it gives (2,3,4)
c((1, 2, 3) + 1
Fix this line so that ‘count’ equals three.
count = 1 + "2"
- Is the data you read in under shiptypes
(Hard, skippable) In the crews dataset bundled with the package, I’ve added ‘feet’ and ‘inches.’ Here’s the code that I used to do it. Two hard questions:
- Can you describe what is going on in each line of this?
- Can you find any examples of heights that this fails to detect? (One way would be using an ‘is.na()’ filter)
crews = crews %>% mutate( feet = Height %>% str_replace(" ?ft.? ", "'") %>% str_replace("[',] ?.*", "") %>% as.numeric(), inches = Height %>% str_replace(" ?ft.? ", "'") %>% str_replace("[0-9]'? ?(1?[0-9]).*", "\\1") %>% as.numeric() )
- Using the
pivot_widerfunctions from the
tidyrpackage, create a new data.frame that has rows corresponding to ship types and columns corresponding to years. (That is, the exact inverse of the data.frame you saw before.)
crews %>% mutate(year = lubridate::year(date)) %>% count(Rig, year) %>% KEEP_GOING()
- There’s another dataset. It was created by the Center for Spatial and Textual Analysis at Stanford. Read it in. We’ll be working with this data set more next week.
CESTA %>% head()
This data is clearly not tidy. (Why not?) Use
pivot_longer to turn it into tidy data, and save it as a data.frame called “tidied”. (Hint: the final frame will have a column called “year.” You may have to use str_replace to remove Xs from the year name.)
? command or the help pane to read about the function
write_csv. Save the result to disk: we’ll be exploring it more next week.