This chapter talks about how to count things, individually and in aggregate, using a pipeline strategy where dataframes are (optionally) grouped, the summarized, then combined. This is the most basic thing that you can do with tabular data statistically, but often one of the most powerful.
This is an essential strategy of many data analysis platforms.
The power of counting
Now that we know what data is, we are finally ready to analyze it.
Data “analysis,” of course, can mean many different things. That’s the whole point of this text! The analyses we’ll conduct to begin with are extremely simple; they mostly include counting, grouping, and summarizing. Later in this course we’ll come to more complicated operations. But in fact, simple algorithms of counting have major advantages in terms.
You can understand them. There is a “black box” quality to many of the more advanced tools we’ll be looking at later in the course: “summary statistics” are much easier to correct and change on the fly. They use, for the most part, math you learned in elementary school.
Your readers can understand them. It is a burden to have to spend five minutes explaining your algorithms at every talk, and when your work makes a field-specific contribution, you want scholars to focus on your argument and evidence, not on whether they trust your algorithm.
The pipeline strategy for exploratory data analysis
Our core strategy will be a pipeline strategy. The idea is that a single table object is passed through a series of saved steps until it gives you useful results. You will sometimes want to save the results of your work; but most of the time, you just
exposes a number of functions that make this easy. But although they are more coherently bundled in this package, they are shared by a wide variety of data manipulation software. In particular, they closely resemble the operations of SQL, the most important data access language of all time.1
I’ll be introducing these statements one by one, but first let me describe the basic idea.
The idea of the pipe is to represent each transformation of data you make as a chain. Remember how we mentioned in the data types chapter that multiplication and addition operators are a different sort of notation from the rest of R? Although it’s highly unnatural, a much more computer-friendly method would be to use functions called
divide_by, and so on. These are available in the maggritr package.
If you wanted to define a process where you took a number, added 2, then multiplied by 6, and finally divided by ten. You could express that as a set of functions where you assign them along the way. (The
<- operator below assigns values; we will not use it again this chapter.) We’re making heavy, heavy use of the so-called “pipe” operator,
But you could also nest them inside each other. This requires you to read the functions from the inside-out: it’s awkward, and doesn’t reflect the order in which you will usually work.
Using pipes, you can read the operation simply from left to right. The idea here is to think of data as flowing through a set of operations. First you take the thing and multiply it; next you take the same thing and divide it.
These expressions can get quite long. Formatting your code will make it substantially more readable. There are a few guidelines for this:
- Each line must end with a pipe-forward operator: if R hits a linebreak with a syntactically complete element, it will print the results.
- If you pipe an operation to a function that has no arguments, you may omit the parentheses after the function. (Don’t worry about what this means yet if it doesn’t make sense.)
So you might prefer to write the above in the following form. (Note that RStudio will automatically indent your code for you.)
Let’s look at some real data. You have already downloaded, with the
HumanitiesDataAnalysis package, a dataset called ‘crews.’ As long as you’ve already loaded the
HumanitiesDataAnalysis library, you can load it like this:
The first thing to do is simply look at. Going to the “Environment” pane in the upper right-hand corner, you can see what’s present. Click on it.
Where did this data come from? Did they do a good job transcribing? Here’s an example of the original source
Once you have a dataset in R, there are a few ways to look at it.
The most basic is to highlight the variable name and hit
control-enter in the RStudio code window. This prints a representation of the dataset.
Clicking on the item in your ‘environment’ panel loads it for you in a separate tab. If you look at the console area of R, you’ll see that this simply calls a function:
View(crews). Unlike most R functions,
View is capitalized; this is a sign that it’s part of RStudio specifically and probably won’t work in permanent code. Whether you want to click on the variable or type
View is up to you.
Filtering is the operation where you remove certain rows from a dataset without changing the columns. Filters evaluate a true/false condition for every row in the dataset.
The comparison operators
The easiest way to filter is when you know exact which value you’re looking for.
Passing data through the filter operator reduces it down to only those entries that match your criteria. In the previous section, you have noticed that the
summary function on crews showed a single 82-year-old. By filtering to only persons aged 82, we can see who exactly that was.
If we wished to case a slightly wider net, we could filter for sailors over the age of 70:
Note: One of the most frequent forms of errors you will encounter in data analysis is confusing the
== operator with the
= one. For the most part, the warnings you get should help clarify this problem. But keep in mind that ‘==’ is for comparison (it asks the question: ‘are x and y equal?’) while ‘=’, as discussed later, is for assignment:
x = y means “make x equal to y.
The special operator
%in% tests if an element is part of a list.
== tests if two values are the same:
%in% tests if the left hand sign is part of the right hand side. This expression tells you whether, for example, “New York” is in a list of cities.
It evaluates to
TRUE because “New York” is one of the elements.
On a longer list, this can be useful to find any combination of names of interest. Here, for example, we can check if anyone on a list might be a Beatle.
To print all of these, might be too long, so instead we use a special function called
sample that takes some number of rows at random.
You can use
head to see the beginning of a dataset; but can be better because it gives you a different set of results each time, not biased towards the beginning of your data. (A related function to
sample_frac, which takes a fraction of the data; so
crews |> sample_frac(.33) would get 33% of the dataset.)
Compare that to a regular-expression based approach, below.
What’s the difference between using
str_detect to filter down a dataset by a string? (Hint: try seeing how they both behave on some very short strings.)
The tidyverse also exposes a function,
count, that makes it possible to see how many times each element in a list occurs. We can also count the occurrences of each of these last names to see which is most common.
Sorting is the basic transformation that changes the order of data. Using the
head function from last week, we can, for example, first limit to ships that sailed before 1860, and then show the youngest individuals.
If you want to sort in descending order, to see the oldest, you can use the
desc to reverse the variable: but usually it’s easiest to just put a negative sign in front of the variable you want sorted.
It is also useful to look at a completely unordered chunk of the data. When you read something in like this
crews set, the first few records
Looking at your individual data is sometimes sufficient: but usually, you want to know some aggregate conditions about it.
dplyr provides the
summarize function to do this. Unlike the
summary function we saw earlier,
summarize doesn’t do anything on its own; instead, it lets you specify the kinds of commands you want to run.
In dplyr, you pipe your results through to a summarize function and then run a different function call inside the parentheses. The simplest function call is
n: it says how many rows there are. To run it, we use the function summarize, which reduces the whole dataset down.
But you can use any of the variables inside the frame as part of your function call: for instance, to find the average age in the set, you could run the following.
This produces an empty value, because there is missing data. R provides a variety of ways of dealing with missing data: in
dplyr, we can just put in a filter operation in.
NA is a special value, it requires a special function in the filter:
is.na. And we don’t want to filter to only
<NA> fields; we want to filter to not
<NA> fields. By common programming convention,
! means “not”.
This is a bit confusing, so in the tidyverse, there is a function to do this directly:
drop_na. The code below keeps only people for whom we have an existing age.
This may look like a single variable, but it’s actually a tibble like all of those we’ve been working with. By using
= to assign values inside summarize, we can summarize on a variety of statistics: average, mean, or oldest age.
There are a bunch of functions you maybe haven’t seen before here:
max. From their names it should be clear what they do.
So how do you find additional functions? The best place to start is the help bar and then type the phrase you’re looking for. Especially if it’s a statistically meaningful concept (e.g., “Standard Deviation”), you should be able to find it.
Grouping and summarizing
This sorting and arranging is all stuff that you can do quite easily in Excel. The reason to use a tabular package like this is that it allows you to perform these operations many times across different facets of data. The last element which enables all sorts of amazing analysis is grouping. Now that you know how to filter and summarize, you’re ready for the most important operation in
group_by doesn’t actually change the data you see. Instead, it does something much more subtle; as it says, it groups the data for future operations.
In other words, it sets the units that you’ll be working with. In Michael Witmore’s terms, it changes the level of address for the text.@witmore_text_2010
The most basic idea is to use a grouping and then the
n function to count the number of items in each bucket.
This is actually so common that there’s a function, “count”, that does it directly and assigns to a variable called
n. But I recommend sticking with the more basic format for now, because it gets you in the habit of actually thinking about what happens.
You can also group by or count multiple items at once. Each of these is like adding and addition index to your search.
Here is a little bit about the social construction of race in code. Try to understand what this code does, and what the data says.
Note that there’s a new function in here,
summarize, that adds a new column (or changes an existing one); unlike summarize, it doesn’t change the number of rows.
Read the line of code below. What does it do? Can you come up with any patterns or explanations for the results that you see here?
- Start by just editing some code. The code below finds the first date that appears in this collection. Edit it to find the minimum age in the set.
filterto determine: what is the name of that youngest person? When did he or she sail?
- How many sailors left on ‘Barks’ between 1850 and 1880? Chain together
summarizewith the special
n()function. Note that this has a number of different conditions in the filter statement. You could build several filters in a row: but you can also include multiple filters by separating them with commas. For instance,
filter(school=="NYU",year==2020)might be a valid filter on some dataset (though not this one.)
To filter by date you may need to use a function like
as.Date on your input.
Question 3 told you how many sailors left on barks in those years. How many distinct voyages left? The variable
Voyage.number identifies distinct voyages in this set. (This may require reading some documentation: reach out to me or a classmate if you can’t figure it out. There are at least two ways: one involves using the
distinct before summarizing, and the second involves using the functions
unique in your call to
Summarizing tells you things about a whole dataset: grouping allows you to define different portions. It is a somewhat strange transformation because, on its own, it doesn’t actually change the data at all–it just changes the way that future functons will work.
If you group and then summarize, the summary functions are applied separately to each group. For example, when we run a summary operation with crew data grouped by residence, the function returns the counts for each each individual group. Here, for example,are the counts for each type of ‘Rig’ in the
Change the code above to count the distinct “Residence” locations in the dataset. Then add two more pipes to the end to arrange by count.
Working with Text
To think about working with text, let’s look at a bibiographic dataset. I’ve taken all the books in the Library of Congress’s catalog dump that have summaries in the bibliographic record, and then further reduced down to “just” about 150,000 books that have dates of birth for their authors. (This is about 1% of the full Library of Congress catalog of books.)
tidyverse packages includes some functions for exploring strings we can use inside these functions.
str_detect finds a regular expression; it returns TRUE or FALSE, so works well with
str_extract replaces a regular expression inside a string: it returns a new string, so it works well with mutate.
You may not remember everything about regexes: but for this, try to recall at least that the vertical bar (
|) means “or”. So
"France|French" will be true if the string contains either “France” or “French.”
- Try to get a sense of what is the books set based on some keyword searches. Can you get a sense of what the biases of this subset of the catalog are?
Here are a couple examples having to do with geographic terms in subjects; you’d probably do better to explore some other kind of resource.
Consider counting some individual words, as well. Using ‘str_extract’, we can create a new column–‘word’–which is only the part of the subject that matches a search. Then we can count those individual terms.
Again, look for something other than the three countries here.
read_csv to work on your own csv or excel file that you explored the types in previously, or use arrow::read_parquet on some of the files at [benschmidt.org/directories]. and do three of the following:
- Find an outlier; who is the oldest person? The youngest?
- Count a categorical variable and arrange by decreasing count. What are the most common labels?
str_extractto create a better categorical variable than the ones you have.
- Count some combination of variables and see if you can identify things that tend to occur a lot together or apart.
maxinside groups to see which groups have higher or lower values.
- Describe the extent of your data.
If you have a slow computer or an extremely large data set, you can do all of these operations on data on disk. I recommend using
duckdbas the on-disk data storage engine: read the documentation at
https://duckdb.org/docs/api/rfor an explanation of how to read it in.↩︎