Counting things
Summary
This chapter talks about how to count things, individually and in aggregate, using a pipeline strategy where dataframes are (optionally) grouped, the summarized, then combined. This is the most basic thing that you can do with tabular data statistically, but often one of the most powerful.
This is an essential strategy of many data analysis platforms. In R it is represented by the following functions, all of which you should know what they do.
(On files)
read_csv
arrow::read_parquet
(On data frames)
filter
arrange
summarize
count
group_by
mutate
sample_n
andsample_frac
drop_na
(Inside summary/mutate calls.)
sum
n()
mean
The power of counting
Now that we know what data is, we are finally ready to analyze it.
Data “analysis,” of course, can mean many different things. That’s the whole point of this text! The analyses we’ll conduct to begin with are extremely simple; they mostly include counting, grouping, and summarizing. Later in this course we’ll come to more complicated operations. But in fact, simple algorithms of counting have major advantages in terms.
You can understand them. There is a “black box” quality to many of the more advanced tools we’ll be looking at later in the course: “summary statistics” are much easier to correct and change on the fly. They use, for the most part, math you learned in elementary school.
Your readers can understand them. It is a burden to have to spend five minutes explaining your algorithms at every talk, and when your work makes a field-specific contribution, you want scholars to focus on your argument and evidence, not on whether they trust your algorithm.
The pipeline strategy for exploratory data analysis
Our core strategy will be a pipeline strategy. The idea is that a single table object is passed through a series of saved steps until it gives you useful results. You will sometimes want to save the results of your work; but most of the time, you just
dplyr
exposes a number of functions that make this easy. But although they are more coherently bundled in this package, they are shared by a wide variety of data manipulation software. In particular, they closely resemble the operations of SQL, the most important data access language of all time.1
I’ll be introducing these statements one by one, but first let me describe the basic idea.
Data chains
The idea of the pipe is to represent each transformation of data you make as a chain. Remember how we mentioned in the data types chapter that multiplication and addition operators are a different sort of notation from the rest of R? Although it’s highly unnatural, a much more computer-friendly method would be to use functions called add
, divide_by
, and so on. These are available in the maggritr package.
If you wanted to define a process where you took a number, added 2, then multiplied by 6, and finally divided by ten. You could express that as a set of functions where you assign them along the way. (The <-
operator below assigns values; we will not use it again this chapter.) We’re making heavy, heavy use of the so-called “pipe” operator, |>
.
start <- 1
x <- multiply_by(start, 6)
y <- add(x, 2)
z <- divide_by(y, 10)
z
But you could also nest them inside each other. This requires you to read the functions from the inside-out: it’s awkward, and doesn’t reflect the order in which you will usually work.
divide_by(add(multiply_by(1, 6), 2), 10)
Using pipes, you can read the operation simply from left to right. The idea here is to think of data as flowing through a set of operations. First you take the thing and multiply it; next you take the same thing and divide it.
1 |>
multiply_by(6) |>
add(2) |>
divide_by(10)
Before 2021, R did not have a native pipe operator: instead, a package called magrittr
(a joke–‘this is not a pipe’) exported an operator that looked like %>%
. You will often see this ‘old’ operator in code: for most purposes, it’s the same as |>
. I’m using |>
here because it runs without importing any packages, but it’s fine to keep the old one in place if you copy code from somewhere else.
These expressions can get quite long. Formatting your code will make it substantially more readable. There are a few guidelines for this:
- Each line must end with a pipe-forward operator: if R hits a linebreak with a syntactically complete element, it will print the results.
- If you pipe an operation to a function that has no arguments, you may omit the parentheses after the function. (Don’t worry about what this means yet if it doesn’t make sense.)
So you might prefer to write the above in the following form. (Note that RStudio will automatically indent your code for you.)
Real Data
Let’s look at some real data. You have already downloaded, with the HumanitiesDataAnalysis
package, a dataset called ‘crews.’ As long as you’ve already loaded the HumanitiesDataAnalysis
library, you can load it like this:
data(crews)
The first thing to do is simply look at. Going to the “Environment” pane in the upper right-hand corner, you can see what’s present. Click on it.
Where did this data come from? Did they do a good job transcribing? Here’s an example of the original source
Once you have a dataset in R, there are a few ways to look at it.
The most basic is to highlight the variable name and hit control-enter
in the RStudio code window. This prints a representation of the dataset.
Clicking on the item in your ‘environment’ panel loads it for you in a separate tab. If you look at the console area of R, you’ll see that this simply calls a function: View(crews)
. Unlike most R functions, View
is capitalized; this is a sign that it’s part of RStudio specifically and probably won’t work in permanent code. Whether you want to click on the variable or type View
is up to you.
Filtering
Filtering is the operation where you remove certain rows from a dataset without changing the columns. Filters evaluate a true/false condition for every row in the dataset.
The comparison operators
The easiest way to filter is when you know exact which value you’re looking for.
Passing data through the filter operator reduces it down to only those entries that match your criteria. In the previous section, you have noticed that the summary
function on crews showed a single 82-year-old. By filtering to only persons aged 82, we can see who exactly that was.
crews |> filter(Age == 82)
If we wished to case a slightly wider net, we could filter for sailors over the age of 70:
crews |> filter(Age > 65)
Note: One of the most frequent forms of errors you will encounter in data analysis is confusing the ==
operator with the =
one. For the most part, the warnings you get should help clarify this problem. But keep in mind that ‘==’ is for comparison (it asks the question: ‘are x and y equal?’) while ‘=’, as discussed later, is for assignment: x = y
means “make x equal to y.
The special operator %in%
tests if an element is part of a list. ==
tests if two values are the same: %in%
tests if the left hand sign is part of the right hand side. This expression tells you whether, for example, “New York” is in a list of cities.
It evaluates to TRUE
because “New York” is one of the elements.
"New York" %in% c("Boston", "New York", "Philadelphia")
On a longer list, this can be useful to find any combination of names of interest. Here, for example, we can check if anyone on a list might be a Beatle.
To print all of these, might be too long, so instead we use a special function called sample
that takes some number of rows at random.
You can use head
to see the beginning of a dataset; but sample_n
can be better because it gives you a different set of results each time, not biased towards the beginning of your data. (A related function to sample_n
is sample_frac
, which takes a fraction of the data; so crews |> sample_frac(.33)
would get 33% of the dataset.)
crews |>
filter(LastName %in%
c("McCartney", "Lennon", "Harrison", "Starr")) |>
sample_n(10)
Compare that to a regular-expression based approach, below.
What’s the difference between using %in%
and str_detect
to filter down a dataset by a string? (Hint: try seeing how they both behave on some very short strings.)
crews |>
filter(LastName |> str_detect("McCartney|Lennon|Harrison|Starr")) |>
sample_n(10)
The tidyverse also exposes a function, count
, that makes it possible to see how many times each element in a list occurs. We can also count the occurrences of each of these last names to see which is most common.
crews |>
filter(LastName %in% c("McCartney", "Lennon", "Harrison", "Starr")) |>
count(LastName)
You may notice a similarity between %in% and %>% in the way they use percentage signs. This is because they both represent an infix operator: basically, A %in% B
is the same as if there were a function that takes two arguments in(A, B). The pipe operator, %>%
, takes two arguments: first a data frame A, and then a function with arguments B. It calls function B with the arguments from A. Javascript is considering using the |>
operator as well.
Sorting
Sorting is the basic transformation that changes the order of data. Using the head
function from last week, we can, for example, first limit to ships that sailed before 1860, and then show the youngest individuals.
crews |>
arrange(Age) |>
head(3)
If you want to sort in descending order, to see the oldest, you can use the dplyr
function desc
to reverse the variable: but usually it’s easiest to just put a negative sign in front of the variable you want sorted.
crews |>
arrange(Age |> desc()) |>
head()
It is also useful to look at a completely unordered chunk of the data. When you read something in like this crews
set, the first few records
crews |> sample_n(5)
Summarizing
Looking at your individual data is sometimes sufficient: but usually, you want to know some aggregate conditions about it.
dplyr
provides the summarize
function to do this. Unlike the summary
function we saw earlier, summarize
doesn’t do anything on its own; instead, it lets you specify the kinds of commands you want to run.
In dplyr, you pipe your results through to a summarize function and then run a different function call inside the parentheses. The simplest function call is n
: it says how many rows there are. To run it, we use the function summarize, which reduces the whole dataset down.
crews |> summarize(n = n())
But you can use any of the variables inside the frame as part of your function call: for instance, to find the average age in the set, you could run the following.
crews |> summarize(mean_age = mean(Age))
This produces an empty value, because there is missing data. R provides a variety of ways of dealing with missing data: in dplyr
, we can just put in a filter operation in.
Since NA
is a special value, it requires a special function in the filter: is.na.
And we don’t want to filter to only <NA>
fields; we want to filter to not <NA>
fields. By common programming convention, !
means “not”.
crews |>
filter(!is.na(Age)) |>
summarize(mean_age = mean(Age))
This is a bit confusing, so in the tidyverse, there is a function to do this directly: drop_na.
The code below keeps only people for whom we have an existing age.
crews |>
drop_na(Age) |>
summarize(mean_age = mean(Age))
This may look like a single variable, but it’s actually a tibble like all of those we’ve been working with. By using =
to assign values inside summarize, we can summarize on a variety of statistics: average, mean, or oldest age.
crews |>
drop_na(Age) |>
summarize(
average_age = mean(Age),
median_age = median(Age),
oldest_age = max(Age)
)
Summarization functions
There are a bunch of functions you maybe haven’t seen before here: mean
, median
, and max
. From their names it should be clear what they do.
Some languages (Python, for example) work to reduce the number of functions in the set R is not like these: it has so many functions that even experienced users sometimes stumble across ones they have not seen before. And libraries like dplyr
provide still more at your fingertips.
A few functions from base R that may be useful with summarize include:
n()
- count the number of rows.rank(col)
- give each row a rank based on the value of col.min(col)
,max(col)
,median(col)
,mean(col)
- return the minimum value ofcol
.
It’s also important to know about type conversion.
as.character
- Turn anything into a character.as.numeric
- Turn anything into a number. Letters will produce<NA>
.as.Date
- turn an ISO formatted field (e.g., “1940-03-15”) into a date object. For dates in a less standard format, you will want to investigate thelubridate
package in the tidyverse. That has functions likedmy
that will turn “3/5/92” into1992-05-03
andmdy
that will turn it into1992-03-05
.
So how do you find additional functions? The best place to start is the help bar and then type the phrase you’re looking for. Especially if it’s a statistically meaningful concept (e.g., “Standard Deviation”), you should be able to find it.
The other way is by searching. One of R’s major flaws is that the name is so generic that it’s hard to Google. The website Stack Overflow contains some of the most valuable information, and you can search inside R by prefixing your search term with the letter [R]
in brackets. But Google will often include good answers as well.
Grouping and summarizing
This sorting and arranging is all stuff that you can do quite easily in Excel. The reason to use a tabular package like this is that it allows you to perform these operations many times across different facets of data. The last element which enables all sorts of amazing analysis is grouping. Now that you know how to filter and summarize, you’re ready for the most important operation in dplyr
: group_by
. Unlike filter
and summarize
, group_by
doesn’t actually change the data you see. Instead, it does something much more subtle; as it says, it groups the data for future operations.
In other words, it sets the units that you’ll be working with. In Michael Witmore’s terms, it changes the level of address for the text.@witmore_text_2010
The most basic idea is to use a grouping and then the n
function to count the number of items in each bucket.
crews |> group_by(Skin) |> summarize(count = n())
This is actually so common that there’s a function, “count”, that does it directly and assigns to a variable called n
. But I recommend sticking with the more basic format for now, because it gets you in the habit of actually thinking about what happens.
crews |> count(Skin)
You can also group by or count multiple items at once. Each of these is like adding and addition index to your search.
crews |>
group_by(Skin, Hair) |>
summarize(count = n()) |>
arrange(-count) |>
head(10)
Here is a little bit about the social construction of race in code. Try to understand what this code does, and what the data says.
Note that there’s a new function in here, mutate
. Like summarize
, that adds a new column (or changes an existing one); unlike summarize, it doesn’t change the number of rows.
crews |>
group_by(Skin) |>
drop_na(Skin, Hair) |>
mutate(skin_total = n()) |>
filter(skin_total > 100) |> ## Why?
group_by(Skin, Hair, skin_total) |> ## Why is `skin_total` here?
summarize(combo_count = n()) |>
mutate(share = combo_count / skin_total) |>
arrange(-share) |>
filter(share > .1) |> # Remove uncommon entries
head(10)
Read the line of code below. What does it do? Can you come up with any patterns or explanations for the results that you see here?
crews |>
drop_na(Age) |>
group_by(Skin) |>
summarize(meanAge = mean(Age), count = n()) |>
filter(count > 100) |>
arrange(meanAge) |> head(10)
- Start by just editing some code. The code below finds the first date that appears in this collection. Edit it to find the minimum age in the set.
crews |>
drop_na(date) |>
summarize(min = min(date))
- Use
filter
to determine: what is the name of that youngest person? When did he or she sail?
- How many sailors left on ‘Barks’ between 1850 and 1880? Chain together
filter
andsummarize
with the specialn()
function. Note that this has a number of different conditions in the filter statement. You could build several filters in a row: but you can also include multiple filters by separating them with commas. For instance,filter(school=="NYU",year==2020)
might be a valid filter on some dataset (though not this one.)
To filter by date you may need to use a function like as.Date
on your input.
Question 3 told you how many sailors left on barks in those years. How many distinct voyages left? The variable Voyage.number
identifies distinct voyages in this set. (This may require reading some documentation: reach out to me or a classmate if you can’t figure it out. There are at least two ways: one involves using the dplyr
function distinct
before summarizing, and the second involves using the functions length
and unique
in your call to summarize
.)
Grouping
Summarizing tells you things about a whole dataset: grouping allows you to define different portions. It is a somewhat strange transformation because, on its own, it doesn’t actually change the data at all–it just changes the way that future functons will work.
If you group and then summarize, the summary functions are applied separately to each group. For example, when we run a summary operation with crew data grouped by residence, the n()
function returns the counts for each each individual group. Here, for example,are the counts for each type of ‘Rig’ in the crews
dataset.
crews |>
group_by(Rig) |>
summarize(count = n())
Change the code above to count the distinct “Residence” locations in the dataset. Then add two more pipes to the end to arrange by count.
Working with Text
To think about working with text, let’s look at a bibiographic dataset. I’ve taken all the books in the Library of Congress’s catalog dump that have summaries in the bibliographic record, and then further reduced down to “just” about 150,000 books that have dates of birth for their authors. (This is about 1% of the full Library of Congress catalog of books.)
The tidyverse
packages includes some functions for exploring strings we can use inside these functions.
str_detect
finds a regular expression; it returns TRUE or FALSE, so works well with filter
. str_extract
replaces a regular expression inside a string: it returns a new string, so it works well with mutate.
You may not remember everything about regexes: but for this, try to recall at least that the vertical bar (|
) means “or”. So "France|French"
will be true if the string contains either “France” or “French.”
- Try to get a sense of what is the books set based on some keyword searches. Can you get a sense of what the biases of this subset of the catalog are?
Here are a couple examples having to do with geographic terms in subjects; you’d probably do better to explore some other kind of resource.
books |>
filter(subjects |> str_detect("France|French")) |>
select(year, summary) |>
sample_n(10)
Consider counting some individual words, as well. Using ‘str_extract’, we can create a new column–‘word’–which is only the part of the subject that matches a search. Then we can count those individual terms.
Again, look for something other than the three countries here.
books |>
mutate(word = subjects |> str_extract("(France|Germany|England)")) |>
drop_na(word) |>
group_by(word) |>
summarize(count = n())
Free Exercise
Try getting read_csv
to work on your own csv or excel file that you explored the types in previously, or use arrow::read_parquet on some of the files at [benschmidt.org/directories]. and do three of the following:
- Find an outlier; who is the oldest person? The youngest?
- Count a categorical variable and arrange by decreasing count. What are the most common labels?
- Use
str_sub
orstr_extract
to create a better categorical variable than the ones you have. - Count some combination of variables and see if you can identify things that tend to occur a lot together or apart.
- Use
min
,mean
, ormax
inside groups to see which groups have higher or lower values. - Describe the extent of your data.
If you have a slow computer or an extremely large data set, you can do all of these operations on data on disk. I recommend using
duckdb
as the on-disk data storage engine: read the documentation athttps://duckdb.org/docs/api/r
for an explanation of how to read it in.↩︎