The Gift of Data
Does “Humanities Data” sound like an oxymoron? To many, nowadays, it does. Humanists make meanings, arguments, and narratives. The word “data”, on the other hand, sounds like it comes from an altogether different vocabulary. Data doesn’t write stories, but undergirds proofs, discoveries, and statistical inference. Reading belongs to the humanities; data analysis belongs to STEM. When universities set up new programs to teach data skills, they inevitably call them “Data Science” even when not all the participants are scientists themselves.
This stems from a crucial metaphor that dominates modern conversations about data. In contemporary language, data is treated as a resource or a commodity: as something physical that is used to build arguments, proofs, and evidence. Widespread use even treats data as if it’s something you find in the ground: data is to be “mined,” or data is “the new oil”@anonymous_world_2017. That makes data seem pristine, but a flip side emphasizes data instead as detritus: as the developer Maciej @ceglowski_haunted_2015 puts it, it is “a waste product, a bunch of radioactive, toxic sludge that we don’t know how to handle.”1
Thinking of data as resource or pollutant forces us to take sides in one of the great culture wars of the 21st century; between the technologists who see salvation in data, and the social critics who fear the ways it simplifies the world. Many people find it meaningful to identify themselves as “loving data” or being suspicious of it. To be on “data’s” side is to align with science, and rationalism; to be against it is to shore up the remaining blocks of human experience, connection, and local knowledge that are melting like glaciers in the technocratic greenhouse that is the modern world.
This text is a how-to manual for those who want to work in a middle ground, where data exists as a part of–not contrary to–research in the humanities.
To understand what “data” can mean a humanistic context, I like to think etymologically. Current language carries traces of past usage, which can let us recover older possibilities meaning. “Data” in Latin, you may know, is simply the plural past participle of the verb ‘do’, to give. (Every first-year Latin students memorizes the foursome “do-dare-dedi-datum”). Data, in Latin, means “that which is given”; it could probably be translated in different contexts as the words ‘given’ or ‘donation.’
Data as “Capta”
If you subscript to the extractionist metaphor, this etymology reveals a great cover-up. Some scholars have argued that to call data given misrepresents the social process that makes data available to us, the force with which data has been ripped from the world. Johanna Drucker, following Christopher Chippendale,2 argues that we should use the term capta instead. “Capta” means “taken”; reminds us that data is taken from the world, not freely given by it. The point is to remind the users of data that someone has gone out ripped some information out of its original context for us to use. T Humanities data is always in some sense about people; it represents information about people extracted through force, flattery, or subterfuge. Data-as-capta is a analytic framework that suggests an important responsibility: the humanist must always carry an awareness of the situatedness of the data they work with, and the social relations and power that make it possible for it to exist.
Data as a given
With most data you’ll encounter, treating it as “taken” is a good idea. But thinking about data analysis just as a jousting match between you and someone who built the dataset misses the most important part of communicating through data, one rooted in the etymology of ‘data’ in the humanistic tradition. David Rosenberg has shown how the original uses of ‘data’ in English language books come from theology and mathematics. In those books, data did not necessarily mean quantitative evidence: it instead meant something that your argument rests on. It was “given” in the sense that your argument rested on it without examinining not, not in any sense that it existed in the ground or was wrenched from a social context. The system of geometry is built up from four fundamental axioms which we take–or rather, which we ‘grant’–to be true. Any rhetorical argument proceeeds from some assumptions, that which we ‘take for granted.’ This is what ‘data’ meant. Rosenberg finds that this older use of data was already evaporating by 1800: “It had become usual to think of data as the result of an investigation rather than its premise. While this semantic inversion did not produce the twentieth-century meaning of data, it did make it possible.” @rosenberg_data_2013.
In this sense, the “gift” in data is not from the world to the researcher. It is from the reader to the writer. “Data,” here, means the premises which the author asks her readers to concede–or to give up; or to take as a “given”–before the argument begins.
To talk of resources, capta, or science obscures that relationship. When writing about data, we need to take seriously not just the relationship between the scholar and her sources, but between the writer and her audience. To do ‘data analysis,’ in this sense, is to work out the system of implications of some of set of evidence; and it is only useful if anyone will accept your evidence to begin with. (A better word than gift might be concession.)
To write about data is to go, hat in hand, begging an allowance from your readers. That gift is the willingness to entertain your premises while you describe them. Depending on what field you are in, the way that you solicit this gift will be radically different. In the humanities–as in much public writing– it is not reasonable to expect others to accept your data because it’s numerical; you must, instead, lead them along to the idea that data has something to say.
To work with data in this sense is not always to perform scientific inference. It is to plumb the relationships of the written record and the enumerated record to the people who were reduced to writing and numbers; and it’s to engage in the careful working out of the implications of that record. One key component of rhetoric that is rarely thought of as data analysis is the reductio ad absurdam: the rhetorical form that demonstrates that two premises (two givens; two data) can not reasonably co-exist, because they produce some outcome which is self-evidently ridiculous. This is a claim we’ll explore.
As I have written elsewhere, digital humanists do not need to understand algorithms; instead, they need to understand the underlying transformations that algorithms execute.@schmidt_digital_2016 These transformations describe the sorts of things you can do to a dataset. Many of them you probably already know in principle: sorting a list, taking an average, making a line chart. For those, this text simply aims to give you a way to command a computer to execute those tasks in a language that’s flexible to help you think about stringing multiple simple transformations together.
Other transformations are more exotic, but have showed their worth in either decades of research in the digital humanities or in intense exploration inside computer science over the last decade. I have tried to be judicious in what I present from this sphere, but there is good reason to understand things like the vector space model underlying modern machine learning, the concept of the ‘bootstrap’ for general purpose statistical testing, or the transformations involved in the fundamental information-theoretic metric of pointwise mutual information.
This course will have you writing code in the R language. There is an extensive debate about whether digital humanists need to learn to code. If you have a lot of money to pay other people, you can probably get away without it. But the fact of the matter is simply that if you want to either do data analysis in the humanities, coding will often be the only way to realize your personal vision; and if you want to build resources in the humanities that others might want analyze, you’ll need to know what sophisticated users want to do with your tools to make them work for them.
I have no expectation that anyone will come out of this a full-fledged developer. But I hope you’ll come out a little more sophisticated in your understandings. In particular, I hope you’ll come to see that debates over learning to code create a false binary; between coders and non-coders. We’ll be focusing in particular in developing skills less in full-fledged “programming,” but in “scripting.” That means instructing a computer in every stage of your work flow; using a language rather than a Graphical User Interface (GUI). This takes more time at first, but has some major advantages over working in a GUI:
- Your work is saved and open for inspection.
- If you want to discover an error, you can correct it without losing the work done after.
- If you want to amend your process (analyze a hundred books instead of ten, for instance) but do the same analysis, you can alter the code only slightly.
- You can deploy a wide variety of methods on the same set of data. While the initial overhead to coding is high, when you read about some fancy new method you can often test it quickly inside R rather than having to figure out some different piece of software.
- You can deploy the same methods on a wide variety of data. The tidy data abstraction we’re working with gives a vocabulary for thinking about documents, resources, and anything that can be counted; by creatively re-combining them, you can interpret new artifacts in interesting ways.