Preface
This is a text for a course in humanities data analysis; it uses the modern R language to introduce the major challenges and features that exist for data analysis in the humanities.
This is an text that teaches the practice and principles of data analysis as encountered in traditionally non-quantitative fields. It is especially targeted at graduate students in history and literature departments, but
I have developed it over the years as a resources for classes aimed at graduate students in the humanities who are new to programming, but interested in working with digitized sources of a variety of sorts: texts, maps, networks, and images. It deals especially heavily with textual data, which is widespread and (relatively) straightforward to work with. But the core goal is to teach fundamental principles, algorithms, and approaches to communicating with data.
The chapters are cumulative, working towards core curriculum of key concepts in the manipulation and presentation of data. The concepts here are drawn from what is likely to be useful to people in the humanities.
The first half is largely occupied with technologies of counting, and the manipulation of tabular data frames. The sort of data you’ll work with here may seem, at first, excessively limited. There are basically only two data types that this text deals with. First, that of a table with rows representing observations and columns representing values. (This is a form almost everyone has encountered in spreadsheets or databases).
Second, the related but more abstract representations of observations as points in an arbitrary, multidimensional space. I don’t talk about, in depth, how to visualize or analyze nested hierarchies, network relations, or sentence trees.
If you work through this full book, I hope you’ll see that such a constraint can ultimately be generative, not limiting. While we won’t directly visualize XML documents, for example, we will consider how best to work with and manipulate them as tabular data with their tag hierarchies represented as columns. This may seem strange! But it also captures one of the most interesting things about data analysis; that the tools you might learn for analyzing the distribution of words in a document can be just as valuable and valid for analyzing the distribution of people in a city or photographic features in an archive.
For any single analysis task, you can probably save time at first by loading it into some online tool or downloadable Java application; but you lose in that the ability to see the shared representational layers below. An absolutely fundamental skill for data manipulation is the ability to recast data into different forms; by doing visualization and statistical analysis on just two of them, you will see how to shape a variety of forms of information