The Gift of Data

Data as “Capta”

Data as a given

Transformational thinking

Code

Different languages and humanities computing

The case for R

The place of GUIs

Packages

Projects

Literate Programming

The Tidyverse

Installing from CRAN

Installing from github

The course package

Troubleshooting Guide

Exercises: creating a project

Data Types

Numbers

Types of Numbers

Textual data

Strings

Character encoding

Other data types, vectors, and dataframes.

Other primitive types

Combined types

Vectors

Dataframes (tibbles)

Formal languages

Arithmetic is the formal language of numbers

Ontologies are formal languages for specific areas

Regexes are the formal language of strings

Examples

Where to use regexes:

Basic search-replace operations

Custom operators

Basic Operators:

, and

OR:

The power of

Replacements

Escape characters.

Escaping special characters

Extraction

Other special characters

Data Types

Data formats

CSV/TSV files

Spreadsheets

JSON and XML

Apache Parquet and Apache Feather

Tables and Databases

The power of counting

The pipeline strategy for exploratory data analysis

Data chains

Real Data

Filtering

The comparison operators

Sorting

Summarizing

Summarization functions

Grouping and summarizing

Grouping

Working with Text

Free Exercise

Why Visualize?

The Grammar of Graphics in ggplot

Data, Geometry, and Aesthetics

Changing Scales

Conventions around X and Y axes.

Facets

Other settings

Coordinate systems

Labels

Themes

Exploring Data with ggplot

Histograms of quantitative variables

Barplots of categorical variables

Scatterplots

Density Plots

Text Scatterplots

Colors

RGB

Exercises: Visualizing Data

Free exercise

Overview

Common data cleaning issues.

NA values

Character encoding

Record formats

Inconsistent category labels

Date-time and other “type” formats

Historical dates

Data cleaning strategies.

Reading files.

Reading tables and constructive failure.

Other formats

Cleaning Data.

Using and to clean data.

Assignment in R.

Cleaning using other functions

Cleaning Code

Merging Data

Selects and filters

Combining data horizontally and vertically.

Renaming and joins

Left joins and anti-joins

Abstract data, SQL databases, and normalization.

Tidyverse to SQL equivalencies.

Self joins

The Pleiades dataset: Real-world modeling and SQL querying

Writing SQL queries.

Nests and deeply structured data

Where can you use list columns?

Working with database systems

Exercises: joins and nesting.

Loading texts into R

Tokenization

The choices of tokenization

Wordcounts

Word counts and Zipf’s law.

Concordances

Functions

Metadata Joins

Subword tokenization with sentencepiece

Train a model, then tokenize

Probabilities and Markov chains.

State of the Union

Joins

Free Exercise

The Variable-Document model

Groupings as documents

Chunking

Programming, summarizing, and ‘binding’

TEIdytext and the variable-document model on XML

hathidy, tidyDFR, and the variable-document model for wordcounts.

UNDER CONSTRUCTION

Bookworm databases

Highly Optional: Working with TEI.

Three metrics

TF-IDF: let’s build a search engine

Pointwise mutual information

food

Dunning Log-Likelihood

The term-document matrix

Principal Components and dimensionality reduction

What is a map as data?

Representing spatial data

Converting point data

Working with SF Objects.

Reading shapefiles.

Why coordinate systems have to be such a pain.

Choosing projections.

We need multiple ways to store information.

Loading data through R packages and choosing world-scale projections

Data manipulation

Cartography

Spatial joins and mapping counts and data

Spatial Summaries

Spatial joins

Using text analysis tools with maps

Grid plotting

Classification

Naive Bayes

Logistic Regression

Choosing the right test and train sets

K-means clustering

Hierarchical Clustering

Topic Modeling