The Gift of Data
Data as “Capta”
Data as a given
Transformational thinking
Code
Different languages and humanities computing
The case for R
The place of GUIs
Packages
Projects
Literate Programming
The Tidyverse
Installing from CRAN
Installing from github
The course package
Troubleshooting Guide
Exercises: creating a project
Data Types
Numbers
Types of Numbers
Textual data
Strings
Character encoding
Other data types, vectors, and dataframes.
Other primitive types
Combined types
Vectors
Dataframes (tibbles)
Formal languages
Arithmetic is the formal language of numbers
Ontologies are formal languages for specific areas
Regexes are the formal language of strings
Examples
Where to use regexes:
Basic search-replace operations
Custom operators
Basic Operators:
, and
OR:
The power of
Replacements
Escape characters.
Escaping special characters
Extraction
Other special characters
Data Types
Data formats
CSV/TSV files
Spreadsheets
JSON and XML
Apache Parquet and Apache Feather
Tables and Databases
The power of counting
The pipeline strategy for exploratory data analysis
Data chains
Real Data
Filtering
The comparison operators
Sorting
Summarizing
Summarization functions
Grouping and summarizing
Grouping
Working with Text
Free Exercise
Why Visualize?
The Grammar of Graphics in ggplot
Data, Geometry, and Aesthetics
Changing Scales
Conventions around X and Y axes.
Facets
Other settings
Coordinate systems
Labels
Themes
Exploring Data with ggplot
Histograms of quantitative variables
Barplots of categorical variables
Scatterplots
Density Plots
Text Scatterplots
Colors
RGB
Exercises: Visualizing Data
Free exercise
Overview
Common data cleaning issues.
NA values
Character encoding
Record formats
Inconsistent category labels
Date-time and other “type” formats
Historical dates
Data cleaning strategies.
Reading files.
Reading tables and constructive failure.
Other formats
Cleaning Data.
Using and to clean data.
Assignment in R.
Cleaning using other functions
Cleaning Code
Merging Data
Selects and filters
Combining data horizontally and vertically.
Renaming and joins
Left joins and anti-joins
Abstract data, SQL databases, and normalization.
Tidyverse to SQL equivalencies.
Self joins
The Pleiades dataset: Real-world modeling and SQL querying
Writing SQL queries.
Nests and deeply structured data
Where can you use list columns?
Working with database systems
Exercises: joins and nesting.
Loading texts into R
Tokenization
The choices of tokenization
Wordcounts
Word counts and Zipf’s law.
Concordances
Functions
Metadata Joins
Subword tokenization with sentencepiece
Train a model, then tokenize
Probabilities and Markov chains.
State of the Union
Joins
Free Exercise
The Variable-Document model
Groupings as documents
Chunking
Programming, summarizing, and ‘binding’
TEIdytext and the variable-document model on XML
hathidy, tidyDFR, and the variable-document model for wordcounts.
UNDER CONSTRUCTION
Bookworm databases
Highly Optional: Working with TEI.
Three metrics
TF-IDF: let’s build a search engine
Pointwise mutual information
food
Dunning Log-Likelihood
The term-document matrix
Principal Components and dimensionality reduction
What is a map as data?
Representing spatial data
Converting point data
Working with SF Objects.
Reading shapefiles.
Why coordinate systems have to be such a pain.
Choosing projections.
We need multiple ways to store information.
Loading data through R packages and choosing world-scale projections
Data manipulation
Cartography
Spatial joins and mapping counts and data
Spatial Summaries
Spatial joins
Using text analysis tools with maps
Grid plotting
Classification
Naive Bayes
Logistic Regression
Choosing the right test and train sets
K-means clustering
Hierarchical Clustering
Topic Modeling