The data table

The data table

Although data can take many shapes, one is especially important: the data table.

A table is a structure for describing a set of data. The simplest mental model for most people is something like a Microsoft Excel spreadsheet with rows and columns. With tabular data, each row represents a record and each column represents a field for that record.

Although this is the this dominant model of data we’ll work with, it’s worth noting that almost no information fits naturally into a tabular form. Even in Microsoft Excel, there are a variety

Is a data table exactly the same as a spreadsheet? No. The key difference is that in a table, each “field”–that is, the columns in a spreadsheet–must be of the same type. Every column in a data table must have the same type: you can’t mix and match numbers, dates, strings.

This means

TitleNetworkPremiered
charactercategoricaldate
Halt and Catch FireAMC2014-06-01
The WireHBO2002-06-02

This means that a data table is something you can read in two directions: by rows and by columns.

Data formats

To work with data, you have to get data. But the material format of data is complicated. The types described above can vary slightly in implementation from programming language to programming language, and from computer platform to computer platform. (One change that happens at the lowest possible level of representation, which you are unlikely to ever encounter, has to do with the order of individual bits in a file; in a reference to how pointless these splits can be, computer scientists borrowed the names [“big-endian” and “little-endian”] from Gulliver’s Travels. So there need to be files that are independent of the way data is built in a computer.

The format that you receive data in will vary on the data. Text files, for example, are almost always distributed nowadays as Unicode data; books as PDFs; and images as JPGs (for small size) or TIFFs (for highest quality.)

The tabular data we’re looking at needs an on-disk format. There are few you may encounter: each has different advantages, because each serves different purposes. Economists talk about the different roles of money; it is a unit of account, a means of exchange, a store of value, etc. Data formats serve several different roles simultaneous. They can be

  1. A permanent store for information that is likely to persist for a long time.
  2. An interchange format that brings data between computer programs.
  3. A means of compression. You might prefer to get images of handwritten text through your e-mail, but turning it into . In some formats, this doesn’t matter at all. Think of the way that people post photographs of books into their social media; the cost of the photograph is minimal.
  4. An editable record that one or many people may need to collaborate on.

Different formats have different advantages. Even in the world of tabular data, there are at least four formats you are likely to encounter.

CSV/TSV files

The CSV format is the most basic way to share tabular data. It uses plain text to store information, so you can open it up in almost any program, but

Basically, a csv is a file in which rows are separated by return characters, and columns are separated by commas. (The TSV is an extremely similar format in which columns are instead separated by tabs; for most purposes, the two formats are interchangeable but csv sees more widespread support.)

That simple definition quickly runs into issues. What happens if a name has a comma inside of it? You can use quotes–but then what if a field has quotes inside of it? There is no single definition of a CSV. The closest is RFC 4180, which most programming languages will support; but in

CSVs can’t represent data types like null values,

Spreadsheets

There are two different spreadsheet programs in wide use. Microsoft Excel is a desktop program that goes back to the 1980s; Google Sheets is a web app from the 2000s. Both allow you to do a great deal of computation in them, which can mean that data gets locked into the format. They also allow a great deal of approaches that are elegant for exploring data (colored columns, merged cells) but that can make it difficult to read data into a standard form. For data entry, spreadsheets are an irreplaceable tool; if a group of people are transcribing sources together, Google Sheets can be immensely powerful. But both Excel and Google have demerits as a unit of interchange. It is possible to read directly from Google Sheets or Excel into R or Python, but often it takes some special wrangling. Their corporate ties also make them subject to decay and unsuitable as a long term archive format.

JSON and XML

JSON and XML are two extremely flexible text-based formats for storing data. Both use

JSON is poorly suited for tabular data, but often used anyways. Like CSVs, it is human readable, universally interpretable, and plain-text based. Unlike them, it allows for some preservation of datatypes (it can distinguish strings from integers, but not dates). And, in something of a mixed blessing, it allows for multiple records in a single entry.

Reading data into R is done using a number of functions from the readr package. In general, you should get in the habit of typing a line of code to load data and assign it to a variable. If you store your data in the same folder as your code, reading a file called my_data.csv is usually as simple typing the following:

      dataset = read_csv("my_data.csv")
    
```{R}

Once you do this, you will have a variable called dataset storing all the columns in the file my_data.csv.

To get a more detailed walkthrough, RStudio can help you load a dataset by going to File > Import Dataset > From Text (readr). If you are doing something rather complicated like loading an Excel file, this may be useful. But for the reasons described earlier, make sure that the file is still located inside the same directory (folder) as your .Rmd file or in a subfolder.

Apache Parquet and Apache Feather

If you want to persist data quickly, efficiently, and in a form closer to what every modern data analysis platform works with, things have gotten dramatically better since about 2017. It used to be that long term storage had to be in a CSV, JSON, or XML file; you will still encounter people who think that these are the only safe ways to share data. I am of the firm opinion that those people have not yet realized just how powerful the latest generation of alternatives are.

These new files binary columnar serialization formats have a number of advantages. The most widely used is Apache Parquet; another similar option is the Apache Feather format. These are formats that store columns of data contiguously as binary types, preserve datatypes rigorously, and can support arbitrary metadata so that it’s possible to describe a documnet internally. Compared to CSV, they will take up substantially less disk space because of internal compression, and be much faster to load. If you find that you have especially large datasets, they are worth exploring; there are almost no cases.

They are not bound to any particular language. As open formats, they do not require any commercial software. Their relative youth means that it may be wise in a few cases to store backup copies in CSV or JSON; but for writing saving your own workflows, you will be well served by using them.

For both, you will need to install the Apache Arrow package .

If you wish to save intermediate work results in {{language}} nowadays, I highly recommend trying one of these instead of using CSVs or native data formats. Because they are harder to read and not so widely known, they are likely to be less suitable to sending to someone else.

      arrow::read_parquet()
    
```{R}

Tables and Databases

We’ve now worked our way up through a hierarchy from a single data point.

  • A vector is many datapoints of the same type stacked together.
  • A table is a set of vectors bound together as a single entity, where each row describes many things of the same type.

There is at least one further step we can take, which is to imagine something composed of many different tables that describe different types of things. This is one definition of a database: several different tables about the same general area. We’ll talk more about databases later.

Chapter actions