Working in a Programming Language

Working in a Programming Language

Different languages and humanities computing

Different computer languages serve different purposes. If you have taken ever taken an introductory computer science course, you might have learned a different language, like Python, Java, C, or Lisp.

Although computing languages are equivalent in a certain, abstract sense, they each channel you towards thinking in particular ways. As I say in chapter 2, computers offer a variety of formal languages for describing things; each of these languages emphasizes a different thing.

Which of these languages is best? It depends on what you want to do. For creating rich, user-oriented experiences, javascript and the open web is best.

What R–especially tidyverse R–does well is encourage you to move from thinking about programming to thinking about data. Exploratory data analysis which operates on a particular base class, the ‘dataframe’ or (for short) ‘tibble.’ We’ll talk about this more in Chapter 3; but a dataframe represents a structured collection of data much like an Excel spreadsheet or database table. This gives a coherent, basic framework for describing any data set. The things that you can do with a dataframe

If you only learn a single language, there’s a strong argument that it should Python, which is a widespread language that can do anything and frequently run quite quickly. If you want to learn to create code, Python is a better language.

But python generally promotes a specific kind of thinking about how you can get a problem done that revolves around thinking like a computer.

The closest analogues to these in other languages are less elegant and less well thought out. Python has widely used tool called pandas for analyzing data that is fast, powerful, and effective. But it is also more challenging for beginners than it need be. If you Google problems you’ll be confronted with a variety of different ways to solve a problem. Ten years ago, one big advantage of python over R was that it had a small standard library, cleaner syntax, and promoted a single way to do things effectively. One of the great ironies of modern data science is that, for programming with data, the situation has almost completely reversed; pandas give you a bewildering number of different ways to join data frames, to access their rows or columns, or to walk through the rows. The tidyverse does a better job enforcing a particular approach.

If you want to learn programming, there’s a good argument for learning python. Although if you just want to get things done, there’s an equally strong case for Javascript: and if you really want to understand computers, you should take a month learning to write in Haskell, or Lisp, or C.

The place of GUIs

One thing you can’t do in this course, though, is rely on the out-the-box where one tool fits every problem. ArcGIS or QGIS may be the best way to make maps, and Gephi the best way to do network analysis. But as this is a course in data analysis, you should think about the fundamental operations of cartography and network analysis as simply subsets of a broader field, which is hard to see from the confines. All of these things are possible in R, and by seeing them as facets of a broader activity, you’ll develop transferrable skills and insights.

Also unlike graphical tools, working in a language saves your workflow. If you make a map with laboriously poisitioned points in ArcGIS, you may have a beautiful final project, but you can’t reproduce exactly how it happened. In R, though, every step you take and every move you make can be preserved. This is called reproducible research, and it is among the most important contributions you can make when working collaboratively.

The course package

This course itself uses an R package to manage information. You can install it using the following lines of code.

The second line will also reinstall the package, which we’ll probably do periodically in the semester.

Once installed, you can also update by typing update_HDA() at the R prompt.

The course package contains four things:

  1. Sample data sets we’ll be working with
  2. Code to make it easier to work with the class by, for example, downloading problems sets to your computer.
  3. Code the streamline approaches that we’ve already learned that aren’t easily expressed in another packages.
  4. A list of ‘dependencies’ that will automatically install other packages you need.

Troubleshooting Guide

When you have trouble running code, there are a few questions you should ask first.

  1. Does RStudio know you’re writing code?
    • Do you have a code block that’s grey with white text on either side?
    • Is there a ‘run’ button on your cell?
    • If not, you probably have a formatting problem. Make sure the chunk has three ticks at the top and bottom; consider going up to the insert button and running a chunk.
  2. Read the error messages.
    • Is there something missing? For instance, have you run library(tidyverse) at the top of your code?
  3. Are you spelling everything right and closing all your punctuation?
    • It’s easy to lose track of how many open parentheses you have.
  4. Can you restart and start over from the beginning? Sometimes you’ll be relying on some changed piece of code you’ve forgotten about.


Exercises: creating a project

Getting started is the hardest thing, because it requires understanding–to some degree–this entire software ‘stack.’ Here’s what you should do once RStudio is running.

  1. Type the following into the prompt to install the latest version of this package.

These updates are important.

  1. Type library(HumanitiesDataAnalysis) to actually load the package.

  2. Create a new project for problem sets in a folder on your computer.

  3. Type download_problem_sets() into the console prompt to download the sets.

  4. Start editing the code in the first problem set and run it using the green arrow buttons.

Chapter actions