```{R} library(tidyverse) library(HumanitiesDataAnalysis) ``` What’s the difference between using `%in%` and `str_detect` to filter down a dataset by a string? (Hint: try seeing how they both behave on some very *short* strings.) 1. Start by just editing some code. The code below finds the first date that appears in this collection. Edit it to find the minimum **age** in the set. ```{R} crews |> drop_na(date) |> summarize(min = min(date)) ``` 2. Use `filter` to determine: what is the name of that youngest person? When did he or she sail? ```{R} ``` 3. How many sailors left on ‘Barks’ between 1850 and 1880? Chain together `filter` and `summarize` with the special `n()` function. Note that this has a number of different conditions in the filter statement. You could build several filters in a row: but you can also include multiple filters by separating them with commas. For instance, `filter(school=="NYU",year==2020)` might be a valid filter on some dataset (though not this one.) To filter by date you may need to use a function like `as.Date` on your input. ```{R} ``` Question 3 told you how many sailors left on barks in those years. How many distinct voyages left? The variable `Voyage.number` identifies distinct voyages in this set. (This may require reading some documentation: reach out to me or a classmate if you can’t figure it out. There are at least two ways: one involves using the `dplyr` function `distinct` before summarizing, and the second involves using the functions `length` and `unique` in your call to `summarize`.) ```{R} ``` Change the code above to count the distinct “Residence” locations in the dataset. Then add two more pipes to the end to arrange by count. 1. Try to get a sense of what is the books set based on some keyword searches. Can you get a sense of what the biases of this subset of the catalog are? Here are a couple examples having to do with geographic terms in subjects; you’d probably do better to explore some other kind of resource.
```{R} books |> filter(subjects |> str_detect("France|French")) |> select(year, summary) |> sample_n(10) ``` ```{R} books |> filter(subjects |> str_detect("France|French")) |> select(year, summary) |> sample_n(10) ```
Consider counting some individual words, as well. Using ‘str_extract’, we can create a new column–‘word’–which is *only* the part of the subject that matches a search. Then we can count those individual terms. Again, look for something other than the three countries here. ```{R} books |> mutate(word = subjects |> str_extract("(France|Germany|England)")) |> drop_na(word) |> group_by(word) |> summarize(count = n()) ``` ### Free Exercise Try getting `read_csv` to work on your own csv or excel file that you explored the types in previously, or use arrow::read_parquet on some of the files at \[benschmidt.org/directories\]. and do three of the following: 1. Find an *outlier*; who is the oldest person? The youngest? 2. Count a categorical variable and arrange by decreasing count. What are the most common labels? 3. Use `str_sub` or `str_extract` to create a better categorical variable than the ones you have. 4. Count some *combination* of variables and see if you can identify things that tend to occur a lot together or apart. 5. Use `min`, `mean`, or `max` inside groups to see which groups have higher or lower values. 6. Describe the *extent* of your data.