```{R}
library(tidyverse)
library(HumanitiesDataAnalysis)
```

What’s the difference between using `%in%` and `str_detect` to filter
down a dataset by a string? (Hint: try seeing how they both behave on
some very *short* strings.)

1.  Start by just editing some code. The code below finds the first date
    that appears in this collection. Edit it to find the minimum **age**
    in the set.

```{R}
crews |>
  drop_na(date) |>
  summarize(min = min(date))
```

2.  Use `filter` to determine: what is the name of that youngest person?
    When did he or she sail?

```{R}

```

3.  How many sailors left on ‘Barks’ between 1850 and 1880? Chain
    together `filter` and `summarize` with the special `n()` function.
    Note that this has a number of different conditions in the filter
    statement. You could build several filters in a row: but you can
    also include multiple filters by separating them with commas. For
    instance, `filter(school=="NYU",year==2020)` might be a valid filter
    on some dataset (though not this one.)

<span class="hint">To filter by date you may need to use a function like
`as.Date` on your input.</span>

```{R}

```

Question 3 told you how many sailors left on barks in those years. How
many distinct voyages left? The variable `Voyage.number` identifies
distinct voyages in this set. (This may require reading some
documentation: reach out to me or a classmate if you can’t figure it
out. There are at least two ways: one involves using the `dplyr`
function `distinct` before summarizing, and the second involves using
the functions `length` and `unique` in your call to `summarize`.)

```{R}

```

Change the code above to count the distinct “Residence” locations in the
dataset. Then add two more pipes to the end to arrange by count.

1.  Try to get a sense of what is the books set based on some keyword
    searches. Can you get a sense of what the biases of this subset of
    the catalog are?

Here are a couple examples having to do with geographic terms in
subjects; you’d probably do better to explore some other kind of
resource.

<div class="cell"
hash="counting-things_cache/json/unnamed-chunk-28_4c09b9f7186d577bf125b698b8b5467d">

```{R}
books |>
  filter(subjects |> str_detect("France|French")) |>
  select(year, summary) |>
  sample_n(10)
```


```{R}
books |>
  filter(subjects |> str_detect("France|French")) |>
  select(year, summary) |>
  sample_n(10)
```

<div class="cell-output-stdout">

</div>

</div>

Consider counting some individual words, as well. Using ‘str_extract’,
we can create a new column–‘word’–which is *only* the part of the
subject that matches a search. Then we can count those individual terms.

Again, look for something other than the three countries here.

```{R}
books |>
  mutate(word = subjects |> str_extract("(France|Germany|England)")) |>
  drop_na(word) |>
  group_by(word) |>
  summarize(count = n())
```

### Free Exercise

Try getting `read_csv` to work on your own csv or excel file that you
explored the types in previously, or use arrow::read_parquet on some of
the files at \[benschmidt.org/directories\]. and do three of the
following:

1.  Find an *outlier*; who is the oldest person? The youngest?
2.  Count a categorical variable and arrange by decreasing count. What
    are the most common labels?
3.  Use `str_sub` or `str_extract` to create a better categorical
    variable than the ones you have.
4.  Count some *combination* of variables and see if you can identify
    things that tend to occur a lot together or apart.
5.  Use `min`, `mean`, or `max` inside groups to see which groups have
    higher or lower values.
6.  Describe the *extent* of your data.

<div class="cell"
hash="counting-things_cache/json/unnamed-chunk-30_5c53e69c7ea50ae99089e1ccae2ce0c3">

</div>