```{R} library(tidyverse) library(HumanitiesDataAnalysis) ``` Create the word statistics field as above, and choose a number between 100 and 200, and look at the words that appear that many times. (Use `filter(n == 100)` to get the top 100 words.) Look at which words have the highest IDF and the lowest IDF scores, looking also at the document counts for them. Does how clustered (i.e., how high the IDF score is) seem to indicate the specificity of the word? ::: {.cell hash=‘Bag-of-words_cache/json/unnamed-chunk-11_6e1edf69afe2741f31b6c52c2a972a45’} ```{R} word_statistics |> mutate(IDF = log(total_documents / documents)) ``` ```{R} word_statistics |> mutate(IDF = log(total_documents / documents)) ```