Rewrite for Library Checkouts Data
The library checkouts data that Nick formatted is available here. We need to:
- [x] Assess the new dataset for ease of learning (105 MB is pretty big for learners to work with) - does it need to be smaller?
- [ ] transition the current text to work with the new dataset
- [ ] Add context to the current text to help learners know what they are doing and why
For adding the context back in, here's the link to my original reader for this workshop.
I did some exploration to try to figure out the best way to subset the data set:
- 157,964 items in
items(103.4 MB)- 4,123 of these have multiple entries
- 1,539 of these are checked out
- 152,714 unique items
- Size of
itemsdata frame by checkout year (possibly with duplicates):2019 2020 2021 2022 2023 "43.9 Mb" "16.8 Mb" "16.5 Mb" "27.8 Mb" "30 Mb" - Sizes of
itemsdata frame columns (above 2 MB):title author subjects "17.5 Mb" "8.5 Mb" "54.4 Mb" publisher publication_place "4 Mb" "2.4 Mb" - Active items alone require 22 MB (or 10.5 MB without subjects).
- 4,123 of these have multiple entries
- 2,595 unique users in
users(0.1 MB), all with checkouts - 240,424 checkouts in
checkouts(13.1 MB) spanning 2019-01-02 to 2023-12-28- 152,714 unique items
- 31,464 active checkouts (i.e., associated with a user)
- 29,283 unique items
This makes me think we should focus on subsetting the items table. The subjects column in the items table is almost 50% of the size of the data set. The value is generally a long list of subject categories for the item, which might be interesting for demonstrating text search with LIKE, but probably not relevant for other commands. The title and author columns also have searchable text, so maybe we should drop subjects.
If we drop subjects as well as all inactive items, we can get down to ~25 MB. I think keeping the checkouts table intact is worth it because it covers Jan 2019 - Dec 2023 and you can definitely see the effect of the pandemic.
If the data set can be a little bit larger, say ~35 MB, we could:
- Do the 25 MB plan but also keep ~15,000 items that aren't checked out (for a total of ~45,000 items)
- Do the 25 MB plan but keep the
subjectscolumn
@MicheleTobias let me know what you think would work best, whether one of these ideas or something else, and I'll update the R script to do it.
@nick-ulle Thanks for layout out some good options! I think ~35MB is really reasonable. The subjects column is probably the least useful since it's a list. The author column can provide similar learning opportunities, so subjects isn't really needed. So that's my vote.
I dropped subjects and kept a sample of 15,000 inactive items and it came out to 25 MB. The code is in R/ in this repo, and the file is on Google Drive. Let me know if there's anything else I can help with!
Thanks, @nick-ulle ! This is great! I don't think there will be any more tasks, but I'll let you know if that changes. The only thing left is for me to finish going through the hands-on section to work with the new data and add some explanations.