workshop_intro_to_sql icon indicating copy to clipboard operation
workshop_intro_to_sql copied to clipboard

Rewrite for Library Checkouts Data

Open MicheleTobias opened this issue 2 years ago • 5 comments

The library checkouts data that Nick formatted is available here. We need to:

  • [x] Assess the new dataset for ease of learning (105 MB is pretty big for learners to work with) - does it need to be smaller?
  • [ ] transition the current text to work with the new dataset
  • [ ] Add context to the current text to help learners know what they are doing and why

MicheleTobias avatar Apr 01 '24 21:04 MicheleTobias

For adding the context back in, here's the link to my original reader for this workshop.

MicheleTobias avatar Apr 01 '24 22:04 MicheleTobias

I did some exploration to try to figure out the best way to subset the data set:

  • 157,964 items in items (103.4 MB)
    • 4,123 of these have multiple entries
      • 1,539 of these are checked out
    • 152,714 unique items
    • Size of items data frame by checkout year (possibly with duplicates):
            2019      2020      2021      2022      2023
       "43.9 Mb" "16.8 Mb" "16.5 Mb" "27.8 Mb"   "30 Mb"
      
    • Sizes of items data frame columns (above 2 MB):
           title                 author           subjects
       "17.5 Mb"               "8.5 Mb"          "54.4 Mb"
       publisher      publication_place
          "4 Mb"               "2.4 Mb"
      
    • Active items alone require 22 MB (or 10.5 MB without subjects).
  • 2,595 unique users in users (0.1 MB), all with checkouts
  • 240,424 checkouts in checkouts (13.1 MB) spanning 2019-01-02 to 2023-12-28
    • 152,714 unique items
    • 31,464 active checkouts (i.e., associated with a user)
      • 29,283 unique items

This makes me think we should focus on subsetting the items table. The subjects column in the items table is almost 50% of the size of the data set. The value is generally a long list of subject categories for the item, which might be interesting for demonstrating text search with LIKE, but probably not relevant for other commands. The title and author columns also have searchable text, so maybe we should drop subjects.

If we drop subjects as well as all inactive items, we can get down to ~25 MB. I think keeping the checkouts table intact is worth it because it covers Jan 2019 - Dec 2023 and you can definitely see the effect of the pandemic.

If the data set can be a little bit larger, say ~35 MB, we could:

  • Do the 25 MB plan but also keep ~15,000 items that aren't checked out (for a total of ~45,000 items)
  • Do the 25 MB plan but keep the subjects column

@MicheleTobias let me know what you think would work best, whether one of these ideas or something else, and I'll update the R script to do it.

nick-ulle avatar Apr 09 '24 03:04 nick-ulle

@nick-ulle Thanks for layout out some good options! I think ~35MB is really reasonable. The subjects column is probably the least useful since it's a list. The author column can provide similar learning opportunities, so subjects isn't really needed. So that's my vote.

MicheleTobias avatar Apr 09 '24 15:04 MicheleTobias

I dropped subjects and kept a sample of 15,000 inactive items and it came out to 25 MB. The code is in R/ in this repo, and the file is on Google Drive. Let me know if there's anything else I can help with!

nick-ulle avatar Apr 10 '24 01:04 nick-ulle

Thanks, @nick-ulle ! This is great! I don't think there will be any more tasks, but I'll let you know if that changes. The only thing left is for me to finish going through the hands-on section to work with the new data and add some explanations.

MicheleTobias avatar Apr 10 '24 15:04 MicheleTobias