sqlite-xsv icon indicating copy to clipboard operation
sqlite-xsv copied to clipboard

zip support?

Open nfultz opened this issue 2 years ago • 2 comments

I was wondering if zip support could be implemented.

I am working with Reddit Data Export, which arrives as a zip file with a couple different CSVs, one zip file per user - some of them can be quite large depending on user activity, so would be nice to use the virtual table facilities w/o having to extract the csvs.

Or if you have other ideas, I'd appreciate it. Thanks much.

nfultz avatar Jul 05 '23 20:07 nfultz

So you have one zip file with multiple CSVs file inside, and you want to query individual CSV files inside that? Like you have a bundle.zip which contains ./a.csv, ./b.csv, ./c.csv etc, and you want to be able to do something like "select all rows from the a.csv file inside bundle.zip"?

If so, you currently can't do that easily with sqlite-xsv alone, although you could potentially combine the SQLite zipfile extension with csv_reader to do what you'd like. The downside is that this will read the entire CSV file into memory, and won't work for files larger than 1GB. Here's what that might look like:

create virtual table temp.students_reader using csv_reader(
  id int,
  name text,
  birthdate text
);

select *
from temp.students_reader(
  (select data from zipfile("bundle.zip") where name = "students.csv")
);

If your zipfile had a single CSV file inside it (ex a a.zip that only contained a sole a.csv file), then sqlite-xsv could support that with relative ease. But since you have multiple CSV files in the zip file, then it'll be a bit awkward to specify which file entry inside the zipfile should be used.

I have a few features planned for the near-future that'll offer a really flexible/extensible way read CSVs from any source (filesystem/HTTP/S3/compression archives etc). Once that's complete, I could see something like this:

create virtual table temp.students_reader using csv_reader(
  id int,
  name text,
  birthdate text
);

select * 
from temp.student_reader(
  zipfile_reader('bundle.zip', 'students.csv')
);

Where zipfile_reader would be a function from another future SQLite extension sqlite-zip, which would efficiently read a specific entry in a zipfile and pass it along to sqlite-xsv for CSV processing.

asg017 avatar Jul 05 '23 21:07 asg017

Thank you so much for your detailed explanation.

I need to think about whether loading it fully into memory is a good idea. My original idea was to mount each user extract as virtual tables, union-all the users together in another view, and then hope predicate pushdown could buy back some performance. It sounds like I'll need to experiment a bit, I could imagine that being pathologically bad for joining comments from one user, to posts from another.

But I also don't want to re-implement reddit, I just want some queries for data cleaning and analysis.

Just for future readers, here is what the data looks like for each user. There will be many of these zip files:

downloads$unzip -l export_nfultz_20230612.zip 
Archive:  export_nfultz_20230612.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       31  2023-07-03 23:10   account_gender.csv
       46  2023-07-03 23:10   approved_submitter_subreddits.csv
   328813  2023-07-03 23:10   chat_history.csv
    76706  2023-07-03 23:10   comment_headers.csv
    94506  2023-07-03 23:10   comment_votes.csv
   133707  2023-07-03 23:10   comments.csv
      142  2023-07-03 23:10   drafts.csv
     1810  2023-07-03 23:10   friends.csv
       32  2023-07-03 23:10   gilded_comments.csv
       32  2023-07-03 23:10   gilded_posts.csv
      671  2023-07-03 23:10   hidden_posts.csv
    12893  2023-07-03 23:10   ip_logs.csv
       22  2023-07-03 23:10   linked_identities.csv
       14  2023-07-03 23:10   linked_phone_number.csv
    14559  2023-07-03 23:10   message_headers.csv
   316397  2023-07-03 23:10   messages.csv
       36  2023-07-03 23:10   moderated_subreddits.csv
     1374  2023-07-03 23:10   multireddits.csv
      440  2023-07-03 23:10   poll_votes.csv
  1315419  2023-07-03 23:10   post_headers.csv
   636504  2023-07-03 23:10   post_votes.csv
  1744510  2023-07-03 23:10   posts.csv
       48  2023-07-03 23:10   reddit_gold_information.csv
     1371  2023-07-03 23:10   saved_comments.csv
     7840  2023-07-03 23:10   saved_posts.csv
       71  2023-07-03 23:10   scheduled_posts.csv
      189  2023-07-03 23:10   statistics.csv
      868  2023-07-03 23:10   subscribed_subreddits.csv
       10  2023-07-03 23:10   twitter.csv
     2106  2023-07-03 23:10   user_preferences.csv
     2542  2023-07-03 23:10   checkfile.csv
---------                     -------
  4693709                     31 files

One other gotcha for this particular data set is that you have to add the username column back to each table, it's not included in a single user extract.

nfultz avatar Jul 06 '23 17:07 nfultz