zip support?
I was wondering if zip support could be implemented.
I am working with Reddit Data Export, which arrives as a zip file with a couple different CSVs, one zip file per user - some of them can be quite large depending on user activity, so would be nice to use the virtual table facilities w/o having to extract the csvs.
Or if you have other ideas, I'd appreciate it. Thanks much.
So you have one zip file with multiple CSVs file inside, and you want to query individual CSV files inside that? Like you have a bundle.zip which contains ./a.csv, ./b.csv, ./c.csv etc, and you want to be able to do something like "select all rows from the a.csv file inside bundle.zip"?
If so, you currently can't do that easily with sqlite-xsv alone, although you could potentially combine the SQLite zipfile extension with csv_reader to do what you'd like. The downside is that this will read the entire CSV file into memory, and won't work for files larger than 1GB. Here's what that might look like:
create virtual table temp.students_reader using csv_reader(
id int,
name text,
birthdate text
);
select *
from temp.students_reader(
(select data from zipfile("bundle.zip") where name = "students.csv")
);
If your zipfile had a single CSV file inside it (ex a a.zip that only contained a sole a.csv file), then sqlite-xsv could support that with relative ease. But since you have multiple CSV files in the zip file, then it'll be a bit awkward to specify which file entry inside the zipfile should be used.
I have a few features planned for the near-future that'll offer a really flexible/extensible way read CSVs from any source (filesystem/HTTP/S3/compression archives etc). Once that's complete, I could see something like this:
create virtual table temp.students_reader using csv_reader(
id int,
name text,
birthdate text
);
select *
from temp.student_reader(
zipfile_reader('bundle.zip', 'students.csv')
);
Where zipfile_reader would be a function from another future SQLite extension sqlite-zip, which would efficiently read a specific entry in a zipfile and pass it along to sqlite-xsv for CSV processing.
Thank you so much for your detailed explanation.
I need to think about whether loading it fully into memory is a good idea. My original idea was to mount each user extract as virtual tables, union-all the users together in another view, and then hope predicate pushdown could buy back some performance. It sounds like I'll need to experiment a bit, I could imagine that being pathologically bad for joining comments from one user, to posts from another.
But I also don't want to re-implement reddit, I just want some queries for data cleaning and analysis.
Just for future readers, here is what the data looks like for each user. There will be many of these zip files:
downloads$unzip -l export_nfultz_20230612.zip
Archive: export_nfultz_20230612.zip
Length Date Time Name
--------- ---------- ----- ----
31 2023-07-03 23:10 account_gender.csv
46 2023-07-03 23:10 approved_submitter_subreddits.csv
328813 2023-07-03 23:10 chat_history.csv
76706 2023-07-03 23:10 comment_headers.csv
94506 2023-07-03 23:10 comment_votes.csv
133707 2023-07-03 23:10 comments.csv
142 2023-07-03 23:10 drafts.csv
1810 2023-07-03 23:10 friends.csv
32 2023-07-03 23:10 gilded_comments.csv
32 2023-07-03 23:10 gilded_posts.csv
671 2023-07-03 23:10 hidden_posts.csv
12893 2023-07-03 23:10 ip_logs.csv
22 2023-07-03 23:10 linked_identities.csv
14 2023-07-03 23:10 linked_phone_number.csv
14559 2023-07-03 23:10 message_headers.csv
316397 2023-07-03 23:10 messages.csv
36 2023-07-03 23:10 moderated_subreddits.csv
1374 2023-07-03 23:10 multireddits.csv
440 2023-07-03 23:10 poll_votes.csv
1315419 2023-07-03 23:10 post_headers.csv
636504 2023-07-03 23:10 post_votes.csv
1744510 2023-07-03 23:10 posts.csv
48 2023-07-03 23:10 reddit_gold_information.csv
1371 2023-07-03 23:10 saved_comments.csv
7840 2023-07-03 23:10 saved_posts.csv
71 2023-07-03 23:10 scheduled_posts.csv
189 2023-07-03 23:10 statistics.csv
868 2023-07-03 23:10 subscribed_subreddits.csv
10 2023-07-03 23:10 twitter.csv
2106 2023-07-03 23:10 user_preferences.csv
2542 2023-07-03 23:10 checkfile.csv
--------- -------
4693709 31 files
One other gotcha for this particular data set is that you have to add the username column back to each table, it's not included in a single user extract.