Intermediate Data Storage Format
Questions:
- How do we want to store data in an intermediate format before moving it to the lm_dataformat that uses json lists?
- Do we even want an intermediate data format?
Let's use this issue to discuss this topic.
Resources:
- https://www.databricks.com/glossary/what-is-parquet#:~:text=What%20is%20Parquet%3F,handle%20complex%20data%20in%20bulk.
- https://arrow.apache.org/
For datasets that we need to process such as filtering, dedup, and near dedup before using lm_format. Using arrow or a similar intermediary format like parquet would allow the process to be faster and more efficient. Are there any datasets that do not need to be processed/filtered?
Ones that are probably already included from a paper or on huggingface we wouldn't need to process
Parquet through pyarrow would be quite good. I want to use it between processing steps also anway.
Converting the xml and storing them in parquet would be pretty useful and efficient. parquet offers much better data compression and stores data in column format, so its faster to load as pandas objects and do other computations, joins etc on top of it.
As we iterate multiple times on different possibilities of post processing, being able to load fast would be useful.
Example difference in sizes for a site

JSON format, other than being readable if one wants to inspect files manually offers no advantage.