Code-Pile icon indicating copy to clipboard operation
Code-Pile copied to clipboard

Intermediate Data Storage Format

Open ncoop57 opened this issue 3 years ago • 4 comments

Questions:

  • How do we want to store data in an intermediate format before moving it to the lm_dataformat that uses json lists?
  • Do we even want an intermediate data format?

Let's use this issue to discuss this topic.

Resources:

  • https://www.databricks.com/glossary/what-is-parquet#:~:text=What%20is%20Parquet%3F,handle%20complex%20data%20in%20bulk.
  • https://arrow.apache.org/

ncoop57 avatar Sep 16 '22 20:09 ncoop57

For datasets that we need to process such as filtering, dedup, and near dedup before using lm_format. Using arrow or a similar intermediary format like parquet would allow the process to be faster and more efficient. Are there any datasets that do not need to be processed/filtered?

taisazero avatar Sep 17 '22 03:09 taisazero

Ones that are probably already included from a paper or on huggingface we wouldn't need to process

ncoop57 avatar Sep 17 '22 21:09 ncoop57

Parquet through pyarrow would be quite good. I want to use it between processing steps also anway.

flowpoint avatar Sep 20 '22 20:09 flowpoint

Converting the xml and storing them in parquet would be pretty useful and efficient. parquet offers much better data compression and stores data in column format, so its faster to load as pandas objects and do other computations, joins etc on top of it.

As we iterate multiple times on different possibilities of post processing, being able to load fast would be useful. Example difference in sizes for a site image

JSON format, other than being readable if one wants to inspect files manually offers no advantage.

vanga avatar Sep 23 '22 08:09 vanga