RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

Inconsistent IDs lead to distributed computing woes.

Open axelmagn opened this issue 1 year ago • 1 comments

When trying to work with these data via Dataflow, I noticed a few things:

  • the ID field key is inconsistent between files. it is id in minhash and signals, doc_id in duplicates.
  • IDs are not present as an explicit field in documents. They must be reconstructed from the file path and line number.

This creates a lot of unnecessary friction when working with big data pipelines, since line number is not usually available. I'm finding myself writing a custom reader (sort of a bummer if you've ever had to do it).

For future data releases, please consider embedding a consistent key between all file groups for easier joining at scale. Just a UUID would be fine.

axelmagn avatar Mar 20 '24 19:03 axelmagn

Hi @axelmagn thanks for your feedback, these are very good points and is something we will definitely do in future releases.

mauriceweber avatar Mar 29 '24 13:03 mauriceweber