Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

About building *.bin and *.idx

Open Yijia-Xiao opened this issue 4 years ago • 5 comments

Hi, thank you for your great work! I've been using Megatron-LM for some time, and I've encountered some problems in building a large dataset. I used preprocess_data.py to build a jsonl (about 1TB) to *.bin and *.idx file; the server comes with 504GB memory. But unfortunately, when the *.bin grows to about 600GB, the process seems to be dead. I wonder if there are some solution for big corpus, or will the lazy loader works?

Thank you:)

Yijia-Xiao avatar Oct 29 '21 02:10 Yijia-Xiao

Hello, has this issue been resolved?

Ant0082 avatar Mar 07 '23 03:03 Ant0082

This issue will be addressed in the next few days by an update to preprocess_data.py that allows processing a large dataset in multiple partitions and thereby avoiding OOM errors. I'll update this issue when the update hits.

jon-barker avatar Jun 29 '23 22:06 jon-barker

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Aug 29 '23 18:08 github-actions[bot]

Qual porcentagens já está para finalizar o projeto?

felipeliliti avatar May 10 '24 22:05 felipeliliti

Pode concluir tá autorizado

felipeliliti avatar May 11 '24 00:05 felipeliliti

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jul 10 '24 18:07 github-actions[bot]