RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

how much disk memory will be used?

Open newbietuan opened this issue 2 years ago • 3 comments

hello, there. i want to get the zh data of one dump. How much disk space will be occupied during data download and processing, and the final data size

newbietuan avatar May 19 '23 03:05 newbietuan

Hi @newbietuan -- the ccnet pipeline processes the warc files on the fly, so you won't need to store an entire cc dump on disk. I cannot say how much space the minified zh output will be, but as a guideline: for en, the output of the mined 2023-06 cc dump is around 800G.

I hope this helps!

mauriceweber avatar May 22 '23 13:05 mauriceweber

Thank you very much. @mauriceweber

Sorry for replying late.

during the pipeline process, the wet_cache will be deleted automatically? when i run for test, it seems not deleted. so 800G is the final output, how much memory need for the whole memory? depends on the snapshot? around 60-100T? may i know your config.json and the configuration of the machine, memory, cpu, disk, time et.al. i have no idea about what configuration should I plan to get the data

newbietuan avatar Jun 08 '23 02:06 newbietuan

Hi @newbietuan -- the ccnet pipeline processes the warc files on the fly, so you won't need to store an entire cc dump on disk. I cannot say how much space the minified zh output will be, but as a guideline: for en, the output of the mined 2023-06 cc dump is around 800G.

I hope this helps!

hi, @mauriceweber when i run python -m cc_net --config config/my_config.json using the { "hash_in_mem": 50, "dump": "2023-06", "num_shards": 8, "task_parallelism": 48, "num_segments_per_shard": -1, "mine_num_processes": 48, "cleanup_after_regroup": "True", "lang_whitelist": ["zh"], "pipeline": [ "dedup", "lid", "keep_lang", "sp", "lm", "pp_bucket", "minify", "split_by_segment" ], "execution": "debug", "output_dir": "zh_data", "mined_dir": "zh_mined_by_segment", "target_size": "1GB", "cache_dir": "zh_data/wet_cache" } the first shard contain 11000 files of .warc.wet.gz. the download speed is about 12M/s, so it seems download all the 8 shards will take about 800 hours~~ i noticed the parameter cleanup_after_regroup, during test, after copy output files form zh_mined_by_segment_split to zh_mined_by_segment, nothing is deleted. Does this have something to do with the size setting of the parameter target_size? Does the size of target_size refer to the size of the final json file?

i also have some confusions about the parameters of task_parallelism and mine_num_processes, After the first shard is downloaded, whether subsequent downloads and processing can be executed in parallel?

now i have a machine 64cpus 512GRAM 1T Memory ,download speed is about 12M/s, Whether the configuration can complete the processing of a snapshot

newbietuan avatar Jun 09 '23 02:06 newbietuan