about download a small portion of cc
hello there, thank for your good work. i want to download a small portion of cc(to run through the whole process firstly) when i run the code 'python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1' if i just need to add the argument --num_segment_per_shard 2 and change some numbers like 'python -m cc_net --dump 2023-06 --task_parallelism 10 --num_shards 10 -l en --mine_num_processes 10 --hash_in_mem 1 --num_segments_per_shard 2'. then, some other arguments like target_size='4G' is have any inffluence? Or how should i set these arguments? like which arguments i should modify or what value is appropriate. thanks a lot!
You can refer to python -m cc_net --help and RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py for all arguments.
You can run python -m cc_net --config config/test_segment.json to take arguments from a json file.
An example config for testing the whole process:
{
"dump": "2023-06",
"output_dir": "test_data",
"num_shards": 5,
"num_segments_per_shard": 3,
"hash_in_mem": 1,
"mine_num_processes": 5,
"task_parallelism": 5
}
thank you very much. i will try it.
when i run the code (demo) mayutuan@mayutuans-MacBook-Pro cc_net % python -m cc_net --dump 2023-06 --task_parallelism 6 --num_shards 6 -l en --mine_num_processes 6 --hash_in_mem 1 --num_segments_per_shard 2
and finished download the 6 file of xxxx.bin . i get the message:
Traceback (most recent call last):
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/Users/mayutuan/Downloads/projects/RedPajama-Data-main/data_prep/cc/cc_net/cc_net/mine.py", line 439, in _mine_shard
jsonql.run_pipes(
File "/Users/mayutuan/Downloads/projects/RedPajama-Data-main/data_prep/cc/cc_net/cc_net/jsonql.py", line 439, in run_pipes
multiprocessing.Pool(
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_mine_shard.