RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

about download a small portion of cc

Open newbietuan opened this issue 2 years ago • 2 comments

hello there, thank for your good work. i want to download a small portion of cc(to run through the whole process firstly) when i run the code 'python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1' if i just need to add the argument --num_segment_per_shard 2 and change some numbers like 'python -m cc_net --dump 2023-06 --task_parallelism 10 --num_shards 10 -l en --mine_num_processes 10 --hash_in_mem 1 --num_segments_per_shard 2'. then, some other arguments like target_size='4G' is have any inffluence? Or how should i set these arguments? like which arguments i should modify or what value is appropriate. thanks a lot!

newbietuan avatar May 12 '23 06:05 newbietuan

You can refer to python -m cc_net --help and RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py for all arguments. You can run python -m cc_net --config config/test_segment.json to take arguments from a json file. An example config for testing the whole process:

{
  "dump": "2023-06",
  "output_dir": "test_data",
  "num_shards": 5,
  "num_segments_per_shard": 3,
  "hash_in_mem": 1,
  "mine_num_processes": 5,
  "task_parallelism": 5
}

ladit avatar May 12 '23 09:05 ladit

thank you very much. i will try it. when i run the code (demo) mayutuan@mayutuans-MacBook-Pro cc_net % python -m cc_net --dump 2023-06 --task_parallelism 6 --num_shards 6 -l en --mine_num_processes 6 --hash_in_mem 1 --num_segments_per_shard 2 and finished download the 6 file of xxxx.bin . i get the message: Traceback (most recent call last): File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job result = delayed.result() File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result self._result = self.function(*self.args, **self.kwargs) File "/Users/mayutuan/Downloads/projects/RedPajama-Data-main/data_prep/cc/cc_net/cc_net/mine.py", line 439, in _mine_shard jsonql.run_pipes( File "/Users/mayutuan/Downloads/projects/RedPajama-Data-main/data_prep/cc/cc_net/cc_net/jsonql.py", line 439, in run_pipes multiprocessing.Pool( File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/context.py", line 119, in Pool return Pool(processes, initializer, initargs, maxtasksperchild, File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 212, in init self._repopulate_pool() File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool return self._repopulate_pool_static(self._ctx, self.Process, File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static w.start() File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object '_mine_shard..' i get a solution from web saying that: globals()['my_local_function'] = my_local_function. while i don't know if it is right and how should i implement it~

newbietuan avatar May 13 '23 03:05 newbietuan