EOFError: Compressed file ended before the end-of-stream marker was reached
Hi, thank you in advance.
I am facing with following error while using same command for processing commoncrawl in README.
python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1
The error seems to be caused by file with bad connection. As my understanding, the code process the file in remote condition, therefore keeping connection with single wet gz file (is it right?) is required. However, the network condition of commoncrawl s3 seems to unstable these days.. So if my suspect is correct which is due to the bad network condition, there seems nothing I can do more.. or is there anything I miss?
Also, the process is killed right away before finishing the whole job if facing with that error. I'm thinking to edit code with Exception so that the process does give up the bad connection gz but continue to the next gz file. Do you think it is available idea?
Thank you!
----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93398_0_log.err
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93398_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 93707 (12 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
jsonql.run_pipes(
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
write_jsons(data, output)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
for res in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
for x in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
for doc in parse_warc_file(self.open_segment(segment), self.min_len):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
for doc in group_by_docs(lines):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
for warc in warc_lines:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
yield from file
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
return self._buffer.read1(size)
File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93707_0_log.err
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93707_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 93259 (13 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
jsonql.run_pipes(
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
write_jsons(data, output)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
for res in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
for x in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
for doc in parse_warc_file(self.open_segment(segment), self.min_len):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
for doc in group_by_docs(lines):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
for warc in warc_lines:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
yield from file
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
return self._buffer.read1(size)
File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93259_0_log.err
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93259_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 75105 (14 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
jsonql.run_pipes(
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
write_jsons(data, output)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
for res in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
for x in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
for doc in parse_warc_file(self.open_segment(segment), self.min_len):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
for doc in group_by_docs(lines):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
for warc in warc_lines:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
yield from file
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
return self._buffer.read1(size)
File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
I also met this problem. My python version is Python 3.7.11 I've change my network many times, but is still failed.
Hi @kimcando , you are right, the error you observe is most likely to be due to interrupted network connection. I have seen in the past that the connection to the cc buckets is somewhat shaky. Usually this is resolved within a few days though.
You can try to catch the gzip exception and continue with the next one -- but if it is an overload issue on the S3 side, then the other files will likely be skipped too.