RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

EOFError: Compressed file ended before the end-of-stream marker was reached

Open kimcando opened this issue 2 years ago • 2 comments

Hi, thank you in advance. I am facing with following error while using same command for processing commoncrawl in README. python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1 The error seems to be caused by file with bad connection. As my understanding, the code process the file in remote condition, therefore keeping connection with single wet gz file (is it right?) is required. However, the network condition of commoncrawl s3 seems to unstable these days.. So if my suspect is correct which is due to the bad network condition, there seems nothing I can do more.. or is there anything I miss?

Also, the process is killed right away before finishing the whole job if facing with that error. I'm thinking to edit code with Exception so that the process does give up the bad connection gz but continue to the next gz file. Do you think it is available idea?

Thank you!

----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93398_0_log.err
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93398_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 93707 (12 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
    jsonql.run_pipes(
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
    write_jsons(data, output)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
    for res in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
    for x in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
    for doc in parse_warc_file(self.open_segment(segment), self.min_len):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
    for warc in warc_lines:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
    yield from file
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
    return self._buffer.read1(size)
  File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93707_0_log.err
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93707_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 93259 (13 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
    jsonql.run_pipes(
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
    write_jsons(data, output)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
    for res in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
    for x in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
    for doc in parse_warc_file(self.open_segment(segment), self.min_len):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
    for warc in warc_lines:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
    yield from file
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
    return self._buffer.read1(size)
  File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93259_0_log.err
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93259_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 75105 (14 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
    jsonql.run_pipes(
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
    write_jsons(data, output)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
    for res in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
    for x in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
    for doc in parse_warc_file(self.open_segment(segment), self.min_len):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
    for warc in warc_lines:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
    yield from file
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
    return self._buffer.read1(size)
  File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

kimcando avatar May 01 '23 08:05 kimcando

I also met this problem. My python version is Python 3.7.11 I've change my network many times, but is still failed.

starlitsky2010 avatar May 02 '23 07:05 starlitsky2010

Hi @kimcando , you are right, the error you observe is most likely to be due to interrupted network connection. I have seen in the past that the connection to the cc buckets is somewhat shaky. Usually this is resolved within a few days though.

You can try to catch the gzip exception and continue with the next one -- but if it is an overload issue on the S3 side, then the other files will likely be skipped too.

mauriceweber avatar May 02 '23 13:05 mauriceweber