CDX-Writer icon indicating copy to clipboard operation
CDX-Writer copied to clipboard

cdx_writer.py timeout when large amounts of URI's present in warc

Open kiska3 opened this issue 6 years ago • 0 comments

Currently have 71 tasks that have timed out(at least not within ~76k seconds) due to large amounts of URI's in megawarc.

These can be found in the tinypic collection from archiveteam.

Example tasks: warning: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 76800 /petabox/sw/bin/cdx_writer.pex 'tinypic_20190830091905_c83d08f5.megawarc.warc.gz' --file-prefix='archiveteam_tinypic_20190830091905_c83d08f5' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_archiveteam_tinypic_20190830091905_c83d08f5/cdxstats.json'> '/t/_archiveteam_tinypic_20190830091905_c83d08f5/cdx.txt' failed with exit code: 124, but told to continue on...

warning: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 76764 /petabox/sw/bin/cdx_writer.pex 'tinypic_20190830120442_36ec361d.megawarc.warc.gz' --file-prefix='archiveteam_tinypic_20190830120442_36ec361d' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_archiveteam_tinypic_20190830120442_36ec361d/cdxstats.json'> '/t/_archiveteam_tinypic_20190830120442_36ec361d/cdx.txt' failed with exit code: 124, but told to continue on... And 69 more

kiska3 avatar Nov 27 '19 19:11 kiska3