cdx_writer.py timeout when large amounts of URI's present in warc
Currently have 71 tasks that have timed out(at least not within ~76k seconds) due to large amounts of URI's in megawarc.
These can be found in the tinypic collection from archiveteam.
Example tasks:
warning: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 76800 /petabox/sw/bin/cdx_writer.pex 'tinypic_20190830091905_c83d08f5.megawarc.warc.gz' --file-prefix='archiveteam_tinypic_20190830091905_c83d08f5' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_archiveteam_tinypic_20190830091905_c83d08f5/cdxstats.json'> '/t/_archiveteam_tinypic_20190830091905_c83d08f5/cdx.txt' failed with exit code: 124, but told to continue on...
warning: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 76764 /petabox/sw/bin/cdx_writer.pex 'tinypic_20190830120442_36ec361d.megawarc.warc.gz' --file-prefix='archiveteam_tinypic_20190830120442_36ec361d' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_archiveteam_tinypic_20190830120442_36ec361d/cdxstats.json'> '/t/_archiveteam_tinypic_20190830120442_36ec361d/cdx.txt' failed with exit code: 124, but told to continue on...
And 69 more