CDX-Writer
CDX-Writer copied to clipboard
Python script to create CDX index files of WARC data
Removed video metadata as option. Will now process automatically.
Currently have 71 tasks that have timed out(at least not within ~76k seconds) due to large amounts of URI's in megawarc. These can be found in the tinypic collection from...
If you instantiate a [`CDX_Writer` with a `file` argument](https://github.com/internetarchive/CDX-Writer/blob/77c3539c59a2d55c31de600e7ca6a9eee67a4977/cdx_writer.py#L729) that contains a relative or absolute path and you use the default of `use_full_path=False` and `file_prefix=None` the `g` field is written...
To deal with large indexes (where sorted cdx data cannot all fit in core), an idx file is produced, containing one line for each chunk of (typically 3000) cdx lines....
Also reported here: https://bitbucket.org/rajbot/warc-tools/issue/1 . I'm creating this issue here on GitHub so others may know about this issue as well. > According to the WARC ISO 28500 Version 1...
The zstd extra depends on a non-public package: ``` # requires a version locally patched for correct single-frame decompression 'zstandard==0.12.0+ia1' ``` Is it possible to release that package, or provide...