wget-lua icon indicating copy to clipboard operation
wget-lua copied to clipboard

Fix segfaults during CDX read

Open the-blank-x opened this issue 2 years ago • 4 comments

Fixes #23

the-blank-x avatar Jan 26 '24 22:01 the-blank-x

This writes invalid WARC-Refers-To-Date headers--oops.

This is because the code would now read the timestamps from the CDX file (which appears to be formatted as YYYYMMDDhhmmss, but it is not defined in the legend, nor is it defined in the specifications from 2006 and 2015), and write it in the WARC-Refers-To-Date header without rewriting it to conform to the WARC specs (which specifies that it is "a UTC timestamp formatted according to W3CDTF", i.e. in the form of YYYY-MM-DDThh:mm:ssZ).

I can have the code rewrite the CDX timestamp into the UTC timestamp according to the W3CDTF, but I am slightly hesitant to (unless given the okay) since I don't know if all properly formed CDX files will have their timestamps be in the form of YYYYMMDDhhmmss.

the-blank-x avatar Jan 27 '24 01:01 the-blank-x

The invalid WARC-Refers-To-Date headers issue should be fixed now

the-blank-x avatar Jan 29 '24 00:01 the-blank-x

Thank you! It looks like warc_date = cdx_to_warc_timestamp(date); is not xfreed after use.

I will need to check this PR closer before merging it.

Arkiver2 avatar Jan 29 '24 08:01 Arkiver2

Oopsie

(and you're welcome!)

the-blank-x avatar Jan 29 '24 10:01 the-blank-x