cdx_toolkit icon indicating copy to clipboard operation
cdx_toolkit copied to clipboard

CommonCrawl index date range code is broken

Open wumpus opened this issue 3 years ago • 5 comments

cdxt --cc --from 2021 --to 2020 -v -v --limit 1 iter https://www.pbm.com/
INFO:cdx_toolkit.cli:set loglevel to DEBUG
DEBUG:cdx_toolkit.myrequests:getting https://index.commoncrawl.org/collinfo.json None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): index.commoncrawl.org:443
DEBUG:urllib3.connectionpool:https://index.commoncrawl.org:443 "GET /collinfo.json HTTP/1.1" 200 1157
INFO:cdx_toolkit.commoncrawl:Found 87 endpoints in the Common Crawl index
INFO:cdx_toolkit:making a custom cc index list
INFO:cdx_toolkit.commoncrawl:using cc index range from https://index.commoncrawl.org/CC-MAIN-2021-04-index to https://index.commoncrawl.org/CC-MAIN-2020-50-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2021-04-index

The above date range should be empty.

wumpus avatar Mar 27 '22 20:03 wumpus

I've recently started using ranges and hit this issue. Is this likely to be picked up in the near future? I've also noticed that the 'closest' argument for commoncrawl works okay and creates a 3 month window, but does not wayback.

Medstaar avatar Oct 06 '22 13:10 Medstaar

Can you give some examples? The bug I was complaining about shouldn't affect any real usage.

wumpus avatar Oct 06 '22 22:10 wumpus

Sorry I think I might have miss-understood how the ranges work. It looks like if I put from=20220101 it will use the index CC-MAIN-2021-49 (November 2021), and if I put from=20220401 it will use the CC-MAIN-2022-05 (January 2022). Looks like it actually uses the closest index to the date that's below the date provided.

For wayback if I use closest=20221007 it seems to extract URL's with a 2019 timestamp. Using from and to is okay with wayback however.

Medstaar avatar Oct 07 '22 11:10 Medstaar

OK, so Common Crawl is doing the right thing, and the closest on wayback issue is a problem on the Internet Archive side, something I can't control.

wumpus avatar Oct 07 '22 16:10 wumpus

I don't know what precisely you're trying to explain but my issue is also related to the index date ranges, though I'm trying to programmatically use them with from_ts. Using it with the iter method isn't working. Doesn't return anything. Using it without works, but I don't need every capture going back a year or whatever the default is.

sgjohnson1981 avatar Mar 11 '24 22:03 sgjohnson1981