CommonCrawl index date range code is broken
cdxt --cc --from 2021 --to 2020 -v -v --limit 1 iter https://www.pbm.com/
INFO:cdx_toolkit.cli:set loglevel to DEBUG
DEBUG:cdx_toolkit.myrequests:getting https://index.commoncrawl.org/collinfo.json None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): index.commoncrawl.org:443
DEBUG:urllib3.connectionpool:https://index.commoncrawl.org:443 "GET /collinfo.json HTTP/1.1" 200 1157
INFO:cdx_toolkit.commoncrawl:Found 87 endpoints in the Common Crawl index
INFO:cdx_toolkit:making a custom cc index list
INFO:cdx_toolkit.commoncrawl:using cc index range from https://index.commoncrawl.org/CC-MAIN-2021-04-index to https://index.commoncrawl.org/CC-MAIN-2020-50-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2021-04-index
The above date range should be empty.
I've recently started using ranges and hit this issue. Is this likely to be picked up in the near future? I've also noticed that the 'closest' argument for commoncrawl works okay and creates a 3 month window, but does not wayback.
Can you give some examples? The bug I was complaining about shouldn't affect any real usage.
Sorry I think I might have miss-understood how the ranges work. It looks like if I put from=20220101 it will use the index CC-MAIN-2021-49 (November 2021), and if I put from=20220401 it will use the CC-MAIN-2022-05 (January 2022). Looks like it actually uses the closest index to the date that's below the date provided.
For wayback if I use closest=20221007 it seems to extract URL's with a 2019 timestamp. Using from and to is okay with wayback however.
OK, so Common Crawl is doing the right thing, and the closest on wayback issue is a problem on the Internet Archive side, something I can't control.
I don't know what precisely you're trying to explain but my issue is also related to the index date ranges, though I'm trying to programmatically use them with from_ts. Using it with the iter method isn't working. Doesn't return anything. Using it without works, but I don't need every capture going back a year or whatever the default is.