cc-crawl-statistics icon indicating copy to clipboard operation
cc-crawl-statistics copied to clipboard

Statistics of Common Crawl monthly archives mined from URL index files

Results 4 cc-crawl-statistics issues
Sort by recently updated
recently updated
newest added

- add `as_index=False` if a `groupby(level=0)` call would duplicate a column already used as index - drop columns used as index when calling `set_index(...)` - required when upgrading from Pandas...

ggplot2 requires R despite everything else being written in Python. To get rid of the R dependency, ggplot2 should be replaced by matplotlib while maintaining the general look-and-feel of the...

The current process is error prone since it requires to clean the cache (rm -r stats/excerpt) if not all previous months are present in the cache dir. This could be...

enhancement

cc-crawl-statistics sometimes can report host counts as one more than actual number. This behavior is sporadic and doesnt always happen. Example: In `domains-top-500.csv` for `CC-MAIN-2025-30`: | domain | actual host...