cc-crawl-statistics
cc-crawl-statistics copied to clipboard
Statistics of Common Crawl monthly archives mined from URL index files
- add `as_index=False` if a `groupby(level=0)` call would duplicate a column already used as index - drop columns used as index when calling `set_index(...)` - required when upgrading from Pandas...
ggplot2 requires R despite everything else being written in Python. To get rid of the R dependency, ggplot2 should be replaced by matplotlib while maintaining the general look-and-feel of the...
The current process is error prone since it requires to clean the cache (rm -r stats/excerpt) if not all previous months are present in the cache dir. This could be...
cc-crawl-statistics sometimes can report host counts as one more than actual number. This behavior is sporadic and doesnt always happen. Example: In `domains-top-500.csv` for `CC-MAIN-2025-30`: | domain | actual host...