crawler
crawler copied to clipboard
Web Scraping Framework
PycURL 7.43.0.4 contains a fix for Python >= 3.8 related to a deprecation warning. In python 3.10, this became unusable with error thrown. - SystemError: PY_SSIZE_T_CLEAN macro must be defined...
https://github.com/lorien/crawler/blob/master/crawler/base.py#L86 `init_hook` method called inside `__init__` So for instance if you're doing some work (db calls etc.) inside `init_hook` there is no way to change class attributes before calling init...
Would be nice to have something like pre/post request hooks, for instance to detect if request is banned by host, page does not match defined rules, to save some counters...
Implement cache backends like for grab. Sometimes it's really useful.