crawlee-python
crawlee-python copied to clipboard
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Crawlers can be used for different scenarios. In cases where the maximum amount of requests is defined, there should be a reasonable enqueue strategy that does not "overenqueue" links too...
### Description - Run unit tests in CI also on `MacOS` to prevent OS-specific incompatible changes, for example: https://github.com/apify/crawlee-python/issues/1329
In `AutoscaledPool._worker_task_orchestrator` there is a call to `self._system_status.get_current_system_info()` this processes existing snapshots to calculate whether the system is overloaded or not. This function gets called in while loop without wait...
Create a document guide that describes common optimization options for crawlers. You can point to the guide about Open Telemetry for source of data that can be used for optimization....
To avoid issues like this https://github.com/apify/crawlee-python/issues/1301.
Using `user_data` to pass data between handlers, I noticed that Crawlee also uses the dict for internally storing some stuff, such as `label`. This has surprised me. It's user data,...
``` $ python -c "import time;start = time.perf_counter();from crawlee.crawlers import ParselCrawler;print(f'{time.perf_coun ter() - start:.4f}')" 2.1944 ```
Using RSS of main process and all it's children can lead to overestimation of used memory due to shared memory being counted multiple times. This was addressed for Linux using...
`AutoscaledPool` controller does not converge to optimal `desired_concurency` which can lead to steady decline in performance for long running actors. It seems that current version of `AutoscaledPool` controller can keep...