crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...

Results 229 crawlee-python issues
Sort by recently updated
recently updated
newest added

Crawlers can be used for different scenarios. In cases where the maximum amount of requests is defined, there should be a reasonable enqueue strategy that does not "overenqueue" links too...

enhancement
t-tooling

### Description - Run unit tests in CI also on `MacOS` to prevent OS-specific incompatible changes, for example: https://github.com/apify/crawlee-python/issues/1329

t-tooling
tested

In `AutoscaledPool._worker_task_orchestrator` there is a call to `self._system_status.get_current_system_info()` this processes existing snapshots to calculate whether the system is overloaded or not. This function gets called in while loop without wait...

enhancement
t-tooling

Create a document guide that describes common optimization options for crawlers. You can point to the guide about Open Telemetry for source of data that can be used for optimization....

documentation
t-tooling

To avoid issues like this https://github.com/apify/crawlee-python/issues/1301.

documentation
t-tooling

Using `user_data` to pass data between handlers, I noticed that Crawlee also uses the dict for internally storing some stuff, such as `label`. This has surprised me. It's user data,...

t-tooling

``` $ python -c "import time;start = time.perf_counter();from crawlee.crawlers import ParselCrawler;print(f'{time.perf_coun ter() - start:.4f}')" 2.1944 ```

t-tooling

Using RSS of main process and all it's children can lead to overestimation of used memory due to shared memory being counted multiple times. This was addressed for Linux using...

enhancement
t-tooling

`AutoscaledPool` controller does not converge to optimal `desired_concurency` which can lead to steady decline in performance for long running actors. It seems that current version of `AutoscaledPool` controller can keep...

bug
t-tooling