crawlee-python issues

Lazy enqueue links

Crawlers can be used for different scenarios. In cases where the maximum amount of requests is defined, there should be a reasonable enqueue strategy that does not "overenqueue" links too...

Pijukatel

enhancement

t-tooling

chore: Add `macos` executor for unit tests

### Description - Run unit tests in CI also on `MacOS` to prevent OS-specific incompatible changes, for example: https://github.com/apify/crawlee-python/issues/1329

Pijukatel

t-tooling

tested

`SystemStatus._get_system_info` called too often on same data (optimization candidate)

1

In `AutoscaledPool._worker_task_orchestrator` there is a call to `self._system_status.get_current_system_info()` this processes existing snapshots to calculate whether the system is overloaded or not. This function gets called in while loop without wait...

Pijukatel

enhancement

t-tooling

Create optimization guide for Crawlers

Create a document guide that describes common optimization options for crawlers. You can point to the guide about Open Telemetry for source of data that can be used for optimization....

Pijukatel

documentation

t-tooling

Add docs versioning

To avoid issues like this https://github.com/apify/crawlee-python/issues/1301.

vdusek

documentation

t-tooling

The user_data doesn't belong exclusively to the user

2

Using `user_data` to pass data between handlers, I noticed that Crawlee also uses the dict for internally storing some stuff, such as `label`. This has surprised me. It's user data,...

honzajavorek

t-tooling

Importing the crawler class takes over 2 seconds

1

``` $ python -c "import time;start = time.perf_counter();from crawlee.crawlers import ParselCrawler;print(f'{time.perf_coun ter() - start:.4f}')" 2.1944 ```

vdusek

t-tooling

Improve used memory estimation when running locally on Windows or Mac

Using RSS of main process and all it's children can lead to overestimation of used memory due to shared memory being counted multiple times. This was addressed for Linux using...

Pijukatel

enhancement

t-tooling

`AutoscaledPool` controller does not converge to optimal `desired_concurency`

`AutoscaledPool` controller does not converge to optimal `desired_concurency` which can lead to steady decline in performance for long running actors. It seems that current version of `AutoscaledPool` controller can keep...

Pijukatel

bug

t-tooling

crawlee-python
crawlee-python copied to clipboard

Metadata

Lazy enqueue links

chore: Add `macos` executor for unit tests

`SystemStatus._get_system_info` called too often on same data (optimization candidate)

Create optimization guide for Crawlers

Add docs versioning

The user_data doesn't belong exclusively to the user

Importing the crawler class takes over 2 seconds

Improve used memory estimation when running locally on Windows or Mac

`AutoscaledPool` controller does not converge to optimal `desired_concurency`

← Metadata

Owner

Metadata

crawlee-python crawlee-python copied to clipboard

Metadata

← Metadata

Owner

Metadata

crawlee-python
crawlee-python copied to clipboard