crawlee-python issues

Error handler does not work

2

I am using the latest version of crawlee, python 3.11, windows 11, tried chromium and firefox. There is a simple example P.S. There is also an error ValueError: Cannot close...

Hitreno

bug

t-tooling

HTTP API for Spider

1

`Scrapy` offers an HTTP API through a third-party library called `ScrapyRT`, which exposes an HTTP API for spiders. By sending a request to `ScrapyRT` with the spider name and URL,...

Ehsan-U

enhancement

t-tooling

chore(deps): update dependency eslint-plugin-react to v7.34.4

[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [eslint-plugin-react](https://togithub.com/jsx-eslint/eslint-plugin-react) | [`7.34.3` -> `7.34.4`](https://renovatebot.com/diffs/npm/eslint-plugin-react/7.34.3/7.34.4) | [![age](https://developer.mend.io/api/mc/badges/age/npm/eslint-plugin-react/7.34.4?slim=true)](https://docs.renovatebot.com/merge-confidence/)...

renovate[bot]

Better format statistics logging

3

It seems we have 8 spaces indentation at the beginning: ```text [crawlee.statistics.statistics] INFO crawlee.beautifulsoup_crawler.beautifulsoup_crawler request statistics { "requests_finished": 0, "requests_failed": 0, "retry_histogram": [ 0 ], "request_avg_failed_duration": null, "request_avg_finished_duration": null, "requests_finished_per_minute":...

vdusek

t-tooling

Document how to use POST requests

2

vdusek

documentation

t-tooling

Set HTTPX logging to warning level by default

By default, HTTPX's logging level is set to INFO, producing logs such as: ```text $ python run_beautifulsoup_crawler_el.py [crawlee.statistics.statistics] INFO crawlee.beautifulsoup_crawler.beautifulsoup_crawler request statistics { "requests_finished": 0, "requests_failed": 0, "retry_histogram": [ 0...

vdusek

enhancement

t-tooling

Expose `crawler._logger` to public

Currently, we expose the logger instance only through the context, e.g.: ```python @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}...') ``` We could expose it on the Crawler level...

vdusek

enhancement

t-tooling

Improve API docs of public components

5

Improve API docs of the public components, mainly: - [ ] `BasicCrawler` - [ ] `HttpCrawler` - [ ] `BeautifulSoupCrawler` - [ ] `ParselCrawler` - [ ] `PlaywrightCrawler` - [x]...

vdusek

documentation

t-tooling

hacktoberfest

Update CLI for infinite questioning for a project name

The CLI should prompt the user for a project name repeatedly until a valid, non-existing folder name is provided. It should behave as follows: ```text $ crawlee create [?] Name...

vdusek

enhancement

t-tooling

Add URL validation

Currently, for end-user "requests" params we define: ```python requests: Sequence[str | BaseRequestData | Request] ``` which, unfortunately, matches a single-string request (e.g. `requests='https://crawlee.dev'`) as well. We could make it more...

vdusek

bug

t-tooling

crawlee-python
crawlee-python copied to clipboard

Metadata

Error handler does not work

HTTP API for Spider

chore(deps): update dependency eslint-plugin-react to v7.34.4

Better format statistics logging

Document how to use POST requests

Set HTTPX logging to warning level by default

Expose `crawler._logger` to public

Improve API docs of public components

Update CLI for infinite questioning for a project name

Add URL validation

← Metadata

Owner

Metadata

crawlee-python crawlee-python copied to clipboard

Metadata

← Metadata

Owner

Metadata

crawlee-python
crawlee-python copied to clipboard