crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...

Results 229 crawlee-python issues
Sort by recently updated
recently updated
newest added

I am using the latest version of crawlee, python 3.11, windows 11, tried chromium and firefox. There is a simple example P.S. There is also an error ValueError: Cannot close...

bug
t-tooling

`Scrapy` offers an HTTP API through a third-party library called `ScrapyRT`, which exposes an HTTP API for spiders. By sending a request to `ScrapyRT` with the spider name and URL,...

enhancement
t-tooling

[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [eslint-plugin-react](https://togithub.com/jsx-eslint/eslint-plugin-react) | [`7.34.3` -> `7.34.4`](https://renovatebot.com/diffs/npm/eslint-plugin-react/7.34.3/7.34.4) | [![age](https://developer.mend.io/api/mc/badges/age/npm/eslint-plugin-react/7.34.4?slim=true)](https://docs.renovatebot.com/merge-confidence/)...

It seems we have 8 spaces indentation at the beginning: ```text [crawlee.statistics.statistics] INFO crawlee.beautifulsoup_crawler.beautifulsoup_crawler request statistics { "requests_finished": 0, "requests_failed": 0, "retry_histogram": [ 0 ], "request_avg_failed_duration": null, "request_avg_finished_duration": null, "requests_finished_per_minute":...

t-tooling

By default, HTTPX's logging level is set to INFO, producing logs such as: ```text $ python run_beautifulsoup_crawler_el.py [crawlee.statistics.statistics] INFO crawlee.beautifulsoup_crawler.beautifulsoup_crawler request statistics { "requests_finished": 0, "requests_failed": 0, "retry_histogram": [ 0...

enhancement
t-tooling

Currently, we expose the logger instance only through the context, e.g.: ```python @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}...') ``` We could expose it on the Crawler level...

enhancement
t-tooling

Improve API docs of the public components, mainly: - [ ] `BasicCrawler` - [ ] `HttpCrawler` - [ ] `BeautifulSoupCrawler` - [ ] `ParselCrawler` - [ ] `PlaywrightCrawler` - [x]...

documentation
t-tooling
hacktoberfest

The CLI should prompt the user for a project name repeatedly until a valid, non-existing folder name is provided. It should behave as follows: ```text $ crawlee create [?] Name...

enhancement
t-tooling

Currently, for end-user "requests" params we define: ```python requests: Sequence[str | BaseRequestData | Request] ``` which, unfortunately, matches a single-string request (e.g. `requests='https://crawlee.dev'`) as well. We could make it more...

bug
t-tooling