crawlee-python
crawlee-python copied to clipboard
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
I am using the latest version of crawlee, python 3.11, windows 11, tried chromium and firefox. There is a simple example P.S. There is also an error ValueError: Cannot close...
`Scrapy` offers an HTTP API through a third-party library called `ScrapyRT`, which exposes an HTTP API for spiders. By sending a request to `ScrapyRT` with the spider name and URL,...
[](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [eslint-plugin-react](https://togithub.com/jsx-eslint/eslint-plugin-react) | [`7.34.3` -> `7.34.4`](https://renovatebot.com/diffs/npm/eslint-plugin-react/7.34.3/7.34.4) | [](https://docs.renovatebot.com/merge-confidence/)...
It seems we have 8 spaces indentation at the beginning: ```text [crawlee.statistics.statistics] INFO crawlee.beautifulsoup_crawler.beautifulsoup_crawler request statistics { "requests_finished": 0, "requests_failed": 0, "retry_histogram": [ 0 ], "request_avg_failed_duration": null, "request_avg_finished_duration": null, "requests_finished_per_minute":...
By default, HTTPX's logging level is set to INFO, producing logs such as: ```text $ python run_beautifulsoup_crawler_el.py [crawlee.statistics.statistics] INFO crawlee.beautifulsoup_crawler.beautifulsoup_crawler request statistics { "requests_finished": 0, "requests_failed": 0, "retry_histogram": [ 0...
Currently, we expose the logger instance only through the context, e.g.: ```python @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}...') ``` We could expose it on the Crawler level...
Improve API docs of the public components, mainly: - [ ] `BasicCrawler` - [ ] `HttpCrawler` - [ ] `BeautifulSoupCrawler` - [ ] `ParselCrawler` - [ ] `PlaywrightCrawler` - [x]...
The CLI should prompt the user for a project name repeatedly until a valid, non-existing folder name is provided. It should behave as follows: ```text $ crawlee create [?] Name...
Currently, for end-user "requests" params we define: ```python requests: Sequence[str | BaseRequestData | Request] ``` which, unfortunately, matches a single-string request (e.g. `requests='https://crawlee.dev'`) as well. We could make it more...