crawlee-python
crawlee-python copied to clipboard
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
### Description - declare private and public interface ### Issues - N/A ### Testing - N/A ### Checklist - [x] CI passed
Optimize performance by skipping unnecessary `update_request()` calls in `RequestQueue.reclaim_request()` https://github.com/apify/apify-sdk-python/blob/v1.3.0/src/apify/storages/request_queue.py#L314:318
Simulate this error in Python and handle it accordingly. ```js try { return await this.client.listItems(options); } catch (e) { const error = e as Error; if (error.message.includes('Cannot create a string...
if user_data_dir option is found in browser_option, then the launch function used is launch_persistent_context instead of launch and the user_data_dir option is passed to playwright ### Description it makes possible...
https://crawlee.dev/python/docs/introduction/saving-data#using-a-context-helper should put emphasis on using the `push_data` helper, `Dataset.open().push_data()` should only be mentioned later in the article
Based on discussion in #347
### Description - `item_count` unexpected increment when loaded from metadata ### Issues - Closes: #442 ### Testing - Added `test_reuse_dataset` test ### Checklist - [ ] CI passed
### Which package is the feature request for? If unsure which one to select, leave blank @crawlee/playwright (PlaywrightCrawler) ### Feature Please add support for using user provided browser profile ###...
When reusing a dataset with metadata, `item_count` is incremented after being loaded from the metadata file. It leads to non continuous file increments, and breaks multiple functions on Datasets (export...
Hello, I'm experiencing performance issues with my web crawler after approximately 1.5 to 2 hours of runtime. The crawling speed significantly decreases to about one site per minute or less,...