[Bug]: Unexpected error in _crawl_web
crawl4ai version
0.7.7
Expected Behavior
Should parse the webpage correctly.
Current Behavior
When crawling this page: https://www.toshiba-lifestyle.com/th-en/blog/how-to-choose-the-right-laundry-product-for-you
I get the following error:
[ERROR]... ร https://www.toshiba-lif...laundry-product-for-you | Error:
Unexpected error in _crawl_web at line 493 in aprocess_html
(../usr/local/lib/python3.12/site-packages/crawl4ai/async_webcrawler.py):
Error: Process HTML, Failed to extract content from the website:
https://www.toshiba-lifestyle.com/th-en/blog/how-to-choose-the-right-laundry-pro
duct-for-you, error: 1 validation error for MediaItem
width
Input should be a valid integer, unable to parse string as an integer
For further information visit https://errors.pydantic.dev/2.12/v/int_parsing
Code context:
488 )
489
490 except InvalidCSSSelectorError as e:
491 raise ValueError(str(e))
492 except Exception as e:
493 โ raise ValueError(
494 f"Process HTML, Failed to extract content from the
website: {url}, error: {str(e)}"
495 )
496
497 # Extract results - handle both dict and ScrapingResult
498 if isinstance(result, dict):
Seems like something is strange in their source code, causing the issue.
@Martichou could you share your code with us? It will help us to narrow down and speed up the debugging process.
Sure sorry!
version: '3.8'
services:
crawl4ai:
image: unclecode/crawl4ai:latest
container_name: crawl4ai
ports:
- "0.0.0.0:11235:11235"
environment:
- CRAWL4AI_ENV=prod
- MAX_CONCURRENT_TASKS=25
- MEMORY_THRESHOLD_PERCENT=80
- PYTHONUNBUFFERED=1
volumes:
- ./crawl4ai_data:/data
- ./logs:/app/logs
shm_size: '2gb'
mem_limit: 28G
mem_reservation: 16G
deploy:
resources:
limits:
cpus: '15'
memory: 28G
reservations:
cpus: '8'
memory: 16G
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
interval: 30s
timeout: 10s
retries: 3
const requestBody: Crawl4AIRequest = {
urls: [website],
browser_config: {
type: 'BrowserConfig',
params: {
extra_args: [
'--disable-dev-shm-usage',
'--disable-gpu',
'--no-sandbox',
],
},
},
crawler_config: {
type: 'CrawlerRunConfig',
params: {
magic: true,
cache_mode: 'bypass',
page_timeout: 60000,
markdown_generator: {
type: 'DefaultMarkdownGenerator',
params: {},
},
...(withProxy
? {
proxy_config: {
type: 'ProxyConfig',
params: {
server: this.proxyServer,
username: this.proxyUsername,
password: this.proxyPassword,
},
},
}
: {}),
},
},
};
const response = await fetch(`${this.apiUrl}/crawl`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
...(this.apiSecret ? { 'X-API-Key': `${this.apiSecret}` } : {}),
},
body: JSON.stringify(requestBody),
});
@Martichou thank you , will take a look on it
Full Root Cause (Simplified)
What Triggered It
The website's HTML had a messed-up <img> tag with width="banner-Ho"โthat's not a number, just some random string (probably a typo or bad CMS output). HTML expects width to be a pixel number, but this site broke the rules.
How It Broke the Code
- The crawler (in async_webcrawler.py) called the scraping strategy to process images.
- In content_scraping_strategy.py, the
process_image()function grabbed thewidthattribute. - It checked if it looked like a number with
w.isdigit(), but since"banner-Ho"isn't digits, it skipped settingwidthโor so it should have. - Somehow, the string
"banner-Ho"still ended up in the data dict passed to Pydantic'sMediaItemmodel. - Pydantic tried to turn it into an integer (as per
width: Optional[int]), failed, and threw a validation error, crashing the whole crawl with a 500.
Why the Code Didn't Handle It
The code assumed HTML attributes would be valid, but the web's full of junk. It had a basic check, but no real error handling for non-numeric junk. Pydantic is strictโit doesn't guess or skip; it fails hard.
Suggested Fix
To make the crawler tougher on bad HTML:
- In content_scraping_strategy.py, change the width parsing to use try-except instead of just
isdigit():if w := img.get("width"): try: base_info["width"] = int(w) except ValueError: pass # Skip invalid widths - In models.py, add a Pydantic validator to
MediaItemfor extra safety:@field_validator('width', mode='before') @classmethod def validate_width(cls, v): if isinstance(v, str): try: return int(v) except ValueError: return None return v
This skips bad attributes and sanitizes any that slip through, preventing crashes.
@Ahmed-Tawfik94 Are you sure about that? Because looking at the source code of the website, I don't see an image with a width of "banner-Ho". What I see is the following:
<picture>
<source media="(max-width:640px), (max-aspect-ratio:1/1) and (max-width: 1200px)" data-srcset="//web-res.midea.com/content/dam/smartlife/thailand/how-to-choose-the-right-laundry-product-for-you/website-blog-images/mobile banner-How to choose the right laundry product for you.jpg/jcr:content/renditions/cq5dam.web.5000.5000.jpeg" srcset="//web-res.midea.com/content/dam/smartlife/thailand/how-to-choose-the-right-laundry-product-for-you/website-blog-images/mobile banner-How to choose the right laundry product for you.jpg/jcr:content/renditions/cq5dam.web.5000.5000.jpeg">
<img src="//web-res.midea.com/content/dam/smartlife/thailand/how-to-choose-the-right-laundry-product-for-you/website-blog-images/desktop banner-How to choose the right laundry product for you.jpg/jcr:content/renditions/cq5dam.web.5000.5000.jpeg" data-src="//web-res.midea.com/content/dam/smartlife/thailand/how-to-choose-the-right-laundry-product-for-you/website-blog-images/desktop banner-How to choose the right laundry product for you.jpg/jcr:content/renditions/cq5dam.web.5000.5000.jpeg" class="swiper-banner-img swiper-lazy blur-up lazyloaded">
</picture>
Notice the srcset, the URL in it has spaces, which seems to break the tokenizer used? I may be wrong, but I don't see any other reference to banner-Ho except there in the page.