crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Unexpected error in _crawl_web

Open Martichou opened this issue 2 months ago โ€ข 5 comments

crawl4ai version

0.7.7

Expected Behavior

Should parse the webpage correctly.

Current Behavior

When crawling this page: https://www.toshiba-lifestyle.com/th-en/blog/how-to-choose-the-right-laundry-product-for-you

I get the following error:

[ERROR]... ร— https://www.toshiba-lif...laundry-product-for-you  | Error:
Unexpected error in _crawl_web at line 493 in aprocess_html
(../usr/local/lib/python3.12/site-packages/crawl4ai/async_webcrawler.py):
Error: Process HTML, Failed to extract content from the website:
https://www.toshiba-lifestyle.com/th-en/blog/how-to-choose-the-right-laundry-pro
duct-for-you, error: 1 validation error for MediaItem
width
  Input should be a valid integer, unable to parse string as an integer
    For further information visit https://errors.pydantic.dev/2.12/v/int_parsing

Code context:
 488                   )
 489
 490           except InvalidCSSSelectorError as e:
 491               raise ValueError(str(e))
 492           except Exception as e:
 493 โ†’             raise ValueError(
 494                   f"Process HTML, Failed to extract content from the
website: {url}, error: {str(e)}"
 495               )
 496
 497           # Extract results - handle both dict and ScrapingResult
 498           if isinstance(result, dict):

Seems like something is strange in their source code, causing the issue.

Martichou avatar Nov 24 '25 09:11 Martichou

@Martichou could you share your code with us? It will help us to narrow down and speed up the debugging process.

Ahmed-Tawfik94 avatar Nov 25 '25 06:11 Ahmed-Tawfik94

Sure sorry!

version: '3.8'
services:
  crawl4ai:
    image: unclecode/crawl4ai:latest
    container_name: crawl4ai
    ports:
      - "0.0.0.0:11235:11235"
    environment:
      - CRAWL4AI_ENV=prod
      - MAX_CONCURRENT_TASKS=25
      - MEMORY_THRESHOLD_PERCENT=80
      - PYTHONUNBUFFERED=1
    volumes:
      - ./crawl4ai_data:/data
      - ./logs:/app/logs
    shm_size: '2gb'
    mem_limit: 28G
    mem_reservation: 16G
    deploy:
      resources:
        limits:
          cpus: '15'
          memory: 28G
        reservations:
          cpus: '8'
          memory: 16G
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      const requestBody: Crawl4AIRequest = {
        urls: [website],
        browser_config: {
          type: 'BrowserConfig',
          params: {
            extra_args: [
              '--disable-dev-shm-usage',
              '--disable-gpu',
              '--no-sandbox',
            ],
          },
        },
        crawler_config: {
          type: 'CrawlerRunConfig',
          params: {
            magic: true,
            cache_mode: 'bypass',
            page_timeout: 60000,
            markdown_generator: {
              type: 'DefaultMarkdownGenerator',
              params: {},
            },
            ...(withProxy
              ? {
                  proxy_config: {
                    type: 'ProxyConfig',
                    params: {
                      server: this.proxyServer,
                      username: this.proxyUsername,
                      password: this.proxyPassword,
                    },
                  },
                }
              : {}),
          },
        },
      };

      const response = await fetch(`${this.apiUrl}/crawl`, {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          ...(this.apiSecret ? { 'X-API-Key': `${this.apiSecret}` } : {}),
        },
        body: JSON.stringify(requestBody),
      });

Martichou avatar Nov 25 '25 23:11 Martichou

@Martichou thank you , will take a look on it

Ahmed-Tawfik94 avatar Nov 27 '25 03:11 Ahmed-Tawfik94

Full Root Cause (Simplified)

What Triggered It

The website's HTML had a messed-up <img> tag with width="banner-Ho"โ€”that's not a number, just some random string (probably a typo or bad CMS output). HTML expects width to be a pixel number, but this site broke the rules.

How It Broke the Code

  • The crawler (in async_webcrawler.py) called the scraping strategy to process images.
  • In content_scraping_strategy.py, the process_image() function grabbed the width attribute.
  • It checked if it looked like a number with w.isdigit(), but since "banner-Ho" isn't digits, it skipped setting widthโ€”or so it should have.
  • Somehow, the string "banner-Ho" still ended up in the data dict passed to Pydantic's MediaItem model.
  • Pydantic tried to turn it into an integer (as per width: Optional[int]), failed, and threw a validation error, crashing the whole crawl with a 500.

Why the Code Didn't Handle It

The code assumed HTML attributes would be valid, but the web's full of junk. It had a basic check, but no real error handling for non-numeric junk. Pydantic is strictโ€”it doesn't guess or skip; it fails hard.

Suggested Fix

To make the crawler tougher on bad HTML:

  • In content_scraping_strategy.py, change the width parsing to use try-except instead of just isdigit():
    if w := img.get("width"):
        try:
            base_info["width"] = int(w)
        except ValueError:
            pass  # Skip invalid widths
    
  • In models.py, add a Pydantic validator to MediaItem for extra safety:
    @field_validator('width', mode='before')
    @classmethod
    def validate_width(cls, v):
        if isinstance(v, str):
            try:
                return int(v)
            except ValueError:
                return None
        return v
    

This skips bad attributes and sanitizes any that slip through, preventing crashes.

Ahmed-Tawfik94 avatar Nov 27 '25 04:11 Ahmed-Tawfik94

@Ahmed-Tawfik94 Are you sure about that? Because looking at the source code of the website, I don't see an image with a width of "banner-Ho". What I see is the following:

<picture>
    <source media="(max-width:640px), (max-aspect-ratio:1/1) and (max-width: 1200px)" data-srcset="//web-res.midea.com/content/dam/smartlife/thailand/how-to-choose-the-right-laundry-product-for-you/website-blog-images/mobile banner-How to choose the right laundry product for you.jpg/jcr:content/renditions/cq5dam.web.5000.5000.jpeg" srcset="//web-res.midea.com/content/dam/smartlife/thailand/how-to-choose-the-right-laundry-product-for-you/website-blog-images/mobile banner-How to choose the right laundry product for you.jpg/jcr:content/renditions/cq5dam.web.5000.5000.jpeg">
    <img src="//web-res.midea.com/content/dam/smartlife/thailand/how-to-choose-the-right-laundry-product-for-you/website-blog-images/desktop banner-How to choose the right laundry product for you.jpg/jcr:content/renditions/cq5dam.web.5000.5000.jpeg" data-src="//web-res.midea.com/content/dam/smartlife/thailand/how-to-choose-the-right-laundry-product-for-you/website-blog-images/desktop banner-How to choose the right laundry product for you.jpg/jcr:content/renditions/cq5dam.web.5000.5000.jpeg" class="swiper-banner-img swiper-lazy blur-up lazyloaded">
</picture>

Notice the srcset, the URL in it has spaces, which seems to break the tokenizer used? I may be wrong, but I don't see any other reference to banner-Ho except there in the page.

Martichou avatar Nov 27 '25 07:11 Martichou