crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: The LLM strategy always sends all tokens from the URL to the LLM server even the URL input is HTML content

Open phamngocquy opened this issue 8 months ago • 2 comments

crawl4ai version

0.6.3

Expected Behavior

my example crawler:

llm_strategy = LLMExtractionStrategy(
    llm_config=self.llm_config,
    schema=PdfDoc.model_json_schema(),
    extraction_type="schema",
    instruction="""
    From the crawled content, extract data from html
    - data in html source include pdf file name and href url under data-pdf style attribute
    - One extracted post JSON format should look like this:

        {
            "name": "Volume 1 - 2024 CMD Rate Case - E-filing.pdf",
            "data_pdf": "/DMS/pdfview/``wportal`Documents`DMS`31`2494`/Volume 1 - 2024 CMD Rate Case - E-filing~pdf",
        }
    """,
    input_format="cleaned_html",
)

run_conf = CrawlerRunConfig(
    extraction_strategy=llm_strategy,
    cache_mode=CacheMode.DISABLED,
    target_elements=[
        ".btnOpenPdfFile",
        ".btn",
        ".btn-default",
        ".btn-sm",
    ],
    excluded_tags=[
        "header",
        "footer",
        "nav",
        "meta",
        "script",
        "style",
        "iframe",
        "li",
        "ul",
    ],
    prettiify=True,
)


resp = requests.get(url)
async with AsyncWebCrawler(config=self.browser_conf) as crawler:
    result = await crawler.arun(url=f"raw://{resp.text}", config=run_conf)
    assert isinstance(result, CrawlResultContainer)

Expected: only cleaned_html should be sent to LLM server

Current Behavior

The behavior of llm extract strategy is send both url and html content to llm server. I think we doesn't handle the case when url input as raw html. That leading to full raw html under url always sending to llm server. It's likely unexpected behavior, the token quota may leak due to it.

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets


OS

Linux

Python version

3.12.3

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

phamngocquy avatar Jun 03 '25 07:06 phamngocquy

I also meet this problem. excluded_tags does not take effect so LLM quota exceeded

lance6716 avatar Jun 22 '25 04:06 lance6716

related to https://github.com/unclecode/crawl4ai/issues/1116

@rbushri, I will include the Pr in our next release

ntohidi avatar Nov 18 '25 09:11 ntohidi

@ntohidi Appreciate the attention to this issue. Is there an ETA for your next release? We are debating waiting for your release or refactoring our current code to avoid raw HTML. Thanks.

smajoseph avatar Dec 03 '25 17:12 smajoseph