crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: result formats markdown and cleaned_html will include damaged html tables

Open Blackvz opened this issue 11 months ago • 2 comments

crawl4ai version

0.4.248

Expected Behavior

When I crawl a page with a html table on it (for example: https://www.german-tigers.de/trainingszeiten.php) then the table should be correctly exported at least in cleaned_html. When I look into the html format of the result than the table is correctly in there, probably because this output is raw and not cleaned. But a table should also be correctly exported in cleaned_html. If columns or rows are missing then its a bug.

Current Behavior

Empty columns in a html table will get removed. This makes the table invalid and the LLM cannot properly extract data from that table, because the table is already wrong in the cleaned_html.

Is this reproducible?

Yes

Inputs Causing the Bug

- Test URL (https://www.german-tigers.de/trainingszeiten.php)
- Use the AsyncWebCrawler and just run .arun() on that url. No config needed. Check cleaned_html output and you will see, that the table is wrong.

Steps to Reproduce

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.german-tigers.de/trainingszeiten.php",
        )

        print(result.cleaned_html)
        return result.cleaned_html

if __name__ == "__main__":
    asyncio.run(main())

Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.german-tigers.de/trainingszeiten.php",
        )

        print(result.cleaned_html)
        return result.cleaned_html

if __name__ == "__main__":
    asyncio.run(main())

OS

macOS

Python version

3.12

Browser

Arc

Browser version

1.83.1

Error logs & Screenshots (if applicable)

Image

Blackvz avatar Feb 27 '25 21:02 Blackvz

@Blackvz Thanks for reporting this very important issue!

RCA

Scraping Strategies in Crawl4AI remove empty elements that doesn't contain any text or doesn't meet the MIN_WORD_THRESHOLD count(unless they are certain special tags). Reason for this is to exclude ornamental elements(only in page for visual flair) from the final output and keep it clean for processors further down the line.

However you are right in pointing out that when, table elements are removed with same criteria, it affects the fundamental structure you'd expect from a table (one row with three elements, while other with five). So by adding table related to tags to these special tags that excluded from empty/min word threshold checks, we'd able to retain empty table cells in the cleaned_html.

Fix suggestions

In WebScrapingStrategy, _process_html function add the following lines to keep table elements regardless of empty content inside them.

# Special case for table elements - always preserve structure
if element.name in ["tr", "td", "th"]:
    keep_element = True

As you can see, I'm only keeping table related tags that would throw off the structure of table when they are empty and deleted.

Now do the same for LXMLScrapingStrategy,by updating the remove_empty_elements_fast function. It has a bypass_tags array variable(special tags I was referring to), to this add ["tr","td","th"]

Doing these would preserve the empty cells and there by keeping table structures intact in cleaned_html and further processing down the line. This is a very important callout @Blackvz. Thanks a bunch.

Now we can't make a fix to this before next release(it's already tested and ready to go in next couple of days), but we'll plan in the one after that.

aravindkarnam avatar Mar 01 '25 12:03 aravindkarnam

I'm very glad I could help 👍🏻

Also thanks for the quick answer, the great library and the upcoming fix!

Blackvz avatar Mar 01 '25 17:03 Blackvz