[Bug]: result formats markdown and cleaned_html will include damaged html tables
crawl4ai version
0.4.248
Expected Behavior
When I crawl a page with a html table on it (for example: https://www.german-tigers.de/trainingszeiten.php) then the table should be correctly exported at least in cleaned_html. When I look into the html format of the result than the table is correctly in there, probably because this output is raw and not cleaned. But a table should also be correctly exported in cleaned_html. If columns or rows are missing then its a bug.
Current Behavior
Empty columns in a html table will get removed. This makes the table invalid and the LLM cannot properly extract data from that table, because the table is already wrong in the cleaned_html.
Is this reproducible?
Yes
Inputs Causing the Bug
- Test URL (https://www.german-tigers.de/trainingszeiten.php)
- Use the AsyncWebCrawler and just run .arun() on that url. No config needed. Check cleaned_html output and you will see, that the table is wrong.
Steps to Reproduce
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.german-tigers.de/trainingszeiten.php",
)
print(result.cleaned_html)
return result.cleaned_html
if __name__ == "__main__":
asyncio.run(main())
Code snippets
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.german-tigers.de/trainingszeiten.php",
)
print(result.cleaned_html)
return result.cleaned_html
if __name__ == "__main__":
asyncio.run(main())
OS
macOS
Python version
3.12
Browser
Arc
Browser version
1.83.1
Error logs & Screenshots (if applicable)
@Blackvz Thanks for reporting this very important issue!
RCA
Scraping Strategies in Crawl4AI remove empty elements that doesn't contain any text or doesn't meet the MIN_WORD_THRESHOLD count(unless they are certain special tags). Reason for this is to exclude ornamental elements(only in page for visual flair) from the final output and keep it clean for processors further down the line.
However you are right in pointing out that when, table elements are removed with same criteria, it affects the fundamental structure you'd expect from a table (one row with three elements, while other with five). So by adding table related to tags to these special tags that excluded from empty/min word threshold checks, we'd able to retain empty table cells in the cleaned_html.
Fix suggestions
In WebScrapingStrategy, _process_html function add the following lines to keep table elements regardless of empty content inside them.
# Special case for table elements - always preserve structure
if element.name in ["tr", "td", "th"]:
keep_element = True
As you can see, I'm only keeping table related tags that would throw off the structure of table when they are empty and deleted.
Now do the same for LXMLScrapingStrategy,by updating the remove_empty_elements_fast function. It has a bypass_tags array variable(special tags I was referring to), to this add ["tr","td","th"]
Doing these would preserve the empty cells and there by keeping table structures intact in cleaned_html and further processing down the line. This is a very important callout @Blackvz. Thanks a bunch.
Now we can't make a fix to this before next release(it's already tested and ready to go in next couple of days), but we'll plan in the one after that.
I'm very glad I could help 👍🏻
Also thanks for the quick answer, the great library and the upcoming fix!