crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: remove_overlay_elements is not working

Open ederuiter opened this issue 11 months ago • 4 comments

crawl4ai version

2025-feb-alpha-1

Expected Behavior

Adding remove_overlay_elements=True to your config should remove overlays from the scraped pages.

Current Behavior

It does not remove any overlays.

Is this reproducible?

Yes

Inputs Causing the Bug

Any page will have this problem as the current code will inject the js code from js_snippet/remove_overlay_elements.js but that code is never executed as this https://github.com/unclecode/crawl4ai/blob/15fd96db17fe748a2ac1cbde3b11a7f4d8805b30/crawl4ai/async_crawler_strategy.py#L1772 wraps the code, but does not actually execute this anonymous function.

Steps to Reproduce


Code snippets


OS

any

Python version

any

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

ederuiter avatar Feb 17 '25 13:02 ederuiter

Easy workaround is adding your own js_code to the CrawlerRunConfig for example:

js_code="""
document.body.scrollIntoView(false)
const elements = document.querySelectorAll("*");
elements.forEach((elem) => {
  const style = window.getComputedStyle(elem);
  if ((style.position === "fixed" || style.position === "sticky")) {
    elem.remove();
  }
});      
""",

ederuiter avatar Feb 17 '25 14:02 ederuiter

Confirming it's not working also at my side

ziudeso avatar Mar 22 '25 11:03 ziudeso

RCA

The overlay removal script in crawl4ai/crawl4ai/js_snippet/remove_overlay_elements.js failed to detect and remove scroll-dependent overlays before attempting removal (because the script did not trigger scroll, before removal). Many modern websites only show certain overlays, popups, and banners after the user has scrolled to a specific position on the page.

Solution

Added document.body.scrollIntoView(false) to scroll to the bottom of the page before running the removal logic, triggering any scroll-dependent overlays to appear. Also increased the timeout from 100ms to 250ms to allow these elements to fully render.

With this fix, the script now successfully removes all overlay elements, that the workaround seem to be targeting.

I tested this with following site without this fix the cookie disclaimer text is also scraped and gets into the final markdown, but with the fix, this get's eliminated.

aravindkarnam avatar Mar 31 '25 07:03 aravindkarnam

@ederuiter @ziudeso Can you try this with any URLs you faced issue with and let me know if you need to fix anything else.

The updated code is in bug fix branch for March

aravindkarnam avatar Mar 31 '25 07:03 aravindkarnam