crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Incorrect scraped content (another page's content is scraped)

Open jtha opened this issue 1 year ago • 1 comments

I noticed some strange behaviour when I was doing retrieval and it turns out I'm seeing wrong page content for the url provided. I have replicated this a few times and so far it looks like it's triggered when setting magic=True. My sense is simulating user behaviour might be resulting in inadvertently clicking on a link on the page?

Turning this off and enabling the protection methods except for simulate_user=True seems to make it behave as intended, at least as far as I can see. For reference this was happening on Weaviate's documentation page with many links on the nav bar, side bar, main content area, basically links everywhere.

jtha avatar Nov 15 '24 11:11 jtha

@jtha Thx for using our library, let me work on this and see what is going on over there.

unclecode avatar Nov 20 '24 07:11 unclecode

@jtha I just tried out this issue with magic mode i.e magic=True and unable to reproduce this issue. Could you try with our latest version 0.6.0 and if the problem still exists, reopen this issue along with a code snippet, so its easier for us to reproduce and root cause the issue.

Thanks again for taking the effort to report this.

aravindkarnam avatar May 08 '25 05:05 aravindkarnam