firecrawl icon indicating copy to clipboard operation
firecrawl copied to clipboard

[Self-Host] Waterfalling into a permanent loop when blocked by anti-bot

Open krim404 opened this issue 3 months ago • 2 comments

In the past few weeks, my Firecrawl process frequently encounters a critical failure where it becomes completely unresponsive, stuck in an infinite loop. The log file repeatedly shows the same entries without any progress and without any kind of timeout...

any idea why?

nuq-worker-4   {"level":"info","message":"Waterfalling to next engine...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"a3dde25ece9ea351","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":120000}
nuq-worker-4   {"level":"info","message":"Scraping via document...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"a3dde25ece9ea351","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":15000}
nuq-worker-4   {"level":"debug","message":"Document was blocked by anti-bot, prefetching with chrome-cdp","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"e76c496ea97e95c3","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4   {"level":"info","message":"Scraping URL \"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films\"...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4   {"level":"info","message":"Selected engines","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","selectedEngines":[{"engine":"pdf","supportScore":20,"unsupportedFeatures":{}},{"engine":"document","supportScore":20,"unsupportedFeatures":{}}],"span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4   {"level":"info","message":"Scraping via pdf...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":120000}
nuq-worker-4   {"error":{"engine":"pdf","error":{"message":"Engine pdf was unsuccessful","name":"EngineUnsuccessfulError","stack":"EngineUnsuccessfulError: Engine pdf was unsuccessful\n    at scrapePDF (/app/dist/src/scraper/scrapeURL/engines/pdf/index.js:262:31)\n    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n    at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:443:12)\n    at async scrapeURLLoopIter (/app/dist/src/scraper/scrapeURL/index.js:201:26)\n    at async /app/dist/src/scraper/scrapeURL/index.js:315:37\n    at async /app/dist/src/scraper/scrapeURL/index.js:325:30\n    at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n    at async /app/dist/src/scraper/scrapeURL/index.js:671:30\n    at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:59:24)"},"message":"WrappedEngineError","name":"WrappedEngineError","stack":"WrappedEngineError: WrappedEngineError\n    at /app/dist/src/scraper/scrapeURL/index.js:319:31\n    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n    at async /app/dist/src/scraper/scrapeURL/index.js:325:30\n    at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n    at async /app/dist/src/scraper/scrapeURL/index.js:671:30\n    at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:59:24)\n    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:9:12)\n    at async processJob (/app/dist/src/services/worker/scrape-worker.js:145:26)\n    at async processJobWithTracing (/app/dist/src/services/worker/scrape-worker.js:832:36)"},"level":"warn","message":"An unexpected error happened while scraping with pdf.","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4   {"level":"info","message":"Waterfalling to next engine...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":120000}
nuq-worker-4   {"level":"info","message":"Scraping via document...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":15000}
nuq-worker-4   {"level":"debug","message":"Document was blocked by anti-bot, prefetching with chrome-cdp","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"e76c496ea97e95c3","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4   {"level":"info","message":"Scraping URL \"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films\"...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"04aee25049a838e2","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4   {"level":"info","message":"Selected engines","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","selectedEngines":[{"engine":"pdf","supportScore":20,"unsupportedFeatures":{}},{"engine":"document","supportScore":20,"unsupportedFeatures":{}}],"span_id":"04aee25049a838e2","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4   {"level":"info","message":"Scraping via pdf...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"04aee25049a838e2","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":120000}
nuq-worker-4   {"error":{"engine":"pdf","error":{"message":"Engine pdf was unsuccessful","name":"EngineUnsuccessfulError","stack":"EngineUnsuccessfulError: Engine pdf was unsuccessful\n    at scrapePDF (/app/dist/src/scraper/scrapeURL/engines/pdf/index.js:262:31)\n    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n    at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:443:12)\n    at async scrapeURLLoopIter (/app/dist/src/scraper/scrapeURL/index.js:201:26)\n    at async /app/dist/src/scraper/scrapeURL/index.js:315:37\n    at async /app/dist/src/scraper/scrapeURL/index.js:325:30\n    at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n    at async /app/dist/src/scraper/scrapeURL/index.js:671:30\n    at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:59:24)"},"message":"WrappedEngineError","name":"WrappedEngineError","stack":"WrappedEngineError: WrappedEngineError\n    at /app/dist/src/scraper/scrapeURL/index.js:319:31\n    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n    at async /app/dist/src/scraper/scrapeURL/index.js:325:30\n    at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n    at async /app/dist/src/scraper/scrapeURL/index.js:671:30\n    at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:59:24)\n    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:9:12)\n    at async processJob (/app/dist/src/services/worker/scrape-worker.js:145:26)\n    at async processJobWithTracing (/app/dist/src/services/worker/scrape-worker.js:832:36)"},"level":"warn","message":"An unexpected error happened while scraping with pdf.","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"04aee25049a838e2","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}

krim404 avatar Nov 01 '25 22:11 krim404

Same issue, was looking if I could disable those engines entirely but looks like there are some fixes on the way, hopefully they will get merged soon! Thank for raising the issue

JohnGemstone avatar Nov 18 '25 06:11 JohnGemstone

Same over here, some pages make the scraping process looping over and over, requiring service restart. Hopefully pending fix will solve it.. Any target date for fix merging?

emonget avatar Nov 23 '25 16:11 emonget

Three pull requests, zero merges. Any updates?

frost19k avatar Dec 10 '25 02:12 frost19k

same issue, any update on this issue?

sheldonxxxx avatar Dec 12 '25 11:12 sheldonxxxx

Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...

MaggiR avatar Dec 13 '25 13:12 MaggiR

Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...

afaik no

krim404 avatar Dec 14 '25 17:12 krim404

same as this which they marked as fixed but the code highlighted in the issue was not changed https://github.com/firecrawl/firecrawl/issues/2056

jaredcdep avatar Dec 18 '25 09:12 jaredcdep

Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...

The easiest workaround is make a request first, and only pass url with response status code 200 to firecrawl. This is what I'm doing now.

sheldonxxxx avatar Dec 18 '25 14:12 sheldonxxxx

strange, seems like this issue is persistent till half a year, first found report is https://github.com/firecrawl/firecrawl/issues/1657 on june 2025.

krim404 avatar Dec 18 '25 17:12 krim404

I am attempting to apply the diffs of https://github.com/firecrawl/firecrawl/pull/2381 onto my local v2.7.0 tag

needed to ask co pilot a few things as TS is not my most familiar language

jaredcdep avatar Dec 19 '25 09:12 jaredcdep

looking now - https://github.com/firecrawl/firecrawl/pull/2364 may be a smaller change - and more maintainable as a temp fork

@krim404 by any chance have you tried the change against v2.7.0?

Once I am done testing https://github.com/firecrawl/firecrawl/pull/2381 ill try your PR also

Maybe we can build an image on one of our GH forks as a work around till this is fixed

jaredcdep avatar Dec 19 '25 09:12 jaredcdep

Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...

The easiest workaround is make a request first, and only pass url with response status code 200 to firecrawl. This is what I'm doing now.

@sheldonxxxx Can you explain what you mean by make a request first?

emonget avatar Dec 19 '25 09:12 emonget

Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...

The easiest workaround is make a request first, and only pass url with response status code 200 to firecrawl. This is what I'm doing now.

@sheldonxxxx Can you explain what you mean by make a request first?

Make a GET request to the target url first, pass to firecrawl only if response status code is 200

sheldonxxxx avatar Dec 19 '25 09:12 sheldonxxxx

would SCRAPEURL_ENGINE_WATERFALL_DELAY_MS config item affect this issue , or does it only affect calling the main scrape jobs?

https://github.com/firecrawl/firecrawl/blob/main/apps/api/src/scraper/scrapeURL/index.ts#L522

jaredcdep avatar Dec 19 '25 09:12 jaredcdep

Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...

The easiest workaround is make a request first, and only pass url with response status code 200 to firecrawl. This is what I'm doing now.

@sheldonxxxx Can you explain what you mean by make a request first?

Make a GET request to the target url first, pass to firecrawl only if response status code is 200

Hmm seems you're right thought it would return 200 in any case but after I tried to curl or wget failing url, I got status 403. So yeah seems to be a potential workaround will use until bug is solved. Cheers

emonget avatar Dec 19 '25 10:12 emonget

looking now - #2364 may be a smaller change - and more maintainable as a temp fork

@krim404 by any chance have you tried the change against v2.7.0?

i just started my CI/CD workflow and build it against the current latest, automerge worked fine. if i dont post anything anymore in the next few hours, everything should be fine.

krim404 avatar Dec 19 '25 12:12 krim404

I have tested a locally built image, and I think the fix works, but it appears that it just tries to crawl the next url just as fast, I wonder if the delay option is not being honoured by the nuq worker?

Edit: just to clarify I was testing a modified version of https://github.com/firecrawl/firecrawl/pull/2381 as I am already on v2.7.0

jaredcdep avatar Dec 19 '25 15:12 jaredcdep

had to add extra delay elements to the crawl config - delay does nothing at all

{
  "url": "https://www.abcdef.com/",
  ...
  "delay": 2,
  "maxConcurrency": 1,
  "scrapeOptions": {
    "waitFor": 1000,
  ...
  }
}

jaredcdep avatar Dec 19 '25 15:12 jaredcdep

I have added a few quick and dirty changes I need to get a crawl done this weekend:

Add a 30s delay when encountering 403 and 429 (there is no built in 429 back off logic it seems! , and eg AWS WAF blocks rate limit exceeded with 403 by default)

https://github.com/firecrawl/firecrawl/blob/main/apps/api/src/scraper/scrapeURL/index.ts#L394 Just after

const isLikelyProxyError = [401, 403, 429].includes(
      engineResult.statusCode,
    );

Added

// Pause on engineResult.statusCode of 403 and 429 for 30s
// TODO: make pause length an config option
    if ([403, 429].includes(engineResult.statusCode)) {
      meta.logger.warn(
        `Engine ${engine} received status code ${engineResult.statusCode}, pausing for 30s to avoid further rate limiting.`,
      );
      await new Promise(resolve => setTimeout(resolve, 30000));
    }

Also now not sure if this whole thing cant be avoided by setting the crawl proxy scrape option to not be auto? (would just be the proxy option on scrape configs)

{
  "scrapeOptions": {
    "proxy": "basic"
  }
}

jaredcdep avatar Dec 20 '25 12:12 jaredcdep

I have added a few quick and dirty changes I need to get a crawl done this weekend:

Add a 30s delay when encountering 403 and 429 (there is no built in 429 back off logic it seems! , and eg AWS WAF blocks rate limit exceeded with 403 by default)

thank you for this suggestion. i also ran into rate limiting (but didnt bother so far) and implemented your fix. Maybe you should create a different bug + pullrequest for this issue?

krim404 avatar Dec 20 '25 14:12 krim404

thank you for this suggestion. i also ran into rate limiting (but didnt bother so far) and implemented your fix. Maybe you should create a different bug + pullrequest for this issue?

I have not coded in typescript much before - I am on the devops side (the use of firecrawl is in a unique project)

It's also a bit off-putting that we had 3 PR's for this issue and none have been looked at (I thank all 3 of you!) , ill try put together a PR but I am not sure it will meet their contributing guidelines. The development is also so rapid that a PR to main today could make no sense in a week or 2

The other thing is understandably this project is primarily a commercial tool, where they would have proxies etc, so they probably don't get many rate limit situations, which is probably why this issue has not been looked at.

I was on v2.0.0 for a while (using the image sha as firecrawl does not tag docker images*) as I only use firecrawl for basic crawling/scraping, but we were hitting this bug so I decided to try rebuild the docker images (which I never managed to do locally before due to dependency errors)

  • this is probably the first project I have used that don't tag their docker images - I don't understand that part

jaredcdep avatar Dec 20 '25 15:12 jaredcdep

@krim404 I have pushed https://github.com/jaredcdep/firecrawl/pull/1 to my fork

I had to make a branch from the v2.7.0 tag I also squashed my commits as I had my local gitlab ci files in the history

jaredcdep avatar Dec 20 '25 16:12 jaredcdep