[Self-Host] Waterfalling into a permanent loop when blocked by anti-bot
In the past few weeks, my Firecrawl process frequently encounters a critical failure where it becomes completely unresponsive, stuck in an infinite loop. The log file repeatedly shows the same entries without any progress and without any kind of timeout...
any idea why?
nuq-worker-4 {"level":"info","message":"Waterfalling to next engine...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"a3dde25ece9ea351","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":120000}
nuq-worker-4 {"level":"info","message":"Scraping via document...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"a3dde25ece9ea351","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":15000}
nuq-worker-4 {"level":"debug","message":"Document was blocked by anti-bot, prefetching with chrome-cdp","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"e76c496ea97e95c3","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4 {"level":"info","message":"Scraping URL \"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films\"...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4 {"level":"info","message":"Selected engines","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","selectedEngines":[{"engine":"pdf","supportScore":20,"unsupportedFeatures":{}},{"engine":"document","supportScore":20,"unsupportedFeatures":{}}],"span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4 {"level":"info","message":"Scraping via pdf...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":120000}
nuq-worker-4 {"error":{"engine":"pdf","error":{"message":"Engine pdf was unsuccessful","name":"EngineUnsuccessfulError","stack":"EngineUnsuccessfulError: Engine pdf was unsuccessful\n at scrapePDF (/app/dist/src/scraper/scrapeURL/engines/pdf/index.js:262:31)\n at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:443:12)\n at async scrapeURLLoopIter (/app/dist/src/scraper/scrapeURL/index.js:201:26)\n at async /app/dist/src/scraper/scrapeURL/index.js:315:37\n at async /app/dist/src/scraper/scrapeURL/index.js:325:30\n at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n at async /app/dist/src/scraper/scrapeURL/index.js:671:30\n at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n at async runWebScraper (/app/dist/src/main/runWebScraper.js:59:24)"},"message":"WrappedEngineError","name":"WrappedEngineError","stack":"WrappedEngineError: WrappedEngineError\n at /app/dist/src/scraper/scrapeURL/index.js:319:31\n at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n at async /app/dist/src/scraper/scrapeURL/index.js:325:30\n at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n at async /app/dist/src/scraper/scrapeURL/index.js:671:30\n at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n at async runWebScraper (/app/dist/src/main/runWebScraper.js:59:24)\n at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:9:12)\n at async processJob (/app/dist/src/services/worker/scrape-worker.js:145:26)\n at async processJobWithTracing (/app/dist/src/services/worker/scrape-worker.js:832:36)"},"level":"warn","message":"An unexpected error happened while scraping with pdf.","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4 {"level":"info","message":"Waterfalling to next engine...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":120000}
nuq-worker-4 {"level":"info","message":"Scraping via document...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"10c2bcfb3c2d17c4","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":15000}
nuq-worker-4 {"level":"debug","message":"Document was blocked by anti-bot, prefetching with chrome-cdp","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"e76c496ea97e95c3","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4 {"level":"info","message":"Scraping URL \"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films\"...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"04aee25049a838e2","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4 {"level":"info","message":"Selected engines","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","selectedEngines":[{"engine":"pdf","supportScore":20,"unsupportedFeatures":{}},{"engine":"document","supportScore":20,"unsupportedFeatures":{}}],"span_id":"04aee25049a838e2","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
nuq-worker-4 {"level":"info","message":"Scraping via pdf...","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"04aee25049a838e2","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e","waitUntilWaterfall":120000}
nuq-worker-4 {"error":{"engine":"pdf","error":{"message":"Engine pdf was unsuccessful","name":"EngineUnsuccessfulError","stack":"EngineUnsuccessfulError: Engine pdf was unsuccessful\n at scrapePDF (/app/dist/src/scraper/scrapeURL/engines/pdf/index.js:262:31)\n at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:443:12)\n at async scrapeURLLoopIter (/app/dist/src/scraper/scrapeURL/index.js:201:26)\n at async /app/dist/src/scraper/scrapeURL/index.js:315:37\n at async /app/dist/src/scraper/scrapeURL/index.js:325:30\n at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n at async /app/dist/src/scraper/scrapeURL/index.js:671:30\n at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n at async runWebScraper (/app/dist/src/main/runWebScraper.js:59:24)"},"message":"WrappedEngineError","name":"WrappedEngineError","stack":"WrappedEngineError: WrappedEngineError\n at /app/dist/src/scraper/scrapeURL/index.js:319:31\n at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n at async /app/dist/src/scraper/scrapeURL/index.js:325:30\n at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n at async /app/dist/src/scraper/scrapeURL/index.js:671:30\n at async withSpan (/app/dist/src/lib/otel-tracer.js:49:24)\n at async runWebScraper (/app/dist/src/main/runWebScraper.js:59:24)\n at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:9:12)\n at async processJob (/app/dist/src/services/worker/scrape-worker.js:145:26)\n at async processJobWithTracing (/app/dist/src/services/worker/scrape-worker.js:832:36)"},"level":"warn","message":"An unexpected error happened while scraping with pdf.","module":"ScrapeURL","scrapeId":"a25e91fc-9064-4983-ab5e-1f4a03926841","scrapeURL":"https://en.wikipedia.org/wiki/List_of_artificial_intelligence_films","span_id":"04aee25049a838e2","teamId":"bypass","team_id":"bypass","trace_flags":"01","trace_id":"67b4e0f540d9e665ea44119a25d6c91e"}
Same issue, was looking if I could disable those engines entirely but looks like there are some fixes on the way, hopefully they will get merged soon! Thank for raising the issue
Same over here, some pages make the scraping process looping over and over, requiring service restart. Hopefully pending fix will solve it.. Any target date for fix merging?
Three pull requests, zero merges. Any updates?
same issue, any update on this issue?
Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...
Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...
afaik no
same as this which they marked as fixed but the code highlighted in the issue was not changed https://github.com/firecrawl/firecrawl/issues/2056
Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...
The easiest workaround is make a request first, and only pass url with response status code 200 to firecrawl. This is what I'm doing now.
strange, seems like this issue is persistent till half a year, first found report is https://github.com/firecrawl/firecrawl/issues/1657 on june 2025.
I am attempting to apply the diffs of https://github.com/firecrawl/firecrawl/pull/2381 onto my local v2.7.0 tag
needed to ask co pilot a few things as TS is not my most familiar language
looking now - https://github.com/firecrawl/firecrawl/pull/2364 may be a smaller change - and more maintainable as a temp fork
@krim404 by any chance have you tried the change against v2.7.0?
Once I am done testing https://github.com/firecrawl/firecrawl/pull/2381 ill try your PR also
Maybe we can build an image on one of our GH forks as a work around till this is fixed
Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...
The easiest workaround is make a request first, and only pass url with response status code 200 to firecrawl. This is what I'm doing now.
@sheldonxxxx Can you explain what you mean by make a request first?
Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...
The easiest workaround is make a request first, and only pass url with response status code 200 to firecrawl. This is what I'm doing now.
@sheldonxxxx Can you explain what you mean by make a request first?
Make a GET request to the target url first, pass to firecrawl only if response status code is 200
would SCRAPEURL_ENGINE_WATERFALL_DELAY_MS config item affect this issue , or does it only affect calling the main scrape jobs?
https://github.com/firecrawl/firecrawl/blob/main/apps/api/src/scraper/scrapeURL/index.ts#L522
Anyone knowing a workaround while the pull request is waiting? Experiencing the same issue...
The easiest workaround is make a request first, and only pass url with response status code 200 to firecrawl. This is what I'm doing now.
@sheldonxxxx Can you explain what you mean by make a request first?
Make a GET request to the target url first, pass to firecrawl only if response status code is 200
Hmm seems you're right thought it would return 200 in any case but after I tried to curl or wget failing url, I got status 403. So yeah seems to be a potential workaround will use until bug is solved. Cheers
looking now - #2364 may be a smaller change - and more maintainable as a temp fork
@krim404 by any chance have you tried the change against v2.7.0?
i just started my CI/CD workflow and build it against the current latest, automerge worked fine. if i dont post anything anymore in the next few hours, everything should be fine.
I have tested a locally built image, and I think the fix works, but it appears that it just tries to crawl the next url just as fast, I wonder if the delay option is not being honoured by the nuq worker?
Edit: just to clarify I was testing a modified version of https://github.com/firecrawl/firecrawl/pull/2381 as I am already on v2.7.0
had to add extra delay elements to the crawl config - delay does nothing at all
{
"url": "https://www.abcdef.com/",
...
"delay": 2,
"maxConcurrency": 1,
"scrapeOptions": {
"waitFor": 1000,
...
}
}
I have added a few quick and dirty changes I need to get a crawl done this weekend:
Add a 30s delay when encountering 403 and 429 (there is no built in 429 back off logic it seems! , and eg AWS WAF blocks rate limit exceeded with 403 by default)
https://github.com/firecrawl/firecrawl/blob/main/apps/api/src/scraper/scrapeURL/index.ts#L394 Just after
const isLikelyProxyError = [401, 403, 429].includes(
engineResult.statusCode,
);
Added
// Pause on engineResult.statusCode of 403 and 429 for 30s
// TODO: make pause length an config option
if ([403, 429].includes(engineResult.statusCode)) {
meta.logger.warn(
`Engine ${engine} received status code ${engineResult.statusCode}, pausing for 30s to avoid further rate limiting.`,
);
await new Promise(resolve => setTimeout(resolve, 30000));
}
Also now not sure if this whole thing cant be avoided by setting the crawl proxy scrape option to not be auto? (would just be the proxy option on scrape configs)
{
"scrapeOptions": {
"proxy": "basic"
}
}
I have added a few quick and dirty changes I need to get a crawl done this weekend:
Add a 30s delay when encountering 403 and 429 (there is no built in 429 back off logic it seems! , and eg AWS WAF blocks rate limit exceeded with 403 by default)
thank you for this suggestion. i also ran into rate limiting (but didnt bother so far) and implemented your fix. Maybe you should create a different bug + pullrequest for this issue?
thank you for this suggestion. i also ran into rate limiting (but didnt bother so far) and implemented your fix. Maybe you should create a different bug + pullrequest for this issue?
I have not coded in typescript much before - I am on the devops side (the use of firecrawl is in a unique project)
It's also a bit off-putting that we had 3 PR's for this issue and none have been looked at (I thank all 3 of you!) , ill try put together a PR but I am not sure it will meet their contributing guidelines. The development is also so rapid that a PR to main today could make no sense in a week or 2
The other thing is understandably this project is primarily a commercial tool, where they would have proxies etc, so they probably don't get many rate limit situations, which is probably why this issue has not been looked at.
I was on v2.0.0 for a while (using the image sha as firecrawl does not tag docker images*) as I only use firecrawl for basic crawling/scraping, but we were hitting this bug so I decided to try rebuild the docker images (which I never managed to do locally before due to dependency errors)
- this is probably the first project I have used that don't tag their docker images - I don't understand that part
@krim404 I have pushed https://github.com/jaredcdep/firecrawl/pull/1 to my fork
I had to make a branch from the v2.7.0 tag I also squashed my commits as I had my local gitlab ci files in the history