crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Invalid host defined options error in http crawler when used with pkg post ESM edition

Open teammakdi opened this issue 2 years ago • 5 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Issue description

Error

makdi-test  | DEBUG HttpCrawler:SessionPool: No 'persistStateKeyValueStoreId' options specified, this session pool's data has been saved in the KeyValueStore with the id: ce096038-12bd-4473-ab69-d0e0fc741228
makdi-test  | INFO  HttpCrawler: Starting the crawler.
makdi-test  | DEBUG HttpCrawler:AutoscaledPool:Snapshotter: Setting max memory of this run to 1963 MB. Use the CRAWLEE_MEMORY_MBYTES or CRAWLEE_AVAILABLE_MEMORY_RATIO environment variable to override it.
makdi-test  | DEBUG HttpCrawler:SessionPool: Created new Session - session_MrRYxu57S9
makdi-test  | WARN  HttpCrawler: Reclaiming failed request back to the list or queue. TypeError: Invalid host defined options
makdi-test  |     at HttpCrawler._requestFunction (/snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:423:13)
makdi-test  |     at /snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:366:94
makdi-test  |     at wrap (/snapshot/makdi/node_modules/@apify/timeout/index.js:52:27)
makdi-test  |     at /snapshot/makdi/node_modules/@apify/timeout/index.js:66:7
makdi-test  |     at AsyncLocalStorage.run (node:async_hooks:319:14)
makdi-test  |     at /snapshot/makdi/node_modules/@apify/timeout/index.js:65:13
makdi-test  |     at new Promise (<anonymous>)
makdi-test  |     at addTimeoutToPromise (/snapshot/makdi/node_modules/@apify/timeout/index.js:59:10)
makdi-test  |     at HttpCrawler._handleNavigation (/snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:366:76)
makdi-test  |     at async HttpCrawler._runRequestHandler (/snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:309:13) {"id":"3FULW9cbbkMrV4R","url":"https://crawlee.dev","retryCount":1}
makdi-test  | DEBUG HttpCrawler:SessionPool: Created new Session - session_jF6xxRoKff
makdi-test  | WARN  HttpCrawler: Reclaiming failed request back to the list or queue. TypeError: Invalid host defined options
makdi-test  |     at HttpCrawler._requestFunction (/snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:423:13)
makdi-test  |     at /snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:366:94
makdi-test  |     at wrap (/snapshot/makdi/node_modules/@apify/timeout/index.js:52:27)
makdi-test  |     at /snapshot/makdi/node_modules/@apify/timeout/index.js:66:7
makdi-test  |     at AsyncLocalStorage.run (node:async_hooks:319:14)
makdi-test  |     at /snapshot/makdi/node_modules/@apify/timeout/index.js:65:13
makdi-test  |     at new Promise (<anonymous>)
makdi-test  |     at addTimeoutToPromise (/snapshot/makdi/node_modules/@apify/timeout/index.js:59:10)
makdi-test  |     at HttpCrawler._handleNavigation (/snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:366:76)
makdi-test  |     at async HttpCrawler._runRequestHandler (/snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:309:13) {"id":"3FULW9cbbkMrV4R","url":"https://crawlee.dev","retryCount":2}
makdi-test  | DEBUG HttpCrawler:SessionPool: Created new Session - session_0RMbGUTi5D
makdi-test  | WARN  HttpCrawler: Reclaiming failed request back to the list or queue. TypeError: Invalid host defined options
makdi-test  |     at HttpCrawler._requestFunction (/snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:423:13)
makdi-test  |     at /snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:366:94
makdi-test  |     at wrap (/snapshot/makdi/node_modules/@apify/timeout/index.js:52:27)
makdi-test  |     at /snapshot/makdi/node_modules/@apify/timeout/index.js:66:7
makdi-test  |     at AsyncLocalStorage.run (node:async_hooks:319:14)
makdi-test  |     at /snapshot/makdi/node_modules/@apify/timeout/index.js:65:13
makdi-test  |     at new Promise (<anonymous>)
makdi-test  |     at addTimeoutToPromise (/snapshot/makdi/node_modules/@apify/timeout/index.js:59:10)
makdi-test  |     at HttpCrawler._handleNavigation (/snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:366:76)
makdi-test  |     at async HttpCrawler._runRequestHandler (/snapshot/makdi/node_modules/@crawlee/http/internals/http-crawler.js:309:13) {"id":"3FULW9cbbkMrV4R","url":"https://crawlee.dev","retryCount":3}

Works fine with [email protected] (before ESM migration), fails since [email protected]

Refer Dockerfile for reference https://github.com/teammakdi/makdi-test/blob/main/Dockerfile#L14

Code sample

https://github.com/teammakdi/makdi-test

Package version

3.6.2

Node.js version

18.17.0

Operating system

Ubuntu

Apify platform

  • [ ] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

teammakdi avatar Dec 07 '23 09:12 teammakdi

Hey, I'm unable to reproduce the issue you're having when running the code in your example repository. Could you confirm if you can consistently reproduce this issue? Does it happen in docker?

vladfrangu avatar Dec 17 '23 18:12 vladfrangu

Hey @vladfrangu

Yes, it only fails via docker. Works fine if run directly on host machine via node.

Also in docker, it fails when used with https://github.com/vercel/pkg

You can checkout the test project and run it via docker

docker-compose up --build

teammakdi avatar Dec 18 '23 06:12 teammakdi

Is there any reasons you use pkg in docker? 👀 Could you also try to reproduce this without pkg?

vladfrangu avatar Dec 18 '23 13:12 vladfrangu

Hey @vladfrangu , it is reproducible only with pkg tool.

We have a requirement to convert our app to binary hence using the same.

It however works with [email protected], not with [email protected].

teammakdi avatar Dec 18 '23 13:12 teammakdi

Hmm... I'll try to take a look and see if I can find out why it's breaking when ran via pkg, but I can't make any promises. 😅

vladfrangu avatar Dec 18 '23 14:12 vladfrangu