crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Support for crawling from secondary IP address

Open teammakdi opened this issue 1 year ago • 1 comments

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Feature

Hi, I see with both HttpCrawler and PuppeteerCrawler, ProxyConfiguration is supported which needs a HTTP proxy server. However my use case is to use the secondary IP address for crawling purposes.

Motivation

Raw axios supports requesting from a secondary IP address present on the machine. Example


const httpsAgent = new https.Agent({
    localAddress: 'x.x.x.x',
    localPort: xxxx
});

await axios.get('https://api.ipify.org', {
  httpsAgent
})
.then(response => {
  console.log('HTTPS Agent: ', response.data); // prints secondary IP address
})
.catch(err => {
    console.error(err);
})

Was wondering if it could be possible with the crawlee HttpCrawler i.e. with got library. Not sure if it would be feasible with the PuppeteerCrawler.

Ideal solution or implementation, and any additional constraints

Alternative solutions or implementations

No response

Other context

No response

teammakdi avatar Apr 08 '24 18:04 teammakdi

For http crawler, this was relatively easy.

preNavigationHooks: [
        async (crawlingContext, gotOptions) => {
            gotOptions.localAddress = secondaryIpAddress
        }
    ]

Setting gotOptions.localAddress works.

Still looking out for PuppeteerCrawler

I was able to work it out with squid proxy by creating a http proxy server, however was looking with direct secondary IP based approaches.

teammakdi avatar Apr 22 '24 06:04 teammakdi