scancode.io icon indicating copy to clipboard operation
scancode.io copied to clipboard

Increase timeout duration for get request in `fetch_http`

Open keshav-space opened this issue 2 years ago • 6 comments

The current timeout of 5 seconds is insufficient for fetching archives like https://www.busybox.net/downloads/busybox-1.01.tar.bz2, https://www.uclibc.org/downloads/uClibc-0.9.30.tar.gz since these websites are bit slow in their response.

keshav-space avatar Apr 02 '24 10:04 keshav-space

@keshav-space is this because of https://github.com/nexB/scancode.io/blob/d6389b28841c4edf25075208eaf0708658650d06/scanpipe/pipes/fetch.py#L380 ?

pombredanne avatar Apr 02 '24 13:04 pombredanne

@keshav-space is this because of

https://github.com/nexB/scancode.io/blob/d6389b28841c4edf25075208eaf0708658650d06/scanpipe/pipes/fetch.py#L380

?

@pombredanne No. The problem is here in fetch_http https://github.com/nexB/scancode.io/blob/d6389b28841c4edf25075208eaf0708658650d06/scanpipe/pipes/fetch.py#L99

keshav-space avatar Apr 02 '24 13:04 keshav-space

The current timeout of 5 seconds is insufficient for fetching archives

https://docs.python-requests.org/en/latest/user/advanced/#timeouts

The connect timeout is the number of seconds Requests will wait for your client to establish a connection to a remote machine

As a clarification, the timeout value here is not the time available to fetch the whole file, just the time allowed to get a response from the server.

That been said, those server URLs are extremely slow to provide a connection answer, over 10 seconds at times. I'm not sure what would be the best timeout value here, as raising it too much may have unwanted consequences.

@keshav-space what's you take?

tdruez avatar Apr 17 '24 10:04 tdruez

@tdruez Some of these URLs very inconsistent in their response time, for example if I try getting response https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.3.11.tar.gz it takes around ~20 seconds but if try to get the response for the same archive the second time it takes less than ~5 seconds, definitely some of these websites are doing delivery optimization.

keshav-space avatar Apr 17 '24 11:04 keshav-space

@keshav-space So we may want to implement a combination of adding an automatic retry on timeout exception + raising the default timeout value to something around 10 seconds.

tdruez avatar Apr 17 '24 11:04 tdruez

@tdruez yes, that should work.

keshav-space avatar Apr 17 '24 11:04 keshav-space