PR for Download File From Link function
Summary
This PR improves the download_file_from_link utility to support robust, memory-efficient downloads for large datasets and adds a dedicated test suite to ensure correct behaviour under different network conditions.
Motivation
Some of the datasets used in TopoBench (e.g. those hosted on external academic servers) can be:
-
Very large, making
response.contentdownloads memory-inefficient. - Slow or unstable, leading to timeouts or partial downloads.
-
Occasionally requiring
verify=False, which previously wasn’t configurable.
The old implementation used a single requests.get call, loaded the entire response into memory, and did not retry on transient failures. This could lead to frequent failures or hangs when downloading large files over slow connections.
What this PR does
1. Improve download_file_from_link
The function download_file_from_link in topobench.data.utils.io_utils is updated to:
- Stream the response in 5MB chunks instead of loading it all into memory.
-
Ensure the target directory exists via
os.makedirs(path_to_save, exist_ok=True). -
Support SSL verification control via a
verifyargument (defaultTrue). -
Support configurable per-chunk read timeout via a
timeoutargument
(default: 60 seconds for the read timeout, 30 seconds for connection). -
Add retry logic with exponential backoff on failures, controlled by a
retriesargument. -
Print download progress when
content-lengthis available:- Total size (in GB)
- Percentage completed
- Approximate download speed (MB/s)
- ETA in hours and minutes
- Handle unknown content length gracefully and still stream the file.
- Raise an exception after all retry attempts are exhausted, instead of silently failing.
Behavioural notes:
- For HTTP status codes other than
200, the function logs an error and returns without creating a file (same high-level behaviour as before, but now explicit). - For persistent network errors (e.g. repeated timeouts), the function retries and finally raises the underlying exception on the last failure.
2. Add tests for download_file_from_link
This PR introduces a new test file (e.g. tests/data/utils/test_io_utils.py) containing a test suite for download_file_from_link. The tests use unittest.mock and pytest to cover:
-
Successful streaming download with progress:
- Mocks
iter_contentwith multiple chunks totalling 5MB. - Asserts the output file exists and has the expected size.
- Mocks
-
Automatic directory creation:
- Uses a nested, non-existent directory in
path_to_save. - Verifies that the directory is created and the file is written.
- Uses a nested, non-existent directory in
-
HTTP error handling:
- Mocks a
404response. - Asserts that no file is created.
- Mocks a
-
Retry on timeout:
- First
requests.getcall raisesrequests.exceptions.Timeout. - Second call returns a successful mock response.
- Verifies that the file is created and that
requests.getis called twice.
- First
-
Exhausting retries:
- All
requests.getcalls raiserequests.exceptions.Timeout. - Asserts that the function raises after the configured number of retries.
- All
-
Support for multiple file formats:
- Loops over
["zip", "tar", "tar.gz"]. - Verifies that files with the correct extensions are created.
- Loops over
-
Handling empty chunks:
- Includes empty chunks in
iter_content. - Ensures the final file size only includes non-empty chunks.
- Includes empty chunks in
-
Unknown content length:
- Omits the
content-lengthheader. - Verifies that the file is still correctly written.
- Omits the
-
SSL verification toggle:
- Calls
download_file_from_linkwithverify=False. - Asserts that
requests.getwas invoked withverify=False.
- Calls
-
Custom timeout:
- Calls the function with a custom
timeoutvalue. - Asserts that
requests.getuses(30, custom_timeout)for(connect, read)timeouts.
- Calls the function with a custom
Backwards compatibility
- The function name, module location, and core signature (
file_link,path_to_save,dataset_name,file_format) are unchanged. - New keyword arguments (
verify,timeout,retries) have sensible defaults and should not break existing call sites. - The main change in behaviour is that persistent network failures now raise an exception after all retries instead of only printing an error. This makes failures explicit and easier to debug, while not affecting successful downloads.
Testing
- New tests added for
download_file_from_link(seetests/data/utils/test_io_utils.py). - All tests pass locally:
pytest tests/data/utils/test_io_utils.py