TopoBench icon indicating copy to clipboard operation
TopoBench copied to clipboard

PR for Download File From Link function

Open marindigen opened this issue 2 months ago • 0 comments

Summary

This PR improves the download_file_from_link utility to support robust, memory-efficient downloads for large datasets and adds a dedicated test suite to ensure correct behaviour under different network conditions.

Motivation

Some of the datasets used in TopoBench (e.g. those hosted on external academic servers) can be:

  • Very large, making response.content downloads memory-inefficient.
  • Slow or unstable, leading to timeouts or partial downloads.
  • Occasionally requiring verify=False, which previously wasn’t configurable.

The old implementation used a single requests.get call, loaded the entire response into memory, and did not retry on transient failures. This could lead to frequent failures or hangs when downloading large files over slow connections.

What this PR does

1. Improve download_file_from_link

The function download_file_from_link in topobench.data.utils.io_utils is updated to:

  • Stream the response in 5MB chunks instead of loading it all into memory.
  • Ensure the target directory exists via os.makedirs(path_to_save, exist_ok=True).
  • Support SSL verification control via a verify argument (default True).
  • Support configurable per-chunk read timeout via a timeout argument
    (default: 60 seconds for the read timeout, 30 seconds for connection).
  • Add retry logic with exponential backoff on failures, controlled by a retries argument.
  • Print download progress when content-length is available:
    • Total size (in GB)
    • Percentage completed
    • Approximate download speed (MB/s)
    • ETA in hours and minutes
  • Handle unknown content length gracefully and still stream the file.
  • Raise an exception after all retry attempts are exhausted, instead of silently failing.

Behavioural notes:

  • For HTTP status codes other than 200, the function logs an error and returns without creating a file (same high-level behaviour as before, but now explicit).
  • For persistent network errors (e.g. repeated timeouts), the function retries and finally raises the underlying exception on the last failure.

2. Add tests for download_file_from_link

This PR introduces a new test file (e.g. tests/data/utils/test_io_utils.py) containing a test suite for download_file_from_link. The tests use unittest.mock and pytest to cover:

  • Successful streaming download with progress:
    • Mocks iter_content with multiple chunks totalling 5MB.
    • Asserts the output file exists and has the expected size.
  • Automatic directory creation:
    • Uses a nested, non-existent directory in path_to_save.
    • Verifies that the directory is created and the file is written.
  • HTTP error handling:
    • Mocks a 404 response.
    • Asserts that no file is created.
  • Retry on timeout:
    • First requests.get call raises requests.exceptions.Timeout.
    • Second call returns a successful mock response.
    • Verifies that the file is created and that requests.get is called twice.
  • Exhausting retries:
    • All requests.get calls raise requests.exceptions.Timeout.
    • Asserts that the function raises after the configured number of retries.
  • Support for multiple file formats:
    • Loops over ["zip", "tar", "tar.gz"].
    • Verifies that files with the correct extensions are created.
  • Handling empty chunks:
    • Includes empty chunks in iter_content.
    • Ensures the final file size only includes non-empty chunks.
  • Unknown content length:
    • Omits the content-length header.
    • Verifies that the file is still correctly written.
  • SSL verification toggle:
    • Calls download_file_from_link with verify=False.
    • Asserts that requests.get was invoked with verify=False.
  • Custom timeout:
    • Calls the function with a custom timeout value.
    • Asserts that requests.get uses (30, custom_timeout) for (connect, read) timeouts.

Backwards compatibility

  • The function name, module location, and core signature (file_link, path_to_save, dataset_name, file_format) are unchanged.
  • New keyword arguments (verify, timeout, retries) have sensible defaults and should not break existing call sites.
  • The main change in behaviour is that persistent network failures now raise an exception after all retries instead of only printing an error. This makes failures explicit and easier to debug, while not affecting successful downloads.

Testing

  • New tests added for download_file_from_link (see tests/data/utils/test_io_utils.py).
  • All tests pass locally:
pytest tests/data/utils/test_io_utils.py

marindigen avatar Nov 24 '25 15:11 marindigen