pystreamapi icon indicating copy to clipboard operation
pystreamapi copied to clipboard

:bug: Refactor data loaders to be lazy and use generators to prevent memory problems

Open garlontas opened this issue 8 months ago • 2 comments

Summary by Sourcery

Refactor all data loaders (CSV, JSON, XML, YAML) to use lazy generators that yield items on demand instead of loading entire datasets into memory and update tests to exercise the new streaming behavior.

New Features:

  • Support reading CSV, JSON, XML, and YAML data directly from source strings via a read_from_src flag
  • Add custom context manager in tests to mock file operations uniformly across loaders

Enhancements:

  • Replace LazyFileIterable with native Python generators for all loaders
  • Unify CSV loader code into separate functions for file and string sources and a shared processing routine
  • Implement lazy JSON, XML, and YAML parsing functions that yield namedtuples progressively
  • Streamline XML parsing to yield elements or nested lists based on retrieve_children configuration

Tests:

  • Update loader tests to consume data via iterators and assert StopIteration at end
  • Add tests verifying loader laziness and custom delimiters for CSV and generator type for YAML
  • Refactor tests to use a mock_csv_file context manager for consistent mocking of file operations

garlontas avatar Jun 06 '25 17:06 garlontas

Reviewer's Guide

This PR refactors the CSV, JSON, XML, and YAML data loaders to use plain Python generators for lazy loading instead of eagerly building lists or relying on a custom LazyFileIterable, standardizes loader function signatures to accept either file paths or raw strings, and updates the corresponding tests to consume these iterators via next() and StopIteration assertions.

Sequence Diagram for Lazy Data Loading with Generators

sequenceDiagram
    actor Client
    participant DataLoader as "Loader Module (e.g., csv_loader.csv())"
    participant InternalProcessor as "Internal Generator Function (e.g., __process_csv)"
    participant DataSource as "File/String Source"

    Client->>DataLoader: load_data(src, ...)
    DataLoader->>InternalProcessor: (initiates lazy processing of src)
    Note right of DataLoader: Returns an iterator immediately
    DataLoader-->>Client: data_iterator

    loop Client requests next item
        Client->>data_iterator: next()
        data_iterator->>InternalProcessor: (requests next item)
        activate InternalProcessor
        InternalProcessor->>DataSource: Read minimal data needed (e.g., a line)
        DataSource-->>InternalProcessor: raw_item_data
        InternalProcessor-->>InternalProcessor: Parse data, create object (e.g., namedtuple)
        InternalProcessor-->>data_iterator: processed_item
        deactivate InternalProcessor
        data_iterator-->>Client: processed_item
    end
    Client->>data_iterator: next() # After all items are processed
    data_iterator-->>Client: StopIteration

File-Level Changes

Change Details Files
Refactored CSV loader to generator-based lazy loading
  • Changed csv() signature to accept src/read_from_src instead of file_path only
  • Split loading into __load_csv_from_file and __load_csv_from_string
  • Extracted row processing into a new __process_csv generator
  • Removed LazyFileIterable usage in favor of yield-based iteration
pystreamapi/loaders/__csv/__csv_loader.py
tests/_loaders/test_csv_loader.py
Refactored JSON loader to generator-based lazy loading
  • Removed LazyFileIterable and return iterator directly
  • Added __lazy_load_json_file and __lazy_load_json_string generator functions
  • Yield parsed objects via json.loads with object_hook
pystreamapi/loaders/__json/__json_loader.py
tests/_loaders/test_json_loader.py
Refactored XML loader to generator-based lazy parsing
  • Replaced LazyFileIterable with _lazy_parse_xml_file and _lazy_parse_xml_string generators
  • Yield parsed elements lazily using _parse_xml_string_lazy
  • Flatten nested children via yield from instead of building lists
pystreamapi/loaders/__xml/__xml_loader.py
tests/_loaders/test_xml_loader.py
Refactored YAML loader to generator-based lazy loading
  • Removed LazyFileIterable and return iterator directly
  • Yield documents via yaml.safe_load_all and convert to namedtuples
  • Support multi-document output with yield from __convert_to_namedtuples
pystreamapi/loaders/__yaml/__yaml_loader.py
tests/_loaders/test_yaml_loader.py
Updated tests for lazy iteration and consolidated mocking
  • Introduced a mock_csv_file contextmanager to DRY file mocking
  • Replaced length and list-based assertions with next()/StopIteration checks
  • Added explicit tests for iterator laziness (GeneratorType)
  • Removed redundant test cases and unified test patterns
tests/_loaders/test_csv_loader.py
tests/_loaders/test_json_loader.py
tests/_loaders/test_xml_loader.py
tests/_loaders/test_yaml_loader.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an issue from a review comment by replying to it. You can also reply to a review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull request title to generate a title at any time. You can also comment @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in the pull request body to generate a PR summary at any time exactly where you want it. You can also comment @sourcery-ai summary on the pull request to (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the pull request to resolve all Sourcery comments. Useful if you've already addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull request to dismiss all existing Sourcery reviews. Especially useful if you want to start fresh with a new review - don't forget to comment @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

  • Contact our support team for questions or feedback.
  • Visit our documentation for detailed guides and information.
  • Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai[bot] avatar Jun 06 '25 17:06 sourcery-ai[bot]

Quality Gate Failed Quality Gate failed

Failed conditions
17.1% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

sonarqubecloud[bot] avatar Jun 06 '25 17:06 sonarqubecloud[bot]