TopoBench icon indicating copy to clipboard operation
TopoBench copied to clipboard

Category: B1;  Team name: DLLB;  Dataset: FakeDataset

Open dleko11 opened this issue 2 months ago • 4 comments

Checklist

  • [x] My pull request has a clear and explanatory title.
  • [x] My pull request passes the Linting test.
  • [x] I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
  • [x] My PR follows PEP8 guidelines. (refer to comment below)
  • [x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • [x] I linked to issues and PRs that are relevant to this PR.

Description

This PR introduces an implementation of an on-disk data loading pipeline for inductive datasets, along with memory profiling utilities.

Key Features

  • On-Disk Dataset Support:
    Implemented an on-disk version of PyG’s FakeDataset to enable realistic testing without holding the full dataset in memory.

  • On-Disk Preprocessor:
    Added a preprocessor built on top of PyG’s OnDiskDataset, which applies transformations one graph at a time and saves the processed outputs.
    This ensures the entire dataset is never fully loaded into memory.

  • Transform Categorisation:
    Introduced a two-tier transform strategy:

    • Heavy transforms: topology and feature liftings, executed during the on-disk preprocessing phase.
    • Easy transforms: data manipulation and intrinsic dataset transforms, applied on the fly at load time.
  • Data Splitting Enhancements:
    Updated load_inductive_splits and assign_train_val_test_mask_to_graphs to support lazy lists, minimizing memory use by avoiding in-memory storage of dataset splits.

Testing & Validation

  • The pipeline passes the existing pipeline test suite.
  • Added a new memory usage test comparing:
    • Our new on-disk FakeDataset
    • PyG’s original in-memory FakeDataset
  • Memory usage was successfully tested for the following models:
    • graph/gcn
    • cell/topotune
    • simplicial/topotune

Details are available in the tutorial_on_disk_inductive_pipeline.ipynb notebook.

dleko11 avatar Nov 03 '25 22:11 dleko11

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Added unit tests, corrected docstrings format.

dleko11 avatar Nov 19 '25 20:11 dleko11

Hi @dleko11 and @luka-benic, I noticed the current description is missing a reference to the notebook containing the provided utilization examples (.ipynb file). Could you please update the description to include this reference?

levtelyatnikov avatar Nov 28 '25 10:11 levtelyatnikov

Hi @levtelyatnikov, thanks for letting us know, as per your request, I updated the comment and included the reference to the notebook. Best, David

dleko11 avatar Nov 28 '25 12:11 dleko11