Category: B1; Team name: DLLB; Dataset: FakeDataset
Checklist
- [x] My pull request has a clear and explanatory title.
- [x] My pull request passes the Linting test.
- [x] I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
- [x] My PR follows PEP8 guidelines. (refer to comment below)
- [x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
- [x] I linked to issues and PRs that are relevant to this PR.
Description
This PR introduces an implementation of an on-disk data loading pipeline for inductive datasets, along with memory profiling utilities.
Key Features
-
On-Disk Dataset Support:
Implemented an on-disk version of PyG’sFakeDatasetto enable realistic testing without holding the full dataset in memory. -
On-Disk Preprocessor:
Added a preprocessor built on top of PyG’sOnDiskDataset, which applies transformations one graph at a time and saves the processed outputs.
This ensures the entire dataset is never fully loaded into memory. -
Transform Categorisation:
Introduced a two-tier transform strategy:- Heavy transforms: topology and feature liftings, executed during the on-disk preprocessing phase.
- Easy transforms: data manipulation and intrinsic dataset transforms, applied on the fly at load time.
-
Data Splitting Enhancements:
Updatedload_inductive_splitsandassign_train_val_test_mask_to_graphsto support lazy lists, minimizing memory use by avoiding in-memory storage of dataset splits.
Testing & Validation
- The pipeline passes the existing pipeline test suite.
- Added a new memory usage test comparing:
- Our new on-disk FakeDataset
- PyG’s original in-memory FakeDataset
- Memory usage was successfully tested for the following models:
-
graph/gcn -
cell/topotune -
simplicial/topotune
-
Details are available in the tutorial_on_disk_inductive_pipeline.ipynb notebook.
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Added unit tests, corrected docstrings format.
Hi @dleko11 and @luka-benic, I noticed the current description is missing a reference to the notebook containing the provided utilization examples (.ipynb file). Could you please update the description to include this reference?
Hi @levtelyatnikov, thanks for letting us know, as per your request, I updated the comment and included the reference to the notebook. Best, David