graphnet icon indicating copy to clipboard operation
graphnet copied to clipboard

Automatic ensemble new

Open Aske-Rosted opened this issue 1 year ago • 3 comments

the following is an implementation of the feature mentioned in #720.

I had to include a type ignore on line 68. which I personally was not too happy with, but I did not manage to satisfy the pre-commit hooks without it.

The solution uses the pre-existing EnsembleDataset class to automatically combine databases from a list of databases.

Aske-Rosted avatar May 17 '24 04:05 Aske-Rosted

Hey @Aske-Rosted - Thank you very much for adding this new feature. I added a hint on what might be causing the mypy challenges that you mentioned.

Given that this is new functionality, could you be persuaded to add a small unit test for this here? I added an example of how such a test could look like below:

@pytest.mark.order(6)
@pytest.mark.parametrize("backend", ["sqlite"])
def test_dataset_config_dict_selection(backend: str) -> None:
    """Test constructing Dataset with multiple data paths."""
    # Arrange
    config_path = CONFIG_PATHS[backend]

    # Single dataset
    config = DatasetConfig.load(config_path)
    dataset = Dataset.from_config(config)
    # Construct multiple datasets
    config_ensemble = DatasetConfig.load(config_path)
    config_ensemble.path = [config_ensemble.path, config_ensemble.path]

    ensemble_dataset = Dataset.from_config(config)
    
    assert len(dataset)*2 == len(ensemble_dataset)

Hey Rasmus looked into the suggestion for a bit but since this automatic ensembling is happing during the dataloader, and not in the dataset.from_config(config) call the current test suggestion does not test the ensembling code. This does however illuminate a bit of an issue with the current implementation of the automatic ensembling. That is you can have a dataset config file that works fine as long as you only feed it to a dataloader, however if you try to manually load the dataset from the config using the dataset class it will fail due to the list of files rather than a single file...

Aske-Rosted avatar May 28 '24 07:05 Aske-Rosted

@Aske-Rosted thanks for pointing this out. It's an essential function to be able to load datasets from their respective configuration files, so I think we need to make sure that this new usage doesn't break the core intention of the files.

I had a quick look at the code again, and I think if you just moved (with slight modifications) the new logic to handle multiple paths into Dataset.from_config (see https://github.com/Aske-Rosted/graphnet/blob/7baa60d2fd729e84214a9f1c5a1d0882cfeac14a/src/graphnet/data/dataset/dataset.py#L107) the problem would be solved. What do you think?

RasmusOrsoe avatar May 28 '24 09:05 RasmusOrsoe

@Aske-Rosted thanks for pointing this out. It's an essential function to be able to load datasets from their respective configuration files, so I think we need to make sure that this new usage doesn't break the core intention of the files.

I had a quick look at the code again, and I think if you just moved (with slight modifications) the new logic to handle multiple paths into Dataset.from_config (see https://github.com/Aske-Rosted/graphnet/blob/7baa60d2fd729e84214a9f1c5a1d0882cfeac14a/src/graphnet/data/dataset/dataset.py#L107) the problem would be solved. What do you think?

I agree that is probably the more appropriate location to handle multiple file paths. I will have a look at it.

Aske-Rosted avatar May 29 '24 02:05 Aske-Rosted

Was fixed with PR #775

Aske-Rosted avatar Dec 12 '24 01:12 Aske-Rosted