Add parameters for MixedDataLoader
Addresses #100
Maybe I misunderstood the function but I added keywords that the sampling can now allow for empirical and uniform priors of the discrete label, and in addition to select between conditional and discrete only positive sampling.
minor comment but this also includes the doc changes from #99 - I'd remove those first to keep this PR clean, thanks @timonmerk - we will review it.
Yes, sorry for that! Took me some time to figure out how to remove an existing commit from github, but git rebase and force push did the job :)
Previously I obtain an error in the test_criterions.py but lokally I did not obtain this error. After merging the upstream main changes, the error now also doesn't occur any more.
@stes Are there any comments about the docstring and test I added?
Hi @timonmerk , apologies for the slow replies here. Let's get this merged before the PRs 1st year anniversary :D Checking compatibility with current main branch now and aiming to get this ready later today!
The test looks good.
@stes Thanks for addressing this PR. And yes, took me some time to get into it again :D
I think the first intention of this PR was simply to implement the empirical and uniform priors that were already mentioned in the docstring of the MixedDataLoader. I realize now however that there is more functionality in discussion behind the MixedDataLoader.
As mentioned in the paper you wrote
if we aim to build a robust brain machine interface that should be invariant to such short-term changes, we would include trial information as a task-irrelevant variable and obtain an embedding space that no longer carries this information
So I thought that it would be possible to pass two "labels", one task-relevant and one task invariant. I understood that the positive sampling then samples with respect to the task-relevant variable, but also with respect to the task invariant variable to make the embedding invariant to it, e.g. by specifying the distribution to be uniform for that variable.
I think this would be an optimal application for a couple of use cases, such as the one mentioned in the paper, or also to build an embedding that is invariant across patients but nevertheless "variant" to a behavioural variable. Optimally both variables could be discrete, continuous, or a mixture of both. But I guess this wouldn't be directly supported by the currently Loader setup, even though I see that the sampling methods for that are implemented, e.g. https://github.com/AdaptiveMotorControlLab/CEBRA/blob/9898850a560357f176441b37e422eb6d95a0c475/cebra/distributions/mixed.py#L88 but not used within the MixedDataLoader.
So I guess this is more up to the user to declare their own DataLoader, and then define manually how prior and index should be specified. As far as I understand it, the scikit-learn API also doesn't provide access to modify those.
With regard to the current code, you're right that the continuous sampling wouldn't be used right now, only if positive_sampling equals conditional, which doesn't make that much sense.
And to be honest I agree that the mixed.MixedTimedeltaDistribution makes also more sense to be uniform, if you think it would be useful to add the empirical option I would also be happy to add it. Otherwise maybe just the docstring could be removed to avoid confusion that there is currently no option to specify the discrete distribution?
@stes lets decide what to do here ...