Use balanced subsampling for data generation

Open svenvanderburg opened this issue 4 years ago • 0 comments

Our current approach of balancing the data during data generation is picking an inchikey, than looking for a pair in a randomly sampled similarity bin, and iteratively widening the bin until a pair is found. It is hard to understand what the data exactly looks like that comes out of this, and it is a bit complex. It suggest instead to use stratified sampling:

On epoch end:

We take all inchikey pairs with their similarity label
We take a random subsample that is balanced for the similarity labels (I think the term would be: Random majority under-sampling with replacement)
We loop over the selected inchikey pairs to generate training examples
We still do spectrum selection and data augmentation on the individual training example level

To me this is more transparent, because it will ensure that the y label is indeed exactly balanced. And it is a bit cleaner.

Mar 25 '21 07:03 svenvanderburg