ms2deepscore icon indicating copy to clipboard operation
ms2deepscore copied to clipboard

Use balanced subsampling for data generation

Open svenvanderburg opened this issue 4 years ago • 0 comments

Our current approach of balancing the data during data generation is picking an inchikey, than looking for a pair in a randomly sampled similarity bin, and iteratively widening the bin until a pair is found. It is hard to understand what the data exactly looks like that comes out of this, and it is a bit complex. It suggest instead to use stratified sampling:

On epoch end:

  1. We take all inchikey pairs with their similarity label
  2. We take a random subsample that is balanced for the similarity labels (I think the term would be: Random majority under-sampling with replacement)
  3. We loop over the selected inchikey pairs to generate training examples
  4. We still do spectrum selection and data augmentation on the individual training example level

To me this is more transparent, because it will ensure that the y label is indeed exactly balanced. And it is a bit cleaner.

svenvanderburg avatar Mar 25 '21 07:03 svenvanderburg