ms2deepscore
ms2deepscore copied to clipboard
Use balanced subsampling for data generation
Our current approach of balancing the data during data generation is picking an inchikey, than looking for a pair in a randomly sampled similarity bin, and iteratively widening the bin until a pair is found. It is hard to understand what the data exactly looks like that comes out of this, and it is a bit complex. It suggest instead to use stratified sampling:
On epoch end:
- We take all inchikey pairs with their similarity label
- We take a random subsample that is balanced for the similarity labels (I think the term would be: Random majority under-sampling with replacement)
- We loop over the selected inchikey pairs to generate training examples
- We still do spectrum selection and data augmentation on the individual training example level
To me this is more transparent, because it will ensure that the y label is indeed exactly balanced. And it is a bit cleaner.