Add option to directly sample the disruptive subset during shot list/set splitting

Open felker opened this issue 6 years ago • 0 comments

Currently, if the testing and training ({train} U {validate}) are drawn from the same source shot list, then the ratio conf['model']['train_frac'] is used to randomly divide the source shots without regards to the shot classes. This also occurs for the splitting of the train and validate sets with conf['model']['validation_frac'].

So, while the the division of the overall shot counts will exactly match the desired fractions within 1/N (where N is the total number of shots), the division of the non-/ disruptive shots among the sets may not be so close to that fraction. This is only a problem when the number of disruptive (or nondisruptive) samples is low and/or the training and testing sets are drawn from different raw lists. As the number of samples -> infinity, of course the N_{validate, disrupt}/N_{training, disrupt} -> conf['model']['validation_frac'], e.g.

There is no real reason not to explicitly divide the disruptive and non-disruptive classes when performing the splitting of the shot sets, so I think we should at least add it as an option, if not make it the default behavior

[ ] Consider renaming train_frac to test_frac (value = 1.0 - train_frac) or another name to make it clear that the "training fraction" is further divided between the training and hold-out validation sets.

Dec 05 '19 19:12 felker