Saving trained models and their metadata for inference and reproducibility
Following discussion on Wednesday 2019-12-04 in FRNN group meeting in San Diego, we need to start systematically saving the best trained models for:
- Collaboration (no need for multiple users to waste GPU hours retraining the same models)
- Practical inference (@mdboyer wants a Python interface derived from
performance_analysis.pythat would allow a user to load a trained model and easily feed a set of shot(s) for inference, without using the bloated shot list and preprocessing pipeline that has been oriented towards training for the first phase of the project. Would enable exploratory studies about proximity to disruption, UQ, clustering, etc. This is an important intermediate step to setting up the C-based real-time inference tool in the PCS. ) - Reproducibility
As a part of a broader effort towards improving reproducibility of our workflow, these models should be stored with:
-
.h5file containing the tunable parameters (can be directly loaded by Keras or C-translated inference software) - Input configuration
conf.yamland/or dumped final configuration used in specifying and training the model - Output performance metrics of the trained model (train/validate/test ROC)
- Normalization
.npzpickled class. ForVarNormalizer, this would only consist of the standard deviations of each channel of each signal from the set of shots used to train the normalizer. However, it is serialized and saved as a "fat" class object that requires the entireplasmamodule to load. Might want to dump a simple non-pickled array, or even.txt, alongside the pickle, so that we have a simple file to load with the Keras-C wrapper. - Some metadata about the layout of a preprocessed shot in
processed_shots/signal_group_*/*.npz(order of channels and signals, sampling rates, thresholding? etc.), so that any real-time inference wrapper could apply a similar preprocessing to the incoming data. - Exact individual shot numbers used in the training, validation, and testing sets, so that anyone using the model for inference will know if the shot being supplied to the model has already been used to train the model.
- SHA1 of Git commit
- Conda environment; versions of dependencies such as TensorFlow, Keras, PyTorch, scikit-learn
- Computer used for training, MPI library, CuDNN library, etc.
- Number of devices and MPI ranks used in training (least important)
Given the binary .h5 and .npz files, we probably don't want to use VCS to store everything. But we might want to version control the plain-text metadata about the trained models. Store in this repository alongside the code? Or a new repository under our GitHub organization?
Also, should we consider ONNX?
Initially maybe archive both ONNX and h5 since we may use either for PCS deployment.
I'd advocate saving normalization as txt/h5 instead of npz to facilitate reading by PCS.
Better yet, could the normalization just be added as a layer to the model post-training so it is saved in the ONNX/H5 file? This would make implementation of the inference even simpler since the unnormalized data could be used as input to the deployed model.
Text files for signals names would also be easier for use in PCS.
I would think having some example trained models in the main repo would be useful, but maybe a larger library of models could be maintained separately?