Pretrained model checkpoint validation

Open xyhuang opened this issue 5 years ago • 1 comments

If a pretrained model is used, have a list of approved checkpoints and make sure the submission uses the approved checkpoints.

Jan 09 '21 03:01 xyhuang

On this topic since we discussed in the meeting:

If we implement it, the plan is to check whether we start from the same checkpoint. For SSD the only valid checkpoint is this one: https://download.pytorch.org/models/resnet34-333f7ec4.pth. That is specified by the reference implementation: https://github.com/mlcommons/training/blob/master/single_stage_detector/download_resnet34_backbone.sh

I see the NVDA submission accessing it from there, for example: https://github.com/mlcommons/training_results_v1.0/blob/master/NVIDIA/benchmarks/ssd/implementations/DGXA100_128x8x3/mxnet/scripts/get_resnet34_backbone.sh

However, Google decided to have a local copy of that, so we copied it offline and used the copy, see: https://github.com/mlcommons/training_results_v1.0/blob/master/Google/benchmarks/ssd/implementations/ssd-preview-TF-tpu-v4-128/ssd_constants.py#L125

Do you want to enforce checking we really access what the reference specifies as accepted checkpoint?

Even in that case the code needs to be reviewed to make sure the checkpoint is loaded / translated appropriately.

The way I see it we will save very little work during review just to check for a path at some point inside the source code or the accompanying scripts, so I prefer to avoid that rule. We could enforce the logger to print out something instead of looking into source files, that would save some effort on our side, but again it will not save any reviewing effort.

Anyway, let me know what you think.

Aug 18 '21 18:08 emizan76