Saving torchvision checkpoints based on staged recipe phase

Open corey-nm opened this issue 2 years ago • 1 comments

Renames BaseManager.phase -> BaseManager.phase_at_end_of
Clarifies behavior
Integrates saving checkpoints based on phases into torchvision

Test Plan

Ran the following recipe:

version: 1.1.0

training_modifiers:
  - !EpochRangeModifier
    start_epoch: 0
    end_epoch: 15

  - !SetLearningRateModifier
    start_epoch: 0.0
    learning_rate: 0.001

pruning_modifiers:
  - !GMPruningModifier
    init_sparsity: 0.05
    final_sparsity: 0.85
    start_epoch: 5.0
    end_epoch: 10.0
    update_frequency: 1.0
    params: ["re:.*conv..weight*"]

quantization_modifiers:
  - !QuantizationModifier
    start_epoch: 11.0
    freeze_bn_stats_epoch: 12.0
    disable_quantization_observer_epoch: 13.0

Which generated the following directory after running:

- best_dense.pth (best_dense.txt contains epoch 3)
- best_pruned_quantized.pth (best_pruned_quantized.txt contains epoch 13)
- best_pruned.pth (best_pruned.txt contains epoch 10)
- last_dense.pth (.txt contains epoch 4)
- last_pruned_quantized.pth (.txt contains epoch 14)
- last_pruned.pth (.txt contains epoch 10)
- last.pth (.txt contains epoch 14)

And the following output:

sparseml.image_classification.train --recipe resnet18-pq.yaml --dataset-path ~/.cache/nm_datasets/imagenette/imagenette-320/ --arch-key resnet18 --output-dir ./runs
INFO:sparseml.pytorch.torchvision.train:Finished epoch 0 in phase dense
INFO:sparseml.pytorch.torchvision.train:Finished epoch 1 in phase dense
INFO:sparseml.pytorch.torchvision.train:Finished epoch 2 in phase dense
INFO:sparseml.pytorch.torchvision.train:Finished epoch 3 in phase dense
INFO:sparseml.pytorch.torchvision.train:Finished epoch 4 in phase dense
INFO:sparseml.pytorch.torchvision.train:Finished epoch 5 in phase None
INFO:sparseml.pytorch.torchvision.train:Finished epoch 6 in phase None
INFO:sparseml.pytorch.torchvision.train:Finished epoch 7 in phase None
INFO:sparseml.pytorch.torchvision.train:Finished epoch 8 in phase None
INFO:sparseml.pytorch.torchvision.train:Finished epoch 9 in phase None
INFO:sparseml.pytorch.torchvision.train:Finished epoch 10 in phase pruned
INFO:sparseml.pytorch.torchvision.train:Finished epoch 11 in phase pruned_quantized
INFO:sparseml.pytorch.torchvision.train:Finished epoch 12 in phase pruned_quantized
INFO:sparseml.pytorch.torchvision.train:Finished epoch 13 in phase pruned_quantized
INFO:sparseml.pytorch.torchvision.train:Finished epoch 14 in phase pruned_quantized

Noting that the following transitions are correct:

0-4 are in dense
start epoch for pruning is 5, so at the END of epoch 5, pruning is in progress -> phase is None
end epoch for pruning is 10, so at the END of epoch 10, pruning is complete -> phase is pruned
start epoch for quantization is 11, so at the END of epoch 11, quantization is complete -> phase is pruned_quantized

Mar 30 '23 20:03 corey-nm

LGTM, just curious, what was the motivation for the change?

This method of saving was decided for the standardization of integrations. It also makes it more clear when a checkpoint is dense/pruned/quantized. Previously best.pt could contain any of the versions - notably it could still be a dense model

Mar 31 '23 14:03 corey-nm