Learning rate does not change
Can you please tell me why the learning rate does not change when the fine-tuning model is launched?
cfg = OmegaConf.load("conformer_ctc/config.yaml")
trainer = pl.Trainer(
logger=logger,
**cfg.trainer,
callbacks=[
model_checkpoint_callback
]
)
model = nemo_asr.models.EncDecCTCModel(cfg=cfg.model, trainer=trainer)
model.set_trainer(trainer)
trainer.fit(model, ckpt_path="pretrained/epoch=39.ckpt")
Log fit:
610/1200 [06:32<-1:37:02, -0.43it/s, loss=61.6, v_num=103]
But according to the graphs learning_rate = 1-e6 throughout the training.
Half "conformer_ctc/config.yaml":
name: "Conformer-CTC-Char"
model:
sample_rate: 8000
optim:
name: adamw
lr: 2.0
betas: [0.9, 0.98]
# less necessity for weight_decay as we already have large augmentations with SpecAug
# you may need weight_decay for large models, stable AMP training, small datasets, or when lower augmentations are used
# weight decay of 0.0 with lr of 2.0 also works fine
weight_decay: 0.0
sched:
name: NoamAnnealing
d_model: ${model.encoder.d_model}
# scheduler config override
warmup_steps: 15000
warmup_ratio: null
min_lr: 1e-6
trainer:
devices: -1
num_nodes: 1
max_epochs: 100
max_steps: -1
val_check_interval: 1.0
accelerator: auto
strategy: dp
accumulate_grad_batches: 1
gradient_clip_val: 1.0
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
# amp_level: O2
# amp_backend: apex
log_every_n_steps: 50
progress_bar_refresh_rate: 10
resume_from_checkpoint: null
num_sanity_val_steps: 0
check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
sync_batchnorm: true
enable_checkpointing: true
benchmark: false # needs to be false for models with variable-length speech input as it slows down training
exp_manager:
exp_dir: null
name: ${name}
use_datetime_version: True
create_tensorboard_logger: true
create_checkpoint_callback: true
checkpoint_callback_params:
monitor: "val_wer"
mode: "min"
save_top_k: 3
always_save_nemo: True
resume_if_exists: False
resume_ignore_no_checkpoint: False
create_wandb_logger: false
wandb_logger_kwargs:
name: null
project: null
Either you must provide max steps to the trainer, or you have to initialize the train_ds + set the trainer so that we can resolve the max_steps. Scheduler will not work without somehow calculating max steps.
I have train_ds installed. It is not possible to restore training with the same lr, either restarts or min_lr
train_ds:
manifest_filepath: "../manifest.json"
labels: ${model.labels}
sample_rate: ${model.sample_rate}
batch_size: &batch_size 16
shuffle: true
num_workers: &num_workers 8
# pin_memory: true
trim_silence: true
max_duration: 32.9 #16.7
min_duration: 0.1
# tarred datasets
is_tarred: false
tarred_audio_filepaths: null
shuffle_n: 2048
# bucketing params
bucketing_strategy: "synced_randomized"
bucketing_batch_size: null
Check the area in your logs that prints out the optimizer being loaded and near it scheduler info should also be posted or it will mention that the scheduler could not be initialized. For simplicity pass max_steps to trainer
Strangely max_steps is always set, but sometimes lr starts from min_lr, sometimes from where it left off. Maybe I'm starting the wrong way to start retraining my models?
Tell me, please, what would be the best way?
trainer.fit(model_with_new_cfg, ckpt_path="epoch=10.ckpt")
Or is it more correct to specify the trainer.resume_from_chekpoint parameter in the config, or upload your model checkpoint via nemo_asr and override datasets and other configs from file "conformer_ctc/config.yaml"?
Use exp manager to resume your jobs. If your scheduled has already hit the min lr then the previous run is over, it will not update the scheduler anymore.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.