NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Learning rate does not change

Open psydok opened this issue 3 years ago • 5 comments

Can you please tell me why the learning rate does not change when the fine-tuning model is launched?

cfg = OmegaConf.load("conformer_ctc/config.yaml")
trainer = pl.Trainer(
    logger=logger,
    **cfg.trainer,
    callbacks=[
        model_checkpoint_callback
    ]
)
model = nemo_asr.models.EncDecCTCModel(cfg=cfg.model, trainer=trainer)
model.set_trainer(trainer)

trainer.fit(model, ckpt_path="pretrained/epoch=39.ckpt")

Log fit:

610/1200 [06:32<-1:37:02, -0.43it/s, loss=61.6, v_num=103]

But according to the graphs learning_rate = 1-e6 throughout the training.

Half "conformer_ctc/config.yaml":

name: "Conformer-CTC-Char"
model:
  sample_rate: 8000
  optim:
    name: adamw
    lr: 2.0
    betas: [0.9, 0.98]
    # less necessity for weight_decay as we already have large augmentations with SpecAug
    # you may need weight_decay for large models, stable AMP training, small datasets, or when lower augmentations are used
    # weight decay of 0.0 with lr of 2.0 also works fine
    weight_decay: 0.0

    sched:
      name: NoamAnnealing
      d_model: ${model.encoder.d_model}
      # scheduler config override
      warmup_steps: 15000
      warmup_ratio: null
      min_lr: 1e-6

trainer:
  devices: -1
  num_nodes: 1
  max_epochs: 100
  max_steps: -1
  val_check_interval: 1.0
  accelerator: auto
  strategy: dp
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
  # amp_level: O2
  # amp_backend: apex
  log_every_n_steps: 50
  progress_bar_refresh_rate: 10
  resume_from_checkpoint: null
  num_sanity_val_steps: 0 
  check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
  sync_batchnorm: true
  enable_checkpointing: true
  benchmark: false # needs to be false for models with variable-length speech input as it slows down training


exp_manager:
  exp_dir: null
  name: ${name}
  use_datetime_version: True
  create_tensorboard_logger: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    monitor: "val_wer"
    mode: "min"
    save_top_k: 3
    always_save_nemo: True
  resume_if_exists: False
  resume_ignore_no_checkpoint: False

  create_wandb_logger: false
  wandb_logger_kwargs:
    name: null
    project: null

psydok avatar Sep 02 '22 16:09 psydok

Either you must provide max steps to the trainer, or you have to initialize the train_ds + set the trainer so that we can resolve the max_steps. Scheduler will not work without somehow calculating max steps.

titu1994 avatar Sep 02 '22 18:09 titu1994

I have train_ds installed. It is not possible to restore training with the same lr, either restarts or min_lr

  train_ds:
    manifest_filepath: "../manifest.json"
    labels: ${model.labels}
    sample_rate: ${model.sample_rate}
    batch_size: &batch_size 16
    shuffle: true
    num_workers: &num_workers 8
    # pin_memory: true
    trim_silence: true
    max_duration: 32.9 #16.7
    min_duration: 0.1
    # tarred datasets
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    # bucketing params
    bucketing_strategy: "synced_randomized"
    bucketing_batch_size: null

psydok avatar Sep 03 '22 09:09 psydok

Check the area in your logs that prints out the optimizer being loaded and near it scheduler info should also be posted or it will mention that the scheduler could not be initialized. For simplicity pass max_steps to trainer

titu1994 avatar Sep 03 '22 09:09 titu1994

Strangely max_steps is always set, but sometimes lr starts from min_lr, sometimes from where it left off. Maybe I'm starting the wrong way to start retraining my models? Tell me, please, what would be the best way? trainer.fit(model_with_new_cfg, ckpt_path="epoch=10.ckpt") Or is it more correct to specify the trainer.resume_from_chekpoint parameter in the config, or upload your model checkpoint via nemo_asr and override datasets and other configs from file "conformer_ctc/config.yaml"?

psydok avatar Sep 08 '22 21:09 psydok

Use exp manager to resume your jobs. If your scheduled has already hit the min lr then the previous run is over, it will not update the scheduler anymore.

titu1994 avatar Sep 09 '22 05:09 titu1994

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Oct 10 '22 02:10 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Oct 17 '22 02:10 github-actions[bot]