Anna Shors

Results 19 comments of Anna Shors

Hi, we've made some changes to Megatron recently to remove the required dependency on Transformer Engine. You should no longer need to install Transformer Engine to run this script. The...

> > If only we can propagate this flag to dist_checkpointing.load then using this flag would be ideal > > Yeah, an example of how PL does that can be...

Hi, I just rebuilt the Dockerfile on the v0.3.1 branch and reran the Megatron 70b experiment, and my results matched with what is reported in the blog. Could you share...

Hi, the cosine decay will happen across the length of the training run and depends on `max_num_steps`. If `max_num_steps` is large, the decay might happen very slowly. Could you try...

Cosine decay seems to be working as expected for me. Here's my config: ``` sft: max_num_steps=60 policy: megatron_cfg: optimizer: lr: 5.0e-6 min_lr: 5e-9 ... scheduler: start_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay} end_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay} weight_decay_incr_style:...

Thanks for the PR @sanjana-inflection! Overall, it looks great. Do you have any loss curves or experiments that you used to validate this PR that you can share as well?

Thanks for the results @sanjana-inflection! I'd like to test this PR a bit more thoroughly before merging. I'll run some experiments on my end and update here if everything looks...

I think it makes sense to set preference loss type during `__init__` since presumably we wouldn't want to use the same `PreferenceLoss` instance with different loss types.

Hi, we provide an importer that automatically converts the HF Qwen2VL model into NeMo 2.0 format: https://github.com/NVIDIA/NeMo/blob/5421d66f3874c74609f68e9f3a25fe1493781125/nemo/collections/vlm/qwen2vl/model/qwen2vl.py#L79