NeMo " ValueError: max() arg is an empty sequence " while converting mamba 2 hybrid checkpoint to nemo

Describe the bug

As described in the title, after finishing all of the installs and building nemo and megatron-lm from source, assuming that the model has been trained with megatron-lm.

Steps/Code to reproduce bug

[NeMo W 2024-08-16 12:43:58 nemo_logging:349] /workspace/megatron/Megatron-LM/megatron/core/tensor_parallel/layers.py:280: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
      def forward(ctx, input, weight, bias, allreduce_dgrad):

[NeMo W 2024-08-16 12:43:58 nemo_logging:349] /workspace/megatron/Megatron-LM/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
      def backward(ctx, grad_output):

[NeMo W 2024-08-16 12:43:58 nemo_logging:349] /workspace/megatron/Megatron-LM/megatron/core/tensor_parallel/layers.py:380: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
      def forward(

[NeMo W 2024-08-16 12:43:58 nemo_logging:349] /workspace/megatron/Megatron-LM/megatron/core/tensor_parallel/layers.py:419: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
      def backward(ctx, grad_output):

[WARNING  | megatron.core.dist_checkpointing.strategies.zarr]: `zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
[NeMo W 2024-08-16 12:43:59 nemo_logging:349] /workspace/megatron/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py:22: DeprecationWarning: `torch.distributed._sharded_tensor` will be deprecated, use `torch.distributed._shard.sharded_tensor` instead
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor

[NeMo W 2024-08-16 12:43:59 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/tensor_quant.py:84: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
      scaled_e4m3_abstract = torch.library.impl_abstract("trt::quantize_fp8")(

[NeMo W 2024-08-16 12:44:02 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/mamba_ssm/ops/selective_scan_interface.py:164: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
      def forward(ctx, xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,

[NeMo W 2024-08-16 12:44:02 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/mamba_ssm/ops/selective_scan_interface.py:240: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
      def backward(ctx, dout):

[NeMo W 2024-08-16 12:44:02 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/mamba_ssm/ops/triton/layer_norm.py:959: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
      def forward(

[NeMo W 2024-08-16 12:44:02 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/mamba_ssm/ops/triton/layer_norm.py:1018: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
      def backward(ctx, dout, *args):

[NeMo W 2024-08-16 12:44:02 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/mamba_ssm/distributed/tensor_parallel.py:26: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
      def forward(ctx, x, weight, bias, process_group=None, sequence_parallel=True):

[NeMo W 2024-08-16 12:44:02 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/mamba_ssm/distributed/tensor_parallel.py:62: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
      def backward(ctx, grad_output):

[NeMo W 2024-08-16 12:44:03 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/mamba_ssm/ops/triton/ssd_combined.py:736: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
      def forward(ctx, zxbcdt, conv1d_weight, conv1d_bias, dt_bias, A, D, chunk_size, initial_states=None, seq_idx=None, dt_limit=(0.0, float("inf")), return_final_states=False, activation="silu",

[NeMo W 2024-08-16 12:44:03 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/mamba_ssm/ops/triton/ssd_combined.py:814: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
      def backward(ctx, dout, *args):

Traceback (most recent call last):
  File "/workspace/nemo/NeMo/scripts/checkpoint_converters/convert_mamba2_pyt_to_nemo.py", line 190, in <module>
    convert(args)
  File "/workspace/nemo/NeMo/scripts/checkpoint_converters/convert_mamba2_pyt_to_nemo.py", line 115, in convert
    num_layers = max(layer_numbers) + 1'

Expected behavior

Expected to convert the mamba trained model to a .nemo format for fine-tuning.

Environment overview (please complete the following information)

Environment location: ubuntu, Docker, fluidstack VM 2 * A100 80.
Method of NeMo install: Installed from source and integrated megatron from source.
If method of install is [Docker], provide docker pull & docker run commands used :

Docker pull command :

 sudo docker pull nvcr.io/nvidia/pytorch:24.07-py3

Docker Run command :

docker run --gpus all -it --rm --ipc=host \
  --shm-size=40g \
  -v /ephemeral/megatron:/workspace/megatron \
  -v /ephemeral/data:/workspace/dataset/data \
  -v /ephemeral/outfix:/workspace/dataset/outfix \
  -v /ephemeral/tok:/workspace/dataset/tok \
  -v /ephemeral/checkpoints:/workspace/checkpoints \
  -v /ephemeral/nemo:/workspace/nemo \
  nvcr.io/nvidia/pytorch:24.07-py3

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

OS version : Ubuntu 22.04.3 LTS
PyTorch version : 2.4
Python version : 3.10.12

Additional context

Nvidia pytorch container : 24.07 (assmuming training was made with 24.03) GPUS : 2 * GPU A100 80

Followed steps here : tutorials/llm/mamba/mamba.rst

Aug 16 '24 13:08 SkanderBS2024

Hi @SkanderBS2024, I see you are mounting. You are not using the NeMo container nvcr.io/nvidia/nemo:24.07, and you are mounting the NeMo. I tested the conversion script in the nvcr.io/nvidia/nemo:24.07, and it works fine. However, there is an update needed for the latest main, for which I have raised a PR. https://github.com/NVIDIA/NeMo/pull/10224. You can either checkout this PR or use the 24.07 nemo container. Thanks for reporting the issue!

Aug 21 '24 18:08 JRD971000

hello @JRD971000 , yep i worked with the nvcr.io/nvidia/nemo:24.07 container and everything worked fine thank you for your response.

Aug 21 '24 21:08 SkanderBS2024

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Sep 21 '24 01:09 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Sep 28 '24 01:09 github-actions[bot]