TransformerEngine The provided qkv memory layout is not supported! When using RoPE

I think the problem has ever been solved before. But now it occurs again.

How to solve it? I have tested both stable branch and main branch. None of them work.

https://github.com/NVIDIA/TransformerEngine/issues/544 https://github.com/NVIDIA/TransformerEngine/issues/455

I just run the official text generation example of Megatron-LM by adding --position-embedding-type rope and --no-position-embedding args:

https://github.com/NVIDIA/Megatron-LM/blob/main/examples/run_text_generation_server_345M.sh

And got the error The provided qkv memory layout is not supported!

Moreover, I use the mcore version model instead of legacy model, so you should change it in text_generation_server.py to reproduce the error.

Feb 26 '24 08:02 1049451037

I solved the problem by hard-code...

https://github.com/NVIDIA/Megatron-LM/issues/703#issuecomment-1965759788

Feb 27 '24 05:02 1049451037

Hi @1049451037 , could you provide more details about how to run the job please? Currently I can start the job but am stuck at

 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 76705792
 * Serving Flask app 'megatron.text_generation_server'
 * Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://10.176.10.102:5000
INFO:werkzeug:Press CTRL+C to quit

Could you send me your complete run_text_generation_server_345M.sh script please?

Mar 01 '24 00:03 cyanguwa

I don't see any problem in your log. You run the server. Just need to start a client at: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/text_generation_cli.py

python tools/text_generation_cli.py 10.176.10.102:5000

Mar 01 '24 10:03 1049451037

Same issue. Any solutions?

Mar 05 '24 09:03 stgzr

I tested with TE main (8255f87) and Megatron-LM main (8957468). I'm not seeing the issue above. Let me know if I'm not using the same run script as you have.

Thanks.

# container: nvcr.io/nvidia/pytorch:24.02-py3
# install the latest TransformerEngine and Megatron-LM

$ cat examples/run_text_generation_server_345M.sh
#!/bin/bash
# This example will start serving the 345M model.
DISTRIBUTED_ARGS="--nproc_per_node 1 \
                  --nnodes 1 \
                  --node_rank 0 \
                  --master_addr localhost \
                  --master_port 6000"

CHECKPOINT="" #<Path to checkpoint (e.g /345m)>
VOCAB_FILE=/code/gpt2-vocab.json
MERGE_FILE=/code/gpt2-merges.txt
DATA_PATH=/code/ds/ThePile/BookCorpus2_ftfy_cleaned_id_shuf_text_document

export CUDA_DEVICE_MAX_CONNECTIONS=1

pip install flask-restful

#torchrun $DISTRIBUTED_ARGS tools/run_text_generation_server.py   \
torchrun tools/run_text_generation_server.py   \
       --tensor-model-parallel-size 1  \
       --pipeline-model-parallel-size 1  \
       --num-layers 2  \
       --hidden-size 1024  \
       --num-attention-heads 16  \
       --max-position-embeddings 1024  \
       --tokenizer-type GPT2BPETokenizer  \
       --position-embedding-type rope \
       --no-position-embedding \
       --fp16  \
       --micro-batch-size 1  \
       --seq-length 1024  \
       --vocab-file $VOCAB_FILE  \
       --merge-file $MERGE_FILE  \
       --data-path $DATA_PATH \
       --use-mcore-models \
       --micro-batch-size 2 \
       --global-batch-size 8 \
       --lr 0.00015 \
       --train-iters 50 \
       --lr-decay-iters 320000 \
       --lr-decay-style cosine \
       --min-lr 1.0e-5 \
       --weight-decay 1e-2 \
       --lr-warmup-fraction .01 \
       --clip-grad 1.0 \
       --seed 42

$ pip list | grep transformer
transformer-engine        1.5.0.dev0

$ bash examples/run_text_generation_server_345M.sh
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://10.32.113.152:5000
INFO:werkzeug:Press CTRL+C to quit
request IP: 10.32.113.152
{"prompts": ["I am"], "tokens_to_generate": 10}
start time:  2024-03-09 04:22:05.314579
INFO:werkzeug:10.32.113.152 - - [09/Mar/2024 04:22:10] "PUT /api HTTP/1.1" 200 -

$ python tools/text_generation_cli.py 10.32.113.152:5000
Enter prompt: I am
Enter number of tokens to generate: 10
Megatron Response: 
I am perennlington vehiclesigning protagonistlon Peng surreal nostalgia ignorant
Enter prompt:

Mar 09 '24 04:03 cyanguwa

You don't have the problem if you just run the example. Because the example inference does not use MCORE model. It just use the legacy model as you can see in model_provider.

Mar 13 '24 08:03 1049451037

@cyanguwa You may replace the model provider in text generation server with this to reproduce the error:

from megatron.core.models.gpt import GPTModel
import megatron.model
from megatron.training import get_model
from megatron.arguments import core_transformer_config_from_args
from megatron.text_generation_server import MegatronServer
from megatron.text_generation import generate_and_post_process
from megatron.text_generation import beam_search_and_post_process
import torch

from megatron.core.transformer.spec_utils import import_module
from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_with_transformer_engine_spec

def model_provider(pre_process=True, post_process=True):
    """Build the model."""
    args = get_args()

    print_rank_0('building GPT model ...')
    config = core_transformer_config_from_args(get_args())

    if args.use_mcore_models:
        print("building megatron core model!!!!!!!!!!!!!!")
        if args.spec is not None:
            transformer_layer_spec = import_module(args.spec)
        else:
            transformer_layer_spec = get_gpt_layer_with_transformer_engine_spec(args.num_experts, args.moe_grouped_gemm)

        model = GPTModel(
            config=config,
            transformer_layer_spec=transformer_layer_spec,
            vocab_size=args.padded_vocab_size,
            max_sequence_length=args.max_position_embeddings,
            pre_process=pre_process,
            post_process=post_process,
            fp16_lm_cross_entropy=args.fp16_lm_cross_entropy,
            parallel_output=False,
            share_embeddings_and_output_weights=not args.untie_embeddings_and_output_weights,
            position_embedding_type=args.position_embedding_type,
            rotary_percent=args.rotary_percent,
        )
    else:
        print("building megatron legacy model!!!!!!!!!!!!!!")
        assert False, "Never do this!"
        assert(args.context_parallel_size == 1), "Context parallelism is only supported with Megatron Core!"

        model = megatron.model.GPTModel(
            config,
            num_tokentypes=0,
            parallel_output=False,
            pre_process=pre_process,
            post_process=post_process
        )
    return model

Mar 13 '24 08:03 1049451037

We're aware of this bug and will push a fix to MCore. For now, you can add the following code in https://github.com/NVIDIA/Megatron-LM/blob/89574689447d694bb19dd86fc8a6153b4467ba9d/megatron/core/transformer/custom_layers/transformer_engine.py#L464

        # In PyTorch, the following two tensors are in fact the same:
        #   Tensor with shape (1, S, H, D) and stride (S*H*D, H*D, D, 1)
        #   Tensor with shape (1, S, H, D) and stride (H*D, H*D, D, 1)
        # We unify them to the first one to pass the stride check in TE
        if value.shape == key.shape and value.stride() != key.stride():
            value = value.as_strided(value.shape, key.stride())

Mar 14 '24 05:03 yaox12

No. This won't fix the bug. It makes inference normal, but makes training fail. (The loss of training cannot converge)

Mar 14 '24 05:03 1049451037

My solution for now is just adding the as_strided for inference, and comment out this line during training... Waiting for an official more elegant way to solve...

Mar 14 '24 06:03 1049451037

Maybe it is the qkv_format, you can check the tensor format is sbhd or bshd.

Mar 14 '24 06:03 stgzr

sbhd, just the official training and inference code in Megatron.

Mar 14 '24 07:03 1049451037

I believe the MCore issue is fixed now, is that correct @yaox12? Can we close this issue?

May 16 '24 19:05 ptrendx

Yes, the issue is fixed in the latest main branch of Megatron-LM.

May 17 '24 03:05 1049451037