The provided qkv memory layout is not supported! When using RoPE
I think the problem has ever been solved before. But now it occurs again.
How to solve it? I have tested both stable branch and main branch. None of them work.
https://github.com/NVIDIA/TransformerEngine/issues/544 https://github.com/NVIDIA/TransformerEngine/issues/455
I just run the official text generation example of Megatron-LM by adding --position-embedding-type rope and --no-position-embedding args:
https://github.com/NVIDIA/Megatron-LM/blob/main/examples/run_text_generation_server_345M.sh
And got the error The provided qkv memory layout is not supported!
Moreover, I use the mcore version model instead of legacy model, so you should change it in text_generation_server.py to reproduce the error.
I solved the problem by hard-code...
https://github.com/NVIDIA/Megatron-LM/issues/703#issuecomment-1965759788
Hi @1049451037 , could you provide more details about how to run the job please? Currently I can start the job but am stuck at
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 76705792
* Serving Flask app 'megatron.text_generation_server'
* Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:5000
* Running on http://10.176.10.102:5000
INFO:werkzeug:Press CTRL+C to quit
Could you send me your complete run_text_generation_server_345M.sh script please?
I don't see any problem in your log. You run the server. Just need to start a client at: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/text_generation_cli.py
python tools/text_generation_cli.py 10.176.10.102:5000
Same issue. Any solutions?
I tested with TE main (8255f87) and Megatron-LM main (8957468). I'm not seeing the issue above. Let me know if I'm not using the same run script as you have.
Thanks.
# container: nvcr.io/nvidia/pytorch:24.02-py3
# install the latest TransformerEngine and Megatron-LM
$ cat examples/run_text_generation_server_345M.sh
#!/bin/bash
# This example will start serving the 345M model.
DISTRIBUTED_ARGS="--nproc_per_node 1 \
--nnodes 1 \
--node_rank 0 \
--master_addr localhost \
--master_port 6000"
CHECKPOINT="" #<Path to checkpoint (e.g /345m)>
VOCAB_FILE=/code/gpt2-vocab.json
MERGE_FILE=/code/gpt2-merges.txt
DATA_PATH=/code/ds/ThePile/BookCorpus2_ftfy_cleaned_id_shuf_text_document
export CUDA_DEVICE_MAX_CONNECTIONS=1
pip install flask-restful
#torchrun $DISTRIBUTED_ARGS tools/run_text_generation_server.py \
torchrun tools/run_text_generation_server.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--num-layers 2 \
--hidden-size 1024 \
--num-attention-heads 16 \
--max-position-embeddings 1024 \
--tokenizer-type GPT2BPETokenizer \
--position-embedding-type rope \
--no-position-embedding \
--fp16 \
--micro-batch-size 1 \
--seq-length 1024 \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--data-path $DATA_PATH \
--use-mcore-models \
--micro-batch-size 2 \
--global-batch-size 8 \
--lr 0.00015 \
--train-iters 50 \
--lr-decay-iters 320000 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--lr-warmup-fraction .01 \
--clip-grad 1.0 \
--seed 42
$ pip list | grep transformer
transformer-engine 1.5.0.dev0
$ bash examples/run_text_generation_server_345M.sh
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:5000
* Running on http://10.32.113.152:5000
INFO:werkzeug:Press CTRL+C to quit
request IP: 10.32.113.152
{"prompts": ["I am"], "tokens_to_generate": 10}
start time: 2024-03-09 04:22:05.314579
INFO:werkzeug:10.32.113.152 - - [09/Mar/2024 04:22:10] "PUT /api HTTP/1.1" 200 -
$ python tools/text_generation_cli.py 10.32.113.152:5000
Enter prompt: I am
Enter number of tokens to generate: 10
Megatron Response:
I am perennlington vehiclesigning protagonistlon Peng surreal nostalgia ignorant
Enter prompt:
You don't have the problem if you just run the example. Because the example inference does not use MCORE model. It just use the legacy model as you can see in model_provider.
@cyanguwa You may replace the model provider in text generation server with this to reproduce the error:
from megatron.core.models.gpt import GPTModel
import megatron.model
from megatron.training import get_model
from megatron.arguments import core_transformer_config_from_args
from megatron.text_generation_server import MegatronServer
from megatron.text_generation import generate_and_post_process
from megatron.text_generation import beam_search_and_post_process
import torch
from megatron.core.transformer.spec_utils import import_module
from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_with_transformer_engine_spec
def model_provider(pre_process=True, post_process=True):
"""Build the model."""
args = get_args()
print_rank_0('building GPT model ...')
config = core_transformer_config_from_args(get_args())
if args.use_mcore_models:
print("building megatron core model!!!!!!!!!!!!!!")
if args.spec is not None:
transformer_layer_spec = import_module(args.spec)
else:
transformer_layer_spec = get_gpt_layer_with_transformer_engine_spec(args.num_experts, args.moe_grouped_gemm)
model = GPTModel(
config=config,
transformer_layer_spec=transformer_layer_spec,
vocab_size=args.padded_vocab_size,
max_sequence_length=args.max_position_embeddings,
pre_process=pre_process,
post_process=post_process,
fp16_lm_cross_entropy=args.fp16_lm_cross_entropy,
parallel_output=False,
share_embeddings_and_output_weights=not args.untie_embeddings_and_output_weights,
position_embedding_type=args.position_embedding_type,
rotary_percent=args.rotary_percent,
)
else:
print("building megatron legacy model!!!!!!!!!!!!!!")
assert False, "Never do this!"
assert(args.context_parallel_size == 1), "Context parallelism is only supported with Megatron Core!"
model = megatron.model.GPTModel(
config,
num_tokentypes=0,
parallel_output=False,
pre_process=pre_process,
post_process=post_process
)
return model
We're aware of this bug and will push a fix to MCore. For now, you can add the following code in https://github.com/NVIDIA/Megatron-LM/blob/89574689447d694bb19dd86fc8a6153b4467ba9d/megatron/core/transformer/custom_layers/transformer_engine.py#L464
# In PyTorch, the following two tensors are in fact the same:
# Tensor with shape (1, S, H, D) and stride (S*H*D, H*D, D, 1)
# Tensor with shape (1, S, H, D) and stride (H*D, H*D, D, 1)
# We unify them to the first one to pass the stride check in TE
if value.shape == key.shape and value.stride() != key.stride():
value = value.as_strided(value.shape, key.stride())
No. This won't fix the bug. It makes inference normal, but makes training fail. (The loss of training cannot converge)
My solution for now is just adding the as_strided for inference, and comment out this line during training... Waiting for an official more elegant way to solve...
Maybe it is the qkv_format, you can check the tensor format is sbhd or bshd.
sbhd, just the official training and inference code in Megatron.
I believe the MCore issue is fixed now, is that correct @yaox12? Can we close this issue?
Yes, the issue is fixed in the latest main branch of Megatron-LM.