TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Dynamic scaling not working on RoPe / rotary_scaling

Open TheCodeWrangler opened this issue 1 year ago • 5 comments

          @byshiue can you try to see if dynamic scaling works? linear scaling works fine. if dynamic scaling doesnt work at all, then this is indeed a bug.

Originally posted by @avianion in https://github.com/NVIDIA/TensorRT-LLM/issues/1595#issuecomment-2112786968

Several users have experienced errors in running engine files which were compiled to use "dynamic" rotary_scaling

Is dynamic scaling supported at this time?

TheCodeWrangler avatar May 15 '24 16:05 TheCodeWrangler

Seconding this. I would like to try dynamic scaling but at the moment only linear scaling works.

Dynamic scaling supposedly provides better results. But this isn't possible to try at the moment.

avianion avatar May 15 '24 16:05 avianion

@byshiue @kaiyux

avianion avatar May 15 '24 21:05 avianion

Hi, @avianion and @TheCodeWrangler, could you, please, provide the reproducing steps?

nekorobov avatar May 17 '24 07:05 nekorobov

@nekorobov

Hi.

Please follow these steps:

  1. Install Llama 3 8b or 70b instruct version, and convert it to checkpoint with following command

python3 convert_checkpoint.py --model_dir llama370b \ --output_dir llama-3-70b-ckpt \ --dtype float16 --workers 2

  1. CD into the checkpoint directory, and modify the config.json file to have following rotary scaling settings

"rotary_scaling": { "type": "dynamic", "factor": 2.0 },

  1. build the checkpoint using this

trtllm-build --checkpoint_dir llama-3-70b-ckpt \ --output_dir ./llama-3-70b-engine \ --gemm_plugin float16 \ --gpt_attention_plugin float16 \ --max_batch_size 8 \ --workers 2 \

  1. once built simply run inference on the checkpoint

you can try a basic command like this

mpirun --allow-run-as-root -n 2 python3 ../run.py --engine_dir ./llama-3-70b-engine --max_output_len 4096 --tokenizer_dir meta-llama/Meta-Llama-3-70B-Instruct --input_text "<|start_header_id|>user<|end_header_id|>Tell me how to count to nine in French<|eot_id|>"

Observe that you will either get no output or an error regarding the encoding. No matter what settings you use.

avianion avatar May 17 '24 09:05 avianion

I was not able to reproduce neither with 8B, nor with 8B-instruct version and dynamic scaling. Could you, please, share what output you get? Note that I was using single GPU setup.

nekorobov avatar May 17 '24 22:05 nekorobov

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar Jun 23 '24 01:06 github-actions[bot]