activezhao issues

Results 8 issues of


                                            activezhao

CodeLlama-7B int4-awq get error of "The value updated is not the same shape as the original. "

### System Info CPU x86_64 GPU NVIDIA A10 TensorRT branch: main commid id:cad22332550eef9be579e767beb7d605dd96d6f3 CUDA: NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 ### Who can help? Quantization: @Tracin ### Information...

bug

triaged

tutorials/Quick_Deploy/vLLM the triton metrics inference delay result is abnormal？

I use tutorials/Quick_Deploy/vLLM to deploy codeLlama 7B, then I call metrics API, and a part of the metrics info is: ` nv_inference_request_summary_us_count{model="triton-vllm-code-llama-model",version="1"} 2639 nv_inference_request_summary_us_sum{model="triton-vllm-code-llama-model",version="1"} 2885180 ` I use grafana to...

question

llama convert add rotary_scaling param in cli_args

In the convert_checkpoint.py of llama, if we define the args with command, the rotary_scaling param can not be missed in some situations, so rotary_scaling param should be added to avoid...

[Question] How to know the inference has been finished with generate_stream API?

### System Info CPU x86_64 GPU NVIDIA L20 TensorRT branch: v0.8.0 CUDA: NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.3 ### Who can help? @byshiue ### Information - [X] The...

bug

change the configuration of engineArgs in config.pbtxt

fix inference quality caused by temperature parameter in bls

When the prompt and parameters are the same, I use APIs of ensemble and tensorrt_llm_bls, the results are different. And the result of `ensemble` is expected. I analyzed the code...

Deepseek model streaming mode with Chinese character �?

### System Info CPU x86_64 GPU NVIDIA L20 TensorRT branch: v0.8.0 CUDA: NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.3 ### Who can help? @kaiyux @byshiue @schetlur-nv ### Information -...

bug

Codestral-22B not support FP8?

I used the following commands to quantize [Codestral-22B-v0.1](https://huggingface.co/mistralai/Codestral-22B-v0.1) model, ``` python3 /data/trt_llm_code/trt_llm_v0.16.0/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/base_models/codestral-22b-v1.0-250311 --dtype bfloat16 --qformat fp8 --kv_cache_dtype fp8 --calib_dataset /data/cnn_dailymail --output_dir /data/trt_llm_quantize_files/nv-ada/trt-v16-codestral-22b-v1.0-250311-fp8/2-gpu --tp_size 2 ``` I got the...