activezhao

Results 39 comments of activezhao

> Please try to add `quantize_lm_head` option to build.py. @Tracin OK, thanks for your reply, I will just try it.

> Please try to add `quantize_lm_head` option to build.py. @Tracin I added the parameter of `--quantize_lm_head`, but a new error appeared. ``` python build.py --model_dir /data/META-CodeLlama-7b-hf/ \ --quant_ckpt_path /data/trt_llama_7b_quantized_int4-awq/llama_tp1_rank0.npz \...

> Sorry for ambiguous instruction, you have to add `--quantize_lm_head` also for `quantize.py`. Since there is a bug in AMMO, we have to enable this before next release. @Tracin Ok,...

@Tracin It works, so nice. But, when I launch Triton Server, the error is: ``` E0129 06:39:43.236314 25545 model_repository_manager.cc:580] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded...

> `Assertion failed: mpiSize == tp * pp ` Did you run with mpi? @Tracin Yes,I use this `scripts/launch_triton_server.py` to launch Triton Server.

> > > `Assertion failed: mpiSize == tp * pp ` Did you run with mpi? > > > > > > @Tracin Yes,I use this `scripts/launch_triton_server.py` to launch Triton...

Hi @Tracin I have two questions, could you help me answer them? 1、I use int4_awq engines, max_batch_size is 8, one GPU, the Throughput is 379 tokens/s. But int8_weight + kv...

> 1. I am not sure, there are more than one variants in the experiments. > 2. int4_awq supports tp_size>1 > 3. If you want to change max_batch_size, you have...

> How did you calculate the average inference delay? Is it from the client side? cc: @rmccorm4 about metrics calcluations I just collect the metrics data by calling " :8002/metrics"...

> I meant that how did you get 300-500ms? But I agree, the metrics should behave the same way as it would have with Fastertransformer backend. Let me describe it...