Kyungmin Lee

Results 11 comments of Kyungmin Lee

Hi @molly-smith! `replace_with_kernel_inject=True` on OPT (galactica-6.7b) still produces garbage on `deepspeed==0.8.3`. Is this problem being fixed?

Hi, for 1 question, there seems to be a constraint on softmax. https://github.com/NVIDIA/FasterTransformer/issues/663

@byshiue I have same problem https://github.com/triton-inference-server/fastertransformer_backend#rebuilding-fastertransformer-backend-optional ``` cmake \ -D CMAKE_EXPORT_COMPILE_COMMANDS=1 \ -D CMAKE_BUILD_TYPE=Release \ -D ENABLE_FP8=OFF \ -D BUILD_MULTI_GPU=ON \ -D BUILD_PYT=ON \ -D SM=80 \ -D CMAKE_INSTALL_PREFIX=/opt/tritonserver \...

related to https://github.com/NVIDIA/TensorRT-LLM/issues/1440

Hi @MrD005 @ARomoH @jellysnack Can you try `use_custom_all_reduce` set disable. related to https://github.com/triton-inference-server/tensorrtllm_backend/issues/390

@PerkzZheng I solved it temporarily. My solution is disable `use_custom_all_reduce` in trtllm-build.

Hi, @byshiue can you check this MR? If "exaone" is in model_name.lower(), then --model_dir should point to a path that includes "exaone". For example: [export HF_MODEL_DIR=hf_models/exaone](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/exaone#exaone-deep) On the other hand,...

The same condition is being used here too. https://github.com/NVIDIA/TensorRT-LLM/blob/eb2d51a42990b8d0b30bc6c29fad4fd491da749f/tensorrt_llm/models/llama/convert.py#L419-L420