Godlovecui issues

Results 12 issues of


                                            Godlovecui

[quantization][Mixtral-8x22B-v0.1] It will hang after [TensorRT-LLM][INFO] Initializing MPI with thread mode 1

### System Info L20, 8 cards, 8x48G memory, TensorRT-LLM version: 0.11.0.dev2024051400 ### Who can help? @Tra ### Information - [X] The official example scripts - [ ] My own modified...

bug

triaged

Other backends are missing

### System Info rtx4090 ### Who can help? @ ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [ ]...

bug

unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'name' not found

### System Info 8*RTX4090, 24G tensorrt_llm version: 0.11.0.dev2024051400 ### Who can help? @T ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks...

bug

triaged

When the request is large, the Triton server has a very high TTFT.

**Description** I run benchmark of Meta-Llama-3-8B-Instruct in RTX 8*4090, ![image](https://github.com/triton-inference-server/server/assets/68674291/1a0fd341-8d8f-4893-973c-ed1ed3b74aca) when request is 16, input sequence length is 1024, output sequence length is 1024, The TTFT(time to first token) is...

investigating

server.cc:251] failed to enable peer access for some device pairs

### System Info RTX 8*4090 version: TensorRT-LLM: v0.9.0 tensorrtllm_backend: v0.9.0 ### Who can help? @kaiyux @BY ### Information - [X] The official example scripts - [ ] My own modified...

bug

stale

waiting for feedback

Does flash attention support FP8?

FP8 is very useful in training or inference in LLM. Does flash attention support FP8? Thank you~

Does xformers support FP8?

# 🚀 Feature FP8 is very useful in training or inference in LLM. Does xformers support FP8? Thank you~

It raises error when I run 06-fused-attention.py

When I run 06-fused-attention.py on RTX 4090, it raises below error. How to fix it? Thank you! triton version: 2.3.0 cuda: 12.4 root@GPU-RTX4090-4-8:/workspaces/triton/python/tutorials# python 06-fused-attention.py Traceback (most recent call last):...

How to use FP8 of TransformerEngine in inference

ENV: RTX 8*4090 I want to test FP8 of TransformerEngine in llama3 (from huggingface) for inference. I can not find instructions on inference. Can you share some code? Thank you~

[QST] Dose CuTe supports FP8 in Ada lovelace?

Hi: I'd like to test FP8 in RTX 4090. I can find some BF16 functions like SM80_16x8x8_F32BF16BF16F32_TN in cutlass/include/cute/arch/mma_sm80.hpp, however, I can't find some FP8 functions like SM80_16x8x8_F32E4M3E4M3FP32_TN. So, how...

question

? - Needs Triage

inactive-30d