Godlovecui
Godlovecui
### System Info L20, 8 cards, 8x48G memory, TensorRT-LLM version: 0.11.0.dev2024051400 ### Who can help? @Tra ### Information - [X] The official example scripts - [ ] My own modified...
### System Info rtx4090 ### Who can help? @ ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [ ]...
### System Info 8*RTX4090, 24G tensorrt_llm version: 0.11.0.dev2024051400 ### Who can help? @T ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks...
**Description** I run benchmark of Meta-Llama-3-8B-Instruct in RTX 8*4090,  when request is 16, input sequence length is 1024, output sequence length is 1024, The TTFT(time to first token) is...
### System Info RTX 8*4090 version: TensorRT-LLM: v0.9.0 tensorrtllm_backend: v0.9.0 ### Who can help? @kaiyux @BY ### Information - [X] The official example scripts - [ ] My own modified...
FP8 is very useful in training or inference in LLM. Does flash attention support FP8? Thank you~
# 🚀 Feature FP8 is very useful in training or inference in LLM. Does xformers support FP8? Thank you~
When I run 06-fused-attention.py on RTX 4090, it raises below error. How to fix it? Thank you! triton version: 2.3.0 cuda: 12.4 root@GPU-RTX4090-4-8:/workspaces/triton/python/tutorials# python 06-fused-attention.py Traceback (most recent call last):...
ENV: RTX 8*4090 I want to test FP8 of TransformerEngine in llama3 (from huggingface) for inference. I can not find instructions on inference. Can you share some code? Thank you~
Hi: I'd like to test FP8 in RTX 4090. I can find some BF16 functions like SM80_16x8x8_F32BF16BF16F32_TN in cutlass/include/cute/arch/mma_sm80.hpp, however, I can't find some FP8 functions like SM80_16x8x8_F32E4M3E4M3FP32_TN. So, how...