chengyi

Results 2 comments of chengyi

@ginowu i have also question,i see nvidia doc (https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-data-types) , there`wgmma` if use fp8 input, can only use `fp16/fp32` as accumulator. Or do you mean precision limitations not ralated to...

> I come across with the same issue, when pybind is built with NVIDIA TensorRT-LLM > > -- Found Torch: /home/ma-user/anaconda3/envs/py10-llm/lib/python3.10/site-packages/torch/lib/libtorch.so -- TORCH_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 -- Building for TensorRT version: 8.6.1,...