Fei Hu
Fei Hu
@MLnick The input size for the three models is: 1) `?*24*12` for univariate model; 2) `?*24*12` for multivariate_model; 3) `?*48*12` for multistep model, where `12` is the number of input...
> Can you try the docker image recommended in the document? Thanks @byshiue for your quick reply! Do you mean this docker image `nvcr.io/nvidia/pytorch:22.09-py3`?
Just tried `nvcr.io/nvidia/pytorch:22.09-py3`, but still got the same error message.
I also ran the below commands to tune gemm, but fp8 is multiple times slower than fp16 in 8 of 11 cases (please check the last column (`speedup`) in the...
Got the same error in H100 and H100-MIG as below: ``` [FT][ERROR] CUDA runtime error: an illegal memory access was encountered FasterTransformer/src/fastertransformer/models/gpt_fp8/GptFP8ContextDecoder.cc:243 ```
> Can you post your scripts and full log? Hi @byshiue, I create new docker containers to test it again. For `nvcr.io/nvidia/pytorch:22.09-py3`, I confirm it works well now (not quite...
> For performance between fp16 and fp8, fp8 only brings speedup when the batch size is large enough. But the batch size in the example is only 1. I made...
@ananthdurbha I met the same issue with yours. I'm wondering if you have resolved it?
Solve this issue by using clang++ and build pytorch from source. There may be other better solutions.