TensorRT ❓ [Question] Running LayerNorm in fp16

❓ Question

What you have already tried

I am trying to convert a transformer model to TRT in fp16 (fp32 works fine 🙂). It includes bunch of LayerNorms, all of them have explicit casting of inputs to fp32, i.e:

class LayerNormFP32(nn.LayerNorm):
    def forward(self, x):
        return super().forward(x.float()).type(x.dtype)

I am getting warnings about precisions of the layers:

WARNING: [Torch-TensorRT TorchScript Conversion Context] - Detected layernorm nodes in FP16: %126 : Tensor = aten::layer_norm(%input.9, %127, %self.decoder.layers.0.attn_ln.weight.1, %370, %129, %130), scope: __module.decoder/__module.decoder.layers.0/__module.decoder.layers.0.attn_ln
...
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT encountered issues when converting weights between types and that could affect accuracy.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Check verbose logs for the list of affected weights.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - - 2 weights are affected by this issue: Detected FP32 infinity values and converted them to corresponding FP16 infinity.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - - 27 weights are affected by this issue: Detected subnormal FP16 values.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - - 3 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.

I checked dtype of the mentioned weights in the trace that I pass to torch_tensorrt.compile and they are correctly in fp32, even though the warnings state the opposite.

The warning suggets two solutions (use INormalizationLayer or force FP32 precisions) but I have no idea ho to achieve it. This might be a related: https://github.com/pytorch/TensorRT/pull/2509 (or https://github.com/NVIDIA/TensorRT/issues/3101)

Any ideas how to resolve or debug this issue?

Environment

Python 3.11.8
torch 2.2.1
torch_tensorrt 2.2.0
a100

Apr 05 '24 09:04 Tomiinek

Here is a minimal reproducible example:

import torch
import torch.nn as nn


class LayerNormFP32(nn.LayerNorm):
    def forward(self, x):
        return super().forward(x.float()).type(x.dtype)


class Model(nn.Module):
    def __init__(self, hidden_dim: int = 1024):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.ln = LayerNormFP32(hidden_dim, bias=False)

    def forward(self, x: torch.Tensor):
        return self.ln(x)

    def to_jit_trace(
        self,
        device: str = "cpu",
        dtype: torch.dtype = torch.float, 
        batch_size: int = 2,
    ) -> torch.jit.ScriptModule:
        
        dummy_inputs = torch.randn((batch_size, self.hidden_dim), dtype=dtype, device=device)
        
        self.to(device)
        self.eval()
    
        with torch.no_grad():
            outputs1 = self(*dummy_inputs)
            trace = torch.jit.trace(self, dummy_inputs, check_trace=False)
            outputs2 = trace(*dummy_inputs)
        assert torch.allclose(outputs1, outputs2)
        
        return trace, dummy_inputs

    def to_tensorrt(
        self,
        batch_size,
        precisions: set[torch.dtype] = {
            torch.float,
            torch.half
        },
    ):
        import torch_tensorrt

        dtype = torch.float
        if torch.half in precisions:
            dtype = torch.half

        with torch.cuda.amp.autocast(enabled=True):
            trace, dummy_inputs = self.to_jit_trace("cuda", dtype, batch_size=batch_size)
        
        trt = torch_tensorrt.compile(
            trace,
            input_signature=(torch_tensorrt.Input(shape=dummy_inputs.shape, dtype=dummy_inputs.dtype),),
            enabled_precisions=precisions,
            require_full_compilation=True,
            truncate_long_and_double=True,
        )
            
        return trt

and fp32 gives the same outputs, fp16 does not (while producing the warnings):

model = Model()
batch_size = 1

trt_16 = model.to_tensorrt(batch_size=batch_size, precisions={torch.float, torch.half})
with torch.cuda.amp.autocast(enabled=True):
    trace_fp16, dummy_inputs_16 = model.to_jit_trace("cuda", torch.half, batch_size=batch_size)

trt_32 = model.to_tensorrt(batch_size=batch_size, precisions={torch.float})
trace_fp32, dummy_inputs_32 = model.to_jit_trace("cuda", torch.float, batch_size=batch_size)

with torch.no_grad():

    # False
    # tensor(0.0020, device='cuda:0', dtype=torch.float16)
    print(torch.allclose(trace_fp16(dummy_inputs_16), trt_16(dummy_inputs_16)))
    print((trace_fp16(dummy_inputs_16) - trt_16(dummy_inputs_16)).abs().max())

    # True
    # tensor(2.9802e-08, device='cuda:0')
    print(torch.allclose(trace_fp32(dummy_inputs_32), trt_32(dummy_inputs_32)))
    print((trace_fp32(dummy_inputs_32) - trt_32(dummy_inputs_32)).abs().max())

Apr 05 '24 09:04 Tomiinek

Hi @Tomiinek, I refactored the layer norm with INormalization Layer. Could you confirm if this works for you? thanks!

Apr 16 '24 20:04 zewenli98

Hello @zewenli98, thank you!

I am having issues compiling the latest code on my environment (python 3.11, torch 2.2), so I tried to use the wheel from gh actions associated to the PR (this one https://github.com/pytorch/TensorRT/actions/runs/8711801688/artifacts/1419799870), but also without a success. Simply patching the file in site_packages of the latest release did not help (i.e. the fp16 issue persists)

Is there another way to check it out or to catch it in tests?

Apr 17 '24 11:04 Tomiinek

@Tomiinek It seems the trace you pass into torch_tensorrt.compile() has type _ModuleType.ts, which means it will compile with torchscript frontend. Can you try using dynamo frontend instead? because the dynamo is better supported. Maybe the function _get_target_fe() in TensorRT/py/torch_tensorrt/_compile.py will be helpful.

Apr 20 '24 03:04 zewenli98

Hi @zewenli98 thank you for our patience.

I tried something like:

model_ = torch.export.export(model, tuple(dummy_inputs)) 
trt = torch_tensorrt.compile(
    model_,
    input_signature=(torch_tensorrt.Input(shape=dummy_inputs.shape, dtype=dummy_inputs.dtype),),
    enabled_precisions={
        torch.float,
        torch.half
    },
    require_full_compilation=True,
    truncate_long_and_double=True,
)

but it says

ValueError: Input graph is an ExportedProgram which is not currently supported. Please provide torch.nn.Module or torch.fx.GraphModule as inputs

because I am still on 2.2.0.

So I tried to upgrade to 2.3.0dev, but I am not able to import the package:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/__init__.py", line 84, in <module>
    from torch_tensorrt._compile import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/_compile.py", line 9, in <module>
    import torch_tensorrt.ts
  File "/fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/ts/__init__.py", line 1, in <module>
    from torch_tensorrt.ts._compile_spec import TensorRTCompileSpec  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/ts/_compile_spec.py", line 7, in <module>
    import torch_tensorrt._C.ts as _ts_C
ImportError: /fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/lib/libtorchtrt.so: undefined symbol: _ZN3c104cuda9GetDeviceEPi

Do you have any tips on how to install or try out the latest and greatest code or builds? What is the prefered way of using the dynamo frontend?

These are my versions:

tensorrt==8.6.1.post1
tensorrt-bindings==8.6.1
tensorrt-libs==8.6.1
torch-tensorrt==2.3.0.dev20240110+cu121

Apr 22 '24 10:04 Tomiinek

Hi @Tomiinek, For this error:

ImportError: /fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/lib/libtorchtrt.so: undefined symbol: _ZN3c104cuda9GetDeviceEPi

This is because you might install mismatched libtorch version. You can replace the corresponding part in WORKSPACE with the correct urls of libtorch version, e.g.,

http_archive(
    name = "libtorch",
    build_file = "@//third_party/libtorch:BUILD",
    strip_prefix = "libtorch",
    urls = ["https://download.pytorch.org/libtorch/test/cu121/libtorch-cxx11-abi-shared-with-deps-2.3.0%2Bcu121.zip"],
    # urls = ["https://download.pytorch.org/libtorch/nightly/cu121/libtorch-cxx11-abi-shared-with-deps-latest.zip"],
)

http_archive(
    name = "libtorch_pre_cxx11_abi",
    build_file = "@//third_party/libtorch:BUILD",
    strip_prefix = "libtorch",
    urls = ["https://download.pytorch.org/libtorch/test/cu121/libtorch-shared-with-deps-2.3.0%2Bcu121.zip"],
    # urls = ["https://download.pytorch.org/libtorch/nightly/cu121/libtorch-shared-with-deps-latest.zip"],
)

and then build torch-tensorrt again with:

python setup.py develop

Besides, you can try to use:

exp_program = torch_tensorrt.dynamo._tracer.trace(module, torchtrt_inputs, **kwargs)
trt_graph_module = torch_tensorrt.dynamo._compiler.compile(
    exp_program,
    inputs=torchtrt_inputs,
    enabled_precisions=enabled_precisions_set,
    **kwargs,
)

Apr 22 '24 22:04 zewenli98

Hi @zewenli98 , thanks for your responses! I'm trying to create a wheel for @Tomiinek to test out the fix. I'm opting for Docker, as local compilation gave me some weird errors about incompatible hashes when downloading tarballs from Nvidia.

I've changed the libtorch sections per your suggestion, checked out your PR branch and ran DOCKER_BUILDKIT=1 docker build --build-arg TENSORRT_VERSION=8.6 --build-arg CUDNN_VERSION=8.9 -f docker/Dockerfile -t torch_tensorrt:latest .. The container builds, however running python3 -c "import torch_tensorrt in the container still errors out:

root@ond-g5-1gpu-dy-g5-4xlarge-16cpu-1:~/.pyenv/versions/3.10.14/lib/python3.10/site-packages/tensorrt# python3 -c "import torch_tensorrt"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch_tensorrt/__init__.py", line 84, in <module>
    from torch_tensorrt._compile import *  # noqa: F403
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch_tensorrt/_compile.py", line 9, in <module>
    import torch_tensorrt.ts
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch_tensorrt/ts/__init__.py", line 1, in <module>
    from torch_tensorrt.ts._compile_spec import TensorRTCompileSpec  # noqa: F401
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch_tensorrt/ts/_compile_spec.py", line 8, in <module>
    import torch_tensorrt._C.ts as _ts_C
ImportError: /opt/python3/site-packages/torch_tensorrt/lib/libtorchtrt.so: undefined symbol: _ZN5torch3jit11parseSchemaERKSs

Perhaps it would be easier to merge the PR and we'll test if the nightly wheel of TensorRT works? Compiling Torch-TensorRT locally seems to be pretty complicated.

Apr 24 '24 13:04 srdecny

Hello @zewenli98 , I installed the current release with python 3.10 so that I can try out at least dynamo.

I tried to compile a single linear layer with torchscript frontend, in fp32. The compiled module gives correct outputs (i.e. the same as raw), but not in fp16, which I believed changed from the previous release which was giving correct outputs but ignoring casting in layer norms.

I tried to compile a single linear layer with dynamo in fp32. I am not getting correct outputs and the compiled module is 3x slower than the one compiled with torchscript frontend.

The layernorm issue persists with torchscript and dynamo does not produce warnings but still produces weird outputs.

I am really confused, could you please help me and provide code snippets that I could run and at the same time work for you? Specifically:

how to compile a single linear layer with torchscript
how to compile a single linear layer with dynamo while getting the same inference speed as with ts
how to compile a single linear layer with whatever, but in fp16 while getting correct outputs
how to compile a single layer norm with fp16 inputs and internal fp32 cast while getting correct outputs and speedups

Or at least tell me if the code I posted above works for you with the latest release, or what I am doing wrong in there 🤷

CC: @narendasan

Apr 29 '24 17:04 Tomiinek

@narendasan @peri044 Can you guys take a look?

Apr 30 '24 18:04 zewenli98