TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

🐛 [Bug] Issue in conversion when parameters/buffers are moved during compilation

Open gs-olive opened this issue 1 year ago • 1 comments

Bug Description

Bug 1

  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/conversion/converter_utils.py", line 491, in to_numpy
    output = value.cpu().detach().contiguous().numpy()
RuntimeError: .numpy() is not supported for tensor subclasses.

Suggested Fix 1

Need a custom version of the following function which registers a parameter, not a buffer https://github.com/pytorch/TensorRT/blob/afd5abebbffa49107bcc7766c9f00bd6be2e593c/py/torch_tensorrt/dynamo/lowering/passes/constant_folding.py#L39

Bug 2

File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/_ops.py", line 571, in __call__
    return self_._op(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and meta! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Suggested Fix 2

Need to cast constant Tensors to nn.Parameter on CUDA at constant-folding time https://github.com/pytorch/TensorRT/blob/afd5abebbffa49107bcc7766c9f00bd6be2e593c/py/torch_tensorrt/dynamo/lowering/passes/constant_folding.py#L39

Bug 3

File "<eval_with_key>.67 from /root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/fx/experimental/proxy_tensor.py:569 in wrapped", line 11, in forward
File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/_ops.py", line 571, in __call__
  return self_._op(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and meta! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Suggested Fix 3

This line needs to be removed, as it has unintended behavior when casting constant params https://github.com/pytorch/TensorRT/blob/afd5abebbffa49107bcc7766c9f00bd6be2e593c/py/torch_tensorrt/dynamo/conversion/_conversion.py#L32

Expected behavior

Model should compile

Environment

  • Torch and Torch-TensorRT Version: 2.3.0.dev2024222+cu121

gs-olive avatar Feb 23 '24 21:02 gs-olive

Just a note: I just faced the bug 1 RuntimeError: .numpy() is not supported for tensor subclasses. during torch_compile compillation. In my case, error starts from here https://github.com/pytorch/TensorRT/blob/main/py/torch_tensorrt/dynamo/conversion/impl/shape.py#L60 The conclusion is that we are creating torch.zeros within torch.compile workflow and this is creating fake tensors instead of a real torch.tensor and hence .numpy() fails. I switched this to create numpy constants directly instead of torch tensors and this works fine.

peri044 avatar Feb 23 '24 23:02 peri044