🐛 [Bug] Issue in conversion when parameters/buffers are moved during compilation
Bug Description
Bug 1
File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/conversion/converter_utils.py", line 491, in to_numpy
output = value.cpu().detach().contiguous().numpy()
RuntimeError: .numpy() is not supported for tensor subclasses.
Suggested Fix 1
Need a custom version of the following function which registers a parameter, not a buffer https://github.com/pytorch/TensorRT/blob/afd5abebbffa49107bcc7766c9f00bd6be2e593c/py/torch_tensorrt/dynamo/lowering/passes/constant_folding.py#L39
Bug 2
File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/_ops.py", line 571, in __call__
return self_._op(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and meta! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
Suggested Fix 2
Need to cast constant Tensors to nn.Parameter on CUDA at constant-folding time
https://github.com/pytorch/TensorRT/blob/afd5abebbffa49107bcc7766c9f00bd6be2e593c/py/torch_tensorrt/dynamo/lowering/passes/constant_folding.py#L39
Bug 3
File "<eval_with_key>.67 from /root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/fx/experimental/proxy_tensor.py:569 in wrapped", line 11, in forward
File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/_ops.py", line 571, in __call__
return self_._op(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and meta! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
Suggested Fix 3
This line needs to be removed, as it has unintended behavior when casting constant params https://github.com/pytorch/TensorRT/blob/afd5abebbffa49107bcc7766c9f00bd6be2e593c/py/torch_tensorrt/dynamo/conversion/_conversion.py#L32
Expected behavior
Model should compile
Environment
- Torch and Torch-TensorRT Version:
2.3.0.dev2024222+cu121
Just a note:
I just faced the bug 1 RuntimeError: .numpy() is not supported for tensor subclasses. during torch_compile compillation. In my case, error starts from here
https://github.com/pytorch/TensorRT/blob/main/py/torch_tensorrt/dynamo/conversion/impl/shape.py#L60
The conclusion is that we are creating torch.zeros within torch.compile workflow and this is creating fake tensors instead of a real torch.tensor and hence .numpy() fails. I switched this to create numpy constants directly instead of torch tensors and this works fine.