feat: support aten.atan2.out converter

Open chohk88 opened this issue 1 year ago • 0 comments

Description

The aten.atan2.out operation calculates the element-wise arctangent of two tensors and stores the results in a specified output tensor, out. This does not alter the input tensors, meaning it isn't an inplace operation in the traditional sense.

However, I've encountered two issues:

First, when the shape of the input out tensor does not match the actual output shape, PyTorch issues a warning but does not produce an error, and still outputs the correct results. I've attached the results of running the atan2.out operation in PyTorch below:

Second, even though TensorRT displays an error message when above warning occurs, the test case still passes. I've attached the results of the executed test case below.

[W511 17:09:06.345646983 Resize.cpp:28] Warning: An output with one or more elements was resized since it had shape [5], which does not match the required output shape [10]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function _resize_output_check)
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 145, GPU 1474 (MiB)
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageChange] Init builder kernel library: CPU +1750, GPU +317, now: CPU 2031, GPU 1791 (MiB)
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.013850
WARNING:torch_tensorrt [TensorRT Conversion Context]:Unused Input: out
WARNING:torch_tensorrt [TensorRT Conversion Context]:[RemoveDeadLayers] Input Tensor out is unused or used only at compile-time, but is not being removed.
INFO:torch_tensorrt [TensorRT Conversion Context]:Global timing cache in use. Profiling results in this builder pass will be stored.
INFO:torch_tensorrt [TensorRT Conversion Context]:Detected 3 inputs and 1 output network tensors.
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Host Persistent Memory: 32
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Device Persistent Memory: 0
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Scratch Memory: 0
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Activation Memory: 0
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Weights Memory: 296
INFO:torch_tensorrt [TensorRT Conversion Context]:Engine generation completed in 0.0770512 seconds.
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 1 MiB
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 3581 MiB
INFO:torch_tensorrt [TensorRT Conversion Context]:Serialized 26 bytes of code generator cache.
INFO:torch_tensorrt [TensorRT Conversion Context]:Serialized 7750 bytes of compilation cache.
INFO:torch_tensorrt [TensorRT Conversion Context]:Serialized 0 timing cache entries
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:00.081009
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 13980 bytes of Memory
INFO:harness:Interpreter run time(s): 0.09525678399950266
INFO:torch_tensorrt [TensorRT Conversion Context]:Loaded engine size: 0 MiB
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[W511 17:09:08.253573766 Resize.cpp:28] Warning: An output with one or more elements was resized since it had shape [5], which does not match the required output shape [10]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function _resize_output_check)
ERROR:torch_tensorrt [TensorRT Conversion Context]:3: [executionContext.cpp::setInputShape::2037] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2037, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
WARNING:torch_tensorrt [TensorRT Conversion Context]:Using default stream in enqueueV3() may lead to performance issues due to additional calls to cudaStreamSynchronize() by TensorRT to ensure correct synchronization. Please use non-default stream instead.
INFO:harness:TRT run time(s)= 0.010920960426330567
.INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 286, GPU 1483 (MiB)
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageChange] Init builder kernel library: CPU +1748, GPU +310, now: CPU 2034, GPU 1793 (MiB)
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.009873
WARNING:torch_tensorrt [TensorRT Conversion Context]:Unused Input: out
WARNING:torch_tensorrt [TensorRT Conversion Context]:[RemoveDeadLayers] Input Tensor out is unused or used only at compile-time, but is not being removed.
INFO:torch_tensorrt [TensorRT Conversion Context]:Global timing cache in use. Profiling results in this builder pass will be stored.
INFO:torch_tensorrt [TensorRT Conversion Context]:Detected 3 inputs and 1 output network tensors.
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Host Persistent Memory: 32
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Device Persistent Memory: 0
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Scratch Memory: 0
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Activation Memory: 0
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Weights Memory: 296
INFO:torch_tensorrt [TensorRT Conversion Context]:Engine generation completed in 0.0711228 seconds.
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 1 MiB
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 3659 MiB
INFO:torch_tensorrt [TensorRT Conversion Context]:Serialized 26 bytes of code generator cache.
INFO:torch_tensorrt [TensorRT Conversion Context]:Serialized 7750 bytes of compilation cache.
INFO:torch_tensorrt [TensorRT Conversion Context]:Serialized 0 timing cache entries
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:00.074702
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 13980 bytes of Memory
INFO:harness:Interpreter run time(s): 0.08486600499600172
INFO:torch_tensorrt [TensorRT Conversion Context]:Loaded engine size: 0 MiB
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
WARNING:torch_tensorrt [TensorRT Conversion Context]:Using default stream in enqueueV3() may lead to performance issues due to additional calls to cudaStreamSynchronize() by TensorRT to ensure correct synchronization. Please use non-default stream instead.
INFO:harness:TRT run time(s)= 0.00047513601183891296
.
----------------------------------------------------------------------
Ran 2 tests in 3.744s

OK

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

New feature (non-breaking change which adds functionality)

Checklist:

[x] My code follows the style guidelines of this project (You can use the linters)
[x] I have performed a self-review of my own code
[ ] I have commented my code, particularly in hard-to-understand areas and hacks
[ ] I have made corresponding changes to the documentation
[ ] I have added tests to verify my fix or my feature
[ ] New and existing unit tests pass locally with my changes
[x] I have added the relevant labels to my PR in so that relevant reviewers are notified

May 11 '24 08:05 chohk88