TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

aten::empty_like

Open apbose opened this issue 1 year ago • 1 comments

apbose avatar Feb 23 '24 01:02 apbose

Had a doubt on this one. Does this require a test. In the following test:

def test_lowering_empty_like(self):
        class emptyLike(torch.nn.Module):
            def __init__(self, *args, **kwargs) -> None:
                super().__init__(*args, **kwargs)

            def forward(self, x):
                y = torch.ops.aten.empty_like.default(x)
                return y

        # Operations expected to be removed in the traced graph after decompositions
        expected_ops = {}
        unexpected_ops = {torch.ops.aten.empty_like.default}

        inputs = [torch.randn(2, 3).cuda()]

        #inputs = [torch.empty((2,3), dtype=torch.int32, device = 'cuda')]

        fx_graph = torch.fx.symbolic_trace((emptyLike()))
        unexpected_ops_seen, expected_ops_unseen = lower_graph_testing(
            fx_graph,
            inputs,
            expected_ops=expected_ops,
            unexpected_ops=unexpected_ops,
            min_block_size=1,
        )

        torch._dynamo.reset()

        # Validate that the results between Torch and Torch-TRT are similar
        optimized_model = torch_tensorrt.compile(
            fx_graph,
            "torch_compile",
            inputs,
            min_block_size=1,
            pass_through_build_failures=True,
        )
        optimized_model_results = optimized_model(*inputs).detach().cpu()
        torch_model_results = fx_graph(*inputs).detach().cpu()

        max_diff = float(
            torch.max(torch.abs(optimized_model_results - torch_model_results))
        )
        self.assertAlmostEqual(
            max_diff,
            0,
            DECIMALS_OF_AGREEMENT,
            f"empty_like TRT outputs don't match with the original model.",
        )
  1. Is the above required since both the optimized_model torchTRT compiled model and fx_graph will have the same lowering pass applied?
  2. Also when I compile the above I see
  File "/home/abose/Documents/work/torchTRT_empty_2_26/TensorRT/tests/py/dynamo/testing_utilities.py", line 55, in fx_dynamo_testing_backend
    trt_compiled = custom_backend(
  File "/home/abose/Documents/work/torchTRT_empty_2_26/TensorRT/tests/py/dynamo/testing_utilities.py", line 73, in compile_module_testing
    partitioned_module, _ = partitioning.fast_partition(
  File "/home/abose/Documents/work/torchTRT/torch_trt/lib/python3.8/site-packages/torch_tensorrt/dynamo/partitioning/_adjacency_partitioner.py", line 280, in
partition
    partitioned_graph = partitioner.partition_graph()
  File "/home/abose/Documents/work/torchTRT/torch_trt/lib/python3.8/site-packages/torch_tensorrt/dynamo/partitioning/_adjacency_partitioner.py", line 197, in
partition_graph
    subgraphs = self.put_nodes_into_subgraphs()
  File "/home/abose/Documents/work/torchTRT/torch_trt/lib/python3.8/site-packages/torch/fx/passes/splitter_base.py", line 805, in put_nodes_into_subgraphs
    raise FxNetSplitterInternalError("Couldn't create subgraphs")
torch._dynamo.exc.BackendCompilerFailed: backend='functools.partial(<function fx_dynamo_testing_backend at 0x7f5c946045e0>, store_intermediate_graphs=[], min_
block_size=1, torch_executed_ops=set(), use_fast_partitioner=True)' raised:
FxNetSplitterInternalError: Couldn't create subgraphs

Is this expected? Is it something to do with no splits happening for the above graph?

apbose avatar Feb 27 '24 07:02 apbose

I'm not sure what the empty_like lowers to, but potentially you could add another operation in the nn.Module so that the graph is non-empty. It is likely the case that the graph is completely empty, so the partitioning fails. Since this decomposition is Torch-provided, we shouldn't need a test, however it is important to verify that whatever the operator is lowered to, is also supported by Torch-TRT

gs-olive avatar Feb 27 '24 18:02 gs-olive

I do not think that the graph would be empty since it would reduce to the lowering operations of aten::size and torch.Tensor() of the corresponding size getting created. So the graph once lowered should lead to these operations, though I need to confirm. Ok I will add another operation to the module and verify the lowering.

apbose avatar Mar 01 '24 01:03 apbose

I verified the above test case with three cases-

  1. Case 1:
class emptyLike(torch.nn.Module):
            def __init__(self, *args, **kwargs) -> None:
                super().__init__(*args, **kwargs)

            def forward(self, x):
                y = torch.ops.aten.empty_like.default(x)
                return y

Without decomposition of empty_like a. Before AOT trace

%l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
%empty_like_default : [num_users=1] = call_function[target=torch.ops.aten.empty_like.default](args = (%l_x,), kwargs = {})
 return (empty_like_default,)

b. After AOT trace

%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%clone : [num_users=1] = call_function[target=torch.ops.aten.clone.default](args = (%arg0_1,), kwargs = {})
%empty_like : [num_users=1] = call_function[target=torch.ops.aten.empty_like.default](args = (%clone,), kwargs = {})
return (empty_like,)

c. After lowering passes

%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%empty_like : [num_users=1] = call_function[target=torch.ops.aten.empty_like.default](args = (%arg0_1,), kwargs = {})
return (empty_like,)

This is the graph for partition

With the decomposition of empty_like a. Before AOT trace

%l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
%empty_like_default : [num_users=1] = call_function[target=torch.ops.aten.empty_like.default](args = (%l_x,), kwargs = {})
 return (empty_like_default,)

b. After AOT trace

%arg0_1 : [num_users=0] = placeholder[target=arg0_1]
%empty_like : [num_users=1] = call_function[target=torch.ops.aten.empty_permuted.default](args = ([2,3],[0,1]), kwargs = {})
return (empty_like,)

c. After lowering passes

%arg0_1 : [num_users=0] = placeholder[target=arg0_1]
%_frozen_param0 : [num_users=1] = get_attr[target=_frozen_param0]
return (_frozen_param0,)

The above graph partitioning errors out at put_nodes_subgraph of fx _splitterbase since only frozen_params have nodes with users (thats my assumption)

  1. Case 2:
            def __init__(self, *args, **kwargs) -> None:
                super().__init__(*args, **kwargs)

            def forward(self, x):
                c = torch.ops.aten.add(x, x)
                y = torch.ops.aten.empty_like.default(c)
                return y

Like the above case during compilation, if the empty_like is included in the decomposition, the shape of x is extracted statically before runtime and the graph subgraphs is not created.

  1. Case 3:
            def __init__(self, *args, **kwargs) -> None:
                super().__init__(*args, **kwargs)

            def forward(self, x):
                c = torch.ops.aten.add(x, x)
                y = torch.ops.aten.empty_like.default(c)
                d = y + c
                return d

With the decomposition of empty_like a. Before AOT trace

   %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
   %add : [num_users=2] = call_function[target=torch.ops.aten.add](args = (%l_x_, %l_x_), kwargs = {})
   %empty_like_default : [num_users=1] = call_function[target=torch.ops.aten.empty_like.default](args = (%add,), kwargs = {})
   %add_1 : [num_users=1] = call_function[target=operator.add](args = (%empty_like_default, %add), kwargs = {})
   return (add_1,)

b. After AOT trace

 %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
 %clone : [num_users=1] = call_function[target=torch.ops.aten.clone.default](args = (%arg0_1,), kwargs = {})
 %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%clone, %clone), kwargs = {})
 %empty_permuted : [num_users=1] = call_function[target=torch.ops.aten.empty_permuted.default](args = ([2, 3], [0, 1]), kwa
rgs = {dtype: torch.float32, layout: torch.strided, device: cuda:0, pin_memory: False})
  %add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%empty_permuted, %add), kwargs = {})
    return (add_1,)

c. After lowering passes

    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg0_1, %arg0_1), kwargs = {})
    %_frozen_param0 : [num_users=1] = get_attr[target=_frozen_param0]
    %add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%_frozen_param0, %add), kwargs = {})
    return (add_1,)

In the above case since there are additional add nodes with the frozen_param nodes, so the subgraph is created.

Studying the above cases, it seems that the aten lowering is happening during AOT trace. As discussed ideally a test case should not be required. I do not believe empty_permute is supporteded though.

apbose avatar Mar 06 '24 08:03 apbose

Thanks for the analysis @apbose - this is very helpful. It looks like the constant_folding lowering pass is freezing the memory for the empty_like operator and storing it as an attribute of the model.

Regarding empty_permuted - it seems like it would be necessary in the dynamic shape case, since we would not be able to freeze the parameter in that case. It seems based on the Core ATen IR that prims.empty_permuted is a core op, so I do think the conversion/evaluation of that would be helpful here, but it could go in a separate PR.

gs-olive avatar Mar 09 '24 02:03 gs-olive

Ok I will go ahead and make a separate PR for empty_permute. For now this PR can be merged then?

apbose avatar Mar 12 '24 18:03 apbose