torch-ccl ProcessGroupCCL Destructor Not Correctly Called in PT 1.10

Hi torch-ccl community,

I was trying to run the follow code with PT 1.10 + ccl backend:

import torch
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch_ccl
dist.init_process_group(backend="ccl")
class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10, bias=False)
        self.net2 = nn.Linear(10, 10)
    def forward(self, x):
        return self.net2(self.net1(x))
model = ToyModel()
ddp = torch.nn.parallel.DistributedDataParallel(
    model,
    find_unused_parameters=True)

inp = torch.randn(1, 10)
out = ddp(inp)

When find_unused_parameters=True, the destructor of ProcessGroupCCL was not correctly called. When find_unused_parameters=False there was no issue. This should have been fine in most cases because the destructor is empty anyways https://github.com/intel/torch-ccl/blob/master/src/ProcessGroupCCL.cpp#L109-L111. However, I am trying to build an extension which requires me to release resources in ~ProcessGroupCCL(). If ~ProcessGroupCCL() is being not called, the process will hang on exit. This issue also does not exist in PT 1.9. Seems like some object life cycle management issue with PyTorch

Would appreciate any insights and help!

Feb 04 '22 21:02 Zha0q1

Hi @Zha0q1,

I cannot reproduce the issue of "the destructor of ProcessGroupCCL was not correctly called" The ~ProcessGroupCCL can always be called on the end of the python life for both the find_unused_parameters=True and find_unused_parameters=False

There maybe some requirements on the sequence of the exiting clean up of your code.

Please be aware the destructor of ProcessGroup is called when clean up the refer to python object at the end of python life.

Feb 07 '22 05:02 chengjunlu

Hi @chengjunlu thanks for your reply! Would you share the hardware and software stack you used? This issue only occurred with PT 1.10 for me -- PT 1.9 worked just fine. I was using an AWS P4d instance with 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-cpu-py38-ubuntu20.04-sagemaker being the base image

Feb 07 '22 05:02 Zha0q1

I am using the public pytorch v1.10.0-rc3 tag for the 1.10 release.

Would you help to double check whether this issue could be reproduced without your changes?

Feb 07 '22 05:02 chengjunlu

Hi I used the v1.10.0 tag and built pytorch from source. And yes, even with https://github.com/intel/torch-ccl/tree/ccl_torch1.10 this branch the issue is still reproducible. I only added a std::cout in the destructor to show it was called/ not called.

Feb 07 '22 05:02 Zha0q1

Let's try more experiment:

Add some debug information in the destructor on ProcessGroup.
Can you show the ABI of the pytorch in your platform torch._C._GLIBCXX_USE_CXX11_ABI?

Feb 07 '22 06:02 chengjunlu

Do you mean the Pytorch ProcessGroup?
it shows True One more question: did you try the same script I used?

Feb 07 '22 06:02 Zha0q1

Do you mean the Pytorch ProcessGroup? Yes.

it shows True One more question: did you try the same script I used? Yes.

Feb 07 '22 07:02 chengjunlu

Sure I will do more experiments on Monday. Do you have any insights as to what might be the issue?

Feb 07 '22 07:02 Zha0q1

Sure I will do more experiments on Monday. Do you have any insights as to what might be the issue?

It is bizarre issue. I don't have the strong confidence about the root cause. The hard part is that I cannot reproduce your issue in my platform.

Here are just some points we can look into:

The process group in PT1.10 is managed by intrusive ptr. There is drawback in C++ in the cross reference of smart pointer blocking the destruction of objects correctly. The attribute reducer of DistributedDataParallel and the Reducer keeps a reference to the process group (in the test, the object of ProcessGroupCCL). Another attribute _default_pg also keeps a reference to it. But Neither of them kept a cross reference to each other. We need to further investigate it.

Another aspect we can check is the pybind itself, less possible but who knows.

Feb 07 '22 07:02 chengjunlu