chakra icon indicating copy to clipboard operation
chakra copied to clipboard

nccl:send not found

Open qyysjtu opened this issue 1 year ago • 2 comments

Describe the Bug

When I run the pytorch converter, it shows nccl:send comm_type not supported, is there any plan to support this or this comm_type is not expected in the trace?

admin@admin: ~/llm/chakra(main)$ python3 -m chakra.et_converter.et_converter --input_type PyTorch --input_filename et_plus/profile_et_rank_0_plus.json --output_filename et_plus/profile_chakra.0.et 
Traceback (most recent call last):
  File "/home/admin/miniconda3/lib/python3.12/site-packages/chakra/et_converter/et_converter.py", line 89, in main
    converter.convert()
  File "/home/admin/miniconda3/lib/python3.12/site-packages/chakra/et_converter/pytorch2chakra_converter.py", line 169, in convert
    collective_comm_type = self.get_collective_comm_type(pytorch_node.name)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/admin/miniconda3/lib/python3.12/site-packages/chakra/et_converter/pytorch2chakra_converter.py", line 395, in get_collective_comm_type
    raise ValueError(f"'{name}' not found in collective communication mapping. "
ValueError: 'nccl:send' not found in collective communication mapping. Please add this collective communication name to the mapping.

qyysjtu avatar Apr 09 '24 02:04 qyysjtu

Supported collective communication types are listed here. Currently, Chakra does not recognize nccl:send as a collective communication type. The Chakra working group must decide whether to add SEND and RECV as new collective types. We understand that these appear in the collected traces, but currently, we do not have a working solution. You can make local changes to support SEND and RECV types on your own. If this works and makes sense, you can create a PR.

TaekyungHeo avatar May 09 '24 12:05 TaekyungHeo

Thanks for reporting this issue.

@TaekyungHeo - we probably need to handle this as COMM_SEND_NODE right? Wdyt? This cannot be a collective operation.

srinivas212 avatar May 18 '24 07:05 srinivas212

@qyysjtu this issue should be fixed now - #PR112

srinivas212 avatar Jun 28 '24 01:06 srinivas212