chakra Support GPU-to-CPU synchronization dependency with HolisticTraceAnalysis

Summary

This PR relies on https://github.com/mlcommons/chakra/pull/119. It can be merged after https://github.com/mlcommons/chakra/pull/119 is merged.

This PR introduces dependencies from GPU operators to CPU operators using the critical path analysis in HolisticTraceAnalysis (HTA). In the simulation flow of Chakra, postprocessors like the trace linker and the converter are required. They are responsible for merging Chakra host traces with Chakra device traces and encoding dependencies. Currently, the dependencies encoded by the postprocessors are limited to CPU operators to GPU operators. However, there can be dependencies from GPU operators to CPU operators if a CPU operator has a dependency on a GPU operator. To identify such dependencies, this PR utilizes the critical path analysis of HTA. More specifically, this PR uses the synchronization dependency of HTA. A synchronization dependency occurs when a CPU operator has to wait for a dispatched GPU operator to be completed. Therefore, synchronization dependency is the best for identifying such dependencies.

HTA Repository: https://github.com/facebookresearch/HolisticTraceAnalysis
HTA Critical Path Analysis Documentation: https://hta.readthedocs.io/en/latest/source/features/lightweight_critical_path_analysis.html
HTA Critical Path Analysis Example: https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/examples/experimental/critical_path_analysis.ipynb

Please note that:

The command has been changed. We need to specify the rank ID with --rank for chakra_trace_link.
When collecting the trace, the Kineto trace profiler should collect nodes where the 'cat' field is 'cuda_sync'. Please follow the instructions here: (https://github.com/pytorch/pytorch/pull/105187).

from torch.autograd.profiler import profile, _ExperimentalConfig
with profile(use_kineto=True, use_cuda=True,
   experimental_config=_ExperimentalConfig(enable_cuda_sync_events=True),
) as prof:
   workload()

Test Plan

Download and Install HTA.

git clone https://github.com/facebookresearch/HolisticTraceAnalysis.git
cd HolisticTraceAnalysis
git checkout d731cc2e2249976c97129d409a83bd53d93051f6
git submodule update --init
pip install -r requirements.txt
pip install -e .

Next, you need to collect traces by following the instructions here: https://github.com/pytorch/pytorch/pull/105187.

After that, you can load sync dependencies and print them out with the following script:

import argparse
import logging
import os
from typing import Dict, List

from hta.analyzers.critical_path_analysis import CPEdgeType
from hta.trace_analysis import TraceAnalysis

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def load_sync_dependencies(
    rank: int, kineto_file: str, annotation: str = "ProfilerStep", instance_id: int = 0
) -> Dict[int, List[int]]:
    """
    Load synchronization dependencies using Holistic Trace Analysis (HTA).

    Args:
        rank (int): Rank for the input Kineto trace.
        kineto_file (str): Path to the Kineto trace file.
        annotation (str): Annotation to use for the analysis. Defaults to "ProfilerStep".
        instance_id (int): Instance ID for the analysis. Defaults to 0.

    Returns:
        Dict[int, List[int]]: A dictionary mapping end event's external ID to a list of start event's external IDs
            that have synchronization dependencies.
    """
    sync_dependencies = {}
    trace_analysis = TraceAnalysis(trace_dir=os.path.dirname(kineto_file))
    cp_graph, success = trace_analysis.critical_path_analysis(rank=rank, annotation=annotation, instance_id=instance_id)
    if not success:
        logger.error("Failed to load Critical Path Graph")
        return sync_dependencies

    raw_events = trace_analysis.t.get_raw_trace_for_one_rank(rank=rank)["traceEvents"]
    for edge in cp_graph.critical_path_edges_set:
        if edge.type in [CPEdgeType.SYNC_DEPENDENCY]:
            start_event_id, end_event_id = cp_graph.get_events_for_edge(edge)
            start_event, end_event = raw_events[start_event_id], raw_events[end_event_id]
            if "External id" in end_event["args"] and "External id" in start_event["args"]:
                start_event_external_id = start_event["args"]["External id"]
                end_event_external_id = end_event["args"]["External id"]
                start_event_name = start_event["name"]
                end_event_name = end_event["name"]
                if start_event_external_id != end_event_external_id:
                    print(
                        f"start_event_id {start_event_id}, end_event_id {end_event_id}, "
                        f"start_event_external_id {start_event_external_id}, end_event_external_id {end_event_external_id}, "
                        f"start_event_name '{start_event_name}', end_event_name '{end_event_name}'"
                    )
            else:
                logger.warning(
                    f"Synchronization dependency from event {start_event_id} to event {end_event_id} will "
                    "not be considered due to missing external IDs."
                )
    return sync_dependencies


def main() -> None:
    """
    Main function to parse arguments and load synchronization dependencies.
    """
    parser = argparse.ArgumentParser(description="Load and print critical paths from Kineto traces.")
    parser.add_argument("--input", type=str, help="Path to the Kineto trace file.")
    parser.add_argument("--rank", type=int, help="Rank for the input traces.")
    args = parser.parse_args()

    load_sync_dependencies(args.rank, args.input)


if __name__ == "__main__":
    main()

You can run it with the following command:

$ python sync_dep.py --input ~/Downloads/cuda-sync/kineto_0.json --rank 0 > /tmp/out

cuda-sync.zip

/tmp/out

start_event_id 24868, end_event_id 24874, start_event_external_id 94785, end_event_external_id 94792, start_event_name 'void multi_tensor_apply_kernel<TensorListMetadata<6>, DistAdamWithParamRemaindersFunctor<float>, float*, float, float, float, float, float, float, adamMode_t, float>(long, int volatile*, TensorListMetadata<6>, DistAdamWithParamRemaindersFunctor<float>, float*, float, float, float, float, float, float, adamMode_t, float)', end_event_name 'cudaDeviceSynchronize'
start_event_id 24536, end_event_id 24650, start_event_external_id 13847, end_event_external_id 91874, start_event_name 'ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm*, unsigned long, ncclWork*)', end_event_name 'cudaDeviceSynchronize'

Two synchronization dependencies are identified with the script. In this test, we focus on the dependency between 'ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm*, unsigned long, ncclWork*)' and 'cudaDeviceSynchronize'.

Let's confirm our observation with a trace visualizer. You can read Kineto traces with https://perfetto.dev/. By searching for ncclDevKernel_ReduceScatter_Sum_f32_RING_LL, you can find that it is a GPU kernel (category field) with an external ID of 13847. Around the operator but in the CPU row of the visualization, you can find cudaDeviceSynchronize where the external ID is 94792. It is a cuda_runtime operator. As the cuda_runtime operator is not considered a simulatable operator in the toolchains, the closest but later CPU operator, aten::empty, with the external ID of 16392, should rely on the GPU kernel.

Let's see if the synchronization dependency is properly encoded in trace_link. Make sure you install Chakra.

$ pip install .

Run chakra_trace_link.

chakra_trace_link \
  --pytorch-et-file /Users/theo/Downloads/et_0.json\
  --kineto-file /Users/theo/Downloads/kineto_0.json\
  --output-file ~/megatron_0.json\
  --rank 0

You can review ~/megatron_0.json and find that sync dependencies are encoded.

        {
            "id": 15899,
            "name": "ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm*, unsigned long, ncclWork*)",
            "ctrl_deps": 15898,
            "inputs": {
                "values": [
                    [
                        87,
                        49,
                        576716800,
                        52428800,
                        4,
                        "cuda:0"
                    ],
                    [
                        90,
                        49,
                        629145600,
                        52428800,
                        4,
                        "cuda:0"
                    ]
                ],
                "shapes": [
                    [
                        52428800
                    ],
                    [
                        52428800
                    ]
                ],
                "types": [
                    "Tensor(float)",
                    "Tensor(float)"
                ]
            },
            "outputs": {
                "values": [],
                "shapes": [],
                "types": []
            },
            "attrs": [
                {
                    "name": "rf_id",
                    "type": "uint64",
                    "value": 13651
                },
                {
                    "name": "fw_parent",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "seq_id",
                    "type": "int64",
                    "value": -1
                },
                {
                    "name": "scope",
                    "type": "uint64",
                    "value": 7
                },
                {
                    "name": "tid",
                    "type": "uint64",
                    "value": 4
                },
                {
                    "name": "fw_tid",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "op_schema",
                    "type": "string",
                    "value": ""
                }
            ],
            "inclusive_dur": 44160,
            "exclusive_dur": 44160,
            "ts": 1719249141376319,
            "inter_thread_dep": 15685,
            "cat": "kernel",
            "ph": "X",
            "stream": 64
        },
        {
            "id": 15982,
            "name": "aten::detach",
            "ctrl_deps": 29,
            "inputs": {
                "values": [
                    [
                        20832,
                        1363,
                        0,
                        1,
                        4,
                        "cuda:0"
                    ]
                ],
                "shapes": [
                    []
                ],
                "types": [
                    "Tensor(float)"
                ]
            },
            "outputs": {
                "values": [
                    [
                        20842,
                        1363,
                        0,
                        1,
                        4,
                        "cuda:0"
                    ]
                ],
                "shapes": [
                    []
                ],
                "types": [
                    "Tensor(float)"
                ]
            },
            "attrs": [
                {
                    "name": "rf_id",
                    "type": "uint64",
                    "value": 13722
                },
                {
                    "name": "fw_parent",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "seq_id",
                    "type": "int64",
                    "value": 19404
                },
                {
                    "name": "scope",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "tid",
                    "type": "uint64",
                    "value": 1
                },
                {
                    "name": "fw_tid",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "op_schema",
                    "type": "string",
                    "value": "aten::detach(Tensor(a) self) -> Tensor(a)"
                }
            ],
            "inclusive_dur": 17,
            "exclusive_dur": 11,
            "ts": 1719249141527040,
            "inter_thread_dep": 15901,
            "sync_dep": 15899
        },

Run chakra_converter

chakra_converter --input_filename ~/megatron_0.json\
    --output_filename megatron_0.chakra\
    --input_type PyTorch\
     --log_filename /tmp/rank_0

Here are traces that I used. cuda-sync.zip Resnet-50.zip llama2.zip

May 10 '24 22:05 JoongunPark

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

May 10 '24 22:05 github-actions[bot]

Hi, how is the review going?

May 30 '24 18:05 JoongunPark

Hi, @JoongunPark. We did not get a chance to review and test because we have an urgent task internally. Thank you for your patience.

May 30 '24 18:05 TaekyungHeo

@JoongunPark - we may need 1-2 more weeks since we are setting up integration tests internally as we speak. We will try to expedite this asap. Thank you for your patience.

May 31 '24 02:05 srinivas212

I have tested with Taekyung's lastest enhancement. It works well on my environment (Python 3.10.13, Linux 5.15.0-105-generic) Below is the log that I have obtained after converting traces into Chakra HDT.

Llama2 
Node ID 14063 now has an synchonization dependency on Node ID 13783
Node ID 13782 now has an synchonization dependency on Node ID 13779
Node ID 13784 now has an synchonization dependency on Node ID 13779
Node ID 13837 now has an synchonization dependency on Node ID 13779
Node ID 13852 now has an synchonization dependency on Node ID 13783
Node ID 13779 now has an synchonization dependency on Node ID 13776
Node ID 13781 now has an synchonization dependency on Node ID 13776
Node ID 13831 now has an synchonization dependency on Node ID 13776
Node ID 13849 now has an synchonization dependency on Node ID 13780

Resnet-50
Node ID 4864 now has an synchonization dependency on Node ID 4861
Node ID 4866 now has an synchonization dependency on Node ID 4861
Node ID 5270 now has an synchonization dependency on Node ID 4865

Also, as he mentioned, now the code builds sync dependency with the closest next CPU operator instead of cuda_runtime op.

Jul 01 '24 15:07 JoongunPark

@JoongunPark can you please resolve the merge conflicts? We can merge this PR.

Oct 05 '24 02:10 srinivas212

Merging based on @TaekyungHeo's feedback and review. Thanks for the PR @JoongunPark and thanks for the review @TaekyungHeo.

My apologies for the delayed recognition of the merge conflicts. Thank you so much for reviewing and managing this PR, @srinivas212 and @TaekyungHeo!

Oct 08 '24 00:10 JoongunPark