dgl [GraphBolt] ItemSampler CPU usage too high, especially hetero case.

🔨Work Item

IMPORTANT:

This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

When running the hetero graphbolt example in the pureGPU mode, the CPU utilization is very high. (4000%)

Depending work items or issues

Apr 18 '24 02:04 mfbalin

As @mfbalin mentioned, specific logic for ItemsetDict could be the culprit.

Apr 18 '24 03:04 Rhett-Ying

It now looks like dgl.create_block is the culprit.

Apr 18 '24 06:04 mfbalin

dgl/heterograph.py:6407 make_canonical_edges uses numpy for some ops.

Apr 18 '24 06:04 mfbalin

https://github.com/dmlc/dgl/blob/41a38486a5ed9298093d9f0bc415751269c7d577/python/dgl/convert.py#L583

Apr 18 '24 06:04 Rhett-Ying

@peizhou001 can dgl.create_block() purely run on GPU or it can only run purely on CPU instead? I remember you looked into it previously.

Apr 18 '24 06:04 Rhett-Ying

@mfbalin tried to bypass the whole forward including data.blocks and CPU usage is still high. So the create_blocks() is probably not the culprit.

Apr 18 '24 07:04 Rhett-Ying

Update: CPU usage high even for the homo examples. Some recent change might have caused us to utilize the CPU even in the pureGPU mode. @frozenbugs do you think it could be the logic to move MiniBatch to device?

Apr 19 '24 05:04 mfbalin

Or could it possibly be one of my recent changes, such as #7312?

Oh the code in #7312 does not run in the homo case.

I am going to bisect to see if I can identify a commit that causes this issue.

Apr 19 '24 05:04 mfbalin

git checkout 78df81015a9a6cdaa4843167b1d000f4ca377ca9

This commit does not have the issue. Somewhere between current master and the reported commit above, there was a change that cause CPU util on the GPU code path.

Apr 19 '24 06:04 mfbalin

git checkout 78df81015a9a6cdaa4843167b1d000f4ca377ca9

This commit does not have the issue. Somewhere between current master and the reported commit above, there was a change that cause CPU util on the GPU code path.

could be https://github.com/dmlc/dgl/pull/7309 @yxy235 could you help look into it? Reproduce and confirm it?

Apr 19 '24 06:04 Rhett-Ying

Easiest way to test is to run python examples/sampling/graphbolt/pyg/node_classification_advanced.py --torch-compile --mode=cuda-cuda-cuda. There is upto %30 regression.

Apr 19 '24 06:04 mfbalin

Transfered attr list:

['blocks', 'compacted_negative_dsts', 'compacted_negative_srcs', 'compacted_node_pairs', 'compacted_seeds', 'edge_features', 'indexes', 'input_nodes', 'labels', 'negative_dsts', 'negative_node_pairs', 'negative_srcs', 'node_features', 'node_pairs', 'node_pairs_with_labels', 'positive_node_pairs', 'sampled_subgraphs', 'seed_nodes', 'seeds']

compacted_negative_dsts
compacted_negative_srcs
compacted_node_pairs
compacted_seeds
edge_features
indexes
input_nodes
labels
negative_dsts
negative_srcs
node_features
node_pairs
sampled_subgraphs
seed_nodes
seeds

Actually transfered by calling .to:

input_nodes
labels
seeds

Apr 19 '24 06:04 mfbalin

Looks like blocks is called inside MiniBatch.to() even for the pyg example.

Apr 19 '24 06:04 mfbalin

Transfered attr list:

['blocks', 'compacted_negative_dsts', 'compacted_negative_srcs', 'compacted_node_pairs', 'compacted_seeds', 'edge_features', 'indexes', 'input_nodes', 'labels', 'negative_dsts', 'negative_node_pairs', 'negative_srcs', 'node_features', 'node_pairs', 'node_pairs_with_labels', 'positive_node_pairs', 'sampled_subgraphs', 'seed_nodes', 'seeds']

compacted_negative_dsts
compacted_negative_srcs
compacted_node_pairs
compacted_seeds
edge_features
indexes
input_nodes
labels
negative_dsts
negative_srcs
node_features
node_pairs
sampled_subgraphs
seed_nodes
seeds

Actually transfered by calling .to:

input_nodes
labels
seeds

I see. Do you think we need a check when call Minibatch.to()?

Apr 19 '24 07:04 yxy235

I figured it out. When we filter which attributes to transfer, we end up calling blocks property. Making a quick patch now.

Apr 19 '24 07:04 mfbalin

CPU usage still higher than 100% though, so I am not sure if I resolved the whole issue.

Apr 19 '24 07:04 mfbalin

Even with #7330, we need to investigate where the high CPU usage comes from. CPU usage is 800% for our main pure-gpu (--mode=cuda-cuda) node classification example.

Apr 19 '24 07:04 mfbalin

hetero example CPU usage is still 4000%

Apr 19 '24 08:04 mfbalin

@Rhett-Ying Here, we can find see the last iterations of training dataloader for the hetero example. Since we have a prefetcher thread with a buffer size 2, the last 2 iterations don't have excessive CPU utilization as the computation for the last 2 iterations has already finished. This indicates that the high CPU utilization is due to the ItemSampler.

        # (4) Cut datapipe at CopyTo and wrap with prefetcher. This enables the
        # data pipeline up to the CopyTo operation to run in a separate thread.
        datapipe_graph = _find_and_wrap_parent(
            datapipe_graph,
            CopyTo,
            dp.iter.Prefetcher,
            buffer_size=2,
        )

Apr 20 '24 05:04 mfbalin

Users with multiple GPUs may not be able to utilize the GPUs effectively due to potential CPU bottleneck.

Apr 20 '24 05:04 mfbalin

python examples/graphbolt/pyg/labor python node_classification.py --dataset=yelp --dropout=0 --mode=cuda-cuda-cuda CPU usage on this example is too high too, and this is the homo case. @Rhett-Ying Becomes the bottleneck, faster CPU results into faster performance even if the GPU is slower. 10000% CPU usage.

Jul 20 '24 18:07 mfbalin

@mfbalin The culprit of high CPU usage is ItemSampler?

Jul 25 '24 01:07 Rhett-Ying