[GraphBolt] ItemSampler CPU usage too high, especially hetero case.
🔨Work Item
IMPORTANT:
- This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
- DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.
Project tracker: https://github.com/orgs/dmlc/projects/2
Description
When running the hetero graphbolt example in the pureGPU mode, the CPU utilization is very high. (4000%)
Depending work items or issues
As @mfbalin mentioned, specific logic for ItemsetDict could be the culprit.
It now looks like dgl.create_block is the culprit.
dgl/heterograph.py:6407 make_canonical_edges uses numpy for some ops.
https://github.com/dmlc/dgl/blob/41a38486a5ed9298093d9f0bc415751269c7d577/python/dgl/convert.py#L583
@peizhou001
can dgl.create_block() purely run on GPU or it can only run purely on CPU instead? I remember you looked into it previously.
@mfbalin tried to bypass the whole forward including data.blocks and CPU usage is still high. So the create_blocks() is probably not the culprit.
Update: CPU usage high even for the homo examples. Some recent change might have caused us to utilize the CPU even in the pureGPU mode. @frozenbugs do you think it could be the logic to move MiniBatch to device?
Or could it possibly be one of my recent changes, such as #7312?
Oh the code in #7312 does not run in the homo case.
I am going to bisect to see if I can identify a commit that causes this issue.
git checkout 78df81015a9a6cdaa4843167b1d000f4ca377ca9
This commit does not have the issue. Somewhere between current master and the reported commit above, there was a change that cause CPU util on the GPU code path.
git checkout 78df81015a9a6cdaa4843167b1d000f4ca377ca9This commit does not have the issue. Somewhere between current master and the reported commit above, there was a change that cause CPU util on the GPU code path.
could be https://github.com/dmlc/dgl/pull/7309 @yxy235 could you help look into it? Reproduce and confirm it?
Easiest way to test is to run python examples/sampling/graphbolt/pyg/node_classification_advanced.py --torch-compile --mode=cuda-cuda-cuda. There is upto %30 regression.
Transfered attr list:
['blocks', 'compacted_negative_dsts', 'compacted_negative_srcs', 'compacted_node_pairs', 'compacted_seeds', 'edge_features', 'indexes', 'input_nodes', 'labels', 'negative_dsts', 'negative_node_pairs', 'negative_srcs', 'node_features', 'node_pairs', 'node_pairs_with_labels', 'positive_node_pairs', 'sampled_subgraphs', 'seed_nodes', 'seeds']
compacted_negative_dsts
compacted_negative_srcs
compacted_node_pairs
compacted_seeds
edge_features
indexes
input_nodes
labels
negative_dsts
negative_srcs
node_features
node_pairs
sampled_subgraphs
seed_nodes
seeds
Actually transfered by calling .to:
input_nodes
labels
seeds
Looks like blocks is called inside MiniBatch.to() even for the pyg example.
Transfered attr list:
['blocks', 'compacted_negative_dsts', 'compacted_negative_srcs', 'compacted_node_pairs', 'compacted_seeds', 'edge_features', 'indexes', 'input_nodes', 'labels', 'negative_dsts', 'negative_node_pairs', 'negative_srcs', 'node_features', 'node_pairs', 'node_pairs_with_labels', 'positive_node_pairs', 'sampled_subgraphs', 'seed_nodes', 'seeds'] compacted_negative_dsts compacted_negative_srcs compacted_node_pairs compacted_seeds edge_features indexes input_nodes labels negative_dsts negative_srcs node_features node_pairs sampled_subgraphs seed_nodes seedsActually transfered by calling
.to:input_nodes labels seeds
I see. Do you think we need a check when call Minibatch.to()?
I figured it out. When we filter which attributes to transfer, we end up calling blocks property. Making a quick patch now.
CPU usage still higher than 100% though, so I am not sure if I resolved the whole issue.
Even with #7330, we need to investigate where the high CPU usage comes from. CPU usage is 800% for our main pure-gpu (--mode=cuda-cuda) node classification example.
hetero example CPU usage is still 4000%
@Rhett-Ying Here, we can find see the last iterations of training dataloader for the hetero example. Since we have a prefetcher thread with a buffer size 2, the last 2 iterations don't have excessive CPU utilization as the computation for the last 2 iterations has already finished. This indicates that the high CPU utilization is due to the ItemSampler.
# (4) Cut datapipe at CopyTo and wrap with prefetcher. This enables the
# data pipeline up to the CopyTo operation to run in a separate thread.
datapipe_graph = _find_and_wrap_parent(
datapipe_graph,
CopyTo,
dp.iter.Prefetcher,
buffer_size=2,
)
Users with multiple GPUs may not be able to utilize the GPUs effectively due to potential CPU bottleneck.
python examples/graphbolt/pyg/labor python node_classification.py --dataset=yelp --dropout=0 --mode=cuda-cuda-cuda
CPU usage on this example is too high too, and this is the homo case. @Rhett-Ying
Becomes the bottleneck, faster CPU results into faster performance even if the GPU is slower. 10000% CPU usage.
@mfbalin The culprit of high CPU usage is ItemSampler?