PyTorchProfiler does not profile GPU
Bug description
Using PyTorchProfiler I don't get GPU profiling in Tensorboard view, the logs indicate that GPU is being used.
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Conda env
name: lightning_tutorials
channels:
- conda-forge
dependencies:
- python=3.12
- lightning
- torchvision=0.21.0=cuda126_py312_h361dbbe_0
- tensorboard
- torch-tb-profiler
train.py
import lightning as L
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torchvision as tv
from lightning.pytorch.loggers import TensorBoardLogger
from lightning.pytorch.profilers import PyTorchProfiler
# --------------------------------
# Step 1: Define a LightningModule
# --------------------------------
# A LightningModule (nn.Module subclass) defines a full *system*
# (ie: an LLM, diffusion model, autoencoder, or simple image classifier).
class LitAutoEncoder(L.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3)
)
self.decoder = nn.Sequential(
nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28)
)
def forward(self, x):
# in lightning, forward defines the prediction/inference actions
embedding = self.encoder(x)
return embedding
def training_step(self, batch, batch_idx):
# training_step defines the train loop. It is independent of forward
x, _ = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
# -------------------
# Step 2: Define data
# -------------------
dataset = tv.datasets.MNIST(".", download=True, transform=tv.transforms.ToTensor())
train, val = data.random_split(dataset, [55000, 5000])
# -------------------
# Step 3: Train
# -------------------
autoencoder = LitAutoEncoder()
logger = TensorBoardLogger(save_dir="tb_logs")
profiler = PyTorchProfiler()
trainer = L.Trainer(logger=logger, profiler=profiler, max_epochs=2)
trainer.fit(autoencoder, data.DataLoader(train), data.DataLoader(val))
Error messages and logs
(lightning_tutorials) PS C:\Users\anguzo\Projects\work\Machine-Learning-Collection> & C:/Users/anguzo/.local/share/mamba/envs/lightning_tutorials/python.exe "c:/Users/anguzo/Projects/work/Machine-Learning-Collection/ML/Pytorch/pytorch_lightning/9.1 Prof/train.py"
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
C:\Users\anguzo\.local\share\mamba\envs\lightning_tutorials\Lib\site-packages\lightning\pytorch\trainer\configuration_validator.py:68: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params | Mode
-----------------------------------------------
0 | encoder | Sequential | 100 K | train
1 | decoder | Sequential | 101 K | train
-----------------------------------------------
202 K Trainable params
0 Non-trainable params
202 K Total params
0.810 Total estimated model params size (MB)
8 Modules in train mode
0 Modules in eval mode
C:\Users\anguzo\.local\share\mamba\envs\lightning_tutorials\Lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.
Epoch 0: 0%| | 4/55000 [00:00<39:03, 23.47it/s, v_num=0][W226 15:37:54.000000000 collection.cpp:647] Warning: Optimizer.step#Adam.step (function operator ())
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55000/55000 [03:34<00:00, 256.89it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=2` reached.
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55000/55000 [03:34<00:00, 256.88it/s, v_num=0]
FIT Profiler Report
Profile stats for: records
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
ProfilerStep* 18.41% 10.089ms 60.02% 32.896ms 10.965ms 8.532ms 15.48% 32.903ms 10.968ms 3
[pl][profile]run_training_batch 0.22% 118.400us 27.99% 15.341ms 7.670ms 107.000us 0.19% 15.349ms 7.675ms 2
[pl][profile][LightningModule]LitAutoEncoder.optimiz... 0.13% 70.600us 27.77% 15.222ms 7.611ms 48.000us 0.09% 15.242ms 7.621ms 2
Optimizer.step#Adam.step 20.39% 11.178ms 27.64% 15.152ms 7.576ms 11.207ms 20.33% 15.194ms 7.597ms 2
[pl][profile][Strategy]SingleDeviceStrategy.backward... 19.32% 10.590ms 20.25% 11.099ms 3.700ms 10.435ms 18.93% 11.136ms 3.712ms 3
[pl][profile][Strategy]SingleDeviceStrategy.training... 3.26% 1.788ms 10.56% 5.790ms 1.930ms 1.524ms 2.77% 5.809ms 1.936ms 3
autograd::engine::evaluate_function: AddmmBackward0 1.48% 812.100us 5.97% 3.271ms 272.583us 473.000us 0.86% 3.406ms 283.833us 12
AddmmBackward0 1.47% 806.000us 4.16% 2.280ms 189.967us 578.000us 1.05% 2.577ms 214.750us 12
aten::t 1.95% 1.066ms 3.62% 1.986ms 34.840us 1.088ms 1.97% 2.560ms 44.912us 57
[pl][profile][_TrainingEpochLoop].train_dataloader_n... 0.21% 113.100us 3.95% 2.167ms 722.333us 94.000us 0.17% 2.214ms 738.000us 3
enumerate(DataLoader)#_SingleProcessDataLoaderIter._... 2.32% 1.273ms 3.75% 2.054ms 684.633us 924.000us 1.68% 2.120ms 706.667us 3
[pl][module]torch.nn.modules.container.Sequential: e... 0.58% 315.500us 3.02% 1.653ms 551.067us 300.000us 0.54% 1.691ms 563.667us 3
aten::transpose 1.61% 880.300us 1.68% 919.700us 16.135us 1.056ms 1.92% 1.472ms 25.825us 57
autograd::engine::evaluate_function: torch::autograd... 0.49% 267.800us 2.12% 1.161ms 48.367us 425.000us 0.77% 1.408ms 58.667us 24
aten::linear 0.42% 232.500us 2.34% 1.285ms 107.042us 233.000us 0.42% 1.398ms 116.500us 12
[pl][profile][Callback]TQDMProgressBar.on_train_batc... 2.27% 1.246ms 2.34% 1.284ms 427.967us 1.278ms 2.32% 1.333ms 444.333us 3
aten::item 1.50% 821.700us 1.53% 836.700us 16.406us 818.000us 1.48% 1.254ms 24.588us 51
[pl][module]torch.nn.modules.container.Sequential: d... 0.48% 265.400us 2.17% 1.191ms 397.033us 201.000us 0.36% 1.201ms 400.333us 3
aten::detach 1.34% 737.000us 1.46% 802.700us 17.838us 737.000us 1.34% 1.142ms 25.378us 45
aten::result_type 0.03% 14.100us 0.03% 14.100us 0.117us 997.000us 1.81% 997.000us 8.308us 120
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 54.812ms
Self CUDA time total: 55.115ms
Environment
Current environment
- CUDA: - GPU: - NVIDIA GeForce GTX 1080 - available: True - version: 12.6
- Lightning: - lightning: 2.5.0.post0 - lightning-utilities: 0.12.0 - pytorch-lightning: 2.5.0.post0 - torch: 2.6.0 - torch-tb-profiler: 0.4.3 - torchmetrics: 1.6.1 - torchvision: 0.21.0
- Packages: - absl-py: 2.1.0 - autocommand: 2.2.2 - backports.tarfile: 1.2.0 - brotli: 1.1.0 - certifi: 2025.1.31 - charset-normalizer: 3.4.1 - colorama: 0.4.6 - filelock: 3.17.0 - fsspec: 2025.2.0 - grpcio: 1.67.1 - idna: 3.10 - importlib-metadata: 8.6.1 - inflect: 7.3.1 - jaraco.collections: 5.1.0 - jaraco.context: 5.3.0 - jaraco.functools: 4.0.1 - jaraco.text: 3.12.1 - jinja2: 3.1.5 - lightning: 2.5.0.post0 - lightning-utilities: 0.12.0 - markdown: 3.6 - markupsafe: 3.0.2 - more-itertools: 10.3.0 - mpmath: 1.3.0 - networkx: 3.4.2 - numpy: 2.2.3 - optree: 0.14.0 - packaging: 24.2 - pandas: 2.2.3 - pillow: 11.1.0 - pip: 25.0.1 - platformdirs: 4.2.2 - protobuf: 5.28.3 - pybind11: 2.13.6 - pybind11-global: 2.13.6 - pysocks: 1.7.1 - python-dateutil: 2.9.0.post0 - pytorch-lightning: 2.5.0.post0 - pytz: 2024.1 - pyyaml: 6.0.2 - requests: 2.32.3 - setuptools: 75.8.0 - six: 1.17.0 - sympy: 1.13.3 - tensorboard: 2.19.0 - tensorboard-data-server: 0.7.0 - tomli: 2.0.1 - torch: 2.6.0 - torch-tb-profiler: 0.4.3 - torchmetrics: 1.6.1 - torchvision: 0.21.0 - tqdm: 4.67.1 - typeguard: 4.3.0 - typing-extensions: 4.12.2 - tzdata: 2025.1 - urllib3: 2.2.2 - werkzeug: 3.1.3 - wheel: 0.45.1 - win-inet-pton: 1.1.0 - zipp: 3.21.0
- System: - OS: Windows - architecture: - 64bit - WindowsPE - processor: AMD64 Family 25 Model 116 Stepping 1, AuthenticAMD - python: 3.12.9 - release: 11 - version: 10.0.22631
More info
No response
I'm seeing the same behavior.
Python: 3.12.8 lightning: " 2.5.0.post0 torch: 2.5.1+cu121 tensorboard: 2.19.0 torch_tb_profiler: 0.4.3
@Borda - I can take this up .
I was able to get the GPU profiling on tensorboard . Using the versions of the libraries below . I used the same code . Will try to confirm if this is a version specific issue .
lightning -- 2.5.2
lightning-utilities -- 0.14.3
line_profiler -- 4.2.0
pytorch-ignite -- 0.5.2
pytorch-lightning -- 2.5.1.post0
tensorboard -- 2.18.0
tensorboard-data-server -- 0.7.2
torch -- 2.6.0+cu124
FIT Profiler Report
Profile stats for: records
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
ProfilerStep* 0.00% 0.000us 0.00% 0.000us 0.000us 38.337ms 3953.29% 38.337ms 12.779ms 3
[pl][profile][Strategy]SingleDeviceStrategy.training... 0.00% 0.000us 0.00% 0.000us 0.000us 2.073ms 213.77% 2.073ms 691.026us 3
ProfilerStep* 29.08% 15.516ms 70.20% 37.460ms 12.487ms 0.000us 0.00% 731.164us 243.721us 3
aten::_foreach_addcdiv_ 0.11% 59.589us 0.17% 88.718us 29.573us 183.167us 18.89% 183.167us 61.056us 3
void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 183.167us 18.89% 183.167us 61.056us 3
aten::_foreach_div_ 0.09% 46.927us 0.14% 73.271us 24.424us 169.087us 17.44% 169.087us 56.362us 3
void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 169.087us 17.44% 169.087us 56.362us 3
autograd::engine::evaluate_function: AddmmBackward0 1.45% 775.080us 5.39% 2.874ms 239.519us 0.000us 0.00% 146.304us 12.192us 12
aten::_foreach_sqrt 0.13% 69.885us 0.45% 238.946us 79.649us 113.280us 11.68% 113.280us 37.760us 3
void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 113.280us 11.68% 113.280us 37.760us 3
[pl][profile][LightningModule]LitAutoEncoder.transfe... 0.00% 0.000us 0.00% 0.000us 0.000us 104.734us 10.80% 104.734us 34.911us 3
[pl][profile][Strategy]SingleDeviceStrategy.training... 3.97% 2.116ms 11.82% 6.308ms 2.103ms 0.000us 0.00% 93.120us 31.040us 3
AddmmBackward0 0.21% 114.695us 3.33% 1.774ms 147.863us 0.000us 0.00% 80.736us 6.728us 12
aten::mm 0.87% 463.009us 1.28% 685.548us 32.645us 80.736us 8.33% 80.736us 3.845us 21
aten::_foreach_addcmul_ 0.12% 66.595us 0.18% 93.564us 31.188us 69.055us 7.12% 69.055us 23.018us 3
void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 69.055us 7.12% 69.055us 23.018us 3
aten::sum 0.38% 203.757us 0.55% 295.501us 24.625us 65.568us 6.76% 65.568us 5.464us 12
[pl][profile]run_training_batch 0.36% 194.507us 23.81% 12.704ms 4.235ms 0.000us 0.00% 61.408us 20.469us 3
[pl][profile][LightningModule]LitAutoEncoder.optimiz... 0.15% 81.050us 23.37% 12.473ms 4.158ms 0.000us 0.00% 61.408us 20.469us 3
Optimizer.step#Adam.step 11.16% 5.953ms 23.19% 12.372ms 4.124ms 0.000us 0.00% 61.408us 20.469us 3
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 53.360ms
Self CUDA time total: 969.756us
Using the versions of the libraries below . I used the same code
I see that you iterated on your reply, so was there something else needed or just freshening the TB?
I spent some time in Kaggle notebook to use their GPU option and observe the files that are created as I run the code . I copied the json files on my local machine to run tensorboard since Kaggle does not let me run tensorboard.
I did not have to change any code or library versions
https://www.kaggle.com/code/gc2713/pytorch-lightning-tensorboard/
@getgaurav2 did you try to isolate which package was causing the problem? Like is this a tensorboard v2.19.0 and v2.5.0 problem? It seems weird that it's not working in 2.5 and 2.19, but is working in 2.18
Actually its working with 2.19 also . I tried again , please see the collab notebook below.
https://colab.research.google.com/drive/10rYUUmUexhfIIJDqcZ-XOQh_LM4q_NAr?usp=sharing
lightning==2.5.0.post0 torch==2.5.1 tensorboard==2.19.0 torch_tb_profiler==0.4.3