pytorch-lightning PyTorchProfiler does not profile GPU

Bug description

Using PyTorchProfiler I don't get GPU profiling in Tensorboard view, the logs indicate that GPU is being used.

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Conda env

name: lightning_tutorials
channels:
  - conda-forge
dependencies:
  - python=3.12
  - lightning
  - torchvision=0.21.0=cuda126_py312_h361dbbe_0
  - tensorboard
  - torch-tb-profiler

train.py

import lightning as L
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torchvision as tv
from lightning.pytorch.loggers import TensorBoardLogger
from lightning.pytorch.profilers import PyTorchProfiler

# --------------------------------
# Step 1: Define a LightningModule
# --------------------------------
# A LightningModule (nn.Module subclass) defines a full *system*
# (ie: an LLM, diffusion model, autoencoder, or simple image classifier).


class LitAutoEncoder(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3)
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28)
        )

    def forward(self, x):
        # in lightning, forward defines the prediction/inference actions
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop. It is independent of forward
        x, _ = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


# -------------------
# Step 2: Define data
# -------------------
dataset = tv.datasets.MNIST(".", download=True, transform=tv.transforms.ToTensor())
train, val = data.random_split(dataset, [55000, 5000])

# -------------------
# Step 3: Train
# -------------------
autoencoder = LitAutoEncoder()
logger = TensorBoardLogger(save_dir="tb_logs")
profiler = PyTorchProfiler()
trainer = L.Trainer(logger=logger, profiler=profiler, max_epochs=2)
trainer.fit(autoencoder, data.DataLoader(train), data.DataLoader(val))

Error messages and logs

(lightning_tutorials) PS C:\Users\anguzo\Projects\work\Machine-Learning-Collection> & C:/Users/anguzo/.local/share/mamba/envs/lightning_tutorials/python.exe "c:/Users/anguzo/Projects/work/Machine-Learning-Collection/ML/Pytorch/pytorch_lightning/9.1 Prof/train.py"
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
C:\Users\anguzo\.local\share\mamba\envs\lightning_tutorials\Lib\site-packages\lightning\pytorch\trainer\configuration_validator.py:68: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type       | Params | Mode
-----------------------------------------------
0 | encoder | Sequential | 100 K  | train
1 | decoder | Sequential | 101 K  | train
-----------------------------------------------
202 K     Trainable params
0         Non-trainable params
202 K     Total params
0.810     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode
C:\Users\anguzo\.local\share\mamba\envs\lightning_tutorials\Lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.
Epoch 0:   0%|                                                                                                                                                 | 4/55000 [00:00<39:03, 23.47it/s, v_num=0][W226 15:37:54.000000000 collection.cpp:647] Warning: Optimizer.step#Adam.step (function operator ())
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55000/55000 [03:34<00:00, 256.89it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=2` reached.
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55000/55000 [03:34<00:00, 256.88it/s, v_num=0] 
FIT Profiler Report
Profile stats for: records
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------        
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls       
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------        
                                          ProfilerStep*        18.41%      10.089ms        60.02%      32.896ms      10.965ms       8.532ms        15.48%      32.903ms      10.968ms             3        
                        [pl][profile]run_training_batch         0.22%     118.400us        27.99%      15.341ms       7.670ms     107.000us         0.19%      15.349ms       7.675ms             2        
[pl][profile][LightningModule]LitAutoEncoder.optimiz...         0.13%      70.600us        27.77%      15.222ms       7.611ms      48.000us         0.09%      15.242ms       7.621ms             2        
                               Optimizer.step#Adam.step        20.39%      11.178ms        27.64%      15.152ms       7.576ms      11.207ms        20.33%      15.194ms       7.597ms             2        
[pl][profile][Strategy]SingleDeviceStrategy.backward...        19.32%      10.590ms        20.25%      11.099ms       3.700ms      10.435ms        18.93%      11.136ms       3.712ms             3        
[pl][profile][Strategy]SingleDeviceStrategy.training...         3.26%       1.788ms        10.56%       5.790ms       1.930ms       1.524ms         2.77%       5.809ms       1.936ms             3        
    autograd::engine::evaluate_function: AddmmBackward0         1.48%     812.100us         5.97%       3.271ms     272.583us     473.000us         0.86%       3.406ms     283.833us            12        
                                         AddmmBackward0         1.47%     806.000us         4.16%       2.280ms     189.967us     578.000us         1.05%       2.577ms     214.750us            12        
                                                aten::t         1.95%       1.066ms         3.62%       1.986ms      34.840us       1.088ms         1.97%       2.560ms      44.912us            57        
[pl][profile][_TrainingEpochLoop].train_dataloader_n...         0.21%     113.100us         3.95%       2.167ms     722.333us      94.000us         0.17%       2.214ms     738.000us             3        
enumerate(DataLoader)#_SingleProcessDataLoaderIter._...         2.32%       1.273ms         3.75%       2.054ms     684.633us     924.000us         1.68%       2.120ms     706.667us             3        
[pl][module]torch.nn.modules.container.Sequential: e...         0.58%     315.500us         3.02%       1.653ms     551.067us     300.000us         0.54%       1.691ms     563.667us             3        
                                        aten::transpose         1.61%     880.300us         1.68%     919.700us      16.135us       1.056ms         1.92%       1.472ms      25.825us            57        
autograd::engine::evaluate_function: torch::autograd...         0.49%     267.800us         2.12%       1.161ms      48.367us     425.000us         0.77%       1.408ms      58.667us            24        
                                           aten::linear         0.42%     232.500us         2.34%       1.285ms     107.042us     233.000us         0.42%       1.398ms     116.500us            12        
[pl][profile][Callback]TQDMProgressBar.on_train_batc...         2.27%       1.246ms         2.34%       1.284ms     427.967us       1.278ms         2.32%       1.333ms     444.333us             3        
                                             aten::item         1.50%     821.700us         1.53%     836.700us      16.406us     818.000us         1.48%       1.254ms      24.588us            51        
[pl][module]torch.nn.modules.container.Sequential: d...         0.48%     265.400us         2.17%       1.191ms     397.033us     201.000us         0.36%       1.201ms     400.333us             3        
                                           aten::detach         1.34%     737.000us         1.46%     802.700us      17.838us     737.000us         1.34%       1.142ms      25.378us            45        
                                      aten::result_type         0.03%      14.100us         0.03%      14.100us       0.117us     997.000us         1.81%     997.000us       8.308us           120        
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------        
Self CPU time total: 54.812ms
Self CUDA time total: 55.115ms

Environment

Current environment

CUDA: - GPU: - NVIDIA GeForce GTX 1080 - available: True - version: 12.6
Lightning: - lightning: 2.5.0.post0 - lightning-utilities: 0.12.0 - pytorch-lightning: 2.5.0.post0 - torch: 2.6.0 - torch-tb-profiler: 0.4.3 - torchmetrics: 1.6.1 - torchvision: 0.21.0
Packages: - absl-py: 2.1.0 - autocommand: 2.2.2 - backports.tarfile: 1.2.0 - brotli: 1.1.0 - certifi: 2025.1.31 - charset-normalizer: 3.4.1 - colorama: 0.4.6 - filelock: 3.17.0 - fsspec: 2025.2.0 - grpcio: 1.67.1 - idna: 3.10 - importlib-metadata: 8.6.1 - inflect: 7.3.1 - jaraco.collections: 5.1.0 - jaraco.context: 5.3.0 - jaraco.functools: 4.0.1 - jaraco.text: 3.12.1 - jinja2: 3.1.5 - lightning: 2.5.0.post0 - lightning-utilities: 0.12.0 - markdown: 3.6 - markupsafe: 3.0.2 - more-itertools: 10.3.0 - mpmath: 1.3.0 - networkx: 3.4.2 - numpy: 2.2.3 - optree: 0.14.0 - packaging: 24.2 - pandas: 2.2.3 - pillow: 11.1.0 - pip: 25.0.1 - platformdirs: 4.2.2 - protobuf: 5.28.3 - pybind11: 2.13.6 - pybind11-global: 2.13.6 - pysocks: 1.7.1 - python-dateutil: 2.9.0.post0 - pytorch-lightning: 2.5.0.post0 - pytz: 2024.1 - pyyaml: 6.0.2 - requests: 2.32.3 - setuptools: 75.8.0 - six: 1.17.0 - sympy: 1.13.3 - tensorboard: 2.19.0 - tensorboard-data-server: 0.7.0 - tomli: 2.0.1 - torch: 2.6.0 - torch-tb-profiler: 0.4.3 - torchmetrics: 1.6.1 - torchvision: 0.21.0 - tqdm: 4.67.1 - typeguard: 4.3.0 - typing-extensions: 4.12.2 - tzdata: 2025.1 - urllib3: 2.2.2 - werkzeug: 3.1.3 - wheel: 0.45.1 - win-inet-pton: 1.1.0 - zipp: 3.21.0
System: - OS: Windows - architecture: - 64bit - WindowsPE - processor: AMD64 Family 25 Model 116 Stepping 1, AuthenticAMD - python: 3.12.9 - release: 11 - version: 10.0.22631

More info

No response

Feb 26 '25 14:02 anguzo

I'm seeing the same behavior.

Python: 3.12.8 lightning: " 2.5.0.post0 torch: 2.5.1+cu121 tensorboard: 2.19.0 torch_tb_profiler: 0.4.3

Mar 24 '25 16:03 oseymour

@Borda - I can take this up .

Jun 06 '25 16:06 getgaurav2

I was able to get the GPU profiling on tensorboard . Using the versions of the libraries below . I used the same code . Will try to confirm if this is a version specific issue .

lightning    --  2.5.2
lightning-utilities   --  0.14.3
line_profiler  --  4.2.0
pytorch-ignite  --  0.5.2
pytorch-lightning -- 2.5.1.post0
tensorboard  --  2.18.0
tensorboard-data-server --  0.7.2
torch  --   2.6.0+cu124

FIT Profiler Report
Profile stats for: records
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          ProfilerStep*         0.00%       0.000us         0.00%       0.000us       0.000us      38.337ms      3953.29%      38.337ms      12.779ms             3  
[pl][profile][Strategy]SingleDeviceStrategy.training...         0.00%       0.000us         0.00%       0.000us       0.000us       2.073ms       213.77%       2.073ms     691.026us             3  
                                          ProfilerStep*        29.08%      15.516ms        70.20%      37.460ms      12.487ms       0.000us         0.00%     731.164us     243.721us             3  
                                aten::_foreach_addcdiv_         0.11%      59.589us         0.17%      88.718us      29.573us     183.167us        18.89%     183.167us      61.056us             3  
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us     183.167us        18.89%     183.167us      61.056us             3  
                                    aten::_foreach_div_         0.09%      46.927us         0.14%      73.271us      24.424us     169.087us        17.44%     169.087us      56.362us             3  
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us     169.087us        17.44%     169.087us      56.362us             3  
    autograd::engine::evaluate_function: AddmmBackward0         1.45%     775.080us         5.39%       2.874ms     239.519us       0.000us         0.00%     146.304us      12.192us            12  
                                    aten::_foreach_sqrt         0.13%      69.885us         0.45%     238.946us      79.649us     113.280us        11.68%     113.280us      37.760us             3  
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us     113.280us        11.68%     113.280us      37.760us             3  
[pl][profile][LightningModule]LitAutoEncoder.transfe...         0.00%       0.000us         0.00%       0.000us       0.000us     104.734us        10.80%     104.734us      34.911us             3  
[pl][profile][Strategy]SingleDeviceStrategy.training...         3.97%       2.116ms        11.82%       6.308ms       2.103ms       0.000us         0.00%      93.120us      31.040us             3  
                                         AddmmBackward0         0.21%     114.695us         3.33%       1.774ms     147.863us       0.000us         0.00%      80.736us       6.728us            12  
                                               aten::mm         0.87%     463.009us         1.28%     685.548us      32.645us      80.736us         8.33%      80.736us       3.845us            21  
                                aten::_foreach_addcmul_         0.12%      66.595us         0.18%      93.564us      31.188us      69.055us         7.12%      69.055us      23.018us             3  
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us      69.055us         7.12%      69.055us      23.018us             3  
                                              aten::sum         0.38%     203.757us         0.55%     295.501us      24.625us      65.568us         6.76%      65.568us       5.464us            12  
                        [pl][profile]run_training_batch         0.36%     194.507us        23.81%      12.704ms       4.235ms       0.000us         0.00%      61.408us      20.469us             3  
[pl][profile][LightningModule]LitAutoEncoder.optimiz...         0.15%      81.050us        23.37%      12.473ms       4.158ms       0.000us         0.00%      61.408us      20.469us             3  
                               Optimizer.step#Adam.step        11.16%       5.953ms        23.19%      12.372ms       4.124ms       0.000us         0.00%      61.408us      20.469us             3  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 53.360ms
Self CUDA time total: 969.756us

Jun 21 '25 05:06 getgaurav2

Using the versions of the libraries below . I used the same code

I see that you iterated on your reply, so was there something else needed or just freshening the TB?

Jun 26 '25 11:06 Borda

I spent some time in Kaggle notebook to use their GPU option and observe the files that are created as I run the code . I copied the json files on my local machine to run tensorboard since Kaggle does not let me run tensorboard.

I did not have to change any code or library versions

https://www.kaggle.com/code/gc2713/pytorch-lightning-tensorboard/

Jun 26 '25 15:06 getgaurav2

@getgaurav2 did you try to isolate which package was causing the problem? Like is this a tensorboard v2.19.0 and v2.5.0 problem? It seems weird that it's not working in 2.5 and 2.19, but is working in 2.18

Sep 28 '25 14:09 oseymour

Actually its working with 2.19 also . I tried again , please see the collab notebook below.

https://colab.research.google.com/drive/10rYUUmUexhfIIJDqcZ-XOQh_LM4q_NAr?usp=sharing

lightning==2.5.0.post0 torch==2.5.1 tensorboard==2.19.0 torch_tb_profiler==0.4.3

Oct 02 '25 03:10 getgaurav2