pytorch torch.cuda.memory_stats returns all 0s

🐛 Bug

Calling torch.cuda.memory_stats on gfx900 GPU (Frontier Vega) or any of the methods in https://pytorch.org/docs/stable/cuda.html#memory-management results in 0s.

To Reproduce

Steps to reproduce the behavior:

Run rocm/pytorch 3.3
in python, start training a model on GPU
In a separate python process,

import torch
torch.cuda.memory_stats(0)

will return all 0s. Output:

OrderedDict([('active.all.allocated', 0),
             ('active.all.current', 0),
             ('active.all.freed', 0),
             ('active.all.peak', 0),
             ('active.large_pool.allocated', 0),
             ('active.large_pool.current', 0),
             ('active.large_pool.freed', 0),
             ('active.large_pool.peak', 0),
             ('active.small_pool.allocated', 0),
             ('active.small_pool.current', 0),
             ('active.small_pool.freed', 0),
             ('active.small_pool.peak', 0),
             ('active_bytes.all.allocated', 0),
             ('active_bytes.all.current', 0),
             ('active_bytes.all.freed', 0),
             ('active_bytes.all.peak', 0),
             ('active_bytes.large_pool.allocated', 0),
             ('active_bytes.large_pool.current', 0),
             ('active_bytes.large_pool.freed', 0),
             ('active_bytes.large_pool.peak', 0),
             ('active_bytes.small_pool.allocated', 0),
             ('active_bytes.small_pool.current', 0),
             ('active_bytes.small_pool.freed', 0),
             ('active_bytes.small_pool.peak', 0),
             ('allocated_bytes.all.allocated', 0),
             ('allocated_bytes.all.current', 0),
             ('allocated_bytes.all.freed', 0),
             ('allocated_bytes.all.peak', 0),
             ('allocated_bytes.large_pool.allocated', 0),
             ('allocated_bytes.large_pool.current', 0),
             ('allocated_bytes.large_pool.freed', 0),
             ('allocated_bytes.large_pool.peak', 0),
             ('allocated_bytes.small_pool.allocated', 0),
             ('allocated_bytes.small_pool.current', 0),
             ('allocated_bytes.small_pool.freed', 0),
             ('allocated_bytes.small_pool.peak', 0),
             ('allocation.all.allocated', 0),
             ('allocation.all.current', 0),
             ('allocation.all.freed', 0),
             ('allocation.all.peak', 0),
             ('allocation.large_pool.allocated', 0),
             ('allocation.large_pool.current', 0),
             ('allocation.large_pool.freed', 0),
             ('allocation.large_pool.peak', 0),
             ('allocation.small_pool.allocated', 0),
             ('allocation.small_pool.current', 0),
             ('allocation.small_pool.freed', 0),
             ('allocation.small_pool.peak', 0),
             ('inactive_split.all.allocated', 0),
             ('inactive_split.all.current', 0),
             ('inactive_split.all.freed', 0),
             ('inactive_split.all.peak', 0),
             ('inactive_split.large_pool.allocated', 0),
             ('inactive_split.large_pool.current', 0),
             ('inactive_split.large_pool.freed', 0),
             ('inactive_split.large_pool.peak', 0),
             ('inactive_split.small_pool.allocated', 0),
             ('inactive_split.small_pool.current', 0),
             ('inactive_split.small_pool.freed', 0),
             ('inactive_split.small_pool.peak', 0),
             ('inactive_split_bytes.all.allocated', 0),
             ('inactive_split_bytes.all.current', 0),
             ('inactive_split_bytes.all.freed', 0),
             ('inactive_split_bytes.all.peak', 0),
             ('inactive_split_bytes.large_pool.allocated', 0),
             ('inactive_split_bytes.large_pool.current', 0),
             ('inactive_split_bytes.large_pool.freed', 0),
             ('inactive_split_bytes.large_pool.peak', 0),
             ('inactive_split_bytes.small_pool.allocated', 0),
             ('inactive_split_bytes.small_pool.current', 0),
             ('inactive_split_bytes.small_pool.freed', 0),
             ('inactive_split_bytes.small_pool.peak', 0),
             ('num_alloc_retries', 0),
             ('num_ooms', 0),
             ('reserved_bytes.all.allocated', 0),
             ('reserved_bytes.all.current', 0),
             ('reserved_bytes.all.freed', 0),
             ('reserved_bytes.all.peak', 0),
             ('reserved_bytes.large_pool.allocated', 0),
             ('reserved_bytes.large_pool.current', 0),
             ('reserved_bytes.large_pool.freed', 0),
             ('reserved_bytes.large_pool.peak', 0),
             ('reserved_bytes.small_pool.allocated', 0),
             ('reserved_bytes.small_pool.current', 0),
             ('reserved_bytes.small_pool.freed', 0),
             ('reserved_bytes.small_pool.peak', 0),
             ('segment.all.allocated', 0),
             ('segment.all.current', 0),
             ('segment.all.freed', 0),
             ('segment.all.peak', 0),
             ('segment.large_pool.allocated', 0),
             ('segment.large_pool.current', 0),
             ('segment.large_pool.freed', 0),
             ('segment.large_pool.peak', 0),
             ('segment.small_pool.allocated', 0),
             ('segment.small_pool.current', 0),
             ('segment.small_pool.freed', 0),
             ('segment.small_pool.peak', 0)])

Expected behavior

Even though I'm using an AMD GPU I expect memory stats to have an AMD analogue that can be reported in torch.cuda.memory_stats

Aug 02 '20 06:08 ericjang

I tried this recently on ROCm 3.3 by printing torch.cuda.memory_stats from the same process by instrumenting the python code, and it worked. Is there a reason you're trying to print it from a different process?

Aug 02 '20 13:08 jithunnair-amd

Hi @ericjang can you reply to @jithunnair-amd 's query? Please let us know if this issue is still reproducible on your end, thanks.

Mar 01 '21 16:03 sunway513