torch.cuda.memory_stats returns all 0s
🐛 Bug
Calling torch.cuda.memory_stats on gfx900 GPU (Frontier Vega) or any of the methods in https://pytorch.org/docs/stable/cuda.html#memory-management results in 0s.
To Reproduce
Steps to reproduce the behavior:
- Run rocm/pytorch 3.3
- in python, start training a model on GPU
- In a separate python process,
import torch
torch.cuda.memory_stats(0)
will return all 0s. Output:
OrderedDict([('active.all.allocated', 0),
('active.all.current', 0),
('active.all.freed', 0),
('active.all.peak', 0),
('active.large_pool.allocated', 0),
('active.large_pool.current', 0),
('active.large_pool.freed', 0),
('active.large_pool.peak', 0),
('active.small_pool.allocated', 0),
('active.small_pool.current', 0),
('active.small_pool.freed', 0),
('active.small_pool.peak', 0),
('active_bytes.all.allocated', 0),
('active_bytes.all.current', 0),
('active_bytes.all.freed', 0),
('active_bytes.all.peak', 0),
('active_bytes.large_pool.allocated', 0),
('active_bytes.large_pool.current', 0),
('active_bytes.large_pool.freed', 0),
('active_bytes.large_pool.peak', 0),
('active_bytes.small_pool.allocated', 0),
('active_bytes.small_pool.current', 0),
('active_bytes.small_pool.freed', 0),
('active_bytes.small_pool.peak', 0),
('allocated_bytes.all.allocated', 0),
('allocated_bytes.all.current', 0),
('allocated_bytes.all.freed', 0),
('allocated_bytes.all.peak', 0),
('allocated_bytes.large_pool.allocated', 0),
('allocated_bytes.large_pool.current', 0),
('allocated_bytes.large_pool.freed', 0),
('allocated_bytes.large_pool.peak', 0),
('allocated_bytes.small_pool.allocated', 0),
('allocated_bytes.small_pool.current', 0),
('allocated_bytes.small_pool.freed', 0),
('allocated_bytes.small_pool.peak', 0),
('allocation.all.allocated', 0),
('allocation.all.current', 0),
('allocation.all.freed', 0),
('allocation.all.peak', 0),
('allocation.large_pool.allocated', 0),
('allocation.large_pool.current', 0),
('allocation.large_pool.freed', 0),
('allocation.large_pool.peak', 0),
('allocation.small_pool.allocated', 0),
('allocation.small_pool.current', 0),
('allocation.small_pool.freed', 0),
('allocation.small_pool.peak', 0),
('inactive_split.all.allocated', 0),
('inactive_split.all.current', 0),
('inactive_split.all.freed', 0),
('inactive_split.all.peak', 0),
('inactive_split.large_pool.allocated', 0),
('inactive_split.large_pool.current', 0),
('inactive_split.large_pool.freed', 0),
('inactive_split.large_pool.peak', 0),
('inactive_split.small_pool.allocated', 0),
('inactive_split.small_pool.current', 0),
('inactive_split.small_pool.freed', 0),
('inactive_split.small_pool.peak', 0),
('inactive_split_bytes.all.allocated', 0),
('inactive_split_bytes.all.current', 0),
('inactive_split_bytes.all.freed', 0),
('inactive_split_bytes.all.peak', 0),
('inactive_split_bytes.large_pool.allocated', 0),
('inactive_split_bytes.large_pool.current', 0),
('inactive_split_bytes.large_pool.freed', 0),
('inactive_split_bytes.large_pool.peak', 0),
('inactive_split_bytes.small_pool.allocated', 0),
('inactive_split_bytes.small_pool.current', 0),
('inactive_split_bytes.small_pool.freed', 0),
('inactive_split_bytes.small_pool.peak', 0),
('num_alloc_retries', 0),
('num_ooms', 0),
('reserved_bytes.all.allocated', 0),
('reserved_bytes.all.current', 0),
('reserved_bytes.all.freed', 0),
('reserved_bytes.all.peak', 0),
('reserved_bytes.large_pool.allocated', 0),
('reserved_bytes.large_pool.current', 0),
('reserved_bytes.large_pool.freed', 0),
('reserved_bytes.large_pool.peak', 0),
('reserved_bytes.small_pool.allocated', 0),
('reserved_bytes.small_pool.current', 0),
('reserved_bytes.small_pool.freed', 0),
('reserved_bytes.small_pool.peak', 0),
('segment.all.allocated', 0),
('segment.all.current', 0),
('segment.all.freed', 0),
('segment.all.peak', 0),
('segment.large_pool.allocated', 0),
('segment.large_pool.current', 0),
('segment.large_pool.freed', 0),
('segment.large_pool.peak', 0),
('segment.small_pool.allocated', 0),
('segment.small_pool.current', 0),
('segment.small_pool.freed', 0),
('segment.small_pool.peak', 0)])
Expected behavior
Even though I'm using an AMD GPU I expect memory stats to have an AMD analogue that can be reported in torch.cuda.memory_stats
I tried this recently on ROCm 3.3 by printing torch.cuda.memory_stats from the same process by instrumenting the python code, and it worked. Is there a reason you're trying to print it from a different process?
Hi @ericjang can you reply to @jithunnair-amd 's query? Please let us know if this issue is still reproducible on your end, thanks.