Inconsistent GPU Memory Usage Reporting Between `dpctl` and `xpu-smi`
Description
When using dpctl to report GPU memory usage on Intel GPUs, the reported free and total memory values appear to be incorrect when compared to the output from xpu-smi. Specifically, dpctl reports 0 bytes of used memory, while xpu-smi correctly reports the used memory as 17 MiB.
Steps to Reproduce
- Set up an environment with
dpctlandxpu-smiinstalled. - Use the following Python script to get GPU memory information using
dpctl:
import os
import dpctl
from dpctl.utils import intel_device_info
def get_intel_gpu_memory_info():
try:
# Set the environment variable ZES_ENABLE_SYSMAN to 1
os.environ["ZES_ENABLE_SYSMAN"] = "1"
# Get the list of GPU devices
devices = dpctl.get_devices(device_type=dpctl.device_type.gpu)
for device in devices:
# Get Intel GPU device info
device_info = intel_device_info(device)
if device_info:
free_memory = device_info.get('free_memory', None)
if free_memory is not None:
free_memory_mib = free_memory / (1024 * 1024)
print(f"Free Memory: {free_memory_mib:.2f} MiB")
# Get the total global memory size
try:
global_mem_size = device.get_info(dpctl.device_info.global_mem_size)
except AttributeError:
global_mem_size = device.global_mem_size
global_mem_size_mib = global_mem_size / (1024 * 1024)
print(f"Total Memory: {global_mem_size_mib:.2f} MiB")
# Calculate and display used memory
if free_memory is not None and global_mem_size is not None:
used_memory = global_mem_size - free_memory
used_memory_mib = used_memory / (1024 * 1024)
print(f"Used Memory: {used_memory_mib:.2f} MiB")
else:
print("Unable to calculate used memory due to missing information.")
return
print("No Intel GPU devices found or no information available.")
except Exception as e:
print(f"An error occurred: {e}")
if __name__ == "__main__":
get_intel_gpu_memory_info()
- Compare the output with the results of running
xpu-smi stats -d 0:
xpu-smi stats -d 0
Observed Behavior
- Output from the Python script using
dpctl:
Free Memory: 15473.60 MiB
Total Memory: 15473.60 MiB
Used Memory: 0.00 MiB
Also, python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];" shows the following output which matches the total memory:
2.1.0.post2+cxx11.abi
2.1.30+xpu
[0]: _DeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=0, total_memory=15473MB, max_compute_units=512, gpu_eu_count=512)
- Output from
xpu-smi:
+-----------------------------+--------------------------------------------------------------------+
| Device ID | 0 |
+-----------------------------+--------------------------------------------------------------------+
| GPU Memory Used (MiB) | 17 |
| GPU Memory Util (%) | 0 |
+-----------------------------+--------------------------------------------------------------------+
Expected Behavior
The used memory reported by dpctl should match the used memory reported by xpu-smi.
Environment
- dpctl version: 0.17.0
- xpu-smi version: 1.2.38.20240718
- OS: HiveOS [Based on Ubuntu 20.04]
- Docker version: 24.0.7, build 24.0.7-0ubuntu2~20.04.1
- Docker image: intel/intel-extension-for-pytorch:2.1.30-xpu
- Python version: 3.10.12
- GPU: Intel(R) Arc(TM) A770 Graphics
Additional Information
Setting the environment variable ZES_ENABLE_SYSMAN to 1 was necessary as mentioned in the documentation, to report the free_memory. The discrepancy in reported values suggests a potential issue within the dpctl library or its interaction with the GPU drivers.
Further information on OS:
# uname -r
6.1.0-hiveos
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal
The container was launched through the following command:
docker run -ti --cap-add=PERFMON --device /dev/dri intel/intel-extension-for-pytorch:2.1.30-xpu bash
The intel-basekit (provides the necessary SYCL runtime and development tools for dpctl) and xpu-smi packages were installed with the following commands before the testing the issue inside the container:
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg \
wget https://github.com/intel/xpumanager/releases/download/V1.2.38/xpu-smi_1.2.38_20240718.060204.0db09695+deb10u1_amd64.deb \
apt update && apt install -y ./xpu-smi_1.2.38_20240718.060204.0db09695+deb10u1_amd64.deb intel-basekit \
source /opt/intel/oneapi/setvars.sh
Proposed Solution
Investigate and resolve the inconsistency in GPU memory reporting between dpctl and xpu-smi. Ensure that dpctl accurately reflects the actual GPU memory usage.
Thank you for looking into this issue. Please let me know if further information or testing is required.
@avimanyu786 Please provide information about your GPU driver, e.g., output of python -m dpctl -f.
@avimanyu786 Please provide information about your GPU driver, e.g., output of
python -m dpctl -f.
Thank you for the assistance @oleksandr-pavlyk ! I will post the output as early as possible.
Actually, you have already provided the information @avimanyu786 : driver_version='1.3.27642'
While attempting to reproduce the reported behavior, I used xpu-smi on a machine where GPU is not utilities (Used xpu-smi ps to verify that only xpu-smi was using GPU), and the xpu-smi reported non-zero GPU memory footprint:
$ xpu-smi ps
PID Command DeviceID SHR MEM
883105 xpu-smi 0 0 2228
$ sudo xpu-smi stats -d 0
+-----------------------------+--------------------------------------------------------------------+
| Device ID | 0 |
+-----------------------------+--------------------------------------------------------------------+
| GPU Utilization (%) | 0 |
| EU Array Active (%) | N/A |
| EU Array Stall (%) | N/A |
| EU Array Idle (%) | N/A |
| | |
| Compute Engine Util (%) | 0; Engine 0: 0, Engine 1: 0, Engine 2: 0, Engine 3: 0 |
| Render Engine Util (%) | N/A |
| Media Engine Util (%) | N/A |
| Decoder Engine Util (%) | N/A |
| Encoder Engine Util (%) | N/A |
| Copy Engine Util (%) | 0; Engine 0: 0, Engine 1: 0, Engine 2: 0, Engine 3: 0 |
| | Engine 4: 0, Engine 5: 0 |
| Media EM Engine Util (%) | N/A |
| 3D Engine Util (%) | N/A |
+-----------------------------+--------------------------------------------------------------------+
| Reset | N/A |
| Programming Errors | N/A |
| Driver Errors | N/A |
| Cache Errors Correctable | N/A |
| Cache Errors Uncorrectable | N/A |
| Mem Errors Correctable | N/A |
| Mem Errors Uncorrectable | N/A |
+-----------------------------+--------------------------------------------------------------------+
| GPU Power (W) | 31 |
| GPU Frequency (MHz) | 1550 |
| Media Engine Freq (MHz) | N/A |
| GPU Core Temperature (C) | N/A |
| GPU Memory Temperature (C) | N/A |
| GPU Memory Read (kB/s) | N/A |
| GPU Memory Write (kB/s) | N/A |
| GPU Memory Bandwidth (%) | N/A |
| GPU Memory Used (MiB) | 28 |
| GPU Memory Util (%) | 0 |
| Xe Link Throughput (kB/s) | N/A |
+-----------------------------+--------------------------------------------------------------------+
So I assume an explanation for the discrepancy is that xpu-smi itself uses some amount of GPU global memory.
I see. In that case, perhaps we could use a different approach. We could try to use the watch command with xpu-smi and then recheck with dpctl from a different terminal?
Just to confirm, in one terminal I started dpctl (and set ZES_ENABLE_SYSMAN=1) and executed:
In [1]: import dpctl.tensor as dpt
In [2]: x = dpt.empty(2**26, dtype="i8")
In [3]: y = dpt.empty(2**26, dtype="i8")
In [4]: (y.nbytes + x.nbytes) / (1024 * 1024 * 1024)
Out[4]: 1.0
In [5]: import dpctl.utils as du
In [6]: du.intel_device_info(x.sycl_device)
Out[6]:
{'device_id': 3034,
'gpu_eu_count': 448,
'gpu_hw_threads_per_eu': 8,
'gpu_eu_simd_width': 16,
'gpu_slices': 1,
'gpu_subslices_per_slice': 56,
'gpu_eu_count_per_subslice': 8,
'free_memory': 50417606656,
'memory_clock_rate': 3200,
'memory_bus_width': 64}
In [7]: x.sycl_device.global_mem_size - Out[6]['free_memory']
Out[7]: 1122000896
In [8]: (y.nbytes + x.nbytes)
Out[8]: 1073741824
In another terminal I executed sudo xpu-smi stats -d 0 which showed:
| GPU Memory Used (MiB) | 1055 |
| GPU Memory Util (%) | 2 |
and xpu-smi ps showed:
$ xpu-smi ps
PID Command DeviceID SHR MEM
885596 ipython 0 0 1081212
885843 xpu-smi 0 0 2228
These figures are kind of consistent, accounting for some GPU global memory used by the driver.
@oleksandr-pavlyk Thanks so much for these confirmations! I will test when I have access to my machine from my end and report back again if I face any issue.
Hi @oleksandr-pavlyk,
I have conducted the tests (this time directly on the host) as suggested and observed the following results.
Terminal 1
(dpctl_env) root@Rig3073250:/home/user# export ZES_ENABLE_SYSMAN=1
(dpctl_env) root@Rig3073250:/home/user# python3.10
Python 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dpctl.tensor as dpt
>>> x = dpt.empty(2**26, dtype="i8")
>>> y = dpt.empty(2**26, dtype="i8")
>>> (y.nbytes + x.nbytes) / (1024 * 1024 * 1024)
1.0
>>> import dpctl.utils as du
>>> device_info = du.intel_device_info(x.sycl_device)
>>> print(device_info)
{'device_id': 22176, 'gpu_eu_count': 512, 'gpu_hw_threads_per_eu': 8, 'gpu_eu_simd_width': 8, 'gpu_slices': 1, 'gpu_subslices_per_slice': 32, 'gpu_eu_count_per_subslice': 16, 'free_memory': 16225243136, 'memory_bus_width': 64}
>>> used_memory = x.sycl_device.global_mem_size - device_info['free_memory']
>>> print(used_memory)
0
>>> (y.nbytes + x.nbytes)
1073741824
Terminal 2
root@Rig3073250:/home/user# xpu-smi stats -d 0
+-----------------------------+--------------------------------------------------------------------+
| Device ID | 0 |
+-----------------------------+--------------------------------------------------------------------+
| GPU Utilization (%) | N/A |
| EU Array Active (%) | N/A |
| EU Array Stall (%) | N/A |
| EU Array Idle (%) | N/A |
| | |
| Compute Engine Util (%) | Engine 0: 0, Engine 1: 0, Engine 2: 0, Engine 3: 0 |
| Render Engine Util (%) | Engine 0: 0 |
| Media Engine Util (%) | N/A |
| Decoder Engine Util (%) | Engine 0: 0, Engine 1: 0 |
| Encoder Engine Util (%) | Engine 0: 0, Engine 1: 0 |
| Copy Engine Util (%) | Engine 0: 0 |
| Media EM Engine Util (%) | Engine 0: 0, Engine 1: 0 |
| 3D Engine Util (%) | N/A |
+-----------------------------+--------------------------------------------------------------------+
| Reset | N/A |
| Programming Errors | N/A |
| Driver Errors | N/A |
| Cache Errors Correctable | N/A |
| Cache Errors Uncorrectable | N/A |
| Mem Errors Correctable | N/A |
| Mem Errors Uncorrectable | N/A |
+-----------------------------+--------------------------------------------------------------------+
| GPU Power (W) | 38 |
| GPU Frequency (MHz) | 1000 |
| Media Engine Freq (MHz) | N/A |
| GPU Core Temperature (C) | N/A |
| GPU Memory Temperature (C) | N/A |
| GPU Memory Read (kB/s) | N/A |
| GPU Memory Write (kB/s) | N/A |
| GPU Memory Bandwidth (%) | N/A |
| GPU Memory Used (MiB) | 18 |
| GPU Memory Util (%) | 0 |
| Xe Link Throughput (kB/s) | N/A |
+-----------------------------+--------------------------------------------------------------------+
root@Rig3073250:/home/user# xpu-smi ps
PID Command DeviceID SHR MEM
Summary
-
dpctlOutput:-
free_memory: 16225243136 bytes (15.1 GiB) -
used_memory: 0 bytes -
total_memory: 15.1 GiB (implied fromglobal_mem_size)
-
-
xpu-smi stats -d 0Output:-
GPU Memory Used (MiB): 18 MiB
-
-
xpu-smi psOutput: Empty
Additional Observation
Upon exiting the Python console after the (y.nbytes + x.nbytes) step, the xpu-smi GPU memory usage drops to 17 MiB.
Despite allocating memory in the Python script using dpctl.tensor, the used_memory reported by dpctl is 0 bytes, which is inconsistent with the xpu-smi output, which is also showing only 18 MiB of GPU memory usage - just a 1 MiB difference. It seems there are issues with xpu-smi as well on my machine.
UPDATE
Further investigated with PyOpenCL:
import pyopencl as cl
import numpy as np
import time
# Create OpenCL context and queue
platforms = cl.get_platforms()
gpu_devices = [d for p in platforms for d in p.get_devices(device_type=cl.device_type.GPU)]
if not gpu_devices:
print("No GPU devices found.")
exit()
device = gpu_devices[0]
context = cl.Context([device])
queue = cl.CommandQueue(context)
# Allocate memory on the GPU
buffer_size = 2**26 # 64 MiB per buffer
mf = cl.mem_flags
buffer1 = cl.Buffer(context, mf.READ_WRITE, size=buffer_size)
buffer2 = cl.Buffer(context, mf.READ_WRITE, size=buffer_size)
allocated_memory_mib = (buffer_size * 2) / (1024 * 1024)
print(f"Allocated memory: {allocated_memory_mib:.2f} MiB")
# Initialize data to write to buffers
host_data = np.random.rand(buffer_size // 4).astype(np.float32)
# Write data to the buffers
cl.enqueue_copy(queue, buffer1, host_data)
cl.enqueue_copy(queue, buffer2, host_data)
queue.finish()
# Pause for 30 seconds to allow xpu-smi observation
print("Memory allocated. Pausing for 30 seconds for observation with xpu-smi...")
time.sleep(30)
# Perform a simple computation to ensure buffers are used
program_src = """
__kernel void add(__global const float *a, __global const float *b, __global float *c) {
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
"""
program = cl.Program(context, program_src).build()
result_buffer = cl.Buffer(context, mf.WRITE_ONLY, size=buffer_size)
program.add(queue, host_data.shape, None, buffer1, buffer2, result_buffer)
queue.finish()
print("Performed computation on the GPU.")
Output:
Allocated memory: 128.00 MiB
Memory allocated. Pausing for 30 seconds for observation with xpu-smi...
Performed computation on the GPU.
xpu-smi stats -d 0 output:
| GPU Memory Used (MiB) | 146 |
| GPU Memory Util (%) | 1
So at this point xpu-smi seems to be working correctly. When I tried to check with dpctl while the pyopencl program was running, I'm still getting the same output:
Free Memory: 16225243136 bytes
Total Memory: 16225243136 bytes
Used Memory: 0 bytes
Environment
- dpctl version: 0.17.0
- xpu-smi version: 1.2.38.20240718
- OS: HiveOS [Based on Ubuntu 20.04]
- Python version: 3.10.14
- GPU: Intel(R) Arc(TM) A770 Graphics
- GPU driver version: 1.3.27642
Please let me know if any further information or testing is required.
From the overall testing on my machine so far, it looks like that both on host and on docker, the free_memory key from the dpctl.utils.intel_device_info(sycl_device) dictionary, is reporting the same value as dpctl.device_info.global_mem_size even though the memory is being consumed on the Intel Arc GPU.
@avimanyu786 The dpctl uses this feature of DPC++ https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#free-global-memory
Could you compile the following C++ executable and check whether its output is consistent with output of dpctl:
// icpx -fsycl mem.cpp -o mem.x
#include <iostream>
#include <vector>
#include <string>
#include <sycl/sycl.hpp>
int main(void) {
sycl::queue q{sycl::default_selector_v};
const sycl::device &dev = q.get_device();
const std::string &dev_name = dev.get_info<sycl::info::device::name>();
const std::string &driver_ver = dev.get_info<sycl::info::device::driver_version>();
std::cout << "Device: " << dev_name << " [" << driver_ver << "]" << std::endl;
auto global_mem_size = dev.get_info<sycl::info::device::global_mem_size>();
std::cout << "Global device memory size: " << global_mem_size << " bytes" << std::endl;
if (dev.has(sycl::aspect::ext_intel_free_memory)) {
auto free_memory = dev.get_info<sycl::ext::intel::info::device::free_memory>();
std::cout << "Free memory: " << free_memory << " bytes" << std::endl;
std::cout << "Implied memory in use: " << global_mem_size - free_memory << " bytes" << std::endl;
} else {
std::cout << "Free memory descriptor is not available" << std::endl;
}
return 0;
}
Once compiled, execute as $ ZES_ENABLE_SYSMAN=1 ./mem.x. This is the output I observe when no processes other than mem access GPU:
$ ZES_ENABLE_SYSMAN=1 ./mem
Device: Intel(R) Data Center GPU Max 1100 [7.66.28691]
Global device memory size: 51539607552 bytes
Free memory: 51492134912 bytes
Implied memory in use: 47472640 bytes
This is what I observe when Python script allocating x and y from earlier discussion allocate 1Gb:
$ ZES_ENABLE_SYSMAN=1 ./mem
Device: Intel(R) Data Center GPU Max 1100 [7.66.28691]
Global device memory size: 51539607552 bytes
Free memory: 50416521216 bytes
Implied memory in use: 1123086336 bytes
If the native application also reports the same value, as dpctl does, try upgrading GPU drivers (https://dgpu-docs.intel.com/driver/installation.html)
Hi @oleksandr-pavlyk ,
Based on the above suggestion, I faced the same issue on the host machine with icpx (both when idle and after tensor allocation in Python):
Device: Intel(R) Arc(TM) A770 Graphics [1.3.27642]
Global device memory size: 16225243136 bytes
Free memory: 16225243136 bytes
Implied memory in use: 0 bytes
After switching to HiveOS based on Ubuntu 22.04, I'm facing the same issue, even after upgrading the driver from 1.3.27642 to 1.3.29735:
Device: Intel(R) Arc(TM) A770 Graphics [1.3.29735]
Global device memory size: 16225243136 bytes
Free memory: 16225243136 bytes
Implied memory in use: 0 bytes
To update the driver on the host, I followed the client GPU documentation for Intel Arc.
Summary
- Operating System: HiveOS (Ubuntu 20.04 and Ubuntu 22.04)
- GPU: Intel(R) Arc(TM) A770 Graphics
-
Driver Versions Tested:
- 1.3.27642
- 1.3.29735
-
Intel BaseKit: Installed and used
icpxbased on Intel BaseKit documentation for apt. - Python: 3.10.14
- dpctl: 0.17.0
The output for Intel Data Center GPU Max 1100 shows a different driver version 7.66.28691. Possibly, this driver might include features or fixes that are not present in the driver versions available for the Intel Arc A770, which could explain the discrepancy in reported free memory. I'll wait for your further feedback. Thanks.
It may be that a discrepancy is indeed explained by the driver. In that case one should file an issue with https://github.com/intel/compute-runtime and provide this C++ reproducer, driver version, OS version and the compiler version.
I do not think the behavior you are witnessing is caused by an issue with Python, as you have confirmed by running a stand-alone executable compiled from C++ code.
Many many thanks @oleksandr-pavlyk for following up on this issue! I have filed the corresponding issue in the compute runtime repository: https://github.com/intel/compute-runtime/issues/750
For more added context, there is a python file called check_xpu_smi.py in the https://github.com/intel/xpumanager repository that fetches the value of XPUM_STATS_MEMORY_USED to report the used GPU memory. I found this when I searched for "GPU Memory Used" in that repository.