dpctl Inconsistent GPU Memory Usage Reporting Between `dpctl` and `xpu-smi`

Description

When using dpctl to report GPU memory usage on Intel GPUs, the reported free and total memory values appear to be incorrect when compared to the output from xpu-smi. Specifically, dpctl reports 0 bytes of used memory, while xpu-smi correctly reports the used memory as 17 MiB.

Steps to Reproduce

Set up an environment with dpctl and xpu-smi installed.
Use the following Python script to get GPU memory information using dpctl:

import os
import dpctl
from dpctl.utils import intel_device_info

def get_intel_gpu_memory_info():
    try:
        # Set the environment variable ZES_ENABLE_SYSMAN to 1
        os.environ["ZES_ENABLE_SYSMAN"] = "1"
        
        # Get the list of GPU devices
        devices = dpctl.get_devices(device_type=dpctl.device_type.gpu)
        for device in devices:
            # Get Intel GPU device info
            device_info = intel_device_info(device)
            if device_info:
                free_memory = device_info.get('free_memory', None)
                if free_memory is not None:
                    free_memory_mib = free_memory / (1024 * 1024)
                    print(f"Free Memory: {free_memory_mib:.2f} MiB")

                # Get the total global memory size
                try:
                    global_mem_size = device.get_info(dpctl.device_info.global_mem_size)
                except AttributeError:
                    global_mem_size = device.global_mem_size

                global_mem_size_mib = global_mem_size / (1024 * 1024)
                print(f"Total Memory: {global_mem_size_mib:.2f} MiB")

                # Calculate and display used memory
                if free_memory is not None and global_mem_size is not None:
                    used_memory = global_mem_size - free_memory
                    used_memory_mib = used_memory / (1024 * 1024)
                    print(f"Used Memory: {used_memory_mib:.2f} MiB")
                else:
                    print("Unable to calculate used memory due to missing information.")

                return
        print("No Intel GPU devices found or no information available.")
    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__":
    get_intel_gpu_memory_info()

Compare the output with the results of running xpu-smi stats -d 0:

xpu-smi stats -d 0

Observed Behavior

Output from the Python script using dpctl:

Free Memory: 15473.60 MiB
Total Memory: 15473.60 MiB
Used Memory: 0.00 MiB

Also, python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];" shows the following output which matches the total memory:

2.1.0.post2+cxx11.abi
2.1.30+xpu
[0]: _DeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=0, total_memory=15473MB, max_compute_units=512, gpu_eu_count=512)

Output from xpu-smi:

+-----------------------------+--------------------------------------------------------------------+
| Device ID                   | 0                                                                  |
+-----------------------------+--------------------------------------------------------------------+
| GPU Memory Used (MiB)       | 17                                                                 |
| GPU Memory Util (%)         | 0                                                                  |
+-----------------------------+--------------------------------------------------------------------+

Expected Behavior

The used memory reported by dpctl should match the used memory reported by xpu-smi.

Environment

dpctl version: 0.17.0
xpu-smi version: 1.2.38.20240718
OS: HiveOS [Based on Ubuntu 20.04]
Docker version: 24.0.7, build 24.0.7-0ubuntu2~20.04.1
Docker image: intel/intel-extension-for-pytorch:2.1.30-xpu
Python version: 3.10.12
GPU: Intel(R) Arc(TM) A770 Graphics

Additional Information

Setting the environment variable ZES_ENABLE_SYSMAN to 1 was necessary as mentioned in the documentation, to report the free_memory. The discrepancy in reported values suggests a potential issue within the dpctl library or its interaction with the GPU drivers.

Further information on OS:

# uname -r
6.1.0-hiveos
# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
Release:	20.04
Codename:	focal

The container was launched through the following command:

docker run -ti --cap-add=PERFMON --device /dev/dri intel/intel-extension-for-pytorch:2.1.30-xpu bash

The intel-basekit (provides the necessary SYCL runtime and development tools for dpctl) and xpu-smi packages were installed with the following commands before the testing the issue inside the container:

wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg \
wget https://github.com/intel/xpumanager/releases/download/V1.2.38/xpu-smi_1.2.38_20240718.060204.0db09695+deb10u1_amd64.deb \
apt update && apt install -y ./xpu-smi_1.2.38_20240718.060204.0db09695+deb10u1_amd64.deb intel-basekit \
source /opt/intel/oneapi/setvars.sh

Proposed Solution

Investigate and resolve the inconsistency in GPU memory reporting between dpctl and xpu-smi. Ensure that dpctl accurately reflects the actual GPU memory usage.

Thank you for looking into this issue. Please let me know if further information or testing is required.

Jul 26 '24 12:07 avimanyu786

@avimanyu786 Please provide information about your GPU driver, e.g., output of python -m dpctl -f.

Jul 28 '24 15:07 oleksandr-pavlyk

@avimanyu786 Please provide information about your GPU driver, e.g., output of python -m dpctl -f.

Thank you for the assistance @oleksandr-pavlyk ! I will post the output as early as possible.

Jul 28 '24 16:07 avimanyu786

Actually, you have already provided the information @avimanyu786 : driver_version='1.3.27642'

Jul 28 '24 17:07 oleksandr-pavlyk

While attempting to reproduce the reported behavior, I used xpu-smi on a machine where GPU is not utilities (Used xpu-smi ps to verify that only xpu-smi was using GPU), and the xpu-smi reported non-zero GPU memory footprint:

$ xpu-smi ps
PID       Command             DeviceID       SHR            MEM
883105    xpu-smi             0              0              2228

$ sudo xpu-smi stats -d 0
+-----------------------------+--------------------------------------------------------------------+
| Device ID                   | 0                                                                  |
+-----------------------------+--------------------------------------------------------------------+
| GPU Utilization (%)         | 0                                                                  |
| EU Array Active (%)         | N/A                                                                |
| EU Array Stall (%)          | N/A                                                                |
| EU Array Idle (%)           | N/A                                                                |
|                             |                                                                    |
| Compute Engine Util (%)     | 0; Engine 0: 0, Engine 1: 0, Engine 2: 0, Engine 3: 0              |
| Render Engine Util (%)      | N/A                                                                |
| Media Engine Util (%)       | N/A                                                                |
| Decoder Engine Util (%)     | N/A                                                                |
| Encoder Engine Util (%)     | N/A                                                                |
| Copy Engine Util (%)        | 0; Engine 0: 0, Engine 1: 0, Engine 2: 0, Engine 3: 0              |
|                             | Engine 4: 0, Engine 5: 0                                           |
| Media EM Engine Util (%)    | N/A                                                                |
| 3D Engine Util (%)          | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| Reset                       | N/A                                                                |
| Programming Errors          | N/A                                                                |
| Driver Errors               | N/A                                                                |
| Cache Errors Correctable    | N/A                                                                |
| Cache Errors Uncorrectable  | N/A                                                                |
| Mem Errors Correctable      | N/A                                                                |
| Mem Errors Uncorrectable    | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| GPU Power (W)               | 31                                                                 |
| GPU Frequency (MHz)         | 1550                                                               |
| Media Engine Freq (MHz)     | N/A                                                                |
| GPU Core Temperature (C)    | N/A                                                                |
| GPU Memory Temperature (C)  | N/A                                                                |
| GPU Memory Read (kB/s)      | N/A                                                                |
| GPU Memory Write (kB/s)     | N/A                                                                |
| GPU Memory Bandwidth (%)    | N/A                                                                |
| GPU Memory Used (MiB)       | 28                                                                 |
| GPU Memory Util (%)         | 0                                                                  |
| Xe Link Throughput (kB/s)   | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+

So I assume an explanation for the discrepancy is that xpu-smi itself uses some amount of GPU global memory.

Jul 28 '24 17:07 oleksandr-pavlyk

I see. In that case, perhaps we could use a different approach. We could try to use the watch command with xpu-smi and then recheck with dpctl from a different terminal?

Jul 28 '24 17:07 avimanyu786

Just to confirm, in one terminal I started dpctl (and set ZES_ENABLE_SYSMAN=1) and executed:

In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.empty(2**26, dtype="i8")

In [3]: y = dpt.empty(2**26, dtype="i8")

In [4]: (y.nbytes + x.nbytes) / (1024 * 1024 * 1024)
Out[4]: 1.0

In [5]: import dpctl.utils as du

In [6]: du.intel_device_info(x.sycl_device)
Out[6]:
{'device_id': 3034,
 'gpu_eu_count': 448,
 'gpu_hw_threads_per_eu': 8,
 'gpu_eu_simd_width': 16,
 'gpu_slices': 1,
 'gpu_subslices_per_slice': 56,
 'gpu_eu_count_per_subslice': 8,
 'free_memory': 50417606656,
 'memory_clock_rate': 3200,
 'memory_bus_width': 64}

In [7]: x.sycl_device.global_mem_size - Out[6]['free_memory']
Out[7]: 1122000896

In [8]: (y.nbytes + x.nbytes)
Out[8]: 1073741824

In another terminal I executed sudo xpu-smi stats -d 0 which showed:

| GPU Memory Used (MiB)       | 1055                                                               |
| GPU Memory Util (%)         | 2                                                                  |

and xpu-smi ps showed:

$ xpu-smi ps
PID       Command             DeviceID       SHR            MEM
885596    ipython             0              0              1081212
885843    xpu-smi             0              0              2228

These figures are kind of consistent, accounting for some GPU global memory used by the driver.

Jul 28 '24 21:07 oleksandr-pavlyk

@oleksandr-pavlyk Thanks so much for these confirmations! I will test when I have access to my machine from my end and report back again if I face any issue.

Jul 29 '24 02:07 avimanyu786

Hi @oleksandr-pavlyk,

I have conducted the tests (this time directly on the host) as suggested and observed the following results.

Terminal 1

(dpctl_env) root@Rig3073250:/home/user# export ZES_ENABLE_SYSMAN=1
(dpctl_env) root@Rig3073250:/home/user# python3.10
Python 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dpctl.tensor as dpt
>>> x = dpt.empty(2**26, dtype="i8")
>>> y = dpt.empty(2**26, dtype="i8")
>>> (y.nbytes + x.nbytes) / (1024 * 1024 * 1024)
1.0
>>> import dpctl.utils as du
>>> device_info = du.intel_device_info(x.sycl_device)
>>> print(device_info)
{'device_id': 22176, 'gpu_eu_count': 512, 'gpu_hw_threads_per_eu': 8, 'gpu_eu_simd_width': 8, 'gpu_slices': 1, 'gpu_subslices_per_slice': 32, 'gpu_eu_count_per_subslice': 16, 'free_memory': 16225243136, 'memory_bus_width': 64}
>>> used_memory = x.sycl_device.global_mem_size - device_info['free_memory']
>>> print(used_memory)
0
>>> (y.nbytes + x.nbytes)
1073741824

Terminal 2

root@Rig3073250:/home/user# xpu-smi stats -d 0
+-----------------------------+--------------------------------------------------------------------+
| Device ID                   | 0                                                                  |
+-----------------------------+--------------------------------------------------------------------+
| GPU Utilization (%)         | N/A                                                                |
| EU Array Active (%)         | N/A                                                                |
| EU Array Stall (%)          | N/A                                                                |
| EU Array Idle (%)           | N/A                                                                |
|                             |                                                                    |
| Compute Engine Util (%)     | Engine 0: 0, Engine 1: 0, Engine 2: 0, Engine 3: 0                 |
| Render Engine Util (%)      | Engine 0: 0                                                        |
| Media Engine Util (%)       | N/A                                                                |
| Decoder Engine Util (%)     | Engine 0: 0, Engine 1: 0                                           |
| Encoder Engine Util (%)     | Engine 0: 0, Engine 1: 0                                           |
| Copy Engine Util (%)        | Engine 0: 0                                                        |
| Media EM Engine Util (%)    | Engine 0: 0, Engine 1: 0                                           |
| 3D Engine Util (%)          | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| Reset                       | N/A                                                                |
| Programming Errors          | N/A                                                                |
| Driver Errors               | N/A                                                                |
| Cache Errors Correctable    | N/A                                                                |
| Cache Errors Uncorrectable  | N/A                                                                |
| Mem Errors Correctable      | N/A                                                                |
| Mem Errors Uncorrectable    | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| GPU Power (W)               | 38                                                                 |
| GPU Frequency (MHz)         | 1000                                                               |
| Media Engine Freq (MHz)     | N/A                                                                |
| GPU Core Temperature (C)    | N/A                                                                |
| GPU Memory Temperature (C)  | N/A                                                                |
| GPU Memory Read (kB/s)      | N/A                                                                |
| GPU Memory Write (kB/s)     | N/A                                                                |
| GPU Memory Bandwidth (%)    | N/A                                                                |
| GPU Memory Used (MiB)       | 18                                                                 |
| GPU Memory Util (%)         | 0                                                                  |
| Xe Link Throughput (kB/s)   | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
root@Rig3073250:/home/user# xpu-smi ps
PID       Command             DeviceID       SHR            MEM

Summary

dpctl Output:
- free_memory: 16225243136 bytes (15.1 GiB)
- used_memory: 0 bytes
- total_memory: 15.1 GiB (implied from global_mem_size)
xpu-smi stats -d 0 Output:
- GPU Memory Used (MiB): 18 MiB
xpu-smi ps Output: Empty

Additional Observation

Upon exiting the Python console after the (y.nbytes + x.nbytes) step, the xpu-smi GPU memory usage drops to 17 MiB.

Despite allocating memory in the Python script using dpctl.tensor, the used_memory reported by dpctl is 0 bytes, which is inconsistent with the xpu-smi output, which is also showing only 18 MiB of GPU memory usage - just a 1 MiB difference. It seems there are issues with xpu-smi as well on my machine.

UPDATE

Further investigated with PyOpenCL:

import pyopencl as cl
import numpy as np
import time

# Create OpenCL context and queue
platforms = cl.get_platforms()
gpu_devices = [d for p in platforms for d in p.get_devices(device_type=cl.device_type.GPU)]
if not gpu_devices:
    print("No GPU devices found.")
    exit()

device = gpu_devices[0]
context = cl.Context([device])
queue = cl.CommandQueue(context)

# Allocate memory on the GPU
buffer_size = 2**26  # 64 MiB per buffer

mf = cl.mem_flags
buffer1 = cl.Buffer(context, mf.READ_WRITE, size=buffer_size)
buffer2 = cl.Buffer(context, mf.READ_WRITE, size=buffer_size)

allocated_memory_mib = (buffer_size * 2) / (1024 * 1024)
print(f"Allocated memory: {allocated_memory_mib:.2f} MiB")

# Initialize data to write to buffers
host_data = np.random.rand(buffer_size // 4).astype(np.float32)

# Write data to the buffers
cl.enqueue_copy(queue, buffer1, host_data)
cl.enqueue_copy(queue, buffer2, host_data)
queue.finish()

# Pause for 30 seconds to allow xpu-smi observation
print("Memory allocated. Pausing for 30 seconds for observation with xpu-smi...")
time.sleep(30)

# Perform a simple computation to ensure buffers are used
program_src = """
__kernel void add(__global const float *a, __global const float *b, __global float *c) {
    int gid = get_global_id(0);
    c[gid] = a[gid] + b[gid];
}
"""
program = cl.Program(context, program_src).build()
result_buffer = cl.Buffer(context, mf.WRITE_ONLY, size=buffer_size)
program.add(queue, host_data.shape, None, buffer1, buffer2, result_buffer)
queue.finish()

print("Performed computation on the GPU.")

Output:

Allocated memory: 128.00 MiB
Memory allocated. Pausing for 30 seconds for observation with xpu-smi...
Performed computation on the GPU.

xpu-smi stats -d 0 output:

| GPU Memory Used (MiB)       | 146                                                                |
| GPU Memory Util (%)         | 1

So at this point xpu-smi seems to be working correctly. When I tried to check with dpctl while the pyopencl program was running, I'm still getting the same output:

Free Memory: 16225243136 bytes
Total Memory: 16225243136 bytes
Used Memory: 0 bytes

Environment

dpctl version: 0.17.0
xpu-smi version: 1.2.38.20240718
OS: HiveOS [Based on Ubuntu 20.04]
Python version: 3.10.14
GPU: Intel(R) Arc(TM) A770 Graphics
GPU driver version: 1.3.27642

Please let me know if any further information or testing is required.

Jul 29 '24 08:07 avimanyu786

From the overall testing on my machine so far, it looks like that both on host and on docker, the free_memory key from the dpctl.utils.intel_device_info(sycl_device) dictionary, is reporting the same value as dpctl.device_info.global_mem_size even though the memory is being consumed on the Intel Arc GPU.

Jul 29 '24 13:07 avimanyu786

@avimanyu786 The dpctl uses this feature of DPC++ https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#free-global-memory

Could you compile the following C++ executable and check whether its output is consistent with output of dpctl:

// icpx -fsycl mem.cpp -o mem.x
#include <iostream>
#include <vector>
#include <string>
#include <sycl/sycl.hpp>

int main(void) {
    sycl::queue q{sycl::default_selector_v};

    const sycl::device &dev = q.get_device();
    const std::string &dev_name = dev.get_info<sycl::info::device::name>();
    const std::string &driver_ver = dev.get_info<sycl::info::device::driver_version>();

    std::cout << "Device: " << dev_name << " ["  << driver_ver << "]" << std::endl;

    auto global_mem_size = dev.get_info<sycl::info::device::global_mem_size>();

    std::cout << "Global device memory size: " << global_mem_size << " bytes" << std::endl;

    if (dev.has(sycl::aspect::ext_intel_free_memory)) {
         auto free_memory = dev.get_info<sycl::ext::intel::info::device::free_memory>();
         std::cout << "Free memory: " << free_memory << " bytes" << std::endl;
         std::cout << "Implied memory in use: " << global_mem_size - free_memory << " bytes" << std::endl;
    } else {
        std::cout << "Free memory descriptor is not available" << std::endl;
    }

    return 0;
}

Once compiled, execute as $ ZES_ENABLE_SYSMAN=1 ./mem.x. This is the output I observe when no processes other than mem access GPU:

$ ZES_ENABLE_SYSMAN=1 ./mem
Device: Intel(R) Data Center GPU Max 1100 [7.66.28691]
Global device memory size: 51539607552 bytes
Free memory: 51492134912 bytes
Implied memory in use: 47472640 bytes

This is what I observe when Python script allocating x and y from earlier discussion allocate 1Gb:

$ ZES_ENABLE_SYSMAN=1 ./mem
Device: Intel(R) Data Center GPU Max 1100 [7.66.28691]
Global device memory size: 51539607552 bytes
Free memory: 50416521216 bytes
Implied memory in use: 1123086336 bytes

If the native application also reports the same value, as dpctl does, try upgrading GPU drivers (https://dgpu-docs.intel.com/driver/installation.html)

Jul 29 '24 14:07 oleksandr-pavlyk

Hi @oleksandr-pavlyk ,

Based on the above suggestion, I faced the same issue on the host machine with icpx (both when idle and after tensor allocation in Python):

Device: Intel(R) Arc(TM) A770 Graphics [1.3.27642]
Global device memory size: 16225243136 bytes
Free memory: 16225243136 bytes
Implied memory in use: 0 bytes

After switching to HiveOS based on Ubuntu 22.04, I'm facing the same issue, even after upgrading the driver from 1.3.27642 to 1.3.29735:

Device: Intel(R) Arc(TM) A770 Graphics [1.3.29735]
Global device memory size: 16225243136 bytes
Free memory: 16225243136 bytes
Implied memory in use: 0 bytes

To update the driver on the host, I followed the client GPU documentation for Intel Arc.

Summary

Operating System: HiveOS (Ubuntu 20.04 and Ubuntu 22.04)
GPU: Intel(R) Arc(TM) A770 Graphics
Driver Versions Tested:
- 1.3.27642
- 1.3.29735
Intel BaseKit: Installed and used icpx based on Intel BaseKit documentation for apt.
Python: 3.10.14
dpctl: 0.17.0

The output for Intel Data Center GPU Max 1100 shows a different driver version 7.66.28691. Possibly, this driver might include features or fixes that are not present in the driver versions available for the Intel Arc A770, which could explain the discrepancy in reported free memory. I'll wait for your further feedback. Thanks.

Jul 30 '24 11:07 avimanyu786

It may be that a discrepancy is indeed explained by the driver. In that case one should file an issue with https://github.com/intel/compute-runtime and provide this C++ reproducer, driver version, OS version and the compiler version.

I do not think the behavior you are witnessing is caused by an issue with Python, as you have confirmed by running a stand-alone executable compiled from C++ code.

Jul 31 '24 12:07 oleksandr-pavlyk

Many many thanks @oleksandr-pavlyk for following up on this issue! I have filed the corresponding issue in the compute runtime repository: https://github.com/intel/compute-runtime/issues/750

Jul 31 '24 18:07 avimanyu786

For more added context, there is a python file called check_xpu_smi.py in the https://github.com/intel/xpumanager repository that fetches the value of XPUM_STATS_MEMORY_USED to report the used GPU memory. I found this when I searched for "GPU Memory Used" in that repository.

Aug 01 '24 05:08 avimanyu786