Observing Host OS panic/crash due to use-after-free error related to CPU OS memory when `gdr_unmap` is not called before `gdr_close`
Impacted platform All server side products, first observed on Grace-Hopper system
Impacted gdrcopy versions 2.0 and later
Impacted gdrcopy configs Both persistent and non-persistent mode
Scenarios
If an application opens a connection to the driver (gdr_open), allocates a GPU memory via CUDA, pins and maps the allocated memory to CPU (gdr_pin_buffer, gdr_map) for read/write operations. Subsequently, if the application closes the connection, without explicitly unmapping the GPU memory, it results in a use-after-free (UAF) condition of OS memory, which can result in functional issues in unrelated areas, or even kernel panic or crash.
Known Mitigations
applications should explicitly call gdr_unmap before gdr_close.
Fixed gdrcopy version 2.4.4