gdrcopy icon indicating copy to clipboard operation
gdrcopy copied to clipboard

Observing GPU memory and/or CPU OS memory leaks with `use_persistent_mapping` enabled in `gdrdrv` during multi-process termination

Open realarnavgoel opened this issue 1 year ago • 0 comments

Impacted platform All server side products, first observed on Grace-Hopper based system

Impacted gdrcopy versions 2.4.1, 2.4.2, 2.4.3

Impacted gdrcopy configs gdrdrv driver loaded with module parameter set use_persistent_mapping=1

Scenarios If gdrcopy persistent mapping mode is enabled,

  1. If one process opens a connection to the driver (via gdr_open), and intents to expliclity share connection (using UNIX Domain socket) with one or more processes to use the underlying connection, then the cleanup of the driver resources (via gdr_close) may be executed by one of the non-owning processes, which would be silently ignored therefore leading to CPU and GPU memory leaks.

  2. If a parent process A forks one or more child process B (instead of linux fork + exec), then connections opened by A can be attempted to be closed by B during an ungraceful termination of processes via signals (SIGSEV or SIGKILL), resulting in OS and GPU memory leaks.

By default, if persistent mode is disabled, under both scenarios, GPU resources cleanup is performed through an independent workflow in CUDA driver and hence dropping the request to close this connection is benign.

Irrespective of persistent mode, this bug may lead to small CPU kernel memory leaks.

Signature of the defect

  • On coherent platforms, e.g. Grace Hopper systems, GPU memory leaks can lead to unexpected side effects. For example, turning off the nvidia-persistenced service may hang, requiring rebooting the machine.
  • On non-coherent platforms, GPU memory leaks may reduce the functionality or performance of CUDA applications.

Known mitigations Turn off by setting driver module parameter use_persistent_mapping=0 and reloading the driver.

Fixed gdrcopy version 2.4.4

realarnavgoel avatar Jan 10 '25 02:01 realarnavgoel