distributed Support MIG parsing during CUDA context creation in UCX initialization

Some issues relating to MIG often arise because of lack of automated CI for that, as it requires specialized hardware that isn't currently available in gpuCI. This PR addresses errors reported recently, see https://github.com/rapidsai/dask-cuda/issues/583#issuecomment-1181505348 . Depends on #6678 .

[ ] Tests added / passed
[ ] Passes pre-commit run --all-files

Jul 12 '22 21:07 pentschev

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±  0       15 suites ±0 6h 26m 23s :stopwatch: - 14m 56s   3 169 tests +  4   3 084 :heavy_check_mark: +  6   83 :zzz: ±0 2 :x: - 1 23 448 runs +31 22 544 :heavy_check_mark: +36 900 :zzz: - 2 4 :x: - 2

For more details on these failures, see this check.

Results for commit 47147c44. ± Comparison against base commit f31fbde7.

:recycle: This comment has been updated with latest results.

Jul 12 '22 22:07 github-actions[bot]

#6678 is now in, could you rebase to drop those commits?

Jul 21 '22 14:07 wence-

Can you check the gpuci failures here? https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/5098/

Oct 31 '22 16:10 wence-

Can you check the gpuci failures here? https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/5098/

Yeah, I'm still trying to figure those. Unfortunately the tests pass if you run them alone, but there seems to be some leakage from previous tests that cause that to fail. I'll keep trying.

Oct 31 '22 20:10 pentschev

Yeah, I'm still trying to figure those. Unfortunately the tests pass if you run them alone, but there seems to be some leakage from previous tests that cause that to fail. I'll keep trying.

Seems like pyvml returns the PID of the handle in the host system, not the PID in the container :(.

(In a container)

In [1]: import numba.cuda

In [2]: import pynvml

In [3]: numba.cuda.current_context()
Out[3]: <CUDA context c_void_p(94546817482160) of device 0>

In [4]: pynvml.nvmlInit()

In [5]: handle = pynvml.nvmlDeviceGetHandleByIndex(0)

In [6]: process, = pynvml.nvmlDeviceGetComputeRunningProcesses_v2(handle)

In [7]: process.pid
Out[7]: 32487

In [8]: import os

In [9]: os.getpid()
Out[9]: 8611

Outside the container:

$ nvidia-smi
...
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     32487      C   ...da/envs/ucx/bin/python3.9      303MiB |
+-----------------------------------------------------------------------------+

Edit: if we can run the container with --pid=host (unsure of the security consequences of this) then things would work (cf https://github.com/NVIDIA/nvidia-docker/issues/1460)

Nov 01 '22 09:11 wence-

Alright, I now marked both CUDA context tests to xfail and gpuCI is passing. Could you take one more look when you have the chance @wence- ?

Nov 02 '22 17:11 pentschev

All 3 failing tests are the same as reported in https://github.com/dask/distributed/issues/7208 . I'm not sure whether codecov is an artefact of the failing tests or something else, but I don't believe this PR has potential to affect it so heavily, so not really sure what is happening there.

Nov 03 '22 15:11 pentschev

@fjetter @jrbourbeau wondering if you could help us figure if something is wrong with codecov, I don't believe this PR alone should decrease coverage, and if I try to open the codecov report for this PR and click on any file to expand it, nothing shows up. Is there some potential issue ongoing with codecov, or am I doing something potentially wrong?

Nov 03 '22 15:11 pentschev

@fjetter @jrbourbeau wondering if you could help us figure if something is wrong with codecov, I don't believe this PR alone should decrease coverage, and if I try to open the codecov report for this PR and click on any file to expand it, nothing shows up. Is there some potential issue ongoing with codecov, or am I doing something potentially wrong?

For some reason (https://github.com/dask/distributed/actions/runs/3379719559/jobs/5611652549#step:20:36) the coverage failed to upload.

Nov 03 '22 15:11 wence-

I don't believe this PR alone should decrease coverage

All the mig-based code (which is more lines, I think) (e.g. https://github.com/dask/distributed/pull/6720/files#diff-457ccaa4d4a70945b095c0460229791e8142c269e9ec781b98ab6d1b1925fd8cR201-R217) will be untested because nothing runs with mig devices. Moreover, I think because GPUCI doesn't interact with codecov, the assumption will be that all the comm/ucx.py and diagnostics/nvml.py files have zero coverage. This PR has a net increase in the number of lines in those files (so the total number of uncovered lines goes up, and coverage goes down).

Nov 03 '22 15:11 wence-

A big hammer would be to ignore those files, per https://docs.codecov.com/docs/ignoring-paths (although those patterns don't look like regexes to me)

diff --git a/codecov.yml b/codecov.yml
index 319545ae..bf003804 100644
--- a/codecov.yml
+++ b/codecov.yml
@@ -1,6 +1,12 @@
 codecov:
   require_ci_to_pass: yes
 
+ignore:
+  # These files exercise GPU functionality which is tested but doesn't
+  # interact with coverage reports, so we ignore them
+  - "distributed/comm/ucx.py"
+  - "distributed/diagnostics/nvml.py"
+  
 coverage:
   precision: 2
   round: down

Nov 03 '22 15:11 wence-

All the mig-based code (which is more lines, I think) (e.g. https://github.com/dask/distributed/pull/6720/files#diff-457ccaa4d4a70945b095c0460229791e8142c269e9ec781b98ab6d1b1925fd8cR201-R217) will be untested because nothing runs with mig devices.

Right, but codecov complains the net change is of -0.75%.

Moreover, I think because GPUCI doesn't interact with codecov, the assumption will be that all the comm/ucx.py and diagnostics/nvml.py files have zero coverage. This PR has a net increase in the number of lines in those files (so the total number of uncovered lines goes up, and coverage goes down).

So even if none of the lines added by this PR are actually seen by codecov, I wouldn't amount to 0.75% of the entire Distributed codebase. Finally, if you look at the whole codecov list many files that aren't at all related to this PR are considered (e.g., pytorch.py decreased by 97.73%). With that said, I believe the problem the failure of codecov uploading files sounds more plausible, as you mentioned in https://github.com/dask/distributed/pull/6720#issuecomment-1302301864 .

Nov 03 '22 16:11 pentschev

@wence- I pushed part of the changes you suggested in https://github.com/dask/distributed/pull/6720#issuecomment-1302316138. NVML is actually tested by some of the CPU workflows here and here, so we don't want to ignore it.

Nov 03 '22 16:11 pentschev