Support MIG parsing during CUDA context creation in UCX initialization
Some issues relating to MIG often arise because of lack of automated CI for that, as it requires specialized hardware that isn't currently available in gpuCI. This PR addresses errors reported recently, see https://github.com/rapidsai/dask-cuda/issues/583#issuecomment-1181505348 . Depends on #6678 .
- [ ] Tests added / passed
- [ ] Passes
pre-commit run --all-files
Unit Test Results
See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.
15 files ± 0 15 suites ±0 6h 26m 23s :stopwatch: - 14m 56s 3 169 tests + 4 3 084 :heavy_check_mark: + 6 83 :zzz: ±0 2 :x: - 1 23 448 runs +31 22 544 :heavy_check_mark: +36 900 :zzz: - 2 4 :x: - 2
For more details on these failures, see this check.
Results for commit 47147c44. ± Comparison against base commit f31fbde7.
:recycle: This comment has been updated with latest results.
#6678 is now in, could you rebase to drop those commits?
Can you check the gpuci failures here? https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/5098/
Can you check the gpuci failures here? https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/5098/
Yeah, I'm still trying to figure those. Unfortunately the tests pass if you run them alone, but there seems to be some leakage from previous tests that cause that to fail. I'll keep trying.
Yeah, I'm still trying to figure those. Unfortunately the tests pass if you run them alone, but there seems to be some leakage from previous tests that cause that to fail. I'll keep trying.
Seems like pyvml returns the PID of the handle in the host system, not the PID in the container :(.
(In a container)
In [1]: import numba.cuda
In [2]: import pynvml
In [3]: numba.cuda.current_context()
Out[3]: <CUDA context c_void_p(94546817482160) of device 0>
In [4]: pynvml.nvmlInit()
In [5]: handle = pynvml.nvmlDeviceGetHandleByIndex(0)
In [6]: process, = pynvml.nvmlDeviceGetComputeRunningProcesses_v2(handle)
In [7]: process.pid
Out[7]: 32487
In [8]: import os
In [9]: os.getpid()
Out[9]: 8611
Outside the container:
$ nvidia-smi
...
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 32487 C ...da/envs/ucx/bin/python3.9 303MiB |
+-----------------------------------------------------------------------------+
Edit: if we can run the container with --pid=host (unsure of the security consequences of this) then things would work (cf https://github.com/NVIDIA/nvidia-docker/issues/1460)
Alright, I now marked both CUDA context tests to xfail and gpuCI is passing. Could you take one more look when you have the chance @wence- ?
All 3 failing tests are the same as reported in https://github.com/dask/distributed/issues/7208 . I'm not sure whether codecov is an artefact of the failing tests or something else, but I don't believe this PR has potential to affect it so heavily, so not really sure what is happening there.
@fjetter @jrbourbeau wondering if you could help us figure if something is wrong with codecov, I don't believe this PR alone should decrease coverage, and if I try to open the codecov report for this PR and click on any file to expand it, nothing shows up. Is there some potential issue ongoing with codecov, or am I doing something potentially wrong?
@fjetter @jrbourbeau wondering if you could help us figure if something is wrong with codecov, I don't believe this PR alone should decrease coverage, and if I try to open the codecov report for this PR and click on any file to expand it, nothing shows up. Is there some potential issue ongoing with codecov, or am I doing something potentially wrong?
For some reason (https://github.com/dask/distributed/actions/runs/3379719559/jobs/5611652549#step:20:36) the coverage failed to upload.
I don't believe this PR alone should decrease coverage
All the mig-based code (which is more lines, I think) (e.g. https://github.com/dask/distributed/pull/6720/files#diff-457ccaa4d4a70945b095c0460229791e8142c269e9ec781b98ab6d1b1925fd8cR201-R217) will be untested because nothing runs with mig devices. Moreover, I think because GPUCI doesn't interact with codecov, the assumption will be that all the comm/ucx.py and diagnostics/nvml.py files have zero coverage. This PR has a net increase in the number of lines in those files (so the total number of uncovered lines goes up, and coverage goes down).
A big hammer would be to ignore those files, per https://docs.codecov.com/docs/ignoring-paths (although those patterns don't look like regexes to me)
diff --git a/codecov.yml b/codecov.yml
index 319545ae..bf003804 100644
--- a/codecov.yml
+++ b/codecov.yml
@@ -1,6 +1,12 @@
codecov:
require_ci_to_pass: yes
+ignore:
+ # These files exercise GPU functionality which is tested but doesn't
+ # interact with coverage reports, so we ignore them
+ - "distributed/comm/ucx.py"
+ - "distributed/diagnostics/nvml.py"
+
coverage:
precision: 2
round: down
All the mig-based code (which is more lines, I think) (e.g. https://github.com/dask/distributed/pull/6720/files#diff-457ccaa4d4a70945b095c0460229791e8142c269e9ec781b98ab6d1b1925fd8cR201-R217) will be untested because nothing runs with mig devices.
Right, but codecov complains the net change is of -0.75%.
Moreover, I think because GPUCI doesn't interact with codecov, the assumption will be that all the comm/ucx.py and diagnostics/nvml.py files have zero coverage. This PR has a net increase in the number of lines in those files (so the total number of uncovered lines goes up, and coverage goes down).
So even if none of the lines added by this PR are actually seen by codecov, I wouldn't amount to 0.75% of the entire Distributed codebase. Finally, if you look at the whole codecov list many files that aren't at all related to this PR are considered (e.g., pytorch.py decreased by 97.73%). With that said, I believe the problem the failure of codecov uploading files sounds more plausible, as you mentioned in https://github.com/dask/distributed/pull/6720#issuecomment-1302301864 .