Leverage cuda-python for GPU detection
After the 3.0.0, I tried to redo https://github.com/conda-forge/scs-feedstock/pull/21, but the problems with running the test suite remain. In particular, the GPU builds segfault when there's no GPU hardware (as happens in the conda-forge CI).
Very recently, the new python-wrappers for cuda from NVIDIA reached general availability, and this would presumably be an excellent tool to use to determine dynamically whether the GPU can actually be used.
@bodono, what do you think about adding a check (possibly conditional on its availability) that the GPU tests are only run in the drivers & GPU can be found?
I think it's easier to do this in the C code rather than python. I have created this PR which should make SCS fail cleanly if there is no gpu availble: https://github.com/cvxgrp/scs/pull/181/files
Are you able to patch this in and test?
Are you able to patch this in and test?
Sorry for the delayed response. I applied the patch in https://github.com/conda-forge/scs-feedstock/pull/21, but the test suite still segfaults (both on linux & on windows)...
Are we sure it's to do with the gpu? Just looking at this:
export PREFIX=/home/conda/feedstock_root/build_artifacts/scs-split_1636715761516/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh
export SRC_DIR=/home/conda/feedstock_root/build_artifacts/scs-split_1636715761516/test_tmp
import: 'scs'
import: '_scs_direct'
import: '_scs_indirect'
import: '_scs_direct'
import: '_scs_indirect'
import: 'scs'
+ pytest test/ -v
============================= test session starts ==============================
platform linux -- Python 3.9.7, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- $PREFIX/bin/python
cachedir: .pytest_cache
rootdir: $SRC_DIR
Fatal Python error: Segmentation fault
I don't see import _scs_gpu.
I don't see
import _scs_gpu.
That's because it's not part of the test recipe
test:
imports:
- scs
- _scs_direct
- _scs_indirect
requires:
- pytest
source_files:
- test/
commands:
- pytest test/ -v
However, I'm sure it has to do with the GPU build/code paths somehow, because the test suite for the CPU version passes.
This is very strange, I don't understand how just building the gpu version could break like this. Is it all platforms (linux, mac, windows)?
For context, each of the _scs_direct, _scs_indirect, _scs_gpu are totally independent packages.
This is very strange, I don't understand how just building the gpu version could break like this. Is it all platforms (linux, mac, windows)?
There are not GPU builds for mac in conda-forge, but for linux & windows, the GPU builds are broken when trying to run the test suite (the imports work fine), while everything runs through for the CPU builds.
To a degree (which this issue is about), this is to be expected, because the Azure CI that conda-forge uses does not have actual GPUs. So at runtime, if a GPU-enabled package tries to access a GPU that is not there, things fail. Hence the desire to add device detection so that GPU builds don't crash if there's no physical hardware.
That's what I'm confused by, the _scs_direct and _scs_indirect packages should never access the gpu since they are completely independent binaries, even if they were built at the same time as _scs_gpu. These segfaults are very strange. I added the check that will fail cleanly if trying to run on gpu and none is available. However, even before that change when there was no GPU available it would also fail reasonably (ie without a segfault).
Previous behavior with no GPU:
------------------------------------------------------------------
SCS v3.0.0 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem: variables n: 2, constraints m: 4
cones: l: linear vars: 4
settings: eps_abs: 1.0e-06, eps_rel: 1.0e-06, eps_infeas: 1.0e-09
alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
max_iters: 100000, normalize: 1, warm_start: 0
acceleration_lookback: 10, acceleration_interval: 10
lin-sys: sparse-indirect GPU
nnz(A): 4, nnz(P): 2
** On entry to cusparseCreateCsr() parameter number 5 (csrRowOffsets) had an illegal value: null pointer
** On entry to cusparseCreateCsr() parameter number 5 (csrRowOffsets) had an illegal value: null pointer
** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: null pointer
** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: null pointer
** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: null pointer
** On entry to cusparseCreateCsr() parameter number 6 (csrColInd) had an illegal value: null pointer
linsys/gpu/indirect/private.c:357:scs_init_lin_sys_work
ERROR_CUDA (*): no CUDA-capable device is detected
ERROR: init_lin_sys_work failure
Failure:could not initialize work
**********************************************************
New behavior with no gpu:
------------------------------------------------------------------
SCS v3.0.0 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem: variables n: 2, constraints m: 4
cones: l: linear vars: 4
settings: eps_abs: 1.0e-06, eps_rel: 1.0e-06, eps_infeas: 1.0e-09
alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
max_iters: 100000, normalize: 1, warm_start: 0
acceleration_lookback: 10, acceleration_interval: 10
lin-sys: sparse-indirect GPU
nnz(A): 4, nnz(P): 2
cudaError: 100 (100 indicates no device)
ERROR: init_lin_sys_work failure
Failure:could not initialize work
**********************************************************
That's what I'm confused by, the
_scs_directand_scs_indirectpackages should never access the gpu
Do you mean the package imports here? As I said above, the imports work (for the GPU builds, even on an agent without a GPU), but the test suite fails.
In any case, great to hear that the failure should now be more gracious! I'm guessing these changes haven't made it to the repo(s) yet?
By test suite do you mean running out/run_tests_gpu_indirect is what is failing?
It should never seg fault with or without a gpu (even before the latest change to make the failing more graceful), it's very strange and it makes me thing something else weird is going on.
With testsuite I mean running the equivalent** of pytest -v test/
** slight adaptation because the test-folder is not packaged in the same way as the package itself, but for basically all intents and purposes it should be the same as running the tests in the source tree.
Ok I understand now, that does `import _scs_gpu'. Still, there shouldn't be a seg fault even without a gpu so I'm not sure what's going on here.
Ok I understand now, that does `import _scs_gpu'
OK cool, glad we're on the same page now
Still, there shouldn't be a seg fault even without a gpu so I'm not sure what's going on here.
I still have artefact persistence switched on in https://github.com/conda-forge/scs-feedstock/pull/21. You could try again to download an appropriate artefact, unpack it, and then install it into an environment. If we can get past the resolver errors this time, then you could have a closer look at what's happening... 🙃
[...] and then install it into an environment
to recall;
- use
conda infoto see which cuda version is detected on your system (anything higher than 11.2 works with the 11.2 artefact) - download the artefact with appropriate platform / cuda version / python version (e.g. 3.8)
- unpack the artefact until you get to the first folder that contains
channeldata.json -
conda create -n test_env -c "path/to/said/folder" -c conda-forge python=3.8 scs -
conda activate test_env - etc.
@bodono, I've tried again for 3.1.0, and import _scs_gpu still segfaults hard on both linux and windows in the absence of a GPU.
Could we give it another shot with you installing one of the artefacts? I think the setup has hopefully improved enough now that you should be able to install it (the last CI run on that PR has green CI because I switched off the failing test suite so that the artefacts are more easily installable) - the instructions in the previous comment remain correct.
I'm looking at this now. Two issues:
- How do I tell which artifact is the right python / cuda version? Eg., what does
conda_artifacts_20220114.5.1_linux_64_c_compiler_version9cuda_co_h934bae3275correspond to? - I tried following the instructions with the artifact above and got the following error:
Collecting package metadata (current_repodata.json): failed
# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<
Traceback (most recent call last):
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/exceptions.py", line 1062, in __call__
return func(*args, **kwargs)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/cli/main.py", line 84, in _main
exit_code = do_call(args, p)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/cli/conda_argparse.py", line 82, in do_call
exit_code = getattr(module, func_name)(args, parser)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/cli/main_create.py", line 37, in execute
install(args, parser, 'create')
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/cli/install.py", line 256, in install
force_reinstall=context.force_reinstall or context.force,
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/solve.py", line 112, in solve_for_transaction
force_remove, force_reinstall)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/solve.py", line 150, in solve_for_diff
force_remove)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/solve.py", line 249, in solve_final_state
ssc = self._collect_all_metadata(ssc)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/common/io.py", line 88, in decorated
return f(*args, **kwds)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/solve.py", line 389, in _collect_all_metadata
index, r = self._prepare(prepared_specs)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/solve.py", line 974, in _prepare
self.subdirs, prepared_specs, self._repodata_fn)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/index.py", line 214, in get_reduced_index
repodata_fn=repodata_fn)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 91, in query_all
result = tuple(concat(executor.map(subdir_query, channel_urls)))
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/concurrent/futures/_base.py", line 556, in result_iterator
yield future.result()
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/concurrent/futures/_base.py", line 398, in result
return self.__get_result()
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/concurrent/futures/_base.py", line 357, in __get_result
raise self._exception
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/concurrent/futures/thread.py", line 55, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 87, in <lambda>
package_ref_or_match_spec))
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 96, in query
self.load()
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 160, in load
_internal_state = self._load()
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 262, in _load
_internal_state = self._process_raw_repodata_str(raw_repodata_str)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 335, in _process_raw_repodata_str
json_obj = json.loads(raw_repodata_str or '{}')
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 49 (char 48)
`$ /usr/local/google/home/bodonoghue/miniconda2/bin/conda create -n test_env -c . -c conda-forge python=3.7 scs`
environment variables:
AUTO_PROXY=<set>
CIO_TEST=<not set>
CONDA_DEFAULT_ENV=python37
CONDA_EXE=/usr/local/google/home/bodonoghue/miniconda2/bin/conda
CONDA_MKL_INTERFACE_LAYER_BACKUP=
CONDA_PREFIX=/usr/local/google/home/bodonoghue/miniconda2/envs/python37
CONDA_PREFIX_1=/usr/local/google/home/bodonoghue/miniconda2
CONDA_PROMPT_MODIFIER=(python37)
CONDA_PYTHON_EXE=/usr/local/google/home/bodonoghue/miniconda2/bin/python
CONDA_ROOT=/usr/local/google/home/bodonoghue/miniconda2
CONDA_SHLVL=2
CUDA_BIN_PATH=/usr/local/cuda/bin
PATH=/usr/local/google/home/bodonoghue/miniconda2/bin:/usr/local/google/hom
e/bodonoghue/bin:/usr/local/google/home/bodonoghue/.luarocks/bin:/usr/
local/google/home/bodonoghue/miniconda2/envs/python37/bin:/usr/local/g
oogle/home/bodonoghue/miniconda2/condabin:/usr/local/google/home/bodon
oghue/bin:/usr/local/google/home/bodonoghue/.luarocks/bin:/usr/lib/goo
gle-golang/bin:/usr/local/buildtools/java/jdk/bin:/usr/local/sbin:/usr
/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/google/home/bodono
ghue/.fzf/bin:/usr/local/google/home/bodonoghue/miniconda2/bin:/usr/lo
cal/google/home/bodonoghue/miniconda2/bin
PYTHONPATH=/usr/local/buildtools/current/sitecustomize
REQUESTS_CA_BUNDLE=<not set>
SSL_CERT_FILE=<not set>
active environment : python37
active env location : /usr/local/google/home/bodonoghue/miniconda2/envs/python37
shell level : 2
user config file : /usr/local/google/home/bodonoghue/.condarc
populated config files :
conda version : 4.7.10
conda-build version : not installed
python version : 3.6.2.final.0
virtual packages : __cuda=11.4
base environment : /usr/local/google/home/bodonoghue/miniconda2 (writable)
channel URLs : https://conda.anaconda.org/./linux-64
https://conda.anaconda.org/./noarch
https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /usr/local/google/home/bodonoghue/miniconda2/pkgs
/usr/local/google/home/bodonoghue/.conda/pkgs
envs directories : /usr/local/google/home/bodonoghue/miniconda2/envs
/usr/local/google/home/bodonoghue/.conda/envs
platform : linux-64
user-agent : conda/4.7.10 requests/2.22.0 CPython/3.6.2 Linux/5.10.46-5rodete1-amd64 debian/rodete glibc/2.33
UID:GID : 348005:89939
netrc file : None
offline mode : False
An unexpected error has occurred. Conda has prepared the above report.
If submitted, this report will be used by core maintainers to improve
future releases of conda.
Can you try -c "/an/absolute/path" instead of -c .?
On the artefact side, not all cuda versions support the relevant gcc versions, which are therefore mixed in and "pollute" the build string. Can you tell me which cuda/python version you need - it's possible to look up from the logs, but a bit tedious. For Windows, it should already be visible
Using the absolute path worked. After installing and activating the environment I navigated to the scs-python directory and ran pytest successfully on my linux machine with a GPU.
2022-01-26 10:16:19 (test_env) 0 bodonoghue@bodonoghue-[]-~/git/scs-python/test:
└──[ins]=> pytest .
================================================================================== test session starts ==================================================================================
platform linux -- Python 3.7.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /usr/local/google/home/bodonoghue/git/scs-python
collected 20 items
test_scs_basic.py ..... [ 25%]
test_scs_rand.py ..... [ 50%]
test_scs_sdp.py ..... [ 75%]
test_solve_random_cone_prob.py ..... [100%]
================================================================================== 20 passed in 14.69s ==================================================================================
Hmm, on my linux machine when I try to import _scs_gpu I get
└──[ins] >>> import _scs_gpu
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named '_scs_gpu'
Can you verify using conda list that you've installed the local artefact? (If it's not found, it falls back to conda-forge, where the published builds don't have gpu support yet)
Got it, looks like it's using scs from conda-forge:
scs conda-forge/linux-64::scs-3.1.0-py37he7248de_0
Presumably it's because I'm using the wrong artifact. I'm in a linux machine with virtual packages : __cuda=11.4. If you can point me to a valid artifact I will try that.
You need one of the builds that says cuda 11.2, for example this one (this is for python 3.9). The build variant is not fully visible in the artefact name, but it is visible in the job overview, which should also lead to the right artefact. Barring that, try the 19th one down on the artefact overview page.
For context, 11.2 is actually 11.2+ (i.e. compatible with all later minor versions of cuda 11)
You can probably also "fail faster" with the wrong artefacts by using strict channel priority:
conda config --set channel_priority strict
Which is the recommended default anyway...
Yes I found that page, but when I click on the '1 artifact produced' link it just brings me to the page of all the artifacts and I couldn't figure out which one corresponded to the link I had clicked. Anyway, with the 19th artifact down and using strict channel priority (both specifying and not specifying python=3.9) I get:
└──[ins] => conda create -n test_env -c /usr/local/google/home/bodonoghue/Downloads/build_artifacts -c conda-forge scs
Collecting package metadata (current_repodata.json): done
Solving environment: failed with current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Package numpy conflicts for:
scs -> numpy[version='>=1.19.5,<2.0a0']
Package python conflicts for:
scs -> python[version='>=3.7,<3.8.0a0']
Package libblas conflicts for:
scs -> libblas[version='>=3.8.0,<4.0a0']
Package pypy3.7 conflicts for:
scs -> pypy3.7[version='>=7.3.7']
Package cudatoolkit conflicts for:
scs -> cudatoolkit[version='>=11.2,<12']
Package scipy conflicts for:
scs -> scipy
Package cvxpy conflicts for:
scs -> cvxpy[version='>1.1.15']
Package __glibc conflicts for:
scs -> __glibc[version='>=2.17']
Package scs-proc conflicts for:
scs -> scs-proc=[build=cuda]
Package python_abi conflicts for:
scs -> python_abi==3.7[build=*_pypy37_pp73]
Package liblapack conflicts for:
scs -> liblapack[version='>=3.8.0,<4.0a0']
Package libgcc-ng conflicts for:
scs -> libgcc-ng[version='>=9.4.0']
Note that strict channel priority may have removed packages required for satisfiability.
I think the main way to use SCS is the direct CPU solver, the GPU solver is a bit niche and in many (most?) cases is actually slower than the direct solver for the time being. With that in mind maybe we should pause on this for now?
It's possible that either you or I miscounted, or that the order on the artefact page is not the same as for the jobs. In any case it seems that you got the pypy build rather than the one for cpython 3.9.
Could you maybe have a look at the windows side of things for the time being - there the artefacts should be named unambiguously.
Once I have access to a computer again, I'll update the PR so that also the Linux builds have artefact names that are decipherable.
I'm not in an urgent hurry to get this done, but it's still something that I think plays to conda-forge's strength, and it would be good to have sorted out. Presumably over time, the GPU variant will have some aspects where it outperforms the CPU version
Can you post a link to the exact artifact I should use? You can get it on the right-hand side menu Copy download URL. Unfortunately I don't have access to a windows machine.
Can you post a link to the exact artifact I should use?
Sorry for the delay, didn't have a laptop available for a while. This download is for linux/x86 + python=3.9 + cuda>=11.2 - could you give it a try? 🙃
Sorry for the delay. I'm still getting UnsatisfiableError with that exact artifact:
└──[ins] => conda create -n test_env -c /usr/local/google/home/bodonoghue/Downloads/build_artifacts -c conda-forge scs
Collecting package metadata (current_repodata.json): done
Solving environment: failed with current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Package scs-proc conflicts for:
scs -> scs-proc=[build=cuda]
Package cudatoolkit conflicts for:
scs -> cudatoolkit[version='>=11.2,<12']
Package libblas conflicts for:
scs -> libblas[version='>=3.8.0,<4.0a0']
Package libgcc-ng conflicts for:
scs -> libgcc-ng[version='>=9.4.0']
Package scipy conflicts for:
scs -> scipy
Package cvxpy conflicts for:
scs -> cvxpy[version='>1.1.15']
Package numpy conflicts for:
scs -> numpy[version='>=1.19.5,<2.0a0']
Package __glibc conflicts for:
scs -> __glibc[version='>=2.17']
Package python conflicts for:
scs -> python[version='>=3.9,<3.10.0a0']
Package liblapack conflicts for:
scs -> liblapack[version='>=3.8.0,<4.0a0']
Package python_abi conflicts for:
scs -> python_abi=3.9[build=*_cp39]
Note that strict channel priority may have removed packages required for satisfiability.
Package __glibc conflicts for: scs -> __glibc[version='>=2.17']
On which system are you running, and what's your current glibc version?