scs-python Leverage cuda-python for GPU detection

After the 3.0.0, I tried to redo https://github.com/conda-forge/scs-feedstock/pull/21, but the problems with running the test suite remain. In particular, the GPU builds segfault when there's no GPU hardware (as happens in the conda-forge CI).

Very recently, the new python-wrappers for cuda from NVIDIA reached general availability, and this would presumably be an excellent tool to use to determine dynamically whether the GPU can actually be used.

@bodono, what do you think about adding a check (possibly conditional on its availability) that the GPU tests are only run in the drivers & GPU can be found?

Oct 24 '21 02:10 h-vetinari

I think it's easier to do this in the C code rather than python. I have created this PR which should make SCS fail cleanly if there is no gpu availble: https://github.com/cvxgrp/scs/pull/181/files

Are you able to patch this in and test?

Oct 27 '21 17:10 bodono

Are you able to patch this in and test?

Sorry for the delayed response. I applied the patch in https://github.com/conda-forge/scs-feedstock/pull/21, but the test suite still segfaults (both on linux & on windows)...

Nov 12 '21 11:11 h-vetinari

Are we sure it's to do with the gpu? Just looking at this:

export PREFIX=/home/conda/feedstock_root/build_artifacts/scs-split_1636715761516/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh
export SRC_DIR=/home/conda/feedstock_root/build_artifacts/scs-split_1636715761516/test_tmp
import: 'scs'
import: '_scs_direct'
import: '_scs_indirect'
import: '_scs_direct'
import: '_scs_indirect'
import: 'scs'
+ pytest test/ -v
============================= test session starts ==============================
platform linux -- Python 3.9.7, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- $PREFIX/bin/python
cachedir: .pytest_cache
rootdir: $SRC_DIR
Fatal Python error: Segmentation fault

I don't see import _scs_gpu.

Nov 12 '21 13:11 bodono

I don't see import _scs_gpu.

That's because it's not part of the test recipe

    test:
      imports:
        - scs
        - _scs_direct
        - _scs_indirect
      requires:
        - pytest
      source_files:
        - test/
      commands:
        - pytest test/ -v

However, I'm sure it has to do with the GPU build/code paths somehow, because the test suite for the CPU version passes.

Nov 12 '21 20:11 h-vetinari

This is very strange, I don't understand how just building the gpu version could break like this. Is it all platforms (linux, mac, windows)?

Nov 16 '21 15:11 bodono

For context, each of the _scs_direct, _scs_indirect, _scs_gpu are totally independent packages.

Nov 16 '21 15:11 bodono

This is very strange, I don't understand how just building the gpu version could break like this. Is it all platforms (linux, mac, windows)?

There are not GPU builds for mac in conda-forge, but for linux & windows, the GPU builds are broken when trying to run the test suite (the imports work fine), while everything runs through for the CPU builds.

To a degree (which this issue is about), this is to be expected, because the Azure CI that conda-forge uses does not have actual GPUs. So at runtime, if a GPU-enabled package tries to access a GPU that is not there, things fail. Hence the desire to add device detection so that GPU builds don't crash if there's no physical hardware.

Nov 16 '21 22:11 h-vetinari

That's what I'm confused by, the _scs_direct and _scs_indirect packages should never access the gpu since they are completely independent binaries, even if they were built at the same time as _scs_gpu. These segfaults are very strange. I added the check that will fail cleanly if trying to run on gpu and none is available. However, even before that change when there was no GPU available it would also fail reasonably (ie without a segfault).

Previous behavior with no GPU:

------------------------------------------------------------------
               SCS v3.0.0 - Splitting Conic Solver
        (c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem:  variables n: 2, constraints m: 4
cones:    l: linear vars: 4
settings: eps_abs: 1.0e-06, eps_rel: 1.0e-06, eps_infeas: 1.0e-09
          alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
          max_iters: 100000, normalize: 1, warm_start: 0
          acceleration_lookback: 10, acceleration_interval: 10
lin-sys:  sparse-indirect GPU
          nnz(A): 4, nnz(P): 2
 ** On entry to cusparseCreateCsr() parameter number 5 (csrRowOffsets) had an illegal value: null pointer

 ** On entry to cusparseCreateCsr() parameter number 5 (csrRowOffsets) had an illegal value: null pointer

 ** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: null pointer

 ** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: null pointer

 ** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: null pointer

 ** On entry to cusparseCreateCsr() parameter number 6 (csrColInd) had an illegal value: null pointer

linsys/gpu/indirect/private.c:357:scs_init_lin_sys_work
ERROR_CUDA (*): no CUDA-capable device is detected
ERROR: init_lin_sys_work failure
Failure:could not initialize work
**********************************************************

New behavior with no gpu:

------------------------------------------------------------------
               SCS v3.0.0 - Splitting Conic Solver
        (c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem:  variables n: 2, constraints m: 4
cones:    l: linear vars: 4
settings: eps_abs: 1.0e-06, eps_rel: 1.0e-06, eps_infeas: 1.0e-09
          alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
          max_iters: 100000, normalize: 1, warm_start: 0
          acceleration_lookback: 10, acceleration_interval: 10
lin-sys:  sparse-indirect GPU
          nnz(A): 4, nnz(P): 2
cudaError: 100 (100 indicates no device)
ERROR: init_lin_sys_work failure
Failure:could not initialize work
**********************************************************

Nov 17 '21 14:11 bodono

That's what I'm confused by, the _scs_direct and _scs_indirect packages should never access the gpu

Do you mean the package imports here? As I said above, the imports work (for the GPU builds, even on an agent without a GPU), but the test suite fails.

In any case, great to hear that the failure should now be more gracious! I'm guessing these changes haven't made it to the repo(s) yet?

Nov 17 '21 22:11 h-vetinari

By test suite do you mean running out/run_tests_gpu_indirect is what is failing?

It should never seg fault with or without a gpu (even before the latest change to make the failing more graceful), it's very strange and it makes me thing something else weird is going on.

Nov 20 '21 12:11 bodono

With testsuite I mean running the equivalent** of pytest -v test/

** slight adaptation because the test-folder is not packaged in the same way as the package itself, but for basically all intents and purposes it should be the same as running the tests in the source tree.

Nov 20 '21 12:11 h-vetinari

Ok I understand now, that does `import _scs_gpu'. Still, there shouldn't be a seg fault even without a gpu so I'm not sure what's going on here.

Nov 20 '21 13:11 bodono

Ok I understand now, that does `import _scs_gpu'

OK cool, glad we're on the same page now

Still, there shouldn't be a seg fault even without a gpu so I'm not sure what's going on here.

I still have artefact persistence switched on in https://github.com/conda-forge/scs-feedstock/pull/21. You could try again to download an appropriate artefact, unpack it, and then install it into an environment. If we can get past the resolver errors this time, then you could have a closer look at what's happening... 🙃

Nov 20 '21 13:11 h-vetinari

[...] and then install it into an environment

to recall;

use conda info to see which cuda version is detected on your system (anything higher than 11.2 works with the 11.2 artefact)
download the artefact with appropriate platform / cuda version / python version (e.g. 3.8)
unpack the artefact until you get to the first folder that contains channeldata.json
conda create -n test_env -c "path/to/said/folder" -c conda-forge python=3.8 scs
conda activate test_env
etc.

Nov 20 '21 13:11 h-vetinari

@bodono, I've tried again for 3.1.0, and import _scs_gpu still segfaults hard on both linux and windows in the absence of a GPU.

Could we give it another shot with you installing one of the artefacts? I think the setup has hopefully improved enough now that you should be able to install it (the last CI run on that PR has green CI because I switched off the failing test suite so that the artefacts are more easily installable) - the instructions in the previous comment remain correct.

Jan 14 '22 23:01 h-vetinari

I'm looking at this now. Two issues:

How do I tell which artifact is the right python / cuda version? Eg., what does conda_artifacts_20220114.5.1_linux_64_c_compiler_version9cuda_co_h934bae3275 correspond to?
I tried following the instructions with the artifact above and got the following error:

Collecting package metadata (current_repodata.json): failed

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/exceptions.py", line 1062, in __call__
        return func(*args, **kwargs)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/cli/main.py", line 84, in _main
        exit_code = do_call(args, p)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/cli/conda_argparse.py", line 82, in do_call
        exit_code = getattr(module, func_name)(args, parser)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/cli/main_create.py", line 37, in execute
        install(args, parser, 'create')
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/cli/install.py", line 256, in install
        force_reinstall=context.force_reinstall or context.force,
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/solve.py", line 112, in solve_for_transaction
        force_remove, force_reinstall)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/solve.py", line 150, in solve_for_diff
        force_remove)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/solve.py", line 249, in solve_final_state
        ssc = self._collect_all_metadata(ssc)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/common/io.py", line 88, in decorated
        return f(*args, **kwds)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/solve.py", line 389, in _collect_all_metadata
        index, r = self._prepare(prepared_specs)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/solve.py", line 974, in _prepare
        self.subdirs, prepared_specs, self._repodata_fn)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/index.py", line 214, in get_reduced_index
        repodata_fn=repodata_fn)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 91, in query_all
        result = tuple(concat(executor.map(subdir_query, channel_urls)))
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/concurrent/futures/_base.py", line 556, in result_iterator
        yield future.result()
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/concurrent/futures/_base.py", line 398, in result
        return self.__get_result()
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/concurrent/futures/_base.py", line 357, in __get_result
        raise self._exception
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/concurrent/futures/thread.py", line 55, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 87, in <lambda>
        package_ref_or_match_spec))
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 96, in query
        self.load()
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 160, in load
        _internal_state = self._load()
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 262, in _load
        _internal_state = self._process_raw_repodata_str(raw_repodata_str)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/site-packages/conda/core/subdir_data.py", line 335, in _process_raw_repodata_str
        json_obj = json.loads(raw_repodata_str or '{}')
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/json/__init__.py", line 354, in loads
        return _default_decoder.decode(s)
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/json/decoder.py", line 339, in decode
        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
      File "/usr/local/google/home/bodonoghue/miniconda2/lib/python3.6/json/decoder.py", line 355, in raw_decode
        obj, end = self.scan_once(s, idx)
    json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 49 (char 48)

`$ /usr/local/google/home/bodonoghue/miniconda2/bin/conda create -n test_env -c . -c conda-forge python=3.7 scs`

  environment variables:
               AUTO_PROXY=<set>
                 CIO_TEST=<not set>
        CONDA_DEFAULT_ENV=python37
                CONDA_EXE=/usr/local/google/home/bodonoghue/miniconda2/bin/conda
CONDA_MKL_INTERFACE_LAYER_BACKUP=
             CONDA_PREFIX=/usr/local/google/home/bodonoghue/miniconda2/envs/python37
           CONDA_PREFIX_1=/usr/local/google/home/bodonoghue/miniconda2
    CONDA_PROMPT_MODIFIER=(python37)
         CONDA_PYTHON_EXE=/usr/local/google/home/bodonoghue/miniconda2/bin/python
               CONDA_ROOT=/usr/local/google/home/bodonoghue/miniconda2
              CONDA_SHLVL=2
            CUDA_BIN_PATH=/usr/local/cuda/bin
PATH=/usr/local/google/home/bodonoghue/miniconda2/bin:/usr/local/google/hom
     e/bodonoghue/bin:/usr/local/google/home/bodonoghue/.luarocks/bin:/usr/
     local/google/home/bodonoghue/miniconda2/envs/python37/bin:/usr/local/g
     oogle/home/bodonoghue/miniconda2/condabin:/usr/local/google/home/bodon
     oghue/bin:/usr/local/google/home/bodonoghue/.luarocks/bin:/usr/lib/goo
     gle-golang/bin:/usr/local/buildtools/java/jdk/bin:/usr/local/sbin:/usr
     /local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/google/home/bodono
     ghue/.fzf/bin:/usr/local/google/home/bodonoghue/miniconda2/bin:/usr/lo
     cal/google/home/bodonoghue/miniconda2/bin
               PYTHONPATH=/usr/local/buildtools/current/sitecustomize
       REQUESTS_CA_BUNDLE=<not set>
            SSL_CERT_FILE=<not set>

     active environment : python37
    active env location : /usr/local/google/home/bodonoghue/miniconda2/envs/python37
            shell level : 2
       user config file : /usr/local/google/home/bodonoghue/.condarc
 populated config files :
          conda version : 4.7.10
    conda-build version : not installed
         python version : 3.6.2.final.0
       virtual packages : __cuda=11.4
       base environment : /usr/local/google/home/bodonoghue/miniconda2  (writable)
           channel URLs : https://conda.anaconda.org/./linux-64
                          https://conda.anaconda.org/./noarch
                          https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /usr/local/google/home/bodonoghue/miniconda2/pkgs
                          /usr/local/google/home/bodonoghue/.conda/pkgs
       envs directories : /usr/local/google/home/bodonoghue/miniconda2/envs
                          /usr/local/google/home/bodonoghue/.conda/envs
               platform : linux-64
             user-agent : conda/4.7.10 requests/2.22.0 CPython/3.6.2 Linux/5.10.46-5rodete1-amd64 debian/rodete glibc/2.33
                UID:GID : 348005:89939
             netrc file : None
           offline mode : False


An unexpected error has occurred. Conda has prepared the above report.

If submitted, this report will be used by core maintainers to improve
future releases of conda.

Jan 25 '22 13:01 bodono

Can you try -c "/an/absolute/path" instead of -c .?

On the artefact side, not all cuda versions support the relevant gcc versions, which are therefore mixed in and "pollute" the build string. Can you tell me which cuda/python version you need - it's possible to look up from the logs, but a bit tedious. For Windows, it should already be visible

Jan 25 '22 23:01 h-vetinari

Using the absolute path worked. After installing and activating the environment I navigated to the scs-python directory and ran pytest successfully on my linux machine with a GPU.

2022-01-26 10:16:19 (test_env) 0 bodonoghue@bodonoghue-[]-~/git/scs-python/test:
└──[ins]=> pytest .
================================================================================== test session starts ==================================================================================
platform linux -- Python 3.7.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /usr/local/google/home/bodonoghue/git/scs-python
collected 20 items

test_scs_basic.py .....                                                                                                                                                           [ 25%]
test_scs_rand.py .....                                                                                                                                                            [ 50%]
test_scs_sdp.py .....                                                                                                                                                             [ 75%]
test_solve_random_cone_prob.py .....                                                                                                                                              [100%]

================================================================================== 20 passed in 14.69s ==================================================================================

Jan 26 '22 11:01 bodono

Hmm, on my linux machine when I try to import _scs_gpu I get

└──[ins] >>> import _scs_gpu
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named '_scs_gpu'

Jan 26 '22 11:01 bodono

Can you verify using conda list that you've installed the local artefact? (If it's not found, it falls back to conda-forge, where the published builds don't have gpu support yet)

Jan 26 '22 12:01 h-vetinari

Got it, looks like it's using scs from conda-forge:

scs                conda-forge/linux-64::scs-3.1.0-py37he7248de_0

Presumably it's because I'm using the wrong artifact. I'm in a linux machine with virtual packages : __cuda=11.4. If you can point me to a valid artifact I will try that.

Jan 26 '22 17:01 bodono

You need one of the builds that says cuda 11.2, for example this one (this is for python 3.9). The build variant is not fully visible in the artefact name, but it is visible in the job overview, which should also lead to the right artefact. Barring that, try the 19th one down on the artefact overview page.

Jan 26 '22 18:01 h-vetinari

For context, 11.2 is actually 11.2+ (i.e. compatible with all later minor versions of cuda 11)

Jan 27 '22 07:01 h-vetinari

You can probably also "fail faster" with the wrong artefacts by using strict channel priority:

conda config --set channel_priority strict

Which is the recommended default anyway...

Jan 27 '22 07:01 h-vetinari

Yes I found that page, but when I click on the '1 artifact produced' link it just brings me to the page of all the artifacts and I couldn't figure out which one corresponded to the link I had clicked. Anyway, with the 19th artifact down and using strict channel priority (both specifying and not specifying python=3.9) I get:

└──[ins] => conda create -n test_env -c /usr/local/google/home/bodonoghue/Downloads/build_artifacts -c conda-forge scs
Collecting package metadata (current_repodata.json): done
Solving environment: failed with current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:



Package numpy conflicts for:
scs -> numpy[version='>=1.19.5,<2.0a0']
Package python conflicts for:
scs -> python[version='>=3.7,<3.8.0a0']
Package libblas conflicts for:
scs -> libblas[version='>=3.8.0,<4.0a0']
Package pypy3.7 conflicts for:
scs -> pypy3.7[version='>=7.3.7']
Package cudatoolkit conflicts for:
scs -> cudatoolkit[version='>=11.2,<12']
Package scipy conflicts for:
scs -> scipy
Package cvxpy conflicts for:
scs -> cvxpy[version='>1.1.15']
Package __glibc conflicts for:
scs -> __glibc[version='>=2.17']
Package scs-proc conflicts for:
scs -> scs-proc=[build=cuda]
Package python_abi conflicts for:
scs -> python_abi==3.7[build=*_pypy37_pp73]
Package liblapack conflicts for:
scs -> liblapack[version='>=3.8.0,<4.0a0']
Package libgcc-ng conflicts for:
scs -> libgcc-ng[version='>=9.4.0']
Note that strict channel priority may have removed packages required for satisfiability.

I think the main way to use SCS is the direct CPU solver, the GPU solver is a bit niche and in many (most?) cases is actually slower than the direct solver for the time being. With that in mind maybe we should pause on this for now?

Jan 27 '22 11:01 bodono

It's possible that either you or I miscounted, or that the order on the artefact page is not the same as for the jobs. In any case it seems that you got the pypy build rather than the one for cpython 3.9.

Could you maybe have a look at the windows side of things for the time being - there the artefacts should be named unambiguously.

Once I have access to a computer again, I'll update the PR so that also the Linux builds have artefact names that are decipherable.

I'm not in an urgent hurry to get this done, but it's still something that I think plays to conda-forge's strength, and it would be good to have sorted out. Presumably over time, the GPU variant will have some aspects where it outperforms the CPU version

Jan 27 '22 12:01 h-vetinari

Can you post a link to the exact artifact I should use? You can get it on the right-hand side menu Copy download URL. Unfortunately I don't have access to a windows machine.

Jan 27 '22 16:01 bodono

Can you post a link to the exact artifact I should use?

Sorry for the delay, didn't have a laptop available for a while. This download is for linux/x86 + python=3.9 + cuda>=11.2 - could you give it a try? 🙃

Feb 04 '22 10:02 h-vetinari

Sorry for the delay. I'm still getting UnsatisfiableError with that exact artifact:

└──[ins] => conda create -n test_env -c /usr/local/google/home/bodonoghue/Downloads/build_artifacts -c conda-forge scs
Collecting package metadata (current_repodata.json): done
Solving environment: failed with current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:



Package scs-proc conflicts for:
scs -> scs-proc=[build=cuda]
Package cudatoolkit conflicts for:
scs -> cudatoolkit[version='>=11.2,<12']
Package libblas conflicts for:
scs -> libblas[version='>=3.8.0,<4.0a0']
Package libgcc-ng conflicts for:
scs -> libgcc-ng[version='>=9.4.0']
Package scipy conflicts for:
scs -> scipy
Package cvxpy conflicts for:
scs -> cvxpy[version='>1.1.15']
Package numpy conflicts for:
scs -> numpy[version='>=1.19.5,<2.0a0']
Package __glibc conflicts for:
scs -> __glibc[version='>=2.17']
Package python conflicts for:
scs -> python[version='>=3.9,<3.10.0a0']
Package liblapack conflicts for:
scs -> liblapack[version='>=3.8.0,<4.0a0']
Package python_abi conflicts for:
scs -> python_abi=3.9[build=*_cp39]
Note that strict channel priority may have removed packages required for satisfiability.

Feb 15 '22 13:02 bodono

Package __glibc conflicts for: scs -> __glibc[version='>=2.17']

On which system are you running, and what's your current glibc version?

Feb 15 '22 21:02 h-vetinari