hiop icon indicating copy to clipboard operation
hiop copied to clipboard

Fix segfault, remove nonsymmetric ginkgo solver

Open fritzgoebel opened this issue 3 years ago • 2 comments

This PR should fix issues #540 and #542. It also removes the nonsymmetric ginkgo solver as previously suggested by @pelesh.

The problem were a missing host side copy of the rhs vector for #540 and directly accessing matrix and vector values located in GPU memory in update_matrix for #542. This is fine as long as they are in unified memory, which the apparently are when built with Debug. I added a host side copy of the matrix that gets updated and moved back to the GPU instead.

fritzgoebel avatar Sep 22 '22 10:09 fritzgoebel

I think this is relevant for other GPU solvers in HiOp.

@kswirydo @nychiang

pelesh avatar Sep 22 '22 14:09 pelesh

I think this is relevant for other GPU solvers in HiOp.

@kswirydo @nychiang

Not sure it actually is, the GPU memory was coming from inside a ginkgo matrix

fritzgoebel avatar Sep 22 '22 14:09 fritzgoebel

PNNL Pipelines seem to be giving a false negative here. Tests failed initially, but are passing upon a retry. I would suggest adding a new commit with https://github.com/LLNL/hiop/blob/develop/.gitlab-ci.yml#L159 changed to not include -E NlpSparse..., and we can see if the failing tests are now passing!

Thankfully this doesn't require a rebuild, so this should be a quick check.

cameronrutherford avatar Sep 27 '22 17:09 cameronrutherford

Tests are passing on ascent and summit with HiOp built in Release mode. I think this fixes #542.

nkoukpaizan avatar Oct 31 '22 16:10 nkoukpaizan

Seems like CUDA_ARCHITECTURES cannot readily be passed from the CLI in the platform variables script into the CMake config. As a workaround, I have specified all Cuda architectures for Marianas in gcc-cuda.cmake. The only downside is that now all CI platforms/developers that use this config will by default now build with all the architectures, rather than only on Marianas as would be desirable.

cameronrutherford avatar Nov 08 '22 16:11 cameronrutherford

I was able to get tests passing when running manually on previously failing platform.

I am not able to re-produce old issue with updated module set. I assume that since ginkgo is built with the fully array of cuda architectures, that the CUDA runtime is smart enough to pick up the right spec, despite HiOp being built with the "wrong" cuda arch.

Unless people have qualms with how I hacked this together, I think this should be good to merge.

As a side note, ExaGO now has updated modules ready to go with a fresh build of latest hiop@develop.

cameronrutherford avatar Nov 08 '22 16:11 cameronrutherford

Thanks @CameronRutherford!

fritzgoebel avatar Nov 08 '22 16:11 fritzgoebel

The only downside is that now all CI platforms/developers that use this config will by default now build with all the architectures, rather than only on Marianas as would be desirable.

This is not a big issue. If you want to distribute HiOp binary, for example, this is exactly how you would build it.

pelesh avatar Nov 08 '22 16:11 pelesh

All tests passing in CI

cameronrutherford avatar Nov 08 '22 16:11 cameronrutherford

Passing spack builds in GitHub actions!

cameronrutherford avatar Nov 08 '22 17:11 cameronrutherford