GauXC HIP compilation and fixing GGA/MGGA vvars kernels

Similar to https://github.com/wavefunction91/ExchCXX/pull/39, I had to make these minor changes to get GauXC to compile with HIP

Jul 18 '24 08:07 ryanstocks00

@wavefunction91 @ryanstocks00 After a couple of minor tweaks to the build to enable successful hipblas discovery, I was able to build the code on Frontier. I made sure this PR points to @ryanstocks00's fork of ExchCXX. However, when I run a test, I get

  what():  Generic GauXC Exception (EXX + non-CUDA NYI)
  File     /scratch/panyala/bdft/GauXC_External-prefix/src/GauXC_External/src/xc_integrator/local_work_driver/device/scheme1_base.cxx
  Function virtual void GauXC::AoSScheme1Base::eval_exx_ek_screening_bfn_stats(XCDeviceData *)
  Line     1389

Jul 19 '24 20:07 ajaypanyala

Thanks Ajay, yea, we don't have a HIP implementation of snK (although, it might be easy to add, would just need to tune the kernel params, it's linear along the wavefront), so we just need to add the logic to not run that test.

Can you provide the diff for the build system stuff?

Jul 19 '24 21:07 wavefunction91

No worries, I realized that just now reg. sn-K. The regular XC eval works fine. Can I commit the build system changes to this PR ?

Jul 19 '24 21:07 ajaypanyala

That would be up to @ryanstocks00.

Jul 19 '24 23:07 wavefunction91

This is a bit more complicated than wavefunction91/ExchCXX#39, I'm happy to work with you on getting this fixed up, but the HIP implementation is GauXC is a bit delicate (or at least it was for older HIP/ROCm on MI250/300 a few years ago when this was a priority) - the kernels are prone to silent errors due to suboptimal resource usage emitted by the optimizing compilers (i.e. the kernel is too large for the launch parameters, and we never really hardened the implementation to do the "right" thing). Getting e.g. the uvvars kernels to work was quite a bit of effort, IIRC.

I can provide you with some of our canonical stress test cases to validate the implementation - again, I don't have access to AMD hardware at the moment, so I can't really test anything on my end.

Yes we have had similar issues on the MI250 hardware. My main priority here was getting something that would compile so as not to break our HIP CI pipeline. I haven't yet gone through thorough testing as I was going to get it integrated on nvidia systems first. That said we do have access to AMD hardware so am happy to do whatever testing you think will be useful. I'm hopeful there will be more useful contributions going forward as I wrap my head around it all.

@ajaypanyala happy for you to commit changes to this PR - do you have the required permissions to push to the branch in my repo?

Jul 20 '24 01:07 ryanstocks00

@ajaypanyala happy for you to commit changes to this PR - do you have the required permissions to push to the branch in my repo?

@ryanstocks00 I do not have the permissions. Could you please add me ?

Jul 20 '24 03:07 ajaypanyala

@ajaypanyala How are you testing this? Just the UTs or are you running something non-trivial (e.g. Ubi)? I'd like to get a full run of Ubi/DZ on AMD HW before merging this.

Also, we might want to wait until #91 is merged as it will require additional updates to get to work with HIP - potentially some issues to work out viz runtime errors as well for large systems. Might need to coordinate with @mikovtun to get a bit test system (they have some big things they're testing with).

Jul 28 '24 18:07 wavefunction91

@wavefunction91 Tested with Ubi/DZ (pbe0) on MI250X.

Jul 29 '24 04:07 ajaypanyala

@wavefunction91 Is this ready to go (modulo the merge conflict) ?

Aug 03 '24 18:08 ajaypanyala

@wavefunction91 @ajaypanyala I have fixed the merge conflicts so that it successfully compiles with HIP, however it fails a lot of tests, would be great to get some more experienced eyes over it

Sep 22 '24 10:09 ryanstocks00

Hi @ryanstocks00, sorry for the delay. I've recently changed jobs, so I'm still in the process of renormalizing where my time is spent.

Great that this compiles. Could you provide more info on which tests are failing? That will help us pin point where things could be going wrong. It might also be worth running the standalone_driver through rocgdb to see if we can see which kernels are dying. If I recall the last time I worked with AMD/HIP, the issue was invariably silent kernel launch failures due to hard coded launch params (optimized for A100).

Sep 26 '24 03:09 wavefunction91

@wavefunction91 hope Microsoft is treating you well! (Is this project likely to continue much development?)

The HIP code seems to be correct for LDA functionals so I think I must have stuffed something in the GGA code during the merge (not sure the MGGA code was ever fully implemented in HIP). I attempted to run rocgdb on standalone_driver as suggested using an MI250 however didn't get any errors - is this the right process? hip_mi250_standalone_b3lyp_rocgdb.txt The output is non-deterministic for both EXC and VXC.

I have attached the output from running ./gauxc_test on both NVIDIA and AMD - interestingly there is some small numeric noise on the 4080 that causes some of the tests to fail but I assume this is just because it is not a datacenter GPU)

cuda_4080s_standalone_b3lyp.txt cuda_4080s_test.txt hip_mi250_standalone.txt hip_mi250_standalone_b3lyp.txt hip_mi250_test.txt

Sep 30 '24 12:09 ryanstocks00

@ryanstocks00 Thanks, this is helpful to diagnosing the problem (I think). Lots to unpack here, I'll try to cover everything.

Is this project likely to continue much development?

Yes, updated may be rolled out slower, but dev and support will continue.

The HIP code seems to be correct for LDA functionals so I think I must have stuffed something in the GGA code during the merge

Nothing obvious from what I can see from the updates, but I agree with this assessment based on the results you've shared

I attempted to run rocgdb on standalone_driver as suggested using an MI250 however didn't get any errors

This should indicate that kernels are not failing to launch (although I'm not precluding it, rocgdb is not the most stable software on earth)

is this the right process?

Looks like it

The output is non-deterministic for both EXC and VXC.

Can you expand on this? Both for LDA and GGA?

4080

Yes, this is known https://github.com/wavefunction91/GauXC/issues/134. I'm not completely sold on the use of consumer GPUs for high-performance DFT simulations (happy to be proven wrong!), so I don't think I'll be prioritizing that for some time (that, and I don't have a consumer grade GPU to test on!)

Based on the results you've shared, I think the next think to check is whether or not the LDA gradients work on AMD. If they do, we can at least preclude the collocation gradients being the problem (as well as a more extensive testing of the batched BLAS - they'll run through similar paths). If those work, then it's in one of two kernels - the uvvars or the zmat. Since EXC looks busted for GGAs, I'd suspect (at least) the former is the cuprit. When I was designing that kernel, it was definitely tuned for NVIDIA SM structure (e.g. warp len of 32 + square process grids within the block). I'm not sure the behaviour of e.g. the warp level reductions on AMD and whether or not there's a non-obvious race condition. This should also be a problem for LDA (i.e. if the logic is fixed there, it should be fixed everywhere), but It might be worth another look.

Again, I don't have access to AMD hardware at the moment, so there's not a ton I can do on the debugging side. Happy to brainstorm debugging ideas though, let me know.

P.S. I'm sure you're aware of this, but just to be explicit - running the standalone driver with b3lyp for benzene will fail the checks as only the SVWN5 data is saved there. I'm assuming you ran those to get the reference numbers for B3LYP?

Oct 05 '24 23:10 wavefunction91

@wavefunction91 thank you very much for the detailed response. I don't think we can check the grads on AMD as I get a "Generic GauXC Exception (LDA Grad NYI for HIP Backends)".

The output is non-deterministic for both EXC and VXC. Can you expand on this? Both for LDA and GGA? Using the standalone driver, EXC and VXC are both correct for LDA, both non-deterministic for GGA (and order of magnitude different to the correct value calculated with the host CPU implementation)

I have finally managed to get it to compile and run with HIP on a NVIDIA machine and get much the same results as on AMD (correct LDA, non-deterministic GGA) which I think rules out AMD launch configuration issues and probably warp length issues so I'm pretty sure it is a code issue. This could potentially be a way for you to test as well since the issue is replicatable on NVIDIA hardware. If there's a good way to narrow down which kernel is the issue that would be very helpful (e.g. does the fact that the "Li / SVWN5 / sto-3g" tests in gauxc_test fail despite being LDA mean anything?).

Oct 08 '24 07:10 ryanstocks00

I think I have now got gauxc running correctly with HIP on a NVIDIA machine - the problem was the vvar grad kernel which required the thread block to be square. I think this will still be problematic on AMD machines as there isn't sufficient shared memory with the increased warp length so will need some more modification though I'm going to have to put this on the backburner for a bit due to other priorities

Oct 09 '24 09:10 ryanstocks00

vvar_grad

For the GGA energy or for the gradient? You should only call the vvar kernel for non-gradients.

Admittedly, I hadn't fully appreciated how complicated that code became with the latest refactor to support GKS, etc. I'll try to set aside some time to look at this over the coming week. FWIW - this set of kernels used to work (obviously), it might just require looking over the AMD commits to see what worked in the past and try to see how to translate it into the new format.

Thanks for continuing to take a look at this, the effort has been very helpful and much appreciated.

Oct 13 '24 23:10 wavefunction91

@wavefunction91 I have finally had some time and have fixed up the HIP kernels so that they now work (though likely not as efficiently as the CUDA implementations). This is the output of the gauxc_test executable on a MI250X GPU. There are still some small errors in the "H3 / BLYP / cc-pvdz" tests - not sure if this is likely to be due to my changes.

❯ ./build/tests/gauxc_test

gauxc_test is a Catch v2.13.10 host application.
Run with -? for options

-------------------------------------------------------------------------------
XC Integrator
  Cytosine / R2SCANL / cc-pVDZ
  Device
  Incore - MPI Reduction
-------------------------------------------------------------------------------
/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:295
...............................................................................

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:295: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
  Generic GauXC Exception (Laplacian NYI for HIP Backends)
    File     /software/projects/pawsey0799/ryan/setonix/software/GauXC/src/
  xc_integrator/local_work_driver/device/scheme1_base.cxx
    Function virtual void GauXC::AoSScheme1Base::eval_collocation_laplacian
  (GauXC::XCDeviceData*)
    Line     415

-------------------------------------------------------------------------------
XC Integrator
  Cytosine / R2SCANL / cc-pVDZ
  Device
  ShellBatched
-------------------------------------------------------------------------------
/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:332
...............................................................................

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:332: FAILED:
due to unexpected exception with message:
  Generic GauXC Exception (Laplacian NYI for HIP Backends)
    File     /software/projects/pawsey0799/ryan/setonix/software/GauXC/src/
  xc_integrator/local_work_driver/device/scheme1_base.cxx
    Function virtual void GauXC::AoSScheme1Base::eval_collocation_laplacian
  (GauXC::XCDeviceData*)
    Line     415

-------------------------------------------------------------------------------
XC Integrator
  H3 / BLYP / cc-pvdz
  Device
  Incore - MPI Reduction
-------------------------------------------------------------------------------
/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:295
...............................................................................

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:217: FAILED:
  CHECK( VXC_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000000131 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:218: FAILED:
  CHECK( VXCz_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000000009 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:219: FAILED:
  CHECK( VXCy_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000001368 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:220: FAILED:
  CHECK( VXCx_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000001368 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:229: FAILED:
  CHECK( VXC1_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000000131 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:230: FAILED:
  CHECK( VXCz1_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000000009 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:231: FAILED:
  CHECK( VXCy1_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000001368 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:232: FAILED:
  CHECK( VXCx1_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000001368 < 0.0000000001

-------------------------------------------------------------------------------
XC Integrator
  H3 / BLYP / cc-pvdz
  Device
  ShellBatched
-------------------------------------------------------------------------------
/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:332
...............................................................................

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:217: FAILED:
  CHECK( VXC_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000000131 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:218: FAILED:
  CHECK( VXCz_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000000009 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:219: FAILED:
  CHECK( VXCy_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000001368 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:220: FAILED:
  CHECK( VXCx_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000001368 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:229: FAILED:
  CHECK( VXC1_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000000131 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:230: FAILED:
  CHECK( VXCz1_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000000009 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:231: FAILED:
  CHECK( VXCy1_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000001368 < 0.0000000001

/software/projects/pawsey0799/ryan/setonix/software/GauXC/tests/xc_integrator.cxx:232: FAILED:
  CHECK( VXCx1_diff_nrm / basis.nbf() < 1e-10 )
with expansion:
  0.0000001368 < 0.0000000001

===============================================================================
test cases:      16 |      15 passed |  1 failed
assertions: 3240884 | 3240866 passed | 18 failed

Jul 29 '25 21:07 ryanstocks00

Hey, if I can weigh in, when I was authoring the nvidia kernels I noticed a very similar output when I was testing with a 3060ti. If I remember right, I was getting that very same 0.0000001368 diff on the H3 VXCs, and the culprit was nvidia's implementation of sqrt differing between product lines. As documented in #134, the issue goes away on an A100. Could be a coincidence you got that same value, or it could point to AMDs impl using the same algorithm for sqrt as the consumer grade nvidia cards. Nice PR!

Jul 29 '25 21:07 mikovtun

@mikovtun good pickup! I think you might be on to something - If I use CUDA on a 4080 I get exactly the same 0.0000001368 errors, however I do not get the 0.0000000131 or 0.0000000009 errors. Interestingly if I compile the HIP version for the 4080 I do get the 0.0000000131 and 0.0000000009 errors so it is clearly not only a hardware discrepancy? Bizarre but at least all the errors are now replicable on NVIDIA hardware

Jul 30 '25 03:07 ryanstocks00