lammps icon indicating copy to clipboard operation
lammps copied to clipboard

Improve Kokkos unit testing

Open alphataubio opened this issue 1 year ago • 7 comments

Summary

add kokkos unit tests for angles, bonds, dihedral, improper, fixes, ...

Related Issue(s)

n/a

Author(s)

@alphataubio

Licensing

By submitting this pull request, I agree, that my contribution will be included in LAMMPS and redistributed under either the GNU General Public License version 2 (GPL v2) or the GNU Lesser General Public License version 2.1 (LGPL v2.1).

Backward Compatibility

n/a

Implementation Notes

pair_coul_dsf_kokkos INTENTIONALY SKIPPED IN YAML FILE pair_zbl_kokkos INTENTIONALY SKIPPED IN YAML FILE

Post Submission Checklist

  • [x] The feature or features in this pull request is complete
  • [x] Licensing information is complete
  • [x] Corresponding author information is complete
  • [x] The source code follows the LAMMPS formatting guidelines
  • [ ] Suitable new documentation files and/or updates to the existing docs are included
  • [ ] The added/updated documentation is integrated and tested with the documentation build system
  • [ ] The feature has been verified to work with the conventional build system
  • [x] The feature has been verified to work with the CMake based build system
  • [x] Suitable tests have been added to the unittest tree.
  • [ ] A package specific README file has been included or updated
  • [ ] One or more example input decks are included

Further Information, Files, and Links

alphataubio avatar Jun 07 '24 23:06 alphataubio

98% tests passed, 13 tests failed out of 573

The following tests FAILED: 452 - FixTimestep:deform (Failed) 454 - FixTimestep:deform_tri (Failed) 456 - FixTimestep:efield_const (Failed) 457 - FixTimestep:efield_region (Failed) 460 - FixTimestep:efield_variable (Failed) 472 - FixTimestep:nph (Failed) 475 - FixTimestep:npt_iso (Failed) 514 - FixTimestep:setforce_const (Failed) 515 - FixTimestep:setforce_region (Failed) 516 - FixTimestep:setforce_variable (Failed) 524 - FixTimestep:spring_self (Failed) 534 - FixTimestep:wall_lj93_const (Failed) 548 - DihedralStyle:hybrid (Failed)

alphataubio avatar Jun 08 '24 00:06 alphataubio

@alphataubio This may have a merge conflict when I get to merge PR #4203 (which will be very soon). My suggestion is that you pull that branch right away (I don't expect to make more changes to the fix style tester program after modernizing its fix pointer access and including changes to detect issues with run style respa) and then resolve the test failures. We're planning to have a feature release in a couple days and if this does not get ready and included by then, it will be postponed until after the stable release (possibly in august).

akohlmey avatar Jun 24 '24 17:06 akohlmey

there's no rush to merge this (NOT READY). i improved the unit test coverage for KOKKOS from 34% to 43% but there are now tests failing so it should NOT be merged.

i put up this draft pr so i can see automated coverage reports in codecov without having to setup/run coverage tools on my own machine.

alphataubio avatar Jun 24 '24 22:06 alphataubio

i improved the unit test coverage for KOKKOS from 34% to 43%

Nice!

stanmoore1 avatar Jun 24 '24 22:06 stanmoore1

Nice!

until you find out that 13 tests are now failing ...

alphataubio avatar Jun 24 '24 22:06 alphataubio

Also i ran the unit tests with GoogleTest Undefined Behavior Sanitizer, Address Sanitizer, and Thread Sanitizer. No issues detected

see https://github.com/lammps/lammps/issues/4229#issuecomment-2268817624

alphataubio avatar Aug 06 '24 05:08 alphataubio

@stanmoore1 i tested fix efield/kk on both 1 and 4 GPUs. It works, I was concerned about comment in the code about crash accessing unwrap[] but it doesn't and numerics are similar to serial

alphataubio avatar Aug 06 '24 22:08 alphataubio

@alphataubio overall this looks good. I had one question in the comments.

No I don't need fsum in Kokkos subclass anymore

Cleanup commits done, unit tests all pass

alphataubio avatar Sep 04 '24 22:09 alphataubio

@alphataubio thanks, this is good work. I'll run it through my regression testing, then it should be ready to merge.

stanmoore1 avatar Sep 04 '24 22:09 stanmoore1

It still needs a review by @sjplimp because I added functions to base classes

but no rush next stable is not for another year.

If @stanmoore1 you see people having problems with these you can just tell them to build develop plus this PR.

alphataubio avatar Sep 04 '24 22:09 alphataubio

Running Kokkos regression tests on this PR now...

stanmoore1 avatar Sep 30 '24 16:09 stanmoore1

A couple regression tests are failing with CUDA:

11:32:36 dir = /lammps_testing/examples/for_kokkos/reax/water
11:32:36 test = water.qeq.field
11:32:36 gold standard = log.23Sep24.linux.kk_gpu_2.water.qeq.field
11:32:51 ==== comparing log.kk_gpu_2.water.qeq.field with log.23Sep24.linux.kk_gpu_2.water.qeq.field ====
11:32:51 mpiexec noticed that process rank 1 with PID 0 on node cee-h100-002 exited on signal 6 (Aborted).
11:32:51 !!! test water.qeq.field FAILED
11:32:51 elapsed time = 14.8613550663 s for test water.qeq.field
 *** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
(ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorIllegalAddress): an illegal memory access was encountered ../../lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:165

stanmoore1 avatar Sep 30 '24 18:09 stanmoore1

A couple regression tests are failing with CUDA:

11:32:36 dir = /lammps_testing/examples/for_kokkos/reax/water
11:32:36 test = water.qeq.field
11:32:36 gold standard = log.23Sep24.linux.kk_gpu_2.water.qeq.field
11:32:51 ==== comparing log.kk_gpu_2.water.qeq.field with log.23Sep24.linux.kk_gpu_2.water.qeq.field ====
11:32:51 mpiexec noticed that process rank 1 with PID 0 on node cee-h100-002 exited on signal 6 (Aborted).
11:32:51 !!! test water.qeq.field FAILED
11:32:51 elapsed time = 14.8613550663 s for test water.qeq.field
 *** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
(ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorIllegalAddress): an illegal memory access was encountered ../../lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:165

ok ill have a look at it now

alphataubio avatar Sep 30 '24 18:09 alphataubio

@alphataubio the OpenMP test passes fine, so probably trying to access CPU memory the GPU, leading to a segfault.

stanmoore1 avatar Sep 30 '24 18:09 stanmoore1

@akohlmey this PR has some bugfixes that were required to get the unit tests to pass when using Kokkos, I think we need to backport all the changes in /src to the next stable update. Nothing outside of /src (e.g. /unittest) needs to be backported.

stanmoore1 avatar Sep 30 '24 21:09 stanmoore1

@alphataubio one question:

pair_coul_dsf_kokkos INTENTIONALY SKIPPED IN YAML FILE
pair_zbl_kokkos INTENTIONALY SKIPPED IN YAML FILE

Why were these skipped?

stanmoore1 avatar Sep 30 '24 21:09 stanmoore1

@alphataubio one question:

pair_coul_dsf_kokkos INTENTIONALY SKIPPED IN YAML FILE
pair_zbl_kokkos INTENTIONALY SKIPPED IN YAML FILE

Why were these skipped?

it wasnt me. i documented that skip_tests: kokkos_omp from yaml files for those two pairs by @akohlmey 2 years ago:

skip tests that fail with Kokkos+OpenMP +6 −1 7 lines changed Add force style tests for Kokkos... #3035 akohlmey committed on Feb 7, 2022

let me turn those back on to see what happens...

alphataubio avatar Sep 30 '24 21:09 alphataubio

@akohlmey this PR has some bugfixes that were required to get the unit tests to pass when using Kokkos, I think we need to backport all the changes in /src to the next stable update. Nothing outside of /src (e.g. /unittest) needs to be backported.

@stanmoore1 changes should now be in the maintenance branch (PR #4327). Already tested locally with Kokkos_OMP.

Please note, I had to restore the changes for my Radeon 780M chip and HIP back as they were overwritten when updating Kokkos. https://github.com/lammps/lammps/pull/4190/commits/db4c6fbaaae72033815811f32777262edc60c91f

akohlmey avatar Sep 30 '24 22:09 akohlmey

Thanks @akohlmey!

stanmoore1 avatar Sep 30 '24 22:09 stanmoore1

Windows unit test failure seems unrelated to the changes in this PR:

Failures
 - ninja (exited 1) - ninja not installed. An error occurred during installation:
 Error downloading 'ninja.1.12.1' from 'https://community.chocolatey.org/api/v2/package/ninja/1.12.1'.
Error: Process completed with exit code 1.

stanmoore1 avatar Sep 30 '24 22:09 stanmoore1

Windows unit test failure seems unrelated to the changes in this PR:

Failures
 - ninja (exited 1) - ninja not installed. An error occurred during installation:
 Error downloading 'ninja.1.12.1' from 'https://community.chocolatey.org/api/v2/package/ninja/1.12.1'.
Error: Process completed with exit code 1.

Yes, this is an unfortunate effect I am seeing now that we are using GitHub runners more. Sometimes they are disconnected from external servers, e.g. to download ubuntu packages, and then we get failures galore. Much less than with the Temple machines.

akohlmey avatar Sep 30 '24 22:09 akohlmey

Windows unit test failure seems unrelated to the changes in this PR

BTW: You should be able to re-run the test after clicking on "Details".

akohlmey avatar Sep 30 '24 22:09 akohlmey

For the record. The following force-style tests have the flag to skip the tests using KokkosOpenMP:

force-styles/tests/improper-class2.yaml
force-styles/tests/kspace-ewald_conp_charge.yaml
force-styles/tests/kspace-pppm_ad.yaml
force-styles/tests/kspace-pppm_conp_charge.yaml
force-styles/tests/manybody-pair-pod.yaml
force-styles/tests/manybody-pair-tersoff_mod_shift.yaml
force-styles/tests/manybody-pair-tersoff_shift.yaml
force-styles/tests/manybody-pair-tersoff_zbl_shift.yaml
force-styles/tests/mol-pair-coul_dsf.yaml
force-styles/tests/mol-pair-dpd_coul_slater_long.yaml
force-styles/tests/mol-pair-dpd_ext_tstat.yaml
force-styles/tests/mol-pair-dpd_ext.yaml
force-styles/tests/mol-pair-dpd_tstat.yaml
force-styles/tests/mol-pair-dpd.yaml
force-styles/tests/mol-pair-hybrid_multiple.yaml
force-styles/tests/mol-pair-hybrid-overlay.yaml
force-styles/tests/mol-pair-hybrid-scaled.yaml
force-styles/tests/mol-pair-lj_cut_coul_dsf.yaml
force-styles/tests/mol-pair-table.yaml
force-styles/tests/mol-pair-zbl.yaml

Some are principal problems, others require detailed debugging (and we are not even testing GPUs or OpenMP with different settings for neighbor list etc. so there may be more issues looming).

After all the changes in this PR, I think it would be better to merge it and focus on fixing tests later. The bigger challenge at the moment is to get the new regression tests working well.

akohlmey avatar Sep 30 '24 22:09 akohlmey