lammps icon indicating copy to clipboard operation
lammps copied to clipboard

Add KOKKOS versions of fixes

Open alphataubio opened this issue 1 year ago • 11 comments

Summary

Add KOKKOS versions:

  • [x] fix cmap/kk (to have all needed fixes in KOKKOS for performance on GPUs)

  • [x] fix nve/limit/kk (to relax overlapped systems with KOKKOS)

  • [x] fix recenter/kk (to have all needed fixes in KOKKOS for performance on GPUs), thanks to @srtee for suggesting this fix on matsci, very useful.

  • [x] fix wall/region/kk (to have all needed fixes in KOKKOS for performance on GPUs)

  • [x] region sphere/kk (bugfix for dynamic_cast in FixWallRegionKokkos)

Related Issue(s)

n/a

Author(s)

@alphataubio

Licensing

By submitting this pull request, I agree, that my contribution will be included in LAMMPS and redistributed under either the GNU General Public License version 2 (GPL v2) or the GNU Lesser General Public License version 2.1 (LGPL v2.1).

Backward Compatibility

n/a

Implementation Notes

ADDED UNIT TESTS:

  • fix-timestep-recenter-coords.yaml
  • fix-timestep-recenter-init.yaml
  • fix-timestep-recenter-null.yaml
  • fix-timestep-wall_region_lj1043.yaml
  • fix-timestep-wall_region_lj126.yaml
  • fix-timestep-wall_region_lj93.yaml
  • fix-timestep-wall_region_morse.yaml

Post Submission Checklist

  • [x] The feature or features in this pull request is complete
  • [x] Licensing information is complete
  • [x] Corresponding author information is complete
  • [x] The source code follows the LAMMPS formatting guidelines
  • [x] Suitable new documentation files and/or updates to the existing docs are included
  • [x] The added/updated documentation is integrated and tested with the documentation build system
  • [x] The feature has been verified to work with the conventional build system
  • [x] The feature has been verified to work with the CMake based build system
  • [x] Suitable tests have been added to the unittest tree.
  • [ ] A package specific README file has been included or updated
  • [x] One or more example input decks are included

Further Information, Files, and Links

alphataubio avatar Aug 24 '24 22:08 alphataubio

Running this PR through internal Kokkos regression testing now.

stanmoore1 avatar Oct 02 '24 17:10 stanmoore1

The fix recenter changes still need to be remove. What is there cannot be correct.

We'll have to look into testing and fixing this some other time. You'll have to stash away the force-style test yaml files as well for the time being.

fix recenter test skipped for respa

alphataubio avatar Oct 03 '24 22:10 alphataubio

@alphataubio for the new Kokkos styles, can you please add the boiler plate note as in other styles (if it isn't in there already)? It goes near the bottom. Thanks


----------

.. include:: accel_styles.rst

----------

stanmoore1 avatar Oct 04 '24 15:10 stanmoore1

I updated this PR to latest develop and it failed a compilation test with a weird error:

h5cc -I include -D_DEFAULT_SOURCE -O2 -DH5_NO_DEPRECATED_SYMBOLS  -Wall -fPIC -c src/ch5md.c -o build/ch5md.o
In file included from /usr/include/hdf5/serial/H5public.h:32,
                 from /usr/include/hdf5/serial/hdf5.h:22,
                 from src/ch5md.c:9:
/usr/include/hdf5/serial/H5version.h:39:4: error: #error "Can't choose old API versions when deprecated APIs are disabled"
   39 |   #error "Can't choose old API versions when deprecated APIs are disabled"
      |    ^~~~~
make: *** [Makefile.h5cc:21: build/ch5md.o] Error 1
make: Leaving directory '/data/workspace/dev/pull_requests/ubuntu/legacy_serial_openmp_smallsmall_static/build/lammps/lib/h5md'
++ exit 1

stanmoore1 avatar Oct 04 '24 16:10 stanmoore1

I updated this PR to latest develop and it failed a compilation test with a weird error:

This must be due to me updating the ubuntu container images on the Jenkins cluster from ubuntu 20.04LTS to ubuntu 22.04LTS so we can merge with PR #4344

Update: @stanmoore1 I am trying to fix this with https://github.com/lammps/lammps/pull/4348/commits/b2e8648e6c8f2ab46e99fdb1e0a31df73548567f Check out the build status on PR #4344

akohlmey avatar Oct 04 '24 16:10 akohlmey

I updated this PR to latest develop and it failed a compilation test with a weird error:

This must be due to me updating the ubuntu container images on the Jenkins cluster from ubuntu 20.04LTS to ubuntu 22.04LTS so we can merge with PR #4344

i see a different compile error on jenkins

I updated this PR to latest develop and it failed a compilation test with a weird error:

h5cc -I include -D_DEFAULT_SOURCE -O2 -DH5_NO_DEPRECATED_SYMBOLS  -Wall -fPIC -c src/ch5md.c -o build/ch5md.o
In file included from /usr/include/hdf5/serial/H5public.h:32,
                 from /usr/include/hdf5/serial/hdf5.h:22,
                 from src/ch5md.c:9:
/usr/include/hdf5/serial/H5version.h:39:4: error: #error "Can't choose old API versions when deprecated APIs are disabled"
   39 |   #error "Can't choose old API versions when deprecated APIs are disabled"
      |    ^~~~~
make: *** [Makefile.h5cc:21: build/ch5md.o] Error 1
make: Leaving directory '/data/workspace/dev/pull_requests/ubuntu/legacy_serial_openmp_smallsmall_static/build/lammps/lib/h5md'
++ exit 1

it works for me on macos, and rhel8 power9/v100. seems like a problem with hdf5 on your local machine. i know nothing about hdf5.

i see a different compile error on automated commit testing, it doesnt happen to me either. likely related to @akohlmey comment below.

g++ -std=c++11 -O3 -g -fPIC -Wall -c poemstreenode.cpp In file included from poemstreenode.cpp:18: poemstreenode.h: In constructor 'TreeNode::TreeNode(const int&, TreeNode*, TreeNode*, int)': poemstreenode.h:36:13: warning: 'TreeNode::data' will be initialized after [-Wreorder] 36 | int data; | ^~~~ poemstreenode.h:32:19: warning: 'TreeNode* TreeNode::left' [-Wreorder] 32 | TreeNode *left; | ^~~~ poemstreenode.cpp:22:1: warning: when initialized here [-Wreorder] 22 | TreeNode::TreeNode (const int & item, TreeNode *lptr,TreeNode *rptr, | ^~~~~~~~

alphataubio avatar Oct 04 '24 16:10 alphataubio

it works for me on macos, and rhel8 power9/v100. seems like a problem with hdf5 on your local machine. i know nothing about hdf5.

It is not the local machine, but rather the Jenkins server pool at Temple. I just updated some containers with newer OS versions. The HDF5 version must have been updated which the H5MD has not been adapted. It should not be a problem with CMake builds, since it does not set the compiler define to not use deprecated APIs. Only the GNU make build does that.

i see a different compile error on automated commit testing

These are warnings and have been around a long time. The reorder one is mostly harmless. There are others more worrisome. This is why the POEMS package will be flagged for removal after the next stable release.

akohlmey avatar Oct 04 '24 16:10 akohlmey

Internal Kokkos regression tests passed.

stanmoore1 avatar Oct 04 '24 16:10 stanmoore1

@alphataubio for the new Kokkos styles, can you please add the boiler plate note as in other styles (if it isn't in there already)? It goes near the bottom. Thanks


----------

.. include:: accel_styles.rst

----------

done

alphataubio avatar Oct 04 '24 17:10 alphataubio

Compilation with Kokkos/HIP currently fails with:

/home/akohlmey/compile/lammps/src/KOKKOS/fix_wall_region_kokkos.cpp:181:23: warning: variable 'engKK' set but not used [-Wunused-but-set-variable]
  181 |       double fwallKK, engKK;
      |                       ^
/home/akohlmey/compile/lammps/src/KOKKOS/fix_wall_region_kokkos.cpp:173:21: error: reference to __host__ function 'surface' in __host__ __device__ function
  173 |     int n = region->surface(d_x(i,0), d_x(i,1), d_x(i,2), cutoff);
      |                     ^
/home/akohlmey/compile/lammps/src/KOKKOS/fix_wall_region_kokkos.cpp:380:16: note: in instantiation of member function 'LAMMPS_NS::FixWallRegionKokkos<Kokkos::HIP>::wall_particle' requested here
  380 | template class FixWallRegionKokkos<LMPDeviceType>;
      |                ^
/home/akohlmey/compile/lammps/src/region.h:82:7: note: 'surface' declared here
   82 |   int surface(double, double, double, double);
      |       ^

akohlmey avatar Oct 06 '24 03:10 akohlmey

@stanmoore1 this is waiting on you now

akohlmey avatar Oct 14 '24 22:10 akohlmey

Running this through internal Kokkos regression testing again.

stanmoore1 avatar Oct 23 '24 15:10 stanmoore1

@alphataubio @weinbe2 I absolutely cannot get this to work with -diag_suppress 128. I think it has to do with nvcc_wrapper only taking 1 argument from -Xcudafe.

Also it is really confusing because nvcc --help and the output warning message both say to use diag-suppress (dash) but it only works with diag_suppress (underscore).

I'm using CUDA 12.4, @alphataubio do you remember which Cuda version failed for you?

stanmoore1 avatar Oct 23 '24 19:10 stanmoore1

I'm using CUDA 12.4, @alphataubio do you remember which Cuda version failed for you?

i tried both cuda/11.8.0 and cuda 12.2, it didnt work with either so i had to manualy remove diag_suppress in my cmakelist.txt file when i build on the cluster(s).

it's a problem with a flood of fmt::format warnings, i wish we could solve that @akohlmey ?

alphataubio avatar Oct 23 '24 19:10 alphataubio

@alphataubio I think I have a good solution now, can you test?

stanmoore1 avatar Oct 24 '24 05:10 stanmoore1

No new regression test failures.

stanmoore1 avatar Oct 24 '24 15:10 stanmoore1

@alphataubio I think I have a good solution now, can you test?

yes i just did and it works now.

alphataubio avatar Oct 24 '24 15:10 alphataubio

No new regression test failures.

maybe you should add examples/wall/in.wall.block and examples/wall/in.wall.sphere to KOKKOS regression tests for RegBlockKokkos and RegSphereKokkos ?

alphataubio avatar Oct 24 '24 15:10 alphataubio

Running the internal Kokkos regression tests again.

stanmoore1 avatar Nov 01 '24 22:11 stanmoore1

Kokkos regression tests passed. I'm going to run a few of my own tests, then this should be ready to merge.

stanmoore1 avatar Nov 02 '24 03:11 stanmoore1

I'm seeing lost atoms with fix recenter/kk in a different input, need to debug.

stanmoore1 avatar Nov 09 '24 00:11 stanmoore1

@alphataubio your latest changes fixed the lost atoms issue I saw, nice work. I'm running a few more tests.

stanmoore1 avatar Nov 12 '24 16:11 stanmoore1

Broken for sycl

lammps/src/KOKKOS/fix_cmap_kokkos.cpp:886:38: error: SYCL kernel cannot use a non-const static data variable
  886 |       for (k = 0; k < 16; k++) xx += wt[in][k]*x[k]

rarensu avatar Nov 19 '24 05:11 rarensu

@rarensu would removing the static (or replacing with const static) on line 853 fix this problem ?

sorry i dont access to anything SYCL (or HIP) to test.

alphataubio avatar Nov 19 '24 05:11 alphataubio

Adding const should do the trick, see https://godbolt.org/z/YxM4Wf4q1.

masterleinad avatar Nov 19 '24 13:11 masterleinad

@akohlmey please add this to your next small changes PR.

You're the cpp compiler expert do you think const static is the correct fix for this ?

alphataubio avatar Nov 19 '24 20:11 alphataubio

@akohlmey please add this to your next small changes PR.

You're the cpp compiler expert do you think const static is the correct fix for this ?

Way ahead of you. Check out: https://github.com/lammps/lammps/pull/4387/commits/e200d557ec94cc698d6d95d0290aff6d6202420c Which was part of the last pull request for the new release.

const static inside a function makes next to no sense. Just const will do. In fact, I tested it on the "godbolt" page with the same compiler setting to compile with SYCL and there was not difference in the assembly code between static const and plain const. Using static const makes more sense outside a function. But for that we now use static constexpr whereever possible to explicitly make it a compile time constant.

akohlmey avatar Nov 19 '24 20:11 akohlmey