Speed is too slow in `3.7.5 with icpx` than `3.6.5 with icpc` for both PBE and EXX calculations
Details
Recently, I perform SOC + EXX calculation. You can check the INPUT and output files in hse-3.6vs3.7-lowerspeed.zip
When I choose 3.6.5 version to calculate, the speed is OK. Evey PBE step costs 13s and EXX costs 178s. Although it faces the slower PBE speed between every EXX step.
When I change to 3.7.5, speed is very slow. Evey PBE step costs 43s and EXX costs 270s, which is twice than the 3.6.5 version above.
Task list for Issue attackers (only for developers)
- [ ] Reproduce the performance issue on a similar system or environment.
- [ ] Identify the specific section of the code causing the performance issue.
- [ ] Investigate the issue and determine the root cause.
- [ ] Research best practices and potential solutions for the identified performance issue.
- [ ] Implement the chosen solution to address the performance issue.
- [ ] Test the implemented solution to ensure it improves performance without introducing new issues.
- [ ] Optimize the solution if necessary, considering trade-offs between performance and other factors (e.g., code complexity, readability, maintainability).
- [ ] Review and incorporate any relevant feedback from users or developers.
- [ ] Merge the improved solution into the main codebase and notify the issue reporter.
EVEN in nspin=1 case, 3.7.5 also faces a big backstep of speed than 3.6.5
As you can see, 3.7.5 is:
when 3.6.5 gives:
When I set ks_solver scalapack_gvx instead of genelpa, the slow speed still remains:
3.7.5
3.6.5
@xdzhu What're your ABACUS installation dependencies?
I compared the time cost of these two versions. It seems arised from ESolver_KS_LCAO - runner and HSolverLCAO - solve modules.
@xdzhu What're your ABACUS installation dependencies?
Both with intel OneAPI 2023.1.0 and GCC 13.1.0.
3.6.5 with LibRI_0.1.0_loop3 3.7.5 with LibRI_0.2.0
I have noticed that in 3.7.x version i take the icpx and mpicxx compilers instead of icpc and mpiicpc which I use to compile 3.6.5 version.
When I change the CXX and MPI_CXX to icpc and mpiicpc and recompile the 3.7.5 version, it goes faster than icpx case and the peformance is also nearly same with the 3.6.5 version.
3.7.5 with icpc
3.7.5 with icpx
3.6.5 with icpc
@xdzhu What're your hardware setting?
@QuantumMisaka The calculation node hardware is with Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (2*20C), 40 cores, and I run ABACUS with following command: mpirun -np 10 -genv OMP_THREADS_NUM=4 abacus
could you:
- delete old ./build before you build a new one
- make some test with
OMP_NUM_THREADS=1
If the result is OK and the only issue is the performance, according to this official guide, one possible reason as listed in the "Performance" section is that "-O3" is no longer sufficient to enable advanced loop optimization & vectorization; "-xhost" might be necessary. Do we have any benchmark on this compiler flag? @caic99
The test case in the zip file takes very long time... do you have smaller examples with the same issue? @xdzhu
"-xhost" might be necessary. Do we have any benchmark on this compiler flag?
@jinzx10 I've tested it on a previous version of ABACUS, and it does not help a lot (-1% time) since the weightlifting parts are the math libs (here MKL and ELPA). I would suggest we have a better alignment on the version of those compilers and their dependencies, and further carry tests on a latest environment.
I have another concern about the compilers. "mpicxx" might be a wrapper of g++; the wrapper for icpx might be "mpiicpx". On my local PC (WSL2 Ubuntu 22.04) where intel compilers are installed via apt, mpicxx is clearly a wrapper of g++ as shown below:
zuxin@legion:/opt/intel/oneapi/mpi/2021.13/bin$ which mpicxx
/opt/intel/oneapi/mpi/2021.13/bin/mpicxx
zuxin@legion:/opt/intel/oneapi/mpi/2021.13/bin$ mpicxx -v
mpigxx for the Intel(R) MPI Library 2021.13 for Linux*
Copyright Intel Corporation.
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.4.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-gcn/usr --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
while mpiicpx is clearly different from mpicxx:
zuxin@legion:/opt/intel/oneapi/mpi/2021.13/bin$ which mpiicpx
/opt/intel/oneapi/mpi/2021.13/bin/mpiicpx
zuxin@legion:/opt/intel/oneapi/mpi/2021.13/bin$ mpiicpx -v
mpiicpx for the Intel(R) MPI Library @IMPI_OFFICIALVERSION@ for Linux*
Copyright Intel Corporation.
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.4 (2023.2.4.20240127)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.2.4/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.2.4/linux/bin-llvm/../bin/icpx.cfg
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/11
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12
Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda, version
icpx: warning: argument unused during compilation: '-I /opt/intel/oneapi/mpi/2021.13/include' [-Wunused-command-line-argument]
I think it might worth trying mpiicpx instead of mpicxx.