Simulation with `indx=12, indy= 12` crashes at start with a memory segfault
Dear devs,
I'm experiencing a problem with running a simulation with indx=12, indy= 12.
I'm using an HPC cluster with a plenty of resources, and for example indx=11, indy=11, indz=13 runs fine.As soon as I set for example indx=12, indy= 12, indz=11 it crashes with a memory segfault.
It does not create the output folders except for ELOG, and the last lines in elog-* are:
new loop, nter= 1
Info: particles being passed further = 232215
Info: particles being passed further = 453895
Info: particles being passed further = 428535
The debugger traces issue till input_class.f03:
source/simulation_class.f03:163
source/input_class.f03:104
and then shows mpi errors with ompi_bcast_f, PMPI_Bcast etc
This is very puzzling, since the same MPI layout works well with smaller simulations, and up-scaling it does not help (neither using different mpi implementations). Could be there any kind of buffer which overflows for such a slice size (4096 x 4096)?
I'm running out of testing options, so any help of advice is appreciated!
It looks like the problem is happening during initialization of particles. What kind of initial distribution of particles are you using?
After particles are initialized, they are checked to see if they have been created in the correct node, and if they are not, they will be moved them to the correct domain.
The message: Info: particles being passed further indicates that particles were moved more than one MPI domain during this process. For uniform initial domains, this should not happen.
How many MPI nodes are being used?
Buffers could overflow, but this should be noted in the output log.
viktor decyk
On Feb 2, 2021, at 6:55 AM, Igor Andriyash [email protected] wrote:
Dear devs,
I'm experiencing a problem which running a simulation with indx=12, indy= 12. I'm using an HPC cluster with a plenty of resources, and for example indx=11, indy=11, indz=13 runs fine.As soon as I set for example indx=12, indy= 12, indz=11 it crashes with a memory segfault.
It does not create the output folders except for ELOG, and the last lines in elog-* are:
new loop, nter= 1 Info: particles being passed further = 232215 Info: particles being passed further = 453895 Info: particles being passed further = 428535
The debugger traces issue till input_class.f03:
source/simulation_class.f03:163 source/input_class.f03:104 and then shows mpi errors with ompi_bcast_f, PMPI_Bcast etc
This is very puzzling, since the same MPI layout works well with smaller simulations, and up-scaling it does not help (neither using different mpi implementations). Could be there any kind of buffer which overflows for such a slice size (4096 x 4096)?
I'm running out of testing options, so any help of advice is appreciated!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UCLA-Plasma-Simulation-Group/QuickPIC-OpenSource/issues/33, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRRZUVYAPZTXJGLWYYBIM3S5AG6HANCNFSM4W6YSXLQ.
Hi @vdecyk and thank you for the fast response. For this test I use 512 MPI tasks and each has 6 threads. The test is rather minimal -- it has 2 beams and uniform plasma species (see attached)
I see Info: particles being passed further message also in the cases that run, but numbers of the moved particles is usually smaller
UPD: the attached file had indx=11, indy=11, indz=13 grid, so it was a working case (sorry for that). I've reverted and updated the file
I have been looking over the source code, and I see that things are not as I expected. (I am the author of the underlying UPIC framework, but not the quickpic itself.)
I see now that the info message you got: "Info: particles being passed further" was from the 3d part of the code, the beam, from the procedure PMOVE32. This message is not necessarily an error. I am still studying the structure of this code.
viktor
On Feb 2, 2021, at 6:55 AM, Igor Andriyash [email protected] wrote:
Dear devs,
I'm experiencing a problem which running a simulation with indx=12, indy= 12. I'm using an HPC cluster with a plenty of resources, and for example indx=11, indy=11, indz=13 runs fine.As soon as I set for example indx=12, indy= 12, indz=11 it crashes with a memory segfault.
It does not create the output folders except for ELOG, and the last lines in elog-* are:
new loop, nter= 1 Info: particles being passed further = 232215 Info: particles being passed further = 453895 Info: particles being passed further = 428535
The debugger traces issue till input_class.f03:
source/simulation_class.f03:163 source/input_class.f03:104 and then shows mpi errors with ompi_bcast_f, PMPI_Bcast etc
This is very puzzling, since the same MPI layout works well with smaller simulations, and up-scaling it does not help (neither using different mpi implementations). Could be there any kind of buffer which overflows for such a slice size (4096 x 4096)?
I'm running out of testing options, so any help of advice is appreciated!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UCLA-Plasma-Simulation-Group/QuickPIC-OpenSource/issues/33, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRRZUVYAPZTXJGLWYYBIM3S5AG6HANCNFSM4W6YSXLQ.
thank you for update @vdecyk
Yes, I confirm that these particles being passed further warnings appear in all our simulations and usually do not indicate any problem. If there is such possibility, could confirm that you reproduce this issue, just to be sure its not some intrinsic openmpi bug?
Also, do you know if your dev group have tested such high resolution cases, and if yes how far did they get ?
Thanx
Igor