dqwu

Results 99 comments of dqwu

> Which version of PMIx are you using? Slurm is reporting pmi2, not PMIx. OpenMPI was built --with-pmi=/usr, not --with-pmix=... ``` @chrlogin1:chrys$ srun --mpi=list srun: MPI types are... srun: none...

This test needs to be run on supercomputers or Linux clusters as it requires 512 MPI tasks. Load openmpi module (if it is available) for testing. If the system does...

> @dqwu, is there any error like `Transport retry count exceeded on ... `? Also are these errors seen on the specific nodes, or the failing nodes can vary from...

> hi thank you for bug report. > > it seems UCX detects error on network and reports it into OMPI. as result OMPI terminates failed rank, but neighbor rank...

> hi @dqwu we don't see timestamps on error messages... was it cut by any grep-like utility? @hoopoepg test_openmpi.193176.out is the original output log file of the submitted slurm job...

> UCX_LOG_LEVEL=info Here is the latest log file: https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195239.out

> @dqwu can you pls try the following (separate) experiments to understand the problem: > > 1. Set `UCX_TLS=rc` to disable DC transport; to see if the issue is DC-related...

> ibv_devinfo These two tests also failed. [UCX_TLS=sm,tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195622.out [UCX_TLS=tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195628.out [ibv_devinfo] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/ibv_devinfo.txt

> The test is run with SLURM srun instead of mpirun. Should I use environment variables? [-mca pml ob1 -mca btl self,vader,tcp -mca pml ^ucx] export OMPI_MCA_pml=ob1,^ucx export OMPI_MCA_btl=self,vader,tcp It...

> export OMPI_MCA_pml=ob1 export OMPI_MCA_btl=self,vader,tcp > > > export OMPI_MCA_coll=^hcoll export UCX_TLS=tcp These two tests still failed. [OMPI_MCA_pml=ob1 OMPI_MCA_btl=self,vader,tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195638.out [OMPI_MCA_coll=^hcoll UCX_TLS=tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195640.out