Task with IO stuck when using 2 nodes
This is happening on marenostrum. We think the reason is that the temporary files disagree when they are running on two different nodes. The temporary directory, is local on the computer node.
Hi, can you post a code snippet or a bit more specifics about the code that is leading to the error? What version are you using? If it's 0.2.7, it could be related to a bug (#419), which we will be fixing shortly and releasing a new fixed version for (0.2.8).
It is 0.2.7. Most of the tests were fine when we using only 1 node -- Thus I suspected it is due to the fact that the temp storage is local -- I don't think #419 cares the number of nodes? I guess we can try again with 0.2.8 after it is tagged ..
@mpellejero will you be able to deploy and test this (or to urge Albert to do it)?
BTW the job system on Marenostrum works with conda stock mpich -- so apparently we don't even need to recompile specially for the computer -- the packages from bccp channel worked directly. We didn't do benchmarks to see if it falls back to TCP/IP or not though.
Does nbodykit use internal temp storage anywhere? Or it's being created in the user script?
but #419 would not show up in serial....it was due to uneven allreduce(size) calls when some ranks had non-unity boolean selections while others did not
Hi, we installed the nbodykit 0.2.8 and tried to run the test in mpi in 4 nodes using:
python run-tests.py --mpirun="srun -n 4"
but it again got stacked at:
../testenv/lib/python3.6/site-packages/nbodykit/base/tests/test_catalogmesh.py .
I suspect the issue still persists. Could you try to set $TMPDIR to the scratch folder?
As soon as the cluster is back from maintenance I'll try.