AMUSE hangs when a system has multiple network interfaces
Describe the bug If a system has multiple network interfaces, like ethernet and wifi, or ethernet and docker, it hangs in seemingly random places. I am confused that it DOES work with LXC, since I also have a network interface from that.
To Reproduce
Create an extra network interface and run test_c_implementation.py
Expected behavior I expect it to not hang
Logs Container network interfaces (docker0 causes problems, lxcbr0 does not):
3: lxcbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
12: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
Workaround For me it worked after setting
export OMPI_MCA_btl_tcp_if_include="enp7s0"
which is my ethernet network. It seems like multiple people run into this problem (also Merijn) so it would be nice if AMUSE at least detects this.
This is probably also related to #128
Thanks for reporting, it's a revelation to me that having multiple network interfaces could be the source of this problem. There may be several related issues, indeed like #128 and perhaps some of the tests that seem to hang in Travis and github testing.
there are other variables that can influence this: OMPI_MCA_oob_tcp_if_include=.. and maybe the "exclude" variants
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 28 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 28 days if no further activity occurs. Thank you for your contributions.
keep open
@rieder @Sbte ok, so I have been thinking of reporting the network interfaces or check for potential problems..there is no default way of getting the network interface in python w/o an external library. e.g. psutil has utility functions for this, but would entail an extra prerequisite..is it important enough for this?
Still running into this (on a different machine again), but this time I'm also unable to solve it using the workaround described above. Only redirection="none" solved anything.
Can't you just check the communication between nodes at startup and if it doesn't work throw a warning and fall back to a different redirection method or something?