Cannot train with multiple instances due to grpc errors
Describe the bug --num-envs doesn't work on ubuntu. Multiple instances of the game binary and mlagents-learn spawn but only one of them is used. Probably due to grpc and fork (see logs below). The training happens, but I cannot fully utilize my CPU and GPU for training.
To Reproduce Steps to reproduce the behavior:
- run mlagents-learn with --num-envs=X and --env="
" parameters.
Console logs / stack traces When run normally. Relevant log lines:
E0623 16:32:11.872695555 194999 fork_posix.cc:70] Fork support is only compatible with the epoll1 and poll polling strategies
Full log up to start of training:
$ GRPC_VERBOSITY=debug mlagents-learn config/v0.2.20/football_020_1v1_001.yaml --run-id=football_020_1v1_001 --env="/media/hoss/data/workspace/nosuchstudio/football/football-unity-ubuntu/Builds/build-020-1v1/build-020-1v1" --num-envs=2 --resume
▄▄▄▓▓▓▓
╓▓▓▓▓▓▓█▓▓▓▓▓
,▄▄▄m▀▀▀' ,▓▓▓▀▓▓▄ ▓▓▓ ▓▓▌
▄▓▓▓▀' ▄▓▓▀ ▓▓▓ ▄▄ ▄▄ ,▄▄ ▄▄▄▄ ,▄▄ ▄▓▓▌▄ ▄▄▄ ,▄▄
▄▓▓▓▀ ▄▓▓▀ ▐▓▓▌ ▓▓▌ ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌ ╒▓▓▌
▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓ ▓▀ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▄ ▓▓▌
▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄ ▓▓ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▐▓▓
^█▓▓▓ ▀▓▓▄ ▐▓▓▌ ▓▓▓▓▄▓▓▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▓▄ ▓▓▓▓`
'▀▓▓▓▄ ^▓▓▓ ▓▓▓ └▀▀▀▀ ▀▀ ^▀▀ `▀▀ `▀▀ '▀▀ ▐▓▓▌
▀▀▀▀▓▄▄▄ ▓▓▓▓▓▓, ▓▓▓▓▀
`▀█▓▓▓▓▓▓▓▓▓▌
¬`▀▀▀█▓
Version information:
ml-agents: 0.27.0,
ml-agents-envs: 0.27.0,
Communicator API: 1.5.0,
PyTorch: 1.8.1+cu102
**D0623 16:32:11.869297835 194999 ev_posix.cc:172] Using polling engine: epollex**
D0623 16:32:11.869364299 194999 lb_policy_registry.cc:42] registering LB policy factory for "grpclb"
D0623 16:32:11.869375930 194999 lb_policy_registry.cc:42] registering LB policy factory for "priority_experimental"
D0623 16:32:11.869381666 194999 lb_policy_registry.cc:42] registering LB policy factory for "weighted_target_experimental"
D0623 16:32:11.869385177 194999 lb_policy_registry.cc:42] registering LB policy factory for "pick_first"
D0623 16:32:11.869390382 194999 lb_policy_registry.cc:42] registering LB policy factory for "round_robin"
D0623 16:32:11.869393915 194999 lb_policy_registry.cc:42] registering LB policy factory for "ring_hash_experimental"
D0623 16:32:11.869397427 194999 dns_resolver_ares.cc:624] Using ares dns resolver
D0623 16:32:11.869435234 194999 certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"
D0623 16:32:11.869439670 194999 lb_policy_registry.cc:42] registering LB policy factory for "cds_experimental"
D0623 16:32:11.869445583 194999 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_impl_experimental"
D0623 16:32:11.869450782 194999 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_resolver_experimental"
D0623 16:32:11.869454121 194999 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_manager_experimental"
I0623 16:32:11.870501304 194999 socket_utils_common_posix.cc:353] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
D0623 16:32:11.871103842 195000 ev_posix.cc:172] Using polling engine: epollex
D0623 16:32:11.871175471 195000 lb_policy_registry.cc:42] registering LB policy factory for "grpclb"
D0623 16:32:11.871184385 195000 lb_policy_registry.cc:42] registering LB policy factory for "priority_experimental"
D0623 16:32:11.871191771 195000 lb_policy_registry.cc:42] registering LB policy factory for "weighted_target_experimental"
D0623 16:32:11.871195208 195000 lb_policy_registry.cc:42] registering LB policy factory for "pick_first"
D0623 16:32:11.871199811 195000 lb_policy_registry.cc:42] registering LB policy factory for "round_robin"
D0623 16:32:11.871203148 195000 lb_policy_registry.cc:42] registering LB policy factory for "ring_hash_experimental"
D0623 16:32:11.871206487 195000 dns_resolver_ares.cc:624] Using ares dns resolver
D0623 16:32:11.871233435 195000 certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"
D0623 16:32:11.871237026 195000 lb_policy_registry.cc:42] registering LB policy factory for "cds_experimental"
D0623 16:32:11.871241879 195000 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_impl_experimental"
D0623 16:32:11.871246670 195000 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_resolver_experimental"
D0623 16:32:11.871249958 195000 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_manager_experimental"
I0623 16:32:11.872387931 195000 socket_utils_common_posix.cc:353] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
E0623 16:32:11.872695555 194999 fork_posix.cc:70] Fork support is only compatible with the epoll1 and poll polling strategies
E0623 16:32:11.874621910 195000 fork_posix.cc:70] Fork support is only compatible with the epoll1 and poll polling strategies
[INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[INFO] Connected new brain: Football1v1Behaviour?team=2
[INFO] Connected new brain: Football1v1Behaviour?team=1
[INFO] Connected new brain: Football1v1Behaviour?team=2
[INFO] Connected new brain: Football1v1Behaviour?team=1
[INFO] Hyperparameters for behavior name Football1v1Behaviour:
trainer_type: poca
I tried to run with other poll strategies, the error message changes but the behavior stays the same. Relevant log lines:
E0623 16:28:12.707733140 193594 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers
Full log up to start of training
GRPC_POLL_STRATEGY=poll GRPC_VERBOSITY=debug mlagents-learn config/v0.2.20/football_020_1v1_001.yaml --run-id=football_020_1v1_001 --env="/media/hoss/data/workspace/nosuchstudio/football/football-unity-ubuntu/Builds/build-020-1v1/build-020-1v1" --num-envs=2 --resume
▄▄▄▓▓▓▓
╓▓▓▓▓▓▓█▓▓▓▓▓
,▄▄▄m▀▀▀' ,▓▓▓▀▓▓▄ ▓▓▓ ▓▓▌
▄▓▓▓▀' ▄▓▓▀ ▓▓▓ ▄▄ ▄▄ ,▄▄ ▄▄▄▄ ,▄▄ ▄▓▓▌▄ ▄▄▄ ,▄▄
▄▓▓▓▀ ▄▓▓▀ ▐▓▓▌ ▓▓▌ ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌ ╒▓▓▌
▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓ ▓▀ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▄ ▓▓▌
▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄ ▓▓ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▐▓▓
^█▓▓▓ ▀▓▓▄ ▐▓▓▌ ▓▓▓▓▄▓▓▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▓▄ ▓▓▓▓`
'▀▓▓▓▄ ^▓▓▓ ▓▓▓ └▀▀▀▀ ▀▀ ^▀▀ `▀▀ `▀▀ '▀▀ ▐▓▓▌
▀▀▀▀▓▄▄▄ ▓▓▓▓▓▓, ▓▓▓▓▀
`▀█▓▓▓▓▓▓▓▓▓▌
¬`▀▀▀█▓
Version information:
ml-agents: 0.27.0,
ml-agents-envs: 0.27.0,
Communicator API: 1.5.0,
PyTorch: 1.8.1+cu102
**D0623 16:28:12.702242882 193593 ev_posix.cc:172] Using polling engine: poll**
D0623 16:28:12.702323886 193593 lb_policy_registry.cc:42] registering LB policy factory for "grpclb"
D0623 16:28:12.702333627 193593 lb_policy_registry.cc:42] registering LB policy factory for "priority_experimental"
D0623 16:28:12.702343424 193593 lb_policy_registry.cc:42] registering LB policy factory for "weighted_target_experimental"
D0623 16:28:12.702349421 193593 lb_policy_registry.cc:42] registering LB policy factory for "pick_first"
D0623 16:28:12.702353020 193593 lb_policy_registry.cc:42] registering LB policy factory for "round_robin"
D0623 16:28:12.702356341 193593 lb_policy_registry.cc:42] registering LB policy factory for "ring_hash_experimental"
D0623 16:28:12.702362159 193593 dns_resolver_ares.cc:624] Using ares dns resolver
D0623 16:28:12.702397093 193593 certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"
D0623 16:28:12.702401248 193593 lb_policy_registry.cc:42] registering LB policy factory for "cds_experimental"
D0623 16:28:12.702407334 193593 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_impl_experimental"
D0623 16:28:12.702412522 193593 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_resolver_experimental"
D0623 16:28:12.702416185 193593 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_manager_experimental"
I0623 16:28:12.703479565 193593 socket_utils_common_posix.cc:353] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
D0623 16:28:12.704147230 193594 ev_posix.cc:172] Using polling engine: poll
D0623 16:28:12.704199786 193594 lb_policy_registry.cc:42] registering LB policy factory for "grpclb"
D0623 16:28:12.704208190 193594 lb_policy_registry.cc:42] registering LB policy factory for "priority_experimental"
D0623 16:28:12.704230093 193594 lb_policy_registry.cc:42] registering LB policy factory for "weighted_target_experimental"
D0623 16:28:12.704234732 193594 lb_policy_registry.cc:42] registering LB policy factory for "pick_first"
D0623 16:28:12.704256968 193594 lb_policy_registry.cc:42] registering LB policy factory for "round_robin"
D0623 16:28:12.704259613 193594 lb_policy_registry.cc:42] registering LB policy factory for "ring_hash_experimental"
D0623 16:28:12.704266330 193594 dns_resolver_ares.cc:624] Using ares dns resolver
D0623 16:28:12.704312283 193594 certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"
D0623 16:28:12.704316586 193594 lb_policy_registry.cc:42] registering LB policy factory for "cds_experimental"
D0623 16:28:12.704321651 193594 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_impl_experimental"
D0623 16:28:12.704326540 193594 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_resolver_experimental"
D0623 16:28:12.704349373 193594 lb_policy_registry.cc:42] registering LB policy factory for "xds_cluster_manager_experimental"
I0623 16:28:12.705384996 193594 socket_utils_common_posix.cc:353] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
E0623 16:28:12.705817074 193593 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers
E0623 16:28:12.707733140 193594 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers
[INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[INFO] Connected new brain: Football1v1Behaviour?team=2
[INFO] Connected new brain: Football1v1Behaviour?team=1
[INFO] Connected new brain: Football1v1Behaviour?team=2
[INFO] Connected new brain: Football1v1Behaviour?team=1
Screenshots N/A
Environment (please complete the following information):
- Unity Version: Unity 2020.3.29f1
- OS version: Ubuntu 21.10
- ML-Agents version: 0.27.0
- Torch version: 1.8.1
- Environment: my own env, but the issue replicates with any environment and other versions of torch / python / unity.
Thank you for the issue. We will look into it and get back with you soon, thanks!
This issue has been automatically marked as stale because it has not had activity in the last 28 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.
Commenting mainly to keep the task alive. @AKemendo any updates on this? Do you think a fix will land anytime soon?
This issue has been automatically marked as stale because it has not had activity in the last 28 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.
Can you update to the latest version of ML-Agents unity and python packages (2.2.1-exp.1 and 0.28) respectively? Also, We are unable to help reproduce bugs with custom environments. Can you attempt to reproduce your issue with one of the example environments, or provide a minimal patch to one of the environments needed to reproduce the issue?
This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days if no further activity occurs. Thank you for your contributions.