ml-agents icon indicating copy to clipboard operation
ml-agents copied to clipboard

Cannot train with multiple instances due to grpc errors

Open hk1ll3r opened this issue 3 years ago • 4 comments

Describe the bug --num-envs doesn't work on ubuntu. Multiple instances of the game binary and mlagents-learn spawn but only one of them is used. Probably due to grpc and fork (see logs below). The training happens, but I cannot fully utilize my CPU and GPU for training.

To Reproduce Steps to reproduce the behavior:

  1. run mlagents-learn with --num-envs=X and --env="" parameters.

Console logs / stack traces When run normally. Relevant log lines:

E0623 16:32:11.872695555  194999 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies

Full log up to start of training:

$ GRPC_VERBOSITY=debug mlagents-learn config/v0.2.20/football_020_1v1_001.yaml --run-id=football_020_1v1_001 --env="/media/hoss/data/workspace/nosuchstudio/football/football-unity-ubuntu/Builds/build-020-1v1/build-020-1v1" --num-envs=2 --resume


                        ▄▄▄▓▓▓▓
                   ╓▓▓▓▓▓▓█▓▓▓▓▓
              ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
            ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
          ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
        ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
        ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
          ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
            '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
               ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
                   `▀█▓▓▓▓▓▓▓▓▓▌
                        ¬`▀▀▀█▓

        
 Version information:
  ml-agents: 0.27.0,
  ml-agents-envs: 0.27.0,
  Communicator API: 1.5.0,
  PyTorch: 1.8.1+cu102
**D0623 16:32:11.869297835  194999 ev_posix.cc:172]            Using polling engine: epollex**
D0623 16:32:11.869364299  194999 lb_policy_registry.cc:42]   registering LB policy factory for "grpclb"
D0623 16:32:11.869375930  194999 lb_policy_registry.cc:42]   registering LB policy factory for "priority_experimental"
D0623 16:32:11.869381666  194999 lb_policy_registry.cc:42]   registering LB policy factory for "weighted_target_experimental"
D0623 16:32:11.869385177  194999 lb_policy_registry.cc:42]   registering LB policy factory for "pick_first"
D0623 16:32:11.869390382  194999 lb_policy_registry.cc:42]   registering LB policy factory for "round_robin"
D0623 16:32:11.869393915  194999 lb_policy_registry.cc:42]   registering LB policy factory for "ring_hash_experimental"
D0623 16:32:11.869397427  194999 dns_resolver_ares.cc:624]   Using ares dns resolver
D0623 16:32:11.869435234  194999 certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"
D0623 16:32:11.869439670  194999 lb_policy_registry.cc:42]   registering LB policy factory for "cds_experimental"
D0623 16:32:11.869445583  194999 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_impl_experimental"
D0623 16:32:11.869450782  194999 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_resolver_experimental"
D0623 16:32:11.869454121  194999 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_manager_experimental"
I0623 16:32:11.870501304  194999 socket_utils_common_posix.cc:353] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
D0623 16:32:11.871103842  195000 ev_posix.cc:172]            Using polling engine: epollex
D0623 16:32:11.871175471  195000 lb_policy_registry.cc:42]   registering LB policy factory for "grpclb"
D0623 16:32:11.871184385  195000 lb_policy_registry.cc:42]   registering LB policy factory for "priority_experimental"
D0623 16:32:11.871191771  195000 lb_policy_registry.cc:42]   registering LB policy factory for "weighted_target_experimental"
D0623 16:32:11.871195208  195000 lb_policy_registry.cc:42]   registering LB policy factory for "pick_first"
D0623 16:32:11.871199811  195000 lb_policy_registry.cc:42]   registering LB policy factory for "round_robin"
D0623 16:32:11.871203148  195000 lb_policy_registry.cc:42]   registering LB policy factory for "ring_hash_experimental"
D0623 16:32:11.871206487  195000 dns_resolver_ares.cc:624]   Using ares dns resolver
D0623 16:32:11.871233435  195000 certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"
D0623 16:32:11.871237026  195000 lb_policy_registry.cc:42]   registering LB policy factory for "cds_experimental"
D0623 16:32:11.871241879  195000 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_impl_experimental"
D0623 16:32:11.871246670  195000 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_resolver_experimental"
D0623 16:32:11.871249958  195000 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_manager_experimental"
I0623 16:32:11.872387931  195000 socket_utils_common_posix.cc:353] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
E0623 16:32:11.872695555  194999 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
E0623 16:32:11.874621910  195000 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
[INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[INFO] Connected new brain: Football1v1Behaviour?team=2
[INFO] Connected new brain: Football1v1Behaviour?team=1
[INFO] Connected new brain: Football1v1Behaviour?team=2
[INFO] Connected new brain: Football1v1Behaviour?team=1
[INFO] Hyperparameters for behavior name Football1v1Behaviour: 
	trainer_type:	poca

I tried to run with other poll strategies, the error message changes but the behavior stays the same. Relevant log lines:

E0623 16:28:12.707733140  193594 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers

Full log up to start of training

GRPC_POLL_STRATEGY=poll GRPC_VERBOSITY=debug mlagents-learn config/v0.2.20/football_020_1v1_001.yaml --run-id=football_020_1v1_001 --env="/media/hoss/data/workspace/nosuchstudio/football/football-unity-ubuntu/Builds/build-020-1v1/build-020-1v1" --num-envs=2 --resume


                        ▄▄▄▓▓▓▓
                   ╓▓▓▓▓▓▓█▓▓▓▓▓
              ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
            ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
          ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
        ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
        ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
          ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
            '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
               ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
                   `▀█▓▓▓▓▓▓▓▓▓▌
                        ¬`▀▀▀█▓

        
 Version information:
  ml-agents: 0.27.0,
  ml-agents-envs: 0.27.0,
  Communicator API: 1.5.0,
  PyTorch: 1.8.1+cu102
**D0623 16:28:12.702242882  193593 ev_posix.cc:172]            Using polling engine: poll**
D0623 16:28:12.702323886  193593 lb_policy_registry.cc:42]   registering LB policy factory for "grpclb"
D0623 16:28:12.702333627  193593 lb_policy_registry.cc:42]   registering LB policy factory for "priority_experimental"
D0623 16:28:12.702343424  193593 lb_policy_registry.cc:42]   registering LB policy factory for "weighted_target_experimental"
D0623 16:28:12.702349421  193593 lb_policy_registry.cc:42]   registering LB policy factory for "pick_first"
D0623 16:28:12.702353020  193593 lb_policy_registry.cc:42]   registering LB policy factory for "round_robin"
D0623 16:28:12.702356341  193593 lb_policy_registry.cc:42]   registering LB policy factory for "ring_hash_experimental"
D0623 16:28:12.702362159  193593 dns_resolver_ares.cc:624]   Using ares dns resolver
D0623 16:28:12.702397093  193593 certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"
D0623 16:28:12.702401248  193593 lb_policy_registry.cc:42]   registering LB policy factory for "cds_experimental"
D0623 16:28:12.702407334  193593 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_impl_experimental"
D0623 16:28:12.702412522  193593 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_resolver_experimental"
D0623 16:28:12.702416185  193593 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_manager_experimental"
I0623 16:28:12.703479565  193593 socket_utils_common_posix.cc:353] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
D0623 16:28:12.704147230  193594 ev_posix.cc:172]            Using polling engine: poll
D0623 16:28:12.704199786  193594 lb_policy_registry.cc:42]   registering LB policy factory for "grpclb"
D0623 16:28:12.704208190  193594 lb_policy_registry.cc:42]   registering LB policy factory for "priority_experimental"
D0623 16:28:12.704230093  193594 lb_policy_registry.cc:42]   registering LB policy factory for "weighted_target_experimental"
D0623 16:28:12.704234732  193594 lb_policy_registry.cc:42]   registering LB policy factory for "pick_first"
D0623 16:28:12.704256968  193594 lb_policy_registry.cc:42]   registering LB policy factory for "round_robin"
D0623 16:28:12.704259613  193594 lb_policy_registry.cc:42]   registering LB policy factory for "ring_hash_experimental"
D0623 16:28:12.704266330  193594 dns_resolver_ares.cc:624]   Using ares dns resolver
D0623 16:28:12.704312283  193594 certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"
D0623 16:28:12.704316586  193594 lb_policy_registry.cc:42]   registering LB policy factory for "cds_experimental"
D0623 16:28:12.704321651  193594 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_impl_experimental"
D0623 16:28:12.704326540  193594 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_resolver_experimental"
D0623 16:28:12.704349373  193594 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_manager_experimental"
I0623 16:28:12.705384996  193594 socket_utils_common_posix.cc:353] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
E0623 16:28:12.705817074  193593 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E0623 16:28:12.707733140  193594 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
[INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0
[INFO] Connected new brain: Football1v1Behaviour?team=2
[INFO] Connected new brain: Football1v1Behaviour?team=1
[INFO] Connected new brain: Football1v1Behaviour?team=2
[INFO] Connected new brain: Football1v1Behaviour?team=1

Screenshots N/A

Environment (please complete the following information):

  • Unity Version: Unity 2020.3.29f1
  • OS version: Ubuntu 21.10
  • ML-Agents version: 0.27.0
  • Torch version: 1.8.1
  • Environment: my own env, but the issue replicates with any environment and other versions of torch / python / unity.

hk1ll3r avatar Jun 23 '22 23:06 hk1ll3r

Thank you for the issue. We will look into it and get back with you soon, thanks!

AKemendo avatar Jun 24 '22 21:06 AKemendo

This issue has been automatically marked as stale because it has not had activity in the last 28 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 31 '22 03:07 stale[bot]

Commenting mainly to keep the task alive. @AKemendo any updates on this? Do you think a fix will land anytime soon?

hk1ll3r avatar Aug 02 '22 10:08 hk1ll3r

This issue has been automatically marked as stale because it has not had activity in the last 28 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 21 '22 02:09 stale[bot]

Can you update to the latest version of ML-Agents unity and python packages (2.2.1-exp.1 and 0.28) respectively? Also, We are unable to help reproduce bugs with custom environments. Can you attempt to reproduce your issue with one of the example environments, or provide a minimal patch to one of the environments needed to reproduce the issue?

miguelalonsojr avatar Oct 05 '22 19:10 miguelalonsojr

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 08 '23 04:01 stale[bot]