ml-agents icon indicating copy to clipboard operation
ml-agents copied to clipboard

Workload imbalance of clients(or SubprocessManager step not evenly over all envs).

Open Tyushang opened this issue 3 years ago • 1 comments

I noticed the code in Subprocessmanager._queue_steps():

    def _queue_steps(self) -> None:
        print([w.worker_id for w in self.env_workers if not w.waiting])  # to debug.
        for env_worker in self.env_workers:
            if not env_worker.waiting:
                env_action_info = self._take_step(env_worker.previous_step)
                ...

It always handle env-tasks by fixed order, no matter which env_step comes earlier or later. This makes a problem: the envs at the back of the self.env_workers may execute much less steps than the envs in the front of the self.env_workers. For example, if we print the worker_id which is not waiting at the begening of _queue_steps(), and num_envs=10, most likely, we will get the following log:

[0, 1, 2, 3, 4, 5, 6, 7, 8]
[0, 1, 2, 3, 4, 5, 6, 9, 10]
[0, 1, 2, 3, 4, 5, 6, 7, ]
[0, 1, 2, 3, 4, 5, 6, 9, 10]
[0, 1, 2, 3, 4, 5, 6, 7, 9, 10]
...

Why? Coz we handle envs sequentially in _queue_steps() and the clients will take some time to return step data, soon after _queue_steps(), we receive env_steps from clients and mark these envs to not-waiting(waiting=False) for next call of _queue_steps(). so some envs(usually at the back of the self.env_workers) the clients are handling will miss the next _queue_steps().

Does this problem matter? How about handling envs by FIFO order?

Tyushang avatar May 09 '22 13:05 Tyushang

Thank you @Tyushang we will log this request and get back with you once we have investigated it

AKemendo avatar May 10 '22 15:05 AKemendo