batch-ppo icon indicating copy to clipboard operation
batch-ppo copied to clipboard

Training threads don't start on Windows

Open donamin opened this issue 8 years ago • 19 comments

Hi

I started the learning a few minutes ago and this is what I got in command prompt:

E:\agents>python -m agents.scripts.train --logdir=E:\model --config=pendulum
INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\model\
20170918T084053-pendulum.
WARNING:tensorflow:Number of agents should divide episodes per update.

It's been like this for about 10 minutes and tensorboard doesn't show anything. In the log directory, there is only one file called 'config.yaml'. Is it ok? It would be nice to see if the agent is progressing or it is hung or something.

Thanks Amin

donamin avatar Sep 18 '17 05:09 donamin

I changed update_every value from 25 to 30 to resolve this warning: Number of agents should divide episodes per update. But still it doesn't seem to be working.

Weird thing is that sometimes when I run the code, I get the following exception:

Traceback (most recent call last):
  File "E:/agents/agents/scripts/train.py", line 165, in <module>
    tf.app.run()
  File "C:\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "E:/agents/agents/scripts/train.py", line 147, in main
    for score in train(config, FLAGS.env_processes):
  File "E:/agents/agents/scripts/train.py", line 113, in train
    config.num_agents, env_processes)
  File "E:\agents\agents\scripts\utility.py", line 72, in define_batch_env
    for _ in range(num_agents)]
  File "E:\agents\agents\scripts\utility.py", line 72, in <listcomp>
    for _ in range(num_agents)]
  File "E:\agents\agents\tools\wrappers.py", line 333, in __init__
    self._process.start()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 313, in _Popen
    return Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Python\Python35\lib\multiprocessing\reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'train.<locals>.<lambda>'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "E:\agents\agents\tools\wrappers.py", line 405, in close
    self._process.join()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 120, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process

donamin avatar Sep 18 '17 07:09 donamin

Update: When I change env_processes to False, it seems to be working! But I guess it disables all the parallelism that this framework is presenting, right?

donamin avatar Sep 18 '17 07:09 donamin

It could be normal that TensorBoard doesn't show anything for a while. The frequency for writing logs is define inside _define_loop() in train.py. This is set to twice per epoch where one training epoch is config.update_every * config.max_length steps and one evaluation epoch is config.eval_episodes * config.max_length steps. It could be that either your environment is very slow or that an epoch consists of a large number of steps for you.

What environment are you using and how long are episodes typically? Can you post your full config?

danijar avatar Sep 22 '17 12:09 danijar

I worked on that and it seems there's some other problem with the code: Now it's showing this error:

Traceback (most recent call last):
  File "E:/agents/agents/scripts/train.py", line 165, in <module>
    tf.app.run()
  File "C:\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "E:/agents/agents/scripts/train.py", line 147, in main
    for score in train(config, FLAGS.env_processes):
  File "E:/agents/agents/scripts/train.py", line 113, in train
    config.num_agents, env_processes)
  File "E:\agents\agents\scripts\utility.py", line 72, in define_batch_env
    for _ in range(num_agents)]
  File "E:\agents\agents\scripts\utility.py", line 72, in <listcomp>
    for _ in range(num_agents)]
  File "E:\agents\agents\tools\wrappers.py", line 333, in __init__
    self._process.start()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 313, in _Popen
    return Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Python\Python35\lib\multiprocessing\reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'train.<locals>.<lambda>'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "E:\agents\agents\tools\wrappers.py", line 405, in close
    self._process.join()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 120, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process

If I change env_processes to False, it works! Do you know what's the problem?

donamin avatar Sep 22 '17 12:09 donamin

Please wrap code blocks in 3 back ticks. Your configuration must be pickable and it looks like yours is not. Try to define it without using lambdas. As alternatives, define external functions, nested functions, or use functools.partial(). I need to see your configuration to help further.

danijar avatar Sep 22 '17 13:09 danijar

OK I got an update:

In train.py, I changed this line: batch_env = utility.define_batch_env(lambda: _create_environment(config), config.num_agents, env_processes) into this: batch_env = utility.define_batch_env(_create_environment(config), config.num_agents, env_processes) Not it doesn't give me that previous error, but now it seems to be freezing after showing this log:

INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\model\20170922-165119-pendulum.
[2017-09-22 16:51:19,149] Making new env: Pendulum-v0

The CPU overload for my python is 0% so it doesn't to be doing anything. Any ideas?

This is my configs:

def default():
  """Default configuration for PPO."""
  # General
  algorithm = ppo.PPOAlgorithm
  num_agents = 10
  eval_episodes = 25
  use_gpu = False
  # Network
  network = networks.ForwardGaussianPolicy
  weight_summaries = dict(all=r'.*', policy=r'.*/policy/.*', value=r'.*/value/.*')
  policy_layers = 200, 100
  value_layers = 200, 100
  init_mean_factor = 0.05
  init_logstd = -1
  # Optimization
  update_every = 30
  policy_optimizer = 'AdamOptimizer'
  value_optimizer = 'AdamOptimizer'
  update_epochs_policy = 50
  update_epochs_value = 50
  policy_lr = 1e-4
  value_lr = 3e-4
  # Losses
  discount = 0.985
  kl_target = 1e-2
  kl_cutoff_factor = 2
  kl_cutoff_coef = 1000
  kl_init_penalty = 1
  return locals()

donamin avatar Sep 22 '17 13:09 donamin

Where is the env defined in your config? You should not create the environments in the main process as you did by removing the lambda.

danijar avatar Sep 22 '17 14:09 danijar

I thought that we give env as one of the main arguments in command prompt. So how should I create the environments? You mean I should change the default code structure so I can make the BatchPPO work?

donamin avatar Sep 22 '17 16:09 donamin

No, I meant you should undo the change you made to the batch env line. You define environments in your config by setting env = ... to either the name of a registered Gym environment or to a function that returns an env object.

danijar avatar Sep 23 '17 09:09 danijar

Oh OK I found out what I did wrong with removing the lambda keyword. But how can I solve this using external or nested functions? I did a lot of searching but couldn't figure this out since I'm kind of new to Python. Can you help me with this? How is that it is working on your computer and not on mine? Because not being able to pickle lambda functions seems to be a Python feature, and I already tried Python 3.5 and 3.6.

donamin avatar Sep 23 '17 16:09 donamin

I've seen it working on many people's computers :)

Please check if YAML is installed:

python3 -c "import ruamel.yaml; print('success')"

And check if the Pendulum environment works:

python3 -c "import gym; e=gym.make('Pendulum-v0'); e.reset(); e.render(); input('success')"

If both works please start from a fresh clone of this repository and report your error message again.

danijar avatar Sep 24 '17 08:09 danijar

Thanks for your reply.

I tried both tests with success.

I cloned the repository again and the code doesn't work. It's not showing me that lambda error but it stays still when it reaches this line of code in wrappers.py: self._process.start()

When I use debugging, stepping into start function eventually takes guides me to this line in context.py (The code hangs when it reaches this line): from .popen_spawn_win32 import Popen

BTW, I'm using Windows 10. Maybe it has something to do with OS?

donamin avatar Sep 24 '17 08:09 donamin

Yea, that might be the problem. Processing is quite different between Windows and Linux/Mac and we mainly tested on the latter. I'm afraid I can't be of much help since I don't use Windows. Do you have an idea how to debug this? I'd be happy to test and merge a fix if you come up with one.

danijar avatar Sep 24 '17 10:09 danijar

OK thanks for your reply. I have no idea right now. But I will work on it because it's kind of important for me to make it work on Windows. I'll let you if it's solved. Thanks :)

donamin avatar Sep 24 '17 10:09 donamin

@donamin Where you able to narrow down this issue?

danijar avatar Nov 09 '17 00:11 danijar

@danijar No I couldn't solve it so I had to switch to linux. Sorry.

donamin avatar Nov 09 '17 05:11 donamin

Thanks for getting back. I'll keep this issue open for now. We might support Windows in the future since as far as I can see the threading is the only platform-specific bit. But unfortunately, there are no concrete plans for this at the moment.

danijar avatar Nov 09 '17 12:11 danijar

It seems you cannot use the _worker class method for multiprocessing.Process on Windows. If you use a global def globalworker( constructor, conn): it will not hang. But then it cannot use getattr. Is there a way to rewrite _worker to be a globalworker?

   self._process = multiprocessing.Process(
        target=globalworker, args=(constructor, conn))

erwincoumans avatar Nov 24 '18 06:11 erwincoumans

@erwincoumans Yes, this seems trivial since self._worker() does not access any object state. You'd just have to replace the occurrences of self with ExternalProcess. I'd be happy to accept a patch if this indeed fixes the behavior on Windows. I don't have a way to test on Windows myself.

danijar avatar Dec 18 '18 17:12 danijar