Rujikorn Charakorn
Rujikorn Charakorn
@vwxyzjn Sure. I'll report back after I ran the tests.
@vwxyzjn I have tested the proper version. The results look very good on Pybullet's HalfCheetah (hovering around 2200 after 1M steps which is higher than the current version of PPO)...
@vwxyzjn I did not turn wandb tracking on. I'll do that tonight and send you the report link right after. And the PR should be simple enough. Should we try...
@vwxyzjn Sorry for a late reply. It seems like the improvement I reported is just a noise :( It seems like the continuous control tasks do not benefit from using...
And the tracked stats is here: https://wandb.ai/51616/proper_ppo_entropy?workspace=user-51616
I followed your fix and got small negative pi loss (around -0.00xx). Is this normal ? edit. now i'm using alternative code and it produces positive pi loss.
@jl1990 i use this equation instead. `pi -= (1-valids)*1000 pi = log_softmax(pi)` this should produce positive pi loss.
Is there any chance I can use this during training?
Cool! @gigayaya That would be amazing since I think self-play is the bottleneck of this training loop. How much faster is it if you do self-play in parallel? Is is...
I would love to see the implementation, of course. @gigayaya You can just commit to this PR and I can read the code just fine. 👍 Also, Have you heard...