Rujikorn Charakorn

Results 25 comments of Rujikorn Charakorn

@vwxyzjn Sure. I'll report back after I ran the tests.

@vwxyzjn I have tested the proper version. The results look very good on Pybullet's HalfCheetah (hovering around 2200 after 1M steps which is higher than the current version of PPO)...

@vwxyzjn I did not turn wandb tracking on. I'll do that tonight and send you the report link right after. And the PR should be simple enough. Should we try...

@vwxyzjn Sorry for a late reply. It seems like the improvement I reported is just a noise :( It seems like the continuous control tasks do not benefit from using...

And the tracked stats is here: https://wandb.ai/51616/proper_ppo_entropy?workspace=user-51616

I followed your fix and got small negative pi loss (around -0.00xx). Is this normal ? edit. now i'm using alternative code and it produces positive pi loss.

@jl1990 i use this equation instead. `pi -= (1-valids)*1000 pi = log_softmax(pi)` this should produce positive pi loss.

Is there any chance I can use this during training?

Cool! @gigayaya That would be amazing since I think self-play is the bottleneck of this training loop. How much faster is it if you do self-play in parallel? Is is...

I would love to see the implementation, of course. @gigayaya You can just commit to this PR and I can read the code just fine. 👍 Also, Have you heard...