IverYangg issues

Results 4 issues of


                                            IverYangg

how to judge a trained policy is good or bad?

using the code can trains a lot of policies, but how to choose the best one? in other words, what kind of standard can be use to judge a trained...

奖励函数的设置

你好，想请教一下奖励函数如何设置的问题。在文件Monitornv.py中，有这样的设置 ![reward function from code ](https://user-images.githubusercontent.com/63235229/180025577-4e9825ad-f1aa-4314-900a-2e0ae66e017f.png) 不知道是否和论文中所阐述的奖励函数对应上了，论文中关于奖励函数的设置感觉比较简单，是否有更详细的说明?

ValueError: device value error, must be str,

请问一下，我只在仿真环境中跑了train.py, 按照readme的指导，但是报告了一个异常， ValueError: device value error, must be str, paddle.CPUPlace(), paddle.CUDAPlace(), paddle.CUDAPinnedPlace() or paddle.XPUPlace(), but the type of device is device 我测试了一下，device=cuda, type(decvice) = , 一直找不到哪里的问题,可以帮忙看一下么？

关于PPO算法计算gae的疑问

![defcomputereturn](https://user-images.githubusercontent.com/63235229/189487405-cbff013b-5f1d-4e60-84c3-cead5727e2e9.png) 你好，请问在计算gae时，假如当前时刻是env._max_episode_steps == self.env._elapsed_steps，也就是当前的一个episode中，step已经到了最大步数，那么此时的mask和bad_mask都是0，第105行和106行这里，最后一个状态的gae为什么是0呢？为什么不这样计算`gae = delta + gamma * gae_lambda * self.masks[step + 1] * gae * self.bad_masks[step + 1]`