BrainWWW issues

Results 3 issues of


                                            BrainWWW

关于策略梯度和PPO中目标函数的几个问题。

莫烦您好，请教您下面几个问题： 1. DPPO中关于PPO的伪代码 ![image](https://user-images.githubusercontent.com/26730243/48056374-9994d700-e1ec-11e8-917d-f8e92a64fcb1.png) 这一部分是计算从t=1到T新旧策略ratio的累加值。但是您代码中的实现是求得tf.reduce_mean，这应该是和这个目标函数相匹配： ![image](https://user-images.githubusercontent.com/26730243/48056510-eaa4cb00-e1ec-11e8-98f9-14289c307371.png) 我很困惑这两种目标函数到底哪个是正确的？或者说都正确，那么有什么区别？ 2. 关于PG和PPO的目标函数。这个问题和上个问题有点类似。下图是传统PG的目标函数： ![image](https://user-images.githubusercontent.com/26730243/48056687-496a4480-e1ed-11e8-9efa-df8b8ec0ef06.png) 这是对轨迹求期望，所以计算t=1到T的累加值。但是PPO的目标函数如下： ![image](https://user-images.githubusercontent.com/26730243/48056906-baa9f780-e1ed-11e8-8b34-21159b35cf5d.png) 这个是对action-state pair来求期望，我不太理解怎么从对轨迹求期望变换到对action-state pair求期望。 3. 关于PG本身的目标函数好像都有这两种写法： ![image](https://user-images.githubusercontent.com/26730243/48057461-10cb6a80-e1ef-11e8-968a-0255c0e74847.png) 第一个绝对是对了，第二个我就不知道怎么理解了？希望得到莫烦老师的帮助！

parameter of network is Nan

hi, i'm doing some reaserch on mutiagent and big2 is a really good plateform to test multiagent algorithm! but i got a problem about your code that after a long...

[Kosmos-v2] unable to build the environment

I build the environment using the docker method. But when it goes to `pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers`, I get the below error: ``` building 'xformers._C_flashattention' extension creating /tmp/pip-install-lwdzzryz/xformers_896a4241413344a4850e6654ebe11206/build/temp.linux-x86_64-3.8 creating...