Multi-agent-RL
Multi-agent-RL copied to clipboard
MC_Basic.py中sum_qvalue_list.append(sum_qvalue)位置错误
RL_Learning-main/scripts/Chapter5_Monte Carlo Methods/MC_Basic.py
当前有问题的代码:
sum_qvalue_list = []
for each_episode in episodes:
sum_qvalue = 0
for i in range(len(each_episode)):
sum_qvalue += (self.gama**i) * each_episode[i]['reward']
sum_qvalue_list.append(sum_qvalue) # ❌ 错误位置:在循环外面
self.qvalue[state][action] = np.mean(sum_qvalue_list)
修正后的代码:
sum_qvalue_list = []
for each_episode in episodes:
sum_qvalue = 0
for i in range(len(each_episode)):
sum_qvalue += (self.gama**i) * each_episode[i]['reward']
sum_qvalue_list.append(sum_qvalue) # ✅ 正确位置:在每个episode计算完后添加
self.qvalue[state][action] = np.mean(sum_qvalue_list)
问题分析:
当前代码的问题:
-
sum_qvalue_list.append(sum_qvalue)只在所有episode循环结束后执行一次 - 实际上只添加了最后一个episode的回报值
- 导致Q值估计只基于一条episode,而不是所有采集的episode
修正后的效果:
- 对每个episode计算其折扣累积回报
- 将所有episode的回报值都添加到列表中
- 最后取平均值得到更准确的Q值估计
修改sum_qvalue_list.append(sum_qvalue)位置后,发现得不到正确的结果了,原来是episodes 列表在循环外部定义,导致它不断累积所有state-action对的episode。
有问题的地方
episodes = [] # 这个应该在每个state-action对内部初始化
for epoch in range(epochs):
for state in tqdm(range(self.state_space_size)):
for action in range(self.action_space_size):
# episodes 应该在这里初始化,而不是在循环外部
修改后的mc_basic_simple_GUI如下:
def mc_basic_simple_GUI(self, length=50, epochs=10):
"""
:param length: 每一个 state-action 对的长度
"""
num_episode = 10
for epoch in range(epochs):
for state in tqdm(range(self.state_space_size), desc = f"Epoch {epoch}/{epochs}"):
for action in range(self.action_space_size):
episodes = []
# Collect sufficiently many episodes starting from (s, a) by following πk
for tmp in range(num_episode): # 对每个action 采集 10条 episode
episodes.append(self.obtain_episode(self.policy, state, action, length))
# Policy evaluation:
sum_qvalue_list = []
for each_episode in episodes:
sum_qvalue = 0
for i in range(len(each_episode)):
sum_qvalue += (self.gama ** i) * each_episode[i]['reward']
sum_qvalue_list.append(sum_qvalue)
self.qvalue[state][action] = np.mean(
sum_qvalue_list) # the average return of all the episodes starting from (s, a)
# self.qvalue[state][action] = np.sum(sum_qvalue_list)/num_episode
# Policy improvement:
max_index = np.argmax(self.qvalue[state])
max_qvalue = np.max(self.qvalue[state])
self.policy[state, :] = np.zeros(self.action_space_size)
self.policy[state, max_index] = 1
self.state_value[state] = max_qvalue
感谢您指出的问题,最近抽空我就修改合并。