Multi-agent-RL icon indicating copy to clipboard operation
Multi-agent-RL copied to clipboard

MC_Basic.py中sum_qvalue_list.append(sum_qvalue)位置错误

Open konyyds opened this issue 3 months ago • 2 comments

RL_Learning-main/scripts/Chapter5_Monte Carlo Methods/MC_Basic.py

当前有问题的代码:

sum_qvalue_list = []
for each_episode in episodes:
    sum_qvalue = 0
    for i in range(len(each_episode)):
        sum_qvalue += (self.gama**i) * each_episode[i]['reward']
sum_qvalue_list.append(sum_qvalue)  # ❌ 错误位置:在循环外面
self.qvalue[state][action] = np.mean(sum_qvalue_list)

修正后的代码:

sum_qvalue_list = []
for each_episode in episodes:
    sum_qvalue = 0
    for i in range(len(each_episode)):
        sum_qvalue += (self.gama**i) * each_episode[i]['reward']
    sum_qvalue_list.append(sum_qvalue)  # ✅ 正确位置:在每个episode计算完后添加
self.qvalue[state][action] = np.mean(sum_qvalue_list)

问题分析:

当前代码的问题

  • sum_qvalue_list.append(sum_qvalue) 只在所有episode循环结束后执行一次
  • 实际上只添加了最后一个episode的回报值
  • 导致Q值估计只基于一条episode,而不是所有采集的episode

修正后的效果

  • 对每个episode计算其折扣累积回报
  • 将所有episode的回报值都添加到列表中
  • 最后取平均值得到更准确的Q值估计

konyyds avatar Oct 31 '25 11:10 konyyds

修改sum_qvalue_list.append(sum_qvalue)位置后,发现得不到正确的结果了,原来是episodes 列表在循环外部定义,导致它不断累积所有state-action对的episode。

有问题的地方

episodes = []  # 这个应该在每个state-action对内部初始化
for epoch in range(epochs):
    for state in tqdm(range(self.state_space_size)):
        for action in range(self.action_space_size):
            # episodes 应该在这里初始化,而不是在循环外部

修改后的mc_basic_simple_GUI如下:

def mc_basic_simple_GUI(self, length=50, epochs=10):
    """
    :param length: 每一个 state-action 对的长度
    """
    num_episode = 10
    for epoch in range(epochs):
        for state in tqdm(range(self.state_space_size), desc = f"Epoch {epoch}/{epochs}"):
            for action in range(self.action_space_size):
                episodes = []
                # Collect sufficiently many episodes starting from (s, a) by following πk
                for tmp in range(num_episode):  # 对每个action 采集 10条 episode
                    episodes.append(self.obtain_episode(self.policy, state, action, length))
                # Policy evaluation:
                sum_qvalue_list = []
                for each_episode in episodes:
                    sum_qvalue = 0
                    for i in range(len(each_episode)):
                        sum_qvalue += (self.gama ** i) * each_episode[i]['reward']
                    sum_qvalue_list.append(sum_qvalue)
                self.qvalue[state][action] = np.mean(
                    sum_qvalue_list)  # the average return of all the episodes starting from (s, a)
                # self.qvalue[state][action] = np.sum(sum_qvalue_list)/num_episode
            # Policy improvement:
            max_index = np.argmax(self.qvalue[state])
            max_qvalue = np.max(self.qvalue[state])
            self.policy[state, :] = np.zeros(self.action_space_size)
            self.policy[state, max_index] = 1
            self.state_value[state] = max_qvalue

konyyds avatar Nov 01 '25 08:11 konyyds

感谢您指出的问题,最近抽空我就修改合并。

Ronchy2000 avatar Nov 10 '25 08:11 Ronchy2000