Question about the transform between true reward and value prefix
Hi, I was a little confused about how to get true reward from value prefix in core/ctree/cnode.cpp
For the function update_tree_q() in Line 256, the true reward is calculated by float true_reward = node->value_prefix - parent_value_prefix
Suppose we have a root node_1, with its two child (node_2 and node_3),
Before the while loop, we push node_1 into the node_stack; For the first time of the while loop, we pop node_1, and push node_2, node_3 into the node_stack, finally we set parent_value_prefix = node_1.value_prefix;
For the second time of the while loop, we pop node_3, (suppose there is no child of node_3 expanded), and we set parent_value_prefix=node3.value_prefix (Line281);
In the third time of the while loop, we pop node_2, when we calc the true reward of node_2 in Line 266, true_reward = node_2.value_prefix - parent_value_prefix = node_2.value_prefix - node_3.value_prefix,
However, the parent of node_2 is node_1, so the true_reward should be node_2.value_prefix - node_1.value_prefix So I wonder if there is some problem for the operation for the variable "parent_value_prefix", or I misunderstood the code.
Alhough, in function update_tree_q, we only update the min_max value, so it may not affect the convergence. I wonder if it will convergence faster if there the operation is fixed.
Thank you for your correction!
You are right. It is a bug that results in wrong min/max values on the tree side. Really thank you for your detailed reading. And I think it will affect the convergence or stability or something else.
We will fix this these days and check out the performance :)