Reward loss targets don't account for episodes that finish within n steps
Hi,
Great work. I enjoyed reading the paper and I replicated your work independently.
I noticed minor performance difference between the two implementations and I just noticed that while computing the target for reward loss, you aren't accounting for the episodes that finished within nsteps (whereas for q loss it is being correctly accounted for).
Specifically, proc_seq.append(seq[env, t+offset:t+offset+depth, :]) in line 32 of treeqn_utils.py doesn't check for done flags.
If the episode finished at time t=3, this error will make the target for d=1 at t=3 to be reward at t=4, which is wrong. Can you please clarify if my understanding is correct?
Hi, Thanks a lot, that definitely looks wrong! I think I was handling this correctly in an earlier version and broke it when "simplifying" for the code release 😅. I'll try to fix this if I find a little time, or you're welcome to make a pull request.
https://github.com/oxwhirl/treeqn/pull/4
Here's what I think is a quick fix. There could probably be a quicker vectorized implementation though
Thanks for the bug spot and fix; sorry for being so slow -- merged now!
@Greg-Farquhar I think the np -> torch commit introduced a new bug. In make_seq_mask the mask variable is being updated in place and will cause our tmp_masks variable to change in the build_sequences function.
mask[int(max_i):].fill_(1)
I also think that I initially padded tmp_masks the wrong way. I forgot the pytorch pads in reverse, and we should also be padding with 1's since they will be flipped to zeros in the make_seq_mask function. If both of these changes sound correct I'll submit a PR
Ah, thanks. Just pushed, lmk if that fixes correctly!
looks good!