treeqn Reward loss targets don't account for episodes that finish within n steps

Hi,

Great work. I enjoyed reading the paper and I replicated your work independently.

I noticed minor performance difference between the two implementations and I just noticed that while computing the target for reward loss, you aren't accounting for the episodes that finished within nsteps (whereas for q loss it is being correctly accounted for).

Specifically, proc_seq.append(seq[env, t+offset:t+offset+depth, :]) in line 32 of treeqn_utils.py doesn't check for done flags.

If the episode finished at time t=3, this error will make the target for d=1 at t=3 to be reward at t=4, which is wrong. Can you please clarify if my understanding is correct?

Aug 16 '18 07:08 srinivr

Hi, Thanks a lot, that definitely looks wrong! I think I was handling this correctly in an earlier version and broke it when "simplifying" for the code release 😅. I'll try to fix this if I find a little time, or you're welcome to make a pull request.

Aug 16 '18 09:08 Greg-Farquhar

https://github.com/oxwhirl/treeqn/pull/4

Here's what I think is a quick fix. There could probably be a quicker vectorized implementation though

Sep 09 '18 15:09 zacwellmer

Thanks for the bug spot and fix; sorry for being so slow -- merged now!

Sep 18 '18 04:09 Greg-Farquhar

@Greg-Farquhar I think the np -> torch commit introduced a new bug. In make_seq_mask the mask variable is being updated in place and will cause our tmp_masks variable to change in the build_sequences function.

mask[int(max_i):].fill_(1)

I also think that I initially padded tmp_masks the wrong way. I forgot the pytorch pads in reverse, and we should also be padding with 1's since they will be flipped to zeros in the make_seq_mask function. If both of these changes sound correct I'll submit a PR

Oct 03 '18 07:10 zacwellmer

Ah, thanks. Just pushed, lmk if that fixes correctly!

Oct 03 '18 13:10 Greg-Farquhar

looks good!

Oct 04 '18 02:10 zacwellmer