Scheduled Sampling
In the scheduled sampling paper, it is mentioned that if we try to train by tossing coin and deciding whether to provide predicted output for the whole sequence or not it performs worse. Instead one should choose to provide correct token or not at each time step. (see p3. footnote in the paper). Yet in the decoder, teacher forcing is either enabled for the whole sequence or not, I don't think that would work.
You might be right, but the teacher forcing here could really improve the performance of 1 ~ 2 point~
@AtmaHou Did you mean the kind of teacher forcing that is implemented here? I tried that and it actually doesn't improve the perf (in agreement with scheduled sampling paper)
Yep~~ You could have a try to tune the teacher forcing rate (default 0), 0.5 is worth trying. I found both 0 and 1 are not helping. emmmmm..... From my point of view, scheduled sampling is just a trick to enable model to see its own output with a random rate, and both the two method achieve this.
@AtmaHou My experience has not been good with this kind of teacher forcing for non-trivial tasks so far. It worsens my result sometimes. Scheduled sampling method works better though.
Since this repo has so many stars and at one point I was using as a ref implementation, I thought I should point it out.
@umgupta Ha~Your post has also deepened my understanding of teacher forcing. Maybe I should implement a kind of teacher forcing that you pointed out, which can further enhance my model effect.
@AtmaHou Sure do so and let me know :).
Also, I am fairly new to learning sequences. Do you happen to know some toy problem to compare/judge the sanity of the algorithm (like mnist for images)? The one in this repo to learn to reverse is too trivial. (Too trivial because any kind of teacher forcing works ok, or even if you make some mistake in code it worked with good result)
@umgupta Machine translation problem in pytorch tutorial is quite simple, which might satisfy you.