Why is last token in the sequence removed
I was wondering if you could give an explanation of why the last token in the sequence is dropped in the cond_transformer.py script, the paper does not give an explanation for that. Thanks!
https://github.com/CompVis/taming-transformers/blob/9d17ea64b820f7633ea6b8823e1f78729447cb57/taming/models/cond_transformer.py#L100
same question
Auto-regress learning need to remove the last one. For example. for sequence [a, b, c, d, e], decoder's input is [bos, a, b, c, d, e], the desired output is [a, b, c, d, e, eos]. Here is no eos, so just remove last one element 'e', and we get decoder input [bos, a, b, c, d], and desired output [a, b, c, d, e]
The first image token is the prediction of the last segmentation token, and the prediction of the last token is not needed, so the transformer's input can be 512 dimensional indices. However, the predict result should be selected starting from the last token prediction of the segmentation.