Why not use the masked transformers directly in the first two stages?

Open xwan0527 opened this issue 1 year ago • 0 comments

Why use convolutions instead? Since upsampling is already employed to obtain the mask matrix, it seems like transformers could also be used.

May 15 '24 09:05 xwan0527