TitleZ99
TitleZ99
Hi thanks for this wonderful work. I am confused about the CrossAttention Module, In the code of XBERT,when layer_num>=6, the text_encoder will turn into cross attention, however it will do...
Thanks for this great work and the open-soursed repo. I want to know how to visualize the result after the inference like you show in the repo in the end.
Thanks for this great work. I am wondering if you just randomly initialized the adaption prompts since you used zero-init attention in the L layer. I also think the multi-modal...
Thanks for this wonderful work. I have a question as the picture shown. When sr_ratio>1,you will do the conv first ,then plus original v and do attention function last. But...
Thanks for this wonderful work. This work is very inspiring. I am confused about how to get the heat-map as shown in your paper. Looking forward to your reply at...
Thanks for this inspired work. I am confused about why you choose Group Normalization as the normalization. Did you try other normalization methods like Batch Normalization? I think Batch Normalization...
Thanks for this great job and i'm wondering how to run inference in a 8GB single GPU,like your example showing in the readme. I tried it in my RTX2080ti with...
Thanks for this great work and open-sourced repo. I didn't touch the segment tasks before. I am wondering to know how to visualize the dense segment image like you show...
Has anyone successfully downloaded the Flickr30k dataset provided in this article? I used azcopy but it didn't work. I really need this dataset for some research. If you can, please...