Text4Vis
Text4Vis copied to clipboard
【AAAI'2023 & IJCV】Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
Thank you for your impressive work, Could you provide your pretrained model without text on HMDB as shown in Table 6? Thank you very much. Kind Regards,
你好!您的项目十分有趣,我想将其修改为回归任务应用在我的数据集上,请问是否可以修改呢?如果可以,具体该如何修改哪些部分呢?
您好,我对您的工作十分感兴趣,并且有两个问题想询问您。 1.您如何获取到Classifier(训练过程)即:如何通过Transferring visual statistic knowledge(LDA)得到lda_0.1.pt文件,以及如何通过Transferring textual semantic knowledge得到classes_features的训练过程 2.相关pt文件distilbert-base-k400.pt和lda_0.1.pt没有给出。 十分期待您的回信
CoOP
May I ask how the CoOP in the paper is implemented? Is there a tutorial available?
作者您好,看了您的论文深受启发,觉得您写的很好,有两个问题想咨询您。 1、我已经成功复现了代码,预训练模型使用的vit-l-14,两张4090显卡跑的结果是:top1: 95.3%\top5: 99.2%,跟您的结果可能还有差距。 2、关于视觉特征和文本特征融合时,您采用了CLIP模型默认的余弦相似度计算,但我不太理解这个代码思路,看CLIP原论文伪代码好像不是这样,恳请您解答一下这个logit_scale 是干啥的,有什么用,为什么要这样初始化logit_scale 。 self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)) logit_scale = self.logit_scale.exp() logits = logit_scale * image_emb @ text_emb.t()
Which code should I run to get the published result? Also, I noticed that "train_nce.py" is quite similar to the code for [BIKE](https://github.com/whwu95). It would be helpful if you could...
While the links to GITHUB are available, all links to OneDrive are expired. But the training of HDMB51, UCF101 involves pre-trained ViT-L models, which are unavailable to access. Please extend...