VQA_AlgorithmDatasets icon indicating copy to clipboard operation
VQA_AlgorithmDatasets copied to clipboard

Table Category

  • Collection of VQA papers
  • Leaderboard
  • Tutorials
  • VQA Dataset
  • VQA Algorithm
  • VQA Code Library

:page_facing_up:Collection of Papers

  • VQA: https://github.com/jokieleung/awesome-visual-question-answering/blob/master/README.md#CVPR-2020

  • Text-VQA https://github.com/xinke-wang/Awesome-Text-VQA

  • Survey papers:

    • KB-VQA: https://github.com/astro-zihao/Awesome-KBQA
    • 2019, V+L dataset and methods: https://arxiv.org/pdf/1907.09358.pdf
    • 2017, VQA dataset and methods: https://www.sciencedirect.com/science/article/pii/S1077314217300772?casa_token=EX_Gt8Ib5rQAAAAA:NSFjlS4iVem0eC_iQCvHf6HPkg18fbQAQC-BqxW96u85bg2gMNw0yFFUFS4HvdiAuzr0D0FQ1Bc

:green_book: Tutorials

  • Sigir2020: https://www.avishekanand.com/talk/sigir20-tute/
  • CVPR2020 (Recent Advances in Vision-and-Language Research): https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research/
  • KDD 2020 (Scene Graph) https://suitclub.ischool.utexas.edu/IWKG_KDD2020/slides/Shih-Fu.pdf

:chart_with_upwards_trend: Leaderboard

  • VQAv2 leaderboard: https://visualqa.org/roe.html
Algorithm Accuracy
Renaissance 79.34
UNIMO Ensemble
VinVL (MSR+MS Cog Svcs., X10 models ) (paper,code) 76.60
GridFeat+MoVie 76.36
DL-61 (BGN) 76.08
VILLA (adversarial training) based on UNITER, (paper, code) 75.9
Ensemble LXMERT, VILBERT, VisualBERT 75.15
Pixel-BERT x152 74.45
Oscar(paper, code) 73.82
UNITER (+grid feature)(paper, code1,code2) 73.82
SOHO 73.47
LXMERT (paper,code) 72.54
VLBERT 72.22
Pixel-BERT r50 71.35
ViLT 71.32
MCAN 70.93
VisualBERT 71.00
ViLBERT 70.92
BUTD 65.67
MUTAN 60.17
  • VizWiz leaderboard (2022): https://eval.ai/web/challenges/challenge-page/1560/leaderboard/3852
Algorithm Accuracy
GIT 67.53
HSSLab 66.72
Alibaba 61.81
LXMBERT 55.4
Pythia 54.72
Gridfeature+MCAN 54.17
VilBERT 52
SAN 47.3
  • Text VQA leaderboard (2022):https://eval.ai/web/challenges/challenge-page/874/leaderboard/2313
Algorithm Accuracy
Mia 73.67
SunLan 65.86
Summer 59.16
Microsoft 54.71
TAG 53.69
ST-VQA 45.66
M4C 39.01
RUArt-M4C 33.54
LoRRA 27.63

:floppy_disk: Dataset

  • VQA Dataset
    • General VQA

      • COCO
      • VQAv1, VQAv2
      • VQA Dialog
    • Text-VQA

      • TextVQA
      • Scene Text VQA
      • OCR-VQA (toy-sized dataset, containing book/poster cover)
    • Doc-VQA

    • Rehrase VQA question

      • Inverse Visual QA (iVQA)
      • VQA-Rehrasings
      • VQA-LOL
      • VQA- introspect
      • rehrase ambiguous questions| 2022 paper
    • Replace VQA images

      • VQAv2
      • VQA-CP
    • VQA reasoning

      • VCR (11/2018)
      • Visual Entailment(2019)
      • GQA
      • CLEVER
      • Referring Expression
      • NLVR2 (2018)
    • VQA with External Knowledge

      • OK-VQA
      • FVQA
      • KBVQA
      • KVQA (2019)
    • Explainable/Grounding Image Captioning/VQA

      • Grounding for image captioning (referring expression)
        • Flickr30K entities
        • Visual Genome
        • RefClef
        • RefCOCO
        • CLEVER-Ref+
        • Google Referring expression
        • PhraseCut
      • grounding for VQA
        • Visual7W (2016)
        • Visual Genome (2016) | paper | website
        • VQA-HAT(2016)
        • VQS (2017) | paper
        • VQA-X(2018)
        • VQA-E(2018)
        • TextVQA-X
        • GQA
        • CLEVR-Ans
        • VizWiz-VQA-Grounding (2022) | paper
    • Multilingual

      • Multilingual VQA
      • Image captioning
        • crossmodal3600

:pencil2: Algorithm

  • Image Feature preparation

    • Show, Attend and Tell (2015/5)
    • SAN (2015/11)
    • BUTD (2017/7) | paper
    • Grid Feature (2020/1)
    • Pixel-BERT (2020/4)
    • SOHO(2021/4)
    • VinVL(2021/4)
  • Enhanced multimodal fusion

    • Bilinear pooling: how to fuse two vectors into one

      • MCB (2016/6)
      • MLB (2016/10)
      • MUTAN (2017/5)
      • MFB&MFH (2017/8)
      • BLOCK (2019/1)
    • FiLM: Feature-wise Linear Modulation

      • FiLM
    • cross-modal attention

      • SAN (2015/11)
      • HierCoAttn (2016/5)
      • DAN (2016/11)
      • DCN (2018/4)
      • BAN (2018/5)
    • pretraining:

      • UNITER
      • ViLBERT
      • LXMERT
      • B2T2
      • VisualBERT
      • Unicoder-VL
      • VL-BERT
      • ERINE-ViL (AAAI, 2021): Scene Graph Prediction
      • Oscar
      • UNIMO (ACL, 2021)
    • End-to-End pretraining:

      • SOHO (CVPR, 2020/4)
      • ViLT (2021, ICML)
    • graph attention/graph Convolutional Network

      • Graph-Structured, (2016/9)
      • Relation Network, (2017/6)
      • Graph Learner,(2018/6)
      • MuRel, (2019/2)
      • ReGAT, (2019/3)
      • LCGN (2019/5)
    • Cross-modal+intra-modal

      • MCAN, 2019: Deep Modular Co-Attention Network
    • Multi-step reasoning

      • MAC: Memory, Attention and Composition
    • Neural module networks

      • NMN, (2015/11)
      • N2NMN,(2017/4)
      • PG+EE,(2017/5)
      • TbD,(2018/3)
      • stackNMN,(2018/7)
      • NS-VQA,(2018/10)
      • Prob-NMN, (2019/2)
      • MMN (2019/10)
  • External Knowledge Algorithm