VQA_AlgorithmDatasets
VQA_AlgorithmDatasets copied to clipboard
Table Category
- Collection of VQA papers
- Leaderboard
- Tutorials
- VQA Dataset
- VQA Algorithm
- VQA Code Library
:page_facing_up:Collection of Papers
-
VQA: https://github.com/jokieleung/awesome-visual-question-answering/blob/master/README.md#CVPR-2020
-
Text-VQA https://github.com/xinke-wang/Awesome-Text-VQA
-
Survey papers:
- KB-VQA: https://github.com/astro-zihao/Awesome-KBQA
- 2019, V+L dataset and methods: https://arxiv.org/pdf/1907.09358.pdf
- 2017, VQA dataset and methods: https://www.sciencedirect.com/science/article/pii/S1077314217300772?casa_token=EX_Gt8Ib5rQAAAAA:NSFjlS4iVem0eC_iQCvHf6HPkg18fbQAQC-BqxW96u85bg2gMNw0yFFUFS4HvdiAuzr0D0FQ1Bc
:green_book: Tutorials
- Sigir2020: https://www.avishekanand.com/talk/sigir20-tute/
- CVPR2020 (Recent Advances in Vision-and-Language Research): https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research/
- KDD 2020 (Scene Graph) https://suitclub.ischool.utexas.edu/IWKG_KDD2020/slides/Shih-Fu.pdf
:chart_with_upwards_trend: Leaderboard
- VQAv2 leaderboard: https://visualqa.org/roe.html
| Algorithm | Accuracy |
|---|---|
| Renaissance | 79.34 |
| UNIMO Ensemble | |
| VinVL (MSR+MS Cog Svcs., X10 models ) (paper,code) | 76.60 |
| GridFeat+MoVie | 76.36 |
| DL-61 (BGN) | 76.08 |
| VILLA (adversarial training) based on UNITER, (paper, code) | 75.9 |
| Ensemble LXMERT, VILBERT, VisualBERT | 75.15 |
| Pixel-BERT x152 | 74.45 |
| Oscar(paper, code) | 73.82 |
| UNITER (+grid feature)(paper, code1,code2) | 73.82 |
| SOHO | 73.47 |
| LXMERT (paper,code) | 72.54 |
| VLBERT | 72.22 |
| Pixel-BERT r50 | 71.35 |
| ViLT | 71.32 |
| MCAN | 70.93 |
| VisualBERT | 71.00 |
| ViLBERT | 70.92 |
| BUTD | 65.67 |
| MUTAN | 60.17 |
- VizWiz leaderboard (2022): https://eval.ai/web/challenges/challenge-page/1560/leaderboard/3852
| Algorithm | Accuracy |
|---|---|
| GIT | 67.53 |
| HSSLab | 66.72 |
| Alibaba | 61.81 |
| LXMBERT | 55.4 |
| Pythia | 54.72 |
| Gridfeature+MCAN | 54.17 |
| VilBERT | 52 |
| SAN | 47.3 |
- Text VQA leaderboard (2022):https://eval.ai/web/challenges/challenge-page/874/leaderboard/2313
| Algorithm | Accuracy |
|---|---|
| Mia | 73.67 |
| SunLan | 65.86 |
| Summer | 59.16 |
| Microsoft | 54.71 |
| TAG | 53.69 |
| ST-VQA | 45.66 |
| M4C | 39.01 |
| RUArt-M4C | 33.54 |
| LoRRA | 27.63 |
:floppy_disk: Dataset
- VQA Dataset
-
General VQA
- COCO
- VQAv1, VQAv2
- VQA Dialog
-
Text-VQA
- TextVQA
- Scene Text VQA
- OCR-VQA (toy-sized dataset, containing book/poster cover)
-
Doc-VQA
-
Rehrase VQA question
- Inverse Visual QA (iVQA)
- VQA-Rehrasings
- VQA-LOL
- VQA- introspect
- rehrase ambiguous questions| 2022 paper
-
Replace VQA images
- VQAv2
- VQA-CP
-
VQA reasoning
- VCR (11/2018)
- Visual Entailment(2019)
- GQA
- CLEVER
- Referring Expression
- NLVR2 (2018)
-
VQA with External Knowledge
- OK-VQA
- FVQA
- KBVQA
- KVQA (2019)
-
Explainable/Grounding Image Captioning/VQA
- Grounding for image captioning (referring expression)
- Flickr30K entities
- Visual Genome
- RefClef
- RefCOCO
- CLEVER-Ref+
- Google Referring expression
- PhraseCut
- grounding for VQA
- Grounding for image captioning (referring expression)
-
Multilingual
- Multilingual VQA
- xGQA
- MaXM | paper
- Image captioning
- crossmodal3600
- Multilingual VQA
-
:pencil2: Algorithm
-
Image Feature preparation
- Show, Attend and Tell (2015/5)
- SAN (2015/11)
- BUTD (2017/7) | paper
- Grid Feature (2020/1)
- Pixel-BERT (2020/4)
- SOHO(2021/4)
- VinVL(2021/4)
-
Enhanced multimodal fusion
-
Bilinear pooling: how to fuse two vectors into one
- MCB (2016/6)
- MLB (2016/10)
- MUTAN (2017/5)
- MFB&MFH (2017/8)
- BLOCK (2019/1)
-
FiLM: Feature-wise Linear Modulation
- FiLM
-
cross-modal attention
- SAN (2015/11)
- HierCoAttn (2016/5)
- DAN (2016/11)
- DCN (2018/4)
- BAN (2018/5)
-
pretraining:
- UNITER
- ViLBERT
- LXMERT
- B2T2
- VisualBERT
- Unicoder-VL
- VL-BERT
- ERINE-ViL (AAAI, 2021): Scene Graph Prediction
- Oscar
- UNIMO (ACL, 2021)
-
End-to-End pretraining:
- SOHO (CVPR, 2020/4)
- ViLT (2021, ICML)
-
graph attention/graph Convolutional Network
- Graph-Structured, (2016/9)
- Relation Network, (2017/6)
- Graph Learner,(2018/6)
- MuRel, (2019/2)
- ReGAT, (2019/3)
- LCGN (2019/5)
-
Cross-modal+intra-modal
- MCAN, 2019: Deep Modular Co-Attention Network
-
Multi-step reasoning
- MAC: Memory, Attention and Composition
-
Neural module networks
- NMN, (2015/11)
- N2NMN,(2017/4)
- PG+EE,(2017/5)
- TbD,(2018/3)
- stackNMN,(2018/7)
- NS-VQA,(2018/10)
- Prob-NMN, (2019/2)
- MMN (2019/10)
-
-
External Knowledge Algorithm
- Mucko (1/2020)
- KRISP (2020)