1 |
ANTOL S, AGRAWAL A, LU J, et al. VQA: visual question answering. International Journal of Computer Vision, 2015, 123(1): 4- 31.
|
2 |
|
3 |
SHIH K J, SINGH S, HOIEM D. Where to look: focus regions for visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 4613-4621.
|
4 |
YANG Z C, HE X D, GAO J F, et al. Stacked attention networks for image question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 21-29.
|
5 |
LI L J, GAN Z, CHENG Y, et al. Relation-aware graph attention network for visual question answering[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2019: 10312-10321.
|
6 |
KAFLE K, KANAN C. Visual question answering: datasets, algorithms, and future challenges. Computer Vision and Image Understanding, 2017, 163, 3- 20.
doi: 10.1016/j.cviu.2017.06.005
|
7 |
WU Q, TENEY D, WANG P, et al. Visual question answering: a survey of methods and datasets. Computer Vision and Image Understanding, 2017, 163, 21- 40.
doi: 10.1016/j.cviu.2017.05.001
|
8 |
包希港, 周春来, 肖克晶, 等. 视觉问答研究综述. 软件学报, 2021, 32(8): 2522- 2544.
|
|
BAO X G, ZHOU C L, XIAO K J, et al. Survey on visual question answering. Journal of Software, 2021, 32(8): 2522- 2544.
|
9 |
MANMADHAN S, KOVOOR B C. Visual question answering: a state-of-the-art review. Artificial Intelligence Review, 2020, 53(8): 5705- 5745.
doi: 10.1007/s10462-020-09832-7
|
10 |
REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149.
doi: 10.1109/TPAMI.2016.2577031
|
11 |
|
12 |
|
13 |
HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 770-778.
|
14 |
SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2015: 12-24.
|
15 |
|
16 |
SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 4510-4520.
|
17 |
CHEN Y D, WANG W, ZHOU Y, et al. Self-training for domain adaptive scene text detection[C]//Proceedings of the 25th International Conference on Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 850-857.
|
18 |
HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 2980-2988.
|
19 |
|
20 |
LI H, WANG P, SHEN C H, et al. Show, attend and read: a simple and strong baseline for irregular text recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 8610- 8617.
doi: 10.1609/aaai.v33i01.33018610
|
21 |
BORISYUK F, GORDO A, SIVAKUMAR V. Rosetta: large scale system for text detection and recognition in images[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM Press, 2018: 71-79.
|
22 |
SHI B G, BAI X, YAO C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(11): 2298- 2304.
doi: 10.1109/TPAMI.2016.2646371
|
23 |
BOJANOWSKI P, GRAVE E, JOULIN A, et al. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017, 5, 135- 146.
doi: 10.1162/tacl_a_00051
|
24 |
ALMAZÁN J, GORDO A, FORNÉS A, et al. Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(12): 2552- 2566.
doi: 10.1109/TPAMI.2014.2339814
|
25 |
HOCHREITER S, SCHMIDHUBER J. Long short-term memory. Neural Computation, 1997, 9(8): 1735- 1780.
doi: 10.1162/neco.1997.9.8.1735
|
26 |
|
27 |
|
28 |
|
29 |
KAFLE K, KANAN C. Answer-type prediction for visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 4976-4984.
|
30 |
GAO D F, LI K, WANG R P, et al. Multi-modal graph neural network for joint reasoning on vision and scene text[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 12743-12753.
|
31 |
|
32 |
HU R H, SINGH A, DARRELL T, et al. Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 9989-9999.
|
33 |
|
34 |
ZHU Q, GAO C Y, WANG P, et al. Simple is not easy: a simple strong baseline for TextVQA and TextCaps. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(4): 3608- 3615.
doi: 10.1609/aaai.v35i4.16476
|
35 |
|
36 |
JIN Z X, SHOU M Z, ZHOU F, et al. From token to word: OCR token evolution via contrastive learning and semantic matching for text-VQA[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York, USA: ACM Press, 2022: 4564-4572.
|
37 |
|
38 |
ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[EB/OL]. [2023-03-05]. https://arxiv.org/abs/1707.07998.
|
39 |
SCARSELLI F, GORI M, TSOI A C, et al. The graph neural network model. IEEE Transactions on Neural Networks, 2009, 20(1): 61- 80.
doi: 10.1109/TNN.2008.2005605
|
40 |
GU J T, LU Z D, LI H, et al. Incorporating copying mechanism in sequence-to-sequence learning[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. [S. l.]: Association for Computational Linguistics, 2016: 1631-1640.
|
41 |
WU J J, DU J, WANG F R, et al. A multimodal attention fusion network with a dynamic vocabulary for TextVQA. Pattern Recognition, 2022, 122, 108214.
doi: 10.1016/j.patcog.2021.108214
|
42 |
GÓMEZ L, BITEN A F, TITO R, et al. Multimodal grid features and cell pointers for scene text visual question answering. Pattern Recognition Letters, 2021, 150, 242- 249.
doi: 10.1016/j.patrec.2021.06.026
|
43 |
SHARMA H, JALAL A S. Improving visual question answering by combining scene-text information. Multimedia Tools and Applications, 2022, 81(9): 12177- 12208.
doi: 10.1007/s11042-022-12317-0
|
44 |
GÓMEZ L, MAFLA A, RUSIÑOL M, et al. Single shot scene text retrieval[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 728-744.
|
45 |
PERRONNIN F, DANCE C. Fisher kernels on visual vocabularies for image categorization[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2007: 1-8.
|
46 |
VINYALS O, FORTUNATO M, JAITLY N. Pointer networks. Advances in Neural Information Processing Systems, 2015,(1): 2692- 2700.
|
47 |
YAO T, PAN Y W, LI Y H, et al. Exploring visual relationship for image captioning[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 711-727.
|
48 |
RAO V N, ZHEN X, HOVSEPIAN K, et al. A first look: towards explainable TextVQA models via visual and textual explanations[EB/OL]. [2023-03-05]. https://arxiv.org/abs/2105.02626.
|
49 |
|
50 |
GAO C Y, ZHU Q, WANG P, et al. Structured multimodal attentions for TextVQA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(12): 9603- 9614.
doi: 10.1109/TPAMI.2021.3132034
|
51 |
LIU Y L, ZHANG S, JIN L W, et al. Omnidirectional scene text detection with sequential-free box discretization[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence. New York, USA: ACM Press, 2019: 3052-3058.
|
52 |
YANG L, WANG P, LI H, et al. A holistic representation guided attention network for scene text recognition. Neurocomputing, 2020, 414, 67- 75.
doi: 10.1016/j.neucom.2020.07.010
|
53 |
LIU F, XU G H, WU Q, et al. Cascade reasoning network for text-based visual question answering[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, USA: ACM Press, 2020: 4060-4069.
|
54 |
YANG Z, XUAN J, LIU Q, et al. Modality-specific multimodal global enhanced network for text-based visual question answering[C]//Proceedings of IEEE International Conference on Multimedia and Expo. Washington D. C., USA: IEEE Press, 2022: 1-6.
|
55 |
GAO P, JIANG Z, YOU H, et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 6639-6648.
|
56 |
YU Z, YU J, CUI Y H, et al. Deep modular co-attention networks for visual question answering[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 6281-6290.
|
57 |
SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 618-626.
|
58 |
PARK D H, HENDRICKS L A, AKATA Z, et al. Multimodal explanations: justifying decisions and pointing to the evidence[EB/OL]. [2023-03-05]. https://arxiv.org/abs/1802.08129.
|
59 |
|
60 |
HAN W, HUANG H T, HAN T. Finding the evidence: localization-aware answer prediction for text visual question answering[C]//Proceedings of the 28th International Conference on Computational Linguistics. Stroudsburg, USA: International Committee on Computational Linguistics, 2020: 3118-3131.
|
61 |
ZHANG X Y, YANG Q. Position-augmented transformers with entity-aligned mesh for TextVQA[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM Press, 2021: 2519-2528.
|
62 |
ZENG G Y, ZHANG Y, ZHOU Y, et al. Beyond OCR+VQA: involving OCR into the flow for robust and accurate TextVQA[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM Press, 2021: 376-385.
|
63 |
QIAO Z, ZHOU Y, YANG D B, et al. SEED: semantics enhanced encoder-decoder framework for scene text recognition[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 13528-13537.
|
64 |
GORDO A, ALMAZAN J, MURRAY N, et al. LEWIS: latent embeddings for word images and their semantics[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2015: 1242-1250.
|
65 |
|
66 |
FANG C Y, ZENG G Y, ZHOU Y, et al. Towards escaping from language bias and OCR error: semantics-centered text visual question answering[C]//Proceedings of IEEE International Conference on Multimedia and Expo. Washington D. C., USA: IEEE Press, 2022: 1-6.
|
67 |
SHAH S, MISHRA A, YADATI N, et al. KVQA: knowledge-aware visual question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 8876- 8884.
doi: 10.1609/aaai.v33i01.33018876
|
68 |
NARASIMHAN M, SCHWING A G. Straight to the facts: learning knowledge base retrieval for factual visual question answering[EB/OL]. [2023-03-05]. https://arxiv.org/abs/1809.01124.
|
69 |
SINGH A K, MISHRA A, SHEKHAR S, et al. From strings to things: knowledge-enabled VQA model that can read and reason[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2019: 4602-4612.
|
70 |
MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 3195-3204.
|
71 |
YE K R, ZHANG M D, KOVASHKA A. Breaking shortcuts by masking for robust visual reasoning[C]//Proceedings of IEEE Winter Conference on Applications of Computer Vision. Washington D. C., USA: IEEE Press, 2021: 3520-3530.
|
72 |
LI G H, WANG X, ZHU W W. Boosting visual question answering with context-aware knowledge aggregation[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, USA: ACM Press, 2020: 1227-1235.
|
73 |
DEY A U, VALVENY E, HARIT G. EKTVQA: generalized use of external knowledge to empower scene text in text-VQA. IEEE Access, 2022, 10, 72092- 72106.
doi: 10.1109/ACCESS.2022.3186471
|
74 |
CHEN F L, ZHANG D Z, HAN M L, et al. VLP: a survey on vision-language pre-training. Machine Intelligence Research, 2023, 20(1): 38- 56.
doi: 10.1007/s11633-022-1369-5
|
75 |
ZHOU L W, PALANGI H, ZHANG L, et al. Unified vision-language pre-training for image captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 13041- 13049.
doi: 10.1609/aaai.v34i07.7005
|
76 |
|
77 |
|
78 |
|
79 |
DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2023-03-05]. https://arxiv.org/abs/2010.11929v1.
|
80 |
|
81 |
|
82 |
BIGHAM J P, JAYANT C, JI H J, et al. VizWiz: nearly real-time answers to visual questions[C]//Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology. New York, USA: ACM Press, 2010: 333-342.
|
83 |
MISHRA A, SHEKHAR S, SINGH A K, et al. OCR-VQA: visual question answering by reading text in images[C]//Proceedings of International Conference on Document Analysis and Recognition. New York, USA: ACM Press, 2019: 947-952.
|
84 |
WANG X Y, LIU Y L, SHEN C H, et al. On the general value of evidence, and bilingual scene-text visual question answering[EB/OL]. [2023-03-05]. https://arxiv.org/abs/2002.10215.
|
85 |
|
86 |
AKHTAR N, MIAN A. Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access, 2018, 6, 14410- 14430.
doi: 10.1109/ACCESS.2018.2807385
|
87 |
ZHANG W E, SHENG Q Z, ALHAZMI A, et al. Adversarial attacks on deep-learning models in natural language processing: a survey. ACM Transactions on Intelligent Systems and Technology, 2020, 11(3): 1- 24.
|
88 |
XU X, CHEN J F, XIAO J H, et al. What machines see is not what they get: fooling scene text recognition models with adversarial text images[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 12301-12311.
|