基于深度学习的自然场景文本检测综述

doi:10.19678/j.issn.1000-3428.0067427

摘要/Abstract

摘要：

基于深度学习的自然场景文本检测技术已成为计算机视觉和自然语言处理领域的重要研究方向，不仅具有广泛的应用前景，而且也为研究人员提供了一个探索神经网络模型和算法的新平台。首先，介绍自然场景文本检测技术的相关概念、研究背景和发展现状。接着，分析近年来基于深度学习的文本检测方法并将其分为基于检测框、基于分割、基于两者混合、其他4类，阐述4类经典和主流方法的基本思路和主要算法流程，归纳总结不同方法的使用机制、适用场景、优劣点及仿真实验结果和环境设置，明确不同方法之间的关联关系。然后，介绍自然场景文本检测的常用公共数据集和文本检测性能评估方法。最后，指出基于深度学习的自然场景文本检测技术目前所面临的主要挑战并对其未来发展方向进行展望。

关键词: 深度学习, 计算机视觉, 自然场景文本, 文本检测, 多方向文本检测, 多尺度文本检测

Abstract:

Natural scene text detection technology based on deep learning has become a crucial research focal point in the fields of computer vision and natural language processing. Not only does it possess a wide range of potential applications but also serves as a new platform for researchers to explore neural network models and algorithms. First, this study introduces the relevant concepts, research background, and current developments in natural scene text detection technology. Subsequently, an analysis of recent deep learning-based text detection methods is performed, categorizing them into four classes: detection boxes-, segmentation-, detection-boxes and segmentation-based, and others. The fundamental concepts and main algorithmic processes of classical and mainstream methods within these four categories are elaborated, summarizing the usage mechanisms, applicable scenarios, advantages, disadvantages, simulation experimental results, and environment settings of different methods, while clarifying their interrelationships. Thereafter, common public datasets and performance evaluation methods for natural scene text detection are introduced. Finally, the major challenges facing current deep learning-based natural scene text detection technology are outlined, and future development directions are discussed.

Key words: deep learning, computer vision, natural scene text, text detection, multi-directional text detection, multi-scale text detection

连哲, 殷雁君, 云飞, 智敏. 基于深度学习的自然场景文本检测综述[J]. 计算机工程, 2024, 50(3): 16-27.

Zhe LIAN, Yanjun YIN, Fei YUN, Min ZHI. Review of Natural Scene Text Detection Based on Deep Learning[J]. Computer Engineering, 2024, 50(3): 16-27.

http://www.ecice06.com/CN/Y2024/V50/I3/16

图/表 13

图1 TB-AFF模块结构

Fig.1 Structure of TB-AFF module

图2 MP模块结构

Fig.2 Structure of MP module

图3 DMFF模块结构

Fig.3 Structure of DMFF module

参考文献 72

1	TSAI S S, CHEN H Z, CHEN D, et al. Mobile visual search on printed documents using text and low bit-rate features[C]//Proceedings of the 18th IEEE International Conference on Image Processing. Washington D. C., USA: IEEE Press, 2011: 2601-2604.
2	段仁翀, 段湘煜. 基于适应性训练与丢弃机制的神经机器翻译. 计算机工程, 2023, 49(10): 120-126, 135. URL
	DUAN R C, DUAN X Y. Neural machine translation based on adaptive training and drop mechanism. Computer Engineering, 2023, 49(10): 120-126, 135. URL
3	DESOUZA G N, KAK A C. Vision for mobile robot navigation: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(2): 237- 267. doi: 10.1109/34.982903
4	张桢, 梁军, 贾海鹏, 等. 基于RISC-V的FFmpeg多媒体算法库优化策略. 计算机工程, 2023, 49(4): 159-165, 173. URL
	ZHANG Z, LIANG J, JIA H P, et al. Optimization strategy of FFmpeg multimedia algorithm library based on RISC-V. Computer Engineering, 2023, 49(4): 159-165, 173. URL
5	HE Z W, LIU J L, MA H Q, et al. A new automatic extraction method of container identity codes. IEEE Transactions on Intelligent Transportation Systems, 2005, 6(1): 72- 78. doi: 10.1109/TITS.2004.838509
6	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149. doi: 10.1109/TPAMI.2016.2577031
7	UIJLINGS J R R, VAN DE SANDE K E A, GEVERS T, et al. Selective search for object recognition. International Journal of Computer Vision, 2013, 104(2): 154- 171. doi: 10.1007/s11263-013-0620-5
8	MA J Q, SHAO W Y, YE H, et al. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 2018, 20(11): 3111- 3122. doi: 10.1109/TMM.2018.2818020
9	ZHONG Z Y, SUN L, HUO Q. An anchor-free region proposal network for Faster R-CNN-based text detection approaches. International Journal on Document Analysis and Recognition, 2019, 22(3): 315- 327. doi: 10.1007/s10032-019-00335-y
10	LIAO M H, ZHU Z, SHI B G, et al. Rotation-sensitive regression for oriented scene text detection[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 5909-5918.
11	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 21-37.
12	LIAO M H, SHI B G, BAI X. TextBoxes++: a single-shot oriented scene text detector. IEEE Transactions on Image Processing, 2018, 27(8): 3676- 3690. doi: 10.1109/TIP.2018.2825107
13	ZHONG Z, JIN L, HUANG S. DeepText: a unified framework for text proposal generation and text detection in natural images[C]//Proceedings of IEEE International Conference on Acoustics. Washington D. C., USA: IEEE Press, 2016: 1208-1212.
14	TIAN Z, HUANG W, HE T, et al. Detecting text in natural image with connectionist text proposal network[C]//Proceedings of the 14th European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 11-14.
15	HOCHREITER S, SCHMIDHUBER J. Long short-term memory. Neural Computation, 1997, 9(8): 1735- 1780. doi: 10.1162/neco.1997.9.8.1735
16	SHI B G, BAI X, BELONGIE S. Detecting oriented text in natural images by linking segments[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 2550-2558.
17	ZHANG S X, ZHU X B, HOU J B, et al. Deep relational reasoning graph network for arbitrary shape text detection[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 9699-9708.
18	LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 2117-2125.
19	LI Y, QI H Z, DAI J F, et al. Fully convolutional instance-aware semantic segmentation[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 2359-2367.
20	ZHOU X Y, YAO C, WEN H, et al. EAST: an efficient and accurate scene text detector[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 5551-5560.
21	NEUBECK A, VAN GOOL L. Efficient non-maximum suppression[C]//Proceedings of the 18th International Conference on Pattern Recognition. Washington D. C., USA: IEEE Press, 2006: 850-855.
22	DENG D, LIU H, LI X, et al. PixelLink: detecting scene text via instance segmentation[C]//Proceedings of AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2018: 6773-6780.
23	LIAO M, WAN Z, YAO C, et al. Real-time scene text detection with differentiable binarization[C]//Proceedings of AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2020: 11474-11481.
24	IBRAYIM M, LI Y, HAMDULLA A. Scene text detection based on two-branch feature extraction. Sensors, 2022, 22(16): 6262. doi: 10.3390/s22166262
25	魏哲亮, 李岳阳, 罗海驰. 多尺度池化和双向特征融合的场景文本检测. 计算机工程与应用, 2024, 60(2): 154- 161. URL
	WEI Z L, LI Y Y, LUO H C. Scene text detection based on multi-scale pooling and bidirectional feature fusion. Computer Engineering and Applications, 2024, 60(2): 154- 161. URL
26	李雨, 闫甜甜, 周东生, 等. 基于注意力机制与深度多尺度特征融合的自然场景文本检测. 图学学报, 2023, 44(3): 473- 481. URL
	LI Y, YAN T T, ZHOU D S, et al. Natural scene text detection based on attention mechanism and deep multi-scale feature fusion. Journal of Graphics, 2023, 44(3): 473- 481. URL
27	LIAO M H, ZOU Z S, WAN Z Y, et al. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 919- 931. doi: 10.1109/TPAMI.2022.3155612
28	WANG W H, XIE E Z, SONG X G, et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2019: 8440-8449.
29	刘倩, 杨鹏, 毛红梅. 基于自适应注意力的任意形状场景文本检测. 计算机工程与设计, 2023, 44(3): 901- 907. URL
	LIU Q, YANG P, MAO H M. Detection of arbitrary shaped scene text based on adaptive attention. Computer Engineering and Design, 2023, 44(3): 901- 907. URL
30	ZHONG Y H, CHENG X Y, CHEN T, et al. PRPN: progressive region prediction network for natural scene text detection. Knowledge-Based Systems, 2022, 236, 107767. doi: 10.1016/j.knosys.2021.107767
31	DAI Y C, HUANG Z, GAO Y T, et al. Fused text segmentation networks for multi-oriented scene text detection[C]//Proceedings of the 24th International Conference on Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 3604-3609.
32	YANG Q P, CHENG M L, ZHOU W M, et al. IncepText: a new Inception-Text module with deformable PSROI pooling for multi-oriented scene text detection[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence. New York, USA: ACM Press, 2018: 1071-1077.
33	XIE E, ZANG Y, SHAO S, et al. Scene text detection with supervised pyramid context network[C]//Proceedings of AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 9038-9045.
34	WANG Y X, XIE H T, ZHA Z J, et al. ContourNet: taking a further step toward accurate arbitrary-shaped scene text detection[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 11753-11762.
35	JIANG F, HAO Z H, LIU X R. Deep scene text detection with connected component proposals[EB/OL]. [2023-03-11]. https://arxiv.org/abs/1708.05133.
36	HE M H, LIAO M H, YANG Z B, et al. MOST: a multi-oriented scene text detector with localization refinement[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 8813-8822.
37	CAI Y, LIU Y L, SHEN C H, et al. Arbitrarily shaped scene text detection with dynamic convolution. Pattern Recognition, 2022, 127, 108608. doi: 10.1016/j.patcog.2022.108608
38	HU H, ZHANG C Q, LUO Y X, et al. WordSup: exploiting word annotations for character based text detection[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 4940-4949.
39	BAEK Y, LEE B, HAN D, et al. Character region awareness for text detection[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 9365-9374.
40	ZHANG S X, ZHU X B, YANG C, et al. Adaptive boundary proposal network for arbitrary shape text detection[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2021: 1305-1314.
41	DAI P W, ZHANG S Y, ZHANG H, et al. Progressive contour regression for arbitrary-shape scene text detection[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 7393-7402.
42	ALI T, SIDDIQUI M F H, SHAHAB S, et al. GMIF: a gated multiscale input feature fusion scheme for scene text detection. IEEE Access, 2022, 10, 93992- 94006. doi: 10.1109/ACCESS.2022.3203691
43	LIU J P, WU S, HE D H, et al. MS-ROCANet: multi-scale residual orthogonal-channel attention network for scene text detection[C]//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2022: 2200-2204.
44	HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. [2023-03-11]. https://arxiv.org/abs/1704.04861s.
45	TAN M, LE Q. EfficientNet: rethinking model scaling for convolutional neural networks[C]//Proceedings of International Conference on Machine Learning. Washington D. C., USA: IEEE Press, 2019: 6105-6114.
46	YANG P, ZHANG F L, YANG G W. A fast scene text detector using knowledge distillation. IEEE Access, 2019, 7, 22588- 22598. doi: 10.1109/ACCESS.2019.2895330
47	SHAHAB A, SHAFAIT F, DENGEL A. ICDAR2011 robust reading competition challenge 2: reading text in scene images[C]//Proceedings of International Conference on Document Analysis and Recognition. Washington D. C., USA: IEEE Press, 2011: 1491-1496.
48	KARATZAS D, SHAFAIT F, UCHIDA S, et al. ICDAR2013 robust reading competition[C]//Proceedings of the 12th International Conference on Document Analysis and Recognition. Washington D. C., USA: IEEE Press, 2013: 1484-1493.
49	KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR2015 competition on Robust Reading[C]//Proceedings of the 13th International Conference on Document Analysis and Recognition. Washington D. C., USA: IEEE Press, 2015: 1156-1160.
50	SHI B G, YAO C, LIAO M H, et al. ICDAR2017 competition on reading Chinese Text in the Wild(RCTW-17)[C]//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Washington D. C., USA: IEEE Press, 2017: 1429-1434.
51	NAYEF N, YIN F, BIZID I, et al. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT[C]//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Washington D. C., USA: IEEE Press, 2017: 1454-1459.
52	GOMEZ R, SHI B G, GOMEZ L, et al. ICDAR2017 robust reading challenge on COCO-text[C]//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Washington D. C., USA: IEEE Press, 2017: 1435-1443.
53	SUN Y P, LIU J M, LIU W, et al. Chinese street view text: large-scale Chinese text reading with partially supervised learning[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2019: 9086-9095.
54	CHNG C K, DING E R, LIU J T, et al. ICDAR2019 robust reading challenge on arbitrary-shaped text-RRC-ART[C]//Proceedings of International Conference on Document Analysis and Recognition. Washington D. C., USA: IEEE Press, 2019: 1571-1576.
55	FENG W, HE W H, YIN F, et al. TextDragon: an end-to-end framework for arbitrary shaped text spotting[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2019: 9076-9085.
56	CHNG C K, CHAN C S. Total-text: a comprehensive dataset for scene text detection and recognition[C]//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Washington D. C., USA: IEEE Press, 2017: 935-942.
57	YAO C, BAI X, LIU W Y, et al. Detecting texts of arbitrary orientations in natural images[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2012: 1083-1090.
58	ANTHIMOPOULOS M, CHRISTODOULIDIS S, EBNER L, et al. Semantic segmentation of pathological lung tissue with dilated fully convolutional networks. IEEE Journal of Biomedical and Health Informatics, 2019, 23(2): 714- 722. doi: 10.1109/JBHI.2018.2818620
59	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 7132-7141.
60	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module. Berlin, Germany: Springer, 2018.
61	LI X, WANG W H, HU X L, et al. Selective kernel networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 510-519.
62	WANG Q L, WU B G, ZHU P F, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 11534-11542.
63	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.
64	ZHENG S X, LU J C, ZHAO H S, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 6881-6890.
65	LU Y, CHEN Y R, ZHAO D B, et al. Graph-FCN for image semantic segmentation. Berlin, Germany: Springer, 2019.
66	GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks. Communications of the ACM, 2020, 63(11): 139- 144. doi: 10.1145/3422622
67	HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020, 33, 6840- 6851.
68	XU Y H, HE F X, DU B, et al. Self-ensembling GAN for cross-domain semantic segmentation. IEEE Transactions on Multimedia, 2023, 25, 7837- 7850. doi: 10.1109/TMM.2022.3229976
69	CHEN S, SUN P, SONG Y, et al. DiffusionDet: diffusion model for object detection[EB/OL]. [2023-03-11]. https://arxiv.org/abs/2211.09788.
70	WU W J, ZHAO Y Z, SHOU M Z, et al. DiffuMask: synthesizing images with pixel-level annotations for semantic segmentation using diffusion models[EB/OL]. [2023-03-11]. https://arxiv.org/html/2303.11681v3.
71	ZHAO Y, GUO P, GAO H, et al. Depth-assisted ResiDualGAN for cross-domain aerial images semantic segmentation. IEEE Geoscience and Remote Sensing Letters, 2023, 20, 1- 5.
72	CHENG Y T, WEI F Y, BAO J M, et al. ADPL: adaptive dual path learning for domain adaptation of semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(8): 9339- 9356. doi: 10.1109/TPAMI.2023.3248294

[1]	侯颖, 杨林, 胡鑫, 贺顺, 宋婉莹, 赵谦. 基于SwinT-YOLOX模型的自动扶梯行人安全检测算法[J]. 计算机工程, 2024, 50(3): 277-289.
[2]	姜百浩, 刘静, 仇大伟, 姜良. 深度学习在脊柱图像分割中的应用综述[J]. 计算机工程, 2024, 50(3): 1-15.
[3]	吴现, 吐松江·卡日, 王海龙, 马小晶, 李振恩, 邵罗. 基于时空长短时记忆神经网络的地基云图预测算法[J]. 计算机工程, 2024, 50(3): 298-305.
[4]	陈虹, 王瀚文, 金海波. 融合改进自编码器和残差网络的入侵检测模型[J]. 计算机工程, 2024, 50(2): 188-195.
[5]	朱贵德, 黄海. 文本视觉问答综述[J]. 计算机工程, 2024, 50(2): 1-14.
[6]	丁国辉, 刘宇琪, 王言开, 耿施展, 姜天昊. 基于翻转网络的低相关性序列数据预测研究[J]. 计算机工程, 2024, 50(2): 78-90.
[7]	徐浩宸, 刘满华. 基于多层次自注意力网络的人脸特征点检测[J]. 计算机工程, 2024, 50(2): 239-246.
[8]	曾嘉忻, 张卫明, 张荣. 基于后门的鲁棒后向模型水印方法[J]. 计算机工程, 2024, 50(2): 132-139.
[9]	郑晨俊, 曾艳, 袁俊峰, 张纪林, 王鑫, 韩猛. 基于联邦学习的船舶AIS轨迹预测算法[J]. 计算机工程, 2024, 50(2): 298-307.
[10]	安峰民, 张冰冰, 董微, 张建新. 面向视频行为识别深度模型的数据预处理方法[J]. 计算机工程, 2024, 50(2): 281-287.
[11]	祝冰艳, 陈志华, 盛斌. 基于感知增强Swin Transformer的遥感图像检测[J]. 计算机工程, 2024, 50(1): 216-223.
[12]	蒋心璐, 陈天恩, 王聪, 赵春江. 大田环境下的农业害虫图像小目标检测算法[J]. 计算机工程, 2024, 50(1): 232-241.
[13]	白尚旺, 王梦瑶, 胡静, 陈志泊. 多区域注意力的细粒度图像分类网络[J]. 计算机工程, 2024, 50(1): 271-278.
[14]	曹广硕, 黄瑞章, 陈艳平, 秦永彬. 基于多模态学习的乳腺癌生存预测研究[J]. 计算机工程, 2024, 50(1): 296-305.
[15]	圣文顺, 余熊峰, 林佳燕, 陈欣. 融合注意力与特征金字塔的小尺度目标检测算法[J]. 计算机工程, 2024, 50(1): 242-250.

选择文件类型/文献管理软件名称

选择包含的内容