Document Detection Method with Multi-scale Feature and Semantic Optimization

doi:10.19678/j.issn.1000-3428.0260038

Abstract

Abstract: To address the issues of unbalanced multi-scale feature expression, cross-level fusion loss, and insufficient bounding box localization accuracy in document detection, a document detection method with multi-scale feature and semantic optimization is proposed. This method includes three parts of design and improvement: first, a multi-branch convolutional attention fusion module is constructed, which expands the receptive field via multi-scale strip convolution and integrates the attention mechanism with the C3k module; second, a multi-scale neck coordinated with global semantics and high-order correlation is designed, which achieves fusion through global feature collection, hypergraph convolution-based correlation mining, and multi-scale scattering; third, the bounding box regression loss is optimized, and dual-threshold interval mapping is adopted to enhance the discrimination of sample losses. Experimental results on the EXAM, CDLA, D4LA, and PubLayNet datasets show that the average detection accuracy of this method is significantly higher than that of existing methods. Experimental results indicate that this method can break through the performance bottleneck of YOLO11n in the field of document detection, improve accuracy while ensuring efficiency, and provide a scientific and feasible application scheme for document detection.

摘要： 为解决文档检测中多尺度特征表达不均衡、跨层级融合损耗及边界框定位精度不足的问题，提出了一种多尺度特征与语义优化的文档检测方法。该方法包含三部分设计与改进：一是构建多分支卷积注意力融合模块，通过多尺度条带卷积扩展感受野，结合注意力机制与C3k模块；二是设计全局语义与高阶关联协同的多尺度颈部，依托全局特征收集、超图卷积关联挖掘及多尺度散射完成融合；三是优化边界框回归损失，采用双阈值区间映射增强样本损失区分度。在EXAM、CDLA、D4LA和PubLayNet数据集上的实验结果表明，该方法平均检测精度较现有方法有显著提升。实验结果显示，该方法可突破YOLO11n在文档检测领域的性能瓶颈，在保证效率的同时提升精度，为文档检测提供科学可行的应用方案。

ZHANG Chi, ZHOU Shibing, JU Jialin, JIANG Min. Document Detection Method with Multi-scale Feature and Semantic Optimization[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260038.

张弛, 周世兵, 鞠佳霖, 蒋敏. 多尺度特征与语义优化的文档检测方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260038.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0260038

References

[1] HUANG S, SIREJIDING S, LU Y, et al. YOLO-Med: multi-task interaction network for biomedical images[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).IEEE,2024:2175–2179.
[2] LIU W, QIAO X, ZHAO C, et al. 2025. VP-YOLO: A human visual perception-inspired robust vehicle-pedestrian detection model for complex traffic scenarios. Expert Systems with Applications [J]: 126837.
[3] 刘子豪, 张佳欣, 薛峰, et al. 2025. 基于改进YOLO-v8的精密管件表面缺陷检测方法. 浙江大学学报(工学版) [J], 59: 1514–1522+1546. LIU Z H, ZHANG J X, XUE F, et al. 2025. Surface defect detection method of precision pipe fittings based on improved yolo-v8. Journal of Zhejiang University(Engineering Science) [J], 59: 1514–1522+1546.
[4] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C] //Proceedings of the IEEE conference on computer vision and pattern recognition.2014:580–587.
[5] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real-time object detection[C] //Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779–788.
[6] LIU W, ANGUELOV D, ERHAN D, et al. Ssd: Single shot multibox detector[C] //European conference on computer vision.Springer, 2016: 21–37.
[7] ULTRALYTICS. YOLO11[EB/OL]. [2026-03-24]. https://github.com/ultralytics/ultralytics.
[8] ULTRALYTICS. Ultralytics YOLO文档[EB/OL]. [2026-03-24]. https://docs.ultralytics.com.
[9] HOWARD A G, ZHU M, CHEN B, et al. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 [J].
[10] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018:7132–7141.
[11] Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 13713-13722.
[12] LIU S, QI L, QIN H, et al. Path aggregation network for -instance segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2018:8759–8768.
[13] ZHENG Z, WANG P, REN D, et al. 2021. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE transactions on cybernetics [J], 52: 8574–8586.
[14] Li X, Wang W, Wu L, et al. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection[J]. Advances in neural information processing systems, 2020, 33: 21002-21012.
[15] 吴迪, 赵品懿, 甘升隆, et al. 2025. 基于动态自适应通道注意力特征融合的小目标检测. 电子科技大学学报 [J], 54: 221–232. WU D, ZHAO P Y, GAN S L, et al. 2025. Small object detection based on dynamic adaptive channel attention feature fusion. Journal of University of Electronic Science and Technology of China [J], 54: 221–232.
[16] XIAO T, LIU Y, HUANG Y, et al. 2023. Enhancing multiscale representations with transformer for remote sensing image semantic segmentation. IEEE Transactions on Geoscience and Remote Sensing [J], 61: 1–16.
[17] PENG Y, LI H, WU P, et al. 2024. D-FINE: Redefine regression task in DETRs as fine-grained distribution refinement. arXiv preprint arXiv: 2410.13842 [J].
[18] Li H. Cdla: A chinese document layout analysis (cdla) dataset[EB/OL].(2021)
[19] DA C, LUO C, ZHENG Q, et al. Vision grid transformer for document layout analysis[C]// Proceedings of the IEEE/CVF international conference on computer vision.2023:19462–19472.
[20] ZHONG X, TANG J, YEPES A J. Publaynet: largest dataset ever for document layout analysis[C]//2019 International conference on document analysis and recognition (ICDAR).IEEE,2019:1015–1022.
[21] EVERINGHAM M, VAN GOOL L, WILLIAMS C K, et al. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision [J], 88: 303–338.
[22] TAN M, LE Q. Efficientnet: Rethinking model scaling for convolutional neural networks[C] //International conference on machine learning. PMLR, 2019: 6105–6114.
[23] REN S, HE K, GIRSHICK R, et al. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems [J], 28.
[24] ULTRALYTICS. YOLOv8[EB/OL]. [2026-03-24]. https://docs.ultralytics.com/models/yolov8.
[25] Wang A, Chen H, Liu L, et al. Yolov10: Real-time end-to-end object detection[J]. Advances in neural information processing systems, 2024, 37: 107984-108011.
[26] GE Z, LIU S, WANG F, et al. 2021. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 [J].
[27] ULTRALYTICS. YOLO26[EB/OL]. [2026-03-24]. https://docs.ultralytics.com/models/yolo26.
[28] Zhao Y, Lv W, Xu S, et al. Detrs beat yolos on real-time object detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 16965-16974.
[29] REZATOFIGHI H, TSOI N, GWAK J, et al. Generalized intersection over union: A metric and a loss for bounding box regression[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019:658–666.
[30] ZHENG Z, WANG P, LIU W, et al. Distance-IoU loss: Faster and better learning for bounding box regression[C]//Proceedings of the AAAI conference on artificial intelligence.2020:12993–13000.
[31] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 3-19.
[32] Cai X, Lai Q, Wang Y, et al. Poly kernel inception network for remote sensing detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 27706-27716.
[33] Tan M, Pang R, Le Q V. Efficientdet: Scalable and efficient object detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10781-10790.
[34] Zhang Y, Zhou S, Li H. Depth information assisted collaborative mutual promotion network for single image dehazing[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 2846-2855.
[35] QIN D, LEICHNER C, DELAKIS M, et al. MobileNetV4: Universal models for the mobile ecosystem[C]//European Conference on Computer Vision.Springer,2024:78–96.
[36] JACOB B, KLIGYS S, CHEN B, et al. Quantizat ion and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2018:2704–2713.
[37] Li H, Kadav A, Durdanovic I, et al. Pruning filters for efficient convnets[J]. arXiv preprint arXiv:1608.08710, 2016.

Please choose a citation manager

Content to export