基于双向LSTM网络的流式文档结构识别

doi:10.19678/j.issn.1000-3428.0053702

计算机工程 ›› 2020, Vol. 46 ›› Issue (1): 60-66,73. doi: 10.19678/j.issn.1000-3428.0053702

基于双向LSTM网络的流式文档结构识别

张真^a,b, 李宁^a,b, 田英爱^a,b

北京信息科技大学 a. 网络文化与数字传播北京市重点实验室;b. 计算机学院, 北京 100101

收稿日期:2019-01-16 修回日期:2019-03-05 出版日期:2020-01-15 发布日期:2019-03-14
作者简介:张真(1994-),男,硕士研究生,主研方向为文档信息处理;李宁,教授、博士;田英爱,副教授、博士。
基金资助:
国家重点研发计划"私有云环境下服务化智能办公系统平台"（2018YFB1004100）；国家自然科学基金"流式文档排版格式的智能化分析与优化方法"（61672105）。

Stream Document Structure Recognition Based on Bidirectional LSTM Network

ZHANG Zhen^a,b, LI Ning^a,b, TIAN Ying'ai^a,b

a. Beijing Key Laboratory of Internet Culture and Digital Dissemination;b. Computer School, Beijing Information Science and Technology University, Beijing 100101, China

Received:2019-01-16 Revised:2019-03-05 Online:2020-01-15 Published:2019-03-14

摘要/Abstract

摘要： 流式文档结构识别对于排版格式自动优化和信息提取等具有重要作用。基于规则的结构识别方法泛化能力较差，而基于机器学习的方法未考虑文档单元之间的长距离依赖关系，识别准确率较低。针对该问题，提出一种基于双向长短期时间记忆（LSTM）网络的流式文档结构识别方法。从文档单元的格式、内容与语义3个方面筛选关键特征，并将文档结构识别看作序列标注问题，使用双向LSTM神经网络构建识别模型，以实现对18种逻辑标签的识别。实验结果表明，该方法能够对文档结构进行有效识别，其识别效果优于方正飞翔软件。

关键词: 文档结构识别, 流式文档, 特征提取, 序列标注, 长短期时间记忆网络

Abstract: Stream document structure recognition is important to automatic typesetting optimization and information extraction.The existing rule-based structure recognition method has a poor performance,and the machine learning-based method has a low recognition accuracy rate as it does not consider the long distance dependency between document units.To address the problem,this paper proposes a stream document structure recognition method based on bidirectional Long Short-Term Memory(LSTM) network.The method extracts key features in terms of the format,content and semantics of document units.Then it reduces document structure recognition to sequence labeling,and uses bidirectional LSTM neural network to construct a recognition model to implement recognition of 18 logical labels.Experimental results show that the method can effectively recognize the document structure,and has a better recognition performance than Founder FX software.

Key words: document structure recognition, stream document, feature extraction, sequence labeling, Long Short-Term Memory(LSTM) network

中图分类号:

TP18

张真, 李宁, 田英爱. 基于双向LSTM网络的流式文档结构识别[J]. 计算机工程, 2020, 46(1): 60-66,73.

ZHANG Zhen, LI Ning, TIAN Ying'ai. Stream Document Structure Recognition Based on Bidirectional LSTM Network[J]. Computer Engineering, 2020, 46(1): 60-66,73.

https://www.ecice06.com/CN/Y2020/V46/I1/60

图/表 7

20200115110000

20200115110003

20200115110006

20200115110009

20200115110012

20200115110015

20200115110018

参考文献

[1] ESKENAZI S,GOMEZ-KRÄMER P,OGIER J M.A comprehensive survey of mostly textual document segmentation algorithms since 2008[J].Pattern Recognition,2016,64:1-14.
[2] TAO Xin,TANG Zhi,XU Canhui.Contextual modeling for logical labeling of PDF documents[J].Computers and Electrical Engineering,2014,40(4):1363-1375.
[3] TAO Xin,TANG Zhi,XU Canhui,et al.Logical labeling of fixed layout PDF documents using multiple contexts[C]//Proceedings of IAPR International Workshop on Document Analysis Systems.Washington D.C.,USA:IEEE Press,2014:360-364.
[4] DONG Yongquan,LI Qingzhong,DING Yanhui,et al.Constrained conditional random fields for semantic annotation of Web data[J].Journal of Computer Research and Development,2012,49(2):361-371.(in Chinese) 董永权,李庆忠,丁艳辉,等.基于约束条件随机场的Web数据语义标注[J].计算机研究与发展,2012,49(2):361-371.
[5] RAHMAN M M,FININ T.Understanding the logical and semantic structure of large documents[EB/OL].[2019-01-01].https://arxiv.org/pdf/1709.00770.pdf.
[6] OYEDOTUN O K,KHASHMAN A.Document segmentation using textural features summarization and feedforward neural network[M].[S.l.]:Kluwer Academic Publishers,2016.
[7] QIN Jiangmin,LIN Ping,WANG Rong,et al.Application of software typesetting in Founder FX 2011[J].Chinese Journal of Science and Technology Research,2012(4):109-111.(in Chinese) 秦江敏,林平,王荣,等.利用方正飞翔2011软件排版的实践[J].中国科技期刊研究,2012(4):109-111.
[8] FENG Shaorong,PAN Weiwei,LIN Ziyu.XML documents clustering based on improved k-medoids algorithm[J].Computer Engineering,2015,41(9):56-62.(in Chinese) 冯少荣,潘炜炜,林子雨.基于改进k-medoids算法的XML文档聚类[J].计算机工程,2015,41(9):56-62.
[9] CHEN Luyao,ZENG Guosun,WANG Wei.Extraction and logic description for structure trust pattern of information documents[J].Application Research of Computers,2010,27(12):4624-4629.(in Chinese) 陈路瑶,曾国荪,王伟.信息文档结构信任模式的提取及逻辑描述[J].计算机应用研究,2010,27(12):4624-4629.
[10] LI Juan.Research on document typesetting format checking method based on template[D].Beijing:Beijing Information Science and Technology University,2012.(in Chinese) 李娟.基于模板的文档排版格式检查方法研究[D].北京:北京信息科技大学,2012.
[11] SONG Haosu,LI Ning,ZHANG Wei.Application of VSM model to document structure recognition[J].Journal of Beijing Information Science and Technology University(Natural Science),2011,26(6):66-69,75.(in Chinese) 宋昊苏,李宁,张伟.VSM模型在文档结构识别中的应用[J].北京信息科技大学学报(自然科学版),2011,26(6):66-69,75.
[12] PENG Xin.Research on document typesetting format inspection method based on format index and graph[D].Beijing:Beijing Information Science and Technology University,2015.(in Chinese) 彭欣.基于格式索引和图的文档排版格式检查方法研究[D].北京:北京信息科技大学,2015.
[13] IORIO A D,PERONI S,POGGI F,et al.Recognising document components in XML-based academic articles[C]//Proceedings of ACM Symposium on Document Engineering.New York,USA:ACM Press,2013:181-184.
[14] KIM T,KIM S,CHOI S,et al.A machine-learning based approach for extracting logical structure of a styled document[J].KSII Transactions on Internet and Information Systems,2017(11):1043-1056.
[15] LEI Yang,TIAN Ying'ai,LI Ning,et al.Document structure identification method based on conditional random field[C]//Proceedings of International Conference on Mechatronics,Control and Materials.[S.l.]:Atlantis Press,2016:1-6.
[16] SUNDERMEYER M,NEY H.From feed forward to recurrent LSTM neural networks for language modeling[J].IEEE/ACM Transactions on Audio Speech and Language Processing,2015,23(3):517-529.
[17] SHAO Y,HARDMEIER C,TIEDEMANN J,et al.Character-based joint segmentation and POS tagging for Chinese using bidirectional RNN-CRF[EB/OL].[2018-12-23].https://arxiv.org/pdf/1704.01314.pdf.
[18] CHEN Bin,ZHOU Yong,LIU Bing.Event trigger word extraction based on convolutional bidirectional long short term memory network[J].Computer Engineering,2019,45(1):153-158.(in Chinese)陈斌,周勇,刘兵.基于卷积双向长短期记忆网络的事件触发词抽取[J].计算机工程,2019,45(1):153-158.
[19] SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[EB/OL].[2018-12-23].https://arxiv.org/pdf/1409.3215.pdf.
[20] LE Q V,MIKOLOV T.Distributed representations of sentences and documents[EB/OL].[2018-12-23].https://arxiv.org/pdf/1405.4053.pdf.

选择文件类型/文献管理软件名称

选择包含的内容

基于双向LSTM网络的流式文档结构识别

Stream Document Structure Recognition Based on Bidirectional LSTM Network

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	赵俊涛, 李陶深, 卢志翔. 基于最优近邻的局部保持投影方法[J]. 计算机工程, 2024, 50(9): 161-168.
[2]	钱清, 龙永, 蒋忠远, 段春红, 王宏. 基于深度强化学习的自适应图像隐写算法[J]. 计算机工程, 2024, 50(8): 319-327.
[3]	胡庆. 多尺度融合与双输出U-Net网络的行人重识别[J]. 计算机工程, 2024, 50(6): 102-109.
[4]	梁松林, 林伟, 王珏, 杨庆. 面向后渗透攻击行为的网络恶意流量检测研究[J]. 计算机工程, 2024, 50(5): 128-138.
[5]	李振鲁, 黄威, 孙锴. 复杂环境下的轻量化道路目标识别算法研究[J]. 计算机工程, 2024, 50(4): 219-227.
[6]	袁文涛, 卫文韬, 高德民. 融合注意力机制的多视图卷积手势识别研究[J]. 计算机工程, 2024, 50(3): 208-215.
[7]	任义, 苏博, 袁帅. 教育领域下多维度特征命名实体识别方法[J]. 计算机工程, 2024, 50(10): 110-118.
[8]	马娜, 温廷新, 贾旭, 李晓会. 复杂光照条件下自适应的车脸重识别模型[J]. 计算机工程, 2023, 49(8): 275-282, 290.
[9]	戴浩磊, 黄永慧, 周郭许. 基于超图正则化非负张量链分解的聚类分析[J]. 计算机工程, 2023, 49(6): 81-89.
[10]	宋羽凯, 谢江. 基于多任务学习的轻量级语音情感识别模型[J]. 计算机工程, 2023, 49(5): 122-128.
[11]	关日鹏, 况立群, 焦世超, 熊风光, 韩燮. 多模态特征融合与词嵌入驱动的三维检索方法[J]. 计算机工程, 2023, 49(4): 101-107,113.
[12]	耿磊, 傅洪亮, 陶华伟, 卢远, 郭歆莹, 赵力. 基于动态卷积递归神经网络的语音情感识别[J]. 计算机工程, 2023, 49(4): 125-130,137.
[13]	李培育, 张雅丽. 基于改进SRGAN模型的人脸图像超分辨率重建[J]. 计算机工程, 2023, 49(4): 199-205.
[14]	何悦, 陈广胜, 景维鹏, 徐泽堃. 基于深度多相似性哈希方法的遥感图像检索[J]. 计算机工程, 2023, 49(2): 206-212.
[15]	王畅, 李雷孝, 杨艳艳. 基于面部多特征融合的疲劳驾驶检测综述[J]. 计算机工程, 2023, 49(11): 1-12.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于双向LSTM网络的流式文档结构识别

Stream Document Structure Recognition Based on Bidirectional LSTM Network

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献

相关文章 15

编辑推荐

Metrics

本文评价