计算机工程 ›› 2018, Vol. 44 ›› Issue (12): 281-287.doi: 10.19678/j.issn.1000-3428.0047985

• 开发研究与工程应用 • 上一篇    下一篇

文本信息深度提取及多关键词并行匹配技术研究

王文奇1,李勇2,关云云3   

  1. 1.计算机信息系统安全评估河南省工程实验室 郑州市计算机网络安全评估重点实验室,郑州 450007; 2.安阳师范学院 物理与电气工程学院,河南 安阳 455002; 3.中原工学院 图书馆,郑州 450007
  • 收稿日期:2017-07-17 出版日期:2018-12-15 发布日期:2018-12-15
  • 作者简介:王文奇(1971—),男,副教授、博士,主研方向为网络安全、多媒体通信;李勇,副教授;关云云,工程师。
  • 基金项目:

    河南省科技攻关项目(142102310284)。

Research on Text Information Depth Extraction and Multi-keyword Parallel Matching Technique

WANG Wenqi1,LI Yong2,GUAN Yunyun3   

  1. 1.Zhengzhou Key Lab of Computer Network Security Assessment,Henan Engineering Lab of Computer Information System Security Assessment,Zhengzhou 450007,China; 2.School of Physics and Electrical Engineering,Anyang Normal University,Anyang,Henan 455002,China; 3.Library,Zhongyuan University of Technology,Zhengzhou 450007,China
  • Received:2017-07-17 Online:2018-12-15 Published:2018-12-15

摘要:

目前文本信息提取与检索无法适应复杂环境、受用户权限限制以及面临存储器容量大的问题。为此,通过对各种文档文本信息的特征分析,建立基于并行的深度文本信息分析系统。基于XML细粒度表达的不同类型文档提取文本信息,采用基于多核的并行技术根据关键词检索分析提取的文本信息,最后输出信息分析结果。实验结果表明,该系统能够细粒度地深入分析不同类型文本信息,在检索词数量较多时,可以快速提取完整信息。

关键词: XML细粒度表达, 磁盘信息提取, 文档文本信息提取, 内存管理算法, 并行搜索算法

Abstract:

At present,text information extraction and retrieval cannot adapt to the complex environment,limited by user rights and facing the problem of large storage capacity.Based on the feature analysis of all kinds of document text information,a parallel depth text information analysis system is established.Text information is extracted from different types of documents based on fine-grained expression of XML,and the text information extracted by keyword retrieval is retrieved by parallel technology based on multi-core.Finally,the result of information analysis is outputted.The experimental results show that the system can analyze the different types of text information in fine granularity and extract the complete information quickly when the number of keywords is large.

Key words: XML fine-grained representation, disk information extraction, document text information extraction, memory manage algorithm, parallel search algorithm

中图分类号: