作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2010, Vol. 36 ›› Issue (12): 83-84. doi: 10.3969/j.issn.1000-3428.2010.12.029

• 软件技术与数据库 • 上一篇    下一篇

基于标签路径聚类的文本信息抽取算法

刘云峰   

  1. (山西工程职业技术学院网络电教中心,太原 030009)
  • 出版日期:2010-06-20 发布日期:2010-06-20
  • 作者简介:刘云峰(1974-),男,讲师、硕士,主研方向:数据库 技术

Text Information Extraction Algorithm Based on Tag Path Clustering

LIU Yun-feng   

  1. (Network & Audio-visual Center, Shanxi Engineering Polytechnic, Taiyuan 030009)
  • Online:2010-06-20 Published:2010-06-20

摘要: 针对网页噪音和网页非结构化信息抽取复杂度高的问题,提出一种基于标签路径聚类的文本信息抽取算法。对网页噪音进行预处理,根据网页的文档对象模型树结构进行标签路径聚类,通过自动训练的阈值和网页分割算法快速判定网页的关键部分,根据数据块中的嵌套结构获取网页文本抽取模板。对不同类型网站的实验结果表明,该算法运行速度快、准确度高。

关键词: 标签路径, 网页分割, 信息抽取, 聚类, 阈值

Abstract: This paper proposes a text information extraction algorithm based on tag path clustering to solve the high complexity problem of Web noise and unstructured information extraction. The method adopts Web noise pretreatment, carries on the tag path clustering according to the Document Object Model(DOM) tree structure of Web. The key part of the Web is determined rapidly through automatic training threshold value and Web page division algorithm, and Web text extracted templates are obtained according to nesting structure in the data block. Experimental results on different kinds of Web sites show that the algorithm is fast and accurate.

Key words: tag path, Web page segmentation, information extraction, clustering, threshold

中图分类号: