作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2012, Vol. 38 ›› Issue (23): 154-157. doi: 10.3969/j.issn.1000-3428.2012.23.038

• 人工智能及识别技术 • 上一篇    下一篇

Web页面中数据表的识别方法研究

车成逸,马宗民,焦晓龙   

  1. (东北大学信息科学与工程学院,沈阳 110819)
  • 收稿日期:2012-02-07 出版日期:2012-12-05 发布日期:2012-12-03
  • 作者简介:车成逸(1969-),男,博士研究生,主研方向:人工智能,本体构建,Web数据处理;马宗民,教授、博士生导师;焦晓龙,硕士研究生
  • 基金资助:
    国家自然科学基金资助项目(61073139)

Research on Identification Method of Data Table in Web Page

CHE Cheng-yi, MA Zong-min, JIAO Xiao-long   

  1. (College of Information Science and Engineering, Northeastern University, Shenyang 110819, China)
  • Received:2012-02-07 Online:2012-12-05 Published:2012-12-03

摘要: 为提高Web数据表识别的准确性,提出一种基于支持向量机与混合核函数的数据表识别方法。给出表格的结构特征、内容特征以及行(列)相似特征,将多项式核函数和线性核函数组成混合核函数,利用其进行Web数据表的自动识别。实验结果表明,该方法在7个站点上,准确率和召回率的平均值为95.14%和95.69%。

关键词: Web页面, 数据表, 特征抽取, 支持向量机, 核函数

Abstract: In order to improve the identification accuracy of Web data table, this paper proposes an identification method based on Support Vector Machine(SVM) and mixed kernel function. This paper gives the structural features, content features and row(column) similarity features of the table, and takes mixed kernel function constructed by a polynomial kernel function and a linear kernel function, automatically recognizes the Web meaningful tables. Experimental result shows that the average precision rate and recall rate of this method are 95.14% and 95.69% in seven sites.

Key words: Web page, data table, feature extraction, Support Vector Machine(SVM), kernel function

中图分类号: