作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 软件技术与数据库 • 上一篇    下一篇

基于HTML特征与层次聚类的Web查询接口发现

魏佳欣,叶飞跃   

  1. (上海大学计算机工程与科学学院,上海 200444)
  • 收稿日期:2015-02-02 出版日期:2016-02-15 发布日期:2016-01-29
  • 作者简介:魏佳欣(1990-),女,硕士,主研方向为Web语义理解;叶飞跃,博士。

Discovery of Web Query Interface Based on HTML Features and Hierarchical Clustering

WEI Jiaxin,YE Feiyue   

  1. (School of Computer Engineering and Science,Shanghai University,Shanghai 200444,China)
  • Received:2015-02-02 Online:2016-02-15 Published:2016-01-29

摘要: 针对各网站Web查询接口(WQI)因结构异构而难以被自动发现的问题,提出一种基于超级文本标记语言(HTML)特征和层次聚类的Web查询接口发现方法。利用HTML控件元素之间的层级结构、依附关系和HTML交互控件的终端特性,通过前序和后序遍历相结合的方式解析页面,建立合适的页面树状模型。按照查询区域交互密度的局部集中性定位并初始化聚类集合。将聚类集合中各潜在接口区域结构距离的相似性进行层次聚类,并对所得潜在接口中的交互控件选择合适的文本节点进行语义标注,得出完整WQI区域,利用接口中的文本特征过滤非查询接口。实验结果表明,该方法克服了传统方法对标签的过度依赖,具有较强的通用性,接口识别率与准确率分别达到90.7%和92%。

关键词: Web查询接口, 超级文本标记语言, 层次聚类, 结构距离, 交互密度, 文本过滤器

Abstract: Aiming at the problem that Web Query Interface(WQI)from different Web sites can not be found automatically due to their highly heterogeneous structure,this paper proposes a method to find WQI based on Hyper Text Markup Language(HTML)features and hierarchical clustering.It establishes a proper page model in the form of tree with a method combined with pre-order traversal and post-order traversal,according to the facts that HTML elements are organized in a hierarchical,attached relationship and interactive elements generally exist in the terminal part of a DOM tree.Local WQIs are detected and the set for clustering,in which each local WQI is considered as one class and named as interaction group,is initially referenced to the interaction density in the model.It clusters different interaction groups hierarchically by structure distance and label the interaction nodes of substantial WQI with the nearest text node around in structure.Non-query WQI is filtered out by text filter.This method avoids the excessive dependency on tag “form” and presents a better performance in property of generality than traditional methods.Experimental results show that this method has advantage over researches before,the recognition accuracies of them reach up to 90.7% and 92% respectively.

Key words: Web Query Interface (WQI), Hyper Text Markup Language (HTML), hierarchical clustering, structure distance;interaction density;text filter

中图分类号: