作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

面向Deep Web本地化数据集成的数据源两层选择模型

鲜学丰1,2,崔志明1,2,方立刚1,顾才东1,孙逊1   

  1. (1.江苏省现代企业信息化应用支撑软件工程技术研发中心,江苏 苏州 215104;2.苏州大学 智能信息处理及应用研究所,江苏 苏州 215006)
  • 收稿日期:2016-02-19 出版日期:2017-03-15 发布日期:2017-03-15
  • 作者简介:鲜学丰(1980—),男,副教授、博士,主研方向为Web数据管理、数据挖掘;崔志明,教授、博士生导师;方立刚,副教授、博士;顾才东,教授、硕士;孙逊,助理实验师、硕士。
  • 基金资助:
    国家自然科学基金(61440053,61472268,41201338);苏州市科技计划研究项目(SYG201342,SYG201343,SS201344)。

Data Source Two-layer Selection Model for Deep Web Localized Data Integration

XIAN Xuefeng  1,2,CUI Zhiming  1,2,FANG Ligang  1,GU Caidong  1,SUN Xun  1   

  1. (Jiangsu Province Support Software Engineering R & D Center for Modern Information Technology Application in Enterprise,Suzhou,Jiangsu 215104,China;2. Institute of Intelligent Information Processing and Application,Soochow University,Suzhou,Jiangsu 215006,China)
  • Received:2016-02-19 Online:2017-03-15 Published:2017-03-15

摘要:

针对基于数据源质量选择方法的数据源在数据爬取时存在代价大、重复率高的问题,提出一种结合两层选择模型的Deep Web数据源选择和集成方法。该方法根据数据源本身质量和数据源的效用构建数据源的两层选择模型。给出基于该模型的递归增量数据源选择和集成策略,采用基于数据源质量的选择器过滤大量低质量Deep Web数据源,仅选择若干个高质量的数据源作为第2层选择器的输入。从候选数据源集合中递归地选择,使集成系统在获得尽可能多的高质量数据的同时,避免出现较高覆盖率的k个数据源,作为集成系统最终需要爬取和集成的数据源。实验结果表明,该方法结合两类选择器的优点,缩减了候选数据源的空间并保证集成数据的质量,同时避免了系统处理大量重复数据,有效降低Deep Web数据爬取与集成的代价。

关键词: 深层网页, 数据集成, 数据源选择, 数据源质量, 效用模型, 递归增量策略

Abstract: Aiming at the problems that the data source based on the selection method of data source quality exists in selection process are heavy crawling price and high repetition rate,this paper proposes a two-layer selection model for source selection and integration. The selection model is built based on the quality and utility of the data source,and a recursive incremental data source selection and integration strategy is presented based on the model. The strategy adopts a data source quality classifier to filter majority low-grade Deep Web resources,only leaveing several high-quality ones as the input of the second layer utility classifier. The second layer classifier chooses the processed candidate resources recursively,which enables the integrated system to extract as much high qualified resources while escaping to get high coverage over k. Experimental results show that,combined the ascendency of two classifiers,the designed model can reduce the space of candidate data resources while assuring the quality,and it simultaneously avoids processing huge amounts of repeated data and reduces the integrated cost of Deep Web resources extraction effectively.

Key words: Deep Web, data integration, data source selection, data source quality, utility model, recursive incremental strategy

中图分类号: