作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (04): 52-54. doi: 10.3969/j.issn.1000-3428.2007.04.018

• 软件技术与数据库 • 上一篇    下一篇

网页分块聚类的Web站点逻辑域挖掘

郑皎凌1,王成良2   

  1. (1. 重庆大学计算机学院,重庆 400044;2. 重庆大学软件学院,重庆 400044)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-02-20 发布日期:2007-02-20

Web Site Logical Domain Mining Based on Web Page Block Cluster

ZHENG Jiaoling 1, WANG Chengliang 2   

  1. ( 1. College of Computer, Chongqing University, Chongqing 400044; 2. School of Software Engineering, Chongqing University, Chongqing 400044)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-02-20 Published:2007-02-20

摘要: Web逻辑域挖掘是当前Web挖掘领域的研究热点之一,它强调从网站设计者的角度来挖掘站点中有逻辑联系的网页,以形成一个逻辑域,而不是单纯的文本聚类或超链排序。随着应用的不同,站点逻辑域的界定也有所不同。在综合分析了几种具有代表性的站点逻辑域及其挖掘方法后,提出了基于网页分块聚类的Web站点逻辑域挖掘模型和挖掘算法。实验结果表明,该算法具有很好的稳定性和适应性,其精度不受站点规模、语言、镜像等因素的影响,召回率则会随着取回网页数目的增加而增加。

关键词: 网页分块, Web逻辑域, Web挖掘, 分块粒度

Abstract: Web logical domain mining is a pioneer brunch in the filed of Web mining. It emphasizes to find those Web pages, which in the view of Web site master, have intra logic relationship and is not purely text cluster or hyperlink ranking. The definitions of Web site logical domain differ from different applications. After summarizing several kinds of Web logical domain models and the mining algorithm, this paper proposes a model and an algorithm. The experimental results show that the algorithm is stable and adjustable. Its precision is hardly effected by the scale of Web site, language and mirror sites. And its recall will improve as the quantity of Web pages obtained increases.

Key words: Web page block, Web logical domain, Web mining, Block granularity