网页分块聚类的Web站点逻辑域挖掘

doi:10.3969/j.issn.1000-3428.2007.04.018

计算机工程 ›› 2007, Vol. 33 ›› Issue (04): 52-54. doi: 10.3969/j.issn.1000-3428.2007.04.018

网页分块聚类的Web站点逻辑域挖掘

郑皎凌1，王成良2

(1. 重庆大学计算机学院，重庆 400044；2. 重庆大学软件学院，重庆 400044)

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-02-20 发布日期:2007-02-20

Web Site Logical Domain Mining Based on Web Page Block Cluster

ZHENG Jiaoling 1, WANG Chengliang 2

( 1. College of Computer, Chongqing University, Chongqing 400044; 2. School of Software Engineering, Chongqing University, Chongqing 400044)

Received:1900-01-01 Revised:1900-01-01 Online:2007-02-20 Published:2007-02-20

摘要/Abstract

摘要： Web逻辑域挖掘是当前Web挖掘领域的研究热点之一，它强调从网站设计者的角度来挖掘站点中有逻辑联系的网页，以形成一个逻辑域，而不是单纯的文本聚类或超链排序。随着应用的不同，站点逻辑域的界定也有所不同。在综合分析了几种具有代表性的站点逻辑域及其挖掘方法后，提出了基于网页分块聚类的Web站点逻辑域挖掘模型和挖掘算法。实验结果表明，该算法具有很好的稳定性和适应性，其精度不受站点规模、语言、镜像等因素的影响，召回率则会随着取回网页数目的增加而增加。

关键词: 网页分块, Web逻辑域, Web挖掘, 分块粒度

Abstract: Web logical domain mining is a pioneer brunch in the filed of Web mining. It emphasizes to find those Web pages, which in the view of Web site master, have intra logic relationship and is not purely text cluster or hyperlink ranking. The definitions of Web site logical domain differ from different applications. After summarizing several kinds of Web logical domain models and the mining algorithm, this paper proposes a model and an algorithm. The experimental results show that the algorithm is stable and adjustable. Its precision is hardly effected by the scale of Web site, language and mirror sites. And its recall will improve as the quantity of Web pages obtained increases.

Key words: Web page block, Web logical domain, Web mining, Block granularity

郑皎凌;王成良. 网页分块聚类的Web站点逻辑域挖掘[J]. 计算机工程, 2007, 33(04): 52-54.

ZHENG Jiaoling ; WANG Chengliang. Web Site Logical Domain Mining Based on Web Page Block Cluster[J]. Computer Engineering, 2007, 33(04): 52-54.

http://www.ecice06.com/CN/Y2007/V33/I04/52

[1]	周诗慧, 殷建. Hadoop平台下的并行Web日志挖掘算法[J]. 计算机工程, 2013, 39(6): 43-46.
[2]	熊忠阳，蔺显强，张玉芳，牙漫. 结合网页结构与文本特征的正文提取方法[J]. 计算机工程, 2013, 39(12): 200-203,210.
[3]	赵涓涓;陈俊杰;李元俊. 基于Web页面结构和主色调的聚类算法[J]. 计算机工程, 2010, 36(3): 1-3.
[4]	方元康;胡学钢;夏启寿. Web日志预处理中优化的会话识别方法[J]. 计算机工程, 2009, 35(7): 49-51.
[5]	王庆;王铮;汪定伟;. Web挖掘在电子商务货源搜索中的应用[J]. 计算机工程, 2008, 34(11): 197-199.
[6]	谢毓湘;杨培;栾悉道;吴玲达;周宏潮. 互联网情报收集与处理技术 [J]. 计算机工程, 2007, 33(23): 205-207.
[7]	翟伟斌;赵艳;许榕生;. 网络过滤研究[J]. 计算机工程, 2007, 33(20): 97-98.
[8]	陈　敏;苗夺谦. 一种基于Close模式发现用户频繁访问路径的方法[J]. 计算机工程, 2007, 33(08): 14-16.
[9]	张蓉. Web挖掘技术研究 [J]. 计算机工程, 2006, 32(15): 4-6.
[10]	沈云斐;沈国强;蒋丽华;覃征. 基于时效性的Web页面个性化推荐模型的研究[J]. 计算机工程, 2006, 32(13): 80-81,9.

选择文件类型/文献管理软件名称

选择包含的内容

网页分块聚类的Web站点逻辑域挖掘

Web Site Logical Domain Mining Based on Web Page Block Cluster

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 10

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

网页分块聚类的Web站点逻辑域挖掘

Web Site Logical Domain Mining Based on Web Page Block Cluster

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 10

编辑推荐

Metrics

本文评价