一种基于词共现的文档聚类算法

doi:10.3969/j.issn.1000-3428.2012.02.070

计算机工程 ›› 2012, Vol. 38 ›› Issue (2): 213-214. doi: 10.3969/j.issn.1000-3428.2012.02.070

一种基于词共现的文档聚类算法

常鹏 ^1a,1b，冯楠 ^1a，马辉 ²

(1. 天津大学 a. 管理与经济学部；b. 信息与网络中心，天津 300072；2. 天津城市建设学院管理工程系，天津 300384)

收稿日期:2011-07-05 出版日期:2012-01-20 发布日期:2012-01-20
作者简介:常鹏(1980－)，男，助理研究员、博士，主研方向：文本挖掘；冯楠、马辉，讲师、博士
基金资助:
国家自然科学基金资助项目(70901054)

Document Clustering Algorithm Based on Word Co-occurrence

CHANG Peng ^1a,1b, FENG Nan^1a, MA Hui²

(1a. School of Management; 1b. Information and Network Center, Tianjin University, Tianjin 300072, China; 2. Department of Management Engineering, Tianjin Institute of Urban Construction, Tianjin 300384, China)

Received:2011-07-05 Online:2012-01-20 Published:2012-01-20

摘要/Abstract

摘要： 为解决文本主题表达存在的信息缺失问题，提出一种基于词共现的文档聚类算法。利用文档集上的频繁共现词建立文档主题向量表示模型，将其应用于层次聚类算法中，并通过聚类熵寻找最优的层次划分，从而准确反映文档之间的主题相关关系。实验结果表明，该算法所获得的结果优于其他基于短语的文档层次聚类算法。

关键词: 文档聚类, 文档模型, 词共现, 文档相似度, 聚类增益

Abstract: This paper presents a document clustering algorithm based on word co-occurrence to solve the problem about information deletion of text subject expression. It uses the word co-occurrence of document set to establish the document theme vector presentation model, and applies to the hierarchical clustering algorithm, through the clustering entropy to find the best level partition, and accurately reflects the relationship between documents’ theme. Experimental results show that the algorithm results is better than other phrases document hierarchical clustering algorithm.

Key words: document clustering, document model, word co-occurrence, document similarity, clustering gain

中图分类号:

TP301.6

常鹏, 冯楠, 马辉. 一种基于词共现的文档聚类算法[J]. 计算机工程, 2012, 38(2): 213-214.

CHANG Feng, FENG Nan, MA Hui. Document Clustering Algorithm Based on Word Co-occurrence[J]. Computer Engineering, 2012, 38(2): 213-214.

http://www.ecice06.com/CN/Y2012/V38/I2/213

[1]	冯少荣,潘炜炜,林子雨. 基于改进k-medoids算法的XML文档聚类[J]. 计算机工程, 2015, 41(9): 56-62.
[2]	刘兴林, 郑启伦, 马千里. 基于词共现有向图的中文合成词提取算法[J]. 计算机工程, 2011, 37(23): 177-180.
[3]	贾雪峰, 王建新, 齐建东, 朱礼军. 基于领域本体的智能检索模型[J]. 计算机工程, 2010, 36(23): 174-176.
[4]	贾雪峰, 王建新, 齐建东, 朱礼军. 基于领域本体的智能检索模型[J]. 计算机工程, 2010, 36(23): 171-173,176.
[5]	罗梓恒, 李巍, 孙涛, 李雄飞. 基于频繁变化结构的时序XML文档聚类方法[J]. 计算机工程, 2010, 36(21): 28-30.
[6]	林小俊, 张猛, 暴筱, 李军, 吴玺宏. 基于概念网络的短文本分类方法[J]. 计算机工程, 2010, 36(21): 4-6.
[7]	李昕, 钱旭, 王自强. 用于文档聚类的间隔流形学习算法研究[J]. 计算机工程, 2010, 36(15): 40-42,48.
[8]	贾西平;刘海珠. 一种潜在文档相似模型[J]. 计算机工程, 2009, 35(15): 32-34.
[9]	张蓉. Web挖掘技术研究 [J]. 计算机工程, 2006, 32(15): 4-6.

选择文件类型/文献管理软件名称

选择包含的内容

一种基于词共现的文档聚类算法

Document Clustering Algorithm Based on Word Co-occurrence

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 9

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

一种基于词共现的文档聚类算法

Document Clustering Algorithm Based on Word Co-occurrence

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 9

编辑推荐

Metrics

本文评价