计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

融合语义资源和关键词的文本聚类

吴舜尧1a,1b,邵峰晶1a,1b,王金龙2,孙仁诚1b,王 营1b   

  1. (1. 青岛大学 a. 自动化工程学院;b. 信息工程学院,山东 青岛 266071;2. 青岛理工大学计算机工程学院,山东 青岛 266033)
  • 收稿日期:2013-05-18 出版日期:2014-04-15 发布日期:2014-04-14
  • 作者简介:吴舜尧(1986-),男,博士研究生,主研方向:数据挖掘,复杂网络;邵峰晶,教授、博士;王金龙、孙仁诚,副教授、博士;王 营,硕士研究生。
  • 基金项目:
    国家自然科学基金资助项目(91130035);国家公益性行业科研专项基金资助项目(200905030-2);山东省自然科学基金资助重点项目(ZR2012FZ003);山东省自然科学基金资助项目(ZR2012FQ017);青岛市科技计划基金资助项目(13-1-4-12-jch, 12-1-4-4-(8)-jch)。

Document Clustering Fused with Semantic Resources and Key Words

WU Shun-yao 1a,1b, SHAO Feng-jing 1a,1b, WANG Jin-long 2, SUN Ren-cheng 1b, WANG Ying 1b   

  1. (1a. College of Automation Engineering; 1b. College of Information Engineering, Qingdao University, Qingdao 266071, China; 2. School of Computer Engineering, Qingdao Technological University, Qingdao 266033, China)
  • Received:2013-05-18 Online:2014-04-15 Published:2014-04-14

摘要: 融合关键词形式的属性层知识可有效提高文本聚类的聚类质量,但融合关键词的簇中心初始化仍是一个开放性问题。为此,提出一种融合语义资源和关键词的文本聚类方法。通过Wikipedia语义识别文本集的主题,采用基于资源分配的网络推断策略,通过文献协同关系发现潜在语义相关性,以选择最能代表各主题的重要文档(初始簇中心),并利用软约束与测度学习相结合的策略融合关键词辅助文本聚类。在20Newsgourp文本集上的实验结果表明,与k-means和仅融合关键词的文本聚类方法相比,该方法可有效提升聚类质量,尤其在News_Different_3数据集上标准互信息最多可提升约20%。

关键词: 关键词, 文本聚类, Wikipedia语义, 簇中心初始化, 网络推断, 重要文档

Abstract: Fusing attribute-level knowledge in the form of key words can effectively improve the performance of document clustering. However, initialization of cluster center of key words is still an open issue. Therefore, this paper utilizes Wikipedia semantics to identify semantic themes, and adopts network-based inference strategy with dynamic resource-allocation to find hidden semantic relatedness according to article collaborative relationship, so as to select the most important documents(initial points) which can reflect semantic themes. It incorporates key words into document clustering by combing metric learning and soft-constraint strategies. Comparisons results with k-means and semi-supervised clustering method with key words on 20Newsgroup collection demonstrate that initialization for document clustering with key words can effectively improve clustering quality. Especially on News_Different_3, the improvement is about 20% under Normalized Mutual Information(NMI) index.

Key words: key words, document clustering, Wikipedia semantics, initialization of cluster center, network inference, important document

中图分类号: