融合语义资源和关键词的文本聚类

doi:10.3969/j.issn.1000-3428.2014.04.043

计算机工程

融合语义资源和关键词的文本聚类

吴舜尧^1a,1b，邵峰晶^1a,1b，王金龙²，孙仁诚^1b，王营^1b

(1. 青岛大学 a. 自动化工程学院；b. 信息工程学院，山东青岛 266071；2. 青岛理工大学计算机工程学院，山东青岛 266033)

收稿日期:2013-05-18 出版日期:2014-04-15 发布日期:2014-04-14
作者简介:吴舜尧(1986－)，男，博士研究生，主研方向：数据挖掘，复杂网络；邵峰晶，教授、博士；王金龙、孙仁诚，副教授、博士；王营，硕士研究生。
基金资助:
国家自然科学基金资助项目(91130035)；国家公益性行业科研专项基金资助项目(200905030-2)；山东省自然科学基金资助重点项目(ZR2012FZ003)；山东省自然科学基金资助项目(ZR2012FQ017)；青岛市科技计划基金资助项目(13-1-4-12-jch, 12-1-4-4-(8)-jch)。

Document Clustering Fused with Semantic Resources and Key Words

WU Shun-yao ^1a,1b, SHAO Feng-jing ^1a,1b, WANG Jin-long ², SUN Ren-cheng ^1b, WANG Ying ^1b

(1a. College of Automation Engineering; 1b. College of Information Engineering, Qingdao University, Qingdao 266071, China; 2. School of Computer Engineering, Qingdao Technological University, Qingdao 266033, China)

Received:2013-05-18 Online:2014-04-15 Published:2014-04-14

摘要/Abstract

摘要： 融合关键词形式的属性层知识可有效提高文本聚类的聚类质量，但融合关键词的簇中心初始化仍是一个开放性问题。为此，提出一种融合语义资源和关键词的文本聚类方法。通过Wikipedia语义识别文本集的主题，采用基于资源分配的网络推断策略，通过文献协同关系发现潜在语义相关性，以选择最能代表各主题的重要文档(初始簇中心)，并利用软约束与测度学习相结合的策略融合关键词辅助文本聚类。在20Newsgourp文本集上的实验结果表明，与k-means和仅融合关键词的文本聚类方法相比，该方法可有效提升聚类质量，尤其在News_Different_3数据集上标准互信息最多可提升约20%。

关键词: 关键词, 文本聚类, Wikipedia语义, 簇中心初始化, 网络推断, 重要文档

Abstract: Fusing attribute-level knowledge in the form of key words can effectively improve the performance of document clustering. However, initialization of cluster center of key words is still an open issue. Therefore, this paper utilizes Wikipedia semantics to identify semantic themes, and adopts network-based inference strategy with dynamic resource-allocation to find hidden semantic relatedness according to article collaborative relationship, so as to select the most important documents(initial points) which can reflect semantic themes. It incorporates key words into document clustering by combing metric learning and soft-constraint strategies. Comparisons results with k-means and semi-supervised clustering method with key words on 20Newsgroup collection demonstrate that initialization for document clustering with key words can effectively improve clustering quality. Especially on News_Different_3, the improvement is about 20% under Normalized Mutual Information(NMI) index.

Key words: key words, document clustering, Wikipedia semantics, initialization of cluster center, network inference, important document

中图分类号:

TP18

吴舜尧，邵峰晶，王金龙，孙仁诚，王营. 融合语义资源和关键词的文本聚类[J]. 计算机工程.

WU Shun-yao, SHAO Feng-jing, WANG Jin-long, SUN Ren-cheng, WANG Ying. Document Clustering Fused with Semantic Resources and Key Words[J]. Computer Engineering.

https://www.ecice06.com/CN/Y2014/V40/I4/223

参考文献

参考文献 [1] 姚清耕, 刘功申, 李翔. 基于向量空间模型的文本聚类算法[J]. 计算机工程, 2008, 34(18): 39-41. [2] Banerjee S, Ramanathan K, Gupta A, et al. Clustering Short Texts Using Wikipedia[C]//Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam, Holland: [s. n.], 2007: 788-789. [3] Wu Shunyao, Wang Jinlong, Vu H Q, et al. Text Clustering with Important Words Using Normalization[C]//Proceedings of the 10th Annual Joint Conference on Digital Libraries. Gold Coast, Australia: [s. n.], 2010: 393-394. [4] Wang Jinlong, Wu Shunyao, Li Gang, et al. Integrating Instance-level and Attribute-level Knowledge into Document Clustering[J]. Computer Science and Information Systems, 2011, 8(3): 635-651. [5] Hu Yeming, Milios E E, Blustein J, et al. Enhancing Semi-supervised Document Clustering with Feature Super- vision[C]//Proceedings of the 27th Annual ACM Symposium on Applied Computing. Trento, Italy: ACM Press, 2012: 929-936. [6] Sun Jun, Zhao Wenbo, Xue Jiangwei, et al. Clustering with Feature Order Preferences[J]. Intelligent Data Analysis, 2010, 14(4): 479-495. [7] 彭京, 杨冬青, 唐世渭, 等. 一种基于语义内积的文本聚类算法[J]. 计算机学报, 2007, 30(8): 1354-1363. [8] Hotho A, Staab S, Stumme G. Explaining Text Clustering Results Using Semantic Structures[C]//Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Cavtat-Dubrovnik, Croatia: [s. n.], 2003: 217-228. [9] Hu Xiaohua, Zhang Xiaodan, Lu Caimei, et al. Exploiting Wikipedia as External Knowledge for Document Clu- stering[C]//Proceedings of the 15th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining. Pairs, France: ACM Press, 2009: 389-396. [10] Zhou Tao, Ren Jie, Medo M, et al. Bipartite Network Projection and Personal Recommendation[J]. Physical Review E, 2007, 76(4). [11] Witten I H, Paynter G W, Frank E, et al. KEA: Practical Automatic Keyphrase Extraction[C]//Proceedings of the 4th ACM Conference on Digital Libraries. Berkeley, USA: ACM Press, 1999: 254-255. [12] Huang Chong, Tian Yonghong, Zhou Zhi, et al. Keyphrase Extraction Using Semantic Networks Structure Analysis[C]// Proceedings of the 6th International Conference on Data Mining. Hong Kong, China: [s. n.], 2006: 275-284. [13] Matsuo Y, Ishizuka M. Key Word Extractions from a Single Document Using Word Co-occurrence Statistical Information[J]. International Journal on Artificial Intelligence Tools, 2004, 13(1): 157-169. [14] Witten I H, Milne D. An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links[C]// Proceedings of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy. Chicago, USA: AAAI Press, 2008: 25-30. [15] Basu S, Bilenko M, Mooney R J. A Probabilistic Framework for Semi-supervised Clustering[C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, USA: ACM Press, 2004: 59-68. 编辑金胡考

选择文件类型/文献管理软件名称

选择包含的内容

融合语义资源和关键词的文本聚类

Document Clustering Fused with Semantic Resources and Key Words

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	杨丰豪, 侯校, 赵紫娟, 强彦, 赵涓涓. 基于中药关键词的序列到序列处方推荐模型[J]. 计算机工程, 2025, 51(3): 283-292.
[2]	潘伟, 黄瑞章, 任丽娜, 薛菁菁. 基于自适应结构学习的深度文本聚类[J]. 计算机工程, 2024, 50(11): 89-97.
[3]	孙瑾, 苏文娟, 王璐, 叶克鑫. 基于联盟区块链和星际文件系统的安全租房方案[J]. 计算机工程, 2024, 50(11): 187-196.
[4]	杨文忠, 丁甜甜, 康鹏, 卜文秀. 基于舆情新闻的中文关键词抽取综述[J]. 计算机工程, 2023, 49(3): 1-17.
[5]	刘蒙蒙, 牛保宁, 杨茸. 关键词最优路径查询的分段拓展算法[J]. 计算机工程, 2022, 48(6): 79-88.
[6]	于尊瑞, 毛震东, 王泉, 张勇东. 基于预训练语言模型的关键词感知问题生成[J]. 计算机工程, 2022, 48(2): 125-131.
[7]	许伟佳, 秦永彬, 黄瑞章, 陈艳平. 基于DMA与特征划分的多源文本主题模型[J]. 计算机工程, 2021, 47(7): 59-66.
[8]	黄保华, 黄丕荣, 赵伟宏, 彭丽. 云存储中支持属性撤销的多关键词可搜索加密方案[J]. 计算机工程, 2021, 47(11): 29-36.
[9]	杨延娇, 赵国涛, 袁振强, 韩家臣. 融合语义特征的TextRank关键词抽取方法[J]. 计算机工程, 2021, 47(10): 82-88.
[10]	李俊, 吕学强. 融合BERT语义加权与网络图的关键词抽取方法[J]. 计算机工程, 2020, 46(9): 89-94.
[11]	马慧芳, 李苗, 童海斌, 詹子俊. 基于通配符模式与随机游走的关键词提取方法[J]. 计算机工程, 2020, 46(7): 78-83.
[12]	骆云鹏, 朱旎彤, 毛慈伟, 程晋雪, 许春根. 一种基于连接关键词的实用化可搜索加密方案[J]. 计算机工程, 2020, 46(2): 175-182.
[13]	侯方杰,王雷,王嵩,盛捷. 基于位置的自动化网络流协议逆向分析方法[J]. 计算机工程, 2019, 45(5): 84-87.
[14]	肖晓丽,吴瑶,周锡玲,廖卓凡. 基于差分进化的两阶段文本特征选择算法[J]. 计算机工程, 2019, 45(2): 303-309,314.
[15]	金紫嫣,张娟,李向军,温海平,张华薇. 一种带标签的协同过滤广告推荐算法[J]. 计算机工程, 2018, 44(4): 236-242,247.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

融合语义资源和关键词的文本聚类

Document Clustering Fused with Semantic Resources and Key Words

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价