作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (14): 38-40. doi: 10.3969/j.issn.1000-3428.2007.14.013

• 软件技术与数据库 • 上一篇    下一篇

海量短语信息文本聚类技术研究

王永恒,贾 焰,杨树强   

  1. (国防科技大学计算机学院网络研究所,长沙 410073)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-07-20 发布日期:2007-07-20

Study on Massive Short Documents Clustering Technology

WANG Yongheng, JIA Yan, YANG Shuqiang   

  1. (Institute of Network, Computer School, National University of Defense Technology, Changsha 410073)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-07-20 Published:2007-07-20

摘要: 信息技术的发展造成了大量的文本数据累积,其中很大一部分是短文本数据。文本聚类技术对于从海量短文中自动获取知识具有重要意义。现有的一般文本挖掘方法很难处理TB级的海量数据。由于短文本中的关键词出现次数少,文本挖掘的精度很难保证。该文提出了一种基于频繁词集并结合语义信息的并行聚类算法来解决海量短语信息的聚类问题。实验表明,该方法在处理海量短语信息时具有很好的性能和准确度。

关键词: 文本挖掘, 海量, 短语, 并行

Abstract: With the rapid development of information technology, huge data is accumulated. A vast amount of such data appears as short documents. It is very useful to cluster such short documents to get knowledge automatically. But most of the current clustering algorithms can’t handle massive data which is at TB level. It is also difficult to get acceptable clustering accuracy since key words appear less time in short documents. This paper proposes a frequent term based parallel clustering algorithm which can be used to cluster massive short documents. Semantic information is also used to improve the accuracy of clustering. The experimental study shows that the algorithm is accurate and efficient.

Key words: text mining, massive, short document, parallel

中图分类号: