作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (7): 183-185. doi: 10.3969/j.issn.1000-3428.2009.07.064

• 人工智能及识别技术 • 上一篇    下一篇

基于贝叶斯信息准则的文本主题数估计

王晓斌,温 春,石昭祥   

  1. (电子工程学院网络工程系,合肥 230037)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-04-05 发布日期:2009-04-05

Topic Number Evaluation Based on Bayes Information Criteria

WANG Xiao-bin, WEN Chun, SHI Zhao-xiang   

  1. (Department of Network Engineering, Electronic Engineering Institute, Hefei 230037)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-04-05 Published:2009-04-05

摘要: 特定领域的主题识别和关键词提取有着广泛的应用,但通过人工指定识别或文本聚类自动生成的主题类别缺乏客观的度量方法。该文结合基于BIC准则的模型选择理论和独立分量分析技术对主题的数量进行概率估计,给出主题数量在BIC意义下的统计分布。在此基础上实现了文档矩阵的ICA分解,并根据分离的独立分量获得主题的关键词及其权重。实验表明,该方法在没有领域知识支持的情况下能估计出反映文本集合的主题数并提取相应的关键词。

关键词: 主题识别, 关键词提取, 独立分量分析, 贝叶斯信息准则

Abstract: There are many applications that can benefit from topic identification and keyword extraction. The traditional way of choosing the topic number depends on human labeling or automatic clustering which is immeasurable. This paper utilizes the Bayes Information Criteria(BIC) based model selection theory to evaluate the probability of each topic numbers taking. After the topic number is acquired, the paper implements the Independent Component Analysis(ICA) decomposition of term-document, then calculates the weight and extracts the keyword according to the ICA separating matrix. Experiments show this method extracts the keyword in a meaningful way.

Key words: topic identification, keyword extraction, Independent Component Analysis(ICA), Bayes Information Criteria(BIC)

中图分类号: