摘要: 针对传统文本分类方法对文档间关联关系考虑不充分的问题,提出一种基于iTopicModel的关联文本分类算法。根据类信息已知的文档归属于各个主题的概率判断主题代表的类信息,利用待分类文档归属于各个主题的概率及文本信息对文档进行分类。实验结果表 明,当文档间的关联关系对类信息影响较大时,TC-iTM的分类性能优于传统文本分类方法。
关键词:
文本分类,
文档网络,
主题模型,
EM算法
Abstract: In order to solve the problem that traditional text classification methods do not emphasize the links among text documents enough , this paper proposes a novel text classification algorithm TC-iTM based on iTopicModel. TC-iTM uses the probability that the labeled documents are assigned to each topic to judge the category that each topic represents. TC-iTM classifies unlabelled documents by using the probability that the documents are assigned to each topic and the text information of these documents. Experimental result shows that TC-iTM outperforms the traditional text classification methods when links among documents are important to the categories of the documents in document network.
Key words:
text classification,
document network,
topic model,
EM algorithm
中图分类号:
梁鹏鹏, 柴玉梅, 王黎明. 基于iTopicModel的关联文本分类算法[J]. 计算机工程, 2011, 37(21): 124-125,130.
LIANG Feng-Feng, CHAI Yu-Mei, WANG Li-Meng. Relational Text Classification Algorithm Based on iTopicModel[J]. Computer Engineering, 2011, 37(21): 124-125,130.