Abstract:
This paper introduces a Tree-Augmented Na?ve Bayes(TAN) text categorization model, analyzes its problem of threshold selection, and proposes an Automatic TAN(ATAN) text categorization framework. Two algorithms based on ATAN are compared to the BL-TAN with the best classification performance at a specified threshold both on Chinese and English imbalanced datasets. Results show that algorithms based on ATAN have higher performance than BL-TAN.
Key words:
text categorization,
Tree-Augmented Na?ve Bayes(TAN) model,
Bayesian network
摘要:
介绍一种树状朴素贝叶斯(TAN)文本分类模型,对该模型存在的阈值选取问题进行实验分析,提出不需要进行阈值选取的TAN文本自动分类框架(ATAN)。在中英文非均匀类分布测试集上对基于ATAN的2种算法与手动选取阈值达到最优性能的BL-TAN进行对比,结果表明基于ATAN的算法具有更高性能。
关键词:
文本分类,
树状朴素贝叶斯模型,
贝叶斯网络
CLC Number:
LIU Jia, GU Cai-Yan. Automatic Text Categorization Framework Based on TAN[J]. Computer Engineering, 2010, 36(16): 36-38.
刘佳, 贾彩燕. 基于TAN的文本自动分类框架[J]. 计算机工程, 2010, 36(16): 36-38.