作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2020, Vol. 46 ›› Issue (4): 91-96. doi: 10.19678/j.issn.1000-3428.0054056

• 人工智能与模式识别 • 上一篇    下一篇

基于泊松分布的加权朴素贝叶斯文本分类算法

赵博文, 王灵矫, 郭华   

  1. 湘潭大学 信息工程学院, 湖南 湘潭 411105
  • 收稿日期:2019-03-01 修回日期:2019-04-18 出版日期:2020-04-15 发布日期:2019-05-27
  • 作者简介:赵博文(1994-),男,硕士研究生,主研方向为数据挖掘;王灵矫,副教授、博士;郭华,高级实验师、硕士。
  • 基金资助:
    国家自然科学基金(61771414)。

Weighted Naive Bayes Text Classification Algorithm Based on Poisson Distribution

ZHAO Bowen, WANG Lingjiao, GUO Hua   

  1. College of Information Engineering, Xiangtan University, Xiangtan, Hunan 411105, China
  • Received:2019-03-01 Revised:2019-04-18 Online:2020-04-15 Published:2019-05-27

摘要: 朴素贝叶斯(NB)算法应用于文本分类时具有简单性和高效性,但算法中属性独立性与重要性一致的假设,使其在精确度方面存在瓶颈。针对该问题,提出一种基于泊松分布的特征加权NB文本分类算法。结合泊松分布模型和NB算法,将泊松随机变量引入特征词权重,在此基础上定义信息增益率对文本特征词加权,削弱传统算法属性独立性假设造成的影响。在20-newsgroups数据集上的实验结果表明,与传统NB算法及其改进算法RW,C-MNB和CFSNB相比,该算法可使文本分类的准确率、召回率和F1值得到提升,并且执行效率高于K-最近邻算法和支持向量机算法。

关键词: 文本分类, 朴素贝叶斯算法, 泊松分布, 信息增益率, 特征词权重

Abstract: Naive Bayes(NB) algorithm is simple and efficient when applied to text classification,but it has a bottleneck in accuracy due to the intrinsic assumption that attribute independence and attribute importance are consistent.To solve this problem,this paper proposes a feature-weighted NB text classification algorithm based on Poisson distribution.The algorithm combines the Poisson distribution model with the NB algorithm,and the Poisson random variable is introduced into the weight of feature words.On this basis,the Information Gain Ratio(IGR) is defined to weigh the feature words of texts,and thus the effects of the attribute independence assumption of traditional algorithms can be reduced.Experimental results on the 20-newsgroups data set show that,compared with NB algorithm and its improved algorithms RW,C-MNB and CFSNB,this algorithm can improve the accuracy rate,recall rate and F1 value of text classification.Meanwhile,its execution efficiency is higher than K-Nearest Neighbor(KNN) algorithm and Support Vector Machine(SVM) algorithm.

Key words: text classification, Naive Bayes(NB) algorithm, Poisson distribution, Information Gain Rate(IGR), weight of feature words

中图分类号: