作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于蚁群聚集信息素的半监督文本分类算法

杜芳华1,冀俊忠1,吴晨生2,吴金源1   

  1. (1. 北京工业大学计算机学院多媒体与智能软件技术北京市重点实验室,北京100124;2. 北京市科学技术情报研究所,北京100048)
  • 收稿日期:2013-11-13 出版日期:2014-11-15 发布日期:2014-11-13
  • 作者简介:杜芳华(1988 - ),男,硕士研究生,主研方向:数据挖掘,机器学习;冀俊忠,教授;吴晨生,研究员;吴金源,硕士研究生。
  • 基金资助:
    国家自然科学基金资助项目(61375059,61332016)。

Semi-supervised Text Classification Algorithm Based on Ant Colony Aggregation Pheromone

DU Fanghua 1,JI Junzhong 1,WU Chensheng 2,WU Jinyuan 1   

  1. (1. Beijing Municipal Key Laboratory of Multimedia and Intelligent Software Technology,College of Computer Science and Technology,Beijing University of Technology,Beijing 100124,China; 2. Beijing Institute of Science and Technology Information,Beijing 100048,China)
  • Received:2013-11-13 Online:2014-11-15 Published:2014-11-13

摘要: 半监督文本分类中已标记数据与未标记数据分布不一致,可能导致分类器性能较低。为此,提出一种利用蚁群聚集信息素浓度的半监督文本分类算法。将聚集信息素与传统的文本相似度计算相融合,利用Top-k 策略选取出未标记蚂蚁可能归属的种群,依据判断规则判定未标记蚂蚁的置信度,采用随机选择策略,把置信度高的未标记蚂蚁加入到对其最有吸引力的训练种群中。在标准数据集上与朴素贝叶斯算法和EM 算法进行对比实验,结果表明,该算法在精确率、召回率以及F1 度量方面都取得了更好的效果。

关键词: 文本分类, 半监督学习, 聚集信息素, 自训练, Top-k 策略, 随机选择策略

Abstract: There are many algorithms based on data distribution to effectively solve semi-supervised text categorization. However,they may perform badly when the labeled data distribution is different from the unlabeled data. This paper presents a semi-supervised text classification algorithm based on aggregation pheromone, which is used for species aggregation in real ants and other insects. The proposed method,which has no assumption regarding the data distribution, can be applied to any kind of data distribution. In light of aggregation pheromone,colonies that unlabeled ants may belong to are selected with a Top-k strategy. Then the confidence of unlabeled ants is determined by a judgment rule. Unlabeled ants with higher confidence are added into the most attractive training colony by a random selection strategy. Compared with Na?ve Bayes and EM algorithm,the experiments on benchmark dataset show that this algorithm performs better on precision,recall and Macro F1.

Key words: text classification, semi-supervised learning, aggregation pheromone, self-training, Top-k strategy, random selection strategy

中图分类号: