基于蚁群聚集信息素的半监督文本分类算法

doi:10.3969/j.issn.1000-3428.2014.11.033

计算机工程

基于蚁群聚集信息素的半监督文本分类算法

杜芳华¹,冀俊忠¹,吴晨生²,吴金源¹

(1. 北京工业大学计算机学院多媒体与智能软件技术北京市重点实验室,北京100124;2. 北京市科学技术情报研究所,北京100048)

收稿日期:2013-11-13 出版日期:2014-11-15 发布日期:2014-11-13
作者简介:杜芳华(1988 - ),男,硕士研究生,主研方向:数据挖掘,机器学习;冀俊忠,教授;吴晨生,研究员;吴金源,硕士研究生。
基金资助:
国家自然科学基金资助项目(61375059,61332016)。

Semi-supervised Text Classification Algorithm Based on Ant Colony Aggregation Pheromone

DU Fanghua ¹,JI Junzhong¹,WU Chensheng²,WU Jinyuan ¹

(1. Beijing Municipal Key Laboratory of Multimedia and Intelligent Software Technology,College of Computer Science and Technology,Beijing University of Technology,Beijing 100124,China; 2. Beijing Institute of Science and Technology Information,Beijing 100048,China)

Received:2013-11-13 Online:2014-11-15 Published:2014-11-13

摘要/Abstract

摘要： 半监督文本分类中已标记数据与未标记数据分布不一致,可能导致分类器性能较低。为此,提出一种利用蚁群聚集信息素浓度的半监督文本分类算法。将聚集信息素与传统的文本相似度计算相融合,利用Top-k 策略选取出未标记蚂蚁可能归属的种群,依据判断规则判定未标记蚂蚁的置信度,采用随机选择策略,把置信度高的未标记蚂蚁加入到对其最有吸引力的训练种群中。在标准数据集上与朴素贝叶斯算法和EM 算法进行对比实验,结果表明,该算法在精确率、召回率以及F1 度量方面都取得了更好的效果。

关键词: 文本分类, 半监督学习, 聚集信息素, 自训练, Top-k 策略, 随机选择策略

Abstract: There are many algorithms based on data distribution to effectively solve semi-supervised text categorization. However,they may perform badly when the labeled data distribution is different from the unlabeled data. This paper presents a semi-supervised text classification algorithm based on aggregation pheromone, which is used for species aggregation in real ants and other insects. The proposed method,which has no assumption regarding the data distribution, can be applied to any kind of data distribution. In light of aggregation pheromone,colonies that unlabeled ants may belong to are selected with a Top-k strategy. Then the confidence of unlabeled ants is determined by a judgment rule. Unlabeled ants with higher confidence are added into the most attractive training colony by a random selection strategy. Compared with Na?ve Bayes and EM algorithm,the experiments on benchmark dataset show that this algorithm performs better on precision,recall and Macro F1.

Key words: text classification, semi-supervised learning, aggregation pheromone, self-training, Top-k strategy, random selection strategy

中图分类号:

TP311. 12

杜芳华,冀俊忠,吴晨生,吴金源. 基于蚁群聚集信息素的半监督文本分类算法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2014.11.033.

DU Fanghua,JI Junzhong,WU Chensheng,WU Jinyuan. Semi-supervised Text Classification Algorithm Based on Ant Colony Aggregation Pheromone[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2014.11.033.

http://www.ecice06.com/CN/Y2014/V40/I11/167

参考文献

参考文献 [ 1 ]　Sebastiani F. Machine Learning in Automated Text Categorization [ J ]. ACM Computing Surveys, 2002, 34(1):1-47. [ 2 ]　苏金树,张博峰,徐　昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报,2006,17(9):1848-1859. [ 3 ]　王建会,王洪伟,申　展,等. 一种实用高效的文本分类算法[J]. 计算机研究与发展,2005,42(1):85-93. [ 4 ]　周志华,王　珏. 机器学习及其应用[M]. 北京:清华大学出版社,2007. [ 5 ]　Zhu Xiaojin. Semi-supervised Learning Literature Survey [R]. University of Wisconsin, Technical Report: CS- 1530,2008. [ 6 ]　Zhu Xiaojin, Goldberg A B. Introduction to Semisupervised Learning[M]. [S. l. ]:Morgan & Claypool Publishers,2009. [ 7 ]　Cohen I, Cozman F G, Sebe N. Semi-supervised Learning of Classifiers: Theory, Algorithm, and Their Application to Human-computer Interaction [ J ]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26(12):1553-1567. [ 8 ]　Blum A, Chawla S. Learning from Labeled and Unlabeled Data Using Graph Mincuts[C] / / Proceedings of the 18th International Conference on Machine Learning. San Francisco,USA:[s. n. ],2001:19-26. [ 9 ]　Li Ming, Zhou Zhihua. SETRED: Self-training with Editing[C] / / Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Hanoi,Vietnam:[s. n. ],2005:611-621. [10]　Nigam K,McCallum A K,Thrun S. Text Classification from Labeled and Unlabeled Documents Using EM[J]. Machine Learning,2000,39(2/ 3):103-134. [11]　Nigam K. Using Unlabeled Data to Improve Text Classification[D]. [S. l. ]:Carnegie Mellon University, 2001. [12]　张博峰,白　冰,苏金树. 基于自训练EM 算法的半监督文本分类[J]. 国防科技大学学报,2007,29(6): 65-69. [13]　郑海清,林　琛,牛军钰. 一种基于紧密度的半监督文本分类方法[J]. 中文信息学报,2007,21(3):54-60. [14]　Halder A,Ghosh S,Ghosh A. Aggregation Pheromone Metaphor for Semi-supervised Classification[J]. Pattern Recognition,2013,46(8):2239-2248. [15]　Tsutsui S. Ant Colony Optimization for Continuous Domains with Aggregation Pheromones Metaphor[C] / / Proceedings of the 5th International Conference on Recent Advances in Soft Computing. Nottingham,UK: [s. n. ],2004:207-212. 编辑　任吉慧

[1]	陈仲磊, 伊鹏, 陈祥, 胡涛. 基于集成学习的系统调用实时异常检测框架[J]. 计算机工程, 2023, 49(6): 162-169,179.
[2]	张博旭, 蒲智, 程曦. 基于提示学习的维吾尔语文本分类研究[J]. 计算机工程, 2023, 49(6): 292-299,313.
[3]	王春东, 孙嘉琪, 杨文军. 基于矫正理解的中文文本对抗样本生成方法[J]. 计算机工程, 2023, 49(2): 37-45.
[4]	佘朝阳, 严馨, 徐广义, 陈玮, 邓忠莹. 融合数据增强与半监督学习的药物不良反应检测[J]. 计算机工程, 2022, 48(6): 314-320.
[5]	陈可嘉, 刘惠. 基于改进BiGRU-CNN的中文文本分类方法[J]. 计算机工程, 2022, 48(5): 59-66,73.
[6]	李冉冉, 刘大明, 刘正, 常高祥. 融合笔画特征的胶囊网络文本分类[J]. 计算机工程, 2022, 48(3): 69-73,80.
[7]	胡彬, 王晓军, 张雷. 一种半监督对抗鲁棒模型无关元学习方法[J]. 计算机工程, 2022, 48(12): 112-118.
[8]	高伟, 吴顺. 基于多尺度注意力半监督学习的老照片划痕修复[J]. 计算机工程, 2022, 48(10): 245-251,261.
[9]	武娇, 洪彩凤, 顾永春, 顾兴全, 金世举. 基于类邻域字典的线性回归文本分类[J]. 计算机工程, 2021, 47(8): 93-99,108.
[10]	彭俊利, 谷雨, 张震, 耿小航. 融合单词贡献度与Word2Vec词向量的文档表示[J]. 计算机工程, 2021, 47(4): 62-67.
[11]	康璐璐, 范兴容, 王茜竹, 杨晓雅, 明蕊. 基于特征组分层与半监督学习的鼠标轨迹识别[J]. 计算机工程, 2021, 47(4): 277-284.
[12]	周伟枭, 蓝雯飞. 融合文本分类的多任务学习摘要模型[J]. 计算机工程, 2021, 47(4): 48-55.
[13]	薛子晗, 潘迪, 何丽. 结合改进密度峰值聚类的LGC半监督学习方法优化[J]. 计算机工程, 2021, 47(2): 77-83,89.
[14]	何力, 郑灶贤, 项凤涛, 吴建宅, 谭林. 基于深度学习的文本分类技术研究进展[J]. 计算机工程, 2021, 47(2): 1-11.
[15]	袁自勇, 高曙, 曹姣, 陈良臣. 基于异构图卷积网络的小样本短文本分类方法[J]. 计算机工程, 2021, 47(12): 87-94.

选择文件类型/文献管理软件名称

选择包含的内容

基于蚁群聚集信息素的半监督文本分类算法

Semi-supervised Text Classification Algorithm Based on Ant Colony Aggregation Pheromone

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于蚁群聚集信息素的半监督文本分类算法

Semi-supervised Text Classification Algorithm Based on Ant Colony Aggregation Pheromone

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价