一种基于神经网络与LDA的文本分类算法

doi:10.19678/j.issn.1000-3428.0054297

计算机工程 ›› 2019, Vol. 45 ›› Issue (10): 208-214. doi: 10.19678/j.issn.1000-3428.0054297

一种基于神经网络与LDA的文本分类算法

牛硕硕, 柴小丽, 李德启, 谢彬

中国电子科技集团公司第三十二研究所, 上海 201808

收稿日期:2019-03-20 修回日期:2019-04-23 出版日期:2019-10-15 发布日期:2019-10-09
作者简介:牛硕硕(1993-),男,硕士研究生,主研方向为机器学习、自然语言处理;柴小丽,研究员;李德启,工程师;谢彬,研究员、博士。
基金资助:
国家部委基金。

A Text Classification Algorithm Based on Neural Network and LDA

NIU Shuoshuo, CHAI Xiaoli, LI Deqi, XIE Bin

The 32 nd Research Institute of China Electronics Technology Group Corporation, Shanghai 201808, China

Received:2019-03-20 Revised:2019-04-23 Online:2019-10-15 Published:2019-10-09

摘要/Abstract

摘要： 传统隐含狄利克雷分配（LDA）主题模型在文本分类计算时利用Gibbs Sampling拟合已知条件分布下的未知参数，较难权衡分类准确率与计算复杂度间的关系。为此，在LDA主题模型的基础上，利用神经网络拟合单词-主题概率分布，提出一种文本分类算法NLDA。在THUCNews语料库和复旦大学语料库上进行实验，结果表明，与传统LDA模型相比，该算法的平均分类准确率分别提升5.53%和4.67%，平均训练时间分别减少8%和10%。

关键词: 文本分类, 深度学习, 神经网络, 隐含狄利克雷分配, 主题模型

Abstract: The traditional Latent Dirichlet Allocation(LDA) topic model uses Gibbs Sampling to fit unknown parameters under known conditional distributions in text classification calculations,making it difficult to weigh classification accuracy and computation complexity.Therefore,based on the LDA topic model,a neural network is used to fit the word-topic probability distribution,and a text classification algorithm NLDA is proposed.Experiments on the THUCNews corpus and Fudan University corpus show that compared with the traditional LDA model,the average classification accuracy of the algorithm is increased by 5.53% and 4.67% respectively,and the average training time is reduced by 8% and 10%.

Key words: text classification, deep learning, neural network, Latent Dirichlet Allocation(LDA), topic model

中图分类号:

TP183

牛硕硕, 柴小丽, 李德启, 谢彬. 一种基于神经网络与LDA的文本分类算法[J]. 计算机工程, 2019, 45(10): 208-214.

NIU Shuoshuo, CHAI Xiaoli, LI Deqi, XIE Bin. A Text Classification Algorithm Based on Neural Network and LDA[J]. Computer Engineering, 2019, 45(10): 208-214.

https://www.ecice06.com/CN/Y2019/V45/I10/208

图/表 16

20191014194818

20191014194820

20191014194824

20191014194827

20191014194830

20191014194833

20191014194836

20191014194839

20191014194842

20191014194846

20191014194848

20191014194851

20191014194855

20191014194859

20191014194901

20191014194905

参考文献 20

[1]	HU Minqing,LIU Bing.Mining and summarizing customer reviews[C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,2004:168-177.
[2]	PARK E K,RA D Y,JANG M G.Techniques for improving Web retrieval effectiveness[J].Information Processing Management,2005,41(5):1207-1223.
[3]	BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet allocation[J].The Journal of Machine Learning Research,2003,3(4/5):993-1022.
[4]	GOUDJIL M,KOUDIL M,BEDDA M,et al.Anovel active learning method using SVM for text classification[J].International Journal of Automation and Computing,2018,15(3):290-298.
[5]	LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444.
[6]	KALCHBRENNER N,GREFENSTETTE E,BLUNSOM P.A convolutional neural network formodelling sentences[EB/OL].[2019-02-20].http://de.arxiv.org/pdf/1404.2188.
[7]	SALTON G,WONG A,YANG C S.A vector space model for automatic indexing[J].Communications of the ACM,1975,18(11):613-620.
[8]	PHAN X H,NGUYEN M L,HORIGUCHI S.Learning to classify short and sparse text & Web with hidden topics from largescale data collections[C]//Proceedings of the 17th Conference on World Wide Web.New York,USA:ACM Press,2008:91-100.
[9]	WANG Le,JIA Yan,HAN Weihong.Instant message clustering based on extended vector space model[C]//Proceedings of the 2nd International Conference on Advances in Computation and Intelligence.Berlin,Germany:Springer,2007:435-443.
[10]	BOUAZIZ A,DARTIGUES-PALLEZ C,PEREIRA C D C,et al.Short text classification using semantic random forest[C]//Proceedings of International Conference on Data Warehousing and Knowledge Discovery.Berlin,Germany:Springer,2014:288-299.
[11]	GUO Hongchen,LIANG Qiliang,LI Zhiqiang.An improved AD-LDA topic model based on weighted Gibbs sampling[C]//Proceedings of IEEE Advanced Information Management,Communicates,Electronic and Automation Control Conference.Washington D.C.,USA:IEEE Press,2016:1978-1982.
[12]	张志飞,苗夺谦,高灿.基于LDA主题模型的短文本分类方法[J].计算机应用,2013,33(6):1587-1590.
[13]	MAO Qirong,DONG Ming,HUANG Zhengwei,et al.Learning salient features for speech emotion recognition using convolutional neural networks[J].IEEE Transactions on Multimedia,2014,16(8):2203-2213.
[14]	IOFFE S,SZEGEDY C.Batch normalization:accelerating deep network training by reducing internal covariate shift[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning.[S.l.]:JMLR.org,2015:448-456.
[15]	奚雪峰,周国栋.面向自然语言处理的深度学习研究[J].自动化学报,2016,42(10):1445-1465.
[16]	HE Kaiming,ZHANG Xiangyu,REN Shaoqing,et al.Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:770-778.
[17]	周志华.机器学习[M].北京:清华大学出版社,2016.
[18]	刘泽锦.基于主题模型和卷积神经网络的短文本分类算法研究[D].北京:北京工业大学,2017.
[19]	王懿.基于自然语言处理和机器学习的文本分类及其应用研究[D].成都:中国科学院成都计算机应用研究所,2006.
[20]	GRIFFITHS T L,STEYVERS M.Finding scientific topics[J].National Academy of Sciences,2004,101(S1):5228-5235.

选择文件类型/文献管理软件名称

选择包含的内容

一种基于神经网络与LDA的文本分类算法

A Text Classification Algorithm Based on Neural Network and LDA

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 20

相关文章 15

编辑推荐

Metrics

本文评价

[1]	魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.
[2]	王志浩, 钱沄涛. 基于Swin Transformer的双流遥感图像时空融合超分辨率重建[J]. 计算机工程, 2024, 50(9): 33-45.
[3]	李俊俊, 董建刚, 李坤. 基于Kubernetes的集群节能策略研究[J]. 计算机工程, 2024, 50(9): 82-91.
[4]	李泽霖, 吕兆峰, 陈富强, 李克. 基于多跳信息融合的实体对齐模型[J]. 计算机工程, 2024, 50(9): 142-152.
[5]	王汝英, 马嘉骏, 董建强, 刘万龙, 张海涛, 尹凯, 赵博超. 基于MTS-BiGRU-DMHSA的工业负荷预测方法[J]. 计算机工程, 2024, 50(9): 169-178.
[6]	张鲁, 田春伟, 宋焕生, 刘侍刚. 用于低剂量CT图像去噪的多级双树复小波网络[J]. 计算机工程, 2024, 50(9): 266-275.
[7]	朱凯, 李理, 张彤, 江晟, 别一鸣. 基于Transformer的多阶段运动模糊图像修复网络[J]. 计算机工程, 2024, 50(9): 276-285.
[8]	张天鹏, 韩晶, 吕学强. 基于多任务学习的超分辨率辅助小目标检测[J]. 计算机工程, 2024, 50(9): 304-312.
[9]	高煜宝, 文志诚. 基于注意力机制的双路解码器图像去噪方法[J]. 计算机工程, 2024, 50(9): 324-332.
[10]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[11]	王蕾, 党时鹏, 潘丰. 基于卷积神经网络的隐匿性旁路预测模型[J]. 计算机工程, 2024, 50(8): 40-49.
[12]	张亚洲, 和玉, 戎璐, 王祥凯. 基于上下文知识增强型Transformer网络的抑郁检测[J]. 计算机工程, 2024, 50(8): 75-85.
[13]	高伟, 李帅龙, 茆琳, 王磊, 李颖颖, 韩林. 一种基于TVM的算子生成加速策略[J]. 计算机工程, 2024, 50(8): 353-362.
[14]	王宇, 祁琦, 王纯, 许才. 储能变流器信号高精度故障诊断方法[J]. 计算机工程, 2024, 50(8): 389-396.
[15]	何杏宇, 周易歆, 罗东旭, 杨桂松. 基于图神经网络和多主体评价的教学资源推荐[J]. 计算机工程, 2024, 50(7): 13-22.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

一种基于神经网络与LDA的文本分类算法

A Text Classification Algorithm Based on Neural Network and LDA

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 20

相关文章 15

编辑推荐

Metrics

本文评价