Chinese Short Text Classification Algorithm Based on BERT Model

doi:10.19678/j.issn.1000-3428.0056222

Abstract

Abstract: The existing Chinese short text classification algorithms are faced with sparse features,informal words and massive data.To address the problems,this paper proposes a Chinese short text classification algorithm based on the Bidirectional Encoder Representation from Transformer(BERT) model.The algorithm uses BERT pre-training language model to perform eigenvector representation of short text on the sentence level,and then the obtained eigenvector is input into the Softmax regression model for training and classification.Experimental results show that with the growth of data from Sohu news,the overall F1 value of the proposed algorithm on the test dataset is up to 93%,which is 6 percentage points higher than that of the TextCNN-based short text classification algorithm.The result demonstrates that the proposed algorithm performs better in semantic information representation at the sentence level,and in the classification of Chinese short texts.

Key words: Chinese short text classification, Bidirectional Encoder Representation from Transformer(BERT), Softmax regression model, TextCNN model, word2vec model

摘要： 针对现有中文短文本分类算法通常存在特征稀疏、用词不规范和数据海量等问题，提出一种基于Transformer的双向编码器表示（BERT）的中文短文本分类算法，使用BERT预训练语言模型对短文本进行句子层面的特征向量表示，并将获得的特征向量输入Softmax回归模型进行训练与分类。实验结果表明，随着搜狐新闻文本数据量的增加，该算法在测试集上的整体F1值最高达到93%，相比基于TextCNN模型的短文本分类算法提升6个百分点，说明其能有效表示句子层面的语义信息，具有更好的中文短文本分类效果。

关键词: 中文短文本分类, 基于Transformer的双向编码器表示, Softmax回归模型, TextCNN模型, word2vec模型

CLC Number:

TP391

DUAN Dandan, TANG Jiashan, WEN Yong, YUAN Kehai. Chinese Short Text Classification Algorithm Based on BERT Model[J]. Computer Engineering, 2021, 47(1): 79-86.

段丹丹, 唐加山, 温勇, 袁克海. 基于BERT模型的中文短文本分类算法[J]. 计算机工程, 2021, 47(1): 79-86.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0056222

http://www.ecice06.com/EN/Y2021/V47/I1/79

Figures/Tables 11

References

[1] China Internet Information Center.The 43rd statistical report on Internet development in China[EB/OL].[2019-09-16].http://www.cnnic.net.cn/hlwfzyj/hlwxz bg/hlwtjbg/201902/t20190228_70645.html.(in Chinese)中国互联网信息中心.第43次《中国互联网发展状况统计报告》[EB/OL].[2019-09-16].http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201902/t20190228_70645.html.
[2] WU Yanwen,HUANG Kai,WANG Xinyue,et al. Method of emotional classification in short texts combined with LDA models[J].Journal of Chinese Computer Systems 2019,40(10):2082-2086.(in Chinese)吴彦文,黄凯,王馨悦,等.一种融合主题模型的短文本情感分类方法[J].小型微型计算机系统,2019,40(10):2082-2086.
[3] YANG Zhitong,ZHENG Jun.Research on Chinese short text classification based on word2vec[C]//Proceedings of the 2nd IEEE International Conference on Computer and Communications.Washington D.C.,USA:IEEE Press,2019:90-96.
[4] CHEN Qiaohong,WANG Lei,SUN Qi,et al.Short text classification method of convolutional neural network[J].Application of Computer Systems,2019,28(5):137-142.(in Chinese)陈巧红,王磊,孙麒,等.卷积神经网络的短文本分类方法[J].计算机系统应用,2019,28(5):137-142.
[5] LIU Xiaomin,WANG Hao,LI Xinlei,et al.A comparative study on the role of different feature granularity in short text classification of Weibo[J].Information Science,2018,36(12):126-133.(in Chinese)刘小敏,王昊,李心蕾,等.不同特征粒度在微博短文本分类中作用的比较研究[J].情报科学,2018,36(12):126-133.
[6] WANG Lei.Research on Chinese short text classification method based on hybrid neural network[D].Hangzhou:Zhejiang Sci-Tech University,2019.(in Chinese)王磊.基于混合神经网络的中文短文本分类方法研究[D].杭州:浙江理工大学,2019.
[7] FENG Yong,QU Bohao,XU Hongyan,et al.Chinese FastText short text classification method based on TF-IDF and LDA[J].Journal of Applied Sciences,2019,37(3):378-388.(in Chinese)冯勇,屈渤浩,徐红艳,等.融合TF-IDF和LDA的中文FastText短文本分类方法[J].应用科学学报,2019,37(3):378-388.
[8] WU Fenlin.Adaptive normalized weighted KNN text classification based on PSO[J].Scientific Bulletin of National Mining University,2016(1):109-115.
[9] GAO Yunlong,ZUO Wanli,WANG Ying,et al.Short text classification model based on integrated neural network[J].Journal of Jilin University(Science Edition),2018,56(4):933-938.(in Chinese)高云龙,左万利,王英,等.基于集成神经网络的短文本分类模型[J].吉林大学学报(理学版),2018,56(4):933-938.
[10] DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2019-09-16].https://arxiv.org/abs/1810.04805.
[11] SUN Zhaoying,LIU Gongshen.Research on neural network clustering algorithm for short texts[J].Computer Science,2018,45(S1):392-395.(in Chinese)孙昭颖,刘功申.面向短文本的神经网络聚类算法研究[J].计算机科学,2018,45(S1):392-395.
[12] ZHANG Dongwen,XU Hua,SU Zengcai,et al.Chinese comments sentiment classification based on word2vec and SVMperf[J].Expert Systems with Applications,2015,42(4):1857-1863.
[13] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[EB/OL].[2019-09-16].https://arXiv:1706.03762v5.
[14] YANG Piao,DONG Wenyong.Chinese named entity recognition method based on BERT embedding[J].Computer Engineering,2020,46(4):40-45,52.(in Chinese)杨飘,董文永.基于BERT嵌入的中文命名实体识别方法[J].计算机工程,2020,46(4):40-45,52.
[15] BRAUD C,DENIS P.Comparing word representations for implicit discourse relation classification[C]//Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing.Lisbon,Portugal:Association for Computational Linguistics,2015:1-8.
[16] LI Li,YING Sancong.Implementation of Softmax layer of FPGA-based convolutional neural networks[J].Modern Computer(Professional Edition),2017(26):21-24.(in Chinese)李理,应三丛.基于FPGA的卷积神经网络Softmax层实现[J].现代计算机(专业版),2017(26):21-24.
[17] YANG Sen.Application research of credit scoring model for small and micro enterprises based on Softmax regression[D].Suzhou:Soochow University,2017.(in Chinese)杨森.基于Softmax回归的小微企业信用评分模型应用研究[D].苏州:苏州大学,2017.
[18] LI Ran.Research on short text emotional tendency based on deep learning[D].Beijing:Beijing Institute of Technology,2015.(in Chinese)李然.基于深度学习的短文本情感倾向性研究[D].北京:北京理工大学,2015.
[19] Sogou Lab Data.Sohu news data(SogouCS)[EB/OL].[2019-09-16].http://www.sogou.coms/labs/resource/cs.php.(in Chinese)搜狗实验室数据.搜狐新闻数据(SogouCS)[EB/OL].[2019-09-16].http://www.sogou.coms/labs/resource/cs.php.
[20] LI Y X,TAN C L,DING X Q,et al.Contextual post-processing based on the confusion matrix in offline handwritten Chinese script recognition[J].Pattern Recognition,2004,37(9):1901-1912.
[21] ZHOU Zhihua.Machine learning[M].Beijing:Tsinghua University Press,2016.(in Chinese)周志华.机器学习[M].北京:清华大学出版社,2016.
[22] KIM Y.Convolutional neural networks for sentence classification[EB/OL].[2019-09-16].https://arxiv.org/abs/1408.5882.

Please choose a citation manager

Content to export