计算机工程 ›› 2021, Vol. 47 ›› Issue (1): 79-86.doi: 10.19678/j.issn.1000-3428.0056222

• 人工智能与模式识别 • 上一篇    下一篇

基于BERT模型的中文短文本分类算法

段丹丹1, 唐加山1, 温勇1, 袁克海1,2   

  1. 1. 南京邮电大学 理学院, 南京 210023;
    2. 圣母大学 心理学系, 美国 南本德 46556
  • 收稿日期:2019-10-09 修回日期:2019-11-27 发布日期:2019-12-13
  • 作者简介:段丹丹(1994-),女,硕士研究生,主研方向为自然语言处理、数据分析;唐加山(通信作者)、温勇、袁克海,教授。
  • 基金项目:
    南京邮电大学横向科研项目(2018外095)。

Chinese Short Text Classification Algorithm Based on BERT Model

DUAN Dandan1, TANG Jiashan1, WEN Yong1, YUAN Kehai1,2   

  1. 1. College of Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China;
    2. Department of Psychology, University of Notre Dame, South Bend 46556, USA
  • Received:2019-10-09 Revised:2019-11-27 Published:2019-12-13

摘要: 针对现有中文短文本分类算法通常存在特征稀疏、用词不规范和数据海量等问题,提出一种基于Transformer的双向编码器表示(BERT)的中文短文本分类算法,使用BERT预训练语言模型对短文本进行句子层面的特征向量表示,并将获得的特征向量输入Softmax回归模型进行训练与分类。实验结果表明,随着搜狐新闻文本数据量的增加,该算法在测试集上的整体F1值最高达到93%,相比基于TextCNN模型的短文本分类算法提升6个百分点,说明其能有效表示句子层面的语义信息,具有更好的中文短文本分类效果。

关键词: 中文短文本分类, 基于Transformer的双向编码器表示, Softmax回归模型, TextCNN模型, word2vec模型

Abstract: The existing Chinese short text classification algorithms are faced with sparse features,informal words and massive data.To address the problems,this paper proposes a Chinese short text classification algorithm based on the Bidirectional Encoder Representation from Transformer(BERT) model.The algorithm uses BERT pre-training language model to perform eigenvector representation of short text on the sentence level,and then the obtained eigenvector is input into the Softmax regression model for training and classification.Experimental results show that with the growth of data from Sohu news,the overall F1 value of the proposed algorithm on the test dataset is up to 93%,which is 6 percentage points higher than that of the TextCNN-based short text classification algorithm.The result demonstrates that the proposed algorithm performs better in semantic information representation at the sentence level,and in the classification of Chinese short texts.

Key words: Chinese short text classification, Bidirectional Encoder Representation from Transformer(BERT), Softmax regression model, TextCNN model, word2vec model

中图分类号: