作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (3): 69-73,80. doi: 10.19678/j.issn.1000-3428.0060560

• 人工智能与模式识别 • 上一篇    下一篇

融合笔画特征的胶囊网络文本分类

李冉冉, 刘大明, 刘正, 常高祥   

  1. 上海电力大学 计算机科学与技术学院, 上海 200090
  • 收稿日期:2021-01-12 修回日期:2021-03-09 发布日期:2022-03-11
  • 作者简介:李冉冉(1996-),男,硕士研究生,主研方向为深度学习、自然语言处理;刘大明,副教授、博士;刘正、常高祥,硕士研究生。
  • 基金资助:
    甘肃省自然科学基金(SKLLDJ032016021)。

Text Classification Using Capsule Network Integrating Stroke Features

LI Ranran, LIU Daming, LIU Zheng, CHANG Gaoxiang   

  1. School of Computer Science and Technology, Shanghai University of Electric Power, Shanghai 200090, China
  • Received:2021-01-12 Revised:2021-03-09 Published:2022-03-11

摘要: 目前多数文本分类方法无法有效反映句子中不同单词的重要程度,且在神经网络训练过程中获得的词向量忽略了汉字本身的结构信息。构建一种GRU-ATT-Capsule混合模型,并结合CW2Vec模型训练中文词向量。对文本数据进行预处理,使用传统的词向量方法训练的词向量作为模型的第1种输入,通过CW2Vec模型训练得到的包含汉字笔画特征的中文词向量作为第2种输入,完成文本表示。利用门控循环单元分别提取2种不同输入的上下文特征并结合注意力机制学习文本中单词的重要性,将2种不同输入提取出的上下文特征进行融合,通过胶囊网络学习文本局部与全局之间的关系特征实现文本分类。在搜狗新闻数据集上的实验结果表明,GRU-ATT-Capsule混合模型相比TextCNN、BiGRU-ATT模型在测试集分类准确率上分别提高2.35和4.70个百分点,融合笔画特征的双通道输入混合模型相比单通道输入混合模型在测试集分类准确率上提高0.45个百分点,证明了GRU-ATT-Capsule混合模型能有效提取包括汉字结构在内的更多文本特征,提升文本分类效果。

关键词: 词向量, 笔画特征, 门控循环单元, 注意力机制, 胶囊网络, 文本分类

Abstract: Most current text classification methods have the problem that they cannot effectively reflect the importance of different words in a sentence, and the word vectors obtained in the neural network training process ignore the structural information of the Chinese characters.A GRU-ATT-Capsule hybrid model is proposed and combined with the CW2Vec model to train Chinese word vectors.First, the text data are preprocessed, and the word vector trained by the traditional word vector method is used as the first input of the model.The Chinese word vectors containing the stroke features of Chinese characters obtained by the CW2Vec model training are used as the second input to represent the text. Second, the contextual features of two different inputs are extracted through the Gated Recurrent Unit(GRU), and the attention mechanism is used to learn the importance of words in the text.The contextual features extracted from the two different inputs are fused, allowing the capsule network to learn the characteristics of the relationship between the local and global features of the text for text classification.The experimental results on the Sogou news dataset show that the GRU-ATT-Capsule hybrid model improves the classification accuracy of the test set by 2.35 and 4.70 percentage points, respectively, compared with the TextCNN and BiGRU-ATT models, and the dual-channel input hybrid model fused with stroke features compared with the single-channel input mixture model, the classification accuracy of the test set is improved by 0.45 percentage points, which proves that the GRU-ATT-Capsule hybrid model can effectively extract more text features, including Chinese character structure, and improve the text classification effect.

Key words: word vector, stroke feature, Gated Recurrent Unit(GRU), attention mechanism, capsule network, text classification

中图分类号: