作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (9): 63-70. doi: 10.19678/j.issn.1000-3428.0062306

• 人工智能与模式识别 • 上一篇    下一篇

融合多通道CNN与BiGRU的字词级文本错误检测模型

郭可翔1,2, 王衡军1, 白祉旭1   

  1. 1. 信息工程大学 密码工程学院, 郑州 450001;
    2. 中国人民解放军96714部队, 福建 永安 366001
  • 收稿日期:2021-08-09 修回日期:2021-09-29 发布日期:2021-10-11
  • 作者简介:郭可翔(1992—),男,硕士研究生,主研方向为自然语言处理、深度学习;王衡军,副教授;白祉旭,硕士研究生。
  • 基金资助:
    国家重点研发计划(2017YFB0801900)。

Detection Model for Word-Level Text Error Combining Multi-Channel CNN and BiGRU

GUO Kexiang1,2, WANG Hengjun1, BAI Zhixu1   

  1. 1. School of Cryptographic Engineering, Information Engineering University, Zhengzhou 450001, China;
    2. Unit 96714 of PLA, Yongan, Fujian 366001, China
  • Received:2021-08-09 Revised:2021-09-29 Published:2021-10-11

摘要: 文本校对是自然语言处理领域的重要分支。深度学习技术因强大的特征提取与学习能力被广泛应用于中文文本校对任务。针对现有中文文本错误检测模型忽略句子连续词间的局部信息、对于长文本的上下文语义信息提取不充分等问题,提出一种基于多通道卷积神经网络(CNN)与双向门控循环单元(BiGRU)的字词级文本错误检测模型。利用Word2vec向量化待检错文本,采用CNN挖掘待检错文本的局部特征,使用BiGRU学习待检错文本的上下文语义信息及长时依赖关系,并通过Softmax处理后输出文本分类结果以判断文本中是否含有字词错误,同时采取L2正则化和dropout策略防止模型过拟合。在SIGHAN2014和SIGHAN2015中文拼写检查任务数据集上的实验结果表明,与基于长短时记忆网络的文本错误检测模型相比,该模型的检错F1值提升了3.01个百分点,具有更优的字词级文本错误检测效果。

关键词: 字词错误, 多通道卷积操作, 卷积神经网络, 双向门控循环单元, 文本错误检测

Abstract: Text proofreading is a critical component of Natural Language Processing(NLP).Deep learning technology is widely used in Chinese text proofreading because of its excellent feature extraction and learning ability.This study presents a model for word-level text error detection based on the combination of a multi-channel Convolution Neural Network(CNN) and Bidirectional Gated Recurrent Unit(BiGRU) to address the problems of ignoring the local information between consecutive words and the insufficient extraction of long-text semantic information in context while detecting errors in Chinese text. First, the text to be checked is vectorized by using Word2vec.Then, the CNN is used to mine the local features of the text, and the BiGRU is used to learn the contextual semantic information and long-term dependence of the text.Softmax is applied to determine whether any word errors exist based on the text classification results.To prevent the model from overfitting, L2 regularization and the dropout strategy are adopted.Experimental results show that the proposed model gains 3.01 percentage points on the F1 value when compared with the Long Short-Term Memory(LSTM)-based text error detection model using SIGHAN2014 and SIGHAN2015 as Chinese spelling-checking task datasets, indicating that the proposed model is superior in word-level text error detection.

Key words: word error, multi-channel convolution operation, Convolution Neural Network(CNN), Bidirectional Gated Recurrent Unit(BiGRU), text error detection

中图分类号: