计算机工程 ›› 2020, Vol. 46 ›› Issue (5): 291-297.doi: 10.19678/j.issn.1000-3428.0054243

• 开发研究与工程应用 • 上一篇    下一篇

基于自注意力机制的中文标点符号预测模型

段大高, 梁少虎, 赵振东, 韩忠明   

  1. 北京工商大学 计算机与信息工程学院, 北京 100048
  • 收稿日期:2019-03-15 修回日期:2019-04-19 发布日期:2019-05-16
  • 作者简介:段大高(1976-),男,副教授,主研方向为数据挖掘、自然语言处理;梁少虎、赵振东,硕士研究生;韩忠明,教授。
  • 基金项目:
    国家自然科学基金(61170112,61532006);教育部人文社会科学研究青年基金(13YJC860006);北京市自然科学基金(4172016)。

Prediction Model of Chinese Punctuation Based on Self-Attention Mechanism

DUAN Dagao, LIANG Shaohu, ZHAO Zhendong, HAN Zhongming   

  1. School of Computer and Information Engineering, Beijing Technology and Business University, Beijing 100048, China
  • Received:2019-03-15 Revised:2019-04-19 Published:2019-05-16

摘要: 中文标点符号预测是自然语言处理的一项重要任务,能够帮助人们消除歧义,更准确地理解文本。为解决传统自注意力机制模型不能处理序列位置信息的问题,提出一种基于自注意力机制的中文标点符号预测模型。在自注意力机制的基础上堆叠多层Bi-LSTM网络,并结合词性与语法信息进行联合学习,完成标点符号预测。自注意力机制可以捕获任意两个词的关系而不依赖距离,同时词性和语法信息能够提升预测标点符号的正确率。在真实新闻数据集上的实验结果表明,该模型F1值达到85.63%,明显高于传统CRF、LSTM预测方法,可实现对中文标点符号的准确预测。

关键词: 标点符号预测, 自注意力机制, Bi-LSTM网络, 深度神经网络, 自然语言处理

Abstract: Chinese Punctuation Prediction(PP) is an important task of Natural Language Pprocessing(NLP),which can help people eliminate ambiguity and understand the text more accurately.In order to solve the problem that the self-attention mechanism cannot process sequence position information,this paper proposes a Chinese punctuation prediction model based on the self-attention mechanism.This model stacks multiple layers of Bi-directional Long Short-Term Memory(Bi-LSTM) network on the basis of self-attention mechanism,and combines the part of speech and grammar information for joint learning to complete the punctuation prediction.The self-attention mechanism can capture the relationship between any two words without relying on their distance,and the accuracy of predicted punctuation can be improved by part of speech and grammatical information.Experimental results on real news datasets show that the F1 value of the proposed model reaches 85.63%,which is significantly higher than traditional CRF and LSTM prediction methods,and achieves accurate prediction of Chinese punctuation.

Key words: Punctuation Prediction(PP), self-attention mechanism, Bi-LSTM network, Deep Neural Network(DNN), Natural Language Processing(NLP)

中图分类号: