计算机工程 ›› 2020, Vol. 46 ›› Issue (10): 294-300.doi: 10.19678/j.issn.1000-3428.0055669

• 开发研究与工程应用 • 上一篇    下一篇

基于Bi-LSTM与CRF的泰语句子切分模型

李自荐, 迟呈英, 战学刚   

  1. 辽宁科技大学 计算机与软件工程学院, 辽宁 鞍山 114031
  • 收稿日期:2019-08-05 修回日期:2019-10-11 发布日期:2019-10-21
  • 作者简介:李自荐(1995-),男,硕士,主研方向为自然语言处理、深度学习;迟呈英(通信作者),教授;战学刚,副教授、博士。
  • 基金项目:
    国家自然科学基金(61672138)。

Thai Sentence Segmentation Model Based on Bi-LSTM and CRF

LI Zijian, CHI Chengying, ZHAN Xuegang   

  1. School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, Liaoning 114031, China
  • Received:2019-08-05 Revised:2019-10-11 Published:2019-10-21

摘要: 在自然语言处理领域中,对于泰语等东南亚语言的分句处理是一项具有挑战性的工作。将序列标注模型应用于句子切分任务,提出基于双向长短期记忆循环神经网络的句子边界自动识别模型。利用Glove词向量技术,将泰语句子中的词或字转换为不同维度的向量,进而将词或字向量组合成为句子向量输入模型进行训练。在此基础上,通过双向网络结构捕捉上下文信息以达到更好的句子切分效果。实验结果表明,该模型在泰语句子切分任务上表现出非常精准的识别效果。

关键词: 自然语言处理, 句子切分, 深度学习, 循环神经网络, 长短期记忆网络, 泰语

Abstract: In the field of Natural Language Processing,clause processing of Southeast Asian languages such as Thai is a challenging task.Therefore,sequence tagging model is applied to sentence segmentation and a sentence boundary automatic recognition model based on bidirectional Long Short-Term Memory cycle neural network is proposed.The words or characters in Thai sentences are transformed into vectors with different dimensions by using Glove word vector technology,and then the word vectors or character vectors are combined into a sentence vector and are input into the model for training.On this basis,the context information is captured through the bidirectional network structure to achieve better sentence segmentation effect.The experimental results show that the model is very accurate in the task of sentence segmentation in Thai.

Key words: Natural Language Processing(NLP), sentence segmentation, deep learning, Recurrent Neural Networks(RNN), Long Short-Term Memory(LSTM) network, Thai

中图分类号: