基于Bi-LSTM与CRF的泰语句子切分模型

doi:10.19678/j.issn.1000-3428.0055669

摘要/Abstract

摘要： 在自然语言处理领域中，对于泰语等东南亚语言的分句处理是一项具有挑战性的工作。将序列标注模型应用于句子切分任务，提出基于双向长短期记忆循环神经网络的句子边界自动识别模型。利用Glove词向量技术，将泰语句子中的词或字转换为不同维度的向量，进而将词或字向量组合成为句子向量输入模型进行训练。在此基础上，通过双向网络结构捕捉上下文信息以达到更好的句子切分效果。实验结果表明，该模型在泰语句子切分任务上表现出非常精准的识别效果。

关键词: 自然语言处理, 句子切分, 深度学习, 循环神经网络, 长短期记忆网络, 泰语

Abstract: In the field of Natural Language Processing,clause processing of Southeast Asian languages such as Thai is a challenging task.Therefore,sequence tagging model is applied to sentence segmentation and a sentence boundary automatic recognition model based on bidirectional Long Short-Term Memory cycle neural network is proposed.The words or characters in Thai sentences are transformed into vectors with different dimensions by using Glove word vector technology,and then the word vectors or character vectors are combined into a sentence vector and are input into the model for training.On this basis,the context information is captured through the bidirectional network structure to achieve better sentence segmentation effect.The experimental results show that the model is very accurate in the task of sentence segmentation in Thai.

Key words: Natural Language Processing(NLP), sentence segmentation, deep learning, Recurrent Neural Networks(RNN), Long Short-Term Memory(LSTM) network, Thai

中图分类号:

TP391.1

李自荐, 迟呈英, 战学刚. 基于Bi-LSTM与CRF的泰语句子切分模型[J]. 计算机工程, 2020, 46(10): 294-300.

LI Zijian, CHI Chengying, ZHAN Xuegang. Thai Sentence Segmentation Model Based on Bi-LSTM and CRF[J]. Computer Engineering, 2020, 46(10): 294-300.

https://www.ecice06.com/CN/Y2020/V46/I10/294

图/表 10

20201023094123

20201023094126

20201023094129

20201023094133

20201023094136

20201023094140

20201023094143

20201023094147

20201023094152

20201023094155

参考文献

[1] HE Qianhua,XU Bingzheng.Overview of machine translation[J].Information Science,1993,11(4):60-67.(in Chinese)贺前华,徐秉铮.机器翻译综述[J].情报科学,1993,11(4):60-67.
[2] GAO Minghu,YU Zhiqiang.A summary review of neural machine translation[J].Journal of Yunnan University of Nationalities(Natural Sciences Edition),2019,28(1):72-76.(in Chinese)高明虎,于志强.神经机器翻译综述[J].云南民族大学学报:自然科学版,2019,28(1):72-76.
[3] PENG Shuchu.A review of the development of machine translation[J].Journal of Huazhong University of Science and Technology(Social Science Edition),2006,20(2):123-124.(in Chinese)彭述初.机器翻译学科发展综述[J].华中科技大学学报(社会科学版),2006,20(2):123-124.
[4] WANG R,UTIYAMA M,FINCH A,et al.Sentence selection and weighting for neural machine translation domain adaptation[J].IEEE Transactions on Audio,Speech,and Language Processing,2018,26(10):1727-1741.
[5] GIATSOGLOU M,VOZALIS M G,DIAMANTARAS K I,et al.Sentiment analysis leveraging emotions and word embeddings[J].Expert Systems with Applications,2017,69:214-224.
[6] ZHENG Lijuan,WANG Hongwei,GAO Song.Sentimental feature selection for sentiment analysis of Chinese online reviews[J].International Journal of Machine Learning and Cybernetics,2018,9(1):75-84.
[7] CHEN Tianying,CHEN Rong,PAN Lulu,et al.Archaic Chinese punctuating sentences based on context N-gram model[J].Computer Engineering,2007,33(3):192-193.(in Chinese)陈天莹,陈蓉,潘璐璐,等.基于前后文n-gram模型的古汉语句子切分[J].计算机工程,2007,33(3):192-193.
[8] XUE Zhengshan,ZHANG Dakun,WANG Lina,et al.An improved sentence segmentation model for machine translation[J].Journal of Chinese Information Processing,2017,31(4):50-56.(in Chinese)薛征山,张大鲲,王丽娜,等.改进机器翻译中的句子切分模型[J].中文信息学报,2017,31(4):50-56.
[9] WIROTE A.Thoughts on word and sentence segmentation in Thai[C]//Proceedings of the 7th International Symposium on Natural Language Processing.Pattaya,Thailand:[s.n.],2007:85-90.
[10] KASISOPA B,REILLY R,BURNHAM D.Orthographic factors in reading Thai:an eye tracking study[C]//Proceedings of the 4th China International Conference on Eye Movements.Tianjin,China:[s.n.],2010:1-2.
[11] HALTERENH V.Syntactic wordclass tagging[M].1st editon.Amsterdam,Holland:Kluwer Academic Publishers,1999.
[12] SILLA C N,KAESTNER C A.An analysis of sentence boundary detection systems for English and Portuguese documents[C]//Proceedings of the 5th International Conference on Intelligent Text Processing and Computational Linguistics.Berlin,Germany:Springer,2004:135-141.
[13] JURAFSKY D,MARTIN J H.Speech and language processing:an introduction to natural language processing,computational linguistics and speech recognition[M].2nd edition.Upper Saddle River,USA:Prentice Hall,2008.
[14] CHAROENPORNSAWAT P,SORNLERTLAMVANICH V.Automatic sentence break disambiguation for Thai[EB/OL].[2019-07-10].http://www.cs.cmu.edu/~paisarn/papers/iccpol2001.pdf.
[15] MITTRAPIYANURUK P,SORNLERTLAMVANICHV.The automatic Thai sentence extraction[C]//Proceedings of the 4th Symposium on Natural Language Processing.Chiang Mai,Thailand:[s.n.],2000:1-6.
[16] SLAYDEN G,HWANG M Y,SCHWARTZ L.Thai sentence-breaking for large-scale SMT[C]//Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing.[S.l.]:The COLING 2010 Organizing Committee,2010:8-16.
[17] WATHABUNDITKUL S.Spacing in the Thai language[EB/OL].[2019-07-10].http://www.thai-language.com/ref/spacing.
[18] ZHOU N,AW A,LERTCHEVA N,et al.A word labeling approach to Thai sentence boundary detection and POS tagging[C]//Proceedings of the 26th International Conference on Computational Linguistics:Technical Papers.Osaka,Japan:The COLING 2016 Organizing Committee,2016:319-327.
[19] MA X,HOVY E.End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[EB/OL].[2019-07-10].http://www.cs.cmu.edu/~xuezhem/publications/P16-1101.pdf.
[20] HAN Xuedong.Summary of conditional random field theory[EB/OL].[2019-07-10].https://wenku.baidu.com/view/842401c42cc58bd63186bd4b.html.(in Chinese)韩雪冬.条件随机场理论综述[EB/OL].[2019-07-10].https://wenku.baidu.com/view/842401c42cc58bd63186bd4b.html.
[21] ZHANG Zirui,LIU Yunqing.Chinese word segmentation based on bi-directional LSTM-CRF model[J].Journal of Changchun University of Science and Technology(Natural Science Edition),2017,40(4):87-92.(in Chinese)张子睿,刘云清.基于BI-LSTM-CRF模型的中文分词法[J].长春理工大学学报(自然科学版),2017,40(4):87-92.

选择文件类型/文献管理软件名称

选择包含的内容