Named Entity Recognition Combining Wubi Glyphs with Contextualized Character Embeddings

doi:10.19678/j.issn.1000-3428.0057265

Abstract

Abstract: As a basic task of natural language processing,Named Entity Recognition(NER) is widely used in information extraction,knowledge graph and other tasks.However,the existing Chinese pre-trained language models usually only capture characters in the context,ignoring the structure of Chinese characters-glyphs.This paper proposes two kinds of contextualized character embeddings representation methods combined with Wubi to enhance the semantic representation of character embeddings.The first method is to present the character embeddings by jointly modeling the extracted character and glyph features.The second one is to splice the Wubi glyphs into the character embeddings for assistance,and on this basis train a language model combining characters and Wubi glyphs.Experimental results show that the proposed methods can significantly improve the performance of Chinese NER systems,which outperform the language models based on only characters.

Key words: language model, Named Entity Recognition(NER), Wubi glyphs, contextualized character embeddings, unlabeled corpus

摘要： 命名实体识别（NER）作为自然语言处理的重要部分，在信息抽取和知识图谱等任务中得到广泛应用。然而目前中文预训练语言模型通常仅对上下文中的字符进行建模，忽略了中文字符的字形结构。提出2种结合五笔字形的上下文相关字向量表示方法，以增强字向量的语义表达能力。第一种方法分别对字符和字形抽取特征并联合建模得到字向量表示，第二种方法将五笔字形作为辅助信息拼接到字向量中，训练一个基于字符和五笔字形的混合语言模型。实验结果表明，所提两种方法可以有效提升中文NER系统的性能，且结合五笔字形的上下文相关字向量表示方法的系统性能优于基于单一字符的语言模型。

关键词: 语言模型, 命名实体识别, 五笔字形, 上下文相关字向量, 无标注语料

CLC Number:

TP391

ZHANG Dong, WANG Mingtao, CHEN Wenliang. Named Entity Recognition Combining Wubi Glyphs with Contextualized Character Embeddings[J]. Computer Engineering, 2021, 47(3): 94-101.

张栋, 王铭涛, 陈文亮. 结合五笔字形与上下文相关字向量的命名实体识别[J]. 计算机工程, 2021, 47(3): 94-101.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0057265

http://www.ecice06.com/EN/Y2021/V47/I3/94

Figures/Tables 9

References

[1] LEAMAN R,LU Z.TaggerOne:joint named entity recognition and normalization with semi-Markov models[J].Bioinformatics,2016,32(18):2839-2846.
[2] LI Yanqun,HE Yunqi,QIAN Longhua,et al.Automatic construction of Chinese nested named entity recognition corpus based on Wikipedia[J].Computer Engineering,2018,44(11):76-82.(in Chinese)李雁群,何云琪,钱龙华,等.基于维基百科的中文嵌套命名实体识别语料库自动构建[J].计算机工程,2018,44(11):76-82.
[3] YU X F,LAM W,CHAN S K,et al.Chinese NER using CRFs and logic for the fourth SIGHAN bakeoff[C]//Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing.New York,USA:ACM Press,2008:102-105.
[4] KURU O,CAN O A,YURET D.Charner:character-level named entity recognition[C]//Proceedings of the 26th International Conference on Computational Linguistics:Technical Papers.New York,USA:ACM Press,2016:911-921.
[5] TRAN Q,MACKINLAY A,YEPES A J.Named entity recognition with stack residual LSTM and trainable bias decoding[EB/OL].[2019-12-20].https://arxiv.org/pdf/1706.07598.pdf.
[6] REI M,CRICHTON G K O,PYYSALO S.Attending to characters in neural sequence labeling models[EB/OL].[2019-12-20].https://arxiv.org/pdf/1611.04361.pdf.
[7] ZHANG Suxiang,QIN Ying,WEN Juan,et al.Word segmentation and named entity recognition for SIGHAN Bakeoff3[EB/OL].[2019-12-20].https://www.aclweb.org/anthology/W06-0126.pdf.
[8] BURGER J D,HENDERSON J C,MORGAN W T.Statistical named entity recognizer adaptation[C]//Proceedings of the 6th Conference on Natural Language Learning.New York,USA:ACM Press,2002:1-4.
[9] CHEN W L,ZHANG Y J,ISAHARA H.Chinese named entity recognition with conditional random fields[C]//Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing.New York,USA:ACM Press,2006:118-121.
[10] LI Huilin,CHAI Yumei,SUN Muzhen.Deep network model for text named entity recognition[J].Journal of Chinese Computer Systems,2019,40(1):50-57.(in Chinese)李慧林,柴玉梅,孙穆祯.面向文本命名实体识别的深层网络模型[J].小型微型计算机系统,2019,40(1):50-57.
[11] COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing(almost) from scratch[J].Journal of Machine Learning Research,2011,12:2493-2537.
[12] HUANG Zhiheng,XU Wei,YU Kai.Bidirectional LSTM-CRF models for sequence tagging[EB/OL].[2019-12-20].https://arxiv.org/pdf/1508.01991.pdf.
[13] PETERS M E,NEUMANN M,IYYER M,et al.Deep contextualized word representations[EB/OL].[2019-12-20].https://arxiv.org/pdf/1802.05365.pdf.
[14] DONG Chuanhai,ZHANG Jiajun,ZONG Chengqing,et al.Character-based LSTM-CRF with radical-level features for Chinese named entity recognition[EB/OL].[2019-12-20].http://pdfs.semanticscholar.org/b944/5206f592423f0b2faf05f99de124ccc6aaa8.pdf.
[15] MA X,HOVY E.End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[EB/OL].[2019-12-20].https://arxiv.org/pdf/1603.01354.pdf.
[16] LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural architectures for named entity recognition[EB/OL].[2019-12-20].https://arxiv.org/pdf/1603.01360.pdf.
[17] YANG Fan,ZHANG Jianhu,LIU Gongshen,et al.Five-stroke based CNN-BiRNN-CRF network for Chinese named entity recognition[C]//Proceedings of CCF International Conference on Natural Language Processing and Chinese Computing.Berlin,Germany:Springer,2018:184-195.
[18] STRUBELL E,VERGA P,BELANGER D,et al.Fast and accurate entity recognition with iterated dilated convolutions[EB/OL].[2019-12-20].https://arxiv.org/pdf/1702.02098.pdf.
[19] YANG Y S,ZHANG M,CHEN W,et al.Adversarial learning for Chinese NER from crowd annotations[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.[S.l.]:AAAI Press,2018:16-25.
[20] ZHANG Yue,YANG Jie.Chinese NER using lattice LSTM[EB/OL].[2019-12-20].https://arxiv.org/pdf/1805.02023.pdf.
[21] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their composi-tionality[EB/OL].[2019-12-20].http://export.arxiv.org/pdf/1310.4546.
[22] SUN Yaming,LIN Lei,YANG Nan,et al.Radical-enhanced Chinese character embedding[C]//Proceedings of International Conference on Neural Information Processing.Berlin,Germany:Springer,2014:279-286.
[23] SHI Xinlei,ZHAI Junjie,YANG Xudong,et al.Radical embedding:delving deeper to Chinese radicals[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing.Berlin,Germany:Springer,2015:594-598.
[24] LI Yanran,LI Wenjie,SUN Fei,et al.Component-enhanced Chinese character embeddings[EB/OL].[2019-12-20].https://arxiv.org/ftp/arxiv/papers/1508/1508.06669.pdf.
[25] YIN Rongchao,WANG Quan,LI Peng,et al.Multi-granularity Chinese word embedding[C]//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing.Berlin,Germany:Springer,2016:981-986.
[26] DAI F Z,CAI Z.Glyph-aware embedding of Chinese characters[EB/OL].[2019-12-20].https://arxiv.org/ftp/arxiv/papers/1709/1709.00028.pdf.
[27] WANG M,CHE W,MANNING C D.Effective bilingual constraints for semi-supervised learning of named entity recognizers[C]//Proceedings of the 27th AAAI Conference on Artificial Intelligence.[S.l.]:AAAI Press,2013:152-168.
[28] CHE W,WANG M,MANNING C D,et al.Named entity recognition with bilingual constraints[C]//Proceedings of 2013 Conference of the North American Chapter of the Association for Computational Linguistics.Washington D.C.,USA:IEEE Press,2013:52-62.
[29] YANG Jie,TENG Zhiyang,ZHANG Meishan,et al.Combining discrete and neural features for sequence labeling[C]//Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics.Berlin,Germany:Springer,2016:140-154.
[30] PENG N,DREDZE M.Named entity recognition for Chinese social media with jointly trained embeddings[C]//Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing.Berlin,Germany:Springer,2015:548-554.
[31] PENG N,DREDZE M.Improving named entity recognition for Chinese social media with word segmentation repre-sentation learning[EB/OL].[2019-12-20].https://arxiv.org/pdf/1603.00786.pdf.
[32] HE H,SUN X.A unified model for cross-domain and semi-supervised named entity recognition in Chinese social media[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence.[S.l.]:AAAI Press,2017:125-146.
[33] LIU Wei,XU Tongge,XU Qinghua,et al.An encoding strategy based word-character LSTM for Chinese NER[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics.Washington D.C.,USA:IEEE Press,2019:2379-2389.
[34] DEVLIN J,CHANG M,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2019-12-20].https://nlp.stanford.edu/seminar/details/jdevlin.pdf.

Please choose a citation manager

Content to export