基于维基百科的中文嵌套命名实体识别语料库自动构建

doi:10.19678/j.issn.1000-3428.0048667

计算机工程 ›› 2018, Vol. 44 ›› Issue (11): 76-82. doi: 10.19678/j.issn.1000-3428.0048667

基于维基百科的中文嵌套命名实体识别语料库自动构建

李雁群,何云琪,钱龙华,周国栋

苏州大学计算机科学与技术学院自然语言处理实验室,江苏苏州 215006

收稿日期:2017-09-14 出版日期:2018-11-15 发布日期:2018-11-15
作者简介:李雁群(1992—),女,硕士研究生,主研方向为信息抽取;何云琪,硕士研究生;钱龙华(通信作者),副教授;周国栋,教授、博士生导师。
基金资助:
国家自然科学基金(61373096,61331011,61673290)。

Automatic Construction of Chinese Nested Named Entity Recognition Corpus Based on Wikipedia

LI Yanqun,HE Yunqi,QIAN Longhua,ZHOU Guodong

Natural Language Processing Laboratory,School of Computer Science and Technology, Soochow University,Suzhou,Jiangsu 215006,China

Received:2017-09-14 Online:2018-11-15 Published:2018-11-15

摘要/Abstract

摘要：

传统的监督学习方法需要标注一定规模的领域内语料库,限制了其领域适应性。为此,提出一种从中文维基百科条目中自动构建中文嵌套命名实体识别语料库的方法。对中文维基百科的条目进行实体分类,利用实体条目构造实体的嵌套结构,从而自动生成大规模的中文嵌套命名实体识别语料库。在手工标注嵌套命名实体识别语料库上的实验结果表明,自动构建的语料库具有规模较大、领域广的特点,且能够适应宽泛领域上的中文嵌套命名实体识别任务。

关键词: 嵌套命名实体识别, 信息抽取, 维基百科, 语料库, 条件随机场

Abstract: Traditional supervised learning method needs to label the corpus in a certain scale,which limits its domain adaptability.Therefore,a method of automatically constructing a Chinese nested named entity recognition corpus from Chinese Wikipedia entries is proposed.The Chinese Wikipedia entries are classified into entities entries,and the nested structure of the entities is constructed by using the entity entries,thereby automatically generating a large-scale Chinese nested named entity recognition corpus.Experimental results on the manually labeled nested named entity recognition corpus show that the automatically constructed corpus has the characteristics of large scale and wide field,and can adapt to the Chinese nested named entity recognition task in a wide range of fields.

Key words: nested named entity recognition, information extraction, Wikipedia, corpus, conditional random field

中图分类号:

TP311

李雁群,何云琪,钱龙华,周国栋. 基于维基百科的中文嵌套命名实体识别语料库自动构建[J]. 计算机工程, 2018, 44(11): 76-82.

LI Yanqun,HE Yunqi,QIAN Longhua,ZHOU Guodong. Automatic Construction of Chinese Nested Named Entity Recognition Corpus Based on Wikipedia[J]. Computer Engineering, 2018, 44(11): 76-82.

https://www.ecice06.com/CN/Y2018/V44/I11/76

参考文献

［1］ZENG S,WANG F,BAO H,et al.Joint extraction of entities and relations based on a novel tagging scheme［C］//Proceedingsof Annual Meeting of the Association for Computational Linguitics.Vancouver,Canada:［s.n.］,2017:1227-1236.
［2］LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural architectures for named entity reconition［C］//Proceedings of NAACL-HLT’16.Washington D.C.,USA:IEEE Press,2016:260-270.
［3］OHTA T,TATEISI Y,KIM J D.The GENIA corpus:an annotated research abstract corpus in molecular biology domain［C］//Proceedings of International Conference on Human Language Technology Research.Washington D.C.,USA:IEEE Press,2002:82-86.
［4］ALEX B,HADDOW B,GROVER C.Recognising nested named entities in biomedical text［C］//Proceedings of Workshop on Bionlp:Biological.New York,USA:ACM Press,2007:65-72.
［5］BYRNE K.Nested named entity recognition in historical archive text［C］//Proceedings of International Conference on Semantic Computing.Washington D.C.,USA:IEEE Press,2007:589-596.
［6］周俊生,戴新宇,尹存燕,等.基于层叠条件随机场模型的中文机构名自动识别［J］.电子学报,2006,34(5):804-808.
［7］尹迪,周俊生,曲维光.基于联合模型的中文嵌套命名实体识别［J］.南京师范大学学报(自然科学版),2014,37(3):29-35.
［8］ZHOU G D,ZHANG J,SU J,et al.Recognizing names in biomedical texts:a machine learning approach［J］.Bioinformatics,2004,20(7):1178-1190.
［9］ZHOU G D.Recognizing names in biomedical texts using mutual information independence model and SVM plus sigmoid［J］.International Journal of Medical Informatics,2006,75(6):456-467.
［10］付春元.汉语嵌套命名实体识别方法研究［D］.哈尔滨:黑龙江大学,2011.
［11］FINKEL J R,CHRISTOPHER D.Nested named entity recognition［C］//Proceedings of Conference on Empirical Methods in Natural Language Processing.Singapore:［s.n.］,2009:141-150.
［12］FU Chunyuan,FU Guohong.Morpheme-based Chinese nested named entity recognition［C］//Proceedings of the 9th International Conference on Fuzzy System and Knowlodge Discovery.Chengdu,china:［s.n.］,2012:2546-2550.
［13］徐志浩,惠浩添,钱龙华,等.中文维基百科的实体分类研究［J］.中文信息学报,2015,29(5):91-97.
［14］BUNESCU R C,PASCA M.Using encyclopedic knowledge for named entity disam-bigution［C］//Proceedings of EACL’06.Washington D.C.,USA:IEEE Press,2006:9-16.
［15］BHOLE A,FORTUNA B,GROBELNIK M,et al.Extracting named entities and relating them over time based on Wikipedia［J］.Informatica,2007,31(4):463-468.
［16］TARDIF S,CURRAN J R,MURPHY T.Improved text categorisation for Wikipedia namedentities ［C］//Proceedings of Australasian Language Technology Association Workshop.Sydney,Australia:［s.n.］,2009:104.
［17］TKATCHENKO M,ULANOV A,SIMANOVSKY A.Classifying Wikipedia entities into fine-grained-classes［C］//Proceedings of the 27th International Conference on Data Engineering Workshops.Washington D.C.,USA:IEEE Press,2011:212-217.
［18］梅家驹.同义词词林［M］.上海:上海辞书出版社,1996.
［19］孙镇,王惠临.命名实体识别研究进展综述［J］.现代图书情报技术,2010,26(6):42-47.
［20］刘章勋.中文命名实体识别粒度和特征选择研究［D］.哈尔滨:哈尔滨工业大学,2010.

[1]	屈潇雅, 李兵, 温立强. 面向行政执法案件文本的事件抽取研究[J]. 计算机工程, 2024, 50(9): 63-71.
[2]	党小超, 刘涧, 董晓辉, 祝忠彦, 李芬芳. 面向不平衡数据的机械设备故障命名实体识别[J]. 计算机工程, 2024, 50(9): 104-112.
[3]	杨冬菊, 黄俊涛. 基于大语言模型的中文科技文献标注方法[J]. 计算机工程, 2024, 50(9): 113-120.
[4]	江惠珍, 孙艳春, 黄罡. 基于知识图谱的GitHub层次化学习和检索服务[J]. 计算机工程, 2024, 50(5): 16-25.
[5]	李鸿鹏, 马博, 杨雅婷, 王磊, 王震, 李晓. 基于槽位语义增强提示学习的篇章级事件抽取方法[J]. 计算机工程, 2023, 49(9): 23-31.
[6]	衡红军, 苗菁. 语义与句法信息加强的二元标记实体关系联合抽取[J]. 计算机工程, 2023, 49(4): 77-84.
[7]	段建勇, 朱奕霏, 王昊, 何丽, 李欣. 基于位置嵌入和多级预测的中文嵌套命名实体识别[J]. 计算机工程, 2023, 49(12): 71-77.
[8]	连艺谋, 张英俊, 谢斌红. 用于嵌套命名实体识别的边界强化分类模型[J]. 计算机工程, 2022, 48(8): 313-320.
[9]	司逸晨, 管有庆. 基于Transformer编码器的中文命名实体识别模型[J]. 计算机工程, 2022, 48(7): 66-72.
[10]	李军怀, 陈苗苗, 王怀军, 崔颖安, 张爱华. 基于ALBERT-BGRU-CRF的中文命名实体识别方法[J]. 计算机工程, 2022, 48(6): 89-94,106.
[11]	张吉祥, 张祥森, 武长旭, 赵增顺. 知识图谱构建技术综述[J]. 计算机工程, 2022, 48(3): 23-37.
[12]	崔丽平, 古丽拉·阿东别克, 王智悦. 基于有向图模型的旅游领域命名实体识别[J]. 计算机工程, 2022, 48(2): 306-313.
[13]	廖涛, 黄荣梅, 张顺香, 段松松. 基于交互式特征融合的嵌套命名实体识别[J]. 计算机工程, 2022, 48(12): 119-126,133.
[14]	张军莲, 张一帆, 汪鸣泉, 黄永健. 基于图卷积神经网络的中文实体关系联合抽取[J]. 计算机工程, 2021, 47(12): 103-111.
[15]	吕江海, 杜军平, 周南, 薛哲. 基于膨胀卷积迭代与注意力机制的实体名识别方法[J]. 计算机工程, 2021, 47(1): 58-65,71.

选择文件类型/文献管理软件名称

选择包含的内容

基于维基百科的中文嵌套命名实体识别语料库自动构建

Automatic Construction of Chinese Nested Named Entity Recognition Corpus Based on Wikipedia

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于维基百科的中文嵌套命名实体识别语料库自动构建

Automatic Construction of Chinese Nested Named Entity Recognition Corpus Based on Wikipedia

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价