计算机工程 ›› 2018, Vol. 44 ›› Issue (11): 76-82.doi: 10.19678/j.issn.1000-3428.0048667

• 体系结构与软件技术 • 上一篇    下一篇

基于维基百科的中文嵌套命名实体识别语料库自动构建

李雁群,何云琪,钱龙华,周国栋   

  1. 苏州大学 计算机科学与技术学院 自然语言处理实验室,江苏 苏州 215006
  • 收稿日期:2017-09-14 出版日期:2018-11-15 发布日期:2018-11-15
  • 作者简介:李雁群(1992—),女,硕士研究生,主研方向为信息抽取;何云琪,硕士研究生;钱龙华(通信作者),副教授;周国栋,教授、博士生导师。
  • 基金项目:

    国家自然科学基金(61373096,61331011,61673290)。

Automatic Construction of Chinese Nested Named Entity Recognition Corpus Based on Wikipedia

LI Yanqun,HE Yunqi,QIAN Longhua,ZHOU Guodong   

  1. Natural Language Processing Laboratory,School of Computer Science and Technology, Soochow University,Suzhou,Jiangsu 215006,China
  • Received:2017-09-14 Online:2018-11-15 Published:2018-11-15

摘要:

传统的监督学习方法需要标注一定规模的领域内语料库,限制了其领域适应性。为此,提出一种从中文维基百科条目中自动构建中文嵌套命名实体识别语料库的方法。对中文维基百科的条目进行实体分类,利用实体条目构造实体的嵌套结构,从而自动生成大规模的中文嵌套命名实体识别语料库。在手工标注嵌套命名实体识别语料库上的实验结果表明,自动构建的语料库具有规模较大、领域广的特点,且能够适应宽泛领域上的中文嵌套命名实体识别任务。

关键词: 嵌套命名实体识别, 信息抽取, 维基百科, 语料库, 条件随机场

Abstract: Traditional supervised learning method needs to label the corpus in a certain scale,which limits its domain adaptability.Therefore,a method of automatically constructing a Chinese nested named entity recognition corpus from Chinese Wikipedia entries is proposed.The Chinese Wikipedia entries are classified into entities entries,and the nested structure of the entities is constructed by using the entity entries,thereby automatically generating a large-scale Chinese nested named entity recognition corpus.Experimental results on the manually labeled nested named entity recognition corpus show that the automatically constructed corpus has the characteristics of large scale and wide field,and can adapt to the Chinese nested named entity recognition task in a wide range of fields.

Key words: nested named entity recognition, information extraction, Wikipedia, corpus, conditional random field

中图分类号: