计算机工程 ›› 2020, Vol. 46 ›› Issue (5): 312-320.doi: 10.19678/j.issn.1000-3428.0054483

• 开发研究与工程应用 • 上一篇    

面向财税领域的实体识别与标注研究

仇瑜1,2,3, 程力1,2,3   

  1. 1. 中国科学院新疆理化技术研究所, 乌鲁木齐 830011;
    2. 中国科学院大学, 北京 100049;
    3. 新疆民族语音语言信息处理实验室, 乌鲁木齐 830011
  • 收稿日期:2019-04-03 修回日期:2019-05-15 发布日期:2019-05-31
  • 作者简介:仇瑜(1988-),男,博士研究生,主研方向为人工智能、自然语言处理;程力,研究员、博士生导师。
  • 基金项目:
    国家"千人计划"项目(Y32H251201);中国科学院"西部之光"基金(2017-XBZG-BR-001)。

Research on Entity Recognition and Tagging in Fiscal and Taxation Domain

QIU Yu1,2,3, CHENG Li1,2,3   

  1. 1. The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academic of Sciences, Urumqi 830011, China;
    2. University of Chinese Academy of Sciences, Beijing 100049, China;
    3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China
  • Received:2019-04-03 Revised:2019-05-15 Published:2019-05-31

摘要: 特定领域中的实体结构和类别相比通用领域更加复杂多样,传统的命名实体识别方法难以取得理想效果。针对该问题,以财税领域为例,研究领域实体识别与标注问题,实现知识库的动态扩充。根据领域特征定义一组层次实体类别集,使用远程监督的方法获取训练语料。采用基于字、词特征结合的深度神经网络模型识别实体边界,将实体类别标注视为多标签多类别分类任务,并提出一种基于集成学习的方法以进行实体类别标注。在真实数据集上的实验结果表明,相比逻辑回归、支持向量机等方法,该方法的准确率、召回率及F值更高。

关键词: 知识库扩充, 实体识别, 实体标注, 深度学习, 集成学习

Abstract: Traditional recognition methods for named entities do not work well for entities in specific domains,as they usually have more complex structures and types than those in the general domain.To address the problem,this paper takes the fiscal and taxation domain as an entry point to study entity recognition and tagging,so as to implement dynamic expansion of knowledge base.According to the characteristics of the fiscal and taxation domain,a hierarchical entity type set is defined,and a training corpus is obtained by using remote monitoring.Then a deep neural network model based on combined character features and word features is used for entity boundary recognition.Entity type tagging is taken as a multi-label and multi-type classification task,and on this basis a method based on ensemble learning is proposed for entity type tagging.Experimental results on real datasets show that compared with basic methods including logistic regression and support vector machine,the proposed method has higher accuracy,recall and F value.

Key words: knowledge base expansion, entity recognition, entity tagging, deep learning, ensemble learning

中图分类号: