作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (10): 145-153. doi: 10.19678/j.issn.1000-3428.0068226

• 人工智能与模式识别 • 上一篇    下一篇

基于词汇融合和依存关系的中文命名实体识别

唐卓然, 柳毅*()   

  1. 广东工业大学计算机学院, 广东 广州 510006
  • 收稿日期:2023-08-16 出版日期:2024-10-15 发布日期:2024-01-25
  • 通讯作者: 柳毅
  • 基金资助:
    广东省重点领域研发计划(2021B0101200002)

Chinese Named Entity Recognition Based on Lexicon Fusion and Dependency Relation

TANG Zhuoran, LIU Yi*()   

  1. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, Guangdong, China
  • Received:2023-08-16 Online:2024-10-15 Published:2024-01-25
  • Contact: LIU Yi

摘要:

命名实体识别是自然语言处理领域的重要基础任务, 为关系抽取、构建知识图谱等众多下游任务提供有价值的数据支撑。针对中文命名实体识别存在分词错误、实体边界模糊和上下文依赖的难点, 以及现有方法不能充分利用词汇信息和有效提取文本内部特征等问题, 提出一种基于词汇融合和依存关系的中文命名实体识别模型。首先, 获取输入文本中每个字符的自匹配词生成词汇特征向量, 并根据字符在它的自匹配词上的位置得到词边界信息, 利用双仿射注意力机制将字符向量与词汇特征向量进行融合, 将词汇信息和词边界信息融入模型的编码过程, 从而使模型获得良好的实体识别能力; 然后, 根据依存句法建立输入文本的依存图结构, 利用图注意力网络(GAT)捕获输入文本内部依存关系特征, 增强文本内部的语义依赖信息, 同时有利于区分实体边界; 最后, 使用条件随机场(CRF)计算文本的标签。实验结果表明, 该模型在CCKS2017、OntoNote4.0和MSRA数据集上分别获得了92.10%、80.76%和95.66%的F1值, 优于对比模型。

关键词: 注意力机制, 依存关系, 词汇融合, 图注意力网络, 中文命名实体识别

Abstract:

Named entity recognition is an important foundational task in the field of natural language processing, providing valuable data support for many downstream tasks, such as relation extraction and knowledge graph construction. To address the difficulties of word segmentation errors, ambiguous entity boundaries, and contextual dependencies in Chinese named entity recognition, as well as the inability of existing methods to fully utilize lexical information and effectively extract internal text features, this paper proposes a Chinese named entity recognition method based on lexicon fusion and dependency relation. First, the self-matching words of each character in the input text are obtained to generate lexical feature vectors, and word boundary information is obtained according to the position of the character in its self-matching words. The character and lexical feature vectors are fused using biaffine attention mechanism, and the lexical and word boundary information are integrated into the encoding process of the model so that the model can achieve good entity recognition ability. Subsequently, based on dependency syntax, a dependency graph structure of the input text is established, and a Graph Attention Network (GAT) is used to capture the internal dependency features of the input text, enhance the semantic dependency information within the text, and facilitate the differentiation of entity boundaries. Finally, text labels are calculated using a Conditional Random Field (CRF). The proposed method obtains F1 values of 92.10%, 80.76%, and 95.66% on the CCKS2017, OntoNote4.0 and MSRA datasets, respectively, which are better than those of the comparison models.

Key words: attention mechanism, dependency relation, lexicon fusion, Graph Attention Network(GAT), Chinese named entity recognition