作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (2): 271-278. doi: 10.19678/j.issn.1000-3428.0064023

• 开发研究与工程应用 • 上一篇    下一篇

一种针对维汉的跨语言远程监督方法

杨振宇1,2,3, 王磊1,2,3, 马博1,2,3, 杨雅婷1,2,3, 董瑞1,2,3, 艾孜麦提·艾瓦尼尔1,2,3, 王震1,2,3   

  1. 1. 中国科学院新疆理化技术研究所, 乌鲁木齐 830011;
    2. 中国科学院大学, 北京 100049;
    3. 新疆民族语音语言信息处理实验室, 乌鲁木齐 830011
  • 收稿日期:2022-02-24 修回日期:2022-03-28 发布日期:2022-07-18
  • 作者简介:杨振宇(1996-),男,硕士,主研方向为自然语言处理、信息抽取;王磊,研究员、博士;马博,副研究员、博士;杨雅婷,研究员、博士;董瑞,副研究员、博士;艾孜麦提·艾尼瓦尔,助理研究员、博士;王震,研究实习员、硕士。
  • 基金资助:
    国家自然科学基金本地青年人才培养专项(U2003303);国家重点研发计划(2018YFC0823002);新疆维吾尔自治区天山创新项目(2020D14045);“天山青年”计划优秀青年科技人才项目(2019Q031);中国科学院青年创新促进会项目(科发人函字[2019]26号);中国科学院西部青年学者B类项目(2019-XBQNXZ-B-008)。

A Cross-Lingual Distant Supervision Method for Uyghur and Chinese

YANG Zhenyu1,2,3, WANG Lei1,2,3, MA Bo1,2,3, YANG Yating1,2,3, DONG Rui1,2,3, Azmat Anwar1,2,3, WANG Zhen1,2,3   

  1. 1. The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China;
    2. University of Chinese Academy of Sciences, Beijing 100049, China;
    3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China
  • Received:2022-02-24 Revised:2022-03-28 Published:2022-07-18

摘要: 远程监督是关系抽取领域重要的语料扩充技术,可以在少量已标注语料的基础上快速生成伪标注语料。但是传统的远程监督方法主要应用于单语种文本,维吾尔语等低资源语言并不能使用这类方法得到伪标注语料。针对上述问题,提出一种针对维汉的跨语言远程监督方法,在无语料的情况下利用现有的汉语语料进行维语语料的自动扩充。将远程监督视为文本语义相似度计算问题而不是简单的文本查找,从实体语义和句子语义2个层面判断维语和汉语句子对是否包含同一关系,若为同一关系则将已有的汉语标注转移到维语句子上,实现维语语料从零开始的自动扩充。此外,为有效捕获实体的上下文和隐藏语义信息,提出一种带有门控机制的交互式匹配方法,通过门控单元控制编码层、注意力层之间的信息传递。人工标记3 500条维语句子和600条汉语句子用于模拟远程监督过程并验证模型的性能。实验结果表明,该方法F1值达到73.05%,并且成功构造了包含97 949条维语句子的关系抽取伪标注数据集。

关键词: 关系抽取, 语义相似度, 语义编码, 远程监督, 跨语言

Abstract: Distant supervision is an important corpus expansion technology in the field of relation extraction.It can quickly generate pseudo-labeled corpus based on a small amount of annotated corpus.However, traditional distant supervision is mainly used in monolingual texts, and low-resource languages such as Uyghur cannot use this method to obtain pseudo-labeled corpora.In view of the above problems, this paper proposes a cross-lingual distant supervision method for Uyghur and Chinese, which can use the existing Chinese corpus to automatically expand the Uyghur corpus in the absence of corpus.This method regards distance supervision as a calculation of sentences semantic similarity problem rather than word search, and judges whether Uyghur and Chinese sentence pairs contain the same relation from two levels of entity semantics and sentence semantics.If the relations are the same, the existing Chinese labels will be transferred to the Uyghur sentences, that is, the automatic expansion of the Uyghur corpus from zero is realized.And in order to capture the context and hidden semantic information of entities, this paper proposes an interactive matching method with a gate mechanism, which controls the information between the encoding layer and the attention layer through the gate unit.In order to prove the effectiveness of the model, the authors manually labeled 3 500 Uighur sentences and 600 Chinese sentences to simulate the distant supervision process and verify the performance of the model.Experimental results shows that the F1 score of the method reached 73.05% and a relation extraction pseudo-labeled dataset containing 97 949 Uighur sentences is successfully constructed.

Key words: relation extraction, semantic similarity, semantic encoding, distant supervision, cross-lingual

中图分类号: