一种针对维汉的跨语言远程监督方法

doi:10.19678/j.issn.1000-3428.0064023

计算机工程 ›› 2023, Vol. 49 ›› Issue (2): 271-278. doi: 10.19678/j.issn.1000-3428.0064023

一种针对维汉的跨语言远程监督方法

杨振宇^1,2,3, 王磊^1,2,3, 马博^1,2,3, 杨雅婷^1,2,3, 董瑞^1,2,3, 艾孜麦提·艾瓦尼尔^1,2,3, 王震^1,2,3

1. 中国科学院新疆理化技术研究所, 乌鲁木齐 830011;
2. 中国科学院大学, 北京 100049;
3. 新疆民族语音语言信息处理实验室, 乌鲁木齐 830011

收稿日期:2022-02-24 修回日期:2022-03-28 发布日期:2022-07-18
作者简介:杨振宇(1996-),男,硕士,主研方向为自然语言处理、信息抽取;王磊,研究员、博士;马博,副研究员、博士;杨雅婷,研究员、博士;董瑞,副研究员、博士;艾孜麦提·艾尼瓦尔,助理研究员、博士;王震,研究实习员、硕士。
基金资助:
国家自然科学基金本地青年人才培养专项（U2003303）；国家重点研发计划（2018YFC0823002）；新疆维吾尔自治区天山创新项目（2020D14045）；“天山青年”计划优秀青年科技人才项目（2019Q031）；中国科学院青年创新促进会项目（科发人函字［2019］26号）；中国科学院西部青年学者B类项目（2019-XBQNXZ-B-008）。

A Cross-Lingual Distant Supervision Method for Uyghur and Chinese

YANG Zhenyu^1,2,3, WANG Lei^1,2,3, MA Bo^1,2,3, YANG Yating^1,2,3, DONG Rui^1,2,3, Azmat Anwar^1,2,3, WANG Zhen^1,2,3

1. The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China;
2. University of Chinese Academy of Sciences, Beijing 100049, China;
3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China

Received:2022-02-24 Revised:2022-03-28 Published:2022-07-18

摘要/Abstract

摘要： 远程监督是关系抽取领域重要的语料扩充技术，可以在少量已标注语料的基础上快速生成伪标注语料。但是传统的远程监督方法主要应用于单语种文本，维吾尔语等低资源语言并不能使用这类方法得到伪标注语料。针对上述问题，提出一种针对维汉的跨语言远程监督方法，在无语料的情况下利用现有的汉语语料进行维语语料的自动扩充。将远程监督视为文本语义相似度计算问题而不是简单的文本查找，从实体语义和句子语义2个层面判断维语和汉语句子对是否包含同一关系，若为同一关系则将已有的汉语标注转移到维语句子上，实现维语语料从零开始的自动扩充。此外，为有效捕获实体的上下文和隐藏语义信息，提出一种带有门控机制的交互式匹配方法，通过门控单元控制编码层、注意力层之间的信息传递。人工标记3 500条维语句子和600条汉语句子用于模拟远程监督过程并验证模型的性能。实验结果表明，该方法F1值达到73.05%，并且成功构造了包含97 949条维语句子的关系抽取伪标注数据集。

关键词: 关系抽取, 语义相似度, 语义编码, 远程监督, 跨语言

Abstract: Distant supervision is an important corpus expansion technology in the field of relation extraction.It can quickly generate pseudo-labeled corpus based on a small amount of annotated corpus.However, traditional distant supervision is mainly used in monolingual texts, and low-resource languages such as Uyghur cannot use this method to obtain pseudo-labeled corpora.In view of the above problems, this paper proposes a cross-lingual distant supervision method for Uyghur and Chinese, which can use the existing Chinese corpus to automatically expand the Uyghur corpus in the absence of corpus.This method regards distance supervision as a calculation of sentences semantic similarity problem rather than word search, and judges whether Uyghur and Chinese sentence pairs contain the same relation from two levels of entity semantics and sentence semantics.If the relations are the same, the existing Chinese labels will be transferred to the Uyghur sentences, that is, the automatic expansion of the Uyghur corpus from zero is realized.And in order to capture the context and hidden semantic information of entities, this paper proposes an interactive matching method with a gate mechanism, which controls the information between the encoding layer and the attention layer through the gate unit.In order to prove the effectiveness of the model, the authors manually labeled 3 500 Uighur sentences and 600 Chinese sentences to simulate the distant supervision process and verify the performance of the model.Experimental results shows that the F1 score of the method reached 73.05% and a relation extraction pseudo-labeled dataset containing 97 949 Uighur sentences is successfully constructed.

Key words: relation extraction, semantic similarity, semantic encoding, distant supervision, cross-lingual

中图分类号:

TP391.1

杨振宇, 王磊, 马博, 杨雅婷, 董瑞, 艾孜麦提·艾瓦尼尔, 王震. 一种针对维汉的跨语言远程监督方法[J]. 计算机工程, 2023, 49(2): 271-278.

YANG Zhenyu, WANG Lei, MA Bo, YANG Yating, DONG Rui, Azmat Anwar, WANG Zhen. A Cross-Lingual Distant Supervision Method for Uyghur and Chinese[J]. Computer Engineering, 2023, 49(2): 271-278.

https://www.ecice06.com/CN/Y2023/V49/I2/271

图/表 7

20230216182719

20230216182722

20230216182726

20230216182729

20230216182733

20230216182736

20230216182739

参考文献

[1] MINTZ M, BILLS S, SNOW R, et al.Distant supervision for relation extraction without labeled data[C]//Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP.New York, USA:ACM Press, 2009:1003-1011.
[2] ZENG D J, LIU K, CHEN Y B, et al.Distant supervision for relation extraction via piecewise convolutional neural networks[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2015:1-10.
[3] LIN Y K, SHEN S Q, LIU Z Y, et al.Neural relation extraction with selective attention over instances[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Stroudsburg, USA:Association for Computational Linguistics, 2016:1-10.
[4] HOFFMANN R, ZHANG C L, LING X, et al.Knowledge-based weak supervision for information extraction of overlapping relations[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies.New York, USA:ACM Press, 2011:541-550.
[5] SURDEANU M, TIBSHIRANI J, NALLAPATI R, et al.Multi-instance multi-label learning for relation extraction[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.New York, USA:ACM Press, 2012:455-465.
[6] JAT S, KHANDELWAL S, TALUKDAR P.Improving distantly supervised relation extraction using word and entity based attention[EB/OL].[2022-01-02].https://arxiv.org/abs/1804.06987.
[7] YANG Z Y, WANG L, MA B, et al.RTJTN:relational triplet joint tagging network for joint entity and relation extraction[J].Computational Intelligence and Neuroscience, 2021, 2021:3447473.
[8] YE Z X, LING Z H.Distant supervision relation extraction with intra-bag and inter-bag attentions[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg, USA:Association for Computational Linguistics, 2019:1-10.
[9] WU S C, FAN K, ZHANG Q.Improving distantly supervised relation extraction with neural noise converter and conditional optimal selector[J].Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1):7273-7280.
[10] LI P S, ZHANG X S, JIA W J, et al.GAN driven semi-distant supervision for relation extraction[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg, USA:Association for Computational Linguistics, 2019:3026-3035.
[11] HAN X, LIU Z Y, SUN M S.Denoising distant supervision for relation extraction via instance-level adversarial training[EB/OL].[2022-01-02].https://arxiv.org/abs/1805.10959.
[12] HUANG P S, HE X D, GAO J F, et al.Learning deep structured semantic models for Web search using clickthrough data[C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management.New York, USA:ACM Press, 2013:2333-2338.
[13] SEVERYN A, MOSCHITTI A.Learning to rank short text pairs with convolutional deep neural networks[C]//Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York, USA:ACM Press, 2015:373-382.
[14] YIN W P, SCHÜTZE H, XIANG B, et al.ABCNN:attention-based convolutional neural network for modeling sentence pairs[J].Transactions of the Association for Computational Linguistics, 2016, 4:259-272.
[15] WANG Z G, HAMZA W, FLORIAN R.Bilateral multi-perspective matching for natural language sentences[EB/OL].[2022-01-02].https://arxiv.org/abs/1702.03814.
[16] CHEN Q, ZHU X D, LING Z H, et al.Enhanced LSTM for natural language inference[EB/OL].[2022-01-02].https://arxiv.org/abs/1609.06038.
[17] GONG Y C, LUO H, ZHANG J.Natural language inference over interaction space[EB/OL].[2022-01-02].https://arxiv.org/abs/1709.04348.
[18] KIM S, KANG I, KWAK N.Semantic sentence matching with densely-connected recurrent and co-attentive information[J].Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1):6586-6593.
[19] HUANG G, LIU Z, VAN DER MAATEN L, et al.Densely connected convolutional networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2261-2269.
[20] ANWAR A, LI X, YANG Y T, et al.Constructing uyghur named entity recognition system using neural machine translation tag projection[C]//Proceedings of China National Conference on Chinese Computational Linguistics.Berlin, Germany:Springer, 2020:247-260.
[21] SHAW P, USZKOREIT J, VASWANI A.Self-attention with relative position representations[EB/OL].[2022-01-02].https://arxiv.org/abs/1803.02155.
[22] LU J S, YANG J W, BATRA D, et al.Hierarchical question-image co-attention for visual question answering[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2016:289-297.
[23] ARTETXE M, SCHWENK H.Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J].Transactions of the Association for Computational Linguistics, 2019, 7:597-610.
[24] GRAVES A, SCHMIDHUBER J.Framewise phoneme classification with bidirectional LSTM networks[C]//Proceedings of 2005 IEEE International Joint Conference on Neural Networks.Washington D.C., USA:IEEE Press, 2005:2047-2052.
[25] LAMPLE G, CONNEAU A.Cross-lingual language model pretraining[EB/OL].[2022-01-02].https://arxiv.org/abs/1901.07291.
[26] DEVLIN J, CHANG M W, LEE K, et al.BERT:pre-training of deep bidirectional Transformers for language understanding[EB/OL].[2022-01-02].https://arxiv.org/pdf/1810.04805.pdf.
[27] CONNEAU A, KHANDELWAL K, GOYAL N, et al.Unsupervised cross-lingual representation learning at scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Stroudsburg, USA:Association for Computational Linguistics, 2020:1-10.
[28] LIU Y H, OTT M, GOYAL N, et al.RoBERTa:a robustly optimized BERT pretraining approach[EB/OL].[2022-01-02].https://arxiv.org/abs/1907.11692.
[29] CASACUBERTA F, VIDAL E.GIZA++:training of statistical translation models[C]//Proceedings of Workshop on Multi-Lingual Speech Communication.Kyoto, Japan:[s.n.], 2000:69-74.

选择文件类型/文献管理软件名称

选择包含的内容

一种针对维汉的跨语言远程监督方法

A Cross-Lingual Distant Supervision Method for Uyghur and Chinese

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	刘军平, 黄宇威, 胡新荣, 彭涛, 姚迅, 王帮超, 杨华利, 朱强. 基于数据增强和动态阈值的文档级关系抽取方法研究[J]. 计算机工程, 2026, 52(4): 131-139.
[2]	何志磊, 高盛祥, 朱恩昌, 余正涛. 基于强化语言关联的中缅越跨语言摘要研究[J]. 计算机工程, 2025, 51(8): 160-167.
[3]	郭桦宜, 游进国, 耿齐祁, 陶静梅, 易健宏. 面向铜基复合材料文献的复杂实体关系抽取方法[J]. 计算机工程, 2025, 51(11): 100-111.
[4]	孙丽郡, 孟繁军, 徐行健. 课程知识图谱构建技术研究综述[J]. 计算机工程, 2025, 51(11): 1-21.
[5]	杨润, 陈艳平, 闫家鑫, 秦永彬. 基于关联邻接矩阵的关系抽取方法研究[J]. 计算机工程, 2025, 51(10): 121-129.
[6]	周雪阳, 傅启明, 陈建平, 陈延明, 陆悠, 王蕴哲. 基于证据和图推理的文档级关系抽取方法: 以医学关系为例[J]. 计算机工程, 2025, 51(1): 106-117.
[7]	林加艺, 夏鸿斌, 刘渊. 基于类比学习的数学应用题求解模型[J]. 计算机工程, 2024, 50(7): 63-70.
[8]	李雪, 王雅文, 张前进. 基于信息检索的源代码自动命名[J]. 计算机工程, 2024, 50(6): 304-310.
[9]	曹渝昆, 程宇, 何祯奕, 徐康乐, 颜家洛, 李云峰. 文档上下文异构表示的句子级关系抽取方法[J]. 计算机工程, 2024, 50(5): 111-119.
[10]	吴海鹏, 钱育蓉, 冷洪勇. 基于双向注意力机制的多模态关系抽取[J]. 计算机工程, 2024, 50(4): 160-167.
[11]	李敬灿, 肖萃林, 覃晓婷, 谢夏. 基于大语言模型与语义增强的文本关系抽取算法[J]. 计算机工程, 2024, 50(4): 87-94.
[12]	冯雄波, 黄于欣, 赖华, 高玉梦. 基于多策略强化学习的低资源跨语言摘要方法研究[J]. 计算机工程, 2024, 50(2): 68-77.
[13]	廖涛, 张国畅, 张顺香. 基于双粒度图的文档级关系抽取[J]. 计算机工程, 2024, 50(10): 164-173.
[14]	刘昊鑫, 董超, 勾智楠, 高凯. 融合混合表征的小样本关系抽取方法[J]. 计算机工程, 2023, 49(8): 63-68.
[15]	马建红, 龚天, 姚爽. 基于证据句与图卷积网络的文档级关系抽取[J]. 计算机工程, 2023, 49(8): 104-110.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

一种针对维汉的跨语言远程监督方法

A Cross-Lingual Distant Supervision Method for Uyghur and Chinese

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献

相关文章 15

编辑推荐

Metrics

本文评价