作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (1): 82-91. doi: 10.19678/j.issn.1000-3428.0063407

• 人工智能与模式识别 • 上一篇    下一篇

融合词簇约束的汉越跨语言词嵌入

武照渊1,2, 余正涛1,2, 黄于欣1,2   

  1. 1. 昆明理工大学 信息工程与自动化学院, 昆明 650500;
    2. 云南省人工智能重点实验室, 昆明 650500
  • 收稿日期:2021-11-30 修回日期:2022-01-27 发布日期:2022-01-28
  • 作者简介:武照渊(1997-),男,硕士研究生,主研方向为自然语言处理、智能信息处理;余正涛(通信作者),教授、博士、博士生导师;黄于欣,副教授、博士。
  • 基金资助:
    国家自然科学基金(61732005,U21B2027,61972186,61866020,61866019);云南省重大科技专项(202002AD080001,202103AA080015);云南省高新技术产业专项(201606)。

Chinese-Vietnamese Cross-Lingual Word-Embedding Combined with Word Cluster Constraints

WU Zhaoyuan1,2, YU Zhengtao1,2, HUANG Yuxin1,2   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China;
    2. Yunnan Key Laboratory of Artificial Intelligence, Kunming 650500, China
  • Received:2021-11-30 Revised:2022-01-27 Published:2022-01-28

摘要: 针对传统跨语言词嵌入方法在汉越等差异较大的低资源语言上对齐效果不佳的问题,提出一种融合词簇对齐约束的汉越跨语言词嵌入方法。通过独立的单语语料训练获取汉越单语词嵌入,使用近义词、同类词和同主题词3种不同类型的关联关系,充分挖掘双语词典中的词簇对齐信息以融入到映射矩阵的训练过程中,使映射矩阵进一步学习到不同语言相近词间具有的一些共性特征及映射关系,根据跨语言映射将两种语言的单语词嵌入映射至同一共享空间中对齐,令具有相同含义的汉语与越南语词嵌入在空间中彼此接近,并利用余弦相似度为空间中每一个未经标注的汉语单词查找对应的越南语翻译构建汉越对齐词对,实现跨语言词嵌入。实验结果表明,与传统有监督及无监督的跨语言词嵌入方法Multi_w2v、Orthogonal、VecMap、Muse相比,该方法能有效提升映射矩阵在非标注词上的泛化性,改善汉越低资源场景下模型对齐效果较差的问题,其在汉越双语词典归纳任务P@1和P@5上的对齐准确率相比最好基线模型提升了2.2个百分点。

关键词: 汉越双语, 低资源语言, 跨语言词嵌入, 词簇对齐, 多粒度约束

Abstract: To solve for the poor alignment effect of the traditional cross-lingual word-embedding method in low-resource languages such as Chinese-Vietnamese, this paper proposes a Chinese-Vietnamese cross-lingual word embedding method with word cluster alignment constraints.First, Chinese and Vietnamese monolingual word embeddings are obtained via training on an independent monolingual corpus.Subsequently, three different types of association relationships including synonyms, similar words, and same subject words are used to completely mine the word cluster alignment information in the bilingual dictionary and integrate it into the training process of the mapping matrix.This allows the mapping matrix to further learn some common features and mapping relationships between similar words in different languages.Second, the monolingual word embeddings of the two languages are mapped onto a shared space through cross-lingual mapping to ensure that the Chinese and Vietnamese word embeddings with the same meaning are close to each other in the space.Finally, the cosine similarity is used to find the corresponding Vietnamese translation for each non-labeled Chinese word in the space, and Chinese-Vietnamese aligned word pairs are constructed to realize cross-lingual word embedding.The experimental results show that the proposed method is different from traditional supervised and unsupervised cross-lingual word-embedding methods such as Multi_w2v, Orthogonal, VecMap, and Muse, and can effectively improve the generalization of the mapping matrix with non-labeled words and poor effect of model alignment in low-resource languages such as Chinese-Vietnamese.Moreover, its alignment accuracy in the Chinese-Vietnamese bilingual dictionary induction tasks P@1 and P@5 is improved by 2.2 percentage points compared with that of the best baseline model.

Key words: Chinese-Vietnamese bilingual, low-resource language, cross-lingual word embedding, word cluster alignment, multi-granularity constraints

中图分类号: