Chinese-Vietnamese Cross-Lingual Word-Embedding Combined with Word Cluster Constraints

doi:10.19678/j.issn.1000-3428.0063407

Abstract

Abstract: To solve for the poor alignment effect of the traditional cross-lingual word-embedding method in low-resource languages such as Chinese-Vietnamese, this paper proposes a Chinese-Vietnamese cross-lingual word embedding method with word cluster alignment constraints.First, Chinese and Vietnamese monolingual word embeddings are obtained via training on an independent monolingual corpus.Subsequently, three different types of association relationships including synonyms, similar words, and same subject words are used to completely mine the word cluster alignment information in the bilingual dictionary and integrate it into the training process of the mapping matrix.This allows the mapping matrix to further learn some common features and mapping relationships between similar words in different languages.Second, the monolingual word embeddings of the two languages are mapped onto a shared space through cross-lingual mapping to ensure that the Chinese and Vietnamese word embeddings with the same meaning are close to each other in the space.Finally, the cosine similarity is used to find the corresponding Vietnamese translation for each non-labeled Chinese word in the space, and Chinese-Vietnamese aligned word pairs are constructed to realize cross-lingual word embedding.The experimental results show that the proposed method is different from traditional supervised and unsupervised cross-lingual word-embedding methods such as Multi_w2v, Orthogonal, VecMap, and Muse, and can effectively improve the generalization of the mapping matrix with non-labeled words and poor effect of model alignment in low-resource languages such as Chinese-Vietnamese.Moreover, its alignment accuracy in the Chinese-Vietnamese bilingual dictionary induction tasks P@1 and P@5 is improved by 2.2 percentage points compared with that of the best baseline model.

Key words: Chinese-Vietnamese bilingual, low-resource language, cross-lingual word embedding, word cluster alignment, multi-granularity constraints

摘要： 针对传统跨语言词嵌入方法在汉越等差异较大的低资源语言上对齐效果不佳的问题，提出一种融合词簇对齐约束的汉越跨语言词嵌入方法。通过独立的单语语料训练获取汉越单语词嵌入，使用近义词、同类词和同主题词3种不同类型的关联关系，充分挖掘双语词典中的词簇对齐信息以融入到映射矩阵的训练过程中，使映射矩阵进一步学习到不同语言相近词间具有的一些共性特征及映射关系，根据跨语言映射将两种语言的单语词嵌入映射至同一共享空间中对齐，令具有相同含义的汉语与越南语词嵌入在空间中彼此接近，并利用余弦相似度为空间中每一个未经标注的汉语单词查找对应的越南语翻译构建汉越对齐词对，实现跨语言词嵌入。实验结果表明，与传统有监督及无监督的跨语言词嵌入方法Multi_w2v、Orthogonal、VecMap、Muse相比，该方法能有效提升映射矩阵在非标注词上的泛化性，改善汉越低资源场景下模型对齐效果较差的问题，其在汉越双语词典归纳任务P@1和P@5上的对齐准确率相比最好基线模型提升了2.2个百分点。

关键词: 汉越双语, 低资源语言, 跨语言词嵌入, 词簇对齐, 多粒度约束

CLC Number:

TP391

WU Zhaoyuan, YU Zhengtao, HUANG Yuxin. Chinese-Vietnamese Cross-Lingual Word-Embedding Combined with Word Cluster Constraints[J]. Computer Engineering, 2023, 49(1): 82-91.

武照渊, 余正涛, 黄于欣. 融合词簇约束的汉越跨语言词嵌入[J]. 计算机工程, 2023, 49(1): 82-91.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0063407

http://www.ecice06.com/EN/Y2023/V49/I1/82

Figures/Tables 13

References

[1] MOGADALA A, RETTINGER A.Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification[C]//Proceedings of 2016 International Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Washington D.C., USA:IEEE Press, 2016:692-702.
[2] LIU P F, QIU X P, HUANG X J.Recurrent neural network for text classifification with multi-task learning[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence.New York, USA:ACM Press, 2016:2873-2879.
[3] ESULI A, MOREO A, SEBASTIANI F.Funnelling:a new ensemble method for heterogeneous transfer learning and its application to polylingual text classification[J].ACM Transactions on Information Systems, 2019, 37(3):1-30.
[4] CHEN X L, SUN Y, ATHIWARATKUN B, et al.Adversarial deep averaging networks for cross-lingual sentiment classification[J].Transactions of the Association for Computational Linguistics, 2018, 6:557-570.
[5] SINGH P, LEFEVER E.Sentiment analysis for hinglish code-mixed tweets by means of cross-lingual word embeddings[C]//Proceedings of the 4th Workshop on Computational Approaches to Code Switching.Marseille, France:European Language Resources Association Press, 2020:45-51.
[6] MADHYASTHA P S, ESPAÑA-BONET C.Learning bilingual projections of embeddings for vocabulary expansion in machine translation[C]//Proceedings of the 2nd Workshop on Representation Learning for Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2017:139-145.
[7] 陈玺, 杨雅婷, 董瑞.面向汉维机器翻译的BERT嵌入研究[J].计算机工程, 2021, 47(12):112-117. CHEN X, YANG Y T, DONG R.Research on BERT embedding for Chinese-Uyghur machine translation[J]. Computer Engineering, 2021, 47(12):112-117.(in Chinese)
[8] MULLOV C, PHAM N Q, WAIBEL A.Unsupervised transfer learning in multilingual neural machine translation with cross-lingual word embeddings[EB/OL].[2021-10-20].https://arxiv:2103.06689v1.
[9] TSAI C T, ROTH D.Cross-lingual wikification using multilingual embeddings[C]//Proceedings of 2016 International Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg, USA:Association for Computational Linguistics, 2016:589-598.
[10] PAN X M, GOWDA T, JI H, et al.Cross-lingual joint entity and word embedding to improve entity linking and parallel sentence mining[C]//Proceedings of the 2nd International Workshop on Deep Learning Approaches for Low-Resource NLP.Stroudsburg, USA:Association for Computational Linguistics, 2019:56-66.
[11] LAMPLE G, CONNEAU A, et al.Word translation without parallel data[C]//Proceedings of the 6th International Conference on Learning Representations.Vancouver, Canada:ICLR Press, 2018:325-338.
[12] ALVAREZ-MELIS D, JAAKKOLA T.Gromov-Wasserstein alignment of word embedding spaces[C]//Proceedings of 2018 International Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2018:1881-1890.
[13] LI Y L, ZHANG Y H, YU K, et al.Adversarial training with Wasserstein distance for learning cross-lingual word embeddings[J].Transactions of the Applied Intelligence, 2021, 51(11):7666-7678.
[14] ARTETXE M, LABAKA G, AGIRRE E.Learning bilingual word embeddings with (almost) no bilingual data[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Stroudsburg, USA:Association for Computational Linguistics, 2017:451-462.
[15] MARCHISIO K V, KOEHN P, XIONG C H.An alignment-based approach to semi-supervised bilingual lexicon induction with small parallel corpora[C]//Proceedings of International Machine Translation Summit.[S.1.]:MT Summit Press, 2021:293-304.
[16] ZHAO X, WANG Z H, WU H, et al.Semi-supervised bilingual lexicon induction with two-way interaction[C]//Proceedings of 2020 International Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2020:2973-2984.
[17] PATRA B, MONIZ J R A, GARG S, et al.Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Stroudsburg, USA:Association for Computational Linguistics, 2019:184-193.
[18] SØGAARD A, RUDER S, VULIĆ I.On the limitations of unsupervised bilingual dictionary induction[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Stroudsburg, USA:Association for Computational Linguistics, 2019:778-788.
[19] MIKOLOV T, LE Q V, SUTSKEVER I.Exploiting similarities among languages for machine translation[EB/OL].[2021-10-20].https://arxiv:1309.4168v1.
[20] XING C, WANG D, LIU C, et al.Normalized word embedding and orthogonal transform for bilingual word translation[C]//Proceedings of 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg, USA:Association for Computational Linguistics, 2015:1006-1011.
[21] ARTETXE M, LABAKA G, AGIRRE E.Learning principled bilingual mappings of word embeddings while preserving monolingual invariance[C]//Proceedings of 2016 International Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2016:2289-2294.
[22] DOVAL Y, CAMACHO-COLLADOS J, ESPINOSA-ANKE L, et al.Improving cross-lingual word embeddings by meeting in the middle[C]//Proceedings of 2018 International Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2018:294-304.
[23] AZPIAZU I M, PERA M S.Hierarchical mapping for crosslingual word embedding alignment[J].Transactions of the Association for Computational Linguistics, 2020, 32(8):361-376.
[24] RUDER S, VULIĆ I, SØGAARD A.A survey of cross-lingual word embedding models[J].Journal of Artificial Intelligence Research, 2019, 65:569-631.
[25] HUANG L F, CHO K, ZHANG B L, et al.Multi-lingual common semantic space construction via cluster-consistent word embedding[C]//Proceedings of 2018 International Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2018:250-260.
[26] MIKOLOV T, SUTSKEVER I, CHEN K, et al.Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2013:3111-3119.
[27] ARTETXE M, LABAKA G, AGIRRE E.Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations[C]//Proceedings of the 32th AAAI Conference on Artificial Intelligen.New Orleans, USA:AAAI Press, 2018:5012-5019.

Please choose a citation manager

Content to export