作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (5): 93-102. doi: 10.19678/j.issn.1000-3428.0069492

• 人工智能与模式识别 • 上一篇    下一篇

面向多源文本的越南语文本检错方法

庄紫薇1,2, 朱俊国1,2   

  1. 1. 昆明理工大学信息工程与自动化学院, 云南 昆明 650500;
    2. 昆明理工大学云南省人工智能重点实验室, 云南 昆明 650500
  • 收稿日期:2024-03-06 修回日期:2024-04-14 出版日期:2025-05-15 发布日期:2025-05-10
  • 通讯作者: 朱俊国,E-mail:jg.zhu.hit@qq.com E-mail:jg.zhu.hit@qq.com
  • 基金资助:
    云南省科技厅面上项目(202101AT070077)。

Vietnamese Text Error Detection Method for Multi-source Text

ZHUANG Ziwei1,2, ZHU Junguo1,2   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, Yunnan, China;
    2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
  • Received:2024-03-06 Revised:2024-04-14 Online:2025-05-15 Published:2025-05-10

摘要: 文本检错是自然语言处理的研究方向之一,目标是自动检测输入文本中错误单词的位置和类型,该任务不仅在文本处理的各种下游环节中应用广泛,而且关系到日常生活中方方面面。目前,针对英文、中文的文本检错模型已经能够达到较高的准确率,然而,因越南语语料资源稀缺、人工标注数据不足,面向越南语的文本检错任务深受训练样本匮乏和低质量的困扰。此外,还存在不同场景来源的文本包含错误类型不同,以及错误类型数量不均衡的情况,导致通用文本检错模型无法学习到特定错误类型的检测方法,检错能力较弱。基于上述问题,首先,提出一种面向多源文本的越南语文本检错语料库构建方法,利用越南语光学字符识别(OCR)、越南语语音识别和越南语-英语翻译数据集构建初始语料库,并根据多源越南语检错语料生成方法得到错误语料,通过检错语料自动标注算法获得带标签的训练数据。其次,提出融入多源信息特征的越南语文本检错序列标注模型,在多语言来自变换器的双向编码器表征量(BERT)编码端融入场景特征,使模型能够根据当前输入文本场景适应错误类型。实验结果表明,该方法相比基线模型,F0.5值和F1值提升了1.91和1.80百分点,并进一步验证了模型各组件的必要性以及数据集构建方法的有效性。

关键词: 自然语言处理, 机器学习, 深度学习, 文本检错, 越南语

Abstract: Text error detection is a topic of significant research interest within natural language processing, focusing on automatically identifying the location and type of erroneous words in input text. This task has broad applications across various downstream aspects of text processing and directly impacts daily life. While models for text error detection in English and Chinese have achieved high accuracy, models for Vietnamese face challenges due to a scarcity of corpus resources and manually labeled data. The low quality of training samples hampers the performance of error detection models for Vietnamese. Additionally, the task of error detection for multi-source text introduces complexities, including varied error types across sources and an uneven distribution of error types. Consequently, generalized text error detection models struggle to learn specific error type detection methods, leading to suboptimal performance. To address these challenges, this study proposes a Vietnamese text error detection corpus construction method for multi-source text. The approach leverages datasets from Vietnamese Optical Character Recognition (OCR), Vietnamese speech recognition, and Vietnamese-English translation to create an initial corpus. Using the multi-source Vietnamese error detection corpus generation method, an error corpus is constructed. An error detection corpus automatic labeling algorithm is then employed to generate labeled training data. Furthermore, a Vietnamese text error detection sequence annotation model is introduced, incorporating multi-source information features. By integrating scene features into the multilingual Bidirectional Encoder Representations from Transformers (BERT) encoding layer, the model adapts to specific error types based on the context of the input text. Experimental results demonstrate that the proposed method enhances the F0.5 and F1 values by 1.91 and 1.80 percentage points, respectively, compared to the baseline model. These results validate the necessity of each component of the model as well as the effectiveness of the dataset construction approach.

Key words: natural language processing, machine learning, deep learning, text error detection, Vietnamese

中图分类号: