Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Chinese Braille Word Segmentation System Based on BERT

  

  • Published:2025-10-17

基于BERT的汉语盲文分词方法

Abstract: hinese Braille is a kind of script used by people with visual impairment in China and it is an important part of the National Commonly-Used Language and Script. At present, although there are some methods have been developed for the automatic translation from Chinese text to Braille text, there are still shortcomings. Braille word segmentation is a crucial step in Chinese-Braille translation, which seriously affects the final translation result. It is also an important task in the research of Braille informationization. Although pre-trained models have been widely used in the field of Chinese natural language processing, they are currently less commonly used in the study of Braille informationization. Braille and Chinese characters are expressions of the same language in different writing systems, and there are similarities and transferability between the two. Pre-trained models have great potential for development in the field of Braille informationization.This paper introduces the BERT pre-trained model into Braille word segmentation task. We used BERT to extract feature vectors and decoded them using CRF combining the whole-word masking strategy. A word segmentation model BERT-CRF-wwm of encoder-decoder structure is implemented. To address the issue that the original Chinese word segmentation information of the BERT model may interfere with Braille word segmentation, a new Braille embeddings is concatenated at the embedding layer and finally the BeBERT-CRF-wwm model is implemented. On the Chinese-Braille Corpus, it ultimately achieves a precision rate of 98.80% and a recall rate of 98.71%. Compared with existing Braille word segmentation methods, it achieves better results in various evaluation.

摘要: 汉语盲文是我国盲人使用的文字,是国家语言文字的重要组成部分。目前,虽然已经有一些技术实现了中文到盲文的自动转换,但是仍然存在不足。盲文分词是汉盲转换中的重要一环,对汉盲转换效果影响显著,也是盲文信息化研究的一项重要任务。尽管预训练模型在中文自然语言处理领域已经被广泛采用,但目前在盲文信息化领域的应用仍较为有限。盲文与汉字属于同一语言在不同文字体系下的表现形式,两者之间存在相似性和可迁移性,预训练模型在盲文信息化领域具有良好发展空间。本文将BERT预训练模型引入盲文分词任务,利用BERT模型提取特征,结合全词掩码策略和CRF解码器,实现了BERT-CRF-wwm编码器-解码器结构的分词模型。针对BERT模型原有的汉语分词信息可能干扰盲文分词的问题,在嵌入层引入一种新的盲文特征嵌入,最终形成BeBERT-CRF-wwm模型。通过在汉语盲文语料库上进行训练,最终达到了98.80%的精确率和98.71%的召回率,与现有的盲文分词方法进行对比,在各项评估指标上都达到了更好的效果。