Chinese Braille Word Segmentation System Based on BERT

doi:10.19678/j.issn.1000-3428.0252691

Abstract

Abstract: hinese Braille is a kind of script used by people with visual impairment in China and it is an important part of the National Commonly-Used Language and Script. At present, although there are some methods have been developed for the automatic translation from Chinese text to Braille text, there are still shortcomings. Braille word segmentation is a crucial step in Chinese-Braille translation, which seriously affects the final translation result. It is also an important task in the research of Braille informationization. Although pre-trained models have been widely used in the field of Chinese natural language processing, they are currently less commonly used in the study of Braille informationization. Braille and Chinese characters are expressions of the same language in different writing systems, and there are similarities and transferability between the two. Pre-trained models have great potential for development in the field of Braille informationization.This paper introduces the BERT pre-trained model into Braille word segmentation task. We used BERT to extract feature vectors and decoded them using CRF combining the whole-word masking strategy. A word segmentation model BERT-CRF-wwm of encoder-decoder structure is implemented. To address the issue that the original Chinese word segmentation information of the BERT model may interfere with Braille word segmentation, a new Braille embeddings is concatenated at the embedding layer and finally the BeBERT-CRF-wwm model is implemented. On the Chinese-Braille Corpus, it ultimately achieves a precision rate of 98.80% and a recall rate of 98.71%. Compared with existing Braille word segmentation methods, it achieves better results in various evaluation.

摘要： 汉语盲文是我国盲人使用的文字，是国家语言文字的重要组成部分。目前，虽然已经有一些技术实现了中文到盲文的自动转换，但是仍然存在不足。盲文分词是汉盲转换中的重要一环，对汉盲转换效果影响显著，也是盲文信息化研究的一项重要任务。尽管预训练模型在中文自然语言处理领域已经被广泛采用，但目前在盲文信息化领域的应用仍较为有限。盲文与汉字属于同一语言在不同文字体系下的表现形式，两者之间存在相似性和可迁移性，预训练模型在盲文信息化领域具有良好发展空间。本文将BERT预训练模型引入盲文分词任务，利用BERT模型提取特征，结合全词掩码策略和CRF解码器，实现了BERT-CRF-wwm编码器-解码器结构的分词模型。针对BERT模型原有的汉语分词信息可能干扰盲文分词的问题，在嵌入层引入一种新的盲文特征嵌入，最终形成BeBERT-CRF-wwm模型。通过在汉语盲文语料库上进行训练，最终达到了98.80%的精确率和98.71%的召回率，与现有的盲文分词方法进行对比，在各项评估指标上都达到了更好的效果。

Wang Ruixuan, Li Yan, Zhong Jinghua, Yao Dengfeng, Xu Cheng, Ren Tianyu. Chinese Braille Word Segmentation System Based on BERT[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252691.

汪睿璇, 李妍, 钟经华, 姚登峰, 徐成, 任天宇. 基于BERT的汉语盲文分词方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252691.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0252691

References

[1] 教育部,国家语委,中央网信办.关于加强数字中文建设推进语言文字信息化发展的意见[EB/OL]. (2025-02-13)[2025-06-17]. http://www.moe.gov.cn/srcsite/A19/s7067/202503/t20250328_1185224.html. Ministry of Education of the People's Republic of China, State Language Commission, Office of the Central Cyberspace Affairs Commission.Opinions on Strengthening the Construction of Digital Chinese and Promoting the Development of Language and Writing Informatization[EB/OL]. (2025-02-13)[2025-06-17]. [2] 耿楠,钟经华.汉语盲文分词连写的历史与价值[J].北京联合大学学报,2019,33(4):6.
GENG N, ZHONG J H.History and Value of Chinese Braille Word Segmentation and Conjunction in the Textbook[J].Journal of Beijing Union University,2019,33(4):6. (In Chinese)
[3] 中华人民共和国教育部,国家语言文字工作委员会,中国残疾人联合会.语言文字规范：国家通用盲文方案: GF 0019-2018[S]. 北京:求真出版社, 2018:9-10. Ministry of Education of the People's Republic of China,State Language Commission,China Disabled Persons’ Federation. Standardization of Language and Characters: Chinese Common Braille Scheme: GF 0019-2018[S]. Beijing:Truth Seeking Publishing House, 2018:9-10. (In Chinese)
[4] 钟经华.坚持汉语盲文分词连写三项基本原则的重要意义[J].现代特殊教育,2022(11):65-67. ZHONG J H.The Importance of Adhering to the Three Basic Principles of Chinese Braille Word Segmentation[J].Modern Special Education,2022(11):65-67. (In Chinese)
[5] 中华人民共和国国家质量监督检验检疫总局,中国国家标准化管理委员会.中国盲文: GB/T 15720-2008[S].北京:中华人民共和国民政部, 2008:11-12,22-33. General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China, Standardization Administration of the People’s Republic of China.Chinese Braille Scheme: GB/T 15720-2008[S].Beijing:Ministry of Civil Affairs of the People's Republic of China, 2008:11-12,22-33. (In Chinese)
[6] 琚四化,鲁明辉,张居晓,等.汉语盲文分词连写规则的研究进展与展望[J].中国特殊教育,2019(3):4. JU S H, LU M H, ZHANG J X, et al.Rules for Word Segmentation and Link Writing in Chinese Braille: Advances and Prospects[J].Chinese Journal of Special Education,2019(3):4. (In Chinese)
[7] 肖航.汉语盲文信息处理研究[M].语文出版社,2016:9-14,80-89. XIAO H.Research on Information Processing of Chinese Braille[M].Language and Culture Press,2016:4-18.(In Chinese)
[8] JIANG M H, ZHU X Y, XIA Y, et al. Segmentation of Mandarin Braille word and Braille Translation Based on Multi-knowledge[C]//Proceedings of 5th International Conference on Signal Processing Proceedings,2000:2070-2073
[9] 李宏乔,樊孝忠,李良富,等.汉语—盲文机器翻译系统的研究与实现[J].计算机应用,2002, 22(11):4 LI H Q, FAN X Z, LI L F,et al.Research and Implementation of Chinese-Braille Machine Translation System[J].Computer Applications,2002, 22(11):4.(In Chinese)
[10] 庄丽,包塔,朱小燕.盲人用计算机软件系统中的语音和自然语言处理技术[J].中文信息学报,2004(04):72-78. ZHUANG L, BAO T, ZHU X Y.The Speech and Natural Language Processing Technique Used in a Software System for the Blinds[J].Journal of Chinese information Processing,2004(04):72-78.(In Chinese)
[11] 陈优阳.汉盲翻译中的分词连写处理算法研究[J].网络安全技术与应用,2014,(02):154-156. CHEN Y Y.Blind Chinese Translation of Word-segmentation Processing Algorithm[J].Network Security Technology & Application,2014,(02):154-156.(In Chinese)
[12] 李志鹏,王锐,张天驰,等.基于马尔科夫模型的智能汉字盲文转换系统设计[J].单片机与嵌入式系统应用,2019, 19(10):4. LI Z P,WANG R, ZHANG T C, et al.Design of Braille Conbersion System of Intelligent Chinese Characters Based on Markov Model[J].Microcontrollers and Embedded Systems,2019, 19(10):4.(In Chinese)
[13] 黄河燕,陈肇雄,黄静.基于多知识分析的汉盲转换算法[C]//语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集,2003:617-623. HUANG H Y, CHEN Z X, HUANG J.Chinese-Braille Translation Approach Based on Multi-Knowledge Analysis[C].Language Computing and Content Based Text Processing--Proceedings of the Seventh National Joint Conference on Computational Linguistics,2003:617-623.(In Chinese)
[14] WANG X D, YANG Y , ZHANG J C,et al.Chinese to Braille Translation Based on Braille Word Segmentation Using Statistical Model[J].Journal of Shanghai Jiaotong University(Science),2017,22(01):82-86.
[15] ZHANG J X, CHEN H F, CHEN B, et al. Design and Implementation of Chinese Common Braille Translation System Integrating Braille Word Segmentation and Concatenation Rules[J]. Computational Intelligence and Neuroscience, 2022:14.
[16] 蔡佳,王向东,唐李真,等.基于汉盲对照语料库和深度学习的汉盲自动转换[J].中文信息学报,2019, 33(4):8. CAI J, WANG X D, TANG L Z,et al.A Deep Learning Method for Chinese-Braille Conversion Based on Parallel Corpora[J].Journal of Chinese Information Processing,2019, 33(4):8.(In Chinese)
[17] WANG X D, ZHONG J H, CAI J,et al.CBConv: Service for Automatic Conversion of Chinese Characters into Braille with High Accuracy[C]//The 21st International ACM SIGACCESS Conference.ACM,2019.
[18] 苏伟,许存禄,林和,等.中国盲文数字平台建设研究[J].现代特殊教育,2021(14):6. SU W, XU C L, LIN H,et al.Research on the Construction of Braille Digital Platform in China[J].Modern Special Education(Research in Higher Education),2021(14):6.(In Chinese)
[19] 蒋琪,苏伟,谢莹,等.基于Transformer的汉字到盲文端到端自动转换[J].计算机科学,2021, 48(S02):6. JIANG Q, SU W, XIE Y,et al.End-to-End Chinese-Braille Automatic Conversion Based on Transformer[J].Computer Science,2021, 48(S02):6.(In Chinese)
[20] 王蕊.基于预训练模型的汉盲转换方法研究[D].兰州大学,2023. WANG R.Research on Chinese-Braille Conversion Method Based on Pre-training Model[D].Lanzhou University,2023.
[21] YU H L, SU W, LIU L ,et al.Pre-training Model for Low-resource Chinese–Braille Translation[J].Displays,2023,79.
[22] HAILONG YU, WEI SU, YI YANG,et al.A Non-autoregressive Chinese-Braille Translation Approach with CTC Loss Optimization[J]. Expert Systems With Applications, 2024, 269: 126356.
[23] 严泉勇,吴跃成.基于深度学习的汉盲转换系统[J].智能计算机与应用,2024,14(6):183-187. YAN Q Y, WU Y C. Chinese-Braille Convevrsion System Based on Deep Learning[J].Intelligent Computer and Applications,2024,14(6):183-187.
[24] JACOB DEVLIN, MING-WEI CHANG, KENTON LEE, et al.Bert: Pre-training of deep bidirectional transformers for language understanding[J].North American Chapter of the Association for Computational Linguistics,2019.
[25] TSENG H , CHANG P C , ANDREW G ,et al.A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005[C]//Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.Association for Computational Linguistics, 2005.
[26] CUI Y, CHE W, Liu T,et al.Pre-Training with Whole Word Masking for Chinese BERT[J]. Institute of Electrical and Electronics Engineers (IEEE), 2021.
[27] 钟经华,朱琳,高旭等.汉语盲文语料库语料采集研究[J].北京联合大学学报,2016,30(04):78-82. ZHONG J H, ZHU L, GAO X, et al.Study on Linguistic Material Collection of the Chinese Braille Corpus[J].Journal of Beijing Union University,2016,30(04):78-82.(In Chinese)
[28] 耿楠.汉语盲文中“的”字分词连写的问题与对策[J].北京联合大学学报, 2024, 38(1):80-85. GENG N.A Study on Word Segmentation and Conjunction of "De" in Chinese Braille[J]. Journal of Beijing Union University, 2024, 38(1):80-85.

Please choose a citation manager

Content to export