摘要: 藏文中后接成份出现频率较高,分词中未登录词的后缀单切现象会影响分词的正确率,为此,采用词(语素)+缀归并的方法,将藏文后接成份与前一词(语素)归并为一个切分单位输出。针对藏文中大量人名、地名、单位名等未登录词在分词时出现的碎片切分现象,使用分词碎片整合方法,将多次出现的词条碎片整合为一个切分单位输出。实验结果表明,2种方法能提高藏文自动分词的识别正确率。
关键词:
藏文信息处理,
词缀归并,
未登录词,
分词碎片整合
Abstract: In Tibetan, followed ingredients appear with high frequency. Suffix-cut appears in the participle word. It affects the accuracy of the word. By applying word(morpheme) + suffix method, Tibetan suffix and prefix word(morpheme) are grouped into a slitting unit output. In response to a large number of names, place names, unit names, and so on appear in Tibetan, which are not included in dictionaries, debris splitting phenomena appears in the word. Aiming at the problem, it uses word fragments consolidation method. Multiple occurrences of the term debris are to be grouped into a slit unit output. Experimental results show that two methods can improve the accuracy of Tibetan word segmentation.
Key words:
Tibetan information processing,
affix merging,
unknown word,
word segmentation fragment integration
中图分类号:
羊毛卓玛, 高定国. 藏文自动分词中未登录词处理方法研究[J]. 计算机工程, 2012, 38(17): 46-48.
YANG Mao-Zhuo-Ma, GAO Ding-Guo. Study of Unknown Word Processing Method in Automatic Tibetan Word Segmentation[J]. Computer Engineering, 2012, 38(17): 46-48.