计算机工程 ›› 2012, Vol. 38 ›› Issue (17): 46-48.doi: 10.3969/j.issn.1000-3428.2012.17.013

• 软件技术与数据库 • 上一篇    下一篇

藏文自动分词中未登录词处理方法研究

羊毛卓玛1,2,高定国1   

  1. (1. 西藏大学工学院,拉萨 850000;2. 青海师范大学民族师范学院,青海 海南藏族自治州 813000)
  • 收稿日期:2011-10-28 修回日期:2011-12-20 出版日期:2012-09-05 发布日期:2012-09-03
  • 作者简介:羊毛卓玛(1978-),女,讲师、硕士研究生,主研方向:藏文信息处理;高定国,副教授
  • 基金项目:
    国家自然科学基金资助项目“基于虚词的藏文基本句型的格式化研究”(6106315)

Study of Unknown Word Processing Method in Automatic Tibetan Word Segmentation

Yangmo Droma  1,2, GAO Ding-guo   1   

  1. (1. School of Engineering, Tibet University, Lhasa 850000, China; 2. College of National Hualion Teachers, Qinghai Normal University, Hainan 813000, China)
  • Received:2011-10-28 Revised:2011-12-20 Online:2012-09-05 Published:2012-09-03

摘要: 藏文中后接成份出现频率较高,分词中未登录词的后缀单切现象会影响分词的正确率,为此,采用词(语素)+缀归并的方法,将藏文后接成份与前一词(语素)归并为一个切分单位输出。针对藏文中大量人名、地名、单位名等未登录词在分词时出现的碎片切分现象,使用分词碎片整合方法,将多次出现的词条碎片整合为一个切分单位输出。实验结果表明,2种方法能提高藏文自动分词的识别正确率。

关键词: 藏文信息处理, 词缀归并, 未登录词, 分词碎片整合

Abstract: In Tibetan, followed ingredients appear with high frequency. Suffix-cut appears in the participle word. It affects the accuracy of the word. By applying word(morpheme) + suffix method, Tibetan suffix and prefix word(morpheme) are grouped into a slitting unit output. In response to a large number of names, place names, unit names, and so on appear in Tibetan, which are not included in dictionaries, debris splitting phenomena appears in the word. Aiming at the problem, it uses word fragments consolidation method. Multiple occurrences of the term debris are to be grouped into a slit unit output. Experimental results show that two methods can improve the accuracy of Tibetan word segmentation.

Key words: Tibetan information processing, affix merging, unknown word, word segmentation fragment integration

中图分类号: