藏文自动分词中未登录词处理方法研究

doi:10.3969/j.issn.1000-3428.2012.17.013

计算机工程 ›› 2012, Vol. 38 ›› Issue (17): 46-48. doi: 10.3969/j.issn.1000-3428.2012.17.013

藏文自动分词中未登录词处理方法研究

羊毛卓玛^1,2，高定国¹

(1. 西藏大学工学院，拉萨 850000；2. 青海师范大学民族师范学院，青海海南藏族自治州 813000)

收稿日期:2011-10-28 修回日期:2011-12-20 出版日期:2012-09-05 发布日期:2012-09-03
作者简介:羊毛卓玛(1978－)，女，讲师、硕士研究生，主研方向：藏文信息处理；高定国，副教授
基金资助:
国家自然科学基金资助项目“基于虚词的藏文基本句型的格式化研究”(6106315)

Study of Unknown Word Processing Method in Automatic Tibetan Word Segmentation

Yangmo Droma ^1,2, GAO Ding-guo ¹

(1. School of Engineering, Tibet University, Lhasa 850000, China; 2. College of National Hualion Teachers, Qinghai Normal University, Hainan 813000, China)

Received:2011-10-28 Revised:2011-12-20 Online:2012-09-05 Published:2012-09-03

摘要/Abstract

摘要： 藏文中后接成份出现频率较高，分词中未登录词的后缀单切现象会影响分词的正确率，为此，采用词(语素)+缀归并的方法，将藏文后接成份与前一词(语素)归并为一个切分单位输出。针对藏文中大量人名、地名、单位名等未登录词在分词时出现的碎片切分现象，使用分词碎片整合方法，将多次出现的词条碎片整合为一个切分单位输出。实验结果表明，2种方法能提高藏文自动分词的识别正确率。

关键词: 藏文信息处理, 词缀归并, 未登录词, 分词碎片整合

Abstract: In Tibetan, followed ingredients appear with high frequency. Suffix-cut appears in the participle word. It affects the accuracy of the word. By applying word(morpheme) + suffix method, Tibetan suffix and prefix word(morpheme) are grouped into a slitting unit output. In response to a large number of names, place names, unit names, and so on appear in Tibetan, which are not included in dictionaries, debris splitting phenomena appears in the word. Aiming at the problem, it uses word fragments consolidation method. Multiple occurrences of the term debris are to be grouped into a slit unit output. Experimental results show that two methods can improve the accuracy of Tibetan word segmentation.

Key words: Tibetan information processing, affix merging, unknown word, word segmentation fragment integration

中图分类号:

TP391.1

羊毛卓玛, 高定国. 藏文自动分词中未登录词处理方法研究[J]. 计算机工程, 2012, 38(17): 46-48.

YANG Mao-Zhuo-Ma, GAO Ding-Guo. Study of Unknown Word Processing Method in Automatic Tibetan Word Segmentation[J]. Computer Engineering, 2012, 38(17): 46-48.

http://www.ecice06.com/CN/Y2012/V38/I17/46

[1]	朱永清, 赵鹏, 赵菲菲, 慕晓冬, 白坤, 尤轩昂. 基于深度学习的生成式文本摘要技术综述[J]. 计算机工程, 2021, 47(11): 11-21,28.
[2]	胡新棒, 于溆乔, 李邵梅, 张建朋. 基于知识增强的中文命名实体识别[J]. 计算机工程, 2021, 47(11): 84-92.
[3]	徐涛，于洪志，加羊吉. 基于改进卡方统计量的藏文文本表示方法[J]. 计算机工程, 2014, 40(6): 185-189.
[4]	蒋效宇. 基于关键词抽取的自动文摘算法?[J]. 计算机工程, 2012, 38(3): 183-186.
[5]	周蕾;朱巧明. 基于统计和规则的未登录词识别方法研究[J]. 计算机工程, 2007, 33(08): 196-198.
[6]	高红;黄德根;杨元生. 一种与分词一体化的中文人名识别方法[J]. 计算机工程, 2006, 32(19): 9-10,1.

选择文件类型/文献管理软件名称

选择包含的内容

藏文自动分词中未登录词处理方法研究

Study of Unknown Word Processing Method in Automatic Tibetan Word Segmentation

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

藏文自动分词中未登录词处理方法研究

Study of Unknown Word Processing Method in Automatic Tibetan Word Segmentation

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

编辑推荐

Metrics

本文评价