计算机工程 ›› 2014, Vol. 40 ›› Issue (12): 141-145.doi: 10.3969/j.issn.1000-3428.2014.12.026

• 人工智能及识别技术 • 上一篇    下一篇

哈萨克语“v+n+n”格式的歧义消解

户冰心1a,2,3,古丽拉·阿东别克1a,2,3,祁卉1b   

  1. 1.新疆大学 a.信息科学与工程学院; b.人文学院,乌鲁木齐 830046;
    2.国家语言资源监测与研究中心少数民族语言分中心哈萨克和柯尔克孜语文基地,乌鲁木齐 830046;
    3.新疆多语种信息技术实验室,乌鲁木齐 830046
  • 收稿日期:2014-02-10 修回日期:2014-03-09 出版日期:2014-12-15 发布日期:2015-01-16
  • 作者简介:户冰心(1989-),女,硕士,主研方向:自然语言处理;古丽拉·阿东别克(通讯作者),教授、博士生导师;祁 卉,学士。
  • 基金项目:
    国家自然科学基金资助项目(61063025)。

“v+n+n” Format Disambiguation in Kazakh

HU Bingxin1a,2,3,Gulia·Altenbek1a,2,3,QI Hui1b   

  1. 1a.College of Information Science and Engineering; 1b.College of Humanity,Xinjiang University,Urumqi 830046,China;
    2.The Base of Kazakh and Kirghiz Language,National Language Resource Monitoring and Research Center of Minority Languages
    Center,Urumqi 830046,China; 3.Multi-lingual Information Technology Laboratory of Xinjiang,Urumqi 830046,China
  • Received:2014-02-10 Revised:2014-03-09 Online:2014-12-15 Published:2015-01-16

摘要: 通过研究大量包含歧义的短语实例,分析计算机处理过程中遇到的短语结构边界判定的歧义问题。针对“v+n+n”这种常见的歧义格式,采用条件随机场模型进行消歧。结合哈萨克语的语言特点,提出基于哈萨克语词尾的类别及位置信息来构建特征模板的方法。以新疆日报(哈语版) 2008年30天的数据统计为实验语料,加入消歧策略后名词短语和动词短语的识别准确率分别达到87.23%和97.46%;召回率分别达到80.12%和95.80%。实验结果表明,将提取出的特征引入到条件随机场模型后,系统的准确率、召回率和F值均有所提高。

关键词: 哈萨克语, 自然语言处理, 歧义, 附加成分, 条件随机场模型, 模板

Abstract: By studying a number of examples including ambiguity phrases,this paper analyzes the ambiguity problem of phrase structure boundary determination in the process of computer processing.Especially for the most common ambiguity format of “v+n+n”,it uses conditional random field model for disambiguation.Combined with the characteristics of Kazakh language,it puts forward a method that constructs the feature template based on category and location information of Kazakh suffix.Taking the Xinjiang Daily(Kazakh Language Version) for 30 days in 2008 statistical data as the experimental corpus,the recognition precision rate of noun phrase and verb phrase with the disambiguation strategy reaches 87.23% and 97.46%,and the recall rate reaches 80.12%,95.80%.Experimental results show that after introducing the feathers presented into conditional random field,accuracy rate,recall rate and F value of the system are improved.

Key words: Kazakh, natural language processing, ambiguity, additional component, conditional random field model, template

中图分类号: