Abstract:
By studying a number of examples including ambiguity phrases,this paper analyzes the ambiguity problem of phrase structure boundary determination in the process of computer processing.Especially for the most common ambiguity format of “v+n+n”,it uses conditional random field model for disambiguation.Combined with the characteristics of Kazakh language,it puts forward a method that constructs the feature template based on category and location information of Kazakh suffix.Taking the Xinjiang Daily(Kazakh Language Version) for 30 days in 2008 statistical data as the experimental corpus,the recognition precision rate of noun phrase and verb phrase with the disambiguation strategy reaches 87.23% and 97.46%,and the recall rate reaches 80.12%,95.80%.Experimental results show that after introducing the feathers presented into conditional random field,accuracy rate,recall rate and F value of the system are improved.
Key words:
Kazakh,
natural language processing,
ambiguity,
additional component,
conditional random field model,
template
摘要: 通过研究大量包含歧义的短语实例,分析计算机处理过程中遇到的短语结构边界判定的歧义问题。针对“v+n+n”这种常见的歧义格式,采用条件随机场模型进行消歧。结合哈萨克语的语言特点,提出基于哈萨克语词尾的类别及位置信息来构建特征模板的方法。以新疆日报(哈语版) 2008年30天的数据统计为实验语料,加入消歧策略后名词短语和动词短语的识别准确率分别达到87.23%和97.46%;召回率分别达到80.12%和95.80%。实验结果表明,将提取出的特征引入到条件随机场模型后,系统的准确率、召回率和F值均有所提高。
关键词:
哈萨克语,
自然语言处理,
歧义,
附加成分,
条件随机场模型,
模板
CLC Number:
HU Bingxin,Gulia·Altenbek,QI Hui. “v+n+n” Format Disambiguation in Kazakh[J]. Computer Engineering, 2014, 40(12): 141-145.
户冰心,古丽拉·阿东别克,祁卉. 哈萨克语“v+n+n”格式的歧义消解[J]. 计算机工程, 2014, 40(12): 141-145.