作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2010, Vol. 36 ›› Issue (24): 59-61. doi: 10.3969/j.issn.1000-3428.2010.24.021

• 软件技术与数据库 • 上一篇    下一篇

基于语料的哈萨克语词频统计研究

王 花,古丽拉 阿东别克   

  1. (新疆大学信息科学与工程学院,乌鲁木齐 830046)
  • 出版日期:2010-12-20 发布日期:2010-12-14
  • 作者简介:王 花(1978-),女,硕士,主研方向:软件与理论,自然语言处理;古丽拉?阿东别克,教授
  • 基金资助:

    国家自然科学基金资助项目“现代哈萨克语词级文本语料库构建技术研究”(60763005);国家教育部、国家语委民族语言文字规范标准建设及信息化科研基金资助项目(MZ115-92)

Study on Frequency Statistic of Kazak Word Based on Corpus

WANG Hua, GULILA Altenbek   

  1. (College of Information Science & Engineering, Xinjiang University, Urumqi 830046, China)
  • Online:2010-12-20 Published:2010-12-14

摘要:

哈萨克语作为新疆少数民族语言之一,其词频统计作为自然语言处理的基础性课题,成为需要迫切解决的问题。基于此,介绍Zapf 定律及哈萨克语词频统计之间的联系。对连续输入哈萨克语字符串进行切分,再输入切分后的哈萨克语词串,由此得到哈萨克语词典。在词典中存储词形不同的哈语词组,以及这些词组出现的频率,并进行哈萨克语的统计实验,结果说明哈萨克语词频之间存在内在联系,同时验证哈萨克词频符合Zapf 的幂率定律。

关键词: 哈萨克语词频统计, 幂率定律, 齐普夫, 频率

Abstract:

Kazak as one of the minority languages and characters being universally applied or used in Xinjiang, frequency statistic of word in Kazak natural language treatment becomes the problem to be solved urgently. This paper introduces the relation of Zapf in Kazak word segmentation, which is based on frequency statistic of the word. Through the system, continuous Kazak character bunch input can be segmented, and then the cut apartment word bunch output can be gotten. The cut apartment word bunch usually is two Kazak word bunch, and dictionary can be gotten. The dictionary stores Kazak word and the frequency that the word appears in these disposal test that combines proceeding Kazak covariance of article experiment. Experimental result expresses the relation of frequency of the Kazak word, and the resulting Kazak word frequency distribution accords with power-law of Zapf.

Key words: frequency statistic of Kazak word, power-law, Zapf, frequency

中图分类号: