Abstract:
Kazak as one of the minority languages and characters being universally applied or used in Xinjiang, frequency statistic of word in Kazak natural language treatment becomes the problem to be solved urgently. This paper introduces the relation of Zapf in Kazak word segmentation, which is based on frequency statistic of the word. Through the system, continuous Kazak character bunch input can be segmented, and then the cut apartment word bunch output can be gotten. The cut apartment word bunch usually is two Kazak word bunch, and dictionary can be gotten. The dictionary stores Kazak word and the frequency that the word appears in these disposal test that combines proceeding Kazak covariance of article experiment. Experimental result expresses the relation of frequency of the Kazak word, and the resulting Kazak word frequency distribution accords with power-law of Zapf.
Key words:
frequency statistic of Kazak word,
power-law,
Zapf,
frequency
摘要:
哈萨克语作为新疆少数民族语言之一,其词频统计作为自然语言处理的基础性课题,成为需要迫切解决的问题。基于此,介绍Zapf 定律及哈萨克语词频统计之间的联系。对连续输入哈萨克语字符串进行切分,再输入切分后的哈萨克语词串,由此得到哈萨克语词典。在词典中存储词形不同的哈语词组,以及这些词组出现的频率,并进行哈萨克语的统计实验,结果说明哈萨克语词频之间存在内在联系,同时验证哈萨克词频符合Zapf 的幂率定律。
关键词:
哈萨克语词频统计,
幂率定律,
齐普夫,
频率
CLC Number:
WANG Hua, GU Li-La-?A-Dong-Bie-Ke. Study on Frequency Statistic of Kazak Word Based on Corpus[J]. Computer Engineering, 2010, 36(24): 59-61.
王花, 古丽拉阿东别克. 基于语料的哈萨克语词频统计研究[J]. 计算机工程, 2010, 36(24): 59-61.