摘要:
利用K-最近距离算法对哈萨克语文本进行分类,通过统计词频信息和语言信息相结合的方法选择特征,实现一个哈萨克语文本分类系统。在计算特征权重值时不仅考虑词频,还利用特征的集中度、分散度,经过训练和统计对每一类哈萨克语文本形成特征的权重向量,根据K-最近距离算法判断测试文本的所属类别,实验结果表明该方法可行。
关键词:
文本分类,
K-最近距离,
集中度,
分散度
Abstract:
The K-nearest-neighbor algorithm is adopted in the classification of the Kazakh text, while in characters chosen, a method that integrates language information and statistical information from the training corpus is applied. The weight of these characters is computed from three parameters: word frequency, centralized degree, decentralized degree. After training, the vector space model of the Kazakh text categorization is got, and the Kazakh text through K-nearest-neighbor algorithm is classified. Experimental results show that this method is feasible.
Key words:
text categorization,
K-nearest-neighbor,
centralized degree,
decentralized degree
中图分类号:
玛依来.哈帕尔, 古丽拉.阿东别克. 哈萨克语文本分类系统的设计与实现[J]. 计算机工程, 2011, 37(5): 196-198.
MA Yi-Lai-?Ha-Mo-Er, GU Li-La-?A-Dong-Bie-Ke. Design and Implementation of Kazakh Text Categorization System[J]. Computer Engineering, 2011, 37(5): 196-198.