摘要: 二分K均值聚类算法在二分聚类过程中的初始质心选取速度方面存在不足。为此,提出以极大距离点作为二分聚类初始质心的思想,提升算法的运行速度。研究如何在群集系统中进行快速聚类,根据二分K均值聚类算法的特性,采用数据并行的思想和均匀划分的策略,对算法进行并行化处理。实验结果表明,改进后的算法能获得比较理想的加速比和较高的使用效率。
关键词:
数据挖掘,
聚类算法,
二分K均值,
并行化,
群集系统
Abstract: Considering the insufficiency of clustering speed which exists in the selecting the initial centroid of Bisecting K-Means(BKM) clustering algorithm, the idea of selecting the two patterns with distance maximum as the initial cluster centroid is implemented. An in-depth study and analysis is carried out on how to accelerate clustering in clustering system. According to the characteristics of BKM, the parallelism algorithm based on data parallelism and symmetric data-partition is put forward. Experimental results show that the improvement of algorithm gets ideal speedup performance and efficiency.
Key words:
data mining,
clustering algorithm,
Bisecting K-Means(BKM),
parallelism,
clustering system
中图分类号:
张军伟, 王念滨, 黄少滨, 蔄世明. 二分K均值聚类算法优化及并行化研究[J]. 计算机工程, 2011, 37(17): 23-25.
ZHANG Jun-Wei, WANG Nian-Bin, HUANG Shao-Bin, MAN Shi-Meng-. Research on Bisecting K-Means Clustering Algorithm Optimization and Parallelism[J]. Computer Engineering, 2011, 37(17): 23-25.