摘要: 将高维的大数据集随机分成若干个子集,对每个子集聚类采用一种基于遗传算法的高维数据模糊聚类方法。该方法引入了一个模糊非相似矩阵来表示高维样本之间的非相似程度,并将高维样本随机初始化到二维平面,利用遗传算法迭代优化二维样本的坐标值,实现二维样本之间的欧氏距离向样本间的模糊非相似度的趋近。将得到的最优的二维样本用模糊C-均值聚类(FCM)算法聚类,克服了聚类有效性对高维样本空间分布的依赖。实验仿真表明,该算法有较好的聚类效果,且极大地提高了聚类的速度。
关键词:
模糊聚类,
分治法,
遗传算法,
模糊非相似矩阵,
大数据集,
高维
Abstract: Data sets are randomly divided into several subsets. A high- dimensional datum fuzzy clustering method based on genetic algorithm is used to cluster the subsets, by importing a fuzzy dissimilar matrix to express the dissimilar degree between any two high dimensional datum, and initializing the high dimensional samples to two-dimensional plane. And then iteratively optimize the coordinate value of two-dimensional plane using genetic algorithm, which makes the Euclidean distance between the two-dimensional plane approximate to the fuzzy dissimilar degree between samples gradually. At last cluster the two-dimensional datum using FCM algorithm, so avoid dependence of clustering validity on the space distribution of high-dimensional samples. Experimental results show the method has exact clustering result, and improves the clustering speed greatly.
Key words:
fuzzy clustering,
distributed method,
genetic algorithm,
fuzzy dissimilar matrix,
large data sets,
high dimension
中图分类号:
王宝文;阎俊梅;刘文远;石 岩. 基于分治法的高维大数据集模糊聚类算法[J]. 计算机工程, 2007, 33(24): 60-62.
WANG Bao-wen ; YAN Jun-mei; LIU Wen-yuan; SHI Yan. Fuzzy Clustering Algorithm for High-dimensional Large Data Sets Based on Distributed Method[J]. Computer Engineering, 2007, 33(24): 60-62.