作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (24): 60-62. doi: 10.3969/j.issn.1000-3428.2007.24.020

• 软件技术与数据库 • 上一篇    下一篇

基于分治法的高维大数据集模糊聚类算法

王宝文1,阎俊梅1,刘文远1,石 岩2   

  1. 1. 燕山大学信息学院,秦皇岛 066004;2. 日本九州东海大学工程学院信息系统工程系
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-12-20 发布日期:2007-12-20

Fuzzy Clustering Algorithm for High-dimensional Large Data Sets Based on Distributed Method

WANG Bao-wen 1, YAN Jun-mei1, LIU Wen-yuan1, SHI Yan2   

  1. 1. Informatin Science and Engineering Institute of Yanshan University, Qihuangdao 066004;2. Department of Information System Eng., School of Engineering, Kyushu Tokai University, Japan
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-12-20 Published:2007-12-20

摘要: 将高维的大数据集随机分成若干个子集,对每个子集聚类采用一种基于遗传算法的高维数据模糊聚类方法。该方法引入了一个模糊非相似矩阵来表示高维样本之间的非相似程度,并将高维样本随机初始化到二维平面,利用遗传算法迭代优化二维样本的坐标值,实现二维样本之间的欧氏距离向样本间的模糊非相似度的趋近。将得到的最优的二维样本用模糊C-均值聚类(FCM)算法聚类,克服了聚类有效性对高维样本空间分布的依赖。实验仿真表明,该算法有较好的聚类效果,且极大地提高了聚类的速度。

关键词: 模糊聚类, 分治法, 遗传算法, 模糊非相似矩阵, 大数据集, 高维

Abstract: Data sets are randomly divided into several subsets. A high- dimensional datum fuzzy clustering method based on genetic algorithm is used to cluster the subsets, by importing a fuzzy dissimilar matrix to express the dissimilar degree between any two high dimensional datum, and initializing the high dimensional samples to two-dimensional plane. And then iteratively optimize the coordinate value of two-dimensional plane using genetic algorithm, which makes the Euclidean distance between the two-dimensional plane approximate to the fuzzy dissimilar degree between samples gradually. At last cluster the two-dimensional datum using FCM algorithm, so avoid dependence of clustering validity on the space distribution of high-dimensional samples. Experimental results show the method has exact clustering result, and improves the clustering speed greatly.

Key words: fuzzy clustering, distributed method, genetic algorithm, fuzzy dissimilar matrix, large data sets, high dimension

中图分类号: