计算机工程

所属专题: 大数据专题

• 大数据专题 • 上一篇    下一篇

面向海量数据的改进最近邻优先吸收聚类算法

宁可 1,孙同晶 1,徐洁洁 2   

  1. (1.杭州电子科技大学 自动化学院,杭州 310018; 2.浙江省电子信息产品检验所,杭州 310007)
  • 收稿日期:2017-04-07 出版日期:2018-04-15 发布日期:2018-04-15
  • 作者简介:宁可(1992—),男,硕士研究生,主研方向为海量数据挖掘;孙同晶,副教授、博士;徐洁洁,工程师。
  • 基金项目:
    浙江省信息安全重点实验室基金(KYZ066816004)。

Improved Nearest Neighbor Absorption First Clustering Algorithm for Massive Data

NING Ke  1,SUN Tongjing  1,XU Jiejie  2   

  1. (1.School of Automation,Hangzhou Dianzi University,Hangzhou 310018,China;2.Zhejiang Province Electronic Information Products Testing Institute,Hangzhou 310007,China)
  • Received:2017-04-07 Online:2018-04-15 Published:2018-04-15

摘要: 针对最近邻优先吸收聚类算法难以应用在海量数据聚类处理上的不足,基于MapReduce提出改进算法。通过引入MapReduce并行框架,利用Canopy粗聚类优化计算过程,并对聚簇交叉部分的处理进行改进。采用3组大小不同的数据集进行实验,结果表明,与K-means算法和最近邻优先吸收聚类算法相比,改进算法在保证聚类质量的基础上具有较快的运行速度,并适用于海量数据的聚类分析。

关键词: 海量数据, 聚类, MapReduce框架, 最近邻优先吸收聚类算法, Canopy算法, 并行化

Abstract: Aiming at the problem that the Nearest Neighbor Absorption First(NNAF) clustering algorithm is difficult to be applied in the massive data clustering process,an improved algorithm is proposed based on MapReduce.By introducing MapReduce parallel programming framework and using Canopy coarse clustering,it optimizes the calculation process and improves the process of clustering the intersection.Three different data sets are used to compare the K-means algorithm,the improved NNAF clustering algorithm and the NNAF clustering algorithm.Experimental results show that the improved algorithm can guarantee the clustering quality to a certain extent and has higher running speed.It is suitable for clustering analysis of massive data.

Key words: massive data, clustering, MapReduce framework, Nearest Neighbor Absorption First(NNAF) clustering lgorithm, Canopy algorithm, parallelization

中图分类号: