摘要: 为适应真实环境中数据量大、流程复杂、计算密集的数据挖掘需求,提高传统树增量更新挖掘效率,改变已有算法的串行执行方式,提出一种基于Hadoop的动态树增量更新方法。介绍云计算、模型与执行流程等基本概念,针对现有Hadoop平台中任务调度的随机分配策略,设计一种动态云平台中的资源调度与分配算法,以期达到成本消耗的最小化,给出树增量更新挖掘算法以及2个并行算法(DeleteFreqTree和FindNewTree),完成树数据的增量挖掘工作。实验结果表明,该并行算法有效可行,具有高效性与良好的扩展率,能够对海量树数据进行更新挖掘。
关键词:
数据挖掘,
数据库,
云计算,
并发控制,
频繁子树,
增量更新
Abstract: In order to deal with problems in true environment caused by data mining tasks with larger amount of data, complex processing and intensive computing, improve the traditional tree incremental updating mining efficiency, and change the existing algorithm of serial implementation methods, this paper proposes a dynamic tree incremental updating method on the basis of Hadoop. It introduces concepts concerning cloud computing, the cloud model, operating process and so on. Then, according to the Hadoop platform task scheduling random distribution strategy, a new dynamic cloud platform resource allocation algorithm is put forward in order to minimize the consumption cost. It designs a new tree incremental updating algorithm on the basis of cloud platform, and two parallel algorithms (DeleteFreqTree, FindNewTree) are proposed. Large number of experiments show that the paralleled algorithm is feasible, highly efficient, expandable, and the algorithm can mine mass tree data effectively.
Key words:
data mining,
database,
cloud computing,
concurrency control,
frequent subtree,
incremental updating
中图分类号: