作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 体系结构与软件技术 • 上一篇    下一篇

基于Hadoop的微阵列数据两阶段并行K近邻基因提取

齐向明1,郑帅1,魏萍2   

  1. (1.辽宁工程技术大学软件学院,辽宁 葫芦岛 125105; 2.中国石油大学地球物理与信息工程学院,北京 102249)
  • 收稿日期:2015-04-10 出版日期:2016-05-15 发布日期:2016-05-13
  • 作者简介:齐向明(1966-),男,副教授、硕士,主研方向为数据挖掘、大数据技术、图形图像处理;郑帅,硕士;魏萍,讲师、博士。
  • 基金资助:
    辽宁省教育厅基金资助项目(L2012113)。

Micro-array Data Two-stage Parallel K Nearest Neighbor Gene Extraction Based on Hadoop

QI Xiangming 1,ZHENG Shuai 1,WEI Ping 2   

  1. (1.College of Software,Liaoning Technical University,Huludao,Liaoning 125105,China; 2.College of Geophysics and Information Engineering,China University of Petroleum,Beijing 102249,China)
  • Received:2015-04-10 Online:2016-05-15 Published:2016-05-13

摘要: 基因信息选取工作中由于数据量庞大,传统单线程运行的分类查询方法无法满足实时性与提取精度要求。为此,利用Hadoop框架设计两阶段并行计算模型。其中第1阶段用于候选基因子集并行选取,第2阶段用于并行K近邻基因信息选取,从而实现并行计算的全过程覆盖。为降低算法的计算复杂度,针对基因信息微阵列数据,定义数据筛选指标对其进行采样,在降低数据处理量的同时消除数据冗余。实验结果表明,该算法具有较高的运行效率,并且继承了Hadoop编程模型的可扩展特性,可移植性较强。

关键词: Hadoop框架, 并行计算, 微阵列采样, 大数据, K近邻, 基因信息

Abstract: Because of huge amount of data in gene information extraction,whose real-time requirements can not be met by traditional methods with single threaded operation,the Hadoop framework is used to design the two-stage parallel computing model.The first stage is used to extract candidate gene subset,and the second stage is used to extract parallel K nearest neighbor genetic information,and it implements whole process cover of parallel computing.At the same time,in order to further reduce the computational complexity of the algorithm,the microarray data sampling method is used to reduce the amount of data processing and eliminate data redundancy.Experimental results show that the proposed algorithm has better running efficiency,inherits the extensible features of Hadoop programming model,and has strong portability.

Key words: Hadoop framework, parallel computing, micro-array sampling, big data, K nearest neighbor, gene information

中图分类号: