作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (5): 177-187. doi: 10.19678/j.issn.1000-3428.0068749

• 先进计算与数据处理 • 上一篇    下一篇

基于MapReduce的拷贝数变异测序数据并行处理方案

何亨1,2, 程凯莉1,2, 张葵1,2, 成淑君3   

  1. 1. 武汉科技大学计算机科学与技术学院, 湖北 武汉 430065;
    2. 湖北省智能信息处理与实时工业系统重点实验室, 湖北 武汉 430065;
    3. 北京邮电大学计算机学院, 北京 100876
  • 收稿日期:2023-11-02 修回日期:2024-02-28 出版日期:2025-05-15 发布日期:2024-05-10
  • 通讯作者: 张葵,E-mail:zhangkui@wust.edu.cn E-mail:zhangkui@wust.edu.cn
  • 基金资助:
    国家自然科学基金(62372343,61602351)。

Parallel Processing Scheme for Sequencing Data in Copy Number Variation Based on MapReduce

HE Heng1,2, CHENG Kaili1,2, ZHANG Kui1,2, CHENG Shujun3   

  1. 1. School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, Hubei, China;
    2. Key Laboratory of Intelligent Information Processing and Real Time Industrial Systems in Hubei Province, Wuhan 430065, Hubei, China;
    3. School of Computing, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2023-11-02 Revised:2024-02-28 Online:2025-05-15 Published:2024-05-10

摘要: 拷贝数变异(CNV)作为一种遗传变异,广泛存在于人类基因组的基因分布中。CNV检测效率的提升不仅可以为更多的病患提供更加快速精确的CNV检测结果,大幅降低医疗成本,同时又有利于药物的研发和临床应用。基于读段深度(RD)的方法是目前最为常用的CNV检测方法,对RD相关信息的处理时间较长,在CNV检测中时间占比较高。现有方法无法有效应用于全基因组分析,存在计算效率较低、检测精度下降的问题。基于RD的CNV检测方法,提出一种高效的测序数据并行处理方案EPPCNV。在EPPCNV中,设计2个MapReduce作业串行执行的方法,实现高效全基因组测序数据的并行处理,精准地完成RD相关信息的提取;为充分考虑到GC含量偏差对CNV检测结果的影响,对测序数据的RDs进行校正处理,保证最终检测结果的高灵敏度与高精确度;采用独立于具体CNV检测方法的高适配性数据处理方式,其最终生成的RD相关信息能够与多种主流CNV检测方法直接结合,在不改变原方法对CNV区域判定的基础上,实现方法整体性能的大幅提升。实验结果表明,EPPCNV的综合准确率高,分别与CNV-LOF、HBOS-CNV以及CNVnator 3种方法直接结合,能够显著提升原方法的计算效率,并保证检测结果的高灵敏度与高精确度。对于覆盖深度越高、数据量越大的测序数据,CNV检测方法与EPPCNV结合后计算效率的提升更为显著。

关键词: 拷贝数变异检测, MapReduce作业, 测序数据处理, 读段深度, 全基因组

Abstract: Copy Number Variation (CNV) is a type of genetic variation that widely occurs in the gene distribution of the human genome. Improving the efficiency of CNV detection can provide patients with more rapid and accurate results, significantly reduce medical costs, and facilitate drug development and clinical applications. Currently, a method based on Read Depth (RD) is the most commonly used method for CNV detection, and the processing time for RD-related information is long, accounting for the relatively high CNV detection time. Existing methods have problems, such as ineffective application in whole-genome analysis, low computational efficiency, and decreased detection accuracy. This paper proposes an efficient parallel processing scheme for sequencing data for copy number variation detection EPPCNV. In EPPCNV, two MapReduce jobs are designed to achieve efficient parallel processing of whole-genome sequencing data and accurately extract RD-related information. Moreover, EPPCNV fully considers the impact of GC content deviation on CNV detection results, implementing RD corrections of sequencing data to ensure high sensitivity and accuracy of the final detection outputs. Further, EPPCNV adopts a highly adaptable data processing method that operates independently of specific CNV detection methods. The final RD-related information generated by EPPCNV can be directly combined with various mainstream CNV detection methods, thereby achieving a significant improvement in the overall performance of the method without changing the judgment of the CNV regions in the original method. Experimental results show that EPPCNV achieves high comprehensive accuracy and can be directly combined with CNV-LOF, HBOS-CNV, and CNVnator methods, significantly improving the computational efficiency of these methods while maintaining high sensitivity and accuracy. For sequencing data with a higher coverage depth and larger data volume, the combination of the CNV detection method and EPPCNV yields even greater improvements in computational efficiency.

Key words: Copy Number Variation (CNV) detection, MapReduce job, sequencing data processing, Read Depth (RD), whole-genome

中图分类号: