作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于机器学习的基因组结构变异致病性预测方法综述

  • 发布日期:2026-02-12

Review of Machine Learning Based Methods for Predicting Pathogenicity of Genomic Structural Variations

  • Published:2026-02-12

摘要: 基因组结构变异(SVs)通过大片段DNA的插入、缺失、倒位或易位等改变基因组三维构象与调控网络,是多种复杂疾病的关键致病变异。近年来,长读长测序与三维基因组学技术的突破显著提升了对SVs的检测能力。然而,由于SVs的复杂性和功能注释的稀缺性,它的致病机制预测仍面临巨大挑战。研究者已经提出了通过挖掘染色质互作、表观修饰及单细胞转录组等数据,揭示SVs对基因表达与表型的影响规律并解析SVs致病机制的方法,目前仍缺乏对该类方法的系统性总结。因此,本文系统综述了基于高通量测序数据预测SVs致病性的方法,包括知识驱动型方法、传统机器学习方法、深度学习方法以及大模型方法。通过总结现有方法的局限性,包括低频变异预测灵敏度不足、功能注释数据库匮乏以及三维模型泛化能力有限等问题,本文提出通过多模态数据融合、因果推理模型及空间组学技术推动该领域发展的潜在方向,旨在为基因组结构变异的功能解析提供理论参考。

Abstract: Genomic Structural Variations (SVs), which alter the three-dimensional conformation and regulatory networks of the genome through insertions, deletions, inversions, or translocations of large DNA fragments, are key pathogenic variants in various complex diseases. In recent years, breakthroughs in long-read sequencing and 3D genomics have significantly improved the detection capability of SVs. However, due to the complexity of SVs and the scarcity of functional annotations, predicting their pathogenicity remains a major challenge. Several methods have been developed to decipher the pathogenic mechanisms of SVs and reveal their impact on gene expression and phenotypes by integrating multi-modal data such as chromatin interactions, epigenetic modifications, and single-cell transcriptomics. However, there is still a lack of systematic summary of such methods. Therefore, this article systematically reviews methods for predicting the pathogenicity of SVs based on high-throughput sequencing data, including knowledge-driven methods, traditional machine learning methods, deep learning methods, and large model methods. By summarizing the limitations of existing methods, including low sensitivity in predicting rare variants, insufficient functional annotation databases, and limited generalizability of 3D models, this article proposes potential future directions to advance the field through multimodal data fusion, causal inference models, and spatial omics technologies. It aims to provide a theoretical reference for the functional interpretation of genomic structural variations.