Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2025, Vol. 51 ›› Issue (8): 53-61. doi: 10.19678/j.issn.1000-3428.0069237

• Research Hotspots and Reviews • Previous Articles     Next Articles

Sequence Alignment Algorithm Based on Combined minimizer Seeds on Pan-Genome Graph

GAO Jia1,2, XU Yun1,2,*()   

  1. 1. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, Anhui, China
    2. Key Laboratory of High Performance Computing of Anhui Province, Hefei 230027, Anhui, China
  • Received:2024-01-15 Revised:2024-04-22 Online:2025-08-15 Published:2024-06-13
  • Contact: XU Yun

基于组合minimizer种子的泛基因组图序列比对算法

高佳1,2, 徐云1,2,*()   

  1. 1. 中国科学技术大学计算机科学与技术学院,安徽 合肥 230027
    2. 安徽省高性能计算重点实验室,安徽 合肥 230027
  • 通讯作者: 徐云
  • 基金资助:
    国家自然科学基金面上项目(61672480); 高等学校学科创新引智计划(BP0719016)

Abstract:

With advancements in sequencing technology, human genome analysis has shifted from individual analysis to population analysis. To better demonstrate the genetic variation information between different samples within a population, the pan-genome graph model has replaced the traditional linear multi-sequence reference genome model, and sequence-to-graph alignment has become a key issue in biological sequence analyses. Existing alignment algorithms employ seed-and-extend strategies. However, owing to the numerous paths formed by graph combinations, localization and verification phases become time-consuming, necessitating further optimization and improvement of single-seed selection methods. To address this issue, this paper proposes a sequence alignment algorithm based on a combined minimizer seed. In the localization phase, the algorithm enhances the coverage range of a single seed through the combined hashing of minimizer seeds. Simultaneously, seeds are located through both sequence and relative position information, which significantly reducing the number of false-positive matching positions, thus lowering the workload of the subsequent filtering and verification processes. Experimental results demonstrate that the proposed algorithm can reduce candidate positions by approximately 80%, optimize time performance by one to three times, and have index memory and precise comparison capabilities comparable to mainstream alignment algorithms.

Key words: pan-genome graph, sequence alignment, seed-and-extend strategy, minimizer seed

摘要:

随着测序技术的发展和应用,人类基因组序列的研究已从个体分析逐步扩展到群体分析。为更好地展示种群不同样本之间的遗传变异信息,泛基因组图模型开始取代传统的线性多序列参考基因组模型,序列到图的比对成为生物序列分析的关键问题之一。现有比对算法通常采用种子与扩展策略,但由于图中组合的路径较多,定位和验证阶段的时间成本高,需要进一步优化单种子选取方法,减少候选位置的数量。为此,提出一种基于组合minimizer种子的序列比对算法,在定位阶段通过对minimizer种子的组合hash,扩展单个种子的覆盖范围。同时,通过序列和相对位置两方面信息查找种子,减少假阳性匹配位置的数量,从而降低后续筛选和验证的工作量。实验结果表明,与主流算法相比,该算法能够减少约80%的候选位置,时间性能提升1~3倍,同时保持相当的索引内存占用和精确比对能力。

关键词: 泛基因组图, 序列比对, 种子与扩展策略, minimizer种子