作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (09): 40-42.

• 博士论文 • 上一篇    下一篇

k-长DNA子序列计数算法研究

王树林1,2,王 戟1,陈火旺1,张鼎兴1   

  1. (1. 国防科技大学计算机学院,长沙 410073;2. 湖南大学计算机与通信学院,长沙 410082)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-05-05 发布日期:2007-05-05

Research on Counting Algorithm of k-mer Occurrence in DNA Sequence

WANG Shulin1,2, WANG Ji1, CHEN Huowang1, ZHANG Dingxing1   

  1. (1. School of Computer Science, National University of Defense Technology, Changsha 410073; 2. College of Computer and Communication, Hunan University, Changsha 410082)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-05-05 Published:2007-05-05

摘要: 基因组的结构与功能存在密切联系,其功能主要通过DNA子序列来表达,因此研究DNA序列结构对于生物信息学来说具有重要的意义。该文研究了k-长DNA子序列在DNA全序列中出现频数的计数问题,设计并实现了k-长DNA子序列内部计数算法和外部计数算法。该算法通过一个哈希函数把k-长DNA子序列映射为整数关键字从而把k-长DNA子序列出现频数的计数问题转化为整数关键字的重复计数问题,使得能够利用经典B树算法来解决k-长DNA子序列的出现频数计数问题。针对所要解决的问题提出3种改进措施以进一步提高算法的性能。

关键词: k-长DNA子序列, DNA序列, B树, 全基因组

Abstract: There is a close relationship between the structures of whole genome and its functions which are expressed by its subsequences. Researching the structure of DNA sequence has a profound meaning to bioinformatics. The problem that all k-mers in whole genome are counted is researched. The internal and external algorithm which counts all k-mers occurrence in DNA sequences is designed and implemented. This algorithm translates the problem of counting all k-mers into the problem of counting integer keys with the help of a hash function which maps a k-mer to an integer, and it applies the classic B-tree algorithm to solve the problem of counting k-mers in DNA sequence. It proposes three measures to further improve the efficiency of the algorithm according to the feature of the counting problem.

Key words: k-mer, DNA sequence, B-tree, Whole genome

中图分类号: