摘要: 基因组的结构与功能存在密切联系,其功能主要通过DNA子序列来表达,因此研究DNA序列结构对于生物信息学来说具有重要的意义。该文研究了k-长DNA子序列在DNA全序列中出现频数的计数问题,设计并实现了k-长DNA子序列内部计数算法和外部计数算法。该算法通过一个哈希函数把k-长DNA子序列映射为整数关键字从而把k-长DNA子序列出现频数的计数问题转化为整数关键字的重复计数问题,使得能够利用经典B树算法来解决k-长DNA子序列的出现频数计数问题。针对所要解决的问题提出3种改进措施以进一步提高算法的性能。
关键词:
k-长DNA子序列,
DNA序列,
B树,
全基因组
Abstract: There is a close relationship between the structures of whole genome and its functions which are expressed by its subsequences. Researching the structure of DNA sequence has a profound meaning to bioinformatics. The problem that all k-mers in whole genome are counted is researched. The internal and external algorithm which counts all k-mers occurrence in DNA sequences is designed and implemented. This algorithm translates the problem of counting all k-mers into the problem of counting integer keys with the help of a hash function which maps a k-mer to an integer, and it applies the classic B-tree algorithm to solve the problem of counting k-mers in DNA sequence. It proposes three measures to further improve the efficiency of the algorithm according to the feature of the counting problem.
Key words:
k-mer,
DNA sequence,
B-tree,
Whole genome
中图分类号:
王树林;王 戟;陈火旺;张鼎兴. k-长DNA子序列计数算法研究[J]. 计算机工程, 2007, 33(09): 40-42.
WANG Shulin; WANG Ji; CHEN Huowang; ZHANG Dingxing. Research on Counting Algorithm of k-mer Occurrence in DNA Sequence[J]. Computer Engineering, 2007, 33(09): 40-42.