摘要: 针对现有DNA重复体频率统计算法效率低、灵活性差等不足,基于字符串多模式匹配的有限状态自动机,构造DNA子序列比对自动机,利用KMP算法对自动机进行状态转移优化,由此提出一种高效的重复体频率统计算法。该算法通过对DNA数据库的线性扫描,得到每个DNA子序列在全局数据库中重叠与非重叠的重复体频率统计信息以及指定DNA序列集合的最长公共子序列信息。实验结果表明,该算法具有效率高、匹配精确、信息获取方式灵活、支持在线操作等优势。
关键词:
有限状态自动机,
DNA子序列,
重复体频率,
频率统计算法,
最长公共子序列
Abstract: The existing statistical algorithms for DNA repeats frequency have some defects on efficiency, flexibility and so on. Based on Finite State Automaton(FSA) for string multiple patterns matching, this paper constructs a DNA subsequence comparative automaton, and optimizes the automaton’s state transition based on the idea of Knuth-Morris-Pratt(KMP). By on-line scanning DNA database, it can achieve all DNA subsequence’s statistics in the whole database, including the frequency statistics of overlapped or nonoverlapped repeats and the longest common subsequence in a designated DNA sequence set. Experimental results show that the algorithm has advantages of efficiency, precise matching, flexible information acquisition, supporting on-line operation and so on.
Key words:
Finite State Automaton(FSA),
DNA subsequence,
repeats frequency,
frequency statistical algorithm,
longest common subsequence
中图分类号:
陈聪, 韩建民, 贾泂, 辛德东. 基于FSA的DNA重复体频率统计算法[J]. 计算机工程, 2011, 37(11): 184-186,189.
CHEN Cong, HAN Jian-Min, GU Jiong, XIN De-Dong. Statistical Algorithm for DNA Repeats Frequency Based on Finite State Automaton[J]. Computer Engineering, 2011, 37(11): 184-186,189.