摘要: 基于簇聚类的确定型有穷自动机(DFA)压缩算法,即ClusterFA算法,解决了正则表达式匹配中的空间爆炸问题,但该算法的分组个数取理想值较为困难,且其类中心向量表的每一行中连续重复转移状态出现频率较高。针对该问题,提出一种改善ClusterFA算法的方案En_ClusterFA。提取类中心向量表行与行之间相同的首尾部分,并对其进行游程编码以建立索引表,对类中心向量表余下部分的转移状态进行游程编码。利用该方案对Bro,Snort和L7filter规则集进行测试,实验结果表明,除了L7_2和L7_6规则集的压缩率分别提高到96.1%和98.1%之外,其他规则集的压缩率都提高到99%以上。与ClusterFA算法的压缩率相比,En_ClusterFA平均提高了4%,证明En_ClusterFA能够有效地提高DFA的压缩效率。
关键词:
正则表达式,
ClusterFA算法,
确定型有穷自动机,
游程编码,
压缩率,
吞吐率
Abstract: In order to solve the Deterministic Finite Automata(DFA) space explosion problem,a DFA algorithm based on clustering,named ClusterFA,is proposed.However,it is difficult to take the ideal value for the number of groups for ClusterFA algorithm.The number in each line of the class center vector table,which is also named CommonTable,is continuously repeated.In order to further improve the clusterFA compression ratio,this paper puts forward a new solution:extracting the same head and tail section between lines of the CommonTable as part of the index table,and using the Runlength Encoding(RLE) technique to code the continuously repeated numbers.This algorithm is tested by Bro,Snort and L7filter rule sets.Experimental results show that the rule sets compression ratio is up to 99% or more except that the compression ratio of L7_2 and L7_6 increases to 96.1% and 98.1%.Compared with the ClusterFA algorithm,the compression ratio of the En_ClusterFA improves an average of 4%.It proves that the En_ClusterFA can effectively improve the compression ratio of the DFA.
Key words:
regular expression,
ClusterFA algorithm,
Deterministic Finite Automata(DFA),
Runlength Encoding(RLE),
compression ratio,
throughput
中图分类号:
杨嘉佳,姜腊林,姜磊,戴琼,谭建龙. 基于簇聚类和游程编码的正则表达式压缩算法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2014.08.054.
YANG Jia-jia,JIANG La-lin,JIANG Lei,DAI Qiong,TAN Jian-long. Regular Expression Compression Algorithm Based on Cluster Clustering and Runlength Encoding[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2014.08.054.