计算机工程

• 开发研究与工程应用 • 上一篇    下一篇

基于簇聚类和游程编码的正则表达式压缩算法

杨嘉佳1,姜腊林1,姜磊2,戴琼3,谭建龙3   

  1. (1.长沙理工大学计算机与通信工程学院,长沙 410114;2.中国科学院计算技术研究所,北京 100190;3.中国科学院信息工程研究所,北京 100093)
  • 收稿日期:2013-08-26 出版日期:2014-08-15 发布日期:2014-08-15
  • 作者简介:杨嘉佳(1988-),男,硕士研究生,主研方向:网络安全,模式匹配;姜腊林,副教授;姜磊,博士研究生;戴琼,高级工程师;谭建龙,研究员、博士生导师。
  • 基金项目:
    国家“863”计划基金资助项目(2012AA012502);中国科学院战略性先导科技专项基金资助项目(XDA06030602)。

Regular Expression Compression Algorithm Based on Cluster Clustering and Runlength Encoding

YANG Jia-jia 1,JIANG La-lin  1,JIANG Lei  2,DAI Qiong  3,TAN Jian-long  3   

  1. (1.College of Computer and Communication Engineering,Changsha University of Science and Technology,Changsha 410114,China;2.Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;3.Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China)
  • Received:2013-08-26 Online:2014-08-15 Published:2014-08-15

摘要: 基于簇聚类的确定型有穷自动机(DFA)压缩算法,即ClusterFA算法,解决了正则表达式匹配中的空间爆炸问题,但该算法的分组个数取理想值较为困难,且其类中心向量表的每一行中连续重复转移状态出现频率较高。针对该问题,提出一种改善ClusterFA算法的方案En_ClusterFA。提取类中心向量表行与行之间相同的首尾部分,并对其进行游程编码以建立索引表,对类中心向量表余下部分的转移状态进行游程编码。利用该方案对Bro,Snort和L7filter规则集进行测试,实验结果表明,除了L7_2和L7_6规则集的压缩率分别提高到96.1%和98.1%之外,其他规则集的压缩率都提高到99%以上。与ClusterFA算法的压缩率相比,En_ClusterFA平均提高了4%,证明En_ClusterFA能够有效地提高DFA的压缩效率。

关键词: 正则表达式, ClusterFA算法, 确定型有穷自动机, 游程编码, 压缩率, 吞吐率

Abstract: In order to solve the Deterministic Finite Automata(DFA) space explosion problem,a DFA algorithm based on clustering,named ClusterFA,is proposed.However,it is difficult to take the ideal value for the number of groups for ClusterFA algorithm.The number in each line of the class center vector table,which is also named CommonTable,is continuously repeated.In order to further improve the clusterFA compression ratio,this paper puts forward a new solution:extracting the same head and tail section between lines of the CommonTable as part of the index table,and using the Runlength Encoding(RLE) technique to code the continuously repeated numbers.This algorithm is tested by Bro,Snort and L7filter rule sets.Experimental results show that the rule sets compression ratio is up to 99% or more except that the compression ratio of L7_2 and L7_6 increases to 96.1% and 98.1%.Compared with the ClusterFA algorithm,the compression ratio of the En_ClusterFA improves an average of 4%.It proves that the En_ClusterFA can effectively improve the compression ratio of the DFA.

Key words: regular expression, ClusterFA algorithm, Deterministic Finite Automata(DFA), Runlength Encoding(RLE), compression ratio, throughput

中图分类号: