基于簇聚类和游程编码的正则表达式压缩算法

doi:10.3969/j.issn.1000-3428.2014.08.054

计算机工程

基于簇聚类和游程编码的正则表达式压缩算法

杨嘉佳¹,姜腊林¹,姜磊²,戴琼³,谭建龙³

(1.长沙理工大学计算机与通信工程学院,长沙 410114;2.中国科学院计算技术研究所,北京 100190;3.中国科学院信息工程研究所,北京 100093)

收稿日期:2013-08-26 出版日期:2014-08-15 发布日期:2014-08-15
作者简介:杨嘉佳(1988－)，男，硕士研究生，主研方向：网络安全，模式匹配；姜腊林，副教授；姜磊，博士研究生；戴琼，高级工程师；谭建龙，研究员、博士生导师。
基金资助:
国家“863”计划基金资助项目(2012AA012502)；中国科学院战略性先导科技专项基金资助项目(XDA06030602)。

Regular Expression Compression Algorithm Based on Cluster Clustering and Runlength Encoding

YANG Jia-jia¹,JIANG La-lin ¹,JIANG Lei ²,DAI Qiong ³,TAN Jian-long³

(1.College of Computer and Communication Engineering,Changsha University of Science and Technology,Changsha 410114,China;2.Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;3.Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China)

Received:2013-08-26 Online:2014-08-15 Published:2014-08-15

摘要/Abstract

摘要： 基于簇聚类的确定型有穷自动机(DFA)压缩算法,即ClusterFA算法,解决了正则表达式匹配中的空间爆炸问题,但该算法的分组个数取理想值较为困难,且其类中心向量表的每一行中连续重复转移状态出现频率较高。针对该问题,提出一种改善ClusterFA算法的方案En_ClusterFA。提取类中心向量表行与行之间相同的首尾部分,并对其进行游程编码以建立索引表,对类中心向量表余下部分的转移状态进行游程编码。利用该方案对Bro,Snort和L7filter规则集进行测试,实验结果表明,除了L7_2和L7_6规则集的压缩率分别提高到96.1%和98.1%之外,其他规则集的压缩率都提高到99%以上。与ClusterFA算法的压缩率相比,En_ClusterFA平均提高了4%,证明En_ClusterFA能够有效地提高DFA的压缩效率。

关键词: 正则表达式, ClusterFA算法, 确定型有穷自动机, 游程编码, 压缩率, 吞吐率

Abstract: In order to solve the Deterministic Finite Automata(DFA) space explosion problem,a DFA algorithm based on clustering,named ClusterFA,is proposed.However,it is difficult to take the ideal value for the number of groups for ClusterFA algorithm.The number in each line of the class center vector table,which is also named CommonTable,is continuously repeated.In order to further improve the clusterFA compression ratio,this paper puts forward a new solution:extracting the same head and tail section between lines of the CommonTable as part of the index table,and using the Runlength Encoding(RLE) technique to code the continuously repeated numbers.This algorithm is tested by Bro,Snort and L7filter rule sets.Experimental results show that the rule sets compression ratio is up to 99% or more except that the compression ratio of L7_2 and L7_6 increases to 96.1% and 98.1%.Compared with the ClusterFA algorithm,the compression ratio of the En_ClusterFA improves an average of 4%.It proves that the En_ClusterFA can effectively improve the compression ratio of the DFA.

Key words: regular expression, ClusterFA algorithm, Deterministic Finite Automata(DFA), Runlength Encoding(RLE), compression ratio, throughput

中图分类号:

TN791

杨嘉佳,姜腊林,姜磊,戴琼,谭建龙. 基于簇聚类和游程编码的正则表达式压缩算法[J]. 计算机工程.

YANG Jia-jia,JIANG La-lin,JIANG Lei,DAI Qiong,TAN Jian-long. Regular Expression Compression Algorithm Based on Cluster Clustering and Runlength Encoding[J]. Computer Engineering.

https://www.ecice06.com/CN/Y2014/V40/I8/282

参考文献

［1］Paxson V.A System for Detecting Network Intruders in Realtime［J］.Computer Networks,1999,31(23):24352463. ［2］Roesch M.Snortlightweight Intrusion Detection for Networks［C］//Proc.of the 13th USENIX Conference on System Administration.Seattle,USA:［s.n.］,1999:229238. ［3］蒙继华,孙宝生,李婷.采用行程编码进行位图压缩的研究［J］.新疆大学学报,2003,20(4):121123. ［4］Yates R B,Gonnet H.Fast Text Searching for Regular Expressions or Automation Searing on Tries［J］.Journal of the ACM,1996,43(6):915936. ［5］Myers G.A Four Russians Algorithm for Regular Expression Pattern Matching［J］.Journal of the ACM,1992,39(2):432448. ［6］Thompson K.Programming Techniques:Regular Expression Searching Algorithm［J］.Communications of the ACM,1968,11(6):419422. ［7］Ficara D,Giordano S,Procissi G,et al.An Improved DFA for Fast Regular Expression Matching［J］.ACM SIGCOMM Computer Communication Review,2008,38(5):2940. ［8］Kumar S,Chandrasekaran B,Turner J,et al.Curing Regular Expressions Matching Algorithms from Insomnia,Amnesia,and Acalculia［C］//Proc.of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems.Washington D.C.,USA:［s.n.］,2007:144164. ［9］Yu Fang,Chen Zhifeng,Diao Yanlei,et al.Fast and Memoryefficient Regular Expression Matching for Deep Packet Inspection［C］//Proc.of ACM/IEEE Symposium on Architecture for Networking and Communications Systems.New York,USA:［s.n.］,2006:93102. ［10］Kumar S,Crowley P,Yu Fang,et al.Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection［C］//Proc.of Annual Con ference of ACM Special Interest Group on Data Communication.Pisa,Italy:［s.n.］,2006:339350. ［11］Kumar S,Turner J,Williams J.Advanced Algorithms for Fast and Scalable Deep Packet Inspection［C］//Proc.of ACM/IEEE Symposium on Architecture for Networking and Communications Systems.［S.l.］:IEEE Press,2006:8192. ［12］Ficara D,Giordano S,Procissi G,et al.An Improved DFA for Fast Regular Expression Matching［J］.ACM SIGCOMM Computer Communication Review,2008,38(5):2940. ［13］Jiang Lei,Tan Jianlong,Liu Yuanbin.ClusterFA:A Memoryefficient DFA Structure for Network Intrusion Detection［C］//Proc.of the 7th ACM Symposium on Information,Computer and Communications Security.Seoul,South Korea:［s.n.］,2012:6566. ［14］Lee J,Hwang S H,Park N,et al.A High Performance NIDS Using FPGAbased Regular Expression Matching［C］//Proc.of ACM Symposium on Applied Computing.New York,USA:［s.n.］,2007:11871191. ［15］Faezipour M,Nourani M.Constraint Repetition In spection for Regular Expression on FPGA［C］//Proc.of the 16th IEEE Symposium on High Performance Interconnects.［S.l.］:IEEE Press,2008:111118. ［16］Yang Y,Prasanna V K.Automatic Construction of Largescale Regular Expression Matching Engines on FPGA［C］//Proc.of the International Conference on Reconfigurable Computing and FPGAs.［S.l.］:IEEE Press,2008:7378. 编辑顾逸斐

[1]	邹翔宇, 魏灿, 夏文, 李诗逸. 面向数据差量压缩的高效压缩率估计方法[J]. 计算机工程, 2024, 50(12): 70-82.
[2]	陈田, 周洋, 任福继, 安鑫, 赵沪隐. 基于三态信号的改进游程编码压缩方法[J]. 计算机工程, 2021, 47(2): 219-225.
[3]	王帅, 杨恒新, 杨华. 基于伪ID码的树型防碰撞算法[J]. 计算机工程, 2020, 46(4): 177-182.
[4]	刘丽华,管武,梁利平. 并行高吞吐率多模极化码编码器设计[J]. 计算机工程, 2019, 45(4): 72-77.
[5]	王翔,卢毓海,马伟,刘燕兵. 一种针对DFA状态爆炸的正则表达式匹配方法[J]. 计算机工程, 2019, 45(4): 148-156.
[6]	张盟盟,沙金,陈万雄,李伟. 基于现场可编程门阵列的Camellia加密算法实现[J]. 计算机工程, 2018, 44(7): 156-159.
[7]	葛卫民,朱海颖,李娟. 无线网络中TCP / NC 协议性能分析与仿真验证[J]. 计算机工程, 2015, 41(6): 71-75.
[8]	李志坚,肖熠琳. 一种基于二进制码调制的射频识别防碰撞算法[J]. 计算机工程, 2015, 41(2): 308-312.
[9]	彭静玉，赵鹤鸣. 彩色图像加密与压缩关联算法[J]. 计算机工程, 2014, 40(5): 139-143.
[10]	丁治国，朱学永. 基于先验知识的自适应多叉树防碰撞算法[J]. 计算机工程, 2014, 40(2): 303-307.
[11]	邢玲, 高宝建, 王玉洁, 郝露微. 一种用于版权保护的压缩域视频水印算法[J]. 计算机工程, 2013, 39(6): 194-199.
[12]	赵楠楠，王成，杨学惠. 基于负载均衡算法的按需多播路由协议[J]. 计算机工程, 2013, 39(11): 96-99.
[13]	姚晔, 江玉洁, 梁旭文. 卫星CICQ交换系统调度算法研究[J]. 计算机工程, 2012, 38(21): 22-25,29.
[14]	彭永华, 何怡刚. 一种新型ICA算法在RFID系统中的应用[J]. 计算机工程, 2012, 38(19): 25-29.
[15]	魏强, 李云照, 褚衍杰. 基于图划分的正则表达式分组算法[J]. 计算机工程, 2012, 38(18): 137-139.

选择文件类型/文献管理软件名称

选择包含的内容

基于簇聚类和游程编码的正则表达式压缩算法

Regular Expression Compression Algorithm Based on Cluster Clustering and Runlength Encoding

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于簇聚类和游程编码的正则表达式压缩算法

Regular Expression Compression Algorithm Based on Cluster Clustering and Runlength Encoding

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价