基于字节编码与预训练任务的加密流量分类模型

doi:10.19678/j.issn.1000-3428.0069150

摘要/Abstract

摘要：

当预训练模型BERT应用于加密流量分类领域时, 缺乏针对加密流量特性设计的编码方法和相应预训练任务。为此, 提出一种融合字节级编码与改进预训练任务的加密流量分类预训练模型。首先, 设计了一种新型词汇表构建方法, 增强模型对流量传输结构的表征能力; 其次, 提出动态掩码BURST预测和同源BURST连贯性预测2个新的自监督预训练任务, 动态掩码BURST预测任务增强模型对加密流量语义多样性的获取能力, 同源BURST连贯性预测任务提高模型对加密流量连贯性顺序的建模能力。实验结果表明, 所提模型在CSTNET-TLS 1.3数据集上的准确率、精确率、召回率和F1值分别为98.52%、98.40%、98.35%、98.43%, 与现有性能最好的预训练基准模型相比, 分别提高了1.15、0.98、0.93、1.02百分点。此外, 在5个下游加密流量分类任务的7个主流数据集上, 所提模型能够有效分类加密流量。

关键词: 加密流量分类, 预训练模型, 字节级编码, 自监督预训练任务, 微调方法

Abstract:

This study proposes a pre-training model that integrates byte-level encoding and improved pre-training tasks for encrypted traffic classification. The aim is to address the lack of coding methods and corresponding pre-training tasks in designing the characteristics of encrypted traffic during the pre-training of the Bidirectional Encoder Representations from Transformers (BERT) model in this field. First, a novel method for constructing a vocabulary is proposed to enhance the model's ability to represent traffic transmission structures. Second, two new self-supervised pre-training tasks are introduced: the dynamic mask BURST prediction task, which enhances the model's ability to capture semantic diversity in encrypted traffic, and the homogeneous BURST coherence prediction task, which improves the model's ability to model a coherent sequence of encrypted traffic. The experimental results demonstrate that the proposed model achieves an accuracy of 98.52%, precision of 98.40%, recall of 98.35%, and F1 value of 98.43% on the CSTNET-TLS 1.3 dataset. Compared to the best-performing existing pre-trained benchmark model, it shows significant improvements of 1.15, 0.98, 0.93, and 1.02 percentage points, respectively. Furthermore, the proposed model consistently outperforms existing algorithms on seven mainstream datasets from five downstream encrypted traffic classification tasks. This demonstrates improvements in all four evaluation metrics, effectively classifying the encrypted traffic.

Key words: encrypted traffic classification, pre-training model, byte-level encoding, self-supervised pre-training task, fine-tuning method

姚利峰, 蔡满春, 朱懿, 陈咏豪, 张溢文. 基于字节编码与预训练任务的加密流量分类模型[J]. 计算机工程, 2025, 51(2): 188-201.

YAO Lifeng, CAI Manchun, ZHU Yi, CHEN Yonghao, ZHANG Yiwen. Encrypted Traffic Classification Model Based on Byte Coding and Pre-Training Tasks[J]. Computer Engineering, 2025, 51(2): 188-201.

https://www.ecice06.com/CN/Y2025/V51/I2/188

图/表 11

图1 ET-SPERT模型架构

Fig.1 The architecture of ET-SPERT model

图2 编码构建示例

Fig.2 Example of encoding construction

图3 数据预处理流程

Fig.3 Procedure of data preprocessing

图4 词元嵌入结构

Fig.4 Structure of word token embedding

图5 预训练架构

Fig.5 The architecture of pre-training

图6 微调架构

Fig.6 The architecture of fine-tuning

参考文献 30

1	陈子涵, 程光, 徐子恒, 等. 互联网加密流量检测、分类与识别研究综述. 计算机学报, 2023, 46 (5): 1060- 1085. doi: 10.11897/SP.J.1016.2023.01060
	CHEN Z H , CHENG G , XU Z H , et al. A survey on internet encrypted traffic detection, classification and identification. Chinese Journal of Computers, 2023, 46 (5): 1060- 1085. doi: 10.11897/SP.J.1016.2023.01060
2	侯剑, 鲁辉, 刘方爱, 等. 加密恶意流量检测及对抗综述. 软件学报, 2024, 35 (1): 333- 355. doi: 10.13328/j.cnki.jos.006891
	HOU J , LU H , LIU F A , et al. Detection and countermeasure of encrypted malicious traffic: a survey. Journal of Software, 2024, 35 (1): 333- 355. doi: 10.13328/j.cnki.jos.006891
3	陈良臣, 高曙, 刘宝旭, 等. 网络加密流量识别研究进展及发展趋势. 信息网络安全, 2019, 19 (3): 19- 25. doi: 10.3969/j.issn.1671-1122.2019.03.003
	CHEN L C , GAO S , LIU B X , et al. Research status and development trends on network encrypted traffic identification. Netinfo Security, 2019, 19 (3): 19- 25. doi: 10.3969/j.issn.1671-1122.2019.03.003
4	邓昕, 刘朝晖, 欧阳燕, 等. 基于CNN CBAM-BiGRU Attention的加密恶意流量识别. 计算机工程, 2023, 49 (11): 178- 186. doi: 10.19678/j.issn.1000-3428.0066558
	DENG X , LIU Z H , OUYANG Y , et al. Encrypted malicious traffic identification based on CNN CBAM-BiGRU Attention. Computer Engineering, 2023, 49 (11): 178- 186. doi: 10.19678/j.issn.1000-3428.0066558
5	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2023-11-28]. https://arxiv.org/pdf/1810.04805.10.18653/v1/N19-1423
6	HE H Y, YANG Z G, CHEN X N. PERT: payload encoding representation from Transformer for encrypted traffic classification[C]//Proceedings of 2020 ITU Kaleidoscope: Industry-Driven Digital Transformation. Washington D. C., USA: IEEE Press, 2020: 1-8.10.23919/ITUK50268.2020.9303204
7	LIN X J, XIONG G, GOU G P, et al. ET-BERT: a contextualized datagram representation with pre-training transformers for encrypted traffic classification[EB/OL]. [2023-11-28]. https://arxiv.org/abs/2202.06335?context=cs.AI.10.1145/3485447.3512217
8	VAN EDE T, BORTOLAMEOTTI R, CONTINELLA A, et al. FlowPrint: semi-supervised mobile-app fingerprinting on encrypted network traffic[C]//Proceedings of 2020 Network and Distributed System Security Symposium. New York, USA: ACM Press, 2020: 23-29.10.14722/ndss.2020.24412
9	TAYLOR V F , SPOLAOR R , CONTI M , et al. Robust smartphone app identification via encrypted network traffic analysis. IEEE Transactions on Information Forensics and Security, 2018, 13 (1): 63- 78. doi: 10.1109/TIFS.2017.2737970
10	Al-NAAMI K, CHANDRA S, MUSTAFA A, et al. Adaptive encrypted traffic fingerprinting with bi-directional dependence[C]//Proceedings of the 32nd Annual Conference on Computer Security Applications. New York, USA: ACM Press, 2016: 177-188.10.1145/2991079.2991123
11	SIRINAM P, IMANI M, JUAREZ M, et al. Deep fingerprinting: undermining website fingerprinting defenses with deep learning[EB/OL]. [2023-11-28]. https://arxiv.org/pdf/1801.02265.10.1145/3243734.3243768
12	LOTFOLLAHI M , SIAVOSHANI M J , ZADE R S H , et al. Deep Packet: a novel approach for encrypted traffic classification using deep learning. Soft Computing, 2020, 24 (3): 1999- 2012. doi: 10.1007/s00500-019-04030-2
13	LIN K , XU X L , GAO H H . TSCRNN: a novel classification scheme of encrypted traffic based on flow spatiotemporal features for efficient management of IIoT. Computer Networks, 2021, 190, 107974. doi: 10.1016/j.comnet.2021.107974
14	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2023-11-28]. https://arxiv.org/pdf/1706.03762.10.48550/arXiv.1706.03762
15	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2023-11-28]. https://arxiv.org/abs/1907.11692?file=1907.11692.10.48550/arXiv.1907.11692
16	LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[EB/OL]. [2023-11-28]. https://arxiv.org/abs/1909.11942?file=1909.11942.10.48550/arXiv.1909.11942
17	ZHANG Z Y, HAN X, LIU Z Y, et al. ERNIE: enhanced language representation with informative entities[EB/OL]. [2023-11-28]. https://arxiv.org/abs/1905.07129v3.10.48550/arXiv.1905.07129
18	SANH V, DEBUT L, CHAUMOND J, et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter[EB/OL]. [2023-11-28]. https://arxiv.org/abs/1910.01108?undefined. 10.48550/arXiv.1910.01108
19	SENGUPTA S, GANGULY N, DE P, et al. Exploiting diversity in Android TLS implementations for mobile app traffic classification[C]//Proceedings of the World Wide Web Conference. New York, USA: ACM Press, 2019: 1657-1668.10.1145/3308558.3313738
20	DRAPER-GIL G, LASHKARI A H, MAMUN M S I, et al. Characterization of encrypted and VPN traffic using time-related features[C]//Proceedings of the 2nd International Conference on Information Systems Security and Privacy. New York, USA: ACM Press, 2016: 407-414.10.5220/0005740704070414
21	PANCHENKO A, NIESSEN L, ZINNEN A, et al. Website fingerprinting in onion routing based anonymization networks[C]//Proceedings of the 10th Annual ACM Workshop on Privacy in the Electronic Society. New York, USA: ACM Press, 2011: 103-114.
22	SHEN M , ZHANG J , ZHU L , et al. Accurate decentralized application identification via encrypted traffic analysis using graph neural networks. IEEE Transactions on Information Forensics and Security, 2021, 16, 2367- 2380. doi: 10.1109/TIFS.2021.3050608
23	WANG W, ZHU M, ZENG X, et al. Malware traffic classification using convolutional neural network for representation learning[C]//Proceedings of International Conference on Information Networking. Washington D. C., USA: IEEE Press, 2017: 712-717.10.1109/ICOIN.2017.7899588
24	LASHKARI A H, GIL G D, MAMUN M S I, et al. Characterization of Tor traffic using time based features[C]//Proceedings of the 3rd International Conference on Information Systems Security and Privacy. [S. l. ]: SciTe Press, 2017: 253-262.10.5220/0006105602530262
25	LI R, XIAO X, NI S G, et al. Byte segment neural network for network traffic classification[C]//Proceedings of IEEE/ACM 26th International Symposium on Quality of Service. New York, USA: ACM Press, 2018: 1-10.10.1109/IWQoS.2018.8624128
26	WANG P, LI S, YE F, et al. PacketCGAN: exploratory study of class imbalance for encrypted traffic classification using CGAN[C]//Proceedings of IEEE International Conference on Communications. Washington D. C., USA: IEEE Press, 2020: 1-7.10.1109/ICC40277.2020.9148946
27	ZHENG W B, GOU C, YAN L, et al. Learning to classify: a flow-based relation network for encrypted traffic classification[C]//Proceedings of The Web Conference 2020. New York, USA: ACM Press, 2020: 13-22.10.1145/3366423.3380090
28	ZHAO Z, CHEN H, ZHANG J B, et al. UER: an open-source toolkit for pre-training models[EB/OL]. [2023-11-28]. https://arxiv.org/abs/1909.05658v1.10.18653/v1/D19-3041
29	PANCHENKO A, LANZE F, ZINNEN A, et al. Website fingerprinting at Internet scale[C]//Proceedings of 2016 Network and Distributed System Security Symposium. New York, USA: ACM Press, 2016: 97-103.10.14722/NDSS.2016.23477
30	LIU C, HE L, XIONG G, et al. FS-Net: a flow sequence network for encrypted traffic classification[C]//Proceedings of Conference on Computer Communications. Washington D. C., USA: IEEE Press, 2019: 1171-1179.10.1109/INFOCOM.2019.8737507

[1]	饶东宁, 许正辉, 梁瑞仕. 基于知识库问答的回答生成研究[J]. 计算机工程, 2025, 51(2): 94-101.
[2]	费涛, 艾山·吾买尔, 杜文旭, 朱翠翠. 基于Squeezeformer的多颗粒度多方面发音质量评测方法[J]. 计算机工程, 2025, 51(1): 81-87.
[3]	魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.
[4]	周昭辰, 方清茂, 吴晓红, 胡平, 何小海. 基于MacBERT与对抗训练的机器阅读理解模型[J]. 计算机工程, 2024, 50(5): 41-50.
[5]	李田芳, 普园媛, 赵征鹏, 徐丹, 钱文华. 基于CLIP和双空间自适应归一化的图像翻译[J]. 计算机工程, 2024, 50(5): 229-240.
[6]	侯钰涛, 阿布都克力木·阿布力孜, 史亚庆, 马依拉木·木斯得克, 哈里旦木·阿布都克里木. 面向"一带一路"的低资源语言机器翻译研究[J]. 计算机工程, 2024, 50(4): 332-341.
[7]	于明诚, 党亚固, 吴奇林, 吉旭, 毕可鑫. 基于多尺度上下文的英文作文自动评分研究[J]. 计算机工程, 2024, 50(3): 259-266.
[8]	张文博, 黄浩, 吴迪, 唐敏杰. 基于MEGA网络和分层预测的标点恢复方法[J]. 计算机工程, 2024, 50(12): 396-406.
[9]	孙仁科, 许靖昊, 皇甫志宇, 李仲年, 许新征. 基于视觉-语言预训练模型的零样本迁移学习方法综述[J]. 计算机工程, 2024, 50(10): 1-15.
[10]	曹发鑫, 孙媛媛, 王治政, 潘丁豪, 林鸿飞. 面向借贷案件的相似案例匹配模型[J]. 计算机工程, 2024, 50(1): 306-312.
[11]	张博旭, 蒲智, 程曦. 基于提示学习的维吾尔语文本分类研究[J]. 计算机工程, 2023, 49(6): 292-299,313.
[12]	朱红, 牛浩然, 朱彤. 基于字词融合与对抗训练的行业人物实体识别[J]. 计算机工程, 2023, 49(5): 56-62.
[13]	廖列法, 谢树松. 基于注意力机制特征融合的中文命名实体识别[J]. 计算机工程, 2023, 49(4): 256-262.
[14]	吴雪莹, 段友祥, 昌伦杰, 李世银, 孙歧峰. 面向地质领域的实体关系联合抽取研究[J]. 计算机工程, 2023, 49(3): 121-127.
[15]	吴奇林, 党亚固, 熊山威, 吉旭, 毕可鑫. 基于混合特征网络的学生评教文本情感分析模型[J]. 计算机工程, 2023, 49(11): 24-29, 39.

选择文件类型/文献管理软件名称

选择包含的内容