作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (2): 188-201. doi: 10.19678/j.issn.1000-3428.0069150

• 网络空间安全 • 上一篇    下一篇

基于字节编码与预训练任务的加密流量分类模型

姚利峰, 蔡满春*(), 朱懿, 陈咏豪, 张溢文   

  1. 中国人民公安大学信息网络安全学院, 北京 100038
  • 收稿日期:2024-01-02 出版日期:2025-02-15 发布日期:2024-06-03
  • 通讯作者: 蔡满春
  • 基金资助:
    中国人民公安大学网络空间安全执法技术双一流创新研究专项(2023SYL07); 中国人民公安大学2022年基本科研业务费课题(2022JKF02009)

Encrypted Traffic Classification Model Based on Byte Coding and Pre-Training Tasks

YAO Lifeng, CAI Manchun*(), ZHU Yi, CHEN Yonghao, ZHANG Yiwen   

  1. College of Information and Cyber Security, People's Public Security University of China, Beijing 100038, China
  • Received:2024-01-02 Online:2025-02-15 Published:2024-06-03
  • Contact: CAI Manchun

摘要:

当预训练模型BERT应用于加密流量分类领域时, 缺乏针对加密流量特性设计的编码方法和相应预训练任务。为此, 提出一种融合字节级编码与改进预训练任务的加密流量分类预训练模型。首先, 设计了一种新型词汇表构建方法, 增强模型对流量传输结构的表征能力; 其次, 提出动态掩码BURST预测和同源BURST连贯性预测2个新的自监督预训练任务, 动态掩码BURST预测任务增强模型对加密流量语义多样性的获取能力, 同源BURST连贯性预测任务提高模型对加密流量连贯性顺序的建模能力。实验结果表明, 所提模型在CSTNET-TLS 1.3数据集上的准确率、精确率、召回率和F1值分别为98.52%、98.40%、98.35%、98.43%, 与现有性能最好的预训练基准模型相比, 分别提高了1.15、0.98、0.93、1.02百分点。此外, 在5个下游加密流量分类任务的7个主流数据集上, 所提模型能够有效分类加密流量。

关键词: 加密流量分类, 预训练模型, 字节级编码, 自监督预训练任务, 微调方法

Abstract:

This study proposes a pre-training model that integrates byte-level encoding and improved pre-training tasks for encrypted traffic classification. The aim is to address the lack of coding methods and corresponding pre-training tasks in designing the characteristics of encrypted traffic during the pre-training of the Bidirectional Encoder Representations from Transformers (BERT) model in this field. First, a novel method for constructing a vocabulary is proposed to enhance the model's ability to represent traffic transmission structures. Second, two new self-supervised pre-training tasks are introduced: the dynamic mask BURST prediction task, which enhances the model's ability to capture semantic diversity in encrypted traffic, and the homogeneous BURST coherence prediction task, which improves the model's ability to model a coherent sequence of encrypted traffic. The experimental results demonstrate that the proposed model achieves an accuracy of 98.52%, precision of 98.40%, recall of 98.35%, and F1 value of 98.43% on the CSTNET-TLS 1.3 dataset. Compared to the best-performing existing pre-trained benchmark model, it shows significant improvements of 1.15, 0.98, 0.93, and 1.02 percentage points, respectively. Furthermore, the proposed model consistently outperforms existing algorithms on seven mainstream datasets from five downstream encrypted traffic classification tasks. This demonstrates improvements in all four evaluation metrics, effectively classifying the encrypted traffic.

Key words: encrypted traffic classification, pre-training model, byte-level encoding, self-supervised pre-training task, fine-tuning method