作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (1): 206-216. doi: 10.19678/j.issn.1000-3428.0069691

• 计算机视觉与图形图像处理 • 上一篇    下一篇

知识蒸馏Transformer的人物交互检测

陈东吉, 赖惠成*(), 高古学, 马骏, 李俊凯, 权虎拓   

  1. 新疆大学计算机科学与技术学院, 新疆 乌鲁木齐 830094
  • 收稿日期:2024-04-03 修回日期:2024-08-15 出版日期:2026-01-15 发布日期:2024-10-21
  • 通讯作者: 赖惠成
  • 作者简介:

    陈东吉, 男, 硕士研究生, 主研方向为图像理解与识别

    赖惠成(通信作者), 教授

    高古学, 博士研究生

    马骏, 硕士研究生

    李俊凯, 硕士研究生

    权虎拓, 硕士研究生

  • 基金资助:
    国家自然科学基金(2022ZD0115803); 新疆维吾尔自治区重点研发计划项目(2022B01008)

Knowledge Distillation-based Transformer for Human-Object Interaction Detection

CHEN Dongji, LAI Huicheng*(), GAO Guxue, MA Jun, LI Junkai, QUAN Hutuo   

  1. College of Computer Science and Technology, Xinjiang University, Urumqi 830094, Xinjiang, China
  • Received:2024-04-03 Revised:2024-08-15 Online:2026-01-15 Published:2024-10-21
  • Contact: LAI Huicheng

摘要:

得到广泛应用的跨界之星Transformer, 在人-物交互(HOI)检测领域同样取得了很好的效果。基于此, 提出全新的基于知识蒸馏的Transformer(KDT)网络来进行端到端的HOI检测。由于Transformer网络建模的HOI整体特征粗糙, 针对HOI检测的3个子任务: 预测人框, 预测物框与物体类别, 预测人物之间的交互动作, 构建基础多分支Transformer结构, 包含一个人体实例分支、一个物体实例分支和一个交互分支, 并利用人、物分支的解码器为交互分支解码器提供人、物的区域线索。为了给Transformer结构提供关键的语义、空间信息, 预先生成物体类别和交互动词语义特征, 以及人物框的空间特征为不同的Transformer分支提供语义、空间线索, 进一步提升解码器对于不同HOI任务的特征提取能力。并在此基础上构建另一个多分支Transformer结构作为教师网络, 教师网络的解码器以预生成特征为解码器查询, 输出更精确的HOI特征。在训练过程中让基础多分支网络模仿教师网络的输出, 构建额外的类相似度损失度量两个网络输出预测之间的类内、类间向量相似度, 从而达到提升基础网络解码器性能的目的。实验结果表明, 在人-物交互基准数据集HICO-DET所有类别、稀有类别和非稀有类别上的均值平均精度(mAP)分别为32.13%、28.57%和33.19%, 对比基线取得了最多4.65百分点的提升。

关键词: Transformer网络, 人-物交互, 预生成特征, 教师网络, 类相似度损失

Abstract:

The widely used cross-field star Transformer has achieved good results in detecting Human-Object Interaction (HOI). This study proposes a new Transformer network, Knowledge Distillation-based Transformer (KDT), for HOI detection. Owing to the roughness of the overall HOI features modeled by the Transformer network, a basic multi-branched structure exists for the three tasks of HOI detection: prediction of human boxes, prediction of object boxes and object categories, and prediction of interaction categories. The basic multi-branched structure comprises a human instance branch, an object instance branch, and an interaction branch. Human and object branch decoders are used to provide interaction branch decoders with the regional tips of the human object. To provide key semantic and spatial information for the Transformer structure, the semantic features of the object categories and interaction verbs, as well as the spatial features of humans and object boxes, are generated to provide semantic and spatial tips for different Transformer branches, which further improves the feature extraction capability of the decoders. Next, the study proposes another multi-branched Transformer structure as a teacher network. The teacher network decoders output accurate HOI using the generated features as decoder queries. During the training process, the basic multi-branched network is allowed to imitate the output of the teacher network. Finally, the study presents an additional category similarity loss to measure the intra- and inter-category similarities between the output predictions of the two networks, thereby improving the performance of the basic network decoder. Experimental results show that the mean Average Precision (mAP) for all categories, rare categories, and non-rare categories on the HOI benchmark dataset HICO-DET are 32.13%, 28.57%, and 33.19%, respectively, achieving the highest increase of 4.65% compared with the baseline.

Key words: Transformer network, Human-Object Interaction (HOI), pre-generated feature, teacher network, category similarity loss