基于特征回归的多自监督特征融合方法

doi:10.19678/j.issn.1000-3428.0252038

摘要/Abstract

摘要： 自监督学习在计算机视觉任务中表现出强大的潜力，但是如何有效的融合多个自监督任务提取的特征，仍是当前研究领域的一大热门挑战。传统多任务学习方法因输入冲突、架构不兼容等问题难以有效整合异构自监督特征，而现有特征融合方法（如子空间学习）往往过度压缩特征空间，导致任务特异性信息丢失。文中提出一种基于特征回归任务的多自监督特征融合方法，该方法将特征融合问题视为多视角学习问题，其目的是学习一个跨越不同视角的公共潜在空间，并最大化不同自监督特征之间的相关性。模型首先将多自监督特征视为互补的“多视角”表示，构建以Transformer编码器为核心的特征交互网络。然后，特征回归任务会以掩码特征为输入，通过自注意力机制挖掘跨任务相关性以重建原始特征，迫使模型在最大化共享信息的同时保留独有信息。得到的特征包含了图像不同视角的大量共享信息和独有信息，使得特征更具泛化性。在多个知名数据集上进行图像分类实验并进行对比得出，融合后的特征泛化能力明显好于融合前的特征，进而验证了特征融合方式的有效性。

Abstract: Self-supervised learning has demonstrated strong potential in computer vision tasks. However, how to effectively fuse features extracted from multiple self-supervised tasks remains a major challenge in the current research field. Traditional multi-task learning methods struggle to effectively integrate heterogeneous self-supervised features due to issues such as input conflicts and architectural incompatibilities. Existing feature fusion methods (e.g., subspace learning) often over-compress the feature space, leading to the loss of task-specific information. This paper proposes a multi-self-supervised feature fusion method based on a feature regression task, which treats the feature fusion problem as a multi-view learning task. The goal is to learn a shared latent space across different views and maximize the correlation between different self-supervised features. The model first treats the multi-self-supervised features as complementary "multi-view" representations and constructs a feature interaction network centered around a Transformer encoder. Then, the feature regression task uses masked features as input, and through a self-attention mechanism, it explores cross-task correlations to reconstruct the original features, forcing the model to preserve unique information while maximizing shared information. The resulting features contain a large amount of shared and unique information from different views of the image, making the features more generalized. Image classification experiments conducted on multiple well-known datasets show that the fused features exhibit significantly better generalization performance compared to the features before fusion, thus validating the effectiveness of the feature fusion method.

潘敏敏, 赵其鲁. 基于特征回归的多自监督特征融合方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252038.

PAN Minmin, ZHAO Qilu. Fusing Multiple Self-Supervised Representations by Solving a Feature Regression Task[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252038.

参考文献

[1] 李兆亮,贾令尧,张冰冰,等.基于自监督学习和二阶表示的小样本图像分类 [J/OL]. 计算机学报 ,1-16[2025-01-10].http://kns.cnki.net/kcms/detail/11.18 26.tp.20241217.1617.008.html. Li Zhaoliang, Jia Lingyao, Zhang Bingbing, et al. Few-shot Image Classification Based on Self-Supervised Learning and Second-Order Representations [J/OL]. Journal of Computer Science, 1-16 [2025-01-10]. http://kns.cnki.net/kcms/detail/11.1826.tp.20241217.1617.0 08.html.
[2] Dosovitskiy A, Springenberg J T, Riedmiller M, et al. Discriminative unsupervised feature learning with convolutional neural networks[J]. Advances in neural information processing systems, 2014, 27.
[3] Doersch C, Gupta A, Efros A A. Unsupervised visual representation learning by context prediction[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1422-1430.
[4] Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations[C]//International conference on machine learning. PMLR, 2020: 1597-1607.
[5] Zhang R, Isola P, Efros A A. Colorful image colorization[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer International Publishing, 2016: 649-666.
[6] Gidaris S, Singh P, Komodakis N. Unsupervised representation learning by predicting image rotations[J]. arXiv preprint arXiv:1803.07728, 2018.
[7] Hotelling H. Relations between two sets of variates[M]//Breakthroughs in statistics: methodology and distribution. New York, NY: Springer New York, 1992: 162-190.
[8] Jia Y, Salzmann M, Darrell T. Factorized latent spaces with structured sparsity[J]. Advances in neural information processing systems, 2010, 23.
[9] Devlin J. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[10] Pathak D, Krahenbuhl P, Donahue J, et al. Context encoders: Feature learning by inpainting[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2536-2544.
[11] Sermanet P, Lynch C, Chebotar Y, et al. Time-contrastive networks: Self-supervised learning from video[C]//2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018: 1134-1141.
[12] Misra I, Zitnick C L, Hebert M. Shuffle and learn: unsupervised learning using temporal order verification[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016: 527-544.
[13] Godard C, Mac Aodha O, Firman M, et al. Digging into self-supervised monocular depth estimation[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 3828-3838.
[14] Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding[J]. arXiv preprint arXiv:1807.03748, 2018.
[15] Bachman P, Hjelm R D, Buchwalter W. Learning representations by maximizing mutual information across views[J]. Advances in neural information processing systems, 2019, 32.
[16] He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 9729-9738.
[17] Chen X, He K. Exploring simple siamese representation learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 15750-15758.
[18] 冯欣,胡成杭.一种自监督掩码图像建模的遮挡目标检测方法 [J]. 重庆理工大学学报 ( 自然科学),2024,38(06):186-193. Feng Xin, Hu Chenghang. A Self-Supervised Masked Image Modeling Method for Occluded Object Detection [J]. Journal of Chongqing University of Technology (Natural Science), 2024, 38(06): 186-193.
[19] He K, Chen X, Xie S, et al. Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 16000-16009.
[20] Baevski A, Hsu W N, Xu Q, et al. Data2vec: A general framework for self-supervised learning in speech, vision and language[C]//International Conference on Machine Learning. PMLR, 2022: 1298-1312.
[21] Chen X, Ding M, Wang X, et al. Context autoencoder for self-supervised representation learning[J]. International Journal of Computer Vision, 2024, 132(1): 208-223.
[22] Wei C, Fan H, Xie S, et al. Masked feature prediction for self-supervised visual pre-training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 14668-14678.
[23] Wang R, Chen D, Wu Z, et al. Bevt: Bert pretraining of video transformers[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 14733-14743.
[24] Wang L, Huang B, Zhao Z, et al. Videomae v2: Scaling video masked autoencoders with dual masking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 14549-14560.
[25] Li Y, Mao H, Girshick R, et al. Exploring plain vision transformer backbones for object detection[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 280-296.
[26] Fang Y, Yang S, Wang S, et al. Unleashing vanilla vision transformer with masked image modeling for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 6244-6253.
[27] Zhou L, Palangi H, Zhang L, et al. Unified vision-language pre-training for image captioning and vqa[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(07): 13041-13049.
[28] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PMLR, 2021: 8748-8763.
[29] Ramesh A, Pavlov M, Goh G, et al. Zero-shot text-to-image generation[C]//International conference on machine learning. Pmlr, 2021: 8821-8831.
[30] Muslea I, Minton S, Knoblock C A. Active learning with strong and weak views: A case study on wrapper induction[C]//IJCAI. 2003, 3: 415-420.
[31] Yu S, Krishnapuram B, Steck H, et al. Bayesian co-training[J]. Advances in neural information processing systems, 2007, 20.
[32] Kumar A, Rai P, Daume H. Co-regularized multi-view spectral clustering[J]. Advances in neural information processing systems, 2011, 24.
[33] Lanckriet G R G, Cristianini N, Bartlett P, et al. Learning the kernel matrix with semidefinite programming[J]. Journal of Machine learning research, 2004, 5(Jan): 27-72.
[34] Sonnenburg S, Rätsch G, Schäfer C, et al. Large scale multiple kernel learning[J]. The Journal of Machine Learning Research, 2006, 7: 1531-1565.
[35] Subrahmanya N, Shin Y C. Sparse multiple kernel learning for signal processing applications[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 32(5): 788-798.
[36] Salzmann M, Ek C H, Urtasun R, et al. Factorized orthogonal latent spaces[C]//Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010: 701-708.
[37] Carroll J D. Generalization of canonical correlation analysis to three of more sets of variables[C]//APA 76th Annual Convention, San Francisco, CA, August 30-September 3, 1968. 1968.
[38] Benton A, Khayrallah H, Gujral B, et al. Deep generalized canonical correlation analysis[J]. arXiv preprint arXiv:1702.02519, 2017.
[39] Chen Z , He Z , Lu Z M .DEA-Net: Single Image Dehazing Based on Detail-Enhanced Convolution and Content-Guided Attention[J].IEEE Transactions on Image Processing, 2024:33.
[40] Wu B , Xiao Q , Liu S ,et al.E2ENet: Dynamic Sparse Feature Fusion for Accurate and Efficient 3D Medical Image Segmentation[J]. 2023.
[41] Guermazi, B., & Khan, N. (2024). DynaSeg: A Deep Dynamic Fusion Method for Unsupervised Image Segmentation Incorporating Feature Similarity and Spatial Continuity. Image Vis. Comput., 150, 105206.
[42] Redmon, J., & Farhadi, A. (2018). YOLOv3: An Incremental Improvement. ArXiv, abs/1804.02767.
[43] 关日鹏, 况立群, 焦世超, 熊风光, 韩燮. 多模态特征融合与词嵌入驱动的三维检索方法[J]. 计算机工程, 2023, 49(4): 101-107,113. GUAN Ripeng, KUANG Liqun, JIAO Shichao, XIONG Fengguang, HAN Xie. Retrieval Method of 3D Models Driven by Multi-modal Feature Fusion and Word Embedding[J]. Computer Engineering, 2023, 49(4): 101-107,113.
[44] Doersch C, Zisserman A. Multi-task self-supervised visual learning[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2051-2060.

选择文件类型/文献管理软件名称

选择包含的内容