基于文本的行人图像检索的多样化数据扩充方法

doi:10.19678/j.issn.1000-3428.0068883

摘要/Abstract

摘要：

近年来, 基于文本的行人图像检索(TBPS)技术在安防和刑侦等领域发挥着越来越重要的作用。然而, 现有数据集中行人图像较少且描述行人的文本较为单调导致模型无法充分学习行人特征和信息, 限制了TBPS检索技术的进一步发展。为了解决这一问题, 提出一种多样化行人图像-文本对数据生成与筛选的扩充方法。在数据生成阶段, 首先使用成分句法分析模型和大语言模型相结合的方式生成行人文本描述, 然后使用条件图像生成模型根据生成的行人文本描述产生相应的行人图像。在依据行人文本筛选图像阶段, 利用评分函数PickScore计算生成的行人图像与行人文本描述之间的相似度分数, 根据计算的相似度分数的结果, 粗粒度地筛掉相似度分数较低的行人图像, 只保留相似度分数较高的行人图像与行人文本描述。在行人图像-文本对数据过滤阶段, 利用图文多模态大模型计算行人图像与行人文本描述的匹配概率, 筛掉概率低于阈值的行人图像-文本对进行细粒度的数据过滤, 并将剩余的高质量行人图像-文本对作为正样本对添加到现有数据集中。在多个公开的TBPS检索数据集上的实验结果表明, 应用该方法对这些数据集进行扩充后, 不同检索基准模型的Rank-k、平均精度均值(mAP)等指标均有较大幅度的提升。此外, 探讨了姿态控制与风格控制对扩充结果的影响, 为后续更深入的研究提供了一种思路。

关键词: 多样化行人数据扩充, 成分句法分析模型, 大语言模型, 条件图像生成模型, 多模态大模型

Abstract:

In recent years, Text-Based Person Search (TBPS) technology has gained significant importance in security and criminal investigations. However, existing datasets are often constrained by limited person images and simplistic text descriptions, which hinders the model's ability to capture diverse person features and restricts the advancement of TBPS technology. To address this issue, we propose a method for enhancing the diversity of person text-image pair data generation and selection. In the data generation phase, person text descriptions are generated using a constituency parsing analysis model in conjunction with large language models, followed by the generation of corresponding person images through conditional image generation models. During the image filtering stage, the PickScore scoring function evaluates the similarity between generated person images and their corresponding text descriptions, filtering out low-scoring pairs. In the person text-image pair data filtering stage, multimodal large models assess the matching probability between person images and text descriptions, discarding pairs that fall below a predefined threshold. The remaining high-quality pairs are then incorporated into existing datasets as positive samples. Experiments conducted on various public TBPS datasets demonstrate notable improvements in benchmark models across Rank-k and mean Average Precision (mAP) metrics after applying this method for dataset augmentation. Furthermore, we explore the impact of posture and style control on the augmentation results, providing valuable insights for future research.

Key words: diversified person data expansion, constituency parsing analysis model, large language model, conditional image generative model, multimodal large model

王靖尧, 曹敏. 基于文本的行人图像检索的多样化数据扩充方法[J]. 计算机工程, 2024, 50(12): 276-287.

WANG Jingyao, CAO Min. Diversified Data Expansion Method Using Text-Based Person Image Search[J]. Computer Engineering, 2024, 50(12): 276-287.

https://www.ecice06.com/CN/Y2024/V50/I12/276

图/表 13

图1 基于CPEM和ChatGPT的行人文本扩充方法框架

Fig.1 Framework of person text expansion method based on CPEM and ChatGPT

图2 基于SDXL模型的行人图像生成方法

Fig.2 Person image generation method based on SDXL model

图3 多模态数据生成示例

Fig.3 Examples of multimodal data generation

图4 LoRA模型控制图像生成示例

Fig.4 Examples of LoRA model controls image generation

图5 DPDGF和ControlNet模型的结合

Fig.5 Combination of DPDGF and ControlNet models

参考文献 35

1	LI S, XIAO T, LI H S, et al. Person search with natural language description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2017: 1970-1979. URL
2	LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2201.12086.
3	GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al. Generative adversarial networks. Communications of the ACM, 2020, 63 (11): 139- 144. doi: 10.1145/3422622
4	KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2023-10-07]. http://arxiv.org/abs/1312.6114.
5	JOSHI V, PETERS M, HOPKINS M. Extending a parser to distant domains using a few dozen partially annotated examples[EB/OL]. [2023-10-07]. http://arxiv.org/abs/1805.06556.
6	CHEN D P , LI H S , LIU X H , et al. Improving deep visual representation for person re-identification by global and local image-language association. Berlin, Germany: Springer International Publishing, 2018. URL
7	DING Z F, DING C X, SHAO Z Y, et al. Semantically self-aligned network for text-to-image part-aware person re-identification[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2107.12666.
8	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2023-10-07]. http://arxiv.org/abs/1706.03762.
9	GAO C Y, CAI G Y, JIANG X Y, et al. Contextual non-local alignment over full-scale representation for text-based person search[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2101.03036.
10	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL]. [2023-10-07]. http://arxiv.org/abs/1810.04805.
11	LECUN Y , BOTTOU L , BENGIO Y , et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86 (11): 2278- 2324. doi: 10.1109/5.726791
12	LI S P, CAO M, ZHANG M. Learning semantic-aligned feature representation for text-based person search[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Washington D.C., USA: IEEE Press, 2022: 2724-2728. URL
13	HAN X, HE S, ZHANG L, et al. Text-based person search with limited data[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2110.10807.
14	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2103.00020.
15	YAN S L, DONG N, ZHANG L Y, et al. CLIP-driven fine-grained text-image person re-identification[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2210.10276.
16	SHARMA P, DING N, GOODMAN S, et al. Conceptual Captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, USA: Association for Computational Linguistics, 2018: 1-10. URL
17	ZHU F D, ZHU Y, CHANG X J, et al. Vision-language navigation with self-supervised auxiliary reasoning tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 10009-10019. URL
18	MAJUMDAR A , SHRIVASTAVA A , LEE S , et al. Improving vision-and-language navigation with image-text pairs from the Web. Berlin, Germany: Springer International Publishing, 2020.
19	LECUN Y , BOSER B , DENKER J S , et al. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1 (4): 541- 551. doi: 10.1162/neco.1989.1.4.541
20	VAN DEN OORD A, KALCHBRENNER N, KAVUKCUOGLU K. Pixel recurrent neural networks[EB/OL]. [2023-10-07]. http://arxiv.org/abs/1601.06759.
21	DINH L, KRUEGER D, BENGIO Y. NICE: non-linear independent components estimation[EB/OL]. [2023-10-07]. http://arxiv.org/abs/1410.8516.
22	DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2105.05233.
23	PODELL D, ENGLISH Z, LACEY K, et al. SDXL: improving latent diffusion models for high-resolution image synthesis[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2307.01952.
24	OLAF R, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[EB/OL]. [2023-10-07]. https://arxiv.org/abs/1505.04597.
25	OUYANG L, WU J, XU J, et al. Training language models to follow instructions with human feedback[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2203.02155.
26	PALATUCCI M, POMERLEAU D, HINTON G, et al. Zero-shot learning with semantic output codes[C]//Proceedings of the 22nd International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2009: 1410-1418. URL
27	BETKER J, GOH G. Improving image generation with better captions[EB/OL]. [2023-10-07]. https://www.semanticscholar.org/paper/Improving-Image-Generation-with-Better-Captions-Betker-Goh/cfee1826dd4743eab44c6e27a0cc5970effa4d80.
28	KIRSTAIN Y, POLYAK A, SINGER U, et al. Pick-a-pic: an open dataset of user preferences for text-to-image generation[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2305.01569.
29	LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2301.12597.
30	ZHU A C, WANG Z J, LI Y F, et al. DSSL: deep surroundings-person separation learning for text-based person retrieval[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM Press, 2021: 209-217. URL
31	WANG W H, BAO H B, DONG L, et al. Image as a foreign language: BEIT pretraining for all vision and vision-language tasks[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2208.10442.
32	LI J N, SELVARAJU R R, GOTMARE A D, et al. Align before fuse: vision and language representation learning with momentum distillation[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2107.07651.
33	JIANG D, YE M. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2303.12501.
34	ZHANG L M, RAO A Y, AGRAWALA M. Adding conditional control to text-to-image diffusion models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 3813-3824. URL
35	HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2106.09685.

[1]	杨冬菊, 黄俊涛. 基于大语言模型的中文科技文献标注方法[J]. 计算机工程, 2024, 50(9): 113-120.
[2]	罗焕坤, 葛一烽, 刘帅. 大语言模型在数学推理中的研究进展[J]. 计算机工程, 2024, 50(9): 1-17.
[3]	翟洁, 李艳豪, 李彬彬, 郭卫斌. 基于大语言模型的个性化实验报告评语自动生成与应用[J]. 计算机工程, 2024, 50(7): 42-52.
[4]	翟洁, 李艳豪, 李彬彬, 郭卫斌. 基于大语言模型的个性化实验报告评语自动生成与应用[J]. 计算机工程, 2024, 50(7): 42-52.
[5]	杨兴睿, 马斌, 李森垚, 钟忺. 基于大语言模型的教育文本幂等摘要方法[J]. 计算机工程, 2024, 50(7): 32-41.
[6]	侯钰涛, 阿布都克力木·阿布力孜, 史亚庆, 马依拉木·木斯得克, 哈里旦木·阿布都克里木. 面向"一带一路"的低资源语言机器翻译研究[J]. 计算机工程, 2024, 50(4): 332-341.
[7]	李敬灿, 肖萃林, 覃晓婷, 谢夏. 基于大语言模型与语义增强的文本关系抽取算法[J]. 计算机工程, 2024, 50(4): 87-94.
[8]	哈里旦木·阿布都克里木, 侯钰涛, 姚登峰, 阿布都克力木·阿布力孜, 陈吉尚. 维吾尔语机器翻译研究综述[J]. 计算机工程, 2024, 50(1): 1-16.

选择文件类型/文献管理软件名称

选择包含的内容