作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

MF-cache:用于玉米病害识别的CLIP多模态缓存模型

  • 发布日期:2025-10-13

MF-cache: A CLIP-Based Multimodal Cache Model for Maize Disease Recognition

  • Published:2025-10-13

摘要: 玉米是重要的经济作物,广泛应用于工业、畜牧业及粮油加工等领域,病害的及时识别对保障产量具有重要意义。当前,卷积神经网络(CNN)等深度学习方法已广泛应用于病害识别,但多数方法仅依赖图像信息,忽略其他模态特征,且模型参数规模较大,部署成本较高,限制了实际应用。为解决上述问题,提出一种基于图像-文本多模态的轻量级缓存模型MF-cache,模型参数量仅为0.061M,兼具低计算开销与较高识别精度。该模型借助多模态预训练模型CLIP提取图像与文本特征,通过并行融合策略获取融合特征,用于构建含领域知识的可学习key-value缓存结构。此外,采用加权的两阶段融合机制,用于动态调整不同模态对分类结果的贡献比例,提高分类稳定性与合理性。为增强鲁棒性,还引入多种数据增强策略,提升样本多样性,缓解小样本带来的过拟合问题。在自建数据集CornI&T与公开数据集PlantVillage上的实验结果显示,该方法分别达到99.72%与98.80%的准确率,具备良好的泛化性能。结果表明,所提方法在保持低计算开销的同时,具备良好的识别性能,为作物病害检测提供了一种高效可行的解决方案,并展示了多模态预训练模型与小样本学习在农业智能识别领域的应用潜力。

Abstract: Maize is a vital economic crop, widely used in industry, animal husbandry, and grain-oil processing. Timely identification of maize diseases is crucial for ensuring stable yield. Currently, deep learning methods such as Convolutional Neural Networks (CNNs) have been widely applied to disease recognition. However, most existing methods rely solely on image information, overlooking features from other modalities. Moreover, their large parameter sizes and high deployment costs hinder practical applications. To address these challenges, we propose a lightweight image-text multimodal cache model, MF-cache, which contains only 0.061M parameters, achieving both low computational cost and high recognition accuracy. The model leverages the multimodal pre-trained model CLIP to extract image and text features, which are fused in parallel to form a key-value cache structure enriched with domain knowledge. Additionally, a weighted two-stage fusion mechanism is introduced to dynamically adjust the contribution of each modality to the classification outcome, enhancing both stability and interpretability. To improve robustness, various data augmentation strategies are employed to increase sample diversity and mitigate overfitting in low-data scenarios. Experimental results on a self-constructed dataset CornI&T and the public PlantVillage dataset demonstrate the effectiveness of the proposed method, achieving 99.72% and 98.80% accuracy, respectively. These results indicate that the method achieves excellent recognition performance while maintaining low computational overhead, offering an efficient and practical solution for crop disease detection. Furthermore, it highlights the potential of combining multimodal pre-trained models with few-shot learning in intelligent agricultural applications.