Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2026, Vol. 52 ›› Issue (3): 420-428. doi: 10.19678/j.issn.1000-3428.0252659

• Interdisciplinary Integration and Engineering Applications • Previous Articles     Next Articles

MF-cache: CLIP-Based Multimodal Cache Model for Maize Disease Recognition

SUN Wei, CHEN Junjie*()   

  1. College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010011, Inner Mongolia, China
  • Received:2025-06-23 Revised:2025-08-21 Online:2026-03-15 Published:2025-10-13
  • Contact: CHEN Junjie

MF-cache: 用于玉米病害识别的CLIP多模态缓存模型

孙伟, 陈俊杰*()   

  1. 内蒙古农业大学计算机与信息工程学院, 内蒙古 呼和浩特 010011
  • 通讯作者: 陈俊杰
  • 作者简介:

    孙伟, 男, 硕士, 主研方向为多模态

    陈俊杰(通信作者), 副教授、博士

  • 基金资助:
    内蒙古自治区科技成果转化专项资金项目(2020CG0054)

Abstract:

Maize is a vital economic crop that is widely used in industries, animal husbandry, and grain-oil processing. Timely identification of maize diseases is crucial for ensuring a stable yield. Currently, deep learning methods such as Convolutional Neural Networks (CNNs) have been widely applied to disease recognition. However, most existing methods rely solely on image information, overlooking the features of other modalities. Moreover, their large parameter sizes and high deployment costs hinder their practical applications. To address these challenges, we propose a lightweight image-text multimodal cache model, MF-cache, that contains only 61 000 parameters, ensuring both low computational cost and high recognition accuracy. The model leverages the multimodal pre-trained model CLIP to extract image and text features, which are fused in parallel to form a key-value cache structure enriched with domain knowledge. Additionally, a weighted two-stage fusion mechanism is introduced to dynamically adjust the contribution of each modality to the classification outcome, thereby enhancing both stability and interpretability. To improve robustness, various data augmentation strategies have been employed to increase sample diversity and mitigate overfitting in low-data scenarios. Experimental results on a self-constructed dataset, CornI&T, and the public PlantVillage dataset demonstrate the effectiveness of the proposed method, achieving 99.72% and 98.80% accuracy, respectively. These results indicate that the method achieves an excellent recognition performance while maintaining a low computational overhead, thus offering an efficient and practical solution for crop disease detection. Furthermore, it highlights the potential of combining multimodal pretrained models with few-shot learning in intelligent agricultural applications.

Key words: maize disease recognition, multimodal cache, pre-trained model, CLIP model, few-shot

摘要:

玉米是重要的经济作物, 广泛应用于工业、畜牧业及粮油加工等领域, 病害的及时识别对保障产量具有重要意义。当前, 卷积神经网络(CNN)等深度学习方法已广泛应用于病害识别, 但多数方法仅依赖图像信息, 忽略其他模态特征, 且模型参数规模较大, 部署成本较高, 限制了实际应用。为解决上述问题, 提出一种基于图像-文本多模态的轻量级缓存模型MF-cache, 模型参数量仅为61 000个, 兼具低计算开销与较高识别精度。该模型借助多模态预训练模型CLIP提取图像与文本特征, 通过并行融合策略获取融合特征, 用于构建含领域知识的可学习key-value缓存结构。此外, 采用加权的两阶段融合机制, 用于动态调整不同模态对分类结果的贡献比例, 提高分类稳定性与合理性。为增强鲁棒性, 引入多种数据增强策略, 提升样本多样性, 缓解小样本带来的过拟合问题。在自建数据集CornI&T与公开数据集PlantVillage上的实验结果表明, 该方法准确率分别达到99.72%与98.80%, 具备良好的泛化性能。所提方法在保持低计算开销的同时, 具备良好的识别性能, 为作物病害检测提供了一种高效可行的解决方案, 并展示了多模态预训练模型与小样本学习在农业智能识别领域的应用潜力。

关键词: 玉米病害识别, 多模态缓存, 预训练模型, CLIP模型, 小样本