基于大小模型融合的医疗数据分类方法

doi:10.19678/j.issn.1000-3428.0070408

摘要/Abstract

摘要：

医疗数据因涉及面广、数量庞大、种类繁多等特点而导致隐私保护难度增大。为了有效地对医疗数据进行合理分类, 进而依据分类结果采取相应的隐私保护措施, 根据医疗信息敏感程度的不同, 提出一种基于大小模型融合的分类方法, 达到医疗数据分类加密的目的。采用大语言模型(LLM)深度神经网络, 结合医疗数据分类标准(MDCS)对医疗数据集进行特征标注, 然后将LLM的输出特征作为小型文本分类模型的输入, 利用小型文本分类模型长短时记忆(LSTM)网络学习文本中的特征表示, 最后将小型文本分类模型的错误预测结果返回给LLM重新分类, 融合大小模型的分类结果, 从而实现将医疗数据按不同的敏感程度进行精准分类。实验结果表明, 大小模型融合分类方法相比于采用其他不同的分类模型和分类标准, 在模型收敛性、分类准确率、数据分类均衡度等方面都有着显著提升, 验证了大小模型融合迭代机制与医疗数据场景极具契合性, 极大地提升对医疗数据的分类准确率, 实现对医疗数据更高效分类, 从而确保对医疗数据的隐私保护。

关键词: 医疗数据分类, 隐私保护, 分类标准, 大小模型融合, 大语言模型, 机器学习

Abstract:

In response to the difficulty of privacy protection in medical data owing to its wide coverage, large quantity, and diverse types and to effectively classify medical data reasonably and take corresponding privacy protection measures based on the classification results, this article proposes a fusion classification method for large and small models based on different levels of medical information sensitivity, achieving the goal of medical data classification encryption. A Large Language Model (LLM) deep neural network combined with Medical Data Classification Standards (MDCS) is used to annotate and output features from the medical dataset. Then, the output features of the LLM are used as inputs for the small-text classification model. The Long Short-Term Memory (LSTM) network of the small-text classification model is used to learn feature representations in the text. Finally, the erroneous prediction results of the small-text classification model are returned to the LLM for reclassification, and the classification results of the large and small models are fused to achieve an accurate classification of medical data according to different levels of sensitivity. The experimental results show that the fusion classification method for large and small models improves model convergence, classification accuracy, and data classification balance than those of other classification models and standards. This verifies that the iterative mechanism of large and small models fusion is highly compatible with the medical data scenario and can significantly improve the classification accuracy, achieve more efficient classification, and ensure the privacy protection of medical data.

Key words: medical data classification, privacy protection, classification standard, fusion of large and small models, Large Language Model(LLM), machine learning

李江涛, 马礼, 李阳. 基于大小模型融合的医疗数据分类方法[J]. 计算机工程, 2026, 52(5): 360-370.

LI Jiangtao, MA Li, LI Yang. Classification Method for Medical Data Based on the Fusion of Large and Small Models[J]. Computer Engineering, 2026, 52(5): 360-370.

https://www.ecice06.com/CN/Y2026/V52/I5/360

图/表 18

图1 TextMDCM-bs方法融合分类流程

Fig.1 Fusion classification procedure of TextMDCM-bs method

图2 大语言模型应用架构

Fig.2 Application architecture of large language model

图3 TextTCM神经网络结构

Fig.3 Structure of TextTCM neural network

图4 TextTCM网络构建过程

Fig.4 Construction process of TextTCM network

图5 LSTM单元结构

Fig.5 Structure of LSTM unit

图6 TextTCM构建流程

Fig.6 Construction procedure of TextTCM

图7 损失函数收敛性对比

Fig.7 Comparison of convergence of loss functions

图8 损失函数收敛趋势对比

Fig.8 Comparison of convergence trends of loss functions

图9 不同分类方法的准确率对比

Fig.9 Accuracy comparison among different classification methods

图10 分类结果准确率趋势对比

Fig.10 Comparison of classification result accuracy trends

图11 各标签综合分类准确率对比

Fig.11 Comparison of comprehensive classification accuracy of each label

图12 标签总数及分类准确率对比

Fig.12 Comparison of total number of labels and classification accuracy

参考文献 28

1	BI H L , LIU J J , KATO N . Deep learning-based privacy preservation and data analytics for IoT enabled healthcare. IEEE Transactions on Industrial Informatics, 2022, 18 (7): 4798- 4807. doi: 10.1109/TII.2021.3117285
2	张明武, 黄嘉骏, 韩亮. 医疗大数据隐私保护多关键词范围搜索方案. 软件学报, 2021, 32 (10): 3266- 3282.
	ZHANG M W , HUANG J J , HAN L . Range-based multi-keyword searchable scheme with privacy protection in e-healthcare cloud systems. Journal of Software, 2021, 32 (10): 3266- 3282.
3	张怡婷, 傅煜川, 杨明, 等. 基于PBAC模型和IBE的医疗数据访问控制方案. 通信学报, 2015, 36 (12): 200- 211.
	ZHANG Y T , FU Y C , YANG M , et al. Access control scheme for medical data based on PBAC and IBE. Journal on Communications, 2015, 36 (12): 200- 211.
4	肖雄, 唐卓, 肖斌, 等. 联邦学习的隐私保护与安全防御研究综述. 计算机学报, 2023, 46 (5): 1019- 1044.
	XIAO X , TANG Z , XIAO B , et al. A survey on privacy and security issues in federated learning. Chinese Journal of Computers, 2023, 46 (5): 1019- 1044.
5	郭子菁, 罗玉川, 蔡志平, 等. 医疗健康大数据隐私保护综述. 计算机科学与探索, 2021, 15 (3): 389- 402.
	GUO Z J , LUO Y C , CAI Z P , et al. Overview of privacy protection technology of big data in healthcare. Journal of Frontiers of Computer Science & Technology, 2021, 15 (3): 389- 402.
6	HU C Q , LIU Z W , LI R N , et al. Smart contract assisted privacy-preserving data aggregation and management scheme for smart grid. IEEE Transactions on Dependable and Secure Computing, 2024, 21 (4): 2145- 2161. doi: 10.1109/TDSC.2023.3300749
7	FANG Z J , XU M C , XU S H , et al. A framework for predicting data breach risk: leveraging dependence to cope with sparsity. IEEE Transactions on Information Forensics and Security, 2021, 16, 2186- 2201. doi: 10.1109/TIFS.2021.3051804
8	江昊琛, 魏子麒, 刘璘, 等. 非均衡数据分类经典方法综述与面向医疗领域的实验分析. 计算机科学, 2022, 49 (1): 80- 88.
	JIANG H C , WEI Z Q , LIU L , et al. Imbalanced data classification: a survey and experiments in medical domain. Computer Science, 2022, 49 (1): 80- 88.
9	陈春玲, 姜慧敏, 郭永安. 基于两阶段特征选择的医疗敏感文本分类. 计算机技术与发展, 2020, 30 (8): 129- 133.
	CHEN C L , JIANG H M , GUO Y A . Medical sensitive text classification based on two-stage feature selection. Computer Technology and Development, 2020, 30 (8): 129- 133.
10	李颖. 基于决策树算法的信息系统数据挖掘研究. 信息技术, 2022 (2): 116-120, 126.
	LI Y . Research on information system data mining based on decision tree algorithm. Information Technology, 2022 (2): 116-120, 126.
11	马卓斌, 李鑫, 金冰鑫. 基于关联规则算法的无线通信网络数据安全分类方法. 自动化技术与应用, 2024, 43 (2): 119-122, 148.
	MA Z B , LI X , JIN B X . Wireless communication network data security classification method based on association rule algorithm. Techniques of Automation and Applications, 2024, 43 (2): 119-122, 148.
12	满红任, 陈晨, 刘秀. 基于关联规则的恶意程序多分类检测方法研究. 信息技术, 2023, 47 (1): 163-167, 173.
	MAN H R , CHEN C , LIU X . Research on multi-classification detection method of malware based on association rules. Information Technology, 2023, 47 (1): 163-167, 173.
13	何铠, 管有庆, 龚锐. 基于深度学习和支持向量机的文本分类模型. 计算机技术与发展, 2022, 32 (7): 22- 27.
	HE K , GUAN Y Q , GONG R . Text classification model based on deep learning and support vector machine. Computer Technology and Development, 2022, 32 (7): 22- 27.
14	丁月, 汪学明. 基于改进特征加权的朴素贝叶斯分类算法. 计算机应用研究, 2019, 36 (12): 3597-3600, 3627.
	DING Y , WANG X M . Naive Bayes classification algorithm based on improved feature weighting. Application Research of Computers, 2019, 36 (12): 3597-3600, 3627.
15	XIE J , XIANG X X , XIA S Y , et al. MGNR: a multi-granularity neighbor relationship and its application in KNN classification and clustering methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12): 7956- 7972. doi: 10.1109/TPAMI.2024.3400281
16	HAMA AZIZ R H , DIMILILER N . SentiXGboost: enhanced sentiment analysis in social media posts with ensemble XGBoost classifier. Journal of the Chinese Institute of Engineers, 2021, 44 (6): 562- 572. doi: 10.1080/02533839.2021.1933598
17	BLEI D M , NG A Y , JORDAN M I . Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3, 993- 1022.
18	王格格, 郭涛, 余游, 等. 基于生成对抗网络的无监督域适应分类模型. 电子学报, 2020, 48 (6): 1190- 1197.
	WANG G G , GUO T , YU Y , et al. Unsupervised domain adaptation classification model based on generative adversarial network. Acta Electronica Sinica, 2020, 48 (6): 1190- 1197.
19	CHANG C C . Fisher's linear discriminant analysis with space-folding operations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (7): 9233- 9240.
20	OZYURT B , ALI AKCAYOL M . A new topic modeling based approach for aspect extraction in aspect based sentiment analysis: SS-LDA. Expert Systems with Applications, 2021, 168, 114231. doi: 10.1016/j.eswa.2020.114231
21	黄发良, 冯时, 王大玲, 等. 基于多特征融合的微博主题情感挖掘. 计算机学报, 2017, 40 (4): 872- 888.
	HUANG F L , FENG S , WANG D L , et al. Mining topic sentiment in microblogging based on multi-feature fusion. Chinese Journal of Computers, 2017, 40 (4): 872- 888.
22	王梦珍, 张德生, 张晓. 基于加权局部密度的双超球支持向量机算法. 计算机工程, 2025, 51 (5): 188- 195. doi: 10.19678/j.issn.1000-3428.0068887
	WANG M Z , ZHANG D S , ZHANG X . Twin-hypersphere support vector machine algorithm based on weighted local density. Computer Engineering, 2025, 51 (5): 188- 195. doi: 10.19678/j.issn.1000-3428.0068887
23	李博涵, 向宇轩, 封顶, 等. 融合知识感知与双重注意力的短文本分类模型. 软件学报, 2022, 33 (10): 3565- 3581.
	LI B H , XIANG Y X , FENG D , et al. Short text classification model combining knowledge aware and dual attention. Journal of Software, 2022, 33 (10): 3565- 3581.
24	刘学博, 户保田, 陈科海, 等. 大模型关键技术与未来发展方向——从ChatGPT谈起. 中国科学基金, 2023, 37 (5): 758- 766.
	LIU X B , HU B T , CHEN K H , et al. Key technologies and future development directions of large language models: insights from ChatGPT. Bulletin of National Natural Science Foundation of China, 2023, 37 (5): 758- 766.
25	WANG S J, DUAN C Y, YANG Y J. Weakly supervised Chinese short text classification algorithm based on ConWea model[C]//Proceedings of the 2nd International Conference on Advanced Technologies in Intelligent Control, Environment, Computing & Communication Engineering (ICATIECE). Washington D.C., USA: IEEE Press, 2023: 1-6.
26	YAN Y D, WANG H, ZHANG J Y, et al. Text classification algorithm for medical adverse events based on deep learning[C]//Proceedings of the IEEE International Conference on Medical Artificial Intelligence (MedAI). Washington D.C., USA: IEEE Press, 2024: 235-244.
27	来纯晓, 李艳翠. 基于CNN-RNN的小麦抗寒性分类模型. 东北农业科学, 2023, 48 (4): 117- 121.
	LAI C X , LI Y C . Classification of cold resistance of wheat based on CNN-RNN model. Journal of Northeast Agricultural Sciences, 2023, 48 (4): 117- 121.
28	XIE J D, HE J Y, HE W M, et al. Research on structured information extraction method of electronic medical records of traditional Chinese medicine[C]//Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Washington D.C., USA: IEEE Press, 2021: 1613-1616.

[1]	曹天涯, 张宇帆, 贾俊杰. 一种有效提高数据可用性的联邦学习隐私保护算法[J]. 计算机工程, 2026, 52(6): 249-257.
[2]	崔爽锌, 卢搏, 张明月, 赵一汎, 王子铭, 刘新宇, 陈程立诏. 基于多模态融合的360°图像质量与美学评估方法[J]. 计算机工程, 2026, 52(6): 288-295.
[3]	李学相, 郑永利, 张怡泽, 段鹏松. 基于机器学习与预训练模型的流量分析方法综述[J]. 计算机工程, 2026, 52(6): 53-67.
[4]	许旻辰, 屈丹, 司念文, 彭思思, 陈雅淇. 社交媒体虚假信息检测技术研究综述[J]. 计算机工程, 2026, 52(5): 60-80.
[5]	余滔, 董军. 多智能体博弈环境下的大语言模型协同决策研究[J]. 计算机工程, 2026, 52(5): 336-348.
[6]	李佳坤, 刘艳青, 杜方, 余振华, 冯宇, 王慧, 霍显浩. BrainTumorLLM: 面向脑肿瘤诊疗的大语言模型优化与评估[J]. 计算机工程, 2026, 52(5): 349-359.
[7]	张添植, 周刚, 张爽, 陈静, 黄宁博, 吴皓. 针对图文模态间实体对齐的目标实体情感分类[J]. 计算机工程, 2026, 52(3): 222-233.
[8]	周岳霖, 钟伯成, 王瑞. 基于区块链的轻量级车载自组网条件隐私保护认证[J]. 计算机工程, 2026, 52(3): 201-210.
[9]	陈先意, 糜慧, 何俊杰, 付章杰. 基于结构嵌入的可溯源联邦学习版权保护方法[J]. 计算机工程, 2026, 52(2): 253-264.
[10]	王利民, 朱光辉, 吴涛. 大模型技术演进：世界模型让人工智能从感知走向决策(特邀)[J]. 计算机工程, 2026, 52(2): 1-6.
[11]	武子璇, 刘银华. 基于瞳孔直径动态数字分析的情感评估[J]. 计算机工程, 2026, 52(2): 110-124.
[12]	齐峰毅, 张新有, 冯力, 邢焕来. 基于CSI特征指纹的无线设备识别方案[J]. 计算机工程, 2026, 52(2): 236-244.
[13]	李博, 季佰军, 段湘煜. 基于译文易错词纠正机制的大语言模型机器翻译[J]. 计算机工程, 2026, 52(2): 372-382.
[14]	曹天涯, 张雨静, 贾俊杰, 张宇帆, 邓晓飞. 基于个性化梯度裁剪的联邦学习隐私保护算法[J]. 计算机工程, 2026, 52(2): 265-274.
[15]	张成辉, 罗景, 涂新辉, 陈雨霖. 基于大语言模型的语料库查询自动生成方法[J]. 计算机工程, 2026, 52(2): 404-412.

选择文件类型/文献管理软件名称

选择包含的内容