A Multimodal Symptom Classification Method Based on Patients’ Chief-Complaint Speech and Text

doi:10.19678/j.issn.1000-3428.0260521

Abstract

Abstract: In real-world clinical consultations, patient chief complaints are typically expressed verbally and subsequently recorded as text by physicians. Physicians need to comprehensively use both patients’ spoken descriptions and corresponding textual records to judge and classify symptoms, thereby providing a basis for subsequent clinical decision-making. However, this task still faces several challenges. Speech information is susceptible to environmental noise and individual pronunciation differences, while textual records are unable to fully reflect speech-related expressive features such as speaking rate, pauses, and intonation. Meanwhile, patient chief complaints are usually colloquial, subjective, and unstructured, and semantic boundaries among different symptom categories may be ambiguous. These factors make it difficult for single-modality methods to achieve satisfactory classification performance. To address these issues, a dynamic weight decision fusion-based multimodal symptom classification method, named DWDF-MSC, is proposed to fully exploit the complementarity between textual and speech information and improve the accuracy and robustness of symptom classification. The proposed method mainly consists of three stages: multimodal feature extraction, preliminary classification, and adaptive gated decision fusion. In the multimodal feature extraction stage, a text branch and a speech branch are constructed to model patient chief complaint text and speech data in parallel. In the text branch, global semantic features and local lexical features are simultaneously extracted based on the clinical pre-trained language model Bio_ClinicalBERT, and the two are fused through a heterogeneous textual feature fusion module, thereby enhancing the model’s representation capability for the overall semantics of chief complaints and local symptom-related keywords. In the speech branch, an audio spectrogram Transformer is used to extract temporal acoustic representations from speech, thereby supplementing speech expressive information that is difficult to capture from textual records. In the preliminary classification stage, the text branch and the speech branch output initial classification results through their respective classification modules, allowing the two modalities to independently perform symptom judgment. In the final classification stage, an adaptive gated decision fusion strategy is designed to dynamically generate fusion weights according to the features of different samples. The initial classification results from the text and speech branches are then weighted and fused to obtain the final symptom classification result. Unlike simple feature concatenation or fixed-weight fusion, this strategy can adaptively adjust the contribution of the two modalities in the final decision according to sample differences, thereby enhancing the influence of discriminative information on the classification result and improving the classification stability of the model in complex chief complaint scenarios. Experimental results on a public medical dataset show that DWDF-MSC achieves 82.43%, 87.44%, and 81.52% in Accuracy, Precision, and F1-score, respectively, outperforming most mainstream baseline models across all metrics. The comparison of multimodal fusion schemes further demonstrates that the proposed dynamic weight decision fusion achieves better classification performance than feature-level fusion. In the ablation study, the complete DWDF-MSC model achieved relative improvements of 4.25% and 7.60% in Accuracy and F1-Score, respectively, compared with the variant that only employed heterogeneous text feature fusion, thereby demonstrating the effectiveness of the speech branch and the adaptive gated decision fusion mechanism. The McNemar test results show that the p-values between DWDF-MSC and multiple comparison methods are less than 0.0001, indicating that the differences in classification results between DWDF-MSC and these comparison methods are statistically significant. The anti-noise performance experiments demonstrate that DWDF-MSC can still maintain relatively stable classification performance under different signal-to-noise ratio conditions. In summary, DWDF-MSC can effectively fuse textual and speech information from patient chief complaints, improve model classification performance, and provide a feasible multimodal method for intelligent symptom classification based on patient chief complaints.

摘要： 真实临床问诊场景中，患者主诉通常以语音形式表达，并由医生记录为文本。医生需要综合利用患者口述语音及其文本记录对症状进行判断和分类，从而为后续临床决策提供依据。然而，该任务仍面临一定挑战，语音信息容易受到环境噪声和个体发音差异的影响，文本记录又难以体现语速、停顿、音调等语音表达特征。同时，患者主诉通常具有口语化、主观性和非结构化特点，不同症状类别之间也可能存在语义边界模糊的问题，导致仅依赖单一模态难以获得理想的分类效果。针对上述问题，提出一种基于动态权重决策融合的多模态症状分类方法（DWDF-MSC），以充分利用文本与语音信息的互补性，提升症状分类的准确性和鲁棒性。该方法主要包括多模态特征提取、初步分类和自适应门控决策融合三个阶段。在多模态特征提取阶段，分别构建文本分支和语音分支，对患者主诉文本和语音数据进行并行建模。文本分支基于临床预训练语言模型Bio_ClinicalBERT同时提取全局语义特征和局部词汇特征，并通过文本异构特征融合模块对两者进行融合，从而增强模型对主诉整体语义和局部症状关键词的表征能力。语音分支利用音频频谱图Transformer提取语音中的时序声学表示，以补充文本记录中难以体现的语音表达信息。在初步分类阶段，文本分支和语音分支分别通过各自的分类模块输出初步分类结果，使两种模态先独立完成症状判断。在最终分类阶段，设计自适应门控决策融合策略，根据不同样本的特征动态生成融合权重，对文本分支和语音分支的初步分类结果进行加权融合，得到最终症状分类结果。与简单特征拼接或固定权重融合不同，该策略能够根据样本差异自适应调整两种模态在最终决策中的贡献，从而增强具有判别力的信息对分类结果的影响，提高模型在复杂主诉场景下的分类稳定性。在公共医疗数据集上的实验结果表明，DWDF-MSC在Accuracy、Precision和F1-Score上分别达到82.43%、87.44%和81.52%，各项指标均优于多数主流基准模型。多模态融合方案对比进一步证明，相较于特征融合，所提出的动态权重决策融合能够取得更好的分类效果。消融实验中，完整DWDF-MSC相较于仅采用文本异构特征融合的方案，在Accuracy和F1-Score上的提升幅度分别为4.25%和7.60%，验证了语音分支和自适应门控决策融合的有效性。McNemar检验结果显示，DWDF-MSC与多种对比方法之间的p-value小于0.0001，说明其与这些对比方法之间的分类结果差异具有统计显著性。抗噪性能实验结果说明，DWDF-MSC在不同信噪比条件下仍能保持较稳定的分类表现。综上所述，DWDF-MSC能够有效融合患者主诉中的文本与语音信息，提升模型分类性能，为面向患者主诉的智能症状分类提供了一种可行的多模态方法。

ZHANG Haoran , JIAN Muwei , WANG Rui , SONG Zengkai. A Multimodal Symptom Classification Method Based on Patients’ Chief-Complaint Speech and Text[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260521.

张昊然, 蹇木伟, 王瑞, 宋增凯. 面向患者主诉语音与文本的多模态症状分类方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260521.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0260521

References

[1] DA’COSTA A, TEKE J, ORIGBO J E, et al. AI-driven triage in emergency departments: A review of benefits, challenges, and future directions[J]. International Journal of Medical Informatics, 2025, 197: 105838.
[2] AGARWAL S, ARYA K V, MEENA Y K. CNN-O-ELMNet: Optimized Lightweight and Generalized Model for Lung Disease Classification and Severity Assessment[J]. IEEE Transactions on Medical Imaging, 2024, 43(12): 4200-4210.
[3] 刘兆伟,方艳红,郑明宇,等.基于注意力机制与多任务的肺部疾病诊断方法[J].计算机工程,2025,51(01):332-342. LIU Z W, FANG Y H, ZHENG M Y, et al. Lung disease diagnosis method based on attention mechanism and multi-tasking[J]. Computer Engineering, 2025, 51(01): 332-342. (in Chinese)
[4] NOWAK S, SCHNEIDER H, LAYER Y C, et al. Development of image-based decision support systems utilizing information extracted from radiological free-text report databases with text-based transformers[J]. European radiology, 2024, 34(5): 2895-2904.
[5] KADHIM M N, AL-SHAMMARY D, Sufi F. A novel voice classification based on Gower distance for Parkinson disease detection[J]. International Journal of Medical Informatics, 2024, 191: 105583.
[6] HANSEN L, ROCCA R, SIMONSEN A, et al. Speech-and text-based classification of neuropsychiatric conditions in a multidiagnostic setting[J]. Nature Mental Health, 2023, 1(12): 971-981.
[7] 杨士臣.面向医学文本的多标签分类方法研究[D]. 武汉: 武汉纺织大学,2024. YANG S C. Research on multi-label classification methods for medical texts[D]. Wuhan: Wuhan Textile University, 2024. (in Chinese)
[8] 黎超,廖薇.基于医疗知识驱动的中文疾病文本分类模型[J].山东大学学报(理学版),2024,59(07):122-130. LI C, LIAO W. Chinese disease text classification model driven by medical knowledge[J]. Journal of Shandong University (Natural Science), 2024, 59(07): 122-130. (in Chinese)
[9] 郑恩昱.基于深度学习组合模型的医疗文本分类[D]. 北京: 中央财经大学,2023. ZHENG E Y. Medical text classification based on deep learning combination models[D]. Beijing: Central University of Finance and Economics, 2023. (in Chinese)
[10] LIU J, NGUYEN A, CAPURRO D, et al. Comparing text-based clinical risk prediction in critical care: a note-specific hierarchical network and large language models[J]. IEEE Journal of Biomedical and Health Informatics, 2025, 29(10): 7657 - 7667.
[11] PENG X, XU H, LIU J, et al. Voice disorder classification using convolutional neural network based on deep transfer learning[J]. Scientific Reports, 2023, 13(1): 7264.
[12] 梁丽娟.基于语音声学特征的抑郁智能识别模型及其验证研究[D]. 沈阳: 中国医科大学,2023. LIANG L J. Research on the intelligent recognition model of depression based on speech acoustic features and its verification[D]. Shenyang: China Medical University, 2023. (in Chinese)
[13] ZHANG Z, WANG T, HU Z, et al. Multivariate time series approach integrating cross-temporal and cross-channel attention for dysarthria detection from speech[J]. Neurocomputing, 2025, 647: 130708.
[14] 孙阿朗.面向方言语音的阿尔茨海默病早期筛查系统设计与实现[D]. 上海: 东华大学,2025. SUN A L. Design and implementation of early screening system for Alzheimer's disease oriented to dialect speech[D]. Shanghai: Donghua University, 2025. (in Chinese)
[15] 陈垒.基于注意力机制的阿尔茨海默病患者语音检测研究[D]. 重庆: 重庆工商大学,2025. CHEN L. Research on speech detection of Alzheimer's disease patients based on attention mechanism[D]. Chongqing: Chongqing Technology and Business University, 2025. (in Chinese)
[16] LI S, NAIR R, NAQVI S M. Acoustic and text features analysis for adult ADHD screening: A data-driven approach utilizing DIVA interview[J]. IEEE journal of translational engineering in health and medicine, 2024, 12: 359-370.
[17] 赵健,崔骞,石佳,等.基于文本和声学特征的双模态融合抑郁倾向识别算法[J].计算机工程,2024,50(11):49-58. ZHAO J, CUI Q, SHI J, et al. Dual-modal fusion depression tendency recognition algorithm based on text and acoustic features[J]. Computer Engineering, 2024, 50(11): 49-58. (in Chinese)
[18] 宋泓.基于文本—音频—图像信息的多模态情感分析与抑郁症辅助诊断方法[D]. 南京: 南京信息工程大学,2025. SONG H. Multimodal emotion analysis and auxiliary diagnosis of depression based on text-audio-image information[D]. Nanjing: Nanjing University of Information Science and Technology, 2025. (in Chinese)
[19] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019: 4171-4186.
[20] GONG Y, CHUNG Y A, GLASS J. AST: audio spectrogram transformer[C]//Proceedings of Interspeech 2021. Brno: International Speech Communication Association, 2021: 571-575.
[21] MOONEY P. Medical speech, transcription, and intent: audio utterances paired with text for common medical symptoms[DB/OL]. Kaggle. https://www.kaggle.com/datasets/paultimothymooney/medical-speech-transcription-and-intent/data.
[22] HE P, LIU X, GAO J, et al. DeBERTa: decoding-enhanced BERT with disentangled attention[C]//Proceedings of the 9th International Conference on Learning Representations. OpenReview.net, 2021: 1-21.
[23] LEWIS M, LIU Y, GOYAL N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2020: 7871-7880.
[24] ZHANG K, LIU X, ZHAO N, et al. Dual channel semantic enhancement-based convolutional neural networks model for text classification[J]. International Journal of Modern Physics C, 2025, 36(10): 2442012.
[25] CHEN L, CHEN J. Deep neural network for automatic classification of pathological voice signals[J]. Journal of Voice, 2022, 36(2): 288.e15-288.e24.
[26] BELABBAS S, ADDOU D, SELOUANI S A. Pathological voice classification system based on CNN-BiLSTM network using speech enhancement and multi-stream approach[J]. International Journal of Speech Technology, 2024, 27(2): 483-502.
[27] CHANG K W, HSU M H, LI S W, et al. Exploring in-context learning of textless speech language model for speech classification tasks[C]//Proceedings of Interspeech 2024. Kos: International Speech Communication Association, 2024: 4139-4143.
[28] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[29] HOUJEIJ A, HAMIEH L, MEHDI N, et al. A novel approach for emotion classification based on fusion of text and speech[C]//Proceedings of the 2012 19th International Conference on Telecommunications. Piscataway: IEEE, 2012: 1-6.
[30] SHANG Y, FU T. Multimodal fusion: a study on speech-text emotion recognition with the integration of deep learning[J]. Intelligent Systems with Applications, 2024, 24: 200436.
[31] LIU Z, WANG Y, VAIDYA S, et al. KAN: Kolmogorov-Arnold networks[C]//Proceedings of the 2025 International Conference on Learning Representations. OpenReview.net, 2025: 1-47.
[32] GORISHNIY Y, KOTELNIKOV A, BABENKO A. TabM: advancing tabular deep learning with parameter-efficient ensembling[C]//Proceedings of the 2025 International Conference on Learning Representations. OpenReview.net, 2025: 1-37.
[33] PANDEY A, SINGH J, KAUR M. Bridging text and speech for emotion understanding: an explainable multimodal transformer fusion framework with unified audio-text attribution[J]. Journal of Intelligence, 2025, 13(12): 159.
[34] BAEVSKI A, ZHOU Y, MOHAMED A, et al. Wav2vec 2.0: a framework for self-supervised learning of speech representations[C]//Advances in Neural Information Processing Systems 33. Red Hook: Curran Associates, Inc., 2020: 1-12.

Please choose a citation manager

Content to export