Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

A Multimodal Symptom Classification Method Based on Patients’ Chief-Complaint Speech and Text

  

  • Published:2026-06-24

面向患者主诉语音与文本的多模态症状分类方法

Abstract: In real-world clinical consultations, patient chief complaints are typically expressed verbally and subsequently recorded as text by physicians. Physicians need to comprehensively use both patients’ spoken descriptions and corresponding textual records to judge and classify symptoms, thereby providing a basis for subsequent clinical decision-making. However, this task still faces several challenges. Speech information is susceptible to environmental noise and individual pronunciation differences, while textual records are unable to fully reflect speech-related expressive features such as speaking rate, pauses, and intonation. Meanwhile, patient chief complaints are usually colloquial, subjective, and unstructured, and semantic boundaries among different symptom categories may be ambiguous. These factors make it difficult for single-modality methods to achieve satisfactory classification performance. To address these issues, a dynamic weight decision fusion-based multimodal symptom classification method, named DWDF-MSC, is proposed to fully exploit the complementarity between textual and speech information and improve the accuracy and robustness of symptom classification. The proposed method mainly consists of three stages: multimodal feature extraction, preliminary classification, and adaptive gated decision fusion. In the multimodal feature extraction stage, a text branch and a speech branch are constructed to model patient chief complaint text and speech data in parallel. In the text branch, global semantic features and local lexical features are simultaneously extracted based on the clinical pre-trained language model Bio_ClinicalBERT, and the two are fused through a heterogeneous textual feature fusion module, thereby enhancing the model’s representation capability for the overall semantics of chief complaints and local symptom-related keywords. In the speech branch, an audio spectrogram Transformer is used to extract temporal acoustic representations from speech, thereby supplementing speech expressive information that is difficult to capture from textual records. In the preliminary classification stage, the text branch and the speech branch output initial classification results through their respective classification modules, allowing the two modalities to independently perform symptom judgment. In the final classification stage, an adaptive gated decision fusion strategy is designed to dynamically generate fusion weights according to the features of different samples. The initial classification results from the text and speech branches are then weighted and fused to obtain the final symptom classification result. Unlike simple feature concatenation or fixed-weight fusion, this strategy can adaptively adjust the contribution of the two modalities in the final decision according to sample differences, thereby enhancing the influence of discriminative information on the classification result and improving the classification stability of the model in complex chief complaint scenarios. Experimental results on a public medical dataset show that DWDF-MSC achieves 82.43%, 87.44%, and 81.52% in Accuracy, Precision, and F1-score, respectively, outperforming most mainstream baseline models across all metrics. The comparison of multimodal fusion schemes further demonstrates that the proposed dynamic weight decision fusion achieves better classification performance than feature-level fusion. In the ablation study, the complete DWDF-MSC model achieved relative improvements of 4.25% and 7.60% in Accuracy and F1-Score, respectively, compared with the variant that only employed heterogeneous text feature fusion, thereby demonstrating the effectiveness of the speech branch and the adaptive gated decision fusion mechanism. The McNemar test results show that the p-values between DWDF-MSC and multiple comparison methods are less than 0.0001, indicating that the differences in classification results between DWDF-MSC and these comparison methods are statistically significant. The anti-noise performance experiments demonstrate that DWDF-MSC can still maintain relatively stable classification performance under different signal-to-noise ratio conditions. In summary, DWDF-MSC can effectively fuse textual and speech information from patient chief complaints, improve model classification performance, and provide a feasible multimodal method for intelligent symptom classification based on patient chief complaints.

摘要: 真实临床问诊场景中,患者主诉通常以语音形式表达,并由医生记录为文本。医生需要综合利用患者口述语音及其文本记录对症状进行判断和分类,从而为后续临床决策提供依据。然而,该任务仍面临一定挑战,语音信息容易受到环境噪声和个体发音差异的影响,文本记录又难以体现语速、停顿、音调等语音表达特征。同时,患者主诉通常具有口语化、主观性和非结构化特点,不同症状类别之间也可能存在语义边界模糊的问题,导致仅依赖单一模态难以获得理想的分类效果。针对上述问题,提出一种基于动态权重决策融合的多模态症状分类方法(DWDF-MSC),以充分利用文本与语音信息的互补性,提升症状分类的准确性和鲁棒性。 该方法主要包括多模态特征提取、初步分类和自适应门控决策融合三个阶段。在多模态特征提取阶段,分别构建文本分支和语音分支,对患者主诉文本和语音数据进行并行建模。文本分支基于临床预训练语言模型Bio_ClinicalBERT同时提取全局语义特征和局部词汇特征,并通过文本异构特征融合模块对两者进行融合,从而增强模型对主诉整体语义和局部症状关键词的表征能力。语音分支利用音频频谱图Transformer提取语音中的时序声学表示,以补充文本记录中难以体现的语音表达信息。在初步分类阶段,文本分支和语音分支分别通过各自的分类模块输出初步分类结果,使两种模态先独立完成症状判断。在最终分类阶段,设计自适应门控决策融合策略,根据不同样本的特征动态生成融合权重,对文本分支和语音分支的初步分类结果进行加权融合,得到最终症状分类结果。与简单特征拼接或固定权重融合不同,该策略能够根据样本差异自适应调整两种模态在最终决策中的贡献,从而增强具有判别力的信息对分类结果的影响,提高模型在复杂主诉场景下的分类稳定性。 在公共医疗数据集上的实验结果表明,DWDF-MSC在Accuracy、Precision和F1-Score上分别达到82.43%、87.44%和81.52%,各项指标均优于多数主流基准模型。多模态融合方案对比进一步证明,相较于特征融合,所提出的动态权重决策融合能够取得更好的分类效果。消融实验中,完整DWDF-MSC相较于仅采用文本异构特征融合的方案,在Accuracy和F1-Score上的提升幅度分别为4.25%和7.60%,验证了语音分支和自适应门控决策融合的有效性。McNemar检验结果显示,DWDF-MSC与多种对比方法之间的p-value小于0.0001,说明其与这些对比方法之间的分类结果差异具有统计显著性。抗噪性能实验结果说明,DWDF-MSC在不同信噪比条件下仍能保持较稳定的分类表现。综上所述,DWDF-MSC能够有效融合患者主诉中的文本与语音信息,提升模型分类性能,为面向患者主诉的智能症状分类提供了一种可行的多模态方法。