作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (11): 49-58. doi: 10.19678/j.issn.1000-3428.0069631

• 智能态势感知与计算 • 上一篇    下一篇

基于文本和声学特征的双模态融合抑郁倾向识别算法

赵健*(), 崔骞, 石佳, 刘岳   

  1. 西北大学信息科学与技术学院, 陕西 西安 710127
  • 收稿日期:2024-03-21 出版日期:2024-11-15 发布日期:2024-11-01
  • 通讯作者: 赵健
  • 基金资助:
    陕西省国际科技合作计划项目(2021KWZ-07)

Bimodal Fused Depressive Tendency Recognition Algorithm Based on Textual and Acoustic Features

ZHAO Jian*(), CUI Qian, SHI Jia, LIU Yue   

  1. School of Information Science and Technology, Northwest University, Xi'an 710127, Shaanxi, China
  • Received:2024-03-21 Online:2024-11-15 Published:2024-11-01
  • Contact: ZHAO Jian

摘要:

在抑郁症诊断中, 抑郁症患者的面部表情、声音信号和文字等数据可以作为评估抑郁倾向的客观指标。相较于视频, 文本和音频模态在处理敏感的个人信息时能更好地保护患者的隐私, 并且文本和音频均属于语言模态, 相关性较强。针对抑郁倾向识别中变长文本数据不易被分析以及手动提取音频特征存在局限性的问题, 提出一种基于Transformer的融合网络优化方法。对于文本模态, 使用卷积神经网络对文本进行特征提取, 得到文本在不同尺度下的局部特征, 然后引入Transformer模型来处理全局信息和长距离依赖。对于音频模态, 为了降低手动提取音频特征对识别结果的影响, 通过使用VGGish网络来自动提取音频特征, 并将提取好的音频特征送入Transformer中。最后, 为进一步增强文本和音频模态融合网络的识别性能, 引入SE通道注意力机制, 使模型能够自适应地调整各模态之间的权重分配, 更有效地聚焦于关键特征。实验结果表明, 双模态融合后的网络准确率达到92.7%, 相比仅使用文本或音频模态, 准确率分别提升2.9和4.9个百分点。

关键词: Transformer模型, VGGish网络, 双模态融合, 抑郁倾向识别, SE通道注意力机制, 深度学习

Abstract:

In the realm of depression diagnosis, different types of data sources, such as facial expressions, voice signals, and written content from individuals displaying symptoms of depression, can be used as objective indicators to assess inclinations toward depression. For this task, text and audio methods offer advantages over video methods in protecting patient confidentiality while handling sensitive personal information. Furthermore, both text and audio methods are language-based and exhibit strong connections. The present study proposes a technique to address the challenges associated with analyzing text data of varying lengths and manually extracting audio features to detect signs of depression. This approach involves optimizing a Transformer-based hybrid network that fuses a Convolutional Neural Network(CNN), which extracts features from text data and captures local features at different scales, with a Transformer model, which handles global information and long-range dependencies. For audio data, a VGGish network is used to automatically extract audio features, minimizing the impact of manually extracted features on recognition outcomes. The extracted audio features are subsequently input into the Transformer. To improve the efficiency of the fusion network, an SE channel attention mechanism is introduced, enabling the adaptive adjustment of the weight distribution between methods and enhancing the focus on crucial features. Experimental results show that the bimodal fusion network reaches an accuracy of 92.7%, indicating an enhancement of 2.9 and 4.9 percentage points compared with the individual use of text and audio methods, respectively.

Key words: Transformer model, VGGish network, bimodal fusion, depressive tendency recognition, SE channel attention mechanism, deep learning