作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (3): 76-85. doi: 10.19678/j.issn.1000-3428.0068832

• 人工智能与模式识别 • 上一篇    下一篇

结合依存图卷积的中文文本相似度计算研究

胡书林1, 张华军1,*(), 邓小涛2, 王征华2   

  1. 1. 武汉理工大学自动化学院, 湖北 武汉 430070
    2. 武汉大晟极科技有限公司, 湖北 武汉 430075
  • 收稿日期:2023-11-13 出版日期:2025-03-15 发布日期:2025-03-17
  • 通讯作者: 张华军
  • 基金资助:
    湖北省重点研发计划项目(2022BAA051)

Similarity Calculation for Chinese Text Based on Dependency Graph Convolution

HU Shulin1, ZHANG Huajun1,*(), DENG Xiaotao2, WANG Zhenghua2   

  1. 1. School of Automation, Wuhan University of Technology, Wuhan 430070, Hubei, China
    2. Wuhan DaSoundGen Technologies Co., Ltd., Wuhan 430075, Hubei, China
  • Received:2023-11-13 Online:2025-03-15 Published:2025-03-17
  • Contact: ZHANG Huajun

摘要:

目前中文文本相似度计算能够通过词嵌入技术在语义层面判别文本相似度, 但通常忽略了文本中蕴含的丰富的句法结构信息, 而以词为单位的中文句法分析与动态词嵌入模型中以字为单位的分词粒度不一致, 使得当前大多数结合句法分析的研究只能使用静态词嵌入来表征词的语义向量。针对此问题, 根据依存句法分析构建依存图, 通过分词掩码映射与注意力混合池化的方法实现动态词嵌入表征词节点的语义特征, 然后使用图卷积网络提取依存图中词节点之间的依存关系信息, 最终读出依存图, 将其作为句子的特征向量, 从语义与句法2个层面计算句子间的相似度。在表示型与交互型2种结构模型上应用所提方法, 并在BQ_Corpus与ATEC数据集上进行实验, 结果显示, 该模型的准确率最高分别达到87.12%与88.33%, 结合依存句法信息后模型的各项评估指标均有提升。

关键词: 图卷积神经网络, 依存句法分析, 动态词嵌入, 文本相似度, 注意力机制

Abstract:

In the current landscape of Chinese text similarity computation, the use of word-embedding techniques enables discrimination of text similarity at the semantic level. However, this approach often overlooks the rich syntactic structural information inherent in texts. Chinese syntactic analysis at the word level is inconsistent with the granularity of the dynamic word-embedding models that operate at the character level. Consequently, most studies that combine syntactic analysis employ only static word embeddings to represent the semantic vectors of words. To address this issue, this study constructs a dependency graph based on syntactic dependency analysis. It employs a method involving tokenization mask mapping and attention-mix pooling to embed the semantic features of word nodes dynamically. Subsequently, a graph convolutional network is employed to extract the dependency relationship information among the word nodes in the dependency graph. The resulting dependency graph is then utilized as a feature vector for the sentence. The similarity between sentences is calculated from both semantic and syntactic perspectives. The proposed approach is applied to two model structures based on representation and interaction. Experimental evaluations are conducted using the BQ_Corpus and ATEC datasets. The experimental results demonstrate that the model achieves the highest accuracies of 87.12% and 88.33%, respectively. The incorporation of syntactic information leads to improvements in various model performance metrics.

Key words: graph convolution neural network, dependency syntactic parsing, dynamic word embedding, text similarity, attention mechanism