结合依存图卷积的中文文本相似度计算研究

doi:10.19678/j.issn.1000-3428.0068832

摘要/Abstract

摘要：

目前中文文本相似度计算能够通过词嵌入技术在语义层面判别文本相似度, 但通常忽略了文本中蕴含的丰富的句法结构信息, 而以词为单位的中文句法分析与动态词嵌入模型中以字为单位的分词粒度不一致, 使得当前大多数结合句法分析的研究只能使用静态词嵌入来表征词的语义向量。针对此问题, 根据依存句法分析构建依存图, 通过分词掩码映射与注意力混合池化的方法实现动态词嵌入表征词节点的语义特征, 然后使用图卷积网络提取依存图中词节点之间的依存关系信息, 最终读出依存图, 将其作为句子的特征向量, 从语义与句法2个层面计算句子间的相似度。在表示型与交互型2种结构模型上应用所提方法, 并在BQ_Corpus与ATEC数据集上进行实验, 结果显示, 该模型的准确率最高分别达到87.12%与88.33%, 结合依存句法信息后模型的各项评估指标均有提升。

关键词: 图卷积神经网络, 依存句法分析, 动态词嵌入, 文本相似度, 注意力机制

Abstract:

In the current landscape of Chinese text similarity computation, the use of word-embedding techniques enables discrimination of text similarity at the semantic level. However, this approach often overlooks the rich syntactic structural information inherent in texts. Chinese syntactic analysis at the word level is inconsistent with the granularity of the dynamic word-embedding models that operate at the character level. Consequently, most studies that combine syntactic analysis employ only static word embeddings to represent the semantic vectors of words. To address this issue, this study constructs a dependency graph based on syntactic dependency analysis. It employs a method involving tokenization mask mapping and attention-mix pooling to embed the semantic features of word nodes dynamically. Subsequently, a graph convolutional network is employed to extract the dependency relationship information among the word nodes in the dependency graph. The resulting dependency graph is then utilized as a feature vector for the sentence. The similarity between sentences is calculated from both semantic and syntactic perspectives. The proposed approach is applied to two model structures based on representation and interaction. Experimental evaluations are conducted using the BQ_Corpus and ATEC datasets. The experimental results demonstrate that the model achieves the highest accuracies of 87.12% and 88.33%, respectively. The incorporation of syntactic information leads to improvements in various model performance metrics.

Key words: graph convolution neural network, dependency syntactic parsing, dynamic word embedding, text similarity, attention mechanism

胡书林, 张华军, 邓小涛, 王征华. 结合依存图卷积的中文文本相似度计算研究[J]. 计算机工程, 2025, 51(3): 76-85.

HU Shulin, ZHANG Huajun, DENG Xiaotao, WANG Zhenghua. Similarity Calculation for Chinese Text Based on Dependency Graph Convolution[J]. Computer Engineering, 2025, 51(3): 76-85.

https://www.ecice06.com/CN/Y2025/V51/I3/76

图/表 10

图1 文本匹配的2种模型结构

Fig.1 Two model structures for text matching

图2 依存图卷积模型结构

Fig.2 Dependency graph convolutional model structure

图3 分词掩码映射示例

Fig.3 Segmentation mask mapping example

图4 依存图构建示例

Fig.4 Dependency graph construction example

图5 最大序列长度统计

Fig.5 Maximum sequence length statistics

图6 BQ_Corpus数据集上GCN层数对模型性能的影响

Fig.6 The impact of GCN layers on model performance on the BQ_Corpus dataset

图7 ATEC数据集上GCN层数对模型性能的影响

Fig.7 The impact of GCN layers on model performance on the ATEC dataset

参考文献 30

1	庞亮, 兰艳艳, 徐君, 等. 深度文本匹配综述. 计算机学报, 2017, 40 (4): 985- 1003. doi: 10.11897/SP.J.1016.2017.00985
	PANG L , LAN Y Y , XU J , et al. A survey on deep text matching. Chinese Journal of Computers, 2017, 40 (4): 985- 1003. doi: 10.11897/SP.J.1016.2017.00985
2	CHOWDHURY G G . Introduction to modern information retrieval. [S.l.]: Facet Publishing, 2010.
3	WANG Y X , HOU Y T , CHE W X , et al. From static to dynamic word representations: a survey. International Journal of Machine Learning and Cybernetics, 2020, 11 (7): 1611- 1630. doi: 10.1007/s13042-020-01069-8
4	MCCANN B, BRADBURY J, XIONG C, et al. Learned in translation: contextualized word vectors[EB/OL]. [2023-10-05]. https://arxiv.org/abs/1708.00107v2.
5	PETERS M, NEUMANN M, IYYER M, et al. Deep contextualized word representations[EB/OL]. [2023-10-05]. https://arxiv.org/pdf/1802.05365.
6	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2023-10-05]. http://arxiv.org/abs/1810.04805v2.
7	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2023-10-05]. http://arxiv.org/abs/1907.11692v1.
8	RADFORD A, NARASIMHAN K. Improving language understanding by generative pre-training[EB/OL]. [2023-10-05]. https://gwern.net/doc/www/s3-us-west-2.amazonaws.com/d73fdc5ffa8627bce44dcda2fc012da638ffb158.pdf.
9	马帅, 刘建伟, 左信. 图神经网络综述. 计算机研究与发展, 2022, 59 (1): 47- 80. doi: 10.7544/issn1000-1239.20201055
	MA S , LIU J W , ZUO X . Survey on graph neural network. Journal of Computer Research and Development, 2022, 59 (1): 47- 80. doi: 10.7544/issn1000-1239.20201055
10	范涛, 王昊, 吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究. 数据分析与知识发现, 2021, 5 (9): 97- 106. doi: 10.11925/infotech.2096-3467.2021.0146
	FAN T , WANG H , WU P . Sentiment analysis of online users' negative emotions based on graph convolutional network and dependency parsing. Data Analysis and Knowledge Discovery, 2021, 5 (9): 97- 106. doi: 10.11925/infotech.2096-3467.2021.0146
11	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2023-10-05]. http://arxiv.org/abs/1301.3781v3.
12	方红, 苏铭, 冯一铂, 等. 结合gazetteers和句法依存树的中文命名实体识别. 计算机工程与应用, 2022, 58 (18): 227- 232. doi: 10.3778/j.issn.1002-8331.2102-0130
	FANG H , SU M , FENG Y B , et al. Chinese named entity recognition combined with gazetteers and syntactic dependency tree. Computer Engineering and Applications, 2022, 58 (18): 227- 232. doi: 10.3778/j.issn.1002-8331.2102-0130
13	张军莲, 张一帆, 汪鸣泉. 基于图卷积神经网络的中文实体关系联合抽取. 计算机工程, 2021, 47 (12): 103- 111. doi: 10.19678/j.issn.1000-3428.0059574
	ZHANG J L , ZHANG Y F , WANG M Q , et al. Joint extraction of Chinese entity relations based on graph convolutional neural network. Computer Engineering, 2021, 47 (12): 103- 111. doi: 10.19678/j.issn.1000-3428.0059574
14	LI Y Z, YU B W, XUE M G, et al. Enhancing pre-trained Chinese character representation with word-aligned attention[EB/OL]. [2023-10-05]. http://arxiv.org/abs/1911.02821v2.
15	VASHISHTH S, BHANDARI M, YADAV P, et al. Incorporating syntactic and semantic information in word embeddings using graph convolutional networks[EB/OL]. [2023-10-05]. http://arxiv.org/abs/1809.04283v4.
16	HOU X C, QI P, WANG G T, et al. Graph ensemble learning over multiple dependency trees for aspect-level sentiment classification[EB/OL]. [2023-10-05]. http://arxiv.org/abs/2103.11794v1.
17	SALTON G . The SMART retrieval system: experiments in automatic document processing. Englewood Cliffs, NJ: Prentice-Hall, 1971.
18	余传明, 薛浩东, 江一帆. 基于深度交互的文本匹配模型研究. 情报学报, 2021, 40 (10): 1015- 1026. doi: 10.3772/j.issn.1000-0135.2021.10.001
	YU C M , XUE H D , JIANG Y F . Research on text matching model based on deep interaction. Journal of the China Society for Scientific and Technical Information, 2021, 40 (10): 1015- 1026. doi: 10.3772/j.issn.1000-0135.2021.10.001
19	PANG L, LAN Y Y, GUO J F, et al. Text matching as image recognition[EB/OL]. [2023-10-05]. https://arxiv.org/abs/1602.06359v1.
20	HUANG P S, HE X D, GAO J F, et al. Learning deep structured semantic models for web search using clickthrough data[C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. New York, USA: ACM Press, 2013: 2333-2338.
21	REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese-BERT-networks[EB/OL]. [2023-10-05]. http://arxiv.org/abs/1908.10084v1.
22	CHE W X, FENG Y L, QIN L B, et al. N-LTP: an open-source neural language technology platform for Chinese[EB/OL]. [2023-10-05]. http://arxiv.org/abs/2009.11616v4.
23	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2023-10-05]. https://arxiv.org/abs/1706.03762.
24	WANG Z G, HAMZA W, FLORIAN R. Bilateral multi-perspective matching for natural language sentences[EB/OL]. [2023-10-05]. http://arxiv.org/abs/1702.03814v3.
25	CHEN Q, ZHU X D, LING Z H, et al. Enhanced LSTM for natural language inference[EB/OL]. [2023-10-05]. http://arxiv.org/abs/1609.06038v3.
26	YANG R Q, ZHANG J H, GAO X, et al. Simple and effective text matching with richer alignment features[EB/OL]. [2023-10-05]. http://arxiv.org/abs/1908.00300v1.
27	SU J, CAO J, LIU W, et al. Whitening sentence representations for better semantics and faster retrieval[EB/OL]. [2023-10-05]. https://arxiv.org/abs/2103.15316?context=cs.CL.
28	SU J , AHMED M , LIU B Y . RoFormer: enhanced transformer with rotary position embedding. Neurocomputing, 2024, 568 (1): 127063. doi: 10.1016/j.neucom.2023.127063
29	SUN Y , WANG S H , LI Y K , et al. ERNIE 2.0:a continual pre-training framework for language understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34 (5): 8968- 8975.
30	SUN Y, WANG S H, FENG S K, et al. ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation[EB/OL]. [2023-10-05]. http://arxiv.org/abs/2107.02137v1.

[1]	卢鹏, 仲闯. 改进CycleGAN的半监督建筑物提取算法[J]. 计算机工程, 2025, 51(3): 241-251.
[2]	王新良, 王璐莹. 特征增强的低照度爆破现场安全帽检测算法[J]. 计算机工程, 2025, 51(3): 252-260.
[3]	孙亭, 杨洁, 李家璇, 王耀宗. 面向弱光交通场景的YOLOv7道路标志检测算法优化[J]. 计算机工程, 2025, 51(3): 342-351.
[4]	栾方军, 龚琪, 袁帅. 基于注意力机制和多尺度融合的人群计数网络[J]. 计算机工程, 2025, 51(3): 352-361.
[5]	张欢, 王晨, 单景东, 仇润鹤. 基于领域自适应与注意力机制的电梯安全风险预测[J]. 计算机工程, 2025, 51(2): 86-93.
[6]	张元, 吕德芳, 孟建军, 祁文哲. 基于双注意力和GSSN轻量化的钢轨扣件缺陷检测[J]. 计算机工程, 2025, 51(2): 289-299.
[7]	许明, 屈泰澎, 姜彦吉. 改进YOLOv7在复杂场景下的交通标志检测算法[J]. 计算机工程, 2025, 51(2): 335-343.
[8]	张兴鹏, 何东, 杨模, 叶杭滨. 基于多尺度注意力和数据增强的细胞核分割[J]. 计算机工程, 2025, 51(2): 387-396.
[9]	罗旭东, 袁笛, 常晓军, 何震宇. 基于不确定性启发图像增强的水下目标跟踪[J]. 计算机工程, 2025, 51(1): 11-19.
[10]	周雪阳, 傅启明, 陈建平, 陈延明, 陆悠, 王蕴哲. 基于证据和图推理的文档级关系抽取方法: 以医学关系为例[J]. 计算机工程, 2025, 51(1): 106-117.
[11]	肖超恩, 李子凡, 张磊, 王建新, 钱思源. 基于Transformer模型与注意力机制的差分密码分析[J]. 计算机工程, 2025, 51(1): 156-163.
[12]	杨红菊, 吉昌. 学习驱动的图像压缩算法研究[J]. 计算机工程, 2025, 51(1): 190-197.
[13]	胡涌涛, 黄洪琼. 结合特征融合和通道注意力的多分支换装行人重识别[J]. 计算机工程, 2025, 51(1): 225-234.
[14]	火久元, 苏泓瑞, 武泽宇, 王婷娟. 基于改进YOLOv8的道路交通小目标车辆检测算法[J]. 计算机工程, 2025, 51(1): 246-257.
[15]	郑雅洲, 刘万平, 黄东. 一种基于注意力机制的BERT-CNN-GRU检测方法[J]. 计算机工程, 2025, 51(1): 258-268.

选择文件类型/文献管理软件名称

选择包含的内容