基于多模型加权组合的文本相似度计算模型

doi:10.19678/j.issn.1000-3428.0066057

计算机工程 ›› 2023, Vol. 49 ›› Issue (10): 97-104. doi: 10.19678/j.issn.1000-3428.0066057

基于多模型加权组合的文本相似度计算模型

刘栋¹^,², 杨辉²^,³, 姬少培¹^,²^,*, 曹扬²^,³

1. 中国电子科技集团公司第三十研究所, 成都 610041
2. 中电科大数据研究院有限公司, 贵阳 550022
3. 提升政府治理能力大数据应用技术国家工程实验室, 贵阳 550022

收稿日期:2022-10-20 出版日期:2023-10-15 发布日期:2023-10-10
通讯作者: 姬少培
作者简介:
刘栋（1986—），男，高级工程师、硕士，主研方向为人工智能、信息安全
杨辉，高级工程师、硕士
曹扬，高级工程师、硕士
基金资助:
国家自然科学基金企业创新发展联合基金(U20B2049); 提升政府治理能力大数据应用技术国家工程实验室2018—2019年开放基金项目(w-2019010)

Text Similarity Computing Model Based on Weighted Combination of Multiple Models

Dong LIU¹^,², Hui YANG²^,³, Shaopei JI¹^,²^,*, Yang CAO²^,³

1. The 30th Research Institute of China Electronics Technology Group Corporation, Chengdu 610041, China
2. CETC Big Data Research Institute Co., Ltd., Guiyang 550022, China
3. Big Data Application on Improving Government Governance Capabilities National Engineering Laboratory, Guiyang 550022, China

Received:2022-10-20 Online:2023-10-15 Published:2023-10-10
Contact: Shaopei JI

摘要/Abstract

摘要：

针对传统文本相似度计算模型存在的未考虑语义及结构信息、容易忽略文本特征细节信息等问题, 建立一种基于多模型加权组合的文本相似度计算模型。在考虑次序、主题、语义等内容的基础上，对句子的每个单词进行嵌入表示，利用最大池化操作和Bi-GRU神经网络作为编码器生成关于句子的嵌入表示，通过多层次的比较来学习句子嵌入之间的相似性关系。对文本转换进行结构化表示，提取基于短语的浅层语法树结构化特征作为Tree-GRU的输入进行文本相似度计算。将上述2个计算结果进行加权处理，获取文本相似度的最终计算结果。实验结果表明：当权重参数C₁和C₂分别取值0.6和0.4时该模型具有最优的相似度计算结果；在STSB数据集上模型的精确率、召回率和F1值分别达到90.32%、90.89%和90.52%；在SICK数据集上精确率、召回率和F1值分别达到85.41%、85.95%和85.61%；在MRPC数据集上精确率、召回率和F1值分别达到90.32%、90.89%和90.52%。该模型可以充分利用文本的多层次内容信息及结构化信息，适用于处理复杂的长文本，相对于DT-TEAM、ECNU等模型能取得更好的文本相似度计算结果。

关键词: 文本特征, 多词嵌入, 多层次比较, 浅层语法树, 线性加权, 文本相似度

Abstract:

A text similarity computing model based on multi-model weighted combination is established to address the issues of traditional text similarity computing models that do not consider semantic and structural information and are thus prone to ignoring text feature details. Based on considerations of order, theme, semantics, and other content, embedding representations are performed for each word in the sentence. The maximum pooling operation and Bi-directional Gated Recurrent Unit (Bi-GRU) neural network are used as encoders to generate embedding representations of the sentence, and the similarity relationship between sentence embeddings is learned through multi-level comparison. By transforming the text conversion into a structured representation, shallow-syntax tree-structured features are extracted from phrases that are input into Tree-GRU for text similarity computing. Weights are applied to the above two computing results to obtain the final text similarity computing outcome. The experimental results show that the model provides optimal similarity computing results when the weight parameters C₁ and C₂ are taken as 0.6 and 0.4, respectively; the precision, recall, and F1 scores of the model on the STSB dataset reached 90.32%, 90.89%, and 90.52%, respectively; the precision, recall, and F1 scores on the SICK dataset reached 85.41%, 85.95%, and 85.61%, respectively; the precision, recall, and F1 scores on the MRPC dataset reached 90.32%, 90.89%, and 90.52%, respectively. The proposed model can fully utilize multi-level content and structured information of text and is suitable for processing complex long texts. Compared to models such as DT-TEAM and ECNU, the model achieved better text similarity computing results.

Key words: text feature, multi-words embedding, multi-level comparison, shallow syntax tree, linear weighting, text similarity

刘栋, 杨辉, 姬少培, 曹扬. 基于多模型加权组合的文本相似度计算模型[J]. 计算机工程, 2023, 49(10): 97-104.

Dong LIU, Hui YANG, Shaopei JI, Yang CAO. Text Similarity Computing Model Based on Weighted Combination of Multiple Models[J]. Computer Engineering, 2023, 49(10): 97-104.

http://www.ecice06.com/CN/Y2023/V49/I10/97

图/表 12

图1 MMTSC模型架构

Fig.1 The architecture of MMTSC model

图2 基于ST结构的句子表示

Fig.2 Sentence representation based on ST structure

图3 基于PST结构的句子表示

Fig.3 Sentence representation based on PST structure

图4 Tree-GRU模型架构

Fig.4 The architecture of Tree-GRU model

图5 WMMTSC模型架构

Fig.5 The architecture of WMMTSC model

参考文献 29

1	石彩霞, 李书琴, 刘斌. 多重检验加权融合的短文本相似度计算方法. 计算机工程, 2021, 47 (2): 95- 102. URL
	SHI C X, LI S Q, LIU B. Method for calculating short text similarity using multi-check weighted fusion. Computer Engineering, 2021, 47 (2): 95- 102. URL
2	TAI K, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[EB/OL]. [2022-09-05]. https://arxiv.org/abs/1503.00075.
3	GAN Z, PU Y C, HENAO R, et al. Learning generic sentence representations using convolutional neural networks[C]//Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. [S. l. ]: Association for Computational Linguistics, 2017: 2380-2390.
4	HUY N T, NGUYEN M L. Multilingual opinion mining on YouTube-A convolutional N-gram BiLSTM word embedding. Information Processing & Management, 2018, 54 (3): 451- 462.
5	ZHANG C, CHEN L, LI Q. A Chinese text similarity calculation algorithm based on DF_LDA[C]//Proceedings of the 6th International Asia Conference on Industrial Engineering and Management Innovation. Paris, France: Atlantis Press, 2015: 627-634.
6	YANG J, LI Y, GAO C, et al. Measuring the short text similarity based on semantic and syntactic information. Future Generation Computer Systems, 2021, 114, 169- 180. doi: 10.1016/j.future.2020.07.043
7	ZHANG P Y, HUANG X Z, WANG Y Q, et al. Semantic similarity computing model based on multi model fine-grained nonlinear fusion. IEEE Access, 2021, 9, 8433- 8443. doi: 10.1109/ACCESS.2021.3049378
8	JI M Y, ZHANG X H. A short text similarity calculation method combining semantic and headword attention mechanism. Scientific Programming, 2022, 15, 1- 9.
9	FAROUK M. Measuring sentences similarity based on discourse representation structure. Computing and Informatics, 2020, 39 (3): 464- 480. doi: 10.31577/cai_2020_3_464
10	ÖZBAL G, PIGHIN D. Evaluating the impact of syntax and semantics on emotion recognition from text[EB/OL]. [2022-09-05]. https://link.springer.com/chapter/10.1007/978-3-642-37256-8_14.
11	李强龙, 周新文, 位梦恩, 等. 基于条形池化和注意力机制的街道场景红外目标检测算法. 计算机工程, 2023, 49 (8): 310- 320. URL
	LI Q L, ZHOU X W, WEI M E, et al. Infrared target detection algorithm for street scene based on bar pooling and attention mechanism. Computer Engineering, 2023, 49 (8): 310- 320. URL
12	王燕, 范林, 赵妮妮. 利用门控网络构建用户动态兴趣的序列推荐模型. 计算机工程, 2022, 48 (8): 283- 291. URL
	WANG Y, FAN L, ZHAO N N. Sequential recommendation model using gated network to construct user's dynamic interest. Computer Engineering, 2022, 48 (8): 283- 291. URL
13	CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. [S. l. ]: Association for Computational Linguistics, 2014: 1724-1734.
14	WANG L, CAO H, YUAN L. Gated tree-structured RecurNN for detecting biomedical event trigger. Applied Soft Computing, 2022, 126, 109251. doi: 10.1016/j.asoc.2022.109251
15	王宇航. 基于关键词和语法树的文本风格迁移模型[D]. 上海: 华东师范大学, 2022.
	WANG Y H. Text style transfer model based on keyword and grammar tree[D]. Shanghai: East China Normal University, 2022. (in Chinese)
16	TIEN N H, LE N M, TOMOHIRO Y, et al. Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity. Information Processing & Management, 2019, 56 (6): 102090.
17	杨萌. 文本语义相似度计算方法研究及应用[D]. 苏州: 苏州大学, 2017.
	YANG M. Research and application of computing method of text semantic similarity[D]. Suzhou: Soochow University, 2017. (in Chinese)
18	MAHARJAN N, BANJADE R, GAUTAM D, et al. DT_Team at SemEval-2017 Task 1: semantic similarity using alignments, sentence-level embeddings and Gaussian mixture model output[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. [S. l. ]: Association for Computational Linguistics, 2017: 120-124.
19	TIAN J F, ZHOU Z H, LAN M, et al. ECNU at SemEval-2017 Task 1: leverage kernel-based traditional NLP features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. [S. l. ]: Association for Computational Linguistics, 2017: 191-197.
20	WU H, HUANG H Y, JIAN P, et al. BIT at SemEval-2017 Task 1: using semantic information space to evaluate semantic textual similarity[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. [S. l. ]: Association for Computational Linguistics, 2017: 77-84.
21	JI Y, EISENSTEIN J. Discriminative improvements to distributional sentence similarity[C]//Proceedings of 2013 Conference on Empirical Methods in Natural Language Processing. [S. l. ]: Association for Computational Linguistics, 2013: 891-896.
22	冯月春, 陈惠娟. 改进Bi-LSTM的文本相似度计算方法. 计算机工程与设计, 2022, 43 (5): 1397- 1403. URL
	FENG Y C, CHEN H J. Text similarity calculation method using improved Bi-LSTM. Computer Engineering and Design, 2022, 43 (5): 1397- 1403. URL
23	徐菲菲, 冯东升. 基于注意力机制的Siamese-BiLSTM短文本相似度算法. 计算机应用与软件, 2022, 39 (4): 281-286, 325. URL
	XU F F, FENG D S. Short text similarity algorithm with Siamese-BiLSTM network based on attention mechanism. Computer Applications and Software, 2022, 39 (4): 281-286, 325. URL
24	CONNEAU A, KIELA D, SCHWENK H, et al. Supervised learning of universal sentence representations from natural language inference data[EB/OL]. [2022-09-05]. https://arxiv.org/abs/1705.02364.
25	胡超文, 邬昌兴, 杨亚连. 基于扩展的S-LSTM的文本蕴含识别. 计算机研究与发展, 2020, 57 (7): 1481- 1489. URL
	HU C W, WU C X, YANG Y L. Extended S-LSTM based textual entailment recognition. Journal of Computer Research and Development, 2020, 57 (7): 1481- 1489. URL
26	ZHANG Y, ROLLER S, WALLACE B C. MGNC-CNN: a simple approach to exploiting multiple word embeddings for sentence classification[C]//Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. [S. l. ]: Association for Computational Linguistics, 2016: 1522-1527.
27	YIN W P, SCHÜTZE H. Multichannel variable-size convolution for sentence classification[C]//Proceedings of the 19th Conference on Computational Natural Language Learning. [S. l. ]: Association for Computational Linguistics, 2015: 204-214.
28	SOLIMAN N F, ABD-ALHALEM S M, EL-SHAFAI W, et al. Hybrid approach for taxonomic classification based on deep learning. Intelligent Automation & Soft Computing, 2022, 32 (3): 1881- 1891.
29	NGUYEN H T, DUONG P H, CAMBRIA E. Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowledge-Based Systems, 2019, 182, 104842.

[1]	刘子健, 王勇, 刘媛妮, 周由胜. 基于情节记忆的高效短文本流聚类算法[J]. 计算机工程, 2023, 49(10): 145-153.
[2]	石彩霞, 李书琴, 刘斌. 多重检验加权融合的短文本相似度计算方法[J]. 计算机工程, 2021, 47(2): 95-102.
[3]	冯兴杰, 张乐, 曾云泽. 基于多注意力CNN的问题相似度计算模型[J]. 计算机工程, 2019, 45(9): 284-290.
[4]	缪峰,贾华丁,熊于宁. 基于服务相似度的移动用户近似邻居选取方法[J]. 计算机工程, 2018, 44(5): 162-167,173.
[5]	夏青,严馨,余正涛,汪建成,高盛祥,洪旭东. 融合要素及主题的汉越双语新闻话题分析[J]. 计算机工程, 2016, 42(9): 186-191.
[6]	郑诚,沈磊,代宁. 基于类序列规则的中文微博情感分类[J]. 计算机工程, 2016, 42(2): 184-189,194.
[7]	方爽,殷俊杰,徐武平. 基于相似图片聚类的Web文本特征算法[J]. 计算机工程, 2014, 40(12): 161-165,171.
[8]	孙劲光，马志芳，孟祥福. 基于情感词属性和云模型的文本情感分类方法[J]. 计算机工程, 2013, 39(12): 211-215.
[9]	程传鹏?, 齐晖. 文本相似度计算在主观题评分中的应用[J]. 计算机工程, 2012, 38(5): 288-290.
[10]	刘莲花, 谭台哲. 多指标融合的指纹图像质量评测方法[J]. 计算机工程, 2011, 37(9): 226-228.
[11]	王少康, 董科军, 阎保平. 基于语句节奏特征的作者身份识别研究[J]. 计算机工程, 2011, 37(9): 4-5,8.
[12]	赵延平, 曹存根, 谢丽聪. 基于CRFs和领域规则的业务名称识别[J]. 计算机工程, 2011, 37(11): 200-202.
[13]	王利;刘宗田;王燕华;廖涛. 基于内容相似度的网页正文提取[J]. 计算机工程, 2010, 36(06): 102-104.
[14]	吕楠;罗军勇;刘尧;杨慧洁. 基于话题三层结构模型的话题演化分析算法[J]. 计算机工程, 2009, 35(23): 71-72,7.
[15]	高茂庭;王正欧. 基于文档标引图模型的文本相似度策略[J]. 计算机工程, 2008, 34(7): 19-22.

选择文件类型/文献管理软件名称

选择包含的内容

基于多模型加权组合的文本相似度计算模型

Text Similarity Computing Model Based on Weighted Combination of Multiple Models

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 29

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于多模型加权组合的文本相似度计算模型

Text Similarity Computing Model Based on Weighted Combination of Multiple Models

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 29

相关文章 15

编辑推荐

Metrics

本文评价