文本相似度计算方法综述

doi:10.19678/j.issn.1000-3428.0068086

摘要/Abstract

摘要：

文本相似度计算是自然语言处理的一部分, 用来计算两个词、句子及文本之间的相似程度, 具有多种应用场景, 文本相似度计算的研究对于人工智能的发展有着重要作用。文本相似度计算起初基于字符串表面, 随着词向量的提出, 文本相似度计算可进行基于统计以及深度学习的建模与计算, 也可与预训练模型相结合。首先, 将文本相似度计算方法分为基于字符串、基于词向量、基于预训练模型、基于深度学习、其他方法5类, 并对这些方法进行简要介绍。然后, 根据不同文本相似度计算方法的原理, 具体介绍了编辑距离、汉明距离、词袋模型、向量空间模型(VSM)、深度结构语义模型(DSSM)、句子嵌入的简单对比学习(SimCSE)等常见方法。最后, 对文本相似度计算常用的数据集以及评价标准进行整理和分析, 并对文本相似度计算的未来发展进行展望。

关键词: 文本相似度, 字符串, 词向量, 预训练模型, 深度学习

Abstract:

Text similarity calculation is a part of natural language processing and is used to calculate the similarity between two words, sentences, or texts in many application scenarios. Research on text similarity calculation plays an important role in the development of artificial intelligence. Text similarity calculation has conventionally been based on character string surfaces. With the introduction of word vectors, text similarity calculation can be modeled and calculated based on statistics and deep learning, in addition to combining it with pre-trained models. First, text similarity calculation methods can be divided into five categories: character string-based, word vector-based, pre-trained model-based, deep learning-based, and other methods. Each category is briefly introduced. Subsequently, according to the principles of the different text similarity calculation methods, common methods such as the edit distance, Hamming distance, bag of words model, Vector Space Model (VSM), Deep Structured Semantic Model (DSSM), and Simple Contrastive learning of Sentence Embedding (SimCSE) are discussed. Finally, commonly used data sets and evaluation criteria for text similarity calculation are sorted and analyzed, and the future development of text similarity calculation is prospected.

Key words: text similarity, character string, word vector, pre-trained model, deep learning

魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.

WEI Wei, DING Xiangxiang, GUO Mengxing, YANG Zhao, LIU Hui. Review of Text Similarity Calculation Methods[J]. Computer Engineering, 2024, 50(9): 18-32.

https://www.ecice06.com/CN/Y2024/V50/I9/18

图/表 10

图1 Word2Vec模型原理

Fig.1 Principle of Word2Vec model

图2 BERT模型结构

Fig.2 Structure of BERT model

图3 DSSM模型结构

Fig.3 Structure of DSSM model

图4 文本相似度计算方法分类

Fig.4 Classification of text similarity calculation method

参考文献 79

1	任洁. 自然语言与自然语言理解及其应用. 科教文汇, 2006,(3): 69- 70. URL
	REN J. Natural language and natural language understanding and its application. Journal of Science and Education, 2006,(3): 69- 70. URL
2	BELKIN N J, CROFT W B. Information filtering and information retrieval. Communications of the ACM, 1992, 35(12): 29- 38. doi: 10.1145/138859.138861
3	王春柳, 杨永辉, 邓霏, 等. 文本相似度计算方法研究综述. 情报科学, 2019, 37(3): 158- 168. URL
	WANG C L, YANG Y H, DENG F, et al. A review of text similarity approaches. Information Science, 2019, 37(3): 158- 168. URL
4	金博, 史彦军, 滕弘飞. 基于语义理解的文本相似度算法. 大连理工大学学报, 2005, 45(2): 291- 297. doi: 10.3321/j.issn:1000-8608.2005.02.028
	JIN B, SHI Y J, TENG H F. Text similarity algorithm based on semantic understanding. Journal of Dalian University of Technology, 2005, 45(2): 291- 297. doi: 10.3321/j.issn:1000-8608.2005.02.028
5	DING P, LIU D, ZHANG Z, et al. A novel discrimination structure for assessing text semantic similarity. Journal of Internet Technology, 2022, 23(4): 709- 717. doi: 10.53106/160792642022072304006
6	LEVENSHTEIN V. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 1965, 10, 707- 710.
7	SAMPATH A, SHANMUGAVEL V. Hybrid Tamil spell checker with combined character splitting. Concurrency and Computation: Practice and Experience, 2023, 35(1): e7440. doi: 10.1002/cpe.7440
8	BENT T, HOLT R F, VAN ENGEN K J, et al. How pronunciation distance impacts word recognition in children and adults. The Journal of the Acoustical Society of America, 2021, 150(6): 4103- 4117. doi: 10.1121/10.0008930
9	LIKHITHA C P, NINITHA P, KANCHANA V, et al. DNA Bar-coding: a novel approach for identifying an individual using extended Levenshtein distance algorithm and STR analysis. International Journal of Electrical and Computer Engineering, 2016, 6(3): 1133. doi: 10.11591/ijece.v6i3.10086
10	ARNOLD M, OHLEBUSCH E. Linear time algorithms for generalizations of the longest common substring problem. Algorithmica, 2011, 60(4): 806- 818. doi: 10.1007/s00453-009-9369-1
11	张毅超, 车玫, 马骏. 求最长公共子串问题的算法分析. 计算机仿真, 2007, 24(12): 97-100, 116. doi: 10.3969/j.issn.1006-9348.2007.12.026
	ZHANG Y C, CHE M, MA J. Analysis of the longest common substring algorithm. Computer Simulation, 2007, 24(12): 97-100, 116. doi: 10.3969/j.issn.1006-9348.2007.12.026
12	周荫清. 信息理论基础. 北京: 北京航空航天大学出版社, 1993.
	ZHOU Y Q. Fundamentals of information theory. Beijing: Beijing University of Aeronautics & Astronautics Press, 1993.
13	JACCARD P. The distribution of the flora in the alpine zone. New Phytologist, 1912, 11(2): 37- 50. doi: 10.1111/j.1469-8137.1912.tb05611.x
14	林颖. 文本相似度计算方法的研究及改进[D]. 乌鲁木齐: 新疆大学, 2021.
	LIN Y. Research and improvement of text similarity calculation method[D]. Urumqi: Xinjiang University, 2021. (in Chinese)
15	田星, 郑瑾, 张祖平. 基于词向量的Jaccard相似度算法. 计算机科学, 2018, 45(7): 186- 189. URL
	TIAN X, ZHENG J, ZHANG Z P. Jaccard text similarity algorithm based on word embedding. Computer Science, 2018, 45(7): 186- 189. URL
16	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2023-06-26]. http://arxiv.org/abs/1301.3781.
17	SUEN C Y. N-gram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979, 1(2): 164- 172. doi: 10.1109/TPAMI.1979.4766902
18	TURNEY P D, PANTEL P. From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research, 2010, 37, 141- 188. doi: 10.1613/jair.2934
19	ROBERTSON S E, WALKER S. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval[M]. London, UK: Springer, 1994: 232-241.
20	RONG X. Word2Vec parameter learning explained[EB/OL]. [2023-06-26]. http://arxiv.org/abs/1411.2738.
21	李伊仝, 王红斌, 程良. 融入新闻标题信息的新闻文本与评论的语义相似度计算方法. 吉林大学学报(理学版), 2022, 60(6): 1399- 1406. URL
	LI Y T, WANG H B, CHENG L. Semantic similarity calculation method of news text and comment integrated with news title information. Journal of Jilin University(Science Edition), 2022, 60(6): 1399- 1406. URL
22	BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3, 993- 1022. URL
23	WANG J, XU W, YAN W, et al. Text similarity calculation method based on hybrid model of LDA and TF-IDF[P]. Computer Science and Artificial Intelligence, 2019.
24	DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41(6): 391- 407. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
25	KONTOSTATHIS A, POTTENGER W M. A framework for understanding Latent Semantic Indexing (LSI) performance. Information Processing & Management, 2006, 42(1): 56- 73. doi: 10.1016/j.ipm.2004.11.007
26	SCHWARZ C. Lsemantica: a command for text similarity based on latent semantic analysis. The Stata Journal, 2019, 19(1): 129- 142. doi: 10.1177/1536867X19830910
27	LANDAUER T K, DUMAIS S T. A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 1997, 104(2): 211- 240. doi: 10.1037/0033-295X.104.2.211
28	LANDAUER T K, FOLTZ P W, LAHAM D. An introduction to latent semantic analysis. Discourse Processes, 1998, 25(2/3): 259- 284.
29	GROSSMAN D A, FRIEDER O. Information retrieval: algorithms and heuristics[M]. Berlin, Germany: Springer, 2012.
30	HOFMANN T. Probabilistic latent semantic analysis[EB/OL]. [2023-06-26]. http://arxiv.org/abs/1301.6705.
31	WITSCHARD D, JUSUFI I, MARTINS R M, et al. Interactive optimization of embedding-based text similarity calculations. Information Visualization, 2022, 21(4): 335- 353. doi: 10.1177/14738716221114372
32	李舟军, 范宇, 吴贤杰. 面向自然语言处理的预训练技术研究综述. 计算机科学, 2020, 47(3): 162- 173. URL
	LI Z J, FAN Y, WU X J. Survey of natural language processing pre-training techniques. Computer Science, 2020, 47(3): 162- 173. URL
33	LI M T, SHEN X F, SUN Y Y, et al. Using semantic text similarity calculation for question matching in a rheumatoid arthritis question-answering system. Quantitative Imaging in Medicine and Surgery, 2023, 13(4): 2183- 2196. doi: 10.21037/qims-22-749
34	VIJI D, REVATHY S. A hybrid approach of weighted fine-tuned BERT extraction with deep Siamese Bi-LSTM model for semantic text similarity identification. Multimedia Tools and Applications, 2022, 81(5): 6131- 6157. doi: 10.1007/s11042-021-11771-6
35	QIU S J, NIU Y, LI J, et al. Research on semantic similarity of short text based on BERT and time warping distance. Journal of Web Engineering, 2021, 20(8): 2521- 2544. URL
36	NGUYEN H T, DUONG P H, CAMBRIA E. Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowledge-Based Systems, 2019, 182, 104842. doi: 10.1016/j.knosys.2019.07.013
37	HOCHREITER S, SCHMIDHUBER J. Long short-term memory. Neural Computation, 1997, 9(8): 1735- 1780. doi: 10.1162/neco.1997.9.8.1735
38	GRAVES A, JAITLY N, MOHAMED A R. Hybrid speech recognition with deep bidirectional LSTM[C]//Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. Washington D. C., USA: IEEE Press, 2013: 273-278.
39	杨飞. 基于LSTM的文本相似度识别方法研究[D]. 长春: 吉林大学, 2018.
	YANG F. Research on text similarity recognition based on LSTM[D]. Changchun: Jilin University, 2018. (in Chinese)
40	ZHAO W D, LIU X T, JING J, et al. Re-LSTM: a long short-term memory network text similarity algorithm based on weighted word embedding. Connection Science, 2022, 34(1): 2652- 2670. doi: 10.1080/09540091.2022.2140122
41	伍树书. 基于BiLSTM和注意力机制的短文本相似度算法研究[D]. 武汉: 武汉科技大学, 2021.
	WU S S. Research on short text similarity algorithm based on BiLSTM and attention mechanism[D]. Wuhan: Wuhan University of Science and Technology, 2021. (in Chinese)
42	GAO T Y, YAO X C, CHEN D Q. SimCSE: simple contrastive learning of sentence embeddings[EB/OL]. [2023-06-26]. http://arxiv.org/abs/2104.08821.
43	WU X, GAO C C, ZANG L J, et al. ESimCSE: enhanced sample building method for contrastive learning of unsupervised sentence embedding[EB/OL]. [2023-06-26]. http://arxiv.org/abs/2109.04380.
44	HUANG P, HE X, GAO J, et al. Learning deep structured semantic models for Web search using clickthrough data[C]// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. New York, USA: ACM Press, 2013: 2333-2338.
45	SHEN Y, HE X, GAO J, et al. A latent semantic model with convolutional-pooling structure for information retrieval[C]// Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. New York, USA: ACM Press, 2014: 101-110.
46	PALANGI H, DENG L, SHEN Y, et al. Semantic modelling with long-short-term memory for information retrieval[EB/OL]. [2023-06-26]. https://arxiv.org/abs/1412.6629.
47	孟金旭, 单鸿涛, 万俊杰, 等. BSLA: 改进Siamese-LSTM的文本相似模型. 计算机工程与应用, 2022, 58(23): 178- 185. doi: 10.3778/j.issn.1002-8331.2105-0220
	MENG J X, SHAN H T, WAN J J, et al. BSLA: improved text similarity model for Siamese-LSTM. Computer Engineering and Applications, 2022, 58(23): 178- 185. doi: 10.3778/j.issn.1002-8331.2105-0220
48	REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[EB/OL]. [2023-06-26]. http://arxiv.org/abs/1908.10084.
49	KIM S, KANG I, KWAK N. Semantic sentence matching with densely-connected recurrent and co-attentive information[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conference and 9th AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 6586-6593.
50	李凡. 基于神经网络的文本相似度匹配算法研究[D]. 太原: 太原科技大学, 2020.
	LI F. Research on text similarity matching algorithm based on neural network[D]. Taiyuan: Taiyuan University of Science and Technology, 2020. (in Chinese)
51	JING H, KEYU M. SRU-based multi-angle enhanced network for semantic text similarity calculation of big data language model. International Journal of Information Technologies and Systems Approach, 2023, 16(2): 1- 20. URL
52	WANG Z, HAMZA W, FLORIAN R. Bilateral multi-perspective matching for natural language sentences[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. New York, USA: ACM Press, 2017: 4144-4150.
53	LUO J G, XIONG W P, DU J Q, et al. Traditional Chinese medicine text similarity calculation model based on the bidirectional temporal Siamese network. Evidence-Based Complementary and Alternative Medicine, 2021, 28, 2337924. doi: 10.1155/2021/2337924
54	WANG Z G, ZHANG B. Chinese text similarity calculation model based on multi-attention Siamese Bi-LSTM[C]//Proceedings of the 4th International Conference on Computer Science and Software Engineering. New York, USA: ACM Press, 2021: 93-98.
55	YAO L, PAN Z Y, NING H S. Unlabeled short text similarity with LSTM encoder. IEEE Access, 2018, 7, 3430- 3437.
56	梅家驹, 竺一鸣, 高蕴琦, 等. 同义词词林. 上海: 上海辞书出版社, 1983.
	MEI J J, ZHU Y M, GAO Y Q, et al. Thesaurus of synonyms. Shanghai: Shanghai Lexicographical Publishing House, 1983.
57	DONG Z D, DONG Q. HowNet and the computation of meaning[M]. Hackensack, USA: World Scientific, 2006.
58	陈丹华, 王艳娜, 周子力, 等. 基于Word2Vec的WordNet词语相似度计算研究. 计算机工程与应用, 2022, 58(3): 222- 229. doi: 10.3778/j.issn.1002-8331.2009-0090
	CHEN D H, WANG Y N, ZHOU Z L, et al. Research on WordNet word similarity calculation based on Word2Vec. Computer Engineering and Applications, 2022, 58(3): 222- 229. doi: 10.3778/j.issn.1002-8331.2009-0090
59	CHEN X J, JIA S B, XIANG Y. A review: knowledge reasoning over knowledge graph. Expert Systems with Applications, 2020, 141, 112948. doi: 10.1016/j.eswa.2019.112948
60	NIU X Y, ZHENG W G, XIAO Y Y, et al. Short text similarity computation method based on feature expansion and Siamese network[C]//Proceedings of the 4th International Conference on Data Science and Information Technology. New York, USA: ACM Press, 2021: 279-283.
61	HUANG P S, CHIU P S, CHANG J W, et al. A study of using syntactic cues in short-text similarity measure. Journal of Internet Technology, 2019, 20, 839- 850. doi: 10.3966/160792642019052003017
62	HAN M T, ZHANG X, YUAN X, et al. A survey on the techniques, applications, and performance of short text semantic similarity. Concurrency and Computation: Practice and Experience, 2021, 33(5): e5971. doi: 10.1002/cpe.5971
63	郭振鹏. 基于中文分词与文本相似度的主观题评分系统研究与实现[D]. 太原: 太原理工大学, 2021.
	GUO Z P. Research and implementation of subjective question scoring system based on Chinese word segmentation and text similarity[D]. Taiyuan: Taiyuan University of Technology, 2021. (in Chinese)
64	NGUYEN M H, TRAN D Q. Estimation in semantic similarity of texts. Journal of Information Science and Engineering, 2021, 37, 617- 633.
65	谷重阳, 徐浩煜, 周晗, 等. 基于词汇语义信息的文本相似度计算. 计算机应用研究, 2018, 35(2): 391- 395. URL
	GU C Y, XU H Y, ZHOU H, et al. Text similarity computing based on lexical semantic information. Application Research of Computers, 2018, 35(2): 391- 395. URL
66	INAN E. SimiT: a text similarity method using lexicon and dependency representations. New Generation Computing, 2020, 38(3): 509- 530.
67	FAROUK M. Measuring text similarity based on structure and word embedding. Cognitive Systems Research, 2020, 63, 1- 10.
68	LI M Y, BI X H, WANG L M, et al. Text similarity measurement method and application of online medical community based on density peak clustering. Journal of Organizational and End User Computing, 2022, 34(2): 1- 25.
69	LI J Y, ZHANG X J, ZHOU X B. ALBERT-based self-ensemble model with semisupervised learning and data augmentation for clinical semantic textual similarity calculation: algorithm validation study. JMIR Medical Informatics, 2021, 9(1): e23086.
70	WANG Y. Similarity detection of English text and teaching evaluation based on improved TCUSS clustering algorithm. Journal of Intelligent & Fuzzy Systems, 2021, 40(4): 7555- 7565.
71	CHATTERJEE N, YADAV N. Fuzzy rough set-based sentence similarity measure and its application to text summarization. IETE Technical Review, 2019, 36(5): 517- 525.
72	CER D, DIAB M, AGIRRE E, et al. SemEval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation[EB/OL]. [2023-06-26]. https://arxiv.org/abs/1708.00055.
73	DOLAN B, QUIRK C, BROCKETT C. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources[C]//Proceedings of the 20th International Conference on Computational Linguistics. Philadelphia, USA: Association for Computational Linguistics, 2004: 350-362.
74	ZHANG B W, SUN W W, WAN X J, et al. PKU paraphrase bank: a sentence-level paraphrase corpus for Chinese[C]// Proceedings of CCF International Conference on Natural Language Processing and Chinese Computing. Berlin, Germany: Springer, 2019: 814-826.
75	LIU X, CHEN Q, DENG C, et al. LCQMC: a large-scale Chinese question matching corpus[C]//Proceedings of International Conference on Computational Linguistics. Philadelphia, USA: Association for Computational Linguistics, 2018: 1952-1962.
76	REIMERS N, BEYER P, GUREVYCH I. Task-oriented intrinsic evaluation of semantic textual similarity[C]//Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers. Philadelphia, USA: Association for Computational Linguistics, 2016: 87-96.
77	WEI W, XIA X, WOZNIAK M, et al. Multi-sink distributed power control algorithm for cyber-physical-systems in coal mine tunnels. Computer Networks, 2019, 161, 210- 219.
78	WEI W, SONG H B, LI W, et al. Gradient-driven parking navigation using a continuous information potential field based on wireless sensor network. Information Sciences, 2017, 408, 100- 114.
79	FAN X, SONG H B, FAN X F, et al. Imperfect information dynamic Stackelberg game based resource allocation using hidden Markov for cloud computing. IEEE Transactions on Services Computing, 2018, 11(1): 78- 89.

[1]	朱凯, 李理, 张彤, 江晟, 别一鸣. 基于Transformer的多阶段运动模糊图像修复网络[J]. 计算机工程, 2024, 50(9): 276-285.
[2]	张天鹏, 韩晶, 吕学强. 基于多任务学习的超分辨率辅助小目标检测[J]. 计算机工程, 2024, 50(9): 304-312.
[3]	高煜宝, 文志诚. 基于注意力机制的双路解码器图像去噪方法[J]. 计算机工程, 2024, 50(9): 324-332.
[4]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[5]	张亚洲, 和玉, 戎璐, 王祥凯. 基于上下文知识增强型Transformer网络的抑郁检测[J]. 计算机工程, 2024, 50(8): 75-85.
[6]	高伟, 李帅龙, 茆琳, 王磊, 李颖颖, 韩林. 一种基于TVM的算子生成加速策略[J]. 计算机工程, 2024, 50(8): 353-362.
[7]	王宇, 祁琦, 王纯, 许才. 储能变流器信号高精度故障诊断方法[J]. 计算机工程, 2024, 50(8): 389-396.
[8]	牛瑞婷, 严天峰, 高锐, 王映植. 低信噪比下基于深度学习TCNN-MobileNet的调制识别[J]. 计算机工程, 2024, 50(7): 204-215.
[9]	肖慈, 徐杨, 张永丹, 冯明文, 黄易仟. 结合注意力和低光增强的夜间语义分割[J]. 计算机工程, 2024, 50(7): 271-281.
[10]	张诗婧, 莫绪涛, 赵行, 董杨林. 基于球面折反射成像和YOLOv7的内螺纹缺陷检测[J]. 计算机工程, 2024, 50(7): 282-292.
[11]	王晋涛, 秦昂, 张元, 陈一飞, 王廷凤, 谢承霖, 邹刚. 基于注意力增强与特征融合的中文医学实体识别[J]. 计算机工程, 2024, 50(7): 324-332.
[12]	徐明亮, 李芳媛, 马浩然, 何飞. 大规模神经记录的峰电位聚类算法(特邀)[J]. 计算机工程, 2024, 50(6): 1-34.
[13]	魏琢艺, 罗迈, 李文兵, 曾远松, 余伟江, 杨跃东. 基于多源域适应的单细胞智能分类[J]. 计算机工程, 2024, 50(6): 48-55.
[14]	李子杰, 周菊香, 韩晓瑜, 甘健侯, 鹿泽光, 王俊. 序列特征与学习过程融合的知识追踪模型[J]. 计算机工程, 2024, 50(6): 77-85.
[15]	李永飞, 李铭洋, 常鑫, 曹可欣. 基于可解释性深度学习的物联网水质监测数据异常检测[J]. 计算机工程, 2024, 50(6): 179-187.

选择文件类型/文献管理软件名称

选择包含的内容