基于改进编辑距离的字符串相似度求解算法

doi:10.3969/j.issn.1000-3428.2014.01.047

计算机工程

基于改进编辑距离的字符串相似度求解算法

姜华^a,b，韩安琪^a,b，王美佳^a,b，王峥^a,b，吴雲玲^a,b

(东北师范大学 a. 计算机科学与信息技术学院；b. 智能信息处理吉林省高校重点实验室，长春 130117)

收稿日期:2012-10-15 出版日期:2014-01-15 发布日期:2014-01-13
作者简介:姜华(1964－)，男，副教授，主研方向：文本挖掘，Web挖掘，聚类算法；韩安琪、王美佳、王峥、吴雲玲，硕士研究生
基金资助:
吉林省发改委基金资助项目(吉发改高技[2012]747号)

Solution Algorithm of String Similarity Based on Improved Levenshtein Distance

JIANG Hua ^a,b, HAN An-qi ^a,b, WANG Mei-jia ^a,b, WANG Zheng ^a,b, WU Yun-ling^a,b

(a. School of Computer Science and Information Technology; b. University Key Laboratory of Intelligent Information Processing in Jilin Province, Northeast Normal University, Changchun 130117, China)

Received:2012-10-15 Online:2014-01-15 Published:2014-01-13

摘要/Abstract

摘要： 编辑距离(LD)算法在求解两个字符串的相似问题时只考虑了编辑操作次数，未考虑字符串之间的公共子串对相似度的影响。为此，提出一种基于改进编辑距离的字符串相似度求解算法，对字符串相似度度量公式及Levenshtein矩阵计算方法进行改进。在计算编辑距离时，以原有矩阵求出两字符串的最长公共子串及所有LD回溯路径。选取一个单词作为源串，一组与源串不同程度相似的单词为目标串，将改进的相似度度量公式与现有的字符串相似度计算方法进行比较，改进公式减少了进入胜者表的目标串数，相似度的样本极差和标准差分别为0.331和0.150。实验结果表明，改进算法在不改变空间复杂度的情况下，计算字符串相似度的准确性更高，且查询方式更灵活。

关键词: 编辑距离, LD算法, 回溯路径, 最长公共子串, 相似度, 模糊查询

Abstract: When calculating the similarity of strings, the Levenshtein Distance(LD) algorithm only considers the operating times and ignores the common substrings of two strings. Aiming at this problem, an improved Levenshtein distance algorithm is proposed to calculate the similarity. The new algorithm improves the formula of similarity and the Levenshtein matrix. When calculating the distance, the new algorithm calculates the longest common substring and all the LD backtracking paths in the original matrix at the same time. Selecting a word in the experiment as a source string, a set of similar words of the different degrees of the source string as a target string, the new similarity measure formula is compared with the existing string similarity calculation method, the new formula reduces the number of target strings into the winner table with similarity sample range and standard deviation of 0.331 and 0.150, respectively. Experimental results show that the new algorithm has higher accuracy and more flexible searching way in the same space complexity.

Key words: Levenshtein Distance(LD), LD algorithm, backtracking path, the longest common substring, similarity, fuzzy query

中图分类号:

TP311.12

姜华，韩安琪，王美佳，王峥，吴雲玲. 基于改进编辑距离的字符串相似度求解算法[J]. 计算机工程.

JIANG Hua, HAN An-qi, WANG Mei-jia, WANG Zheng, WU Yun-ling. Solution Algorithm of String Similarity Based on Improved Levenshtein Distance[J]. Computer Engineering.

https://www.ecice06.com/CN/Y2014/V40/I1/222

参考文献

参考文献 [1] Nirenburg S. Two Approaches of Matching in Example-based Machine Translation[C]//Proceedings of TMI’93. Kyoto, Japan: [s. n.], 1993. [2] Jin L, Li C, Mehrotra S. Efficient Record Linkage in Large Data Sets[C]//Proc. of the 8th International Conference on Database System for Advanced Application. Washington D. C., USA: IEEE Computer Society, 2003: 137-146. [3] Monge A E, Elkan C P. An Efficient Domain-independent Algorithm for Detection Approximately Duplicate Database Records[C]//Proceedings of SIFMOD Workshop on Data Mining and Knowledge Discovery. Tuscan, USA: [s. n.], 1997: 23-29. [4] 周汉平. Levenshtein距离在编程题自动评阅中的应用研究[J]. 计算机应用与软件, 2001, 28(5): 209-212. [5] 赵作鹏, 尹志民, 王潜平, 等. 一种改进的编辑距离算法及其在数据处理中的应用[J]. 计算机应用, 2009, 29(2): 424- 426. [6] 车万翔, 刘挺, 秦兵, 等. 基于改进编辑距离的中文相似句子检索[J]. 高技术通讯, 2004, 14(7): 15-20. [7] Levenshtein V. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals[J]. Soviet Physics Doklady, 1966, 10(8): 707-710. [8] Damerau F J, Merver R L. Context Based Spelling Correc- tion[J]. Information Processing & Management, 1991, 27(5): 517-522. [9] 万仓一黍. 文本比较算法Ⅰ——LD算法[EB/OL]. (2012-12- 18). http://www.cNblogs.com/grenet/archive/2010/06/01/1748 448.html. [10] McIlroy D. An Algorithm for Differential File Compari- son[EB/OL]. (1976-07-04). http://www.citeulike.org/user/ eelco/article/4509722. [11] Gospodnetic O, Hatcher E. Lucene in Action[M]. 谭鸿, 黎俊鸿, 译. 北京: 电子工业出版社, 2007. [12] 邱哲, 符滔滔. 开发自己的搜索引擎[M]. 北京: 人民邮电出版社, 2007. 编辑任吉慧

[1]	李启文, 王治和, 杜辉, 鲁德鹏. 基于高斯分布的自适应密度峰值聚类算法[J]. 计算机工程, 2025, 51(4): 137-148.
[2]	胡书林, 张华军, 邓小涛, 王征华. 结合依存图卷积的中文文本相似度计算研究[J]. 计算机工程, 2025, 51(3): 76-85.
[3]	魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.
[4]	李红娇, 王宝金, 王朝晖, 胡仁豪. 基于模型相似度与本地损失的双重客户端选择算法[J]. 计算机工程, 2024, 50(8): 153-164.
[5]	林加艺, 夏鸿斌, 刘渊. 基于类比学习的数学应用题求解模型[J]. 计算机工程, 2024, 50(7): 63-70.
[6]	耿丽丽, 牛保宁. 基于通道相似度熵的卷积神经网络裁剪[J]. 计算机工程, 2024, 50(7): 133-143.
[7]	李雪, 王雅文, 张前进. 基于信息检索的源代码自动命名[J]. 计算机工程, 2024, 50(6): 304-310.
[8]	邓远飞, 李加伟, 蒋运承. 基于知识注入提示学习的专利短语相似度计算[J]. 计算机工程, 2024, 50(4): 294-302.
[9]	史艳琼, 查昭, 张文亮, 戴尔愉, 陈中. 基于深度估计置信度的聚焦形貌恢复[J]. 计算机工程, 2024, 50(3): 233-241.
[10]	程小辉, 李钰, 康燕萍. 基于中间图特征提取的卷积网络双标准剪枝[J]. 计算机工程, 2023, 49(3): 105-112.
[11]	胡慧旗, 张维强, 徐晨. 判别性增强的稀疏子空间聚类[J]. 计算机工程, 2023, 49(2): 98-104.
[12]	杨振宇, 王磊, 马博, 杨雅婷, 董瑞, 艾孜麦提·艾瓦尼尔, 王震. 一种针对维汉的跨语言远程监督方法[J]. 计算机工程, 2023, 49(2): 271-278.
[13]	刘栋, 杨辉, 姬少培, 曹扬. 基于多模型加权组合的文本相似度计算模型[J]. 计算机工程, 2023, 49(10): 97-104.
[14]	潘金凤, 尹丽菊, 高明亮, 邹国峰. 压缩感知观测信号的低秩稀疏分解[J]. 计算机工程, 2022, 48(8): 234-239.
[15]	周瑞朋, 秦进. 基于最佳子策略记忆的强化探索策略[J]. 计算机工程, 2022, 48(2): 106-112.

选择文件类型/文献管理软件名称

选择包含的内容

基于改进编辑距离的字符串相似度求解算法

Solution Algorithm of String Similarity Based on Improved Levenshtein Distance

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于改进编辑距离的字符串相似度求解算法

Solution Algorithm of String Similarity Based on Improved Levenshtein Distance

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价