基于改进编辑距离的字符串相似度求解算法

doi:10.3969/j.issn.1000-3428.2014.01.047

计算机工程

基于改进编辑距离的字符串相似度求解算法

姜华^a,b，韩安琪^a,b，王美佳^a,b，王峥^a,b，吴雲玲^a,b

(东北师范大学 a. 计算机科学与信息技术学院；b. 智能信息处理吉林省高校重点实验室，长春 130117)

收稿日期:2012-10-15 出版日期:2014-01-15 发布日期:2014-01-13
作者简介:姜华(1964－)，男，副教授，主研方向：文本挖掘，Web挖掘，聚类算法；韩安琪、王美佳、王峥、吴雲玲，硕士研究生
基金资助:
吉林省发改委基金资助项目(吉发改高技[2012]747号)

Solution Algorithm of String Similarity Based on Improved Levenshtein Distance

JIANG Hua ^a,b, HAN An-qi ^a,b, WANG Mei-jia ^a,b, WANG Zheng ^a,b, WU Yun-ling^a,b

(a. School of Computer Science and Information Technology; b. University Key Laboratory of Intelligent Information Processing in Jilin Province, Northeast Normal University, Changchun 130117, China)

Received:2012-10-15 Online:2014-01-15 Published:2014-01-13

摘要/Abstract

摘要： 编辑距离(LD)算法在求解两个字符串的相似问题时只考虑了编辑操作次数，未考虑字符串之间的公共子串对相似度的影响。为此，提出一种基于改进编辑距离的字符串相似度求解算法，对字符串相似度度量公式及Levenshtein矩阵计算方法进行改进。在计算编辑距离时，以原有矩阵求出两字符串的最长公共子串及所有LD回溯路径。选取一个单词作为源串，一组与源串不同程度相似的单词为目标串，将改进的相似度度量公式与现有的字符串相似度计算方法进行比较，改进公式减少了进入胜者表的目标串数，相似度的样本极差和标准差分别为0.331和0.150。实验结果表明，改进算法在不改变空间复杂度的情况下，计算字符串相似度的准确性更高，且查询方式更灵活。

关键词: 编辑距离, LD算法, 回溯路径, 最长公共子串, 相似度, 模糊查询

Abstract: When calculating the similarity of strings, the Levenshtein Distance(LD) algorithm only considers the operating times and ignores the common substrings of two strings. Aiming at this problem, an improved Levenshtein distance algorithm is proposed to calculate the similarity. The new algorithm improves the formula of similarity and the Levenshtein matrix. When calculating the distance, the new algorithm calculates the longest common substring and all the LD backtracking paths in the original matrix at the same time. Selecting a word in the experiment as a source string, a set of similar words of the different degrees of the source string as a target string, the new similarity measure formula is compared with the existing string similarity calculation method, the new formula reduces the number of target strings into the winner table with similarity sample range and standard deviation of 0.331 and 0.150, respectively. Experimental results show that the new algorithm has higher accuracy and more flexible searching way in the same space complexity.

Key words: Levenshtein Distance(LD), LD algorithm, backtracking path, the longest common substring, similarity, fuzzy query

中图分类号:

TP311.12

姜华，韩安琪，王美佳，王峥，吴雲玲. 基于改进编辑距离的字符串相似度求解算法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2014.01.047.

JIANG Hua, HAN An-qi, WANG Mei-jia, WANG Zheng, WU Yun-ling. Solution Algorithm of String Similarity Based on Improved Levenshtein Distance[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2014.01.047.

http://www.ecice06.com/CN/Y2014/V40/I1/222

参考文献

参考文献 [1] Nirenburg S. Two Approaches of Matching in Example-based Machine Translation[C]//Proceedings of TMI’93. Kyoto, Japan: [s. n.], 1993. [2] Jin L, Li C, Mehrotra S. Efficient Record Linkage in Large Data Sets[C]//Proc. of the 8th International Conference on Database System for Advanced Application. Washington D. C., USA: IEEE Computer Society, 2003: 137-146. [3] Monge A E, Elkan C P. An Efficient Domain-independent Algorithm for Detection Approximately Duplicate Database Records[C]//Proceedings of SIFMOD Workshop on Data Mining and Knowledge Discovery. Tuscan, USA: [s. n.], 1997: 23-29. [4] 周汉平. Levenshtein距离在编程题自动评阅中的应用研究[J]. 计算机应用与软件, 2001, 28(5): 209-212. [5] 赵作鹏, 尹志民, 王潜平, 等. 一种改进的编辑距离算法及其在数据处理中的应用[J]. 计算机应用, 2009, 29(2): 424- 426. [6] 车万翔, 刘挺, 秦兵, 等. 基于改进编辑距离的中文相似句子检索[J]. 高技术通讯, 2004, 14(7): 15-20. [7] Levenshtein V. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals[J]. Soviet Physics Doklady, 1966, 10(8): 707-710. [8] Damerau F J, Merver R L. Context Based Spelling Correc- tion[J]. Information Processing & Management, 1991, 27(5): 517-522. [9] 万仓一黍. 文本比较算法Ⅰ——LD算法[EB/OL]. (2012-12- 18). http://www.cNblogs.com/grenet/archive/2010/06/01/1748 448.html. [10] McIlroy D. An Algorithm for Differential File Compari- son[EB/OL]. (1976-07-04). http://www.citeulike.org/user/ eelco/article/4509722. [11] Gospodnetic O, Hatcher E. Lucene in Action[M]. 谭鸿, 黎俊鸿, 译. 北京: 电子工业出版社, 2007. [12] 邱哲, 符滔滔. 开发自己的搜索引擎[M]. 北京: 人民邮电出版社, 2007. 编辑任吉慧

[1]	程小辉, 李钰, 康燕萍. 基于中间图特征提取的卷积网络双标准剪枝[J]. 计算机工程, 2023, 49(3): 105-112.
[2]	胡慧旗, 张维强, 徐晨. 判别性增强的稀疏子空间聚类[J]. 计算机工程, 2023, 49(2): 98-104.
[3]	杨振宇, 王磊, 马博, 杨雅婷, 董瑞, 艾孜麦提·艾瓦尼尔, 王震. 一种针对维汉的跨语言远程监督方法[J]. 计算机工程, 2023, 49(2): 271-278.
[4]	潘金凤, 尹丽菊, 高明亮, 邹国峰. 压缩感知观测信号的低秩稀疏分解[J]. 计算机工程, 2022, 48(8): 234-239.
[5]	周瑞朋, 秦进. 基于最佳子策略记忆的强化探索策略[J]. 计算机工程, 2022, 48(2): 106-112.
[6]	王治和, 曹旭琰, 杜辉. 一种优化初始点与自适应半径的密度聚类算法[J]. 计算机工程, 2022, 48(1): 51-59.
[7]	石彩霞, 李书琴, 刘斌. 多重检验加权融合的短文本相似度计算方法[J]. 计算机工程, 2021, 47(2): 95-102.
[8]	田智慧, 马占宇, 魏海涛. 基于密度核心的出租车载客轨迹聚类算法[J]. 计算机工程, 2021, 47(2): 133-138.
[9]	郭渝洛, 边浩东, 董润婷, 唐嘉豪, 王晓英, 黄建强. 基于SIMD的并行傅里叶空间图像相似度计算[J]. 计算机工程, 2021, 47(11): 247-253.
[10]	李宇霞, 孙永奇, 闫茹, 朱卫国. 基于CNN图像识别与语义可靠性的路径搜索方法[J]. 计算机工程, 2021, 47(1): 255-263,274.
[11]	陈俊月, 郝文宁, 张紫萱, 唐新德, 康睿智, 莫斐. 基于改进句子相似度算法的释义识别研究[J]. 计算机工程, 2020, 46(9): 76-82.
[12]	王青松, 张衡, 李菲. 基于文本多维度特征的自动摘要生成方法[J]. 计算机工程, 2020, 46(9): 110-116.
[13]	柯翔敏, 陈江, 罗光华. 一种改进的基于兴趣相似度推荐算法[J]. 计算机工程, 2020, 46(8): 78-84.
[14]	刘治国, 宋广跃, 蔡文珠, 刘庆利. 基于TextRank算法的未知网络协议帧定位方法[J]. 计算机工程, 2020, 46(7): 179-184.
[15]	邱少明, 於涛, 杜秀丽, 陈波. 基于节点多属性相似性聚类的社团划分算法[J]. 计算机工程, 2020, 46(7): 84-90,97.

选择文件类型/文献管理软件名称

选择包含的内容

基于改进编辑距离的字符串相似度求解算法

Solution Algorithm of String Similarity Based on Improved Levenshtein Distance

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于改进编辑距离的字符串相似度求解算法

Solution Algorithm of String Similarity Based on Improved Levenshtein Distance

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价