基于改进句子相似度算法的释义识别研究

doi:10.19678/j.issn.1000-3428.0055313

计算机工程 ›› 2020, Vol. 46 ›› Issue (9): 76-82. doi: 10.19678/j.issn.1000-3428.0055313

基于改进句子相似度算法的释义识别研究

陈俊月, 郝文宁, 张紫萱, 唐新德, 康睿智, 莫斐

陆军工程大学指挥信息系统学院, 南京 210000

收稿日期:2019-06-27 修回日期:2019-08-28 发布日期:2019-10-14
作者简介:陈俊月(1995-),女,硕士研究生,主研方向为自然语言处理;郝文宁(通信作者),教授;张紫萱、唐新德,硕士研究生;康睿智,博士研究生;莫斐,硕士研究生。
基金资助:
国家自然科学基金"效果驱动背景下基于层次任务网的作战行动序列快速生成及动态修复方法"（61806221）。

Research on Paraphrase Identification Based on Improved Sentence Similarity Algorithm

CHEN Junyue, HAO Wenning, ZHANG Zixuan, TANG Xinde, KANG Ruizhi, MO Fei

Institute of Command Information System, Army Engineering University, Nanjing 210000, China

Received:2019-06-27 Revised:2019-08-28 Published:2019-10-14

摘要/Abstract

摘要： 针对现有句子相似度算法无法处理同义词、准确率低和复杂度高等不足，结合词向量技术改进Levenshtein相似度算法和Jaccard系数，提出一种新的句子相似度算法用于释义识别，并对多种句子相似度算法的优劣进行分析，设计多相似度特征组合的应用模式。基于MRPC释义识别数据集的实验结果表明，使用该算法的释义识别模型准确率与F1值分别达到74.4%和83.1%，与使用TF-IDF算法、词袋算法等传统算法的模型相比识别性能更优。

关键词: 句子相似度, Jaccard系数, Levenshtein距离, 词向量, 释义识别, 多特征组合

Abstract: The existing sentence similarity algorithms fail to process synonyms and are faced with low accuracy and high complexity.To address the problems,this paper proposes a new sentence similarity algorithm for paraphrase identification by using the word embedding technique to improve the Levenshtein similarity algorithm and Jaccard index.Also,the advantages and disadvantages of the sentence similarity algorithms are briefly analyzed,and the application mode of multi-similarity feature combination is designed.Experimental results on MRPC paraphrase recognition data set show that the accuracy rate and F1 value of the paraphrase identification model using this algorithm are 74.4% and 83.1% respectively.Compared with the models using TF-IDF algorithm,Bag of Words(BoW) algorithm and other traditional algorithms,it has better recognition performance.

Key words: sentence similarity, Jaccard index, Levenshtein Distance(LD), word embedding, paraphrase identification, multi-feature combination

中图分类号:

TP18

陈俊月, 郝文宁, 张紫萱, 唐新德, 康睿智, 莫斐. 基于改进句子相似度算法的释义识别研究[J]. 计算机工程, 2020, 46(9): 76-82.

CHEN Junyue, HAO Wenning, ZHANG Zixuan, TANG Xinde, KANG Ruizhi, MO Fei. Research on Paraphrase Identification Based on Improved Sentence Similarity Algorithm[J]. Computer Engineering, 2020, 46(9): 76-82.

http://www.ecice06.com/CN/Y2020/V46/I9/76

图/表 3

参考文献

[1] WU Shaohong,PENG Dunlu,YUAN Weiwei,et al.MGSC:a multi-granularity semantic cross model for matching short texts[J].Journal of Chinese Computer Systems,2019,40(6):1148-1152.(in Chinese)吴少洪,彭敦陆,苑威威,等.MGSC:一种多粒度语义交叉的短文本语义匹配模型[J].小型微型计算机系统,2019,40(6):1148-1152.
[2] JIN Bo,SHI Yanjun,TENG Hongfei.Similarity algorithm of text based on semantic understanding[J].Journal of Dalian University of Technology,2005,45(2):291-297.(in Chinese)金博,史彦军,滕弘飞.基于语义理解的文本相似度算法[J].大连理工大学学报,2005,45(2):291-297.
[3] ZHAO Zhen,WU Ning,SONG Panpan.Sentence semantic similarity calculation based on multi-feature fusion[J].Computer Engineering,2012,38(1):171-173.(in Chinese)赵臻,吴宁,宋盼盼.基于多特征融合的句子语义相似度计算[J].计算机工程,2012,38(1):171-173.
[4] LIU Hongzhe.Research on text semantic similarity calculation method[D].Beijing:Beijing Jiaotong University,2012.(in Chinese)刘宏哲.文本语义相似度计算方法研究[D].北京:北京交通大学,2012.
[5] HUANG Jiangping,JI Donghong.Paraphrase identification based on sentence semantic distances[J].Journal of Sichuan University(Engineering Science Edition),2016,48(6):202-207.(in Chinese)黄江平,姬东鸿.基于句子语义距离的释义识别研究[J].四川大学学报(工程科学版),2016,48(6):202-207.
[6] KOZAREVA Z,MONTOYO A.Paraphrase identification on the basis of supervised machine learning techniques[C]//Proceedings of International Conference on Natural Language Processing(in Finland).Berlin,Germany:Springer,2006:524-533.
[7] LI Yujian,LIU BO.A normalized Levenshtein distance metric[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2007,29(6):1091-1095.
[8] REAL R,VARGAS J M.The probabilistic basis of Jaccard's index of similarity[J].Systematic Biology,1996,45(3):380-385.
[9] CHEN Zhenrui,DING Zhiming.Improved word repre-sentation based on GloVe model[J].Computer Systems & Applications,2019,28(1):194-199.(in Chinese)陈珍锐,丁治明.基于GloVe模型的词向量改进方法[J].计算机系统应用,2019,28(1):194-199.
[10] DOLAN B.Unsupervised construction of large paraphrase corpora:exploiting massively parallel news sources[C]//Proceedings of the 20th International Conference on Computational Linguistics.San Diego,USA:Association for Computational Linguistics,2004:350-357.
[11] WU H C,LUK R W P,WONG K F,et al.Interpreting TF-IDF term weights as making relevance decisions[J].ACM Transactions on Information Systems,2008,26(3):1-27.
[12] ARORA S,LIANG Y,MA T.A simple but tough-to-beat baseline for sentence embeddings[C]//Proceedings of International Conference on Learning Representations.Toulon,France:[s.n.],2017:1-16.
[13] HILL F,CHO K,KORHONEN A.Learning distributed representations of sentences from unlabeled data[C]//Proceedings of NAACL-HLT 2016.San Diego,USA:Association for Computational Linguistics,2016:1367-1377.
[14] LOGESWARAN L,LEE H.An efficient framework for learning sentence representations[EB/OL].[2019-05-10].https://arxiv.org/pdf/1803.02893.pdf.
[15] KIROS R,ZHU Y,SALAKHUTDINOV R,et al.Skip-thought vectors[C]//Proceedings of International Conference on Neural Information Processing Systems.Montreal,Canada:[s.n.],2015:1-11.
[16] JIANG Hua,HAN Anqi,WANG Meijia,et al.Solution algorithm of string similarity based on improved Levenshtein distance[J].Computer Engineering,2014,40(1):222-227.(in Chinese)姜华,韩安琪,王美佳,等.基于改进编辑距离的字符串相似度求解算法[J].计算机工程,2014,40(1):222-227.
[17] CONNEAU A,KIELA D,SCHWENK H,et al.Supervised learning of universal sentence representations from natural language inference data[EB/OL].[2019-05-10].https://arxiv.org/pdf/1705.02364.pdf.
[18] SUBRAMANIAN S,TRISCHLER A,BENGIO Y,et al.Learning general purpose distributed sentence representations via large scale multi-task learning[C]//Proceedings of International Conference on Learning Representations.Vancouver,Canada:[s.n.]:2018:1-16.
[19] GONG Yichen,LUO Heng,ZHANG Jian.Natural language inference over interaction space[EB/OL].[2019-05-10].https://arxiv.org/pdf/1709.04348.pdf.
[20] SU Jianlin.Q&A model based on CNN:DGCNN[EB/OL].[2019-05-10].https://spaces.ac.cn/archives/5409.(in Chinese)苏剑林.基于CNN的阅读理解式问答模型:DGCNN[EB/OL].[2019-05-10].https://spaces.ac.cn/archives/5409.
[21] ZANG Runqiang,SUN Hongguang,YANG Fengqin,et al.Text similarity calculation method based on Levenshtein and TFRSF[J].Computer and Modernization,2018(4):84-89.(in Chinese)藏润强,孙红光,杨凤芹,等.基于Levenshtein和TFRSF的文本相似度计算方法[J].计算机与现代化,2018(4):84-89.
[22] PENNINGTON J,SOCHER R,MANNING C.GloVe:global vectors for word representation[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing.San Diego,USA:Association for Computational Linguistics,2014:1532-1543.
[23] WILLIAMS A,NANGIA N,BOWMAN S R.A broad-coverage challenge corpus for sentence understanding through inference[EB/OL].[2019-05-10].https://arxiv.org/pdf/1704.05426v4.pdf.
[24] SOCHER R,PERELYGIN A,WU J Y,et al.Recursive deep models for semantic compositionality over a sentiment treebank[EB/OL].[2019-05-10].https://nlp.stanford.edu/sentiment/index.html.

选择文件类型/文献管理软件名称

选择包含的内容

基于改进句子相似度算法的释义识别研究

Research on Paraphrase Identification Based on Improved Sentence Similarity Algorithm

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 3

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	李军怀, 陈苗苗, 王怀军, 崔颖安, 张爱华. 基于ALBERT-BGRU-CRF的中文命名实体识别方法[J]. 计算机工程, 2022, 48(6): 89-94,106.
[2]	李冉冉, 刘大明, 刘正, 常高祥. 融合笔画特征的胶囊网络文本分类[J]. 计算机工程, 2022, 48(3): 69-73,80.
[3]	雷恒林, 古兰拜尔·吐尔洪, 买日旦·吾守尔, 曾琪. 基于Hellinger距离与词向量的终身机器学习主题模型[J]. 计算机工程, 2022, 48(11): 89-95.
[4]	彭俊利, 谷雨, 张震, 耿小航. 融合单词贡献度与Word2Vec词向量的文档表示[J]. 计算机工程, 2021, 47(4): 62-67.
[5]	李俊, 吕学强. 融合BERT语义加权与网络图的关键词抽取方法[J]. 计算机工程, 2020, 46(9): 89-94.
[6]	王青松, 张衡, 李菲. 基于文本多维度特征的自动摘要生成方法[J]. 计算机工程, 2020, 46(9): 110-116.
[7]	王义, 沈洋, 戴月明. 基于细粒度多通道卷积神经网络的文本情感分析[J]. 计算机工程, 2020, 46(5): 102-108.
[8]	许莹莹, 黄浩. 基于标签分解的口语理解模型[J]. 计算机工程, 2019, 45(7): 237-241.
[9]	卢晨阳,康雁,杨成荣,蒲斌. 基于语义结构的迁移学习文本特征对齐算法[J]. 计算机工程, 2019, 45(5): 116-121.
[10]	周锦峰,叶施仁,王晖. 基于深度卷积神经网络模型的文本情感分类[J]. 计算机工程, 2019, 45(3): 300-308.
[11]	喻靖民,向凌云,曾道建. 基于Word2vec的自然语言隐写分析方法[J]. 计算机工程, 2019, 45(3): 309-314.
[12]	康雁, 李晋源, 杨其越, 崔国荣, 王沛尧. 基于双通道词向量的卷积胶囊网络文本分类[J]. 计算机工程, 2019, 45(11): 177-182.
[13]	杨正龙, 高建华. 基于蜕变测试的面向用户搜索引擎性能分析[J]. 计算机工程, 2019, 45(10): 52-56,63.
[14]	梁艳红, 坎启轩, 苏翌. 基于主题分布优化的模糊文本分类研究[J]. 计算机工程, 2019, 45(10): 221-226.
[15]	李思宇,谢珺,邹雪君,续欣莹,冀小平. 基于双词语义扩展的Biterm主题模型[J]. 计算机工程, 2019, 45(1): 210-216.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于改进句子相似度算法的释义识别研究

Research on Paraphrase Identification Based on Improved Sentence Similarity Algorithm

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 3

参考文献

相关文章 15

编辑推荐

Metrics

本文评价