作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2020, Vol. 46 ›› Issue (9): 76-82. doi: 10.19678/j.issn.1000-3428.0055313

• 人工智能与模式识别 • 上一篇    下一篇

基于改进句子相似度算法的释义识别研究

陈俊月, 郝文宁, 张紫萱, 唐新德, 康睿智, 莫斐   

  1. 陆军工程大学 指挥信息系统学院, 南京 210000
  • 收稿日期:2019-06-27 修回日期:2019-08-28 发布日期:2019-10-14
  • 作者简介:陈俊月(1995-),女,硕士研究生,主研方向为自然语言处理;郝文宁(通信作者),教授;张紫萱、唐新德,硕士研究生;康睿智,博士研究生;莫斐,硕士研究生。
  • 基金资助:
    国家自然科学基金"效果驱动背景下基于层次任务网的作战行动序列快速生成及动态修复方法"(61806221)。

Research on Paraphrase Identification Based on Improved Sentence Similarity Algorithm

CHEN Junyue, HAO Wenning, ZHANG Zixuan, TANG Xinde, KANG Ruizhi, MO Fei   

  1. Institute of Command Information System, Army Engineering University, Nanjing 210000, China
  • Received:2019-06-27 Revised:2019-08-28 Published:2019-10-14

摘要: 针对现有句子相似度算法无法处理同义词、准确率低和复杂度高等不足,结合词向量技术改进Levenshtein相似度算法和Jaccard系数,提出一种新的句子相似度算法用于释义识别,并对多种句子相似度算法的优劣进行分析,设计多相似度特征组合的应用模式。基于MRPC释义识别数据集的实验结果表明,使用该算法的释义识别模型准确率与F1值分别达到74.4%和83.1%,与使用TF-IDF算法、词袋算法等传统算法的模型相比识别性能更优。

关键词: 句子相似度, Jaccard系数, Levenshtein距离, 词向量, 释义识别, 多特征组合

Abstract: The existing sentence similarity algorithms fail to process synonyms and are faced with low accuracy and high complexity.To address the problems,this paper proposes a new sentence similarity algorithm for paraphrase identification by using the word embedding technique to improve the Levenshtein similarity algorithm and Jaccard index.Also,the advantages and disadvantages of the sentence similarity algorithms are briefly analyzed,and the application mode of multi-similarity feature combination is designed.Experimental results on MRPC paraphrase recognition data set show that the accuracy rate and F1 value of the paraphrase identification model using this algorithm are 74.4% and 83.1% respectively.Compared with the models using TF-IDF algorithm,Bag of Words(BoW) algorithm and other traditional algorithms,it has better recognition performance.

Key words: sentence similarity, Jaccard index, Levenshtein Distance(LD), word embedding, paraphrase identification, multi-feature combination

中图分类号: