作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于句法结构与修饰词的句子相似度计算

邓涵 1a,朱新华 1a,2,李奇 1a,彭琦 1b   

  1. (1.广西师范大学 a.计算机科学与信息工程学院;b.网络中心,广西 桂林 541004; 2.广西区域多源信息集成与智能处理协同创新中心,广西 桂林 541004)
  • 收稿日期:2016-08-16 出版日期:2017-09-15 发布日期:2017-09-15
  • 作者简介:邓涵(1991—),女,硕士研究生,主研方向为自然语言处理;朱新华(通信作者),教授;李奇、彭琦,硕士研究生。
  • 基金资助:
    国家自然科学基金(61363036,61462010);广西师范大学自然科学青年基金“词汇语义相似度计算研究”。

Sentence Similarity Calculation Based on Syntactic Structure and Modifier

DENG Han 1a,ZHU Xinhua 1a,2,LI Qi 1a,PENG Qi 1b   

  1. (1a.College of Computer Science and Information Engingeering; 1b.Network Center,Guangxi Normal University, Guilin,Guangxi 541004,China; 2.Collaborative Innovation Center of Guangxi Regional Multi-source Information Integration and Intelligent Processing,Guilin,Guangxi 541004,China)
  • Received:2016-08-16 Online:2017-09-15 Published:2017-09-15

摘要: 根据汉语句子结构复杂、词语一词多义的特点,提出一种句子相似度计算方法。对句子进行句法分析和依存关系的预处理,提取句子结构中的主、谓、宾、介词等主要成分的词语集合,从而准确地表达出句子的浅层语义,并利用《知网》计算不同句子相同成分之间的语义相似度。考虑依存句法关系中的定中关系和状中关系起到的语义修饰作用,在句法结构基础上进一步融入修饰词,综合计算句子的语义相似度,区分句子主题内容的一致性和句子间的反义关系。以微软研究院释义语料库中抽取的30对句子作为测试集,实验结果表明,提出方法的皮尔森相关系数达到0.89,F值达到85.7%,具有较好的准确性与实用性。

关键词: 句子相似度, 知网, 依存树, 句法结构, 修饰词

Abstract: According to the complex structure and polysemy characteristics of Chinese sentences,this paper proposes a sentence similarity calculation method.It pretreats the sentence through syntactic analysis and dependency relationship,and extracts word set of main components such as subject,predicate,object,preposition and so on,thus the shallow semantics of sentences can be expressed accurately.HowNet is used to calculate the semantic similarity between the same components of different sentences.Considering semantic modification effect of attribute relationship and adverbial relationship in dependency syntactic relations,based on syntactic structure,further integrating into the modifiers,the sentence semantic similarity is comprehensive by calculated to distinguish consistency of sentence topic content and the antonym relationship between sentences.The extracted 30 pairs of sentences are used as test sets,which are taken from paraphrase corpus of Microsoft Research Institute Corpus.Experimental results indicate that the Pearson correlation coefficient of the proposed method reaches 0.89 and the F-measure reaches 85.7%,which has better accuracy and practicability.

Key words: sentence similarity, HowNet, dependency tree, syntactic structure, modifier

中图分类号: