摘要: 根据汉语句子结构复杂、词语一词多义的特点,提出一种句子相似度计算方法。对句子进行句法分析和依存关系的预处理,提取句子结构中的主、谓、宾、介词等主要成分的词语集合,从而准确地表达出句子的浅层语义,并利用《知网》计算不同句子相同成分之间的语义相似度。考虑依存句法关系中的定中关系和状中关系起到的语义修饰作用,在句法结构基础上进一步融入修饰词,综合计算句子的语义相似度,区分句子主题内容的一致性和句子间的反义关系。以微软研究院释义语料库中抽取的30对句子作为测试集,实验结果表明,提出方法的皮尔森相关系数达到0.89,F值达到85.7%,具有较好的准确性与实用性。
关键词:
句子相似度,
知网,
依存树,
句法结构,
修饰词
Abstract: According to the complex structure and polysemy characteristics of Chinese sentences,this paper proposes a sentence similarity calculation method.It pretreats the sentence through syntactic analysis and dependency relationship,and extracts word set of main components such as subject,predicate,object,preposition and so on,thus the shallow semantics of sentences can be expressed accurately.HowNet is used to calculate the semantic similarity between the same components of different sentences.Considering semantic modification effect of attribute relationship and adverbial relationship in dependency syntactic relations,based on syntactic structure,further integrating into the modifiers,the sentence semantic similarity is comprehensive by calculated to distinguish consistency of sentence topic content and the antonym relationship between sentences.The extracted 30 pairs of sentences are used as test sets,which are taken from paraphrase corpus of Microsoft Research Institute Corpus.Experimental results indicate that the Pearson correlation coefficient of the proposed method reaches 0.89 and the F-measure reaches 85.7%,which has better accuracy and practicability.
Key words:
sentence similarity,
HowNet,
dependency tree,
syntactic structure,
modifier
中图分类号: