计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于共现词卡方值的关键词提取算法

时永宾,余青松   

  1. (华东师范大学 计算中心,上海 200333)
  • 收稿日期:2015-06-02 出版日期:2016-06-15 发布日期:2016-06-15
  • 作者简介:时永宾(1989-),男,硕士研究生,主研方向为Web语义;余青松,高级工程师。

Key Words Extraction Algorithm Based on Chi-square Value of Co-concurrence Words

SHI Yongbin,YU Qingsong   

  1. (Computer Center,East China Normal University,Shanghai 200333,China)
  • Received:2015-06-02 Online:2016-06-15 Published:2016-06-15

摘要:

文本分词系统的词库未收录新词和组合词,而这些词具有很强的主题表现力。为此,基于共现词卡方值,提出一种关键词提取算法。使用语言技术平台的依存句法分词系统构建词语的关联关系,并提取共现词。应用卡方检验检测共现词的分布是否具有显著性差异。差异越大,共现词作为关键词的概率也越大,该算法同样适用于单个词。把单个词和共现词作为候选关键词,综合考虑候选关键词的卡方值、词频、词个数抽取全文关键词。实验结果表明,该算法提取关键词的效果优于TextRank算法,关键词提取的准确率达到38.07%,共现词的正确率达到80.15%。

关键词: 依存句法分析, 共现词, 卡方检验, 候选关键词, 显著性差异

Abstract:

New words or compound words are not included in the dictionary of text segmentation system,however these words have strong theme performances.To address this problem,the key words extraction algorithm based on chi-square value of co-concurrence words is proposed.Co-concurrence words are extracted by the associations among words,which are established according to the dependency parsing from the Language Technology Platform (LTP).The chi-square is used to test whether obvious differences exist among the distributions of co-concurrence words.Co-concurrence words with higher obvious differences have greater probability of being key words.The algorithm is also valid for the single word.Taken the single word and co-concurrence words as candidate key words,the algorithm extracts full text key words with the consideration of the chi-square value,word frequency and number of the candidate key words.Experimental result shows that the key words extraction algorithm based on chi-square value of co-concurrence words is better than the TextRank algorithm as the precision of key words extraction reaches 38.07% and the accuracy of the co-concurrence words reaches 80.15%.

Key words: dependency parsing, co-concurrence word, chi-square test, candidate key word, obvious difference

中图分类号: