作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2021, Vol. 47 ›› Issue (10): 82-88. doi: 10.19678/j.issn.1000-3428.0059116

• 人工智能与模式识别 • 上一篇    下一篇

融合语义特征的TextRank关键词抽取方法

杨延娇, 赵国涛, 袁振强, 韩家臣   

  1. 西北师范大学 计算机科学与工程学院, 兰州 730070
  • 收稿日期:2020-07-31 修回日期:2020-09-11 发布日期:2020-09-21
  • 作者简介:杨延娇(1976-),女,副教授、硕士,主研方向为数据挖掘;赵国涛、袁振强、韩家臣,硕士研究生。
  • 基金资助:
    国家自然科学基金(61662068);甘肃省高等学校创新能力提升项目(2019A-006)。

TextRank-based Keyword Extraction Method Integrating Semantic Features

YANG Yanjiao, ZHAO Guotao, YUAN Zhenqiang, HAN Jiachen   

  1. College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China
  • Received:2020-07-31 Revised:2020-09-11 Published:2020-09-21

摘要: TextRank使用共现窗口代替PageRank网页超链接以判断词语关系,但共现窗口机制下的词汇图是无向图,且实际中文文本中词语与其共现窗口内的词语之间在多数情况下没有认知上的指向性链接关系,导致共现窗口机制下的词语关系与PageRank网页超链接关系存在较大差别。为此,提出一种融合语义特征的关键词抽取方法S-TextRank。在TextRank方法的基础上以依存关系代替共现窗口判断词语关系,以模拟PageRank网页指向性超链接。对不同词性词语赋予相应的权重系数,从而模拟不同性质网页的重要程度。在此基础上,使用IDF方法结合汉语语法规则构建非关键词表,排除无关词语以降低其对抽取结果的影响。实验结果表明,S-TextRank方法在测试集上的准确率达到74%,比TextRank方法高19.4个百分点。

关键词: TextRank方法, 关键词抽取, 依存关系, 词性重要度, IDF方法, PageRank方法

Abstract: TextRank uses a co-occurrence window instead of PageRank Web hyperlinks to determine the relationships between words.However, the vocabulary graph under the co-occurrence window mechanism is an undirected graph, and in most cases, there is no cognitive directional link between the words in the actual Chinese texts and the words in the co-occurrence window.Under this mechanism, the relationship between the words is sharply different from the hyperlink relationship of PageRank.To address the problem, a keyword extraction method, S-TextRank, is proposed integrating semantic features.Based on TextRank, S-TextRank employs dependency relationships instead of co-occurrence windows to determine the relationships between words to simulate directional PageRank hyperlinks.In addition, different part-of-speech words are assigned with corresponding weight coefficients to simulate the importance of different types of Web pages.Finally, a non-keyword list is constructed by using the IDF method and Chinese grammar rules to exclude the influence of irrelevant words on the extraction results.Experimental results show that the accuracy of the S-TextRank method achieves 74% on the test set, 19.4 percentage points higher than that of the TextRank method.

Key words: TextRank method, keyword extraction, dependency relationship, part-of-speech importance, IDF method, PageRank method

中图分类号: