融合语义特征的TextRank关键词抽取方法

doi:10.19678/j.issn.1000-3428.0059116

摘要/Abstract

摘要： TextRank使用共现窗口代替PageRank网页超链接以判断词语关系，但共现窗口机制下的词汇图是无向图，且实际中文文本中词语与其共现窗口内的词语之间在多数情况下没有认知上的指向性链接关系，导致共现窗口机制下的词语关系与PageRank网页超链接关系存在较大差别。为此，提出一种融合语义特征的关键词抽取方法S-TextRank。在TextRank方法的基础上以依存关系代替共现窗口判断词语关系，以模拟PageRank网页指向性超链接。对不同词性词语赋予相应的权重系数，从而模拟不同性质网页的重要程度。在此基础上，使用IDF方法结合汉语语法规则构建非关键词表，排除无关词语以降低其对抽取结果的影响。实验结果表明，S-TextRank方法在测试集上的准确率达到74%，比TextRank方法高19.4个百分点。

关键词: TextRank方法, 关键词抽取, 依存关系, 词性重要度, IDF方法, PageRank方法

Abstract: TextRank uses a co-occurrence window instead of PageRank Web hyperlinks to determine the relationships between words.However, the vocabulary graph under the co-occurrence window mechanism is an undirected graph, and in most cases, there is no cognitive directional link between the words in the actual Chinese texts and the words in the co-occurrence window.Under this mechanism, the relationship between the words is sharply different from the hyperlink relationship of PageRank.To address the problem, a keyword extraction method, S-TextRank, is proposed integrating semantic features.Based on TextRank, S-TextRank employs dependency relationships instead of co-occurrence windows to determine the relationships between words to simulate directional PageRank hyperlinks.In addition, different part-of-speech words are assigned with corresponding weight coefficients to simulate the importance of different types of Web pages.Finally, a non-keyword list is constructed by using the IDF method and Chinese grammar rules to exclude the influence of irrelevant words on the extraction results.Experimental results show that the accuracy of the S-TextRank method achieves 74% on the test set, 19.4 percentage points higher than that of the TextRank method.

Key words: TextRank method, keyword extraction, dependency relationship, part-of-speech importance, IDF method, PageRank method

中图分类号:

TP391.1

杨延娇, 赵国涛, 袁振强, 韩家臣. 融合语义特征的TextRank关键词抽取方法[J]. 计算机工程, 2021, 47(10): 82-88.

YANG Yanjiao, ZHAO Guotao, YUAN Zhenqiang, HAN Jiachen. TextRank-based Keyword Extraction Method Integrating Semantic Features[J]. Computer Engineering, 2021, 47(10): 82-88.

http://www.ecice06.com/CN/Y2021/V47/I10/82

图/表 8

20211015190424

20211015190427

20211015190432

20211015190534

20211015190537

20211015190540

20211015190544

20211015190547

参考文献

[1] 李俊, 吕学强.融合BERT语义加权与网络图的关键词抽取方法[J].计算机工程, 2020, 46(9):89-94. LI J, LÜ X Q.Keyword extraction method based on BERT semantic weighting and network graph[J].Computer Engineering, 2020, 46(9):89-94.(in Chinese)
[2] SIDDIQI S, SHARAN A.Keyword and keyphrase extraction techniques:a literature review[J].International Journal of Computer Applications, 2015, 109(2):18-23.
[3] XIE F, WU X, ZHU X.Document-spcific keyphrase extraction using sequential patterns with wildscards[C]//Proceedings of IEEE International Conference on Data Mining.Washington D.C., USA:IEEE Press, 2015:1055-1060.
[4] 宁建飞, 刘降珍.融合Word2vec与TextRank的关键词抽取研究[J].现代图书情报技术, 2016(6):20-27. NING J F, LIU J Z.Using Word2vec with TextRank to extract keywords[J].New Technology of Library and Information Service, 2016(6):20-27.(in Chinese)
[5] YAN Y, LIANG H, MENG Q.Exploration and improvoment in keyword extraction for news based on TFIDF[J].Energy Procedia, 2011(13):3551-3556.
[6] BLEI D M, NGA Y, JODAN M I.Latentdirichlet allocation[J].The Journal of Machine Learning Research, 2003, 3:993-1022.
[7] BRIN S, PAGE L.Reprint of the anatomy of a large-scale hypertextual Web search engine[J].Computer Networks, 2012, 56(18):3825-3833.
[8] MIHALCEA R, TARAU P.TextRank:bringing order into texts[C]//Proceedings of Empirical Methods on Natural Language Processing(EMNLP).Barcelona, Spain:Association for Computation Linguistics, 2004:404-411.
[9] 罗有志, 陈征明, 陈明, 等.一种基于自适应关联熵的关键字提取算法[J].计算机与现代化, 2020(4):67-71. LUO Y Z, CHEN Z M, CHEN M, et al.A keyword extraction algorithm based on adaptive association entropy[J].Computer and Modernization, 2020(4):67-71.(in Chinese)
[10] 顾益军, 夏天.融合LDA与TextRank的关键词抽取研究[J].现代图书情报技术, 2014(Z1):41-47. GU Y J, XIA T.Study on keyword extraction with LDA and TextRank combination[J].New Technology of Library and Information Service, 2014(Z1):41-47.(in Chinese)
[11] 夏天.词向量聚类加权TextRank的关键词抽取[J].数据分析与知识发现, 2017, 1(2):28-34. XIA T.Extracting keywords with modified TextRank model[J].Data Analysis and Knowledge Discovery, 2017, 1(2):28-34.(in Chinese)
[12] FIGUEROA G, CHEN P C, CHEN Y S.RankUp:enhancing graph-based keyphrase extraction methods with error-feedback propagation[J].Computer Speech and Language, 2017, 47:112-131.
[13] ZHANG Y, CHANG Y, LIU X, et al.Mike:keyphrase extraction by intergrating muitidimensional information[C]//Proceedings of 2017 ACM Conference on Information and Knowledge Management.New York, USA:ACM Press, 2017:1349-1358.
[14] BISWAS S K, BORDOLOI M, SHREYA J.A graph basedkeyword extraction model using collective node weight[J].Expert Systems with Applications, 2018, 97:51-59.
[15] 徐立.基于加权TextRank的文本关键词提取方法[J].计算机科学, 2019, 46(S1):142-145. XU L.Text keyword extraction method based on weighted TextRank[J].Computer Science, 2019, 46(S1):142-145.(in Chinese)
[16] 李航, 唐超兰, 杨贤, 等.融合多特征的TextRank关键词抽取方法[J].情报杂志, 2017, 36(8):183-187. LI H, TANG C L, YANG X, et al.TextRank keyword extraction based on multi feature fusion[J].Information Magazine, 2017, 36(8):183-187.(in Chinese)
[17] 张建娥.基于多特征融合的中文文本关键词提取方法[J].情报理论与实践, 2013, 36(10):105-108. ZHANG J E.Method for the extraction of Chinese text keywords based on multi-feature fusion[J].Information Studies:Theory&Application, 2013, 36(10):105-108.(in Chinese)
[18] 艾金勇.融合多特征的TextRank藏文文本关键词抽取方法研究[J].情报探索, 2020(7):1-6. AI J Y.Research on the keyword extract method of Tibetan text based on TextRank integrated multiple features[J].Information Research, 2020(7):1-6.(in Chinese)
[19] 刘治国, 宋广跃, 蔡文珠, 等.基于TextRank的未知协议帧定位方法研究[J].计算机工程, 2020, 46(7):179-184. LIU Z G, SONG G Y, CAI W Z, et al.Research on unknown protocol frame location method based on TextRank[J].Computer Engineering, 2020, 46(7):179-184.(in Chinese)
[20] 王明文, 徐雄飞, 徐凡, 等.基于word2vec的大中华区词对齐库的构建[J].中文信息学报, 2015, 29(5):76-83. WANG M W, XU X F, XU F, et al.word2vec based word alignment corpus for the greater China region[J].Journal of Chinese Information Processing, 2015, 29(5):76-83.(in Chinese)
[21] 周锦章, 崔晓晖.基于词向量与TextRank的关键词提取方法[J].计算机应用研究, 2019, 36(4):1051-1054. ZHOU J Z, CUI X H.Keyword extraction method based on word vector and TextRank[J].Application Research of Computers, 2019, 36(4):1051-1054.(in Chinese)
[22] 樊玮, 刘欢, 张宇翔.融合词向量与位置信息的关键词提取算法[J].计算机工程与应用, 2020, 56(5):179-185. FAN W, LIU H, ZHANG Y X.Keyphrase extraction algorithm integrating word embeddings and position information[J].Computer Engineering and Applications, 2020, 56(5):179-185.(in Chinese)
[23] PAGE L.The page rank citation ranking:bringing orderto the Web[J].Stanford Digital Libraries Working Paper, 1998, 9(1):1-14.

选择文件类型/文献管理软件名称

选择包含的内容