Extraction of Chinese Text Summarization Based on Improved TextRank Algorithm

doi:10.19678/j.issn.1000-3428.0051615

Abstract

Abstract:

This paper proposes a Chinese text summarization extraction algorithm,called DK-TextRank,combines Doc2Vec model,K-means and TextRank algorithm for Chinese texts to improve summarization accuracy.After using the Doc2Vec model for text vectorization,the DK-TextRank algorithm uses an improved K-means algorithm for similar text clustering,and the TextRank algorithm with weight impact factors in each cluster to sort and extract topic sentence.Then,it generates a summary.Experimental results show that,compared with traditional TF-IDF,TextRank algorithm,the DK-TextRank algorithm has an F value of 79.36% when the number of summary statements is 7,and the extracted abstract has higher quality.

Key words: Doc2Vec model, K-means algorithm, TextRank algorithm, summarization extraction, weight influence factor

摘要：

为提高中文文本摘要提取的准确度,融合Doc2Vec模型、K-means算法和TextRank算法,提出一种中文文本摘要自动提取算法(DK-TextRank)。使用Doc2Vec模型进行文本向量化,采用改进的K-means算法实现相似文本聚类,在每个聚类簇中应用加入权重影响因子的TextRank算法对文本语句进行排序,并提取主题句生成摘要。实验结果表明,DK-TextRank算法在摘要语句数量为7时F值达到79.36%,相比传统TF-IDF、TextRank算法提取的摘要质量更高。

关键词: Doc2Vec模型, K-means算法, TextRank算法, 摘要提取, 权重影响因子

CLC Number:

TP391

XU Xintao,CHAI Xiaoli,XIE Bin,SHEN Chen,WANG Jingping. Extraction of Chinese Text Summarization Based on Improved TextRank Algorithm[J]. Computer Engineering, 2019, 45(3): 273-277.

徐馨韬,柴小丽,谢彬,沈晨,王敬平. 基于改进TextRank算法的中文文本摘要提取[J]. 计算机工程, 2019, 45(3): 273-277.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0051615

http://www.ecice06.com/EN/Y2019/V45/I3/273

References

［1］BIADSY F,HIRSCHBERG J,FILATOVA E.An unsupervised approach to biography production using Wikipedia［C］//Proceedings of Meeting of the Association for Computational Linguistics.Cambridge,USA:Association for Computational Linguistics,2008:807-815.
［2］张云涛,龚玲,王永成.基于综合方法的文本主题句的自动抽取［J］.上海交通大学学报,2006,40(5):771-774.
［3］YEH J,KE H,YANG W.iSpreadRank:ranking sentences for extraction-based summarization using feature weight propagation in the sentence similarity networkp［J］.Expert Systems with Applications,2008,35(3):1451-1462.
［4］SALTON G,CLEMENT T.On the construction of effective vocabularies for information retrieval［C］//Proceedings of 1973 Meeting on Programming Languages and Information Retrieval.New York,USA:ACM Press,1973:48-60.
［5］张奇,黄萱菁,吴立德.一种新的句子相似度度量及其在文本自动摘要中的应用［J］.中文信息学报,2005,19(2):93-99.
［6］张筱丹,胡学钢.基于向量空间模型的自动摘要冗余处理研究［J］.合肥工业大学学报(自然科学版),2010,33(9):1355-1358.
［7］LE Q,MIKOLOV T.Distributed representations of sentences and documents［C］//Proceedings of International Conference on Machine Learning.New York,USA:ACM Press,2014:1188-1196.
［8］甘如饴.基于doc2vec和SVM的舆情情感分析系统的研究与设计［D］.北京:北京邮电大学,2017.
［9］DAI X,BIKDASH M,MEYER B.From social media to public health surveillance:word embedding based clustering method for Twitter classification［C］//Proceedings of SoutheastCon’17.Washington D.C.,USA:IEEE Press,2017:1-11.
［10］CHANG W B,XU Z Z,ZHOU S H,et al.Research on detection methods based on Doc2vec abnormal comments［J］.Future Generation Computer Systems,2018,86:656-662.
［11］李依尘.面向自动问答的中学历史知识库构建［D］.哈尔滨:哈尔滨工业大学,2017.
［12］汪文靖,冯瑞.基于二分K-means的测试用例集约简方法［J］.计算机工程,2016,42(12):73-77,83.
［13］贾晓婷,王名扬,曹宇.结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究［J］.数据分析与知识发现,2018,14(2):86-95.
［14］张银明,黄廷磊,林科,等.一种改进的k均值文本聚类算法［J］.桂林电子科技大学学报,2016,36(4):311-314.
［15］夏天.词语位置加权TextRank的关键词抽取研究［J］.现代图书情报技术,2013(9):30-34.
［16］罗庆平.基于信息融合的Web信息可信度研究［D］.长沙:中南大学,2014.
［17］TOMAS M,KAI C,GREG C,et al.Efficient estimation of word representations in vector space［EB/OL］.［2018-03-11］.http://arxiv.org/abs/1301.3781.

[1]	ZHANG Haitao, QIN Pengcheng. Video Target Tracking Method Based on GMS and FPME [J]. Computer Engineering, 2021, 47(7): 226-231.
[2]	YAN Chengqi, ZHAO Lihua, CHEN Mengjie, ZHOU Jun. Period Identification for Electromyography Signals of Children’s Lower Limb Based on Statistical Clustering Method [J]. Computer Engineering, 2021, 47(5): 273-276,284.
[3]	SUN Jingyong, MA Fumin. Rough K-Means Algorithm Based on Mixed Measure of Neighborhood Partition Information [J]. Computer Engineering, 2021, 47(3): 109-116.
[4]	WEI Wenhao, TANG Zekun, LIU Gang. PBK-means Algorithm Based on Distance and Density [J]. Computer Engineering, 2020, 46(9): 68-75.
[5]	LIU Zhiguo, SONG Guangyue, CAI Wenzhu, LIU Qingli. Frame Location Method of Unknown Network Protocol Based on TextRank Algorithm [J]. Computer Engineering, 2020, 46(7): 179-184.
[6]	HE Famei, MA Huizhen, WANG Xuren, FENG Anran. Research on Anomaly Intrusion Detection System Based on Feature Grouping Clustering [J]. Computer Engineering, 2020, 46(4): 123-128,134.
[7]	KANG Yan, YANG Qiyue, LI Hao, LIANG Wentao, LI Jinyuan, CUI Guorong, WANG Peiyao. Adaptive Text Classification Based on Topic Similarity Clustering [J]. Computer Engineering, 2020, 46(3): 93-98.
[8]	ZHOU Wenjun, ZHANG Yong, WANG Yujie. Real-time Recognition Method for Static Gestures Based on DSSD [J]. Computer Engineering, 2020, 46(2): 255-261.
[9]	CAO Yongyi, JIN Weizheng, WU Jing, LUO Wei, ZHU Bo. A DDoS Detection and Defense Method Based on Cross Plane Cooperation for SDN [J]. Computer Engineering, 2020, 46(11): 148-156.
[10]	WANG Ze,CHEN Yongle,WANG Xiaojian. WLAN authentication and attack location scheme based on CSI [J]. Computer Engineering, 2019, 45(6): 181-187.
[11]	WANG Jinsong,LI Junyan,ZHANG Hongwei. Design of Large-scale Network Anomaly Traffic Detection System Based on IPv6 [J]. Computer Engineering, 2018, 44(10): 14-21.
[12]	SHEN Xingfa,WANG Landi. Rental Points Clustering and Function Identification of Public Bicycle System [J]. Computer Engineering, 2018, 44(1): 44-50.
[13]	PU Mei,ZHOU Feng,ZHOU Jingjing,YAN Xin,ZHOU Lanjiang. Topic Sentence Extraction of Key News Events Based on Weighted TextRank [J]. Computer Engineering, 2017, 43(8): 219-224.
[14]	LIU Yisong,ZHU Dan. Semantic Web Service Discovery Based on Clustering and Bipartite Graph Matching [J]. Computer Engineering, 2016, 42(2): 157-163.
[15]	FEI Huan,LI Guanghui. Abnormal Data Detection Algorithm for WSN Based on K-means Clustering [J]. Computer Engineering, 2015, 41(7): 124-128.

Please choose a citation manager

Content to export