Short Text Duplicate Removal Algorithm Based on Feature Iteration

doi:10.3969/j.issn.1000-3428.2015.12.011

Abstract

Abstract: Because of the single word frequency and the simple structure of short text,algorithms based on normal feature selection methods do not fit to short text.This paper proposes an iteration method of weighting features for short text.It produces the fingerprints of short text using SimHash,and clusters these fingerprints with Shared Nearest Neighbor(SNN).Initial features are added or deleted according to the clusters.This process is circulatory so as to realize the duplicate removal of short text.Experimental results based on two real datasets show that this method fits short text well and has better duplicate removal effects than existing methods.

Key words: SimHash algorithm, Shared Nearest Neighbor(SNN), iteration, feature selection, short text, duplicate removal

摘要： 由于短文本具有词频单一、结构简单等特点,基于传统特征选取方法的文本去重算法不适合短文本。为此,提出一种适合短文本特点的去重算法,利用SimHash算法产生短文本的指纹,使用共享最近邻算法对指纹进行聚类,根据聚类结果增删初始特征,迭代直至收敛,从而实现短文本的去重检测。在真实数据集上的实验结果表明,与现有的文本去重算法相比,该算法对于短文本具有更好的去重效果。

关键词: SimHash算法, 共享最近邻, 迭代, 特征选择, 短文本, 去重

CLC Number:

TP311

CAO Hai,SUN Jing,SHI Xibin. Short Text Duplicate Removal Algorithm Based on Feature Iteration[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2015.12.011.

曹海,孙婧,史喜斌. 基于特征迭代的短文本去重算法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2015.12.011.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.3969/j.issn.1000-3428.2015.12.011

http://www.ecice06.com/EN/Y2015/V41/I12/54

References

参考文献［1］Campbell D M,Chen W R,Smith R D.Copy Detection Systems for Digital Documents［C］//Proceedings of IEEE Advances in Digital Libraries.Washington D.C.,USA:IEEE Press,2000:78-88. ［2］Si A,Leong H V,Lau R W H.Check:A Document Plagiarism Detection System［C］//Proceedings of 1997 ACM Symposium on Applied Computing.New York,USA:ACM Press,1997:70-77. ［3］Phan X H,Nguyen L M,Horiguchi S.Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections［C］//Proceedings of the 17th International Conference on World Wide Web.New York,USA:ACM Press,2008:91-100. ［4］Charikar M S.Similarity Estimation Techniques from Rounding Algorithms［C］//Proceedings of the 34th Annual ACM Symposium on Theory of Computing.New York,USA:ACM Press,2002:380-388. ［5］Bernstein Y,Zobel J.Accurate Discovery of Co-deriva-tive Documents via Duplicate Text Detection［J］.Info-rmation Systems,2006,31(7):595-609. (下转第63页) (上接第57页) ［6］董博,郑庆华,宋凯磊,等.基于多 SimHash指纹的近似文本检测［J］.小型微型计算机系统,2011,32(11):2152-2157. ［7］Wang Meng,Lin Lanfen,Wang Jing,et al.Improving Short Text Classification Using Public Search Engines［M］.Berlin,Germany:Springer-Verlag,2013. ［8］Ni Xingliang,Quan Xiaojun,Lu Zhi,et al.Short Text Clustering by Finding Core Terms［J］.Knowledge and Information Systems,2011,27(3):345-365. ［9］Gong Caichun,Huang Yulan,Cheng Xueqi,et al.Detecting Near-duplicates in Large-scale Short Text Databases［M］.Berlin,Germany:Springer-Verlag,2008. ［10］Coskun B,Giura P.Mitigating SMS Spam by Online Detection of Repetitive Near-duplicate Messages［C］//Proceedings of IEEE International Conference on Com-munications.Washington D.C.,USA:IEEE Press,2012:999-1004. ［11］Datar M,Immorlica N,Indyk P,et al.Locality-sensitive Hashing Scheme Based on P-stable Distribu-tions［C］//Proceedings of the 20th Annual Symposium on Computational Geometry.New York,USA:ACM Press,2004:253-262. ［12］Patidar A K,Agrawal J,Mishra N.Analysis of Different Similarity Measure Functions and Their Impacts on Shared Nearest Neighbor Clustering Approach［J］.International Journal of Computer Applications,2012,40(16). ［13］Li Liangyi.ik-analyzer java开源中文分词器［EB/OL］.(2014-11-20).http://code.google.com/p/ik-analyzer/. ［14］Uddin M S,Roy C K,Schneider K A,et al.On the Effectiveness of Simhash for Detecting Near-miss Clones in Large Scale Software Systems［C］//Proceedings of the 18th Working Conference on Reverse Engineering.Washington D.C.,USA:IEEE Press,2011:13-22. 编辑索书志

[1]	Xuan YANG, Jianmin MA, Manjun ZHAO. Feature Selection of High-Dimensional Time-Series Data Based on Neighborhood Mutual Information [J]. Computer Engineering, 2023, 49(7): 135-142.
[2]	LIANG Dengyu, LIU Daming. Short Text Matching Model Combined with Multi-Granularity Information and External Knowledge [J]. Computer Engineering, 2022, 48(8): 129-135,143.
[3]	LIU Li, ZHANG Desheng, XIAO Yanting. Fuzzy Weighted k-Nearest Centroid Neighbor Algorithm Based on Membership [J]. Computer Engineering, 2022, 48(7): 122-129.
[4]	AI Chenghao, GAO Jianhua, HUANG Zijie. Code Smell Detection Driven by Hybrid Feature Selection and Ensemble Learning [J]. Computer Engineering, 2022, 48(7): 168-176,198.
[5]	FAN Linge, WU Xinrong, TONG Wei, ZENG Weijun. Feature Selection Method for Incomplete Data Sets Based on Probability Matrix Decomposition [J]. Computer Engineering, 2022, 48(6): 57-64.
[6]	ZHANG Yao, MA Yingcang, ZHU Hengdong, LI Heng, CHEN Cheng. Multi-label Feature Selection Combining Manifold Learning and Logistic Regression [J]. Computer Engineering, 2022, 48(3): 90-99,106.
[7]	ZHAN Fei, ZHU Yanhui, LIANG Wentong, ZHANG Xu, OUYANG Kang, KONG Lingwei, HUANG Yalin. Short Text Entity Linking Method Based on Multi-Task Learning [J]. Computer Engineering, 2022, 48(3): 315-320.
[8]	WANG Zhengkai, SHEN Dongsheng, WANG Chenxi. Fisher Score Fast Multi-Label Feature Selection Algorithm Based on Text Classification [J]. Computer Engineering, 2022, 48(2): 113-124.
[9]	HUANG Yixuan, DU Shiqiang, YU Yao, XIAO Qingjiang, SONG Jinmei. Multi-View Clustering Based on Feature Selection and Robust Graph Learning [J]. Computer Engineering, 2022, 48(12): 95-103.
[10]	XU Benye, GU Binjie, PAN Feng, XIONG Weili. Weighted Smooth Projection Twin Support Vector Regression Algorithm [J]. Computer Engineering, 2022, 48(12): 104-111,118.
[11]	FANG Haitao, LI Mingqi, BIAN Xin. Improved Algorithm for Compressed Sensing Channel Estimation Based on DFT Path Searching [J]. Computer Engineering, 2022, 48(1): 182-187.
[12]	XUE Zihan, PAN Di, HE Li. Optimization of LGC Semi-Supervised Learning Method Combined with Improved Density Peaks Clustering [J]. Computer Engineering, 2021, 47(2): 77-83,89.
[13]	SHI Caixia, LI Shuqin, LIU Bin. Method for Calculating Short Text Similarity Using Multi-Check Weighted Fusion [J]. Computer Engineering, 2021, 47(2): 95-102.
[14]	YU Danning, NI Kun, LIU Yunlong. Value Iteration Algorithm for POMDP Based on Recurrent Convolutional Neural Network [J]. Computer Engineering, 2021, 47(2): 90-94,102.
[15]	JI Fanfan, YANG Xin, YUAN Xiaotong. Structural Pruning Algorithm Based on Second-Order Information of Deep Neural Network [J]. Computer Engineering, 2021, 47(2): 12-18.

Please choose a citation manager

Content to export