计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

基于特征迭代的短文本去重算法

曹海,孙婧,史喜斌   

  1. (复旦大学计算机科学技术学院上海市数据科学重点实验室,上海 201203)
  • 收稿日期:2014-12-05 出版日期:2015-12-15 发布日期:2015-12-15
  • 作者简介:曹海(1989-),男,硕士研究生,主研方向:共享最近邻聚类算法,机器学习;孙婧、史喜斌,博士研究生。
  • 基金项目:
    国家科技支撑计划基金资助项目(2012BAH13F02);上海市科委基金资助项目(12511502403,12511509602)。

Short Text Duplicate Removal Algorithm Based on Feature Iteration

CAO Hai,SUN Jing,SHI Xibin   

  1. (Shanghai Key Laboratory of Data Science,School of Computer Science,Fudan University,Shanghai 201203,China)
  • Received:2014-12-05 Online:2015-12-15 Published:2015-12-15

摘要: 由于短文本具有词频单一、结构简单等特点,基于传统特征选取方法的文本去重算法不适合短文本。为此,提出一种适合短文本特点的去重算法,利用SimHash算法产生短文本的指纹,使用共享最近邻算法对指纹进行聚类,根据聚类结果增删初始特征,迭代直至收敛,从而实现短文本的去重检测。在真实数据集上的实验结果表明,与现有的文本去重算法相比,该算法对于短文本具有更好的去重效果。

关键词: SimHash算法, 共享最近邻, 迭代, 特征选择, 短文本, 去重

Abstract: Because of the single word frequency and the simple structure of short text,algorithms based on normal feature selection methods do not fit to short text.This paper proposes an iteration method of weighting features for short text.It produces the fingerprints of short text using SimHash,and clusters these fingerprints with Shared Nearest Neighbor(SNN).Initial features are added or deleted according to the clusters.This process is circulatory so as to realize the duplicate removal of short text.Experimental results based on two real datasets show that this method fits short text well and has better duplicate removal effects than existing methods.

Key words: SimHash algorithm, Shared Nearest Neighbor(SNN), iteration, feature selection, short text, duplicate removal

中图分类号: