作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (20): 203-205. doi: 10.3969/j.issn.1000-3428.2011.20.070

• 人工智能及识别技术 • 上一篇    下一篇

基于粒子群优化的文档子内容查重算法

叶庆卫,武冬星,周 宇,王晓东   

  1. (宁波大学信息科学与工程学院,浙江 宁波 315211)
  • 收稿日期:2011-05-09 出版日期:2011-10-20 发布日期:2011-10-20
  • 作者简介:叶庆卫(1970-),男,副教授、博士,主研方向:振动信号处理;武冬星,硕士研究生;周 宇、王晓东,副教授
  • 基金资助:
    浙江省教育厅基金资助项目(Y200908502)

Duplicate Checking Algorithm of Document Partial Content Based on Particle Swarm Optimization

YE Qing-wei, WU Dong-xing, ZHOU Yu, WANG Xiao-dong   

  1. (Information Science and Engineering Institute, Ningbo University, Ningbo 315211, China)
  • Received:2011-05-09 Online:2011-10-20 Published:2011-10-20

摘要: 现存的文档相似性算法虽然能够获得2篇文档的相似度,但不能判断出重复或最相似子内容的位置。为此,提出一种基于粒子群优化(PSO)的文档内部子内容的查重算法。利用PSO方法查找2篇文档中最佳相似子内容的位置和长度,设计一种相关函数来判断字符串之间的相似程度,从而得到粒子群的评估函数。测试表明,该查重算法能够快速准确地确定出重复或最相似子内容的位置与长度。

关键词: 查重, 相似度函数, 粒子群优化, 评估函数, 字符串

Abstract: There are some algorithms which can detect similarity among documents, but these algorithms can not detect the duplicated of partial contents in documents. A new effective algorithm of the duplicated of partial contents detection in documents is put forward in this paper. It uses Particle Swarm Optimization(PSO) algorithm to search the optimized partial contents which is the most similar in two documents. For PSO algorithm, it provides the encoding of the particles. A new related coefficient of strings is defined for strings similarity. And the new evaluation function of PSO is designed based on the related coefficient function. The hybrid mutation PSO algorithm is used for searching the most similar partial contents quickly and accurately. Simulation experiments indicate that the algorithm can search the most similar partial contents in two documents effectively.

Key words: duplicate checking, similarity function, Particle Swarm Optimization(PSO), evaluation function, character string

中图分类号: