作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2010, Vol. 36 ›› Issue (18): 197-199. doi: 10.3969/j.issn.1000-3428.2010.18.068

• 人工智能及识别技术 • 上一篇    下一篇

面向文档集抄袭的快速全文识别算法

胡明晓   

  1. (温州大学物理与电子信息工程学院,浙江 温州 325035)
  • 出版日期:2010-09-20 发布日期:2010-09-30
  • 作者简介:胡明晓(1965-),男,讲师、硕士,主研方向:人工智能,计算机图形学
  • 基金资助:
    温州市科技计划基金资助项目(H20090049)

Quick Full-text Identification Algorithm for Document Set Plagiarism

HU Ming-xiao   

  1. (College of Physics & Electronic Information Engineering, Wenzhou University, Wenzhou 325035, China)
  • Online:2010-09-20 Published:2010-09-30

摘要: 为实现局部文档集抄袭的识别,将基于回退数与前跳数的广义编辑距离的近似值定义为文档抄袭距离,分析该文档抄袭距离满足三角不等式成立和弱三角不等式成立时的充分条件,提出一种快速全文识别算法,能识别出文档集内涉嫌抄袭的所有文档有序对。实验结果表明,相比其他算法,该算法在兼顾识别召回率的同时效率提高了3倍~5倍。

关键词: 抄袭识别, 文档集, 三角不等式, 电子文档管理

Abstract: In order to identify plagiarisms for local document set, this paper defines the document plagiarism distance as an approximate generalized edit distance based on returning number and skipping number, then uses this distance. After analyzing the sufficient conditions of satisfying triangle inequality or weak triangle inequality for the distance, it proposes an efficient full-text identification algorithm which can find out all ordered plagiarizing document pairs faithfully. Experimental results show that the algorithm improves the identifying efficiency by 3 times to 5 times meanwhile it does not lower the recall ratio.

Key words: plagiarism identification, document set, triangle inequality, electronic document management

中图分类号: