摘要: 为实现局部文档集抄袭的识别,将基于回退数与前跳数的广义编辑距离的近似值定义为文档抄袭距离,分析该文档抄袭距离满足三角不等式成立和弱三角不等式成立时的充分条件,提出一种快速全文识别算法,能识别出文档集内涉嫌抄袭的所有文档有序对。实验结果表明,相比其他算法,该算法在兼顾识别召回率的同时效率提高了3倍~5倍。
关键词:
抄袭识别,
文档集,
三角不等式,
电子文档管理
Abstract: In order to identify plagiarisms for local document set, this paper defines the document plagiarism distance as an approximate generalized edit distance based on returning number and skipping number, then uses this distance. After analyzing the sufficient conditions of satisfying triangle inequality or weak triangle inequality for the distance, it proposes an efficient full-text identification algorithm which can find out all ordered plagiarizing document pairs faithfully. Experimental results show that the algorithm improves the identifying efficiency by 3 times to 5 times meanwhile it does not lower the recall ratio.
Key words:
plagiarism identification,
document set,
triangle inequality,
electronic document management
中图分类号:
胡明晓. 面向文档集抄袭的快速全文识别算法[J]. 计算机工程, 2010, 36(18): 197-199.
HU Meng-Xiao. Quick Full-text Identification Algorithm for Document Set Plagiarism[J]. Computer Engineering, 2010, 36(18): 197-199.