作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2013, Vol. 39 ›› Issue (5): 230-234. doi: 10.3969/j.issn.1000-3428.2013.05.051

• 人工智能及识别技术 • 上一篇    下一篇

基于语言模型和特征分类的抄袭判定

李 惠,刘 颖   

  1. (清华大学中国语言文学系,北京 100084)
  • 收稿日期:2012-05-09 出版日期:2013-05-15 发布日期:2013-05-14
  • 作者简介:李 惠(1987-),女,硕士研究生,主研方向:计算语言学;刘 颖,副教授
  • 基金资助:
    国家自然科学基金资助项目“基于语用信息的交互行为与语言特征的建模研究”(61171114)

Plagiarism Judgment Based on Language Model and Feature Classification

LI Hui, LIU Ying   

  1. (Department of Chinese Language and Literature, Tsinghua University, Beijing 100084, China)
  • Received:2012-05-09 Online:2013-05-15 Published:2013-05-14

摘要: 信息时代作者版权的保护问题已受到越来越多的关注。针对部分小说存在的文本大面积相似问题,提出基于语言模型和特征分类的方法。统计文本二元~六元的语言模型并且绘制拓扑图,通过计算重合概率和词性比来分析词语的重合程度和语法信息,在此基础上利用主成分分析和随机森林的方法,进行分类判别。机器学习的结果表明,该方法能够有效地鉴别小说是否存在抄袭现象。

关键词: 抄袭判定, 语言模型, 语法信息, 主成分分析, 随机森林, 分类

Abstract: The protection of copyright property arouses much attention in the present information age. Aiming at the dispute problem caused by the text similarity between some novels, this paper proposes a method based on language model and feature classification, with statistics of coincidences and the proportion of pos to analyze the grammatical collocations and the coincidences. The methods of Principal Component Analysis(PCA) and Random Forest(RF) used to extract characteristics for automatic classification are added into experiments. The result of machine learning shows that the method can effectively identify whether novels exist plagiarism phenomenon.

Key words: plagiarism judgment, language model, grammatical information, Principal Component Analysis(PCA), random forest, classification

中图分类号: