计算机工程

• 开发研究与工程应用 • 上一篇    下一篇

基于主题与语义的作弊网页检测方法

易军凯,刘慕凡,万静   

  1. (北京化工大学信息科学与技术学院,北京 100029)
  • 收稿日期:2014-07-10 出版日期:2015-09-15 发布日期:2015-09-15
  • 作者简介:易军凯(1972-),男,教授,主研方向:信息安全,人工智能,语义挖掘;刘慕凡,硕士研究生;万静(通讯作者),讲师。
  • 基金项目:
    中央高校基本科研业务费专项基金资助项目(ZZ1311)。

Spam Web Detection Method Based on Topic and Semantic

YI Junkai,LIU Mufan,WAN Jing   

  1. (College of Information Science and Technology,Beijing University of Chemical Technology,Beijing 100029,China)
  • Received:2014-07-10 Online:2015-09-15 Published:2015-09-15

摘要: 网页作弊检测可以被看作二元分类问题。当前基于内容的作弊网页检测方法主要使用统计特征,不能准确识别隐藏的作弊手段。为此,提出一种改进的作弊网页检测方法,使用语义与统计两类特征,将作弊检测深入至主题层次。该方法对网页内容进行主题建模,将网页内容映射至主题空间,根据其主题分布进行语义分析计算,从中提取语义特征,结合统计特征对网页进行分类检测。实验结果表明,该方法在精确率、召回率与F1测度上均获得了较好的效果。

关键词: 分类, 主题模型, 潜在狄利克雷分配, 语义特征, 语义相似度

Abstract: Web spam detection can be considered as a bi-classification problem.Currently,content-based spam web detection mainly uses statistic features,however,they are just at a junior level and have several limitations.The topic and semantic based spam Web detection method is presented which uses both semantic features and statistic features,expanding the spam detection to topic-level.The method conducts topic modeling,mappings the content to topic space,and computes and extracts the semantic features based on its topic distribution in topic space,and uses both semantic and statistic features to detect the spam.Experimental results show that the proposed method performs better in terms of precision,recall and F1 values.

Key words: classification, topic model, Latent Dirichlet Allocation(LDA), semantic feature, semantic similarity

中图分类号: