作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 安全技术 • 上一篇    下一篇

基于URL特征检测的违法网站识别方法

凡友荣 1,2,杨涛 1,2,王永剑 1,2,姜国庆 1,2   

  1. (1.公安部第三研究所,上海 200031; 2.信息网络安全公安部重点实验室,上海 201400)
  • 收稿日期:2016-12-01 出版日期:2018-03-15 发布日期:2018-03-15
  • 作者简介:凡友荣(1992—),女,研究实习员、硕士,主研方向为信息安全、数据挖掘;杨涛、王永剑,副研究员、博士;姜国庆,研究实习员、硕士。
  • 基金资助:
    国家重点研发计划项目(2016YFC0800909);中央高校基本科研业务费专项资金(C16356)。

Illegal Website Identification Method Based on URL Feature Detection

FAN Yourong  1,2,YANG Tao  1,2,WANG Yongjian  1,2,JIANG Guoqing  1,2   

  1. (1.The Third Research Institute of Ministry of Public Security,Shanghai 200031,China;2.Key Lab of Information Network Security of Ministry of Public Security,Shanghai 201400,China)
  • Received:2016-12-01 Online:2018-03-15 Published:2018-03-15

摘要: 为高效识别违法网站,提出一种基于URL特征检测的识别方法。基于报文请求行信息中用户访问路径的分级特点,构建基于路径相似度的网站相似度计算模型,并使用Python编程语言实现模型的分布式计算。采用Fast Unfolding算法进行网站聚类并抽取违法网站的URL特征,从中筛选出准确率高、具有特定含义的特征作为有效的违法网站特征,并通过检测未知网站是否具有违法网站的URL特征识别出违法网站。实验结果证明,该方法能有效度量同类网站间的关联程度,结合Fast Unfolding算法能有效区分不同类型的网站。与基于URL词法特征、HTML、语义特征的违法网站识别方法相比,其F-Measure值最高。

关键词: URL特征, 违法网站识别, 网站相似度, 聚类, 访问路径

Abstract: An identification method based on URL feature detection is proposed to effectively identify illegal websites.A website similarity model based on path similarity is designed based on the hierarchical characteristics of user access path in message request line information,and distributed computing of the model is implemented by using Python programming language.Websites clustering is achieved by Fast Unfolding algorithm,and URL features of illegal websites are extracted.The features of high accuracy and specific meaning are selected as effective illegal website features.By detecting whether an unknown website has the URL features of an illegal website to identify illegal websites.Experimental results show that the method can effectively measure the degree of association between similar websites,and can effectively distinguish different types of websites with Fast Unfolding algorithm.Compared with other identifying methods based on URL morphological features,HTML or semantic features,F-Measure value of the proposed method achieves the best result.

Key words: URL feature, illegal website identification, website similarity, clustering, access path

中图分类号: