作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

所属专题: 移动社交专题

• 移动社交专题 • 上一篇    下一篇

微博数据通用抓取算法

卢体广,刘 新,刘任任   

  1. (湘潭大学信息工程学院智能计算与信息处理教育部重点实验室,湖南 湘潭 411105)
  • 收稿日期:2013-10-31 出版日期:2014-05-15 发布日期:2014-05-14
  • 作者简介:卢体广(1988-),男,硕士研究生,主研方向:社会计算,信息安全;刘 新,副教授;刘任任,教授、博士生导师。
  • 基金资助:
    湖南省自然科学基金资助项目(12JJ3066);湖南省高校科技成果产业化培育基金资助项目(11CY018);湖南省重点学科基金资助项目。

Universal Crawling Algorithm for Microblogging Data

LU Ti-guang, LIU Xin, LIU Ren-ren   

  1. (Key Laboratory of Intelligent Computing and Information Processing, Ministry of Education, Institute of Information Engineering, Xiangtan University, Xiangtan 411105, China)
  • Received:2013-10-31 Online:2014-05-15 Published:2014-05-14

摘要: 目前常用的网络爬虫和基于微博API抓取数据的算法很难满足舆情系统对微博数据的需求。为此,提出一种模拟浏览器登录微博抓取网页数据的算法,以方便地获取任意微博用户网页上的所有数据。通过微博用户之间的关系构建用户网络,并通过该网络发现新用户。为获取微博上有质量的数据,建立一个完整的数学模型,根据用户的发帖数、发帖频率、粉丝数、转发数、评论数等因素来计算用户影响力,以影响力为主要因子构建优先队列,使得影响力越大的用户数据采集频率越高,同时计算时间间隔以兼顾非活跃用户的数据获取。实验结果表明,该算法具有通用性强、完全无需人工干预、获取信息的质量高、速度快等优点。

关键词: 微博数据, 模拟登录, 用户网络, 用户影响力, 网络舆情, 优先队列

Abstract: Currently, Web crawler and microblog API which are used to grab data from the microblog are difficult to satisfy the public opinion system demands for microblog data. To settle the problem, this paper presents a feasible solution which is the similar as the browser login microblog to capture data from Web pages. It can easily get all data from any microblog users. On this basis, it constructs a microblogging network through interconnections among users, and discovers new users through it. In order to get high quality data, it builds mathematical models to calculate the user’s influence index by using posting number, posting frequency, fans number, forwarding number and comments number. Moreover, it builds priority queue according to the calculated influence factor, which let those that have bigger influence index have high acquisition frequency. Finally, it calculates time interval to balance the lower frequency of non-active microblog user. The experimental results show that this method not only processes easily and has higher speed but also can obtain high quality information and have huge versatility.

Key words: microblogging data, analog login, user network, user influence, Internet public opinion, priority queue

中图分类号: