摘要: 社交网络数据信息量大、主题性强,具有巨大的数据挖掘价值,是互联网大数据的重要组成部分。针对传统搜索引擎无法利用关键字检索技术直接索引社交网络平台信息的现状,基于众包模式,采用C / S 架构,设计社交网络数据采集模型,包含服务端、客户端、存储系统与主题Deep Web 爬虫系统4 个模块。通过主题Deep Web 爬虫的分布式机器节点自动向服务器请求爬虫任务并上传爬取数据,利用Hadoop 分布式文件系统对爬取数据进行快速处理并存储结果数据。实验结果表明,主题Deep Web 爬虫系统配置简单,支持功能扩展和目标信息直接获取,数据采集模型具有较快的数据获取速度及较高的信息检索效率。
关键词:
社交网络,
众包模式,
分布式计算,
信息采集,
Web 爬虫,
Hadoop 分布式文件系统
Abstract: Social network data has the features of informative and strong topicality with significant value for data
mining,and it is also a very important part of the Internet big data. However,traditional search engines can not use the keywords retrieve technology to index the information of social network platform directly,and under such circumstances, this paper designs and implements a data collection model based on crowdsourcing mode and C / S architecture. The model consists of four modules including server,client,storage sub-system and a Deep Web crawler system. The nodes run the topic Deep Web crawler system to request new tasks automatically and upload the acquired data,meanwhile the system uses the Hadoop Distributed File System(HDFS) to process data rapidly and store results. The topic Deep Web crawler system has the features of easyconfiguration,flexible scalability and direct data collection,and it also proves that data
collection model is able to fulfill the tasks in a high success rate and collect data in an efficient way.
Key words:
social network,
crowdsourcing mode,
distributed computing,
information collection,
Web crawler,
Hadoop
Distributed File System(HDFS)
中图分类号:
高梦超,胡庆宝,程耀东,周旭,李海波,杜然. 基于众包的社交网络数据采集模型设计与实现[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2015.04.007.
GAO Mengchao,HU Qingbao,CHENG Yaodong,ZHOU Xu,LI Haibo,DU Ran. Design and Implementation of Crowdsourcing-based Social Network Data Collection Model[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2015.04.007.