计算机工程

• 软件技术与数据库 • 上一篇    下一篇

基于分类关键词词频模型的地缘政治主题爬虫设计

魏勇1,2,胡丹露3,郝晨光3,欧小平3   

  1. (1.信息工程大学地理空间信息学院,郑州 450052; 2.四川省应急测绘与防灾减灾工程技术研究中心,成都 610041; 3.中华测绘技术服务公司,北京 100088)
  • 收稿日期:2015-04-28 出版日期:2016-02-15 发布日期:2016-01-29
  • 作者简介:魏勇(1987-),男,博士研究生,主研方向为网络信息抽取;胡丹露,高级工程师、博士;郝晨光、欧小平,工程师、硕士。
  • 基金项目:

    四川省应急测绘与防灾减灾工程技术研究中心开放基金资助项目(K2015B014)。

Design of Geopolitical Topical Crawler Based on Classified Keyword Term Frequency Model

WEI Yong  1,2,HU Danlu  3,HAO Chenguang  3,OU Xiaoping  3   

  1. (1.Institute of Geospatial Information,Information Engineering University,Zhengzhou 450052,China; 2.Sichuan Engineering Research Center for Emergency Mapping & Disaster Reduction,Chengdu 610041,China; 3.China National Technic Service Corporation for Surveying and Mapping,Beijing 100088,China)
  • Received:2015-04-28 Online:2016-02-15 Published:2016-01-29

摘要:

针对词频-逆文档频率模型应用于主题爬虫时存在的非结构化问题,设计基于分类关键词词频(CKTF)模型的主题爬虫。利用网页文档结构特征和主题词语的分布信息将网页映射为五维向量,根据维基百科中文语料库和搜狗全网新闻数据语料库选择关键词并计算其与地缘政治主题的相关度,使用支持向量机实现网页向量的学习和分类。实验结果表明,与传统主题爬虫相比,该主题爬虫能够挖掘地缘政治主题中的丰富内容,有效衡量网页与主题的相关度,具有较高的爬准率和稳定性。

关键词: 分类关键词词频模型, 词向量, 支持向量机, 相关度, 主题爬虫, 分类关键词词频模型, 词向量, 支持向量机, 相关度

Abstract:

To solve the no-structuring problem of Term Frequency-Inverse Document Frequency(TF-IDF) model in topical crawler,this paper proposes a novel topical crawler based on Classified Keyword Term Frequency (CKTF) model.A Webpage is divided into five parts,according to the Webpage document structure characteristics and the distribution information of topical works.Geopolitical topical words and their correlative rates are calculated based on Wikipedia and Sougou internet corpus.Then,Webpage vector classification are learned and classified by Support Vector Machine(SVM).Experimental result shows that geopolitical topical crawler based on CKTF model can mine the rich meaning of the geopolitical topic,and measure effectively correlation between a Webpage and a topic with a higher accuracy and stability.

Key words: Support Vector Machine(SVM), relevancy, topical crawler, Classified Keyword Term Frequency(CKTF) model, word vector, Support Vector Machine(SVM), relevancy

中图分类号: