计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

一种基于本体语义的灾害主题爬虫策略

马雷雷1,2,李宏伟1,连世伟1,梁汝鹏1,陈虎3   

  1. (1.信息工程大学 地理空间信息学院,郑州 450052; 2.四川省应急测绘与防灾减灾工程技术研究中心,成都 610041; 3.国防信息学院,武汉 430010)
  • 收稿日期:2015-10-30 出版日期:2016-11-15 发布日期:2016-11-15
  • 作者简介:马雷雷(1987—),男,博士研究生,主研方向为地理信息本体、地理信息智能处理;李宏伟,教授、博士、博士生导师;连世伟,工程师、博士研究生;梁汝鹏,讲师、博士;陈虎,助教、硕士。
  • 基金项目:
    国家自然科学基金(41271392,41401463,41571394);四川省应急测绘与防灾减灾工程技术研究中心开放基金(K2015B014)。

A Strategy of Disaster Focused Crawler Based on Ontology Semantics

MA Leilei  1,2,LI Hongwei  1,LIAN Shiwei  1,LIANG Rupeng  1,CHEN Hu  3   

  1. (1.Institute of Geographic Space Information,Information Engineering University,Zhengzhou 450052,China; 2.Sichuan Engineering Research Center for Emergency Mapping and Disaster Reduction,Chengdu 610041,China; 3.Institute of National Defense Information,Wuhan 430010,China)
  • Received:2015-10-30 Online:2016-11-15 Published:2016-11-15

摘要: 为高效精确地提取存在于互联网中的灾害主题网页文本信息,引入本体语义,提出一种新的灾害主题爬虫策略。给出本体语义支持的灾害主题爬虫框架和流程,改进本体概念语义相似度计算方法,利用语义相似度计算主题语义向量,通过HTML位置加权获取网页文本特征向量,并进行主题相关度计算。设计URL锚文本主题相关度计算方法,分析URL链接优先度,优化爬行队列。选取地震灾害和气象灾害2个主题进行测试与分析,实验结果表明,该策略能有效提高稳定性和爬准率。

关键词: 主题爬虫, 本体, 语义相似度, 向量空间模型, 相关度计算, 锚文本

Abstract: This paper introduces ontology semantics and proposes a new strategy of disaster focused crawler to retrieve disaster theme webpages from the Internet efficiently and accurately.Firstly,the frame and process of disaster focused crawler are designed,and an improved ontology semantic similarity calculation method is proposed.Secondly,the thematic semantic vector is calculated based on semantic similarity,the webpage text feature vector is obtained based on HTML location weighting,and the thematic relevance is calculated.Then a relevance calculation method of URL anchor text is proposed,URL link priority is analyzed,and the crawling queue is optimized.Earthquake disaster and meteorologic disaster are selected to test and analyze,and the experimental results show that the proposed strategy can improve stability and accuracy.

Key words: focused crawler, ontology, semantic similarity, Vector Space Model(VSM), relevance calculation, achor text

中图分类号: