计算机工程 ›› 2019, Vol. 45 ›› Issue (3): 26-31.doi: 10.19678/j.issn.1000-3428.0049976

所属专题: 云计算与大数据专题

• 云计算与大数据专题 • 上一篇    下一篇

基于Hadoop平台的相关性权重算法设计与实现

高军,黄献策   

  1. 上海海事大学 信息工程学院,上海 201306
  • 收稿日期:2018-01-04 出版日期:2019-03-15 发布日期:2019-03-15
  • 作者简介:高军(1979—),男,副教授、博士,主研方向为大数据分析、异构计算;黄献策,硕士研究生
  • 基金项目:

    国家自然科学基金(41701523);上海海事大学研究生创新基金(YXR2017032)

Design and Implementation of Correlation Weight Algorithm Based on Hadoop Platform

GAO Jun,HUANG Xiance   

  1. College of Information Engineering,Shanghai Maritime University,Shanghai 201306,China
  • Received:2018-01-04 Online:2019-03-15 Published:2019-03-15

摘要:

传统TF-IDF算法仅从词频与逆向文档频率的角度计算关键词与文档之间的相关性权重,忽略了用户兴趣对权重计算的影响。为此,以满足用户信息检索目的为研究背景,提出一种基于日志关联的相关性权重算法。从面向用户相关性的角度出发,通过分析用户的搜索日志建立用户兴趣模型,并结合分布式计算的思想,运用MapReduce编程框架实现计算任务的并行化处理。实验结果表明,该算法在处理海量数据时,不仅能够提高算法效率,而且可以根据用户的历史检索记录动态地改变检索词的权重,提升用户与系统的交互能力。

关键词: 分布式计算, TF-IDF算法, 日志, 兴趣模型, 信息检索

Abstract:

The traditional TF-IDF algorithm calculates the correlation weights between keywords and documents only by using the perspective of word frequency and reverse document frequency,which ignoes the influence of user interest on weight calculation.In order to meet the purpose of user information retrieval,a correlation weight algorithm based on journal association is proposed.From the perspective of user-oriented comelation,the user interest model is built by analyzing the user's search journal,and combined with the idea of distributed computing,the MapReduce programming framework is used to realize the parallel processing of computing tasks.Experimental results show that it can not only improve the efficiency of the algorithm when dealing with massive data,but also dynamically change the weight of retrieval word according to the user's historical retrieval records,so as to enhance the interaction ability between users and the system.

Key words: distributed computing, TF-IDF algorithm, journal, interest model, information retrieval

中图分类号: