计算机工程 ›› 2019, Vol. 45 ›› Issue (3): 1-6.doi: 10.19678/j.issn.1000-3428.0050119

所属专题: 云计算与大数据专题

• 云计算与大数据专题 • 上一篇    下一篇

一种分布式用户浏览点击模型算法

张浩盛伦1,2,李翀1,柯勇1,张士波1   

  1. 1.中国科学院计算机网络信息中心,北京 100190; 2.中国科学院大学,北京 100190
  • 收稿日期:2018-01-05 出版日期:2019-03-15 发布日期:2019-03-15
  • 作者简介:张浩盛伦(1993—),男,硕士研究生,主研方向为点击模型、分布式计算;李翀(通信作者),副研究员、博士;柯勇,高级工程师;张士波,工程师
  • 基金项目:

    中国科学院信息化专项“中国科学院信息化评估”(Y647021189)

A Distributed User Browse Click Model Algorithm

ZHANG Haoshenglun 1,2,LI Chong 1,KE Yong 1,ZHANG Shibo 1   

  1. 1.Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China; 2.University of Chinese Academy of Sciences,Beijing 100190,China
  • Received:2018-01-05 Online:2019-03-15 Published:2019-03-15

摘要:

为从海量搜索点击日志中快速挖掘用户行为,提出一种分布式用户浏览点击模型(UBM)算法。原始UBM算法求出的检验度参数E只与搜索结果文档所在排序位置以及上一文档的点击位置有关,且非常稳定,基于此特性,将EM迭代求解转换为抽样估计检验度以求解吸引度的分布式UBM算法。在Spark数据平台上进行仿真,结果表明,与原始UBM算法相比,该算法能够解决点击日志中存在的严重数据倾斜问题,且运行效率较高。

关键词: 点击日志, 点击模型, 用户浏览点击模型算法, 搜索引擎, Spark平台

Abstract:

A distributed User Browse Click Model(UBM) algorithm is proposed to quickly mine user behavior from massive search click logs.The validation parameter E derived from the original UBM algorithm is only related to the ranking position of the search results and the click position of the previous document,and is very stable.Based on this characteristic,the EM iteration solution is transformed into a distributed UBM algorithm which estimates the test degree by sampling to solve the attraction degree.Results of simulation on Spark data platform show that compared with the original UBM algorithm,the proposed algorithm can solve the serious data skew problem in click log,and has higher efficiency.

Key words: click log, click model, User Browse Click Model(UBM) algorithm, search engine, Spark platform

中图分类号: