作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 专栏 • 上一篇    下一篇

基于MapReduce的高能物理数据分析系统

臧冬松1,2,霍 菁1,2,梁 栋1,2,孙功星1   

  1. (1. 中国科学院高能物理研究所,北京 100049;2. 中国科学院大学,北京 100049)
  • 收稿日期:2013-02-20 出版日期:2014-02-15 发布日期:2014-02-13
  • 作者简介:臧冬松(1981-),男,博士研究生,主研方向:分布式计算,海量数据管理;霍 菁、梁 栋,博士研究生;孙功星,研究员
  • 基金资助:

    国家自然科学基金资助重点项目(90912004)

High Energy Physics Data Analysis System Based on MapReduce

ZANG Dong-song  1,2, HUO Jing  1,2, LIANG Dong  1,2, SUN Gong-xing 1   

  1. (1. Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China; 2. University of Chinese Academy of Sciences, Beijing 100049, China)
  • Received:2013-02-20 Online:2014-02-15 Published:2014-02-13

摘要:

将MapReduce思想引入到高能物理数据分析中,提出一个基于Hadoop框架的高能物理数据分析系统。通过建立事例的TAG信息数据库,将需要进一步分析的事例数减少2~3个数量级,从而减轻I/O压力,提高分析作业的效率。利用基于TAG信息的事例预筛选模型以及事例分析的MapReduce模型,设计适用于ROOT框架的数据拆分、事例读取、结果合并等MapReduce类库。在北京正负电子对撞机实验上进行系统实现后,将其应用于一个8节点实验集群上进行测试,结果表明,该系统可使4×106个事例的分析时间缩短23%,当增加节点个数时,每秒钟能够并发分析的事例数与集群的节点数基本呈正比,说明事例分析集群具有良好的扩展性。

关键词: 高能物理, 大数据, 数据分析, MapReduce模型, 集群, 分布式计算

Abstract:

This paper brings the idea of MapReduce parallel processing to high energy physics data analysis, proposes a high energy physics data analysis system based on Hadoop framework. It significantly reduces the number of events that need to do further analysis by 2~3 classes by establishing an event TAG information database, which reduces the I/O volume and improves the efficiency of data analysis jobs. It designs proper MapReduce libs that fit for the ROOT framework to do things such as data splitting, event fetching and result merging by using event pre-selection model based on TAG information and MapReduce model of event analysis. A real system is implemented on BESIII experiment, an 8-nodes cluster is used for data analysis system test, the test result shows that the system shortens the data analyzing time by 23% of 4×106 event, and event number of concurrence analysis per second is higher than cluster nodes when adding more worker nodes, which explains that the case analysis cluster has a good scalability.

Key words: high energy physics, big data, data analysis, MapReduce model, cluster, distributed computing

中图分类号: