Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2021, Vol. 47 ›› Issue (2): 126-132. doi: 10.19678/j.issn.1000-3428.0057273

• Advanced Computing and Data Processing • Previous Articles     Next Articles

File Access Popularity Prediction for Hierarchical Storage for High-Energy Physics

CHENG Zhenjing1,2, WANG Lu1,2, CHENG Yaodong1,2,3, CHEN Gang1, HU Qingbao1, LI Haibo1,2   

  1. 1. Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China;
    2. University of Chinese Academy of Sciences, Beijing 100049, China;
    3. Tianfu Cosmic Ray Research Center, Institute of High Energy Physics, Chinese Academy of Sciences, Chengdu 610041, China
  • Received:2020-01-20 Revised:2020-02-28 Online:2021-02-15 Published:2020-03-25

面向高能物理分级存储的文件访问热度预测

程振京1,2, 汪璐1,2, 程耀东1,2,3, 陈刚1, 胡庆宝1, 李海波1,2   

  1. 1. 中国科学院高能物理研究所, 北京 100049;
    2. 中国科学院大学, 北京 100049;
    3. 中国科学院高能物理研究所天府宇宙线研究中心, 成都 610041
  • 作者简介:程振京(1993-),男,博士研究生,主研方向为分布式存储、机器学习;汪璐,副研究员、博士;程耀东、陈刚,研究员、博士;胡庆宝,硕士;李海波,副研究员、博士。
  • 基金资助:
    国家重点研发计划(2017YFB0203200);国家自然科学基金(11675201,11805226,11805223)。

Abstract: Computing for high-energy physics is typically data-intensive.It mainly adopts file-based hierarchical storage solutions where data is allocated based on the access popularity to storage devices with different performances.The existing schemes of data popularity prediction generally adopt a heuristic algorithm based on artificial experience,whose prediction accuracy is low.This paper proposes a method of predicting future access popularity using Long Short-Term Memory(LSTM) network,which consists of network structure design,training,and prediction algorithms.The method divides the dynamic time window to construct a time series of file access features,and on this basis predicts the access trends of different data.Experimental results on the data set of LHAASO high-energy physics experiments show that compared with SVM,MLP and other algorithms,the proposed method increases the prediction accuracy by about 30%,and it has stronger applicability.

Key words: hierarchical storage, file access characteristics, time series data, Long Short-Term Memory(LSTM) network, file access popularity

摘要: 高能物理计算是典型的数据密集型计算,其主要采用基于文件的分级存储方案,根据访问热度的不同将数据存储于不同性能的存储设备上,然而当前数据热度预测采用基于人工经验的启发式算法,准确率较低。提出一种借助长短期记忆网络预测文件未来访问热度的方法,包括网络结构设计、训练和预测算法等。该方法通过划分动态时间窗口构造文件访问特征的时序序列,预测不同数据的访问趋势。在LHAASO高能物理实验数据集上的实验结果表明,与SVM、MLP等算法相比,该方法预测准确率提升了30%左右,具有更强的适用性。

关键词: 分级存储, 文件访问特征, 时序数据, 长短期记忆网络, 文件访问热度

CLC Number: