[1]CAPPELLO F.Resilience:one of the main challenges for exascale computing[D].Champaign,USA:NRIA Illinois Joint-Laboratory on Petascale Computing,2011.
[2]KUSNEZOV D,BINKLEY S,HARROD B,et al.DOE exascale initiative[EB/OL].[2017-10-21].https://energy.gov/downloads/doe-exascale-initiative.
[3]KOGGE P,BERGMAN K,BORKAR S et al.Exascale computing study:technology challenges in achieving exascale systems[EB/OL].[2017-10-21].http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.
[4]SCHROEDER B,GIBSON G A.A large-scale study of failures in high-performance computing systems[J].IEEE Transactions on Dependable and Secure Computing,2010,7(4):337-350.
[5]LIANG Y,ZHANG Y,JETTE M,et al.BlueGene/L failure analysis and prediction models[C]//Proceedings of the 43rd Annual IEEE International Conference on Dependable Systems and Networks.Washington D.C.,USA:IEEE Press,2006:425-434.
[6]ZHENG Z,YU L,TANG W et al.Co-analysis of RAS log and job log on blue gene/P[C]//Proceedings of the 2011 IEEE International Parallel & Distributed Process-ing Symposium.Washington D.C.,USA:IEEE Press,2011:840-851.
[7]HEIEN E,LAPINE D,KONDO D,et al.Modeling and tolerat-ing heterogeneous failures in large parallel systems[C]//Proceedings of the 2011 International Conference for High Performance Computing,Networking,Storage and Analysis.Washington D.C.,USA:IEEE Press,2011:214-222.
[8]SCHROEDER B,PINHEIRO E,WEBER W.DRAM errors in the wild:a large-scale field study[C]//Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems.Washington D.C.,USA:IEEE Press,2009:193-204.
[9]PINHEIRO E,WEBER W,BARROSO L A.Failure trends in a large disk drive population[C]//Proceedings of the 5th USENIX Conference on File and Storage Technologies.Washington D.C.,USA:IEEE Press,2007:17-28.
[10]NIE B,TIWARI D,GUPTA S et al.A large-scale study of soft-errors on GPUs in the field[C]//Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture.Washington D.C.,USA:IEEE Press,2016:519-530.
[11]ZHENG Z,LAN Z,GUPTA R,et al.A practical failure prediction with location and lead time for blue gene/P[C]//Proceedings of the 2010 International Conference Dependable Systems and Networks Work-shops.Washington D.C.,USA:IEEE Press,2010:125-134.
[12]SAHOO R K,OLINER A J,RISH I,et al.Critical event prediction for proactive management in large-scale computer clusters[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,2003:426-435.
[13]GU J,ZHENG Z,LAN Z,et al.Dynamic meta-learning for failure prediction in large-scale systems:a case study[C]//Proceedings of IEEE International Conference on Parallel Processing.Washington D.C.,USA:IEEE Press,2008:154-166.
[14]GAINARU A,CAPPELLO F,SNIR M,et al.Fault prediction under the microscope:a closer look into HPC systems[C]//Proceedings of IEEE International Conference on High Performance Computing,Networking,Storage and Analysis.Washington D.C.,USA:IEEE Press,2012:456-465.
[15]LU X,WANG H Q,ZHOU R J,et al.Autonomic failure prediction based on manifold learning for large-scale distributed systems[J].Journal of China Universities of Posts and Telecommunications,2010,17(4):116-124.
[16]YOUNG J W.A first order approximation to the optimum checkpoint interval[J].Communications of the ACM,1994,17(9):530-531.
[17]GUNAWI H S,HAO M,SUMINTO R O,et al.Why does the cloud stop computing?:lessons from hundreds of service outages[C]//Proceedings of the 7th ACM Symposium on Cloud Computing.New York,USA:ACM Press,2016:1-16.
[18]GUNAWI H S,HAO M,LEESATAPORNWONGSE T,et al.What bugs live in the cloud? A study of 3000+ issues in cloud systems[C]//Proceedings of ACM Symposium on Cloud Computing.New York,USA:ACM Press,2014:1-14.
[19]HUANG P,GUO C,ZHOU L,et al.Gray failure:the achilles’ heel of cloud-scale systems[C]//Proceedings of the 16th Workshop on Hot Topics in Operating Systems.Washington D.C.,USA:IEEE Press,2017:150-155.
[20]ZHENG Z,LAN Z,PARK B H,et al.System log pre-processing to improve failure prediction[C]//Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks.Washington D.C.,USA:IEEE Press,2009:215-226.
[21]张志华.可靠性理论及工程应用[M].北京:科学出版社,2012. |