摘要
随着互联网的发展、存储规模的骤增,大型数据中心硬盘频繁损坏导致的数据丢失给企业带来的损失已成为不可忽视的重大问题.以往基于硬盘SMART(self-monitoring,analysis and reporting technology)属性建立的包括应用统计学和机器学习等方法在内的各种硬盘故障预测模型,虽然取得了较好的效果,但其数据采集及处理等方面均存在不足之处.基于某真实的互联网大型数据中心环境,提取SMART属性数据,并提出了一种基于神经网络权值矩阵的方法,结合Rank Sum秩和检验、RAT反向安排测试、Z-Score评分3种无参统计学方法,对属性进行选择,应用CART决策树及BP神经网络2种机器学习方法,建立硬盘故障预测模型.实验表明描述的2种硬盘故障预测模型均具有很好的性能,这是机器学习算法在实际应用场景下很好的实践.此外,通过实验以及对实验的分析和解释,得出一些有益的结论,这为下一步的研究工作奠定了基础.
With the surge in the development of the Internet and the scale of storage,frequent damage of large data center disk resulting in data missing and bringing great loss to enterprises has become a major problem that cannot be ignored.Past research build all kinds of hard disk failure prediction models by means of statistics or machine learning based on SMART(self-monitoring,analysis and reporting technology),although it has obtained good performance,its data acquisition and processing exist shortcomings.Based on a large real Internet data center environment,this paper extracts the SMART attribute data and proposes an attribute selection method based on neural network weight matrix,combining with three kinds of non-parametric statistical methods(Rank Sum test,RAT reverse arrangement test,Z-Score)to select useful attributes for building hard disk failure prediction model base on two kinds of machine learning methods(CART decision tree and BP neural network).Experimental results show that the two kinds of hard disk failure prediction models obtain very good performance,which is a very good practice of the machine learning algorithm in actual practical application scenarios.In addition,this paper draws some useful conclusions through experiments as well as the analysis and interpretation of the experiments,which lays the foundation for further research.
引文
[1]Schroeder B,Gibson G A.Disk failures in the real world:What does an MTTF of 1,000,000hours mean to you///Proc of the 5th USENIX Conf on File and Storage Technologies(FAST).Berkeley,CA:USENIX Assocication,2007:7-1-7-16
[2]Bairavasundaram L N,Goodson G R,Pasupathy S,et al.An analysis of latent sector errors in disk drives//Proc of the Int Conf on Measurements and Modeling of Computer Systems.New York:ACM,2007:289-300
[3]Pinheiro E,Weber W D,Barroso L A.Failure trends in a large disk drive population//Proc of the 5th USENIX Conf on File and Storage Technologies(FAST).Berkeley,CA:USENIX Assocication,2007:17-29
[4]Murray J F,Hughes G F,Kreutz-Delgado K.Machinelearning methods for predicting failures in hard drives:Amultiple-instance application.Journal of Machine Learning Research,2005,6(5):783-816
[5]Hughes G F,Murray J F,Kreutz-Delgado K,et al.Improved disk-drive failure warnings.IEEE Trans on Reliability,2002,51(3):350-357
[6]Hamerly G,Elkan C.Bayesian approaches to failure prediction for disk drives//Proc of the 18th Int Conf on Machine Learning.San Francisco,CA:ICML,2001:202-209
[7]Murray J F,Hughes G F,Kreutz-Delgado K.Hard drive failure prediction using non-parametric statistical methods//Proc of the Int Conf on Artificial Neural Networks(ICANN)/ICONIP 2003.Berlin:Springer,2003
[8]Zhao Y,Liu X,Gan S,et al.Predicting disk failures with HMM-and HSMM-based approaches//Proc of the 10th Industrial Conf on Advances in Data Mining:Applications and Theoretical Aspects.Berlin:Springer,2010:390-404
[9]Zhu B,Wang G,Liu X,et al.Proactive drive failure prediction for large scale storage systems//Proc of the 29th IEEE Conf on Massive Storage Systems and Technologies(MSST).Piscataway,NJ:IEEE,2013:1-5
[10]Wang Y,Miao Q,Pecht M.Health monitoring of hard disk drive based on mahalanobis distance//Proc of Conf in Prognostics and System Health Management Conf(PHM2011).Piscataway,NJ:IEEE,2011:1-8
[11]Wang Y,Miao Q,Ma E W,et al.Online anomaly detection for hard disk drives based on mahalanobis distance.IEEE Trans on Reliability,2013,62(1):136-145
[12]Li J,Ji X,Jia Y,et al.Hard drive failure prediction using classification and regression trees//Proc of the 44th Annual IEEE/IFIP Int Conf on Dependable Systems and Networks(DSN).Los Alamitos,CA:IEEE Computer Society,2014:383-394
[13]Allen B.Monitoring hard disks with SMART.Linux Journal,2004,2004(117):74-77
[14]Strom B D,Lee S C,Tyndall G W,et al.Hard disk drive reliability modeling and failure prediction.IEEE Trans on Magnetics,2007,43(9):3676-3684
[15]Ma A,Douglis F,Lu G,et al.RAIDShield:Characterizing,monitoring,and proactively protecting against disk failures//Proc of the 13th USENIX Conf on File and Storage Technologies(FAST'15).Berkeley,CA:USENIX Association,2015:16-19
[16]Williams G,Use R.Data Mining with Rattle and R:the Art of Excavating Data for Knowledge Discovery.Berlin:Springer,2011