Data Engineering and Failure Prediction for Hard Drive S.M.A.R.T. Data

Home , Big data, Hard disk drive failure

DATA ENGINEERING AND FAILURE PREDICTION FOR HARD DRIVE S.M.A.R.T. DATA

Asanga Ramanayaka Mudiyanselage

A Thesis

Submitted to the Graduate College of Bowling Green State University in partial fulﬁllment of the requirements for the degree of

MASTER OF SCIENCE

August 2020

Committee:

Robert C. Green II, Advisor

Robert Dyer

Robert C. Green II, Advisor

Failing hard drives within data centers can be costly, but it can be very difﬁcult to predict failure of these devices since they are designed to be reliable and, as such, they do not typically fail often or quickly. Due to this goal of reliable design, any data set that records hard drive failures tends to be highly imbalanced, containing many more records of hard drives continuing to function when compared to those that fail. Accordingly, this study focuses on predicting the failure of hard drives using S.M.A.R.T. data records as recorded by the entire Backblaze Data Set, covering multiple years of data beginning in 2013. In order to perform this analysis, a Data Engineering process is developed for collecting, combining, and cleaning the data set before various resampling algorithms, machine learning algorithms and distributed and high performance computing techniques are applied to achieve proper feature selection and prediction. In addition, this data is divided on a per manufacturer basis in order to improve results, resulting in increased performance. iv

To my loving wife Shashini, daughter Binushi, my parents, and my sister for their encouragement, support and love. – Asanga Ramanayaka Mudiyanselage v ACKNOWLEDGMENTS

I hereby gratefully appreciate Dr. Robert Green for his mentorship, encouragement, and support for the last two years. His vision inspired me to pursue this master’s degree in Computer Science and it initiated several new opportunities for me to work in the ﬁeld of Data Science. I also convey my gratitude to the committee members Dr. Robert Dyer and Dr. Yan Wu for their guidance and support throughout the research. The faculty members, the staff, and the colleagues of the Department of Computer Science are also acknowledged. I thank Backblaze for offering free access to hard-drive SMART data. I acknowledge the staff at Ohio Supercomputing Center for granting me access to Owens cluster for faster data processing. Finally, I would like to thank my loving wife Shashini, daughter Binushi, my parents, my sister, my in-laws, and my dear friends at BGSU for their cooperation and valuable support. vi TABLE OF CONTENTS Page

CHAPTER 1 INTRODUCTION ...... 1

CHAPTER 2 BACKGROUND & LITERATURE REVIEW ...... 4 2.1 Machine Learning Methods & S.M.A.R.T. Data ...... 4 2.2 Backblaze Dataset ...... 6

CHAPTER 3 EXPLORATORY ANALYSIS ...... 9 3.1 Imbalanced Data ...... 9 3.1.1 Random Under Sampling (RUS) ...... 10 3.1.2 Synthetic Minority Over Sampling Technique (SMOTE) ...... 10 3.2 Cross-Validation ...... 11 3.2.1 Stratiﬁed K-fold Cross-validation ...... 11 3.3 Recursive Feature Elimination (RFE) ...... 12 3.4 Performance Measures ...... 13 3.5 Analysis and Results ...... 15 3.5.1 Random Undersampling ...... 16 3.5.2 SMOTE ...... 18 3.6 Principal Component Analysis ...... 19

CHAPTER 4 EVALUATING BACKBLAZE DATA ON A PER MANUFACTURER BASIS 23 4.1 Data Collection and Engineering ...... 23 4.2 Data Preprocessing for Per Manufacturer Analysis ...... 25 4.2.1 Data Munging ...... 25 4.2.2 Data Standardization ...... 26 4.3 Analysis of Toshiba Dataset ...... 27 4.3.1 Correlation Matrix ...... 27 4.3.2 Recursive Feature Elimination ...... 28 vii 4.3.3 Scikit-Learn Analysis ...... 29 4.3.4 PySpark Analysis ...... 31 4.4 Analysis of Western Digital Dataset ...... 35 4.4.1 Correlation Matrix ...... 35 4.4.2 Recursive Feature Elimination ...... 36 4.4.3 Scikit-Learn Analysis ...... 37 4.4.4 PySpark Analysis ...... 39 4.5 Analysis of Hitachi Dataset ...... 40 4.5.1 Correlation Matrix ...... 42 4.5.2 Recursive Feature Elimination ...... 42 4.5.3 PySpark Analysis ...... 42 4.6 Analysis of Seagate Dataset ...... 44 4.6.1 Correlation Matrix ...... 45 4.6.2 Recursive Feature Elimination ...... 46 4.6.3 PySpark Analysis ...... 46 4.7 Threats to Validity ...... 49

CHAPTER 5 CONCLUSION ...... 51 5.1 Future Works ...... 52

BIBLIOGRAPHY ...... 53 viii LIST OF FIGURES Figure Page

3.1 Stratiﬁed k-fold cross-validation ...... 11 3.2 Optimal number of features using RFECV ...... 12 3.3 Features importance using RFECV ...... 14 3.4 ROC curves - Random Undersampling ...... 17 3.5 PR curves - Random Undersampling ...... 17 3.6 ROC curves - SMOTE ...... 20 3.7 PR curves - SMOTE ...... 21 3.8 Number of principal components vs variance ...... 21 3.9 Principal component plots ...... 22

4.1 Data collection and engineering ...... 24 4.2 Samples with duplicated features and dual status ...... 26 4.3 Correlation plot for Toshiba ...... 27 4.4 ROC curves for Scikit-Learn Analysis on Toshiba ...... 32 4.5 PR curves for Scikit-Learn Analysis on Toshiba ...... 33 4.6 ROC curves for PySpark Analysis on Toshiba ...... 35 4.7 PR curves for PySpark Analysis on Toshiba ...... 36 4.8 Correlation plot for Western Digital ...... 37 4.9 ROC curves for PySpark Analysis on Western Digital ...... 39 4.10 ROC curves for PySpark Analysis on Western Digital ...... 41 4.11 Correlation plot for Hitachi ...... 42 4.12 ROC curves for PySpark Analysis on Hitachi ...... 45 4.13 Correlation plot for Seagate ...... 46 4.14 ROC curves for PySpark Analysis on Seagate ...... 49 ix LIST OF TABLES Table Page

2.1 Hard drive SMART attributes available in Backblaze dataset ...... 7

3.1 Confusion matrix in general ...... 13 3.2 Comparison of performances - 2014 dataset Random Undersampling ...... 16 3.3 Confusion matrix for Decision Tree classifier on 2014 dataset ...... 16 3.4 Confusion matrix for Random Forest classifier on 2014 dataset ...... 16 3.5 Confusion matrix for AdaBoost classifier on 2014 dataset ...... 16 3.6 Comparison of performances - 2014 dataset SMOTE ...... 18 3.7 Confusion matrix for Decision Tree classifier on 2014 dataset ...... 18 3.8 Confusion matrix for Random Forest classifier on 2014 dataset ...... 19 3.9 Confusion matrix for AdaBoost classifier on 2014 dataset ...... 19

4.1 No. of SMART attributes present in Blackbaze dataset. “Q[X]” stands for “Quarter [X]”...... 25 4.2 Top ten features for Toshiba based on RFE ...... 28 4.3 Comparison of results of Scikit-Learn Analysis on Toshiba ...... 30 4.4 Confusion matrix for Decision Tree classifier on Toshiba dataset ...... 30 4.5 Confusion matrix for Random Forest classifier on Toshiba dataset ...... 30 4.6 Confusion matrix for AdaBoost classifier on Toshiba dataset ...... 30 4.7 Comparison of results from PySpark analysis on Toshiba ...... 33 4.8 Confusion matrix for Decision Tree classifier on Toshiba dataset ...... 33 4.9 Confusion matrix for Random Forest classifier on Toshiba dataset ...... 34 4.10 Confusion matrix for Gradient Boost classifier on Toshiba dataset ...... 34 4.11 Confusion matrix for MLP classifier on Toshiba dataset ...... 34 4.12 Top ten features for Western Digital based on RFE ...... 37 4.13 Comparison of results of Scikit-Learn Analysis on Western Digital ...... 38 x 4.14 Confusion matrix for Decision Tree classifier on Western Digital dataset ...... 38 4.15 Confusion matrix for Random Forest classifier on Western Digital dataset . . . . . 38 4.16 Confusion matrix for Ada Boost classifier on Western Digital dataset ...... 38 4.17 Comparison of results from PySpark analysis on Western Digital ...... 40 4.18 Confusion matrix for Decision Tree classifier of Western Digital PySpark analysis . 40 4.19 Confusion matrix for Random Forest classifier of Western Digital PySpark analysis 40 4.20 Confusion matrix for Gradient Boost classifier of Western Digital PySpark analysis 41 4.21 Confusion matrix for MLP classifier of Western Digital PySpark analysis ...... 41 4.22 Top ten features for Hitachi based on RFE ...... 43 4.23 Comparison of results from PySpark analysis on Hitachi ...... 44 4.24 Confusion matrix for Decision Tree classifier on Hitachi ...... 44 4.25 Confusion matrix for Random Forest classifier on Hitachi ...... 44 4.26 Confusion matrix for Gradient Boost classifier on Hitachi ...... 44 4.27 Confusion matrix for MLP classifier on Hitachi ...... 44 4.28 Top ten features for Seagate based on RFE ...... 47 4.29 Comparison of results from PySpark analysis on Seagate ...... 47 4.30 Confusion matrix for Decision Tree classifier on Seagate ...... 47 4.31 Confusion matrix for Random Forest classifier on Seagate ...... 47 4.32 Confusion matrix for Gradient Boost classifier on Seagate ...... 48 4.33 Confusion matrix for MLP classifier on Seagate ...... 48 1

CHAPTER 1 INTRODUCTION

In the modern era, people are producing a massive amount of data. This data is stored on different types of storage devices, however still, the hard drive is the most commonly used storage device in the world. The most important feature of a storage device is reliability. Generally, hard disk drives are considered to be durable and reliable, however recent studies have shown that hard drive is the most frequently replaced device in data centers [1]. Hard drive failures can be extremely costly for any company or any user because it can lead to an irreplaceable data loss or a live server crash that can cost millions of dollars damage [2]. On the other hand, cloud storage service providers such as Google, Microsoft, Apple, and Backblaze are keen to know which drives are about to fail ahead of time to take steps to protect users’ data. In addition to that, when hard drives fail in a data center, they have to be replaced every day, which is an expensive process. Considering all these situations, it is extremely important to predict hard drive failures accurately in advance in order to mitigate data loss. Most hard drive manufacturers have integrated advanced technologies such as Self-Monitoring, Analysis, and Reporting Technology (SMART) to report any possibility of hard drive failures [3]. SMART attributes contain several numerical values that represent the current condition of the hard drive such as read error rate, seek error rate, spin-up time, temperature, power-on hours, relocated sectors count, etc. [4, 5]. They reveal defect information and the health conditions of hard drives. The frequently used approach is setting thresholds for each SMART attribute to raise alarm, however, this can lead to many false alarms [6]. At the same time, models that use multiple attributes perform much better than threshold-based mechanisms with isolated attributes [3, 7, 8]. Therefore, many researchers have created machine learning models by combining several SMART attributes to predict hard drive failures. While most manufacturers use SMART attributes, the implementation of SMART attributes is not fully standardized. Hence, there are many challenges associated with using SMART attributes in machine learning models to predict hard drive health and failure [7, 9, 10]. Different hard drive 2 manufacturers use a distinct set of SMART parameters in their productions that tend to evolve over time. For example, when considering data recorded in the Backblaze Hard Drive Dataset [5] multiple issues are encountered:

• Some SMART parameters are completely null or zero for some hard drive models;

• Some of the null SMART parameters in the previous years started to produce values in the following years;

• The number and type of SMART parameters associated with a single manufacturer changes from year to year; and

• Each SMART parameter has two columns – raw value and normalized value – of which the normalized value is manufacturer speciﬁc.

When considering machine learning algorithms for predicting hard drive failures, this inconsistency and evolution of data lead to inherent difﬁculties as this family of algorithms typically assumes that input will be consistent and homogeneous. As such, this thesis pursues two goals: First, to develop and describe a data engineering process for collecting, organizing, and combining hard drive SMART data from Backblaze that continually expands and evolves in a variety of ways. Second, to effectively predict hard drive failure on a per manufacturer basis. Accordingly, this thesis is differentiated from past works as it:

1. Performs thorough data engineering, including the deﬁnition of a process for continually combining and updating all data provided from Backblaze;

2. Performs a per manufacturer analysis and evaluation of this data, including the use of machine learning algorithms for predicting failure base on highly imbalanced datasets; and

3. Focuses on using the entire data set provided by Backblaze as opposed to individual years or quarters of data. 3 The remainder of this thesis is structured as follows: Background and Information on the Back- blaze dataset, on which this research was performed is in Chapter 2; Data pre-processing methods are included in Chapter 3; Chapter 4 consists of the analysis of this data on a per manufacturer basis; Finally, our ﬁndings and potential future developments are presented in chapter 5 conclusion. 4

CHAPTER 2 BACKGROUND & LITERATURE REVIEW

This chapter details the state-of-the-art literature related to this project including predictive analysis of hard disks and description of the Backblaze dataset.

2.1 Machine Learning Methods & S.M.A.R.T. Data

Several studies used machine learning models and statistical methods to improve hard drive failure prediction. Most of the research uses SMART attributes for the prediction [2–4] while others use failure logs and captured disk events [9, 11]. Furthermore, these researches were performed on different datasets. Some of the datasets are publicly available such as Backblaze [5], while many other datasets are not disclosed to the public due to security reasons. Many of the studies calculated the failure detection rate and false alarm rate (FAR). In general, the failure prediction rate varies between 20%-90% while FAR is around 0%-3%. Murray and Hughes performed several studies on hard drive failure prediction. In their study [3] they introduced improved algorithms to implement in SMART attributes to increase the correct prediction accuracy of hard drives. The prediction rate achieved in the study is 40% and the FAR was 0.2% [3]. Then, Murray and Hughes came up with non-parametric statistical methods to generate more accurate results with FAR 0.1% [12]. Later, they were able to achieve more than 50% prediction rate with FAR 0%, using a Support Vector Machine (SVM) classifier which was computationally expensive [13]. Since then, prediction rates were improved considerably by several other researchers. Some researchers used different statistical techniques such as time series, maximum likelihood, regression trees, and evaluation matrices for predicting hard drive failures [4, 8, 9]. In addition to that, machine learning techniques such as SVM, Artificial Neural Networks (ANN), Classification Trees (CT), etc. were used in the studies [2, 4]. The work in [4] used Backpropagation Artificial Neural Networks (BP ANN) and advanced SVM model with a dataset of more than 23,000 hard disk drives to predict drive failures. They were able to achieve a failure detection rate of 95% with 5 a FAR 0.03% which is a relatively high accuracy [4]. Instead of using SMART attributes, [9] tried to predict hard drive failures using checksum mis- match of disks. They used a rule-based classifier to attain 70% accuracy [9]. In addition to that, Schroeder and Gibson predicted using mean-time-to-failure (MTTF) values [11]. On the other hand, [2] discussed predicting the actual failure time of hard disks. They used the Combined Bayesian Network (CBN) on SMART attributes to predict actual failure time 70% accurately. In this study, four classifiers were trained by Back-propagation Artificial Neural Networks, Evolu- tionary Neural Network (ENN), SVM, and Classification Trees (CT) [2]. Yang and Hu changed the direction of hard drive failure prediction researches by using bigdata to train their machine learning model [8]. In this study, the researchers focused more on improving the quality of training the existing model, instead of building new advanced models. They used 74.5 million hard disk records to train their model Hdoctor and achieved about 98% detection rate with a FAR of 0.3% which is an impressive improvement when compared with previous studies [8]. This study emphasizes how to improve accuracy by increasing the number of samples in a large number. However, they used only the logistic regression model for the prediction. Many recent studies have incorporated the publicly available Backblaze dataset that is real- world, operational hard disk SMART data [6, 10, 14–26]. Few of those studies made some interesting points on hard disk failure prediction. In [10] applied three machine learning models: SVM, Random Forest (RF), and Gradient-Boosted Tree (GBT) on the 2014 Backblaze dataset using Apache Spark cluster. They used Synthetic Minority Over-sampling Technique (SMOTE) as a resampling strategy to mitigate class imbalance [27]. Their RF and GBT models returned impressive results with precision (95%) and recall(67%). In [6] used five different machine learning algorithms on one specific Seagate hard drive model of the 2016 Backblaze dataset. XGBoost was the best performing model among them. In [22] used a larger subset of the Backblaze dataset (2013-2017) than other researchers. However, their goal was to predict the time to failure of the hard disks using regression analysis. More or less, all of the previous researchers have analyzed a few quarters of data or a few years of data, but no one has investigated all 7 years of data. In our 6 study, we have incorporated the entire Backblaze dataset, available at this point that is 2013 April to 2019 December. In addition to that, the majority of the studies done on the Backblaze dataset have focused only on a single manufacturer. Most frequently, the analyses were performed on a few selected Seagate drive models. Few researchers [10, 14, 16, 22] analyzed the corresponding datasets as a single set of heterogeneous hard drive population. Among them, a couple of studies [14, 16] suggested in their future works sections that manufacture based analysis would produce better results. They further discussed that per manufacturer investigation is advisable because many of the SMART parameters are vendor-specific and not standardized. In [15] applied a machine learning-based pipeline on Seagate and Hitachi hard drives separately. However, they did not incorporate Western Digital or Toshiba makes due to the lack of availability of samples. Obviously, none of the existing studies predicted hard disk drive failures based on all the manufacturers with dedicated analyses.

2.2 Backblaze Dataset

Backblaze is a data storage provider for enterprises and end-users all over the world. They have three data centers located in the US. The Backblaze Sacramento data center has published hard drive related data on their website for public access. The Backblaze dataset contains millions of records of hard drive data from 2013 to 2019 [5]. By each year, more hard disks are added to the data center. This dataset is a relatively big dataset with a volume of around 40 gigabytes and more than 160 million of rows. The whole dataset available (at this point) from 2013 April to 2019 December were used in this study to predict hard disk drive failures. The complete Backblaze dataset contains thousands of CSV files. A single file contains a snapshot of all working hard drives on each day [5]. For 2013, 2014, and 2015 yearly datasets were released at the end of each year. From 2016 new data was released at the end of each quarter since the number of operational disks had been increased. There are several columns available in each file: serial number, model, capacity, failure status, and SMART attributes. SMART attributes are a set of flags which represent the current condition of the hard drive [4]. Many of the SMART attributes are empty because most drives do not report values for all fields daily. However, Yang 7 et al. identified 22 basic SMART attributes that are meaningful, and Zhu et al. incorporated 10 SMART attributes in their study which are useful for predicting hard drive failures [4, 8]. Table 2.1 shows list of available hard drive SMART attributes. 22 SMART attributes used by Yang et al. are represented in bold text.

Table 2.1 Hard drive SMART attributes available in Backblaze dataset

SMART parameter Name SMART parameter Name smart 1 raw Read Error Rate smart 190 raw Airflow Temperature smart 2 raw Throughput Performance smart 191 raw G-Sense Errors smart 3 raw Spin Up Time smart 192 raw Power-Off Retract Cycles smart 4 raw Start/Stop Count smart 193 raw Load/Unload Cycles smart 5 raw Reallocated Sectors smart 194 raw Temperature Celsius smart 7 raw Read Error Rate smart 195 raw Hardware ECC Recovered smart 8 raw Seek Time Performance smart 196 raw Reallocated Events smart 9 raw Power-On Hours smart 197 raw Current Pending Sectors smart 10 raw Spin-up Retries smart 198 raw Offline Uncorrectable smart 11 raw Calibration Retries smart 199 raw CRC Error Count smart 12 raw Power Cycle Count smart 200 raw Multi-Zone Error Rate smart 13 raw Soft Read Error Rate smart 201 raw Soft Read Errors smart 15 raw Vendor-specific field smart 218 raw Vendor-specific field smart 16 raw Vendor-specific field smart 220 raw Disk Shift smart 17 raw Vendor-specific field smart 222 raw Loaded Hours smart 18 raw Vendor-specific field smart 223 raw Load/Unload Retries smart 22 raw Current Helium Level smart 224 raw Load Friction smart 23 raw Vendor-specific field smart 225 raw Load/Unload Cycles smart 24 raw Vendor-specific field smart 226 raw Load-in Time smart 168 raw Vendor-specific field smart 231 raw Temperature smart 170 raw Reserved Block Count smart 232 raw Available Reserved Space smart 173 raw Wear Level Count smart 233 raw Media Wearout Indicator smart 174 raw Unexpected Power Loss smart 235 raw Good Block Count smart 177 raw Wear Range Delta smart 240 raw Head Flying Hours smart 179 raw Used Block Count smart 241 raw Total LBAs Written smart 181 raw Unsed Block Count smart 242 raw Total LBAs Read smart 182 raw Erase Fail Count smart 250 raw Read Error Retry Rate smart 183 raw SATA Downshifts smart 251 raw Min Spares Remaining smart 184 raw End-to-End error smart 252 raw Bad Flash Block smart 187 raw Uncorrectable Errors smart 254 raw Free Fall Protection smart 188 raw Command Timeout smart 255 raw Vendor-specific field smart 189 raw High Fly Writes

The Backblaze dataset is considered to be a structured dataset. However, the number of 8 SMART attributes tends to change from year to year. For instance, in 2015, five additional SMART attributes were added to the dataset, which means ten new columns were generated to store new values. Another interesting point of this dataset is the inconsistency of the fields. Some of the SMART attributes depend on the model of the hard drive and the manufacture. The Backblaze dataset has four main hard drive manufacturers such as Seagate, Hitachi, Western Digital, Toshiba. Therefore, some of the fields are highly inconsistent and cannot be used in the analysis in the previous studies [5]. At the same time, the set of columns SMART values are stored are changed based on the manufacturer. Hence, in this study, we split the Backblaze dataset into four based on the manufacturer and ran a predictive analysis on each dataset separately. 9

CHAPTER 3 EXPLORATORY ANALYSIS

To determine the best techniques for analyzing the data, a subset of data that is the year 2014 data set (1st of January to 31st of December 2014) was processed using different methods. The reason for using a single dataset for preprocessing is to minimize the time spent on analysis while figuring out the best approach for a better classification. Aussel et. al used the same dataset (Backblaze - year 2014) for their study to obtain great results. In this study, some of their methods are used as the basis to develop the analysis. However, their data preprocessing strategy and feature selection can be different from our study. Accordingly, this chapter presents details regarding an exploratory analysis of the data. These results and methods directly impact choices made regarding the final analysis in Chapter 4.

3.1 Imbalanced Data

In the Backblaze dataset, the daily status of the hard drive is indicated by the column ’failure’ where ’0’ represents a healthy hard drive at a particular day and ’1’ represents a failed hard drive. Therefore, predicting hard drive failures is identified as a binary classification problem. The year 2014 dataset consists of 12 million rows of daily records with 40 different SMART attributes including 2206 failed hard drives which is the highest number of failures per year in the whole dataset (2013-2019). The ratio of healthy drive samples to failure drive samples is approximately 5700:1. Clearly, there is a large difference between the number of observations per class. In other words, the dataset is extremely imbalanced. Therefore, any machine learning model will be biased when making predictions based on this dataset. In addition, the cost associated with miss-classifying a failure as a healthy sample (type II error) is much higher than the cost of type I error. To address the imbalanced issue of the dataset, resampling techniques can be used. Funda- mentally, there are three types of resampling methods: oversampling, undersampling, and hybrid. Oversampling is generating synthetic data by duplicating or creating new records of the minority 10 class. Undersampling is removing existing data from the majority class to reduce the number of observations. Some techniques use a combination of both methods to resample datasets. However, resampling techniques should only be applied on the training set. If applied on the test set, it will introduce an over-optimism problem [28]. The Imbalanced-Learn Python library consists of multiple resampling approaches [29]. It includes oversampling methods such as SMOTE, Adaptive Synthetic (ADASYN) sampling, Ran- dom Oversampling, and undersampling methods such as NearMiss, TomekLink, Random Under- sampling, etc. There are different variations of SMOTE: SMOTENC, SVMSMOTE, KMeansS- MOTE, and BorderlineSMOTE. In addition to that, there are some combined resampling techniques: SMOTETomek and SMOTEENN [30]. In this study, after evaluating multiple methods, it was decided to use SMOTE and Random Undersampling methods for the analysis. Methods such as ADASYN, TomekLink, and other variations of SMOTE were computationally expensive that each of them took more than a day to execute on a large dataset. Random Oversampling was reasonably fast, however it does have an over-fitting problem [28].

3.1.1 Random Under Sampling (RUS)

In Random Undersampling, the number of samples in the majority class is reduced to equal the number of samples in the minority class by randomly eliminating samples from the majority class [31]. It works faster than most of the other resampling techniques and generated better results in this study.

3.1.2 Synthetic Minority Over Sampling Technique (SMOTE)

SMOTE balances the dataset by producing artiﬁcial examples from the minority class. It randomly selects a sample from the minority class and produces cases along the line segments joining with their k-nearest minority neighbors. This works until the minority class has as many samples as the majority class [10]. Since the generated samples are not duplicates of the original minority samples, SMOTE addresses the over-ﬁtting problem to some extent [27]. SMOTE should be done in concurrence with the cross-validation correspondingly. Applying SMOTE before cross- 11 validation would introduce an overoptimism issue [28].

3.2 Cross-Validation

Cross-Validation (CV) is the standard model evaluation method. It splits the dataset into k partitions, to train the model k-1 folds are used. After training, the reserved partition is used to test the model (Figure 3.1, extracted from Scikit-Learn documentation [32]). This is process is run up to k times to use every partition for both training and testing. Finally, performance measures such as accuracy, precision, and recall are averaged. It is important to make sure, at each iteration, the test fold is not used for training. Otherwise, an over-ﬁtting issue would be introduced [28].

Figure 3.1 Stratiﬁed k-fold cross-validation

3.2.1 Stratiﬁed K-fold Cross-validation

Stratification is a cross-validation method in which data is partitioned into folds where every fold represents the complete dataset. It ensures each fold has approximately the same percentage of samples of the targeted class. This makes training and testing folds preserving the class distribution. Stratified k-fold cross-validation is suitable for most of the applications of binary or multi-class classifiers [32]. Scikit-Learn provides a variety of methods for cross-validation such as KFold, Stratified k- 12 fold, Groupsplit, ShuffleSplit, and TimeSeriesSplit [32]. Since Backblaze data is time-series data, one would argue that TimeseriesSplit is the best cross-validation method for this analysis. It splits data based on fixed time intervals and uses first k-1 folds as the training set and kth fold as the test set. This leads to the generation of folds with a different number of target labels. With this extremely imbalanced dataset, there can be folds without any failed hard drives, which results in highly biased predictions. Therefore, Stratified k-fold cross-validation suits best, since it produces folds with approximately the same percentage of samples of label ’1’. Furthermore, Aussel et. al used Stratified 3-fold cross-validation in their study and produced better results [10]. We used k=3 in our study as well.

Figure 3.2 Optimal number of features using RFECV

3.3 Recursive Feature Elimination (RFE)

The Recursive Feature Elimination method is a step-wise process that selects the most important features in the dataset. At each iteration, feature importance is calculated and the less significant feature is eliminated. To calculate the final ranking, the reverse order in which features are excluded is used. Feature selection must be done within the cross-validation. For each fold, 13 the dataset is split into train and test sets. Then a binary classifier such as the random forest is run to select the best features. Scikit-Learn provides two in-built functions RFE and RFECV for selecting the best features. RFECV has cross-validation incorporated into recursive feature elimination. RFECV returns the optimal number of features by calculating feature importance scores based on a given input scor- ing parameter such as accuracy, f1-score, average-precision, etc. On the other hand, one of the drawbacks of RFECV is we cannot use any resampling method in the feature selection process. As shown in Figure 3.2, RFECV was used to determine the optimal number of features (n=7) for the dataset and to select them. Figure 3.3 shows the feature importance score for each selected variable. SMART 198 and 5 fields have the highest feature importance. One of the Backblaze blog articles explained that generally, SMART fields 5 (Reallocated Sec- tor Count), 187 (Reported Uncorrectable Errors), 188 (Command Timeout), 197 (Current Pending Sector Count) and 198 (Offline Uncorrectable errors) are the most useful parameters for predicting hard drive failures [33]. In this analysis, only SMART 5 and 198 were included from that list into the top 7 selected features. The main reason for this is fields such as 187, 188 are not present in the makes like Toshiba. At the same time, fields 197 and 198 are highly correlated and including one of them to the analysis might be enough.

3.4 Performance Measures

Table 3.1 Confusion matrix in general

Predicted Negative Predicted Positive Actual Negative True Negative (TN) False Positive (FP) Actual Positive False Negative (FN) True Positive (TP)

To evaluate the performance of a binary classiﬁer, several metrics have been used. Since Back- blaze is an extremely imbalanced dataset, matrices like accuracy and false alarm rate (FAR) are not appropriate for measuring model performance. For example: since the percentage of the failed hard drives is less than 1%, even blindly predicting all hard drives as healthy drives will give an 14

Figure 3.3 Features importance using RFECV accuracy higher than 99%. However, several studies used failure detection rate (FDR) and FAR to compare and contrast their models [15, 19, 23]. Balanced accuracy or Balanced Classification Rate (BCR) is the macro average of recall of the two classes. It is more appropriate to measure performance than FDR. However, even blindly predicting all hard drives as healthy drives will give 50% of balanced accuracy. (class 0: 100% class 1: 0% accuracy will make a 50% balanced accuracy) In this study, confusion matrix, precision, recall, and F1-score were used to evaluate the performances of the model. The confusion matrix provides more details about the performance of the model than any other metric. Precision shows the percentage of samples correctly predicted out of total predicted. Recall shows the percentage of samples identified out of actual total. F-1 score indicates the F-measure which is the balance of precision and recall [34]. One would suggest Matthews Correlation Coefficient (MCC) is also a great metric when dealing with an imbalanced 15 dataset because it considers all four in the confusion metric including true negatives. However, in this study, F1 score is good enough since the majority class is labeled as negative [35].

1 TP TN BCR = + 2 TP + FN TN + FP

TP P recision = TP + FP

TP Recall = TP + FN

2 ∗ P recision ∗ Recall F 1 − score = P recision + Recall

Using Receiver Operating Characteristic (ROC) curve, we can determine the performance of the model. In the ROC curve, The True Positive rate is plotted against the False Positive rate. The area under the ROC curve (AUC) can vary from 0 to 1. AUC values greater than 0.5 is considered to be not bad. Values less than 0.5 indicate that if we ﬂip the zeros and ones of the predictions, we will get a better model than the existing one [36]. However, previous studies suggest that ROC curves can be overly optimistic if the dataset is imbalanced [37]. Precision-Recall (PR) curve is an alternative to the ROC curve where precision (y-axis) is plotted against recall (x-axis). It deals reasonably well will imbalanced datasets. In addition to that, using PR curves, it is easy to compare and contrast different models than ROC curves [37].

3.5 Analysis and Results

Hundreds of machine learning algorithms used for different purposes. Three of the frequently used were chosen to build models for this binary classiﬁcation problem. Decision Tree is simple to implement and works faster than any other algorithm. Random Forest classiﬁer is an ensemble model that has a good reputation for working with imbalanced data. Ada Boost is a boosting algorithm that trains a set of models in a sequential way by learning from previous models. This 16 analysis was performed using the Scikit-learn library in Python 3 based Jupyter Lab environment. In this analysis, the year 2014 data were analyzed as a whole without splitting them into subsets. It consists of hard drives mainly from four manufacturers and 81 different hard drive models. To remove the imbalance of the dataset, we used two resampling techniques separately and performed two different analyses.

3.5.1 Random Undersampling

Table 3.2 Comparison of performances - 2014 dataset Random Undersampling

Model BCR% Precision% Recall% F-1% ROC AUC PR curve Decision Tree 74.5 0.1 76.7 0.2 0.81 0.00 Random Forest 78.0 0.2 82.9 0.4 0.86 0.39 Ada Boost 81.6 0.2 78.8 0.3 0.85 0.10

Table 3.3 Confusion matrix for Decision Tree classiﬁer on 2014 dataset

Predicted 0 Predicted 1 Actual 0 2811171 564181 Actual 1 178 402

Table 3.4 Confusion matrix for Random Forest classiﬁer on 2014 dataset

Predicted 0 Predicted 1 Actual 0 3155984 219369 Actual 1 100 480

Table 3.5 Confusion matrix for AdaBoost classiﬁer on 2014 dataset

Predicted 0 Predicted 1 Actual 0 30864002 288953 Actual 1 123 457 17

Figure 3.4 ROC curves - Random Undersampling

Figure 3.5 PR curves - Random Undersampling 18 Random Undersampling was applied on training data within 3-fold cross-validation. Three Machine Learning models were trained and tested. According to the table 3.2, precision was less than 1% in all the models, because of the high number of false alarms as seen in confusion matrices (tables 3.3, 3.4 and 3.5 ). As a result of this, F1-score was also very low. In other words, all three models predicted that several hard drives would fail, however only a few hard drives actually failed. In fact, all models struggled to minimize false positives. All three models produced reasonably balanced accuracies and recall values as shown in the table 3.2. This indicates all models were able to identify the majority of failed hard drives (less number of missed alarms). By considering Table 3.2 along with Fig. 3.4 and Fig. 3.7, it can be seen that the decision tree algorithm under-performed than the other two. We noted that in Fig. 3.7 the average precision of decision tree PR-curve was zero. Overall, the random forest was the best among all three when random undersampling was applied.

3.5.2 SMOTE

Table 3.6 Comparison of performances - 2014 dataset SMOTE

Model BCR% Precision% Recall% F-1% ROC AUC PR curve Decision Tree 77.0 0.1 61.4 0.3 0.77 0.00 Random Forest 79.0 1.9 59.1 3.6 0.84 0.31 Ada Boost 85.6 0.3 74.8 0.5 0.89 0.39

Table 3.7 Confusion matrix for Decision Tree classiﬁer on 2014 dataset

Predicted 0 Predicted 1 Actual 0 3139536 235830 Actual 1 219 348

SMOTE was applied on the 2014 dataset within 3-fold cross-validation. Table 3.6 shows the comparison of three models using different performance measures. Balanced accuracy, precision, and F-1 score for Random Forest and AdaBoost classiﬁers were slightly improved by using SMOTE than using random undersampling. The worst performed algorithm was the decision tree 19 Table 3.8 Confusion matrix for Random Forest classiﬁer on 2014 dataset

Predicted 0 Predicted 1 Actual 0 3357594 17772 Actual 1 232 335

Table 3.9 Confusion matrix for AdaBoost classiﬁer on 2014 dataset

Predicted 0 Predicted 1 Actual 0 3219199 156167 Actual 1 143 424

that under-performed when using SMOTE than undersampling. However, by observing confusion matrices in Table 3.7, 3.8, and 3.9 we can see that number of false positives reduced in all three models. On the other hand, the number of false negatives increased, hence recall values reduced. Once again the average precision of decision tree PR-curve was zero as shown in Fig. 3.7. The average precision value of the AdaBoost classiﬁer increased from 0.10 to 0.39 by using SMOTE which is a considerable increase. Overall, we observed that all three machine learning models struggled to predict failed hard drives properly. We conclude that analyzing the 2014 dataset as a whole, did not produce great results. Therefore, we decided to analyze the Backblaze dataset (from 2013 to 2019) per manufacturer basis may produce better results.

3.6 Principal Component Analysis

Principal Component Analysis (PCA) is a linear transformation method and an ordination technique used in multivariate statistical models. PCA transforms variables into a new set of variables called “principal components” where variation is maximum. The ﬁrst principal component has the maximum variance contained in the original variables. This variation decreases in the second and third components, respectively [38, 39]. In this study, PCA was performed on the 2014 dataset to identify any separation between healthy and failed hard drives using principal components. 20

Figure 3.6 ROC curves - SMOTE

Before applying PCA, the dataframe which contains the 2014 dataset was scaled using standard scaler of Scikit-Learn library. Figure 3.8 represents the number of principal components against the percentage of cumulative variance explained by them. According to the graph, to explain 90% percent of the cumulative variance, at least 10 principal components are needed. It depicts dimensional reduction by applying PCA did not work very well. In other words, instead of thirteen variables in the dataset, 10 principal components are necessary which is even impossible to plot in 3-D space [40]. According to Figure 3.8, three principal components can explain only about 50% of cumulative variance. However, PCA was applied to generate 3 principal components to see if we can identify any separation between healthy and failed hard drives. By running PCA, the thirteen variables in the dataset were projected into 3 principal components. Figure 3.9 shows the generated principal component plots. To zoom out the graphs to view failed hard drives clearly, data were ﬁltered out to keep only within +3 to -3 standard deviations from the mean. 21

Figure 3.7 PR curves - SMOTE

Figure 3.8 Number of principal components vs variance

As shown in Figure 3.9, by plotting three principal components against each other, we could not differentiate between the healthy and failed hard drives. According to the blue and red markers in all three graphs, it seems that most failed hard drives are clustered together in the graph. However, 22

Figure 3.9 Principal component plots since the healthy and failed hard drives are overlapped with each other, it was impossible to identify any separation between them using PCA. 23

CHAPTER 4 EVALUATING BACKBLAZE DATA ON A PER MANUFACTURER BASIS

Due to previously identiﬁed trends in the Backblaze dataset that suggest that all data should be split by manufacturer for analysis, this chapter details the process of the related data engineering and evaluation which includes collecting and combining all data, preprocessing the data for analysis, analyzing the data, and then performing an analytical comparison.

4.1 Data Collection and Engineering

In order to separate the Backblaze dataset by manufacturer, a Python script was run to extract, transform and load (ETL) the data. The script implements the following process as represented in Figure 4.1:

1. Download and collect Data;

2. Decompress and save ﬁles appropriately. This results 90-92 daily CSV (comma separated value) ﬁles in one folder for each quarter of data,

3. For each quarter of data:

(a) Load the ﬁles

(b) Remove Normalized Columns

(d) Order by Columns

(e) Combine with previously processed data

4. Perform analysis.

Downloading, decompressing, and processing steps were necessary to organize data for straight forward use in the analysis. In step three, data processing of each quarter was performed as follows: First, all the normalized columns were dropped from the dataset since they could not be used for 24

Figure 4.1 Data collection and engineering the analysis. Only SMART raw fields were used to build the models. Second, the entire Backblaze dataset was split based on the manufacturer. Mainly there are four different manufacturers in the dataset. More than 60% of the records are from Seagate hard drives. The rest of the hard drive types are Hitachi, Western Digital, and Toshiba. In order to identify the manufacturer, we used the field ”model”. All Seagate drive models start from ”ST”, Hitachi from ”H”, Western Digital from ”W” and Toshiba from ”T”. One major challenge faced within this data was handling the inconsistent growth of the columns between datasets. As shown in Table 4.1, the number of SMART attributes kept growing over the years. When Backblaze started publishing data in 2013, there were only 40 SMART columns. At present, 63 SMART columns are present in the 2019 quarter 4 dataset. The main reason is that different manufactures kept updating their technologies to capture more parameters. To address 25 this, while processing data in each directory, SMART parameters were ordered according to the growth of the columns in each quarter. By ordering, columns that were added later are pushed to the right side of the dataframe. If newly added columns were unavailable in the dataset, ordering columns ensured that they were assigned null automatically, while splitting each row. In the end, each dataset was appended to each existing CSV file based on the manufacturer. Table 4.1 No. of SMART attributes present in Blackbaze dataset. “Q[X]” stands for “Quarter [X]”.

No. of SMART Attributes Corresponding Datasets 40 2013, 2014 45 2015, 2016 Q1, Q2, Q3, Q4, 2017 Q1, Q2, Q3, Q4 50 2018 Q1 52 2018 Q2, Q3 62 2018 Q4, 2019 Q1, Q2, Q3 63 2019 Q4

The end goal of this process was the creation of four different CSV files, each related to a single manufacturer. For processing and analysis, these files were loaded into Pandas Dataframes. Due to the size of these files (300MB - 20GB), the script also had to ensure that all data was flushed appropriately.

4.2 Data Preprocessing for Per Manufacturer Analysis

We tried to perform per manufacture analysis for Seagate and Hitachi datasets, however, they were too large for the Pandas library to handle. Hence, they were analyzed using Apache Spark. Toshiba and Western Digital datasets were relatively small, hence analyzed using both methods. First using Pandas + Scikit-Learn and then using Apache Spark. The CSV ﬁles which consist of hard drive SMART data ware loaded to Jupyter Lab and analyzed separately. However, data munging and data standardization steps are similar for all four manufacturers, so the common approach is explained here.

4.2.1 Data Munging

Data Munging was completed in a few steps: First, SMART Columns were converted into the float. Then all the columns that had more than 25% of null values, were dropped from the 26 dataframe. Here we noticed that different manufacturers had a different set of columns to drop, mainly due to the vendor-specific nature of SMART parameters. After that, rows with any of the null values were dropped. Then columns with a single value for all the rows (all zero columns) were also dropped. After that, 16-23 SMART attributes were left in the dataframes for building machine learning models. At this point, we observed some of the working disk records and failed disk records had exact same values for feature variables as shown in Figure 4.2. We called them “samples with duplicated features and dual status”. The main reason behind this was some hard disk drives included in the daily snapshot as a working hard drive on the day before it died. Therefore, the particular hard drive had the same features on consecutive days and failed status as both 0 and 1. These records can not be differentiated by any binary classifier and they lead to confuse machine learning models. For example, there were 16 samples in the Toshiba dataset showing that behavior. Hence, 8 working samples of them were dropped from the dataframe to maintain integrity.

Figure 4.2 Samples with duplicated features and dual status

4.2.2 Data Standardization

To achieve standardization across data, all SMART attributes were standardized using standard scaler by removing the mean and scaling to unit variance. This step was mandatory to get all the SMART parameters into the same scale before applying any machine learning model. This is one of the important preprocessing steps because if some features have extremely higher values, they will dominate the model. Obviously, SMART attributes have different scales. For example; while SMART 190 and 194 are the temperature values in Celsius ranges from 10-40, SMART 1 and 7 columns represent error rates usually in millions. Hence, all SMART parameters were standardized using standard scaler function. Starting from the next step, per manufacturer analysis is presented 27 separately since the results for each make is different.

Figure 4.3 Correlation plot for Toshiba

4.3 Analysis of Toshiba Dataset

The Toshiba portion of the analyzed dataset consists of 1.8 million samples, only 204 of which are records containing failures. Altogether, this makes up 0.01% of the entire dataset. The Toshiba data also differs from other manufacturers in the inclusion of SMART features 4 and 220. Analysis of the Toshiba dataset was completed using both Scikit-Learn and PySpark. Since this is a relatively small dataset, it could be analyzed using Pandas + Scikit-Learn. However, we analyzed it using PySpark as well, in order to compare the results with the other manufacturers.

4.3.1 Correlation Matrix

The correlation matrix was plotted for the Toshiba dataset using the 16 SMART parameters left in the dataset after removing columns with null values (explained in section 4.2.1 Data Munging). Figure 4.3 represents the correlation plot where Pearson correlation coefﬁcients for each column 28 against the rest of the columns are shown. This indicated some SMART attributes had strong relationships with other attributes. Therefore, there is no point in including both of the variables in the analysis. We dropped SMART parameters that had correlation coefﬁcients higher than 0.8. Accordingly, SMART 196, 222, and 226 were dropped from the Toshiba dataframe since they had a high correlation with SMART 5, 9, and 220 respectively. The number of correlated variables and the set of correlated parameters highly depends upon the manufacturer.

4.3.2 Recursive Feature Elimination

We used Scikit-Learn’s RFECV function to select features when analyzing the 2014 dataset as mentioned in Chapter 3. However, we used RFE function in the per manufacturer analysis since it can be used with a resampling method like SMOTE within K-fold cross-validation. We decided to use the top ten features in each per manufacturer analysis to use in machine learning models. RFE was run using SMOTE, Stratiﬁed K-Fold sampling, and Random Forest Classiﬁer. As mentioned in Chapter 3, it is important to apply SMOTE inside the cross-validation to avoid over-optimism. Feature rankings, which were generated by RFE function, were used to select the top ten features. Table 4.2 shows the selected features for Toshiba hard drives. SMART attributes shown in italics are the features used in the previous works [8].

Table 4.2 Top ten features for Toshiba based on RFE

Ranking Features 1.0 smart 3 raw 1.0 smart 4 raw 1.0 smart 5 raw 1.0 smart 9 raw 1.0 smart 193 raw 1.0 smart 194 raw 1.0 smart 197 raw 1.0 smart 199 raw 1.0 smart 220 raw 1.6 smart 191 raw 29 4.3.3 Scikit-Learn Analysis

The Toshiba dataset was split using stratiﬁed sampling into 80% of data as the training set and remaining 20% as the testing set. In this dataset, the ratio of healthy drives to failed hard drives is about 9000 to 1. This indicates an extremely imbalanced dataset in which many machine learning algorithms might perform poorly. To overcome this issue SMOTE was applied on training dataset inside K-fold cross-validation.

for train_index, test_index in skFold.split(X, y): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] resampling = SMOTE() X_train_res, y_train_res = resampling.fit_resample(X_train, y_train) print(’Resampled dataset shape%s’ % Counter(y_train_res))

classifier = DecisionTreeClassifier() classifier.fit(X_train_res, y_train_res) classifier_accuracy = classifier.score(X_test,y_test)

print("classifier_accuracy= {}%".format(classifier_accuracy *100))

y_pred = classifier.predict(X_test) cnf_matrix = confusion_matrix(y_test,y_pred) plt.figure() plot_cm(cnf_matrix) plt.show() print(classification_report(y_test,y_pred))

Listing 4.1: Decision tree binary classiﬁer

Three widely used supervised learning models were applied on the Toshiba dataset: Decision Tree, Random Forest, and Ada Boost. The script in the Listing 4.1 shows the code for decision tree binary classiﬁer applied on the dataset. All the hyper-parameters are left at their default values for the classiﬁer. Finally, balanced accuracy, precision, recall, f1-score, and the area under the ROC 30 curve and average precision of the PR curve were calculated.

Table 4.3 Comparison of results of Scikit-Learn Analysis on Toshiba

Model BCR% Precision% Recall% F-1% ROC AUC PR curve Decision Tree 60.0 0.7 20.6 1.4 0.60 0.00 Random Forest 71.2 1.1 42.9 2.1 0.85 0.01 Ada Boost 71.0 0.1 46.0 0.3 0.80 0.02

Table 4.4 Confusion matrix for Decision Tree classiﬁer on Toshiba dataset

Predicted 0 Predicted 1 Actual 0 559196 1745 Actual 1 50 13

Table 4.5 Confusion matrix for Random Forest classiﬁer on Toshiba dataset

Predicted 0 Predicted 1 Actual 0 576753 2517 Actual 1 36 27

Table 4.6 Confusion matrix for AdaBoost classiﬁer on Toshiba dataset

Predicted 0 Predicted 1 Actual 0 556532 22738 Actual 1 34 29

We defined the failure samples as positive and the healthy samples as negative. Table 4.3 includes the performance measures for the Scikit-Learn analysis on Toshiba dataset. Due to a higher number of False-Positives, all the models produced very low precision values around 1%. Hence the F-1 score is also very low. Balanced accuracy, recall, and ROC-AUC values were reasonably good. Overall, Random Forest produced slightly better results than the AdaBoost model. Decision 31 Tree model underperformed than the rest. PR curve generated average precision values for all the models were small. The reason might be the usage of an extremely imbalanced dataset. Confusion matrices of the models are shown in Tables 4.4, 4.5, and 4.6. In all the cases, the number of True-Negatives is high. This is good because this indicates almost all healthy hard drives have been identified correctly. On the other hand, the number of False-Positives is higher when compared with the number of True Positives, which leads to poor performance. This is the main reason for having a lower precision (equation in section 3.4). Figure 4.4 shows the ROC curves generated for each model. Random Forest has the highest area under the curve. Decision Tree had the least value and the worst shape of the curve. According to most of the performance measures, AdaBoost was a close contender with the Random Forest classifier. Plotted PR curves for the Toshiba dataset produced extremely low values as shown in Figure 4.5. As we got lower precisions and F-1 scores, we tried to improve the results by performing hyper-parameter tuning by using Scikit-Learn GridSearchCV. However, there was not any noticeable gain.

4.3.4 PySpark Analysis

Machine Learning models were applied on all four datasets using Apache Spark ML library. The main reason for using Spark was that Hitachi (8 GB) and Seagate (20 GB) datasets were too big to handle with Pandas and Scikit-Learn libraries. However, we realized that many in-built features such as stratiﬁcation, RFE, and resampling methods provided by Scikit-Learn and Imb- Learn were not available in Apache Spark. Therefore, some of them were written from scratch. Random undersampling (written manually from scratch) was used to remove the imbalancement of the datasets. In this step, SMOTE could not be used due to two reasons: Seagate and Hitachi datasets contain more than 50 million records in each. Applying SMOTE on such a big dataset might take a longer time. At the same time, the built-in SMOTE function was not available, manually written SMOTE function did not perform as efﬁcient as Imb-Learn in-built SMOTE function. In all PySpark analyses on per manufacturer basis data, each dataset was split into 80% as 32

Figure 4.4 ROC curves for Scikit-Learn Analysis on Toshiba the training set, remaining 20% as the testing set and random undersampling was applied on the training set. Four machine learning models were used in the analysis using PySpark: Decision Tree, Random Forest, Gradient Boost Tree (GBT), and Multi-layer Perceptron (MLP). All of them are built-in classifiers in Spark. GBT was selected since AdaBoost was not available in the Spark ML library. MLP refers to a type of feed-forward artificial neural network (ANN). When using the PySpark MLP classifier, only limited options could be customized. We specified the layers as 10, 25, 25, and 2. In this array, 10 indicates the number of input features, 2 is the number of output classes, and two intermediate layers of size 25. The comparison of the results for each machine learning model is shown in Table 4.7. For PySpark analysis the calculated metric for the PR curve was the area under the curve (AUC) instead of the average-precision of Scikit-Learn; mainly due to the difference of the built-in functions of PySpark and Scikit-Learn libraries. 33

Figure 4.5 PR curves for Scikit-Learn Analysis on Toshiba

Table 4.7 Comparison of results from PySpark analysis on Toshiba

Model BCR% Precision% Recall% F-1% ROC AUC PR curve Decision Tree 71.98 0.04 64.44 0.08 0.76 0.00 Random Forest 73.92 0.03 80.00 0.06 0.83 0.01 Gradient Boost 73.81 0.04 71.11 0.08 0.80 0.00 MLP 73.11 0.04 71.11 0.07 0.79 0.00

Table 4.8 Confusion matrix for Decision Tree classiﬁer on Toshiba dataset

Predicted 0 Predicted 1 Actual 0 275996 71141 Actual 1 16 29

Confusion matrices of the four models used in PySpark analysis are shown in Tables 4.8, 4.9, 4.10 and 4.11. By investigating matrices with Table 4.7, several facts can be highlighted. All the 34 Table 4.9 Confusion matrix for Random Forest classiﬁer on Toshiba dataset

Predicted 0 Predicted 1 Actual 0 235483 111654 Actual 1 9 36

Table 4.10 Confusion matrix for Gradient Boost classiﬁer on Toshiba dataset

Predicted 0 Predicted 1 Actual 0 265611 81526 Actual 1 13 32

Table 4.11 Confusion matrix for MLP classiﬁer on Toshiba dataset

Predicted 0 Predicted 1 Actual 0 253779 93358 Actual 1 13 32

models had extremely low precisions and F-1 scores due to the high number of False-Positives. Recall values were better than 60% in each model since True-Positives were higher than False- Negatives in all models. Based on Recall, ROC-AUC values, and balanced accuracies Random Forest classiﬁer is the best model out of these four. Figure 4.6 which compares the ROC curves indicates the same. PR-AUC values were extremely low as shown in Figure 4.7. By observing the results of Scikit-Learn analysis (Table 4.3) and Pyspark analysis (Table 4.7) on the same Toshiba dataset, the difference between the two is emphasized. One would expect similar results for the two analyses since we used the same dataset and SMART parameters. The main reason for diverse results can be the use of different resampling techniques in each analysis. Furthermore, the Scikit-Learn library has more sophisticated techniques and customizable parameters in its machine learning algorithms than in the Spark ML library. In addition to that, the use of other techniques such as cross-validation can impact the results. 35

Figure 4.6 ROC curves for PySpark Analysis on Toshiba

4.4 Analysis of Western Digital Dataset

4.4.1 Correlation Matrix

The correlation matrix was plotted for the Western Digital dataset using 15 SMART parameters. None of the SMART parameters had correlation coefﬁcients higher than 0.8. The maximum correlation was 0.69. Therefore, we didn’t drop any of the features. This is one of the unique facts of the Western Digital dataset. All the other datasets had at least a couple of highly correlated features. At the same time, many coefﬁcients are shown in yellow which indicates negative correlations. This too is unique to the Western Digital dataset. 36

Figure 4.7 PR curves for PySpark Analysis on Toshiba

4.4.2 Recursive Feature Elimination

Table 4.12 shows the selected features using RFE to run machine learning models for Western Digital hard drives. When we compare this with Table 4.2, where the top ten parameters for the Toshiba dataset listed, we noticed a big difference. There are only four SMART attributes that are common (SMART 3, 5, 194, and 197). This indicates feature importance can be different based on the manufacturer. This supports the argument that per manufacturer base analysis is more appropriate for hard disks. SMART attributes shown in italics are the features used in the previous works [8]. 37

Figure 4.8 Correlation plot for Western Digital

Table 4.12 Top ten features for Western Digital based on RFE

Ranking Features 1.0 smart 1 raw 1.0 smart 5 raw 1.0 smart 192 raw 1.0 smart 194 raw 1.0 smart 196 raw 1.0 smart 197 raw 1.0 smart 200 raw 1.6 smart 198 raw 2.6 smart 3 raw 2.6 smart 12 raw

4.4.3 Scikit-Learn Analysis

Western Digital dataset was split using stratiﬁed sampling into 80% of data as the training set and remaining 20% as the testing set. In this dataset, the ratio of healthy drives to failed hard 38 drives is about 7000 to 1. This ratio is slightly better than Toshiba hard drives. However, this still indicates an extremely imbalanced dataset.

Table 4.13 Comparison of results of Scikit-Learn Analysis on Western Digital

Model BCR% Precision% Recall% F-1% ROC AUC PR curve Decision Tree 54.2 0.1 10.0 0.2 0.54 0.00 Random Forest 75.1 0.1 56.0 0.3 0.81 0.01 Ada Boost 71.0 0.2 44.7 0.3 0.75 0.02

Table 4.14 Confusion matrix for Decision Tree classiﬁer on Western Digital dataset

Predicted 0 Predicted 1 Actual 0 1033382 16603 Actual 1 135 15

Table 4.15 Confusion matrix for Random Forest classiﬁer on Western Digital dataset

Predicted 0 Predicted 1 Actual 0 989793 60192 Actual 1 66 84

Table 4.16 Confusion matrix for Ada Boost classiﬁer on Western Digital dataset

Predicted 0 Predicted 1 Actual 0 1009786 40199 Actual 1 83 67

Table 4.13 and the confusion matrices in Table 4.14, 4.15 and 4.16 indicates Random Forest is the best model on Western Digital dataset. Based on balanced accuracy, recall, and ROC-AUC values the Decision Tree performed poorly than the other two models by far. According to the confusion matrix, it could predict only 15 failures accurately out of 150. 39

Figure 4.9 ROC curves for PySpark Analysis on Western Digital

When we compare these results with Toshiba Scikit-Learn Analysis in Table 4.3, all the models produced a higher number of False-Positives on this analysis which leads to produce lower precisions and F-1 scores. Figure 4.4 depicts the poor performance of the Decision Tree classiﬁer (by shape and AUC) compared to the other two models.

4.4.4 PySpark Analysis

Western Digital dataset was split into 80% of data as the training set and remaining 20% as the testing set for PySpark analysis. Four different supervised learning models were applied on each dataset with the random under-sampling method. The comparison of the results as follows. Based on Table 4.17 and confusion matrices in Tables 4.18, 4.19, 4.20, and 4.21 following observations can be made. In the PySpark Analysis on the Western Digital dataset, all four models produced high number of False Positives, especially Gradient Boost and MLP models. Therefore, 40 Table 4.17 Comparison of results from PySpark analysis on Western Digital

Model BCR% Precision% Recall% F-1% ROC AUC PR curve Decision Tree 72.17 0.07 55.06 0.14 0.76 0.00 Random Forest 75.94 0.07 64.04 0.15 0.82 0.00 Gradient Boost 73.07 0.50 62.92 0.11 0.80 0.00 MLP 63.99 0.02 69.66 0.05 0.69 0.00

Table 4.18 Confusion matrix for Decision Tree classiﬁer of Western Digital PySpark analysis

Predicted 0 Predicted 1 Actual 0 562261 67524 Actual 1 40 49

values for precision and PR-AUC for all models were close to zero. Hence the F1- score was also very low. However recall and ROC-AUC values for all the models were reasonable. When comparing Scikit-Learn analysis and PySpark analysis on the Western Digital dataset, the performance of Decision Tree improved in the latter one. Random Forest behaved almost the same in both analyses.

4.5 Analysis of Hitachi Dataset

The Hitachi dataset consists of 53.3 million samples, only 1049 of which are records containing failures. Altogether, this makes up 0.002% of the entire dataset. SMART features 2 (Throughput Performance), 8 (Seek Time Performance) and 10 (Spin-up Retries) are unique to the Hitachi dataset; no other manufacturer uses them. Hitachi dataset was split into 80% as the training set and the remaining 20% as the testing set.

Table 4.19 Confusion matrix for Random Forest classiﬁer of Western Digital PySpark analysis

Predicted 0 Predicted 1 Actual 0 553192 76593 Actual 1 32 57 41 Table 4.20 Confusion matrix for Gradient Boost classiﬁer of Western Digital PySpark analysis

Predicted 0 Predicted 1 Actual 0 524104 105681 Actual 1 33 56

Table 4.21 Confusion matrix for MLP classiﬁer of Western Digital PySpark analysis

Predicted 0 Predicted 1 Actual 0 367265 262520 Actual 1 27 62

Figure 4.10 ROC curves for PySpark Analysis on Western Digital 42

Figure 4.11 Correlation plot for Hitachi

4.5.1 Correlation Matrix

The correlation matrix was plotted for the Hitachi dataset using 17 SMART parameters. Four out of them had correlation coefﬁcients higher than 0.8. SMART 2, 4, 193, and 196 were dropped from the Hitachi dataframe since they had a high correlation with SMART 8, 12, 192, and 5 respectively.

4.5.2 Recursive Feature Elimination

Feature selection was done using recursive feature elimination. Selected top ten features are listed in the Table 4.22. Nine SMART parameters were assigned to ranking one.

4.5.3 PySpark Analysis

For the Hitachi and Seagate datasets, only PySpark analysis was performed. Due to the high volume of these datasets, Pandas dataframes and ScIkit-Learn functions showed memory errors. 43 Hence we analyzed them using Apache Spark with Python. PySpark analyses for Hitachi and Seagate datasets were completed at the Ohio Supercomputing Center (OSC) on the Owens cluster. Due to the high volume, it might take days on a normal server to analyze them. Using OSC, both the analyses were completed in a few hours. First, the Hitachi dataset was split into 80% of data as the training set and the remaining 20% as the testing set for PySpark analysis. Four different supervised learning models were applied on each dataset with the random undersampling method. Table 4.23 and confusion matrices 4.24, 4.25, 4.26, and 4.27 show the results of PySpark Analysis of Hitachi dataset. Generated precision values, F-1 scores, and PR-AUC values were extremely low and close to zero. The main reason can be the high number of False Positives. In all cases False Positives were higher than 1000 times of True Positive, therefore some performance measures become inﬂuenced by that. Based on balanced accuracies and recalls the MLP model performed best. At the same time, Random Forest and Gradient Boost Trees were closely matched. Figure 4.12 shows the ROC curves for the models. Gradient Boost Trees classiﬁer had the highest ROC-AUC from all four.

Table 4.22 Top ten features for Hitachi based on RFE

Ranking Features 1.0 smart 1 raw 1.0 smart 3 raw 1.0 smart 5 raw 1.0 smart 8 raw 1.0 smart 9 raw 1.0 smart 10 raw 1.0 smart 12 raw 1.0 smart 194 raw 1.0 smart 197 raw 1.2 smart 192 raw 44 Table 4.23 Comparison of results from PySpark analysis on Hitachi

Model BCR% Precision% Recall% F-1% ROC AUC PR curve Decision Tree 76.96 0.02 58.05 0.05 0.78 0.00 Random Forest 77.80 0.02 60.92 0.04 0.82 0.00 Gradient Boost 77.37 0.01 62.07 0.03 0.83 0.00 MLP 78.94 0.01 64.94 0.03 0.80 0.00

Table 4.24 Confusion matrix for Decision Tree classiﬁer on Hitachi

Predicted 0 Predicted 1 Actual 0 8845678 380675 Actual 1 73 101

Table 4.25 Confusion matrix for Random Forest classiﬁer on Hitachi

Predicted 0 Predicted 1 Actual 0 8736015 490338 Actual 1 68 106

Table 4.26 Confusion matrix for Gradient Boost classiﬁer on Hitachi

Predicted 0 Predicted 1 Actual 0 8550879 675474 Actual 1 66 108

Table 4.27 Confusion matrix for MLP classiﬁer on Hitachi

Predicted 0 Predicted 1 Actual 0 8398338 828015 Actual 1 61 113

4.6 Analysis of Seagate Dataset

Seagate is the largest dataset among all manufacturers. About 65% of the whole dataset belongs to Seagate hard drives. They contained more than 100 different hard drive models. The Seagate 45

Figure 4.12 ROC curves for PySpark Analysis on Hitachi dataset consists of 108.8 million samples, only 9242 of which are records containing failures. Altogether, this makes up 0.009% of the entire records.

4.6.1 Correlation Matrix

The correlation matrix was plotted for the Seagate dataset using 21 SMART parameters. This was the highest number of available SMART attributes for a single manufacturer. Four SMART parameters had correlation coefﬁcients higher than 0.8. SMART 192, 194, 197, and 199 were dropped from the Seagate dataframe since they had a high correlation with SMART 4, 190, 198, 46

Figure 4.13 Correlation plot for Seagate and 188 respectively.

4.6.2 Recursive Feature Elimination

Table 4.28 shows the selected features using RFE to run machine learning models for the Seagate dataset. The top ten features include some unique features which are not available among selected features for other manufacturers: SMART 189 (High Fly Writes), 241 (Total Logical Block Addresses Written), and 242 (Total Logical Block Addresses Read). SMART attributes shown in italics are the features used in the previous works [8].

4.6.3 PySpark Analysis

Seagate dataset was split into 80% of data as the training set and the remaining 20% as the testing set for PySpark analysis. Four different supervised learning models were applied on each dataset with the random undersampling method. 47 Table 4.28 Top ten features for Seagate based on RFE

Ranking Features 1.0 smart 1 raw 1.0 smart 193 raw 1.0 smart 190 raw 1.0 smart 189 raw 1.0 smart 241 raw 1.0 smart 242 raw 1.0 smart 9 raw 1.0 smart 4 raw 1.0 smart 12 raw 1.3 smart 191 raw

Table 4.29 Comparison of results from PySpark analysis on Seagate

Model BCR% Precision% Recall% F-1% ROC AUC PR curve Decision Tree 85.76 0.16 75.11 0.31 0.87 0.01 Random Forest 86.08 0.14 76.07 0.30 0.91 0.01 Gradient Boost 86.77 0.12 78.63 0.24 0.91 0.01 MLP 85.65 0.11 76.87 0.21 0.90 0.01

Table 4.30 Confusion matrix for Decision Tree classiﬁer on Seagate

Predicted 0 Predicted 1 Actual 0 15864180 592926 Actual 1 312 942

Table 4.31 Confusion matrix for Random Forest classiﬁer on Seagate

Predicted 0 Predicted 1 Actual 0 15813837 643269 Actual 1 300 954

Based on Table 4.29 and confusion matrices in Tables 4.30, 4.31, 4.32, and 4.33 it is clear that all the models performed better on Seagate dataset than the rest. All performance measures have improved in Seagate PySpark Analysis. Balanced accuracy, recall, and ROC-AUC for all the models are high. Figure 4.14 shows the ROC curve for the models where AUC values and 48 Table 4.32 Confusion matrix for Gradient Boost classiﬁer on Seagate

Predicted 0 Predicted 1 Actual 0 15622075 835031 Actual 1 268 986

Table 4.33 Confusion matrix for MLP classiﬁer on Seagate

Predicted 0 Predicted 1 Actual 0 15540869 916237 Actual 1 290 964

shapes of the curves are much better than all the previous analyses on other datasets. However, precisions and F-1 scores are still less than 1% due to the higher number of False Positives. Values for PR-AUC have slightly increased when compared with other analyses. By comparing the performance measures produced by 2014 whole dataset (in Chapter 3) and each per manufacturer analysis, we cannot conclude that splitting data by manufacturer improved the results. No noticeable improvements of the performance measures were observed when comparing Tables 3.2 and 3.6 with Tables 4.3, 4.7, 4.13, 4.17, 4.23, and 4.29. In most of the cases reasonable recall and balanced accuracy values were observed. However, precision and F-1 scores were consistently lower than accepted range because number of False Positives were extremely high. The reason for good recall values is True Positives were higher than False Negatives in each case. When we observe the PR curve relates metrics average precision for Scikit-Learn Analyses and PR-AUC for PySpark analyses, it can be seen that analysis on 2014 whole dataset produced much better PR curves than per manufacturer analyses. By comparing Figures 3.5, 3.7 in Chapter 3 and Figures 4.5, 4.7 in Chapter 4 confirms this argument. Analyzing all these figures and performance measure tables, the Random Forest classifier generated better PR curves on the 2014 whole dataset than each analysis done on each makes. We can argue splitting data, based on the manufacturer is not the optimum way to analyze data. As we suggested in section 5.1, splitting by the hard drive 49

Figure 4.14 ROC curves for PySpark Analysis on Seagate model may be the better way to perform analysis.

4.7 Threats to Validity

The main threat to the validity of this study is the issues of the dataset. For example, there are some missing data in the 2017 Q1 dataset. As Backblaze has reported, from 28th January to 31st January, no daily snapshots were captured due to some data center error [5]. At the same time, there were several records the same as the one shown in Figure 4.2 for every manufacturer. They were ﬁxed by removing records where failure status equals to one. However, there can be different errors in the dataset that were unnoticed. 50 In this study, in all the analyses, we standardized the data using standard scaler function in Scikit-Learn library, before spitting the data into training and test sets. However, it is mentioned in several blogs that the standardization should be done separately on training and test sets, after splitting the data. The main idea here is to keep testing data untouched before the evaluation. In the standardization process, we use mean and standard deviation; if these two are calculated using both training and test sets, the test set will be biased. We can eliminate this error by standardization training and test sets separately. Keeping the test set as unseen is recommended. 51

CHAPTER 5 CONCLUSION

This study aimed to 1) Develop and describe a data engineering process for extraction, transformation, and loading (ETL) hard drive SMART data and 2) Effectively predict hard drive failure on a per manufacturer basis. The first goal was achieved successfully by implementing a Python script to automate data collection, combining and organizing of data. To accomplish the second goal, the entire Backblaze dataset was split by the manufacturer and analyzed separately. The idea of per manufacturer analysis was motivated by a couple of previous studies [14, 16] in which they mentioned that manufacture based analysis may produce better predictions. Based on balanced accuracy, recall, and ROC-AUC matrices, the analysis on the Seagate dataset generated better results when compared with other manufactures. However, the main issue with all the analyses was the large number of False-Positives. As a result, precision and F-1 scores were extremely low. Obviously, in machine learning it is always a trade-off between precision and recall; increasing one would decrease the other. In the study, the Random Forest classifier was almost always superior in terms of performance. In PySpark analyses, most of the time Gradient Boost Tree was a close contender. It is difficult to compare the results directly with previous researches due to few reasons:

1. No one had previously worked on the entire dataset

2. Most work is done only focused on one or few hard drive model of a single manufacturer (mostly Seagate)

3. Many studies discussed accuracy and false alarm rate which are not valid for an extremely imbalanced dataset such as Backblaze

However, there are very few studies that we can compare the results. Aussel et. al was able to achieve 67% recall and 95% precision by using Random Forest on the 2014 Backblaze heterogeneous dataset which is significantly better than what we achieved [10]. Their key point for 52 generating promising results was using a time window for failed hard drives. However, when they used SVM for failure prediction, the precision was less than 1% which is comparable to the current study. Rincon et. al analyzed 2015 and 2016 heterogeneous datasets for failure predictions using the Decision tree classifier. In their paper, they have included their confusion matrices in the results section. By comparison, we had a less percentage of False-Negatives and a higher percentage of False-Positives [16]. In this study, we ignored the time series factor when analyzing the data. Using the time-series factor in the analysis may complicate the analysis, however, will lead to produce more accurate results. As in some of the previous researches, using a time window to determine failures might work better [10]. On the other hand, it will be difficult to apply time-series analysis on all the datasets at once.

5.1 Future Works

We suggest that splitting by the hard drive model may be the optimum way to perform analysis than splitting data based on the manufacturer. We plan to investigate this further in our future research. However, due to a large number of hard drive models available in the Backblaze dataset (there are more than 100 models for Seagate hard drives only), this will not be an easy task. We plan to use combined resampling techniques consists of both undersampling and oversampling such as SMOTEENN and SMOTETomek in the Imblearn library. They are expected to perform better in resampling. At the same time, instead of resampling techniques, Generative Adversarial Network (GAN) where two different neural networks compete with each other, can be used. However, more computational power will be necessary to apply them on a large dataset such as Seagate. Using Deep learning would produce better results for predicting hard drive failures. However, to use deep learning on the entire dataset high computational power will be needed. In addition to that, cloud services such as AWS, Google, and Microsoft Azure can be used to speed up the analysis. 53

BIBLIOGRAPHY

[1] K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing hardware reliability,” in Proceedings of the 1st ACM symposium on Cloud computing, 2010, pp. 193–204.

[2] S. Pang, Y. Jia, R. Stones, G. Wang, and X. Liu, “A combined bayesian network method for predicting drive failure times from smart attributes,” in 2016 International Joint Conference on Neural Networks (IJCNN), July 2016, pp. 4850–4856.

[3] G. F. Hughes, J. F. Murray, K. Kreutz-Delgado, and C. Elkan, “Improved disk-drive failure warnings,” IEEE Transactions on Reliability, vol. 51, no. 3, pp. 350–357, Sep. 2002.

[4] B. Zhu, G. Wang, X. Liu, D. Hu, S. Lin, and J. Ma, “Proactive drive failure prediction for large scale storage systems,” in 2013 IEEE 29th symposium on mass storage systems and technologies (MSST). IEEE, 2013, pp. 1–5.

[5] Backblaze, “Hard drive data and stats,” Available at https://www.backblaze.com/b2/ hard-drive-test-data.html, 2019.

[6] X. Huang, “Hard drive failure prediction for large scale storage system,” Master’s thesis, UCLA, 2017.

[7] E. Pinheiro, W.-D. Weber, and L. A. Barroso, “Failure trends in a large disk drive population,” in File and Storage Technologies (FAST’07), 2007.

[8] W. Yang, D. Hu, Y. Liu, S. Wang, and T. Jiang, “Hard drive failure prediction using big data,” in 2015 IEEE 34th Symposium on Reliable Distributed Systems Workshop (SRDSW). IEEE, 2015, pp. 13–18.

[9] V. Agrawal, C. Bhattacharyya, T. Niranjan, and S. Susarla, “Prediction of hard drive failures via rule discovery from autosupport data.” Citeseer, 2009. 54 [10] N. Aussel, S. Jaulin, G. Gandon, Y. Petetin, E. Fazli, and S. Chabridon, “Predictive models of hard drive failures based on operational data,” in 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2017, pp. 619–625.

[11] B. Schroeder and G. A. Gibson, “Understanding disk failure rates: What does an mttf of 1,000,000 hours mean to you?” ACM Transactions on Storage (TOS), vol. 3, no. 3, pp. 8–es, 2007.

[12] J. F. Murray, G. F. Hughes, and K. Kreutz-Delgado, “Hard drive failure prediction using non-parametric statistical methods,” in Proceedings of ICANN/ICONIP, 2003.

[13] ——, “Machine learning methods for predicting failures in hard drives: A multiple-instance application,” Journal of Machine Learning Research, vol. 6, no. May, pp. 783–816, 2005.

[14] I. C. Chaves, M. R. P. de Paula, L. G. Leite, L. P. Queiroz, J. P. P. Gomes, and J. C. Machado, “Banhfap: A bayesian network based failure prediction approach for hard disk drives,” in 2016 5th Brazilian Conference on Intelligent Systems (BRACIS). IEEE, 2016, pp. 427–432.

[15] M. M. Botezatu, I. Giurgiu, J. Bogojeska, and D. Wiesmann, “Predicting disk replacement towards reliable data centers,” in Proceedings of the 22nd ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, 2016, pp. 39–48.

[16] C. A. Rincon,´ J.-F. Paris,ˆ R. Vilalta, A. M. Cheng, and D. D. Long, “Disk failure prediction in heterogeneous environments,” in 2017 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS). IEEE, 2017, pp. 1–7.

[17] F. Mahdisoltani, I. Stefanovici, and B. Schroeder, “Proactive error prediction to improve storage system reliability,” in 2017 {USENIX} Annual Technical Conference ({USENIX}{ATC} 17), 2017, pp. 391–402.

[18] F. L. F. Pereira, F. D. dos Santos Lima, L. G. de Moura Leite, J. P. P. Gomes, and J. de Cas- tro Machado, “Transfer learning for bayesian networks with application on hard disk drives 55 failure prediction,” in 2017 Brazilian Conference on Intelligent Systems (BRACIS). IEEE, 2017, pp. 228–233.

[19] J. Xiao, Z. Xiong, S. Wu, Y. Yi, H. Jin, and K. Hu, “Disk failure prediction in data centers via online learning,” in Proceedings of the 47th International Conference on Parallel Processing, 2018, pp. 1–10.

[20] S. Bhardwaj, A. Saxena, and A. Nayyar, “Exploratory data analysis on hard drive failure statistics and prediction,” International Journal, vol. 6, no. 6, pp. 1–6, 2018.

[21] P. Anantharaman, M. Qiao, and D. Jadav, “Large scale predictive analytics for hard disk remaining useful life estimation,” in 2018 IEEE International Congress on Big Data (BigData Congress). IEEE, 2018, pp. 251–254.

[22] A. R. Mashhadi, W. Cade, and S. Behdad, “Moving towards real-time data-driven quality monitoring: A case study of hard disk drives,” Procedia Manufacturing, vol. 26, pp. 1107– 1115, 2018.

[23] J. Shen, J. Wan, S.-J. Lim, and L. Yu, “Random-forest-based failure prediction for hard disk drives,” International Journal of Distributed Sensor Networks, vol. 14, no. 11, p. 1550147718806480, 2018.

[24] C.-J. Su and S.-F. Huang, “Real-time big data analytics for hard disk drive predictive mainte- nance,” Computers & Electrical Engineering, vol. 71, pp. 93–101, 2018.

[25] F. Pereira, D. Teixeira, J. P. Gomes, and J. Machado, “Evaluating one-class classiﬁers for fault detection in hard disk drives,” in 2019 8th Brazilian Conference on Intelligent Systems (BRACIS). IEEE, 2019, pp. 586–591.

[26] J. Yu, “Hard disk drive failure prediction challenges in machine learning for multi-variate time series,” in Proceedings of the 2019 3rd International Conference on Advances in Image Processing, 2019, pp. 144–148. 56 [27] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artiﬁcial intelligence research, vol. 16, pp. 321–357, 2002.

[28] M. S. Santos, J. P. Soares, P. H. Abreu, H. Araujo, and J. Santos, “Cross-validation for imbalanced datasets: Avoiding overoptimistic and overﬁtting approaches [research frontier],” ieee ComputatioNal iNtelligeNCe magaziNe, vol. 13, no. 4, pp. 59–76, 2018.

[29] G. Lemaˆıtre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” Journal of Machine Learning Research, vol. 18, no. 17, pp. 1–5, 2017. [Online]. Available: http: //jmlr.org/papers/v18/16-365.html

[30] “Resampling methods.” [Online]. Available: https://imbalanced-learn.readthedocs.io/en/ stable/api.html

[31] J. Brownlee, “Step-by-step framework for imbalanced classiﬁcation projects,” Mar 2020. [Online]. Available: https://machinelearningmastery.com/ framework-for-imbalanced-classiﬁcation-projects/

[32] “Cross-validation: evaluating estimator performance¶.” [Online]. Available: https: //scikit-learn.org/stable/modules/cross validation.html

[33] B. Beach and B. Beach, “Hard drive smart stats,” May 2020. [Online]. Available: https://www.backblaze.com/blog/hard-drive-smart-stats/

[34] J. Brownlee, “Tour of evaluation metrics for imbalanced classiﬁca- tion,” Jan 2020. [Online]. Available: https://machinelearningmastery.com/ tour-of-evaluation-metrics-for-imbalanced-classiﬁcation/

[35] L. Sisters, “Matthews correlation coefﬁcient: when to use it and when 57 to avoid it,” May 2020. [Online]. Available: https://towardsdatascience.com/ matthews-correlation-coefﬁcient-when-to-use-it-and-when-to-avoid-it-310b3c923f7e

[36] I. Kuznetsov, “Metrics for imbalanced classiﬁcation,” May 2019. [Online]. Avail- able: https://towardsdatascience.com/metrics-for-imbalanced-classiﬁcation-41c71549bbb5? source=rss----7f60cf5620c9---4

[37] J. Davis and M. Goadrich, “The relationship between precision-recall and roc curves,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 233–240.

[38] M. Galarnyk, “Pca using python (scikit-learn),” May 2020. [Online]. Available: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

[39] “Principal component analysis tutorial.” [Online]. Available: https://www.dezyre.com/ data-science-in-python-tutorial/principal-component-analysis-tutorial

[40] M. Brems, “A one-stop shop for principal component analysis,” Jun 2019. [Online]. Available: https://towardsdatascience.com/ a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c