<<

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 59, NO. 8, AUGUST 2010 2055 Approach to Fault Identification for Electronic Products Using Mahalanobis Sachin Kumar, Member, IEEE, Tommy W. S. Chow, Senior Member, IEEE, and Michael Pecht, Fellow, IEEE

Abstract—This paper presents a Mahalanobis distance (MD) The built-in test (BIT) and self-test abilities in a system based diagnostic approach that employs a probabilistic approach were early attempts at providing diagnostic capabilities incor- to establish thresholds to classify a product as being healthy porated into a system’s own structure. Gao and Suryavanshi or unhealthy. A technique for detecting trends and biasness in system health is presented by constructing a control chart for have catalogued applications of BITs in many industries, in- the MD value. The performance parameters’ residuals, which are cluding semiconductor production, manufacturing, aerospace, the differences between the estimated values (from an empirical and transportation [5]. BIT system applicability is limited to model) and the observed values (from health monitoring), are the failure definition embedded at the system’s manufactur- used to isolate parameters that exhibit faults. To aid in the qual- ing stage, whereas, with developments in sensor and data ification of a product against a specific known fault, we suggest that a fault-specific threshold MD value be defined by minimizing analysis capabilities, the development and implementation of an error function. A case study on notebook computers is pre- data-driven diagnostic systems that can adapt to new failure sented to demonstrate the applicability of this proposed diagnostic definitions are now possible. approach. Today, a product’s health can be assessed in many ways, Index Terms—Computers, diagnostics, electronic products, including monitoring changes in its performance parameters, fault identification, fault isolation, Mahalanobis distance (MD). which are used to characterize a system’s performance; moni- toring canaries (structures that have equivalent circuitry but are calibrated to fail at a faster rate than the actual product); and I. INTRODUCTION estimating accumulated damage based on physics-of-failure UANTIFICATION of degradation and fault progression modeling [6]. Performance parameter analysis uncovers the Q in an electronic system is difficult since not all faults nec- interactions between performance parameters and the influence essarily lead to system failure or functionality loss [1], [2]. In of environmental and operational conditions on these param- addition, there is a significant lack of knowledge about failure eters. In the absence of fault-indicating parameters, health precursors in electronics [3]. With limited failure precursors assessment can be performed by combining 1) damage estimate and complex architecture, it is generally hard to implement information obtained from physics-based models that utilize a health-monitoring system that can directly monitor all the data from environmental and operating conditions and 2) failure conditions in which fault incubation occurs. precursor information extracted from data-driven models [7]. A The health of a system is a state of complete physical, struc- product’s historical data on intermittent failures (i.e., failures tural, and functional well-being and not merely conformance to that cannot be reproduced in a laboratory environment [8]) the system’s specifications. A health assessment of electronic should be included in a product’s health assessment. products can be performed at the product level, assembly Sun Microsystems developed the Continuous System level, or component level [4]. The health assessment procedure Telemetry Harness for collecting, conditioning, synchronizing, should also consider various environmental and usage condi- and storing computer systems’ telemetry signals [9]. The Mul- tions in which a product is likely to be used. tivariate State Estimation Technique (MSET) provides an esti- mate of each parameter, and these estimates are later used for decision making using the Sequential Probability Ratio Test and hypothesis testing. The Mahalanobis Distance (MD) approach Manuscript received March 19, 2009; revised August 27, 2009; accepted August 28, 2009. Date of publication October 30, 2009; date of current version considered in this paper is a distance measure in multidimen- July 14, 2010. The Associate Editor coordinating the review process for this sional space that considers correlations among parameters [10]. paper was Dr. John Sheppard. S. Kumar is with the Prognostics and Health Management Labora- The use of the MD approach over the MSET will reduce the tory, Center for Advanced Life Cycle Engineering (CALCE), University analytical burden, because the MD approach provides a number of Maryland, College Park, MD 20742 USA (e-mail: [email protected]; for determining a system’s health after combining information [email protected]). T. W. S. Chow is with the Prognostics and Health Management Centre, De- on all performance parameters, whereas MSET provides an partment of Electronic Engineering, City University of Hong Kong, Kowloon, estimate for each parameter and needs analytical assessment of Hong Kong. each parameter for determining a system’s health. M. Pecht is with the Prognostics and Health Management Laboratory, CALCE, University of Maryland, College Park, MD 20742 USA, and also with Other distance-based approaches that have been used for the Prognostics and Health Management Center, City University of Hong Kong, diagnostics and classification include Manhattan distance, Kowloon, Hong Kong (e-mail: [email protected]). Euclidean distance, Hamming distance, Hotelling T-square, and Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. square prediction error. Manhattan distance is the distance Digital Object Identifier 10.1109/TIM.2009.2032884 between two points measured along axes at right angles. It

0018-9456/$26.00 © 2009 IEEE 2056 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 59, NO. 8, AUGUST 2010 has been used to classify text via the N-gram approach [11]. These traditional methods do not provide a generic framework Euclidean distance is the straight-line distance between two for defining a threshold MD value for fault identification. The points and can be calculated as the sum of the squares of the proposed diagnostic method does not require the definition of a differences between two points. The Hotelling T-square and faulty product during training and fault isolation, unlike other square prediction error are used in principal component analysis methods such as clustering and supervised neural networks that for representing statistical indices [12]. The Hotelling T-square require aprioriknowledge of the types of faults during training is a measure that accounts for the covariance structure of a [21]. When unforeseen types of faults occur, supervised neural multivariate normal distribution and is computed in reduced networks or clustering approaches may fail to deliver a correct model space, which is defined by a few principal components decision on system health [21]. (i.e., the number of principal components used is less than the The MD approach suffers from the masking effect if the number of original parameters) [13]. The squared prediction training data contain a significant amount of [22]. This error index is a measure computed in the residual space that is because MD uses a sample and a correlation matrix is not explained by the model space [14]. that can be influenced by a cluster of outliers. These outliers The Manhattan distance, Euclidean distance, and Hamming can shift the sample mean and inflate the correlation matrix in distance do not use correlation among parameters and suffer a covariate direction. This is particularly true if the n/p ratio from a scaling effect, in contrast to MD. The scaling effect is small, where n is the number of observations and p is the describes a situation where the variability of one parameter number of features. Another issue is related to the computation masks the variability of another parameter, and it happens time needed to reach O(p2) for the p-dimensionality of feature when the measurement ranges or scales of two parameters are vectors [23]. different [15]. To remove the scaling effect (i.e., eliminate the This paper provides a probabilistic approach for defining influence of measurement units), the data should be normalized. warning and fault threshold MD values to improve upon the The Hotelling T-square and the square prediction error indices traditional approaches where threshold MD values are decided are calculated in reduced dimensions (i.e., information loss) by experts. Since MD values do not follow any distribution and and use covariance as opposed to a correlation matrix, which have positive values, a Box–Cox transformation was applied is one reason to consider using MD for fault diagnosis. MD to the MD values to obtain a normally distributed transformed calculation uses the normalized values of measured parame- variable. The transformed variable was used to construct a ters, which eliminates the problem of scaling. MD also uses control chart and define threshold values to detect faults. An correlation among parameters, which makes it sensitive to optimized MD value, using an error function, was obtained to interparameter “health” changes. For example, consider a set qualify a product against a particular fault. The residual, which of multiparameter points that are equidistant (i.e., estimated by is the difference between a parameter’s estimated and observed the Euclidean distance) from a sphere around a location. This values, was calculated to isolate faulty parameters. A product’s location is defined by the arithmetic mean of those points in health was classified by comparing its MD value, which was multidimension space. The MD stretches this sphere to even computed for each observation, with a threshold MD value. off the respective scales of the different dimensions and account for the correlation among the parameters. A. Mahalanobis Distance The performance data of some electronic systems are multi- The MD methodology distinguishes multivariable data dimensional, such as multifunctional radio-frequency commu- groups by a univariate distance measure, that is calculated from nication devices, infrared imaging cameras, and hybrid silicon the measurements of multiple parameters. The MD value is complementary metal–oxide–semiconductor (CMOS) circuits calculated using the normalized value of performance param- [16]. While a high-dimensional data set contains much valuable eters and their correlation coefficients, which is the reason for information, 1-D measures are easier to comprehend and can be MD’s sensitivity [10]. computed in quick succession. A data set formed by measuring the performance parameters Consideration of correlations among performance parame- of a healthy product is used as training (or baseline) data. The ters is advantageous because an electronic product experiences collection of MD values for a healthy system is known as diverse environmental and uses conditions. For example, the the Mahalanobis space (MS). The performance parameters col- capacitance and insulation resistance of a capacitor vary with lected from a product are denoted as X , where i =1, 2,...,p. changes in ambient temperature. The effectiveness of a di- i Here, p is the total number of performance parameters. The agnostic procedure increases by incorporating the change in observation of the ith parameter, on the jth instance, is denoted relationship among performance parameters. This is because by X , where i =1, 2,...,p, and j =1, 2,...,m; m is the each performance parameter changes at a different rate with ij total number of times an observation is made for all parameters. changes in ambient conditions. Thus, the (p × 1) data vector for the normal group is denoted In an MD-based diagnostic approach, a healthy baseline and by X , where j =1, 2,...,m. Each individual parameter in a threshold MD value are needed to classify a product as healthy j the data vector is normalized using the mean and the standard or unhealthy. In the MD-based diagnostic approach, traditional deviation of that parameter calculated from the baseline data. methods for defining a threshold MD value are either based Thus, a parameter’s normalized values are on personal judgment or traded off to lower the economic consequences of misclassifications, or an MD value that cor- (Xij −Xi) Zij = ,i=1, 2,...,p; j =1, 2,...,m (1) responds to a known abnormal condition is given [17]–[20]. Si KUMAR et al.: APPROACH TO FAULT IDENTIFICATION FOR ELECTRONIC PRODUCTS USING MD 2057

Fig. 2. MD calculation using test data.

Fig. 1. Fault detection approach. where    m − 2 m  (Xij Xi) 1  j=1 X = X S = . (2) i m ij i (m − 1) j=1

Next, the values of the MDs are calculated for a healthy Fig. 3. Baseline establishment methodology. product 1 the product as being healthy or unhealthy. Then, if the product MD = zT C−1z (3) j p j j were to be classified as unhealthy, further processing would be performed to isolate the faulty parameter(s) to establish reasons T where zj =[z1j,z2j ,...,zpj ] is a vector comprising zij; zj for the fault. The process for defining the baseline and the T threshold MD values is discussed in the following sections. is the transpose of zj ; and C is the correlation matrix calcu- lated as

1 m A. Baseline Construction C = z zT . (4) (m − 1) j j j=1 A product’s performance range is defined by measurements made of its performance parameters under different operating For fault diagnosis, Betta and Pietrosanto [24] presented the conditions. The combination of performance parameters can be requirements, including system monitoring; establishment of a summarized by a distance measure. A baseline consists of an suitable threshold; and estimation of residuals, which can be MD profile, a threshold MD value, and the empirical models of obtained by continuous comparison of the system under analy- performance parameters. The process of constructing a baseline sis with another system or by taking the differences between is shown in Fig. 3. the measured and expected quantities. The following section The baseline construction process starts with the functional illustrates an MD-based diagnostic approach that meets these evaluation of a product. Based on a failure modes, mechanisms, requirements, including the creation of a healthy baseline from and effects analysis (FMMEA) of a product, parameters that measured data, an approach for defining a threshold for fault represent product performance should be selected for monitor- detection, and a residual-based approach for identifying faulty ing [1]. These parameters are monitored during the operation parameters. of a set of healthy products under various environmental, op- erational, and usage conditions. The collected information on parameters forms a data set that is used to train and calculate II. DIAGNOSTIC APPROACH the statistical features of each parameter. For MD calculation, Our anomaly detection approach (Fig. 1) starts with per- performance parameter data are normalized, and a correlation formance parameter monitoring. For a test product, the MD coefficient matrix is formed. The correlation coefficient be- value for each observation is calculated using the performance tween two parameters expresses the linear dependence of one parameters’ mean, , and a correlation coef- parameter on the other and the direction of the dependence. The ficient matrix that is obtained from the training data (Fig. 2). MD values corresponding to each observation in the training The calculated MD value is then compared with a threshold data are calculated, and this group of MD values forms the MS. MD value τ, which is established from a baseline to classify From the MS, the min-max range, mean, and standard deviation 2058 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 59, NO. 8, AUGUST 2010

Fig. 5. Threshold value calculation.

a normally distributed transformed variable [26]. The Box–Cox transformation is defined as follows: (xλ − 1) x(λ)= ,λ=0 Fig. 4. Approach for defining the threshold MD value. λ x(λ)=ln(x),λ=0 (5) of MD values are obtained to explain the variability of a healthy where the vector of data observations is x = x1,x2,...,xn, product’s performance in terms of MD values. and x(λ) is the transformed data. The power λ is obtained by Empirical models of performance parameters are developed maximizing the logarithm of the in the absence of analytical models. Training data are used   n 2 n to compute the correlation coefficients between different pa- n (xi(λ) − x(λ)) f(x, λ)=− ln +(λ − 1) ln(x ) rameters and identify parameters to be used for empirical 2 n i i=1 i=1 models. The linear modeling approach was chosen because of (6) its simplicity and effectiveness without losing much model- fitting accuracy. One can use nonlinear models for parame- where ter estimation, but nonlinear models need relatively complex n 1  learning algorithms to fit the underlying relationship among x(λ)= x (λ). (7) n i parameters [25]. In our application, the training data, which i=1 were collected under various operation conditions of a set of healthy products, are linear. Thus, a linear model for each The normality of x(λ), which is a transformed variable, is performance parameter is developed as a function of other confirmed by plotting it into a normal plot. The mean (μx) and related performance parameters. Linear models are considered standard (σx) deviations of the transformed variable are used to appropriate due to their simplicity and considerable fit (i.e., determine the control limits of an x-bar chart. A threshold value > 90%) to the experimentally collected data. These models are corresponding to the warning limit (μx +2σx) and a threshold used for isolating parameters that are behaving far differently value corresponding to a fault alarm (μx +3σx) are defined. from expectations. Since higher MD values are of concern from an “unhealthiness” perspective, the upper portion of the control chart is of im- portance for identifying changes in system health. Rules from quality control, including bias and variance identification, can B. Threshold Determination be used [27]. In this section, a probabilistic approach is presented to de- 2) Fault-Specific Threshold Determination: A normally dis- termine two types of threshold MD values. First, a generic tributed transformed variable, which corresponds to MD values, threshold for detecting any type of fault or anomaly present in can be used to determine Type-I and Type-II errors [19]. A a product based on the MDs obtained from the training data is Type-I error, which is often referred to as a false positive, is determined. Second, a fault-specific threshold for detecting the a statistical error made in testing the health of a product, in presence of a particular fault based on historical data related to which the product is healthy but is incorrectly determined to be a particular fault is determined. The second threshold can be unhealthy. A Type-II error, which is often referred to as a false considered a second-tier fault isolation process. negative, is a statistical error made in testing the unhealthiness 1) Generic Threshold Determination: An approach for de- of a product, in which a product is determined to be healthy termining a generic threshold—an MD value—for fault diagno- when it is not. Fig. 5 shows Type-I and Type-II errors using a sis is shown in Fig. 4. The MDs are always positive, but they do variable’s distribution for a healthy and an unhealthy system, not generally follow a normal distribution. The Box–Cox power where the healthy distribution is defined from the training transformation can be used to transform a variable that has data and the unhealthy distribution is defined from the data positive values and does not follow a Normal distribution into representing a specific fault in a system. KUMAR et al.: APPROACH TO FAULT IDENTIFICATION FOR ELECTRONIC PRODUCTS USING MD 2059

TABLE I TABLE II ENVIRONMENTAL CONDITIONS EXPERIMENTS PERFORMED

For a known fault, an optimal transformed variable can be defined such that the combined error (i.e., the sum of Type-I and Type-II errors) remains minimal (i.e., the shaded region in Fig. 5), and an MD value corresponding to the optimal transformed variable x is calculated. For a healthy product, the combination) condition, four usage conditions and three power probability of having MD values higher than the threshold value supply conditions were considered [14]. The test duration is the number of observations that produce an MD value higher depended on the way the computer was powered. When the than the threshold MD value divided by the total number of computer was powered by an ac adapter and the battery was observations for a healthy product. Similarly, for an unhealthy fully charged [relative state of charge (RSOC) = 100%], product, the probability of having an MD value less than the the test ran for 3.5 h. When the computer was powered by threshold value is the number of observations that produce MD an ac adapter when the battery was fully discharged (i.e., values less than the threshold MD value divided by the total RSOC < 4%), the test duration was determined by the time number of observations for an unhealthy product. The threshold that the battery took to fully charge (RSOC = 100%). When value τx of a transformed variable for detecting a known the computer was powered by its battery only, the test duration anomaly is established using the following error function ε: was determined by the discharge (RSOC < 4%) time. The tests

e1 e2 were conducted in a temperature–humidity chamber and in a ε(τx)= + (8) room-ambient environment. Table II shows all 72 experiments. nh nu Each computer was turned on for 30 min before the experiment where τ is the threshold, e1 is the number of observations clas- was started. The computers were kept at room temperature sified as unhealthy in the healthy population nh, and e2 is the between each test for 30 min. number of observations classified as healthy in the unhealthy The correlation coefficients among performance parameters population nu. The threshold value is obtained by minimizing were calculated. Only significant correlation coefficients (for the error function (i.e., by choosing a different value for τx). which the Pearson probability was less than 0.05) between two performance parameters are shown in Table III. The training data were formed by eight correlated performance parameters III. CASE STUDY (listed in Table III). The parameters measured were fan speed Experiments were performed on ten state-of-the-art (2007) (speed of a cooling fan in r/min), CPU temperature (measured notebook computers that were produced by the same manu- on the CPU die), motherboard temperature (measured on the facturer. As part of the test plan, it was necessary to assess top surface of the printed circuit board near the CPU), video the performance of the products under various environmental card temperature (measured on the graphics processor unit), and usage conditions. The computers used for this study were %CPU usage (measure of how much time the processor spends exposed to different environmental and usage conditions dur- on a user’s applications and high-level Windows functions), and ing the experiments, and their performance parameters were %CPU throttle (measure of the maximum CPU percentage to be monitored in situ. Since not all conditions could be tested, used by any process or service, thereby ensuring that no process certain extreme and nominal conditions were included. The can consume all of the CPU’s resources at the expense of other software usage conditions—a set of computer user activities users or processes). The parameters C2 and C3 are the power- representative of typical computer uses—were defined [28]. saving states of the CPU in which the processor consumes less These usage conditions were executed through a script file, power and dissipates less heat than in the active state [29]. C2 where all user activities were encoded. and C3 represent the percentage time that a processor spends To study the variability in performance parameters, in the low-power idle state and are a subset of the processor’s experiments were conducted under six different environmental total idle time. In the C2 power state, the processor is able to conditions, as shown in Table I. The test temperature range was maintain the context of the computer’s caches. The C3 power 5 ◦C–50 ◦C, which was wider than the specified operating and state offers improved power savings and higher exit latency storage temperature range of the computer in order to include over the C2 state. In a C3 power state, the processor is unable variation in operating conditions beyond the manufacturer- to maintain the coherency of its caches. All the parameters specified range. In each environmental (temperature–humidity mentioned in Table III were sampled at different rates: CPU 2060 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 59, NO. 8, AUGUST 2010

TABLE III CORRELATION COEFFICIENTS FOR NOTEBOOK PERFORMANCE PARAMETERS

Fig. 7. Probability density of residual CPU temperature for a healthy product.

The residual analysis of CPU temperature indicated that a Fig. 6. Histogram of MD values for healthy population. probability density plot of CPU temperature residuals (Fig. 7) represents 94% of the variability in the CPU temperature. The operation at every fifth second, and temperatures and fan speed residual analysis indicated that the mean residual for fan speed at every 30th second. was up to 500 r/min. Similarly, the mean residual for the CPU The MD for each observation in the training data set was temperature was 5 ◦C, and that for the motherboard and video calculated using (3). According to the flowchart shown in card temperature was 8 ◦C. Similar empirical models for other Fig. 3, a healthy baseline was defined using MD values and parameters have been developed [30]. empirical models of performance parameters. The training data Two types of threshold MD values were determined: First, were comprised of approximately 25 000 observations. The a generic threshold for detecting faults at the product level distribution of these MD values corresponding to the training was developed, and second, a specific threshold for detecting data is shown in Fig. 6. Empirical models for each performance the presence of a particular fault was developed. For generic parameter were developed as functions of other performance threshold value determination, the Box–Cox transformation parameters using training data. The “residuals” of each pa- was applied on the training MD values, and an optimized value rameter were calculated by subtracting the estimated value of λ(= −0.2) was obtained by maximizing the likelihood func- from the observed value. For example, an empirical model tion defined earlier. The plot of λ and the likelihood function for CPU temperature as a function of fan speed, motherboard f(x, λ) is shown in Fig. 8. temperature, and video card temperature is The optimal λ value was used to obtain a normally distrib- uted transformed variable x from the training MD values. The CPU Temp = −21.6 − 0.0025 ∗ Fan Speed normal probability plot of x is shown in Fig. 9. A control chart +0.44 ∗ MB Temp +0.87 ∗ VC Temp. (9) for fault identification was developed, where control limits KUMAR et al.: APPROACH TO FAULT IDENTIFICATION FOR ELECTRONIC PRODUCTS USING MD 2061

TABLE IV PERCENTAGE OF ALARMS RAISED BY DIFFERENT RULES

Fig. 8. Plot of λ and likelihood function f(x, λ).

only one data point fell above the fault limit, and 1.5% of the data fell in Zone A (i.e., above the warning limit). The threshold MD values corresponding to the warning limit and fault limit were 3.4 and 9.1, respectively. Quality control rules were applied to determine the bias and trend in the data along with the identification of faults [27]. Data exhibit a trend if six (or more) consecutive points are increasing or decreasing. Data exhibit biasness if nine (or more) consecutive points fall on one side of the central line (i.e., μx). Since MD values increase with abnormality, data that fall above the central line of the control chart are of more concern. A set of data obtained from one test notebook computer, which was field-returned, was plotted on the control chart constructed for the transformed value x, and the quality control rules were applied. The observations made (Table IV) were given as follows: 62.1% of the data were above the failure alarm Fig. 9. Normal probability plot of transformed variable x. limit (μx +3σx) (i.e., 62.1% of the data indicated the presence of faults in the test system), in comparison with 0% for a healthy system. In addition, 37.8% of the data were within Zone A (i.e., 37.8% of the data indicated the tendency of the test system to be faulty), in comparison with 1.5% for a healthy system. This also indicated that 99.9% of the data were above the warning limit μx +2σx, in comparison with 1% for a healthy system. At 96% of the time, the data indicated the presence of a trend in the test system, in comparison to 2% for a healthy system. All test data (i.e., 100%) were on one side of the average, which indicated the presence of biasness in the test system, in comparison to 33% for a healthy system. The marginal difference between a healthy system and the test system, which was identified using the fault isolation approach, suggested that the test system had problems. The MD values corresponding to the baseline (i.e., healthy) and the test computer are shown in Fig. 11. Both sets of MD values obtained from the training and test data sets were Fig. 10. Control limits for fault identification. transformed into normally distributed variables. To detect a specific fault, a threshold MD value corresponding to that fault were calculated using the mean and standard deviations of the was defined using the error function approach discussed earlier. transformed variable x (Fig. 10). A warning limit and a fault An optimal threshold MD value τ was calculated by mini- limit corresponding to μx +2σx and μx +3σx were defined. mizing the error function [(6)]. The amount of error ε consid- For fault identification, two rules were used: First, one or more ering different MD values is shown in Fig. 12. In this study, points fall above the fault limit (i.e., μx +3σx), and second, the optimal MD threshold value τ was 4.70, and the error two (or three) out of three consecutive points are within the ε(τ) was 0.025 (i.e., 2.5% misclassification, where 1.8% was fault and warning limits (i.e., Zone A). From the training data, contributed by the training data and 0.7% was from the test 2062 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 59, NO. 8, AUGUST 2010

The performance parameter residuals were used to isolate the parameters that were responsible for the drift in the health of the test computers. A few test data samples are presented in Table V, where the measured M and estimated E parameter values are shown. From the residual analysis, it was observed that the residual of the fan speed was greater than expected in 90% of the instances, and in 10% of the instances, the residual of the temperature parameters was greater than expected. The fan was judged to be faulty based on the residual analysis, and this judgment was verified by investigation of raw data. The case study demonstrated that the methodology presented was capable of identifying faults. A baseline generated from Fig. 11. MD value for a baseline and a test system. experimental data can be used to successfully analyze the onset of a fault and the eventual failure of a similar computer. The diagnostic approach can be applied to any product, but the case study results (and the baseline) cannot be extrapolated to all products and their variations. It would be expected that the product developers would develop baselines for their products of interest.

IV. CONCLUSION This paper presents a data-driven diagnostic approach that utilizes MD. Instead of using an expert-opinion-based threshold MD value, a probabilistic approach has been developed to establish threshold MD values to classify a product as being Fig. 12. Optimal threshold evaluation. healthy or unhealthy. An error function has been defined and minimized to determine a reference MD value to identify the presence of a specific fault in a product. Once faults are detected, a set of specific threshold values developed using the residuals of the performance parameters can be used for isolating known faults. This paper demonstrates that the distri- bution of the residuals of performance parameters can be used to isolate parameters that exhibit faults. This paper presents an approach for constructing an MD control chart from a system’s performance data. The control chart enables continuous monitoring of a system’s health using the MD value calculated from the system’s performance data. This MD control chart concept can also be used by the man- Fig. 13. Robustness evaluation of the threshold value. ufacturing industry for continuous process monitoring, instead of following several performance parameter control charts. data). Higher misclassification of training data suggests that Rules for detecting faults and observing trends and biases in the defined threshold value was conservative in nature, because a system’s performance have been presented in this paper. The the healthy product was misclassified more than the unhealthy ability to identify trends and biasness in the data will enable product. the development of new tests to identify flawed system and The validity of the defined threshold value was evaluated by processes. The ability to detect trends and biasness in system calculating the misclassification of training data and test data health by observing a control chart constructed for MD values at various threshold values (Fig. 13). The graph in Fig. 13 will allow for the detection of changes in a product’s health indicates that lowering the threshold value resulted in an in- before it experiences failure. crease in the number of observations from the training data The case study on notebook computers demonstrates that being classified as faulty (misclassification of healthy data as the approach to define threshold MD value is a major im- unhealthy data increased). Similarly, increasing the threshold provement. The defined thresholds were able to detect faults value resulted in an increase in the number of observations in a product with 99% accuracy. In a known fault condition, a being classified as healthy from the test data (misclassification specific threshold was defined, which classified a product with of unhealthy data as healthy data increased). Large-percentage 97.5% accuracy (i.e., 2.5% error). The residual analysis of the changes in misclassification were not observed, even after performance parameters identified the fan as a problem 90% of changing (increasing or decreasing) the threshold value of MD the time. The temperature parameters that are correlated to the by 10%. Thus the threshold value can be considered robust. fan operation were identified as a problem 10% of the time. The KUMAR et al.: APPROACH TO FAULT IDENTIFICATION FOR ELECTRONIC PRODUCTS USING MD 2063

TABLE V ESTIMATED VALUES OF THE PARAMETERS

results demonstrated that the suggested approach for defining [10] G. Taguchi, S. Chowdhury, and Y. Wu, The Mahalanobis-Taguchi System. a threshold MD value for the diagnostic approach was able to New York: McGraw-Hill, 2001. [11] L. Khreisat, “A machine learning approach for Arabic text classification identify faults. The residual-based parameter isolation approach using N-gram frequency statistics,” J. Informetr., vol. 3, no. 1, pp. 72–77, identified the cause of the problem. Jan. 2009. MD is a good health measure that summarizes multiple [12] S. Bouhouche, M. Lahreche, A. Moussaoui, and J. Bast, “Quality moni- toring using principal component analysis and fuzzy logic application in monitored parameters that are correlated. With the modifica- continuous casting process,” Amer. J. Appl. Sci., vol. 4, no. 9, pp. 637–644, tions presented in this paper, MD will benefit manufacturers in 2007. controlling the quality of their products and processes online or [13] K. Choi, S. Singh, A. Kodali, K. R. Pattipati, J. W. Sheppard, offline. S. M. Namburu, S. Chigusa, D. V. Prokhorov, and L. Qiao, “Novel clas- sifier fusion approaches for fault diagnosis in automotive systems,” IEEE Trans. Instrum. Meas., vol. 58, no. 3, pp. 602–611, Mar. 2009. [14] S. Kumar, V. Sotiris, and M. Pecht, “Health assessment of electronic ACKNOWLEDGMENT products using Mahalanobis distance and projection pursuit analy- sis,” Int. J. Comput. Inf. Syst. Sci. Eng., vol. 2, no. 4, pp. 242–250, The members of the Prognostics and Health Management Fall 2008. Consortium, CALCE, University of Maryland, College Park [15] G. Blöschl, “Scaling issues in snow hydrology,” Hydrol. Process., vol. 13, and The Prognostics and Health Management Center at City no. 14/15, pp. 2149–2175, 1999. [16] J. Ahn, H. Kim, K. J. Lee, S. Jeon, S. J. Kang, Y. Sun, R. G. Nuzzo, University of Hong Kong supported this work. I would like to and J. A. Rogers, “Heterogeneous three-dimensional electronics by use thank M. Zimmerman of CALCE for copyediting this paper. of printed semiconductor nanomaterials,” Science, vol. 314, no. 5806, pp. 1754–1757, Dec. 2006. [17] T. Riho, A. Suzuki, J. Oro, K. Ohmi, and H. Tanaka, “The yield enhance- REFERENCES ment methodology for invisible defects using the MTS + method,” IEEE [1] M. Pecht, Prognostics and Health Management of Electronics. Trans. Semicond. Manuf., vol. 18, no. 4, pp. 561–568, Nov. 2005. New York: Wiley-Interscience, 2008. [18] L. Abdesselam and C. Guy, “Time-frequency classification applied to [2] N. Vichare and M. Pecht, “Prognostics and health management of elec- induction machine faults monitoring,” in Proc. 32nd IEEE IECON, tronics,” IEEE Trans. Compon. Packag. Technol., vol. 29, no. 1, pp. 222– Nov. 2006, pp. 5051–5056. 229, Mar. 2006. [19] O. Schabenberger and F. J. Pierce, Contemporary Statistical Models for [3] K. Janasak and R. Beshears, “Diagnostics to prognostics—A product the Plant and Soil Sciences, 1st ed. Boca Raton, FL: CRC Press, 2001. availability technology evolution,” in Proc. Annu. RAMS, Jan. 2007, [20] S.-K. Si, X.-F. Wang, and X.-J. Sun, “Discrimination methods for the pp. 113–118. classification of breast cancer diagnosis,” in Proc. ICNN, B, Oct. 2005, [4] J. Gu, N. Vichare, T. Tracy, and M. Pecht, “Prognostics implementation vol. 1, pp. 259–261. methods for electronics,” in Proc. Annu. RAMS, Jan. 2007, pp. 101–106. [21] S. Wu and T. Chow, “Induction machine fault detection using SOM-based [5] R. X. Gao and A. Suryavanshi, “BIT for intelligent system design and con- RBF neural networks,” IEEE Trans. Ind. Electron., vol. 51, no. 1, pp. 183– dition monitoring,” IEEE Trans. Instrum. Meas., vol. 51, no. 5, pp. 1061– 194, Feb. 2004. 1067, Oct. 2002. [22] P. J. Rousseeuw and B. C. van Zomeren, “Unmasking multivariate outliers [6] Z. Liu, D. S. Forsyth, J. P. Komorowski, K. Hanasaki, and T. Kirubarajan, and leverage points,” J. Amer. Stat. Assoc., vol. 85, no. 411, pp. 633–639, “Survey: State of the art in NDE data fusion techniques,” IEEE Trans. Sep. 1990. Instrum. Meas., vol. 56, no. 6, pp. 2435–2451, Dec. 2007. [23] F. Sun, S. Omachi, N. Kato, H. Aso, S. Kono, and T. Takagi, “Two-stage [7] M. Baybutt, C. Minnella, A. E. Ginart, P. W. Kalgren, and M. J. Roemer, computational cost reduction algorithm based on Mahalanobis distance “Improving digital system diagnostics through Prognostic and Health approximations,” in Proc. 15th Int. Conf. Pattern Recog., Sep. 2000, Management (PHM) technology,” IEEE Trans. Instrum. Meas., vol. 58, vol. 2, pp. 696–699. no. 2, pp. 255–262, Feb. 2009. [24] G. Betta and A. Pietrosanto, “Instrument fault detection and isolation: [8] D. Thomas, K. Ayers, and M. Pecht, “The ‘trouble not identified’ phenom- State of the art and new research trends,” IEEE Trans. Instrum. Meas., enon in automotive electronics,” Microelectron. Reliab., vol. 42, no. 4/5, vol. 49, no. 1, pp. 100–107, Feb. 2000. pp. 641–651, Apr./May 2002. [25] T. Mitchell, Machine Learning, 1st ed. New York: McGraw-Hill, 1997. [9] K. C. Gross, K. W. Whisnant, and A. Urmanov, “Electronic prognos- [26] G. Box and D. Cox, “An analysis of transformations,” J. R. Stat. Soc., Ser. tics through continuous system telemetry,” in Proc. 60th Soc. Mach. B Stat. Methodol., vol. 26, no. 2, pp. 211–252, 1964. Failure Prevention Technol. Meeting, Virginia Beach, VA, Apr. 2006, [27] L. S. Nelson, “Technical aids,” J. Qual. Technol., vol. 16, no. 4, pp. 238– pp. 53–62. 239, Oct. 1984. 2064 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 59, NO. 8, AUGUST 2010

[28] J. C. Day, A. Janus, and J. Davis, “Computer and Internet use in the United Michael Pecht (F’92) received the M.S. degree in States: 2003,” U.S. Bureau Labor Stat., Washington, DC, Oct. 2005. electrical engineering and the M.S. and Ph.D. de- [29] Hewlett-Packard, Intel, Microsoft, Phoenix, and Toshiba, Revision 3.0a, grees in engineering mechanics from the University Last accessed on May 29, 2009, Advanced Configuration and Power of Wisconsin, Madison. Interface Specification, Dec. 30, 2005. [Online]. Available: http://www. He is currently a Visiting Professor in electrical acpi.info/DOWNLOADS/ACPIspec30a.pdf engineering with the City University of Hong Kong, [30] S. Kumar and M. Pecht, “Baseline performance of notebook computer Kowloon, Hong Kong, and the founder of the Center under various environmental and usage conditions for prognostics,” IEEE for Advanced Life Cycle Engineering (CALCE), Trans. Compon. Packag. Technol., vol. 32, no. 3, pp. 667–676, Sep. 2009, University of Maryland, College Park, where he is Manuscript ID: TCPT-2008-024.R2. also the George Dieter Chair Professor in mechanical engineering and a Professor in applied mathematics. He has been leading a research team in the area of prognostics for the past Sachin Kumar (M’07) received the B.S. degree ten years. He has consulted for more than 100 major international electronics in metallurgical engineering from Bihar Institute of companies, providing expertise in strategic planning, design, test, prognostics, Technology, Sindri, India, and the M.Tech. degree in IP, and risk assessment of electronic products and systems. He has written reliability engineering from Indian Institute of Tech- more than 20 books on electronic product development, use, and supply chain nology, Kharagpur, India. He is currently working management, and more than 400 technical papers. He is a Chief Editor for toward the Ph.D. degree in reliability engineering Microelectronics Reliability. with the Prognostics and Health Management Labo- Dr. Pecht is a Fellow of the American Society of Mechanical Engi- ratory, Center for Advanced Life Cycle Engineering neers and the International Microelectronics and Packaging Society (IMAPS). (CALCE), University of Maryland, College Park. He is a Professional Engineer. He served as Chief Editor for the IEEE His research interests include reliability evalua- TRANSACTIONS ON RELIABILITY for eight years and on the advisory board tion and prediction, model and algorithm develop- of IEEE SPECTRUM. He is an Associate Editor for the IEEE TRANSACTIONS ment for diagnostics and prognostics, electronic system health management, ON COMPONENTS AND PACKAGING TECHNOLOGY. He was the recipient Bayesian methodology, statistical modeling, data mining, machine learning, of the highest reliability honor, i.e., the IEEE Reliability Society’s Lifetime and artificial intelligence. Achievement Award in 2008; the European Micro and Nano-Reliability Award for outstanding contributions to reliability research; 3M Research Award for electronics packaging; and the IMAPS William D. Ashman Memorial Achieve- ment Award for his contributions in electronic reliability analysis. Tommy W. S. Chow (M’93–SM’03) received the B.Sc. (first honors) and Ph.D. degrees from the Uni- versity of Sunderland, Sunderland, U.K. He is currently a Professor with the Prognos- tics and Health Management Centre, Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong. He has been working on different consultancy projects with the Mass Tran- sit Railway, Kowloon–Canton Railway Corporation, Hong Kong. He has also conducted other collabo- rative projects with the Kong Electric Co. Ltd., the MTR Hong Kong, and Observatory Hong Kong on the application of neural networks for machine fault detection and forecasting. He is the author or coauthor of more than 130 journal articles related to his research, five book chapters, and one book. His research interests include neural networks, machine learning, pattern recognition, fault diagnosis, and bioinformatics. Dr. Chow received the Best Paper Award in 2002 IEEE Industrial Electronics Society Annual Meeting in Seville, Spain. He is the Associate Editor of Pattern Analysis and Applications, and International Journal of Information Technology.