VIBRATION ANALYSIS FOR OCEAN TURBINE RELIABILITY MODELS by Randall David Wald

A Dissertation Submitted to the Faculty of The College of Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

Florida Atlantic University Boca Raton, FL December 2012 Copyright by Randall David Wald 2012

ii

ACKNOWLEDGEMENTS

I would like to recognize the assistance and guidance provided by my advi- sor, Dr. Taghi M. Khoshgoftaar, towards the completion of my doctoral studies at Florida Atlantic University. His years of research experience and invaluable advice and support were an essential part of my studies. I would also like to thank Dr. Howard Hanson, Dr. Martin . Solomon, and Dr. Bassem Alhalabi for serving on my dissertation committee. I am also grateful to the Southeast National Marine Renewable Energy Center (SNMREC) for supporting and funding this research. In addition, I would like to acknowledge my colleagues in the Data Mining and Machine Learning Laboratory, Janell Duhaney, Dr. John C. Sloan, and Dr. Amri Napolitano, who I have been fortunate enough to collaborate with on this research. Finally, I wish to thank the researchers and staff in the Department of Ocean and Mechanical Energy at the Dania Beach campus for their collaboration and assistance with the SNMREC ocean turbine project.

iv ABSTRACT

Author: Randall David Wald Title: Vibration Analysis for Ocean Turbine Reliability Models Institution: Florida Atlantic University Dissertation Advisor: Dr. Taghi M. Khoshgoftaar Degree: Doctor of Philosophy Year: 2012

Submerged turbines which harvest energy from ocean currents are an impor- tant potential energy resource, but their harsh and remote environment demands an automated system for machine condition monitoring and prognostic health monitor- ing (MCM/PHM). For building MCM/PHM models, vibration sensor data is among both the most useful (because it can show abnormal behavior which has yet to cause damage) and the most challenging (because due to its waveform nature, frequency bands must be extracted from the signal). To perform the necessary analysis of the vibration signals, which may arrive rapidly in the form of data streams, we develop three new wavelet-based transforms (the Streaming Wavelet Transform, Short-Time Wavelet Packet Decomposition, and Streaming Wavelet Packet Decomposition) and propose modifications to the existing Short-Time Wavelet Transform. We also prepare post-processing techniques to resolve additional problems such as interpreting wavelet data in a fully-streaming format, au- tomatically choosing the appropriate transformation depth without performing classi- fication, and building models which can perform state identification correctly even as

v the turbine’s environment changes. Collectively, these new approaches solve problems not currently dealt with by existing algorithms and offer important improvements. The proposed algorithms allow for data to be processed in a fully-streaming manner. These algorithms also create and select frequency-band features which focus on the areas of the signal most important to MCM/PHM, producing only the information necessary for building models (or removing all unnecessary information) so models can run on less powerful hardware. Finally, we demonstrate models which can work in multiple environmental conditions. To evaluate these algorithms, along with the Short-Time Fourier Transform which is often neglected in the context of MCM/PHM, we perform six case studies on data from two different physical machines, a fan and a dynamometer model of the ocean turbine. Our results show that many of the transforms give similar results in terms of performance, but their different properties as to time complexity, ability to operate in a fully streaming fashion, and number of generated features may make some more appropriate than others in particular applications, such as when streaming data or hardware limitations are extremely important (e.g., ocean turbine MCM/PHM).

vi DEDICATION

To my parents, Harlan and Karen Wald, for always supporting me and pushing me to achieve my goals. VIBRATION ANALYSIS FOR OCEAN TURBINE RELIABILITY MODELS

List of Tables ...... xi

List of Figures ...... xiii

1 Introduction ...... 1 1.1 Motivation...... 3 1.2 Contributions...... 8 1.3 Organization...... 9

2 Background ...... 12 2.1 Prognostic Health Monitoring...... 12 2.1.1 Designing an MCM/PHM system...... 13 2.1.2 Combining MCM/PHM algorithms...... 14 2.1.3 Applying MCM/PHM to machine maintenance...... 17 2.2 Vibration Analysis...... 19 2.2.1 Fourier Transforms...... 20 2.2.2 Wavelet Transforms...... 21 2.2.3 Comparing Fourier and Wavelet Approaches...... 23 2.2.4 Streaming Data...... 25

3 Methodology ...... 27 3.1 Classification...... 27

vii 3.1.1 C4.5...... 28 3.1.2 Random Forest...... 30 3.1.3 Naïve Bayes...... 31 3.1.4 k-Nearest Neighbor...... 32 3.1.5 Logistic Regression...... 32 3.1.6 Support Vector Machines...... 32 3.1.7 Multi-Layer Perception...... 33 3.1.8 RIPPER...... 34 3.1.9 Radial Basis Function Neural Networks...... 34 3.2 Feature Selection...... 35 3.2.1 Chi Squared...... 35 3.2.2 Information Gain...... 36 3.2.3 Signal to Noise...... 37 3.3 Performance Evaluation...... 38 3.3.1 Performance Metrics...... 38 3.3.2 Training and Test Datasets...... 39 3.4 Fourier Transforms...... 40 3.4.1 Short Time Fourier Transforms...... 41

4 Wavelet Transforms ...... 45 4.1 Continuous Wavelet Transforms...... 46 4.2 Discrete Wavelet Transforms...... 48 4.2.1 Short-Time Wavelet Transform...... 51 4.2.2 Streaming Wavelet Transform...... 54 4.3 Wavelet Packet Decomposition...... 57 4.3.1 Short-Time Wavelet Packet Decomposition...... 59 4.3.2 Streaming Wavelet Packet Decomposition...... 62

viii 5 Post-Processing ...... 69 5.1 Scale Detection...... 69 5.2 Data windowing...... 73 5.3 Automatic Depth Selection...... 75 5.4 Baseline-Differencing...... 77

6 Datasets ...... 80 6.1 Fan Experiments...... 80 6.2 Fan Transformation Parameters...... 81 6.2.1 Short-Time Fourier Transform...... 82 6.2.2 Streaming Wavelet Transform...... 82 6.2.3 Short-Time Wavelet Packet Decomposition...... 84 6.2.4 Streaming Wavelet Packet Decomposition...... 85 6.3 Dynamometer Data...... 86 6.4 Dynamometer Transformation Parameters...... 88

7 Case Studies ...... 89 7.1 Case Study One...... 90 7.1.1 Parameter optimization...... 90 7.1.2 Classification...... 91 7.1.3 Experimental procedure...... 91 7.1.4 Results...... 92 7.2 Case Study Two...... 95 7.2.1 Classification...... 95 7.2.2 Results...... 96 7.3 Case Study Three...... 98 7.3.1 Classification...... 99

ix 7.3.2 Results...... 99 7.4 Case Study Four...... 104 7.4.1 Classifiers and Feature Selection...... 104 7.4.2 Results...... 106 7.5 Case Study Five...... 107 7.5.1 Feature Ranking for Depth Selection...... 108 7.5.2 Classification...... 110 7.5.3 Results...... 110 7.6 Case Study Six...... 112 7.6.1 Classification...... 112 7.6.2 Results...... 113

8 Conclusions and Future Work ...... 125 8.1 Conclusions...... 126 8.2 Future Work...... 130

Bibliography ...... 132

x LIST OF TABLES

7.1 CS1: Fan Experiment One...... 92 7.2 CS2: Fan Experiment Two...... 93 7.3 CS2...... 97 7.4 CS3: STFT with NB...... 99 7.5 CS3: STFT with 5NN...... 100 7.6 CS3: STFT with C4.5...... 100 7.7 CS3: STWPD with NB...... 101 7.8 CS3: STWPD with 5NN...... 102 7.9 CS3: STWPD with C4.5...... 102 7.10 CS4: STWPD...... 105 7.11 CS4: SWPD...... 105 7.12 CS5: Largest Selected Features...... 108 7.13 CS5: Fan Experiment One, Distribution of Features...... 109 7.14 CS5: Fan Experiment Two, Distribution of Features...... 110 7.15 CS5: Classification Performance...... 111 7.16 CS6: Dyn Experiment One, Test on 656, No Baselining...... 115 7.17 CS6: Dyn Experiment One, Test on 656, Baseline-Differencing.... 116 7.18 CS6: Dyn Experiment One, Test on 763, No Baselining...... 117 7.19 CS6: Dyn Experiment One, Test on 763, Baseline-Differencing.... 118 7.20 CS6: Dyn Experiment Two, Test on 654, No Baselining...... 120 7.21 CS6: Dyn Experiment Two, Test on 654, Baseline-Differencing.... 120

xi 7.22 CS6:, Dyn Experiment Two, Test on 1090, No Baselining...... 121 7.23 CS6: Dyn Experiment Two, Test on 1090, Baseline-Differencing... 121

xii LIST OF FIGURES

3.1 Example spectrogram...... 44

4.1 Example scalogram...... 47 4.2 Comparison of discrete wavelet transform and wavelet packet decom- position...... 58 4.3 Illustration of wavelet packet decomposition tree...... 65

xiii Chapter 1

Introduction

Broadening the range and availability of alternative energy is an increasingly im- portant national goal, aiming to lower energy costs, increase national self-reliance, reduce environmental harm, and guarantee access to energy even after nonrenewable resources become more scarce. While much research and development has focused on areas such as photovoltaic, solar thermal, and wind energy [57], one mostly untapped energy resource is the ocean, in the form of ocean thermal energy, tidal energy, or hydrokinetics. This last form in particular focuses on exploiting continuous ocean currents in regions such as South Florida, where the Florida Current moves large masses of water through the power of thermal, Coriolis, and other effects [63], mak- ing it a reliable and renewable resource for energy extraction. Tapping these currents would expand the spectrum of available energy sources for coastal regions near such currents, helping to take the load off of other resources [35]. To harness these currents, underwater ocean turbines are necessary [69]. Although these turbines operate on moving water, much like the turbines in a hydroelectric dam, their design has more in common with a wind turbine. As with their land-based cousins, ocean turbines will have large blade assembly connected via a shaft to a pod, which will contain the gearbox and generator used to convert rotary motion into electricity. Unlike wind turbines, however, the entire assembly will be underwater.

1 This forces a number of changes: the blades are much shorter and faster-moving (in one design, they are approximately two meters in diameter and rotate (ideally) at approximately 60 RPM), while wind turbines have much larger blades (40 meters in diameter or more) and consequently rotate more slowly (15–20 RPM). In addition, while wind turbines are located atop individual towers, due to the benthic conditions found in areas with strong ocean currents undersea turbines will be tethered to the sea floor using a cable, with additional support buoys connected via additional cables to provide topside equipment. And of course, all of these components must subsist in a very hostile environment: the ocean. Any engineered device faces many challenges for operating in the open ocean. The corrosive nature of sea water will rust and degrade many metals, which can damage cables, short out electronics, and threaten the integrity of watertight compartments. In addition, any surfaces exposed to the ocean will experience biofouling: the accumu- lation of bacteria, algae, and eventually more extensive sea life. Each layer provides the ecosystem for even more sea life to accumulate, and this growth can impede mov- ing components or accelerate the corrosion started by the sea water [70]. Finally, it is important to note that just as human equipment is threatened by the ocean environ- ment, the ocean environment can be threatened by human equipment [8]. Animals such as fish and sea turtles can be struck by moving blades, and migration patterns can be affected by large collections of turbines. Any large-scale ocean project must employ monitoring systems, both for the health of the system and the health of the environment. The remote ocean turbines which are proposed to harness energy from ocean cur- rents face one additional challenge, above and beyond most ocean systems: they must operate unmanned for long stretches of time, with up to a year between maintenance operations [19]. The primary reason for this is that sending out an expedition to re-

2 trieve a turbine, servicing it on shore, and returning it to the ocean incurs significant costs, and should only be done on an as-needed basis. Performing these mainte- nance operations too frequently will reduce the cost-effectiveness of the entire turbine project, while performing them too rarely will lead to increased damage and associ- ated costs from malfunctioning equipment. Planned maintenance approaches, which rely on a priori assumptions about system lifespans, are too imprecise to determine when a turbine needs human attention. Instead, predictive maintenance, using ma- chine condition monitoring and prognostic health monitoring (MCM/PHM) systems, must be used to determine when a specific turbine has issues which require further attention, and when a system should be shut down to reduce further damage.

1.1 MOTIVATION

Condition monitoring is an important part of modern industrial design across a wide range of application domains, from induction motors [20] to industrial tools [40] to remote ocean turbines [70]. In all cases, the basic problem is simple: complex ma- chines have a large number of components, and consequently have a wide range of failure conditions. While some failures can be diagnosed by human operators, often this is only feasible when the failure has progressed past a certain point, resulting in additional damage. In addition, this diagnosis may require the machine be taken offline, and the availability of human experts may be limited. The solution to these problems is computerized monitoring: designing and imple- menting a system which will evaluate the state of the machines on a regular basis, without direct human involvement [3, 33]. Naturally, these systems rely on sensor data from the machines being evaluated, and this data is processed using different behavior models to both determine the current condition of the machine (MCM) and

3 predict the future states of the machine (PHM). Although the specifics of these models may differ across application domains (some focus on real-time models while others employ models too costly to employ continuously, and different applications call for different processes in building the models), the basic principle of employing sensor data to evaluate computer models is found across all applications of MCM/PHM. Many types of sensor data are used with MCM/PHM systems, again as appropri- ate for a given application domain. In the case of an underwater ocean turbine [9], these may include leak sensors (to determine if the integrity of the pressure vessel has been compromised), pressure sensors (both to evaluate vessel integrity and deter- mine depth), tachometers (to find the rotational velocity of the blades), voltmeters (to measure the voltage being produced by the generator), oil quality sensors (which typically use the oil as a dielectric medium in a capacitor and evaluate changes in capacitance), temperature sensors (both to find outside sea temperature and to see if internal components are overheating), and strain gauges (to determine if the blades are being flexed inappropriately, for example by having been caught on something). For the most part, these sensors are relatively binary: there is a known range of ap- propriate values, and any values outside this range are immediate cause for concern. For most of these sensors, the model only needs to find if any values are outside of their appropriate ranges, and if so alert the human operators of the potential failure. One type of sensor does require additional attention, however: the accelerome- ter [2]. These sensors, used to detect vibrations within or of a piece of machinery, are among the most useful for detecting subtle failures and pre-failure wear in a ma- chine — but their output is also the most difficult to process. In rotating or recipro- cating machinery such as an underwater turbine, different components of the machine will complete one full cycle of motion at different rates. Each of these components thus creates a signature vibration, a type of signal which shows that the component

4 in question is functioning normally. When something changes, due to wear, foreign objects, or other failure modes, these components will behave differently, and these changes will affect the vibration signatures. However, extracting these signatures is not as simple as polling the accelerometer and comparing the current value with a given range. By their very waveform nature, each vibration signature can only be observed when collecting data from a time frame at least as long as the wavelength in question. This data must be passed through a mathematical function known as a transform, to move it from the amplitude-time domain (e.g., the physical location of the accelerometer at a given point in time) to the frequency-magnitude domain (e.g., how much of the wave can be decomposed into vibrations of each given frequency). The Fourier transform is a well-known form of frequency-magnitude representation: the original data is decomposed into a collection of sinusoidal waves, with each point of the transformed data representing how much of the original data exhibited oscillations (waves) of the given frequency. These separate frequencies can be used as the vibration signature to characterize a machine in both healthy and abnormal states [42]. One downside of Fourier transforms (and frequency-magnitude transforms in gen- eral) is they assume that all data exists prior to performing the transform, and that the data continues to infinity in both directions. Naturally, these assumptions are not met when using vibration sensor data as part of an MCM/PHM system. Moreover, the most important aspect of such a system is that it can detect a change in the machine’s behavior: whether a system that works on one day has stopped working on the next. Any transform used for such a system must be able to extract vibration signatures and determine when these signatures themselves change over time. This is a challenge, because the signatures themselves take up a finite amount of time (at least long enough to evaluate one wavelength of the longest/lowest frequency which

5 composes that signature), but changes in signatures must be detected rapidly. A time-frequency-magnitude transformation [11] must be employed to balance these concerns and provide information about which vibration signatures are present and when. One solution to this problem with the traditional Fourier transform is the short- time Fourier transform (STFT) [15, 88]. This transform modifies the existing Fourier transform by selecting a window of data and performing the transform on only this window. The window is then slid along, processing new time spans from the data. Choosing the window size is a major downside of STFT: too long, and the window will not be able to recognize changes in high-frequency (short-wavelength) vibrations until long after these changes have transpired. Too short, on the other hand, and low-frequency (long-wavelength) vibrations will be excluded altogether, because the window ends before they can be detected. Despite these drawbacks, STFT are one potential approach to vibration analysis, and bear further study. Wavelet transforms are an alternative to Fourier transforms which lack these draw- backs [83, 105]. Wavelets (which form the core of wavelet transforms) are generator functions which only have non-zero magnitude over a very limited portion of their range, and which exhibit wave-like patterns (hence the name, signifying “little wave”). The underlying principle of wavelet transformation is choosing a single generator func- tion (known as the “mother wavelet”), and using this function to create a family of “child wavelets” by stretching (scaling) and sliding (translating) the mother wavelet. Each child wavelet will then have a unique position (based on how much it was trans- lated) and size (based on how much it was scaled), representing the range of time and scale being studied. This collection of child wavelets is convoluted with the original data, to determine for each child wavelet, how much of the original data can be repre- sented by vibration on that scale at that time. This collection of scale data, collected

6 into a “scalogram” can be used to prove a full time-frequency-magnitude. With wavelet transforms, one additional problem is streaming data [12, 79]. (Fourier transforms, in the form of the STFT, are already able to handle streaming data.) Even these time-frequency-magnitude transforms, which properly handle data where vibration signatures vary over time, often expect that all data will be present from the outset. Frequently the full dataset is used to build the results, and time localization can only be used retrospectively. Thus, streaming versions of these transforms are needed, which are designed from the ground up to operate on data as it flows through an MCM/PHM system, without expecting future data to be available. A number of solutions exist: windowing (such as the windowing employed in the STFT) may be considered, or the transform can be built sequentially, using new data to update in- formation as soon as it is available. Both approaches have benefits and drawbacks, and merit further investigation. Additional challenges exist for using vibration sensors for building an MCM/PHM system to monitor an underwater ocean turbine. One known problem with such underwater systems is that the environmental conditions can induce vibrations which are not directly related to the operating state of the turbine [1]. For example, the rotating speed of the turbine shaft is not directly related to the health of the turbine, only to the speed of the ocean flow at a given time (and this speed is known to change over time). However, the rotating shaft will induce vibrations, and these will affect the vibration signatures extracted from the accelerometers. Measures must be taken to remove this background noise, to preserve only the parts of the vibration data which are useful for evaluating system state. In addition, some types of transform require further processing to determine which scales exhibit the strongest waves, so that the collection of discovered scales can be used for building a vibration signature. Finally, depending on the types of accelerometers used in a given system, data from

7 these sources may be combined to create a cleaner and more useful data source, both reducing and refining the data employed by the MCM/PHM system [18].

1.2 CONTRIBUTIONS

This dissertation studies the problem of vibration analysis within an MCM/PHM sys- tem to improve ocean turbine reliability, considering many of the challenges discussed above. Through the course of this work, a number of different algorithms are devel- oped, tested, and compared, in order to demonstrate the strengths and weaknesses of each. Case studies are performed to evaluate these algorithms on real-world data, and secondary issues which impose additional challenges are discussed and dealt with. Overall, this work is a thorough collection of techniques and approaches which may be used for vibration analysis of ocean turbine sensor data. The specific contributions from this work include:

1. A collection of new wavelet-based transforms, including a modified short-time wavelet transform, and three novel transforms: the streaming wavelet transform, short-time wavelet packet decomposition, and streaming wavelet packet decom- position. Collectively, these consider a number of different views on wavelet transformation, with different advantages and drawbacks in terms of extracting different types of vibration signatures.

2. New approaches for dealing with the challenges of extracting useful scale infor- mation from the aforementioned wavelet-based transforms. In particular, while existing techniques can employ windowed data to determine which frequencies in a wavelet-based transform show the highest amount of amplitude, we intro- duce an approach that can perform this in a fully streaming fashion, examining

8 different frequencies on different time scales (rather than being restricted to a single time window across all frequencies).

3. A novel method for reducing the of the data extracted through the afore- mentioned transformation steps, which works by employing the selected features to discover an appropriate depth for transformation. This is the first data-driven approach towards depth selection, as opposed to using ad-hoc or human-driven approaches based on comparing results across different experiments.

4. A baseline-differencing approach for finding useful vibration data even in the face of changing environmental conditions. This is important because in the real world, vibration data will necessarily incorporate information regarding envi- ronmental conditions in addition to important state information, and models will need to be built using only the state information without the extraneous information regarding the environment.

5. A large collection of case studies from two different physical systems to eval- uate and compare the above approaches. These two systems, a commercial box fan and a purpose-built dynamometer test bed, exhibit different traits and serve different purposes for testing. Through the course of the case studies, we demonstrate all the above algorithms and approaches, and compare how they affect the performance of the models.

1.3 ORGANIZATION

This dissertation is organized as follows:

• Chapter 2 presents a background of related works and ideas in the area of MCM/PHM, with a special focus on vibration analysis. This provides a context

9 for this dissertation, and presents additional works which have addressed similar problems with related systems.

• Chapter 3 includes an overview of the methodologies used in this work, includ- ing classification models, feature selection, the Fourier transform, and model evaluation. Although these are all well-understood techniques, they are pre- sented here for completeness, and to include the specific parameter, algorithm, and implementation decisions made in this dissertation.

• Chapter 4 presents both a broad background on wavelets (with mathematic justification and algorithms) and the details of the newly-proposed or modified algorithms (that is, the Short Time Wavelet Transform, the Streaming Wavelet Transform, the Short Time Wavelet Packet Decomposition, and the Streaming Wavelet Packet Decomposition). All of these include full algorithmic description and explanation.

• Chapter 5 includes the new post-processing techniques we propose to address the problems of scale detection and baseline-differencing. In each case, we discuss which post-processing techniques are appropriate for which transforms, and present full detail for performing these techniques.

• Chapter 6 gives an outline of of both the box fan and the dynamometer testbed used in the case studies, discussing the data collection and format for each. The similarities and differences between each are also discussed.

• Chapter 7 presents the case studies, in each case presenting the results for applying one or more of the proposed algorithms and approaches to the data from one of the physical systems discussed in the previous chapter.

10 • Finally, Chapter 8 presents conclusions, summarizing the results across all the case studies and showing how the algorithms compare when considering all of the experiments. In addition, future work is discussed, giving suggestions for what aspects of research remain yet unexplored.

11 Chapter 2

Background

MCM/PHM systems, and the use of vibration analysis, both have a long history of study in many different academic domains. The following sections give an overview of related works and important ideas in these fields, both works focused on MCM/PHM overall (Section 2.1) and works which specifically examine different approaches to vibration analysis (Section 2.2).

2.1 PROGNOSTIC HEALTH MONITORING

Machine Condition Monitoring/Prognostic Health Monitoring (MCM/PHM) systems are used in a wide variety of domains to ensure that industrial machines are function- ing correctly, in order to minimize maintenance operations and costs associated with failures. Due to the broad importance of MCM/PHM systems, many studies have examined the different approaches to building these systems [80]. The following sec- tions will discuss overall design considerations (Section 2.1.1), important approaches for combining MCM/PHM algorithms (Section 2.1.2), and important considerations when applying MCM/PHM algorithms towards machine maintenance (Section 2.1.3).

12 2.1.1 Designing an MCM/PHM system

Before designing an MCM/PHM system, one must consider all the aspects of the system in order to make the best choice at each stage of MCM/PHM design. This problem is dealt with in Uckun, et al. [75]. Using their procedure, the first step is to determine whether or not MCM/PHM is justified for the system in question. If there is no business case for prognostics, there is no reason to deploy it. Once it is known that MCM/PHM is relevant to the system, the designer must find the most severe potential faults in the system and determine the root causes of these faults, employing an understanding of the underlying mechanics of the system. Once these causes are found, the designer must decide if existing sensors are sufficient to detect the warning signs for these causes, or whether physical redesign of the system is necessary to support MCM/PHM. Once the system design itself is finalized, the testing platform must be constructed. A mock-up of the system is used to generate data representing normal and abnormal behavior. Using this data, a model-driven approach (using the lab data to create a physics model or expert system which will examine the real-world data), data- driven approach (comparing real-world data to data generated in the lab and finding differences), or hybrid of the two can be built and tested. One problem which must also be considered when designing an MCM/PHM sys- tem is how to evaluate its effectiveness. One approach is a cost-benefit analysis. If the MCM/PHM system is employed, how much money will be saved by repairing machines before they become severely damaged, compared with the extra cost of in- specting potentially-healthy systems? By computing the net effect of MCM/PHM, one can produce a bottom-line result that is easily understood by all stakeholders [99]. This is not the only possible metric, however; there are a variety of methods to eval-

13 uate MCM/PHM approaches, including examining the effectiveness of the algorithm as well as how computationally efficient the algorithm is [67]. As in many fields, the choice of metric often determines which algorithm will be declared most suitable.

2.1.2 Combining MCM/PHM algorithms

A wide range of algorithms have been applied to MCM/PHM; collectively, these approach the problem of prognostics from a variety of different perspectives. It is not good to rely too heavily on any one strategy. There are always edge cases where that specific approach will fail, while another might have caught a potential problem. Thus, whenever possible, it is best to employ a hybrid approach, running multiple algorithms in parallel or in sequence to achieve better results. One example of hybrid prognostics is found in Goebel, et al. [29]. Here, the researchers combine a physics-based model with a condition-based one. The first type of model uses data from the system to determine the state of each component and then uses preprogrammed knowledge about component interactions to predict the overall lifetime of the system. The second model uses that same data to define an overall state of the system and then uses past experimental data to interpret the survival of systems found in a given state. Each of these approaches has their drawbacks: a physics model assumes that the experts designing the model have a perfect understanding of the system, and that no unexpected problems will arise. The condition model, on the other hand, assumes that there is training data from all possible conditions, and that no new state will be found. Neither can be used alone. To overcome these difficulties, both their predictions and the uncertainties in their predictions must be fused. How this is to be done depends on the amount of processing power available, as well as the specific type of problems (short-term or long-term) being addressed. The researchers in [29], examining aeronautic data,

14 consider prognostics both during an actual mission (while the plane is in flight) and between missions (while the plane is in a hangar). For the purposes of remote oceanic systems, there is no sense of “during” or “between” missions, but the same principles show that processor-intensive prognostics can be run less frequently, because they are meant to see further into the future. When performing short-term, low-processor predictions, the researchers perform prediction fusion in a fairly rudimentary way: information from sensors and a physics model are fused in a diagnostic reasoner, and when the amount of predicted damage exceeds a given threshold, this information loops back to the physics model which changes into a different state. Between flights, when planning future missions, more elaborate models can be run. For example, the physics model can be used in con- junction with a mission profile for future flights to predict what sort of damage would be expected on that specific mission, based on the current state of the system. A condition-based model can also produce the same information. Each of these mod- els will have known uncertainties, both in terms of overall prediction accuracy and specific areas of the input domain known to be especially hard to predict. Prognostic models can combine these results using simple weighted averaging or adaptive neuro-fuzzy inference systems, but combining the uncertainties is just as important. Dempster Shafer regression provides one method to do this [58]. In essence, the data is considered as a phase space: each sensor, metric, or other feature is one dimension, and x is a vector which represents a new instance which is to be classified. The class itself is y. To perform this classification, each piece of training data of the form (xi, yi) is considered; the closer that a given xi is to the x being classified, the more its prediction yi is believed. These predictions are then combined, based on their respective belief values, and the results (along with their belief values) are returned. Thus, multiple modalities can be integrated while incorporating their

15 individual uncertainties to develop the final result. An alternate strategy is the agent-based approach discussed in Tang, et al. [73]. This approach incorporates a number of separate databases and agents, covering both experience- and model-based information. In this system, the databases for the case library and the expert-written failure library are used by independent agents to generate predictions. In addition, a separate reinforcement learning agent scours the case library and attempts to discover new failure modes, which it can then add to the failure library. This allows the system to work even with minimal starting information. Over time, the case library will grow as more data is collected, and this growing case library is used to expand the failure library. Reinforcement learning can also be employed within the case-based agent to improve its predictions. Once both agents have come to their conclusions, they are combined using an additional agent, the knowledge fusion agent. While the fusion agent is not discussed extensively in Tang, et al. [73], it could employ similar techniques as discussed earlier. Even without fusion at this stage, the use of the case library to improve the failure library allows for the best of both types of learning. For our premiere application, ocean turbines, we have a great deal of information which can be employed to create an expert-driven model. The parts of the system are well understood in isolation: gears, bearings, and shafts have known failure modes, rated capacities, and lifespans. Collectively, these possible failure states have been assembled into a database, according to ISO-13374, the ISO standard for machine condition monitoring. This standard describes a multilayer architecture, with each layer describing progressively more complex types of MCM/PHM, which can be used on progressively larger pieces of the system. Thus, it provides an outline for model- based MCM/PHM: at each level, known interactions of problems on the level below are integrated into a single state. These models can be combined with the data-driven

16 and hybrid approaches discussed above to result in more accurate predictions.

2.1.3 Applying MCM/PHM to machine maintenance

As outlined earlier, MCM/PHM has been applied to a wide variety of fields. Not all of these applications are directly relevant to the specific task of predicting faults for the purpose of maintaining machines before they break. Fortunately, this is a particularly well-studied subset of the field; much research has been conducted examining how to optimize repair schedules. Some examples follow. A straightforward method is discussed in Bey-Temsamani, et al. [5]. Here, the assumption is that a single organization is handling MCM/PHM and maintenance planning for a large, distributed network of systems, all of which are sending data back to a headquarters. After the data from the various systems are integrated into a single database which can be mined, features are selected in a two-phase process which first removes irrelevant features and then builds an entropy-based decision tree (which, since it iteratively selects those features which produce the best reduction of entropy, contains embedded feature selection). Though this decision tree classifies various instances as “in need of preventative maintenance,” “in need of corrective maintenance,” or “not in need of any maintenance,” this tree is not used directly on live systems to decide their fates. Instead, the most significant feature is used alone, and a threshold is established which indicates that maintenance is essential. Based on historic data as well as a weighted mean slopes model (a type of generalized linear regression) for analyzing live data, the prognostic unit predicts when that feature will cross the threshold, and reports this as the time left until maintenance mus be performed. Though the use of a single feature is extremely limiting, the idea of setting a threshold and using historic data to determine how long until it is surpassed has many potential applications.

17 A more unusual approach to taking maintenance into account when planning an MCM/PHM system is found in Muller, et al. [53]. This paper views maintenance as one part of an overall prognostic system, with the ultimate goal of producing a probabilistic behavior model of the system. This model combines a physics-based un- derstanding of how the system works as a whole with data-derived decay information on the system’s parts: events (data from a live system) are incorporated as predictors of the decay state of various components. In this framework, maintenance can be added as a type of event: rather than observing that a given part is failing, the part is actively being fixed. To determine the value of maintenance, one simply runs the model with and without maintenance; if the predicted benefits outweigh the costs, it is worth doing. It is especially difficult to plan maintenance for remote ocean systems. Land- based systems are often relatively easy to repair in-place; either they are located in a factory which has staff available to perform the maintenance, or they are large fixed structures which have appropriate access systems to allow for workers to open up and perform maintenance on the system. Even mobile units such as aircraft and vehicles (and unmanned versions thereof) can be driven or flown into a garage or repair facility. The same applies to manned aquatic systems: boats and submarines can sometimes be repaired by their crew, and even if the damaged section is outside the vessel, it can at least be examined by an expert before the entire system is hauled back to shore. This is not the case with remote ocean systems: repair missions are all- or-nothing, requiring a full expedition and recovery trip just to see a problem exists in the first place. This makes predicting needed maintenance especially important for such systems: there is a much lower margin of error, since mistakes in either direction will lead to significant costs.

18 2.2 VIBRATION ANALYSIS

Vibration signals are an invaluable source of information for MCM and PHM. Many machines have rotating or reciprocating parts, and as these parts pass through their ranges of motion, they subtly affect the position of other parts and the machine at large. This is especially noticeable for parts which are heavy, moving quickly, or both; such parts can cause the whole machine to move in counterpoint to the parts. This gross movement, which can be detected as vibration signals, contains useful information about the state of each of the moving parts. Since different forms of failure will affect this motion in different ways, the vibration signal can provide hints about which conditions currently affect the machine. However, as there are often many distinct moving parts, each with its own pat- terns, care must be taken in analyzing vibration signals. Each part may oscillate at a specific frequency, with changes in that frequency’s amplitude reflecting that part’s health. Alternately, multiple frequencies may be associated with a given part, and some parts may have overlapping frequencies. To fully analyze vibrations, the time-series waveform must be broken down into its component frequencies, and then these frequencies may be examined separately to find the changes which indicate part deterioration. Different approaches have been developed to perform this task. The following sections discuss these approaches, including Fourier Transforms (Sec- tion 2.2.1) and wavelet transforms (Section 2.2.2); works which compare these ap- proaches to each other (Section 2.2.3); and additional issues pertaining to streaming data (Section 2.2.4).

19 2.2.1 Fourier Transforms

The traditional technique for transforming waves outside the context of MCM/PHM is the Fourier transform. When looking at a wave generated by a stationary (non- changing) process, the Fourier transform works by finding a collection of sinusoidal functions which, when weighted appropriately, can be summed together to reconstruct the original signal. The weight given to each of the individual sinusoids shows how much of the original wave can be explained by oscillations of that sinusoid’s frequency. Together, the function which maps these frequencies to the weights associated with the sinusoids for each frequency is the Fourier transform of the signal. One major downside of using Fourier transforms for processing vibration data is that the signal must be time-invariant, meaning that there must be no changes in the prevalence of different frequencies throughout the signal. This assumption is invalid for non-stationary signals like those found in the MCM/PHM domain: rather than expecting that the relative importance of different frequencies will remain constant throughout a machine’s life, it is changes in these relative importances which can indicate changes in machine state and which MCM/PHM systems look for. To get around these problems, the short-time Fourier transform (STFT) was developed. This is similar to the regular Fourier transform, but rather than evaluating the signal for all time, the transform is performed on a succession of time windows. This is accomplished by multiplying the input signal with either a square windowing function (one which is defined as 1 for a short span and 0 elsewhere) or a smoother function, such as a Hann or Hamming window [7]. In either case, the result is that only the parts of the signal where the windowing function is non-zero are used. This window is then slid forward, allowing a different span of the signal to be examined by the transform. In the end, these results are collected together into a spectrogram, a

20 time-frequency-amplitude plot where each choice of time (specific point the window was slid to) and frequency will give a value for the magnitude of that frequency’s importance at the given time.

2.2.2 Wavelet Transforms

Although the STFT is simple to implement and understand, many object to using STFT for vibration analysis [21, 30, 32, 34, 39, 54, 56, 71, 74, 95, 96, 97, 101]. A common complaint is with its choice of a fixed-time window. When choosing the length of this window, two properties of the resulting transform are fixed: the length of the maximum wavelength which can be detected by this transform and the minimum temporal resolution which this transform will have. The length factor is due to the simple fact that for waves longer than the window size, the transform performed on the window will not have enough information to detect even a single iteration of that frequency’s waveform. Thus, these longer waves will be missed. However, on the other end of frequency sizes (that is, for shorter wavelengths), it can be difficult to determine when, exactly, these smaller waves are found. Given a single window, all that can be said is how much that frequency contributes to the timespan considered by that window, and not whether that frequency was found more during the beginning, end, or middle of the span. By considering a succession of time spans, one can begin to guess where that frequency gained and lost importance, but this is an inherently imprecise process. Any attempt to make temporal resolution crisper will necessarily make longer wavelengths that much harder to detect. In the end, a proper balance must be found between detecting long wavelengths and knowing when the shorter wavelengths are found. Due to these disadvantages, many works advise against using Fourier transforms, and instead suggest an alternate approach referred to as wavelet transformation.

21 Wavelet transforms are based on a special type of function called a “wavelet,” or little wave, so called because they resemble waves but only have non-zero ampli- tude for a small part of their domain. In this way they share some similarities with the windowing functions used in STFT, but here the resemblance ends. To perform wavelet transformation, a specific function with this property must be chosen, and this “mother wavelet” will then be dilated (stretched) and translated (slid) to produce a family of related functions called “child wavelets.” Each of these children is (sepa- rately) multiplied with the original signal, to show how much of that signal can be represented by that child wavelet alone, and in turn by oscillation of the frequency and time represented by the child wavelet’s dilation and translation factors. Many works have examined the use of the wavelet transform for vibration anal- ysis. These range from extracting features from washing machine vibrations [30], to mechanical fault diagnosis [49], to detecting faults in rotating gears [4, 41], beams and plates [66], ball and rolling element bearings [64, 74, 92], and gearboxes [50]. Although these papers employ a number of different approaches to applying the wavelet trans- forms to their domains and to processing the transform to determine the health of the machine being studied, the breadth of the literature demonstrates the usefulness of this technique. Some have moved beyond simple wavelet transforms to employ the more advanced approach known as wavelet packet decomposition (WPD) [14]. This is a more compli- cated technique that increases high-frequency resolution and precision, necessary for some particularly recalcitrant applications. WPD techniques have been used to detect faults in rolling element bearings [39, 21], large-scale civil infrastructure [71], turbine generator sets [48], rotating machinery [51], and steel beams [34]. Again, there is variation within these studies regarding the details of their approaches (e.g., depth of the transform, choice of mother wavelet, feature extraction, hybrid approaches com-

22 bining WPD with other techniques, etc.), but the breadth of research supports the use of this technique for the purposes of MCM/PHM on vibration signals.

2.2.3 Comparing Fourier and Wavelet Approaches

Although many works employing wavelet transforms briefly discuss the advantages of wavelets over Fourier transform, few of these papers directly address how well Fourier analysis will fare in this domain [83, 89]. Despite the conjectures which favor wavelet transforms over Fourier transforms, the only direct empirical comparison between the classification performance of vibration data transformed using these two techniques known to the authors is work performed by the authors. However, some preexisting work does exist which compares these two techniques in some less formal ways. Pan, et al. [55] apply windowed Fourier transforms, two forms of wavelet transform (using the Morlet and Daubechies wavelets) and smoothed Wigner-Ville distributions to examine non-stationary behavior in a brake system. Vibration sensors monitor the system as it is started and then stopped, and these vibration signals are transformed using the various techniques. The largest frequency (within a specific frequency window) found at each time is then used as that time’s output value. The authors compare how these largest frequency versus time plots differ for the different transforms, observing that they give different curves for different forms of braking. However, they do not quantify this observation or comment on how the different techniques compare to each other. Thus, although this paper introduces the concept of comparing Fourier and wavelet techniques when analyzing vibration signals, the analysis is not sufficient to draw any conclusions. Rehorn, et al. [61] experiment with a technique known as selective regional cor- relation (SRC) on vibration sensors from three points on a machining tool. SRC first performs a time-frequency transform (such as a STFT or wavelet transform) on

23 vibration signals representing known states. Then, a windowing function is applied to eliminate all but a certain range of times and frequencies from the transform’s output. Finally, an inverse transform is performed to get a modified signal which has been cleaned to only hold the parts of the original signal which related to the chosen times and frequencies. These transformed-windowed-inverse transformed sig- nals are the templates used to describe each state. To evaluate new data, this same transform-window-inverse transform process is performed, and the cross-correlations of the resulting modified signals and the templates for each state are used to deter- mine the state most similar to the new data. The authors use three transform types as part of this technique (short-time Fourier transforms (STFT), continuous Wavelet transforms (CWT), and S-Transforms), and find that S-Transforms work best for two of three sensor positions (with STFT coming in second and CWT coming last) and CWT works best for the third sensor position (with the S-Transform second and STFT last for this sensor). Although this paper does compare STFT with CWT, its focus on the SRC method and ambiguous results when comparing STFT and CWT make it difficult to determine which technique is best suited for the MCM/PHM domain in general. Gaberson [26] discusses the application of various time-frequency transforms for the purpose of vibration analysis. The examined techniques include short-time Fourier transforms, the Wigner distribution, multiple reduced interference distributions (in- cluding the Choi-Williams distribution), and wavelet analysis with both the Morlet and Daubechies 4-tap wavelet. Each of these techniques is tested on vibration data from an intake valve cap on an air compressor. The author examines the plots pro- duced by these transforms and discusses how effective the different techniques are at finding the important time and frequency components of the signal, concluding that the Choi-Williams distribution has the greatest precision (but that all other than

24 Morlet wavelets have their virtues). However, this conclusion was drawn through ad hoc interpretation of the plots; no statistical comparison was performed. Of the reviewed literature, the work which best compares short-time Fourier trans- forms and wavelet transforms (here, using the Shannon wavelet) for vibration analysis in the context of MCM/PHM is Li, et al. [48]. Here, a steam turbine is tested under steady state and during start-up, with different loads. For the steady state exper- iments, the short-time Fourier transform is not needed, but when examining the changes in the system over time, techniques for non-stationary signals are needed. Here, it was found that wavelets were able to provide clearer information about the signal; this was evident in the fact that more distinct bands were seen in their re- spective contour plots. Unfortunately, this conclusion is not quantified beyond visual inspection of the plots and no automated classification is performed to examine how important this distinction is to actual MCM/PHM practice. Although these papers do examine both Fourier and wavelet transforms, they each use different analysis techniques, either finding the similarity between transformed signals and templates for each state or visually examining the plots produced by the transforms. What sets the present work apart from previous work is our comparison of both the Fourier and wavelet transforms for creating features for downstream clas- sification of an experimental data set. We find that STFT should not be dismissed as readily as it is in much of the literature.

2.2.4 Streaming Data

One element not often considered in the literature is the use of wavelet approaches adapted to real-time monitoring of live systems. The above algorithms generally presume that the data is presented to the user in one large block; thus, the algorithms do not properly handle streaming data. While for many applications it is satisfactory

25 to apply transforms to complete blocks of data and process the wave after all the data has been received, this does not suffice for monitoring active systems in situ. For these applications, it is not sufficient to take chunks of data and process them separately using a sliding window, as in Kong, et al. [46]. Doing so would require choosing an appropriate window size: this results in a trade-off between choosing larger windows to see lower wavelengths and smaller windows to see changes more quickly, as is found with STFT. This may at times be sufficient, but it is important to consider all alternatives. Some attempts have been made to modify wavelet transforms to make them func- tion in a streaming fashion: by applying the continuous wavelet transform to stream- ing data [31], by applying the discrete wavelet transform to out-of-order streams of data [27], by performing in-place wavelet processing [43], or by updating the com- plete scalogram at each time instance [62]. The newly-proposed and modified algo- rithms presented in this dissertation (the Short-Time Wavelet Transform, Stream- ing Wavelet Transform, Short-Time Wavelet Packet Decomposition, and Streaming Wavelet Packet Decomposition) all address the problem of streaming data in different ways, and the results are compared for each.

26 Chapter 3

Methodology

A number of different methodologies were employed through the course of this re- search. Although these are standard techniques, we review them here to provide a foundation for the work performed later in this dissertation.

3.1 CLASSIFICATION

In many application domains, too much data is generated to be easily processed by humans. Although patterns exist, they require connecting vast quantities of data together in ways which would take years to be discovered manually. The field of data mining encompasses a wide range of techniques for extracting information from large datasets [98]. Many of these techniques are predicated on the concept that a dataset is a collection of instances, each representing one example of the type of data being analyzed. For studying ocean turbines, instances could be collections of vibration sig- natures extracted from a given time span (recall that these cannot be instantaneous, because this would not permit any vibrations to transpire). Each instance may be considered as a set of (attribute, value) pairs; all instances will have the same number of pairs representing the same attributes, but their values will differ. The specific goals of data mining depend on the nature of these attributes. Oftentimes, one or more of these attributes will be the “class,” or dependent attributes; these are the

27 most important attributes, and are used to understand each instance. For this study, the class attribute will represent the state of the machine: typically, one state is des- ignated the “normal” state, and the remaining states are considered to be “abnormal.” The remaining attributes are referred to as features or independent attributes. These are pieces of information collected from each instance, not necessarily useful on their own but valuable for what they tell about the class attributes. For vibration data, these features are the vibration signatures from each frequency. Because there are a fixed number of states (classes) considered in each case study, the primary goal of our models is classification: building a model to assign each in- stance to one of the possible class values. This task is particularly useful for providing class labels to future, unlabeled instances. Furthermore, to reduce the complexity of our models, we consider only “binary classification,” where models are trained to dis- tinguish between only two classes. In all cases, the “normal” class is used as one of these, and one of the “abnormal” classes is the other. A number of different machine learning models are used for different case studies, and these are presented in greater detail below.

All machine learning models in this study were built using the Weka machine learning toolkit, and more information regarding the algorithms defined in this toolkit may be found in the accompanying textbook [98].

3.1.1 C4.5

One simple form of classifier is the decision tree. This is a tree-shaped flow chart which classifies instances by comparing the values of certain features with thresh- olds identified during the training of the classification algorithm (such as the C4.5 algorithm [60]). When using the tree to evaluate a new, unknown instance, these features and thresholds are considered until the tree has enough information to make

28 a prediction. For example, the root of the tree might say “If feature f has a value less than x, go to the left subtree; otherwise, go to the right subtree.” The left and right subtrees have internal nodes with similar structures, each using one feature (which may or may not be the same as the feature chosen by other nodes in the tree) to divide instances into two or more new subtrees (if, for example, the feature used to split the tree has three distinct values). Eventually the subtrees end at leaf nodes, which say “If you have reached this node, predict that you are in class C.” Decision trees are especially useful for classification because it is easy to see how the classifier is working: if the user wants to know why a given instance was assigned to a given class, they can trace that instance’s path through the tree and see how its attributes led to this classification. The C4.5 algorithm in particular is designed to use Information Gain (described in more detail in Section 3.2) to select a feature as the splitting feature at each node. Conceptually, the algorithm considers all the possible features and how they would split the tree into subtrees, and selects the feature which creates the purest subtrees (those with the greatest concentrations of a single class). Also, a C4.5 tree does not necessarily proceed until every subtree is 100% pure (e.g., contains only instances from a single class): pruning may be applied, in order to build a smaller tree which will reduce the problem of overfitting1. All models in this study are built using Weka’s default parameter values, unless the so-called “normalized” values are used, which entail using an unpruned tree with Laplace smoothing.

1Overfitting occurs when a model is built to match the exact noise and randomness in the data rather than matching the underlying information.

29 3.1.2 Random Forest

While a single decision tree makes an effective classifier, for imbalanced data, where one class has many more instances than the other class, often ensemble classifiers significantly improve performance [68]. An ensemble classifier employs a family of base classifiers which are designed to be reasonably accurate while also varying as much as possible from one another. By using this family and classifying an instance based on its majority opinion (rather than the opinion of only one classifier), there is much less risk that the classifier will simply put all instances into the majority class to maximize its accuracy. Imbalanced data is important to our domain because while we can design our experimental conditions to control the proportions of class membership, once the turbine is in the water a disproportionately large number of instances will reflect normal operation, while a tiny minority will be abnormal. One type of ensemble classifier based on decision trees is the Random Forest (RF) classifier [10, 23, 44]. The RF classifier creates variation within its family of decision trees using two methods: first, each tree in the ensemble is not trained on the full set of instances from the data set. Rather, each tree has an independently-bootstrapped copy of the data, and uses this. Secondly, when training each tree, only a limited subset of the features are considered for each node, rather than allowing the C4.5 training algorithm to select from all of them. These two approaches ensure that the members of the ensemble do not end up too similar to one another. One parameter which must be considered when designing an ensemble classifier is the number of base classifiers to put into the ensemble. In this study, we used 100 trees to make up our ensemble unless noted otherwise. These values were found to work especially well in our experiments, but case studies based on other data might

30 find different optimal values. In addition, the choice of RF was likewise made based on our data, and must be considered before applying these results to other work.

3.1.3 Naïve Bayes

Naïve Bayes [98] is a simple Bayesian classifier which seeks to determine the probabil-

ity of a class knowing only the values of its features; this is written as p(C|F1,F2,F3,

...,Fn), where C is the class being tested and F1,F2,F3,...,Fn are the various fea- tures which are being used to predict the class. To get this using the known feature distribution within the training data, Bayes’s theorem must be employed:

p(C)p(F1,F2,F3,...,Fn|C) p(C|F1,F2,F3,...,Fn) = (3.1) p(F1,F2,F3,...,Fn)

Even with this, the p(F1,F2,F3,...,Fn|C) term is difficult to compute formally. One common simplification is to assume conditional independence of the features, that is, that the values of one feature have no effect on the values of other features. With this assumption, Equation 3.1 may be rewritten as:

Qn p(C) i=1 p(Fi|C) p(C|F1,F2,F3,...,Fn) = Qn (3.2) i=1 p(Fi)

Each p(Fi|C) term is simply the fraction of the given class’s instances which have the specific feature value of interest, and may be easily computed from the training data. Also, because the denominator is not a function of the class being examined, its value is effectively constant, giving the probability of membership in a given class as Qn p(C) i=1 p(Fi|C). Although this equation requires that all features be independent of one another, prior research has shown that it is an effective classifier even when this requirement is not met [47].

31 3.1.4 k-Nearest Neighbor

k-Nearest Neighbor (k-NN) [94] is an instance-based learner which finds the k in- stances in the training set closest to the test instance. This distance is measured using the Euclidean metric, which finds the distance between two instances with respect to each feature separately, squares these separate distances, sums them to- gether, and then finds the square root of the total. Once these nearest neighbors are found, they are each weighted by a factor of 1 / distance (so closer neighbors have greater weight), and weighted voting is used to determine which class the test instance belongs to. The class whose instances have the largest collective weight is

chosen as the label for the test instance. When using Weka’s k-NN classifier, the weightByDistance parameter was set to 1, and k was set to 5 for most experiments (except when noted otherwise).

3.1.5 Logistic Regression

Logistic Regression [45] is a simple form of regression model commonly used to per- form classification. As with other regression models, this approach finds the weighted sum of a given instance’s features (with the correct weights having been determined during the training phase). Unlike a simple linear regression, however, the logis- tic function is applied to this sum prior to a threshold being used to determine the instance’s class.

3.1.6 Support Vector Machines

Support Vector Machines (SVM) [36] are built from the assumption that both classes can be linearly separated from each other. This assumption allows us to use a dis- criminant to split the instances into the two classes. A linear discriminant uses the

32 T formula g(x|w, ω0) = w x + ω0. In the case of the linear discriminant the only data that needs to be learned is the weight vector, w and the bias ω0. One aspect that must be addressed is that there can be multiple discriminants that correctly classify the two classes. SVM is a linear discriminant classifier which assumes that the best discriminant maximizes the distance between the two classes. This is measured in the distance from the discriminant to the samples of both classes. In Weka, the SVM classifier is implemented as SMO.

3.1.7 Multi-Layer Perception

Multi-Layer Perceptron [65] is a form of neural network, intended to model how decisions are made in biological systems. The basic component of a neural network is the artificial neuron, which takes multiple inputs, finds their weighted sum, applies an activation function (such as a sigmoid), and provides this as an output. In a multi-layer perceptron, these neurons are arranged into layers: one input layer which takes the feature values from the instances and passes them on to downstream layers (no sum-and-sigmoid is performed on this layer), one or more hidden layers which consist of neurons that receive inputs from every neuron on the previous layer (but no other neurons) and send their outputs to every neuron on the following layer, and one final output layer with one neuron for the class value (which receives its input from all the neurons on the previous layer). For our study, two parameters were varied from Weka’s defaults: we use a network with one hidden layer which contains three nodes, and we used backpropagation (with 10% held back for verifying when to stop backpropagation) to train the neuron weights.

33 3.1.8 RIPPER

Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [13] is a rule- based classifier which generates its rule set based on the data. It operates by initializ- ing an empty rule set, and then incrementally dividing the dataset into Train and Test folds (with 2/3 of the data used for training), building a single additional rule based on the Train fold and testing if the new rule shows good performance on the Test fold. If so, the rule is retained and instances successfully identified by the rule are removed from the dataset (so that future rules will learn the remaining instances); otherwise, the rule is discarded and the algorithm iterates. This learning process stops when a newly-accepted rule causes too great an increase in the description length of the complete rule set. Following the above iterative process an optimization step consid- ers replacing each of the rules in turn to improve performance, and then the whole iterative process is repeated, using the generated rule set as the starting point.

3.1.9 Radial Basis Function Neural Networks

Radial basis function neural networks [6] are similar to multilayer perceptrons with one hidden layer, but use radial basis functions within each hidden node rather than the sigmoid of a weighted sum. Radial basis functions operate by taking the collection of input values as a vector and finding the Euclidean distance from this vector to a so-called “center” vector, the value of which is learned during the model-building process. This distance is then passed through a Gaussian function (with an internal weighting term) and multiplied by a final weighting term. Both weighting terms are also discovered during the model-building process. The final output node of the network (on the output layer) uses simple linear weights to combine the radial basis function nodes from the hidden layer.

34 3.2 FEATURE SELECTION

Some of the experiments in this study employ feature selection, a group of techniques designed to alleviate the problems inherent in datasets which have too many features. These extraneous features can be either redundant (provide information already avail- able in other features) or irrelevant (provide no useful information whatsoever about the class values), and in either case will only harm learning. More features will re- sult in a longer time to generate models and can sometimes produce worse results. There are three main types of feature selection: filter-based, wrapper-based, and embedded. Filter-based feature selection methods apply different statistical tests to the dataset to determine which features are relevant, while wrapper-based methods (unlike filter-based methods) build classifiers to determine which features are most useful. Embedded methods combine feature selection with the classification process itself. The feature selection techniques in this work come from the family of filter-based feature selection, specifically filter-based feature rankers. These techniques use statis- tical approaches to give each feature a score, and then rank all features based on this score. The alternative, filter-based subset evaluation, gives grades to entire subsets, but due to the additional complexity in searching the possible feature subsets (and in calculating the grades), we only consider filter-based feature ranking in this work.

3.2.1 Chi Squared

The Chi Squared (CS) feature evaluator [24] is based on the assumption that if the feature value2 is not correlated with the class, it will be randomly assigned. The χ2 distribution can be used to determine the probability of the extant distribution given

2For continuous or floating-point values, discretization is required.

35 this assumption of random assignment. In particular, this is found with the following equation:

NF NC 2 X X (Oij − Eij) χ2 = E i ij where NF is the number of feature values, NC is the number of class values, Oij is the observed number of instances in class j with feature value i, and Eij is the expected number of instances in class j with feature value i, if the distribution were random. The greater this sum, the less likely it is that the feature value was randomly assigned, and thus the more relevant the feature is to the class.

3.2.2 Information Gain

A frequently-employed feature evaluator is the information gain metric [100, 78]. This is the application of a more general technique, the measurement of informational en- tropy, to the problem of deciding how important a given feature is. Informational entropy, when measured using Shannon entropy, is notionally the question of how many bits of data it would take to encode a given piece of information[100]. The more space a piece of information takes to encode, the more entropy it has. Intu- itively, this makes sense because a random string has maximum entropy and cannot be compressed, while a highly-patterned string can be written with a brief description of the string’s information. In the context of classification, the distribution of instances among classes is the information in question. If the instances are randomly assigned among the classes, the number of bits necessary to encode this class distribution is high, because each instance would need to be enumerated. On the other hand, if all the instances are in a single class, the entropy would be lower, because the bit-string would simply say

36 “All instances save for these few are in the first class.” Thus, to find a numeric metric for entropy, we want a value which increases when the class distribution gets more spread out, and which can be applied recursively to permit finding the entropy of subsets of the data. The following formula satisfies both of these requirements:

e(a, b, c, ··· ) = −(a/n) log(a/n) − (b/n) log(b/n)

−(c/n) log(c/n) − · · ·

In this formula, we consider a subset of the total data set. This subset has n total elements, with a members in the A class, b members in the B class, c members in the C class, etc. With this, e(a, b, c, ··· ) is the amount of entropy present in that subset. To find the information gain of a specific feature, one finds the entropy for each value of that feature; that is, one considers the subset of instances which contain a specified value for the feature in question and then finds the entropy for that subset. If the feature is a good description of the class, each value of that feature will have little entropy in its class distribution; for each value most of the instances should be primarily in one class. The weighted sum of the entropies of all the values of the feature is then subtracted from the total entropy of the entire data set; this represents the information that the feature provides to the data set.

3.2.3 Signal to Noise

Signal to Noise (S2N) is a metric originating from electrical engineering, which is rarely used as a feature ranker. However, our research team has recently implemented this as a ranker and found that it is as or more effective than many traditional techniques [93]. In the context of feature ranking, S2N is computed by grouping the instances based on the class values (of which there are assumed to be two, P and

N; S2N only works for binary classification). For the positive class, the mean µP

37 and standard deviation σP of the given feature’s values are found; likewise, for the negative class µN and σN are found. S2N is then found using the following equation:

µ − µ S2N = P N σP + σN

This equation favors features which have different means for the two classes (a large value in the numerator) while at the same time having minimal variation of their values within each class (a small value in the denominator). Such features are effective at partitioning the two classes.

3.3 PERFORMANCE EVALUATION

Although building models is naturally an important part of data mining and machine learning, it is also vital to properly evaluate the quality of these models. Different forms of performance evaluation can lead to different results. The following sections discuss the performance metrics used in this dissertation, and how models were built and tested given the data.

3.3.1 Performance Metrics

A number of different performance metrics may be used to evaluate the quality of a model, based on the number of instances correctly and incorrectly classified in the positive class (the class of interest, which is usually the minority class or abnormal state) and the negative class (which is usually the majority class or normal state). Five of the most widely used are the True Positive Rate (TPR, or true positives / all positives), the True Negative Rate (TNR, or true negatives / all negatives), the False Positive Rate (FPR, or false positives / all negatives), the False Negative Rate (FNR, or false negatives / all positives), and the Area Under the Receiver Operator Characteristic (ROC) Curve metric, or AUC. The ROC curve is built by plotting

38 the TPR on one axis and the FPR on the other, as the decision threshold (relative probability of false positives and false negatives) is varied. To get a single value from this curve, the area underneath it is found. This is the AUC, and its value ranges from 0 to 1, with 1 being the best. For all of our studies, the abnormal class is considered to be the positive class. The TPR, TNR, FPR, and FNR values reported in the tables are based on a default decision threshold of 0.5. These performance metrics were chosen because TPR, TNR, FPR, and FNR give a good perspective on the different types of misclassification present in the learners when using the default decision threshold, and AUC gives a summary of the overall misclassification at all decision thresholds.

3.3.2 Training and Test Datasets

For evaluating a model, two steps are needed: first, the model must be built, and then it must be tested. Because it is not appropriate to use the same dataset in both steps (since this encourages overfitting), we use two different paradigms for model evaluation. For many of our experiments, cross-validation was employed to evaluate the quality of the classifiers. The first step of cross-validation is to randomly divide the data set into N equal-sized pieces. For our case studies, the value of N was set to 5. After the dataset is so divided, one of the pieces is chosen to be the hold-out set. The classifier is then built using the remaining N − 1 pieces, and is tested against the hold-out set to determine its accuracy. This prevents the model from overfitting to the data, by not testing the model on the data used to build it. After the results are found for the chosen hold-out set, the procedure is repeated N − 1 more times, with each piece being used as the hold-out set in precisely one iteration. The results from these tests are averaged together to give the performance of the classification on the data set as a whole.

39 In order to test the ability of models built under one environmental condition to perform in an entirely different environmental condition, when evaluating the baseline-differencing algorithm we build models with a given load level, RPM, and burst, and test these models on data from the same load and burst — but for a dif- ferent RPM (because RPM is considered to be the environmental condition). This is more challenging than simply building models and testing on the same RPM con- dition. In principle, it is expected that different environmental conditions will affect the nature of the vibration data, and that models built for one condition might fare poorly with another. It is here where the baseline-differencing approach becomes especially valuable: although the models are built on data from one condition and tested on data from another condition, all preprocessing is done with a baseline ap- propriate for each dataset’s own environmental condition. Thus, the model should not be seeking out specifics about how changes in vibration can distinguish load in a given RPM condition; rather, the model should be finding the changes in load which are invariant across environmental conditions.

3.4 FOURIER TRANSFORMS

Although this dissertation primarily focuses on the use of wavelet transforms for vibration analysis of MCM/PHM systems, it is important to compare and contrast these to a more traditional form of time-series analysis, the Fourier transform. This will both help to highlight the properties of wavelet transforms which are not found in Fourier transforms, as well as provide a background for our work in directly comparing the results of vibration analysis using both Fourier and wavelet-based transforms. As noted, one of the oldest approaches for finding the oscillations in a waveform is a Fourier transform. This transforms a function from the time domain to the

40 frequency domain, such that every value of the input no longer asks, “what was the amplitude at this time?” but instead asks the question, “how much of the original wave could be described by an oscillation on this frequency?” In the general case,

a Fourier transform of an arbitrary function f(x) or a sequence of amplitudes xn is computed as follows, in both the continuous (3.3) and discrete (3.4) cases.

Z ∞ fˆ(ξ) = f(x) e−2πixξ dx (3.3) −∞ N−1 X −i2πk n Xk = xne N k = 0,...,N − 1 (3.4) n=0 In these equations, f(x) is the continuous function being transformed, fˆ(ξ) is the transformed function (with ξ representing size of oscillations rather than distance from the origin), xn is one of a discrete sequence of N data points, and Xk is the amount of oscillation found in the kth frequency band of that sequence 3. While the discrete form is generally more useful on real-world data, it still is computationally intensive; a naive implementation will take O(N 2) time. Because of this, Fast Fourier Transform (FFT) algorithms were developed. These compute the same values as the discrete Fourier transform, but only require O(N log N) time. These algorithms employ a variety of optimizations, such as divide-and-conquer as well as more esoteric formulae; the most efficient of these have been built into specialized hardware devices.

3.4.1 Short Time Fourier Transforms

One obvious problem with the above approaches is that they assume all the data is available at the time of computation; either f(x) is defined beforehand, or the entire set of xn’s is provided as an input to the algorithm. This does not work for many √ 3e, π, and i have their usual interpretations as Euler’s Number, the circle constant, and −1, respectively.

41 common domains, where data is continuously streamed from a sensor. Moreover, by only looking at the data in one large block, basic Fourier transforms do not help to find when the data changes. They will find oscillations, but will detect sudden changes neither in absolute amplitude nor in the oscillations.

input : Vector V indexed from 0 to max-time − 1 of the input signal (values from the accelerometer) : Parameter w, the size of the window to be used : Function window(n, w), which takes two parameters, n and w, where w is to total length of the window and n is the point within the window whose windowing coefficient is being found, and returns a single modified value as seen in Equations 3.5, 3.6, and 3.7 : Parameter l, the time increment by which to slide the window output: Two-subscript matrix M which contains max-time − w rows, each of which has w features representing the Fourier coefficients at that row’s time

1 Prepare vectors Vin and Vout, each having w values ;

2 Initialize fft (Fast Fourier Transform) subroutine to use Vin and Vout as input and output vectors ;

3 for t ← 0 to max-time − w − 1 step t ← t + l do

4 for i ← 0 to w do

5 Vin[i] ← window(i, w)×V[t + i] ;

6 fft(Vin, Vout) ;

7 for i ← 0 to w do

8 M[t][i] ← Vout[i] ;

Algorithm 3.1: Short-Time Fourier Transform

To get around this problem, Short-time Fourier Transforms (STFT) are employed. Here, instead of performing the Fourier transform from −∞ to ∞ in the continuous case or across all input data in the discrete case, only a finite-time window is examined at once. Algorithm 3.1 demonstrates how this may be performed. The initial data stream is first chopped up into w-sized pieces, with the w parameter having been chosen by the operator beforehand. As part of this windowing process, a windowing

function window is multiplied by each element, with parameters i and w as defined in Algorithm 3.1. In the simplest case, this is a rectangular window (Equation 3.5), where all elements within the window are multiplied by 1 (and all elements outside

42 the window are implicitly multiplied by 0). More advanced windows include the Hann and Hamming windows, shown in Equation 3.6 and Equation 3.7, respectively:

window(i, w) = 1 (3.5)   2πi  window(i, w) = 0.5 1 − cos (3.6) w − 1  2πi  window(i, w) = 0.54 − 0.46 cos (3.7) w − 1 After windowing, the Fourier transform is performed. Note that, although the

Xk values from Equation 3.4 are complex numbers and thus for N (e.g., max-time) values in the input there will be 2N distinct values in the output, in the case where all input values are real the nth value of the output will be the complex conjugate of the (N − n)th value. Thus, we may only look at the real and imaginary parts of the first N/2 frequencies, giving a total of N output values. This is why Vin and Vout have the same number of values: the fft algorithm used does not change the size or type of the vectors from the input to the output. The full fft algorithm is not presented here as a separate Fast Fourier Transform implementation may be chosen (e.g., many hardware vendors supply their own custom FFT implementations, and both open-source and proprietary versions may be employed; the details of our chosen FFT implementation are discussed in Section 6.2.1. Following Fourier transformation, the output vectors are collected into a two- dimensional matrix M. This can be presented graphically as a frequency-time spec- trogram, or waterfall, as is shown in Figure 3.1. This waterfall shows both the fre- quencies found in the data at any given time and how these frequencies changed over time. STFTs are not without their downsides, however. In order to perform the trans- form, one must decide on the size of the window. This choice highlights an important

43 Figure 3.1: Example spectrogram. The input wave has oscillations of 10, 25, 50, and 100 Hz, increasing from one to the next every 5 seconds. (Credit: Wikimedia Foundation, Inc.) trade-off that Fourier transform-based algorithms often have: frequency resolution versus temporal resolution. If a larger window is chosen, the resulting STFT will be able to resolve lower-frequency oscillations; it will be less precise, however, in deter- mining when changes in oscillations occur. Conversely, a smaller window will give a better idea of when changes occurred in the data but will not be able to detect pat- terns which span longer than the window’s duration. One solution to the problem of not knowing in advance what patterns will be most relevant to the data is to compute multiple STFTs with different window sizes.

44 Chapter 4

Wavelet Transforms

Wavelet transforms, a family of time-frequency-magnitude transforms, are an increasingly- popular alternative to Fourier transforms [17, 28, 37, 38, 54, 56, 59, 72, 103]. These employ a “wavelet,” a small wave which only has non-zero amplitude for a small part of its domain. A generating function known as the “mother wavelet” and denoted by ψ(x) is chosen, and the goal is to modify this function by both dilation and transla- tion (creating “child wavelets”) to find out how much of the original wave (the data being transformed) has signal corresponding to these dilations and translations. This chapter discusses both the basics of continuous and discrete wavelet transforms (in Sections 4.1 and 4.2, respectively) and wavelet packet decomposition (in Section 4.3), as well as the four main algorithms employed in this work: a modified Short-Time Wavelet Transform (Section 4.2.1), the Streaming Wavelet Transform (Section 4.2.2), the Short-Time Wavelet Packet Decomposition (Section 4.3.1), and the Streaming Wavelet Packet Decomposition (Section 4.3.2). Of these four, the first algorithm is a version of the traditional STWT that we tuned for the purposes of MCM/PHM, and the latter three were novel creations designed specifically for this application.

45 4.1 CONTINUOUS WAVELET TRANSFORMS

The so-called continuous wavelet transform is the most mathematically pure form of wavelet analysis, examining all scales with infinite resolution. Two key equations form the basis of this transform1:

1 t − b ψ (t) = √ ψ (4.1) a,b a a Z ∞   1 ∗ t − b Xw(a, b) = √ x(t)ψ dt (4.2) a −∞ a Equation 4.1 demonstrates how the child wavelets are generated based on the mother wavelet, with a and b denoting the amount of dilation and translation, re- spectively, used to generate a given child wavelet. Equation 4.2 describes how to generate a scalogram of input wave x(t), which plots scale (a) on the y axis versus translation (b) on the x axis, in the continuous case. For each choice of scale and time point, the product of the original function and the complex conjugate (denoted by the asterisk) of the child wavelet is computed for all time. Since the child wavelet is zero outside of a small range, this “integral over all time” is actually only looking at the area immediately around the child wavelet and is thus bounded; everything else will be multiplied by zero and will not contribute to the integral in Equation 4.2. The magnitude of this integral will reflect how much overlap functions x(t) and ψa,b(t) have. For a given choice of a and b, Xw(a, b) will give how much the original wave x(t) is similar to the chosen wavelet ψa,b at that scale and point in time. When viewed over time, the scalogram tells us at each point b how much of the function resembles the mother wavelet at scale a. Observe that scalograms do not show oscillations. Instead, they gauge how well

1Further discussion of and mathematical justification for these equations may be found at http: //users.rowan.edu/~polikar/WAVELETS/WTtutorial.html

46 Figure 4.1: Example scalogram. The x axis is time, and the y axis is frequency (from high frequency at the bottom to low frequency towards the top). The input wave starts with a low-frequency oscillation, then in the middle of the scalogram switches to a high-frequency one. (Credit: Wikimedia Foundation, Inc.)

the shape of the input wave x(t) fits child wavelet ψa,b at scale a. If the child wavelet aligns positively with the input (e.g., resembles it directly), this will result in a large integral, and the scalogram for that point will be especially large (exact values will depend on the nature of the data). On the other hand, if the mirror image of the child wavelet (flipped about the x axis) aligns to the input wave, this will cause the integral to take on a negative value, but one with a large absolute value. Looking at the scalogram, then, one would see alternating bands of high and low values (e.g., the white and black bands found in Figure 4.1) in those scale ranges which matched well with the original function. If there were no matches, then the child wavelet would tend to get integrals with values near 0 (gray values in the above figure), indicating that there is little periodic behavior at that scale. Because they fundamentally represent different things, spectrograms and scalo- grams must be interpreted using different analytic techniques. With a spectrogram, to find the most relevant frequencies at a given point in time, one simply looks for the frequencies with the largest values. This does not work in a scalogram. Depending on the nature of the data, extremely small (negative) values may imply a trough (and therefore, an oscillation) at a given point, while values closer to zero would mean no match. Also, while in a spectrogram sudden changes in value imply a change in the

47 strength of that frequency’s oscillations, with a spectrogram such changes are found within a single consistent oscillation. Thus, naïvely applying analysis techniques ap- propriate for STFT spectrograms will give misleading results. Further discussion of techniques for analyzing scalograms may be found in Chapter 5.

4.2 DISCRETE WAVELET TRANSFORMS

As with the Fourier transforms, computing the continuous transform is difficult, since in the real world it is not possible to check every value of a and b to generate the proper scalogram. Using small discrete steps in both a and b is an option, but a computationally-expensive one. The alternative is a discrete wavelet transform. Here, rather than considering all possible values of scale and translation, we consider

m m only pairs (a, b) which conform to {(a0 , na0 b0)|m, n ∈ {0, 1, 2,...}}, with a0, b0 fixed values chosen before calculating the transform (e.g., based on user knowledge of the underlying data). When using only these values, the child wavelets do not have the full range of variation found in Equation 4.1 and can be restated as in Equation 4.3 below.

−m/2 −m ψm,n(t) = a ψ(a t − nb) (4.3)

This equation tells how to find the child wavelets in the discrete case, but this alone is not enough to compute the discrete wavelet transform. We must still find out how important each child wavelet is at each scale and point in time. Though one could accomplish this by simply applying Equation 4.2 to the new values of a and b, doing so would still require evaluating the integral of a continuous function; only a finite number of such integrals would be necessary, but each one would still be computationally expensive. To avoid this, multiresolution analysis is employed. This

48 consists of finding a family of related waves, starting with the original wave being transformed, where each subsequent wave has only half the resolution of the previous one. A scaling function, sometimes called a “father wavelet,” is used to halve the resolution; this function acts as a low-pass filter to remove the higher-frequency infor-

mation. At the same time, child wavelets ψm,n are considered for each of these waves, looking specifically at the scale which will be removed with the next filtering. After the child wavelet finds the high-frequency information, the low-frequency information is used as the starting point for the next iteration of the analysis. So at each stage, two filters are applied, giving the low-frequency (approximation) and high-frequency (detail) information about the original wave at the current scale. While the approxi- mation information is passed from one stage of the calculation to the next, it is really the detail information we care about. This is the information which represents how much of the function is described at each frequency scale. This detail information can be plotted in a scalogram as above, with the benefit of reducing the computational complexity compared to approximating a continuous wavelet transform. To demonstrate the calculation of the discrete wavelet transform, we consider the Haar wavelet:

  1 0 ≤ t < 1/2,   ψ(t) = −1 1/2 ≤ t < 1, (4.4)    0 otherwise.   1 0 ≤ t < 1, φ(t) = (4.5)  0 otherwise. Equation 4.4 is the mother wavelet function, while Equation 4.5 is the scaling, or “father,” wavelet. These functions describe the types of patterns being sought

49 in the input wave (the child wavelets based on this mother wavelet), as well as the procedure used to alter the resolution of the input wave (the father wavelet). In practice, for discrete data these filters are applied to adjacent pairs of values m and n, producing the “approximation” and “detail” for this pair. These are found with (m + n)/2 and (m − n)/2, respectively. By convention, when working with a sequence of numbers, the first half of the sequence is replaced with the approximation information and the second half gets the detail. This is applied recursively until the entire sequence has been replaced by the detail information on different resolutions (aside from the very first entry, which will contain the average across the entire sequence). Algorithm 4.1 shows how this is done procedurally. For example, consider the sequence [w0, w1, w2, w3]. Here, the first pass would result in [(w0 + w1)/2, (w2 +

w3)/2, (w0 − w1)/2, (w2 − w3)/2]. For the second step, only the approximation data from the first half would be used, overwriting only that part of the array; thus, the full discrete wavelet transform for this example would be [((w0 +w1)+(w2 +w3))/4, ((w0 +

w1) − (w2 + w3))/4, (w0 − w1)/2, (w2 − w3)/2]. If the input array had contained [9, 3, 5, 8], after the transform this same array would now contain [6.25, −0.25, 3, −1.5]. In this case, we can see by looking at the higher-order detail data (the 3 and −1.5) that our original sequence had some important changes on the smaller scale, but on the larger scale (the −0.25), it did not change much from one end to the other. Multiresolution analysis using discrete wavelet transforms has applications beyond simply finding the scales which have the most interesting oscillations. It is frequently employed as a form of compression and smoothing. In this context, the incoming signal is passed through a sequence of filter banks. Each filter bank performs the approximation and detail filters separately, saves the detail data as the coefficients for this level, and passes the approximation data to the next filter bank. At the end, the final approximation data is saved along with the detail data. When used

50 input : Array A with indices from 0 to k output: Array A modified to contain the discrete wavelet transform of the original A

j 1 for j from k to 1 step j ← b 2 c do

2 tempA ← A ; j 3 for i from 0 to b 2 c do // This finds the “approximation” values

4 tempA[i] ← (A[2i] + A[2i + 1])/2 ; // This finds the “detail” values j 5 tempA[i + d 2 e] ← (A[2i] − A[2i + 1])/2 ;

6 A ← tempA ; Algorithm 4.1: Discrete Haar Wavelet Transform as a compression scheme, it is hoped that some levels will possess regularities which reduce data size; for example, if one level has values near zero, these can likely be replaced with zeros without much loss of information. Patterns in the data are used to reduce the amount of information needed to reproduce that data; this can also be used to denoise the data. Additionally, multiresolution analysis has been employed to detect attribute noise, by comparing how the detail data varies among different attributes [22].

4.2.1 Short-Time Wavelet Transform

Discrete wavelet transforms are a useful technique for finding the location and size of oscillations hidden in an input signal, but traditionally they require that the entire signal be present before transformation may occur. Although this assumption is fine in some domains, for MCM/PHM it does not hold. When a sensor provides a continuous stream of information about a machine, practitioners want to know about problems as soon as they occur. One strategy for overcoming this obstacle is the short-time wavelet transform (STWT) [81], modeled after the short-time Fourier transform. Like the STFT, the STWT divides the signal into a series of overlapping

51 windows, and examines each in turn, building the transform just for the data within that window. The window size used is based on the scale of the wavelet transform: if a depth of d is used, this means that when examining a window of size 2d (that is, 2d adjacent samples, regardless of the actual amount of time each sample represents), the highest-possible resolution will match up exactly with the sampling rate. Thus, the window size is a direct function of the scale of the transform.

input : Vector V indexed from 0 to max-time − 1 of the input signal (values from the accelerometer) : Parameter d, the depth of the transform to be performed output: Two-layer matrix M which contains max-time − 2d rows, each of which has d + 1 features representing the wavelet component energies at that row’s time

1 M ← Empty Matrix ;

2 for i from 0 to max-time − 2d − 1 do // Create a window of size 2^d starting at time i

3 W ← Empty Vector ;

4 for j from 0 to 2d − 1 do

5 W[j] ← V[i + j] ;

// Find wavelet transform (using Algorithm 4.1)

6 T ← HaarTransform(W) ; // Calculate component energies for each scale

7 C ← Empty Vector ; // First entry in wavelet transform is overall average; second is overall difference

8 C[0] ← T[0] ;

9 C[1] ← T[1] ; // Remaining entries are higher-resolution differences

10 for j from 1 to d − 1 do 2j+1−1 11 C[j + 1] = P T[k]2 ; k=2j

12 Append C to M ; Algorithm 4.2: Short-Time Wavelet Transform

To address the problem of converting scalogram information (which only contains the locations where the wavelet matched either positively or negatively with the signal) into a form which explicitly shows those frequency bands which have the strongest signal (either positive or negative), we have modified the STWT to add

52 an extra step of finding the component energies of each level of the decomposition. The concept of component energies come from wavelet packet decompositions (which will be discussed further in Section 4.3), where they are frequently used to find the overall signal at different levels of decomposition, but they are also useful for simpler wavelet transforms to identify the signal in each frequency band. The component energy of each band is calculated by finding the sum of the squares of all the values at the given level. So, for the highest-frequency level, all d/2 values are squared and then added together. The result is the sole number reported for this highest level of decomposition. Although this does sacrifice the time localization information inherent in the sets at each level of decomposition, windowing already does this to some extent, and this sum-of-squaring does permit the algorithm to report on d + 1 values (for an d-level decomposition, because the average across the entire window is also included) rather than 2d, a significant savings. In addition, by only having a single value for each scale of the resolution, there is no risk of placing undue weight on higher frequencies simply due to these having twice as many values as the next-lowest frequency. The STWT algorithm is presented in more detail in Algorithm 4.2. This demon- strates how the original signal is broken into overlapping windows, each of which has its wavelet transform performed separately. Recall that when utilizing the Haar wavelet transform, the first half of the data from each scale being examined is re- placed with the approximation values from the previous scale, while the second half is replaced with the detail values from the current scale. This means that the vector T begins with a single value representing the average (approximation) over the entire window, followed by a single value that shows the difference across the entire window (e.g., the average of the first half minus the average of the second half), followed by two values for “difference from first half” and “difference from second half,” then four

53 values for the differences at the next-highest scale, and so on, until the entire last half of the vector is taken up by the differences between successive values in the original vector. Thus, the component-energy calculations (to produce vector C) first pull out the first two values (to represent the lowest-possible scales), and then collect each of the remaining scales using a sum-of-squares until the highest scale is reported. These component energy vectors are then collected into the output matrix, one vector for each window of data which was examined.

4.2.2 Streaming Wavelet Transform

Although the STWT is one way of modifying the wavelet transform to function on streaming data, its use of a window sacrifices the wavelet transform’s ability to have different temporal resolution (e.g., localization of changes) at different frequency bands. To address the problems of streaming data which arise in the context of MCM/PHM while preserving this ability, we propose a novel solution for vibration monitoring which can perform incremental updates to the scalogram on an as-needed basis [79]. This results in an algorithm which somewhat resembles the Redundant Haar Wavelet Transform, except that it lacks the very redundancy which gives the Redundant Haar Wavelet its name [62]. Rather than producing a new (and redun- dant) copy of the transform at each time instance, coefficients from previous levels are reused until enough time has passed for a given level to gain a totally new coefficients. The consequences of this lack of redundancy are discussed at the end of this section. In this non-redundant approach, described in more detail in Algorithm 4.3, the time slices are numbered consecutively (i.e., each has a timestamp, with numbering starting at 1 and going on with 2, 3, 4, etc.). Instead of updating all scales at every time slice, each scale s is updated only at those time slices whose timestamp is divisible by 2s. In other words, the highest-frequency scale (denoted as 1 in this

54 input : Vector V indexed from 1 to max-time of the input signal (value from accelerometer) : Parameter d, the depth of the transform to be performed (i.e., the transform will examine all scales from 21 to 2d, where smaller wavelengths correspond to higher frequencies) max-time output: Two-layer matrix Sc, a scalogram with 2 rows each of which has d coefficients showing how well the Haar wavelet aligns with the signal at that scale at that time

1 Prepare a two-layer matrix Avg where the outermost layer has max-time rows (one for each time point) and the inner layer has d coefficients, representing the approximation of the signal at that time and scale (the scalogram matrix Sc will hold the difference coefficients) ;

2 for t ← 1 to max-time do

3 if t = 1 then

4 for s ← 1 to d do

5 Avg[1, s] ⇐ 0 ;

6 Sc[1, s] ⇐ 0 ;

7 else

8 for s ← 1 to d do

9 Avg[t, s] ⇐ Avg[t − 1, s] ;

10 Sc[t, s] ⇐ Sc[t − 1, s] ;

11 for s ← 1 to d do

12 if t mod 2s = 0 then

13 if s = 1 then // Update the highest-frequency (smallest-wavelength) scale

14 Avg[t, 1] ⇐ (V[t − 1] + V[t])/2 ;

15 Sc[t, 1] ⇐ (V[t − 1] − V[t])/2 ;

16 else // Update all other (lower-frequency, larger-wavelength) scales

17 Avg[t, s] ⇐ (Avg[t − 1, s − 1] + Avg[t, s − 1])/2 ;

18 Sc[t, s] ⇐ (Avg[t − 1, s − 1] − Avg[t, s − 1])/2 ;

19 Downsample Sc by removing all odd-valued time points ; Algorithm 4.3: Streaming Wavelet Transform

55 numbering scheme) would update every other time slice, the next-highest-frequency (next shortest wavelength) would update every four time slices, the third every eight, and so on in powers of two. Assuming the use of the Haar wavelet as in the earlier example, these updates could be implemented by first having the Avg matrix at all scales copy forward coefficients from the previous time point. When a scale is instructed to update, it requests that the scale above it provide the coefficients from the previous and current iteration; these two coefficients are averaged together (line 17 in Algorithm 4.3) and this result constitutes the current scale’s new average coefficient. Half the difference of these two coefficients is also computed and stored separately as this scale’s detail coefficient for the current iteration (in line 18). Half the difference is used instead of the difference alone to normalize the coefficients; since the average consists of (second+first)/2, for consistency the detail consists of (second−first)/2. These detail coefficients are not used in further computing the transform, but are the output of the transform used to create the scalogram. The highest-frequency scale is a special case, since it cannot request coefficients from another scale; it gets the latest state directly from the sensor (in line 14) and returns half the difference between this and the sensor’s previous state as its detail (in line 15), recording the mean of the sensor’s new state and its previous state as this scale’s new average coefficient. It should be noted that this transform is non-redundant because at each scale, the transform is recomputed only once for each span of time equal to this scale’s wavelength. Thus, each point in the initial signal is only examined once by each input scale’s transform. This means that the highest-frequency transform, with the lowest wavelength of 2, still will not produce a coefficient for each timestamp in the initial signal. If the timestamps are numbered 1, 2, 3, . . . , then even at this highest frequency the output will only have coefficients at times 2 (the transform on signal

56 data from times 1 and 2), 4 (the transform performed on times 3 and 4), 6 (the transform on times 5 and 6), and so on. Thus, in practice the computation is only performed for these time values, and the scalogram matrix has only half as many time points as the initial signal. The downsampling on the final line of Algorithm 4.3 represents this, by discarding all the odd-valued points (which will never be computed using the loop starting on line 12). This is the reason that the final output has only half as many instances as the input: max-time represents the number of points in the initial signal, while the usable output signal will have only half as many usable instances. This lack of redundancy does have a price in terms of accuracy, since the transform will depend on the initial time point when the transform is computed. It may be possible to contrive a data set such that the signal contains a wave precisely out of phase with the initial time when the transform begins, so that this signal is detected at the wrong scale or not at all. However, we feel that the computational advantages outweigh this risk. Furthermore, our case studies in Chapter 7 demonstrate that this limitation does not necessarily affect the results when applied to real-world data. Such data would need to coincidentally have a significant number of its oscillations lined up perfectly with the initial point of the transform, an event which is extremely unlikely to occur.

4.3 WAVELET PACKET DECOMPOSITION

Although the discrete wavelet transform (and the short-time and streaming wavelet transforms based on it) is a useful technique for many signal processing needs, it is not without its flaws: it lacks sufficient high-frequency resolution for certain MCM/PHM tasks (where these frequencies often contain the most useful information pertaining

57 (a) Discrete Wavelet Transform

(b) Wavelet Packet Decomposition

Figure 4.2: Comparison of Discrete Wavelet Transform and Wavelet Packet Decomposition (Credit: Wikimedia Foundation, Inc.) to a system’s health), and is not always easy to interpret. To alleviate these flaws, a newer technique, wavelet packet decomposition, was developed [14]. This technique builds upon the benefits of the discrete wavelet transform (efficiency, automatic fre- quency scaling) to analyze both the high and low frequency data at increasing reso- lution, recursively applying the approximation and detail filters to all vectors on each level rather than just the single approximation vector. Figure 4.2 compares the two techniques. In 4.2a, we see the discrete wavelet trans- form, identical to the transform shown in Algorithm 4.1 (although the approximation and detail filters are generalized as g[n] and h[n], respectively). The ↓ 2 symbol represents a decimation filter which removes every other instance in accordance with Nyquist’s rule; although this was performed implicitly in the algorithm, it is stated explicitly here. Note that the output of the discrete wavelet transform consists of four vectors: the h[n] vectors from levels 1, 2, and 3, and the g[n] vector from level 3.

58 These correspond to the detail information at the first three levels of decomposition, as well as the average information left over from the final level of decomposition. Figure 4.2b shows the results of a wavelet packet decomposition. As can be seen, unlike in a discrete wavelet transform the wavelet packet decomposition applies the g[n] and h[n] (approximation and detail) filters to both outputs from each level, not just to the g[n] vector from the previous level. This means that each level has twice as many vectors as the level before it. Note, however, that due to the ↓ 2 filter, each level has precisely as many coefficients as its predecessor (assuming the initial number is a power of two); this demonstrates how Nyquist’s rule ensures that information is being neither created nor destroyed at each level of decomposition. Unlike the discrete wavelet transform, the output of the wavelet packet decomposition is not the vectors from different levels of the tree; instead, it is all of the vectors found on the lowest level of the tree. Nonetheless, as these vectors may still have many coefficients, further processing is needed to turn them into a set of scalar features useful for MCM/PHM tasks, and to ensure that both positive and negative values are detected appropriately (as is necessary to find both those places where the wavelet lines up directly with the signal and those places where it has a strong negative match).

4.3.1 Short-Time Wavelet Packet Decomposition

One way to turn these vectors into a manageable number of features is to find the wavelet packet component energies [102]. This is a relatively simple procedure: for each of the output vectors, the squared values of all its coefficients are found and summed together. This sum then represents the component energy for that particular vector. Assuming that a d-level decomposition was performed, this means that rather than having the same number of individual numbers as initial instances (and with high-sample-rate sensors, this value can be extremely high), only 2d values are found.

59 input : Vector V indexed from 0 to max-time − 1 of the input signal (values from the accelerometer) : Parameter d, the depth of the transform to be performed : Parameter w, the size of the window to be used (should be a power of two, with 2d < w) output: Two-layer matrix M which contains max-time rows, each of which has 2d features representing the wavelet packet component energies at that row’s time

1 Prepare a three-layer matrix W (with a structure similar to that seen in Figure 4.3) where the first subscript represents the level of the decomposition (e.g., from 1 to d), the second subscript indexes the specific vector within that particular level (e.g., from 0 to 2i − 1, on the ith level), and the third subscript indexes the scalar coefficients stored in the given vector ;

2 for t1 ← 0 to max-time − w − 1 do

3 Clear all vectors within W ;

4 for t2 ← 0 to w − 1 do

5 W[0][0][t2] ← V[t1 + t2] ;

6 for i ← 1 to d do

7 for j ← 0 to 2i − 1 step j ← j + 2 do j 8 for u ← 0 to size(W[i − 1][ 2 ]) − 1 step u ← u + 2 do j j 9 Append (W[i − 1][ 2 ][u] + W[i − 1][ 2 ][u + 1])/2 to end of vector W[i][j] ; j j 10 Append (W[i − 1][ 2 ][u] − W[i − 1][ 2 ][u + 1])/2 to end of vector W[i][j + 1] ;

11 for j ← 0 to 2d − 1 do

12 M[t][j] ← 0 ;

13 for u ← 0 to size(W[d][j]) − 1 do

14 M[t][j] ← M[t][j] + (W[d][j][u])2 ;

Algorithm 4.4: Short-Time Wavelet Packet Decomposition

60 Algorithm 4.4 demonstrates a novel sliding window version of the wavelet packet decomposition that we refer to as the short-time wavelet packet decomposition (STWPD) [85]. As the name implies, this approach is inspired by the short-time Fourier trans- form (STFT), and like the STFT needs to completely recompute the transform as the window slides along (here, once per time unit in the original data). At each time instant, the algorithm first copies the actual sensor values from the current window into the very first vector within the W matrix starting at line 4. Then it considers all the s scales being examined one at a time starting at line 6. Within each scale, it then considers how many different vectors the current level is going to produce, and prepares to loop through these. The value j holds the number of vectors for the current level, and the loop increments by two because every pair of vectors on the current level is created based on a single vector from the previous level. The algo- rithm then goes through all coefficients for the vector on the previous level starting at line 8. Here, a skip is included because pairs of coefficients on the previous level are used to create single coefficients on the current level. Finally, knowing the correct pair of coefficients from the previous level, the algorithm finds their average and half their difference and adds these to the vectors being built on the present level. Once the entire transform has been processed in this fashion, the wavelet packet component energies are found by going through all the vectors on the bottom layer starting at line 11 and calculating the sum of squares for each vector individually, making this sum the value for the jth feature for the current time instance. These features together compose a vector with 2s elements, each one representing one of the vectors found on the bottom level of the decomposition. An important difference between the discrete wavelet transform and wavelet packet decomposition is that with the discrete wavelet transform, the transform’s depth only matters in that the greater the depth, the smaller the approximation vector at the

61 end. Any detail vector produced by the shallower transform will have an identical copy in the deeper transform; only the shallow transform’s approximation vector is missing, replaced with one or more new detail vectors and a new, smaller approxima- tion vector. This can be seen in Figure 4.2a: a level 2 decomposition would give the h[n] (detail) vector from level 1 and both the g[n] (approximation) and h[n] vectors from level 2, while a level 3 decomposition would give the h[n] vectors from levels 1 and 2 and the g[n] and h[n] vectors from level 3. On the other hand, with the wavelet packet decomposition only the deepest level is returned as the transform’s output. Thus, it is very important that the transform be performed to the depth that represents the best trade-off between runtime performance and classification perfor- mance. Figure 4.2b shows this: a level 2 decomposition would give the four vectors found on the second level, while a level 3 decomposition would give the eight vectors found on the third level (and none of the vectors found on the second level).There is little guidance in the literature on selecting this parameter, although values often fall in the range of 4 to 7 [21, 34, 54, 71, 101, 104]. The general policy seems to be to empirically find the best value of this parameter for a specific application.

4.3.2 Streaming Wavelet Packet Decomposition

As discussed in Section 4.2.2, dealing with ongoing data streams is a problem which many types of transforms cannot easily handle. For example, before a signal can be processed with a traditional wavelet packet decomposition, it must be entirely finished, or must at least use a fixed-size window selected from the ongoing signal for analysis. Because of this windowing, no frequency with a wavelength longer than the window size can be detected. Moreover, localizing signals in time more precisely than the window size is challenging, especially when using wavelet packet component energies, as the summing removes the time-specific information found within each

62 vector. A truly streaming-based approach is needed to process this data in an ongoing fashion, producing a new set of outputs for every input so that signals can be localized in time.

input : Vector V indexed from 0 to max-time − 1 of the input signal (values from the accelerometer) : Parameter d, the depth of the transform to be performed

max-time d+1 output: Two-layer matrix M which contains 2 rows, each of which has 2 − 2 features representing the results of the transform at that row’s time

1 Prepare a three-layer matrix W where the first subscript represents the level of the decomposition (e.g., from 1 to d), the second subscript indexes the specific vector within that particular level (e.g., from 0 to 2i − 1, on the ith level), and the third subscript indexes the scalar coefficients stored in the given vector ;

2 Copy V into W[0][0] as the notional “0th” level of decomposition ;

3 for t ← 0 to max-time − 1 do

4 for i ← 1 to d do

5 if t + 1 mod 2i = 0 then

6 for j ← 0 to 2i − 1 step j ← j + 2 do j j 7 i i Append (W[i − 1][ 2 ][(2 × (t + 1))/2 − 2] + W[i − 1][ 2 ][(2 × (t + 1))/2 − 1])/2 to end of vector W[i][j] ; j j 8 i i Append (W[i − 1][ 2 ][(2 × (t + 1))/2 − 2] − W[i − 1][ 2 ][(2 × (t + 1))/2 − 1])/2 to end of vector W[i][j + 1] ;

9 for j ← 0 to 2i − 1 do

10 if W[i][j] 6= ∅ then

11 Append the last value of W[i][j] to M[t] ;

12 else

13 Append ? (missing value symbol) to M[t] ;

max time 14 - for t ← 0 to 2 − 1 do

15 Append vector M[2t + 1] to matrix tempM ;

16 M ← tempM Algorithm 4.5: Streaming Wavelet Packet Decomposition

Algorithm 4.5 describes the novel streaming wavelet packet decomposition (SWPD) [91]. This algorithm processes the input waveform at each successive time point, pro- ducing a new vector for each which shows the most recent wavelet decomposition coefficients for all levels at that time. These vectors are subsequently collected into

63 the output matrix M. Over the course of the decomposition, the complete WPD tree W for the entire waveform is incrementally built; Figure 4.3 shows the structure of this tree, and may help with visualizing the algorithm. The algorithm works as follows: for each point in time, the algorithm steps through each level performing the decomposition starting at line 4. Then, if based on the current time, there is new information to be found at a given level, the algorithm goes through each vector on this level and writes new data into each, finding both the approximation or the detail of the previous level’s vectors. Note that throughout the algorithm, even values of j represent approximation vectors and odd values of j represent detail vectors; this is why only the even values are used when examining a previous level. After all the data which can be updated has been updated, the algorithm determines if any information is available at the given level (line 10). If this is the case, the most recent information at the present level is written to the current time’s feature vector (line 11); otherwise, a missing value indicator is written to show that no information was available yet (line 13). The final steps, lines 14–16, remove the even-valued instances, which would contain nothing but duplicated data (much the same as line 19 in Algorithm 4.3). To walk through this algorithm, consider an input signal consisting of the vector {1, 2, 3, 6, 7, 4, 2, −3}. As this is the input signal, it shall be treated as vector W[0][0], corresponding to level 0, vector 0 in Figure 4.3. At the first time instance, t = 0, no further computation will be performed, since t + 1 mod 2i = 1 for all i. This means that no new information about the lower waves can be found, and all vectors in levels 1 and below will remain empty. Since no information has been processed yet, the output vector for time 0, M[0], will consist entirely of missing values (that is, the ? symbol). When t = 1, however, there is new information, so the first level (W[1], or level 1 in the figure) can be computed. In particular, the arithmetic mean of the most

64 (a) Algorithm at time t = 1

(b) Algorithm at time t = 3

(c) Algorithm at time t = 7

Figure 4.3: Illustration of wavelet packet decomposition tree W, both empty and at different stages in processing Algorithm 4.5 recent two coefficients in the W[0][0] vector (which here have the values 1 and 2) is found and added as a new coefficient in the W[1][0] vector, and half their difference2 is put as the new coefficient in the W[1][1] vector. This is illustrated in Figure 4.3a. Because some information has been put into the lower levels, the output vector M[1] will begin with {1.5, 0.5} before continuing with missing values (?’s) representing levels 2 and below. Note that despite the appearance in the figure, all vectors after

2Specifically, “second minus first over two.” The division is done to ensure that the means and differences remain on the same scale.

65 W[0][0] are presumed to start off with no elements (rather than with placeholder blank elements), and new coefficients are appended whenever they become available. This results in filling out the tree as shown when the transform is complete. At t = 2, no new coefficients can be found. However, at t = 3, new coefficients can be found at both the first and second levels. For the level 1 decomposition, the same procedure as above is performed, with the mean and half the difference of the most recent two coefficients (now, 3 and 6) being added to the W[1][0] and W[1][1] vectors. However, now the W[2] vectors must be updated as well, using the most recent coefficients from the W[1] vectors. W[2][0] and W[2][1] use the coefficients from W[1][0] (that is, 1.5 and 4.5), while W[2][2] and W[2][3] use the coefficients from W[1][1] (that is, .5 and 1.5). In each case, the vector with an even value for j (the second index of W, or the vector number in the figure) will take the average of its two source numbers, while the vector with an odd value for j will take half their difference. These coefficients are shown in Figure 4.3b. To produce the output vector M[3], the last coefficients in all available levels are used: {4.5, 1.5, 3, 1.5, 1, 0.5, ?, ?, ?, ?, ?, ?, ?, ?}. The coefficients from the third level are still missing since that information has not yet had time to process. This algorithm is iterated, with each level receiving new coefficients when enough time has elapsed for the level above it to receive two new coefficients per vector, so that those two coeffi- cients can be used to determine the lower level’s new coefficients. Figure 4.3c shows the complete wavelet packet decomposition tree for this input vector, with each value representing a wavelet packet decomposition coefficient. At the final time for this input, M[7] = {−0.5, −2.5, 2.5, −3, −2, −0.5, 2.75, −0.25, −0.75, −0.5, −1.5, 0, −0.5}, with no missing values. The matrix M consists entirely of vectors of this form, with either coefficients or the ? symbol in each position. Each vector (one for each time instance) is indexed separately.

66 This algorithm has many desirable properties. It incrementally generates a new vector of features (wavelet packet decomposition coefficients) for each point in time, so there is no waiting for a window to pass to detect new information, reducing latency compared to STWPD. Additionally, broader information about longer wavelengths are retained; waves up to size 2d can be detected, without a window restricting the maximum length. Because it computes a new coefficient for each level only when enough new information has arrived to replace the old coefficient, this algorithm performs the minimal computation needed to extract the signal’s information. This permits examining depths greater than those which can be feasibly reached using a more typical wavelet packet decomposition, as in Algorithm 4.4. Moreover, since the coefficients at all levels are written into the feature vector, there is no harm in examining the signal with greater depth; the output is not restricted to only seeing the deepest level, so at worst the deeper information is simply extraneous noise and reduced runtime performance. When coupled with feature selection techniques, even this downside is mitigated, and examining all levels permits the algorithm to detect on its own which levels are most useful for interpreting the data. There are some downsides to this algorithm, however. Because it only computes new coefficients when enough new information has arrived to redo the calculation, there is no redundancy, and thus there is the potential for alignment issues. As with the SWT algorithm, it is possible to contrive an input wave that precisely aligns with the boundaries where the computations take place, making the wave difficult to detect. Preliminary research with an implementation of Algorithm 4.5 suggests that this contrived scenario does not significantly impact the ability of algorithms such as this to classify waveforms by fault class. Hence, the advantages outlined earlier appear to outweigh this potential risk. One important distinction between short-time wavelet packet decomposition (STWPD)

67 and streaming wavelet packet decomposition (SWPD) is their respective performance bounds. To perform STWPD, the full wavelet packet decomposition must be found for every single window. The upper bound for this calculation (performed on lines 6–10 of Algorithm 4.4) is O(n log(n)), because in the worst-case scenario d (the level) is bounded above by log(n), w (the window size) is bounded by n/2, and the full wavelet packet decomposition tree has d × w entries. Since this calculation must be performed a total of n−w times from lines 2–14 (and in the worst case, with w = n/2, this is also n/2), the overall worst-case performance bound is O(n2 log(n)). In contrast, the SWPD only builds one wavelet packet decomposition tree over the course of its entire run, rather than a separate tree for every instance. Rather than having their own trees, each instance is just a snapshot of the overall tree at that particular moment in time. Overall, in the worst case where d is bounded above by log(n), the building of this tree will take O(n log(n)) steps for the entire calculation. That is, O(n log(n)) time will be spent from lines 5–8 in Algorithm 4.5 over the course of the entire algorithm. However, taking these snapshots (which is to say, going through the tree to find the most recent value in each of its vectors for the current time instance, which is done in lines 9–13) is O(n) for each instance, since there are O(n) vectors to be searched. Thus, the overall worst-case performance bound is the maximum of O(n log(n)) and O(n2), which is to say O(n2). This is a factor of log(n) less than the worst-case for STWPD. Especially on large datasets, the reduced complexity of SWPD is crucial to finding answers for real-time MCM/PHM applications, and does not cost an appreciable amount of accuracy.

68 Chapter 5

Post-Processing

A number of different post-processing techniques may be used to further transform and modify the data produced by either the Fourier or wavelet transforms. These serve different purposes, from extracting the scale information from the transform, to collecting windowed data to avoid outlier instances, to employing feature selec- tion to pick the optimal transform depth, to baseline-differencing to dissociate the environmental condition from the machine state. All of these techniques are novel contributions of this work, aside from windowing which is not a novel algorithm but its application to ocean turbine MCM/PHM is done for the first time as part of this work.

5.1 SCALE DETECTION

As discussed in Chapter 4, one notable difference between Fourier and wavelet trans- forms is that the former give an value for how much of the original signal can be found at a given time and scale, while the latter instead reports how well a wavelet with the matching time and scale correspond with the given data. Most importantly, this means that if the wavelet has a strong negative correspondence (which would indicate a valid wave at the given time and scale), it will have a large negative value, rather than a large positive one. However, the goal in many application domains

69 (including analyzing vibration data for detecting faults) is to detect oscillations, or repeating patterns, and find when these occur and at what scales. In the case of ro- tating machinery, these oscillations correspond to the vibrational modes of different parts of the machine. Changes in these vibrations can be an indicator that something is not operating properly, and it is these changes which are important, not simply a scalogram showing where the child wavelets correspond to the input signal. Because of this, we propose further processing. One approach to identifying these changes is the component energy calculations discussed in the previous chapter. These fix the problem of searching for large positive or negative values by simply using squared values, which will convert everything to a positive number. In addition, the problem of detecting waves when not all instances will be extremely positive or extremely negative is solved by summing the squared values: even if some time points have more neutral values, if there is a strong wave in a given frequency band, most of its values will have large squared values. However, one downside of this approach is that it needs to collect all the values for a given time period into this sum. This is sufficient for approaches such as the STWT and the STWPD which already operate on a windowed period, but for the SWT and SWPD, such component energies are not feasible, and an algorithm is needed which will exhibit the same automatic scaling of temporal resolution over different scales as the underlying transform. The scale detection algorithm outlined in Algorithm 5.1 demonstrates one such approach. This algorithm was designed to specifically seek out waves in the trans- formed data, in particular when the transformed data was generated using the Haar wavelet. It relies on properties of the Haar wavelet which are not found in the general case for all mother wavelets. However, since the Haar wavelet is what was employed in the earlier streaming wavelet transform, this is sufficient. For more intricate wavelets,

70 max-time input : Two-layer matrix Sc, which is a scalogram with 2 rows each of which has d coefficients showing how well the Haar wavelet aligns with the signal at that scale at that time : Parameters p and q, where p is the number of transitions to examine and q is the minimum number of these transitions which should involve a sign change for there to be an oscillation detected max-time output: Two-layer matrix SD, with 2 rows each of which has d values showing whether or not it appears there is a wave (sequence of rising and falling) at that scale at that time; a 1 at a given cell in the matrix means that there appears to be a wave, while a 0 means there does not appear to be a wave

max time 1 - for t ← 1 to 2 do

2 Initialize SD for time t with 0’s ;

3 for s ← 1 to d do

4 if 2s × p > t then // It is still too early to detect a wave

5 SD[t, s] ⇐ 0 ;

6 else // Search for a wave

7 r ⇐ 0 ;

8 for k ← 0 to p − 1 do

s 9 t0 ⇐ t − k × 2 ; s 10 t1 ⇐ t − (k + 1) × 2 ;

11 if (Sc[t0, s] ≥ 0 && Sc[t1, s] < 0) || (Sc[t0, s] < 0 && Sc[t1, s] ≥ 0) then

12 r ⇐ r + 1 ;

13 if r ≥ q then

14 SD[t, s] ⇐ 1 ;

15 else

16 SD[t, s] ⇐ 0 ;

Algorithm 5.1: Haar Wavelet-based scale detection

71 such as the Daubechies 4-tap wavelet [16], a different form of analysis may be required. Although the scalogram shows positive alignment with the Haar wavelet when the signal is falling and negative alignment when the signal is rising, these patterns may not indicate oscillations per se. Instead, an oscillation would be represented by alternating bands of falling and rising. While for simple cases these alternating bands can be recognized visually in the scalogram, frequently it is difficult to visualize more subtle or intricate patterns. Algorithm 5.1 automatically detects patterns which, if illustrated in a scalogram, would appear as these alternating colors. Note in this

max-time algorithm that the input and output matrices are said to have 2 instances; here, max-time signifies the total duration of the original (pre-transformation) data file. The scale detection algorithm is demonstrated as working on matrices half this

max-time length (e.g., 2 ) because it is primarily useful for the SWT and SWPD, which due to their streaming nature only produce output instances for every other input instance, and thus their output matrices have only half as many instances as the original max-time value. This algorithm works by considering blocks of p consecutive transitions. A transi- tion is a timespan where, at a given scale, the wavelet transform was recomputed and thus could potentially have changed its coefficient. In practice, these are computed by picking the p + 1 previous time points at the current scale such that the transform has been recomputed exactly once between each point and the next one in the se- quence. The loop between lines 8 and 12 in Algorithm 5.1 finds these p + 1 points in time, which are t through t − p × 2s relative to starting point t, at resolution s. The transitions are the p spaces between these p + 1 points, where the coefficient of the scalogram was recomputed. The spacing of these points (and the transitions which are located between them) corresponds to the degree of resolution s at a given scale of the transform. Higher resolution means that transitions occur more frequently;

72 these transitions occur more rarely at the lower resolutions. When considering these p consecutive transitions, a wave is said to be present if the transform changed sign (went from positive to negative or negative to positive) in at least q of them. The output of Algorithm 5.1 is the detected scales information. While this has the same dimensions as the scalogram data, here the attributes are not the degree of alignment with the Haar wavelet at a given scale, but simply whether or not the algorithm believes that there is oscillation present at this scale at this time. In this implementation, these detected scales were represented as binary values with 0 for no oscillation and 1 for oscillation. In the output file, each row is a single time instance, with the first value being that instance’s timestamp and the subsequent values being these 0’s or 1’s, one for each scale being examined. When performing the full SWT later in this dissertation, it is this detected scales file which is used as input to downstream processing for fault detection.

5.2 DATA WINDOWING

Having performed this second pass, one option is to perform data mining on the raw output, considering each of the time instances separately. However, for certain transforms and datasets this would not be an effective approach to this data. There can be a great deal of short term variation in the signal, such that no scale will have a wave detected 100% of the time even if that particular experimental setup tends to produce waves in that range. To work around this problem, windowing may be employed. In windowing, an individual point in time is not considered alone. Instead, a group of adjacent points are collected into a single instance. A sliding window is a particular type of window wherein the window at time t + 1 is created by removing the first time point from the window at time t and adding the first time point which

73 follows that window. For windows of length N, N − 1 instances overlap.

SD(t) = (sd1(t), sd2(t), sd3(t), . . . , sdd(t)) (5.1) N X W(t) = SD(i) (5.2) i=t N N N N ! X X X X = sd1(i), sd2(i), sd3(i),..., sdd(i) (5.3) i=t i=t i=t i=t Equations 5.1 through 5.3 describe how windowing was used in this research. Equation 5.1 denotes how instances SD(t) are represented in the detected scales file; for each point in time, there are d data points, one for each scale. These points may either be binary values or continuous values, depending on whether Algorithm 5.1 has been applied. Equations 5.2 and 5.3 describe how these instances are collected together to form the windows W(t), with Equation 5.2 working in the vector format and Equation 5.3 composing each component of SD(i) individually. Note that if the input to this windowing is the output of the scale detection algorithm (such that each frequency in the input data may only take a value of 0 or 1), then the values of W(t) may take any integral value from 0 to N. For our case studies, we let N = 100, because preliminary research suggested that 100 is a good window size; any smaller and noise started to creep in, but any larger and one of the key advantages of the streaming wavelet transform, being able to quickly detect changes, would be lost in the resulting blur. Note that this windowing is different from windowing which may be applied during the initial transformation step (as in the short-time algorithms, STFT, STWT, and STWPD). Windowing during the initial transform will make it impossible to detect oscillations larger in size than the window used, while also making it more challenging to localize changes occurring within a window. This is one reason to favor wavelet

74 transforms over Fourier transforms for the initial transformation: wavelet transforms do not necessarily require such windowing. However, at this stage of data processing (e.g., when the data windowing being discussed is being performed), both wavelet transformation and wavelet interpretation (if necessary) have already been performed, and information from all scales has been integrated into the data stream. Thus, even with the windowing described here, oscillations of all scales can still be detected.

5.3 AUTOMATIC DEPTH SELECTION

For all of the transforms discussed and proposed in this work, one downside is that a depth must be chosen for the transformation. Using the maximum possible depth often results in very large quantities of data, especially when examining systems for a long time span, so for real-time systems this is not an option. However, a priori there is no guide for finding this depth. Domain experts can advise about which frequencies are believed to be most important, but a data-driven approach allows a more direct solution to finding the best transformation depth. Feature selection techniques such as CS and IG can be used for this purpose, by finding the most important features and then determining the appropriate depth based on those. This is demonstrated in Algorithm 5.2. Specifically, first the feature ranking is performed on training data (which has been transformed to the greatest possible depth) to order the different features from most to least important. These rankings are provided as the input to the algorithm. Then, for each combination of ranking and dataset, the top N features are examined. If a given feature has not yet been added to the consensus set (F-out), it gets added. One important aspect of this is that the size of F-out cannot be known in advance; it depends on the number of duplicates. Finally, the largest of these top features is found and examined to

75 input : Parameter num-r, the total number of feature rankers used : Parameter num-d, the total number of distinct datasets to be considered : Parameter N, the total number of features to be selected from each separate list : Two-layer matrix F-in, where the first subscript represents one combination of ranker and dataset (e.g., runs from 1 to num-r × num-d), and the second subscript represents the ranked features for that combination (e.g., F-in[i][0] is the top-ranked feature, F-in[i][1] is the second-place feature, F-in[i][2] is in third place, and so on) output: Parameter Max, the maximum depth necessary to capture all of the top N features for all combinations of ranker and dataset : Set F-out, the set of features which are found within the top N features at for at least one combination of ranker and dataset

1 F-out ← {} ;

2 for i ← 0 to num-r − 1 do

3 for j ← 0 to num-d − 1 do

4 for k ← 0 to N − 1 do

5 if F-in[i × num-d + j][k] 6∈ F-out then

6 F-out = F-out ∪ {F-in[i × num-d + j][k]} ;

7 Best-yet ← 0 ;

8 for i ← 0 to size(F-out) − 1 do

9 if F-out[i] > Best-yet then

10 Best-yet ← F-out[i] ;

11 Max ← 0 ;

12 while true do

13 if 2Max ≥ Best-yet then

14 break

15 Max ← Max + 1 ; Algorithm 5.2: Automated Depth Selection using Feature Selection

76 determine the smallest depth which nonetheless encompasses that feature. This is chosen as the depth to be used when transforming the data for the live system. The choice of N, the number of top features to examine, might seem to be an important step in this algorithm. However, we will show in Chapter 7 that the selected depth does not depend strongly on this choice, remaining relatively stable over a fairly wide range.

5.4 BASELINE-DIFFERENCING

Although the proposed algorithms are useful for producing a summary of the wave information contained in a signal, one risk of these vibration monitoring techniques is that it is not clear how much of the observed oscillations are normal for the signal and how much are abnormal. Statistical techniques and machine learning classifiers can be employed for a given environmental condition to determine the difference between normal and abnormal, but such approaches do not generalize well. Depending on the number of environmental conditions involved, many different models will be necessary, and it may not be clear which should be used for the edge cases between models.

input : Two-layer matrix R, which is the output of a transform (or post-processing algorithm) with max-time rows each of which has d coefficients for each of the corresponding scales : Baseline vector BL, which has d values output: Two-layer matrix BD, with max-time rows each of which has d values showing how above/below the baseline the corresponding scale and time are

1 Initialize BD as a two-layer matrix with the same dimensions as R ;

2 for t ← 0 to max-time − 1 do

3 for s ← 0 to d − 1 do

4 BD[t][s] ← R[t][s] − BL[s] ;

Algorithm 5.3: Baseline-differencing

To resolve this, we propose that for each environmental condition, a baseline be

77 found showing what constitutes “normal” for this environment, and then the values from this baseline be subtracted from all individual data points in this environment to normalize them. This is the baseline-differencing algorithm presented in Algo- rithm 5.3. This algorithm takes a data file which has a known environmental condi- tion but an unknown state and pairs it with a baseline which is known to be from the “healthy” state of that environmental condition. Then for all instances from the data file of instance, the difference between the “raw” value and the baseline value is found for all of the scales, and these are used as the values for downstream calculations. It is important to note that in practice, this differencing is applied to all data taken from that environmental condition, regardless of whether that data is believed to be in the “baseline” state or not: once the baseline has been found, all data is processed identically.

input : Two-layer matrix R, which is the output of a transform (or post-processing algorithm) with max-time rows each of which has d coefficients for each of the corresponding scales output: Baseline vector BL, which has d values

1 Initialize BL as a vector of length d ;

2 for s ← 0 to d − 1 do Pmax-time−1  3 BL[s] ← t=0 R[t][s] ÷ max-time ; Algorithm 5.4: Creating a baseline

Ideally, one would want to generate the baseline using nothing more than the environmental conditions themselves. This would permit a new baseline to be gen- erated for each condition, avoiding the problem of finding a new baseline for each environment that the operator thinks will become relevant (and somehow interpo- lating for intermediate conditions). Such automatic generation will be explored in future work. However, for this study, we find our baseline values by computing the arithmetic mean of all values for each coefficient across the data from the baseline class, as demonstrated in Algorithm 5.4. Since the data is assumed to show what the

78 machine looks like in the baseline state, by finding the average of all instances, we find the baseline values against which all others should be compared.

79 Chapter 6

Datasets

Two very different experimental setups were used to generate data for the case studies in this dissertation. These were a household box fan, described further in Section 6.2, and a dynamometer testbed, described in Section 6.4. In both cases, data was gath- ered in a number of different operating conditions, one of which was designated “nor- mal.” All models were built by combining data from the normal condition and one of the abnormal conditions and performing binary classification.

6.1 FAN EXPERIMENTS

The first set of case studies examined in this study employ data recorded from a household box fan using two separate experimental setups. While this system lacks many of the properties of submerged turbines, it is sufficient for demonstrating the analysis of vibration data from rotating machinery operating in a variety of modes. To acquire data from the fan, two accelerometers were attached to its case. These were located at opposite sides of the fan’s top edge, to the left and right of the fan’s axle. Each accelerometer provided a separate channel of data, and these were referred to as channels 1 and 2 (denoted CH1 and CH2). Using this hardware, two different sets of experiments were performed, each examining the fan in both baseline (not faulty) and various abnormal (faulty) states. In the first experiment, the fan

80 was run at 1010 RPM, and four different experimental conditions were employed: fan operating normally; fan tilted against a hard surface (e.g., a wall), denoted TOW; fan tilted against a soft surface (e.g., a hand), denoted TOH; and fan slowed by having an obstruction (e.g., a pencil) inserted into the blades, denoted SWO. Each of these conditions was tested with six separate runs. Each run, or burst, lasted 3 seconds, and data was collected with a resolution of 1000 Hz, so a total of 3000 samples were collected per burst1. This resulted in 2 accelerometers × 4 conditions × 6 bursts = 48 data files with 3000 lines each. In the second experiment, the fan was run at 420 RPM, and the states used were unperturbed (baseline), two different types of obstructions (where the inserted object was only placed partially into the fan, lightly obstructing the blades, and where it was placed in farther, heavily obstructing them; these were referred to as LO and HO, respectively), and with percussion (taps on the side of the fan by the user) to simulate an ocean turbine being jostled by sharks, denoted PP. Here, four runs were conducted for each condition, again with a resolution of 1000 Hz but now only lasting 1 second. Thus, given the 2 accelerometers, 4 conditions, and 4 bursts, there were 32 data files with 1000 lines each.

6.2 FAN TRANSFORMATION PARAMETERS

Of the vibration analysis algorithms discussed and introduced in this work (STFT, STWT, SWT, STWPD, and SWPD), all but STWT were applied to the aforemen- tioned fan data. The following sections discuss the specific parameters used in the course of these transformations. 1Use of bursts in this study was necessary due to bandwidth and storage limitations. None of the included algorithms require data to be in bursts. In fact, continuous data streams alleviate the difficulties faces in handling the initial portions of data streams for the two streaming algorithms, SWT and SWPD.

81 6.2.1 Short-Time Fourier Transform

The Short-Time Fourier Transform (STFT) employed for our case studies used a window size of 512 (i.e., slightly more than half a second, since with a resolution of 1 KHz each instance represents 1/1000th of a second), and the transform was performed as outlined in Algorithm 3.1. Note that overlapping windows were used: the ith window consists of all time points (instances) from point i to point i + 512. Thus, if there were N instances in the burst, a total of N − 512 windows would be created, since the last such window would include the final 512 instances from the dataset. For the STFT, the Hann window was chosen. This was based on preliminary ad-hoc analysis comparing the rectangular, Hann, and Hamming windows. Future research will consider additional windowing functions. The Fourier transform im- plementation used in this work was FFTW, an open-source FFT library written in C[25]. This implementation was designed to dynamically optimize its choice of FFT algorithms based on the current system’s architecture, and thus is a good choice for performing STFT, where many Fourier transforms are performed sequentially.

6.2.2 Streaming Wavelet Transform

The first wavelet technique examined was the streaming wavelet transform (SWT). The SWT technique is actually performed in two steps: first the scalogram is created using Algorithm 4.3, and then the detected scales file is created using Algorithm 5.1. When performing this transform, the depth (value of d) must be chosen to specify the length of the longest wave which can be examined (in particular, 2d). For the first experiment, with total of 3000 instances per burst (3 seconds sampled at 1 KHz), a depth of 10 was chosen. This is because, while the algorithm itself will detect half-

82 waves (either a rising or falling section) of length 210 = 1024, the streaming wavelet interpretation algorithm needs to see two of these to detect a wave, and thus the actual maximum wavelength is 2048, the largest power of 2 less than 3000. Since the second experiment only has a total of 1000 instances per burst (the same 1 KHz resolution for only 1 second), a depth of 10 would be too great, and so 9 was chosen. Following the SWT, the Scale Detection algorithm discussed in Section 5.1 was employed. This was necessary to convert the raw data from a form which only shows where the Haar wavelet has positive and negative alignment with the original signal into a form which shows which frequency bands have waves detected. Note also that because the SWT algorithm cannot detect waves of a given wavelength until at least that much time has elapsed since the start of the burst, the earlier instances in each burst will have the value 0 (no wave detected) for the longer wavelengths. Windowing, with a window size of 100, was employed to alleviate this. In addition, since even the smallest wave to be detected has a size of 2, the algorithm only produces a new output for every other time point in the input. Thus there are only half as many points in the output of the SWT as there are in the input (1500 for the first experiment and 500 for the second experiment). When performing the second half of the SWT approach (e.g., Scale Detection), values must be chosen for p and q to decide both the number of transitions examined to find sign changes (p) and the target number of changes which must be found (q) to predict a wave. For the first Case Study (described in Section 7.1), these values are varied to discover optimal parameters, and in the subsequent studies, only the optimal values of p and q for each dataset are reported. In the first fan experiment, these optimal values were 2 and 1, respectively, while in the second fan experiment, they were 3 and 3.

83 6.2.3 Short-Time Wavelet Packet Decomposition

The first of the wavelet packet decomposition-based techniques is a windowing ap- proach referred to as short-time wavelet packet decomposition, or STWPD (as dis- cussed in Algorithm 4.4). Here, windows of size 512 (i.e., slightly more than half a second, since with a resolution of 1 KHz each instance represents 1/1000th of a second) were applied over the vibration data, and wavelet packet decomposition was performed on these windows. Note that overlapping windows were used: the ith win- dow consists of all time points (instances) from point i to point i+512. Thus, if there were n instances in the burst, a total of n − 512 windows would be created, since the last such window would include the final 512 instances from the dataset. On each of these windows, a sixth-level Haar wavelet packet decomposition was performed. This means that the Haar wavelet approximation and detail functions were applied to the signal (original data) and then applied recursively to the outputs from the previous level, giving a total of six levels. The value six was chosen because while little guidance exists in the literature (most works suggest finding the best value empirically), a value in the range of four to seven was found to work on a number of datasets [21, 34, 54, 71, 101, 104]. Future work will examine values other than six. The final level of this decomposition consisted of 64 separate vectors, each containing 8 coefficients (since the window size was 512 and 512/64 = 8). To get a single value for each vector, the wavelet packet component energies were calculated by summing the squared coefficients for each element in each vector. These 64 sum-of-square values were then used as features for building a classifier.

84 6.2.4 Streaming Wavelet Packet Decomposition

The second approach based on wavelet packet decomposition was the streaming wavelet packet decomposition (SWPD) discussed in Algorithm 4.5. For this par- ticular data, the depth of the decomposition was chosen based on the total number of instances in each dataset. For the first experiment, with 3000 instances per burst, a depth of 11 was chosen; for the second experiment, with only 1000 instances per burst, a depth of 9 was used. These values were selected because the depth of the decomposition directly relates to the size of the largest wave which can be detected. In particular, the largest wave that can be detected with a dth-level decomposition is 2d. Thus, with d = 11, the largest wave is one of size 2048, while for d = 9, the largest wave has size 512. Since these are the largest powers of two which are less than 3000 and 1000, respectively, these choices of d were used for each dataset. Due to the nature of SWPD, all information from transforms with a depth less than d is found at the transform of depth d. Nonetheless, future work will consider the effects of smaller values of d. SWPD results in an output dataset with 2d+1 − 2 features. This is because rather than only examining the coefficients on the bottom, dth level of the decomposition, the results include coefficients from every single vector (recall that a complete binary tree of depth d has 2d+1 − 1 nodes). This number is decremented one more time because the root node, which contains the original signal, is not used to produce an output. Notice that not every instance in the output will have a coefficient for each of the 2d+1 − 2 features. This is because at each level, the vector only first gains a coefficient when enough time has passed for the algorithm to find waves of that length. In fact, the very first instance in each dataset has zero real coefficients, since even waves of size 2 have yet to become visible. As with the SWT above, these early,

85 incomplete instances may have a detrimental effect on classification performance.

6.3 DYNAMOMETER DATA

The final case study discussed in this dissertation employs data from a full-scale dynamometer prototype [19] which shares the behavior of an unsupervised ocean turbine. This machine consists of a drive motor (simulating the motion of propellers driven by ocean current) connected via a shaft to a generator (simulating the power- generation aspect of an ocean turbine). By varying the speed of the drive motor and the resistive load placed on the generator, the dynamometer allows for simulating varying possible operating conditions for the turbine. In addition to the motor and generator driving the rotational motion in the ma- chine, the dynamometer has a number of vibration sensors placed at key points, to provide state information. Four of these are low-frequency accelerometers, placed towards the drive motor end (as well as in the middle, along the shaft), while two are high-frequency accelerometers and are placed around the gearbox of the generator end of the machine. All accelerometers collect data at 5000 Hz (that is, 5000 samples per second). It is these accelerometers which produce the vibration signals which are then processed and interpreted for the experiments. Two separate experiments were performed on the dynamometer, with the aim of determining the machine’s resistive load based on its vibration, with the drive motor’s RPM taken as known. The resistive load and RPM used varied between the experiments. In the first experiment, it was assumed that a load of 60% is “normal,” and vibration data gathered when the machine is in this state was used to generate a baseline. The abnormal states were taken to be other levels of resistance, specifically 30%, 45%, 75%, and 90%. For the second experiment, the normal (i.e., baseline) load

86 was taken as 45%, while the abnormal loads were 60% and 90%. In both cases, because we employed binary classification, each model is trained to distinguish between the normal (60% for the first experiment, 45% for the second experiment) load level and one of the abnormal levels. The multiclass problem will be considered in future work. We varied the choice of “normal” load between experiments to demonstrate that our results are not specific to any one choice. In addition to varying the load, different levels of motor speed (RPM) were con- sidered. For the first experiment, the levels were 545, 654, and 763, while the second experiment uses 545, 654, and 1090. These were not used as separate classes, how- ever. Rather, these are considered to be environmental conditions. In the real-world case, the environmental conditions will always be known, and the goal of the learning algorithm will be to determine the system state given the environmental condition. For our tests, all models were built on the 545 RPM data, and models were tested on either the 654 RPM data or the third RPM (764 for the first experiment, 1090 for the second experiment) data. Two different RPM levels were used for the “third” dataset due to different conditions when acquiring the two different datasets. For each combination of load and RPM, one 32-second burst of data was recorded from the six accelerometers. The 32 seconds of data from each sensor and combination of conditions were then broken up into eight 4-second bursts for ease of processing. The STWT algorithm was performed on each of these 4-second bursts separately (and for each of the six accelerometer channels separately). Because the first experiment uses five loads (30%, 45%, 60%, 75%, and 90%) and three RPM levels (545, 654, and 763), there were 720 data files from this experiment. The second experiment with its three loads (45%, 60%, and 90%) and three RPM levels (545, 654, and 1090) had 432 data files.

87 6.4 DYNAMOMETER TRANSFORMATION PARAMETERS

For the case study employing dynamometer data, the STWT algorithm (as shown in Algorithm 4.2) was used to preprocess all data received from the accelerometer. When using this algorithm, the depth parameter d must be chosen; this determined both the resolution of the transform and the length of the window. We chose a depth of 12 because this produced a window of size 212 = 4096, and with our 5000 Hz data source, this means the window is approximately 0.8 seconds long. Any longer (e.g., d = 13 for a window of size 1.6 seconds) and the window takes up too much of the 4-second bursts, resulting in fewer distinct windows which could be used (since only times from t = 0 to t = max-time − window-size are valid start points for windows). Because we use d = 12, for each instance (window) we have 13 features, each representing a different scale from lowest to highest (with the first feature being the “zeroth” scale in the sense that it represents the average across the entire window).

88 Chapter 7

Case Studies

To determine the effectiveness of the discussed and proposed algorithms on real- world data, a number of case studies were performed. These were conducted over the course of our research, with parameters, transforms, and other experimental condi- tions (such as the choices of classification algorithm and dataset) evolving over the course of the research. This work began with Case Study One, which introduced the Streaming Wavelet Transform (SWT) and used data from the fan experiments (discussed in Section 6.1) to determine the optimal parameters for the SWT. The following case study, Case Study Two, compared the SWT results with those using the two wavelet packet decomposition-based approaches: Short-Time Wavelet Packet Decomposition (STWPD) and Streaming Wavelet Packet Decomposition (SWPD). The third case study considered an alternate approach: Short-Time Fourier Trans- form (STFT), which is compared with the STWPD. Case Studies Four and Five considered ways of reducing the dimensionality of the wavelet packet decomposition output, first through feature selection (performed on both the STWPD and SWPD in Case Study Four), and then by using the feature selection results for automatic depth selection of the STWPD in Case Study Five. Finally, Case Study Six considered more realistic data (acquired from the dynamometer described in Section 6.3) and addresses the problem of building models which are tolerant of varying environmental

89 conditions, through the use of the baseline-differencing algorithm. Collectively, these case studies represent over two years of research and experimentation using different algorithms and approaches.

7.1 CASE STUDY ONE

The goal of this first case study [84], performed using the fan data discussed in Sec- tion 6.2, was to evaluate the streaming wavelet transform (SWT) and scale detection algorithms, and in particular to optimize the parameters for the scale detection algo- rithm. In addition, data windowing was employed for this experiment.

7.1.1 Parameter optimization

After creating the scalograms, the scale detection algorithm discussed in Algorithm 5.1 was used to find scales and times which contain oscillations. As noted in the algo- rithm, there are two free values which must be chosen when performing this analysis: p, the number of transitions to examine, and q, the minimum number of transitions which must include a sign change for an oscillation to be detected. Although the following Case Studies use the optimized parameters discussed in Section 6.2.2, this first Case Study was where these parameters were discovered. Thus, we explored a range of parameters: p ∈ {1, 2, 3, 4, 8}, and q starting equal to p and going down until it equaled 1 or p − 4, whichever happened first. This was done because unpublished pilot studies strongly suggested that smaller values for p were preferred. For the sec- ond fan experiment, we examined all values of p ranging from 2 to 8, and values of q ranging from 1 to p.

90 7.1.2 Classification

For this initial Case Study, twelve different classifiers were used: C4.5 decision trees with two different sets of parameter values (default and “normalized,”) Naïve Bayes, Multilayer Perceptron, RIPPER, two k-Nearest Neighbor models (with k set to either 2 or 5), Support Vector Machines, two Random Forest models (one with default parameters and one with 100 trees), Radial Basis Function neural networks, and Logistic Regression. These learners are all discussed in greater detail in Section 3.1, including parameters which were changed from their default Weka implementation. This wide range of classifiers was tested to due to the preliminary nature of this case study. In the remaining case studies, once the optimal set of learners was narrowed, a smaller set of classifiers were employed. For all experiments in this case study, five-fold cross-validation was used to create training and test datasets, and AUC was used as the performance metric. These are discussed in more detail in Section 3.3.

7.1.3 Experimental procedure

The experiments in this case study were conducted as follows: first, the original sig- nal data was processed by the Streaming Wavelet Transform and Scales Detection algorithms, resulting in a collection of files (one for each combination of fan experi- ment, run number, class, and channel) with the detected scales for each point in time. These were then processed using the windowing approach discussed earlier, creating one data set for each fan experiment and class with the values for all windows in that experiment. To build the three separate binary classifiers per channel, three combined data sets were built, by (for each experiment and channel) combining all the runs from the baseline data set with the runs for the class of interest. This gave a

91 p p − q 2 3 4 8

0 0.9973912 0.9852202 0.9422787 0.7342919 1 0.9996832 0.9998036 0.9990993 0.9344293 2 0.9993748 0.9998382 0.9535954 3 0.9938027 0.9940203 4 0.9939642

(a) Channel 1

p p − q 2 3 4 8

0 0.9995377 0.9957457 0.9861783 0.9136349 1 0.9999539 0.9998464 0.9997948 0.9560843 2 0.9997716 0.9998312 0.9805002 3 0.9860269 0.9976710 4 0.9999244

(b) Channel 2

Table 7.1: Case Study One, Fan Experiment One: AUC values averaged across all abnormal class datasets and all twelve classifiers, as p and q vary total of six data sets for each fan experiment, three abnormal classes × two channels.

7.1.4 Results

The results of the experiments are presented in Tables 7.1 and 7.2, which present the results of each fan experiment separately (with Table 7.1 containing the first fan experiment and Table 7.2 containing the second). Each subtable presents the results for a single choice of experiment and channel. All numbers are averaged over all three abnormal class data sets and all twelve classifiers. The tables show the performance of the classifiers when using the specified values of p and q to perform the scale detection algorithm. Note that the rows do not directly go by values of q, but rather by values of p − q; this was done to highlight the effects of q values as they come closer and

92 p p − q 2 3 4 5 6 7 8

0 0.9998748 0.9999376 0.9999614 0.9999199 0.9998712 0.9993372 0.9996146 1 0.9999556 0.9999370 0.9998290 0.9996799 0.9998942 0.9998463 0.9997574 0.9996902 0.9998270 0.9997410 0.9995928 0.9990012 0.9991270 3 0.9993006 0.9999544 0.9998210 0.9996946 0.9998894 4 0.9986566 0.9997093 0.9999348 0.9997153 5 0.9620894 0.9981024 0.9996850 6 0.8997543 0.9847578 7 0.8147661

(a) Channel 1

p p − q 2 3 4 5 6 7 8

0 0.9997800 0.9997863 0.9987589 0.9963524 0.9947563 0.9876337 0.9871252 1 0.9993542 0.9991751 0.9992837 0.9966872 0.9807406 0.9731641 0.9867416 2 0.9987892 0.9995076 0.9996096 0.9997576 0.9991757 0.9830107 3 0.9992319 0.9995282 0.9993140 0.9984770 0.9990459 4 0.9984456 0.9993200 0.9991124 0.9985326 5 0.9970434 0.9990807 0.9988774 6 0.9911959 0.9975649 7 0.9777743

(b) Channel 2

Table 7.2: Case Study One, Fan Experiment Two: AUC values averaged across all abnormal class datasets and all twelve classifiers, as p and q vary

93 closer to their respective p values. The best performance in each column is printed in boldface, and the best overall values can be found by comparing these boldface values. Examining the boldface (best) values, some trends emerge. For the first fan exper- iment, choosing p = 2 or p = 3, and p−q = 1 (that is, the (p, q) pairs (2, 1) and (3, 2)) were generally best, with (2, 1) in particular being a strong choice. The results were less clear for the second experiment, although p = 2 or p = 3 still showed the best results. Generally, the ideal q values were found to be half of their respective p values, but this was less stable for the second experiment, where results varied depending on the choice of channel. In each case, it was important to optimize the p and q values based on the data from the fan experiments. It is by no means certain whether these choices of p and q would be ideal for data with different levels of noise or with a significantly different sampling rate; for such experiments, it would be important to rerun these experiments and confirm which choices of p and q produced ideal classification results. The fan experiments here were presented both to show which choices of p and q worked best for one type of data and to show the process of discovering the best choices of p and q for an arbitrary set of data. Note that despite the fluctuations in optimal p and q values (especially in the second fan experiment), the average value of classifier performance was extremely high (nearly 1 in many cases). Thus we can see that the proposed strategy (including wavelet transformation, scale detection, data windowing, and classification) is quite effective at interpreting data such as this.

94 7.2 CASE STUDY TWO

This case study [91, 85] took the optimized parameters for the SWT discovered in Case Study One (specifically, p = 2 and q = 1 for the first fan experiment and p = 3 and q = 3 for the second experiment) and compared the results from this transform to the wavelet packet decomposition-based algorithms, STWPD and SWPD. All data were gathered from the two fan experiments discussed in Section 6.2. To avoid potential problems arising from noisy data in the SWT and SWPD algorithms (since these algorithms effectively examine each time point individually and do not combine more than one coefficient from each vector), an additional windowing step was applied, as discussed in Section 5.2. SWPD in particular was tested both with and without this windowing, to evaluate its influence on performance. In addition, although the windowing step discussed in Section 5.2 was not applied to the STWPD algorithm, because it inherently involves windowing, it is labeled as such in our results.

7.2.1 Classification

For this study, classifiers were built to label instances as being in one of just two classes, comparing an abnormal experimental condition to the corresponding baseline state. Each of the two separate experiments have their own baseline, and the three conditions for each experiment ({ToW, ToH, SWO} and {LO, HO, PP}, respectively) are paired off individually with their baseline to form the experimental datasets. Fu- ture work will consider the multi-class problem, where a single classifier will consider all four classes at once. In this experiment, and based on the results of the first Case Study, the Naïve Bayes classifier was used to distinguish faulty states from the baseline state [98]. Five-fold cross-validation was employed to evaluate the models and the AUC perfor-

95 mance metric was used for calculating the performance. All of these techniques were discussed in Chapter 3.

7.2.2 Results

Table 7.3 compares the classification performance for SWT, STWPD, and SWPD. Each column contains the results of both experiments, with each case for each ex- periment labeled separately. The performance values are given in terms of AUC, and the highest values for a given combination of experimental condition and channel are designated in boldface. Observe that AUC values are extremely high across the board, with the lowest value being 0.97581. (On the SWT data, choosing different parameter values for the first experiment led to a minimum value of 0.97584, but as the results for those parameters were lower in general, they are not reported.) This shows the effective- ness of wavelet-based techniques for interpreting vibration data. Overall, the results for STWPD are higher than those for SWT and SWPD, but this is likely because both of the streaming approaches do not produce full instances until their maximum wavelength has passed, with the earlier instances being incomplete. Only the first few instances suffer from this problem, and the longer the burst being examined, the less of a problem this becomes. Nonetheless, all three techniques, and especially the wavelet packet decomposition-based techniques, are validated as effective for inter- preting vibration data. Due to their similarities, the SWT and SWPD approaches can be more directly compared than either approach with STWPD. Thus, these are the best two cases to consider to observe the benefits of wavelet packet decomposition compared to wavelet transforms. Note that when considering only the windowed results for SWPD (because the SWT results have an identical window applied), we find that the lowest

96 Experimental Channel Windowed? SWT STWPD SWPD Conditions

No – – 0.99995 1 Yes 0.98678 1.00000 0.99982 ToW No – – 0.99996 2 Yes 0.99598 1.00000 0.99999

No – – 0.99999 1 Yes 0.99770 1.00000 1.00000 ToH No – – 0.99998 2 Yes 0.99326 1.00000 1.00000

No – – 0.99968 1 Yes 0.97581 0.99999 1.00000 SWO No – – 0.99997 2 Yes 0.99799 1.00000 1.00000

No – – 1.00000 1 Yes 1.00000 1.00000 1.00000 LO No – – 1.00000 2 Yes 0.99976 1.00000 1.00000

No – – 1.00000 1 Yes 1.00000 1.00000 1.00000 HO No – – 0.99998 2 Yes 0.99954 1.00000 1.00000

No – – 0.99942 1 Yes 0.99994 1.00000 0.99999 PP No – – 0.99921 2 Yes 0.99999 1.00000 0.99998

Table 7.3: Case Study Two: AUC values based on SWT, SWPD, and STWPD transforms and NB learner, with each channel and abnormal condition considered separately

97 value for the first experiment is 0.99982 (for ToW, channel 1), while for the second experiment it is 0.99998 (for PP, channel 2). These are significant improvements over the SWT results, especially for the first experiment. In addition, the windowing step itself only improves the SWPD results noticeably in three cases: SWO channel 1, PP channel 1, and PP channel 2. In fact, in the ToW channel 1 case, removing windowing improves performance. This is important because the windowing process can add additional computational expense while introducing latency, as is seen with the STWPD. In contrast, SWPD can achieve very high performance even without this extra windowing step. Also, the SWT algorithm requires choosing internal parameters to maximize its performance, while with SWPD, the only choices are whether or not to apply windowing, and the depth of the transform. Note that since choosing too high a value for the depth of the SWPD only increases runtime without removing any of the output data, this value may be adapted to the available computing power. Thus, the SWPD algorithm is more useful for processing streaming data than the SWT algorithm.

7.3 CASE STUDY THREE

In this third case study [88, 89], the Short-Time Fourier Transform (STFT) is com- pared with the highest-performing transform from the previous Case Study, the short-time wavelet packet decomposition (STWPD). (Although the SWPD performed nearly as well without requiring windowing, for comparison with the already-windowed STFT, we chose STWPD based on its performance alone and despite its use of win- dowing.) As with the previous Case Study, these results were found using the data discussed in Section 6.2, with both fan datasets being employed.

98 Experimental Channel TPR TNR FPR FNR AUC Condition

1 1.00000 1.00000 0.00000 0.00000 1.00000 ToW 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 ToH 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 SWO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 LO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 HO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 PP 2 1.00000 1.00000 0.00000 0.00000 1.00000

Table 7.4: Case Study Three: AUC values based on STFT and NB learner, with each channel and abnormal condition considered separately

7.3.1 Classification

In this experiment, the Naïve Bayes (NB), five-Nearest Neighbor (5NN), and C4.5 decision tree (C4.5) classifiers were used. Five-fold cross-validation is employed, along with the TPR, TNR, FPR, FNR, and AUC performance metrics. Further details regarding these techniques may be found in Chapter 3.

7.3.2 Results

Tables 7.4, 7.5 7.6, 7.7, 7.8, and 7.9 compare the classification performance for STFT and STWPD. Each table shows the results for one learner and transform, with the rows containing the different channels and experimental conditions and the columns having the different performance metrics. AUC values are presented in boldface if they are the highest (or tied for the highest) for a given combination of channel and experimental condition across all learners and both transforms.

99 Experimental Channel TPR TNR FPR FNR AUC Condition

1 1.00000 1.00000 0.00000 0.00000 1.00000 ToW 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 ToH 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 SWO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 0.24590 1.00000 0.00000 0.75410 1.00000 LO 2 0.25256 1.00000 0.00000 0.74744 1.00000

1 0.19877 1.00000 0.00000 0.80123 1.00000 HO 2 0.12807 1.00000 0.00000 0.87193 1.00000

1 0.50461 1.00000 0.00000 0.49539 1.00000 PP 2 0.48822 1.00000 0.00000 0.51178 1.00000

Table 7.5: Case Study Three: AUC values based on STFT and 5NN learner, with each channel and abnormal condition considered separately

Experimental Channel TPR TNR FPR FNR AUC Condition

1 0.99940 1.00000 0.00000 0.00060 0.99978 ToW 2 1.00000 0.99973 0.00027 0.00000 0.99995

1 0.99967 0.99973 0.00027 0.00034 0.99972 ToH 2 0.99987 0.99879 0.00121 0.00013 0.99956

1 0.99933 0.99973 0.00027 0.00067 0.99963 SWO 2 0.99960 0.99960 0.00040 0.00040 0.99967

1 0.99590 0.99693 0.00307 0.00410 0.99650 LO 2 0.99795 1.00000 0.00000 0.00205 0.99872

1 0.99949 0.99641 0.00359 0.00051 0.99796 HO 2 0.99641 0.99949 0.00051 0.00359 0.99759

1 0.99949 1.00000 0.00000 0.00051 0.99974 PP 2 0.99795 0.99949 0.00051 0.00205 0.99869

Table 7.6: Case Study Three: AUC values based on STFT and C4.5 learner, with each channel and abnormal condition considered separately

100 Experimental Channel TPR TNR FPR FNR AUC Condition

1 1.00000 1.00000 0.00000 0.00000 1.00000 ToW 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 ToH 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 0.99156 0.99993 0.00007 0.00844 0.99999 SWO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 LO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 HO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 PP 2 1.00000 1.00000 0.00000 0.00000 1.00000

Table 7.7: Case Study Three: AUC values based on STWPD and NB learner, with each channel and abnormal condition considered separately

Observe that when considering the AUC metric, both transforms were able to give perfect or near-perfect results on this data. The one exception to this is the STWPD transform, which showed poor AUC performance with the SWO class using data from the first accelerometer (channel 1). Despite this, however, both transforms generally give extremely good results with this metric, suggesting that either technique can be used to produce features which these classifiers can use to build reliable classification models. When considering the performance metrics other than AUC, there are some dis- crepancies. Notably, STFT performed extremely poorly on the TPR and FNR metrics when using the 5NN learner on the second set of experiments (LO, HO, and PP), with either channel. Despite these low scores, STFT nonetheless managed to have a perfect AUC in these same circumstances. One possible explanation is to consider the decision thresholds used for calculating TPR, FNR, and AUC. With TPR and

101 Experimental Channel TPR TNR FPR FNR AUC Condition

1 1.00000 1.00000 0.00000 0.00000 1.00000 ToW 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 ToH 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 SWO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 LO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 HO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 PP 2 1.00000 1.00000 0.00000 0.00000 1.00000

Table 7.8: Case Study Three: AUC values based on STWPD and 5NN learner, with each channel and abnormal condition considered separately

Experimental Channel TPR TNR FPR FNR AUC Condition

1 1.00000 0.04086 0.95914 0.00000 0.52335 ToW 2 0.99980 0.99960 0.00040 0.00020 0.99979

1 1.00000 1.00000 0.00000 0.00000 1.00000 ToH 2 0.99993 1.00000 0.00000 0.00007 0.99997

1 0.38438 0.99993 0.00007 0.61562 0.69834 SWO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 0.99949 0.00051 0.00000 0.99974 LO 2 1.00000 1.00000 0.00000 0.00000 1.00000

1 1.00000 1.00000 0.00000 0.00000 1.00000 HO 2 1.00000 0.99949 0.00051 0.00000 0.99974

1 1.00000 0.99949 0.00051 0.00000 0.99974 PP 2 1.00000 0.99949 0.00051 0.00000 0.99974

Table 7.9: Case Study Three: AUC values based on STWPD and C4.5 learner, with each channel and abnormal condition considered separately

102 FNR, the default decision threshold (0.5) is used: if the weighted sum for the nearest neighbors (recall this is the 5NN learner) from one class exceeds the sum for the other class, that class will be chosen. Effectively, the fraction (weight of positive class) / (weight of negative class) is considered, and if and only if this value exceeds 1, the positive class is selected. However, the AUC metric is not confined to this threshold; instead, it considers all possible values this fraction must exceed to select the posi- tive class and examines how the TPR and FPR change in response to changes in this value. Thus, as long as there exists one threshold which leads to perfect classification, and changing this threshold does not decrease performance in nonintuitive ways (e.g., making the positive class more likely does not decrease the rate of true positives), then the classifier can still have a perfect AUC even with imperfect error rates ob- served at the default decision threshold. Previous research highlights the importance of considering the threshold before performing classification, and the risks of using the default decision threshold [77], although these risks are more often encountered in domains with imbalanced data (where the two classes do not have an equal number of instances). Because of this, the poor values for TPR and FNR seen in Table 7.5 do not necessarily indicate that STFT is unsuitable for classification. Overall, these results demonstrate that despite popular opinion, STFT can be just as effective a technique as STWPD for transforming vibration data prior to performing classification given suitable datasets and classifiers. The issues with 5NN suggest that STFT may not work well with a default decision threshold, even when working with balanced data, but this should not hinder its use in real-world applications.

103 7.4 CASE STUDY FOUR

Although the previous study demonstrated that STFT and STWPD can be effective transforms for building classification models, both produce large numbers of features, which may make them infeasible for real-time classification in systems with limited computational resources. Because of this, we chose to examine techniques for dimen- sionality reduction, specifically as it relates to our proposed wavelet decomposition- based transforms, STWPD and SWPD. The specific parameters used in this study resulted in 64 STWPD features for both fan experiments and either 4095 or 1023 SWPD features for the first and second fan experiments, respectively. In this case study [87, 86], we consider the use of feature selection to directly select those features which are most useful for building models. This case study used the data discussed in Section 6.2, with both fan datasets being employed.

7.4.1 Classifiers and Feature Selection

Three learning algorithms (classifiers) were chosen for this case study, including Naïve Bayes (NB), 5-Nearest Neighbor (5NN), and C4.5 Decision Tree (C4.5). These were chosen because they represent a wide range of learner types, so our results could be more generalizable. Five-fold cross-validation was employed to evaluate the models. Two feature selection techniques, Information Gain (IG) and Signal-to-Noise (S2N) were used to reduce the number of features produced by the transforms, and results are reported in terms of AUC. All standard data mining and machine learning tools are discussed further in Chapter 3.

104 Experiment Channel All IG S2N Name Number Features 5 10 20 50 5 10 20 50

CH1 0.89944 0.83345 0.83594 0.88854 0.89843 0.83361 0.83502 0.84312 0.89843 SWO CH2 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000

CH1 1.00000 0.99999 0.99999 0.99999 0.99999 1.00000 1.00000 1.00000 1.00000 TOH CH2 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999 0.99999

CH1 0.84112 0.83456 0.83541 0.83675 0.84013 0.83456 0.83541 0.83675 0.84114 TOW CH2 0.99993 0.99995 0.99995 0.99995 0.99995 0.99995 0.99995 0.99995 0.99995

CH1 0.99991 1.00000∗ 1.00000 1.00000 0.99991 0.99991 0.99991 0.99991 0.99991 LO CH2 1.00000 0.99974 0.99991 0.99991 1.00000 0.99991 1.00000 1.00000 1.00000

CH1 1.00000 0.99983 1.00000 1.00000 1.00000 0.99991 0.99991 0.99991 1.00000 HO CH2 0.99991 0.99991 0.99991 0.99991 0.99991 0.99991 0.99991 1.00000 0.99991

CH1 0.99991 0.99983 0.99991 0.99991 0.99991 0.99983 0.99991 0.99991 0.99991 PP CH2 0.99991 0.99991 0.99991 0.99991 0.99991 0.99991 0.99991 0.99991 0.99991

Table 7.10: Case Study Four, STWPD: AUC values based on STWPD, averaged across all learners and with each channel, abnormal condition, and choice of feature selection considered separately

Experiment Channel All IG S2N Name Number Features 5 10 20 50 5 10 20 50

CH1 0.93677 0.84222 0.85493 0.88256 0.94770 0.85502 0.91053 0.93556 0.94215 SWO CH2 0.93681 0.91212 0.95573 0.98592 0.99938 0.94181 0.94171 0.94172 0.94173

CH1 0.98725 0.99072 0.99690 0.99866 0.99990 0.93508 0.94218 0.92917 0.94219 TOH CH2 0.97651 0.97010 0.97087 0.99504 0.99988 0.91917 0.93553 0.94179 0.94231

CH1 0.96476 0.94797 0.96263 0.98644 0.99830 0.93551 0.93572 0.94230 0.94231 TOW CH2 0.94182 0.96383 0.96473 0.98640 0.99762 0.92755 0.94178 0.94189 0.94191

CH1 0.94367 0.99960 0.99988 0.99981 0.99972 0.87025 0.86944 0.87015 0.87015 LO CH2 0.94180 0.99979 0.99984 0.99977 0.99958 0.86944 0.86944 0.87015 0.87015

CH1 0.94105 0.99987 0.99989 0.99978 0.99984 0.87051 0.86975 0.87025 0.86896 HO CH2 0.93503 0.99966 0.99986 0.99978 0.99946 0.86896 0.87025 0.87025 0.87025

CH1 0.94382 0.98407 0.99534 0.99719 0.99761 0.86919 0.87015 0.87015 0.87015 PP CH2 0.93822 0.98608 0.99297 0.99693 0.99713 0.86975 0.86986 0.87075 0.87015

Table 7.11: Case Study Four, SWPD: AUC values based on SWPD, averaged across all learners and with each channel, abnormal condition, and choice of feature selection considered separately

105 7.4.2 Results

The STWPD and WPD results are presented in Tables 7.10 and 7.11, respectively. Each table contains the results of both fan experiments, and for each channel (ac- celerometer) within the experiments. To simplify the tables, the results from all three classifiers are averaged together (except in the LO, CH1, IG, 5 features, short-time case, denoted with *, where the C4.5 results were not available). The results both without feature selection (i.e., using all features) and using only the top 5, 10, 20, or 50 features from the IG or S2N feature rankers are shown. Recall that for STWPD, both fan experiments have 64 features, while for SWPD, the first and second experi- ments (which are presented in the first and last three rows of each table) have 4095 and 1023 features, respectively. The best results for each combination of experimental condition, channel, and transform are highlighted with boldface. Despite a greatly reduced number of features, the classification performance is not severely impacted, and in the case of SWPD, shows an overall improvement. This demonstrates the effectiveness of feature selection to improve (or at least not hurt) the results of the STWPD and SWPD algorithms while significantly reducing the complexity of the resulting models (due to their having vastly fewer features). Examining the data more closely, we see that for STWPD, a larger number of features tended to improve classification performance, with the results for 50 features coming closest to achieving the accuracy of the full set. This is not entirely impressive, as 50 features comprise over three quarters of the 64 original features. In almost no cases did the reduced-feature models outperform the all-features model, and in those few cases the performance gains were minimal. However, in the cases where the all- features model produced nearly perfect results, the five-features models (using either feature ranker) also matched these results. In the two cases where the all-features

106 model did not perform perfectly (channel one for both SWO and TOH), the models built using few features achieved AUCs of around 0.835, with the performance steadily approaching that of the all-features model as the number of features was increased. With the SWPD data, more intriguing results may be found. First of all, S2N performs poorly with the second experiment (LO/HO/PP), never achieving an AUC value above 0.9. This dataset has many features (1023 in total), and it appears S2N is not able to find even the 50 most important features for building a classification model. With the first fan experiment, these results are not as clear; while IG generally has better performance than S2N (except for SWO), S2N is not that bad either, despite this experiment having over four times as many features as the second (a total of 4095). Also varying between experiments is the trend of performance as number of features is increased. For the first experiment, increasing the number of features vastly improves the performance; for the second, however, number of features does not have a significant effect on performance. In any event, one conclusion is clear across both fan experiments: using IG for feature ranking produces significantly better classification performance than S2N or (in many cases) no feature selection whatsoever.

7.5 CASE STUDY FIVE

Although Case Study Four demonstrates one form of dimensionality reduction (direct feature selection), we developed and demonstrate a second technique: automatic depth selection through feature selection. This case study [90] demonstrates this technique using the SWPD-transformed fan data, as described in Section 6.2.4.

107 Experiment Name Channel CS IG

CH1 119 93 TOW CH2 118 118

CH1 118 103 TOH CH2 109 109

CH1 121 121 SWO CH2 108 105

CH1 61 61 LO CH2 58 58

CH1 61 61 HO CH2 60 60

CH1 61 61 PP CH2 58 58

Table 7.12: Case Study Five: Largest Selected Features by Experiment, Channel, and Ranker, for N = 30

7.5.1 Feature Ranking for Depth Selection

For each fan experiment, there are a total of six datasets (three different types of “abnormal” class and two different channels/accelerometers). A preliminary SWPD was computed for each dataset using d values of 11 and 9 for the first and second ex- periment, respectively; these values were used because they are the maximum depths possible based on each experiment’s burst size. These rankings were used with the algorithm discussed in Section 5.3 to find the most appropriate depth for transforma- tion (e.g., value of d). One sample of this output is presented in Table 7.12, which shows the largest (e.g., furthest into the tree) feature for each combination of dataset and feature ranker when N is 30. Note that for the first experiment, none of the top 30 features for any combination of experiment, channel, or ranker are any larger than 126 (= 26+1 − 2), while for the second experiment, no value exceeds 62 (= 25+1 − 2). Thus, for the first fan experiment a depth of 6 was chosen, and for the second a depth of 5 was chosen.

108 Largest Fraction Found at Depth N Feature 1 2 3 4 5 6 7

5 89 0 0 0 0.18182 0.63636 0.18182 0 10 121 0 0 0 0.08824 0.47059 0.44118 0 15 121 0 0 0 0.06667 0.37778 0.55556 0 20 121 0 0 0 0.05556 0.37037 0.57407 0 25 121 0 0 0 0.05357 0.35714 0.58929 0 30 121 0 0 0 0.05263 0.35088 0.59649 0 35 121 0 0 0 0.05085 0.33898 0.61017 0 40 125 0 0 0 0.04615 0.32308 0.63077 0 50 125 0 0 0 0.05263 0.30263 0.64474 0 60 125 0 0 0 0.05000 0.28750 0.66250 0 75 244 0 0 0 0.05556 0.19841 0.49206 0.25397 100 253 0 0 0 0.05114 0.15341 0.35795 0.43750

Table 7.13: Case Study Five: Distribution of Features for Case Study 1, when Selecting the Top N Features

To justify the claim that the choice of N is unimportant, Tables 7.13 and 7.14 shows the distribution of chosen features (how many of the chosen features are found at all of the depth levels) when considered across all datasets and feature rankers, for each fan experiment and different values of N. Level containing a plurality of features are printed in bold. In addition, the largest feature found for each N is shown. As can be seen, when N is between 5 and 35, the largest depth which still has some top features does not change for either experiment; the largest depth only increases for N ≥ 40, and then only by one step. Note also that the SWPD will include data from both the chosen depth and all shallower levels (for example, choosing d = 5 includes all data from levels 1–5), so d = 5 still captures the majority of experiment two data up until N = 60, and d = 6 works for experiment one past N = 100. Thus, this technique can be used to find an appropriate depth for transformation without the need to choose an appropriate value for N.

109 Largest Fraction Found at Depth N Feature 1 2 3 4 5 6

5 42 0 0 0 0.73333 0.26667 0 10 60 0 0 0 0.55556 0.44444 0 15 61 0 0 0.02500 0.40000 0.57500 0 20 61 0 0 0.04348 0.34783 0.60870 0 25 61 0 0 0.04255 0.34043 0.61702 0 30 61 0 0 0.04167 0.33333 0.62500 0 35 61 0 0 0.04000 0.32000 0.64000 0 40 119 0 0 0.03774 0.30189 0.60377 0.05660 50 119 0 0 0.05747 0.18391 0.36782 0.39080 60 125 0 0 0.05310 0.14159 0.28319 0.52212 75 125 0 0 0.05085 0.13559 0.27119 0.54237 100 125 0 0 0.05085 0.13559 0.27119 0.54237

Table 7.14: Case Study Five: Distribution of Features for Case Study 2, when Selecting the Top N Features

7.5.2 Classification

Once all the data was transformed both to the full depths (11 and 9 for fan exper- iments 1 and 2, respectively) or the chosen depths (6 and 5 for experiments 1 and 2, respectively), it was used to build classification models for predicting whether in- stances are in the normal or abnormal class. The Naïve Bayes (NB) classifier was used due to its simplicity, ease of operation, and accuracy, especially based on its successful performance in the previous case studies. Following the model-building phase, five-fold cross-validation was used to evaluate the performance of the models, with AUC as the performance metric.

7.5.3 Results

Table 7.15 shows the performance of the classifiers built using the SWPD data from both the full and the chosen depths. As with Table 7.12, results are broken down

110 Experiment Name Channel Full Chosen

CH1 0.999954 0.997655 TOW CH2 0.999964 0.998196

CH1 0.999988 1.000000 TOH CH2 0.999980 0.999979

CH1 0.999675 0.962000 SWO CH2 0.999967 0.999817

CH1 1.000000 1.000000 LO CH2 1.000000 1.000000

CH1 0.999998 1.000000 HO CH2 0.999983 1.000000

CH1 0.999422 0.996927 PP CH2 0.999210 0.994794

Table 7.15: Case Study Five: AUC values using the full and chosen depth, with each channel and abnormal condition considered separately by dataset (specific abnormal class and channel number) within each fan experiment grouping. For each combination of experimental condition and channel, the best result (between either full depth or chosen depth) is bolded. As can be seen, the classification values were extremely high, showing very little loss of AUC value even when using a depth chosen based on the top 30 features selected by the two feature rankers. Note that at the chosen depths, the experiments have 126 and 62 features, respectively, compared to the 4094 and 1022 found in the full depths. Relatively speaking, the lowest AUC value is 0.962000 (for the first fan experi- ment, the SWO class, and the first channel, when using the chosen depth), but this makes sense because the first experiment (with its longer burst size) had the greater potential for extremely large features which might be lost by choosing a depth of just 6. Nonetheless, even without those large features the performance was generally very good, showing that using a carefully-chosen depth can enable good classification models without producing large quantities of data.

111 7.6 CASE STUDY SIX

In the final Case Study [82, 81], to explore the effect of environmental condition on model performance, we migrated from the earlier fan-based data to data using the dynamometer, a more realistic model of an underwater turbine. In addition, for this case study we employed the Short-Time Wavelet Transform. This change was prompted by the large change in scale between the fan data and the dynamometer data: the two fan experiments contained either 3000 or 1000 instances per burst (because bursts are 3 or 1 seconds in length, with a sampling frequency of 1000 Hz), while the dynamometer data has 20000 instances per burst (4 seconds of data at 5000 Hz), and there are eight bursts per combination of channel, burst, load level, and motor RPM (for the fan data, there is only one burst per combination of channel and condition). In addition, there are six channels being studied rather than two. The STWT operates more efficiently than the STWPD or SWPD algorithms and produces a more selective set of features (for a d-level transform, only d+1 features, rather than 2d or 2d+1 − 1 as is the case for STWPD and SWPD, respectively), and is thus better suited for large datasets. The specific parameters used for applying the STWT to dynamometer data are discussed in Section 6.4. In addition, the baseline-differencing algorithm introduced in Section 5.4 was applied to this data, and the results with and without baseline-differencing are compared.

7.6.1 Classification

For this experiment, five different types of models were used for classification: Naïve Bayes (NB) [76], C4.5 decision trees [98], Logistic Regression (LR) [98], Support Vector Machines (SVM) [52], and Multi-Layer Perceptrons (MLP) [98]. The AUC performance metric was used. In addition, rather than using cross-validation, we

112 trained models on one environmental condition and tested them on another, as dis- cussed in Section 3.3.2. For these experiments, RPM 545 was always considered as the “baseline” environmental condition, and the other RPM values (654 or 763 for the first dynamometer experiment, and 654 or 1090 for the second) were the environmental conditions used for testing. Because there were so many models built (for both the first and second dynamome- ter experiments), we present summarized results instead of reporting values for each combination of burst number and channel. In particular, the tables in the next sec- tion show how many models performed within the specified performance ranges, so the shift in behavior with and without baseline-differencing may be observed. We also present the averages across burst size and channel in the final column.

7.6.2 Results

Each dynamometer experiment is presented separately in the following results. These are presented in terms of how many models match a given set of parameters (abnormal load level, classifier, use of baseline-differencing, and choice of RPM used for testing). In addition, the best average performance for a given combination of preprocessing (with or without baseline differencing), test dataset, and abnormal class is emphasized in bold. There are a total of 48 models for each combination of parameters (due to the eight 4-second bursts and six channels), and each column is exclusive (e.g., the number in the ≥0.8 column is the number of models with AUC values ≥0.8 but not ≥0.9). The performance metric is AUC, both for computing how many models fall within each category and for finding the average performance across all models. There are additional rows designated with “All”: one for each of the abnormal load levels, and one final row where “All” replaces the load level. These collect all of the models in the relevant group: the All rows within each load level show the counts and average

113 AUC for all models built with that load level (across all five learners), while the final All row contains the counts and average AUC for all models contained in that table (e.g., for that choice of test dataset and use of baseline-differencing). Because the All rows aggregate the results per load and overall, their counts sum to 240 (for the per-load All rows) or either 720 or 480 (for the per-table All rows, for the first and second dynamometer experiments respectively).

First Dynamometer Experiment

Tables 7.16 through 7.19 show the results for the first dynamometer experiment. As can be seen from these results, the difference in performance between those models which use baseline-differencing and those which do not is dramatic. Using the data without baseline-differencing, most models fail to exceed or meet an AUC of 0.6. With the baseline-differenced data, however, far better models can be built. Considering the two test RPMs separately, when using 654 as the test RPM, an overall average AUC of 0.76495 can be achieved with baseline-differencing, while only 0.57141 can be found without it. Likewise, when using RPM 763 as the test data, the average AUC is 0.80240 with baseline-differencing and 0.54537 without. This highlights the value of the baseline-differencing approach, and demonstrates that rather than reducing performance when the difference between the training and test RPMs grows (recall the training RPM is 545), with baseline-differencing the performance can actually improve as the training and test RPM spread farther apart. This suggests that baseline-differencing remains useful even for large ranges of environmental operating conditions. Considering the four abnormal classes (load levels 30, 45, 75, and 90), it is useful to note that baseline-differencing improves results for each individually as well as across the board. However, this is not to say that all four abnormal classes act the same. In

114 Load Learner =1 ≥0.9 ≥0.8 ≥0.7 ≥0.6 ≥0.5 <0.5 Average

NB 0 0 0 0 0 1 47 0.17082 C4.5 0 1 2 1 1 41 2 0.53111 LR 0 9 5 5 7 21 1 0.68236 30 SVM 5 13 5 5 3 15 2 0.74754 MLP 7 20 6 6 1 3 5 0.83762 All 12 43 18 17 12 81 57 0.59389

NB 0 0 0 1 1 3 43 0.22288 C4.5 1 5 0 1 1 36 4 0.54930 LR 0 5 6 4 3 26 4 0.62189 45 SVM 3 8 6 3 3 21 4 0.67995 MLP 5 19 9 3 5 3 4 0.83363 All 9 37 21 12 13 89 59 0.58153

NB 0 0 0 3 2 3 40 0.36401 C4.5 0 1 0 2 0 38 7 0.47936 LR 1 4 0 2 6 33 2 0.56865 75 SVM 2 5 1 1 0 37 2 0.57390 MLP 17 11 4 0 2 6 8 0.78047 All 20 21 5 8 10 117 59 0.55328

NB 0 3 1 3 2 10 29 0.43519 C4.5 1 1 1 0 1 43 1 0.53176 LR 2 0 0 1 1 44 0 0.53004 90 SVM 1 2 0 0 1 44 0 0.53401 MLP 15 8 6 1 4 5 9 0.75376 All 19 14 8 5 9 146 39 0.55695

All — 60 115 52 42 44 433 214 0.57141

Table 7.16: Case Study Six, Dynamometer Experiment One: Training on 545, Test on 654, No baseline-differencing

115 Load Learner =1 ≥0.9 ≥0.8 ≥0.7 ≥0.6 ≥0.5 <0.5 Average

NB 0 10 12 8 9 5 4 0.74215 C4.5 0 25 12 3 3 4 1 0.84738 LR 0 15 16 10 1 2 4 0.80368 30 SVM 1 20 16 6 3 0 2 0.84673 MLP 11 24 2 4 1 2 4 0.87397 All 12 94 58 31 17 13 15 0.82278

NB 0 9 13 13 6 5 2 0.76873 C4.5 0 24 16 1 2 3 2 0.86107 LR 0 12 20 8 0 3 5 0.78079 45 SVM 0 17 19 6 2 1 3 0.82987 MLP 8 23 6 3 4 1 3 0.89098 All 8 85 74 31 14 13 15 0.82629

NB 0 21 9 4 2 2 10 0.75742 C4.5 0 17 19 0 0 6 6 0.75301 LR 0 6 7 4 1 1 29 0.45210 75 SVM 1 23 4 3 2 4 11 0.73433 MLP 14 14 7 1 3 0 9 0.79005 All 15 81 46 12 8 13 65 0.69738

NB 0 35 2 9 0 1 1 0.90208 C4.5 1 20 25 0 0 1 1 0.88593 LR 0 4 1 2 0 0 41 0.30208 90 SVM 6 14 6 2 5 2 13 0.71364 MLP 14 9 7 2 3 3 10 0.76294 All 21 82 41 15 8 7 66 0.71333

All — 56 342 219 89 47 46 161 0.76495

Table 7.17: Case Study Six, Dynamometer Experiment One: Training on 545, Test on 654, Data following baseline-differencing

116 Load Learner =1 ≥0.9 ≥0.8 ≥0.7 ≥0.6 ≥0.5 <0.5 Average

NB 0 1 0 4 9 6 28 0.41601 C4.5 0 0 0 0 1 41 6 0.48951 LR 0 4 3 2 2 35 2 0.56497 30 SVM 0 5 1 0 3 39 0 0.57137 MLP 1 13 7 2 7 16 2 0.71266 All 1 23 11 8 22 137 38 0.55090

NB 0 0 0 2 3 6 37 0.35666 C4.5 0 0 1 0 0 47 0 0.50792 LR 0 2 1 1 1 40 3 0.53570 45 SVM 0 3 1 0 4 37 3 0.54424 MLP 0 9 6 7 1 23 2 0.67035 All 0 14 9 10 9 153 45 0.52297

NB 0 0 4 2 3 7 32 0.45152 C4.5 0 0 0 0 0 46 2 0.49756 LR 0 0 4 1 2 40 1 0.54526 75 SVM 0 0 5 1 2 39 1 0.55196 MLP 5 10 10 5 1 16 1 0.76027 All 5 10 23 9 8 148 37 0.56131

NB 0 2 1 0 2 3 40 0.34347 C4.5 0 0 0 1 0 41 6 0.46459 LR 0 0 0 0 0 45 3 0.48190 90 SVM 0 0 0 0 1 47 0 0.50579 MLP 21 16 4 3 2 2 0 0.93573 All 21 18 5 4 5 138 49 0.54629

All — 27 65 48 31 44 576 169 0.54537

Table 7.18: Case Study Six, Dynamometer Experiment One: Training on 545, Test on 763, No baseline-differencing

117 Load Learner =1 ≥0.9 ≥0.8 ≥0.7 ≥0.6 ≥0.5 <0.5 Average

NB 0 4 4 15 12 9 4 0.67693 C4.5 0 7 13 11 3 5 9 0.70585 LR 0 10 29 5 0 0 4 0.80380 30 SVM 0 14 21 7 4 0 2 0.81808 MLP 4 28 8 3 1 1 3 0.88376 All 4 63 75 41 20 15 22 0.77768

NB 0 5 6 17 10 8 2 0.71326 C4.5 0 11 18 5 3 6 5 0.75253 LR 0 5 29 9 2 2 1 0.81374 45 SVM 0 10 25 7 4 0 2 0.82196 MLP 1 35 3 4 2 3 0 0.91444 All 1 66 81 42 21 19 10 0.80319

NB 0 6 8 11 8 8 7 0.69089 C4.5 4 6 18 2 6 5 7 0.73927 LR 0 8 22 6 2 3 7 0.75410 75 SVM 0 20 17 4 1 4 2 0.84132 MLP 19 11 7 3 4 3 1 0.89288 All 23 51 72 26 21 23 24 0.78369

NB 0 17 9 6 7 6 3 0.79066 C4.5 0 12 23 4 0 6 3 0.79228 LR 0 17 9 3 4 5 10 0.70203 90 SVM 5 34 8 1 0 0 0 0.95128 MLP 36 11 0 0 1 0 0 0.98896 All 41 91 49 14 12 17 16 0.84504

All — 69 271 277 123 74 74 72 0.80240

Table 7.19: Case Study Six, Dynamometer Experiment One: Training on 545, Test on 763, Data following baseline-differencing

118 fact, there is a fair amount of random variation, with no easily-discernible patterns that hold across all abnormal classes, learners, test RPMs, and presence/absence of baseline-differencing. Nonetheless, these results demonstrate that the utility of baseline-differencing is not limited to a single choice of abnormal class. When viewed in terms of the optimal learner, it is clear that MLP performs best in the majority of circumstances (abnormal class, load level, test RPM, and presence/absence of baseline-differencing). The only exception is found in the load 90 section of Table 7.17, where MLP performs quite poorly, especially with respect to NB and C4.5 which both perform well with the same combination of parameters. Empirically, this appears to have been caused by Channel 5 of the data giving a particularly bad value in this case: the bottom four models for this collection of parameters all come from Channel 5, and the remaining four models involving Channel 5 are also within the bottom 16 models. Nonetheless, even in this particularly poor- performing combination of parameters, there are 14 perfect models built with MLP, twice as many as the remaining four learners combined, which suggests that selecting the top few models built with MLP is always a safe choice. Overall, the results from dynamometer experiment one show that baseline-differencing will improve classification performance, both in terms of how many models perform well and in terms of the average AUC, and these results are true for a wide range of abnormal classes and test RPMs. Thus, from these results we would recommend the use of the MLP learner along with baseline-differencing for reliable state detection of unattended machinery in unstable and changing conditions.

Second Dynamometer Experiment

The results from the second dynamometer experiment are presented in Tables 7.20 through 7.23, as discussed above. Overall, the results show that using the baseline-

119 Load Learner =1 ≥0.9 ≥0.8 ≥0.7 ≥0.6 ≥0.5 <0.5 Average

NB 0 0 2 1 3 11 31 0.43239 C4.5 0 0 0 1 2 33 12 0.47039 LR 1 3 1 1 1 33 8 0.49373 60 SVM 1 2 0 2 0 36 7 0.47265 MLP 9 5 2 4 5 11 12 0.60116 All 11 10 5 9 11 124 70 0.49407

NB 0 2 1 1 2 15 27 0.51229 C4.5 0 0 0 0 0 47 1 0.49293 LR 7 1 0 0 1 39 0 0.58556 90 SVM 2 6 1 0 1 38 0 0.59557 MLP 20 20 1 2 2 0 3 0.91591 All 29 29 3 3 6 139 31 0.62045

All — 40 39 8 12 17 263 101 0.55726

Table 7.20: Case Study Six, Dynamometer Experiment Two: Training on 545, Test on 654, No baseline-differencing

Load Learner =1 ≥0.9 ≥0.8 ≥0.7 ≥0.6 ≥0.5 <0.5 Average

NB 0 8 14 11 7 6 2 0.75278 C4.5 0 5 11 6 2 14 10 0.64665 LR 0 11 9 4 3 1 20 0.52916 60 SVM 0 14 7 6 1 0 20 0.56625 MLP 4 14 8 3 2 1 16 0.62039 All 4 52 49 30 15 22 68 0.62305

NB 9 25 12 2 0 0 0 0.92312 C4.5 0 15 14 5 0 7 7 0.76788 LR 11 2 0 3 3 2 27 0.48057 90 SVM 21 16 1 0 0 0 10 0.83325 MLP 19 20 2 2 2 0 3 0.91399 All 60 78 29 12 5 9 47 0.78376

All — 64 130 78 42 20 31 115 0.70340

Table 7.21: Case Study Six, Dynamometer Experiment Two: Training on 545, Test on 654, Data following baseline-differencing

120 Load Learner =1 ≥0.9 ≥0.8 ≥0.7 ≥0.6 ≥0.5 <0.5 Average

NB 0 2 4 7 10 6 19 0.51837 C4.5 0 0 0 0 1 40 7 0.46276 LR 0 0 0 0 0 45 3 0.48363 60 SVM 0 0 0 0 2 41 5 0.45322 MLP 0 1 1 2 2 34 8 0.46688 All 0 3 5 9 15 166 42 0.47697

NB 1 0 2 1 3 4 37 0.29061 C4.5 0 0 0 0 0 40 8 0.41667 LR 0 0 0 0 0 48 0 0.50000 90 SVM 0 0 0 0 0 48 0 0.50000 MLP 11 8 2 1 4 19 3 0.72143 All 12 8 4 2 7 159 48 0.48574

All — 12 11 9 11 22 325 90 0.48136

Table 7.22: Case Study Six, Dynamometer Experiment Two: Training on 545, Test on 1090, No baseline-differencing

Load Learner =1 ≥0.9 ≥0.8 ≥0.7 ≥0.6 ≥0.5 <0.5 Average

NB 0 0 0 1 3 4 40 0.32121 C4.5 0 3 9 9 3 10 14 0.61457 LR 0 5 28 5 0 1 9 0.71551 60 SVM 0 8 25 4 2 0 9 0.69736 MLP 10 19 6 1 3 0 9 0.76996 All 10 35 68 20 11 15 81 0.62372

NB 7 4 13 12 7 2 3 0.78114 C4.5 0 15 24 0 0 4 5 0.81084 LR 0 40 7 1 0 0 0 0.95902 90 SVM 12 20 7 1 1 1 6 0.85779 MLP 40 1 1 1 1 2 2 0.94032 All 59 80 52 15 9 9 16 0.86982

All — 69 115 120 35 20 24 97 0.74677

Table 7.23: Case Study Six, Dynamometer Experiment Two: Training on 545, Test on 1090, Data following baseline-differencing

121 differenced data will improve the performance of models built on one RPM (545) and tested on another RPM (654 or 1090), compared to not using baseline-differencing. When using 654 as the test dataset, the baseline-differenced-based models have an average AUC of 0.70340, compare with 0.55726 for non-baseline-differenced models; with 1090 as the test dataset, these values are 0.74677 (baseline-differencing) and 0.48136 (no baseline-differencing), respectively. The difference is even clearer when considering the number of models which have a performance of ≥0.9 (considering both those in the =1 and ≥0.9 columns). Without baseline-differencing, only 79 of 480 models tested on 654 data are in this group, but with baseline-differencing, 194 are. Likewise, on the 1090 data, the number of models which perform at or above 0.9 jumps from 23 to 184 when baseline-differencing is used. These results demonstrate that baseline-differencing results in better models on average, and that far more high- quality models can be found when using baseline-differencing. Although baseline-differencing shows improvement overall, it is important to see how different factors such as test dataset, abnormal load, and choice of learner af- fect the relative performance of models built with and without baseline-differencing. The first observation is that baseline-differencing is far more important when the test dataset is 1090, compared to testing on 654. This makes sense because 654 is far closer to the training dataset (545) than 1090 is. Since 545 and 654 are rela- tively close, models built without baseline-differencing still will show decent perfor- mance, and baseline-differencing is not as essential. When the operating condition (RPM) changes more dramatically, however, the performance falls without baseline- differencing, while the baseline-differenced data shows similar performance (greater average AUC, but fewer models at or above 0.9). This underscores the occasions when baseline-differencing should be employed. Considering the two possibilities for the abnormal class (load 60 and load 90),

122 it is important to note that both loads 60 and 90 show improved performance on an individual bases when used with baseline-differencing. For 60, the improvement is 0.62305/0.49407 and 0.62372/0.47697 for RPMs 654 and 1090, respectively (both in with/without baseline-differencing order), and for 90, this is 0.78376/0.62045 and 0.86982/0.48574. An additional observation based on the load levels is that overall, load 90 has the highest performance values. This makes sense because the baseline class (the load which the models are trying to distinguish from loads 60 or 90) is 45, which is farthest from load 90. Nonetheless, the average AUC for load 90 using baseline-differencing is especially encouraging. When comparing the five different learners used, some are found to have higher overall performance as well as better response when built using the baseline-differenced data. The best models overall are built using the MLP learner on load 90: with 654 data, this model gives an average AUC of 0.91399 with baseline-differencing and 0.91591 without, and with the 1090 data, this model gives average AUC values of 0.94032 and 0.72143 with and without baseline-differencing. Although these values are not always the highest individual score for any combination of learner and load level for the given RPM and use of baseline-differencing choices (e.g., for test RPM 1090 and with baseline-differencing, LR exceeds it, while NB outperforms MLP for testing on 654 RPM using baseline-differencing on either abnormal load, or on RPM 1090 with no baseline-differencing on load 60), the other two candidates (LR and NB) show very poor performance outside of the above-mentioned instances where they outperform MLP. Thus, MLP is a safer choice for classifier, especially in light of the results of the first dynamometer experiment which also recommended MLP. It should be noted that baseline-differencing only significantly improved the MLP results when 1090 was used as the test RPM, again demonstrating the importance of baseline-differencing when the environmental operating condition varies greatly.

123 Overall, these results lead us to recommend the use of the MLP learner, especially for use in conjunction with baseline-differencing when the operating condition shows significant change.

124 Chapter 8

Conclusions and Future Work

The focus of this research has been defining and fine-tuning different vibration analysis techniques with the goal of building reliability models for remote ocean turbines. This is an important application because ocean currents are an exciting potential new energy resource, but the submerged turbines necessary to extract energy from these currents face unique reliability challenges. Unlike most waterborne devices, they will be left submerged and unattended for months or even years at a time (the ideal situation is one year of maintenance-free operation), and thus must be monitored remotely. There would also eventually be a whole fleet of turbines in operation to extract a useful portion of the current, which makes direct human examination of the numerous signals from each turbine difficult and unlikely. A Machine Condition Monitoring/Prognostic Health Monitoring (MCM/PHM) system is thus necessary to perform automated maintenance activities, using both information about the present state (provided by the MCM system) and the predicted future state (provided by the PHM system) to adjust machine behavior to minimize damage and to alert human operators when maintenance is required. The proposed MCM/PHM system would rely on many sensor signals from the turbine to predict its state, but vibration signals are among the most useful, espe- cially for the prognostics portion of the system. Minor faults such as wear, lodged

125 obstructions, and chipped components can change vibration signals before they cause severe enough consequences to trigger other sensor types, and thus vibration signals can allow operators to understand faults before they cause significant damage to the machine and the environment. Because vibrations are a type of waveform, transfor- mation is necessary to convert raw vibration signals into a collection of frequency bins, which can then be used to build classification models for predicting state. Although previous work has examined vibration analysis for MCM/PHM in other application domains, this is the first thorough examination and comparison of different trans- forms, especially wavelet transforms, for the goal of ocean turbine MCM/PHM, with its demands for streaming and minimally-computationally-intensive techniques.

8.1 CONCLUSIONS

In the course of this work, we performed three major classes of algorithm develop- ment and analysis, grouped into Chapters4,5, and7. First, we developed three novel wavelet-based transforms (the Streaming Wavelet Transform (SWT) [79], Short-Time Wavelet Packet Decomposition (STWPD) [85], and Streaming Wavelet Packet De- composition SWPD) [91]), and also modified an existing technique (the Short-Time Wavelet Transform (STWT) [81]) to make it appropriate for use in real-time stream- ing applications. Then, we introduced a number of new post-processing techniques, including a scale detection algorithm which can identify important frequency bands without sacrificing the streaming nature of wavelets, an approach for selecting the appropriate transformation depth based on actual results from the datasets, and a solution to the problem of models detecting vibration changes which are represen- tative of their environmental condition rather than their internal state. Finally, we performed a number of case studies to evaluate and compare these algorithms and

126 approaches. From these case studies in Chapter 7, we can discern a number of conclusions. First of all, based on Case Studies Two and Three, we find that satisfactory classification results can be obtained on our fan data when using any of the tested algorithms (Short-Time Fourier Transform (STFT) [88], SWT, STWPD, and SWPD), at least on certain combinations of experimental condition and channel and with certain choices of learner. Case Study Two in particular showed that STWPD allows for nearly perfect classification performance, while SWPD (with windowing) only gives slightly- less-than-perfect results for one combination of condition and channel. These results were found using Naïve Bayes, which was overall the best learner. In addition, Case Study Three found that STFT and STWPD gave perfect (or nearly perfect) AUC values when using the Naïve Bayes and 5-Nearest Neighbor learners, although we did find somewhat unusual results (very high error rates coupled with perfect AUC values) when applying the STFT transform and 5-Nearest Neighbor learner to the second set of experimental data. Although Case Study Six focuses on data from a dynamometer rather than a fan, its results show good performance using the STWT algorithm, especially when baseline-differencing — a novel approach proposed in Section 5.4 for eliminating the environment-specific portion of the signal prior to building models — is applied. Specif- ically, the dynamometer is a full-scale mock-up of a turbine, with the only differences being the use of a motor to simulate the propeller blades. Using this dynamometer, where both the motor speed and load level can be independently varied, allows us to tackle the additional challenge of identifying system state even when the environ- mental condition (which will influence the vibrations) varies. Our proposed baseline- differencing algorithm is able to remove the effect of the environmental condition, ensuring that MCM/PHM is effective even when the motor RPM varies by as much

127 as 100%. Collectively, these techniques presented in all of our Case Studies show comparable performance in terms of AUC, and so other considerations are important when deciding which is appropriate for an ocean turbine MCM/PHM system. One challenge that faces many vibration analysis techniques is the reliance on windowing, the process of examining data in finite chunks. Although this enables algorithms such as the Fourier transform to operate on streaming data, it restricts both the size of the lowest-frequency (longest-wavelength) visible wave and the tem- poral resolution for detecting when a higher-frequency wave has changed its behavior. Two of our proposed techniques, SWT and SWPD, are specifically designed to avoid this form of windowing by detecting data in a fully-streaming basis, updating each frequency band only when new information has arrived about that frequency. This enables more rapid response to higher-frequency events while retaining the ability to detect lower-frequency events. Our other two novel or improved approaches, the STWT and STWPD, do employ a window (much as the STFT does) because for some datasets (those provided in discrete bursts) the window does not impair timeliness, but retain the wavelet’s ability to examine different frequency bands at different levels of resolution, which allows the STWT to summarize a wide range of frequencies with a relatively small number of features and the STWPD to produce in-depth informa- tion about the higher frequencies without providing excessive redundant information for the lower frequencies. Another major concern that must be considered in an MCM/PHM system for monitoring a submerged turbine is that due to its remoteness, any data that must be transmitted to the shore must be sent via wireless communication. Only a lim- ited amount of data can be transmitted over these communication links, and these wireless links can be unreliable. Due to these restrictions, certain MCM/PHM oper- ations, such as vibration analysis, that require access to larger chunks of raw sensor

128 data would reside on the support buoy directly connected to the turbine, rather than in an onshore laboratory. This would necessarily be a smaller-scale computing solu- tion, however, specifically designed for reliability and survivability rather than sheer computational power. All of the wavelet transforms discussed in this work were de- signed to minimize computational demand, by utilizing small data structures and few operations and by retaining only the information necessary to build the transform. In addition, the scale detection algorithm introduced for use with the SWT is able to reduce the dataset further, by replacing floating-point features with booleans. This reduced set of data can be more easily transmitted to shore if needed. An additional approach developed to reduce the quantity of data which must be processed is the automatic depth selection algorithm, which enables the operator to tune the transformation depth based on which is necessary for giving the most useful results. Notably, automatic depth selection depends only on feature selection, not classification; even when building models shoreside, feature selection is much more efficient than classification, and so automated depth selection significantly reduces the time necessary to determine the optimal depth, compared to performing classification experiments on all depths separately. The automatic depth selection algorithm is also the first approach to let the dataset directly recommend a depth level, rather than forcing the user to decide among different levels based on performance values and domain knowledge. By giving an unambiguous answer, this algorithm improves vibration analysis in general, helping to make the process of finding the right depth much more efficient. Overall, we develop and propose a number of new transforms and techniques, and demonstrate their effectiveness on data collected from real-world systems. Based on these experiments, our recommendations for appropriate transforms would depend on the exact nature of the data being processed. If processing streaming data in real-time

129 is the highest priority, the SWT or SWPD transforms should be considered, along with scale detection to determine which frequency bands show oscillations. However, if temporal localization of changes in high-frequency waves is of lower importance, a windowed technique such as STFT, STWT, or STWPD may be appropriate. If the system being used to evaluate new data will have limited resources, techniques such as feature selection or automatic depth selection may be used to reduce the quantity of data processed by the classification models. Lastly, if the environmental conditions are expected to change, operators may need to employ baseline-differencing to ensure that their models will predict only the internal state of the machine, not the condition of the environment.

8.2 FUTURE WORK

A number of questions remain unanswered by this research and may be considered in the future. First of all, with greater computational resources it may be possible to experiment with the SWT, STWPD, and SWPD transforms on the dynamometer data. Some preliminary work (not presented in this dissertation but part of ongoing research efforts) suggests this would give similar results to the STWT, but a more comprehensive analysis is needed. This comparison could also consider the STFT, or wavelet-based techniques which do not employ the Haar wavelet, and may apply the baseline-differencing approach to these other transforms to gain the same benefits as seen when using it with the STWT. Another important question not presented in this research is multi-class classi- fication. In the real world, there are many abnormal states, not just one. Again, although unpublished preliminary investigation was promising, more work is needed to demonstrate the best algorithms and approaches for detecting system state when

130 the whole range of possible states is considered. Finally, an important open question is finding a generalized method for creating a system baseline in an arbitrary environmental condition. This would allow for de- tecting system state even if the current condition does not match up with one seen previously. This automated baseline generation may be performed through interpo- lation or regression, but would complement the algorithms already presented towards building a complete ocean turbine MCM/PHM system.

131 BIBLIOGRAPHY

[1] W. Bartelmus and R. Zimroz. A new feature for monitoring the condition of gearboxes in non-stationary operating conditions. Mechanical Systems and Signal Processing, 23(5):1528–1534, 2009.

[2] W. Bartelmus and R. Zimroz. Vibration condition monitoring of planetary gearbox under varying external load. Mechanical Systems and Signal Processing, 23(1):246–257, 2009. Special Issue: Non-linear Structural Dynamics.

[3] T. Batzel and D. Swanson. Prognostic health management of aircraft power gen- erators. IEEE Transactions on Aerospace and Electronic Systems, 45(2):473– 482, April 2009.

[4] N. Baydar and A. Ball. Detection of gear failures via vibration and acoustic signals using wavelet transform. Mechanical Systems and Signal Processing, 17(4):787–804, 2003.

[5] A. Bey-Temsamani, M. Engels, A. Motten, S. Vandenplas, and P. Agusmian. A practical approach to combine data mining and prognostics for improved predic- tive maintenance. In 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-2009), June – July 2009.

[6] C. Bishop. Improving the generalization properties of radial basis function neural networks. Neural Computation, 3(4):579–588, Dec. 1991.

[7] R. B. Blackman and J. W. Tukey. Particular pairs of windows. In The Mea- surement of Power Spectra, From the Point of View of Communications Engi- neering, pages 98–99. Dover, New York, 1959.

[8] G. W. Boehlert and A. B. Gill. Environmental and ecological effects of ocean renewable energy development: a current synthesis. Oceanography, 23(2):68–81, 2010.

[9] M. Borghi, F. Kolawole, S. Gangadharan, W. Engblom, J. VanZwieten, G. Alse- nas, W. Baxley, and S. Ravenna. Design, fabrication and installation of a hydrodynamic rotor for a small-scale experimental ocean current turbine. In Proceedings of IEEE Southeastcon, pages 1–6, March 2012.

[10] L. Breiman. Random forests. Machine Learning, 45(1):5–32, Oct 2001.

132 [11] R. Ceravolo. Time–Frequency Analysis. John Wiley & Sons, Ltd, 2009.

[12] P. Chaovalit, A. Gangopadhyay, G. Karabatis, and Z. Chen. Discrete wavelet transform-based time series analysis and mining. ACM Computing Surveys, 43(2):6:1–6:37, Feb. 2011.

[13] W. W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth Inter- national Conference on Machine Learning, pages 115–123. Morgan Kaufmann, July 1995.

[14] R. R. Coifman and M. V. Wickerhauser. Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory, 38(2):713–718, Mar. 1992.

[15] F. Combet and R. Zimroz. A new method for the estimation of the instanta- neous speed relative fluctuation in a vibration signal based on the short time scale transform. Mechanical Systems and Signal Processing, 23(4):1382–1397, 2009.

[16] I. Daubechies. Ten Lectures on Wavelets (CBMS-NSF Regional Conference Series in Applied Mathematics). SIAM: Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992.

[17] M. Dong and D. He. Hidden semi-Markov model-based methodology for multi- sensor equipment health diagnosis and prognosis. European Journal of Opera- tional Research, 178(3):858–878, 2007.

[18] J. Duhaney, T. M. Khoshgoftaar, and J. C. Sloan. Feature level sensor fusion for improved fault detection in MCM systems for ocean turbines. In Proceedings of the Twenty-First International Florida Artificial Intelligence Research Society Conference. AAAI Press, May 2011.

[19] J. Duhaney, T. M. Khoshgoftaar, J. C. Sloan, B. Alhalabi, and P. P. Beaujean. A dynamometer for an ocean turbine prototype: Reliability through automated monitoring. In 13th IEEE International Symposium on High-Assurance Systems Engineering, pages 244–251. IEEE Computer Society, November 2011.

[20] D. G. Ece and M. Başaran. Condition monitoring of speed controlled induction motors using wavelet packets and discriminant analysis. Expert Systems with Applications, 38(7):8079–8086, 2011.

[21] L. Eren and M. J. Devaney. Bearing damage detection via wavelet packet decomposition of the stator current. IEEE Transactions on Instrumentation and Measurement, 53(2):431–436, April 2004.

133 [22] A. Folleco and T. M. Khoshgoftaar. Attribute noise detection using multi- resolution analysis. International Journal Of Reliability Quality And Safety Engineering, 13(3):267–288, 2006. [23] A. Folleco, T. M. Khoshgoftaar, J. Van Hulse, and L. Bullard. Software quality modeling: The impact of class noise on the random forest classifier. pages 3853–3859, Jun. 2008. [24] G. Forman. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res., 3:1289–1305, 2003. [25] M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Pro- ceedings of the IEEE, 93(2):216–231, 2005. Special issue on “Program Genera- tion, Optimization, and Platform Adaptation”. [26] H. A. Gaberson. The use of wavelets for analyzing transient machinery vibra- tion. Sound and Vibration, 36(9):12–17, 2002. [27] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases, pages 79–88, San Francisco, CA, USA, 2001. Morgan Kaufmann Pub- lishers Inc. [28] Z. E. Gketsis, M. E. Zervakis, and G. Stavrakakis. Detection and classification of winding faults in windmill generators using wavelet transform and ANN. Electric Power Systems Research, 79(11):1483–1494, 2009. [29] K. Goebel, N. Eklund, and P. Bonanni. Fusing competing prediction algorithms for prognostics. In Aerospace Conference, 2006 IEEE, pages 1–10, 2006. [30] S. K. Goumas, M. E. Zervakis, and G. S. Stavrakakis. Classification of washing machines vibration signals using discrete wavelet analysis for feature extraction. IEEE Transactions on Instrumentation and Measurement, 51(3):497–508, June 2002. [31] S. Guha and B. Harb. Approximation algorithms for wavelet transform coding of data streams. In SODA ’06: Proceedings of the Seventeenth Annual ACM- SIAM Symposium on Discrete Algorithm, pages 698–707, New York, NY, USA, 2006. ACM. [32] K. Gurley and A. Kareem. Applications of wavelet transforms in earthquake, wind and ocean engineering. Engineering Structures, 21(2):149–167, 1999. [33] Z. Hameed, Y. Hong, Y. Cho, S. Ahn, and C. Song. Condition monitoring and fault detection of wind turbines and related algorithms: A review. Renewable and Sustainable Energy Reviews, 13(1):1–39, 2009.

134 [34] J.-G. Han, W.-X. Ren, and Z.-S. Sun. Wavelet packet based damage iden- tification of beam structures. International Journal of Solids and Structures, 42(26):6610–6627, 2005.

[35] H. P. Hanson, A. Bozek, and A. E. S. Duerr. The Florida Current: A clean but challenging energy resource. Eos Transactions AGU, 92(4), Jan 2011.

[36] M. A. Hearst, S. T. Dumais, E. Osman, J. Platt, and B. Scholkopf. Support vector machines. IEEE Intelligent Systems and their Applications, 13(4):18–28, July/Aug. 1998.

[37] Y. Huang, C. Liu, X. F. Zha, and Y. Li. A lean model for performance assess- ment of machinery using second generation wavelet packet transform and fisher criterion. Expert Systems with Applications, 37(5):3815–3822, 2010.

[38] A. K. S. Jardine, D. Lin, and D. Banjevic. A review on machinery diagnos- tics and prognostics implementing condition-based maintenance. Mechanical Systems and Signal Processing, 20(7):1483–1510, 2006.

[39] P. Jayaswal, S. Verma, and A. Wadhwani. Application of ANN, fuzzy logic and wavelet transform in machine fault diagnosis using vibration signal analysis. Journal of Quality in Maintenance Engineering, 16(2):190–213, 2010.

[40] K. Jemielniak, J. Kossakowska, and T. Urbański. Application of wavelet trans- form of acoustic emission and cutting force signals for tool condition monitoring in rough turning of inconel 625. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture, 225(1):123–129, 2011.

[41] C. Kar and A. R. Mohanty. Monitoring gear vibrations through motor cur- rent signature analysis and wavelet transform. Mechanical Systems and Signal Processing, 20(1):158–187, 2006.

[42] C. Kar and A. R. Mohanty. Vibration and current transient monitoring for gear- box fault detection using multiresolution fourier transform. Journal of Sound and Vibration, 311(1-2):109–132, 2008.

[43] M. Kawada, K. Yamada, Y. Kaneko, and K. Isaka. Discrimination of vibration phenomena on model turbine rotor using in-place fast Haar wavelet transform. In Power Engineering Society General Meeting, 2007. IEEE, pages 1–6, June 2007.

[44] T. M. Khoshgoftaar, M. Golawala, and J. Van Hulse. An empirical study of learning from imbalanced data using random forest. In 19th IEEE International Conference on Tools with Artificial Intelligence, 2007. ICTAI 2007., volume 2, pages 310–317, October 2007.

135 [45] D. G. Kleinbaum and M. Klein. Maximum likelihood techniques: An overview. Logistic regression, pages 103–127, 2010. [46] Y. Kong, Y. Shi, and J. Yuan. Prediction method of time series data stream based on wavelet transform and least squares support vector machine. In Fourth International Conference on Natural Computation, 2008. ICNC ’08, volume 2, pages 120–124, October 2008. [47] D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In C. Nédellec and C. Rouveirol, editors, Machine Learning: ECML- 98, volume 1398 of Lecture Notes in Computer Science, pages 4–15. Springer Berlin / Heidelberg, 1998. [48] X. Li, L. Qu, G. Wen, and C. Li. Application of wavelet packet analysis for fault detection in electro-mechanical systems based on torsional vibration mea- surement. Mechanical Systems and Signal Processing, 17(6):1219–1235, 2003. [49] J. Lin and L. Qu. Feature extraction based on Morlet wavelet and its application for mechanical fault diagnosis. Journal of Sound and Vibration, 234(1):135–148, 2000. [50] J. Lin and M. J. Zuo. Gearbox fault diagnosis using adaptive wavelet filter. Mechanical Systems and Signal Processing, 17(6):1259–1269, 2003. [51] B. Liu. Selection of wavelet packet basis for rotating machinery fault diagnosis. Journal of Sound and Vibration, 284(3-5):567–582, 2005. [52] T.-Y. Liu. EasyEnsemble and feature selection for imbalance data sets. In IJCBS ’09: International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, 2009., pages 517–520, August 2009. [53] A. Muller, M. C. Suhner, and B. Iung. Maintenance alternative integration to prognosis process engineering. Journal of Quality in Maintenance Engineering, 13(2):198–211, 2007. [54] H. Ocak, K. A. Loparo, and F. M. Discenzo. Online tracking of bearing wear using wavelet packet decomposition and probabilistic modeling: A method for bearing prognostics. Journal of Sound and Vibration, 302(4-5):951–961, 2007. [55] M.-C. Pan, P. Sas, and H. van Brussel. Nonstationary time-frequency analysis for machine condition monitoring. In Proceedings of the IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis, 1996., pages 477–480, June 1996. [56] Y. Pan, J. Chen, and L. Guo. Robust bearing performance degradation assess- ment method based on improved wavelet packet-support vector data descrip- tion. Mechanical Systems and Signal Processing, 23(3):669–681, 2009.

136 [57] N. Panwar, S. Kaushik, and S. Kothari. Role of renewable energy sources in en- vironmental protection: A review. Renewable and Sustainable Energy Reviews, 15(3):1513–1524, 2011.

[58] S. Petit-Renaud and T. Denoeux. Nonparametric regression analysis of uncer- tain and imprecise data using belief functions. International Journal of Approx- imate Reasoning, 35(1):1 – 28, 2004.

[59] H. Qiu, J. Lee, J. Lin, and G. Yu. Wavelet filter-based weak signature detection method and its application on rolling element bearing prognostics. Journal of Sound and Vibration, 289(4-5):1066–1090, 2006.

[60] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.

[61] A. G. Rehorn, E. Sejdic, and J. Jiang. Fault diagnosis in machine tools us- ing selective regional correlation. Mechanical Systems and Signal Processing, 20(5):1221 – 1238, 2006.

[62] O. Renaud, J.-L. Starck, and F. Murtagh. Wavelet-based combined signal fil- tering and prediction. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 35(6):1241–1251, December 2005.

[63] P. L. Richardson. Florida current, gulf stream and labrador current. Ocean Currents: A Derivative of the Encyclopedia of Ocean Sciences, pages 13–22, 2009.

[64] R. Rubini and U. Meneghetti. Application of the envelope and wavelet trans- form analyses for the diagnosis of incipient faults in ball bearings. Mechanical Systems and Signal Processing, 15(2):287–302, 2001.

[65] D. W. Ruck, S. K. Rogers, M. Kabrisky, M. E. Oxley, and B. W. Suter. The multilayer perceptron as an approximation to a bayes optimal discriminant function. IEEE Transactions on Neural Networks, 1(4):296–298, Dec. 1990.

[66] M. Rucka and K. Wilde. Application of continuous wavelet transform in vibra- tion based damage detection method for beams and plates. Journal of Sound and Vibration, 297(3-5):536–550, 2006.

[67] A. Saxena, J. Celaya, E. Balaban, K. Goebel, B. Saha, S. Saha, and M. Schwabacher. Metrics for evaluating performance of prognostic techniques. In International Conference on Prognostics and Health Management, 2008. PHM 2008., pages 1–17, October 2008.

137 [68] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano. Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 40(1):185–197, January 2010.

[69] J. C. Sloan, T. M. Khoshgoftaar, B. Alhalabi, and P.-P. Beaujean. Strategy and application of data-driven testing of an ocean turbine drivetrain. International Journal of Reliability, Quality and Safety Engineering, 18(06):555–571, 2011.

[70] J. C. Sloan, T. M. Khoshgoftaar, P.-P. J. Beaujean, and F. Driscoll. Ocean turbines – a reliability assessment. International Journal of Reliability, Quality and Safety Engineering, 16(5):413–433, October 2009.

[71] Z. Sun and C. C. Chang. Structural damage assessment based on wavelet packet transform. Journal of Structural Engineering, 128(10):1354–1361, 2002.

[72] M. M. R. Taha, A. Noureldin, J. L. Lucero, and T. J. Baca. Wavelet transform for structural health monitoring: A compendium of uses and features. Structural Health Monitoring, 5(3):267–295, 2006.

[73] L. Tang, G. J. Kacprzynski, J. R. Bock, and M. Begin. An intelligent agent- based self-evolving maintenance and operations reasoning system. In Aerospace Conference, 2006 IEEE, pages 1–12, 2006.

[74] P. W. Tse, Y. H. Peng, and R. Yam. Wavelet analysis and envelope detection for rolling element bearing fault diagnosis—their effectiveness and flexibilities. Journal of Vibration and Acoustics, 123(3):303–310, 2001.

[75] S. Uckun, K. Goebel, and P. J. F. Lucas. Standardizing research methods for prognostics. In International Conference on Prognostics and Health Manage- ment, 2008. PHM 2008., pages 1–10, October 2008.

[76] J. Van Hulse and T. M. Khoshgoftaar. Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering, 68(12):1513–1542, 2009.

[77] J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano. A novel noise filtering algorithm for imbalanced data. In S. Draghici, T. M. Khoshgoftaar, V. Palade, W. Pedrycz, M. A. Wani, and X. Zhu, editors, ICMLA, pages 9–14. IEEE Computer Society, 2010.

[78] J. Van Hulse, T. M. Khoshgoftaar, A. Napolitano, and R. Wald. Feature se- lection with high-dimensional imbalanced data. In IEEE International Confer- ence on Data Mining Workshops, 2009. ICDMW ’09., pages 507–514, December 2009.

138 [79] R. Wald and T. M. Khoshgoftaar. Wavelet transforms for classification of streaming data in reliability analysis of ocean systems. Technical report, De- partment of Computer and Electrical Engineering and Computer Science at Florida Atlantic University, Boca Raton, FL, USA, September 2010.

[80] R. Wald, T. M. Khoshgoftaar, and A. Abu Shanab. The effect of measurement approach and noise level on gene selection stability. Technical report, Depart- ment of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, June 2012.

[81] R. Wald, T. M. Khoshgoftaar, and B. Alhalabi. Baseline-differencing: A novel approach for building generalizable ocean turbine models. Technical report, Department of Computer and Electrical Engineering and Computer Science at Florida Atlantic University, Boca Raton, FL, USA, September 2012.

[82] R. Wald, T. M. Khoshgoftaar, and B. Alhalabi. A novel baseline-differencing approach for creating generalizable reliability models of ocean turbine behavior. In Proceedings of the 18th ISSAT International Reliability and Quality in Design Conference, pages 1–8, July 2012.

[83] R. Wald, T. M. Khoshgoftaar, P.-P. J. Beaujean, and J. C. Sloan. Combin- ing wavelet and fourier transforms in reliability analysis of ocean systems. In Proceedings of the 16th ISSAT International Reliability and Quality in Design Conference, pages 303–307, 2010.

[84] R. Wald, T. M. Khoshgoftaar, and J. C. Sloan. An improved wavelet transform method for streaming data in reliability analysis of ocean systems. Technical report, Department of Computer and Electrical Engineering and Computer Science at Florida Atlantic University, Boca Raton, FL, USA, December 2010.

[85] R. Wald, T. M. Khoshgoftaar, and J. C. Sloan. Comparison of wavelet-based approaches for real-time vibration analysis. Technical report, Department of Computer and Electrical Engineering and Computer Science at Florida Atlantic University, Boca Raton, FL, USA, March 2011.

[86] R. Wald, T. M. Khoshgoftaar, and J. C. Sloan. Feature selection for optimiza- tion of wavelet packet decomposition in reliability analysis of systems. Technical report, Department of Computer and Electrical Engineering and Computer Sci- ence at Florida Atlantic University, Boca Raton, FL, USA, June 2011.

[87] R. Wald, T. M. Khoshgoftaar, and J. C. Sloan. Feature selection for vibration sensor data transformed by a streaming wavelet packet decomposition. In 23rd IEEE International Conference on Tools with Artificial Intelligence, pages 978– 985, November 2011.

139 [88] R. Wald, T. M. Khoshgoftaar, and J. C. Sloan. Fourier transforms for vibration analysis: A review and case study. In 2011 IEEE International Conference on Information Reuse and Integration (IRI), pages 366–371, August 2011.

[89] R. Wald, T. M. Khoshgoftaar, and J. C. Sloan. Review and comparison of fourier and wavelet-based transforms for vibration-based reliability analysis. Technical report, Department of Computer and Electrical Engineering and Computer Science at Florida Atlantic University, Boca Raton, FL, USA, April 2011.

[90] R. Wald, T. M. Khoshgoftaar, and J. C. Sloan. Using feature selection to determine optimal depth for wavelet packet decomposition of vibration sig- nals for ocean system reliability. In 13th IEEE International Symposium on High-Assurance Systems Engineering, pages 236–243. IEEE Computer Society, November 2011.

[91] R. Wald, T. M. Khoshgoftaar, J. C. Sloan, and P.-P. Beaujean. A streaming wavelet packet decomposition approach for real-time vibration analysis. In Proceedings of the 17th ISSAT International Reliability and Quality in Design Conference, pages 359–363, August 2011.

[92] C. Wang and R. X. Gao. Wavelet transform with spectral post-processing for enhanced feature extraction [machine condition monitoring]. IEEE Transac- tions on Instrumentation and Measurement, 52(4):1296–1301, August 2003.

[93] H. Wang, T. M. Khoshgoftaar, and A. Napolitano. A comparative study of en- semble feature selection techniques for software defect prediction. In Ninth In- ternational Conference on Machine Learning and Applications (ICMLA), pages 135–140, December 2010.

[94] H. Wang, T. M. Khoshgoftaar, and J. Van Hulse. A comparative study of threshold-based feature selection techniques. In IEEE International Conference on Granular Computing (GrC), pages 499–504, August 2010.

[95] W. J. Wang and P. D. McFadden. Application of wavelets to gearbox vibration signals for fault detection. Journal of Sound and Vibration, 192(5):927–939, 1996.

[96] S. J. Watson, B. J. Xiang, W. Yang, P. J. Tavner, and C. J. Crabtree. Condition monitoring of the power output of wind turbine generators using wavelets. IEEE Transactions on Energy Conversion, 25(3):715–721, September 2010.

[97] W. A. Wilkinson and M. D. Cox. Discrete wavelet analysis of power system transients. IEEE Transactions on Power Systems, 11(4):2038–2044, November 1996.

140 [98] I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, Burlington, MA, 3rd edition, January 2011.

[99] C. Yang and S. Letourneau. Model evaluation for prognostics: Estimating cost saving for the end users. In ICMLA ’07: Proceedings of the Sixth International Conference on Machine Learning and Applications, pages 304–309, Washington, DC, USA, 2007. IEEE Computer Society.

[100] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In ICML ’97: Proc. 14th Int’l Conf. Machine Learning, pages 412–420, 1997.

[101] G. G. Yen and K.-C. Lin. Conditional health monitoring using vibration sig- natures. In Proceedings of the 38th IEEE Conference on Decision and Control, volume 5, pages 4493–4498, 1999.

[102] G. G. Yen and K.-C. Lin. Wavelet packet feature extraction for vibration monitoring. IEEE Transactions on Industrial Electronics, 47(3):650–667, June 2000.

[103] F. Zhao, J. Chen, and W. Xu. Condition prediction based on wavelet packet transform and least squares support vector machine methods. Proceedings of the Institution of Mechanical Engineers, Part E: Journal of Process Mechanical Engineering, 223(2):71–79, May 2009.

[104] H. Zhengjia, Z. Jiyuan, H. Yibin, and M. Qingfeng. Wavelet transform and multiresolution signal decomposition for machinery monitoring and diagnosis. In Proceedings of The IEEE International Conference on Industrial Technology, 1996. (ICIT ’96), pages 724–727, December 1996.

[105] K. Zhu, Y. S. Wong, and G. S. Hong. Wavelet analysis of sensor signals for tool condition monitoring: A review and some new results. International Journal of Machine Tools and Manufacture, 49(7–8):537–553, 2009.

141