LIU-ITN-TEK-A-19/046--SE

A For Battery Capacity Estimations In Electrical Vehicles Simon Corell

2019-09-03

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet nedewS ,gnipökrroN 47 106-ES 47 ,gnipökrroN nedewS 106 47 gnipökrroN LIU-ITN-TEK-A-19/046--SE

A Recurrent Neural Network For Battery Capacity Estimations In Electrical Vehicles Examensarbete utfört i Datateknik vid Tekniska högskolan vid Linköpings universitet Simon Corell

Handledare Daniel Jönsson Examinator Gabriel Eilertsen

Norrköping 2019-09-03 Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extra- ordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/

© Simon Corell Linköping University | Department of Science and Technology Master’s thesis, 30 ECTS | Computer Science and Engineering 202019 | LIU-ITN/LITH-EX-A--2019/001--SE

A Recurrent Neural Network For Battery Capacity Estimations In Electrical Vehicles

Simon Corell

Supervisor : Daniel Jönsson Examiner : Gabriel Eilertsen

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer- ingsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko- pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis- ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker- heten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman- nens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to down- load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

© Simon Corell Abstract

This study is an investigation if a recurrent long short-term memory (LSTM) based neural network can be used to estimate the battery capacity in electrical cars. There is an enormous interest in finding the underlying reasons why and how Lithium-ion batteries ages and this study is a part of this broader question. The research questions that have been answered are how well a LSTM model estimates the battery capacity, how the LSTM model is performing compared to a linear model and what parameters that are important when estimating the capacity. There have been other studies covering similar topics but only a few that has been performed on a real data set from real cars driving. With a data science approach, it was discovered that the LSTM model indeed is a powerful model to use for estimation the capacity. It had better accuracy than a model, but the linear regression model still gave good results. The parameters that implied to be important when estimating the capacity were logically related to the properties of a Lithium-ion battery. Acknowledgments

This project has been the most challenging and exciting work I have ever been a part of. I am grateful to have received this opportunity in such a reputable company as Volvo cars. The idea behind the project came from Christian Fleischer at Volvo cars. Christian have supervised the project and have provided me with various pieces of advice. A huge thank you Christian for giving me this opportunity and for all the help you have provided. The initial idea was unfortunately not possible due to different circumstances but I am still satisfied with what the project resulted in. A big thank you to all other Volvo cars employees that have helped in various discussions and meetings. Specially Herman Johnsson for all the support you gave. I also want to thank my examiner Gabriel Eilertsen and supervisor Daniel Jönsson for the great feedback and guidance throughout the project. Lastly, I want to thank all other mas- ter thesis students that were at the same department for all the laughs and amusing moments.

Simon Corell, 2019

iv Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

1 Introduction 1 1.1 Motivation ...... 1 1.2 Aim...... 1 1.3 Research questions ...... 2 1.4 Delimitations ...... 2

2 Background description and related work 3 2.1 Big Data ...... 3 2.2 Time Series ...... 3 2.3 ...... 4 2.4 Neurons & Synapses ...... 4 2.5 Lithium Ion Battery Description and Battery Management System ...... 4 2.6 Related research ...... 7

3 Theory 9 3.1 Data pre-processing ...... 9 3.2 Linear Regression ...... 17 3.3 Artificial neural networks ...... 18

4 Method 25 4.1 Prestudy: Capacity estimation using lab data from NASA ...... 25 4.2 Overall Project Progression ...... 27 4.3 Data pre-processing ...... 28 4.4 Implementation of the Linear Regression model ...... 33 4.5 Implementation of the recurrent LSTM neural network ...... 33

5 Results and Analysis 35 5.1 Results From Linear Regression Model ...... 35 5.2 Long Short-Term Memory (LSTM) ...... 38

6 Discussion 40 6.1 Method ...... 40 6.2 Results ...... 43 6.3 Replicability, reliability and validity ...... 45 6.4 The work in a wider context ...... 45

v 7 Conclusion 46 7.1 Summarizing ...... 46 7.2 Future work ...... 47

A Appendix 49

Bibliography 50

vi List of Figures

2.1 A Lithium Ion battery pack for Nissan Leaf at the Tokyo Motor Show 2009 . . . . . 5

3.1 Basic idea behind ...... 11 3.2 The normal distribution ...... 12 3.3 The typical pipeline for a filter method ...... 14 3.4 The typical pipeline for a Wrapper method ...... 16 3.5 A simple linear model ...... 17 3.6 A simple ANN ...... 19 3.7 Vanilla recurrent neural network ...... 23 3.8 The LSTM cell ...... 24

4.1 Absolute difference for every 3000 estimation ...... 26 4.2 Model Pipeline ...... 27 4.3 Capacity values over all cycles for some cars ...... 28 4.4 Capacity values over all cycles for some cars ...... 29 4.5 Top 25 Pearson correlation coefficients ...... 31 4.6 The Pearson correlations between the signals ...... 33

5.1 Error distribution for linear regression ...... 36 5.2 Plot showing the estimated values scattered against the real values ...... 36 5.3 Plot showing the estimated values and the real values for car BDCWCU5 ...... 37 5.4 Plot showing the estimated values against the real values for car BDCWCU5 . . . . 37 5.5 Error distribution for LSTM ...... 38 5.6 Plot showing the estimated values against the real values ...... 38 5.7 Plot showing the estimated values against the real values for car BDCWCU5 . . . . 39 5.8 Plot showing the estimated values against the real values for car BDCWCU5 . . . . 39

vii 1 Introduction

This research study focuses primarily on the hypothesis that a recurrent neural network can, with high accuracy, estimate the current capacity on lithium-ion batteries. Lithium-ion bat- teries are used as the primary power source in the latest era of electrical cars. Volvo cars wants to investigate how well a recurrent neural network is performing in this estimation and have provided a data set containing information from a fleet of their cars. The study was carried out on Volvo cars headquarter in Gothenburg, Sweden.

1.1 Motivation

Traditionally, fossil fuels has been the main source of fuel in automobiles. This makes vehicles a huge contributor to the carbon footprint which has a detrimental impact on the environ- ment. Many companies nowadays are taking a more eco-friendly approach by electrifying new automobiles and also the existing fleet in operation to a large extent. The main source of portable power for these automobiles will be batteries. Therefore, it is of primal importance research is done on the batteries when commercializing them, so that understanding of their viability in the field of transportation and vehicle propulsion can be discovered. Even in the existing chemical combinations of batteries, Lithium-ion batteries have proved to be the best choice for Electric Vehicles (EVs). The reasons for this choice being fast charge capability, high power density and energy efficiency, wide operating temperature range, low self-discharge rate, lightweight, and small size [1]. However, the battery wears out due to many factors such as temperature, depth of discharge and charge/discharge rates, driving style, route terrain, etc.

1.2 Aim

Today it is difficult for a driver of an electric vehicle (EV) to exactly know how many ampere minutes/hours the battery in his/her car has left. This is because the current State of Health (SoH) can not precisely state this value since it does not take into consideration factors such as driving patterns, driving style, the type of environment around the car and so on. To be able to estimate with high precision the capacity of the battery will certainly be beneficial

1 1.3. Research questions for a car company like Volvo. This motivation can help us understand what factors that makes a battery perform well over a longer time or why is it not performing well over a longer time. The underlying purpose of this thesis project is to evaluate and investigate if a recurrent neural network model can learn to estimate what the battery’s capacity is with respect regarding all the different parameters the Battery Management System (BMS) has sent out.

1.3 Research questions

Here are the research questions listed that this report throughout will try to answer.

1. How well does a recurrent LSTM neural network estimate the capacity of a battery in an electrical vehicle?

2. Is a recurrent LSTM neural network necessary for the estimation of the capacity or can a linear model be used instead?

3. What parameters seems to be important when estimating the battery capacity?

1.4 Delimitations

The most important delimitations that will be done are listed here.

• The data covers only between 3-6 months. It is estimated that a battery should last for around 7 years. Therefore, the decision was made to change the direction of the thesis. From predicting the batteries end-of-life to instead estimating the online State of Health/capacity that the battery currently has.

• The study will only focus on the provided data set and the results can therefore in theory only be proven to be applied on this particular data set.

• This master thesis project will though only focus on the implementation of the global model. Volvo wants in a wider scope investigate if a global model can be combined with local models to make the estimation even more precisely. Can, for example, Gothen- burg, Sweden have a local model which applies only to the cars here and a more general global model for all the cars in the world? The focus of this master thesis project will though only be to implement a global model for the estimation of the capacity of the batteries.

2 2 Background description and related work

This chapter will in detail explain all the background information that was researched and the related work that was investigated.

2.1 Big Data

Big data is stored digital information of immense magnitude [2]. The sizes of big data sets varies between gigabytes to terabytes or even petabytes. Big data consists of many different areas such as bioinformatics, physics, environment studies, simulations and web services such as Youtube, Twitter, Facebook and Google. Many companies and organizations are trying to extract as much information and value as possible from their big data. It is becoming more and more important for large global companies to handle and manage their data as well as possible. The emergence of big data comes from the possibility of collecting digital information using different sensors as well from the internet to later store the information on hard drives located in different data servers and centers. Big data is too large and/or complex for traditional data-processing methods. Using new algorithms and methods complex correlations in the data can be found. Correlations explaining phenomenons such as business trends, diseases, crimes and so on. The data on our planet grows rapidly with the increase of internet of things IoT devices such as cameras, microphones, mobile devices, and cars. The global data volume will grow exponentially from around 4.4 zettabytes to around 44 zettabytes between 2013 and 2020. By 2025 there will be around 163 zettabytes of digital information [3].

2.2 Time Series

The data in this study is a time series.A Time series is a data set where the samples are indexed in time order [4]. The most common time series have the same time difference between sam- ples making it a sequence of discrete-time data. Time series can deal with real continuous data, discrete numeric data or symbolic data. Time series are used in many different fields including statistics, signal processing, weather forecasting, astronomy and so on. Time se- ries analysis consists of analyzing time series to extract meaningful information from it. Time series forecasting tries to build a model to predict future values based on the previously ob- served values. For example, from previously observed weather values how will the weather be tomorrow?

3 2.3. Machine Learning

2.3 Machine Learning

Machine learning (ML) is a set of algorithms that can learn behavior and patterns without being hardcoded for the purpose [5]. ML algorithms are founded on optimization and proba- bility theory which uses multidimensional calculus and linear algebra in its calculations. ML have become enormously popular due to the excellent results it has in solving problems in- volving complex and large data sets. There are mainly two different categories of problems ML algorithms can deal with. These are called supervised and . In , the input data is labeled so that the task is to learn the correct relation between the inputs and outputs so that the ML algorithm can generalize what it has learned to new data. In unsupervised learning, the input data is not labeled. In this case, it is up to the ML algorithm to find underlying structures and patterns within the input data. Another way to divide ML problems is to what format the output data has. If the output data consists of discrete values out of a set of classes it is called classification. If the output data consists of real values it is called regression. Another ML problem is called clustering. Clustering tries to divide the input data into different groups and in that way categorize them. This is most commonly an unsupervised problem. Today ML algorithms are used in various fields including , natural language processing, and image recognition. is a class of machine learning algorithms that are used in supervised and unsupervised tasks. Most deep learning algorithms are most often based on artificial neural networks. They can learn multiple levels of representations and uses a large number of layers of nonlinear con- nectivity. The "deep" in deep learning that there are a large number of layers in the artificial neural network.

2.4 Neurons & Synapses

The human brain consist of a very large collection of neurons [6]. Neurons are the core com- ponent of the nervous system and this is especially true for the brain. Neurons are electri- cally excitable cells that receive, processes and later transmits information to other neurons through chemical and electrical signals. The connections between the neurons are called synapses and these complex membranes junctions allow for neurons to communicate with each other. The neurons operate in different neural circuits. A neural circuit is a population of neurons designed to carry out a specific task. These neural circuits are interconnected with other neural circuits to form a large scale neural network in the brain. There are an estimated 100 billion neurons in our human brain. Each neuron can be connected up to 10 thousand other neurons and passing signals to each neuron with up to 1,000 trillion synapse connec- tions. This is equivalent to a CPU with 1 trillion bits per second operations. The human brain’s memory capacity estimates to vary between 1 to 1,000 terabytes [7]. It is fairly simple to understand the enormous power of the human brain. How fast it operates and how much memory it can store. The human brain is amazing in how it can learn to understand complex patterns, process information and be able to memorize large sizes of information. All these factors made scientists wonder if there was a possibility to mimic the human brain somehow. The result is called Artificial neural networks.

2.5 Lithium Ion Battery Description and Battery Management System

Lithium-Ion batteries were first introduced during the 1990s and have now become one of the most common batteries on the market [8] [9]. They are being used in a large majority of different electronic devices including music devices, laptops, mobile phones and now elec- tronic cars. Lithium-Ion batteries have found to be very well-suited for mobile devices since they have a high power density (W/kg) and a long expected lifetime. Lithium-Ion batteries consist of Lithium-Ion cells. Lithium-Ion cells have primary three components in its internal

4 2.5. Lithium Ion Battery Description and Battery Management System construction. These are two electrodes called cathode (positive) and anode (negative) as well as a conductive electrolyte. During a charging cycle, the lithium captions move to the neg- ative electrode anode. During a discharge cycle, the lithium captions move to the positive electrode cathode. For electronic cars, a lithium-Ion battery consists of tens to thousands of individual cells packaged together. The amount of cells is determined by the required power, voltage, and energy. These Lithium-Ion battery cell packs are located underneath the car and can be seen in figure 2.1.

Figure 2.1: A Lithium Ion battery pack for Nissan Leaf at the Tokyo Motor Show 2009 taken from [10].

So what characterizes a Lithium-Ion battery? There are some important parameters one must know to be able to understand the most essential properties of a Li-Ion battery. These values are discharging and charging cycles, State of Charge, battery voltage, discharge cur- rent, internal resistance, capacity and State of Health. These values are briefly described below.

• Discharging and charging cycles: A discharging cycle is when the battery is being used. In the case of a car battery, this is equivalent to the car driving and being used. A charge cycle is when the car is standing still and the battery is being filled up with energy.

• State of Charge: State of Charge is the equivalent of a fuel gauge for the energy in the battery of an electric vehicle. 0% meaning an empty battery and 100% meaning a full battery.

• Battery cell voltage: The battery voltage is the difference in charge between the cathode (positive electrode) and anode (negative electrode). The higher the difference is, the higher the voltage is. The battery voltage is equivalent to the electric potential in the battery. The electric potential is measured in volts.

• Discharge current: An electric current is the rate of flow of electric charge passing a point or over a region. It is measured in ampere. Discharge current is the electric current used during a discharge cycle. The discharge current is equivalent to the necessary current required to handle the demands of energy.

• Internal resistance: Internal resistance is the impedance in an electric circuit. It is mea- sured in Ohm Ω and is defined as the opposition to the current in an electric circuit.

• State of Health (SoH): State of Health is a value for the health of a battery compared to its ideal conditions. The state of health will be 100% at the time of manufacture and will later decrease when being used over time. The state of health is measured from the capacity in the battery.

5 2.5. Lithium Ion Battery Description and Battery Management System

• Capacity: A battery’s capacity is the amount of electric charge it can deliver. It is mea- sured in ampere-hours and is a good indication for the health of the battery.

• EOL: EOL stands for End-Of-Life and is a state when the battery is considered to be useless. This is estimated to be when a full charge of the battery only results in 70% of its original capacity.

So what is known to be the cause for the degradation of Lithium-Ion batteries? The capacity is a good indication for the health of the battery and will, therefore, be the value representing the degradation. The capacity will degrade over thousands of cycles. The degradation consists of slow electrochemical processes in the electrodes. Chemical and mechanical changes to the electrodes are major causes of the reduction of the capacity. These changes cause flaws such as; lithium plating, mechanical cracking of electrode particles, and thermal decomposition of the electrolyte. Charge and discharge cycles cause consumption of lithium ions on the electrodes. With fewer lithium ions on the electrodes, the charge and dis- charge efficiency of the electrode material is reduced. This meaning that charges of the cells will result in less capacity and that discharges cycles will faster consume the capacity. The degradation is highly correlated to the cell temperature. The cylindrical cell temperatures are linearly correlated to the discharge current. It is also believed that high discharge levels and elevated ambient temperatures will cause an increase in degradation. High discharge levels meaning that the State of Charge has been pushed down to small percentages.

The battery management system (BMS) is a software program that manages a battery. Ex- amples on its functions are helping the battery from operating outside its safe operating area and monitoring its different parameters. Different parameters including, for example, the voltage, temperature, current and so on. The battery management system is also communi- cating directly to the battery and doing all of the computations needed for the values used for the management of the battery. These values including State of charge, State of health and internal resistance.

6 2.6. Related research

2.6 Related research

In the paper written by Hicham Chaoui [11], they have used a neural network with delayed inputs in order to predict the SOC and SOH values of Li-ion batteries. The benefit of this method lies in the fact that the network does not need any information regarding the bat- tery model or parameters. They have used the previously recorded parameters like ambient temperature, voltage, current etc as inputs. Another advantage of this method is that it com- pensates for the non-linear behaviour of the batteries such as hysteresis and reduction in performance due to aging. From the results one can learn that this method provides a high accuracy and robustness with a simple algorithm to evaluate a LiFePO4 battery. Batteries were conventionally modelled by defining an empirical model dependent on the equivalent electric circuit, which can be made through actual measurements or estimation of battery parameters using extrapolation. The problem with these models is that such models can only describe the battery characteristics until before the battery is charged again. In a research paper written by Massimo [12], the inter-cycle battery effects are automatically taken into effect and implemented into the generic model used as a reference template. This method was used to model a commercial lithium iron phosphate battery to check the flexibility and accuracy of the described strategy. In the paper written by Ecker [13], a multivariate analysis of the battery parameters was performed which was subjected to accelerated aging procedures. Impedance and capacity were used as the output variables. The change in these parameters was observed when the battery was subjected to different conditions of State of Charge and temperature. This lead to the development of battery functions which model the aging process bolstered by the phys- ical measurements on the battery. These functions lead to the development of a general bat- tery aging model. These models were subjected to different drive cycles and management strategies which gave a better understanding of their impact on battery life. The study done by Tina [14] highlights the use of several layers of Recurrent Neural Net- works (RNNs) to take into account the non-linear behaviour of photovoltaic batteries while modelling. In this manner the trend of change in the input and the output parameters of the battery are taken into account while charge and discharge processes. This has proved to be a powerful tool despite of the heavy computations required to run the neural network archi- tecture. In this model, the electric current supplied to the battery is the only effective external input parameter because it is dependent on the user application. The voltage and SOC are taken as the output variables. In the study done by Chemali [15], the LSTM cells were used for a Recurrent Neural Net- work (RNN) to estimate the State of Charge (SOC). The significance of the research presented in this paper lies in the fact that they used a standalone neural network to take into account all the dependencies without using any prior knowledge of the battery model or employing any filters for estimation like the Kalman filter. The network is capable of generalizing all kinds of feature changes it is trained on using data sets containing different battery behaviour sub- jected to different initial environment conditions. One such feature taken into account while recording data sets is ambient temperature. The model was able to predict a very accurate SOC with variation in external temperature in this case. In the research conducted by Saha [16], a was used to estimate the re- maining useful life of a battery where the internal state is difficult to measure under normal operating conditions. Therefore, these states need to be estimated using a Bayesian statistical approach based on the previous data recorded, indirect measurements and different opera- tional conditions. Concepts from the electrochemical processes like equivalent electric circuit and statistics like state transition models were combined in a systematic framework to model the changes in aging process and predict the Remaining Useful Life (RUL). Particle filters and Relevance Vector Machines (RVMs) were used to provide uncertainty bounds for these models.

7 2.6. Related research

In the study conducted by You [17], the State Of Health (SOH) of the battery has been estimated to determine the schedule for the maintenance and replacement time period of the battery or to estimate the driving mileage. Most of the previous research is done in a constrained environment where a full charge cycle with constant current has been a common assumption. However, these assumptions cannot be used for EVs as the charge cycles are mostly partial and dynamic. The objective of this study was to come up with a robust model to estimate SOH where batteries are subjected to real world environment conditions. This method has demonstrated a technique that makes use of historic data collected on current, voltage and temperature in order to estimate the State of Health of the battery in real-time with high accuracy.

8 3 Theory

This chapter will explain the theory behind the key components used in the thesis. The chap- ter starts with data pre-processing and lastly an explanation of linear regression and neural networks.

3.1 Data pre-processing

This section will go through the theory behind data pre-processing. First with an overall description of data pre-processing followed with descriptions of some of the different steps in data pre-processing.

Data pre-processing is an important step in a machine learning project [18]. When data have been collected it is called raw data. Raw data comes often with many flaws. These flaws include for example values that are out of range, values that are missing or combinations of values that are impossible. Including these kinds of values can lead to misleading results. This is why it is important to analyze the data and to make the necessary changes to the data before applying any kind of machine learning algorithm to it. There are many different steps in data pre-processing. The only steps in data pre-processing that this report covers are called data cleaning, data transformation, and feature selection. These will be explained in the upcoming paragraphs.

3.1.1 Data Cleaning Data cleaning is a process where the aim is to detect incorrect values and later correct or remove these [19]. The purpose of data cleaning is to make all data sets consistent with each other. Data cleaning can consist of many different steps. The steps that were taken into consideration in this report are NaN-replacements, outlier detection, and Data Aggregation. These will be explained in the upcoming paragraphs.

3.1.1.1 NaN-values NaN stands for "not a number" and is an error that can happen in the collection of data [20]. NaN is exactly what it stands for, "not a number", meaning that for a value which should

9 3.1. Data pre-processing be numeric, nothing or a non-numeric value has been saved. This error can happen due to wrong user input or from an error in the sensor used in the collection of the data. It is important to deal with NaNs because their false values will affect the data set. There are different solutions for dealing with NaN-values. Some of these are listed down below.

• Replacing the NaN-value with the of column. • Removing the samples with a NaN-value. • Replacing the NaN-value with the mean of the previous valid value and next valid value. • Replacing the NaN-value with either the previous valid value or next valid value.

3.1.1.2 Outlier detection Outlier detection identifies events or observations in data sets that differ significantly from the majority of the data [21]. Outlier detection can either be unsupervised or supervised. When dealing with unsupervised outlier detection the assumption is made that the majority of the samples in an unlabeled data set are normal and the aim is to find the abnormal samples. In supervised learning, a classifier is trained on data sets that are labeled either normal or abnormal. There are many different outlier detection algorithms. The one used in this report is based on the K-nearest neighbors algorithm and is called Local outlier factor.

Local outlier factor is an algorithm created by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander in the year 2000 [22]. The algorithm emanates from the samples’ local densities. The local density is estimated by using the k-distance given by k-nearest neighbors. The basic idea is to calculate all samples’ different local densities and then comparing the neighboring samples’ densities. This way the samples that have densities that differ the most compared to its neighbors can be found and later be classified as outliers. The algorithm starts with the user deciding on how many k-nearest neighbors that should be used. The k-distance is later the euclidean distance to the given neighbor.

This k-distance is used to define what is called the reachability-distance. The reachability- distance is defined as the maximum between the k-distance and the distance between the samples. This means that if the sample is within the k neighbors then the reachability-distance will be the k-distance. Otherwise, it will be the real distance between the samples. This proce- dure is just a "smoothing factor" used to get more stable results. Lastly the reachability-distance is used to calculate the local reachability density. The local reachability density for a given sample is the inverse of the averaged value of all the reachability-distances to its k-nearest neighbors. The longer the the local reachability-distances are, the sparser the the local reachability density will become. The local reachability density describes how far away the sample is from other samples or clusters of samples. Local outlier factor is performed by first calculating the local reachability density for each sample. Then for each sample: calculate k ratios between the given sample’s local reachability density and the the local reachability densities of its k-nearest neighbors. The k ratios are later averaged into one single value. The equation for Local Outlier Factor for an object p can be seen in equation 3.1.

lrdKNN(o) P lrdKNN(p) o NKNN(p)) LOF = ř (3.1) KNN(p) N | KNN(p)|

Where o is an object within k-nearest neighbors, lrdKNN(o) is the local reachability density for object o and lrdKNN(p) is the local reachability density for the given object p.

10 3.1. Data pre-processing

If this value is greater than 1 then the local reachability density for the given sample is greater than its k-nearest neighbors and is considered to be an outlier. The basic idea behind local outlier factor can be seen in figure 3.1.

Figure 3.1: Basic idea behind Local Outlier Factor taken from [23].

Figure 3.1 shows the local reachability densities for some samples. Sample A has a sparse local reachability density with far k-distances to its neighbors. It is thereby considered to be an outlier.

3.1.2 Data transformation 3.1.2.1 Data Aggregation Data aggregation is a process in which data is gathered and expressed in a summary form [24]. The purpose of data aggregation is to prepare data sets for data processing. Data ag- gregation can, for example, consist of aggregating raw data over some time to provide new information such as mean, maximum, minimum, sum or count. With this new information, one can analyze the aggregated data to gain insights about the data.

With the explosion of big data, organizations and companies are seeking ways of un- derstanding their large scale of information. Data aggregation can be a critical part of data management. One example of this could be that minimizing the number of rows will lead to a dramatic decrease in time required to process the data and the size of the data to be dramatically less. Data aggregation is a popular strategy to use in Business Intelligence (BI) solutions and databases.

3.1.2.2 Normalization techniques and Normal distributions Normalization techniques are used in statistics and applications of statistics. Normalization can have different purposes and can be used within Data Transformation. In the simplest cases normalization is used to adjust values measured using different scales to a common scale. In more advanced cases normalization is used to change whole probability distribu- tions. There are different normalization algorithms. For example , Studentized residual and Min-Max Feature scaling. They all have different purposes and applications. The

11 3.1. Data pre-processing one that this report will focus on is called Standard score. Standard score is a common normal- ization technique used when working with Machine Learning algorithms.

3.1.2.3 Normal Distributions In probability theory, the normal distribution (Gaussian distribution) is an important proba- bility distribution [25]. The normal distribution comes from the central limit theorem. The central limit theorem states that samples of observations of random variables will become normally distributed when the number of observations becomes sufficiently large. The dis- tribution fits many natural phenomenons, for example, blood pressures, heights, IQ scores, and measurements errors. The normal distribution is always symmetrical around the mean and the shape of its distribution can be determined by the mean and a value called . The graph for the normal distribution can be seen in figure 3.2.

Figure 3.2: The normal distribution taken from [26].

In figure 3.2 σ is the standard deviation and µ is the mean. One can see that 68 % of the samples are within one σ of the mean, 95 % are within two standard deviations of the mean and 99.7 % of the samples are within three standard deviations of the mean.

The values come from the probability density function for the normal distribution. This function can be seen in equation 3.2.

1 ´(x´µ)2 2σ2 P(x)= e . (3.2) σ?2π Where X is a single data sample, σ is the standard deviation and µ is the mean.

12 3.1. Data pre-processing

3.1.2.4 Shapiro-Wilk test The Shapiro-Wilk test was published in 1965 by Samuel Sanford Shapiro and Martin Wilk [27]. The test is a test of normality and tests the null hypothesis that a distribution is normally distributed.

2 n  i=1 aix(i) W = (3.3) řn (x x¯)2 i=1 i ´ ř Where, x(i) is the ith smallest number, xi is the ith order statistic, x¯ is the sample mean. The coefficients ai are given by a vector of expected values of ordered independent and identically distributed random variables samples. The null-hypothesis tests if the probability value is less than the chosen alpha value. If the probability value is higher than the chosen alpha value then the null hypothesis can not be rejected and the data tested is normally distributed. The sample size must be greater than two and if the sample size is sufficiently large the test may detect departures from the null hypothesis that are trivial. This resulting in rejections that are based on small meaningless departures.

3.1.2.5 Standardization (Standard Score) The standard score is a normalization technique where the observed value is subtracted by the mean and later divided by the standard deviation [28].

x α z = ´ (3.4) β

Where x is the sample value, α is mean and β is the standard deviation. The result of doing standard score will be that all the values will be rescaled so that the standard deviation is one and the mean is zero. Having a standard normal distribution is a general requirement for a large proportion of machine learning algorithms. Some examples of machine learning algorithms that need standard scores are neural networks with as an opti- mization algorithm, K-Nearest neighbor with euclidean distance measurement and principal component analysis [29]. All have their different reasons. Example for neural networks with gradient descent-based optimization some weights will be updated much faster than others and for principal component analysis you want to find directions of maximizing the variance and this can only be done of having values that have the same scale.

3.1.3 Feature Selection Feature selection is the process of selecting a subset of features out of an original larger set of features [30]. The newly selected subset of features is instead used when constructing a machine learning model. There are mainly five reasons for using feature selection. These are listed down below.

• Simplification of data makes the models easier to understand by users and researchers.

• Reduces the chances of overfitting the model. Overfitting meaning that the model can not generalize to new unseen data.

• Shortens the time needed for training the model.

• Avoids the curse of dimensionality. Curse of dimensionality means various negative phe- nomenons that can occur when dealing with high-dimensional data.

13 3.1. Data pre-processing

• It improves the performance of the model if the most suitable subset is selected.

Feature selection emanates from the fact that there are features that are redundant or ir- relevant for the particular problem faced and can, therefore, be removed. Redundant mean- ing that some features could be relevant but be highly correlated with other more relevant features. Feature selection is often used for data sets which have many features but compar- atively few samples. The most famous cases where feature selection is used is the analysis of written texts and DNA analysis. In these cases, there are thousands of features but only hundreds of samples. The aim of doing feature selection is to obtain the most optimal subset of features for the particular problem. There are three main categories of feature selection algorithms Filter methods, Wrapper methods and Embedded methods. Filter methods and Wrapper methods will be explained in the upcoming paragraphs. This report will not cover Embedded methods.

3.1.3.1 Filter Method Filter methods are the simplest feature selection algorithms. In filter methods, a measure- ment is used to determine how relevant and important a feature is. The measurements that are most commonly used are Mutual information, Pearson correlation coefficient, relief-based algo- rithms and Significance tests. These measurements are fast to compute and are capturing the importance of the features. Some filter methods provide the "best" subset of features while other provides a ranking of the features. The typical pipeline for a filter method can be seen in figure 3.3.

Figure 3.3: The typical pipeline for a filter method taken from [31].

Figure 3.3 shows the typical pipeline for a filter method. It starts with all the features. Then all features different correlations with the target variable are calculated. Then the best subset is selected concerning the chosen threshold. This subset is used when training and later evaluating the model.

14 3.1. Data pre-processing

The measurement that this report will focus on is Pearson correlation coefficients [32]. Pearson correlation coefficients was created by Karl Pearson and measures the linear correlation between two variables. It has a value between -1 and +1. +1 meaning a total positive lin- ear correlation and -1 meaning a total negative linear correlation. Having a value 0 means that there is no linear correlation between the variables. The definition of the correlation coefficient is the covariance of the two variables divided by the product of their standard deviations.

cov(X, Y) ρ = (3.5) σxσy

Where cov is the covariance, σx is the standard deviation of X (X meaning the population of the feature x) and σy is the standard deviation of Y (Y meaning the population of the target feature y). When used in the filter method the user decides what threshold the Pearson corre- lation coefficients should be larger than when selecting the subset of features. This means that the filter methods consists of first calculating all features different Pearson correlation coeffi- cients and later selecting those features whose Pearson correlation coefficient satisfies the chosen threshold. The negative aspect when using a filter method with Pearson correlation coefficients is that it does not take into consideration the dependencies between features in the data. It can only find the best subset of features if all features in the data are completely independent of each other. It also does not take into consideration the non-linearity correlation between the features and the target feature. So when dealing with data sets with non-linearity and features with dependencies the only result that is obtained from this method is a ranking of how all features are linearly correlated to the target feature. It will not find the most opti- mal subset of features. But this ranking can still provide the user with a first overall linear correlation description of the data.

3.1.3.2 Wrapper Method Wrapper methods are feature selection methods which take both linearity and non-linearity dependencies between features into consideration. Wrapper methods can deal with data sets with features that are depended on each other. Wrapper methods use a predictive model to score different selected subsets of features. It trains a new model for each different subset and calculates a score based on how well the model performed. Wrapper methods are very computationally heavy but are usually the best feature selection methods to use. The typical pipeline for a wrapper method can be seen in figure 3.4.

15 3.1. Data pre-processing

Figure 3.4: The typical pipeline for a Wrapper method taken from [33].

The typical pipeline for a wrapper method starts with all the features. Then subsets of features are generated and evaluated until the subset of features which gives the best per- formance is found. It is essentially reduced to a search problem and is very computationally expensive. There are mainly three different wrapper methods. These are called Forward Se- lection, Backward Elimination and Recursive Feature elimination. These are listed and explained down below.

• Forward Selection: Forward Selection starts with no features. Then for each iteration, it adds the feature that best improves the model. This is done until no feature can be added to improve the performance of the model.

• Backward Elimination: Backward Elimination starts with all features. It later removes the least important feature at each iteration concerning the performance. This is repeated until no removal of features can improve the performance.

• Recursive Feature elimination: Recursive Feature elimination is a greedy optimization al- gorithm which repeatedly creates new models. At each iteration, it saves the best performing feature. It later constructs the next model with the saved features. This is done until all features have been tried.

The wrapper method that this report will cover is a backward elimination method based on shadow features and a algorithm [34]. The wrapper method consists of four steps major steps in its pipeline. The pipeline is explained down below.

• 1: The given data set is extended by creating shuffled copies of all features. These new copies are called shadow features. Each original feature has one corresponding shadow feature.

• 2: A random forest regressor is later trained on the extended data and a feature im- portance measure is applied. The measurement is called Mean Decrease Accuracy. The measurement explains how important each feature is.

• 3: Later at each iteration, it checks if an original real feature has higher importance than the best of the shadow features. If the best shadow feature has higher importance, then that feature is considered unimportant and is removed.

• 4: This is done until all features are either confirmed or rejected.

16 3.2. Linear Regression

In the backward elimination method based on shadow features and a random forest al- gorithm, the original data is expanded by an equally large randomized data set of shadow features. Then at each iteration, the expanded data set is trained using a random forest al- gorithm. This will calculate the relative importance of each feature. Then for each feature, a check is made to see if the best of the shadow features have a higher value. If the check results in the feature having a higher value then it is considered to be a "hit". This "hit" is later saved in a table. At each iteration, another check is also done which compares the number of times the feature did better than the best shadow features using a binomial distribution. For exam- ple; say F1 is a feature which gets 3 hits over 3 iterations. With these values, we can calculate a probability value using a binomial distribution with k=3, n=3, and p=0.5. F1 is confirmed to be an important feature and its column is removed from the original data set. If a feature has not recorded a hit over a certain time of iterations, it is considered to be unimportant and is rejected (also removed from the original data set). The iterations continue until all features have either been confirmed or rejected.

3.2 Linear Regression

Linear regression is a linear model which tries to model the relationship between a target vari- able and predictor variables [35]. It assumes that the relationship between the target variable and the predictor variables is linear. When there is only one predictor variable the model is called simple linear regression and when there is more than one predictor variable it is called multiple linear regression.

Y = β + β X + + β X + ǫ . (3.6) i 0 1 i1 ¨ ¨ ¨ p ip i Where Y is the target variable, β , β , , β are regression coefficients and X , , X are i 0 1 ¨ ¨ ¨ p i1 ¨ ¨ ¨ ip the predictor variables. ǫi is the error variable. The error variable ǫ is an unobserved random variables which adds noise to the linear relationship. β , β , , β are interpreted as partial 0 1 ¨ ¨ ¨ p derivatives of the target variable with respect to the predictor variables. In figure 3.5 an example of a simple linear regression model can be seen.

Figure 3.5: A simple linear model taken from [36].

In figure 3.5 the red dots are the observed relationship between the target variable and the predictive variables. The green lines are random deviations and are believed to be the cause of relationships which the blue line represents.

17 3.3. Artificial neural networks

3.3 Artificial neural networks

This section will describe what Artificial neural networks (ANNs) [37] are and in detail ex- plain the different aspects of ANNs.

3.3.0.1 Background The interest for neural networks began in 1958 when the was created by Rosen- blatt. But due to computer hardware limitations and the fact that basic were not sufficient enough the interest stagnated until the backpropagation algorithm was created in 1975. The backpropagation algorithm made it possible to train multi-layer networks in a much efficient way and paved to new sets of applications. The following years more and more algorithms were discovered meanwhile the computational power increased. This have lead us to what we have today. Today there are various ANNs including recurrent neural networks, deep neural networks, and convolutional neural networks. The usages for these ANNs are for example recognition for image, text, and voice.

3.3.0.2 General description ANNs are computing systems inspired by the neurons and synapses in our brains. ANNs are a collection of connected nodes called artificial neurons. These artificial neurons try model the neurons in our human brain. Each connection between the artificial neurons can send signals between the artificial neurons just as the synapses in our brain. One can see ANNs as a framework where different types of machine learning algorithms can be used to process complex data inputs. The most important quality that ANNs has is the ability to perform tasks without being hardcoded for it. ANNs learns how to process new data based on what it has previously learned. The basic idea with a feedforward ANN is to train the network so that it can predict what the input data it gets corresponds to. E.g. predict that an input image contains a human face or a dog. The procedure for training a network for this kind of problem is called supervised learning. In supervised learning, the network is fed with labeled training data over a time duration. After the time duration is over the network changes its internal parameters with respect due to the error it had. The update of its internal parameters is done through an optimization strategy. More about these in a the optimization section. The error is calculated as what the network predicted the training data was compared to what the training data was labeled with. The calculation of the error is called the cost function and will be described more closely in a cost-functions section. The ANN repeats this procedure over the labeled training data until the maximum number of iterations has been done. When this occurs the network gets tested on the test data. It is here where one can tell if the network has learned to generalize its ability to predict or if the network is overfitted to the training data and predicts wrongly on new data. More on this in the later section. The basic structure of an ANN consists of nodes, layers, and weights. These handles data inputs and computes an output as a result of the data input. In figure 3.6 down below can a simple ANN be found.

18 3.3. Artificial neural networks

Figure 3.6: A simple ANN taken from [38].

The ANN in figure 3.6 consists of three layers. The first layer is called the input layer and is the layer which first receives the input data. The input layer here consists of three nodes meaning that the input data consists of three features. These three features form together a sample. This sample can represent any kind of data, e.g. it can represent a point in a vector space (x,y,z) or three properties of a flower. Each feature needs to be represented by a number representing its value. Since this is a network of the type fully connected every feature value gets transmitted to every hidden node in the hidden layer. When each feature value leaves the input layer it is getting transmitted to the hidden layer it is multiplied with a unique weight for each hidden layer node. The weights are scalar values and are for the most part initialized with a random value within a certain range. The second layer in this ANN is called the hidden layer. It has four nodes and since this ANN is fully connected every node computes the sum of every weighted feature from the previous layer.

pj(t)= oi(t)wij. (3.7) ÿi

Where pj(t) is the output of the given node, oi(t)wij is the weighted output from the pre- vious node. When this has been done it activates the summation of the weighted features with an activation function, more about activation functions in the next section. The hidden layer later transmits its activated values to the third layer in the same way as the input layer transmitted to the hidden layer. The third layer is called the output layer and has two nodes. Every node in the output layer calculates the sum of the weighted activated values and com- putes the activation of this value using an activation function. The network later decides what to output regarding what activation in the output layer that has the highest value. This procedure is done for every single sample in the training data. With an increasing number of nodes and layers, the complexity increases and more computations need to be done. But this also increases the possibility to create a network which can handle a large amount of complex data with a high degree of variations. The difficulty for a designer of an ANN is to first find the right amount of hidden nodes and hidden layers to use in his/her network. The

19 3.3. Artificial neural networks creator also needs to find the right type of activation functions, cost function, optimization strategy, and validation procedure. There are also many different variations of ANNs, each with unique architecture and each most suitable to a certain type of data and problem. The type of ANN that has been chosen for this problem is a Recurrent artificial neural network. More about Recurrent artificial neural networks in a later section.

3.3.0.3 Activation functions The activation function decides what a node in an ANN should output. It is used to nor- malize the outputs and in some cases, it acts as a binary which decides if the node should be "activated" or not. Activations functions which are non-linear and are central in making the neural network able to model non-linear relationships. Non-linearity is needed in activation functions because its aim in a neural network is to produce a non-linear decision boundary via non-linear combinations of the weight and inputs. Down here are some of the most com- mon activation functions for ANNs listed and the descriptions on how they work.

• Tanh: Tanh normalizes the input between -1 and 1. ´ ex e x tanh(x)= ´ (3.8) ex + e´x Where, x is the input value and e is Euler’s constant

• Rectified linear unit (ReLu): Relu outputs the maximum of zero and the input value. This makes non positive outputs impossible.

ReLu(x)= max(0, x) (3.9) Where x is the input value.

• Sigmoid: Sigmoid normalizes the input between 0 and 1.

1 f (x)= (3.10) 1 + e´x Where e is Euler’s constant and x is the input value

3.3.0.4 Cross-validation Cross-validation strategies are different techniques for validation of an ANN and decide how the whole data set should be divided into training, test and validation parts. The cross- validation must be well thought out since problems like overfitting can occur if not. Overfit- ting is the phenomena when the ANN performs well on the training data but cannot gener- alize its predictions on new unknown data. This is mostly because the training data has had a unique pattern and behavior which is not existing in the test data. There are two common types of cross-validation.

• Holdout method: This method split the entire data into two different parts; one as training part and the other as test part.

• n-fold cross-validation: The original data set is divided into n equal-sized subsamples. Then one sample is randomly picked as validation data while the rest n-1 subsamples are picked as training data. This procedure is then repeated n times until all n sub- samples have been used for validation. Then a single estimation is produced through averaging the n results altogether.

20 3.3. Artificial neural networks

3.3.0.5 Cost-functions Cost functions measures how well an ANN performed when mapping the training examples to correct outputs. It computes this result by calculating all the differences between the out- puts and the correct outputs. This is done for every training sample. Then an error or score is calculated by averaging all the differences. The creator of the ANN wants to either minimize (if it is an error) or maximize (if it is a score) the cost-function by using an optimization al- gorithm. More about optimization algorithms in the next section. Cost-functions depends on four parameters. All the weights and biases in the network, one single training sample and the correct output. It returns a single value which either represents an error or score. Two of the most used cost-functions are listed below.

• Mean Squared Error (MSE): MSE sums the squared error over the whole training set.

C (W, B, Sr, Er)= 0.5 (aL Er)2 (3.11) MSE j ´ j ÿj

Where, W is the weight matrix, B are the biases, S’ are the training samples, E’ are the correct outputs.

• Cross-entropy cost (CE): CE sums all the cross-entropies in a training set.

C (W, B, Sr, Er)= [Er ln aL +(1 Er) ln (1 aL)] (3.12) CE ´ j j ´ j ´ j ÿj

Where, W is the weight matrix, B are the biases, S’ are the training samples, E’ are the correct outputs

3.3.0.6 Optimization algorithms Optimization algorithms are used to either minimize or maximize the return value from the cost-function. They play a major role in the training of an ANN because they state how the internal parameters; the weights and the bias values should be updated. The goal for optimization algorithms is to update the internal parameters in the direction of the optimal solution. There are two major categories of optimization algorithms and these are called First-order optimization algorithms and Second-order optimization algorithms. First-order optimization algorithms use the cost function’s gradient values to update the ANN’s inter- nal parameters. Second-order optimization algorithms use the cost function’s second-order derivative values to update the ANN’s internal parameters. First-order optimization algo- rithms are converging rather fast on large data sets and are easier to compute compared to Second-order optimization. Second-order optimization algorithms are often slower and re- quire more time and memory compared to first-order optimization algorithms. Down here are two of the most used algorithms listed that optimization algorithms use.

• Gradient Descent: Gradient descent is a common algorithm used for optimization. It up- dates the weights and bias values after each training step corresponding to the formula in equation 3.13 which you can see below. The gradient is calculated using a method called backpropagation. θ = θ η ∇ J(θ ) (3.13) t+1,i t,i ´ ¨ θ t,i Where, θ is the weight and bias values, η is the learning rate, ∇θ J(θ) is the gradient value of the cost function. Backpropagation is commonly used in the training of deep neural networks. The method computes the gradients for each layer in the ANN by using the chain rule on the cost function. The method can by using the chain rule propagate backwards through the network so that the effect each inner neuron had

21 3.3. Artificial neural networks

on the output can be traced and can thereby be updated correctly. Optimization algo- rithms that use gradient descent include standard gradient descent, stochastic gradient descent, and Mini batch gradient descent.

• Momentum: Momentum is an improvement of the previous described gradient descent. It is gradient descent with a short-term memory. It takes into consideration more than only the previous gradient when updating the weights.

z = β z + ∇ J(θ ) (3.14) t+1,i ¨ t,i ¨ θ t,i θ = θ η z (3.15) t+1,i t,i ´ ¨ t+1,i

Where, θ is the weight and bias values, η is the learning rate, ∇θ J(θ) is the gradient value of the cost function, zt+1,i is the momentum value. Momentum gives gradient de- scent an acceleration and how large this acceleration should be is up to the user of the al- gorithm. When β = 0, Momentum becomes exactly like gradient descent. But for β = 0.99 the acceleration could be what is needed. Momentum has proved to be suitable for data with Pathological curvatures. Pathological curvatures is when regions are not scaled properly and are forming regions similar to valleys, canals and ravines. When gradi- ent descent gets stuck or jumps between these regions, Momentum more often progress along the most optimal direction. An optimizer which uses Momentum is Adam.

3.3.1 Recurrent neural network Recurrent neural networks (RNNs) are a type of ANN that has a certain structure that allows training on data where the input sequence is of importance [39]. Recurrent neural networks can scale to much longer sequences than traditional feedforward neural networks. RNNs processes its inputs as sequences which are defined as vectors where the elements are sepa- rated by a time step. The time step can either be a timestamp in the real world or the position in the sequence. RNNs are based on the idea of an unfolded computational graph to include cycles [40]. In these cycles, the present value has an influence on its own value at a future time step. With a computational graph, there is a way to form a structure which can map inputs to outputs where the inputs have a repetitive structure, typically corresponding to a chain of events. The structure in most RNNs can be decomposed into three blocks of pa- rameters and associated transformations. From the input node to the hidden node, from the last hidden node to the next hidden node and from the hidden node to the output. Each of these transformations corresponds to a shallow transformation. A shallow transformation is a learned affine transformation followed by a fixed non-linearity. RNNs are typically nor- mal feed-forward networks where the outputs from the hidden nodes are fed back into itself. There are three different types of RNNs; Many-to-many, One-to-many, and Many-to-one. The different types and their explanations are listed hereunder.

• Many-to-many: The network takes in a sequence of inputs and outputs a sequence of outputs.

• One-to-many: The network takes in single inputs and outputs a sequence of outputs.

• Many-to-one: The network takes in a sequence of inputs and outputs a single output.

The vanilla RNN is a fully recurrent neural network, meaning that every neuron in a given layer is directly connected to every neuron in the next layer. Each hidden neuron is at any given time step feeding its output back into itself. With this behavior, the hidden neurons calculate their outputs with regards to what previous values it outputted. The network has a

22 3.3. Artificial neural networks

Figure 3.7: Vanilla recurrent neural network taken from [41].

"memory" of the previous inputs. A folded and unfolded example of a Vanilla RNN can be seen in figure 3.7. In figure 3.7 the folded vanilla RNN shows the cycles where the hidden node h always is saving its previous output and is considering this value when calculating its next output. The unfolded vanilla RNN shows how this looks in practice over three-time steps t-1, t and t+1.

Vanilla recurrent is not used very often for long-term memory tasks. The main reason for this is the vanishing gradient problem. The vanishing gradient problem is that the more time steps we have, the more chance we have of having back-propagation gradients either accumulating, exploding or vanishing down to zero. This is because, in the backpropaga- tion, the weights are becoming exponentially smaller due to long-term interactions which involves multiplication of many Jacobians. With many time steps, many non-linear functions are used and the result is highly nonlinear, typically most of the values being small derivative, some values being large derivative and many alternations between increasing and decreas- ing. Derivatives becoming small is especially true when having the Sigmoid function as an activation function. This is because the gradient of the Sigmoid function becomes very small when the output from the Sigmoid function is close to zero or one. The problem can not be avoided by simply staying in a region of parameter space where the gradients do not vanish or explode. This is because, to store memories in a way that is robust to small perturba- tions, the RNN must enter a region of parameter space where the gradients must vanish. For the model to be able to learn long-term dependencies the gradient of a long-term de- pendency needs to have exponentially smaller magnitude than the gradient of a short-term dependency. It is not impossible to learn long-term dependencies but it might take a very long time because the long-term dependencies will tend to be hidden by the smallest fluctu- ations arising from the short-term dependencies. To remove the vanishing gradient problem we need to create paths where derivatives neither vanish or explode. Gated RNNs is a type of RNNs that does this by updating the weights at a given time step when suitable and by learning the network to clear the state when needed. Long short-term Memory (LSTM) is a Gated RNN that does exactly this.

3.3.2 LSTM LSTM introduced the idea of self-loops to create paths where the gradient neither vanishes or explodes [40]. The most important addition has been to make the weight in this self-loop dynamic rather than fixed. By making the weight of the self-loop gated (controlled by another hidden unit), the effect the time has on the integration can be changed dynamically. This is because the time constants are outputted by the model itself. LSTM has been enormously

23 3.3. Artificial neural networks successful in many applications such as speech and handwriting recognition. In LSTM the neurons are replaced by LSTM-cells. The cell has three different gates that control the various aspects of the cell. The time dependence and effect of previous inputs are controlled by a forget gate. The input gate controls the value of the previous input and the output gate controls what value the output should have. The LSTM cell has an internal state which has a linear self-loop whose weight is controlled by the forget gate. The LSTM cell can be seen in figure 3.8.

Figure 3.8: The LSTM cell taken from [42].

In figure 3.8 we can see the whole structure of the LSTM cell. The cell can be divided into three different stages. These are called the input stage, the internal stage, and the output stage. The cell gets two inputs; Xt is the current input to the cell and ht´1 is the previous output from the cell. In the input stage, the current inputs to the cell get processed by the input gate. The input gate is a sigmoid function which acts as a switch that turns inputs either "on" or "off". The input gate outputs a processed value which is later summed up together with the activation of the current inputs to the cell. This sum leaves the input stage and enters the internal stage. In the internal stage, the forget gate decides what inputs that should be forgotten. The internal state is then computed as the addition of the sum from the input stage and the summation over the previous state and the output from the forget gate. Lastly, the internal state reaches the output stage. Here the output gate activates the current inputs to the cell which will affect on what value the output will have. The output from the cell is then computed as the summation over the activated internal state and the output from the output gate.

Instead of (as in the case of vanilla RNN) having nodes which simply applies an element- wise non-linearity to the affine transformation of inputs, LSTM has cells which themselves have an internal recurrence (a self-loop), in addition to the outer recurrence of the RNN. Each cell has the same input and output as a vanilla RNN but has an internal structure of gates and an internal state which controls the flow of information. By using this structure the vanishing gradient problem can be dealt with by creating a path which the gradients never vanishes or explodes.

24 4 Method

This chapter will describe how the whole study was performed. The chapter is divided into two parts starting with an explanation on how the prestudy was carried out and ending with how the actual study was done. The different software libraries and tools used can be found in Appendix A.

4.1 Prestudy: Capacity estimation using lab data from NASA

A prestudy was carried out. The prestudy investigated how well a LSTM model performed when estimating the capacity on Li-ion lab data from NASA. The lab data consisted of 34 batteries each discharged over roughly 150 cycles. The batteries were discharged under the same environment until EOL was reached. The only difference in the discharge was that the state of charge reached differed. This meant that some batteries were discharged under longer time duration during the cycles and others were discharged under shorter time duration. The parameters captured during the discharge were ambient temperature, cell temperature, discharge current, impedance (internal resistance), battery voltage and capacity. This resulted in a data set consisting of 6 features and 7 million samples.

4.1.1 Method To be able to use a LSTM model as a machine learning model the input data must be trans- formed from 2-dimensions to 3-dimensions. The new dimension is called the lookback. The look back states how many previous inputs the LSTM cell should take into consideration in its weight updates. The original data set had samples in the y-axis and features in the x-axis. Now after the transformation, the features are in the z-axis, the look back representing the y- axis and the samples are in the x-axis. The amount of elements in the matrix is increasing with a factor of the lookback. After the transformation was done the target variable, the capacity was extracted and the data set was divided into test and training. The model had 40 hidden cells with relu activations and a look back of 5. The optimizer is Adam and a mean squared error was used as the loss function. The training was made over 10 epochs with batches sizes of 1000.

25 4.1. Prestudy: Capacity estimation using lab data from NASA

4.1.2 Result The LSTM model estimated the capacity with high accuracy. The root mean squared error on the test data was 0.011 ampere hours (0.66 ampere-minutes). Figure 3.1 shows every 3000 estimations, showing the overall trend of accuracy.

Figure 4.1: Absolute difference for every 3000 estimation

Figure 4.1 shows that the absolute difference is most often under 0.04 ampere-hours away from the true value. The worst accuracy is close to 0.008.

4.1.3 Conclusion It is fair to say that the LSTM model estimated the capacity well. A root mean squared error of 0.011 ampere-hours is relatively low compared to the capacity reaching from 1.75 ampere-hours down to 0 ampere-hours. But the problem here is that this was a limited lab data set. The lab data was under ideal conditions for all batteries and the only difference was the state of charge reached. What would happen if real data collected from real cars driving were used instead? Would the LSTM model be able to estimate the capacity with the same accuracy?

26 4.2. Overall Project Progression

4.2 Overall Project Progression

The project can be broadly divided into modules as shown in figure 4.2. Below is the descrip- tion of the project pipeline

Figure 4.2: Model Pipeline

Figure 4.2 shows the model pipeline. Starting with the pre-processing including data cleaning and data transformation. Followed up by feature selection with both filter and wrap- per method. Then two models tested on the data, linear regression, and LSTM.

4.2.1 Description of the data and new approach of thesis The data have been given from the Wireless Information Collection Environment (WICE) team at Volvo cars. The WICE team is responsible for the development of the WICE system which is used for data collection and management. The data contains signals collected from the battery management systems in 32 cars. These cars are located either in the Gothenburg area in Sweden, the Charleston area in the USA or China. The cars are of the same model of a plug-in hybrid electric vehicle (PHEV). PHEVs combines gasoline or a diesel engine with an electric motor. The battery in PHEVs is recharged with a plug which allows for long-distance drives using only the electric motor. When the battery is empty the gasoline or diesel engine takes over. The data has been collected over a period of 3-6 months for each car. The data have been collected using mainly two different sensors. Each sensor has its unique configuration of what signals that it collects. One collecting roughly 1700 signals and the other roughly 360 signals. The trend used in the collection of data is that the sensor which collects 360 signals starts the collection and then the other sensor takes over. Roughly taking half of the total collection each. The different signals that are being collected covers more or less everything that can be sent from the cars. Everything from the incline of the road, the torques in various modules, the requested vehicle speed, the actual speed, if a particular lamp is on or off and so on. The data can be divided into different cycles. Each cycle meaning the driving in between two charges of the battery. The data from each cycle for each car have been saved into .dat files.

The original aim of the thesis was to see if a recurrent deep learning model can be im- plemented to predict a battery’s end-of-life. But to be able to do this one must have data

27 4.3. Data pre-processing collected during the entirety of a battery’s life. That is, from its first cycle to its last. This data set does not start from the beginning of the batteries life and there is no information about what has previously happened. The time of the data is also only for between 3-6 months. It is estimated that a battery should last for around 7 years. Therefore, the decision was made to change the direction of the thesis. From predicting the batteries end-of-life to instead estimating the online State of Health/capacity that the battery currently has.

4.2.2 First glimpse of the data The data have been collected and saved as different cycles for each car. The first thing that was done was to analyze how the State of Health / Capacity was changed during the course of the whole data set. The values of some cars can be found in figure 4.3.

Figure 4.3: Capacity values over all cycles for some cars

This trend is the same for all cars. The capacity starts at 100 percentage ampere-hours and drops down to around 93-94 percentage at the end. One can notice that the values have a stair-like behavior. This behavior is because that many nearby cycles have the same capacity values. One particular car (BDWCU9) has a very weird increase in the capacity and the decision was made to exclude this car. This can be seen in figure 4.4. When looking at the values of the capacity within the cycles one very important thing was discovered. All capacity values within each cycle have the same value. This is because the capacity value is only updated when a charge has been done, that is, between two cycles. This means that all samples within a cycle will all have the same capacity values. This fact is of significant importance, especially when training a recurrent model. The recurrent model remembers what previously it has been fed. Each cycle consisted roughly of three hundred thousands samples. By feeding the recurrent model all these samples the model will always presume that the previous samples will have the same capacity values as the current.

4.3 Data pre-processing

This section will describe how the pre-processing of the data was done and why it was done in such a manner. All the steps in the different sections are appearing in the order they were implemented.

28 4.3. Data pre-processing

Figure 4.4: Capacity values over all cycles for some cars

4.3.1 Data cleaning 4.3.1.1 Feature cleaning Because the data have been collected using different sensors which collect different sets of signals, the decision was made to find the intersection of signals between these sensors. This was found by iterating through each cycle for each car and find the signals that appear in all sensors for all cars. This resulted in around 360 unique signals. When looking at the values in the data it was clear that many values for some signals were completely static. They would always have the same value for each cycle and each car. By looking at the variance of all signals for each cycle, all signals that had zero variance could be found and later be removed from the relevant signals. After this was done it remained around 280 signals. By examining these signals it was clear that many signals would not be of importance for the capacity of the battery. These signals were for example; if a light switch was on or off, if a window was opened or not, if music was playing, the status of touch screen and so on. A scientific decision was made to remove all of these signals that most likely would not affect the performance of the battery. This resulted in 131 remaining signals.

4.3.1.2 Handling of missing data & Local Outlier Factor The whole data set was iterated through and each NaN value was replaced with the mean of the previously last valid and the next valid value. Outlier detection was done by using the Local Outlier factor algorithm. The algorithm takes in 2 parameters; the number of neighbors to take into consideration and the threshold to use. After trial and error, twenty was chosen for the number of neighbors. The decision was made to take 0.5 as threshold since this seemed to remove the outliers having misplaced values.

4.3.2 Data Transformation 4.3.2.1 Standard score normalization To test if the 131 signals in each cycle were normally distributed, the Shapiro–Wilk test was done. The Shapiro–Wilk test the null hypothesis that a sample came from a normal distri- bution. It returns a false or true. Shapiro-Wilk rejected the null hypothesis which confirmed that the features were each not from a normal distribution. To make the features to have the

29 4.3. Data pre-processing same properties as a normal distribution, that is, having a standard deviation of 1 and a mean of zero a standard score normalization was done on each feature. Standard score normaliza- tion iterates through each value and replaces the values with the result that comes from the equation which can be seen in equation 4.1.

x α z = ´ (4.1) β

where, x is the sample value, α is the mean and β is the standard deviation.

4.3.2.2 Data aggregation Since all samples for each cycle have the same capacity values an aggregation needed to be done. By doing aggregation a lot of information was lost. What characteristics the cycle had in many signals disappear. But this needed to be done to be able to create a data set which can be fed to a recurrent model. Mean aggregation as the aggregation method was decided. This means taking the mean of the values for each signal and this way creating one sample to represent the whole cycle. This way some characteristics of the behaviour of the car is still preserved. For example, the mean vehicle speed will give us some information about how fast the car was driven but we will lose information if the car was driving very fast sometimes and very slow other times. The same thing applies to all the important signals coming from the battery, for example, the values on the voltage, resistance, current, and temperature. This aggregation method was decided because it gives us an estimation of what values the signals gave out during the cycles, even though a lot of information is lost in the process.

After the mean aggregation was done the data set consisted of around 10 000 samples for all cars. Each sample representing the mean values collected from one cycle. The cycles were appended for each car and this way created one complete matrix for each car, where each row represented one sample with 131 features.

4.3.3 Feature selection using Filter Method By looking at how the different features all linear correlates to the target, the capacity, the Pearson correlation coefficient was calculated for each signal. When using data from all cars the 25 signals with the highest correlation can be seen in figure 4.5. Please note that the signal names have been cut down due to secrecy.

30 4.3. Data pre-processing

Figure 4.5: Top 25 Pearson correlation coefficients

Going down to top in importance one can notice 2 signals with a very high correlation, almost 100 percent. These two signals are the battery state of health and the other is the maximum cell capacity. The state of health signal is the same value as the target, the capacity. The other signal, maximum cell capacity is the upper limit of the capacity, when our target signal is the lower limit of the capacity. The decision was to remove these from the relevant list due to their obvious correlation. Going up in the list 3 signals are representing various battery resistances. After those, there is a signal which gives the usage in percentage the CPU is currently using. After this signal, there is a signal which gives the total distance that the car has driven. All of these signals seem to be relevant, maybe except for the CPU usage signal. Going further up the list there are signals as, the sun intensity, the sun elevation and the energy save mode selected by the user. All these signals can have an impact on the capacity of the battery over a whole lifetime. But it is fairly unlikely that they will have an impact over a course of only 3-6 months. Other signals that most likely are of interest are the battery current and the battery temperature. Then there is a signal for the usable charge energy left and the transmission torque ratio that could be of interest. So how to select features with this knowledge of their Pearson correlation coefficient? Choosing features by their Pearson correlation coefficients is called the filter method. One signal that is missing from the top 25 is the battery voltage. The fact that many seemingly not correlated signals are among the top 25 and not, for example, the battery voltage is proof that we have non-linearity in our data that needs to be investigated. Remember, the Pearson correlation only looks for linear correlations. Our data have also been reduced by a factor of around 300 000 making it likely that there are false correlations.

31 4.3. Data pre-processing

4.3.4 Feature selection using Wrapper Method One method of feature selection which considers the non-linearity in the data is called the wrapper method. Wrapper method evaluates all possible subsets of features which allows for detection of the possible interactions between features. It does this by creating shadow features and reevaluates if the performance of the predictions is either significantly is increas- ing or decreasing when adding or removing these shadow features. The computation time is exponentially increasing with the number of features. A wrapper method using a ran- dom forest regression algorithm for estimations was decided to use to see which features it confirms to be of importance. The wrapper method confirmed these following signals to be of importance:

• 3 Different battery resistances

• CPU usage

• The total distance driven

• Average cell temperature in the battery

• Ambient temperature

• Battery voltage

• Battery state of charge

• Battery current

All of these signals seemed to be of importance for the capacity of the battery. The resistance indicates somewhat the status of the battery. The average cell temperature gives information on how far the battery was pushed during the cycle. The voltage and current are the main properties of the battery. The state of charge indicates how long the battery was used for example when being close to fully charged or close to empty. The ambient temperature and CPU usage could be of importance for the battery but there are not as obvious as the rest of the signals since these are not directly linked to the battery itself. By looking at how these signals correlate to each other one can see if there is a need of removing any with high correlations.

32 4.4. Implementation of the Linear Regression model

The Pearson correlations can be seen in figure 4.6.

Figure 4.6: The Pearson correlations between the signals. All the signal names have been removed due to secrecy. The y-axis is the inversion of the x-axis. Every square shows the correlation between the two signals which the square coordinate is located on. The outer diagonal is constant 1 because the same signals correlate 100%, the decision was made to color this white.

The three resistance signals have a very high correlation, in fact, a correlation over 95%. The cell pack resistance was kept because it had the highest correlation to the capacity. The other 2 resistance signals were chosen to be removed. There is also a very high correlation between the battery state of charge and the battery voltage. A correlation over 95%. The decision was made to remove the state of charge since the voltage had a higher correlation to the capacity. There is also a high correlation between the ambient temperature and the average battery cell temperature. This correlation coefficient is 70% and the decision was made to keep them both. This resulted in a total of 7 signals left.

4.4 Implementation of the Linear Regression model

A linear model was initialized based on Ordinary Least squares. Ordinary Least squares com- putes the least-squares solution to the linear equation Ax = b. The model was later fitted with the training data. Then was a score calculated on the test data. This is done over cross- validation using 6 groups with 5 unique cars in each group.

4.5 Implementation of the recurrent LSTM neural network

This section will show the whole implementation of the LSTM network.

4.5.1 Data preparation for LSTM To be able to use a time series as data for a supervised learning problem it needs to be transformed. As mentioned earlier, a time series is a sequence of numbers that are ordered by a time index. To be able to have a supervised learning problem for times series the time series needs to be expanded with shifted columns. Shifted columns meaning that the original columns have been shifted n numbers of times. By creating copies of columns that are pushed forward, observations, as well as columns of forecast observations for time series data set are

33 4.5. Implementation of the recurrent LSTM neural network created. Five were chosen to be the amount of time steps, resulting in creating 5 new unique shifted columns with the time observations of t-1, t-2, t-3, t-4 and t-5, for each column. The t is for the original column looked at. This results in 6 times larger data set compared to the original time series and with a look back of 5 cycles. In this newly expanded data set each column is followed with columns that are shifted 1,2,3,4 or 5 times.

The data that is fed to the LSTM network needs to be of three-dimensional. This is because the LSTM expects to have one axis to consist of time steps. The dimensions of the original data are the samples along the x-axis and features along the y-axis. Having two-dimensional data means that conversion needs to be made into three-dimensional. This new three-dimensional data needs to have samples as x-axis, time steps as y-axis and features as the z-axis. So by reshaping the original two-dimensional data into a new three-dimensional data structure, the desired structure on the data was obtained.

4.5.2 Implementation of the LSTM The model was initialized to have sequential attributes. This means that the layers will appear after each other in sequential order. Then was the first layer added with LSTM cells as nodes. For the first layer, we need to specify what shape the data has since it first receives the inputs. For the next four layers 10, 10, 25 and 25 LSTM cells were used. The activation function used were tanh for each layer. Adam was decided as optimizer and mean squared error as loss function. Training of the model was done by feeding the model training data and with a batch size of 1 during training, the weights were updated after each training sample. This is called Online Learning and is a popular strategy used for training a model based on time series. The data is now 3-dimensional with features, samples and the 5 previous cycles as dimensions. This makes the Online Learning in practice to update its weights every time after a series of six samples. Then validation was done on the test data with again a batch size of 1 making the model to estimate the output value after a series of 6 test samples. The training and testing were done using n-fold cross-validation where all 30 cars were di- vided into 6 different groups. Each group once acting as test cars making the cross-validation to last over 6 cycles.

34 5 Results and Analysis

This section will show various results that were achieved from using a linear model and a recurrent LSTM model. The first part is from the usage of a linear model and the second part is from the usage of a recurrent LSTM model.

5.1 Results From Linear Regression Model

The global root squared mean error is 0.341 ampere-hours overall the test cars. This is roughly an error of around 20.5 ampere-minutes. The global root squared mean error is defined as squaring and adding all root squared mean errors for all cars and dividing these with 6. And then take the root of this value. This can be seen in equation 5.1.

mse + + mse Global rmse = ¨ ¨ ¨ k (5.1) c k where, k=6

35 5.1. Results From Linear Regression Model

In figure 5.1 one can see the error distribution for all cars.

Figure 5.1: Error distribution for linear regression

The error distribution is distributed as a normal distribution with a mean of -0.014 percentage and variance of +1.43 percentage. The error is mostly spread out between -2.5 percentage and +2.5 percentage.

In figure 5.2 all the estimated values are scattered against the real values.

Figure 5.2: Plot showing the estimated values scattered against the real values

A perfect score would only show a straight line of all values. In the plot, one can see that many values are misplaced. The values from around 88 percentage up to around 96 percentage are somewhat linearly, but with a spread of around 1.5 percentage. Especially the higher values are performing poorly with many predictions far away from the true ones.

36 5.1. Results From Linear Regression Model

So how does it look when only estimating for one car? The plots for car with name BD- WCU5 can be seen in figures 5.3 and 5.4.

Figure 5.3: Plot showing the estimated values and the real values for car BDCWCU5

From figure 5.3 one can see that the linear model estimates very wrongly on the first predictions before it stagnates and keeps within a distance of around 2.5 percentage for the rest of the cycles.

Figure 5.4: Plot showing the estimated values against the real values for car BDCWCU5

Figure 5.4 shows clearly that the high values are far out from the rest. The rest of the predictions are clearly within a reasonable distance from the true values forming a somewhat straight line.

37 5.2. Long Short-Term Memory (LSTM)

5.2 Long Short-Term Memory (LSTM)

So can a Long Short-Term Memory (LSTM) model get better accuracy? The global root squared mean error is 0.077 ampere-hours overall the test cars. This results in an error of roughly 4,6 ampere-minutes. Comparing to the linear regression model this is a decrease of 77.5%. Figure 5.5 shows the error distribution.

Figure 5.5: Error distribution for LSTM

The error distribution is distributed as a normal distribution with a mean of -0.004 per- centage and variance of +0.074 percentage. The error is mostly spread out between +0.5% and -0.5%. Comparing to the linear model this error distribution is more compact and a large majority of the errors are within 0.5%.

In figure 5.6 all the estimated values are scattered against the real values.

Figure 5.6: Plot showing the estimated values against the real values

38 5.2. Long Short-Term Memory (LSTM)

A perfect score would here as well show a straight line of all values. In the plot, one can see that a thick straight line is formed with some misplaced values. In figures 5.7 and 5.8 shows the estimations for the car BDWCU5 (same care as for the linear model).

Figure 5.7: Plot showing the estimated values against the real values for car BDCWCU5

Figure 5.7 shows the estimated values are following the real values very closely, from start to end. There is no time when the estimated value is "jumping" away from the real value, as for the linear model. The mean squared error for this car was 0.047 ampere-hours or 2.82 ampere-minutes.

Figure 5.8: Plot showing the estimated values against the real values for car BDCWCU5

Here the estimations are close to the real values. The line is very straight without any outliers.

39 6 Discussion

This chapter will discuss the various aspects of this master thesis. Aspects such as the choices of approaches in the method, things that could have been made different, what went well/bad. Aspects of the results such as surprising or not surprising results will also be dis- cussed. A section covering the work in a wider context is also included.

6.1 Method

In this section will the applied method be criticized and discussed.

6.1.1 The use of an LSTM The purpose of this study was to see how well a recurrent neural network model could es- timate the battery capacity in electrical vehicles. Many different machine learning models can be applied to various type of problems. Many other machine learning models could be applied to this specific problem and maybe even give better results than a LSTM model or another recurrent neural network for that matter. The data in this problem is of time series and it is safe to say that previously recorded values for a battery have a relationship to what its current or future values will have. Recurrent neural networks and especially of type LSTM have shown to be very powerful models for estimation and prediction of weather forecast- ing. Weather forecasting is heavily depended on what different values the previous days have had and is an extraordinary example of a time series. The choice to use a recurrent neu- ral network is therefore a good choice for this time series problem. The question is rather; Is the data sufficient enough that it can be called a time series? The data is stored in containers of cycles where the output data we are seeking to estimate is only updated once every cycle. This is in a way a time series since all the data points are in a time axis where the values are somehow depended on the previous values. But this makes the data not a purely time series in my opinion. This time series is artificially made because of this vague update of the bat- tery capacity. Hopefully, can a LSTM catch up this time dependency but it is fair to say that it will not be as powerful as for weather forecasting were all values are updated for every data point. There is a time dependency in the data, but it is not as clear as in a problem as weather forecasting. There are many different recurrent neural networks and they all have their special twist. The choice of using a LSTM model among all different recurrent neural

40 6.1. Method networks is not crystal clear. The LSTM has its special mechanics of using LSTM cells which saves different "states" and have proven to be a powerful model for time series. The decision was therefore made to try this type of recurrent neural network. There is a possibility that other recurrent neural networks could have given better results but the limitation was made to only try out this model because it is considered to be the "state of the art" recurrent neural networks models out there.

6.1.2 The pre processing This section will go through what could have been different in the pre-processing of the data. Many things could have been done differently in the pre-processing of the data. There are hundreds of different approaches on how to pre-process your data and one can only know how well one has done after applying a model on it. For each step in the pre-processing, there are various software libraries and it is impossible to know before which will be the best for your purpose. A study is rarely perfect and the perfect thing would be to try out all different combinations of all different steps to see what combination that gives the best result. This reports lack this and is limited to the steps that were chosen.

I will take a self-critical stance to the choices made in the pre-processing on this project. The need for NaN-replacements and data normalization is widely known to be of huge im- portance when pre-processing data. These steps are needed and without these, the data will be corrupted and not suitable for a machine learning model. The questionable steps in the pre-processing are the rest of the steps. Each step will be discussed in the upcoming sections.

6.1.2.1 Outlier Detection Doing an outlier detection and what algorithm to use is something you cannot know before- hand. Outliers are an important part of the data and are helping to describe the data. Keeping outliers could help you to get a better understanding of the data and therefore give better re- sults. But it can also give your model false values if the outliers are false such as wrongly collected values. This will trick your model and it will give out worse results. The data has been collected at a very high speed. Collecting a huge amount of samples every second over a long period. It is fair to say that some samples will have wrongly collected values and will therefore sometimes be considered to be outliers. The decision was therefore made to do an outlier detection using a Local Outlier Factor. Local Outlier Factor is easy to implement but powerful outlier detection algorithm which will spot whenever some samples are considered to be "abnormal". The data will this way become more compact and shrank at the far ends. This should not affect this particular data set too much since these abnormal samples will be such a small part of the whole data set. There is a possibility that the outlier detection did not affect the outcome at all. But it is sometimes better having it than not having it, especially for this fast-paced collected data set. The decision was therefore made to have outlier detection as a part of the data cleaning step.

6.1.2.2 Data aggregation This is the pre-processing step where the largest limitation was made. Data aggregation is about shrinking data dimensions and therefore taking away information. When the cycles were aggregated into one single sample an enormous amount of information was lost. It was a hard decision to be made but it needed to be done. Having it as a time series would result in huge blocks. Each block would consist of thousands of samples all with the same capacity. The time series would all have a structure that a recurrent neural network would have had a huge problem to deal with. When training the model with these blocks the model would learn to always estimate the same capacity value as it previous samples had. This because over 99.9% of the estimations would be within a block and not on the border between 2

41 6.1. Method blocks where the capacity was updated and changed. This is a problem that needed to be solved. Data aggregation is an efficient way to solve this and taking the mean of the cycles was decided. Taking the mean will describe the general character of the cycles but undoubt- edly will a huge amount of information be lost. For example, consider the signal for the car speed. Taking the mean will describe the average car speed over the cycle. But the behavior of the driver will completely be lost. One cycle where one is driving at a steady speed will be treated the same as a cycle where one has driven sometimes very fast and sometimes very slow. This is by all means applied to all different signals such as temperatures, voltages, resistances, SOC, current and so on. Another aggregation method could provide a better description of the cycles and this way give better results. This will be described in the future work section in the upcoming chapter Conclusion.

The defense of doing mean aggregation of the data is simply that is was needed and it provided to be an easily implemented way to shape the data in a way that the model could handle. The criticism is that a huge amount of valuable information was lost. Information that probably would be of huge importance for the estimation of the capacity.

6.1.2.3 Feature Selection Feature selection could have been done in many different ways. After the study of Li-ion batteries, it is clear that there are some properties of the batteries that are directly linked to its capacity and aging. So one way of doing feature selection would be to simply select these signals as features. This could maybe be enough since these are the properties that are known to be linked to the battery capacity. The wrong part in this is that we miss all the signals in the data that also could be important to the aging of the battery. For example, signals that could describe how a reckless driver with many hard brakes and fast accelerations are affecting the battery to age. Or other signals such as incline of the road and time of the day. This is not either a data scientist approach to this big data problem. This would be more of a scientific approach of selecting features with regards to the believed relationship they have to the target output. A data science approach is to let the data speak for itself. By selecting the features that in the data correlates to the output target or that seems to positively affect the predicting of the output target is more of a data science approach. The decision was therefore made to take the data science approach where analyses of the data are made to decide on what features to select. Feature selection based on a data science approach can be done in many different ways. The 2 most common ways are to use a filter method or a wrapper method, or like in this study, use them both.

Using only a filter method would only work if the data consists of complete independent features with a linear relationship to the output target. But since the filter method was used as a complement to the wrapper method I can’t see any argument against using it. By using a filter method a lot of information was discovered about the data. Valuable information about the data was discovered which culminated in the removal of some features and the realization that the data both had dependency and non linearly correlation to the output target. This resulted in the realization that a wrapper method was indeed needed. As described in the (4) the top 25 features correlated to the output target did not make any sense at all. Many important features were missing and some signals in the top layer of the list did not make any sense at all. This made me realize the weakness in the data and many questions came up. Was the data not large enough? Would these false correlations disappear if the data was larger? That these false correlations somewhat would affect the model’s ability to estimate the capacity if used was for certain. It was clear that feature selection based on non-linearity was needed.

42 6.2. Results

By using a wrapper method based on random forest algorithm with shadow features non-linearity between the features was handled. The software used for this is based on a PhD-student thesis [34]. Some people have used this software in their feature selection and have come up with satisfying results. The software has been used mostly for feature selection in problems in biological computing. In biological computing, the ratio between the features and samples is very low making it extremely important to select the features that are important to the output target. One fine example is the analysis of DNA in where the DNA consists of thousands of proteins, each representing one feature. This was the strongest motivation for going for this wrapper method and software. This problem is not close to having this small sample/feature ratio such as in the DNA analysis example making it even more suitable and powerful for this particular problem.

The features that the wrapper method decided to keep were not surprising with regards to the study of the batteries. All features are battery properties except for CPU usage, the total distance drove and ambient temperature. The only feature that does not make to much sense is the signal for CPU usage. The CPU usage is a percentage number between 0% and 100% which states how much the central processing unit in the car’s central system is working. It appears high up in the Pearson linear correlation list and it could be that this is a false correlation that will disappear when the data is sufficient large.

The fact that the data seemed to be too small for the filter method but not for the wrapper method is most likely because of the non-linearity relationships between the features. There are non-linear combinations of some features that are affecting the battery capacity in a way that the filter method could not perceive. But since the wrapper method still thought that the CPU usage was of huge importance testifies that the data could be too small in some aspects even for the wrapper method.

As previously noted, a report is rarely perfect and so are these feature selection methods. There is a possibility that this combination of features is not the most suitable for the estima- tion of the battery capacity. But with this data science approach together with the comparison of the research made on the batteries, it is fair to say that this combination will have a high effect on the battery capacity.

6.2 Results

In this section, the results will be discussed. The section is divided into 3 parts. The first part will discuss the results from the linear model, the second part will discuss the LSTM and the last part will compare the results from both models.

6.2.1 Linear Regression Model The linear model is doing a good job in estimating the capacity. The global root squared error is only 0.341 ampere-hours (20.5 ampere-minutes) overall. This is a desirable performance and indicates that there is a degree of a linear relationship between the set of features and the battery capacity. When looking at the error distribution one can see that the error is normally distributed and that a high majority of the errors are within -2.5% and +2.5%. One would argue that a linear model is indeed enough to estimate the battery capacity on this data set. Fig 5.2 shows that the estimations and actual values are forming a 45% degree curve with a noticeable spread. When looking at the example car BDCWU5 it is obvious that the estimated values are reasonably close to the actual values. The bad thing with the estimations is that there are many notable outliers. Especially the beginning of the estimations have poorly estimations before it stagnates in between -2.5% and +2.5%. The bad performance, in the beginning, is probably because of the data not starting at the first-ever cycle for the car. Here

43 6.2. Results the features have values that are not similar to the values for the rest of the cycles making it hard for the linear model to estimate. Why this is the case is not certain.

6.2.2 LSTM The LSTM is also doing a good job in the estimation of the capacity. The global root square error is as low as 0.077 ampere-hours (4,6 ampere-minutes) overall the test cars. The error distribution is normally distributed with a high majority of the errors within +0.5% and - 0.5%. Plot 5.6 is showing an almost perfect 45% curve with a small spread and almost no noticeable outliers. For the example car BDWCU5, it is obvious that the LSTM is estimating with high accuracy. The lines for the actual values and the estimated values is almost the same. The line for the estimated values is following the behavior of the actual values with high accuracy. When the actual value is increasing or decreasing the estimated values is following with almost the same increase/decrease.

6.2.3 Comparison between linear model and LSTM When looking at the results from both models it is easy to say that the LSTM is better to estimate the capacity than the linear model. The global root squared mean error is 77.5% less than the linear model’s. The error distribution is more central to zero error than the linear model’s. And the LSTM does not have close to having the number of outliers as the linear model. The outliers that the LSTM have is also noticeably smaller. The question becomes rather; is it worth using a LSTM? A LSTM requires much more computational power and takes significantly more time to compute. The linear model can estimate with reasonable high accuracy and can do this with a small overhead and with high speed. I believe the answer lies to the users of the models. If car owners and car manufacturers want a model which requires very high accuracy with the costs of using a LSTM then this is the model to use. But on the other hand, if the users are satisfied with laying in between +2.5% and -2.5% then the linear model is the model to use. This is not a question I can answer. In real life situations if the model is for example in the cloud computing the estimations for the battery capacity then updating the users with the current capacity value. I would argue that accuracy of +2.5% and -2.5% is enough. The result that the LSTM is doing a better job than the linear model was expected since non-linearity was found in the data. But regards of having a highly costly computing overhead and time overhead I would argue that the linear model is relatively more powerful in its estimations. A linear model is also easier to understand compared to the complex structure LSTM has. This makes it much easier to troubleshoot the linear model compared to LSTM if ever needed.

6.2.4 Scalability As previously mentioned, this data set is limited in many ways. The time duration for each car is not near as large needed to see how well both a linear model and a LSTM does. The car fleet of only 30 cars is not large enough either. Both models do a good job in estimations on this particular data set, with this particular set of cars. What would happen if the data set was covering the whole life for the batteries? If the first cycle was the car’s first drive and the last cycle its last. Or if the fleet consisted of thousands of cars instead of only thirty. Would the models have the same accuracy? This question is hard to answer. It is not certain that models would scale to have the same or even better accuracy. Since the complexity in the data set would increase it is a possibility that the LSTM would handle this increase better than the linear model due to its more powerful non-linearity structure. The LSTM could find more complex patterns and relationships in the data that was not discovered in this smaller data set. The ideal scenario would be to have a model in the cloud that could handle each car on the

44 6.3. Replicability, reliability and validity road. This fleet of thousands of cars would send its parameters to the model which would give out its estimations based on the cars previously recorded parameters. The model would train on these cars and after an unknown time duration, it would with a high accuracy predict all cars’ capacities in real-time. This global model could even have local models beneath itself which would learn certain behavior and relationships to the geographical area it would cover.

I believe that a LSTM model with the right training data could be divided into many local models and one global model which could handle all cars on the road. And after a certain time duration of training would be able to form an internal structure that could handle all cars on the road, not only estimating their capacity but also predicting how their battery will age.

6.3 Replicability, reliability and validity

This study tries to be clear so that it can be replicated, be reliable on and state validity. With a clear description of what software was used and how they were used one should be able to replicate and perform this study. The data set belongs to Volvo cars and is not open for the public domain. But it should be possible to do the same study on another similar data set. The same results would certainly not be able to get, but hopefully similar results. Source criticism is always important to reflect when validating a study. All sources used in this study comes from reliable sources. Google Scholar is the search engine used when looking for scientific papers.

6.4 The work in a wider context

In a time when environmental questions are one of our time most important topics, it is of great importance that alternative transport ways are investigated and created. The usage of carbon dioxide, fossil fuels, and other natural gases are damaging our earth and the car industry is a huge part of this damage. If the car industry would shift to more electric so- lutions then a huge part of the damage would disappear. It is therefore of great importance to contribute to this development of electrifying the car industry. Electrical cars do produce emissions while driving. The only emissions it has is in the creation of the batteries. The car only needs to drive for 2-3 years until these emissions are compensated and then it is not having any effect at all on the earth while driving. This study will give a small contribution to this process of electrifying the car industry by providing new ways of estimating the capacity the batteries currently have.

45 7 Conclusion

This chapter will go through the conclusions made in the study. The first section will sum- marize the work and connect the research questions to what has been achieved. The second section will describe what future work that can be done.

7.1 Summarizing

The purpose of the study was to investigate if a recurrent long short-term memory-based neural network can be applied to estimate the current capacity an electrical battery in electric cars has. The first research question that was asked was: How well does a recurrent LSTM neural network estimate the capacity of a battery in an electrical vehicle? The answer to this question is that a recurrent LSTM neural network is indeed a suitable model and it can with a high accuracy estimate the capacity. The overall accuracy is high with a small global root mean squared error of only 2.5 ampere-minutes. Since the capacity in the batteries is varying within many ampere-hours this is indeed within a very small error margin. The LSTM is using its powerful internal structure to use what previous cycles have had as capacity values which makes it reliable when following the change of capacity values. It is generalizing what it has learned from training on to new, not before seen data. When comparing the real values and the estimated values are for the large majority of the cars sticking closely to the real values. Even when the values are jumping up in between cycles it is following the trend and there are almost no noticeable outliers. The error distribution is very compact closely to 0% which is a high indication that this model is much reliable and powerful for this purpose. Worth mentioning is that the LSTM has much more computational expensive compared to the linear regression model.

The second research question was Is a recurrent LSTM neural network necessary for estimation of the capacity or can a linear model be used instead? The answer to this question is that when looking at the results, the LSTM model is indeed superior comparing to the linear model. This makes sense since non-linearity was discovered in the data set. The LSTM model is beneficial over the linear model mostly for the fact that the linear model has many estimations that are far away from the real value, making it sometimes an unstable and unreliable model for this problem. And particular because a high majority of the estimations are much closer to the real values making it a stronger and more reliable

46 7.2. Future work model. But the linear model is still estimating with a reasonably high accuracy. Having a global root mean squared error of only 20.5 ampere-minutes. This is within reasonable error margin. If this error margin is satisfying the user’s desires then the linear model is naturally the model to choose. Another point that is worth mentioning is that the computational time and effort is far greater for LSTM comparing it to the linear model making it a question of proportion. Is the usage of a LSTM model "worth it" when the linear model is still doing such a good job. Here again, it is up for the user to decide. The last research question was What parameters seems to be important when estimating the battery capacity? From using various feature selection techniques the parameters that seem to be most important when estimating the capacity in the batteries are listed down below.

• 3 Different battery resistances

• CPU usage

• The total distance driven

• Average cell temperature in the battery

• Ambient temperature

• Battery voltage

• Battery state of charge

• Battery current

When compared to the research made on Li-ion batteries the parameters that certainly are important are all parameters that are directly related to the capacity. These parameters are the battery resistances, the cell temperature, the ambient temperature, the battery voltage, the battery state of charge and the battery current. The parameters are needed when estimating the capacity. The other 2 parameters are the total distance drove and CPU usage. That the total distance driven is of importance makes sense since the capacity will drain with the distance that the car is driving. The CPU usage is certainly questionable. This parameter stands for the amount of work the CPU is doing and should not be directly related to what value the capacity has. The reason that this parameter was selected from the feature selection could be explained from the lack of data. If the data was scaled up then there is a possibility that this parameter would disappear.

7.2 Future work

This section will explain three topics on what future work that can be done. These topics are more advanced aggregation methods and future possibilities with more data.

7.2.1 More advanced aggregation methods Since the capacity cannot be updated within cycles there is a need for data aggregation. More advanced aggregation methods which capture the characteristics of the cycles would give more information on what is causing the batteries to age. Instead of only using the direct values from the signals there is a possibility to extract information from the signals and create new features. For example, one could extract information such as car speed at certain time points to decide on how many stops that were done or if the driver was driving aggressively with many hard breaks. Then new features which would represent the number of unique events that had occurred during one particular cycle. By adding more of these new features the character of the cycle would become more explained and captured. These new features

47 7.2. Future work could, by all means, be combined with aggregated values from the real signals, as used in this study.

7.2.2 Future possibilities with more data With a data set with a sufficient large car fleet over an extended time duration, the possibil- ities are enormous for us to truly understand what is causing the batteries to age. There is a possibility to train a model which will be able to predict how each unique battery will age in the future. To be able to do this a huge amount of data is needed. Data that is covering the whole lifespan of the battery. With this model the car driver would be updated with high accuracy on how much more she/he can drive on the battery according to how she/he have been driven, what area she/he have been driven in and what type of drives she/he has made and obviously what values the battery currently has. A possibility is to use these new features which describe driving behavior and combine these with values from the battery, the car, and its environment.

The hard aspect of this is that a battery would need to be out on the market or at least be tested for a whole lifetime for countless different cars in various locations before a model would be able to be fully trained. At this time a new car model would without a doubt al- ready been released and the model would not be applied directly to its battery. But since the batteries and the cars certainly will have similarities it is a possibility to change the parame- ters on the data collected and train the model accordingly so that it would be applied on the current car models.

48 A Appendix

• Prestudy: The LSTM model was implemented using the software libraries Tensorflow [43] and Keras [44].

• The data was loaded into a Python program with the help of an in-house Volvo made MDF-parser. Once there were loaded, they could be transformed into dataframes. Dataframes are a data structure within the python library Pandas [45] which stores num- bers in matrices. Pandas is a high-performance, open-source, easy-to-use data analysis tool for Python. When having a dataframe various operations can be made on your data which facilitates the work.

• The software library used for Local Outlier Factor is called sklearn [46]. Sklearn is open source and stands for scikit-learn. The particular function used is called "LocalOutlier- Factor" [23].

• The Shapiro-Wilk test and Standard Score Normalization was done using the sciPy li- brary for python [47].

• The software library used for Pearson correlation is called yellowbrick [48] and is based on scikit [46].

• The software library used for Wrapper method is called Boruta and can be found at [34]

• The linear model was implemented using the Sklearn library and can be found here [49].

• The long short-term memory (LSTM) model was implemented through Tensorflow and Keras.

49 Bibliography

[1] D. A. Wetz, B. Shrestha, S. T. Donahue, D. N. Wong, M. J. Martin, and J. Heinzel. “Capac- ity Fade of 26650 Lithium-Ion Phosphate Batteries Considered for Use Within a Pulsed- Power System’s Prime Power Supply”. In: IEEE Transactions on Plasma Science 43.5 (May 2015), pp. 1448–1455. ISSN: 0093-3813. DOI: 10.1109/TPS.2015.2403301. [2] Andrew McAfee, Erik Brynjolfsson, Thomas H Davenport, DJ Patil, and Dominic Bar- ton. “Big data: the management revolution”. In: Harvard business review 90.10 (2012), pp. 60–68. [3] Andrew Cave. What Will We Do When The World’s Data Hits 163 Zettabytes In 2025? 2017. URL: https://www.forbes.com/sites/andrewcave/2017/04/13/what- will-we-do-when-the-worlds-data-hits-163-zettabytes-in-2025/ #ab0ef9c349ab. (accessed: 08.05.2019). [4] Peter J Brockwell, Richard A Davis, and Matthew V Calder. Introduction to time series and forecasting. Vol. 2. Springer, 2002. [5] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 2009. ISBN: 0-13-604259-7. [6] Bryan Kolb and Ian Q Whishaw. An introduction to brain and behavior. Worth Publishers, 2001. [7] Ashley Feinberg. An 83,000-Processor Supercomputer Can Only Match 1% of Your Brain. 2013. URL: https://gizmodo.com/an-83-000-processor-supercomputer- only-matched-one-perc-1045026757. (accessed: 08.05.2019). [8] Anthony Barré, Benjamin Deguilhem, Sébastien Grolleau, Mathias Gérard, Frédéric Suard, and Delphine Riu. “A review on lithium-ion battery ageing mechanisms and es- timations for automotive applications”. In: Journal of Power Sources 241 (2013), pp. 680– 689. [9] Doron Aurbach, Boris Markovsky, Alexander Rodkin, Miriam Cojocaru, Elena Levi, and Hyeong-Jin Kim. “An analysis of rechargeable lithium-ion batteries after pro- longed cycling”. In: Electrochimica Acta 47.12 (2002), pp. 1899–1911. [10] Tennen-Gas. Nissan Leaf at the 2009 Tokyo Motor Show. 2009. URL: https : / / creativecommons.org/licenses/by-sa/3.0. (accessed: 04.16.2019).

50 Bibliography

[11] Hicham Chaoui, Chinemerem C. Ibe-Ekeocha, and Hamid Gualous. “Aging prediction and state of charge estimation of a LiFePO4 battery using input time-delayed neural networks”. In: Electric Power Systems Research 146 (2017), pp. 189–197. ISSN: 0378-7796. DOI: https://doi.org/10.1016/j.epsr.2017.01.032. URL: http://www. sciencedirect.com/science/article/pii/S037877961730041X. [12] M. Petricca, D. Shin, A. Bocca, A. Macii, E. Macii, and M. Poncino. “Automated gener- ation of battery aging models from datasheets”. In: 2014 IEEE 32nd International Confer- ence on Computer Design (ICCD). Oct. 2014, pp. 483–488. DOI: 10.1109/ICCD.2014. 6974723. [13] Madeleine Ecker, Jochen B. Gerschler, Jan Vogel, Stefan Käbitz, Friedrich Hust, Philipp Dechent, and Dirk Uwe Sauer. “Development of a lifetime prediction model for lithium-ion batteries based on extended accelerated aging test data”. In: Journal of Power Sources 215 (2012), pp. 248–257. ISSN: 0378-7753. DOI: https://doi.org/10.1016/ j.jpowsour.2012.05.012. URL: http://www.sciencedirect.com/science/ article/pii/S0378775312008671. [14] Giuseppe Tina and Giacomo Capizzi. “Improved Lead -Acid Battery Modelling for Photovoltaic Application by Recurrent Neural Networks”. In: July 2008, pp. 1170–1174. DOI: 10.1109/SPEEDHAM.2008.4581302. [15] E. Chemali, P. J. Kollmeyer, M. Preindl, R. Ahmed, and A. Emadi. “Long Short-Term Memory Networks for Accurate State-of-Charge Estimation of Li-ion Batteries”. In: IEEE Transactions on Industrial Electronics 65.8 (Aug. 2018), pp. 6730–6739. ISSN: 0278- 0046. DOI: 10.1109/TIE.2017.2787586. [16] B. Saha, K. Goebel, S. Poll, and J. Christophersen. “Prognostics Methods for Battery Health Monitoring Using a Bayesian Framework”. In: IEEE Transactions on Instrumen- tation and Measurement 58.2 (Feb. 2009), pp. 291–296. ISSN: 0018-9456. DOI: 10.1109/ TIM.2008.2005965. [17] Gae-won You, Sangdo Park, and Dukjin Oh. “Real-time state-of-health estimation for electric vehicle batteries: A data-driven approach”. In: Applied Energy 176 (2016), pp. 92– 103. ISSN: 0306-2619. DOI: https://doi.org/10.1016/j.apenergy .2016. 05.051. URL: http://www.sciencedirect.com/science/article/pii/ S0306261916306456. [18] Shichao Zhang, Chengqi Zhang, and Qiang Yang. “Data preparation for data mining”. In: Applied artificial intelligence 17.5-6 (2003), pp. 375–381. [19] Erhard Rahm and Hong Hai Do. “Data cleaning: Problems and current approaches”. In: IEEE Data Eng. Bull. 23.4 (2000), pp. 3–13. [20] Mark Huisman. “Imputation of missing item responses: Some simple techniques”. In: Quality and Quantity 34.4 (2000), pp. 331–351. [21] Charu C Aggarwal. “An introduction to outlier analysis”. In: Outlier analysis. Springer, 2017, pp. 1–34. [22] Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. “LoOP: local out- lier probabilities”. In: Proceedings of the 18th ACM conference on Information and knowledge management. ACM. 2009, pp. 1649–1652. [23] Chire. Basic idea of LOF: comparing the reachability (density) of a point with the reachabil- ity (density) of its neighbors to detect Outliers. 2010. URL: https://en.wikipedia. org/wiki/Local_outlier_factor#/media/File:LOF-idea.svg. (accessed: 04.16.2019). [24] ZE. How Important Is Data Aggregation? 2018. URL: https : / / blog . ze . com / the- zema- solution/how- important- is- data- aggregation/. (accessed: 04.16.2019).

51 Bibliography

[25] Robert V Hogg, Joseph McKean, and Allen T Craig. Introduction to mathematical statis- tics. Pearson Education, 2005. [26] Dan Kernler. A visual representation of the Empirical (68-95-99.7) Rule based on the normal distribution. 2014. URL: https://creativecommons.org/licenses/by-sa/4. 0/. (accessed: 04.16.2019). [27] Patrick Royston. “A toolkit for testing for non-normality in complete and censored samples”. In: Journal of the Royal Statistical Society: Series D (The Statistician) 42.1 (1993), pp. 37–43. [28] S Patro and Kishore Kumar Sahu. “Normalization: A preprocessing stage”. In: arXiv preprint arXiv:1503.06462 (2015). [29] Sebastian Raschka. About Feature Scaling and Normalization – and the effect of standardiza- tion for machine learning algorithms. 2014. URL: https://sebastianraschka.com/ Articles/2014_about_feature_scaling.html. (accessed: 08.05.2019). [30] Isabelle Guyon and André Elisseeff. “An introduction to variable and feature selec- tion”. In: Journal of machine learning research 3.Mar (2003), pp. 1157–1182. [31] Lucien Mousin. Summary of filter method for feature selection. 2015. URL: https : / / creativecommons.org/licenses/by-sa/4.0/. (accessed: 04.18.2019). [32] Jacek Biesiada and Wlodzisław Duch. “Feature selection for high-dimensional data—a Pearson redundancy based filter”. In: Computer recognition systems 2. Springer, 2007, pp. 242–249. [33] Lastdreamer7591. Wrapper Method pour la selection d’attribut. 2014. URL: https : / / creativecommons.org/licenses/by-sa/4.0/. (accessed: 04.18.2019). [34] Miron B Kursa, Witold R Rudnicki, et al. “Feature selection with the Boruta package”. In: J Stat Softw 36.11 (2010), pp. 1–13. [35] Xin Yan. Linear : Theory and Computing. World Scientific, 2009. ISBN: 9812834117. [36] Oleg Alexandrov. Illustration of linear least squares. 2008. URL: https : / / en . wikipedia.org/wiki/Linear_regression#/media/File:Linear_least_ squares_example2.png. (accessed: 04.18.2019). [37] Kevin Gurney. An Introduction to Neural Networks. CRC Press, 1997. ISBN: 9781315273570. [38] Glosser.ca. Artificial neural network with layer coloring. 2013. URL: https : / / creativecommons.org/licenses/by-sa/3.0/. (accessed: 04.22.2019). [39] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. “Empiri- cal evaluation of gated recurrent neural networks on sequence modeling”. In: arXiv preprint arXiv:1412.3555 (2014). [40] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www. deeplearningbook.org. MIT Press, 2016. [41] François Deloche. A diagram for a one-unit recurrent neural network (RNN). From bottom to top : input state, hidden state, output state. U, V, W are the weights of the network. Com- pressed diagram on the left and the unfold version of it on the right. 2017. URL: https : //creativecommons.org/licenses/by/4.0/. (accessed: 04.22.2019). [42] Guillaume Chevalier. The Long Short-Term Memory (LSTM) cell can process data sequen- tially and keep its hidden state through time. 2018. URL: https://creativecommons. org/licenses/by/4.0/. (accessed: 04.22.2019). [43] Tensorflow dev. An end-to-end open source machine learning platform. 2018. URL: https: //www.tensorflow.org/.

52 Bibliography

[44] Keras dev. Keras: The Python Deep Learning library. 2018. URL: https://keras.io/. [45] Python Software Foundation. Python Data Analysis Library. 2019. URL: https : / / pandas.pydata.org/. [46] Scikit-learn community. Scikit-learn. 2019. URL: https : / / scikit - learn . org / stable/. [47] SciPy community. SciPy. 2019. URL: https://www.scipy.org/. [48] Yellowbrick community. Yellowbrick: Machine Learning Visualization. 2019. URL: https: //www.scikit-yb.org/en/latest/.

[49] Sklearn Development team. sklearn.linearmodel.LinearRegression. 2019. URL: https:// scikit-learn.org/stable/modules/generated/sklearn.linear_model. LinearRegression.html.

53