<<

Time Series Feature Extraction for Industrial Big Data (IIoT) Applications

[1]

Term paper for the subject „Current Topics of Data Engineering” In the course of studies MSc. Data Engineering at the Jacobs University Bremen Markus Sun Born on 07.11.1994 in Achim Student/ Registration number: 30003082

1. Reviewer: Prof. Dr. Adalbert F.X. Wilhelm 2. Reviewer: Prof. Dr. Stefan Kettemann Time Series Feature Extraction for Industrial Big Data (IIoT) Applications ______

Table of Content

Abstract ...... 4

1 Introduction ...... 4

2 Main Topic - Terminology ...... 5

2.1 Sensor Data ...... 5

2.2 and Feature Extraction...... 5

2.2.1 Variance Threshold...... 7 2.2.2 Correlation Threshold ...... 7 2.2.3 Principal Component Analysis (PCA) ...... 7 2.2.4 Linear Discriminant Analysis (LDA) ...... 8 2.2.5 ...... 8

2.3 Time Series ...... 9

2.3.1 Autocorrelation ...... 9 2.3.2 Seasonality ...... 10 2.3.3 Stationarity ...... 11

2.4 Feature Extraction from Time Series Data ...... 12

3 Method Description - “TSFRESH” Package for Feature Extraction in Python 12

4 Conclusion & Outlook ...... 14

References ...... 15

2 Markus Sun List of Figures

Figure 1 - Data Points in different dimensions [9] ...... 6 Figure 2 - 's NN Architecture [11] ...... 9 Figure 3 - Autocorrelation example [12] ...... 10 Figure 4 - Example of Seasonality [12] ...... 11 Figure 5 - Example of a non-stationary process [12] ...... 11 Figure 6 - Overview about Data Scientist's working time [13] ...... 13

Abstract

The chosen topic in this term paper for the subject “Current Topics of Data Engineering” is “Time Series Feature Extraction for Industrial Big Data (IIoT) applications.” This report should rather be seen as a project proposal for possible future projects, which shows some insights regarding the topic itself and possible approaches. This term paper has mainly two parts. One of them is the introduction of the terminologies and some of its methods, which are related to the particular topic of this paper. The other is the description of a potential method for solving this problem in a real case, which is the usage of the package “TSFRESH” from python.

1 Introduction

Nowadays, the terms “Artificial Intelligence,” “Artificial Neural Networks,” “” and “Deep Learning” are often used in a context with innovations, which has a huge impact potential for today’s world. Even though the mathematical/statistical concepts and ideas behind them already existed since several decades ago, they just became more popular not too long ago. There are mainly two reasons for the popularity gain. One is the advanced processing power of computers, and the other one is big data. The latter describes the vast amount of data, which are produced by the digital equipment and sensors in this digital age. In the past 50 years, the amount of data increased exponentially, furthermore 90% of all data up to date has been created in the last two years. [2] The estimated amount of data we should have by 2020 is 40 trillion gigabytes of data [3]. It’s needless to say that this vast quantity of big data has a significant potential to access new knowledge, but it has no value unless we can analyze it and find the hidden patterns within. That’s precisely where machine learning techniques are needed. Within the area of data analytics, machine learning techniques are known as predictive analytics. The strength of these techniques is to find valuable hidden patterns within a vast amount of complex data. With these newly gained pieces of information, it is possible to predict some future events and helps to find a solution for a specific complex problem. One specific research field, which is referred to as Human Activity Recognition (HAR), yields great interest for medical, military, and security applications. [4] Related to this research field, the “Universität Bremen” in Bremen, Germany, organized a competition “The Bremen Big Data Challenge 2019”. The objective of this particular challenge was the classification of various leg movements. The data was collected through different sensors beforehand, which were positioned on the leg of human subjects. The student group, which is formed by Gari Ciodaro, Diogo Cosin, and Ralph Florent used a variety of feedforward neural networks to take on the challenge. The highest obtained accuracy for the training set was 84.44% and 91%. The accuracy of the challenge data set (test set) was 54% and 64%. [5] Generally spoken, industrial big data, the Internet of Things (IoT), robotics, and other similar information sources are generating a large volume of variety information in huge velocity, variability (inconsistency), and veracity (imprecision). In order to create models with good performances, one needs to carefully merge and integrate these kinds of data by analyzing and extracting the most useful features from different time intervals. This report aims to discuss the time series feature extraction for industrial big data (IIoT) applications and its possible usage for the “The Bremen Big Data Challenge 2019”, in order to optimize the prediction accuracy of the machine learning models. [6]

2 Main Topic - Terminology

This chapter briefly introduces some of the terminologies, which are correlated to the topic of time series feature extraction. It is important to know about the terminology in order to have a good overall understanding of the topic. At first, I will introduce the term “sensor data”, which will be followed by “feature extraction”, “feature selection” and “time series”.

2.1 Sensor Data

Sensors are indispensable devices in nowadays world. A sensor is a device, which measures certain physical properties by detecting different types of input from its environment. The measured properties can be, for instance, pressure, force, displacement, acceleration, temperature, vibration or even electrochemical potential. Once the property is measured, it will be converted into a standardized control signal. This standardized control signal will then be converted directly at the device (sensor) into a readable display, or it will be electronically transmitted via a network for further processing. [7] The output of these devices is referred to as sensor data. It can be used to provide information or as an input to another system. Sensor data is a crucial component for the Internet of Things (IoT) environment because most of the data, which are transmitted over a network are sensor data. [8]

2.2 Feature Selection and Feature Extraction

The world has more data than ever, and the amount is increasing at an exponential rate. This increasing amount of data leads to data sets with confusingly many features. Therefore, it is important to know the difference between interesting data and useful data or in other words, the ability to discriminate between the relevant and irrelevant parts of a given data set. This procedure of selecting some subset of a machine learning algorithm’s input variables, on which the algorithm should put its focus on and ignore the rest, is referred to as “”. [9] This reduction of dimensionality is needed because of a phenomenon that occurs by classifying, organizing, and analysing data with high dimensions, the curse of dimensionality. “Dimensionality” in the context of machine learning refers to the number of features (variables) in a data set. As it can be seen in figure 1 below, where the data space moves from a one- dimensional perspective to two dimensions and then to three dimensions, the given data fills less of the whole data space, each time the dimension increases. As a result, the data for analysis grows exponentially whenever the dimension increases.

Figure 1 - Data Points in different dimensions [10]

Another problem, which is related to classification and clustering methods, occurs with the curse of dimensionality. Data points may look similar in low dimensional spaces because they lay near to each other, but that perception can be wrong. The data points may be further apart in the higher dimensions. [10] There are many different methods for dimensionality reduction. These methods can be separated into two sections:

1. Feature Selection 2. Feature Extraction Feature selection filters irrelevant or redundant features (variables) from the given data set by keeping a subset of the original features. Feature extraction has the same purpose, but instead of keeping a subset, it creates new features. This is the key difference between feature selection and feature extraction. Both of them can be supervised or unsupervised. “Variance Thresholds” and “Correlation Thresholds” are some of the feature selection methods, which I will briefly describe in the following.

2.2.1 Variance Threshold

Variance Threshold is a feature selection method, which removes features whose variance of values falls below a threshold. In other words, their values don’t have a big change from an observation to another. The strengths by applying variance thresholds are that they are based on solid intuition. Therefore, it is relatively easy and safe to reduce dimensionality. The intuition implies that features which don’t change much, also don’t add much information. The weakness of this method is that it probably won’t be sufficient for cases where dimensionality reduction is required. Additionally, it can be difficult to tune a variance threshold.

2.2.2 Correlation Threshold

On the other hand, it is the method “Correlation Thresholds”. This method removes features, which are highly correlated with others because these features will only provide redundant information. In order to decide which feature to remove, one needs to first calculate all pair- wise correlations and then check which of the correlation between a pair of features has a value above a given threshold. In the end, the one with a larger mean absolute correlation with other features gets removed. The strengths of this particular method type are similar to the ones from variance thresholds. It’s based on solid intuition, that similar features provide redundant information. Removing correlated features can boost the performance of some algorithms. The weaknesses are also similar to the variance threshold methods, which is the tuning of the correlation threshold. If the threshold is set too low, useful information can be lost.

Feature extraction is a category of methods for dimensionality reduction by which an initial set of raw data is reduced to a more manageable data set for processing. Feature extraction is a category of dimensional reduction for methods that select and combine variables into new features. As a result, it effectively reduces the amount of data that must be processed while still being able to capture most of the useful information. Some of the algorithms for feature extraction are Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA) and Autoencoders. I will introduce the basic ideas behind these algorithms in the following. [11]

2.2.3 Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is considered to be an unsupervised algorithm, which creates linear combinations of the original features. These newly created features are orthogonal to each other. Hence, they are uncorrelated. Additionally, they need to be ranked according to their “explained variance”. The first principal component, which is also referred to as PC1, explains the most variance in the data set, while the PC2 explains the second-most variance, etc. In that way, one can reduce the dimensionality of a data set by limiting the number of principal components. As an example, one might only want to keep as many principal components as needed to get a cumulative explained variance of 90%. It makes sense to normalize the data set before performing PCA because the transformation is dependent on the scale. PCA has the strength that it works well in practice. It is fast and straightforward to implement PCA, which results in a good way to easily test algorithms with and without PCA, in order to compare the performance. The new principal components are not interpretable anymore, which is a weakness for this algorithm. Another weakness is that the threshold tuning needs to be manually done.

2.2.4 Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) creates -similar to PCA- linear combinations of the original features. Compared to PCA, LDA maximizes the separability between the classes, instead of maximizing the explained variance. LDA is a supervised method because of that difference. Since the LDA transformation also depends on the scale, it is recommended to normalize the data set beforehand. The strength of LDA is that it can improve the predictive performance of the extracted features. Similar to the weakness of PCA are the new features of LDA not easily interpretable. Also, the number of components needs to be adjusted manually. Additionally, due to the fact that this algorithm is a supervised method, it requires labeled data, which makes it more depending on situations.

2.2.5 Autoencoders

The feature extraction tools “autoencoders” are neural networks, which are trained to reconstruct their original inputs. For instance, a set of image-autoencoders are trained to reconstruct the original images, which is given as an input, instead of classifying the image. The hidden layer of the neural network needs to have fewer neurons compared to the input and output layers, which can be seen in figure 2. That hidden layer learns to create a smaller representation of the original image.

Figure 2 - Autoencoder's NN Architecture [12]

Since the autoencoders use the input image as the target output, it can be seen as unsupervised methods. Autoencoders can either be used directly, e.g., in image compression or stacked in sequence, e.g. in deep learning. The strength of autoencoders is the good performance for certain data types because they are neural networks, for instance image and audio data. Their weakness, on the other hand, also comes from the fact that they are neural networks because they require a lot more data to train. [12]

2.3 Time Series

Time is an important factor, which must be included in most of the models. It plays a crucial role when it comes to tasks like predicting the trend in the financial market, electricity consumption or weather forecasting. Whenever a series of data points are ordered in time, it is considered to be a time series, where time is mostly an independent variable. Time series have several characteristics. I will briefly introduce three of them. [13]

2.3.1 Autocorrelation

The term autocorrelation is considered to be a type of serial dependence. It describes the similarity between different observations as a function of the time lag between them. For example, in the autocorrelation plot figure 3 below, it can be seen that the first value and the 24th value have a relatively high autocorrection. This is similar to the 12th and 36th observations. These characteristics indicate that one can find a similar value every 24 units of time.

Figure 3 - Autocorrelation example [13] By looking at figure 3 above, one can notice that the course has a sinusoidal form. This indicates seasonality. [13]

2.3.2 Seasonality

Seasonality is the term, which refers to a data set with periodic fluctuation patterns. Typical examples for the seasonality characteristic are electricity consumption (high during the day and low during the night) and online sales during Christmas (increasing during the time and slowing down afterward). As an example, the seasonality can be seen in figure 4 below. Every day the value starts low and increases up to the maximum in the evening. After the peak, it goes down again at night. Therefore, it has seasonality. Furthermore, the seasonality can be derived from an autocorrelation plot, whenever it has a sinusoidal pattern. In these cases, one can find the length of a season by looking at the period. In the example above (figure 4), it would be 24. [13]

Figure 4 - Example of Seasonality [13]

2.3.3 Stationarity

A time series is only stationary if there is no change in their statistical properties over time. For the statistical properties to stay the same, the mean and variance have to remain constant, and the covariance has to be independent of the time. As can be seen in figure x above, it is stationary because the mean and variance don’t change over time. In contrast to the stationary characteristic stands, for instance, the prices in the stock market. There are growing trends, or the volatility might increase or decrease over time, which means that the variance is changing. A non-stationary time series can be seen in figure 5 below. For modeling purposes, it is beneficial to have stationary time series. Since its naturally that not all time series is stationary, one needs to use different transformations to make them stationary.

Figure 5 - Example of a non-stationary process [13]

In order to make useful predictions, a modeling of the time series is needed. Three of those modeling methods are listed in the following: [13]

- Moving average - Exponential smoothing - Seasonal Autoregressive Integrated Moving Average Model (SARIMA)

2.4 Feature Extraction from Time Series Data

Feature Extraction is one of the first steps in machine learning algorithms. It identifies the important and useful attributes/ features, by kicking out the redundant features and noises from the system, in order to build a model with the best prediction output. The difficulty for feature extraction increases for time series classification and regression compared to the feature extraction of standard classification and regression because each label or regression target is associated with several time series and meta-information simultaneously. Huge time-series-datasets occur quite often, for instance, from machinery, industrial heavy manufacturing equipment, or IoT applications. IoT applications are often used in the context of maintenance or production line optimization. The goals of feature extraction from time-series data are the following: [6]

- Extracting the characteristics feature from the time series (minimum, maximum, average, percentile or other mathematical derivations) - Extracted features, which are based on its relevancy, allows time series clustering - Consolidation of the feature extraction and selection process from distributed and heterogeneous sources of information, which lies on different time series scale, in order to predict the target output. - With the help of the extracted (relevant and non-relevant) features, it is easier to identify new insights at time-series properties and dimensions in both classification and regression modeling.

3 Method Description - “TSFRESH” Package for Feature Extraction in Python

This chapter is structured as followed. After we have seen that there are several different methods for feature extraction of time series, I will introduce the python package “TSFRESH” as an option to do time series feature extraction for real cases. TSFRESH is a python package, which contains quite a few extraction methods and a robust algorithm for feature selection. TSFRESH stands for “Time Series Feature Extraction Based on Scalable Hypothesis Tests”. As can be seen in figure 6 below, data scientists spend most of their time either cleaning the data or building features (cleaning and organizing data 60%). The latter can be automated.

Figure 6 - Overview of Data Scientist's working time [14]

The package TSFRESH extracts 100s of features from time series, automatically. Therefore, a data scientist has more time to work on the models. Those extracted features describe the basic characteristics of the time series data, for example, the number of peaks, the average or maximal value, and also more complex features like time-reversal symmetry statistics. The TSFRESH package contains a filtering procedure, which is used to avoid extracting irrelevant features. The filtering algorithm is based on the theory of hypothesis testing and uses a multiple test procedure. As a result of that background, the filtering process controls the percentage of irrelevant extracted features, mathematically. In the following are some of TSFRESH’s advantages listed: [15]

1. The package is field, and unit tested 2. The filtering algorithm/process is statistically (mathematically) correct 3. It is compatible with sklearn, numpy, and pandas 4. It is possible to add features to it easily 5. The package has a comprehensive documentation The equivalent package for automated time series feature extraction in , is referred to as “tsfeaturex”. This package is inspired and modelled after the Python package TSFRESH.

4 Conclusion & Outlook

In this report, I introduced some terminologies individually, in order to have a common state of understanding regarding the topic of time series feature extraction. Afterward, I put the term of feature extraction concerning time series data and discussed it. Time series feature extraction for time series regression and classification problems is a task of high complexity. Time series data related feature extraction has several goals, which bear benefits for both classification and regression modeling if achieved. At the end of this term paper I gave some brief insights to a python package for time series feature extraction, which is referred to as “TSFRESH”. Based on my current state of knowledge, which is quite sparse at the moment, I’d say that the performance of the Neural Network, from the student group of Gari Ciodaro, Diogo Cosin, and Ralph Florent for the “The Bremen Big Data Challenge 2019”, might improve by using the “TSFRESH” package for the feature extraction beforehand.

This report task was truly an interesting one. I can picture myself to tackle a project, which is related to the topic “time series feature extraction for industrial big data (IIoT) applications”, in the near future. On the one hand, it is because of the interesting nature of this topic, which can add a new analytical perspective to my thought processes. On the other hand, it's because of the fact that this skill is one of the versatile fundamentals for any data scientist, which plays a crucial role in the final model’s performance. As for the beginning, I’d start by getting more detailed information about the feature as mentioned above extraction methods in relation to time series data and then trying to get more used to the python package “TSFRESH”.

References

[1] ncpl, „ncpl,“ ncpl, [Online]. Available: https://www.ncplinc.com/. [Zugriff am 20 November 2019].

[2] C. Petrov, „Techjury,“ 22 March 2019. [Online]. Available: https://techjury.net/statsabout/big-data-statistics/. [Zugriff am 5 November 2019].

[3] D. R. John Gantz, „THE DIGITAL UNIVERSE IN 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East,“ IDC, December 2012. [Online]. Available: https://de.slideshare.net/arms8586/the-digital-universe-in-2020. [Zugriff am 5 November 2019].

[4] G. B. M. K. Samanta Rosati, „Comparison of Different Sets of Features for Human Activity Recognition by Wearable Sensors,“ MDPI, 29 November 2018. [Online]. Available: https://www.mdpi.com/1424-8220/18/12/4189/htm. [Zugriff am 5 November 2019].

[5] D. C. R. F. Gari Ciodaro, „ - Project Report,“ 17 May 2019. [Online]. Available: http://lrgcobranzas.com/data_analytics/big_data_challenge.pdf . [Zugriff am 5 November 2019].

[6] S. Chatterjee, „Time Series Feature Extraction for industrial big data (IIoT) applications,“ Towards Data Science, 10 April 2019. [Online]. Available: https://towardsdatascience.com/time-series-feature-extraction-for-industrial-big-data- iiot-applications-5243c84aaf0e. [Zugriff am 5 November 2019].

[7] ALTHEN, „Sensoren,“ ALTHEN, [Online]. Available: https://www.althensensors.com/de/sensoren/ . [Zugriff am 5 November 2019].

[8] M. Rouse, „IoT Agenda,“ TechTarget, 2019. [Online]. Available: https://internetofthingsagenda.techtarget.com/definition/sensor-data . [Zugriff am 5 November 2019].

[9] M. Ved, „Feature Selection and Feature Extraction in Machine Learning: An Overview,“ Medium, 19 July 2018. [Online]. Available: https://medium.com/@mehulved1503/feature- selection-and-feature-extraction-in-machine-learning-an-overview-57891c595e96. [Zugriff am 6 November 2019].

[10] DeepAI, „Curse of Dimensionality,“ [Online]. Available: https://deepai.org/machine- learning-glossary-and-terms/curse-of-dimensionality . [Zugriff am 6 November 2019]. [11] DeepAI, „Feature Extraction,“ DeepAi, [Online]. Available: https://deepai.org/machine- learning-glossary-and-terms/feature-extraction. [Zugriff am 6 November 2019].

[12] E. D. SCIENCE, „Dimensionality Reduction Algorithms: Strengths and Weaknesses,“ ELITE DATA SCIENCE, 2016. [Online]. Available: https://elitedatascience.com/dimensionality-reduction-algorithms. [Zugriff am 6 November 2019].

[13] M. Peixeiro, „Almost Everything You Need to Know About Time Series,“ Towards Data Science, 5 February 2019. [Online]. Available: https://towardsdatascience.com/almost- everything-you-need-to-know-about-time-series-860241bdc578. [Zugriff am 8 November 2019].

[14] G. Press, „Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says,“ 23 March 2016. [Online]. Available: https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time- consuming-least-enjoyable-data-science-task-survey-says/#67fa06c96f63. [Zugriff am 8 November 2019].

[15] dbarbier und nils-braun, „blue-yonder/tsfresh,“ 2018. [Online]. Available: https://github.com/blue-yonder/tsfresh. [Zugriff am 8 November 2019].