IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 1 A Review of with Special Emphasis on Architectures, Applications and Recent Trends Saptarshi Sengupta, Sanchita Basak, Pallabi Saikia, Sayak Paul, Vasilios Tsalavoutis, Frederick Ditliac Atiah, Vadlamani Ravi and Richard Alan Peters II

Abstract—Deep learning has solved a problem that as little in software and customized hardware [1]. ANNs have been as five years ago was thought by many to be intractable - the studied for more than 70 years [2] during which time they have automatic recognition of patterns in data; and it can do so with an waxed and waned in the attention of researchers. Recently they accuracy that often surpasses that of human beings. It has solved problems beyond the realm of traditional, hand-crafted machine have made a strong resurgence as pattern recognition tools fol- learning algorithms and captured the imagination of practitioners lowing pioneering work by a number of researchers [3]. It has trying to make sense out of the flood of data that now inundates been demonstrated unequivocally that multilayered artificial our society. As public awareness of the efficacy of deep learning neural architectures can learn complex, non-linear functional increases so does the desire to make use of it. But even for mappings, given sufficient computational resources and train- highly trained professionals it can be daunting to approach the rapidly increasing body of knowledge produced by experts in ing data. Importantly, unlike more traditional approaches, their the field. Where does one start? How does one determine if a results scale with training data. Following these remarkable, particular Deep Learning model is applicable to their problem? significant results in robust pattern recognition, the intellectual How does one train and deploy such a network? A primer on neighborhood has seen exponential growth, both in terms of the subject can be a good place to start. With that in mind, academic and industrial research. Moreover, multilayer ANNs we present an overview of some of the key multilayer artificial neural networks that comprise deep learning. We also discuss reduce much of the manual work that until now has been some new automatic architecture optimization protocols that needed to set up classical pattern recognizers. They are, in use multi-agent approaches. Further, since guaranteeing system effect, black box systems that can deliver, with minimal human uptime is becoming critical to many computer applications, we attention, excellent performance in applications that require include a section on using neural networks for fault detection insights from unstructured, high-dimensional data [4][5][6] and subsequent mitigation. This is followed by an exploratory survey of several application areas where deep learning has [7][8][9]. These facts motivate this review of the topic. emerged as a game-changing technology: anomalous behavior detection in financial applications or in financial time-series A. What is an Artificial Neural Network? forecasting, predictive and prescriptive analytics, medical image processing and analysis and power systems research. The thrust An artificial neural network comprises many interconnected, of this review is to outline emerging areas of application-oriented simple functional units, or neurons that act in concert as paral- research within the deep learning community as well as to provide lel information-processors, to solve classification or regression a handy reference to researchers seeking to use deep learning in problems. That is they can separate the input space (the range their work for what it does best: statistical pattern recognition of all possible values of the inputs) into a discrete number with unparalleled learning capacity with the ability to scale with information. of classes or they can approximate the function (the black box) that maps inputs to outputs. If the network is created Index Terms—Deep Neural Network Architectures, Supervised by stacking layers of these multiply connected neurons the Learning, , Testing Neural Networks, Applications of Deep Learning, Evolutionary Computation resulting computational system can: arXiv:1905.13294v3 [cs.LG] 9 Jul 2019 1) Interact with the surrounding environment by using one I.INTROUCTION layer of neurons to receive information (these units are Artificial neural networks (ANNs), now one of the most known to be part of the input layers of the neural widely-used approaches to computational intelligence, started network) as an attempt to mimic adaptive biological nervous systems 2) Pass information back-and-forth between layers within Manuscript received March XX, 2019. the black-box for processing by invoking certain design S. Sengupta, S. Basak and R.A. Peters II are with the EECS Department goals and learning rules (these units are known to be at Vanderbilt University, USA. e-mail: [email protected]; [email protected]; [email protected] part of the hidden layers of the neural network) P. Saikia is with the CSE Department at IIT Guwahati, India. e-mail: [email protected] 3) Relay processed information out to the surrounding Sayak Paul is with DataCamp Inc, India. e-mail: [email protected] environment via some of its atomic units (these units V. Tsalavoutis is with the School of Mechanical Engineering, Industrial Engineering Laboratory, National Technical University of Athens, Greece. e- are known to be part of the output layers of the neural mail: [email protected] network). F. Atiah is with the Computer Science Department at the University of Pretoria. e-mail: [email protected] R. Vadlamani is with the Institute for Development and Research in Banking Within a hidden layer each neuron is connected to the Technology, India. e-mail: [email protected] outputs of a subset (or all) of the neurons in the previous IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 2 layer each multiplied by a number called a weight. The neuron computes the sum of the products of those outputs (its inputs) and their corresponding weights. This computation is the dot product between an input vector and weight vector which can be thought of as the projection of one vector onto the other or as a measure of similarity between the the two. Assume the input vectors and weights are both n-dimensional and there are m neurons in the layer. Each neuron has its own weight vector, so the output of the layer is an m-dimensional vector computed as the input vector pre-multiplied by an m×n matrix of weights. That is, the output is an m-dimensional vector that is the linear transformation of an n-dimensional input vector. The output of each neuron is in effect a linear classifier where Fig. 1: The Learning Model the weight vector defines a borderline between two classes and where the input vector lies some distance to one side of an ANN provided that it has a sufficient number of neurons in it or the other. The combined result of all m neurons is an a sufficient number of layers with a specific activation function m-dimensional hyperplane that independently classifies the n [10][11][12][13]. dimensions of the input into two m-dimensional classes in the output. If the weights are derived via least mean-squared Given the dimensions of the input and output vectors, the (LMS) estimation from matched pairs of input-output data they number of layers, the number of neurons in each layer, the form a , i.e. the hyperplane that is closest in form of the activation function, and an error function, the the LMS sense to all the outputs given the inputs. weights are computed via optimization over input-output pairs The hyperplane maps new input points to output points that to minimize the error function. That way the resulting network are consistent with the original data, in the sense that some is a best approximation of the known input-output data. error function between the computed outputs and the actual outputs in the training data is minimized. Multiple layers of linear maps, wherein the output of one linear classifier or B. How do these networks learn? regression is the input of another, is actually equivalent to a Neural networks are capable of learning - by changing the different single linear classifier or regression. This is because distribution of weights it is possible to approximate a function the output of k different layers reduces to the multiplication representative of the patterns in the input. The key idea of the inputs by a single q × n matrix that is the product of is to re-stimulate the black-box using new excitation (data) the k matrices, one per layer. until a sufficiently well-structured representation is achieved. To classify inputs non-linearly or to approximate a nonlinear Each stimulation redistributes the neural weights a little bit - function with a regression, each neuron adds a numerical hopefully in the right direction, given the learning algorithm bias value to the result of its input sum of products (the involved is appropriate for use, until the error in approximation linear classifier) and passes that through a nonlinear activation w.r.t some well-defined metric is below a practitioner-defined function. The actual form of the activation function is a design lower bound. Learning then, is the aggregation of a variable parameter. But they all have the characteristic that they map length of causal chains of neural computations [14] seeking the real line through a monotonic increasing function that to approximate a certain pattern recognition task through has an inflection point at zero. In a single neuron, the bias linear/nonlinear modulation of the activation of the neurons effectively shifts the inflection point of the activation function across the architecture. The instances in which chains of to the value of the bias itself. So the sum of products is implicit linear activation fail to learn the underlying structure, mapped through an activation function centered on the bias. non-linearity aids the modulation process. The term ’deep’ Any pair of activation functions so defined are capable of in this context is a direct indicator of the space complexity producing a pulse between their inflection points if each one of the aggregation chain across many hidden layers to learn is scaled and one is subtracted from the other. In effect each sufficiently detailed representations. Theorists and empiricists pair of neurons samples the input space and outputs a specific alike have contributed to an exponential growth in studies value for all inputs within the limits of the pulse. Given using Deep Neural Networks, although generally speaking, training data consisting of input-output pairs – input vectors the existing constraints of the field are well-acknowledged each with a corresponding output vector – the ANN learns [15][16][17]. Deep learning has grown to be one of the an approximation to the function that produced each of the principal components of contemporary research in artificial outputs from its corresponding input. That approximation is intelligence in light of its ability to scale with input data and its the partition of the input space into samples that minimizes the capacity to generalize across problems with similar underlying error function between the output of the ANN given its training feature distributions, which are in stark contrast to the hard- inputs and the training outputs. This is stated mathematically coded, problem-specific pattern recognition architectures of by the universal approximation theorem which implies that yesteryear. any functional mapping between input vectors and output vectors can be approximated to with arbitrary accuracy with IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 3 TABLE II: A Collection of Popular Deployment Platforms TABLE I: Some Key Advances in Neural Networks Research Software Purpose Platform People Contribution Software library with high performance numerical computa- Involved tion and support for and Deep Learning Tensorflow McCulloch architectures compatible to be deployed in CPU, GPU and ANN models with adjustible weights (1943) [2] [30] & Pitts TPU. Rosenblatt The Perceptron Learning Algorithm (1957) [18] url: https://www.tensorflow.org/ Widrow and Adaline (1960), Madaline Rule I (1961) & Mada- GPU compatible Python library with tight integration to Hoff line Rule II (1988)[19][20] NumPy involves smooth mathematical operations on multi- Minsky & [35] The XOR Problem (1969) [21] dimensional arrays. Papert url: http://deeplearning.net/software/theano/ Werbos Microsoft Cognitive Toolkit (CNTK) is a Deep Learning (Doctoral Backpropagation (1974) [22] CNTK [36] Framework describing computations through directed graphs. Dissertation) url: https://www.microsoft.com/en-us/cognitive-toolkit/ Hopfield Hopfield Networks (1982) [23] It runs on top of Tensorflow, CNTK or Theano compatible to Rumelhart, [33] be deployed in CPU and GPU. Renewed interest in backpropagation: multilayer Hinton & url: https://keras.io/ adaptive backpropagation (1986) [24] Williams Distributed training and performance evaluation platform in- Vapnik, PyTorch [29] tegrated with Python supported by major cloud platforms. Support Vector Networks (1995) [25] Cortes url: https://pytorch.org/ Hochreiter Convolutional Architecture for Fast Feature Embedding & Long Short Term Memory Networks (1997) [26] () is a Deep Learning framework with focus on image Schmidhuber Caffe [31] classsification and segmentation and deployable in both CPU LeCunn et. and GPU. Convolutional Neural Networks (1998) [27] al. url: http://caffe.berkeleyvision.org/ Hinton & Hierarchical in Deep Neural Supports CUDA computation and multiple GPU implemen- Ruslan Networks (2006) [28] Chainer [32] tation. url: https://chainer.org/ Distributed deep learning library for Apache Spark supporting programming languages Scala and Python. BigDL [34] C. Why are deep neural networks garnering so much url: https://software.intel.com/en-us/articles/ attention now? bigdl-distributed-deep-learning-on-apache-spark Multi-layer neural networks have been around through the better part of the latter half of the previous century. A natural D. Review Methodology question to ask why deep neural networks have gained the undivided attention of academics and industrialists alike in The article, in its present form serves to present a collection recent years? There are many factors contributing to this of notable work carried out by researchers in and related to the meteoric rise in research funding and volume. Some of these deep learning niche. It is by no means exhaustive and limited are briefed: in its own right to capture the global scheme of proceedings in the ever-evolving complex web of interactions among the • A surge in the availability of large training data sets with deep learning community. While cognizant of the difficulty of high quality labels achieving the stated goal, we tried to present nonetheless to the reader an overview of pertinent scholarly collections in • Advances in parallel computing capabilities and multi- varied niches in a single article. core, multi-threaded implementations The article makes the following contributions from a prac- • Niche software platforms such as PyTorch [29], Tensor- titioner’s reading perspective: flow [30], Caffe [31] , Chainer [32], Keras [33], BigDL [34] etc. that allow seamless integration of architectures • It walks through foundations of biomimicry involving into a GPU computing framework without the complexity artificial neural networks from biological ones, comment- of addressing low-level details such as derivatives and en- ing on how neural network architectures learn and why vironment setup. TableII provides a summary of popular deeper layers of neural units are needed for certain of Deep Learning Frameworks. pattern recognition tasks.

• Better regularization techniques introduced over the years • It talks about how several different deep architec- help avoid overfitting as we scale up: techniques like tures work, starting from Deep feed-forward networks batch normalization, dropout, data augmentation, early (DFNNs) and Restricted Boltzmann Machines (RBMs) stopping etc are highly effective in avoiding overfitting through Deep Belief Networks (DBNs) and Autoen- and can improve model performance with scaling. coders. It also briefly sweeps across Convolutional neural networks (CNNs), Recurrent Neural Networks (RNNs), • Robust optimization algorithms that produce near-optimal Generative Adversarial Networks (GANs) and some ot solutions: Algorithms with adaptive learning rates (Ada- the more recent deep architectures. This cluster within Grad, RMSProp, Adam, Adaboost), Stochastic Gradient the article serves as a baseline for further readings or as Descent (with standard momemtum or Nesterov momen- a refresher for the sections which build on it and follow. tum), Particle Swarm Optimization, Differential Evolu- tion, etc. IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 4

Fig. 2: Organization of the Review

• The article surveys two major computational areas of of now and Section VII concludes the article. research in the present day deep learning community that we feel have not been adequately surveyed yet - (a) Multi- II.DEEPARCHITECTURES:WORKINGMECHANISMS agent approaches in automatic architecture generation and learning rule optimization of deep neural networks using There are numerous deep architectures available in the swarm intelligence and (b) Testing, troubleshooting and literature. The Comparison of architectures is difficult as robustness analysis of deep neural architectures which different architectures have different advantages based on are of prime importance in guaranteeing up-time and the application and the characteristics of the data involved, ensuring fault-tolerance in mission-critical applications. for example, In vision, Convolutional Neural Networks [27], for sequences and time series modelling Recurrent neural networks [37] is prefered. However, deep learning is a fast • A general survey of developments in certain application evolving field. In every year various architectures with various modalities is presented. These include: learning algorithms are developed to endure the need to develop human-like efficient machines in different domains 1) in Financial Services, of application. 2) Financial Time Series Forecasting, 3) Prognostics and Health Monitoring, 4) Medical Imaging and A. Deep Feed-forward Networks 5) Power Systems Deep Feedforward Neural network, the most basic deep ar- chitecture with only the connections between the nodes moves Figure2 captures a high-level hierarchical abstraction of the forward. Basically, when a multilayer neural network contains organization of the review with emphasis on current practices, multiple numbers of hidden layers, we call it deep neural challenges and future research directions. The content organi- network [38]. An example of Deep Feed-Forward Network zation is as follows: SectionII outlines some commonly used with n hidden layers is provided in Figure3. Multiple hidden deep architectures with a high-level working mechanisms of layers help in modelling complex nonlinear relation more each, Section III talks about the infusion of swarm intelligence efficiently compared to the shallow architecture. A complex techniques within the context of deep learning and Section function can be modelled with less number of computational IV details diagnostic approaches in assuring fault-tolerant units compared to a similarly performing shallow network due implementations of deep learning systems. SectionV makes an to the hierarchical learning possible with the multiple levels exploratory survey of several pertinent applications highlighted of nonlinearity [39]. Due to the simplicity of architecture and in the previous paragraph while SectionVI makes a critical the training in this model, It is always a popular architecture dissection of the general successes and pitfalls of the field as among researchers and practitioners in almost all the domains IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 5

Fig. 3: Deep Feed-forward Neural Network with n Hidden Fig. 4: RBM with m visible units and n hidden units layers, p input units and q output units with weights W. [48] and good set of activation functions [49] are introduced of engineering. Backpropagation using gradient descent [40] to combat the issues in training deep neural networks. is the most common learning algorithm used to train this model. The algorithm first initialises the weights randomly, B. Restricted Boltzmann Machines and then the weights are tuned to minimise the error using gradient descent. The learning procedure involves multiple Restricted Boltzmann Machine (RBM) [50] can be inter- forward and backwards passes consecutively. In forward pass, preted as a stochastic neural network. It is one of the popular we forward the input towards the output through multiple deep learning frameworks due to its ability to learn the input hidden layers of nonlinearity and ultimately compare the probability distribution in supervised as well as unsupervised computed output with the actual output of the corresponding manner. It was first introduced by Paul Smolensky in 1986 input. In the backward pass, the error derivatives with respect with the name Harmonium [51]. However, it gets popularised to the network parameters are back propagated to adjust the by Hinton in 2002 [52] with the advent of the improved train- weights in order to minimise the error in the output. The ing algorithm to RBM. After that, it got a wide application in process continues multiple times until we obtained a desired various tasks like representation learning [53], [54], prediction problems [55]. However, deep belief improvement in the model prediction. If Xi is the input and network training using the RBM as building block [28] was fi is the nonlinear activation function in layer i, the output of the layer i can be represented by, the most prominent application in the history of RBM that provides the starting of deep learning era. Recently RBM Xi+1 = fi(WiXi + bi) (1) is getting immense popularity in the field of collaborative filtering [56] due to the state of the art performance in Netflix. X , as this becomes input for the next layer. W and b are i+1 i i Restricted Boltzmann Machine is a variation of Boltzmann the parameters connecting the layer i with the previous layer. machine with the restriction in the intra-layer connection be- In the backward pass, these parameters can be updated with, tween the units, and hence called restricted. It is an undirected containing two layers, visible and hidden Wnew = W − η∂E/∂W (2) layer, forms a bipartite graph. Different variations of RBMs have been introduced in literature in terms of improving the learning algorithms, provided the task. Temporal RBM [57] bnew = b − η∂E/∂b (3) and conditional RBM [58] proposed and applied to model Where Wnew and bnew are the updated parameters for W multivariate time series data and to generate motion captures, and b respectively, and E is the cost function and ηis the Gated RBM [59] to learn transformation between two input learning rate. Depending on the task to be performed like images, Convolutional RBM [60], [61] to understand the time regression or classification, the cost function of the model structure of the input time series, mean-covariance RBM [62], is decided. Like for regression, root mean square error is [63], [64] to represent the covariance structure of the data, and common and for classification softmax function. many more like Recurrent TRBM [65], factored conditional Many issues like overfitting, trapped in local minima and RBM (fcRBM) [66]. Different types of nodes like Bernoulli, vanishing gradient issues can arise if a deep neural network Gaussian [67] are introduced to cope with the characteristics is trained naively. This was the reason; neural network was of the data used. However, the basic RBM modelling concept forsaken by the machine learning community in the late 1990s. introduced with Bernoulli units. Each node in RBM is a However, in 2006 [28], [41], with the advent of unsupervised computational unit that processes the input it receives to make pretraining approach in deep neural network, the neural net- stochastic decisions whether to transmit that input or not. An work is revived again to be used for the complex tasks like RBM with m visible and n hidden units is provided in Figure vision and speech. Lately, many other techniques like l1, l2 4. regularisation [42], dropout [43], batch normalisation [44], The joint probability distribution of an standard RBM can 1 −E(v,h) good set of weight initialisation techniques [45], [46], [47], be defined with Gibbs distribution p(v, h) = Z e , IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 6 where energy function E(v,h) can be represented with:

n m m n X X X X E(v, h) = − wijhjvi − bjvj − cihi (4) i=1 j=1 j=1 i=1 Where, m,n are the number of visible and hidden units, vj, hj are the states of the visible unit j and hidden unit i, bj, cj are the real-valued biases corresponding to the jth visible unit and ith hidden unit respectively, wij is real-valued weights connecting visible units with hidden units. Z is the normalisation constant (sum over all the possible combinations for e−E(v,h)) to ensure the probability distributions sums to Fig. 5: DBN with input vector x with 3 hidden layers 1. The restriction made in the intralayer connection make the RBM hidden layer variables independent given the states of Inference in DBN is an intractable problem due to the the visible layer variables and vice versa. This easy down the explaining away effect in the latent variable model. However, complexity of modelling the probability distribution and hence in 2006 Hinton [28] proposed a fast and efficient way of the probability distribution of each variable can be represented training DBN by stacking Restricted Boltzmann Machine by conditional probability distribution as given below: (RBM) one above the other. The lowest level RBM during training learns the distribution of the input data. The next n Y level of RBM block learns high order correlation between the p(h|v) = p(hi|v) (5) hidden units of the previous hidden layer by sampling the i=1 hidden units. This process repeated for each hidden layer till m Y the top. A DBN with L numbers of hidden layer models the p(v|h) = p(vj|h) (6) joint distribution between its visible layer v and the hidden j=1 layers hl, where l =1,2, . . . L as follows: RBM is trained to maximise the expected probability of the training samples. Contrastive divergence algorithm proposed L−2 Y by Hinton [52] is popular for the training of RBM. The training p(v, h1, ... , hL) = p(v| h1)( p( hl| hl+1))p( hL−1, hL) brings the model to a stable state by minimising its energy by l=1 updating the parameters of the model. The parameters can be (10) updated using the following equations: The log-probability of the training data can be improved by adding layers to the network, which, in turn, increases the true ∆wij = (< vihj >data − < vihj >model) (7) representational power of the network [69]. The DBN training proposed in 2006 [28] by Hinton led to the deep learning ∆b = (< v > − < v > ) (8) era of today and revived the neural network. This was the first i i data i model deep architecture in the history able to train efficiently. Before that, it was almost impossible to train deep architectures. ∆cj = (< hj >data − < hj >model) (9) Deep architectures build by initialising the weights with DBN, outperformed the kernel machines, that was in the research Where,  is the learning rate, < . > data , < . > model landscape at that time. DBN, along with its use as genera- are used to represent the expected values of the data and the tive models, significantly applied as discrimination model by model. appending a discrimination layer at the end and fine-tuning the model using the target labels provided [3]. In most of the C. Deep Belief Networks applications, this approach of pretraining a deep architecture Deep belief network (DBN) is a generative graphical model led to the state of the performance in discriminative model composed of multiple layers of latent variables. The latent [70], [28], [41], [71], [54] like in recognising handwritten variables are typically binary, can represent the hidden features digits, detecting pedestrians, time series prediction etc. even present in the input observations. The connection between the when the number of labelled data was limited [72]. It has top two layers of DBN is undirected like an RBM model, got immense popularity in acoustic modelling [73] recetly as hence a DBN with 1 hidden layer is just an RBM. The other the model could provide upto 20% improvement over state connections in DBN except last are directed graphs towards of the art models, , Gaussian Mixture the input layer. DBN is a generative model, hence to generate Model. The approach creates feature detectors hierarchically a sample from DBN follows a top-down approach. We first as features of features in pretraining that provide a good draw samples from the RBM on the top layer, this is usually set of initialised weights to the discriminative model. The done by Gibbs sampling, then we can perform sampling from initialised weights are in a region near the optimal weights the visible units by a simple pass of ancestral sampling in that can improve both modelling and the convergence in a top-down fashion. A standard DBN model [68] with three fine-tuning [70], [74]. DBN has been used as an initialised hidden layers is shown in Figure5. model in classification in many applications like in phone IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 7

wnew = w − η∂E/∂w (13)

bnew = b − η∂E/∂b (14)

Where wnew and bnew are the updated parameters for w and b respectively at the end of the current iteration, and E is the reconstruction error of the input at the output layer. with multiple hidden layers forms a deep autoencoder. Similar like in deep neural network, autoencoder training may be difficult due to multiple layers. This can be overcome by training each layer of deep autoencoder as a sim- ple autoencoder [28], [41]. The approach has been successfully Fig. 6: Autoencoder with 3 neurons in hidden layer applied to encode documents for faster subsequent retrieval recognition [62], computer vision [63] where it is used for [81], image retrieval, efficient speech features [82] etc. As like the training of higher order factorized Boltzmann machine, RBM stacking to form DBN [28] for layerwise pretraining of speech recognition [75], [76], [77] for pretraining DNN, for DNN, autoencoder [41] along with sparse encoding energy- pretraining of deep convolutional neural network (CNN) [60], based model [71] are independently developed at that time. [78], [61]. The improved performance is due to the ability They both were effectively used to pre-train a deep neural to learn some abstract features by the hidden layer of the network, much like the DBN. The unsupervised pretraining network. Some of the work on analysis of the features to using autoencoder has been successfully applied in many fields understand what is lost and what is captured during its training like in image recognition and dimensionality reduction in is demonstrated in [64], [79], [80]. MNIST [54], [82], [83], multimodal learning in speech and video images [84], [85] and many more. Autoencoder has got immense popularity as generative model in recent years D. [38], [86]. Non Probabilistic and non-generative nature of Autoencoder is a three-layer neural network, as shown conventional autoencoder has been generalised to generative in Figure6, that tries to reconstruct its input in its output modelling [87], [42], [88], [89], [90] that can be used to layer. Hence, the output layer of an autoencoder contains the generate the samples from the network meaningfully. same number of units as the input layer. The hidden layer Several variations of autoencoders are introduced with quite typically contains less number of neurons compared to the different properties and implementation to learn more effi- visible layer, tries to encode or represent the input in a more cient representation of data. One of the popular variation of compact form. It shares the same idea as RBM, but it typically autoencoder that is robust to input variations is denoising uses deterministic distribution instead of stochastic units with autoencoder [89], [42], [90]. The model can be used for particular distribution as in the case of RBM. good compact representation of input with the number of Like feedforward neural network, autoencoder is typically hidden layers less than the input layer. It can also be used trained using backpropagation algorithm. The training consists to perform robust modelling of the input distribution with of two phases: Encoding and Decoding. In the encoding higher number of neurons in the hidden layer. The robustness phase, the model tries to encode the input into some hidden in denoising autoencoder is achieved by introducing dropout representation using the weight metrics of the lower half layer, trick or by introducing some gaussian noise to the input data and in the decoding phase, it tries to reconstruct the same [91], [92] or to the hidden layers [93]. The approach helps input from the encoding representation using the metrics of in many many ways to improve performance. It virtually the upper half layer. Hence, weights in encoding and decoding increasing the training set hence reduce overfitting, and make are forced to be the transposed of each other. The encoding robust representation of the input. Sparse autoencoder [93] is and decoding operation of an autoencoder can be represented introduced in a consideration to allow larger number of hidden by equations below: In encoding phase, units than the visible units to make it easier and efficient representation of the input distribution in the hidden layer. y0 = f(wx + b) (11) The larger hidden layer represent the input representation by turning on and off the units in the hidden layer. Variational Where w, b are the parameters to be tuned, f is the autoencoder [86], [94] that uses quite the similar concept as activation function, x is the input vector, and y is the hidden RBM, learn stochastic distribution of latent variables instead representation. In decoding phase, of deterministic distribution. Transforming autoencoders [95] x0 = f(w0y0 + c) (12) proposed as a autoencoder with transformation invariant prop- erty. The encoded features of the autoencoder can effectively Where w0 is the transpose of w, c is the bias to the output reflect the transformation invariant property. The encoder is layer, x0 is the reconstructed input at the output layer. The applied in image recognition [95], [96] purpose that contains parameters of the autoencoder can be updated using the capsule as the building block. Capsule is an independent sub- following equations: network that extracts local features within a limited window IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 8 of viewing to understand if a feature entity is present with adjacent time steps such that at any time instant a node certain probability. Pretraining for CNN using regularised of the network takes the current data input as well as the deep autoencoder is very much popularised in recent years hidden node values capturing information of previous time in computer vision works. Robust models of CNN is obtained steps. During the backpropagation of errors across multiple with denoising autoencoder [88] and sparse autoencoder with timesteps the problem of vanishing and exploding gradients pooling and local contrast normalization [97] which provides take place which can be avoided by Long Short Term Memory not only translation-invariant features but also scaling and out- (LSTM) Networks introduced by Hochreiter and Schmidhuber of-plane rotation invariant features. [103]. The amount of information to be retained from previous time steps is controlled by a sigmoid layer known as ‘forget’ E. Convolutional Neural Networks gate whereas the sigmoid activated ‘input gate’ decides upon the new information to be stored in the cell followed by a Convolutional Neural Networks are a class of neural net- hyperbolic tangent activated layer to produce new candidate works that are extremely good for processing images. Al- values which is updated taking forget gate coefficient weighted though its idea was proposed way back in 1998 by LeCun old state’s candidate value. Finally the output is produced et. al in their paper entitled ”Gradient-based learning applied controlled by output gate and hyperbolic tangent activated to document recognition” [98] but the deep learning world candidate value of the state. actually saw it in action when Krizhevsky et. al were able LSTM networks with peephole connections [104] updates win the ILSVRC-2012 competition. The architecture that the three gates using the cell state information. A single Krizhevsky et. al proposed is popularly known as AlexNet update gate instead of forget and input gate is introduced [99]. This remarkable win started the new era of artificial in Gated Recurrent Unit (GRU) [105] merging the hidden intelligence and the computation community witnessed the and the cell state. In [106] Sak et al., came up with training real power of CNNs. Soon after this, several architectures LSTM RNNs in a distributed way on multicore CPU using have been proposed and still are being proposed. And in many asynchronus SGD (Stochastic Gradient Descent) optimization cases, these CNN architectures have been able to beat human for the purpose of acoustic modelling. They presented a two- recognition power as well. It is worth to note that, The deep layer deep LSTM architecture with each layer having a linear learning revolution actually with the usage of Convolutional recurrent projection layer with more efficient use of the model Neural Networks (CNNs). CNNs are are extremely useful for parameters. Doetch et al., [107] proposed a LSTM based train- a set computer vision related tasks such as image detection, ing framework composed of sequence chunks forming mini image segmentation, image classification and so on and all of batches for training for the purpose of handwriting recognition. these tasks are practically well aligned. On a very high level, With reduction of runtime by a factor of 3 the architecture uses deep learning is all about learning data representations and in modified gating units with layer specific weights for each gate. order to do so deep learning systems typically breaks down Palangi et al., [108] implemented sentence embedding model complex representations into a set of simpler representations. using LSTM-RNN that sequentially extracts information from As mentioned earlier, CNNs are particularly useful when it each word and embeds in a semantic vector till the end of the comes to images as images have a special spatial property sentence to obtain overall semantic representation of the entire in their formations. An image has several characteristics sentence. The model with capability of attenuating unimpor- like edges, contours, strokes, textures, gradients, orientation, tant words and identifying salient keywords is specifically colour. A CNN breaks down an image in terms of simple useful in web document retrieval applications. properties like these and learn them as representations in different layers [100]. Figure8 is a good representative of this learning scheme. G. Generative Adversarial Networks The layers involved in any CNN model are the convolution Goodfellow et al., [109] introduced a novel framework for layers and the subsampling/pooling layers which allow the Generative Adversarial Nets with simultaneous training of a network learn filters that are specific to specific parts in an generative and a discriminative model. The proposed new image. The convolution layers help the network retain the Generative model bypasses the difficulty of approximation of spatial arrangement of pixels that is present in any image unmanageable probabilistic measures in Maximum Likelihood whereas the pooling layers allow the network to summarize the Estimation faced previously. The generative model tries to pixel information [101]. There are several CNN architectures capture the data distribution whereas the discriminative model ZFNet, AlexNet, VGG, YOLO, SqueezeNet, ResNet and so learns to estimate the probability of a sample either coming on and some these have been discussed in section II-H. from training data or the distribution captured by generative model. If the two above models described by multilayer F. Recurrent Neural Networks , only backpropagation and dropout algorithms are Although Hidden Markov Models (HMM) can express required to train them. time dependencies, they become computationally unfeasible The goal in this process is to train the Generative network in the process of modelling long term dependencies which in a way to maximize the probability of the discriminative RNNs are capable of. A detailed derivation of Recurrent network to make a mistake. A unique solution can be obtained Neural Network from differential equations can be found in in the function space where the generative model recovers [102]. RNNs are form of feed-forward networks spanning the distribution of training data and the discriminative model IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 9

Fig. 7: Convolution and Pooling Layers in a CNN

learnt in an unsupervised manner by the adversarial discrim- inator. Vondrick et al., [113] came up with video recogni- tion/classification and video generation/prediction model us- ing Generative Adversarial Network (GAN) with separation of foreground from background employing spatio-temporal convolutional architecture. The proposed model is efficient in predicting futuristic versions of static images extracting meaningful features and recognizing actions embedded in the video in a minimally supervised way. Thus, learning scene dynamics from unlabeled videos using adversarial learning is Fig. 8: Representations of an image of handwritten digit the main objective of the proposed framework. learned by CNN Another interesting application is generating images from detailed visual descriptions [114]. The authors trained a results into 50% probalilty for each sample. This can be deep convolutional generative adversarial network (DC-GAN) viewed as a minmax two player game between these two based on encoded text features through hybrid character-level models as the generative models produce adversarial examples convolutional and used manifold while discriminative model trying to identify them correctly interpolation regularizer. The generalizability of the approach and both try to improve their efficiency until the adversarial was tested by generating images from various objects and examples are indistinguishable from the original ones. changing backgrounds. In [110], the authors presented training procedures to be applied to GANs focusing on producing visually sensible images. The proposed model was successful in producing H. Recent Deep Architectures MNIST samples visually indistinguishable from the original When it comes to deep learning and computer vision, data and also in learning recognizable features from Imagenet datasets like Cats and Dogs, ImageNet, CIFAR-10, MNIST dataset in a semi-supervised way. This work provides insight are used for benchmarking purposes. Throughout this section, about appropriate evaluation metric for generative models the ImageNet dataset is used for the purpose of benchmarking in GANs and stable semi-supervised training approach. In results as it is more generalized than the other datasets [111], the authors identified distinct features of GANs from a just mentioned. Every year a competition named ILSVRC Turing perspective. The discriminators were allowed to behave (ImageNet Large Scale Visual Recognition Competition) is as interrogators such as in Turing Test by interacting with organized (which is an image classification competition) which data sample generating processes and affirmed the increase based on the ImageNet dataset and it is widely accepted by in accuracy of the models by verification with two case the deep learning community [115]. studies. The first one was about inferring an agent’s behavior Several deep neural network architectures have been proposed based on a hidden stochastic process while managing its in the literature and still are being proposed with an objective environment. The second examples talks about active self- of achieving general artificial intelligence. LeNet architecture, discovery exercised by a robot to conclude about its own for example was proposed by Lecun et. al in 1998s and it sensors by controlled movements. was originally proposed as a digit classification model. Later, Wu et al., [112] proposed a 3D Generative Adversarial LeNet has been incorporated to identify handwritten numbers Network (3DGAN) for three dimensional object generation on cheques [98]. Several architectures have been proposed using volumetric convolutional networks with a mapping from after LeNet among which AlexNet certainly deserves to be probabilistic space of lower dimension to three dimensional the most notable mentions. It was proposed by Krizhevsky et. object space so that the 3D object can be sampled or explored al in 2012 and AlexNet was able to beat all the competitors of without any reference image. As a result high quality 3D the ILSVRC challenge. The discovery of AlexNet marks a sig- objects can be generated employing efficient shape descriptor nificant turn in the history of deep learning for several reasons IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 10

Fig. 9: A Recurrent Neural Network Architecture such as AlexNet incorporated the dropout regularization which computation a Deep Evolutionary Network Structured Repre- was just developed by that time, AlexNet made use of efficient sentation (DENSER) was proposed in [121], where the optimal GPU computing for reducing the training time which was first design for the network is achieved by a bi-leveled representa- of its kind back in 2012 [99]. Soon after AlexNet , ZFNet tion. The outer level deals with the number of layers and their was proposed by Zeiler et. al in the year of 2013 and showed sequence whereas the inner layer optimizes the parameters state-of-the-art results on the ILSVRC challenge. It was an and hyper parameters associated with each layer defined by a enhancement of the AlexNet architecture. It uses expanded context-free human perceivable grammar. Through automatic mid convolution layers and incorporates smaller strides and design of CNNs the proposed approach performed well on filters in the first convolution layer for capturing the pixel CIFER-10, CIFER-100, MNIST and Fashion MNIST dataset. information in a great detail [116]. In 2014, Google researchers On the other hand, Garro et al., [122] proposed a methodology came with a better model which is known as GoogleNet or to automatically design ANN using basic Particle Swarm the Inception Network and won the ILSVRC 2014 challenge. Optimization (PSO), Second Generation of Particle Swarm The main catch of this architecture is the inception layer which Optimization (SGPSO), and a New Model of PSO (NMPSO) allows convolving in parallel with different kernel sizes. This is to evolve and optimize the synaptic weights, transfer function turn allows to learn the smaller pixel information of an image for each neuron and the architecture itself simultaneously. in a better way [117]. It’s worth to mention the VGGNet (also The ANNs designed in this way, were evaluated over eight called VGG) architecture here. It was the runners’ up in the fitness functions. It aimed towards dimensionality reduction ILSVRC 2014 challenge and was proposed by Simonyan et. of the input pattern, and was compared to the traditional al. VGG uses a 3X3 kernel throughout its entire architecture design architectures using well known Back-Propagation and and ahieves tremendous generalization with this fixation [118]. Levenberg-Marquardt algorithms. Das et al. [123], used PSO The inner of the ILSVRC 2015 challenge was the ResNet to optimize the number of layers, neurons, the kind of transfer architecture and was proposed by He et. al. This architecture functions to be involved and the topology of ANN aimed at is more formally known as Residual Networks and is deeper building channel equalizers that perform better in presence of than the VGG architecture while still being less complex all noise scenarios. in the VGG architecture. ResNet was able to beat human Wang et al. [124], used Variable-length Particle Swarm performance on the ImageNet dataset and it is still being quite Optimization for automatic evolution of deep Convolutional actively used in production [119][120]. Neural Network Architectures for image classification pur- poses. They proposed novel encoding strategy to encode CNN layers in particle vectors and introduced a Disabled III.SWARM INTELLIGENCEIN DEEP LEARNING layer hiding certain dimensions of the particle vector to have The introduction of heuristic and meta-heuristic algorithms variable-length particles. In addition to this, to speed up in designing complex neural network architectures aimed the process the authors randomly picked up partial datasets towards tuning the network parameters to optimize the learning for evaluation. Thus several variants of PSO along with its process has brought improvements in the performance of sev- hybridised versions [125] as well as a host of recent swarm eral Deep Learning Frameworks. In order to design the Artifi- intelligence algorithms such as Quantum Double Delta Swarm cial Neural Networks (ANN) automatically with evolutionary Algorithm (QDDS) [126] and its chaotic implementation [127] proposed by Sengupta et al. can be used, among others for automatic generation of architectures used in Deep Learning applications. The problem of changing dimensionality of perceived in- formation by each agent in the domain of Deep (RL) for swarm systems has been solved in [128] using an endtoend learned mean feature embedding as state information. The research concluded that an endtoend embed- ding using neural network features helps to scale up the RL Fig. 10: A Generative Adversarial Network Architecture IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 11 architecture with increasing numbers of agents towards better the two patterns together shall exhibit the changes required by performing policies as well as ensures fast convergence. the corresponding testing criterion. In the final test suite the inputs matching these patterns will be added. The authors put forward results on 10 DNNs with the Sign-Sign, Distance- IV. TESTINGNEURALNETWORKS Sign, Sign-value and Distance-Value covering methods that Software employed in safety critical systems need to be rig- show that the test generation algorithms are effective, as orously tested through white-box or black-box testing. In white they reach high coverage for all covering criteria. Also, the box testing, the internal structure of the software/program is covering methods designed are useful. This is supported by known and utilized in generating test cases as per the test the fact that a significant portion of adversarial examples have criteria/requirement. Whereas in black box testing the inputs been identified. To evaluate the quality of obtained adversarial and outputs of the program are compared as the internal code examples, a distance curve to see how close the adversarial of the software cannot be accessed. Some of the previous example is to the correct input has been plotted. It is observed works dealing with generating test cases revealing faulty cases that when going deeper into the DNN, it can become harder can be found in [129] and in [130] using Principle component for the cover of neuron pairs. Under such circumstances, to analysis. In [131] the authors implemented a black-box testing improve the coverage performance, the use of larger data set methodology by feeding randomly generated input test cases when generating test pairs is needed. Interestingly, it seems to an original version of a real-world test program producing that most adversarial examples can be found around the middle the corresponding outputs, so as the input-output pairs are layers of all DNNs tested. In future the authors propose to generated to train a neural network. Then each test case is find more efficient test case generation algorithms that do not applied to mutated and faulty version of the test program and require linear programming. compared against the output of the trained ANN to calculate Katz et al. [133], provided methods for verifying adversar- the distance between two outputs indicating whether the faulty ial robustness of neural networks with Reluplex algorithm, program has produced valid or invalid result. Thus ANN is to prove, that a small perturbation to a rightly classified treated as an automated oracle which produces satisfactory input should not result into misclassification. Huang et al, results when the training set is comprised of data ensuring [134], proposed an automated verification framework based good coverage on the whole range of input. on Satisfiability Modulo Theory (SMT) to test the safety of Y. Sun et al, [132] proposed a set of four test coverage neural network by searching adversarial manipulations through criteria drawing inspiration from traditional Modified Condi- exploration in the space around a given data point. The tion/Decision Coverage (MC/DC) criteria. They also proposed adversarial examples discovered were used to fine-tune the algorithms for generating test cases for each criterion built network. upon linear programming. A new test case (an input to Deep Neural Network) is produced by perturbing a given one, where the stated algorithms should encode the test requirement A. Different Methods of Adversarial Test Generation and a fragment of the DNN by fixing the activation pattern Despite the success of deep learning in various domains, obtained from the given input example, and then minimize the the robustness of the architectures need to be studied before difference between the new and the current inputs. The utility applying neural network architectures in safety critical sys- of this method lies in bug finding, determining DNN safety tems. In this subsection we discuss the kind of malicious statistics, measuring testing accuracy and analysis of DNN attack that can fool or mislead NN to output wrong deci- internal structure. The paper discusses about sign change, sions and ways to overcome them. The work presented by value change and distance change of a neuron pair with two Tuncali et al., [135] deals with generating scenarios leading neurons in adjacent layers in the context of their change to unexpected behaviors by introducing perturbations in the in activation values in two given test cases. Four covering testing conditions. For identifying fasification and critical methods: sign sign cover, distance sign cover, sign value systems behavior for autonomous driving systems, the authors cover and distance value cover are explained along with test focused on finding glancing counterexamples which refer to requirement and test criteria which computes the percentage the borderline behavior of the system where it is in the verge of the neuron pairs that are covered by test cases with respect of failing. They introduced Signal Temporal Logic (STL) to the covering method. formula for the problem in hand which in this case is a For each test requirement an automatic test case generation combination of predicates over the speed of the target car and algorithm is implemented based on Linear Programming (LP). distances of all other objects (including cars and pedestrians) The objective is to find a test input variable, whose value is and relative positions of them. Then a list of test cases is to be synthesized with LP, with identical activation pattern created and evaluated against STL specification. A covering as a given input. Hence a pair of inputs that satisfy the array spanning all possible combinations of the values the closeness definition are called adversarial examples if only one variables can take is generated. To find a glancing behavior, the of them is correctly labeled by the DNN. The testing criteria discrete parameters from the covering array that correspond necessitates that (sign or distance) changes of the condition to the trace that minimize STL conditions for a trace, are neurons should support the (sign or value) change of every used to create test cases either uniformly randomly or by a decision neuron. For a pair of neurons with a specified testing cost function to guide a search over the continuous variables. criterion, two activation patterns need to be found such that Thus, a glancing test case for a trace is obtained. The proposed IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 12 closed loop architecture behaves in an integrated way along design of algorithms for software and hardware the proposed with the controller and Deep Neural Network (DNN) based framework aims to maximize model reliability. perception system to search for critical behavior of the vehicle. It is necessary to build robust countermeasures to be used In [136] Yuan et al discuss adversarial falsification problem for different types of adversarial scenarios to provide a reliable explaining false positive and false negative attacks, white box infrastructure as none of the countermeasures can be univer- attacks where there is complete knowledge about the trained sally applicable to all sorts of adversaries. A detailed list of NN model and black box attack where no information of the specific attack generation and corresponding countermeasures model can be accessed. With respect to adversarial specificity can be found in [140]. there are targeted and non-targeted attacks where the class output of the adversarial input is predefined in the first case and arbitrary in the second case. They also discuss about TABLE III: Distribution of Articles by Application Areas perturbation scope where individual attacks are geared towards Application Authors generating unique perturbations per input whereas universal Area Fraud Pumsirirat et al. [141], Schreyer et al. [142], Wang et al. attacks generate similar attack for the whole dataset. The Detection in [143], Zheng et al. [144], Dong et al. [145], Gomez et al. perturbation measurement is computed as p-norm distance be- Financial [146], Rymantubb et al. [147], Fiore et al. [148] tween actual and adversarial input. The paper discusses various Services attack methods including L-BFGS attack, Fast Gradient Sign Cavalcante et al. [149], Li et al. [150], Fama et al. [151], Lu Financial et al. [152], Tk & Verner [153], Pandey et al. [154], Lasfer Method (FGSM) by performing update of one step gradient Time Series et al. [155], Gudelek et al. [156], Fischer & Krauss [157], along the direction of the sign of the gradient of every pixel Forecasting Santos Pinheiro & Dras [158], Bao et al. [159], Hossain et expressed as [137]: al. [160], Calvez and Cliff [161] Prognostics Basak et al. [162], Tamilselvan & Wang [163], Kuremoto et and Health al. [164], Qiu et al. [165], Gugulothu et al. [166], Filonov et η = sign(∇xJθ(x, l)) (15) Monitoring al. [167], Botezatu et al. [168] Suk, Lee & Shen [169], van Tulder & de Bruijne [170], where  is the magnitude of perturbation η which when added Medical Brosch & Tam [171], Esteva et al. [172], Rajaraman et. al. to an input data generates an adversarial data. Image [173], Kang et al. [174], Hwang & Kim [175], Andermatt et FGSM has been extended by Basic Iterative Method (BIM) Processing al. [176], Cheng et al. [177], Miao et al. [178], Oktay et al. [179], Golkov et al. [180] and Iterative Least-Likely Class Method (ILLC). Moosavi- Vankayala & Rao [181], Chow et al. [182], Guo et al. [183], Dezfooli et al. [138] proposed Deepfool where iterative at- Bourguet & Antsaklis [184], Bunn & Farmer [185], Hippert tack was performed with linear approximation to surpass the et al. [186], Kuster et al. [187], Aggarwal & Song [188], Zhai [189], Park et al. [190], Mocanu et al. [191], Chen et nonlinearity in multidimensional cases. al. [192], Bouktif et al. [193], Dedinec et al. [194], Rahman Power et al. [195], Kong et al. [196], Dong et al. [197], Kalogirou Systems et al. [198], Wang et al. [199], Das et al. [200], Dabra et al. B. Countermeasures for Adversarial Examples [201], Liu et al. [202], Jang et al. [203], Gensler et al. [204], Abdel-Nasser et al. [205], Manwell et al. [206], Marugan´ et The paper [136] deals with reactive countermeasures such al. [207], Wu et al. [208], Wang et al. [209], Wang et al. as Adversarial Detecting, Input Reconstruction, and Network [210], Feng et al. [211], Qureshi et al. [212] Verification and proactive countermeasures such as Network Distillation, Adversarial (Re)training, and Classifier Robus- tifying. In Network Distillation high temperature softmax V. APPLICATIONS activation reduces the sensitivity of the model towards small perturbations. In Adversarial (Re)training adversarial examples are used during training. Adversarial detecting deals with A. Fraud Detection in Financial Services finding the probability of a given input being adversarial or Fraud detection is an interesting problem in that it can be not. In input reconstruction technique a denoising autoencoder formulated in an unsupervised, a supervised and a one-class is used to transform the adversarial examples to actual data classification setting. In unsupervised learning category, class before passing them as input to the prediction module by deep labels either unknown or are assumed to be unknown and NN. Also, Gaussian Process Hybrid Deep Neural Networks clustering techniques are employed to figure out (i) distinct (GPDNNs) are proven to be more robust towards adversarial clusters containing fraudulent samples or (ii) far off fraudulent inputs. samples that do not belong to any cluster, where all clusters There are also ensembling defense strategies to counter contained genuine samples, in which case, it is treated as multifaceted adversarial examples. But the defense strategies an outlier detection problem. In category, discussed here are mostly applicable to computer vision tasks, class labels are known and a binary classifier is built in order whereas the need of the day is to generate real time adversarial to classify fraudulent samples. Examples of these techniques input detection and take measures for safety critical systems. include , Naive Bayes, supervised neural In [139] Rouhani et al., proposed an online defense frame- networks, decision tree, support vector machine, fuzzy rule work DeepFense against adversarial deep learning. They based classifier, rough set based classifier etc. Finally, in the formulated it as an unsupervised optimization problem by one-class classification category, only samples of genuine class minimizing the less observed spaces in the latent feature available or fraud samples are not considered for training even hyperspace spanned by a Deep Learning network and was if available. These are called one-class classifiers. Examples able to decrease the risk of integrated attacks. With integrated include one-class support vector machine (aka Support vector IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 13 data description or SVDD), auto association neural networks large transfer so that the bank can prevent potential frauds (aka auto encoders). In this category, models are trained on if the probability exceeds a threshold. The model uses a the genuine class data and are tested on the fraud class. deep denoising autoencoder to learn the complex probabilistic Literature abounds with many studies involving traditional relationship among the input features, and employs adversarial neural networks with various architectures to deal with the training to accurately discriminate between positive samples above mentioned three categories. Having said that fraud and negative samples in a data. They concluded that the (including cyber fraud) detection is increasingly becoming model outperformed traditional classifiers and using it two menacing and fraudsters always appear to be few notches commercial banks have reduced losses of about 10 million ahead of organizations in terms of finding new loopholes in RMB in twelve weeks thereby significantly improving their the system and circumventing them effortlessly. On the other reputation. hand, organizations make huge investments in money, time and In today’s word-of-mouth marketing, online reviews posted resources to predict fraud in near real-time, if not real time by customers critically influence buyers purchase decisions and try to mitigate the consequences of fraud. Financial fraud more than before. However, fraud can be perpetrated in these manifests itself in various areas such as banking, insurance reviews too by posting fake and meaningless reviews, which and investments (stock markets). It can be both offline as cannot reflect customers’/users genuine purchase experience well as online. Online fraud includes credit/debit card fraud, and opinions. They pose great challenges for users to make transaction fraud, cyber fraud involving security, while offline right choices. Therefore, it is desirable to build a fraud detec- fraud includes accounting fraud, forgeries etc. tion model to identify and weed out fake reviews. Dong et al. Deep learning algorithms proliferated during the last five (2018)[145] present an autoencoder and , where years having found immense applications in many fields, a stochastic decision tree model fine tunes the parameters. where the traditional neural networks were applied with great Extensive experiments were conducted on a large Amazon success. Fraud detection one of them. In what follows, we review dataset. review the works that employed deep learning for fraud Gomez et al. (2018)[146] presented a neural network based detection and appeared in refereed international journals and system for fraud detection in banking. They analyzed a real one article is from arXive repository. papers published in world dataset, and proposed an end-to-end solution from the International conferences are excluded. practitioners perspective, especially focusing on issues such as Pumsirirat (2018)[141] proposed an unsupervised deep auto data imbalances, data processing and cost metric evaluation. encoder (AE) based on restricted Boltzmann machine (RBM) They reported their proposed solution performed comparably in order to detect novel frauds because fraudsters always with state-of-the-art solutions. try to be innovative in their modus operandi so that they Ryman-Tubb et al. (2018) [147] observed that payment card are not caught while perpetrating the fraud. He employed fraud has dented economies to the tune of USD 416bn in backpropagation trained deep Auto-encoder based on RBM 2017. This fraud is perpetrated primarily to finance terrorism, that can reconstruct normal transactions to find anomalies from arms and drug crime. Until recently the patterns of fraud and normal patterns. He used the Tensorflow library from Google the criminals modus operandi has remained unsophisticated. to implement AE, RBM, and H2O by using deep learning. However, smart phones, mobile payments, cloud computing The results show the mean squared error, root mean squared and contactless payments have emerged almost simultaneously error, and area under curve. with large-scale data breaches. This made the extant methods Schreyer (2017) [142] observed the disadvantage of business less effective. They surveyed extant methods using transac- and experiential-knowledge driven rules in failing to generalize tional volumes in 2017. This benchmark will show that only well beyond the known scenarios in large scale accounting eight traditional methods have a practical performance to be frauds. Therefore, he proposed a deep auto encoder for this deployed in industry. Further, they suggested that a cognitive purpose and tested it effectiveness on two real world datasets. computing approach and deep learning are promising research Chartered accountants appreciated the power of the deep auto directions. encoder in predicting the anomalous accounting entries. Fiore et al (2019) [148] observed that data imbalance is Automobile insurance fraud has traditionally been predicted a crucial issue in payment card fraud detection and that by considering only structured data and textual date present in oversampling has some drawbacks. They proposed Generative the claims was never analyzed. But, Wang and Xu (2018) [143] Adversarial Networks (GAN) for oversampling, where they proposed a novel method, wherein Latent Dirichlet Allocation trained a GAN to output mimicked minority class examples, (LDA) was first used to extract the text features hidden in which were then merged with training data into an augmented the text descriptions of the accidents appearing in the claims, training set so that the effectiveness of a classifier can be and then along with the traditional numeric features as input improved. They concluded that a classifier trained on the data deep neural networks are trained. Based on the real-world augmented set outperformed the same classifier trained on the insurance fraud dataset, they concluded their hybrid approach original data, especially as far the sensitivity is concerned, outperformed random forests and support vector machine. resulting in an effective fraud detection mechanism. Telecom fraud has assumed large proportions and its impact In summary, as far as fraud detection is concerned, some can be seen in services involving mobile banking. Zheng et al. progress is made in the application of a few deep learning (2018)[144] proposed a novel generative adversarial network architectures. However, there is immense potential to con- (GAN) based model to compute probability of fraud for each tribute to this field especially, the application of Resnet, gated IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 14 recurrent unit, capsule network etc to detect frauds including fund data set. The model was not compared against other cyber frauds. . NN architecture. Fisher and Krauss [157] adapted LSTM for stock prediction and compared its performance with memory- free based algorithms like random forest, logistic regression B. Financial Time Series Forecasting classifier and deep neural network. LSTM performed better Advances in technology and break through in deep learning than other algorithms, random forest however, outperformed models have seen an increase in intelligent automated trading LSTM during the financial crisis in 2008. and decision support systems in Financial markets, especially EMH [151] holds the view that financial news which in the stock and foreign exchange (FOREX) markets. How- affects the price movement are in cooperated into the price ever, time series problems are difficult to predict especially immediately or gradual. Therefore, any investor that can first financial time series [149]. On the other hand, NN and deep analyze the news and make a good trading strategy can learning models have shown great success in forecasting profit. Based on this view and the rise of big data, there has financial time series [150] despite the contradictory report by been an upward trend in sentiment analysis and text mining efficient market hypothesis (EMH) [151], that the FOREX and research which utilizes blogs, financial news social media stock market follows a random walk and any profit made is by to forecast the stock or FOREX market [149]. Santos et al chance. This can be attributed to the ability of NN to self-adapt [158] explored the impact of news articles on company stock to any nonlinear data set without any statically assumption and prices by implementing a LSTM neural network pre-trained prior knowledge of the data set [152]. by a character level language model to predict the changes in Deep leaning algorithms have used both fundamental and prices of a company for both inter day and intraday trading. technical analysis data, which is the two most commonly The results showed that, CNN with word wise based model used techniques for financial time series forecasting, to trained outperformed other models. However, LSTM character level- and build deep leaning models [149]. Fundamental analysis is based model performed better than RNN base models and also the use or mining of textual information like financial news, has less architectural complexity than other algorithms. company financial reports and other economic factors like Moreover, there has been hybrid architectures to combine government policies, to predict price movement. Technical the strengths or more than one deep leaning models to fore- analysis on the other hand, is the analysis of historical data of cast financial time series. Bao et al [159] combined wavelet the stock and FOREX market. transform, stacked autoencoders and LSTM for stock price Deep Learning NN (DLNN) or Multilayer Feed forward NN prediction. The output of one network or model was fed into (MFF) is the most used algorithms for financial markets [153]. the next model as input. The hybrid model perfumed better According to the experimental analysis done by Pandey el at than LSTM and RNN (which were standalone). Hossain et [154], showed that MFF with Bayesian learning performed al [160], also created a hybrid model by combining LSTM better than MFF learning with back propagation for the and Gated recurrent unit (GRU) to predict S&P 500 stock FOREX market. price. The model was compared against standalone models Deep neural networks or machine learning models are like LSTM and GRU with different architectural layers. The considered as a black box, because the internal workings is not hybrid model outperformed all other algorithms. fully understood. The performance of DNN is highly influence Calvez and Cliff [161] did introduce a new approach on by the its parameters for a particular domain. Lasfer el at [155] how to trade on the stock market with DLNN model. DLNN performed an analysis on the influence of parameter (like the model learn or observe the trading behaviors of traders. The number of neurons, learning rate, activation function etc) on author used a limit-order-book (LOB) and quotes made by stock price forecasting. The authors work showed that a larger successful traders (both automated and humans) as input data. NN produces a better result than a smaller NN. However, the DLNN was able to learn and outperformed both human traders effect of the activation function on a large NN is lesser. and automated traders. This approach of learning might be the Although CNN is well known for its stripes in image breakthrough for intelligent automated trading for Financial recognition and less application in the Financial markets, CNN markets. have also shown good performance in forecasting the stock market. As indicated by [155], the deeper the network the more NN can generalize to produce good results. However, C. Prognostics and Health Management the more the layers of NN, it is more likely to overfit a The service reliability of the ever-encompassing cyber- given data set. CNN on the other hand, with its techniques physical systems around us has started to garner the undi- of convolution, pooling and drop out mechanism reduces the vided attention of the prognostics community in recent years. tendency of overfitting [156]. Factors such as revenue loss, system downtime, failure in In order to apply CNN for the Financial market, the input mission-critical deployments and market competitive index data need to be transformed or adapted for CNN. With the help are emergent motivations behind making accurate predictions of a sliding window, Gudelek el at [156] used images gener- about the State-of-Health (SoH) and Remaining Useful Life ated by taking snapshots of the stock time series data and then (RUL) of components and systems. Industry niches such as fed it into 2D-CNN to perform daily predictions and classifi- manufacturing, electronics, automotive, defense and aerospace cation of trends (whether downwards or upwards). The model are increasingly becoming reliant on expert diagnosis of sys- was able to get 72 percent accuracy on 17 exchange traded tem health and smart recommender systems for maximizing IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 15 system uptime and adaptive scheduling of maintenance. Given Kuremoto et al., [164] applied DBN composed of two the surge in sensor influx, if there exists sufficient structured Restricted Botzmann Machines (RBM) to capture the input information in historical or transient data, accurate models feature distribution and then optimized the size of the network describing the system evolution may be proposed. The general and learning rate through Particle Swarm Optimization for idea is that in such approaches, there is a point in the opera- forecasting purposes with time series data. Qiu et al., [165] tional cycle of a component beyond which it no longer delivers proposed an early warning model where feature extraction optimum performance. In this regard, the most widely used through DNN with hidden state analysis of Hidden Markov metric for determining the critical operational cycle is termed Model (HMM) is carried out for health maintenance of equip- as the Remaining Useful Life (RUL), which is a measure of ment chain in gas pipeline. Gugulothu et al. [166] proposed a the time from measurement to the critical cycle beyond which forecasting scheme using a Recurrent Neural Network (RNN) sub-optimal performance is anticipated. Prognostic approaches model to generate embeddings which capture the trend of mul- may be divided into three categorizations: (a) Model-driven tivariate time series data which are supposed to be disparate (b) Data-driven (c) Hybrid i.e. any combination of (a) and (b). for healthy and unhealthy devices. The idea of using RNNs The last three decades have seen extensive usage of model- to capture intricate dependencies among various time cycles driven approaches with Gaussian Processes and Sequential of sensor observations is emphasized in [167] for prognostic Monte-Carlo (SMC) methods which continue to be popular applications. Botezatu et al., came up with some rules for in capturing patterns in relatively simpler sensor data streams. directly identifying the healthy or unhealthy state of a device However, one shortcoming of model driven approaches used in [168], employing a disk replacement prediction algorithm till date happens to be their dependence on physical evolution with changepoint detection applied to time series Backblaze equations recommended by an expert with problem-specific data. domain knowledge. For model-driven approaches to continue So, a typical flow of prognostics and health management of to perform as well when the problem complexity scales, any system under test using data-driven approaches start with the prior distribution (physical equations) needs to continue data collection including instances of device performances or to capture the embedded causalities in the data accurately. features under both normal and degraded operating conditions. However, it has been the observation that as sensor data scales, The data preprocessing and feature selection play a crucial role the ability of model-driven approaches to learn the inherent before applying deep learning approaches to learn a model structures in the data has lagged. This is of course due to capturing degradation of device under test which should be the use of simplistic priors and updates which are unable to employed to diagnose a fault, prognosticate future states of capture the complex functional relationships from the high a device ensuring proper device maintenance. In this case dimensional input data. With the introduction of self-regulated maintaining a balance between false positives and negatives learning paradigms such as Deep Learning, this problem of becomes crucial under real world industrial constraints and learning the structure in sensor data was mitigated to a large accuracy measures such as precision and recall must be extent because it was no longer necessary for an expert to validated before deployment. hand-design the physical evolution scheme of the system. With the recent advancements in parallel computational capabilities, D. techniques leveraging the volume of available data have begun Medical Image Processing to shine. One key issue to keep in mind is that the performance Deep learning techniques have pervaded the entire disci- of data-driven approaches are only as good as the labeled pline of medical image processing and the number of studies data available for training. While the surplus of sensor data highlighting its application in canonical tasks such as image may act as a motivation for choosing such approaches, it is classification, detection, enhancement, image generation, reg- critical that the precursor to the supervised part of learning, istration and segmentation have been on a sharp rise. A recent i.e. data labeling is accurate. This often requires laborious survey by Litjens et al. [213] presents a collective picture of and time-consuming efforts and is not guaranteed to result in the prevalence and applications of deep learning models within the generation of near-accurate ground truth. However, when the community as does a fairly rigorous treatise of the same adequate precaution is in place and strategic implementation by Shen et al. [214]. A concise overview of recent work in facilitating optimal learning is achieved, it is possible to deliver some of these canonical tasks follows. customized solutions to complex prediction problems with The purpose of image/exam classification jobs is to identify an accuracy unmatched by simpler, model-driven approaches. the presence of a disease based on the images of medical Therein lies the holy grail of deep learning: the ability to scale examinations. Over the last few years, various neural network learning with training data. architectures have been used in this field including stacked The recent works on device health forecasting are as fol- auto-encoders applied to diagnosis of Alzheimers disease and lows: Basak et al. [162] carried on Remaining Useful Life mild cognitive impairment, exploiting the latent non-linear (RUL) prediction of hard disks along with discussions on complicated relations among various features [169], Restricted effective feature normalization strategies on Backblaze hard Boltzmann Machines applied to Lung CT analysis combining disk data. Deep Belief Networks (DBN) based multisensor generative as well as discriminative learning techniques [170], health diagnosis methodology has been proposed in [163] and Deep Belief Networks trained on three dimensional medical employed in aircraft engine and electric power transformer images [171] etc. Recently, the the trend of using Convo- health diagnosis to show the effectiveness of the approach. lutional Neural Networks in the field of image processing IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 16

proposed method also involves on-the-fly data augmentation which enables the model to be trained with less amount of training data. Other applications of deep learning in Medical Image pro- cessing include image registration which implies coordinate transformation from a reference image space to target image space. Cheng et al. [177] used multi-modal stacked denoising Fig. 11: MRI Brain Slice and its different segmentation [217] autoencoder to compute effective similarity measure among images using normalized mutual information and local cross has been observed. In 2017, Esteva et al. [172] used and correlation. On the other hand, Miao et al. [178] developed fine-tuned the Inception v3 [215] model to classify clinical CNN regressors to directly evaluate the registration transfor- images pertaining to skin cancer examinations into benign and mation parameters. In addition to these, image generation and malignant variants. Validated of experiments was carried out enhancement techniques have been discussed in [179], [180]. by testing model performance against a good number of der- So far, applications of deep learning in medical image matologists. In 2018, Rajaraman et. al [173] used specialized processing has produced satisfactory results in most of the CNN architectures like ResNet for detecting malarial parasites cases, However, in a sensitive field like medical image pro- in thin blood smear images. Kang et al. [174] improved the cessing prior knowledge should be incorporated in cases of performance of 2D CNN by using a 3D multi-view CNN for image detection and recognition, reconstruction so that the data lung nodule classification using spatial contextual information driven approaches do not produce implausible results [219]. with the help of 3D Inception-ResNet architecture. Object/lesion detection aims to identify different E. Power Systems parts/lesions in an image. Although object classification Artificial Neural Networks (ANN) have rapidly gained pop- and object detection are quite similar to each other but the ularity among power system researchers [181]. Since their in- challenges are specific to each of the categories. When it troduction to the power systems area in 1988 [182], numerous comes to object detection, the problem of class-imbalance applications of ANN to problems of electric power systems can pose a major hurdle in terms of the performance of object have been proposed. However, the recent developments of detection models. Object detection also involves identification Deep Learning (DL) methods have resulted into powerful tools of localized information (that is specific to different parts of that can handle large data-sets and often outperform traditional an image) from the full image space. Therefore, the task of machine learning methods in problems related to the power object detection is a combination of identification of localized sector [183]. For this reason, currently deep architectures are information and classification [216]. In 2016, Hwang and receiving the attention of researchers in power industry appli- Kim proposed a self-transfer learning (STL) framework which cations. Here, we will focus on describing some approaches optimizes both the aspects of medical object detection task. of deep ANN architectures applied on three basic problems of They tested the STL framework for the detection of nodules the power industry, i.e. load forecasting and prediction of the in chest radiographs and lesions in mammography [175]. power output of wind and solar energy systems. Segmentation happens to be one of the most common Load forecasting is one of the most important tasks for subjects of interest when it comes to application of Deep the efficient power system’s operation. It allows the system Learning in the domain of medical image processing. Or- operator to schedule spinning reserve allocation, decide for gan and substructure segmentation allows for advanced fine- possible interchanges with other utilities and assess system’s grained analysis of a medical image and it is widely practiced security [184]. A small decrease in load forecasting error may in the analyses of cardiac and brain images. A demonstration result in significant reduction of the total operation cost of is shown in Figure 11, where different segmented parts of an the power system [185]. Among the Artificial Intelligence MRI Brain Slice along with the original slice are considered. techniques applied for load forecasting, methods based on Segmentation includes both the local and global context of ANN have undoubtedly received the largest share of attention pixels with respect to a given image and the performance [186]. A basic reason for their popularity lies on the fact that of a segmentation model can suffer from inconsistencies due ANN techniques are well-suited for energy forecast [187]; to class imbalances. This makes the task of segmentation they may obtain adequate estimations in cases where data a difficult one. The most widely-used CNN architecture for is incomplete [188] and can consistently deal with complex medical image segmentation is U-Net which was proposed non-linear problems [189]. Park et al. [190], was one of the by Ronneberger et al. [218] in 2015. U-Net takes care of first approaches for applying ANN in load forecasting. The sampling that is required to check the class-imbalance factors efficiency of the proposed Multi-layer Perceptron (MLP) was and it is capable of scanning an entire image in just one demonstrated by benchmarking it against a numerical fore- forward pass which enables it to consider the full context of the casting method frequently used by utilities. As an evolution image. RNN-based architectures have also been proposed for of ANN forecasting techniques, DL methods are expected to segmentation tasks. In 2016, Andermatt et al. [176] presented increase the prediction accuracy by allowing higher levels a method to automatically segment 3D volumes of biomedical of abstraction [191]. Thus, DL methods are gradually gain images. They used multi-dimensional gated recurrent units increased popularity due to their ability to capture data be- (GRU) as the main layers of their neural network model. The haviour when considering complex non-linear patterns and IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 17 large amounts of data. In [192], an end-to-end model based clean [201]. However, due to the chaotic and erratic nature of on deep residual neural networks is proposed for hourly load the weather systems, the power output of PV energy systems forecasting of a single day. Only raw data of past load and is intermittent, volatile and random [202]. These uncertainties temperature were used as inputs of the model. Initially, the may potentially degrade the real-time control performance, inputs of the model are processed by several fully connected reduce system economics, and thus pose a great challenge layers to produce preliminary forecast. These forecasts are then for the management and operation of electric power and passed through a deep neural network structure constructed energy systems [203]. For these reasons, the accuracy of by residual blocks. The efficiency of the proposed model was forecasting of PV power output plays a major role in ensuring demonstrated on data-sets from the North-American Utility optimum planning and modelling of PV plants. In [199] a and ISO-NE. In [193], a Long Short Term Memory (LSTM)- deep neural network architecture is proposed for deterministic based neural network has been proposed for short and medium and probabilistic PV power forecasting. The deep architecture term load forecasting. In order to optimize the effectiveness of for deterministic forecasting comprises a Wavelet Transform the proposed approach, Genetic Algorithm is used to find the and a deep CNN. Moreover, the probabilistic PV power optimal values for the time lags and the number of layers of forecasting model combines the deterministic model and a the LSTM model. The efficient performance of the proposed spine Quantile Regression (QR) technique. The method has structure was verified using electricity consumption data of the been evaluated on historical PV power data-sets obtained France Metropolitan. Mocanu et al. [191] utilized two deep from two PV farms in Belgium, exhibiting high forecasting learning approaches based on Restricted Boltzman Machines stability and robustness. In Gensler et al. [204], several (RBM), i.e. conditional RBM and factored conditional RBM, deep network architectures, i.e. MLP, LSTM networks, for single-meter residential load forecasting. The method was DBN and Autoencoders, have been examined with respect benchmarked against several shallow ANN architectures and to their forecasting accuracy of the PV power output. The a Support Vector Machine approach, demonstrating increased performance of the methods is validated on actual data from efficiency compared to the competing methods. Dedinec et PV facilities in Germany. The architecture that has exhibited al. [194] employed a Deep Belief Network (DBN) for short the best performance is the Auto-LSTM network, which term load forecasting of the Former Yugoslavian Republic of combines the feature extraction ability of the Autoencoder Macedonia. The proposed network comprised several stacks with the forecasting ability of the LSTM. In [205] an of RBM, which were pre-trained layer-wise. Rahman et al. LSTM-RNN is proposed for forecasting the output power [195] proposed two models based on the architecture of of solar PV systems. In particular, the authors examine five Recurrent Neural Networks (RNN) aiming to predict the different LSTM network architectures in order to obtain the medium and long term electricity consumption in residen- one with the highest forecasting accuracy at the examined tial and commercial buildings with one-hour resolution. The data-sets, which are retrieved from two cities of Egypt. The approach has utilized a MLP in combination with a LSTM network, which provided the highest accuracy is the LSTM based model using an encoder-decoder architecture. A model with memory between batches. based on LSTM-RNN framework with appliance consumption sequences for short term residential load forecasting has been With the advantages of non-pollution, low costs and proposed in [196]. The researchers have showed that their remarkable benefits of scale, wind power is considered as method outperforms other state-of-the-art methods for load one of the most important sources of energy [206]. ANN forecasting. In [197] a Convolutional Neural Network (CNN) have been widely employed for processing large amounts with k-means clustering has been proposed. K-means is used of data obtained from data acquisition systems of wind to partition the large amount of data into clusters, which turbines [207]. In recent years, many approaches based on are then used to train the networks. The method has shown DL architectures have been proposed for the prediction of improved performance compared to the case where the k- the power output of wind power systems. In [208], a deep means has not been engaged. neural network architecture is proposed for deterministic The utilization of DL techniques for modelling and forecast- wind power forecasting, which combines CNN and LSTM ing in systems of renewable energy is progressively increasing. networks. The results of the model are further analyzed Since the data in such systems are inherently noisy, they may and evaluated based on the wind power forecasting error be adequately handled with ANN architectures [198]. More- in order to perform probabilistic forecasting. The method over, because renewable energy data is complicated in nature, has been validated on data obtained from a wind farm shallow learning models may be insufficient to identify and in China; it has managed to perform better compared learn the corresponding deep non-linear and non-stationary to other statistical methods, i.e. ARIMA and persistence features and traits [199]. Among the various renewable energy method, as well as artificial intelligence based techniques sources, wind and solar energy have gained more popularity in deterministic and probabilistic wind power forecasting. due to their potential and high availability [200]. As a result, Wang et al. [209] proposed a wind power forecasting method in recent years the research endeavours have been focused based on Wavelet Transform, CNN and ensemble technique. on developing DL techniques for the problems related to the Their method was compared with the persistence method deployment of the aforementioned renewable energy sources. and two shallow ANN architectures, i.e. Back-Propagation Photovolatic (PV) energy has received much attention, due ANN (BPANN) and Support Vector Machine, on data sets to its many advantages; it is abundant, inexhaustible and collected from wind farms in China. The results validate that IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 18 their method outperforms the benchmark approaches in terms • Challenges with scarcity of data: With growing avail- of reliability, sharpness and overall skill. In [210] a DBN ability of data as well as powerful and distributed pro- model in conjunction with the k-means clustering algorithm cessing units Deep Learning architectures can be suc- is proposed for wind power forecasting. The proposed cessfully applied to major industrial problems. However, approach demonstrated significantly increased forecasting deep learning is traditionally big data driven and lacks accuracy compared to a BPANN and a Morlet wavelet neural efficiency to learn abstractions through clear verbal def- network on data-sets obtained from a wind farm in Spain. A initions [220] if not trained with billions of training data-driven multi-model wind forecasting methodology with samples. Also the large reliance on Convolutional Neural deep feature selection is proposed in [211]. In particular, a Networks(CNNs) especially for video recognition pur- two layer ensemble technique is developed; the first layer poses could face exponential ineffeciency leading to their comprises multiple machine learning models, which generate demise [221] which can be avoided by capsules [222] individual forecasts. In the second layer a blended algorithm capturing critical spatial hierarchical relationships more is utilized to merge the forecasts derived during the first stage. efficiently than CNNs with lesser data requirements. To Numerical results validate the efficiency of the proposed make DL work with smaller available data sets, some methodology compared to models employing a single of the approaches in use are data augmentation, transfer algorithm. Finally, in [212] an approach is proposed for wind learning, recursive classification techniques as well as power forecasting, which combines deep Autoencoders, DBN synthetic data generation. One shot learning [223] is also and the concept of transfer learning. The method is tested on bringing new avenues to learn from very few training data-sets containing power measurement and meteorological examples which has already started showing progress forecast related to components of wind, obtained from wind in language processing and image classification tasks. farms in Europe. Moreover, it is compared to commonly used More generalized techniques are being developed in this baseline regression models, i.e. ARIMA and Support Vector domain to make DL models learn from sparse or fewer Regressor, and derives better results in terms of MAE, RMSE data representations is a current research thrust. and SDE compared to the benchmark algorithms. • Adopting unsupervised approaches: A major thrust is towards combining deep learning with unsupervised VI.DISCUSSIONS learning methods. Systems developed to set their own In this paper we presented several Deep Learning archi- goals [220] and develop problem-solving approaches in tectures starting from the foundational architectures up to the its way towards exploring the environment are the future recent developments covering the aspect of their modifications research directions surpassing supervised approaches re- and evolution over time as well as applications to specific quiring lots of data apriori. So, the thrust of AI research domains. We provided a list of category wise publicly available including Deep Learning is towards Meta Learning, i.e., data repositories for Deep Learning practitioners in Table learning to learn which involves automated model design- V. We discussed the blend of swarm intelligence in Deep ing and decision making capabilities of the algorithms. It Learning approaches and how the influence of one enriches optimizes the ability to learn various tasks from fewer other when applied to real world problems. The vastly growing training data[224]. use of deep learning architectures specially in safety critical systems brings us to the question, how reliable the architec- • Influence of cognitive meuroscience: Inspiration drawn tures are in providing decisions even in presence of adversarial from cognitive neuroscience, developmental psychology scenarios. To address this, we started by giving an overview to decipher human behavioral pattern are able to bring of testing neural network architectures, various methods for major breakthrough in applications such as enabling adversarial test generation as well as countermeasures to be artificial agents learn about spatial navigation on their adopted against adversarial examples. Next we moved on own which comes naturally to most living beings [225]. to specific applications of deep learning including Medical Imaging, Prognostics and Health Management, Applications • Neural networks and reinforcement learning: Meta- in Financial Services, Financial Time Series Forecasting and modeling approaches using Reinforcement Learning(RL) lastly the applications in Power Systems. For each application are being used for designing problem specific Neu- the current research trends as well as future research directions ral Network architectures. In [226] the authors intro- are discussed. TableIV lists a collection of recent reviews in duced MetaQNN, a RL based meta-modeling algorithm different fields of Deep Learning including computer vision, to automatically generate CNN architectures for image forecasting, image processing, adversarial cases, autonomous classification by using Q-learning [227] with  greedy vehicles, natural language processing, recommender systems exploration. AlphaGo, the computer program built com- and big data analytics. bining reinforcement learning and CNN for playing the game ‘Go’ achieved a great success by beating human VII.CONCLUSIONSAND FUTURE WORK professional ’Go’ players. Also deep convolutional neural networks can work as function approximators to predict In conclusion, we highlight a few open areas of research ‘Q’ values in a reinforcement learning problem. So, a and elaborate on some of the existing lines of thoughts and major thrust of current research is on superposition studies in addressing challenges that lie within. IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 19 TABLE IV: A Collection of Recent Reviews on Deep Learning

Topic Review Papers Deep learning for visual understanding: A review [228] Deep Learning for Computer Vision: A Brief Review [229] A Survey on Deep Learning Methods for Robot Vision [230] Computer Vision Deep learning for visual understanding: A review [228] Deep Learning Advances in Computer Vision with 3D Data: A Survey [231] Visualizations of Deep Neural Networks in Computer Vision: A Survey [232] Machine Learning in Financial Crisis Prediction: A Survey [233] Deep Learning for Time-Series Analysis [234] A Survey on Machine Learning and Statistical Techniques in Bankruptcy Prediction [235] Time series forecasting using artificial neural networks methodologies: A systematic review [236] Forecasting A Review of Deep Learning Methods Applied on Load Forecasting [237] Trends in Machine Learning Applied to Demand & Sales Forecasting: A Review [238] A survey on retail sales forecasting and prediction in fashion markets [239] Electric load forecasting: Literature survey and classification of methods [240] A review of unsupervised feature learning and deep learning for time-series modeling [241] A Survey on Deep Learning in Medical Image Analysis [213] A Comprehensive Survey of Deep Learning for Image Captioning [242] Biological image analysis using deep learning-based methods: Literature review [243] Deep learning for remote sensing image classification: A survey [244] Image Processing Deep Convolutional Neural Networks for Image Classification: A Compre- hensive Review [245] Deep Learning for Medical Image Processing: Overview, Challenges and the Future [246] An overview of deep learning in medical imaging focusing on MRI [247] Deep Learning in Medical Ultrasound Analysis: A Review [248] Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey [249] Adversarial Learning in Statistical Classification: A Comprehensive Review of Defenses Against Attacks [250] Adversarial Attacks and Defenses Against Deep Neural Networks: A Survey Adversarial Cases [251] Adversarial Machine Learning: A Literature Review [252] Review of Artificial Intelligence Adversarial Attack and Defense Technolo- gies [253] A Survey of Adversarial Machine Learning in Cyber Warfare [254] Planning and Decision-Making for Autonomous Vehicles [255] A Review of Deep Learning Methods and Applications for Unmanned Aerial Vehicles [256] MIT Autonomous Vehicle Technology Study: Large-Scale Deep Learning Autonomous Vehicles Based Analysis of Driver Behavior and Interaction with Automation [257] Perception, Planning, Control, and Coordination for Autonomous Vehicles [258] Survey of neural networks in autonomous driving [259] Self-Driving Cars: A Survey [260] IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 20

Recent Trends in Deep Learning Based Natural Language Processing [261] A Survey of the Usages of Deep Learning in Natural Language Processing [262] A survey on the state-of-the-art machine learning models in the context of Natural Language NLP [263] Processing Inflectional Review of Deep Learning on Natural Language Processing [264] Deep learning for natural language processing: advantages and challenges [265] Deep Learning for Natural Language Processing [266] Deep Learning based Recommender System: A Survey and New Perspectives [267] A Survey of Recommender Systems Based on Deep Learning [268] A review on deep learning for recommender systems: challenges and remedies Recommender Systems [269] Deep Learning Methods on Recommender System: A Survey of State-of-the- art [270] Deep Learning-Based Recommendation: Current Issues and Challenges [271] A Survey and Critique of Deep Learning on Recommender Systems [272] Efficient Machine Learning for Big Data: A Review [273] A survey on deep learning for big data [274] A Survey on Data Collection for Machine Learning: a Big Data - AI Big Data Analytics Integration Perspective [275] A survey of machine learning for big data processing [276] Deep learning in big data Analytics: A comparative study [277] Deep learning applications and challenges in big data analytics [278] IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 21 TABLE V: A Collection of Data Repositories for Deep Learning Practitioners

Category Dataset Link MNIST http://yann.lecun.com/exdb/mnist/ CIFAR-100 http://www.cs.utoronto.ca/∼kriz/cifar.html Caltech 101 http://www.vision.caltech.edu/Image Datasets/Caltech101/ Caltech 256 http://www.vision.caltech.edu/Image Datasets/Caltech256/ Image Imagenet http://www.image-net.org/ Datasets COIL100 http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php STL-10 http://www.stanford.edu/∼acoates//stl10/ Google Open images https://ai.googleblog.com/2016/09/introducing-open-images-dataset. html Labelme http://labelme.csail.mit.edu/Release3.0/browserTools/php/dataset.php Google Audioset https://research.google.com/audioset/dataset/index.html TIMIT http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId= LDC93S1 VoxForge http://www.voxforge.org/ CHIME http://spandh.dcs.shef.ac.uk/chime challenge/data.html Speech 2000 HUB5 English https://catalog.ldc.upenn.edu/LDC2002T43 Datasets LibriSpeech http://www.openslr.org/12/ VoxCeleb http://www.robots.ox.ac.uk/∼vgg/data/voxceleb/ Open SLR https://www.openslr.org/51 CALLHOME American https://catalog.ldc.upenn.edu/LDC97S42 English Speech English Broadcast News https://catalog.ldc.upenn.edu/LDC97S44 SQuAD https://rajpurkar.github.io/SQuAD-explorer/ Billion Word Dataset http://www.statmt.org/lm-benchmark/ Text 20 Newsgroups http://qwone.com/∼jason/20Newsgroups/ Datasets Google Books Ngrams https://aws.amazon.com/datasets/google-books-ngrams/ UCI Spambase https://archive.ics.uci.edu/ml/datasets/Spambase Common Crawl http://commoncrawl.org/the-data/ Yelp Open Dataset https://www.yelp.com/dataset Web 1T 5-gram https://catalog.ldc.upenn.edu/LDC2006T13 Blizzard Challenge 2018 https://www.synsig.org/index.php/Blizzard Challenge 2018 Flickr personal https://www.isi.edu/∼lerman/downloads/flickr/flickr taxonomies.html taxonomies Multi-Domain http://www.cs.jhu.edu/∼mdredze/datasets/sentiment/ Natural Sentiment Dataset Language Enron Email Dataset https://www.cs.cmu.edu/∼./enron/ Datasets Blogger Corpus http://u.cs.biu.ac.il/∼koppel/BlogCorpus.htm Wikipedia Links Data https://code.google.com/archive/p/wiki-links/downloads Gutenberg eBooks List http://www.gutenberg.org/wiki/Gutenberg:Offline Catalogs SMS Spam Collection http://www.dt.fee.unicamp.br/∼tiago/smsspamcollection/ UCI’s Spambase data https://archive.ics.uci.edu/ml/datasets/Spambase OpenStreetMap https://www.openstreetmap.org Landsat8 https://landsat.gsfc.nasa.gov/landsat-8/ NEXRAD https://www.ncdc.noaa.gov/data-access/radar-data/nexrad ESRI Open data https://hub.arcgis.com/pages/open-data Geospatial USGS EarthExplorer https://earthexplorer.usgs.gov/ Datasets OpenTopography https://opentopography.org/ NASA SEDAC https://sedac.ciesin.columbia.edu/ NASA Earth Observa- https://neo.sci.gsfc.nasa.gov/ tions Terra Populus https://terra.ipums.org/ IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 22

Movielens https://grouplens.org/datasets/movielens/ Million Song Dataset https://www.kaggle.com/c/msdchallenge Last.fm https://grouplens.org/datasets/hetrec-2011/ Book-crossing Dataset http://www2.informatik.uni-freiburg.de/∼cziegler/BX/ Recom- Jester https://goldberg.berkeley.edu/jester-data/ mender Netflix Prize https://www.netflixprize.com/ Systems Pinterest Fashion Com- http://cseweb.ucsd.edu/∼jmcauley/datasets.html#pinterest Datasets patibility Amazon Question and http://cseweb.ucsd.edu/∼jmcauley/datasets.html#amazon qa Answer Data Social Circles Data http://cseweb.ucsd.edu/∼jmcauley/datasets.html#socialcircles Quandl https://www.quandl.com/ World Bank Open Data https://data.worldbank.org/ IMF Data https://www.imf.org/en/Data Financial Times Market https://markets.ft.com/data/ Data Economics Google Trends https://trends.google.com/trends/?q=google&ctab=0&geo=all&date= and all&sort=0 Finance American Economic As- https://www.aeaweb.org/resources/data/us-macro-regional Datasets sociation US stock Data https://github.com/eliangcs/pystock-data World Factbook https://www.cia.gov/library/publications/download/ Dow Jones Index Data http://archive.ics.uci.edu/ml/datasets/Dow+Jones+Index Set BDD100k https://bdd-data.berkeley.edu/ Baidu Apolloscapes http://apolloscape.auto/ Comma.ai https://archive.org/details/comma-dataset Oxfords Robotic Car https://robotcar-dataset.robots.ox.ac.uk/ Cityscape Dataset https://www.cityscapes-dataset.com/ Auto- CSSAD Dataset http://aplicaciones.cimat.mx/Personal/jbhayet/ccsad-dataset nomous KUL Belgium Traffic http://www.vision.ee.ethz.ch/∼timofter/traffic signs/ Vehicles Sign Dataset Datasets LISA http://cvrr.ucsd.edu/LISA/datasets.html Bosch Small Traffic https://hci.iwr.uni-heidelberg.de/node/6132 Light LaRa Traffic Light http://www.lara.prd.fr/benchmarks/trafficlightsrecognition Recognition WPI Datasets http://computing.wpi.edu/dataset.html IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 23

of neural networks and reinforcement learning geared [17] E. Abbe and C. Sandon, “Provable limitations of deep learning,” arXiv towards problem specific requirements. e-prints, p. arXiv:1812.06369, Dec. 2018. [18] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review, pp. 65– This review has aimed at aiding the beginner as well as the 386, 1958. practitioner in the field make informed choices and has made [19] and, “Madaline rule ii: a training algorithm for neural networks,” in IEEE 1988 International Conference on Neural Networks, July 1988, an in-depth analysis of some recent deep learning architectures pp. 401–408 vol.1. as well as an exploratory dissection of some pertinent applica- [20] B. Widrow and M. A. Lehr, “30 years of adaptive neural networks: tion areas. It is the authors’ hope that readers find the material perceptron, madaline, and backpropagation,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1415–1442, Sep. 1990. engaging and informative and openly encourage feedback to [21] M. Minsky and S. Papert, “Perceptrons - an introduction to computa- make the organization and content of this article more aligned tional geometry,” 1969. along the lines of a formal extension of the literature within [22] P. J. Werbos, The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. John Wiley & Sons, the deep learning community. 1994, vol. 1. [23] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National REFERENCES Academy of Sciences, vol. 79, no. 8, pp. 2554–2558, 1982. [Online]. Available: https://www.pnas.org/content/79/8/2554 [1] M. van Gerven and S. Bohte, “Editorial: Artificial neural networks as [24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Parallel distributed models of neural information processing,” Frontiers in Computational processing: Explorations in the microstructure of cognition, vol. 1,” Neuroscience, vol. 11, p. 114, 2017. [Online]. Available: https: D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group, //www.frontiersin.org/article/10.3389/fncom.2017.00114 Eds. Cambridge, MA, USA: MIT Press, 1986, ch. Learning [2] W. S. McCulloch and W. Pitts, “A logical calculus of the Internal Representations by Error Propagation, pp. 318–362. [Online]. ideas immanent in nervous activity,” The bulletin of mathematical Available: http://dl.acm.org/citation.cfm?id=104279.104293 biophysics, vol. 5, no. 4, pp. 115–133, Dec 1943. [Online]. Available: [25] C. Cortes and V. Vapnik, “Support-vector networks,” Machine https://doi.org/10.1007/BF02478259 Learning, vol. 20, no. 3, pp. 273–297, Sep 1995. [Online]. Available: [3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, https://doi.org/10.1023/A:1022627411411 no. 7553, p. 436, 2015. [26] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [4] S. Lawrence, C. L. Giles, , and A. D. Back, “Face recognition: a Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available: convolutional neural-network approach,” IEEE Transactions on Neural https://doi.org/10.1162/neco.1997.9.8.1735 Networks, vol. 8, no. 1, pp. 98–113, Jan 1997. [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based [5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks learning applied to document recognition,” Proceedings of the IEEE, for semantic segmentation,” in 2015 IEEE Conference on Computer vol. 86, no. 11, pp. 2278–2324, 1998. Vision and Pattern Recognition (CVPR), June 2015, pp. 3431–3440. [28] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm [6] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527– S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent 1554, 2006. convolutional networks for visual recognition and description,” IEEE [29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 677–691, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differenti- Apr. 2017. [Online]. Available: https://doi.org/10.1109/TPAMI.2016. ation in ,” in NIPS-W, 2017. 2599174 [30] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, [7] X. Wu, R. He, and Z. Sun, “A lightened CNN for deep G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, face representation,” CoRR, vol. abs/1511.02683, 2015. [Online]. I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, Available: http://arxiv.org/abs/1511.02683 L. Kaiser, M. Kudlur, J. Levenberg, D. Mane,´ R. Monga, [8] A. Diba, V. Sharma, A. Pazandeh, H. Pirsiavash, and L. V. Gool, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, “Weakly supervised cascaded convolutional networks,” in 2017 IEEE I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, Conference on Computer Vision and Pattern Recognition (CVPR), July F. Viegas,´ O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, 2017, pp. 5131–5139. Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on [9] W. Ouyang, X. Zeng, X. Wang, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, heterogeneous systems,” 2015, software available from tensorflow.org. Z. Wang, H. Li, K. Wang, J. Yan, C. Loy, and X. Tang, “Deepid- [Online]. Available: http://tensorflow.org/ net: Object detection with deformable part based convolutional neural [31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, networks,” IEEE Transactions on Pattern Analysis and Machine Intel- S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture ligence, vol. 39, no. 7, pp. 1320–1334, July 2017. for fast feature embedding,” in Proceedings of the 22Nd ACM [10] G. Cybenko, “Approximation by superpositions of a sigmoidal International Conference on Multimedia, ser. MM ’14. New function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, York, NY, USA: ACM, 2014, pp. 675–678. [Online]. Available: pp. 303–314, Dec 1989. [Online]. Available: https://doi.org/10.1007/ http://doi.acm.org/10.1145/2647868.2654889 BF02551274 [32] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next- [11] K. Hornik, “Approximation capabilities of multilayer feedforward generation open source framework for deep learning,” in Proceedings networks,” Neural Networks, vol. 4, no. 2, pp. 251 – 257, 1991. of Workshop on Machine Learning Systems (LearningSys) in The [Online]. Available: http://www.sciencedirect.com/science/article/pii/ Twenty-ninth Annual Conference on Neural Information Processing 089360809190009T Systems (NIPS), 2015. [Online]. Available: http://learningsys.org/ [12] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power of papers/LearningSys 2015 paper 33.pdf neural networks: A view from the width,” CoRR, vol. abs/1709.02540, [33] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015. 2017. [Online]. Available: http://arxiv.org/abs/1709.02540 [34] J. Dai, Y. Wang, X. Qiu, D. Ding, Y. Zhang, Y. Wang, X. Jia, C. Zhang, [13] B. Hanin, “Universal function approximation by deep neural nets with Y. Wan, Z. Li, J. Wang, S. Huang, Z. Wu, Y. Wang, Y. Yang, B. She, bounded width and relu activations,” 08 2017. [Online]. Available: D. Shi, Q. Lu, K. Huang, and G. Song, “Bigdl: A distributed deep https://arxiv.org/abs/1708.02691 learning framework for big data,” CoRR, vol. abs/1804.05839, 2018. [14] J. Schmidhuber, “Deep learning in neural networks: An overview,” [35] Theano Development Team, “Theano: A Python framework for Neural Networks, vol. 61, pp. 85 – 117, 2015. [Online]. Available: fast computation of mathematical expressions,” arXiv e-prints, vol. http://www.sciencedirect.com/science/article/pii/S0893608014002135 abs/1605.02688, May 2016. [Online]. Available: http://arxiv.org/abs/ [15] G. Marcus, “Deep Learning: A Critical Appraisal,” arXiv e-prints, p. 1605.02688 arXiv:1801.00631, Jan. 2018. [36] F. Seide and A. Agarwal, “Cntk: Microsoft’s open-source deep-learning [16] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and toolkit,” in Proceedings of the 22Nd ACM SIGKDD International A. Swami, “The limitations of deep learning in adversarial settings,” Conference on Knowledge Discovery and , ser. KDD in 2016 IEEE European Symposium on Security and Privacy (EuroS ’16. New York, NY, USA: ACM, 2016, pp. 2135–2135. [Online]. P), March 2016, pp. 372–387. Available: http://doi.acm.org/10.1145/2939672.2945397 IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 24

[37] S. Kombrink, T. Mikolov, M. Karafiat,´ and L. Burget, “Recurrent [62] G. Dahl, A.-r. Mohamed, G. E. Hinton, et al., “Phone recognition with neural network based language modeling in meeting recognition,” in the mean-covariance restricted boltzmann machine,” in Advances in Twelfth annual conference of the international speech communication neural information processing systems, 2010, pp. 469–477. association, 2011. [63] G. E. Hinton et al., “Modeling pixel means and covariances using [38] L. Deng, D. Yu, et al., “Deep learning: methods and applications,” factorized third-order boltzmann machines,” in Computer Vision and Foundations and Trends R in Signal Processing, vol. 7, no. 3–4, pp. Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, 197–387, 2014. pp. 2551–2558. [39] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and [64] A.-r. Mohamed, G. Hinton, and G. Penn, “Understanding how deep trends R in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. belief networks perform acoustic modelling,” in Acoustics, Speech and [40] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal Signal Processing (ICASSP), 2012 IEEE International Conference on. representations by error propagation,” California Univ San Diego La IEEE, 2012, pp. 4273–4276. Jolla Inst for Cognitive Science, Tech. Rep., 1985. [65] I. Sutskever, G. E. Hinton, and G. W. Taylor, “The recurrent temporal [41] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer- restricted boltzmann machine,” in Advances in neural information wise training of deep networks,” in Advances in neural information processing systems, 2009, pp. 1601–1608. processing systems, 2007, pp. 153–160. [66] G. W. Taylor and G. E. Hinton, “Factored conditional restricted [42] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Advances in boltzmann machines for modeling motion style,” in Proceedings of optimizing recurrent networks,” in 2013 IEEE International Conference the 26th annual international conference on machine learning. ACM, on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 8624– 2009, pp. 1025–1032. 8628. [67] G. E. Hinton, “A practical guide to training restricted boltzmann [43] G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep machines,” in Neural networks: Tricks of the trade. Springer, 2012, neural networks for lvcsr using rectified linear units and dropout,” pp. 599–619. in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE Deep learning International Conference on. IEEE, 2013, pp. 8609–8613. [68] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, . [44] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep MIT press Cambridge, 2016, vol. 1. network training by reducing internal covariate shift,” arXiv preprint [69] N. Le Roux and Y. Bengio, “Representational power of restricted arXiv:1502.03167, 2015. boltzmann machines and deep belief networks,” Neural computation, [45] D. Sussillo and L. Abbott, “Random walk initialization for training very vol. 20, no. 6, pp. 1631–1649, 2008. deep feedforward networks,” arXiv preprint arXiv:1412.6558, 2014. [70] G. E. Hinton et al., “What kind of graphical model is the brain?” in [46] D. Mishkin and J. Matas, “All you need is a good init,” arXiv preprint IJCAI, vol. 5, 2005, pp. 1765–1775. arXiv:1511.06422, 2015. [71] C. Poultney, S. Chopra, Y. L. Cun, et al., “Efficient learning of sparse [47] X. Glorot and Y. Bengio, “Understanding the difficulty of training representations with an energy-based model,” in Advances in neural deep feedforward neural networks,” in Proceedings of the thirteenth information processing systems, 2007, pp. 1137–1144. international conference on artificial intelligence and statistics, 2010, [72] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, “Pedes- pp. 249–256. trian detection with unsupervised multi-stage feature learning,” in [48] S. K. Kumar, “On weight initialization in deep neural networks,” arXiv Proceedings of the IEEE Conference on Computer Vision and Pattern preprint arXiv:1704.08863, 2017. Recognition, 2013, pp. 3626–3633. [49] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities [73] A.-r. Mohamed, G. E. Dahl, G. Hinton, et al., “Acoustic modeling improve neural network acoustic models,” in Proc. icml, vol. 30, no. 1, using deep belief networks,” IEEE Trans. Audio, Speech & Language 2013, p. 3. Processing, vol. 20, no. 1, pp. 14–22, 2012. [50] A. Fischer and C. Igel, “An introduction to restricted boltzmann ma- [74] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and chines,” in Iberoamerican Congress on Pattern Recognition. Springer, S. Bengio, “Why does unsupervised pre-training help deep learning?” 2012, pp. 14–36. Journal of Machine Learning Research, vol. 11, no. Feb, pp. 625–660, [51] P. Smolensky, “Information processing in dynamical systems: Founda- 2010. tions of harmony theory,” COLORADO UNIV AT BOULDER DEPT [75] S. M. Siniscalchi, J. Li, and C.-H. Lee, “Hermitian polynomial for OF COMPUTER SCIENCE, Tech. Rep., 1986. speaker adaptation of connectionist speech recognition systems,” IEEE [52] G. E. Hinton, “Training products of experts by minimizing contrastive Transactions on Audio, Speech, and Language Processing, vol. 21, divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002. no. 10, pp. 2152–2161, 2013. [53] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks [76] S. M. Siniscalchi, D. Yu, L. Deng, and C.-H. Lee, “Exploiting deep in unsupervised feature learning,” in Proceedings of the fourteenth neural networks for detection-based speech recognition,” Neurocom- international conference on artificial intelligence and statistics, 2011, puting, vol. 106, pp. 148–157, 2013. pp. 215–223. [77] D. Yu, S. M. Siniscalchi, L. Deng, and C.-H. Lee, “Boosting at- [54] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of tribute and phone estimation accuracies with deep neural networks for science data with neural networks,” , vol. 313, no. 5786, pp. 504–507, detection-based speech recognition,” in Acoustics, Speech and Signal 2006. Processing (ICASSP), 2012 IEEE International Conference on. IEEE, [55] H. Larochelle and Y. Bengio, “Classification using discriminative re- 2012, pp. 4169–4172. stricted boltzmann machines,” in Proceedings of the 25th international [78] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Unsupervised conference on Machine learning. ACM, 2008, pp. 536–543. learning of hierarchical representations with convolutional deep belief [56] R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmann networks,” Communications of the ACM, vol. 54, no. 10, pp. 95–103, machines for collaborative filtering,” in Proceedings of the 24th inter- 2011. national conference on Machine learning. ACM, 2007, pp. 791–798. [57] I. Sutskever and G. Hinton, “Learning multilevel distributed represen- [79] J. Susskind, V. Mnih, G. Hinton, et al., “On deep generative models tations for high-dimensional sequences,” in Artificial Intelligence and with applications to recognition,” in Computer Vision and Pattern Statistics, 2007, pp. 548–555. Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. [58] G. W. Taylor, G. E. Hinton, and S. T. Roweis, “Modeling human motion 2857–2864. using binary latent variables,” in Advances in neural information [80] V. Stoyanov, A. Ropson, and J. Eisner, “Empirical risk minimization processing systems, 2007, pp. 1345–1352. of graphical model parameters given approximate inference, decoding, [59] R. Memisevic and G. Hinton, “Unsupervised learning of image and model structure,” in Proceedings of the Fourteenth International transformations,” in Computer Vision and Pattern Recognition, 2007. Conference on Artificial Intelligence and Statistics, 2011, pp. 725–733. CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–8. [81] R. Salakhutdinov and G. Hinton, “Semantic hashing,” International [60] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, 2009. belief networks for scalable unsupervised learning of hierarchical repre- [82] L. Deng, M. L. Seltzer, D. Yu, A. Acero, A.-r. Mohamed, and sentations,” in Proceedings of the 26th annual international conference G. Hinton, “Binary coding of speech spectrograms using a deep auto- on machine learning. ACM, 2009, pp. 609–616. encoder,” in Eleventh Annual Conference of the International Speech [61] H. Lee, P. Pham, Y. Largman, and A. Y. Ng, “Unsupervised feature Communication Association, 2010. learning for audio classification using convolutional deep belief net- [83] L. Deng, “The mnist database of handwritten digit images for machine works,” in Advances in neural information processing systems, 2009, learning research [best of the web],” IEEE Signal Processing Magazine, pp. 1096–1104. vol. 29, no. 6, pp. 141–142, 2012. IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 25

[84] J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng, “Learning deep energy IEEE/ACM Transactions on Audio, Speech, and Language Processing, models,” in Proceedings of the 28th International Conference on vol. 24, pp. 694–707, 2016. Machine Learning (ICML-11), 2011, pp. 1105–1112. [109] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, [85] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” “Multimodal deep learning,” in Proceedings of the 28th international in NIPS, 2014. conference on machine learning (ICML-11), 2011, pp. 689–696. [110] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and [86] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv X. Chen, “Improved techniques for training gans,” in Proceedings of preprint arXiv:1312.6114, 2013. the 30th International Conference on Neural Information Processing [87] G. Alain and Y. Bengio, “What regularized auto-encoders learn from Systems, ser. NIPS’16. USA: Curran Associates Inc., 2016, pp. 2234– the data-generating distribution,” The Journal of Machine Learning 2242. [Online]. Available: http://dl.acm.org/citation.cfm?id=3157096. Research, vol. 15, no. 1, pp. 3563–3593, 2014. 3157346 [88] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A [111] R. Groß, Y. Gu, W. Li, and M. Gauci, “Generalizing gans: A turing review and new perspectives,” IEEE transactions on pattern analysis perspective,” in NIPS, 2017. and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. [112] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum, “Learn- [89] Y. Bengio, E. Laufer, G. Alain, and J. Yosinski, “Deep generative ing a probabilistic latent space of object shapes via 3d generative- stochastic networks trainable by backprop,” in International Conference adversarial modeling,” in NIPS, 2016. on Machine Learning, 2014, pp. 226–234. [113] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with [90] Y. Bengio, “Deep learning of representations: Looking forward,” in In- scene dynamics,” in NIPS, 2016. ternational Conference on Statistical Language and Speech Processing. [114] S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, Springer, 2013, pp. 1–37. “Generative adversarial text to image synthesis,” in ICML, 2016. [91] P. Vincent, “A connection between score matching and denoising [115] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and 2011. L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” [92] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. “Stacked denoising autoencoders: Learning useful representations in 211–252, 2015. a deep network with a local denoising criterion,” Journal of machine [116] M. D. Zeiler and R. Fergus, “Visualizing and understanding learning research, vol. 11, no. Dec, pp. 3371–3408, 2010. convolutional networks,” CoRR, vol. abs/1311.2901, 2013. [Online]. [93] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and Available: http://arxiv.org/abs/1311.2901 R. R. Salakhutdinov, “Improving neural networks by preventing co- [117] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, arXiv preprint arXiv:1207.0580 adaptation of feature detectors,” , 2012. D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with arXiv preprint [94] C. Doersch, “Tutorial on variational autoencoders,” convolutions,” CoRR, vol. abs/1409.4842, 2014. [Online]. Available: arXiv:1606.05908 , 2016. http://arxiv.org/abs/1409.4842 [95] G. E. Hinton, “A better way to learn features: technical perspective,” [118] K. Simonyan and A. Zisserman, “Very deep convolutional networks Communications of the ACM, vol. 54, no. 10, pp. 94–94, 2011. for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [96] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto- [Online]. Available: http://arxiv.org/abs/1409.1556 encoders,” in International Conference on Artificial Neural Networks. [119] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning Springer, 2011, pp. 44–51. for image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. [97] Q. V. Le, “Building high-level features using large scale unsupervised Available: http://arxiv.org/abs/1512.03385 learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8595–8598. [120] L. N. Smith, “A disciplined approach to neural network hyper- [98] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based parameters: Part 1 - learning rate, batch size, momentum, and CoRR learning applied to document recognition,” Proceedings of the IEEE, weight decay,” , vol. abs/1803.09820, 2018. [Online]. Available: vol. 86, no. 11, pp. 2278–2324, Nov 1998. http://arxiv.org/abs/1803.09820 [99] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [121] F. Assunao,˜ N. Loureno, P. Machado, and B. Ribeiro, “Denser: deep with deep convolutional neural networks,” in Proceedings of the 25th evolutionary network structured representation,” Genetic Programming International Conference on Neural Information Processing Systems and Evolvable Machines, pp. 1–31, 2018. - Volume 1, ser. NIPS’12. USA: Curran Associates Inc., 2012, [122] B. A. Garro and R. A. Vazquez,´ “Designing artificial neural networks pp. 1097–1105. [Online]. Available: http://dl.acm.org/citation.cfm?id= using particle swarm optimization algorithms,” in Comp. Int. and 2999134.2999257 Neurosc., 2015. [100] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, [123] G. Das, P. K. Pattnaik, and S. K. Padhy, “Artificial neural network vol. 521, no. 7553, pp. 436–444, 2015. [Online]. Available: trained by particle swarm optimization for non-linear channel https://doi.org/10.1038/nature14539 equalization,” Expert Systems with Applications, vol. 41, no. 7, pp. [101] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT 3491 – 3496, 2014. [Online]. Available: http://www.sciencedirect.com/ Press, 2016, http://www.deeplearningbook.org. science/article/pii/S0957417413008701 [102] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and [124] B. Wang, Y. Sun, B. Xue, and M. Zhang, “Evolving deep convolutional long short-term memory (lstm) network,” CoRR, vol. abs/1808.03314, neural networks by variable-length particle swarm optimization for im- 2018. age classification,” 2018 IEEE Congress on Evolutionary Computation [103] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural (CEC), pp. 1–8, 2018. computation, vol. 9, pp. 1735–80, 12 1997. [125] S. Sengupta, S. Basak, and R. A. Peters, “Particle swarm [104] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in optimization: A survey of historical and recent developments Proceedings of the IEEE-INNS-ENNS International Joint Conference with hybridization perspectives,” Machine Learning and Knowledge on Neural Networks. IJCNN 2000. Neural Computing: New Challenges Extraction, vol. 1, no. 1, pp. 157–191, 2018. [Online]. Available: and Perspectives for the New Millennium, vol. 3, July 2000, pp. 189– http://www.mdpi.com/2504-4990/1/1/10 194 vol.3. [126] S. Sengupta, S. Basak, and R. A. Peters, “Qdds: A novel quantum [105] J. Chung, aglar Gulehre,¨ K. Cho, and Y. Bengio, “Empirical evaluation swarm algorithm inspired by a double dirac delta potential,” in 2018 of gated recurrent neural networks on sequence modeling,” CoRR, vol. IEEE Symposium Series on Computational Intelligence (SSCI), Nov abs/1412.3555, 2014. 2018, pp. 704–711. [106] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory re- [127] S. Sengupta, S. Basak, and R. A. Peters, “Chaotic quantum double current neural network architectures for large scale acoustic modeling,” delta swarm algorithm using chebyshev maps: Theoretical foundations, in INTERSPEECH, 2014. performance analyses and convergence issues,” Journal of Sensor [107] P. Doetsch, M. Kozielski, and H. Ney, “Fast and robust training of re- and Actuator Networks, vol. 8, no. 1, 2019. [Online]. Available: current neural networks for offline handwriting recognition,” 2014 14th http://www.mdpi.com/2224-2708/8/1/9 International Conference on Frontiers in Handwriting Recognition, pp. [128]M.H uttenrauch,¨ A. Sosic, and G. Neumann, “Deep reinforcement 279–284, 2014. learning for swarm systems,” CoRR, vol. abs/1807.06613, 2018. [108] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, [129] C. Anderson, A. V. Mayrhauser, and R. Mraz, “On the use of neural and R. K. Ward, “Deep sentence embedding using long short-term networks to guide software testing activities,” in Proceedings of 1995 memory networks: Analysis and application to information retrieval,” IEEE International Test Conference (ITC), Oct 1995, pp. 720–729. IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 26

[130] T. M. Khoshgoftaar and R. M. Szabo, “Using neural networks to predict [Online]. Available: http://www.sciencedirect.com/science/article/pii/ software faults during testing,” IEEE Transactions on Reliability, S0957417408008622 vol. 45, no. 3, pp. 456–462, Sep. 1996. [151] E. F. Fama, “Random walks in stock market prices,” Financial analysts [131] M. Vanmali, M. Last, and A. Kandel, “Using a neural network in the journal, vol. 51, no. 1, pp. 75–80, 1995. software testing process,” Int. J. Intell. Syst., vol. 17, pp. 45–62, 2002. [152] C.-J. Lu, T.-S. Lee, and C.-C. Chiu, “Financial time series forecasting [132] Y. Sun, X. Huang, and D. Kroening, “Testing deep neural networks,” using independent component analysis and support vector regression,” CoRR, vol. abs/1803.04792, 2018. Decision Support Systems, vol. 47, no. 2, pp. 115 – 125, 2009. [133] G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, [Online]. Available: http://www.sciencedirect.com/science/article/pii/ “Towards proving the adversarial robustness of deep neural networks.” S0167923609000323 in FVAV@iFM, 2017. [153] M. Tk and R. Verner, “Artificial neural networks in business: Two [134] X. Huang, M. Z. Kwiatkowska, S. Wang, and M. Wu, “Safety verifi- decades of research,” Applied Soft Computing, vol. 38, pp. 788 – cation of deep neural networks,” in CAV, 2017. 804, 2016. [Online]. Available: http://www.sciencedirect.com/science/ [135] C. E. Tuncali, G. Fainekos, H. Ito, and J. Kapinski, “Simulation- article/pii/S1568494615006122 based adversarial test generation for autonomous vehicles with machine [154] T. N. Pandey, A. K. Jagadev, S. Dehuri, and S.-B. Cho, learning components,” 2018 IEEE Intelligent Vehicles Symposium (IV), “A novel committee machine and reviews of neural network pp. 1555–1562, 2018. and statistical models for currency exchange rate prediction: [136] X. Yuan, P. He, Q. Zhu, R. R. Bhat, and X. Li, “Adversarial examples: An experimental analysis,” Journal of King Saud University - Attacks and defenses for deep learning,” CoRR, vol. abs/1712.07107, Computer and Information Sciences, 2018. [Online]. Available: 2017. http://www.sciencedirect.com/science/article/pii/S1319157817303816 [137] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing [155] A. Lasfer, H. El-Baz, and I. Zualkernan, “Neural network design adversarial examples,” CoRR, vol. abs/1412.6572, 2014. parameters for forecasting financial time series,” in Modeling, Sim- [138] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: A ulation and Applied Optimization (ICMSAO), 2013 5th International simple and accurate method to fool deep neural networks,” 2016 IEEE Conference on. IEEE, 2013, pp. 1–4. Conference on Computer Vision and Pattern Recognition (CVPR), pp. [156] M. U. Gudelek, S. A. Boluk, and A. M. Ozbayoglu, “A deep learning 2574–2582, 2016. based stock trading model with 2-d cnn trend detection,” in Computa- [139] B. D. Rouhani, M. Samragh, M. Javaheripi, T. Javidi, and tional Intelligence (SSCI), 2017 IEEE Symposium Series on. IEEE, F. Koushanfar, “Deepfense: Online accelerated defense against 2017, pp. 1–8. adversarial deep learning,” in Proceedings of the International [157] T. Fischer and C. Krauss, “Deep learning with long short-term mem- Conference on Computer-Aided Design, ser. ICCAD ’18. New ory networks for financial market predictions,” European Journal of York, NY, USA: ACM, 2018, pp. 134:1–134:8. [Online]. Available: Operational Research, vol. 270, no. 2, pp. 654–669, 2018. http://doi.acm.org/10.1145/3240765.3240791 [158] L. dos Santos Pinheiro and M. Dras, “Stock market prediction with [140] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and deep learning: A character-based neural language model for event- D. Mukhopadhyay, “Adversarial attacks and defences: A survey,” based trading,” in Proceedings of the Australasian Language Technol- CoRR, vol. abs/1810.00069, 2018. ogy Association Workshop 2017, 2017, pp. 6–15. [141] A. Pumsirirat and L. Yan, “Credit card fraud detection using deep [159] W. Bao, J. Yue, and Y. Rao, “A deep learning framework for financial learning based on auto-encoder and restricted boltzmann machine,” IN- time series using stacked autoencoders and long-short term memory,” TERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE PloS one, vol. 12, no. 7, p. e0180944, 2017. AND APPLICATIONS, vol. 9, no. 1, pp. 18–25, 2018. [160] A. H. Mohammad, K. Rezaul, T. Ruppa, D. B. B. Neil, and W. Yang, [142] M. Schreyer, T. Sattarov, D. Borth, A. Dengel, and B. Reimer, “Hybrid deep learning model for stock price prediction,” in IEEE “Detection of anomalies in large scale accounting data using deep Symposium Symposium Series on Computational Intelligence SSCI, autoencoder networks,” CoRR, vol. abs/1709.05254, 2017. [Online]. 2018. IEEE, 2018, pp. 1837–1844. Available: http://arxiv.org/abs/1709.05254 [161] A. le Calvez and D. Cliff, “Deep learning can replicate adaptive traders [143] Y. Wang and W. Xu, “Leveraging deep learning with lda-based text in a limit-order-book financial market,” 2018 IEEE Symposium Series analytics to detect automobile insurance fraud,” Decision Support on Computational Intelligence (SSCI), pp. 1876–1883, 2018. Systems, vol. 105, pp. 87–95, 2018. [162] S. Basak, S. Sengupta, and A. Dubey, “Mechanisms for Integrated [144] Y.-J. Zheng, X.-H. Zhou, W.-G. Sheng, Y. Xue, and S.-Y. Chen, Feature Normalization and Remaining Useful Life Estimation Using “Generative adversarial network based telecom fraud detection at the LSTMs Applied to Hard-Disks,” arXiv e-prints, p. arXiv:1810.08985, receiving bank,” Neural Networks, vol. 102, pp. 78 – 86, 2018. Oct 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ [163] P. Tamilselvan and P. Wang, “Failure diagnosis using deep belief S0893608018300698 learning based health state classification,” Reliability Engineering & [145] M. Dong, L. Yao, X. Wang, B. Benatallah, C. Huang, and System Safety, vol. 115, pp. 124 – 135, 2013. [Online]. Available: X. Ning, “Opinion fraud detection via neural autoencoder decision http://www.sciencedirect.com/science/article/pii/S0951832013000574 forest,” Pattern Recognition Letters, 2018. [Online]. Available: [164] T. Kuremoto, S. Kimura, K. Kobayashi, and M. Obayashi, “Time http://www.sciencedirect.com/science/article/pii/S0167865518303039 series forecasting using a deep belief network with restricted [146] J. A. Gmez, J. Arvalo, R. Paredes, and J. Nin, “End-to-end neural boltzmann machines,” Neurocomputing, vol. 137, pp. 47 – 56, network architecture for fraud scoring in card payments,” Pattern 2014, advanced Intelligent Computing Theories and Methodologies. Recognition Letters, vol. 105, pp. 175 – 181, 2018, machine Learning [Online]. Available: http://www.sciencedirect.com/science/article/pii/ and Applications in Artificial Intelligence. [Online]. Available: S0925231213007388 http://www.sciencedirect.com/science/article/pii/S016786551730291X [165] J. Qiu, W. Liang, L. Zhang, X. Yu, and M. Zhang, “The [147] N. F. Ryman-Tubb, P. Krause, and W. Garn, “How artificial intelligence early-warning model of equipment chain in gas pipeline based and machine learning research impacts payment card fraud detection: on dnn-hmm,” Journal of Natural Gas Science and Engineering, A survey and industry benchmark,” Engineering Applications of vol. 27, pp. 1710 – 1722, 2015. [Online]. Available: http: Artificial Intelligence, vol. 76, pp. 130 – 157, 2018. [Online]. Available: //www.sciencedirect.com/science/article/pii/S1875510015302407 http://www.sciencedirect.com/science/article/pii/S0952197618301520 [166] N. Gugulothu, T. Vishnu, P. Malhotra, L. Vig, P. Agarwal, and [148] U. Fiore, A. D. Santis, F. Perla, P. Zanetti, and F. Palmieri, G. Shroff, “Predicting remaining useful life using time series embed- “Using generative adversarial networks for improving classification dings based on recurrent neural networks,” CoRR, vol. abs/1709.01073, effectiveness in credit card fraud detection,” Information Sciences, 2017. vol. 479, pp. 448 – 455, 2019. [Online]. Available: http://www. [167] P. Filonov, A. Lavrentyev, and A. Vorontsov, “Multivariate industrial sciencedirect.com/science/article/pii/S0020025517311519 time series with cyber-attack simulation: Fault detection using an lstm- [149] R. C. Cavalcante, R. C. Brasileiro, V. L. Souza, J. P. Nobrega, and A. L. based predictive data model,” CoRR, vol. abs/1612.06676, 2016. Oliveira, “Computational intelligence and financial markets: A survey [168] M. M. Botezatu, I. Giurgiu, J. Bogojeska, and D. Wiesmann, and future directions,” Expert Systems with Applications, vol. 55, pp. “Predicting disk replacement towards reliable data centers,” in 194–211, 2016. Proceedings of the 22Nd ACM SIGKDD International Conference [150] X. Li, Z. Deng, and J. Luo, “Trading strategy design in financial on Knowledge Discovery and Data Mining, ser. KDD ’16. New investment through a turning points prediction scheme,” Expert York, NY, USA: ACM, 2016, pp. 39–48. [Online]. Available: Systems with Applications, vol. 36, no. 4, pp. 7818 – 7826, 2009. http://doi.acm.org/10.1145/2939672.2939699 IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 27

[169] H.-I. Suk, S.-W. Lee, and D. Shen, “Latent feature representation [190] D. C. Park, M. El-Sharkawi, R. Marks, L. Atlas, and M. Damborg, with stacked auto-encoder for ad/mci diagnosis,” Brain Structure and “Electric load forecasting using an artificial neural network,” IEEE Function, vol. 220, pp. 841–859, 2013. transactions on Power Systems, vol. 6, no. 2, pp. 442–449, 1991. [170] G. van Tulder and M. de Bruijne, “Combining generative and discrim- [191] E. Mocanu, P. H. Nguyen, M. Gibescu, and W. L. Kling, “Deep inative representation learning for lung ct analysis with convolutional learning for estimating building energy consumption,” Sustainable restricted boltzmann machines,” IEEE Transactions on Medical Imag- Energy, Grids and Networks, vol. 6, pp. 91–99, 2016. ing, vol. 35, no. 5, pp. 1262–1272, May 2016. [192] K. Chen, K. Chen, Q. Wang, Z. He, J. Hu, and J. He, “Short-term load [171] T. Brosch and R. Tam, “Manifold learning of brain mris by deep forecasting with deep residual networks,” IEEE Transactions on Smart learning,” in Medical Image Computing and Computer-Assisted Inter- Grid, 2018. vention – MICCAI 2013, K. Mori, I. Sakuma, Y. Sato, C. Barillot, and [193] S. Bouktif, A. Fiaz, A. Ouni, and M. Serhani, “Optimal deep learning N. Navab, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, lstm model for electric load forecasting using feature selection and pp. 633–640. genetic algorithm: Comparison with machine learning approaches,” [172] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Energies, vol. 11, no. 7, p. 1636, 2018. Blau, and S. Thrun, “Dermatologist-level classification of skin cancer [194] A. Dedinec, S. Filiposka, A. Dedinec, and L. Kocarev, “Deep belief with deep neural networks,” Nature, vol. 542, pp. 115–, Jan. 2017. network based electricity load forecasting: An analysis of macedonian [Online]. Available: http://dx.doi.org/10.1038/nature21056 case,” Energy, vol. 115, pp. 1688–1700, 2016. [173] S. Rajaraman, S. K. Antani, M. Poostchi, K. Silamut, M. A. [195] A. Rahman, V. Srikumar, and A. D. Smith, “Predicting electricity Hossain, R. J. Maude, S. Jaeger, and G. R. Thoma, “Pre-trained consumption for commercial and residential buildings using deep convolutional neural networks as feature extractors toward improved recurrent neural networks,” Applied Energy, vol. 212, pp. 372–385, malaria parasite detection in thin blood smear images,” PeerJ, vol. 6, 2018. pp. e4568–e4568, Apr 2018, 29682411[pmid]. [Online]. Available: [196] W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang, https://www.ncbi.nlm.nih.gov/pubmed/29682411 “Short-term residential load forecasting based on lstm recurrent neural [174] G. Kang, K. Liu, B. Hou, and N. Zhang, “3d multi-view convolutional network,” IEEE Transactions on Smart Grid, 2017. neural networks for lung nodule classification,” in PloS one, 2017. [197] X. Dong, L. Qian, and L. Huang, “Short-term load forecasting in smart [175] S. Hwang and H. Kim, “Self-transfer learning for fully weakly grid: A combined cnn and k-means clustering approach,” in Big Data supervised object localization,” CoRR, vol. abs/1602.01625, 2016. and Smart Computing (BigComp), 2017 IEEE International Conference [Online]. Available: http://arxiv.org/abs/1602.01625 on. IEEE, 2017, pp. 119–125. [176] S. Andermatt, S. Pezold, and P. Cattin, “Multi-dimensional gated [198] S. A. Kalogirou, “Artificial neural networks in renewable energy recurrent units for the segmentation of biomedical 3d-data,” in Deep systems applications: a review,” Renewable and sustainable energy Learning and Data Labeling for Medical Applications, G. Carneiro, reviews, vol. 5, no. 4, pp. 373–401, 2001. D. Mateus, L. Peter, A. Bradley, J. M. R. S. Tavares, V. Belagiannis, [199] H. Wang, H. Yi, J. Peng, G. Wang, Y. Liu, H. Jiang, and W. Liu, J. P. Papa, J. C. Nascimento, M. Loog, Z. Lu, J. S. Cardoso, and “Deterministic and probabilistic forecasting of photovoltaic power J. Cornebise, Eds. Cham: Springer International Publishing, 2016, based on deep convolutional neural network,” Energy Conversion and pp. 142–151. Management, vol. 153, pp. 409–422, 2017. [177] X. Cheng, X. Lin, and Y. Zheng, “Deep similarity learning for [200] U. K. Das, K. S. Tey, M. Seyedmahmoudian, S. Mekhilef, M. Y. I. multimodal medical images,” CMBBE: Imaging & Visualization, vol. 6, Idris, W. Van Deventer, B. Horan, and A. Stojcevski, “Forecasting pp. 248–252, 2018. of photovoltaic power generation and model optimization: A review,” [178] S. Miao, Z. J. Wang, and R. Liao, “A cnn regression approach for Renewable and Sustainable Energy Reviews, vol. 81, pp. 912–928, real-time 2d/3d registration,” IEEE Transactions on Medical Imaging, 2018. vol. 35, no. 5, pp. 1352–1363, May 2016. [201] V. Dabra, K. K. Paliwal, P. Sharma, and N. Kumar, “Optimization [179] O. Oktay, W. Bai, M. C. H. Lee, R. Guerrero, K. Kamnitsas, J. Ca- of photovoltaic power system: a comparative study,” Protection and ballero, A. de Marvao, S. A. Cook, D. P. O’Regan, and D. Rueckert, Control of Modern Power Systems, vol. 2, no. 1, p. 3, 2017. “Multi-input cardiac image super-resolution using convolutional neural [202] J. Liu, W. Fang, X. Zhang, and C. Yang, “An improved photovoltaic networks,” in MICCAI, 2016. power forecasting model with the assistance of aerosol index data,” [180] V. Golkov, A. Dosovitskiy, J. I. Sperl, M. I. Menzel, M. Czisch, IEEE Transactions on Sustainable Energy, vol. 6, no. 2, pp. 434–442, P. Smann, T. Brox, and D. Cremers, “q-space deep learning: Twelve- 2015. fold shorter and model-free diffusion mri scans,” IEEE Transactions [203] H. S. Jang, K. Y. Bae, H.-S. Park, and D. K. Sung, “Solar power on Medical Imaging, vol. 35, no. 5, pp. 1344–1351, May 2016. prediction based on satellite images and support vector machine,” IEEE [181] V. S. S. Vankayala and N. D. Rao, “Artificial neural networks and their Trans. Sustain. Energy, vol. 7, no. 3, pp. 1255–1263, 2016. applications to power systemsa bibliographical survey,” Electric power [204] A. Gensler, J. Henze, B. Sick, and N. Raabe, “Deep learning for systems research, vol. 28, no. 1, pp. 67–79, 1993. solar power forecastingan approach using autoencoder and lstm neural [182] M.-y. Chow, P. Mangum, and R. Thomas, “Incipient fault detection networks,” in Systems, Man, and Cybernetics (SMC), 2016 IEEE in dc machines using a neural network,” in Signals, Systems and International Conference on. IEEE, 2016, pp. 002 858–002 865. Computers, 1988. Twenty-Second Asilomar Conference on, vol. 2. [205] M. Abdel-Nasser and K. Mahmoud, “Accurate photovoltaic power IEEE, 1988, pp. 706–709. forecasting models using deep lstm-rnn,” Neural Computing and Ap- [183] Z. Guo, K. Zhou, X. Zhang, and S. Yang, “A deep learning model for plications, pp. 1–14, 2017. short-term power load and probability density forecasting,” Energy, vol. [206] J. F. Manwell, J. G. McGowan, and A. L. Rogers, Wind energy 160, pp. 1186–1200, 2018. explained: theory, design and application. John Wiley & Sons, 2010. [184] R. E. Bourguet and P. J. Antsaklis, “Artificial neural networks in electric [207] A. P. Marugan,´ F. P. G. Marquez,´ J. M. P. Perez, and D. Ruiz- power industry,” ISIS, vol. 94, p. 007, 1994. Hernandez,´ “A survey of artificial neural network in wind energy [185] J. Sharp, “Comparative models for electrical load forecasting: D.h. systems,” Applied energy, vol. 228, pp. 1822–1836, 2018. bunn and e.d. farmer, eds.(wiley, new york, 1985) [uk pound]24.95, [208] W. Wu, K. Chen, Y. Qiao, and Z. Lu, “Probabilistic short-term wind pp. 232,” International Journal of Forecasting, vol. 2, no. 2, pp. 241– power forecasting based on deep neural networks,” in Probabilistic 242, 1986. [Online]. Available: https://EconPapers.repec.org/RePEc: Methods Applied to Power Systems (PMAPS), 2016 International eee:intfor:v:2:y:1986:i:2:p:241-242 Conference on. IEEE, 2016, pp. 1–8. [186] H. S. Hippert, C. E. Pedreira, and R. C. Souza, “Neural networks for [209] H.-z. Wang, G.-q. Li, G.-b. Wang, J.-c. Peng, H. Jiang, and Y.-t. Liu, short-term load forecasting: A review and evaluation,” IEEE Transac- “Deep learning based ensemble approach for probabilistic wind power tions on power systems, vol. 16, no. 1, pp. 44–55, 2001. forecasting,” Applied energy, vol. 188, pp. 56–70, 2017. [187] C. Kuster, Y. Rezgui, and M. Mourshed, “Electrical load forecasting [210] K. Wang, X. Qi, H. Liu, and J. Song, “Deep belief network based k- models: A critical systematic review,” Sustainable cities and society, means cluster approach for short-term wind power forecasting,” Energy, vol. 35, pp. 257–270, 2017. vol. 165, pp. 840–852, 2018. [188] R. Aggarwal and Y. Song, “Artificial neural networks in power sys- [211] C. Feng, M. Cui, B.-M. Hodge, and J. Zhang, “A data-driven multi- tems. i. general introduction to neural computing,” Power Engineering model methodology with deep feature selection for short-term wind Journal, vol. 11, no. 3, pp. 129–134, 1997. forecasting,” Applied Energy, vol. 190, pp. 1245–1257, 2017. [189] Y. Zhai, “Time series forecasting competition among three sophisti- [212] A. S. Qureshi, A. Khan, A. Zameer, and A. Usman, “Wind power cated paradigms,” Ph.D. dissertation, University of North Carolina at prediction using deep neural network based meta regression and Wilmington, 2005. transfer learning,” Applied Soft Computing, vol. 58, pp. 742–755, 2017. IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 28

[213] G. J. S. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, [236] A. Tealab, “Time series forecasting using artificial neural networks M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken, methodologies: A systematic review,” Future Computing and Infor- and C. I. Sanchez,´ “A survey on deep learning in medical image matics Journal, vol. 3, no. 2, pp. 334 – 340, 2018. [Online]. Available: analysis,” CoRR, vol. abs/1702.05747, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2314728817300715 http://arxiv.org/abs/1702.05747 [237] A. Almalaq and G. Edwards, “A review of deep learning methods ap- [214] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical plied on load forecasting,” in 2017 16th IEEE International Conference image analysis,” Annual Review of Biomedical Engineering, vol. 19, on Machine Learning and Applications (ICMLA), Dec 2017, pp. 511– no. 1, pp. 221–248, 2017, pMID: 28301734. [Online]. Available: 516. https://doi.org/10.1146/annurev-bioeng-071516-044442 [238] J. P. Usuga Cadavid, S. Lamouri, and B. Grabot, “Trends in Machine [215] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re- Learning Applied to Demand & Sales Forecasting: A Review,” thinking the inception architecture for computer vision,” 2016 IEEE in International Conference on Information Systems, Logistics Conference on Computer Vision and Pattern Recognition (CVPR), pp. and Supply Chain, Lyon, France, July 2018. [Online]. Available: 2818–2826, 2016. https://hal.archives-ouvertes.fr/hal-01881362 [216] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking [239] S. Beheshti-Kashi, H. Reza Karimi, K.-D. Thoben, M. Ltjen, and via multi-task sparse learning,” in 2012 IEEE Conference on Computer M. Teucke, “A survey on retail sales forecasting and prediction in Vision and Pattern Recognition, June 2012, pp. 2042–2049. fashion markets,” Systems Science & Control Engineering: An Open [217] “Brain mri image segmentation using stacked denoising autoencoders,” Access Journal, vol. 3, pp. 154–161, 01 2015. https://goo.gl/tpnDx3. [240] H. K. Alfares and M. Nazeeruddin, “Electric load forecasting: Liter- [218] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional ature survey and classification of methods,” Int. J. Systems Science, CoRR networks for biomedical image segmentation,” , vol. vol. 33, pp. 23–34, 2002. abs/1505.04597, 2015. [Online]. Available: http://arxiv.org/abs/1505. 04597 [241] M. Lngkvist, L. Karlsson, and A. Loutfi, “A review of unsupervised feature learning and deep learning for time-series modeling,” Pattern [219] A. Maier, C. Syben, T. Lasser, and C. Riess, “A gentle introduction to Recognition Letters deep learning in medical image processing,” Zeitschrift fr Medizinische , vol. 42, pp. 11 – 24, 2014. [Online]. Available: Physik, vol. 29, no. 2, pp. 86 – 101, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167865514000221 http://www.sciencedirect.com/science/article/pii/S093938891830120X [242] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A [220] G. Marcus, “Deep learning: A critical appraisal,” CoRR, vol. comprehensive survey of deep learning for image captioning,” ACM abs/1801.00631, 2018. Comput. Surv., vol. 51, no. 6, pp. 118:1–118:36, Feb. 2019. [Online]. [221] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between Available: http://doi.acm.org/10.1145/3295748 capsules,” in NIPS 2017, 2017. [243] H. Wang, S. Shang, L. Long, R. Hu, Y. Wu, N. Chen, [222] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto- S. Zhang, F. Cong, and S. Lin, “Biological image analysis encoders,” in ICANN, 2011. using deep learning-based methods: Literature review,” Digital [223] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and Medicine, vol. 4, no. 4, pp. 157–165, 2018. [Online]. D. Wierstra, “Matching networks for one shot learning,” in Available: http://www.digitmedicine.com/article.asp?issn=2226-8561; Proceedings of the 30th International Conference on Neural year=2018;volume=4;issue=4;spage=157;epage=165;aulast=Wang;t=6 Information Processing Systems, ser. NIPS’16. USA: Curran [244] Y. Li, H. Zhang, X. Xue, Y. Jiang, and Q. Shen, “Deep Associates Inc., 2016, pp. 3637–3645. [Online]. Available: http: learning for remote sensing image classification: A survey,” //dl.acm.org/citation.cfm?id=3157382.3157504 Wiley Interdisciplinary Reviews: Data Mining and Knowledge [224] K. Hsu, S. Levine, and C. Finn, “Unsupervised learning via meta- Discovery, vol. 8, no. 6, p. e1264, 2018. [Online]. Available: learning,” CoRR, vol. abs/1810.02334, 2018. https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1264 [225] A. Banino, C. Barry, B. Uria, C. Blundell, T. P. Lillicrap, P. Mirowski, [245] W. Rawat and Z. Wang, “Deep convolutional neural networks for A. Pritzel, M. J. Chadwick, T. Degris, J. Modayil, G. Wayne, H. Soyer, image classification: A comprehensive review,” Neural Computation, F. Viola, B. Zhang, R. Goroshin, N. C. Rabinowitz, R. Pascanu, vol. 29, no. 9, pp. 2352–2449, 2017, pMID: 28599112. [Online]. C. Beattie, S. Petersen, A. Sadik, S. Gaffney, H. King, K. Kavukcuoglu, Available: https://doi.org/10.1162/neco a 00990 D. Hassabis, R. Hadsell, and D. Kumaran, “Vector-based navigation [246] M. I. Razzak, S. Naz, and A. Zaib, Deep Learning for Medical Image using grid-like representations in artificial agents,” Nature, vol. 557, Processing: Overview, Challenges and the Future. Cham: Springer pp. 429–433, 2018. International Publishing, 2018, pp. 323–350. [Online]. Available: [226] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neu- https://doi.org/10.1007/978-3-319-65981-7 12 ral network architectures using reinforcement learning,” CoRR, vol. [247] A. S. Lundervold and A. Lundervold, “An overview of deep learning abs/1611.02167, 2016. in medical imaging focusing on mri,” Zeitschrift fr Medizinische [227] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, Physik, vol. 29, no. 2, pp. 102 – 127, 2019. [Online]. Available: vol. 8, no. 3, pp. 279–292, May 1992. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0939388918301181 https://doi.org/10.1007/BF00992698 [248] S. Liu, Y. Wang, X. Yang, B. Lei, L. Liu, S. X. Li, D. Ni, and [228] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, “Deep T. Wang, “Deep learning in medical ultrasound analysis: A review,” learning for visual understanding: A review,” Neurocomputing, vol. Engineering, vol. 5, no. 2, pp. 261 – 275, 2019. [Online]. Available: 187, pp. 27–48, 2016. http://www.sciencedirect.com/science/article/pii/S2095809918301887 [229] A. Voulodimos, N. D. Doulamis, A. D. Doulamis, and E. Protopa- [249] N. Akhtar and A. S. Mian, “Threat of adversarial attacks on deep Comp. padakis, “Deep learning for computer vision: A brief review,” in learning in computer vision: A survey,” IEEE Access, vol. 6, pp. Int. and Neurosc. , 2018. 14 410–14 430, 2018. [230] J. R. del Solar, P. Loncomilla, and N. Soto, “A survey on deep learning methods for robot vision,” CoRR, vol. abs/1803.10862, 2018. [250] D. J. Miller, Z. Xiang, and G. Kesidis, “Adversarial learning in [231] A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris, statistical classification: A comprehensive review of defenses against CoRR “Deep learning advances in computer vision with 3d data: A survey,” attacks,” , vol. abs/1904.06292, 2019. [Online]. Available: ACM Computing Surveys, vol. 50, 06 2017. http://arxiv.org/abs/1904.06292 [232] C. Seifert, A. Aamir, A. Balagopalan, D. Jain, A. Sharma, [251] M. Ozdag, “Adversarial attacks and defenses against deep neural S. Grottel, and S. Gumhold, Visualizations of Deep Neural Networks networks: A survey,” Procedia Computer Science, vol. 140, pp. in Computer Vision: A Survey. Cham: Springer International 152 – 161, 2018, cyber Physical Systems and Deep Learning Publishing, 2017, pp. 123–144. [Online]. Available: https://doi.org/10. Chicago, Illinois November 5-7, 2018. [Online]. Available: http: 1007/978-3-319-54024-5 6 //www.sciencedirect.com/science/article/pii/S1877050918319884 [233] W.-Y. Lin, Y.-H. Hu, and C.-F. Tsai, “Machine learning in financial [252] S. Thomas and N. Tabrizi, Adversarial Machine Learning: A Literature crisis prediction: A survey,” IEEE Transactions on Systems, Man, and Review, 07 2018, pp. 324–334. Cybernetics - TSMC, vol. 42, pp. 421–436, 07 2012. [253] S. Qiu, Q. Liu, S. Zhou, and C. Wu, “Review of artificial intelligence [234] J. C. B. Gamboa, “Deep learning for time-series analysis,” CoRR, vol. adversarial attack and defense technologies,” Applied Sciences, vol. 9, abs/1701.01887, 2017. no. 5, 2019. [Online]. Available: http://www.mdpi.com/2076-3417/9/ [235] S. Sarojini Devi and Y. Radhika, “A survey on machine learning and 5/909 statistical techniques in bankruptcy prediction,” International Journal [254] V. Duddu, “A survey of adversarial machine learning in cyber warfare,” of Machine Learning and Computing, vol. 8, pp. 133–139, 04 2018. 2018. IEEE TRANSACTIONS ON XXX, VOL. XX, NO. YY, MARCH 2019 29

[255] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision- [278] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, making for autonomous vehicles,” Annual Review of Control, Robotics, R. Wald, and E. Muharemagic, “Deep learning applications and and Autonomous Systems, vol. 1, 05 2018. challenges in big data analytics,” Journal of Big Data, vol. 2, [256] A. Carrio, C. Sampedro, A. Rodriguez-Ramos, and P. C. Cervera, “A no. 1, p. 1, Feb 2015. [Online]. Available: https://doi.org/10.1186/ review of deep learning methods and applications for unmanned aerial s40537-014-0007-7 vehicles,” J. Sensors, vol. 2017, pp. 3 296 874:1–3 296 874:13, 2017. [257] A. Fridman, D. E. Brown, M. Glazer, W. Angell, S. Dodd, B. Jenik, J. Terwilliger, J. Kindelsberger, L. Ding, S. Seaman, H. Abraham, A. Mehler, A. Sipperley, A. Pettinato, B. Seppelt, L. Angell, B. Mehler, and B. Reimer, “Mit autonomous vehicle technology study: Large-scale deep learning based analysis of driver behavior and interaction with automation,” CoRR, vol. abs/1711.06976, 2017. [258] S. D. Pendleton, H. Andersen, X. Du, X. Shen, M. Meghjani, Y. H. Eng, D. Rus, and M. H. Ang, “Perception, planning, control, and coordination for autonomous vehicles,” Machines, vol. 5, no. 1, 2017. [Online]. Available: http://www.mdpi.com/2075-1702/5/1/6 [259] G. von Zitzewitz, “Survey of neural networks in autonomous driving,” 07 2017. [260] C. Badue, R. Guidolini, R. V. Carneiro, P. Azevedo, V. B. Cardoso, A. Forechi, L. F. R. Jesus, R. F. Berriel, T. M. Paixao,˜ F. W. Mutz, T. Oliveira-Santos, and A. F. de Souza, “Self-driving cars: A survey,” CoRR, vol. abs/1901.04407, 2019. [261] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural language processing [review article],” IEEE Computational Intelligence Magazine, vol. 13, pp. 55–75, 2018. [262] D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages of deep learning in natural language processing,” CoRR, vol. abs/1807.10854, 2018. [263] W. Khan, A. Daud, J. Nasir, and T. Amjad, “A survey on the state- of-the-art machine learning models in the context of nlp,” vol. 43, pp. 95–113, 10 2016. [264] S. Fahad and A. Yahya, “Inflectional review of deep learning on natural language processing,” 07 2018. [265] H. Li, “Deep learning for natural language processing: advantages and challenges,” National Science Review, vol. 5, no. 1, pp. 24–26, 09 2017. [Online]. Available: https://doi.org/10.1093/nsr/nwx110 [266] Y. Xie, L. Le, Y. Zhou, and V. V. Raghavan, “Deep learning for natural language processing,” 2018. [267] S. Zhang, L. Yao, and A. Sun, “Deep learning based recommender system: A survey and new perspectives,” ACM Comput. Surv., vol. 52, pp. 5:1–5:38, 2019. [268] R. Mu, “A survey of recommender systems based on deep learning,” IEEE Access, vol. 6, pp. 69 009–69 022, 2018. [269] Z. Batmaz, A. Yurekli, A. Bilge, and C. Kaleli, “A review on deep learning for recommender systems: challenges and remedies,” Artificial Intelligence Review, Aug 2018. [Online]. Available: https://doi.org/10.1007/s10462-018-9654-y [270] B. T. Betru, C. A. Onana, and B. Batchakui, “Deep learning methods on recommender system: A survey of state-of-the-art,” 2017. [271] R. Fakhfakh, B. a. Anis, and C. Ben Amar, “Deep learning-based recommendation: Current issues and challenges,” International Journal of Advanced Computer Science and Applications, vol. 8, 01 2017. [272] L. Zheng, “A survey and critique of deep learning on recommender systems,” 2016. [273] O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, and K. Taha, “Efficient machine learning for big data: A review,” Big Data Research, vol. 2, no. 3, pp. 87 – 93, 2015, big Data, Analytics, and High-Performance Computing. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2214579615000271 [274] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, “A survey on deep learning for big data,” Information Fusion, vol. 42, pp. 146 – 157, 2018. [Online]. Available: http://www.sciencedirect.com/science/ article/pii/S1566253517305328 [275] Y. Roh, G. Heo, and S. E. Whang, “A survey on data collection for machine learning: a big data - AI integration perspective,” CoRR, vol. abs/1811.03402, 2018. [Online]. Available: http://arxiv.org/abs/1811.03402 [276] J. Qiu, Q. Wu, G. Ding, Y. Xu, and S. Feng, “A survey of machine learning for big data processing,” EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1, p. 67, May 2016. [Online]. Available: https://doi.org/10.1186/s13634-016-0355-x [277] B. Jan, H. Farman, M. Khan, M. Imran, I. U. Islam, A. Ahmad, S. Ali, and G. Jeon, “Deep learning in big data analytics: A comparative study,” Computers & Electrical Engineering, vol. 75, pp. 275 – 287, 2019. [Online]. Available: http://www.sciencedirect.com/science/ article/pii/S0045790617315835