Deep Learning
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 12 Deep Learning 12.1 Introduction In his famous book, "The nature of Statistical Learning Theory:" [37], Vladimir Vapnik stated (in 1995) the four historical periods in the area of learning theory as: 1. Constructing first learning machine 2. Constructing fundamentals of the theory 3. Constructing the neural network 4. Constructing alternatives to the neural network The very existence of the fourth period was a failure of neural networks. Vapnik and others came up with various alternatives to neural network that were far more technologically feasible and efficient in solving the problems at hand at the time and neural networks slowly became history. However, with the modern technological advances, neural networks are back and we are in the fifth historical period: “Re- birth of neural network.” With re-birth, the neural networks are also renamed into deep neural networks or simply deep networks. The learning process is called as deep learning. However, deep learning is probably one of the most widely used and misused term. Since inception, there were no sufficient success stories to back up the over-hyped concept of neural networks. As a result they got pushed back in preference to more targeted ML technologies that showed instant success, e.g., evolutionary algorithms, dynamic programming, support vector machines, etc. The concept of neural networks or artificial neural networks (ANNs) was simple: to replicate the processing methodology of human brain which is composed of connected network of neurons. ANN was proposed as a generic solution to any type of learning problem, just like human brain is capable of learning to solve any problem. The only blocker was that at the time, the processor architecture and manufacturing processes were not sufficiently mature to tackle the gargantuan processing required © Springer Nature Switzerland AG 2020 117 A. V. Joshi, Machine Learning and Artificial Intelligence, https://doi.org/10.1007/978-3-030-26622-6_12 118 12 Deep Learning to realistically simulate human brain. Average human brain has roughly 100 billion neurons and over 1000 trillion synaptic connections [1]!! This would compare to a processor with trillion calculations/second processing power backed by over 1000 terabytes of hard drive. Even in 2018 AD, this configuration for a single computer is far-fetched.1 The generic learning of ANNs can only converge to sufficient accuracy if it is trained with sufficient data followed by corresponding level of computation. It was just beyond the scope of technology in 1990s, and as a result they were not able to realize the full potential of the technology. With the advent of graphics processor based computation (called as general purpose graphics processing unit or GPGPU) made popular by NVIDIA in the form of CUDA library, the technology started coming close to realizing the potential of ANNs. However, due to the stigma associated with original ANN, the newly introduced neural networks were called as deep neural networks (DNN) and the ML process was called deep learning. In essence there is no fundamental difference between an ANN of the 1990s and deep networks of the twenty-first century. However, there are some more differences that are typically assumed when one talks about deep networks. The original ANNs were primarily used for solving problems with a very small amount of data and very narrow scope of application and consisted of handful of nodes and 1–3 layers of neurons at best. Based on the hardware limitations, this was all that could have been achieved. The deep networks typically consist of hundreds to thousands of nodes per layer and number of layers can easily exceed 10. With this added complexity of order of magnitude, the optimization algorithms have also drastically changed. One of the fundamental changes in the optimization algorithms is heavier use of parallel computation. As the GPGPUs present many hundreds or even thousands of cores that can be used in parallel, the only way to improve the computation performance is to parallelize the training optimization of the deep networks [2]. The deep learning framework is not limited to supervised learning, but some of the unsupervised problems can also be tackled with this technique. 12.2 Origin of Modern Deep Learning Although the current state of deep learning is made possible by the advances in computation hardware, a lot of efforts have been spent on the optimization algorithms that are needed to successfully use these resources and converge to the optimal solution. It is hard to name one person who invented deep learning, but there are few names that are certainly at the top of list. Seminal paper by 1The fastest supercomputer in the world is rated at 200 petaflops, or 200,000 trillion calcula- tions/second. It is backed by storage of 250 petabytes or 250,000 terabytes. So it does significantly better than a single human brain, but at the cost of consuming 13 MW of power and occupying entire floor of a huge office building [18]. 12.3 Convolutional Neural Networks (CNNs) 119 Geoffrey Hinton titled A fast learning algorithm for deep belief networks [54] truly ushered the era of deep learning. Some of the early applications of deep learning were in the field of acoustic and speech modeling and image recognition. Reference [53] summarizes the steps in the evolution of deep learning as we see today concisely. Typically when one refers to deep learning or deep networks they either belong to convolutional networks or recurrent networks or their variations. Surprisingly enough, both of these concepts were invented before the deep earning. Fukushima introduced convolutional networks back in 1980 [55], while Michael Jordan introduced recurrent neural networks in 1986 [56]. In the following sections we will discuss architectures of convolutional and recurrent deep neural networks. 12.3 Convolutional Neural Networks (CNNs) Convolutional neural networks or CNNs are based on the convolution operation. This operation has its roots in signal processing and before going into details of CNN, let us look at the convolution process itself. 12.3.1 1D Convolution Mathematically, convolution defines a process by which one real valued function operates on another real valued function to produce new real valued function. Let one real valued continuous function be f(t) and other be g(t), where t denotes continuous time. Let the convolution of the two be denoted as s(t). Convolution operation is typically denoted as ∗. ∞ f(t)∗ g(t) = (f ∗ g)(t) = f(τ)g(t − τ)dτ (12.1) −∞ Also, convolution is a commutative operation meaning, (f ∗ g) is same as (g ∗ f), ∞ (f ∗ g)(t) = (g ∗ f )(t) = g(τ)f (t − τ)dτ (12.2) −∞ The same equations can also be written for discrete processes, that we typically deal with in machine learning applications. The discrete counterparts of the two equations can be written as f(k) and g(k), where k denoted a discrete instance of time. ∞ f(k)∗ g(k) = (f ∗ g)(k) = f(δ)g(k− δ) (12.3) δ=−∞ 120 12 Deep Learning and ∞ (f ∗ g)(k) = (g ∗ f )(k) = g(δ)f(k − δ) (12.4) δ=−∞ These definitions are very similar to the definition of correlation between two functions. However, the key difference here is the sign of δ in case of discrete convolution and τ in case of continuous convolution is opposite between f and g. If the sign is flipped, the same equations would represent correlation operation. The sign reversal makes one function reversed in time before being multiplied (point-wise) with the other function. This time reversal has far reaching impact and the result of convolution is completely different than the result of correlation. The convolution has also very interesting properties from the perspective of Fourier transform [8] in frequency domain and they are heavily used in signal processing applications. 12.3.2 2D Convolution Above equations represent 1D convolution. This is typically used in speech applications, where the signal is 1D. Concept of convolution can also be applied in 2-dimensions as well, which makes it suitable for applications on images. In order to define 2D convolution, let us consider two images A and B. The 2D convolution can now be defined as, (A ∗ B)(i, j) = A(m, n)B(i − m, j − n) (12.5) m n Typically the second image B is a small 2D kernel compared to first image A. The convolution operation transforms the original image into another one to enhance the image in some desired way. 12.3.3 Architecture of CNN Figure 12.1 shows the architecture of CNN. The building block of CNN is composed of three units: 1. Convolution layer 2. Rectified linear unit, also called as ReLU 3. Pooling layer 12.3 Convolutional Neural Networks (CNNs) 121 Fig. 12.1 Architecture of convolutional neural network. The building block of the network contains the convolution, ReLU followed by pooling layer. A CNN can have multiple such layers in series. Then there is a fully connected layer that generates the output 12.3.3.1 Convolution Layer The convolution layer consists of a series of 2D kernels. Each of these kernels is applied to original figure using 2D convolution defined in Eq. 12.5. This generates a 3D output. 12.3.3.2 Rectified Linear Unit (ReLU) Rectified linear unit of ReLU, as the name suggests, rectifies the output of convolutional layer to convert all the negative values to 0. The function is defined as, f(x)= max(0,x) (12.6) 122 12 Deep Learning Fig. 12.2 Plot of the rectified linear unit (ReLU) The ReLU function is shown in Fig. 12.2. This layer also introduces nonlinearity into the network, which is otherwise linear.