9.1 Discriminative Models 9.2 Perceptron

ECE598: Representation Learning: Algorithms and Models Fall 2017 Lecture 9: Neural Networks: Architecture of Neural Networks Lecturer: Prof. Pramod Viswanath Scribe: Moitreya Chatterjee, Sep 26, 2017 Sitting on your shoulders is the most complicated object in the known universe - Michio Kaku. 9.1 Discriminative Models In the context of probability and statistics, Discriminative Models are approaches where the dependence of the unobserved variables conditioned on the observed ones is modeled directly. Note: This is in contrast to Generative Models, which are a class of approaches that model all random variables associated with a phenomenon, both those that can be observed as well as the unobserved ones [Bis06]. The contrast is exemplified by the following classification task. Let’s assume that we are given data, X , i.e. the observed variable and we want to determine its class label, Y, the unobserved (target) variable. A discriminative classifier, such as a Logistic Regression, would directly model the posterior class probabilities, i.e. P(Y|X ), to perform this inference. Alternatively, some other discriminative models also map out a deterministic relation of Y as a function of X , Artificial Neural Networks being a prime example. A generative classifier, such as Naive Bayes, on the other hand, makes use of the joint distribution of X and Y, i.e. P(X , Y) to perform this inference. Kaku’s quote cited above embodies the power of discriminative models elegantly, since the human brain houses a complicated network of neurons that are known to essentially perform a discriminative reasoning, almost effortlessly all the time. 9.2 Perceptron The history of Artificial Neural Networks, commences with the seminal work of Rosenblatt in 1958 [Ros58]. In this work, he first proposed a structure and described the functioning of an artificial neuron, inspired from their counterparts found in the brains of mammals. Figure 9.1 shows a representative schematic diagram of a Perceptron, as conceived by Rosenblatt. 9.2.1 Description of the Architecture n The Perceptron consists of an n-dimensional input, ~x ∈ R , amplified by a corresponding input n weight vector ~w ∈ R , a scalar output variable y ∈ R, and an input bias of the cell, w0 ∈ R. The final output of the cell is related to the input by the following: T y = σ(~w ~x + w0), 1 Figure 9.1: A schematic representation of a Perceptron. where sigma is an Activation Function and is typically non-linear. In the context of Figure 9.1, the step function has been chosen as the activation function. Note: It is interesting to note here that the weighted combination of inputs as represented in a perceptron, is simply a linear model. The only opportunity of making this model sufficiently complex is by injecting non-linearity via the choice of appropriate activation function. Note: Further, while modeling a process with neural networks, we are not restricted to use just one perceptron but can use a stack of them in different configurations, as well. This setting is what is typically known as Neural Networks. 9.2.2 Activation Functions Activation Functions are functions, which map the weighted sum of inputs to a neuron, to its output. The idea behind introducing them in the Perceptron, was to have a function that behaved a lot like the Action Potential of the neuron cell [Ros58]. Activation Functions typically allow for the introduction of non-linearity in neural networks. Typi- cally, the following are some of the desirable properties of activation functions [GBC16]: • Non-Linearity: When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator [Cyb89]. On the other hand, linear activation functions result in essentially the neural network becoming a linear model, which often wounds up being suboptimal for various tasks. • Continuous Differentiability: This property is necessary for enabling gradient-based optimiza- tion methods, especially during training. • Range of the Function: The range of the activation function plays a crucial role in efficient training of the neural network. Bounded range typically results in training that converges 2 quickly, while unbounded ones make the task of weight tuning harder. • Monotonicity: When the activation function is monotonic, the error surface associated with a single-layer model is guaranteed to be convex. This makes the task of obtaining optimal neuron weights, significantly easier. • Approximates the Identity Near the Origin: Such activation functions allow the neural network to efficiently learn when its weights are initialized with small random values. Choice of Activation Functions A wide assortment of activation functions have been explored by the community so far by the neural networks community. Figure 9.2: A table showing some common activation functions, their gradient with respect to the input, and their range [ali]. Figure 9.2 shows a list of some popular activation functions, their response plots, the equations which govern their response, their corresponding derivatives with respect to their inputs, and the range of their outputs. While not all of the functions possess the “desirable” properties listed above, but they still tend to behave quite robustly, while training. For instance, the Rectified Linear Unit (ReLU) activation is not fully differentiable but it is still compatible with gradient based training methods because it is not differentiable only at one point in the input space. Further, it is also 3 noteworthy that the gradients of many of these functions, can be expressed in terms of the output of the original function. This is true for example for the Logistic, and the tan-hyperbolic activations. This makes the task of computing the derivatives much easier. 9.2.3 Artificial Neural Networks Figure 9.3: A schematic representation of an Artificial Neural Network. So far in our discussion, we have only looked at a single perceptron cell. However, in a landmark work by Fukushima et al. [FMI83], the idea of stacking up perceptrons to build a “neural network” was proposed. Such networks were capable of approximating more complicated functions, such as those arising in tasks such as in handwritten numeral recognition. Figure 9.3 shows a representative example of an artificial neural network. In his pioneering work Cybenko, observed that with a non-linear activation function, a two-layer neural network can be proven to be a universal function approximator [Cyb89]. This possibly helps explain, atleast partially, their ubiquity in applications relevant to today’s day and age, while at the same time giving us a sense of why they have such amazing generalizability. 9.2.4 Deep Neural Networks Given the capacity to approximate extremely complex functions, the neural network community, in order to apply neural nets to different applications, started designing neural network architectures that were several layers deep. Such networks are, intuitively, called Deep Neural Networks. Going back to our definition of parametric discriminative models, we can represent a deep neural network as a Composite Functional Mapping from the input, say ~x to the output, say y. This is given by: y = f1( ~w1, f2( ~w2, ..., fn( ~wn, ~x))) Over the years, with increasing insight about such networks the community has calibrated different forms of these functions, fi’s. 4 Depending on whether the network accepts fixed length inputs or variable length ones, we pick the functions appropriately. In the following section, we will look at the fixed-length setting, in the context of image inputs, while later on we will delve into variable-length inputs such as text. Yet other forms of Neural Network architectures exists and the interested reader is referred to [zoo]. 9.3 Convolutional Neural Networks In the late 80’s Convolutional Neural Networks (CNNs) were introduced for the task of image classification [LBD+90], specifically in the context of recognizing hand-written digits. Such networks became immensely popular in the recent times, owing to their high precision in recognizing image content. Modern day, CNNs tend to be deeper and typically take the form of the network, shown in Figure 9.4. Figure 9.4: A block diagram representing the AlexNet Network. Such networks typically have the following components: • Convolution Layers: One key property of images is that they are Shift Invariant Signals, i.e. an image of a dog should be categorized as dog no matter where the dog is positioned in the image. This property motivated the design of convolution layers, where a weight matrix, typically much smaller than the image is convolved with the image and its response is noted. Mathematically, these layers perform a set of convolution operations, represented by: al = al−1 ∗ W l + bl, where ai denotes the activation from layer i, W i, bi denote the convolution weight kernel and the bias of that kernel at layer i, and * indicates convolution. Note: The above convolution may also be represented with a simple matrix product, but then W must be a Toeplitz Matrix, so that it ensures shift invariance. • Maxpool Layers: These layers serve the purpose of downsampling the response map. A typical Maxpool operation selects only the maximum response in a window of size, say 3 × 3. The motivation behind performing a maxpool is that only the maximum response is retained while the response at everywhere else is considered local variations in the data, i.e. it is not considered key to the final decision making. 5 • Fully Connected Layers: Typically, after several stages of Convolution followed by Maxpool, one ends up using a set of Fully Connected Layers in a CNN. These layers resemble the familiar densely connected Neural Networks shown in Figure 9.3. These layers perform an operation that may be represented as a simple matrix multiply, like the one shown below: al = al−1W l, where the notations carry forward their meaning from before.

Load more