Artificial Intelligence

Katarzyna Mazur Information

Contact: I [email protected], [email protected] I +48 (81) 537-29-38 I Maria Curie Sklodowska University, Akademicka 9 Street, Department of Informatics, room 412 (IV floor)

WWW: I https://cybersecurity.umcs.lublin.pl/ai-wsei/

Assignment: I ... Course Content

I Introduction I Multilayer I Convolutional Neural Networks I Recurrent Neural Networks I ... Introduction: what is AI?

I Artificial intelligence - or AI for short - is technology that enables a computer to think or act in a more ’human’ way. It does this by taking in information from its surroundings, and deciding its response based on what it learns or senses Introduction: what is AI?

I What is the relationship between AI, , neural networks and ? I You can think of deep learning, machine learning and artificial intelligence as a set of Russian dolls nested within each other, beginning with the smallest and working out I Deep learning is a subset of machine learning, and machine learning is a subset of AI, which is an umbrella term for any computer program that does something smart I In other words, all machine learning is AI, but not all AI is machine learning, and so forth Introduction: what is AI? Introduction: what is AI? Introduction: what is AI?

I Let us start with the basics, and describe a basic unit of an AI, that is, neurons and neural networks I The artificial neuron has a similar structure to the neurons of the human brain - so at first let us focus on biological neurons, in order to understand their operation Introduction: biological neuron

I There are 4 parts of a typical nerve cell (neuron): dendrites, axon, synapses, and nucleus

I Dendrites accept the inputs, nucleus processes the inputs, axon turns the processed inputs into outputs and synapses are the electrochemical contact between the neurons Introduction: artificial neuron ()

I Artificial neuron consists of:

I inputs (which correspond to synapses in biological neuron),

I weights (which correspond to dendrites in biological neuron),

I (or transfer function, activation function (or transfer function, which corresponds to nucleus in bioloical neuron)

I outputs (which correspond to axon in biological neuron) Introduction: artificial neuron (perceptron)

I Inputs to the network are represented by the xi , where i ∈ {1, ..., n}

I Each of these inputs are multiplied by a connection weight, wi , where i ∈ {1, ..., n}, weight determines the importance of incoming value

I These products are simply summed, fed through the transfer function to generate a result and then output

I Outputs to the network are represented by the yj , where j ∈ {1, ..., m} Introduction: artificial neuron (perceptron)

I The activation function determines the value of the neuron’s output. The simplest form of activation function is a certain type of step function. It mimics the biological neuron firing upon reaching its firing threshold by outputting a 1 if the total input exceeds a given threshold quantity, and outputting a 0 otherwise

I However, for a more realistic result, one needs to use a non-linear activation function. One of the most 1 commonly used is the : f (x) = (1+e−x ) Introduction: biological neuron vs artificial neuron Introduction: exercises

I Exercise 1: using programming language Python, implement, without any additional modules and libraries, a simple single perceptron (tip: perceptron is a single neuron) with the threshold activation function. Inputs, weights and threshold should be read from a text file.

I Exercise 2: using programming language Python, implement, without any additional modules and libraries, a simple single layer perceptron (tip: perceptron is a single neuron) with the sigmoid activation function. Inputs, weights and threshold should be read from a text file. Introduction: Artificial Neural Networks (ANN)

I For understanding single layer perceptron, it is important to understand Artificial Neural Networks (ANN)

I Artificial neural networks is the information processing system the mechanism of which is inspired with the functionality of biological neural circuits

I An artificial neural network possesses many processing units connected to each other Introduction: Artificial Neural Networks (ANN)

Artificial Neural Networks (ANN) is:

I An information processing architecture loosely modelled on the brain

I Consists of a large number of interconnected processing units (neurons)

I Works in parallel to accomplish a global task

I Generally used to model relationships between inputs and outputs or find patterns in data Introduction: Artificial Neural Networks (ANN)

Artificial Neural Networks (ANN) has 3 types of layers: input layer, hidden layer or layers, and the output layer Introduction: Artificial Neural Networks (ANN)

Building blocks of each of these layers are neurons. Neurons are single processing unit and are built as follows: Introduction: Artificial Neural Networks (ANN)

I There are 2 types of artificial neural networks architecture: feed-forward network and recurrent network

I Feed-forward networks can be single layered or multi-layered

I Recurrent networks can be single layered or multi-layered An example of a single layer ANN: Introduction: Artificial Neural Networks (ANN)

An example of a multi layer ANN: Introduction: Artificial Neural Networks (ANN)

A feed-forward artificial neural networks have the following characteristics:

I Perceptrons are arranged in layers, with the first layer taking in inputs and the last layer producing outputs. The middle layers have no connection with the external world, and hence are called hidden layers.

I Each perceptron in one layer is connected to every perceptron on the next layer. Hence information is constantly "fed forward" from one layer to the next., and this explains why these networks are called feed-forward networks.

I There is no connection among perceptrons in the same layer. Introduction: Artificial Neural Networks (ANN)

A recurrent artificial neural networks have the following characteristics:

I Recurrent nets are a powerful set of artificial neural network algorithms especially useful for processing sequential data such as sound, time series (sensor) data or written natural language

I Recurrent nets differ from feedforward nets because they include a feedback loop, whereby output from step n-1 is fed back to the net to affect the outcome of step n, and so forth for each subsequent step Introduction: Artificial Neural Networks (ANN) Introduction: Artificial Neural Networks (ANN)

A perceptron is:

I An algorithm for of binary classifiers

I A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class

I It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector

A neural network is really just a composition of perceptrons, connected in different ways and operating on different activation functions Introduction: exercises

I Exercise 3: using programming language Python, implement, without any additional modules and libraries, a simple multi layer perceptron with the threshold activation function and one hidden layer. Inputs, weights and threshold should be read from a text file.

I Exercise 4: using programming language Python, implement, without any additional modules and libraries, a simple multi layer perceptron with the sigmoid activation function and two hidden layers. Inputs, weights and threshold should be read from a text file. Introduction: Artificial Neural Networks (ANN) Introduction: Artificial Neural Networks (ANN)

Input 1: 001 Weights for input 1: 1 2 3 Input 2: 101 Weights for input 2: 4 5 6 Input 3: 111 Weights for input 3: 7 8 9 Input 4: 000 Weights for input 4: 1 2 3 Input 5: 110 Weights for input 5: 4 5 6

Treshold for neruon 1 in hidden layer: 3 Treshold for neruon 2 in hidden layer: 5 Treshold for neruon 3 in hidden layer: 6 Introduction: Artificial Neural Networks (ANN)

Input 1:001 Weights for input 1:123 Input 2:101 Weights for input 2:456 Input 3:111 Weights for input 3:789 Input 4:000 Weights for input 4:123 Input 5:110 Weights for input 5:456

Input for neruon 1 in hidden layer: 1*0+4*1+7*1+1*0+4*1 = 15 Output from neruon 1: 1, because 15 > 3 Introduction: Artificial Neural Networks (ANN)

Input 1: 001 Weights for input 1: 123 Input 2: 101 Weights for input 2: 456 Input 3: 111 Weights for input 3: 789 Input 4: 000 Weights for input 4: 123 Input 5: 110 Weights for input 5: 456

Input for neruon 2 in hidden layer: 2*0+5*0+8*1+2*0+5*1 = 13 Output from neruon 2: 1, because 13 > 5 Introduction: exercises

I Exercise 5: using programming language Python, implement, without any additional modules and libraries, a network from the image. The inputs, weights and treshold values should be taken from image too. Introduction: what is deep learning?

I Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by example

I Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost. It is the key to voice control in consumer devices like phones, tablets, TVs, and hands-free speakers

I In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound Introduction: what is deep learning?

I Models are trained by using a large set of labeled data and neural network architectures that contain many layers

I Most deep learning methods use neural network architectures, which is why deep learning models are often referred to as deep neural networks

I The term “deep” usually refers to the number of hidden layers in the neural network. Traditional neural networks only contain 2-3 hidden layers, while deep networks can have as many as 150

I product uses machine learning in all of its products to improve the search engine, translation, image captioning or recommendations What’s the Difference Between Machine Learning and Deep Learning?

I Deep learning is a specialized form of machine learning

I A machine learning workflow starts with relevant features being manually extracted from imagesThe features are then used to create a model that categorizes the objects in the image

I With a deep learning workflow, relevant features are automatically extracted from images. In addition, deep learning performs “end-to-end learning” – where a network is given raw data and a task to perform, such as classification, and it learns how to do this automatically What is TensorFlow? Introduction

I Google’s TensorFlow is the most famous deep learning library in the world

I TensorFlow is the most famous deep learning library these recent years. A practitioner using TensorFlow can build any deep learning structure, like CNN, or simple artificial neural network

I TensorFlow is a library developed by the Team to accelerate machine learning and deep neural network research

I TensorFlow is mostly used by academics, startups, and large companies. Google uses TensorFlow in almost all Google daily products including Gmail, Photo and Google Search Engine

I It was built to run on multiple CPUs or GPUs and even mobile operating systems, and it has several wrappers in several languages like Python, C++ or Java What is TensorFlow? Architecture

Tensorflow architecture works in three parts:

I Preprocessing the data

I Build the model

I Train and estimate the model What is TensorFlow? Architecture

I It is called Tensorflow because it takes input as a multi-dimensional array, also known as tensors

I You can construct a sort of flowchart of operations (called a Graph) that you want to perform on that input

I The input goes in at one end, and then it flows through this system of multiple operations and comes out the other end as output

I This is why it is called TensorFlow because the tensor goes in it flows through a list of operations, and then it comes out the other side Where can Tensorflow run?

TensorFlow can hardware, and software requirements can be classified into: I Development Phase: This is when you train the mode. Training is usually done on your Desktop or laptop

I Run Phase or Inference Phase: Once training is done Tensorflow can be run on many different platforms. You can run it on: I Desktop running Windows, macOS or Linux I Cloud as a web service I Mobile devices like iOS and Android

I You can train it on multiple machines then you can run it on a different machine, once you have the trained model. Where can Tensorflow run?

I The model can be trained and used on GPUs as well as CPUs

I GPUs were initially designed for video games. In late 2010, Stanford researchers found that GPU was also very good at matrix operations and algebra so that it makes them very fast for doing these kinds of calculations

I Deep learning relies on a lot of matrix multiplication

I TensorFlow is very fast at computing the matrix multiplication because it is written in C++

I Although it is implemented in C++, TensorFlow can be accessed and controlled by other languages mainly, Python Where can Tensorflow run?

I We will be using https://colab.research.google.com/ Introduction to Components of TensorFlow: Tensor

I Tensorflow’s name is directly derived from its core framework: Tensor. In Tensorflow, all the computations involve tensors

I A tensor is a vector or matrix of n-dimensions that represents all types of data

I All values in a tensor hold identical data type with a known (or partially known) shape

I The shape of the data is the dimensionality of the matrix or array Introduction to Components of TensorFlow: Tensor

I A tensor can be originated from the input data or the result of a computation

I In TensorFlow, all the operations are conducted inside a graph

I The graph is a set of computation that takes place successively

I Each operation is called an op node and are connected to each other

I The graph outlines the ops and connections between the nodes. However, it does not display the values. The edge of the nodes is the tensor, i.e., a way to populate the operation with data Introduction to Components of TensorFlow: Graphs

TensorFlow makes use of a graph framework. The graph gathers and describes all the series computations done during the training. The graph has lots of advantages:

I It was done to run on multiple CPUs or GPUs and even mobile operating system

I The portability of the graph allows to preserve the computations for immediate or later use. The graph can be saved to be executed in the future Introduction to Components of TensorFlow: Graphs

I All the computations in the graph are done by connecting tensors together

I A tensor has a node and an edge. The node carries the mathematical operation and produces an endpoints outputs. The edges the edges explain the input/output relationships between nodes Simple TensorFlow Example

I In the first two line of code, we have imported tensorflow as tf. With Python, it is a common practice to use a short name for a library. The advantage is to avoid to type the full name of the library when we need to use it. For instance, we can import tensorflow as tf, and call tf when we want to use a tensorflow function

I Let ’s practice the elementary workflow of Tensorflow with a simple example. Let ’s create a computational graph that multiplies two numbers together Simple TensorFlow Example

I During the example, we will multiply X_1 and X_2 together. Tensorflow will create a node to connect the operation. In our example, it is called multiply. When the graph is determined, Tensorflow computational engines will multiply together X_1 and X_2

I Finally, we will run a TensorFlow session that will run the computational graph with the values of X_1 and X_2 and print the result of the multiplication Simple TensorFlow Example

I Let ’s define the X_1 and X_2 input nodes. When we create a node in Tensorflow, we have to choose what kind of node to create. The X1 and X2 nodes will be a placeholder node. The placeholder assigns a new value each time we make a calculation. We will create them as a TF dot placeholder node Simple TensorFlow Example

I When we create a placeholder node, we have to pass in the data type will be adding numbers here so we can use a floating-point data type, let’s use tf.float32. We also need to give this node a name. This name will show up when we look at the graphical visualizations of our model. Let’s name this node X_1 by passing in a parameter called name with a value of X_1 and now let’s define X_2 the same way. X_2. Simple TensorFlow Example

I Now we can define the node that does the multiplication operation. In Tensorflow we can do that by creating a tf.multiply node.

I We will pass in the X_1 and X_2 nodes to the multiplication node. It tells tensorflow to link those nodes in the computational graph, so we are asking it to pull the values from x and y and multiply the result. Let’s also give the multiplication node the name multiply. It is the entire definition for our simple computational graph. Simple TensorFlow Example

I To execute operations in the graph, we have to create a session. In Tensorflow, it is done by tf.Session(). Now that we have a session we can ask the session to run operations on our computational graph by calling session. To run the computation, we need to use run.

I When the operation runs, it is going to see that it needs to grab the values of the X_1 and X_2 nodes, so we also need to feed in values for X_1 and X_2. We can do that by supplying a parameter called feed_dict. We pass the value 1,2,3 for X_1 and 4,5,6 for X_2. Simple TensorFlow Example

I We print the results with print(result). We should see 4, 10 and 18 for 1x4, 2x5 and 3x6 Learning process of a neural network

I The ability to learn is a fundamental trait of intelligence

I Although a precise definition of learning is difficult to formulate, a learning process in the ANN context can be viewed as the problem of updating network architecture and connection weights so that a network can efficiently perform a specific task

I The network usually must learn the connection weights from available training patterns

I Performance is improved over time by iteratively updating the weights in the network Learning process of a neural network

I Instead of following a set of rules specified by human experts, ANNs appear to learn underlying rules (like input-output relationships) from the given collection of representative examples. This is one of the major advantages of neural networks over traditional expert systems

I Learning is essential to most of neural network architectures

I Choice of a learning algorithm is a central issue in network development Learning process of a neural network

What is really meant by saying that a processing element learns?

I Learning implies that a processing unit is capable of changing its input/output behavior as a result of changes in the environment. Since the activation rule is usually fixed when the network is constructed and since the input/output vector cannot be changed, to change the input/output behavior the weights corresponding to that input vector need to be adjusted. A method is thus needed by which, at least during a training stage, weights can be modified in response to the input/output process Learning process of a neural network

I In a neural network, learning can be supervised, in which the network is provided with the correct answer for the output during training, or unsupervised, in which no external teacher is present

I The learning methods in neural networks are classified into 3 types:

I supervised learning I I reinforced learning Learning process of a neural network

These 3 types of learning are classified based on: I presence or absence of teacher

I the information provided for the system to learn

These are further classified based on the rules used: I Hebbian I I Competitive I Stochastic learning Learning process of a neural network Learning process of a neural network Learning process of a neural network The quality of learning results

I Hidden layers: Both the number of hidden layers and the number of nodes in each hidden layer can influence the quality of the results. For example, too few layers and/or nodes may not be adequate to sufficiently learn and too many may result in overtraining the network

I Number of cycles: A cycle is where a training example is presented and the weights are adjusted The quality of learning results

I The number of examples that get presented to the neural network during the learning process can be set. The number of cycles should be set to ensure that the neural network does not overtrain. The number of cycles is often referred to as the number of epochs

I : Prior to building a neural network, the learning rate should be set and this influences how fast the neural network learns Learning process of a neural network

I At each training step the network computes the direction in which each bias (treshold value) and link value (weight) can be changed to calculate a more correct output

I The rate of improvement at that solution state is also known. A learning rate is user-designated in order to determine how much the link weights and node biases can be modified based on the change direction and change rate

I The higher the learning rate (max. of 1.0) the faster the network is trained Learning process of a neural network:

I Back-propagation is the essence of neural net training. It is the method of fine-tuning the weights of a neural net based on the error rate obtained in the previous epoch (i.e., iteration)

I Proper tuning of the weights allows you to reduce error rates and to make the model reliable by increasing its generalization

I Backpropagation is a short form for "backward propagation of errors." It is a standard method of training artificial neural networks. This method helps to calculate the gradient of a with respects to all the weights in the network Learning process of a neural network: Backpropagation Learning process of a neural network: Backpropagation

I While designing a Neural Network, in the beginning, we initialize weights with some random values or any variable for that fact I Now obviously, we are not superhuman. So, it’s not necessary that whatever weight values we have selected will be correct, or it fits our model the best I Okay, fine, we have selected some weight values in the beginning, but our model output is way different than our actual output i.e. the error value is huge Learning process of a neural network: Backpropagation

Now, how will you reduce the error?

I Basically, what we need to do, we need to somehow explain the model to change the parameters (weights), such that error becomes minimum

I Let’s put it in an another way, we need to train our model Learning process of a neural network: Backpropagation Learning process of a neural network: Backpropagation Learning process of a neural network: Backpropagation Learning process of a neural network: Backpropagation Learning process of a neural network: Backpropagation Learning process of a neural network: Backpropagation Learning process of a neural network: Backpropagation Learning process of a neural network: Backpropagation Learning process of a neural network: Backpropagation Learning process of a neural network: Backpropagation Learning process of a neural network: Backpropagation

In backpropagation we try to reduce the error by changing the values of weights and biases (treshold values) Keras

I Keras is a high-level neural networks API, capable of running on top of Tensorflow, , and CNTK. It enables fast experimentation through a high level, user-friendly, modular and extensible API. Keras can also be run on both CPU and GPU

I Keras is an open-source neural-network library written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, Theano, or PlaidML.[1][2] Designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly, modular, and extensible Exercise: clothes classification with TF and Keras

I We will use Fashion MNIST dataset which contains 70,000 grayscale images in 10 categories. The images show individual articles of clothing at low resolution (28 by 28 pixels)

I In our example 60,000 images are used to train the network and 10,000 images to evaluate how accurately the network learned to classify images. You can access the Fashion MNIST directly from TensorFlow. We will import and load the Fashion MNIST data directly from TensorFlow

I We will try to train our network such that it will say, with some probability, what’s on the image Exercise: clothes classification with TF and Keras Assignment

I Monday’s group: 13.01.2020 I Thursday’s group: 16.01.2020 I Assignment will be the a,b,c,d test Keras overview

I tf.keras is TensorFlow’s implementation of the Keras API specification. This is a high-level API to build and train models that includes first-class support for TensorFlow-specific functionality

I tf.keras makes TensorFlow easier to use without sacrificing flexibility and performance

I The main data structure in keras is the model which provides a way to define the complete graph. You can add layers to the existing model/graph to build the network you want Keras overview

Keras has two distinct ways of building models:

I Sequential models: This is used to implement simple models. You simply keep adding layers to the existing model. In Keras, you assemble layers to build models. A model is (usually) a graph of layers. The most common type of model is a stack of layers: the tf.keras.Sequential model.

I Functional API: Keras functional API is very powerful and you can build more complex models using it, models with multiple output, directed acyclic graph etc. Keras overview - introduction

I To get started, import tf.keras as part of your TensorFlow program setup:

I This is how we start by importing and building a Sequential model:

I We can add layers like Dense (fully connected layer), Activation, Conv2D, MaxPooling2D etc by calling add function: Keras overview - sequential model Keras overview - layers Keras overview - layers

Here is how you can add some of the most popular layers to the network:

I Convolutional layer: Here, we shall add a layer with 64 filters of size 3*3 and use relu activations after that:

I MaxPooling layer: Specify the type of layer and specify the pool size and you are done:

I Fully connected layer: It’s called Dense in Keras. Just specify the number of outputs and you are done: Keras overview - layers

I Drop out:

I Flattening layer: Keras overview - multi-layer perceptron Keras overview - specifying the input shape Keras overview - specifying the input shape Keras overview - specifying the input shape

I The first layer of a network reads the training data. So, we need to specify the size of images/training data that we are feeding the network I In this case, the input layer is a convolutional layer which takes input images of 224 * 224 * 3 Keras overview - model compilation

I After the model is constructed, configure its learning process by calling the compile method. tf.keras.Model.compile takes three important arguments: optimizer, loss and metrics Keras overview - model compilation

I Once you have specified the architecture of the network, you need to specify the method for back-propagation by choosing an optimizer (like rmsprop or adagrad) and specify the loss (like categorical_crossentropy)

I We use compile function to do that in Keras. For example, in this line below we are asking the network to use the ‘rmsprop’ optimizer to change weights in such a way that the loss ‘binary_crossentropy’ is minimized at each iteration: Keras overview - model compilation Keras overview - training

I Keras models are trained on Numpy arrays of input data and labels. For training a model, you will typically use the fit function

I For small datasets, use in-memory NumPy arrays to train and evaluate a model. The model is "fit" to the training data using the fit method

I We feed the data to the model via the fit function. You can also specify the batch_size and the maximum number of epochs you want training to go on. Keras overview - training Keras overview - exercise 1

I Let’s use Keras and TensorFlow in order to predict simple problems

I Suppose we want to understand when John will play basketball based on the weather conditions

I In the below dataset, we have record of the outlook, humidity, wind and if he played or not

I In the first entry of our data set, when it is sunny, high humidity and weak winds, he does not play basketball Keras overview - exercise 1 Keras overview - exercise 1

I Since we want to predict if John will play basketball or not, our inputs will be outlook, humidity, and wind; our output will be play. We want to represent our inputs and outputs numerically, and we will do them as followed: Keras overview - exercise 1

I We will take our table and convert that to an array of numbers, which will act as our training dataset: Keras overview - exercise 1 Keras overview - exercise 1 Keras overview - exercise 1 Keras overview - exercise 1 Keras overview - exercise 1 Keras overview - exercise 1 Keras overview - exercise 1 Keras overview - exercise 1 Keras overview - exercise 1 Keras overview - exercise 2

I One of the capabilities of deep learning is image recognition, The “hello world” of object recognition for machine learning and deep learning is the MNIST dataset for handwritten digit recognition

I In this example, we will use Keras and TensorFlow to classify MNIST Handwritten digits Keras overview - exercise 2 Activation Functions

I Neural network activation functions are a crucial component of deep learning

I Activation functions determine the output of a deep learning model, its accuracy, and also the computational efficiency of training a model—which can make or break a large scale neural network

I Activation functions also have a major effect on the neural network’s ability to converge and the convergence speed, or in some cases, activation functions might prevent neural networks from converging in the first place Activation Functions

I Activation functions are mathematical equations that determine the output of a neural network

I The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction

I Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1 Activation Functions

I An additional aspect of activation functions is that they must be computationally efficient because they are calculated across thousands or even millions of neurons for each data sample

I Modern neural networks use a technique called backpropagation to train the model, which places an increased computational strain on the activation function, and its derivative function Activation Functions

I In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer

I Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer

I The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer

I It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function Activation Functions

I Increasingly, neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions Activation Functions

I So what does an artificial neuron do? Simply put, it calculates a “weighted sum” of its input, adds a bias and then decides whether it should be “fired” or not (an activation function does this) Activation Functions

I So consider a neuron:

I Now, the value of Y can be anything ranging from -inf to +inf. The neuron really doesn’t know the bounds of the value. So how do we decide whether the neuron should fire or not (why this firing pattern? Because we learnt it from biology that’s the way brain works and brain is a working testimony of an awesome and intelligent system) Activation Functions

I We decided to add “activation functions” for this purpose. To check the Y value produced by a neuron and decide whether outside connections should consider this neuron as “fired” or not. Or rather let’s say — “activated” or not Activation Functions - Step function

I The first thing that comes to our minds is how about a threshold based activation function? If the value of Y is above a certain value, declare it activated. If it’s less than the threshold, then say it’s not

I Activation function A = “activated” if Y > threshold else not

I Alternatively, A = 1 if y> threshold, 0 otherwise Activation Functions - Step function

I Well, what we just did is a “step function”, see the below figure

I Its output is 1 (activated) when value > 0 (threshold) and outputs a 0 (not activated) otherwise. Activation Functions - Step function

I Great. So this makes an activation function for a neuron. No confusions. However, there are certain drawbacks with this. To understand it better, think about the following:

I Suppose you are creating a binary classifier. Something which should say a “yes” or “no” (activate or not activate). A Step function could do that for you! That’s exactly what it does, say a 1 or 0. Now, think about the use case where you would want multiple such neurons to be connected to bring in more classes. Class1, class2, class3 etc. What will happen if more than 1 neuron is “activated”. All neurons will output a 1 ( from step function). Now what would you decide? Which class is it? Hmm hard, complicated. Activation Functions - Step function

I You would want the network to activate only 1 neuron and others should be 0 ( only then would you be able to say it classified properly/identified the class ). Ah! This is harder to train and converge this way. It would have been better if the activation was not binary and it instead would say “50% activated” or “20% activated” and so on. And then if more than 1 neuron activates, you could find which neuron has the “highest activation” and so on (better than max, a softmax, but let’s leave that for now). Activation Functions - Step function

I In this case as well, if more than 1 neuron says “100% activated”, the problem still persists.I know! But..since there are intermediate activation values for the output, learning can be smoother and easier (less wiggly) and chances of more than 1 neuron being 100% activated is lesser when compared to step function while training ( also depending on what you are training and the data ). Activation Functions - Step function

I To train a neural network, you adapt the weights of the connections in such a way as to reduce the error (or loss function)

I For this, you need to calculate the gradient of the activation functions, and this will allow you to make small changes to the weights

I When you use a step function, the gradients will all be 0. You will not be able to work out in which direction to adapt the weights to improve performance Activation Functions - Step function

I There is no ”step” activation function in Keras, if you really need one, you should implement it by yourself

I If you have written your custom activation function in the backend (theano, tensorflow or CNTK) that you want to use in Keras, Keras will use the automatic differentiation engine of that backend to compute the necessary gradients automatically Activation Functions - Step function

I The ”step” activation function is useful when the input pattern can only belong to one or two groups, that is, binary classification

I Ok, so we want something to give us intermediate (analog) activation values rather than saying ”activated” or not (binary)

I The first thing that comes to our minds would be Linear function Activation Functions - Linear function

I Linear function: A = cx

I Linear function is a straight line function where activation is proportional to input (which is the weighted sum from neuron)

I This way, it gives a range of activations, so it is not binary activation. We can definitely connect a few neurons together and if more than 1 fires, we could take the max (or softmax) and decide based on that

I However, linear activation has some disadvantages: if there is an error in prediction, the changes made by back propagation are constant and not depending on the change in input delta(x) Activation Functions - Linear function

I There is another problem: think about connected layers. Each layer is activated by a linear function. That activation in turn goes into the next level as input and the second layer calculates weighted sum on that input and it in turn, fires based on another linear activation function

I No matter how many layers we have, if all are linear in nature, the final activation function of last layer is nothing but just a linear function of the input of first layer

I That means these two layers (or N layers) can be replaced by a single layer. We just lost the ability of stacking layers this way. No matter how we stack, the whole network is still equivalent to a single layer with linear activation (a combination of linear functions in a linear manner is still another linear function) Activation Functions - Linear function Activation Functions - Linear function Activation Functions - Sigmoid Function Activation Functions - Sigmoid Function Activation Functions - Sigmoid Function

I Sigmoid Function is nonlinear in nature. Combinations of this function are also nonlinear

I With this function network layers can be stacked

I Between X values −2 to 2, Y values are very steep. Which means, any small changes in the values of X in that region will cause values of Y to change significantly Activation Functions - Sigmoid Function

I That means this function has a tendency to bring the Y values to either end of the curve.

I It tends to bring the activations to either side of the curve (above x = 2 and below x = −2 for example). Making clear distinctions on prediction

I Another advantage of this activation function is, unlike linear function, the output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function. So we have our activations bound in a range Activation Functions

I However, there exists a problem with the sigmoid activation function: towards either end of the sigmoid function, the Y values tend to respond very less to changes in X

I This means that the network refuses to learn further or is drastically slow

I There are ways to work around this problem and sigmoid is still very popular in classification problems Activation Functions - Sigmoid Function Activation Functions - Sigmoid Function Activation Functions - Tanh Function Activation Functions - Tanh Function

I Another activation function that is used is the tanh function

I It is a scaled sigmoid function

I It is nonlinear in nature, so we can stack layers

I It is bound to range (−1, 1) so no worries of activations blowing up

I Tanh is also a very popular and widely used activation function Activation Functions - Tanh Function Activation Functions - Tanh Function Activation Functions - ReLu

I ReLu function: A(x) = max(0,x)

I It gives an output x if x is positive and 0 otherwise Activation Functions - ReLu

I ReLu is nonlinear in nature. And combinations of ReLu are also non linear (in fact it is a good approximator. Any function can be approximated with combinations of ReLu)

I This means we can stack layers, it is not bound though - the range of ReLu is [0, inf )

I ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations. That is a good point to consider when we are designing deep neural networks Activation Functions - ReLu Activation Functions - ReLu Activation Functions - Softmax

I Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would

I Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer Activation Functions - Softmax

I Softmax lets us answer classification questions with probabilities, which are more useful than simpler answers (e.g. binary yes/no) Activation Functions - Softmax Activation Functions - Softmax Convolutional Neural Networks

I To take the next step in improving the accuracy of our networks, we need to delve into deep learning

I A particularly useful type of deep learning neural network for image classification is the convolutional neural network

I It should be noted that convolutional neural networks can also be used for applications other than images, such as time series prediction Convolutional Neural Networks

I As shown in the previous example, multi-layer neural networks can perform pretty well in predicting things like digits in the MNIST dataset

I However, the MNIST dataset is quite simple. The images are small (only 28 x 28 pixels), are single layered (i.e. greyscale, rather than a coloured 3 layer RGB image) and include pretty simple shapes (digits only, no other objects)

I Once we start trying to classify things in more complicated colour images, such as buses, cars, trains etc., we run into problems with our accuracy Convolutional Neural Networks

I The first thing we can do in order to solve this problem is to try to increase the number of layers in our neural network to make it deeper

I That will increase the complexity of the network and allow us to model more complicated functions

I However, it will come at a cost - the number of parameters (i.e. weights and biases) will rapidly increase. This makes the model more prone to overfitting and will prolong training times

I In fact, learning such difficult problems can become intractable for normal neural networks. This leads us to a solution – convolutional neural networks Convolutional Neural Networks

I Convolutional Neural Networks are very similar to ordinary Neural Networks: they are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity

I The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other

I And they still have a loss function (e.g. softmax) on the last (fully-connected) layer Convolutional Neural Networks

I Convolutional Neural Network architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture

I These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network Recall: Regular Neural Networks

I Neural Networks receive an input (a single vector), and transform it through a series of hidden layers

I Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections

I The last fully-connected layer is called the “output layer” and in classification settings it represents the class scores Recall: Regular Neural Networks

I Regular Neural Networks don’t scale well to full images

I Images, for instance, can be of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32*32*3 = 3072 weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images

I For example, an image of more respectable size, e.g. 200x200x3, would lead to neurons that have 200*200*3 = 120,000 weights Recall: Regular Neural Networks

I Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting

I Overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably" Convolutional Neural Networks

I Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way

I In particular, unlike a regular Neural Network, the layers of a CNN have neurons arranged in 3 dimensions: width, height, depth. (Note that the word depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) Convolutional Neural Networks

I For example, the input images in CIFAR-10 (it is a dataset which consists of 60000 32x32 colour images in 10 classes, with 6000 images per class) are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively)

I As we will soon see, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner

I Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the CNN architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension Convolutional Neural Networks

I A regular 3-layer Neural Network Convolutional Neural Networks

I A CNN arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers I Every layer of a CNN transforms the 3D input volume to a 3D output volume of neuron activations I Inthis example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels) Convolutional Neural Networks

I A simple CNN is a sequence of layers, and every layer of a CNN transforms one volume of activations to another through a differentiable function

I We use three main types of layers to build CNN architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks)

I We will stack these layers to form a full CNN architecture

I The role of the CNN is to reduce the images into a form which is easier to process, without losing features which are critical for getting a good prediction CNN - Convolutional Layer

I In layer - a “filter”, sometimes called a “kernel”, is passed over the image, viewing a few pixels at a time (for example, 3X3 or 5X5)

I The convolution operation is a dot product of the original pixel values with weights defined in the filter

I The results are summed up into one number that represents all the pixels the filter observed CNN - Convolutional Layer

I The CNN layer’s parameters consist of a set of learnable filters

I Every filter is small spatially (along width and height), but extends through the full depth of the input volume

I For example, a typical filter on a first layer of a CNN might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels)

I During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position CNN - Convolutional Layer

I As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position

I Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network

I Now, we will have an entire set of filters in each CNN layer (e.g. 12 filters), and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume CNN - Convolutional Layer

I To illustrate the convolution operation, let’s take a 6x6 grayscale image (i.e. only one channel) and convolve this 6x6 matrix with a 3x3 filter: CNN - Convolutional Layer

I After the convolution, we will get a 4x4 image. The first element of the 4x4 matrix will be calculated as: we take the first 3 X 3 matrix from the 6 X 6 image and multiply it with the filter. Now, the first element of the 4 X 4 output will be the sum of the element-wise product of these values, i.e. 3*1 + 0 + 1*-1 + 1*1 + 5*0 + 8*-1 + 2*1 + 7*0 + 2*-1 = -5

I To calculate the second element of the 4 X 4 output, we will shift our filter one step towards the right and again get the sum of the element-wise product CNN - Convolutional Layer

I Similarly, we will convolve over the entire image and get a 4x4 output: CNN - Convolutional Layer

I The 6x6 image is now converted into a 4x4 image

I Think of weight matrix like a paint brush painting a wall

I The brush first paints the wall horizontally and then comes down and paints the next row horizontally. Pixel values are used again when the weight matrix moves along the image

I This basically enables parameter sharing in a convolutional neural network CNN - Convolutional Layer

I The weight matrix behaves like a filter in an image extracting particular information from the original image matrix. A weight combination might be extracting edges, while another one might a particular color, while another one might just blur the unwanted noise. CNN - Pooling Layer

I Sometimes when the images are too large, we would need to reduce the number of trainable parameters

I It is then desired to periodically introduce pooling layers between subsequent convolution layers

I Pooling is done for the sole purpose of reducing the spatial size of the image

I Pooling is done independently on each depth dimension, therefore the depth of the image remains unchanged CNN - Pooling Layer

I The most common form of pooling layer generally applied is the max pooling

I Pooling layers are generally used to reduce the size of the inputs and hence speed up the computation

I Consider a 4x4 matrix as shown below: CNN - Pooling Layer

I Applying max pooling on this matrix will result in a 2x2 output

I For every consecutive 2x2 block, we take the max number

I Here, we have applied a filter of size 2 and a stride of 2: CNN - Pooling Layer

I Here we have taken stride as 2, while pooling size also as 2

I The max operation is applied to each depth dimension of the convolved output

I As you can see, the 4*4 convolved output has become 2*2 after the max pooling operation CNN - Pooling Layer CNN - Pooling Layer

I As you can see we have taken convoluted image and have applied max pooling on it

I The max pooled image still retains the information that it’s a car on a street

I If you look carefully, the dimensions if the image have been halved

I This helps to reduce the parameters to a great extent CNN - Outpul Layer

I After multiple layers of convolution and padding, we would need the output in the form of a class

I The convolution and pooling layers would only be able to extract features and reduce the number of parameters from the original images

I However, to generate the final output we need to apply a fully connected layer to generate an output equal to the number of classes we need

I It becomes tough to reach that number with just the convolution layers CNN - Outpul Layer

I Convolution layers generate 3D activation maps while we just need the output as whether or not an image belongs to a particular class

I The output layer has a loss function like categorical cross-entropy, to compute the error in prediction

I Once the forward pass is complete the backpropagation begins to update the weight and biases for error and loss reduction Convolutional Neural Networks

I A “flatten” layer in CNN turns the inputs into a vector

I A “dense” layer that takes that vector and generates probabilities for target labels, using an activation function

I The ‘adam’ optimizer adjusts learning rate throughout training (read our guide to neural network hyperparameters to understand learning rate)

I Loss function - a ‘categorical_crossentropy’ loss function, a common choice for classification. The lower the score, the better the model is performing

I Metrics - the ‘accuracy’ metric to get an accuracy score when the model runs on the validation set Google Colab - import your own dataset Google Colab - import your own dataset Google Colab - import your own dataset

I Dataset: http://files.fast.ai/data/dogscats.zip

I Documentation: https://colab.research.google. com/notebooks/io.ipynb#scrollTo=vz-jH8T_Uk2c CNN - Convolutional Layer - exercises

I Exercise 1: Using any programming language you like, implement the convolution operation without using any additional libraries.

I Exercise 2: Using Google Colab and the given dataset, implement a CNN which will classify images of dogs and cats.

Tips: resize images to be 300 x 300, use 10 filters, 5 x 5 filtersize, 5 epochs. The layers should be 1xConv2D, 1xMaxPooling2D with pool_size(2, 2), 1xFlatten, 1xDense. Use categorical_crossentropy, adam and accuracy. After that, try addming more Conv2D and MaxPooling2D layers. Compare results. Pandas library

I Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures

I Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data

I Pandas if often used for processing numerical data which is the input to NN Pandas library Pandas library - data structures

I If you want to analyze that data using pandas, the first step will be to read it into a data structure that’s compatible with pandas

I There are two types of data structures in pandas: Series and DataFrames Pandas library - series

I Series: a pandas Series is a one dimensional data structure (“a one dimensional ndarray”) that can store values — and for every value it holds a unique index, too Pandas library - data frames

I DataFrame: a pandas DataFrame is a two (or more) dimensional data structure – basically a table with rows and columns. The columns have names and the rows have indexes Pandas library - data frames

I There are several ways to create a DataFrame. One way way is to use a dictionary Pandas library - data frames

I Another way to create a DataFrame is by importing a csv file using Pandas Pandas library - datasets to learn

I Zoo - https://pastebin.com/raw/jpEW5CB3

I Countries - https://pastebin.com/raw/C9QXZaF8

! wget -O dataset1.csv https://pastebin.com/raw/C9QXZaF8 Pandas library Pandas library Pandas library Pandas library Pandas library Pandas library Pandas library Pandas library Pandas library Pandas library Pandas library Pandas library

Exercise 1: Using the ZOO dataset and pandas: I Count the number of rows (the number of animals) in zoo (count()) I Calculate the total water need of the animals (sum()) I Find out which is the smallest water need value (min()) I Find out which is the greatest water need value (max()) I Find out the average water need value (mean(), median()) Pandas library Pandas library Pandas library

Exercise 2: Using the Countries dataset and pandas: I Say what’s the most frequent source in this dataset? I For the users of country_2, what was the most frequent topic and source combination? Or in other words: which topic, from which source, brought the most views from country_2?