An Introduction to

Machine Learning v1.1

E. J. Sagra Agenda

● Why is in the News again? ● ArtificiaI Intelligence vs Machine Learning vs ● Artificial Intelligence ● Machine Learning & Data Science ● Machine Learning ● Data ● Machine Learning - By The Steps ● Tasks that Machine Learning solves ○ Classification ○ ○ Regression ○ Ranking ○ Generation Agenda (cont...)

● Model Training ○ ● Reinforcement Learning - Going Deeper ○ Simple Example ○ The Bellman Equation ○ Deterministic vs. Non-Deterministic Search ○ Markov Decision Process (MDP) ○ Living Penalty ● Machine Learning - Decision Trees ● Machine Learning - Augmented Random Search (ARS) Why is Machine Learning In The News Again?

Processing capabilities General ● GPU’s etc have reached level where Machine ● Tools / Languages / Automation Learning / Deep Learning practical ● Need for Data Science no longer limited to ● Cloud computing allows even individuals the tech giants capability to create / train complex models on ● Education is behind in creating Data vast data sets Scientists ● Organizing data is hard. Organizations Memory (Hard Drive (now SSD) as well RAM) challenged ● Speed / capacity increasing ● High demand due to lack of qualified talent ● Cost decreasing

Data ● Volume of Data ● Access to vast public data sets ArtificiaI Intelligence vs Machine Learning vs Deep Learning

Artificial Intelligence is the all-encompassing concept that initially erupted

Followed by Machine Learning that thrived later

Finally Deep Learning is escalating the advances of Artificial Intelligence to another level Artificial Intelligence

Artificial intelligence (AI) is perhaps the most vaguely understood field of data science.

The main idea behind building AI is to use pattern recognition and machine learning to build an agent able to think and reason as humans do (or approach this ability).

Challenge: The term is so widely used, we haven’t yet agreed on interpreting this I in AI. Intelligence is hard to formalize, and ways to determine it are numerous. Artificial Intelligence For Example: ● In business language, AI can be interpreted as the ability to solve new problems. Effectively, solving new problems is the outcome of perception, generalizing, reasoning, and judging. ● In the public view, AI is usually conceived as the ability of machines to solve problems related to many fields of knowledge. This would make them somewhat similar to humans. This concept of AGI - Artificial General Intelligence remains in the realm of science fiction - not matching the existing state of the art advancements ● Famous systems as AlphaGo, IBM Watson, or Libratus (Texas Hold’em) are representative of the ANI - Artificial Narrow Intelligence. They specialize in one area and can perform tasks based on similar techniques to process data.

Scaling from ANI to AGI is the endeavor that data science has yet to achieve Machine Learning: Programs that Alter Themselves

Machine learning is a subset of artificial intelligence. That is, all machine learning counts as artificial intelligence, but not all artificial intelligence counts as machine learning. For example, symbolic logic – rules engines, expert systems and knowledge graphs – could all be described as artificial intelligence, and none of them are machine learning.

One aspect that separates machine learning from the knowledge graphs and expert systems is its ability to modify itself when exposed to more data; i.e. machine learning is dynamic and does not require human intervention to make certain changes. That makes it less brittle, and less reliant on human experts. Machine Learning & Data Science

Machine learning and statistics are part of data science. The word learning in machine learning that the algorithms depend on some data, used as a training set, to fine-tune some model or algorithm parameters. This encompasses many techniques such as regression, naive Bayes or supervised clustering. But not all techniques fit in this category. For instance, unsupervised clustering - a statistical and data science technique - aims at detecting clusters and cluster structures without any prior knowledge or training set to help the classification algorithm. A human being is needed to label the clusters found. Some techniques are hybrid, such as semi-supervised classification. Some pattern detection or density estimation techniques fit in this category. Machine Learning

Machine Learning uses statistical techniques to give computers the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.

Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs.

Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible. Example applications include email filtering, optical character recognition (OCR) and computer vision.

**While it seems that and KDD solely address the main problem of data science, machine learning adds business efficiency to it. Machine Learning

Machine learning is similar to data mining in that it’s about creating algorithms to extract valuable insights, however it’s heavily focused on continuous use in dynamically changing environments and emphasizes on adjustments, retraining, and updating of algorithms based on previous experiences.

The goal of machine learning is to constantly adapt to new data and discover new patterns or rules in it. Sometimes it can be realized without human guidance and explicit reprogramming. Machine Learning

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. “ –Tom Mitchell

”Field of study that gives computers the ability to learn without being explicitly programmed. - Arthur Samuel (1959) Machine Learning - How?

The main difference between machine learning and conventionally programmed algorithms is the ability to process data without being explicitly programmed. This actually means that an engineer isn’t required to provide elaborate instructions to a machine on how to treat each type of data record. Instead, a machine defines these rules itself relying on input data.

Regardless of a particular machine learning application, the general workflow remains the same and iteratively repeats once the results become dated or need higher accuracy.

The core artifact of any machine learning execution is a mathematical model, which describes how an algorithm processes new data after being trained with a subset of historic data. The goal of training is to develop a model capable of formulating a target value (attribute), some unknown value of each data object. While this sounds complicated, it really isn’t. Machine Learning - Example

For example, you need to predict whether customers of your eCommerce store will make a purchase or leave. These predictions buy or leave are the target attributes that we are looking for.

To train a model in doing this type of predictions you “feed” an algorithm with a dataset that stores different records of customer behaviors and the results (whether customers left or made a purchase).

By learning from this historic data a model will be able to make predictions on future data. Data - How Much Do I Need?

For most Machine Learning algorithms / approaches, data is the essential ingredient, however how much will I need?

No one can really tell you - however the more powerful machine learning algorithms (often referred to as nonlinear algorithms) generally require more data.

These algorithms are often more flexible and even nonparametric (they can figure out how many parameters are required to model your problem in addition to the values of those parameters). They are also high-variance, meaning predictions vary based on the specific data used to train them. This added flexibility and power comes at the cost of requiring more training data, often a lot more data.

E.g. If a linear algorithm achieves good performance with hundreds of examples per class, you may need thousands of examples per class for a nonlinear algorithm, like , or an artificial neural network.

Some nonlinear algorithms like deep learning methods can continue to improve in skill as you give them more data Data - Challenges

There maybe a number of reasons that prevent you or make it more challenging to obtain data for your analysis. For example:

● Security and access ● Privacy ● Compliance, ● Anonymized data ● IP protection ● Barriers (physical and virtual)

Finally, format and structure of the data needs to be considered. E.g. Reviewing currency rates from the Federal Reserve going back 40 years there will be a discontinuity from 1999 onwards since the euro had replaced most European currencies.

In fact, some nonlinear algorithms like deep learning methods can continue to improve in skill as you give them more data. Data - Characteristics

Data must be thought of as a building block for information and analytics. It must be collected to answer a question or set of questions. This means that it must have the following characteristics:

● Accuracy: While obvious, the data must be accurate. ● Completeness: The data must be relevant, and data that is necessary to answer the question asked must be present. An obvious example of incomplete data would be a classroom where there are 30 students, but the teacher calculates the average for only 15.

● Consistency: If there is one database indicating that there are 30 students in a class and a second database showing that there are 31 in the same class then this is an issue.

● Uniqueness: If a student has different identifiers in two separate databases, this is an issue as it opens the risk that information won’t be complete or consistent.

● Timeliness: Data can change, and the AI model may need to be updated. Data - Splitting the Data (Training & Test)

A common approach to model training in general is to split your data set into at least two groups:

● The TRAINING data set to train the model ● The TEST data to validate the model's performance

The percentage per group can differ but is usually in the range of 70 - 80% for training and 20 - 30% for validation.

Many ways exist to achieve this systematically. E.g. To divide 25% testing & 75% training using sklearn lib

sklearn.model_selection.train_test_split Data - Splitting (K-Folds Cross Validation)

As with everything in Data Science - there are many types of data splitting for model training. For K-Folds Cross Validation for example, we split our data into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.

Visual representation of K-Folds Joseph Nelson Data - Scaling & Normalization

One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar!

In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that, in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data.

Let's talk a little more in-depth about each of these options. Data - Scaling

Here, you're transforming your data so it fits within a specific scale, like 0-100 or 0-1. You want to scale data when you're using methods based on measures of how far apart data points, like support vector machines, or k-nearest neighbors, or KNN. With these algorithms, a change of "1" in any numeric feature is given the same importance.

E.g. Comparing prices of products in both Yen and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices methods like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US Dollar! This clearly doesn't fit with our intuitions of the world. With currency, you can convert between currencies. But what about if you're looking at something like height and weight? It's not entirely clear how many pounds should equal one inch (or how many kilograms should equal one meter).

By scaling your variables, you can help compare different variables on equal footing. Data - Why is Data Scaling Important?

Feature scaling is a general trick applied to optimization problems, not just Support Vector Machines (SVM). The underlying algorithm to solve optimization problem of SVM is gradient descend.

As an example of the core ideas here, suppose you have only two parameters and one of the parameters can take a relatively large range of values. Then the contour of the cost function can look like very tall and skinny ovals (see blue ovals on right). Your gradients (the path of gradient is drawn in red) could take a long time and go back and forth to find the optimal solution.

Instead if your scaled your feature, the contour of the cost function might look like circles (below), then the gradient can take a much more straight path and achieve the optimal point much faster.

sklearn.preprocessing.StandardScaler Data - Scaling

Note: The data distribution does not change - only the desired axis (x-axis in this case) Data - Normalization

Scaling just changes the range of your data. Normalization is a more radical transformation. The point of normalization is to change your observations so that they can be described as a normal distribution.

Normal distribution: Also known as the "bell curve", this is a specific statistical distribution where a roughly equal observations fall above and below the , the mean and the median are the same, and there are more observations closer to the mean. The normal distribution is also known as the Gaussian distribution.

In general, you'll only want to normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include t-tests, ANOVAs, , linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method with "Gaussian" in the name probably assumes normality.) Data - Normalization

Notice that the shape of our data has changed. Before normalizing it was almost L-shaped. But after normalizing it looks more like the outline of a bell (hence "bell curve"). Data - What is Correlation?

Correlation is a statistical term which in common usage refers to how close two variables are to having a linear relationship with each other.

For example, two variable which are linearly dependent (say, x and y which depend on each other as x = 2y) will have a higher correlation than two variables which are non-linearly dependent (say, u and v which depend on each other as u = v2)

PairPlot (from Seaborn)

Heat Map Data - Compression - Run Length Encoding (RLE)

Run-length encoding (RLE) is a very simple form of lossless data compression in which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This is most useful on data that contains many such runs. Consider, for example, simple graphic images such as icons, line drawings, and animations. It is not useful with files that don't have many runs as it could greatly increase the file size.

Example:

WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

Could be stored as…

12W1B19W3B20W1B14W Data - Compression - RLE - E.g. Image Masking

Say we have satellite images of the ocean and want to identify ships in these images. We could create a model that generates a ‘Mask’ image which overlays the positions of the ships it has identified over the actual image for processing.

E.g. We have Images with resolution of 768 x 768 pixels - thats 589,824 total pixels

Satellite image of two ships Mask image Overlay Image Data - Compression - RLE - E.g. Image Masking

We start by counting the pixels from the top left pixel (1) and then count down the column until we reach the end pixel (768). We then start at the second column (769) and count down to the bottom (1536). We continue this until we hit the first pixel of the ship. In this case it’s 56010. We can then build up the RLE column by column as shown below:

56010 1 56777 3 57544 6 58312 7 59079 9 59846 11 60613 14 61380 16 62148 17……….

So, from pixel 56010 for 1 pixel is part of the ship mask. Next from pixel 56777 for 3 pixels and so on. Data - Compression - RLE - E.g. Image Masking

RLE is really effective for the mask file due to the binary form of the data i.e. Ship or NOT ship, the relatively small area of the two ships when compared to the image size.

This is the actual mask data for the previous image (2 ships), contains only 2401 characters Machine Learning - By The Steps

Generally, the workflow follows these simple steps:

1. Collect data. Use your digital infrastructure and other sources to gather as many useful records as possible and unite them into a dataset.

2. Prepare data. Prepare your data to be processed in the best possible way. Data preprocessing and cleaning procedures can be quite sophisticated, but usually, they aim at filling the missing values and correcting other flaws

3. Split data. Separate subsets of data to train a model and further evaluate how it performs against new data.

4. Train a model. Use a subset of historic data to let the algorithm recognize the patterns in it.

5. Test and validate a model. Evaluate the performance of a model using testing and validation subsets of historic data and understand how accurate the prediction is.

6. Deploy a model. Embed the tested model into your decision-making framework as a part of an analytics solution or let users leverage its capabilities (e.g. better target your product recommendations).

7. Iterate. Collect new data after using the model to incrementally improve it. Machine Learning - By The Steps

Machine Learning Workflow Tasks that Machine Learning solves

In business terms, machine learning addresses a broad spectrum of tasks, but on the higher levels, the tasks that algorithms solve fall into five major groups: classification, cluster analysis, regression, ranking, and generation.

● Classification ● Cluster Analysis ● Regression ● Ranking ● Generation Machine Learning - Classification

Classification algorithms define which category the objects from the dataset belong to. Thus, categories are usually related to as classes. By solving classification problems you can address a variety of questions: Binary classification problems:

● Will this lead convert or no? ● Is this email spam or not? ● Is this transaction fraudulent or not?

Binary classification Another highly specific type of classification tasks is . It’s usually recognized as the one-class classification because the goal of anomaly detection is to find outliers, unusual objects in data that don’t appear in its normal distribution. What kind of problems it can solve:

● Are there any untypical customers in our dataset? ● Can we spot unusual behaviors among our bank clients? ● Does this patient deviate from the rest, according to the records?

Anomaly detection Machine Learning - Cluster Analysis

The main difference between regular classification and clustering is that the algorithm is challenged to group items in clusters without predefined classes. In other words, it should decide the principles of the division itself without human guidance. Cluster analysis is usually realized within the unsupervised learning style, which we will talk about in a minute. Clustering can solve the following problems:

● What are the main segments of customers we have considering their demographics and behaviors? ● Is there any relationship between default risks of some bank clients and their behaviors? ● How can we classify the keywords that people use to reach our website?

Cluster analysis (estimated number of clusters: 3) Machine Learning - Regression

Regression algorithms define numeric target values, instead of classes. By estimating numeric variables, these algorithms are powerful at predicting the product demand, sales figures, marketing returns, etc. For example:

● How many items of this product will we be able to sell next month? ● What’s going to be the fly fare for this air destination? ● What is the top speed for a vehicle to sustain its operating life?

Linear Regression Machine Learning - Ranking

Ranking algorithms decide the relative importance of objects (or items) in connection with other objects. The most well-known example is PageRank, which is heavily used by Google to rank pages on the search engine results page. Ranking algorithms are also applied by Facebook to define which posts in a news feed are more engaging to users than others. What other problems can ranking address?

● Which movies this user is going to enjoy the most? ● What’s going to be the top list of recommended hotels for this customer? ● How should we rank products on a search page of an eCommerce store?

Movie recommendation ranking Machine Learning - Generation

Generation algorithms are applied to generate text, images, or music. Today they are used in such applications like Prisma, that converts photos to artwork-style images, or WaveNet by DeepMind that can mimic human speech or create musical compositions. Generative tasks are more common for mass consumer applications, rather than predictive analytics solutions. That’s why this type of machine learning has a big potential for entertainment software. What are the possible tasks of generative algorithms?

● Turn photos into specific style painting. ● Create text-to-speech applications for mobile voice assistants (e.g. the Google assistant). ● Create music samples of one style or reminiscent of a particular musician.

Image converted to artwork using “The Great Wave off Kanagawa” piece of art Model Training

To meet these tasks, different model training approaches (or training styles) are used. Training is a procedure to develop a specific mathematical model that is tailored to dependencies among values in historic data. A trained model will be able to recognize these dependencies in future data and predict the values that you look for. So, there are three styles of model training.

● Supervised Learning ● Unsupervised Learning ● Reinforcement Learning Supervised Learning

Supervised learning algorithms operate with historic data that already has target values. Mapping these target values in training datasets is called labeling. In other words, humans tell the algorithm what values to look for and which decisions are right or wrong. By looking at a label as an example of a successful prediction, the algorithm learns to find these target values in future data. Today, supervised machine learning is actively used both with classification and regression problems as generally target values are already available in training datasets.

This makes supervised learning the most popular approach employed in business. For example, if you choose binary classification to predict the likelihood of lead conversion, you know which leads converted and which didn’t. You can label the target values (converted/not converted or 0/1) and further train a model. Supervised learning algorithms are also used in recognizing objects on pictures, in defining the mood of social media posts, and predicting numeric values as temperature, prices, etc. Unsupervised Learning

Unsupervised learning is aimed at organizing data without labeled target values. The goal of machine learning, in this case, is to define patterns in values and structure the objects according to similarities or differences. In the area of classification tasks, unsupervised learning is usually applied with clustering algorithms, anomaly detection, and generative tasks. These models are useful in finding hidden relations among items, solving segmentation problems, etc.

For example, a bank can use unsupervised learning to split clients into multiple groups. This will help to develop specific instructions for dealing with each particular group. Unsupervised learning techniques are also employed in ranking algorithms to provide individualized recommendations. Reinforcement Learning

Reinforcement learning is perhaps the most sophisticated style of machine learning inspired by game theory and behaviorist psychology.

The general aim of Machine Learning is to produce intelligent programs, often called agents (algorithms), through a process of learning and evolving. An Reinforcement Learning (RL) agent learns by interacting with its environment and observing the results of these interactions. This mimics the fundamental way in which humans (and animals alike) learn. As humans, we have a direct sensori-motor connection to our environment, meaning we can perform actions and witness the results of these actions on the environment. The idea is commonly known as "cause and effect", and this undoubtedly is the key to building up knowledge of our environment throughout our lifetime. Reinforcement Learning

Reinforcement learning techniques are actively used in robotics and AI development. A well-known AlphaGo algorithm by DeepMind used reinforcement learning to estimate the most productive moves in the ancient game of Go instead of enumerating all possible board combinations. Reinforcement Learning - Going Deeper

The "cause and effect" idea can be translated into the following steps for an RL agent:

1. The agent observes an input state 2. An action is determined by a decision making function (policy) 3. The action is performed 4. The agent receives a scalar reward or reinforcement from the environment 5. Information about the reward given for that state / action pair is recorded

By performing actions, and observing the resulting reward, the policy used to determine the best action for a state can be fine-tuned. Eventually, if enough states are observed an optimal decision policy will be generated and we will have an agent that performs perfectly in that particular environment. Reinforcement Learning - Going Deeper

In the business sphere, reinforcement learning is still hard to apply as most algorithms can successfully learn only within the unchanging frame of rules, goals, and world circumstances.

That’s why many of modern reinforcement learning advancements today are tied to games like Go where these three parameters are stable.

Another problem of reinforcement learning is the longevity of learning cycles. In games, the time between first decision and achieved points is relatively short, while in the real-life circumstances the time to estimate how successful the decision was may take weeks. Reinforcement Learning - Simple Example

In simple terms there is an Agent (our AI) that exists in an environment. The Agent takes Actions in this environment and in return receives an amended State and a Reward.

The Environment could be anything e.g. A kitchen where the Agent is cooking eggs. Reinforcement Learning - Simple Example

Your Agent performs tasks in the Simple comparison: Imagine training a environment and is provided a reward dog. When the dog takes an action with a based on the outcome of each action. This positive outcome in its environment it reward will either be positive (good) or receives a treat. Alternatively when the negative (bad). A scale can also be provided dog takes an action with a negative to determine how good or bad the outcome outcome it is disciplined of this last action was. Reinforcement Learning - Bellman Equation

Let’s take a look at the origins of Dynamic Programming with the Bellman Equation. We need to be aware of the following concepts:

s - State (what state our Agent is in) a - Action (an Action an Agent can take in a particular State) R - Reward (Reward an Agent obtains for entering a particular State) v - Discount

We start with a basic maze. It has a GOAL (green), a HAZARD (red), Blocked (grey) and Open (white). The objective is to make it the GOAL without hitting the HAZARD. Agent can move one block at a time, UP, DOWN, LEFT, RIGHT.

We can start by applying a REWARD of +1 to the Green square and a -1 to the Red square. The robot moves randomly until it hits either end state. Reinforcement Learning - Bellman Equation

You start off in a maze at the bottom left corner. The robot moves around randomly until he finds the flags and gets a reward of +1. He then works out how did he get to that reward - the answer is from the square to the left. Hence he goes back one square and assigns it a Value of +1.0 (The Max reward by taking an action from that state is +1.0 PLUS the Discount x Value of the next of the next state (which is 0) - hence V=1.0. Reinforcement Learning - Bellman Equation

Next, going back (left) one more square - we get the Reward for taking the action to the right is ZERO (there is no reward for that square) PLUS the Discount (0.9) Multiplied by the Value of the square (which we just worked out to be 1.0) - hence V=0.9. One more square to the left and V(s) = (0(no reward) + 0.9(Discount) * 0.9(next states value)) = 0.81. Finally we have our PLAN. We know where we should go from every location on the board. NOTE - that from the bottom left square we could take one of two directions - when you have the same Value like this - you just have a standard way to resolve e.g. pick UP as the priority. Reinforcement Learning - Bellman Equation

The Agent has Learned how to navigate based on its current State from anywhere in the maze RL - Deterministic vs. Non-Deterministic Search

For Deterministic search - you know the state you’re moving to. If you select UP then you’ll move UP.

For Non-Deterministic Search - if you select UP then there’s a chance that you might LEFT (say 10%), RIGHT (say 10%) or UP (say 80%). So we know the probabilities of what the next state will be - but we can’t be sure

In summary, for a deterministic algorithm for a given input, the computer will always produce the same output going through the same states - in the case of a non-deterministic algorithm, for the same input the compiler may produce different output in different runs RL - Markov Decision Process (MDP)

Reward Discount Probabilities Factor V - Value s - State a - Action R - Reward v - Discount

The Value ‘V’ of ending up in a state ‘s’ is equal to the Maximum of all possible Actions ‘a’ multiplied by Reward for being in the next State ‘s’ after the Action ‘a’ PLUS Reduction Function Gamma ‘v’ MULTIPLIED by the SUM for all possible States ‘s prime’ that you could possibly get into where Probability of when you’re in State ‘s’ by taking Action ‘a’ of getting into State ‘s prime’ multiplied by the Value ‘V’ of ‘s prime’ RL - Markov Decision Process (MDP)

If when you selected a direction to travel in - there was an 80% chance you’d get there, a 10% chance you’d go right and 10% chance you’d go left - this is an example of MDP.

The Values would be impacted as shown below. This now becomes a POLICY as opposed to a PLAN since you can’t guarantee where you’ll end up. NOTE the last maze - the AI decided it would point away from the fire rather than risk going into it - it has a 10% of getting left or right which is more preferable than ending in the fire. RL - Living Penalty

The primary aim here is to apply a cost of living to the Agent - to motivate it to find a solution rather that randomly moving with no associated cost. Below we apply a -0.04 cost to each empty tile - the Agent will accumulate this cost when it moves onto a new tile. RL - Living Penalty

Let’s look at the impact of different Living Penalties.

To the left are 4 examples - let’s see how each of them will impact the Path the Agent takes to the solution. RL - Living Penalty

R(s)=0 There is no penalty to existing - hence the Agent will move around for as long as it needs to without any sense of urgency. It wants to stay out of the fire.

R(s)=-0.04 Now the Agent has a motivation to find the end sooner. It now takes the 10% chance of getting into the Fire to move more quickly to the end. RL - Living Penalty

R(s)=-0.5 The Agent has a strong motivation to get to the end sooner. The two bottom middle squares have changed direction to the shortest possible root to the end.

R(s)=-2.0 The Agent wants to end in the shortest number of moves possible, even if this means jumping into the Fire (since the -1.0 Penalty for jumping in the fire is less than the -2.0 Living Penalty Augmented Random Search (ARS)

An algorithm based on a recently released white paper (Mar 20 2018) titled ‘Simple random search provides a competitive approach to reinforcement learning’ by Horia Mania & Aurelia Guy from University of California, Berkeley.

This is a reinforcement learning algorithm of relatively low complexity when compared to other approaches that has some quite exceptional results. Over the next few slides we take a look at this approach in depth and how it obtains it iterates and learns by utilizing ‘shallow learning’ i.e. a neural network with no hidden layers i.e. just an input and an output layer,

This approach was used to train numerous MuJoCo libraries with impressive results.

The really amazing take-away from this approach - is that the solution can be implemented in Python utilizing just the ‘numpy’ library. In other words no Deep Learning Libraries like Tensorflow or Keras are required. MuJoCo - Multi-Joint dynamics with Contact

Source www.extremetech.com Augmented Random Search (ARS)

MuJoCo - Multi-Joint dynamics with Contact

MuJoCo is a physics engine aiming to facilitate research and development in robotics, biomechanics, graphics and animation, and other areas where fast and accurate simulation is needed.

Trained < 5 hours with ARS Reinforcement Learning written in Python (125 lines of code) executed on Macbook pro 2.8 Ghz Augmented Random Search (ARS)

What is behind ARS? Augmented Random Search (ARS)

Can be represented by a ? Augmented Random Search (ARS)

Perceptrons with 2 Outputs Augmented Random Search (ARS)

Perceptron with 4 inputs and 5 outputs Augmented Random Search (ARS)

If you look at how this can be represented…...

Inputs are a Vector, Weights a Matrix & Output a Vector. Augmented Random Search (ARS)

Note: ARS uses Shallow learning since there are no Hidden Layers. Deep learning Algorithms utilize one or more hidden layers in their neural networks.

Result is simplicity of computation resulting in rapid results that are still impressive Augmented Random Search (ARS)

How does the model learn?

Each set of actions is called an Episode.

An Episode lasts until: 1) Agent falls / fails 2) A specific duration elapses 3) A specific Goal is reached Augmented Random Search (ARS)

How does the model learn? We Reward the Agent based on its performance of the assigned task.

The longer the Agent continues successfully with its task e.g. Walking. The higher the reward.

If it falls soon after starting the the Reward is lower. Augmented Random Search (ARS)

How does the model learn? We use this Reward to adjust the Weights.

This results in a change in the output values based on the same input values.

We then perform the training again to see what impact changing the weights had on Agent Augmented Random Search (ARS)

Method of Finite Differences

Note: The environment provides rewards after every Action - not just at the end of an Episode.

The longer the AI doesn’t fall the higher the reward. The further the AI travelled the higher the reward. The faster the AI travelled the higher the reward etc.

Usually in Artificial Intelligence - the is used (differentiation)

ARS - uses the method of ‘Finite Differences’ (approximate gradient) Augmented Random Search (ARS)

How are the Weights adjusted.

The ARS approach is to take the weights and ADD and SUBTRACT small perturbations (a small change) to them resulting in 2 new matrices of weights with slightly different values Augmented Random Search (ARS)

We do this many times. E.g. For half Cheetah example we used 16 (32 in total) we show 4 below: Augmented Random Search (ARS)

Next we run an Episode for each new matrix of Weights. Augmented Random Search (ARS)

The Episode execution for each of the new Weights will result in a Reward.

Some of these perturbations will perform better and some worse than the original Weights. This is why the approach is called Augmented RANDOM Search.

Rd-pos Re-pos Rf-pos Rg-pos

Rd-neg Re-neg Rf-neg Rg-neg Augmented Random Search (ARS)

We now multiply the perturbations by the Reward returned for the execution of that Episode. That means the delta weights (original weight + perturbation) will be greater depending on whether that change had a positive or negative effect on the AI performance.

Rd-pos Re-pos Rf-pos Rg-pos

Rd-neg Re-neg Rf-neg Rg-neg Augmented Random Search (ARS)

We now multiply the perturbations by the Reward returned for the execution of that Episode. That means the delta weights (original weight + perturbation) will be greater depending on whether that change had a positive or negative effect on the AI performance. We then add the results from this calculation for all the perturbations to the original Weights. Augmented Random Search (ARS)

Next we run an Episode for each new matrix of Weights. Basic vs Augmented Random Search

Three main updates that make this approach Augmented:

● Scale update step by of Rewards ○ Take the results from the previous slide and divide the Standard Deviation of the rewards ● Online normalization of states ○ Basically in real-time we normalize the inputs E.g. if inputs are between 90 & 100 for some inputs and between 0 and 1 for others, they’d have different magnitude of impact - even though this maybe due to the AI being in a different environment. Hence by normalizing the values against values we’ve already seen for this training we can have a consistent behaviour. ● Discard directions that yield lowest Rewards ○ Only the top ‘k’ perturbations will be used for amending the weights. The weights will evolve in the direction of most successful results ARS vs Other AI

ARS Other AI

1. Exploration in the Policy Space 1. Exploration in the Action Space 2. Method of Finite Differences 2. Gradient Descent Algorithm 3. Shallow Learning 3. Usually Deep Learning Additional Reading - ARS

Simple random search provides a competitive approach to reinforcement learning: https://arxiv.org/pdf/1803.07055.pdf

Evolution Strategies White Paper: https://arxiv.org/pdf/1703.03864.pdf

MuJoCo: http://www.mujoco.org/

Clues for which I search and choose: http://www.argmin.net/2018/03/20/mujocoloco/ Machine Learning - Decision Trees

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes).

In simpler terms, a decision tree checks if an attribute or a set of attributes satisfy a condition and based on the result of the check, the subsequent checks are performed. The tree splits the data into different parts based these checks.