Machine Learning (KIM.ML09) Lecture Notes Master Program in Artificial

Herbert Jaeger Machine Learning (KIM.ML09) Lecture Notes V 1.0, January 7, 2020 Master Program in Artificial Intelligence Rijksuniversiteit Groningen, Bernoulli Institute Contents 1 Introduction 5 1.1 Human Versus Machine Learning . 5 1.2 The two super challenges of ML - from an eagle’s eye . 6 1.3 Looking at Human Intelligence, Again . 17 1.4 A Remark on “Modeling” . 18 1.5 The Machine Learning Landscape . 21 2 Decision trees and random forests 26 2.1 A toy decision tree . 26 2.2 Formalizing “training data” . 28 2.3 Learning decision trees: setting the stage . 31 2.4 Learning decision trees: the core algorithm . 33 2.5 Dealing with overfitting . 37 2.6 Variants and refinements . 39 2.7 Random forests . 39 3 Elementary supervised temporal learning 45 3.1 Recap: linear regression . 45 3.2 Temporal learning tasks . 50 3.3 Time series prediction tasks . 53 3.4 State-based vs. signal-based timeseries modeling . 54 3.5 Takens’ theorem . 57 4 Basic methods for dimension reduction 60 4.1 Set-up, terminology, general remarks . 60 4.2 K-means clustering . 63 4.3 Principal component analysis . 65 4.4 Mathematical properties of PCA and an algorithm to compute PCs 69 4.5 Summary of PCA based dimension reduction procedure . 71 4.6 Eigendigits . 71 4.7 Self-organizing maps . 73 4.8 Summary discussion. Model reduction, data compression, dimension reduction . 79 5 Discrete symbolic versus continuous real-valued 82 6 The bias-variance dilemma and how to cope with it 87 6.1 Training and testing errors . 88 6.2 The menace of overfitting – it’s real, it’s everywhere . 91 6.3 An abstract view on supervised learning . 96 6.4 Tuning model flexibility . 99 6.5 Finding the right modeling flexibility by cross-validation . 105 6.6 Why it is called the bias-variance dilemma . 107 7 Representing and learning distributions 110 7.1 Optimal classification . 110 7.2 Representing and learning distributions . 112 7.3 Mixture of Gaussians; maximum-likelihood estimates by EM algorithms . 125 7.4 Parzen windows . 136 8 Bayesian model estimation 140 8.1 The ideas behind frequentist statistics . 140 8.2 The ideas behind Bayesian statistics . 142 8.3 Case study: modeling proteins . 148 9 Sampling algorithms 152 9.1 What is “sampling”? . 152 9.2 Sampling by transformation from the uniform distribution . 153 9.3 Rejection sampling . 156 9.4 Proto-distributions . 157 9.5 MCMC sampling . 159 9.6 Application example: determining evolutionary trees . 167 10 Graphical models 174 10.1 Bayesian networks . 176 10.2 Undirected graphical models . 195 10.3 Hidden Markov models . 195 11 Online adaptive modeling 197 11.1 The adaptive linear combiner . 198 11.2 Basic applications of adaptive linear combiners . 201 11.3 Iterative learning algorithms by gradient descent on performance surfaces . 207 11.4 Stochastic gradient descent with the LMS algorithm . 220 12 Feedforward neural networks: the Multilayer Perceptron 226 12.1 MLP structure . 229 12.2 Universal approximation and “deep” networks . 232 12.3 Training an MLP with the backpropagation algorithm . 233 A Elementary mathematical structure-forming operations 241 A.1 Pairs, tuples and indexed families . 241 A.2 Products of sets . 242 A.3 Products of functions . 242 B Joint, conditional and marginal probabilities 243 3 C The argmax operator 249 D Expectation, variance, covariance, and correlation of numerical random variables 249 E Derivation of Equation 31 253 4 1 Introduction 1.1 Human Versus Machine Learning Humans learn. Animals learn. Societies learn. Machines learn. It looks like “learning” were a universal phenomenon and all we had to do is to develop a solid scientific theory of “learning”, turn that into algorithms and then let “learning” happen on computers. Wrong wrong wrong. Human learning is very different from animal learning (and amoebas learn different things in different ways than chimpanzees), societal learning is quite another thing as human or animal learning, and machine learning is as different from any of the former as cars are from horses. Human learning is incredibly scintillating and elusive. It is as complex and impossible to be fully understood as you can’t fully understand yourself. Think of all the things you can do, all of your body motions from tying your shoes to playing the guitar; thoughts you can think from “aaagrhhh!” to “I think therefore I am”; achievements personal, social, academic; all the things you can remember including your first kiss and what you did 20 seconds ago (you started reading this paragraph, in case you forgot); your plans for tomorrow and the next 40 years; well, just everything about you — and almost everything of that wild collection is the result of a fabulous mixing learning of some kind with other miracles and wonders of life. To fully understand human learning, a scientist would have to integrate at least the following fields and phenomena: body, brain, sensor & motor architecture · physiology and neurophysi- ology · body growth · brain development · motion control · exploration, curiosity, play · creativity · social interaction · drill and exercise and rote learning · reward and punishment, pleasure and pain · the universe, the earth, the atmosphere, water, food, caves· evolution · dreaming · re- membering · forgetting · aging · other people, living · other people, long dead · machines, tools, buildings, toys· words and sentences· concepts and meanings · letters and books and schools · traditions ... Recent spectacular advances in machine learning may have nurtured the im- pression that machines come already somewhat close. Specifically, neural networks with many cascaded internal processing stages (so-called deep networks) have been trained to solve problems that were considered close to impossible only a few years back. A showcase example (one that got me hooked) is automated image caption (technical report: Kiros et al. (2014)). At http://www.cs.toronto.edu/ ∼nitish/nips2014demo you can find stunning examples of caption phrases that have been automatically generated by a neural network based system which was given photographic images as input. Figure 1 shows some screenshots. This is a demo from 2014. Since deep learning is evolving incredibly fast, it’s already a little outdated today. Other fascinating examples of deep learning are face recognition 5 (Parkhi et al., 2015), online text translation (Bahdanau et al., 2015), inferring a Turing machine (almost) from input-output examples (Graves et al, 2016), or playing the game of Go at and beyond the level of human grand-masters (Silver et al, 2016). So, apparently machine learning algorithms come close to human performance in several tasks or even surpass humans, and these performance achievements have been learnt by the algorithms, — thus, machines today can learn like humans??!? The answer is NO. ML researchers (the really good ones, not the average Tensor- Flow user) are highly aware of this. Outside ML however, naive spectators (from the popular press, politics, or other sciences) often conclude that since learning machines can perform similar feats as humans, they also learn like humans. It takes some effort to argue why this is not so (read Edelman (2015) for a refu- tation from the perspective of cognitive psychology). I cannot embark on this fascinating discussion at this point. Very roughly speaking, it’s the same story again as with chess-playing algorithms: the best chess programs win against the best human chess players, but not by fair means — chess programs are based on larger amounts of data (recorded chess matches) than humans can memorize, and chess programs can do vastly more computational operations per second than a human can do. Brute force wins over human brains at some point when there is enough data and processing bandwidth. Progress has accelerated in the last years because increasingly large training datasets have become available and fast enough computing systems have become cheap enough. This is not to say that powerful “deep learning” just means large datasets and fast machines. These conditions are necessary but not sufficient. In addition, also numerous algorithmical refinements and theoretical insights in the area of statistical modeling had to be developed. Some of these algorithmical/theoretical concepts will be presented in of this course. Take-home message: The astonishing learning feats of today’s ML are based on statistical modeling techniques, raw processing power and a lot of a researcher’s personal experience and trial-and-error optimization. It’s technology and maths, not brain biology or psychology. Dismiss any romantic ideas about ML. ML is stuff for sober engineers. But you are allowed to become very excited about that stuff, and that stuff can move mountains. 1.2 The two super challenges of ML - from an eagle’s eye In this section I want to explain, on an introductory, pre-mathematical level, that large parts of ML can be understood as the art of estimating probability distributions from data. And that this art faces a double super challenge: the unimaginably complex geometry of real-world data distributions, and the extreme lack of information provided by real-world data. I hope that after reading this section you will be convinced that machine learning is impossible. Since however Siri exists, where is the trick? ... in this way I want to make you read on, like in 6 Figure 1: Three screenshots from the image caption demo at http://www.cs. toronto.edu/∼nitish/nips2014demo. A “deep learning” system was trained on some tens of thousands of photos showing everyday scenes. Each photo in the training set came with a few short captions provided by humans. From these training data, the system learnt to generate tags and captions for new photos. The tags and captions on the left were produced by the trained system upon input of the photos at the right.

Machine Learning (KIM.ML09) Lecture Notes Master Program in Artificial

Black Box Explanation by Learning Image Exemplars in the Latent Feature Space

On the Boosting Ability of Top-Down Decision Tree Learning Algorithms

Boosting with Multi-Way Branching in Decision Trees

Inductive Bias in Decision Tree Learning • Issues in Decision Tree Learning • Summary

A Prediction for Student's Performance Using Decision Tree ID3 Method

Galaxy Classification with Deep Convolutional Neural Networks

X-Trepan: a Multi Class Regression and Adapted Extraction of Comprehensible Decision Tree in Artificial Neural Networks

Evaluation and Comparison of Word Embedding Models, for Efficient Text Classification

Chapter 2. Machine Learning Overview

A Commodity Classification Framework Based on Machine

COMP 307 — Lecture 07 Machine Learning 4 DT Learning And

Machine-Learning-Course Documentation Release 1.0