<<

Pennsylvania State University College of Sciences and Technology Artificial Laboratory

Machine

Vasant Honavar Research Laboratory College of Information Sciences and Technology Center for Analytics and Discovery Graduate Programs in IST, Science, & , The Pennsylvania State University

[email protected] http://faculty.ist.psu.edu/vhonavar http://ailab.ist.psu.edu

Machine Learning, @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory as a subfield of artificial intelligence

AI is about § Study of computational models of intelligence § Falsifiable hypotheses about intelligent behavior § Construction of intelligent artifacts § Mechanization of tasks requiring intelligence § Exploring the design space of intelligent systems

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Why should machines learn? Practical • Intelligent behavior requires knowledge • Explicitly specifying the knowledge needed for specific tasks is hard, and often infeasible • If we can get machines to acquire the knowledge needed for particular tasks from observations (data), interactions (experiments), we can • Dramatically reduce the cost of developing intelligent systems • Automate aspects of scientific discovery • … Machine Learning is most useful when § the structure of the task is not well understood but representative data or interactions with the environment are available § task (or parameters) change dynamically

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Why should machines learn? – Applications

• Astrophysics § Characterizing and classifying galaxies § Life and Health sciences § Identifying sequence correlates of protein function, predicting potential adverse drug interactions… § Understanding the relationship between genetic, environmental, and behavioral characteristics that contribute to health or disease § Diagnosing diseases from symptoms, test results (e.g. pneumonia, pap smears)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Why should machines learn? – Applications

• Agriculture – Precision farming • Business – Fraud detection (e.g. credit cards, phone calls) – Product recommendation (e.g. Google, Amazon, Netflix) – Stock trading § Technology – Self-driving vehicles – Natural language conversation – Computer vision – Video understanding

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Why should machines learn? – Science of learning Information processing models can provide useful insights into • How humans and animals learn • Information requirements of learning tasks • The precise conditions under which learning is possible • Inherent difficulty of learning tasks • How to improve learning – e.g. value of active versus passive learning • Computational architectures for learning

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Machine Learning – related disciplines • Applied Statistics – Emphasizes statistical models of data – Methods typically applied to small data sets – Often done by a statistician increasingly assisted by a computer • Machine learning – Relies on (often, but not always statistical) inference from data and knowledge (when available) – Emphasizes efficient data structures and for learning from data – Characterizing what can be learned and under what conditions – Obtaining guarantees regarding the quality of learned models – Scalability to large, complex data sets (big data)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory What is Machine Learning?

• A program M is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance as measured by P on tasks in T in an environment Z improves with experience E. Example 1 T – cancer diagnosis E – a set of diagnosed cases P – accuracy of diagnosis on new cases Z – noisy measurements, occasionally misdiagnosed training cases M – a program that runs on a general purpose computer

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory What is Machine Learning?

Example 2 T – recommending movies e.g., on Netflix E – movie ratings data from individuals P – accuracy of predicted movie ratings

10% improvement in prediction accuracy – $1 million prize

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory What is Machine Learning?

Example 3 T – Predicting protein-RNA interactions E – A data set of known interactions P – accuracy of predicted interactions

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory What is Machine Learning?

Example 4 T – Reconstructing functional connectivity of brains from brain activity (e.g., fMRI) data E – fMRI data P – accuracy of the reconstructed network

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory What is Machine Learning?

Example 5 T – solving integral calculus problems, given rules of integral calculus E – a set of solved problems P – score on test consisting of problems not in E

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory What is Machine Learning?

Example 6 T – predicting the risk of a disease before the onset of clinical symptoms E – longitudinal gut microbiome data coupled with diagnostic tests P – accuracy of predictions

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory What is Machine Learning?

Example 7 T – predicting sleep quality from actigraphy data E – actigraphy data with sleep stage labels P – accuracy of predictions

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory What is Machine Learning?

Example 8 T – Uncovering the causal relationship between exercise, diet and diabetes E – Data from observations and interventions (changes in diet, exercise) P – accuracy of causal predictions

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Key requirements

• There is a pattern to be learned • There are data to learn from

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning to approve credit

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning to approve to credit

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Canonical Learning Problems Supervised Learning: • Given labeled samples, predict labels on future samples § Classification § Regression § Time series prediction § Many variants based on what constitutes a predictive model § Many variants based on what constitutes a sample and a label § Multi instance learning § Multi-label learning § Multi-instance, multi-label learning § Distributional learning § Many variants based on data type § Feature vectors § Sequences § Networks § Relations Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Canonical Learning Problems Unsupervised Learning: given unlabeled samples, discover representations, features, structure, etc. § Clustering § Compression § Representation Many variants based on what constitutes samples, data types Semi-supervised Learning: given some labeled samples, and large amounts of unlabeled samples, predict labels of unlabeled samples • Transductive (unlabeled samples given at learning time) • Inductive (new unlabeled samples given at prediction time) Multi-view learning: • Given data from multiple sources about some underlying system, discover how they relate to each other; • integrate the data to make predictions that are more reliable than those obtainable using any single data source

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Canonical Learning Problems Reinforcement Learning: Given the means of observing and interacting with an environment, learn how to act rationally • Many variants based on what constitutes observation, interaction, and action

Causal inference: given observational and experimental data, causal assumptions, identify causal relations • Identification • Transport • Meta analysis

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning input – output functions: Classification, regression Target function f – unknown to the learner – f ∈ F Learner’s hypothesis about what f might be – h ∈ H H – hypothesis space Instance space – X – domain of f, h Output space – Y – range of f, h Example – an ordered pair (x,y) where and x ∈ X f (x) = y ∈Y

F and H may or may not be the same! Training set E – a multi set of examples Learning L – a procedure which given some E, outputs an h ∈ H

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Decision theoretic foundations of classification

Consider the problem of classifying an instance X

into one of two mutually exclusive classes ω1 or ω2

P(ω1 X) = probability of class ω1 given the evidence X

P(ω2 X) = probability of class ω2 given the evidence X What is the probability of error?

P(error | X) = P(ω1 X) if we choose ω2

= P(ω2 X) if we choose ω1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Minimum Error Classification

To minimize classification error

Choose ω1 if P(ω1 X ) > P(ω2 X )

Choose ω2 if P(ω2 X ) > P(ω1 X ) which yields

P(error | X ) = min[P(ω1 X ), P(ω2 X )] We have :

P(ω1 X ) = P(X |ω1)P(ω1);

P(ω2 X ) = P(X |ω2 )P(ω2 )

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Choose ω1 if P(ω1 X ) > P(ω2 X ) i.e. X ∈ R1

Choose ω2 if P(ω2 X ) > P(ω1 X ) i.e. X ∈ R2

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Optimality of Bayes Decision Rule

We can show that the Bayesian classifier • is optimal in that it is guaranteed to minimize the probability of misclassification

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Optimality of Bayes Decision Rule

Error

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Optimality of the Bayes Decision Rule

Pe = P(x ∈ R2,x ∈ ω1) + P(x ∈ R1,x ∈ ω2)

= P(x ∈ R2 |ω1)P(ω1) + P(x ∈ R1 |ω2)P(ω2)

= P(ω1) ∫ p(x |ω1)dx + P(ω2) ∫ p(x |ω2)dx

R 2 R1 Applying Bayes Rule :

p(x |ωi )P(ωi ) = P(ωi | x)p(x) = p(x,ωi )

Pe = ∫ P(ω1 | x)p(x)dx + ∫ P(ω2 | x)p(x)dx

R 2 R1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar € Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Optimality of the Bayes Decision Rule P p | x p x dx p | x p x dx e = ∫ (ω1 ) ( ) + ∫ (ω2 ) ( ) R2 R1

Because R1 ∪ R2 covers the entire input space, P | x p x dx P | x p x dx P ∫ (ω1 ) ( ) + ∫ (ω1 ) ( ) = (ω1 ) R1 R2 P = P ω − P ω | x − P ω | x p x dx e ( 1 ) ∫ ( ( 1 ) ( 2 )) ( ) R1

Pe is minimized by choosing

R1 such that P(ω1 | x) > P(ω2 | x) and

R2 such that P(ω2 | x) > P(ω1 | x)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Optimality of Bayes Decision Rule

• The proof generalizes to multivariate input spaces • Similar result can be proved in the case of discrete (as opposed to continuous) input spaces – replace integration over the input space by summation

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Bayes Decision Rule yields Minimum Error Classification

To minimize classification error

Choose ω1 if P(ω1 X ) > P(ω2 X )

Choose ω2 if P(ω2 X ) > P(ω1 X ) which yields

P(error | X ) = min[P(ω1 X ), P(ω2 X )]

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Bayes Decision Rule

Behavior of Bayes decision rule as a function of prior probability of classes

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Bayes Optimal Classifier

Classification rule that guarantees minimum error :

Choose ω1 if P(X |ω1)P(ω1) > P(X |ω2 )P(ω2 )

Choose ω2 if P(X |ω2 )P(ω2 ) > P(X |ω1)P(ω1)

If P(X |ω1) = P(X |ω2 )

classification depends entirely on P(ω1) and P(ω2 )

If P(ω1) = P(ω2 ),

classification depends entirely on P(X |ω1) and P(X |ω2 ) Bayes classification rule combines the effect of the two terms optimally - so as to yield minimum error classification.

Generalization to multiple classes c(X ) = arg max P(ω j | X ) ω j

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Minimum Risk Classification

Let λij = risk or cost associated with assigning an instance

to class ω j when the correct classification is ωi

R(ωi | X ) = expected loss incurred in assigning X to class ωi

R(ω1 | X ) = λ11P(ω1 | X ) + λ21P(ω2 | X )

R(ω2 | X ) = λ12P(ω1 | X ) + λ22P(ω2 | X ) Classification rule that guarantees minimum risk :

Choose ω1 if R(ω1 | X ) < R(ω2 | X )

Choose ω2 if R(ω2 | X ) < R(ω1 | X ) Flip a coin otherwise

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Minimum Risk Classification

λij = risk or cost associated with assigning an instance

to class ω j when the correct classification is ωi

Ordinarily (λ21 − λ22 ) and (λ12 − λ11) are positive ( cost of being correct is less than the cost of error )

P(X|ω1 ) (λ21 − λ22 ) P(ω2 ) So we choose ω1 if > P(X|ω2 ) (λ12 − λ11 ) P(ω1 )

Otherwise choose ω2 Minimum error classification rule is a special case :

λij = 0 if i = j and λij =1 if i ≠ j This classification rule can be shown to be optimal in that it is guaranteed to minimize the risk of misclassification

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Summary of Bayesian recipe for classification

λij = risk or cost associated with assigning an instance

to class ω j when the correct classification is ωi

P(X|ω1 ) (λ21 − λ22 ) P(ω2 ) Choose ω1 if > P(X|ω2 ) (λ12 − λ11 ) P(ω1 )

P(X|ω1 ) (λ21 − λ22 ) P(ω2 ) Choose ω2 if < P(X|ω2 ) (λ12 − λ11 ) P(ω1 ) Minimum error classification rule is a special case :

P(X|ω1 ) P(ω2 ) Choose ω1 if > Otherwise choose ω2 P(X|ω2 ) P(ω1 )

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Bayesian recipe for classification

P(x|ωi )P(ωi ) Note that P(ω | x) = i P(x)

Model P(x | ω1 ), P(x|ω2 ), P(ω1 ), and P(ω2 )

Using Bayes rule, choose ω1 if P(x | ω1 )P(ω1 ) > P(x|ω2 )P(ω2 )

Otherwise choose ω2

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Summary of Bayesian recipe for classification

• The Bayesian recipe is simple, optimal, and in principle, straightforward to apply

• To use this recipe in practice, we need to know P(X|ωi) – the generative model for data for each class and P(ωi) – the prior probabilities of classes • Because these probabilities are unknown, we need to estimate them from data – or learn them! • X is typically high-dimensional or may have complex structure

• Need to estimate P(X|ωi) from data

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Naïve Bayes Classifier

• How to learn P (X|ωi) ? One solution: Assume that the random variables in X are conditionally independent given the class. • Result: Naïve Bayes classifier which performs optimally under certain assumptions • A simple, practical learning algorithm grounded in Probability Theory When to use • Attributes that describe instances are likely to be conditionally independent given classification • The data is insufficient to estimate all the probabilities reliably if we do not assume independence

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Conditional Independence

Let Z1,...Zn and W be random variables on a given event space.

Z1,...,Zn are mutually independent given W if n P Z , Z ,...Z W P Z W ( 1 2 n )= ∏ ( i ) i=1

Note that these represent sets of equations, for all possible value assignments to random variables

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Implications of Independence • Suppose we have 5 Binary attributes and a binary class label • Without independence, in order to specify the joint distribution, we need to specify a probability for each possible assignment of values to each variable resulting in a table of size 26=64 • Suppose the features are independent given the class label – we only need 5(2x2)=20 entries • The reduction in the number of probabilities to be estimated is even more striking when N, the number of attributes is large – from O(2N) to O(N)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Naive Bayes Classifier Consider a discrete valued target function f : χ → Ω

where an instance X = (X1, X 2...X n )∈ χ is described

in termsof attribute values X1 = x1, X 2 = x2 , ... X n = xn

where xi ∈ Domain(X i )

ωMAP = arg max P(ω j | X1 = x1, X 2 = x2... X n = xn ) ω j∈Ω P(X = x , X = x ,..., X = x |ω )P(ω ) = arg max 1 1 2 2 n n j j ω ∈Ω j P(X1 = x1, X 2 = x2 ,..., X n = xn )

= arg max P(X1 = x1, X 2 = x2 ,..., X n = xn |ω j )P(ω j ) ω j∈Ω

ωMAP is called the maximum a posteriori classification

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Naive Bayes Classifier

ωMAP = arg max P(ω j | X1 = x1, X 2 = x2... X n = xn ) ω j∈Ω

= arg max P(X1 = x1, X 2 = x2 ,..., X n = xn |ω j )P(ω j ) ω j∈Ω If the attributes are independent given the class, we have n

ωMAP = arg max P(X i = xi |ω j ) P(ω j ) ω ∈Ω ∏ j i=1

= ωNB n

= arg max P(ω j ) P(X i = xi |ω j ) ω ∈Ω ∏ j i=1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Naive Bayes Learner

For each possible value ωj of Ω, ˆ P(Ω = ω j ) ← Estimate(P(Ω = ω j ), D) For each possible value a of X ik i Pˆ(X a | ) Estimate P X a | , D i = ik ω j ← ( ( i = ik Ω = ω j ) )

Classify a new instance X = (x1, x2 ,...xN ) n

c(X ) = arg max P(ω j ) P(X i = xi |ω j ) ω ∈Ω ∏ j i=1

Estimate is a procedure for estimating the relevant probabilities from set of training examples

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning Dating Preferences

Instances – Training Data ordered 3-tuples of attribute Instance Class label values corresponding to I1 (t, d, l) + Height (tall, short) I2 (s, d, l) + Hair (dark, blonde, red) I3 (t, b, l) - Eye (blue, brown) I4 (t, r, l) - Classes – I5 (s, b, l) - I (t, b, w) + +, – 6 I7 (t, d, w) + I8 (s, b, w) +

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Probabilities to estimate

P(+) = 5/8 P(Height | c) t s P(Hair | c) d b r P(–) = 3/8 + 3/5 2/5 + 3/5 2/5 0 – 2/3 1/3 – 0 2/3 1/3

P(Eye | c) l w Classify (Height=t, Hair=b, eye=l) + 2/5 3/5 P(X | +) = (3/5)(2/5)(2/5) = (12/125) – 1 0 P(X | –) = (2/3)(2/3)(1) = (4/9) P(+|X) α P(+)P(X|+)=(5/8)(12/125)=0.06 P(-|X) α P(-)P(X|-)=(3/8)(4/9)=0.1667 Classify (Height=t, Hair=r, eye=w)

Note the problem with zero probabilities Solution – Use Laplacian correction

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Estimation of Probabilities from Small Samples n + mp Pˆ(X a | ) jik ji i = ik ω j ← n j + m

n j is the number of training examples of class ω j n number of training examples of class jik = ω j which have attribute value a for attribute X ik i p is the prior estimate for Pˆ(X a | ) ji i = ik ω j m is the weight given to the prior n As n , Pˆ(X a | ) jik → ∞ i = ik ω j → n j

This is effectively the same as using Dirichlet priors

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Sample Applications of Naïve Bayes Classifier Naive Bayes is among the most useful algorithms • Learning dating preferences • Learn which news articles are of interest • Learn to classify web pages by topic • Learn to classify SPAM • Learn to assign proteins to functional families

What attributes shall we use to represent text?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning to Classify Text • Target function Interesting: Documents à {+,-} • Learning: Use training examples to estimate P (+), P (- ), P (d |+), P (d |-)

Alternative generative models for documents: • Represent each document as a sequence of words – In the most general case, we need a probability for each word occurrence in each position in the document, for each possible document length

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory length(d ) P(d | ) P(Learning to Classify Text length(d)) P(X | ,length(d)) ωi = ∏ i ωi i=1 This would require estimating, length(d) Vocabulary × Ω probabilities for each possible document length! To simplify matters, assume that probability of encountering a specific word in a particular position is independent of the position, and of document length Treat each document as a bag of words!

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Bag of Words Representation

So we estimate one position-independent class-conditional

probability P(wk |ω j ) for each word instead of the set of position-specific

word occurrence probabilities P(X1 = wk |ω j ) ...P(Xlength(d ) = wk |ω j ) The number of probabilities to be estimated drops to Vocabulary × Ω

The result is a generative model for documents that treats each document as an ordered tuple of word frequencies More sophisticated models can consider dependencies between adjacent word positions

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning to Classify Text

With the bag of words representation, we have ⎧⎛ ⎞ ⎫ n ! ⎪⎜∑ kd ⎟ ⎪ ⎪⎝ k ⎠ ⎪ nkd P(d |ω j ) is proportional to ⎨ ⎬ (P(wk |ω j )) n ! ∏ ⎪ ∏ kd ⎪ k k ⎩⎪ ⎭⎪

where nkd is the number of occurences of wk in document d (ignoring dependence on length of the document)

We can estimate P(wk |ω j ) from the labeled bags of words we have.

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Naïve Bayes Text Classifier • Given 1000 training documents from each group, learn to classify new documents according to the newsgroup where it belongs • Naive Bayes achieves 89% classification accuracy

comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles comp.sys.mac.hardware comp.windows.x rec.sport.baseball alt.atheism rec.sport.hockey soc.religion.christian sci.space talk.religion.misc sci.crypt talk.politics.mideast sci.electronics talk.politics.misc sci.med talk.politics.guns

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Naïve Bayes Text Classifier

Representative article from rec.sport.hockey

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu From: [email protected] (John Doe) Subject: Re: This year's biggest and worst (opinion)... Date: 5 Apr 93 09:53:39 GMT

I can only comment on the Kings, but the most obvious candidate for pleasant surprise is Alex Zhitnik. He came highly touted as a defensive defenseman, but he's clearly much more than that. Great skater and hard shot (though wish he were more accurate). In fact, he pretty much allowed the Kings to trade away that huge defensive liability Paul Coffey. Kelly Hrudey is only the biggest disappointment if you thought he was any good to begin with. But, at best, he's only a mediocre goaltender. A better choice would be Tomas Sandstrom, though not through any fault of his own, but because some thugs in Toronto decided ….

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning to label images

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Object Bag of visual‘words’

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Bags of features for object recognition

face, flowers, building

• Works pretty well for image-level classification and for recognizing object instances

Csurka et al. (2004),Machine Willamowski Learning, et Astroinformatics al. (2005), Grauman@ Penn State, & 2018, Darrell (C) Vasant(2005), Honavar Sivic et al. (2003, 2005) Pennsylvania State University College of Information Sciences and Technology Bags of features for objectArtificial recognition Intelligence Research Laboratory

Caltech6 dataset

bag of features bag of features Parts-and-shape model

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Bag of features

1. Extract features

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Bag of features

1. Extract features 2. Construct a “visual vocabulary”

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Bag of features: outline

1. Extract features 2. Learn “visual vocabulary” 3. Represent images by frequencies of visual words

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Bag of features

1. Extract features 2. Learn “visual vocabulary” 3. Represent images by frequencies of “visual words” - Note this requires matching image features to visual words

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Feature extraction

• Regular grid – Vogel & Schiele, 2003 – Fei-Fei & Perona, 2005

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Feature extraction

• Regular grid – Vogel & Schiele, 2003 – Fei-Fei & Perona, 2005 • Interest point detector – Csurka et al. 2004 – Fei-Fei & Perona, 2005 – Sivic et al. 2005

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Feature extraction

• Regular grid – Vogel & Schiele, 2003 – Fei-Fei & Perona, 2005 • Interest point detector – Csurka et al. 2004 – Fei-Fei & Perona, 2005 – Sivic et al. 2005 • Other methods – Random sampling (Vidal-Naquet & Ullman, 2002) – Segmentation-based patches (Barnard et al. 2003)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Constructing the visual vocabulary

Clustering

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Slide credit: Josef Sivic Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Constructing the visual vocabulary Visual vocabulary …

Clustering

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Slide credit: Josef Sivic Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Constructing and using a visual vocabulary

• Clustering is used to construct a visual vocabulary or codebook – Provided the training set is sufficiently representative, the vocabulary will be “universal” – More sophisticated methods exist – representation learning, e.g., using deep neural networks

• When a new image is to be represented using the visual vocabulary, and maps the image features are mapped to one or more visual words

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Example visual vocabulary

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Fei-Fei et al. 2005 Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Naïve Bayes Image Classifier • Models images as bags of visual words • Similar to modeling documents as bags of words • Assumes that the features are independent given the class • Multinomial Naïve Bayes

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Other examples of “bag of words” models

• Protein sequences represented by bags of amino acids or bags of sequence features (words) • Protein structures represented by bags of structural features • Traces of system calls represented as bags of system calls (for intrusion detection) • Video segments represented as bags of frames (which in turn are represented by bags of words and bags of image features)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Naïve Bayes Learner – Summary • Produces minimum error classifier if attributes are conditionally independent given the class When to use • Attributes that describe instances are likely to be conditionally independent given classification • There is not enough data to estimate all the probabilities reliably if we do not assume independence • Often works well even if when independence assumption is violated (Domigos and Pazzani, 1996) • Can be used iteratively – Kang et al., 2006

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Modeling dependencies between attributes • Naïve Bayes classifier assumes that the attributes are independent given the class • What if the independence assumption does not hold? – We need more sophisticated models • Decision trees and random forests • Support Vector Machines • Markov models • Bayesian networks

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Evaluating Classifier Performance

Vasant Honavar Artificial Intelligence Research Laboratory College of Information Sciences and Technology Bioinformatics and Genomics Graduate Program The Huck Institutes of the Life Sciences Pennsylvania State University

[email protected] vhonavar.ist.psu.edu faculty.ist.psu.edu/vhonavar

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Estimating classifier performance

Domain(X ) = {a,b,c,d} ⎧1 1 1 1⎫ D(X ) = ⎨ , , , ⎬ ⎩8 2 8 4⎭ x a b c d h(x) 0 1 1 0 f (x) 1 1 0 0

errorD (h) = PrD [h(x) ≠ f (x)] = D(X = a) + D(X = c) 1 1 1 = + = 8 8 4

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Measuring classifier performance

ErrorD (h) = Prx∈D ( f (x) ≠ h(x))

• We do not in general, know D, the distribution from which the data samples are drawn. • So we estimate the error from the samples we have

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Estimating Classifier Performance N: Total number of instances in the data set

TPj: Number of True positives for class j

FPj : Number of False positives for class j

TNj: Number of True Negatives for class j

FNj: Number of False Negatives for class j TP +TN Accuracy = j j j N

= P(class = c j ∧ label = c j )

Perfect classifier ßà Accuracy =1 Popular measure Biased in favor of the majority class! Should be used with caution! Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Measuring Classifier Performance: Sensitivity

TP j Sensitivity j = TPj + FN j

Count(label = c j ∧ class = c j ) = Count( class = c j )

= P(label = c j | class = c j )

Perfect classifier à Sensitivity = 1 Probability of correctly labeling members of the target class Also called recall or hit rate

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Measuring Classifier Performance: Specificity

TP Specificity = j j TPj + FPj

Count(label = c j ∧ class = c j ) = Count(label = c j )

= P(class = c j | label = c j )

Perfect classifier à Specificity = 1 Also called precision Probability that a positive prediction is correct

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Measuring Performance: Precision, Recall, and False Alarm Rate

TP TP j Recall Sensitivity j Precision j = Specificity j = j = j = TPj + FPj TPj + FN j

Perfect classifier à Precision=1 Perfect classifier à Recall=1

FP FalseAlarm = j j TN FP j + j Perfect classifier à Count(label = c j ∧ class = ¬c j ) False Alarm Rate = 0 = Count(label = ¬c j )

= P(label = c j | class = ¬c j )

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Classifier Learning -- Measuring Performance

Class C1 C1 Label

C1 TP= 55 FP=5

C1 FN=10 TN=30

N = TP + FN +TN + FP =100 TP 55 55 sensitivity = = = 1 TP + FN 55 +10 65 TP 55 55 specificity = = = 1 TP + FP 55 + 5 60 TP +TN 55 + 30 85 accuracy = = = 1 N 100 100 FP 5 5 falsealarm = = = 1 TN + FP 30 + 5 35

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Measuring Performance – Correlation Coefficient

(TPj × TN j ) − (FPj × FN j ) CC j = (TPJ + FN j )(TPj + FPj )(TN j + FPj )(TN j + FN j )

−1≤ CC j ≤1

Perfect classifier ßà CC = 1, Random guessing ßà CC=0

€ Corresponds to the standard measure of correlation between two random variables Label and Class estimated from labels L and the corresponding class values C for the special case of binary (0/1) valued labels and classes

( jlabeli − jlabel)( jclassi − jclass) CC j = ∑ d ∈D i σ JLABELσ JCLASS

where jlabeli =1 iff the classifier assigns di to class c j

jclassi =1 iff the true class of di is class c j

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar

€ Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Beware of terminological confusion in the literature! • Some bioinformatics authors use “accuracy” incorrectly to refer to recall i.e. sensitivity or precision i.e. specificity • In medical statistics, specificity sometimes refers to sensitivity for the negative class i.e. TN j

TN j + FPj • Some authors use false alarm rate to refer to the probability that

a positive prediction is incorrect i.e. FPj =1− Precision j FPj +TPj

When you write • provide the formula in terms of TP, TN, FP, FN When you read • check the formula in terms of TP, TN, FP, FN

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Measuring Classifier Performance • TP, FP, TN, FN provide the relevant information • No single measure tells the whole story • A classifier with 98% accuracy can be useless if 98% of the population does not have cancer and the 2% that do are misclassified by the classifier • Use of multiple measures recommended • Beware of terminological confusion!

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Micro-averaged performance measures Performance on a random instance • Micro averaging gives equal importance to each instance • Classes with large number of instances dominate TP TPj ∑ j ∑ j MicroAverage Precision = j MicroAverage Recall = TP + FN TPj + FPj ∑ j ∑ j ∑ ∑ j j j j MicroAverage FalseAlarm =1− MicroAverage Precision TP ∑ j j Etc. MicroAverage Accuracy = N ⎛⎛ ⎞ ⎛ ⎞⎞ ⎛⎛ ⎞ ⎛ ⎞⎞ ⎜⎜ TP ⎟×⎜ TN ⎟⎟ − ⎜⎜ FP ⎟×⎜ FN ⎟⎟ ⎜⎜∑ j ⎟ ⎜∑ j ⎟⎟ ⎜⎜∑ j ⎟ ⎜∑ j ⎟⎟ ⎝ j ⎠ ⎝ j ⎠ ⎝ j ⎠ ⎝ j ⎠ MicroAverage CC = ⎝ ⎠ ⎝ ⎠ ⎛ ⎞⎛ ⎞⎛ ⎞⎛ ⎞ ⎜ TP FN ⎟⎜ TP FP ⎟⎜ TN FP ⎟⎜ TN FN ⎟ ⎜∑ j + ∑ j ⎟⎜∑ j + ∑ j ⎟⎜∑ j + ∑ j ⎟⎜∑ j + ∑ j ⎟ ⎝ j j ⎠⎝ j j ⎠⎝ j j ⎠⎝ j j ⎠

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Macro-averaged performance measures

Macro averaging gives equal importance to each of the M classes

1 MacroAverage Sensitivity Sensitivity = ∑ j M j

1 MacroAverageCorrelationCoeff CorrelationCoeff = ∑ j M j

1 MacroAverage Specificity Specificity = ∑ j M j

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Receiver Operating Characteristic (ROC) Curve

• We can often trade off recall versus precision – e.g., by adjusting classification threshold e.g.,

P(c j | X ) label = c j if > θ P(¬c j | X )

• ROC curve is a plot of Sensitivity against False Alarm Rate which characterizes this tradeoff for a given classifier

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Receiver Operating Characteristic (ROC) Curve

Perfect 1 classifier

TP Sensitivity = TP + FN 0 0 1 FP False Alarm Rate = FP +TN

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Measuring Performance of Classifiers – ROC curves

• ROC curves offer a more complete picture of the performance of the classifier as a function of the classification threshold • A classifier h is better than another classifier g if ROC(h) dominates the ROC(g) • ROC(h) dominates ROC(g) à AreaROC(h) > AreaROC(g)

1

0 0 1 Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Evaluating a Classifier • How well can a classifier be expected to perform on novel data? • We can estimate the performance (e.g., accuracy, sensitivity) of the classifier using a test data set (not used for training) • How close is the estimated performance to the true performance? References: • Evaluation of discrete valued hypotheses – Chapter 5, Mitchell • Empirical Methods for Artificial Intelligence, Cohen

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Estimating the performance of a classifier

The true error of a hypothesis h with respect to a target function f and an instance distribution D is

ErrorD(h) ≡ Pr[f (x) ≠ h(x)] x∈D The sample error of a hypothesis h with respect to a target function f and an instance distribution D is 1 Error (h) ( f (x) h(x)) S ≡ ∑δ ≠ | S | x∈S δ (a,b) =1 iff a ≠ b; δ (a,b) = 0 otherwise

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Estimating classifier performance

Domain(X ) = {a,b,c,d} ⎧1 1 1 1⎫ D(X ) = ⎨ , , , ⎬ ⎩8 2 8 4⎭ x a b c d h(x) 0 1 1 0 f (x) 1 1 0 0

errorD (h) = PrD [h(x) ≠ f (x)] = D(X = a) + D(X = c) 1 1 1 = + = 8 8 4

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Evaluating the performance of a classifier • Sample error estimated from training data is an optimistic estimate = E[ErrorS (h)]− ErrorD (h)

• For an unbiased estimate, h must be evaluated on an independent sample S (which is not the case if S is the training set!) • Even when the estimate is unbiased, it can vary across samples! • If h misclassifies 8 out of 100 samples 8 Error (h) = = 0.08 S 100 How close is the sample error to the true error?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory How close is the estimated error to the true error? • Choose a sample S of size n according to distribution D • Measure

ErrorS (h)

is a random variable (outcome of a random ErrorS (h) experiment)

Given ErrorS (h), what can we conclude about ErrorD (h)?

More generally, given the estimated performance of a hypothesis, what can we say about its actual performance?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Evaluation of a classifier with limited data

• There is extensive literature on how to estimate classifier performance from samples and how to assign confidence to estimates (See Mitchell, Chapter 5) • Holdout method – use part of the data for training, and the rest for testing • We may be unlucky – training data or test data may not be representative • Solution – Run multiple experiments with disjoint training and test data sets in which each class is represented in roughly the same proportion as in the entire data set

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Estimating the performance of the learned classifier K-fold cross-validation

Partition the data (multi) set S into K equal parts S1 ..SK with roughly the same class distribution as S. Errorc = 0 For i=1 to K do { STest ← Si STrain ← S − Si; α ← Learn(STrain )

Errorc ← Errorc + Error(α,STest ) } ⎛ Errorc ⎞ Error ← ⎜ ⎟; Output(Error ) ⎝ K ⎠

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Leave-one-out cross-validation

• K-fold cross validation with K = n where n is the total number of samples available • n experiments – using n-1 samples for training and the remaining sample for testing • Leave-one-out cross-validation does not guarantee the same class distribution in training and test data! Extreme case: 50% class 1, 50% class 2 Predict majority class label in the training data True error – 50%; Leave-one-out error estimate – 100%!!!!!

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Estimating classifier performance

Recommended procedure • Use K-fold cross-validation (K=5 or 10) for estimating performance estimates (accuracy, precision, recall, points on ROC curve, etc.) and 95% confidence intervals around the mean • Compute mean values of performance estimates and standard deviations of performance estimates • Report mean values of performance estimates and their standard deviations or 95% confidence intervals around the mean • Be skeptical – repeat experiments several times with different random splits of data into K folds! • A great deal more can be done with regard to rigorous statistical evaluation of alternative models, algorithms, etc.

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Generative Versus Discriminative Models

§ Generative models § Naïve Bayes, Bayes networks, etc. § Discriminative models § Perceptron, Support vector machines, Logistic regression .. § Relating generative and discriminative models § Tradeoffs between generative and discriminative models § Generalizations and extensions

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Decision tree representation In the simplest case, • each internal node tests on an attribute • each branch corresponds to an attribute value • each leaf node corresponds to a class label

In general, • each internal node corresponds to a test (on input instances) with mutually exclusive and exhaustive outcomes – tests may be univariate or multivariate • each branch corresponds to an outcome of a test • each leaf node corresponds to a class label

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Decision Tree Representation

Attributes Class 11, 01, 10, 00 11, 01,10, 00 y 1 0 x y c x E 1 0 11 10 x 1 1 1 A 01 00 a 11 01 m x x 2 0 1 B 10 00 1 0 1 0 p l 3 1 0 A c=A c=B e 11 01 10 00 s 4 0 0 B c=A c=B c=A c=B Data set Tree 1 Tree 2

Should we choose Tree 1 or Tree 2? Why?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Decision tree representation • Any Boolean function can be represented by a decision tree

• Any function f : A1 × A2 ×⋅⋅⋅An → C

where each is the domain of the Ai i th attribute and C is a discrete set of values (class labels) can be represented by a decision tree

• In general, the inputs need not be discrete valued

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning Decision Tree Classifiers • Decision trees are especially well suited for representing simple rules for classifying instances that are described by discrete attribute values • Decision tree learning algorithms • Implement Ockham’s razor as a preference bias (simpler decision trees are preferred over more complex trees) • Are relatively efficient – linear in the size of the decision tree and the size of the data set • Produce comprehensible results • Are often among the first to be tried on a new data set

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning Decision Tree Classifiers § Ockham’s razor recommends that we pick the simplest decision tree that is consistent with the training set § Simplest tree is one that takes the fewest bits to encode (why? – ) § There are far too many trees that are consistent with a training set § Searching for the simplest tree that is consistent with the training set is not typically computationally feasible Solution § Use a greedy algorithm – not guaranteed to find the simplest tree – but works well in practice § Or restrict the space of hypothesis to a subset of simple trees

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Information – Some intuitions

• Information reduces uncertainty • Information is relative – to what you already know • Information content of a message is related to how surprising the message is • Information is related Information depends on context

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Digression: Information and Uncertainty

Message Sender Receiver

You are stuck inside. You send me out to report back to you on what the weather is like. I do not lie, so you trust me. You and I are both generally familiar with the weather in Iowa

On a July afternoon in Pennsylvania, I walk into the room and tell you it is hot outside On a January afternoon in Pennsylvania, I walk into the room and tell you it is hot outside

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Digression: Information and Uncertainty Message

Sender Receiver

How much information does a message contain? If my message to you describes a scenario that you expect with certainty, the information content of the message for you is zero The more surprising the message to the receiver, the greater the amount of information conveyed by the message What does it mean for a message to be surprising?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Digression: Information and Uncertainty

Suppose I have a coin with heads on both sides and you know that I have a coin with heads on both sides. I toss the coin, and without showing you the outcome, tell you that it came up heads. How much information did I give you? Suppose I have a fair coin and you know that I have a fair coin.

I toss the coin, and without showing you the outcome, tell you that it came up heads. How much information did I give you?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Information

• Without loss of generality, assume that messages are binary – made of 0s and 1s. • Conveying the outcome of a fair coin toss requires 1 bit of information – need to identify one out of two equally likely outcomes • Conveying the outcome one of an experiment with 8 equally likely outcomes requires 3 bits .. • Conveying an outcome of that is certain takes 0 bits • In general, if an outcome has a probability p, the information content of the corresponding message is

I( p) = −log2 p I(0) = 0

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Information is Subjective

• Suppose there are 3 agents – Sanghack, Sam, Aria, in a world where a dice has been tossed. • Sanghack observes the outcome is a “6” and whispers to Sam that the outcome is “even” but Aria knows nothing about the outcome. • Probability assigned by Sam to the event “6” is a subjective measure of Sam’s belief about the state of the world. • Information gained by Sanghack by looking at the outcome of

the dice =log26 bits.

• Information conveyed by Sanghack to Sam = log26 – log23 bits • Information conveyed by Sanghack to Aria = 0 bits

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Information and Shannon Entropy

• Suppose we have a message that conveys the result of a random experiment with m possible discrete outcomes, with probabilities

p1, p2,...pm

The expected information content of such a message is called the entropy of the probability distribution m H ( p , p ,..p ) p I p 1 2 m = ∑ i ( i ) i=1

I(pi )= −log2 pi provided pi ≠ 0

I(pi )= 0 otherwise

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Shannon’s entropy as a measure of information ! Let P = (p1 .... pn ) be a discrete probability distribution The entropy of the distribution P is given by ! n ⎛ 1 ⎞ n H P p log p log p ( ) = ∑ i 2 ⎜ ⎟ = −∑ i 2 ( i ) i=1 ⎝ pi ⎠ i=1 ⎛ 1 1 ⎞ 2 ⎛ 1 ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞ H ⎜ , ⎟ = −∑ pi log2 (pi ) = −⎜ ⎟log2 ⎜ ⎟−⎜ ⎟log2 ⎜ ⎟ =1 bit ⎝ 2 2 ⎠ i=1 ⎝ 2 ⎠ ⎝ 2 ⎠ ⎝ 2 ⎠ ⎝ 2 ⎠ 2 H (0,1) = −∑ pi log2 (pi ) = −1I (1) − 0I (0) = 0 bit i=1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Properties of Shannon’s entropy ! ! ∀P H (P)≥ 0 ! If there are N possible outcomes, H (P)≤ log2 N 1 ! If ∀i p = , H P = log N i N ( ) 2 ! If ∃i such that pi =1, H (P)= 0 " ! H (P) is a continuous function of P

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Shannon’s entropy as a measure of information ! ! • For any distribution is the optimal number of P, H (P) binary questions required on average to determine an outcome drawn from P. • We can extend these ideas to talk about how much information is conveyed by the observation of the outcome of one experiment about the possible outcomes of another (mutual information) • We can also quantify the difference between two probability distributions (Kullback-Liebler divergence or relative entropy)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Coding Theory Perspective ! • Suppose you and I both know the distribution P ! • I choose an outcome according to P • Suppose I want to send you a message about the outcome • You and I could agree in advance on the questions • I can simply send you the answers ! • Optimal message length on average is H (P) • This generalizes to noisy

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning Decision Tree Classifiers On average, the information needed to convey the class membership ! of a random instance drawn from nature is H (P)

Nature S Sm Training Data S 1

Instance S2

Classifier ⌢ i=m H P = − pˆ log pˆ = Hˆ X ( ) ∑i=1 ( i ) 2 ( i ) ( ) ⌢ Class label where P is an estimate of P and X is a random variable with distribution P

Si is the multi-set of examples belonging to class Ci

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning Decision Tree Classifiers

The task of the learner then is to extract the needed information from the training set and store it in the form of a decision tree for classification Information gain based decision tree learner Start with the entire training set at the root Recursively add nodes to the tree corresponding to tests that yield the greatest expected reduction in entropy (or the largest expected information gain) until some termination criterion is met ( e.g., the training data at every leaf node has zero entropy )

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning Decision Tree Classifiers - Example

Training Data

Instance Class label I (t, d, l) + Instances – 1 I2 (s, d, l) + ordered 3-tuples of attribute I3 (t, b, l) values corresponding to I4 (t, r, l) Height (tall, short) I5 (s, b, l) I6 (t, b, w) + Hair (dark, blonde, red) I7 (t, d, w) + I (s, b, w) + Eye (blue, brown) 8

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning Decision Tree Classifiers - Example 3 3 5 5 I1…I8 ˆ H(X ) = − log2 − log2 = 0.954bits Height 8 8 8 8 ˆ 2 2 3 3 t s H(X | Height = t) = − log2 − log2 = 0.971bits 5 5 5 5

2 2 1 1 Hˆ (X | Height = s) = − log − log = 0.918bits 3 2 3 3 2 3

St I1 I3 I4 I6 I7 Ss I2 I5 I8 5 3 5 3 Hˆ (X | Height) = Hˆ (X | Height = t) + Hˆ (X | Height = s) = (0.971) + (0.918) = 0.95bits 8 8 8 8 Similarly, Hˆ (X | Eye) = 0.607bits and Hˆ (X | Hair) = 0.5bits Hair is the most informative because it yields the largest reduction in entropy. Test on the value of Hair is chosen to correspond to the root of the decision tree

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Learning Decision Tree Classifiers - Example

Hair d r b Compare the result with + Eye - Naïve Bayes l w - +

In practice, we need some way to prune the tree to avoid overfitting the training data – more on this later.

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Minimizing over fitting

• Use roughly the same size sample at every node to estimate entropy – when there is a large data set from which we can sample • Stop when further split fails to yield statistically significant information gain (estimated from validation set) • Grow full tree, then prune • minimize size (tree) + size (exceptions (tree))

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Pruning based on whether information gain is significantly greater than zero

Evaluate Candidate split to decide if the resulting information gain is significantly greater than zero as determined using a suitable hypothesis testing method at a desired significance level

1 2 n1 = 50, n2 = 50, N = n1 + n2 = 100 50 50 n n = 50, n = 0, n = 50, p = L = 0.5 1L 2L L N

L R n1e = pn1 = 25, n2e = pn2 = 25 2 2 2 (n1L − n1e ) (n2L − n2e ) χ = + = 25 + 25 = 50 1 2 1 2 n1e n2e 50 0 0 50 This split is significantly better than random with confidence > 99% because > 6.64 χ 2 Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Pruning based on whether information gain is significantly greater than zero Evaluate Candidate split to decide if the resulting information gain is significantly greater than zero as determined using a suitable hypothesis testing method at a desired significance level 2 Example: statistic χ In the 2-class, binary (L,R) split case, 2 2 2 (niL − nie ) χ = ∑ n1 of class 1, n2 of class 2; N=n1+n2 i=1 nie Split sends pN to L and (1-p)N to R Random split would send pn of class 1 to L and pn of class 2 to L 2 1 2 The critical value of depends on the degrees of freedom which is 1 in χ

this case (for a given p, n1L fully specifies n2L n1R and n2R )

In general, the number of degrees of freedom can be > 1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Pruning based on whether information gain is significantly greater than zero The greater the value of χ 2 2 the less likely it is that the split Branches−1Classes n − n ( ij iej ) χ 2 = is random. For a sufficiently ∑ ∑ n 2 j=1 i=1 iej high value of , the χ difference between the N = n + n +...n 1 2 Classes expected (random) split is Branches statistically significant and we p = [ p p ....p ]; p =1 1 2 Branches ∑ j reject the null hypothesis that j=1 the split is random. n p n iej = j i

Degrees of freedom = (Classes -1)(Branches −1)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Classification of instances

• Unique classification – possible when each leaf has zero entropy and there are no missing attribute values • Most likely classification – based on distribution of classes at a node when there are no missing attribute values • Probabilistic classification – based on distribution of classes at a node when there are no missing attribute values

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Handling different types of attribute values

Types of attributes • Nominal – values are names • Ordinal – values are ordered • Cardinal (Numeric) – values are numbers (hence ordered) ….

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Handling numeric attributes

Attribute T 40 48 50 54 60 70 Class N N Y Y Y N

(48 + 50) (60 + 70) Candidate splits T > ? T > ? 2 2 2 4 ⎛ ⎛ 3 ⎞ ⎛ 3 ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞⎞ E(S |T > 49 ?) = (0) + ⎜−⎜ ⎟log2 ⎜ ⎟ −⎜ ⎟log2 ⎜ ⎟⎟ 6 6 ⎝ ⎝ 4 ⎠ ⎝ 4 ⎠ ⎝ 4 ⎠ ⎝ 4 ⎠⎠ • Sort instances by value of numeric attribute under consideration • For each attribute, find the test which yields the lowest entropy • Greedily choose the best test across all attributes

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Two-way versus multi-way splits

Entropy criterion favors many-valued attributes Pathological behavior – what if in a medical diagnosis data set, social security number is one of the candidate attributes? Solutions A = value versus A = ¬value Only two-way splits (CART) Entropy ratio (C4.5)

Entropy(S | A) EntropyRatio(S | A) ≡ SplitEntropy(S | A) Values( A) | S | | S | SplitEntropy(S | A) i log i ≡ − ∑ 2 i=1 | S | | S |

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Incorporating Attribute costs

Not all attribute measurements are equally costly or risky In Medical diagnosis Blood-Test has cost $150 Exploratory-Surgery may have a cost of $3000 Goal: Learn a Decision Tree Classifier which minimizes cost of classification Gain2 (S, A) Tan and Schlimmer (1990) Cost ( A )

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Incorporating class-dependent misclassification costs

Not all misclassifications are equally costly An occasional false alarm about a nuclear power plant meltdown is less costly than the failure to alert when there is a chance of a meltdown

Weighted Gini Impurity

⎛ S ⎞⎛ S j ⎞ Impurity(S) = λ ⎜ i ⎟⎜ ⎟ ∑ ij ⎜ S ⎟⎜ S ⎟ ij ⎝ ⎠⎝ ⎠ λ ij is the cost of wrongly assigning an instance belonging to class i to class j

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Minimizing over fitting

• Use roughly the same size sample at every node to estimate entropy – when there is a large data set from which we can sample • Stop when further split fails to yield statistically significant information gain (estimated from validation set) • Grow full tree, then prune • minimize size (tree) + size (exceptions (tree))

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Reduced error pruning

Each decision node in the tree is considered as a candidate for pruning

Pruning a decision node consists of • removing the sub tree rooted at that node, • making it a leaf node, and • assigning it the most common label at that node

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Reduced error pruning – Example

Node Accuracy gain by Pruning A 100 A -20% A=a A=a2 1 B +10% A + B 60 40 A=a A=a2 1 55 - + 5 + -

Before Pruning After Pruning

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Reduced error pruning

Do until further pruning is harmful: • Evaluate impact on validation set of pruning each candidate node • Greedily select a node which most improves the performance on the validation set when the sub tree rooted at that node is pruned

Drawback – holding back the validation set limits the amount of training data available – not desirable when data set is small

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Reduced error pruning

Pruned Tree

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Pruning based on whether information gain is significantly greater than zero

Evaluate Candidate split to decide if the resulting information gain is significantly greater than zero as determined using a suitable hypothesis testing method at a desired significance level

1 2 n1 = 50, n2 = 50, N = n1 + n2 = 100 50 50 n n = 50, n = 0, n = 50, p = L = 0.5 1L 2L L N

L R n1e = pn1 = 25, n2e = pn2 = 25 2 2 2 (n1L − n1e ) (n2L − n2e ) χ = + = 25 + 25 = 50 1 2 1 2 n1e n2e 50 0 0 50 This split is significantly better than random with confidence > 99% because > 6.64 χ 2

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Pruning based on whether information gain is significantly greater than zero Evaluate Candidate split to decide if the resulting information gain is significantly greater than zero as determined using a suitable hypothesis testing method at a desired significance level 2 Example: statistic χ In the 2-class, binary (L,R) split case, 2 2 2 (niL − nie ) χ = ∑ n1 of class 1, n2 of class 2; N=n1+n2 i=1 nie Split sends pN to L and (1-p)N to R

Random split would send pn1 of class 1 to L and pn2 of class 2 to L 2 The critical value of depends on the degrees of freedom which is 1 in χ this case (for a given p, n1L fully specifies n2L n1R and n2R )

In general, the number of degrees of freedom can be > 1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Pruning based on whether information gain is significantly greater than zero

The greater the value of χ 2 2 the less likely it is that the Branches−1Classes n − n ( ij iej ) χ 2 = split is random. For a ∑ ∑ n j=1 i=1 iej sufficiently high value of , the difference 2 N = n + n +...n χ 1 2 Classes between the expected Branches (random) split is p = [ p p ....p ]; p =1 1 2 Branches ∑ j statistically significant and j=1 we reject the null n = p n iej j i hypothesis that the split is random. Degrees of freedom = (Classes -1)(Branches −1)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Classification of instances

• Unique classification – possible when each leaf has zero entropy and there are no missing attribute values • Most likely classification – based on distribution of classes at a node when there are no missing attribute values • Probabilistic classification – based on distribution of classes at a node when there are no missing attribute values

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Handling different types of attribute values

Types of attributes • Nominal – values are names • Ordinal – values are ordered • Cardinal (Numeric) – values are numbers (hence ordered) ….

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Handling numeric attributes

Attribute T 40 48 50 54 60 70 Class N N Y Y Y N

(48 + 50) (60 + 70) Candidate splits T > ? T > ? 2 2 2 4 ⎛ ⎛ 3 ⎞ ⎛ 3 ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞⎞ E(S |T > 49 ?) = (0) + ⎜−⎜ ⎟log2 ⎜ ⎟ −⎜ ⎟log2 ⎜ ⎟⎟ 6 6 ⎝ ⎝ 4 ⎠ ⎝ 4 ⎠ ⎝ 4 ⎠ ⎝ 4 ⎠⎠ • Sort instances by value of numeric attribute under consideration • For each attribute, find the test which yields the lowest entropy • Greedily choose the best test across all attributes

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Handling numeric attributes

Axis-parallel split Oblique split

C2

C2 C1 C1

Oblique splits cannot be realized by univariate tests

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Two-way versus multi-way splits Entropy criterion favors many-valued attributes Pathological behavior – what if in a medical diagnosis data set, social security number is one of the candidate attributes? A = value versus A = ¬value Solutions Only two-way splits (CART) Entropy ratio (C4.5) Entropy(S | A) EntropyRatio(S | A) ≡ SplitEntropy(S | A) Values( A) | S | | S | SplitEntropy(S | A) i log i ≡ − ∑ 2 i=1 | S | | S |

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Alternative split criteria Consider split of set S based on attribute A

|Values( A)| Impurity(S | A) Impurity(S ) = ∑ j j=1

Classes Z Z Entropy Impurity(Z) i log i = ∑ − 2 i=1 Z Z 2 Classes ⎛ Z ⎞⎛ Z j ⎞ ⎛ Z ⎞ Gini Impurity(Z) = ⎜ i ⎟⎜ ⎟ =1− ⎜ i ⎟ ∑⎜ Z ⎟⎜ Z ⎟ ∑ ⎜ Z ⎟ i≠ j ⎝ ⎠⎝ ⎠ i=1 ⎝ ⎠ (Expected rate of error if class label is picked randomly according to distribution of instances in a set)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Alternative split criteria

One-sided split criteria – often useful for exploratory analysis of data

Impurity(S | A) = Min {Impurity(Si )} i∈Values( A)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Incorporating Attribute costs

Not all attribute measurements are equally costly or risky In Medical diagnosis Blood-Test has cost $150 Exploratory-Surgery may have a cost of $3000 Goal: Learn a Decision Tree Classifier which minimizes cost of classification Gain2 (S, A) Tan and Schlimmer (1990) Cost ( A )

Nunez (1988) Gain(S,A) 2 - 1 ( Cost( A) + 1) w

where w [0, 1] determines importance of cost Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Incorporating Different Misclassification Costs for different classes Not all misclassifications are equally costly An occasional false alarm about a nuclear power plant meltdown is less costly than the failure to alert when there is a chance of a meltdown

Weighted Gini Impurity

⎛ S ⎞⎛ S j ⎞ Impurity(S) = λ ⎜ i ⎟⎜ ⎟ ∑ ij ⎜ S ⎟⎜ S ⎟ ij ⎝ ⎠⎝ ⎠ λ ij is the cost of wrongly assigning an instance belonging to class i to class j

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Dealing with Missing Attribute Values (Solution 1)

Sometimes, the fact that an attribute value is missing might itself be informative – Missing blood sugar level might imply that the physician had not to measure it Introduce a new value (one per attribute) to denote a missing value Decision tree construction and use of tree for classification proceed as before

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Dealing with Missing Attribute Values (Solution 2)

During decision tree construction Replace a missing attribute value in a training example with the most frequent value found among the instances at the node that have the same class label as the training example

During use of tree for classification Assign to a missing attribute the most frequent value found at the node (based on the training sample) Sort the instance through the tree to generate the class label

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Dealing with Missing Attribute Values

During decision tree construction Generate several fractionally weighted training examples based on the distribution of values for the corresponding attribute at the node During use of tree for classification Generate multiple instances by assigning candidate values for the missing attribute based on the distribution of instances at the node Sort each such instance through the tree to generate candidate labels and assign the most probable class label or probabilistically assign class label

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Dealing with Missing Attribute Values

A n+=60, n-=40 1 0

(n+| + B (n+|A=0) = 10; (n-|A=0 )= A=1)=50 40 1 0 + (n-|A=0, B=1)=40 - (n+|A=0, B=0) = 10 Suppose B is missing Replacement with most frequent value at the node à B=1 Replacement with most frequent value if class is + à B=0 Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Dealing with Missing Attribute Values

A n+=60, n-=40 1 0 Suppose B is missing

(n+| + B (n+|A=0) = 10; (n-|A=0 )= A=1)=50 40 1 0 (n+|A=0, B=0) = 10 (n-|A=0, B=1)=40 - + Fractional instance based on distribution at the node .. 4/5 for B=1, 1/5 for B=0 Fractional instance based on distribution at the node for class + ..1/5 for B=0, 0 for B=1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory A simple discriminative model: Perceptron

• Outline • Background • Threshold logic functions • Connection to logic • Connection to geometry • Learning threshold functions – perceptron algorithm and its variants • Perceptron convergence theorem

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory McCulloch-Pitts computational model of a neuron

x X0 =1 1 w1 W0

n Output Input x2 w2 ∑ y

! ! n y = 1 if w x > 0 xn wn Synaptic weights ∑ i i i=0 y = −1 otherwise

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Threshold neuron – Connection with Geometry x2 (w1,w2) w1x1 + w2x2 + w0 > 0 Decision boundary C1

x1 C2

w1x1 + w2x2 + w0 < 0 w1x1 + w2x2 + wo = 0

n w x + w = 0 describes a hyperplane which divides the ∑ i i 0 n i=1 instance space into two half–spacesℜ n X n W X w 0 χ+ = {X p ∈ℜ W• X p + w0 > 0} and χ− = { p ∈ℜ • p + 0 < }

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory McCulloch-Pitts Neuron or Threshold Neuron

⎡x1 ⎤ ⎡w1 ⎤ ⎢ ⎥ ⎢ ⎥ y = sign (W • X + w0 ) x w2 X = ⎢ 2 ⎥ W = ⎢ ⎥ ⎛ n ⎞ sign w x ⎢ ⎥ ⎢ ⎥ = ⎜∑ i i ⎟ ⎢ ⎥ ⎝ i=0 ⎠ ⎢ ⎥ x wn T ⎣ n ⎦ ⎣ ⎦ = sign (W X + w0 )

sign (v) = 1 if v > 0 = 0 otherwise

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Threshold neuron– Connection with Geometry

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Threshold neuron – Connection with Geometry

Instance space ℜ n Hypothesis space is the set of (n-1)-dimensional hyperplanes defined in the n-dimensional instance space

A hypothesis is defined by n w x 0 ∑ i i = i=0 T • Orientation of the hyperplane is governed by (w1...wn ) • and the perpendicular distance of the hyperplane from the origin is given by ⎛ w ⎞ ⎜ 0 ⎟ ⎜ 2 2 ... 2 ⎟ ⎝ (w1 + w2 + + wn )⎠ Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Threshold neuron as a pattern classifier

• The threshold neuron can be used to classify a set of instances into one of two classes C1, C2

• If the output of the neuron for input pattern Xp is +1 then Xp is assigned to class C1

• If the output is -1 then the pattern Xp is assigned to C2

• Example T T [w0 w1 w2 ] = [−1 −1 1] T T X p = [1 0] W • X p + w0 = −1+ (−1) = −2

X p is assigned to class C2

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Threshold neuron – Connection with Logic

• Suppose the input space is {0,1}n • Then threshold neuron computes a Boolean function f :{0,1}n à {-1,1} x1 x2 g(X) y Example 0 0 -1.5 -1 Let w = -1.5; w = w = 1 0 1 2 0 1 -0.5 -1 In this case, the threshold neuron implements the logical 1 0 -0.5 -1 AND function 1 1 0.5 1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Threshold neuron – Connection with Logic

• A threshold neuron with the appropriate choice of weights can implement Boolean AND, OR, and NOT function • Theorem: For any arbitrary Boolean function f, there exists a network of threshold neurons that can implement f. • Theorem: Any arbitrary finite state automaton can be realized using threshold neurons and delay units • Networks of threshold neurons, given access to unbounded , can compute any Turing-computable function • Corollary: Brains if given access to enough working memory, can compute any computable function

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Threshold neuron: Connection with Logic

Theorem: There exist functions that cannot be implemented by a single threshold neuron. Example Exclusive OR

Why?

x2

x1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Threshold neuron – Connection with Logic

• Definition: A function that can be computed by a single threshold neuron is called a threshold function • Of the 16 2-input Boolean functions, 14 are Boolean threshold functions • As n increases, the number of Boolean threshold functions becomes an increasingly small fraction of the total number of n-input Boolean functions

n n2 N n 22 NThreshold (n) ≤ 2 Boolean ( )=

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning Threshold functions

A training example Ek is an ordered pair (Xk, dk) where X x x .... x T k = [ 0k 1k nk ]

is an (n+1) dimensional input pattern, dk = f (Xk )∈{−1, 1} is the desired output of the classifier and f is an unknown target function to be learned.

A training set E is simply a multi-set of examples.

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning Threshold functions

+ S = {Xk (Xk ,dk )∈ E and dk = 1} − S = {Xk (Xk ,dk )∈ E and dk = −1} We say that a training set E is linearly separable if and only if

* + * ∃W such that ∀X p ∈ S , W • X p > 0 − * and ∀X p ∈ S , W • X p < 0 Learning Task: Given a linearly separable training set E, find a solution * + * − * W such that ∀X p ∈ S , W • X p > 0 and ∀X p ∈ S , W • X p < 0

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Rosenblatt’s Perceptron Learning Algorithm

T Initialize W = [0 0.....0] Set learning rate η > 0

Repeat until a complete pass through E results in no weight updates

For each training example Ek ∈ E

{ yk ← sign (W• Xk )

W ← W + η(dk − yk )Xk } W* ← W; Return (W* )

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Perceptron learning algorithm –Example

Let S+= {(1, 1, 1), (1, 1, -1), (1, 0, -1)} S- = {(1,-1, -1), (1,-1, 1), (1,0, 1) } 1 η = W= (0 0 0); 2

Xk dk W W.Xk yk Update? Updated W (1, 1, 1) 1 (0, 0, 0) 0 -1 Yes (1, 1, 1) (1, 1, -1) 1 (1, 1, 1) 1 1 No (1, 1, 1) (1,0, -1) 1 (1, 1, 1) 0 -1 Yes (2, 1, 0) (1, -1, -1) -1 (2, 1, 0) 1 1 Yes (1, 2, 1) (1,-1, 1) -1 (1, 2, 1) 0 -1 No (1, 2, 1) (1,0, 1) -1 (1, 2, 1) 2 1 Yes (0, 2, 0) (1, 1, 1) 1 (0, 2, 0) 2 1 No (0, 2, 0)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Perceptron

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Perceptron Convergence Theorem (Novikoff)

E X , d n Theorem Let be a training set where = {( k k )} Xk ∈{1}×ℜ

and . dk ∈{−1,1}

+ − Let S = {Xk (Xk ,dk )∈E & dk =1} and S = {Xk (Xk ,dk )∈E & dk = −1} The perceptron algorithm is guaranteed to terminate after a bounded number t of weight updates with a weight vector * + * − * W such that ∀ Xk ∈ S , W • Xk ≥ δ and ∀ Xk ∈ S , W • Xk ≤ −δ for some δ > 0, whenever such W* ∈ℜn+1 and δ > 0 exist -- that is, E is linearly separable. The bound on the number t of weight updates is given by 2 ⎛ W* L ⎞ ⎜ ⎟ + − t ≤ where L = max Xk and S = S ∪ S ⎜ δ ⎟ Xk ∈S ⎝ ⎠ Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Proof of Perceptron Convergence Theorem

Let Wt be the weight vector after t weight updates.

W* Invariant : ∀θ cosθ ≤1 θ Wt

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Proof of Perceptron Convergence Theorem Let W* be such that + * − * ∀Xk ∈ S ,W • Xk ≥ δ and ∀Xk ∈ S ,W • Xk ≤ −δ WLOG assume that W* • X = 0 passes through the origin. + Let ∀Xk ∈ S , Zk = Xk , − ∀Xk ∈ S , Zk = −Xk ,

Z = {Zk } + * − * (∀Xk ∈ S ,W • Xk ≥ δ & ∀Xk ∈ S ,W • Xk ≤ −δ) * ⇔ (∀Zk ∈ Z, W • Zk ≥ δ).

Let E'= {(Zk ,1)}

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Proof of Perceptron Convergence Theorem

Wt+1 = Wt + η(dk − yk )Zk T where W0 = [0 0 .... 0] and η > 0

[Weight update based on example (Zk ,1) ]

⇔ [(dk = 1)∧ (yk = −1)] * * ∴W • Wt+1 = W • (Wt + 2ηZk ) * * = (W • Wt )+ 2η(W • Zk ) * * * Since ∀Zk ∈ Z, (W • Zk ≥ δ), W • Wt+1 ≥ W • Wt + 2ηδ * ∴∀t W • Wt ≥ 2tηδ ...... (a)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Proof of Perceptron Convergence Theorem

2 Wt+1 ≡ Wt+1 • Wt+1

= (Wt + 2ηZk )• (Wt + 2ηZk ) 2 = (Wt • Wt )+ 4η(Wt • Zk )+ 4η (Zk • Zk )

Note weight update based on Zk ⇔ (Wt • Zk ≤ 0) 2 2 2 2 2 2 2 ∴ Wt+1 ≤ Wt + 4η Zk ≤ Wt + 4η L

2 2 2 Hence Wt ≤ 4tη L

∴∀t Wt ≤ 2ηL t ...... (b)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Proof of Perceptron Convergence Theorem

* From (a) we have : ∀t (W • Wt )≥ 2tηδ * * ⇒ {∀t 2tηδ ≤ (W • Wt )}⇒ {∀t 2tηδ ≤ W Wt cosθ} * ⇒ {∀t 2tηδ ≤ W Wt }∵∀θ cosθ ≤1,

Substituting for an upper bound on Wt from (b), ∀t {2tηδ ≤ W* 2ηL t} ⇒ {∀t (δ t ≤ W* L)} 2 ⎛ W* L ⎞ ⇒ t ≤ ⎜ ⎟ ⎜ δ ⎟ ⎝ ⎠

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Notes on the Perceptron Convergence Theorem

• The bound on the number of weight updates does not depend on the learning rate • The bound is not useful in determining when to stop the algorithm because it depends on the norm of the unknown weight vector and delta • The convergence theorem offers no guarantees when the training data set is not linearly separable

Exercise: Prove that the perceptron algorithm is robust with respect to fluctuations in the learning rate

0 < ηmin ≤ ηt ≤ ηmax < ∞

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Multiple classes K K −1 K −1 binary classifiers ( ) binary classifiers 2

K −1 binary classifiers

One-versus-rest One-versus-one

Problem: Green region has ambiguous class membership

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Multi-category classifiers Define K linear functions of the form: T C1 yk (X)= Wk X + wk0

h(X) arg max y X C = k ( ) C3 2 k

T = arg max(Wk X + wk 0 ) k

Decision surface between class Ck and Cj T (Wk − Wj ) X + (wk 0 − wj0 )= 0

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Linear separator for K classes

• Decision regions defined by

W W T X w w 0 ( k − j ) + ( k 0 − j0 )= T (Wk − Wj ) X + (wk 0 − wk 0 )= 0 are singly connected and convex

For any points X A , XB ∈ Rk , ˆ any X that lies on the line connecting X A and XB ˆ X = λX A + (1− λ)XB where 0 ≤ λ ≤1

also lies in Rk

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Winner-Take-All Networks

yip =1 iff Wi • X p > Wj • X p ∀j ≠ i

yip = 0 otherwise Note: Wj are augmented weight vectors

T T T W1 = [1 -1 -1] , W2 = [1 1 1] , W3 = [2 0 0]

W1.Xp W2.Xp W3.Xp y1 y2 y3 1 -1 -1 3 -1 2 1 0 0 1 -1 +1 1 1 2 0 0 1 1 +1 -1 1 1 2 0 0 1 1 +1 +1 -1 3 2 0 1 0

What does neuron 3 compute?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Linear separability of multiple classes

Let S1,S2,S3...SM be multisets of instances

Let C1,C2,C3...CM be disjoint classes

∀i Si ⊆ Ci

∀i ≠ j Ci ∩ C j = ∅

We say that the sets S1,S2,S3...SM are linearly * * * separable iff ∃ weight vectors W1 ,W2 ,..WM such that * * ∀i {∀X p ∈ Si, (Wi • X p > Wj • X p ) ∀j ≠ i}

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Training WTA Classifiers

dkp =1 iff X p ∈Ck ; dkp = 0 otherwise

ykp =1 iff Wk • X p > Wj • X p ∀k ≠ j

Suppose dkp =1, y jp =1 and ykp = 0

Wk ← Wk + ηX p ; Wj ← Wj − ηX p ; All other weights are left unchanged.

Suppose dkp =1, y jp = 0 and ykp =1. The weights are unchanged.

Suppose dkp =1,∀j y jp = 0 (there was a tie)

Wk ← Wk + ηX p All other weights are left unchanged.

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory WTA Convergence Theorem

Given a linearly separable training set, the WTA learning algorithm is guaranteed to converge to a solution within a finite number of weight updates.

Proof Sketch: Transform the WTA training problem to the problem of training a single perceptron using a suitably transformed training set. Then the proof of WTA learning algorithm reduces to the proof of perceptron learning algorithm

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory WTA Convergence Theorem T T Let W = [W1W2....WM ] be a concatenation of the weight vectors associated with the M neurons in the WTA group. Consider a multi - category

training set E = {(X p , f (X p ))} where ∀X p f (X p )∈{C1,...CM }

Let X p ∈C1. Generate (M-1) training examples using X p for an M (n +1) input perceptron :

X p12 = [X p − X p φ φ ... φ]

X p13 = [X p φ − X p φ ... φ] ...

X p1M = [X p φ φ ... φ − X p ]

where φ is an all zero vector with the same dimension as X p and set the desired output of the corresponding perceptron to be 1 in each case. Similarly, from each training example for an (n +1) −input WTA, we can generate (M −1) examples for an M (n +1) input single neuron. Let the union of the resulting E (M −1) examples be E '

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory WTA Convergence Theorem

By construction, there is a one - to- one correspondence T T between the weight vector W = [W1W2....WM ] that results from training an M -neuron WTA on the multi - category set of examples E and the result of training an M (n +1) input perceptron on the transformed training set E'. Hence the convergence proof of WTA learning algorithm follows from the perceptron convergence theorem.

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Weight space representation

Pattern space representation Coordinates of space correspond to attributes (features) A point in the space represents an instance

Weight vector Wv defines a hyperplane Wv .X=0 Weight space (dual) representation Coordinates define a weight space

A point in the space represents a choice of weights Wv

An instance Xp defines a hyperplane W. Xp=0

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Weight space representation Solution region w W • Xr = 0 0 + W • X p = 0 ∈ S X X r p ∈ S − X q

W • X q = 0

w1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Weight space representation

W X 0, X S + w0 • p = p ∈

Wt+1 ηX p Wt+1 ← Wt + ηX p

Wt

w1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

The Perceptron Algorithm Revisited

The perceptron works by adding misclassified positive or subtracting misclassified negative examples to an arbitrary weight vector, which (without loss of generality) we assumed to be the zero vector. So the final weight vector is a linear combination of training points l wx= α y , ∑ iii i=1 where, since the sign of the coefficient of is given by xi label yi, the are positive values, proportional to the αi number of times, misclassification of has caused the weight to be updated. It is called the embedding strength of xi the pattern . xi

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Dual Representation

The decision function can be rewritten as: ⎛ ⎛ l ⎞ ⎞ h x = sgn w,x + b = sgn α y x ,x + w ( ) ( ) ⎜ ⎜ ∑ j j j ⎟ 0 ⎟ ⎝ ⎝ j=1 ⎠ ⎠

⎛ ⎛ l ⎞ ⎞ sgn y x ,x w = ⎜ ⎜ ∑α j j ⎟ j + 0 ⎟ ⎝ j=1 ⎠ ⎝ ⎠ on training example (x , y ) i i ⎛ ⎛ l ⎞ ⎞ The update rule is: if y y x ,x w 0, i ⎜ ⎜ ∑α j j ⎟ j i + 0 ⎟ ≤ ⎝ ⎝ j=1 ⎠ ⎠ then α ←α +η i i WLOG, we can take η = 1.

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Limitations of perceptrons

• Perceptrons can only represent threshold functions • Perceptrons can only learn linear decision boundaries What if the data are not linearly separable? § More complex networks? § Non-linear transformations into a feature space where the data become separable?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Extending Linear Classifiers Learning in feature spaces

Map data into a feature space where they are linearly separable xx→ϕ() x ϕ()x

X Φ Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Exclusive OR revisited In the feature (hidden) space: 2 T −||X−W1|| ϕ1(x1,x2 ) = e = z1 W1 = [1,1] 2 −||X−W2|| T ϕ2 (x1,x2 ) = e = z2 W2 = [0,0] z 2 (0,0) 1.0 Decision boundary

0.5 (1,1)

0.5 1.0 z (0,1) and (1,0) 1

When mapped into the feature space < z1 , z2 >, C1 and C2 become linearly separable. So a linear classifier with 1(x) and 2(x) as inputs can be used to solve the XOR problem. Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory “Perceptrons” (1969)

“The perceptron […] has many features that attract : its linearity, its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgement that the extension is sterile.” [pp. 231 – 232]

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Learning in the Feature Spaces High dimensional Feature spaces

x = x ,x ....x → ϕ x = ϕ x ,ϕ x ,....ϕ x ( 1 2 n ) ( ) ( 1 ( ) 2 ( ) d ( )) where typically d > > n solve the problem of expressing complex functions

But this introduces a § computational problem (working with very large vectors) § Solved using the kernel trick – implicit feature spaces

§ generalization problem (curse of dimensionality) § Solved by maximizing the margin of separation – first implemented in SVM (Vapnik) We will return to SVM later

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Generative Versus Discriminative Models

§ Generative models § Naïve Bayes, Bayes networks, etc. § Discriminative models § Perceptron, Support vector machines, Logistic regression .. § Relating generative and discriminative models § Tradeoffs between generative and discriminative models § Generalizations and extensions

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Alternative realizations of the Bayesian recipe Chef 1: Generative model

P(x|ωi )P(ωi ) Note that P(ω | x) = i P(x)

Model P(x | ω1 ), P(x|ω2 ), P(ω1 ), and P(ω2 )

Using Bayes rule, choose ω1 if P(x | ω1 )P(ω1 ) > P(x|ω2 )P(ω2 )

Otherwise choose ω2

Chef 2: Discriminative Model

P(ω1 | x) Model P(ω1 | x), P(ω2 | x), or the ratio directly P(ω2 | x)

P(ω1 | x) Choose ω1 if >1 P(ω2 | x)

Otherwise choose ω2

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Generative vs. Discriminative Classifiers Generative classifiers § Assume some functional form for P(X|C), P(C) § Estimate parameters of P(X|C), P(C) directly from training data § Use Bayes rule to calculate P(C|X=x) Discriminative classifiers § Assume some functional form for P(C|X) § Estimate parameters of P(C|X) directly from training data Discriminative classifiers – maximum margin version § Assume a functional form f(W) for the discriminant § Find W that minimizes prediction error § E.g., find W that maximizes the margin of separation between classes (e.g., SVM) Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Generative vs. Discriminative Models

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Which Chef cooks a better Bayesian recipe? In theory, generative and conditional models produce identical results in the limit • The classification produced by the generative model is the same as that produced by the discriminative model • That is, given unlimited data, assuming that both approaches select the correct form for the relevant probability distributions or the model for the discriminant function, they will produce identical results (Why?) • If the assumed form of the probability distributions is incorrect, then it is possible that the generative model might have a higher classification error than the discriminative model (Why?) How about in practice?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Which Chef cooks a better Bayesian recipe?

In practice • The error of the classifier that uses the discriminative model can be lower than that of the classifier that uses the generative model (Why?) • Naïve Bayes is a generative model • A perceptron is a discriminative model, and so is SVM • An SVM can outperforms Naïve Bayes on classification

If the goal is classification, it might be useful to consider discriminative models that directly learn the classifier without going solving the harder intermediate problem of modeling the joint probability distribution of inputs and classes (Vapnik)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory From generative to discriminative models

• Assume classes are binary y ∈{0,1} • Suppose we model the class by a binomial distribution with parameter q p(y | q) = q y (1− q)(1− y)

• Assume each component Xj of input X each have Gaussian distributions with parameters θj and are independent given the class

n p x, y | P y | q p x | y, ( Θ) = ( )∏ ( j θ j ) j=1

where Θ = (q,θ1,....θn )

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory From generative to discriminative models

1 %# 1 2 %' p x j | y = 0,Θ j = 1 exp$− 2 x j − µ0 j ( ( ) 2 2 ( ) &% 2σ j )% (2πσ j )

1 %# 1 2 %' p x j | y =1,Θ j = 1 exp$− 2 x j − µ1 j ( ( ) 2 2 ( ) &% 2σ j )% (2πσ j )

where Θ j = (µ0 j,µ1 j,σ j )

(Note:we have assumed that ∀j σ 0 j = σ1 j = σ j )

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory From generative to discriminative models

The calculation of the posterior probability p(Y=1|x, ) is simplified if we use matrix notation

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. 2 j=n ⎛ ⎧ ⎫⎞ ⎜ 1 ⎪ 1 ⎛ x1 − µ1 j ⎞ ⎪⎟ p x |y = 1,Θ = exp − ⎜ ⎟ ( ) ∏ ⎜ 1 ⎨ ⎜ ⎟ ⎬⎟ j=1 ⎜ 2π 2 σ 2 σ j ⎟ ⎝ ( ) j ⎩⎪ ⎝ ⎠ ⎭⎪⎠ 1 1 exp⎧ x µ T −1 x µ ⎫ = n 1 ⎨− ( − 1 ) Σ ( − 1 )⎬ (2π) 2 Σ 2 ⎩ 2 ⎭ 2 ⎡σ1 0 0 ⎤ T 2 2 ⎢ ⎥ where µ µ µ ; and diag σ ....σ 0 . 0 1 = ( 11.... 1n ) Σ = ( 1 n )= ⎢ ⎥ ⎢ 0 0 σ 2 ⎥ ⎣ n ⎦

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory From generative to discriminative models p(x | y = 1,Θ)p(y = 1| q) p(y = 1| x,Θ) = p(x | y = 1,Θ)p(y = 1| q)+ p(x | y = 0,Θ)p(y = 0 | q)

⎧ 1 T −1 ⎫ q exp⎨− (x − µ1 ) Σ (x − µ1 )⎬ 2 = ⎩ ⎭ ⎧ 1 T −1 ⎫ ⎧ 1 T −1 ⎫ q exp⎨− (x − µ1 ) Σ (x − µ1 )⎬ + (1− q)exp⎨− (x − µ0 ) Σ (x − µ0 )⎬ ⎩ 2 ⎭ ⎩ 2 ⎭ 1 = ⎧ ⎛ q ⎞ 1 T 1 T ⎫ 1 exp log x µ −1 x µ x µ −1 x µ + ⎨− ⎜ ⎟ + ( − 1 ) Σ ( − 1 )− ( − 0 ) Σ ( − 0 )⎬ ⎩ ⎝1− q ⎠ 2 2 ⎭ 1 = ⎧ ⎫ ⎪ ⎪ ⎪ T −1 1 T −1 ⎛ q ⎞⎪ 1+ exp⎨− (µ1 − µ0 ) Σ x + (µ1 − µ0 ) Σ (µ1 + µ0 )− log⎜ ⎟⎬ $!!#!!" 2 ⎜1 q ⎟ T ⎝ − ⎠ ⎪ β $!!!!!!!#!!!!!!!"⎪ ⎩⎪ −γ ⎭⎪ 1 = 1+ exp(− βT x − γ) where we have used AT DA − BT DB = (A + B)T D(A − B) for a symmetric matrix D Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory From generative to discriminative models

1 p(y = 1| x,Θ) = 1+ exp(− βT x − γ) The posterior probability that Y=1 takes the form 1 where φ(z) = 1+ e−z

is an affine function of x z = β T x +γ = β • x +γ

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Sigmoid or Logistic Function φ(z)

1 φ(z) = 1+ e−z

z

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Implications of the logistic posterior

• Posterior probability of Y is a logistic function of an affine function of x • Contours of equal posterior probability are lines in the input space • Tx is proportional to the projection of x on and this projection is equal for all vectors x that lie along a line that is orthogonal to • Special case – variances of Gaussians = 1 – the contours of equal posterior probability are lines that are orthogonal to the difference vector between the means of the two classes • Equal posterior for the two classes when z=0

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Geometric interpretation (diagonal ) Contour plot

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Geometric interpretation p(x | y =1,Θ)p(y =1| q) p(y =1| x,Θ) = p(x | y =1,Θ)p(y =1| q)+ p(x | y = 0,Θ)p(y = 0 | q) 1 = ⎧ ⎫ ⎪ ⎪ ⎪ T −1 1 T −1 ⎛ q ⎞⎪ 1+ exp⎨− (µ1 − µ0 ) Σ x + (µ1 − µ0 ) Σ (µ1 + µ0 )− log⎜ ⎟⎬ $!!#!!" 2 ⎜1− q ⎟ T ⎝ ⎠ ⎪ β $!!!!!!#!!!!!!"⎪ ⎩⎪ −γ ⎭⎪ 1 1 = = 1+ exp(− β T x −γ ) 1+ e−z

T −1⎛ (µ1 + µ0 )⎞ when q =1-q, z = (µ1 − µ0 ) Σ ⎜ x − ⎟ ⎝ 2 ⎠ In this case, the posterior probabilities for the two classes are equal when x is equidistant from the two means

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Geometric interpretation

• If the prior probabilities of the classes are such that q > 0.5 the effect is to shift the logistic function to the left resulting in a larger value for the posterior probability for Y=1 for any given point in the input space. • q < 0.5 results in a shift of the logistic function to the right resulting in a smaller value for the posterior probabilty for Y=1 (or larger value for the posterior probability for Y=0)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Geometric interpretation (general )

Now the equi- probability contours are still lines in the input space although the lines are no longer orthogonal to the difference in means of the two classes

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Generalization to multiple classes – Softmax function

• Y is a multinomial variable which takes on one of K values k qk = p(y = k | q) = p(y =1| q) where (y = k) ≡ (yk =1)

q = (q1 q2 qK ) • As before, x is a multivariate Gaussian 1 1 p x |y 1, exp⎧ x µ T −1 x µ ⎫ ( k = Θ)= n 1 ⎨− ( − k ) Σ ( − k )⎬ (2π) 2 Σ 2 ⎩ 2 ⎭ T where µk = (µk1.... µkn ) ; and ∀k Σk = Σ (covariance matrix is assumed to be same for each class)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Generalization to multiple classes – Softmax function Posterior probability for class k is obtained via Bayes rule

k k k p(x | y =1,Θ)p(y =1| q) p(y =1| x, Θ)= K ∑ p(x | y k =1,Θ)p(y k =1| q) l=1

⎧ 1 T −1 ⎫ qk exp⎨− (x − µk ) Σ (x − µk )⎬ 2 = ⎩ ⎭ K 1 q exp⎧ x µ T Σ−1 x µ ⎫ ∑ l ⎨− ( − l ) ( − l )⎬ l=1 ⎩ 2 ⎭

⎧ T −1 1 T −1 ⎫ exp⎨µk Σ x − µk Σ µk + logqk ⎬ 2 = ⎩ ⎭ K 1 exp⎧µT Σ−1x µT Σ−1µ logq ⎫ ∑ ⎨ l − l l + l ⎬ l=1 ⎩ 2 ⎭

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Generalization to multiple classes – Softmax function We have shown that

⎧ T −1 1 T −1 ⎫ exp⎨µ k Σ x − µ k Σ µ k + logqk ⎬ 2 p y k 1| x, Θ ⎩ ⎭ ( = )= K ⎧ T −1 1 T −1 ⎫ ∑exp⎨µ l Σ x − µ l Σ µ l + logql ⎬ l=1 ⎩ 2 ⎭

⎡ 1 T −1 ⎤ Defining parameter vectors − µk Σ µk + logqk βk = ⎢ 2 ⎥ and augmenting the input ⎢ −1 ⎥ ⎣ Σ µk ⎦ Vector x by adding a T βk x βk ,x constant input of 1 we k e e p(y = 1| x,Θ)= K = K have βT x β ,x ∑e l ∑e l l=1 l=1 Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Generalization to multiple classes – Softmax function

T βk x βk ,x k e e p(y = 1| x,Θ)= K = K T β ,x ∑eβl x ∑e l l=1 l=1 corresponds to the decision rule:

k β j ,x h(x) = argmax p(y = 1| x,Θ)= argmax e = argmax β j ,x j j j Consider the ratio of posterior prob. for classes k and j K βl ,x k βk ,x e βk ,x ∑ (βk −β j ),x p(y = 1| x,Θ) e l=1 e j = K = = e β j ,x β j ,x p y 1| x, βl ,x ( = Θ) ∑e e e l=1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Equi-probability contours of the softmax function

T (β3 − β1 ) x = 0

T (β1 − β2 ) x = 0

T (β2 − β3 ) x = 0

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Naïve Bayes classifier with discrete attributes and K classes

qk= prior probability of class k

ηkji= probability that xj (the jth component of x) takes the ith value in its domain when x belongs to class k. n p x,y | p y | q p x | y, ( Θ) = ( )∏ ( j θ j ) j=1 k qk = p(y =1| q); i k ηkji = p(x j =1| y =1,η) i q x j k ∏∏(ηkji ) k j i p(y =1| x,η)= i q x j ∑ l ∏∏(ηkji ) l j i

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Naïve Bayes classifier with discrete attributes and K classes

k i k qk = p(y =1| q); ηkji = p(x j =1| y =1,η) i q x j k ∏∏(ηkji ) k j i p(y =1| x,η)= i q x j ∑ l ∏∏(ηkji ) l j i ⎧ n N j ⎫ exp logq xi log ⎨ k + ∑∑ j ηkji ⎬ j=1 i=1 = ⎩ ⎭ K ⎧ n N j ⎫ exp logq xi log ∑ ⎨ l + ∑∑ j ηlji ⎬ l=1 ⎩ j=1 i=1 ⎭ T β ,x eβk x e k = K = K T β ,x ∑eβl x ∑e l l=1 l=1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory From generative to discriminative models

• A curious fact about all of the generative models we have considered so far is that – The posterior probability of class can be expressed in the form of a logistic function in the case of a binary classifier and a softmax function in the case of a K-class classifier

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory From generative to discriminative models

• For multinomial and Gaussian class conditional densities (in the case of the latter, with equal but otherwise arbitrary covariance matrices) – the contours of equal posterior probabilities of classes are hyperplanes in the input (feature) space. • The result is a simple linear classifier analogous to the perceptron (for binary classification) or winner-take-all network (for K-ary classification) • Next, we see that these results hold for a more general class of distributions

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Digression: The exponential family of distributions The exponential family is specified by

T p(x | η)= h(x)e{η G(x)−A(η)} where η is a parameter vector and A(η), h(x) and G(x) are appropriately chosen functions. • Gaussian, Binomial, and multinomial (and many other “textbook”) distributions belong to the exponential family • Likelihood function for exponential family is provably convex • Maximum entropy estimate of unknown probability distributions under moment constraints yields an exponential form

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory The Bernoulli distribution belongs to the exponential family

T p(x | η) = h(x)e{η G(x)− A(η)} Bernoulli distribution with success rate q is given by

x 1− x ⎧ ⎛ q ⎞ ⎫ p(x | q) = q (1− q) = exp⎨log⎜ ⎟x + log(1− q)⎬ ⎩ ⎝1− q ⎠ ⎭ We can see that Bernoulli distribution belongs to the exponential family by choosing

⎛ q ⎞ η = log⎜ ⎟; G(x) = x; h(x) = 1 ⎝1− q ⎠ A(η) = −log(1− q) = log(1+ eη )

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

The Gaussian distribution belongs to the exponential family T p(x | η)= h(x)e{η G(x)−A(η)} Univariate Gaussian distribution can be written as

2 1 ⎧ 1 2 ⎫ p(x | µ,σ )= 1 exp⎨− 2 (x − µ) ⎬ (2π ) 2 σ ⎩ 2σ ⎭

1 ⎧ µ 1 2 1 2 ⎫ = 1 exp⎨ 2 x − 2 x − 2 µ − lnσ⎬ (2π ) 2 ⎩σ 2σ 2σ ⎭

⎡ µ ⎤ 2 2 µ η = ⎢ σ ⎥; A(η) = + ln σ ⎢−1 ⎥ 2 We see that Gaussian 2 2σ distribution belongs to the ⎣⎢ 2σ ⎦⎥ ⎡ x ⎤ 1 exponential family by G x ; h x ( ) = ⎢ 2 ⎥ ( ) = 1 choosing ⎣x ⎦ (2π) 2 Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory The exponential family

The exponential family which is given by

T p(x | η)= h(x)e{η G(x)−A(η)}

where is a parameter vector and A(), h(x) and G(x) are appropriately chosen functions – can be shown to include several additional distributions such as the multinomial, the Poisson, the Gamma, the Dirichlet, among others.

Exercise: Show that the multinomial distribution belongs to the exponential family.

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory From generative to discriminative models

• In the case of the generative models we have seen • The posterior probability of class can be expressed in the form of a logistic function in the case of a binary classifier and a softmax function in the case of a K-class classifier • The contours of equal posterior probabilities of classes are hyperplanes in the input (feature) space yielding a linear classifier for binary classification) or winner-take-all network (for K-ary classification).

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory From generative to discriminative models

• We just showed that the probability distributions underlying the generative models considered belong to the exponential family • What can we say about the classifiers when the underlying generative models are distributions from the exponential family?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Classification problem for generic class conditional density T from the exponential family p(x | η) = h(x)e{η G(x)− A(η)} Consider Binary classification task with density for class 0 and class 1 parameterized by η0 and η1. Further assume G(x) is a linear function of x (before augmenting x with a 1) p(x | y = 1,η)p( y = 1| q) p(y = 1| x,η) = p(x | y = 1,η)p( y = 1| q) + p(x | y = 0,η)p( y = 0 | q)

⎧ T ⎫ exp⎨η G(x)− A(η1 )⎬h(x)q1 = ⎩ 1 ⎭ ⎧ T ⎫ T exp⎨η G(x)− A(η1 )⎬h(x)q1 + exp{η0 G(x)− A(η0 )}h(x)q0 ⎩ 1 ⎭ 1 p(y = 1| x,η) = ⎧ T q0 ⎫ 1+ exp⎨-(η0 − η1 ) G(x)− A(η0 )+ A(η1 )+ log ⎬ ⎩ q1 ⎭ Note that this is a logistic function of a linear function of x Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Classification problem for generic class conditional density from the exponential family Consider K-ary classification task; Suppose G(x) is a linear function of x T p(x | η) = h(x)e{η G(x)− A(η)} T k exp{ηk G(x)− A(ηk )} qk p(y =1| x, η)= K exp η T G x A η q ∑ { l ( )− ( l )} l l=1 T exp{ηk G(x)− A(ηk )+ logqk } = K exp η T G x A η logq ∑ { l ( )− ( l )+ l } l=1 which is a softmax function of a linear function of x !!

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Summary • A variety of class conditional densities all yield the same logistic-linear or softmax-linear (with respect to parameters) form for the posterior probability • In practice, choosing a class conditional density can be difficult – especially in high dimensional spaces – e.g., multi-variate Gaussian where the covariance matrix grows quadratically in the number of dimensions! • The invariance of the functional form of the posterior probability with respect to the choice of the distribution is good news! • It is not necessary to specify the class conditional density at all if we can work directly with the posterior – which brings us to discriminative models!

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Discriminative Models • We saw that under fairly general assumptions concerning the underlying generative model, the posterior probability of class given x can be expressed in the form of a logistic function of an affine or polynomial (in the simplest case, linear) function of x in the case of a binary classification task. 1 1 P(y =1| x) = = η x = µ(x) 1+ e− w,G(x) 1+ e− ( )

where η(x)= wTG(x)= w,G(x) • In the discriminative setting, we simply assume this form and proceed without regard to details of the underlying generative model

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Discriminative Models Note that the posterior probability of Y=1 is same as the conditional expectation of y given x: E(y | x) = 1⋅ P(y = 1| x)+ 0⋅ P(y = 0 | x) = P(y = 1| x) = µ(x) = (µ(x))y (1− µ(x))1− y

where 1 1 µ(x) = = T 1+ e−η(x) −w G(x) 1+ e Hence estimating P(Y=1|x) is equivalent to performing logistic regression

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Some Properties of the Logistic Function

1 ⎛ µ ⎞ µ ; η log⎜ ⎟ = −η = ⎜ ⎟ 1+ e ⎝1-µ ⎠ ⎛ d d ⎞ ⎜ (1-µ) (µ)− µ (1− µ)⎟ dη d ⎛ µ ⎞ ⎛1-µ ⎞ d ⎛ µ ⎞ ⎛1-µ ⎞ dµ dµ 1 log ⎜ ⎟ = ⎜ ⎟ = ⎜ ⎟ ⎜ ⎟ = ⎜ ⎟ 2 = dµ dµ ⎝1-µ ⎠ ⎝ µ ⎠ dµ ⎝1-µ ⎠ ⎝ µ ⎠⎜ (1-µ) ⎟ µ(1-µ) ⎜ ⎟ ⎝ ⎠ dµ = µ(1-µ) dη

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Maximum likelihood estimation of w

D = {(xn , yn ); X n ∈ Domain(x); yn ∈{0,1}; n =1..N}

T 1 ηn = w xn ; µn = 1+ e−ηn

N yn (1− yn ) Likelihood P(y1...y N | x1...xN ,w) = ∏ (µn ) (1− µn ) n=1

Log likelihood

N LL(w : D) = ∑{yn log µn + (1− yn )log(1− µn )} n=1

We need to find w that maximizes log likelihood

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Maximum likelihood estimation of w

N LL w : D y log µ 1 y log 1 µ ( ) = ∑{ n n + ( − n ) ( − n )} G(x)= x n=1 η(x)= WT x N ∂LL(w : D) ⎛ yn (1− yn )⎞⎛ ∂µn ⎞⎛ ∂ηn ⎞ = ∑⎜ − ⎟⎜ ⎟⎜ ⎟ ∂w n=1 ⎝ µn (1− µn )⎠⎝ ∂ηn ⎠⎝ ∂w ⎠ N ⎛ y − µ ⎞ ⎜ n n ⎟µ 1 µ x = ∑⎜ ⎟ n ( − n ) n n=1 ⎝ µn (1− µn )⎠ N = ∑(yn − µn )xn n=1 Simple gradient ascent learning algorithm ∂LL(w : D) w(t +1) ← w(t) + ρ ∂w w=w(t ) w t 1 w t ρ y µ x ρ > 0 ( + ) ← ( )+ t ( n − n ) n ∞ ∞ 2 lim ρt = 0; ρt = ∞; ρt < ∞ t→∞ ∑ ∑ t=0 t=0

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Maximum likelihood estimation of w

Simple gradient ascent algorithm can be quite slow and has little to recommend it in practice The momentum trick provides a simple approach to speeding up the simple gradient ascent algorithm

w(t +1) = w(t)+ Δw(t) ∂LL(w : D) Δw(t) = ρ + αΔwi (t −1) where 0 < α < 1 ∂w w=w(t ) t t−τ ∂LL(w : D) = ∑α τ=0 w ∂ wi =wi (τ)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Maximum likelihood estimation of w

The momentum trick can also be applied in the on line version

w(t +1) = w(t)+ Δw(t) w(t) ρ y µ x w (t 1) where 0 1 Δ = t ( n − n ) n w=w(t) +αΔ i − < α < t t−τ = α ρt (yn − µn )xn ∑ wi =wi (τ ) τ =0

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Maximum likelihood estimation of w

• More sophisticated optimization algorithms – line search, conjugate gradient, Newton-Raphson, iteratively reweighted least squares, and related methods can be used to maximize the log likelihood function which although not quadratic, is approximately quadratic. • For details, see standard texts on optimization. • When the form of the underlying generative model is known, we can initialize the parameter vector w based on the maximum likelihood estimates for which often closed form solutions are available and then run a few iterations of gradient ascent to improve classification accuracy.

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Multi-class Discriminative Model

Softmax-linear model is the multi-class generalization of the logistic-linear model T eθk x p(y k = 1| x)= K T ∑eθl x l=1

In the discriminative setting, we simply assume this form and proceed without regard to details of the underlying generative model

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Multi-class Discriminative Model

T eθk xn Let k k p(y = 1| xn )= = µ K T n ∑eθl xn k T l=1 η = θ x n k n k T x η = ∑θk n Let n ⎡ 1 ... k ... K ⎤ ηn = ⎣ηn ηn ηn ⎦ ⎡ 1 ... k ... K ⎤ µn = ⎣µn µn µn ⎦ k ηk η e n e Then µ k ; µ k n = K = K l ηl η ∑e n ∑e l=1 l=1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Some properties of the softmax function

Softmax-linear function is ηk invertible up to an additive k e µ = K constant. ηl ∑e l=1 ηk = log µk + C

K ⎛ ηl ⎞ C = log⎜∑e ⎟ ⎝ l=1 ⎠

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Some properties of the softmax function

ηk K l k e k k ⎛ η ⎞ µ = ; η = log µ + C; C = log⎜ e ⎟ K ηl ∑ ∑e ⎝ l=1 ⎠ l=1

K j ⎛ ηl ⎞ ηk ηk η ⎜ e ⎟e δ − e e ∂µ k ⎜∑ ⎟ kj ⎝ l=1 ⎠ j = 2 δij =1iff i = j; K l ∂ η ⎛ η ⎞ ⎜ e ⎟ ⎜∑ ⎟ δij = 0 iff otherwise ⎝ l=1 ⎠ ⎛ ⎞ j ηk ⎜ η ⎟ e ⎜ e ⎟ k j = K δkj − K = µ δkj − µ ηl ⎜ ηl ⎟ ( ) ∑e ⎜ ∑e ⎟ l=1 ⎝ l=1 ⎠

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Maximum likelihood estimation of w

1 K D = {(xn , yn ); xn ∈ Domain(x); yn ∈{yn...yn }; n =1..N;}

K yk P y | x ,θ µk n ( n n ) = ∏ ( n ) k=1 N K yk L θ : D P y ...y | x ...x ,θ µk n ( ) = ( 1 N 1 N ) = ∏∏ ( n ) n=1 k =1 N K LL θ ...θ : D y k log µk ( 1 K ) = ∑∑ n n n=1 k=1 We need to find parameters that maximize log likelihood

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Maximum likelihood estimation of w k T ηn = θk xn N K LL θ : D y k log µk ( ) = ∑∑ n n n=1 k =1 N K k i ⎛ ∂LL(θ : D)⎞⎛ ∂µn ⎞⎛ ∂ηn ⎞ ∇θ LL(θ : D) = ⎜ ⎟⎜ ⎟⎜ ⎟ i ∑∑⎜ k ⎟⎜ i ⎟⎜ ⎟ n=1 k =1 ⎝ ∂µn ⎠⎝ ∂ηn ⎠⎝ ∂θi ⎠ N K y k µi x = ∑∑ n (δik − n ) n n=1 k =1 N yi µi x = ∑( n − n ) n n=1

Where we have used the chain rule and the fact that ⎛ K ⎞ n yk 1 ∀ ⎜∑ n = ⎟ ⎝ k=1 ⎠

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Maximum likelihood estimation of w

N ∇ LL θ : D = y i − µ i x θi ( ) ∑( n n ) n n=1

Basic gradient ascent update rule is given by ∂LL(θ : D) θi (t +1) ← θi (t) + ρ θ ∂ i θ=θ(t) ρ > 0

which we can be speed up using the momentum trick as before

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Maximum likelihood estimation of w

The momentum trick provides a simple approach to speeding up the simple gradient ascent algorithm

θi (t +1) = θi (t)+ Δθi (t) ∂LL(θ : D) Δθ (t) = ρ +αΔθ (t −1) where 0 < α <1 i θ i ∂ i θ=θ(t)

t t−τ ∂LL(θ : D) = ∑α τ =0 ∂θi θi =θi (τ )

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Maximum likelihood estimation of w

N ∇ LL θ : D = yi − µi x θi ( ) ∑( n n ) n n=1

Basic online gradient ascent update rule is given by

i i θi (t +1) ← θi (t)+ ρt (yn − µn )xn ∞ ∞ 2 lim ρt = 0; ρt = ∞; ρt < ∞ t −>∞ ∑ ∑ t=0 t=0

which we can speed up using the momentum trick as before

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Maximum likelihood estimation of w

The momentum trick can also be applied in the on line version

θi (t +1) = θi (t)+ Δθi (t) i i Δθi (t) = ρt (yn − µn )xn +αΔθi (t −1) where 0 < α <1 θi =θi (t) t t−τ = α ρt (yn − µn )xn ∑ θi =θi (τ ) τ =0

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Summary • For a large class of generative models, the probability distribution of class conditioned on the input can be modeled by the exponential family • Generative models can perform poorly when the assumed parametric form for the distribution is incorrect • Discriminative models can perform poorly when the assumed form of G(x) is inappropriate – but it is often easier to choose the form of G(x) than it is to specify the precise form of the generative model • Discriminative models focus on the classification problem without solving (potentially more difficult) problem of learning the generative model for data • Estimating the parameters in the discriminative setting requires solving an optimization problem although their generative counterparts have closed form solutions (via sufficient statistics) • Discriminative models can be ‘kernelized’ – e.g., kernel logistic regression Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Summary

• We can learn classifiers in a discriminative setting using maximum likelihood or maximum a posteriori or bayesian estimation of parameters • Discriminative models may overfit the data – use of priors or regularization recommended

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Accuracy Boosting Using Ensemble Classifiers

• Outline • Ensemble methods • Bagging • Boosting • Error-correcting output coding • Why does ensemble learning work?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Ensemble learning

Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions

[Freund & Schapire, 1995]

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Ensemble Learning

• Intuition: Combining Predictions of an ensemble is more accurate than a single classifier • Justification: – It is easy to find quite correct “rules of thumb” – It is hard to find single highly accurate prediction rule – If the training examples are few and the hypothesis space is large then there are several equally accurate classifiers – Hypothesis space does not contain the true function, but it has several good approximations – Exhaustive global search in the hypothesis space is expensive so we can combine the predictions of several locally accurate classifiers

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Ensemble learning

T

Learning phase different … T1 T2 TS training sets and/or … learning h1 h2 hS algorithms

(x, ?) h* = F(h1, h2, …, hS)

Classification (x, y*) phase

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory How to make an effective ensemble?

Two basic questions in designing ensembles: • How to generate the base classifiers? h1, h2, … • How to combine them? F(h1(x), h2(x), …)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory How to combine classifiers

Usually take a weighted vote:

ensemble(x) = sign( Σi wi hi(x) )

• wi is the weight of hypothesis hi

• wi > wj means hi is more reliable than hj

• typically wi > 0 (though could have wi < 0 meaning hi is more often wrong than right) • Bayesian averaging is an example • (Fancier schemes are possible but uncommon)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory How to generate base classifiers

• A variety of approaches • Bagging (Bootstrap aggregation) • Boosting (Specifically, Adaboost – Adaptive Boosting algorithm) • …

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Ensemble learning

T

Learning phase different … T1 T2 TS training sets and/or … learning h1 h2 hS algorithms

(x, ?) h* = F(h1, h2, …, hS)

Classification (x, y*) phase

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting • Boosting is an ensemble method. • The prediction generated by the classifier is a combination of the prediction of several predictors. • What is different? – It is iterative – Boosting: Each successive classifier depends upon its predecessors unlike in the case of bagging where the individual classifiers were independent – Training Examples may have unequal weights – Look at errors from previous classifier step to decide how to focus on next iteration over data – Set weights to focus more on ‘hard’ examples. (the ones on which we committed mistakes in the previous iterations)

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting Algorithm

• W(x) is the distribution of weights over the N training points

∑ W(xi)=1

• Initially assign uniform weights W0(x) = 1/N for all x, step k=0 • At each iteration k :

– Find best weak classifier Ck(x) using samples obtained using Wk(x) § Resulting error rate εk § The weight of the resulting classifier Ck is αk § For each xi , update weights based on εk to get Wk+1(xi ) • CFINAL(x) =sign [ ∑ αi Ci (x) ]

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Boosting (Algorithm) M C(x) = ∑α j C j (x) j=1

Weighted Sample CM

Weighted Sample

Weighted Sample C2

Training Set C1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting

Basic Idea: • assign a weight to every training set instance • initally, all instances have the same weight • as boosting proceeds, it adjusts weights based on how well we have predicted data points so far - data points correctly predicted àlow weight - data points mispredicted à high weight • Results: as learning proceeds, the learner is forced to focus on portions of data space not previously well predicted

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory AdaBoost Algorithm

• W(x) is the distribution of weights over the N training points ∑ W(xi)=1 • Initially assign uniform weights W0(x) = 1/N for all x. • At each iteration k :

– Find best weak classifier Ck(x) using weights Wk(x) – Compute εk the error rate as εk= [ ∑ W(xi ) · I(yi ≠ Ck(xi )) ] / [ ∑ W(xi )]

– weight the classifier Ck by αk • αk = log ((1 – εk )/εk )

– For each xi , Wk+1(xi ) = Wk(xi ) · exp[αk · I(yi ≠ Ck(xi ))] • CFINAL(x) =sign [ ∑ αi Ci (x) ]

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory AdaBoost Example

Original Training set : Equal Weights to all training samples

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory AdaBoost Example

ROUND 1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory AdaBoost Example

ROUND 2

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory AdaBoost Example

ROUND 3

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory AdaBoost Example

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Boosting 50% error rate of flipping a coin 49% error rate of L by itself ensemble error rate

1 2 3 4 5 6 …. 500 size of ensemble • Suppose L is a weak learner - one that can learn a hypothesis that is better than rolling a dice – but perhaps only a tiny bit better – Theorem: Boosting L yields an ensemble with arbitrarily low error on the training data!

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting performance Decision stumps are very simple classifiers that test condition on a single attribute. Suppose we use decision stumps as individual classifiers whose predictions were combined to generate the final prediction. Suppose we plot the misclassification rate of the Boosting algorithm against the number of iterations performed.

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting performance

Steep decrease in error

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting performance • Observations – First few ( about 50) iterations increase the accuracy substantially.. Seen by the steep decrease in misclassification rate. – As iterations increase training error decreases – As iterations increase, generalization error decreases ?

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Can Boosting do well if?

• Individual classifiers are not very accurate and have high variance (e.g., decision stumps) ? – It can if the individual classifiers have considerable mutual disagreement. • Individual classifier is very accurate and has low variance (e.g., SVM with a good kernel function) ? – No..

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting as an Additive Model

• The final prediction in boosting f(x) can be expressed as an additive expansion of individual classifiers

M f (x) b(x;γ ) = ∑ βm m m=1 • The process is iterative and can be expressed as follows.

fm (x) = fm−1(x) + βmb(x;γm ) • Typically we would try to minimize a loss function of the training examples

N ⎛ M ⎞ min L⎜ yi , βmb(xi ;γ m )⎟ {β ,γ }M ∑ ∑ m m 1 i=1 ⎝ m=1 ⎠

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting as an Additive Model

• Simple case: Squared-error loss

1 L(y, f (x))= (y − f (x))2 2 • Forward stage-wise modeling amounts to just fitting the residuals from previous iteration.

L(yi ,f m−1(xi )+ βb(xi;γ)) 2 = (yi − fm−1(xi )− βb(xi;)γm ) 2 = (rim − βb(xi ;γm )) • Squared-error loss not robust for classification

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting as an Additive Model

• AdaBoost for Classification uses the exponential loss function: – L(y, f (x)) = exp(-y · f (x))

N arg min L(y , f (x )) ∑ i i f i=1 N arg min exp( y [ f (x ) G (x )]) = ∑ − i ⋅ m−1 i + β ⋅ m i β ,Gm i=1 N arg min exp( y f (x )) exp( y G (x )) = ∑ − i ⋅ m−1 i ⋅ − i ⋅ β ⋅ m i β ,Gm i=1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting as an Additive Model

First assume that β is constant, and minimize w.r.t. Gm: N

argmin∑exp(−yi ⋅ fm−1(xi ))exp(−yi ⋅ β ⋅Gm (xi )) β,Gm i=1 N (m) (m) = argmin∑wi ⋅ exp(−yi ⋅ β ⋅Gm (xi )), where wi = exp(−yi ⋅ fm−1(xi )) ,G β m i=1 N N (m) −β (m) β First optimize wrt Gm , then β = argmin ∑ wi ⋅ e + ∑ wi ⋅ e Gm yi =Gm (xi ) yi ≠Gm (xi ) N N β −β (m) −β (m) = argmin (e − e )∑[wi ⋅ I(yi ≠ Gm (xi ))]+ e ∑wi G m i=1 i=1 N (m) ∑[wi ⋅ I(yi ≠ Gm (xi ))] β −β i=1 −β = argmin (e − e ) N + e Gm (m) ∑wi i=1

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting as an Additive Model

N (m) ∑[wi ⋅ I(yi ≠ G(xi ))] β −β i=1 −β argmin (e − e ) N + e Gm (m) ∑wi i=1 β −β −β = argmin (e − e )⋅ errm + e = H(β) Gm

errm: the training error on the weighted samples On each iteration we must find a classifier that minimizes the training error on the weighted samples!

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting as an Additive Model Now that we have found G, we minimize w.r.t. β: β −β −β H (β ) = errm ⋅(e − e ) + e ∂H = err ⋅(eβ + e−β ) − e−β = 0 ∂β m β β −β 1− e ⋅errm (e + e ) = 0 2β 1− e ⋅errm − errm = 0 1− err m = e2β errm

1 ⎛1− errm ⎞ β = log⎜ ⎟ 2 ⎝ errm ⎠

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

binary class y {0,1}

= 1/N

normalize wt to get a probability distribution pt I pti = 1

penalize mistakes on high-weight instances more

if ht gets instance i right multiply weight b a factor < 1 if ht gets instance i wrong multiply weight by 1

weighted vote,

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Learning from weighted instances?

[0,1]

• The learning algorithms we have seen take as input a set of unweighted examples • What if we have weighted examples instead? • It is easy to modify most learning algorithms to deal with weighted instances: – For example, replace counts of examples that match some specified criterion by sum of weights of examples that match the specified criterion

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory AdaBoost (Characteristics)

• Why exponential loss function? – Computational • Simple modular re-weighting • Determining optimal parameters is relatively easy – Statistical • In a two label case it determines one half the log odds of P(Y=1|x) • We can use the sign as the classification rule • Accuracy depends upon number of iterations

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Boosting - Summary • Basic motivation – creating a committee of experts is typically more effective than trying to derive a single super-genius • Boosting provides a simple and powerful method for turning weak learners into strong learners

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Spectral Clustering

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Data matrix Adjacency matrix (A) – n x n matrix – A = [wij ] – wij edge weight between vertices (data instances) xi and x j 0.1 1 5 0.8 x x x x x x 0.8 0.8 1 2 3 4 5 6

0.6 6 x 0 0.8 0.6 0 0.1 0 2 4 1 0.7 0.8 x2 0.8 0 0.8 0 0 0 3 0.2

x3 0.6 0.8 0 0.2 0 0

• Important properties: x4 0 0 0.2 0 0.8 0.7 x 0.1 0 0 0.8 0 0.8 – Symmetric matrix 5 x 0 0 0 0.7 0.8 0 – Eigen values are real 6 – Eigenvectors span an orthogonal basis

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Degree matrix

Degree matrix (D) – n x n diagonal matrix D(i,i) w – : total weight of edges incident to vertex = ∑ ij xi j

x1 x2 x3 x4 x5 x6

0.1 x1 1.5 0 0 0 0 0 1 5 0.8 0.8 0.8 x2 0 1.6 0 0 0 0

0.6 6 2 4 x3 0 0 1.6 0 0 0 0.7 0.8 3 0.2 x4 0 0 0 1.7 0 0

x5 0 0 0 0 1.7 0

x6 0 0 0 0 0 1.5 Important application: – Normalization of adjacency matrix

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Laplacian Matrix L = D - A • Laplacian matrix (L) x1 x2 x3 x4 x5 x6 – n x n symmetric matrix x1 1.5 -0.8 -0.6 0 -0.1 0

0.1 x2 -0.8 1.6 -0.8 0 0 0 1 5 0.8 0.8 0.8 x3 -0.6 -0.8 1.6 -0.2 0 0 0.6 6 2 4

0.7 x4 0 0 -0.2 1.7 -0.8 -0.7 0.8 3 0.2 0.8 x -0.1 0 0 1.7 -0.8 5 -

• Important properties: x6 0 0 0 -0.7 -0.8 1.5 – L is symmetric and positive semi-definite – L has n non-negative eigenvalues – Smallest eigenvalue is 0 and the0 = correspondingλ ≤ λ ≤ .. ≤ eigenλ vector is the constant vector of 1’s. 1 2 n

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Spectral analysis

x1 x2 x3 x4 x5 x6 x 1.5 -0.8 -0.6 0 -0.1 0 1. Pre-processing 1 x -0.8 1.6 -0.8 0 0 0 – Build Laplacian matrix L=D-A 2 x -0.6 -0.8 1.6 -0.2 0 0 corresponding to the graph 3 x 0 0 -0.2 1.7 -0.8 -0.7 represented by the adjancy 4 x -0.1 0 0 -0.8 1.7 -0.8 matrix A 5 x 0 0 0 -0.7 -0.8 1.5 6 2. Decomposition – Find the k smallest eigenvalues 1 ..k and – the corresponding eigenvectors v1 ..vk of the matrix

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Spectral analysis reveals clusters • Grouping – The lowest eigen value coresponds to a vector all of whose components are equal

– Sort components of the eigen vector (say v2) (keeping track of the correspondence with the vertices – Identify clusters by splitting the vertices into two groups – Split at 0

Split at 0 A B x1 0.2 Cluster A: Positive points

x2 0.2 Cluster B: Negative points

x3 0.2 x1 0.2 x4 -0.4 x4 -0.4 x2 0.2 x5 -0.7 x5 -0.7 x3 0.2 x6 -0.7 x6 -0.7

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory

Spectral Bi-partitioning Algorithm

1 0.41 -0.41 -0.65 -0.31 -0.38 0.11

2 0.41 -0.44 0.01 0.30 0.71 0.22

6 5 4 3 2 1 3 0.41 -0.37 0.64 0.04 -0.39 -0.37

4 0.41 0.37 0.34 -0.45 0.00 0.61

5 0.41 0.41 -0.17 -0.30 0.35 -0.65 6 0.41 0.45 -0.18 0.72 -0.29 0.09

Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar Pennsylvania State University College of Information Sciences and Technology Artificial Intelligence Research Laboratory Example – 2 Spirals

2

1.5 Build a reduced space from 1 k eigenvectors.

0.5 Cluster the data in the 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-0.5 reduced space – using k-

-1 means -1.5 -2 0.8

0.6

0.4

0.2

0 In the embedded space -0.709 -0.7085 -0.708 -0.7075 -0.707 -0.7065 -0.706 given by two leading -0.2 eigenvectors, clusters are -0.4 trivial to separate. -0.6 -0.8 Machine Learning, Astroinformatics @ Penn State, 2018, (C) Vasant Honavar