Introduction to

Introduction to Machine Learning Amo G. Tong 1 Lecture * Computational Learning Theory

• PAC learning • Sample complexity

• Some materials are courtesy of Mitchell. • All pictures belong to their creators.

Introduction to Machine Learning Amo G. Tong 2 Learning Theory • Learning Theory is used to characterize: • The difficulty of machine learning problems. • The capacity of machine learning algorithms.

• Some metrics: • Size of training data. (Sample complexity) • Running time of the learning approach. (Computational Complexity) • Quality of the result. (Correctness)

Introduction to Machine Learning Amo G. Tong 3 Settings

Settings

What are the inputs? What is a learner? How to measure the result?

Introduction to Machine Learning Amo G. Tong 4 Settings: Input and Output • General Settings. • Learn a boolean function with noise-free data. Sky Wind EnjoySport

Sky Wind EnjoySport • Instance space 푋. [possible weathers] Sunny Strong Yes • Training data 푇. [a subset of 푋] Sunny weak Yes • Concept space 퐶. [all possible functions] Rainy Strong No • Target concept 푐 ∈ 퐶. [the underlying true one] cloudy Strong Yes • Hypothesis space 퐻 ⊆ 퐶. • A learner 퐿 (algorithm): output one ℎ in 퐻, according to the training data drawn from a certain distribution 퐷.

Introduction to Machine Learning Amo G. Tong 5 Settings: True Error

• 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 Sky Wind EnjoySport • The probability that ℎ is wrong.

Sky Wind EnjoySport Sunny Strong Yes Sunny weak Yes Rainy Strong No cloudy Strong Yes

Introduction to Machine Learning Amo G. Tong 6 Settings: True Error

• 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 • The probability that ℎ is wrong.

Instance Space

퐷, distribution

Introduction to Machine Learning Amo G. Tong 7 Settings: True Error

• 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 • The probability that ℎ is wrong.

Instance Space

푐: truth +

-

퐷, distribution

Introduction to Machine Learning Amo G. Tong 8 Settings: True Error

• 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 • The probability that ℎ is wrong.

Instance Space

ℎ: our result +

-

퐷, distribution

Introduction to Machine Learning Amo G. Tong 9 Settings: True Error

• 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 • The probability that ℎ is wrong.

Instance Space

푐 ℎ + + error error - -

퐷, distribution

Introduction to Machine Learning Amo G. Tong 10 Settings: True Error

• 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 • The probability that ℎ is wrong. • The error is define on 푋.

• The test example is generated Instance Space following the same 퐷 as training samples. ℎ 푐 • We learn from training data, and the + + error is not known to us. Can we error error bound the true error by minimizing the training error? - - • We learn from training data, the 퐷, distribution distribution 퐷 is not known to us.

Introduction to Machine Learning Amo G. Tong 11 PAC Learnability

PAC Learnability

Probably approximately correct learning. One possible way to related sample complexity, running time, and results.

Introduction to Machine Learning Amo G. Tong 12 PAC Learnability • Learnability: for a certain target, can it be reliably learned from • a reasonable number of randomly drawn training examples • with a reasonable amount of computation?

Introduction to Machine Learning Amo G. Tong 13 PAC Learnability • Learnability: for a certain target, can it be reliably learned from • a reasonable number of randomly drawn training examples • with a reasonable amount of computation? Small error No infinite training data No infinite time

Introduction to Machine Learning Amo G. Tong 14 PAC Learnability • Learnability: for a certain target, can it be reliably learned from • a reasonable number of randomly drawn training examples • with a reasonable amount of computation?

• Is zero error possible? Yes, only if training data=entire instance space. • So consider an error bound.

• Can we make sure that the error is small? No, the training data is drawn randomly. • So consider a success probability.

Introduction to Machine Learning Amo G. Tong 15 PAC Learnability • PAC: probably approximately correct

• (Def) A concept class 퐶 is PAC-learnable by 퐿 using 퐻 if for all 푐 ∈ 퐶, distribution 퐷 over 푋, small values 휖 and 훿, 퐿 will output a hypothesis ℎ in 퐻 such that 퐸퐷 ℎ ≤ 휖, with probability at least 1 − 훿, in polynomial time in terms of 1/휖, 1/훿, 푛 and 푠𝑖푧푒(푐). • 푛 is length of one instance, and 푠𝑖푧푒(푐) is the length of concept 푐.

The ability to achieve the arbitrary good result.

Introduction to Machine Learning Amo G. Tong 16 PAC Learnability • PAC: probably approximately correct

• (Def) A concept class 퐶 is PAC-learnable by 퐿 using 퐻 if for all 푐 ∈ 퐶, distribution 퐷 over 푋, small values 휖 and 훿, 퐿 will output a hypothesis ℎ in 퐻 such that 퐸퐷 ℎ ≤ 휖, with probability at least 1 − 훿, in polynomial time in terms of 1/휖, 1/훿, 푛 and 푠𝑖푧푒(푐). • 푛 is length of one instance, and 푠𝑖푧푒(푐) is the length of concept 푐.

Sky Wind EnjoySport Example: Concept : enjoy sport? Instance: 2 features and one target value. (푛 ≈ 3) Learner 퐿: find-S algorithm. 퐻: using conjunction of constraints. 퐷: a distribution over instances. 휖 = 훿 any small values.

Introduction to Machine Learning Amo G. Tong 17 PAC Learnability • PAC: probably approximately correct

• (Def) A concept class 퐶 is PAC-learnable by 퐿 using 퐻 if for all 푐 ∈ 퐶, distribution 퐷 over 푋, small values 휖 and 훿, 퐿 will output a hypothesis ℎ in 퐻 such that 퐸퐷 ℎ ≤ 휖, with probability at least 1 − 훿, in polynomial time in terms of 1/휖, 1/훿, 푛 and 푠𝑖푧푒(푐). • 푛 is length of one instance, and 푠𝑖푧푒Enjoysport(푐) is thecan lengthbe predicted, of concept 푐. by find-S algorithm considering the conjunction of constrains, with an arbitrary high accuracy, with an arbitrary high success probability,Sky Wind EnjoySport Example: for an arbitrary distribution, Concept : enjoy sport? in polynomial time. Instance: 2 features and one target value. (푛 ≈ 3) Learner 퐿: find-S algorithm. 퐻: using conjunction of constraints. 퐷: a distribution over instances. 휖 = 훿 any small values.

Introduction to Machine Learning Amo G. Tong 18 PAC learnability • PAC: probably approximately correct

• (Def) A concept class 퐶 is PAC-learnable by 퐿 using 퐻 if for all 푐 ∈ 퐶, distribution 퐷 over 푋, small values 휖 and 훿, 퐿 will output a hypothesis ℎ in 퐻 such that 퐸퐷 ℎ ≤ 휖, with probability at least 1 − 훿, in polynomial time in terms of 1/휖, 1/훿, 푛 and 푠𝑖푧푒(푐). • 푛 is length of one instance, and 푠𝑖푧푒(푐) is the length of concept 푐.

• How about the size of training data? • Polynomial time requires polynomial training data.

• Sample complexity: how many training examples needed? • Computational complexity: how much running time needed?

Introduction to Machine Learning Amo G. Tong 19 Sample Complexity

Sample Complexity

Number of required training examples

Introduction to Machine Learning Amo G. Tong 20 Sample Complexity: Outline • Number of required training examples. • Data size can be the most limitation in practice.

• A bound – some sufficient conditions • Consistent learners. • VC dimension.

Introduction to Machine Learning Amo G. Tong 21 Sample Complexity: Consistent Learner

Sample Complexity: Consistent Learner

When the training error is zero, how many data needed to ensure a good solution?

Introduction to Machine Learning Amo G. Tong 22 Sample Complexity: Consistent Learner • Number of required training examples v.s. problem (hypothesis) size.

• Data size can be the most limitation in practice.

• A general bound on sample complexity for consistent learners. • Consistent learner might be the best we can do with training data. • Does zero training error imply a low true error? How much data needed?

• A bound – some sufficient conditions • The bound is independent of the considered learning algorithm.

Introduction to Machine Learning Amo G. Tong 23 Sample Complexity: Consistent Learner • Training examples 푇 and a target concept 푐.

• Version Space: 푉푆퐻,푇 = ℎ ∈ 퐻 ℎ 푥 = 푐 푥 for each 푥 ∈ 푇}.

• Idea:

• A consistent learner produces one hypothesis in the version space. • If each hypothesis in the version space is “good”, the result of any consistent learn must be “good”.

• How to ensure that each hypothesis in 푉푆퐻,푇 is good?

Introduction to Machine Learning Amo G. Tong 24 Sample Complexity: Consistent Learner • Training examples 푇 and a target concept 푐.

• Version Space: 푉푆퐻,푇 = ℎ ∈ 퐻 ℎ 푥 = 푐 푥 for each 푥 ∈ 푇}.

• (Def) 흐-exhausted. For a distribution 퐷 and a target concept 푐, a version space 푉푆퐻,푇 is 휖- exhausted if every ℎ in 푉푆퐻,푇 satisfies that 퐸퐷 ℎ = Pr 푐 푥 ≠ ℎ 푥 ≤ 휖. 푥 from 퐷

• 흐-exhausted version space => good performance of any consistent learner. • How much data needed to ensure an 흐-exhausted version space? • We can have a bound, without knowing 퐷 or 푐.

Introduction to Machine Learning Amo G. Tong 25 Sample Complexity: Consistent Learner

• (Def) 흐-exhausted. For a distribution 퐷 and a target concept 푐, a version space 푉푆퐻,푇 is 휖- exhausted if every ℎ in 푉푆퐻,푇 satisfies that 퐸퐷 ℎ = Pr 푐 푥 ≠ ℎ 푥 ≤ 휖. 푥 from 퐷

• (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training examples independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚.

−휖푚 • The probability that 푉푆퐻,푇 is “good” is no less than 1 − 퐻 푒 .

Introduction to Machine Learning Amo G. Tong 26 Sample Complexity – Consistent Learner • (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training exampled independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚.

• Let ℎ1, … , ℎ푘 be the hypotheses in 퐻 such that 퐸퐷 ℎ > 휖. proof

Introduction to Machine Learning Amo G. Tong 27 Sample Complexity – Consistent Learner • (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training exampled independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚.

• Let ℎ1, … , ℎ푘 be the hypotheses in 퐻 such that 퐸퐷 ℎ > 휖. proof

• 푉푆퐻,푇 is not 휖-exhausted if and only if at least one of ℎ1, … , ℎ푘 is in 푉푆퐻,푇.

Introduction to Machine Learning Amo G. Tong 28 Sample Complexity – Consistent Learner • (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training exampled independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚.

• Let ℎ1, … , ℎ푘 be the hypotheses in 퐻 such that 퐸퐷 ℎ > 휖. proof

• 푉푆퐻,푇 is not 휖-exhausted if and only if at least one of ℎ1, … , ℎ푘 is in 푉푆퐻,푇. 푚 • (a) The probability that ℎ푖 ∈ 푉푆퐻,푇 is at most 1 − 휖 . (next page)

Introduction to Machine Learning Amo G. Tong 29 Sample Complexity – Consistent Learner • (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training exampled independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚.

• Let ℎ1, … , ℎ푘 be the hypotheses in 퐻 such that 퐸퐷 ℎ > 휖. proof

• 푉푆퐻,푇 is not 휖-exhausted if and only if at least one of ℎ1, … , ℎ푘 is in 푉푆퐻,푇. 푚 • (a) The probability that ℎ푖 ∈ 푉푆퐻,푇 is at most 1 − 휖 . (next page) 푚 • The probability that at least one of ℎ1, … , ℎ푘 is in 푉푆퐻,푇 is at most 푘 1 − 휖 • The probability that at least one event happen is no larger than the sum of the probability of individual event.

Introduction to Machine Learning Amo G. Tong 30 Sample Complexity – Consistent Learner • (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training exampled independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚.

• Let ℎ1, … , ℎ푘 be the hypotheses in 퐻 such that 퐸퐷 ℎ > 휖. proof

• 푉푆퐻,푇 is not 휖-exhausted if and only if at least one of ℎ1, … , ℎ푘 is in 푉푆퐻,푇. 푚 • (a) The probability that ℎ푖 ∈ 푉푆퐻,푇 is at most 1 − 휖 . (next page) 푚 • The probability that at least one of ℎ1, … , ℎ푘 is in 푉푆퐻,푇 is at most 푘 1 − 휖 . • The probability that at least one event happen is no larger than the sum of the probability of individual event.

• 푘 1 − 휖 푚 ≤ |퐻| 1 − 휖 푚 ≤ 퐻 푒−휖푚

Introduction to Machine Learning Amo G. Tong 31 Sample Complexity – Consistent Learner • (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training exampled independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚.

푚 proof • (a) the probability that ℎ푖 ∈ 푉푆퐻,푇 is at most 1 − 휖 .

Introduction to Machine Learning Amo G. Tong 32 Sample Complexity – Consistent Learner • (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training exampled independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚.

푚 proof • (a) the probability that ℎ푖 ∈ 푉푆퐻,푇 is at most 1 − 휖 .

• ℎ푖 ∈ 푉푆퐻,푇 if and only if it is consistent on all the training data.

Introduction to Machine Learning Amo G. Tong 33 Sample Complexity – Consistent Learner • (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training exampled independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚.

푚 proof • (a) the probability that ℎ푖 ∈ 푉푆퐻,푇 is at most 1 − 휖 .

• ℎ푖 ∈ 푉푆퐻,푇 if and only if it is consistent on all the training data.

• For one training data 푥, the probability that ℎ푖 is consistent on 푥 is at most 1 − 휖, because (a) 퐸퐷 ℎ푖 > 휖 and (b) 푥 is generated from 퐷.

Introduction to Machine Learning Amo G. Tong 34 Sample Complexity – Consistent Learner • (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training exampled independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚.

푚 proof • (a) the probability that ℎ푖 ∈ 푉푆퐻,푇 is at most 1 − 휖 .

• ℎ푖 ∈ 푉푆퐻,푇 if and only if it is consistent on all the training data.

• For one training data 푥, the probability that ℎ푖 is consistent on 푥 is at most 1 − 휖, because (a) 퐸퐷 ℎ푖 > 휖 and (b) 푥 is generated from 퐷. 푚 • The probability that ℎ푖 is consistent on all the 푚 training data is at most 1 − 휖 .

Introduction to Machine Learning Amo G. Tong 35 Sample Complexity – Consistent Learner

• (Def) 흐-exhausted. For a distribution 퐷 and a target concept 푐, a version space 푉푆퐻,푇 is 휖- exhausted if every ℎ in 푉푆퐻,푇 satisfies that 퐸퐷 ℎ = Pr 푐 푥 ≠ ℎ 푥 ≤ 휖. 푥 from 퐷

• (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training exampled independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚.

• If we require the failure probability is less than some value 휹, how much data needed? 1 1 • 퐻 푒−휖푚 ≤ 훿 ⇔ 푚 ≥ (ln 퐻 + ln ) 휖 훿 • More data needed if |퐻| increases. Only depends on the size of hypothesis space. Can be a loose bound.

Introduction to Machine Learning Amo G. Tong 36 Sample Complexity – Consistent Learner

1 1 • 퐻 푒−휖푚 ≤ 훿 ⇔ 푚 ≥ (ln 퐻 + ln ) 휖 훿

• Application 1: learn a concept which is a conjunction of literals over 푛 variables.

• E.g.: 푥1⋀¬푥2⋀푥4 • Are such concepts PAC-learnable by a consistent learner?

Introduction to Machine Learning Amo G. Tong 37 Sample Complexity – Consistent Learner

1 1 • 퐻 푒−휖푚 ≤ 훿 ⇔ 푚 ≥ (ln 퐻 + ln ) 휖 훿

• Application 1: learn a concept which is a conjunction of literals over 푛 variables.

• E.g.: 푥1⋀¬푥2⋀푥4 • Are such concepts PAC-learnable by a consistent learner?

• Hypothesis Space 퐻: all possible functions. • |퐻| = 3푛 (why?) 1 1 1 1 • 푚 ≥ ln 퐻 + ln = 푛 ln 3 + ln 휖 훿 휖 훿

Introduction to Machine Learning Amo G. Tong 38 Sample Complexity – Consistent Learner

1 1 • 퐻 푒−휖푚 ≤ 훿 ⇔ 푚 ≥ (ln 퐻 + ln ) 휖 훿

• Application 1: learn a concept which is a conjunction of literals over 푛 variables.

• E.g.: 푥1⋀¬푥2⋀푥4 • Are such concepts PAC-learnable by a consistent learner?

• Hypothesis Space 퐻: all possible functions. • |퐻| = 3푛 (why?) 1 1 1 1 • 푚 ≥ ln 퐻 + ln = 푛 ln 3 + ln 휖 훿 휖 훿

• Running time bounded by number of training examples. • Result: it is PAC-learnable by Find-S with 퐻 = {all possible concepts}

Introduction to Machine Learning Amo G. Tong 39 Sample Complexity – Consistent Learner

1 1 • 퐻 푒−휖푚 ≤ 훿 ⇔ 푚 ≥ (ln 퐻 + ln ) 휖 훿

(Def) 흐-exhausted. For a distribution (Def) A concept class 퐶 is PAC-learnable by 퐿 using 퐻 • Application퐷 and a target1: learn concept a concept 푐, a version which is a conjunctionif for all 푐 ∈ 퐶of, distribution literals over 퐷 over푛 variables. 푋, small values 휖 훿 퐿 ℎ 퐻 • E.g.:space 푥1⋀¬ 푉푥푆퐻2⋀,푇 푥is4 휖-exhausted if every ℎ and , will output a hypothesis in such that in 푉푆 satisfies that 퐸 ℎ = 퐸 ℎ ≤ 휖, with probability at least 1 − 훿, in • Are such퐻 concepts,푇 PAC-learnable퐷 by a consistent퐷 learner? Pr 푐 푥 ≠ ℎ 푥 ≤ 휖. polynomial time in terms of 1/휖, 1/훿, 푛 and 푠𝑖푧푒(푐). 푥 from 퐷

• Hypothesis Space 퐻: all possible functions. • |퐻| = 3푛 (why?) 1 1 1 1 • 푚 ≥ ln 퐻 + ln = 푛 ln 3 + ln 휖 훿 휖 훿

• Running time bounded by number of training examples. • Result: it is PAC-learnable by Find-S with 퐻 = {all possible concepts}

Introduction to Machine Learning Amo G. Tong 40 Sample Complexity – Consistent Learner

1 1 • 퐻 푒−휖푚 ≤ 훿 ⇔ 푚 ≥ (ln 퐻 + ln ) 휖 훿

• Application 2: learn a concept which can be any boolean function over 푛 variables.

• Are such concepts PAC-learnable by a consistent learner?

• Hypothesis Space 퐻: all possible functions. • 퐻 =? Test 1 1 1 • 푚 ≥ ln 퐻 + ln =? 휖 훿

• Is it PAC-learnable?

Introduction to Machine Learning Amo G. Tong 41 Sample Complexity – Consistent Learner

1 1 • 퐻 푒−휖푚 ≤ 훿 ⇔ 푚 ≥ (ln 퐻 + ln ) 휖 훿

• Application 2: learn a concept which can be any boolean function over 푛 variables.

• Are such concepts PAC-learnable by a consistent learner?

• Hypothesis Space 퐻: all possible functions. 푛 • |퐻| = 22 (why?) 1 1 1 1 • 푚 ≥ ln 퐻 + ln = 2푛 ln 2 + ln 휖 훿 휖 훿

• Is it PAC-learnable?

Introduction to Machine Learning Amo G. Tong 42 Sample Complexity – Infinite Hypothesis

Sample Complexity – Infinite Hypothesis

A method for infinite hypothesis. Look at the structure of the instance space.

Introduction to Machine Learning Amo G. Tong 43 Sample Complexity – Infinite Hypothesis

1 1 • 퐻 푒−휖푚 ≤ 훿 ⇔ 푚 ≥ (ln 퐻 + ln ) 휖 훿

• This is a useful bound.

• This bound can be weak.

• This bound is not useful for infinite hypothesis 퐻. • or other parameterized models…

• Other complexity measures: Vapnik-Chervonenkis (VC) dimension.

Introduction to Machine Learning Amo G. Tong 44 Sample Complexity – Infinite Hypothesis • Motivation: classification is to distinguish subsets.

• A set of instances 푆 and a hypothesis space 퐻. • Can 퐻 distinguish all subsets of 푆?

Introduction to Machine Learning Amo G. Tong 45 Sample Complexity – Infinite Hypothesis • Motivation: classification is to distinguish subsets.

• A set of instances 푆 and a hypothesis space 퐻. • Can 퐻 distinguish all subsets of 푆?

• The number of concepts over 푆: 2 푆 . • (Def) 푆 is shattered by 퐻 if all the concepts over 푆 are included in 퐻.

• I.e., for any bi-partition (푆1, 푆2) of 푆, there exists one ℎ in 퐻, such that ℎ(푠) = 0 for each 푠 ∈ 푆1 and ℎ(푠) = 1 for each 푠 ∈ 푆2.

Introduction to Machine Learning Amo G. Tong 46 Sample Complexity – Infinite Hypothesis • Motivation: classification is to distinguish subsets.

• A set of instances 푆 and a hypothesis space 퐻. • Can 퐻 distinguish all subsets of 푆?

• The number of concepts over 푆: 2 푆 . • (Def) 푆 is shattered by 퐻 if all the concepts over 푆 are included in 퐻.

• I.e., for any bi-partition (푆1, 푆2) of 푆, there exists one ℎ in 퐻, such that ℎ(푠) = 0 for each 푠 ∈ 푆 and ℎ(푠) = 1 for each 푠 ∈ 푆 . 1 2 Test 2

• Corollary: 퐻 ≥ 2 푆 if 퐻 can shatter 푆. • Question: can a finite hypothesis space shatter an infinite set of instances?

Introduction to Machine Learning Amo G. Tong 47 Sample Complexity – Infinite Hypothesis • 푆 is shattered by 퐻 if all the concepts over 푆 are included in 퐻.

• I.e., for any bi-partition (푆1, 푆2) of 푆, there exists one ℎ in 퐻, such that ℎ(푠) = 0 for each 푠 ∈ 푆1 and ℎ(푠) = 1 for each 푠 ∈ 푆2.

• Shatter the whole space? No, an unbiased hypothesis is not interesting. • Consider a subset of it.

Introduction to Machine Learning Amo G. Tong 48 Sample Complexity – Infinite Hypothesis • 푆 is shattered by 퐻 if all the concepts over 푆 are included in 퐻.

• I.e., for any bi-partition (푆1, 푆2) of 푆, there exists one ℎ in 퐻, such that ℎ(푠) = 0 for each 푠 ∈ 푆1 and ℎ(푠) = 1 for each 푠 ∈ 푆2.

• Shatter the whole space? No, an unbiased hypothesis is not interesting. • Consider a subset of it.

• (Def) VC Dimension. The VC dimension 푉퐶(퐻) of a hypothesis space 퐻 over an instance space 푋 is the size of the largest finite subset of 푋 shattered by 퐻.

• Corollary: 푉퐶 퐻 ≤ log |퐻| (base 2)

Introduction to Machine Learning Amo G. Tong 49 Sample Complexity – Infinite Hypothesis • (Def) VC Dimension. The VC dimension 푉퐶(퐻) of a hypothesis space 퐻 over an instance space 푋 is the size of the largest finite subset of 푋 shattered by 퐻.

• Example 1: • 푋: real numbers. a b • 퐻: intervals [푎, 푏].

• For any subset 푆 = {푐} of 푋 with size 1, • {푐} can be distinguished by 푐 − 1, 푐 + 1 , • ∅ can be distinguished by 푐 + 1, 푐 + 2 . • 푺 can be shattered by 푯.

Introduction to Machine Learning Amo G. Tong 50 Sample Complexity – Infinite Hypothesis • (Def) VC Dimension. The VC dimension 푉퐶(퐻) of a hypothesis space 퐻 over an instance space 푋 is the size of the largest finite subset of 푋 shattered by 퐻.

• Example 1: • 푋: real numbers. a b • 퐻: intervals [푎, 푏].

• For any subset 푆 = {푐1, 푐2} of 푋 with size 2, 푐1 < 푐2 (assume 푐2 − 푐1 > 2 )

• {푐1} can be distinguished by ?, Test 3 • {푐2} can be distinguished by ?, • ∅ can be distinguished by ?,

• {푐1, 푐2} can be distinguished by ?,

Introduction to Machine Learning Amo G. Tong 51 Sample Complexity – Infinite Hypothesis • (Def) VC Dimension. The VC dimension 푉퐶(퐻) of a hypothesis space 퐻 over an instance space 푋 is the size of the largest finite subset of 푋 shattered by 퐻.

• Example 1: • 푋: real numbers. a b • 퐻: intervals [푎, 푏].

• For any subset 푆 = {푐1, 푐2} of 푋 with size 2, 푐1 < 푐2 (assume 푐2 − 푐1 > 2 )

• {푐1} can be distinguished by 푐1 − 1, 푐1 + 1 ,

• {푐2} can be distinguished by 푐2 − 1, 푐2 + 1 ,

• ∅ can be distinguished by 푐2 + 1, 푐2 + 2 ,

• {푐1, 푐2} can be distinguished by 푐1 − 1, 푐2 + 1 , • 푺 can be shattered by 푯.

Introduction to Machine Learning Amo G. Tong 52 Sample Complexity – Infinite Hypothesis • (Def) VC Dimension. The VC dimension 푉퐶(퐻) of a hypothesis space 퐻 over an instance space 푋 is the size of the largest finite subset of 푋 shattered by 퐻.

• Example 1: • 푋: real numbers. • 퐻: intervals [푎, 푏].

a b • For any subset 푆 = {푐1, 푐2, 푐3} of 푋 with size 3, 푐1 < 푐2 < 푐3

• Can we distinguish {푐1, 푐3} from {푐2}? • 푺 cannot be shattered by 푯.

Note: 퐻 is infinite, 푉퐶(퐻) is finite. • 푉퐶(퐻) = 2

Introduction to Machine Learning Amo G. Tong 53 Sample Complexity – Infinite Hypothesis • (Def) VC Dimension. The VC dimension 푉퐶(퐻) of a hypothesis space 퐻 over an instance space 푋 is the size of the largest finite subset of 푋 shattered by 퐻.

• Another upper bound for 휖-exhausted version space 1 2 13 1 1 푚 ≥ (4 log + 8푉퐶(퐻) log ) 푚 ≥ (ln 퐻 + ln ) 휖 훿 휖 휖 훿

Introduction to Machine Learning Amo G. Tong 54 Sample Complexity – Infinite Hypothesis • (Def) VC Dimension. The VC dimension 푉퐶(퐻) of a hypothesis space 퐻 over an instance space 푋 is the size of the largest finite subset of 푋 shattered by 퐻.

• Another upper bound for 휖-exhausted version space 1 2 13 1 1 푚 ≥ (4 log + 8푉퐶(퐻) log ) 푚 ≥ (ln 퐻 + ln ) 휖 훿 휖 휖 훿 • A lower bound on sample complexity. • (Theorem) Consider any concept class 퐶 such that 푉퐶 퐶 ≥ 2, any learner 퐿, and any 휖 ∈ 1 1 (0, ) and any 훿 ∈ (0, ). Then there exists a distribution 퐷 and target concept in 퐶 such 8 100 1 1 푉퐶(퐶)−1 that if 퐿 observes examples fewer than max[ log , ], the with probability at least 훿, 휖 훿 32휖 퐿 outputs a hypothesis ℎ having 퐸퐷(ℎ) > 휖.

Introduction to Machine Learning Amo G. Tong 55 Sample Complexity – Infinite Hypothesis • Another upper bound for 휖-exhausted version space 1 2 13 푚 ≥ (4 log + 8푉퐶(퐻) log ) 휖 훿 휖 • Practice: In a 2-dimensional space, consider a class 퐶 of concept of form (푎 ≤ 푥 ≤ 푏)⋀(푐 ≤ 푦 ≤ 푏), where 푎, 푏, 푐, 푑 are real values.

• Question: Find a number of training examples drawn randomly to assure that for any target in 퐶, any consistent learner using 퐻 = 퐶 will, with probability at least 95%, output a hypothesis with error at most 0.15.

• Step 1: compute VC(퐻). • Step 2: use the equation

Introduction to Machine Learning Amo G. Tong 56 Summary and Learning Goal • PAC learnability • Sample complexity: finite and infinite.

• Know how to compute VC dimension.

Introduction to Machine Learning Amo G. Tong 57 Consistent Learner: Practice 1

Consistent Learner: Practice 1

Introduction to Machine Learning Amo G. Tong 58 Consistent Learner: Practice 1

1 1 • 퐻 푒−휖푚 ≤ 훿 ⇔ 푚 ≥ (ln 퐻 + ln ) 휖 훿

• Practice: In a 2-dimensional space, consider a class 퐶 of concept of form (푎 ≤ 푥 ≤ 푏)⋀(푐 ≤ 푦 ≤ 푑), where 푎, 푏, 푐, 푑 are integers in [0, 99].

Introduction to Machine Learning Amo G. Tong 59 Consistent Learner: Practice 1

1 1 • 퐻 푒−휖푚 ≤ 훿 ⇔ 푚 ≥ (ln 퐻 + ln ) 휖 훿

• Practice: In a 2-dimensional space, consider a class 퐶 of concept of form (푎 ≤ 푥 ≤ 푏)⋀(푐 ≤ 푦 ≤ 푑), where 푎, 푏, 푐, 푑 are integers in [0, 99].

• Question: Find a number of training examples drawn randomly to assure that for any target in 퐶, any consistent learner using 퐻 = 퐶 will, with probability at least 95%, output a hypothesis with error at most 0.15.

Introduction to Machine Learning Amo G. Tong 60 Consistent Learner: Practice 1

1 1 • 퐻 푒−휖푚 ≤ 훿 ⇔ 푚 ≥ (ln 퐻 + ln ) 휖 훿

• Practice: In a 2-dimensional space, consider a class 퐶 of concept of form (푎 ≤ 푥 ≤ 푏)⋀(푐 ≤ 푦 ≤ 푑), where 푎, 푏, 푐, 푑 are integers in [0, 99].

• Question: Find a number of training examples drawn randomly to assure that for any target in 퐶, any consistent learner using 퐻 = 퐶 will, with probability at least 95%, output a hypothesis with error at most 0.15.

• Step 1: compute |퐻|. • Step 2: use the equation.

Introduction to Machine Learning Amo G. Tong 61 VC Dimension Practice 2

VC Dimension Practice 2

Introduction to Machine Learning Amo G. Tong 62 VC Dimension Practice 2 • (Def) VC Dimension. The VC dimension 푉퐶(퐻) of a hypothesis space 퐻 over an instance space 푋 is the size of the largest finite subset of 푋 shattered by 퐻.

• Example 2:

• 푋: points (푥, 푦). • 퐻: linear classifiers, lines.

• For any subset 푆 of 푋 with size 1, yes

Introduction to Machine Learning Amo G. Tong 63 VC Dimension Practice 2 • (Def) VC Dimension. The VC dimension 푉퐶(퐻) of a hypothesis space 퐻 over an instance space 푋 is the size of the largest finite subset of 푋 shattered by 퐻.

• Example 2:

• 푋: points (푥, 푦). • 퐻: linear classifiers, lines.

• For any subset 푆 of 푋 with size 1, yes • For any subset 푆 of 푋 with size 2

Introduction to Machine Learning Amo G. Tong 64 VC Dimension Practice 2

• (Def) VC Dimension. The VC dimension 푉퐶(퐻) of a hypothesis space 퐻 over an instance space 푋 is the size of the largest finite subset of 푋 shattered by 퐻.

• Example 2:

• 푋: points (푥, 푦). yes • 퐻: linear classifiers, lines.

• For any subset 푆 of 푋 with size 1, yes • For any subset 푆 of 푋 with size 2, yes • For any subset 푆 of 푋 with size 3, sometimes No • 푉퐶 퐻 ≥ 3

Introduction to Machine Learning Amo G. Tong 65 VC Dimension Practice 2

• (Def) VC Dimension. The VC dimension 푉퐶(퐻) of a hypothesis space 퐻 over an instance space 푋 is the size of the largest finite subset of 푋 shattered by 퐻.

• Example 2:

• 푋: points (푥, 푦). • 퐻: linear classifiers, lines.

• For any subset 푆 of 푋 with size 1, yes • For any subset 푆 of 푋 with size 2, yes • For any subset 푆 of 푋 with size 3, sometimes • For any subset 푆 of 푋 with size 4? never VC dimension of linear surfaces in a n-dimensional space is n+1 • 푉퐶(퐻) = 3

Introduction to Machine Learning Amo G. Tong 66