Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Amo G. Tong 1 Lecture * Computational Learning Theory • PAC learning • Sample complexity • Some materials are courtesy of Mitchell. • All pictures belong to their creators. Introduction to Machine Learning Amo G. Tong 2 Learning Theory • Learning Theory is used to characterize: • The difficulty of machine learning problems. • The capacity of machine learning algorithms. • Some metrics: • Size of training data. (Sample complexity) • Running time of the learning approach. (Computational Complexity) • Quality of the result. (Correctness) Introduction to Machine Learning Amo G. Tong 3 Settings Settings What are the inputs? What is a learner? How to measure the result? Introduction to Machine Learning Amo G. Tong 4 Settings: Input and Output • General Settings. • Learn a boolean function with noise-free data. Sky Wind EnjoySport Sky Wind EnjoySport • Instance space 푋. [possible weathers] Sunny Strong Yes • Training data 푇. [a subset of 푋] Sunny weak Yes • Concept space 퐶. [all possible functions] Rainy Strong No • Target concept 푐 ∈ 퐶. [the underlying true one] cloudy Strong Yes • Hypothesis space 퐻 ⊆ 퐶. • A learner 퐿 (algorithm): output one ℎ in 퐻, according to the training data drawn from a certain distribution 퐷. Introduction to Machine Learning Amo G. Tong 5 Settings: True Error • 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 Sky Wind EnjoySport • The probability that ℎ is wrong. Sky Wind EnjoySport Sunny Strong Yes Sunny weak Yes Rainy Strong No cloudy Strong Yes Introduction to Machine Learning Amo G. Tong 6 Settings: True Error • 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 • The probability that ℎ is wrong. Instance Space 퐷, distribution Introduction to Machine Learning Amo G. Tong 7 Settings: True Error • 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 • The probability that ℎ is wrong. Instance Space 푐: truth + - 퐷, distribution Introduction to Machine Learning Amo G. Tong 8 Settings: True Error • 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 • The probability that ℎ is wrong. Instance Space ℎ: our result + - 퐷, distribution Introduction to Machine Learning Amo G. Tong 9 Settings: True Error • 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 • The probability that ℎ is wrong. Instance Space 푐 ℎ + + error error - - 퐷, distribution Introduction to Machine Learning Amo G. Tong 10 Settings: True Error • 퐸퐷(ℎ) = Pr [푐 푥 ≠ ℎ(푥)] 푥 from 퐷 • The probability that ℎ is wrong. • The error is define on 푋. • The test example is generated Instance Space following the same 퐷 as training samples. ℎ 푐 • We learn from training data, and the + + error is not known to us. Can we error error bound the true error by minimizing the training error? - - • We learn from training data, the 퐷, distribution distribution 퐷 is not known to us. Introduction to Machine Learning Amo G. Tong 11 PAC Learnability PAC Learnability Probably approximately correct learning. One possible way to related sample complexity, running time, and results. Introduction to Machine Learning Amo G. Tong 12 PAC Learnability • Learnability: for a certain target, can it be reliably learned from • a reasonable number of randomly drawn training examples • with a reasonable amount of computation? Introduction to Machine Learning Amo G. Tong 13 PAC Learnability • Learnability: for a certain target, can it be reliably learned from • a reasonable number of randomly drawn training examples • with a reasonable amount of computation? Small error No infinite training data No infinite time Introduction to Machine Learning Amo G. Tong 14 PAC Learnability • Learnability: for a certain target, can it be reliably learned from • a reasonable number of randomly drawn training examples • with a reasonable amount of computation? • Is zero error possible? Yes, only if training data=entire instance space. • So consider an error bound. • Can we make sure that the error is small? No, the training data is drawn randomly. • So consider a success probability. Introduction to Machine Learning Amo G. Tong 15 PAC Learnability • PAC: probably approximately correct • (Def) A concept class 퐶 is PAC-learnable by 퐿 using 퐻 if for all 푐 ∈ 퐶, distribution 퐷 over 푋, small values 휖 and 훿, 퐿 will output a hypothesis ℎ in 퐻 such that 퐸퐷 ℎ ≤ 휖, with probability at least 1 − 훿, in polynomial time in terms of 1/휖, 1/훿, 푛 and 푠푧푒(푐). • 푛 is length of one instance, and 푠푧푒(푐) is the length of concept 푐. The ability to achieve the arbitrary good result. Introduction to Machine Learning Amo G. Tong 16 PAC Learnability • PAC: probably approximately correct • (Def) A concept class 퐶 is PAC-learnable by 퐿 using 퐻 if for all 푐 ∈ 퐶, distribution 퐷 over 푋, small values 휖 and 훿, 퐿 will output a hypothesis ℎ in 퐻 such that 퐸퐷 ℎ ≤ 휖, with probability at least 1 − 훿, in polynomial time in terms of 1/휖, 1/훿, 푛 and 푠푧푒(푐). • 푛 is length of one instance, and 푠푧푒(푐) is the length of concept 푐. Sky Wind EnjoySport Example: Concept : enjoy sport? Instance: 2 features and one target value. (푛 ≈ 3) Learner 퐿: find-S algorithm. 퐻: using conjunction of constraints. 퐷: a distribution over instances. 휖 = 훿 any small values. Introduction to Machine Learning Amo G. Tong 17 PAC Learnability • PAC: probably approximately correct • (Def) A concept class 퐶 is PAC-learnable by 퐿 using 퐻 if for all 푐 ∈ 퐶, distribution 퐷 over 푋, small values 휖 and 훿, 퐿 will output a hypothesis ℎ in 퐻 such that 퐸퐷 ℎ ≤ 휖, with probability at least 1 − 훿, in polynomial time in terms of 1/휖, 1/훿, 푛 and 푠푧푒(푐). • 푛 is length of one instance, and 푠푧푒Enjoysport(푐) is thecan lengthbe predicted, of concept 푐. by find-S algorithm considering the conjunction of constrains, with an arbitrary high accuracy, with an arbitrary high success probability,Sky Wind EnjoySport Example: for an arbitrary distribution, Concept : enjoy sport? in polynomial time. Instance: 2 features and one target value. (푛 ≈ 3) Learner 퐿: find-S algorithm. 퐻: using conjunction of constraints. 퐷: a distribution over instances. 휖 = 훿 any small values. Introduction to Machine Learning Amo G. Tong 18 PAC learnability • PAC: probably approximately correct • (Def) A concept class 퐶 is PAC-learnable by 퐿 using 퐻 if for all 푐 ∈ 퐶, distribution 퐷 over 푋, small values 휖 and 훿, 퐿 will output a hypothesis ℎ in 퐻 such that 퐸퐷 ℎ ≤ 휖, with probability at least 1 − 훿, in polynomial time in terms of 1/휖, 1/훿, 푛 and 푠푧푒(푐). • 푛 is length of one instance, and 푠푧푒(푐) is the length of concept 푐. • How about the size of training data? • Polynomial time requires polynomial training data. • Sample complexity: how many training examples needed? • Computational complexity: how much running time needed? Introduction to Machine Learning Amo G. Tong 19 Sample Complexity Sample Complexity Number of required training examples Introduction to Machine Learning Amo G. Tong 20 Sample Complexity: Outline • Number of required training examples. • Data size can be the most limitation in practice. • A bound – some sufficient conditions • Consistent learners. • VC dimension. Introduction to Machine Learning Amo G. Tong 21 Sample Complexity: Consistent Learner Sample Complexity: Consistent Learner When the training error is zero, how many data needed to ensure a good solution? Introduction to Machine Learning Amo G. Tong 22 Sample Complexity: Consistent Learner • Number of required training examples v.s. problem (hypothesis) size. • Data size can be the most limitation in practice. • A general bound on sample complexity for consistent learners. • Consistent learner might be the best we can do with training data. • Does zero training error imply a low true error? How much data needed? • A bound – some sufficient conditions • The bound is independent of the considered learning algorithm. Introduction to Machine Learning Amo G. Tong 23 Sample Complexity: Consistent Learner • Training examples 푇 and a target concept 푐. • Version Space: 푉푆퐻,푇 = ℎ ∈ 퐻 ℎ 푥 = 푐 푥 for each 푥 ∈ 푇}. • Idea: • A consistent learner produces one hypothesis in the version space. • If each hypothesis in the version space is “good”, the result of any consistent learn must be “good”. • How to ensure that each hypothesis in 푉푆퐻,푇 is good? Introduction to Machine Learning Amo G. Tong 24 Sample Complexity: Consistent Learner • Training examples 푇 and a target concept 푐. • Version Space: 푉푆퐻,푇 = ℎ ∈ 퐻 ℎ 푥 = 푐 푥 for each 푥 ∈ 푇}. • (Def) 흐-exhausted. For a distribution 퐷 and a target concept 푐, a version space 푉푆퐻,푇 is 휖- exhausted if every ℎ in 푉푆퐻,푇 satisfies that 퐸퐷 ℎ = Pr 푐 푥 ≠ ℎ 푥 ≤ 휖. 푥 from 퐷 • 흐-exhausted version space => good performance of any consistent learner. • How much data needed to ensure an 흐-exhausted version space? • We can have a bound, without knowing 퐷 or 푐. Introduction to Machine Learning Amo G. Tong 25 Sample Complexity: Consistent Learner • (Def) 흐-exhausted. For a distribution 퐷 and a target concept 푐, a version space 푉푆퐻,푇 is 휖- exhausted if every ℎ in 푉푆퐻,푇 satisfies that 퐸퐷 ℎ = Pr 푐 푥 ≠ ℎ 푥 ≤ 휖. 푥 from 퐷 • (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training examples independently drawn from 퐷. For each 휖 ∈ [0,1], the probability that 푉푆퐻,푇 is not 휖-exhausted is no larger than 퐻 푒−휖푚. −휖푚 • The probability that 푉푆퐻,푇 is “good” is no less than 1 − 퐻 푒 . Introduction to Machine Learning Amo G. Tong 26 Sample Complexity – Consistent Learner • (Theorem) Consider a finite hypothesis space 퐻, a target concept 푐, and 푚 training exampled independently drawn from 퐷.

Load more