<<

Machine Learning Foundations (機器學習基石)

Lecture 16: Three Learning Principles Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science & Information Engineering National Taiwan University (國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/25 Three Learning Principles Roadmap 1 When Can Machines Learn? 2 Why Can Machines Learn? 3 How Can Machines Learn? 4 How Can Machines Learn Better? Lecture 15: Validation (crossly) reserve validation data to simulate testing procedure for model selection

Lecture 16: Three Learning Principles Occam’s Razor Sampling Bias Data Snooping Power of Three

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/25 entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied beyond necessity) —William of Occam (1287-1347)

‘Occam’s razor’ for trimming down unnecessary explanation

figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons

Three Learning Principles Occam’s Razor Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein? (1879-1955)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/25 ‘Occam’s razor’ for trimming down unnecessary explanation

figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons

Three Learning Principles Occam’s Razor Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein? (1879-1955)

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied beyond necessity) —William of Occam (1287-1347)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/25 Three Learning Principles Occam’s Razor Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein? (1879-1955)

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied beyond necessity) —William of Occam (1287-1347)

‘Occam’s razor’ for trimming down unnecessary explanation

figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/25 1 What does it mean for a model to be simple? 2 How do we know that simpler is better?

which one do you prefer? :-)

two questions:

Three Learning Principles Occam’s Razor Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25 1 What does it mean for a model to be simple? 2 How do we know that simpler is better?

two questions:

Three Learning Principles Occam’s Razor Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25 2 How do we know that simpler is better?

Three Learning Principles Occam’s Razor Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions: 1 What does it mean for a model to be simple?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25 Three Learning Principles Occam’s Razor Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions: 1 What does it mean for a model to be simple? 2 How do we know that simpler is better?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25 small Ω( ) = not many • H contains small of • hypotheses

small Ω(h) small Ω( ) ⇐ H

simple model H small Ω(h) = ‘looks’ simple • specified by few • parameters

connection h specified by ` bits of size 2` ⇐ |H|

simple: small hypothesis/model complexity

Three Learning Principles Occam’s Razor Simple Model

simple hypothesis h

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25 small Ω( ) = not many • H contains small number of • hypotheses

small Ω(h) small Ω( ) ⇐ H

simple model H

specified by few • parameters

connection h specified by ` bits of size 2` ⇐ |H|

simple: small hypothesis/model complexity

Three Learning Principles Occam’s Razor Simple Model

simple hypothesis h small Ω(h) = ‘looks’ simple •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25 small Ω( ) = not many • H contains small number of • hypotheses

small Ω(h) small Ω( ) ⇐ H

simple model H

connection h specified by ` bits of size 2` ⇐ |H|

simple: small hypothesis/model complexity

Three Learning Principles Occam’s Razor Simple Model

simple hypothesis h small Ω(h) = ‘looks’ simple • specified by few • parameters

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25 small Ω(h) small Ω( ) ⇐ H

small Ω( ) = not many • H contains small number of • hypotheses

connection h specified by ` bits of size 2` ⇐ |H|

simple: small hypothesis/model complexity

Three Learning Principles Occam’s Razor Simple Model

simple hypothesis h simple model H small Ω(h) = ‘looks’ simple • specified by few • parameters

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25 small Ω(h) small Ω( ) ⇐ H

contains small number of • hypotheses

connection h specified by ` bits of size 2` ⇐ |H|

simple: small hypothesis/model complexity

Three Learning Principles Occam’s Razor Simple Model

simple hypothesis h simple model H small Ω(h) = ‘looks’ simple small Ω( ) = not many • • H specified by few • parameters

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25 small Ω(h) small Ω( ) ⇐ H

connection h specified by ` bits of size 2` ⇐ |H|

simple: small hypothesis/model complexity

Three Learning Principles Occam’s Razor Simple Model

simple hypothesis h simple model H small Ω(h) = ‘looks’ simple small Ω( ) = not many • • H specified by few contains small number of • • parameters hypotheses

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25 small Ω(h) small Ω( ) ⇐ H

simple: small hypothesis/model complexity

Three Learning Principles Occam’s Razor Simple Model

simple hypothesis h simple model H small Ω(h) = ‘looks’ simple small Ω( ) = not many • • H specified by few contains small number of • • parameters hypotheses

connection h specified by ` bits of size 2` ⇐ |H|

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25 simple: small hypothesis/model complexity

Three Learning Principles Occam’s Razor Simple Model

simple hypothesis h simple model H small Ω(h) = ‘looks’ simple small Ω( ) = not many • • H specified by few contains small number of • • parameters hypotheses

connection h specified by ` bits of size 2` ⇐ |H| small Ω(h) small Ω( ) ⇐ H

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25 Three Learning Principles Occam’s Razor Simple Model

simple hypothesis h simple model H small Ω(h) = ‘looks’ simple small Ω( ) = not many • • H specified by few contains small number of • • parameters hypotheses

connection h specified by ` bits of size 2` ⇐ |H| small Ω(h) small Ω( ) ⇐ H

simple: small hypothesis/model complexity

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25 = ⇒

simple H = smaller mH(N) ⇒ m (N) = less ‘likely’ to fit data perfectly H ⇒ 2N = more significant when fit happens ⇒

direct action: linear first; always ask whether data over-modeled

Three Learning Principles Occam’s Razor Simple is Better

in addition to math proof that you have seen, philosophically:

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25 = ⇒ = smaller mH(N) ⇒ m (N) = less ‘likely’ to fit data perfectly H ⇒ 2N = more significant when fit happens ⇒

direct action: linear first; always ask whether data over-modeled

Three Learning Principles Occam’s Razor Simple is Better

in addition to math proof that you have seen, philosophically: simple H

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25 = ⇒ m (N) = less ‘likely’ to fit data perfectly H ⇒ 2N = more significant when fit happens ⇒

direct action: linear first; always ask whether data over-modeled

Three Learning Principles Occam’s Razor Simple is Better

in addition to math proof that you have seen, philosophically: simple H = smaller mH(N) ⇒

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25 = ⇒

= more significant when fit happens ⇒

direct action: linear first; always ask whether data over-modeled

Three Learning Principles Occam’s Razor Simple is Better

in addition to math proof that you have seen, philosophically: simple H = smaller mH(N) ⇒ m (N) = less ‘likely’ to fit data perfectly H ⇒ 2N

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25 = ⇒

direct action: linear first; always ask whether data over-modeled

Three Learning Principles Occam’s Razor Simple is Better

in addition to math proof that you have seen, philosophically: simple H = smaller mH(N) ⇒ m (N) = less ‘likely’ to fit data perfectly H ⇒ 2N = more significant when fit happens ⇒

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25 = ⇒

direct action: linear first; always ask whether data over-modeled

Three Learning Principles Occam’s Razor Simple is Better

in addition to math proof that you have seen, philosophically: simple H = smaller mH(N) ⇒ m (N) = less ‘likely’ to fit data perfectly H ⇒ 2N = more significant when fit happens ⇒

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25 = ⇒

always ask whether data over-modeled

Three Learning Principles Occam’s Razor Simple is Better

in addition to math proof that you have seen, philosophically: simple H = smaller mH(N) ⇒ m (N) = less ‘likely’ to fit data perfectly H ⇒ 2N = more significant when fit happens ⇒

direct action: linear first;

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25 = ⇒

Three Learning Principles Occam’s Razor Simple is Better

in addition to math proof that you have seen, philosophically: simple H = smaller mH(N) ⇒ m (N) = less ‘likely’ to fit data perfectly H ⇒ 2N = more significant when fit happens ⇒

direct action: linear first; always ask whether data over-modeled

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25 Reference Answer: 3 Of all 1024 possible , only 2N = 20 of them is separable by . D H

Three Learning Principles Occam’s Razor Fun Time

Consider the decision stumps in R1 as the hypothesis set . Recall H that mH(N) = 2N. Consider 10 different inputs x1, x2,..., x10 coupled with labels yn generated iid from a fair coin. What is the probability that 10 the data = (xn, yn) is separable by ? D { }n=1 H 1 1 1024 10 2 1024 20 3 1024 100 4 1024

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/25 Three Learning Principles Occam’s Razor Fun Time

Consider the decision stumps in R1 as the hypothesis set . Recall H that mH(N) = 2N. Consider 10 different inputs x1, x2,..., x10 coupled with labels yn generated iid from a fair coin. What is the probability that 10 the data = (xn, yn) is separable by ? D { }n=1 H 1 1 1024 10 2 1024 20 3 1024 100 4 1024

Reference Answer: 3 Of all 1024 possible , only 2N = 20 of them is separable by . D H

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/25 and set the title ‘Dewey Defeats Truman’ based on polling

a newspaper phone-poll of how people voted, •

who is this? :-)

Three Learning Principles Sampling Bias Presidential Story

1948 US President election: Truman versus Dewey •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/25 and set the title ‘Dewey Defeats Truman’ based on polling

who is this? :-)

Three Learning Principles Sampling Bias Presidential Story

1948 US President election: Truman versus Dewey • a newspaper phone-poll of how people voted, •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/25 who is this? :-)

Three Learning Principles Sampling Bias Presidential Story

1948 US President election: Truman versus Dewey • a newspaper phone-poll of how people voted, • and set the title ‘Dewey Defeats Truman’ based on polling

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/25 Three Learning Principles Sampling Bias Presidential Story

1948 US President election: Truman versus Dewey • a newspaper phone-poll of how people voted, • and set the title ‘Dewey Defeats Truman’ based on polling

who is this? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/25 editorial bug?—no • bad luck of polling (δ)?—no •

Truman, and yes he won

suspect of the mistake:

hint: phones were expensive :-)

Three Learning Principles Sampling Bias The Big Smile Came from ...

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25 editorial bug?—no • bad luck of polling (δ)?—no •

suspect of the mistake:

hint: phones were expensive :-)

Three Learning Principles Sampling Bias The Big Smile Came from ...

Truman, and yes he won

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25 editorial bug?—no • bad luck of polling (δ)?—no •

hint: phones were expensive :-)

Three Learning Principles Sampling Bias The Big Smile Came from ...

Truman, and yes he won

suspect of the mistake:

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25 bad luck of polling (δ)?—no •

hint: phones were expensive :-)

Three Learning Principles Sampling Bias The Big Smile Came from ...

Truman, and yes he won

suspect of the mistake: editorial bug?—no •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25 hint: phones were expensive :-)

Three Learning Principles Sampling Bias The Big Smile Came from ...

Truman, and yes he won

suspect of the mistake: editorial bug?—no • bad luck of polling (δ)?—no •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25 Three Learning Principles Sampling Bias The Big Smile Came from ...

Truman, and yes he won

suspect of the mistake: editorial bug?—no • bad luck of polling (δ)?—no •

hint: phones were expensive :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25 technical explanation: • data from P (x, y) but test under P = P : VC fails 1 2 6 1 philosophical explanation: • study Math hard but test English: no strong test guarantee

‘minor’ VC assumption: data and testing both iid from P

Three Learning Principles Sampling Bias Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25 philosophical explanation: • study Math hard but test English: no strong test guarantee

‘minor’ VC assumption: data and testing both iid from P

Three Learning Principles Sampling Bias Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

technical explanation: • data from P (x, y) but test under P = P : VC fails 1 2 6 1

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25 ‘minor’ VC assumption: data and testing both iid from P

Three Learning Principles Sampling Bias Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

technical explanation: • data from P (x, y) but test under P = P : VC fails 1 2 6 1 philosophical explanation: • study Math hard but test English: no strong test guarantee

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25 Three Learning Principles Sampling Bias Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

technical explanation: • data from P (x, y) but test under P = P : VC fails 1 2 6 1 philosophical explanation: • study Math hard but test English: no strong test guarantee

‘minor’ VC assumption: data and testing both iid from P

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25 in my first shot, Eval(g) showed 13% improvement why am I still teaching here? :-) •

formed , • Dval

validation: random examples within ; test: ‘last’ user records ‘after’ D D

Three Learning Principles Sampling Bias Sampling Bias in Learning

A True Personal Story

Netflix competition for movie likes comedy?likes action?prefers blockbusters? likes Tom Cruise? • recommender system: viewer 10% improvement = 1M US dollars Match movie and add contributions predicted viewer factors from each factor rating

movie

comedy contentaction contentblockbuster? Tom Cruise in it?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25 why am I still teaching here? :-) •

in my first shot, Eval(g) showed 13% improvement

validation: random examples within ; test: ‘last’ user records ‘after’ D D

Three Learning Principles Sampling Bias Sampling Bias in Learning

A True Personal Story

Netflix competition for movie likes comedy?likes action?prefers blockbusters? likes Tom Cruise? • recommender system: viewer 10% improvement = 1M US dollars Match movie and add contributions predicted viewer factors from each factor rating formed val, • D movie

comedy contentaction contentblockbuster? Tom Cruise in it?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25 why am I still teaching here? :-) •

Eval(g) showed 13% improvement

validation: random examples within ; test: ‘last’ user records ‘after’ D D

Three Learning Principles Sampling Bias Sampling Bias in Learning

A True Personal Story

Netflix competition for movie likes comedy?likes action?prefers blockbusters? likes Tom Cruise? • recommender system: viewer 10% improvement = 1M US dollars Match movie and add contributions predicted viewer factors from each factor rating formed val, • movie in my firstD shot,

comedy contentaction contentblockbuster? Tom Cruise in it?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25 why am I still teaching here? :-) •

validation: random examples within ; test: ‘last’ user records ‘after’ D D

Three Learning Principles Sampling Bias Sampling Bias in Learning

A True Personal Story

Netflix competition for movie likes comedy?likes action?prefers blockbusters? likes Tom Cruise? • recommender system: viewer 10% improvement = 1M US dollars Match movie and add contributions predicted viewer factors from each factor rating formed val, • movie in my firstD shot,

comedy contentaction contentblockbuster? Tom Cruise in it? Eval(g) showed 13% improvement

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25 validation: random examples within ; test: ‘last’ user records ‘after’ D D

Three Learning Principles Sampling Bias Sampling Bias in Learning

A True Personal Story

Netflix competition for movie likes comedy?likes action?prefers blockbusters? likes Tom Cruise? • recommender system: viewer 10% improvement = 1M US dollars Match movie and add contributions predicted viewer factors from each factor rating formed val, • movie in my firstD shot,

comedy contentaction contentblockbuster? Tom Cruise in it? Eval(g) showed 13% improvement why am I still teaching here? :-) •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25 test: ‘last’ user records ‘after’ D

Three Learning Principles Sampling Bias Sampling Bias in Learning

A True Personal Story

Netflix competition for movie likes comedy?likes action?prefers blockbusters? likes Tom Cruise? • recommender system: viewer 10% improvement = 1M US dollars Match movie and add contributions predicted viewer factors from each factor rating formed val, • movie in my firstD shot,

comedy contentaction contentblockbuster? Tom Cruise in it? Eval(g) showed 13% improvement why am I still teaching here? :-) •

validation: random examples within ; D

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25 Three Learning Principles Sampling Bias Sampling Bias in Learning

A True Personal Story

Netflix competition for movie likes comedy?likes action?prefers blockbusters? likes Tom Cruise? • recommender system: viewer 10% improvement = 1M US dollars Match movie and add contributions predicted viewer factors from each factor rating formed val, • movie in my firstD shot,

comedy contentaction contentblockbuster? Tom Cruise in it? Eval(g) showed 13% improvement why am I still teaching here? :-) •

validation: random examples within ; test: ‘last’ user records ‘after’ D D

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25 • training: emphasize later examples (KDDCup 2011) • validation: use ‘late’ user records

e.g. if test: ‘last’ user records ‘after’ • D

last puzzle:

danger when learning ‘credit card approval’ with existing bank records?

Three Learning Principles Sampling Bias Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb: • match test scenario as much as possible

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25 • training: emphasize later examples (KDDCup 2011) • validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’ with existing bank records?

Three Learning Principles Sampling Bias Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb: • match test scenario as much as possible e.g. if test: ‘last’ user records ‘after’ • D

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25 • validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’ with existing bank records?

Three Learning Principles Sampling Bias Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb: • match test scenario as much as possible e.g. if test: ‘last’ user records ‘after’ • D • training: emphasize later examples (KDDCup 2011)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25 last puzzle:

danger when learning ‘credit card approval’ with existing bank records?

Three Learning Principles Sampling Bias Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb: • match test scenario as much as possible e.g. if test: ‘last’ user records ‘after’ • D • training: emphasize later examples (KDDCup 2011) • validation: use ‘late’ user records

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25 Three Learning Principles Sampling Bias Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb: • match test scenario as much as possible e.g. if test: ‘last’ user records ‘after’ • D • training: emphasize later examples (KDDCup 2011) • validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’ with existing bank records?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25 Reference Answer: 2 That’s how we form the validation set, remember? :-)

Three Learning Principles Sampling Bias Fun Time

If the data is an unbiased sample from the underlying distribution P for binary classification,D which of the following subset of is also an unbiased sample from P? D

1 all the positive (yn > 0) examples 2 half of the examples that are randomly and uniformly picked from without replacement D 3 half of the examples with the smallest xn values k k 4 the largest subset that is linearly separable

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/25 Three Learning Principles Sampling Bias Fun Time

If the data is an unbiased sample from the underlying distribution P for binary classification,D which of the following subset of is also an unbiased sample from P? D

1 all the positive (yn > 0) examples 2 half of the examples that are randomly and uniformly picked from without replacement D 3 half of the examples with the smallest xn values k k 4 the largest subset that is linearly separable

Reference Answer: 2 That’s how we form the validation set, remember? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/25 for VC-safety, Φ shall be decided without ‘snooping’ data

Three Learning Principles Data Snooping Visual Data Snooping

1 Visualize = R2 X full Φ : z = (1, x , x , x2, x x , x2), d = 6 • 2 1 2 1 1 2 2 VC or z = (1, x2, x2), d = 3, after visualizing? 0 • 1 2 VC or better z = (1, x2 + x2) , d = 2? • 1 2 VC 2 2  or even better z = sign(0.6 x x ) ? 1 1 2 − 1 0 1 • − − − —careful about your brain’s ‘model complexity’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/25 Three Learning Principles Data Snooping Visual Data Snooping

1 Visualize = R2 X full Φ : z = (1, x , x , x2, x x , x2), d = 6 • 2 1 2 1 1 2 2 VC or z = (1, x2, x2), d = 3, after visualizing? 0 • 1 2 VC or better z = (1, x2 + x2) , d = 2? • 1 2 VC 2 2  or even better z = sign(0.6 x x ) ? 1 1 2 − 1 0 1 • − − − —careful about your brain’s ‘model complexity’

for VC-safety, Φ shall be decided without ‘snooping’ data

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/25 8 years of currency trading data • first 6 years for training, • last two 2 years for testing x = previous 20 days, • y = 21th day snooping versus no snooping: • superior profit possible

snooping: shift-scale all values by training+ testing • no snooping: shift-scale all values by training only •

Three Learning Principles Data Snooping Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25 snooping: shift-scale all values by training+ testing • no snooping: shift-scale all values by training only •

first 6 years for training, • last two 2 years for testing x = previous 20 days, • y = 21th day snooping versus no snooping: • superior profit possible

Three Learning Principles Data Snooping Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25 snooping: shift-scale all values by training+ testing • no snooping: shift-scale all values by training only •

x = previous 20 days, • y = 21th day snooping versus no snooping: • superior profit possible

Three Learning Principles Data Snooping Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data • first 6 years for training, • last two 2 years for testing

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25 snooping: shift-scale all values by training+ testing • no snooping: shift-scale all values by training only •

snooping versus no snooping: • superior profit possible

Three Learning Principles Data Snooping Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data • first 6 years for training, • last two 2 years for testing x = previous 20 days, • y = 21th day

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25 snooping: shift-scale all values by training+ testing • no snooping: shift-scale all values by training only •

Three Learning Principles Data Snooping Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data 30 • snooping first 6 years for training, • 20 last two 2 years for testing 10 x = previous 20 days, • y = 21th day 0 Cumulative Profit % snooping versus no snooping: -10 • no snooping superior profit possible 0 100 200 300 400 500 Day

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25 no snooping: shift-scale all values by training only •

Three Learning Principles Data Snooping Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data 30 • snooping first 6 years for training, • 20 last two 2 years for testing 10 x = previous 20 days, • y = 21th day 0 Cumulative Profit % snooping versus no snooping: -10 • no snooping superior profit possible 0 100 200 300 400 500 Day snooping: shift-scale all values by training+ testing •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25 Three Learning Principles Data Snooping Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data 30 • snooping first 6 years for training, • 20 last two 2 years for testing 10 x = previous 20 days, • y = 21th day 0 Cumulative Profit % snooping versus no snooping: -10 • no snooping superior profit possible 0 100 200 300 400 500 Day snooping: shift-scale all values by training+ testing • no snooping: shift-scale all values by training only • Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25 if all papers from the same author in one big paper: • bad generalization due to dVC( m m) ∪ H step-wise: later author snooped data by reading earlier papers, • bad generalization worsen by publish only if better

—and publish only if better than on H1 D paper 3: find room for improvement, propose • 3 —and publish only if better than on H H2 D ... •

paper 2: find room for improvement, propose • H2

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping Data Snooping by Data Reusing Research Scenario benchmark data D paper 1: propose that works well on • H1 D

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25 if all papers from the same author in one big paper: • bad generalization due to dVC( m m) ∪ H step-wise: later author snooped data by reading earlier papers, • bad generalization worsen by publish only if better

paper 3: find room for improvement, propose • 3 —and publish only if better than on H H2 D ... •

—and publish only if better than on H1 D

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping Data Snooping by Data Reusing Research Scenario benchmark data D paper 1: propose that works well on • H1 D paper 2: find room for improvement, propose • H2

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25 if all papers from the same author in one big paper: • bad generalization due to dVC( m m) ∪ H step-wise: later author snooped data by reading earlier papers, • bad generalization worsen by publish only if better

—and publish only if better than on H2 D ... •

paper 3: find room for improvement, propose • H3

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping Data Snooping by Data Reusing Research Scenario benchmark data D paper 1: propose that works well on • H1 D paper 2: find room for improvement, propose • 2 —and publish only if better than on H H1 D

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25 if all papers from the same author in one big paper: • bad generalization due to dVC( m m) ∪ H step-wise: later author snooped data by reading earlier papers, • bad generalization worsen by publish only if better

... •

—and publish only if better than on H2 D

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping Data Snooping by Data Reusing Research Scenario benchmark data D paper 1: propose that works well on • H1 D paper 2: find room for improvement, propose • 2 —and publish only if better than on H H1 D paper 3: find room for improvement, propose • H3

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25 bad generalization due to dVC( m m) ∪ H step-wise: later author snooped data by reading earlier papers, • bad generalization worsen by publish only if better

if all papers from the same author in one big paper: •

... •

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping Data Snooping by Data Reusing Research Scenario benchmark data D paper 1: propose that works well on • H1 D paper 2: find room for improvement, propose • 2 —and publish only if better than on H H1 D paper 3: find room for improvement, propose • 3 —and publish only if better than on H H2 D

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25 bad generalization due to dVC( m m) ∪ H step-wise: later author snooped data by reading earlier papers, • bad generalization worsen by publish only if better

if all papers from the same author in one big paper: •

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping Data Snooping by Data Reusing Research Scenario benchmark data D paper 1: propose that works well on • H1 D paper 2: find room for improvement, propose • 2 —and publish only if better than on H H1 D paper 3: find room for improvement, propose • 3 —and publish only if better than on H H2 D ... •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25 step-wise: later author snooped data by reading earlier papers, • bad generalization worsen by publish only if better

bad generalization due to dVC( m m) ∪ H

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping Data Snooping by Data Reusing Research Scenario benchmark data D paper 1: propose that works well on • H1 D paper 2: find room for improvement, propose • 2 —and publish only if better than on H H1 D paper 3: find room for improvement, propose • 3 —and publish only if better than on H H2 D ... • if all papers from the same author in one big paper: •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25 bad generalization worsen by publish only if better

step-wise: later author snooped data by reading earlier papers, •

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping Data Snooping by Data Reusing Research Scenario benchmark data D paper 1: propose that works well on • H1 D paper 2: find room for improvement, propose • 2 —and publish only if better than on H H1 D paper 3: find room for improvement, propose • 3 —and publish only if better than on H H2 D ... • if all papers from the same author in one big paper: • bad generalization due to dVC( m m) ∪ H

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25 bad generalization worsen by publish only if better

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping Data Snooping by Data Reusing Research Scenario benchmark data D paper 1: propose that works well on • H1 D paper 2: find room for improvement, propose • 2 —and publish only if better than on H H1 D paper 3: find room for improvement, propose • 3 —and publish only if better than on H H2 D ... • if all papers from the same author in one big paper: • bad generalization due to dVC( m m) ∪ H step-wise: later author snooped data by reading earlier papers, •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25 if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping Data Snooping by Data Reusing Research Scenario benchmark data D paper 1: propose that works well on • H1 D paper 2: find room for improvement, propose • 2 —and publish only if better than on H H1 D paper 3: find room for improvement, propose • 3 —and publish only if better than on H H2 D ... • if all papers from the same author in one big paper: • bad generalization due to dVC( m m) ∪ H step-wise: later author snooped data by reading earlier papers, • bad generalization worsen by publish only if better

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25 Three Learning Principles Data Snooping Data Snooping by Data Reusing Research Scenario benchmark data D paper 1: propose that works well on • H1 D paper 2: find room for improvement, propose • 2 —and publish only if better than on H H1 D paper 3: find room for improvement, propose • 3 —and publish only if better than on H H2 D ... • if all papers from the same author in one big paper: • bad generalization due to dVC( m m) ∪ H step-wise: later author snooped data by reading earlier papers, • bad generalization worsen by publish only if better

if you torture the data long enough, it will confess :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25 be blind: avoid making modeling decision by data • be suspicious: interpret research results (including your own) by • proper feeling of contamination

extremely honest: lock your test data in safe • less honest: reserve validation and use cautiously •

one secret to winning KDDCups:

careful balance between data-driven modeling (snooping) and validation (no-snooping)

Three Learning Principles Data Snooping Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25 be blind: avoid making modeling decision by data • be suspicious: interpret research results (including your own) by • proper feeling of contamination

less honest: reserve validation and use cautiously •

one secret to winning KDDCups:

careful balance between data-driven modeling (snooping) and validation (no-snooping)

Three Learning Principles Data Snooping Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest • extremely honest: lock your test data in safe •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25 be blind: avoid making modeling decision by data • be suspicious: interpret research results (including your own) by • proper feeling of contamination

one secret to winning KDDCups:

careful balance between data-driven modeling (snooping) and validation (no-snooping)

Three Learning Principles Data Snooping Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest • extremely honest: lock your test data in safe • less honest: reserve validation and use cautiously •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25 be suspicious: interpret research results (including your own) by • proper feeling of contamination

one secret to winning KDDCups:

careful balance between data-driven modeling (snooping) and validation (no-snooping)

Three Learning Principles Data Snooping Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest • extremely honest: lock your test data in safe • less honest: reserve validation and use cautiously •

be blind: avoid making modeling decision by data •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25 one secret to winning KDDCups:

careful balance between data-driven modeling (snooping) and validation (no-snooping)

Three Learning Principles Data Snooping Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest • extremely honest: lock your test data in safe • less honest: reserve validation and use cautiously •

be blind: avoid making modeling decision by data • be suspicious: interpret research results (including your own) by • proper feeling of contamination

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25 Three Learning Principles Data Snooping Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest • extremely honest: lock your test data in safe • less honest: reserve validation and use cautiously •

be blind: avoid making modeling decision by data • be suspicious: interpret research results (including your own) by • proper feeling of contamination

one secret to winning KDDCups:

careful balance between data-driven modeling (snooping) and validation (no-snooping)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25 Reference Answer: 4 A professional like you should be aware of those! :-)

Three Learning Principles Data Snooping Fun Time

Which of the following can result in unsatisfactory test performance in machine learning? 1 data snooping 2 overfitting 3 sampling bias 4 all of the above

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/25 Three Learning Principles Data Snooping Fun Time

Which of the following can result in unsatisfactory test performance in machine learning? 1 data snooping 2 overfitting 3 sampling bias 4 all of the above

Reference Answer: 4 A professional like you should be aware of those! :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/25 Artificial Intelligence Statistics compute use data to make • • something that inference about shows intelligent an unknown behavior process ML is one statistics contains • • possible route to many useful tools realize AI for ML

Three Learning Principles Power of Three Three Related Fields

Power of Three

Data Mining use (huge) data • to find property that is interesting

difficult to • distinguish ML and DM in reality

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/25 Statistics use data to make • inference about an unknown process statistics contains • many useful tools for ML

Three Learning Principles Power of Three Three Related Fields

Power of Three

Data Mining Artificial Intelligence use (huge) data compute • • to find property something that that is interesting shows intelligent behavior difficult to ML is one • • distinguish ML possible route to and DM in reality realize AI

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/25 Three Learning Principles Power of Three Three Related Fields

Power of Three

Data Mining Artificial Intelligence Statistics use (huge) data compute use data to make • • • to find property something that inference about that is interesting shows intelligent an unknown behavior process difficult to ML is one statistics contains • • • distinguish ML possible route to many useful tools and DM in reality realize AI for ML

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/25 Multi-Bin Hoeffding VC

P[BAD] P[BAD] 2 2M exp( 2 N) 4mH(2N) exp(...) ≤ − ≤

M hypotheses all • • H useful for useful for • • validation training

Three Learning Principles Power of Three Three Theoretical Bounds

Power of Three

Hoeffding

P[BAD] 2 exp( 22N) ≤ −

one hypothesis • useful for • verifying/testing

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/25 VC

P[BAD]

4mH(2N) exp(...) ≤

all • H useful for • training

Three Learning Principles Power of Three Three Theoretical Bounds

Power of Three

Hoeffding Multi-Bin Hoeffding

P[BAD] P[BAD] 2 exp( 22N) 2M exp( 22N) ≤ − ≤ −

one hypothesis M hypotheses • • useful for useful for • • verifying/testing validation

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/25 Three Learning Principles Power of Three Three Theoretical Bounds

Power of Three

Hoeffding Multi-Bin Hoeffding VC

P[BAD] P[BAD] P[BAD] 2 2 2 exp( 2 N) 2M exp( 2 N) 4mH(2N) exp(...) ≤ − ≤ − ≤

one hypothesis M hypotheses all • • • H useful for useful for useful for • • • verifying/testing validation training

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/25 linear regression logistic regression

h(x) = s h(x) = θ(s)

x0 x0 x x 1 s 1 s x2 h( x ) x2 h( x )

xd xd friendly err = squared plausible err = CE (easy to minimize) (maximum likelihood) minimize analytically minimize iteratively

Three Learning Principles Power of Three Three Linear Models

Power of Three

PLA/pocket h(x) = sign(s) x0 x 1 s x2 h( x )

xd plausible err = 0/1 (small flipping noise) minimize specially

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/25 logistic regression

h(x) = θ(s)

x0 x 1 s x2 h( x )

xd plausible err = CE (maximum likelihood) minimize iteratively

Three Learning Principles Power of Three Three Linear Models

Power of Three

PLA/pocket linear regression

h(x) = sign(s) h(x) = s x x0 0 x x1 1 s s x h x x2 h( x ) 2 ( )

xd xd plausible err = 0/1 friendly err = squared (small flipping noise) (easy to minimize) minimize specially minimize analytically

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/25 Three Learning Principles Power of Three Three Linear Models

Power of Three

PLA/pocket linear regression logistic regression

h(x) = sign(s) h(x) = s h(x) = θ(s) x x x0 0 0 x x1 x1 1 s s s x h x x h x x2 h( x ) 2 ( ) 2 ( )

xd xd xd plausible err = 0/1 friendly err = squared plausible err = CE (small flipping noise) (easy to minimize) (maximum likelihood) minimize specially minimize analytically minimize iteratively

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/25 Regularization Validation

Ein(w) Ein(wREG) Ein(h) Eval(h) → → d ( ) d ( , ) g−,..., g− VC H → EFF H A H → { 1 M }

by augmenting by reserving K • • regularizer Ω examples as Dval lower d fewer choices • EFF • higher E fewer examples • in •

Three Learning Principles Power of Three Three Key Tools

Power of Three

Feature Transform

Ein(w) Ein(w˜ ) → dVC( ) dVC( Φ) H → H

by using more • complicated Φ lower E • in higher d • VC

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/25 Validation

Ein(h) Eval(h) → g−,..., g− H → { 1 M }

by reserving K • examples as Dval fewer choices • fewer examples •

Three Learning Principles Power of Three Three Key Tools

Power of Three

Feature Transform Regularization

Ein(w) Ein(w˜ ) Ein(w) Ein(wREG) → → dVC( ) dVC( Φ) d ( ) d ( , ) H → H VC H → EFF H A

by using more by augmenting • • complicated Φ regularizer Ω

lower Ein lower d • • EFF higher dVC higher E • • in

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/25 Three Learning Principles Power of Three Three Key Tools

Power of Three

Feature Transform Regularization Validation

Ein(w) Ein(w˜ ) Ein(w) Ein(wREG) Ein(h) Eval(h) → → → − − dVC( ) dVC( Φ) d ( ) d ( , ) g ,..., g H → H VC H → EFF H A H → { 1 M }

by using more by augmenting by reserving K • • • complicated Φ regularizer Ω examples as Dval lower Ein lower d fewer choices • • EFF • higher dVC higher E fewer examples • • in •

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/25 Sampling Bias Data Snooping class matches exam honesty is best policy

Three Learning Principles Power of Three Three Learning Principles

Power of Three

Occam’s Razer simple is good

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/25 Data Snooping honesty is best policy

Three Learning Principles Power of Three Three Learning Principles

Power of Three

Occam’s Razer Sampling Bias simple is good class matches exam

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/25 Three Learning Principles Power of Three Three Learning Principles

Power of Three

Occam’s Razer Sampling Bias Data Snooping simple is good class matches exam honesty is best policy

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/25 More Regularization Less Label

bagging support vector machine decision tree neural network kernel aggregation sparsity autoencoder AdaBoost coordinate descent deep learning dual uniform blending nearest neighbor decision stump quadratic programming SVR kernel LogReg large-margin prototype

GBDT PCA random forest matrix factorization Gaussian kernel RBF network soft-margin k-means OOB error probabilistic SVM

ready for the jungle!

Three Learning Principles Power of Three Three Future Directions

Power of Three

More Transform

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 23/25 Less Label

bagging support vector machine decision tree neural network kernel aggregation sparsity autoencoder AdaBoost coordinate descent deep learning dual uniform blending nearest neighbor decision stump quadratic programming SVR kernel LogReg large-margin prototype

GBDT PCA random forest matrix factorization Gaussian kernel RBF network soft-margin k-means OOB error probabilistic SVM

ready for the jungle!

Three Learning Principles Power of Three Three Future Directions

Power of Three

More Transform More Regularization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 23/25 bagging support vector machine decision tree neural network kernel aggregation sparsity autoencoder AdaBoost coordinate descent deep learning dual uniform blending nearest neighbor decision stump quadratic programming SVR kernel LogReg large-margin prototype

GBDT PCA random forest matrix factorization Gaussian kernel RBF network soft-margin k-means OOB error probabilistic SVM

ready for the jungle!

Three Learning Principles Power of Three Three Future Directions

Power of Three

More Transform More Regularization Less Label

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 23/25 Three Learning Principles Power of Three Three Future Directions

Power of Three

More Transform More Regularization Less Label

bagging support vector machine decision tree neural network kernel aggregation sparsity autoencoder AdaBoost coordinate descent deep learning dual uniform blending nearest neighbor decision stump quadratic programming SVR kernel LogReg large-margin prototype

GBDT PCA random forest matrix factorization Gaussian kernel RBF network soft-margin k-means OOB error probabilistic SVM

ready for the jungle!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 23/25 Reference Answer: 3 3 as illustrated, and you may recall 1126 somewhere :-)

Three Learning Principles Power of Three Fun Time

What are the magic that repeatedly appear in this class? 1 3 2 1126 3 both 3 and 1126 4 neither 3 nor 1126

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 24/25 Three Learning Principles Power of Three Fun Time

What are the magic numbers that repeatedly appear in this class? 1 3 2 1126 3 both 3 and 1126 4 neither 3 nor 1126

Reference Answer: 3 3 as illustrated, and you may recall 1126 somewhere :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 24/25 Three Learning Principles Power of Three Summary 1 When Can Machines Learn? 2 Why Can Machines Learn? 3 How Can Machines Learn? 4 How Can Machines Learn Better? Lecture 15: Validation Lecture 16: Three Learning Principles Occam’s Razor simple, simple, simple! Sampling Bias match test scenario as much as possible Data Snooping any use of data is ‘contamination’ Power of Three relatives, bounds, models, tools, principles

• next: ready for jungle!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 25/25