<<

Minimum Description Length Principle

Carl E. Rasmussen, Niki Kilbertus

25 January 2018

1 Introduction

Some Theory

Crude Two-Part MDL

Universal Codes and Refined MDL

Conclusion and Discussion

Introduction 2 Main idea Any regularity in the allows us to compress them better.

What is this about?

The minimum description length (MDL) principle is a method for inductive inference that provides a generic solution to the problem, and, more generally to the overfitting problem. – Peter Grunwald¨

Introduction 3 What is this about?

The minimum description length (MDL) principle is a method for inductive inference that provides a generic solution to the model selection problem, and, more generally to the overfitting problem. – Peter Grunwald¨

Main idea Any regularity in the data allows us to compress them better.

Introduction 3 I found the really interesting questions are hard to answer without getting lost in technicalities.

This talk is entirely based on theMinimumDescriptionLength principle by Peter Grunwald¨ and contains numerous verbatim quotes.

Disclaimer

MDL is a principle, not a method. It allows for several methods of inductive inference and while sometimes there are clearly “better” or “worse” versions of MDL, there is no “optimal MDL”.

Introduction 4 This talk is entirely based on theMinimumDescriptionLength principle by Peter Grunwald¨ and contains numerous verbatim quotes.

Disclaimer

MDL is a principle, not a method. It allows for several methods of inductive inference and while sometimes there are clearly “better” or “worse” versions of MDL, there is no “optimal MDL”.

I found the really interesting questions are hard to answer without getting lost in technicalities.

Introduction 4 Disclaimer

MDL is a principle, not a method. It allows for several methods of inductive inference and while sometimes there are clearly “better” or “worse” versions of MDL, there is no “optimal MDL”.

I found the really interesting questions are hard to answer without getting lost in technicalities.

This talk is entirely based on theMinimumDescriptionLength principle by Peter Grunwald¨ and contains numerous verbatim quotes.

Introduction 4 The advantages

MDL is a general theory of inductive inference with

I Occam’s razor “built in”

I no overfitting automatically, without ad hoc principles

I Bayesian interpretation

I no need for “underlying truth” for clear interpretation

I predictive interpretation (data compression ≡ probabilistic prediction)

Introduction 5 Inductive Inference

The task of inductive inference is to find laws or regularities underlying some given set of data. These laws are then used to gain insight into the data or to classify or predict future data.

Introduction 6 I Bayesian: Start with a family of distributions (model) and our beliefs over them – the prior. Update the prior in light of data to get the posterior, which is used for prediction about the distributions in the model and/or new data.

I Machine Learning: Find a decision/prediction rule that generalizes well (whether train and test set come from the same distribution or not).

I MDL: Try to compress the data as much as possible. Hopefully such a model also generalizes well (and is consistent).

Goals of Inductive Inference

I Orthodox: The goal of probability theory is to compute probabilities of events for some given probability measure. The goal of (orthodox) is, given that some events have taken place, to infer what distribution may have generated these events.

Introduction 7 I Machine Learning: Find a decision/prediction rule that generalizes well (whether train and test set come from the same distribution or not).

I MDL: Try to compress the data as much as possible. Hopefully such a model also generalizes well (and is consistent).

Goals of Inductive Inference

I Orthodox: The goal of probability theory is to compute probabilities of events for some given probability measure. The goal of (orthodox) statistics is, given that some events have taken place, to infer what distribution may have generated these events.

I Bayesian: Start with a family of distributions (model) and our beliefs over them – the prior. Update the prior in light of data to get the posterior, which is used for prediction about the distributions in the model and/or new data.

Introduction 7 I MDL: Try to compress the data as much as possible. Hopefully such a model also generalizes well (and is consistent).

Goals of Inductive Inference

I Orthodox: The goal of probability theory is to compute probabilities of events for some given probability measure. The goal of (orthodox) statistics is, given that some events have taken place, to infer what distribution may have generated these events.

I Bayesian: Start with a family of distributions (model) and our beliefs over them – the prior. Update the prior in light of data to get the posterior, which is used for prediction about the distributions in the model and/or new data.

I Machine Learning: Find a decision/prediction rule that generalizes well (whether train and test set come from the same distribution or not).

Introduction 7 Goals of Inductive Inference

I Orthodox: The goal of probability theory is to compute probabilities of events for some given probability measure. The goal of (orthodox) statistics is, given that some events have taken place, to infer what distribution may have generated these events.

I Bayesian: Start with a family of distributions (model) and our beliefs over them – the prior. Update the prior in light of data to get the posterior, which is used for prediction about the distributions in the model and/or new data.

I Machine Learning: Find a decision/prediction rule that generalizes well (whether train and test set come from the same distribution or not).

I MDL: Try to compress the data as much as possible. Hopefully such a model also generalizes well (and is consistent).

Introduction 7 ⇒ find its regularities and try to model them ⇒ “remember” the residuals ⇒ account for the model complexity ⇒ minimize the description length of the model + residuals

MDL views learning as data compression.

Compression as the main focus

compress data

Introduction 8 ⇒ “remember” the residuals ⇒ account for the model complexity ⇒ minimize the description length of the model + residuals

MDL views learning as data compression.

Compression as the main focus

compress data ⇒ find its regularities and try to model them

Introduction 8 ⇒ account for the model complexity ⇒ minimize the description length of the model + residuals

MDL views learning as data compression.

Compression as the main focus

compress data ⇒ find its regularities and try to model them ⇒ “remember” the residuals

Introduction 8 ⇒ minimize the description length of the model + residuals

MDL views learning as data compression.

Compression as the main focus

compress data ⇒ find its regularities and try to model them ⇒ “remember” the residuals ⇒ account for the model complexity

Introduction 8 MDL views learning as data compression.

Compression as the main focus

compress data ⇒ find its regularities and try to model them ⇒ “remember” the residuals ⇒ account for the model complexity ⇒ minimize the description length of the model + residuals

Introduction 8 Compression as the main focus

compress data ⇒ find its regularities and try to model them ⇒ “remember” the residuals ⇒ account for the model complexity ⇒ minimize the description length of the model + residuals

MDL views learning as data compression.

Introduction 8 What is a description? – Examples

Data 1. 00010001000100010001...000100010001000100010001 2. 01110100110100100110...111010111011000101100010 3. 00011000001010100000...001000010000001000110001

Descriptions 1. Repeating 0001 2. Tosses of a fair coin 3. Four times as many 0s as 1s (at random locations)

Introduction 9 No-Hypercompression-Inequality

Intuition If we want to encode binary sequences of length n, only a fraction of at most 2−K sequences can be compressed by more than K bits.

Theorem Suppose X1, X2,..., Xn are distributed according to some distribution P∗ on X n (not necessarily i.i.d.). Let Q be an arbitrary distribution on X n. Then for all K > 0

P∗ (− log(Q(Xn)) ≤ − log(P∗(Xn)) − K) ≤ 2−K .

A short description is equivalent to identifying the data as belonging to a special, tiny subset of all possible sequences.

Introduction 10 Ideal MDL

Definition The Kolmogorov complexity (KC) of a binary string is the length of the shortest program in a universal language that produces the string and then halts.

Problems:

I KC is uncomputable

I KC is only defined up to a (potentially) large constant

The MDL principle with KC as its foundation is often called ideal(ized) MDL or algorithmic statistics.

Ming Li, Paul Vitanyi,´ An Introduction to Kolmogorov Complexity and Its Applications. Springer, 2008

Introduction 11 Remark In practice, a suitable choice of the description method will be guided by a priori knowledge of the problem domain.

Downside There will be “regular” sequences that we cannot compress.

Practical MDL

Practical MDL uses a less expressive description method that is

I restrictive enough to allow us to compute the length of the shortest description of any data sequence

I general enough to allow us to compress many of the intuitively “regular” data sequences

Introduction 12 Downside There will be “regular” sequences that we cannot compress.

Practical MDL

Practical MDL uses a less expressive description method that is

I restrictive enough to allow us to compute the length of the shortest description of any data sequence

I general enough to allow us to compress many of the intuitively “regular” data sequences

Remark In practice, a suitable choice of the description method will be guided by a priori knowledge of the problem domain.

Introduction 12 Practical MDL

Practical MDL uses a less expressive description method that is

I restrictive enough to allow us to compute the length of the shortest description of any data sequence

I general enough to allow us to compress many of the intuitively “regular” data sequences

Remark In practice, a suitable choice of the description method will be guided by a priori knowledge of the problem domain.

Downside There will be “regular” sequences that we cannot compress.

Introduction 12 Terminology (intentionally imprecise) point hypothesis a single or function hypothesis an arbitrary set of probability distributions or functions model a set of probability distributions or functions of the same functional form, e.g. first order Markov models, kth degree polynomials model class a family (set) of models, e.g. all Markov chains or polynomials of each order

hypothesis

model 1 point hypothesis

model 2

Introduction model class 13 Notation H general point hypothesis P probabilistic point hypothesis, e.g. N (µ, σ) h deterministic function, e.g. 4x2 − 2x + 6 H sets of general point hypothesis M probabilistic models and model classes (M)

hypothesis

model 1 point hypothesis

model 2

model class

Introduction 14 Outcome The best fit is obtained for the point hypothesis h = x4 − 3x2 + 2. We conclude that H4 = {4th order polynomials} is the best model in the model class of all polynomials.

Terminology & Notation – Example

Problem setting We fit a polynomial of degree k to data and perform cross validation over “all” k for model selection.

Introduction 15 Terminology & Notation – Example

Problem setting We fit a polynomial of degree k to data and perform cross validation over “all” k for model selection.

Outcome The best fit is obtained for the point hypothesis h = x4 − 3x2 + 2. We conclude that H4 = {4th order polynomials} is the best model in the model class of all polynomials.

Introduction 15 We need to operationalize L(·).

Informal primer on “crude two-part MDL”

For a list of candidate models H1, H2,... the best point S hypothesis to explain D in H := i Hi is

H∗ := arg min L(H) + L(D | H) , H∈H where

I L(H) is the length, in bits, of the description of the hypothesis

I L(D | H) is the length, in bits, of the description of the data when encoded “with the help of the hypothesis” ∗ The best model to explain D is the “smallest” Hi containing H .

Introduction 16 Informal primer on “crude two-part MDL”

For a list of candidate models H1, H2,... the best point S hypothesis to explain D in H := i Hi is

H∗ := arg min L(H) + L(D | H) , H∈H where

I L(H) is the length, in bits, of the description of the hypothesis

I L(D | H) is the length, in bits, of the description of the data when encoded “with the help of the hypothesis” ∗ The best model to explain D is the “smallest” Hi containing H .

We need to operationalize L(·).

Introduction 16 Operationalizing L(D | H) – the Shannon-Fano code

We can identify codelength functions and probability distributions such that L(D | H) = − log(P(D | H))

short codelength ←→ high probability

Introduction 17 Operationalizing L(H)

People have used “intuitively reasonable” codes for H ∈ H, which are always in danger to be arbitrary. We call these crude MDL methods.

For refined MDL methods, we need an additional principle for designing a code for H, which leads to minimax codes.

Introduction 18 Operationalizing L(H) – Example

Models for psychological perception (brightness) of physical dimensions (light intensity) Stevens’s model y = axb + Z Fechner’s model y = a ln(x + b) + Z with noise Z ∼ N

Although both have two free parameters, according to refined MDL, Stevens’s model is “more complex” than Fechner’s. The notion used corresponds to “counting essentially distinguishable” point hypothesis.

Introduction 19 Introduction

Some Theory

Crude Two-Part MDL

Universal Codes and Refined MDL

Conclusion and Discussion

Some Theory 20 Probabilistic source

Definition A probabilistic source with outcomes in a sample space X is a function P : X ∗ → [0, ∞) such that for all n ≥ 0, n n x = (x1,..., xn) ∈ X P n n 1. z∈X P(x , z) = P(x ) 2. P(x0) = 1, x0 is the empty sample ∗ S n 0 0 0 where X = n≥1 X ∪ X and X = {x }.

A probabilistic source is just a probability distribution over infinite sequences.

Some Theory 21 Definition A description method is a nonsingular coding system.

Definition A code is a description method such that each source symbol is associated with at most one code.

Coding systems and description methods Definition A coding system C is a relation between an alphabet A and binary strings {0, 1}∗. 1. For (a, b) ∈ C, b is a code word for source symbol a. 2. C is called singular if two different source symbols have the same code word. 3. C is partial if some source symbol does not have a code word.

Some Theory 22 Definition A code is a description method such that each source symbol is associated with at most one code.

Coding systems and description methods Definition A coding system C is a relation between an alphabet A and binary strings {0, 1}∗. 1. For (a, b) ∈ C, b is a code word for source symbol a. 2. C is called singular if two different source symbols have the same code word. 3. C is partial if some source symbol does not have a code word.

Definition A description method is a nonsingular coding system.

Some Theory 22 Coding systems and description methods Definition A coding system C is a relation between an alphabet A and binary strings {0, 1}∗. 1. For (a, b) ∈ C, b is a code word for source symbol a. 2. C is called singular if two different source symbols have the same code word. 3. C is partial if some source symbol does not have a code word.

Definition A description method is a nonsingular coding system.

Definition A code is a description method such that each source symbol is associated with at most one code.

Some Theory 22 (De)coding functions and code length

I A description method can be identified with a decoding function C−1 : {0, 1}∗ → A ∪ {↑} , where ↑ undefined.

I A code can be identified with a (possibly partial) encoding function C : A → {0, 1}∗ ∪ {↑} .

We denote the length (in number of bits) of the shortest description of x ∈ A under description method C by LC(x). If C(x) is not well defined, LC(x) := ∞.

We usually only care about codelength functions when we talk about codes.

Some Theory 23 Prefix coding systems

Definition Prefix coding systems are coding systems in which no extension of a code word can itself be a code word.

Prefix coding systems are exactly the description methods that remain nonsingular under concatenation.

We entirely restrict ourselves to prefix codes.

Some Theory 24 Conditional description methods

Definition n Let x = (x1,..., xn) be a sequence from alphabet A1 × A2 × · · · × An. We denote the conditional description method, i.e. the description method used to encode xi given i−1 that the previous symbols were x1,..., xi−1, by C(· | x ).

i−1 I C(· | x ) is still prefix. i I If C(· | x ) is a code for every i,

n n−1 LC(x ) = LC(x1,..., xn) = LC(x1)+LC(x2 | x1)+...+LC(xn | x ).

Some Theory 25 Uniform codes minimize maximum description length

Definition Enumerating the elements of a finite alphabet A results in “the” uniform code U, for which

dlog |A|e = LU(x) = max LU(x) = min max L(x) , x∈A L∈LA x∈A

where LA is the set of all length functions for prefix description methods over A.

The uniform code results in the worst-case optimal codelengths over all x ∈ A.

Some Theory 26 Efficiency and complete codes

Definition For two description methods C1, C2 for a set A, we call C1 more efficient than C2 if for all x ∈ A, L1(x) ≤ L2(x) and there exist x ∈ A such that L1(x) < L2(x).

Definition A code C for a set A is complete if there does not exist a code C0 that is more efficient than C.

Some Theory 27 The Kraft Inequality

Theorem For any description method C and a finite alphabet A, the code word lengths LC(x) must satisfy X 2−LC(x) ≤ 1 . x∈A

Given a set of code word lengths that satisfy this inequality, there exists a prefix code with these code word lengths.

For a probability distribution P over the finite space X , there exists a code C for X such that for all x ∈ X

LC(x) = d− log(P(x))e .

C is called the Shannon-Fano code.

Some Theory 28 Code length functions

I Non-integer codelengths are fine.

I The set of all codelength functions for a (finite, countable, or continuous) sample space X is ( ) X −L(x) LX = L : X → [0, ∞]: 2 ≤ 1 . x∈X

I A codelength function L inLX is complete if X 2−L(x) = 1 . x∈X

I LU is the unique minimax optimal code in LA.

Note that we (ab)use “code” synonymously for “code length function”.

Some Theory 29 0 n I Let C be a prefix description method for X . Then there exists a defective probability distribution P0 such that for all xn ∈ X n 0 n n − log P (x ) = LC0 (x ) . P0 is called the probability distribution corresponding to C0. 0 0 I C is a complete prefix code if and only if P is a proper distribution.

Revisiting the Shannon-Fano code Theorem

I Let X be a countable sample space, and P a probability distribution over X . Then there exists a prefix code C for X n such that for all xn ∈ X n

n n LC(x ) = − log P(x ) .

C is called the code corresponding to P.

Some Theory 30 0 0 I C is a complete prefix code if and only if P is a proper distribution.

Revisiting the Shannon-Fano code Theorem

I Let X be a countable sample space, and P a probability distribution over X . Then there exists a prefix code C for X n such that for all xn ∈ X n

n n LC(x ) = − log P(x ) .

C is called the code corresponding to P. 0 n I Let C be a prefix description method for X . Then there exists a defective probability distribution P0 such that for all xn ∈ X n 0 n n − log P (x ) = LC0 (x ) . P0 is called the probability distribution corresponding to C0.

Some Theory 30 Revisiting the Shannon-Fano code Theorem

I Let X be a countable sample space, and P a probability distribution over X . Then there exists a prefix code C for X n such that for all xn ∈ X n

n n LC(x ) = − log P(x ) .

C is called the code corresponding to P. 0 n I Let C be a prefix description method for X . Then there exists a defective probability distribution P0 such that for all xn ∈ X n 0 n n − log P (x ) = LC0 (x ) . P0 is called the probability distribution corresponding to C0. 0 0 I C is a complete prefix code if and only if P is a proper distribution.

Some Theory 30 The code corresponding to P is P-optimal

Theorem Let X be a countable set, P a probability distribution over X . Then ∗ L := arg min Ex∼P[L(x)] L∈LX exists, is unique, and for all x ∈ X we have L∗(x) = − log(P(x)).

Rewritten as the information inequality: For all distributions P, Q with P 6= Q

EP[− log(Q(X))] > EP[− log(P(X))] m D(P||Q) ≥ 0( with “=” iff P = Q) .

Some Theory 31 Worst-case optimality of the uniform code

For every distribution P 6= PU on a finite sample space X of size |X | = M there exists a distribution Q such that

EQ[− log(P(x))] > log(M) = EQ[− log(PU(x))] .

The uniform code has optimal worst-case performance over both distributions and data sequences.

Some Theory 32 Entropy and KL divergence

Interpretation of entropy The entropy of P

H(P) := EP[− log(P(X))]

is the expected number of bits of the P-optimal encoding of outcomes generated by P.

Interpretation of KL divergence The KL divergence between P and Q

D(P||Q) := EP[log(P(X)/Q(X))]

is the expected number of additional bits for the Q-optimal encoding of outcomes generated by P.

Some Theory 33 Introduction

Some Theory

Crude Two-Part MDL

Universal Codes and Refined MDL

Conclusion and Discussion

Crude Two-Part MDL 34 To formalize these ideas, we need to choose codes C1 and C2.

Point hypothesis selection with crude two-part MDL

Let M be a candidate model or model class. 1. For data D, we set

∗ P := arg min L1(P) + L2(D | P) P∈M | {z } =:L1,2(P,D)

∗ 2. If P is not unique, we choose arg minP∈P∗ L1(P), i.e. the one with lowest complexity. 3. After that we have no preference.

Crude Two-Part MDL 35 Point hypothesis selection with crude two-part MDL

Let M be a candidate model or model class. 1. For data D, we set

∗ P := arg min L1(P) + L2(D | P) P∈M | {z } =:L1,2(P,D)

∗ 2. If P is not unique, we choose arg minP∈P∗ L1(P), i.e. the one with lowest complexity. 3. After that we have no preference.

To formalize these ideas, we need to choose codes C1 and C2.

Crude Two-Part MDL 35 Why “crude”?

To keep it simple, we consider “crude” MDL, meaning that we simply make “intuitive” choices for C1 and C2.

I For C2, the Shannon-Fano code L(D | P) := − log(P(D)) is practically definite. It is the only code that would be optimal if P were true.

I C1 is harder. Given a specific M, we need to make ad hoc choices.

Crude Two-Part MDL 36 Example for crude MDL – Markov Chains

n I data D = (x1,..., xn) ⊂ X for X = {0, 1} I model class B = {all Markov chains} γ I P ∈ B is parameterized by k = 2 (γ order of the chain) and θ ∈ [0, 1]k (the transition probabilities)

I cumbersome discretization of θ on a k-dimensional grid (k) with a finite precision of d bits Θd leads to a somewhat intuitive L1

Eventually, two-part MDL suggests the point hypothesis

∗ P = arg min − log(Pk,θ(D)) + kd + LN(k) + LN(d) γ∈ ,d∈ ,θ∈Θ(k) | {z } | {z } N N d error term complexity term

Crude Two-Part MDL 37 I If there is something to learn, but it is not a , the hope is that the two-part criterion still converges to the Markov chain that is in some sense “closest” to the process generating the data. Such a closest Markov chain will in many cases still lead to reasonable predictions;

Example for crude MDL – Markov Chains

I For increasing sample size we will first underfit for small sample size and then stay at the correct order.

I This procedure can be extended to model selection and parameter estimation in a straight forward way.

I With the right discretization, this is consistent for model selection of Markov chains and close to refined MDL.

Crude Two-Part MDL 38 Example for crude MDL – Markov Chains

I For increasing sample size we will first underfit for small sample size and then stay at the correct order.

I This procedure can be extended to model selection and parameter estimation in a straight forward way.

I With the right discretization, this is consistent for model selection of Markov chains and close to refined MDL.

I If there is something to learn, but it is not a Markov chain, the hope is that the two-part criterion still converges to the Markov chain that is in some sense “closest” to the process generating the data. Such a closest Markov chain will in many cases still lead to reasonable predictions;

Crude Two-Part MDL 38 Introduction

Some Theory

Crude Two-Part MDL

Universal Codes and Refined MDL

Conclusion and Discussion

Universal Codes and Refined MDL 39 Refined MDL is a general theory of inductive inference based on universal codes that are designed to be (close to) minimax optimal. It has mostly been developed for model selection, estimation and prediction.

Encoding models, not point hypothesis

In refined MDL we associate a code for D not with a single P ∈ M, but with the full model (class) M.

Universal Codes and Refined MDL 40 Encoding models, not point hypothesis

In refined MDL we associate a code for D not with a single P ∈ M, but with the full model (class) M.

Refined MDL is a general theory of inductive inference based on universal codes that are designed to be (close to) minimax optimal. It has mostly been developed for model selection, estimation and prediction.

Universal Codes and Refined MDL 40 ¯ For different tasks we choose a different Ptop

I point hypothesis selection: choose a full two-part code

I prediction: choose a one-part prequential code, e.g. a meta-Bayesian code

I model selection: choose a meta-two-part code

The basic idea behind refined MDL I

To find structure in observed data D based on a model class M, ¯ we first design a “top-level” universal model Ptop relative to M.

Universal Codes and Refined MDL 41 The basic idea behind refined MDL I

To find structure in observed data D based on a model class M, ¯ we first design a “top-level” universal model Ptop relative to M.

¯ For different tasks we choose a different Ptop

I point hypothesis selection: choose a full two-part code

I prediction: choose a one-part prequential code, e.g. a meta-Bayesian code

I model selection: choose a meta-two-part code

Universal Codes and Refined MDL 41 The basic idea behind refined MDL II

¯ 1. Choose a good Ptop that will be partially subjective and depends on a “luckiness function”, except in special cases where minimax optimal codes exist. 2. Patterns that we want to identify explicitly, must be encoded explicitly with a two-part code, e.g. the models in model selection and point hypothesis in point hypothesis selection. 3. For patterns we need not identify explicitly, we use one-part codes, because two-part codes are incomplete.

Universal Codes and Refined MDL 42 Examples

I The two-part code we have already seen

I Bayesian universal codes

I Prequential plug-in universal codes

I Variants of the above and more

Universal codes – Overview

Intuition A universal code for a model M is a code that compresses the data almost as well as the best element of M. The distribution corresponding to a universal code via L(x) = − log(P(x)) is called universal model.

Universal Codes and Refined MDL 43 Universal codes – Overview

Intuition A universal code for a model M is a code that compresses the data almost as well as the best element of M. The distribution corresponding to a universal code via L(x) = − log(P(x)) is called universal model.

Examples

I The two-part code we have already seen

I Bayesian universal codes

I Prequential plug-in universal codes

I Variants of the above and more

Universal Codes and Refined MDL 43 Given candidate codes L for X n, there is no L¯ ∈ L, such that for n all xn ∈ X L¯(xn) ≤ min L(xn) . L∈L In words: No code has short code words for many (all) different symbols.

Are there codes that are “not much larger” than the best one for all xn ∈ X n?

Universal codes – Motivation

For simplicity, we only look at countable models over countable X (instead of the more realistic parametric and non-parametric ones).

Universal Codes and Refined MDL 44 Are there codes that are “not much larger” than the best one for all xn ∈ X n?

Universal codes – Motivation

For simplicity, we only look at countable models over countable X (instead of the more realistic parametric and non-parametric ones).

Given candidate codes L for X n, there is no L¯ ∈ L, such that for n all xn ∈ X L¯(xn) ≤ min L(xn) . L∈L In words: No code has short code words for many (all) different symbols.

Universal Codes and Refined MDL 44 Universal codes – Motivation

For simplicity, we only look at countable models over countable X (instead of the more realistic parametric and non-parametric ones).

Given candidate codes L for X n, there is no L¯ ∈ L, such that for n all xn ∈ X L¯(xn) ≤ min L(xn) . L∈L In words: No code has short code words for many (all) different symbols.

Are there codes that are “not much larger” than the best one for all xn ∈ X n?

Universal Codes and Refined MDL 44 Redundancy

Let P¯ and P be two distributions on X n. Definition n I The redundancy of P¯ over P on x is

RED(P¯, P, xn) :=  − log(P¯(xn)) −  − log(P(xn)) .

I The worst-case redundancy of P¯ over P is

¯ ¯ n REDmax(P, P) := max RED(P, P, x ) . xn∈X n

Universal Codes and Refined MDL 45 Remarks This can be extended to f(n)-uniformly universal models. For example, in the finite case, the codelength difference is often bounded by a constant, not just increasing sublinearly, which we call O(1)-uniformly universal.

Universal models

Let M be a family of probabilistic sources. Definition A universal model relative to M is a sequence of distributions P(1), P(2),... on X 1, X 2,... respectively, such that for all P ∈ M and all  > 0 there exists n0 ∈ N such that for all n > n0 1 RED (P¯(n), P) ≤  . n max

Universal Codes and Refined MDL 46 Universal models

Let M be a family of probabilistic sources. Definition A universal model relative to M is a sequence of distributions P(1), P(2),... on X 1, X 2,... respectively, such that for all P ∈ M and all  > 0 there exists n0 ∈ N such that for all n > n0 1 RED (P¯(n), P) ≤  . n max

Remarks This can be extended to f(n)-uniformly universal models. For example, in the finite case, the codelength difference is often bounded by a constant, not just increasing sublinearly, which we call O(1)-uniformly universal.

Universal Codes and Refined MDL 46 The regret is the additional number of bits needed to encode xn using the code P¯ as compared to the best P in M.

Regret

Let M be a family of probabilistic sources and P¯ a probability distribution on X n, not necessarily in M.

Definition For given xn the regret of P¯ relative to M is

REG(P¯, M, xn) := − log(P¯(xn)) − inf {− log(P(xn))} . P∈M

Universal Codes and Refined MDL 47 Regret

Let M be a family of probabilistic sources and P¯ a probability distribution on X n, not necessarily in M.

Definition For given xn the regret of P¯ relative to M is

REG(P¯, M, xn) := − log(P¯(xn)) − inf {− log(P(xn))} . P∈M

The regret is the additional number of bits needed to encode xn using the code P¯ as compared to the best P in M.

Universal Codes and Refined MDL 47 Minimize the additional bits needed compared to the best code in M in the worst case over X n.

Minimax optimal universal model We measure the quality of P¯ as a universal model by its regret relative to M taking a worst-case approach over all xn ∈ X n

(n) n REGmax(P¯, M) := max REG(P¯, M, x ) . xn∈X n

Definition The minimax optimal universal model is given by

(n) arg min REGmax(P¯, M) . P¯∈∆(X n)

Universal Codes and Refined MDL 48 Minimax optimal universal model We measure the quality of P¯ as a universal model by its regret relative to M taking a worst-case approach over all xn ∈ X n

(n) n REGmax(P¯, M) := max REG(P¯, M, x ) . xn∈X n

Definition The minimax optimal universal model is given by

(n) arg min REGmax(P¯, M) . P¯∈∆(X n)

Minimize the additional bits needed compared to the best code in M in the worst case over X n.

Universal Codes and Refined MDL 48 The more sequences xn have large likelihood under some P ∈ M, i.e. can be fit well, the larger is COMP(n)(M).

Complexity

Definition The complexity of a model M is defined by ! X COMP(n)(M) := log sup P(xn) . xn∈X n P∈M

Universal Codes and Refined MDL 49 Complexity

Definition The complexity of a model M is defined by ! X COMP(n)(M) := log sup P(xn) . xn∈X n P∈M

The more sequences xn have large likelihood under some P ∈ M, i.e. can be fit well, the larger is COMP(n)(M).

Universal Codes and Refined MDL 49 ¯(n) Pnml is a uniformly universal model.

¯(n) Pnml is called an equalizer strategy as it achieves the same regret COMP(n)(M) on every xn ∈ X n. Therefore it achieves the minimax regret, which is equal to the model complexity.

Normalized maximum likelihood (NML) model

Definition If COMP(n)(M) is finite, the minimax regret is uniquely achieved by the normalized maximum likelihood distribution

n ¯(n) n supP∈M P(x ) Pnml(x ) := P n . xn∈X n supP∈M P(x )

Universal Codes and Refined MDL 50 ¯(n) Pnml is called an equalizer strategy as it achieves the same regret COMP(n)(M) on every xn ∈ X n. Therefore it achieves the minimax regret, which is equal to the model complexity.

Normalized maximum likelihood (NML) model

Definition If COMP(n)(M) is finite, the minimax regret is uniquely achieved by the normalized maximum likelihood distribution

n ¯(n) n supP∈M P(x ) Pnml(x ) := P n . xn∈X n supP∈M P(x )

¯(n) Pnml is a uniformly universal model.

Universal Codes and Refined MDL 50 Normalized maximum likelihood (NML) model

Definition If COMP(n)(M) is finite, the minimax regret is uniquely achieved by the normalized maximum likelihood distribution

n ¯(n) n supP∈M P(x ) Pnml(x ) := P n . xn∈X n supP∈M P(x )

¯(n) Pnml is a uniformly universal model.

¯(n) Pnml is called an equalizer strategy as it achieves the same regret COMP(n)(M) on every xn ∈ X n. Therefore it achieves the minimax regret, which is equal to the model complexity.

Universal Codes and Refined MDL 50 Simple universal two-part codes

Let L = {L1, L2,..., LM} be a finite set of codelength functions.

Definition First, encode the index j with log(M) bits such that n n n Lj(x ) = minL∈L L(x ) using a uniform code. We then encode x using the code indexed by j. Thus there exists a universal two-part code such that for all xn ∈ X n

¯ n n L2−p(x ) = min L(x ) + log(M) . L∈L

Usually the first part grows like n, i.e. the difference between our two-part universal code and the minimum becomes negligible. The code is log(M)-uniformly universal.

Universal Codes and Refined MDL 51 Bayesian universal model Let M be a countable set of probabilistic sources, parameterized by a set Θ and let W ∈ ∆(Θ) (the prior). Definition Then ¯ n X n PBayes(x ) := Pθ(x )W(θ) θ∈Θ is a universal model and is also called Bayesian marginal likelihood or Bayesian mixture.

Universality here means that for all θ0 ∈ Θ ! ¯ n X n n − log(PBayes(x )) = − log Pθ(x )W(θ) ≤ − log(Pθ0 (x ))+cθ0 , θ∈Θ

where cθ0 = − log(W(θ0)) does not depend on n. ¯ Thus PBayes is a O(1)-uniformly universal model.

Universal Codes and Refined MDL 52 Individual vs. stochastic universality

Individual-sequence-based So far we have not made the assumption that sequences are generated by some P. Hence we took a worst-case view, requiring small redundancy for all sequences. This follows the MDL philosophy that data are never really generated by a distribution.

Expectation-based Information theorists are often willing to assume that data are actually distributed according to some P ∈ M leading to an alternative definition of universal models.

Universal Codes and Refined MDL 53 Universality Let M be a family of probabilistic sources. Let P¯(1), P¯(2),... be a sequence of distributions on X 1, X 2,... respectively. P¯(1), P¯(2),... is universal in the 1. inidivual sequence sense 2. almost-sure sense 3. expected sense

if for all P ∈ M and all  > 0, there exists n0 ∈ N such that for all n > n0 we have 1 h i − log P¯(n)(xn) − [− log(P(xn))] ≤  n

1. for all xn ∈ X n. 2. with P-probability 1. 3. in P-expectation.

Universal Codes and Refined MDL 54 Expected Universality

The criterion for expected universality is equivalent to: For all P ∈ M,  > 0 and large n

1 D(P||P¯(n)) ≤  . n

Since the expectation is over P, one could argue that we thereby explicitly assume that the data were generated by P.

Universal Codes and Refined MDL 55 Introduction

Some Theory

Crude Two-Part MDL

Universal Codes and Refined MDL

Conclusion and Discussion

Conclusion and Discussion 56 2. Models are identified with (universal) code lengths of descriptions of the data using the model. 3. Encoded regularities are meaningful regardless of whether a point hypothesis in the model is “true”. 4. We have only the data. 5. Consistency and convergence rates are a sanity check, not a design principle.

Are these properties, assumptions, or guiding principles of MDL?

Summary of the MDL philosophy

1. Squeeze out all regularity for compression.

Conclusion and Discussion 57 3. Encoded regularities are meaningful regardless of whether a point hypothesis in the model is “true”. 4. We have only the data. 5. Consistency and convergence rates are a sanity check, not a design principle.

Are these properties, assumptions, or guiding principles of MDL?

Summary of the MDL philosophy

1. Squeeze out all regularity for compression. 2. Models are identified with (universal) code lengths of descriptions of the data using the model.

Conclusion and Discussion 57 4. We have only the data. 5. Consistency and convergence rates are a sanity check, not a design principle.

Are these properties, assumptions, or guiding principles of MDL?

Summary of the MDL philosophy

1. Squeeze out all regularity for compression. 2. Models are identified with (universal) code lengths of descriptions of the data using the model. 3. Encoded regularities are meaningful regardless of whether a point hypothesis in the model is “true”.

Conclusion and Discussion 57 5. Consistency and convergence rates are a sanity check, not a design principle.

Are these properties, assumptions, or guiding principles of MDL?

Summary of the MDL philosophy

1. Squeeze out all regularity for compression. 2. Models are identified with (universal) code lengths of descriptions of the data using the model. 3. Encoded regularities are meaningful regardless of whether a point hypothesis in the model is “true”. 4. We have only the data.

Conclusion and Discussion 57 Are these properties, assumptions, or guiding principles of MDL?

Summary of the MDL philosophy

1. Squeeze out all regularity for compression. 2. Models are identified with (universal) code lengths of descriptions of the data using the model. 3. Encoded regularities are meaningful regardless of whether a point hypothesis in the model is “true”. 4. We have only the data. 5. Consistency and convergence rates are a sanity check, not a design principle.

Conclusion and Discussion 57 Are these properties, assumptions, or guiding principles of MDL?

Summary of the MDL philosophy

1. Squeeze out all regularity for compression. 2. Models are identified with (universal) code lengths of descriptions of the data using the model. 3. Encoded regularities are meaningful regardless of whether a point hypothesis in the model is “true”. 4. We have only the data. 5. Consistency and convergence rates are a sanity check, not a design principle.

Conclusion and Discussion 57 Summary of the MDL philosophy

1. Squeeze out all regularity for compression. 2. Models are identified with (universal) code lengths of descriptions of the data using the model. 3. Encoded regularities are meaningful regardless of whether a point hypothesis in the model is “true”. 4. We have only the data. 5. Consistency and convergence rates are a sanity check, not a design principle.

Are these properties, assumptions, or guiding principles of MDL?

Conclusion and Discussion 57 Answer: Refined MDL is very restricted in terms of codes (in idealized MDL the code is unique).

Criticism: Occam’s razor is false. What good are simple explanations in a complex world? Answer: MDL is a strategy (prefer simple models for small sample sizes) and not a statement about the world (simple models are more likely to be true). A strategy is not true or false, but clever or stupid. Occam’s razor is clever.

Peter’s answer to common criticism

Criticism: Occam’s razor and MDL is arbitrary. By changing the coding method, we can make “complex” things look “simple” and vice versa.

Conclusion and Discussion 58 Criticism: Occam’s razor is false. What good are simple explanations in a complex world? Answer: MDL is a strategy (prefer simple models for small sample sizes) and not a statement about the world (simple models are more likely to be true). A strategy is not true or false, but clever or stupid. Occam’s razor is clever.

Peter’s answer to common criticism

Criticism: Occam’s razor and MDL is arbitrary. By changing the coding method, we can make “complex” things look “simple” and vice versa. Answer: Refined MDL is very restricted in terms of codes (in idealized MDL the code is unique).

Conclusion and Discussion 58 Answer: MDL is a strategy (prefer simple models for small sample sizes) and not a statement about the world (simple models are more likely to be true). A strategy is not true or false, but clever or stupid. Occam’s razor is clever.

Peter’s answer to common criticism

Criticism: Occam’s razor and MDL is arbitrary. By changing the coding method, we can make “complex” things look “simple” and vice versa. Answer: Refined MDL is very restricted in terms of codes (in idealized MDL the code is unique).

Criticism: Occam’s razor is false. What good are simple explanations in a complex world?

Conclusion and Discussion 58 Peter’s answer to common criticism

Criticism: Occam’s razor and MDL is arbitrary. By changing the coding method, we can make “complex” things look “simple” and vice versa. Answer: Refined MDL is very restricted in terms of codes (in idealized MDL the code is unique).

Criticism: Occam’s razor is false. What good are simple explanations in a complex world? Answer: MDL is a strategy (prefer simple models for small sample sizes) and not a statement about the world (simple models are more likely to be true). A strategy is not true or false, but clever or stupid. Occam’s razor is clever.

Conclusion and Discussion 58 Thanks! Luckiness function vs. prior distribution From a Bayesian point of view, adopting a prior implies that we a priori expect certain things to happen; and strictly speaking, we should be willing to accept bets which have positive expected pay-off given these expectations. For example, we always believe that, for large n, with high probability, the ML estimator will lie in a region with high mass. If this does not happen, a low-probability event will have occurred and we will be surprised. From an MDL point of view, adopting a luckiness function implies that we a priori hope certain things will happen. For example, we hope that the ML estimator will lie in a region with small luckiness function (i.e., high luckiness), but we are not willing to place bets on this event. If it does not happen, then we do not compress the data well, and therefore, we do not have much confidence in the quality of our current predictive distribution when used for predicting future data; but we will not necessarily be surprised.

Conclusion and Discussion 60 Bayes vs. MDL

1. MDL is not restricted to Bayesian universal codes. 2. Luckiness is a different and weaker type of subjectiveness than the Bayesian subjectiveness. Many types of are meaningless from an MDL perspective. 3. Priors must compress. If a Bayesian marginal likelihood is used in an MDL context, it has to be interpretable as a universal code. This rules out the Diaconis-Freedman type priors, which make Bayesian inference inconsistent.

Conclusion and Discussion 61 The NML distribution does not appear in any Bayesian textbook.

“NML is just an approximation of the log marginal likelihood.” This is not true for the “localized” NML distribution.

Bayes vs. MDL

“MDL is really just a special case of Bayes.”

Conclusion and Discussion 62 “NML is just an approximation of the log marginal likelihood.” This is not true for the “localized” NML distribution.

Bayes vs. MDL

“MDL is really just a special case of Bayes.” The NML distribution does not appear in any Bayesian textbook.

Conclusion and Discussion 62 This is not true for the “localized” NML distribution.

Bayes vs. MDL

“MDL is really just a special case of Bayes.” The NML distribution does not appear in any Bayesian textbook.

“NML is just an approximation of the log marginal likelihood.”

Conclusion and Discussion 62 Bayes vs. MDL

“MDL is really just a special case of Bayes.” The NML distribution does not appear in any Bayesian textbook.

“NML is just an approximation of the log marginal likelihood.” This is not true for the “localized” NML distribution.

Conclusion and Discussion 62 Luckiness function vs. prior distribution

For a parametric model {Pθ | θ ∈ Θ} and luckiness function a(θ), a prior π(θ) corresponding to a(θ) is in general not proportional to e−a(θ), but rather to the luckiness-tilted Jeffreys’ prior π(θ) ∝ pdet I(θ)e−a(θ).

Conclusion and Discussion 63 Typical statement that becomes hard to parse

For models with fixed , the Bayesian universal code based on a Gaussian prior coincides exactly with a particular Luckiness NML-2 code, for all sample sizes, and not just asymptotically. This implies that the LNML-2 universal model for the linear regression models is prequential, i.e. it constitutes a probabilistic source.

Conclusion and Discussion 64