<<

Tutorial Part I: meets machine

Emmanuel Abbe Martin Wainwright UC Berkeley Princeton University

(UC Berkeley and Princeton) Information theory and June2015 1/46 Introduction

Era of massive data sets Fascinating problems at the interfaces between information theory and statistical machine learning.

1 Fundamental issues ◮ Concentration of measure: high-dimensional problems are remarkably predictable ◮ Curse of dimensionality: without structure, many problems are hopeless ◮ Low-dimensional structure is essential

2 Machine learning brings in algorithmic components ◮ Computational constraints are central ◮ Memory constraints: need for distributed and decentralized procedures ◮ Increasing importance of privacy Historical perspective: info. theory and

Claude

A rich intersection between information theory and statistics

1 Hypothesis testing, large deviations 2 , Kullback-Leibler 3 and Fano’s inequality Statistical estimation as channel coding Codebook: indexed family of distributions Q θ Θ { θ | ∈ } Codeword: nature chooses some θ∗ Θ ∈

θ∗ n ∗ X Θ Qθ 1 θ

b

Channel: user observes n i.i.d. draws X Q ∗ i ∼ θ Decoding: estimator Xn θ such that θ θ∗ 1 7→ ≈ b b Statistical estimation as channel coding Codebook: indexed family of probability distributions Q θ Θ { θ | ∈ } Codeword: nature chooses some θ∗ Θ ∈

θ∗ n Θ Qθ∗ X1 θ

b

Channel: user observes n i.i.d. draws X Q ∗ i ∼ θ Decoding: estimator Xn θ such that θ θ∗ 1 7→ ≈ Perspective dating back to Kolmogorovb (1950s)b with many variations: codebooks/codewords: graphs, vectors, matrices, functions, densities.... channels: random graphs, regression models, elementwise probes of vectors/machines, random projections ∗ closeness θ θ : exact/partial graph recovery in Hamming, ℓp-, L(Q)-distances,≈ sup- etc. b Machine learning: algorithmic issues to forefront!

1 Efficient are essential ◮ only (low-order) polynomial-time methods can ever be implemented ◮ trade-offs between computational and performance? ◮ fundamental barriers due to computational complexity?

2 Distributed procedures are often needed ◮ many modern data sets: too large to stored on a single machine ◮ need methods that operate separately on pieces of data ◮ trade-offs between decentralization and performance? ◮ fundamental barriers due to complexity?

3 Privacy and access issues ◮ conflicts between individual privacy and benefits of aggregation? ◮ principled information-theoretic formulations of such trade-offs?

(UC Berkeley and Princeton) Information theory and machine learning June2015 6/46 Part I of tutorial: Three vignettes

1 Graphical model selection

2 Sparse principal component analysis

3 Structured non-parametric regression and minimax theory

(UC Berkeley and Princeton) Information theory and machine learning June2015 7/46 Vignette A: Graphical model selection Simple motivating example: Epidemic modeling +1 if individual s is infected Disease status of person s: xs = 1 if individual s is healthy (− (1) Independent infection

5 Q(x ,...,x ) exp(θ x ) 1 5 ∝ s s s=1 Y (2) Cycle-based infection

5 Q(x ,...,x ) exp(θ x ) exp(θ x x ) 1 5 ∝ s s st s t s=1 Y (s,tY)∈C (3) Full clique infection

5 Q(x ,...,x ) exp(θ x ) exp(θ x x ) 1 5 ∝ s s st s t s=1 Y sY6=t Possible epidemic patterns (on a square) From epidemic patterns to graphs Underlying graphs Model selection for graphs drawn n i.i.d. samples from

Q(x ,...,x ;Θ) exp θ x + θ x x 1 p ∝ s s st s t s∈V (s,t)∈E  X X

graph G and matrix [Θ]st = θst of edge weights are unknown

data matrix Xn 1, +1 n×p 1 ∈ {− } want estimator Xn G to minimize error probability 1 7→ b Qn G(Xn) = G 1 6 Prob. that estimated graph differs from truth b | {z }

Channel decoding: Think of graphs as codewords, and the graph family as a codebook. Past/on-going work on graph selection exact polynomial-time solution on trees (Chow & Liu, 1967) testing for local conditional independence relationships (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008) pseudolikelihood and BIC criterion (Csiszar & Talata, 2006) pseudolikelihood and ℓ1-regularized neighborhood regression ◮ Gaussian case (Meinshausen & Buhlmann, 2006) ◮ Binary case (Ravikumar, W. & Lafferty et al., 2006) various greedy and related methods: (Bresler et al., 2008; Bresler, 2015; Netrapalli et al., 2010; Anandkumar et al., 2013) lower bounds and inachievability results ◮ information-theoretic bounds (Santhanam & W., 2012) ◮ computational lower bounds (Dagum & Luby, 1993; Bresler et al., 2014) ◮ phase transitions and performance of neighborhood regression (Bento & Montanari, 2009) US Senate network (2004–2006 voting) Experiments for sequences of star graphs

p = 9 d = 3

p = 18 d = 6 Empirical behavior: Unrescaled plots

Star graph; Linear fraction neighbors 1

0.8

0.6

0.4 Prob. success

0.2 p = 64 p = 100 p = 225 0 0 100 200 300 400 500 600 Number of samples

Plots of success probability versus raw sample size n. Empirical behavior: Appropriately rescaled

Star graph; Linear fraction neighbors 1

0.8

0.6

0.4 Prob. success

0.2 p = 64 p = 100 p = 225 0 0 0.5 1 1.5 2 Control parameter

n Plots of success probability versus rescaled sample size d2 log p Some theory: Scaling law for graph selection

graphs Gp,d with p nodes and maximum degree d

minimum absolute weight θmin on edges how many samples n needed to recover the unknown graph?

Theorem (Ravikumar, W. & Lafferty, 2010; Santhanam & W., 2012) Achievable result: Under regularity conditions, for a graph estimate G produced by ℓ1-regularized logistic regression: b n>c d2 + 1/θ2 log p = Q[G = G] 0 u min ⇒ 6 → Lower bound on sample size Vanishing probability of error b | {z } | {z } Necessary condition: For graph estimate G produced by any .

n

Question: How to prove lower bounds on graph selection methods?

Answer: Graph selection is an unorthodox channel coding problem.

codewords/codebook: graph G in some graph class G channel use: draw sample X =(X ,...,X ) 1, +1 p from the i• i1 ip ∈ {− } graph-structured distribution QG

decoding problem: use n samples X1•,...,Xn• to correctly distinguish the “codeword” { }

Q(X G) X ,X ,...,X G | 1• 2• n• Proof sketch: Main ideas for necessary conditions based on assessing difficulty of graph selection over various sub-ensembles G⊆Gp,d choose G u.a.r., and consider multi-way hypothesis testing problem based on the∈G data Xn = X ,...,X 1 { 1• n•} for any graph estimator ψ : n , Fano’s inequality implies that X →G I(Xn; G) Q[ψ(Xn) = G] 1 1 o(1) 1 6 ≥ − log − |G| n n where I(X1 ; G) is between observations X1 and randomly chosen graph G

remaining steps:

1 Construct “difficult” sub-ensembles G⊆Gp,d

2 Compute or lower bound the log cardinality log |G|.

n 3 Upper bound the mutual information I(X1 ; G). A “hard” d-clique ensemble

Base graph G Graph Guv Graph Gst

1 Divide the vertex set V into p groups of size d +1, and form the base ⌊ d+1 ⌋ graph G by making a (d + 1)-clique within each group. C 2 Form graph Guv by deleting edge (u,v) from G. 3 Consider testing problem over family of graph-structured distributions Q( ; Gst), (s,t) . { · ∈ C} Why is this ensemble “hard”? Kullback-Leibler divergence between pairs decays exponentially in d, unless minimum edge weight decays as 1/d. Vignette B: Sparse principal components analysis Principal component analysis: widely-used method for dimensionality reduction, etc. { } extracts top eigenvectors of sample covariance matrix Σ classical PCA in p dimensions: inconsistent unless p/n 0 → low-dimensional structure: many applications lead to sparseb eigenvectors

Population model: Rank-one spiked covariance matrix

∗ ∗ T Σ= ν θ (θ ) +Ip SNR parameter rank-one spike |{z} | {z }

Sampling model: Draw n i.i.d. zero-mean vectors with cov(Xi) = Σ, and form sample covariance matrix

1 n Σ := X XT n i i i=1 X b p-dim. matrix, rank min{n,p} | {z } Example: PCA for face compression/recognition

First 25 standard principal components (estimated from data) Example: PCA for face compression/recognition

First 25 sparse principal components (estimated from data) Perhaps the simplest estimator.... Diagonal thresholding: (Johnstone & Lu, 2008)

Diagonal thresholding 14

12

10

8

6 Diagonal value 4

2

0 0 10 20 30 40 50 Index

Given n i.i.d. samples Xi with zero mean, and with spiked covariance Σ= νθ∗(θ∗)T + I: 1 1 n 2 Compute diagonal entries of sample covariance: Σjj = n i=1 Xij . 2 Apply threshold to vector Σjj ,j = 1,...,p { } b P b Diagonal thresholding: unrescaled plots

Unrescaled plots of diagonal thresholding; k = O(log p) 1

0.8

0.6

0.4

Prob. success p = 100 p = 200 0.2 p = 300 p = 600 p = 1200 0 0 200 400 600 800 Number of observations Diagonal thresholding: rescaled plots

Diagonal thresholding: k = O(log(p)) 1

0.8

0.6

0.4

Prob. success p = 100 p = 200 0.2 p = 300 p = 600 p = 1200 0 0 5 10 15 Control parameter

n Scaling is quadratic in sparsity: k2 log p Diagonal thresholding and fundamental limit Consider spiked covariance matrix

Σ= ν θ∗(θ∗)T +I where θ B (k) B (1) p×p ∈ 0 ∩ 2 -to- rank one spike k-sparse and unit norm |{z} | {z } | {z }

Theorem (Amini & W., 2009) (a) There are thresholds τ DT τ DT such that ℓ ≤ u n n τ DT τ DT k2 log p ≤ ℓ k2 log p ≥ u DT fails w.h.p. DT succeeds w.h.p. | {z } | {z } (b) For optimal method (exhaustive search): n n <τ ES >τ ES . k log p k log p Fail w.h.p. Succeed w.h.p. | {z } | {z } One polynomial-time SDP relaxation Recall Courant-Fisher variational formulation:

∗ T ∗ ∗ T θ = arg max z νθ (θ ) + Ip×p z . kzk2=1 n   o Population covariance Σ | {z } Equivalently, lifting to matrix variables Z = zzT :

θθT = arg max trace(ZT Σ) s.t. trace(Z) = 1, and rank(Z) = 1 × Z∈Rp p Z=ZT , Z0 n o

Dropping rank constraint yields a standard SDP relaxation:

ZT = arg max trace(ZT Σ) s.t. trace(Z)=1. × Z∈Rp p Z=ZT , Z0 n o b In practice: (d’Aspremont et al., 2008) apply this relaxation using the sample covariance matrix Σ p 2 add the ℓ1-constraint i,j=1 Zij k . | |≤ b P Phase transition for SDP: logarithmic sparsity

SDP relaxation (k = O(log p)) 1

0.8

0.6

0.4

Prob. success p = 100 p = 200 0.2 p = 300

0 0 5 10 15 Control parameter

n Scaling is linear in sparsity: k log p A natural question

Questions: Can logarithmic sparsity or rank one condition be removed? Computational lower bound for sparse PCA Consider spiked covariance matrix

Σ= ν (θ∗)(θ∗)T +I where θ∗ B (k) B (1) p×p ∈ 0 ∩ 2 signal-to-noise rank one spike k-sparse and unit norm |{z} | {z } | {z } H (no signal): X (0,I ) Sparse PCA detection problem: 0 i p×p H (spiked signal): X ∼D(0, Σ). 1 i ∼D Distribution with sub-exponential tail behavior. D

Theorem (Berthet & Rigollet, 2013) Under average-case hardness of planted clique, polynomial-minimax level of detection νPOLY is given by

k2 log p νPOLY for all sparsity log p k √p ≍ n ≪ ≪

k log p Classical minimax level νOPT . ≍ n Planted k-clique problem

Erdos-Renyi Planted k-clique

Binary hypothesis test based on observing random graph G on p-vertices:

H0 Erdos-Renyi, each edge randomly with prob. 1/2 H1 Planted k-clique, remaining edges random Planted k-clique problem

Randomentries Planted k k sub-matrix ×

Binary hypothesis test based on observing random binary matrix:

H0 Random 0, 1 matrix with Ber(1/2) on off-diagonal H Planted k{ k}sub-matrix 1 × Vignette C: (Structured) non-parametric regression Goal: How to predict output from covariates? given covariates (x1,x2,x3,...,xp) output variable y want to predict y based on (x1,...,xp) Examples: Medical Imaging; Geostatistics; Astronomy; .....

2nd order polynomial kernel 1st order Sobolev 0.5 0.5

0 0

−0.5 −0.5 Function value Function value −1 −1

−1.5 −1.5 −0.5 −0.25 0 0.25 0.5 −0.5 −0.25 0 0.25 0.5 Design value x Design value x (a) Second-order polynomial fit (b) Lipschitz function fit High dimensions and sample complexity Possible models: p ∗ ordinary linear regression: y = θj xj +w j=1 X hθ∗,xi

∗ general non-parametric model: |y ={zf (}x1,x2,...,xp)+ w.

Estimation accuracy: How well can f ∗ be estimated using n samples? linear models 2 ◮ without any structure: error δ ≍ p/n linear|{z} in p 2 ep ◮ with sparsity k ≪ p: error δ ≍ k log /n k logarithmic| {z in} p non-parametric models: p-dimensional, smoothness α 2 2α Curse of dimensionality: Error δ (1/n) 2α+p ≍ Exponential slow-down | {z } Minimax risk and sample size

Consider a function class , and n i.i.d. samples from the model F y = f ∗(x )+ w , where f ∗ is some member of . i i i F

For a given estimator (x ,y ) n f , worst-case risk in a metric ρ: { i i }i=1 7→ ∈F n n 2 ∗ Rworst(f; )= supb E [ρ (f,f )]. F f ∗∈F b b Minimax risk For a given sample size n, the minimax risk

inf Rn (f; ) = inf sup En ρ2(f,f ∗) b worst b f F f f ∗∈F   where the infimum is taken overb all estimators. b

(UC Berkeley and Princeton) Information theory and machine learning June2015 37/46 How to measure “size” of function classes?

A 2δ-packing of in metric ρ is a collection f 1,...,fF M such that { }⊂F ρ f j,f k > 2δ for all j = k. 6 The packing number M(2δ) is the cardinality of the largest such set.

Packing/covering entropy: emerged from Russian school in 1940s/1950s (Kolmogorov and collaborators)

Central object in proving minimax lower bounds for nonparametric problems (e.g., Hasminskii & Ibragimov, 1978; Birge, 1983; Yu, 1997; Yang & Barron, 1999) Packing and covering

f 1

f 2 f 3 f 4

f M

A 2δ-packing of in metric ρ is a collection f 1,...,f M such that F { }⊂F ρ f j,f k > 2δ for all j = k. 6 The packing number M (2δ) is the cardinality of the largest such set. Example: Sup-norm packing for Lipschitz functions

L

Lδ −

L − δ 2δ Mδ

1 2 M j k δ-packing set: functions f ,f ,...,f such that f f 2 >δ for all j = k { } k − k 6 for L-Lipschitz functions in 1-dimension: M(δ) 2(L/δ). ≍ Standard reduction: from estimation to testing

goal: to characterize the minimax risk for ρ-estimation over f 1 F construct a 2δ-packing of : F f 2 f 3 4 collection f 1,...,f M such that ρ(f j ,f k) > 2δ f { } now form a M-component mixture distribution as follows: 2δ ◮ draw packing index V ∈{1,...M} M uniformly at random f ◮ conditioned on V = j, draw n i.i.d. samples (Xi,Yi) ∼ Qf j

1 Claim: Any estimator f such that ρ(f,f J ) δ w.h.p. can be used to solve the M-ary testing problem. ≤ 2 Use standard techniquesb ( Assouad, Leb Cam, Fano ) to lower bound the probability of error in the{ testing problem. } Minimax rate via metric entropy matching

∗ observe (xi,yi) pairs from model yi = f (xi)+ wi two possible norms

n 1 2 f f ∗ 2 := f(x ) f ∗(x ) , or f f ∗ 2 = E (f(X) f ∗(X))2 . k − kn n i − i k − k2 − i=1 X    b b b b e e

Metric entropy master equation

For many regression problems, minimax rate δn > 0 determined by solving the master equation

log M 2δ; , nδ2. F k · k ≍  basic idea (with Hellinger metric) dates back to Le Cam (1973) elegant general version due to Yang and Barron (1999)

(UC Berkeley and Princeton) Information theory and machine learning June2015 42/46 Example 1: Sparse linear regression Observations y = x , θ∗ + w , where i h i i i θ∗ Θ(k,p) := θ Rp θ k, and θ 1 . ∈ ∈ | k k0 ≤ k k2 ≤ n o Gilbert-Varshamov: can construct a 2δ-separated set with ep log M(2δ) k log ) elements ≍ k

Master equation and minimax rate

k log ep log M(2δ) nδ2 δ2 k . ≍ ⇐⇒ ≍ n  Polynomial-time achievability:

by ℓ1-relaxations under restricted eigenvalue (RE) conditions (Candes & Tao, 2007; Bickel et al., 2009; Buhlmann & van de Geer, 2011)

achieve minimax-optimal rates for ℓ2-error (Raskutti, W., & Yu, 2011) ℓ -methods can be sub-optimal for prediction error X(θ θ∗) /√n 1 k − k2 (Zhang, W. & Jordan, 2014) b Example 2: α-smooth non-parametric regression

∗ Observations yi = f (xi)+ wi, where f ∗ (α)= f : [0, 1] R f is α-times diff’ble, with α f (j) C . ∈F → | j=0 k k∞ ≤ n P o Classical results in (e.g., Kolmogorov & Tikhomorov, 1959)

1/α log M(2δ; ) 1/δ F ≍ 

Master equation and minimax rate

2α log M(2δ) nδ2 δ2 1/n 2α+1 . ≍ ⇐⇒ ≍ 

(UC Berkeley and Princeton) Information theory and machine learning June2015 44/46 Example 3: Sparse additive and α-smooth ∗ ∗ Observations yi = f (xi)+ wi, where f satisfies constraints p ∗ ∗ Additively decomposable: f (x1,...xp)= gj (xj ) j=1 X ∗ ∗ Sparse: At most k of (g1 ,...,g p) are non-zero Smooth: Each g∗ belongs to α-smooth family (α) j F Combining previous results yields 2δ-packing with

1/α ep log M(2δ; ) k 1/δ + k log F ≍ k  

Master equation and minimax rate Solving log M(2δ) nδ2 yields ≍ ep 2 2α k log k δ k 1/n) 2α+1 + ≍ n  k-component estimation search complexity | {z } | {z } Summary To be provided during tutorial....