Tutorial Part I: Information Theory Meets Machine Learning

Tutorial Part I: Information theory meets machine learning Emmanuel Abbe Martin Wainwright UC Berkeley Princeton University (UC Berkeley and Princeton) Information theory and machine learning June2015 1/46 Introduction Era of massive data sets Fascinating problems at the interfaces between information theory and statistical machine learning. 1 Fundamental issues ◮ Concentration of measure: high-dimensional problems are remarkably predictable ◮ Curse of dimensionality: without structure, many problems are hopeless ◮ Low-dimensional structure is essential 2 Machine learning brings in algorithmic components ◮ Computational constraints are central ◮ Memory constraints: need for distributed and decentralized procedures ◮ Increasing importance of privacy Historical perspective: info. theory and statistics Claude Shannon Andrey Kolmogorov A rich intersection between information theory and statistics 1 Hypothesis testing, large deviations 2 Fisher information, Kullback-Leibler divergence 3 Metric entropy and Fano’s inequality Statistical estimation as channel coding Codebook: indexed family of probability distributions Q θ Θ { θ | ∈ } Codeword: nature chooses some θ∗ Θ ∈ θ∗ n ∗ X Θ Qθ 1 θ b Channel: user observes n i.i.d. draws X Q ∗ i ∼ θ Decoding: estimator Xn θ such that θ θ∗ 1 7→ ≈ b b Statistical estimation as channel coding Codebook: indexed family of probability distributions Q θ Θ { θ | ∈ } Codeword: nature chooses some θ∗ Θ ∈ θ∗ n Θ Qθ∗ X1 θ b Channel: user observes n i.i.d. draws X Q ∗ i ∼ θ Decoding: estimator Xn θ such that θ θ∗ 1 7→ ≈ Perspective dating back to Kolmogorovb (1950s)b with many variations: codebooks/codewords: graphs, vectors, matrices, functions, densities.... channels: random graphs, regression models, elementwise probes of vectors/machines, random projections ∗ closeness θ θ : exact/partial graph recovery in Hamming, ℓp-distances, L(Q)-distances,≈ sup-norm etc. b Machine learning: algorithmic issues to forefront! 1 Efficient algorithms are essential ◮ only (low-order) polynomial-time methods can ever be implemented ◮ trade-offs between computational complexity and performance? ◮ fundamental barriers due to computational complexity? 2 Distributed procedures are often needed ◮ many modern data sets: too large to stored on a single machine ◮ need methods that operate separately on pieces of data ◮ trade-offs between decentralization and performance? ◮ fundamental barriers due to communication complexity? 3 Privacy and access issues ◮ conflicts between individual privacy and benefits of aggregation? ◮ principled information-theoretic formulations of such trade-offs? (UC Berkeley and Princeton) Information theory and machine learning June2015 6/46 Part I of tutorial: Three vignettes 1 Graphical model selection 2 Sparse principal component analysis 3 Structured non-parametric regression and minimax theory (UC Berkeley and Princeton) Information theory and machine learning June2015 7/46 Vignette A: Graphical model selection Simple motivating example: Epidemic modeling +1 if individual s is infected Disease status of person s: xs = 1 if individual s is healthy (− (1) Independent infection 5 Q(x ,...,x ) exp(θ x ) 1 5 ∝ s s s=1 Y (2) Cycle-based infection 5 Q(x ,...,x ) exp(θ x ) exp(θ x x ) 1 5 ∝ s s st s t s=1 Y (s,tY)∈C (3) Full clique infection 5 Q(x ,...,x ) exp(θ x ) exp(θ x x ) 1 5 ∝ s s st s t s=1 Y sY6=t Possible epidemic patterns (on a square) From epidemic patterns to graphs Underlying graphs Model selection for graphs drawn n i.i.d. samples from Q(x ,...,x ;Θ) exp θ x + θ x x 1 p ∝ s s st s t s∈V (s,t)∈E X X graph G and matrix [Θ]st = θst of edge weights are unknown data matrix Xn 1, +1 n×p 1 ∈ {− } want estimator Xn G to minimize error probability 1 7→ b Qn G(Xn) = G 1 6 Prob. that estimated graph differs from truth b | {z } Channel decoding: Think of graphs as codewords, and the graph family as a codebook. Past/on-going work on graph selection exact polynomial-time solution on trees (Chow & Liu, 1967) testing for local conditional independence relationships (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008) pseudolikelihood and BIC criterion (Csiszar & Talata, 2006) pseudolikelihood and ℓ1-regularized neighborhood regression ◮ Gaussian case (Meinshausen & Buhlmann, 2006) ◮ Binary case (Ravikumar, W. & Lafferty et al., 2006) various greedy and related methods: (Bresler et al., 2008; Bresler, 2015; Netrapalli et al., 2010; Anandkumar et al., 2013) lower bounds and inachievability results ◮ information-theoretic bounds (Santhanam & W., 2012) ◮ computational lower bounds (Dagum & Luby, 1993; Bresler et al., 2014) ◮ phase transitions and performance of neighborhood regression (Bento & Montanari, 2009) US Senate network (2004–2006 voting) Experiments for sequences of star graphs p = 9 d = 3 p = 18 d = 6 Empirical behavior: Unrescaled plots Star graph; Linear fraction neighbors 1 0.8 0.6 0.4 Prob. success 0.2 p = 64 p = 100 p = 225 0 0 100 200 300 400 500 600 Number of samples Plots of success probability versus raw sample size n. Empirical behavior: Appropriately rescaled Star graph; Linear fraction neighbors 1 0.8 0.6 0.4 Prob. success 0.2 p = 64 p = 100 p = 225 0 0 0.5 1 1.5 2 Control parameter n Plots of success probability versus rescaled sample size d2 log p Some theory: Scaling law for graph selection graphs Gp,d with p nodes and maximum degree d minimum absolute weight θmin on edges how many samples n needed to recover the unknown graph? Theorem (Ravikumar, W. & Lafferty, 2010; Santhanam & W., 2012) Achievable result: Under regularity conditions, for a graph estimate G produced by ℓ1-regularized logistic regression: b n>c d2 + 1/θ2 log p = Q[G = G] 0 u min ⇒ 6 → Lower bound on sample size Vanishing probability of error b | {z } | {z } Necessary condition: For graph estimate G produced by any algorithm. n<c d2 + 1/θ2 log p = Q[G = G] 1/2 ℓ min ⇒ e 6 ≥ Upper bound on sample size Constant probability of error e | {z } | {z } Information-theoretic analysis of graph selection Question: How to prove lower bounds on graph selection methods? Answer: Graph selection is an unorthodox channel coding problem. codewords/codebook: graph G in some graph class G channel use: draw sample X =(X ,...,X ) 1, +1 p from the i• i1 ip ∈ {− } graph-structured distribution QG decoding problem: use n samples X1•,...,Xn• to correctly distinguish the “codeword” { } Q(X G) X ,X ,...,X G | 1• 2• n• Proof sketch: Main ideas for necessary conditions based on assessing difficulty of graph selection over various sub-ensembles G⊆Gp,d choose G u.a.r., and consider multi-way hypothesis testing problem based on the∈G data Xn = X ,...,X 1 { 1• n•} for any graph estimator ψ : n , Fano’s inequality implies that X →G I(Xn; G) Q[ψ(Xn) = G] 1 1 o(1) 1 6 ≥ − log − |G| n n where I(X1 ; G) is mutual information between observations X1 and randomly chosen graph G remaining steps: 1 Construct “difficult” sub-ensembles G⊆Gp,d 2 Compute or lower bound the log cardinality log |G|. n 3 Upper bound the mutual information I(X1 ; G). A “hard” d-clique ensemble Base graph G Graph Guv Graph Gst 1 Divide the vertex set V into p groups of size d +1, and form the base ⌊ d+1 ⌋ graph G by making a (d + 1)-clique within each group. C 2 Form graph Guv by deleting edge (u,v) from G. 3 Consider testing problem over family of graph-structured distributions Q( ; Gst), (s,t) . { · ∈ C} Why is this ensemble “hard”? Kullback-Leibler divergence between pairs decays exponentially in d, unless minimum edge weight decays as 1/d. Vignette B: Sparse principal components analysis Principal component analysis: widely-used method for dimensionality reduction, data compression etc. { } extracts top eigenvectors of sample covariance matrix Σ classical PCA in p dimensions: inconsistent unless p/n 0 → low-dimensional structure: many applications lead to sparseb eigenvectors Population model: Rank-one spiked covariance matrix ∗ ∗ T Σ= ν θ (θ ) +Ip SNR parameter rank-one spike |{z} | {z } Sampling model: Draw n i.i.d. zero-mean vectors with cov(Xi) = Σ, and form sample covariance matrix 1 n Σ := X XT n i i i=1 X b p-dim. matrix, rank min{n,p} | {z } Example: PCA for face compression/recognition First 25 standard principal components (estimated from data) Example: PCA for face compression/recognition First 25 sparse principal components (estimated from data) Perhaps the simplest estimator.... Diagonal thresholding: (Johnstone & Lu, 2008) Diagonal thresholding 14 12 10 8 6 Diagonal value 4 2 0 0 10 20 30 40 50 Index Given n i.i.d. samples Xi with zero mean, and with spiked covariance Σ= νθ∗(θ∗)T + I: 1 1 n 2 Compute diagonal entries of sample covariance: Σjj = n i=1 Xij . 2 Apply threshold to vector Σjj ,j = 1,...,p { } b P b Diagonal thresholding: unrescaled plots Unrescaled plots of diagonal thresholding; k = O(log p) 1 0.8 0.6 0.4 Prob. success p = 100 p = 200 0.2 p = 300 p = 600 p = 1200 0 0 200 400 600 800 Number of observations Diagonal thresholding: rescaled plots Diagonal thresholding: k = O(log(p)) 1 0.8 0.6 0.4 Prob. success p = 100 p = 200 0.2 p = 300 p = 600 p = 1200 0 0 5 10 15 Control parameter n Scaling is quadratic in sparsity: k2 log p Diagonal thresholding and fundamental limit Consider spiked covariance matrix Σ= ν θ∗(θ∗)T +I where θ B (k) B (1) p×p ∈ 0 ∩ 2 Signal-to-noise rank one spike k-sparse and unit norm |{z} | {z } | {z } Theorem (Amini & W., 2009) (a) There are thresholds τ DT τ DT such that ℓ ≤ u n n τ DT τ DT k2 log p ≤ ℓ k2 log p ≥ u DT fails w.h.p.

Load more