Tutorial Part I: Information Theory Meets Machine Learning

Tutorial Part I: Information theory meets machine learning Emmanuel Abbe Martin Wainwright UC Berkeley Princeton University (UC Berkeley and Princeton) Information theory and machine learning June2015 1/46 Introduction Era of massive data sets Fascinating problems at the interfaces between information theory and statistical machine learning. 1 Fundamental issues ◮ Concentration of measure: high-dimensional problems are remarkably predictable ◮ Curse of dimensionality: without structure, many problems are hopeless ◮ Low-dimensional structure is essential 2 Machine learning brings in algorithmic components ◮ Computational constraints are central ◮ Memory constraints: need for distributed and decentralized procedures ◮ Increasing importance of privacy Historical perspective: info. theory and statistics Claude Shannon Andrey Kolmogorov A rich intersection between information theory and statistics 1 Hypothesis testing, large deviations 2 Fisher information, Kullback-Leibler divergence 3 Metric entropy and Fano’s inequality Statistical estimation as channel coding Codebook: indexed family of probability distributions Q θ Θ { θ | ∈ } Codeword: nature chooses some θ∗ Θ ∈ θ∗ n ∗ X Θ Qθ 1 θ b Channel: user observes n i.i.d. draws X Q ∗ i ∼ θ Decoding: estimator Xn θ such that θ θ∗ 1 7→ ≈ b b Statistical estimation as channel coding Codebook: indexed family of probability distributions Q θ Θ { θ | ∈ } Codeword: nature chooses some θ∗ Θ ∈ θ∗ n Θ Qθ∗ X1 θ b Channel: user observes n i.i.d. draws X Q ∗ i ∼ θ Decoding: estimator Xn θ such that θ θ∗ 1 7→ ≈ Perspective dating back to Kolmogorovb (1950s)b with many variations: codebooks/codewords: graphs, vectors, matrices, functions, densities.... channels: random graphs, regression models, elementwise probes of vectors/machines, random projections ∗ closeness θ θ : exact/partial graph recovery in Hamming, ℓp-distances, L(Q)-distances,≈ sup-norm etc. b Machine learning: algorithmic issues to forefront! 1 Efficient algorithms are essential ◮ only (low-order) polynomial-time methods can ever be implemented ◮ trade-offs between computational complexity and performance? ◮ fundamental barriers due to computational complexity? 2 Distributed procedures are often needed ◮ many modern data sets: too large to stored on a single machine ◮ need methods that operate separately on pieces of data ◮ trade-offs between decentralization and performance? ◮ fundamental barriers due to communication complexity? 3 Privacy and access issues ◮ conflicts between individual privacy and benefits of aggregation? ◮ principled information-theoretic formulations of such trade-offs? (UC Berkeley and Princeton) Information theory and machine learning June2015 6/46 Part I of tutorial: Three vignettes 1 Graphical model selection 2 Sparse principal component analysis 3 Structured non-parametric regression and minimax theory (UC Berkeley and Princeton) Information theory and machine learning June2015 7/46 Vignette A: Graphical model selection Simple motivating example: Epidemic modeling +1 if individual s is infected Disease status of person s: xs = 1 if individual s is healthy (− (1) Independent infection 5 Q(x ,...,x ) exp(θ x ) 1 5 ∝ s s s=1 Y (2) Cycle-based infection 5 Q(x ,...,x ) exp(θ x ) exp(θ x x ) 1 5 ∝ s s st s t s=1 Y (s,tY)∈C (3) Full clique infection 5 Q(x ,...,x ) exp(θ x ) exp(θ x x ) 1 5 ∝ s s st s t s=1 Y sY6=t Possible epidemic patterns (on a square) From epidemic patterns to graphs Underlying graphs Model selection for graphs drawn n i.i.d. samples from Q(x ,...,x ;Θ) exp θ x + θ x x 1 p ∝ s s st s t s∈V (s,t)∈E X X graph G and matrix [Θ]st = θst of edge weights are unknown data matrix Xn 1, +1 n×p 1 ∈ {− } want estimator Xn G to minimize error probability 1 7→ b Qn G(Xn) = G 1 6 Prob. that estimated graph differs from truth b | {z } Channel decoding: Think of graphs as codewords, and the graph family as a codebook. Past/on-going work on graph selection exact polynomial-time solution on trees (Chow & Liu, 1967) testing for local conditional independence relationships (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008) pseudolikelihood and BIC criterion (Csiszar & Talata, 2006) pseudolikelihood and ℓ1-regularized neighborhood regression ◮ Gaussian case (Meinshausen & Buhlmann, 2006) ◮ Binary case (Ravikumar, W. & Lafferty et al., 2006) various greedy and related methods: (Bresler et al., 2008; Bresler, 2015; Netrapalli et al., 2010; Anandkumar et al., 2013) lower bounds and inachievability results ◮ information-theoretic bounds (Santhanam & W., 2012) ◮ computational lower bounds (Dagum & Luby, 1993; Bresler et al., 2014) ◮ phase transitions and performance of neighborhood regression (Bento & Montanari, 2009) US Senate network (2004–2006 voting) Experiments for sequences of star graphs p = 9 d = 3 p = 18 d = 6 Empirical behavior: Unrescaled plots Star graph; Linear fraction neighbors 1 0.8 0.6 0.4 Prob. success 0.2 p = 64 p = 100 p = 225 0 0 100 200 300 400 500 600 Number of samples Plots of success probability versus raw sample size n. Empirical behavior: Appropriately rescaled Star graph; Linear fraction neighbors 1 0.8 0.6 0.4 Prob. success 0.2 p = 64 p = 100 p = 225 0 0 0.5 1 1.5 2 Control parameter n Plots of success probability versus rescaled sample size d2 log p Some theory: Scaling law for graph selection graphs Gp,d with p nodes and maximum degree d minimum absolute weight θmin on edges how many samples n needed to recover the unknown graph? Theorem (Ravikumar, W. & Lafferty, 2010; Santhanam & W., 2012) Achievable result: Under regularity conditions, for a graph estimate G produced by ℓ1-regularized logistic regression: b n>c d2 + 1/θ2 log p = Q[G = G] 0 u min ⇒ 6 → Lower bound on sample size Vanishing probability of error b | {z } | {z } Necessary condition: For graph estimate G produced by any algorithm. n<c d2 + 1/θ2 log p = Q[G = G] 1/2 ℓ min ⇒ e 6 ≥ Upper bound on sample size Constant probability of error e | {z } | {z } Information-theoretic analysis of graph selection Question: How to prove lower bounds on graph selection methods? Answer: Graph selection is an unorthodox channel coding problem. codewords/codebook: graph G in some graph class G channel use: draw sample X =(X ,...,X ) 1, +1 p from the i• i1 ip ∈ {− } graph-structured distribution QG decoding problem: use n samples X1•,...,Xn• to correctly distinguish the “codeword” { } Q(X G) X ,X ,...,X G | 1• 2• n• Proof sketch: Main ideas for necessary conditions based on assessing difficulty of graph selection over various sub-ensembles G⊆Gp,d choose G u.a.r., and consider multi-way hypothesis testing problem based on the∈G data Xn = X ,...,X 1 { 1• n•} for any graph estimator ψ : n , Fano’s inequality implies that X →G I(Xn; G) Q[ψ(Xn) = G] 1 1 o(1) 1 6 ≥ − log − |G| n n where I(X1 ; G) is mutual information between observations X1 and randomly chosen graph G remaining steps: 1 Construct “difficult” sub-ensembles G⊆Gp,d 2 Compute or lower bound the log cardinality log |G|. n 3 Upper bound the mutual information I(X1 ; G). A “hard” d-clique ensemble Base graph G Graph Guv Graph Gst 1 Divide the vertex set V into p groups of size d +1, and form the base ⌊ d+1 ⌋ graph G by making a (d + 1)-clique within each group. C 2 Form graph Guv by deleting edge (u,v) from G. 3 Consider testing problem over family of graph-structured distributions Q( ; Gst), (s,t) . { · ∈ C} Why is this ensemble “hard”? Kullback-Leibler divergence between pairs decays exponentially in d, unless minimum edge weight decays as 1/d. Vignette B: Sparse principal components analysis Principal component analysis: widely-used method for dimensionality reduction, data compression etc. { } extracts top eigenvectors of sample covariance matrix Σ classical PCA in p dimensions: inconsistent unless p/n 0 → low-dimensional structure: many applications lead to sparseb eigenvectors Population model: Rank-one spiked covariance matrix ∗ ∗ T Σ= ν θ (θ ) +Ip SNR parameter rank-one spike |{z} | {z } Sampling model: Draw n i.i.d. zero-mean vectors with cov(Xi) = Σ, and form sample covariance matrix 1 n Σ := X XT n i i i=1 X b p-dim. matrix, rank min{n,p} | {z } Example: PCA for face compression/recognition First 25 standard principal components (estimated from data) Example: PCA for face compression/recognition First 25 sparse principal components (estimated from data) Perhaps the simplest estimator.... Diagonal thresholding: (Johnstone & Lu, 2008) Diagonal thresholding 14 12 10 8 6 Diagonal value 4 2 0 0 10 20 30 40 50 Index Given n i.i.d. samples Xi with zero mean, and with spiked covariance Σ= νθ∗(θ∗)T + I: 1 1 n 2 Compute diagonal entries of sample covariance: Σjj = n i=1 Xij . 2 Apply threshold to vector Σjj ,j = 1,...,p { } b P b Diagonal thresholding: unrescaled plots Unrescaled plots of diagonal thresholding; k = O(log p) 1 0.8 0.6 0.4 Prob. success p = 100 p = 200 0.2 p = 300 p = 600 p = 1200 0 0 200 400 600 800 Number of observations Diagonal thresholding: rescaled plots Diagonal thresholding: k = O(log(p)) 1 0.8 0.6 0.4 Prob. success p = 100 p = 200 0.2 p = 300 p = 600 p = 1200 0 0 5 10 15 Control parameter n Scaling is quadratic in sparsity: k2 log p Diagonal thresholding and fundamental limit Consider spiked covariance matrix Σ= ν θ∗(θ∗)T +I where θ B (k) B (1) p×p ∈ 0 ∩ 2 Signal-to-noise rank one spike k-sparse and unit norm |{z} | {z } | {z } Theorem (Amini & W., 2009) (a) There are thresholds τ DT τ DT such that ℓ ≤ u n n τ DT τ DT k2 log p ≤ ℓ k2 log p ≥ u DT fails w.h.p.

Tutorial Part I: Information Theory Meets Machine Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support