Foundations of Data Science∗
Avrim Blum, John Hopcroft and Ravindran Kannan
Thursday 9th June, 2016
∗Copyright 2015. All rights reserved
1 Contents
1 Introduction 8
2 High-Dimensional Space 11 2.1 Introduction ...... 11 2.2 The Law of Large Numbers ...... 11 2.3 The Geometry of High Dimensions ...... 14 2.4 Properties of the Unit Ball ...... 15 2.4.1 Volume of the Unit Ball ...... 15 2.4.2 Most of the Volume is Near the Equator ...... 17 2.5 Generating Points Uniformly at Random from a Ball ...... 20 2.6 Gaussians in High Dimension ...... 21 2.7 Random Projection and Johnson-Lindenstrauss Lemma ...... 23 2.8 Separating Gaussians ...... 25 2.9 Fitting a Single Spherical Gaussian to Data ...... 27 2.10 Bibliographic Notes ...... 29 2.11 Exercises ...... 30
3 Best-Fit Subspaces and Singular Value Decomposition (SVD) 38 3.1 Introduction and Overview ...... 38 3.2 Preliminaries ...... 39 3.3 Singular Vectors ...... 41 3.4 Singular Value Decomposition (SVD) ...... 44 3.5 Best Rank-k Approximations ...... 45 3.6 Left Singular Vectors ...... 47 3.7 Power Method for Computing the Singular Value Decomposition ...... 49 3.7.1 A Faster Method ...... 50 3.8 Singular Vectors and Eigenvectors ...... 52 3.9 Applications of Singular Value Decomposition ...... 52 3.9.1 Centering Data ...... 52 3.9.2 Principal Component Analysis ...... 53 3.9.3 Clustering a Mixture of Spherical Gaussians ...... 54 3.9.4 Ranking Documents and Web Pages ...... 59 3.9.5 An Application of SVD to a Discrete Optimization Problem . . . . 60 3.10 Bibliographic Notes ...... 63 3.11 Exercises ...... 64
4 Random Graphs 71 4.1 The G(n, p)Model ...... 71 4.1.1 Degree Distribution ...... 72 4.1.2 Existence of Triangles in G(n, d/n)...... 77 4.2 Phase Transitions ...... 79 4.3 The Giant Component ...... 87
2 4.4 Branching Processes ...... 96 4.5 Cycles and Full Connectivity ...... 102 4.5.1 Emergence of Cycles ...... 102 4.5.2 Full Connectivity ...... 104 4.5.3 Threshold for O(ln n) Diameter ...... 105 4.6 Phase Transitions for Increasing Properties ...... 107 4.7 Phase Transitions for CNF-sat ...... 109 4.8 Nonuniform and Growth Models of Random Graphs ...... 114 4.8.1 Nonuniform Models ...... 114 4.8.2 Giant Component in Random Graphs with Given Degree Distribution114 4.9 Growth Models ...... 116 4.9.1 Growth Model Without Preferential Attachment ...... 116 4.9.2 Growth Model With Preferential Attachment ...... 122 4.10 Small World Graphs ...... 124 4.11 Bibliographic Notes ...... 129 4.12 Exercises ...... 130
5 Random Walks and Markov Chains 139 5.1 Stationary Distribution ...... 143 5.2 Markov Chain Monte Carlo ...... 145 5.2.1 Metropolis-Hasting Algorithm ...... 146 5.2.2 Gibbs Sampling ...... 147 5.3 Areas and Volumes ...... 150 5.4 Convergence of Random Walks on Undirected Graphs ...... 151 5.4.1 Using Normalized Conductance to Prove Convergence ...... 157 5.5 Electrical Networks and Random Walks ...... 160 5.6 Random Walks on Undirected Graphs with Unit Edge Weights ...... 164 5.7 Random Walks in Euclidean Space ...... 171 5.8 The Web as a Markov Chain ...... 175 5.9 Bibliographic Notes ...... 179 5.10 Exercises ...... 180
6 Machine Learning 190 6.1 Introduction ...... 190 6.2 Overfitting and Uniform Convergence ...... 192 6.3 Illustrative Examples and Occam’s Razor ...... 194 6.3.1 Learning disjunctions ...... 194 6.3.2 Occam’s razor ...... 195 6.3.3 Application: learning decision trees ...... 196 6.4 Regularization: penalizing complexity ...... 197 6.5 Online learning and the Perceptron algorithm ...... 198 6.5.1 An example: learning disjunctions ...... 198 6.5.2 The Halving algorithm ...... 199
3 6.5.3 The Perceptron algorithm ...... 199 6.5.4 Extensions: inseparable data and hinge-loss ...... 201 6.6 Kernel functions ...... 202 6.7 Online to Batch Conversion ...... 204 6.8 Support-Vector Machines ...... 205 6.9 VC-Dimension ...... 206 6.9.1 Definitions and Key Theorems ...... 207 6.9.2 Examples: VC-Dimension and Growth Function ...... 209 6.9.3 Proof of Main Theorems ...... 211 6.9.4 VC-dimension of combinations of concepts ...... 214 6.9.5 Other measures of complexity ...... 214 6.10 Strong and Weak Learning - Boosting ...... 215 6.11 Stochastic Gradient Descent ...... 218 6.12 Combining (Sleeping) Expert Advice ...... 220 6.13 Deep learning ...... 222 6.14 Further Current directions ...... 228 6.14.1 Semi-supervised learning ...... 228 6.14.2 Active learning ...... 231 6.14.3 Multi-task learning ...... 231 6.15 Bibliographic Notes ...... 232 6.16 Exercises ...... 233
7 Algorithms for Massive Data Problems: Streaming, Sketching, and Sampling 237 7.1 Introduction ...... 237 7.2 Frequency Moments of Data Streams ...... 238 7.2.1 Number of Distinct Elements in a Data Stream ...... 239 7.2.2 Counting the Number of Occurrences of a Given Element...... 242 7.2.3 Counting Frequent Elements ...... 243 7.2.4 The Second Moment ...... 244 7.3 Matrix Algorithms using Sampling ...... 247 7.3.1 Matrix Multiplication Using Sampling ...... 249 7.3.2 Implementing Length Squared Sampling in two passes ...... 252 7.3.3 Sketch of a Large Matrix ...... 253 7.4 Sketches of Documents ...... 256 7.5 Bibliography ...... 258 7.6 Exercises ...... 259
8 Clustering 264 8.1 Introduction ...... 264 8.1.1 Two general assumptions on the form of clusters ...... 265 8.2 k-means Clustering ...... 267 8.2.1 A maximum-likelihood motivation for k-means ...... 267
4 8.2.2 Structural properties of the k-means objective ...... 268 8.2.3 Lloyd’s k-means clustering algorithm ...... 268 8.2.4 Ward’s algorithm ...... 270 8.2.5 k-means clustering on the line ...... 271 8.3 k-Center Clustering ...... 271 8.4 Finding Low-Error Clusterings ...... 272 8.5 Approximation Stability ...... 272 8.6 Spectral Clustering ...... 275 8.6.1 Stochastic Block Model ...... 276 8.6.2 Gaussian Mixture Model ...... 278 8.6.3 Standard Deviation without a stochastic model ...... 278 8.6.4 Spectral Clustering Algorithm ...... 279 8.7 High-Density Clusters ...... 281 8.7.1 Single-linkage ...... 281 8.7.2 Robust linkage ...... 282 8.8 Kernel Methods ...... 283 8.9 Recursive Clustering based on Sparse cuts ...... 283 8.10 Dense Submatrices and Communities ...... 284 8.11 Community Finding and Graph Partitioning ...... 287 8.11.1 Flow Methods ...... 287 8.12 Axioms for Clustering ...... 290 8.12.1 An Impossibility Result ...... 290 8.12.2 Satisfying two of three ...... 291 8.12.3 Relaxing the axioms ...... 293 8.12.4 A Satisfiable Set of Axioms ...... 293 8.13 Exercises ...... 295
9 Topic Models, Hidden Markov Process, Graphical Models, and Belief Propagation 299 9.1 Topic Models ...... 299 9.2 Hidden Markov Model ...... 303 9.3 Graphical Models, and Belief Propagation ...... 308 9.4 Bayesian or Belief Networks ...... 308 9.5 Markov Random Fields ...... 309 9.6 Factor Graphs ...... 311 9.7 Tree Algorithms ...... 311 9.8 Message Passing in general Graphs ...... 313 9.9 Graphs with a Single Cycle ...... 315 9.10 Belief Update in Networks with a Single Loop ...... 316 9.11 Maximum Weight Matching ...... 318 9.12 Warning Propagation ...... 321 9.13 Correlation Between Variables ...... 322 9.14 Exercises ...... 327
5 10 Other Topics 329 10.1 Rankings ...... 329 10.2 Hare System for Voting ...... 331 10.3 Compressed Sensing and Sparse Vectors ...... 332 10.3.1 Unique Reconstruction of a Sparse Vector ...... 333 10.3.2 The Exact Reconstruction Property ...... 336 10.3.3 Restricted Isometry Property ...... 337 10.4 Applications ...... 339 10.4.1 Sparse Vector in Some Coordinate Basis ...... 339 10.4.2 A Representation Cannot be Sparse in Both Time and Frequency Domains ...... 340 10.4.3 Biological ...... 342 10.4.4 Finding Overlapping Cliques or Communities ...... 342 10.4.5 Low Rank Matrices ...... 343 10.5 Gradient ...... 344 10.6 Linear Programming ...... 345 10.6.1 The Ellipsoid Algorithm ...... 347 10.7 Integer Optimization ...... 348 10.8 Semi-Definite Programming ...... 349 10.9 Exercises ...... 351
11 Wavelets 354 11.1 Dilation ...... 354 11.2 The Haar Wavelet ...... 355 11.3 Wavelet Systems ...... 359 11.4 Solving the Dilation Equation ...... 359 11.5 Conditions on the Dilation Equation ...... 361 11.6 Derivation of the Wavelets from the Scaling Function ...... 363 11.7 Sufficient Conditions for the Wavelets to be Orthogonal ...... 367 11.8 Expressing a Function in Terms of Wavelets ...... 370 11.9 Designing a Wavelet System ...... 371
12 Appendix 375 12.1 Asymptotic Notation ...... 375 12.2 Useful relations ...... 376 12.3 Useful Inequalities ...... 380 12.4 Probability ...... 387 12.4.1 Sample Space, Events, Independence ...... 388 12.4.2 Linearity of Expectation ...... 389 12.4.3 Union Bound ...... 389 12.4.4 Indicator Variables ...... 389 12.4.5 Variance ...... 390 12.4.6 Variance of the Sum of Independent Random Variables ...... 390
6 12.4.7 Median ...... 391 12.4.8 The Central Limit Theorem ...... 391 12.4.9 Probability Distributions ...... 391 12.4.10 Bayes Rule and Estimators ...... 395 12.4.11 Tail Bounds and Chernoff inequalities ...... 397 12.5 Bounds on Tail Probability ...... 401 12.6 Applications of the tail bound ...... 403 12.7 Eigenvalues and Eigenvectors ...... 405 12.7.1 Symmetric Matrices ...... 406 12.7.2 Relationship between SVD and Eigen Decomposition ...... 408 12.7.3 Extremal Properties of Eigenvalues ...... 409 12.7.4 Eigenvalues of the Sum of Two Symmetric Matrices ...... 411 12.7.5 Norms ...... 412 12.7.6 Important Norms and Their Properties ...... 413 12.7.7 Linear Algebra ...... 415 12.7.8 Distance between subspaces ...... 417 12.8 Generating Functions ...... 418 12.8.1 Generating Functions for Sequences Defined by Recurrence Rela- tionships ...... 419 12.8.2 The Exponential Generating Function and the Moment Generating Function ...... 421 12.9 Miscellaneous ...... 423 12.9.1 Lagrange multipliers ...... 423 12.9.2 Finite Fields ...... 423 12.9.3 Hash Functions ...... 424 12.9.4 Application of Mean Value Theorem ...... 424 12.9.5 Sperner’s Lemma ...... 426 12.9.6 Pr¨ufer ...... 426 12.10Exercises ...... 427
Index 433
7 1 Introduction
Computer science as an academic discipline began in the 1960’s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered finite automata, regular expressions, context free languages, and computability. In the 1970’s, the study of algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other fields calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks as central aspects of daily life presents both opportunities and challenges for theory.
While traditional areas of computer science remain highly important, increasingly re- searchers of the future will be involved with using computers to understand and extract usable information from massive data arising in applications, not just how to make com- puters useful on specific well-defined problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as an understanding of automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods.
Early drafts of the book have been used for both undergraduate and graduate courses. Background material needed for an undergraduate course has been put in the appendix. For this reason, the appendix has homework problems.
This book starts with the treatment of high dimensional geometry. Modern data in diverse fields such as Information Processing, Search, Machine Learning, etc., is often represented advantageously as vectors with a large number of components. This is so even in cases when the vector representation is not the natural first choice. Our intuition from two or three dimensional space can be surprisingly off the mark when it comes to high dimensional space. Chapter 2 works out the fundamentals needed to understand the differences. The emphasis of the chapter, as well as the book in general, is to get across the mathematical foundations rather than dwell on particular applications that are only briefly described.
The mathematical areas most relevant to dealing with high-dimensional data are ma- trix algebra and algorithms. We focus on singular value decomposition, a central tool in this area. Chapter 3 gives a from-first-principles description of this. Applications of sin- gular value decomposition include principal component analysis, a widely used technique which we touch upon, as well as modern applications to statistical mixtures of probability
8 densities, discrete optimization, etc., which are described in more detail.
Central to our understanding of large structures, like the web and social networks, is building models to capture essential properties of these structures. The simplest model is that of a random graph formulated by Erd¨osand Renyi, which we study in detail proving that certain global phenomena, like a giant connected component, arise in such structures with only local choices. We also describe other models of random graphs.
One of the surprises of computer science over the last two decades is that some domain- independent methods have been immensely successful in tackling problems from diverse areas. Machine learning is a striking example. We describe the foundations of machine learning, both algorithms for optimizing over given training examples, as well as the theory for understanding when such optimization can be expected to lead to good performance on new, unseen data, including important measures such as Vapnik-Chervonenkis dimen- sion. Another important domain-independent technique is based on Markov chains. The underlying mathematical theory, as well as the connections to electrical networks, forms the core of our chapter on Markov chains.
The field of algorithms has traditionally assumed that the input data to a problem is presented in random access memory, which the algorithm can repeatedly access. This is not feasible for modern problems. The streaming model and other models have been formulated to better reflect this. In this setting, sampling plays a crucial role and, in- deed, we have to sample on the fly. In Chapter 7 we study how to draw good samples efficiently and how to estimate statistical and linear algebra quantities, with such samples.
Another important tool for understanding data is clustering, dividing data into groups of similar objects. After describing some of the basic methods for clustering, such as the k-means algorithm, we focus on modern developments in understanding these, as well as newer algorithms. The chapter ends with a study of clustering criteria.
This book also covers graphical models and belief propagation, ranking and voting, sparse vectors, and compressed sensing. The appendix includes a wealth of background material.
A word about notation in the book. To help the student, we have adopted certain notations, and with a few exceptions, adhered to them. We use lower case letters for scalar variables and functions, bold face lower case for vectors, and upper case letters for matrices. Lower case near the beginning of the alphabet tend to be constants, in the middle of the alphabet, such as i, j, and k, are indices in summations, n and m for integer sizes, and x, y and z for variables. If A is a matrix its elements are aij and its rows are ai. If ai is a vector its coordinates are aij. Where the literature traditionally uses a symbol for a quantity, we also used that symbol, even if it meant abandoning our convention. If we have a set of points in some vector space, and work with a subspace, we use n for the
9 number of points, d for the dimension of the space, and k for the dimension of the subspace.
The term ”almost surely” means with probability one. We use ln n for the natural logarithm and log n for the base two logarithm. If we want base ten, we will use log10 . 2 To simplify notation and to make it easier to read we use E2(1 − x) for E(1 − x) and E(1 − x)2 for E (1 − x)2 . When we say “randomly select” some number of points from a given probability distribution D, independence is always assumed unless otherwise stated.
10 2 High-Dimensional Space
2.1 Introduction
High dimensional data has become very important. However, high dimensional space is very different from the two and three dimensional spaces we are familiar with. Generate n points at random in d-dimensions where each coordinate is a zero mean, unit variance Gaussian. For sufficiently large d, with high probability the distances between all pairs of points will be essentially the same. Also the volume of the unit ball in d-dimensions, the set of all points x such that |x| ≤ 1, goes to zero as the dimension goes to infinity. The volume of a high dimensional unit ball is concentrated near its surface and is also concentrated at its equator. These properties have important consequences which we will consider.
2.2 The Law of Large Numbers If one generates random points in d-dimensional space using a Gaussian to generate coordinates, the distance between all pairs of points will be essentially the same when d is large. The reason is that the square of the distance between two points y and z,
d 2 X 2 |y − z| = (yi − zi) , i=1 is the sum of d independent random variables. If one averages n independent samples of a random variable x of bounded variance, the result will be close to the expected value of x. In the above summation there are d samples where each sample is the squared distance in a coordinate between the two points y and z. Here we give a general bound called the Law of Large Numbers. Specifically, the Law of Large Numbers states that x1 + x2 + ··· + xn V ar(x) Prob − E(x) ≥ ≤ . (2.1) n n2 The larger the variance of the random variable, the greater the probability that the error will exceed . Thus the variance of x is in the numerator. The number of samples n is in the denominator since the more values that are averaged, the smaller the probability that the difference will exceed . Similarly the larger is, the smaller the probability that the difference will exceed and hence is in the denominator. Notice that squaring makes the fraction a dimensionless quantity.
We use two inequalities to prove the Law of Large Numbers. The first is Markov’s inequality which states that the probability that a nonnegative random variable exceeds a is bounded by the expected value of the variable divided by a.
11 Theorem 2.1 (Markov’s inequality) Let x be a nonnegative random variable. Then for a > 0, E(x) Prob(x ≥ a) ≤ . a Proof: For a continuous nonnegative random variable x with probability density p,
∞ a ∞ Z Z Z E (x) = xp(x)dx = xp(x)dx + xp(x)dx
0 0 a ∞ ∞ Z Z ≥ xp(x)dx ≥ a p(x)dx = aProb(x ≥ a).
a a
E(x) Thus, Prob(x ≥ a) ≤ a .
The same proof works for discrete random variables with sums instead of integrals.