Lecture 7: Concentration Inequalities and the Law of Averages

Total Page:16

File Type:pdf, Size:1020Kb

Load more

Department of Mathematics Ma 3/103 KC Border Introduction to Probability and Statistics Winter 2021 Lecture 7: Concentration Inequalities and the Law of Averages Relevant textbook passages: Pitman [19]: Section 3.3 Larsen–Marx [17]: Section 4.3 7.1 The Law of Averages The “Law of Averages” is an informal term used to describe a number of mathematical theorems that relate averages of sample values to expectations of random variables. F ∈ Given random variables X1,...,Xn on a probability space (Ω, ,PP), each point ω Ω yields ¯ n a list X(ω)1,...,X(ω)n. If we average these numbers we get X(ω) = i=1 Xi(ω)/n, the sample average associated with the outcome ω. The sample average, since it depends on ω, is also a random variable. In later lectures, I’ll talk about how to determine the distribution of a sample average, but we already have a case that we can deal with. If X1,...,Xn are independent Bernoulli random variables, their sum has a Binomial distribution, so the distribution of the sample average is easily given. First note that the sample average can only take on the values k/n, for k = 0, . , n, and that n P X¯ = k/n = pk(1 − p)n−k. k Figure 7.1 shows the probability mass function of X¯ for the case p = 1/2 with various values of n. Observe the following things about the graphs. • The sample average X¯ is always between 0 and 1, and it is simply the fraction of successes in sequence of trials. • If the frequency interpretation of probability is to make sense, then as the sample size grows, it should converge to the probability of success, which in this case is 1/2. • What can we conclude about the probability that X¯ is near 1/2? As the sample size becomes larger, the heights (which measure probability) of the dots shrink, but there are more and more of them close to 1/2. Which effect wins? What happens for other kinds of random variables? Fortunately we do not need to know the details of the distribution to prove a Law of Averages. But we start with some preliminaries. KC Border v. 2020.10.21::10.28 Ma 3/103 Winter 2021 KC Border The Law of Averages 7–2 ��� ��� ��� ������� �� �� ����������� ��������� ��� 0.20 ● ● ● 0.15 ● ● 0.10 ● ● 0.05 ● ● ● ● ● ● ● ● ● ● 0.2 0.4 0.6 0.8 1.0 ��� ��� ��� ������� �� ��� ����������� ��������� ��� 0.05 ●●● ● ● ● ● ● ● ● ● 0.04 ● ● ● ● 0.03 ● ● ● ● ● ● 0.02 ● ● ● ● ● ● ● ● 0.01 ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.2 0.4 0.6 0.8 1.0 ��� ��� ��� ������� �� ������ ����������� ��������� ��� ● ● ● 0.0030 ● ● ● ● ● ● ●● ●● ●● ●● ●● ●● 0.0025 ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● 0.0020 ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● 0.0015 ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● 0.0010 ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● 0.0005 ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.2 0.4 0.6 0.8 1.0 Figure 7.1. v. 2020.10.21::10.28 KC Border Ma 3/103 Winter 2021 KC Border The Law of Averages 7–3 7.2 Standardized random variables Pitman [19]: p. 190 7.2.1 Definition Given a random variable X with finite mean µ and finite variance σ2, the standardization of X is the random variable X∗ defined by X − µ X∗ = . σ Note that E X∗ = 0, and Var X∗ = 1, and X = σX∗ + µ, so that X∗ is just X measured in different units, called standard units. ∗ [Note: Pitman uses both X and later X∗ to denote the standardization of X.] A convenient feature of standardized random variables is that they are invariant under change of scale and location. 7.2.2 Proposition Let X be a random variable with mean µ and standard deviation σ, and let Y = aX + b, where a > 0. Then X∗ = Y ∗. Proof : The proof follows from Propositions 6.7.1 and 6.10.2, which assert that E Y = aµ + b and SD Y = aσ. So z =}|Y { Y − aµ − b aX + b −aµ − b a(X − µ) X − µ Y ∗ = = = = = X∗. aσ aσ aσ σ 7.3 Concentration inequalities The term concentration inequality refers to a proposition about the probability of the value of a random variable being concentrated near a particular point. Concentration inequalities are crucial to the proof of the Law of Large Numbers and have many applications in statistics. One could write a whole book on the subject. Indeed Boucheron, Lugosi, and Massart [2] have done so. 7.3.1 Markov Markov’s Inequality bounds the probability of large values of a nonnegative random variable in terms of its expectation. It is a very crude bound, but it is just what we need for the Weak Law of Large Numbers. Pitman [19]: p. 174 7.3.1 Proposition (Markov’s Inequality) Let X be a nonnegative random variable with finite mean µ. For every ε > 0, µ P (X > ε) 6 . ε KC Border v. 2020.10.21::10.28
Recommended publications
  • Lecture 2: Estimating the Empirical Risk with Samples CS4787 — Principles of Large-Scale Machine Learning Systems

    Lecture 2: Estimating the Empirical Risk with Samples CS4787 — Principles of Large-Scale Machine Learning Systems

    Lecture 2: Estimating the empirical risk with samples CS4787 — Principles of Large-Scale Machine Learning Systems Review: The empirical risk. Suppose we have a dataset D = f(x1; y1); (x2; y2);:::; (xn; yn)g, where xi 2 X is an example and yi 2 Y is a label. Let h : X!Y be a hypothesized model (mapping from examples to labels) we are trying to evaluate. Let L : Y × Y ! R be a loss function which measures how different two labels are The empirical risk is n 1 X R(h) = L(h(x ); y ): n i i i=1 Most notions of error or accuracy in machine learning can be captured with an empirical risk. A simple example: measure the error rate on the dataset with the 0-1 Loss Function L(^y; y) = 0 if y^ = y and 1 if y^ 6= y: Other examples of empirical risk? We need to compute the empirical risk a lot during training, both during validation (and hyperparameter optimiza- tion) and testing, so it’s nice if we can do it fast. Question: how does computing the empirical risk scale? Three things affect the cost of computation. • The number of training examples n. The cost will certainly be proportional to n. • The cost to compute the loss function L. • The cost to evaluate the hypothesis h. Question: what if n is very large? Must we spend a large amount of time computing the empirical risk? Answer: We can approximate the empirical risk using subsampling. Idea: let Z be a random variable that takes on the value L(h(xi); yi) with probability 1=n for each i 2 f1; : : : ; ng.
  • Concentration Inequalities and Martingale Inequalities — a Survey

    Concentration Inequalities and Martingale Inequalities — a Survey

    Concentration inequalities and martingale inequalities | a survey Fan Chung ∗† Linyuan Lu zy May 28, 2006 Abstract We examine a number of generalized and extended versions of concen- tration inequalities and martingale inequalities. These inequalities are ef- fective for analyzing processes with quite general conditions as illustrated in an example for an infinite Polya process and webgraphs. 1 Introduction One of the main tools in probabilistic analysis is the concentration inequalities. Basically, the concentration inequalities are meant to give a sharp prediction of the actual value of a random variable by bounding the error term (from the expected value) with an associated probability. The classical concentration in- equalities such as those for the binomial distribution have the best possible error estimates with exponentially small probabilistic bounds. Such concentration in- equalities usually require certain independence assumptions (i.e., the random variable can be decomposed as a sum of independent random variables). When the independence assumptions do not hold, it is still desirable to have similar, albeit slightly weaker, inequalities at our disposal. One approach is the martingale method. If the random variable and the associated probability space can be organized into a chain of events with modified probability spaces and if the incremental changes of the value of the event is “small”, then the martingale inequalities provide very good error estimates. The reader is referred to numerous textbooks [5, 17, 20] on this subject. In the past few years, there has been a great deal of research in analyzing general random graph models for realistic massive graphs which have uneven degree distribution such as the power law [1, 2, 3, 4, 6].
  • Lecture 6: Estimating the Empirical Risk with Samples

    Lecture 6: Estimating the Empirical Risk with Samples

    Lecture 6: Estimating the empirical risk with samples CS4787 — Principles of Large-Scale Machine Learning Systems Recall: The empirical risk. Suppose we have a dataset D = f(x1; y1); (x2; y2);:::; (xn; yn)g, where xi 2 X is an example and yi 2 Y is a label. Let h : X!Y be a hypothesized model (mapping from examples to labels) we are trying to evaluate. Let L : Y × Y ! R be a loss function which measures how different two labels are. The empirical risk is n 1 X R(h) = L(h(x ); y ): n i i i=1 We need to compute the empirical risk a lot during training, both during validation (and hyperparameter optimization) and testing, so it’s nice if we can do it fast. But the cost will certainly be proportional to n, the number of training examples. Question: what if n is very large? Must we spend a long time computing the empirical risk? Answer: We can approximate the empirical risk using subsampling. Idea: let Z be a random variable that takes on the value L(h(xi); yi) with probability 1=n for each i 2 f1; : : : ; ng. Equivalently, Z is the result of sampling a single element from the sum in the formula for the empirical risk. By construction we will have n 1 X E [Z] = L(h(x ); y ) = R(h): n i i i=1 If we sample a bunch of independent identically distributed random variables Z1;Z2;:::;ZK identical to Z, then their average will be a good approximation of the empirical risk.
  • Arxiv:2002.10761V1 [Math.PR] 25 Feb 2020

    Arxiv:2002.10761V1 [Math.PR] 25 Feb 2020

    SOME NOTES ON CONCENTRATION FOR α-SUBEXPONENTIAL RANDOM VARIABLES HOLGER SAMBALE Abstract. We prove extensions of classical concentration inequalities for ran- dom variables which have α-subexponential tail decay for any α (0, 2]. This includes Hanson–Wright type and convex concentration inequalities.∈ We also pro- vide some applications of these results. This includes uniform Hanson–Wright inequalities and concentration results for simple random tensors in the spirit of Klochkov and Zhivotovskiy [2020] and Vershynin [2019]. 1. Introduction The concentration of measure phenomenon is by now a well-established part of probability theory, see e. g. the monographs Ledoux [2001], Boucheron et al. [2013] or Vershynin [2018]. A central feature is to study the tail behaviour of functions of random vectors X = (X1,...,Xn), i. e. to find suitable estimates for P( f(X) Ef(X) t). Here, numerous situations have been considered, e. g. ass| uming− | ≥ independence of X1,...,Xn or in presence of some classical functional inequalities for the distribution of X like Poincaré or logarithmic Sobolev inequalities. Let us recall some fundamental results which will be of particular interest for the present paper. One of them is the classical convex concentration inequality as first established in Talagrand [1988], Johnson and Schechtman [1991] and later Ledoux [1995/97]. Assuming X1,...,Xn are independent random variables each taking values in some bounded interval [a, b], we have that for every convex Lipschitz function f : [a, b]n R with Lipschitz constant 1, → t2 (1.1) P( f(X) Ef(X) > t) 2 exp | − | ≤ − 2(b a)2 − for any t 0 (see e.
  • The Introductory Statistics Course: a Ptolemaic Curriculum?

    The Introductory Statistics Course: a Ptolemaic Curriculum?

    UCLA Technology Innovations in Statistics Education Title The Introductory Statistics Course: A Ptolemaic Curriculum? Permalink https://escholarship.org/uc/item/6hb3k0nz Journal Technology Innovations in Statistics Education, 1(1) ISSN 1933-4214 Author Cobb, George W Publication Date 2007-10-12 DOI 10.5070/T511000028 Peer reviewed eScholarship.org Powered by the California Digital Library University of California The Introductory Statistics Course: A Ptolemaic Curriculum George W. Cobb Mount Holyoke College 1. INTRODUCTION The founding of this journal recognizes the likelihood that our profession stands at the thresh- old of a fundamental reshaping of how we do what we do, how we think about what we do, and how we present what we do to students who want to learn about the science of data. For two thousand years, as far back as Archimedes, and continuing until just a few decades ago, the mathematical sciences have had but two strategies for implementing solutions to applied problems: brute force and analytical circumventions. Until recently, brute force has meant arithmetic by human calculators, making that approach so slow as to create a needle's eye that effectively blocked direct solution of most scientific problems of consequence. The greatest minds of generations past were forced to invent analytical end runs that relied on simplifying assumptions and, often, asymptotic approximations. Our generation is the first ever to have the computing power to rely on the most direct approach, leaving the hard work of implementation to obliging little chunks of silicon. In recent years, almost all teachers of statistics who care about doing well by their students and doing well by their subject have recognized that computers are changing the teaching of our subject.
  • 4 Concentration Inequalities, Scalar and Matrix Versions

    4 Concentration Inequalities, Scalar and Matrix Versions

    4 Concentration Inequalities, Scalar and Matrix Versions 4.1 Large Deviation Inequalities Concentration and large deviations inequalities are among the most useful tools when understanding the performance of some algorithms. In a nutshell they control the probability of a random variable being very far from its expectation. The simplest such inequality is Markov's inequality: Theorem 4.1 (Markov's Inequality) Let X ≥ 0 be a non-negative random variable with E[X] < 1. Then, [X] ProbfX > tg ≤ E : (31) t Proof. Let t > 0. Define a random variable Yt as 0 if X ≤ t Y = t t if X > t Clearly, Yt ≤ X, hence E[Yt] ≤ E[X], and t ProbfX > tg = E[Yt] ≤ E[X]; concluding the proof. 2 Markov's inequality can be used to obtain many more concentration inequalities. Chebyshev's inequality is a simple inequality that control fluctuations from the mean. 2 Theorem 4.2 (Chebyshev's inequality) Let X be a random variable with E[X ] < 1. Then, Var(X) ProbfjX − Xj > tg ≤ : E t2 2 Proof. Apply Markov's inequality to the random variable (X − E[X]) to get: 2 E (X − EX) Var(X) ProbfjX −X j > t g= Prob f(X −X)2 > t2 g ≤ = : E E t2 t2 2 4.1.1 Sums of independent random variables In what follows we'll show two useful inequalities involving sums of independent random variables. The intuitive idea is that if we have a sum of independent random variables X = X1 + ··· + Xn; where X are iid centered random variables, then while the value of X can be of order O(n) it will very i p likely be of order O( n) (note that this is the order of its standard deviation).
  • Lecture 12: Martingales Concentration Inequality

    Lecture 12: Martingales Concentration Inequality

    MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013 Martingale Concentration Inequalities and Applications Content. 1. Exponential concentration for martingales with bounded increments 2. Concentration for Lipschitz continuous functions 3. Examples in statistics and random graph theory 1 Azuma-Hoeffding inequality Suppose Xn is a martingale wrt filtration Fn such that X0 = 0 The goal of this lecture is to obtain bounds of the form P(jXnj ≥ δn) ≤ exp(−Θ(n)) under some condition on Xn. Note that since E[Xn] = 0, the deviation from zero is the ”right” regime to look for rare events. It turns out the exponential bound of the form above holds under very simple assumption that the increments of Xn are bounded. The theorem below is known as Azuma-Hoeffding Inequality. Theorem 1 (Azuma-Hoeffding Inequality). Suppose Xn; n ≥ 1 is a martin­ gale such that X0 = 0 and jXi − Xi−1j ≤ di; 1 ≤ i ≤ n almost surely for some constants di; 1 ≤ i ≤ n. Then, for every t > 0, t2 P (jXnj > t) ≤ 2 exp − Pn 2 : 2 i=1 di Notice that in the special case when di = d, we can take t = xn and obtain p 2 2 an upper bound 2 exp −x n=(2d ) - which is of the form promised above. Note that this is consistent with the Chernoff bound for the special case Xn is the sum of i.i.d. zero mean terms, though it is applicable only in the special case of a.s. bounded increments. Proof. f(x) , exp(λx) is a convex function in x for any λ 2 R.
  • Lecture 1 - Introduction and the Empirical CDF

    Lecture 1 - Introduction and the Empirical CDF

    Lecture 1 - Introduction and the Empirical CDF Rui Castro February 24, 2013 1 Introduction: Nonparametric statistics The term non-parametric statistics often takes a different meaning for different authors. For example: Wolfowitz (1942): We shall refer to this situation (where a distribution is completely determined by the knowledge of its finite parameter set) as the parametric case, and denote the opposite case, where the functional forms of the distributions are unknown, as the non-parametric case. Randles, Hettmansperger and Casella (2004) Nonparametric statistics can and should be broadly defined to include all method- ology that does not use a model based on a single parametric family. Wasserman (2005) The basic idea of nonparametric inference is to use data to infer an unknown quantity while making as few assumptions as possible. Bradley (1968) The terms nonparametric and distribution-free are not synonymous... Popular usage, however, has equated the terms... Roughly speaking, a nonparametric test is one which makes no hypothesis about the value of a parameter in a statistical density function, whereas a distribution-free test is one which makes no assumptions about the precise form of the sampled population. The term distribution-free is used quite often in the statistical learning theory community, to refer to an analysis that makes no assumptions on the distribution of training examples (and only assumes these are independent and identically distributed samples from an unknown distribution). For our purposes we will say a model is nonparametric if it is not a parametric model. In the simplest form we speak of a parametric model if the data vector X is such that X ∼ Pθ ; d and where θ 2 Θ ⊆ R .
  • A Ptolemaic Curriculum?

    A Ptolemaic Curriculum?

    The Introductory Statistics Course: A Ptolemaic Curriculum George W. Cobb Mount Holyoke College 1. INTRODUCTION The founding of this journal recognizes the likelihood that our profession stands at the thresh- old of a fundamental reshaping of how we do what we do, how we think about what we do, and how we present what we do to students who want to learn about the science of data. For two thousand years, as far back as Archimedes, and continuing until just a few decades ago, the mathematical sciences have had but two strategies for implementing solutions to applied problems: brute force and analytical circumventions. Until recently, brute force has meant arithmetic by human calculators, making that approach so slow as to create a needle's eye that effectively blocked direct solution of most scientific problems of consequence. The greatest minds of generations past were forced to invent analytical end runs that relied on simplifying assumptions and, often, asymptotic approximations. Our generation is the first ever to have the computing power to rely on the most direct approach, leaving the hard work of implementation to obliging little chunks of silicon. In recent years, almost all teachers of statistics who care about doing well by their students and doing well by their subject have recognized that computers are changing the teaching of our subject. Two major changes were recognized and articulated fifteen years ago by David Moore (1992): we can and should automate calculations; we can and should automate graph- ics. Automating calculation enables multiple re-analyses; we can use such multiple analyses to evaluate sensitivity to changes in model assumptions.
  • Concentration Inequalities for Empirical Processes of Linear Time Series

    Concentration Inequalities for Empirical Processes of Linear Time Series

    Journal of Machine Learning Research 18 (2018) 1-46 Submitted 1/17; Revised 5/18; Published 6/18 Concentration inequalities for empirical processes of linear time series Likai Chen [email protected] Wei Biao Wu [email protected] Department of Statistics The University of Chicago Chicago, IL 60637, USA Editor: Mehryar Mohri Abstract The paper considers suprema of empirical processes for linear time series indexed by func- tional classes. We derive an upper bound for the tail probability of the suprema under conditions on the size of the function class, the sample size, temporal dependence and the moment conditions of the underlying time series. Due to the dependence and heavy-tailness, our tail probability bound is substantially different from those classical exponential bounds obtained under the independence assumption in that it involves an extra polynomial de- caying term. We allow both short- and long-range dependent processes. For empirical processes indexed by half intervals, our tail probability inequality is sharp up to a multi- plicative constant. Keywords: martingale decomposition, tail probability, heavy tail, MA(1) 1. Introduction Concentration inequalities for suprema of empirical processes play a fundamental role in statistical learning theory. They have been extensively studied in the literature; see for example Vapnik (1998), Ledoux (2001), Massart (2007), Boucheron et al. (2013) among others. To fix the idea, let (Ω; F; P) be the probability space on which a sequence of random variables (Xi) is defined, A be a set of real-valued measurable functions. For a Pn function g; denote Sn(g) = i=1 g(Xi): We are interested in studying the tail probability T (z) := P(∆n ≥ z); where ∆n = sup jSn(g) − ESn(g)j: (1) g2A When A is uncountable, P is understood as the outer probability (van der Vaart (1998)).
  • Concentration Inequalities

    Concentration Inequalities

    Concentration Inequalities St´ephane Boucheron1, G´abor Lugosi2, and Olivier Bousquet3 1 Universit´e de Paris-Sud, Laboratoire d'Informatique B^atiment 490, F-91405 Orsay Cedex, France [email protected] WWW home page: http://www.lri.fr/~bouchero 2 Department of Economics, Pompeu Fabra University Ramon Trias Fargas 25-27, 08005 Barcelona, Spain [email protected] WWW home page: http://www.econ.upf.es/~lugosi 3 Max-Planck Institute for Biological Cybernetics Spemannstr. 38, D-72076 Tubingen,¨ Germany [email protected] WWW home page: http://www.kyb.mpg.de/~bousquet Abstract. Concentration inequalities deal with deviations of functions of independent random variables from their expectation. In the last decade new tools have been introduced making it possible to establish simple and powerful inequalities. These inequalities are at the heart of the mathematical analysis of various problems in machine learning and made it possible to derive new efficient algorithms. This text attempts to summarize some of the basic tools. 1 Introduction The laws of large numbers of classical probability theory state that sums of independent random variables are, under very mild conditions, close to their expectation with a large probability. Such sums are the most basic examples of random variables concentrated around their mean. More recent results reveal that such a behavior is shared by a large class of general functions of independent random variables. The purpose of these notes is to give an introduction to some of these general concentration inequalities. The inequalities discussed in these notes bound tail probabilities of general functions of independent random variables.
  • Section 1: Basic Probability Theory

    Section 1: Basic Probability Theory

    ||||| Course Overview Three sections: I Monday 2:30pm-3:45pm Wednesday 1:00pm-2:15pm Thursday 2:30pm-3:45pm Goal: I Review some basic (or not so basic) concepts in probability and statistics I Good preparation for CME 308 Syllabus: I Basic probability, including random variables, conditional distribution, moments, concentration inequality I Convergence concepts, including three types of convergence, WLLN, CLT, delta method I Statistical inference, including fundamental concepts in inference, point estimation, MLE Section 1: Basic probability theory Dangna Li ICME September 14, 2015 Outline Random Variables and Distributions Expectation, Variance and Covariance Definitions Key properties Conditional Expectation and Conditional Variance Definitions Key properties Concentration Inequality Markov inequality Chebyshev inequality Chernoff bound Random variables: discrete and continuous I For our purposes, random variables will be one of two types: discrete or continuous. I A random variable X is discrete if its set of possible values X is finite or countably infinite. I A random variable X is continuous if its possible values form an uncountable set (e.g., some interval on R) and the probability that X equals any such value exactly is zero. I Examples: I Discrete: binomial, geometric, Poisson, and discrete uniform random variables I Continuous: normal, exponential, beta, gamma, chi-squared, Student's t, and continuous uniform random variables The probability density (mass) function I pmf: The probability mass function (pmf) of a discrete random variable X is a nonnegative function f (x) = P(X = x), where x denotes each possible value that X can take. It is always true that P f (x) = 1: x2X I pdf: The probability density function (pdf) of a continuous random variable X is a nonnegative function f (x) such that R b a f (x)dx = P(a ≤X≤ b) for any a; b 2 R.