<<

3 A few more good inequalities, martingale variety1 3.1 From independence to martingales...... 1 3.2 Hoeffding’s inequality for martingales...... 4 3.3 Bennett’s inequality for martingales...... 8 3.4 *Concentration of random polynomials...... 10 3.5 *Proof of the Kim-Vu inequality...... 14 3.6 Problems...... 18 3.7 Notes...... 18

Printed: 8 October 2015

version: 8Oct2015 Mini-empirical printed: 8 October 2015 c David Pollard §3.1 From independence to martingales 1

Chapter 3 A few more good inequalities, martingale variety

Section 3.1 introduces the method for bounding tail probabilities using mo- ment generating functions. Section 3.2 discusses the Hoeffding inequality, both for sums of independent bounded random variables and for martingales with bounded increments. Section 3.3 discusses the Bennett inequality, both for sums of indepen- dent random variables that are bounded above by a constant and their martingale analogs. *Section3.4 presents an extended application of a martingale version of the Bennett inequality to derive a version of the Kim-Vu inequality for poly- nomials in independent, bounded random variables.

3.1 From independence to martingales BasicMG::S:intro Throughout this Chapter {(Si, Fi): i = 1, . . . , n} is a martingale on some probability space (Ω, F, P). That is, we have a sub-sigma fields F0 ⊆ F1 ⊆ · · · ⊆ Fn ⊆ F and integrable, Fi-measurable random variables Si for

which PFi−1 Si = Si−1 almost surely. Equivalently, the martingale differ-

ences ξi := Si − Si−1 are integrable, Fi-measurable, and PFi−1 ξi = 0 almost surely, for i = 1, . . . , n. In that case Si = S0 + ξ1 + ··· + ξi = Si−1 + ξi. For simplicity I assume F0 is trivial, so that S0 = µ = PSn. Also, I usually omit the “almost sure” qualifiers, unless there is some good reason to

Draft: 8Oct2015 c David Pollard §3.1 From independence to martingales 2

remind you that the conditional expectations are unique only up to almost sure equivalence. And to avoid a plethora of subscripts, I abbreviate the

conditional expectation operator PFi−1 to Pi−1. The inequalities from Sections 2.5 and 2.6 have martingale analogs, which have proved particularly useful for the development of concentration inequalities. The analog of the Hoeffding inequality is often attributed to Azuma(1967), even though Hoeffding(1963, pages 17–18) had already noted that

“The inequalities of this section can be strengthened in the following way. Let Sm = X1 + ··· + Xm for m = 1, 2, . . . , n. . . Furthermore, the inequalities of Theorems 1 and 2 remain true if the assumption that X1,X2,...,Xn. are independent 0 is replaced by the weaker assumption that the sequence Sm = Sm − ESm, m = 1, 2, . . . , n, is a martingale, that is, . . . . Indeed, Doob’s inequality (2.17) is true under this assumption. On the other hand, (2.18) implies that the conditional mean of Xm for 0 Sm−1 fixed is equal to its unconditional mean. A slight modifica- tion of the proofs of Theorems 1 and 2 yields the stated result.”

In principle, the martingale results can be proved by conditional analogs of the moment generating function technique described in Section 2.2, with just a few precautions to avoid problems with negligible sets. In that Sec- tion I started with a X that lives on some probability space (Ω, F, P). By means of arguments involving differentiation inside ex- pectations I derived a Taylor expansion for the cumulant generating func- tion,

θX 1 2 \E@ L.Taylor <1> L(θ) = log Pe = θPX + 2 θ varθ∗ (X), where θ∗ lies between 0 and θ and the variance is calculated under the ∗ ∗ probability measure Pθ∗ that has density exp (θ X − L(θ )) with respect to P. The obvious analog for expectations conditional on some sub-sigma- field G of F is

θX 1 2 \E@ L.Taylor.condit <2> log PGe = θPGX + 2 θ varG,θ∗ (X) almost surely..

θX Here the conditional expectation PGe is defined in the usual Kolmogorov way as the almostly surely unique G-measurable random variable Mω(θ) for

Draft: 8Oct2015 c David Pollard §3.1 From independence to martingales 3 which

 θX  P g(ω)e = P (g(ω)Mω(θ)) at least for all nonnegative, G-measurable random variables g. The random variable Mω(θ) could be changed on some G-measurable negligible set Nθ without affecting the defining equality. Unfortunately, such negligible am- biguities can become a major obstacle if we want the map θ 7→ Mω(θ) to be differentiable, or even continuous. Regular conditional distributions eliminate the obstacle, by consolidating all the bookkeeping with negligible sets into one step. Recall that the distribution of a real-valued random variable X is a prob- ability measure P on B(R) for which Pf(X) = P f, at least for all bounded, B(R)-measurable real functions f.A regular conditional distribution for X given a sub-sigma-field G is a Markov kernel, a set {Pω : ω ∈ Ω} of probability measures on B(R) for which Pωf is a version of the conditional expectation PGf(X) whenever it is well defined. That is, ω 7→ Pωf is G- measurable and Pg(ω)f(X(ω)) = P (g(ω)Pωf) for a suitably large collection of B(R)-measurable functions f and G-measurable functions g. Such regular conditional distributions always exist (Breiman, 1968, Section 4.3).

Remark. As noted by Breiman(1968, page 80), regular conditional distributions also justify the familiar idea that for the purpose of calculating PG conditional expectations we may treat G-measurable functions as constants. For example, if f is a bounded B(R2)\B(R)- measurable real function on R2 and G is a G-measurable random x variable then PGf(X,G) = Pω f(x, G(ω)) almost surely. You might find it instructive to prove this assertion using a generating class argument, starting from the fact that it is valid when f factorizes: f(x, y) = f1(x)f2(y).

θx If we choose Pωe as the version of the conditional moment generating function Mω(θ) then the arguments from Section 2.2 work as before: on the interior of the interval Jω = {θ ∈ R : Mω(θ) < ∞} we have

Lω(θ) := log Mω(θ) 0  θx Lω(θ) = Pω xe /Mω(θ) = Pω,θx 00 2 Lω(θ) = varω,θ(x) = Pω,θ(x − µω,θ) where µω,θ := Pω,θx.

As before, the probability measure Pω,θ is defined on B(R) by its den- xθ sity, dPω,θ/dPω = e /Mω(θ). If 0 and θ are interior points of Jω then

Draft: 8Oct2015 c David Pollard §3.2 Hoeffding’s inequality for martingales 4

equality <2> takes the cleaner form

1 2 Lω(θ) = θPωx + 2 θ varω,θ∗ (x) for all ω, with θ∗, which might depend on ω, lying between 0 and θ. Now suppose there exist functions aω and bω for which Pω[aω, bω] = 1. As in Section 2.5 we have

θ(x−Pωx) 1 2 2 Pωe ≤ exp 8 θ (bω − aω) so that

2 2 θ(X−PGX) θ Rω/8 \E@ condit.Hoeff.mgf <3> PGe ≤ e if Rω ≥ bω − aω. a.s.

Remark. Strangely enough, the first inequality does not require any sort of measurability assumption about aω and bω, because ω is being held fixed. However, we usually need the range Rω to be G-measurable when integrating out <3>.

2 Similarly, if Pω(−∞, bω] = 1 and vω ≥ Pωx then

θ(X−PGX) 1 2  \E@ condit.Bennett.mgf <4> PGe ≤ exp 2 θ vω∆(θbω) for θ ≥ 0, a.s.

where, as before, ∆ denotes the convex increasing function on R for which

1 2 x 2 x ∆(x) = Φ1(x) := e − 1 − x.

3.2 Hoeffding’s inequality for martingales BasicMG::S:HoeffdingMG Consider the martingale {(Si, Fi): i = 1, . . . , n} described at the start of Section 3.1. Suppose the conditional distribution of the increment ξi given Fi−1 con- centrates on an interval whose length is bounded above by an Fi−1-measurable function Ri(ω). Then inequality <3> gives

θξi 1 2 2 \E@ mda.mgf <5> Pi−1e ≤ exp 8 θ Ri .

If the Ri’s are actually constants then   θSi θSi−1 θξi 1 2 2 θSi−1 Pe = P e Pi−1e ≤ exp 8 θ Ri Pe ,

Draft: 8Oct2015 c David Pollard §3.2 Hoeffding’s inequality for martingales 5

which, by repeated substitution, gives

2 θSn θS0 Y 1 2 2 θPSn+θ W/8 P 2 Pe ≤ Pe exp 8 θ Ri = e where W = 1≤i≤n Ri . 1≤i≤n

For each nonnegative x,

2 2  \E@ Hoeff.promised <6> P{Sn ≥ x + PSn} ≤ inf exp(−xθ + W θ /8) = exp −2x /W . θ≥0

We have the same bound as for a sum of independent increments, as Ho- effding promised. If the Ri’s are truly random, a slightly more subtle argument leads to the inequality

2 X 2 \E@ range.squares <7> {Sn ≥ x + Sn} ≤ exp(−2 /W ) + { R > W } P P P i≤n i for each constant W . Compare with McDiarmid, 1998, Theorem 3.7. P 2 To prove <7> define Vi := k≤i Rk, which is Fi−1 measurable. For each i, h i 2  2  θξi P exp θSi − θ Vi/8 = P exp θSi−1 − θ Vi/8) Pi−1e 2  ≤ P exp θSi−1 − θ Vi−1/8 by <5>.

Repeated substitution now gives

2  θPSn P exp θSn − θ Vn/8) ≤ P exp (θS0) = e .

The strange Vn contribution is not a problem on the set where Vn ≤ W :

2 2 θSn θSn+θ (W −Vn)/8 θPSn+θ W/8 Pe {Vn ≤ W } ≤ Pe ≤ e

so that

−θ(x+PSn) θSn P{Sn ≥ x + PSn} ≤ P{Vn > W } + infθ>0 e Pe {Vn ≤ W }.

And so on, until inequality <7> emerges.

BasicMG::McD <8> Example. Suppose X = (X1,...,Xn) is a vector of independent random n variables and f is a B(R )-measurable function with the bounded differ- ence property: for constants ci,

|f(x) − f(z)| ≤ ci if x and z differ only in the ith coordinate.

Draft: 8Oct2015 c David Pollard §3.2 Hoeffding’s inequality for martingales 6

More succinctly,

X n \E@ bdd.diff <9> |f(x1, . . . , xn) − f(z1, . . . , zn)| ≤ ci{xi 6= zi} for all x, z ∈ . i≤n R

Then an appeal to <6> will give  X  \E@ McD.ineq <10> {f(X) ≥ t + f(X)} ≤ exp −2t2/ c2 . P P i i

Remark. If you crave greater generality, let Xi take values in some measurable space (Xi, Ai) and let f be product-measurable. In that setting you will be grateful that <3> does not require measurability of aω and bω.

As before, let me simplify our lives by assuming P = ⊗iPi, a product measure, and the Xi’s are the coordinate maps. Define Qi = ⊗i

Si := Pif(x) = fi(x1, . . . , xi) := Qif(1, . . . , xn) for i = 0, 1, . . . , n

is a martingale with S0 = Pf(X) and increments

ξi = fi(x1, . . . , xi) − fi−1(x1, . . . , xi−1).

Note that fi inherits a bounded difference property from f:

|fi(x1, . . . , xi) − fi(z1, . . . , zi)| ≤ Qi|f(x1, . . . , xi, xi+1, . . . , xn) − f(z1, . . . , zi, xi+1, . . . , xn)| X X ≤ i ck{xk 6= zk} = ck{xk 6= zk}. Q k≤i k≤i

The last integration with respect to Qi has no effect because the xi+1, . . . , xn coordinates have already disappeared from the bound. The conditional distribution of ξi given Fi−1 concentrates on the inter- val [a(x1, . . . , xi−1), b(x1, . . . , xi−1)], where

a(x1, . . . , xi−1) = inf fi(x1, . . . , xi−1, yi) − fi(x1, . . . , xi) yi

b(x1, . . . , xi−1) = sup fi(x1, . . . , xi−1, zi) − fi(x1, . . . , xi) zi

Remark. You might worry about the measurability of a and b if the yi and zi range over uncountable sets. It’s a good thing that the argument only depends on an upper bound for b − a.

Draft: 8Oct2015 c David Pollard §3.2 Hoeffding’s inequality for martingales 7

The interval has length at most ci because

|fi(x1, . . . , xi−1, zi) − fi(x1, . . . , xi−1, yi)| ≤ ci

And so on. 

Remark. McDiarmid(1989, page 159, after Lemma 4.1) noted that the cruder inequality −ci ≤ ξi ≤ ci contributes an unwanted factor of 1/4 in the exponent of <10>. He stressed the importance of working with sharper (conditional) bounds on the squared range, rather than (unconditional) bounds on the increments.

BasicMG::ER.triangles1 <11> Example. (The Erd¨os-R´enyi random graph) The edges of a graph with m m vertices are a subset of the set E of all n = 2 pairs of distinct vertices. Two edges are said to be adjacent if they share an endpoint. Three edges form a triangle if together they contain only three vertices: that is, the edges are {i, j}, {j, k}, and {k, i} for distinct vertices i, j, k. The subset T of E3 m of all triangles has size N = 3 . The Erd¨os-R´enyi random graph Gm(p) chooses its edges by a set of in- dependent random variables {Xe : e ∈ E}, with each Xe distributed Ber(p). That is, Gm(p) includes e when Xe = 1, an event with probability p. The number of triangles in Gm(p) equals X f(X) = Xe1 Xe2 Xe3 {e1,e2,e3}∈T

3 The expected number of triangles is Pf(X) = Np . A change in a single xe can have a big effect on f(x). For example, if xe = 1 for all e ∈ E and just one xe is changed to zero then m − 2 triangles disappear. The constants corresponding to the ci’s in inequality <9> must all be equal to m − 2. The two-sided version of inequality <10> gives

 22  {|f(X) − f(X)| ≥ } ≤ 2 exp − P P n(m − 2)2

For what values of p (depending on m) might that inequality assert something nontrivial? For example, suppose we wanted to show that

1 \E@ half.expected <12> P{|f(X) − Pf(X)| ≥ 2 Pf(X)} → 0. We know that n grows like m2 and N grows like m3, so we have  of or- der m3p3, which makes the exponent in the inequality have order m2p6. we would need m1/3p → ∞ to make the bound go to zero.

Draft: 8Oct2015 c David Pollard §3.3 Bennett’s inequality for martingales 8

In fact, arguments based on the Chen-Stein method (as in Chapter 5 of Barbour et al., 1992) show that f(X) is approximately Poisson(Np3) distributed. To achieve <12> it should suffice to have the mean of the distribution going to infinity, that is, mp → ∞. Why has inequality <10> fallen so far short of the best answer? The blame lies with the pessimistic choice ci = m − 2 forced by the unlikely con- figuration where Gm(p) contains all n possible edges. It is much more likely that a possible edge, say e = {1, 2}, is involved in only about mp2 potential triangles: the expected number of vertices k ≥ 3 for which X{1,k} = 1 = 2 X{2,k} equals (m − 2)p . A change in Xe probably changes f(X) by O(mp). If by some modification of the argument we could reduce the effective ci to a term of order mp2 then the exponent in the bound from inequality <10> would be changed to something of order m2p2, which would then match the the conclusion suggested by the Poisson approximation. The Kim-Vu inequality in Section 3.4 captures (modulo a few logs) this idea that the de- gree of concentration should be controlled by average rather than extreme behavior. To be continued in Example <21>. 

3.3 Bennett’s inequality for martingales BasicMG::S:BennettMG If X = X1 + ··· + Xn, a sum of independent random variables with each Xi P 2 bounded above by some constant b and W ≥ i≤n PXi , then Bennett’s inequality from Section 2.6 asserts that

 t2  bt  \E@ Bennett.indep1 <13> {X ≥ t + X} ≤ exp − ψ for t ≥ 0 P P 2W Benn W

where ψBenn is the decreasing, convex function on [−1, ∞) for which

1 2 2 t ψBenn(t) = h(t) := (1 + t) log(1 + t) − t.

Inequality <4> extends the result to sums of martingale differences. The following theorem illustrates the method. The resulting bound inter- ests me particularly because of its application to the Kim-Vu inequality in Sections 3.4 and 3.5.

BasicMG::Bennett.mda <14> Theorem. Suppose Si = µ + ξ1 + ··· + ξi for i ≤ n is the martingale, P with µ = PSn, described at the start of Section 3.1. Let Vi := 1≤k≤i vi, 2 where vi is an Fi−1-measurable random variable with vi ≥ PFi−1 ξi , and

Draft: 8Oct2015 c David Pollard §3.3 Bennett’s inequality for martingales 9

let Mi be an Fi−1-measurable random variable for which ξi ≤ Mi. Then, for each x ≥ 0 and nonnegative constants b and W ,

 x2  bx  P{Sn ≥ x+µ} ≤ exp − ψBenn +P{max Mi > b or Vn > W }. 2W W i

BasicMG::two-sided.Bennett.mda <15> Corollary. If, in addition to the assumptions of the Theorem we actually have two-sided control over the increments, |ξi| ≤ Mi, then P{|Sn − µ| ≥ x} is less than twice the upper bound for P{Sn ≥ x + µ}.

Remark. The sequences {vi} and {Mi} are said to be predictable, because the values vi and Mi are both determined by what we learn up to step i − 1. They are used in the proof to define predictable stopping times. Compare with Pollard(2001, Section 6.5).

Proof As with many martingale arguments, the main idea is to multi- ply the increments ξi by predictable weights, which sets up an appeal to inequality <4>. The weights come from the

τ := inf{i ≥ 1 : Mi > b or Vi > W }.

As usual, the infimum of an empty set equals +∞; that is, τ = +∞ if maxi Mi ≤ b and Vn ≤ W . The stopping time is predictable because

{τ ≤ i} = {maxk≤i Mk > b or Vi > W } ∈ Fi−1 for 1 ≤ i ≤ n.

The reweighted increment ηi := ξi{i < τ} also has zero PFi−1 conditional expectation, because {i < τ} is Fi−1-measurable:

Pi−1ηi = Pi−1 (ξi{i < τ}) = {i < τ}Pi−1ξi = 0.

Moreover ηi ≤ b, because Mi ≤ b when τ > i; and

2 2 Pi−1ηi = {i < τ}Pi−1ξi ≤ vi.

θηi 1 2  For each θ ≥ 0 inequality <4> implies Pi−1e ≤ exp 2 θ vi∆(θb) so that

1 2  {i < τ}Pi−1 exp θξi − 2 θ vi∆(θb) ≤ 1. Now argue as in Section 3.2, this time for the process

1 2  Di = exp θSi − 2 θ Vi∆(θb) ,

Draft: 8Oct2015 c David Pollard §3.4 Concentration of random polynomials 10

suitably stopped. For 1 ≤ i ≤ n,

 2  θξi−θ vi∆(θb)/2 PDi{i < τ} = P Di−1{i < τ}Pi−1e

≤ PDi−1{i − 1 < τ} By repeated substitution

1 2  θSN θµ exp − 2 θ W ∆(θb) Pe {n < τ} ≤ PDn{n < τ} ≤ PD0{0 < τ} = e . Conclude that   θ(Sn−x−µ) P{Sn ≥ x + µ} ≤ P{τ ≤ n} + inf P {τ > n}e θ≥0 1 2  ≤ P{τ ≤ n} + inf exp −θx + θ W ∆(θb) . θ≥0 2 Calculate the final infimum in the same way as in Section 2.6. 

*3.4 *Concentration of random polynomials BasicMG::S:KimVu The main ideas for the following material come from the papers by Kim and Vu(2000) and Vu(2002). See the Notes for an explanation of the liberties I am taking in referring to the arguments as the Kim-Vu method. Their method deals with random variables expressible as polynomials f(X) in independent random variables X = (X1,...,Xn), each taking values 2 5 3 3 3 5 in [0, 1]. For example, we might have f(x) = 3x1x2+2x1x2x3x4+7x1x4x5x9+ 8 9 6x8x9 for n = 9. The of f is defined to be the largest of the degrees of its terms, 17 for the example just given. At the cost of an extra factor of 2 we may assume that all the coefficients in the polynomial are nonnegative. The problem is to find useful exponential tail bounds for |f(X)−Pf(X)|. Even though the polynomials do satisfy bounded difference conditions, as in Example <8>, the ci coefficients can too big for the bounds from that Example to be helpful; the squared ranges can be too crude an upper bound for the variances. Instead, a recursive appeal to Theorem <14> will give a better result that involves two quantities, E0(f) and E1(f) (see <18> below), that play the scaling role of a variance. As before h(t) := (1 + t) log(1 + t) − t for t ≥ −1. <16> Kim-Vu inequality. For every kth degree polynomial f(X) (with nonneg- ative coefficients),

kp −h(λ) \E@ KV.ineq <17> P{|f(X)−Pf(X)| ≥ Ckλ E0(f)E1(f) } ≤ Cn,ke for all λ ≥ 0,

Draft: 8Oct2015 c David Pollard §3.4 Concentration of random polynomials 11

k−1 where Cn,k(λ) ≤ 4(n + 1) and the constants Ck are defined recursively by C1 = 1 and Ck = 2k(1 + Ck−1). For the sake of comparison, here is the two-sided version of the Bennett inequality <13> for a sum X = X1 + ··· + Xn of independent random P 2 variables with |Xi| ≤ b and W ≥ i PXi : for t ≥ 0,  t2  bt   W  bt  {|X − X| ≥ t} ≤ exp − ψ = 2 exp − h . P P 2W Benn W b2 W

It takes a bit of notation to describe the two scaling constants, E0(f) and E1(f). For me, digesting their definition was the hardest part of under- standing the papers. n Once again I simplify our lives by assuming that P = ⊗i≤nPi on B[0, 1] and the Xi’s are the coordinate maps: Xi(x) = xi for x = (x1, . . . , xn) ∈ n [0, 1] . Write Zn for the set of all n-tuples of nonnegative integers and |a| P for i ai if a ∈ Zn. For a, ν ∈ Zn write ν ≤ a to mean νi ≤ ai for all i and define

ν1 νn a Y ai ∂ ∂ x := x and ∂ν = ... , i≤n i ν1 νn ∂x1 ∂xn 0 0 0 n where xi ≡ 1 and ∂ /∂xi factors are ignored. For example, if x ∈ [0, 1] then

 Q ai−νi a (ai)ν x if ν ≤ a ∂νx = ai>0 i i 0 otherwise where

ai! (ai)νi := ai(ai − 1) ... (ai − νi + 1) = for 0 ≤ νi ≤ ai. (ai − νi)!

Notice that the contribution from xi disappears (leaving only an ai!) if νi = ai. This small fact plays a subtle role in the Kim-Vu method. a The expected value of various ∂νx terms will appear as coefficients in the polynomials that define the predictable Mi’s and the conditional variances vi that are needed for an appeal to Theorem <14>. The kth degree polynomial can written as

X a f(x) = fA(x) = wax , a∈A where A is a finite subset of Zn with maxa∈A |a| = k and wa ≥ 0. The polynomial has expected value

X Y ai PfA(x) = waµa where µa := Pixi . a∈A ai>0

Draft: 8Oct2015 c David Pollard §3.4 Concentration of random polynomials 12

0 Without loss of generality we may assume 0 ∈/ A (cf. Px = 1 = µ0). At last I have enough notation to define the scale factors appearing in <17>:

k \E@ eej.def <18> Ej(f) := max`=j D`(f) where D`(f) := max|ν|=j P∂νf.

The Ej’s for j ≥ 2 will appear during the proof (using induction on k) in Section 3.5. Note that D0(f) = Pf(x) and (Problem [2]) D`(f) = 0 for ` > k = the degree of f. By construction, E0(f) ≥ Pf(x) and E0(f) ≥ E1(f) ≥ · · · ≥ Ek(f) ≥ 0 = Ej(f) for j > k. For the remainder of the current Section let me try to give some feel for what calculation of the Ej’s involves.

2 5 3 3 3 5 8 9 BasicMG::partial.g.eg <19> Example. For g(x) = 3x1x2 + 2x1x2x3x4 + 7x1x4x5x9 + 6x8x9 and ν = (1, 2, 0,..., 0):

3 2 ∂νg(x) = 120x1x2 + 36x1x2x3x4 + 0 + 0

3 2 and P∂νg(x) = 120(P1x1)(P2x2) + 36(P1x1)(P2x2)(P3x3)(P4x4). 

BasicMG::k1 <20> Example. Calculate the Ej(f) for the simplest case where f has degree 1, P that is, f(x) = i≤n wixi with 0 ≤ wi for all i. (Remember that I assume 0 ∈/ A, so that there is no constant term.) Write b for maxi wi and (at the risk of minor notational confusion) µi for Pixi. Calculate.

• Ej(f) = 0 for j ≥ 2.

• If |ν| = 1 then there is a single i for which νi = 1 and ∂νf = wi. Thus E1(f) = b. P • If |ν| = 0 then ∂νf(x) = f(x) so that E0(f) = max(b, i wiµi). Consequently,

2  2 X  γ := E0(f)E1(f) = max b , b wiµi i and the k = 1 case of the Kim-Vu inequality becomes X {| wi(xi − µi)| ≥ λγ} ≤ 4 exp(−h(λ)). P i≤n

Draft: 8Oct2015 c David Pollard §3.4 Concentration of random polynomials 13

Compare with the two-sided version of the Bennett inequality <13> for 2 P P 2 independent summands wixi with W := γ ≥ b i wiµi ≥ i P(wixi) and maxi |wixi| ≤ b:

X  γ2 bλγ  := P{| (xi − µi)| ≥ λγ} ≤ 2 exp − h . F i b2 γ2

From the fact that h(t)/t2 is a decreasing function and b/γ ≤ 1 we have

h(λb/γ) h(λ) ≥ and ≤ 2 exp (−h(λ)) . (λb/γ)2 λ2 F

That is, the Kim-Vu inequality for k = 1 follows directly from the Bennett inequality. 

BasicMG::ER.triangles2 <21> Example. (Continued from Example <11>) Recall the challenge of proving concentration of the number of triangles X f(x) = xe1 xe2 xe3 {e1,e2,e3}∈T

near its expected value Np3 for the Erd¨os-R´enyi random graph on m vertices m m with Xe ∼ Ber(p) for each of the n = 2 possible edges and N = #T = 3 . Specifically, I wanted to show that

1 \E@ half.expected2 <22> P{|f(x) − Pf(x)| ≥ 2 Pf(x)} → 0 if mp → ∞.

Calculate Dj(f) for j = 0, 1, 2, 3. First notice that if maxe νe ≥ 2 then ∂νf = 0: in the sum that defines f(x) no xe is raised to a power higher than 1. For Dj(f) we have only to consider ν with exactly j components equal to 1, the rest equal to 0.

0 00 • For j = 3 suppose νe = νe = νe = 1. Then ∂νxe1 xe2 xe3 = 0 unless 0 00 {e, e , e } = {e1, e2, e3} ∈ T. Thus D3(f) = 1.

0 • For j = 2 suppose νe = νe = 1. Then ∂νxe1 xe2 xe3 = 0 unless e and e0 share an endpoint, say e = {i, α} and e0 = {i, β}, and the triangle {e1, e2, e3} has vertices i, α, β. Thus D2(f) = p.

• For j = 1 suppose νe = 1, for e = {α, β}. Then ∂νxe1 xe2 xe3 = 0 unless the triangle has vertices α, β, γ, for γ one of the m − 2 vertices not 2 in e. Thus D1(f) = (m − 2)p .

Draft: 8Oct2015 c David Pollard §3.5 Proof of the Kim-Vu inequality 14

For large enough values of mp,

E2(f) = E3(f) = 1 2 2 E1(f) = max 1, (m − 2)p ≤ 1 + mp 3 E0(f) = max (Pf(x), E1(f)) = (1/6 + o(1)) (mp)

How close to <22> can we get using <17>. We certainly need h(λ) to grow faster than log4 m to ensure that the upper bound in

3p 2 −h(λ) P{|f(X) − Pf(X)| ≥ C3λ E0(f)E1(f) } ≤ 4(n + 1) e

goes to zero. And to also ensure that

2 m C λ3 (1 + mp2) ≤ 1 f(x) = 1 p3 3 4 P 4 3

we need mp greater than some suitably large constant multiple of log2 m. Pretty close to <22>. 

*3.5 *Proof of the Kim-Vu inequality BasicMG::S:KimVuProof First note that inequality <17> holds for trivial reasons if λ ≤ 3 because −h(3) 3 needed? 4e = 4e /64 > 1. We may assume λ > 3.

Remark. I had hoped to write the proof in a way that would reveal the reasons for defining the Ej’s as in <18> and to show how much flexibility we have in the choice of constants, such as the bk and Wk for inequality <23>. Unfortunately I succumbed to the temptation of just checking that chosen values lead to the desired conclusion. See Problem [4].

Argue almost as in Example <8>, except that the martingale version of Bennett’s inequality takes over the role played by the martingale version of Hoeffding’s inequality. That is, express Sn = f(x) as Pf(x) plus a sum of martingale differences,

ξi = fi(x1, . . . , xi) − fi−1(x1, . . . , xi−1)

Draft: 8Oct2015 c David Pollard §3.5 Proof of the Kim-Vu inequality 15

where

fi(x1, . . . , xi) = Qif(x1, . . . , xi, xi+1, . . . , xn)

X Y ai  Y ai  = wa xi Pixi a∈A `≤i `≤i X h[a,i] = wax µt[x,i] a∈A with h[a, i] := (a1, . . . , ai, 0,..., 0)

and t[a, i] := (0,..., 0, ai+1, . . . , an). The ‘head’, h[a, i], and the ‘tail’, t[a, i], of the vector a play different roles. The increment ξi is bounded in absolute value by the predictable function X h[a,i−1] Mi(x1, . . . , xi−1) := wa{ai > 0}x µt[x,i] a∈A because

X h[a,i−1] ai ai |ξi| = wax (xi − Pixi ) µt[x,i] . a∈A ai ai The factor xi − Pixi is zero when ai = 0 and otherwise is bounded in absolute value by 1. p P 2 For τk(f) = Ck E0(f)E1(f) and Vn = i≤n Pi−1ξi and all choices of positive constants Wk and bk Theorem <14> gives k \E@ probfk <23> probk(f, λ) := P{|f(x) − Pf(x)| ≥ τk(f)λ } ≤ bennk + badk where   k  Wk τkλ bk \E@ BENNk <24> bennk := 2 exp − 2 h bk Wk badk := P{maxi Mi > b or Vn > Wk} X \E@ BADk <25> ≤ {Mi > bk} + {maxi Mi ≤ b and Vn > Wk}. i P P

Notice that Mi is a polynomial of degree at most k−1. The upper bound for badk suggests an induction on k. The conditional variance term Vn is also a polynomial, but unfortunately with degree usually greater than k. Luckily, on the set where all the Mi are bounded above by bk, we have X 2 X Vn ≤ i−1ξ {Mi ≤ bk} ≤ β i−1|ξi| ≤ bUn, i≤n P i i≤n P where X X h[a,i−1] \E@ Un.def <26> Un := 2 wa{ai > 0}x µt[a,i−1], i≤n a∈A a polynomial of degree at most k − 1.

Draft: 8Oct2015 c David Pollard §3.5 Proof of the Kim-Vu inequality 16

h[a,i−1] Remark. The ath summand for Mi equals wa{ai > 0}x µt[x,i] whereas the (i, a)th summand for Un equals the same thing multiplied ai by Pix . I was sorely tempted to discard that extra factor, replacing Un P i by i Mi. Not a good idea. Look at the proof of Lemma <28> to see why.

k−1 2 Temporarily write γ for τk(f)λ . If we choose Wk = γ and bk = γθk for some θk in (0, 1], inequality <24> takes a much tidier form:

−h(λθ)θ2 −h(λ) bennk ≤ 2e ≤ e , again by virtue of the fact that h(t)/t2 is a decreasing function. Inequal- ity <23> then becomes

−h(λ) X probk(f, λ) ≤ e + {Mi > γθk} + {Un > γ/θk}. i P P If

k−1 k−1 \E@ induction <27> γθk − PMi ≥ τk−1(Mi)λ and γ/θk − PUn ≥ τk−1(Un)λ then the inductive hypothesis (for k − 1) would give

−h(λ) −h(λ) −h(λ) −h(λ) probk(f, λ) ≤ 2e + nCn,k−1e + Cn,k−1e = Cn,ke where

Cn,k = 2 + (n + 1)Cn,k−1.

Example <20> gave Cn,1 = 2. Repeated substitution followed by summation of a geometric progression (Problem [3]) gives

(n + 1)k − 1 C = 2 ≤ 4(n + 1)k−1, n,k n as asserted. It remains only to check inequailities <27> for an appropriate θk. Be- cause PMi ≤ E0(Mi) and PUn ≤ E0(Un), it suffices to check that p p Ckθk E0(f)E1(f) ≥ E0(Mi) + Ck−1 E0(Mi)E1(Mi) p p Ck E0(f)E1(f) /θk ≥ E0(Un) + Ck−1 E0(Un)E1(Un) p Choose θk = E1(f)/E0(f) and note that E1(Mi) ≤ E0(Mi) and E1(Un) ≤ E0(Un) to reduce the task to showing that

CkE1(f) ≥ (1 + Ck−1)E0(Mi) and CkE0(f) ≥ (1 + Ck−1)E0(Un).

Draft: 8Oct2015 c David Pollard §3.6 Problems 17

The next Lemma should make clear to you why the constants are defined recursively by C1 = 1 and Ck = 2k(1 + Ck−1). The Lemma exploits the fact that averaging out with respect to some xi coordinates or differentiating with respect to to some xi both reduce the degree of the polynomial f(x) = P w xa. a∈A a BasicMG::ee.transfer <28> Lemma. For all j and i,

(i) Ej(Mi) ≤ 2Ej+1(f)

(ii) Ej(Un) ≤ 2kEj(f)

Proof For (i), for a fixed i and ν ∈ Zn,

Xk X h[a,i−1] P∂νMi = wa{ai = `}{ν ≤ h[a, i − 1]}P∂νx µt[x,i]. `=1 a∈A h[a,i−1] I inserted the indicator {ν ≤ h[a, i−1]} just to remind you that ∂νx = 0 if ν  h[a, i − 1]. Compare the (`, a)th term with a h[a,i−1]  ` P∂νx = P∂νx Pixi µt[x,i].

` ` We can eliminate the Pixi by augmenting ν to differentiate out the xi . Let e denote the Zn vector with ei = 1 and er = 0 for r 6= i. Then a h[a,i−1] P∂ν+`ex = P∂νx (`!) µt[x,i]. Thus Xk 1 X a P∂νMi = wa{ai = `}{ν ≤ h[a, i − 1]}P∂ν+`ex `=1 `! a∈A Xk 1 X a ≤ P wa∂ν+`ex `=1 `! a∈A Xk 1 = P∂ν+`ef(x) `=1 `! Xk 1 ≤ D|ν+`e|(f) `=1 `! If |ν| = j then |ν + `e| ≥ j + 1 and the last sum is less than (e − 1)Ej+1(f). P Argue similarly for (ii), using the fact that {ai > 0} ≤ ai and i ai ≤ k. ai This time there is no need to eliminate the xi term. X X h[a,i−1] P∂νUn ≤ 2 waaiPx µt[a,i−1] i≤n a∈A X X  a = 2 wa ai P∂νx a∈A i≤n ≤ 2kP∂νf(x).

The last term is bounded by D|ν|(f).  Draft: 8Oct2015 c David Pollard §3.7 Notes 18

3.6 Problems BasicMG::S:Problems

BasicMG::P:Bernstein.mda [1] Prove an analog of Theorem <14> under the Bernstein-like assumption: 2 k 1 k−2 Pi−1ξi = vi and Pi−1|ξi| ≤ 2 vik!Mi for k ≥ 3, where each Mi is Fi−1- measurable.

a P P BasicMG::P:zero.dd [2] In the notation of Section 3.4, prove that ∂νx = 0 if |ν| = i νi > i ai = |a|. Hint: We must have νi > ai for at least one i.

BasicMG::P:recursive.bnd [3] Suppose m1, m2,... is a sequence of real numbers with m1 = α > 0 and mk = α + βmk−1, with β > 1. Show that βk − 1 m = α(1 + β + β2 + ··· + βk−1) = α . k β − 1

BasicMG::P:rewrite [4] Rewrite Sections 3.4 and 3.5 to make clearer how much flexibility we have in the definition of Ej’s and in the choice of the constants Wk and bk. In particular, I am not particularly satisfied with the part of the argument following equation <26>. Then send me your improved version.

3.7 Notes BasicMG::S:Notes The inequality for bounded differences in Example <8> is often attributed to McDiarmid. However McDiarmid(1989) seemed to be giving more credit to Hoeffding. See Boucheron et al.(2013, Chapter 6) for some powerful ways to extended the bounded difference idea. Freedman(1975) proved analogs of the Bernstein inequality, for sums of bounded martingale differences, with a sum of conditional variances taking over the role played by the second moment bound W . The Bennett inequality for independent summands was proved by Ben- nett(1962). The extension to martingales with increments that are bounded in absolute value by 1 is due to Freedman(1975). The relaxation to predictable upper bounds is implicit in Sections 3.3 and 3.4 of the paper of Vu(2002), although he did not express it via stopping times. He did not appeal to Bennett’s inequality, but instead derived the necessary exponential bounds by direct arguments. Something similar to inequality <17> appeared in the Kim and Vu(2000) paper, but only for Xi’s taking values in {0, 1}. Vu(2002) developed more elaborate versions of the inequality. Most of my discussion in Sections 3.4 and 3.5 is a translation of

Draft: 8Oct2015 c David Pollard References for 0 19

the Vu and Kim-Vu papers into the framework of Bennett’s inequality for martingales. Both papers used the numbers of triangles for the Erd¨os-R´enyi random graph (my Example <11> and <21>) to contrast their results with the results using the method of bounded differences.

References

Azuma1967TMJ Azuma, K. (1967). Weighted sums of certain dependent random variables. TˆohokuMathematical Journal 19 (3), 357–367.

BarbourHolstJanson92 Barbour, A. D., L. Holst, and S. Janson (1992). Poisson Approximation. Oxford University Press.

Bennett62jasa Bennett, G. (1962). Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association 57, 33–45.

BLM2013Concentration Boucheron, S., G. Lugosi, and P. Massart (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.

Breiman1968prob Breiman, L. (1968). Probability. Reading, Massachusets: Addison-Wesley.

Freedman1975AnnProb Freedman, D. A. (1975). On tail probabilities for martingales. Annals of Probability 3 (1), 100–118.

Hoeffding:63 Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, 13–30.

KimVu2000Combinatorica Kim, J. H. and V. H. Vu (2000). Concentration of multivariate polynomials and its applications. Combinatorica 20 (3), 417–434.

McDiarmid1989Survey McDiarmid, C. (1989). On the method of bounded differences. In J. Siemons (Ed.), Surveys in , Volume 141 of London Mathematical Society Lecture Notes, pp. 148–188. Cambridge University Press.

McDiarmid98conc McDiarmid, C. (1998). Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsen, and B. Reed (Eds.), Probabilistic Methods for Al- gorithmic Discrete , pp. 195–248. Springer-Verlag.

PollardUGMTP Pollard, D. (2001). A User’s Guide to Measure Theoretic Probability. Cam- bridge University Press.

Vu2002RSA Vu, V. H. (2002). Concentration of non-Lipschitz functions and applications. Random Structures and Algorithms 20, 262–316.

Draft: 8Oct2015 c David Pollard