3 A few more good inequalities, martingale variety1 3.1 From independence to martingales...... 1 3.2 Hoeffding’s inequality for martingales...... 4 3.3 Bennett’s inequality for martingales...... 8 3.4 *Concentration of random polynomials...... 10 3.5 *Proof of the Kim-Vu inequality...... 14 3.6 Problems...... 18 3.7 Notes...... 18
Printed: 8 October 2015
version: 8Oct2015 Mini-empirical printed: 8 October 2015 c David Pollard §3.1 From independence to martingales 1
Chapter 3 A few more good inequalities, martingale variety
Section 3.1 introduces the method for bounding tail probabilities using mo- ment generating functions. Section 3.2 discusses the Hoeffding inequality, both for sums of independent bounded random variables and for martingales with bounded increments. Section 3.3 discusses the Bennett inequality, both for sums of indepen- dent random variables that are bounded above by a constant and their martingale analogs. *Section3.4 presents an extended application of a martingale version of the Bennett inequality to derive a version of the Kim-Vu inequality for poly- nomials in independent, bounded random variables.
3.1 From independence to martingales BasicMG::S:intro Throughout this Chapter {(Si, Fi): i = 1, . . . , n} is a martingale on some probability space (Ω, F, P). That is, we have a sub-sigma fields F0 ⊆ F1 ⊆ · · · ⊆ Fn ⊆ F and integrable, Fi-measurable random variables Si for
which PFi−1 Si = Si−1 almost surely. Equivalently, the martingale differ-
ences ξi := Si − Si−1 are integrable, Fi-measurable, and PFi−1 ξi = 0 almost surely, for i = 1, . . . , n. In that case Si = S0 + ξ1 + ··· + ξi = Si−1 + ξi. For simplicity I assume F0 is trivial, so that S0 = µ = PSn. Also, I usually omit the “almost sure” qualifiers, unless there is some good reason to
Draft: 8Oct2015 c David Pollard §3.1 From independence to martingales 2
remind you that the conditional expectations are unique only up to almost sure equivalence. And to avoid a plethora of subscripts, I abbreviate the
conditional expectation operator PFi−1 to Pi−1. The inequalities from Sections 2.5 and 2.6 have martingale analogs, which have proved particularly useful for the development of concentration inequalities. The analog of the Hoeffding inequality is often attributed to Azuma(1967), even though Hoeffding(1963, pages 17–18) had already noted that
“The inequalities of this section can be strengthened in the following way. Let Sm = X1 + ··· + Xm for m = 1, 2, . . . , n. . . Furthermore, the inequalities of Theorems 1 and 2 remain true if the assumption that X1,X2,...,Xn. are independent 0 is replaced by the weaker assumption that the sequence Sm = Sm − ESm, m = 1, 2, . . . , n, is a martingale, that is, . . . . Indeed, Doob’s inequality (2.17) is true under this assumption. On the other hand, (2.18) implies that the conditional mean of Xm for 0 Sm−1 fixed is equal to its unconditional mean. A slight modifica- tion of the proofs of Theorems 1 and 2 yields the stated result.”
In principle, the martingale results can be proved by conditional analogs of the moment generating function technique described in Section 2.2, with just a few precautions to avoid problems with negligible sets. In that Sec- tion I started with a random variable X that lives on some probability space (Ω, F, P). By means of arguments involving differentiation inside ex- pectations I derived a Taylor expansion for the cumulant generating func- tion,
θX 1 2 \E@ L.Taylor <1> L(θ) = log Pe = θPX + 2 θ varθ∗ (X), where θ∗ lies between 0 and θ and the variance is calculated under the ∗ ∗ probability measure Pθ∗ that has density exp (θ X − L(θ )) with respect to P. The obvious analog for expectations conditional on some sub-sigma- field G of F is
θX 1 2 \E@ L.Taylor.condit <2> log PGe = θPGX + 2 θ varG,θ∗ (X) almost surely..
θX Here the conditional expectation PGe is defined in the usual Kolmogorov way as the almostly surely unique G-measurable random variable Mω(θ) for
Draft: 8Oct2015 c David Pollard §3.1 From independence to martingales 3 which
θX P g(ω)e = P (g(ω)Mω(θ)) at least for all nonnegative, G-measurable random variables g. The random variable Mω(θ) could be changed on some G-measurable negligible set Nθ without affecting the defining equality. Unfortunately, such negligible am- biguities can become a major obstacle if we want the map θ 7→ Mω(θ) to be differentiable, or even continuous. Regular conditional distributions eliminate the obstacle, by consolidating all the bookkeeping with negligible sets into one step. Recall that the distribution of a real-valued random variable X is a prob- ability measure P on B(R) for which Pf(X) = P f, at least for all bounded, B(R)-measurable real functions f.A regular conditional distribution for X given a sub-sigma-field G is a Markov kernel, a set {Pω : ω ∈ Ω} of probability measures on B(R) for which Pωf is a version of the conditional expectation PGf(X) whenever it is well defined. That is, ω 7→ Pωf is G- measurable and Pg(ω)f(X(ω)) = P (g(ω)Pωf) for a suitably large collection of B(R)-measurable functions f and G-measurable functions g. Such regular conditional distributions always exist (Breiman, 1968, Section 4.3).
Remark. As noted by Breiman(1968, page 80), regular conditional distributions also justify the familiar idea that for the purpose of calculating PG conditional expectations we may treat G-measurable functions as constants. For example, if f is a bounded B(R2)\B(R)- measurable real function on R2 and G is a G-measurable random x variable then PGf(X,G) = Pω f(x, G(ω)) almost surely. You might find it instructive to prove this assertion using a generating class argument, starting from the fact that it is valid when f factorizes: f(x, y) = f1(x)f2(y).
θx If we choose Pωe as the version of the conditional moment generating function Mω(θ) then the arguments from Section 2.2 work as before: on the interior of the interval Jω = {θ ∈ R : Mω(θ) < ∞} we have
Lω(θ) := log Mω(θ) 0 θx Lω(θ) = Pω xe /Mω(θ) = Pω,θx 00 2 Lω(θ) = varω,θ(x) = Pω,θ(x − µω,θ) where µω,θ := Pω,θx.
As before, the probability measure Pω,θ is defined on B(R) by its den- xθ sity, dPω,θ/dPω = e /Mω(θ). If 0 and θ are interior points of Jω then
Draft: 8Oct2015 c David Pollard §3.2 Hoeffding’s inequality for martingales 4
equality <2> takes the cleaner form
1 2 Lω(θ) = θPωx + 2 θ varω,θ∗ (x) for all ω, with θ∗, which might depend on ω, lying between 0 and θ. Now suppose there exist functions aω and bω for which Pω[aω, bω] = 1. As in Section 2.5 we have
θ(x−Pωx) 1 2 2 Pωe ≤ exp 8 θ (bω − aω) so that
2 2 θ(X−PGX) θ Rω/8 \E@ condit.Hoeff.mgf <3> PGe ≤ e if Rω ≥ bω − aω. a.s.
Remark. Strangely enough, the first inequality does not require any sort of measurability assumption about aω and bω, because ω is being held fixed. However, we usually need the range Rω to be G-measurable when integrating out <3>.
2 Similarly, if Pω(−∞, bω] = 1 and vω ≥ Pωx then
θ(X−PGX) 1 2 \E@ condit.Bennett.mgf <4> PGe ≤ exp 2 θ vω∆(θbω) for θ ≥ 0, a.s.
where, as before, ∆ denotes the convex increasing function on R for which
1 2 x 2 x ∆(x) = Φ1(x) := e − 1 − x.
3.2 Hoeffding’s inequality for martingales BasicMG::S:HoeffdingMG Consider the martingale {(Si, Fi): i = 1, . . . , n} described at the start of Section 3.1. Suppose the conditional distribution of the increment ξi given Fi−1 con- centrates on an interval whose length is bounded above by an Fi−1-measurable function Ri(ω). Then inequality <3> gives
θξi 1 2 2 \E@ mda.mgf <5> Pi−1e ≤ exp 8 θ Ri .
If the Ri’s are actually constants then θSi θSi−1 θξi 1 2 2 θSi−1 Pe = P e Pi−1e ≤ exp 8 θ Ri Pe ,
Draft: 8Oct2015 c David Pollard §3.2 Hoeffding’s inequality for martingales 5
which, by repeated substitution, gives
2 θSn θS0 Y 1 2 2 θPSn+θ W/8 P 2 Pe ≤ Pe exp 8 θ Ri = e where W = 1≤i≤n Ri . 1≤i≤n
For each nonnegative x,
2 2 \E@ Hoeff.promised <6> P{Sn ≥ x + PSn} ≤ inf exp(−xθ + W θ /8) = exp −2x /W . θ≥0
We have the same bound as for a sum of independent increments, as Ho- effding promised. If the Ri’s are truly random, a slightly more subtle argument leads to the inequality
2 X 2 \E@ range.squares <7> {Sn ≥ x + Sn} ≤ exp(−2 /W ) + { R > W } P P P i≤n i for each constant W . Compare with McDiarmid, 1998, Theorem 3.7. P 2 To prove <7> define Vi := k≤i Rk, which is Fi−1 measurable. For each i, h i 2 2 θξi P exp θSi − θ Vi/8 = P exp θSi−1 − θ Vi/8) Pi−1e 2 ≤ P exp θSi−1 − θ Vi−1/8 by <5>.
Repeated substitution now gives