BERNSTEIN POLYNOMIALS AND BROWNIAN MOTION
EMMANUEL KOWALSKI
1. Introduction. One of the greatest pleasures in mathematics is the surprising connec- tions that often appear between apparently disconnected ideas and theories. Some particularly striking instances exist in the interaction between prob- ability theory and analysis. One of the simplest is the elegant proof of the Weierstrass approximation theorem by S. Bernstein [2]: on the surface, this states that if f : [0, 1] → C is a continuous function, then the sequence of Bernstein polynomials n X n j B (f)(x) = f xj(1 − x)n−j n j n j=0 converges uniformly to f on [0, 1]. However, the structure and origin of the polynomials become much clearer when given a probabilistic interpretation. We recall this argument in section 3, stating here only that the heart of this connection is the formula S B (f)(x) = E f n , (1.1) n n where Sn = X1 + ··· + Xn for any n-uple (Xi) of independent random variables with Bernoulli law such that
P (Xi = 1) = x, P (Xi = 0) = 1 − x. On a much deeper level lies Brownian motion, which can be described informally as “random” paths in Euclidean space Rd starting from the ori- gin. We will recall the precise definition of this concept in the case d = 1 in section 4, and the reader with no previous experience is also encouraged to read [3], which contains a very readable informal description. In the author’s view, there are few mathematical facts more amazing than the connection between plane Brownian motion and, for instance, the theory of harmonic functions. As an example, there is the beautiful formula (discovered by Kakutani [7]) Z f(z) = f0(z + B(τ))dP Ω for the solution of the Dirichlet problem: to find the function f harmonic in a bounded, open, connected plane set U with smooth boundary such that f(z) = f0(z) for all z on the boundary ∂U, where f0 : ∂U → R is a given (continuous) function. Here B(t) signifies a two-dimensional Brownian mo- tion starting at 0 defined on a probability space Ω with probability measure P , and τ is the exit time from U for B started at z (i.e., τ = inf{t : z + B(t) ∈ ∂U}; 1 2 EMMANUEL KOWALSKI for a proof, see [8, chap. 4, sec. 2.B]). Although not directly relevant to what follows, it might be useful to point out as well the discovery of the (still mysterious) link between the Riemann zeta-function ζ(s) (and its generalizations) and random matrices as another striking example where probability and number theory have become inter- twined. Roughly speaking, it is expected—with overwhelming numerical evidence—that suitably normalized spacings of zeros of ζ(s) are distributed asymptotically “high in the critical strip” (i.e., when the imaginary part goes to infinity) like the normalized spacings of eigenvalues of large unitary matrices selected at random with respect to the natural probability mea- sures on groups of unitary matrices. A shade differently, but conveying the same kind of intuitive idea, it is expected (and confirmed in various ways) that one can “model” ζ(1/2 + it) for large |t| by values of characteristic polynomials of “random” unitary matrices of size roughly log t/2π. We re- fer readers to [9] for a survey of this very lively topic and to [10] for some of the remarkable conjectures concerning the zeta-function that arise in this manner. Since Brownian motion (now one-dimensional) is a way of speaking of a “random” continuous function on [0, +∞), it seems natural to bring together the simple Bernstein polynomials and the more delicate Brownian motion: namely, for n > 1 and for x in [0, 1], we consider the random variable given by evaluating the Bernstein polynomial for the Brownian motion at x: n X n j B (x) = B xj(1 − x)n−j. (1.2) n j n j=0 Now three different questions (at least) present themselves: • Forgetting that we know that Brownian motion exists, can we use polynomial approximations to prove its existence? The idea is to consider a sequence Bn of random polynomials of the type n X n B (x) = X xj(1 − x)n−j (1.3) n j nj j=0 for some suitable random variables Xnj and to prove that Bn converges (again in a suitable sense) to a Brownian motion. • Or, since we know that Brownian motion exists, can one investigate its properties by means of the Bernstein motions Bn and Bernstein’s con- vergence theorem? • Or, again assuming that Brownian motion exists, what are the prop- erties of the random polynomials of the type (1.2)? We will consider simple instances of all three questions. First, we provide a construction of Brownian motion (in the guise of the Wiener measure on the vector space C([0, 1], R) of real-valued continuous functions on the closed interval [0, 1]) from random Bernstein polynomials. The complete proof is not simpler than known ones but (to the author, at least) furnishes a very convincing plausibility argument for the existence of Brownian motion that requires nothing more than the very basics of probability theory. BERNSTEIN POLYNOMIALS AND BROWNIAN MOTION 3
Second, we use results concerning Bernstein polynomials to prove that almost surely the Brownian path t 7→ B(t) is not differentiable outside a set of Lebesgue measure 0 in [0, 1]. This is, of course, weaker than the theorem of Paley-Wiener-Zygmund [13], which asserts that the exceptional set is almost surely empty, but the proof is particularly simple. Moreover, it is in fact independent of the construction of Brownian motion previously mentioned. Finally, simple facts about the oscillations of a Brownian path allow us to prove some results about the zeros of random Bernstein polynomials. In all of this we give complete rigorous proofs, but we also try to highlight the ideas (often quite intuitive) behind the arguments used. Everything should be understandable to a graduate student (not studying probability) or, with a reasonable amount of suspension of disbelief, to an undergraduate who has had a first course in probability theory. Such a reader may wish to start by reading the very nice informal description given by Billingsley [3] and can safely skip all the technical mentions of measurability (so any set and any function can be considered as measurable and all σ-algebras on a set Ω are just the entire power set of subsets of Ω). Notation. A triple (Ω, Σ,P ) always signifies a probability space, with P a probability measure on the σ-algebra Σ of subsets of Ω. For a random variable X, V (X) = E((X − E(X))2) is its variance. Also, as usual, we use x ∧ y to denote min(x, y).
2. Review of facts from probability. We list here the various definitions and basic results from probability theory needed in the rest of the paper, with a few exceptions that are deeper results and will be stated only when needed in later sections. As a reference, we use [15], but the material can be found in any textbook on probability theory. First, for the study of Bernstein polynomials we use the most basic prop- erties of Bernoulli random walks. Let x in [0, 1] and n > 1 be given, and let X1,..., Xn be independent random variables on a given probability space (Ω, Σ,P ) taking only the values 0 and 1 according to the same Bernoulli law
P (Xi = 1) = x, P (Xi = 0) = 1 − x.
The associated (n-step) random walk is the sequence of sums Si = X1 + ··· + Xi (0 6 i 6 n), and the “empirical” average Tn = Sn/n is often what we consider. Note that
E(Xi) = x, V (Xi) = x(1 − x). Hence S S x(1 − x) E(T ) = E n = x, V (T ) = V n = , (2.1) n n n n n the latter because of the additivity of the variance for independent random variables. We also use the fourth moment of Tn − x, for which we have the following result: 4 EMMANUEL KOWALSKI
Lemma 2.1. The estimate S 4 1 E((T − x)4) = E n − x n n 6 n2 holds for n > 1 and for all x in [0, 1]. Proof. This is well known (it is the basis of Cantelli’s “easy” proof of the strong law of large numbers for identically distributed random variables with finite fourth moment; see, for example, [15, Theorem 1, p. 363]). After expanding X − x + ··· + X − x4 (T − x)4 = 1 n n n and taking the expectation, among the n4 terms 1 E((X − x)(X − x)(X − x)(X − x)) n4 i1 i2 i3 i4 all those with three or four distinct indices vanish because of the indepen- dence assumption. For instance (with X˜i = Xi − x),
E((X1 − x)(X2 − x)(X3 − x)(X4 − x)) = E(X˜1) ··· E(X˜4) = 0, 2 ˜ 2 ˜ ˜ E((X1 − x) (X2 − x)(X3 − x)) = E(X1 )E(X2)E(X3) = 0. There remain only the n terms of the type 1 x4(1 − x) + x(1 − x)4 1 E((X − x)4) = , n4 i n4 6 12n4 and the 6n(n − 1)/2 terms of the type 1 x2(1 − x)2 1 E((X − x)2(X − x)2) = n4 i j n4 6 16n4 with i 6= j, so we find that n 6n(n − 1) 1 E((T − x)4) + . n 6 12n4 32n2 6 n2 We also consider d-dimensional variants of the Bernoulli random walk in which each Xi is a vector Xij (1 6 j 6 d) of Bernoulli random variables with P (Xij = 1) = xj,P (Xij = 0) = 1 − xj d for some vector x = (x1, . . . , xd) in [0, 1] . We assume then that the (Xij) are all independent. In the study of Bernstein polynomials, x is in fact the variable of the functions under consideration. All random variables depend on it, although we suppress this fact in the notation (but notice that (2.1) recovers x nat- urally from Sn or Tn). This is already apparent in formula (1.1) (there is no x on the right-hand side!); this may seem confusing, but we will explain this formula again carefully in the proof of Theorem 3.1. We recall Chebychev’s inequality [15, chap. 1, sec. 5.1] for a random variable X having finite variance V (X): for any δ > 0 V (X) P (|X| δ) . (2.2) > 6 δ2 BERNSTEIN POLYNOMIALS AND BROWNIAN MOTION 5
Also, we have the easy half of the Borel-Cantelli lemma (see [15, chap. 2, sec. 10.3] for the full statement): if (An) is a sequence of events such that the series X P (An) n>1 converges, then almost surely any ω in Ω belongs to at most finitely many An (i.e., \ [ P An = 0.) N>1 n>N Our second main probabilistic topic is the elementary theory of Gaussian random variables and vectors (see, for example, [15, chap. 2, sec. 13]). Recall that a real-valued random variable X is a centered Gaussian random variable with variance σ2 (σ > 0) if its distribution law X(P ) is given by
t2 1 − X(P ) = √ e 2σ2 dt. (2.3) σ 2π Recall that this means simply that for any (measurable) subset B of R (for example, an interval B = [a, b]), we have
Z t2 1 − P (X ∈ B) = √ e 2σ2 dt. σ 2π B It follows that Z t2 Z t2 1 − 1 − E(X) = √ te 2σ2 dt = 0,V (X) = √ t2e 2σ2 dt = σ2. σ 2π R σ 2π R Moreover, a simple computation using (2.3) shows that for k = 0, 1, 2 ... the moment E(Xk) exists and, in particular, that E(X4) = 3V (X)2 = 3σ4. (2.4) By the correspondence between probability laws and their characteristic functions, an equivalent condition for X to be Gaussian is that for each real t one has 2 2 E(eitX ) = e−σ t /2. This approach by Fourier theory leads to an easy definition of a centered Gaussian random vector (X1,...,Xm): this is a vector of real-valued ran- dom variables such that there exists a positive quadratic form Q(t1, . . . , tm) for which E(ei(t1X1+···+tmXm)) = e−Q(t1,...,tm)/2 m for all (t1, . . . , tm) in R . The coefficients of the quadratic form are related to the Gaussian vector as follows: if X Q(t) = aijtitj, 16i,j6m then the covariance matrix A = (aij) is determined by
E(XiXj) = aij. 6 EMMANUEL KOWALSKI
As before, the joint law (X1,...,Xn)(P ) of the vector is defined as a mea- m sure on R that permits one to compute P ((X1,...,Xm) ∈ B) for any measurable subset B of Rm. If Q is positive definite, one shows that Z P 1 − bij titj P ((X1,...,Xm) ∈ B) = e i,j dt1 ··· dtm, p m/2 | det(A)|(2π) B where (bij) is the inverse of A. Exploiting the form of the characteristic functions, one demonstrates eas- ily that the coordinates of a centered Gaussian random vector (X1,...,Xm) are independent if and only if they are uncorrelated. This means that the vector (X1,...,Xm) is Gaussian with diagonal covariance matrix, or in other words, E(XiXj) = 0 if i 6= j. Also, one sees without difficulty that if X = (X1,...,Xm) is a centered Gaussian random vector and u : Rm → Rn is a linear map, then the n- dimensional vector u(X) is again a centered Gaussian vector. In particular, a linear combination of the coordinates of X, m X X = αiXi (αi ∈ R), i=1 is a centered Gaussian random variable. As a final generalization, if T is an arbitrary set, then a mapping on T of the sort t 7→ Xt, where Xt is a random variable for each t, is a Gaussian stochastic process parameterized by T (often T = [0, 1] or [0, +∞)) if for each
finite subset t1, . . . , tm of T the vector (Xt1 ,...,Xtm ) is a centered Gaussian random vector. The covariance function of the stochastic process X = (Xt) is the function g(s, t) = E(XsXt) defined for (s, t) in T 2. Besides the definition and characterization of independence, we will ap- peal to the following simple lemma,which makes rigorous the quite intuitive fact that the values of a sequence of Gaussian random variables with in- creasing variance (i.e., more and more “spread out”) can form a convergent sequence only with probability zero.
Lemma 2.2. Let (Ω, Σ,P ) be a probability space, and let (Xn) be a sequence 2 of centered Gaussian random variables such that σn = V (Xn) → +∞. Then (Xn) is almost surely unbounded. Proof. If Y is a centered Gaussian random variable with variance σ = 1 and if ε > 0, we have by (2.3) Z ε 1 −t2/2 P (|Yn| < ε) = √ e dt 6 ε. 2π −ε
Since Yn = Xn/σn has variance 1, we infer that M M P (|Xn| < M) = P |Yn| < 6 σn σn for any n and arbitrary positive M, whence
lim P (|Xn| < M) = 0 (M > 0). (2.5) n→∞ BERNSTEIN POLYNOMIALS AND BROWNIAN MOTION 7
So, in a sense, (Xn) is unbounded in probability, and it should come as no surprise that there exists a subsequence of (Xn) that is unbounded almost surely. Indeed, taking M = k, an integer, there exists Nk such that −k P (|Xn| < k) < 2 (n > Nk), which ensures that X P (|XNk | < k) < +∞. k Invoking the easy half of the Borel-Cantelli lemma, we conclude that with probability one the event {|XNk | < k} occurs only finitely often. Thus the subsequence (XNk ) is almost surely unbounded, as then is (Xn). −1 Alternatively, one can establish Lemma 2.2 by applying to Yn = Xn (with Yn = 0 if Xn = 0, which only occurs with probability zero) the standard fact (see, for example, [15, chap. 2, sec. 10.5]) that a sequence converging in probability has a subsequence converging almost surely, since Yn converges to 0 in probability by (2.5). Finally, let Y be a metric space, and let B be the Borel σ-algebra on Y (i.e., the smallest σ-algebra containing the open sets of the metric topology on Y ). One says that a sequence (Pn) of probability measures on (Y, B) (respectively, a sequence (Xn) of measurable maps Xn :Ω → Y ) converges in distribution to a probability measure Q on (Y, B) if the sequence of mea- sures (Pn) (respectively, the sequence (Xn(P )) of image measures) converges weakly to Q. This means that for any bounded continuous function f on Y we have Z Z lim f(y)dPn(y) = f(y)dQ(y) n→∞ Y Y or Z Z lim f(y)dXn(P )(y) = f(y)dQ(y). n→∞ Y Y Equivalently, and more intuitively, the Pn-probability of a subset A of Y converges to its Q-probability for all “reasonable” sets A. To be precise, convergence in distribution holds if and only if for each Borel subset A of Y whose boundary ∂A has Q-measure zero it is true that
lim Pn(A) = Q(A) n→∞ (see [15, chap. 3, sec. 1.2]). The L´evycriterion (see [15, chap. 3, sec. 3]) states that if Y = Rd with d > 1, then a sequence of d-dimensional random vectors Xn = (Xn1,...,Xnd) converges in distribution to some Q if the characteristic functions
i(t1Xn,1+···+tdXn,d) ϕn(t1, . . . , td) = E(e ) converge pointwise to a function ϕ on Rd that is continuous at 0. In particu- lar, it follows that a sequence of centered Gaussian random vectors converges in distribution if and only if the covariance coefficients E(Xn,iXn,j) converge to real numbers aij (1 6 i, j 6 d) and the limit distribution ϕ is also the distribution function of a centered Gaussian random vector with covariance matrix (aij). 8 EMMANUEL KOWALSKI
3. Preliminaries about Bernstein polynomials. In this section we summarize the facts about Bernstein polynomials that we need in the sequel. For completeness, we include proofs of all of them, albeit proofs in a probabilistic style. We first recall the proof of Bernstein’s theorem, generalized to functions on [0, 1]d (the case d = 2 will be required later). Theorem 3.1 (Bernstein). Let d be a positive integer. If f : [0, 1]d → C is a continuous function and if i i d n X 1 d Y ik n−ik Bn(f)(x) = f ,..., xk (1 − xk) (3.1) n n ik 06i1,...,id6n k=1 d for n = 1, 2 ... and x = (x1, . . . , xd) in [0, 1] , then Bn(f) → f uniformly on [0, 1]d. d Proof. Consider a positive integer n and x = (x1, . . . , xd) in [0, 1] . Let Xi = (Xi1,...,Xid) (1 6 i 6 n) be a sequence of random vectors defined on some probability space such that the Xij are all independent and, as in the beginning of the previous section, each Xij takes only the values 0 and 1 according to the Bernoulli law for which
P (Xij = 1) = xj,P (Xij = 0) = 1 − xj, thus putting our point x in the picture. Generalizing (1.1) to dimension d we claim that S B (f)(x) = E f n . (3.2) n n where still Sn = X1 + ··· + Xn. This formula indicates why the theorem holds: by the law of large num- bers, it is usually the case that Sn/n is close to the average (x1, . . . , xd), and therefore (by continuity) f(Sn/n) should be close to the constant f(x). We infer that the expectation Bn(f)(x) is close to E(f(x)) = f(x). To make this argument precise, we start by proving (3.2). Writing Sn = (Sn1,...,Snd), notice that Snj/n, as a sum of n terms each of which is either 0 or 1, can take only values of the form ij/n with 0 6 ij 6 n. Moreover by counting the number of possible ways of doing this, we see that Snj/n takes the value ij/n with probability n ij n−ij P (Snj/n = ij/n) = xj (1 − x) , ij so expanding E(f(Sn/n)) according to the possible values of Sn/n and using the independence of the various coordinate vectors, we get (3.2). Now we put 1 σ2 = V (S ) = n−2x (1 − x ) . n,j nj j j 6 4n2 Using obvious notation for a multidimensional expectation, we have E(Sn) = nx = n(x1, . . . , xd), whence S S S B (f)(x) − f(x) = E f n − f(x) = E f n − f E n . n n n n BERNSTEIN POLYNOMIALS AND BROWNIAN MOTION 9
d Set kxk = max{|xi|} for any x in R . By Chebychev’s inequality (2.2), applied componentwise, and independence, we obtain for any δ > 0 the upper bound
2 Sn X Sn,j dσn,j d P − x δ P − x δ . n > 6 n j > 6 δ2 6 4n2δ2 16j6d Now let ε (> 0) be given. By the uniform continuity of f on the compact set [0, 1]d, there exists a positive δ such that |f(x) − f(y)| < ε whenever x and y in [0, 1]d have kx − yk < δ. Then
Z S n f − f(x) dP 6 ε, {kSn/n−xk<δ} n so Z Sn Sn f − f(x) dP 6 2 max |f(x)|P − x > δ n x∈[0,1]d n {kSn/n−xk>δ} 2d 6 max |f(x)|. 4n2δ2 x∈[0,1]d
Clearly, taken together the foregoing bits of information imply that for all sufficiently large n
S S E f n − f E n 2ε n n 6 holds independently of x, hence the result.
Another simple property of Bernstein polynomials that we will require is the form of their derivatives. We consider only functions of one or two variables. The result (which is simply a direct computation) reads as follows:
Lemma 3.2. (1) If f : [0, 1] → C is a continuous function and Bn(f) is the nth Bernstein polynomial of f, then
n−1 X n − 1 j + 1 j B (f)0(x) = n f − f xj(1 − x)n−1−j. (3.3) n j n n j=0
2 (2) If f : [0, 1] → C is a continuous function and Bn(f) is the nth Bernstein polynomial of f, then
n−1 n X X n − 1n j + 1 k j k ∂ B (f)(x, y) = n f , − f , x n j k n n n n j=0 k=0 xj(1 − x)n−1−jyk(1 − y)n−k, and symmetrically for ∂yBn(f). 10 EMMANUEL KOWALSKI
Proof. For d = 1 we have n X n j B (f)0(x) = f {jxj−1(1 − x)n−j + (n − j)xj(1 − x)n−1−j} n j n j=0 (3.4) n n−1 X n j X n j = j f xj−1(1 − x)n−j − (n − j) f xj(1 − x)n−1−j. j n j n j=1 j=0