Stochastic Approximation and Newton's Estimate of a Mixing

Home , Stochastic approximation

arXiv:1102.3592v1 [stat.ME] 17 Feb 2011 oepoeisrltosi iharcn loih [ algorithm recent a with relationship its explore to and applications, statistical way recent the some highlighting along (SA), approximation stochastic of ject pdit nipratae fssescnrland control systems of area important an into oped on observations siaeo iigDistribution Ghosh Mixing K. Jayanta and Martin a Ryan Newton’s of and Estimate Approximation Stochastic 0 odKlaa ni 018 e-mail: 700108, India Kolkata, Road T Institute B Statistical 203 Indian Mathematics, Theoretical and of Statistics USA Division 47907, Emeritus, Lafayette, Professor West and Street, University University, N. Purdue 150 Statistics, of Department Professor, function a of root the ﬁnding for 08 o.2,N.3 365–382 3, DOI: No. 23, Vol. 2008, Science Statistical [email protected] [email protected] 28 tet etLfyte nin 70,UA e-mail: USA, 47907, Indiana Lafayette, University West North Street, 250 University, Purdue Statistics,

ynMri sGaut tdn,Dprmn of Department Student, Graduate is Martin Ryan c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint h i ftepeetppri orve h sub- the review to is paper present the of aim The nttt fMteaia Statistics Mathematical of Institute Awsitoue n[ in introduced was SA o siaigamxn distribution. mixing a estimating for ] 10.1214/08-STS265 .INTRODUCTION 1. h nttt fMteaia Statistics Mathematical of Institute , itr oes ypnvfunctions. Lyapunov models, phrases:mixture con and its words prove estimati prop and Key also for model, We allows the mixture. in that parameter finite unknown algorithm a additional consis Newton’s of prove of case to used the modification st are in ODE results estimate and SA Newton’s functions standard Lyapunov Then includes theory. SA, ity which of necessary algorithm, review theory SA the a a specia and with a applications, begin statistical as We recent study (SA). we approximation which stochastic distribution, mixing the timating A Ser. rbto a nrae rmtclyi eetyas Newt years. recent mi in the dramatically estimate increased to has methods tribution efficient computationally for need Abstract. 08 o.2,N.3 365–382 3, No. 23, Vol. 2008, r vial.I a ic devel- since has It available. are . aat .Gohis Ghosh K. Jayanta . 31 64 2008 , sa loihi method algorithmic an as ] 20)3632 rpsdafs eusv loih o es- for algorithm recursive fast a proposed 306–322] (2002) aysaitclpolm nov itr oesadthe and models mixture involve problems statistical Many h hnol noisy only when tcatcapoiain miia Bayes, empirical approximation, Stochastic This . 26 in , – 1 uecnegneo Aagrtm hc sused is which algorithm, SA Section in a almost of for a convergence conditions state sure sufficient then We providing density. theorem a of constant normalizing o aibe hnbcmsamxueo h form the of mixture ran- a observed, becomes or then manifest, variables dom the of distribution [ The data high-dimensional with mation duality the with demonstrate SAMC We combining problems. that statistical of variety a in [ Approxima- (SAMC) Carlo Carlo Stochastic Monte Monte tion called versatile a method, re- is integration third algorithms, the Metropolis and and spectively, [ EM two the first inno- strengthen The and applications. recent three statistical review vative and algorithm SA the Section statis- In in tics. applications numerous with optimization, aet ruosre,rno aibe,frexam- [ for analysis variables, cluster ple, random unobserved, or latent, differential developed. ordinary is (ODEs) for equations the theory purpose, stability this necessary For estimate. distribution mixing (1.1) aysaitclpolm nov oeigwith modeling involve problems statistical Many [ 18 3 rvdsamto o siaigthe estimating for method a provides ] Π osuytecnegnepoete fa of properties convergence the study to f ( x = ) 2 egv re nrdcinto introduction brief a give we o nlssof analysis for Z 24 Θ n[ on n utpetsigo esti- or testing multiple and ] p ( no an of on x igdis- xing aeof case l sistency. Sankhy¯a ec of tency | 21 θ ) s a ose some f ,wihcnb applied be can which ], abil- ( θ energy–temperature ) dµ ( 1 θ , ) 8 , , 9 , 34 6 , , 36 16 ]. ] 2 R. MARTIN AND J. K. GHOSH where θ ∈ Θ is the latent variable or parameter, and We also consider the problem where the sampling f is an unknown mixing density with respect to the density p(x|θ) of (1.1) is of the form p(x|θ,ξ), where measure µ on Θ. Estimation of f plays a funda- f is a mixing density or prior for θ, and ξ is an ad- mental role in many inference problems, such as an ditional unknown parameter. Newton’s algorithm is empirical Bayes approach to multiple testing. unable to handle unknown ξ, and we propose a mod- For the deconvolution problem, when p(x|θ) in ified algorithm, called N + P, capable of recursively (1.1) is of the form p(x − θ), asymptotic results for estimating both f and ξ. We express this algorithm estimates of f, including optimal rates of conver- as a general SA and prove consistency under suitable gence, are known [10]. A nonparametric Bayes ap- conditions. proach to Gaussian deconvolution is discussed in [13]. In Section 5 we briefly discuss some additional For estimating Πf , a Bayesian might assume an a theoretical and practical issues concerning Newton’s priori distribution on f, inducing a prior on Πf via recursive algorithm and the N + P. the map f 7→ Πf . Consistency of the resulting estimate of Πf is considered in [3, 11, 12]. 2. STOCHASTIC APPROXIMATION In Section 3, we describe a recursive algorithm of Newton et al. [26–28] for estimating the mix- 2.1 Algorithm and Examples ing density f. This estimate is significantly faster Consider the problem of finding the unique root ξ to compute than the popular nonparametric Bayes of a function h(x). If h(x) can be evaluated exactly estimate based on a Dirichlet process prior. In fact, for each x and if h is sufficiently smooth, then var- the original motivation [27] for the algorithm was to ious numerical methods can be employed to locate approximate the computationally expensive Bayes ξ. A majority of these numerical procedures, includ- estimate. The relative efficiency of the recursive al- ing the popular Newton–Raphson method, are iter- gorithm compared to MCMC methods used to com- ative by nature, starting with an initial guess x of pute the Bayes estimate, coupled with the similarity 0 ξ and iteratively defining a sequence {x } that con- of the resulting estimates, led Quintana and New- n verges to ξ as n →∞. Now consider the situation ton [29] to suggest the former be used for Bayesian where only noisy observations on h(x) are available; exploratory data analysis. that is, for any input x one observes y = h(x)+ ε, While Newton’s algorithm performs well in exam- where ε is a zero-mean random error. This problem ples and simulations (see [14, 26–29, 38] and Sec- arises in situations where h(x) denotes the expected tion 3.3), very little is known about its large-sample value of the response when the experiment is run properties. A rather difficult proof of consistency, at setting x. Unfortunately, standard deterministic based on an approximate martingale representation methods cannot be used in this problem. of the Kullback–Leibler divergence, is given by Ghosh and Tokdar [14] when Θ is finite; see Section 3.1. In In their seminal paper, Robbins and Monro [31] Section 3.2, we show that Newton’s algorithm can proposed a stochastic approximation algorithm for be expressed as a stochastic approximation and re- defining a sequence of design points {xn} target- sults presented in Section 2.4 are used to prove a ing the root ξ of h in this noisy case. Start with stronger consistency theorem than in [14] for the an initial guess x0. At stage n ≥ 1, use the state case of finite Θ, where the Kullback–Leibler diver- xn−1 as the input, observe yn = h(xn−1)+ εn, and gence serves as the Lyapunov function. update the guess (xn−1,yn) 7→ xn. More precisely, The numerical investigations in Section 3.3 con- the Robbins–Monro algorithm defines the sequence sider two important cases when Θ is finite, namely, {xn} as follows: start with x0 and, for n ≥ 1, set when f is strictly positive on Θ and when f(θ) = 0 xn = xn−1 + wnyn for some θ ∈ Θ. In the former case, our calculations (2.1) show that Newton’s estimate is superior, in terms of = xn−1 + wn{h(xn−1)+ εn}, accuracy and computational efficiency, to both the nonparametric MLE and the Bayes estimate. For the where {εn} is a sequence of i.i.d. random variables latter case, when only a superset of the support of with mean zero, and the weight sequence {wn} sat- f is known, the story is completely different. While isfies Newton’s estimate remains considerably faster than 2 (2.2) wn > 0, wn = ∞, wn < ∞. the others, it is not nearly as accurate. n n X X SA AND NEWTON’S ALGORITHM 3

While the SA algorithm above works in more general situations, we can develop our intuition by look- ing at the special case considered in [31], namely, when h is bounded, continuous and monotone decreasing. If xn <ξ, then h(xn) > 0 and we have

E(xn+1|xn)= xn + wn+1{h(xn)+ E(εn+1)}

= xn + wn+1h(xn)

>xn.

Likewise, if xn >ξ, then E(xn+1|xn) percentile of the is that wn → 0. Clearly wn → 0 implies that the ef- t distribution. fect of the noise vanishes as n →∞. This, in turn, 5 has an averaging effect on the iterates yn. On the Example 2.2. Suppose we wish to find the αth other hand, the condition n wn = ∞ washes out the effect of the initial guess x0. For further details, quantile of the tν distribution; that is, we want to P see [25]. find the solution to the equation Fν(x)= α, where We conclude this section with three simple exam- Fν is the cdf of the tν distribution. While there are ples of SA to shed light on when and how the algo- numerous numerical methods available (e.g., Newton– rithm works. Example 2.1, taken from [19], page 4, Raphson or bijection), we demonstrate below how is an important special case of the Robbins–Monro SA can be used to solve this problem. Making use algorithm (2.1) which further motivates the algo- of the well-known fact that the tν distribution is a rithm as well as the conditions (2.2) on the sequence scale-mixture of normals, we can write {w }. Example 2.2 uses SA to find quantiles of a t- n F (x)= E[Φ(x|ν−1Z)],Z ∼ χ2, distribution, and Example 2.3 illustrates a connec- ν ν tion between SA and empirical Bayes, two of Rob- where Φ(x|σ2) is the cdf of the N(0,σ2) distribution. bins’s greatest contributions. i.i.d. 2 Now, for Z1,Z2,... ∼ χν , the sequence {yn} defined as y = α − Φ(x |ν−1Z ) are noisy observa- Example 2.1. Let Fξ be the cdf of a distribu- n n−1 n tion with mean ξ. Then estimation of ξ is equivalent tions of h(xn−1)= α − Fν (xn−1). This h is bounded, continuous and monotone decreasing so the Robbins– to solving h(x)=0 where h(x)= ξ − x. If Z1,...,Zn Monro theory says that the sequence {xn} defined are i.i.d. observations from Fξ , then the average Zn as (2.1) converges to the true quantile, for any ini- is the least squares estimate of ξ. To see that {Zn} is actually a SA sequence, recall the computationally tial condition x0. For illustration, Figure 1 shows the first 1000 iterations of the sequence {xn} for efficient recursive expression for Zn: α = 0.75, ν = 5 and for three starting values x0 ∈ −1 (2.3) Zn = Zn−1 + n (Zn − Zn−1). {0.5, 0.75, 1.0}.

−1 If we let xn = Zn, wn = n and yn = Zn − Zn−1, Example 2.3. In Section 3 we consider a par- then (2.3) is exactly of the form of (2.1), with {wn} ticular recursive estimate and show that it is of the satisfying (2.2). Moreover, if εn = Zn − ξ, then we form of a general SA. It turns out that the problem can write yn = h(xn−1)+ εn. With this setup, we there can also be expressed as an empirical Bayes could study the asymptotic behavior of xn using (EB) problem [30]. In this simple example, we demon- the SA analysis below (see Sections 2.3 and 2.4), strate the connection between SA and EB, both of although the SLLN already guarantees xn → ξ a.s. which are theories pioneered by Robbins. Consider 4 R. MARTIN AND J. K. GHOSH

need not have a unique limit point. However, conditions can be imposed which guarantee convergence to a particular solution ξ of h(x) = 0, provided that ξ is a stable solution to the ODEx ˙ = h(x). This is discussed further in Section 2.3. 2.2 Applications 2.2.1 Stochastic approximation EM. The EM algorithm [7] has quickly become one of the most popular computational techniques for likelihood estimation in a host of standard and nonstandard statistical problems. Common to all problems in which the EM can be applied is a notion of “missing data.” Consider a problem where data Y is observed and the goal is to estimate the parameter θ based on its likelihood function L(θ). Suppose that the observed Fig. 2. Sample path of the sequence {xn} in Example 2.3. data Y is incomplete in the sense that there is a The dotted line is the value of ξ used for data generation. component Z which is missing—this could be ac- tual values which are not observed, as in the case the simple hierarchical model of censored data, or it could be latent variables, i.i.d. ind as in a random effects model. Let X = (Y,Z) de- λ1, . . . , λn ∼ Exp(ξ) and Zi|λi ∼ Poi(λi) note the complete data. Then the likelihood func- for i = 1,...,n, where the exponential rate ξ> 0 is tion f(z,θ) based on the complete data x is related unknown. EB tries to estimate ξ based on the ob- to L(θ) according to the formula L(θ)= f(z,θ) dz. served data Z1,...,Zn. Here we consider a recursive The EM algorithm produces a convergent sequence R estimate of ξ. Fix an initial guess x0 of ξ. Assum- of estimates by iteratively filling in the missing data ing ξ is equal to x0, the posterior mean of λ1 is Z in the E-step and then maximizing the simpler −1 (Z1 + 1)/(x0 + 1), which is a good estimate of ξ complete-data likelihood function f(z,θ) in the M- if x0 is close to ξ. Iterating this procedure, we can step. The E-step is performed by sampling the z- generate a sequence values from the density 1 Z + 1 (2.4) x = x + w − i , f(z,θ)/L(θ), if L(θ) 6= 0, i i−1 i x x + 1 p(z|θ)= i−1 i−1 0, if L(θ) = 0, where {w } is assumed to satisfy (2.2). Let y de- i i which is the predictive density of Z, given Y and θ. note the quantity in brackets in (2.4) and take its It is often the case that at least one of the E-step expectation with respect to the distribution of Z : i and M-step is computationally difficult, and many ξ − x variations of the EM have been introduced to im- (2.5) h(x)= E(yi|xi−1 = x)= . ξx(x + 1) prove the rate of convergence and/or simplify the computations. In the case where the E-step cannot Then the sequence {xn} in (2.4) is a SA targeting a solution of h(x) = 0. Since h is continuous, de- be done analytically, Wei and Tanner [39] suggest re- creasing and h(x) = 0 iff x = ξ, it follows from the placing the expectation in the E-step with a Monte general theory that xn → ξ. Figure 2 shows the first Carlo integration. The resulting MCEM algorithm 250 steps of such a sequence with x0 = 1.5. comes with its own challenges, however; for example, simulating the missing data Znj, for j = 1,...,mn, The examples above emphasize one important prop- from p(z|θn) could be quite expensive. erty that h(x) must satisfy, namely, that it must Delyon, Lavielle and Moulines [6] propose, in the be easy to “sample” in the sense that there is a case where integration in the E-step is difficult or function H(x, z) and a random variable Z such that intractable, an alternative to the MCEM using SA. h(x)= E[H(x, Z)]. Another thing, which is not obvious from the examples, is that h(x) must have cer- SAEM Algorithm. At step n, simulate the tain stability properties. In general, a SA sequence missing data Znj from the posterior distribution SA AND NEWTON’S ALGORITHM 5 p(z|θn), j = 1,...,mn. Update Qn(θ) using resulting process is not Markov, but ergodicity is

mn proved using a regeneration argument [15]. wn An adaptive Metropolis (AM) algorithm is pre- Qn(θ) = (1 − wn)Qn−1(θ)+ b log f(Znj,θ), mn sented by Haario, Saksman and Tamminen [16], which Xj=1 b b uses previously visited states to update the proposal where {wn} is a sequence as in (2.2). Then choose covariance matrix Σ. Introduce a mean µ and set θn+1 such that Qn(θn+1) ≥ Qn(θ) for all θ ∈ Θ. θ = (µ, Σ). Let {wn} be a deterministic sequence as Compared to the MCEM, the SAEM algorithm’s in (2.2). b b use of the simulated data Znj is much more efficient. AM Algorithm. Fix a starting point z0 and At each iteration, the MCEM simulates a new set initial estimates µ0 and Σ0. At iteration n ≥ 1 draw of missing data from the posterior distribution and zn from Np(zn−1, cΣn−1) and set forgets the simulated data from the previous itera- ′ tion. On the other hand, note that the inclusion of Σn = (1 − wn)Σn−1 + wn(zn − µn−1)(zn − µn−1) , Q (θ) in the SAEM update θ 7→ θ implies all n−1 n n+1 µn = (1 − wn)µn−1 + wnzn. the simulated data points contribute. It is pointed −1 outb in [6] that the SAEM performs strikingly better Note that if wn = n , then µn and Σn are the than the MCEM in problems where maximization is sample mean and covariance matrix, respectively, of much cheaper than simulation. the observations z1,...,zn. The constant c in the Delyon, Lavielle and Moulines [6] show, using gen- AM is fixed and depends only on the dimension d eral SA results, that for a broad class of complete- of the support of π. A choice of c which is, in some data likelihoods f(z,θ) and under standard regular- sense, optimal is c = 2.42/d ([32], page 316). ity conditions, the SAEM sequence {θn} converges It is pointed out in [16] that the AM has the ad- a.s. to the set of stationary points {θ : ∇L(θ) = 0} vantage of starting the adaptation from the very be- of the incomplete-data likelihood. Moreover, they ginning. This property allows the AM algorithm to prove that the only attractive stationary points are search the support of π more effectively earlier than local maxima; that is, saddle points of L(θ) are other adaptive algorithms. Note that for the algo- avoided a.s. rithm of [15] mentioned above, the adaptation does not begin until the atom is first reached; although 2.2.2 Adaptive Markov Chain Monte Carlo. A ran- the renewal times are a.s. finite, they typically have dom walk Metropolis (RWM) algorithm is a spe- no finite upper bound. cific MCMC method that can be designed to sam- It is shown in [16] that, under certain conditions, ple from almost any distribution π. In this partic- the stationary distribution of the stochastic process ular case, the proposal is q(x,y)= q(x − y), where {zn} is the target π, the chain is ergodic (even though q is a symmetric density. A popular choice of q is a it is no longer Markovian), and there is almost sure Np(0, Σ) density. It is well known that the conver- convergence to θ = (µ , Σ ), the mean and covari- gence properties of Monte Carlo averages depend π π π ance of the target π. This implies that, as n →∞, on the choice of the proposal covariance matrix Σ, the proposal distributions in the AM algorithm will in the sense that it affects the rate at which the be close to the “optimal” choice. If H(z,θ) = (z − generated stochastic process explores the support of µ, (z − µ)(z − µ)′ − Σ), then the AM is a general π. Trial and error methods for choosing Σ can be SA algorithm with θ = θ + w H(z ,θ ), and difficult and time consuming. One possible solution n n−1 n n n−1 Andrieu, Moulines and Priouret [2] extend the work would be to use the history of the process to suit- in [16] via new SA stability results. ably tune the proposal. These so-called adaptive algorithms come with their own difficulties, however. 2.2.3 Stochastic approximation Monte Carlo. Let In particular, making use of the history destroys X be a finite or compact space with a dominat- the Markov property of the process so nonstandard ing measure ν. Let p(x)= κp0(x) be a probability results are needed in a convergence analysis. For density on X with respect to ν with possibly un- instance, when the state space contains an atom, known normalizing constant κ> 0. We wish to es- Gilks, Roberts and Sahu [15] propose an adaptive timate f dν, where f is some function depending algorithm that suitably updates the proposal den- on p or p0. For example, suppose p(x) is a prior sity only when the process returns to the atom. The and g(yR|x) is the conditional density of y given x. 6 R. MARTIN AND J. K. GHOSH

Then f(x)= g(y|x)p(x) is the unnormalized poste- In Example 2.4, we apply SAMC to estimate the rior density of x and its integral, the marginal den- partition function in the one-dimensional Ising model. sity of y, is needed to compute a Bayes factor. In this simple situation, a closed-form expression is The following stochastic approximation Monte available, which we can use as a baseline for assess- Carlo (SAMC) method is introduced in [21]. Let ing the performance of the SAMC estimate. A1,...,Am be a partition of X and let ηi = f dν Ai Example 2.4. Consider a one-dimensional Ising for 1 ≤ i ≤ m. Takeη ˆ (0) as an initial guess, and let i R model, which assumes that each of the d particles in ηî(n) be the estimate of ηi at iteration n ≥ 1. For a system has positive or negative spin. The Gibbs notational convenience, write distribution on X = {−1, 1}d has density (with re- ′ spect to counting measure ν) θni = logη î(n) and θn = (θn1,...,θnm) . 1 The probability vector π = (π ,...,π )′ will denote p (x)= e−E(x)/T ,Z(T )= e−E(x)/T , 1 m T Z(T ) the desired sampling frequency of the A ’s; that is, x∈X i X πi is the proportion of time we would like the chain where T is the temperature, and E is the energy to spend in Ai. The choice of π is flexible and does d−1 function defined, in this case, as E(x)= − i=1 xixi+1. not depend on the particular partition {A1,...,Am}. The partition function Z(T ) is of particular inter- P SAMC Algorithm. Starting with initial esti- est to physicists: the thermodynamic limit F (T )= −1 mate θ0, for n ≥ 0 simulate a sample zn+1 using a limd→∞ d log Z(T ) is used to study phase transi- RWM algorithm with target distribution tions [4]. In this simple case, a closed-form expres- m sion for Z(T ) is available. There are other more com- −θni plex systems, however, where no analytical solution (2.6) p(z|θn) ∝ f(z)e IAi (z), z ∈X . d i=1 is available and ν(X ) = 2 is too large to allow for X na¨ıve calculation of Z(T ). Then set θn+1 = θn + wn+1(ζn+1 − π), where the de- Our jumping-off point is the energy–temperature terministic sequence {wn} is as in (2.2), and ζn+1 = duality [18] Z(T )= Ω(u)e−u/T , where Ω(u)= ′ u (IA1 (zn+1),...,IAm (zn+1)) . ν{x : E(x)= u} is the density of states. We will ap- P The normalizing constant in (2.6) is generally un- ply SAMC to first estimate Ω(u) and then estimate Z(T ) with a plug-in: known and difficult to compute. However, p(z|θn) is only used at the RWM step where it is only required Z(T )= Ω(u)e−u/T . that the target density be known up to a proportion- u ality constant. X Note here thatb a single estimateb of Ω can be used It turns out that, in the case where no A are i to estimate the partition function for any T , elimi- empty, the observed sampling frequencyπ ˆ of A i i nating the need for simulations at multiple tempera- converges to π . This shows thatπ ˆ is independent of i i tures. Furthermore, Ω(u)= ν(X ) = 2d is known its probability p dν. Consequently, the resulting u Ai so, by imposing this condition on the estimate Ω we chain will not get stuck in regions of high probabil- P ity, as a standardR Metropolis chain might. do not fall victim to the lack of identifiability mentioned above. Figure 3 shows the true partition func-b The sequence {θn} is a general stochastic approx- d d−1 imation and, using the convergence results of [2], tion Z(T ) = 2 cosh (1/T ) for d = 10 as well as the SAMC estimate Z(T ) as a function of T ∈ [1, 4], on Liang, Liu and Carroll [21] show that if no Ai is empty and suitable conditions are met, then the log-scale, based on n = 1000 iterations. Clearly, Z performs quiteb well in this example, particularly for large T . (2.7) θni → C + log f dν − log πi a.s., ZAi b 2.3 ODE Stability Theory for 1 ≤ i ≤ m as n →∞, for some arbitrary constant C. Liang, Liu and Carroll [21] point out a lack The asymptotic theory of ODEs plays an impor- of identifiability in the limit (2.7); that is, C cannot tant role in the convergence analysis of a SA al- be determined from {θn} alone. Additional informa- gorithm. After showing the connection between SA m tion is required, such as i=1 ηî(n)= c for each n and ODEs, we briefly review some of the ODE the- and for some known constant c. ory that is necessary in the sequel. P SA AND NEWTON’S ALGORITHM 7

Imagine a physical system, such as an orbiting celestial body, whose state is being governed by the ODEx ˙ = h(x) with initial condition x(0) = x0. Then, loosely speaking, the system is stable if choosing an ′ alternative initial condition x(0) = x0 in a neighborhood of x0 has little eﬀect on the asymptotic properties of the resulting solution x(t). The following deﬁnition makes this more precise. Definition 2.5. A point ξ ∈ Rd is said to be locally stable forx ˙ = h(x) if for each ε> 0 there is a δ> 0 such that if kx(0) − ξk < δ, then kx(t) − ξk < ε for all t ≥ 0. If ξ is locally stable and x(t) → ξ as b t →∞, then ξ is locally asymptotically stable. If this Fig. 3. log Z(T ) (gray) and SAMC estimate log Z(T ) convergence holds for all initial conditions x(0), then (black) in Example 2.4. the asymptotic stability is said to be global.

Recall the general SA algorithm in (2.1) given Points ξ for which stability is of interest are equi- by xn = xn−1 + wnyn. Assume there is a measur- librium points ofx ˙ = h(x). Any point ξ such that able function h such that h(xn−1)= E[yn|xn−1] and h(ξ) = 0 is called an equilibrium point, since the rewrite this algorithm as constant solution x(t) ≡ ξ satisﬁesx ˙ = h(x).

xn = xn−1 + wnh(xn−1)+ wn{yn − h(xn−1)}. Example 2.6. Letx ˙ = Ax, where A is a fixed d × d matrix. For an initial condition x(0) = x0, we Define Mn = yn − h(xn−1). Then {Mn} is a zero- can write an explicit formula for the particular so- mean martingale sequence and, under suitable con- lution: x(t)= eAtx for t ≥ 0. Suppose, for simplic- ditions, the martingale convergence theorem guar- 0 ity, that A has a spectral decomposition A = UΛU ′, antees that M becomes negligible as n →∞, leav- n where U is orthogonal and Λ is a diagonal matrix of ing us with the eigenvalues λ1, . . ., λd of A. Then the matrix ex- At Λt ′ Λt xn = xn−1 + wnh(xn−1)+ wnMn ponential can be written as e = Ue U , where e λit is diagonal with ith element e . Clearly, if λi < 0, ≈ xn−1 + wnh(xn−1). then eλit → 0 as t →∞. Therefore, if A is negative But this latter “mean trajectory” is deterministic definite, then the origin x = 0 is globally asymptot- and essentially a finite difference equation with small ically stable. step sizes. Rearranging the terms gives us When explicit solutions are not available, proving xn − xn−1 = h(xn−1), asymptotic stability for a given equilibrium point wn will require a so-called Lyapunov function [20]. which, for large n, can be approximated by the ODE Definition 2.7. Let ξ ∈ Rd be an equilibrium x˙ = h(x). It is for this reason that the study of SA point of the ODEx ˙ = h(x) with initial condition algorithms is related to the asymptotic properties of Rd R solutions to ODEs. x(0) = x0. A function ℓ : → is called a Lyapunov Consider a general autonomous ODEx ˙ = h(x), function (at ξ) if: Rd Rd where h : → is a bounded and continuous, pos- • ℓ has continuous first partial derivatives in a neigh- sibly nonlinear, function. A solution x(t) of the ODE borhood of ξ; Rd is a trajectory in with a given initial condition • ℓ(x) ≥ 0 with equality if and only if x = ξ; x(0). Unfortunately, in many cases, a closed-form • the time derivative of ℓ along the path x(t), de- expression for a solution x(t) is not available. For fined as ℓ˙(x)= ∇ℓ(x)′h(x), is ≤ 0. that reason, other methods are necessary for study- ing these solutions and, in particular, their proper- A Lyapunov function is said to be strong if ℓ˙(x) = 0 ties as t →∞. implies x = ξ. 8 R. MARTIN AND J. K. GHOSH

Lyapunov functions are a generalization of the po- Theorem 2.10. For {xn} in (2.8) with {wn} tential energy of a system, such as a swinging pen- satisfying (2.2), assume dulum, and Lyapunov’s theory gives a formal ex- hSA1i sup Eky k2 < ∞. tension of the stability principles of such a system. n n hSA2i There exists a continuous function h(·) and Theorem 2.8 is very powerful because it does not E F require an explicit formula for the solution. See [20] a random vector βn such that (yn| n−1)= for a proof and various extensions of the Lyapunov h(xn−1)+ βn a.s. for each n. theory. hSA3i n wnkβnk converges a.s. Theorem 2.8. If there exists a (strong) Lya- If ξ isP globally asymptotically stable for x˙ = h(x), punov function in a neighborhood of an equilibrium then xn → ξ a.s. point ξ of x˙ = h(x), then ξ is (asymptotically) stable. 3. NEWTON’S RECURSIVE ESTIMATE There is no general recipe for constructing a Lya- punov function. In one important special case, how- Let Θ and X be the parameter space and sam- ever, a candidate Lyapunov function is easy to find. ple space, equipped with σ-finite measures µ and Suppose h(x)= −∇g(x), for some positive definite, ν, respectively. Typically, Θ and X are subsets of sufficiently smooth function g. Then ℓ(x)= g(x) is Euclidean space and ν is Lebesgue or counting mea- a Lyapunov function since ℓ˙(x)= −k∇g(x)k2 ≤ 0. sure. The measure µ varies depending on the inference problem: for estimation, µ is usually Lebesgue Example 2.9. Consider again the linear system or counting measure, but for testing, µ is often some- x˙ = Ax from Example 2.6, where A is a d × d nega- thing different (see Example 3.1). tive definite matrix. Here we will derive asymptotic Consider the following model for pairs of random stability by finding a Lyapunov function and apply- variables (X ,θi) ∈X× Θ: ing Theorem 2.8. In light of the previous remark, we i 1 ′ ˙ 2 i i.i.d. i ind i choose ℓ(x)= − 2 x Ax. Then ℓ(x)= −kAxk ≤ 0 so (3.1) θ ∼ f, Xi|θ ∼ p(·|θ ), i = 1,...,n, ℓ is a strong Lyapunov function forx ˙ = Ax and the origin is asymptotically stable by Theorem 2.8. where {p(·|θ) : θ ∈ Θ} is a parametric family of probability densities with respect to ν on X and f is a Of interest is the stronger conclusion of global probability density with respect to µ on Θ. In the asymptotic stability. Note, however, that Theorem present case, the variables (parameters) θ1,...,θn 2.8 does not tell us how far x0 can be from the equi- are not observed. Therefore, under model (3.1), librium in question and still get asymptotic stability. X1,...,Xn are i.i.d. observations from the marginal For the results that follow, we will prove the global density Π in (1.1). We call f the mixing density part directly. f (or prior, in the Bayesian context) and the inference 2.4 SA Convergence Theorem problem is to estimate f based on the data observed from Π . The following example gives a very impor- Consider, for fixed x and {w } satisfying (2.2), f 0 n tant special case of this problem—the analysis of the general SA algorithm DNA microarray data. (2.8) xn = Proj {xn−1 + wnyn}, n ≥ 1, X Example 3.1. A microarray is a tool that gives Rd where X ⊂ is compact and ProjX (x) is a projec- researchers the ability to simultaneously investigate tion of x onto X. The projection is necessary when the effects of numerous genes on the occurrence of boundedness of the iterates cannot be established by various diseases. Not all of the genes will be ex- other means. The truncated or projected algorithm pressed—related to the disease in question—so the (2.8) is often written in the alternative form [19] problem is to identify those which are. Let θi repre- i (2.9) xn = xn−1 + wnyn + wnzn, sent the expression level of the ith gene, with θ = 0 indicating the gene is not expressed. After some re- where zn is the “minimum” z such that xn−1 + i duction, the data Xi is a measure of θ , and the wnyn + wnz belongs to X. model is of the form (3.1) with f being a prior den- Next we state the main stochastic approximation sity with respect to µ = λ + δ . Consider the result used in the sequel, a special case of The- Leb {0} multiple testing problem orem 5.2.3 in [19]. Define the filtration sequence i Fn = σ(y1,...,yn). H0i : θ = 0, i = 1,...,n. SA AND NEWTON’S ALGORITHM 9

The number n of genes under investigation is often Theorem 3.2. Assume the following: in the thousands so, with little information about θi hN1i Θ is finite and µ is counting measure. in X , choosing a fixed prior f would be problematic. i hN2i w = ∞. On the other hand, the data contain considerable n n hN3i p(x|θ) is bounded away from 0 and ∞. information about the prior f so the empirical Bayes P approach—using the data to estimate the prior— Then surely there exists a density f∞ on Θ such has been quite successful [9]. that fn → f∞ as n →∞. In what follows, we focus our attention on a partic- Newton [26] presents a proof of Theorem 3.2 based ular estimate of the mixing density f. Let x1,...,xn ∈ on the theory of nonhomogeneous Markov chains. X be i.i.d. observations from the mixture density He proves that fn represents the n-step marginal Πf in (1.1). Newton [26] suggests the following al- distribution of the Markov chain {Zn} given by gorithm for estimating f. Zn−1, with prob 1 − wn, Newton’s Algorithm. Z0 ∼ f0,Zn = Choose a positive den- Yn, with prob wn, sity f0 on Θ and weights w1, . . . , wn ∈ (0, 1). Then for i = 1,...,n, compute where Yn has density ∝ p(Xn|θ)fn−1(θ). However, the claim that this Markov chain admits a station- p(Xi|θ)fi−1(θ) (3.2) fi(θ) = (1 − wi)fi−1(θ)+ wi , ary distribution is incomplete—N2 implies the chain Πi−1(Xi) {Zn} is weakly ergodic but the necessary strong er- where Πj(x)= p(x|θ)fj(θ) dµ(θ), and report fn(θ) godicity property does not follow, even when Θ is as the final estimate. finite. Counterexamples are given in [14, 17]. Ghosh R and Tokdar [14] prove consistency of f along quite In the following subsections we establish some n different lines. For probability densities ψ and ϕ asymptotic properties of f as n →∞ and we show n with respect to µ, define the Kullback–Liebler (KL) the results of several numerical experiments that divergence, demonstrate the finite-sample accuracy of Newton’s estimate (3.2) in both the discrete and continuous (3.3) K(ψ, ϕ)= ψ log(ψ/ϕ) dµ. cases. First, a few important remarks. ZΘ • The update fi−1 7→ fi in (3.2) is similar to a Bayes The following theorem is proved in [14] using an estimate based on a Dirichlet process prior (DPP), approximate martingale representation of K(f,fn). given the information up to, and including, time i − 1. That is, after observing X1, X2,...,Xi−1, D 1−wi Theorem 3.3. In addition to N1–N3, assume a Bayesian might model f with a DPP ( w , i hGT1i w2 < ∞. fi−1). In this case, the posterior expectation is n n hGT2i f is identifiable; that is, f 7→ Πf is injective. exactly the fi in (3.2). P • Because f depends on the ordering of the data n Then K(f,fn) → 0 a.s. as n →∞. and not simply on the sufficient statistic (X(1),..., Part of the motivation for the use of the KL di- X(n)), it is not a posterior quantity. • The algorithm is very fast: if one evaluates (3.2) vergence lies in the fact that the ratio fn/fn−1 has on a grid of m points θ1,...,θm and calculates the a relatively simple form. More important, however, integral in Πi−1 using, say, a trapezoid rule, then is the Lyapunov property shown in the proof The- the computational complexity is mn. orem 3.4. Sufficient conditions for GT2 in the case of finite Θ are given in, for example, [22, 37]. San 3.1 Review of Convergence Results Martin and Quintana [33] also discuss the issue of In this section, we give a brief review of the known identifiability in connection with the consistency of convergence results for Newton’s estimate fn in the fn. case of a finite parameter space. The case of a com- 3.2 Newton’s Estimate as a SA pact Θ is quite different and, until very recently [38], nothing was known about the convergence of fn in Here we show that Newton’s algorithm (3.2) is a such problems; see Section 5. special case of SA. First, note that if f is viewed Newton [26], building on the work in [27, 28], as a prior density, then estimating f is an empirical states the following convergence theorem. Bayes (EB) problem. The ratio in (3.2) is nothing 10 R. MARTIN AND J. K. GHOSH but the posterior distribution of θ, given xi, and Remark 3.5. Removal of the boundedness con- assuming that the prior f is equal to fi−1. This, in dition N3 on p(x|θ) in Theorem 3.4 extends the con- fact, is exactly the approach taken in Example 2.3 sistency result of [14] to many important cases, such to apply SA in an EB problem. as mixtures of normal or gamma densities. Let µ be counting measure and d = µ(Θ). We can 1 d ′ Remark 3.6. Theorem 3.4 covers the interior think of fn(θ) as a vector fn = (fn,...,fn) in the d case (when f is strictly positive) as well as the bound- probability simplex ∆ , defined as ary case (when f i = 0 for some i). The fact that i i d f0 > 0 implies fn > 0 for all n suggests that conver- ∆d = (ϕ1,...,ϕd)′ ∈ [0, 1]d : ϕi = 1 . gence may be slow in the boundary case. ( ) Xi=1 3.3 Simulations d Rd Define H : X × ∆ → with kth component Here we provide numerical illustrations comparing k the performance of Newton’s estimate with that of p(x|θk)ϕ k (3.4) Hk(x, ϕ)= − ϕ , k = 1,...,d, its competitors. We consider a location-mixture of Πϕ(x) normals; that is, p(·|θ) is a N(θ,σ2) density. The k −1 where Πϕ(x)= k p(x|θk)ϕ is the marginal density weights are set to be wi = (i + 1) and the initial d on X induced by ϕ ∈ ∆ . Then (3.2) becomes estimate f0 is taken to be a Unif(Θ) density. For the P Bayes estimate, we assume a Dirichlet process prior (3.5) fn = fn−1 + wnH(Xn,fn−1). f ∼ D(1,f0) in each example.

Let Px = diag{p(x|θk) : k = 1,...,d} be the diagonal Example 3.7 (Finite Θ). In this example, we matrix of the sampling density values and define the compare Newton’s recursive (NR) estimate with the mapping h :∆d → Rd to be the conditional expecta- nonparametric maximum likelihood (NPML) esti- tion of H(x,fn), given fn = ϕ: mate and the nonparametric Bayes (NPB) estimate. Computation of NR and NPML (using the EM algo- h(ϕ)= H(x, ϕ)Πf (x) dν(x) rithm) is straightforward. Here, in the case of finite X (3.6) Z Θ, we use sequential imputation [23] to calculate Z Πf (x) NPB. Take Θ = ∩ [−4, 4], and set σ =1 in p(x|θ). = Pxϕ dν(x) − ϕ, We consider two different mixing distributions on X Πϕ(x) Z Θ: where f = (f 1,...,f d)′ is the true mixing/prior distribution. From (3.6), it is clear that f solves the I. f = Bin(8, 0.6), II. f = 0.5δ + 0.5δ . equation h(ϕ) = 0 which implies (i) f is an equilib- {−2} {2} rium point of the ODEϕ ˙ = h(ϕ), and (ii) that f is We simulate 50 data sets of size n = 100 from the a fixed point of the map models corresponding to mixing densities I, II and computing the three estimates for each. Figure 4 Π (x) T (ϕ)= h(ϕ)+ ϕ = f P ϕ dν(x). shows the resulting estimates for a randomly chosen Π (x) x Z ϕ data set from each model. Notice that NR does bet- Newton [26], page 313, recognized the importance of ter for model I than both NPML and NPB. The story is different for model II—both NPML and this map in relation to the limit of fn. Also, the use of T in [5, 35] for the I-projection problem is closely NPB are considerably better than NR. This is fur- related to the SA approach taken here. ther illustrated in Figure 5 where the KL divergence R We have shown that (3.5) can be considered as a K(Πf , Πn) on X = is summarized over the 50 sam- general SA algorithm, targeting the solution ϕ = f ples. We see that NR has a slightly smaller KL num- of the equation h(ϕ)=0 in ∆d. Therefore, the SA ber thanb NPML and NPB for model I, but they results of Section 2.4 can be used in the conver- clearly dominate NR for model II. This discrepancy gence analysis. The following theorem is proved in is at least partially explained by Remark 3.6; see Appendix A.1. Section 5 for further discussion. We should point out, however, that both NPML and NPB take sig- Theorem 3.4. Assume N1, N2, GT1 and GT2. nificantly longer to compute than NR, about 100 If p(·|θ) > 0 ν-a.e. for each θ, then fn → f a.s. times longer on average. SA AND NEWTON’S ALGORITHM 11

Fig. 4. Estimates of mixing densities I and II in Example 3.7. Left column: True f (gray) for model I and the three estimates (black). Right column: True f (gray) for model II and the three estimates (black). 12 R. MARTIN AND J. K. GHOSH

cannot be used in this situation since θ does not fully specify the sampling density. In this section we introduce a modiﬁcation of New- ton’s algorithm to simultaneously and recursively estimate both a mixing distribution and an additional unknown parameter. This modiﬁcation, called the Newton+Plug-in (N+P), is actually quite simple— at each step we use a plug-in estimate of ξ in the update (3.2). We show that the N+P algorithm can be written as a general SA algorithm and, under certain conditions, prove its consistency. Let p(x|θ,ξ) be a two-parameter family of densities on X , and consider the model

i.i.d. θ1,...,θn ∼ f, (4.1) b i.i.d. Fig. 5. Summary of the KL divergence K(Πf , Πn) for the i b Xi1,...,Xir ∼ p(·|θ ,ξ), i = 1,...,n, three estimates Πn in models I and II in Example 3.7. where f is an unknown density on Θ and the parameter ξ ∈ Ξ is also unknown. The number of replicates Example 3.8 (Compact Θ). We consider a one- r ≥ 2 is assumed fixed. Note that (4.1) is simply a and a two-component mixture of beta densities on nonparametric random effects model. Θ = [0, 1] as the true f: Assume, for simplicity, that Ξ ⊆ R; the more gen- I. f = Beta(2, 7), eral case Ξ ⊆ Rp is a natural extension of what fol- II. f = 0.33 Beta(3, 30) + 0.67 Beta(4, 4). lows. Let Θ= {θ1,...,θd} be a finite set and take µ to be counting measure on Θ. Recall that ∆d is the Let σ = 0.1 be the normal sampling variance. Again, probability simplex. Assume: computation of NR is straightforward. To compute NPB, the importance sampling algorithm in [38] hNP1i ξ ∈ int(Ξ0), where Ξ0 is a compact and con- that makes use of a collapsing of the PolyáUrn vex subset of Ξ. d scheme is used. Figure 6 shows a typical realiza- hNP2i f ∈ int(∆0) where ∆0 ⊂ ∆ is compact and, 1 d tion of NR and NPB, based on a sample of size for each ϕ ∈ ∆0, the coordinates ϕ ,...,ϕ n = 100 from each of the corresponding marginals. are bounded away from zero. Note that the Bayes estimate does a rather poor The subset Ξ can be arbitrarily large so assump- job here, being much too spiky in both cases. This 0 tion NP1 causes no difficulty in practice. Assump- is mainly because the posterior for f sits on dis- tion NP2 is somewhat restrictive in that f must crete distributions. On the other hand, Newton’s es- be strictly positive. While NP2 seems necessary to timate has learned the general shape of f after only prove consistency (see Appendix A.3), simulations 100 iterations and results in a much better estimate suggest that this assumption can be weakened. than NPB. Furthermore, on average, the computa- The N+P algorithm uses an estimate of ξ at each tion time for NR is again about 100 times less than step in Newton’s algorithm. We assume here that an that of NPB. unbiased estimate is available:

4. N+P ALGORITHM hNP3i There exists an unbiased estimate TUBE(x), x ∈X r, of ξ with variance v2 < ∞. Suppose that the sampling distribution on X is parametrized not only by θ but by an additional Later we will replace the unbiased estimate with a parameter ξ. An example of this is the normal dis- Bayes estimate. This will require replacing NP3 with tribution with mean θ and variance ξ = σ2. More another assumption. speciﬁcally, we replace the sampling densities p(x|θ) At time i = 1,...,n, we observe an r-vector Xi = ′ ˆ(i) of Section 3 with p(x|θ,ξ) where θ is the latent vari- (Xi1,...,Xir) and we compute ξ = TUBE(Xi). An able, and ξ is also unknown. Newton’s algorithm unbiased estimate of ξ based on the entire data X1,..., SA AND NEWTON’S ALGORITHM 13

−1 n ˆ(i) Xn would be the average ξn = n i=1 ξ , which trary ξ0 ∈ Ξ0. Then for i = 1,...,n compute has a convenient recursive expression −1 ˆ(i) P ξi = ProjΞ0 {i [(i − 1)ξi−1 + ξ ]}, −1 (i) (4.2) ξi = i [(i − 1)ξi−1 + ξˆ ], i = 1,...,n. fi = Proj∆0 {fi−1 + wiH(Xi; fi−1,ξi)}, ˆ(1) ˆ(n) More importantly, by construction, ξ ,..., ξ are and produce (fn,ξn) as the final estimate. i.i.d. random variables with mean ξ and finite vari- We claim that the N+P algorithm for estimating f ance. It is, therefore, a consequence of the SLLN that can be written as a general SA involving the true but ξ , as defined in (4.2), converges a.s. to ξ. While this n unknown ξ plus an additional perturbation. Define result holds for any unbiased estimate T , an unbi- the quantities ased estimate T ′ with smaller variance is preferred, since it will have better finite-sample performance. (4.4) h(fn−1)= E[H(Xn,fn−1,ξ)|Fn−1], r d Define the mapping H : X × ∆0 × Ξ0 → R with βn = E[H(Xn,fn−1,ξn)|Fn−1] kth component (4.5) E F k − [H(Xn,fn−1,ξ)| n−1], p(x|θk, ψ)ϕ k (4.3) Hk(x, ϕ, ψ)= − ϕ , j where Fn−1 = σ(X1,...,Xn−1), so that j p(x|θj, ψ)ϕ E F for k = 1,...,d, wherePϕ and ψ denote generic el- [H(Xn,fn−1,ξn)| n−1]= h(fn−1)+ βn. ements in ∆0 and Ξ0, respectively, and p(·|θ, ψ) is Now the update fn−1 7→ fn can be written as the joint density of an i.i.d. sample of size r from (4.6) f = f + w {h(f )+ M + β + z }, p(·|θ, ψ). n n−1 n n−1 n n n where z is the “minimum” z keeping f in ∆ , and N+P Algorithm. Choose an initial estimate n n 0 f0 ∈ ∆0, weights w1, . . ., wn ∈ (0, 1), and an arbi- Mn = H(Xn,fn−1,ξn) − h(fn−1) − βn

Fig. 6. Estimates of the mixing densities I and II in Example 3.8. Top row: true f (gray) and NR (black). Bottom row: true f (gray) and NPB (black). 14 R. MARTIN AND J. K. GHOSH is a martingale adapted to Fn−1. Notice that (4.6) Proposition 4.3. NP4 holds for H in (4.7). is now in a form in which Theorem 2.10 can be ap- Let Σ be the Ξ defined in the general setup. plied. We will make use of the Law of Iterated Log- 0 0 arithm so define u(t) = (2t log log t)1/2. The consis- For the N+P, we choose TUBE(x) to be the sample tency properties of the N+P algorithm are summa- variance of x, resulting in the recursive estimate rized in the following theorem. i r 1 (4.8) σ2 = (X − X )2. Theorem 4.1. Assume N1, GT1, GT2, NP1– i i(r − 1) kj k NP3. In addition, assume Xk=1 Xj=1 2 2 hNP4i ∂ H(x; ϕ, ψ) is bounded on X r × ∆ × Ξ . For σ , take the standard noninformative prior p(σ )= ∂ψ 0 0 2 −1 −1 (σ ) . Under squared-error loss, the Bayes estimate hNP5i n wnn u(n) converges. 2 of σ based on X1,...,Xi is Then (f ,ξ ) → (f,ξ) a.s. as n →∞. Pn n 2 E 2 F σ˜i = (σ | i) We now remove the restriction to unbiased esti- (4.9) mates of ξ, focusing primarily on the use of a Bayes i r 1 2 estimate in place of the unbiased estimate. But first, = (Xkj − Xk) . ˜ i(r − 1) − 2 let ξi = T (X1,...,Xi) be any suitable estimate of ξ Xk=1 Xj=1 based on only X1,...,Xi. Then replace the N+P 2 2 −1/2 Note that |σñ −σ | = O(n ) a.s. so the conclusion update fi−1 7→ fi with −1 of Corollary 4.2 holds if wn ∼ n . ˜ ˜ ˜ ˜ fi = Proj∆0 {fi−1 + wiH(Xi, fi−1, ξi)}. The following example compares three resulting While this adaptation is more flexible with regard estimates for this location mixture of normals prob- to the choice of estimate, this additional flexibility lem: when σ2 is known, when (4.8) is used with the does not come for free. Notice that the algorithm is N+P and when (4.9) is used in the modified N+P. no longer recursive. That is, given a new data point Convergence of the iterates holds in each case by xn+1, we need more information than just the pair Theorems 3.4 and 4.1 and Corollary 4.2. (f˜ , ξ˜ ) to obtain (f˜ , ξ˜ ). n n n+1 n+1 Example 4.4. Let Θ= Z ∩ [−4, 4] and take f Corollary 4.2. If assumptions NP3 and NP5 to be a Bin(8, 0.5) density on Θ. Suppose r = 10, −1 2 in Theorem 4.1 are replaced by n = 100, wi = (i + 1) and set σ = 1.5. For each of ′ 100 simulated data sets, the three estimates of f are hNP3 i |ξñ − ξ| = O(ρn) a.s. as n →∞, ′ computed using Newton’s algorithm, the N+P and hNP5 i wnρn < ∞, n the Bayes modification. Each algorithm produces es- ˜ ˜ 2 then (fnP, ξn) → (f,ξ) a.s. as n →∞. timates fˆ andσ ˆ with which we compute Πσˆ(s)= d 2 ˆj Typically, for Bayes and ML estimates, the rate is j=1 g(s|θj, σˆ )f . Figure 7 summarizes the 100 KL ρ = n−1/2. Then NP5′ holds if, e.g., w ∼ n−1. b n n divergencesP K(Π, Πσˆ) for each of the three estimates. To illustrate the N+P and its modified version, Surprisingly, little efficiency is lost when an estimate consider the special case where p(·|θ,ξ) in (4.1) is of σ2 is used ratherb than the true value. Also, the a normal density with mean θ and ξ = σ2 is the N+P and the Bayes modification perform compa- unknown variance. That is, rably, with the Bayes version performing perhaps i.i.d. i 2 slightly better on average. Note that no projections Xi1,...,Xir ∼ N(θ ,σ ), i = 1,...,n. −4 4 onto Σ0 = [10 , 10 ] or Moreover, the statistic Si = Xi is sufficient for the 2 k −4 mean and the density g(·|θ,σ ) of Si is known. There- ∆0 = {ϕ ∈ ∆ : ϕ ≥ 10 , k = 1,...,d} fore, H in (4.3) can be written as were necessary in this example. g(s|θ , ψ)ϕk (4.7) H (s, ϕ, ψ)= k − ϕk k j 5. DISCUSSION j g(s|θj, ψ)ϕ for k = 1,...,d, where gP(s|θ, ψ) is the N(θ, ψ/r) den- In this paper, we have used general results in the sity. Even in this simple example, it is not obvious area of stochastic approximation to prove a consis- that the function H in (4.7) satisfies NP4. A proof tency theorem for a recursive estimate of a mixing of the following proposition is in Appendix A.3. distribution/prior in the case of a finite parameter SA AND NEWTON’S ALGORITHM 15

Newton et al. [26–28] claim that the former should serve as a suitable approximation to the latter. Our calculations in Section 3.3 disagree. In particular, we see two examples in the finite case, one where the recursive estimate is significantly better and the other where the Bayes estimate is significantly better. A new question arises: if it is not an approximation of the Dirichlet process prior Bayes estimate, for what prior does the recursive estimate approximate the corresponding Bayes estimate? Finally, it should be pointed out that our approach to the finite mixture problem is somewhat less general than would be desirable. In particular, we are assuming that the support of f is within a known b finite set of points. In general, however, what is Fig. 7. Summary of KL divergences K(Π, Πσˆ ) for the three known is that the support of f is contained in, say, algorithms in Example 4.4. a bounded interval. In this case, a set of grid points Θ= {θ ,...,θ } are chosen to approximate the un- space Θ. It is natural to wonder if this theorem 1 m known support Θ∗ = {θ∗,...,θ∗ } of f. Newton’s al- can be extended to the case where f is an infinite- 1 M gorithm will produce an estimate f on Θ in this dimensional parameter on an uncountable space Θ. n Very recently, Tokdar, Martin and Ghosh [38] have case, but it is impossible to directly compare fn to f since their supports Θ and Θ∗ may be en- proved consistency of fn in the infinite-dimensional case, under mild conditions. Their argument is based tirely different. There is no problem comparing the on the approximate martingale representation used marginals, however. This leads us to the following in [14] but applied to the KL divergence K(Πf , Πn) important conjecture, closely related to the so-called between the induced marginals. Again, there is a I-projections in [5, 35]. connection between their approach and the SA ap- Conjecture. proach taken here, namely, K(Π , Π ) is also a Lya- Let Πfn and Πf be the marginal f ϕ ∗ punov function for the associated ODEϕ ˙ = h(ϕ). densities corresponding to fn on Θ and f on Θ , In addition to convergence, there are some other respectively. Then, as n →∞, interesting theoretical and practical questions to con- K(Πf , Πf ) → inf K(Πf , Πϕ) a.s., sider. First and foremost, there is the question of n ϕ rate of convergence which, from a practical point of view, is much more important than convergence where ϕ ranges over all densities on Θ. alone. We expect that, in general, the rate of conver- Despite these unanswered practical and theoret- gence will depend on the support of f0, the weights ical questions, the strong performance of Newton’s wn and, in the case of an uncountable Θ, the smooth- ness of f. Whatever the true rate of convergence algorithm and the N+P algorithm in certain cases might be, Example 3.7 (model II) demonstrated that and, more importantly, their computational cost- this rate is unsatisfactory when the support of f is effectiveness, make them very attractive compared misspecified. For this reason, a modification of the to the more expensive nonparametric Bayes esti- algorithm that better handles such cases would be mate or the nonparametric MLE, and worthy of fur- desirable. ther investigation. Another question of interest goes back to the original motivation for Newton’s recursive algorithm. APPENDIX: PROOFS To an orthodox Bayesian, any method which performs well should be at least approximately Bayes. A.1 Proof of Theorem 3.4 Stemming from the fact that the recursive estimate and the nonparametric Bayes estimate, with the ap- To prove the theorem, we need only show that propriate Dirichlet process prior, agree when n = 1, the algorithm (3.5) satisfies the conditions of Theo- 16 R. MARTIN AND J. K. GHOSH

d rem 2.10. First note that fn is, for each n, a convex Given that ∇ℓ(ϕ) exists on all of ∆ , the time combination of points in the interior of ∆d so no derivative of ℓ along ϕ is projection as in (2.8) is necessary. Second, the ran- ˙ ′ dom variables βn in assumption SA2 are identically ℓ(ϕ)= ∇ℓ(ϕ) h(ϕ) zero so SA3 is trivially satisfied. Πf (x) − Πϕ(x) ′ d (A.2) = ∇ℓ(ϕ) P ϕ dν(x) Let {un} be a convergent sequence in ∆ , where Π (x) x 1 d ′ 1 d ′ Z ϕ un = (un, . . ., un) . The limit u = (u , . . . , u ) = Πf limn→∞ un also belongs to ∆ so h(u) is well de- = 1 − Πf dν. ′ Π fined. To prove that h = (h1,...,hd) is continuous, Z ϕ we show that hk(un) → hk(u) for each k = 1,...,d ˙ as n →∞. Consider It remains to show that ℓ(ϕ) = 0 iff ϕ = f. Applying Jensen’s inequality to y 7→ y−1 in (A.2) gives k p(x|θk)un k h (u )= Π (x) dν(x) − u . −1 k n f n Π Πun (x) ˙ ϕ Z ℓ(ϕ) = 1 − Πf dν Π k ZX f The integrand p(·|θk)un/Πun (·) is nonnegative and (A.3) bounded ν-a.e. for each k. Then by the bounded Π −1 ≤ 1 − ϕ Π dν = 0, convergence theorem we get Π f ZX f lim hk(un)= hk(u), k = 1,...,d. n→∞ where equality can hold in (A.3) iff Πϕ = Πf ν-a.e. We assume the mixtures are identifiable, so this im- d But {un}⊂ ∆ was arbitrary so h is continuous. plies ϕ = f. Therefore, ℓ˙(ϕ) = 0 iff ϕ = f, and we Next, note that H(x,fn) is the difference of two have shown that ℓ is a strong Lyapunov function d points in ∆ and is thus bounded independent of on ∆d. To prove that f is a globally asymptotically x and n. Then SA1 holds trivially. stable point forϕ ˙ = h(ϕ), suppose that ϕ(t) is a so- Finally, we show that f is globally asymptoti- lution, with ϕ(0) = f0, that does not converge to f. d cally stable for the ODEϕ ˙ = h(ϕ) in ∆ . Note that Since ℓ is a strong Lyapunov function, the sequence d i d i=1 ϕ˙ = i=1 hi(ϕ) = 0 so the trajectories lie on ℓ(ϕ(t)), as t →∞, is bounded, strictly decreasing d the connected and compact ∆ . Let ℓ(ϕ) be the KL and, thus, has a limit λ> 0. Then the trajectory P P d k k k divergence, ℓ(ϕ) = k=1 f log(f /ϕ ). We claim ϕ(t) must fall in the set that ℓ is a strong Lyapunov function forϕ ˙ = h(ϕ) P ∗ d at f. Certainly ℓ(ϕ) is positive definite. To check the ∆ = {ϕ ∈ ∆ : λ ≤ ℓ(ϕ) ≤ ℓ(f0)} differentiability condition, we must show that ℓ(ϕ) d has a well-defined gradient around f, even when f for all t ≥ 0. In the case f ∈ int(∆ ), ℓ(ϕ) →∞ as ∗ is on the boundary of ∆d. Suppose, without loss ϕ → ∂∆, so the set ∆ is compact (in the relative d ∗ of generality, that f 1,...,f s are positive, 1 ≤ s ≤ d, topology). If f ∈ ∂∆ , then ∆ is not compact but, ˙ and the remaining f s+1,...,f d are zero. By defini- as shown above, ℓ(ϕ) is well defined and continuous ˙ tion, ℓ(ϕ) is constant in ϕs+1,...,ϕd and, therefore, there. In either case, ℓ is continuous and bounded ∗ the partial derivatives with respect to those ϕ’s are away from zero on ∆ , so zero. Thus, for any 1 ≤ s ≤ d and for any ϕ such sup ℓ˙(ϕ)= −L< 0. that ℓ(ϕ) < ∞, the gradient can be written as ϕ∈∆∗ 1 2 d ′ s ′ (A.1) ∇ℓ(ϕ)= −(r , r , . . ., r ) + r Is, Then, for any τ ≥ 0, we have where rk = f k/ϕk and I is a vector whose first s τ s ˙ coordinates are one and last d − s coordinates are ℓ(ϕ(τ)) = ℓ(f0)+ ℓ(ϕ(s)) ds ≤ ℓ(f0) − Lτ. 0 zero. The key point here is that the gradient of ℓ(ϕ), Z for ϕ restricted to the boundary which contains f, is If τ >ℓ(f0)/L, then ℓ(ϕ(τ)) < 0, which is a con- exactly (A.1). We can, therefore, extend the defini- tradiction. Therefore, ϕ(t) → f for all initial condition of ∇ℓ(ϕ) continuously to the boundary if need tions ϕ(0) = f0, so f is globally asymptotically sta- be. ble. Theorem 2.10 then implies fn → f a.s. SA AND NEWTON’S ALGORITHM 17

A.2 Proof of Theorem 4.1 Proof of Proposition 4.3. Clearly each component H of H, defined in (4.7), is differentiable The proof of the theorem requires the following k with respect to ψ ∈ Σ0 and, after simplification, lemma, establishing a Lipschitz-type bound on the 2 k −rθk/2ψ rsθk/ψ error terms βn in (4.5). Its proof follows immediately ∂ ϕ e e Hk(s, ϕ, ψ)= from NP4 and the Mean Value Theorem. ∂ψ 2ψ2 2 j −rθj /2ψ rsθj /ψ Lemma A.1. Under the assumptions of Theo- j ukj(s)ϕ e e · 2 , rem 4.1, there exists a number A ∈ (0, ∞) such that j −rθj /2ψ rsθj /ψ 2 P[ j ϕ e e ] E F kβnk≤ A (|ξn − ξ| | n−1). where (as |s|→∞) P 2 2 Proof of Theorem 4.1. The map h in (4.4) (A.5) ukj(s)= θk − θj + 2s(θj − θk)= O(|s|). r has kth component This derivative is continuous on s(X ) × ∆0 × Σ0 and, since ∆0 and Σ0 are compact, we know that r hk(ϕ)= H(x; ϕ, ξ)Πf,ξ(s) dν (x) ∂ (A.6) A (s):= sup sup H (s; ϕ, ψ) Z k ∂ψ k Π (x) ϕ∈∆0 ψ∈Σ0 = f,ξ p(x|θ ,ξ)ϕk dνr(x) − ϕk, k is finite for all s ∈ s(X r) and for all k. By the Mean Πϕ,ξ(x) Z Value Theorem, k where Πf,ξ(x)= p(x|θk,ξ)f is the marginal den- 2 2 k |Hk(s; ϕ, ψ) − Hk(s; ϕ, σ )|≤ Ak(s)|ψ − σ |. sity of x and νr is the product measure on X r. No- tice that this h,P which does not depend on the esti- It remains to show that Ak(s) is bounded in s. For notational simplicity, assume that ϕ and ψ are the mate ξn, is exactly the same as the h in (3.6). There- fore, the continuity and stability properties derived values for which the suprema in (A.6) are attained. Making a change of variables y = rs/ψ we can, with in the proof of Theorem 3.4 are valid here as well. a slight abuse of notation, write All that remains is to show that the βn’s in (4.5) k yθk j yθj satisfy SA3 of Theorem 2.10. Ckϕ e j |ukj(y)|ϕ e Ak(y) ≤ . By the SLLN, ξ belongs to Ξ for large enough n j yθj 2 n 0 [ Pj ϕ e ] so we can assume, without loss of generality, that no We must show that A (y) is bounded as |y|→∞. projection is necessary. Let S = Z +· · ·+Z , where k P n 1 n Assume, without loss of generality, that the θ’s are the Z = v−1(ξˆ(i) − ξ) and v2 is the variance of ξˆ(i). i arranged in ascending order: θ <θ < · · · <θ . Fac- Then |ξ − ξ| = cn−1|S |, where c> 0 is a constant 1 2 d n n toring out, respectively, eyθ1 and eyθd , we can write independent of n. Since Sn is a sum of i.i.d. random k y(θk−θ1) j y(θj −θ1) variables with mean zero and unit variance, the Law Ckϕ e j |ukj(y)|ϕ e Ak(y) ≤ , of Iterated Logarithm states that 1 2 j i y(θj −θ1)+y(θi−θ1) (ϕ ) + j6=1 iP6=1 ϕ ϕ e

k y(θk−θd) j y(θj −θd) (A.4) lim sup{|Sn|/u(n)} = 1 a.s. Ckϕ eP P j |ukj(y)|ϕ e n→∞ Ak(y) ≤ . d 2 j i y(θj −θd)+y(θi−θd) (ϕ ) + j6=d iP6=d ϕ ϕ e Now, by Lemma A.1 and (A.4) we have j Note that since ϕP∈ ∆0P, each ϕ is bounded away −1 −1 y(θk−θ1) kβnk≤ Acn E(|Sn| | Fn−1)= O(n u(n)) from 0. If y → −∞, then the term e → 0 dominates the numerator of the first inequality, while and, therefore, n wnkβnk converges a.s. by NP5. the denominator is bounded. Similarly, if y → +∞, Condition SA3 is satisfied, completing the proof. then the term ey(θk−θd) → 0 dominates the numer- P ator in the second inequality, while the denomina- A.3 Proof of Proposition 4.3 tor is bounded. For the case k =1 or k = d, note To prove that the case of a location-mixture of that |u11(y)| = |udd(y)| = 0, so the two inequalities normals with unknown variance is covered by The- can still be applied and a similar argument shows orem 4.1, we must show that the function H, de- A1 and Ad are also bounded. Therefore, Ak(y) is fined in (4.7), satisfies NP4, that is, that the partial bounded for each k and the claim follows by taking ∂ A to be max{supy Ak(y) : 1 ≤ k ≤ d}. derivatives ∂ψ Hk(s; ϕ, ψ) are bounded. 18 R. MARTIN AND J. K. GHOSH

ACKNOWLEDGMENTS mixing distribution. In Frontiers in Statistics 429– 443. Imp. Coll. Press, London. MR2326012 The authors thank Professors Chuanhai Liu and [15] Gilks, W., Roberts, G. and Sahu, S. (1998). Adaptive Surya T. Tokdar for numerous fruitful discussions, Markov chain Monte Carlo through regeneration. J. as well as Professors Jim Berger and Mike West, Amer. Statist. Assoc. 93 1045–1054. MR1649199 the Associate Editor and the two referees for their [16] Haario, H., Saksman, E. and Tamminen, J. (2001). helpful comments. An adaptive Metropolis algorithm. Bernoulli 7 223– 242. MR1828504 [17] Isaacson, D. L. and Madsen, R. W. (1976). Markov REFERENCES Chains: Theory and Applications. Wiley, New York. MR0407991 Allison, D., Gadbury, G., Heo, M., Fernandez,´ [1] [18] Kou, S. C., Zhou, Q. and Wong, W. H. (2006). Equi- J., Lee, C., Prolla, T. and Weindruch, R. energy sampler with applications in statistical in- (2002). A mixture model approach for the anal- ference and statistical mechanics. Ann. Statist. 34 ysis of microarray gene expression data. Comput. 1581–1619. MR2283711 Statist. Data Anal. 39 1–20. MR1895555 [19] Kushner, H. J. and Yin, G. G. (2003). Stochastic Ap- [2] Andrieu, C., Moulines, E. and Priouret, P. (2005). proximation and Recursive Algorithms and Applica- Stability of stochastic approximation under veriﬁ- able conditions. SIAM J. Control Optim. 44 283– tions, 2nd ed. Springer, New York. MR1993642 LaSalle, J. Lefschetz, S. 312. MR2177157 [20] and (1961). Stability by Lia- [3] Barron, A., Schervish, M. and Wasserman, L. punov’s Direct Method with Applications. Academic (1999). The consistency of posterior distributions in Press, New York. MR0132876 nonparametric problems. Ann. Statist. 27 536–561. [21] Liang, F., Liu, C. and Carroll, R. J. (2007). Stochas- MR1714718 tic approximation in Monte Carlo computation. J. [4] Cipra, B. (1987). Introduction to the Ising model. Amer. Statist. Assoc. 102 305–320. MR2345544 Amer. Math. Monthly 94 937–959. MR0936054 [22] Lindsay, B. (1995). Mixture Models: Theory, Geometry [5] Csiszar,´ I. (1975). I-divergence geometry of probabil- and Applications. IMS, Haywood, CA. ity distributions and minimization problems. Ann. [23] Liu, J. S. (1996). Nonparametric hierarchical Bayes via Probab. 3 146–158. MR0365798 sequential imputations. Ann. Statist. 24 911–930. [6] Delyon, B., Lavielle, M. and Moulines, E. (1999). MR1401830 Convergence of a stochastic approximation ver- [24] McLachlan, G., Bean, R. and Peel, D. (2002). A sion of the EM algorithm. Ann. Statist. 27 98–128. mixture model-based approach to the clustering of MR1701103 microarray expression data. Bioinformatics 18 413– [7] Dempster, A., Laird, N. and Rubin, D. (1977). 422. Maximum-likelihood from incomplete data via the [25] Nevel’son, M. B. and Has’minskii, R. Z. (1973). EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. Stochastic Approximation and Recursive Estima- MR0501537 tion. Amer. Math. Soc., Providence, RI. Trans- [8] Efron, B. and Tibshirani, R. (2002). Empirical Bayes lations of Mathematical Monographs, Vol. 47. methods and false discovery rates for microarrays. MR0423714 23 Genet. Epidemiol. 70–86. [26] Newton, M. A. (2002). A nonparametric recursive es- [9] Efron, B., Tibshirani, R., Storey, J. and Tusher, timator of the mixing distribution. Sankhy¯aSer. A V. (2001). Empirical Bayes analysis of a microarray 64 306–322. MR1981761 experiment. J. Amer. Statist. Assoc. 96 1151–1160. [27] Newton, M. A., Quintana, F. A. and Zhang, MR1946571 Y. (1998). Nonparametric Bayes methods using [10] Fan, J. (1991). On the optimal rates of convergence predictive updating. In Practical Nonparametric for nonparametric deconvolution problems. Ann. and Semiparametric Bayesian Statistics (D. Dey, Statist. 19 1257–1272. MR1126324 P. Muller and D. Sinha, eds.) 45–61. Springer, New [11] Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1999). Posterior consistency of Dirichlet mixtures York. MR1630075 in density estimation. Ann. Statist. 27 143–158. [28] Newton, M. A. and Zhang, Y. (1999). A recursive MR1701105 algorithm for nonparametric analysis with missing 86 [12] Ghosal, S. and van der Vaart, A. (2001). Entropies data. Biometrika 15–26. MR1688068 and rates of convergence for maximum likelihood [29] Quintana, F. A. and Newton, M. A. (2000). Compu- and Bayes estimation for mixtures of normal densi- tational aspects of nonparametric Bayesian analysis ties. Ann. Statist. 29 1233–1263. MR1873329 with applications to the modeling of multiple binary [13] Ghosh, J. K. and Ramamoorthi, R. V. (2003). sequences. J. Comput. Graph. Statist. 9 711–737. Bayesian Nonparametrics. Springer, New York. MR1821814 MR1992245 [30] Robbins, H. (1964). The empirical Bayes approach to [14] Ghosh, J. and Tokdar, S. (2006). Convergence and statistical decision problems. Ann. Math. Statist. 35 consistency of Newton’s algorithm for estimating a 1–20. MR0163407 SA AND NEWTON’S ALGORITHM 19

[31] Robbins, H. and Monro, S. (1951). A stochastic ap- 24, Dept. Statistics, Purdue Univ., West Lafayette, proximation method. Ann. Math. Statist. 22 400– IN. 407. MR0042668 [36] Tang, Y., Ghosal, S. and Roy, A. (2007). Nonpara- [32] Robert, C. P. and Casella, G. (2004). Monte Carlo metric Bayesian estimation of positive false discov- 63 Statistical Methods, 2nd ed. Springer, New York. ery rates. Biometrics 1126–1134. MR2080278 [37] Teicher, H. (1963). Identifiability of finite mixtures. 34 [33] San Martin, E. and Quintana, F. (2002). Consistency Ann. Math. Statist. 1265–1269. MR0155376 [38] Tokdar, S. T., Martin, R. and Ghosh, J. K. (2008). and identifiability revisited. Braz. J. Probab. Stat. Consistency of a recursive estimate of mixing dis- 16 99–106. MR1962497 tributions. Ann. Statist. To appear. [34] Scott, J. G. and Berger, J. O. (2006). An exploration [39] Wei, G. C. G. and Tanner, M. A. (1990). A Monte of aspects of Bayesian multiple testing. J. Statist. Carlo implementation of the EM algorithm and the 136 Plann. Inference 2144–2162. MR2235051 Poor Man’s data augmentation algorithm. J. Amer. [35] Shyamalkumar, N. (1996). Cyclic I0 projections and Statist. Assoc. 85 699–704. its applications in statistics. Technical Report 96-