<<

arXiv:0808.3852v1 [stat.ME] 28 Aug 2008 hr tthe at where from as otniggvsaMro chain Markov a gives Continuing pass. X rbes rm( From problems. one-dimensional of sequence a by constant, ,x ., . . lsfo utvraepoaiiydensity probability sam- multivariate draw a to from way ofples a mainstay provides a It is computing. , scientific heat-bath the or namics onl nvriy aotHl,Ihc,NwYork e-mail: New USA Ithaca, 14853-4201, Hall, Malott Mathematics, University, of Cornell Department Professor, is Saloff-Coste Saloff-Coste Laurent and Khare Kshitij Diaconis, and Persi Polynomials Families Orthogonal Exponential Sampling, Gibbs 40-05 S e-mail: California USA Stanford, 94305-4065, University, 390 Stanford Hall, Mall, Sequoia Serra , of Department Student, 08 o.2,N.2 151–178 2, DOI: No. 23, Vol. 2008, Science Statistical 10.1214/08-STS252REJ 10.1214/08-STS252C [email protected] tnod aiona93546,UAe-mail: USA University, 94305-4065, Stanford California Mall, Stanford, Serra 390 Hall, Sequoia

es icnsi rfso,Dprmn fStatistics, of Department Professor, is Diaconis Persi c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint p h ib ape,as nw sGabrdy- Glauber as known also sampler, Gibbs The nttt fMteaia Statistics Mathematical of Institute 1 ,te ( then ), icse in Discussed 10.1214/07-STS252 p f ,prasol nw pt normalizing a to up known only perhaps ), ihteohrcodntsfie.Ti sone is This fixed. coordinates other the with X 1 ′ 10.1214/08-STS252A i X , .INTRODUCTION 1. hsae h oriaei sampled is coordinate the stage, th nttt fMteaia Statistics Mathematical of Institute X 2 , ′ il,snua au decomposition. value or families, singular location mials, priors, conjugate families, nential phrases: and words eigenfunctions. Key as explicitly is polynomials operator orthogonal av transition classical the are with case, sampler each Gibbs the In and used priors. families exponential widely standard involve the examples of The stationarity to gence Abstract. X , and 08 o.2,N.2 151–178 2, No. 23, Vol. 2008, 1 X , . . . , . 3 ,X ., . . , 10.1214/08-STS252D [email protected] [email protected] sii hr sGraduate is Khare Kshitij . 2008 , p rce o( to proceed ) p egv aiiso xmlsweesaprtso conver- of rates sharp where examples of families give We ) , . . . , , 10.1214/08-STS252B ( X ,X X, 1 ′ X , eone at rejoinder ; X 2 ′ 1 ′ ′ . X , . . . , f Laurent . X , X , ib ape,rnigtm nlss expo- analyses, time running sampler, Gibbs ( This . x 2 ′′ 1 ., . . , , . . . , x , p ′ 2 in ) , , 1 ape hisi tl ao eerheot An the at effort. given research is results major and tools a available of still overview is Gibbs for chains analyses sampler time running useful community, analyses throw time to running below. much questions how these the of call We of question away. segment the initial leaves these This an with away run. throw dealing to of to is way steps problems of One number stationarity? enormous anal- an reach image take or they enormous cards Do on shuffling ysis). of run (think are spaces chains state Markov state. Markov some starting the original Indeed, run the forgotten to has long it how until chain know to important is it [ lg n h oilsine,aogwt extensive with [ along in sciences, are bi- social references from the examples and many ology with accounts, Textbook of works [ rou- Wong the and for Tanner following method computations the Bayesian [ employ tine Geman to and Geman began by Statisticians analysis on image for [ based base proved (e.g., was dynamics measures this showing Gibbs theorem of uniqueness existence Dobrushin basic simulation The practical for [ both (e.g., physics, statistical of [ in discussed ditions has which ednl yTri [ Turcin by pendently 49 ept eoceot yteapidprobability applied the by efforts heroic Despite sampler, Gibbs the of application practical any In Glauber by 1963 in introduced was algorithm The od iuain o sn oes n inde- and models, Ising for simulations do to ] 1 88 )ada aua yais(.. [ (e.g., dynamics natural a as and ]) f ssainr est ne idcon- mild under density stationary as hgnlpolyno- thogonal 47 101 diagonalizable , 4 48 103 , 54 rconjugate ir n efn n mt [ Smith and Gelfand and ] 102 , ) twsitoue sa as introduced was It ]). 78 .I ssilasadr tool standard a still is It ]. ailable. ]. ]. 10 45 46 ]). ]. ]. 2 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE end of this introduction. The main purpose of the present paper is to give families of two component examples where a sharp analysis is available. These may be used to compare and benchmark more ro- bust techniques such as the Harris recurrence tech- niques in [64], or the spectral techniques in [2] and [107]. They may also serve as a base for the compar- ison techniques in [2, 26, 32]. Here is an example of our results. The following example was studied as a simple expository example in [16] and [78], page 132. Let n f (x)= θx(1 θ)n x, θ x −   − π(dθ) = uniform on (0, 1), x 0, 1, 2,...,n . ∈ { } These define the bivariate Beta/Binomial density (uniform prior) Fig. 1. Simulation of the Beta/Binomial “x-chain” with n = 100. n f(x,θ)= θx(1 θ)n x x −   − is reversible with stationary density m(x) [i.e., m(x) · with marginal density k(x,x′)= m(x′)k(x′,x)]. For the Beta/Binomial ex- ample 1 m(x)= f(x,θ) dθ n n n + 1 x x′ 0 k(x,x′)= , Z 2n + 1 2n 1 x+x′  = , x 0, 1, 2,...,n . (1.1) n + 1 ∈ { }  0 x,x n. ≤ ′ ≤ The posterior density [with respect to the prior A simulation of the Beta/Binomial “x-chain” (1.1) π(dθ)] is given by with n = 100 is given in Figure 1. The initial posi- tion is 100 and we track the position of the Markov fθ(x) π(θ x)= chain for the first 200 steps. | m(x) The proposition below gives an explicit diagonal- n = (n + 1) θx(1 θ)n x, θ (0, 1). ization of the x-chain and sharp bounds for the bi- x − ℓ   − ∈ variate chain [Kn,θ denotes the density of the distri- bution of the bivariate chain after ℓ steps starting The Gibbs sampler for f(x,θ) proceeds as follows: from (n,θ)]. Itf shows that order n steps are neces- From x, draw θ from Beta(x + 1,n x + 1). sary and sufficient for convergence. That is, for sam- • ′ − From θ′, draw x′ from Binomial(n,θ′). pling from the K to simulate from the • f, a small (integer) multi- The output is (x′,θ′). Let K(x,θ; x′,θ′) be the tran- fn ple of n steps suffices, while 2 steps do not. The sition density for this chain. While K has f(x,θ) as following bounds make this quantitative. The proof stationary density, the (K,ff ) pair is not reversible is given in Section 4. For example, when n = 100, (see below). This blocks straightforwardf use of spec- after 200 steps the total variation distance (see be- tral methods. Liu et al.f [77] observed that the “x- low) to stationarity is less than 0.0192, while after chain” with 50 steps the total variation distance is greater than 1 0.1858, so the chain is far from equilibrium. k(x,x′)= f (x′)π(θ x) dθ We simulate 3000 independent replicates of the θ | Z0 Beta/Binomial “x-chain” (1.1) with n = 100, start- 1 f (x)f (x ) ing at the initial value 100. We provide histograms = θ θ ′ dθ m(x) of the position of the “x-chain” after 50 steps and Z0 AND ORTHOGONAL POLYNOMIALS 3

Fig. 2. Histograms of the observations obtained from simulating 3000 independent replicates of the Beta/Binomial “x-chain” after 50 steps and 200 steps.

200 steps in Figure 2. Under the stationary distribu- The calculations work because the operator with tion (which is uniform on 0, 1,..., 100 ), one would density k(x,x′) takes polynomials to polynomials. expect roughly 150 observations{ in each} block of the Our main results give classes of examples with the histogram. As expected, these histograms show that same explicit behavior. These include: the empirical distribution of the position after 200 fθ(x) is one of the exponential families singled steps is close to the stationary distribution, while • out by Morris [86, 87] (binomial, Poisson, negative the empirical distribution of the position after 50 binomial, normal, gamma, hyperbolic) with π(θ) steps is quite different. the . If f, g are probability densities with respect to a fθ(x)= g(x θ) is a location family with π(θ) con- σ-finite measure µ, then the total variation distance • jugate and g−belongs to one of the six exponential between f and g is defined as families above. (1.2) f g = 1 f(x) g(x) µ(dx). Section 2 gives background. In Section 2.1 the k − kTV 2 | − | Z Gibbs sampler is set up more carefully in both sys- tematic and random scan versions. Relevant Markov Proposition 1.1. For the Beta/Binomial ex- chain tools are collected in Section 2.2. Exponential ample with uniform prior, we have: families and conjugate priors are reviewed in Section 2.3. The six families are described more carefully (a) The chain (1.1) has eigenvalues in Section 2.4 which calculates needed moments. A brief overview of orthogonal polynomials is in Sec- β0 = 1, tion 2.5. n(n 1) (n j + 1) β = − · · · − , 1 j n. Section 3 is the heart of the paper. It breaks the j (n + 2)(n + 3) (n + j + 1) ≤ ≤ · · · operator with kernel k(x,x′) into two pieces: T : L2(m) L2(π) defined by In particular, β1 = 1 2/(n+2). The eigenfunctions → are the discrete Chebyshev− polynomials [orthogonal polynomials for m(x) = 1/(n + 1) on 0,...,n ]. T g(θ)= fθ(x)g(x)m(dx) { } Z (b) For the bivariate chain K, for all θ,n and ℓ, and its adjoint T ∗. Then k is the kernel of T ∗T . Our ℓ 1/2 analysis rests on a singular value decomposition of 1 β − ℓ ℓ f 1 T . In our examples, T takes orthogonal polynomials β1 Kn,θ f TV 2ℓ 1 . 2 ≤ k − k ≤ 1 β − − 1 for m(x) into orthogonal polynomials for π(θ). This f 4 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE leads to explicit computations and allows us to treat also makes useful connections with classical renewal the random scan, x-chain and θ-chain on an equal theory. footing. The Markov chains studied in this paper generate The x-chains and θ-chains corresponding to three stochastic processes which have the marginal as sta- of the six classical exponential families are treated tionary distribution. These chains have been used in Section 4. There are some surprises; while order in a variety of applied modeling contexts in [80] and n steps are required for the Beta/Binomial exam- [89, 90, 91]. The present paper presents some new ple above, for the parallel Poisson/Gamma example, tools for these models. A related family of Markov log n + c steps are necessary and sufficient. The six chains and analyses have been introduced by Eaton location chains are treated in Section 5 where some [33, 34, 35] to study admissibility of formal Bayes standard queuing models emerge (e.g., the M/M/ estimators. The process is a two-component Gibbs queue). The final section points to other examples∞ sampler, as above, with π an improper prior having with polynomial eigenfunctions and other methods an almost surely proper posterior. Under regular- for studying present examples. Our examples are ity assumptions, Eaton shows that recurrence of the just illustrative. It is easy to sample from any of θ-chain implies the admissibility of certain formal the families f(x,θ) directly. Further, we do not see Bayes estimators corresponding to π. Hobert and how to carry our techniques over to higher compo- Robert [61] have developed this research, introduc- nent problems. We further point out that in routine ing the x-chain as a useful tool. Spectral techniques Bayesian use of the Gibbs sampler, the distribution have proved to be a useful adjunct to Foster-type we wish to sample from is typically a posterior dis- criteria for studying recurrence. We hope to develop tribution of the . Nevertheless, our two our theory in these directions. component problems are easy-to-understand stan- The analyses carried out in this paper hinge criti- dard statistical examples. cally on the existence of all marginal and conditional Basic convergence properties of the Gibbs sampler moments. These moments need not exist. Indeed, x can be found in [4, 102]. Explicit rates of conver- consider the geometric density fθ(x)= θ (1 θ), 0 x< , and put a Beta(α, β) prior on θ, with− α ≤1 gence appear in [95, 96]. These lean on Harris recur- ∞ ≥ rence and require a drift condition of type being an integer. The of x ad- mits only α 1 moments. For α> 1, the available E(V (X1) X0 = x) aV (x)+ b for all x. Also re- − quired are| a minorization≤ condition of the form moments are put to good use in [25]. k(x,x′) εq(x′) for ε> 0, some probability density q, and all≥ x with V (x) d. Here d is fixed with 2. BACKGROUND ≤ d b/(1 + a). Rosenthal [95] then gives explicit up- ≥ This section gives needed background. The two- per bounds and shows these are sometimes practi- component Gibbs sampler is defined more carefully cally relevant for natural statistical examples. His in Section 2.1. Bounds on convergence using eigen- paper [97] is a nice expository account. Finding use- values are given in Section 2.2. Exponential families ful V and q is currently a matter of art. For example, and conjugate priors are reviewed in Section 2.3. a group of graduate students tried to use these tech- The six families with a quadratic function niques in the Beta/Binomial example treated above of the are treated in Section 2.4. Finally, a and found it difficult to make choices giving useful brief review of orthogonal polynomials is in Section results. This led to the present paper. 2.5. A marvelous expository account of this set of tech- niques with many examples and an extensive liter- 2.1 Two-Component Gibbs Samplers ature review is given by Jones and Hobert in [64]. Let ( , ) be a measurable space equipped with In their main example an explicit eigenfunction was a σ-finiteX F measure µ. Let (Θ, ) be a measurable available for V ; our Gamma/Gamma examples be- space equipped with a probabilityG measure π(dθ). low generalize this. Further, “practically relevant” In many classical situations the prior is given by a examples with useful analyses of Markov chains for density g(θ) with respect to the Lebesgue measure stationary distributions which are difficult to sample dθ, and so π(dθ)= g(θ) dθ. However, we also con- directly from are in [65, 81, 98]. Some sharpenings sider examples where the θ is discrete, of the Harris recurrence techniques are in [9] which which cannot be described in the fashion mentioned GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 5 above. Thus, we work with general prior probabili- Note that k(x,x′)µ(dx′) = 1 so that k(x,x′) is a ties π(dθ) throughout. Let fθ(x) θ Θ be a family of probability density with respect to µ. Note further { } ∈ R probability densities with respect to µ. These define that m(x)k(x,x′)= m(x′)k(x′,x) so that the x-chain a probability measure on Θ via has m(dx) as a stationary distribution. X × The “θ-chain” [from θ, draw x from fθ(x) and P (A B)= f (x)µ(dx)π(dθ), × θ then θ′ from π(θ′ x)] has transition density ZB ZA | A , B . k(θ,θ′)= f (x)π(θ′ x)µ(dx) ∈ F ∈ G θ | The marginal density on is (2.2) ZX X f (x)f ′ (x) = θ θ µ(dx). m(x)= fθ(x)π(dθ) m(x) Z ZΘ X Note that k(θ,θ′)π(dθ′) = 1 and that k(θ,θ′) has so m(x)µ(dx) = 1 . π(dθ) as reversing measure. R  ZX  We assume that 0

K¯ (L2(m)+ L2(π)) ]. Indeed, for any h L2(P ), we can compute Kφ and m(φ)]. The second condi- ⊇ ⊥ ∈ Kg,h¯ = (Kg¯ )h dP = (Kh¯ )g dP = 0. Thus tion [orthogonality to constants in L2(m)] will be h iP Kg¯ = 0. We diagonalize random scan chains in Sec- automatically satisfied when β < 1. Such an eigen- R R tion 3. function yields a simple lower| bound| on the conver- gence of the chain to its stationary measure. 2.2 Bounds on Markov Chains 2.2.1 General results. We briefly recall well-known Lemma 2.1. Referring to the notation above, as- sume that φ L2(m(dξ)) and φ 2 dm = 1. Then results that will be applied to either our two-compo- ∈ | | nent Gibbs sampler chains or the x- and θ-chains. 2 2R 2ℓ χξ(ℓ) φ(ξ) β . Suppose we are given a Markov chain described by ≥ | | | | its kernel K(ξ,ξ′) with respect to a measureµ ˜(dξ′) Moreover, if φ is a bounded function, then [e.g., ξ = (x,θ),µ ˜(dξ)= µ(dx)π(dθ) for the two-com- φ(ξ) β ℓ ponent sampler, ξ = θ,µ ˜(dθ)= π(dθ) for the θ-chain, Kℓ m | || | . k ξ − kTV ≥ 2 φ etc.]. Suppose further that the chain has stationary k k∞ measure m(dξ)= m(ξ)˜µ(dξ) and write Proof. This follows from the well-known results

Kˆ (ξ,ξ′)= K(ξ,ξ′)/m(ξ′), 2 ℓ 2 (2.5) χξ(ℓ)= sup Kξ (g) m(g) g 2,m 1{| − | } ˆ ℓ ˆ ℓ ℓ k k ≤ Kξ (ξ′)= K (ξ,ξ′)= K (ξ,ξ′)/m(ξ′) and for the kernel and iterated kernel of the chain with respect to the stationary measure m(dξ). We define ℓ 1 ℓ (2.6) Kξ m TV = 2 sup Kξ (g) m(g) . k − k g ∞ 1{| − |} the chi-square distance between the distribution of k k ≤ the chain started at ξ after ℓ steps and its stationary Here, Kl (g) denotes the expectation of g under the measure by ξ density Kl(ξ, ). For chi-square, use g = φ as a test 2 ℓ 2 · χ (ℓ)= Kˆ (ξ′) 1 m(dξ′) function. For total variation use g = φ/ φ as a ξ | ξ − | k k∞ Z test function. More sophisticated lower bounds on ℓ 2 K (ξ,ξ′) m(ξ′) total variation are based on the second moment method = | − | µ˜(dξ′). m(ξ ) (e.g., [99, 106]).  Z ′ This quantity always yields an upper bound on the To obtain upper bounds on the chi-square dis- total variation distance tance, we need much stronger hypotheses. Namely, assume that K is a self-adjoint operator on L2(m) ℓ 1 ˆ ℓ K m TV = K (ξ′) 1 m(dξ′) 2 k ξ − k 2 | ξ − | and that L (m) admits an orthonormal basis of real Z eigenfunctions ϕi with real eigenvalues βi 0, β0 = 1 ℓ = K (ξ,ξ′) m(ξ′) µ˜(dξ′), ≥ 2 | − | 1, ϕ0 1, βi 0 so that Z ≡ ↓ namely, Kˆ (ξ,ξ′)ϕi(ξ′)m(dξ′)= βiϕi(ξ). ℓ 2 2 Z (2.4) 4 Kξ m TV χξ(ℓ). k − k ≤ Assume further that K acting on L2(m) is Hilbert– Our analysis will be based on eigenvalue decom- 2 Schmidt (i.e., βi < ). Then we have positions. Let us first assume that we are given a | | ∞ ˆ ℓP ℓ function φ such that K (ξ,ξ′)= βi ϕi(ξ)ϕi(ξ′) Xi Kφ(ξ)= K(ξ,ξ )φ(ξ )˜µ(dξ )= βφ(ξ), ′ ′ ′ [convergence in L2(m m)] Z × m(φ)= φ(ξ)m(ξ)˜µ(dξ) = 0 and Z 2 2ℓ 2 for some (complex number) β. In words, φ is a gener- (2.7) χξ (ℓ)= βi ϕi (ξ). alized eigenfunction with eigenvalue β. We say “gen- Xi>0 eralized” here because we have not assumed here Useful references for this part of classical functional that φ belongs to a specific L2 space [we only assume analysis are [1, 94]. GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 7

2.2.2 Application to the two-component Gibbs sam- For the variant K, the similar formula reads pler. All of the bounds in this paper are derived via ℓ ℓ 1 fθ′ (z) the following route: bound L1 by L2 and use the ex- K (x,θ; x′,θ′)=f k − (x, z) fθ′ (x′)µ(dz). m(z) plicit knowledge of eigenvalues and eigenfunctions to Z bound the sum in (2.7). This, however, does not ap- Thesef two bivariate chains have stationary density f(x,θ)= f (x) with respect to the measure µ(dx) ply directly to the two-component Gibbs sampler K θ · (or K) because these chains are not reversible with π(dθ). So, we write ℓ respect to their stationary measure. Fortunately, the K (x,θ; x′,θ′) f 1 x-chain and the θ-chain are reversible and their anal- f(x′,θ′) − ysis yields bounds on the two-component chain thanks ℓ 1 to the following elementary observation. The x-chain = (kˆ − (z,x′) 1)f (z)µ(dz) − θ has kernel k(x,x ) with respect to the measure µ(dx). Z ′ and It will also be useful to have kˆ(x,x′)= k(x,x′)/m(x′), Kℓ(x,θ; x ,θ ) the kernel with respect to the probability m(dx)= ′ ′ 1 ℓ ℓ m(x)µ(dx). For ℓ 2, we let kx(x′)= k (x,x′)= f(x′,θ′) − k(x,y)kℓ 1(y,x )µ≥(dy) denote the density [w.r.t. f − ′ ˆℓ 1 µ(dx)] of the distribution of the x-chain after l steps = (k − (x, z) 1)fθ′ (z)µ(dz). R − ˆℓ ˆℓ ˆ ˆℓ 1 Z and set kx(x′)= k (x,x′)= k(x,y)k − (y,x′)m(dy) To prove the second inequality in the lemma (the [the density w.r.t. m(dx)]. Also R proof of the first is similar), write 1/p ℓ p p (Kx,θ/f) 1 g p,P = g(ω) P (dω) for p 1. k − kp,P k k | | ≥ p Z  ℓ 1 f= (kˆ − (x, z) 1)f ′ (z)µ(dz) Lemma 2.2. Referring to the K, K two- − θ Z Z Z component chains and x-chain introduced in Sec- f ′ (x′)µ(dx′)π(dθ′) tion 2.1, for any p [1, ], we have f · θ ∈ ∞ ˆℓ 1 p ℓ p ℓ 1 p k − (x, z) 1 (K /f) 1 kˆ − 1 f (z)µ(dz) ≤ | − | k x,θ − kp,P ≤ k z − kp,m θ Z Z Z Z f ′ (z)µ(dz)f ′ (x′)µ(dx′)π(dθ′) ˆℓ 1 p · θ θ sup kz− 1 p,m ≤ z k − k ℓ 1 p = kˆ − (x, z) 1 m(z)µ(dz) | − | and Z ℓ p ℓ 1 p ˆℓ 1 p (K /f) 1 kˆ − 1 . = k − (x, z) 1 m(dz). k x,θ − kp,P ≤ k x − kp,m | − | Z  Similarly, forf the θ-chain, we have This gives the desired bound. ℓ p ℓ 1 p To get lower bounds, we observe the following. (K /f) 1 k − 1 π(θ x)π(dθ) k x,θ − kp,P ≤ k θ − kp,π | Z Lemma 2.3. Let g be a function of x only [abus- f ℓ 1 p sup kθ− 1 p,π ing notation, g(x,θ)= g(x)]. Then ≤ θ k − k and Kg(x,θ)= k(x,x′)g(x′)µ(dx′). Z l p ℓ 1 p (K /f) 1 k − 1 . If instead,fg is a function of θ only, then k x,θ − kp,P ≤ k θ − kp,π Proof. We only prove the results involving the Kg(x,θ)= k(θ,θ′)g(θ′)π(dθ′). x-chain. The rest is similar. Recall that the bivariate Z Proof. Assume g(x,θ)= g(x). Then chain has transition density fθ′ (x)fθ′ (x′) K(x,θ; x ,θ )= f (x )f ′ (x )/m(x ). Kg(x,θ)= g(x′)µ(dx′)π(dθ′) ′ ′ θ ′ θ ′ ′ m(x) Z Z By direct computation f = k(x,x′)g(x′)µ(dx′). ℓ ℓ 1 fθ′ (x′) K (x,θ; x′,θ′)= f (z)k − (z,x′) µ(dz). Z θ m(x ) The other case is similar.  Z ′ 8 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE

2 2 Lemma 2.4. Let χx,θ(ℓ) and χx,θ(ℓ) be the chi- 2.3.2 Conjugate priors for exponential families. square distances after ℓ steps for the K-chain and With notation as above, fix n0 > 0 and x0 in the 2 the K-chain, respectively, startinge at (x,θ). Let χx(ℓ), interior of the convex hull of the support of µ. Define 2 χθ(ℓ) be the chi-square distances for the x-chain a prior density with respect to Lebesgue measure dθ (startingf at x) and the θ-chain (starting at θ), re- by spectively. Then we have n0x0θ n0M(θ) πn0,x0 (dθ)= z(n0,x0)e − dθ, χ2(ℓ) χ2 (ℓ) χ2(ℓ 1), θ ≤ x,θ ≤ θ − where z(n0,x0) is a shown to ℓ ℓ ℓ 1 k 1 K f k − 1 k θ − kTV ≤ k x,θ − kTV ≤ k θ − kTV be positive and finite in Diaconis and Ylvisaker [29] and which contains proofs of the assertions below. The posterior is 2 2 2 χx(ℓ) χx,θ(ℓ) χx(ℓ 1), ≤ ≤ − π(dθ x)= π (dθ). ℓ ℓ ℓ 1 | n0+1,(n0x0+x)/(n0+1) kx m TV eKx,θ f TV kx− m TV. k − k ≤ k − k ≤ k − k Thus the family of conjugate priors is closed under Proof. This isf immediate from (2.5)–(2.6) and sampling. This is sometimes used as the definition  Lemma 2.3. of conjugate prior. A central fact about conjugate 2.3 Exponential Families and Conjugate Priors priors is Three topics are covered in this section: exponen- (2.8) E(m(θ) x)= ax + b. tial families, conjugate priors for exponential fami- | lies and conjugate priors for location families. This linear expectation property characterizes con- jugate priors for families where µ has infinite sup- 2.3.1 Exponential families. Let µ be a σ-finite port. Section 3 shows that linear expectation implies measure on the Borel sets of the real line R. De- that the associated chain defined at (2.1) always has R xθ fine Θ= θ : e µ(dx) < . Assume that Θ is an eigenfunction of the form x c with eigenvalue nonempty{ and∈ open. H¨older’s∞} inequality shows that − R a, and c equal to the mean of the marginal distribu- Θ is an interval. For θ Θ, set ∈ tion. Often, an is not parametrized M(θ) = log exθµ(dx), by the natural parameter θ, but in terms of the mean Z parameter m(θ). If we construct conjugate priors f (x)= exθ M(θ). θ − with respect to this parametrization, then (2.8) does The family of probability densities f ,θ Θ is not hold in general. However, for the six exponen- { θ ∈ } the exponential family through µ in its “natural tial families having quadratic variance function (dis- parametrization.” Allowable differentiations yield the cussed in Section 2.4 below), (2.8) holds even with mean m(θ)= xfθ(x)µ(dx)= M ′(θ) and the vari- the mean parametrization. In [19], this is shown to ance σ2(θ)= M (θ). R ′′ hold only for these six families. See [15, 55] for more Statisticians realized that many standard families on this. can be put in such form so that properties can be studied in a unified way. Standard references for ex- Example. For the Poisson example above the ponential families include [8, 11, 66, 73, 74]. conjugate priors with respect to θ are of the form θ n0x0θ n0e Example. Let = 0, 1, 2, 3,... , µ(x) = 1/x!. z(n0,x0)e − dθ. Then Θ= R, and MX(θ)={ eθ, } The mean parameter is λ = eθ. Since dθ = dλ/λ, the xθ eθ e − priors transform to fθ(x)= , x = 0, 1, 2,.... x! n0x0 1 n0λ z(n0,x0)λ − e− dλ. This is the Poisson(λ) distribution with λ = eθ. This paper works with familiar exponential fam- The Poisson density parametrized by λ is given ilies. Many exotic families have been studied. See by [12, 79]. These lead to interesting problems when ex log λ λ studied in conjunction with the Gibbs sampler. f (x)= − , x = 0, 1, 2,.... λ x! GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 9

Hence, the conjugate priors with respect to λ are of at discrete times: If this is Xn at time n, then SXn is the form the number served in the next time period and εn+1 is the number of unserved new arrivals. The explicit z˜(n ,x )λn0x0 e n0λ dλ. 0 0 − diagonalization of the M/M/ chain, in continuous ∞ Here, the Jacobian of the transformation θ m(θ) time, using Charlier polynomials appears in [3]. blends in with the rest of the prior so that→ both This same chain has yet a different interpreta- parametrizations lead to the usual Gamma priors tion: Let f (j) = η pj(1 p)η j . Here 0

The x-chain may be represented as Xn+1 = SXn + mean parametrization only for the six families. Wal- εn+1 with Sk Binomial(k,η/(η + λ)) and ter and Hamedani [105] construct orthogonal poly- ε Poisson(λ). This∼ also represents the number of nomials for the six families for use in empirical Bayes customers∼ on service in an M/M/ queue observed estimation. See also Pommeret [93]. ∞ 10 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE

k k 1 Extensions are developed by Letac and Mora [76] α − E (θk)= xk + b xj. and Casalis [15] who give excellent surveys of the x α + 1 j literature. Still most useful, Morris [86, 87] gives a   jX=0 unified treatment of basic (and not so basic) prop- Negative Binomial: = 0, 1, 2,... , µ counting mea- X { } erties such as moments, unbiased estimation, or- sure, Θ = (0, 1): thogonal polynomials and statistical properties. We Γ(x + r) f (x)= θx(1 θ)r, give the six families in their usual parametrization θ Γ(r)x! − along with the conjugate prior and formulae for the k k 0 <θ< 1,r> 0. moments Eθ(X ), Ex(θ ) of X and θ under dP = fθ(x)µ(dx)π(dθ), given the value of the other. For Γ(α + β) α 1 β 1 k k π(dθ)= θ − (1 θ) − dθ, each of these families, Eθ(X ) and Ex(θ ) are poly- Γ(α)Γ(β) − nomials of degree k in θ and x, respectively. We only 0 <α, β< . specify the leading coefficients of these polynomials ∞ below. In fact, the leading coefficients are all we re- k k 1 j k θ − θ quire for our analysis. In the conditional expectation Eθ(X ) = (r)k + aj , 1 θ 1 θ formulae, k is an integer running from 0 to unless  −  jX=0  −  R N ∞ specified otherwise. For a and n 0 , we k k 1 ∈ ∈ ∪ { } θ − define E = (β + r k) xk + b xj, x 1 θ − k j    j=0 (a)n = a(a + 1) (a + n 1) − X · · · − k < β + r. if n 1, (a)0 = 1. ≥ Normal: =Θ= R, µ Lebesgue measure: While some of the following calculations are stan- X 1 ( 1/2)(x θ)2 /σ2 dard and well known, others are not, and since the fθ(x)= e − − , details enter our main theorems, we give complete √2πσ2 statements. 0 <σ2 < Binomial: = 0,...,n , µ counting measure, Θ = ∞ X { } 1 ( 1/2)(θ v)2 /τ 2 (0, 1): π(dθ)= e − − dθ, √2πτ 2 n x n x f (x)= θ (1 θ) − , 0 <θ< 1,

−1 r irx r irx erx tan θβ + , , an explicit diagonalization of the univariate chains 2 2 2 2 ·  −  (x-chain, θ-chain) in Section 3. r> 0, 2. The first five families are very familiar, the sixth Γ(a)Γ(b) family less so. As one motivation, consider the gen- where β(a, b)= , eralized arc sine densities Γ(a + b) θ 1 (1 θ) 1 Γ(ρ/2 ρδi/2)Γ(ρ/2+ ρδi/2) y − (1 y) − − π(dθ)= − fθ(y)= − , 0 y, θ< 1. Γ(ρ/2)Γ(ρ/2 1/2)√π Γ(θ)Γ(1 θ) ≤ − − −1 eρδ tan θ Transform these to an exponential family via x = dθ, 2 ρ/2 log(y/(1 y)), η = πθ π/2. This has density · (1 + θ ) − − <δ< , ρ 1, exη+log(cos η) −∞ ∞ ≥ g (x)= , k 1 η 2 cosh((π/2)x) k k − j Eθ(X )= k!θ + ajθ , π π j=0

2.5 Some Background on Orthogonal These polynomials satisfy Polynomials (β) (α + β + j 1) E (Q Q )= j − n+1 A variety of orthogonal polynomials are used cru- m j ℓ (α + β + 2j 1) − cially in the following sections. While we usually just (2.10) j!(n j)! quote what we need from the extensive literature, − δjl. this section describes a simple example. Perhaps the · (α + β)n(α)j n! best introduction is in [18]. We will make frequent Thus they are orthogonal polynomials for m. When reference to [63] which is thorough and up-to-date. α = β = 1, these become the discrete Chebyshev The classical account [100] contains much that is polynomials cited in Proposition 1.1. From our work hard to find elsewhere. The on-line account [68] is in Section 2.2, we see we only need to know Qj(x0) very useful. For pointers to the literature on orthog- with x0 the starting position. This is often available onal polynomials and birth and death chains, see, in closed form for special values, for example, for for example, [104]. x0 =0 and x0 = n, As an indication of what we need, consider the Qj(0) = 1, Beta/Binomial example with a general Beta(α, β) (2.11) prior. Then the stationary distribution for the x- ( β j) Q (n)= − − j , 0 j n. chain on = 0, 1, 2,...,n is j (α + 1) X { } j ≤ ≤ n (α)x(β)n x For general starting values, one may draw on the m(x)= − x (α + β) extensive work on uniform asymptotics; see, for ex-   n ample, [100], Chapter 8, or [5]. where We note that [86], Section 8, gives an elegant self- Γ(a + x) contained development of orthogonal polynomials (a)x = = a(a + 1) (a + x 1). xθ M(θ) Γ(a) · · · − for the six families. Briefly, if fθ(x)= e − is the density, then The choice α = β = 1 yields the uniform distribution k while α = β = 1/2 yields the discrete arc-sine density 2k d pk(x,θ)= σ fθ(x) fθ(x) from [43], Chapter 3, dkm  . 2x 2n 2x 2 − [derivatives with respect to the mean m(θ)]. If σ (θ)= x n x 2 m(x)= 2n− . v2m (θ)+ v1m(θ)+ v0, then 2  k 1 The orthogonal polynomials for m are called Hahn 2k − polynomials [see (2.10) below]. They are developed Eθ(pnpk)= δnkakσ with ak = k! (1 + iv2). in [63], Section 6.2, which refers to the very useful iY=0 treatment of Karlin and McGregor [67]. The poly- We also find need for orthogonal polynomials for nomials are given explicitly in [63], pages 178–179. the conjugate priors π(θ). Shifting parameters by 1 to make the classical no- tation match present notation, the orthogonal poly- 3. A SINGULAR VALUE DECOMPOSITION nomials are The results of this section show that many of the j, j + α + β 1, x Gibbs sampler Markov chains associated to the six Q (x)= F 1 , j 3 2 − α, n− − families have polynomial eigenvectors, with explic-  −  itly known eigenvalues. This includes the x-chain, θ- 0 j n. ≤ ≤ chain and the random scan chain. Analysis of these Here chains is in Sections 4 and 5. For a discussion of Markov operators related to orthogonal polynomi- ℓ a1 ar ∞ (a1a2 ar)ℓ z als, see, for example, [6]. For closely related statis- rFs · · · z = · · · b1 bs (b1b2 bs)ℓ ℓ! tical literature, see [13] and the references therein.  · · ·  ℓ=0 · · · (2.9) X Throughout, notation is as in Section 2.1. We have r fθ(x) θ Θ a family of probability densities on the with (a1 ar)ℓ = (ai)ℓ. · · · real{ line} ∈R with respect to a σ-finite measure µ(dx), iY=1 GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 13 for θ Θ R. Further, π(dθ) is a probability mea- section consists of five examples in all. For each, sure on∈ Θ.⊆ These define a joint probability P on we set up the results for general parameter values R Θ with marginal density m(x) [w.r.t. µ(dx)] and and carry out the bounds in some natural special × posterior density [w.r.t. the prior π(dθ)] given by cases. The other three families are not amenable to π(θ x)= f (x)/m(x). The densities do not have to this analysis due to lack of existence of all moments | θ come from exponential families in this section. [which violates hypothesis (H1) in Section 3]. How- Let c = #suppm(x). This may be finite or infinite. ever, they are analyzed by probabilistic techniques For simplicity, throughout this section, we assume such as coupling in [25]. supp(π) is infinite. Moreover, we make the following hypotheses: 4.1 Beta/Binomial

α1 x +α2 θ 4.1.1 The x-chain for the Beta/Binomial. Fix α, (H1) For some α1, α2 > 0, e | | | |P (dx, dθ) < . β> 0. On the state space = 0, 1, 2,...,n , let ∞ k R X { } (H2) For 0 k < c, Eθ(X ) is a polynomial in θ ≤ k(x,x′) of degree k with lead coefficient ηk > 0. 1 (H3) For 0 k< , E (θk) is a polynomial in x n α+x+x′ 1 β+2n (x+x′) 1 ≤ ∞ x = θ − (1 θ) − − x′ − of degree k with lead coefficient µk > 0. Z0   2 Γ(α + β + n) dθ By (H1), L (m(dx)) admits a unique monic, or- (4.1) thogonal basis of polynomials p , 0 k < c, with p · Γ(α + x)Γ(β + n x) k k − of degree k. Also, L2(π(dθ)) admits≤ a unique monic, n Γ(α + β + n)Γ(α + x + x′) orthogonal basis of polynomials qk, 0 k< , with = ≤ ∞ x′ Γ(α + x)Γ(β + n x) qk of degree k. As usual, η0 = µ0 =1 and p0 q0 1.   − ≡ ≡ Γ(β + 2n (x + x′)) Theorem 3.1. Assume (H1)–(H3). Then: − . · Γ(α + β + 2n) (a) The x-chain (2.1) has eigenvalues β = η µ k k k When α = β = 1 (uniform prior), k(x,x ) is given by with eigenvectors p , 0 k < c. ′ k ≤ (1.1). For general α, β, the stationary distribution is (b) The θ-chain (2.2) has eigenvalues βk = ηkµk with eigenvectors q for 0 k < c, and eigenvalues the Beta/Binomial: k ≤ zero with eigenvectors qk for c k< . n (α)x(β)n x (c) The random scan chain (≤2.3) has∞ spectral de- m(x)= − , x (α + β)n composition given by   where eigenvalues 1 1 √η µ , 2 ± 2 k k Γ(a + j) ηk (a)j = = a(a + 1) (a + j 1). eigenvectors pk(x) qk(θ), 0 k < c, Γ(a) · · · − ± µk ≤ r From our work in previous sections we obtain the eigenvalues 1 , eigenvectors q , c k< . 2 k ≤ ∞ following result. The proof of Theorem 3.1 is in the Appendix. Proposition 4.1. For n = 1, 2,..., and α, Remark. The theorem holds with obvious mod- β> 0, the Beta/Binomial x-chain (4.1) has: ification if #supp(π) < . This occurs for binomial n(n 1) (n j+1) ∞ (a) Eigenvalues β0 = 1 and βj = − ··· − , 1 location problems. It will be used without further (α+β+n)j ≤ comment in Section 5. Further, the arguments work j n. (b) Eigenfunctions≤ Q , 0 j n, the Hahn polyno- to give some eigenvalues with polynomial eigenvec- j ≤ ≤ tors when only finitely many moments are finite. mials of Section 2.5. (c) For any ℓ 1 and any starting state x, ≥ 4. EXPONENTIAL FAMILY EXAMPLES n χ2 (ℓ)= β2ℓQ2(x)z This section carries out the analysis of the x- and x i i i i=1 θ- chains for the Beta/Binomial, Poisson/Gamma X (α + β + 2i 1)(α + β) (α) n and normal families. The x- and θ-chains for the where z = − n i . i (β) (α + β + i 1) i normal family are essentially the same. Hence, this i − n+1   14 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE

We now specialize this to α = β = 1 and prove the Γ(α + β + n) (4.2) bounds announced in Proposition 1.1. · Γ(α + j)Γ(β + n j) − Proof of Proposition 1.1. From (a), βi = n(n 1) (n i+1) α+j 1 β+n j 1 2 (θ′) − (1 θ′) − − . (n+2)(−n+3)··· −(n+i+1) . From (2.11), Qi (n) = 1. By ele- ··· · − mentary manipulations, zi = βi(2i +1). Thus

n ℓ 2 This is a transition density with respect to Lebesgue 2 (k (n,y) m(y)) χn(ℓ)= − measure dθ′ on [0, 1]. It has stationary density y=0 m(y) X Γ(α + β) α 1 β 1 n π(dθ)= θ − (1 θ) − dθ. 2ℓ+1 Γ(α)Γ(β) − = βi (2i + 1). Xi=1 Remark. In this specific example, the prior π(dθ) 2 Γ(α+β) α 1 β 1 We may bound β βi = (1 )i, and so has a density g(θ)= θ − (1 θ) − with re- i 1 n+2 Γ(α)Γ(β) − ≤ − spect to the Lebesgue measure dθ. For ease of expo- n n 2 2ℓ+1 i(2ℓ+1) sition, we deviate from the general treatment in Sec- χn(ℓ)= βi (2i + 1) β1 (2i + 1). ≤ tion 2.1, where k(θ,θ′) is a transition density with Xi=1 Xi=1 respect to π(dθ′). Instead, we absorb g(θ′) in the i i 2 Using 1∞ x = x/(1 x), 1∞ ix = x/(1 x) , we transition density, so that k(θ,θ′) is a transition den- obtain − − P P sity with respect to the Lebesgue measure dθ′. 3β2ℓ+1 The relevant orthogonal polynomials are Jacobi 2ℓ+1 2 1 a,b 3β1 χn(ℓ) 2ℓ+1 . polynomials P , a = α 1, b = β 1, given on ≤ ≤ (1 β )2 i − − − 1 [ 1, 1] in standard literature [68], 1.8. We make the − 1 x By Lemma 2.4, this gives (for the K chain) change of variables θ = −2 and write pi(θ) = α 1,β 1 − − 2ℓ 1 Pi (1 2θ). Then, we have 2ℓ+1 2 3β1f− − 3β χ (ℓ) . 1 1 n,θ 2ℓ 1 2 ≤ ≤ (1 β − ) 1 1 (4.3) pj(θ)pk(θ)π(θ) dθ = zj− δjk, − 0 The upper bounde in total variation follows from Z (2.4). For a lower bound in total variation, use the where eigenfunction ϕ (x)= x n . This is maximized at (2j + α + β 1)Γ(α)Γ(β)Γ(j + α + β 1)j! 1 − 2 z = − − . x = n and the lower bound follows from Lemma 2.1. j Γ(α + β)Γ(j + α)Γ(j + β)  Proposition 4.2. For α,β > 0, the θ-chain for Remark. Essentially, the same results hold for the Beta/Binomial (4.2) has: any Beta(α, β) prior in the sense that, for fixed α, β, n(n 1) (n j+1) (a) Eigenvalues β0 = 1, βj = − ··· − , 1 starting at n, order n steps are necessary and suf- (α+β+n)j ≤ ficient for convergence. The computation gets more j n, βj = 0 for j>n. ≤ involved if one starts from a different point than (b) Eigenfunctions pj, the shifted Jacobi polyno- n. Mizan Rahman and Mourad Ismail have shown mials. us how to evaluate the Hahn polynomials at n/2 (c) With zi from (4.3), for any ℓ 1 and any ≥ when n is even and α = β using [63], (1.4.12) (the starting state θ [0, 1], ∈ odd-degree Hahn polynomials vanish at n/2 and the n 2 2ℓ 2 three-term recurrence then easily yields the values χθ(ℓ)= βi pi (θ)zi. of the even-degree Hahn polynomials at n/2). See i=1 X Section 5.1 for a closely related example. The following proposition gives sharp chi-square 4.1.2 The θ-chain for the Beta/Binomial. Fix α, bounds, uniformly over α,β,n in two cases: (i) α ≥ β> 0. On the state space [0, 1], let β, starting from 0 (worst starting point), (ii) α = β, starting from 1/2 (heuristically, the most favor- n n j n j able starting point). The restriction α β is not re- k(θ,θ′)= θ (1 θ) − ≥ j − ally a restriction because of the symmetry P a,b(x)= jX=0   i GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 15

i b,a 2i + α + β + 1 i + α + β 1 i + α ( 1) Pi ( x). For α β> 1/2, it is known (e.g., (4.4) − [63−], Lemma− 4.2.1) that≥ · 2i + α + β 1 i + 1 i + β − 5 (α + β)(α + 1) α 1,β 1 (α)i sup pi = sup Pi − − = pi(0) = . ≤ 6 β + 1 [0,1] | | [ 1,1] | | i! − α + β + 2 2ℓ Hence, 0 is clearly the worst starting point from 1 . the viewpoint of convergence in chi-square distance, ·  − α + β + n + 1 2 that is, The lead term in χ0(ℓ) is

2 2 (α + β + 1)α 2ℓ sup χθ(ℓ) = χ0(ℓ). β . θ [0,1]{ } β 1 ∈   Proposition 4.3. For α β> 0, n> 0, set N = From (4.4), we get that for any log[(α + β)(α + 1)/(β + 1)].≥ The θ-chain for the 1 ℓ log[(α + β)(α + 1)/(β + 1)] Beta/Binomial (4.2) satisfies: ≥ 2 log β − 1 2 c N+c we have (i) χ0(ℓ) 7e− , for ℓ , c> 0. • ≤ ≥ 2 log β1 2ℓ 2 2 1 c −N c β pi+1(0) zi+1 χ (ℓ) e , for ℓ − , c> 0. i+1 0 6 2 log β1 5/6. • ≥ ≤ − 2ℓ 2 (ii) Assuming α = β> 0, βi pi(0) zi ≤ Hence, for such ℓ, χ2 (ℓ) 13β2ℓ, for ℓ 1 . 1/2 2 2 log β2 • ≤ ≥ − 2 1 2ℓ 2 (α + β + 1)α 2ℓ ∞ k χ (ℓ) β2 , for ℓ> 0. χ (ℓ) β (5/6) • 1/2 ≥ 2 0 ≤ β 1   0 ! Roughly speaking, part (i) says that, starting from X (α + β + 1)α 0, ℓ(α,β,n) steps are necessary and sufficient for = 6 β2ℓ. β 1 convergence in chi-square distance where   log[(α + β)(α + 1)/(β + 1)] With N = log[(α+β)(α+1)/(β +1)] as in the propo- ℓ(α,β,n)= . sition, we obtain 2 log(1 (α + β)/(α + β + n)) − − 2 c N + c χ (ℓ) 7e− for ℓ , c> 0; Note that if α, n, n/α tend to infinity and β is fixed, 0 ≤ ≥ 2 log β − 1 n log α α 2 1 c N c ℓ(α,β,n) , β1 1 . χ (ℓ) e for ℓ − , c> 0. ∼ α ∼ − n 0 ≥ 6 ≤ 2 log β  − 1 If α, n, n/α tend to infinity and α = β, Proof of Proposition 4.3(ii). When a = b, a,b n log α 2α the classical Jacobi polynomial Pk is given by ℓ(α, α, n) , β1 1 . ∼ 4α ∼ − n a,a (a + 1)k a+1/2 Pk (x)= Ck (x) The result also says that, starting from 0, conver- (2a + 1)k ν gence occurs abruptly (i.e., with cutoff) at ℓ(α,β,n) where the Ck ’s are the ultraspherical polynomials. ν as long as α tends to infinity. See [63], (4.5.1). Now, [63], (4.5.16) gives Cn(0) = 0 Part (ii) indicates a completely different behav- if n is odd and ior starting from 1/2 (in the case α = β). There is (2ν) ( 1)n/2 Cν(0) = n no cutoff and convergence occurs at the exponen- n n − 2 (n/2)!(ν + 1/2)n/2 tial rate given by β (β 1 4α if n/α tends to 2 2 n if n is even. Going back to the shifted Jacobi’s, this infinity). ∼ − yields p2k+1(1/2)=0 and Proof of Proposition 4.3(i). We have χ2(ℓ)= 0 (α)2k α 1/2 n 2ℓ 2 p (1/2) = C − (0) 1 βi pi(0) zi and 2k (2α 1) 2k − 2k P β2ℓ p (0)2z k i+1 i+1 i+1 (α)2k (2α 1)2k( 1) 2ℓ 2 = − − βi pi(0) zi (2α 1) 22kk!(α) − 2k k 2ℓ n i (α + k) ( 1)k = − = k − . α + β + n + i 2k   2 k! 16 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE

We want to estimate The orthogonal polynomials for the negative bino- n/2 mial are Meixner polynomials [68], (1.9): 2 ⌊ ⌋ 2ℓ 2 χ1/2(ℓ)= β2i p2i(1/2) z2i 1 j x X M (x)= F α . and thus we compute j 2 1 − a−  −  2ℓ 2 β2(i+1)p2(i+1)(1/2) z2(i+1) These satisfy [68], (1.92), 2ℓ 2 β p2i(1/2) z2i j 2i ∞ (1 + α) j! M (x)M (x)m(x)= δ . (n 2i)(n 2i 1) 2ℓ j k (a) jk = − − − x=0 j (2α + n + 2i)(2α + n + 2i + 1) X   Our work in previous sections, together with basic 4i + 2α + 1 2i + 2α 1 properties of Meixner polynomials, gives the follow- − · 4i + 2α 1 2i + 2α + 1 ing propositions. − 2i(2i + 1)(2α + 2i + 1)(2α + 2i) (4.5) Proposition 4.4. For a,α > 0 the Poisson/ · (2i + α)2(2i + α + 1)2 Gamma x-chain (4.6) has: (α + 2i)(α + 2i + 1) 2 (a) Eigenvalues β = (α/(1 + α))j, 0 j< . j ≤ ∞ 4(α + i)(i + 1) (b) Eigenfunctions Mj(x), the Meixner polynomi- ·   9 als. β2ℓ. (c) For any ℓ 0 and any starting state x ≤ 5 2 ≥ ℓ 2 Hence ∞ (k (x,y) m(y)) χ2 (ℓ)= − 1 x m(y) χ2 (ℓ) 10β2ℓp (1/2)2z for ℓ . y=0 1/2 ≤ 2 2 2 ≥ 2 log β X − 2 As ∞ 2ℓ 2 (a)i = βi Mi (x)zi, zi = i . α + 1 4(2α + 3) i=1 (1 + α) i! p2(1/2) = and z2 = , X 4 α(α + 1)2 Proposition 4.5. For α = a = 1, starting at n, this gives χ2 (ℓ) 1 β2ℓ and, assuming ℓ 1 , 2 2c 1/2 2 2 2 log β2 χn(ℓ) 2− for ℓ = log2(1 + n)+ c, c> 0; 2 2ℓ ≥ ≥ − ≤ χ (ℓ) 13β .  2 2c 1/2 2 χ (ℓ) 2 for ℓ = log (n 1) c, c> 0. ≤ n ≥ 2 − − 4.2 Poisson/Gamma Proof. From the definitions, for all j and pos- 4.2.1 The x-chain for the Poisson/Gamma. Fix itive integer x, α,a> 0. For x,y = 0, 1, 2,... = N, let j x ∈X { } ∧ i j e λ(α+1)/αλa+x 1 Mj(x) = ( 1) x(x 1) (x i + 1) ∞ − − | | − i − · · · − k(x,y)= a+x i=0   0 Γ(a + x)(α/(α + 1)) X Z j λ y j i j e− λ x =(1+ x) . (4.6) dλ ≤ i · y! Xi=0   a+x+y Thus, for ℓ log (1 + n)+ c, Γ(a + x + y)(α/(2α + 1)) ≥ 2 = a+x . Γ(a + x)(α/(α + 1)) y! ∞ χ2 (ℓ)= M 2(n)2 j(2ℓ+1) The stationary distribution is the negative binomial n j − j=1 a x X (a)x 1 α N m(x)= , x . ∞ 2j j(2ℓ+1) x! α + 1 α + 1 ∈ (1 + n) 2−     ≤ When α = a = 1, the prior is a standard exponential, jX=1 an example given in Section 2.1. Then, 2 (2ℓ+1) (1 + n) 2− x+y+1 x+1 1 x + y 1 2 (2ℓ+1) k(x,y)= , ≤ 1 (1 + n) 2− 3 x 2 −       2c 1 . 2− − 2c x+1 2− . m(x) = 1/2 . ≤ 1 2 2c 1 ≤ − − − GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 17

′ The lower bound follows from using only the lead θ θ (a 1)/2 e− − θ′ − √ term. Namely, = Ia 1(2 θθ′) α/(1 + α)  θ  − 2 2 2ℓ θ (α+1)θ′/α (a 1)/2 χn(ℓ) (1 n) 2− e− − (α + 1)θ′ − ≥ − = Ia 1 22c for ℓ = log (n 1) c.  α/(1 + α) αθ − ≥ 2 − −   Remark. Note the contrast with the Beta/ (2 (α + 1)θθ /α). · ′ Binomial example above. There, order n steps are q necessary and sufficient starting from n and there Thus, k(θ,θ′) is a transition density with respect to is no cutoff. Here, log2 n steps are necessary and the Lebesgue measure dθ′, that is, sufficient and there is a cutoff. See [27] for further discussion of cutoffs. k(θ,θ′) dθ′ = 1. Z Jim Hobert has shown us a neat appearance of Here Ia 1 is the modified Bessell function. For fixed the x-chain as branching process with immigration. − θ, k(θ,θ ) integrates to 1 as discussed in [44], pages Write the transition density above as ′ 58–59. The stationary distribution of this Markov Γ(a + x + y) α + 1 a+x α y chain is the Gamma: k(x,y)= . Γ(a + x)y! 2α + 1 2α + 1     e θ/αθa 1 π(dθ)= − − dθ. This is a negative binomial mass function with Γ(a)αa α parameters θ = 2α+1 and r = x + a. Hence, the x- chain can be viewed as a branching process with To simplify notation, we take α =1 for the rest of immigration. Specifically, given that the population this section. The relevant polynomials are the La- size at generation n is Xn = x, we have guerre polynomials ([68], Section 1.11) x (a)i i X = N + M , Li(θ)= 1F1 − θ n+1 i,n n+1 i! a i=1   X i where N1,n, N2,n,...,Nx,n are i.i.d. negative binomi- 1 ( i)j j α = − (a + j)i jθ . als with parameters θ = and r =1, and Mn+1, i! j! − 2α+1 j=0 which is independent of the Ni,n’s, has a negative X α with parameters θ = 2α+1 and Note that classical notation has the parameter a r = a. The Ni,n’s represent the number of offspring shifted by 1 whereas we have labeled things to mesh of the nth generation and Mn+1 represents the im- with standard statistical notation. The orthogonal- migration. This branching process representation was ity relation is used in [61] to study admissibility. Γ(a + j) ∞ L (θ)L (θ)π(θ) dθ = δ 4.2.2 The θ-chain for the Poisson/Gamma. Fix i j j!Γ(a) ij Z0 α,a> 0. For θ,θ′ Θ = (0, ), let η = (α + 1)θ′/α. ∈ ∞ 1 Note that π(dθ′) = g(θ′) dθ′, where g(θ′) = = zj− δij. ′ e−θ /α(θ′)a−1 Γ(a)αa . Hence, as in Section 4.1.2, we absorb The multilinear generating function formula ([63], g(θ′) in the transition density of the θ-chain. This Theorem 4.7.5) gives gives 2tθ/(1 t) 2 j ∞ e− − ∞ 1 θ t θ j θ′(α+1)/α a+j 1 2 i ∞ e− θ e− (θ′) − Li(θ) zit = a 2 . k(θ,θ′)= (1 t) j!(a)j 1 t a+j Xi=0 − X0  −  j=0 j! Γ(a + j)(α/(α + 1)) X Combining results, we obtain the following state- θ θ′ a 1 j e− − (θ′) − ∞ (θθ′) ments. = α/(α + 1) j!Γ(a + j) jX=0 Proposition 4.6. For α = 1 and a> 0, the Markov θ θ′ (a 1)/2 2j+a 1 chain with kernel (4.7) has: e− − θ′ − ∞ (√θθ′) − = 1 α/(α + 1) θ j!Γ(a + j) (a) Eigenvalues βj = j , 0 j< .   jX=0 2 ≤ ∞ (4.7) (b) Eigenfunctions Lj, the Laguerre polynomials. 18 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE

−1/2(θ−ν)2/τ2 (c) For any ℓ 1 and any starting state θ, e dθ. The marginal density is Normal(v, ≥ √2πτ 2 2 2 ∞ j!Γ(a) σ + τ ). χ2(ℓ)= β2ℓL2(θ) θ j j Γ(a + j) A stochastic description of the chain is jX=1 X = aX + ε 2−2ℓ+1θ /(1 2−2ℓ) n+1 n n+1 e− − (4.8) = 2ℓ a τ 2 σ2ν (1 2− ) 2 − with a = 2 2 , ε Normal 2 2 ,σ . 2 2ℓ j σ + τ ∼ σ + τ ∞ 1 θ 2−   4ℓ 1. This is the basic autoregressive (AR1) process. Feller · j!(a)j 1 2− − X0  −  ([44], pages 97–99) describes it as the discrete-time Proposition 4.7. For α = 1 and a > 0, the Ornstein–Uhlenbeck process. The diagonalization of Markov chain with kernel (4.7) satisfies: this Gaussian Markov chain has been derived by other authors in various contexts. Goodman and For θ> 0, χ2(ℓ) e22 c if ℓ 1 (log [2(1 + a + θ − 2 2 Sokal [50] give an explicit diagonalization of vector- • θ2/a)] + c), c> 0≤. ≥ 2 c valued Gaussian autoregressive processes which spe- For θ (0, a/2) (2a, ), χθ(ℓ) 2 if ℓ cialize to (a), (b) below. Donoho and Johnstone ([31], • 1 (log [ 1∈(θ2/a + a)]∪ c),∞c> 0. ≥ ≤ 2 2 2 − Lemma 2.1) also specialize to (a), (b) below. Both Proof. For the upper bound, assuming ℓ 1, sets of authors give further references. Since it is so we write ≥ well studied, we will be brief and treat the special case with ν = 0, σ2 + τ 2 = 1/2. Thus the stationary 2 ℓ a (2θ4−ℓ)/(1 4−ℓ) χ (ℓ) = (1 4− )− e− − θ − distribution is Normal(0, 1/2). The orthogonal poly- 2 ℓ j nomials are now Hermite polynomials ([68], 1.13). ∞ 1 θ 4− ℓ 1 These are given by · j!(a)j 1 4− − X0  −  n n/2, (n 1)/2 1 2 ℓ Hn(y) = (2y) 2F0 − − − exp((2θ /a)4− ) − y2 ℓ a 1  −−  ≤ (1 4− ) − − [n/2] k n 2k 2 ℓ ( 1) (2y) − 2 ℓ exp(2(θ /a)4− ) = n! − . 2(θ /a + a)4− . k!(n 2k)! ≤ (1 4 ℓ)a+1 k=0 −  − −  X 1 2 They satisfy For ℓ 2 (log2[2(1 + θ /a + a)] + c), c> 0, we obtain 2 ≥ 2 c 1 2 χ (ℓ) e 2− . ∞ y n θ e− Hm(y)Hn(y) dy = 2 n!δmn. The≤ stated lower bound does not easily follow √π Z−∞ from the formula we just used for the upper bound. There is also a multilinear generating function for- 2 Instead, we simply use the first term in χθ(ℓ)= mula which gives ([63], Example 4.7.3) 2ℓ 2 j!Γ(a) 1 2 ℓ j 1 βj Lj (θ) Γ(a+j) , that is, a− (θ a) 4− . This 2 2 ≥ − ∞ H (x) 1 2x t easily gives the desired result.  n tn = exp . P 2nn! √ 2 1+ t 0 1 t   Remark. It is not easy to obtain sharp formulas X − 2 2 starting from θ near a. For example, starting at θ = Proposition 4.8. For ν = 0, σ + τ = 1/2, the a, one gets a lower bound by using the second term Markov chain (4.8) has: 2 2ℓ 2 j!Γ(a) 2 j 2 2 χθ(ℓ)= j 1 βj Lj (θ) Γ(a+j) (the first term vanishes (a) Eigenvalues βj = (2τ ) (as σ + τ = 1/2, we ≥ 2 at θ = a). This gives χ2(ℓ) [2a/(a + 1)]4 2ℓ . When have 2τ < 1). P a ≥ − a is large, this is significantly smaller than the upper (b) Eigenfunctions the Hermite polynomials Hj. (c) For any starting state x and all ℓ 1, bound proved above. ≥ ∞ 1 4.3 The Gaussian Case χ2 (ℓ)= (2τ 2)2kℓH2(x) x k 2kk! Here, the x-chain and the θ-chain are essentially kX=1 the same and indeed the same as the chain for the exp(2x2(2τ 2)2ℓ/(1 + (2τ 2)2ℓ)) = 1. additive models, so we just treat the x-chain. Let 2 4ℓ − R 1/2(x θ)2/σ2 2 1 (2τ ) = , fθ(x)= e− − /√2πσ and π(dθ)= − X q GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 19

The next proposition turns the available chi-square follows: from x, draw θ from π( x) and then go to ′ ·| formula into sharp estimates when x is away from 0. X′ = θ′ + ε′ with ε′ independently drawn from g. 2 Starting from 0, the formula gives χ0(ℓ) = It has stationary distribution m(x) dx, the convolu- 2 4ℓ 1/2 (1 (2τ ) )− 1. This shows convergence at the tion of π and g. For the θ-chain, starting at θ, set − − 2 2 faster exponential rate of β2 = (2τ ) instead of β1 = X′ = θ +ε and draw θ′ from π( x′). It has stationary 2τ 2. distribution π. Observe that ·| 2 2 k k Proposition 4.9. For ν = 0, σ +τ = 1/2, x Eθ(X )= Eθ((θ + ε) ) R, the Markov chain (4.8) satisfies: ∈ k 2 k j k j 2 c log(2(1 + x )) + c = θ E(ε − ). χx(ℓ) 8e− for ℓ ,c> 0, j ≤ ≥ 2 log(2τ 2) jX=0   − x2ec log(2(1 + x2)) c Thus (H2) of Section 3 is satisfied with ηk = 1. To χ2 (ℓ) for ℓ − , x ≥ 2(1 + x2) ≤ 2 log(2τ 2) check the conjugate condition we may use results − of [87], Section 4. In present notation, Morris shows c> 0, that if pk is the monic orthogonal polynomial of de- 2 2 4ℓ 1 2 4ℓ χ0(ℓ)=(1 (2τ ) )− 1 (2τ ) . gree k for the distribution π and pk′ is the monic − − ≥ orthogonal polynomial of degree k for the distribu- Proof. For the upper bound, assuming tion m, then 1 ℓ (log(2(1 + x2)) + c), c> 0, k 2 log(2τ 2) n1 ≥ Ex(pk(θ)) = bkpk′ (x). − n1 + n2 we have   Here π is taken as the sum of n1 copies and ε the (2τ 2)2ℓ < 1/2, 2x2(2τ 2)2ℓ < 1 sum of n2 copies of one of the six families and and it follows that k 1 − 1+ ic/n1 exp(2x2(2τ 2)2ℓ/(1 + (2τ 2)2ℓ)) bk = , 2 1+ ic/(n1 + n2) χx(ℓ)= 1 i=0 1 (2τ 2)4ℓ − Y − where c is the coefficient of m2(θ) in σ2(θ)= a + q (1 + 2(2τ 2)4ℓ)(1 + 6x2(2τ 2)2ℓ) 1 bm(θ)+cm2(θ) for the family. Comparing lead terms ≤ − gives (H3) (in Section 3) with an explicit value of µ . 8(1 + x2)(2τ 2)2ℓ. k ≤ In the present setup, µk = βk is the kth eigenvalue. For the lower bound, write We now make specific choices for each of the six 2 2 2ℓ 2 2ℓ cases. 2 exp(2x (2τ ) /(1 + (2τ ) )) χx(ℓ)= 1 1 (2τ 2)4ℓ − 5.1 Binomial − q For fixed p, 0

2 For the binomial, the parameter c is c = 1 and This gives the desired result since we also have χ0(ℓ) − 2ℓ  ≥ the eigenvalues of the x-chain are Nqβ1 . n1(n1 1) (n1 k + 1) βk = − · · · − , In general, it is not very easy to evaluate the poly- (n1 + n2)(n1 + n2 1) (n1 + n2 k + 1) nomials k at x = 0 to estimate χ2(ℓ) and under- − · · · − j 6 x 0 k n + n = N. stand what the role of the starting point is. However, ≤ ≤ 1 2 if p = 1/2 and x = n1 = N/2, then the recurrence Note that βk =0 for k n1 + 1. The orthogonal polynomials are Krawtchouck≥ polynomials ([68], 1.10; equation ([63], (6.2.37)) shows that k2j+1(N/2) = 0 [63], page 100): and (2j 1)! j, x 1 k (N/2) = ( 1)j − . (5.2) kj(x)= 2F1 − − 2j N p − (N 2j + 1)2j  −  − which satisfy Thus, for ℓ 1, ≥ N N x N x N/4 2 p (1 p) − k (x)k (x) ⌈ ⌉ N (2j 1)! x − j ℓ χ2 (ℓ)= β2ℓ − x=0   N/2 2j 2j (N 2j + 1) X j=1    2j  1 j X − N − 1 p = − δjℓ. N/4 j p ⌈ ⌉ 2ℓ (2j 1)!     = β2j − . 2j(N 2j + 1)2j jX=1 − Proposition 5.1. Consider the chain (5.1) on This is small for ℓ =1 if N is large enough. Indeed, 0,...,n1 + n2 with 0

5.3 Negative Binomial The orthogonal polynomials are Laguerre polynomi- als, discussed in Section 4.2 above. Fix p with 0 0 and θ < π/2. θ Γ(x + n + n )Γ(n )Γ(n ) 2 | | |   1 2 1 2 It has mean µ = r tan(θ) and variance µ /r + r. See 0 θ x, [86], Section 5 or [37] for numerous facts and refer- ≤ ≤ ences. Fix real µ and positive n1,n2. Let the density which is a negative hypergeometric. A simple ex- π be hyperbolic with mean n µ and r = n (1 + µ2). ample has n = n = 1 () so 1 1 1 1 2 Let the density g be hyperbolic with mean n µ and π(θ x) = 1/(1 + x). The x-chain becomes: From x, 2 r = n (1 + µ2). Then m is hyperbolic with mean choose| θ uniformly in 0 θ x and let X = θ + ε 2 2 ′ (n + n )µ and r = (n + n )(1 + µ2). The condi- with ε geometric. The parameter≤ ≤ c = 1 so that 1 2 1 2 tional density π(θ x) is “unnamed and apparently | β0 = 1, has not been studied” ([87], page 581).

n1(n1 + 1) (n1 + k 1) For this family, the parameter c = 1 and thus βk = · · · − , (n1 + n2)(n1 + n2 + 1) (n1 + n2 + k 1) β = 1, · · · − 0 1 k< . n1(n1 + 1) (n1 + k 1) ≤ ∞ βk = · · · − . The orthogonal polynomials are Meixner polynomi- (n1 + n2) (n1 + n2 + k 1) · · · − als discussed in Section 4.2. The orthogonal polynomials are Meixner–Pollaczek 5.4 Normal polynomials ([100], page 395; [68], 1.7; [63], page 171). These are given in the form Fix reals µ and n1,n2,v> 0. Let π = Normal(n1µ, λ n1v), g = Normal(n2µ,n2v). Then m = Normal((n1 + Pn (x, ϕ) n )µ, (n + n )v) and π(θ x) = Normal( n1 x, 2 1 2 n1+n2 n n | (2λ)n n, λ + ix 2iϕ inϕ 1 2 v). Here c =0 and = 2F1 − 1 e− e , n1+n2 2λ n!  −  k (5.3) n1 β = , 0 k< . 1 ∞ (2ϕ π)x 2 λ λ k e − Γ(λ + ix) P P dx n1 + n2 ≤ ∞ 2π | | m n   Z The orthogonal polynomials are Hermite, discussed −∞ Γ(n + 2λ) in Section 4.3. Both the x- and θ-chains are classical = 2λ δmn. autoregressive processes as described in Section 4.3. n!(2 sin ϕ) Here 0, 0 <ϕ<π. The change 5.5 Gamma −∞ rx∞ π 1 of variables y = 2 , ϕ = 2 + tan− (θ) λ = r/2 trans- Fix positive real n ,n , α. Let π = Gamma(n , α), (2ϕ π)x 2 1 2 1 forms the density e − Γ(λ + ix) to a constant g = Gamma(n2, α). Then | | multiple of the density fθ(x) of Section 2.4. m = Gamma(n1 + n2, α), We carry out one simple calculation. Let π, g have 2 the density of π log C , with C standard Cauchy. π(θ x)= x Beta(n1,n2). | | | · Thus A simple case to picture is α = n1 = n2 = 1. Then, 1 the x-chain may be described as follows: From x, (5.4) π(dx)= g(x) dx = dx. 2 cosh(πx/2) choose θ uniformly in (0,x) and set X′ = θ + ε with 2 ε standard exponential. This is simply a continuous The marginal density is the density of π log C1C2 , version of the examples of Section 5.3. The param- that is, | | eter c =1 and so x m(x)= . β0 = 1, 2 sinh(πx/2) n (n + 1) (n + k 1) Proposition 5.2. For the additive walk based β = 1 1 · · · 1 − , k (n + n )(n + n + 1) (n + n + k 1) on (5.4): 1 2 1 2 · · · 1 2 − 0

(b) The eigenfunctions are the Meixner–Pollaczek Koudou shows that, given σ, one can always find polynomials (5.3) with ϕ = π/2, λ = 1. sequences of orthonormal functions such that σ is 2 2ℓ 1 1 x π 2 (c) χx(ℓ) = 2 k∞=1(k + 1)− − (Pk ( 2 , 2 )) . Lancaster for these sequences. He then character- izes those sequences ρ such that the associated σ Proof. UsingP Γ(z +1)= zΓ(z), Γ(z)Γ(1 z)= n − is absolutely continuous with respect to µ ν with π , we check that × sin(πz) dσ 2 (6.6) f = = ρnpn(x)qn(y) Γ(1 + ix) =Γ(1+ ix)Γ(1 ix) d(µ ν) n | | − × X = (ix)Γ(ix)Γ(1 ix) in L2(µ ν). − × π(ix) πx = = . For such Lancaster densities (6.6), the equivalence sin π(ix) sinh(πx) (6.5) says precisely that the x-chain for the Gibbs sampler for f has pn as eigenvectors with eigen- The result now follows from routine simplification. 2 { }  values ρn (cf. Section 3 above). For more on Lan- caster families{ } with marginals in the six exponential Remark. Part (c) has been used to show that families, see [7], Section 7, which makes fascinat- order log x steps are necessary and sufficient for con- ing connections between these families and diagonal vergence in unpublished joint work with Mourad Is- multivariate families. mail. The above does not require polynomial eigenvec- tors and Griffiths [52] gives examples of triangular 6. OTHER MODELS, OTHER METHODS margins and explicit nonpolynomial eigenfunctions. Here is another. Let G be a compact Abelian group Even in the limited context of bivariate Gibbs with characters pn(x) chosen orthonormal with sampling, there are other contexts in which the in- respect to Haar measure{ }µ. For a probability density gredients above arise. These give many further ex- f on G, the location problem amples where present techniques lead to sharp re- sults. This section gives brief pointers to Lancaster σ(dx, dy)= f(x y)µ(dx)ν(dy) − families, alternating projections, the multivariate has uniform margins, pn as eigenfunctions of the case and to other techniques for proving conver- { } x-chain and ρn = pn(x)f(x) dµ(x) as eigenvalues. gence. A main focus of Koudou and his co-authors is R 6.1 Lancaster Families delineating all of the extremal Lancaster sequences and so, by Choquet’s theorem (every point in a com- There has been a healthy development of bivariate pact convex subset of a metrizable topological vector distributions with given margins. The part of inter- space is a barycenter of a probability measure sup- est here begins with the work of Lancaster, nicely ported on the extreme points; see [84], Chapter 11.2) summarized in his book [72]. Relevant papers by all densities f with pn , qn as in (6.5). Of course, Angelo Koudou [69, 70, 71] summarize recent work. any of these can be used{ } with{ } the Gibbs sampler and Briefly, let ( ,µ), ( ,ν) be probability spaces with present techniques. These characterizations are car- X Y associated L2(µ),L2(ν). Let σ be a measure on ried out for a variety of classical orthogonal polyno- X× with margins µ and ν. Suppose the relevant con- mials. In particular, Koudou gives a clear translation Y ditional probabilities exist, so into probabilists’ language of Gasper’s complete de- termination of the extremals of the Lancaster fami- σ(dx, dy)= µ(dx)K (dy)= ν(dy)L (dx). x y lies with equal Beta(α, β) margins for α, β 1 . We ≥ 2 Say that σ is a Lancaster probability with respect give but one example. to the given sequences of orthonormal functions if Example. Consider the uniform distribution on there exists a positive real sequence ρn such that, the unit disk in R2 given by for all n, { } 1 (6.7) f(x,y)= , 0 x2 + y2 1. (6.5) pn(x)qm(y)σ(dx, dy)= ρnδmn. π ≤ ≤ Z The conditional distribution of X given Y is uni- 2 2 This equation is also equivalent to qn(y)Kx(dy)= form on [ √1 Y , √1 Y ]. Similarly, the con- ρ p (x) and to p (x)K (dx)= ρ q (y). ditional distribution− − of Y− given X is uniform on n n n y nR n R GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 23

[ √1 X2, √1 X2]. The marginal density of X with [42], was unified in [14]. See [38] for a text- is− − − book treatment. Models of Fisher–Wright, Moran, 2 Kimura, Karlin and McGregor are included. While m(x)= 1 x2, 1 x 1. π − − ≤ ≤ many models are either absorbing, nonreversible, or p have intractable stationary distributions, there are By symmetry, Y has the same marginal density. also tractable new models to be found. See the Stan- Since for every k 0, ≥ ford thesis work of Hua Zhou. (1 Y 2)k Interesting classes of reversible Markov chains with E(X2k Y )= − , | 2k + 1 explicit polynomial eigenfunctions appear in the work of Hoare, Rahman and their collaborators [20, 58, (1 X2)k E(Y 2k X)= − 59, 60]. These seem distinct from the present exam- | 2k + 1 ples, are grounded in physics problems (transfer of and vibrational energy between polyatomic molecules) and involve some quite exotic eigenfunctions (9 j E(X2k+1 Y ) = 0, − | symbols). Their results are explict and it seems like a E(Y 2k+1 X) = 0, worthwhile project to convert them into sharp rates | of convergence. we conclude that the x-chain has polynomial eigen- A rather different class of examples can be cre- functions pk(x) k 0 and eigenvalues λk k 0 where ated using autoregressive processes. For definiteness, λ = 1{ and} ≥λ = 0, k 0. The{ } polynomi-≥ 2k (2k+1)2 2k+1 ≥ work on the real line R. Consider processes of form als pk k 0 are orthogonal polynomials correspond- X0 =0, and for 1 n< , ing{ to the} ≥ marginal density m. They are Chebyshev ≤ ∞ polynomials of the second kind, given by the identity Xn+1 = an+1Xn + εn+1, sin((k + 1)θ) with (ai, εi) i 1 independent and identically dis- p (cos(θ)) = , 0 θ π, k 0. { } ≥ k sin(θ) ≤ ≤ ≥ tributed. Under mild conditions on the distribution of (ai, εi), the Markov chain Xn has a unique sta- Using present theory we have the following theorem. tionary distribution m which can be represented as Theorem 6.1. For the x-chain from the Gibbs the probability distribution of sampler for the density (6.7): X = ε0 + a0ε1 + a1a0ε2 + . ∞ 2 1 2 · · · (a) χx(l)= k∞=1 (2k+1)4l p2k(x). The point here is that for any k such that moments 2 1 exist (b) For x =P 0, χx(l)= k∞=1 (2k+1)4l . Thus k k P E(X1 X0 = x)= E((a1x + ε1) ) 1 8l + 1 1 | 2 k 4l χx(l) 4l . k 3 ≤ ≤ 8l 2 3 = xiE(ai εk i).  −  i 1 1− 2 1 i=0   (c) For x = 1, χ (l)= ∞ . X | | x k=1 (2k+1)4l−2 Thus If, for example, the stationary distribution m has P moments of all orders and is determined by those 1 2 8l 3 1 4l 2 χx(l) − 4l 2 . moments, then the Markov chain Xn n∞=0 is gener- 3 − ≤ ≤ 8l 6 3 − { } i  −  ated by a compact operator with eigenvalues E(a1), One of the hallmarks of our examples is that they 0 i< , and polynomial eigenfunctions. are statistically natural. It is an open problem to ≤We have∞ treated the Gaussian case in Section 4.5. give natural probabilistic interpretations of a gen- At the other extreme, take a < 1 constant and let εi eral Lancaster family. There are some natural ex- take values 1 with probability| | 1/2. The fine prop- ± amples. Eagleson [36] has shown that W1 + W2 and erties of π have been intensively studied as Bernoulli W2 + W3 are Lancaster if W1,W2,W3 are indepen- convolutions. See [23] and the references there. For dent and from one of the six quadratic families. Nat- example, if a = 1/2, then π is the usual uniform dis- ural Markov chains with polynomial eigenfunctions tribution on [ 1, 1] and the polynomials are Cheby- have been extensively studied in mathematical ge- shev polynomials.− Unfortunately, for any value of netics literature. This work, which perhaps begins a = 0, in the 1 case, the distribution π is known 6 ± 24 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE to be continuous while the distribution of Xn is dis- using the notation of Section 6.1. We develop the crete and so does not converge to π in L1 or L2. We tools used to quantify rates of convergence for von do not know how to use the eigenvalues to get quan- Neumann’s algorithm for use with the Gibbs sam- titative rates of convergence in one of the standard pler in [24]. metrics for weak convergence. As a second example take (a, ε) = (u, 0) with prob- 6.3 Multivariate Models ability p and (1 + u, u) with probability 1 p with The present paper and its companion paper [25] − − u uniform on (0, 1) and p fixed in (0, 1). This Markov have discussed univariate models. There are a num- chain has a Beta(p, 1 p) stationary density. The ber of models with x or θ multivariate where the as- eigenvalues are 1/(k +− 1), 1 k< . It has poly- ≤ ∞ sociated Markov chains have polynomial eigenfunc- nomial eigenfunctions. Alas, it is not reversible and tions. Some analogs of the six exponential families again we do not know how to use the spectral infor- are developed in [15]. In Koudou and Pommeret [69], mation to get usual rates of convergence. See [23] or the Lancaster theory for these families is elegantly [75] for more information about this so-called “don- developed. Their work can be used to give the eigen- key chain.” values and eigenfunctions for the multivariate ver- Finally, we mention the work of Hassairi and Zarai sions of the location models in Section 5. In their [57] which develops the orthogonal polynomials for work, families of multivariate polynomials depend- cubic (and other) exponential families such as the ing on a parameter matrix are given. For our ex- inverse Gaussian. They introduce a novel notion of amples, specific forms must be chosen. In a series 2-orthogonality. It seems possible (and interesting) of papers, developed independently of [69], Griffiths to use their tools to handle the Gibbs sampler for [51, 53] has developed such specific bases and illu- conjugate priors for cubic families. minated their properties. This work is turned into 6.2 Alternating Conditional Expectations sharp rates of convergence for two-component mul- tivariate Gibbs samplers in the Stanford thesis work The alternating conditional expectations that un- of Kshitij Khare and Hua Zhou. derlie the Gibbs sampler arise in other parts of prob- An important special case, high-dimensional ability and statistics. These include classical canoni- Gaussian distributions, has been studied in [2, 50]. cal correlations, especially those abstracted by Daux- Here is a brief synopsis of these works. Let m(x) be ois and Pousse [21]. This last paper contains a frame- a p-dimensional normal density with mean µ and work for studying our problems where the associated covariance Σ [i.e., N (µ, Σ)]. A Markov chain with operators are not compact. p stationary density m may be written as A nonlinear version of canonical correlations was developed by Breiman and Jerry Friedman as the (6.8) Xn+1 = AXn + Bv + Cεn+1. A.C.E. algorithm. Buja [13] pointed out the connec- 1 tions to Lancaster families. He found several other Here εn has a Np(0,I) distribution, v =Σ− µ, and parallel developments, particularly in electrical en- the matrices A,B,C have the form gineering ([13], Section 7–12). Many of these come 1 T A = (D + L)− L , with novel, explicit examples which are grist for our − 1 mill. Conversely, our development gives new exam- B = (D + L)− , ples for understanding the alternating conditional C = (D + L) 1D1/2, expectations that are the central focus of [13]. − There is a classical subject built around alternat- where D and L are the diagonal and lower triangular 1 ing projections and the work of von Neumann. See parts of Σ− . The chain (6.8) is reversible if and only Deutsch [22]. Let 1 and 2 be closed subspaces if AΣ=ΣAT . If this holds, A has real eigenvalues of a Hilbert space H. Let PH, P be the orthogonal H 1 2 (λ1, λ2,...,λp). In [50], Goodman and Sokal show projection onto , and let P be the orthogo- K H1 H2 I that the Markov chain (6.8) has eigenvalues λ and nal projection onto 1 2. von Neumann showed eigenfunctions H for K = (k , k ,...,k ), k 0, n H ∩H K 1 2 p i that (P1P2) PI as n tends to infinity. That is with ≥ (P P )n(x) →P (x) 0. In [24] we show that the k 1 2 − I k→ p p Gibbs sampler is a special case of von Neumann’s K ki 2 2 2 λ = λi ,HK(x)= Hki (zi), algorithm; with = L (σ), 1 = L (µ), 2 = L (ν) H H H iY=1 iY=1 GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 25 where Z = P T Σ 1/2X and H are the usual θ. Since it is orthogonal to all polynomials of degree − { k} one-dimensional Hermite polynomials. Here less than k, we must have Eθ[pk(X)] = ηkqk(θ).  1/2 1/2 T Σ− AΣ = P DP is the eigendecomposition of 1/2 1/2 The second lemma is dual to the first. Σ− AΣ . Goodman and Sokal show how a vari- ety of stochastic , including the system- Lemma A2. Ex[qk(θ)] = µkpk(x), 0 k < c. If ≤ atic scan Gibbs sampler for sampling from m, are c< , Ex(qk(θ)) = 0 for k c. ∞ ≥ covered by this framework. Explicit rates of conver- Proof. The first part is proved as per Lemma gence for this Markov chain can be found in the A1. If c< , and k c, by the same argument Stanford thesis work of Kshitij Khare. ∞ ≥ we have, for 0 j < c, E[pj(X)EX [qk(θ)]] = 0. But ≤ 2 6.4 Conclusion pj 0 j

Suppose first that c< . For k c, Lemma A2 [3] Anderson, W. (1991). Continuous-Time Markov ∞ ≥ 1 Chains. An Applications-Oriented Approach. shows Exqk(θ) = 0 for all x. Thus Kqk(x,θ)= 2 qk(θ). Further Springer, New York. MR1118840 [4] Athreya, K., Doss, H. and Sethuraman, J. (1996). ηk On the convergence of the Markov chain simula- span pk(x) qk(θ) ± µ tion method. Ann. Statist. 24 89–100. MR1389881  r k [5] Baik, J., Kriecherbauer, T., McLaughlin, K. and 0 k

variance functions. J. Amer. Statist. Assoc. 87 [37] Esch, D. (2003). The skew-t distribution: Proper- 1123–1127. MR1209570 ties and computations. Ph.D. dissertation, Dept. [20] Cooper, R., Hoare, M. and Rahman, M. (1977). Statistics, Harvard Univ. Stochastic Processes and special functions: On the [38] Ewens, W. (2004). Mathematical Population Genetics. probabilistic origin of some positive kernels asso- I. Theoretical Introduction, 2nd ed. Springer, New ciated with classical orthogonal polynomials. J. York. MR2026891 Math. Anal. Appl. 61 262–291. MR0486694 [39] Feinsilver, P. (1986). Some classes of orthogonal poly- [21] Dauxois, J. and Pousse, A. (1975). Une extension de nomials associated with martingales. Proc. Amer. l’analyse canonique. Quelques applications. Ann. Math. Soc. 98 298–302. MR0854037 Inst. H. Poincar´e Sect. B (N.S.) 11 355–379. [40] Feinsilver, P. (1991). Orthogonal polynomials and co- MR0408118 herent states. In Symmetries in Science V 159– [22] Deutsch, F. (2001). Best Approximation in Inner 172. Plenum Press, New York. MR1143591 Product Spaces. Springer, New York. MR1823556 [41] Feinsilver, P. and Schott, R. (1993). Algebraic [23] Diaconis, P. and Freedman, D. (1999). Iterated ran- Structures and Operator Calculus. I. Representa- dom functions. SIAM Rev. 41 45–76. MR1669737 tions and Probability Theory. Kluwer Academic [24] Diaconis, P., Khare, K. and Saloff-Coste, Press, Dordrecht. MR1227095 L. (2006). Stochastic alternating projections. [42] Feller, W. (1951). Diffusion processes in genet- Preprint, Dept. Statistics, Stanford Univ. ics. Proc. Second Berkeley Symp. Math. Statist. [25] Diaconis, P., Khare, K. and Saloff-Coste, L. Probab. 227–246. Univ. California Press, Berkeley. (2006). Gibbs sampling, exponential families and MR0046022 coupling. Preprint, Dept. of Statistics, Stanford [43] Feller, W. (1968). An Introduction to Probability The- Univ. ory and Its Applications. I, 3rd ed. Wiley, New [26] Diaconis, P. and Saloff-Coste, L. (1993). Com- York. MR0228020 parison theorems for Markov chains. Ann. Appl. [44] Feller, W. (1971). An Introduction to Probability The- Probab. 3 696–730. MR1233621 ory and Its Applications. II, 2nd ed. Wiley, New [27] Diaconis, P. and Saloff-Coste, L. (2006). Separa- York. tion cut-offs for birth and death chains. Ann. Appl. [45] Gelfand, A. E. and Smith, A. F. M. (1990). Sam- Probab. 16 2098–2122. MR2288715 pling based approaches to calculating marginal [28] Diaconis, P. and Stanton, D. (2006). A hypergeo- densities. J. Amer. Statist. Assoc. 85 398–409. metric walk. Preprint, Dept. Statistics, Stanford MR1141740 Univ. [46] Geman, S. and Geman, D. (1984). Stochastic re- [29] Diaconis, P. and Ylvisaker, D. (1979). Conjugate laxation Gibbs distributions and the Bayesian priors for exponential families. Ann. Statist. 7 269– restoration of images. IEEE Trans. Pattern Anal. 281. MR0520238 Machine Intelligence 6 721–741. [30] Diaconis, P. and Ylvisaker, D. (1985). Quantify- [47] Gilks, W. Richardson, S. and Spiegelhalter, D. ing prior opinion. In 2 (J. (1996). in Practice. Bernardo et al. eds.) 133–156. North-Holland, Am- Chapman and Hall, London. MR1397966 sterdam. MR0862488 [48] Gill, J. (2002). Bayesian Methods: A Social and Be- [31] Donoho, D. and Johnstone, I. (1989). Projection- havioral Sciences Approach. Chapman and Hall, based approximation and a duality with kernel Boca Raton, FL. methods. Ann. Statist. 17 58–106. MR0981438 [49] Glauber, R. (1963). Time dependent statistics of [32] Dyer, M., Goldberg, L., Jerrum, M. and Mar- the Ising model. J. Math. Phys. 4 294–307. tin, R. (2005). Markov chain comparison. Probab. MR0148410 Surv. 3 89–111. MR2216963 [50] Goodman, J. and Sokal, A. (1984). Multigrid Monte [33] Eaton, M. L. (1992). A statistical diptych: Admissi- Carlo method conceptual foundations. Phys. Rev. ble inferences—Recurrence of symmetric Markov D 40 2035–2071. chains. Ann. Statist. 20 1147–1179. MR1186245 [51] Griffiths, R. C. (1971). Orthogonal polynomials on [34] Eaton, M. L. (1997). Admissibility in quadratically the . Austral. J. Statist. regular problems and recurrence of symmetric 13 27–35. MR0275548 Markov chains: Why the connection? J. Statist. [52] Griffiths, R. C. (1978). On a bivariate triangu- Plann. Inference 64 231–247. MR1621615 lar distribution. Austral. J. Statist. 20 183–185. [35] Eaton, M. L. (2001). Markov chain conditions for ad- MR0548185 missibility on estimation problems with quadratic [53] Griffiths, R. C. (2006). n-kernel orthogonal poly- loss. In State of the Art in Statistics and Proba- nomials on the Dirichlet, Dirichlet-multinomial, bility. A Festschrift for Willen von Zwet (M. de Ewens’ sampling distribution and Poisson Gunst, C. Klaassen and A. van der Vrad, eds.) Dirichlet processes. Lecture notes available at 223–243. IMS, Beachwood, OH. MR1836563 http://www.stats.ox.ac.uk/∼griff/. [36] Eagleson, G. (1964). Polynomial expansions of bi- [54] Gross, L. (1979). Decay of correlations in classical lat- variate distributions. Ann. Math. Statist. 25 1208– tice models at high temperatures. Comm. Math. 1215. MR0168055 Phys. 68 9–27. MR0539733 28 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE

[55] Gutierrez-Pena, E. and Smith, A. (1997). Exponen- [74] Letac, G. (1992). Lectures on Natural Exponential tial and Bayes in conjugate families: Review and Families and Their Variance Functions. Mono- extensions. Test 6 1–90. MR1466433 graf´ıas de Matem´atica 50. IMPA, Rio de Janeiro. [56] Harkness, W. and Harkness, M. (1968). Generalized MR1182991 hyperbolic secant distributions. J. Amer. Statist. [75] Letac, G. (2002). Donkey walk and Dirichlet distribu- Assoc. 63 329–337. MR0234503 tions. Statist. Probab. Lett. 57 17–22. MR1911808 [57] Hassairi, A. and Zarai, M. (2004). Characterization [76] Letac, G. and Mora, M. (1990). Natural real ex- of the cubic exponential families by orthogonal- ponential families with cubic variance functions. ity of polynomials. Ann. Statist. 32 2463–2476. Ann. Statist. 18 1–37. MR1041384 MR2078547 [77] Liu, J., Wong, W. and Kong, A. (1995). Covariance [58] Hoare, M. and Rahman, M. (1979). Distributed pro- structure and convergence rates of the Gibbs sam- cesses in discrete systems. Physica 97A 1–41. pler with various scans. J. Roy. Statist. Soc. Ser. [59] Hoare, M. and Rahman, M. (1983). Cumula- B 57 157–169. MR1325382 tive Bernoulli trials and Krawtchouk processes. [78] Liu, J. (2001). Monte Carlo Strategies in Scientific Stochastic Process. Appl. 16 113–139. MR0724060 Computing. Springer, New York. MR1842342 [60] Hoare, M. and Rahman, M. (2007). A probabilis- [79] Malouche, D. (1998). Natural exponential fami- tic origin for a new class of bivariate polynomi- lies related to Pick functions. Test 7 391–412. als. Preprint, School of Math. and Stat. Carleton MR1667006 Univ., Ottawa. [80] Mckenzie, E. (1988). Some ARMA models for depen- [61] Hobert, J. P. and Robert, C. P. (1999). dent sequences of Poisson counts. Adv. in Appl. Eaton’s Markov chain, its conjugate partner Probab. 20 822–835. MR0968000 and P-admissibility. Ann. Statist. 27 361–373. [81] Marchev, D. and Hobert, J. P. (2004). Geometric MR1701115 ergodicity of van Dyk and Meng’s algorithm for the [62] Ismail, M. (1977). Connection relations and bilinear multivariate Student’s t model. J. Amer. Statist. formulas for the classical orthogonal polynomials. Assoc. 99 228–238. MR2054301 J. Math. Anal. Appl. 57 487–496. MR0430361 [82] Meng, X. L. and Zaslavsky, A. (2002). Single obser- [63] Ismail, M. (2005). Classical and Quantum Orthogonal vation unbiased priors. Ann. Statist. 30 1345–1375. Polynomials. Cambridge Univ. Press. MR2191786 MR1936322 [64] Jones, G. and Hobert, J. (2001). Honest explo- [83] Meixner, J. (1934). Orthogonal polynom system mit ration of intractable probability distributions via einer Besonderth Gestalt der Erzengerder func- Markov chain Monte Carlo. Statist. Sci. 16 312– tion. J. London Math. Soc. 9 6–13. 334. MR1888447 [84] Meyer, P. A. (1966). Probability and Potentials. Blais- [65] Jones, G. L. and Hobert, J. P. (2004). Sufficient dell, Waltham, MA. MR0205288 burn-in for Gibbs samplers for a hierarchical [85] Moreno, E. and Giron,´ F. (1998). Estimating with random effects model. Ann. Statist. 32 784–817. incomplete count data: A Bayesian approach. J. MR2060178 Statist. Plann. Inference 66 147–159. MR1617002 [66] Jorgensen, C. (1997). The Theory of Dispersion Mod- [86] Morris, C. (1982). Natural exponential families with els. Chapman and Hall, London. MR1462891 quadratic variance functions. Ann. Statist. 10 65– [67] Karlin, S. and McGregor, J. (1961). The Hahn poly- 80. MR0642719 nomials, formulas and application. Scripta. Math. [87] Morris, C. (1983). Natural exponential families with 26 33–46. MR0138806 quadratic variance functions: Statistical theory. [68] Koekoek, R. and Swarttouw, R. (1998). The Ann. Statist. 11 515–589. MR0696064 Askey-scheme of hypergeometric orthogonal [88] Newman, M. and Barkema, G. (1999). Monte Carlo polynomials and its q-analog. Available at Methods in Statistical Physics. Oxford Univ. Press. http://math.nist.gov/opsf/projects/koekoek.html. MR1691513 [69] Koudou, A. and Pommeret, D. (2000). A construc- [89] Pitt, M. and Walker, S. (2006). Extended con- tion of Lancaster probabilities with margins in the structions of stationary autoregressive processes. multidimensional Meixner class. Austr. N. Z. J. Statist. Probab. Lett. 76 1219–1224. MR2269348 Stat. 42 59–66. MR1747462 [90] Pitt, M. and Walker, S. (2005). Constructing sta- [70] Koudou, A. (1996). Probabilities de Lancaster. Exp. tionary time series models using auxiliary vari- Math. 14 247–275. MR1409004 ables with applications. J. Amer. Statist. Assoc. [71] Koudou, A. (1998). Lancaster bivariate probability 100 554–564. MR2160559 distributions with Poisson, negative binomial and [91] Pitt, M., Chatfield, C. and Walker, S. (2002). gamma margins. Test 7 95–110. MR1650839 Constructing first order stationary autoregressive [72] Lancaster, H. O. (1969). The Chi-Squared Distribu- models via latent processes. Scand. J. Statist. 29 tion. Wiley, New York. MR0253452 657–663. MR1988417 [73] Lehmann, E. and Romano, J. (2005). Testing Sta- [92] Pommeret, D. (1996). Natural exponential fami- tistical Hypotheses, 3rd ed. Springer, New York. lies and Lie algebras. Exp. Math. 14 353–381. MR2135927 MR1418028 GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 29

[93] Pommeret, D. (2001). Posterior variance for quadratic (with discussion). J. Amer. Statist. Assoc. 82 528– natural exponential families. Statist. Probab. Lett. 550. MR0898357 53 357–362. MR1856159 [102] Tierney, L. (1994). Markov chains for exploring pos- [94] Ringrose, J. (1971). Compact Non-Self-Adjoint Oper- terior distributions (with discussion). Ann. Statist. ators. Van Nostrand, New York. 22 1701–1762. MR1329166 [95] Rosenthal, J. (1995). Minorization conditions and [103] Turcin, V. (1971). On the computation of multidi- convergence rates for Markov chain Monte Carlo. mensional integrals by the . J. Amer. Statist. Assoc. 90 558–566. MR1340509 Probab. Appl. 16 720–724. MR0292259 [96] Rosenthal, J. (1996). Analysis of the Gibbs sampler [104] Van Doorn, E. A. (2003). Birth-death processes and for a model related to James-Stein estimations. associated polynomials. In Proceedings of the Sixth Statist. Comput. 6 269–275. International Symposium on Orthogonal Polyno- [97] Rosenthal, J. S. (2002). Quantitative convergence mials, Special Functions and their Applications rates of Markov chains: A simple account. Elec- (Rome, 2001 ). J. Comput. Appl. Math. 153 497– 7 tron. Comm. Probab. 123–128. MR1917546 506. MR1985718 [98] Roy, V. and Hobert, J. P. (2007). Convergence rates [105] Walter, G. and Hamedani, G. (1991). Bayes em- and asymptotic standard errors for MCMC al- pirical Bayes estimation for natural exponential gorithms for Bayesian probit regression. J. Roy. families with quadratic variance functions. Ann. 69 Statist. Soc. Ser. B 607–623. Statist. 19 1191–1224. MR1126321 [99] Saloff-Coste, L. (2004). Total variation lower [106] Wilson, D. (2004). Mixing times of Lozenge tiling and bounds for finite Markov chains: Wilson’s lemma. card shuffling Markov chains. Ann. Appl. Probab. In Random Walks and Geometry (V. Kaimanovich 14 274–325. MR2023023 and W. Woess, eds.) 515–532. de Gruyter, Berlin. [107] Yuen, W. K. (2000). Applications of geometric bounds MR2087800 to the convergence rate of Markov chains in Rn. [100] Szego, G. (1959). Orthogonal Polynomials, rev. ed. Stochastic Process. Appl. 87 1–23. MR1751162 Amer. Math. Soc., New York. MR0106295 [101] Tanner, M. and Wong, W. (1987). The calculation of posterior distributions by data augmentation