arXiv:0808.3852v1 [stat.ME] 28 Aug 2008 hr tthe at where from as otniggvsaMro chain Markov a gives Continuing pass. X rbes rm( From sampling problems. one-dimensional of sequence a by constant, ,x ., . . lsfo utvraepoaiiydensity probability sam- multivariate draw a to from way ofples a mainstay provides a It is computing. algorithm, scientific heat-bath the or namics onl nvriy aotHl,Ihc,NwYork e-mail: New USA Ithaca, 14853-4201, Hall, Malott Mathematics, University, of Cornell Department Professor, is Saloff-Coste Saloff-Coste Laurent and Khare Kshitij Diaconis, and Persi Polynomials Families Orthogonal Exponential Sampling, Gibbs 40-05 S e-mail: California USA Stanford, 94305-4065, University, 390 Stanford Hall, Mall, Sequoia Serra Statistics, of Department Student, 08 o.2,N.2 151–178 2, DOI: No. 23, Vol. 2008, Science Statistical 10.1214/08-STS252REJ 10.1214/08-STS252C [email protected] tnod aiona93546,UAe-mail: USA University, 94305-4065, Stanford California Mall, Stanford, Serra 390 Hall, Sequoia
es icnsi rfso,Dprmn fStatistics, of Department Professor, is Diaconis Persi c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint p h ib ape,as nw sGabrdy- Glauber as known also sampler, Gibbs The nttt fMteaia Statistics Mathematical of Institute 1 ,te ( then ), icse in Discussed 10.1214/07-STS252 p f ,prasol nw pt normalizing a to up known only perhaps ), ihteohrcodntsfie.Ti sone is This fixed. coordinates other the with X 1 ′ 10.1214/08-STS252A i X , .INTRODUCTION 1. hsae h oriaei sampled is coordinate the stage, th nttt fMteaia Statistics Mathematical of Institute X 2 , ′ il,snua au decomposition. value or families, singular location mials, priors, conjugate families, nential phrases: and words eigenfunctions. Key as explicitly is polynomials operator orthogonal av transition classical the are with case, sampler each Gibbs the In and used priors. families exponential widely standard involve the examples of The stationarity to gence Abstract. X , and 08 o.2,N.2 151–178 2, No. 23, Vol. 2008, 1 X , . . . , . 3 ,X ., . . , 10.1214/08-STS252D [email protected] [email protected] sii hr sGraduate is Khare Kshitij . 2008 , p rce o( to proceed ) p egv aiiso xmlsweesaprtso conver- of rates sharp where examples of families give We ) , . . . , , 10.1214/08-STS252B ( X ,X X, 1 ′ X , eone at rejoinder ; X 2 ′ 1 ′ ′ . X , . . . , f Laurent . X , X , ib ape,rnigtm nlss expo- analyses, time running sampler, Gibbs ( This . x 2 ′′ 1 ., . . , , . . . , x , p ′ 2 in ) , , 1 ape hisi tl ao eerheot An the at effort. given research is results major and tools a available of still overview is Gibbs for chains analyses sampler time running useful community, analyses throw time to running below. much questions how these the of call We of question away. segment the initial leaves these This an with away run. throw dealing to of to is way steps problems of One number stationarity? enormous anal- an reach image take or they enormous cards Do on shuffling ysis). of run (think are spaces chains state Markov state. Markov some starting the original Indeed, run the forgotten to has long it how until chain know to important is it [ lg n h oilsine,aogwt extensive with [ along in sciences, are bi- social references from the examples and many ology with accounts, Textbook of works [ rou- Wong the and for Tanner following method computations the Bayesian [ employ tine Geman to and Geman began by Statisticians analysis on image for [ based base proved (e.g., was dynamics measures this showing Gibbs theorem of uniqueness existence Dobrushin basic simulation The practical for [ both (e.g., physics, statistical of [ in discussed ditions has which ednl yTri [ Turcin by pendently 49 ept eoceot yteapidprobability applied the by efforts heroic Despite sampler, Gibbs the of application practical any In Glauber by 1963 in introduced was algorithm The od iuain o sn oes n inde- and models, Ising for simulations do to ] 1 88 )ada aua yais(.. [ (e.g., dynamics natural a as and ]) f ssainr est ne idcon- mild under density stationary as hgnlpolyno- thogonal 47 101 diagonalizable , 4 48 103 , 54 rconjugate ir n efn n mt [ Smith and Gelfand and ] 102 , ) twsitoue sa as introduced was It ]). 78 .I ssilasadr tool standard a still is It ]. ailable. ]. ]. 10 45 46 ]). ]. ]. 2 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE end of this introduction. The main purpose of the present paper is to give families of two component examples where a sharp analysis is available. These may be used to compare and benchmark more ro- bust techniques such as the Harris recurrence tech- niques in [64], or the spectral techniques in [2] and [107]. They may also serve as a base for the compar- ison techniques in [2, 26, 32]. Here is an example of our results. The following example was studied as a simple expository example in [16] and [78], page 132. Let n f (x)= θx(1 θ)n x, θ x − − π(dθ) = uniform on (0, 1), x 0, 1, 2,...,n . ∈ { } These define the bivariate Beta/Binomial density (uniform prior) Fig. 1. Simulation of the Beta/Binomial “x-chain” with n = 100. n f(x,θ)= θx(1 θ)n x x − − is reversible with stationary density m(x) [i.e., m(x) · with marginal density k(x,x′)= m(x′)k(x′,x)]. For the Beta/Binomial ex- ample 1 m(x)= f(x,θ) dθ n n n + 1 x x′ 0 k(x,x′)= , Z 2n + 1 2n 1 x +x′ = , x 0, 1, 2,...,n . (1.1) n + 1 ∈ { } 0 x,x n. ≤ ′ ≤ The posterior density [with respect to the prior A simulation of the Beta/Binomial “x-chain” (1.1) π(dθ)] is given by with n = 100 is given in Figure 1. The initial posi- tion is 100 and we track the position of the Markov fθ(x) π(θ x)= chain for the first 200 steps. | m(x) The proposition below gives an explicit diagonal- n = (n + 1) θx(1 θ)n x, θ (0, 1). ization of the x-chain and sharp bounds for the bi- x − ℓ − ∈ variate chain [Kn,θ denotes the density of the distri- bution of the bivariate chain after ℓ steps starting The Gibbs sampler for f(x,θ) proceeds as follows: from (n,θ)]. Itf shows that order n steps are neces- From x, draw θ from Beta(x + 1,n x + 1). sary and sufficient for convergence. That is, for sam- • ′ − From θ′, draw x′ from Binomial(n,θ′). pling from the Markov chain K to simulate from the • probability distribution f, a small (integer) multi- The output is (x′,θ′). Let K(x,θ; x′,θ′) be the tran- fn ple of n steps suffices, while 2 steps do not. The sition density for this chain. While K has f(x,θ) as following bounds make this quantitative. The proof stationary density, the (K,ff ) pair is not reversible is given in Section 4. For example, when n = 100, (see below). This blocks straightforwardf use of spec- after 200 steps the total variation distance (see be- tral methods. Liu et al.f [77] observed that the “x- low) to stationarity is less than 0.0192, while after chain” with kernel 50 steps the total variation distance is greater than 1 0.1858, so the chain is far from equilibrium. k(x,x′)= f (x′)π(θ x) dθ We simulate 3000 independent replicates of the θ | Z0 Beta/Binomial “x-chain” (1.1) with n = 100, start- 1 f (x)f (x ) ing at the initial value 100. We provide histograms = θ θ ′ dθ m(x) of the position of the “x-chain” after 50 steps and Z0 GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 3
Fig. 2. Histograms of the observations obtained from simulating 3000 independent replicates of the Beta/Binomial “x-chain” after 50 steps and 200 steps.
200 steps in Figure 2. Under the stationary distribu- The calculations work because the operator with tion (which is uniform on 0, 1,..., 100 ), one would density k(x,x′) takes polynomials to polynomials. expect roughly 150 observations{ in each} block of the Our main results give classes of examples with the histogram. As expected, these histograms show that same explicit behavior. These include: the empirical distribution of the position after 200 fθ(x) is one of the exponential families singled steps is close to the stationary distribution, while • out by Morris [86, 87] (binomial, Poisson, negative the empirical distribution of the position after 50 binomial, normal, gamma, hyperbolic) with π(θ) steps is quite different. the conjugate prior. If f, g are probability densities with respect to a fθ(x)= g(x θ) is a location family with π(θ) con- σ-finite measure µ, then the total variation distance • jugate and g−belongs to one of the six exponential between f and g is defined as families above. (1.2) f g = 1 f(x) g(x) µ(dx). Section 2 gives background. In Section 2.1 the k − kTV 2 | − | Z Gibbs sampler is set up more carefully in both sys- tematic and random scan versions. Relevant Markov Proposition 1.1. For the Beta/Binomial ex- chain tools are collected in Section 2.2. Exponential ample with uniform prior, we have: families and conjugate priors are reviewed in Section 2.3. The six families are described more carefully (a) The chain (1.1) has eigenvalues in Section 2.4 which calculates needed moments. A brief overview of orthogonal polynomials is in Sec- β0 = 1, tion 2.5. n(n 1) (n j + 1) β = − · · · − , 1 j n. Section 3 is the heart of the paper. It breaks the j (n + 2)(n + 3) (n + j + 1) ≤ ≤ · · · operator with kernel k(x,x′) into two pieces: T : L2(m) L2(π) defined by In particular, β1 = 1 2/(n+2). The eigenfunctions → are the discrete Chebyshev− polynomials [orthogonal polynomials for m(x) = 1/(n + 1) on 0,...,n ]. T g(θ)= fθ(x)g(x)m(dx) { } Z (b) For the bivariate chain K, for all θ,n and ℓ, and its adjoint T ∗. Then k is the kernel of T ∗T . Our ℓ 1/2 analysis rests on a singular value decomposition of 1 β − ℓ ℓ f 1 T . In our examples, T takes orthogonal polynomials β1 Kn,θ f TV 2ℓ 1 . 2 ≤ k − k ≤ 1 β − − 1 for m(x) into orthogonal polynomials for π(θ). This f 4 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE leads to explicit computations and allows us to treat also makes useful connections with classical renewal the random scan, x-chain and θ-chain on an equal theory. footing. The Markov chains studied in this paper generate The x-chains and θ-chains corresponding to three stochastic processes which have the marginal as sta- of the six classical exponential families are treated tionary distribution. These chains have been used in Section 4. There are some surprises; while order in a variety of applied modeling contexts in [80] and n steps are required for the Beta/Binomial exam- [89, 90, 91]. The present paper presents some new ple above, for the parallel Poisson/Gamma example, tools for these models. A related family of Markov log n + c steps are necessary and sufficient. The six chains and analyses have been introduced by Eaton location chains are treated in Section 5 where some [33, 34, 35] to study admissibility of formal Bayes standard queuing models emerge (e.g., the M/M/ estimators. The process is a two-component Gibbs queue). The final section points to other examples∞ sampler, as above, with π an improper prior having with polynomial eigenfunctions and other methods an almost surely proper posterior. Under regular- for studying present examples. Our examples are ity assumptions, Eaton shows that recurrence of the just illustrative. It is easy to sample from any of θ-chain implies the admissibility of certain formal the families f(x,θ) directly. Further, we do not see Bayes estimators corresponding to π. Hobert and how to carry our techniques over to higher compo- Robert [61] have developed this research, introduc- nent problems. We further point out that in routine ing the x-chain as a useful tool. Spectral techniques Bayesian use of the Gibbs sampler, the distribution have proved to be a useful adjunct to Foster-type we wish to sample from is typically a posterior dis- criteria for studying recurrence. We hope to develop tribution of the parameters. Nevertheless, our two our theory in these directions. component problems are easy-to-understand stan- The analyses carried out in this paper hinge criti- dard statistical examples. cally on the existence of all marginal and conditional Basic convergence properties of the Gibbs sampler moments. These moments need not exist. Indeed, x can be found in [4, 102]. Explicit rates of conver- consider the geometric density fθ(x)= θ (1 θ), 0 x< , and put a Beta(α, β) prior on θ, with− α ≤1 gence appear in [95, 96]. These lean on Harris recur- ∞ ≥ rence and require a drift condition of type being an integer. The marginal distribution of x ad- mits only α 1 moments. For α> 1, the available E(V (X1) X0 = x) aV (x)+ b for all x. Also re- − quired are| a minorization≤ condition of the form moments are put to good use in [25]. k(x,x′) εq(x′) for ε> 0, some probability density q, and all≥ x with V (x) d. Here d is fixed with 2. BACKGROUND ≤ d b/(1 + a). Rosenthal [95] then gives explicit up- ≥ This section gives needed background. The two- per bounds and shows these are sometimes practi- component Gibbs sampler is defined more carefully cally relevant for natural statistical examples. His in Section 2.1. Bounds on convergence using eigen- paper [97] is a nice expository account. Finding use- values are given in Section 2.2. Exponential families ful V and q is currently a matter of art. For example, and conjugate priors are reviewed in Section 2.3. a group of graduate students tried to use these tech- The six families with variance a quadratic function niques in the Beta/Binomial example treated above of the mean are treated in Section 2.4. Finally, a and found it difficult to make choices giving useful brief review of orthogonal polynomials is in Section results. This led to the present paper. 2.5. A marvelous expository account of this set of tech- niques with many examples and an extensive liter- 2.1 Two-Component Gibbs Samplers ature review is given by Jones and Hobert in [64]. Let ( , ) be a measurable space equipped with In their main example an explicit eigenfunction was a σ-finiteX F measure µ. Let (Θ, ) be a measurable available for V ; our Gamma/Gamma examples be- space equipped with a probabilityG measure π(dθ). low generalize this. Further, “practically relevant” In many classical situations the prior is given by a examples with useful analyses of Markov chains for density g(θ) with respect to the Lebesgue measure stationary distributions which are difficult to sample dθ, and so π(dθ)= g(θ) dθ. However, we also con- directly from are in [65, 81, 98]. Some sharpenings sider examples where the parameter θ is discrete, of the Harris recurrence techniques are in [9] which which cannot be described in the fashion mentioned GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 5 above. Thus, we work with general prior probabili- Note that k(x,x′)µ(dx′) = 1 so that k(x,x′) is a ties π(dθ) throughout. Let fθ(x) θ Θ be a family of probability density with respect to µ. Note further { } ∈ R probability densities with respect to µ. These define that m(x)k(x,x′)= m(x′)k(x′,x) so that the x-chain a probability measure on Θ via has m(dx) as a stationary distribution. X × The “θ-chain” [from θ, draw x from fθ(x) and P (A B)= f (x)µ(dx)π(dθ), × θ then θ′ from π(θ′ x)] has transition density ZB ZA | A , B . k(θ,θ′)= f (x)π(θ′ x)µ(dx) ∈ F ∈ G θ | The marginal density on is (2.2) ZX X f (x)f ′ (x) = θ θ µ(dx). m(x)= fθ(x)π(dθ) m(x) Z ZΘ X Note that k(θ,θ′)π(dθ′) = 1 and that k(θ,θ′) has so m(x)µ(dx) = 1 . π(dθ) as reversing measure. R ZX We assume that 0 K¯ (L2(m)+ L2(π)) ]. Indeed, for any h L2(P ), we can compute Kφ and m(φ)]. The second condi- ⊇ ⊥ ∈ Kg,h¯ = (Kg¯ )h dP = (Kh¯ )g dP = 0. Thus tion [orthogonality to constants in L2(m)] will be h iP Kg¯ = 0. We diagonalize random scan chains in Sec- automatically satisfied when β < 1. Such an eigen- R R tion 3. function yields a simple lower| bound| on the conver- gence of the chain to its stationary measure. 2.2 Bounds on Markov Chains 2.2.1 General results. We briefly recall well-known Lemma 2.1. Referring to the notation above, as- sume that φ L2(m(dξ)) and φ 2 dm = 1. Then results that will be applied to either our two-compo- ∈ | | nent Gibbs sampler chains or the x- and θ-chains. 2 2R 2ℓ χξ(ℓ) φ(ξ) β . Suppose we are given a Markov chain described by ≥ | | | | its kernel K(ξ,ξ′) with respect to a measureµ ˜(dξ′) Moreover, if φ is a bounded function, then [e.g., ξ = (x,θ),µ ˜(dξ)= µ(dx)π(dθ) for the two-com- φ(ξ) β ℓ ponent sampler, ξ = θ,µ ˜(dθ)= π(dθ) for the θ-chain, Kℓ m | || | . k ξ − kTV ≥ 2 φ etc.]. Suppose further that the chain has stationary k k∞ measure m(dξ)= m(ξ)˜µ(dξ) and write Proof. This follows from the well-known results Kˆ (ξ,ξ′)= K(ξ,ξ′)/m(ξ′), 2 ℓ 2 (2.5) χξ(ℓ)= sup Kξ (g) m(g) g 2,m 1{| − | } ˆ ℓ ˆ ℓ ℓ k k ≤ Kξ (ξ′)= K (ξ,ξ′)= K (ξ,ξ′)/m(ξ′) and for the kernel and iterated kernel of the chain with respect to the stationary measure m(dξ). We define ℓ 1 ℓ (2.6) Kξ m TV = 2 sup Kξ (g) m(g) . k − k g ∞ 1{| − |} the chi-square distance between the distribution of k k ≤ the chain started at ξ after ℓ steps and its stationary Here, Kl (g) denotes the expectation of g under the measure by ξ density Kl(ξ, ). For chi-square, use g = φ as a test 2 ℓ 2 · χ (ℓ)= Kˆ (ξ′) 1 m(dξ′) function. For total variation use g = φ/ φ as a ξ | ξ − | k k∞ Z test function. More sophisticated lower bounds on ℓ 2 K (ξ,ξ′) m(ξ′) total variation are based on the second moment method = | − | µ˜(dξ′). m(ξ ) (e.g., [99, 106]). Z ′ This quantity always yields an upper bound on the To obtain upper bounds on the chi-square dis- total variation distance tance, we need much stronger hypotheses. Namely, assume that K is a self-adjoint operator on L2(m) ℓ 1 ˆ ℓ K m TV = K (ξ′) 1 m(dξ′) 2 k ξ − k 2 | ξ − | and that L (m) admits an orthonormal basis of real Z eigenfunctions ϕi with real eigenvalues βi 0, β0 = 1 ℓ = K (ξ,ξ′) m(ξ′) µ˜(dξ′), ≥ 2 | − | 1, ϕ0 1, βi 0 so that Z ≡ ↓ namely, Kˆ (ξ,ξ′)ϕi(ξ′)m(dξ′)= βiϕi(ξ). ℓ 2 2 Z (2.4) 4 Kξ m TV χξ(ℓ). k − k ≤ Assume further that K acting on L2(m) is Hilbert– Our analysis will be based on eigenvalue decom- 2 Schmidt (i.e., βi < ). Then we have positions. Let us first assume that we are given a | | ∞ ˆ ℓP ℓ function φ such that K (ξ,ξ′)= βi ϕi(ξ)ϕi(ξ′) Xi Kφ(ξ)= K(ξ,ξ )φ(ξ )˜µ(dξ )= βφ(ξ), ′ ′ ′ [convergence in L2(m m)] Z × m(φ)= φ(ξ)m(ξ)˜µ(dξ) = 0 and Z 2 2ℓ 2 for some (complex number) β. In words, φ is a gener- (2.7) χξ (ℓ)= βi ϕi (ξ). alized eigenfunction with eigenvalue β. We say “gen- Xi>0 eralized” here because we have not assumed here Useful references for this part of classical functional that φ belongs to a specific L2 space [we only assume analysis are [1, 94]. GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 7 2.2.2 Application to the two-component Gibbs sam- For the variant K, the similar formula reads pler. All of the bounds in this paper are derived via ℓ ℓ 1 fθ′ (z) the following route: bound L1 by L2 and use the ex- K (x,θ; x′,θ′)=f k − (x, z) fθ′ (x′)µ(dz). m(z) plicit knowledge of eigenvalues and eigenfunctions to Z bound the sum in (2.7). This, however, does not ap- Thesef two bivariate chains have stationary density f(x,θ)= f (x) with respect to the measure µ(dx) ply directly to the two-component Gibbs sampler K θ · (or K) because these chains are not reversible with π(dθ). So, we write ℓ respect to their stationary measure. Fortunately, the K (x,θ; x′,θ′) f 1 x-chain and the θ-chain are reversible and their anal- f(x′,θ′) − ysis yields bounds on the two-component chain thanks ℓ 1 to the following elementary observation. The x-chain = (kˆ − (z,x′) 1)f (z)µ(dz) − θ has kernel k(x,x ) with respect to the measure µ(dx). Z ′ and It will also be useful to have kˆ(x,x′)= k(x,x′)/m(x′), Kℓ(x,θ; x ,θ ) the kernel with respect to the probability m(dx)= ′ ′ 1 ℓ ℓ m(x)µ(dx). For ℓ 2, we let kx(x′)= k (x,x′)= f(x′,θ′) − k(x,y)kℓ 1(y,x )µ≥(dy) denote the density [w.r.t. f − ′ ˆℓ 1 µ(dx)] of the distribution of the x-chain after l steps = (k − (x, z) 1)fθ′ (z)µ(dz). R − ˆℓ ˆℓ ˆ ˆℓ 1 Z and set kx(x′)= k (x,x′)= k(x,y)k − (y,x′)m(dy) To prove the second inequality in the lemma (the [the density w.r.t. m(dx)]. Also R proof of the first is similar), write 1/p ℓ p p (Kx,θ/f) 1 g p,P = g(ω) P (dω) for p 1. k − kp,P k k | | ≥ p Z ℓ 1 f= (kˆ − (x, z) 1)f ′ (z)µ(dz) Lemma 2.2. Referring to the K, K two- − θ Z Z Z component chains and x-chain introduced in Sec- f ′ (x′)µ(dx′)π(dθ′) tion 2.1, for any p [1, ], we have f · θ ∈ ∞ ˆℓ 1 p ℓ p ℓ 1 p k − (x, z) 1 (K /f) 1 kˆ − 1 f (z)µ(dz) ≤ | − | k x,θ − kp,P ≤ k z − kp,m θ Z Z Z Z f ′ (z)µ(dz)f ′ (x′)µ(dx′)π(dθ′) ˆℓ 1 p · θ θ sup kz− 1 p,m ≤ z k − k ℓ 1 p = kˆ − (x, z) 1 m(z)µ(dz) | − | and Z ℓ p ℓ 1 p ˆℓ 1 p (K /f) 1 kˆ − 1 . = k − (x, z) 1 m(dz). k x,θ − kp,P ≤ k x − kp,m | − | Z Similarly, forf the θ-chain, we have This gives the desired bound. ℓ p ℓ 1 p To get lower bounds, we observe the following. (K /f) 1 k − 1 π(θ x)π(dθ) k x,θ − kp,P ≤ k θ − kp,π | Z Lemma 2.3. Let g be a function of x only [abus- f ℓ 1 p sup kθ− 1 p,π ing notation, g(x,θ)= g(x)]. Then ≤ θ k − k and Kg(x,θ)= k(x,x′)g(x′)µ(dx′). Z l p ℓ 1 p (K /f) 1 k − 1 . If instead,fg is a function of θ only, then k x,θ − kp,P ≤ k θ − kp,π Proof. We only prove the results involving the Kg(x,θ)= k(θ,θ′)g(θ′)π(dθ′). x-chain. The rest is similar. Recall that the bivariate Z Proof. Assume g(x,θ)= g(x). Then chain has transition density fθ′ (x)fθ′ (x′) K(x,θ; x ,θ )= f (x )f ′ (x )/m(x ). Kg(x,θ)= g(x′)µ(dx′)π(dθ′) ′ ′ θ ′ θ ′ ′ m(x) Z Z By direct computation f = k(x,x′)g(x′)µ(dx′). ℓ ℓ 1 fθ′ (x′) K (x,θ; x′,θ′)= f (z)k − (z,x′) µ(dz). Z θ m(x ) The other case is similar. Z ′ 8 P. DIACONIS, K. KHARE AND L. SALOFF-COSTE 2 2 Lemma 2.4. Let χx,θ(ℓ) and χx,θ(ℓ) be the chi- 2.3.2 Conjugate priors for exponential families. square distances after ℓ steps for the K-chain and With notation as above, fix n0 > 0 and x0 in the 2 the K-chain, respectively, startinge at (x,θ). Let χx(ℓ), interior of the convex hull of the support of µ. Define 2 χθ(ℓ) be the chi-square distances for the x-chain a prior density with respect to Lebesgue measure dθ (startingf at x) and the θ-chain (starting at θ), re- by spectively. Then we have n0x0θ n0M(θ) πn0,x0 (dθ)= z(n0,x0)e − dθ, χ2(ℓ) χ2 (ℓ) χ2(ℓ 1), θ ≤ x,θ ≤ θ − where z(n0,x0) is a normalizing constant shown to ℓ ℓ ℓ 1 k 1 K f k − 1 k θ − kTV ≤ k x,θ − kTV ≤ k θ − kTV be positive and finite in Diaconis and Ylvisaker [29] and which contains proofs of the assertions below. The posterior is 2 2 2 χx(ℓ) χx,θ(ℓ) χx(ℓ 1), ≤ ≤ − π(dθ x)= π (dθ). ℓ ℓ ℓ 1 | n0+1,(n0x0+x)/(n0+1) kx m TV eKx,θ f TV kx− m TV. k − k ≤ k − k ≤ k − k Thus the family of conjugate priors is closed under Proof. This isf immediate from (2.5)–(2.6) and sampling. This is sometimes used as the definition Lemma 2.3. of conjugate prior. A central fact about conjugate 2.3 Exponential Families and Conjugate Priors priors is Three topics are covered in this section: exponen- (2.8) E(m(θ) x)= ax + b. tial families, conjugate priors for exponential fami- | lies and conjugate priors for location families. This linear expectation property characterizes con- jugate priors for families where µ has infinite sup- 2.3.1 Exponential families. Let µ be a σ-finite port. Section 3 shows that linear expectation implies measure on the Borel sets of the real line R. De- that the associated chain defined at (2.1) always has R xθ fine Θ= θ : e µ(dx) < . Assume that Θ is an eigenfunction of the form x c with eigenvalue nonempty{ and∈ open. H¨older’s∞} inequality shows that − R a, and c equal to the mean of the marginal distribu- Θ is an interval. For θ Θ, set ∈ tion. Often, an exponential family is not parametrized M(θ) = log exθµ(dx), by the natural parameter θ, but in terms of the mean Z parameter m(θ). If we construct conjugate priors f (x)= exθ M(θ). θ − with respect to this parametrization, then (2.8) does The family of probability densities f ,θ Θ is not hold in general. However, for the six exponen- { θ ∈ } the exponential family through µ in its “natural tial families having quadratic variance function (dis- parametrization.” Allowable differentiations yield the cussed in Section 2.4 below), (2.8) holds even with mean m(θ)= xfθ(x)µ(dx)= M ′(θ) and the vari- the mean parametrization. In [19], this is shown to ance σ2(θ)= M (θ). R ′′ hold only for these six families. See [15, 55] for more Statisticians realized that many standard families on this. can be put in such form so that properties can be studied in a unified way. Standard references for ex- Example. For the Poisson example above the ponential families include [8, 11, 66, 73, 74]. conjugate priors with respect to θ are of the form θ n0x0θ n0e Example. Let = 0, 1, 2, 3,... , µ(x) = 1/x!. z(n0,x0)e − dθ. Then Θ= R, and MX(θ)={ eθ, } The mean parameter is λ = eθ. Since dθ = dλ/λ, the xθ eθ e − priors transform to fθ(x)= , x = 0, 1, 2,.... x! n0x0 1 n0λ z(n0,x0)λ − e− dλ. This is the Poisson(λ) distribution with λ = eθ. This paper works with familiar exponential fam- The Poisson density parametrized by λ is given ilies. Many exotic families have been studied. See by [12, 79]. These lead to interesting problems when ex log λ λ studied in conjunction with the Gibbs sampler. f (x)= − , x = 0, 1, 2,.... λ x! GIBBS SAMPLING AND ORTHOGONAL POLYNOMIALS 9 Hence, the conjugate priors with respect to λ are of at discrete times: If this is Xn at time n, then SXn is the form the number served in the next time period and εn+1 is the number of unserved new arrivals. The explicit z˜(n ,x )λn0x0 e n0λ dλ. 0 0 − diagonalization of the M/M/ chain, in continuous ∞ Here, the Jacobian of the transformation θ m(θ) time, using Charlier polynomials appears in [3]. blends in with the rest of the prior so that→ both This same chain has yet a different interpreta- parametrizations lead to the usual Gamma priors tion: Let f (j) = η pj(1 p)η j . Here 0