<<

i i

“-Random_Permutations_and_Partition_Models”—//—:—page—#

Metadata of the chapter that will be visualized online

Book Title International Encyclopedia of Statistical Science

Book CopyRight - Year 

Copyright Holder Springer-Verlag

Title Random Permutations and Partition Models

Author Degree Given Name Peter Particle Family Name McCullagh Suffix Phone Fax Email Affiliation Role John D. MacArthur Distinguished Service Professor Division Organization University of Chicago Country USA

UNCORRECTED PROOF

i i i i

“-Random_Permutations_and_Partition_Models”—//—:—page—#

R

− In matrix notation, Bσ = σBσ ,sotheactionbyconju-   Random Permutations and gation permutes both the rows and columns of B in the  Partition Models same way. The block sizes are preserved and are maximally   Peter McCullagh invariant under conjugation. In this way, the  partitions  [ ]   John D. MacArthur Distinguished Service Professor of  may be grouped into five orbits or equivalence  University of Chicago, USA classes as follows:  , ∣ [],∣ [],∣∣ [],∣∣∣.   Set Partitions Thus, for example, ∣ is the representative element for   ≥ [ ]={ } For n , a partition B of the finite set n ,...,n is one orbit, which also includes ∣ and ∣.    ● A collection B ={b,...} of disjoint non-empty sub- The symbol #B applied to a set denotes the number of  sets, called blocks, whose union is [n] elements, so #B is the number of blocks, and #b is the size  ∈ E   ● An equivalence relation or Boolean function B∶[n]× of block b B.If n is the set of equivalence relations on [ ] [ ] E   [n]→{, } that is reflexive, symmetric and transitive n , or the set of partitions of n ,thefirstfewvaluesof# n E   ● AsymmetricBooleanmatrixsuchthatBij = ifi, j are,,,,,calledBellnumbers.Moregenerally,# n  belong to the same block is the nth moment of the unit whose  exponential generating function is exp(et − ).Inthedis-   These equivalent representations are not distinguished in cussion of explicit probability models on En,itishelpfulto   the notation, so B is a set of subsets, a matrix, a Boolean use the ascending and descending factorial symbols   function, or a subset of [n]×[n], as the context demands. In ↑  practice, a partition is sometimes written in an abbreviated α = α(α + )⋯(α + r − )= Γ(r + α)/Γ(α)  form, such as B = ∣ for a partition of [].Inthisnotation,  ↓r = ( − )⋯( − + )  the five partitions of [] are k k k  k r  ↓  , ∣, ∣, ∣, ∣∣. for integer r ≥ . Note that k r =  for positive integers  ↑ r > k.Byconventionα  = .   The blocks are unordered,UNCORRECTED so∣   is the same partition as PROOF  ∣and∣. ∗  A partition B is a sub-partition of B if each block of Partition Model  ∗ The term partition model refers to a probability distribu-   B is a subset of some block of B or, equivalently, if Bij =  ∗ E   = tion, or family of probability distributions, on the set n implies Bij . This relationship is a partial order denoted ∗ ∗ of partitions of [n]. In some cases, the probability is con-   by B ≤ B , which can be interpreted as B ⊂ B if each parti- centrated on the the subset E k ⊂E of partitions having   tion is regarded as a subset of [n]. The partition lattice E n n n E ( )=   [ ] k or fewer blocks. A distribution on n such that pn B is the set of partitions of n with this partial order. To each − ′ p (σBσ ) for every permutation σ∶[n]→[n] is said to   pair of partitions B, B there corresponds a greatest lower n ′ be finitely exchangeable. Equivalently, p is exchangeable   bound B ∧ B , which is the set intersection or Hadamard n if p (B) depends only on the block sizes of B.   component-wise matrix product. The least upper bound n ′ Historically, the most important examples are Dirichlet-   B ∨ B is the least element that is greater than both, the ′ multinomial random partitions generated for fixed k in   transitive completion of B ∪ B . The least element of En is three steps as follows.   the partition n with n singleton blocks, and the greatest  element is the single-block partition denoted by n. ● First generate the random probability vector π =   ∶[ ]→[] A permutation σ n n induces an action (π,...,πk) from the Dirichlet distribution with   ↦ σ σ ( )= ( ( ) ( )) B B by composition such that B i, j B σ i , σ j . parameter (θ,...,θk). 

Miodrag Lovric (ed.), International Encyclopedia of Statistical Science, DOI ./----, © Springer-Verlag Berlin Heidelberg 

i i i i

“-Random_Permutations_and_Partition_Models”—//—:—page—#

 R Random Permutations and Partition Models

 ● Given π,thesequenceY,...,Yn, . . . is independent lattice. These deletion maps make the sets {E, E,...} into   and identically distributed, each component taking aprojectivesystem   values in {,...,k} with probability π.Eachsequence Dn+ Dn ⋯E + → E → E − ⋯   of length n in which the value r occurs nr ≥ timeshas n  n n   probability =( )  ↑ Afamilyp p, p,... in which pn is a probability ( ) ∏k nj Γ θ. j= θj distribution on En is said to be mutually consistent, or   , Γ(n + θ.) Kolmogorov-consistent, if each pn− is the marginal dis-   = ∑ tribution obtained from pn under deletion of element n  where θ. θj. − [ ] ( )= (  )   ● Now forget the labels , . . . , k and consider only the from the set n .Inotherwords,pn− A pn Dn A for A ⊂En−. Kolmogorov consistency guarantees the exis-   partition B generated by the sequence Y,i.e.Bij =  tence of a random partition of the natural numbers whose   if Yi = Yj. The distribution is exchangeable, but an   explicit simple formula is available only for the uni- finite restrictions are distributed as pn. The partition is infinitely exchangeable if each pn is finitely exchangeable.   form case θj = λ/k, which is now assumed. The number ↓   of sequences generating the same partition B is k #B, Some authors, for example Kingman (), refer to p as a   and these have equal probability in the uniform case. partition structure.   Consequently, the induced partition has probability An exchangeable partition process may be generated from an exchangeable sequence Y, Y,... bythetrans-  ↑#b ↓ ( ) ∏ ( / ) #B Γ λ b∈B λ k formation Bij = ifYi = Yj and zero otherwise. The   pnk(B,λ)=k ,() Γ(n + λ) Dirichlet-multinomial and the Ewens processes are gener-   called the uniform Dirichlet-multinomial partition ated in this way. Kingman’s () paintbox construction  ↓  distribution. The factor k #B ensures that partitions shows that every exchangeable partition process may be   having more than k blocks have zero probability. generated from an exchangeable sequence in this manner.  Let B be an infinitely exchangeable partition,   In the limit as k →∞, the uniform Dirichlet-multinomial ∗ B[n]∼pn,letB be a fixed partition in En,andsup-  ∗  partition becomes pose that the event B[n]≤B occurs. Then B[n] lies  ∗ #B [ ] [ ]=  λ ∏ ∈ Γ(#b) in the lattice interval n, B , which means that B n  ( )= b B pn B,λ ↑n .()B[b]∣B[b]∣ . . . is the concatenation of partitions of the  λ ∗ ∗ blocks b ∈ B .Foreachblockb ∈ B ,therestrictionB[b] is   This is the celebrated Ewens distribution, or Ewens sam- distributed as p#b, so it is natural to ask whether, and under   pling formula, which arises in population genetics as the ∗ what conditions, the blocks of B are partitioned indepen-   partition generated by allele type in a population evolving ∗ dently given B[n]≤B . Conditional independence implies   according to the Fisher–Wright model by random muta- that   tion with no selective advantage of allele types (Ewens ∗ pn(B ∣ B[n]≤B )= ∏ p#b(B[b]),()  ). The preceding derivation, a version of which can be ∗ UNCORRECTED PROOF b∈B  foundinChap.ofKingman(),goesbacktoWatter- which is a type of non-interference or lack-of-memory   son (). The Ewens partition is the same as the partition property not dissimilar to that of the exponential distribu-   generated by a sequence drawn according to the Blackwell- tion on the real line. It is straightforward to check that the   McQueen urn scheme (Blackwell and MacQueen ). condition is satisfied by () but not by (). Aldous ()   Although the derivation makes sense only if k is a shows that conditional independence uniquely character-   positve integer, the distribution () is well defined for neg- izes the Ewens family.   ative values −λ < k < . For a discussion of this and the  connection with GEM distributions and Poisson-Dirichlet Chinese Restaurant Process   distributions, see Pitman (, Sect. .). A partition process is a random partition B ∼ p of a count-  ably infinite set {u, u,...}, and the restriction B[n] of B   Partition Processes and Partition to {u,...,un} is distributed as pn. The conditional distri-   Structures bution of B[n + ] given B[n] is determined by the proba-   Deletion of element n from the set [n], or deletion of the bilities assigned to those events in En+ that are consistent   last row and column from B ∈En, determines a map  Dn∶En →En−, a projection from the larger to the smaller

i i i i

“-Random_Permutations_and_Partition_Models”—//—:—page—#

Random Permutations and Partition Models R 

 with B[n],i.e.theeventsun+ ↦ b for b ∈ B and b = /.For with probability () or (), as determined by the partition   the uniform Dirichlet-multinomial model (), these are process. If the table is occupied, the new arrival sits to the  left of one customer selected uniformly at random from the  ⎪⎧ ⎪(#b + λ/k)/(n + λ) b ∈ B table occupants. The random permutation thus generated  pr(u + ↦ b ∣ B[n]=B)=⎨ n  ⎪ ↦ ( ) ( )  ⎩⎪ λ( − #B/k)/(n + λ) b = /. is j σ j from j to the left neighbour σ j . Provided that the partition process is consistent and   () exchangeable, the distributions p on permutations of [n]   In the limit as k →∞,weobtain n are exchangeable and mutually consistent under the pro-  ⎧ →  ⎪#b/(n + λ) b ∈ B jection Πn Πn− on permutations in which element  ( ↦ ∣ [ ]= )=⎨ pr un+ b B n B () n is deleted from the cyclic representation (Pitman ,  ⎪ /( + ) = / ⎩ λ n λ b , Sect. .). In this way, every infinitely exchangeable ran-    which is the conditional probability for the Ewens process. dom partition also determines an infinitely exchangeable ∶ N → N   To each partition process p there corresponds a random permutation σ of the natural numbers.   sequential description called the Chinese restaurant pro- Distributional exchangeability in this context is not to be   cess, in which B[n] is the arrangement of the first n cus- confused with uniformity on Πn.  tomers at #B tables. The placement of the next customer   is determined by the conditional distribution pn+(B[n + On the Number of Unseen Species [ ]   ]∣B[n]). For the Ewens process, the customer chooses A partition of the set n is a set of blocks, and the block   a new table with probability λ/(n + λ) or one of the sizes determine a partition of the integer n. For example, ∣ ∣ [ ]   occupied tables with probability proportional to the num- the partition   oftheset  is associated with the + +   berofoccupants.Theterm,duetoPitman,Dubinsand integer partition   , one singleton and two dou- =( )   Aldous, is used primarily in connection with the Ewens bletons. An integer partition m m,...,mn is a list = m m ⋯ mn   and Dirichlet-multinomial models. of multiplicities, also written as m   n ,such that ∑ jmj = n. The number of blocks, usually called the  number of parts of the integer partition, is the sum of the   Exchangeable Random Permutations = ∑   Beginning with the uniform distribution on permutations multiplicities m. mj. E   of [n], the exponential family with canonical parameter Each integer partition is a group orbit in n induced by [ ]   θ = log(λ) and canonical statistic #σ equal to the number the action of the symmetric group on the set n .Themul-   of cycles is tiplicity vector m contains all the information about block #σ ↑n sizes, but there is a subtle transfer of emphasis from block   pn(σ)=λ /λ . sizes to the multiplicities of the parts.   The Stirling number of the first kind, Sn,k,isthenumber By definition, an exchangeable distribution on set par-   [ ] of permutations of n having exactly k cycles, for which titions is a function only of the block sizes, so p (B)=  ↑n n k n  λ = ∑ = S λ is the generating function. The cycles k  n,k UNCORRECTEDqn(m) PROOF,wherem is the integer partition corresponding   [ ] of the permutation determine a partition of n whose to B.Sincethereare   distribution is (), and a partition of the integer n whose n!  distribution is (). From the cumulant function  n ( )mj ∏j= j! mj! n− ↑  log(λ n)=∑ log( j + λ) set partitions B corresponding to a given integer parti-  j= tion m, to each exchangeable distribution pn on set par-    it follows that #σ = X +⋯+Xn− is the sum of indepen- titions there corresponds a marginal distribution  ( )= /( + ) dent Bernoulli variables with parameter E Xj λ λ j , n!  qn(m)  which is evident also from the Chinese restaurant repre- ∏n ( )mj j= j! mj!  sentation. For large n, the number of cycles is roughly Pois-   son with parameter λ log(n),implyingthatˆλ ≃ #σ/ log(n) on integer partitions. For example, the Ewens distribution  is a consistent estimate as n →∞, but practically inconsis- on integer partitions is   tent. λm.Γ(λ) ∏ Γ(j)mj n! λm. n!Γ(λ) × = .  A minor modification of the Chinese restaurant pro- ( + ) n ( )mj ( + ) mj Γ n λ ∏j= j! mj! Γ n λ ∏j j mj!  cess also generates a random permutation by keeping track ()   of the cyclic arrangement of customers at tables. After n This version leads naturally to an alternative description   customers are seated, the next customer chooses a table as follows. Let M = M,...,Mn be independent Poisson 

i i i i

“-Random_Permutations_and_Partition_Models” — // — : — page  — #

 R Random Permutations and Partition Models j  random variables with mean E(Mj)=λθ /j for some is the same as the relation between binomial and negative   positive number θ.Then∑ jMj is sufficient for θ,and binomial sampling schemes for a Bernoulli process: they   ( = ∣ n = )  the conditional distribution pr M m ∑j= jMj n is are not equivalent, but they are complementary. The par-  the Ewens integer-partition distribution with parameter λ. tition formulation is an exchangeable process indexed by   This representation leads naturally to a simple method of specimens: it gives the distribution of species numbers in a   estimation and testing, using Poisson log-linear models sample consisting of a fixed number of specimens.Fisher’s   with model formula  + j and offset − log(j) for response version is also an exchangeable process, in fact an iid pro-   vectors that are integer partitions. cess, but this process is indexed by species:itgivesthe   The problem of estimating the number of unseen distribution of the sample composition for a fixed set of   species was first tackled in a paper by Fisher (), using species observed over a finite period. In either case, the con-   an approach that appears to be entirely unrelated to parti- ditional distribution given a sample containing k species   tion processes. Specimens from species i occur as a Poisson and n specimens is the distribution induced from the uni-   process with rate ρi, the rates for distinct species being form distribution on the set of Sn,k permutations having   independent and identically distributed gamma random k cycles. For the sorts of ecological or literary applica-   variables. The number Ni ≥  of occurrences of species i tions considered by Good and Toulmin () or Efron and   in an interval of length t is a negative binomial random Thisted (), the partition process indexed by specimens   variable is much more direct than one indexed by species.  ( )∝  ν x Γ(ν + x) Fisher’s finding that the multiplicities decay as E Mj  pr(Ni = x)=( − θ) θ .()j/  x!Γ(ν) θ j, proportional to the frequencies in the log-series dis- tribution, is a property of many processes describing popu-   = /( + ) In this setting, θ t  t is a monotone function lation structure, either social structure or genetic structure.   > of the sampling time, whereas ν  is a fixed number It occurs in Kendall’s () model for family sizes as mea-   independent of t. Specimen counts for distinct species are sured by surname frequencies. One explanation for univer-   independent and identically distributed random variables sality lies in the nature of the transition rates for Kendall’s   > < < with parameters ν and θ . process, a discussion of which can be found in Sect. . of   The probability that no specimens from species i occur Kelly ().   inthesampleis( − θ)ν, the same for every species. Most  species are unlikely to be observed if either θ is small, Equivariant Partition Models   i.e., the time interval is short, or ν is small. Afamilypn(σ; θ) of distributions on permutations   ≥ Let Mx be the number of species occurring x times, indexed by a parameter matrix θ, is said to be equivari-   so that M. is the unknown total number of species of which ant under the induced action of the symmetric group if   M. − M are observed. The approach followed by Fisher − − pn(σ; θ)=pn(gσg ; gθg ) for all σ, θ,andforeach   is to estimate the parameters θ, ν by conditioning on the group element g∶[n]→[n].Bydefinition,theparam-   number of species observedUNCORRECTED and regarding the observed eter spacePROOF is closed under conjugation: θ ∈ Θimplies   ≥ − multiplicities Mx for x  as multinomial with parameter gθg  ∈ Θ. The same definition applies to partition models.   vector proportional to the negative binomial frequencies Unlike exchangeability, equivariance is not a property of a   (). For Fisher’s entomological examples, this approach distribution,butapropertyofthefamily.Inthissetting,   = pointed to ν , consistent with the Ewens distribution the family associated with [n] is not necessarily the same   (), and indicating that the data are consistent with the as the family of marginal distributions induced by deletion   number of species being infinite. Fisher’s approach using from [n + ].   a model indexed by species is less direct for ecological pur- Exponential family models play a major role in both   poses than a process indexed by specimens. Nonetheless, theoretical and applied work, so it is natural to begin   subsequent analyses by Good and Toulmin (), Holgate with such a family of distributions on permutations of the   ()andEfronandThisted()showedhowFisher’s matrix-exponential type   model can be used to make predictions about the likely ( )= #σ ( ( ))/ ( )   number of new species in a subsequent temporal exten- pn σ; θ α exp tr σθ Mα θ ,  sion of the original sample. This amounts to a version of  the Chinese restaurant process.  At this point, it is worth clarifying the connection  between Fisher’s negative binomial formulation and the  Ewens partition formulation. The relation between them

i i i i

“-Random_Permutations_and_Partition_Models” — // — : — page  — #

Random Permutations and Partition Models R   > ( )= n  where α andtr σθ ∑j= θσ(j),j is the trace of the is a product-partition model satisfying the conditional  ordinary matrix product. The normalizing constant is the independence property (). For K = n,then × n  ↑  ( )= n  α-permanent matrix whose elements are all one, perα n α is the n ascending factorial function. Thus the uniform Dirichlet-   ( )= ( )=∑ #σ ∏  Mα θ perα K α Kσ(j),j multinomial model () and the Ewens model () are both σ j= obtained by setting θ = . 

 where Kij = exp(θij) is the component-wise exponen-  tial matrix. This family of distributions on permutations Leaf-Labelled Trees  [ ] E   is equivariant. Kingman’s n -coalescent is a non-decreasing, n-valued ( )   The limit of the α-permanent as α → givesthesum Markov process Bt in continuous-time starting from the  of cyclic permutations partition B = n with n singleton blocks at time zero. The   n coalescence intensity is one for each pair of blocks regard- −  ( )=  ( )=  cyp K lim α per K ∑ ∏ Kσ(j),j, less of size, so each coalescence event unites two blocks α→ α σ:#σ= j= chosenuniformlyatrandomfromthesetofpairs.Conse-   giving an alternative expression for the α-permanent quently, the first coalescence occurs after a random time Tn  exponentially distributed with rate ρ(n)=n(n − )/and   ( )= ∑ #B ∏ ( [ ]) perα K α cyp K b mean /ρ(n).Afterk coalescences, the partition consists  B∈En b∈B of n − k blocks, and the waiting time Tk for the next sub-   as a sum over partitions. The induced marginal distri- sequent coalescence is exponential with rate ρ(n − k).The   bution () on partitions is of the product-partition type time to complete coalescence is the sum of independent   recommended by Hartigan (), and is also equivariant. exponentials T = Tn + Tn− +⋯+T,whichisarandom   Note that the matrix θ and its transpose determine the variable with mean  − /n and variance increasing from    same distribution on partitions, but they do not usually at n =  to a little less than . as n →∞.Inthecontext   determine the same distribution on permutations. of the Fisher–Wright model, the coalescent describes the   The α-permanent has a less obvious convolution prop- genealogical relationships among a sample of individuals,   erty that helps to explain why this function might be and T is the time to the most recent common ancestor of   expected to occur in partition models: the sample.  [ ]   ∑ ( [ ]) ( [¯]) = ( ) The n -coalescent is exchangeable for each n,butthe perα K b perα′ K b perα+α′ K .() b⊂[n] property that makes it interesting mathematically, statisti-  cally and genetically is its consistency under selection or   The sum extends over all n subsets of [n],andb¯ is the sub-sampling (Kingman ). If we denote by pn the dis-   [ ] complement of b in n . A derivation can be found in sec- tribution on [n]-trees implied by the specific Markovian   tion . of McCullagh and Møller (). If B is a partition model described above, it can be shown that the embedded   [ ] ⋅ = ⋅ of n ,thesymbolK UNCORRECTEDB B K denotes the Hadamard tree obtained PROOF by deleting element n from the sample [n] is   component-wise matrix product for which not only Markovian but also distributed as pn−, i.e., the   ( ⋅ )=∏ ( [ ]) same coalescent rule applied to the subset [n − ].This  perα K B perα K b b∈B property is mathematically essential for genealogical trees   is the product over the blocks of B of α-permanents because the occurrence or non-occurrence of individual n   ↦ ( ⋅ ) inthesampledoesnotaffectthegenealogicalrelationships  restricted to the blocks. Thus the function B perα K B  is of the product-partition type. among the remainder.   With α, K as parameters, we may define a family of A fragmentation [n]-tree is a non-increasing   E k [ ] E  probability distributions on n , i.e., partitions of n hav- n-valued Markov process starting from the trivial par- = =   ing k or fewer blocks, as follows: tition B n with one block of size n at time t . ↓ The simplest of these are the consistent binary Gibbs frag-   ( )= #B ( ⋅ )/ ( ) pnk B k perα/k K B perα K .()mentation trees studied by Aldous (), Bertoin (,    The fact that () is a probability distribution on En follows ) and McCullagh et al. (). The first split into two  from the convolution property of permanents. The limit as branches occurs at a random time Tn exponentially dis-   k →∞ tributed with parameter ρ(n). Subsequently, each branch  fragments independently according to the same family of   ( )= #B ∏ ( [ ])/ ( ) pn B α cyp K b perα K ,()distributions with parameter ρ(#b) for branch b,which  b∈B

i i i i

“-Random_Permutations_and_Partition_Models” — // — : — page  — #

 R Random Permutations and Partition Models  is a Markovian conditional independence property anal- taken from McCullagh and Yang () but, with minor   ogous to (). Consistency and conditional independence modifications, the same description applies equally to   put severe limitations on both the splitting distribution more complicated non-linear versions associated with gen-   and the rate function ρ(n),sotheentireclassisessentially eralized linear mixed models (Blei et al. ). Given the   one-dimensional. observation (Y[n], B[n]) for the ‘training sample’ [n],   A rooted leaf-labelled tree T is also a non-negative together with the feature vector Yn+ for specimen un+,   symmetric matrix. The interpretation of Tij as the distance the conditional distribution of B[n + ] is determined by   from the root to the junction at which leaves i, j occur on those events un+ ↦ b for b ∈ B and b = / that are com-   disjoint branches implies the inequality Tij ≥ min(Tik, Tjk) patible with the observation. The assignment of a positive   for all i, j, k ∈[n].Thesetof[n]-trees is a subset of the probability to the event that the new specimen belongs to   positive definite symmetric matrices, not a manifold, but a previously unobserved class seems highly desirable, even   a finite union of manifolds of dimension n − , or n if logically necessary, in many applications.   the diagonal elements are constrained to be equal. Like If the classes are tree-structured with two levels, we  ′  partitions, rooted trees form a projective system within may generate a sub-partition B ≤ B whose conditional dis-   the positive definite matrices. A fragmentation tree is an tribution given B is Ewens restricted to the interval [n, B],  ′  infinitely exchangeable random tree, which is also a special with parameter λ . This sub-partition has the effect of split-   type of infinitely exchangeable random matrix. ting each main clusters randomly into sub-clusters. For the  ′ sample [n],lett (i) be the number of the sub-cluster in  ′  Cluster Processes and Classification which individual i occurs. Given B, B , the Gauss-Ewens   Models two-level tree process is a sum of three independent Gaus-  ′  Rd ( ) = + +  A -valued cluster process is a pair Y, B in which sian processes Yi ηt(i) ηt′(i) єi for which the con- d  Y =(Y,...) is an R -valued random sequence and B is a ditional distributions may be computed as before. In this   random partition of N. The process is said to be exchange- situation, however, events that are compatible with the  ′  able if, for each finite sample [n]⊂N, the restricted process observation B[n], B [n] areofthreetypesasfollows:   (Y[n], B[n]) is invariant under permutation σ∶[n]→[n] ′ ′ u + ↦ b ∈ B [n], u + ↦ /⊂b ∈ B[n], u + ↦ /.   of sample elements. n  n  n  ′  The Gauss–Ewens process is the simplest non-trivial In all, there are #B + #B +  disjoint events for which the  ′  example for which the distribution for a sample [n] is as conditional distribution given B[n], B [n], Y[n + ] must   follows. First fix the parameter values λ > , and Σ, Σ be computed. An event of the second type is one in which   both positive definite of order d. In the first step B has the the new specimen belongs to the major class b ∈ B,butnot   Ewens distribution on En with parameter λ. Conditionally to any of the sub-types previously observed for this class.   on B, Y is a zero-mean Gaussian matrix of order n×d with  covariance matrix Further Applications of Partition Models  UNCORRECTEDExchangeable PROOF partition models are used to construct non-   cov(Y , Y ∣ B)=δ Σ + B Σ , ir js ij rs ij rs trivial, exchangeable processes suitable for cluster analysis   where δij is the Kronecker symbol. A scatterplot color- and density estimation. See Frayley and Raftery ()   coded by blocks of the Y values in R shows that the points for a review. Here, cluster analysis means cluster detec-   tend to be clustered, the degree of clustering being gov- tion and cluster counting in the absence of covariate or   erned by the ratio of between to within-cluster variances. relational information about the units. In the computer-   For an equivalent construction we may proceed using science literature, cluster detection is also called unsuper-   a version of the Chinese restaurant process in which tables vised learning. The simplest of these models is the marginal   are numbered in order of occupancy, and t(i) is number of Gauss–Ewens process in which only the sequence Y is   the table at which customer i is seated. In addition, є,... observed. The conditional distribution pn(B ∣ Y) on En is   and η, . . . are independent Gaussian sequences with inde- the posterior distribution on clusterings or partitions of     pendent components єi ∼ Nd(, Σ ),andηi ∼ Nd(, Σ ). [n],andE(B ∣ Y) is the one-dimensional marginal dis-   The sequence t determines B, and the value for individual i tribution on pairs of units. In estimating the number of   = + Rd = + +  is a vector Yi ηt(i) єi in ,orYi μ ηt(i) єi if a clusters, it is important to distinguish between the sample  constant non-zero mean vector is included. number #B[n], which is necessarily finite, and the popu-   Despite the lack of class labels, cluster processes lend lation number #B[N], which could be infinite (McCullagh   themselves naturally to prediction and classification, also and Yang ).   called supervised learning. The description that follows is

i i i i

“-Random_Permutations_and_Partition_Models”—//—:—page—#

Random Permutations and Partition Models R 

 Exchangeable partition models are also used to provide Holgate P () Species frequency distributions. Biometrika   a Bayesian solution to the multiple comparisons problem. :–    The key idea is to associate with each partition B of [k] a Kelly FP () Reversibility and stochastic networks. Wiley, Chichester   ⊂Rk subspace VB equal to the span of the columns of B. Kendall DG () Some problems in mathematical genealogy. In:   = = Thus, VB consists of vectors x such that xr xs if Brs . Perspectives in probability and : papers in honour of   For a treatment factor having k levels τ,...,τk, the Gauss– M.S. Bartlett. Academic, London, pp –   Ewens prior distribution on Rk puts positive mass on the Kingman JFC () Random discrete distributions (with discus-  sion). J R Stat Soc B :–   subspaces VB for each B ∈Ek. Likewise, the posterior dis- Kingman JFC () The population structure associated with the   tribution also puts positive probability on these subspaces, Ewens sampling formula. Theor Popul Biol :–   which enables us to compute in a coherent way the pos- Kingman JFC () The representation of partition structures.   terior probability pr(τ ∈ VB) or the marginal posterior JLondMathSoc:–    probability pr(τr = τs ∣ y).Fordetails,see(Gopalanand Kingman JFC () Mathematics of genetic diversity. CBMS-   Berry ). NSF conference series in applied mathematics,  SIAM, Philadelphia  Kingman JFC () The coalescent. Stoch Proc Appl :–   Acknowledgments McCullagh P, Møller J () The permanental process. Adv Appl   Peter McCullagh is the John D. MacArthur Distinguished Prob :–   Service Professor in the Department of Statistics at the McCullagh P, Yang J (). Stochastic classification models. In:    University of Chicago. Professor McCullagh is a past edi- Proceedings of the international congress of mathematicians, , vol , pp –   tor of Bernoulli, a fellow of the Royal Society of London McCullagh P,Yang J () How many clusters? Bayesian Anal :–   and of the American Academy of Arts and Sciences. He is McCullagh P,Pitman J, Winkel M () Gibbs fragmentation trees.   co-author with John Nelder of the book Generalized linear Bernoulli :–   Models (Chapman and Hall, ). Pitman J () Combinatorial stochastic processes. Springer,    SupportforthisresearchwasprovidedinpartbtNSF Berlin Watterson GA () The sampling theory of selectively neutral   Grant DMS-. alleles. Adv Appl Probab :– 

 References and Further Readings  Aldous D () Probability distributions on cladograms. In: Ran-  dom discrete structures. IMA Vol Appl Math . Springer,  New York, pp –  Bertoin J () Homogeneous fragmentation processes. Probab  Theor Relat Fields :–  Bertoin J () Random fragmentation and coagulation pro-  cesses. Cambridge studies in advanced mathematics, vol .  Cambridge University Press, Cambridge  Blackwell D, MacQueen JUNCORRECTED () Ferguson distributions via Pólya PROOF  urn schemes. Ann Stat :–  Blei D, Ng A, Jordan M () Latent Dirichlet allocation. J Mach  learn Res :–  Efron B, Thisted RA () Estimating the number of unknown  species: how many words did Shakespeare know? Biometrika  :–  Ewens WJ () The sampling theory of selectively neutral alleles.  Theor Popul Biol :–  Fisher RA, Corbet AS, Williams CB () The relation between the  number of species and the number of individuals in a random  sample of an animal population. J Anim Ecol :–  Fraley C, Raftery AE () Model-based clustering, discriminant  analysis and density estimation. J Am Stat Assoc :–  Good IJ, Toulmin GH () The number of new species, and the  increase in population coverage when a sample is increased.  Biometrika :–  Gopalan R, Berry DA () Bayesian multiple comparisons using  Dirichlet process priors. J Am Stat Assoc :–  Hartigan JA () Partition models. Commun Stat Theor Meth  :–

i i