7 13 CLUSTER ANALYSIS UNDER PARAMETRIC MODELS by David Anthony Binder Thesis Submitted for the Degree of Doctor of Philosophy In

7 13

CLUSTER ANALYSIS UNDER PARAMETRIC MODELS

David Anthony Binder

Thesis submitted for the degree of Doctor of Philosophy in the University of London and for the Diploma of Imperial College.

January, 1977 -2-

ABSTRACT

Partitioning a set of data units into mutually exclusive clusters is studied, assuming that within each cluster the data are independent and identically distributed with some parametric density function. Throughout, examples are given when this is multivariate normal. The cluster membership is assumed random. Parameter estimation and the asymptotic distributions are discussed when the cluster membership of each unit is independent and identically distributed with (possibly) unknown probabilities and the number of clusters is known. Tests of the hypothesis that there is one group with the alternative of two groups are given.

The model is then generalized to include any distribution for cluster membership of the units and a Bayesian approach is taken. A model-dependent definition of the similarity between two data units is proposed and an optimal clustering which is a function of the similarities is given in a decision theoretic framework. When the densities of the data from each of the clusters are in the exponential family, a class of prior distributions is given for which the posteriors of the group membership are often easy to evaluate. When these densities are multivariate normal, conditions are found which lead to the same clustering criteria as those used by other authors.

A general algorithm is given for approximating the optimal cluster when the volume of the data is large. This algorithm is applied to Fisher's (1936) iris data and to Duncan's (1955) barley data.

ACKNOWLEDGEMENTS

I am grateful to my supervisor Dr. A.F.S. Mitchell and to Prof. D.R. Cox for their helpful comments during the course of this work. I also thank my wife, Marilyn, for her encouragement especially during my frustrations.

I am indebted to Statistics Canada for their generous financial support throughout my stay in London. The final version of this thesis was typed by Mrs. R. Fatorich. - 3 -

CONTENTS

Abstract 2

Acknowledgements 2

1. Introduction 5

2. A Parametric Model 9

2.1 Derivation of the model 9 2.2 Parameter estimation 11

2.2.1 Univariate case 13 2.2.2 General p-variate case 19

2.3 Information matrix for a mixture of two densities 23 2.4 Discussion 30

3. Some Test for Homogeneity 33

3.1 Introduction 33 3.2 Local tests for simple hypotheses 39 3.3 Local similar tests 42 3.4 Discussion 48

4. Bayesian Cluster Analysis: General 51

4.1 Introduction 51 4.2 General results 54

4.2.1 Posterior distributions 56 4.2.2 The general Bayesian decision problem 57 4.2.3 Some loss functions 58

4.3 Unknown number of groups 63 4.4 Convenient prior distributions 65 4.5 Practical considerations 69 4.6 Discussion 72

5. Bayesian Cluster Analysis: Normally Distributed Groups 74

5.1 Introduction 74 5.2 Unknown means, common known variance 74 5.3 Unknown means, common unknown variance I 77 5.4 Unknown means, common unknown variance II 78 5.5 Different unknown means and variances 80 5.6 Common known means, unknown variances 83 5.7 Regression models 84 5.8 Some prior distributions for A 87 5.9 Performance of approximation algorithm 89 5.10 Discussion 93 4

6. Fisher's Iris Data 95

6.1 The data and previous analyses 95 6.2 Methodology for our analysis 99 6.3 Results 102 6.3.1 Common covariance matrices 102 6.3.2 Different covariance matrices 106

6.4 Summary and discussion 109

7. Duncan's Barley Data 113 7.1 Introduction 113 7.2 Model for clustering varieties 114 7.3 Method and results 120 7.4 Discussion 129

8. Topics for Further Research 130

TABLES

5.1 Maximum probability in certain intervals of 91 sample file.

5.2 Maximum absolute deviation between estimated 92 and actual similarities.

6.1 Doubtful units for exactly three groups, 104 common covariance.

6.2 Doubtful units for up to four groups, common 105 covariance.

6.3 Doubtful units for up to four groups, diffrent 107 covariances.

6.4 Doubtful units for exactly three groups, different 108 covariances.

6.5 Summary of estimated clusterings. 109

7.1 Duncan's barley data. 114 7.2 List of prior distributions. 122 7.3 Analysis of barley data. 124

FIGURES

6.1 Fisher's iris data 96 5

Chapter 1

INTRODUCTION

"There are many intuitive ideas, often conflicting, of what consti- tutes a cluster, but few formal definitions."

Cormack (1971)

This thesis proposes, develops and shows the usefulness of some parametric models which appear appropriate in many applications of cluster analysis. There is a vast literature of algorithms and applications of this kind of analysis. Many of the algorithms are reviewed in Cormack

(1971), Anderberg (1973), Everitt (1974) and Hartigan (1975). As Cormack

(1971) points out, however: "Most techniques for clustering have developed, without formal basis, as algorithms." Before we develop our clustering techniques, derived from explicit assumptions, we introduce some common concepts.

The objective of cluster analysis is to group a given set of entities

(e.g., individuals, attributes). There are essentially three types of cluster analyses which Cormack (1971) refers to as: (i) hierarchical classification, where the groupings are split further to form a tree,

(ii) partitioning, where the set of entities is partitioned into mutually exclusive clusters, and (iii) clumping, in which a given entity may be assigned to more than one group. We shall concentrate on partitioning, although some of the techniques we shall develop may lead to clumping when the classification of entities is doubtful.

We consider the data as consisting of n individuals or units, each individual being measured with respect to p attributes (p > 1). We shall not discuss the case of a varying number of attributes, although some of 6

the techniques we develop can be extended to this case. Two conceptually

distinct forms of cluster analysis are: (i) to cluster the individuals,

and (ii) to cluster the attributes. Gower (1966) has shown that certain

procedures for clustering individuals have a mathematical duality with

certain procedures for clustering attributes. We consider only the

problem of clustering individuals since this can be modelled fairly

easily.

The basic problem we wish to consider is to find a grouping of the

labels {/,...,n} such that units which are in the same group tend to be

"similar" to each other (internal cohesion) and units in different groups

tend to be "dissimilar" (external isolation). However, these concepts

of internal cohesion and external isolation are vague. In §4.2.3 we

propose a loss function by which the user can specify their relative

importance.

A common starting point in many clustering algorithms is to specify

or to derive from the data a similarity (dissimilarity) matrix. These

are usually n x n symmetric matrices where the larger the (i,j)-th entry,

the more similar (dissimilar) the i-th and j-th units. Much of the

literature is devoted to specifying this similarity matrix and to scaling

variables so that the similarity matrix has a meaningful interpretation for the particular application. We shall propose a natural definition of a similarity matrix in Chapter 4 in a Bayesian context.

Many existing algorithms attempt to optimize some criterion over all partitions. This criterion may be based on the similarity matrix or on the actual data matrix. In a Bayesian decision theory framework, we may formally derive the criterion to be optimized. However, examination 7 of the natural similarity matrix which we propose will often lead to a clustering without resorting to an explicit decision theoretic model.

Chapter 2 introduces a model for cluster analysis, by assuming the observations come from a mixture of a known number of parametric density functions. We discuss certain problems associated with such a model, primarily in a non-Bayesian context.

In Chapter 3 we consider testing certain hypotheses about the number of density functions in the mixture.

We reformulate our problem more generally and place it in a Bayesian framework in Chapter 4. We discuss some loss functions in a Bayesian decision theoretic framework which may be appropriate to a wide class of practical problems. We also discuss the implementation and approximation of the Bayes' procedure when it is not feasible to compute the optimal solution.

Chapter 5 gives some examples of posterior distributions derived in

Chapter 4 under certain multivariate normal assumptions. We also demonstrate the numerical behaviour of the approximation procedure proposed in Chapter 4.

We apply some of our procedures to the Fisher (1936) iris data under certain multivariate normal assumptions in Chapter 6. We compare our results with those obtained using other clustering algorithms.

Chapter 7 demonstrates the usefulness of our procedures for the problem of multiple comparisons by applying our Bayesian clustering techniques to Duncan's (1955) barley data. We compare the effects of - 8 - changing the prior distributions.

Some areas of further research for the implementation of our general procedures are suggested in Chapter 8. 9

Chapter 2

A PARAMETRIC MODEL

2.1 Derivation of the Model

We now consider the specification of a model for partitioning the

units or individuals. In §2.2 we briefly review some of the literature

concerned with this model. We propose in §2.3 a technique for simplifying

some integrals which are useful in the asymptotic theory for the model,

applying the technique to a mixture of two multivariate normal densities

with common covariance matrix.

Each observation is assumed to be a vector in p-dimensional Euclidean

space. We let m be the number of clusters, which may be labelled Group 1,

Group 2,..., Group m. (Often the specific labels are not important, as

we may be interested only in which individuals have been assigned to the

same group.) Since the individuals are to be partitioned into clusters,

we assume there exists a "true cluster" to which each data unit

(individual) belongs. (However, there may be some applications where

"true clusters" do not exist. For example, there could be no difference

among certain attributes of two adjacent streets in a city, but they may

be in different boroughs, so the classification into boroughs does not

reflect an actual dissimilarity in the attributes. Kendall (1966) refers

to this as "dissection" as opposed to classification. We do not consider

the problem of dissection.)

The n data points in the sample are denoted by p-dimensional column vectors The "true clusters" to which the data units belong are represented by the vector g = (g(1),...,g(n)), where g(k) = i whenever - 10 -

the k-th data unit beongs to Group i. We call the "true grouping"

vector.

Now is unobservable and the aim of cluster analysis is to find a 04. g = (g(1),...,g(n)), based on the observations ;&1,...,K17, such that if

g(k) = i, our estimated clustering assigns the k-th unit to Group i. It

is assumed that is the realization of a random event g with distribution

F(g).

In this chapter we assume that G(1),...,G(n) are independent and

identically distributed with Pr(G(k) = iJ = Ai for i = 1,...,m, where

E x. = 1. i=./ 2

(Depending on the application, the parameters k (xl,...,xm) may be either known or unknown.) We shall generalize the distribution of g in

Chapter 4 where we shall consider the problem in a Bayesian context.

However, here we have

m n. f G (gW = Pr( = Zik) = n Xi , i=1 where n.a is the number of elements in {klg(k) = -01 so that

n = n. i=1 a

The data points are assumed to be realizations of the random events Conditional on the "true grouping" vector z, we assume that kz,...44.1 are independently distributed with the distribution of depending upon g(k), the group to which the k—th individual belon k gs. For a set of parameters given by the vector k, we have m parametric density functions h1(kik),...,hm,(0). The conditional density of the data given the "true groups" is

n = n h (?6 k) ti k1 g(k) k so the marginal density of the data is

n m (b,..., 11 1k/k) = 11 / / X./7(k le)]. (2.1.1) k=1 i=1 k

This mixture is a possible model if cluster analysis is to be approached from a probabilistic point of view. It is in fact the same model which other authors have assumed when considering cluster analysis; see, for example, Day (1969b) and Wolfe (1970). It is also similar to models proposed by Scott and Symons (1971) and Bock (1972).

2.2 Parameter Estimation

If all the parameters (k40 are known, the grouping of data units is an example of Bayesian discriminant analysis; see Anderson (1958),

Chapter 6. The vector k takes the role of the prior probabilities of membership in the respective populations (groups); these populations having known density functions hi(k I p,) ..,hm(dk). The assignment of the data units is based on the posterior probabilities of membership in the populations, conditional on the observations ki,...,n. - 12 -

Now

n Ag(k)hg(k)( kI Z) ,'R,) = II Pr(q = Z1 1"."kn'k k=1 T L Aihi( k lk) i=1

Therefore, conditional on the observations, the components of the "true

clustering" vector are distributed as n independent random variables, with

Pr[G(k) = ikz,...,kn,?A] depending on the observations only through irk.

We shall denote the quantity

I Aihi (k) i=1

by P(i1).

If some of the parameters (k,,k) are unknown, and there are no

individuals whose group membership is known, then we are no longer in the

usual discriminant analysis framework. However, if we can find some

A. A suitable estimator (ti,k) of (ti,k) and we let

h.(Xle) P(iik) = i2 "1.1

i./ 2 2

then it may seem reasonable to consider the set {P(il?Ek)lk = 1,...,n} to assign data units. This is what Day (1969b) and Wolfe (1970) suggest using maximum likelihood estimates for (k,,A). (Aitchison (1975) has suggested that in model fitting problems, the simple substitution of asymptotically - 13 -

efficient estimates of the parameters into the density function does not

behave as well as the predictive density derived from a Bayesian analysis.

The predictive probabilities for our model are discussed in Chapter 4.)

Let us now turn our attention to the problem of finding k and 8 - suitable

parameter estimates.

2.2.1 Univariate Case

The problem of estimating the parameters of a mixture is quite old and there is still no consensus on the best way to handle many situations.

The important case of unknown k is still "elusive" according to

Macdonald (1971). He stated, "There is no estimation technique uniformly 2 2 useful in practice." A case in point is when m = 2 and hi(x1111,p2,a1 ,a2 ) is the density function of a normally distributed random variable with mean 2 2 pi andvariancea.2, the five parameters A,pl,p2,a ,a2 2 1 being all unknown. (Here A refers to A and A = 1-A.) Pearson (1894) gave a solution using l 2 the method of moments. This requires solving a nonic polynomial made up of terms involving the first five moments, and in the case of more than one real root, Pearson (1894) suggested taking the solution which agrees best with the sixth moment. Cohen (1967) simplified the numerical solution circumventing finding the roots of the nonic. Wolfe (1970) considered the stationary point of the likelihood function with maximum likelihood even though the likelihood function is unbounded as a12 -4- 0 for fixed A and 2 a2 , and if pi = p2 = xk for any k; see Cox and Hinkley (1974), pg. 291.

Hasselblad (1966) considered the stationary point of the likelihood function with maximum likelihood based on the histogram, using equispaced intervals and a convenient approximation for the interval probabilities, and Fryer and Robertson (1972) compared the bias and mean squared error of similar estimators (using exact interval probabilities) with those - 14 -

2 obtained from minimizing the x goodness of fit test statistic and the

moment estimators. Some authors such as Cox (1966) and Bhattacharya (1967)

used graphs for estimating some of the parameters. For two components,

Cox (1966) mentioned splitting the data into two groups, where all the

observations in one group are larger than in the other and using normal

probability paper to estimate the means and variances. For a general number of component distributions, Bhattacharya (1967) used histogram ordinates for the plots and estimates the pi t s and ai's from the graph.

Using these estimates and the data, he estimates X.

2 2 2 If it is assumed that a = a = a , then the estimation problem / 2 becomes theoretically easier, since the likelihood function is bounded.

Rao (1952; pg. 300) derived the estimators based on equating the first four sample and population cumulants. The solution involves finding the unique negative real root of a cubic polynomial. This root always exists so the problem of multiple solutions disappears. Preston (1953) avoided solving the cubic by interpolating two parameters in the contours of a skewness - kurtosis chart. Tan and Chang (1972) considered the asymptotic relative efficiency of the moment estimators to the maximum likelihood estimators. They observed that the maximum likelihood estimators are much better than the moment estimators, especially when A = 'P1 u21/a is small.

Most authors remark on the poorness of all the estimators, unless the sample sizes are very large. For example, Tan and Chang (1972) pointed out that if we want to estimate A by maximum likelihood such that the standard deviation of A is less than 0.1 then for A > 1 we need about

8000 observations, for A > 1.5 about 900, and for A > 2 about 160.

Some authors have considered the simpler problem of estimating k for

- 15 - m > 2 and general known density functions h.(x16) (0 is assumed known). 2 ou Even this case is non-trivial in general. If F(x) is the distribution function of X and F (x) the empirical distribution function, Choi and n Bulgren (1968) suggested estimating ,Ai by minimizing iff(x)-Fn(x).7 2dFn(x). 2 Macdonald (1971) suggested that minimizing f[F(x)-Fn(x)] dF(x), the

Cramer-von Mises goodness of fit test statistic, usually yields a much smaller bias and a smaller mean squared error than Choi and Bulgren's (1968) estimator when the component distributions are normal.

If v. = 1212.(x)dx < co, for i = 1,2, and vi 0 then the moment a a p2, -/ estimator for A when m = 2 is A = -p ) (x -u ) where x is the sample / 2 2 mean. Johnson (1973) considered this and other unbiased estimators of

A using means of truncated samples. The moment estimate is not necessarily in the range [0,1]. However, by the weak law of large numbers, A converges to A in probability for large samples. If in addition la.2 = !(x-p.)2h.00dx < co for i = 1,2 then A converges to A almost surely, .1 1 and by the central limit theorem, 4-(A-A) is asymptotically normal with zero mean and variance given by

2 2 Aa (1-A)a 1 2 + A (1-A ) . 2 (1.1/ P2)

This moment estimator may be considered for any component distributions with known means. Generally, it is asymptotically inefficient but if 2 2 2 (pi - p2) is large relative to max(o1 ,a2 ) then the asymptotic variance of the moment estimator is close to A(1-A)/n, the variance of the asymptotically efficient estimator of the parameter in a binomial distribution. - 16 -

Hill (1963) has shown that X(1-X)/n is a lower bound to the variance of 2 2 2 an asymptotically efficient estimator of A, so for (1117112) /max(ci ,c2 ) large, the asymptotic relative efficiency of the moment estimator is near one.

A class of unbiased estimators of k has been proposed by Bartlett and

Macdonald (1968) for general m. If Fn(x) is the empirical distribution and the cumulative distribution function of X is

F ( x ) = y A.H.(x), i=1 "

then for G a suitably chosen increasing function of x, Bartlett and

Macdonald (1968) proposed that k should minimize

( dF -dF ) /dG. (2.2.1.1) f n

For example, if m = 2, the estimator of A is

f(dF -dli)[(dH )/dG] A = n 12 dH -dll )((dH -dH2) /dG / ( 1 2 1 J

with the variance of A given by

/{(dH1-dH2 ) / dG}2dF - ( f{(dH1-dH2)/dG}dF )2

n[ f (dH -dH ){(dH -dH )/dG}12 1 2 1 2

If is unknown, Bartlett and Macdonald (1968) proposed minimizing

(2.2.1.1) with respect to ,1,X and Q. - 17 -

Let us now consider the maximum likelihood estimation of k. Let

k(x) be the vector 021(x),...,1217(x)). Macdonald (1971) pointed out that

the log likelihood function,

n m I log( y A.h.(xk)] k=1 i=1 "

is a strictly concave function of k ifEn > m-1 and there exist m-1

vectors Vxk) which together with the m-dimensional vector of i's form

a basis in Euclidean m-space. For m = 2 this means that there exists a k such that h (x ) h (x ). When the likelihood function is strictly 1 k 2 k concave, there is at most one stationary point, which, if it exists, would

maximize the likelihood. If there is no stationary point in the set

A = {A10 < A. < / for i = 1,...,m; y x. = 11, i=1

then the maximum likelihood estimate of is such that A. = 0 or 1 for A 2 some i e fi,...,m1. For m = 2, Macdonald (1971) showed that

y fh Pr(0 < A < 1) = 1 - Pr (n-1 (x )/h2(xk)1 <11 k=1 1 k

-/ - Prfn fh2rxk)/h1(xk)). <_/), (2.2.1.2) k=1 so for any finite n there is a non-zero probability that A ¢ (0,1). We also have the usual asymptotic result for maximum likelihood estimators that k converges almost surely to k since the likelihood function is continuous in k; see Cox and Hinkley (1974), pg. 209. This result holds - 18 - even if k, is on the boundary of the set A.

Under suitable regularity conditions, when m = 2 and 0 < A < 1 then -/ i(A-A) is asymptotically normal with zero mean and variance [i(X)] , where /(A) is given by fah1(x)-h2(x)}2/{Xyx)÷(1-A)h2(x)11dx when the distribution function for x is continuous. We can show that the following conditions are sufficient for this result: (i) I(A) > 0 for all A E (0,1), and

3 112 (x)-h (x)I (ii) / 2 3 hi(x)dx < co for i = 1, 2, [min{h (x),h (x)}] 1 2

since this implies that the regularity conditions given in Cox and

Hinkley (1974; pg. 281) hold.

Hill (1963) gave the following results: (i) /(A) = [A(/-)01-1/1-S(X),/, where SOO = f(hi(x)h2(x) fAhl(x)÷(1-X)h2(x)} 11dx, and (ii) if h1(x)/h2(x) is a continuous, strictly increasing function with

lim (111(x)/h2(x)] = co x-}=

and C satisfies Th1(C)/h2(C)] = [(1-X)/X] where C = - co if

Ah1(x) > (1-A)h (x) for all x, then 2

co -1 -1k S(A) = (1-A) 1 (-1) (7 (x){ (1-2)h (x)) h (x)dx k=0 -co (Ah1 2 1

CO -1 k co ,-1 k A I (-1) [(1-A)h (x){Xh c 2 1 (x)Y h (x)dx, k=0 2 - 19 - where the first integral is taken to be zero if C = -0.. For example, if 2 -1/2 2 2 hi (x)= (2no ) expf-(x-pi) /2a j, d = (U1-112) /a, r = 1o01-X)/A],

0(y) = g.,(27)-1/2 exp(-t2/2)dt, then

CO k 2 k S(X) = I (-1) expik(k+1)d /21(X 1{(/-A)/A} {1-0((2k4-1)d/2 + r/d)} k=0

k + (1-A) 1{X/(1-X)} {1-0((2k+1)d/2 - r/d)}).

In §2.3 we shall derive general forms for all the entries of the information matrix.

For m > 2, Behboodian (1972) has considered the problem of estimating k with known R, by Bayesian methods. He shows that if the prior for is a finite mixture of Dirichlet distributions then so is the posterior for k.

He derives the posterior mean and variance for A when m = 2. One advantage of the Bayesian approach is that if the prior for A is such that 0 < A < 1 with probability one, then the posterior for A also gives 0 < A < 1 with probability one for all finite samples. Hence the posterior mean, a possible estimator of A, is in the open interval (0,1). This is not always true of maximum likelihood and moment estimators, as we have already seen.

We shall discuss our model in a Bayesian setting in Chapter 4.

2.2.2 General p-Variate Case h Day (1969a) considered the estimation of (A,Z) when m = 2 and is the density function of a multivariate normal distribution with unknown mean, kd, and completely unspecified common covariance matrix g. There are a 2 total of p /2 + 5p/2 + 1 parameters to be estimated. Day (1969a) discussed the moment and maximum likelihood estimators and he mentioned an iterative -20-

scheme for solving the maximum likelihood equations. He noted that the 2 computations involved for minimum x estimators and estimators involving fitting the cumulative distribution function with the empirical distribUtion function lead to prohibitive computations. He also dismissed Bayes'

estimators for the same reason. (However, if we use a prior suggested in

§5.4 then the posterior means of the unknown parameters could be approximated using an algorithm similar to that given in §4.5.)

To derive the moment estimators, the first four moments are needed to

estimate all the unknown parameters; however, there are more of these

moments than there are parameters, so Day (1969a) considered a function of

the third and fourth sample moments which is invariant under rotations of

the sample space. Day (1969a) pointed out that for p > 2 the moment esti-

mators are poor compared to maximum likelihood, especially when 2 T -1 A = (k -g2) g (gi g2) is small. In the univariate case, the moment 1 estimates are sufficiently similar to the maximum likelihood estimates that

the former may serve as a starting point in his iterative scheme for

solving the maximum likelihood equations. However, for p > 2, the moment

estimates are so poor that they do not serve as good starting points.

The biases of the maximum likelihood estimators can be appreciable.

Day (1969a) produced tables of his simulation results to show this. For

example, with A = [(g1-g2)1g 1(81-82)11/4 = 2, X = 0.3, p = 2, and 1/2 n = 200, then E(A) = 2.24, and (Var(A)) = .32. Clearly, very large

samples are needed in many situations to get good estimates of the

parameters.

Let us now consider the maximum likelihood estimators of a general

mixture with density - 21 -

n m x (C7 )...);Cniq,) = [ / Aihi(Kklk)], k=1 i=1 where AEA = {a/ ... X m)10 < Xi < 1, for i = I x. /1, and i=i

O. It is possible that certain density functions may give rise to unbounded likelihbods, so considering the stationary points of the likelihood function does not necessarily yield the maximum likelihood estimate.

However, if the likelihood is maximized at (,Z), an interior point of

Axe, and if 9/og[hAlkWak exists for i = 1,...,m and for all k in the interior 0, then (k,,k) satisfies the following equations:

A ihi (Kk I V P(i l x ) = for i = 1,...,m; k = 1,...,n, m I A.12.(x 2 2 fuk le) i=1 (2.2.2.1)

n P(il k ) - nAi = 0 for i = 1,...,m, k=1

n m ^ P(il k )01og{hi (Kk lp,)}/3k] = 0. k=1 i=1

Some areas for further research are: (i) which numerical method should be used for finding the maximum likelihood estimates for a particular model, and (ii) for how large a problem is the method computation- ally feasible? Day (1969a) has proposed an algorithm for the case where. the components of the mixture are multivariate normal with completely unspecified common covariance matrix. Wolfe (1970) and Hartigan (1975, pg, 113) proposed different algorithms which handle more general density functions than Day's (1969a). The application of Newton's method to

(2.2.2.1) is a third possibility. -22-

The asymptotic distribution of the maximum likelihood estimates may ttniez. often be found. If k is a q -dimensional vector, then undcr suitable

regularity conditions (see, for example Cox and Hinkley (1974), pg. 281), T T T -k ) is asymptotically (m+q-1)-variate normal -/ with zero means and covariance matrix qQ,k).) where ,E(k,k) is the T T information matrix for (A ,61 ) . We can partition this matrix / into

Cxx -TA()) = ka Zee

where has (i,j)-th entry given by

ffh i k)-hm(ek ) (kip) -hm (,d J , Aihi (k I V ]-1q,

ZA0 has (i,L)-th entry given by

f,hi w .11.020/og(h( k))/DR, 1] [ A.h.(0)/-/ k)-hmfk i v i, 7 A 2, 1 2 j=1 3 i=1

and zee has (k,2)-th entry given by

m m X. h.rophi revolog(hi wkwaekmalog(yekwayi i=1 J=1 2 j 2

x Aihi wk),-/d,. i/ -23-

These integrals are difficult to evaluate in general, so the method of

scoring suggested by Kale (1962) to solve the maximum likelihood, equations

may lead to prohibitive computations.

2.3 Information Matrix for a Mixture of Two Densities

In this section we propose a technique for evaluating the information

matrix when m = 2. In this case Hill (1963) has shown that

-1 = (A(1-X)] [1 - 6de)h adviffAipp 1(2;6 (2.3.1) AA / q, 2 ft

He evaluates the integral by expanding the integrand into a series. We

use a similar expansion for all the terms of the information matrix in our

example, where hi ,z) represents the multivariate normal density ( Iki,k2 function with mean u. and covariance matrix g.

Now, let us simplify the terms of the information matrix for (k,k)

when m = 2. For convenience we denote hi(k) by hi and f,( 1,p,) by f.

We let E. (.) denote the expected value when the random variable has

density hi. Also E(.) = AE1(.)+(1-X)E2(.). Let u = 1og(121/h2). We

assume that Ei O/og(hi)/Dk] = j.

The term corresponding to IAA is

E(fDlog(f)/DA 12] = I (121-h2)2[Xy(1-A)h 1-1q

-2 u -1 = A E [(Ae +/-A) - 1]. (2.3.2) 2

It is easily verified that (2.3.2) is equivalent to (2.3.1). The i-th component of kxe is

- 24 —

E[{aiog(f)/aX}{a/og(f)/aei}l

f(h -h )012 {alog(h )/ae.}4-(1—x)h Olog(h )/DO / 2 1 / 2 2 2 13

-1 x (Ah +(1-X)h ] c7,6 1 2

u -1 = A 1E fa- Agau/Dei l(Ae +1- X) - Dlog(h )/36.] 2 / •

(2.3.3)

Finally, the (i,j)-th component of 46 is

Eguog miae j43/og(f)/DO 2 .13

= fOth {alog(h )/90.}+ (1-X)h ( 3log(h )/30 / 2 2 2 i}]

x Oth {a1og(h )1/30.}4(1-X)h {alog(h )/ae 1 2 2 j}]

x 0,121+(1-X)h21-1ck

AEI ( alog(hi ) 3log(h,) . = L 36. DO. a

D1og(121) 31og(h2 ) 3log(121 ) au E2 ((1-A){ DO. DO DO. D6. m 2

(1-A)2 Du Du

(2.3.4) Xeu+1 -X -25-

Expressions (2.3.2) to (2.3.4) may be used for computing the

information matrix. They may be easier to handle than the original p-fold

integrals, if the distributions of the various functions of u, au/ak,

D/og(hi)/3p, are known when k has density function hi or h2 .

We now apply expressions (2.3.2) to (2.3.4) to the case where hi is

the density function of a p-variate normal distribution with mean Ri and k z -/ covariance matrix 9/. Let Sk (a) = s(z {Ae +/-A} )when z is normal with 2 2 mean aA and variance 0 . Therefore, (2.3.2) yields

-2 IAA = A (.5 (1/2)-11. 0 (2.3.5)

Now applying (2.3.3), we have

E({3log(f )/9 A}{a1og(f)/apii}1

1 = E u 2 1{(/-X)w/(Xe #1.4)}-w],

-1 -1 T -1 T -/ where w = {g -kpli and u = (k1-k2) ci-(//2)(k2 g k2-ki g

(Throughout, we use the notation { }. to refer to the i-th component of 2 the vector inside the curly brackets, and { }. . to denote the (i,j)-th ij entry of a matrix.) Now, (u,w) is bivariate normal under h2, so by first conditioning on u and taking conditional expectations, using standard results in multivariate normal distribution theory, then taking expectations with respect to u, we obtain

Egalog(f)/DX}{9log(f)/aki }1

= (Ak)-1 (k -k2)(14.(1-A) {s1(-1/2)/02 - (1/2)S (-1/2)1). 1 0

(2.3.6) - 26 -

Using a similar argument as above, we find that

E[{3/og(f)/3A}0/og(f)/N21/

-1 2 = (1-A) (Aa) (ur u (-//2)//1 + (1/2)S (-1/2)1 . (2.3.7) iv 21 o

Applying (2.3.4),

ET{81og(f)/D 0/og(f)/Dp p P/i } /j

2 u = Aci j-(1-A)w .w (1-A) w .w .{Ae -1-1-A } 11 , 2 ILE2 2 2 .7

r 1 r / where wi = tZ (Ik/)}.71 9 wj = lg (1 -ki)/j,

T -1 ij - . Now (u,w.,w.) is u = (kl-k2 )Tg 1e(1/2)( k2Tg 1k2k1 g k1)' = (g 11ij trivariate normal under h2' so again using standard results in multivariate normal distribution theory, conditioning on u, and then taking expectations with respect to u, we get

Eg8log(f)/N11{81og(f)/3k1 }T ]

1 1 T 1 = (2A-1)g - (1.-X)g (k1-k2)

f{g 1 -1 -1 )s0 ÷ (1-X)2 (kl-k2 ) (kcz-k2 )Tg (

, -1 4 1 k2) g {-si(-1/2)/2 S (-1/2)/1 31 .Z (k1- (ki-k2)T 1 2

(2.3.8)

Using similar procedures, we arrive at - 27 -

.ETD/og(f)/Dk1}{3/og(f)/ak2}21]

= -(1-A)2({g-14-g-I(kI-k2)(ki-m2)Tg-1((//4)-A2)}.50(-1/2)

-1 T 2 -hg (ki-m2)(k1- 2)g{-s1(-1/2)/A S2(-1/2)/A4}J,

(2.3.9)

E[{9log(f)/3k2}{a1og(f)/N2}T1

2 -1 1 = (1-A) ({R, 4-g (),/-k2)(k1-k2)Tg }s0

T +g /( k/-k2)(k/-k2)g 1(s1(-1/2)/A2 S2(-1/2)/A4".

(2.3.10)

Now, for the information matrix entries corresponding to terms in g, we could consider a number of cases; e.g., (i) g = Or, where A is unknown, (ii) g is completely unspecified, (iii) a = diag(01,...,8p) where 01,...,0 pare unknown.

The appropriate terms for case (i) are

EVD/og(f)/n}{a/og(f)/30}] = -(Xe)-1[(/-A)S1(-1/2) A2/2],

(2.3.11)

s[{a1og(f)/9k1){a1og(f )/ae}]

-2 2 - - - 2 - - = 0 (1-A)(u -u )(14-(1/2)A 4(1 )0{S2( 1/2)/A (1/2)S1 ( 1/2)}],

(2.3.12) -28-

E({DIog(f)/3k2 }{31og(f)/D8}1

-2 = El (1 2 -X)lk1-k2 )(14- (1-A){.52 (-1/2)/0 4. (1/2) S1 (-1/4],

(2.3.13)

2 -2 2 2 2 E[{31og(f)/a0} ] = [(1/2)p - (1-4)A {14-(1/4)A } (1-A) 2 (-1/2)].

(2.3.14)

Cases (ii) and (iii) may be handled similarly but the expressions are quite long and tedious.

In order to evaluate these expressions, we must be able to calculate k = 0,1,2. Sk(-1/2) for This can be facilitated by considering

k S(a,t) = I St.(a)t /k!, k=0 so that S (a) = kS(a,0)/3tk k . We can show that

S(a,t) = expi(1/2)2(2at+t2 )180 (a+t), so that

2 S (a) = ak, S (a)4-{dS (a)/da } 1 0 0

2 4 2 2 +A )S (a)42aA 2 2 S2 (a) = (a A 0 {dS0 (a)/da}+{d S0 (a)/da }.

(2.3.15)

To evaluate S (a), 0 we expand

2 1/2 2 2 S0 (a)= (27rA ) - f _wm pe u 1-Allexp[-(u-aA2) /2A ldu as a series. Let z = /og((/-X)/A]. 0 We split the integral into two parts: -29- one over the range (-w,z0-e) and the other over the range (z04-e,c9, and u -/ let c tend to zero. We let (ae 4-/-X)

00 -/ r k k (1-A) (-1) [A/(1 -X)] elcu if u < z -e k=0 0

03 -1 k k -( 1)u A I (-1) [(1-4)/A] e k+ if u > z 4-e k=0 0

lx.(27)-1/2exp(- 2 Letting 1(x) t /2)dt, we obtain

co -1 k k 2 S0(a) = (1-A) I (-1) Ot/(1-X).7 exp[(1/2)A k(k+2a).74)(z0 - A(a+k)]

k=0

CO -1 k-1 k-1 2 +A 1 (-1) [(1-A)/X] exp((1/2)A k(k-2a)1(.1.-44z - 0(a-k)1]. o k=0

(2.3.16)

Now, dS0(a)/da and d2S0(a)/da2 may be derived from this expression and we can obtain S (a) and S (a) using (2.3.15). Therefore we have obtained 1 2 f6rmulae for deriving all the terms of the information matrix.

We have given a few possible assumptions that could be made about the extent to which the covariance matrix is known. There is a close analogy between these assumptions and some of the proposals in the literature for defining a dissimilarity matrix, which, we already mentioned in Chapter 1, is a common starting point for many clustering algorithms. For example, if we assume that p, = 0/, then ordinary

x. Euclidean distance between x.xi and 'to is an analogous distance (or dissimilarity) measure between the i-th and j-th observations. On the other hand, the dissimilarities based on the quadratic form -30-

T-1( 6i-c1), where 11,4 is a positive definite matrix not generated by the data, is analogous to assuming g = 0A. (Since there exists a -/ T decomposition A = P P, then gk = ,P rk for k = would be multivariate normal in the i-th cluster, with mean kki and covariance 6/, so by a

transformation of the data we are back into the case g = 01.) If A is generated by the data, then more unknown parameters will need to be introduced. For example, Cormack (1971) stated that some authors suggest T the distance between units should be (ki-v / 676i-V where le is a diagonal matrix with positive entries on the main diagonal, generated by the data. This case is analogous to assuming that g = diag(61,...„6) where 01,...,0 are unknown. For all these cases, one can compute the information matrix using the techniques given in this section. This matrix can be used to construct approximate confidence bounds for the parameters.

However, we must be wary of such a procedure of estimating the parameters and constructing confidence bounds. As we have already pointed 2 out, for small A the estimates may be quite biased unless the sample size is very large. We can refer to Table 3 in Day (1969a) for some simulation results on the biases and the variances of some of the parameters.

2.4 Discussion

Although the method of maximum likelihood yields asymptotically efficient estimates for the parameters of a mixture of multivariate normal distributions with common covariances, the estimates may be biased and have large variances for many practical problems. Therefore, it seems that mere substitution of the parameter estimates into P(il k) in order to cluster the units may lead to poor results, especially for p > 1.

For m = 2, it is often feasible to estimate the information matrix - 31 -

using techniques similar to those outlined in §2.3, by substituting the

maximum likelihood estimates of the parameters into the terms of the

matrix, so that approximate confidence bounds for the parameters may be

derived. Suppose, however, we wish to find a set of estimates {g}

(depending on the data which are random) such that the fixed true grouping

vector g is in that random set with a particular probability 1-a, say.

Such a set could be considered a confidence set of clusters. However, it

is not clear how one can find such a set.

A slightly different approach to cluster analysis was suggested by

Scott and Symons (1971). They consider the likelihood as

n = h le) 1 g(k) '11

and treat (g,,) as the unknown parameters. To find a g which gives the

estimated clustering, they propose that (g,p,) should maximize this

likelihood.

For example, if h i is a multivariate normal density function with mean u. and common completely unspecified covariance matrix and for

Qi = {k1g(k) = and

- = n i /L keQ. 2

where n. is the number of elements in Q2., then g minimizes 1

m 1 I (kii)(Kkii)11. i=1 keQiI -32-

Ifontheotherhand,thecovariancesaredifferentforeachh„theng 2 minimizes

m - E 06k —kci,(k1 Si) a' i=1 kcQi

(Note that this is incorrect in Scott and Symons (1971).)

It is difficult to discuss the asymptotic properties of this procedure since the number of parameters to be estimated is 0(n). However, we shall show in Chapter 5 that in a Bayesian setting, for a particular prior for g strongly weighted in favour of near equal-sized clusters, and if the prior for the other parameters can be approximated by a flat prior, then maximizing the approximate posterior probability for g gives the same clusters as Scott and Symons (1971) technique in the case of multivariate normal components. Therefore, it is not surprising that such a clustering

tends to yield groups of about equal size.

Throughout this chapter, we have assumed that m, the number of groups is known. However, if m is not known, the problem of clustering becomes more difficult. In order to employ the method of maximum likelihood, m must be fixed. Therefore, we would like to establish some test statistics for the hypothesis m = mo, against the alternative m = m1, for given values of m and m , in order to specify m 0 1 before proceeding with maximum likelihood estimation. The special case of m = 1 and m = 2 will be the 0 / subject of the next chapter. - 33 -

Chapter 3

SOME TESTS FOR HOMOGENEITY

3.1 Introduction

The previous chapter considered the estimation of the parameters of

a mixture, assuming that m (the number of distributions in the mixture) is

known. However, if m is unknown we need to specify it before proceeding

with many of the methods discussed. One way of accomplishing this would

be to consider the sequence of hypotheses,

H.: m = i (i = I,2,...). (3.1.1)

We could begin by testing HI againstH2; if H1 is rejected we test H2

againstH3 andsoonuntilTh a is accepted, in which case we proceed by letting m to be i. Alternatively, we could start by testing Hk_i against

H (where k is a prespecified maximum number of groups) and if H is k k-1 accepted we test Hk..2 against Hk_l and so on until H is rejected (and i-1 we proceed with m = i) or H1 is accepted (and we proceed with m = 1).

(These procedures are analogous to choosing the degree of a polynomial

regression, for which Anderson (1962) advocates the latter approach.)

In this chapter the parametric model we consider under H. is (2.1.1) for m = j. We shall concentrate on the relatively simple problem of

testing the hypothesis H1 against the alternative H2. We attempt to derive similar tests for relatively simple (and, in practice, usually artificial) cases. We also discuss some problems associated with the maximum likelihood ratio test. We propose some tests for these simple problems, but we find that the interpretation of the results of the tests depends critically on some of the assumptions, and therefore in many

-34- practical situations the proposed tests are not useful. Also the tests do not extend easily to the cases where m > 2 under the null hypothesis.

If X is the mixing parameter when m = 2, then testing H1 against H2 is mathematically equivalent to

Hi: X = 0 or / (3.1.2) H : 0 < A < 1. 2

Now there are common cases where A may be replaced by /-X, leaving the data with essentially the same likelihood. We define such cases formally as follows. For p, = (01,...,04) let tir = (0v(1),...,0n(g)) where n is a member of the permutation group on the set [1,...,0. If there exists a group element 7 of order 2, such that hAlkir) = h2(t16) for almost all

then under H2 we have

f OE IxrV = xh1la+(1-x)h2 ( lk)

= (1-X)ht(0)+Ah2 (x I kn )

= f(id/-X,ks). (3.1.3)

Whenever (3.1.3) holds for some 7 we have a lack of identifiability of the parameters (X,)) when the original parameter space is retained. Then without loss of generality we may assume A < 1/2. In this case (3.1.2) becomes

X = 0 H1: (3.1.4) 0 < A < 1/2. H2: -35-

2 For example, .for i = 1 or 2 if h.(xlp ,p ) is a normal density function 1 2' a 2 2 with unknown mean pi and unknown variance a , then for k (p/,p ,a ) condition (3.1.3) holds when Ca(1),7r(2),7(3)) = (2,1,3).

However, if (3.1.3) does not hold for any permutation and if the hypothesis in (3.1.2) is accepted for a particular data set, the natural question of whether A = 0 or A = 1 usually arises in practice. Thus we would be concerned with a three-decision problem,

(i) A = 0, (ii) A = 1, or (iii) 0 < A < 1. (3.1.5)

There are a number of ad hoc ways of dealing with (3.1.5), but optimal criteria are difficult to specify. One possibility is to let

n n (/ if sup II hi(ialk) > sup II h2 klk) k k=1 k=1 (c

A* =N (3.1.6)

[0 otherwise, and then to test

H1: A = (3.1.7) H2: 0 < A < 1.

Another possibility is to first test

A= Hl: 0 (3.1.8) H2: A = 1

using a method such as one of those suggested by Cox (1962) or by - 36 -

Atkinson (1970) for tests of separate families of hypotheses; then we test

(3.1.7) where A is 0 or 1 according to whether (3.1.8) was accepted or rejected. The probability of taking a particular decision in (3.1.5) using such methods is difficult to derive analytically.

In this chapter we concentrate on the problems (3.1.4) and (3.1.7) * without referring to the more difficult case when A in (3.1.7) has been generated by the data. Without loss of generality we may assume that * A = 0.

One possibility for testing

H1: X = 0 (3.1.9) H2: A c w, where w is either (0,1/2] or (0,1) is to use the maximum likelihood ratio test. This was suggested by Wolfe (1970) for the more general problem of testing Hi against H.+1 in (3.1.1). However, the null distribution of such a test may be difficult to derive even asymptotically. For example if h and h are completely specified (no nuisance parameters) then the 1 2 test statistic is

n A121(kk"-(14)h2( k) T( 1,...,kn) = 2logf 1I /, (3.1.10) k=1 h2( k) where A is the maximum likelihood estimate of A under H . Now, 2

-1 i 0.1 = Pr(A = 0] = Prjn L {hi(kk)/h il< Pr[T(,...,) n L 2"ticl k=1 -37-

provided there exists a k such that 1.110W X h2( k); see Macdonald (1971).

If (i) A = 0, (ii) h1 (fir)= 0 whenever h2W = 0, and

(iii) 1 < f[121002/h2001dx < 00, then by the central limit theorem

n - n 1/2 I hi(V/h200 n1/2 k=1 is asymptotically normal with zero mean under the null hypothesis. In

this case we have

lim = 0] = 1/2 n-}00

2 and the test statistic (3.1.10) is not asymptotically x . The reason for 2 it not having an asymptotic x distribution is that under the null

hypothesis A is on the boundary of the parameter space {0} u w, so regularity conditions for the likelihood ratio test being asymptotically 2 X are not fulfilled. However, if h1(Wh2W is bounded for all x E Whi(V > 0) then Ah1(0+(1-A)h2(0 is a density when A is in some

E-neighbourhood of zero. Therefore, if A is the maximum likelihood

estimate of A over the set (-6,0) u w, then the test statistics (3.1.10) is -* -* A (x )4-(1-X )h2(x ' h1 rkk WI: ) 2log( II ] if A > 0 k=1 h2( k)

(3.1.11)

0 otherwise.

Under suitable regularity conditions (see for example Cox and -38-

Hinkley (1974), pg. 281), when A = 0 we have (i) n112i is asymptotically normal with mean zero and variance v = 11{(121-h2)2/h2}q1-1 and (ii) asymptotically (3.1.11) is

2 (A ) /v t o (1) if A > 0 (3.1.12)

0 otherwise.

Hence the asymptotic null distribution of T(k1,...q41) in (3.1.10) is 2 (1) with probability 1/2 and zero with probability 1/2.

A difficulty with the maximum likelihood ratio test in general is the derivation of its null distribution. If this can be found (at least asymptotically) it may prove to be a powerful tool for testing the number of clusters. We shall see in §3.4 that when h/ and h2 are p-variate normal,density functions (p > 1) with unknown means ki and k2, and completely unspecified common covariance matrix the maximum likelihood ratio test has a null distribution not depending on any of the unknown parameters. We might simulate it to get its null distribution, but as 2 T Day (1969a) points out when A = (u u ) a -u ) is small (the null - -1(u q,2 2 hypothesis is equivalent to A = 0) there tend to be many local maxima in the likelihood function and also convergence to the maximum likelihood estimate is slow. Therefore, if we wish to find the critical regions accurately for many values of p and n, simulations are costly.

We attempt to derive reasonable tests for (3.1.9), concentrating on local tests for alternatives near A = 0. In §3.2 we discuss local tests under the unrealistic assumption that h2 is completely specified. This demonstrates certain difficulties which may occur even for this simple case. In §3.3 we discuss local similar tests when h has minimal 2 -39-

sufficient statistics which are boundedly complete under the null

hypothesis. In §3.2 and §3.3 we exemplify our results when hi and h2 are

p-variate normal (p > 1) with common covariance matrix.

3.2 Local Tests For Simple Hypotheses

Assume h is completely specified. If h were also completely 2 1 specified, the locally most powerful test of the hypothesis A = 0 with

alternatives near A = 0 is given by rejecting the hypothesis if there

exists a k such that h2( k) = 0 and hi(,xLk) > 0, as well as rejecting the

hypothesis for larger values of

-1/2 2 1/2 T = = n hi(,k),h2(,k) - n . (3.2.1) k=1

(Note we have discussed the maximum likelihood ratio test for this simple

case in §3.1.) The exact distribution of Tn may be difficult to derive,

but if h1 (fir)= 0 whenever h2 (fir)= 0 and

1 < n = f[h1()2/h2().1q < co, (3.2.2)

then when A = 0, by the central limit theorem Tn is asymptotically normal

with mean zero and variance n-/. If in addition

= ![h1( )3/h2( < co (3.2.3)

Tn n1/2x(n 1) then for any A we have by the central limit theorem that -

is asymptotically normal with zero mean and known variance depending on

(X,n,Q. The test is consistent for all alternatives. -40-

Example 3.2.1

Assume

6 = (27)-1)/2 i0-1/2expf(-1/2)( i)T -1(r6i)] (3.2.4) where (61,62,6) are known. These assumptions are artificial because

4 are rarely known in practice, but, as we shall see in Example (k1462 3.2.2, relaxing the assumption that ki is known makes the problem more difficult and we end up suggesting a heuristic test.

The test statistic T in (3.2.1) is n

-1/2 -1 T -1 n ri + (1/2)(k2 g 1 n1/2. ki-k2) T Z1 4k + k2-)61 g k )] • k=1exPi(

(3.2.5)

The value of n in (3.2.2) is exp(A2) and the value of in (3.2.3) is 2 2 T - 1 exp(3A ), where A -62). Therefore, the asymptotic power = (g1-k2) g 061 2 of the test depends on (X,A ) and it can be verified that for any fixed 2 X > 0, the asymptotic power is a monotone increasing function of A .

h depends on some unknown It can be seen from (3.2.1) that if 1 parameter pi then in general the locally most powerful test depends on the value of Q. In this case we must resort to some heuristic procedure.

Example 3.2.2

Continuing with Example 3.2.1, we now assume that (62,6) are known 1 but ki is unknown. We see from (3.2.5) that for k g (61-k2) the test

1/2 - 1 1 statistic in Example 3.2.1 is equivalent to n (14-k eo(1Q1)] where

-1 x = n ix If we consider local alternatives for )6 near k defined to r1.4c. 2 - 41 - be those for which all the components of k are small, then the test statistic becomes equivalent to rejecting the hypothesis for large values

1/2 T - n k (c-k2)/A, (3.2.6)

T -1 1/2 T 1/2 where A = ((k1-k2) g (ki-k2)] = (k 0) . When A = 0 then (3.2.6) has a standard normal distribution. The critical region is exactly the same as that for testing the p-variate normal simple hypothesis and simple alternative

Ho: k = k2, g given

H1: k = ki, g same as in Ho.

If we wish to test over all directions d, a heuristic procedure would be

to use the same critical region as for the p-variate normal case

Ho: k = k2, g given

H1: k2, g same as in H k # o'

- T -1 - Therefore, we propose the test statistic T = n(-k2) g (.-k2) which under

the null hypothesis has a x2(p) distribution. Under the alternative and

conditional on g, the "true grouping" vector, it can be shown that T has a 2 noncentral x distribution on p degrees of freedom and noncentrality 12 2 parameter n A /n, where ni is the number of components of g equal to one.

Therefore, the density function for T under the alternative is

LZI `11' k=0 k - 42 -

2 where f (zIv,a) is the density function for a noncentral x distribution

on v degrees of freedom and noncentrality parameter a. Thus the power of

the proposed test may be computed numerically for any size test and any 2 value of (p,n,X,A ).

Assuming that (k2,g) is known is very artificial for most practical

cases. However, we have demonstrated that even this simple case leads to

using heuristic arguments in order to find a test statistic.

3.3 Local Similar Tests

In this section we make the more realistic assumption that there are

nuisance parameters under the null hypothesis; we restrict ourselves to

the case that when X = 0 the minimal sufficient statistics are boundedly

complete. With this restriction we consider local similar tests, but

usually no locally most powerful similar test exists. In Examples 3.3.1

and 3.3.2 we use a heuristic argument to derive reasonable tests in the

case of p-variate normal components with common covariance matrix.

We let k(k1,...442) be a vector of minimal sufficient statistics for

when k1,...442 are independent and identically distributed with density

function h2(k1k). Since we assume k is boundedly complete when X = 0, all

similar tests of size a are size a conditional on the value of k, for

almost all s; see Cox and Hinkley (1974), pg. 135. If under the alter- * native we specify A = R, then any similar test rejects the hypothesis if

the density of ki,...,kn given k is (i) zero under the null hypothesis, and (ii) non-zero under any alternative for X. For alternatives local to

X = 0 we also reject the hypothesis for large values of

_ h -1/2 *),h062k le) n1/2, (3.3.1) 166.71"kn i n k_1/k (kik -43- the distribution of which (given 0 does not depend on k under the null hypothesis.

The difficulty with this test statistic is that in general it yields different critical regions for varying values of k . When 8 is not known we must resort to a test which is suboptimal for most particular s.

Example 3.3.1

Assume

2 2 -p/2 hi lkalk2fa ) = (21ra ) exp[(-1/2a2)(c-Pi)7(6-ki)1 for i = 1,2,

2 where (k1,k,2,a ) is unknown, so that under the null hypothesis the data have a p-variate spherically normal distribution with minimal sufficient statistics

n - -/ 2 T = n L and s = (k- ) 676k-i0 k=1 k=1 which are boundedly complete; see Lehmann (1959), pg. 132. (Recall from

§2.3 that if the covariance matrix is known up toaproportionality constant then the data can be transformed to yield spherically normal components.)

If under the alternative we specify ki = ki for i = 1,2, and a2 = (a2)* * * 2 * then for 61 -k2 )/(a ) the locally most powerful test statistic for k ( alternatives local to A = 0 is given by (3.3.1), which is equivalent to 2 the test statistic Eexp( Tkrk) where kt and s are assumed fixed. The critical region changes with k. Now, the null hypothesis is mathematically equivalent to k j. If we consider alternatives for which k is near zero then for p > 1 the similar critical region of size a is given by - 44 -

(sTyyTx i T s2 c _ (3.3.2)

- 2 whereX= a p x n matrix and c is such that given (k,s ,k)

the probability content of the critical region under the null hypothesis

is a. This test rejects the hypothesis if the sample variance of the

lengths of the projections of the data vectors onto k is large relative 2 to s2/p(n-1), an estimate of a under the null hypothesis. Therefore,

this test depends on the assumption of sphericity under the hypothesis and

if this is not true then the results could be misleading.

If we let kk = (kk-k)/s, then under the null hypothesis, conditional 2 on and s we have that (ki,... n) is uniformly distributed over

{(ki,...,kn)IIkk = 0, rLk kT kk = 1). Therefore ( 1,...,kn) is independent

of /6 and s2 under the null hypothesis. Since the left hand side of (3.3.2)

is a function of ki,...,kn we have that c in (3.3.2) does not depend on 2 or s . Also since the null distribution of (x ) is invariant under rkd any orthogonal transformation of all the observations we have that the null

distribution of the test statistic given in (3.3.2) is invariant under

orthogonal transformations of k so c does not depend on k.

In general we do not know the direction k, so we wish to find a test

over all directions. One suggestion is to take as an overall critical

region the union (over k) of the critical regions of fixed size, an

application of Roy's (1953) union-intersection principle. This would yield the test statistic

T T 2 Y = mr(kTklt kik ks ), (3.3.3) - 45 -

2 the maximum eigenvalue of UT/s . Since y is invariant under any location, scale and orthogonal transformation of all the observations, the power of this test is a function of (A,t2) only.

It can be shown that y yields the same critical regions as the maximum likelihood ratio test for the multivariate normal hypothesis and alternative

H : f = f O 1 2 = =f

H1: f1 > f2 = = f where fi is the i-th largest eigenvalue of the variance-covariance matrix.

If the component distributions are not spherically normal, the results of test statistic (3.3.3) may be misleading.

2 For p = 1 and local alternatives of (5 = (p -11 )/a near zero, the / 2 locally most powerful similar critical region for alternatives of A near zero is

- 3 2 3/2 sign(S) y ock -x) ps ) > c. k=1

By an analogous argument as above we can show thatc does not depend on x 2 or s . If we wish to test the hypothesis without caring whether p > p 2 or p2 > pi, we may take the union of the two critical regions for a fixed size to yield

n[1(xk 70312/II(xk-30213 (3.3.4)

as a test statistic. This is invariant under location and scale 2 transformations so the power of the test is a function of (A,A ). -46-

Expression (3.3.4) is a (weakly) consistent estimator of 3 2 23 U.-t-Z“5 [A(1-4)(1-2)0A ]/[1+A(1-A)A so the test is consistent under A = 1/2.

If A = 1/2 and p = 1 the locally most powerful similar test of 6 near

zero in the alternative is given by

7/x _14/rVfx _;127 2 < c Li k p. .LLI k J

where 5 = (p14112)/2. Since x is a strongly consistent estimator of 5, a possible test statistic is to reject the hypothesis when

x _704 /[(x _70212 (3.3.5) LI k L k k x) 4/

is small. Since the distribution of each observation is platykurtic when

A = 1/2 for all all,p2), the test is consistent when A = 1/2 in the alternative.

When p = 1, we might like to combine tests (3.3.4) and (3.3.5) if we wish to test for all A, but it is not clear how this should be done.

One suggestion is to first test (3.3.4) and if the hypothesis is accepted we perform test (3.3.5). It is difficult to derive analytically the power of such a test, but computer simulations are feasible. However, because of the very limited scope of the test we do not perform the simulations in this thesis.

Example 3.3.2

Assume that h and h are multivariate normal density functions with 1 2 unknown means ki and k2 respectively and completely unspecified common

-47-

covariance matrix g. Under the null hypothesis, minimal sufficient - r statistics are = n and = ,XT, where x is defined in (3.3.2). -1 Gk These are boundedly complete under the null hypothesis; see Lehmann (1959), pg. 132. We let k = g-1(k1-k,2) and consider alternatives for X near zero. For alternatives of k near zero we obtain the test statistic

I(421;031(kTaT,)3/2 (3.3.6)

where kk is the k-th column of X. This is the same as the locally most

powerful similar test in Example 3.3.1 for p = 1, where the observations

T T are (A ;En ), so it is unsatisfactory for alternatives such as A = 1/2 when the test is inconsistent. The distribution of (3.3.6) does

not depend on (,q,k). Again the test depends on the direction ,, but if

we wish to test over all directions we might consider the critical region

defined by the union (over k) of the critical regions of fixed size. This

yields the test statistic

[(kr T 3 2

max (3.3.7) T T 3 J

which is invariant under any location and non-singular linear trans- 2 formation of all the observations, so its power depends on (X,A ).

Expression (3.3.7) may be regarded as a measure of multivariate

sk6wness. Mardia (1970) has considered another measure of multivariate

skewness for testing a multivariate normal hypothesis with unknown

parameters against general alternatives, whereas our test has a specific -48- alternative in mind.

3.4 Discussion

In the last section we showed that locally most powerful similar tests can be found for testing A = 0 in a mixture of two p-variate (p > 1) normal distributions with common covariance matrix, if we know the value of -1 k = g (k1-x2) up to a proportionality constant. In most applications we do not know 6 so we had to resort to a suboptimal procedure. As we have already discussed, the resulting tests could give misleading results if some of our assumptions are not true. With real data this is usually undesirable.

We might consider invariant tests for this problem. For example, 2 2 when g = a I and (g1,g2,a ) is unknown, then the hypothesis and alternative are invariant under any location, scale and orthogonal transformations of all the observations. A maximal invariant statistic is

T T = X Vtr (3.4.1) where is defined in (3.3.2). The distribution of depends on the 2 2 2 parameters only through (A,A ), . The where A = (k1 k2) (k1 k2)/a hypothesis is mathematically equivalent to assuming (X,A2) = (0,0) so the null distribution of does not involve any unknown parameters. (An example of an invariant test is the maximum likelihood ratio test which we discussed in §3.1.) However, for n > p, then is an n x n matrix of rank p (almost surely), so the distribution of is singular and cannot be easily derived.

Similarly, when (k1,k2,g) is completely unspecified, the hypothesis -49- and alternative are invariant under any location and non-singular linear transformation of all the observations. A maximal invariant statistic analogous to (3.4.1) is the n x n symmetric idempotent matrix

= evT)-1x

where X is defined as in (3.3.2). The distribution of this statistic 2 2 -1 depends on the parameters through (X,A ) where A -16 (k -k2). = (k1 2)Tg 1 (The maximum likelihood ratio test is an example of an invariant test.

However, its null distribution is difficult to derive.)

2 For the two cases (i) g = a and (ii) g completely unspecified, the distribution of a maximal invariant statistic depends on only two 2 parameters A and A , but we cannot derive mathematically tractable tests.

However, the similar tests we have suggested for the case of testing a normal hypothesis against a mixture of two normal distributions with common covariance matrix are easily computable with high speed computers, and the critical regions which depend on p and n may be found by simulations. Also the tests are consistent for a much wider class of alternatives than those considered here. For example (3.3.3) may be used as a test of sphericity of a multivariate normal distribution against general multivariate normal alternatives, although the more standard test would be that in Anderson (1958, pg. 261).

The problem of testing m = 1 against m = 2 is very restrictive when in fact there is little knowledge about m. Unfortunately the tests discussed in this chapter do not extend to the more general problem of testing Hi against Hi+1 in (3.1.1) for i > 1. When i > 1 then no reduction -50- of the data may be made by sufficiency, so tests based on the distribution conditioned by the minimal sufficient statistics are not appropriate.

An ad hoc way of approaching the cluster analysis problem with unknown number of groups is as follows. First we test H1 against H2 in

(3.1.1). If H is rejected we find the maximum likelihood estimates of the 1 parameters assuming two groups and partition the units on the basis of

these estimates. Treating each group separately, we repeat the process until a prespecified maximum number of groups is obtained or until the tests on all the partitions yield acceptance of each hypothesis. The distribution of the number of groups and the parameter estimators using this clustering procedure is difficult to derive analytically, but computer simulations may be feasible.

The fact that maximum likelihood estimators may be quite biased, can have large variances and cannot be used unless the number of groups is known indicates that maximum likelihood estimation is often not a satisfactory approach to cluster analysis. In the next chapter we consider cluster analysis in a Bayesian framework and derive algorithms for a wide variety of parametric models. - 51 -

Chapter 4

BAYESIAN CLUSTER ANALYSIS: GENERAL

4.1 Introduction

Chapter 2 considered a general parametric model which seemed appropri-

ate for cluster analysis. In particular, we assumed there exists a "true

grouping" vector g = (g(1),...,g(n)), where g(k) = j means that the k-th

unit comes from the j-th group with density function h. in such a way that

, ...x A) = h (4.1.1) ./ '"— klq4r] 'n k1

We considered the special case where the components of g were independent and identically distributed with Pr(G(k) = = Ai, so that

m n. Po(Z1k,) = xi 1, (4.1.2) i=1 where c...,A )andn,is the number of elements in {k1g(k) = a In this chapter we reformulate our problem in a Bayesian context, using a general density for q.

The likelihood function for the observations 1,...kri is given by

(4.1.1), where now g and A play the role of the parameters of the model.

If we consider the simple case where g has density (4.1.2) and (ti,k) is known, then the problem of assigning units to particular groups (or equivalently finding g, the estimate of g) is the same as the classical discriminant analysis problem; see Anderson (1958), Chapter 6. This problem is often handled with the use of decision theory. For c.. being the cost 2.7 of assigning any unit to the i-th group when it should be in the j-th group -52-

the Bayes' decision rule is given by

g(k) = i iff i minimizes ), j=1 2J -7 J

and g = (g(1),...,g(n)). If L(g,g) is the loss function for g when g is

the "true grouping" vector, this Bayes' rules is equivalent to assuming a

loss function of the form

L(z,g) = I I n..c..,lj 2J i=1 j=1

where n.. is the number of elements in the set 2J fklg(k) = i, g(k) = j}.

The case of known k, but unknown k, where the hi's correspond to normal density functions and where there are certain data units for which the group membership is known, has been considered by Geisser (1964,

1966). He assumes a vague prior distribution for the unknown parameters, and develops the posterior distribution of g given the observations on all the data units, including those for which the group membership is known.

Since he assumes that for each of the groups there are units with known group membership, the groups are well-defined and the use of

Pr( G(k) = il l,..., ,71 is justifiable. However, in many applications of cluster analysis there are no units with known group membership and the purpose of the analysis is to try to determine which units are in the same group, without regard to the label associated with each of the groups. For such cases the quantities PrIG(k) = 11,761,...,,W are not appropriate.

However, for the (2) pairs (k,k) we could consider the probability that the k-th and 2.-th units are in the same group, given by -53-

Pr[G(k) = G(t)lki,...,Q. Each of these quantities may be regarded as a

measure of the similarity between the k-th and t-th data units.

Bock (1972) has .considered Bayesian procedures for classifying

normally distributed random variables with common known covariance matrix.

In a Bayesian analysis, the posterior probabilities may depend critically

on our choice of priors. However, we shall see in Chapter 5 that the

adoption of a particular procedure, such as one of those proposed by Scott

and Symons (1971) (see §2.4) leads to the same results as maximizing the

posterior distribution for a particular choice of prior distributions. In

a Bayesian decision theoretic framework we specify explicitly what

assumptions are being made, so that if the user is dissatisfied with the

results of a particular algorithm which may be derived from a Bayesian

decision theoretic analysis, then he needs to reconsider only his choice

of basic model, prior distributions and loss functions, and thus modify

the algorithm accordingly. In Chapter 7 we investigate numerically the

robustness of one of our procedures as the prior distributions are changed.

Many techniques of cluster analysis approach the problem by optimizing

some criterion over the groupings. For example the criterion may be tr(W), / tr(W B) or IBI/IW1 where W and B are the within and between group sums of squares matrices respectively; see Friedman and Rubin (1967). However,

these criteria seem quite arbitrary. One of the advantages of a Bayesian

decision theoretic approach is that the criterion to optimize is given by

minimizing the posterior expected loss.

In §4.2 we consider the general cluster analysis problem in a Bayesian

context. We investigate the problem of unknown number of groups in §4.3, and in §4.4 we give some results for conjugate type priors when the -54-

components are in the exponential family. In §4.5 we outline a method of

approximating the theoretical results when the exact results are not

feasible to compute.

4.2 General Results

In general, the parameters of the model (4.1.1) are (g,%). However,

for the particular case where we may assume there exists a k, such that

conditional on k the density forg is given by (4.1.2), we regard the parameters of the model to be (g,k/k). In this chapter we take the

Bayesian approach of regarding the posterior distribution of the parameters

(g,V (or (g,k,k)), given kz,...,kn, as a representation of our information about the parameters after observing the data.

We first consider the model where m, the number of groups in the population is known. We assume that the "true grouping" vector g has a prior distribution

(4.2.1)

For example, if there are parameters k such that (4.1.2) applies, then p (a) is the marginal distribution of g when k has prior p(k); that is, G%

m n. a PG( g) = I II A. PA( )dk (4.2.2) r%, i=1 2 k,

In general, we have unknown parameters k = (el,...,e ), for which the prior distribution may depend on g. This prior is given by

p2lekiv. (4.2.3)

-55-

If we wish, we could assume that when (4.1.2) holds for some k then we

specify the prior for k to be

(4.2.4) PRIeik(0 '0

which depends on k as well as g. However for many applications it will

suffice to assume that (4.2.4) does not depend on k; that is, if we know

the "true grouping" then our knowledge about the component densities does

not depend on how the "true grouping" was generated. This implies that if

the "true grouping" is known then the distribution for k given the data

depends only on the "true grouping" and not the data; that is

'41e1 g).

The likelihood function for the observable quantities given g

and Qis given by (4.1.1). Note we are assuming that if there exists a k,

such that (4.1.2) holds then the conditional distribution of

given (g,k,k) does not depend on ti. We can see that if there is a Itt, such

that (4.1.2) holds then the model (2.1.1) we introduced in Chapter 2 is a

special case of our general model when k and k, are known. In particular,

if (4.2.4) does not depend on g we have

n m ( n y Aihi(„),,A (4.2.5) k=1 i=1

On the other hand, if (4.2.4) depends on g then (4.2.5) does not hold, so

our model is more general than that considered in Chapter 2. -56-

4.2.1 Posterior Distributions

In general we have the unobservable quantities (g,k) or (k,g,k) about

which we wish to make inferences. We base these inferences on the

posterior distribution of the unobservable quantities. For example, the

posterior distribution of (g,Q) is

n pG(g)p (0g) h (; lk). 2q 1 g(k) k

(4.2.1.1)

A similar expression may be derived for

when there exists a k such that (4.1.2) holds.

If the actual value of k or (k,k) is uninteresting for a particular application and we are primarily concerned with the grouping vector g, then we need the posterior distribution for g given by

n (g1 m pG(g)fpol G(klg) (; lk)dre . Prq,1 1"..'n 1 k _ih g(k) k

(4.2.1.2)

As we pointed out in §4.1, we may be interested in

n i n Pr(G(k) = p(g)Ip(oq klg) h irk t)drq, der k=i g(k) ( s.t. g(k)=i

(4.2.1.3) -57-

Pr(G(k) =

n 0: I PG(g)fPolc(klg) n hg(k)( klk)dk; (4.2.1.4) ger ,I, ,I, ,I, k=1 s.t. g (k)=g 00

where r = {glp,(g) > 0), in order to perform the cluster analysis. These

are intrinsically interesting quantities to consider and simply investi-

gating these values for a particular data set may lead to an obvious

clustering or a set of possible clusters without resorting to loss func-

tions. In §4.2.2 we introduce loss functions to formalize our choice of a

particular clustering when a single choice has to be made.

For many applications we may be interested in making inferences about the parameters ,or q,k). However, in this thesis we concentrate on estimating g since this appears to be the main purpose of most clustering

algorithms.

It may be difficult to evaluate the integrals and sums of expressions such as (4.2.1.2) to (4.2.1.4). In §4.4 we give some general cases where

the integrals may be easy to compute , and in §4.5 we consider the approximation of the sums in a problem where F , the set of allowable groups, is too large for exact computation. We now consider the model in a decision theoretic framework.

4.2.2 The General Bayesian Decision Problem

To formalize actions based on the posterior distributions, assume there is a set D, the decision space, consisting of all possible actions. -58-

For a given action d c D and for each vector k of unobservable quantities,

there is a loss incurred, given by L(6, V. In a Bayesian decision theory

framework we consider those (5's in the set

(dodif d'cip then E(1,65401ki,..., n) < E[L(61 ,k)1 1,...471])

This is, of course, quite general and we now turn our attention to the

problem of specifying a loss function which reflects the aims of cluster

analysis.

4.2.3 Some Loss Functions

Of course, any loss function should depend on the purpose of the analysis. However, it is often difficult to specify precisely, especially

when the analysis is used as an exploratory tool before proceeding with

further statistical techniques. In Chapter 1 we introduced the notions

of internal cohesion within each group and external isolation between

groups. One of the loss functions we consider, (4.2.3.9), can be regarded as a quantification of these concepts and gives the relative importance

of each.

It is often appropriate to assume that the decision space consists

of all allowable partitions, and the loss function depends on only the decision g c r and the "true grouping" vector g. Therefore, we are

ignoring how far the observations are from any particular group and how far the groups are from each other.

Usually, the order of the sequence of observations is unimportant.

This implies that the loss function depends on the matrix k with entries n., defined as the number of elements in the set fklg(k) = i, g(k) = j}. -59-

For a particular decision g, the row sums of are fixed; that is,

m n. . = n., j=1 13 1

the number of elements in {klg(k) = i}. Also we have that

m y n= n., i=1 13

the number of elements in {k1g(k) = j}.

An example of a loss function depending on the matrix kr, is that proposed by Bock (1972):

0 if g = g (k is diagonal),

1 otherwise.

The Bayes' decision in this case is to estimate g by taking the grouping with maximum posterior probability in r.

In certain applications, the labels of the true groups are important so the loss function should not be invariant under permutations of these labels. This could occur in a number of situations; for example, (i) there may be individuals with known group membership, so that the estimate of the group membership for the other individuals should depend on the group labels associated with the individuals with known membership and (ii) the densities h. (j = 1,...,m) may be of different functional forms (e.g., one is gamma, the other lognormal) so that a permutation of - 60 - the labels of the true groups would yield a different model. If we do not wish the loss function to be invariant under permutations of the labels of the true groups, a possible loss function is that used in many applications of discriminant analysis; namely,

m m L at) = 2 n. .c (4.2.3.5) i=1 j=1 2.1 ij where c.. ij is the cost of assigning any unit to Group i when it should be in Group j (c .. > c..). The Bayes' decision rule with this loss function J is to let g(k) = i if

(7..--(7..)PriG(k) = jkif“orkr2.7 j=1 lj jj

* is minimized at i = i .

On the other hand, for many applications, the labels of the "true groups" are unimportant so that we wish to consider loss functions which are invariant under permutations of the "true group" labels; that is,

14g,g(1),...,g(n)] = L[g,r{g(1)},...,7{g(n)}1 for all permutations r: {1,...,m). This implies that L(k) is invariant under permutations of the columns of N.Except for the tiivial case of cij = cii, no linear loss function of the form (4.2.3.5) satisfies this property for all Linear loss functions are not appropriate, but we may consider quadratic loss functions of the general form

mmmm m m L(g) =IEzin. a. . 4-1122.2). . (4.2.3.6) 2.24 23 1j2 i=1 j=1 k=1 2=1 i=1 j=1

▪

- 61 -

where the a's and b's are real and do not depend on

m m n = E E ni4. i=1 j=1

We may assume without loss of generality that aiik2 = akuj . If L(N) is

invariant to permutations of the columns of g and L(g) = 0 when g = g,

then it can be shown that L(g) is of the form

2 2, (1/2)(1 1c.. I n..n . - n.. Y], (4.2.3.7) • . ij kj . d{(2 . 2J 2] 10k 2 j=1 i=1 j=1 3=1

with posterior expected loss

(1/2) c..1. I. Pr[G(k) = keA. ten. i#i J

2 (1/2) T - {Pr G(k) = L d.[n. - I i=1 2 2 kcA i teii

(4.2.3.8)

where n. is the number of units assigned to the i-th group and

Ai {kig(k) = i}. Loss function (4.2.3.7) is equivalent to losing (i) C.. 23 for each pair (k,k) such that g(k) = i , g(2.) = j (i#j) and g(k) = g(k),

and (ii) di for each pair (k,k) such that g(k) = g(k) = i and g(k) # g(t).

Therefore, we assume c.. > 0 and d. > 0. Note that (4.2.3.8) implies the 2J decision is based on the data only through the similarity matrix with

entries Pr(G(k) = G(t)kz,...,W, which we suggested in §4.1 as reasonable

quantities to consider.

-62-

If we do not wish to have different weights to the assigned groups,

thenweletc..=candd. =ci in (4.2.3.7) to yield the loss function 1

9 m (c/2) nd/2) / Li n..)-- n23 j=1 i=1 23 i=1 23 i=1 j=1 23 j-1

(4.2.3.9)

where c > 0, d > 0 and c d > 0. The posterior expected loss is given by

n n m r 2 (c/2) y Pr(G(k) = G(2.)1,t ,..., n] (d/2) > n. 1 2 k=1 L=1 i=1

- [(c+d)/21 1 Pr(G(k) = i=1 kedi 2

(4.2.3.10)

the first term of which does not depend on g so we need not calculate it

to get the optimal partition. Loss function (4.2.3.9) is equivalent to

considering the (2" 1 pairs (k,Z) and losing c whenever

(i) g(k) # g(2) and (ii) g(k) = g(k); (4.2.3.11)

losing d whenever

(i) g(k) = g(k) and (ii) g(k) # g(!Z). (4.2.3.12)

Therefore, when c/d is large we attach more importance to assigning

pairs to the same group when they should be in the same group, than when

c/d is small (where we attach more importance to pairs being in different -63- groups when they should be in different groups). Hence c/d may be regarded as a measure of the relative importance of internal cohesion to external isolation. If c = d we get a loss function equivalent to the criterion suggested by Rand (1971) for comparisons of different clusterings.

We have seen in this section that a decision theoretic approach may lead to a natural definition of a similarity matrix as well as formalizing the concepts of internal cohesion and external isolation. However, we have been assuming that m, the number of groups in the population is known. We now extend our discussion to the case of unknown m.

4.3 Unknown Number of Groups

In Chapter 3 we considered the problem of unknown m from a classical hypothesis testing viewpoint with specific reference to multivariate normal components, and the only choice being m = 1 or m = 2. Extensions to more general problems were not self-evident. Before proceeding with a

Bayesian approach, we should point out that even for known m, the Bayes' rule may partition the data units into fewer than m groups, leaving some groups empty. In most applications we wish to determine m', the number of groups actually represented in the data, as opposed to m, the number of groups in the population.

In this section we treat m as an unknown parameter with prior probabilities given by Wm). All the other parameters have priors conditional on m. Therefore, the posterior distribution for the unobservable quantities (m,g,k) is given by

-64-

(M,g,DIX ) PM,G,OIX r%, 1 n rx, run

n P (m)P 11,10m)P mlO gfm) h,(k) (Kk lk)- (4.3.1) M 2 rq,' k=1

A similar expression may be derived when the unknown parameters are

onfgfkfk) •

Thus we can derive the posterior probabilities for m' given the data.

In contrast to the results of Chapter 3, where we considered tests of

hypotheses on m, this is a relatively straightforward theoretical solution

which is quite general for all m and any parametric density function for

the component distributions. In practice, it may be difficult to compute

these posterior probabilities, and approximations such as those we shall

consider in §4.5 may be needed.

How do we proceed with cluster analysis when m is unknown? One

possibility is to decide on m' by minimizing the expected loss for some

loss function and then proceed with the analysis assuming m' is that which

has been decided upon. However, if our primary concern is the partition

of the data units and not the number of groups, then a couple of loss

functions considered in §4.2.3 extend naturally to the case of unknown m.

One is

0 if g = g L (g,g) = (4.3.2)

1 if g # g.

In this case we wish to maximize the posterior probability of g. However, - 65 -

this does not take into account how "close" g is to g. Another loss

function which extends naturally is (4.2.3.9), the extension being

A A m' m 2 rm 2 L(re,g,m,g) = (d/2) G E( L ni4) - L n. . ] i=1 j=1 j=1 iJ

m m' m' + )4/2 y r( 1 n..)2- I n 2] (4.3.3) j=1 i=1 23 i=1 ij

n) where c > 0, d > 0 and c d > 0. This equivalent to considering the 2 pairs and losing c or d whenever (4.2.3.11) or (4.2.3.12) respectively

holds. The posterior expected loss is analogous to (4.2.3.10). Of course,

there may be other loss functions which can be used and the particular one

chosen should depend on the application. For example, if each group will

be considered separately in.future analyses then there may be an additional a cost proportional to m'.

We now have a theoretically justifiable technique for cluster analysis

whether or not the number of groups are known. There probably is a large

number of practical cluster analysis problems where the loss function

(4.3.3) is a meaningful formulation of the aims of the user. However, each

application should be considered carefully to determine what loss function

is appropriate.

4.4 Convenient Prior Distributions

In §4.2 and §4.3 we discussed the usefulness of deriving the posterior

probabilities for (g,m) given the data In this section we concern ourselves with the computation of the quantities

Pm n (m,g,ka,...4.72). (4.4.1) -66-

In general this involves integrations, which for many applications

may need to be performed numerically. Here we consider the relatively

simple but widely applicable case where each of the densities hi(0)

which involve unknown parameters A are in the exponential family. To

facilitate the integrations we assume "conjugate type" priors for Q. This may not be always applicable in specific situations, but there are a wide

variety of cases for which such priors may be sensible to the user. In

Chapter 5 we give some examples of these posterior distributions and find

that some of them lead to existing cluster analysis techniques.

We now introduce some notation for this section. We assume that for

given m, then 1-1(m) = {1,...,m1} and 12(m) = {1,...,m} -I1(m). For

j e 1-2(m) we assume h. is completely specified and for j e .t1(m) the

component distribution, which depends on unknown parameters k(m), is given

h.(08) = explla..(k)b..00 Yc..(k) d.W1. (4.4.2) 2.7 J

In expression (4.4.2) we write 8 for Vm). We usually write such short

forms where there is no ambiguity. Note we are extending the usual

definition of the exponential family so that terms depending on % alone 2 are expressed as a sum. For example if h.(x1p1,...,plea ) is the density 2 function of a normal random variable with mean p. and variance a , we have c .(11 2 2 2 ,...4zeci2),..al and c (u ,p ) = (-1/2)1og(a2 1J 1 23 / m ).

For given g we let Q. = {klg(k) = j} which has nj elements. Now conditional on m and g we assume that q'has a prior distribution of the form -67-

-/ rr POIG,M(e = (K( ,k)] exPILLai.(Va +Yic ..(k)13 (4.4.3) ij J 23 ij 2J 23

where and k are matrices with entries a., and a.. respectively, and 13 2] K(k,B)is a normalizing factor. The entries of k and k may depend on (g,m).

* Weleteande bematriceswithentriesa.! 2J a 82.7 respectively, where

* a., := b. A I 1J 23 1J k kEQj

* O.. 2J 23 J

Having introduced this notation, we have that

Pm,q, 1,..., n(m'g,k1,---, n)

** =P (m)p (glm)f hi(ick).1exPI d.(? k)11C(4 )/K(k,k) _1E1" kEQ .1E1' kEQ. 2 j 1 j

(4.4.4) and

P (elor,M, el,m, i,..., n ', ft,

* * _/ * rr = ic(k ,k )1 explUai.(Vai4 acii(V0ii] ij J ij

(4.4.5)

Let us now turn our attention to certain convenient expressions for

1:11,,f(glin) when

- 68 -

m n. m n.

II A. 2/ I A if germ) 2 ga(M) i=1 Pq i k;m(glk,m) otherwise.

We assume that if g and m are given then pd and k are independent in the prior distributions; that is, = p l 'M (81g,m). This implies that

= Pkig,mRIg'm). (4.4.6)

Now when m = 2 and k is known, Behboodian (1972) considers the posterior distribution for A by assuming the prior for A is a beta density. He suggests for m > 2 using a Dirichlet prior. Here we assume that the prior for k given m is proportional to a Dirichlet distribution over some set of

A's given by P(m). Therefore

M s.-1 II X. 2 dk if k 112k.12 gS(m) i=1 i=1

pk (mMm)

0 otherwise.

(4.4.7) and we can then show that

m a a fR(m). Ai dx a=1 • (4.4.8) P0A1 m) m I f dk n(m).n Ai 2 2 ger(m) 2=/ -69-

In Chapter 5 we give some examples of the expressions given in this section, but now we consider the general application of these expressions to get an approximate Bayes' clustering rule.

4.5 Practical Considerations

In §4.2 and §4.3 we found that if we base our clusterings on the posterior distributions of g and m, we need to compute the posterior probabilities of certain events, each event we express as t(,M). For example, t(q,m) may be the event that G(k) = i and M = m, or it may be that G(k) = G(k) and M is arbitrary. We let

T(e) = {(g,m)It((,m) = e, gc r(m)}. Therefore,

Pr[t(,M) =

P (g,m)ET(e)I (4.5.1) PM G X X ( 'g'1' ...'n) in ger(m) m

In many practical problems, the set of allowable groupings for certain values of m may be very large. For example, if Jin = {/,...,m} we may have

F(m) = J x...xj m .. • m

n times which has mn elements. Therefore, for large n the sums in (4.5.1) cannot be evaluated exactly in practice. In this section we give a method of finding these sums approximately.

In expression (4.5.1) all the terms of the sums involve -70-

(n,g4E ...46).n If there are some (g,m)'s which have relatively large values of pm, then these values make the most important contribution to the ratio of these sums. For a given set of data units, it is not unreasonable to assume that relatively few of the partitions will yield a high posterior probability for the grouping and the vast majority of the partitions will be absurd in the light of the data, and therefore have low posterior probabilities. We outline a sequential method of sampling which tends to give the partitions with higher ranking probabilities toward the beginning of the sample.

(/) (2) We define the distance between g and g as the number of elements (1) (2) in the set {k g (k) # g (k)}. For fixed m, we define the m-neighbourhood of g as the set of all g with non-zero posterior probability for m given, such that the distance between g and g is one.

The basic assumption we make is that if g has relatively high posterior probability for a given m,, then some of the partitions in the m-neighbourhood of g have relatively high posterior probability for given m. We say that a local maximum occurs at (g,m) if none of the points in the m-neighbourhood of g has a posterior probability greater than that of g for fixed m.

The algorithm creates three non-overlapping files which together comprise the sample. The records on these files consist of the values of

(g,m) and pM (m,g, i,...46,2), or the logarithms of these probabilities to avoid machine overflow. We call these three files:

(i) Local Maximum File, (ii) Generator File and (iii) Rest of the Sample

File. In practice we may have one large file with each record containing a code denoting to which of the above three files that record belongs. - 71 -

The starting target of our algorithm is to obtain a set of (g,m)'s, each of which occurs at a local maximum. We put these into the Local

Maximum File. This set may be found by starting at some (g,m) and searching its m-neighbourhood until a (g,m) is found with a larger posterior probability, for m given. This step is repeated for the new

(g,m) until a local maximum is found. The starting (g,m)'s for this procedure may be obtained by two different means. One is to make various plots of the data and suggest values of (g,m) which give reasonable clusters. The other is to generate randomly some values of (g,m).

Now starting with the initial Local Maximum File and leaving the other two files empty, our algorithm is as follows:

1. Search the Local Maximum File and the Rest of the Sample File

for the (g,m) with maximum probability. Denote this by

(gMaxl,max). * * 2. Create two files, FLAG1 and FLAG2. For every ,m ) in the

Local Maximum File such that m = max and the distance between * * g and is one, add g to FLAG1. For every (z ,m ) in the max and the distance between g Generator File such that m = mmax * and is one or two add to FLAG2. Also, delete the record rtA max q corresponding to (a ,m ) from its original file and add it etmax max to the Generator File.

max-neighbourhood of. For 3. Generate the set of g's in the m rtmax each of the g's, if its distance is greater than zero from all

the g 's in FLAG1 and is greater than one from all the g 's in

FLAG2 then add (g,mmax) and pm,1,...441(mmax,g, i,...46n) to

the Rest of the Sample File. -72-

4. If the total number records in all three files, Local Maximum,

Generator and Rest of the Sample, exceeds the sample size

required then stop; otherwise go to step 1.

Note that we create FLAG1 and FLAG2 so that we can check fairly quickly whether a new (g,m) was already on one of the other three files.

We take as our total sample the three files created. If our basic assumption (if (g,m) has a relatively large posterior probability then some of those in the m-neighbourhood of (g,m) have relatively high probability) is correct then the partitions with higher ranking probabilities will tend to be in the sample we have taken. To evaluate expression

(4.5.1), instead of taking the sums over r(m) and T(e), we restrict the sum to those (g,m)'s in the sample. In Chapter 5 we discuss the behaviour of this sampling scheme and the approximations derived from the sample for a particular numerical example.

The algorithm has been specified quite generally, and usually certain modifications would be made in particular cases. We employ the general technique in Chapters 6 and 7 and describe the modifications used.

4.6 Discussion

In this chapter we have suggested how cluster analysis may be approached from a Bayesian point of view. The selection of a prior distribution is very much a subject matter problem and should be thought out carefully by the user. By changing his prior he may get quite different posterior probabilities. In particular, pm(m) and p m(gim) are important factors in the posterior distributions of (g,m). For example, if the user specifies pom(g1m) = m n he is implicitly saying -73- that the prior for (n1,...,n ) given m is m

-n m

This gives a high weight to each of the ni r s being close to n/m. If the user does not want his algorithm necessarily to give such a prior

probability to relatively equal cluster sizes, he should adjust his

specification of poi(glm). For example, he may consider a prior given

by (4.4.8). The user has some control over the type of results he is

likely to get and these controls have a meaningful interpretation in terms

of the shape of the prior distributions. - 74 -

Chapter 5

BAYESIAN CLUSTER ANALYSIS: NORMALLY DISTRIBUTED GROUPS

5.1 Introduction

The last chapter discussed a Bayesian procedure for partitioning data

units. We suggested that if our primary interest is in the estimated

grouping, we should consider

(5.1.1)

In §4.4 we gave some "conjugate type" priors for which the computation of

these quantities may be feasible. In this chapter, we give some examples of the convenience of the results of §4.4 where h. in (4.1.1) is a p-variate normal density with mean ki and covariance matrix a.. We r1,2 consider a number of cases, depending on various assumptions about our knowledge of the )'s and g's.

5.2 Unknown Means, Common Known Variance

Here we assume that gi = = gm = g (known and positive definite), 2 2 so that without loss of generality we may assume g = Z, where a is known. Now, conditional on g and m, the unknown parameters are

The "conjugate type" prior (4.4.3) given g and m is equivalent to the.'s being independent p-variate normal with EN? = gj = (a1j,...,apj)T and 2 -1 -/ V( 1) = cc diag(5lj ). For fixed m and Bij = = Sp), this prior distribution is the same as that used by Bock (1972).

We can show that -75-

2 -np/2 1,..-ri n) aP (m)P 1 (gim)(a ) 13M41"-471 (in,g1 M

P in 1/2 x II II fa../(0-411.)1 i=1 j=1 23 13 .7

2 2 xexpr(-1/2a .)) ) IcE ik 2j 2 +(n.../(n (x . -a ..) //I i=1 j=1 keQ. 2.7 jiJ 2j 2J

(5.2.1) where i.1c = and (x1k"..,2cPk)T

-1 x..= n. x. . 23 J keQ. lk

Note we are using the notation of §4.4 where Q. = {klg(k) = j}. The posterior distribution of ka,...,km given is such that the

}es are independent normal with

= (1.x..4-13..a..)/(n.4-$.J and 3 2.7 13 23 J 2 3 2 -1 -1 VN.IM,g, /..., n) = a diag[(a .411 ) .4.11,) .7. If our prior 1 13 jPJ J distribution satisfies: (i) (61,...,km) is independent of g so that

(4.2.5) holds and (ii) the density for ki,...,km is nearly flat (that is the (3's are close to zero) so that we may approximate (5.2.1) by letting the a's tend to zero, then (5.2.1) is approximately proportional to

2) my y Pm(m)P0A1m) .11 ni-P/2exp((-1/2a 3=1 - j=1)(EQ. 3

(5.2.2) - 76 -

where

-1 x = . 176k. J J keQ .

Note that (5.2.2) is invariant under location and orthogonal transforma-

tions of the data. Expression (5.2.2) is normable if and only if

.p.i m(glm) is zero whenever any of them groups is empty. (If there is a

non-zero probability that the j-th group is empty then (5.2.2) is infinite

when n. = 0.) J

Now, minimizing the within group sum of squares is a frequently used

criterion to determine the clusters in many clustering algorithms; see

Cormack (1971). An interesting feature of (5.2.2) is that it is maximized

when the within sum of squares is minimized if and only if

m glm) cc 11 n.13/2, (5.2.3) PIM( j=1

2 for any value of a . Note that (5.2.3) strongly weights the prior for g

given m in favour of relatively equal sized clusters. The prior for

) given m is proportional to m

p/2 II n. /E11 j=1

when the prior for g is given by (5.2.3). We would expect, therefore,

that the grouping which minimizes the within sum of squares tends to have

relatively equal numbers of data units in each cluster. If we do not wish

this necessarily to occur, we would not use. (5.2.3) as our prior for g. -77-

If our prior for g is given by (4.4.7) for some (s1,...,sm) where r(m)

does not contain partitions leaving any of the groups empty and if the

posterior for g may be approximated by (5.2.2) we would still consider

the within sum of squares but attach different weights according to the 2 value of ). Also the clustering may depend on a . m

5.3 Unknown Means, Common Unknown Variance I

The model where the covariance matrices a. (i = 1,...,m) are all equal

to g, say, not completely known, is considered in two parts. In this section, we assume that g is known up to a proportionality constant, so -/ 2 that without loss of generality g = 0 = a where A is unknown. In the next section we consider the case where none of the entries of the matrix g is known.

In this section, our "conjugate type" prior assumes that, conditional on (gm), then 0 has a r(a,$) distribution and, conditional on 0

(or 0-1 = a2) and (g,m), the prior for kl,...,km is such that they are 2 T p-variate normal with E(Iii la ,g,m) = (u/i,...,upi) and

2 2 -1 -1 N.la ,g,m) = a diag(d . ). We can show that 1j PJ

P m cc p (m)p (glm) II II ./ (d +n .)11/2 a M (2/f3) a(a+nP/2)/r (egii i=1 j=1 2.7 J

R m 2 „..12,7-c-np/2 x [(2/8)-1- y -x ) 4102.4-6.)) ij II • i=1 j=1 kEQ. ik iJ J ij aj/

(5.3.1) -78-

The distribution of 0 given g,m,K1,...,xn is gamma with parameters

2 2 ,-1 bx+np/2,{(2/0)+ ( L (x -x.) +(n.d../(n.#6..))(x..-v..) )1 1=1 j=1 keQ. ik 13 j 2J J 2J 1J 2.1

2 and ki,...,km given ,g,m, 1,...,;€1.2 are independent normal with

2 E(pijia ,g,m4E1,...,K11) = (nixiff-diivii)/(y6ii) and

2 2 -1 -1 V(kjia = diagi(diff-ni) ,...,(dpi+ni) 1. If our prior is such that (a,8) does not depend on (g,m) and the prior for ki,...,km given a2 is nearly flat so that we may approximate the posterior distribution by letting the Vs tend to zero, then (5.3.1) becomes proportional to

-a-np/2 -1- 1 T P M OOP 1 (glm) U n.-1)12[(2/8) (17 k--1C" ) (1Ek-P ' ll • j=1 j=1 kef2 .

(5.3.2)

Again this density is not normable unless our prior for g given m is zero whenever any of the m groups is empty. If our prior for g given m is

(5.2.3) then the grouping which maximizes (5.3.2) is that which minimizes the within group sum of squares. However, whenever our prior for g given m is not given by (5.2.3) then (a,8) which determines the prior for a2 may affect which grouping has maximum posterior probability. If our prior -1 -1 is such that is very small and we approximate (5.3.2) by letting a tend to zero, then any clustering rule based on this approximation is invariant under scale as well as location and orthogonal transformations of the data.

5.4 Unknown Means, Common Unknown Variance II

Here we assume that none of the entries of the common covariance -79-

- matrix g is known. We let 2 = g 1. We take the prior for ki,...,gm,Q given (g,m) to satisfy

(a) the matrix Qhas a Wishart distribution on a (a > p-/)

degrees of freedom and scale matrix k / (positive definite

symmetric),

(b) given g = 2-1, the prior for ki„.„km is such that they

are independent normal with E(ki lg,g,m) = xj and

V(ki lg,g,m) = g/di.

We let

- T W = m (,ik_ j)(,k_,j) , j=1 Ice Q3. the within sums of squares and products matrix for a given g and m. Then, we can show that

P (71,0k1r...rtn) m i 06.z -4tn

pm(nopom(glm) n ta./(6 .-i-n )P12 ikl ai 2 j=1 3

x n IT{(a4111-1-i)/2}/r{(a+1-i)/2}.7 1=1

.) T 1- (a+n) /2 I ilt, m in .6 ./(.2 .)] 01' .) (5.4.1) j=1 JJJJ JJJJ

The distribution of given k,m,k1,...,Kn is Wishart with (a+n) degrees of freedom and scale matrix -80-

- T -1 it+ I {n .6 ./(n .+6 .))().1-Xj) ] . j=1 .7.7

Given g,g,mqa,...,x , the distribution of ki,...,u is independent normal rt n with E(kjig,g,m, e...,itn) = (nAj+kA)/(ni4.6j) and

= g/(nj+6j). If the prior is such that (a,A) does not depend on (g,m) and the prior for ki,...,km given g is "nearly flat" so that the posterior distribution may be approximated by letting the 6's tend to zero, then (5.4.1) is approximately proportional to

p 670 1, , (ain?) n n.-p/21A,11-(a+n)/2 (5.4.2) M j=1 3

This is not normable unless the prior for g given m is zero whenever any of the m groups is empty. Friedman and Rubin (1967) suggested the criterion minIWI for obtaining a clustering. We see that (5.4.2) is maximized when 10 is minimized if (i) the prior for g satisfies (5.2.3) and (ii) the elements of A are sufficiently close to zero that we may approximate (5.4.2) by letting the elements of equal zero. Condition

(ii) implies that any clustering based on (5.4.2) is invariant under any location and non-singular linear transformation of the data. In general, however, clusters based on (5.4.2) depend on a and

5.5 Different Unknown Means and Variances

In this section we assume that the component distributions have different means and variance matrices. We recall from Chhpter 2 that non-Bayesian parameter estimation in the univariate problem with parameters 2 2 (X,111,p2,al ,02 ) presents certain theoretical difficulties because the likelihood function (2.1.1) is unbounded. A Bayesian analysis with proper

priors presents no such theoretical difficulties. We shall see in - 81 -

Chapter 6 the effect of approximating the posterior distribution by taking certain limiting cases on a particular numerical example.

-/ We let gi = RI and take the prior for given

(g,m) to be such that:

(a) 21,...,2m are independent Wishart with Rj having aj (aj > p-/) -/ degrees of freedom and scale matrix kj Kpositive definite

symmetric);

(b) given ki,...,gm the prior for ki,...,km is such that they are

independent normal with E(kjIgi,...,gm,g,m) = and

= gi/6j.

We let

T = ( kij)qk-IC.1) KEwi so that

(m'gliEl" ."kn)

({/( )}P/2 c` Pm (m )PG / 2 j=/

x ( (a .+n .+1-i) / 2)/r ((a .4-1-i)/2)} i=1 J J - .-r-n .)/2 a ./2 T1 j 3- , x Ikji -7 lAi4g.14-{nifj/(ni+dj)}0Ei-ki)(iE- j -kj) I J

(5.5.1)

Conditional on we have

- 82 -

(a) 21,...,2m are independent Wishart with Qj having (1.412.) degrees J J of freedom and scale matrix

fkil'i+Ini6j/(ni4-6j)1% -1;

(b) are independent normal, conditional on gl,...,gm with

gi having mean (Aiki+diki)/(nj+6.1) and variance gi/(di+ni).

If our prior is such that (i) al = = am = a and ki = = 4m = A

where (a,4) does not depend on (g,m) and (ii) the prior for k1' ...,km given gi,...,gm is sufficiently flat that letting the Vs tend to zero

provides an approximation for (5.5.1), then (5.5.1) is approximately

proportional to

-(a+n .)/2 p (m)p (Om) n (n .-p/2 (a+n .+/-i)/211k+fti m qim 11 j=1 i=1

(5.5.2)

Now Scott and Symons (1971) have proposed a criterion to be minimized in this situation when m is known namely

n . II -7 3

We see that this gives maximum posterior probability in (5.5.2) when

(i) a = 0, (ii) A = 0 and (iii)

m (r1.4.1)p/2 p -/ P 1 (71m) cc j=1II In. r((n.+/-i)/2)} (5.5.3) 1111 i=1

Note that our original prior is normable only if a. > p-1, so for p > 1,

condition (i) is not a limiting approximation of (5.5.2). Also condition - 83 -

(ii) implies that the approximate posterior probabilities for (g,m) are

invariant under location and non-singular linear transformations of the

data. If we approximate the posterior probabilities under condition (ii)

then (5.5.2) is not normable unless the prior for g given m is such that

it is zero whenever there are less than (p+1) observations on each group;

otherwise 'JO would be zero for certain groupings g for which

P I (g1m) > 0.

Probabilities given by (5.5.3) are even more peaked near equal n.'s

than the probabilities (5.2.3). Therefore, we would expect the resulting

clusters to have nearly equal numbers in each cluster which is often not

desirable. If the user does not wish to give such a high prior probability

to relatively equal sized clusters, he should not use (5.5.3). He may

consider (4.4.8) a more accurate representation of his prior for g.

5.6 Common Known Mean, Unknown Variances

In this section we assume ki = = km = k (known), so that the -/ unknown parameters are We let = gj . We assume that

given (g,m) then k1,...,Z11 are independent Wishart, having a. (a. > p-1) J -1 degrees of freedom and scale matrix A (positive definite symmetric).

We can show that

m p a ./2 Q p (top 1m (511 m) 11 {rox .41-i) /2) /r ( (a .+1-i)/W .1 j M j=1 2=1 J J

-(a )/2 T jj (5.6.1) x l kj+ I (kk-k )(tk-k ) I kcQ . -84-

Conditional on m,g4E1,..., n, then 2,...,2m are independent with Qj being

Wishart on Oz.-1-n.) degrees of freedom and scale matrix

T - (A.4- I k k-k,) ke .

The case where k is unknown is not discussed, because the "conjugate type" priors do not integrate easily. This is an example where our problem doesnot simplify greatly by assuming a prior as suggested in §4.4.

5.7 Regression Models

We consider now a model which includes some explanatory variables in the structure. For a linear regression of univariate observations on one

explanatory variable, we may consider a number of different models. For example, in the j-th group we may have

(concurrent lines),

(parallel lines),

(general lines).

We assume that for a vector of observations and a vector of explantory TT T variables (g ) , we have for the j-th group that

(5.7.1)

where is a txp matrix and S. is uxp. We let -85-

TT , T , T T , rT = ), and kJ = ( klj ,...,kui I.

-1 Weassume that kk ti N(0,g) and ki,...,kn are independent. We let 2 = g . We include the possibility that t = 0; that is, g is empty. Let

B F1 • rtm

0 4'1F1 ti1 H = (5.7.2) ti

F 0 fvm

We assume that conditional on (g,m)

(a)2 has a Wishart distribution on a degrees of freedom and -1 scale matrix k ;

(b) given Q, the entries of k, i,...,qm are jointly normal with

•••

11 = E

kin - 86 -

T TTT -1 1 and the covariance of (k ,ki ) is k 2 where 8 denotes the Kronecker product.

We let

B = B+ E Xkkk k=/

* . 12 k kkkk .1c=1

ki = kj 4 L kkkk keQ .

= keQ.2 4k ;

* * , * is the matrix (5.7.2) with (k,ej,ki) replaced by (k kj ). We define

= 4. n xkT (kT kiT kmT1 -1rkT kiT... kmT1T k-1-

*T *T *T * -1 *T *T *T T fk ki km ] ik ki --- km •

Therefore,

T P Q Pm(m)P1m(glm)flg* I/Ik111-V2 n Er{(ct-11v./-i)/2}/r{(a+1 - i)/2}] i=1 -87-

)/21. x fiAla/2/1A* 1(a+n (5.7.3)

In general (5.7.3) may be difficult to compute numerically. However, with certain models the expression may be simplified to a form suitable for computation. For example, in Chapter 7 we consider an analysis of variance model which could be expressed in the form (5.7.1), where the observations xl,...,xn are univariate and (5.7.3) is easily computable for the prior assumptions we make.

5.8 Some Prior Distributions for

In the preceding sections the posterior distributions were given for general priors of g given m. We now consider some examples of these priors in the case where there exists a X such that (4.1.2) holds. We employ the techniques of §4.4 for two special cases of 0(m), the set of allowable values of given m. It may often be the case that (4.4.6) is true so we also give expressions for pk i ,m(kig,m).

11. Therefore, Firstly,weletf/00 a

Pki m( Im) is the usual Dirichlet distribution given by

r(si) i=1 ri 2 H A 4 (5.8.1) i=1 H r(s.) 2 i=1

In this case m nia n m P (aim) ce [ n n (s.4-j-1)//r H ( s.4*-1)1, for g c r(m), IM i=1 j=1 2 k=1 1=1 2

(5.8.2) - 88 - and m r(n+ y s.) i=1 i m 2 2 'gin!) — R A. • (5.8.3) m i=1 2 II r(n 2 2.) i=1

In the second case we assume that

Wm) = {(X1,...,X m)10 < Al < Am < 1, A. = /}, so the group labels are ordered. Now if k has the Dirichlet distribution (5.8.1) then

A Beta(s Ysi ) l 1'2 and for - we have that conditional on A , i = 2,...,m 1, / i-1

i-1 (A./(1- y Beta (s s;j) 2 j=1 J 1.11 see Johnson and Kotz (1972, pg. 234). Therefore, when pom([m) is given by (5.8.1), it can be shown that

m-1 Pr(A < < A ). R [13 (s., s.)/B(s., 1 - m . -1 E . s.)19 i=1 671-2+1) 2+1 2+1 where

X BX(a, = I ta-'(1-0 13-idt 0 and B(a,8) = ya,0). Therefore for k, c Wm) we have - 89 -

m-1 m -1 m s.-1 P I (Alm) = B (s , s.), A. (5.8.4) klm i=1 (71-1.4-1)-1 i+1 i=1 1

m Is.411., (5.4 m-1 -1 a 2 j 11n ).7 (m-i+1) 1+1 Pq1011n) cc i=111 (5.8.5) B (s., Z -1 s.)j (m-i4.2) .1.+1

m-1 m m p (Ala,m) = [ B 1 2 fn.-1-s., (n.-Fs )17 Ai • 1=1 ( W-i#1)-1 2 2 1+1 j 11=1=1

(5.8.6)

In general, careful thought must be given to the prior for g given m, as we shall demonstrate in Chapters 6 and 7. The priors suggested in this section may be suitable for certain applications, but often the user will not find them satisfactory and he must specify a prior which better reflects the subject matter under consideration.

5.9 Performance of Approximation Algorithm

In §4.5 we outlined an algorithm for approximating the quantities

(4.5.1). We now take a number of small data sets and look at the order in which the probabilities 77,gki,--.,kn) appear on our pM'OZ " -46n ( sample file. We also compare estimates of Pr(G(k) = G(.94I 1,..., n/ with the true values and see how the clusters based on these estimates compare with the optimal cluster based on the true values of the posterior probabilities.

In particular, we took the probabilities given in (5.4.2) with a = 0 and k= 0 and had eight sets of bivariate observations with two groups.

-90-

Since the posterior probabilities are invariant under location and non-

singular linear transformations, the results depend on (k1,k2,g) only -/ //2 through A = 061 2)] . In the simulations we generated the ((k1-12)Tg -6 data such that

(.4)n1(.0n2/1.1(.4)10_(.01 0] for 1 < n1 < 9

P (g)

otherwise, so that the distribution of g was not invariant to a permutation of the

labels {1,2} and the probability of a group being empty is zero. We took

two replicates for each of four values of A corresponding to A equalling

1,2,3 and 4. To determine the posterior probabilities we used the following priors: Pr(M = 2) = 1 and

(10)1-1 if 1 < < "/ pr_.(g) = (5.9.1)

otherwise, which is the same as (5.8.2) where s1 = s2 = 1 and r is the set of all g's with no empty group.

The Local Maximum File was generated by first selecting twenty groupings randomly with probabilities (5.9.1) and then searching for the local maximum as outlined in §4.5. Only twenty starting points were selected so that the effect of not having found all the local maxima could be seen. - 91 -

Table 5.1 summarizes the order in which the groupings appeared on

the sample file. The posterior probabilities are normalized so that the

global maximum over the 1022 partitions in each data set has the value

one.

Table 5.1

MAXIMUM PROBABILITY IN CERTAIN INTERVALS OF SAMPLE FILE

A 1 2 3 4

Replicate No. 1 2 1 2 1 2 1 2

Sample File Interval

1 - 50 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 51 - 100 .067 .202 1.000 .074 .458 .004 .003 .004 101 - 150 .022 .061 .005 .074 .091 .002 .002 .003 151 - 200 .027 .055 .005 .301 .018 .002 .001 .002 201 - 250 .002 .018 .002 .677 .007 .001 .001 .001 251 - 300 .007 .012 .001 .010 .008 .001 .001 1.000 301 - 350 .001 .025 .000 .012 .002 .000 .001 .002 351 - 400 .001 .006 .001 .003 .001 .000 .000 .000 401 - 450 .004 .010 .001 .003 .001 .000 .000 .000 451 - 500 .003 .001 .000 .001 .001 .000 .000 .000

We see that as A gets larger, then the probabilities decrease more

rapidly down the sample. This is to be expected since for larger A the

posterior probabilities tend to be more extreme. Also we clearly have not

started with all the local maxima. The effect of this may be seen

particularly in the data sets A = 2, Replicate 2 and A = 4, Replicate 2.

But the important feature of Table 5.1 is that the algorithm does

succeed in that those groupings with higher probabilities tend to be near

the beginning of the file. In fact, for three of the data sets we considered, the maximum probability after the fiftieth grouping in the

sample was less than 10% of the overall maximum. -92-

Now we consider the effect of estimating, by the method proposed in

§4.5 using the first 100 groupings in the sample, the similarity matrix

with entries Pr(G(k) = Ga)lkz,...46z7/. For these small data sets we

could determine the exact values of the matrix. The results summarized in Table 5.2, were quite promising. There was a tendency that probabili-

ties close to zero or one became extreme in the estimate. But these

estimates are not too important for determining the optimal cluster. More important are the probabilities near 1/2 which, it turns out, tend to be estimated better.

Table 5.2

MAXIMUM ABSOLUTE DEVIATION BETWEEN ESTIMATED AND ACTUAL SIMILARITIES

A 1 2 3 4

Replicate No. 1 2 1 2 1 2 1 2

Max. Abs. Dev. .065 .127 .027 .154 .047 .015 .018 .030

We see that all the data sets, with the possible exceptions of A = 1,

Replicate 2 and A = 2, Replicate 2 perform very well. As can be seen from

Table 5.1, the case A = 2, Replicate 2 gives a larger deviation between the estimated and true values of the entries of the similarity matrix because an important (large) local maximum was not in the original Local

Maximum File. The same probably happened for the A = 1, Replicate case.

To check how the clusters derived from the estimates compare with the optimal clusters, using the true values of the posterior probabilities, we used loss function (4.2.3.9) with c = d. It turns out that all the clusters based on the estimates were exactly the same as those based on the true values of the posterior probabilities. The estimated clusters -93- themselves were not very good compared to the true clusters for small A, but this was reflected in a large expected loss.

5.10 Discussion

For all our examples in this chapter we have made normality type assumptions and the prior distributions were such that the integrals were easily calculable. Of course, we have not exhausted all the cases and others could be handled similarly. For example, we might assume that the covariance matrix is diagonal or that there are both discrete and continuous observations and the parameters of the continuous observations may depend on the values of the discrete ones. Our general theory developed in §4.4 has wide applicability.

If conjugate type priors are not appropriate for a particular application, or if the conjugate type prior does not lead to posterior probabilities which are easy to calculate, we may need to use numerical integration which can be quite time consuming. Because of the increased expense, we could not take as large a sample in our approximation algorithm and the resulting clusters may be suboptimal (although possibly still have a posterior expected loss close to the optimal grouping).

It is interesting that some of the examples we have considered in the limiting case leads to commonly used clustering algorithms such as minimizing tr(W) or minimizing le, where Ft is the within cluster sums of squares matrix. The examples given here provide some insight in the type of assumptions one could make to justify the use of these algorithms; see

§5.2, §5.3 and §5.4.

We have also demonstrated that our approximation to the optimal -94- clusters can perform very well. The important aspect of the algorithm seems to be the identification of the local maxima with relatively high probabilities. There tend to be more of these when the groups are not well separated. Also, when the groups are not well separated, then the resulting clusters based on the optimal Bayes' decision rule are poor compared to the true clusters (reflected by a high expected loss), but our approximation may still yield a clustering with a posterior expected loss close to that of the optimal cluster. -95-

Chapter 6

FISHER'S IRIS DATA

6.1 The Data and Previous Analyses

In this chapter, we apply some of the techniques discussed in Chapters

4 and 5 to the Fisher (1936) iris data. This data have been subjected to

a variety of cluster analysis programs and the results are documented in

the statistical literature. The data consist of four measurements (sepal

width, sepal length, petal width, and petal length) on each of 150 flowers,

which are known to belong to three groups of 50 flowers each (Iris Setosa,

Iris Versicolor, and Iris Virginica). In our analysis, we assume that the actual grouping is unknown. The data are such that Iris Setosa is well separated from the other two varieties and there is an overlap between Iris

Versicolor and Iris Virginica. Therefore, most cluster analysis algorithms succeed in separating the Setosa plants from the other two varieties, but the results differ when separating the Virginica and Versicolor plants. A visual summary of the data is given in Figure 6.1 where we have plotted the first two principal components based on the covariance matrix. They account for 97.8% of the total variation. However, these two components can be misleading as a reduction of the data, since some of the cluster analysis algorithms are invariant under linear transformations, but the principal components are not. For example, the first two principal components based on the correlation matrix would have yielded a different scatter diagram, but the main feature of one isolated group and two overlapping groups would be retained.

Let us now review some of the previous analyses of this data. For the rest of this chapter we label the original data as follows: Figure 6.1

FISHER'S IRIS DATA

MP - 97 -

Setosa: 1 - 50

Versicolor: 51 - 100

Virginica: 101 - 150.

Kendall (1966) proposed a "distribution-free" method based on the ranks of each component of the original data vectors. He applied his procedure to the Versicolor and Virginica data only. The results of his procedure are that he did not misclassify any of the Versicolor flowers, although he does leave 58, 61 and 94 undecided as to whether they form a small isolated group or should be included with the rest of Versicolor.

He does misclassify 20 Virginica flowers: 102, 104, 107, 108, 112, 117,

120, 124, 126, 127, 128, 130, 131, 134, 138, 139, 143, 147 and 150. He also finds 109, 114, 122 and 132 remote from any group and remote from each other, so he remains undecided about them.

Friedman and Rubin (1967) considered the effect of optimizing various criteria to arrive at the resulting clusters. If ki7 is the within cluster sums of squares matrix and k the between sums of squares matrix, they considered the following three criteria:

(a) min tr(k)

(b) min Ifil71

For criterion (a), Maronna and Jacovkis (1974) report that 12 Versicolor and 11 Virginica were misclassified. All the Setosa plants were successfully isolated. When applying criterion (b), only 3 out of the

150 flowers were misclassified: 71, 84 and 134. The first two were allocated to Virginica and 134 was allocated to Versicolor. The result -98-

is remarkably good, especially when compared with those of Kendall (1966).

When criterion (c) was applied to the data, then only three 3 flowers were

misclassified: 71, 78 and 84, all allocated to Virginica. No indication

was given as to which units were most doubtful in the sense that changing

those units to another group had the smallest effect on the value of the

criterion to be optimized. Friedman and Rubin's (1967) concluding remarks

suggested that criterion (b) tends to give better results than (c) in

general.

When Wolfe (1970) applied his NORMAP program to the iris data, he

obtained exactly the same clusters as those derived from minimizing 10

NORMAP finds the maximum likelihood estimates of the parameters of a

mixture of multivariate normal distributions with common covariance matrix,

and then applies the linear discriminant functions to the data using these

estimates. Wolfe (1970) also considers the case where the covariances may

be different. In the program NORMIX, he estimates the parameters of the

model by taking the stationary point of the likelihood with maximum value

and applies the quadratic discriminant functions using these estimates.

The result of this procedure when applied to the iris data is that 5

flowers from Versicolor were misclassified as Virginica.

The iris data were also analyzed by Maronna and Jacovkis (1974) using

the criterion of finding g which minimizes

lw.111P, j=1 where. is defined as in (5.5.1). They report that the results were

similar to those obtained by the criteria max tr(lik) and min 10. -99-

Thus, there are cluster analysis algorithms which can perform well on Fisher's (1936) iris data. However, with the exception of

Wolfe's (1970) algorithms, the user is not clear what assumptions he is implicitly making when employing these techniques. A discussion of

Wolfe's (1970) technique has already been given in Chapter 2. We now apply the techniques we have proposed in Chapters 4 and 5 to this data.

6.2 Methodology for Our Analysis

We analyze the data in four ways under different assumptions. In all four analyses we assume the component distributions to be multivariate normal with different mean vectors. The four analyses correspond to the following models:

(a) exactly three groups, common covariance matrices,

(b) one to four groups, common covariance matrices,

(d) one to four groups, different covariance matrices.

We shall not investigate the effect of varying prior distributions; this is done for simpler data in Chapter 7. For the two cases where we -/ assume a common covariance matrix g, say, then for 2 = g we assume the prior for 2,k1,...,41 given (g,m) to be that given in §5.4, with a (the degrees of freedom of the Wishart density) being four (the smallest integer which yields a proper prior). We assume that (5.4.2) is a good approximation of (5.4.1) and that the entries of (the inverse of the scale matrix of the Wishart density) are sufficiently small that we may further approximate the posterior distribution of (g,m) by letting k= 0.

Therefore, our prior for 2,ki,...,km given (g,m) is assumed to be close -//2 to the improper prior proportional to 121 . Similarly for the case of -100-

-/ different covariance matrices gi,...,gm we let and assume that the prior for k1,...,km,21,...,2m is close to the improper prior proportional to

m n —1/2 j=1 -7

We also assume that the prior on m is constant over the allowable values.

Therefore, for the equal covariance case, we have from (5.4.2) that

—2 1 77 P I ON a k/"'"n) cc P1m(glm) n /1k1 , (6.2.1) 14,q n '1 and for the case of different covariance matrices, we have from (5.5.2) that

m -2 4 —(24-n./2) (g1m) R tn. R r{(541] —i)/2 7. (6.2.2) P . }1Ki l j=1 i=1

In order that the posterior probabilities for g are finite, we shall assume that

(a) n. > 1 for j = 1,...,m when the covariance matrices are equal,

so that

-2 fin. < co

(b) n. > 5 for j = 1,...,m when the covariances are different, so - 101 -

IW I > 0 with probability one.

If we let Qk(m) = [gin. > k for j = 1,...,m} then the prior for g J - is taken to be

n /tr i m) = r(150- mk-i-m- 1 )1501{ n.1)-1]-1 (6.2.3) m-1 j=1 for all ge Ok(m), where k = 1 in the equal covariance case and k = 5 otherwise. This prior distribution is equivalent to the following model:

p Im) = (m-i)1 for O<

( m n. m n. II A. 1/ ( A 1) for g c (m) k i=1 2 geOk(m) i=1 2 = ZI kiM(gi

0 otherwise.

When we refer to the likelihood function in this chapter we mean the function

n m II ( y k=1 i=1 2 where h is the multivariate normal density function with mean ki and variance.. This is the density function of given m'kqe — "km'g/"..46.

Using (6.2.1) to (6.2.3) we may find estimates of the posterior

"similarity" matrix with entries PriG(k) = For the iris data we essentially employ the approximation technique outlined in -102 -

§4.5. Initially we plot the first two principal components as in Figure

6.1 and make several "guesses" as to which partitions will yield

relatively high posterior probabilities. In conjunction with each of

these "guesses", we specify a set of units for which we are unsure

whether moving those units to another specified group would yield a

higher posterior probability. We then search through these doubtful

units assigning them to the other group until no higher probabilities

are found. We repeat the process with a new set of doubtful units until

a local maximum is attained. We do this for each allowable value of m

and thus create the Local Maximum File. Since the posterior probabilities

are the same for all ml permutations of the group labels, we restrict the

partitions to only one of these ml permutations when generating local

maxima. Once having created the Local Maximum File we proceed as outlined

in §4.5, except that we do not add to the Rest of the Sample File any -16 partition whose posterior probability is less than 10 of the maximum

posterior probability.

We find the optimal clustering of the data using loss function

(4.2.3.9) with:c = d. In the next section, we outline the results of our

procedures.

6.3 Results

6.3.1 Common Covariance Matrices

The criterion of minimizing le is the same as minimizing (6.2.1) if

2 Pom(g1m) cc 11 n 4 . j=1 - 103 -

This prior isc smtrongly peaked around n = = n than the prior we have 1 m used, namely

(14, 1501 _1 PGIMglin ( ) ] • m- n 1...n ! 1

Since the true grouping for the iris data is such that n1 = n2 = n3 = 50,

we would expect that minimizing would perform better in terms of

reproducing the original groups. However, it turns out that when m = 3

with probability one, the same grouping maximizes the posterior probability,

minimizes the expected loss and minimizes Id; namely, units 71 and 84 are

misclassified as Virginica and unit 134 is misclassified as Versicolor.

Our analysis has the benefit that we can investigate the posterior

"similarity" matrix to see which units seem doubtful as to their

classification.

In Table 6.1 we give in the third column those values of k such that

,..., n] generally lies in the range of the second column Pr[G(k) = G(2)1xrv1 whenever ft is a unit assigned to a group specified by the first column.

We call

Group I: All of Setosa,

Group II: Unit 134 and all of Versicolor except for 71 and 84,

Group III: Units 71 and 84 and all of Virginica except for 134. - 104 -

Table 6.1

DOUBTFUL UNITS FOR EXACTLY THREE GROUPS, COMMON COVARIANCE

Group Probability Range Units

II (.35, .40) 120 (.25, .30) 134 (.15, .20) 135 (.05, .10) or (.90, .95) 71, 73, 84, 85, 127 (.01, .05) or (.95, .99) 67, 69, 107, 124, 128, 139

III (.60, .65) 120 (.70, .75) 134 (.80, .85) 135 (.05, .10) or (.90, .95) 71, 73, 84, 85, 127 (.01, .05) or (.95, .99) 67, 69, 107, 124, 128, 139

All other units usually had probabilities in the range (0, .001) or

(.99, 1) unless they were paired with one of those listed in Table 6.1.

We see that all the misclassified units are in the top ten of Table 6.1.

Also all nine of the Virginica flowers in this list were misclassified in -/ Kendall's (1966) analysis. The three units misclassified by the tr(t k) criterion are in the top ten of the table. As we shall discuss in §6.3.2, we have reason to believe that Wolfe's (1970) analysis based on his NORMIX program misclassified 69, 71, 73, 78 and 84. All these appear in Table 6.1.

We now consider the case where m can take values one to four. The priors we have used lead to deciding on four groups. They are as follows:

Group I: All of Setosa,

Group II: Units 120 and 134 and all of Versicolor except 71,

Group III: 104, 106, 108, 109, 117, 118, 123, 126, 130, 131, 132, 135, 138,

Group IV: Unit 71 and all of Virginica except 120, 134 and those in Group III. - 105 -

Essentially the analysis has split Virginica into two groups, and otherwise has misclassified units 71, 120 and 134. Because we have assumed equal covariance matrices, the procedure attempts to make each group have the same "shape". The result is that the convex hulls of Groups III and IV overlap. We now present a table analogous to Table 6.1 for this case.

Table 6.2

DOUBTFUL UNITS FOR UP TO FOUR GROUPS, COMMON COVARIANCE

Group Probability Range Units

II (.50, .60) 84, 120 (.75, .80) 134 (.01, .05) or (.95, .99) 71, 73, 78, 85, 117, 124, 127, 128, 138, 139, 150

III (.40, .50) 84, 120 (.20, .25) 103, 134 (.05, .10) or (.90, .95) 117, 138 (.01, .05) or (.95, .99) 73, 104, 136

IV (.75, .80) 103 (.05, .10) or (.90, .95) 117, 138 (.01, .05) or (.95, .99) 71, 78, 84, 85, 104, 124, 127, 128, 136, 139, 150

Except for units 120 and 134, the Virginica group lies entirely in

Groups III and IV. Of these 48 flowers, ten of them appear in Table 6.2, but only five of them were doubtful between Groups III and IV. The three other misclassified flowers 71, 120 and 134 all appear in the table.

Except for 103 and 136 all the Virginica flowers which appear in the table were misclassified in Kendall's (1966) analysis. Note that units 103 and

136 were only doubtful in Groups III and IV, which together may be considered the Virginica group, so in our analysis we are fairly sure that those units do not belong to the Versicolor group. Of the five flowers which Kendall considered remote from any group, three of them are placed in the Group III, the smaller of our two Virginica groups. None of them -106- appeared in Table 6.2.

If we compare the results for the two cases, (i) m = 3 and (ii) m e {/,2,3,4} combining Groups III and IV, we misclassify three units in both cases. Both times 71 and 134 are misclassified and unit 84 is misclassified in case (1), whereas unit 120 is misclassified in case (ii).

In both analyses, unit 120 appears high in the list of doubtful units and in case (ii) unit 84 is high.

6.3.2 Different Covariance Matrices

With different covariance matrices the prior we have assumed leads to unappealing results. This is because the likelihood function is unbounded as lad 4 0 for some i e {1,...,m} and ki = i6k for any k. The result of this is that the cluster analysis based on (6.2.2) and the prior for we have assumed tends to yield a partition with a small number of units in one of the groups.

It turns out that if we ignore such clusterings, then for the case when m is exactly three, the grouping with maximum posterior probability misclassifies the five units 69, 71, 73, 78 and 84 as belonging to the

Virginica group. Now Wolfe (1970) reported that when he applied his

NORMIX program to the iris data, he misclassified five of the Versicolor flowers as belonging to Virginica. It seems reasonable to suspect that they were the same five plants.

Let us now consider what happens when we allow the groupings to take any value in p5(m), defined in f6.2. First of all, a difficulty arises because the within sums of squares matrix associated with units 9, 14, 39,

42, 43, 133 and 145 is singular. This has occurred because of the limited - 107 -

accuracy of the measurements in the raw data. For all seven units the

relation Sepal Length = Petal Width 4.2 holds. Because the determinant

of the within sums of squares matrix is zero, expression (6.2.2) would be

infinite. To avoid these anomalies, we ignore such groupings.

We now look at the results when m can take values up to four. The assumed priors lead to deciding on four groups, the optimal partition

being:

Group I: 7, 9, 39, 42, 43,

Group II: All of Setosa except units in Group I,

Group III: All of Versicolor except 69, 71, 73, 78 and 84,

Group IV: 69, 71, 73, 78, 84 and all of Virginica.

Therefore, we have split Setosa into two groups and of the remaining we have misclassified the same five units we suspect were misclassified using

Wolfe's (1970) NORMIX program. The table analogous to Tables 6.1 and 6.2 is as follows:

Table 6.3

DOUBTFUL UNITS FOR UP TO FOUR GROUPS, DIFFERENT COVARIANCES

Group Probability Range Units I (.01, .05) or (.95, .99) 14, 42

II (.01, .05) or (.95, .99) 14, 42

III (.20, .25) or (.75, .80) 78, 88 (.10, .20) or (.80, .90) 85, 134 (.05, .10) or (.90, .95) 54, 74 (.01, .05) or (.95, .99) 55, 56, 60, 64, 67, 69, 71, 73, 77, 79, 91, 132 IV (.20, .25) or (.75, .80) 78, 88 (.10, .20) or (.80, .90) 85, 134 (.05, .10) or (.90, .95) 54, 74 (.01, .05) or (.95, .99) 55, 56, 60, 64, 67, 69, 71, 73, 77, 79, 91, 132 -108-

Two of the five units assigned to Group I appear on this table, and all the units misclassified by the analysis between Groups III and IV except unit 84 are in this table. If we would combine Groups I and II, we see that our clusters are reasonable.

If we assume that m is exactly three, then the optimal partition becomes

Group I: 7, 9, 14, 23, 43,

Group II: All of Setosa except Group I units,

Group III: All of Versicolor and Virginica.

We again get one small group and the rest of the data is split into two groups. Since Setosa is remote from the other two varieties, the rest of

Setosa forms a single group. All the pairs of units with posterior probabilities not being near zero or one occur between units in Groups I and II.

Table 6.4

DOUBTFUL UNITS FOR EXACTLY THREE GROUPS, DIFFERENT COVARIANCES

Group Probability Range Units

I (.05, .10) or (.90, .95) 14, 24 (.001, .01) or (.99, .999) 7, 9, 23, 39, 42

II (.05, .10) or (.90, .95) 14, 24 (.001, .01) or (.99, .999) 7, 9, 23, 39, 42

Hence four of the five units separated from the bulk of Setosa are listed as mildly doubtful. This partition is of limited interest since after finding a small group (which we get because our priors are fairly flat) the data must split into two groups and the analysis does the - 109 - obvious separation of Setosa from the rest.

In general our results for the cases where the covariance matrices are different are unsatisfactory as a result of our choice of priors. We could continue our investigation of the behaviour of our procedures on this data set using other priors, but not much further insight would be gained. We would expect that using priors which are close to the ones used here would lead to similar results, but as the priors become more peaked the results would be different and would depend strongly on the prior used.

6.4 Summary and Discussion

To summarize the clusterings resulting from the different analyses of the Fisher (1936) iris data, we give the matrix g with entries n.., 2J the number of elements in the set {klg(k) = i, g(k) = j}. All the clustering algorithms considered in this chapter ignore the labels associated with the true groups and the estimated groups so the evaluation of the merits of a particular clustering should be invariant to permutations of rows and columns of k. We leave out Kendall's (1966) analysis in this summary since he did not give a complete partition of the units, but left some undecided.

Table 6.5

SUMMARY OF ESTIMATED CLUSTERINGS

Algorithm: min trat) criterion - 3 groups

Setosa 50 0 0 Versicolor 0 38 12 Virginica 0 11 39 - 110 -

Table 6.5 (cont'd)

Algorithms: mina criterion, Bayesian common covariance and Wolfe's NO - 3 groups

Setosa 50 0 0 Versicolor 0 48 2 Virginica 0 1 49

-/ Algorithm: max tr(k k) criterion - 3 groups

Setosa 50 0 0 Versicolor 0 47 3 Virginica 0 0 50

Algorithm: Wolfe's NORMIX - 3 groups

Setosa 50 0 0 Versicolor 0 45 5 Virginica 0 0 50

Algorithm: Bayesian common covariance - 4 groups

Setosa 50 0 0 0 Versicolor 0 49 0 1 Virginica 0 2 13 35

Algorithm: Bayesian different covariances - 4 groups

Setosa 5 45 0 0 Versicolor 0 0 45 5 Virginica 0 0 0 50

Algorithm: Bayesian different covariances - 3 groups

Setosa 5 45 0 Versicolor 0 0 50 Virginica 0 0 50

We see that our Bayesian procedure for common covariance matrices and three groups compares favourably with the other algorithms. However when the number of groups is allowed to be between one and four, the priors we have used gives a posterior which favours four groups. This is because the assumption of equal covariance matrices implies the clusters should have

the same "shape". Since both our prior for this "shape" (that is the

prior for g) and our prior for the number of units which are in each group

are not strongly peaked around particular values, the split into four

groups satisfies the requirement of the clusters having the same "shape"

better than the split into three groups.

When we assume that the covariance matrices are different and our

priors for the matrices and the number of units in each group are not

strongly peaked around particular values, then the size of this data set

is too small for a good separation into clusters. We have indicated that

if we ignore small isolated groups then the clustering into 3 groups is

the same as in Wolfe's (1970) NORMIX program. This suggests that if our

prior for ) given m were more strongly peaked around m n = . . = n then the analysis would have given better results. 1 m

The data set considered in this chapter contained 150 four dimensional observations. This is relatively small compared to many applications of cluster analysis. However, in the computer program, the creation of the

Sample File was continued until a partition with maximum posterior probability among the partitions not in the Generator File had a probability of one third that of the partition with maximum probability in the Generator File. Although we have not included in our sample all the partitions with appreciable probability, tables such as 6.1 to 6.4 have identified those units which may be in error. With a larger data set, the recommended procedure would be to take as the estimated partition that with maximum posterior probability, and get a rough idea which units are doubtful by taking partitions close to the estimated partition. In - 112 -

all the examples we have considered, including those in §5.9, the optimal

partition, using loss function (4.2.3.9) with c = d, was the partition

with maximum posterior probability.

There is a need to find efficient procedures for computing the

posterior probabilities of (g,m) given the observations. For posterior

probabilities which involve computing le or 0 = 1,...,m), it is important to have an efficient procedure for calculating the determinant,

especially when p, the dimensionality of the observation vectors, is

large. For example, if le is known for a particular grouping and we then move a unit from one group to another, the new value of le should be

expressed in terms of the old. Also since

-1 T -1 = ( n 1,...,nm ) - X XI, (6.4.1) 1=1 where

= E (kki) ock-)Z T k=1

kr =

and x is the i-th group mean, the right hand side of (6.4.1) may be quicker to compute than directly computing 10 if p is larger than m. (Note that

does not change for different partitions.)

In this chapter we demonstrated our proposed Bayesian cluster analysis under relatively flat priors. In the next chapter we consider how the estimated clusters change under different priors, using another data set. - 113 -

Chapter 7

DUNCAN'S BARLEY DATA

7.1 Introduction

If in an analysis of variance table the mean square (due to treatments, say) is significant, the question often arises as to which treatment effects are equal. This is generally referred to as the problem of multiple comparisons, a review of which is given by O'Neill and

Wetherill (1971). In this chapter, we give a numerical illustration of this problem in a cluster analysis framework and compare our results when using different prior distributions.

Scott and Knott (1974) suggested a particular cluster analysis approach to this problem in a balanced design. They employed a divisive cluster analysis technique, by finding the partition into two groups which maximized the between sum of squares, repeating this procedure by splitting each of the newly formed smaller groups into a further two groups. After each split, they performed a test, based on the maximum between sum of squares, to decide if the split was significant. This technique was applied to the data on barley grain yields published in Duncan (1955).

The data arose from a randomized block experiment on seven varieties of barley, each with six replicates. We summarize the data in Table 7.1.

For convenience we refer to the varieties by their ranks. - 114 -

Table 7.1

DUNCAN'S BARLEY DATA

Variety A F G D C B E Mean 49.6 58.1 61.0 61.5 67.6 71.2 71.3 Rank 1 2 3 4 5 6 7

Analysis of Variance

Source d.f. m.s. Between Varieties 6 366.97 Between Blocks 5 141.95 Error 30 79.64

Estimated Standard Error of Varietal Mean (based on 30 d.f.)

3.643

Scott and Knott (1974) first split the varieties into two groups

1234,567. They found that the further splits are 1,234 which is on the

borderline of significance at the 5% level and 5,67 which is not significant at the 5% level. Hence they decide on the clusters 1234,567 with the possibility of a further split to 1,234,567. The latter of these.

also was suggested by Plackett in the discussion of O'Neill and

Wetkerill (1971). It should be noted that Scott and Knott (1974) do not

use the between block sum of squares in their analysis.

7.2 Model for Clustering Varieties

In this section we present a Bayesian model for analyzing the barley

data. We let yik (i = k = 1,...,n) be the individual barley

yields for the i-th block and k-th variety. We make the usual linear

assumptions

y =n-Fla + 8.+c (7.2.1) ik 1 ik where the a's and 8's are real constants corresponding to the variety - 115 - and block effects respectively, and the c's are independent normal random 2 variables with E(cik) = 0 and V(Eik) = a . We let

Yk = n ak

T = (51,...,s p)

kT = (Ylk"'"Ypk).

We now introduce the grouping vector gt= (g(/),...,g(n)) where, if there are m groups, g(k) e {1,...,m} and g(k) = g(P) iff yk = yz. Therefore, 2 given (m,g), we have parameters k,p1,...,pea where pi = yk whenever g(k) = j. We have that 1,...,, as given mq,k,p1,...,pm,a2 are independent normal with Eqk lm,g,t,p1,...,pm,A = ki-pg(k)k where k is the p-dimensional vector of 1's, and vqk lm,g,k,ki,... ,k m,a2) = a2Z. This model is similar to those discussed in §5.7 except that the covariance 2 2 matrix is restricted to the form a Z. We let e = 1/a and consider the 2 following priors for the parameters. Given vm and a , we assume that T T (k ) is (p+m)-variate normal with m

Vai lg,m,a2) = 0

Eludgm,a2) = c. v(ki g,m,a2) = (0.2/t)71,. (7.2.2)

2 VI(u1....,um)TIg,m,o2 J = (a /114

Cov(0.41j) = 0, where c ,t,u are known. We assume the prior on 0 to be the gamma 1,...,cm distribution -116-

-1 a-1 -0/b p0 (e) = [barlan e e , (7.2.3) where a > 0, b > 0. These prior distributions are a special case of the general conjugate type priors we had discussed in §4.4. Note we are assuming that the mean of the block effects is zero. This is not unreasonable since in the linear model (7.2.1) it is usually assumed that

. = 0. We see that as t co there is no block effect since k -' 0 in probability. In that case we may incorporate the sum of squares due to blocks into the sum of squares due to error to yield a mean square due to error based on 35 degrees of freedom in the analysis of variance table.

As t 0 we can become arbitrarily close to a uniformly flat prior for the block effect, and as u + 0 the same is true for the variety effect.

An empirical Bayes approach might be considered to determine values of t and u based on the data when t and u do not depend on (g,m). Since the expected value of

mean square due to blocks f 11/n (7.2.4) mean square due to error

-1 2 -/ is t , the prior variance of 6i/a given a , an ad hoc estimate of t is

mean square due to blocks

max[0, { 1}/n). (7.2.5) mean square due to error

-/ Note we let the estimate of t be zero when (7.2.4) is negative. For the data given in Table 7.1, expression (7.2.5) has the value 0.11. Also iftherearemgroupswithri. varieties in the i-th group and the 's in 2 c

- 117 -

(7.2.2) are all equal then the expected of

mean square due to varieties 1.7/P (7.2.6) mean square due to error

-/ 2 u (1- (n ./n) Jn/(n-/) j=1

-1 which is less than or equal to u . Therefore, an ad hoc upper bound of -/ 2 u , the prior variance of pia given a , is given by

mean square due to varieties max[0,{ 1}/p.1 (7.2.7) mean square due to error

when c1 = = c. For the data given in Table 7.1, expression (7.2.7)

has the value 0.60. When all the c.'s are equal to c (not depending on

(g,m)), an ad hoc estimate of c is the overall mean, 62.9.

2 -1 The prior mean and variance of a are given by tb(a-1)] for a > 1 2 2 -1 and tb (a-1) (a-2)3 for a > 2, respectively. The condition that p 10) 0 be a proper distribution is that a > 0 and b > 0 so that that the prior 2 mean and variance of a may not exist.

We let y.. denote the overall mean, SST the variety sum of squares,

SSE the block sum of squares, SSE the residual sum of squares, Tk the

sample mean for the k-th variety and T the variety mean in the j-th group.

Also Q. = {kig(k) = j}.

We can show that the density function for given (z,m,0)

is - 118 -

np/2 p/2 1/2 (27) -np/20 it/(t+n)/ 142/(u+n.p)] j=1

, 2 , -1/2 x (1 -{p/(t+n).1 L {124 /(u+n.p)Y1 expf(-6/2){SST+SSE j=1

2 r 2 (t/(t+n))(SSB+npy.. )+u L c. j=1

m 2 - y mon (n ./(t+n) )flp5..4-uc .) /(u+n.p)) j=1 3

m - (n./(u+n 13))(pn.T.-(n ./ (t÷z2))nP5..÷uc ) .) 2 / (t+n)/P) j=1 J j JJ J

T 2 /(u+njP) ))1]. (7.2.8) j=1

Now if u is sufficiently close to zero that we may approximate (7.2.8) by

letting u = 0, and if t does not depend on g or m then (7.2.8) is

approximately proportional to

np/2 -1/2 e n ) 2 1]. 1 exp((-012){SSE4-(t/(t+n))SSB+p (T-T j=1 j=1 ka) k j

(7.2.9)

1/2 Therefore, if for fixed m, the prior for g is proportional to nnj ,

then for all values of t and any prior for 0, the grouping with maximum

posterior probability when u = 0 is that which maximizes

2, 2. (Tk j) (7.2.10) j=1 kc(2j

When m = 2, this is exactly the split into two groups which Scott and

- 119 -

Knott (1974) use at each stage of their algorithm.

We assume that (a,b,t,u) do not depend on (g,m), so the posterior

probability for g given based on (7.2.8) is proportional to

r 2 IP (m)P (g1m) Wartly4 -{P/ (t412) } L {n . / (u+n .p)}1-1/2 [ (1/b) j=1 j=1

+ (1/2){SST+SSE+(t/(t+n))(SSB+npi..2)+uic 2

- ((pn.T.-(n.np5../(t+n))+uc.)2/(u+n.p)) j=1 J J

- ( (n./(u+n.p))(n.pi%-(n.np5../(t+n))+uc.))2/(((t+n)/P) j=1 J J JJJ

m (n.2/N411.r0)))1-(a+np/2) - • (7.2.11) j=1 -7

For the remainder of this chapter we assume the following:

(i) c1 = = c = c, not depending on g or m, m

(ii) pm(m) = en-1/(m-1)1 for m c fl,...,221 and > 0,

n , In 1 n II II (s+j-1)/ II (ms+k-1) if goo < m t=1 j=1 k=1 for k = 1,...,n,

PGI 011in)= otherwise,

where s does not depend on m. - 120 -

2 2 (iv) a > 2 so that the prior values of E(a ) and V(a ) are

finite.

Assumption (i) and the prior variance we have assumed for given

(g,m,a2) are equivalent to assuming that n in (7.2.1) is known to be equal 2 to c and the prior mean for the variety effects given (g,m,a ) is zero.

We shall see in §7.3 that the value of c has little effect on the results.

Assumption (iii) is equivalent to

m PAI m(kim) = fr(ms)/{r(s)}mi It A s-1 i=i

m n. Pok,m(glk,m) = 11 Ai 2. i=1

-n As s 03, assumption (iii) becomes p,1,,,(glm) = m ; that is, each g is equally likely for fixed m. Assumptions (i) and (iii) imply that the posterior distribution for g is invariant to permutations of the labels of the groups. To derive the posterior distribution of g we must specify seven controls:

(i) prior mean of m,

(ii) value of s in assumption (iii) above, 2 (iii)prior mean and variance of a ,

(iv) prior mean for the varieties,

(v) prior values of v(B /ala) and V(p./a1a). i

7.3 Method and Results

As suggested in §6.4, we shall let the estimated optimal partition

be that which maximizes the posterior probability, instead of basing - 121 - the optimal partition on the "similarity" matrix with entries

Pr(G(k) = G(k)41.—Ikn], in order to save computer time. However, once having found this partition we estimate the "similarity" matrix as specified in §4.5 using the estimated optimal partition as our Local

Maximum File, and taking a total of ten generators. In §4.5 we suggested that with each generator we include in the sample the n(m-1) partitions in the m-neighbourhood of that generator for some specified m. However, in this example we include the n(n-1) partitions in the n-neighbourhood of each generator, when n is seven, the number of varieties. Since n > m with probability one, we are including a larget set of partitions in our sample.

To find the grouping with maximum posterior probability we perform the following:

(i) Select a value of m with probability given by the prior for m.

(ii) Select (n1,...,nm) with probability

(n+m-1)-1

\ m-1 •

(iii)Let g = (1,...,1, ... ,m,...,m).

n times n times 1 m

(iv) Search the n-neighbourhood of g for a partition with larger

posterior probability. If one is found then we let g be that

partition and we search its n-neighbourhood. We continue this

process until a local maximum is found.

(v) We also do step (iv) with the initial g being (1,...,1). - 122 -

(vi) We repeat steps (i) to (iv) at least five times and take

the optimal g to be that with maximum posterior probability.

It turns out that each local maximum we find by this process is either the partition derived from the staring point (1,...,1) or at most one other partition, if we consider partitions to be indistinguishable whenever they differ by a permutation of the labels {1,...,n}

We now present our results for the data given in Table 7.1. We number the cases according to Table 7.2. These cases were selected so that the effect of radically altering each control could be seen.

Throughout we let the prior v(a2) be small so that the effect of varying 2 a could be seen. Each case was chosen in the light of the results (given in Table 7.3) for the previous cases.

Table 7.2

LIST OF PRIOR DISTRIBUTIONS

Prior Prior Prior Prior Prior Case Prior Varietyvalial0V(8./cr la) 2 2 No. EN) E(a ) v(a ) Mean

3 -4 -6 3 -6 1 6.9 10 10 10 65 10 10 -4 -6 3 -4 2 6.9 103 10 10 65 10 10 3 -4 -6 3 -1 3 6.9 10 10 10 65 10 10 -4 -6 3 5 4 6.9 103 10 10 65 10 10 -4 -6 3 5 5 4 103 10 10 65 10 10 3 -4 -6 3 5 6 2 10 10 10 65 10 10 -4 6 3 5 7 1.1 103 10 10 65 10 10 -4 -6 3 5 8 6.9 1 10 10 65 10 10 -4 -6 3 -6 9 6.9 1 10 10 65 10 10

- 123 -

Table 7.2 (cont'd)

Prior Prior Prior Case Prior Prior Prior Variety v(p./a1a) VOi No. E(m) s E(a2) V(a2) Mean /Gla

-3 -4 -6 3 5 10 6.9 10 10 10 65 10 10 4 -6 5 11 6.9 1 10 10 50 1 10 4 -6 5 12 6.9 1 10 10 70 1 10 -4 -6 5 13 6.9 1 10 10 30 1 10 4 -6 5 14 6.9 1 10 10 100 1 10 -4 -6 -2 5 15 6.9 1 10 10 70 10 10 4 6 2 5 16 6.9 1 10 10 60 10 10 -4 -6 -2 5 17 6.9 1 10 10 50 10 10 3 4 -6 2 5 18 6.9 10 10 10 50 10 10 3 -6 -2 5 19 6.9 1 10 10 50 10 10 -3 3 -6 3 5 20 6.9 10 10 10 65 10 10 3 3 -6 3 5 21 1.1 10 10 10 65 10 10 3 4 6 -6 22 6.9 10 10 10 65 1 10 3 -4 -6 23 6.9 10 10 10 65 1 1 3 2 -6 3 5 24 6.9 10 10 10 65 10 10 3 2 -6 5 25 6.9 10 10 10 65 1 10 3 3 6 3 5 26 6.9 10 10 10 65 10 10

In Table 7.3 we give the estimated optimal partition for each of

these cases and summarize the "similarity" matrix by giving the pairs

which fall in certain ranges. These ranges are: - 124 -

Range Code ( 0, .01) 1 (.01, .05) 2 (.05, .10) 3 (.10, .25) 4 (.25, .75) 5 (.75, .90) 6 (.90, .95) 7 (.95, .99) 8 (.99, 1) 9

We code the groupings by

A: all in the same group, B: two groups given by 1234,567, C: three groups given by 1,234,567.

Note that groups B and C are those suggested by Scott and Knott (1974). All the estimated optimal partitions for the 26 cases considered are A, B or C.

Table 7.3 ANALYSIS OF BARLEY DATA Case Opt. Entries of Similarity Matrix in the Range No. Group 1 2 3 4 5 6 7 8 9 1 B (1,6) (1,5) (1,3) (1,2) (3,4) (6,7) (1,7) (1,4) (2,3) (5,6) (2,5) (2,4) (5,7) (2,6) (3,5) (2,7) (4,5) (3,6) (4,6) (3,7) (4,7) 2 B --- Same as Case 1 -

3 B (1,6) (1,5) all (1,2) (3,4) (6,7) (1,7) (2,6) other (2,3) (5,6) (2,7) pairs (2,4) (5,7)

--- TT usu0 uu amuS --- D VT --- TT auu0 au umuS --- 0 CT

--- TT usuD " 3UIBS --- 0 ZT

(L'V) (eg) (9'V) (9'0 (c'') (L'E) (V'E) (S`E) (9'£) (CT) (V'Z) (S`Z) (V`T) (L'Z) (9'T) (L'9) (CZ) (Z'T) (CT) (96 Z) (5`T) D TT (L'T) (9'T) (S `T) sapid (V'T) 13143° (CT) TTu (Z'T) V OT

(CY) (9'y) (ez) (s'i7) (9'z) (L'£) (S`Z) (e9) (9'0 WO (s'i) (es) (s'c) (c'z) (Y'T) (ei) (9'0 WO (z'T) (CT) (9'1) CE 6 siTed (L'T) (L`S) ialpo (7 `T) (9'T) (9'S) (C9) (V'£) it (CT) (S `I) CI 8 (V'£) (ti `Z) (L'9) (E'Z) siTud (L`S) (CT) iaxpo (9'S) (Z'T) ("T) TTu V L siTvd (es) WO ialpo (ez) (L'i) (C9) (9'5) (Z'T) (CZ) nu (S`Z) (9'Z) (S`T) (9'T) 8 9 (V'E) ulTud (L'E) (L`S) (CZ) 13430 (9'£) (eZ) (CT) (L'9) (9'S) (Z'T) nu (S`Z) (9'Z) (S`T) (9q) a s (es) (9'0 siTud (L'T) (V'£) (17'Z) aut130 (L`E) (eZ) (9'T) (L'9) (CZ) (eT) TIP (9'E) (9'Z) (S`T) 8 V 6 8 L 9 g V £ Z T dnoi9 .oN a2uell alp uT xTiquiq LTPuTTmTS ;o saTalua '1d0 9SED -126-

Case Opt. Entries of Similarity Matrix in the Range No. Group 2 3 4 5 6 7 8 9

15 A all pairs

16 A all pairs

17 A all pairs

18 B all (1,2) other (1,3) pairs (1,4) (2,3) (2,4) (3,4) (5,6) (5,7) (6,7)

19 A all pairs

20 A all pairs

21 A all pairs

22 C (1,5) all (1,2) (2,3) (1,6) other (2,4) (1,7) pairs (3,4) (2,6) (5,6) (2,7) (5,7) (6,7)

23 C (1,5) (1,2) (2,3) (1,6) (1,3) (2,4) (1,7) (1,4) (3,4) (2,6) (2,5) (5,6) (2,7) (3,5) (5,7) (3,6) (4,5) (6,7) (3,7) (4,6) (4,7)

24 B (1,6) (1,5) (2,5) all (5,6) (6,7) (1,7) (2,6) other (5,7) (2,7) pairs -127 -

Case Opt. Entries of Similarity Matrix in the Range No. Group 1 2 3 4 5 6 7 8 9

25 C (1,5) all (3,5) (1,2) (2,3) (6,7) (1,6) other (4,5) (2,4) (1,7) pairs (3,4) (5,6) (5,7)

26 A (1,3) all (1,4) other (1,5) pairs (1,6) (1,7)

We summarize these results. The partition with maximum posterior probability usually puts the barley varieties into at least two groups.

However, this does not occur in either of the following situations:

(a) The prior for the number of groups is such that it strongly

favours a single cluster; that is, E(m) is near one, or s is

near zero. This occurs in cases 7, 10, 20 and 21.

(b) The prior for the parameters of the component distributions

is such that it strongly favours the component distributions

being "relatively close" to each other compared to the error,

Eik in (7.2.1). In such cases it is difficult to separate out

the clusters, and the partition with maximum posterior probability

is the one with all varieties in the same group. However, the

"similarity" matrix puts a lot of doubt on that clustering and

none of the entries are near zero or one. In our example, the

component distributions are "relatively close" if v(p./ala) is 2 near zero or if a is large with high probability. This occurs

in cases 15, 16, 17, 18, 19 and 26. However, case 18 gives two

groups because the effect of a large s outweights the effect of - 128 -

V(p Jala) being small.

If, however, (a) or (b) does not hold we get a clustering of either two or three groups. The important factor in these cases is the prior value of V(p./a1a). If this is large, then an attempt is made to get as wide a separation of the groups as possible, and we end up with two groups. If on the other hand, it is not too large (but not too small, otherwise condition (b) may take over and we would get only one group) we get three groups. The values of the prior mean and the prior block 2 effect (given a ) does not seem to affect which grouping has maximum posterior probability.

We now look at the effect of the prior distributions on the

"similarity" matrix. We observe the following:

(a) As E(m) decreases, the "similarities" tend to be closer to

one. This occurs because our prior distribution asserts that

the pairs are more likely to be in the same group if E(m) is

small.

(b) As s decreases the "similarities" tend to be closer to one,

because our prior asserts that pairs are more likely to be

in the same group. 2 (c) As a increases the "similarities" tend to be closer to one.

This is because it is more difficult to discriminate between

groups if the es in (7.2.1) tend to be large, so the pairs

are more likely to be in the same group.

(d) The variety prior mean has little effect on the "similarities".

(e) As V(p./a1a) increases, the "similarities" tend to be closer to - 129 -

one, since our prior information about the group means is more

vague so it is more difficult to discriminate between groups.

(f) As VOi/a la) increases, the "similarities" tend to be closer to

zero. This is due to the fact that we may explain more of the

variation of the data by the block effects so it is easier to

discriminate between variety groups.

7.4 Discussion

By taking Duncan's (1955) barley data as an example, we have demonstrated the feasibility of our clustering techniques to the problem

of multiple comparisons. We have shown that different prior distributions may lead to different clusters, but the resulting clusters are a natural consequence of the prior we assume. This is why it is important for the analyst to use a prior which makes sense for his particular application.

However, for all the cases we considered we came up with only three different possible clusterings. This tends to indicate that the procedure

is "robust" in the sense that "mild deviations" in the prior distribution

do not alter the resulting clusters. This property is important for

the analyst who usually cannot be too specific about his prior distribution but has a general feeling about the nature of the problem. Another

"robust" property of our procedure is that the ranking of the pairs in the

"similarity" matrix does not change much for different prior distributions.

One difficulty with our general procedure is that the practical feasibility of our analysis does not extend to the case of unbalanced designs. In theory, we could still derive the posterior probabilities of the respective groupings but this involves much more computation. - 130 -

Chapter 8

TOPICS FOR FURTHER RESEARCH

The preceding chapters discussed some general ideas on the partitioning of data units into mutually exclusive groups using parametric models. Our examples concentrated on the case where the distributions of the true groups were multivariate normal and we found that some of the techniques suggested by other authors fall into our general framework.

We have emphasized the importance of using the prior knowledge we may have about the groupings and have demonstrated how this prior knowledge may affect our results. We must agree with Cormack (1971) when he says: "The growing tendency to regard numerical taxonomy as a satisfactory alternative to clear thinking is condemned." We have suggested that this "clear thinking" can often be reflected in a Bayesian decision theoretic model.

A subjective Bayesian approach states that only one prior is appropriate for a particular user. There is no reason to suppose that such a prior corresponds to one of the conjugate type priors we have suggested in Chapter 4. However, the attitude we have taken is that by choosing a particular conjugate type prior, the user often has enough control over the resulting algorithm so as to yield useful results. In practice, if the user is employing the cluster analysis as an exploratory tool for a better qualitative understanding of the data, he may wish to specify a number of priors to see how the results differ.

There are a number of points which arise out of the development of our parametric model approach to cluster analysis that seem worthy of further research. We list a few of these. - 131 -

1. One problem is the numerical feasibility of the procedures

developed in Chapter 4. We have suggested a general method of

approximating the similarity matrix which seems to give

satisfactory results. There is a need to find efficient

algorithms for particular models, especially when the data matrix

is large in the number of rows or columns.

2. In the examples we have considered, the data were continuous and

certain normality assumptions were made. There is a need to

extend this to non-normal distributions and investigate the

behaviour of the resulting procedures.

3. We have justified the use of certain loss functions as a way of

expressing mathematically the concepts of "cohesion" within

clusters and "isolation" between clusters. We also found a

natural definition of a similarity matrix in our general

framework. However, the actual loss function used in practice

should reflect the purpose of the analysis. It would be

worthwhile to investigate which loss functions are appropriate

for specific applications.

4. If the number of attributes measured on each individual is large,

we may wish to consider a smaller set of attributes on which to

base our clustering. This is an interesting and important

problem which we have not discussed. A related problem is the

following. If we have a number of different clusterings based

on different sets of attributes, could we combine them to give

an overall clustering without going back to the original data? - 132 -

Similarly, the number of individuals to be clustered may be 2 large. Since our similarity matrix is 0(n ) this can become too

large to handle. One way to overcome this problem is to process

the data sequentially, assigning one point at a time to a cluster.

How would we go about this? Could we improve our clustering by

performing more than one pass of the data file? Alternatively, we

could process subsets of the data separately and then combine

these into an overall clustering.

5. A problem with many multivariate data sets is that of missing

entries in the data matrix. In theory this may be modelled but

the analysis can become difficult. For example, we may consider

the density function of the k-th individual, given that g(k) = j

and given k, to be h. k)(0), the marginal density function for

the attributes present in the k-th individual. Even if all the

marginal densities are in the exponential family, the natural

parameters of the distribution usually depend on which marginal

is being considered, and the integration with respect to the

parameters may not be easy. There is the additional problem that

the knowledge of a particular set of attributes not being observed

may yield some information about the parameter values, and this

should be incorporated into the model.

The above list includes a few of the problems associated with our approach to cluster analysis. The following message by Jardine in the discussion of Cormack (1971) summarizes our main theme.

"In conclusion, I suggest that a scientist who is tempted to use a method of automatic classification should beware, because unless he considers carefully both the significance of his data and his own purpose in classifying, he may be seriously misled by his results." - 133 -

REFERENCES

Anderberg, M. R. (1973) Cluster Analysis for Applications. New York: Academic Press.

Anderson, T. W. (1958) An Introduction to Multivariate Statistical Analysis. New York: Wiley.

Anderson, T. W. (1962) The choice of the degree of a polynomial regression as a multiple decision problem. Ann. Math. Statist. 33, 255-265.

Aitchison, J. (1975) Goodness of prediction fit. Biometrika 62, 547-554.

Atkinson, A. C. (1970) A method for discriminating between models. J. R. Statist. Soc. B 32, 323-353.

Bartlett, M. S. and Macdonald, P. D. M. (1968) "Least squares" estimation of distribution mixtures. Nature 217, 195-196.

Behboodian, J. (1972) Bayesian estimation for the proportions in a mixture of distributions. Sankhya B 34, 15-20.

Bhattacharya, C. G. (1967) A simple method of resolution of a distribution into Gaussian components. Biometrics 23, 115-135.

Bock, Von H. H. (1972) Statistische Modelle and Bayesche Verfahren Bestimmung einer unbekannter Klassifikation normalverteiler zufaelliger Vektoren. Metrika 18, 120-132.

Choi, K. and Bulgren, W. G. (1968) An estimation procedure for mixtures of distributions. J. R. Statist. Soc. B 30, 444-460.

Cohen, A. C. (1967) Estimation in mixtures of two normal populations. Technometrics 9, 15-28.

Cormack, R. M. (1971) A review of classification. J. R. Statist Soc. A 134, 321-367.

Cox, D. R. (1962) Tests of separate families of hypotheses. J. R. Statist. Soc. B 24, 406-424.

Cox, D. R. (1966) Notes on the analysis of mixed frequency distributions. Brit. J. Math. and Statist. Psych. 19, 39-47.

Cox, D. R. and Hinkley, D. V. (1974) Theoretical Statistics. London: Chapman and Hall.

Day, N. E. (1969a) Estimating the components of a mixture of normal distributions. Biometrika 56, 463-474.

Day, N. E. (1969b) Divisive cluster analysis and a test for multivariate normality. Bull. International Statist. Inst. 43 (2), 110-112. -134-

Duncan, D. B. (1955) Multiple range and multiple F tests. Biometrics 11, 1-42.

Everitt, B. (1974) Cluster Analysis. London: Heinemann Educational Books.

Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, 179-188.

Friedman, H. P. and Rubin, J. (1967) On some invariant criteria for grouping data. J. Amer. Statist. Assoc. 62, 1159-1178.

Fryer, J. G. and Robertson, C. A. (1972) A comparison of some methods of estimating mixed normal distributions. Biometrika 59, 639-648.

Geisser, S. (1964) Posterior odds for multivariate normal classifications. J. R. Statist. Soc. B 26, 26-76.

Geisser, S. (1966) Predictive discrimination. In 'Multivariate Analysis. Proceedings of an International Symposium'. (P. R. Krishnaiah, ed.) New York: Academic Press, 149-163.

Gower, J. C. (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325-328.

Hartigan, J. A. (1975) Clustering Algorithms. New York: Wiley.

Hasselblad, V. (1966) Estimation of parameters for a mixture of normal distributions. Technometrics 8, 431-444.

Hill, B. M. (1963) Information for estimating the proportions in mixtures of exponential and normal distributions. J. Amer. Statist. Assoc. 58, 918-932.

Johnson, N. L. and Kotz, S. (1972) Distributions in Statistics: Continuous Multivariate Distributions. New York: Wiley.

Johnson, N. L. (1973) Some simple tests of mixtures with symmetrical components. Commun. Statist. 1, 17-25.

Kale, B. K. (1962) On the solution of likelihood equations by iteration processes. The multiparameter case. Biometrika 49, 479-486.

Kendall, M. G. (1966) Discrimination and classification. In 'Multivariate Analysis. Proceedings of an International Symposium'. (P. R. Krishnaiah, ed.) New York: Wiley.

Lehmann, E. L. (1959) Testing Statistical Hypotheses. New York: Wiley.

Macdonald, P. D. M. (1971) Estimation procedures for mixtures of distributions. J. R. Statist. Soc. B 33, 326-329.

Maronna, R. and Jacovkis, P. M. (1974) Multivariate clustering procedures with variable metrics. Biometrics 30, 449-505. -135-

O'Neill, R. and Wetherill, G. B. (1971) The present state of multiple comparison methods. J. R. Statist. Soc. B 33, 218-250.

Pearson, K. (1894) Contributions to the mathematical theory of evolution. Phil. Trans. R. Soc. A 185, 71-110.

Preston, E. J. (1953) A graphical method for the analysis of statistical distributions into normal components. Biometrika 40, 460-464.

Rand, W. H. (1971) Objective criteria for the evaluation of clustering methods. J. Amer. Statist. Assoc. 66, 846-850.

Rao, C. R. (1952) Advanced Statistical Methods in Biometric Research. New York: Wiley.

Scott, A. J. and Knott, M. (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 30, 507-512.

Scott, A. J. and Symons, M. J. (1971) Clustering methods based on likelihood ratio criteria. Biometrics 27, 387-397.

Tan, W. Y. and Chang, W. C. (1972) Some comparisons of the method of moments and the method of maximum likelihood in estimating parameters of a mixture of two normal densities. J. Amer. Statist. Assoc. 67, 702-708.

Wolfe, J. H. (1970) Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research 5, 329-350. k` ), 3) on (..5 Bcst cons+,4ction and i fs

r"-1)11 °4'..ale analysis'. Ann, 24, 210- -ZVI.