<<

Utah State University DigitalCommons@USU

All Graduate Plan B and other Reports Graduate Studies

1975

Discriminant Function Analysis

Kuo Hsiung Su Utah State University

Follow this and additional works at: https://digitalcommons.usu.edu/gradreports

Part of the Applied Commons

Recommended Citation Su, Kuo Hsiung, "Discriminant Function Analysis" (1975). All Graduate Plan B and other Reports. 1179. https://digitalcommons.usu.edu/gradreports/1179

This Report is brought to you for free and open access by the Graduate Studies at DigitalCommons@USU. It has been accepted for inclusion in All Graduate Plan B and other Reports by an authorized administrator of DigitalCommons@USU. For more information, please contact [email protected]. DISCRIMINANTFUNCTION ANALYSIS

by

Kuo Hsiung Su

A report submitted in partial fulfillment of the requirements for the degree of MASTEROF SCIENCE

in

Applied Statistics

Plan B Approved:

UTAHSTATE UNIVERSITY Logan, Utah 1975 ii

ACKNOWLEDGEMENTS

My sincere thanks go to Dr. Rex L. Hurst for his help and valuable suggestions ' in the preparation of this report. He is one of the teachers I mostly respect.

I would also like to thank the members of my committee, Dr. Michael

P. Windham and Dr. David White for their contributions to my education.

Finally, I wish to express a sincere gratitude to my parents and wife for their various support during the time I studied at Utah State University. iii

TABLEOF CONTENTS

Chapter Page

I. INTRODUCTION 1

II. CLASSIFICATIONPROCEDURES 3

I II. MATHEMATICSOF DISCRIMINANTFUNCTION 9

(3 .1) OrthoRonal procedure 10 (3. 2) Non-orthogonal procedure 20 (3 .3) Classification of groun membership using discriminant function score 23 (3.4) The use of categorical variable in discriminant function analysis 24 IV. SCREENINGPROCEDURE FOR SELECTINGVARIABLES IN DISCRIMINANTANALYSIS 27

v. TESTINGSIGNIFICANCE 32

(5 .1) Test for difference between vectors of preassigned two groups 32 (5.2) Test for differences among mean vectors of all groups 33 (3. 3) Test for siRTiificant power of discriminant function 35

VI. ILLUSTRATIVEEXAMPLE OF DISCRIMINANTANALYSIS 37

LITERATURECITED 44

VITA 46 iv

LIST OF TABLES

Table Page

1. Group and grand mean vectors 38 2. Classification results using the non-orthogonal discriminant function 39

3. Eigenvalues and their vectors 40

4. Group centroids of orthogonal discriminant functions 41 s. Classification results using the orthogonal function 42 6. Classification results usin g the first orthogonal discriminant function 43 V

LIST OF FIGURES

Figure Page

1. Location of group centroid in discriminant space 41

2. Location of group mean along discriminant line 42 CHAPTERI

INTRODUCTION

The technique of discriminant function analysis was originated by

R.A. Fisher and first applied by Barnard (1935). Two very useful summaries

of the recent work in this technique can be found in Hodges (1950) and

in Tosuoka and Tiedeman (1954). The techniques have been used primarily

in the fields of anthropology, psychology, biology, medicine, and education,

and have only begun to be applied to other fields i n recent years.

Classification and discriminant function analys es are two phases in

the attempt to predict which of several populations an observation might

be a member of, on the basis of multivariate measurements. Both procedures

require that variables are measured on a series of observations of known

population membership.

Classification procedures match the profile of an observation on the

original variables with the mean profile of the various predefined groups.

Discriminant function procedures precede classification and produce a small

number of linear functions of the original variables. These linear functions

are derived so that they retain the maximum amount of information necessary

to properly classify observations into groups.

Discriminant analysis assumes that: (1) the groups under inv .estigation

are discrete and identifiable, (2) each observation in each group can be

characterized by a set of measurements on P variables, and (3) these P

variables are distributed multi-normally in each group. The last assumption

can be relaxed in practical applications. Waite (1971) has shown that dummy variables (1, 0, -1) may be successfully used in discrimination problems.

Since the discriminant function is a linear combination of the original 2

variables one would expect these linear functions to approach the multi­

variate normal because of the central limit theorem. Waite (1971) found

that wide departures from the multivariate normal still permit good classi­

fication results. The discriminant function analysis seems to be as robust

as the analysis of .

The problem of classification may be observed as an attempt to assign

a probability to the future occurrence of each of the mutually exclusive

groups. In the statistical approach of discriminant analysis to the problems

of prediction, the main questions center on:

1. Which "antecedent" variables are most appropriate and contribute

most significantly to the predictability?

2. How can these appropriate variables be combined to form a set of

reduced compound scores?

3. After establishing the reduced compound scores, how can a probability

to the future occurrence of each group be assigned?

The first question can be solved by applying a screening procedure for

selecting variables in discriminant analysis. The next question concerns the

principle purpose of discriminant analysis, that is to construct discriminant

functions of the variables which differentiate optimally between or among

the groups. The functions must transform the P dimensional space to a reduced

discriminant space. Classification procedures use whatever variables are

available in building rules for classification. These rules apply whether or not discriminant functions are used.

In this manuscript, the various questions will be discussed and evalua­ tion of results will be described. Finally, an illustrative example using

Fisher's Iris data of 1936 is presented. 3

CHAPTERII

CLASSIFICATIONPROCEDURES

Classification procedures match the characteristics of observations with either the known or estimated characteristics of populations. For each population, a multivariate density function is assumed to describe the distribution of the observations in multivariate space. An observation is predicted to be a member of that population which is most likely to have it as one of its observations. In some instances, the population distribution functions are completely known; but in others, the parameters of distribu­ tions must be estimated from a sample. For simplicity of discussion in this chapter, only the multivariate normal distribution is considered. Parameters will be estimated from the sample data.

There are methods which take the a priori probabilities into considera­ tion. The a priori probability represents an independent knowledge of group sizes unrelated to frequency of observed data. It is not required that there be different a priori probabilities. An equal a priori probability of

1/G, G = number of groups, could be assigned to each group to simplify development of classification schemes.

It is noted that many methods include loss functions. Such schemes have not greatly improved the accuracy of the decision rule mainly because of the difficulty of their assessment (Rao, 1965). For this reason, the discussion of classification scheme will take its simplest form in this chapter.

Consider two groups, each contains observations on which two variables are measured. Each observation may be represented as a point in a plane. In the following figure, ellipse A contains all observations for Group 1 and 4

ellipse B contains all observations for Group 2; XA and are the centroids x8 of these groups. A centroid is a point defined by computing the group

on different variables. If we have an observation such as Point Sit would

be predicted to be a member of Group 2 in the multivariate space because

Sis nearer to x8• Similarly, an observation represented by Point T would be classified as belonging to Group 1.

0 B

0 s

Mathematical development is along the following lines. Let there be

G groups, each consisting of n observations, and P available measurements g (or variables). Each observation is denoted as X , p=l,2, ••• ,P. X, the p measurement vector, is distributed multivariate nonnally with the density in each group

1 I - - (X-M.) S~ *(X-M.) p. (X) .--~,-,.-e 2 1 1 1 1 i=l,2, ••• ,G (2.1) 5

where

M. is the mean vector of variables in Group i 1 S. is variance matrix of variables in Group i 1 If a priori probabilities are used and defined as rri, i=l,2, ••• ,G; the

conditional probability of X given Group i is

P(X H.) = IT.P. (X) 1 1 1

I -1 (X-M.) S. (X-M.) 1 l 1 i=l,2, ••• ,G (2. 2)

Fundamentally, an observation with X will be classified into Group i if the following rule holds:

P(XjH.) > P(XjH.) j=l,2, ••• ,G and ifj (2. 3) 1 J

If we take logarithm of (2.2), it becomes

ln P.(X) 1

1 I -1 = - ~ (X-M.) S. (X-M ) i=l,2, ••• ,G ,:. 1 1 1

Omitting common tern - - 1 ln 2n for all groups and multiplying the equation 2 by 2, we get a new equation, denoted as C. 1

1 C.= - ln 1s.I + 2 ln IT. - (X-M.) 's: cx-M.) i=l,2, ••• ,G 1 1 1 1 1 1

I -1 If Xis univariate, (X-M.) S. (X-M.) is a Chi-square with one degree of 1 1 1 6

I -1 freedom. If Xis multivariate, lX-M.) s. (X-M.) can also be represented 1 1 1 by a Chi-square symbol with P degrees of freedom. So C. may be rewritten as 1

2 C. = -ln IS.I + 2 ln IT. i=l,2, ••• ,G 1 1 1 X·1

Accordingly, classify an observation with X into ith group if

C. > C. or 1 J IT. 2 ? Js.1 I l X· < x:- ln + 2 ln j=l,2, ••• ,G and i;lj (2.4) 1 J - Is. I IT. J J

If group variance covariance matrix is equal for all groups and the a Priori

probability also equal, as may be assumed; (2.4) may be rewritten as

2 x~ < x. j=l,2, ••• ,G and i;lj (2. 5) 1 J

In many instances, especially in research with small sample size in each

group, it will be found necessary to use a pooled variance covariance matrix

for the following reasons: (1) the variance covariance matrix estimated for

a· group having a few observations is very unreliable, and (2) when observa­

tions in a group have very close to the same values on certain measurements,

or when some measurements are very dependent, the variance covariance matrix may be close to singularity and an attempt to find its inverse will abort

and Chi-square will be undefined.

Another way to derive (2.5) is given by Cooley and Lohnes (1971). A boundary of group swarm in measurement space is studied, within which K propor­ tion of observations will be found when each of them is represented by the deviation of its measurement vector from the grand centroid. Such a boundary 7

is the locus of all points for which a generalized (or standardized) distance function from the grand centroid is a constant value such as

2 I 1 (X-M.) s:- (X-M.) Xi = 1 1 1

This is defined as a centour. The proportion of observations within the 2 boundary for a known X·, can be obtained by computing the cumulative 1 function of the given x~ with P degrees of freedom. If we set P(H. jx) to 1 1 be the hypothesis probability of Group i membership given measurement vector X, the maximum likelihood classification rule is to assign an observation with X to Group i if

p (H . IX) > p (H . IX) j=l,2, ••• ,G and i#j 1 J

The relation of P(H. jX) to the cumulative probability for the distance 1 function x~ is simply taken as 1

P(H.jX) = 1 - P(x~) (2. 6) 1 1

Equation (2.6) is inverse monotonic with the Chi-square value for each group.

In addition, when a priori probabilities differ among groups, another procedure, called Bayesian posterior probability, may be applied. It is the probability that an observation belongs to Group i given its measurement vector X, defined

P(XIH.) ll. P. (X) 1 1 1 p (H. IX) = = (2.7) 1 G G .r PCXIH.) E II. P. (X) J= 1 J j=l J J 8

where P. (X) is defined in (2 .1) l In this case, the rule of classification is to assign an observation to Group i if

p (H. IX) > p (H. IX) j=l,2, ••• ,G and iij (2. 8) l J

Evidently, the result by Bayesian method agrees with that of (2.5) when using the pooled variance covariance matrix and using the same a prior probability. 9

CHAPTERIII

MATHEMATICSOF DISCRIMINANTFUNCTION

When P variables are observed on each individual, it is mathematically possible to derive a set of linear functions of such variables. Then, instead of P variables, the individual may be characterized by the linear functions. In other words, the discriminant analysis produces a set of reduced compound scores which, of course, is the linear functions of P variables. The fact that the number of compound scores may be one less than

Pis one of the advantages of the discriminant analysis. Another advantage, probably the most important, is that the compound scores are more likely to be distributed as the multivariate normal. Dummyvariables (1,0,-1) may be used to incorporate categorical variables in the construction of such scores. Compound scores tena to be closer to the multivariate normal distribution, and therefore more applicable in classification procedures.

If the function scores are correlated with each other, the method of finding them is called a non-orthogonal transformation. In this case, the number of dimensions of the discriminant space will equal to the minimum of (G,P). On the other hand, if the functions derived are independent of each other, the method is called an orthogonal transformation. This results in a discriminant space whose maximum number of dimensions is the minimum of (.G-1,P).

Let the set of all linear transformation functions be

k=min(G,P) if non-orthogonal

k=min(G-1,P) if orthogonal 10

3.1 Orthogonal procedure

A detailed discussion of this procedure, in terms of mathematical development, is given by Cooley and Lohnes (1971).

First of all, let there be G groups, each having n observations. Let g I the measurement vector be X .. =[X . . , X .. p], where P is the lJ lJ 1 X lJ.. 2 • ' lJ nember of measurements or variables. The group and observation are given by i and j. Let m be the grand centroid or vector of total sample means, and let m. be the centroid for Group i. A deviation vector of an observation 1 from the grand centroid is

X .. = X .. m lJ lJ - = (m. m) + (X .. - m.) 1 lJ 1

The terms of right hand side of the above equation are referred to as the hypothesis effect and the error effect, respectively. Summing over all observations in the total sample the squares and cross-products of the elements of the score vector and its two partition terms yie ld

G n. , G 1 r 1: x .. x .. = r n. (m. - m) (m. - m) i=1 j = 1 lJ lJ i=l 1 1 1

G n. J 1 + E E (X .. -m.)(X .. -m.) (3.1.1) i=l j=l lJ 1 lJ 1

The left term of (3. 1. 1), denoted as T for "Total", is the matrix of sums of squares and cross-products of deviations of all observations from the grand centroid. The first term of right hand side of (3.1.1), denoted as A for "Among-groups", is the matrix of weighted squares and cross-products of deviations of group centroids from the grand centroid. The second term of 11

right hand side of (3. 1. 1), denoted as W for "Within-groups", is the matrix

of squares and cross-products of deviations of observations from their group G centroids, pooled over all groups. If N=.E n., W/(N-G) is, then, an estimator 1=1 1 of common groups dispersion based on the pooled within-groups deviations. At this point, it seems necessary to show how the matrices A, Wand T

are calculated. Let the data be given as follows:

Group 1 Group 2 Variable Variable Observation 1 2 3 1 2 3

1 3 1 4 4 3 2

2 2 2 3 3 3 2

3 2 1 3 5 2 1 in which, P=3, G=2, n =3, n =3, and N=n +n =6; and =3, =1 and so on. 1 2 1 2 x111 x112 The desired matrices are easily found by applying matrix algebra; but for those who are unfamiliar with matrix algebra, a sum of squares and products approach is presented followed by the matrix algebra method.

Let

The diagonal elements of T, A and W may be found by using the comnutational procedures of the univariate analysis of variance for a completely randomized design. The data for variable one would appear as Observation. Group 1 Group 2

1 3 4

2 2 3

3 2 5

Sum X =19 x1.i=7 Xz-1=12 •• 1 n. n =3 n -3 n =6 1 1 2- .

We first compute the total sum of squares in the above table, ignoring the classification into groups

G n· G n· 1 2 1 2 Total s.s. = I: I: ( I: I: X.. ) /n x.1J · 1 1J 1 • i=l j=l i=l j=l 2 = 32 + z2+ + 52 - (19) /6 = 6.83

Next, we compute the sum of squares among group means

G ni 2 G ni 2 S.S. among group means = I: I: X.. ) /n. - ( I: I: X.. ) /n i=l j=l iJ 1 1 i=l j=l 1 J 1 2 2 = (7) /3 + (12) /3 - (19//6 = 4.16

The sum of squares within groups is found

S.S within groups= 6.83 - 4.16 = 2.67

These figures form a table of analysis of variance as:

Source of variation D.F. s.s. Groups, A 11 1 4.16 Error, Wll 4 2.67

Total, Tll 5 6.83 13

The off-diagonal elements of W, T and A are determined by the following formulas. They should be recognized as the computations for the cross product columns in a covariance analysis.

G n·1 w = E E (X .. -X. ) (X .. - pq lJP l•p lJq xi•q) i=l j=l

G n. n. n-1 = E [ El X .. •X .. - ( El X.. )( E X.. ) /n.] (3. 1. 2) lJp lJq lJp . 1 1Jq 1 i=l j=l j=l J=

G n•1 T = E E (_X.. - X ) (X .. - x ) pq ••p . lJq • • q i=l j=l lJP

G n•1 G n·1 G Ili G = E E x .. •X.. ( E E X.. )( E E X.. ) / E n. (3.1.3) 1Jp 1Jq - 1 i=l j=l i=l j=l 1JP i=l j=l iJq i=l

A = T w (3.1.4) pq pq - pq

When p=q, the above formulas are also applied to calculate the values of diagonal elements of T, Wand A. To illustrate, we compute:

w = 3.3 + 2•2 + 2-2 + 4.4 + 3.3 + 5.5 -(7•7)/3 - (12•12)/3 11 = 67 - 64.33 = 2.67

= 3•1 + 2-2 + 2-1 + 4.3 + 3.3 + 5.2 - (7•4)/3 - (12•8)/3 w12 = 40 - 41.33 = -1.33

= 3•3 + 2•2 + 2•2 + 4•4 + 3•3 + 5•5 - (19•19)/6 r 11 = 67 - 60.17 = 6.83

= 40 - 38 = 2

A = T - = 6.83 - 2.67 = 4.16 11 11 w11

A12= r 12- w12= 2.00 - (-1.33) = 3.33 14

Similarly, the values of other elements of W, T and A can also be found.

Finally, the values of desired matrices are

2.67 -1.33 -.33 6.83 2.00 -4.50

W = -1.33 1.33 .33 T = 2.00 4.00 --3.00

-.33 .33 1.33 -4.50 -3.00 5.50

4.16 3.33 -4.17

3.33 2.67 -3.33 and A = -4.17 -3.33 4.17

Next, in matrix form, the same data is

3 2 4 4 3 2

.!1 = 2 2 3 and ~ = 3 3

2 1 3 Group 1 5 2 1 Group 2

I Define a vector 1n =[l , 1, ••• , l], in which the number of elements is n. The group and grand mean vectors are determined as

3 2 2 1 7 2.33 1 1 2 1 1 = 4 = 1.33 3 4 3 3 1 10 3.33

4 3 5 1 12 4.00 I 1 1 mz= ~2 ln2/n2= ~ 3 3 2 1 = 3 8 = 2.67 2 2 1 1 5 1.67 15

19 3.17

m = fX' 1 ' + X' 1 ' ]/N = - 1 12 = 2.00 1 nl - 2 n2 6 15 2.50

So x ' , x ' and x ' are formed as 1 2

3 2 2 2 . 33 2.33 2.33 .67 -.33 .67

x ' = X ' - m 1 ' = 1 2 1 1.33 1.33 1.33 = -.33 .67 -.33 -1 -1 1 n 1 4 3 3 3 . 33 3.33 3.33 .67 -.33 -.33

.o -1.0 1.0

.33 .33 - .67

.33 .33 -.67

-.17 -1.17 -1.17 • 83 -.17 1.83

X ' = ' ' m 1 ' = -1. .o -1.0 1.0 1.0 [fl' ! 21 - N .o 1.5 .5 .5 -.5 -.5 -1.S

.67 -.33 .67 .67 - .33 .67 .67 -.33 .67 and ' X -.33 .67 -.33 -.33 .67 -.33 !..1 -1 = = -.33 .67 -.33 .67 -.33 -.33 .67 -.33 -.33 .67 -.33 .67

6.83 2.00 -4.50 2.0 -1.0 -1.0 and x ' x = T = 2.00 4.00 -3.00 !..2' ~2= -1.0 .67 .67 -1.0 .67 .67 -4.50 -3.00 5.50 16

Finally, Wand A are

2.67 -1.33 -.33

I I w = ~l xl + ~2 ~2 = -1.33 1.33 .33 -.33 .;33 1.33

4.16 3.33 -4.17

and A= T - W = 3.33 2.67 -3.33

-4.17 -3.33 4.17

Let the orthogonal discriminant function vector be

I y = V X (3.1.S)

where xis a measurement deviation vector from the grand centroid

Vis a P x min(G-1,P) coefficient matrix.

The among-groups sum of squares on the function of (3.1,5) is the following quadratic form:

I V A V

Similarly, the within-groups sum of squares on the function is the following quadratic form:

I V W V

Since the greater the distance between the vector means the better the 17

discriminant, and the narrower the spread about the vector means the more effective the discriminant; the best discriminant function may be viewed by maximizing the ratio of the among-groups sum of squares to the within­ groups sum of squares:

VA' V or (3.1.6) V' WV Maximum Maximum subject to the restriction V' V=l (normalizing V).

For determining the elements of the first column v , of V, we want to 1 1 maximize A, subject to v~v =1. Let a function be w- 1

we introduce the restriction on v to the function by means of the Langrange 1 multiplier Al and differentiate with respect to v : 1

(3. 1. 7)

Setting (3.1.7) equal to zero produces an equation as:

(3.1.8)

(3.1.8) is known as the problem of the eigenstructure of W-1 A. The largest eigenvalue A1, may be found by solving the characteristic equation 1 lw- A - A Il=O and then finding its associated eigenvector v • This produces 1 1 18

the 1st column of V, or the coefficients of first discriminant function.

A second discriminant function can be derived the same way by maximizing '-1 v2w A v2 subject to the restriction that this function be orthogonal to the first, v ' v = O, and so on. The extraction of eigenstructure may be 1 2 continued until a trivial eigenvalue, sa y A , is reached. n 1 Based on the full eigenstructure of w-1A, (W- A)V=VL (Lis diagcmal matrix of ei genvalues), the propert i es followed are evident:

p -1 E A,= trace (W A) j =l J

and

Therefore, Al is clearly an index indicating the largest power of discrimi­

nation among groups extracted by the first discriminant function; A is the 2 next largest and so on. Usually only two or three di scriminant functions

contain the useful discriminating power even though there are many groups

and many variables.

Cooley and Lohnes (1971) further transform discriminant functions to

standardized functions of a standardized measurement vector. Again let x be an observation's deviation vector from grand centroid and T be the matrix of total sum of squared deviations about grand centroid, the grand mean vector of Y is zero and the grand variance covariance matrix is 19

' 1 8 = V ( --- T) V (3.1.9) N - 1

Further, if Dis the diagonal matrix from the diagpnal elements of (1/(N-l))T

and n is the number of discriminant functions with non-trivial eigenvalues,

n

1 f = ---Y = Je

I C Z (3.1.10)

1/2 -1/2 where C = D V8 ·

C is the P x n matrix of standardized function coefficients, which converts an N x P roster of standard scores to an N x n roster of standardized discriminant scores. If we let R be a correlation matrix based on T, the vector of correlation coefficients between the variables and jth discriminant factor is found as

N 1 S.= E (Z. - m )Cf .. - mf) J N 1=. 1 1 Z 1J N N 1 1 r f .. = E ' = N z.1 N z.z.c.1 i=l 1J i=l 1 J

= R C. (3.1.11) J

The matrix of factor structure coefficients is

S = R C This is a matrix of correlations between the variables and discriminant functions. In non-orthogonal procedure, such a factor structure can also be foundo When the content of variables is known, the correlations of the ith column of S explain the ith discriminant factor. On the other hand, the jth row of S describes that the portion of information contained in jth variable is extracted by each of the discriminant functions; and the sum of squares of elements of a row is called communality of a variable, which measures the proportion of variation of such a variable retained in oiscriminant functions concerned. When all discriminant functions, G-1, or P, are considered, the communalit y of any variable is equal to one.

3.2 Non-orthogonal procedure

Rao (1965) gives the discriminant score for ith group for an observation with measurement vector X as

S . = IT.P. (X) i=l,2, ••• ,G (3.2.1) 1 1 1 with known a priori probability rri for ith group and regardless of losses due to wrong identification.Xis multivariate normal in each of the groups or populations with the density as:

(3.2.2)

i=l,2, ••• ,G where

Pis the number of elements in measurement vector X

µ. is the mean vector of ith population 1 E. is the dispersion matrix of ith population 1 21

p Taking the logarithm of IT.P. (X) of (3.2.1) and omittinR the term - - ln 2IT 1 1 2 common to all populations, S. becomes 1

(3.2.3)

With known~- and E., this function is quadratic in X and called a quadratic 1 1 discriminant score. When the populations have a common dispersion matrix, the discriminant score is

1 1 (X-~.) 'r- (X-µ.) + ln E. 2 1 1 1

1 I -1 = - lnlrl - - JJ.E µ. + ln II. (3.2.4) + 2 1 1 1

i=l,2, ••• G

Since the first two terms of (3.2.4) are the same for all groups, it can be rewritten as

I -1 1 I -1 S.= (µ.t )X - - µ.E µ.+ ln IT. (3.2.5) 1 1 2 1 1 1

When the case of only two groups is investigated, the comparison of

and forms the linear discriminant function given by R.A. Fisher in 1936, s1 s2 1 which is (µ - µ 1 2)r- x. If the function score S. of (3.2.5) is treated as a variable, the 1 exclusion of constant terms will not change its nature. Let ith function be

I -1 I Y.= µ.t X = v.X (3.2.6) 1 1 1 22

-1 where vi= r µi is the vector of coefficients for ith function. The vector of all G non-orthogonal discriminant functions is

Y = V ' X (3.2.7)

where Vis P x G coefficient matrix, the first column contains the coeffi­ cients of 1st function.

Therefore, the solution to all non-orthogonal discriminant functions is to find such a matrix V, satisfying

(3.2.8) where

µ is the matrix of group means of variables, first row is group means fo.r 1st variable, etc.

r is the common aispersion of variance covariance matrix

Vis the coefficient matrix, ith column is coefficients of equation (3.1) when k=i.

It is very clear that E(Y)=Var(Y) since

f -1 I -1 f -1 E(Y)=E(µ r X)=µ r E(X)=µ r µ and

I -1 I -1 -1 f -1 -1 I -1 Var(Y)=Var(µ r X)=~ r Var(X)r µ=µ r r r µ=µ r µ

And a look into Var(Y) reveals that the non-orthogonal functions are highly correlated with each other.

When randµ matrices are not available, the estimates of them may be 23

substituted in (3.2.8). Usually, a pooled variance covariance matrix S, based on the sample in all groups, is used; and M, instead ofµ, is estimated

from the sample data.

3.3 Classification of group membership using discriminant function score

With the establishment of discriminant functions, it is possible to use their scores, instead of the originally observed variables, in predicting group membership. It is already known that discriminant function scores are distributed multinonnally in each group, due to the central limit theory.

According to the rules in Chapter 2, given that an observation with a function score vector Y, it will be assigned to ith group if

1 1 x~= (Y-M.) 's- (Y-M.) < x:= (Y-M.) 's- (Y-M. ) j=l,2, ••• ,G and iij 1 1 1 J J , J where

M. is the mean vector of function scores in ith group 1 Sis the pooled dispersion matrix of function scores

The hypothesis probability that it may come from ith group is

2 p (H. IY) = 1 - P Cx.) 1 1

It has been shown repeatedly that the non-orthogonal and orthogonal discriminant function produce nearly identical classification results. When there are many groups to be discriminated, the analysis using the orthogonal procedure may be better because it may produce only two or three dominant functions which contain a high proportion of discriminant power. In addition, there are two other reasons why the orthogonal proc-edure is often applied.

First, the discriminant functions are independent, this may enable a researGher 24

to relate a function with a certain property of his data. Secondly, the classification computations are less than the non-orthogonal procedure,

the original computations are worse.

3.4 The use of categorical variable in discriminant function analysis

The robustness of the discriminant function allows the use of quali­

tative variables. For instance, if the tabulation for residence of students

by religion is available, the following might result

Group Category Residence Non-residence Total

LDS 90 0 90

Catholicism 2 5 7

Christianity 0 30 30

Buddhism 0 2 2

ethers 0 2 2

This table indicates that the two group distributions overlap on the category of Catholicism. There are 90 resident students in LOS and 34 non-resident students in either Christianity or Buddhism or others. A rule of discrimi­ nation can simply be set as

If a student is in LDS, assign him to the resident group.

If a student is in Christianity or Buddhism or others, assign him to

the non-orthogonal group.

If a student is in Catholicism, undecided.

This should demonstrate that there is information in qualitative variables 25

that can be used to predict group membership for an individual.

Through the use of dummyvariables, categorical variables can be

included in the discriminant function analysis. For example, if X. is a 1 categorical variable with 4 levels, two possible methods of constructing

dturunyvariables are

Form 4 dummyvariables as

Level

1 1 0 0 0

2 0 I 0 0

3 0 0 I 0

4 0 0 0 I

4 By imposing the conaitions E a .. k=O produce 3 dummyvariables as k=l lJ

Level Xil xi2 xi3

1 I 0 0

2 0 I 0

3 0 0 1

4 -1 -1 -1

There are an infinity of way of producing dummyvariables, the above are the most common. Theoretically, it does not matter in the orthogonal discri­ minant function analysis whether the dummyvariables created by the first method or those by the second one are used. Practically, the dummyvariables of the second type would more likely be used since (1) the number of them 26

is one less, and (2) the variance covariance matrix using them may be more likely of full rank. The non-orthogonal discriminant function requires the second type of dummyvariable because singular matrices do not have true inverses.

When among P variables, P is continuous and P is categorical, the 1 2 set of all linear orthogonal transformation functions is

P2 k=min{G-l,[P + E (C.- l)]} 1 j=l J where

X. is ith continuous Yariable 1 Xjt* is tth dummyvariable of jth categorical variable

C. is the number of levels of jth categorical variable, J 27

CHAPTER IV

SCREENING PROCEDUREFOR SELECTING VARIABLES IN DISCRIMINANT ANALYSIS

In problems of classification, an even more accurate result of

prediction may be achieved if more characteristics, or variables, are

observed. However, when the number of variables is large, the computations

required for a solution of discriminant analysis become expensive.

Fortunately, practical applications have generally been shown that most

of the variation accounted for in an entire set of variables can be found

in a small set of these variables (Hotelling, 1940). This is due to the

high correlation among some of the variables and small contributions of

others. The problem of selecting the smaller set of contributing variables

is to choose variables which actually discriminate among groups, this is

the main topic in this section.

Assume there are G groups and P µossible antecedent variables denoted

by X, p=l,2, ••• ,P. Within each of the r, groups, for each variable X , the p p pooled within-group sum of squared deviations about group mean, denoted as

W(Xr ), can be determined by formula (3.1.2) with q=p; the total sum of squared deviations about the grand mean, T(X ), is also determined by p (3.1.3) with q=p; and the sum of squared deviations between group means

and the grand mean, A(X ), may be obtained by the subtraction p

A(X) = T(X) - W(X) p p p

As for each of antecedent variables, W(X ), T(X) and A(X) are the same as p p p the error, the total and the treatment effects in univariate analysis of variance. Consequently, the criterion for selecting the first variable, say 28

X(l) , is

A(X (l)) A(X ) (1) p p=l,2, ••. ,P and X # X (3.1) > p w(X Cl)) W(X ) p

This is equivalent to choose a variable which has largest F ratio for the univariate analysis of variance compared with those of the rest P-1 variables.

Variables showing larger ana significant F values may be giving the same aiscrirninant infonnation due to high degree of association between or among variables. Instead, the criterion of (3.1) could be interpreted as the trace of the matrix W-1 A. A stepwise addition procedure should add variables in such a manner as to produce maximum trace. Let

trace (Q) = q 11 + q22 + ••• + qpp

If the eigenvalues of Qare found to be A , A , ••• , and AP; trace (Q) = 1 2 Pr A.• Since PEA. represents the d egree of d iscr1m1nant. . power, trace (Q) i=l 1 i=l 1 is applied as a criterion of effectiveness in the discriminating ability of antecedent variables.

The first variable selected, X(l), has to agree with the criterion

(1) p=l,2, ••• ,P and X X # p 29

2 The criterion for deciding the second variable, xC ), becomes

(1) (2) p=l,2, ••• ,P and X , X ¥ Xp

where

W(X(l)) PW(X (1 )Xp) i-l PA(X(l)Xp)l = PW(X(1) X ) W(X ) A(X) p p p

p=l,2, ••• ,P and x(l)# X p

Values of the terms PW(X(l)X) and PA(X(l)X) can be found by using the p p same formulas of (3.1.2) and (3.1.4) when X.. = X(l) and X.. = X. lJp lJq p Obviously, the general criterion for the selection of variable X(s) is

trace w-1A(X(l)x(Z) ••• xCs-l)x(s) ) > trace w-1A(X(l)xC 2) _ (1) (2) (s) p-1,2, ••• ,P and X , X , ... , X ¥ Xp

Again take the same numerical example used in Chapter 3 and have

2.67 -1.33 -.33 4.16 3.33 -4.17

W = -1.33 1.33 .33 and A= 3.33

-.33 .33 1.33 -4.17 -3.33 4.17 30

Since

A(X ) A(X3) 4.17 2 2.67 A(X1} 4.16 = -> = --> = W(X ) 1.33 W(X) 1.33 W(X) 2.67 3 2 1

Variable 3, X(l)= x3 , is first selected. Next, find

-.33 -1 [ 4.17 2.83 -2. w-l A(X (1) X ) = [1.33 -4.ll 84l l -.33 2.6] -4.17 4.16 -1.21 1.21

1 0 [1.33 .33 4.17 -3.3l 4.00 _3.20 l and w- cx \ ) 2 = .33 1.33 n-3 .33 2.67 -3.50 2.80

- 1 (1) -1 (1) and since trace W A(X x2) = 6.80> trace W A(X x1) = 4.04, variable 2 will next be selected in discriminant analysis.

There is no theoretical test available to determine which variable can be considered statistically significant, the selection of variable is therefore arbitrarily decided and may be halted when the incremental value of trace fails to exceed a reasonable setting limit. For example, the increments of traces may be tabulated as

w-lA Trace Increment X(1) tl tl xCI)xC2) t2 t2 - tl xCl)xC2)xC3) t3 t3 - t2 • • xCl)x-C2\C3) ••• xCP) tp tp - tP-1

2 and a value V is set. Suppose t - t < V; X(1), XC ), • • • , and X(n) will n- 1 n 31

be selected. There is no nice way to relate the trace with discriminating 2 power, such as is possible with R in multiple regression.

With the selection of qualitative variable, Hurst (1971) suggests a

two-way independent Chi-square test for each categorical variable as

n. - e. G Ck l.C l.C = .1:1 1: e. with (G-1) (Ck-1) D.F. 1.= c=l l.C

where

Ck i s the number of categories (levels) of kth variable

n . is number of observations of cth level of kth variable for ith group lC e = (n • n )/N expected observations in cth level of ith group. ic i• •C ' Then, the variables with the largest Chi-square values may be selected in

discriminant analysis.

In addition to Chi-square test, the trace procedure may also be applied if t he categorical variable is introduced by means of dummy variables. The dummyvariables for a categorical variable are handled as a subset in the determination of trace. 32

CHAPTERV

TESTINGSIGNIFICANCE

From the discussion of Chapter 3, the discriminant function analysis can be applied only after there exists the homogeneity of group dispersions.

In spite of this,,the analysis makes sense only when the hypothesis of the equality of group centroids is significantly rejected. For these reasons, the tests in this regard will be necessary.

A test for the equality of dispersions of groups can be found in

Cooley and Lohnes (J9-71). According to Miller (1961), the testing signifi­ cance for discriminant analysis deals with the following aspects:

1. Wh.eth.er the observed variables are able to discriminate between particular pairs of groups ana among groups as a whole.

2. That how many discriminant functions should be required so as to include the adequate variation contained in the observed variables.

Therefore, only the statistical tests concerning them will be covered in this section. The notations used in later discussion can be referred back to Chapter 3.

5.1 Test for difference between mean vectors of preassigned two groups

To test the hypothesis that the difference of mean vectors between a pair of groups is zero, Hotelling's T test, which is sometimes called 2 Mahalanobis D , is used.

Let the mean vectors, m. and m., of Groups i and j be tested. It is 1 J required to compute the "Within-groups" matrix of Groups i and j, denoted 2 as W..• Then Mahalanobis D is defined as lJ

2= I -1 d D d W.• 1. 1) lJ cs. 33

where d= (m. - m. ) 1 J And Hotelling's T statistic is expressed as

n. + n. - 2 T2= 1 J D2 cs.1. 2) 1 1 -+ TI• TI. 1 J

Under the hypothesis that vector difference is zero,

n. + n. - P ~ 1 1 J T2 (5. 1. 3) (n. + n. - 2.l · P 1 J

or ni • nj 02 (S.1.4) (ii. + n.)(n. + nj - 2) 1 J . 1 where

Pis the number of variables

n. is the number of observations in ith group 1 nj is the number of observations in jth group is approximately distributed as F with P and (n. + n. - p - 1) D.F. 1 J 5.2 Test for differences among mean vectors of all groups

The multivariate analysis of variance is employed to perform testing the hypothesis:

First, the "Within-groups" matrix W, the "Among-groups" matrix A, and the

"Total" mat_rix T, of all groups have to be determined. The statistic, called 34

Lambda A, for testing the hypothesis was developed by Wilks in 1932 and

is defined as

i = lwl/lTI (5. 2 .1)

Lambda has been tabulated for centain special parameters, G, P and n .• Based l on Wilks' A, Rao derived a F statistic in 1952, which is powerful even for

a very small number of degrees of freedom. In the same notations used previously, let

G N = I n. l=. 1 l 2 /r (G - 1) Z.,. 4 S = P(G - 1) - 5

fl= P (G - 1)

f = S [ (N - l) _ P (G - 1) - 1 ]- P (.G - 1) - 2 2

the Rao's F ratio is

(S.2.2)

with f and f D.F. The null hypothesis is rejected if such a calculated F 1 2 value is significant at a certain level of significance on f and f degrees 1 2 of freedom.

When the sample is large, an alternative Chi-square criterion, also dependent on A, is 35

2 x = -(N - 1 - P 2 G) ln A (S.2.3)

with P(G - 1) degrees of freedom.

5.3 Test for significant power of discriminant function

The test to be discussed is concerned only with the orthogonal

discriminant function.

Wilks' Lambda criterion for the discriminant power of variables among 1 groups may also be determined as a function of the eigenvalues of w-A since

lwl lwl I r 1 A = = ----= -----=1 'IT lrl lw + Al I r + w- A I j = 1 1 + >.. J 1 in which A's are roots of !w-A - Ail=O, and r=min(G-1,P). By this property, Bartlett, in 1934, discovered a set of approximate Chi­ square tests for testing the statistical significance of each of the discriminant functions as following:

Discriminant Test Distributed Function Statistic as 2 First, k=l [N-1-(P+G)/2].ln (l+A ) 1 X (P+G-2) 2 Second, k=2 [N-1-(P+G)/2J 0 ln (1+>-2) X (P+G-4)

2 Last, k=min(G-1,P) [N-1-(P+G)/2] •ln (1+>.min(G-1,P)) X {P+G-2[min(G-l,P)]}

Meanwhile, a test for the significant power of the remaining functions, say min(G~l, P)-n, after the acceptance of the first n functions is possible 36

r-n 2 X = [N - 1 - (P + G)/2] I: ln (1 + A.) with (P - n)(r - n) D.F. j=n+l J where r=min(G-1,P). 37

CHAPTERVI

ILLUSTRATIVEEXAMPLE OF DISCRIMINANTANALYSIS

The data used here is from Fisher's first paper of 1936 on discriminant

function, in which four variables (sepal length, sepal width, patel length,

and patel width) of the flowers of plants of three species (Iris Setosa,

Iris Versicolor, and Iris Virginica) were measured, with fifty plants in

each sp ecies.

The r equired computations for demonstrating the example were produced by use of the Computer Statistical Program Package, STATPAC,by Hurst.

We have assumed that ther e exists equalit y of dispersions of groups. Table 1 gives group and grand mean vectors. To test equality of group mean vectors, we have as a hypothesis

and we find F=199.15 which is greater than the tabulated F_ , =2.7. 0518 288 So the rejection of the hypothesis makes sense and we may proceed to the discriminant function analysis.

Tab 1e 1. Group and grand mean vectors

Variable Group 1 Group 2 Group 3 Grand Mean

1 5.006 5.936 6.588 5.843

2 3.428 2.770 2.974 3.057

3 1.462 4.260 5.552 3.758

4 .246 1.326 2.026 1.199 Given the estimated dispersion and mean matrices as

.265 .093 .168 .038 5.006 5.936 6.588 .115 .ass .033 3.428 2.770 2.974 s = and M = .185 .043 1.462 4.260 5.552

.042 .246 1.326 2.026

The solution to the non-orthogonal discriminant function is

23.544 15 698 12.446

23.588 7.073 3.685

-16.431 s.211 12.767 -17.398 6.434 21.079

Accordingly, three non-orthogonal discriminant functions are formed as

y = 23.544 + 23.588 16.431 17.398 x 1 Xl x2 - x3 - 4 15.698 X + 7.073 5. 211 6.434 x (6 .1) Y2= 1 x2 + X3 + 4 12.446 x + 3.685 x + 12.767 21.079 x Y3= 1 2 X3 + 4

The variance covariance matrix of the function scores is

170.420 112.032 98.787

= sy 143.508 166.423 206.539

The mean matrix, My= fmyl' my2, my3], of the function scores is exactly 39

the S. Taking a look at S matrix, we know that the function scores are y y highly correlated with each other.

Now, let us take the measurement vector of the first plant in Iris

I Setosa X =[5.1, 3.5, 1.4, .2], substituting it in (6.1) produces its function

I I -1 score vector Y =[170.15, 113.40, 98.46]. Then, based on (Y-m .) S (Y-m .), y1 y1 i=l,2 . and 3; the Chi-square values computed are .289, 89.883, and 191.786

and the hypothesis probabilities .743, .20E-19, and 0. The plant with Y

will therefore be assigned to Group 1, Iris Setosa. Table 2 shows the results of prediction of all observations by this method. It should be noted that

since the a priori probability 1/3, is the same for all groups, the predicted

results by Bayesian posterior probability is not different from Table 2.

Table 2. Classification results using the non-orthogonal discriminant function

Predicted Group Membership

Actual Group Iris Iris Iris Membership Setosa Versicolor Virginica Total

Iris Setosa 50 0 0 50

Iris Versicolor 0 48 2 50

Iris Virginica 0 1 49 50

Total 50 49 51

Percent hits= 98

When orthogonal procedure applied, the matrices of "Among-groups" A, and "Within-groups" W should first be determined as 40

63.21 -19.95 165.25 71.28 38.96 13.64 24.63 5.65

11.35 -57.24 -22.93 16.96 8.12 4.81 A = and W = 437.10 186.77 27.22 6.27

80.41 6.16

The eigenvalues of W-1 A and their eigenvectors are given. in Table 3. Bot h

functions have significant Chi-square values, 1138.2 and 453.7. From the

percent of trace, the first function retains almost all the variation of

original variables so that the discriminant space could be reduced to

only one dimension with little loss of discriminant power and greater simplicity of interpretation.

Table 3. Eigenvalues and their vectors

Function 1 Function 2 Variable vl v2

1 .2087 .0065

2 .3862 .5866

3 -.5540 -.2526

4 -.7074 .7695

Eigenvalue 2366.1070 20.9760

Percent of Trace 99.1 .9

Chi-square Value 1138.2 453.7

Table 4 gives the group centroids of two discriminant function scores and Figure 1 depicts the locations of group centroids in the discriminant 41

space. The variance covariance diagonal matrix of discriminant function

scores is

.06335 s = .o y l .o .07345

Table 4. Group centroids of orthogonal discriminant functions

Group Function 1 Function 2

Iris Setosa 1. 385 1.864

Iris Versicolor -.989 1. 608

Iris Virginica -1.985 1.944

2.4 Y=(l.499, 1.887) * Group 3 Group 1 1.6 Group 2

.8

-2.4 -1.6 -.8 o. .8 1.6 2.4

Figure 1. Location of group centroid in discriminant space Given two orthogonal functions

.2087 xl + .3862 x2 - .5540 X3 - .7074 X4 (6. 2) .0065 x + .5866 x - .2s26 x + .7695 x 1 2 3 4

Similarly, when we take X' =JS.I, 3.5, 1.4, 0.2] and substitute it into

I (6.21, we find the function score vector Y =[1.499, 1.88.7] of which the location is also shown in Figure 1. The generalized distance of the point

Y from the centroid of Group 1 is found to be the smallest, so the plant with Y will be identified with coming from the first group. Based on the orthogonal function scores, Table 5 is tabulated for all observations by the method of generalized distance or Chi-squre.

Table 5. Classification results using the orthogonal function

Predicted Group Membership

Actual Group Iris Iris Iris Membership Setosa Versicolor Virginica Total

Iris Setosa 50 0 0 so Iris Versicolor 0 48 2 so

Iris Virginica 0 1 49 50

Total 50 49 51

Percent hits= 98

As indicated previously, discarding the second function results in almost no loss of discriminant power. Therefore, the discriminant space 43

reduces to one dimension or a straight line and the group means my be plotted along the line as Figure 2. It is clear that the point Y is nearest to the mean of Group 1 on the axis of the first function score. The same method of classification gives the results of prediction in Table 6, which is almost the same as in Tables 2 and 5.

Y=l.499

Group 3 Group 2 Group 1 J "'i t 1- e- 'ix: .+ ' ' -2.4 -1.6 -.8 o. .8 1.6 2.4

Figure 2. Location of group mean along discriminant line

Table 6. Classification results using the first orthogonal discriminant function

Predicted Group Membership

Actual Group Iris Iris Iris Membership Setosa Versitolor Virginica Total

Iris Setosa 50 0 0 so

Iris Versicolor 0 48 2 so

Iris Virginica 0 0 50 50

Total 50 48 52

Percent hits = 98.7 44

LITERATIJRECITED

Anderson, T. W. 1958. An Introduction to Multivariate Statistical Analysis. John Wiley and Sons, Inc., New York, London.

Barnard, M. M. 1935. The Secular Variations of Skull Characters in Four Series of Egyptian Skulls, Annals of Egenics, Vol. 6, pp. 352-371.

Cooley, William W., and Lohnes Paul R. 1971. Multivariate Data Analysis. John Wiley and Sons, Inc., New York, London.

Eisenbies Robert A., and Avery Robert B. 1972. Discriminant Analysis and Classification Procedures. D. C. Heath and Company, Lexington, Massachusetts, Toronto, London.

Fisher R. A. L936. The Use of Multiple Measurements in Taxonomic Problems, Annals of Egenics, 7:179-188.

Hotelling, H. 1931. The Generalization of Student's Ratio, Annals of Mathematical Statistics, Vol. 2, pp. 360-378 .

Hotelling, H. 1940. The Selection of Variables for Use in Prediction with Some Comments on the General Problems of Nuisance Parameters, Annals of Mathematical Statistics, Vol. 11, pp. 271-283. Hudges, J. L., Jr. 1935. Discriminatory Analysis, Report No. 1, USAF School of Aviation Medicine, Randolph Field, Taxes.

Hurst, Rex L. 1971. Classification and Discriminant Function Analysis. Unpublished Paper, Department of Applied Statistics, Utah State University, Logan, Utah.

Hurst, Rex L. 1971. Statistical Program Package. Department of Applied Statistics, Utah State University, Logan, Utah.

Kendall, M. G. 1966. Discrimination and Classification. Oaoer, C-F-I-R LTD., London, England.

Miller, Robert G. 1961. An Application of Multiple Discriminant Analysis to the Probabilitic Prediction of Meteorological Conditions Affecting Operational Decisions. Tech. Memo. No. 4, The Travelers Research Center Inc., Hartford, Connecticut.

Miller, Robert G. 1960. Selecting Variables for Multiple Discriminant Analysis. AFCRC-TR-60-254, The Travelers Weather Research Center, Hartford, Connecticut.

Morrison Donald F. 1967. Multivariate Statistical Methods. Mcfraw-Hill Book Company, New York, London. 45

Press, S. James, 1972. Applied Multivariate Analysis. Holt, Rinehart, and Winston, Inc., New York, London. 13:369-386.

Rao C. Radhakrishna, 1965. Linear Statistical Inference and Its Applications, John Wiley and Sons, Inc., New York, London. 8:435-510.

Tusuoka, M. M., and Tiedeman, D. V. 1954. Discriminant Analysis, Review of Educational Research, Vol. 24, pp. 402-420.

Van de geer, John P. 1971. Introduction to Multivariate Analysis for the Social Sciences. W. H. Freeman and Company, San Francisco.

Waite, Preston Jay, 1971. The Effectiveness of Categorical Variables in Discriminant Function Analysis. Thesis, Department of Applied Statistics, Utah State University, Logan, Utah. VITA

Kuo Hsiung Su

Candidate for the Degree of

Master of Science

Report: Discriminant Function Analysis

Major Field: Applied Statistics

Biographical Information:

Personal Data: Born in Taipei, Taiwan, Dember 10, 1943, son of Ting Hsen and Tri Chih Su; married Hui-yi March 10, 1971.

Education: Attended both elementary and high schools in Taipei, Taiwan; received the Bachelor of Art degree from National Chung Hsing University, Taiwan, Republic of China, in Statistics in 1967; completed requirements for the Master of Science degree, in Applied Statistics, at Utah State University in 1975.

Professional Experience: July 1968 to April 1969, research assistant in Taiwan Population Studies Center; April 1969 to August 1972, research assistant in Taiwan Provincial Committee on Family Planning; August 1972 to August 1973, acting chief of data processing division in Taiwan Provincial Committee on Family Planning; January 1974 to March 1975, statistical and computer programming consultant, Department of Applied Statistics and Computer Science, Utah State University.