<<

Journal of Articial Intelligence Research Submitted published

Solving Multiclass Learning Problems via

ErrorCorrecting Output Co des

Thomas G Dietterich tgdcsorstedu

Department of Computer Science Dearborn Hal l

Oregon State University

Corval lis OR USA

Ghulum Bakiri ebisaccuobbh

Department of Computer Science

University of Bahrain

Isa Town Bahrain

Abstract

Multiclass learning problems involve nding a denition for an unknown f x

whose range is a discrete containing k values ie k classes The denition is

acquired by studying collections of training examples of the form hx f x i Existing ap

i i

proaches to multiclass learning problems include direct application of multiclass algorithms

such as the decisiontree algorithms C and CART application of binary concept learning

algorithms to learn individual binary functions for each of the k classes and application

of binary concept learning algorithms with distributed output representations This pap er

compares these three approaches to a new technique in which errorcorrecting co des are

employed as a distributed output representation We show that these output representa

tions improve the generalization p erformance of b oth C and backpropagation on a wide

range of multiclass learning tasks We also demonstrate that this approach is robust with

resp ect to changes in the size of the training sample the assignment of distributed represen

tations to particular classes and the application of overtting avoidance techniques such as

decisiontree pruning Finally we show thatlike the other metho dsthe errorcorrecting

co de technique can provide reliable class probability estimates Taken together these re

sults demonstrate that errorcorrecting output co des provide a generalpurp ose metho d for

improving the p erformance of inductive learning programs on multiclass problems

Intro duction

The task of learning from examples is to nd an approximate denition for an unknown

function f x given training examples of the form hx f x i For cases in which f takes

i i

only the values f gbinary functionsthere are many algorithms available For example

the decisiontree metho ds such as C Quinlan and CART Breiman Friedman

Olshen Stone can construct trees whose leaves are lab eled with binary values

Most articial neural network algorithms such as the p erceptron algorithm Rosenblatt

and the error backpropagation BP algorithm Rumelhart Hinton Williams

are b est suited to learning binary functions Theoretical studies of learning have

fo cused almost entirely on learning binary functions Valiant Natara jan

In many realworld learning tasks however the unknown function f often takes values

from a discrete set of classes fc c g For example in medical diagnosis the function

k

might a description of a patient to one of k p ossible diseases In digit recognition eg

c

AI Access Foundation and Morgan Kaufmann Publishers All rights reserved

Dietterich Bakiri

LeCun Boser Denker Henderson Howard Hubbard Jackel the function maps

each handprinted digit to one of k classes Phoneme recognition systems eg Waib el

Hanazawa Hinton Shikano Lang typically classify a sp eech segment into one of

to phonemes

Decisiontree algorithms can b e easily generalized to handle these multiclass learning

tasks Each leaf of the decision tree can b e lab eled with one of the k classes and internal

no des can b e selected to discriminate among these classes We will call this the direct

multiclass approach

Connectionist algorithms are more dicult to apply to multiclass problems The stan

dard approach is to learn k individual binary functions f f one for each class To

k

assign a new case x to one of these classes each of the f is evaluated on x and x is

i

assigned the class j of the function f that returns the highest activation Nilsson

j

We will call this the oneperclass approach since one binary function is learned for each

class

An alternative approach explored by some researchers is to employ a distributed output

code This approach was pioneered by Sejnowski and Rosenb erg in their widely

known NETtalk system Each class is assigned a unique binary string of length n we will

refer to these strings as co dewords Then n binary functions are learned one for each bit

p osition in these binary strings During training for an example from class i the desired

outputs of these n binary functions are sp ecied by the co deword for class i With articial

neural networks these n functions can b e implemented by the n output units of a single

network

New values of x are classied by evaluating each of the n binary functions to generate an

nbit string s This string is then compared to each of the k co dewords and x is assigned to

the class whose co deword is closest according to some distance measure to the generated

string s

As an example consider Table which shows a sixbit distributed co de for a tenclass

digitrecognition problem Notice that each row is distinct so that each class has a unique

co deword As in most applications of distributed output co des the bit p ositions columns

have b een chosen to b e meaningful Table gives the meanings for the six columns During

learning one binary function will b e learned for each column Notice that each column is

also distinct and that each binary function to b e learned is a disjunction of the original

classes For example f x if f x is or

v l

To classify a new handprinted digit x the six functions f f f f f and f

v l hl dl cc ol or

are evaluated to obtain a sixbit string such as Then the distance of this string

to each of the ten co dewords is computed The nearest co deword according to Hamming

distance which counts the numb er of bits that dier is which corresp onds to class

Hence this predicts that f x

This pro cess of mapping the output string to the nearest co deword is identical to the de

co ding step for errorcorrecting co des Bose RayChaudhuri Ho cquenghem

This suggests that there might b e some advantage to employing errorcorrecting co des as

a distributed representation Indeed the idea of employing errorcorrecting distributed

representations can b e traced to early research in machine learning Duda Machanik

Singleton

ErrorCorrecting Output Codes

Table A distributed co de for the digit recognition task

Co de Word

Class vl hl dl cc ol or

Table Meanings of the six columns for the co de in Table

Column p osition Abbreviation Meaning

vl contains vertical line

hl contains horizontal line

dl contains diagonal line

cc contains closed curve

ol contains curve op en to left

or contains curve op en to right

Table A bit errorcorrecting output co de for a tenclass problem

Co de Word

Class f f f f f f f f f f f f f f f

Dietterich Bakiri

Table shows a bit errorcorrecting co de for the digitrecognition task Each class is

represented by a co de word drawn from an errorcorrecting co de As with the distributed

enco ding of Table a separate b o olean function is learned for each bit p osition of the error

correcting co de To classify a new example x each of the learned functions f x f x

is evaluated to pro duce a bit string This is then mapp ed to the nearest of the ten

co dewords This co de can correct up to three errors out of the bits

This errorcorrecting co de approach suggests that we view machine learning as a kind

of communications problem in which the identity of the correct output class for a new

example is b eing transmitted over a channel The channel consists of the input features

the training examples and the learning algorithm Because of errors intro duced by the

nite training sample p o or choice of input features and aws in the learning pro cess

the class information is corrupted By enco ding the class in an errorcorrecting co de and

transmitting each bit separately ie via a separate run of the learning algorithm the

system may b e able to recover from the errors

This p ersp ective further suggests that the onep erclass and meaningful distributed

output approaches will b e inferior b ecause their output representations do not constitute

robust errorcorrecting co des A measure of the quality of an errorcorrecting co de is the

minimum Hamming distance b etween any pair of co de words If the minimum Hamming

d

distance is d then the co de can correct at least b c single bit errors This is b ecause each

single bit error moves us one unit away from the true co deword in Hamming distance If

d

we make only b c errors the nearest co deword will still b e the correct co deword The

co de of Table has minimum Hamming distance seven and hence it can correct errors in

any three bit p ositions The Hamming distance b etween any two co dewords in the one

p erclass co de is two so the onep erclass enco ding of the k output classes cannot correct

any errors

The minimum Hamming distance b etween pairs of co dewords in a meaningful dis

tributed representation tends to b e very low For example in Table the Hamming

distance b etween the co dewords for classes and is only one In these kinds of co des new

columns are often intro duced to discriminate b etween only two classes Those two classes

will therefore dier only in one bit p osition so the Hamming distance b etween their output

representations will b e one This is also true of the distributed representation develop ed by

Sejnowski and Rosenb erg in the NETtalk task

In this pap er we compare the p erformance of the errorcorrecting co de approach to

the three existing approaches the direct multiclass metho d using decision trees the

onep erclass metho d and in the NETtalk task only the meaningful distributed output

representation approach We show that errorcorrecting co des pro duce uniformly b etter

generalization p erformance across a variety of multiclass domains for b oth the C decision

tree learning algorithm and the backpropagation neural network learning algorithm We

then rep ort a series of exp eriments designed to assess the robustness of the errorcorrecting

co de approach to various changes in the learning task length of the co de size of the training

set assignment of co dewords to classes and decisiontree pruning Finally we show that

the errorcorrecting co de approach can pro duce reliable class probability estimates

The pap er concludes with a discussion of the op en questions raised by these results

Chief among these questions is the issue of why the errors b eing made in the dierent bit

p ositions of the output are somewhat indep endent of one another Without this indep en

ErrorCorrecting Output Codes

Table Data sets employed in the study

Numb er of Numb er of Numb er of Numb er of

Name Features Classes Training Examples Test Examples

glass fold xval

vowel

POS fold xval

soyb ean

audiologyS

ISOLET

letter

NETtalk phonemes words words

stresses letters letters

dence the errorcorrecting output co de metho d would fail We address this questionfor

the case of decisiontree algorithmsin a companion pap er Kong Dietterich

Metho ds

This section describ es the data sets and learning algorithms employed in this study It

also discusses the issues involved in the design of errorcorrecting co des and describ es four

algorithms for co de design The section concludes with a brief description of the metho ds

applied to make classication decisions and evaluate p erformance on indep endent test sets

Data Sets

Table summarizes the data sets employed in the study The glass vowel soyb ean audi

ologyS ISOLET letter and NETtalk data sets are available from the Irvine Rep ository of

machine learning databases Murphy Aha The POS part of sp eech data set

was provided by C Cardie p ersonal communication an earlier version of the data set was

describ ed by Cardie We did not use the entire NETtalk data set which consists of

a dictionary of words and their pronunciations Instead to make the exp eriments

feasible we chose a training set of words and a disjoint test set of words at

random from the NETtalk dictionary In this pap er we fo cus on the p ercentage of letters

pronounced correctly rather than whole words To pronounce a letter b oth the phoneme

and stress of the letter must b e determined Although there are syntactically p ossible

combinations of phonemes and stresses only of these app ear in the training and test

sets we selected

The rep ository refers to the soyb ean data set as soyb eanlarge the audiologyS data set as audiol

ogystandardized and the letter data set as letterrecognition

Dietterich Bakiri

Learning Algorithms

We employed two general classes of learning metho ds algorithms for learning decision trees

and algorithms for learning feedforward networks of sigmoidal units articial neural net

works For decision trees we p erformed all of our exp eriments using C Release which

is an older but substantially identical version of the program describ ed in Quinlan

We have made several changes to C to supp ort distributed output representations but

these have not aected the treegrowing part of the algorithm For pruning the condence

factor was set to C contains a facility for creating soft thresholds for continuous

features We found exp erimentally that this improved the quality of the class probability

estimates pro duced by the algorithm in the glass vowel and ISOLET domains so

the results rep orted for those domains were computed using soft thresholds

For neural networks we employed two implementations In most domains we used the

extremely fast backpropagation implementation provided by the CNAPS neuro computer

Adaptive Solutions This p erforms simple gradient descent with a xed learning

rate The gradient is up dated after presenting each training example no momentum term

was employed A p otential limitation of the CNAPS is that inputs are only represented

to eight bits of accuracy and weights are only represented to bits of accuracy Weight

up date arithmetic do es not round but instead p erforms jamming ie forcing the lowest

order bit to when low order bits are lost due to shifting or multiplication On the

sp eech recognition letter recognition and vowel data sets we employed the opt system

distributed by Oregon Graduate Institute Barnard Cole This implements the

conjugate gradient algorithm and up dates the gradient after each complete pass through

the training examples known as p erep o ch up dating No learning rate is required for this

approach

Both the CNAPS and opt attempt to minimize the squared error b etween the computed

and desired outputs of the network Many researchers have employed other error measures

particularly crossentropy Hinton and classication gureofmerit CFM Hamp

shire I I Waib el Many researchers also advo cate using a softmax normalizing layer

at the outputs of the network Bridle While each of these congurations has go o d

theoretical supp ort Richard and Lippmann rep ort that squared error works just as

well as these other measures in pro ducing accurate p osterior probability estimates Further

more crossentropy and CFM tend to overt more easily than squared error Lippmann

p ersonal communication Weigend We chose to minimize squared error b ecause this

is what the CNAPS and opt systems implement

With either neural network algorithm several parameters must b e chosen by the user

For the CNAPS we must select the learning rate the initial random seed the numb er

of hidden units and the stopping criteria We selected these to optimize p erformance

on a validation set following the metho dology of Lang Hinton and Waib el The

training set is sub divided into a subtraining set and a validation set While training on the

subtraining set we observed generalization p erformance on the validation set to determine

the optimal settings of learning rate and network size and the b est p oint at which to

stop training The training set mean squared error at that stopping p oint is computed

and training is then p erformed on the entire training set using the chosen parameters and

stopping at the indicated mean squared error Finally we measure network p erformance

on the test set

ErrorCorrecting Output Codes

For most of the data sets this pro cedure worked very well However for the letter

recognition data set it was clearly cho osing p o or stopping p oints for the full training set

To overcome this problem we employed a slightly dierent pro cedure to determine the

stopping ep o ch We trained on a series of progressively larger training sets all of which

were of the nal training set Using a validation set we determined the b est

stopping ep o ch on each of these training sets We then extrap olated from these training

sets to predict the b est stopping ep o ch on the full training set

For the glass and POS data sets we employed tenfold crossvalidation to assess

generalization p erformance We chose training parameters based on only one fold of the

tenfold crossvalidation This creates some test set contamination since examples in the

validation set data of one fold are in the test set data of other folds However we found

that there was little or no overtting so the validation set had little eect on the choice of

parameters or stopping p oints

The other data sets all come with designated test sets which we employed to measure

generalization p erformance

ErrorCorrecting Co de Design

We dene an errorcorrecting co de to b e a matrix of binary values such as the matrix shown

in Table The length of a co de is the numb er of columns in the co de The numb er of

rows in the co de is equal to the numb er of classes in the multiclass learning problem A

co deword is a row in the co de

A go o d errorcorrecting output co de for a k class problem should satisfy two prop erties

Row separation Each co deword should b e wellseparated in Hamming distance

from each of the other co dewords

Column separation Each bitp osition function f should b e uncorrelated with the

i

functions to b e learned for the other bit p ositions f j i This can b e achieved by

j

insisting that the Hamming distance b etween column i and each of the other columns

b e large and that the Hamming distance b etween column i and the complement of

each of the other columns also b e large

The p ower of a co de to correct errors is directly related to the row separation as

discussed ab ove The purp ose of the column separation condition is less obvious If two

columns i and j are similar or identical then when a deterministic learning algorithm

such as C is applied to learn f and f it will make similar correlated mistakes Error

i j

correcting co des only succeed if the errors made in the individual bit p ositions are relatively

uncorrelated so that the numb er of simultaneous errors in many bit p ositions is small If

there are many simultaneous errors the errorcorrecting co de will not b e able to correct

them Peterson Weldon

The errors in columns i and j will also b e highly correlated if the bits in those columns

are complementary This is b ecause algorithms such as C and backpropagation treat

a class and its complement symmetrically C will construct identical decision trees if

the class and class are interchanged The maximum Hamming distance b etween two

columns is attained when the columns are complements Hence the column separation

condition attempts to ensure that columns are neither identical nor complementary

Dietterich Bakiri

Table All p ossible columns for a threeclass problem Note that the last four columns

are complements of the rst four and that the rst column do es not discriminate

among any of the classes

Co de Word

Class f f f f f f f f

c

c

c

Unless the numb er of classes is at least ve it is dicult to satisfy b oth of these prop

erties For example when the numb er of classes is three there are only p ossible

columns see Table Of these half are complements of the other half So this leaves us

with only four p ossible columns One of these will b e either all zero es or all ones which

will make it useless for discriminating among the rows The result is that we are left with

only three p ossible columns which is exactly what the onep erclass enco ding provides

k

In general if there are k classes there will b e at most usable columns after

removing complements and the allzeros or allones column For four classes we get a

sevencolumn co de with minimum interrow Hamming distance For ve classes we get

a column co de and so on

We have employed four metho ds for constructing go o d errorcorrecting output co des

in this pap er a an exhaustive technique b a metho d that selects columns from an

exhaustive co de c a metho d based on a randomized hillcli mbin g algorithm and d BCH

co des The choice of which metho d to use is based on the numb er of classes k Finding a

single metho d suitable for all values of k is an op en research problem We describ e each of

our four metho ds in turn

Exhaustive Codes

k

When k we construct a co de of length as follows Row is all ones Row

k k k

consists of zero es followed by ones Row consists of zero es followed by

k k k

ones followed by zero es followed by ones In row i there are alternating

k i

runs of zero es and ones Table shows the exhaustive co de for a veclass problem

This co de has interrow Hamming distance no columns are identical or complementary

Column Selection from Exhaustive Codes

When k we construct an exhaustive co de and then select a go o d of

its columns We formulate this as a prop ositional satisability problem and apply the

GSAT algorithm Selman Levesque Mitchell to attempt a solution A solution

is required to include exactly L columns the desired length of the co de while ensuring

that the Hamming distance b etween every two columns is b etween d and L d for some

chosen value of d Each column is represented by a b o olean variable A pairwise mutual

ErrorCorrecting Output Codes

Table Exhaustive co de for k

Row Column

1 0 1 0

1 0 0 1

Figure Hillclimbi ng algorithm for improving row and column separation The two closest

rows and columns are indicated by lines Where these lines intersect the bits in

the co de words are changed to improve separations as shown on the right

exclusion constraint is placed b etween any two columns that violate the column separation

condition To supp ort these constraints we extended GSAT to supp ort mutual exclusion

and mofn constraints eciently

Randomized Hill Climbing

For k we employed a random search algorithm that b egins by drawing k random

strings of the desired length L Any pair of such random strings will b e separated by a

Hamming distance that is binomially distributed with mean L Hence such randomly

generated co des are generally quite go o d on average To improve them the algorithm

rep eatedly nds the pair of rows closest together in Hamming distance and the pair of

columns that have the most extreme Hamming distance ie either to o close or to o

far apart The algorithm then computes the four co deword bits where these rows and

columns intersect and changes them to improve the row and column separations as shown

in Figure When this hill climbing pro cedure reaches a lo cal maximum the algorithm

randomly cho oses pairs of rows and columns and tries to improve their separations This

combined hillcli mbingrandomchoice pro cedure is able to improve the minimum Hamming

distance separation quite substantially

Dietterich Bakiri

BCH Codes

For k we also applied the BCH algorithm to design co des Bose RayChaudhuri

Ho cquenghem The BCH algorithm employs algebraic metho ds from Galois

eld theory to design nearly optimal errorcorrecting co des However there are three prac

tical drawbacks to using this algorithm First published tables of the primitive p olynomials

required by this algorithm only pro duce co des up to length since this is the largest word

size employed in computer memories Second the co des do not always exhibit go o d column

separations Third the numb er of rows in these co des is always a p ower of two If the num

b er of classes k in our learning problem is not a p ower of two we must shorten the co de by

deleting rows and p ossible columns while maintaining go o d row and column separations

We have exp erimented with various heuristic greedy algorithms for co de shortening For

most of the co des used in the NETtalk ISOLET and Letter Recognition domains we have

used a combination of simple greedy algorithms and manual intervention to design go o d

shortened BCH co des

In each of the data sets that we studied we designed a series of errorcorrecting co des

of increasing lengths We executed each learning algorithm for each of these co des We

stopp ed lengthening the co des when p erformance app eared to b e leveling o

Making Classication Decisions

Each approach to solving multiclass problemsdirect multiclass onep erclass and error

correcting output co dingassumes a metho d for classifying new examples For the C

direct multiclass approach the C system computes a class probability estimate for each

new example This estimates the probability that that example b elongs to each of the

k classes C then cho oses the class having the highest probability as the class of the

example

For the onep erclass approach each decision tree or neural network output unit can

b e viewed as computing the probability that the new example b elongs to its corresp onding

class The class whose decision tree or output unit gives the highest probability estimate

is chosen as the predicted class Ties are broken arbitrarily in favor of the class that comes

rst in the class ordering

For the errorcorrecting output co de approach each decision tree or neural network

output unit can b e viewed as computing the probability that its corresp onding bit in the

co deword is one Call these probability values B hb b b i where n is the length of

n

the co dewords in the errorcorrecting co de To classify a new example we compute the L

distance b etween this probability vector B and each of the co dewords W i k in

i

the error correcting co de The L distance b etween B and W is dened as

i

L

X

jb W j L B W

j ij i

j

The class whose co deword has the smallest L distance to B is assigned as the class of the

new example Ties are broken arbitrarily in favor of the class that comes rst in the class

ordering

ErrorCorrecting Output Codes

Glass Vowel POS Soybean AudiologyISOLET Letter NETtalk

* 10 * * * * * 0 C4.5 Multiclass * -10 * * -20

-30 Performance relative to Multiclass *

C4.5 one-per-class C4.5 ECOC

Figure Performance in p ercentage p oints of the onep erclass and ECOC metho ds rel

ative to the direct multiclass metho d using C Asterisk indicates dierence is

signicant at the level or b etter

Results

We now present the results of our exp eriments We b egin with the results for decision trees

Then we consider neural networks Finally we rep ort the results of a series of exp eriments

to assess the robustness of the errorcorrecting output co de metho d

Decision Trees

Figure shows the p erformance of C in all eight domains The horizontal line corresp onds

to the p erformance of the standard multiclass decisiontree algorithm The light bar shows

the p erformance of the onep erclass approach and the dark bar shows the p erformance of

the ECOC approach with the longest errorcorrecting co de tested Performance is displayed

as the numb er of p ercentage p oints by which each pair of algorithms dier An asterisk

indicates that the dierence is statistically signicant at the p level according to the

test for the dierence of two prop ortions using the normal approximation to the binomial

distribution see Snedecor Co chran p

From this gure we can see that the onep erclass metho d p erforms signicantly worse

than the multiclass metho d in four of the eight domains and that its b ehavior is statistically

indistinguishabl e in the remaining four domains Much more encouraging is the observation

that the errorcorrecting output co de approach is signicantly sup erior to the multiclass

approach in six of the eight domains and indistinguishabl e in the remaining two

Dietterich Bakiri

In the NETtalk domain we can also consider the p erformance of the meaningful dis

tributed representation develop ed by Sejnowski and Rosenb erg This representation gave

correct classication as compared with for the onep erclass conguration

for the directmulticlass conguration and for the ECOC conguration The

dierences in each of these gures are statistically signicant at the level or b etter

except that the onep erclass and directmulticlass congurations are not statistically dis

tinguishable

Backpropagation

Figure shows the results for backpropagation in ve of the most challenging domains

The horizontal line corresp onds to the p erformance of the onep erclass enco ding for this

metho d The bars show the numb er of p ercentage p oints by which the errorcorrecting

output co ding representation outp erforms the onep erclass representation In four of the

ve domains the ECOC enco ding is sup erior the dierences are statistically signicant in

the Vowel NETtalk and ISOLET domains

In the letter recognition domain we encountered great diculty in successfully training

networks using the CNAPS machine particularly for the ECOC conguration Exp eriments

showed that the problem arose from the fact that the CNAPS implementation of backprop

agation employs a xed learning rate We therefore switched to the much slower opt

program which cho oses the learning rate adaptively via conjugategradient line searches

This b ehaved b etter for b oth the onep erclass and ECOC congurations

We also had some diculty training ISOLET in the ECOC conguration on large net

works units even with the opt program Some sets of initial random weights led to

lo cal minima and p o or p erformance on the validation set

In the NETtalk task we can again compare the p erformance of the SejnowskiRosenb erg

distributed enco ding to the onep erclass and ECOC enco dings The distributed enco ding

yielded a p erformance of correct compared to for the onep erclass enco ding

and for the ECOC enco ding The dierence b etween the distributed enco ding and the

onep erclass enco ding is not statistically signicant From these results and the previous

results for C we can conclude that the distributed enco ding has no advantages over the

onep erclass and ECOC enco ding in this domain

Robustness

These results show that the ECOC approach p erforms as well as and often b etter than

the alternative approaches However there are several imp ortant questions that must b e

answered b efore we can recommend the ECOC approach without reservation

Do the results hold for small samples We have found that decision trees learned using

errorcorrecting co des are much larger than those learned using the onep erclass or

multiclass approaches This suggests that with small sample sizes the ECOC metho d

may not p erform as well since complex trees usually require more data to b e learned

reliably On the other hand the exp eriments describ ed ab ove covered a wide range of

The dierence for ISOLET is only detectable using a test for paired dierences of prop ortions See

Snedecor Co chran p

ErrorCorrecting Output Codes

Glass Vowel ISOLET Letter NETtalk

10 *

5

* * 0 Backprop one-per-class Performance relative to one-per-class

Backprop ECOC

Figure Performance of the ECOC metho d relative to the onep erclass using backprop

agation Asterisk indicates dierence is signicant at the level or b etter

training set sizes which suggests that the results may not dep end on having a large

training set

Do the results dep end on the particular assignment of co dewords to classes The

co dewords were assigned to the classes arbitrarily in the exp eriments rep orted ab ove

which suggests that the particular assignment may not b e imp ortant However some

assignments might still b e much b etter than others

Do the results dep end on whether pruning techniques are applied to the decision

tree algorithms Pruning metho ds have b een shown to improve the p erformance of

multiclass C in many domains

Can the ECOC approach provide class probability estimates Both C and back

propagation can b e congured to provide estimates of the probability that a test

example b elongs to each of the k p ossible classes Can the ECOC approach do this

as well

Small sample performance

As we have noted we b ecame concerned ab out the small sample p erformance of the ECOC

metho d when we noticed that the ECOC metho d always requires much larger decision trees

than the OPC metho d Table compares the sizes of the decision trees learned by C

under the multiclass onep erclass and ECOC congurations for the letter recognition task

and the NETtalk task For the OPC and ECOC congurations the tables show the average

numb er of leaves in the trees learned for each bit p osition of the output representation For

Dietterich Bakiri

Table Size of decision trees learned by C for the letter recognition task and the

NETtalk task

Letter Recognition Leaves p er bit Total leaves

Multiclass

Onep erclass

bit ECOC

NETtalk Leaves p er bit Total leaves

phoneme stress phoneme stress

Multiclass

Onep erClass

bit ECOC

letter recognition the trees learned for a bit ECOC are more than six times larger

than those learned for the onep erclass representation For the phoneme classication part

of NETtalk the ECOC trees are times larger than the OPC trees Another way to

compare the sizes of the trees is to consider the total numb er of leaves in the trees The

tables clearly show that the multiclass approach requires much less memory many fewer

total leaves than either the OPC or the ECOC approaches

With backpropagation it is more dicult to determine the amount of network re

sources that are consumed in training the network One approach is to compare the

numb er of hidden units that give the b est generalization p erformance In the ISOLET task

for example the onep erclass enco ding attains p eak validation set p erformance with a

hiddenunit network whereas the bit errorcorrecting enco ding attained p eak validation

set p erformance with a hiddenunit network In the letter recognition task p eak p er

formance for the onep erclass enco ding was obtained with a network of hidden units

compared to hidden units for a bit errorcorrecting output co de

From the decision tree and neural network sizes we can see that in general the error

correcting output representation requires more complex hyp otheses than the onep erclass

representation From learning theory and statistics we known that complex hyp otheses

typically require more training data than simple ones On this basis one might exp ect that

the p erformance of the ECOC metho d would b e very p o or with small training sets To test

this prediction we measured p erformance as a function of training set size in two of the

larger domains NETtalk and letter recognition

Figure presents learning curves for C on the NETtalk and letter recognition tasks

which show accuracy for a series of progressively larger training sets From the gure it is

clear that the bit errorcorrecting co de consistently outp erforms the other two congu

rations by a nearly constant margin Figure shows corresp onding results for backpropa

gation on the NETtalk and letter recognition tasks On the NETtalk task the results are

the same sample size has no apparent inuence on the b enets of errorcorrecting out

put co ding However for the letterrecognition task there app ears to b e an interaction

ErrorCorrecting Output Codes

NETtalk Letter Recognition 75 100 C4 61-bit ECOC

70 C4 Multiclass 90 C4 One-per-class C4 Multiclass C4 62-bit ECOC 65 80

60 70

55 60

Percent Correct 50 50

45 40

40 30 C4 One-per-class

35 20 100 1000 100 1000 10000

Training Set Size Training Set Size

Figure Accuracy of C in the multiclass onep erclass and errorcorrecting output

co ding congurations for increasing training set sizes in the NETtalk and letter

recognition tasks Note that the horizontal axis is plotted on a logarithmic scale

NETtalk Letter Recognition 100

75 90

70 CNAPS 61-bit ECOC 80 opt OPC 65 opt 62-bit ECOC 70 60 CNAPS One-per-class Percent Correct 60 55

50 50

45 40 100 1000 100 1000 10000

Training Set Size (words) Training Set Size

Figure Accuracy of backpropagation in the onep erclass and errorcorrecting output

co ding congurations for increasing training set sizes on the NETtalk and letter

recognition tasks

Errorcorrecting output co ding works b est for small training sets where there is a statisti

cally signicant b enet With the largest training set examplesthe onep erclass

metho d very slightly outp erforms the ECOC metho d

From these exp eriments we conclude that errorcorrecting output co ding works very

well with small samples despite the increased size of the decision trees and the increased

complexity of training neural networks Indeed with backpropagation on the letter recog

nition task errorcorrecting output co ding worked b etter for small samples than it did for

Dietterich Bakiri

Table Five random assignments of co dewords to classes for the NETtalk task Each

column shows the p ercentage of letters correctly classied by C decision trees

Bit ErrorCorrecting Co de Replications

Multiclass Onep erclass a b c d e

large ones This eect suggests that ECOC works by reducing the variance of the learning

algorithm For small samples the variance is higher so ECOC can provide more b enet

Assignment of Codewords to Classes

In all of the results rep orted thus far the co dewords in the errorcorrecting co de have b een

arbitrarily assigned to the classes of the learning task We conducted a series of exp eriments

in the NETtalk domain with C to determine whether randomly reassigning the co dewords

to the classes had any eect on the success of ECOC Table shows the results of ve

random assignments of co dewords to classes There is no statistically signicant variation

in the p erformance of the dierent random assignments This is consistent with similar

exp eriments rep orted in Bakiri

Effect of Tree Pruning

Pruning of decision trees is an imp ortant technique for preventing overtting However the

merit of pruning varies from one domain to another Figure shows the change in p erfor

mance due to pruning in each of the eight domains and for each of the three congurations

studied in this pap er multiclass onep erclass and errorcorrecting output co ding

From the gure we see that in most cases pruning makes no statistically signicant

dierence in p erformance aside from the POS task where it decreases the p erformance of

all three congurations Aside from POS only one of the statistically signicant changes

involves the ECOC conguration while two aect the onep erclass conguration and one

aects the multiclass conguration These data suggest that pruning only o ccasionally has

a ma jor eect on any of these congurations There is no evidence to suggest that pruning

aects one conguration more than another

Class Probability Estimates

In many applications it is imp ortant to have a classier that cannot only classify new cases

well but also estimate the probability that a new case b elongs to each of the k classes

For example in medical diagnosis a simple classier might classify a patient as healthy

b ecause given the input features that is the most likely class However if there is a

nonzero probability that the patient has a lifethreatening disease the right choice for the

physician may still b e to prescrib e a therapy for that disease

A more mundane example involves automated reading of handwritten p ostal co des on

envelop es If the classier is very condent of its classication ie b ecause the estimated

ErrorCorrecting Output Codes

Glass Vowel POS Soybean AudiologyISOLET Letter NETtalk

10 * * * 5

0 No Pruning -2 ** *

Performance relative to no pruning *

C4.5 Multiclass C4.5 one-per-class C4.5 ECOC

Figure Change in p ercentage p oints of the p erformance of C with and without pruning

in three congurations Horizontal line indicates p erformance with no pruning

Asterisk indicates that the dierence is signicant at the level or b etter

Dietterich Bakiri

probabilities are very strong then it can pro ceed to route the envelop e However if it

is uncertain then the envelop e should b e rejected and sent to a human b eing who can

attempt to read the p ostal co de and pro cess the envelop e Wilkinson Geist Janet et al

One way to assess the quality of the class probability estimates of a classier is to

compute a rejection curve When the learning algorithm classies a new case we require

it to also output a condence level Then we plot a curve showing the p ercentage of

correctly classied test cases whose condence level exceeds a given value A rejection curve

that increases smo othly demonstrates that the condence level pro duced by the algorithm

can b e transformed into an accurate probability measure

For onep erclass neural networks many researchers have found that the dierence in

activity b etween the class with the highest activity and the class with the secondhighest

activity is a go o d measure of condence eg LeCun et al If this dierence is large

then the chosen class is clearly much b etter than the others If the dierence is small then

the chosen class is nearly tied with another class This same measure can b e applied to the

class probability estimates pro duced by C

An analogous measure of condence for errorcorrecting output co des can b e computed

from the L distance b etween the vector B of output probabilities for each bit and the

co dewords of each of the classes Sp ecically we employ the dierence b etween the L

distance to the secondnearest co deword and the L distance to the nearest co deword as

our condence measure If this dierence is large an algorithm can b e quite condent of

its classication decision If the dierence is small the algorithm is not condent

Figure compares the rejection curves for various congurations of C and backprop

agation on the NETtalk task These curves are constructed by rst running all of the test

examples through the learned decision trees and computing the predicted class of each ex

ample and the condence value for that prediction To generate each p oint along the curve

a value is chosen for a parameter which denes the minimum required condence The

classied test examples are then pro cessed to determine the p ercentage of test examples

whose condence level is less than these are rejected and the p ercentage of the re

maining examples that are correctly classied The value of is progressively incremented

starting at until all test examples are rejected

The lower left p ortion of the curve shows the p erformance of the algorithm when is

small so only the least condent cases are rejected The upp er right p ortion of the curve

shows the p erformance when is large so only the most condent cases are classied

Go o d class probability estimates pro duce a curve that rises smo othly and monotonically

A at or decreasing region in a rejection curve reveals cases where the condence estimate

of the learning algorithm is unrelated or inversely related to the actual p erformance of the

algorithm

The rejection curves often terminate prior to rejecting of the examples This o ccurs

when the nal increment in causes al l examples to b e rejected This gives some idea of the

numb er of examples for which the algorithm was highly condent of its classications If

the curve terminates early this shows that there were very few examples that the algorithm

could condently classify

In Figure we see thatwith the exception of the Multiclass congurationthe rejec

tion curves for all of the various congurations of C increase fairly smo othly so all of

ErrorCorrecting Output Codes

C4.5 Backpropagation 100 100

61-bit ECOC 61-bit ECOC 159-bit ECOC 95 95 OPC OPC 90 159-bit ECOC 90 Distributed Multiclass Distributed 85 85 80 Percent Correct 80 75

75 70

65 70 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Percent Rejected Percent Rejected

Figure Rejection curves for various congurations of C and backpropagation on the

NETtalk task The Distributed curve plots the b ehavior of the Sejnowski

Rosenb erg distributed representation

them are pro ducing acceptable condence estimates The two errorcorrecting congura

tions have smo oth curves that remain ab ove all of the other congurations This shows that

the p erformance advantage of errorcorrecting output co ding is maintained at all condence

levelsECOC improves classication decisions on all examples not just the b orderline ones

Similar b ehavior is seen in the rejection curves for backpropagation Again all cong

urations of backpropagation give fairly smo oth rejection curves However note that the

bit co de actually decreases at high rejection rates By contrast the bit co de gives a

monotonic curve that eventually reaches We have seen this b ehavior in several of the

cases we have studied extremely long errorcorrecting co des are usually the b est metho d

at low rejection rates but at high rejection rates co des of intermediate length typically

bits b ehave b etter We have no explanation for this b ehavior

Figure compares the rejection curves for various congurations of C and backprop

agation on the ISOLET task Here we see that the ECOC approach is markedly sup erior

to either the onep erclass or multiclass approaches This gure illustrates another phe

nomenon we have frequently observed the curve for multiclass C b ecomes quite at and

terminates very early and the onep erclass curve eventually surpasses it This suggests that

there may b e opp ortunities to improve the class probability estimates pro duced by C

on multiclass trees Note that we employed softened thresholds in these exp eriments

In the backpropagation rejection curves the ECOC approach consistently outp erforms the

onep erclass approach until b oth are very close to correct Note that b oth congura

tions of backpropagation can condently classify more than of the test examples with

accuracy

From these graphs it is clear that the errorcorrecting approach with co des of interme

diate length can provide condence estimates that are at least as go o d as those provided

by the standard approaches to multiclass problems Dietterich Bakiri

C4.5 Backpropagation 100 101 107-bit ECOC 98 45-bit ECOC 100 96 30-bit ECOC Multiclass 94 99 92

90 98 One Per Class 88 Percent Correct 97 86

84 One Per Class 96 82

80 95 0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45

Percent Rejected Percent Rejected

Figure Rejection curves for various congurations of C and backpropagation on the

ISOLET task

Conclusions

In this pap er we exp erimentally compared four approaches to multiclass learning problems

multiclass decision trees the onep erclass OPC approach the meaningful distributed

output approach and the errorcorrecting output co ding ECOC approach The results

clearly show that the ECOC approach is sup erior to the other three approaches The

improvements provided by the ECOC approach can b e quite substantial improvements on

the order of ten p ercentage p oints were observed in several domains Statistically signicant

improvements were observed in six of eight domains with decision trees and three of ve

domains with backpropagation

The improvements were also robust

ECOC improves b oth decision trees and neural networks

ECOC provides improvements even with very small sample sizes and

The improvements do not dep end on the particular assignment of co dewords to classes

The errorcorrecting approach can also provide estimates of the condence of classica

tion decisions that are at least as accurate as those provided by existing metho ds

There are some additional costs to employing errorcorrecting output co des Decision

trees learned using ECOC are generally much larger and more complex than trees con

structed using the onep erclass or multiclass approaches Neural networks learned using

ECOC often require more hidden units and longer and more careful training to obtain

the improved p erformance see Section These factors may argue against using error

correcting output co ding in some domains For example in domains where it is imp ortant

for humans to understand and interpret the induced decision trees ECOC metho ds are not

appropriate b ecause they pro duce such complex trees In domains where training must

b e rapid and completely autonomous ECOC metho ds with backpropagation cannot b e

recommended b ecause of the p otential for encountering diculties during training

ErrorCorrecting Output Codes

Finally we found that errorcorrecting co des of intermediate length tend to give b etter

condence estimates than very long errorcorrecting co des even though the very long co des

give the b est generalization p erformance

There are many op en problems that require further research First and foremost it is

imp ortant to obtain a deep er understanding of why the ECOC metho d works If we assume

that each of the learned hyp otheses makes classication errors indep endently then co ding

theory provides the explanation individual errors can b e corrected b ecause the co dewords

are far apart in the output space However b ecause each of the hyp otheses is learned

using the same algorithm on the same training data we would exp ect that the errors made

by individual hyp otheses would b e highly correlated and such errors cannot b e corrected by

an errorcorrecting co de So the key op en problem is to understand why the classication

errors at dierent bit p ositions are fairly indep endent How do es the errorcorrecting output

co de result in this indep endence

A closely related op en problem concerns the relationship b etween the ECOC approach

and various ensemble committee and b o osting metho ds Perrone Co op er

Schapire Freund These metho ds construct multiple hyp otheses which then

vote to determine the classication of an example An errorcorrecting co de can also

b e viewed as a very compact form of voting in which a certain numb er of incorrect votes

can b e corrected An interesting dierence b etween standard ensemble metho ds and the

ECOC approach is that in the ensemble metho ds each hyp othesis is attempting to predict

the same function whereas in the ECOC approach each hyp othesis predicts a dierent

function This may reduce the correlations b etween the hyp otheses and make them more

eective voters Much more work is needed to explore this relationship

Another op en question concerns the relationship b etween the ECOC approach and the

exible discriminant analysis technique of Hastie Tibshirani and Buja In Press Their

metho d rst employs the onep erclass approach eg with neural networks and then

applies a kind of discriminant analysis to the outputs This discriminant analysis maps the

outputs into a k dimensional space such that each class has a dened center p oint New

cases are classied by mapping them into this space and then nding the nearest center

p oint and its class These center p oints are similar to our co dewords but in a continuous

space of dimension k It may b e that the ECOC metho d is a kind of randomized

higherdimensional variant of this approach

Finally the ECOC approach shows promise of scaling neural networks to very large

classication problems with hundreds or thousands of classes much b etter than the one

p erclass metho d This is b ecause a go o d errorcorrecting co de can have a length n that is

much less than the total numb er of classes whereas the onep erclass approach requires that

there b e one output unit for each class Networks with thousands of output units would

b e exp ensive and dicult to train Future studies should test the scaling ability of these

dierent approaches to such large classication tasks

Acknowledgements

The authors thank the anonymous reviewers for their valuable suggestions which improved

the presentation of the pap er The authors also thank Prasad Tadepalli for pro ofreading the

Dietterich Bakiri

nal manuscript The authors gratefully acknowledge the supp ort of the National Science

Foundation under grants numb ered IRI CDA and IRI Bakiri

also thanks Bahrain University for its supp ort of his do ctoral research

References

Adaptive Solutions CNAPS backpropagation guide Tech rep Adap

tive Solutions Inc Beaverton OR

Bakiri G Converting English text to sp eech A machine learning approach Tech

rep Department of Computer Science Oregon State University Corvallis

OR

Barnard E Cole R A A neuralnet training program based on conjugate

gradient optimization Tech rep CSE Oregon Graduate Institute Beaverton

OR

Bose R C RayChaudhuri D K On a class of errorcorrecting binary group

co des Information and Control

Breiman L Friedman J H Olshen R A Stone C J Classication and

Regression Trees Wadsworth International Group

Bridle J S Training sto chastic mo del recognition algorithms as networks can

lead to maximum mutual information estimation of parameters In Touretzky D S

Ed Neural Information Processing Systems Vol pp San Francisco CA

Morgan Kaufmann

Cardie C Using decision trees to improve casebased learning In Proceedings of

the Tenth International Conference on Machine Learning pp San Francisco

CA Morgan Kaufmann

Duda R O Machanik J W Singleton R C Function mo deling exp eriments

Tech rep Stanford Research Institute

Freund Y An improved b o osting algorithm and its implications on learning com

plexity In Proc th Annu Workshop on Comput Learning Theory pp

ACM Press New York NY

Hampshire I I J B Waib el A H A novel ob jective function for improved

phoneme recognition using timedelay neural networks IEEE Transactions on Neural

Networks

Hastie T Tibshirani R Buja A In Press Flexible discriminant analysis by optimal

scoring Journal of the American Statistical Association

Hinton G Connectionist learning pro cedures Articial Intel ligence

Ho cquenghem A Co des corecteurs derreurs Chires

ErrorCorrecting Output Codes

Kong E B Dietterich T G Why errorcorrecting output co ding works with

decision trees Tech rep Department of Computer Science Oregon State University

Corvallis OR

Lang K J Hinton G E Waib el A A timedelay neural network architecture

for isolated word recognition Neural Networks

LeCun Y Boser B Denker J S Henderson B Howard R E Hubbard W Jackel

L D Backpropagation applied to handwritten zip co de recognition Neural

Computation

Murphy P Aha D UCI rep ository of machine learning databases machine

readable data rep ository Tech rep University of California Irvine

Natara jan B K Machine Learning A Theoretical Approach Morgan Kaufmann

San Mateo CA

Nilsson N J Learning Machines McGrawHill New York

Perrone M P Co op er L N When networks disagree Ensemble metho ds for

hybrid neural networks In Mammone R J Ed Neural networks for speech and

image processing Chapman and Hall

Peterson W W Weldon Jr E J ErrorCorrecting Codes MIT Press Cam

bridge MA

Quinlan J R C Programs for Empirical Learning Morgan Kaufmann San

Francisco CA

Richard M D Lippmann R P Neural network classiers estimate bayesian a

posteriori probabilities Neural Computation

Rosenblatt F The p erceptron a probabilistic mo del for information storage and

organization in the brain Psychological Review

Rumelhart D E Hinton G E Williams R J Learning internal representa

tions by error propagation In Paral lel Distributed Processing Explorations in the

Microstructure of Cognition chap pp MIT Press

Schapire R E The strength of weak learnability Machine Learning

Sejnowski T J Rosenb erg C R Parallel networks that learn to pronounce

english text Journal of Complex Systems

Selman B Levesque H Mitchell D A new metho d for solving hard satisability

problems In Proceedings of AAAI pp AAAIMIT Press

Snedecor G W Co chran W G Statistical Methods Iowa State University

Press Ames IA Eighth Edition

Valiant L G A theory of the learnable Commun ACM

Dietterich Bakiri

Waib el A Hanazawa T Hinton G Shikano K Lang K Phoneme recogni

tion using timedelay networks IEEE Transactions on Acoustics Speech and Signal

Processing

Weigend A Measuring the eective numb er of dimensions during backpropagation

training In Proceedings of the Connectionist Models Summer School pp

Morgan Kaufmann San Francisco CA

Wilkinson R A Geist J Janet S et al The rst census optical character recog

nition systems conference Tech rep NISTIR National Institute of Standards

and Technology