Solving Multiclass Learning Problems Via Error-Correcting Output Codes
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Articial Intelligence Research Submitted published Solving Multiclass Learning Problems via ErrorCorrecting Output Co des Thomas G Dietterich tgdcsorstedu Department of Computer Science Dearborn Hal l Oregon State University Corval lis OR USA Ghulum Bakiri ebisaccuobbh Department of Computer Science University of Bahrain Isa Town Bahrain Abstract Multiclass learning problems involve nding a denition for an unknown function f x whose range is a discrete set containing k values ie k classes The denition is acquired by studying collections of training examples of the form hx f x i Existing ap i i proaches to multiclass learning problems include direct application of multiclass algorithms such as the decisiontree algorithms C and CART application of binary concept learning algorithms to learn individual binary functions for each of the k classes and application of binary concept learning algorithms with distributed output representations This pap er compares these three approaches to a new technique in which errorcorrecting co des are employed as a distributed output representation We show that these output representa tions improve the generalization p erformance of b oth C and backpropagation on a wide range of multiclass learning tasks We also demonstrate that this approach is robust with resp ect to changes in the size of the training sample the assignment of distributed represen tations to particular classes and the application of overtting avoidance techniques such as decisiontree pruning Finally we show thatlike the other metho dsthe errorcorrecting co de technique can provide reliable class probability estimates Taken together these re sults demonstrate that errorcorrecting output co des provide a generalpurp ose metho d for improving the p erformance of inductive learning programs on multiclass problems Intro duction The task of learning from examples is to nd an approximate denition for an unknown function f x given training examples of the form hx f x i For cases in which f takes i i only the values f gbinary functionsthere are many algorithms available For example the decisiontree metho ds such as C Quinlan and CART Breiman Friedman Olshen Stone can construct trees whose leaves are lab eled with binary values Most articial neural network algorithms such as the p erceptron algorithm Rosenblatt and the error backpropagation BP algorithm Rumelhart Hinton Williams are b est suited to learning binary functions Theoretical studies of learning have fo cused almost entirely on learning binary functions Valiant Natara jan In many realworld learning tasks however the unknown function f often takes values from a discrete set of classes fc c g For example in medical diagnosis the function k might map a description of a patient to one of k p ossible diseases In digit recognition eg c AI Access Foundation and Morgan Kaufmann Publishers All rights reserved Dietterich Bakiri LeCun Boser Denker Henderson Howard Hubbard Jackel the function maps each handprinted digit to one of k classes Phoneme recognition systems eg Waib el Hanazawa Hinton Shikano Lang typically classify a sp eech segment into one of to phonemes Decisiontree algorithms can b e easily generalized to handle these multiclass learning tasks Each leaf of the decision tree can b e lab eled with one of the k classes and internal no des can b e selected to discriminate among these classes We will call this the direct multiclass approach Connectionist algorithms are more dicult to apply to multiclass problems The stan dard approach is to learn k individual binary functions f f one for each class To k assign a new case x to one of these classes each of the f is evaluated on x and x is i assigned the class j of the function f that returns the highest activation Nilsson j We will call this the oneperclass approach since one binary function is learned for each class An alternative approach explored by some researchers is to employ a distributed output code This approach was pioneered by Sejnowski and Rosenb erg in their widely known NETtalk system Each class is assigned a unique binary string of length n we will refer to these strings as co dewords Then n binary functions are learned one for each bit p osition in these binary strings During training for an example from class i the desired outputs of these n binary functions are sp ecied by the co deword for class i With articial neural networks these n functions can b e implemented by the n output units of a single network New values of x are classied by evaluating each of the n binary functions to generate an nbit string s This string is then compared to each of the k co dewords and x is assigned to the class whose co deword is closest according to some distance measure to the generated string s As an example consider Table which shows a sixbit distributed co de for a tenclass digitrecognition problem Notice that each row is distinct so that each class has a unique co deword As in most applications of distributed output co des the bit p ositions columns have b een chosen to b e meaningful Table gives the meanings for the six columns During learning one binary function will b e learned for each column Notice that each column is also distinct and that each binary function to b e learned is a disjunction of the original classes For example f x if f x is or v l To classify a new handprinted digit x the six functions f f f f f and f v l hl dl cc ol or are evaluated to obtain a sixbit string such as Then the distance of this string to each of the ten co dewords is computed The nearest co deword according to Hamming distance which counts the numb er of bits that dier is which corresp onds to class Hence this predicts that f x This pro cess of mapping the output string to the nearest co deword is identical to the de co ding step for errorcorrecting co des Bose RayChaudhuri Ho cquenghem This suggests that there might b e some advantage to employing errorcorrecting co des as a distributed representation Indeed the idea of employing errorcorrecting distributed representations can b e traced to early research in machine learning Duda Machanik Singleton ErrorCorrecting Output Codes Table A distributed co de for the digit recognition task Co de Word Class vl hl dl cc ol or Table Meanings of the six columns for the co de in Table Column p osition Abbreviation Meaning vl contains vertical line hl contains horizontal line dl contains diagonal line cc contains closed curve ol contains curve op en to left or contains curve op en to right Table A bit errorcorrecting output co de for a tenclass problem Co de Word Class f f f f f f f f f f f f f f f Dietterich Bakiri Table shows a bit errorcorrecting co de for the digitrecognition task Each class is represented by a co de word drawn from an errorcorrecting co de As with the distributed enco ding of Table a separate b o olean function is learned for each bit p osition of the error correcting co de To classify a new example x each of the learned functions f x f x is evaluated to pro duce a bit string This is then mapp ed to the nearest of the ten co dewords This co de can correct up to three errors out of the bits This errorcorrecting co de approach suggests that we view machine learning as a kind of communications problem in which the identity of the correct output class for a new example is b eing transmitted over a channel The channel consists of the input features the training examples and the learning algorithm Because of errors intro duced by the nite training sample p o or choice of input features and aws in the learning pro cess the class information is corrupted By enco ding the class in an errorcorrecting co de and transmitting each bit separately ie via a separate run of the learning algorithm the system may b e able to recover from the errors This p ersp ective further suggests that the onep erclass and meaningful distributed output approaches will b e inferior b ecause their output representations do not constitute robust errorcorrecting co des A measure of the quality of an errorcorrecting co de is the minimum Hamming distance b etween any pair of co de words If the minimum Hamming d distance is d then the co de can correct at least b c single bit errors This is b ecause each single bit error moves us one unit away from the true co deword in Hamming distance If d we make only b c errors the nearest co deword will still b e the correct co deword The co de of Table has minimum Hamming distance seven and hence it can correct errors in any three bit p ositions The Hamming distance b etween any two co dewords in the one p erclass co de is two so the onep erclass enco ding of the k output classes cannot correct any errors The minimum Hamming distance b etween pairs of co dewords in a meaningful dis tributed representation tends to b e very low For example in Table the Hamming distance b etween the co dewords for classes and is only one In these kinds of co des new columns are often intro duced to discriminate b etween only two classes Those two classes will therefore dier only in one bit p osition so the Hamming distance b etween their output representations will b e one This is also true of the distributed representation develop ed by Sejnowski