Feature Selection Le ar ning and

a Usability Cas e Study for Text Categor ization

Hwee Tou Ng We i Bo on Goh Kok LeongLow

DSO National Laboratorie s Mini stry of Defence Mini stry of Defence

Science ParkDrive Gombak Dr ive Gombak Dr ive

Singap ore Singap ore Singap ore

nhweetoudsoorgsg

the p erformance of suchanautomate d lear ningapproach Abstract

withthe more traditional rulebas e d exp ert system ap

In thi s pap er wede scr ib e an automated learningapproach

proach of building text categor ization systems In the rule

totext categor ization bas e d on p erceptron lear ninganda

bas e d exp ert system approach thedeveloper of the system

new feature s election metr ic calle d correlation co ecient

manually co de s up a s et of rule s tocategor ize texts In

Our approachhas b een tested on thestandard Reuters text

contrast our lear ningapproach alleviates the knowle dge ac

categor ization collection Empir ical re sults indicatethat our

quisition b ottleneck inherent in a rulebas e d approach

approachoutp erforms the best publi she d re sults on thi s Re

As compar i son we us e an exi stingtext categor ization

uters collection In particular our new feature s election

system Tcsdeveloped usingatext categor ization shell built

metho d yields cons iderable improvement

byCarnegie Group Hayes et al The inputto Tcs

We also investigatetheusability of our automate d lear n

are newswire article s andtheoutputcategor ie s form a tree

ingapproachby actually developing a system thatcategor

Our evaluation indicates that a completely automate d lear n

ize s texts into a tree of categor ie s We compare the accuracy

ingapproachstill gives lower accuracyHowever bymanu

of our lear ningapproachto a rulebas e d exp ert system ap

ally mo difyingandaugmentingthe s et of words to be used

proachthat usesatext categor ization shell builtby Car ne

as feature s in a topic categor izer weachieve accuracy very

gie Group Although our automate d lear ningapproachstill

clos e tothemanual rulebas e d approach Thi s sugge sts that

gives a lower accuracybyappropr iately incorp orating a s et

at pre s ent a s emiautomated approach i s p erhaps the best

of manually chos en words to us e as feature s thecombined

way to build a high p erformance text categor ization system

s emiautomated approach yields accuracy clos e tothe rule

Therestofthi s pap er i s organize d as follows Section

bas e d approach

givesade scr iption of thetext categor ization task Section

di scuss e s thetext repre s entation andfeature s election met

ic us e d Section de scr ib e s the p erceptron algor ithm us e d

Intro duction

Section pre s ents the empir ical re sults achieved by our ap

proachonthestandard Reuters corpus Section de scr ib e s

Weliveinaworld of information explos ion Thephenomenal

thecasestudy conducted to compare our automate d lear ning

growthoftheInter net has re sulte d in theavailabili tyofhuge

approach with Tcs Thi s i s followed bySection on related

amounts of online information Muchofthi s information i s

work and Section gives the conclus ion

in theformofnatural language texts Hence theability

tocatalog and organize textual information automatically

by computers i s highly desirable In particular a computer

Task De scr iption

system that can categor ize realworld unre str icted Engli sh

texts into a pre dened set of categor ie s would b e most us eful

The inputto our text categor ization system calle d Classi

In thi s pap er we presentanautomate d lear ningapproach

CLAS s ication S ystem for Information cons i sts of unre s

to building a robust ecientand practical text categor iza

tr icted Engli sh texts The system i s also given a s et of pre

tion system calle d Classiusingthe p erceptron lear ning

dened categor ie s There i s no re str iction as towhatcan

algor ithm Wealsode scr ib e a new feature s election metr ic

form a categoryFor example a category can b e abouta

calle d correlation co ecient which yields cons iderable im

particular country like USA Japan a particular sub ject

provementincategor ization accuracyWhen teste d on the

topic like economics p olitics etc Onetext can b elongto

standard Reuters text categor ization collection our approach

more than onecategor ie s if it mentions multiple topics like

outp erforms thebestpubli she d re sults on thi s Reuters cor

in a longtext Also thecategor ie s nee d not b e exhaustive

pus

sometext may b elongtononeofthe predened set of

We also conducte d a usabilitycasestudy by compar ing

categor ie s

Unlike most exi stingwork on text categor ization we al

lowthecategor ie s to form a tree Wewillde scr ib e in greater

detail how hierarchical categor ization i s achieved when we

di scuss theusability cas e study in Section

Given an inputtext a text categor ization system ass igns

zero oneormorecategor ie s tothetext

Text Repre s entation and Feature Selection withthehighe st feature s election metr ic score We exp er i

mente d withthree feature s election metr ics correlation coef

To us e an automate d lear ningapproach we rst nee d to

cient and f requency

transform a text into a feature vector repre s entation Thi s

Wedenethe correlation co ecient C ofaword w as

transformation pro ce ss require s theappropr iatechoice of fea

p

N N N N N

ture s touseinafeature vector The s e feature vectors form

r  n r n

p C

the training example s Feature vectors that are der ived from

N N N N N N N N

r  r n n r  n r n

the relevanttexts of a category C form thepositive training

example s for thecategory while thefeature vectors der ived

where N N isthenumb er of relevant nonrelevant

r  n

from the irrelevanttexts of category C form thenegative ex

texts in whichtheword w o ccurs and N N isthe

r n

ample s Next an automate d lear ning algor ithm lear ns the

numb er of relevant nonrelevant texts in whichtheword w

nece ssary asso ciation knowle dge f rom the training example s

do e s not o ccur

to build a class ier for eachcategory C Inthi s s ection we

Our correlation co ecientisavar iantofthe metr ic

fo cus on thetext repre s entation andfeature s election i ssue s

used in Schutze et al where C C can b e

while thenext s ection di scuss e s the p erceptron lear ning al

viewe d as a ones ided metr ic Therationale b ehind

gor ithm us e d

theuseofournew correlation co ecient C i s related tothe

We use single words as the bas ic units to representtext

ndingthat lo cal dictionary yields a b etter s et of feature s

Aword i s dened as a contiguous str ingofcharacters de

as rep orted in Apte et al and conrmed in our own

limited by space s Sp ecicallyeachtext i s prepro ce ss e d in

work That i s we are lo oking for words that only come

the followingsteps

f rom the relevanttexts of a category C and are indicative

of memb ership in C Words that comefromthe irrelevant

Punctuation marks are s eparate d f rom words

texts or are highly indicativeofnonmembership in C are

not as us eful The correlation co ecient C s elects exactly

Numb ers andpunctuation marks are removed

thos e words that are highly indicativeofmemb ership in a

All words are converted tolower cas e

categorywhereas the metr ic will not only pickoutthi s

setofwords but also thos e words that are indicativeofnon

Words like prep os itions conjunctions auxiliary verbs

memb ership in thecategory Our empir ical re sults sugge st

etc are removed The s e stopwords are thos e

thatusingwords f rom the relevanttexts and are indicative

given in Lewi s

of memb ership in a category i s b etter than us ingwords that

are indicativeofmemb ership as well as nonmemb ership of

Eachword i s replace d by its morphological ro ot form

acategory

For example plural nouns like intere sts are replace d

Thethird feature s election metr ic we exp er imented is fre

withthesingular form intere st inectional verb forms

quency which s elects words that o ccur most f requently in the

likeate eaten eating etc are replace d withthe

trainingtexts to us e as feature s

innitive form eat andsoon

Thevalue of a feature in a feature vector i s the normalize d

f requency of the corre sp ondingword in the trainingtext We

Weusethe morphological routines from WordNet Miller

us e normalize d f requencie s so that trainingtexts of dierent

to convert eachword into its morphological ro ot form

lengths are normalize d to contr ibute equally dur ing training

The preprocessing speed of Classi is fast about

Let t t t be the s et of words chos en as feature s

 n

words p er s econdonaPentium p ersonal computer

when building the class ier for a category C Then

The remainingwords after prepro ce ss ing are p otential

candidate s for us e as feature s with eachword as one feature

t f t f t f

  n n

in the feature vectors Feature s election refers tothepro

ce ss of choosingasubs et of these remainingwords touseas

is theder ive d training example where f i s a normalize d

i

feature s to form the training example s

f requency

Previous re s earchontext categor ization Apte et al

When trainingthe class ier for a category C weuse

sugge sts two p oss ible ways in whichthewords to be used

all thetexts in thetraining corpus that b elongto C as the

as feature s can or iginate f rom the relevanttexts only lo cal

positive trainingtexts On theother hand it i s often the

dictionary or f rom b oththerelevantand irrelevanttexts

cas e thatthere are many nonrelevanttexts not b elonging

universal dictionary Apte et al rep orte d re sults indicat

tocategory C in thetraining corpus We employthe same

ingthat lo cal dictionary gives better p erformance In our

technique de scr ib e d in Hearst et al to s elect a subs et

work we also foundthat lo cal dictionary gives better accur

of the nonrelevanttexts to us e as thenegative trainingtexts

acy In particular re sults f rom our cas e study showthat

These texts are the most relevant nonrelevanttexts First

lo cal dictionary give s cons iderably higher p erformance s ee

we form thevector sum of all thepositive trainingvectors

Section Hence we only us e the lo cal dictionary method

Then thenegative trainingvectors are ranked bytheir dot

on the Reuters corpus

pro duct score withthe p os itive aggregatevector Thehigher

We also require a word to o ccur at least vetimes in

the dot pro duct score the more relevantthenegativetext i s

the trainingtexts tobechos en as a feature Thi s measure

tothecategory

i s quite widely us e d for example in thework of Hearst et

al and Lewi s andRinguette Thisisbenecial

Perceptron Lear ning

s ince inf requentwords are not reliable indicators for us e as

feature s

Havingchos en a wordf requency li st repre s entation of text

After words are chos en accordingtothe lo cal dictionary

wenow cons ider thetask of building a class ier for a category

method and after eliminatingwords with inf requent o ccur

C Let t t t be the s et of words chos en as feature s

 n

rence we s elect a s et of n feature s byapplying a feature

bas e d on a s et of trainingtexts for C asde scr ib e d in the

s election metr ic The feature s chos en are thetop n feature s

last s ection Given a new text T to b e class ie d Classi

rst prepro ce ss e s thetext as de scr ib e d earlier The feature

Initialize the weights W w w w to

  n

vector repre s entation of thenew text T is

random real values

t f t f t f

  n n

Compute the weighted sum of frequencies for

all training examples E

i

where f f are the normalize d f requencie s of theword

 n

o ccurrence s in T

n

X

Our class ier arr ives at a class ication deci s ion bynding

w f

j i

j

an appropr iate s et of realvalue d weights w w w such

  n

j 

that

n

X

If all positive examples have nonnegative sum

w f

j j

and all negative examples have negative sum

j 

then output the weights and stop

Else compute the vector sum S of the

if and only if T belongs tocategory C Weletf in



misclassified examples That is if E is a

i

computingtheweighted sum of f requencie s

positive example that is misclassifiedas

As our class ier arr ives atadeci s ion bytaking a lin

negative then

early we ighted summation it functions as a linear threshold

unit LTU That i s our class ier i s a linear class ier The

p erceptron lear ning algor ithm PLA Ros enblatt is a

S S f f f

i i i

n



wellknown algor ithm for lear ningsuch asetofwe ights for

an LTU andweusethi s algor ithm in Classi

Conversely if E is a negative example that

i

Let E E E be thepositive example s der ived from

 l

is misclassified as positive then

the relevanttexts of category C andletE E E

l l m

be thenegative example s der ive d f rom the nonrelevanttexts

not of category C Let E b e of theform

S S f f f

i

i i i

n



t f t f t f

n i i  i

n

Update the weights as follows and go to step 

for i mThe goal of the p erceptron Wesetf

i



lear ning algor ithm i s tondasetofweights w w w

  n

W W S

suchthat

n

X

where is a constant scale factor

i l w f

j i

j

j 

Table The p erceptron lear ning algor ithm

and

n

X

Reuters Test Corpus

l i m w f

j i

j

j 

In order to compare the p erformance of Classi withother

stateoftheart text categor ization systems wetested Classi

Although PLA i s a wellknown algor ithm for complete

onastandard te st collection for text categor ization used in

ne ss sake we givehere a formal de scr iption of PLA in Table

theliterature Thi s collection of texts known as Reuters

PLA is essentially a hillcli mbi ng gradientde scent s earch

cons i sts of Reuters newswire article s aboutnancial

algor ithm It starts with a random s et of we ights anditerat



categor ie s Lewi s Thi s text corpus has about

ively renes thewe ights to minimize thenumb er of mi sclas

million words o ccupyingMBwith categor ie s and

s ie d example s i s a constantthatcontrols the lear ning

texts

rate Weset in all our evaluation runs Since

We used the same trainingtesti ng s et split of Apte et al

PLA may not converge or converge to o slowly in practice

First texts us e d as testing in a s eparatestudy are

we also s et themaximumnumber of iterations to in all

remove d f rom cons ideration Of the remainingtexts some

our evaluation runs

do not haveanycategory ass igned tothem After ignor ing

Notethat as part of the pro ce ss of deciding whether a

suchtexts the training s et cons i sts of texts dated

new text b elongs toacategorythe class ier computes a

P

on or b efore Apr il andthetesting s et cons i sts of

n

w f Thi s we ighted linearly we ighted summation

j j

j 

texts date d on or after Apr il As in Apte et al

sum can b e taken as a measure of thedegree of memb ership

weconsider only categor ie s which o ccur more than once

of a text in a categoryAssuch b e s ides beingable todecide

in the trainingtexts

whether a new text b elongs toacategorywe can also us e

For theReuters corpus wechoose thefeature s us ingthe

thi s we ighted sumto rank a s et of new texts f rom the most

lo cal dictionary method whichwas foundto yield b etter re s

clos ely matchingtext tothe least matchingtext

ults on thi s corpus by Apte et al That i s thewords

Wedene recal l R as theratio of truly relevanttexts

are only taken f rom the p os itive trainingtexts For each

that are class ie d by Classi as relevant and precision P

category C we us e d all trainingtexts b elonging to C as the

as theratio of texts class ie d as relevantby Classi thatare

Available by anonymous ftp f rom

truly relevant

ftpcsumasse duat pubdo creuters courte sy of Reuters Carne

gie Group andDavid Lewis

positive example s of C andthetop most relevant non law and p olitical party As an example a text thattalks

relevanttexts as thenegative example s of C Thetechnique about pass ing a legi slative bill in the US Senate will come

to pickthe most relevant nonrelevanttexts are de scr ib e d in under theleafcategory USAp oliticslaw There are leaf

Section categor ie s in TcsIttook about p ersonyears todevelop

Thenumb er of feature s chos en i s an imp ortant parameter the rule s nee de d for categor ization in Tcs

that aects the p erformance of Classi on theReuters corpus Toachieve hierarchical categor ization Classi forms the

We exp er imente d withseveral dierentnumb er of feature s as inter nal nonleaf categor ie s An inter nal nonleaf category

li ste d in Table We also showthe eect of thethree feature denotes theunion of all its children categor ie s Classi builds

s election methods used The accuracy measure in Table one class ier for eachcategory leaf and nonleaf no de in

is the microaverage d breakeven p oints the samemeasure the tree Theoutputcategor ie s of an inputtext can b e zero

as us e d in Lewi s and Ringuette At a breakeven one or more leaf categor ie s in thetreeWhen an inputtext

p oint recall and preci s ion are the same Microaveraging is presented to Classi it rst checks for eachcountry at

combines the recall and preci s ion value s of all thecategor ie s thetop level to s ee if thetext b elongs toanyofthe coun

bysummingthe true p os itive true negative f als e p os itive try category If not then thetext do e s not b elongtoany

and f als e negative counts across all categor ie s categoryHowever if a text b elongs to a country according

From Table it i s evidentthatournew feature s elec tothe class ier built for thatcountry category Classi then

tion metho d bas e d on correlation co ecient consistently out recurs ively checks for memb ership in thecategor ie s of the

p erforms the and f requency s election method at all fea subtree ro oted atthat country categoryIfatanynode it i s

ture s ize s Also p erformance improve s as more feature s are determined thatthe inputtext do e s not b elongtoanyofits

used We plan toinvestigate more thoroughly the relation children categor ie s then categor ization stops for that branch

ship b etween feature s et s ize and p erformance andnding of the recurs ion The recurs ive pro ce ss terminates at a leaf

theoptimal feature s et s ize where the p erformance p eaks category

As thenumber of feature s us e d has a s ignicant impact on As thenumber of categor ie s increas e s thi s hierarchical

accuracyitmay be benecial toapply crossvalidation tech approachtocategor ization i s more ecientthan a linear ap

nique s likethos e of Kohavi and John toautomatically proach which cons iders every category s equentiallyToour

determinethe best number of feature s to us e for eachcat knowle dge no previous work on text categor ization dealt

egory f rom the training example s with a tree or hierarchyofcategor ie s

Previous publi shed test results on thi s trainingtesting Thepositive trainingtexts of a leaf category C are therel

s et split of the Reuters corpus includethe system SWAP evanttexts of category C For a nonleaf inter nal category

of Apte et al Ripper and Experts of Cohen and C the p os itive trainingtexts are taken f rom thedescend

Singer and an implementation of Ro cchios algor ithm ant leaf no decategor ie s under C with an equal number of

Ro cchio by Cohen andSinger Cohen and Singer texts f rom eachde scendantleafnodecategoryWe us e d

positive trainingtexts for eachcategory

WelistinTable the b e st microaverage d breakeven For negative trainingtexts of a category C we used the

p oints achieved by Classi RipperSWAP Experts and positive trainingtexts that b elongtothe s iblingcategor ie s of

Ro cchio The accuracy gure s li ste d are bas e d on repre s ent C For example for thecategory USAeconomics we used

ations that do not give sp ecial treatmenttotheheadlines of texts b elongingto USAp oliticslaw USAp oliticspartyetc

atext Classi outp erforms all previously publi she d re sults as thenegative trainingtexts We us e d negative train

on the Reuters corpus ingtexts for eachcategory s electingthe most relevantnon

Wiener et al Wiener et al also tested their neural relevanttexts if more than negativetexts are available

network approachonthe Reuters corpus Although they Thetotal s ize of our training corpus i s about MB Since

rep orte d breakeven p oint of thelistofcategor ie s they we do not have trainingtexts withthe ir ass igned categor ie s

cons ider i s dierentfromthecategor ie s rep orted here and ver ie d byhuman thecategor ie s of the trainingtexts that

so the re sults are not directly comparable For example they weuseto train Classi are thos e ass igned automaticall y by

cons ider categor ie s likecbond loan ebond gb ond tbill and the Tcs system Since the s e trainingtext categor ie s are only

tb ond which are not amongthecategor ie s cons idere d in about accurate the us e of such noi sy training corpus

our pre s entstudy See the li st of categor ie s in Figure tends tolower the accuracy of Classi

Chapter of Lewi s f rom whichthecategor ie s Thenumb er of feature s in the training example s we used

are chos en for the rst s econdandthird level of thecategory tree i s

and respectively We us e d only feature s

for a country category s ince it tends tohave only a smaller

A Usability Cas e Study

number of indicative feature s

To compare the accuracy of Classi versus Tcswemanu

Toevaluatethe usability of our automated learningapproach

ally ass igned the correct categor ie s toanew randomly chos en

we also conducte d a cas e study by compar ingthe p erform

s et of texts not us e d in the training corpus The s ize of

ance of our approach with an exi stingtext categor ization

thi s te st corpus i s about words

system Thi s system calle d Tcswas builtusingatext cat

Table li sts the p erformance gure s of succe ss ivever

egor ization shell developed by Car negie Group Hayes et al

s ions of Classi as compare d to TcsThe Fmeasure Rijs

Tcs was builtusing a rulebas e d exp ert system ap

b ergen is dened as

proach The inputto Tcs are daily newswire article s The

outputcategor ie s of Tcs form theleafnode s of a tree A

PR

fragmentofthi s hierarchical organization of thecategor ie s i s

F

P R

shown in Figure The rst level denotes the divi s ion into

var ious countr ie s such as USA Japan Australia etc The

where P is the preci s ion and R is the recall Weusethe

second level denotes the divi s ion intoprimary sub ject topics

Fmeasure that give s equal weightage toboth recall and pre

such as economics p olitics etc Thethird level denotes de

ci s ion

taile d sub ject topics suchasthesub divi s ion of p olitics into

Feature Selection feature s feature s feature s feature s

Correlation co e

Frequency

Table Eect of Feature Selection Method andFeature Set Size on Breakeven p oint

System Option Breakeven p oint

Classi feature s

Ripper negativete sts

SWAP f req feat

Experts words

Ro cchio

Table Re sults on the Reuters te st corpus

USA Japan Australia ...

economics politics... economics politics ......

communication industry ... law party ......

Figure Thetreeofcategor ie s

Method Recall Preci s ion Fmeasure

texts f req

texts f req

texts

texts corr co e

words f rom children cat

twocycle generation

manual feature s

Tcs

Table Successive improvements to Classi and Compar i son with Tcs

Vers ion of Classi li ste d in Table us e s universal Related Work

dictionary where the feature s come f rom b othpositiveand

Many lear ningmethods havebeenapplie d totext categor

negative trainingtexts and relie s on f requency to s elect the

ization Schutze et al including deci s ion rule induc

feature s Its Fmeasure p erformance i s only Vers ions

tion Apte et al deci s ion tree induction Lewi s and

switchtousing lo cal dictionary where the feature s

Ringuette neare st ne ighb or algor ithms Masand et

are taken f rom the p os itive trainingtexts onlyVers ions

al Baye s ian class iers Lewi s and Ringuette

and us e the f requency and correlation co ecient

di scr iminantanalys i s Hull neural networks Schutze

metr ic re sp ectivelyto s elect thefeature s Again correlation

et al Wiener et al Lewi s et al etc

co ecientachieve s higher accuracy compare d withthe

Schutze et al andWiener et al made us e of nonlinear

and f requency metr ics

neural networks but rep orte d only a slight improvementin

In vers ion of Classiweaddtovers ion the follow

accuracy over the us e of linear neural networks The activa

ing for thesecond level categor ie s general sub ject topics

tion function us e d in the linear neural network of Schutze et

in thecategory tree Classi us e s the feature s f rom theleaf

al Wiener et al Lewi s et al is the s ig

children categor ie s of a sub ject category C as C s feature s

moid activation function while our p erceptron us e s a step

Thi s re sults in a mo derate improvement in accuracy

wi s e activation function Toourknowle dge noneofthe

From vers ion we added a new trainingmetho d calle d

previous work has used the p erceptron lear ning algor ithm

twocycle generationto yield vers ion of Classi Thi s

in text categor ization We used the p erceptron algor ithm

new trainingmetho d i s related totherationale of wantingto

s ince it has b een shown toachieve surpr i s ingly high accuracy

us e only words thatareindicativeofmemb ership of a cat

Mo oney et al andithas very f ast trainingtimeat

egoryandnotwords that are indicative of nonmemb ership

least an order of magnitude f aster compare d withthe back

Thisisthe samerationale b ehindthe us e of lo cal dictionary

propagation algor ithm of nonlinear neural networks mak

and our correlation co ecient Bas icallyintwocycle gen

ing it a go o d choice for building a practical text categor iza

eration after a s et of feature s i s chos en and a class ier i s

tion system However we do not claim thatthe p erceptron

formed bythe p erceptron lear ning algor ithm we di scard the

algor ithm i s the b e st lear ning algor ithm to us e for text cat

feature s withnegativewe ights ass igned bythe p erceptron

egor ization More exp er imentation nee ds to b e donetoeval

lear ning algor ithm The remaining s et of feature s then form

uatethe relative strengthofvar ious lear ning algor ithms

thenal s et of feature s us e d andthe p erceptron algor ithm i s

Our new correlation co ecient i s bas e d on a var iation of

applie d again to lear n a new class ier Thi s twocycle gener

the metr ic us e d in Schutze et al To our know

ation metho d yields cons iderable improvement in accuracy

le dge noneofthe previous work has adopted theuseofsuch

as shown in Table

a correlation co ecient for feature s election in text categor

Up till vers ion wehaveapplie d completely automated

ization It app ears thatthe improvement re sulting f rom the

lear ningtechnique s withnospecialuseofdomainsp ecic

useofbetter feature s election methods is at least as s ig

or manual eorts Weachieve d an Fmeasure accuracy of

nicantasthe improvementachieved from better lear ning

whichisstill substantial ly lower than the accuracy of

algor ithms

achieved by TcsWenowdecideto us e somemanual

Finally no previous work has rep orted the comparative

engineer ing eorts In particular we incorp orateasetof

accuracy of a s emiautomate d lear ningapproach withthe

manually chos en words to us e as feature s but only for the

manual rulebas e d exp ert system approach

country categor ie s atthetop level of thecategory tree Thi s

manually chosensetofwords i s obtained by pruningsome

of the nonindicativewords foundby our automate d lear ning

Conclus ion

method and addingthewords thatwere us e d in Tcs for the

country categor ie s Us ingthese manually chos en words as

Wehavesucce ssfully built a robust ecientand practical

feature s andthe same s et of trainingtexts as in previous

text categor ization system Classiusingthe p erceptron

vers ions we obtained vers ion of ClassiThe p erform

lear ning algor ithm Our evaluation has shown that Classi

ance of Classi now reache s whichisonlyabout

outp erforms exi stingapproache s on thestandard Reuters

belowthe accuracy of achieved by Tcs Thi s re sult

corpus Theuseofanew correlation co ecientinfeature

i s encouraging e sp ecially cons ider ingthatthe trainingtexts

s election re sults in cons iderable improvementincategor iz

used by Classi i s noi sy s ince the trainingcategor ie s are

ation p erformance Wealsoconductedacasestudy which

ass igned by Tcs whichcontain mi stakes

indicates that a s emiautomated approach can achievecat

Thus our cas e study sugge sts thatatpresent a s emi

egor ization p erformance clos e tothemanual exp ert system

automated approachisperhaps thebestway to build a high

approach of building text categor ization systems

p erformance text categor ization system Exi sting lear ning

methods are good attuningasetofweights compare d with

Reference s

manual engineer ingofwe ights However feature s election

metho ds are still not go o d enough e sp ecially for training

Apte et al ChidanandApte Fre d Damerau and

s ets of smaller s ize in order for a us eful s et of feature s tobe

Sholom M We i ss Automate d lear ningofdeci s ion rule s

s elected

for text categor ization ACM Transactions on Informa

Thecategor ization sp ee d of Classi i s quite f ast A daily

tion Systems July

collection of newswire article s averagingabout texts

Cohen andSinger William W Cohen andYoram

and more than words more than MB take s only

Singer Contexts ens itive lear ningmetho ds for text cat

about minutes tobecategor ize d on a Pentium PC Hence

egor ization In th International ACM SIGIR Confer

Classi i s a practical system thatruns eciently

enceonResearch and Development in Information Re

trieval

Hayes et al PJ Hayes PM Anders en IB Niren

burg andLMSchmandt TCS A shell for contentbas e d

Wiener et al Er ik Wiener Jan O Peders en and text categor ization In Proceedings of the Sixth IEEE Con

Andreas S We igend A neural network approachtotopic ferenceonArticial Intel ligence Applications page s

sp otting In Symposium on Document Analysis and In

formation Retrieval

Hearst et al Marti Hearst Jan Peders en Peter Pir

olli Hinr ichSchutze Gregory Grefenstette andDavid

Hull Xerox TREC s ite rep ort In Proceedings of the

Fourth Text Retrieval Conference TREC

Hull David Hull Improvingtext retr ieval for the

routing problem us inglatentsemantic indexing In th

International ACM SIGIR ConferenceonResearch and

Development in Information Retrieval

Kohavi and John Ron Kohavi and George H John

Automatic parameter s election by minimizi ngestimated

error In Proceedings of the Twelfth

International Conference

Lewi s and Ringuette David Lewi s and Marc

Ringuette A compar i son of two lear ning algor ithms for

text categor ization In Symposium on Document Analysis

and Information Retrieval

Lewi s et al David D Lewi s Rob ert E Schapire

James P Callan andRonPapka Training algor ithms for

linear text class iers In th InternationalACM SIGIR

ConferenceonResearch and Development in Information

Retrieval

Lewi s David Lewi s Representation and Learning in

Information Retrieval PhD the s i s Dept of Computer and

Information Science Univ of Massachus etts at Amherst

Masand et al Br ij Masand Gordon Lino and

David Waltz Class ifyingnews stor ie s us ingmemory

bas e d reasoning In th International ACM SIGIR Con

ferenceonResearch and Development in Information Re

trieval

Miller George A Miller Five pap ers on WordNet

International Journal of Lexicology

Mo oney et al RaymondJMooneyJudeWShav

lik G Towell andAGove An exp er iemental compar i son

of symbolic and connectioni st lear ning algor ithms In Pro

ceedings of the Eleventh International Joint Conference

on Articial Intel ligence page s

Rijsb ergen C J Van Rijsb ergen Information Re

trievalButterworths London

Ro cchio J Ro cchio Relevance fee dback inform

ation retr ieval In Gerard Salton e ditor The Smart

Retrieval System Experiments in Automatic Docu

ment Processing page s PrenticeHall Engle

wo o d Clis NJ

Ros enblatt F Ros enblatt The p erceptron A prob

abili stic mo del for information storage and organization in

the brain PsychologicalReview

Schutze et al Hinr ichSchutze David A Hull and

Jan O Peders en A compar i son of class iers anddoc

ument repre s entations for the routing problem In th

International ACM SIGIR ConferenceonResearch and

Development in Information Retrieval