<<

Journal of Articial Intelligence Research Submitted published

Integrative Windowing

Johannes F urnkranz jufficscmuedu

School of Computer Science

Carnegie Mel lon University

Pittsburgh PA

Abstract

In this pap er we reinvestigate windowing for rule learning algorithms We show that

contrary to previous results for decision tree learning windowing can in fact achieve signi

cant runtime gains in noisefree domains and explain the dierent b ehavior of rule learning

algorithms by the fact that they learn each rule indep endently The main contribution of

this pap er is integrative windowing a new type of algorithm that further exploits this prop

erty by integrating go o d rules into the nal theory right after they have b een discovered

Thus it avoids relearning these rules in subsequent iterations of the windowing pro cess

Exp erimental evidence in a variety of noisefree domains shows that integrative windowing

can in fact achieve substantial runtime gains Furthermore we discuss the problem of

noise in windowing and present an algorithm that is able to achieve runtime gains in a set

of exp eriments in a simple domain with articial noise

Introduction

Windowing is a subsampling technique prop osed by Quinlan for the ID decision

tree learning algorithm Its goal is to reduce the complexity of a learning problem by

identifying an appropriate subset of the original data from which a theory of sucient

quality can b e learned For this purp ose it maintains a subset of the available data the

socalled window which is used as the training set for the learning algorithm The window

is initialized with a small random sample of the available data and the learning algorithm

induces a theory from this sample This theory is then tested on the remaining examples

If the quality of the learned theory is not sucient the window is adjusted usually by

adding more examples from the training data and a new theory is learned This pro cess is

rep eated until a theory of sucient quality has b een found

There are at least three motivations for studying windowing techniques

Memory Limitations Almost all learning algorithms still require to have all training

examples and all background knowledge in main memory Although memory has b e

come cheap and the capacity of the main memory of the available hardware platforms

is increasing rapidly there certainly are datasets to o big to t into the main memory

of conventional computer systems

Eciency Gain Learning time usually increases most often sup erlinearly with the

complexity of a learning problem Reducing this complexity may b e necessary to

make a learning problem tractable

c

AI Access Foundation and Morgan Kaufmann Publishers All rights reserved

Furnkranz

procedure WindowingExamplesInitSizeMaxIncSize

Window RandomSampleExamplesInitSize

Test Examples n Window

repeat

Theory InduceWindow

NewWindow

OldTest

for Example Test

Test Test n Example

if ClassifyTheoryExample ClassExample

NewWindow NewWindow Example

else

OldTest OldTest Example

if jNewWindow j MaxIncSize

exit for

Test AppendTestOldTest

Window Window NewWindow

until NewWindow

returnTheory

Figure The basic windowing algorithm

Accuracy Gain It has b een observed that windowing may also lead to an increase in

predictive accuracy A p ossible explanation for this phenomenon is that learning from

a subset of examples may often result in a less overtting theory

In this pap er our ma jor concern is the appropriateness of windowing techniques for

increasing the eciency of inductive rule learning algorithms such as those using the p opu

lar separateandconquer or covering learning strategy Michalski Clark Niblett

Quinlan F urnkranz We will argue that windowing is more suitable for

these algorithms than for divideandconquer decisiontree learning Quinlan

section Thereafter we will introduce integrative windowing a technique that exploits

the fact that rule learning algorithms learn each rule indep endently We will show that

this metho d allows to signicantly improve the p erformance of windowing by integrating

go o d rules learned from dierent iterations of the basic windowing pro cedure into the nal

theory section While we have primarily worked with noisefree domains section will

discuss the problem of noise in windowing as well as some preliminary work that shows how

windowing techniques can b e adapted for noisy domains Parts of this work have previously

app eared as F urnkranzc d a

A Brief History of Windowing

Windowing dates back to early versions of the ID decision tree learning algorithm where

it was devised as an automated teaching pro cedure that allowed a preliminary version of

ID to discover complete and consistent descriptions of various problems in a KRKN chess

endgame Quinlan Figure shows the basic windowing algorithm as describ ed in

Quinlans subsequent seminal pap er on ID Quinlan It starts by picking a random

Integrative Windowing

sample of a usersettable size InitSize from the total set of Examples These examples are

used for inducing a theory with a given learning algorithm This theory is then tested on

the remaining examples and all misclassied examples are removed from the test set and

added to the window of the next iteration In order to keep the size of the training set

small another parameter MaxIncSize controls the maximum number of examples that can

b e added to the training set in one iteration If this number is reached no further examples

are tested and the next theory is learned from the new training set To make sure that all

examples are tested in the rst few iterations the examples that have already b een tested

OldTest are Appended to the examples that have not yet b een used so that testing will

start with new examples in the next iteration

Quinlan argued that windowing is necessary for ID to tackle very large classi

cation problems Nevertheless windowing has not played a ma jor role in machine learning

research One reason for this certainly is the rapid development of computer hardware

which made the motivation for windowing seem less comp elling However a go o d deal of

this lack of interest can b e attributed to an empirical study Wirth Catlett in

which the authors studied windowing with ID in various domains and concluded that it

cannot b e recommended as a pro cedure for improving eciency The b est results were

achieved in noisefree domains such as the Mushroom domain where windowing was able

to p erform on the same level as simple ID while its p erformance in noisy domains was

considerably worse

Despite the discouraging exp erimental evidence of Wirth and Catlett Quinlan

implemented a new version of windowing into the C learning algorithm It diers

from the simple windowing version originally prop osed for ID Quinlan in several

ways

While selecting examples it takes care to make the class distribution as uniform

as p ossible This can lead to accuracy gains in domains with skewed distributions

Catlett b

It includes at least half of the misclassied examples into next window which supp os

edly guarantees faster convergence fewer iterations in noisy domains

It can stop b efore all examples are correctly classied if it app ears that no further

gains in accuracy are p ossible

Cs t parameter which invokes windowing allows it to p erform multiple runs of

windowing and to select the b est tree

Nevertheless windowing is arguably one of Cs least frequently used options

Recent work in the areas of Know ledge Discovery in Databases Kivinen Mannila

Toivonen and Intel ligent Information Retrieval Lewis Catlett Yang

has reemphasized the imp ortance of subsampling pro cedures for reducing b oth learn

ing time and memory requirements Thus the interest in windowing techniques has revived

as well We discuss some of the more recent approaches in section

Quinlan do es not explicitly sp ecify how this case should b e handled but we think it makes sense that

way

Furnkranz

Tree Learning and Windowing Rule Learning and Windowing Seconds Seconds C4.5 DOS 30.00 1.50 C4.5 -t 1 DOS + Win 28.00 1.40 26.00 1.30 24.00 1.20 22.00 1.10

1.00 20.00

0.90 18.00

0.80 16.00

0.70 14.00

0.60 12.00

0.50 10.00

0.40 8.00

0.30 6.00

0.20 4.00

0.10 2.00

0.00 0.00 Train Exs x 103 Train Exs x 103

0.00 2.00 4.00 6.00 8.00 0.00 2.00 4.00 6.00 8.00

Figure Results for windowing with the decision tree learner c and a rule learner in

the Mushroom domain

A Closer Lo ok at Windowing

A Motivating Example

The motivation for our study came from a brief exp eriment in which we compared windowing

with a decision tree algorithm to windowing with a rule learning algorithm in the noise

free Mushroom domain This domain contains examples represented by symbolic

attributes The task is to discriminate b etween p oisonous and edible mushrooms Figure

shows the results of this exp eriment

The left graph shows the runtime b ehavior over dierent training set sizes of C

invoked with its default parameters versus C invoked with one pass of windowing pa

rameter setting t No signicant dierences can b e observed although windowing

seems to eventually achieve a little runtime gain The graph is quite similar to one result

ing from exp eriments with ID and windowing shown by Wirth and Catlett so that

we b elieve the dierences in the original version of windowing and the one implemented in

C are negligible in this domain The results are also consistent with those of Quinlan

who for this domain rep orted runtime savings of no more than for windowing

with appropriate parameter settings In any case it is obvious that the runtime of b oth

C and C with windowing grows linearly with the number of training examples

On the other hand the right half of Figure depicts a similar exp eriment with windowing

and DOS a simple separateandconquer rule learning algorithm DOS is basically identical

to pFoil a prop ositional version of Foil Quinlan that uses information gain as a

search heuristic but no stopping criterion Our implementation of windowing closely follows

Such a linear growth was already observed by Quinlan in some early exp eriments with windowing

Dull Old Separateandconquer

Integrative Windowing

rule examples p erc

A C B D

C E D F

adjacentAE adjacentBF

C E A C

D F B D

C E not betweenBDF

D F not betweenACE

Table Coverage of a correct theory for the KRK domain

the algorithm depicted in Figure The results show a large dierence b etween DOS without

windowing which again takes linear time and DOS with windowing which seems to take

only constant time for training set sizes of ab out examples and ab ove This surprising

constanttime learning b ehavior can b e explained by the fact that if windowing is able to

reliably discover a correct theory from a subset of the examples it will never use more than

this amount of examples for learning this theory The remaining examples are only used

for verifying that the learned theory has b een correctly learned These costs alb eit linear

in principle are negligible compared to the costs of learning the correct theory

Windowing versus Random Subsampling

Of course the intuitive arguments for the constanttime learning b ehavior of windowing

could equally well apply to the decision tree learning metho d However the exp erimental

evidence seems to show that in that case the learning time is at least linear in the number

of examples Why do es windowing work b etter with rule learning algorithms than with

decision tree learning algorithms

To understand this b ehavior let us take a closer lo ok at the characteristics of windowing

Contrary to random subsampling which will retain the distribution of examples of the

original training set in the subsample windowing purp osefully tries to skew this example

distribution by adding only examples that are misclassied by the current theory As

an example consider the KRK illegality domain where the task is to discriminate legal

from illegal p ositions in a kingro okking chess endgame given features that compare the

co ordinates of two pieces and determine whether they are equal or adjacent or whether one

is b etween the others

Table shows the b o dy of rules that form a correct theory for illegal KRK p ositions in a

rstorder logic representation The head of each rule is the literal illegalABCDEF

Next to the rules we give the number of dierent p ositions that are covered by the rule

along with their p ercentage out of the total of dierent examples Ab out a third of

the p ositions are illegal One can see that rules and each cover more than

examples They are usually easily found from small random samples of the dataset On the

other hand rules and and in particular and cover a signicantly smaller number

Note however that not all dierent p ositions form a dierent feature vector with the sp ecied features

as has b een p ointed out by Pfahringer

Furnkranz

of examples In order to obtain enough examples for learning these rules one has to take

a much larger random sample of the data This problem is closely related to the smal l

disjuncts problem discussed by Holte Acker and Porter

How do es windowing deal with this situation Recall that it starts with a small random

subsample of the available training data and successively adds examples that are misclassi

ed by the previously learned theory to this window By doing so it skews the distribution

of examples in a way that increases the prop ortion of examples covered by rules that are

hard to learn thereby decreasing the prop ortion of examples covered by rules that are easy

to learn Thus for each individual rule in the target theory windowing tries to identify a

minimum number of examples from which the rule can b e learned

Rule Learning versus Decision Tree Learning

Let us now return to the question of why windowing seems to work b etter with rule learning

algorithms We b elieve that its way of skewing the example distribution has dierent eects

on divideandconquer decision tree learning algorithms and on separateandconquer rule

learning algorithms The nature of this dierence lies in the way a condition is selected

Typically a separateandconquer rule learner selects the test that maximizes the number of

covered p ositive examples and at the same time minimizes the number of negative examples

that pass the test It usually do es not pay any attention to the examples that do not pass the

test The selection of a condition in a divideandconquer decision tree learning algorithm

on the other hand tries to optimize for al l outcomes of the test For example the decision

tree learner C Quinlan selects a condition by minimizing the average entropy

over all its outcomes while the original version of the rule learner CN Clark Niblett

selects the test that minimizes the entropy for all examples that pass the selected

test

The consequence of this dierence for the windowing pro cess is that additional examples

in the window of a decision tree learner will have a strong inuence on the selection of tests

at all levels of the tree In particular all examples have an equal inuence on the selection of

the ro ot attribute Thus windowings way of skewing the example distribution may cause

signicant changes in the learned tree Separateandconquer rule learning algorithms on

the other hand evaluate a rule using only the examples that are covered by it As no

examples will b e added for an already correctly learned rule the evaluation of this rule will

not change if more examples are added to the window by incorrect rules There might b e

some changes in the evaluation of some of its tests b ecause an incomplete rule can cover

some of the uncovered p ositive examples or covered negative examples that were added in

the last iteration of the windowing algorithm but in general we think that this inuence

is not nearly as signicant as for decision tree learning algorithms where it is certain that

each added example will inuence the evaluation of the tests at the ro ot of the tree

This is even worse in the usual representation of this problem Muggleton Bain HayesMichie Michie

where instead of the between relation a less than predicate is given in the background knowledge

In this representation which will b e used in all our exp eriments rules and have to b e reformulated

than predicate which is only p ossible if each of them is replaced with two new rules using the less

Actually C maximizes information gain which is computed by subtracting the average entropy from

an information value that is constant for all considered tests

Integrative Windowing

More evidence for the sensitivity of ID to changes in the example distribution also

comes from some exp eriments with C Quinlan where it turned out that changing

windowing in a way that makes the class distribution in the initial window as uniform as

p ossible pro duces b etter results As discussed ab ove we b elieve that this sensitivity is

caused by the need of decision tree algorithms to optimize the class distribution in all

successor no des of an interior no de On the other hand dierent class distributions will

only have a minor eect on separateandconquer learners b ecause they are learning one

rule at a time Adding uncovered p ositive examples to the current window will not alter

the evaluation of rules that do not cover the new examples but it may cause the selection

of a new ro ot no de in decision tree learning We hypothesize from these delib erations that

for windowing rule learning algorithms exhibit more stability than decision tree learning

algorithms This in turn can lead to b etter runtimes b ecause the newly added examples

will not interact with the parts of theory that have already b een learned well

Domain Redundancy

It is clear that windowing requires a redundant training set in order to work eectively

Intuitively we say a training set is redundant if it contains more examples that are actually

needed for inducing the correct domain theory If this is not the case ie if every example is

needed for inferring a correct theory windowing is unlikely to work well b ecause eventually

all examples have to b e added to the window and there is nothing to b e gained

A computable measure for the redundancy of a given domain would enable us to evaluate

the p otential eectiveness of windowing in this domain Unfortunately we do not know of

many approaches that deal with this problem A notable exception is the work by Mller

where the use of the conditional population entropy CPE is suggested for measuring

the redundancy of a domain The CPE is dened as

n

n v n

c a a

X X X

CPE pc px jc log px jc

i av i av i

v a

i

where n is the number of classes n is the number of attributes and n is the number

c a v

a

of values for attribute a c stands for the ith class and x represents the v th value of

i av

attribute a One can interpret the formula as computing the sum of the resp ective average

entropies of the class variable in onelevel decision trees for predicting each attribute Mller

normalizes the CPE as follows

CPE

P

Red

n

a

log n

v

a

a

The normalization factor is the maximum value that the CPE can assume which is when

Using this normalization redundancy can b e measured and pc px jc

i av i

n n

v c

a

with a value b etween and meaning high redundancy meaning no redundancy

However it is unclear how this measure corresp onds to our intuitive notion of redun

dancy For example two example sets of a domain that dier only in the fact that the

This equalfrequency subsampling metho d has rst b een describ ed by Breiman Friedman Olshen and

Stone for dynamicall y drawing a subsample at each no de in a decision tree from which the b est

split for this no de is determined In C it is used to seed the initial window

Furnkranz

second contains each example of the rst set twice would have exactly the same redun

dancy estimate Intuitively however the second example set should b e more redundant

than the rst

A radically dierent approach to dening redundancy in a somewhat dierent context

was undertaken by Gamberger and Lavrac They call a training set nsaturated i

there is no subset of size n whose removal would cause a dierent theory to b e learned They

prop ose to use the notion of nsaturation for approximating the concept of saturation which

intuitively means that the training set contains enough examples for inducing the correct

target hypothesis Gamberger Lavrac This notion corresp onds quite closely to

our intuitive concept of redundancy However the computational complexity of testing for

nsaturation can b e quite high

More imp ortantly the notion of saturation as dened ab ove do es not take into account

that dierent regions of the example space may have dierent degrees of redundancy We

have already seen in Table that often the rules in a target theory have dierent degrees

of coverage Consequently a randomly chosen training set will typically contain more

examples that are covered by a highcoverage rule of the target theory than examples that

are covered by a lowcoverage rule In other words the subset of the examples from which

the highcoverage rules are learned can already b e redundant while the subset from which

the lowcoverage rules should b e learned do es not yet contain enough examples for inducing

the correct rules In Gamberger and Lavracsnotion of saturation such a training set would

b e considered nonsaturated

Windowing tries to exploit these dierent degrees of redundancy in a training set If some

parts of the example space are already covered by correct rules no more examples of these

regions will b e added to the window We have already discussed that this skewing of the

example distribution is more appropriate for separateandconquer rule learning algorithms

than for divideandconquer decision tree learning algorithms b ecause the former learn each

rule indep endently In the next section we will discuss a new windowing algorithm for rule

learning systems that aims at further exploiting this prop erty

Integrative Windowing

One thing that happ ens frequently when using windowing with a rule learning algorithm

is that go o d rules have to b e discovered again and again in subsequent iterations of the

windowing pro cedure Although correctly learned rules will add no more examples to the

current window they have to b e relearned in the next iteration as long as the current

theory is not complete and consistent with the entire training set We have developed

a new version of windowing which tries to exploit the fact that regions of the example

space that are already covered by go o d rules need not b e further considered in subsequent

iterations Because of its technique of successively integrating learned rules into the nal

theory we have named our metho d Integrative Windowing

The Algorithm

The algorithm shown in Figure starts just like basic windowing it selects a random subset

of the examples learns a theory from these examples and tests it on the remaining examples

However contrary to basic windowing it do es not merely add incorrectly classied examples

Integrative Windowing

procedure IntegrativeWindowingExamplesInitSizeMaxIncSize

Window RandomSampleExamplesInitSize

Test Examples n Window

OldRules

repeat

NewRules InduceWindow

Theory NewRules OldRules

NewWindow

OldTest

for Example Test

Test Test n Example

if ClassifyTheoryExample ClassExample

NewWindow NewWindow Example

else

OldTest OldTest Example

if jNewWindow j MaxIncSize

exit for

Test AppendTestOldTest

Window Window NewWindow CoverOldRules

OldRules

for Rule Theory

if ConsistentRuleNewWindow

OldRules OldRules Rule

Window Window n CoverRule

until NewWindow

returnTheory

Figure Integrative Windowing

to the window for the next iteration but also removes examples from the window if they

are covered by consistent rules A rule is considered consistent when it did not cover a

negative example during the testing phase Note that this do es not necessarily mean that

the rule is consistent with all examples in the training set b ecause it may contradict an

example that has not yet b een tested at the p oint where MaxIncSize misclassied examples

have b een found Thus apparently consistent rules have to b e remembered and tested again

in the next iteration However testing is much cheaper than learning so we exp ect that

removing the examples that are covered by these rules from the window should keep the

window size small and thus decrease learning time

Implementation

To test this hypothesis we implemented several algorithms in Common LISP building on

Ray Mo oneys publicly available Machine Learning library The implemented algorithms

are DOS a basic separateandconquer rule learning algorithm using information gain as a

search heuristic and no stopping criterion for noise handling WinDOS an implemen

tation of windowing as shown in Figure that wraps a windowing pro cedure around DOS

and WinDOS an algorithm that integrates windowing into DOS as shown in Figure

All algorithms are limited to class problems ie they learn a theory that discriminates

Furnkranz

p ositive from negative examples by classifying all examples that are covered by the rules

as p ositive and all examples that are not covered by the rules as negative Note however

that this is not a principle limitation of the approach as there are several ways for solving

multiclass problems with binary rule learning algorithms of the type discussed in this pap er

Clark Boswell Ali Pazzani Dietterich Bakiri

In preliminary exp eriments it turned out that one problem that happ ens more frequently

in integrative windowing than in regular windowing or basic separateandconquer learning

is that of overspecialized rules Often a consistent rule is found at a low example size but

other rules are found later that cover all of the examples this sp ecial rule covers Note that

this problem cannot b e removed with a syntactic generality test consider for example the

case where a rule stating that a KRK p osition is illegal if the two kings are on the same

square is learned from a small set of the data and a more general rule is discovered later

which states that all p ositions are illegal in which the two kings o ccupy adjacent squares

Sometimes the examples of the sp ecial case can also b e covered by more than one of the

other rules

We have implemented a heuristic pro cedure for removing such redundant rules After

the nal theory has b een learned by either of the algorithms each of its rules is tested on

the complete training set and the rules are ordered according to the number of examples

they cover Starting with the rule with the least coverage each rule is tested whether the

examples it covers are also covered by the remaining rules If so the rule is removed This

pro cedure can b e implemented quite eciently and will only b e p erformed once at the end

of each of the three algorithms

Exp erimental Setup

We compared b oth versions of windowing on a variety of noisefree domains In each

domain we ran a series of exp eriments with varying training set sizes For each training

set size dierent subsets of this size were selected from the entire set of preclassied

examples All three algorithms DOS WinDOS and WinDOS were run on each

of these subsets and the results of the exp eriments were averaged For each exp eriment we

measured the accuracy of the learned theory on the entire example set and the total runtime

of the algorithm To have a more reliable complexity measure than the implementation

dep endent runtime we also measured the total number of examples that were pro cessed by

the basic learning algorithm For DOS this number is identical to the size of the resp ective

training set while for the windowing algorithms it is computed as the sum of the training

set sizes of all iterations of windowing For the noisefree case it turned out that this factor

determined the runtime of the algorithms so that its graphs were almost identical to the

graphs for runtime results Therefore for reasons of space eciency we will only present

the runtime curves except in Figure

All exp eriments shown b elow were conducted with a setting of InitSize and Max

IncSize These settings are briey discussed in section

Measured in CPU seconds of a microSPARC MHz running compiled Allegro Common Lisp co de

under SUN Unix

We did not compute this measurement for the exp eriments in noisy domains section b ecause the way

we compute the completeness check section invalidates these measurements

Integrative Windowing

Domain Size c t vs c Redundancy

Mushro om

KRKN

KRKP

KRK prop

TicTacToe

Binary Shuttle

Table Domains used in the exp eriments along with their size the p erformance of C

with windowing versus C without windowing and an estimate for the redun

dancy of the domain

Domains

We evaluated the algorithms on a variety of reasonably large and noisefree training sets

from the UCI collection of Machine Learning databases As our implementation can only

handle class problems we constructed a binary version of the multiclass Shuttle domain

by discriminating examples of ma jority class from all other classes In the KRK illegality

domain we used a prop ositional version of the original relational learning problem Muggle

ton et al where each p osition is enco ded with features that corresp ond to the truth

values of the dierent meaningful instantiations of the adjacent equal and less than

relations in the background knowledge

Table shows the total number of examples available for each domain and the ratio of

the average runtime of C with windowing invoked using the parameter setting t

versus C without windowing The last column shows the redundancy of the domain

estimated with Mllers conditional p opulation entropy heuristic Interestingly enough

there seems to b e a negative correlation b etween the p erformance of Cs windowing

algorithm and this redundancy measure In general the results with C conrm the

results of Wirth and Catlett that not much can b e gained with the use of windowing

for IDlike learners The only exception is the Shuttle domain where windowing can save

almost half of Cs runtime

Results

Figure shows the accuracy the number of pro cessed examples and the runtime results for

the three algorithms in the Mushroom domain WinDOS seems to b e eective in this

domain at least for larger training sets Our improved version WinDOS

clearly outp erforms simple windowing in terms of runtime while there are no signicant

dierences in terms of accuracy

In a typical run with the ab ovementioned parameter settings WinDOS needs

ab out to iterations for learning the correct concept the last of them using a window size

of ab out to examples WinDOS needs ab out the same number of iterations

The measure could not b e computed for the Shuttle domain b ecause this domain contains numerical

attributes which cannot b e handled in a straightforward fashion

Furnkranz

Mushroom: Test Accuracy % Correct DOS 100.00 WIN-DOS-3.1 WIN-DOS-95

99.50

99.00

98.50

98.00

97.50

97.00

96.50

96.00

95.50 Train Exs x 103 0.00 2.00 4.00 6.00 8.00

Mushroom: # Examples processed by DOS Mushroom: Train Time Learning Exs x 103 Seconds DOS DOS 30.00 8.00 WIN-DOS-3.1 WIN-DOS-3.1 WIN-DOS-95 28.00 WIN-DOS-95 7.50

7.00 26.00

6.50 24.00

6.00 22.00

5.50 20.00

5.00 18.00 4.50 16.00 4.00 14.00 3.50 12.00 3.00 10.00 2.50 8.00 2.00 6.00 1.50

1.00 4.00

0.50 2.00

0.00 0.00 Train Exs x 103 Train Exs x 103

0.00 2.00 4.00 6.00 8.00 0.00 2.00 4.00 6.00 8.00

Figure Results in the Mushroom domain

but it rarely keeps more than examples in its window In total WinDOS submits

ab out examples to the DOS learner while WinDOS can save ab out half of them

Figure shows that these numbers remain almost constant after a saturation p oint of ab out

examples is reached which is the p oint where all learners learn correct theories

Learning from to randomly selected examples do es not result in correct theories

as can b e inferred from the DOS accuracy curve Thus in this domain a p erformance gain

comparable to that achieved by windowing cannot b e achieved with random subsampling

Hence we conclude that the systematic subsampling p erformed by windowing has its merits

Figure shows the accuracy and runtime results for the three algorithms in the KRK

and KRKN domains The picture here is quite similar to Figure WinDOS seems

to b e eective in b oth domains at least for larger training set sizes Our improved version Integrative Windowing

KRK: Test Accuracy KRK: Train Time % Correct Seconds DOS 17.00 DOS 100.00 WIN-DOS-3.1 WIN-DOS-3.1 16.00 WIN-DOS-95 WIN-DOS-95 99.90 15.00

99.80 14.00

99.70 13.00 12.00 99.60 11.00 99.50 10.00 99.40 9.00 99.30 8.00

99.20 7.00

99.10 6.00

99.00 5.00 4.00 98.90 3.00 98.80 2.00 98.70 1.00 98.60 0.00 Train Exs x 103 Train Exs x 103 0.00 2.00 4.00 6.00 8.00 10.00 0.00 2.00 4.00 6.00 8.00 10.00

KRKN: Test Accuracy KRKN: Train Time % Correct Seconds DOS 7.00 DOS 100.00 WIN-DOS-3.1 WIN-DOS-3.1 6.50 WIN-DOS-95 WIN-DOS-95

99.50 6.00

5.50 99.00 5.00

98.50 4.50

4.00 98.00 3.50

97.50 3.00

2.50 97.00 2.00

96.50 1.50

1.00 96.00 0.50

95.50 0.00 Train Exs x 103 Train Exs x 103

0.00 2.00 4.00 6.00 8.00 10.00 0.00 2.00 4.00 6.00 8.00 10.00

Figure Results in the KRK and KRKN domains

WinDOS clearly outp erforms basic windowing in terms of runtime while there are

no signicant dierences in terms of accuracy In the KRK domain predictive accuracy

reaches at ab out training examples for all three algorithms At lower training

set sizes WinDOS is a little ahead but in general there are no signicant dierences

The runtime of b oth windowing algorithms reaches a plateau at ab out the same size of

training examples which again shows that windowing do es not use additional examples

once it is able to discover a correct theory from a certain sample size

The results in the KRKN chess endgame domain with training set sizes of up to

examples are quite similar However for smaller training set sizes which presumably do

not contain enough redundancy basic windowing can take signicantly longer than learning

from the complete data set

Furnkranz

Tic-Tac-Toe: Test Accuracy Tic-Tac-Toe: Train Time % Correct Seconds

DOS 2.20 DOS 100.00 WIN-DOS-3.1 WIN-DOS-3.1 2.10 99.00 WIN-DOS-95 WIN-DOS-95 2.00 98.00 1.90 97.00 1.80 96.00 1.70 95.00 1.60 94.00 1.50 93.00 1.40 92.00 1.30 91.00 1.20 90.00 1.10 89.00 1.00 88.00 0.90 87.00 0.80 86.00 0.70 85.00 0.60 84.00 0.50 83.00 82.00 0.40 81.00 0.30 Train Exs x 103 0.20 Train Exs x 103 0.20 0.40 0.60 0.80 1.00 0.20 0.40 0.60 0.80 1.00

KRKP: Test Accuracy KRKP: Train Time % Correct Seconds DOS 95.00 DOS 100.00 WIN-DOS-3.1 90.00 WIN-DOS-3.1 WIN-DOS-95 WIN-DOS-95 85.00 99.00 80.00 98.00 75.00 70.00 97.00 65.00

96.00 60.00 55.00 95.00 50.00 45.00 94.00 40.00 93.00 35.00 30.00 92.00 25.00 91.00 20.00 15.00 90.00 10.00

89.00 5.00 0.00 Train Exs x 103 Train Exs x 103

0.00 0.50 1.00 1.50 2.00 2.50 3.00 0.00 0.50 1.00 1.50 2.00 2.50 3.00

Figure Results in the TicTacToe and KRKP domains

This b ehavior is more obvious in the results in the TicTacToe endgame domain

shown in the top p ortion of Figure Here the predictive accuracy of all three algorithms

reaches at ab out half the total example set size Interestingly WinDOS reaches

this p oint considerably earlier at ab out examples On the other hand WinDOS

is not able to achieve an advantage over DOS in terms of runtime although it seems quite

likely that it would overtake DOS at slightly larger training set sizes WinDOS reaches

this breakeven p oint much earlier and continues to build up a signicant gain in runtime

at larger training set sizes

Note that this example set is only a subset of the TicTacToe data set that has b een studied by Wirth

and Catlett We did not have access to the full data set

Integrative Windowing

In all domains considered so far removing a few randomly chosen examples from the

larger training sets did not aect the learned theories Intuitively we would call such

training sets redundant as discussed in section In the example KRKP data set on

the other hand the algorithms are not able to learn theories that are correct when

tested on the complete data set unless they use the entire data set for training Note that in

this case the accuracy estimate is in fact a resubstitution estimate which may b e a

bad approximation for the true accuracy We would call such a data set nonredundant as

it seems to b e the case that randomly removing only a few examples will already aect the

learned theories In this domain WinDOS pro cesses ab out twice as many examples

as DOS for each training set size Our improved version of windowing on the other hand

pro cesses only a few more examples than DOS at lower sizes but seems to b e able to exploit

some redundancies of the domain at larger training set sizes We interpret this a evidence

for our hypothesis that even in nonredundant training sets some parts of the example space

may b e redundant section This is consistent with previous ndings which showed

that in this domain a few accurate rules with high coverage can b e found easily from small

training sets while the ma jority of the rules have low coverage and are very errorprone

Holte et al Interestingly enough however Mllers redundancy estimate which

seemed to correlate well with Cs p erformance in these domains in Table is a p o or

predictor for the p erformance of our windowing algorithms The KRKP set which exhibits

the worst p erformance and is not redundant according to our denition has a much b etter

redundancy estimate than the KRK data set for which windowing works much b etter

Figure shows the results in the examples Shuttle domain For this domain a

separate test set is available examples The accuracy graphs show the p erformance

of the algorithms on this test set The upp er half shows the results when training on the

minority class ie learning a theory for all examples with a non classication Both

windowing algorithms p erform exceptionally well in this domain with WinDOS b eing

ab out twice as fast as WinDOS In terms of accuracy WinDOS is more accurate

than b oth other algorithms WinDOS b eing a little ahead of DOS As this example

set has a relatively skewed class distribution ab out of the examples are of class

we decided to also try the algorithms on the reverse problem by swapping the lab els of the

p ositive and negative examples The results in terms of runtime did not change However

in terms of accuracy the p erformance of DOS improved so that it equals WinDOS

with WinDOS b eing b ehind b oth We have not yet found a go o d explanation for this

phenomenon

It is interesting to note that this is the only domain we have tried that contains numeric

attributes We handled these in the common way using threshold tests The candidate

thresholds for each test are selected b etween changes in the class variables in the sorted

list of values of the continuous attributes Fayyad Irani As the windowing

algorithms will typically have to choose a threshold from a much smaller example set they

are likely to choose dierent thresholds Whether this has a p ositive less overtting

negative the optimal threshold may b e missed or no eect at all is an op en question

which has already b een raised by Breiman et al and has b een further explored in

Catletts work on peepholing Catlett

The sorting causes the slightly sup erlinear slop e in the runtime curves of DOS

Furnkranz

Binary Shuttle (+): Test Accuracy Binary Shuttle (+): Train Time % Correct Seconds DOS DOS 800.00 100.00 WIN-DOS-3.1 WIN-DOS-3.1 99.98 WIN-DOS-95 750.00 WIN-DOS-95

99.96 700.00

99.94 650.00 99.92 600.00 99.90 550.00 99.88 500.00 99.86 99.84 450.00 99.82 400.00

99.80 350.00 99.78 300.00 99.76 250.00 99.74 200.00 99.72 150.00 99.70 99.68 100.00

99.66 50.00

99.64 0.00 99.62 Train Exs x 103 Train Exs x 103 0.00 10.00 20.00 30.00 40.00 0.00 10.00 20.00 30.00 40.00

Binary Shuttle (-): Test Accuracy Binary Shuttle (-): Train Time % Correct Seconds DOS DOS 100.00 WIN-DOS-3.1 900.00 WIN-DOS-3.1 99.98 WIN-DOS-95 850.00 WIN-DOS-95 99.96 800.00 99.94 750.00 99.92 700.00 99.90 650.00 99.88 600.00 99.86 550.00 99.84 500.00 99.82 450.00 99.80 400.00 99.78 350.00 99.76 300.00 99.74 99.72 250.00 99.70 200.00 99.68 150.00 99.66 100.00 99.64 50.00 99.62 0.00 Train Exs x 103 Train Exs x 103

0.00 10.00 20.00 30.00 40.00 0.00 10.00 20.00 30.00 40.00

Figure Results in the binarized Shuttle domain

In summary the results have shown us that signicant runtime gains can b e achieved

with windowing usually without losing accuracy and even more so with integrative win

dowing in a variety of noisefree domains However we have also seen that redundancy is

in fact a crucial condition for the eectiveness of windowing so that a further exploration

of domain redundancy is suggested as a promising area for further research

A Note on Parameter Settings

All exp eriments rep orted ab ove were p erformed with InitSize and MaxIncSize

Dierent variations of the InitSize parameter have b een investigated by Wirth and Catlett

Their results indicate that the algorithm is quite insensitive to this parameter which Integrative Windowing

KRK - Varying increments Examples x 103 Win-DOS-3.1 1.60 Win-DOS-95

1.50

1.40

1.30

1.20

1.10

1.00

0.90

0.80

0.70

0.60

0.50

0.40

Log(Increment)

1e+01 3 1e+02 3 1e+03 3

Figure Number of examples pro cessed by the windowing algorithms for varying values

of their MaxIncSize parameter shown on a logarithmic scale

we have empirically conrmed Nevertheless it might b e worthwhile to use a theoretical

analysis of subsampling similar to the one p erformed by Kivinen and Mannila for

determining promising values of this parameter for a particular domain

However we attribute more signicance to the choice of the MaxIncSize parameter

which sp ecies the maximum number of examples that can b e added to a window Figure

shows the results of exp eriments with the examples set of the KRK domain in

which we varied MaxIncSize from to plotted on a logarithmic scale on the xaxis

The p erformance of the algorithms in terms of number of pro cessed examples is b est if

this parameter is kept comparably low In the range of to examples the parameter

is relatively insensitive to its exact setting If more examples are added to the window

p erformance degrades For example at MaxIncSize WinDOS p erforms ab out

iterations of the basic learning algorithm pro cessing a total of ab out examples the

nal window containing ab out examples At MaxIncSize on the other hand the

basic learning mo dule not only has to pro cess ab out twice as many examples but windowing

also takes more iterations to converge Similar b ehavior can b e observed for WinDOS

Thus it seems to b e imp ortant to continuously evaluate the learned theories in order to

fo cus the learner on the parts of the search space that have not yet b een correctly learned

This nding contradicts the heuristic that is currently employed in C namely to add at

least half of the total misclassied examples However this heuristic was formed in order to

make windowing more eective in noisy domains Quinlan a goal that in our opinion

cannot b e achieved with merely using a noisetolerant learner inside the windowing lo op

for reasons discussed in the next section

Furnkranz

Windowing and Noise

So far we have only dealt with noisefree domains While we b elieve that there are many

rewarding learning tasks that can b e successfully attacked with noisefree learning algo

rithms noise is commonly considered as a typical prop erty of realworld data In this

section we will have a closer lo ok at the b ehavior of windowing with noisy data analyze

the problems and discuss a p ossible solution

Noise is a Problem

An ecient adaptation of the basic windowing technique shown in Figure to noisy domains

is a nontrivial endeavor In particular it cannot b e exp ected that the use of a noise

tolerant learning algorithm inside the windowing lo op will lead to p erformance gains in

noisy domains In our opinion the main problem with windowing in a noisy domain lies in

the fact that a go o d theory will misclassify most of the noisy examples and consequently

incorp orate them into the learning window for the next iteration On the other hand the

window will typically only contain a subset of the original learning examples Hence after a

few iterations the prop ortion of noisy examples in the learning window can b e much higher

than the noise level in the entire data set Naturally this makes the task for the learning

mo dule considerably more dicult

Assume for example that your favorite noisetolerant learner has learned a correct

theory from a randomly selected starting window of size in a examples domain

Further assume that of the examples are lab eled incorrectly Therefore the correct

theory will misclassify of the remaining examples b ecause they are noisy These

examples will consequently b e added to the window thus doubling its size Assuming that

the original window also contained ab out noise more than half of the examples in the

new window are now erroneous so that the classication of the examples in the window

is in fact random It can b e assumed that many more examples have to b e added to the

window in order to recover the structure that is inherent in the data This conjecture is

consistent with the exp erimental results of Wirth and Catlett and Catlett a

which showed that windowing is highly sensitive to noise

Towards NoiseTolerant Windowing

The integrative windowing algorithm describ ed in section which is only applicable to

noisefree domains is based on the observation that rule learning algorithms will rediscover

go o d rules again and again in subsequent iterations of the windowing pro cedure Such

consistent rules do not add examples to the current window hence they are unlikely to

change but they nevertheless have to b e rediscovered in subsequent iterations Integrative

windowing detects these rules early on saves them and removes all examples they cover

from the window thus gaining computational eciency

In order to adapt this pro cedure for noisy domains three parts of the algorithm have

to b e mo died

Think eg of Ken Thompsons CDRoms of chess endgames to which every comp etitive chess pro

gram has an interface A compression of these endgames into a simple set of rules would certainly b e

appreciated by the computer chess industry F urnkranzb

Integrative Windowing

ConsistencyCheck When is a rule learned from the window good enough

In the noisefree case all rules that do not cover any negative examples are added to

the nal theory For noisy data a criterion has to b e found that allows rules to cover

some noisy negative examples

CompletenessCheck When should we stop adding rules to a theory

In the noisefree case rules are added to the theory until all p ositive examples are

covered by at least one rule For noisy data we have to nd a criterion that estimates

whether the remaining uncovered p ositive examples can b e considered as noise or

whether another rule should b e added to the theory that explains some of them

Resampling Which examples should be added to the current window

In the noisefree case all misclassied examples are candidates for b eing added to the

window However we have seen ab ove that this can lead to severe problems with

windowing b ecause noisy examples will b e misclassied by a correct theory

We have built a prototype system that deals with these three questions in the way

describ ed in the following sections

ConsistencyCheck

The rst choice we have to make is to nd a criterion for determining which rules should b e

added to the nal theory and which rules should b e further improved A straightforward

idea is to use some sort of signicance test in order to determine whether a rule that has

b een learned from the current window is signicant on the entire set of training examples

We have exp erimented with a variety of criteria known from the literature but found that

they are insucient for our purp oses For example it turned out that at higher training

set sizes CNs likelihoo d ratio signicance test Clark Niblett which accepts

a rule when the distribution of p ositive and negative examples covered by the rule diers

signicantly from the class distribution in the entire set will deem any rule learned from

the window as signicant

Eventually we have settled for the following criterion For each rule r learned from the

current window we compute two accuracy estimates AccW inr which is determined using

only examples from the current window and AccT otr which is estimated on the entire

training set The criterion we use for detecting go o d rules consists of two parts

The AccW inr estimate has to b e signicantly ab ove the default accuracy of the

domain This is ensured by requiring that

AccW inr SE AccW inr D A

q

pp

where DA is the default accuracy and SE p is the standard error of

n

classication Breiman et al

Furnkranz

The accuracy of the learned rule on the window has to b e within a certain interval of

its overall accuracy

jAccW inr AccT otr j SE AccT otr

is a usersettable parameter that can b e used to adjust the width of this range

The purp ose of the rst heuristic is to avoid rules with a bad classication p erformance in

particular it weeds out many rules that have b een derived from to o few training examples

while the second criterion aims at making sure that the accuracy estimates that were com

puted on the current window and were used in the heuristics of the learning algorithm

are not to o optimistic compared to the true accuracy of the rule which is approximated by

the accuracy measured on the entire training set

The parameter determines the degree to which the estimates AccW inr and

AccT otr have to corresp ond A setting of requires that AccW inr AccT otr

which in general will only b e happ en if r is consistent or has b een learned from the entire

training set This is the recommended setting in noisefree domains In noisy domains

values of have to b e used b ecause the rules returned from the learning algorithm

will typically b e inconsistent on the training set Note however that a setting of

in a noisy domain will not lead to overtting and a decrease in predictive accuracy b e

cause overtting is caused by to o optimistic estimates of a rules accuracy The chance of

accepting such rules decreases with the value of Clearly if the chance of a rule b eing

accepted decreases the runtime of algorithm increases b ecause windowing has to p erform

more iterations The extreme case causes the window to grow until it contains all

examples Then the rule is accepted by the second criterion b ecause the examples used

for estimating AccW inr and AccT otr are basically the same Typical settings in noisy

domains are or will move all rules that have survived the rst cri

terion into the nal rule set In as much as the learner is suciently noisetolerant and the

initial example size is suciently large the algorithm implements learning from a random

subsample if

CompletenessCheck

A straightforward approach for attacking the completeness problem ie the decision when

to stop learning more rules would b e to stop learning whenever the learner can nd no more

rules from the current window This approach however might miss some imp ortant rules

that could b e found if only more of the uncovered p ositive examples had b een added to the

window So a dierent approach is to continue the windowing pro cess until all remaining

p ositive examples are in the learning window This on the other hand may lead to many

unnecessary iterations in the case when no more meaningful rules can b e found

We aimed at a compromise here The strategy we employ is to double the window size

in the case when the learner has not discovered any rules from the current window which

ensures a fast convergence for the case when no more rules can b e found but also tries to

nd the rules from lower window sizes rst Also note that this approach relies on a truly

Note that diers from the version of F urnkranz d by using SE AccT otr instead of

SE AccW inr on the right hand side We have found this version to work somewhat more reliably

Integrative Windowing

noisetolerant learning algorithm If for example the learning algorithm discovers rules

even in randomly classied training data ie pure noise this criterion will never kick in

The windowing algorithm will then discover that these rules are insignicant and continue

to add more examples to the window This will go on until all examples are in the window

in which case the found rules will b e retained

Resampling

Our main concern with the resampling problem was that adding only misclassied examples

is likely to increase the noise level inside the window To avoid this we form a set of

candidates containing al l examples that are not yet in the window and that are covered

by insignicant rules plus all uncovered p ositive examples The algorithm then selects

MaxIncSize of these candidate examples and adds them to the window We stick to adding

uncovered positive examples only b ecause after more and more rules have b een discovered

the prop ortion of p ositive examples in the remaining training set will considerably decrease

so that the chances of randomly picking a p ositive example from the set of all uncovered

examples would decrease which in turn might slow down the learner Although adding only

p ositive uncovered examples may increase the chances of learning overgeneral rules these

will b e discovered by the second part of our criterion and appropriate counterexamples will

eventually b e added to the window

The Algorithm

Figure shows the nal algorithm At the b eginning it pro ceeds just like the basic or

integrative windowing algorithms describ ed earlier in this pap er It selects a random subset

of the examples learns a theory from these examples and tests it on the remaining examples

Like integrative windowing it do es not merely add examples that have b een incorrectly

classied to the window for the next iteration but it also removes all examples from this

window that are covered by go o d rules To determine go o d rules it tests the individual

rules that have b een learned from the current window on the entire data set and p erforms

the consistency check describ ed in section pro cedure Significant If the rule passes

the test it is added to the nal theory and all examples that are covered by it are removed

from the training set and the window Otherwise the examples covered by this rule

b ecome candidates for b eing added to the window in the next iteration After all uncovered

p ositive examples have also b een added to this candidate set the algorithm randomly

selects MaxIncSize examples that are added to the window Not shown in the algorithm

is the completeness check describ ed in section which doubles the window size if the

noisetolerant learner do es not nd any rules

With a setting of NoiseTolerantWindowing is very similar to the Inte

grativeWindowing algorithm of Figure with the dierence that the latter only tests a

theory until it has collected MaxIncSize new examples to add to the current window Thus

it cannot determine whether a rule has already b een tested on all examples and has to test

the stored rules in all subsequent iterations NoiseTolerantWindowing on the other

hand tests a rule on the entire training set If it nds the rule to b e signicant it will add

it to the nal rule set and will never test it again Consequently the examples covered by

such a rule can b e removed not only from the window but from the entire training set

Furnkranz

procedure NoiseTolerantWindowingExamplesInitSizeMaxIncSize

Window RandomSampleExamplesInitSize

Theory

repeat

NewRules NoiseTolerantLearnerWindow

NewWin Window

NewExs Examples

Candidates

for Rule NewRules

if SignificantRuleWindowExamples

Theory Theory Rule

NewWin NewWin n CoverRuleWindow

NewExs NewExs n CoverRuleExamples

else

Candidates Candidates CoverRuleExamples

for Example PositiveExamples

if Example j Cover NewRulesExamples

Candidates Candidates Example

Window NewWin RandomSampleCandidatesMaxIncSize

Examples NewExs

until Candidates

return Theory

Figure A noisetolerant version of integrative windowing

Exp erimental Evaluation

We implemented the algorithm describ ed in the last section in the same LISP environment as

the noisefree algorithms The noisetolerant learning algorithm used was IRIP a separate

andconquer learner halfway b etween IREP F urnkranz Widmer F urnkranz

e and RIPPER Cohen Like IREP it learns single rules by greedily adding

one condition at a time using Foils information gain heuristic Quinlan until the

rule no longer makes incorrect predictions on the growing set a randomly chosen set of

of the training examples Thereafter the learned rule is simplied by greedily deleting

conditions as long as the p erformance of the rule do es not decrease on the remaining set of

examples the pruning set All examples covered by the resulting rule are then removed

from the training set and a new rule is learned in the same way until all p ositive examples

are covered by at least one rule or the stopping criterion res Like RIPPER IRIP evalu

pn

ates the quality of a rule on the pruning set by computing the heuristic where p and n

pn

are the covered p ositive and negative examples resp ectively and stops adding rules to the

theory whenever the fraction of p ositive examples that are covered by the b est rule do es not

exceed These choices have b een shown to outp erform IREPs original choices Cohen

IRIP is quite similar to IREP which is also describ ed by Cohen but it

retains IREPs metho d of considering all conditions for pruning instead of considering a

nal sequence of conditions Integrative Windowing

KRK (0%): Test Accuracy KRK (0%): Train Time % Correct Seconds I-RIP I-RIP 100.00 WIN-RIP-3.1 28.00 WIN-RIP-3.1 99.80 WIN-RIP-95 WIN-RIP-95 26.00 99.60 WIN-RIP-NT-0.0 WIN-RIP-NT-0.0 99.40 24.00 99.20 22.00 99.00 98.80 20.00

98.60 18.00 98.40 98.20 16.00

98.00 14.00 97.80 12.00 97.60 97.40 10.00 97.20 8.00 97.00 96.80 6.00 96.60 4.00 96.40 96.20 2.00 96.00 0.00 95.80 Train Exs x 103 Train Exs x 103

0.00 2.00 4.00 6.00 8.00 10.00 0.00 2.00 4.00 6.00 8.00 10.00

Figure Results for the noisefree KRK domain

Around IRIP we wrapp ed the windowing pro cedure of Figure The resulting algo

rithm is referred to as WinRIPNT Again the implementations of b oth algorithms

remove semantically redundant rules in a p ostpro cessing phase as describ ed in section

We studied the b ehavior of the algorithms in the KRK illegality domain with varying

levels of articial noise A noise level of n means that in ab out n of the examples the

class lab el has b een replaced with a random class lab el In each of the exp eriments describ ed

in this section we rep ort the average results on dierent subsets in terms of runtime

of the algorithm and measured accuracy on a separate noisefree test set We did not

evaluate the algorithms according to the total number of examples pro cessed by the basic

learning algorithm b ecause the way we compute the completeness check doubling the size

of the window if no rules can b e learned from the current window see section makes

these results less useful All algorithms were run on identical data sets but some random

variation results from the fact that IRIP uses internal random splits of the training data

All exp eriments shown b elow were conducted with the same parameter setting of InitSize

and MaxIncSize which p erformed well in the noisefree case We have not yet

made an attempt to evaluate their appropriateness for noisy domains

First we aimed at making sure that the noisefree setting of the WinRIP

NT algorithm still p erforms reasonably well so that the noisetolerant generalization of

the IntegrativeWindowing algorithm do es not lose its eciency in noisefree domains

Figure shows the p erformance of IRIP WinRIP basic windowing using IRIP in its

lo op WinRIP integrative windowing using IRIP and WinRIPNT the noise

tolerant version of integrative windowing with a setting of The graphs resemble

very much the graphs of Figure for the KRK domain except that IRIP overgeneralizes

o ccasionally so that the accuracy never reaches exactly WinRIPNT p erforms

a little worse than WinRIP but it still is b etter than regular windowing A similar

NT of course stands for NoiseTolerant Sp ecial thanks to the reviewer who suggested this notation

Furnkranz

KRK (5%): Test Accuracy KRK (5%): Train Time % Correct Seconds

I-RIP 340.00 I-RIP 100.00 Win-RIP-3.1 Win-RIP-3.1 320.00 WIN-RIP-NT-0.0 WIN-RIP-NT-0.0 99.50 300.00 99.00 280.00

98.50 260.00

98.00 240.00

97.50 220.00 200.00 97.00 180.00 96.50 160.00 96.00 140.00

95.50 120.00

95.00 100.00

94.50 80.00

94.00 60.00 40.00 93.50 20.00 93.00 0.00 92.50 Train Exs x 103 Train Exs x 103

0.00 2.00 4.00 6.00 8.00 10.00 0.00 2.00 4.00 6.00 8.00 10.00

Figure Results of the noisefree windowing versions for the KRK domain with arti

cial noise

graph for the Mushroom domain is shown in previous work F urnkranzd So we can

conclude that the changes in the algorithm did some harm but the algorithm still p erforms

reasonably well in noisefree domains

Figure shows the p erformance of three of these algorithms we did not test Win

RIP for datasets with articial noise Here IRIP is clearly the fastest algorithm

a windowed version takes ab out twice as long Although this is only a mo derate level of

noise integrative windowing with WinRIPNT is already unacceptably slow b ecause

it successively adds all examples into the window as we have discussed in section

The interesting question of course is what happ ens if we increase the parameter

which supp osedly should b e able to deal with the noise in the data To this end we

have p erformed a series of exp eriments with varying noise and varying values for the

parameters Our basic nding is that the algorithm is very sensitive to the choice of the

parameter The rst thing to notice is that there is an inverse relationship b etween the

value and the runtime of the algorithm This is not surprising b ecause higher values of

will make it easier for rules to b e added to the nal theory Thus the algorithm is likely

to terminate earlier On the other hand we also notice a correlation b etween predictive

accuracy and the value of The reason for this is that lower values imp ose more stringent

restrictions on the quality of the accepted rules Again this can b e explained with the fact

that higher values are likely to admit less accurate rules see section

Therefore what we hop e to nd is an optimal range for the parameter for which

WinRIPNT outp erforms IRIP in terms of runtime while maintaining ab out the same

p erformance in terms of accuracy For example Figure shows the results of IRIP and

WinRIPNT with a setting of in the KRK domain with noise added Both

p erform ab out the same in terms of accuracy but there is some gain in terms of runtime Integrative Windowing

KRK (20%): Test Accuracy KRK (20%): Train Time % Correct Seconds I-RIP I-RIP 100.00 90.00 WIN-RIP-NT-0.5 WIN-RIP-NT-0.5 85.00 99.00 80.00 98.00 75.00 70.00 97.00 65.00 96.00 60.00

95.00 55.00 50.00 94.00 45.00

93.00 40.00 35.00 92.00 30.00 91.00 25.00 20.00 90.00 15.00 89.00 10.00

88.00 5.00 0.00 87.00 Train Exs x 103 Train Exs x 103

0.00 2.00 4.00 6.00 8.00 10.00 0.00 2.00 4.00 6.00 8.00 10.00

Figure Results of WinRIPNT with a setting of for the KRK domain with

articial noise

Although it is not as remarkable as in noisefree domains there is a clear dierence in

the growth function IRIPs runtime grows faster with increasing sample set sizes while

WinRIPNTs growth function app ears to b e sublinear Unfortunately the p erformance

of the algorithm seems to b e quite sensitive to the choice of so that we cannot give a

general recommendation for a setting for Figure shows the graph for three dierent

settings of the parameter for varying levels of noise In terms of runtime it is obvious

that higher values of pro duce lower runtimes In terms of accuracy seems to b e

a consistently go o d choice However for example at a noise level of the version with

is clearly preferable when considering the p erformance in b oth dimensions With

further increasing noise levels all algorithms b ecome prohibitively exp ensive but are able to

achieve surprisingly go o d results in terms of accuracy From this and similar exp eriments

we conclude that there seems to b e some correlation b etween an optimal choice of the

parameter and the noiselevel in the data

We have also tested WinRIPNT on a discretized class version of Quinlans

example thyroid diseases database where we achieved improvements in terms of b oth run

time and accuracy with the windowing algorithms These exp eriments are discussed in

previous work F urnkranz d However later exp eriments showed that this domain

exhibited the same kind of b ehavior as the Shuttle domain shown in Figure ie most ver

sions of WinRIPNT consistently outp erformed IRIP in terms of accuracy when learning

on the minority class while this picture reversed when learning on the ma jority class Run

time gains could b e observed in b oth cases However these results can only b e regarded as

preliminary In order to answer the question whether our results in the KRK domain with

In a new domain we would advice to try and in that order

Furnkranz

KRK (10,000): Noise vs. Accuracy KRK (10,000): Noise vs. Run-time Accuracy Run-time I-RIP 460.00 I-RIP 100.00 WIN-RIP-NT-0.25 440.00 WIN-RIP-NT-0.25 WIN-RIP-NT-0.5 420.00 WIN-RIP-NT-0.5 99.50 WIN-RIP-NT-1.0 400.00 WIN-RIP-NT-1.0 380.00 99.00 360.00 340.00 98.50 320.00 300.00 98.00 280.00 260.00 97.50 240.00 220.00 97.00 200.00 180.00 96.50 160.00 140.00 96.00 120.00 100.00 95.50 80.00 60.00 95.00 40.00 20.00 Noise Level Noise Level

10.00 20.00 30.00 40.00 50.00 10.00 20.00 30.00 40.00 50.00

Figure Results of IRIP and three versions of WinRIPNT for the

examples KRK domain with varying levels of articial noise

articial class noise are predictive for the algorithms p erformance on realworld problems

a more elab orate study must b e p erformed

In summary we have discussed why noisy data app ear to b e a problem for windowing

algorithms and have argued that three basic problems have to b e addressed in order to adapt

windowing to noisy domains We have also shown that with three particular choices for

these pro cedures it is in fact p ossible to achieve runtime gains without sacricing accuracy

However we do not claim that these three particular choices are optimal and are convinced

that b etter choices are p ossible and should lead to a more robust learning system

Related Work

Windowing techniques are related to several other research areas in the eld of Machine

Learning including subsampling active learning and techniques for complexity reduction

such as feature subset selection In this section we will briey discuss some of these related

techniques For a slightly dierent angle on many of the following topics the reader may

also wish to consult the work by Blum and Langley

SubSampling Techniques

Several authors have discussed approaches that use subsampling algorithms dierent from

windowing For decision tree algorithms it has b een prop osed to use dynamic subsampling

at each no de in order to determine the optimal test This idea has b een originally prop osed

but not evaluated by Breiman et al It was further explored in Catletts work on

peepholing Catlett which is a sophisticated pro cedure for using subsampling to

eliminate unpromising attributes and thresholds from consideration

Integrative Windowing

John and Langley discuss a dierent approach that successively expands the

learning window ELC adds randomly selected examples to the window until an extrap o

lation of the learning curve accuracy over window size do es no longer promise signicant

gains for further enlargements of the window In terms of runtime the authors note that

this technique can in general only gain eciency for incremental learning algorithms

Partitioning prop osed by Domingos b a splits the example space into seg

ments of equal size and combines the rules learned on each partition This technique pro

duced promising results in noisy domains but has substantially decreased learning accuracy

in nonnoisy domains Besides the technique seems to b e tailored to a sp ecic learning al

gorithm and not applicable to a wider group of rule learning algorithms Domingos a

Subsampling techniques have recently b een frequently used for increasing the accuracy

of a classier The basic idea is to train classiers on multiple subsamples of the data and

combine their predictions usually by voting Such approaches include bagging Breiman

b where all examples are subsampled with equal probabilities and boosting Drucker

Schapire Simard Freund Schapire also known as arcing Breiman

a where examples that have b een misclassied in previous iterations are more likely

to b e selected in the next iterations Interestingly Breiman b has noted that these

techniques rely on unstable base learners while we conjecture that the b etter p erformance

of windowing with rule learning algorithms is due to more stability section While

the main fo cus of these works is to improve the accuracy of a given learning algorithm it

would b e interesting to evaluate bagging and b o osting techniques along their runtime and

memory requirements as well For example Breiman c discusses arcing algorithms

for datasets where limited memory requires the use of subsampling

In Knowledge Discovery for Databases subsampling techniques have b een investigated

for the discovery of asso ciation rules Toivonen describ es a straightforward ap

proach where asso ciation rules are discovered from a subsample of the data and their

validity is checked on the complete dataset Kivinen and Mannila give theoretical b ounds

for the sample size that is required for establishing with a given maximum error proba

bility the truth of asso ciation rules Kivinen Mannila or functional dep endencies

Kivinen Mannila that have b een discovered from the sample

Active Learning

Windowing techniques are also closely related to the eld of active learning According to

the terms original denition within the eld of Machine Learning active learning includes

any form of learning in which the learning program has some control over the inputs it

trains on Cohn Atlas Ladner While this denition would b e broad enough to

include windowing techniques subsequent work in this area has mostly concentrated on the

use of membership queries ie on giving the learner the means to query the classication

of examples of its own choice instead of providing it with a xed set of lab eled data

Such approaches are based on a new motivation for studying subsampling techniques

in addition to the three motivations listed in the Introduction to this pap er

Exp ensive Lab eling In many domains it is very exp ensive to obtain preclassied train

ing examples while unlab eled examples are cheaply available Subsampling tech

Furnkranz

niques can help to solve this problem by fo cussing on minimizing the number of

examples that have to b e lab eled in order to learn a satisfying theory

The prevalent example of such a domain is the WorldWide Web which provides an

abundance of training pages for text categorization problems However signicant eort is

required to assign semantic categories to these pages Not surprisingly much of the recent

work in active learning has concentrated on text categorization problems

Closely related to windowing is uncertainty sampling Lewis Gale Lewis

Catlett The dierence is that the window is not adjusted on the basis of misclassied

examples but on the basis of the learners condence in its own predictions The examples

that are classied with the least condence will b e added to the training set in the next

iteration after obtaining their class lab els

However not all learning algorithms are able to attach uncertainty estimates to their

predictions Besides using uncertainty estimates from a single learning algorithm may b e

problematic in some cases Cohn et al Thus it was suggested to use a committee

of classiers and measure the uncertainty in the predictions by the degree of disagreement

among the classiers The selective sampling technique prop osed by Cohn et al is

one such technique which is based on a theoretical framework that uses the entire version

space of consistent theories as a committee Another version of this approach is the query

by committee algorithm Seung Opp er Somp olinsky Freund Seung Shamir

Tishby which uses a probability distribution over hypotheses to randomly select

two consistent hypotheses for classifying a new example If their predictions dier the

algorithm asks for the true lab el of the example and adds it to the training set Committee

based sampling Dagan Engelson is an adaptation of this idea to probabilistic

classiers Liere and Tadepalli compare several ways of combining the predictions

of a committee of Winnowbased learners on a text categorization task

Obviously the ab ovementioned approaches can also b e applied to situations where

a large amount of lab eled data is available An investigation of the suitability of these

techniques for increasing the eciency of learning algorithms and a comparison to our

solutions for this problem is left for future work However it should b e noted that these

techniques are not susceptible to the problem of noise in the form we discussed in section

b ecause they would simply ignore the class lab els during the subsampling pro cess

It is less obvious that windowing techniques might also b e useful in the presence of large

amounts of unlab eled data However several authors have recently started to investigate

the idea that instead of submitting the predictions that the learner is most uncertain

ab out to a teacher for lab elling the learner might autonomously lab el the predictions it

is most certain ab out and add them to the training set Prop osals include the use of a

committee of learners for cotraining Blum Mitchell or techniques based on the

Exp ectation Maximization EM algorithm Nigam McCallum Thrun Mitchell

p ossibly coupled with an active learning pro cedure McCallum Nigam If research

in this direction is pushed further the number of lab eled examples available to a learning

algorithm might b e greatly enlarged at the exp ense of some incorrectly lab eled examples

Noisetolerant windowing algorithms may turn out to b e an appropriate choice for such

data sets

Integrative Windowing

procedure WindowingAlgorithmLP

RedLP InitializeReductionLP

loop

Theory CallAlgorithmRedLP

Q EvaluateTheoryLP

if StoppingCriterionQLPRedLP

returnTheory

else

RedLP ExpandReductionQLPRedLP

Figure A general view of windowing

Complexity Reduction

Windowing can b e viewed as sp ecial case of a wider variety of optimization techniques that

try to reduce the complexity of a learning problem by identifying appropriate lowcomplexity

versions of the problem The complexity of a learning problem mostly dep ends on the

number of training examples and the size of the searched hypothesis space Figure

shows an abstraction of the windowing algorithm It starts by initializi ng the learning

problem with a reduced learning problem eg with a subsample of the examples then

applies the learning algorithm to this reduced problem and analyzes the resulting theory

with resp ect to the original problem Unless some stopping criterion sp ecies that the

quality of learned theory is already sucient eg if no exceptions could b e found on the

complete data set the reduced learning problem will b e expanded to incorp orate more

information eg by adding all misclassied examples and a new theory is induced

Note that this abstract framework also describ es other approaches for reducing the

complexity of a learning problem such as hypothesis space reduction As an example think

of an algorithm that attempts to learn a theory in a simple hypothesis space rst and only

switches to more complex hypothesis spaces if the result in the simple space in unsatisfactory

For example many Inductive Logic Programming systems provide some explicit control over

the complexity of their hypothesis space which might b e controlled with an instantiation

of the generalized windowing algorithm Such an approach has in fact b een realized in

CLINT De Raedt Bruyno oghe which has a predened set of hypothesis spaces

with increasing complexity and is able to switch to a more expressive hypothesis language

if it is not able to nd a satisfactory theory in its current search space Similar approaches

could also b e imagined for other ILP algorithms For example in Foil Quinlan one

could systematically vary certain parameters that inuence the complexity of the hypothesis

space like the number of new variables that can b e introduced in the b o dy of a clause or

the maximum length of a clause in order to dene increasingly complex hypothesis spaces

This pro cedure could b e automatized in a way similar to the one describ ed by Kohavi and

John The crucial p oint is how to eciently evaluate that no progress can b e made

In representation languages that extend at feature vectors such as rstorder logic the complexity

of a learning problem also dep ends crucially on the average cost of matching an instance with a rule

Sebag and Rouveirol demonstrate a technique for reducing these p otentially exp onential costs via

subsampling

Furnkranz

by shifting to a more complex hypothesis space In the prop ositional case forward selection

approaches to feature subset selection ie algorithms that select the b est subset of features

by adding one feature at a time to an initially empty set of features can b e viewed in this

framework Caruana Freitag Kohavi John

All of the ab ovementioned approaches may b e viewed as a particular type of bias shift

op erator Utgo desJardins Gordon fo cussing on shifts to computationally

more exp ensive inductive biases Turney investigates this in more detail but suggests

that in order to maximize accuracy one should start with a weak bias and gradually

shift to stronger biases Our results suggest the opp osite strategy if eciency is the main

concern This is consistent with results in comparing forward and backward feature subset

selection Kohavi John

We b elieve that thinking in this framework may lead to more general approaches for

reducing the complexity of a learning problem which aim at reducing b oth hypothesis

and example space at the same time As an example consider the peepholing technique

introduced by Catlett b where subsampling is used to reliably eliminate unpromising

candidate conditions from the hypothesis space

Future Work

There are several ways how future work can b e based on the presented results Clearly

a deep er exploration of the eects of the parameter settings of our algorithms is needed

In particular a b etter understanding of the parameter in the noise handling heuristics

is necessary for practical applications of the algorithm Alternative solution attempts for

the three problems outlined in section should also b e investigated thus maybe entirely

eliminating the need for this parameter As discussed in section the applicability of

windowing algorithms crucially dep ends on the presence of some redundancy in the training

set Thus b etter metho ds for characterizing redundant domains are a rewarding topic

for further research The eciency of the presented algorithms could maybe b e further

improved by trying to sp ecialize overgeneral rules instead of entirely removing them from

the current theory and relying on the next windowing iteration to nd a b etter rule Ideas

from theory revision might turn out to b e useful in this context

While our ma jor concern was to demonstrate that signicant increases in runtime are

p ossible it might b e promising to further investigate the use of windowing techniques for

increasing predictive accuracy or for increasing the eectiveness for learning with limited

memory resources For the former problem techniques for combining the results of multiple

runs of windowing with dierent random seeds as implemented in C Quinlan

are quite similar to bagging and b o osting techniques and could pro duce similar results For

learning in the presence of memory limitations we have found that in noisefree domains

the curve for the total number of examples submitted to the learning algorithms attens at

the p oint where training accuracy is reached Thus the windowing algorithms need

asymptotically less memory than the learning algorithms p er se if one assumes that the

testing of learned theories can b e p erformed on disk However no guarantees can b e made

whether the use of windowing will allow a given learning problem to b e solved without

exceeding a given memory b ound A closer investigation of these issues would also b e a

rewarding topic for further research

Integrative Windowing

Currently the applicabili ty of our algorithms is limited to prop ositional class prob

lems A straightforward adaption to multiclass problems can b e p erformed in several ways

Clark Boswell Ali Pazzani Dietterich Bakiri A dierent ap

proach might adapt the algorithm for learning ordered rule sets as in CN Clark Niblett

For example one could use a separate windowing pro cess for each rule so that the

rules that are successively added to the nal theory can b e of dierent classes Orthogo

nally an extension of windowing for rstorder learning techniques is another challenging

and rewarding task Note that the subsampling problem in inductive logic programming

is considerably harder than for prop ositional learning For example rstorder learning

systems often allow an implicit denition of training examples Muggleton or learn

from p ositive examples only by making some form of closedworld assumption Quinlan

which prevents the straightforward use of subsampling Furthermore the examples

in training sets for ILP algorithms can dep end on each other For example learning a

recursive concept often requires that all instantiations of the target predicate that are used

in the derivation of a p ositive example are present in the training data a prop erty which is

not likely to hold in the presence of subsampling

Conclusion

In this pap er we have presented a reevaluation for windowing with separateandconquer

rule learning algorithms For this type of algorithm signicant gains in computational

eciency are p ossible without a loss in terms of predictive accuracy We explain this with

the fact that separateandconquer rule learning algorithms learn each rule indep endently

whereas an attribute chosen by a divideandconquer decision tree learning algorithm will

b e part of all rules that are represented by the subtree b elow this attribute Based on this

nding we have further demonstrated a more exible technique for integrating windowing

into rule learning algorithms Go o d rules are immediately added to the nal theory and

the covered examples are removed from the window This avoids relearning these rules

in all subsequent iterations of the windowing pro cess thus reducing the complexity of the

learning problem While most of our results have b een obtained in noisefree domains we

b elieve that the idea of integrative windowing can b e generalized for attacking the problem

of noise in windowing To that end we have outlined three basic problems that have to b e

solved A rst implementation of straightforward solutions to these problems has achieved

promising results in a simple domain with articial noise

Acknowledgements

This research is sp onsored by the Austrian Fonds zur Forderung der Wissenschaftlichen

Forschung FWF under grant number JINF SchrodingerStipendiu m I would like

to thank Hendrik Blo ckeel William Cohen Tom Mitchell Bernhard Pfahringer Michele

Sebag Ashwin Srinivasan Gerhard Widmer and the anonymous reviewers of this and

previous versions of the pap er for helpful suggestions discussions and p ointers to relevant

literature Thanks are also due to Ray Mo oney for making his Common LISP Machine

Learning library available and to the maintainers of the UCI Machine Learning rep ository

Furnkranz

References

Ali K M Pazzani M J HYDRA A noisetolerant relational concept learning

algorithm In Ba jcsy R Ed Proceedings of the th Joint International Confer

ence on Articial Intel ligence IJCAI pp Chambery France Morgan

Kaufmann

Blum A L Langley P Selection of relevant features and examples in machine

learning Articial Intel ligence Sp ecial Issue on Relevance

Blum A L Mitchell T Combining lab eled and unlab eled data with cotraining

In Proceedings of the th Annual Conference on Computational Learning Theory

COLT Madison WI ACM Inc Forthcoming

Breiman L a Arcing classiers Tech rep Statistics Department University

of California at Berkeley

Breiman L b Bagging predictors Machine Learning

Breiman L c Pasting bites together for prediction in large data sets and

online Unpublished manuscript available from ftpftpstatberkeleyedu

usersbreimanpastebitepsZ

Breiman L Friedman J H Olshen R Stone C Classication and Regression

Trees Wadsworth Bro oks Pacic Grove CA

Caruana R Freitag D Greedy attribute selection In Cohen W Hirsh H

Eds Proceedings of the th International Conference on Machine Learning ML

pp New Brunswick NJ Morgan Kaufmann

Catlett J a Megainduction A test ight In Birnbaum L Collins G Eds

Proceedings of the th International Workshop on Machine Learning ML pp

Evanston IL Morgan Kaufmann

Catlett J b Megainduction Machine Learning on Very Large Databases PhD

thesis Basser Department of Computer Science University of Sydney

Catlett J Peepholing Cho osing attributes eciently for megainduction In Pro

ceedings of the th International Conference on Machine Learning ML pp

Morgan Kaufmann

Clark P Boswell R Rule induction with CN Some recent improvements

In Proceedings of the th European Working Session on Learning EWSL pp

Porto Portugal SpringerVerlag

Clark P Niblett T The CN induction algorithm Machine Learning

Cohen W W Fast eective rule induction In Prieditis A Russell S Eds

Proceedings of the th International Conference on Machine Learning ML pp

Lake Tahoe CA Morgan Kaufmann

Integrative Windowing

Cohn D A Atlas L Ladner R Improving generalization with active learning

Machine Learning

Dagan I Engelson S P Committeebased sampling for training probabilistic

classiers In Prieditis A Russell S Eds Proceedings of the th International

Conference on Machine Learning ML pp Lake Tahoe CA Morgan

Kaufmann

De Raedt L Bruyno oghe M Indirect relevance and bias in inductive concept

learning Know ledge Acquisition

desJardins M Gordon D F Eds Sp ecial issue on bias evaluation and

selection Machine Learning

Dietterich T G Bakiri G Solving multiclass learning problems via error

correcting output co des Journal of Articial Intel ligence Research

Domingos P a Ecient sp ecictogeneral rule induction In Simoudis E Han

J Eds Proceedings of the nd International Conference on Know ledge Discovery

and Data Mining KDD pp AAAI Press

Domingos P b Using partitioning to sp eed up sp ecictogeneral rule induction In

Proceedings of the AAAI Workshop on Integrating Multiple Learned Models pp

Drucker H Schapire R E Simard P Bo osting p erformance in neural networks

International Journal of Pattern Recognition and Articial Intel ligence

Fayyad U M Irani K B On the handling of continuousvalued attributes in

decision tree generation Machine Learning

Freund Y Schapire R E Exp eriments with a new b o osting algorithm In Saitta

L Ed Proceedings of the th International Conference on Machine Learning pp

Bari Italy Morgan Kaufmann

Freund Y Seung H S Shamir E Tishby N Selective sampling using the

query by committee algorithm Machine Learning

F urnkranz J a Dimensionality reduction in ILP A call to arms In De Raedt

L Muggleton S Eds Proceedings of the IJCAI Workshop on Frontiers of

Inductive Logic Programming pp Nagoya Japan

F urnkranzJ b Knowledge discovery in chess databases A research prop osal Tech

rep OEFAITR Austrian Research Institute for Articial Intelligence

F urnkranz J c More ecient windowing In Proceedings of the th National

Conference on Articial Intel ligence AAAI pp Providence RI AAAI

Press

Furnkranz

F urnkranzJ d Noisetolerant windowing In Proceedings of the th International

Joint Conference on Articial Intel ligence IJCAI pp Nagoya Japan

Morgan Kaufmann

F urnkranz J e Pruning algorithms for rule learning Machine Learning

F urnkranzJ Separateandconquer rule learning Articial Intel ligence Review In

press

F urnkranz J Widmer G Incremental Reduced Error Pruning In Cohen

W Hirsh H Eds Proceedings of the th International Conference on Machine

Learning ML pp New Brunswick NJ Morgan Kaufmann

Gamberger D LavracN Conditions for Occams razor applicabili ty and noise

elimination In van Someren M Widmer G Eds Proceedings of the th Eu

ropean Conference on Machine Learning ECML pp Prague Czech

Republic SpringerVerlag

Holte R Acker L Porter B Concept learning and the problem of small

disjuncts In Proceedings of the th International Joint Conference on Articial

Intel ligence IJCAI pp Detroit MI Morgan Kaufmann

John G H Langley P Static versus dynamic sampling for data mining In

Simoudis E Han J Eds Proceedings of the nd International Conference on

Know ledge Discovery and Data Mining KDD pp AAAI Press

Kivinen J Mannila H The p ower of sampling in knowledge discovery In

Proceedings of the th ACM SIGACTSIGMODSIGART Symposium on Principles

of Database Systems PODS pp

Kivinen J Mannila H Approximate dep endency inference from relations

Theoretical Computer Science

Kohavi R John G H Automatic parameter selection by minimizing estimated

error In Prieditis A Russell S Eds Proceedings of the th International

Conference on Machine Learning ML pp Lake Tahoe CA Morgan

Kaufmann

Kohavi R John G H Wrappers for feature subset selection Articial Intel li

gence Sp ecial Issue on Relevance

Lewis D D Catlett J Heterogeneous uncertainty sampling for sup ervised

learning In Proceedings of the th International Conference on Machine Learning

ML pp New Brunswick NJ Morgan Kaufmann

Lewis D D Gale W Training text classiers by uncertainty sampling In

Proceedings of the th Annual International ACM SIGIR Conference on Research

and Development in Information Retrieval SIGIR pp

Integrative Windowing

Liere R Tadepalli P Active learning with committees for text categorization

In Proceedings of the th National Conference on Articial Intel ligence AAAI

pp Providence RI AAAI Press

McCallum A Nigam K Employing EM in p o olbased active learning for

text classication In Proceedings of the th International Conference on Machine

Learning ICML Madison WI Morgan Kaufmann

Michalski R S On the quasiminimal solution of the covering problem In Pro

ceedings of the th International Symposium on Information Processing FCIP

Vol A Switching Circuits pp Bled Yugoslavia

Mller M Sup ervised learning on large redundant training sets International

Journal of Neural Systems

Muggleton S Bain M HayesMichie J Michie D An exp erimental compari

son of human and machine learning formalisms In Proceedings of the th International

Workshop on Machine Learning ML pp Morgan Kaufmann

Muggleton S H Inverse entailment and Progol New Generation Computing

Sp ecial Issue on Inductive Logic Programming

Nigam K McCallum A Thrun S Mitchell T Learning to classify from

lab eled and unlab eled do cuments In Proceedings of the th National Conference on

Articial Intel ligence AAAI Madison WI AAAI Press

Pfahringer B Practical Uses of the Minimum Description Length Principle in

Inductive Learning PhD thesis Technische UniversitatWien

Quinlan J R Discovering rules by induction from large collections of examples In

Michie D Ed Expert Systems in the Micro Electronic Age pp Edinburgh

University Press

Quinlan J R Learning ecient classication pro cedures and their application

to chess end games In Michalski R S Carb onell J G Mitchell T M Eds

Machine Learning An Articial Intel ligence Approach pp Tioga Palo Alto

CA

Quinlan J R Learning logical denitions from relations Machine Learning

Quinlan J R C Programs for Machine Learning Morgan Kaufmann San

Mateo CA

Sebag M Rouveirol C Tractable induction and classication in rst order logic

via sto chastic matching In Proceedings of the th International Joint Conference on

Articial Intel ligence IJCAI pp Nagoya Japan

Seung H S Opp er M Somp olinsky H Query by committee In Proceedings

of the th Annual ACM Workshop on Computational Learning Theory COLT

pp

Furnkranz

Toivonen H Sampling large databases for asso ciation rules In Proceedings of the

nd Conference on Very Large Data Bases VLDB pp Mumbai India

Turney P How to shift bias Lessons from the Baldwin eect Evolutionary

Computation

Utgo P E Shift of bias for inductive concept learning In Michalski R Carb onell

J Mitchell T Eds Machine Learning An Articial Intel ligence Approach Vol

II pp Morgan Kaufmann Los Altos CA

Wirth J Catlett J Exp eriments on the costs and b enets of windowing in

ID In Laird J Ed Proceedings of the th International Conference on Machine

Learning ML pp Ann Arb or MI Morgan Kaufmann

Yang Y Sampling strategies and learning eciency in text categorization In

Hearst M Hirsh H Eds Proceedings of the AAAI Spring Symposium on Ma

chine Learning in Information Access pp AAAI Press Technical Rep ort

SS