<<

Learning Binary Relations and Total Orders

Sally A Goldman Ronald L Rivest

Department of Computer Science MIT Lab oratory for Computer Science

Washington University Cambridge MA

St Louis MO rivesttheorylcsmitedu

sgcswustledu

Rob ert E Schapire

ATT Bell Lab oratories

Murray Hill NJ

schapireresearchattcom

Abstract

We study the problem of learning a binary b etween two sets of ob jects

or b etween a and itself We represent a b etween a set of size n

and a set of size m as an n  m matrix of bits whose i j entry is if and only

if the relation holds b etween the corresp onding elements of the twosetsWe present

p olynomial prediction algorithms for learning binary relations in an extended online

learning mo del where the examples are drawn by the learner by a helpful teacher by

an adversary or according to a uniform probability distribution on the instance space

In the rst part of this pap er we present results for the case that the matrix of the

relation has at most k rowtyp es We present upp er and lower b ounds on the

of prediction mistakes any prediction algorithm makes when learning such a matrix

under the extended online learning mo del Furthermore we describ e a technique

that simplies the pro of of exp ected mistake b ounds against a randomly chosen query

sequence

In the second part of this pap er we consider the problem of learning a binary re

lation that is a total on a set We describ e a general technique using a fully

p olynomial randomized approximation scheme fpras to implement a randomized ver

sion of the halving algorithm We apply this technique to the problem of learning a

using a fpras for the numb er of extensions of a partial order to

obtain a p olynomial prediction algorithm that with high probabilitymakes at most

This pap er prepared with all authors were at MIT Lab oratory for Computer Science with supp ort

from NSF grant DCR ARO GrantDAALK DARPAContract NJ and

a grant from the Siemens Corp oration S Goldman currently receives supp ort from a GE Foundation

Junior Faculty Grant and NSF Grant CCR Preliminary versions of this pap er app eared in the

Pro ceedings of the th IEEE Symp osium on Foundations of Computer Science Octob er and as MIT

Lab oratory for Computer Science Technical Memo MITLCSTM May

n lg n lg elgn mistakes when an adversary selects the query sequence We also

consider the case that a teacher or the learner selects the query sequence

Keywords Machine Learning Computational Learning Theory Online Learning

Mistakeb ounded Learning Binary Relations Total Orders Fully Polynomial Ran

domized Approximation Schemes

Intro duction

In many domains it is imp ortant to acquire information ab out a relation b etween two sets

For example one may wish to learn a haspart relation b etween a set of animals and a set

of attributes We are motivated by the problem of designing a prediction algorithm to learn

such a binary relation when the learner has limited prior information ab out the predicate

forming the relation While one could mo del such problems as concept learning they are

fundamentally dierent problems In concept learning there is a single set of ob jects and the

learners task is to classify these ob jects whereas in learning a binary relation there are two

sets of ob jects and the learners task is to learn the predicate relating the two sets Observe

that the problem of learning a binary relation can b e viewed as a concept learning problem

by letting the instances b e all ordered pairs of ob jects from the two sets However the ways

in which the problem may b e structured are quite dierent when the true task is to learn

a binary relation as opp osed to a classication rule That is instead of a rule that denes

which ob jects b elong to the target concept the predicate denes a relationship b etween pairs

of ob ject

A binary relation is dened b etween twosetsofobjects Throughout this pap er we

assume that one set has cardinality n and the other has cardinality mWe also assume that

for all p ossible pairings of ob jects the predicate relating the two sets of variables is either true

or false Before dening a prediction algorithm we rst discuss our representation

of a binary relation Throughout this pap er we represent the relation as an n m binary

matrix where an entry contains the value of the predicate for the corresp onding elements

Since the predicate is binaryvalued all entries in this matrix are either false or true

The twodimensional structure arises from the fact that we are learning a binary relation

For the sake of comparison wenowbrieymention other p ossible representations One

could represent the relation as a table with two columns where eachentry in the rst column

is an item from the rst set and eachentry in the second column is an item from the second

set The rows of the table consist of the of the p otential nm pairings for whichthe

predicate is true One could also represent the relation as a bipartite graph with n vertices

in one vertex set and m vertices in the other set An edge is placed b etween twovertices

exactly when the predicate is true for corresp onding items

Having intro duced our metho d for representing the problem wenow informally discuss

the basic learning scenario The learner is rep eatedly given a pair of elements one from

each set and asked to predict the corresp onding matrix entry After making its prediction

the learner is told the correct value of the matrix entry The learner wishes to minimize the

numb er of incorrect predictions it makes Since we assume that the learner must eventually

make a prediction for each matrix entrythenumb er of incorrect predictions dep ends on the

size of the matrix

Unlike problems typically studied where the natural measure of the size of the learners

problem is the size of an instance or example for this problem it is the size of the matrix

Such concept classes with p olynomialsized instance spaces are uninteresting in Valiants

probably approximately correct PAC mo del of learning In this mo del instances are cho

sen randomly from an arbitrary unknown probability distribution on the instance space A

concept class is PAClearnable if the learner after seeing a numb er of instances that is p oly

nomial in the problem size can output a hyp othesis that is correct on all but an arbitrarily

small fraction of the instances with high probabilityFor concepts whose instance space has

cardinality p olynomial in the problem size by asking to see enough instances the learner can

see almost all of the probabilityweight of the instance space Thus it is not hard to show

that these concept classes are trivially PAClearnable One goal of our research is to build a

framework for studying such problems

To study learning algorithms for these concept classes we extend the basic mistakebound

mo del to the cases that a helpful teacher or the learner selects the query sequence

in addition to the cases where instances are chosen byanadversary or according to a prob

ability distribution on the instance space Previously helpful teachers have b een used to

provide counterexamples to conjectured concepts or to break up the concept into

smaller subconcepts In our framework the teacher only selects the presentation order

for the instances

If the learner is to haveany hop e of doing b etter than random guessing there must b e

some structure in the relation Furthermore since there are so manyways to structure a

binary relation wegive the learner some prior knowledge ab out the nature of this structure

Not surprisingly the learning task dep ends greatly on the prior knowledge provided One

way to imp ose structure is to restrict one set of ob jects to haverelatively few typ es For

example a circus may contain many animals but only a few dierent sp ecies In the rst

part of this pap er we study the case where the learner has a priori knowledge that there are

a limited number of ob ject typ es Namelywe restrict the matrix representing the relation

to have at most k distinct rowtyp es Tworows are of the same typ e if they agree in all

columns We dene a k binaryrelation to b e a binary relation for which the corresp onding

matrix has at most k rowtyp es This restriction is satised whenever there are only k typ es of

ob jects in the set of n ob jects b eing considered in the relation The learner receives no other

knowledge ab out the predicate forming the relation With this restriction weprove that any

prediction algorithm makes at least km nb lgkc k blgkc mistakes in

the worst case for anyxed against any query sequence Sofor weget

k km

n blg k c on the numb er of mistakes made byany prediction alower b ound of

algorithm If computational eciency is not a concern the halving algorithm makes

at most km n k lg k mistakes against any query sequence The halving algorithm

predicts according to the ma jority of the feasible relations or concepts and thus each

mistake halves the numb er of remaining relations

We present an ecient algorithm making at most kmn k blg k c mistakes in the case

that the learner cho oses the query sequence We provea tightmistakeboundof km n

k k in the case that the helpful teacher selects the query sequence When the adversary

selects the query sequence we present an ecient algorithm for k that makes at most

m n mistakes and for arbitrary k we presentanecient algorithm making at most

q

k m mistakes We proveany algorithm makes at least kmn k blg k c mistakes km n

in the case that an adversary selects the query sequence and use the existence of pro jective

p p

geometries to improvethislower b ound to kmnk blg k cminfn m m ngforalarge

class of algorithms Finallywe describ e a technique to simplify the pro of of exp ected mistake

p

b ounds when the query sequence is chosen at random and use it to provean O km nk H

exp ected mistake b ound for a simple algorithm Here H is the maximum Hamming distance



Throughout this pap er we use lg to denote log





The mistake b ound is a worst case mistake b ound taken over all consistent learners See Section for

formal denitions

between anytworows

Another p ossibilityforknown structure is the problem of learning a binary relation on

a set where the predicate induces a total order on the set For example the predicate

maybe In the second half of this pap er we study the case in which the learner

has a priori knowledge that the relation forms a total order Once again we see that the

halving algorithm yields a go o d mistake b ound against any query sequence This

motivates a second goal of this research to develop ecient implementations of the halving

algorithm We uncover an interesting application of randomized approximation schemes to

computational learning theory Namelywe describ e a technique that uses a fully p olynomial

randomized approximation scheme fpras to implement a randomized version of the halving

algorithm We apply this technique using a fpras due to Dyer Frieze and Kannan and

Matthews for counting the numb er of linear extensions of a partial order to obtain a

p olynomial prediction algorithm that makes at most n lg n lgelg n mistakes with very

high probability against an adversaryselected query sequence The small probabilityof

making to o many mistakes is determined by the coin ips of the learning algorithm and

not by the query sequence selected by the adversaryWecontrast this result with an n

mistake b ound when the learner selects the query sequence and an n mistakebound

when a teacher selects the query sequence

The remainder of this pap er is organized as follows In the next section we formally

intro duce the basic problem the learning scenario and the extended mistake b ound mo del

In Section we present our results for learning k binaryrelations We rst give a motivating

example and present some general mistake b ounds In the following subsections we consider

query sequences selected by the learner by a helpful teacher byanadversary or at random

In Section we turn our attention to the problem of learning total orders We b egin by dis

cussing the relationship b etween the halving algorithm and approximate counting schemes

in Section In particular we describ e how a fpras can b e used to implement an approxi

mate halving algorithm Then in Section we present our results on learning a total order

Finally in Section we conclude with a summary and discussion of related op en problems

Learning Scenario and Mistake Bound Mo del

In this section wegive formal denitions and discuss the learning scenario used in this pap er

To b e consistent with the literature we discuss these mo dels in terms of concept learning

As wehave mentioned the problem of learning a binary relation can b e viewed in this

framework by letting the instance space b e all pairs of ob jects one from eachofthetwo sets

A concept c is a Bo olean on some of instances A concept class C is

a family of concepts The learners goal is to infer some unknown target concept chosen

from some known concept class Often C is decomp osed into sub classes C according to

n

some natural measure n That is for each n let X denote a nite learning

n

S

X and x X denote an instance To illustrate these denitions domain Let X

n

n

we consider the concept class of monomials A monomial is a conjunction of literals where

each literal is either some Bo olean variable or its negation For this concept class n is just

n n

the number of variables Thus jX j where each x X is chosen from f g and

n n

represents the assignment for eachvariable For each n let C be a family of concepts on

n

S

X Let C C denote a concept class over X For example if C contains monomials

n n n

n

over n variables then C is the class of all monomials Given any concept c C wesay

n

that x is a positive instance of c if cxand x is a negative instance of c if cx

In our example the target concept for the class of monomials over vevariables mightbe

x x Then the instance is a p ositive instance and is a negative instance x

Finally the hypothesis space of algorithm A is simply the set of all hyp otheses or rules h

that A may output A hyp othesis for C must make a prediction for each x X

n n

A prediction algorithm for C is an algorithm that runs under the following scenario A

learning session consists of a set of trials Ineach trial the learner is given an unlab eled

instance x X The learner uses its currenthyp othesis to predict if x is a p ositiveor

n

negative instance of the target concept c C and then the learner is told the correct

n

classication of x If the prediction is incorrect the learner has made a mistake Note that

in this mo del there is no training phase Instead the learner receives unlabeled instances

throughout the entire learning session However after each prediction the learner discovers

the correct classication This feedback can then b e used by the learner to improve its

hyp othesis A learner is consistent if on every trial there is some concept in C that agrees

n

b oth with the learners prediction and with all the lab eled instances observed on preceding

trials

The numb er of mistakes made by the learner dep ends on the sequence of instances pre

sented We extend the mistake b ound mo del to include several metho ds for the selection

of instances A query sequence is a p ermutation hx x x i of X where x is

n t

jX j

n

th

the instance presented to the learner at the t trial We call the agent selecting the query

sequence the director We consider the following directors

Learner The learner cho oses To select x the learner may use time p olynomial

t

in n and all information obtained in the rst t trials In this case wesay that the

learner is selfdirected

Helpful Teacher A teacher who knows the target concept and wants to minimize

the learners mistakes cho oses To select x theteacher uses knowledge of the target

t

concept x x and the learners predictions on x x Toavoid allowing

t t

the learner and teacher to have a co ordinated strategy in this scenario we consider the

worst case mistake b ound over all consistent learners In this case wesay the learner

is teacherdirected

Adversary The adversary who selected the target concept cho oses This adver

sary who tries to maximize the learners mistakes knows the learners algorithm and

has unlimited computing p ower In this case wesay the learner is adversarydirected

Random In this mo del is selected randomly according to a uniform probability

distribution on the p ermutations of X Here the numb er of mistakes made bythe

n

learner for some target concept c in C is dened to b e the exp ected numb er of mistakes

n

over all p ossible query sequences In this case wesay the learner is randomlydirected

We consider how a prediction algorithms p erformance dep ends on the director Namely

we let MB A C denote the worst case numb er of mistakes made by A for any tar

Z n

get concept in C when the query sequence is provided by Z When Z adversary

n

MB A C M C in the notation of Littlestone WesaythatA is a polynomial

Z n A n

prediction algorithm if A makes each prediction in time p olynomial in n

Learning Binary Relations

In this section we apply the learning scenario of the extended mistake b ound mo del to the

concept class C of k binaryrelations For this concept class the dimension measure is denoted

by n and mand X f ngf mg An instance i j is in the target concept

nm

c C if and only if the matrix entry in row i and column j is a So in each trial the

nm

learner is rep eatedly given an instance x from X and asked to predict the corresp onding

nm

matrix entry After making its prediction the learner is told the correct value of the matrix

entry The learner wishes to minimizethenumb er of incorrect predictions it makes during

a learning session in which the learner must eventually make a prediction for each matrix

entry

We b egin this section with a motivating example from the domain of allergy testing We

use this example to motivate b oth the restriction that the matrix has k rowtyp es and the

use of the extended mistake b ound mo del We then present general upp er and lower b ounds

on the numb er of mistakes made by the learner regardless of the director Finallywe study

the complexity of learning a k binaryrelation under each director

Motivation Allergist Example

In this section we use the following example taken from the domain of allergy testing to

motivate the problem of learning a k binaryrelation

Consider an allergist with a set of patients to b e tested for a given set of allergens Each

patient is either highly al lergic mild ly al lergic ornot al lergic to any given allergen The

allergist may use either an epicutaneous scratch test in which the patientisgiven a fairly

low dose of the allergen or an intradermal under the skin test in which the patientisgiven

a larger dose of the allergen The patients reaction to the test is classied as strong positive

weak positive or negative Figure describ es the reaction that o ccurs for eachcombination

of allergy level and dosage level Finallywe assume a strong p ositive reaction is extremely

uncomfortable to the patient but not dangerous

What options do es the allergist have in testing a patient for a given allergen He could

just p erform the intradermal test option Another option option is to p erform an

epicutaneous test and if it is not conclusive then p erform an intradermal test See Figure

for decision trees describing these two testing options Which testing option is b est If the

Epicutaneous Intradermal

Scratch Under the Skin

Not Allergic negative negative

Mildly Allergic negative weak p ositive

Highly Allergic weak p ositive strong p ositive

Figure Summary of testing reactions for allergy testing example

Option #0:::

Under the Skin Test neg. strong weak pos. pos.

Not Mildly Highly Allergic Allergic Allergic

Option #1::: Scratch Test neg. weak pos. Under the Skin Test weak neg. pos.

Not Mildly Highly

Allergic Allergic Allergic

Figure The testing options available to the allergist

patient has no allergy or a mild allergy to the given allergen then testing option is b est

since the patient need not return for the second test However if the patient is highly allergic

to the given allergen then testing option is b est since the patient do es not exp erience a

bad reaction We assume the inconvenience of going to the allergist twice is approximately

the same as having a bad reaction That is the allergist has no to error in a

particular direction While the allergists nal goal is to determine each patients allergies

we consider the problem of learning the optimal testing option for eachcombination of

patient and allergen

The allergist interacts with the environment as follows In each trial the allergist is

asked to predict the b est testing option for a given patientallergen pair He is then told

the testing results thus learning whether the patient is not allergic mildly allergic or highly

allergic to the given allergen In other words the allergist receives feedback as to the correct

testing option Note that wemake no restrictions on howthehyp othesis is represented

as long as it can b e evaluated in p olynomial time In other words all we require is that

given any patientallergen pair the allergist decides which test to p erform in a reasonable

amount of time

How can the allergist p ossibly predict a patients allergies If the allergies of the patients

are completely random then there is not muchhope What prior knowledge do es the

allergist have He knows that p eople often have exactly the same allergies so there is a set of

allergy typ es that o ccur often We do not assume that the allergist has a priori knowledge

of the actual allergy typ es This knowledge can help guide the allergists predictions

Having sp ecied the problem we discuss our choice of using the extended mistakebound

mo del to evaluate learning algorithms for this problem First of all observe that wewant

an online mo del There is no training phase here the allergist wants to predict the correct

testing option for each patientallergen pair Also we exp ect that the allergist has time to

test each patient for each allergen that is the instance space is p olynomialsized Thus as

discussed in Section the distributionfree mo del is not appropriate

How should we judge the p erformance of the learning algorithm For each wrong predic

tion made a patient is inconvenienced with making a second trip or having a bad reaction

Since the learner wants to give all patients the b est p ossible service he strives to minimize

the numb er of incorrect predictions made Thus wewant to use the absolute mistakebound

success criterion Namelywe judge the p erformance of the learning algorithm by the number

of incorrect predictions made during a learning session in whichhemust eventually test each

patient for each allergen

Up to now the standard online mo del using absolute mistake b ounds to judge the

learners app ears to b e the appropriate mo del Wenow discuss the selection of the instances

Since the allergist has no control over the target relation ie the allergies of his patients

it makes sense to view the feedback as coming from an adversaryHowever do we really

wantanadversary to select the presentation order for the instances It could b e that the

allergist is working for a cosmetic company and due to restrictions of the Fo o d and Drug

Administration and the cosmetic company the allergist is essentially told when to test each

p erson for each allergen In this case it is appropriate to haveanadversary select the

presentation order However in the typical situation the allergist can decide in what order

to p erform the testing so that he can make the b est predictions p ossible In this case wewant

to allow the learner to select the presentation order One could also imagine a situation in

whichanintern is b eing guided by an exp erienced allergist and thus a teacher helps to select

the presentation order Finally random selection of the presentation order may provide us

with a b etter feeling for the b ehavior of an algorithm

Learning k BinaryRelations

In this section we b egin our study of learning k binaryrelations by presenting general lower

and upp er b ounds on the mistakes made by the learner regardless of the director

Throughout this section we use the following notation Wesayanentry i j ofthe

matrix M isknown if the learner was previously presented that entryWe assume without

ij

loss of generality that the learner is never asked to predict the value of a known entryWe

sayrows i and i are consistent given the current state of knowledge if M M for all

ij i j

columns j in whichbothentries i j andi j are known

Wenow lo ok at general lower and upp er b ounds on the numb er of mistakes that apply

m m

for all directors First of all note that k since there are only p ossible rowtyp es

for a matrix with m columns Clearlyany learning algorithm makes at least km mistakes

for some matrix regardless of the query sequence The adversary can divide the rows into

k groups and reply that the prediction was incorrect for the rst column queried for each

entry of each We generalize this approach to force mistakes for more than one row

of eachtyp e lg(βk)

(1-β )k

n rows 0

m columns

Figure The nal matrix created by the adversary in the pro of of Theorem All entries

in the unmarked area will contain the bit not predicted by the learner That is a mistakeis

forced on eachentry in the unmarked area All entries in the marked area will b e zero

Theorem For any anyprediction algorithm makes at least km

nblg kc k blgkc mistakes regard less of the query sequence

Pro of The adversary selects its feedback for the learners predictions as follows For

eachentry in the rst blg kc columns the adversary replies that the learners resp onse is

incorrect At most k new rowtyp es are created by this action Likewise for eachentry in

the rst k rows the adversary replies that the learners resp onse is incorrect This

creates at most k new rowtyp es The adversary makes all remaining entries in the

matrix zero See Figure The numb er of mistakes is at least the area of the unmarked

region Thus the adversary has forced at least km nblg kc k blg kc

mistakes while creating at most k k k rowtyp es

By letting we obtain the following corollary

km k

Corollary Any algorithm makes at least n blg k c mistakes in the worst case

regard less of the query sequence

If computational eciency is not a concern for all query sequences the halving algo

rithm provides a go o d mistake b ound

Observation The halving algorithm achieves a km n k lg k mistake bound

Pro of We use a simple counting argument on the size of the concept class C There

nm

km nk

are ways to select the k rowtyp es and k ways to assign one of the k rowtyp es to

km nk

each of the remaining n k rows Thus jC j k Littlestone proves that the

nm

halving algorithm makes at most lg jC j mistakes Thus the numb er of mistakes made by

nm

km nk

the halving algorithm for this concept class is at most lg k km n k lg k

In the remainder of this section we study ecient prediction algorithms designed to

p erform well against each of the directors In some cases we are also able to provelower

b ounds that are b etter than that of Theorem In Section we consider the case that the

query sequence is selected by the learner We study the helpfulteacher director in Section

In Section we consider the case of an adversary director Finally in Section we consider

when the instances are drawn uniformly at random from the instance space

SelfDirected Learning

In this section we present an ecient algorithm for the case of selfdirected learning

Theorem There exists a polynomial prediction algorithm that achieves a kmn k blg k c

mistake bound with a learnerselected query sequence

Pro of The query sequence selected simply sp ecies the entries of the matrix in rowma jor

order The learner b egins assuming there is only one rowtyp e Let k denote the learners

current estimate for k Initially k For the rst row the learner guesses eachentry

This row b ecomes the template for the rst rowtyp e Next the learner assumes that the

second row is the same as the rst row If he makes a mistake then the learner revises his

estimate for k to b e guesses for the rest of the row and uses that row as the template for

the second rowtyp e In general to predict M the learner predicts according to a ma jority

ij

vote of the recorded row templates that are consistentwithrow i breaking ties arbitrarily

Thus if a mistake is made then at least half of the rowtyp es can b e eliminated as the

j k

p otential typ e of row i If more than lg k mistakes are made in a row then a new rowtyp e

has b een found In this case k is incremented the learner guesses for the rest of the row

and makes this row the template for rowtyp e k

Howmany mistakes are made by this algorithm Clearly at most m mistakes are made

for the rst row found of eachofthe k typ es For the remaining n k rows since k k at

most b lg k c mistakes are made

Observe that this upp er b ound is within a constant factor of the lower b ound of Corol

lary Furthermore we note that this algorithm need not know k a priori In fact it obtains

the same mistake b ound even if an adversary tells the learner whichrow to examine and in

what order to predict the columns provided that the learner sees all of a row b efore going

on to the next As we will later see this problem b ecomes harder if the adversary can select

the query sequence without restriction

TeacherDirected Learning

In this section we present upp er and lower b ounds on the numb er of mistakes made under

the helpfulteacher director Recall that in this mo del we consider the worst case mistake

b ound over all consistent learners Thus the question asked here is what is the minimum

numb er of matrix entries a teacher must reveal so that there is a unique completion of the

matrix That is until there is a unique completion of the partial matrix a mistake could

b e made on the next prediction

Wenow prove an upp er b ound on the number of entries needed to uniquely dene the

target matrix

Theorem The number of mistakes made with a helpful teacher as the director is at most

km n k k

Pro of First the teacher presents the learner with one rowofeachtyp e For eachofthe

remaining n k rows the teacher presents an entry to distinguish the given row from eachof

the k incorrect rowtyp es After these km n k k entries have b een presented

we claim that there is a unique matrix with at most k rowtyp es that is consistentwith

the partial matrix Since all k distinct rowtyp es have b een revealed in the rst stage all

remaining rows must b e the same as one of the rst k rows presented However eachofthe

remaining rows have b een shown to b e inconsistent with all but one of these k row templates

Is Theorem the b est such result p ossible Clearly the teacher must presenta rowof

eachtyp e But in general is it really necessary to present k entries of the remaining rows

to uniquely dene the matrix Wenow answer this question in the armativeby presenting

a matching lower b ound

Theorem The number of mistakes made with a helpful teacher as the director is at least

minfnm k m n k k g 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 5 row 0 1 0 0 0 0 0 0 0 types 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

Figure The matrix created by the adversary against the helpful teacher director In this

example there are rowtyp es which app ear in the rst verows of the matrix

Pro of The adversary selects the following matrix The rst rowtyp e consists of all zeros

For z minfm kgrowtyp e z contains z zeros followed by a one followed by

m z zeros The rst k rows are each assigned to b e a dierent one of the k rowtyp es

Each remaining row is assigned to b e the rst rowtyp e See Figure Until there is a

unique completion of the partial matrix by denition there exists a consistent learner that

could make a mistake Clearly if the learner has not seen eachcolumnofeachrowtyp e

then the nal matrix is not uniquely dened This part of the argument accounts for km

mistakes When m k for the remaining rows unless all of the rst k columns are

known there is some rowtyp e b esides the rst rowtyp e that must b e consistent with the

given row This argument accounts for n k k mistakes Likewise when m k

if any of the rst m columns are not known then there is some rowtyp e b esides the rst

rowtypethatmust b e consistentwiththegiven row This accounts for n k m mistakes

Thus the total numb er of mistakes is at least minfnm k m n k k g

Due to the requirement that mistake b ounds in the teacherdirected case apply to all

consistent learners we note that it is p ossible to get mistake b ounds that are not as go o d

as those obtained when the learner is selfdirected Recall that in the previous section we

proved a km n k b lg k c mistake b ound for the learner director This b ound is b etter

than that obtained with a teacher b ecause the learner uses a ma jorityvote among the known

rowtyp es for making predictions However a consistent learner mayusea minority vote

and could thus make km n k k mistakes

AdversaryDirected Learning

In this section wederive upp er and lower b ounds on the numb er of mistakes made when

the adversary is the director We rst present a stronger informationtheoretic lower b ound

on the numb er of mistakes an adversary can force the learner to make Next we present

an ecient prediction algorithm that achieves an optimal mistake b ound if k We

then consider the related problem of computing the minimum number of rowtyp es needed

to complete a partially known matrix Finallywe consider learning algorithms that work

against an adversary for arbitrary k

Wenow present an informationtheoretic lower b ound on the numb er of mistakes made

byany prediction algorithm when the adversary selects the query sequence We obtain this

result by mo difying the technique used in Theorem

Theorem Any prediction algorithm makes at least minfnm k m n k blg k cg mistakes

against an adversaryselected query sequence

Pro of The adversary starts by presenting all entries in the rst blg k c columns or m

columns if m blg k c and replying that each prediction is incorrect If m blg k cthis

step causes the learner to make nblg k c mistakes Otherwise this step causes the learner

to make nm mistakes Eachrow can now b e classied as one of k rowtyp es Next the

adversary presents the remaining columns for one rowofeachtyp e again replying that each

prediction is incorrect For m blg k c this step causes the learner to make k m blg k c

additional mistakes For the remaining matrix entries the adversary replies as dictated by

the completed row of the same rowtyp e as the given row So the numb er of mistakes made

by the learner is at least minfnm nblg k c km k blg k cg minfnm k m n k blg k cg

Sp ecial Case k

Wenow consider ecient prediction algorithms for learning the matrix under an adversary

selected query sequence Recall that if eciency is not a concern the halving algorithm

makes at most km n k lg k mistakes In this section we consider the case that k

and present an ecient prediction algorithm that p erforms optimally

Theorem There exists a polynomial prediction algorithm that makes at most m n

mistakes against an adversaryselected query sequence for k

Pro of The algorithm uses a graph G whose vertices corresp ond to the rows of the matrix

and that initially has no edges To predict M the algorithm colors the graph G and then

ij

If no entry of column j is known it guesses randomly

Else if every known entry of column j is zero resp ectively one it guesses zero one

Else it nds a row i assigned the same color as i and known in column j and guesses

M

i j

Finally after the prediction is made and the feedback received the graph G is up dated by

ii to G for eachrow i known in column j for which M M Note that adding an edge

ij i j

one of the ab ove cases always applies Also since k itwillalways b e p ossible to nd a

coloring

How many mistakes can this algorithm make It is not hard to see that cases and

each o ccur only once for every column so there are at most m mistakes made in eachof

these cases Furthermore the rst case mistake adds at least one edge to GWenow argue

that each case mistake reduces the numb er of connected comp onents of G by at least

We use a pro of bycontradiction That is assume that a case mistakedoes not reduce the

v v added to G must numb er of connected comp onents Then it follows that the edge e

form a cycle See Figure Wenow separately consider the cases that this cycle contains

an o dd numb er of edges or an even numb er of edges

Case Oddlength cycle Since G is known to b e colorable this case cannot

o ccur

Case Evenlength cycle Before adding esincev and v were connected by

an o dd numb er of edges in any legal coloring they must have b een dierent colors

Since Step of the algorithm picks no des of the same color an edge could have never

b een placed b etween v and v Thus we again have a contradiction

In b oth cases wereach a contradiction and thus wehaveshown that every case mistake

reduces the numb er of connected comp onents of GThus after at most n case mistakes v 1 e

v

2

Figure The situation o ccurring if a case mistake do es not reduce the number of

connected comp onents of G The thick grey edges and the thick blackedgeshow the cycle

created in GLete shown as a thickblack edge b e the edge added to form the cycle

G must b e fully connected and thus there must b e a unique coloring of G and no more

mistakes can o ccur Thus the worst case numb er of mistakes made by this algorithm is

m n

Note that for k this upp er b ound matches the informationtheoretic lower b ound

of Theorem We also note that if there is only one rowtyp e then the algorithm given in

Theorem makes at most m mistakes matching the informationtheoretic lower b ound

An interesting theoretical question is to nd a linear mistake b ound for constant k

when provided with a k colorability oracle However such an approachwould havetobe

greatly mo died to yield a p olynomial prediction algorithm since a p olynomialtime k

colorability oracle exists only if P NP Furthermore even go o d p olynomialtime ap

proximations to a k colorability oracle are not known

The remainder of this section fo cuses on designing p olynomial prediction algorithms for

the case that the matrix has at least three rowtyp es One approachthatmay seem promising

is to make predictions as follows Compute a matrix that is consistent with all known entries

and that has the fewest p ossible rowtyp es Then use this matrix to make the next prediction

Wenow show that even computing the minimum number of rowtyp es needed to complete

a partially known matrix is NPcomplete Formallywe dene the matrix kcomplexity

problem as follows given an n m binary matrix M that is partially known decide if there

is some matrix with at most k rowtyp es that is consistentwith M The matrix k complexity

problem can b e shown to b e NPcomplete by a reduction from graph k colorabilityforany



Two colorings under renaming of the colors are considered to b e the same 2 4 1 0 1 2 0 1 3 110 1 1 5 4 0 1 3 5 1 0 1 6 0 7 6 7 1 1 0

G M

Figure An example of the reduction used in Theorem The graph G is the instance for

the graph coloring problem The partial matrix M is the instance for the matrix complexity

problem We note that there exists a matrix that is an completion of M that uses only three

rowtyp es The corresp onding coloring of G is demonstrated by the no de colorings used in

G

xed k

Theorem For xed k the matrix k complexity problem is NPcomplete

Pro of Clearly this problem is in NP since we can easily verify that a guessed matrix has

k rowtyp es and is consistent with the given partial matrix

To show that the problem is NPcomplete we use a reduction from graph k colorability

Given an instance G V E of graph k colorabilitywe transform it into an instance of the

matrix k complexity problem Let m n jV jFor each edge fv v gE weaddentries

i j

to the matrix so that row i and row j cannot b e the same rowtyp e Sp ecically for each

vertex v we set M and M foreach neighbor v of v An example demonstrating

i ii ji j i

this reduction is given in Figure

Wenow show that there is some matrix of at most k rowtyp es that is consistentwith

this partial matrix if and only if G is k colorable We rst argue that if there is a matrix

M consistent with M that has at most k rowtyp es then G is k colorable By construction

if tworows are of the same typ e there cannot b e an edge b etween the corresp onding no des

So just let the no de color for eachnodebethetyp e of the corresp onding rowinM

Converselyif G is k colorable then there exists a matrix M consistent with M that has

at most k rowtyp es By the construction of M if a set of vertices are the same color in G

then the corresp onding rows are consistentwitheach other Thus there exists a matrix with

at most k rowtyp es that is consistent with M

Row Algorithms

In this section we study the p erformance of a whole class of algorithms designed to learn a

matrix with arbitrary complexity k when an adversary selects the query sequence Wesay

that an algorithm A is a rowlter algorithm if A makes its prediction for M strictly as a

ij

function of j and all entries in the set I of rows consistentwithrow i and dened in column

j That is As prediction is f I jwhere f is some p ossibly probabilistic function So to

make a prediction for M a rowlter algorithm considers all rows that could b e the same

ij

typ e as row i and whose value for column j is known and uses these rows in anywayone

could imagine to make a prediction For example it could take a ma jorityvote on the entries

in column j of all rows that are consistentwithrow i Or of the rows dened in column j

it could select the row that has the most known values in common with row i and predict

according to its entry in column j Wehave found that many of the prediction algorithms

we considered are rowlter algorithms

Consider the simple rowlter algorithm ConsMajorityPredictinwhich f I j computes

the ma jorityvote of the entries in column j of the rows in I Guess randomly in the case of

a tie Note that ConsMajorityPredict only takes time linear in the numb er of known entries

of the matrix to make a prediction Wenowgive an upp er b ound on the numb er of mistakes

made by ConsMajorityPredict

q

k m mistakes Theorem The algorithm ConsMajorityPredict makes at most km n

against an adversaryselected query sequence

Pro of For all i let di b e the number of rows consistentwithrow iWe dene the potential

P

n

di We rst consider howmuch the p otential of a partially known matrix to b e

i

function can change over the entire learning session

k

Lemma The potential function decreases by at most n during the learning session

k

Pro of Initially for all i dinSo n LetC z b e the number of rows of typ e z

init

P P

k k

for z k By denition Thus our goal is to minimize C z C z

nal

z z

P

k

under the constraintthat C z n Using the metho d of Lagrange multipliers we

z

obtain that is minimized when for all z C z nk Thus nk k n k

nal nal



n k

So n n

init nal

k k

Now that the total decrease in over the learning session is b ounded we need to deter

k

mine how many mistakes can b e made without decreasing by more than n We b egin

k

by noting that is strictly nonincreasing Once tworows are found to b e inconsistent they

remain inconsistent So to b ound the numb er of mistakes made by ConsMajorityPredict we

must compute a lower b ound on the amount is decreased byeach mistake Intuitively

one exp ects to decrease by larger amounts as more of the matrix is seen We formalize

this intuition in the next twolemmas For a given rowtyp e z letB j z denote the set of

matrix entries that are in column j of a rowoftyp e z

th

Lemma The r mistake made when predicting an entry in B j z causes to decrease

by at least r

Pro of Supp ose that this mistake o ccurs in predicting entry i j where row i is of typ e

z Consider all the rows of typ e z Since r mistakes have o ccurred in column j at

least r entries of B j z are known Since ConsMajorityPredict is a rowlter algorithm

these rows must b e in I Furthermore ConsMajorityPredict uses a ma jorityvoting scheme

and thus if a mistake o ccurs there must b e at least r entries in I and thus consistent

with row i that dier in column j with row iThus if a mistake is made row i is found

to b e inconsistent with at least r rows it was thought to b e consistent with When two

previously consistentrows are found to b e inconsistent decreases bytwo Thus the total

th

decrease in caused bythe r mistake made when predicting an entry in B j z is at least

r

From Lemma we see that the more entries known in B j z the greater the decrease

in for future mistakes on suchentries So intuitively it app ears that the adversary can

maximize the numb er of mistakes made by the learner by balancing the number of entries

seen in B j z for all j and z Weprove that this intuition is correct and apply it to obtain

alower b ound on the amount must have decreased after the learner has made mistakes

Lemma After mistakes are made the total decrease in is at least km

km

th

Pro of From Lemma after the r mistake made in predicting an entry from B j z the

P

r

total decrease in from its initial value is at least LetW j z be x r

x

the numb er of mistakes made in column j of rows of typ e z The total decrease in is at

least

m k

X X

D W j z

z

j

P P

m k

sub ject to the constraint W j z

j z

Using the metho d of Lagrange multipliers we obtain that D is minimized when W j z

for all j and z Since any algorithm clearly must make km mistakes km and thus

km

k m So the total decrease in is at least

m k

X X

km

km km

j z

Wenow complete the pro of of the theorem Combining Lemma and Lemma along

with the observation that is strictly nonincreasing wehaveshown that

k

km n

km k

q

k m This implies that km n

We note that by using the simpler argument that each mistake except for the rst mistake

k

in each column of eachrowtyp e decreases by at least we obtain a km n mistake

k

b ound for any rowlter algorithm Also Manfred Warmuth has indep endently given

an algorithm based on the weighted ma jority algorithm of Littlestone and Warmuth

p

m lg k mistakeboundWarmuths algorithm builds a complete that achieves an O km n

graph of n vertices where row i corresp onds to vertex v and all edges have an initial weight

i

of To predict a value for i j the learner takes a weighted ma jority of all active neighb ors

of v v is active if M is known After receiving feedback the learner sets the weighton

i k kj

the edge from v to v to b e if M M Finally if a mistake o ccurs the learner doubles

i k kj ij

the weightofv v ifM M ie the edges to neighb ors that predicted correctly We

i k kj ij

note that this algorithm is not a rowlter algorithm

Do es ConsMajorityPredict give the b est p erformance p ossible bya rowlter algorithm

Wenow present an informationtheoretic lower b ound on the numb er of mistakes an adver

sary can force against anyrowlter algorithm

Theorem Any rowlter algorithm for learning an n m matrix with m n and k

p

makes n m mistakes when the adversary selects the query sequence

X X X

X X X

X X X

X X X

X X X

X X X

X X X

Figure A pro jective geometry for p m

Pro of We assume that the adversary knows the learners algorithm and has access to any

random bits he uses One can prove a similar lower b ound on the exp ected mistakebound

when the adversary cannot access the random bits

Let m p p b e the largest of the given form such that p is prime and

m m Without loss of generalitywe assume in the remainder of this pro of that the

p

matrix has m columns and proveann m mistake b ound From Bertrands conjecture

p

m mistakes in the original it follows from this result that the adversary has forced n

matrix

Our pro of dep ends up on the existence of a projective geometry onm points and lines

That is there exists a set of m p oints and a set of m lines such that each line contains

exactly p points and each p ointisattheintersection of exactly p lines Furthermore

any pair of lines intersects at exactly one p oint and anytwo p oints dene exactly one line

The choice of m p p for p prime comes from the fact that pro jective geometries

are only known to exist for suchvalues Figure shows a matrix representation of sucha

geometry an xinentry i j indicates that p oint j is on line iLet denote the rst

bn c lines of Note that since m n all entries of are contained within M

The matrix M consists of tworowtyp es the o dd rows are lled with ones and the even

rows with zeros Two consecutiverows of M are assigned to eachlineof See Figure



Bertrands conjecture states that for anyinteger n there exists a prime p such that npn

Although this is known as Bertrands conjecture it was proven byTchebychef in 0 000000 111111 1 0000000 111 1111 0000000 2i-1,j 1 111111 0000000 1111111

2i,j

Figure The matrix created by the adversary in the pro of of Theorem The shaded

regions corresp ond to the entries in The learner is forced to make a mistakeononeof

the entries in each shaded rectangle

Wenowprove that the adversary can force a mistake for eachentry of Theadversarys

query sequence maintains the condition that an entry i j is not revealed unless line di e

of contains p oint j In particular the adversary will b egin by presenting one entry of

the matrix for eachentry of We prove that for eachentry of the learner must predict

the same value for the two corresp onding entries of the matrix Thus the adversary forces a

p

m entries of The remaining entries of the matrix mistake for the bn cp n

are then presented in any order

Let I b e the set of rows that may b e used bytherowlter algorithm when predicting

entry i j Let I b e the set of rows that may b e used bytherowlter algorithm when

predicting entry i j We provebycontradiction that I I IfI I then it must b e

the case that there is some row r that is dened in column j and consistent with rowiyet

inconsistent with rowi or visa versa By denition of the adversarys query sequence

it must b e the case that lines dr e and di e i of contain p oint j Furthermore

since i j is b eing queried that entry is not known Thus rows r and i must

b oth b e known in some other column j since they are known to b e inconsistent Thus since

only entries in are shown it follows that lines dr e and i of also contain p oint j for

j j So this implies that lines dr e and i of must intersect at twopoints giving a

contradiction Thus I I and so f I jf I jforentry i j and entry i j

Since rows i and i dier in each column and the adversary has access to the random

bits of the learner he can compute f I j just b efore making his query and then ask the

learner to predict the entry for which the mistake will b e made This pro cedure is rep eated

for the pair of entries corresp onding to each elementof

p

We use a similar argument to get an m n b ound for m n Combined with the

p p

mm ng lower b ound of Theorem and Theorem we obtain a kmnk blg k cminfn

lower b ound on the numb er of mistakes made byarowlter algorithm

p p

Corollary Any rowlter algorithm makes km n k blg k c minfn m m ng

mistakes against an adversaryselected query sequence

Comparing this lower b ound to the upp er b ound proven for ConsMajorityPredictwe see

that for xed k the mistake b ound of ConsMajorityPredict is within a constant factor of

optimal

Given this lower b ound one may question the m n upp er b ound for k given in

Theorem However the algorithm describ ed is not a rowlter algorithm Also compared

to our results for the learnerselected query sequence it app ears that allowing the learner to

select the query sequence is quite helpful

RandomlyDirected Learning

In this section we consider the case that the learner is presented at each step with one of

the remaining entries of the matrix selected uniformly and indep endently at random We

p

H mistakes on average where H is present a prediction algorithm that makes O km nk

the maximum Hamming distance b etween anytworows of the matrix We note that when

m

H the result of Theorem sup ersedes this result A key result of this section is a

k

pro of relating two dierent probabilistic mo dels for analyzing the mistake b ounds under a

random presentation We rst consider a simple probabilistic mo del in which the requirement

that t matrix entries are known is simulated by assuming that eachentry of the matrix is

t

seen indep endently with probability We then prove that any upp er b ound obtained on

nm

the numb er of mistakes under this simple probabilistic mo del holds under the true mo del to

within a constant factor in which there are exactly t entries known This result is extremely

useful since in the true mo del the dep endencies among the probabilities that matrix entries

are known makes the analysis signicantly more dicult

We dene the algorithm RandomConsistentPredict to b e the rowlter algorithm where

the learner makes his prediction for M bycho osing one row i of I uniformly at random and

ij

predicting the value M If I is empty then RandomConsistentPredict makes a random

i j

guess

Theorem Let H be the maximum Hamming distancebetween any two rows of M Then

p

the expectednumber of mistakes made by RandomConsistentPredict is O k n H m

Pro of Let U b e the probability that the prediction rule makes a mistakeonthet st

t

step That is U is the chance that a prediction error o ccurs on the next randomly selected

t

entry given that exactly t other randomly chosen entries are already known Clearlythe

P

S

U where S nm Our goal is to nd an upp er b ound exp ected numb er of mistakes is

t

t

for this sum

The condition that exactly t entries are known makes the computation of U rather messy

t

since the probabilityofhaving seen some entry of the matrix is not indep endent of knowing

the others Instead we compute the probability V of a mistake under the simpler assumption

t

that eachentry of the matrix has b een seen with probability tS indep endent of the rest of

P

S

the matrix We rst compute an upp er b ound for the sum V and then show that this

t

t

P

S

U sum is within a constant factor of

t

t

p

P

S

Lemma V O km nk H

t

t

Pro of Fix t and let p tS Also let di b e the number of rows of the same typ e as row

iWebound V by trivially and assume henceforth that p

By denition V is the probability of a mistake o ccurring when a randomly selected

t

unknown entry is presented given that all other entries are known with probability p Since

eachentry i j is presented next with probabilityS it follows that

X

R V

ij t

S

ij

where R is the probability of a mistake o ccurring given that entry i j is unknown and

ij

presented next

Let I b e the random variable describing the set of rows consistentwithrow i and known

ij

in column j and let J b e the random variable describing the set of rows i in I for which

ij ij

M M IfI is nonempty then the probabilityofcho osing a row i for which M M

ij i j ij ij i j

is clearly jJ j jI jThus the probability of a mistake is just the exp ected value of this

ij ij

fraction assuming I

ij

Unfortunately exp ectations of fractions are often hard to deal with To handle this

situation we therefore place a probabilistic lower b ound on the denominator of this ratio

ie on jI j Note that if i and i are of the same typ e then the probability that i I is

ij ij

just the chance p that i jisknown Since there are dirows of typ e i including i itself

we see that PrjI j y is at most the chance that fewer than y of the other di rows

ij

of the same typ e as i are in I In other words this probability is b ounded by the chance

ij

of fewer than y successes in a sequence of di Bernoulli trials each succeeding with

probability p

We use the following form of Cherno b ounds due to Angluin and Valianttobound

this probability

Lemma Consider a sequenceofm independent Bernoul li trials each succeeding with

probability pLet S be the random variable describing the total number of successes Then

for the fol lowing hold



mp

PrS mp e and



mp

PrS mp e

Thus letting y pdi and applying this lemma it follows that

pdi

PrjI j pdi e

ij

Note that this b ound applies even if di

Thus wehave

jJ j

ij

jjI jy Pr jI j y R PrjI j y E

ij ij ij ij

jI j

ij

EjJ jj jI j y

ij ij

PrjI jy PrjI j y

ij ij

y

EjJ j

ij

PrjI j y

ij

y

So to b ound R it will b e useful to b ound EjJ j

ij ij

Wehave

X

EjJ j Pr i I

ij ij

i iM M

ij

i j

If M M then Pri I isthechance that i jisknown and that i and i are

ij i j ij

consistent Entry i jisknown with probability pand i and i are consistent if either

i j ori j isunknown for each column j j in which i and i dier If hi i isthe

hii

Hamming distance b etween rows i and i then this probabilityis p

Combining these facts wehave

P

hii

p p

X X X X

i iM M

ij

i j

pdi

V e

t

S S pdi

i j j

di

hii

X X X

hi i p

pdi

e

n S di

i

i i

di

P

S

V Applying the ab ove upp er b ound Recall that our goal is to upp er b ound the sum

t

t

for V we get

t

S S S

X X X X X X

hi i

tS di hii

A

V e tS

t

n S di

t t i t

i i

di

Wenow b ound the rst part of the ab ove expression We b egin by noting that

Z

n n S

S

X X X

tS di tS di

e e dt

n n

i i t

n

X

S

n di

i

where this last b ound follows byevaluating the integral in the two cases that di and

di This last expression equals

n

X

m km

di

i

where the last step is obtained by rewriting the summation to go over all the rowtyp es

there are di terms for rows of the same typ e as row ithus eachrowtyp e contributes to

the summation

We next b ound the second part of expression To complete the pro of of the lemma

it suces to showthat

S

p

X X X

hi i

hii

A

O nk tS H

S di

t

i i

di

We b egin by noting that this expression is b ounded ab oveby

hii

Z

S

X X

hi i t

A

dt

S di S

i i

di

If hi i then this integral is trivially evaluated to b e S Otherwise applying the

x

e x weget

hii

Z Z

S S

t t

hi i dt dt exp

S S

A standard integral table gives

p

Z

S t

q

hi i dt exp

S

hi i

Combining these b ounds wehave

p

hii

Z

S

S t

q

dt

S

hi i

for hi i Thus we arrive at an upp er b ound of

Z

S

X X

hii

hi i

tS dt

S di

i i

di

p

X X

hi i S

A

q

S di

hi i

i

i i

q

p

p

X X X X

hi i

m

O nk H

S di di

i i

i i i i

This implies the desired b ound

To complete the theorem we prove the main result of this section namely that the upp er

b ound obtained under this simple probabilistic mo del holds to within a constant factor for

the true mo del In other words to compute an upp er b ound on the numb er of mistakes

made by a prediction algorithm when the instances are selected according to a uniform

distribution on the instance space one can replace the requirement that exactly t matrix

t

entries are known by the requirement that each matrix entry is known with probability

nm

P P

S S

Lemma U O V

t t

t t

Pro of We rst note that

S

r S r

X

t S t

V U

t r

r S S

r

To see this observe that for each r where r is the numb er of known entries we need just

multiply U by the probability that exactly r entries are known assuming eachentry is known

r

with probabilityof tS Therefore

S S S

r S r

X X X

S t t

V U

t r

S S r

t t r

S S

r S r

X X

t S t

U

r

r S S

t r

Thustoprove the lemma it suces to show that the inner summation is b ounded b elow

by a p ositive constant By symmetry assume that r S and let y S r Stirlings

approximation implies that

s

S

S S S

r y

r y ry r

Applying this formula to the desired summation we obtain that

s

y

S S

r S r r

X X

S t t t S t S

r S S ry r y

t t

p

s

ryS

y

r

X

S r x y x

C B

A

ry r y

x

The last step ab ove follows by letting x t r and reducing the limits of the summation To

complete the pro of that the inner summation of Equation is b ounded b elowby a p ositive

constantwe need just provethat

y

r

r x y x

r y

q

ry S for all x

y

x

y

We Using the inequalityx e it can b e shown that for y y e

apply this observation to get that

y y

r r

r x y x x x

r y r y

x rx yx x

exp exp

x x

r x y x

r y

x r y Sx

exp exp

r xy x r xy x

q

Since x ry S it follows that Sx ry Applying this observation to the ab ove inequality

it follows that

y

r

r x y x ry

exp

r y r xy x

ry

exp

ry y r x x

ry

exp exp

ry ry S

S



S

Finally we note that for S e e This completes the pro of of the lemma

p

P

S

U O km nk Clearly Lemma and Lemma together imply that H giving

t

t

the desired result

This completes our discussion of learning k binaryrelations

Learning a Total Order

In this section we present our results for learning a binary relation on a set where it is

known a priori that the relation forms a total order One can view this problem as that of

learning a total order on a set of n ob jects where an instance corresp onds to comparing which

of two ob jects is greater in the target total order Thus this problem is like comparison

based sorting except for twokey dierences wevary the agent selecting the order in which

comparisons are made in sorting the learner do es the selection and wecharge the learner

only for incorrectly predicted comparisons

Before describing our results wemotivate this section with the following example There

are n basketball teams that are comp eting in a roundrobin tournament That is each team

will play all other teams exactly once Furthermore wemake the admittedly simplistic

assumption that there is a ranking of the teams such that a team wins its match if and only

if its opp onent is ranked b elowitAgambler wants to place a b et on eachgameifhe

b ets on the winning team he wins and if he b ets on the losing team he loses Of

course his goal is to win as many b ets as p ossible

We formalize the problem of learning a total order as follows The instance space X

n

fngfngAninstancei j inX is in the target concept if and only if ob ject

n

i precedes ob ject j in the corresp onding total order

If computation time is not a concern then the halving algorithm makes at most n lg n

mistakes However we are interested in ecient algorithms and thus our goal is to design an

ecientversion of the halving algorithm In the next section we discuss the relation b etween

the halving algorithm and approximate counting Then we showhow to use an approximate

counting scheme to implement a randomized version of the approximate halving algorithm

and apply this result to the problem of learning a total order on a set of n elements Finally

we discuss how a ma jority algorithm can b e used to implement a counting algorithm

The Halving Algorithm and Approximate Counting

In this section we review the halving algorithm and approximate counting schemes We rst

cover the halving algorithm Let V denote the set of concepts in C that are consistent

n

with the feedback from al l previous queries Given an instance x in X for eachconceptinV

n

the halving algorithm computes the prediction of that concept for x and predicts according

to the ma jority Finally all concepts in V that are inconsistent with the correct classication

are deleted Littlestone shows that this algorithm makes at most lg jC j mistakes Now

n

supp ose the prediction algorithm predicts according to the ma jority of concepts in set V the

set of all concepts in C consistent with all incorrectly predicted instances Littlestone

n

also proves that this spaceecient halving algorithm makes at most lg jC j mistakes

n

We dene an approximate halving algorithm to b e the following generalization of the

halving algorithm Given instance x in X an approximate halving algorithm predicts in

n

agreement with at least jV j of the concepts in V for some constant

Theorem An approximate halving algorithm makes at most log jC j mistakes for



n

learning C

n

Pro of Each time a mistake is made the numb er of concepts that remain in V are reduced

by a factor of at least Thus after at most log jC j mistakes there is only one



n

consistent concept left in C

n

We note that the ab ove result holds also for the spaceecientversion of the approximate

halving algorithm

When given an instance x X oneway to predict as dictated by the halving algorithm

n

is to countthenumb er of concepts in V for which cx and for which cx and then

predict with the ma jorityAswe shall see using these ideas we can use an approximate

counting scheme to implement the approximate halving algorithm

Wenowintro duce the notion of an approximate counting scheme for counting the number

of elements in a nite set S Letx b e a description of a set S in some natural enco ding

x

An exact counting scheme on input x outputs jS j with probability Sucha scheme is

x

p olynomial if it runs in time p olynomial in jxj Sometimes exact counting can b e done in

p olynomial time however many counting problems are P complete and thus assumed to b e

intractable For a discussion of the class P see Valiant For manyP complete prob

lems go o d approximations are p ossible A randomized approximation scheme

Rforacounting problem satises the following condition for all

jS j

x

Rx jS j Pr

x

where Rx is Rs estimate on input x and In other words with high probability

R estimates jS j within a factor of Suchascheme is ful ly polynomial if it runs in time

x

p olynomial in jxj and lg For further discussion see Sinclair

Wenow review work on counting the numb er of linear extensions of a total order That

is given a partial order on a set of n elements the goal is to compute the numb er of total

orders that are linear extensions of the given partial order We discuss the relationship

between this problem and that of computing the volume of a convex p olyhedron For more

details on this sub ject see Section of Lovasz Given a S and an element

n

a of aweak separation oracle

Asserts that a S or

Asserts that a S and supplies a reason why In particular for closed convex sets in

n

ifa S then there exists a hyp erplane separating a from S Soifa S the oracle

resp onds with such a separating hyp erplane as the reason why a S

Wenow discuss how to reduce the problem of counting the numb er of extensions of a partial

order on n elements to that of computing the volume of a convex ndimensional p olyhedron

n

given by a separation oracle The p olyhedron built in the reduction will b e a subset of

n

ie the unit hyp ercub e in where each dimension corresp onds to one of the n elements

n

Observe that any inequality x x denes a halfspace in Let tdenotethe

i j

p olyhedron obtained by taking the intersection of the halfspaces given by the inequalities of

the partial order t See Figure for an example with n For any pair of total orders

t and t the p olyhedra t and t are simplices that only intersect in a face zero

z y x (0,1,1) (1,1,1)

(0,0,1) (1,0,1)

(1,1,0) (0,1,0)

(0,0,0) (1,0,0)

Figure The p olyhedron formed by the total order zyx

volume a pair of elements say x and x that are ordered dierently in t and t sucha

i j

pair must exist dene a hyp erplane x x that separates t and t Let T be the

i j n

set of all n total orders on n elements Then

n

t

tT

n

In other words the union of the p olyhedra asso ciated with all total orders yields the unit

hyp ercub e Wehave already seen that p olyhedra asso ciated with the t T are disjoint To

n

n n

see that they coverallof observe that anypoint y denes some total order

t and clearly y t Let P b e a partial order on a set of n elements From Equation

and the observation that the volumes of the p olyhedra formed byeach total order is equal it

follows that the volume of the p olyhedron dened byany total order is n Thus it follows

that for any partial order P

numb er of extensions of P

volume of P

n

Rewriting equation we obtain that

numb er of extensions of P n volume of P

Finallywe note that the weak separation oracle is easy to implement for any partial

order Given inputs a and S it just checks each inequality of the partial order to see if a is

in the convex p olyhedron S Ifa do es not satisfy some inequality then reply that a S and

return that inequality as the separating hyp erplane Otherwise if a satises all inequalities

reply that a S

Dyer Frieze and Kannan give a fullyp olynomial randomized approximation scheme

fpras to approximate the volume of a p olyhedron given a weak separation oracle From

Equation we see that this fpras for estimating the volume of a p olyhedron can b e easily

applied to estimate the numb er of extensions of a partial order Furthermore Dyer and

Frieze prove that it is P hard to exactly compute the volume of a p olyhedron given

either by a list of its facets or its vertices

Indep endently Matthews has describ ed an algorithm to generate a random linear

extension of a partial order Consider the convex p olyhedron K dened by the partial

order Matthews main result is a technique to sample nearly uniformly from K Given

such a pro cedure to sample uniformly from K one can sample uniformly from the set of

extensions of a partial order bycho osing a random p ointinK and then selecting the total

order corresp onding to the ordering of the co ordinates of the selected p oint A pro cedure

to generate a random of a partial order can then b e used rep eatedly to

approximate the numb er of linear extensions of a partial order

Application to Learning

We b egin this section by studying the problem of learning a total order under teacherdirected

and selfdirected learning Then we showhow to use a fpras to implement a randomized

version of the approximate halving algorithm and apply this result for the problem of

learning a total order on a set of n elements

Under the teacherselected query sequence we obtain an n mistake b ound The teacher

can uniquely sp ecify the target total order by giving the n instances that corresp ond to

consecutive elements in the target total order Since n instances are needed to uniquely

sp ecify a total order we get a matching lower b ound Winkler has shown that under

the learnerselected query sequence one can also obtain an n mistake b ound Toachieve

this b ound the learner uses an insertion sort as describ ed for instance by Cormen Leiserson

and Rivest where for each new element the learner guesses it is smaller than eachofthe

ordered elements starting with the largest until a mistake is made When a mistake o ccurs

this new element is prop erly p ositioned in the chain Thus at most n mistakes will b e

made by the learner In fact the learner can b e forced to make at least n mistakes

The adversary gives feedback using the following simple strategy the rst time an ob ject

is involved in a comparison reply that the learners prediction is wrong In doing so one

creates a set of chains where a chain is a total order on a subset of the elements If c chains

are created by this pro cess then the learner has made n c mistakes Since all these chains

must b e combined to get a total order the adversary can force c additional mistakes by

always replying that a mistake o ccurs the rst time that elements from two dierentchains

are compared It is not hard to see that the ab ove steps can b e interleaved Thus the

adversary can force n mistakes

Next we consider the case that an adversary selects the query sequence We rst prove

an n lg nlower b ound on the numb er of mistakes made byany prediction algorithm We

use the following result of Kahn and Saks Given any partial order P that is not a total

order there exists an incomparable pair of elements x x such that

i j

numb er of extensions of P with x x

i j

numb er of extensions of P

So the adversary can always pick a pair of elements so that regardless of the learners pre

diction the adversary can rep ort that a mistakewas made while only eliminating a constant

fraction of the remaining total orders

Finallywe present a p olynomial prediction algorithm making n lg n lgelg n mistakes

with very high probabilityWe rst showhow to use an exact counting algorithm R for

counting the numb er of concepts in C consistent with a given set of examples to implement

n

the halving algorithm

Lemma Given a polynomial algorithm R to exactly count the number of concepts in C

n

consistent with a given set E of examples one can construct an ecient implementation of

the halving algorithm for C

n

Pro of WeshowhowtouseR to eciently make the predictions required by the halving

algorithm Tomake a prediction for an instance x in X the following pro cedure is used

n

Construct E from E by app ending x as a negative example to E Use the counting algorithm

R to countthenumb er of concepts C V that are consistentwith E Next construct E

from E by app ending x as a p ositive example to E As b efore use R to count the number

of concepts C V that are consistentwith E Finally if jC jjC j then predict that x

is a negative example otherwise predict that x is a p ositive example

Clearly a prediction is made in p olynomial time since it just requires calling R twice It

is also clear that each prediction is made according to the ma jority of concepts in V

We mo dify this basic technique to use a fpras instead of the exact counting algorithm

to obtain an ecient implementation of a randomized version of the approximate halving

algorithm In doing so we obtain the following general theorem describing when the existence

of a fpras leads to a go o d prediction algorithm We then apply this theorem to the problem

of learning a total order

Theorem Let R beafpras for counting the number of concepts in C consistent with a

n

given set E of examples If jX j is polynomial in nonecan produceaprediction algorithm

n

lg e

and makes at most lg jC j that for any runs in time polynomial in n and lg

n

n

mistakes with probability at least

Pro of The prediction algorithm implements the pro cedure describ ed in Lemma with the

Consider the prediction for an exact counting algorithm replaced by the fpras Rn

n jX j

n

instance x X Let V b e the set of concepts that are consistent with all previous instances

n

Let r resp ectively r bethenumb er of concepts in V for which x is a p ositive negative

instance Letr resp ectivelyr b e the estimate output by R for r r Since R is a

fpras with probability at least

jX j

n

r r

r r and r r

where n Without loss of generality assume that the algorithm predicts that x is a

negative instance and thusr r Combining the ab ove inequalities and the observation

jV j

that r r jV jwe obtain that r



jV j

We dene an appropriate prediction to b e a prediction that agrees with at least



of the concepts in V To analyze the mistake b ound for this algorithm supp ose that each

prediction is appropriate For a single prediction to b e appropriate b oth calls to the fpras R

must output a count that is within a factor of of the true count So anygiven prediction

is appropriate with probability at least and thus the probability that all predictions

jX j

n

are appropriate is at least

jX j

n

jX j

n

Clearly if all predictions are appropriate then the ab ove pro cedure is in fact an implemen

tation of the approximate halving algorithm with and thus by Theorem at



most log jC j mistakes are made Substituting with its value of and simplifying



n

n

the expression we obtain that with probability at least

lg jC j lg jC j

n n

mistakes



n

lg

lg



n n



n

Since



n n n



n

lg lg



n n n

lg

n

lg

n

lg

n

lg e lg e

Applying the inequalities lg and lg it follows that

n n n n

lg e

lg

n

n

lg e

lg

n

n

lg e

n

n lg e

n

lg e

n

Finally applying these inequalities to Equation yields that

lg e lg jC j

n

lg jC j mistakes

n



n

n

lg



n n

Note that we could mo dify the ab ove pro of by not requiring that all predictions b e

appropriate In particular if we allow predictions not to b e appropriate then wegeta

lg e

mistake b ound of lg jC j

n

n

Wenow apply this result to obtain the main result of this section Namelywe describ e

a randomized p olynomial prediction algorithm for learning a total order in the case that the

adversary selects the query sequence

Theorem There exists a prediction algorithm A for learning total orders such that on

input for al l and for any query sequenceprovided by the adversary A runs in

and makes at most n lg n lg elg n mistakes with probability time polynomial in n and lg

at least

Pro of Sketch We apply the results of Theorem using the fpras for counting the number

of extensions of a partial order given indep endently byDyer Frieze and Kannan and by

Matthews We know that with probability at least thenumb er of mistakes is at

lg e

most lg jC j SincejC j n the desired result is obtained

n n

n

We note that the probabilitythat A makes more than n lg n lgelg n mistakes do es

not dep end on the query sequence selected by the adversary The probabilityistaken over

the coin ips of the randomized approximation scheme

Thus as in learning a k binaryrelation using a rowlter algorithm we see that a learner

can do asymptotically b etter with selfdirected learning versus adversarydirected learning

Furthermore while the selfdirected learning algorithm is deterministic here the adversary

directed algorithm is randomized

As a nal note observe that wehave just seen how a counting algorithm can b e used

to implement the halving algorithm In her thesis Goldman has describ ed conditions

under which the halving algorithm can b e used to implement a counting algorithm

Conclusions and Op en Problems

Wehave formalized and studied the problem of learning a binary relation b etween twosets

of ob jects and b etween a set and itself under an extension of the online learning mo del We

have presented general techniques to help develop ecientversions of the halving algorithm

In particular wehave shown how a fully p olynomial randomized approximation scheme can

b e used to eciently implement a randomized version of the approximate halving algorithm

Wehave also extended the mistake b ound mo del by adding the notion of an instance selector

The sp ecic results are summarized in Table In this table all lower b ounds are information

theoretic b ounds and all upp er b ounds are for p olynomialtime learning algorithms Also

unless otherwise stated the results listed are for deterministic learning algorithms

From Table one can see that several of the ab ove b ounds are tight and several others

are asymptotically tight However for the problem of learning a k binaryrelation there is

a gap in the b ound for the random and adversary except k directors Note that the

b ounds for rowlter algorithms are asymptotically tightfor k constant Clearlyifwewant

asymptotically tight b ounds that include a dep endence on k we cannot only use tworow

Concept Lower Upp er

Class Director Bound Bound Notes

km k

Learner n blg k c km n k blg k c

 

Teacher km n k k km n k k

p

Binary Relation Adversary km n k blg k c O km n m lg k

k rowtyp es Adversary m n m n k

p

p p

Adversary km n k lgk minfn m m n g km n k m rowlter algorithm

p

km k

Uniform Dist n blg k c O km nk H avg case rowlter alg

 

Teacher n n

y

Total Order Learner n n

Adversary n lg n n lg n lg elgn randomized algorithm

Due to Manfred Warmuth Note that if computation time is not a concern wehaveshown that the

halving algorithm makes at most km n k lgk mistakes

y

Due to Peter Winkler

Table Summary of our results

typ es in the matrix used for the pro jective geometry lower b ound

For the problem of learning a total order all the ab ove b ounds are tight or asymptotically

tight Although the fully p olynomial randomized approximation scheme for approximating

the numb er of extensions of a partial order is a p olynomialtime algorithm the exp onenton

n is somewhat large and the algorithm is quite complicated Thus an interesting problem is

to nd a practical prediction algorithm for the problem of learning a total order Another

interesting direction of research is to explore other ways of mo deling the structure in a

binary relation Finallywe hop e to nd other applications of fully p olynomial randomized

approximation schemes to learning theory

Acknowledgments

The results in Sections and were inspired by an op en problems session led byManfred

Warmuth at our weekly Machine Learning Reading Group meeting where he prop osed the

basic idea of using an approximate halving algorithm based on approximate and probabilistic



We note that William Chen has recently extended the pro jective geometry argument to obtain a

p

lower b ound of n m lg kform n lg k

counting and also suggested the problem of learning a total order on n elements We thank

Tom Leighton for helping to improvethelower b ound of Theorem We also thank Nick

Littlestone Bob Sloan and Ken Goldman for their comments

References

Dana Angluin Learning regular sets from queries and counterexamples Information and

Computation November

Dana Angluin Queries and concept learning Machine Learning

Dana Angluin and Leslie G Valiant Fast probabilistic algorithms for hamiltonian circuits and

matchings Journal of Computer and System Sciences April

J Barzdin and R Freivald On the prediction of general recursive functions Soviet Mathe

matics Doklady



Avrim Blum An O n approximation algorithm for coloring In Proceedings of the Twenty

First Annual ACM Symposium on Theory of Computing pages May

R Carmichael Introduction to the Theory of Groups of Finite OrderDover Publications

New York

William Chen Personal communication

Thomas H Cormen Charles E Leiserson and Ronald L Rivest Introduction to Algorithms

MIT PressMcGrawHill Cambridge Massachusetts

M Dyer A Frieze and R Kannan A random p olynomial time algorithm for estimating the

volumes of convex b o dies In Proceedings of the st Annual ACM Symposium on Theory of

Computing pages May

Martin Dyer and Alan Frieze On the complexity of computing the volume of a p olyhedron

SIAM Journal on Computing

Sally Ann Goldman Learning Binary Relations Total Orders and ReadonceFormulasPhD

thesis MIT Dept of Electrical Engineering and Computer Science July

T Gradshteyn and I Ryzhik Tables of Integral and Products Academic Press New

York corrected and enlarged edition by A Jerey

David Haussler Michael Kearns Nick Littlestone and Manfred K Warmuth Equivalence of

mo dels for p olynomial learnabilityInProceedings of the Workshop on Computational

Learning Theory pages Morgan Kaufmann August

David Haussler Nick Littlestone and Manfred Warmuth Exp ected mistake b ounds for online

learning algorithms Unpublished manuscript

Mark Jerrum and Alistair Sinclair Conductance and the rapid mixing prop erty for Markov

chains the approximation of the p ermanent resolved In Proceedings of the th Annual ACM

Symposium on Theory of Computing pages May

Je Kahn and Michael Saks Balancing p oset extensions Order pages

Nati Linial and Umesh Vazirani Graph pro ducts and chromatic numb ers In Proceedings of the

Thirtieth Annual SymposiumonFoundations of Computer Science pages Octob er

Nick Littlestone Learning when irrelevant attributes ab ound A new linearthreshold algo

rithm Machine Learning

Nick Littlestone and Manfred K Warmuth The weighted ma jority algorithm In Proceedings

of the Thirtieth Annual Symposium on Foundations of Computer Science pages

Octob er

LaszloLovasz An algorithmic theory of numb ers graphs and convexity In CBMSNSF

Regional Conference Series on Applied Mathematics

Peter Matthews Generating a random linear extension of a partial order Unpublished

manuscript

Ronald L Rivest and Rob ert Sloan Learning complicated concepts reliably and usefullyIn

David Haussler and Leonard Pitt editors First Workshop on Computational Learning Theory

pages Morgan Kaufmann August

Alistair Sinclair RandomisedAlgorithms for Counting and Generating Combinatorial Struc

tures PhD thesis University of Edinburgh Department of Computer Science November

Larry Sto ckmeyer The complexity of approximate counting In Proceedings of the th Annual

ACM Symposium on Theory of Computing pages May

Leslie Valiant The complexity of computing the p ermanent Theoretical Computer Science

Leslie Valiant A theory of the learnable Communications of the ACM

Novemb er

Manfred Warmuth Personal communication

Peter Winkler Personal communication