Learning Nested Concept Classes with Limited Storage*

Learning Nested Concept Classes with Limited Storage* David Heath, Simon Kasif, S. Rao Kosaraju, Steven Salzberg, and Gregory Sullivan Department of Computer Science The Johns Hopkins University Baltimore, MD 21218 Abstract ceptron methods, neural nets, and decision tree tech• Many existing learning methods use incre• niques all fit this paradigm. Most of these methods do mental algorithms that construct a general• not store the entire set of training data. When process• ization in one pass through a set of training ing is completed, the only thing they store is a general• data and modify it in subsequent passes (e.g., ized data structure such as a tree, a matrix of weights, perceptrons, neural nets, and decision trees). or a set of geometric clusters. The issue of limiting the Most of these methods do not store the en• storage of a learning algorithm is one abstraction of the tire training set, in essence employing a limited notion of a compressed representation. The question we storage requirement that abstracts the notion address is how many passes through a data set arc re• of a compressed representation. The question quired to obtain a "correct" (i.e., accurate but minimum we address is, how much additional processing in size) generalization if we are only allowed to store the time is required for methods with limited stor• generalization. This issue is equivalent to analyzing an age? Processing time for learning algorithms algorithm that has a limited storage requirement. is equated in this paper with the number of Fixed storage is an important consideration for sev• passes necessary through a data set to obtain eral reasons. First of all, some learning models al• a correct generalization. For instance, neural ways use fixed storage. Neural net learning algorithms, nets require many passes through a data set for example, have a fixed number of nodes and edges, before converging. Decision trees require fewer and only change the weights on the edges. Decision passes, but precise bounds are unknown. tree algorithms can in principle grow without bound, We consider limited storage algorithms for but in practice researchers have devised many tech• a particular concept class, nested hyperrect- niques for restricting their growth [Quinlan 1986, Ut- angles. We prove bounds that illustrate the goff 1989]. Instance-based techniques such as those of fundamental trade-off between storage require• Salzberg [1989ab] and Aha and Kibler [1989] attempt to ments and processing time required to learn an store as few examples as possible in order to minimize optimal structure. It turns out that our lower storage. Second, fixed storage is a realistic constraint bounds apply to other algorithms and concept from the perspective of cognitive modelling - human classes (e.g., decision trees) as well. Notably, learning behavior clearly must adhere to some storage imposing storage limitations on the learning limitations. Finally, there is experimental evidence that task forces one to devise a completely different restricting storage actually leads to better performance, algorithm to reduce the number of passes. We especially if the input data is noisy [Aha and Kibler 1989, also briefly discuss parallel learning algorithms. Quinlan 1986]. The intuition behind this result is that by throwing away noisy data, an algorithm can construct 1 Introduction a more accurate generalization. Many existing learning methods attempt to create con• Our results show that if a fixed-storage algorithm at• cise generalizations from a set of examples. Besides sav• tempts to create a simple concept structure, then it can• ing storage, small generalizations are easier to summa• not generalize on-line without losing accuracy By "sim• rize and communicate to others. A common learning ple" we mean a generalization that is both minimum in method will construct a generalization after one pass size and an accurate model of the data; i.e., it classifies through a set of training data, and modify it in subse• all the training examples correctly. We also show that quent passes to make it smaller or more accurate. Per- by making a number of additional passes (depending on the number of concepts) through the data set, an algo• *This research supported in part by Air Force Office of Scientific Research under Grant AFOSR-89-1151, Na• rithm can create an optimal structure. Our main goal is tional Science Foundation under Grant IRI-88-09324 and to demonstrate that there exists a fundamental trade-off NSF/DARPA under Grant CCR-8908092. between the storage available and the number of passes 1 Supported by NSF under grant CCR-8804284 and required for learning an optimal structure. There are few NSF/DARPA under grant CCR-8808092. results comparable to ours on the number of passes re- Heath, et al. 777 quired to learn a concept. Typical theory results, rather, give estimates of the size of the input data set, not of the number of presentations required. This work is an im• portant step towards formalizing the capabilities of in• cremental vs. non-incremental learning algorithms. Any algorithm that does not save all training examples is to some extent incremental, since upon presentation of new inputs the algorithm cannot re-compute a generalization using all previous examples. The learning framework considered in this paper is computing generalizations in the form of geometric con• cept classes. Many important learning algorithms fall in this category, e.g., perceptron learning [Rosenblatt 1959], instance-based learning [Aha and Kibler 1989], decision tree models [Quinlan 1986] and hyperrectangles [Salzberg 1990]. We focus on the concept class defined Figure 1: Categorizing points using rectangles by nested hyperrectangles, although many of our results are applicable to other concept classes. In fact, our im• possibility results apply to any convex partitioning of tersection closed classes, which include orthogonal rect• feature space, such as decision trees and perceptrons. angles in Rn, monomials, and other concepts. The main The learning model we consider has fixed number of assumption behind their model is that it must be possi• storage locations; i.e., it is not permitted to store and ble to classify the training examples with strictly nested process all training examples at once. The limited stor• rectangles. age requirement is applicable only during the learning (or training) phase. That is, our algorithms must operate Given that the learning community has found nested incrementally, storing a limited number of intermediate rectangles a useful concept class, and that the theoretical results during each pass through a data set and modi• community has proven some further results about this fying the partially constructed generalization in subse• same concept class, we have been led to investigate the quent passes. For both cognitive and practical reasons, ability of an algorithm to construct an optimal number much experimental learning research has focused on the of nested rectangles given only limited storage. We will development of incremental models [Utgoff 1989]. argue below that our results apply to other well-known concept classes, including the partitionings induced by 2 Nested Hyperrectangles decision trees and perceptrons. Recent experimental research on learning from exam• 3 Preliminaries ples has shown that concepts in the shape of hyperrect• angles are a useful generalization in a variety of real- An example is defined simply as a vector of real-valued world domains [Salzberg 1989ab, 1990]. In this work, numbers, plus a category label. For instance, we may be rectangular-shaped generalizations are created from the considering a problem where medical patients are rep• training examples, and are then used for classification. resented by a set of real numbers including heart rate, Rectangles may be nested inside one another to arbitrary blood pressure, etc., and our task is to categorize the depth, and new examples are classified by the innermost patients as "in-patient" or "out-patient." For our pur• rectangle containing them. Thus nested rectangles may poses, an example is just a point in Euclidean space, be thought of as exceptions to the surrounding rectan• where the dimensionality of the space is determined by gles. This learning model is called the Nested Gener• the number of attributes measured for each example. We alized Exemplar (NGE) model. Experimental results will categorize points by using axis-parallel hyperrectan• with this model thus far have shown that it compares gles, where each rectangle Ri is labeled with a category very favorably with several other models, including de• C(R{) such as "out-patient." A point is categorized by cision trees, rule-based methods, statistical techniques, the innermost rectangle containing it. Figure 1 illus• and neural nets [Salzberg 89ab]. trates how categories are assigned. In the figure, lower- Independently of the experimental work cited above, case letters indicate points belonging to categories a and Helmbold, Sloan, and Warmuth [1989] have produced b, and uppercase letter indicate the category labels of very promising theoretical results for nested rectangular the rectangles. Notice that points not contained by any concepts. In particular, they have developed a learning rectangle are assigned to category A, which corresponds algorithm for binary classification problems that creates to a default category. Only two rectangles are required strictly nested rectangles, and that makes predictions to classify all the points in Figure 1. about new examples on-line. They have proven that The general problem definition is as follows: we are their algorithm is optimal with respect to several cri• given n points in a d-dimensional space, and we are asked teria, including the probability that the algorithm pro• to construct a minimum set of strictly nested hyperrect• duces a hypothesis with small error [Valiant 1984] and angles that will correctly classify the set. We will assume the expected total number of mistakes for classification that each point belongs to one of two classes (i.e., we of the first t examples.

Load more