Derive High Confidence Rules for Spatial Data Using Count Cube

Decision Tree Classification of Spatial Data Streams Using Peano Count Tree1, 2

Qiang Ding, Qin Ding and William Perrizo Computer Science Department, North Dakota State University Fargo, ND 58105, USA {Qiang_Ding, Qin_Ding, William_Perrizo }@ndsu.nodak.edu

Abstract Many organizations have large quantities of spatial data collected in various application areas, including remote sensing, geographical information systems (GIS), astronomy, computer cartography, environmental assessment and planning, etc. Many of these data collections are growing rapidly and can therefore be considered as spatial data streams [15, 16]. For data stream classification, time is a major issue. However, many of these spatial data sets are too large to be classified effectively in a reasonable amount of time using existing methods. In this paper, we develop a P-tree method for decision tree classification. The Peano Count Tree (P-tree) is a spatial data organization that provides a lossless compressed representation of a spatial data set and facilitates classification and other data mining techniques. We compare P-tree decision tree induction classification and a classical decision tree induction method with respect to the speed at which the classifier can be built (and rebuilt when substantial amounts of new data arrive). We show that the P-tree method is significantly faster than existing training methods, making it the preferred method for working with spatial data streams.

Keywords Data mining, Classification, Decision Tree Induction, Spatial Data, Peano Count Tree, Data Stream

1. Introduction Data streams are very common today. Classifying open-ended data streams brings challenges and opportunities since traditional techniques often cannot complete the work as quickly as the data is arriving in the stream [15, 16]. Spatial data collected from sensor platforms in space, from airplanes or other platforms are typically update periodically. For example, AVHRR (Advanced Very High Resolution Radiometer) data is update every hour or so (8 times each day during daylight hours). Such data sets can be very large (multiple gigabytes) and are often archived in deep storage before valuable information can

1 Patents are pending on the bSQ and Ptree technology.

2 This work is partially supported by GSA Grant ACT#: K96130308 CIT:142.0.K00PA 131.50.25.H40.516, DARPA Grant DAAH04- 96-1-0329, NSF Grant OSR-9553368 be obtained from them. An objective of spatial data stream mining is to mine the such data in near real time prior to deep storage archiving. In this paper we focus on classification data mining. In classification a training or learning set is identified for the construction of a classifier. Each record in the learning set has several attributes, one of which, the goal or class label attribute, indicates the class to which each record belongs. The classifier, once built and tested, is used to predict the class label of new records which do not yet have a class label attribute value. A test set is used to test the accuracy of the classifier once it has been developed. The classifier, once certified, is used to predict the class label of future unclassified data. Different models have been proposed for classification such as decision trees, neural networks, Bayesian belief networks, fuzzy sets, and generic models. Among these models, decision trees are widely used for classification. We focus on inducing decision trees in this paper. ID3 (and its variants such as C4.5) [1] [2] and CART [4] are among the best known classifiers which use decision trees. Other decision tree classifiers include Interval Classifier [3] and SPRINT [3] [5] concentrate on making it possible to mine databases that do not fit in main memory by only requiring sequential scans of the data. Classification has been applied in many fields, such as retail target marketing, customer retention, fraud detection and medical diagnosis [14]. Spatial data is a promising area for classification. In this paper, we propose a decision tree based model to perform classification on spatial data. The Peano Count Tree (P-tree) [17], is used to build the decision tree classifier. P-trees represent spatial data bit-by-bit in a recursive quadrant-by-quadrant arrangement. With the information in P-trees, we can rapidly build the decision tree. Each new component in a spatial data stream is converted to P-trees and then added to the training set as soon as possible. Typically, a window of data components from the stream is used to build (or rebuild) the classifier. There are many ways to define the window, depending on the data and application. In this paper, we focus on a fast classifier-building algorithm. The rest of the paper is organized as follows. In section 2, we briefly introduce the data formats of spatial data and describe the P-tree data structure and algebra. In Section 3, we detail our decision tree induction classifier using P-trees. A performance analysis is given in Section 4. Section 5 discuss some related work. Finally, there is a conclusion.

2. Data Structures There are several formats used in spatial data such as Band Sequential (BSQ), Band Interleaved by Line (BIL) and Band Interleaved by Pixel (BIP). In previous work[17] we proposed a new format called bit Sequential (bSQ). Each bit of each band is stored separately as a bSQ file. Each bSQ file can be reorganized into a quadrant-based tree (P-tree). The example in Figure 1 shows a bSQ file and its basic P- tree. 1 1 1 1 1 1 0 0 55 depth=0 level=3 1 1 1 1 1 0 0 0 ______/ / \ \______1 1 1 1 1 1 0 0 / _____/ \ ___ \ 1 1 1 1 1 1 1 0 16 ____8__ _15__ 16 depth=1 level=2 1 1 1 1 1 1 1 1 / / | \ / | \ \ 1 1 1 1 1 1 1 1 3 0 4 1 4 4 3 4 depth=2 level=1 1 1 1 1 1 1 1 1 //|\ //|\ //|\ 0 1 1 1 1 1 1 1 1110 0010 1101 depth=3 level=0 Figure 1. 8-by-8 image and its P-tree

In this example, 55 is the count of 1’s in the entire image, the numbers at the next level, 16, 8, 15 and 16, are the 1-bit counts for the four major quadrants. Since the first and last quadrant is made up of entirely 1-bits, we do not need sub-trees for these two quadrants. This pattern is continued recursively. Recursive raster ordering is called Peano or Z-ordering in the literature. The process terminates at the leaf level (level-0) where each quadrant is a 1-row-1-column quadrant. If we were to expand all sub-trees, including those for quadrants that are pure 1-bits, then the leaf sequence is just the Peano space-filling curve for the original raster image. For each band (assuming 8-bit data values), we get 8 basic P-trees, one for each bit positions. For

th band, Bi, we will label the basic P-trees, Pi,1, Pi,2, …, Pi,8, thus, Pi,j is a lossless representation of the j bits of

th the values from the i band. However, Pij provides more information and are structured to facilitate data mining processes. Some of the useful features of P-trees are collected together below. Details on these features can be found later in this paper or our earlier work [17].  P-trees contain 1-count for every quadrant of every dimension.  The P-tree for any sub-quadrant at any level is simply the sub-tree rooted at that sub-quadrant.  A P-tree leaf sequence (depth-first) is a partial run-length compressed version of the original bit-band.  Basic P-trees can be combined to reproduce the original data (P-trees are lossless representations).  P-trees can be partially combined to produce upper and lower bounds on all quadrant counts.  P-trees can be used to smooth data by bottom-up quadrant purification (bottom-up replacement of mixed counts with their closest pure counts - see section 3 below). For a spatial data set with 2n-row and 2n-column, there is a convenient mapping from raster coordinates (x,y) to Peano coordinates (or quadrant id). If x and y are expressed as n-bit strings, x1x2…xn and y1y2…yn then the mapping is (x,y)=(x1x2…xn, y1y2…yn)  (x1y1.x2y2…xnyn). Thus, in an 8 by 8 image, the pixel at (3,6) = (011,110) has quadrant id 01.11.10 = 1.3.2. To locate all quadrants containing (3,6) in the tree start at the root; follow sub-tree pointer 1; then sub-tree pointer 3; then sub-tree pointer 2 (where the 4 sub-tree pointers are 0, 1, 2, 3 at each node). Finally, we note that these P-trees can be generated quite quickly [18] and can be viewed as a “data mining ready”, lossless format for storing spatial data. The basic P-trees defined above can be combined using simple logical operations (AND, NOT, OR, COMPLEMENT) to produce P-trees for the original values (at any level of precision, 1-bit precision, 2-bit precision, etc.). We let Pb,v denote the Peano Count Tree for band, b, and value, v, where v can be expressed in 1-bit, 2-bit,.., or 8-bit precision. Using all 8-bit for instance, P b,11010011 can be constructed from the basic P-trees as:

Pb,11010011 = Pb1 AND Pb2 AND Pb3’ AND Pb4 AND Pb5’ AND Pb6’ AND Pb7 AND Pb8 where ’ indicates the bit-complement (which is simply the count complement in each quadrant). This is called the value P-tree. The AND operation is simply the pixel-wise AND of the bits. Before going further, we note that the process of converting the BSQ (BIL, BIP) data for a Landsat TM satellite image (approximately 60 million pixels) to its basic P-trees can be done in just a few seconds using a high performance PC computer. This is a one time cost. We also note that we are storing the basic P-trees in a “depth-first” data structure, which specifies the pure-1 quadrants only. Using this data structure, each AND can be completed in a few milliseconds and the result counts can be computed easily once the AND and COMPLEMENT program has completed. Finally, we note that the data in the relational format can also be represented as P-trees. For any combination of values, (v1,v2,…,vn), where vi is from band-i, the quadrant-wise count of occurrences of this tuple of values is given by: P(v1,v2,…,vn) = P1,v1 AND P2,v2 AND … AND Pn,vn . This is called a tuple P-tree.

3. The Classifier Classification is a data mining technique which typically involves three phases, a learning phase, a testing phase and an application phase. A learning model or classifier is built during the learning phase. It may be in the form of classification rules, a decision tree or a mathematical formulae. Since the class label of each training sample is provided, this approach is known as supervised learning. In unsupervised learning (clustering), the class labels are not known in advance. In the testing phase test data are used to assess the accuracy of classifier. If the classifier passes the test phase, it is used for the classification of new, unclassified data tuples. This is the application phase. The classifier predicts the class label for these new data samples. In this paper, we consider the classification of spatial data in which the resulting classifier is a decision tree (decision tree induction). Our contributions include  a set of classification-ready data structures called Peano Count trees, which are compact, rich in information and facilitate classification;  a data structure for organizing the inputs to decision tree induction, the Peano count cube;  and a fast decision tree induction algorithm, which employs these structures. We point out the classifier is precisely that classifier built by the ID3 decision tree induction algorithm [4]. The point of the work is to reduce the time it takes to build and rebuild the classifier as new data continues to arrive in the data stream.

3.1 Data Smoothing and Attribute Relevance In the overall classification effort, as in most data mining approaches, there is a data preparation stage in which the data is prepared for classification. Data preparation can involve cleaning (noise reduction by applying smoothing techniques and missing value management techniques). The P-tree data structure facilitates a proximity-based data smoothing method, which can reduce the data classification time considerably. The smoothing method is called bottom-up purity shifting. By replacing 3 counts with 4 and 1 counts with 0 at level-1 (and making resultant changes on up the tree), the data is smoothed and the P-tree is compressed. A more drastic smoothing can be effected. The user can select which set of counts to replace with pure1 and which set of counts to replace with pure0. The most important thing to note is that this smoothing can be done almost instantaneously once P-trees are constructed. With this method it is feasible to actually smooth data from the data stream before mining. Another important pre-classification step is relevance analysis (selecting only a subset of the feature attributes, so as to improve algorithm efficiency). This step can involve removal of irrelevant attributes or removal of redundant attributes. An effective way to determine relevance is to rollup the P- cube to the class label attribute and each other potential decision attribute in turn. If any of these rollups produce counts that are uniformly distributed, then that attribute is not going to be effective in classifying the class label attribute. The rollup can be computed from the basic P-trees without necessitating the actual creation of the P-cube. This can be done by ANDing the class label attribute P-trees with the P-trees of the potential decision attribute. Only an estimate of uniformity in the root counts is all that is needed. By ANDing down to a fixed depth of the P-trees ever better estimates can be discovered. For instance, ANDing to depth=1 counts provides the roughest of distribution information, ANDing at depth=2 provides better distribution information and so forth. Again, the point is that P-trees facilitate simple real-time relevance analysis, which makes it feasible for data streams.

3.2 Classification by Decision Tree Induction Using P-tree A Decision Tree is a flowchart-like structure in which each inode denotes a test on an attribute. Each branch represents an outcome of the test and the leaf nodes represent classes or class distributions. Unknown samples can be classified by testing attributes against the tree. The path traced from root to leaf holds the class prediction for that sample. The basic algorithm for inducing a decision tree from the learning or training sample set is as follows ([2] [13]):  Initially the decision tree is a single node representing the entire training set.  If all samples are in the same class, this node becomes a leaf and is labeled with that class label.  Otherwise, an entropy-based measure, "information gain", is used as a heuristic for selecting the attribute which best separates the samples into individual classes (the “decision" attribute).  A branch is created for each value of the test attribute and samples are partitioned accordingly.  The algorithm advances recursively to form the decision tree for the sub-sample set at each partition. Once an attribute has been used, it is not considered in descendent nodes.  The algorithm stops when all samples for a given node belong to the same class or when there are no remaining attributes (or some other stopping condition). The attribute selected at each decision tree level is the one with the highest information gain. The information gain of an attribute is computed by using the following algorithm. Assume B[0] is the class attribute; the others are non-class attributes. We store the decision path for each node. For example, in the decision tree below (Figure 2), the decision path for node N09 is “Band2, value 0011, Band3, value 1000”. We use RC to denote the root count of a P-tree, given node N’s decision path B[1], V[1], B[2], V[2], … , B[t], V[t], let P-tree P=PB[1],v[1]^P B[2],v[2]^…^P B[t],v[t] , we can calculate node N’s information I(P) through I(P)=- (i=1..n)pi*log 2pi where pi=RC(P^PB[0], V0[i])/RC(P), here V0[1]..V0[n] are possible B[0] values if classified by B[0] at node N. If N is the root node, then P is the full P-tree (root count is the total number of transactions). Now if we want to evaluate the information gain of attribute A at node N, we can use the formulae:

Gain(A)=I(P)-E(A), where entropy E(A)=  (i=1..n)I(P^PA,VA[i])*RC(P^PA,VA[i])/RC(P), here VA[1]..VA[n] are possible A values if classified by attribute A at node N.

N01 B2 0010 0011 0111 1010 1011 N02 N03 N04 N05 N06 B1 B3 B1 B1 B1 0111 0100 1000 0011 1111 0010 N07 N08 N09 N10 N11 N12

B1 0111 B1 0011

N13 N14

Figure 2. A Decision Tree Example

3.3 Example In this example the data is a remotely sensed image (e.g., satellite image or aerial photo) of an agricultural field and the soil moisture levels for the field, measured at the same time. We use the whole data set for mining so as to get as better accuracy as we can. This data are divided to make up the learning and test data sets. The goal is to classify the data using soil moisture as the class label attribute and then to use the resulting classifier to predict the soil moisture levels for future time (e.g., to determine capacity to buffer flooding or to schedule crop planting). Branches are created for each value of the selected attribute and subsets are partitioned accordingly. The following training set contains 4 bands of 4-bit data values (expressed in decimal and binary). B1 stands for soil-moisture. B2, B3, and B4 stand for the channel 3, 4, and 5 of AVHRR, respectively. FIELD |CLASS| REMOTELY SENSED FIELD |CLASS| REMOTELY SENSED COORDS|LABEL| REFLECTANCES______COORDS|LABEL| REFLECTANCES______X | Y | B1 | B2 | B3 | B4___ X | Y | B1 | B2 | B3 | B4___ 0,0 | 3 | 7 | 8 | 11 0,0 | 0011| 0111| 1000| 1011 0,1 | 3 | 3 | 8 | 15 0,1 | 0011| 0011| 1000| 1111 0,2 | 7 | 3 | 4 | 11 0,2 | 0111| 0011| 0100| 1011 0,3 | 7 | 2 | 5 | 11___ 0,3 | 0111| 0010| 0101| 1011__ 1,0 | 3 | 7 | 8 | 11 1,0 | 0011| 0111| 1000| 1011 1,1 | 3 | 3 | 8 | 11 1,1 | 0011| 0011| 1000| 1011 1,2 | 7 | 3 | 4 | 11 1,2 | 0111| 0011| 0100| 1011 1,3 | 7 | 2 | 5 | 11___ 1,3 | 0111| 0010| 0101| 1011__ 2,0 | 2 | 11 | 8 | 15 2,0 | 0010| 1011| 1000| 1111 2,1 | 2 | 11 | 8 | 15 2,1 | 0010| 1011| 1000| 1111 2,2 | 10 | 10 | 4 | 11 2,2 | 1010| 1010| 0100| 1011 2,3 | 15 | 10 | 4 | 11___ 2,3 | 1111| 1010| 0100| 1011__ 3,0 | 2 | 11 | 8 | 15 3,0 | 0010| 1011| 1000| 1111 3,1 | 10 | 11 | 8 | 15 3,1 | 1010| 1011| 1000| 1111 3,2 | 15 | 10 | 4 | 11 3,2 | 1111| 1010| 0100| 1011 3,3 | 15 | 10 | 4 | 11 3,3 | 1111| 1010| 0100| 1011

This learning dataset is converted to bSQ format. We display the bSQ bit-bands values in their spatial positions, rather than displaying them in 1-column files. The Band-1 bit-bands are:

B11 B12 B13 B14 0000 0011 1111 1111 0000 0011 1111 1111 0011 0001 1111 0001 0111 0011 1111 0011

Thus, the Band-1 basic P-trees are as follows (tree pointers are omitted).

P1,1 P1,2 P1,3 P1,4 5 7 16 11 0 0 1 4 0 4 0 3 4 4 0 3 0001 0111 0111

We can use AND and COMPLEMENT operation to calculate all the value P-trees of Band-1 as below.

(e.g., P1,0011 = P1,1’ AND P1,2’ AND P1,3 AND P1,4 )

P1,0000 P1,0100 P1,1000 P1,1100 P1,0010 P1,0110 P1,1010 P1,1110 0 0 0 0 3 0 2 0 0 0 3 0 0 0 1 1 1110 0001 1000

P1,0001 P1,0101 P1,1001 P1,1101 P1,0011 P1,0111 P1,1011 P1,1111 0 0 0 0 4 4 0 3 4 0 0 0 0 4 0 0 0 0 0 3 0111

Then we generate basic P-trees and value P-trees similarly to B2, B3 and B4. The basic algorithm for inducing a decision tree from the learning set is as follows. 1. The tree starts as a single node representing the set of training samples, S: 2. If all samples are in same class (same B1-value), S becomes a leaf with that class label.. 3. Otherwise, use entropy-based measure, information gain, as the heuristic for selecting the attribute which best separates the samples into individual classes (the test or decision attribute).

Start with A = B2 to classify S into {A1..Av}, where Aj = { t | t(B2)=aj } and aj ranges over those B2- values, v’, such that the root count of P2,v' is non-zero. The symbol, sij, counts the number of samples of class, Ci, in subset, Aj , that is, the root count of P1,v AND P2,v', where v ranges over those B1-values such that the root count of P1,v is non-zero. Thus, the sij (the probability that a sample belongs to Ci) are as follows (i=row and j=column). 0 0 0 0 3 0 2 2 0 0 2 2 0 0 0 0 0 0 1 1 0 0 0 3 0

Let attribute, A, have v distinct values, {a1..av}. A could be used to classify S into {S1..Sv}, where Sj is the set of samples having value, aj. Let sij be the number of samples of class, Ci, in a subset, Sj. The expected information needed to classify the sample is

I(s1..sm)= - i=1..m pi*log2 pi pi = si/s

(m=5 si = 3,4,4,2,3 pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16, respectively). Thus,

I = - ( 3/16*log2(3/16)+4/16*log2(4/16)+4/16*log2(4/16)+2/16*log2(2/16)+3/16*log2(3/16) ) = 2.281 The entropy based on the partition into subsets by B2 is

E(A) = j=1..v i=1..m (sij / s) * I(sij..smj), where

I(s1j..smj) = - i=1..m pij*log2 pij pij = sij/|Aj|. We finish the computations leading to the gain for B2 in detail.

2 4 2 4 4 ← |Aj| (divisors) 0 0 0 0 .75 ← p1j

0 .5 1 0 0 ← p2j

1 .5 0 0 0 ← p3j

0 0 0 .25 .25 ← p4j

0 0 0 .75 0 ← p5j 0 1 0 .811 .811 ← I(s1j+..+s5j) 0 .25 0 .203 .203 ← (s1j+..+s5j)*I(s1j+..+s5j)/16 .656 ← E(B2) 2.281 ← I(s1..sm) 1.625 ← I(s1..sm)-E(B2) = Gain(B2)

Likewise, the Gains of B3 and B4 are computed. 1.084 = Gain(B3) 0.568 = Gain(B4) Thus, B2 is selected as the first level decision attribute. 4. Branches are created for each value of B2 and samples are partitioned accordingly (If a partition is empty, generate a leaf and label it with the most common class, C2, labeled with 0011). B2=0010  Sample_Set_1 B2=0011  Sample_Set_2 B2=0111  Sample_Set_3 B2=1010  Sample_Set_4 B2=1011  Sample_Set_5

Sample_Set_1 1,0 0011 1000 1011 X-Y B1 B3 B4 0,3 0111 0101 1011 Sample_Set_4 1,3 0111 0101 1011 X-Y B1 B3 B4 2,2 1010 0100 1011 Sample_Set_2 2,3 1111 0100 1011 X-Y B1 B3 B4 3,2 1111 0100 1011 0,1 0011 1000 1111 3,3 1111 0100 1011 0,2 0111 0100 1011 1,1 0011 1000 1011 1,2 0111 0100 1011 Sample_Set_5 X-Y B1 B3 B4 Sample_Set_3 2,0 0010 1000 1111 X-Y B1 B3 B4 2,1 0010 1000 1111 0,0 0011 1000 1011 3,0 0010 1000 1111 3,1 1010 1000 1111

5. The algorithm advances recursively to form a decision tree for the samples at each partition. Once an attribute is the decision attribute at a node, it is not considered further. 6. The algorithm stops when: a. all samples for a given node belong to the same class or b. no remaining attributes (label leaf with majority class among the samples). All samples belong to the same class for Sample_Set_1 and Sample_Set_3. Thus, the decision tree is: B2=0010  B1=0111 B2=0011  Sample_Set_2 B2=0111  B1=0011 B2=1010  Sample_Set_4 B2=1011  Sample_Set_5

Advancing the algorithm recursively, it is unnecessary to rescan the learning set to form these Sub- sample sets, since the P-trees for those samples have been computed.

For Sample_set_2, we compute all P2,001 ^P1,v ^P3,w to calculate Gain(B3); and we compute P2,001 ^P1,v

^P4,w to calculate Gain(B4). The results are: Gain(B4)  0.179 , Gain(B3)  1.000 Thus, B3 is the decision attribute at this level in the decision tree. For Sample_Set_4 (label-B2=1010), the P-trees,

P1010,1010,, P1111,1010,, P,1010,0100, P,1010,,1011 1 3 4 4 0 0 0 1 0 0 0 3 0 0 0 4 0 0 0 4 1000 0111 tell us that there is only one B3 and one B4 value in the Sample_Set. Therefore we can tell there will be no gain by using either B3 or B4 at this level in the tree. We can come to that conclusion without producing and scanning the Sample_Set_4. The same conclusion can be reach regarding Sample_Set_5 (label-B2=1011) simply by examining,

P0010,1011,, P1010,1011,, P,1011,1000, P,1011,,1111 3 1 4 4 0 0 3 0 0 0 1 0 0 0 4 0 0 0 4 0 1110 0001

Here, we are using an additional stopping condition at step 3, namely, Any attribute, which has one single value in each candidate decision attribute over the entire sample, need not be considered, since the information gain will be zero. If all candidate decision attributes are of this type, the algorithm stops for this subsample. Thus, we use the majority class label for Sample_Sets 4 and 5:

B2=0010  B1=0111 B2=0011  Sample_Set_2 B2=0111  B1=0011 B2=1010  B1=1111 B2=1011  B1=0010

The best test attribute for Sample_Set_2 (of B3 or B4), is found in the same way. The result is that B3 maximizes the information gain. Therefore, the decision tree becomes, B2=0010  B1=0111 B2=0011  B3=0100  Sample_Set_2.1 B3=1000  Sample_Set_2.2 B2=0111  B1=0011 B2=1010  B1=1111 B2=1011  B1=0010

Sample_Set_2.1 Sample_Set_2.2 X-Y B1 B3 B4 X-Y B1 B3 B4 0,2 0111 0100 1011 0,1 0011 1000 1111 1,2 0111 0100 1011 1,1 0011 1000 1011

In both cases, there is one class label in the set. Therefore the algorithm terminates with the decision tree:

B2=0010  B1=0111 B2=0011  B3=0100  B1=0111 B3=1000  B1=0011 B2=0111  B1=0011 B2=1010  B1=1111 B2=1011  B1=0010

4 Performance Analysis Often prediction accuracy is used as a basis of comparison for different classification methods. However, for stream data mining, speed is a significant issue. In this paper, we use ID3 with the P-tree data structure to improve the speed of the algorithm. The important performance issue in this paper is computation speed relative to ID3. In our method, we only build and store basic P-trees. All the AND operations are performed on the fly and only the corresponding root count is needed. Our experimental results show that larger data size leads to more significant speed (in Figure 3). The reason is obvious if we analysis theoretically. First, let’s look at the cost to calculate information gain each time. In ID3, to test if all the samples are in the same class, one scan on the entire sample set is needed. While using P-trees, we only need to calculate the root counts of the AND of some P-trees. These AND operations can also be performed in parallel. Figure 4 shows our experimental results of the cost comparison between scanning the entire data set and all the P-tree ANDing. Second, Using P-trees, the creation of sub-sample sets is not necessary. If B is a candidate for the current decision attribute with kB basic P-trees, we need only AND the P-trees of the class label defining the sub-sample set with each of the kB basic P-trees. If the P-tree of the current sample set is P 2, 0100 ^ P 3,

0001 , for example and the current attribute is B1 (with, say, 2 bit values), then P 2, 0100 ^ P 3, 0001 ^ P 1, 00, P 2,

0100 ^ P 3, 0001 ^ P 1, 01, P 2, 0100 ^ P 3, 0001 ^ P 1, 10 and P 2, 0100 ^ P 3, 0001 ^ P 1, 11 identifies the partition of the current sample set. To generate Sij, only P-tree ANDings are required.

Classification Tim e

800

t 600 s

o ID3 c

l 400 a

t P-tree o

T 200 0 0 50 100 Size of data (M)

Figure 3. Classification cost with respect to the dataset size

Cost (base-2 log) w ith respect to dataset size

20 2 -

e Scan cost in ID3 s 15 a b ) (

g

) 10 o l s ANDing cost using m

( 5 P-trees t s

o 0 c 0 20000 40000 60000 80000 data set size (KB)

Figure 4: Cost comparison between scan and ANDings

5 Related work Concepts related to the P-tree data structure, include Quadtrees [8, 9, 10] and its variants (such as point quadtrees [7] and region quadtrees [8]), and HH-codes [12]. Quadtrees decompose the universe by means of iso-oriented hyperplanes. These partitions do not have to be of equal size, although that is often the case. For example, for two-dimensional quadtrees, each interior node has four descendants, each corresponding to a NW/NE/SW/SE quadrant. The decomposition into subspaces is usually continued until the number of objects in each partition is below a given threshold. Quadtrees have many variants, such as point quadtrees and region quadtrees [8]. HH-codes, or Helical Hyperspatial Codes, are binary representations of the Riemannian diagonal. The binary division of the diagonal forms the node point from which eight sub-cubes are formed. Each sub- cube has its own diagonal, generating new sub-cubes. These cubes are formed by interlacing one- dimensional values encoded as HH bit codes. When sorted, they cluster in groups along the diagonal. The clustering gives fast (binary) searches along the diagonal. The clusters are order in a helical pattern which is why they are called "Helical Hyperspatial". They are sorted in a Z-ordered fashion. The similarities among P-tree, quadtree and HHCode are that they are quadrant based. The difference is that P-trees focus on the count. P-trees are not indexes, rather they are representations of the datasets themselves. P-trees are particular useful for data mining because they contain the aggregate information needed for data mining.

6 Conclusion In this paper, we propose a new approach to decision tree induction, that is especially useful for the classification of streaming spatial data. We use the data organization, bit Sequential organization (bSQ) and a lossless, data-mining ready data structure, the Peano Count tree (P-tree), to represent the information needed for classification in an efficient and ready-to-use form. The rich and efficient P-tree storage structure and fast P-tree algebra facilitate the development of a fast decision tree induction classifier. The P-tree decision tree induction classifier is shown to improve classifier development time significantly. This makes classification of open-ended streaming datasets feasible in near real time.

References [1] J. R. Quinlan and R. L. Riverst, “Inferring decision trees using the minimum description length principle”, Information and Computation, 80, 227-248, 1989. [2] Quinlan, J. R., “C4.5: Programs for Machine Learning”, Morgan Kaufmann, 1993. [3] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. “An interval classifier for database mining applications”, VLDB 1992. [4] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, “Classfication and Regression Trees”, Wadsworth, Belmont, 1984. [5] J. Shafer, R. Agrawal, and M. Mehta, “SPRINT: A scalable parallel classifier for data mining”, VLDB 96. [6] S. M. Weiss and C. A. Kulikowski, “Computer Systems that Learn: Classification and Prediction Methods from Satatistics, Neural Nets, Machine Learning, and Expert Systems”, Morgan Kaufman, 1991. [7] Volker Gaede and Oliver Gunther, "Multidimensional Access Methods", Computing Surveys, 30(2), 1998. [8] H. Samet, “The quadtree and related hierarchical data structure”. ACM Computing Survey, 16, 2, 1984. [9] H. Samet, “Applications of Spatial Data Structures”, Addison-Wesley, Reading, Mass., 1990. [10] H. Samet, “The Design and Analysis of Spatial Data Structures”, Addison-Wesley, Reading, Mass., 1990. [11] R. A. Finkel and J. L. Bentley, “Quad trees: A data structure for retrieval of composite keys”, a Informatica, 4, 1, 1974. [12] http://www.statkart.no/nlhdb/iveher/hhtext.htm [13] Jiawei Han, Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2001. [14] D. Michie, D. J. Spiegelhalter, and C. C. Taylor, “Machine Learning, Neural and Statistical Classification”, Ellis Horwood, 1994. [15] Domingos, P., & Hulten, G., “Mining high-speed data streams”, Proceedings of ACM SIGKDD2000. [16] Domingos, P., & Hulten, G., “Catching Up with the Data: Research Issues in Mining Data Streams”, DMKD2001. [17] W. Perrizo, Q. Ding, Q. Ding, A. Roy, “On Mining Satellite and Other Remotely Sensed Images”, DMKD2001. [18] William Perrizo, "Peano Count Tree Technology", Technical Report NDSU-CSOR-TR-01-1, 2001.