Vertical Inner Product Based Vote Tallying for Classification

Vertical Set Inner Products: A Fast and Scalable Technique to Compute Total Variation in Large Data Set

Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo Computer Science Department North Dakota State University Fargo, ND, 58105, USA [email protected]

Abstract this paper we will call horizontal set inner products (HSIPs). This paper is organized as follows. Section 2 provides a short review of P-tree vertical data representation. Section 3 presents inner products formulas and examples. Section 4 discusses performance evaluation. Section 5 presents our conclusion and future work.

2. VERTICAL REPRESENTATION

1. INTRODUCTION Vertical representation consists of a set of P-trees rather than a set of relational records was initially A tremendous increasing amount of data stored in developed for mining spatial data [5][7]. However, since databases and data warehouses has motivated researchers then, this vertical representation has been widely adopted in data mining to develop technique that are capable to and used as a fundamental data structure for mining handle large data set. In clustering task and outlier another type of data [6][8]. detection for examples, one often require to determine the With an assumption that data mining algorithms closeness of a certain point to a set of points in a data set. should handle large data set without having to do a sub One way to measure the closeness of a point to some sampling, which requires considerable knowledge about other predefined collection of points is by examining the the data at the first place, makes P-trees vertical data total variation of the points to the designated point. Not to structure particularly suited for data mining. The creation confuse with total variation terminology usually used in of P-tree is typically started by converting a relational image restoration, total variation we meant here is table of horizontal records, normally the training set, to a basically the total positive difference of a set of vectors set of P-trees by decomposing each attribute in the table about a target vector, which is slightly different to total into separate bit vectors, one for each bit position of variation defined by [3]. numeric values in that attribute. Such vertical partitioning The main concern of data mining algorithm is to be guarantees that the original attribute values can be able to handle huge data set in a fast, accurate and retained in some way so that original information is not scalable way. In this paper, we introduce a new technique lost. called vertical set inner products, shorten as VSIPs, that Vertical data structure can be constructed into 0- precisely compute total variation of a set of vectors about dimensional P-tree, which is actually the vertical bit, a, a target vector, in a fast, accurate and scalable way. vectors itself or into 1-dimensional, 2-dimensional or Because VSIPs employs P-tree1 vertical data structure, multi-dimensional P-tree in the form of tree. In this paper, thus this technique is also called PSIPs2. however, for simplicity we only discuss the construction Our performance evaluation shows that PSIPs are of 1-dimensional P-tree. fast and scalable to compute total variation in large data Let R a relational table consists of three numeric set, much faster and scalable compare to similar technique attributes R(A1, A2, A3). For each attribute in R, we convert computed in a horizontal record-based approach, which in each numerical domain into binary representation, for example a value (5)10 are converted to (101)2. Then, each 1 Patents are pending on the P-tree technology. This work is partially corresponding bit position is vertically decomposed and supported by GSA Grant ACT#: K96130308. stored in a separate file. In 1-dimensional P-tree, each file subsequently converted into P-tree by recursively divides 2 VSIPs and PSIPs terminology will be used interchangeably in this the bit vectors by halves and records the truth of “purely paper 1-bits” predicate. A predicate 1 indicates that bits in the division are all 1 and 0 indicates otherwise. Figure 1 (b) (c) depicts how a numerical value in attribute A1 is converted into binary representation, decomposed into three separate Figure 2. Result of 1-D P-trees logical operations. vertical bit vectors and then constructed into 1- dimensional P-tree. The count of 1-bits from the resulting P-tree of logical operations is called root count and it can be computed quickly by summing from the bottom up. For A A P 0 1 1 11 L 3 example a root count of P11  P12 is equal to 7, computed 0 1 2 4 100 0 0 L from 1 · 2 + 1 · 2 + 1 · 2 as there is only a single bit of 010 2 1 found in each level 0, 1 and 2. 0 0 1 0 L 2 010 1 111 L 3. SET INNER PRODUCTS 0 1 0 1 0 2 101 7 001 In this section, we define formulas for vertical set P 0 L 110 12 3 inner products and its variations that are performed in a 5 011 0 0 L training set with n dimensions, each of which represented 1 2 in b bits binary representation. We also demonstrate L 6 0 1 0 1 1 simple examples on how to use the formulas. Due to the 3 P P P L limitation of space, the proofs of each formula are not 13 12 11 0 1 0 included in the paper, but they can be found in [1]. At the 1 0 0 end of this section, we define formula for horizontal set A 0 P 2 13 0 L inner products. 0 1 3 0 0 0 L 2 0 1 1 3.1 Binary Representation 0 0 0 0 L 1 1 1 1 L Binary representation is intrinsically a fundamental 1 0 0 1 1 0 1 0 0 concept in vertical data structure. Let x is a numeric value 1 0 1 0 0 0 of attribute A1 then the representation of x in b bits is Figure 1. 1 dimensional P-tree of attribute A1. written as: 1 1 1 0 0 1  j Logical operations AND (), OR () and NOT or x1b1 x10   2  x1 j complement (') are the main operations used in P-tree and jb1 performed level by level. In figure 2, we illustrate how where xb1 and x0 are the highest and lowest order bits these operations used. Figure 2(a) corresponds to the respectively. result of complement operation of P13, while figure 2(b) and 2(c) correspond respectively to the result operations 3.2 Vertical Set Inner Products of P11  P12 and P11  P12. Refer to [2] for an excellent overview about P-tree algebra and its logical operations. Let X any set of vectors in R(A1…An) with P-tree class mask, PX, where x  X is represented in b bits, P ’ 13 0 x  (x1(b1)  x10 , x2(b1)  x20 ,, xn(b1)  xn0 ) 0 0 and

0 0 0 0 a  (a1(b1) a10 ,a2(b1) a20 ,,an(b1) an0 ) be a target vector, then vertical set inner products (X o a) 0 1 1 0 0 1 0 1 is defined as:

(a) n 0 b1 ° ° jk X a   x a    rc(PX  Pij )2  aik P P P P xX i1 jb1 k0 11  12 0 11  12 0

0 0 0 1 We demonstrate how to get the result from this 0 0 0 0 0 1 formula with the help of an example. Suppose there is a data set with three attributes A1, A2, and A3, each of which 0 1 0 1 0 1 has a numerical domain and another attribute Rank has a

single categorical domain as illustrate in table 1, a P-tree rc(PX1  P10) = 3 rc(PX2  P10) = 2 class masks, PXi, are achieved by creating a vertical bit rc(PX1  P24) = 4 rc(PX2  P24) = 2 vector, one for each class, with bit 1 is assigned to every tuples containing that class and bit 0 to every other tuples. rc(PX1  P23) = 2 rc(PX2  P23) = 1

Assume that attribute Rank is chosen to be a class rc(PX1  P22) = 3 rc(PX2  P22) = 2 attribute consists of two types of value then two P-tree rc(PX1  P21) = 3 rc(PX2  P21) = 1 class masks will be created, one for each distinct value. rc(PX1  P20) = 4 rc(PX2  P20) = 1

Table 1. Training set example. rc(PX1  P34) = 0 rc(PX2  P34) = 0

A1 A2 A3 Rank rc(PX1  P33) = 0 rc(PX2  P33) = 0

9 31 6 Low rc(PX1  P32) = 1 rc(PX2  P32) = 2

11 20 5 Low rc(PX1  P31) = 1 rc(PX2  P31) = 1 11 21 4 High rc(PX1  P30) = 2 rc(PX2  P30) = 1 7 23 3 High 7 27 1 High Hence, using the formula, we compute the set inner 8 31 0 High product for class High as follows:

Subsequently, we convert each numerical domain 8 7 6 5 4 (X1 o a) = 0 · (2 · 0 + 2 · 1 + 2 · 1 + 2 · 1 + 2 · 0) + into based-two representation with a uniform b bits width. 2 · (27 · 0 + 26 · 1 + 25 · 1 + 24 · 1 + 23 · 0) + The maximum width is determined from the largest value 2 · (26 · 0 + 25 · 1 + 24 · 1 + 23 · 1 + 22 · 0) + in the training set. After that, we create P-tree class masks 3 · (25 · 0 + 24 · 1 + 23 · 1 + 22 · 1 + 21 · 0) + as depicted in table 2. We pad a reasonable number of 3 · (24 · 0 + 23 · 1 + 22 · 1 + 21 · 1 + 20 · 0) + zeros to make a uniform bit width for any attributes that 4 · (28 · 0 + 27 · 1 + 26 · 0 + 25 · 1 + 24 · 0) + actually can be represented in less than b bits. Zero 2 · (27 · 0 + 26 · 1 + 25 · 0 + 24 · 1 + 23 · 0) + padding is a prerequisite for the formula to fulfill a correct 3 · (26 · 0 + 25 · 1 + 24 · 0 + 23 · 1 + 22 · 0) + result. In table 2, attribute A3 has been padded with two 3 · (25 · 0 + 24 · 1 + 23 · 0 + 22 · 1 + 21 · 0) + additional zero bits to make a uniform 5 bits width 4 · (24 · 0 + 23 · 1 + 22 · 0 + 21 · 1 + 20 · 0) + because 31 (domain of tuple 1 or 7 attribute A2, the largest 0 · (28 · 1 + 27 · 0 + 26 · 0 + 25 · 1 + 24 · 1) + value found in the training set) is actually (11111)2. 0 · (27 · 1 + 26 · 0 + 25 · 0 + 24 · 1 + 23 · 1) + 1 · (26 · 1 + 25 · 0 + 24 · 0 + 23 · 1 + 22 · 1) + Table 2. P-tree class masks of attribute Rank. 1 · (25 · 1 + 24 · 0 + 23 · 0 + 22 · 1 + 21 · 1) + Rank 2 · (24 · 1 + 23 · 0 + 22 · 0 + 21 · 1 + 20 · 1) A1 A2 A3 PX1 PX2 01001 11111 00110 0 1 = 0 · 224 + 2 · 112 + 2 · 56 + 3 · 28 + 3 · 14 + 01011 10100 00101 0 1 4 · 160 + 2 · 80 + 3 · 40 + 3 · 20 + 4 · 10 + 01011 10101 00100 1 0 0 · 304 + 0 · 152 + 1 · 76 + 1 · 38 + 2 · 19 00111 10111 00011 1 0 00111 11011 00001 1 0 = 1634 01000 11111 00000 1 0 Similarly for class Low, we compute (X2 o a) = 999. As mentioned before, a root count is total number of bit 1 counted from the resulting operations of P-tree 3.3 Vertical Set Vector Difference operands. A root count (rc) of PX1  P13, written as Vertical set vector difference is another formula we rc(PX1  P13), is equal to 2 where PX1 is P-tree class mask introduce in this paper. The formula, denoted as (X - a), of class high and P13 is P-tree of attribute A1 at fourth bit position. Let a center vector a = (14, 10, 19) or in binary a computes the sum of vector difference from a set of = (01110, 01010, 10011), the root counts of each P-tree vectors X to a target vector a where x  X are vectors class mask with all corresponding P-trees are listed below. belong to the same class. Let X any set of vectors in R(A1…An) with PX to be a class mask and vectors x and a are both represented b bits width binary, such that rc(PX1  P14) = 0 rc(PX2  P14) = 0 x  (x1(b1)  x10 , x2(b1) ⋯x20 ,, xn(b1)  xn0 ) rc(PX1  P13) = 2 rc(PX2  P13) = 2 a  (a1(b1) a10 ,a2(b1) ⋯a20 ,,an(b1) an0 ) rc(PX1  P12) = 2 rc(PX2  P12) = 0 then, a vertical set vector difference (X – a) is defined as: rc(PX1  P11) = 3 rc(PX2  P11) = 1 2 X  a  (v1 , v2 ,, vi ,, vn ), 1  i  n n 0  xi  ai  j xX i1 vi  2  (rc(PX  Pij )  rc(PX )  aij )  n n n jb1  2 2    xi  2  xi ai   ai  Assume that we are dealing with vectors in two xX  i1 i1 i1  n n n dimensional spaces where xi are vectors from the same  x 2  2 x a  a 2 class and 1  i  4, then total separation of vectors xi to a  i  i i  i center vector a can be measured by (X - a), that is, a xX i1 xX i1 xX i1 summation of each vector difference (x1 - a) + (x2 - a) + (x3  T1  T2  T3 - a) + (x4 – a). Figure 3 illustrates vectors difference. where The formula returns a single vector that represents n a cumulative length of a set of vectors to a center. 2 T1   xi However, since a vector has direction, the final xX i1 summation might mislead the actual separation, especially n 0 when negative vectors are involved in the summation. 2 j =   2  rc(PX  Pij )  Therefore, to avoid this misleading, we develop another i1 jb1 powerful formula using a combination of vertical set inner 2k  rc(PX  P  P ) products and vector difference concepts to compute total  ij il k( j*2)( j1)&& j0 variation of vectors about a center. Again, since vertical l( j1)0&& j0 data structure is used, we call this concept as vertical total variation. We will describe it thoroughly in the next n section. T2  2   xi ai y xX i1 n  0 0   2  2 j  x 2 j  a  x -a    ij  ij  1 x xX i1  jb1 jb1  1 n 0 0  2  2 j  x  2 j  a a x -a    ij  ij 2 i1 jb1 xX jb1 x 2 n 0 0 x j j x -a  2    2  rc(PX  Pij )   2  aij x 4 i1 jb1 jb1 3 x -a x 3 n 0 4 j  2   ai   2  rc(PX  Pij ) i1 jb1 n 2 T3   ai Figure 3. A set of vectors from the same class differ to a. xX ii 2 n  0    2 j  a  3.4 Vertical Set Inner Products of Vector Difference   ij  (Compute Total Variation) xX i1  jb1  2 n  0   rc(PX )   2 j  a  To alleviate a problem of canceling out when a set   ij  of vector difference is calculated due to the existence of i1  jb1  n direction in a vector, we introduce a vertical set inner 2  rc(PX )   ai products of vector difference that primarily measures a i1 total variation of set of vector X about a center vector a by dividing the result by corresponding number of vectors in The formula measures the sum of vector length X. The formula combines the concept of set inner product X  a° X  a and set vector difference, defined as follows. connecting X and a. Therefore, where N X  a°X  a  x  a° x  a N refers to the total number of vectors in X, intrinsically xX will measures total variation of X about a. One should notice that, in vertical concept, N can be easily computed using rc(PX), that is a total number of bit 1 counted from P-tree class mask X. When the set of vectors is relatively 16 · 2 + 16 · 2 + 8 · 2 + close to a then total variation will be small. 4 · 3 + 4 · 3 + 1 · 3 + The advantage of vertical set inner products is that 256 · 4 + 256 · 2 + 128 · 3 + 64 · 3 + 32 · 4 + root count values can be pre-computed and stored because 64 · 2 + 64 · 1 + 32 · 2 + 16 · 2 + root count operations are obviously independent from a in 16 · 3 + 16 · 2 + 8 · 3 + PSIPs, thus allowing us to compute them in advanced. 4 · 3 + 4 · 3 + 1 · 4 + These root counts include root counts of P-tree class 256 · 0 + 256 · 0 + 128 · 0 + 64 · 0 + 32 · 0 + 64 · 0 + 64 · 0 + 32 · 0 + 16 · 0 + masks PX itself, root counts of PX  Pij and root counts 16 · 1 + 16 · 0 + 8 · 0 + of PX  Pij  Pil where Pij and Pil are corresponding P- 4 · 1 + 4 · 1 + 1 · 2 trees in of the training set. We demonstrate the calculation of total variation, T = 2,969 again, with a help of an example using the same training 1 4 3 example found in table 2 and target vector a = (14, 10, T2 = -2 · (14 · (2 · rc(PX1P14) + 2 · rc(PX1P13) + 2 1 19). We start the computation using a set of vectors in X1, 2 · rc(PX1P12) + 2 · rc(PX1P11) + 0 hence X1  a°X 1  a  T1  T2  T3 . We calculate T1, 2 · rc(PX1P10) ) + 4 3 T2 and T3 separately. 10 · (2 · rc(PX1P24) + 2 · rc(PX1P23) + 2 1 2 · rc(PX1P22) + 2 · rc(PX1P21) + 8 8 T1 = 2 · rc(PX1P14) + 2 · rc(PX1P14P13) + 0 2 · rc(PX1P20) ) + 7 6 2 · rc(PX1P14P12) + 2 · rc(PX1P14P11) + 4 3 19 · (2 · rc(PX1P34) + 2 · rc(PX1P33) + 5 2 · rc(PX1P14P10) + 2 1 2 · rc(PX1P32) + 2 · rc(PX1P31) + 6 6 2 · rc(PX1P13) + 2 · rc(PX1P13P12) + 0 2 · rc(PX1P30) ) ) 5 4 2 · rc(PX1P13P11) + 2 · rc(PX1P13P10) + 4 4 2 · rc(PX1P12) + 2 · rc(PX1P12P11) + = -2 · (462 + 1,020 + 152) 3 2 · rc(PX1P12P10) + 2 2 T2 = -3,268 2 · rc(PX1P11) + 2 · rc(PX1P11P10) + 0 2 2 2 2 · rc(PX1P10) + T3 = rc(PX1) · (14 + 10 + 19 ) 8 8 2 · rc(PX1P24) + 2 · rc(PX1P24P23) + 7 6 = 4 · 657 2 · rc(PX1P24P22) + 2 · rc(PX1P24P21) + 5 2 · rc(PX1P24P20) + = 2,628 6 6 2 · rc(PX1P23) + 2 · rc(PX1P23P22) + 5 4 2 · rc(PX1P23P21) + 2 · rc(PX1P23P20) + Hence, the total variation of X1 about a: 4 4 2 · rc(PX1P22) + 2 · rc(PX1P22P21) + T T T T T T 3 1  2  3 1  2  3 2 · rc(PX1P22P20) + (X 1  a) ° (X 1  a)   2 2 N rc(PX 1 ) 2 · rc(PX1P21) + 2 · rc(PX1P21P20) + 0 2 · rc(PX1P20) + 2,969  3,268  2,628 8 8   582.25 2 · rc(PX1P34) + 2 · rc(PX1P34P33) + 4 7 6 2 · rc(PX1P34P32) + 2 · rc(PX1P34P31) + 5 2 · rc(PX1P34P30) + Similarly, total variation of X2 about a = 470. Therefore, 6 6 we conclude that X2 is closer to a. 2 · rc(PX1P33) + 2 · rc(PX1P33P32) + 5 4 2 · rc(PX1P33P31) + 2 · rc(PX1P33P30) + 3.5 Horizontal Set Inner Products 4 4 2 · rc(PX1P32) + 2 · rc(PX1P32P31) + 3 2 · rc(PX1P32P30) + Let X is a set of vectors in R(A1…An) and x = (x1, x2, 2 2 …, xn) are vectors belong to class X. Let a = (a1, a2, …, an) 2 · rc(PX1P31) + 2 · rc(PX1P31P30) + is a target vector, then horizontal set inner products (X o 0 2 · rc(PX1P30) a) is defined as: n = 256 · 0 + 256 · 0 + 128 · 0 + 64 · 0 + 32 · 0 + X ° a   x ° a   xi ai 64 · 2 + 64 · 0 + 32 · 1 + 16 · 1 + xX xX i1 Similarly, horizontal set inner products (HSIPs) of dataset to produce five other large datasets, each of which vector difference is defined as: having cardinality of 2,097,152, 4,194,304 (2048x2048 pixels), 8,388,608, 16,777,216 (4096x4096 pixels) and  x  a ° x  a 25,165,824 (5016x5016 pixels). X  a°X  a     xX 4.2 Timing and Scalability Results n 2 The first performance evaluation was done using a  xi  ai  xX i1 P4with 2GB RAM. We used synthetic datasets having 4.1 and 8.3 million rows to evaluate execution time of the algorithms computing total variation for 100 different test cases. Datasets with a size greater than 8.3 million rows cannot be executed in this machine since out of memory 4. EXPERIMENTAL RESULTS error occurs when running HSIPs. Figure 4 and 5 depict the execution time comparison between PSIPs and HSIPs. This section reports experiments we conducted to evaluate the PSIPs algorithm. The experiments were PSIPs vs HIPS Time Comparison conducted using both real and synthetic datasets. The Using 100 Test Cases on 4,194,304 Rows Dataset objective was to compare the execution time and 9 scalability of our algorithm employing vertical approach 7 (vertical data structure and horizontal bitwise AND ) s operation) with horizontal approach (horizontal data d 5 n e o c m i structure and vertical scan operation). We show the results e T S 3 n

of experiments of execution time with respect to i scalability. Performance of both algorithms was observed ( under different machine specifications, including an SGI 1 Altix CC-NUMA machine. Table 3 summarizes the -1 0 10 20 30 40 50 60 70 80 90 100 different types of machines used for the experiments. Test Case ID

PSIPs HSIPs Table 3. The specification of machines used. Figure 4. Time comparison on 4.1 million rows dataset. Machine Specification

AMD1GB AMD Athlon K7 1.4GHz, 1GB RAM PSIPs vs HIPS Time Comparison P42GB Intel P4 2.4GHz processor 2GB RAM Using 100 Test Cases on 8,388,608 Rows Dataset SGI Altix CC-NUMA 12 processor 17 SGI Altix shared memory (12 x 4 GB RAM). 15 13

) 11 s

4.1 Dataset d n e 9 o c m

The experimental data were generated based on a i e T S 7

set of aerial photographs from the Best Management Plot n i (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, ( 5 North Dakota. Latitude and longitude are 970 42'18"W, 3 taken in 1998. The image contains three bands: red, green, 1 -1 0 10 20 30 40 50 60 7 0 80 90 10 0 and blue reflectance values. We use the original image of Test Case ID size 1024x1024 pixels (having cardinality of 1,048,576). PSIPs HSIPs Corresponding synchronized data for soil moisture, soil nitrate and crop yield were also used for experimental Figure 5. Time comparison on 8.3 million rows dataset. evaluations. Combining of all bands and synchronized data, we obtained a dataset with 6 dimensions. As the figures show, up to 8.3 million rows, both Additional datasets with different sizes were algorithms apparently scale, however PSIPs is synthetically generated based on the original data sets to significantly fast compared to HSIPs. It only requires study the timing and scalability of PSIPs technique 0.0003 seconds on average to complete the calculation on presented in this paper. Both timing and scalability were both datasets, very much less than HSIPs which need evaluated with respect to data size. Due to a small number 7.4206 and 14.9013 seconds on average on each dataset of cardinality obtained from the original dataset respectively. These significant disparities are due to the (1,048,576 records), we super sampled the dataset by superiority of PSIPs to use similar root count values, pre- using a simple image processing tool on the original computed and stored during P-trees creation. Although various test case vectors are fed during calculation. If we running on the P42GB machine. This is because of not refer back to the general PSIPs formula defined in section utilizing the full capability of the shared memory 12 3.4, a test case vector a only appeared during T2 and T3 processor parallel architecture of the machine, which is calculation and obviously independent from root count beyond the scope of this paper. This machine with 12x4G rc(PX  Pij) operations. Thus allowing us to pre-compute of RAM was used in the performance study since it was these operations once and use their values repeatedly the only machine capable of loading the entire data set regardless how many total variation are computed as long for the HSIPs for larger data sets. as the dataset and set of class, X, are unchanged. Notice On the other hand, our PSIPs technique was also that PSIPs tend to have constant execution time even successful in both timing and scalability. There was no though datasets size are expanded, where as HSIPs tend to memory problem, yet effectively computing total increase execution time significantly. One may argue that variation with dataset having more than 25 million rows. pre-calculation of root count makes this comparison The same result was evident when running PSIPs under fallacious. However, notice the time required for loading the other two machines and the average time was vertical data structure to memory and one time root count extremely stable, that is around 0.0003 to 0.0004 seconds. operations for PSIPs, and loading horizontal records to Table 5 presents the actual average time when executing memory for HSIPs given on table 4. The performance the two techniques under different machines and figure 6 with respect to time of PSIPs is comparable to HSIPs. further illustrates performance with respect to scalability. There is a slight increase in the amount of time required to load horizontal records than to load P-trees and to compute root counts as presented in table 4. This illustrates the ability of the P-tree data structure to Table 5. Average time running under different machines. efficiently load and compute the simple counts. These Average Time timing were tested on a P4 with 2GB of memory. (Seconds) Cardinality of HSIPs PSIPs Table 4. Time required to load and compute root count. Dataset SGI Time (Seconds) AMD- P4- Altix AMD- PSIPs HSIPs 1GB 2GB 12x4GB 1GB Cardinality Root Count Pre- 1,048,576 2.200 1.840 5.480 0.0003 of Dataset Horizontal Computation and 2,097,152 4.410 3.640 8.320 0.0003 Dataset Loading P-trees Loading 4,194,304 8.580 7.380 15.864 0.0004 1,048,576 3.900 4.974 8,388,608  15.160 33.900 0.0004 2,097,152 8.620 10.470 16,777,216   66.540 0.0004 4,194,304 18.690 19.914 25,165,824   115.204 0.0004 8,388,608 38.450 39.646 : Out of memory

Our next experiment was to observe the algorithm’s timing and scalability performance when PSIPs vs HSIPs Time Comparison Using 100 Difference Test Cases executing on machines with different specifications, Running in Different Types of Machine especially for HSIPs which is very sensitive to the 120 availability of memory to execute successfully. This 100 sensitivity was proven when we run HSIPs on AMD1GB 80 PSIPs on P4-2G machine. HSIPs successfully completed the total variation ) s HSIP s on AMD-1G d e

n HSIP s on P4-2G

computation using dataset with cardinality of 1,048,576, m 60 o i c HSIP s on SGI-48G T e

2,097,152, and 4,194,304, yet suffered from out of S

( Out of Memory memory problem when computing total variation using 40 dataset with cardinality more than 4.1 millions. Similarly, 20 when we run HSIPs on P42GB machine, HSIPs scale to 0 compute total variation only for datasets with cardinality 0 2 4 6 8 10 12 14 16 18 20 22 24 less than 8.3 million. Nevertheless, HSIPs performed Num ber of Tuples (1024^2) better in term of scalability under the SGI Altix and Figure 6. Average time running under different machines. successfully computed total variation for all datasets, but also suffered from out of memory problem when trying to load a dataset with more than 25 million rows. However 5. CONCLUSION the timing performance of HSIPs on this machine degrades significantly compare to the timing of HSIPs In this paper we have presented, defined and Trees. Proceedings of ACM Symposium on evaluated the performance of a new concept to compute Applied Computing. 613-617. total variation, called vertical set inner products (P-tree based set inner products, abbreviated as PSIPs). Experiments indicate that vertical set inner products is fast and scalable compare to conventional horizontal set inner products, especially when dealing with large data set, without compromising on the accurate of the results. We believe that PSIPs technique is could be very useful in clustering tasks as it could measure the closeness of a group of feature vectors to a target one. Presumably, in classification tasks, applying PSIPs in voting phase would greatly accelerate the assignment of class because the calculation of correlated points can be done entirely in one computation without having to visit each individual point as in horizontal approach. However, those hypotheses require careful observation and testing which would be our future works.

6. REFERENCES

[1] Abidin, T. and Perrizo, W., Vertical Set Inner Products Formula. http://midas.cs.ndsu.nodak.edu/ ~abidin/research/PSIPs.pdf

[2] Ding, Q., Khan, M., Roy, A., and Perrizo, W., (2002). The P-tree Algebra, Proceedings of the ACM Symposium on Applied Computing. 426-431.

[3] Eric W. Weisstein et al. Total Variation. From MathWorld – A Wolfram Web Resource. http://mathworld.wolfram.com/TotalVariation.html

[4] Han, J., and Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco, CA., Morgan Kaufmann.

[5] Khan, M., Ding, Q., and Perrizo, W., (2002). K- Nearest Neighbor Classification of Spatial Data Streams using P-trees, Proceedings of the PAKDD. 517-528.

[6] Perera, A., Denton, A., Kotala,P., Jockheck,W., Granda, W. V., and Perrizo, W., (2002). P-tree Classification of Yeast Gene Deletion Data. SIGKDD Explorations, 4(2): 108-109.

[7] Perrizo, W. (2001). Peano Count Tree Technology, Technical Report NDSU-CSOR-TR-01-1.

[8] Rahal, I., and Perrizo, W., (2004). An Optimized Approach for KNN Text Categorization using P-