<p> Vertical Set Inner Products: A Fast and Scalable Technique to Compute Total Variation in Large Data Set</p><p>Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo Computer Science Department North Dakota State University Fargo, ND, 58105, USA [email protected]</p><p>Abstract this paper we will call horizontal set inner products (HSIPs). This paper is organized as follows. Section 2 provides a short review of P-tree vertical data representation. Section 3 presents inner products formulas and examples. Section 4 discusses performance evaluation. Section 5 presents our conclusion and future work.</p><p>2. VERTICAL REPRESENTATION</p><p>1. INTRODUCTION Vertical representation consists of a set of P-trees rather than a set of relational records was initially A tremendous increasing amount of data stored in developed for mining spatial data [5][7]. However, since databases and data warehouses has motivated researchers then, this vertical representation has been widely adopted in data mining to develop technique that are capable to and used as a fundamental data structure for mining handle large data set. In clustering task and outlier another type of data [6][8]. detection for examples, one often require to determine the With an assumption that data mining algorithms closeness of a certain point to a set of points in a data set. should handle large data set without having to do a sub One way to measure the closeness of a point to some sampling, which requires considerable knowledge about other predefined collection of points is by examining the the data at the first place, makes P-trees vertical data total variation of the points to the designated point. Not to structure particularly suited for data mining. The creation confuse with total variation terminology usually used in of P-tree is typically started by converting a relational image restoration, total variation we meant here is table of horizontal records, normally the training set, to a basically the total positive difference of a set of vectors set of P-trees by decomposing each attribute in the table about a target vector, which is slightly different to total into separate bit vectors, one for each bit position of variation defined by [3]. numeric values in that attribute. Such vertical partitioning The main concern of data mining algorithm is to be guarantees that the original attribute values can be able to handle huge data set in a fast, accurate and retained in some way so that original information is not scalable way. In this paper, we introduce a new technique lost. called vertical set inner products, shorten as VSIPs, that Vertical data structure can be constructed into 0- precisely compute total variation of a set of vectors about dimensional P-tree, which is actually the vertical bit, a, a target vector, in a fast, accurate and scalable way. vectors itself or into 1-dimensional, 2-dimensional or Because VSIPs employs P-tree1 vertical data structure, multi-dimensional P-tree in the form of tree. In this paper, thus this technique is also called PSIPs2. however, for simplicity we only discuss the construction Our performance evaluation shows that PSIPs are of 1-dimensional P-tree. fast and scalable to compute total variation in large data Let R a relational table consists of three numeric set, much faster and scalable compare to similar technique attributes R(A1, A2, A3). For each attribute in R, we convert computed in a horizontal record-based approach, which in each numerical domain into binary representation, for example a value (5)10 are converted to (101)2. Then, each 1 Patents are pending on the P-tree technology. This work is partially corresponding bit position is vertically decomposed and supported by GSA Grant ACT#: K96130308. stored in a separate file. In 1-dimensional P-tree, each file subsequently converted into P-tree by recursively divides 2 VSIPs and PSIPs terminology will be used interchangeably in this the bit vectors by halves and records the truth of “purely paper 1-bits” predicate. A predicate 1 indicates that bits in the division are all 1 and 0 indicates otherwise. Figure 1 (b) (c) depicts how a numerical value in attribute A1 is converted into binary representation, decomposed into three separate Figure 2. Result of 1-D P-trees logical operations. vertical bit vectors and then constructed into 1- dimensional P-tree. The count of 1-bits from the resulting P-tree of logical operations is called root count and it can be computed quickly by summing from the bottom up. For A A P 0 1 1 11 L 3 example a root count of P11 P12 is equal to 7, computed 0 1 2 4 100 0 0 L from 1 · 2 + 1 · 2 + 1 · 2 as there is only a single bit of 010 2 1 found in each level 0, 1 and 2. 0 0 1 0 L 2 010 1 111 L 3. SET INNER PRODUCTS 0 1 0 1 0 2 101 7 001 In this section, we define formulas for vertical set P 0 L 110 12 3 inner products and its variations that are performed in a 5 011 0 0 L training set with n dimensions, each of which represented 1 2 in b bits binary representation. We also demonstrate L 6 0 1 0 1 1 simple examples on how to use the formulas. Due to the 3 P P P L limitation of space, the proofs of each formula are not 13 12 11 0 1 0 included in the paper, but they can be found in [1]. At the 1 0 0 end of this section, we define formula for horizontal set A 0 P 2 13 0 L inner products. 0 1 3 0 0 0 L 2 0 1 1 3.1 Binary Representation 0 0 0 0 L 1 1 1 1 L Binary representation is intrinsically a fundamental 1 0 0 1 1 0 1 0 0 concept in vertical data structure. Let x is a numeric value 1 0 1 0 0 0 of attribute A1 then the representation of x in b bits is Figure 1. 1 dimensional P-tree of attribute A1. written as: 1 1 1 0 0 1 j Logical operations AND (), OR () and NOT or x1b1 x10 2 x1 j complement (') are the main operations used in P-tree and jb1 performed level by level. In figure 2, we illustrate how where xb1 and x0 are the highest and lowest order bits these operations used. Figure 2(a) corresponds to the respectively. result of complement operation of P13, while figure 2(b) and 2(c) correspond respectively to the result operations 3.2 Vertical Set Inner Products of P11 P12 and P11 P12. Refer to [2] for an excellent overview about P-tree algebra and its logical operations. Let X any set of vectors in R(A1…An) with P-tree class mask, PX, where x X is represented in b bits, P ’ 13 0 x (x1(b1) x10 , x2(b1) x20 ,, xn(b1) xn0 ) 0 0 and </p><p>0 0 0 0 a (a1(b1) a10 ,a2(b1) a20 ,,an(b1) an0 ) be a target vector, then vertical set inner products (X o a) 0 1 1 0 0 1 0 1 is defined as:</p><p>(a) n 0 b1 ° ° jk X a x a rc(PX Pij )2 aik P P P P xX i1 jb1 k0 11 12 0 11 12 0</p><p>0 0 0 1 We demonstrate how to get the result from this 0 0 0 0 0 1 formula with the help of an example. Suppose there is a data set with three attributes A1, A2, and A3, each of which 0 1 0 1 0 1 has a numerical domain and another attribute Rank has a</p><p> single categorical domain as illustrate in table 1, a P-tree rc(PX1 P10) = 3 rc(PX2 P10) = 2 class masks, PXi, are achieved by creating a vertical bit rc(PX1 P24) = 4 rc(PX2 P24) = 2 vector, one for each class, with bit 1 is assigned to every tuples containing that class and bit 0 to every other tuples. rc(PX1 P23) = 2 rc(PX2 P23) = 1</p><p>Assume that attribute Rank is chosen to be a class rc(PX1 P22) = 3 rc(PX2 P22) = 2 attribute consists of two types of value then two P-tree rc(PX1 P21) = 3 rc(PX2 P21) = 1 class masks will be created, one for each distinct value. rc(PX1 P20) = 4 rc(PX2 P20) = 1</p><p>Table 1. Training set example. rc(PX1 P34) = 0 rc(PX2 P34) = 0</p><p>A1 A2 A3 Rank rc(PX1 P33) = 0 rc(PX2 P33) = 0</p><p>9 31 6 Low rc(PX1 P32) = 1 rc(PX2 P32) = 2</p><p>11 20 5 Low rc(PX1 P31) = 1 rc(PX2 P31) = 1 11 21 4 High rc(PX1 P30) = 2 rc(PX2 P30) = 1 7 23 3 High 7 27 1 High Hence, using the formula, we compute the set inner 8 31 0 High product for class High as follows:</p><p>Subsequently, we convert each numerical domain 8 7 6 5 4 (X1 o a) = 0 · (2 · 0 + 2 · 1 + 2 · 1 + 2 · 1 + 2 · 0) + into based-two representation with a uniform b bits width. 2 · (27 · 0 + 26 · 1 + 25 · 1 + 24 · 1 + 23 · 0) + The maximum width is determined from the largest value 2 · (26 · 0 + 25 · 1 + 24 · 1 + 23 · 1 + 22 · 0) + in the training set. After that, we create P-tree class masks 3 · (25 · 0 + 24 · 1 + 23 · 1 + 22 · 1 + 21 · 0) + as depicted in table 2. We pad a reasonable number of 3 · (24 · 0 + 23 · 1 + 22 · 1 + 21 · 1 + 20 · 0) + zeros to make a uniform bit width for any attributes that 4 · (28 · 0 + 27 · 1 + 26 · 0 + 25 · 1 + 24 · 0) + actually can be represented in less than b bits. Zero 2 · (27 · 0 + 26 · 1 + 25 · 0 + 24 · 1 + 23 · 0) + padding is a prerequisite for the formula to fulfill a correct 3 · (26 · 0 + 25 · 1 + 24 · 0 + 23 · 1 + 22 · 0) + result. In table 2, attribute A3 has been padded with two 3 · (25 · 0 + 24 · 1 + 23 · 0 + 22 · 1 + 21 · 0) + additional zero bits to make a uniform 5 bits width 4 · (24 · 0 + 23 · 1 + 22 · 0 + 21 · 1 + 20 · 0) + because 31 (domain of tuple 1 or 7 attribute A2, the largest 0 · (28 · 1 + 27 · 0 + 26 · 0 + 25 · 1 + 24 · 1) + value found in the training set) is actually (11111)2. 0 · (27 · 1 + 26 · 0 + 25 · 0 + 24 · 1 + 23 · 1) + 1 · (26 · 1 + 25 · 0 + 24 · 0 + 23 · 1 + 22 · 1) + Table 2. P-tree class masks of attribute Rank. 1 · (25 · 1 + 24 · 0 + 23 · 0 + 22 · 1 + 21 · 1) + Rank 2 · (24 · 1 + 23 · 0 + 22 · 0 + 21 · 1 + 20 · 1) A1 A2 A3 PX1 PX2 01001 11111 00110 0 1 = 0 · 224 + 2 · 112 + 2 · 56 + 3 · 28 + 3 · 14 + 01011 10100 00101 0 1 4 · 160 + 2 · 80 + 3 · 40 + 3 · 20 + 4 · 10 + 01011 10101 00100 1 0 0 · 304 + 0 · 152 + 1 · 76 + 1 · 38 + 2 · 19 00111 10111 00011 1 0 00111 11011 00001 1 0 = 1634 01000 11111 00000 1 0 Similarly for class Low, we compute (X2 o a) = 999. As mentioned before, a root count is total number of bit 1 counted from the resulting operations of P-tree 3.3 Vertical Set Vector Difference operands. A root count (rc) of PX1 P13, written as Vertical set vector difference is another formula we rc(PX1 P13), is equal to 2 where PX1 is P-tree class mask introduce in this paper. The formula, denoted as (X - a), of class high and P13 is P-tree of attribute A1 at fourth bit position. Let a center vector a = (14, 10, 19) or in binary a computes the sum of vector difference from a set of = (01110, 01010, 10011), the root counts of each P-tree vectors X to a target vector a where x X are vectors class mask with all corresponding P-trees are listed below. belong to the same class. Let X any set of vectors in R(A1…An) with PX to be a class mask and vectors x and a are both represented b bits width binary, such that rc(PX1 P14) = 0 rc(PX2 P14) = 0 x (x1(b1) x10 , x2(b1) ⋯x20 ,, xn(b1) xn0 ) rc(PX1 P13) = 2 rc(PX2 P13) = 2 a (a1(b1) a10 ,a2(b1) ⋯a20 ,,an(b1) an0 ) rc(PX1 P12) = 2 rc(PX2 P12) = 0 then, a vertical set vector difference (X – a) is defined as: rc(PX1 P11) = 3 rc(PX2 P11) = 1 2 X a (v1 , v2 ,, vi ,, vn ), 1 i n n 0 xi ai j xX i1 vi 2 (rc(PX Pij ) rc(PX ) aij ) n n n jb1 2 2 xi 2 xi ai ai Assume that we are dealing with vectors in two xX i1 i1 i1 n n n dimensional spaces where xi are vectors from the same x 2 2 x a a 2 class and 1 i 4, then total separation of vectors xi to a i i i i center vector a can be measured by (X - a), that is, a xX i1 xX i1 xX i1 summation of each vector difference (x1 - a) + (x2 - a) + (x3 T1 T2 T3 - a) + (x4 – a). Figure 3 illustrates vectors difference. where The formula returns a single vector that represents n a cumulative length of a set of vectors to a center. 2 T1 xi However, since a vector has direction, the final xX i1 summation might mislead the actual separation, especially n 0 when negative vectors are involved in the summation. 2 j = 2 rc(PX Pij ) Therefore, to avoid this misleading, we develop another i1 jb1 powerful formula using a combination of vertical set inner 2k rc(PX P P ) products and vector difference concepts to compute total ij il k( j*2)( j1)&& j0 variation of vectors about a center. Again, since vertical l( j1)0&& j0 data structure is used, we call this concept as vertical total variation. We will describe it thoroughly in the next n section. T2 2 xi ai y xX i1 n 0 0 2 2 j x 2 j a x -a ij ij 1 x xX i1 jb1 jb1 1 n 0 0 2 2 j x 2 j a a x -a ij ij 2 i1 jb1 xX jb1 x 2 n 0 0 x j j x -a 2 2 rc(PX Pij ) 2 aij x 4 i1 jb1 jb1 3 x -a x 3 n 0 4 j 2 ai 2 rc(PX Pij ) i1 jb1 n 2 T3 ai Figure 3. A set of vectors from the same class differ to a. xX ii 2 n 0 2 j a 3.4 Vertical Set Inner Products of Vector Difference ij (Compute Total Variation) xX i1 jb1 2 n 0 rc(PX ) 2 j a To alleviate a problem of canceling out when a set ij of vector difference is calculated due to the existence of i1 jb1 n direction in a vector, we introduce a vertical set inner 2 rc(PX ) ai products of vector difference that primarily measures a i1 total variation of set of vector X about a center vector a by dividing the result by corresponding number of vectors in The formula measures the sum of vector length X. The formula combines the concept of set inner product X a° X a and set vector difference, defined as follows. connecting X and a. Therefore, where N X a°X a x a° x a N refers to the total number of vectors in X, intrinsically xX will measures total variation of X about a. One should notice that, in vertical concept, N can be easily computed using rc(PX), that is a total number of bit 1 counted from P-tree class mask X. When the set of vectors is relatively 16 · 2 + 16 · 2 + 8 · 2 + close to a then total variation will be small. 4 · 3 + 4 · 3 + 1 · 3 + The advantage of vertical set inner products is that 256 · 4 + 256 · 2 + 128 · 3 + 64 · 3 + 32 · 4 + root count values can be pre-computed and stored because 64 · 2 + 64 · 1 + 32 · 2 + 16 · 2 + root count operations are obviously independent from a in 16 · 3 + 16 · 2 + 8 · 3 + PSIPs, thus allowing us to compute them in advanced. 4 · 3 + 4 · 3 + 1 · 4 + These root counts include root counts of P-tree class 256 · 0 + 256 · 0 + 128 · 0 + 64 · 0 + 32 · 0 + 64 · 0 + 64 · 0 + 32 · 0 + 16 · 0 + masks PX itself, root counts of PX Pij and root counts 16 · 1 + 16 · 0 + 8 · 0 + of PX Pij Pil where Pij and Pil are corresponding P- 4 · 1 + 4 · 1 + 1 · 2 trees in of the training set. We demonstrate the calculation of total variation, T = 2,969 again, with a help of an example using the same training 1 4 3 example found in table 2 and target vector a = (14, 10, T2 = -2 · (14 · (2 · rc(PX1P14) + 2 · rc(PX1P13) + 2 1 19). We start the computation using a set of vectors in X1, 2 · rc(PX1P12) + 2 · rc(PX1P11) + 0 hence X1 a°X 1 a T1 T2 T3 . We calculate T1, 2 · rc(PX1P10) ) + 4 3 T2 and T3 separately. 10 · (2 · rc(PX1P24) + 2 · rc(PX1P23) + 2 1 2 · rc(PX1P22) + 2 · rc(PX1P21) + 8 8 T1 = 2 · rc(PX1P14) + 2 · rc(PX1P14P13) + 0 2 · rc(PX1P20) ) + 7 6 2 · rc(PX1P14P12) + 2 · rc(PX1P14P11) + 4 3 19 · (2 · rc(PX1P34) + 2 · rc(PX1P33) + 5 2 · rc(PX1P14P10) + 2 1 2 · rc(PX1P32) + 2 · rc(PX1P31) + 6 6 2 · rc(PX1P13) + 2 · rc(PX1P13P12) + 0 2 · rc(PX1P30) ) ) 5 4 2 · rc(PX1P13P11) + 2 · rc(PX1P13P10) + 4 4 2 · rc(PX1P12) + 2 · rc(PX1P12P11) + = -2 · (462 + 1,020 + 152) 3 2 · rc(PX1P12P10) + 2 2 T2 = -3,268 2 · rc(PX1P11) + 2 · rc(PX1P11P10) + 0 2 2 2 2 · rc(PX1P10) + T3 = rc(PX1) · (14 + 10 + 19 ) 8 8 2 · rc(PX1P24) + 2 · rc(PX1P24P23) + 7 6 = 4 · 657 2 · rc(PX1P24P22) + 2 · rc(PX1P24P21) + 5 2 · rc(PX1P24P20) + = 2,628 6 6 2 · rc(PX1P23) + 2 · rc(PX1P23P22) + 5 4 2 · rc(PX1P23P21) + 2 · rc(PX1P23P20) + Hence, the total variation of X1 about a: 4 4 2 · rc(PX1P22) + 2 · rc(PX1P22P21) + T T T T T T 3 1 2 3 1 2 3 2 · rc(PX1P22P20) + (X 1 a) ° (X 1 a) 2 2 N rc(PX 1 ) 2 · rc(PX1P21) + 2 · rc(PX1P21P20) + 0 2 · rc(PX1P20) + 2,969 3,268 2,628 8 8 582.25 2 · rc(PX1P34) + 2 · rc(PX1P34P33) + 4 7 6 2 · rc(PX1P34P32) + 2 · rc(PX1P34P31) + 5 2 · rc(PX1P34P30) + Similarly, total variation of X2 about a = 470. Therefore, 6 6 we conclude that X2 is closer to a. 2 · rc(PX1P33) + 2 · rc(PX1P33P32) + 5 4 2 · rc(PX1P33P31) + 2 · rc(PX1P33P30) + 3.5 Horizontal Set Inner Products 4 4 2 · rc(PX1P32) + 2 · rc(PX1P32P31) + 3 2 · rc(PX1P32P30) + Let X is a set of vectors in R(A1…An) and x = (x1, x2, 2 2 …, xn) are vectors belong to class X. Let a = (a1, a2, …, an) 2 · rc(PX1P31) + 2 · rc(PX1P31P30) + is a target vector, then horizontal set inner products (X o 0 2 · rc(PX1P30) a) is defined as: n = 256 · 0 + 256 · 0 + 128 · 0 + 64 · 0 + 32 · 0 + X ° a x ° a xi ai 64 · 2 + 64 · 0 + 32 · 1 + 16 · 1 + xX xX i1 Similarly, horizontal set inner products (HSIPs) of dataset to produce five other large datasets, each of which vector difference is defined as: having cardinality of 2,097,152, 4,194,304 (2048x2048 pixels), 8,388,608, 16,777,216 (4096x4096 pixels) and x a ° x a 25,165,824 (5016x5016 pixels). X a°X a xX 4.2 Timing and Scalability Results n 2 The first performance evaluation was done using a xi ai xX i1 P4with 2GB RAM. We used synthetic datasets having 4.1 and 8.3 million rows to evaluate execution time of the algorithms computing total variation for 100 different test cases. Datasets with a size greater than 8.3 million rows cannot be executed in this machine since out of memory 4. EXPERIMENTAL RESULTS error occurs when running HSIPs. Figure 4 and 5 depict the execution time comparison between PSIPs and HSIPs. This section reports experiments we conducted to evaluate the PSIPs algorithm. The experiments were PSIPs vs HIPS Time Comparison conducted using both real and synthetic datasets. The Using 100 Test Cases on 4,194,304 Rows Dataset objective was to compare the execution time and 9 scalability of our algorithm employing vertical approach 7 (vertical data structure and horizontal bitwise AND ) s operation) with horizontal approach (horizontal data d 5 n e o c m i structure and vertical scan operation). We show the results e T S 3 n</p><p> of experiments of execution time with respect to i scalability. Performance of both algorithms was observed ( under different machine specifications, including an SGI 1 Altix CC-NUMA machine. Table 3 summarizes the -1 0 10 20 30 40 50 60 70 80 90 100 different types of machines used for the experiments. Test Case ID </p><p>PSIPs HSIPs Table 3. The specification of machines used. Figure 4. Time comparison on 4.1 million rows dataset. Machine Specification</p><p>AMD1GB AMD Athlon K7 1.4GHz, 1GB RAM PSIPs vs HIPS Time Comparison P42GB Intel P4 2.4GHz processor 2GB RAM Using 100 Test Cases on 8,388,608 Rows Dataset SGI Altix CC-NUMA 12 processor 17 SGI Altix shared memory (12 x 4 GB RAM). 15 13</p><p>) 11 s</p><p>4.1 Dataset d n e 9 o c m</p><p>The experimental data were generated based on a i e T S 7</p><p> set of aerial photographs from the Best Management Plot n i (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, ( 5 North Dakota. Latitude and longitude are 970 42'18"W, 3 taken in 1998. The image contains three bands: red, green, 1 -1 0 10 20 30 40 50 60 7 0 80 90 10 0 and blue reflectance values. We use the original image of Test Case ID size 1024x1024 pixels (having cardinality of 1,048,576). PSIPs HSIPs Corresponding synchronized data for soil moisture, soil nitrate and crop yield were also used for experimental Figure 5. Time comparison on 8.3 million rows dataset. evaluations. Combining of all bands and synchronized data, we obtained a dataset with 6 dimensions. As the figures show, up to 8.3 million rows, both Additional datasets with different sizes were algorithms apparently scale, however PSIPs is synthetically generated based on the original data sets to significantly fast compared to HSIPs. It only requires study the timing and scalability of PSIPs technique 0.0003 seconds on average to complete the calculation on presented in this paper. Both timing and scalability were both datasets, very much less than HSIPs which need evaluated with respect to data size. Due to a small number 7.4206 and 14.9013 seconds on average on each dataset of cardinality obtained from the original dataset respectively. These significant disparities are due to the (1,048,576 records), we super sampled the dataset by superiority of PSIPs to use similar root count values, pre- using a simple image processing tool on the original computed and stored during P-trees creation. Although various test case vectors are fed during calculation. If we running on the P42GB machine. This is because of not refer back to the general PSIPs formula defined in section utilizing the full capability of the shared memory 12 3.4, a test case vector a only appeared during T2 and T3 processor parallel architecture of the machine, which is calculation and obviously independent from root count beyond the scope of this paper. This machine with 12x4G rc(PX Pij) operations. Thus allowing us to pre-compute of RAM was used in the performance study since it was these operations once and use their values repeatedly the only machine capable of loading the entire data set regardless how many total variation are computed as long for the HSIPs for larger data sets. as the dataset and set of class, X, are unchanged. Notice On the other hand, our PSIPs technique was also that PSIPs tend to have constant execution time even successful in both timing and scalability. There was no though datasets size are expanded, where as HSIPs tend to memory problem, yet effectively computing total increase execution time significantly. One may argue that variation with dataset having more than 25 million rows. pre-calculation of root count makes this comparison The same result was evident when running PSIPs under fallacious. However, notice the time required for loading the other two machines and the average time was vertical data structure to memory and one time root count extremely stable, that is around 0.0003 to 0.0004 seconds. operations for PSIPs, and loading horizontal records to Table 5 presents the actual average time when executing memory for HSIPs given on table 4. The performance the two techniques under different machines and figure 6 with respect to time of PSIPs is comparable to HSIPs. further illustrates performance with respect to scalability. There is a slight increase in the amount of time required to load horizontal records than to load P-trees and to compute root counts as presented in table 4. This illustrates the ability of the P-tree data structure to Table 5. Average time running under different machines. efficiently load and compute the simple counts. These Average Time timing were tested on a P4 with 2GB of memory. (Seconds) Cardinality of HSIPs PSIPs Table 4. Time required to load and compute root count. Dataset SGI Time (Seconds) AMD- P4- Altix AMD- PSIPs HSIPs 1GB 2GB 12x4GB 1GB Cardinality Root Count Pre- 1,048,576 2.200 1.840 5.480 0.0003 of Dataset Horizontal Computation and 2,097,152 4.410 3.640 8.320 0.0003 Dataset Loading P-trees Loading 4,194,304 8.580 7.380 15.864 0.0004 1,048,576 3.900 4.974 8,388,608 15.160 33.900 0.0004 2,097,152 8.620 10.470 16,777,216 66.540 0.0004 4,194,304 18.690 19.914 25,165,824 115.204 0.0004 8,388,608 38.450 39.646 : Out of memory</p><p>Our next experiment was to observe the algorithm’s timing and scalability performance when PSIPs vs HSIPs Time Comparison Using 100 Difference Test Cases executing on machines with different specifications, Running in Different Types of Machine especially for HSIPs which is very sensitive to the 120 availability of memory to execute successfully. This 100 sensitivity was proven when we run HSIPs on AMD1GB 80 PSIPs on P4-2G machine. HSIPs successfully completed the total variation ) s HSIP s on AMD-1G d e</p><p> n HSIP s on P4-2G</p><p> computation using dataset with cardinality of 1,048,576, m 60 o i c HSIP s on SGI-48G T e</p><p>2,097,152, and 4,194,304, yet suffered from out of S</p><p>( Out of Memory memory problem when computing total variation using 40 dataset with cardinality more than 4.1 millions. Similarly, 20 when we run HSIPs on P42GB machine, HSIPs scale to 0 compute total variation only for datasets with cardinality 0 2 4 6 8 10 12 14 16 18 20 22 24 less than 8.3 million. Nevertheless, HSIPs performed Num ber of Tuples (1024^2) better in term of scalability under the SGI Altix and Figure 6. Average time running under different machines. successfully computed total variation for all datasets, but also suffered from out of memory problem when trying to load a dataset with more than 25 million rows. However 5. CONCLUSION the timing performance of HSIPs on this machine degrades significantly compare to the timing of HSIPs In this paper we have presented, defined and Trees. Proceedings of ACM Symposium on evaluated the performance of a new concept to compute Applied Computing. 613-617. total variation, called vertical set inner products (P-tree based set inner products, abbreviated as PSIPs). Experiments indicate that vertical set inner products is fast and scalable compare to conventional horizontal set inner products, especially when dealing with large data set, without compromising on the accurate of the results. We believe that PSIPs technique is could be very useful in clustering tasks as it could measure the closeness of a group of feature vectors to a target one. Presumably, in classification tasks, applying PSIPs in voting phase would greatly accelerate the assignment of class because the calculation of correlated points can be done entirely in one computation without having to visit each individual point as in horizontal approach. However, those hypotheses require careful observation and testing which would be our future works.</p><p>6. REFERENCES</p><p>[1] Abidin, T. and Perrizo, W., Vertical Set Inner Products Formula. http://midas.cs.ndsu.nodak.edu/ ~abidin/research/PSIPs.pdf</p><p>[2] Ding, Q., Khan, M., Roy, A., and Perrizo, W., (2002). The P-tree Algebra, Proceedings of the ACM Symposium on Applied Computing. 426-431.</p><p>[3] Eric W. Weisstein et al. Total Variation. From MathWorld – A Wolfram Web Resource. http://mathworld.wolfram.com/TotalVariation.html </p><p>[4] Han, J., and Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco, CA., Morgan Kaufmann.</p><p>[5] Khan, M., Ding, Q., and Perrizo, W., (2002). K- Nearest Neighbor Classification of Spatial Data Streams using P-trees, Proceedings of the PAKDD. 517-528.</p><p>[6] Perera, A., Denton, A., Kotala,P., Jockheck,W., Granda, W. V., and Perrizo, W., (2002). P-tree Classification of Yeast Gene Deletion Data. SIGKDD Explorations, 4(2): 108-109.</p><p>[7] Perrizo, W. (2001). Peano Count Tree Technology, Technical Report NDSU-CSOR-TR-01-1.</p><p>[8] Rahal, I., and Perrizo, W., (2004). An Optimized Approach for KNN Text Categorization using P-</p>
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-