International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

Dimension Reduction of Health Data Clustering

Rahmat Widia Sembiring 1, Jasni Mohamad Zain 2, Abdullah Embong 3 1,2 Faculty of Computer System and Software Engineering, Universiti Malaysia Pahang Lebuhraya Tun Razak, 26300, Kuantan, Pahang Darul Makmur, Malaysia 3School of Computer Science, Universiti Sains Malaysia 11800 Minden, Pulau Pinang, Malaysia [email protected], [email protected], [email protected]

clusters that are less meaningful. The use ABSTRACT of multidimensional data will result in more noise, complex data, and the The current data tends to be more complex possibility of unconnected data entities. than conventional data and need This problem can be solved by using reduction. Dimension reduction is important clustering algorithm. Several clustering in and creates a smaller data algorithms grouped into cell-based in volume and has the same analytical clustering, density based clustering, and results as the original representation. A clustering oriented. To obtain an clustering process needs data reduction to obtain an efficient processing time while efficient processing time to mitigate a clustering and mitigate curse of curse of dimensionality while clustering, dimensionality. This paper proposes a model a clustering process needs data for extracting multidimensional data reduction. Dimension reduction is a clustering of health . We technique that is widely used for various implemented four dimension reduction applications to solve curse of techniques such as Singular Value dimensionality. Decomposition (SVD), Principal Dimension reduction is important in Component Analysis (PCA), Self cluster analysis, which not only makes Organizing Map (SOM) and FastICA. The the high dimensional data addressable results show that dimension reductions and reduces the computational cost, but significantly reduce dimension and shorten processing time and also increased can also provide users with a clearer performance of cluster in several health picture and visual examination of the datasets. data of interest [ 6]. Many emerging dimension reduction techniques KEYWORDS proposed, such as Local (LDR) tries to find local DBSCAN, dimension reduction, SVD, PCA, correlations in the data, and performs SOM, FastICA. dimensionality reduction on the locally correlated clusters of data individually 1 Introduction [3], where dimension reduction as a The current data tends to be dynamic process adaptively adjusted and multidimensional and high dimension, integrated with the clustering process and more complex than conventional [4]. data. Many clustering algorithms have Sufficient Dimensionality Reduction been proposed and often produce (SDR) is an iterative algorithm [ 8],

1041 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

which converges to a local minimum of cases when there are more inputs than and hence cases or observations. Dimension solves the Max-Min problem as well. A reduction number of optimizations can solve this Dimension Increasing Reducing of Reduction of reduction learning irrelevant redundant minimization problem, and reduction performance dimension dimension algorithm based on Bayesian inductive Attribute Records reduction cognitive model used to decide which reduction Future are advantageous [ 11]. Attribute selection decomposition Record Developing an effective and efficient Function selection clustering method to process decomposition Simple Variable multidimensional and high dimensional decomposition selection dataset is a challenging problem. This paper is organ ized into a few Figure 1. Taxonomy of dimension reduction problem sections. Section 2 will present the related work. Section 3 explains the Dimension reduction methods materials and method. Section 4 associated with regression, additive elucidates the results followed by models, neural network models, and discussion in Section 5. Section 6 deals methods of Hessian [6], one of which is with the concluding remarks. the local dimension reduction (LDR), which is looking for relationships in the 2 Related Work dataset and reduce the dimensions of Functions of are association, each individual, then using a correlation, prediction, clustering, multidimensional index structure [3]. classification, analysis, trends, outliers Nonlinear algorithm gives better and deviation analysis, and similarity performance than PCA for sound and and dissimilarity analysis. Clustering image data [14], on the other studies technique is applied when there is no mentioned Principal Component class to predict but rather when the Analysis (PCA) is based on dimension instances divide into natural groups [20]. reduction and texture classification Clustering for multidimensional data has scheme can be applied to manifold many challenges. These are noise, statistical framework [3]. complexity of data, and data In most applications, dimension redundancy. To mitigate these problems reduction performed as pre-processing dimension reduction needed. In step [5], performed with traditional statistics, dimension reduction is the statistical methods that will parse an process of reducing the number of increasing number of observations [6]. random variables. The process classified Reduction of dimensions will create a into and feature more effective domain characterization extraction [ 18], and the taxonomy of [1]. Sufficient Dimension Reduction dimension reduction problems [16] (SDR) is a generalization of nonlinear shown in Figure.1. Dimension reduction regression problems, where the is the ability to identify a small number extraction of features is as important as of important inputs (for predicting the the matrix factorization [8], while SSDR target) from a much larger number of (Semi-supervised Dimension Reduction) available inputs, and is effective in

1042 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

is used to maintain the original structure highest singular value in the upper left of high dimensional data [27]. index of the S matrix. The goals of dimension reductio n methods are to reduce the number of predictor components and to help ensure that these components are independent. B. PCA The method designed to provide a PCA is a dimension reduction framework for interpretability of the technique that uses variance as a results, and to find a mapping F that measure of interestingness and finds maps the input data from the s pace to orthogonal vectors (principal lower dimension feature space components) in the feature space that denotes as [26, 15]. accounts for the most variance in the

Dimension reduction techniques, such as data [ 19]. Principal component analysis principal component analysis (PCA) and is probably the oldest and best known of partial least squares (PLS) can used to the techniques of multivariate analysis, reduce the dimension of the microarray first introduced by Pearson, and data before certain classifier is used [ 25]. developed independently by Hotelling

We compared four dimension [12]. reduction techniques and embedded in The advantages of PCA are DBSCAN, these dimension reduction identifying patterns in data, and are: expressing the data in such a way as to highlight their similarities and A. SVD differences. It is a powerful tool for The Singular Value Decomposition analysing data by finding these patterns (SVD) is a factorization of a real or in the data. Then compress them by complex matrix. The equation for SVD dimensions reduction without much loss T of X is X=USV [24], where U is an of information [ 23]. Algorithm PCA [ 7] m x n matrix, S is an n x n diagonal shown as follows: matrix, and VT is also an n x n matrix. a. Recover basis: The columns of U are called the left Calculate and let T singular vectors , {uk}, and form an U = eigenvectors of XX orthonormal basis for the assay corresponding to the top d expression profiles, so that ui·uj = 1 for i eigenvalues. = j , and ui·uj = 0 otherwise. The rows of b. Encode training data: T VT contain the elements of the right Y = U X where Y is a d x t matrix of singular vectors , { vk}, and form an encodings of the original data. orthonormal basis for the gene c. Reconstruct training data: transcriptional responses. The elements of S are only nonzero on the diagonal, d. Encode test example: and are called the singular values . Thus, where y is a d- S = diag( s1,..., sn). Furthermore, sk > 0 for dimensional encoding of x. 1 ≤ k ≤ r, and si = 0 for ( r+1) ≤ k ≤ n. By e. Reconstruct test example: convention, the ordering of the singular vectors is determined by high-to-low sorting of singular values, with the C. SOM

1043 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

A self-organizing map (SOM) is a techniques tested in the proposed model, type of artificial neural network that is namely SVD, PCA, SOM, FastICA. trained using unsupervised learning to produce a low-dimensional (typically Perfor- mance-1 two-dimensional), discretized Dimension Original Clustering Perfor- representation of the input space of the Reduction Datasets technique mance training samples, and called a map [14]. (DR)

Self-organizing maps are different from Data to Perfor- Cluster Filtering other artificial neural networks in the Similarity mance-2 Model sense that they use a neighborhood function to preserve the topological Figure 2. Proposed model compared based on properties of the input space. This makes dimension reduction and DBSCAN clustering

SOMs useful for visualizing low- Dimensions reduction result is dimensional views of high-dimensional processed into DBSCAN cluster data, akin to multidimensional scaling. technique. DBSCAN needs ε (eps) and The model was first described as an the minimum number of points required artificial neural network by the Finnish to form a cluster ( minPts ) including professor Teuvo Kohonen, and is mixed euclidean distance as distance sometimes called a Kohonen map. measure. For the result of DBSCAN

clustering using functional data to D. FastICA similarity, it calculates a similarity Independent Component Analysis measure from the given data (attribute (ICA) introduced by Jeanny Hérault and based), and another output of DBSCAN Christian Jutten in 1986, later clarified that is measured is performance-1, this by Pierre Comon in 1994 [22]. FastICA simply provides the number of clusters is one of the extensions of ICA, which is as a value. based on point iteration scheme to find Result of data to similarity takes an the nongaussianity [9], can also be exampleSet as input for filter examples derived as approximate Newton and returns a new exampleSet including iteration, FastICA using the following only the examples that fulfill a formula: condition. By specifying an implementation of a condition, and a where and parameter string, arbitrary filters can be , matrices W applied and directly derive a need to orthogonalized after each phase performance-2 as measure from a have been processed. specific data or statistics value, then process expectation maximum cluster 3 Material and Method with parameter k=2, max runs =5, max This study is designed to find the optimization step =100, quality =1.0E-10 most efficient dimension reduction and install distribution=k-means run. technique. In order to achieve this objective, we implemented a model for 4 Result efficiency of the cluster performed by Testing of model performance was first reducing the dimensions of datasets conducted on four datasets model; e-coli, [21]. There are four dimension reduction acute implant, blood transfusion and

1044 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

prostate cancer. Dimension reduction reduction used SVD, PCA, SOM and FastICA. By using RapidMiner, we conducted the By implementing four different testing process without dimension reduction techniques SVD, PCA, SOM, reduction and clustering, and then and FastICA, and continuously applying compared with the results of clustering the cluster method based on cluster process using dimension reduction. density. We obtained results, Figure. 3a Result of attribute dimension reduction present e-coli datasets based on shown in Table 1. DBSCAN without dimension reduction, Figure. 3b is the result of the cluster of Table 1. Attribute Dimension Reduction E-coli datasets based on DBSCAN Number of attribute for each datasets within SVD, Figure. 3c is a cluster of E- Dimension Acute Blood Prostate E-coli coli datasets based on DBSCAN and reduction implant transfusion cancer with SVD 1 1 1 1 PCA, Figure. 3d is a cluster of E-coli with PCA 5 4 1 3 datasets based on DBSCAN and SOM, with SOM 2 2 1 2 while Figure. 3e is the result of the with FastICA 8 8 5 18 cluster by using DBSCAN within without dimension 8 8 5 18 FastICA dimension reduction. reduction

To find out efficiency we conducted the testing and record for processing time, as shown in Table 2.

Table 2. Processing time

Processing time for each datasets Figure 3.1 . E-coli Figure. 3b. E-coli Dimension Acute Blood Prostate E-coli datasets based on datasets based on reduction implant transfusion cancer DBSCAN without DBSCAN and with SVD 19 9 61 39 dimension reduction SVD with PCA 27 14 47 35 with SOM 34 22 51 41 with FastICA 67 12 58 148 without dimension 22 11 188 90 reduction

Using SVD, PCA, SOM and FastICA we also conducted the testing process Fig ure . 3c. E-coli Figure 3d. E-coli datasets based on datasets based on and found performance no of cluster, as DBSCAN and PCA DBSCAN and shown in Table 3. SOM

Table 3. Performance no of cluster Performance no of cluster for each datasets Dimension E-coli Acute Blood Prostate reduction implant transfusion cancer with SVD 2 10 13 1 with PCA 2 2 2 2 with SOM 2 7 17 1 Fig ure . 3e. E-coli with FastICA 1 1 51 2 datasets based on without dimension 8 10 13 1 DBSCAN and FastICA

1045 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

Another result obtained for acute implant datasets, at Figure. 4a-e present acute implant datasets based on DBSCAN without dimension reduction and within a various dimension reduction. Fig ure 5a. Blood Fig ure 5b. Blood transfusion datasets based transfusion datasets on DBSCAN without based on DBSCAN dimension reduction and SVD

Fig ure 4a. Acute Figure 4b. Acute implant datasets based on implant datasets

DBSCAN withou t based on DBSCAN Fig ure 5c. Blood Figure 5 d. Blood dimension reduction and SVD transfusion datasets based transfusion datasets on DBSCAN and PCA based on DBSCAN an d and SOM

Fig ure 4c. Acute implant Figure 4d. Acute datasets based on implant datasets DBSCAN and PCA based on DBSCAN Fig ure 5e. Blood and SOM transfusion datasets based on DBSCAN and FastICA

Using same dimension reduction techniques, we clustered prostate cancer, result we present at Figure 6a-e, based Fig ure 4e. Acute implant datasets based on on DBSCAN without dimension DBSCAN and FastICA reduction and within a various dimension reduction. The third was dataset tested is blood transfusion. Some of the result we present at Figure 5a-e, result obtained for blood transfusion datasets, based on DBSCAN without dimension reduction and within a various dimension Fig ure 6a. Prostate Fig ure 6b. reduction. cancer datasets based on Prostate cancer DBSCAN without datasets based on dimension reduction DBSCAN and SVD

1046 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

Another evaluation for model implementation is comparison of processing time. In general dimension reduction decreased time to process. For several datasets we found DBSCAN within SVD has lowest processing time. Fig ure 6c. Prostate cancer Fig ure 6d. datasets based on Prostate cancer DBSCAN and PCA datasets based on DBSCAN and SOM

Fig ure 6e. Prostate cancer datasets based on Figure 8. Processing time for each attribute DBSCAN and FastICA Cluster process with FastICA Each cluster process, especially ahead dimension reduction has highest cluster of determined value of ɛ=1 , and the performance for blood datasets (Figure. value MinPts=5 , while the number of 9), but lowest in other datasets, while clusters ( k=2 ) that will be produced was PCA has lowest performance for overall also determined before. datasets.

5 Discussion Dimension reduction before clustering process is to obtain efficient processing time and increase accuracy of cluster performance. Based on results in previous section, dimension reduction can shorten processing time and has lowest number of attribute. Figure. 7 Figure 9. Performance no of cluster for each shows DBSCAN with SVD has lowest attribute number of reduced attribute. 6 Conclusion The discussion above has shown that applying a dimension reduction technique will shorten the processing time. Dimension reduction before clustering process is to obtain efficient processing time and increase cluster performance. DBSCAN with SVD has

Figure 7. Reduction number of attributes lowest processing time for several datasets. SVD also create lowest number of reduced attribute. In general,

1047 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

dimension reduction shows an increased cluster performance.

1048 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

References 13. Kambhatla, Nanda , Todd K. Leen: Fast 1. Bi, Jinbo, Kristin Bennett, Mark Embrechts, “Non_Linear Dimension Reduction”, (1994) Curt Breneman and Minghu Song: 14. Kohonen, T., Kaski, S. and Lappalainen, H. “Dimensionality Reduction via Sparse (1997). Self-organized formation of various Support Vector Machine”, Journal of invariant-feature filters in the adaptive- Research 3, pp.1229-1243 subspace SOM. Neural Computation , 9: (2003) 1321-1344. 2. Chakrabarti, Kaushik, Sharad Mehrotra: 15. Larose, Daniel T: “Data Mining Methods “Local Dimensionality Reduction : A New And Models”, John Wiley & Sons Inc, New Approach To Indexing High Dimensional Jersey, pp.1-15 (2006) Space”, Proceeding Of The 26 th VLDB 16. Maimon, Oded, Lior Rokach: “Data Mining Conference, Cairo, Egypt, pp.89-100 (2000) And Knowledge Discovery Handbook”, 3. Choi, S. W.; Martin, E. B.; Morris, A. J.; Lee, Springer Science+Business Media Inc, pp.94- I.-B: “Fault detection based on a maximum 97 (2005) likelihood PCA mixture ”, Ind. Eng. Chem. 17. Maimon, Oded, Lior Rokach: Res. , 44 , 2316−2327, ( 2005) “Decomposition Methodology For 4. Ding, Chris, Tao Li: Adaptive Dimension Knowledge Discovery And Data Mining”, Reduction Using Discriminant Analysis and World Scientific Publishing Co, Pte, Ltd, K-means Clustering, International Danvers MA, pp. 253-255 (2005) Conference on Machine Learning, Corvallis, 18. Nisbet, Robert, John Elder, Gary Miner: OR, 2007 “Statistical Analysis & Data Mining 5. Ding, Chris, Xiaofeng He, Hongyuan Zha, Application”, Elsevier Inc, California, Horst Simon: Adaptive Dimension Reduction pp.111-269 (2009) For Clustering High Dimensional Data”, 19. Poncelet, Pascal, Maguelonne Teisseire, Lawrence Berkeley National Laboratory, Florent Masseglia: “Data Mining Patterns: pp.1-8 (2002) New Methods And Application”, Information 6. Fodor, I.K: “A Survey of Dimension Science Reference, Hershey PA, pp. 120-121 Reduction Techniques. LLNL Technical (2008) Report, UCRL-ID-148494”, p p.1-18 (2002) 20. Sembiring, Rahmat Widia, Jasni Mohamad 7. Ghodsi, Ali: “Dimensionality Reduction , A Zain, Abdullah Embong : “Clustering High Short Tutorial , Technical Report”, 2006 -14, Dimensional Data Using Subspace And Department of Statistics and Actuarial Projected Clustering Algorithm”, Science, University of Waterloo, pp. 5-6 International Journal Of Computer Science & (2006) Information Technology (IJCSIT) Vol.2, 8. Globerson, Amir, Naftali Tishby : “ Sufficient No.4, pp.162-170 (2010) Dimensionality Reduction”, Journal of 21. Sembiring, Rahmat Widia, Jasni Mohamad Machine Learning Research 3, pp. 1307-1331 Zain, Abdullah Embong: “Alternative Model (2003) for Extracting Multidimensional Data Based- 9. Hyvaerinen, Aapo, Erkki Oja: “Independent On Comparative Dimension Reduction ”, Component Analysis: Algorithms and ICSECS (2), pp. 28-42, (2011) Applications”, Neural Networks, pp. 411 -430 22. Sembiring, Rahmat Widia, Jasni Mohamad (2002) Zain : “ Cluster Evaluation Of Density Based 10. Hyvarinen, A., Oja, E.: Independent Subspace Clustering ”, Journal Of Component Analysis: Algorithms and Computing, Volume 2, Issue 11, pp.14-19 Applications. Neural Networks 13, pp.411-- (2010) 430 (2000). 23. Smith, Lindsay I: “A Tutorial on Principal 11. Jin, Longcun, Wanggen Wan, Yongliang Wu, Component Analysis”, Bin Cui, Xiaoqing Yu, Youyong Wu: “A http://www.cs.otago.ac.nz/cosc453/student_t Robust High-Dimensional Data Reduction utorials/principal_components.pdf, pp.12-16 Method”, The International Journal Of (2002) Virtual Reality 9(1), pp.55-60 (2010) 24. Wall, Michael E., Andreas Rechtsteiner, Luis 12. Jolliffe, I.T: “ Principal Component M. Rocha: "Singular value decomposition Analysis”, Springe r Verlag New York Inc. and principal component analysis”, A New York, pp. 7-26 (2002) Practical Approach to Microarray Data Analysis . D.P. Berrar, W. Dubitzky, M.

1049 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

Granzow, Kluwer: Norwell, MA, LANL LA- UR-02-4001, eds. pp. 91-109 (2003) 25. Wang, John: “Encyclopaedia of Data Warehousing and Data Mining”, Idea Group Reference, Hershey PA, pp. 812 (2006) 26. Xu, Rui, Donald C. Wunsch II: “Clustering”, John Wiley & Sons, Inc, New Jersey, pp. 237-239 (2009) 27. Zhang, Daoqiang, Hua Zhou Zhi, Songcan Ch en: “Semi -Supervised Dimensionality Reduction”, 7th SIAM International Conference on Data Mining, pp.629-634, (2008)

1050