Dimension Reduction of Health Data Clustering

International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) Dimension Reduction of Health Data Clustering Rahmat Widia Sembiring 1, Jasni Mohamad Zain 2, Abdullah Embong 3 1,2 Faculty of Computer System and Software Engineering, Universiti Malaysia Pahang Lebuhraya Tun Razak, 26300, Kuantan, Pahang Darul Makmur, Malaysia 3School of Computer Science, Universiti Sains Malaysia 11800 Minden, Pulau Pinang, Malaysia [email protected], [email protected], [email protected] clusters that are less meaningful. The use ABSTRACT of multidimensional data will result in more noise, complex data, and the The current data tends to be more complex possibility of unconnected data entities. than conventional data and need dimension This problem can be solved by using reduction. Dimension reduction is important clustering algorithm. Several clustering in cluster analysis and creates a smaller data algorithms grouped into cell-based in volume and has the same analytical clustering, density based clustering, and results as the original representation. A clustering oriented. To obtain an clustering process needs data reduction to obtain an efficient processing time while efficient processing time to mitigate a clustering and mitigate curse of curse of dimensionality while clustering, dimensionality. This paper proposes a model a clustering process needs data for extracting multidimensional data reduction. Dimension reduction is a clustering of health database. We technique that is widely used for various implemented four dimension reduction applications to solve curse of techniques such as Singular Value dimensionality. Decomposition (SVD), Principal Dimension reduction is important in Component Analysis (PCA), Self cluster analysis, which not only makes Organizing Map (SOM) and FastICA. The the high dimensional data addressable results show that dimension reductions and reduces the computational cost, but significantly reduce dimension and shorten processing time and also increased can also provide users with a clearer performance of cluster in several health picture and visual examination of the datasets. data of interest [ 6]. Many emerging dimension reduction techniques KEYWORDS proposed, such as Local Dimensionality Reduction (LDR) tries to find local DBSCAN, dimension reduction, SVD, PCA, correlations in the data, and performs SOM, FastICA. dimensionality reduction on the locally correlated clusters of data individually 1 Introduction [3], where dimension reduction as a The current data tends to be dynamic process adaptively adjusted and multidimensional and high dimension, integrated with the clustering process and more complex than conventional [4]. data. Many clustering algorithms have Sufficient Dimensionality Reduction been proposed and often produce (SDR) is an iterative algorithm [ 8], 1041 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) which converges to a local minimum of cases when there are more inputs than and hence cases or observations. Dimension solves the Max-Min problem as well. A reduction number of optimizations can solve this Dimension Increasing Reducing of Reduction of reduction learning irrelevant redundant minimization problem, and reduction performance dimension dimension algorithm based on Bayesian inductive Attribute Records reduction cognitive model used to decide which reduction Future dimensions are advantageous [ 11]. Attribute selection decomposition Record Developing an effective and efficient Function selection clustering method to process decomposition Simple Variable multidimensional and high dimensional decomposition selection dataset is a challenging problem. This paper is organ ized into a few Figure 1. Taxonomy of dimension reduction problem sections. Section 2 will present the related work. Section 3 explains the Dimension reduction methods materials and method. Section 4 associated with regression, additive elucidates the results followed by models, neural network models, and discussion in Section 5. Section 6 deals methods of Hessian [6], one of which is with the concluding remarks. the local dimension reduction (LDR), which is looking for relationships in the 2 Related Work dataset and reduce the dimensions of Functions of data mining are association, each individual, then using a correlation, prediction, clustering, multidimensional index structure [3]. classification, analysis, trends, outliers Nonlinear algorithm gives better and deviation analysis, and similarity performance than PCA for sound and and dissimilarity analysis. Clustering image data [14], on the other studies technique is applied when there is no mentioned Principal Component class to predict but rather when the Analysis (PCA) is based on dimension instances divide into natural groups [20]. reduction and texture classification Clustering for multidimensional data has scheme can be applied to manifold many challenges. These are noise, statistical framework [3]. complexity of data, and data In most applications, dimension redundancy. To mitigate these problems reduction performed as pre-processing dimension reduction needed. In step [5], performed with traditional statistics, dimension reduction is the statistical methods that will parse an process of reducing the number of increasing number of observations [6]. random variables. The process classified Reduction of dimensions will create a into feature selection and feature more effective domain characterization extraction [ 18], and the taxonomy of [1]. Sufficient Dimension Reduction dimension reduction problems [16] (SDR) is a generalization of nonlinear shown in Figure.1. Dimension reduction regression problems, where the is the ability to identify a small number extraction of features is as important as of important inputs (for predicting the the matrix factorization [8], while SSDR target) from a much larger number of (Semi-supervised Dimension Reduction) available inputs, and is effective in 1042 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) is used to maintain the original structure highest singular value in the upper left of high dimensional data [27]. index of the S matrix. The goals of dimension reductio n methods are to reduce the number of predictor components and to help ensure that these components are independent. B. PCA The method designed to provide a PCA is a dimension reduction framework for interpretability of the technique that uses variance as a results, and to find a mapping F that measure of interestingness and finds maps the input data from the s pace to orthogonal vectors (principal lower dimension feature space components) in the feature space that denotes as [26, 15]. accounts for the most variance in the Dimension reduction techniques, such as data [ 19]. Principal component analysis principal component analysis (PCA) and is probably the oldest and best known of partial least squares (PLS) can used to the techniques of multivariate analysis, reduce the dimension of the microarray first introduced by Pearson, and data before certain classifier is used [ 25]. developed independently by Hotelling We compared four dimension [12]. reduction techniques and embedded in The advantages of PCA are DBSCAN, these dimension reduction identifying patterns in data, and are: expressing the data in such a way as to highlight their similarities and A. SVD differences. It is a powerful tool for The Singular Value Decomposition analysing data by finding these patterns (SVD) is a factorization of a real or in the data. Then compress them by complex matrix. The equation for SVD dimensions reduction without much loss T of X is X=USV [24], where U is an of information [ 23]. Algorithm PCA [ 7] m x n matrix, S is an n x n diagonal shown as follows: matrix, and VT is also an n x n matrix. a. Recover basis: The columns of U are called the left Calculate and let T singular vectors , {uk}, and form an U = eigenvectors of XX orthonormal basis for the assay corresponding to the top d expression profiles, so that ui·uj = 1 for i eigenvalues. = j , and ui·uj = 0 otherwise. The rows of b. Encode training data: T VT contain the elements of the right Y = U X where Y is a d x t matrix of singular vectors , { vk}, and form an encodings of the original data. orthonormal basis for the gene c. Reconstruct training data: transcriptional responses. The elements of S are only nonzero on the diagonal, d. Encode test example: and are called the singular values . Thus, where y is a d- S = diag( s1,..., sn). Furthermore, sk > 0 for dimensional encoding of x. 1 ≤ k ≤ r, and si = 0 for ( r+1) ≤ k ≤ n. By e. Reconstruct test example: convention, the ordering of the singular vectors is determined by high-to-low sorting of singular values, with the C. SOM 1043 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) A self-organizing map (SOM) is a techniques tested in the proposed model, type of artificial neural network that is namely SVD, PCA, SOM, FastICA. trained using unsupervised learning to produce a low-dimensional (typically Perfor- mance-1 two-dimensional), discretized

Dimension Reduction of Health Data Clustering

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support