Principal Component Analysis As an Integral Part of Data Mining in Health Informatics

Proceedings of 31st International Society Conference on Computers And Their Applications CATA 2016, pp. 251-256, April 05, 2016 Principal Component Analysis as an Integral Part of Data Mining in Health Informatics 1 2 Chaman Lal Sabharwal and Bushra Anjum 1 Missouri University of Science and Technology, Rolla, MO-63128, USA [email protected] 2Amazon Inc, 1194 Pacific St., San Luis Obispo, CA-93401, USA [email protected] Abstract diseases, uncovering the clinical effectiveness of treatments, and reducing readmissions are expected to be top priority use cases for Big Data in healthcare [1]. Linear and logistic regression are well-known data According to the Harvard School of Public Health mining techniques, however, their ability to deal with inter- publication entitled The Promise of Big Data, petabytes of dependent variables is limited. Principal component raw information could provide clues for everything from analysis (PCA) is a prevalent data reduction tool that both preventing tuberculosis to shrinking health care costs [6]. transforms the data orthogonally and reduces its Some of the initiatives taken in this domain are discussed dimensionality. In this paper we explore an adaptive hybrid below. approach where PCA can be used in conjunction with Today, there is a significant opportunity to improve the logistic regression to yield models which have both a better efficiencies in the healthcare industry by using an fit and a reduced set of factors than those produced by just evidence-based learning model, which can in turn be the regression analysis. We will use example dataset from powered by Big Data analytics [8]. A few examples are HealthData.gov to demonstrate the simplicity, applicability provided below. The company Asthmapolis has created a and usability of our approach. global positioning system (GPS) enabled tracker that monitors inhaler usage by patients, eventually leading to keywords: Principal component analysis, Regression more effective treatment of asthma [9]. Center for Disease analysis, healthcare analytics, big data analytics Control and Prevention (CDC) is using Big Data analytics to combat influenza. Every week, the CDC receives over 700,000 flu reports including the details on the sickness, 1. Introduction what treatment was given, and whether not the treatment was successful. The CDC has made this information available to the general public called FluView, an From doctors’ fees and medical tests to the price of application that organizes and sifts through this extremely medications and the cost of hospital stays, health-care costs large amount of data to create a clearer picture for doctors around the world are skyrocketing. Much of this is of how the disease is spreading across the nation in near attributed to wasteful spending on such things as real-time [3]. Another area of interest is the surveillance of ineffective drugs, futile procedures and redundant adverse drug reactions (ADRs), which has been a leading paperwork, as well as missed disease-prevention cause of death in the United States [4]. It is estimated that opportunities. This calls for efficient data reduction approximately 2 million patients in USA are affected by mechanisms and diagnostic tools that can help identify the ADRs and the researchers in [11,13] propose an analytical problems and the factors which are most responsible for framework for extracting patient–reported adverse drug them. This is where Big Data analysis comes in the picture. events from online patient forums such as DailyStrength It is estimated that developing and using prediction and PatientsLikeMe. models in the health-care industry could save billions by Simplistically speaking, in all the above examples the using big-data health analytics to mine the treasure trove of researchers are trying to model and predict a dependent information in electronic health records, insurance claims, phenomenon based on a number of predictors that have prescription orders, clinical studies, government reports, been observed. The dependent parameter can be discrete or and laboratory results. Improving the care of chronic nominal or even binary/logical. There are two problems at hand. First problem is analyzing whether some event If we have m vector values x1,..., xm of an n- T occurred or not given the success or failure, acceptance or dimensional vector x = [x1,..., xn] , these m observations rejection, presence or absence of observed simulators. If are represented by an mxn data matrix A. The kth row of T such a dependence and correlation can be established, then A is the row vector xk . Thus the (i, j) element of A th th T there is a second equally interesting problem of becomes to the j element of the i row/observation, xi . optimization. The optimization problem is how we can There are several ways to represent data so that minimize the set of predictors while still maintaining high implicit/hidden information becomes transparent. For prediction accuracy for the dependent variable. linear representation of vector data, a vector space is There are several approaches to modeling prediction equipped with a basis of linearly independent vectors. variables, such as, linear regression analysis, logistic Usually in data mining, the data is represented as a regression analysis, and principal component analysis. matrix of row vectors or data instances. Two of the Each has its own advantages and disadvantages. A methods for efficient representation of data are mathematical background to these is presented in section 2. regression and principal component analysis, Figure 1. Though regression analysis has been well known as a statistical analysis technique, the understanding of PCA, however, has been lacking in the past by non-academic 2.2. Linear Regression clinicians. In this paper we explore an adaptive hybrid The linear regression for one dependent and m approach where PCA can be used in conjunction with independent variables is given as � = �� + �!�,� �� logistic regression and unsupervised learning to yield where the error between the observed value y and models which have both a better fit and reduced number of i estimated value � + � � is minimum. For m variables than the models predicted by standalone logistic � �!�,� � �� points data, we compute b by using the method of least regression or unsupervised learning. We will apply our k squares that minimizes findings to a dataset obtained from HealthData.gov that � lists the quality ratings of California’s hospitals based on (�� − �0 − ��) �!�,� their location, medical procedure, mortality rate, no. of �=1,� cases etc. [7] Thus linear regression determines a hyper plane which is The paper is organized as follows. Section 2 presents the a least square approximation of the data points. If data is mathematical background on linear regression, logistic mean centered, then b0=0. If m=1, it is a regression line. It regression, and PCA. Section 3 describes the hybrid is always better to mean center the data as it simplifies the algorithms for regression using PCA. In section 4 we use computations. the hospital rating dataset from HeathData.gov to present It is important to note that the data points may not be at experimental results of our algorithm and Section 5 the least distance from the regression line. We show this in concludes the paper. the next section, that there is a better least distance line, see Figure 2. For direction vectors and approximation error of 2. Background data points from the line, see Table 1. 2.3. Principle Component Analysis 2.1. Mathematical Notation In linear regression, the distance between observed point Here we describe the mathematical notation for terms (x ,y ) from the line y=a+bx, along the y direction is used in this paper. A vector is a sequence of variables. i i minimized. In principal component analysis, the distance All vectors are column vectors and are in lower case bold of observed (x ,y ) along a direction orthogonal to the line letters such as x. The n-tuple [x ,..., x ] denotes a row i i 1 n y=a+bx, is minimized. vector with n elements in lowercase. A superscript T is used to denote the transpose of a vector x, so that xT is a The principal component analysis is a well-known data T reduction tool in academia for over 100 years. PCA is row vector whereas x = [x1,..., xn] is a column vector. This notation is overloaded at some places where the beneficial for representing physical world data/objects more clearly in terms of independent, uncorrelated, ordered pair [x1, x2] may be used as a row vector, a point in the plane or a closed interval on the real line. The matrices orthogonal parameters. In the next section we explore an are denoted with uppercase letters, e.g. A, B. The nxn adaptive hybrid approach where PCA can be used not only for data reduction but also to yield models which have a identity matrix is denoted by In , or simply I when the dimension is implicit in the context. The elements I = 0 better fit than those produced by using logistic regression ij or unsupervised learning alone. if i ≠ j and Iii = 1 for 1 ≤ i, j ≤ n. For vectors x, y, the covariance is denoted by cov(x, y), whereas cov(x) is used PCA and Singular Value Decomposition (SVD) are for cov(x, x) as a shortcut [2]. interchangeably used in the literature. However, there is a clear distinction between them as can be noted from the Next we claim, and present an example, that PCA can give discussion below. us a better approximation – a better least distance line – than Definition 1. For a real square matrix A, if there is a real standard linear regression. We have a data set of randomly created 20 points and we use both methods, i.e., linear number λ and a non-zero vector x such that A x = λ x, then regression (also termed as least square approximation) and λ is called an eigenvalue and x is called an eigenvector.

Load more