30-DAYS All-CAUSE PREDICTION MODEL FOR READMISSIONS FOR

HEART FAILURE PATIENTS

A COMPARATIVE STUDY OF MACHINE LEARNING APPROACHES

A Dissertation Presented

By

Amal Abdullah Bukhari

to

The Department of Engineering

in partial fulfillment of requirements

for the degree of

Doctor of Philosophy

In the field of

Interdisciplinary Engineering

Northeastern University Boston, Massachusetts

November, 2019

ii

Northeastern University Graduate School of Engineering Dissertation Signature Page

Dissertation Title: 30-Days All-Cause Prediction Model for Readmissions For Heart Failure Patients: A Comparative Study of Machine Learning Approaches

Author: Amal Bukhari. NUID: 000034724.

Department: The Department of Engineering – Interdisciplinary Engineering

Approved for Dissertation Requirement for the Doctor of Philosophy Degree

Dissertation Advisor Professor. Sagar Kamarthi. ______Print Name,Title Signature Date

Dissertation Committee Member Professor. Kal Bugrara. ______Print Name,Title Signature Date

Dissertation Committee Member Dr. Kamal Jethwani. ______Print Name,Title Signature Date

Dissertation Committee Member Dr. Stephen Agboola. ______Print Name,Title Signature Date

Department Chair

______Print Name,Title Signature Date

Associate Dean of the Graduate School

______Senior Associate Dean for Academic Affairs Signature Date

iii

ACKNOWLEDGMENTS

I would like to express my special appreciation and thanks to my advisor Professor Sagar

Kamarthi, for the patient guidance, encouragement and advice he has provided throughout my time as his student. I also would like to thank the members of my dissertation committee, Professor Kal Bugrara, Dr. Stephen Agboola and Dr. Kamal

Jethwani, for their contribution and suggestion in general. I gratefully acknowledge the scholarship I received from the Saudi Arabian Cultural Mission and University of Jeddah.

Lastly, I owe my deepest gratitude to my lovely family for their support and encouragement during my Ph.D. journey and for always believing in me and encouraging me to follow my dreams.

iv

I dedicate this dissertation to

my beloved family, my father, my mother, my sister and brothers for their constant

support and unconditional love.

I love you all dearly.

v

TABLE OF CONTENTS

TABLE OF CONTENTS ...... v LIST OF TABLES ...... vii LIST OF FIGURES ...... viii LIST OF ABBREVIATION ...... ix ABSTRACT ...... x CHAPTER 1 ...... 1 INTRODUCTION AND OVERVIEW ...... 1 1.1 Heart failure Overview ...... 1 1.2 Research goal and objectives ...... 8 1.3 Structure of the Thesis ...... 10 CHAPTER 2 ...... 11 2.1 Heart Failure Hospitalization ...... 11 2.2 Risk Prediction Models of Readmission for Heart Failure ...... 14 2.2.1 Risk factors ...... 16 2.2.2 Model Development and Performance ...... 20 CHAPTER 3 ...... 27 METHODOLOGY ...... 27 3.1 Aims and objectives ...... 27 3.2 Data Mining Software Selection ...... 31 3.3 Data Description ...... 31 3.4 Inclusion / Exclusion Criteria ...... 33 3.5 Data Preprocessing ...... 34 3.5.1 Data Wrangling ...... 34 3.5.3 Data Cleaning ...... 36 3.5.3 Data Transforming ...... 38 3.6 Dataset ...... 39 3.7 Label Definition / Outcome Definition ...... 50 Definition of Index Admission ...... 50 3.8 Modeling ...... 51 3.8.1 Feature Selection ...... 51 3.8.2 Imbalance Data Class Imbalance ...... 56 3.8.3 Experiments and selected algorithms ...... 57 vi

3.9 Validation Set Approach (Data Split) ...... 60 3.10 Evaluation / Performance Metrics ...... 61 3.10.1 Accuracy (Acc) ...... 62 3.10.2 Precision (p) ...... 62 3.10.3 Sensetivity or Recall (r) ...... 62 3.10.4 Specificity ...... 63 3.10.5 F-Measure (FM) ...... 63 3.10.6 Area under the ROC Curve (AUC) ...... 63 3.11 Summary ...... 64 CHAPTER 4 ...... 65 RESULT AND ANALYSIS ...... 65 4.1 Logistic Regression ...... 66 4.2 Decision Tree ...... 67 4.3 Random Forest ...... 69 4.4 Naïve Bayes ...... 71 4.5 Support Vector Machine ...... 72 4.6 Xboost ...... 73 4.7 Summary ...... 74 CHAPTER 5 ...... 75 CONCLUSION AND FUTURE WORK ...... 75 REFERENCES ...... 78

vii

LIST OF TABLES

Table1. 1 Common medications ...... 3 Table1. 2 Common procedures ...... 4 Table1.3 Stages of heart failure ...... 5 Table 1.4 NYHA classification for heart failure ...... 5 Table 2.1 Summary of Model Characteristics Predicting 30-Day All-Cause Readmission For Patients with Heart Failure ...... 18 Table 3.2 Pressure stages ...... 36 Figure 3.7 Handling missing data ...... 37 Table 3.3 Distribution of readmission ...... 39 Table 3.4 Frequency distribution of variables by 30-days readmission ...... 40 Table 3.4 (Continued) Frequency distribution of variables by 30-days readmission ...... 43 Table 3.4 (Continued) Frequency distribution of variables by 30-days readmission ...... 44 Table 3.4 (Continued) Frequency distribution of variables by 30-days readmission ...... 46 Table 3.4 (Continued) Frequency distribution of variables by 30-days readmission ...... 47 Table 3.5 Descriptive measures for the continues variables ...... 48 Table 3.6 The features in dataset Info Gain and Backward Selection ...... 55 Table 3. 7 Confusion matrix of a two-class classifier ...... 61 Table 4.1 Performance Results of Logistic Regression for 30 Days Readmission ...... 66 Table 4.2 Performance Results of Decision Tree for 30 Days Readmission ...... 67 Table 4.3 Performance Results of Random Forest for 30 Days Readmission ...... 69 Table 4. 4 Performance Results of Random Forest for 30 Days Readmission ...... 70 Table 4. 5 Performance Results of Naïve Bayes for 30 Days Readmission ...... 71 Table 4. 6 Performance Results of SVM for 30 Days Readmission ...... 72 Table 4. 7 Performance Results of Xboost for 30 Days Readmission ...... 73

viii

LIST OF FIGURES

Figure1. 1 ACCF/AHA guideline for the management of heart failure ...... 6 Figure2.1 Heart Disease Hospitalization Rates (Benjamin et al., 2017) ...... 12 Figure 3.2 Machine learning process ...... 29 Figure 3.3 Machine learning validation process ...... 30 Figure 3.4 Dataset overview ...... 31 Table 3.1 Tables description ...... 32 Figure 3.5 Exclusion/ Inclusion criteria ...... 33 Figure 3.6 Accepted data from lab table ...... 35 Figure 3.8 Histogram of different attributes ...... 49 Figure 3.9 The Feature Filter Approach ...... 52 Figure 3.10 The Wrapper Approach ...... 52 Figure 3. 11 The Embedded Approach ...... 53 Figure 3. 12 Undersampling and oversampling techniques (Karagod, n.d.) ...... 56

ix

LIST OF ABBREVIATION

HF Heart Failure

ML Machine Learning

DT Decision Tree

LR Logistic Regression

NV Naïve Bayes

RF Random Forest

SVM Support Vector Machine

Xgboost Extreme Gradient Boosting

CV Cross Validation

InfoGain Information Gain

x

ABSTRACT

The value of machine learning in healthcare comes from its ability to process large amount of health care data to extract clinical insights that are helpful to physicians for planning and providing care with better outcomes and lower costs. Recent studies exploring machine learning techniques suggest that predictive models have the potential for identifying high risk patients, however, the advantage of machine learning methods over classical methods neither evident nor universal. Moreover, only a few studies address the challenges posed by class-imbalanced data commonly encountered in healthcare applications.

In this work, we compared different machine learning algorithms to predict all-cause readmissions 30 days after discharge with heart failure hospitalization. In this research we addressed the feature selection and the class imbalance issues in the healthcare data.

We developed various machine learning models and studied their performance. The models explored include logistic regression, decision tree, random forest, Naïve Bayes, support vector machine, and X-boost. We compared their performance using the performance metrics such as area under the receiver operating characteristic curve (AUC) and sensitivity and specificity.

xi

We identified 5894 patients admitted with heart failure complications between 2011 and

2015. The dataset included 8684 records and 61 variables. Among the study patients,

16.44% were readmitted within 30 days of hospital discharge. This research explored the effectiveness of different class balancing and feature selection approaches.

The models produced AUCs in the range of 0.62 – 0.79 and a sensitivity in the range of

0.25 – 0.73. On the current dataset, machine learning techniques did not outperform the standard regression model to predict 30- day readmission for heart failure patients.

However, the result achieved by all the classifier agree with the results reported in the literature.

CHAPTER 1

INTRODUCTION AND OVERVIEW

This chapter provides a general introduction to the dissertation. Section 1 presents an overview of heart failure. Section 2 highlight the research objectives and goals. Section 3 describes the structure of the thesis.

1.1 Heart failure Overview

This section gives an overview of heart failure, starting with a definition of heart failure, treatment and stages. Next, we review heart failure prevalence in the United States.

Definition

Heart failure is a progressive condition in which the heart’s muscle gets injured from events like a heart attack or high blood pressure and from then onwards gradually loses its ability to pump blood enough to supply the body’s needs. The heart can be affected in two ways; either becomes weak and unable to pump blood, which is called systolic heart failure, or it becomes stiff and unable to fill with blood adequately, which is called diastolic heart failure.

Ultimately, both conditions lead to retention of extra fluid or congestion. When patients develop these symptoms it’s called congestive heart failure (Yancy et al., 2013).

Heart Failure Diagnosis

Heart failure is diagnosed by a range of symptoms and signs of fluid overload due to either a weak heart (heart failure with reduced ejection fraction) or a healthy heart with inadequate heart relaxation (heart failure with preserved ejection fraction). Symptoms include shortness 2 of breath, dry cough, poor appetite, nausea and fatigue. Signs include leg swelling and increased abdominal girth. Medical providers often order an echocardiogram to determine the strength of the heart. An echocardiogram is an ultrasound of the heart that measures the ejection fraction (EF), wall thickness, and the flow of blood through valves in the heart. People with a healthy heart have an EF of about 60%, while people with heart failure have either a reduced ejection fraction with EF < 40% (HFrEF) or a preserved ejection fraction with EF >50% (HFpEF). Prescription of medical therapy, including pills and devices, depends on the stage of heart failure and the functional state. (Yancy et al., 2013).

Heart Failure Treatment

Early detection and treatment of heart failure allow the patients to continue living an active lifestyle for a longer time, while reducing the risk for hospitalization. The initial treatment regimen will vary depending on the type of heart failure and the severity of the condition.

Treatment regimens generally include a combination of medications, lifestyle changes

(quitting smoking, diet, exercise), and surgical procedures (“Heart Failure | National Heart,

Lung, and Blood Institute (NHLBI),” n.d.).

3

Common Medications

The medications listed in Table1.1 work together to help the heart regain strength over time, while also controlling some other symptoms of heart failure (“Heart Failure | National Heart,

Lung, and Blood Institute (NHLBI),” n.d.)

Table1. 1 Common medications

Medication Class Medications Why It is Useful ACE-Inhibitors (ACE-I) ACE-I: Captopril, Enalapril, Lowers blood pressure and or Angiotensin Receptor Lisinopril, Ramipril reduces the workload of the heart Blockers (ARB) ARB: Losartan, Valsartan, so that it can regain strength Olmesartan, Candesartan

ARB + Neprilysin Sacubitril-Valsartan Lowers blood pressure and Inhibitor reduces the workload of the heart so that it can regain strength; may be used instead of an ACE or ARB Beta-Blockers Carvedilol, Metoprolol Slows your heart rate and allows Succinate your heart to more efficiently pump blood. It also helps the heart regain strength. Mineralocorticoid Eplerenone, Spironolactone Removes excessive fluid and Antagonists prevents loss of potassium. Nitrates Isosorbide + Hydralazine Lowers blood pressure and reduces the workload of the heart so that it can regain strength; may be particularly useful in African American patients Cardiac Glycosides Digoxin Helps the heart beat stronger and pump blood more efficiently Diuretics Bumetanide, Furosemide, Remove extra fluid that Torsemide accumulates in your lungs and legs causing discomfort

4

Common Procedures

As heart failure gets worse, lifestyle changes and medicines might not control the symptoms. Patients may require a medical procedure or surgery(“Heart Failure | National Heart, Lung, and Blood Institute (NHLBI),” n.d.). Table 1.2 summarizes common procedures.

Table1. 2 Common procedures

Procedure Why It is Useful Defibrillator (ICD) Placement Heart failure can increase risk of developing harmful heart rhythms, which may lead to cardiac arrest. An ICD will provide an electrical shock to prevent this from occurring. Cardiac Resynchronization If the left and right side of the heart are not beating Therapy (CRT) simultaneously, a pacemaker device may be implanted to synchronize the two sides; this can lead to improvements in heart failure over time and reduce symptoms. Left Ventricular Assist Device For those with end-stage heart failure, mechanical support (LVAD) with an LVAD may be necessary to prevent the heart from completely failing. The device will help the left side of the heart continue to pump blood throughout the circulation. Heart Transplant For eligible patients with end-stage heart failure who have failed medical therapy (medications, LVAD, etc), a heart transplant may be an option. A healthy heart from a deceased donor is transplanted into the recipient. This requires the recipient to be on lifelong immune suppressing medications.

5

Stages of Heart Failure

In order to determine the best course of therapy, physicians often evaluate the stage of heart failure (HF) as well as their functional status .The American College of

Cardiology/American Heart Association classification of heart failure has four stages (Yancy et al., 2013). Table 1.3 lists stage of heart failure.

Table1.1 Stages of heart failure

Stages Definition A High risk for HF but without structural heart disease or symptoms of HF B Structural heart disease but without signs or symptoms of HF C Structural heart disease with prior or current symptoms of HF D Refractory HF requiring specialized interventions

The patient’s functional status is also assessed according to the New York Heart Association

(NYHA) functional classification system. This system relates symptoms to everyday activities and the patient’s quality of life(Yancy et al., 2013).

Table 1.4 NYHA classification for heart failure

NYHA Class Patient Symptoms Class I (Mild) No limitation of physical activity. Ordinary physical activity does not cause undue fatigue, palpitation, or dyspnea (shortness of breath). Class II (Mild) Slight limitation of physical activity. Comfortable at rest, but ordinary physical activity results in fatigue, palpitation, or dyspnea. Class III (Moderate) Marked limitation of physical activity. Comfortable at rest, but less than ordinary activity causes fatigue, palpitation, or dyspnea. Class IV (Severe) Unable to carry out any physical activity without discomfort. Symptoms of cardiac insufficiency at rest. If any physical activity is undertaken, discomfort is increased. 6

Generally, the physicians use clinical pathways to decide treatment options. Figure 1.1

summarizes the usual clinical pathways heart failure goes through based on ACCF/AHA

Guideline for the

management of heart failure (Yancy et al., 2013).

e284 Circulation October 15, 2013

At Risk for Heart FailureHeart Failure

STAGE A STAGE B STAGE C STAGE D At high risk for HF but Structural heart disease Structural heart disease Refractory HF without structural heart but without signs or with prior or current disease or symptoms of HF symptoms of HF symptoms of HF

e.g., Patients with: HTN Atherosclerotic disease e.g., Patients with: DM e.g., Patients with: Previous MI e.g., Patients with: Refractory Obesity Development of symptoms of HF Marked HF symptoms at Structural heart LV remodeling including Known structural heart disease and Metabolic syndrome symptoms of HF at rest, despite rest disease LVH and low EF HF signs and symptoms or GDMT Recurrent hospitalizations Asymptomatic valvular Patients disease despite GDMT Using cardiotoxins With family history of cardiomyopathy

HFpEF HFrEF

THERAPY THERAPY THERAPY THERAPY THERAPY Goals Goals Goals Goals Goals Control symptoms Control symptoms Heart healthy lifestyle Prevent HF symptoms Control symptoms Patient education Improve HRQOL Prevent vascular, Prevent further cardiac Improve HRQOL Prevent hospitalization Reduce hospital coronary disease remodeling Prevent hospitalization Prevent mortality readmissions Prevent LV structural Prevent mortality Establish patient’s end- abnormalities Drugs Drugs for routine use of-life goals Diuretics for fluid retention ACEI or ARB as Strategies appropriate ACEI or ARB Options Drugs Identification of Beta blockers Advanced care Beta blockers as ACEI or ARB in comorbidities Aldosterone antagonists measures appropriate patients for appropriate Heart transplant vascular disease or DM Treatment Drugs for use in selected patients Chronic inotropes In selected patients Hydralazine/isosorbide dinitrate Temporary or permanent Statins as appropriate Diuresis to relieve ICD ACEI and ARB MCS symptoms of congestion Revascularization or Digitalis Experimental surgery or valvular surgery as Follow guideline driven drugs appropriate indications for In selected patients Palliative care and comorbidities, e.g., HTN, CRT hospice Downloaded from http://ahajournals.org by on July 3, 2019 ICD ICD deactivation AF, CAD, DM Revascularization or valvular surgery as appropriate

Figure 3. Stages in the development of HF and recommended therapy by stage. ACEI indicates angiotensin-converting enzyme inhibitor; AF, atrial fibrillation; ARB, angiotensin-receptor blocker; CAD, coronary artery disease; CRT, cardiac resynchronization therapy; DM, diabetes mellitus; EF, Figure1.ejection fraction; 1 ACCF/AHA GDMT, guideline-directed guideline for medical the mtherapy;anagement HF, heart of failure; heart HF failurepEF, heart failure with preserved ejection fraction; HFrEF, heart failure with reduced ejection fraction; HRQOL, health-related quality of life; HTN, hypertension; ICD, implantable cardioverter-defibrillator; LV, left ventricular; LVH, left ventricular hypertrophy; MCS, mechanical circulatory support; and MI, myocardial infarction. Adapted from Hunt et al.38

8. The Hospitalized Patient such patients are elderly or near elderly, equally male or female, and typically have a history of hypertension, as well as other 8.1. Classification of Acute Decompensated HF medical comorbidities, including chronic kidney disease, hypo- Hospitalization for HF is a growing and major public health natremia, hematologic abnormalities, and chronic obstructive 703 issue. Presently, HF is the leading cause of hospitalization pulmonary disease.107,706,708–713 A relatively equal percentage of 51 among patients >65 years of age ; the largest percentage of patients with acutely decompensated HF have impaired versus expenditures related to HF are directly attributable to hospi- preserved LV systolic function707,714,715; clinically, patients with tal costs. Moreover, in addition to costs, hospitalization for preserved systolic function are older, more likely to be female, acutely decompensated HF represents a sentinel prognostic to have significant hypertension, and to have less CAD. The event in the course of many patients with HF, with a high overall morbidity and mortality for both groups is high. risk for recurrent hospitalization (eg, 50% at 6 months) and Hospitalized patients with HF can be classified into impor- 211,704 a 1-year mortality rate of approximately 30%. The AHA tant subgroups. These include patients with acute coronary has published a scientific statement about this condition.705 ischemia, accelerated hypertension and acutely decompen- There is no widely accepted nomenclature for HF syndromes sated HF, shock, and acutely worsening right HF. Patients requiring hospitalization. Patients are described as having who develop HF decompensation after surgical procedures “acute HF,” “acute HF syndromes,” or “acute(ly) decompen- also bear mention. Each of these various categories of HF has sated HF”; while the third has gained greatest acceptance, it specific etiologic factors leading to decompensation, presenta- too has limitations, for it does not make the important distinc- tion, management, and outcomes. tion between those with a de novo presentation of HF from Noninvasive modalities can be used to classify the patient those with worsening of previously chronic stable HF. with hospitalized HF. The history and physical examination Data from HF registries have clarified the profile of patients allows estimation of a patient’s hemodynamic status, that with HF requiring hospitalization.107,704,706,707 Characteristically, is, the degree of congestion (“dry” versus “wet”), as well as 7

Heart Failure Prevalence

Heart failure is quite prevalent in the US population. Although there has been progress in the treatment of heart disease, heart failure is a growing problem in the US. Current estimates are that nearly 6.5 million Americans over the age of 20 have heart failure conditions. In the

US alone, there were 960,000 new heart failure cases in 2017, and this is expected to continue to increase over the years with the ageing population. It has been estimated that by

2030, the prevalence of heart failure in the US will exceed 8 million people. Not only is heart failure a major problem affecting many people, heart failure is also a major killer. Heart failure directly accounts for about 8.5% of all heart disease deaths in the US.

States. Moreover, by some estimates heart failure contributes to about 36% of all cardiovascular disease deaths. One study notes that heart failure is mentioned in every one in eight death certificates (Yancy et al., 2013).

Heart failure is the most expensive diagnosis for hospitalizations in the US (Roger VL, et al,

2004). Annual hospitalizations cost accounts for approximately 70% of the total costs for heart failure treatment and management (Blecker et al., 2014). In an evaluation of US costs published in 2014, the direct and indirect costs of heart failure were calculated from publicly available resources to be about US$60.2 billion and US$115.4 billion, respectively (Voigt J et al.,2014).

Moreover, it is the most frequent diagnosis for 30-day readmission (Roger VL et al., 2004).

A projection of timeframe for heart failure from 2010 to 2013 shows an increase of 215% in direct cost, 80% in indirect costs and 25% in prevalence (McIlvennan, 2015). Thus, identifying those patients at highest risk for HF readmission could provide initial information 8 in developing targeted intervention to prevent or delay readmission and at the same time stabilizing costs for health systems.

1.2 Research goal and objectives

30-day hospital readmissions clearly pose a significant problem in the healthcare system in terms of cost and quality. Several readmission models have been developed to predict 30-day readmission in patients with heart failure. Although most of the models are built with statistical techniques, in recent years machine learning approaches have come out as a promising technique that can improve the predictive ability of readmission risk prediction models. Recent studies show that machine learning (ML) methods outperform classical methods. However, further comparative studies between ML and statistical techniques are needed to assess the real impact of these techniques in the domain of readmission risk prediction (Artetxe et al, 2018). Moreover, medical data often are limited by class imbalance.

It is the problem in machine learning where the majority of the data belong to the negative outcome event causing a poor performance of prediction models. Lastly, most ML studies have used area under the receiver operating characteristic curve (AUC) as the only performance metric to assess the model performance.

Our primary goal is studying the suitability of ML models with healthcare data. The focus of our study is methodological from computational intelligence point of view. Specifically, to compare in terms of internal validation and different performance measures.

9

The purpose of this study is to develop ML models to predict all-cause readmissions 30 days after discharge from index HF hospitalization and compare the performance of different ML models with attention to feature selection and sampling methods. The primary questions for the proposed study are:

1. Does the accuracy of a model improve when data preprocessing is applied?

2. Can ML algorithms improve the accuracy of predicting the risk for readmission

within 30-days for HF patients?

10

1.3 Organization of the Dissertation

The dissertation is divided into five chapters, which are structured as follows:

• Chapter 1 provides an introduction and overview of heart failure condition, treatment, and management.

• Chapter 2 reviews the literature of studies in the context of heart failure patient readmission prediction.

• Chapter 3 describes the dataset and presents the research methodology, highlighting the steps taken for model development and machine learning algorithms used in the experimental design.

• Chapter 4 concludes in-depth discussion into the results generated from the research methodology

• Chapter 5 provides the conclusions along with some recommendations for future work.

The following two appendices are included in the dissertation:

• Appendix A: Supplementary appendix

• Appendix B: R source code

11

CHAPTER 2

LITERATURE REVIEW

In this chapter we will first provide a review of the literature which is preformed to explore in more details the prediction models for 30-days readmission of heart failure patients, and an analysis of the most significant readmission predictive model studies are considered. Then, we present the most popular and powerful machine learning techniques used to model and predict heart failure re-hospitalization.

2.1 Heart Failure Hospitalization

The prevalence of heart failure is expected to increase from 6.5 million Americans to more than 8 million by 2030. The increase is a result of many factors including a growing elderly population, an increase the prevalence of risk factors like hypertension, improved survival after myocardial infarction, and improved survival with heart failure. With a growing number of heart failure patients, 121.5 million American adults had some form of cardiovascular disease between 2013 and 2016. Which leads to negative outcomes, the most costly is the high rate of HF hospitalizations. Figure.2.1 provides a representation of heart failure hospitalization in the US. In fact heart failure is one of the leading cause of primary diagnoses with an estimated 1 million hospitalization annually (Yancy et al., 2013).

A study using nationally representative data in the United States, found that the overall CHF hospitalization rate did not change significantly for one decade period (Hall, 2012). 12

Hospital admissions or readmissions are a significant burden on our healthcare system. It is crucial to assess and monitor readmission rates locally and nationally and to determine patients populations at highest risk(Albert et al., 2015).

Figure2.1 Heart Disease Hospitalization Rates (Benjamin et al., 2017)

The readmissions have cost the American public more than 15$ billion per year; around 20% of Medicare beneficiaries are readmitted within 30 days of discharge (Bradley et al., 2013).

Although, the risk of readmission drops over time, patients with index HF hospitalization 13 have significantly elevated risk of readmission at least one year. Approximately 25% of patients admitted due to HF are readmitted within 30 days, and 34% are readmitted within 90 days of discharge (Albert et al., 2015). Readmission is a key measure for the quality of patient care in U.S. hospitals. National initiatives such as the Centers for Medicare &

Medicaid Services Hospital Readmissions Reduction Program (HRRP) and the Partnership for Patients (PfP) are focused on decreasing preventable readmissions. The program applies to discharges from October 1, 2012 forward and includes readmission for any reason within

30 days of discharge for three groups of patients; HF, pneumonia and myocardial infarction.

Therefore, 30 days readmission has emerged as a benchmark for reimbursement and an indicator of hospital quality(Vaduganathan M, Bonow RO, & Gheorghiade M, 2013).

A variety of reasons can lead to readmissions, early discharge of patients, improper discharge planning, and poor care transition. A few studies noted that ineffective communication and medication delay or discrepancies are some reasons for 30 days readmission (Stevens, 2015).

Other studies have shown that targeted interventions and strategies that provide increased support at discharge, improved communication, and early and close outpatient follow-up are associated with lower readmission risk (Ziaeian & Fonarow, 2016).On the basis of this perspective, predicting the risk of readmission within a given time frame (such as 30 days) will lead to developing an important intervention to prevent readmissions (Annema, Luttik,

& Jaarsma, 2009). Therefore, reducing the readmission rate and the cost of these readmissions, which will improve the quality of care across institutions.

14

2.2 Risk Prediction Models of Readmission for Heart Failure

As mentioned in the previous section, readmission not only degrades the quality of health care but also increases medical expenses. Accordingly, it is essential to identify and predict the causes of readmission in order to prevent it. Predictive models are advanced mathematical techniques that can be used to identify patients based on their health status more, predictive models are one solution to allow hospital to plan interventions, improve health care and manage costs. They are also absolutely useful for understanding risks attributable to the measured factors for critical outcomes (Califf & Pencina, 2013).

Readmission prediction models are not new; there are a lot of studies in the literature addressing this problem. Yet, there is much room for improving the usability of risk prediction tools and performance (Rahimi et al., 2014). There are several systematic review studies combining the literature on prediction models for the estimation of readmission risk and that attempt to analyze model for risk prediction for readmission for heart failure. In

2008, Ross et al, conducted a systematic review of statistical models to predict a heart failure patients’ risk of readmission from 1950 to 2007, found that substantial contradictions in patient characteristics that were predictive (Ross JS, Mulvey GK, Stauffer B, & et al, 2008).

Another systematic review in 2011 by Kansagara focused on risk prediction models for hospital readmission after generic and HF index admission for hospital comparison and clinical purposes. Twenty-six models were reviewed, most models relied on retrospective administrative data, and few relied on real time administrative data. The authors conclude readmission risk prediction is a complex problem and that most models had poor predictive ability with c-statistics ranging from 0.55 to 0.65 for models using administrative data. 15

In 2012 Betihavas et al, updated the review of Ross and colleagues and identified only one additional model (Amarasingham et al., 2010), which is predicting death or readmission within 30 days. This review highlighted the need for additional models to extend the scope to include non-clinical factors like social and socioeconomic status as predictors for hospitalization (Betihavas et al., 2012).

In 2013, Hertsh et al, updated the review of Kansagara with a newer timeline from 2011 to

2013 and listed both general and heart failure risk model, focusing on the patient, health system and environment and did not report any new model. The review conclude that conceptualization HF readmission as a sociobiological process than a discreet physiological occurrence will help to better predict, characterize and ultimately mitigate risk (Hersh,

Masoudi, & Allen, 2013).

In 2016 O’Connor et al, conducted a review of the literature to find the effects of patients level factors or readmission within a maximum timeframe of 60 days for heart failure patients, which is discussed more in the section below of risk factors. (O’Connor et al.,

2016).

The latest review in 2018 by Mahajan et al, stated that the large volumes of diverse electronic data and statistical methods have improved the predictive power of the models over the past two decades, from 25 multivariate predictive models overall predictive accuracy with

C-statistics ranged from 0.59 to 0.84, but more work still needed for calibration, external validation and deployment of such model in clinical use.

There are a several essentials aspects to a readmission prediction including outcome, the data that the model is using, the risk factors included and the development and performance of the model. The majority of studies related to HF prediction models outcome have been 16 readmission, mortality or both. In our study, we are focusing on readmission outcome, more specifically 30-days readmission after index hospitalization of heart failure. The following subsections are intended to review the risk factors and the development of the model.

Predicting of 30-day readmission of all-cause readmission for HF patients was addressed by many researchers. In the latest systematic review by Mahajan et al, in 2018 on predictive models for identifying the risk of readmission after index hospitalization for heart failure, there were 12 studies out of 25 that indicate all cause readmission as a single outcome and 30 days as time frame for standard readmission. All of them were built in the USA (Mahajan et al, 2018).

2.2.1 Risk Factors

Once the outcome of choice has been selected it will be predicted based on patient risk factor data. Most risk prediction models use a multivariable approach to determine important predictors to provide outcome probabilities for different combinations of predictors.

A review of the literature that is done by O’Conner et al in 2016, on the effects of patients level factors on readmission within a maximum timeframe of 60 days has divided the factors as individual, contextual and health behaviors factors (Betihavas et al., 2012). According to

(Betihavas et al., 2012), Contextual factors include demographic and social characteristics at the community level including health services; health policies, financing and organizational characteristics; along with need characteristics such as community health indicators of disease prevalence and mortality rates. Individual factors also include predisposing (e.g., gender, age, health beliefs), enabling (e.g., income level, means of transportation, regular source of care), and need characteristics (perceived and evaluated). There are three types of 17 health behaviors: personal health practices (e.g., diet, use of tobacco), medical care processes, and use of health services. The review concludes that studies varied in patients age on readmission, demographics are not consistent in their effects, predisposing factors other than demographics and enabling factors are infrequently reported and appear to be understudied, in addition to health behaviors and other need characteristics include delirium, functional limitations, and B-type natriuretic peptide (BNP). Also, the review noted that the only common predictors of readmission in the models identified were a history of diabetes mellitus and a history of prior hospitalization (Betihavas et al., 2012). Additionally, models of patient-level factors (comorbidities, demographic and clinical) are much better able to predict mortality than readmission risk. Other social, environmental, and medical factors

(access to care, social support, substance abuse and functional status) contribute to readmission risk in some models, but the utility of such factors has not been broadly studied

(Kansagara et al, 2011).

Other studies show that associated diagnosis, including atrial fibrillation, ischemic heart disease and hypertension, results to higher risk of HF readmission, whereas non-cardiac illness like including chronic kidney disease, diabetes mellitus, anemia and pulmonary disease raises the risk for both HF and no HF readmission(Albert et al., 2015) (Desai &

Stevenson, 2012).

Many researchers have addressed the need for adding psychosocial factors into the HF predictive modeling (Hersh et al., 2013). Where another study have concluded that socioeconomic, health status and psychosocial variables are not dominated factors in predicting risk of readmission for heart failure (Krumholz et al., 2016). Which is stress the need to be evaluated in another dataset for the ultimate conclusion. In a recent systematic 18 review by Mahajan et al, revealed that systolic blood pressure, urea nitrogen and hemoglobin were the significant factors in the clinical domain and prior heart failure status, discharge disposition and emergency department visit appear to be significant administrative risk factors (Mahajan et al, 2018). As such, patients discharged against medical advice were at clearly elevated risk of 30-day readmission (Betihavas et al., 2012).

It is getting clear from the review that patients’ predictors of predictive models for readmission for heart failure mostly fell into three domains clinical, administrative and psychosocial. Most of the studies in predicting all cause readmission for patient with heart failure in 30 days used administrative predictors; mainly consist of demographics, comorbidities and prior procedures, with some clinical predictors, and only few studies have used psychosocial predictors.

In the table below we listed a summary of all the risk factors included in the studies that indicate all cause readmission as a single outcome and 30 days as time frame for standard readmission

Table 2.1 Summary of Model Characteristics Predicting 30-Day All-Cause Readmission For Patients with Heart Failure

Domain Predictors Clinical Predictors Respiratory rate Heart rate ⩽ 80 (per 10 beats/min) Admission SBP Creatinine Blood urea nitrogen GFR Hemoglobin Sodium BNP/ NT-proBNP Troponin ACEI/ARB Number of Medications 19

Domain Predictors Administrative Predictors Age Sex Black race vs. White Prior Heart Failure Ischemic heart disease MI Valvular Heart Disease Peripheral Vascular Disease Arrhythmias Diabetes Mellitus COPD/Asthma Renal Disease Liver Disease Metastatic Cancer/ Acute Leukemia Cerebrovascular Number of Comorbidities Prior Cardiac Surgery Prior Admission Length of Stay Discharge to Skilled Nursing Facility Discharge Disposition ED Visits in Prior Number of Discharge

Domain Predictors Psychosocial Predictors Single Marital Status Use of Health System Pharmacy History of Missed Clinic Visits Drugs / Alcohol Depression or Anxiety or Major Psychiatric Disorder Disability

20

2.2.2 Model Development and Performance

The development of predictive model for risk prediction is extremely challenging. A variety of literature exists on statistical approach for assessing risk of readmission. Logistic regression and Cox regression (or proportional hazards regression) are the dominant modeling methods in the most recent studies. However, in recent years machine learning

(ML) are gaining importance over the classical techniques, machine learning algorithms such as random forests, neural networks and SVMs, have been increasingly used (from none to

38% of yearly publications in the last five years) (Artetx et al, 2018). In a latest systematic review to analyze models for risk prediction for heart failure patient there was 20 out of 25 studies follow methodologies based on statistical approaches, where logistic regression and

Cox proportional hazard models are the most extended techniques, only five studies used machine learning algorithm such as Decision Trees, Random Forests and Support Vector

Machine (S. M. Mahajan et al., 2018).

Machine learning and data mining have developed as an approach that comes with big capabilities to improve the prediction ability of the readmission risk prediction models.

Classification algorithms are one technique that is used widely in the field of predictive model. However, machine learning techniques are not restricted to building of the classifier, but they also involve a wider set of techniques such as feature selection, variable discretization and normalization, missing value imputation, and many others (Artetxe,

Beristain, et al., 2018).

Different approaches reported in different studies cannot be directly compared since each study has its own particular characteristics in population, definition of the problem, computational methods and evaluation metrics. One study which used approximately 205 21 variables, found that the use of ML algorithms did not improve prediction of 30-day heart failure readmission compared with more traditional prediction models (Frizzell et al., 2017).

While a recent systematic review stated that recent studies introducing machine learning techniques report promising results and anticipate advantages over classical methods

(Artetxe, Beristain, et al., 2018). However, the author has stated that further comparative studies are needed to assess the real impact of these techniques in the domain of readmission risk prediction. Prediction of readmission using machine learning was tackled by several studies; from the latest systematic review in re-hospitalization there were around 15 studies, which used ML between 2013 to 2018.

Zolfaghar et al.(2013) studied big data driven solution to predict risk of readmission for HF patients in 30-day, using five different scenario of the data size using Random Forest algorithm, the best scenario accuracy was 87.12%.

Vedomske et al.(2013) used Random Forest algorithm with administrative data (procedure data, diagnosis data, or both) of 6,904 visits to predict readmission for HF patients in 30-day.

The procedures data was applied twice with and without weighting, to help with the imbalanced issue. Both model with no-prior weighting preform the best. The model had a discriminated by splitting the data to training and testing set.

Meadem et al. (2013) studied the readmission prediction within 30-day all cause readmission, in patients with HF, with 8,600 patients and 49 attributes. The focus of the study was feature extraction of the data (attribute selection, missing value imputation, data balancing). 22

Comparing the performance of three algorithms LR, SVM and Naïve Bayes. The best accuracy was 64% using SVM.

Walsh et al. (2014) studied the prediction of 30-day hospital readmissions on 25,691 unique patients using Regularized regression (LASSO: least absolute shrinkage and selection operator) and SVM. The study focused on the effect of reasons for readmission, available data and data types, and cohort selection. They found that targeting reason for readmission impacted discriminatory performance, data source contributions varied by reason for readmission, but in general, laboratory data and visit history data contributed the most to prediction, Cohort selection had a big effect on model performance.

Shah et al. (2014) focused on the distinction of heart failure with preserved ejection fraction

(HFpEF) and its association with adverse outcome. They used 527 HFpEF patients that grouped into 3 distinct phenol-group in term of clinical characteristics, cardiac structure and function, hemodynamics and outcome. The results show that the created phenol-groups provided better discrimination compared to clinical parameters and and B-type Natriuretic

Peptide. The AUROC of SVM to predict combined outcome is 0.76, for cardiovascular hospitalization is 0.72, for HF hospitalization is 0.70.

Basu Roy et al. (2015) followed dynamic hierarchical classification (DHC) for patient’s risk of readmission prediction for HF patients. The prediction problem was divided into several layers. At each layer of dynamic hierarchical framework LR, RF, Adaboost, NB and SVM classifiers were tested, the best classifier at each stage was determined through 10-fold cross 23 validation procedure on training test. Random Forest achieved the highest AUC of all the three layers.

Zheng et al.(2015) studied several machine learning algorithms to predict 30-day readmission of patient with HF, on a dataset of 1641 patients, using NN, SVM with different kernels and

RF. The best accuracy was 0.78, Sensitivity of 0.97 using particle swarm optimization SVM.

Futoma et al. (2015) studied the prediction of 30-day hospital readmission using and comparing several predictive models; logistic regression, logistic regression with a multi-step variable selection, Penalized logistic regression, random forest, support vector machine; and

Deep Learning on five disease. They find that random forest, penalized logistic regression and Deep Learning have significantly better predictive performance than other method that have been previously applied to this problem, resulting on accuracy for HF cohort of 0.67 with NN and 0.65 on PLR. However, the study only reported AUC which is not an adequate performance metrics for the imbalanced data they had.

Kang et al. (2016) predicts the rehospitalization for 60 days home healthcare episode among a cohort of tele-homecare patients, using bivariate and analysis for selecting the variables and decision tree for prediction, getting and accuracy of 0.59.

Koulaouzidis et al. (2016) predicts heart failure readmissions in 8 days on daily physiological data based on home telemonitoring from a single center data as a part of HF care, using 24

Naïve Bayes classifier. The best predictive results were by combining weight and diastolic blood pressure with AUROC 0.82.

Turgeman et al. (2016) developed a hospital 30-day readmission predictive model of 20,321 patients, 4840 are patients with CHF, deployed with an ensemble model with boosted C5.0 and SVM resulting of accuracy of 0.84. However, their model had a low sensitivity of 25.8%.

Mortazavi et al. (2016) studied the prediction of 30 and 180- day, all cause and HF readmission using data from telemonitoring of 977 patients and 236 attributes, by comparing different techniques; Logistic Regression (LR), Poisson Regression (PR), Random Forest

(RF), boosting, and RF combined hierarchal with Support Vector Machine (SVM). The accuracy for 30-day was 0.54 using LR and 0.615 using boosting, and for 180 days was

0.669 with RF and 0.678 with boosting.

S. Mahajan et al. (2016) used clinical dataset of 1037 HF patients with 48 clinical predictors to build a prediction model to predict 30-day readmission using LR and RF. They got better predictive results with LR (C-statistics 0.65) when compared with RF (C-statistics: 0.61)

(Jamei et al, 2017) using data from more than 300,000 hospital stays and 1667 features , the study predict all-cause risk of 30-day hospital readmission using artificial neural network, and comparing it with LACE model the industry standard and other proposed models in the literature, they found that NN significantly have a better performance in predicting readmission with AUC 0.78 and recall of 0.60.

25

Artetxe et al. (2018) used a clinical dataset with 119 patients and 60 attributes to build a predictive model to predict unplanned readmission or death for HF patients. The focus of this study was feature extraction, they used RF and SVM, the best accuracy they get is with SVM

0.647.

Recent studies show that machine learning (ML) methods outperform classical methods.

However, further comparative studies between ML and statistical techniques are needed to assess the real impact of these techniques in the domain of readmission risk prediction

(Artetxe et al, 2018). Moreover, medical data often are limited by class imbalance. It is the problem in machine learning where the majority of the data belong to the negative outcome event causing a poor performance of prediction models. Lastly, most ML studies have used area under the receiver operating characteristic curve (AUC) as the only performance metric to assess the model performance.

26

2.3 Summary and Research Gap

While Heart failure is a chronic disease that affects millions of people in the United States, it can be managed, early detection of HF, assessment of the risk factors and early prediction of adverse events will improve the quality of life of patients and will reduce the associated medical costs. Large volumes of studies have addressed the readmission risk prediction, statistical modeling techniques like Logistic regression and Cox regression (or proportional hazards regression) are the dominant modeling methods. However, in recent years, machine learning techniques have been increasingly used. Overall, ML models outperform traditional models such as LR. However, additional comparative studies are needed to assess the real impact of this techniques in the domain of readmission risk prediction (Artetxe et al., 2018).

Moreover, There is still need to combine many different types of predictors and corresponding statistical method that could handle large and complex data from a variety of data source(Walsh et al, 2014).

In this study, we aim to develop models using ML approach to predict all-cause readmissions

30-day after discharge from HF hospitalization and compare it with traditional prediction model, logistic regression. We focus on data preprocessing step to improve prediction outcomes, including feature selection and data balancing using real data set provided by

Partners. Chapter three presents data summary and processing.

27

CHAPTER 3

METHODOLOGY

In this chapter the approach used for model development will be illustrated, the steps we followed in the creation of the prediction model for readmission 30-days. We begin by an overview of the objectives and software we used to create the model, then a detail description about the data. With this defined, we then describe the data preprocessing which was used to formulate the final list of possible variables available to use in the prediction of 30-days readmission. Finally, we give short descriptions of the classification algorithms that we have used in prediction.

3.1 Aims and objectives

This research implements machine learning techniques to predict the risk of all cause 30-days readmission for heart failure patients. The main aim of the research is to find the best machine learning techniques by performance and accuracy and compare it with regression method that are typically applied in healthcare literature. Accordingly, our methodology can be described as follow:

1. Obtain and prepare the data we want our model to work with

This phase involves exploring the raw data. The dataset for the purpose research of implementation the 30-days personalized prediction model for readmissions, it’s gathered from both Enterprise Data Warehouse (EDW) and the Research Patient Data Registry 28

(RPDR) system of Partners HealthCare, a clinical data registry that gathers medical records from various hospitals affiliated with Partners Healthcare and stores them in a central location. The access to patients’ data associated with Massachusetts General Hospital (MGH) was authorized by Partners HealthCare Institutional Review Board (IRB) (no. 2016-

P001258). Partners HealthCare is a non-profit health care system that is committed to patient care, research, teaching, and service to the community locally and globally.

2. Perform data preprocessing

Data preprocessing is an important step that helps transform the raw data to an understandable format, especially medical data that contains high-dimensionality, irregularity, missing data, noise and bias. To achieve better result from machine learning model the data should be in a specified format, because some of the modeling techniques are sensitive to predictors. Data preprocessing includes data cleaning, normalization, transformation, feature selection and balancing, etc. Later in this chapter we will discuss in detail the steps we used for preprocessing.

Data Data wrangling Data cleaning Variable selection transformation

Figure 3.1 Data preprocessing tasks 29

3. Modeling

This phase involves selecting and applying various machine learning prediction techniques.

We will use various classification techniques to build a model that can predict the

readmission of HF patients. A comprehensive set of experiment will be performed with

several classification models to determine the efficiency of these models for the given

dataset.

Iterate until Iterate to find data is ready the best model

Apply Deploy Raw Preprocessing Candidate Chosen Prepared Learning Chosen Data Data Model Model Data Algorithm Model to Data

ML New Algorithms Data

Figure 3.2 Machine learning process

30

4. Validate the Model

In this phase we determine the effectiveness of our models with the help of various evaluation metrics to compare the performance with the different models. The models developed are evaluated in order to assess the quality and accuracy of prediction of various classification models. Once the models are evaluated and show an accurate result, it can be used to predict the risk of hospitalization for new patient.

Features without Labels Test Set

Feature Extraction Feature and Labels

Model Training Test Model Data Training/ Set Prediction Building

Train/Test loop: Test accuracy of Prediction matching test labels

Figure 3.3 Machine learning validation process

31

3.2 Data Mining Software Selection

There is various software for data mining and predictive modeling; for the purpose of this research we selected R-studio and R for data cleaning, analysis, and for building and training the predictive model. R is a free and open source and it’s one of the most popular programming languages used by data analysts and data scientists.

3.3 Data Description

The original de-identified data file for this research was a csv format. The dataset provides information of 20629 patients with 75053 records of patients who had admitted with a diagnosis of CHF for at least once between 2011 and 2015. The dataset contain 22 tables: five diagnosis table for each year, five labs table for each year, five outpatients table for each year, two medication table (EDW and RPDR), two vitals table (EDW and RPDR), one inpatients table, one demographics table, one procedure table, one diagnosis description table and one procedures description table.

Demographics

Inpatients admissions Procedures (2011-2015)

2 Medications 30 days – 5 Labs (EDW-RPDR) Readmission (2011-2015)

5 Outpatients 5 Diagnosis (2011-2015) (2011-2015) 2 Vitals (EDW-RPDR)

Figure 3.4 Dataset overview

32

Table 3.1 Tables description

Table Category Category Description Examples

Inpatients Admission Traditional admission factors Readmission flag, length of stay, condition flag,…

Demographics Traditional demographic Age, gender, language, factors race…

Diagnosis The presence of a primary Number of diagnosis, ICD9 diagnosis codes

Procedures The occurrence and numbers Inpatient flag, number of of procedures during the procedures , ICD9 codes index admission Vitals Vital sign measurements Heart rate, weight, height, collected during the index BMI, admission Medications Presence if specific Medication name, code type, medications, type, dosage … strength, ….

Labs Lab test and results Test date, test name, test result, …..

33

3.4 Inclusion / Exclusion Criteria

• The outpatient’s tables were excluded from the dataset that we will work on, because our focus in the research is on inpatient setting, and we count a readmission as a following inpatient admission to any acute care facility which occurs within 30 days of the discharge date of an eligible index admission.

• From the admission set, cases whose death flag (deathFLG) is 1, which states that the patient has died, since they can’t have any readmission, so they were excluded.

• Only patients with index admission for CHF are included, because the goal of the research is to predict heart failure patients being readmitted to the hospital within 3o days after discharge.

•Total inpatients dataset (2011 -2015) 75053

• Patients who died were excluded 41806

•Only HF index hospitalization were included 8686

Figure 3.5 Exclusion/ Inclusion criteria

34

3.5 Data Preprocessing

Data preprocessing is a vital step in data mining process. We clean the data to make it sufficiently accurate and well-structured to support the analysis we want to perform. Clean data set will lead to an accurate prediction. The following subsections will explain the main preprocessing steps that has been taken to make the dataset ready.

3.5.1 Data Wrangling

By Data wrangling we mean the process of combining, merging and reshaping and join all the data in one single table, in addition to that creating new attributes from the given data set.

Combining, merging and reshaping

We started by merging the inpatients table with demographics by patient key, then with the procedures table by the patient key and the patient account key. Medication, vitals and labs table are a time series, longitudinal data so we had to reshape the data from the wide form to long form in order to be able to merge it with the other table. Before the reshaping medication table, each drug was classified to its group, by the chemical type of the active ingredient or by the way it is used to treat a particular condition. We did this step manually using Drugs.com. (See appendix for a table showing the medication and its group).

After reshaping, is merging and because those tables had only a patient key, and no patient account key that show which record is linked to the patient visit, so for tables of vitals and medication, we merge them depend on serviceshift date if it’s between the admission date and discharge date, because vitals usually preformed on admission to a hospital, and for medication usually taken medication history on admission and prescribed at discharge. For 35 labs table we merged them if the serviceshift date is between the admission date and discharge date or 7 days before admission or 7 days after admission, because some lab test preformed prior to admission especially for an inpatient surgery.

Admission Date

7 Days 7 Days Length of Stay

Discharge Date

Accepted Data

Figure 3.6 Accepted data from lab table

Creating new attributes

• Based on the literature, we create attributes for the most common medical history of comorbidities among patients with heart failure which are: hypertension, diabetes, hyperlipidemia, ischemic cardiomyopathy, atrial fibrillation, COPD, chronic bronchitis, asthma and depression.

• Based on the two attributes (BloodPressureSystolicNBR) and

(BloodPressureDiastolicNBR) we create (Blood pressure condition) with the values (Normal,

Elevated, High Stage 1, High stage 2, Crisis) as presented in the chart below from American heart association. 36

Table 3.2 Pressure stages

Blood pressure Systolic Diastolic

category

Normal Less than 120 and Less than 180

Elevated 120 -129 and Less than 80

High blood pressure 130 - 139 or 80 - 89

(Hypertension) Stage 1

High blood pressure 140 or higher or 90 or higher

(Hypertension) Stage 2

High blood crisis Higher than And / Higher than 120

Seek emergency care 180 or

3.5.3 Data Cleaning

After joining all the data in one single table, we started the second phase of preprocessing.

This phase is data cleaning, which is preparing the data for training. By data cleaning we mean dealing with missing data and with detecting and removing errors and inconsistencies from data in order to improve the quality of data.

Remove Zero- and Near Zero-Variance Predictors

After combing all the tables in one table, the data set contains quite a few attributes that only contain null value or unique value, these attributes have no impact therefore they are considered as irrelevant attributes which should be removed. (See appendix for a table showing the NZV test result).

37

Missing values

The dataset had a lot of missing data (See appendix for a table showing number of missing data in each columns or attribute), the way we dealt with missing value as follow

Deletion

Handling missing data Make NA as a Categorical level

Imputation

Mean, Median, Continuous Mode

Figure 3.7 Handling missing data

• For medications and labs table, the missing values were converted as zero.

• For vitals, the missing values were imputed by the mean

• For days attributes NA were converted to zero.

• For all the text attributes we create a new level called “not applicable”

• There were two records have been deleted because the gender was missing

38

3.5.3 Data Transforming

Machine learning models are only as good as the data that is used to train them. A key characteristic of good training data is that it is provided in a way that is optimized for learning and generalization. The process of putting together the data in this optimal format is known in as feature transformation. Our dataset are one of three types: Binary, continuous and categorical.

• Binary, we kept the binary data as it is, for example: readmissionflag (0,1)

• Many machine learning algorithms expect numerical input data; therefore, the categorical data must be numerical. We used dummy variables method, which is a representation method that takes each category value and turns it into a binary vector. (see appendix for dummy data).

• For the reason of simplicity, in medication we changed the quantity of the medication to zero or one. One if the patient has the medication and zero if not.

• Continuous data were kept as it is.

39

3.6 Dataset

The complete clean dataset contains of information about 5894 patients with CHF disease, in

8684 records and 61 variables. Out of those patients 1428 admitted within 30 days in and the rest 3564, 4916 not readmitted and 2340 readmitted in more than 30 days. Table 3.3 shows the distribution of readmission.

Table 3.3 Distribution of readmission

Variables Values Frequency Percent No 4916 56.6 ReadmissionFLG Yes 3768 43.4 Total 8684 100.0 No 7256 83.6 PHSReadmission30DayFLG Yes 1428 16.4 Total 8684 100.0

40

Table 3.4 Frequency distribution of variables by 30-days readmission

Variables Values Frequency Percent Gender 0 4072 46.89 1 4612 53.11 Language English 7631 94.68 Spanish 429 5.32 Race More than 1 race 161 1.90 Asian 195 2.30 Black or African American 999 11.78 Hispanic or Latino 318 3.75 Hispanic or Latino 72 .85 Hispanic or Latino - black 9 .11 Hispanic or Latino - white 32 .38 White 6698 78.95 Marital Divorced / separated 874 10.56 Married / partnered 3715 44.88 Single 2052 24.79 Widow 1637 19.78 Education Graduate of college or postgraduate 1986 33.04 school High school graduate/GED 2632 43.79 Some college/vocational/technical 378 6.29 program Some high school or less 1014 16.87 Employment Disability 637 11.32 Employed 1500 26.66 Home environment 29 .52 Retired 2771 49.24 Self employed 186 3.31 Student 1 .02 Unemployed 503 8.94 Age1 <65 3194 36.78 >=65 5490 63.22 Age2 0-18 1 .01 19-44 544 6.26 45-64 2445 28.16 65-84 4082 47.01 85+ 1612 18.56

41

Variables Values Frequency Percent EmergencySeverIndexNBR 0 5675 65.35 1 28 .32 2 701 8.07 3 851 9.80 4 10 .12 9 1419 16.34 EmergencyChargeFLG 0 2078 23.93 1 6606 76.07 ReadmissionFLG 0 4917 56.62 1 3767 43.38 PHSReadmission30DayFLG 0 7256 83.56 1 1428 16.44 PHSPayerCategoryDSC Commercial 1690 19.84 Government 6829 80.16 ClinicalObservationDaysNBR 0 8329 95.91 1 302 3.48 2 36 .41 3 13 .15 4 1 .01 5 2 .02 6 1 .01 ClinicalOperativeDaysNBR 0 8375 96.44 1 229 2.64 2 42 .48 3 19 .22 4 6 .07 5 5 .06 6 2 .02 8 3 .03 10 3 .03 AdmissionSourceCommonDSC ADMIT FROM OBSERVATION 364 4.20 EMERGENCY ROOM 2896 33.43 OP DEPT/CLINIC/PHYSICIAN 2715 31.34 REFERRAL Outside Health Care Facility 19 .22 Outside Hospital 208 2.40 Physician or Clinic Referral 401 4.63 Self-Referral 1164 13.44 Skilled Nursing Facility 44 .51 TRANSFER FROM ACUTE 734 8.47 HOSPITAL TRANSFER FROM NON ACUTE 118 1.36 FACILITY AdmissionServiceCommonDSC Not Specified 154 1.78 CARDIAC SURGERY 78 .90 CARDIOLOGY 1914 22.10 Emergency Medicine 1338 15.45 MEDICINE 5017 57.94 ONCOLOGY 34 .39 Pulmonology 28 .32 RENAL MEDICINE 16 .18 SURGERY 80 .92

42

Variables Values Frequency Percent ServiceLineCommonDSC Surgery 46 .53 Cardiac 8490 97.86 Onco 53 .61 Vasclr 87 1.00 ServiceLineSubServiceDSC Surgery 46 .53 Clinical 6597 76.17 EP & Arrhythmias 407 4.70 Invasive 1222 14.11 Medical 43 .50 Surgery 259 2.99 Vascular 87 1.00 DischargeDispositionCommonDSC Acute Hospital 142 1.64 Discharge to Institution/shelter/care 15 .17 Home 3448 39.82 Home Care 3596 41.53 Hospice 45 .52 Left Against Medical Advice 76 .88 Long Term Care 175 2.02 Rehab Facility 177 2.04 Skilled Nursing Facility 985 11.38 DischargeServiceCommonDSC Cardiac surgery 209 2.42 Cardiology 1975 22.84 Medicine 6280 72.63 Oncology 44 .51 Renal medicine 20 .23 Surgery 119 1.38 ICD9DiagnosisDSC AC DIASTOLIC HRT FAILURE 510 5.90 AC ON CHR DIAST HRT FAIL 2185 25.26 AC ON CHR SYST HRT FAIL 1726 19.95 AC SYST/DIASTOL HRT FAIL 61 .71 AC SYSTOCIL HRT FAILURE 482 5.57 AC/CHR SYST/DIA HRT FAIL 394 4.55 CHR DIASTOLIC HRT FAIL 71 .82 CHR SYST/DIASTL HRT FAIL 14 .16 CHR SYSTOLIC HRT FAILURE 73 .84 CONGESTIVE HEART 2175 25.14 FAIL,UNSPECIF DIASTOLC HRT FAILURE NOW 182 2.10 HYP HRT/REN NOS W/HRT 404 4.67 FAILURE HYP HRT/REN NOS W/HRT 81 .94 FLR&KIDN HYPERTEN HEART DIS W CHF 157 1.81 MAL HYP HTR/REN W/ HRT 16 .18 FLR&W/ UNSPEC SYST & DIAST HEART 17 .20 FAIL UNSPEC SYSTOLIC HEART 103 1.19 FAILURE ICD9DiagnosisCategoryDSC Diseases of the circulatory system 8684 100.00

43

Table 3.4 (Continued) Frequency distribution of variables by 30-days readmission

Variables Values Frequency Percent PHSDRGDSC No answer 9 .10 Circulatory System Procedures 14 .16 Vascular Procedures 14 .16 VASCULAR PROCEDURES W CC 13 .15 VASCULAR PROCEDURES W MAJOR CC 95 1.11 Cardiac Catheterization w/ Circ Disord Exc Ischemic Heart Disease 313 3.65 CARDIAC DEFIB IMPLANT W CARDIAC CATH W 32 .37 AMI/HF/SHOCK Cardiac Defibrillator & Heart Assist Anomaly 91 1.06 CARDIAC DEFIBRILLATOR W/O CARDIAC CATHETER 124 1.45 CARDIAC VALVE OR CARDIAC DEFIB IMPLANT 105 1.22 PROCEDURE W MAJOR CC CHF & CARDIAC ARRHYTHMIA W MAJOR CC 2005 23.39 CIRC DISORDERS EXCEPT AMI, W CARD CATH & 479 5.59 COMPLEX DIAG CIRCULATORY DISORDERS W AMI & MAJOR COMP, 57 .66 DISCHARGED ALIVE ECMO OR TRACH W MV 96+ HR OR TRACH W PDX EXC 14 .16 FACE/MTH/NCK DX EXTEN O.R. PROCEDURE UNRELATED TO PRINCIPAL 14 .16 DIAGNOSIS Heart &/or Lung Transplant 16 .19 Heart Failure 2167 25.28 HEART FAILURE & SHOCK 2729 31.83 HEART TRANSPLANT 50 .58 MAJOR CARDIOVASCULAR PROCEDURES W MAJOR CC 95 1.11 NON-EXTENSIVE O.R. PROC UNRELATED TO PRINCIPAL 11 .13 DIAGNOSIS PERCUTANEOUS CARDIOVAS PROC W DRUG ELUTING 13 .15 STENT W/O AMI PERCUTANEOUS CARDIOVASC PROC W AMI, HF OR 22 .26 SHOCK Percutaneous Cardiovascular Procedures w/o AMI 41 .48 Permanent Cardiac Pacemaker Implant w/ AMI, Heart Failure or 18 .21 Shock PRM CARD PACEM IMPL W AMI, HRT FAIL OR SHK, OR 32 .37 AICD LEAD OR GN

44

Table 3.4 (Continued) Frequency distribution of variables by 30-days readmission

Variables Values Frequency Percent NumberOfProcedures 0 4385 50.50 1 1807 20.81 2 874 10.06 3 543 6.25 4 323 3.72 5 191 2.20 6 126 1.45 7 95 1.09 8 68 .78 9 40 .46 10 34 .39 11 37 .43 12 16 .18 13 26 .30 14 14 .16 15 18 .21 16 14 .16 17 5 .06 18 10 .12 19 16 .18 20 42 .48 PrincipalICD9ProcedureCD Miscellaneous diagnostic and therapeutic 1668 19.21 procedures None 4385 50.50 Operations on the digestive system 242 2.79 Operations on the endocrine system 73 .84 Operations on the integumentary system 48 .55 Operations on the musculoskeletal system 46 .53 Operations on the nervous system 2200 25.33 Operations on the urinary system 22 .25 NumberOfDiagnoses 1 1 .01 2 3 .03 3 7 .08 4 23 .26 5 64 .74 6 107 1.23 7 145 1.67 8 234 2.69 9 335 3.86 10 387 4.46 11 1944 22.39 12 439 5.06 13 495 5.70 14 497 5.72 15 477 5.49 16 468 5.39 17 422 4.86 18 426 4.91 19 425 4.89 20 1785 20.56

45

Variables Values Frequency Percent Hypertension 0 3240 37.31 1 5444 62.69 diabetes 0 8509 97.98 1 175 2.02 X.depression 0 6543 75.35 1 2141 24.65 Hyperlipidemia 0 2661 30.64 1 6023 69.36 Ischemic.cardiomyopathy 0 5567 64.11 1 3117 35.89 Atrial.fibrillation 0 3905 44.97 1 4779 55.03 COPD.chronic.bronchitis.and.asthma 0 5350 61.61 1 3334 38.39 OutOfRangeCD.GFR (estimated) 0 1700 19.58 Low 1269 14.61 Normal 5514 63.50 UknownAbnormal 201 2.31 OutOfRangeCD.Potassium 0 1730 19.92 High 261 3.01 Low 383 4.41 Normal 6309 72.65 UknownAbnormal 1 .01 OutOfRangeCD.Creatinine 0 1756 20.22 High 2875 33.11 Low 63 .73 Normal 3989 45.94 UknownAbnormal 1 .01 OutOfRangeCD.Sodium 0 1778 20.47 High 72 .83 Low 1436 16.54 Normal 5397 62.15 UknownAbnormal 1 .01 OutOfRangeCD.BUN 0 1766 20.34 High 4084 47.03 Low 18 .21 Normal 2816 32.43 OutOfRangeCD.Chloride 0 1789 20.60 High 62 .71 Low 3140 36.16 Normal 3692 42.51 UknownAbnormal 1 .01 OutOfRangeCD.Carbon Dioxide 0 1794 20.66 High 1111 12.79 Low 499 5.75 Normal 5279 60.79 UknownAbnormal 1 .01 OutOfRangeCD.Anion Gap 0 1837 21.15 High 477 5.49 Low 156 1.80 Normal 6213 71.55 UknownAbnormal 1 .01 46

Table 3.4 (Continued) Frequency distribution of variables by 30-days readmission

Variables Values Frequency Percent OutOfRangeCD.Glucose 0 2465 28.39 High 2831 32.60 Low 99 1.14 Normal 3289 37.87 OutOfRangeCD.PLT 0 3317 38.20 High 222 2.56 Low 738 8.50 Normal 4406 50.74 UknownAbnormal 1 .01 OutOfRangeCD.WBC 0 3317 38.20 High 736 8.48 Low 273 3.14 Normal 4358 50.18 OutOfRangeCD.MCHC 0 3315 38.17 High 7 .08 Low 1502 17.30 Normal 3860 44.45 OutOfRangeCD.MCH 0 3318 38.21 High 576 6.63 Low 1022 11.77 Normal 3768 43.39 OutOfRangeCD.MCV 0 3318 38.21 High 667 7.68 Low 465 5.35 Normal 4234 48.76 OutOfRangeCD.RBC 0 3318 38.21 High 58 .67 Low 3433 39.53 Normal 1875 21.59 OutOfRangeCD.RDW 0 3341 38.47 High 3306 38.07 Low 3 .03 Normal 2034 23.42 OutOfRangeCD.Hgb 0 3547 40.85 High 34 .39 Low 3583 41.26 Normal 1520 17.50 OutOfRangeCD.Calcium 0 2803 32.28 High 61 .70 Low 1122 12.92 Normal 4698 54.10 OutOfRangeCD.Magnesium 0 4334 49.91 High 322 3.71 Low 157 1.81 Normal 3870 44.56 UknownAbnormal 1 .01 OutOfRangeCD.PT 0 5145 59.25 High 2948 33.95 Low 3 .03 Normal 588 6.77 47

Table 3.4 (Continued) Frequency distribution of variables by 30-days readmission

Variables Values Frequency Percent OutOfRangeCD.HCT 0 3288 37.86 High 67 .77 Low 3627 41.77 Normal 1702 19.60 StrengthAMT.Loop diuretics 0 7550 86.94 1 1134 13.06 StrengthAMT.Cardioselective beta blockers 0 7550 86.94 1 1134 13.06 StrengthAMT.Statins 0 7550 86.94 1 1134 13.06 StrengthAMT.Salicylates 0 7550 86.94 1 1134 13.06 StrengthAMT.Minerals and electrolytes 0 7550 86.94 1 1134 13.06 BloodPressure Elevated 4 .05 High Stage 1 25 .29 High stage 2 8613 99.18 Normal 42 .48

48

Table 3.5 Descriptive measures for the continues variables

Variables CV Mean SD Variance Median Minimum Maximum Range (%) EmergencySeverIndexNBR 1.93 3.27 10.72 169.43 0 0 9 9 AgeYrDeident 69.64 14.97 224.13 21.5 72.00 18 90 72 LengthOfStayNBR 7.21 11.30 127.65 156.73 5.00 1 372 371 ClinicalRoutineDaysNBR 3.24 4.93 24.33 152.16 1.00 0 59 59 LengthOfStayNBR 7.21 11.30 127.65 156.73 5.00 1 372 371 ClinicalRoutineDaysNBR 3.24 4.93 24.33 152.16 1.00 0 59 59 ClinicalICUDaysNBR .63 4.53 20.51 719.05 .00 0 151 151 ClinicalObservationDayNBR .05 .27 .07 540 .00 0 6 6 ClinicalOperativeDaysNBR .06 .39 .15 650 .00 0 10 10 NumberOfProcedures 1.51 2.82 7.98 186.75 .00 0 20 20 NumberOfDiagnoses 14.30 4.19 17.59 29.3 14.00 1 20 19

49

Age Gender Readmission 30 Days Readmission

5000 1400

1200 4000 4000 6000

1000 3000 3000 800 4000

Frequency Frequency 2000 Frequency Frequency 600 2000

400 2000 1000 1000 200

0 0 0 0

20 40 60 80 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

NumberData$AgeYrDeident of Diagnosis NumberData$GenderCD of Procedures RR$ReadmissionFLGEmergency Flg Data$PHSReadmission30DayFLGLOS 2000 6000 8000 6000

5000 1500 5000 6000

4000 4000

1000 3000 4000 3000 Frequency Frequency Frequency Frequency

2000 2000 500 2000

1000 1000

0 0 0 0

5 10 15 20 0 5 10 15 20 0.0 0.4 0.8 0 100 200 300

Data$NumberOfDiagnoses Data$NumberOfProcedures Data$EmergencyChargeFLG Data$LengthOfStayNBR

Figure 3.8 Histogram of different attributes

50

3.7 Label Definition / Outcome Definition

Readmission Rate: A hospitalization that occurs within 30 days after discharge

Measure Title: Hospital 30-day, all cause readmission rate following heart failure hospitalization

Target Population: patients aged 18 years and older

Definition of Index Admission

Ø Qualifying Event: Discharged alive.

Ø Clinical Scope: Indexed admissions are identified by CHF.

Ø Patients may have multiple records due to recurrent labs and medications

Definition of Readmission

Ø Qualifying Event: Admission that occurs within 30 days of an index admission for all cause

Ø Clinical Scope: Readmission for all causes

Ø Readmission in all facilities are considered

51

3.8 Modeling

In this section we will present the computational methods used in the experiments. Then, we will provide a brief description of each of the method utilized to perform the classification task.

3.8.1 Feature Selection

Feature selection is the process of obtaining a subset of the original variable set containing the relevant features by discarding redundant or irrelevant variables. Feature selection is an important step in model building because it reduces the complexity of the model. In this phase we aim to figure out which subset of factors that has a significant impact on readmission from the dataset we have. The dataset contains 61 attributes.There are three general classes of feature selection algorithms: filter methods, wrapper methods and embedded methods.

• Filter Method

As said by (Kohavi & John, 1997) filter method selects features using a preprocessing step and attempt to assess the predictive value of features from the data, without recourse to the classifier learning algorithm . The main disadvantage of the filter approach is that it totally ignores the effects of the selected feature subset on the performance of the induction algorithm. Some examples of some filter methods include the Chi squared test, information gain and correlation coefficient scores. 52

Input Feature Induction Features Subset Algorithm Selection

Figure 3.9 The Feature Filter Approach

• Wrapper Method

The wrapper methodology consists in using the prediction performance of given machine learning to assess the relative usefulness of subset variables (Guyon & Elisseeff, 2003). We use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from the subset. Some examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc.

Select the best subset

Set of all Generate a Learning features subset Algorithm Performance

Figure 3.10 The Wrapper Approach

53

• Embedded Method

Embedded methods preform variable selection in the process of training and usually specific to a given machine learning (Guyon & Elisseeff, 2003), It’s implemented by algorithms that have their own built-in feature selection methods. It combines the advantages of filter and wrapper methods.

Select the best subset

Set of all Generate a Learning

features subset Algorithm +Performance

Figure 3. 11 The Embedded Approach

54

We generated three different sets of features and conducted our experiments using all the feature sets. The first set of features which we call All consists of all the features we had in the original dataset after cleaning the variables are shown in Table 3.4. The second sets of features are Info Gain which is filter feature selection method called information gain.

Information Gain (IG) is an entropy-based feature. This method chooses the feature based on information gain with the frequency of items, this set is found in Table 3.6. The third sets of features are conducted using wrapper methods which is Backward Selection, it’s shown in

Table 3.6.

Info Gain Features Backward Selection Features

ICD9DiagnosisDSC StrengthAMT.Loop.diuretics

AgeYrDeident StrengthAMT.Cardioselective.beta.blockers

LengthOfStayNBR StrengthAMT.Statins

NumberOfDiagnoses StrengthAMT.Salicylates

AdmissionSourceCommonDSC StrengthAMT.Minerals.and.electrolytes

EmploymentGRP RaceGRP

EducationGRP DischargeDispositionCommonDSC

MaritalGRP ICD9DiagnosisDSC

DischargeDispositionCommonDSC AdmissionServiceCommonDSC

ClinicalRoutineDaysNBR EmploymentGRP

RaceGRP AdmissionSourceCommonDSC

NumberOfProcedures DischargeServiceCommonDSC

PrincipalICD9ProcedureCD Ischemic.cardiomyopathy 55

AdmissionServiceCommonDSC OutOfRangeCD.RDW

OutOfRangeCD.Glucose, LanguageGRP

OutOfRangeCD.PT OutOfRangeCD.HCT

OutOfRangeCD.Magnesium OutOfRangeCD.RBC

OutOfRangeCD.Carbon.Dioxide NumberOfDiagnoses

Ischemic.cardiomyopathy ServiceLineSubServiceDSC

Table 3.6 The features in dataset Info Gain and Backward Selection

56

3.8.2 Imbalance Data/Class Imbalance

Data imbalance issue is one of the major problems in data mining and machine learning. We say that a dataset is imbalanced when majority classes dominate over minority classes, causing the machine learning classifiers to be more biased towards majority classes. This causes poor classification of minority classes. Classifiers may even predict all the test data as majority classes (Kotsiantis et al, 2005). To deal with this problem, several approaches have been developed that can be implemented during the preprocessing phase. One of the methods developed to deal with class imbalance is called resampling, which includes under sampling and over sampling techniques.

Under sampling techniques deleting instances of the majority class, oversampling techniques, which replicate or create new instances of the minority class (Kotsiantis et al.,

2005).

Figure 3. 12 Undersampling and oversampling techniques (Karagod, n.d.)

57

3.8.3 Experiments and selected algorithms

In Supervised Learning, algorithms learn from labeled data. After training the data, the algorithm determines which label should be given to new data based on pattern and associating the patterns to the unlabeled new data. Supervised learning techniques are divided to two categories; classification and regression. There are various classification techniques on solving the problem of readmission.

We are choosing five techniques to model and predict readmission; each algorithm will be trained and validated. Below is a list of the algorithms that we used in the experiment with a brief description

3.8.3.1 Logistic Regression (LR)

Logistic regression is one of the most popular techniques used for binary classification problem. This model estimates the probability of the target variable given some linear combination of the predictors by fitting a logit function, as

�����(��)=��( �� )=�0+�1��1+...+����� where pi is the probability of the outcome is true given some linear combination of the predictors,

� are the coefficient associated with each variable, �.. �� are the predictor variables and � is the unique subscript denoting each variable (Hosmer, Lemeshow, & Sturdivant, 2013). Once the coffecients are learnt by the model the probability of the output of the new data can be known by the following equation

1 ��= 1 + �−(�+�.��) 58

3.8.3.2 Decision Tree (DT) Decision trees are very popular and typically the most intuitive data mining approaches. They start my selecting the variable that provides the best separation of the target class, generating, nodes. This process is then repeated on each node until it is not needed (Tuffery, 2011). The core algorithm for building decision trees called ID3 by (Quinlan, 1986) which employs a top down search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree (Quinlan, 1986).

3.8.3.3 Naïve Bayes (NV)

This is a statistical classifier that can predict membership probabilities such as the probability that a given set belongs to a class. This type of classification is based on Bayes theorem; this

Bayes classifier has been found to be comparable in performance to decision tree and some selected neural network classifiers. This type of Bayesian classifier assumes that the effect of an attribute value on a given class is independent of the values of the other attributes

(Tuffery, 2011).

3.8.3.4 Random Forest (RF)

A random forest is an ensemble method which can be employed to increase accuracy of a decision tree. Ensemble methods combine a series of k models with the aim of generating a compound classification model. In a random forest multiple decision tree classifier are collected to generate a forest. Each tree is generated using random selected attributes at each node to generate the split. During classification each of these trees votes for the most popular class (Han et al., 2012).

59

3.8.3.5 Support Vector Machine (SVM)

SVM is a supervised machine learning algorithm which can be used for classification or regression problems. It uses a technique called the kernel trick to transform the data and then based on these transformations it finds an optimal boundary between the possible outputs

(Hearst, 1998)

3.8.3.6 eXtreme Gradient Boosting (xgboost)

Xgboost is an implementation of gradient boosting machines created by Tianqi Chen.

XGBoost is one of the fastest implementations of gradient boosted trees and follows the principle of gradient boosting. It is a specific implementation of the Gradient Boosting method, which uses more accurate approximations to find the best tree model. It employs a number of nifty tricks that make it exceptionally successful, particularly with structured data, however it used more regularized model formalization to control over-fitting, which gives it better performance.

60

3.9 Validation Set Approach (Data Split)

The validation set approach consists of randomly splitting the data into two sets: one set is used to train the model and the remaining other set sis used to test the model.

The process works as follow:

1. Build (train) the model on the training data set

2. Apply the model to the test data set to predict the outcome of new unseen observations

3.Quantify the prediction error as the mean squared difference between the observed and the predicted outcome values.

We divided the dataset randomly to avoid any selection bias, into training set and testing set in 80: 20 ratios, which is mean we train the model in 80 % of the data and use the other 20 % in assessing the performance of the data.

61

3.10 Evaluation / Performance Metrics

Once the classification model is built, we will measure how accurate is the classifier in predicting the output, using evaluation metrics. The common metric to measure classification is basic accuracy. However, in the case of imbalanced classes this metric can be misleading, as high metrics doesn’t show prediction capacity for the minority class, precision and recall used to focus on small positive class. In supervised classification, the confusion matrix is the basis of every evaluation metric. A confusion matrix has two dimensions; the actual and the predicted class, with two classes, positive and negative. As shown in Table 3.6 below. There are various metrics for evaluating the predictive accuracy; the following is the metrics we used:

Positive Negative

Positive TP FP

Negative FN TN

Table 3. 7 Confusion matrix of a two-class classifier

62

3.10.1InternationalInternational Accuracy Journal Journal (Acc) ofof DataData Mining & & Knowledge Knowledge Ma Managementnagement Process Process (IJDKP) (IJDKP) Vol.5, Vol.5, No.2, No.2, March March 2015 2015

Accuracyalsoalso powerless powerlessmetric measures in in terms termsthe ratio ofof correct informativeness informativeness predictions over [25, [25,the 36]total 36] numberand and less ofless instancesfavour favour towards towards minority minority class class instancesinstances [3, [3, 9, 9, 16, 16, 30,30, 37].37]. evaluated (M & M.N, 2015). Table 2. Threshold Metrics for Classification Evaluations Table 2. Threshold Metrics for Classification Evaluations Metrics Formula Evaluation Focus Metrics Formula Evaluation Focus !" + !$ In general, the accuracy metric measures the International Journal of Data Mining & Knowledge ManagementIn general, Process the (IJDKP) accuracy Vol.5, metric No.2, measures March 2015 the Accuracy (acc) Accuracy = !" + !$ ratio of correct predictions over the total Accuracy (acc) !" + %" + !$ + %$ numberratio of of instances correct evaluated. predictions over the total also powerless in terms of informativeness!" + %" + !$ + %$ [25, 36] and less favour towards minority class %" + %$ Misclassificationnumber of instances error measuresevaluated. the ratio of instancesError [3,Rate 9, (err) 16, 30, 37]. incorrectMisclassification predictions over error the measures total number the ratioof of !" + %"%"++!$%$+ %$ Error Rate (err) instancesincorrect evaluated. predictions over the total number of 3.10.2 Precision (p) Table 2.!" Threshold!"+ %" + !$ Metrics+ %$ forThis Classification metric is used Evalu to ationsmeasure the fraction of Sensitivity (sn) instances evaluated. !" + %$ positive patterns that are correctly classified Precision is used to measure the positive patterns!" that are correctly predictedThis metric from the is totalused to measure the fraction of Sensitivity (sn) !$ This metric is used to measure the fraction of MetricsSpecificity (sp) Formula!" + %$ positive Evaluation patterns that Focus are correctly classified negative patterns that are correctly classified. predicted patterns in a positive class (M!$ & +M.N,!$%"!" 2015+ !$). InThis general, metric theis used accuracy to measure metric the measures fraction the of AccuracySpecificity (acc) (sp) Precisionratio of is correct used to predictions measure the over positive the total !$!"+ %" negative patterns that are correctly classified. Precision (p) !" + %" + !$ + %$ patternsnumber that of areinstances correctly evaluated. predicted from the !" + %" Precision is used to measure the positive !" totalMisclassification predicted patterns error in a positive measures class. the ratio of Precision (p) Precision = !" %" + %$ Recallpatterns is usedthat are to measurecorrectly the predicted fraction from of the ErrorRecall Rate (r) (err) !" + %" incorrect predictions over the total number of !" ++!$%" + !$ + %$ positivetotal predicted patterns that patterns are correctly in a positive classified class. instances evaluated. 2 ∗!"" ∗ ( ThisRecall metric is represents used to measure the harmonic the fraction mean of RecallF-Measure (r) (FM) !" This metric is used to measure the fraction of Sensitivity (sn) !"" + (!$ betweenpositive recall patterns and precision that are correctlyvalues classified 3.10.3 Sensitivity or Recall (r) !" + %$ positive patterns that are correctly classified 2 ∗ " ∗ ( ThisThis metric metric is used represents to maximize the harmonic the tp rate mean F-Measure (FM) !$ This metric is used to measure the fraction of SpecificityGeometric-mean (sp) (GM) )!"" +∗ !$( andbetween tn rate, recall and simultaneously and precision keepingvalues both Recall is measure for the fraction of positive!$ + pattern%" that are correctlynegative classified patterns (M & M.N, that are correctly classified. ratesThis relatively metric balanced is used to maximize the tp rate !" + !$ Precision is used to measure the positive Geometric-meanAveraged (GM) ∑, !" + + and tn rate, and simultaneously keeping both 2015)Precision. (p) )+!"-. !"∗ !$+ %$ + %" + Thepatterns average that effectiveness are correctly of all predicted classes from the Accuracy !" + %"+ + + rates relatively balanced / total predicted patterns in a positive class. , %"!"++%$!$+ AveragedAveraged ∑∑,+-!". + + Recall is used to measure the fraction of Recall (r) +-. !"!"+ ++%$%$++%"%"++ +TheThe average average error effectiveness rate of all classes of all classes AccuracyError Rate Recall = !" + !$+ + + positive patterns that are correctly classified / / 2 ∗ " ∗ ( This metric represents the harmonic mean ,, !"+ %"+ + %$+ F-MeasureAveragedAveraged (FM) ∑∑+-. "+-+.!"!"(+ ++%"%$+ + %" +ThebetweenThe average average recall of per-classerror and rate precision precision of all classesvalues ErrorPrecision Rate + + + / / This metric is used to maximize the tp rate , !"!"+ Geometric-meanAveraged (GM) ∑)+!",-. ∗ !$ + and tn rate, and simultaneously keeping both Averaged ∑+-.!"+ + %$+ The average of per-class recall Recall !"+ + %" + ratesThe averagerelatively of balancedper-class precision Precision / 2 ∗, " ∗/ ( !"+ + !$+ AveragedAveraged ∑+-.0 0 , !"+ !"+ +%$+ + %"+ +TheThe average average of per-classeffectiveness F-measure of all classes AccuracyAveragedF-Measure ∑"+-.+ ( 0 !"+0+ %$+ The average of per-class recall RecallNote: - each class of data; !" - true positive/ for 1 ; %" - false positive for 1 ; %$ – false negative + / %" + %$+ + + + Averagedfor 1 ; !$ - true negative for ∑1,; and 2 macro-averaging.+ + Averaged+ + 2+∗-."!"0 ∗+(0%$ + %" + The average error rate of all classes Error Rate + + + The average of per-class F-measure F-Measure "0 + (0 / Note: - each class of data; !" ,- true positive!"+ for 1 ; %" - false positive for 1 ; %$ – false negative InsteadAveraged of accuracy, the FM and∑+ +- .GM also reported+ as a+ good discriminator and+ performed+ better !"+ + %"+ The average of per-class precision thanPrecisionfor 1accuracy+; !$ + - true in optimizing negative for classifier 1+; and 2 for macro-averaging. binary classification problems [20]. To the best of our / knowledge, no previous work has employed!" the FM and GM to discriminate and select the Averaged ∑, + optimal solution for multiclass classification+-. !" + %$ problems.The average of per-class recall Recall + + Instead of accuracy, the FM and GM/ also reported as a good discriminator and performed better thanInAveraged contrast,accuracy the in restoptimizing of metrics classifier2 ∗in" 0Table∗ ( 0for 2 binary are uns uitableclassification to discriminate problems and [20]. select To the the optimal best of our The average of per-class F-measure knowledge,solutionF-Measure due no to previoussingle evaluation work" 0 has task+ ( employed0 (either posi thetive FM or negative and GM class). to discriminate For discriminating and select and the optimalselectingNote: solution- eachthe optimal class for ofmulticlass solution data; !" +during -classification true thepositive classific forprob ation1+lems.; %" training,+ - false positivethe significance for 1+; %$trade-off+ – false between negative classesfor 1+; is!$ essential+ - true negative to ensure for 1 every+; and class2 macro-averaging. is represented by its representative prototype(s). The Intrade-off contrast, between the rest classes of metrics becomes in moreTable crucial 2 are whenunsuitable imbalanced to discriminate class data were and used.select The the best optimal solutionselected due solution to single turn evaluation out to be uselesstask (either if none posi oftive the or minority negative class class). instances For discriminating were able to and Insteadselectingcorrectly of the accuracy,predicted optimal by thesolution the FM chosen and during GMrepresentative(s) the also classific reported ation(prototype) as training,a good or discriminator selectedthe significance as the and representative(s) trade-offperformed between better thanclasses accuracy is essential in optimizing to ensure classifier every class for binary is repre classificationsented by its problems representative [20]. To prototype(s). the best of4 Theour knowledge,trade-off between no previous classes workbecomes has more employed crucial the when FM imbalanced and GM toclass discriminate data were andused. select The best the optimalselected solution solution for turn multiclass out to beclassification useless if prob nonelems. of the minority class instances were able to correctly predicted by the chosen representative(s) (prototype) or selected as the representative(s) In contrast, the rest of metrics in Table 2 are unsuitable to discriminate and select the optimal 4 solution due to single evaluation task (either positive or negative class). For discriminating and selecting the optimal solution during the classification training, the significance trade-off between classes is essential to ensure every class is represented by its representative prototype(s). The trade-off between classes becomes more crucial when imbalanced class data were used. The best selected solution turn out to be useless if none of the minority class instances were able to correctly predicted by the chosen representative(s) (prototype) or selected as the representative(s) 4 International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2, March 2015 also powerless in terms of informativeness [25, 36] and less favour towards minority class instances [3, 9, 16, 30, 37].

Table 2. Threshold Metrics for Classification Evaluations63

3.10.4Metrics Specificity Formula Evaluation Focus !" + !$ In general, the accuracy metric measures the Specificity Accuracyis the proportion of actual negative cases which are correctly (acc) ratio ofidentified. correct predictions over the total !" + %" + !$ + %$ number of instances evaluated. fp International Journal of Data Mining & Knowledge%" +Management%$ Process (IJDKP)Misclassification Vol.5, No.2, March 2015 error measures the ratio of Error Rate (err) fp + tn incorrect predictions over the total number of (i.e. if using randomly representative!" + (prototype)%" + !$ selection+ %$ method) during the classification training. instances evaluated. !" This metric is used to measure the fraction of Sensitivity2.2. Mean (sn) Square Error (MSE) !" + %$ positive patterns that are correctly classified

Supervised Learning Vector Quantization!$ (LVQ) [21] is one This of the metric Prototype is used Selection to measure the fraction of Specificityclassifiers. (sp)During the learning process, supervise d LVQ uses MSE to evaluate its performances 3.10.5 F-duringMeasure the classification(FM) training. !$ In + general,%" the MSE measuresnegative the difference patterns between that the are correctly classified. predicted solutions and desired solutions. The smaller MSE value isPrecision required in order is to used obtain to a measure the positive better trained of supervised LVQ. The MSE!" is defined as below: Precision (p) patterns that are correctly predicted from the F measure metrics is the harmonic mean!" /of+ precision%" and recall (M & M.N, 2015). 1 . total predicted patterns in a positive class. !"# = '() − , - (1) & !"* * Recall is used to measure the fraction of Recall (r) *01 !" + !$ positive patterns that are correctly classified where P j is FM the predicted = value of instance2 ∗ " ∗j, (Aj is real target value ofThis instance metric j and n representsis the total the harmonic mean F-Measurenumber of (FM)instances. Through the learning process of LVQ, the solution that has minimum MSE score will be used as the final model (best" + solutio( n). between recall and precision values

This metric is used to maximize the tp rate Similar to accuracy, the main limitation of MSE is this metric does not provide the trade-off Geometric-meaninformation between (GM) class data. This)!" may∗ !$ lead the discriminationand process tn rate, to select and the simultaneously sub- keeping both 3.10.6 Areaoptimal under solution. the ROCMoreover, Curve this metric(AUC) is really dependent on the weightrates initialization relatively process. balanced In extremely imbalanced class problem, if the initial weights are not proper selected (i.e. no initial , !"+ + !$+ Averagedweight to represent the minority class∑ data), this may lead the discrimination process ends up with +-. !" + %$ + %" + The average effectiveness of all classes Accuracysub-optimal solution due to lack information+ of minority+ class+ data although the MSE value is AUC is oneminimized of the (under-fittingpopular ranking or over-fitting). type metrics. AUC /value reflects the overall ranking , %"+ + %$+ performanceAveraged2.3. Areaof a classifier. under the For ROC binary Curve class∑ +(AUC)-. problem, the AUC value can be calculated as !"+ + %$+ + %"+ + The average error rate of all classes ErrorAUC Rate is one of the popular ranking type metrics. In [13, 17, 31] the AUC was used to construct an below / optimized learning model and also for comparing!" learning algorithms [28,29]. Unlike the Averagedthreshold and probability metrics, ∑the, AUC value+ reflects the overall ranking performance of a +-. !" + %" The average of per-class precision Precisionclassifier. For two-class problem [13], the AUC+ value +can be calculated as below / "4 − &4(&/ + 1)/2 (2) ,23 = , !"+ Averaged ∑+-&. & 4 !"/ + + %$+ The average of per-class recall Recall where, S is the sum of the all positive examples/ ranked, while n and n denote the number of p p n Averagedpositive and negative examples respectively.2 ∗ "0 ∗ The(0 AUC was proven theoretically and empirically The average of per-class F-measure Where,F-Measure Spbetter is the than sum the ofaccuracy the all metric positive [17] examples for evaluating ranked, the classifier while np performance and nn denote and discriminating the number an optimal solution during the classification"0 + train(0 ing. of positiveNote: and - each negative class examples of data; respectively !"+ - true (M positive & M.N, for 2015) 1+;. %"+ - false positive for 1+; %$+ – false negative Although the performance of AUC was excellent for evaluation and discrimination processes, the for 1+; !$+ - true negative for 1+; and 2 macro-averaging. computational cost of AUC is high especially for discriminating a volume of generated solutions of multiclass problems. To compute the AUC for multiclass problems the time complexity is O(|C|n log n) for Provost and Domingos AUC model [28] and O(|C|2 n log n) for Hand and Till

InsteadAUC of model accuracy, [13]. the FM and GM also reported as a good discriminator and performed better than accuracy2.4. Hybrid in Discriminator optimizing Metrics classifier for binary classification problems [20]. To the best of our knowledge,Optimized no Precision previous is a type work of hybrid has employed threshold metrics the and FM has and been proposedGM to as discriminate a and select the discriminator for building an optimized heuristic classifier [30]. This metric is a combination of optimalaccuracy, solution sensitivity for andmulticlass specificity metrics.classification The sensitivity prob andlems. specificity metric were used for

In contrast, the rest of metrics in Table 2 are unsuitable to discriminate5 and select the optimal solution due to single evaluation task (either positive or negative class). For discriminating and selecting the optimal solution during the classification training, the significance trade-off between classes is essential to ensure every class is represented by its representative prototype(s). The trade-off between classes becomes more crucial when imbalanced class data were used. The best selected solution turn out to be useless if none of the minority class instances were able to correctly predicted by the chosen representative(s) (prototype) or selected as the representative(s) 4 64

3.11 Summary

This research objective is to develop models using machine-learning algorithms to predict

30-days readmission after discharge from a heart failure admission and to compare model’s performance with logistic regression model. This chapter concludes with details the data we used for the model and a description our approach in developing the prediction models along with the algorithms and the evaluation metrics that we used.

65

CHAPTER 4

RESULT AND ANALYSIS

As we mentioned in the first chapter, the objective of this study is to develop a model using

ML to predict all- cause readmission for 30-day readmission for heart failure patients and to compare ML model performance with standard logistic regression. In this chapter we will describe the experimental results. In particular, we discuss how different classifier preform with different feature selection and class balancing techniques. We preform different ML algorithms with different attribute sets and various class balancing techniques. All the machine learning algorithms were conducted using data split validation (80:20) and we used

10-fold cross validation for DT and RF. The evaluation metrics that we have used are accuracy, sensitivity, specificity, precision, recall, f-measure and AUC.

The dataset is high dimensional and imbalanced data, we wanted to find a feature subset that reduce the complexity of the model, identifying the most significant variables or groups of variables, although improving the prediction performance, for those reasons we used filter methods InfoGain as metric and wrapper method backward selection. Results are shown for the next section for each machine learning we used.

66

4.1 Logistic Regression

We first examine the performance of the logistic regression classification technique with

different attribute sets and sampling technique that we mentioned in chapter 3. The results of

the 30-days readmission in Table 4.1. As we can see from the table, the results show that the

class balanced dataset achieved better sensitivity and recall than the original dataset. On the

other hand, although the original model with the full features had more features than the

other models, the results of recall with the feature selection reduced. Although, wrapper

method had a better performance result than filter method. The best results were with

backward selection using both sampling techniques with accuracy of 0.63 and recall of 0.61

Table 4.1 Performance Results of Logistic Regression for 30 Days Readmission

Data Sets Sampling Accuracy Sensitivity Specificity Precision Recall FMeasure AUC

Techniques

All No Sample 0.5384 0.49798 0.54515 0.15511 0.49798 0.23654 0.5107

Over 0.3267 0.8583 0.2376 0.1588 0.8583 0.2680 0.5339

Under 0.436 0.7368 0.3856 0.1674 0.7368 0.2729 0.5324

Both 0.5517 0.47368 0.56483 0.15435 0.47368 0.23284 0.5096

Filter No Sample 0.8564 0.0000 1.0000 NA 0.0000 NA NA

Info- Over 0.6105 0.60324 0.61168 0.20666 0.60324 0.30785 0.5543

Gain Under 0.5994 0.57490 0.60353 0.19559 0.57490 0.29188 0.545

Both 0.618 0.59919 0.62118 0.20963 0.59919 0.31060 0.556

Wrapper No Sample 0.8458 0.026515 0.994498 0.466667 0.026515 0.050179 0.6579

Backward Over 0.6452 0.53409 0.66551 0.22596 0.53409 0.31757 0.5562

Selection Under 0.6373 0.5587 0.6528 0.2415 0.5587 0.3373 0.5618

Both 0.6311 0.61511 0.63405 0.23361 0.61511 0.33861 0.5672

67

4.2 Decision Tree

The same set of experiments that was performed for the logistic regression is repeated with

decision tree classifier and the results are shown in Table 4.2. As it shows below the balanced

data set is preforming better. Specifically, under sampling techniques preform the best

sensitivity and recall. There was no much difference in performance between the original

dataset and the one with the feature selection methodology. the 10- fold cross validation for

DT classifier, using under sampling technique, the results show that AUC and Recall

have slightly better results but not significant. The best results were with all data set using

cross validated under sampling with accuracy of 0.55 and recall of 0.62

Table 4.2 Performance Results of Decision Tree for 30 Days Readmission

Data Sets Sampling Accuracy Sensitivity Specificity Precision Recall FMeasure AUC

Techniques

All No Sample 0.8564 0.0000 1.0000 NA 0.0000 NA NA

Over 0.6483 0.44534 0.68228 0.19031 0.44534 0.26667 0.5352

Under 0.5581 0.61134 0.54922 0.18528 0.61134 0.28437 0.5396

Both 0.5581 0.57895 0.55465 0.17897 0.57895 0.27342 0.533

Filter No Sample 0.8564 0.0000 1.0000 NA 0.0000 NA NA

Info- Gain Over 0.6483 0.44534 0.68228 0.19031 0.44534 0.26667 0.5352

Under 0.6343 0.4575 0.6640 0.1859 0.4575 0.2643 0.5327

Both 0.6227 0.4575 0.6504 0.1799 0.4575 0.2583 0.5286

Wrapper No Sample 0.8381 0.0000 1.0000 NA 0.0000 NA NA

Backward Over 0.6203 0.45364 0.65445 0.21207 0.45364 0.28903 0.533

Selection Under 0.5645 0.59109 0.56008 0.18388 0.59109 0.28050 0.5374

Both 0.6227 0.4575 0.6504 0.1799 0.4575 0.2583 0.5286 68

Embedded No Sample 0.8564 0.0000 1.0000 NA 0.0000 NA NA

Over 0.6483 0.44534 0.68228 0.19031 0.44534 0.26667 0.5352

Under 0.6343 0.4575 0.6640 0.1859 0.4575 0.2643 0.5327

Both 0.6227 0.4575 0.6504 0.1799 0.4575 0.2583 0.5286

10- Fold DT-Under 0.5558 0.62753 0.54379 0.18742 0.62753 0.28864 0.5422

\

69

4.3 Random Forest

The similar experiments sets were conducted for Random Forest. As shown from the table

below the class balanced datasets achieved better results. Under sampling had the highest

recall and sensitivity. The feature selection methodology didn’t improve the results as

expected. The best results were with all data set using under sampling techniques with

accuracy of 0.55 and Recall of 0.67.

Table 4.3 Performance Results of Random Forest for 30 Days Readmission

Data Sets Sampling Accuracy Sensitivity Specificity Precision Recall FMeasure AUC

Techniques

All No Sample 0.857 0.028340 0.995927 0.538462 0.028340 0.053846 0.6989

Over 0.861 0.09312 0.98982 0.60526 0.09312 0.16140 0.736

Under 0.5552 0.67206 0.53564 0.19529 0.67206 0.30264 0.5511

Both 0.8087 0.23482 0.90496 0.29293 0.23482 0.26067 0.5844

Filter No Sample 0.8279 0.10526 0.94908 0.25743 0.10526 0.14943 0.5605

Info- Over 0.7785 0.18623 0.87780 0.20354 0.18623 0.19450 0.5345

Gain Under 0.5023 0.66802 0.47454 0.17572 0.66802 0.27825 0.5354

Both 0.6791 0.38057 0.72912 0.19067 0.38057 0.25405 0.533

Wrapper No Sample 0.8299 0.035461 0.988003 0.370370 0.035461 0.064725 0.6038

Backward Over 0.8272 0.061303 0.965422 0.242424 0.061303 0.097859 0.5466

Selection Under 0.536 0.62348 0.52138 0.17928 0.62348 0.27848 0.5356

Both 0.7487 0.24113 0.84968 0.24199 0.24113 0.24156 0.5455

70

Table 4. 4 Performance Results of Random Forest for 30 Days Readmission

Embeded No Sample 0.8465 0.068826 0.976918 0.333333 0.068826 0.114094 0.5978

Over 0.8326 0.11741 0.95248 0.29293 0.11741 0.16763 0.5792

Under 0.5203 0.64372 0.49966 0.17746 0.64372 0.27822 0.5353

Both 0.7523 0.31984 0.82485 0.23442 0.31984 0.27055 0.5565

10-Fold RF-Under 0.5471 0.6316 0.5329 0.1848 0.6316 0.2860 0.5405

71

4.4 Naïve Bayes

With the same experiment set we apply it with Naïve Bayes. The feature subset of feature

selection (InfoGain) has the best results. The class balanced dataset had better Recall and

sensitivity. The best results were with using both sampling techniques with accuracy of 0.61

and Recall of 0.59.

Table 4. 5 Performance Results of Naïve Bayes for 30 Days Readmission

Data Sets Sampling Accuracy Sensitivity Specificity Precision Recall FMeasure AUC

Techniques

All No Sample 0.8122 0.12551 0.92736 0.22464 0.12551 0.16104 0.5441

Over 0.7465 0.36437 0.81059 0.24390 0.36437 0.29221 0.5638

Under 0.7907 0.1741 0.8941 0.2161 0.1741 0.1928 0.541

Both 0.7924 0.24291 0.88459 0.26087 0.24291 0.25157 0.5677

Filter No Sample 0.8395 0.052632 0.971487 0.236364 0.052632 0.086093 0.5479

Info- Over 0.6041 0.57490 0.60896 0.19777 0.57490 0.29430 0.5465

Gain Under 0.5971 0.5870 0.5988 0.1970 0.5870 0.2950 0.5467

Both 0.614 0.59109 0.61779 0.20592 0.59109 0.30544 0.553

Wrapper No Sample 0.8242 0.12500 0.95117 0.31731 0.12500 0.17935 0.5871

Backward Over 0.7304 0.33449 0.80499 0.33449 0.28235 0.15856 0.5547

Selection Under 0.7078 0.34839 0.78223 0.24885 0.34839 0.29032 0.5509

Both 0.7476 0.25559 0.85074 0.26403 0.25559 0.25974 0.5546

72

4.5 Support Vector Machine

With the similar set of experiments are also conducted for SVM. In general, the results are

similar with the feature selection compared with model used all the variables. Like the other

models, class balancing techniques significantly improves the recall measure. The best

results we have is All data set with under sampling techniques which has accuracy of 0.63

and recall of 0.58

Table 4. 6 Performance Results of SVM for 30 Days Readmission

Data Sets Sampling Accuracy Sensitivity Specificity Precision Recall FMeasure AUC

Techniques

All No Sample 0.8552 0.016194 0.995927 0.400000 0.016194 0.031128 0.6289

Under 0.6308 0.58300 0.63883 0.21302 0.58300 0.31203 0.5572

Both 0.6628 0.4777 0.6938 0.2074 0.4777 0.2892 0.5477

Filter No Sample 0.8564 0.000 1.0000 NA 0.000 NA 0.4278

Info- Over 0.6209 0.49798 0.64155 0.18894 0.49798 0.27394 0.5365

Gain Under 0.5983 0.57085 0.60285 0.19421 0.57085 0.28983 0.5438

Both 0.6203 0.4575 0.6477 0.1788 0.4575 0.2571 0.5278

Wrapper No Sample 0.8552 0.998 0.000 0.000 0.000 NA 0.4281

Backward Over 0.664 0.44939 0.69993 0.20072 0.44939 0.27750 0.5421

Selection Under 0.6052 0.51822 0.61982 0.18605 0.51822 0.27380 0.5354

Both 0.6529 0.47368 0.68296 0.20034 0.47368 0.28159 0.5443

73

4.6 Xboost

We conducted the same experiments with Xboost and had similar results, which is that the

class balanced dataset had better sensitivity and recall. There is no much difference in the

performance when applied the feature selection. The best results we got is with InfoGain and

over sampling techniques which has accuracy of 0.597 and recall of 0.595

Table 4. 7 Performance Results of Xboost for 30 Days Readmission

Data Sets Sampling Accuracy Sensitivity Specificity Precision Recall FMeasure AUC

Techniques

All No Sample 0.8564 0.0000 1.0000 NA 0.0000 NA NA

Over 0.6105 0.56275 0.61847 0.19829 0.56275 0.29325 0.5462

Under 0.6267 0.57085 0.63612 0.20827 0.57085 0.30519 0.5533

Both 0.6634 0.44534 0.69993 0.19928 0.44534 0.27534 0.541

Filter No Sample 0.8564 0.0000 1.0000 NA 0.0000 NA NA

Info- Over 0.5971 0.59514 0.59742 0.19865 0.59514 0.29787 0.5483

Gain Under 0.607 0.54656 0.61711 0.19313 0.54656 0.28541 0.5417

Both 0.6343 0.47368 0.66124 0.18994 0.47368 0.27115 0.5361

Wrapper No Sample 0.8564 0.0000 1.0000 NA 0.0000 NA NA

Backward Over 0.6186 0.54656 0.63069 0.19882 0.54656 0.29158 0.5456165

Selection Under 0.614 0.53846 0.62661 0.19473 0.53846 0.28602 0.5423983

Both 0.65 0.4777 0.6789 0.1997 0.4777 0.2816 0.5427006

74

4.7 Summary

We investigated different ML based model to predict 30-day readmission. We conclude that

ML based model didn’t outperformed the standard logistic regression model. In our study, we have taken into consideration the class imbalance issue, which is frequently encountered with the medical data.

We applied the predictive models LR, DT, RF, NB, SVM and Xboost, without feature selection or balancing. The model was predicting 0 the largest class to make the classification decision resulting on high accuracy (the highest was 85%). It was crucial to tackle the class imbalanced problem, we have used oversampling, under sampling and both. Along with feature selection techniques: InfoGain, backward selection, embedded DT and embedded RF.

According to the results in the models above, there was no significant difference in the performance of one algorithm over the other with respect to all evaluation metrics. For all the models, all the sampling techniques improves the model recall value, which increase the ability of the model to accurately classify patients who are higher risk of getting readmitted.

There was no effect on AUC value. Moreover, feature selection methods preformed alike in most of the models except for Naïve Bayes, the feature selection method InfoGain has better recall and F-measure than backward selection. The best result we had is with random forest with all the features and under sampling technique with recall of 0.67 and accuracy of 0.55.

Furthermore, the AUC is not the most appropriate measure for an imbalanced dataset, Recall is more relevant if the detecting patients that belong to Readmission is the main goal.

75

CHAPTER 5

CONCLUSION AND FUTURE WORK

In this dissertation, we study the issue of patient readmission. We have analyzed a high dimensional data set provided by Partners HealthCare Institutional of five years period, to predict all-cause hospital readmissions of patients discharge with HF within 30 days of discharge. We developed, train and tested different prediction models. The dataset has 61 features including patient demographics, conditions, medications, labs, procedures, vitals and emotional status (Depression). We have compared the results of five different classification algorithms, which are, logistic regression, decision tree, random forest, naïve Bayes and Xboost with different attribute sets obtained by different feature selection methods. Since the dataset are class imbalanced, we used different sampling methods as well, and evaluated their effect on the models using the appropriate measure such as recall and sensitivity, because accuracy could be misleading when imbalanced data, Recall is more relevant if the detection of patients that belong to readmission is the main goal. For all the models, the class balanced dataset has better performance. The recall has significant improvement by employing class balancing techniques including oversampling, under sampling and both. There was no much of improvement using feature selection methods. Overall, the results we had from all the algorithms with different subsets and sampling methods are similar to the results in the literature. The best results we got from training and testing five models, was random forest, generating and accuracy of 0.55 and recall of 0.67. 76

Our experiment confirms that preprocessing improves the predictive models, it’s also demonstrated the importance of other performance measures in addition to the AUC for medical dataset that have high imbalanced class while predicting readmission.

Furthermore, the use of different ML algorithms did not improve prediction with standard logistic regression model to predict readmission for HF patients.

Limitation of this study included the readmission timeframe and definition of index discharge which may results in some loss of information and limit generalizability.

Additionally, patients may have had previous admission before the index date and may have had a readmission either before or after the data collection period. Some important features were not available in this study, like lifestyle habits (Exercise, Smoking) and socioeconomic status. Living with HF is challenging and cannot be understood from reviewing medical data alone. Besides the objective data (medical data), subjective data like interviews and surveys are needed to capture the occurrence of readmission of patients with HF.

For the future work recommendation, the presented classifiers will be trained with larger data and focus on further improvement of predictive ability through advanced methods.

Next, we aim to integrates those models with the monitoring system to help clinicians to manage their patients. 77

78

REFERENCES

Albert, N. M., Barnason, S., Deswal, A., Hernandez, A., Kociol, R., Lee, E., … White-

Williams, C. (2015). Transitions of Care in Heart Failure: A Scientific Statement

From the American Heart Association. Circulation: Heart Failure, 8(2), 384–409.

https://doi.org/10.1161/HHF.0000000000000006

Amarasingham, R., Moore, B. J., Tabak, Y. P., Drazner, M. H., Clark, C. A., Zhang, S.,

… Halm, E. A. (2010). An Automated Model to Identify Heart Failure Patients at

Risk for 30-Day Readmission or Death Using Electronic Medical Record Data:

Medical Care, 48(11), 981–988. https://doi.org/10.1097/MLR.0b013e3181ef60d9

Annema, C., Luttik, M.-L., & Jaarsma, T. (2009). Reasons for readmission in heart

failure: Perspectives of patients, caregivers, cardiologists, and heart failure nurses.

Heart & Lung, 38(5), 427–434. https://doi.org/10.1016/j.hrtlng.2008.12.002

Artetxe, A., Beristain, A., & Graña, M. (2018). Predictive models for hospital

readmission risk: A systematic review of methods. Computer Methods and

Programs in Biomedicine, 164, 49–64.

https://doi.org/10.1016/j.cmpb.2018.06.006

Artetxe, A., Larburu, N., Murga, N., Escolar, V., & Graña, M. (2018). Heart Failure

Readmission or Early Death Risk Factor Analysis: A Case Study in a

Telemonitoring Program. In Y.-W. Chen, S. Tanaka, R. J. Howlett, & L. C. Jain

(Eds.), Innovation in Medicine and Healthcare 2017 (Vol. 71, pp. 244–253).

https://doi.org/10.1007/978-3-319-59397-5_26 79

Basu Roy, S., Teredesai, A., Zolfaghar, K., Liu, R., Hazel, D., Newman, S., & Marinez,

A. (2015). Dynamic Hierarchical Classification for Patient Risk-of-Readmission.

1691–1700. https://doi.org/10.1145/2783258.2788585

Benjamin, E. J., Blaha, M. J., Chiuve, S. E., Cushman, M., Das, S. R., Deo, R., …

Muntner, P. (2017). Heart Disease and Stroke Statistics—2017 Update.

Circulation, 135(10), e146–e603.

https://doi.org/10.1161/CIR.0000000000000485

Betihavas, V., Davidson, P. M., Newton, P. J., Frost, S. A., Macdonald, P. S., & Stewart,

S. (2012). What are the factors in risk prediction models for rehospitalisation for

adults with chronic heart failure? Australian Critical Care, 25(1), 31–40.

https://doi.org/10.1016/j.aucc.2011.07.004

Blecker, S., Agarwal, S. K., Chang, P. P., Rosamond, W. D., Casey, D. E., Kucharska-

Newton, A., … Katz, S. (2014). Quality of Care for Heart Failure Patients

Hospitalized for Any Cause. Journal of the American College of Cardiology,

63(2), 123–130. https://doi.org/10.1016/j.jacc.2013.08.1628

Bradley, E. H., Curry, L., Horwitz, L. I., Sipsma, H., Wang, Y., Walsh, M. N., …

Krumholz, H. M. (2013). Hospital Strategies Associated With 30-Day

Readmission Rates for Patients With Heart Failure. Circulation: Cardiovascular

Quality and Outcomes, 6(4), 444–450.

https://doi.org/10.1161/CIRCOUTCOMES.111.000101

Califf, R. M., & Pencina, M. J. (2013). Predictive Models in Heart Failure Who Cares?

Circulation: Heart Failure, 6(5), 877–878.

https://doi.org/10.1161/CIRCHEARTFAILURE.113.000659 80

Desai, A. S., & Stevenson, L. W. (2012). Rehospitalization for Heart Failure Predict or

Prevent? Circulation, 126(4), 501–506.

https://doi.org/10.1161/CIRCULATIONAHA.112.125435

Frizzell, J. D., Liang, L., Schulte, P. J., Yancy, C. W., Heidenreich, P. A., Hernandez, A.

F., … Laskey, W. K. (2017). Prediction of 30-Day All-Cause Readmissions in

Patients Hospitalized for Heart Failure: Comparison of Machine Learning and

Other Statistical Approaches. JAMA Cardiology, 2(2), 204–209.

https://doi.org/10.1001/jamacardio.2016.3956

Futoma, J., Morris, J., & Lucas, J. (2015). A comparison of models for predicting early

hospital readmissions. Journal of Biomedical Informatics, 56, 229–238.

https://doi.org/10.1016/j.jbi.2015.05.016

Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. J.

Mach. Learn. Res., 3, 1157–1182.

Hall, M. J. (2012). Hospitalization for Congestive Heart Failure: United States, 2000–

2010. (108), 8.

Hearst, M. A. (1998). Support Vector Machines. IEEE Intelligent Systems, 13(4), 18–28.

https://doi.org/10.1109/5254.708428

Heart Failure | National Heart, Lung, and Blood Institute (NHLBI). (n.d.). Retrieved May

7, 2019, from https://www.nhlbi.nih.gov/health-topics/heart-failure

Hersh, A. M., Masoudi, F. A., & Allen, L. A. (2013). Postdischarge Environment

Following Heart Failure Hospitalization: Expanding the View of Hospital

Readmission. Journal of the American Heart Association, 2(2), e000116.

https://doi.org/10.1161/JAHA.113.000116 81

Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression.

Retrieved from http://ebookcentral.proquest.com/lib/northeastern-

ebooks/detail.action?docID=1138225

Jamei, M., Nisnevich, A., Wetchler, E., Sudat, S., & Liu, E. (2017). Predicting all-cause

risk of 30-day hospital readmission using artificial neural networks. PLOS ONE,

12(7), e0181173. https://doi.org/10.1371/journal.pone.0181173

Kang, Y., McHugh, M. D., Chittams, J., & Bowles, K. H. (2016). Utilizing Home

Healthcare Electronic Health Records for Telehomecare Patients With Heart

Failure: A Decision Tree Approach to Detect Associations With

Rehospitalizations. Computers, Informatics, Nursing: CIN, 34(4), 175–182.

https://doi.org/10.1097/CIN.0000000000000223

Kansagara D, Englander H, Salanitro A, & et al. (2011). Risk prediction models for

hospital readmission: A systematic review. JAMA, 306(15), 1688–1698.

https://doi.org/10.1001/jama.2011.1515

Karagod, V. (n.d.). How to Handle Imbalanced Data: An Overview. Retrieved May 30,

2019, from https://www.datascience.com/blog/imbalanced-data

Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial

Intelligence, 97(1), 273–324. https://doi.org/10.1016/S0004-3702(97)00043-X

Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (n.d.). Handling imbalanced datasets: A

review.

Koulaouzidis, G., Iakovidis, D. K., & Clark, A. L. (2016). Telemonitoring predicts in

advance heart failure admissions. International Journal of Cardiology, 216, 78–

84. https://doi.org/10.1016/j.ijcard.2016.04.149 82

Krumholz, H. M., Chaudhry, S. I., Spertus, J. A., Mattera, J. A., Hodshon, B., & Herrin,

J. (2016). Do Non-Clinical Factors Improve Prediction of Readmission Risk?:

Results From the Tele-HF Study. JACC: Heart Failure, 4(1), 12–20.

https://doi.org/10.1016/j.jchf.2015.07.017

M, H., & M.N, S. (2015). A Review on Evaluation Metrics for Data Classification

Evaluations. International Journal of Data Mining & Knowledge Management

Process, 5(2), 01–11. https://doi.org/10.5121/ijdkp.2015.5201

Mahajan, S., Burman, P., & Hogarth, M. (2016). Analyzing 30-Day Readmission Rate

for Heart Failure Using Different Predictive Models. Studies in Health

Technology and Informatics, 225, 143–147.

Mahajan, S. M., Heidenreich, P., Abbott, B., Newton, A., & Ward, D. (2018). Predictive

models for identifying risk of readmission after index hospitalization for heart

failure: A systematic review. European Journal of Cardiovascular Nursing:

Journal of the Working Group on Cardiovascular Nursing of the European

Society of Cardiology, 17(8), 675–689.

https://doi.org/10.1177/1474515118799059

McIlvennan, C. K., Eapen, Z. J., & Allen, L. A. (2015). Hospital Readmissions

Reduction Program. Circulation, 131(20), 1796–1803.

https://doi.org/10.1161/CIRCULATIONAHA.114.010270

Meadem, N., Verbiest, N., & Zolfaghar, K. (n.d.). Exploring Preprocessing Techniques

for Prediction of Risk of Readmission for Congestive Heart Failure Patients. 6.

Mortazavi, B. J., Downing, N. S., Bucholz, E. M., Dharmarajan, K., Manhapra, A., Li, S.-

X., … Krumholz, H. M. (2016). Analysis of Machine Learning Techniques for 83

Heart Failure Readmissions. Circulation: Cardiovascular Quality and Outcomes,

9(6), 629–640. https://doi.org/10.1161/CIRCOUTCOMES.116.003039

O’Connor, M., Murtaugh, C. M., Shah, S., Barrón-Vaya, Y., Bowles, K. H., Peng, T. R.,

… Feldman, P. H. (2016). Patient Characteristics Predicting Readmission Among

Individuals Hospitalized for Heart Failure. Medical Care Research and Review :

MCRR, 73(1), 3–40. https://doi.org/10.1177/1077558715595156

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.

https://doi.org/10.1007/BF00116251

Rahimi, K., Bennett, D., Conrad, N., Williams, T. M., Basu, J., Dwight, J., …

MacMahon, S. (2014). Risk Prediction in Patients With Heart Failure: A

Systematic Review and Analysis. JACC: Heart Failure, 2(5), 440–446.

https://doi.org/10.1016/j.jchf.2014.04.008

Roger VL, Weston SA, Redfield MM, & et al. (2004). TRends in heart failure incidence

and survival in a community-based population. JAMA, 292(3), 344–350.

https://doi.org/10.1001/jama.292.3.344

Ross JS, Mulvey GK, Stauffer B, & et al. (2008). Statistical models and patient predictors

of readmission for heart failure: A systematic review. Archives of Internal

Medicine, 168(13), 1371–1386. https://doi.org/10.1001/archinte.168.13.1371

Shah, S. J., Katz, D. H., Selvaraj, S., Burke, M. A., Yancy, C. W., Gheorghiade, M., …

Deo, R. C. (2014). Phenomapping for Novel Classification of Heart Failure with

Preserved Ejection Fraction. Circulation, CIRCULATIONAHA.114.010637.

https://doi.org/10.1161/CIRCULATIONAHA.114.010637 84

Stevens, S. (2015). Preventing 30-day readmissions. The Nursing Clinics of North

America, 50(1), 123–137. https://doi.org/10.1016/j.cnur.2014.10.010

Turgeman, L., & May, J. H. (2016). A mixed-ensemble model for hospital readmission.

Artificial Intelligence in Medicine, 72, 72–82.

https://doi.org/10.1016/j.artmed.2016.08.005

Vaduganathan M, Bonow RO, & Gheorghiade M. (2013). Thirty-day readmissions: The

clock is ticking. JAMA, 309(4), 345–346.

https://doi.org/10.1001/jama.2012.205110

Vedomske, M. A., Brown, D. E., & Harrison, J. H. (2013). Random Forests on

Ubiquitous Data for Heart Failure 30-Day Readmissions Prediction. 2013 12th

International Conference on Machine Learning and Applications (ICMLA), 2,

415–421. https://doi.org/10.1109/ICMLA.2013.158

Walsh, C., & Hripcsak, G. (2014). The Effects of Data Sources, Cohort Selection, and

Outcome Definition on a Predictive Model of Risk of Thirty-Day Hospital

Readmissions. Journal of Biomedical Informatics, 52, 418–426.

https://doi.org/10.1016/j.jbi.2014.08.006

Yancy, C. W., Jessup, M., Bozkurt, B., Butler, J., Casey, D. E., Drazner, M. H., …

Wilkoff, B. L. (2013). 2013 ACCF/AHA Guideline for the Management of Heart

Failure A Report of the American College of Cardiology Foundation/American

Heart Association Task Force on Practice Guidelines. Circulation, 128(16), e240–

e327. https://doi.org/10.1161/CIR.0b013e31829e8776

Zheng, B., Zhang, J., Yoon, S. W., Lam, S. S., Khasawneh, M., & Poranki, S. (2015).

Predictive modeling of hospital readmissions using metaheuristics and data 85

mining. Expert Systems with Applications, 42(20), 7110–7120.

https://doi.org/10.1016/j.eswa.2015.04.066

Ziaeian, B., & Fonarow, G. C. (2016). The Prevention of Hospital Readmissions in Heart

Failure. Progress in Cardiovascular Diseases, 58(4), 379–385.

https://doi.org/10.1016/j.pcad.2015.09.004

Zolfaghar, K., Meadem, N., Roy, S. B., Chin, S., & Muckian, B. (n.d.). Big Data

Solutions for Predicting Risk-of-Readmission for Congestive Heart Failure

Patients.

86

APPENDIX A: Supplementary Appendix

Crosstabulation of the variables by readmission FLG following Chi-square test

Variables Readmission FLG P-Value Values No Yes PHSPayer Commercial 979 (57.93) 711 (42.07) .142 CategoryDSC Government 3821 (55.95) 3008 (44.05) AdmissionSource Admit from observation 209 (57.42) 155 (42.58) <.0001> CommonDSC Emergency room 1682 (58.08) 1214 (41.92) Op dept/clinic/physician referral 1492 (54.95) 1223 (45.05) Outside health care facility 9 (47.37) 10 (52.63) Outside hospital 127 (61.06) 81 (38.94) Physician or clinic referral 209 (52.12) 192 (47.88) Self-referral 631 (54.21) 533 (45.79) Skilled nursing facility 18 (40.91) 26 (59.09) Transfer from acute hospital 471 (64.17) 263 (35.83) Transfer from non-acute facility 61 (51.69) 57 (48.31) AdmissionService /Not specified 80 (51.95) 74 (48.05) .73 CommonDSC Cardiac surgery 40 (51.28) 38 (48.72) Cardiology 1103 (57.63) 811 (42.37) Emergency medicine 722 (53.96) 616 (46.04) Medicine 2869 (57.19) 2148 (42.81) Oncology 19 (55.88) 15 (44.12) Pulmonology 11 (39.29) 17 (60.71) Renal medicine 7 (43.75) 9 (56.25) Surgery 52 (65) 28 (35) ServiceLine Surgery 26 (56.52) 20 (43.48) .934 CommonDSC Cardiac 4804 (56.58) 3686 (43.42) Onco 31 (58.49) 22 (41.51) Vasclr 52 (59.77) 35 (40.23) ServiceLine Surgery 26 (56.52) 20 (43.48) <.0001> SubServiceDSC Clinical 3665 (55.56) 2932 (44.44) EP & Arrhythmias 286 (70.27) 121 (29.73) Invasive 721 (59) 501 (41) Medical 25 (58.14) 18 (41.86) Surgery 129 (49.81) 130 (50.19) Vascular 52 (59.77) 35 (40.23)

87

DischargeDisposition Acute Hospital 32 (22.54) 110 (77.46) <.0001> CommonDSC Discharge to Institution/shelter/care 9 (60) 6 (40) Home 2133 (61.86) 1315 (38.14) Home Care 1918 (53.34) 1678 (46.66) Hospice 33 (73.33) 12 (26.67) Left Against Medical Advice 40 (52.63) 36 (47.37) Long Term Care 88 (50.29) 87 (49.71) Rehab Facility 99 (55.93) 78 (44.07) Skilled Nursing Facility 556 (56.45) 429 (43.55)

88

Variables Readmission FLG P-Value Values No Yes DischargeService CARDIAC SURGERY 98 (46.89) 111 (53.11) .095 CommonDSC CARDIOLOGY 1133 (57.37) 842 (42.63) MEDICINE 3563 (56.74) 2717 (43.26) ONCOLOGY 27 (61.36) 17 (38.64) RENAL MEDICINE 10 (50) 10 (50) SURGERY 66 (55.46) 53 (44.54) ICD9DiagnosisDSC AC DIASTOLIC HRT FAILURE 290 (56.86) 220 (43.14) <.0001> AC ON CHR DIAST HRT FAIL 1193 (54.6) 992 (45.4) AC ON CHR SYST HRT FAIL 950 (55.04) 776 (44.96) AC SYST/DIASTOL HRT FAIL 45 (73.77) 16 (26.23) AC SYSTOCIL HRT FAILURE 312 (64.73) 170 (35.27) AC/CHR SYST/DIA HRT FAIL 220 (55.84) 174 (44.16) CHR DIASTOLIC HRT FAIL 35 (49.3) 36 (50.7) CHR SYST/DIASTL HRT FAIL 4 (28.57) 10 (71.43) CHR SYSTOLIC HRT FAILURE 39 (53.42) 34 (46.58) CONGESTIVE HEART FAIL,UNSPECIF 1245 (57.24) 930 (42.76) DIASTOLC HRT FAILURE NOW 119 (65.38) 63 (34.62) HYP HRT/REN NOS W/HRT FAILURE 216 (53.47) 188 (46.53) HYP HRT/REN NOS W/HRT FLR&KIDN 38 (46.91) 43 (53.09) HYPERTEN HEART DIS W CHF 116 (73.89) 41 (26.11) MAL HYP HTR/REN W/ HRT FLR&W/ 9 (56.25) 7 (43.75) UNSPEC SYST & DIAST HEART FAIL 7 (41.18) 10 (58.82) UNSPEC SYSTOLIC HEART FAILURE 62 (60.19) 41 (39.81) 89

Readmission FLG P-Value Variables Values No Yes PHSDRG Others 2 (22.22) 7 (77.78) <.0001> DSC Circulatory System Procedures 7 (50) 7 (50) Vascular Procedures 9 (64.29) 5 (35.71) VASCULAR PROCEDURES W CC 8 (61.54) 5 (38.46) VASCULAR PROCEDURES W MAJOR CC 56 (58.95) 39 (41.05) 125 Cardiac Catheterization w/ Circ Disord Exc Ischemic Heart Disease 188 (60.06) (39.94) CARDIAC DEFIB IMPLANT W CARDIAC CATH W AMI/HF/SHOCK 24 (75) 8 (25) Cardiac Defibrillator & Heart Assist Anomaly 49 (53.85) 42 (46.15) CARDIAC DEFIBRILLATOR W/O CARDIAC CATHETER 91 (73.39) 33 (26.61) CARDIAC VALVE OR CARDIAC DEFIB IMPLANT PROCEDURE W MAJOR CC 57 (54.29) 48 (45.71) 936 CHF & CARDIAC ARRHYTHMIA W MAJOR CC 1069 (53.32) (46.68) 181 CIRC DISORDERS EXCEPT AMI, W CARD CATH & COMPLEX DIAG 298 (62.21) (37.79) CIRCULATORY DISORDERS W AMI & MAJOR COMP, DISCHARGED ALIVE 39 (68.42) 18 (31.58) ECMO OR TRACH W MV 96+ HR OR TRACH W PDX EXC FACE/MTH/NCK DX 6 (42.86) 8 (57.14) EXTEN O.R. PROCEDURE UNRELATED TO PRINCIPAL DIAGNOSIS 9 (64.29) 5 (35.71) Heart &/or Lung Transplant 8 (50) 8 (50) 1010 Heart Failure 1157 (53.39) (46.61) 1132 HEART FAILURE & SHOCK 1597 (58.52) (41.48) HEART TRANSPLANT 29 (58) 21 (42) MAJOR CARDIOVASCULAR PROCEDURES W MAJOR CC 59 (62.11) 36 (37.89) NON-EXTENSIVE O.R. PROC UNRELATED TO PRINCIPAL DIAGNOSIS 6 (54.55) 5 (45.45) PERCUTANEOUS CARDIOVAS PROC W DRUG ELUTING STENT W/O AMI 9 (69.23) 4 (30.77) PERCUTANEOUS CARDIOVASC PROC W AMI, HF OR SHOCK 13 (59.09) 9 (40.91) Percutaneous Cardiovascular Procedures w/o AMI 21 (51.22) 20 (48.78) Permanent Cardiac Pacemaker Implant w/ AMI, Heart Failure or Shock 13 (72.22) 5 (27.78) PRM CARD PACEM IMPL W AMI, HRT FAIL OR SHK, OR AICD LEAD OR GN 24 (75) 8 (25) 90

Variables Readmission FLG P-Value Values No Yes X.depression 0 3997 (61.09) 2546 (38.91) <.0001> 1 920 (42.97) 1221 (57.03) Hyperlipidemia 0 1726 (64.86) 935 (35.14) <.0001> 1 3191 (52.98) 2832 (47.02) Ischemic.cardiomyopathy 0 3461 (62.17) 2106 (37.83) <.0001> 1 1456 (46.71) 1661 (53.29) Atrial.fibrillation 0 2344 (60.03) 1561 (39.97) <.0001> 1 2573 (53.84) 2206 (46.16) COPD.chronic.bronchitis.and.asthma 0 3319 (62.04) 2031 (37.96) <.0001> 1 1598 (47.93) 1736 (52.07) OutOfRangeCD.GFR (estimated) 0 950 (55.88) 750 (44.12) <.0001>

Low 661 (52.09) 608 (47.91)

Normal 3200 (58.03) 2314 (41.97) UknownAbnormal 106 (52.74) 95 (47.26) OutOfRangeCD.Potassium 0 966 (55.84) 764 (44.16) <.001>

High 133 (50.96) 128 (49.04) Low 185 (48.3) 198 (51.7) Normal 3633 (57.58) 2676 (42.42) OutOfRangeCD.Creatinine 0 988 (56.26) 768 (43.74) <.0001>

High 1485 (51.65) 1390 (48.35) Low 36 (57.14) 27 (42.86) Normal 2408 (60.37) 1581 (39.63) OutOfRangeCD.Sodium 0 999 (56.19) 779 (43.81) <.002>

High 37 (51.39) 35 (48.61) Low 758 (52.79) 678 (47.21) Normal 3123 (57.87) 2274 (42.13) OutOfRangeCD.BUN 0 993 (56.23) 773 (43.77) <.0001>

High 2233 (54.68) 1851 (45.32)

Low 7 (38.89) 11 (61.11) Normal 1684 (59.8) 1132 (40.2) OutOfRangeCD.Chloride 0 1007 (56.29) 782 (43.71) <.001>

High 30 (48.39) 32 (51.61) Low 1710 (54.46) 1430 (45.54) Normal 2170 (58.78) 1522 (41.22)

91

Variables Readmission FLG P-Value Values No Yes OutOfRangeCD.Carbon Dioxide 0 1011 (56.35) 783 (43.65) .06

High 611 (55) 500 (45)

Low 264 (52.91) 235 (47.09) Normal 3031 (57.42) 2248 (42.58) OutOfRangeCD.Anion Gap 0 1036 (56.4) 801 (43.6) .54

High 264 (55.35) 213 (44.65) Low 94 (60.26) 62 (39.74) Normal 3523 (56.7) 2690 (43.3) OutOfRangeCD.HCT 0 1929 (58.67) 1359 (41.33) <.0001>

High 32 (47.76) 35 (52.24)

Low 1923 (53.02) 1704 (46.98) Normal 1033 (60.69) 669 (39.31) OutOfRangeCD.Glucose 0 1384 (56.15) 1081 (43.85) .337

High 1584 (55.95) 1247 (44.05)

Low 51 (51.52) 48 (48.48) Normal 1898 (57.71) 1391 (42.29) OutOfRangeCD.PLT 0 1945 (58.64) 1372 (41.36) <.024>

High 131 (59.01) 91 (40.99) Low 414 (56.1) 324 (43.9) Normal 2426 (55.06) 1980 (44.94) OutOfRangeCD.WBC 0 1944 (58.61) 1373 (41.39) <..004>

High 389 (52.85) 347 (47.15)

Low 148 (54.21) 125 (45.79) Normal 2436 (55.9) 1922 (44.1) OutOfRangeCD.MCHC 0 1944 (58.64) 1371 (41.36) <.001>

High 4 (57.14) 3 (42.86)

Low 787 (52.4) 715 (47.6) Normal 2182 (56.53) 1678 (43.47) OutOfRangeCD.MCH 0 1946 (58.65) 1372 (41.35) <.005>

High 302 (52.43) 274 (47.57)

Low 552 (54.01) 470 (45.99) Normal 2117 (56.18) 1651 (43.82) OutOfRangeCD.MCV 0 1946 (58.65) 1372 (41.35) <.012>

High 363 (54.42) 304 (45.58)

Low 245 (52.69) 220 (47.31) Normal 2363 (55.81) 1871 (44.19)

92

Variables Readmission FLG P-Value Values No Yes OutOfRangeCD.RBC 0 1946 (58.65) 1372 (41.35) <.0001>

High 28 (48.28) 30 (51.72)

Low 1831 (53.34) 1602 (46.66) Normal 1112 (59.31) 763 (40.69) OutOfRangeCD.RDW 0 1957 (58.58) 1384 (41.42) <.0001>

High 1731 (52.36) 1575 (47.64)

Normal 1227 (60.32) 807 (39.68) OutOfRangeCD.Hgb 0 2059 (58.05) 1488 (41.95) <.0001>

High 18 (52.94) 16 (47.06)

Low 1912 (53.36) 1671 (46.64) Normal 928 (61.05) 592 (38.95) OutOfRangeCD.Calcium 0 1572 (56.08) 1231 (43.92) .743

High 36 (59.02) 25 (40.98)

Low 626 (55.79) 496 (44.21) Normal 2683 (57.11) 2015 (42.89) OutOfRangeCD.Magnesium 0 2481 (57.25) 1853 (42.75) <.006>

High 162 (50.31) 160 (49.69) Low 76 (48.41) 81 (51.59) Normal 2198 (56.8) 1672 (43.2) OutOfRangeCD.PT 0 2956 (57.45) 2189 (42.55) <.0001>

High 1620 (54.95) 1328 (45.05)

Normal 339 (57.65) 249 (42.35) BloodPressure High Stage 1 18 (72) 7 (28) .291

High stage 2 4874 (56.59) 3739 (43.41)

Normal 23 (54.76) 19 (45.24)

93

Independent samples T-tests comparing the continues variables across the Readmission FLG (Equality of variances examined using Levene test)

Readmission FLG Variables Values P-Value No-Mean (SD) Yes-Mean (SD) No 70.73 (14.522) 70.73 (14.522) AgeYrDeident <.0001> Yes 68.21 (15.422) 68.21 (15.422) No 1.99 (3.313) 1.99 (3.313) EmergencySeverIndexNBR <.05> Yes 1.86 (3.223) 1.86 (3.223) No 0.75 (0.434) 0.75 (0.434) EmergencyChargeFLG .<001> Yes 0.78 (0.416) 0.78 (0.416) No 0.01 (0.088) 0.01 (0.088) PHSReadmission30DayFLG <.0001> Yes 0.37 (0.483) 0.37 (0.483) No 6.64 (8.922) 6.64 (8.922) LengthOfStayNBR <.0001> Yes 7.94 (13.765) 7.94 (13.765) No 3.06 (4.539) 3.06 (4.539) ClinicalRoutineDaysNBR <.0001> Yes 3.48 (5.396) 3.48 (5.396) No 0.43 (2.968) 0.43 (2.968) ClinicalICUDaysNBR <.0001> Yes 0.88 (5.973) 0.88 (5.973) No 0.05 (0.275) 0.05 (0.275) ClinicalObservationDaysNBR .532 Yes 0.05 (0.268) 0.05 (0.268) No 0.04 (0.321) 0.04 (0.321) ClinicalOperativeDaysNBR <.0001> Yes 0.07 (0.466) 0.07 (0.466) No 1.49 (2.717) 1.49 (2.717) NumberOfProcedures .478 Yes 1.54 (2.959) 1.54 (2.959) No 13.94 (4.18) 13.94 (4.18) NumberOfDiagnoses <.0001> Yes 14.76 (4.167) 14.76 (4.167) No 0.13 (0.337) 0.13 (0.337) StrengthAMT.Loop diuretics .893 Yes 0.13 (0.338) 0.13 (0.338) No 0.13 (0.337) 0.13 (0.337) StrengthAMT.Cardioselective beta blockers .893 Yes 0.13 (0.338) 0.13 (0.338) No 0.13 (0.337) 0.13 (0.337) StrengthAMT.Statins .893 Yes 0.13 (0.338) 0.13 (0.338) No 0.13 (0.337) 0.13 (0.337) StrengthAMT.Salicylates .893 Yes 0.13 (0.338) 0.13 (0.338) No 0.13 (0.337) 0.13 (0.337) StrengthAMT.Minerals and electrolytes .893 Yes 0.13 (0.338) 0.13 (0.338)

94

Data Dictionary

Variables Notes 1 PatientKEY The coded facility encounter / patient account number of this record, generated by PCH for deidentification. Unique for each encounter. E.g. HIT12345-1, HIT12345-2. 2 AgeYrDeident "Age in whole years, calculated by subtracting the discharge date from the birth date of the patient. DOB not provided to Hitachi. Discharge date to be date-shifted for deidentification." 3 IndexFacilityCD "Code used to differentiate the hospital or entity associated with the encounter, transformed from original FacilityCD (e.g. MGH, FKH) by PCH for deidentification. E.g. F1, F2, etc." 4 EmergencySeverIndexNBR "Emergency severity index, assigned values of 1 - 9. 1 = ESI Level 1, 2 = ESI Level 2, etc." 5 EmergencyChargeFLG This field holds a flag, 0 and 1. 1 if the patient admitted by emergency department 6 DaysToReadmissionNBR The days between the discharge date of the index encounter and the admission date of the subsequent readmission encounter, inclusive. 7 ReadmissionFLG "This flag is set to 1 when a subsequent readmission occurred for this patient within 180 days of the discharge date. Otherwise, this flag is equal to 0. When this flag is 1, the Readmit fields are valid, otherwise the contents should be ignored" 95

8 PHSReadmission30DayFLG This is used to identify 30-day readmissions 9 IndexAMIConditionFLG This field holds a flag, 0 - not AMI condition or 1 - is AMI condition 10 IndexCHFConditionFLG This CHF field holds a flag, 0 - not CHF condition or 1 - is CHF comndition 11 IndexPneumoniaConditionFLG This field holds a flag, 0 - not Pneumonia condition or 1 - is Pneumonia condition 12 ReadmitFacilityCD "Readmiditted FacilityCD (e.g. MGH, FKH) by PCH for deidentification. E.g. F1, F2, etc." 13 ReadmitAMIConditionFLG This field holds a flag, 0 and 1. 1 if the patient got readmitted for AMI condition 14 ReadmitCHFConditionFLG This field holds a flag, 0 and 1. 1 if the patient got readmitted for CHF condition 15 ReadmitPneumoniaConditionFLG This field holds a flag, 0 and 1. 1 if the patient got readmitted for Pneumonia condition 16 PHSPayerCategoryDSC Highest level grouping of the PHSPayer rollups: Commercial, Government, Other. 17 LengthOfStayNBR "Length of stay in days, calculated as discharge date minus admit date for inpatient, same day surgery and admit to observation" 18 ClinicalRoutineDaysNBR Number of days of admission recorded as routine. 19 ClinicalICUDaysNBR Number of days of admission recorded as ICU. 20 ClinicalObservationDaysNBR Number of days of admission recorded as observation. 21 ClinicalOperativeDaysNBR Number of days of admission recorded as operative. 22 AdmissionSourceCommonDSC Description associated with the admit source code 23 AdmissionServiceCommonDSC Description associated with the admit service code 96

24 ServiceLineCommonDSC Description associated with the service line assigned to the encounter 25 ServiceLineSubServiceDSC Description associated with the service line subservice 26 DischargeDispositionCommonDSC Description associated with the discharge disposition code 27 DischargeServiceCommonDSC Description associated with the discharge service code 28 PrincipalICD9DiagnosisCD Coded ICD9 diagnosis code 29 ICD9DiagnosisDSC "Description of the associated ICD9/ICD10 diagnosis, provided for up to 20 listed diagnoses for each encounter , where applicable." 30 ICD9DiagnosisCategoryDSC "Description of the rollup category associated with the ICD9/ICD10 diagnosis, provided for up to 20 listed diagnoses for each encounter, where applicable." 31 PHSDRG Diagnosis-related group is a system to classify hospital cases for payment 32 PHSDRGDSC Description of DRG group 33 PayerPlan01DSC Payers :Commercial, Government, Other. 34 GenderCD Patient gender 35 DeathFLG Value = 1 if patient is deceased, = 0 if patient is alive at time of data extraction. 36 LanguageGRP Patient Language Description 37 RaceGRP Patient Race 38 MaritalGRP Patient Marital Status 39 EducationGRP Education status 40 EmploymentGRP Last known employment status of patient, e.g. Employed, Retired, Disability, Unemployed, Unknown. 41 NumberOfProcedures Derived value equal to the number of ICD procedures associated with the encounter. 42 PrincipalICD9ProcedureCD Coded ICD9/ICD10 procedure code, provided for up to 20 97

listed procedures for each encounter, where applicable. 43 Hypertension History, created feature 44 diabetes History, created feature 45 X.depression History, created feature 46 Hyperlipidemia History, created feature 47 Ischemic.cardiomyopathy History, created feature 48 Atrial.fibrillation History, created feature 49 COPD.chronic.bronchitis.and.asthma History, created feature 50 HeartRateNBR Vitals value 51 WeightPoundsNBR Vitals value 52 HeightNBR Vitals value 53 BodyMassIndexNBR Vitals value 54 BloodPressureSystolicNBR Vitals value 55 BloodPressureDiastolicNBR Vitals value 56 TemperatureFahrenheitNBR Vitals value 57 RespiratoryVAL Vitals value inpatientvisits Created feature; number of inpatient vists 58 StrengthAMT.Aldosterone.receptor.antagonists Medication flag (0,1) 59 StrengthAMT.Miscellaneous.anxiolytics..sedatives. Medication flag (0,1) and.hypnotics 60 StrengthAMT.Non.cardioselective.beta.blockers Medication flag (0,1) 61 StrengthAMT.Angiotensin.receptor.blockers Medication flag (0,1) 62 StrengthAMT.Nutraceutical.products Medication flag (0,1) 63 StrengthAMT.Loop.diuretics Medication flag (0,1) 64 StrengthAMT.Vitamin.and.mineral.combinations Medication flag (0,1) 65 StrengthAMT.Minerals.and.electrolytes Medication flag (0,1) 66 StrengthAMT.Quinolones Medication flag (0,1) 67 StrengthAMT.Selective.serotonin.reuptake.inhibito Medication flag (0,1) rs 68 StrengthAMT.Agents.for.pulmonary.hypertension Medication flag (0,1) 69 StrengthAMT.Adrenergic.bronchodilators Medication flag (0,1) 70 StrengthAMT.Calcium.channel.blocking.agents Medication flag (0,1) 71 StrengthAMT.Statins Medication flag (0,1) 72 StrengthAMT.Platelet.aggregation.inhibitors Medication flag (0,1) 73 StrengthAMT.Azole.antifungals Medication flag (0,1) 74 StrengthAMT.Gamma.aminobutyric.acid.analogs Medication flag (0,1) 75 StrengthAMT.Thiazide.diuretics Medication flag (0,1) 98

76 StrengthAMT.Angiotensin.Converting.Enzyme.Inh Medication flag (0,1) ibitors 77 StrengthAMT.Cardioselective.beta.blockers Medication flag (0,1) 78 StrengthAMT.Miscellaneous.analgesics Medication flag (0,1) 79 StrengthAMT.Antigout.agents Medication flag (0,1) 80 StrengthAMT.Salicylates Medication flag (0,1) 81 StrengthAMT.Macrolides Medication flag (0,1) 82 StrengthAMT.Antianginal.agents Medication flag (0,1) 83 StrengthAMT.Proton.pump.inhibitors Medication flag (0,1) 84 StrengthAMT.Phenylpiperazine.antidepressants Medication flag (0,1) 85 StrengthAMT.Serotonin.norepinephrine.reuptake.i Medication flag (0,1) nhibitors 86 StrengthAMT.Iron.products Medication flag (0,1) 87 StrengthAMT.Inhaled.corticosteroids Medication flag (0,1) 88 StrengthAMT.Laxatives Medication flag (0,1) 89 StrengthAMT.Thyroid.drugs Medication flag (0,1) 90 StrengthAMT.Narcotic.analgesics Medication flag (0,1) 91 StrengthAMT.Vitamins Medication flag (0,1) 92 StrengthAMT.Ophthalmic.glaucoma.agents Medication flag (0,1) 93 StrengthAMT.Nonsteroidal.anti.inflammatory.agen Medication flag (0,1) ts 94 StrengthAMT.GI.stimulants Medication flag (0,1) 95 StrengthAMT.Glucocorticoids Medication flag (0,1) 96 StrengthAMT.Phenothiazine.antiemetics Medication flag (0,1) 97 StrengthAMT.Topical.steroids Medication flag (0,1) 98 StrengthAMT.Purine.nucleosides Medication flag (0,1) 99 StrengthAMT.Group.III.antiarrhythmics Medication flag (0,1) 100 StrengthAMT.H2.antagonists Medication flag (0,1) 101 StrengthAMT.Heparins Medication flag (0,1) 102 StrengthAMT.Group.I.antiarrhythmics Medication flag (0,1) 103 StrengthAMT.Antiadrenergic.agents..peripherally.a Medication flag (0,1) cting 104 StrengthAMT.Atypical.antipsychotics Medication flag (0,1) 105 StrengthAMT.Insulin Medication flag (0,1) 106 StrengthAMT.Vasodilators Medication flag (0,1) 107 StrengthAMT.Topical.anesthetics Medication flag (0,1) 108 StrengthAMT.Antihistamines Medication flag (0,1) 109 StrengthAMT.Narcotic.analgesic.combinations Medication flag (0,1) 110 StrengthAMT.Anticholinergic.bronchodilators Medication flag (0,1) 111 StrengthAMT.Benzodiazepines Medication flag (0,1) 99

112 StrengthAMT.Sulfonamides Medication flag (0,1) 113 StrengthAMT.Aminopenicillins Medication flag (0,1) 114 StrengthAMT.Coumarins.and.indandiones Medication flag (0,1) 115 StrengthAMT.First.generation.cephalosporins Medication flag (0,1) 116 StrengthAMT.Group.V.antiarrhythmics Medication flag (0,1) 117 StrengthAMT.Non.sulfonylureas Medication flag (0,1) 118 StrengthAMT.Topical.antifungals Medication flag (0,1) 119 StrengthAMT.Tetracyclic.antidepressants Medication flag (0,1) 120 StrengthAMT.Factor.Xa.inhibitors Medication flag (0,1) 121 StrengthAMT.Fibric.acid.derivatives Medication flag (0,1) 122 StrengthAMT.Miscellaneous.GI.agents Medication flag (0,1) 123 StrengthAMT.Bronchodilator.combinations Medication flag (0,1) 124 StrengthAMT.Dopaminergic.antiparkinsonism.age Medication flag (0,1) nts 125 StrengthAMT.Selective.immunosuppressants Medication flag (0,1) 126 StrengthAMT.Phosphate.binders Medication flag (0,1) 127 StrengthAMT.Sulfonylureas Medication flag (0,1) 128 StrengthAMT.Miscellaneous.antibiotics Medication flag (0,1) 129 StrengthAMT.Calcineurin.inhibitors Medication flag (0,1) 130 StrengthAMT.Otic.steroids.with.anti.infectives Medication flag (0,1) 131 StrengthAMT.Topical.acne.agents Medication flag (0,1) 132 StrengthAMT.5HT3.receptor.antagonists Medication flag (0,1) 133 StrengthAMT.Glycopeptide.antibiotics Medication flag (0,1) 134 StrengthAMT.Antiseptic.and.germicides Medication flag (0,1) 135 StrengthAMT.Antirheumatics Medication flag (0,1) 136 StrengthAMT.Cardiac.stressing.agents Medication flag (0,1) 137 StrengthAMT.Anticholinergic.antiparkinson.agents Medication flag (0,1) 139 OutOfRangeCD.Albumin "Labartory Range @ no out of range code was assigned, so the value is assumed to be normal. A Abnormal H High L Low U abnormal but uknown range 0 is replaced for NA " 140 OutOfRangeCD.Calcium Labartory Range 141 OutOfRangeCD.Globulin Labartory Range 142 OutOfRangeCD.Total.Protein Labartory Range 143 OutOfRangeCD.Alkaline.Phosphatase Labartory Range 144 OutOfRangeCD.Transaminase.SGPT Labartory Range 145 OutOfRangeCD.Transaminase.SGOT Labartory Range 146 OutOfRangeCD.Total.Bilirubin Labartory Range 100

147 OutOfRangeCD.Plasma.Anion.GAP Labartory Range 148 OutOfRangeCD.Plasma.Urea.Nitrogen Labartory Range 149 OutOfRangeCD.Plasma.Carbon.Dioxide Labartory Range 150 OutOfRangeCD.Plasma.Chloride Labartory Range 151 OutOfRangeCD.Plasma.Creatinine Labartory Range 152 OutOfRangeCD.eGFR Labartory Range 153 OutOfRangeCD.Plasma.Glucose Labartory Range 154 OutOfRangeCD.Plasma.Potassium Labartory Range 155 OutOfRangeCD.Plasma.Sodium Labartory Range 156 OutOfRangeCD.HCT Labartory Range 157 OutOfRangeCD.HGB Labartory Range 158 OutOfRangeCD.MCH Labartory Range 159 OutOfRangeCD.MCHC Labartory Range 160 OutOfRangeCD.MCV Labartory Range 161 OutOfRangeCD.PLT Labartory Range 162 OutOfRangeCD.RBC Labartory Range 163 OutOfRangeCD.RDW Labartory Range 164 OutOfRangeCD.WBC Labartory Range 165 OutOfRangeCD.NT.proBNP Labartory Range 166 OutOfRangeCD.Absolute.Basos Labartory Range 167 OutOfRangeCD.Absolute.EOS Labartory Range 168 OutOfRangeCD.Absolute.Lymphs Labartory Range 169 OutOfRangeCD.Absolute.Monos Labartory Range 170 OutOfRangeCD.Absolute.Neuts Labartory Range 171 OutOfRangeCD.Basos Labartory Range 172 OutOfRangeCD.EOS Labartory Range 173 OutOfRangeCD.Lymphs Labartory Range 174 OutOfRangeCD.Monos Labartory Range 175 OutOfRangeCD.Poly Labartory Range 176 OutOfRangeCD.PT Labartory Range 177 OutOfRangeCD.PT.INR Labartory Range 178 OutOfRangeCD.APTT Labartory Range 179 OutOfRangeCD.UA.pH Labartory Range 180 OutOfRangeCD.UA.Specific.Gravity Labartory Range 181 OutOfRangeCD.Base.Excess Labartory Range 182 OutOfRangeCD.Potassium Labartory Range 183 OutOfRangeCD.Troponin.T Labartory Range 184 OutOfRangeCD.Magnesium Labartory Range 185 OutOfRangeCD.Phosphorus Labartory Range 101

186 OutOfRangeCD.Superstat.PT Labartory Range 187 OutOfRangeCD.Superstat.PT.INR Labartory Range 188 OutOfRangeCD.Superstat.APTT Labartory Range 189 OutOfRangeCD.Lipase Labartory Range 190 OutOfRangeCD.Direct.Bilirubin Labartory Range 191 OutOfRangeCD.CALCIUM Labartory Range 192 OutOfRangeCD.MAGNESIUM Labartory Range 193 OutOfRangeCD.PHOSPHOROUS Labartory Range 194 OutOfRangeCD.ANION.GAP Labartory Range 195 OutOfRangeCD.BUN Labartory Range 196 OutOfRangeCD.CARBON.DIOXIDE Labartory Range 197 OutOfRangeCD.CHLORIDE Labartory Range 198 OutOfRangeCD.CREATININE Labartory Range 199 OutOfRangeCD.GLUCOSE Labartory Range 200 OutOfRangeCD.SODIUM Labartory Range 201 OutOfRangeCD.PROTIME Labartory Range 202 OutOfRangeCD.INR Labartory Range 203 OutOfRangeCD.ABS.NEUT.COUNT Labartory Range 204 OutOfRangeCD.BASOPHIL Labartory Range 205 OutOfRangeCD.EOSINOPHIL Labartory Range 206 OutOfRangeCD.LYMPHOCYTE Labartory Range 207 OutOfRangeCD.MONOCYTE Labartory Range 208 OutOfRangeCD.NEUTROPHIL Labartory Range 209 OutOfRangeCD.HEMOGLOBIN Labartory Range 210 OutOfRangeCD.PLATELET Labartory Range 211 OutOfRangeCD.RDW.CV Labartory Range 212 OutOfRangeCD.POTASSIUM Labartory Range 213 OutOfRangeCD.CK Labartory Range 214 OutOfRangeCD.GLUCOSE.POC Labartory Range 215 OutOfRangeCD.ALBUMIN Labartory Range 216 OutOfRangeCD.LIPASE Labartory Range 217 OutOfRangeCD.TOTAL.PROTEIN Labartory Range 218 OutOfRangeCD.CHOLESTEROL Labartory Range 219 OutOfRangeCD.HDL Labartory Range 220 OutOfRangeCD.ALK.PHOS Labartory Range 221 OutOfRangeCD.ALT Labartory Range 222 OutOfRangeCD.AST Labartory Range 223 OutOfRangeCD.BILIRUBIN.TOTAL Labartory Range 224 OutOfRangeCD.FERRITIN Labartory Range 102

225 OutOfRangeCD.IRON Labartory Range 226 OutOfRangeCD.TIBC Labartory Range 227 OutOfRangeCD.UIBC Labartory Range 228 OutOfRangeCD.HGB.A1C Labartory Range 229 OutOfRangeCD.GLU.POC Labartory Range 230 OutOfRangeCD.GLOBULIN Labartory Range 231 OutOfRangeCD.TOT.PROT Labartory Range 232 OutOfRangeCD.ALT.GPT Labartory Range 233 OutOfRangeCD.AST.GOT Labartory Range 234 OutOfRangeCD.TOT.BILI Labartory Range 235 OutOfRangeCD.UREA.N Labartory Range 236 OutOfRangeCD.TOTAL.CO2 Labartory Range 237 OutOfRangeCD..BASO.. Labartory Range 238 OutOfRangeCD..EOS.. Labartory Range 239 OutOfRangeCD..LYMP.. Labartory Range 240 OutOfRangeCD..MONO.. Labartory Range 241 OutOfRangeCD..NEUT.. Labartory Range 242 OutOfRangeCD..BASO...1 Labartory Range 243 OutOfRangeCD..EOS...1 Labartory Range 244 OutOfRangeCD..LYMP...1 Labartory Range 245 OutOfRangeCD..MONO...1 Labartory Range 246 OutOfRangeCD..NEUT...1 Labartory Range 247 OutOfRangeCD.RETIC.CT Labartory Range 248 OutOfRangeCD.CLDL Labartory Range 249 OutOfRangeCD.TRIGLYCERIDES Labartory Range 250 OutOfRangeCD.VLDL Labartory Range 251 OutOfRangeCD.TSH Labartory Range 252 OutOfRangeCD.PT.INR. Labartory Range 253 OutOfRangeCD.PTT Labartory Range 254 OutOfRangeCD..GLOM.FILT.RATE.AFRICAN. Labartory Range AMERIC 255 OutOfRangeCD..GLOM.FILT.RATE.NON.AFRI Labartory Range CAN.AM 256 OutOfRangeCD.GLUCOSE..SERUM Labartory Range 257 OutOfRangeCD.SEGMENTED.NEUTROPHIL Labartory Range 258 OutOfRangeCD.MPV Labartory Range 259 OutOfRangeCD.PLATELET.COUNT Labartory Range 260 OutOfRangeCD.PROTHROMBIN.TIME Labartory Range 261 OutOfRangeCD.BAND.NEUTROPHIL Labartory Range 103

262 OutOfRangeCD.PH Labartory Range 263 OutOfRangeCD.URIC.ACID Labartory Range 264 OutOfRangeCD.HEMATOCRIT Labartory Range 265 OutOfRangeCD.C.REACTIVE.PROTEIN Labartory Range 266 OutOfRangeCD.BASE.X Labartory Range 267 OutOfRangeCD.TCO2 Labartory Range 268 OutOfRangeCD.Hct Labartory Range 269 OutOfRangeCD.HgB Labartory Range 270 OutOfRangeCD.SO2.calc. Labartory Range 271 OutOfRangeCD.PCO2 Labartory Range 272 OutOfRangeCD.pH Labartory Range 273 OutOfRangeCD.PO2 Labartory Range 274 OutOfRangeCD.CKMB.QUANT Labartory Range 275 OutOfRangeCD.TROPONIN.T Labartory Range 276 OutOfRangeCD.LACTIC.ACID Labartory Range 277 OutOfRangeCD.PHOSPHATE Labartory Range 278 OutOfRangeCD.HYAL.CAST Labartory Range 279 OutOfRangeCD.SP.GRV Labartory Range 280 OutOfRangeCD.LDH Labartory Range 281 OutOfRangeCD.UREA.NITROGEN Labartory Range 282 OutOfRangeCD.BAND Labartory Range 283 OutOfRangeCD.BASO Labartory Range 284 OutOfRangeCD.LYMPH Labartory Range 285 OutOfRangeCD.ATYP Labartory Range 286 OutOfRangeCD.MONO Labartory Range 287 OutOfRangeCD.POLY Labartory Range 288 OutOfRangeCD.HAPTOGLOBIN Labartory Range 289 OutOfRangeCD.METAMY Labartory Range 290 OutOfRangeCD.MYELO Labartory Range 291 OutOfRangeCD.Bands Labartory Range 292 OutOfRangeCD.DIR.BILI Labartory Range 293 OutOfRangeCD.IOCA Labartory Range 294 OutOfRangeCD.PREALBUMIN Labartory Range 295 OutOfRangeCD.ESR Labartory Range 296 OutOfRangeCD.VANCOMYCIN Labartory Range 297 OutOfRangeCD.LYMPHS Labartory Range 298 OutOfRangeCD.MEAN.CORPUSCULAR.HGB Labartory Range 299 OutOfRangeCD.MEAN.CORPUSCULAR.HGB.C Labartory Range ONC 104

300 OutOfRangeCD.MEAN.CORPUSCULAR.VOLU Labartory Range ME 301 OutOfRangeCD.RED.BLOOD.COUNT Labartory Range 302 OutOfRangeCD.RED.CELL.DIST.CV Labartory Range 303 OutOfRangeCD.Lactic.Dehydrogenase Labartory Range 304 OutOfRangeCD.NUCLEATED.RBC Labartory Range 305 OutOfRangeCD.NUC.RBC Labartory Range 306 OutOfRangeCD.Total.Cells.Counted Labartory Range 307 OutOfRangeCD.K.pl Labartory Range 308 OutOfRangeCD.GLUCOSE.wb Labartory Range 309 OutOfRangeCD.vBASE.X Labartory Range 310 OutOfRangeCD.vTCO2 Labartory Range 311 OutOfRangeCD.vHCT Labartory Range 312 OutOfRangeCD.vHgB Labartory Range 313 OutOfRangeCD.vSO2 Labartory Range 314 OutOfRangeCD.vPCO2 Labartory Range 315 OutOfRangeCD.vpH Labartory Range 316 OutOfRangeCD.vPO2 Labartory Range 317 OutOfRangeCD.FIBRINOGEN Labartory Range 318 OutOfRangeCD..ANTI.Xa.LEVEL Labartory Range 319 OutOfRangeCD.NRBC Labartory Range 320 OutOfRangeCD.Whole.Blood.Glucose Labartory Range 321 OutOfRangeCD.PLASMA.HGB Labartory Range 322 OutOfRangeCD.UREA.NITROGEN..BUN. Labartory Range 323 OutOfRangeCD.GLOM.FILT.RATE.NON.AFRIC Labartory Range AN.A 324 OutOfRangeCD.Absolute.NRBC Labartory Range 325 OutOfRangeCD.GLOM.FILT.RATE.AFRICAN.A Labartory Range MERI 326 OutOfRangeCD.XNRBC.. Labartory Range 327 OutOfRangeCD.ABSOLUTE.EOS Labartory Range 328 OutOfRangeCD.ABSOLUTE.MONOS Labartory Range 329 OutOfRangeCD.ABSOLUTE.NEUTS Labartory Range 330 OutOfRangeCD.XNRBC...1 Labartory Range 331 OutOfRangeCD.ABSOLUTE.LYMPHS Labartory Range 332 OutOfRangeCD.ABSOLUTE.BASOS Labartory Range 333 OutOfRangeCD.ABSOLUTE.NRBC Labartory Range 334 OutOfRangeCD.MEAN.PLT.VOLUME Labartory Range 335 OutOfRangeCD.ABANDS Labartory Range

105

APPENDIX B: R Source Code

Final-ReadingData.R install.packages("readxl") install.packages("xlsx") library(readxl) require(xlsx) library(openxlsx) install.packages("gdata") require(gdata) install.packages("rmarkdown") install.packages("reshape") library(reshape) require(stats) install.packages("icd9") install.packages("splitstackshape") library(splitstackshape) install.packages("devtools") install.packages("dplyr") require(dplyr) install.packages("data.table") install.packages("icdcoder") install.packages("kohonen") install.packages("class") install.packages("MASS") library(kohonen) require(graphics) install.packages("hclust") install.packages("ggvis") library(ggvis) library(stats) install.packages("factoextra") install.packages("cluster") library("cluster") library("factoextra") install.packages("graphics") install.packages("ggplot2") library(ggplot2) memory.size(10000000000000)

Inpdf<-read.table("InpatientAdmissions.txt",header=TRUE,sep="\t",quote= "\"", dec=".", fill =TRUE,comment.char="") inpdf <-Inpdf[order(Inpdf$PatientKEY,Inpdf$IndexPatientAccountKEY),]

106 inpdf[inpdf=="UNKNOWN"]<- NA inpdf[inpdf=="N/A"]<- NA inpdf[inpdf=="NULL"]<- NA demodf<-read.table("Demographics.txt",header=TRUE,sep="\t",quote="\"", dec=".", fill =TRUE,comment.char="") demodf[demodf=="UNKNOWN"]<- NA demodf[demodf=="N/A"]<- NA demodf[demodf=="NULL"]<- NA inpdemodf<-merge(inpdf,demodf,by="PatientKEY") write.table(inpdemodf, "C:/Users/aambukhari/Desktop/R Project/My Projec t/inpdemodf.txt", sep="\t") #Diagnosis diag11<-read.table("11DiagnosisCDs.txt",header=TRUE,sep="\t",quote="\"" , dec=".", fill =TRUE,comment.char="") diag12<-read.table("12DiagnosisCDs.txt",header=TRUE,sep="\t",quote="\"" , dec=".", fill =TRUE,comment.char="") diag13<-read.table("13DiagnosisCDs.txt",header=TRUE,sep="\t",quote="\"" , dec=".", fill =TRUE,comment.char="") diag14<-read.table("14DiagnosisCDs.txt",header=TRUE,sep="\t",quote="\"" , dec=".", fill =TRUE,comment.char="") diag15<-read.table("15DiagnosisCDs.txt",header=TRUE,sep="\t",quote="\"" , dec=".", fill =TRUE,comment.char="")

DiaCDsAndDSCs<-read.table("DiaCDsAndDSCs.txt",header=TRUE,sep="\t",quot e="\"", dec=".", fill =TRUE,comment.char="") diagnosis <- rbind(diag11,diag12,diag13,diag14,diag15) diagnosis1<-diagnosis[!(diagnosis$PatientKEY==""),] diagnosis2<- diagnosis1[with(diagnosis1,order(PatientKEY,decreasing=FAL SE)),] diagnosis3<-apply(diagnosis2,2,function(x)(gsub("\\[","",x))) diagnosis3<-apply(diagnosis3,2,function(x)(gsub("\\]","",x))) diagnosis4<-as.data.frame(diagnosis3) diagnosis5<-diagnosis4[which(diagnosis4$InpatientFLG == 1),] write.table(diagnosis5, "C:/Users/aambukhari/Desktop/R Project/My Proje ct/diagnosis5.txt", sep="\t") diagnosis5<-read.table("diagnosis5.txt",header=TRUE,sep="\t",quote="\"" , 107

dec=".", fill =TRUE,comment.char="")

Procedures<-read.table("AllProceduresCDs.txt",header=TRUE,sep="\t",quot e="\"", dec=".", fill =TRUE,comment.char="")

Procedures<-apply(Procedures,2,function(x)(gsub("\\[","",x))) Procedures<-apply(Procedures,2,function(x)(gsub("\\]","",x)))

Procedures[Procedures=="NULL"]<- NA Procedures[Procedures=="UNKNOWN"]<- NA Procedures[Procedures=="N/A"]<- NA Procedures[Procedures==" "]<- NA

Procedures<- as.data.frame(Procedures) inProcedures<-Procedures[which(Procedures$InpatientFLG == 1),] colnames(inpdemodf)[colnames(inpdemodf)=="IndexPatientAccountKEY"] <- " PatientAccountKEY" write.table(inProcedures, "C:/Users/aambukhari/Desktop/R Project/My Pro ject/inProcedures.txt", sep="\t") inProcedures<-read.table("inProcedures.txt",header=TRUE,sep="\t",quote= "\"", dec=".", fill =TRUE,comment.char="")

lab11<-read.table("Hit_EDWLabs_Year11DEIDENTb.txt",header=FALSE,sep="\t ",quote="\"", dec=".", fill =TRUE,comment.char="") lab11<-data.frame(PatientKEY=lab11$V1,ServiceDTShift=lab11$V2,TypeCD=la b11$V3,LabQDMID=lab11$V4, LOINC=lab11$V5, RPDRTestID=lab11$V6, PanelNM=lab 11$V7, SubPanelNM=lab11$V8, GroupNM=lab11$V9, TestNM=lab11$V10,

ResultTypeCD=lab11$V11, TextResultVAL=lab11$V12, NumericResultVAL=lab11$V13, UnitCD=lab11$V14, OutOfRangeCD=lab11$V15 ) lab12<-read.table("Hit_EDWLabs_Year12DEIDENTb.txt",header=FALSE,sep="\t ",quote="\"", dec=".", fill =TRUE,comment.char="") lab12<-data.frame(PatientKEY=lab12$V1,ServiceDTShift=lab12$V2,TypeCD=la b12$V3,LabQDMID=lab12$V4, LOINC=lab12$V5, RPDRTestID=lab12$V6, PanelNM=lab 12$V7, SubPanelNM=lab12$V8, GroupNM=lab12$V9, TestNM=lab12$V10,

ResultTypeCD=lab12$V11, TextResultVAL=lab12$V12, 108

NumericResultVAL=lab12$V13, UnitCD=lab12$V14, OutOfRangeCD=lab12$V15 ) lab13<-read.table("Hit_EDWLabs_Year13DEIDENTb.txt",header=FALSE,sep="\t ",quote="\"", dec=".", fill =TRUE,comment.char="") lab13<-data.frame(PatientKEY=lab13$V1,ServiceDTShift=lab13$V2,TypeCD=la b13$V3,LabQDMID=lab13$V4, LOINC=lab13$V5, RPDRTestID=lab13$V6, PanelNM=lab 13$V7, SubPanelNM=lab13$V8, GroupNM=lab13$V9, TestNM=lab13$V10,

ResultTypeCD=lab13$V11, TextResultVAL=lab13$V12, NumericResultVAL=lab13$V13, UnitCD=lab13$V14, OutOfRangeCD=lab13$V15 ) lab14<-read.table("Hit_EDWLabs_Year14DEIDENTb.txt",header=FALSE,sep="\t ",quote="\"", dec=".", fill =TRUE,comment.char="") lab14<-data.frame(PatientKEY=lab14$V1,ServiceDTShift=lab14$V2,TypeCD=la b14$V3,LabQDMID=lab14$V4, LOINC=lab14$V5, RPDRTestID=lab14$V6, PanelNM=lab 14$V7, SubPanelNM=lab14$V8, GroupNM=lab14$V9, TestNM=lab14$V10,

ResultTypeCD=lab14$V11, TextResultVAL=lab14$V12, NumericResultVAL=lab14$V13, UnitCD=lab14$V14, OutOfRangeCD=lab14$V15 ) Labs<-read.table("Hit_EDWLabs_Year15DEIDENTb.txt",header=FALSE,sep="\t" ,quote="\"", dec=".", fill =TRUE,comment.char="") lab15<-data.frame(PatientKEY=lab15$V1,ServiceDTShift=lab15$V2,TypeCD=la b15$V3,LabQDMID=lab15$V4, LOINC=lab15$V5, RPDRTestID=lab15$V6, PanelNM=lab 15$V7, SubPanelNM=lab15$V8, GroupNM=lab15$V9, TestNM=lab15$V10,

ResultTypeCD=lab15$V11, TextResultVAL=lab15$V12, NumericResultVAL=lab15$V13, UnitCD=lab15$V14, OutOfRangeCD=lab15$V15 )

Labs<- rbind(lab11,lab12,lab13,lab14,lab15)

Labs<-Labs [- grep("", Labs$PatientKEY),] Labs[Labs=="NULL"]<- NA Labs[Labs=="NULL"]<- NA Labs[Labs=="UNKNOWN"]<- NA Labs[Labs=="N/A"]<- NA Labs[Labs==" "]<- NA

Labs<-read.table("Labs.txt",header=TRUE,sep="\t",quote="\"", dec=".", fill =TRUE,comment.char="") Labs<-data.frame(PatientKEY=Labs$V1,ServiceDTShift=Labs$V2,TypeCD=Labs$ V3,LabQDMID=Labs$V4, LOINC=Labs$V5, RPDRTestID=Labs$V6, PanelNM=Labs$V7 109

, SubPanelNM=Labs$V8, GroupNM=Labs$V9, TestNM=Labs$V10, ResultTypeCD=Labs$V11, TextResultVAL=Labs$V12, Num ericResultVAL=Labs$V13, UnitCD=Labs$V14, OutOfRangeCD=Labs$V15)

Medications<-read.table("Medications.txt",header= TRUE,sep="\t",quote=" \"", dec=".", fill =TRUE,comment.char="")

Medications<-data.frame(PatientKEY=Medications$V1,ServiceDTShift=Medica tions$V2, RecordAuditSEQ=Medications$V3,FactAuditFLG=Medi cations$V4,StatusCD=Medications$V5, TypeCD=Medications$V6,MedicationQDMID=Medicatio ns$V7,RecordID=Medications$V8, FDBMedicationID=Medications$V9,GenericID=Medica tions$V10,BrandNameID=Medications$V11, RollupID=Medications$V12,StrengthAndFormID=Medi cations$V13,MedicationNM_1=Medications$V14, MedicationNM_2=Medications$V15,DoseAMT=Medicati ons$V16,DoseUnitNM=Medications$V17, StrengthAMT=Medications$V18,TakeDSC=Medications $V19,FrequencyMnemonicCD=Medications$V20, RouteCD=Medications$V21,RouteDSC=Medications$V2 2,DurationNBR=Medications$V23, DispenseQTY=Medications$V24,DispenseUnitNM=Medi cations$V25,FormDSC=Medications$V26, LastActionTakenCD=Medications$V27,RefillCNT=Med ications$V28,VerifyActionFLG=Medications$V29, PrescriptionFLG=Medications$V30,PRNFLG=Medicati ons$V31,PRNReasonDSC=Medications$V32, RetailMailPharmacyCD=Medications$V33)

Medications<-Medications[- grep("", Medications$PatientKEY),]

Medications[Medications=="NULL"]<- NA Medications[Medications=="UNKNOWN"]<- NA Medications[Medications=="N/A"]<- NA Medications[Medications==" "]<- NA

Medications$ServiceDTShift<-gsub("00:00:00.000", "", Medications$Servic eDTShift)

#Vitlas vitals<- read.table("Vitals-EDW.txt",header=TRUE,sep="\t",quote="\"", dec=".", fill =TRUE,comment.char="") vitals[vitals=="NULL"]<- NA vitals[vitals=="UNKNOWN"]<- NA 110 vitals[vitals=="N/A"]<- NA vitals$ServiceDTShift <- as.Date(vitals$ServiceDTShift ,"%Y-%m-%d") inProcedures<-read.table("inProcedures.txt",header=TRUE,sep="\t",quote= "\"", dec=".", fill =TRUE,comment.char="") colnames(inProcedures)[colnames(inProcedures)=="PatientAccountKEY"] <- "IndexPatientAccountKEY" inp_demod_Procedures<-left_join(inpdemodf,inProcedures,by= c("IndexPati entAccountKEY" ,"PatientKEY")) colnames(diagnosis5)[colnames(diagnosis5)=="PatientAccountKEY"] <- "Ind exPatientAccountKEY" inp_demod_Procedures_Diag <- merge(inp_demod_Procedures,diagnosis5,by=c ("IndexPatientAccountKEY" ,"PatientKEY")) inp_demod_Procedures_Diag_Corm<-left_join(inp_demod_Procedures_Diag,com orbidities,by= "PatientKEY") write.table(Final_inp_demod_Procedures_Diag_Corm, "C:/Users/Amal/Deskto p/Final_inp_demod_Procedures_Diag_Corm.txt", sep="\t") write.csv(inp_demod_Procedures_Diag_Corm, file = "Final_inp_demod_Proce dures_Diag_Corm.csv")

111

DataCleaning.R

#Read Final_inp_demod_Procedures_Diag_Corm Final_inp <- read.csv(file="C:/Users/Amal/Desktop/FinalProject/Final_in p_demod_Procedures_Diag_Corm.csv", header=TRUE, sep=",") countFinal_inp<- apply(Final_inp, 2, function(x) length(which(!is.na(x) ))) countFinal_inp<- data.frame(countFinal_inp)

#Filtering the Data 75053 => 41806 -41806 => 8686 library(help = "graphics") Final_Data_Death0 <- Final_inp[which(Final_inp$DeathFLG == 0),]

Final_Data_Death0_CHF1 <- Final_Data_Death0[which(Final_Data_Death0$Ind exCHFConditionFLG == 1),] countData<- apply(Final_Data_Death0_CHF1, 2, function(x) length(which(! is.na(x)))) countData<- data.frame(countData)

#Vitals vitals<- read.table("Vitals-EDW.txt",header=TRUE,sep="\t",quote="\"", dec=".", fill =TRUE,comment.char="") vitals[vitals=="NULL"]<- NA vitals$ServiceDTShift <- as.Date(vitals$ServiceDTShift ,"%Y-%m-%d") test <- Final_Data_Death0_CHF1[c(3,2,10,11)] test$IndexAdmissionDTShift <- as.Date(test$IndexAdmissionDTShift , "%Y- %m-%d") test$IndexDischargeDTShift <- as.Date(test$IndexDischargeDTShift , "%Y- %m-%d")

Data_vitals<- test %>% left_join(vitals, by = "PatientKEY") %>% filter((ServiceDTShift >= IndexAdmissionDTShift & ServiceDTShift <= I ndexDischargeDTShift)) countVitals<- apply(Data_vitals, 2, function(x) length(which(!is.na(x)) )) countVitals<- data.frame(countVitals)

#Medication Meds<- read.csv(file="C:/Users/Amal/Desktop/FinalProject/MedicationGrou psEDW4 (2).csv", header=TRUE, sep=",") levels(Meds$StrengthAMT) <- c(levels(Meds$StrengthAMT), "1") Meds$StrengthAMT[is.na(Meds$StrengthAMT)] <- "1" Medications<-Meds[!is.na(Meds$Medication.Group),]

ReshapedMedications<- reshape(Medications,direction = "wide",idvar= c(" 112

PatientKEY","ServiceDTShift"), timevar= "Medication.Group") ReshapedMedications$ServiceDTShift <- as.Date(ReshapedMedications$Servi ceDTShift, format = "%m/%d/%Y")

ReshapedMedications$ServiceDTShift <-format(ReshapedMedications$Service DTShift,"%Y-%m-%d") require(dplyr) Data_Meds<- test %>% left_join(ReshapedMedications, by = "PatientKEY") %>% filter((ServiceDTShift >= IndexAdmissionDTShift - 7 & ServiceDTShift <= IndexDischargeDTShift + 7)) countMeds<- apply(Data_Meds, 2, function(x) length(which(!is.na(x)))) countMeds<- data.frame(countMeds)

Labs<-read.table("Labs.txt",header=TRUE,sep="\t",quote="\"", dec=".", fill =TRUE,comment.char="") Labs<- Labs[c(1,2,9,15)] Labs <- Labs[!duplicated(Labs), ] Labs$ServiceDTShift <- as.Date(Labs$ServiceDTShift ,"%Y-%m-%d")

ReshapedLabs<- reshape(Labs,direction = "wide",idvar= c("PatientKEY","S erviceDTShift"), timevar= "GroupNM")

Data_Labs<- test %>% left_join(ReshapedLabs, by = "PatientKEY") %>% filter((ServiceDTShift >= IndexAdmissionDTShift - 7 & ServiceDTShift <= IndexDischargeDTShift + 7))

countLabs<- apply(Data_Labs, 2, function(x) length(which(!is.na(x)))) countLabs<- data.frame(countLabs)

#NZV library(caret) nzv<- nearZeroVar(Data_Labs, saveMetrics= TRUE)

Data_Labs_1<- Data_Labs[c("PatientKEY", "IndexPatientAccountKEY", "IndexAdmissionDTShift", "IndexDischargeDTShift", "ServiceDTShift", "OutOfRangeCD.GFR (estimated)", "OutOfRangeCD.Potassium", "OutOfRangeCD.Creatinine", 113

"OutOfRangeCD.Sodium", "OutOfRangeCD.BUN", "OutOfRangeCD.Chloride", "OutOfRangeCD.Carbon Dioxide", "OutOfRangeCD.Anion Gap", "OutOfRangeCD.HCT", "OutOfRangeCD.Glucose", "OutOfRangeCD.PLT", "OutOfRangeCD.WBC", "OutOfRangeCD.MCHC", "OutOfRangeCD.MCH", "OutOfRangeCD.MCV", "OutOfRangeCD.RBC", "OutOfRangeCD.RDW", "OutOfRangeCD.Hgb", "OutOfRangeCD.Calcium", "OutOfRangeCD.Magnesium", "OutOfRangeCD.PT")]

Data_Meds_1 <- Data_Meds[c("PatientKEY", "IndexPatientAccountKEY", "IndexAdmissionDTShift", "IndexDischargeDTShift", "ServiceDTShift", "StrengthAMT.Loop diuretics", "StrengthAMT.Cardioselective beta blockers", "StrengthAMT.Statins", "StrengthAMT.Salicylates", "StrengthAMT.Minerals and electrolytes")]

Data_Vitals_1 <- Data_vitals[c("PatientKEY", "IndexPatientAccountKEY", "IndexAdmissionDTShift", "IndexDischargeDTShift", "ServiceDTShift", "BloodPressureSystolicNBR", "BloodPressureDiastolicNBR")]

Data_Labs_Meds<- full_join(Data_Labs_1,Data_Meds_1,by= c("PatientKEY"," IndexPatientAccountKEY")) Data_Labs_Meds_Vitals <- full_join(Data_Labs_Meds,Data_Vitals_1,by= c(" PatientKEY","IndexPatientAccountKEY")) x<- Data_Labs_Meds_Vitals countx<- apply(x, 2, function(x) length(which(!is.na(x)))) countx<- data.frame(countx)

Full_Data <- left_join(Final_Data_Death0_CHF1,x, by= c("PatientKEY","In dexPatientAccountKEY")) countFull_Data<- apply(Full_Data, 2, function(x) length(which(!is.na(x) ))) 114 countFull_Data<- data.frame(countFull_Data)

Full_Data<- Full_Data[!duplicated(Full_Data), ] #> length(unique(Full_Data$PatientKEY))[1] 5895 #> length(unique(Full_Data$IndexPatientAccountKEY))[1] 8686 Full_Data_1 <- Full_Data[-c(1,5,6,7,10,11,12,13,15:25,41,42,44,45,46,48 , 49,50,51,57,58,61:81,83:102,110,111,112,134 ,135,136, 142,143,144)] Full_Data_1<-Full_Data_1[!duplicated(Full_Data_1), ]

#Medication : Convert it to 0 /1 , have it or not Full_Data_1[c(60:64)]<- as.data.frame(ifelse(is.na(Full_Data_1[c(60:64) ]), 0, 1))

#Labs : Imputation Na to Zero levels = c("@" , "A", "H","L","U","0") Full_Data_1[c(39:59)] <- lapply(Full_Data_1[c(39:59)], factor, levels = levels)

Full_Data_1[c(39:59)][is.na(Full_Data_1[c(39:59)])] <- 0 countFull_Data_1<- apply(Full_Data_1, 2, function(x) length(which(!is.n a(x)))) countFull_Data_1<- data.frame(countFull_Data_1)

Full_Data_1$ClinicalRoutineDaysNBR[is.na(Full_Data_1$ClinicalRoutineDay sNBR)] <- 0 Full_Data_1$ClinicalICUDaysNBR[is.na(Full_Data_1$ClinicalICUDaysNBR)] < - 0 Full_Data_1$ClinicalObservationDaysNBR[is.na(Full_Data_1$ClinicalObserv ationDaysNBR)] <- 0 Full_Data_1$ClinicalOperativeDaysNBR[is.na(Full_Data_1$ClinicalOperativ eDaysNBR)] <- 0

#Race levels(Full_Data_1$RaceGRP) <- c(levels(Full_Data_1$RaceGRP), "OTHER") Full_Data_1$RaceGRP[is.na(Full_Data_1$RaceGRP)] <- "OTHER"

#Marital Full_Data_1$MaritalGRP[is.na(Full_Data_1$MaritalGRP)] <- "OTHER"

#Education Full_Data_1$EducationGRP[is.na(Full_Data_1$EducationGRP)] <- "OTHER"

#Employment levels(Full_Data_1$EmploymentGRP) <- c(levels(Full_Data_1$EmploymentGRP ), "OTHER") 115

Full_Data_1$EmploymentGRP[is.na(Full_Data_1$EmploymentGRP)] <- "OTHER"

#Numberof Procedures Full_Data_1$NumberOfProcedures[is.na(Full_Data_1$NumberOfProcedures)] < - 0

#Language Full_Data_1$LanguageGRP[is.na(Full_Data_1$LanguageGRP)] <- "OTHER"

library(Hmisc) #BloodPressureSystolicNBR Full_Data_1$BloodPressureSystolicNBR <- with(Full_Data_1, impute(BloodP ressureSystolicNBR, median))

#BloodPressureDiastolicNBR Full_Data_1$BloodPressureDiastolicNBR <- with(Full_Data_1, impute(Blood PressureDiastolicNBR, median))

#Blood pressure condition Full_Data_1$BloodPressure<-NA Full_Data_1$BloodPressureSystolicNBR = as.integer(Full_Data_1$BloodPres sureSystolicNBR) Full_Data_1$BloodPressureDiastolicNBR = as.integer(Full_Data_1$BloodPre ssureDiastolicNBR) Full_Data_1$BloodPressure <- with(Full_Data_1, ifelse(Full_Data_1$BloodPressureSystolicNBR < 120 & Full_Data_1$BloodPr essureDiastolicNBR < 80, 'Normal', #Normal ifelse(Full_Data_1$BloodPressureSystolicNBR >= 120 & Full_Data_1$BloodP ressureSystolicNBR <= 129 & Full_Data_1$BloodPressureDiastolicNBR < 80, 'Elevated', #Elevated ifelse(Full_Data_1$BloodPressureSystolicNBR >= 130 & Full_Data_1$BloodP ressureSystolicNBR <= 139 | Full_Data_1$BloodPressureDiastolicNBR >= 80 & Full_Data_1$BloodPressureDiastolicNBR <= 89, 'High Stage 1', #High S tage 1 ifelse(Full_Data_1$BloodPressureSystolicNBR >= 140 | Full_Data_1$BloodP ressureDiastolicNBR >= 90,'High stage 2', #High stage 2 ifelse(Full_Data_1$BloodPressureSystolicNBR >= 180 & Full_Data_1$BloodP ressureDiastolicNBR > 120, 'Crisis', #Crisis ifelse(Full_Data_1$BloodPressureSystolicNBR >= 180 | Full_Data_1$BloodP ressureDiastolicNBR > 120, 'Crisis','NULL'))))))) countFull_Data_1<- apply(Full_Data_1, 2, function(x) length(which(!is.n a(x)))) countFull_Data_1<- data.frame(countFull_Data_1) y<- Full_Data_1

Full_Data_1$PrincipalICD9ProcedureCD <- as.character(Full_Data_1$Princi palICD9ProcedureCD) 116

Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 01 & Full_Data_1$PrincipalICD9ProcedureCD <= 05 ]<- "Operations on the nervous system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 06 & Full_Data_1$PrincipalICD9ProcedureCD <= 07 ]<- "Operations on the endocrine system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 08 & Full_Data_1$PrincipalICD9ProcedureCD <= 16 ]<- "Operations on the eye" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 18 & Full_Data_1$PrincipalICD9ProcedureCD <= 20 ]<- "Operations on the ear" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 21 & Full_Data_1$PrincipalICD9ProcedureCD <= 29 ]<- "Operations on the nose, mouth and pharynx" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 30 & Full_Data_1$PrincipalICD9ProcedureCD <= 34 ]<- "Operations on the respiratory system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 35 & Full_Data_1$PrincipalICD9ProcedureCD <= 39 ]<- "Operations on the cardiovascular system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 40 & Full_Data_1$PrincipalICD9ProcedureCD <= 41 ]<- "Operations on the hemic and lymphatic system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 42 & Full_Data_1$PrincipalICD9ProcedureCD <= 54 ]<- "Operations on the digestive system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 420 & Full_Data_1$PrincipalICD9ProcedureCD <= 540 ]<- "Operation s on the digestive system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 540 & Full_Data_1$PrincipalICD9ProcedureCD <= 549 ]<- "Operation s on the digestive system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 4200 & Full_Data_1$PrincipalICD9ProcedureCD <= 5400 ]<- "Operati ons on the digestive system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 5400 & Full_Data_1$PrincipalICD9ProcedureCD <= 5499 ]<- "Operati ons on the digestive system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 55 & Full_Data_1$PrincipalICD9ProcedureCD <= 59 ]<- "Operations on the urinary system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 590 & Full_Data_1$PrincipalICD9ProcedureCD <= 599 ]<- "Operation s on the urinary system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 5900 & Full_Data_1$PrincipalICD9ProcedureCD <= 5999 ]<- "Operati ons on the urinary system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 60 & Full_Data_1$PrincipalICD9ProcedureCD <= 64 ]<- "Operations 117 on the male genital organs" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 65 & Full_Data_1$PrincipalICD9ProcedureCD <= 71 ]<- " Operations on the female genital organs" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 710 & Full_Data_1$PrincipalICD9ProcedureCD <= 719 ]<- " Operatio ns on the female genital organs" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 72 & Full_Data_1$PrincipalICD9ProcedureCD <= 75 ]<- "Obstetrical procedures" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 76 & Full_Data_1$PrincipalICD9ProcedureCD <= 84 ]<- "Operations on the musculoskeletal system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 8400 & Full_Data_1$PrincipalICD9ProcedureCD <= 8499 ]<- "Operati ons on the musculoskeletal system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 85 & Full_Data_1$PrincipalICD9ProcedureCD <= 86 ]<- "Operations on the integumentary system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 8600 & Full_Data_1$PrincipalICD9ProcedureCD <= 8699 ]<- "Operati ons on the integumentary system" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 87 & Full_Data_1$PrincipalICD9ProcedureCD <= 99 ]<- "Miscellaneo us diagnostic and therapeutic procedures" Full_Data_1$PrincipalICD9ProcedureCD[Full_Data_1$PrincipalICD9Procedure CD >= 9900 & Full_Data_1$PrincipalICD9ProcedureCD <= 9999 ]<- "Miscell aneous diagnostic and therapeutic procedures" levels(Full_Data_1$PrincipalICD9ProcedureCD) <- c(levels(Full_Data_1$Pr incipalICD9ProcedureCD), "OTHER") Full_Data_1$PrincipalICD9ProcedureCD[is.na(Full_Data_1_Data_1$Principal ICD9ProcedureCD)] <- "OTHER"

levels1 = c("@" , "A", "H","L","U","Normal","Abnormal","High","Low","Uk nownAbnormal","0") Full_Data_1[c(39:59)] <- lapply(Full_Data_1[c(39:59)], factor, levels = levels1)

Full_Data_1[ , 39:59][ Full_Data_1[ , 39:59 ] == '@' ] <- 'Normal' Full_Data_1[ , 39:59][ Full_Data_1[ , 39:59 ] == 'A' ] <- 'Abnormal' Full_Data_1[ , 39:59][ Full_Data_1[ , 39:59 ] == 'H' ] <- 'High' Full_Data_1[ , 39:59][ Full_Data_1[ , 39:59 ] == 'L' ] <- 'Low' Full_Data_1[ , 39:59][ Full_Data_1[ , 39:59 ] == 'U' ] <- 'UknownAbno rmal'

118

Full<- Full_Data_1[complete.cases(Full_Data_1), ]

Final<-Full[!duplicated(Full), ] write.csv(Final, file = "Final.csv")

Final_1<- Full %>% group_by (PatientKEY) %>% arrange(PHSReadmission30DayFLG) %>% slice(1) write.csv(Final_1, file = "Final_1.duplicate.csv")

Final_Dummy<- read.csv(file="C:/Users/Amal/Desktop/Final_Dummy.csv", he ader=TRUE, sep=",") f <- read.csv(file="C:/Users/Amal/Desktop/Final_Dummy.csv", header=FAL SE, sep=",") testt <- FLMV %>% group_by (PatientKEY) %>% arrange(desc(PHSReadmission30DayFLG))%>% slice(1)

D<- testt[-c(1,2,3)]

119

DataCleaning.R library(readr) library(dplyr) library(tidyr) library(stringr) library(purrr) temp <- L %>% group_by(PatientKEY, IndexPatientAccountKEY,!!as.name(var)) %>% summarise(counts = n()) %>% slice(which.max(counts)) %>% select(-counts)

aggregate_lab_results <- function(var){ temp <- L %>% group_by(PatientKEY, IndexPatientAccountKEY,!!as.name(var)) %>% summarise(counts = n()) %>% slice(which.max(counts)) %>% select(-counts)

return(temp) } b <- aggregate_lab_results("OutOfRangeCD.Sodium") labs_vars <- names(L)[names(L) %>% str_detect("OutOfRange")] res <- map(labs_vars, aggregate_lab_results) labs_df <- Reduce(inner_join, res)

aggregate_Med_results <- function(var){ temp <- M %>% group_by(PatientKEY, IndexPatientAccountKEY,!!as.name(var)) %>% summarise(counts = n()) %>% slice(which.max(counts)) %>% select(-counts)

return(temp) }

Meds_vars <- names(M)[names(M) %>% str_detect("StrengthAMT")] res_meds <- map(Meds_vars,aggregate_Med_results) meds_df <- Reduce(inner_join,res_meds) meds_labs<- full_join(meds_df,labs_df,by= c("PatientKEY","IndexPatientA ccountKEY")) meds_labs_vitals <- full_join(meds_labs,V,by= c("PatientKEY","IndexPati entAccountKEY")) 120

x<- left_join(Final_Data_Death0_CHF1,meds_labs_vitals, by= c("PatientKE Y","IndexPatientAccountKEY")) count_x<- apply(x, 2, function(x) length(which(!is.na(x)))) count_x<- data.frame(count_x) x <- FullFLMV [-c(1,5,6,7,10,11,12,13,15:25,41:46,48:51,57,58,61:81,84: 102,136:138)]

#Medication : Convert it to 0 /1 , have it or not x[c(39:42)]<- as.data.frame(ifelse(is.na(x[c(39:42)]), 0, 1)) #Labs x[c(44:64)][is.na(x[c(44:64)])] <- 0 library(Hmisc) library(imputeMissings) #BloodPressureSystolicNBR x$BloodPressureSystolicNBR<- with(x, impute(BloodPressureSystolicNBR, m edian))

#BloodPressureDiastolicNBR x$BloodPressureDiastolicNBR <- with(x, impute(BloodPressureDiastolicNBR , median))

#Blood pressure condition x$BloodPressure<-NA x$BloodPressureSystolicNBR = as.integer(x$BloodPressureSystolicNBR) x$BloodPressureDiastolicNBR = as.integer(x$BloodPressureDiastolicNBR) x$BloodPressure <- with(x, ifelse(x$BloodPressureSystolicNBR < 120 & x$BloodPressureDiastolicNBR < 80, 'Normal', #Normal ifelse(x$BloodPressureSystolicNBR >= 120 & x$BloodPressureSystolicNBR < = 129 & x$BloodPressureDiastolicNBR < 80, 'Elevated', #Elevated ifelse(x$BloodPressureSystolicNBR >= 130 & x$BloodPressureSystolicNBR < = 139 | x$BloodPressureDiastolicNBR >= 80 & x$BloodPressureDiastolicNBR <= 89, 'High Stage 1', #High Stage 1 ifelse(x$BloodPressureSystolicNBR >= 140 | x$BloodPressureDiastolicNBR >= 90,'High stage 2', #High stage 2 ifelse(x$BloodPressureSystolicNBR >= 180 & x$BloodPressureDiastolicNBR > 120, 'Crisis', #Crisis ifelse(x$BloodPressureSystolicNBR >= 180 | x$BloodPressureDiastolicNBR > 120, 'Crisis','NULL')))))))

x$ClinicalRoutineDaysNBR[is.na(x$ClinicalRoutineDaysNBR)] <- 0 x$ClinicalICUDaysNBR[is.na(x$ClinicalICUDaysNBR)] <- 0 x$ClinicalObservationDaysNBR[is.na(x$ClinicalObservationDaysNBR)] <- 0 x$ClinicalOperativeDaysNBR[is.na(x$ClinicalOperativeDaysNBR)] <- 0 121

#Race levels(x$RaceGRP) <- c(levels(x$RaceGRP), "OTHER") x$RaceGRP[is.na(x$RaceGRP)] <- "OTHER"

#Marital x$MaritalGRP[is.na(x$MaritalGRP)] <- "OTHER"

#Education x$EducationGRP[is.na(x$EducationGRP)] <- "OTHER"

#Employment levels(x$EmploymentGRP) <- c(levels(x$EmploymentGRP), "OTHER") x$EmploymentGRP[is.na(x$EmploymentGRP)] <- "OTHER"

#Numberof Procedures x$NumberOfProcedures[is.na(x$NumberOfProcedures)] <- 0

#Language x$LanguageGRP[is.na(x$LanguageGRP)] <- "OTHER" x$PrincipalICD9ProcedureCD <- as.character(x$PrincipalICD9ProcedureCD) x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 01 & x$Princi palICD9ProcedureCD <= 05 ]<- "Operations on the nervous system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 06 & x$Princi palICD9ProcedureCD <= 07 ]<- "Operations on the endocrine system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 08 & x$Princi palICD9ProcedureCD <= 16 ]<- "Operations on the eye" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 18 & x$Princi palICD9ProcedureCD <= 20 ]<- "Operations on the ear" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 21 & x$Princi palICD9ProcedureCD <= 29 ]<- "Operations on the nose, mouth and pharynx " x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 30 & x$Princi palICD9ProcedureCD <= 34 ]<- "Operations on the respiratory system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 35 & x$Princi palICD9ProcedureCD <= 39 ]<- "Operations on the cardiovascular system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 40 & x$Princi palICD9ProcedureCD <= 41 ]<- "Operations on the hemic and lymphatic sys tem" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 42 & x$Princi palICD9ProcedureCD <= 54 ]<- "Operations on the digestive system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 420 & x$Princ ipalICD9ProcedureCD <= 540 ]<- "Operations on the digestive system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 540 & x$Princ ipalICD9ProcedureCD <= 549 ]<- "Operations on the digestive system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 4200 & x$Prin cipalICD9ProcedureCD <= 5400 ]<- "Operations on the digestive system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 5400 & x$Prin cipalICD9ProcedureCD <= 5499 ]<- "Operations on the digestive system" 122 x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 55 & x$Princi palICD9ProcedureCD <= 59 ]<- "Operations on the urinary system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 590 & x$Princ ipalICD9ProcedureCD <= 599 ]<- "Operations on the urinary system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 5900 & x$Prin cipalICD9ProcedureCD <= 5999 ]<- "Operations on the urinary system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 60 & x$Princi palICD9ProcedureCD <= 64 ]<- "Operations on the male genital organs" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 65 & x$Princi palICD9ProcedureCD <= 71 ]<- " Operations on the female genital organs" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 710 & x$Princ ipalICD9ProcedureCD <= 719 ]<- " Operations on the female genital organ s" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 72 & x$Princi palICD9ProcedureCD <= 75 ]<- "Obstetrical procedures" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 76 & x$Princi palICD9ProcedureCD <= 84 ]<- "Operations on the musculoskeletal system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 8400 & x$Prin cipalICD9ProcedureCD <= 8499 ]<- "Operations on the musculoskeletal sys tem" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 85 & x$Princi palICD9ProcedureCD <= 86 ]<- "Operations on the integumentary system" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 8600 & x$Prin cipalICD9ProcedureCD <= 8699 ]<- "Operations on the integumentary syste m" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 87 & x$Princi palICD9ProcedureCD <= 99 ]<- "Miscellaneous diagnostic and therapeutic procedures" x$PrincipalICD9ProcedureCD[x$PrincipalICD9ProcedureCD >= 9900 & x$Prin cipalICD9ProcedureCD <= 9999 ]<- "Miscellaneous diagnostic and therapeu tic procedures" levels(x$PrincipalICD9ProcedureCD) <- c(levels(x$PrincipalICD9Procedure CD), "None") x$PrincipalICD9ProcedureCD[is.na(x$PrincipalICD9ProcedureCD)] <- "None"

FLMV<- x[-c(1,5,6,7,10,11,12,13,15:25,38,41,42,44,45,46,48:51,57,58,61: 79,80,81,83:102,110:112,134:136,142:146)] count_FLMV<- apply(FLMV, 2, function(x) length(which(!is.na(x)))) count_FLMV<- data.frame(count_FLMV)

FLMV<- FLMV[complete.cases(FLMV$GenderCD), ] # Gender write.csv(Dummydata, file = "Dummydata.csv") write.csv(Data, file = "Data.csv") 123

DataCleaning.R

L <- Data_Labs_1 %>% group_by(PatientKEY,IndexPatientAccountKEY) %>% arrange(ServiceDTShift) L <-L[order(L$PatientKEY,L$IndexPatientAccountKEY),]

L[c(6:26)] <- lapply(L[c(6:26)], factor, levels = levels)

L[c(6:26)][is.na(L[c(6:26)])] <- 0 levels1 = c("@" , "A", "H","L","U","Normal","Abnormal","High","Low","Uk nownAbnormal","0") L[c(6:26)] <- lapply(L[c(6:26)], factor, levels = levels1)

L[ , 6:26][ L[ , 6:26 ] == '@' ] <- 'Normal' L[ , 6:26][ L[ , 6:26 ] == 'A' ] <- 'Abnormal' L[ , 6:26][ L[ , 6:26 ] == 'H' ] <- 'High' L[ , 6:26][ L[ , 6:26 ] == 'L' ] <- 'Low' L[ , 6:26][ L[ , 6:26 ] == 'U' ] <- 'UknownAbnormal'

write.csv(L, file = "L.csv") h<- head(L)

a <- L[!duplicated(L$IndexPatientAccountKEY),] z <- L[!rev(duplicated(rev(L$IndexPatientAccountKEY))),] library("plyr") s<-ddply(h, .(IndexPatientAccountKEY), function(x) x[c(1, nrow(x)), ]) library("dplyr") temp<- L %>% group_by(IndexPatientAccountKEY) %>% slice(c(1, n())) %>% ungroup() require(dplyr) detach("package:plyr", unload=TRUE) s<- L %>% group_by(PatientKEY,IndexPatientAccountKEY,OutOfRangeCD.Sodiu m) %>% summarize(counts = n())

M <- Data_Meds_1 %>% group_by(PatientKEY,IndexPatientAccountKEY) %>% arrange(ServiceDTShift)

124

M<-M[order(M$PatientKEY,M$IndexPatientAccountKEY),]

M[c(6:10)]<- as.data.frame(ifelse(is.na(M[c(6:10)]), 0, 1))

ZM <- M[!rev(duplicated(rev(M$IndexPatientAccountKEY))),]

V <- Data_Vitals_1 V <- V %>% group_by(PatientKEY,IndexPatientAccountKEY) %>% arrange(ServiceDTShift) V<-V[order(V$PatientKEY,V$IndexPatientAccountKEY),]

V <- V[!rev(duplicated(rev(V$IndexPatientAccountKEY))),] library(Hmisc) #BloodPressureSystolicNBR V$BloodPressureSystolicNBR <- with(V, impute(BloodPressureSystolicNBR, median))

#BloodPressureDiastolicNBR V$BloodPressureDiastolicNBR <- with(V, impute(BloodPressureDiastolicNBR , median))

#Blood pressure condition V$BloodPressure<-NA V$BloodPressureSystolicNBR = as.integer(V$BloodPressureSystolicNBR) V$BloodPressureDiastolicNBR = as.integer(V$BloodPressureDiastolicNBR)

V$BloodPressure <- with(V, ifelse(V$BloodPressureSystolicNBR < 120 & V$BloodPressureDiastolicNBR < 80, 'Normal', #Normal ifelse(V$BloodPressureSystolicNBR >= 120 & V$BloodPressureSystolicNBR < = 129 & V$BloodPressureDiastolicNBR < 80, 'Elevated', #Elevated ifelse(V$BloodPressureSystolicNBR >= 130 & V$BloodPressureSystolicNBR < = 139 | V$BloodPressureDiastolicNBR >= 80 & V$BloodPressureDiastolicNBR <= 89, 'High Stage 1', #High Stage 1 ifelse(V$BloodPressureSystolicNBR >= 140 | V$BloodPressureDiastolicNBR >= 90,'High stage 2', #High stage 2 ifelse(V$BloodPressureSystolicNBR >= 180 & V$BloodPressureDiastolicNBR > 120, 'Crisis', #Crisis ifelse(V$BloodPressureSystolicNBR >= 180 | V$BloodPressureDiastolicNBR > 120, 'Crisis','NULL')))))))

LM<- full_join(z,zM,by= c("PatientKEY","IndexPatientAccountKEY")) LMV <- full_join(LM,V,by= c("PatientKEY","IndexPatientAccountKEY"))

FullLMV<- left_join(Final_Data_Death0_CHF1,LMV, by= c("PatientKEY","Ind exPatientAccountKEY")) count_FLMV<- apply(FullLMV, 2, function(x) length(which(!is.na(x)))) 125 count_FLMV<- data.frame(count_FLMV)

#Medication : Convert it to 0 /1 , have it or not FullLMV[c(137:141)]<- as.data.frame(ifelse(is.na(FullLMV[c(137:141)]), 0, 1)) #Labs FullLMV[c(113:133)][is.na(FullLMV[c(113:133)])] <- 0 library(Hmisc) #BloodPressureSystolicNBR FullLMV$BloodPressureSystolicNBR <- with(FullLMV, impute(BloodPressureS ystolicNBR, median))

#BloodPressureDiastolicNBR FullLMV$BloodPressureDiastolicNBR <- with(FullLMV, impute(BloodPressure DiastolicNBR, median))

#Blood pressure condition FullLMV$BloodPressure<-NA FullLMV$BloodPressureSystolicNBR = as.integer(FullLMV$BloodPressureSyst olicNBR) FullLMV$BloodPressureDiastolicNBR = as.integer(FullLMV$BloodPressureDia stolicNBR)

FullLMV$BloodPressure <- with(FullLMV, ifelse(FullLMV$BloodPressureSystolicNBR < 120 & FullLMV$BloodPressureDi astolicNBR < 80, 'Normal', #Normal ifelse(FullLMV$BloodPressureSystolicNBR >= 120 & FullLMV$BloodPressureS ystolicNBR <= 129 & FullLMV$BloodPressureDiastolicNBR < 80, 'Elevated', #Elevated ifelse(FullLMV$BloodPressureSystolicNBR >= 130 & FullLMV$BloodPressureS ystolicNBR <= 139 | FullLMV$BloodPressureDiastolicNBR >= 80 & FullLMV$B loodPressureDiastolicNBR <= 89, 'High Stage 1', #High Stage 1 ifelse(FullLMV$BloodPressureSystolicNBR >= 140 | FullLMV$BloodPressureD iastolicNBR >= 90,'High stage 2', #High stage 2 ifelse(FullLMV$BloodPressureSystolicNBR >= 180 & FullLMV$BloodPressureD iastolicNBR > 120, 'Crisis', #Crisis ifelse(FullLMV$BloodPressureSystolicNBR >= 180 | FullLMV$BloodPressureD iastolicNBR > 120, 'Crisis','NULL')))))))

FullLMV$ClinicalRoutineDaysNBR[is.na(FullLMV$ClinicalRoutineDaysNBR)] < - 0 FullLMV$ClinicalICUDaysNBR[is.na(FullLMV$ClinicalICUDaysNBR)] <- 0 FullLMV$ClinicalObservationDaysNBR[is.na(FullLMV$ClinicalObservationDay sNBR)] <- 0 FullLMV$ClinicalOperativeDaysNBR[is.na(FullLMV$ClinicalOperativeDaysNBR )] <- 0

#Race 126 levels(FullLMV$RaceGRP) <- c(levels(FullLMV$RaceGRP), "OTHER") FullLMV$RaceGRP[is.na(FullLMV$RaceGRP)] <- "OTHER"

#Marital FullLMV$MaritalGRP[is.na(FullLMV$MaritalGRP)] <- "OTHER"

#Education FullLMV$EducationGRP[is.na(FullLMV$EducationGRP)] <- "OTHER"

#Employment levels(FullLMV$EmploymentGRP) <- c(levels(FullLMV$EmploymentGRP), "OTHE R") FullLMV$EmploymentGRP[is.na(FullLMV$EmploymentGRP)] <- "OTHER"

#Numberof Procedures FullLMV$NumberOfProcedures[is.na(FullLMV$NumberOfProcedures)] <- 0

#Language FullLMV$LanguageGRP[is.na(FullLMV$LanguageGRP)] <- "OTHER"

FullLMV$PrincipalICD9ProcedureCD <- as.character(FullLMV$PrincipalICD9P rocedureCD) FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 0 1 & FullLMV$PrincipalICD9ProcedureCD <= 05 ]<- "Operations on the nervo us system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 0 6 & FullLMV$PrincipalICD9ProcedureCD <= 07 ]<- "Operations on the endoc rine system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 0 8 & FullLMV$PrincipalICD9ProcedureCD <= 16 ]<- "Operations on the eye" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 1 8 & FullLMV$PrincipalICD9ProcedureCD <= 20 ]<- "Operations on the ear" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 2 1 & FullLMV$PrincipalICD9ProcedureCD <= 29 ]<- "Operations on the nose, mouth and pharynx" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 3 0 & FullLMV$PrincipalICD9ProcedureCD <= 34 ]<- "Operations on the respi ratory system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 3 5 & FullLMV$PrincipalICD9ProcedureCD <= 39 ]<- "Operations on the cardi ovascular system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 4 0 & FullLMV$PrincipalICD9ProcedureCD <= 41 ]<- "Operations on the hemic and lymphatic system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 4 2 & FullLMV$PrincipalICD9ProcedureCD <= 54 ]<- "Operations on the diges tive system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 4 20 & FullLMV$PrincipalICD9ProcedureCD <= 540 ]<- "Operations on the dig estive system" 127

FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 5 40 & FullLMV$PrincipalICD9ProcedureCD <= 549 ]<- "Operations on the dig estive system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 4 200 & FullLMV$PrincipalICD9ProcedureCD <= 5400 ]<- "Operations on the d igestive system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 5 400 & FullLMV$PrincipalICD9ProcedureCD <= 5499 ]<- "Operations on the d igestive system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 5 5 & FullLMV$PrincipalICD9ProcedureCD <= 59 ]<- "Operations on the urina ry system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 5 90 & FullLMV$PrincipalICD9ProcedureCD <= 599 ]<- "Operations on the uri nary system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 5 900 & FullLMV$PrincipalICD9ProcedureCD <= 5999 ]<- "Operations on the u rinary system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 6 0 & FullLMV$PrincipalICD9ProcedureCD <= 64 ]<- "Operations on the male genital organs" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 6 5 & FullLMV$PrincipalICD9ProcedureCD <= 71 ]<- " Operations on the fema le genital organs" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 7 10 & FullLMV$PrincipalICD9ProcedureCD <= 719 ]<- " Operations on the fe male genital organs" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 7 2 & FullLMV$PrincipalICD9ProcedureCD <= 75 ]<- "Obstetrical procedures" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 7 6 & FullLMV$PrincipalICD9ProcedureCD <= 84 ]<- "Operations on the muscu loskeletal system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 8 400 & FullLMV$PrincipalICD9ProcedureCD <= 8499 ]<- "Operations on the m usculoskeletal system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 8 5 & FullLMV$PrincipalICD9ProcedureCD <= 86 ]<- "Operations on the integ umentary system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 8 600 & FullLMV$PrincipalICD9ProcedureCD <= 8699 ]<- "Operations on the i ntegumentary system" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 8 7 & FullLMV$PrincipalICD9ProcedureCD <= 99 ]<- "Miscellaneous diagnosti c and therapeutic procedures" FullLMV$PrincipalICD9ProcedureCD[FullLMV$PrincipalICD9ProcedureCD >= 9 900 & FullLMV$PrincipalICD9ProcedureCD <= 9999 ]<- "Miscellaneous diagn ostic and therapeutic procedures" levels(FullLMV$PrincipalICD9ProcedureCD) <- c(levels(FullLMV$PrincipalI CD9ProcedureCD), "None") 128

FullLMV$PrincipalICD9ProcedureCD[is.na(FullLMV$PrincipalICD9ProcedureCD )] <- "None"

FLMV<- FullLMV[-c(1,5,6,7,10,11,12,13,15:25,38,41,42,44,45,46,48:51,57, 58,61:79,80,81,83:102,110:112,134:136,142:146)] count_FLMV<- apply(FLMV, 2, function(x) length(which(!is.na(x)))) count_FLMV<- data.frame(count_FLMV)

FLMV<- FLMV[complete.cases(FLMV$GenderCD), ] # Gender

R<- FullLMV[c(3,14)] RR<- left_join(FLMV,R, by= c("IndexPatientAccountKEY")) RR<- RR[-c(1,2,3)]

write.csv(FLMV, file = "FLMV.csv") Final_inp <- read.csv(file="C:/Users/Amal/Desktop/FinalProject/Final_in p_demod_Procedures_Diag_Corm.csv", header=TRUE, sep=",") FLMV <- read.csv(file="/Users/amalbukhari/Desktop/FinalProject/FLMV.csv ", header=TRUE, sep=",") FullLMV <- read.csv(file="/Users/amalbukhari/Desktop/FullLMV.csv", head er=TRUE, sep=",")

129

FinalLogesticRegression.R library(ROSE) library(caret) set.seed(1234) ind <- sample(2, nrow(Data),replace= T,prob = c(0.8,0.2)) train <- Data [ind == 1,] test <- Data [ind == 2,] set.seed(1234) ind <- sample(2, nrow(Dummydata),replace= T,prob = c(0.8,0.2)) train <- Dummydata [ind == 1,] test <- Dummydata [ind == 2,]

#InfoGain set.seed(1234) ind <- sample(2, nrow(Fsubsetdummy),replace= T,prob = c(0.8,0.2)) train <- Fsubsetdummy [ind == 1,] test <- Fsubsetdummy[ind == 2,]

#Wraper-Backward ind <- sample(2, nrow(WsubsetDummy),replace= T,prob = c(0.8,0.2)) train <- WsubsetDummy [ind == 1,] test <- WsubsetDummy [ind == 2,]

#WithoutSampling

# Logistic regression model train$PHSReadmission30DayFLG <- as.factor(train$PHSReadmission30DayFLG) mymodel <- glm(PHSReadmission30DayFLG~., data = train, family = 'binomi al'(link='logit')) summary(mymodel)

# Prediction p1 <- predict(mymodel, train, type = 'response') head(p1) head(train)

# Misclassification error - train data pred1 <- ifelse(p1>0.5, 1, 0) tab1 <- table(Predicted = pred1, Actual = train$PHSReadmission30DayFLG) tab1 1 - sum(diag(tab1))/sum(tab1)

# Misclassification error - test data test$PHSReadmission30DayFLG <- as.factor(test$PHSReadmission30DayFLG)

130 p2 <- predict(mymodel,test, type = 'response',se.fit=FALSE) pred2 <- ifelse(p2>0.5, 1, 0) tab2 <- table(Predicted = pred2, Actual = test$PHSReadmission30DayFLG) tab2 1 - sum(diag(tab2))/sum(tab2) test$PHSReadmission30DayFLG <- as.factor(test$PHSReadmission30DayFLG) confusionMatrix(factor(pred2), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) pred2 <- as.numeric(pred2) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(pred2, test$PHSReadmission30DayFLG)

#over sampling data_balanced_over <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "over",N = 11566)$data table(data_balanced_over$PHSReadmission30DayFLG)

#Train model mymodel <- glm(PHSReadmission30DayFLG~., data = data_balanced_over, fam ily = 'binomial') summary(mymodel)

# Prediction p1 <- predict(mymodel, data_balanced_over, type = 'response') head(p1) head(data_balanced_over)

# Misclassification error - train data pred1 <- ifelse(p1>0.5, 1, 0) tab1 <- table(Predicted = pred1, Actual = data_balanced_over$PHSReadmis sion30DayFLG) tab1 1 - sum(diag(tab1))/sum(tab1)

# Misclassification error - test data p2 <- predict(mymodel, test, type = 'response') pred2 <- ifelse(p2>0.5, 1, 0) tab2 <- table(Predicted = pred2, Actual = test$PHSReadmission30DayFLG) tab2 1 - sum(diag(tab2))/sum(tab2) test$PHSReadmission30DayFLG <- as.factor(test$PHSReadmission30DayFLG) confusionMatrix(factor(pred2), factor(test$PHSReadmission30DayFLG), 131

mode = "everything", positive='1') library(pROC) pred2 <- as.numeric(pred2) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(pred2, test$PHSReadmission30DayFLG)

#Under sampling data_balanced_under <- ovun.sample(PHSReadmission30DayFLG ~ ., data = t rain, method = "under", N = 2362, seed = 1)$data table(data_balanced_under$PHSReadmission30DayFLG) data_balanced_under$PHSReadmission30DayFLG <- as.factor(data_balanced_u nder$PHSReadmission30DayFLG) #Train model mymodel <- glm(PHSReadmission30DayFLG~., data = data_balanced_under, fa mily = 'binomial') summary(mymodel)

# Prediction p1 <- predict(mymodel, data_balanced_under, type = 'response') head(p1) head(data_balanced_under) data_balanced_under$PHSReadmission30DayFLG<- as.factor(data_balanced_un der$PHSReadmission30DayFLG) # Misclassification error - train data pred1 <- ifelse(p1>0.5, 1, 0) tab1 <- table(Predicted = pred1, Actual = data_balanced_under$PHSReadmi ssion30DayFLG) tab1 1 - sum(diag(tab1))/sum(tab1)

# Misclassification error - test data p2 <- predict(mymodel, test, type = 'response') pred2 <- ifelse(p2>0.5, 1, 0) tab2 <- table(Predicted = pred2, Actual = test$PHSReadmission30DayFLG) tab2 1 - sum(diag(tab2))/sum(tab2) test$PHSReadmission30DayFLG <- as.factor(test$PHSReadmission30DayFLG) confusionMatrix(factor(pred2), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) pred2 <- as.numeric(pred2) 132 test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(pred2, test$PHSReadmission30DayFLG)

#BOTH data_balanced_both <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "both", p=0.5, N=6964, seed = 1)$data table(data_balanced_both$PHSReadmission30DayFLG)

#Train model mymodel <- glm(PHSReadmission30DayFLG~., data = data_balanced_both, fam ily = 'binomial') summary(mymodel)

# Prediction p1 <- predict(mymodel,data_balanced_both, type = 'response') head(p1) head(data_balanced_both) data_balanced_both$PHSReadmission30DayFLG<- as.factor(data_balanced_bot h$PHSReadmission30DayFLG) # Misclassification error - train data pred1 <- ifelse(p1>0.5, 1, 0) tab1 <- table(Predicted = pred1, Actual =data_balanced_both$PHSReadmiss ion30DayFLG) tab1 1 - sum(diag(tab1))/sum(tab1)

# Misclassification error - test data p2 <- predict(mymodel, test, type = 'response') pred2 <- ifelse(p2>0.5, 1, 0) tab2 <- table(Predicted = pred2, Actual = test$PHSReadmission30DayFLG) tab2 1 - sum(diag(tab2))/sum(tab2) test$PHSReadmission30DayFLG <- as.factor(test$PHSReadmission30DayFLG) confusionMatrix(factor(pred2), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) pred2 <- as.numeric(pred2) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(pred2, test$PHSReadmission30DayFLG) 133

FinalDecisionTree.R

#install packages install.packages("ROSE") library(ROSE) library(caret)

#Embeded set.seed(1234) ind <- sample(2, nrow(DTdata),replace= T,prob = c(0.8,0.2)) train <- DTdata [ind == 1,] test <- DTdata [ind == 2,] #Data set.seed(1234) ind <- sample(2, nrow(Data),replace= T,prob = c(0.8,0.2)) train <- Data [ind == 1,] test <- Data [ind == 2,] #Wrapper ind <- sample(2, nrow(Wsubset),replace= T,prob = c(0.8,0.2)) train <- Wsubset [ind == 1,] test <- Wsubset [ind == 2,]

#InfoGain set.seed(1234) ind <- sample(2, nrow(Fsubset),replace= T,prob = c(0.8,0.2)) train <- Fsubset [ind == 1,] test <- Fsubset [ind == 2,] #Dummydata set.seed(1234) ind <- sample(2, nrow(Dummydata),replace= T,prob = c(0.8,0.2)) train <- Dummydata [ind == 1,] test <- Dummydata [ind == 2,]

#No Sampling

#Train model dt <- rpart(PHSReadmission30DayFLG ~ . , train, method="class") train$PHSReadmission30DayFLG <- as.factor(train$PHSReadmission30DayFLG) #Plot tree rpart.plot(dt) text(dt) dt

#If there aren't enough branches, decrease the complexity parameter (de fault is .01) dt <- rpart(PHSReadmission30DayFLG ~ ., control=rpart.control(cp = 0.01 ), train)

134

#Predict results <- predict(dt, test) test$PHSReadmission30DayFLG <- as.factor(test$PHSReadmission30DayFLG) #Variable importance dt$variable.importance head(results[,2]) res_binary<- ifelse(results[,2] > .5, 1, 0) #decrease library(caret) # confusion matrix of test set confusionMatrix(factor(res_binary), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') res_binary <- as.numeric(res_binary) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(res_binary,test$PHSReadmission30DayFLG)

#over sampling data_balanced_over <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "over",N = 11566)$data table(data_balanced_over$PHSReadmission30DayFLG)

#Train model dt <- rpart(PHSReadmission30DayFLG ~ . , data_balanced_over, method="cl ass") data_balanced_over$PHSReadmission30DayFLG <- as.factor(data_balanced_ov er$PHSReadmission30DayFLG) #Plot tree plot(dt) text(dt) dt #If you see that there aren't enough branches, decrease the complexity parameter (default is .01) dt <- rpart(PHSReadmission30DayFLG ~ ., control=rpart.control(cp = 0.01 ), data_balanced_over)

#Predict results <- predict(dt, test)

#Variable importance dt$variable.importance head(results[,2]) res_binary<- ifelse(results[,2] > .5, 1, 0) #decrease

135

# confusion matrix of test set confusionMatrix(factor(res_binary), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(res_binary, test$PHSReadmission30DayFLG)

#Under sampling data_balanced_under <- ovun.sample(PHSReadmission30DayFLG ~ ., data = t rain, method = "under", N = 2362, seed = 1)$data table(data_balanced_under$PHSReadmission30DayFLG)

#Train model dt <- rpart(PHSReadmission30DayFLG ~ . , data_balanced_under, method="c lass") data_balanced_under$PHSReadmission30DayFLG <- as.factor(data_balanced_u nder$PHSReadmission30DayFLG) #Plot tree plot(dt) text(dt) dt #If you see that there aren't enough branches, decrease the complexity parameter (default is .01) dt <- rpart(PHSReadmission30DayFLG ~ ., control=rpart.control(cp = 0.01 ), data_balanced_under)

#Predict results <- predict(dt, test)

#Variable importance dt$variable.importance head(results[,2]) res_binary<- ifelse(results[,2] > .5, 1, 0) #decrease

# confusion matrix of test set confusionMatrix(factor(res_binary), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(res_binary, test$PHSReadmission30DayFLG)

#BOTH 136 data_balanced_both <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "both", p=0.5, N=6964, seed = 1)$data table(data_balanced_both$PHSReadmission30DayFLG)

#Train model dt <- rpart(PHSReadmission30DayFLG ~ . , data_balanced_both, method="cl ass") data_balanced_both$PHSReadmission30DayFLG <- as.factor(data_balanced_bo th$PHSReadmission30DayFLG) #Plot tree plot(dt) text(dt) dt #If you see that there aren't enough branches, decrease the complexity parameter (default is .01) dt <- rpart(PHSReadmission30DayFLG ~ ., control=rpart.control(cp = 0.01 ), data_balanced_both)

#Predict results <- predict(dt, test)

#Variable importance dt$variable.importance head(results[,2]) res_binary<- ifelse(results[,2] > .5, 1, 0) #decrease

# confusion matrix of test set confusionMatrix(factor(res_binary), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(res_binary, test$PHSReadmission30DayFLG)

#ROSE set.seed(1234) ind <- sample(2, nrow(Fsubset),replace= T,prob = c(0.8,0.2)) train <- Fsubset [ind == 1,] test <- Fsubset [ind == 2,] data.rose <- ROSE(PHSReadmission30DayFLG ~ ., data = train, seed = 1)$d ata table(data.rose$PHSReadmission30DayFLG) #Train model dt <- rpart(PHSReadmission30DayFLG ~ . , data.rose, method="class") data.rose$PHSReadmission30DayFLG <- as.factor(data.rose$PHSReadmission3 137

0DayFLG) #Plot tree plot(dt) text(dt) dt #If you see that there aren't enough branches, decrease the complexity parameter (default is .01) dt <- rpart(PHSReadmission30DayFLG ~ ., control=rpart.control(cp = 0.01 ), data.rose)

#Predict results <- predict(dt, test)

#Variable importance dt$variable.importance head(results[,2]) res_binary<- ifelse(results[,2] > .5, 1, 0) #decrease

# confusion matrix of test set confusionMatrix(factor(res_binary), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(res_binary, test$PHSReadmission30DayFLG) train$PHSReadmission30DayFLG <- as.factor(train$PHSReadmission30DayFLG)

ROSE.holdout <- ROSE.eval(PHSReadmission30DayFLG ~ ., data = train, lea rner = rpart, method.assess = "holdout", extr.pred = function(obj)obj[, 2], seed = 1) ROSE.holdout data.rose$PHSReadmission30DayFLG <- as.factor(data.rose$PHSReadmission3 0DayFLG) #CROSSVALIDATION trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3) set.seed(3333) dt<- train(PHSReadmission30DayFLG ~ ., data = data.rose, method = "rpa rt", parms = list(split = "information"), trControl=trctrl, tuneLength = 10)

#Plot tree plot(dt) text(dt) 138 dt prp(dt$finalModel, box.palette = "Reds", tweak = 1.2)

#Predict results <- predict(dt, test)

library(caret) # confusion matrix of test set confusionMatrix(factor(results), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) test$PHSReadmission30DayFLG<- as.factor(test$PHSReadmission30DayFLG) auc(results,test$PHSReadmission30DayFLG)

139

FinalRandomForest.R set.seed(1234) ind <- sample(2, nrow(Data),replace= T,prob = c(0.8,0.2)) train <- Data [ind == 1,] test <- Data [ind == 2,] #Embeded set.seed(1234) ind <- sample(2, nrow(EmbededRF),replace= T,prob = c(0.8,0.2)) train <- EmbededRF [ind == 1,] test <- EmbededRF [ind == 2,] #Dummydata set.seed(1234) ind <- sample(2, nrow(Dummydata),replace= T,prob = c(0.8,0.2)) train <- Dummydata [ind == 1,] test <- Dummydata [ind == 2,]

#InfoGain set.seed(1234) ind <- sample(2, nrow(Fsubset),replace= T,prob = c(0.8,0.2)) train <- Fsubset [ind == 1,] test <- Fsubset [ind == 2,]

#Wraper-Backward ind <- sample(2, nrow(Wsubset),replace= T,prob = c(0.8,0.2)) train <- Wsubset [ind == 1,] test <- Wsubset [ind == 2,]

#WithoutSampling # Random Forest library(randomForest) train$PHSReadmission30DayFLG <- as.factor(train$PHSReadmission30DayFLG) set.seed(222) rf <- randomForest(PHSReadmission30DayFLG~., data=train, ntree = 300, mtry = 8, importance = TRUE, proximity = TRUE) print(rf) attributes(rf)

# Prediction & Confusion Matrix - train data library(caret) p1 <- predict(rf, train) confusionMatrix(p1, train$PHSReadmission30DayFLG)

# # Prediction & Confusion Matrix - test data p2 <- predict(rf, test) 140 p2 <- as.factor(p2) test$PHSReadmission30DayFLG <-as.factor( test$PHSReadmission30DayFLG) confusionMatrix(p2, test$PHSReadmission30DayFLG)

# confusion matrix of test set confusionMatrix(factor(p2), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) p2 <- as.numeric(p2) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(p2,test$PHSReadmission30DayFLG)

#over sampling data_balanced_over <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "over",N = 11566)$data table(data_balanced_over$PHSReadmission30DayFLG)

#Train model library(randomForest) data_balanced_over$PHSReadmission30DayFLG <- as.factor(data_balanced_ov er$PHSReadmission30DayFLG) set.seed(222) rf <- randomForest(PHSReadmission30DayFLG~., data_balanced_over, ntree = 300, mtry = 8, importance = TRUE, proximity = TRUE) print(rf) attributes(rf) importance(rf)

# Prediction & Confusion Matrix - train data library(caret) p1 <- predict(rf,data_balanced_over) confusionMatrix(p1, data_balanced_over$PHSReadmission30DayFLG,positive= '1')

# # Prediction & Confusion Matrix - test data p2 <- predict(rf, test) p2 <- as.factor(p2) test$PHSReadmission30DayFLG <-as.factor( test$PHSReadmission30DayFLG) confusionMatrix(p2, test$PHSReadmission30DayFLG,positive='1')

141

# confusion matrix of test set confusionMatrix(factor(p2), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) p2 <- as.numeric(p2) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(p2,test$PHSReadmission30DayFLG)

#Under sampling data_balanced_under <- ovun.sample(PHSReadmission30DayFLG ~ ., data = t rain, method = "under", N = 2362, seed = 1)$data table(data_balanced_under$PHSReadmission30DayFLG) #Train model library(randomForest) data_balanced_under$PHSReadmission30DayFLG<- as.factor(data_balanced_un der$PHSReadmission30DayFLG) set.seed(222) rf <- randomForest(PHSReadmission30DayFLG~., data_balanced_under , ntree = 300, mtry = 8, importance = TRUE, proximity = TRUE) print(rf) attributes(rf) # Prediction & Confusion Matrix - train data library(caret) p1 <- predict(rf,data_balanced_under ) confusionMatrix(p1, data_balanced_under $PHSReadmission30DayFLG,positiv e='1')

# # Prediction & Confusion Matrix - test data p2 <- predict(rf, test) p2 <- as.factor(p2) test$PHSReadmission30DayFLG <-as.factor( test$PHSReadmission30DayFLG) confusionMatrix(p2, test$PHSReadmission30DayFLG,positive='1')

# confusion matrix of test set confusionMatrix(factor(p2), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1')

142 library(pROC) p2 <- as.numeric(p2) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(p2,test$PHSReadmission30DayFLG)

#BOTH data_balanced_both <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "both", p=0.5, N=6964, seed = 1)$data table(data_balanced_both$PHSReadmission30DayFLG)

#Train model library(randomForest) data_balanced_both $PHSReadmission30DayFLG <- as.factor(data_balanced_b oth $PHSReadmission30DayFLG) set.seed(222) rf <- randomForest(PHSReadmission30DayFLG~., data_balanced_both , ntree = 300, mtry = 8, importance = TRUE, proximity = TRUE) print(rf) attributes(rf) # Prediction & Confusion Matrix - train data library(caret) p1 <- predict(rf,data_balanced_both ) confusionMatrix(p1, data_balanced_both $PHSReadmission30DayFLG,positive ='1')

# # Prediction & Confusion Matrix - test data p2 <- predict(rf, test) p2 <- as.factor(p2) test$PHSReadmission30DayFLG <-as.factor( test$PHSReadmission30DayFLG) confusionMatrix(p2, test$PHSReadmission30DayFLG,positive='1')

# confusion matrix of test set confusionMatrix(factor(p2), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) p2 <- as.numeric(p2) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(p2,test$PHSReadmission30DayFLG)

143

RandomForestCrossValidated.R

#Cross validation with undersampling set.seed(1234) ind <- sample(2, nrow(Dummydata),replace= T,prob = c(0.8,0.2)) train <- Dummydata [ind == 1,] test <- Dummydata [ind == 2,]

library(e1071) numFolds <- trainControl(method = "cv", number = 10) cpGrid <- expand.grid(.cp = seq(0.01, 0.5, 0.01)) data_balanced_under$PHSReadmission30DayFLG <- as.character(data_balance d_under$PHSReadmission30DayFLG) data_balanced_under$PHSReadmission30DayFLG <- as.factor(data_balanced_u nder$PHSReadmission30DayFLG) train(PHSReadmission30DayFLG ~ ., data = data_balanced_under, method = "rpart", trControl = numFolds, tuneGrid = cpGrid) stevensTreeCV <- rpart(PHSReadmission30DayFLG ~ ., data = data_balanced _under, method = "class", cp = 0.01) predictionCV <- predict(stevensTreeCV, newdata = test, type = "class") table(test$PHSReadmission30DayFLG, predictionCV)

# confusion matrix of test set confusionMatrix(factor(predictionCV), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(rpart) library(rpart.plot) prp(stevensTreeCV) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(predictionCV ,test$PHSReadmission30DayFLG)

144

FinalNäiveBayes.R install.packages("naivebayes") library(naivebayes) library(dplyr) library(ggplot2) install.packages("psych") library(psych)

set.seed(1234) ind <- sample(2, nrow(Data),replace= T,prob = c(0.8,0.2)) train <- Data [ind == 1,] test <- Data [ind == 2,]

#InfoGain set.seed(1234) ind <- sample(2, nrow(Fsubset),replace= T,prob = c(0.8,0.2)) train <- Fsubset [ind == 1,] test <- Fsubset[ind == 2,]

#Wraper-Backward ind <- sample(2, nrow(Wsubset),replace= T,prob = c(0.8,0.2)) train <- Wsubset [ind == 1,] test <- Wsubset [ind == 2,]

#No-Sampling #Naive Model train$PHSReadmission30DayFLG <- as.factor (train$PHSReadmission30DayFLG ) model <- naive_bayes(PHSReadmission30DayFLG ~ . ,data = train) model plot(model) #Predict p <- predict (model ,train, type = 'prob') head(cbind(p,train)) # Confusion Matrix -train data p1 <- predict(model,train ) (tab1 <- table(p1, train$PHSReadmission30DayFLG)) 1- sum (diag(tab1))/sum(tab1)

# Confusion Matrix -test data p2 <- predict(model,test) (tab2 <- table(p2, test$PHSReadmission30DayFLG)) 1- sum (diag(tab2))/sum(tab2) summary(tab2, type = c("Fscore", "Recall", "Precision")) plot (tab2) 145

# confusion matrix of test set confusionMatrix(factor(p2), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) p2 <- as.numeric(p2) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(p2,test$PHSReadmission30DayFLG)

#over sampling11566 data_balanced_over <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "over",N = 11566)$data table(data_balanced_over$PHSReadmission30DayFLG)

#Train model #Naive Model data_balanced_over$PHSReadmission30DayFLG <- as.factor (data_balanced_o ver$PHSReadmission30DayFLG ) model <- naive_bayes(PHSReadmission30DayFLG ~ . ,data = data_balanced_o ver) model plot(model) #Predict p <- predict (model ,data_balanced_over, type = 'prob') head(cbind(p,data_balanced_over)) # Confusion Matrix -train data p1 <- predict(model,data_balanced_over ) (tab1 <- table(p1, data_balanced_over$PHSReadmission30DayFLG)) 1- sum (diag(tab1))/sum(tab1)

# Confusion Matrix -test data p2 <- predict(model,test) (tab2 <- table(p2, test$PHSReadmission30DayFLG)) 1- sum (diag(tab2))/sum(tab2)

confusionMatrix(factor(p2), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) p2 <- as.numeric(p2) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(p2,test$PHSReadmission30DayFLG) 146

#Under sampling 2362 data_balanced_under <- ovun.sample(PHSReadmission30DayFLG ~ ., data = t rain, method = "under", N = 2236, seed = 1)$data table(data_balanced_under$PHSReadmission30DayFLG) #Train model data_balanced_under$PHSReadmission30DayFLG <-as.factor(data_balanced_un der$PHSReadmission30DayFLG) #Naive Model model <- naive_bayes(PHSReadmission30DayFLG ~ . ,data = data_balanced_u nder ) model plot(model) #Predict p <- predict (model ,data_balanced_under , type = 'prob') head(cbind(p,data_balanced_under )) # Confusion Matrix -train data p1 <- predict(model,data_balanced_under) (tab1 <- table(p1,data_balanced_under$PHSReadmission30DayFLG)) 1- sum (diag(tab1))/sum(tab1)

# Confusion Matrix -test data p2 <- predict(model,test) (tab2 <- table(p2, test$PHSReadmission30DayFLG)) 1- sum (diag(tab2))/sum(tab2)

confusionMatrix(factor(p2), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) p2 <- as.numeric(p2) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(p2,test$PHSReadmission30DayFLG)

set.seed(1234) #BOTH data_balanced_both <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "both", p=0.5, N=6964, seed = 1)$data table(data_balanced_both$PHSReadmission30DayFLG) data_balanced_both$PHSReadmission30DayFLG <-as.factor(data_balanced_bot h$PHSReadmission30DayFLG) #Naive Model 147 model <- naive_bayes(PHSReadmission30DayFLG ~ . ,data = data_balanced_b oth) model plot(model) #Predict p <- predict (model ,data_balanced_both, type = 'prob') head(cbind(p,data_balanced_both)) # Confusion Matrix -train data p1 <- predict(model,data_balanced_both) (tab1 <- table(p1,data_balanced_both$PHSReadmission30DayFLG)) 1- sum (diag(tab1))/sum(tab1)

# Confusion Matrix -test data p2 <- predict(model,test) (tab2 <- table(p2, test$PHSReadmission30DayFLG)) 1- sum (diag(tab2))/sum(tab2)

confusionMatrix(factor(p2), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) p2 <- as.numeric(p2) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(p2,test$PHSReadmission30DayFLG)

148

FinalXGboost.R install.packages("xgboost") library(xgboost) library(magrittr) library(dplyr) library(Matrix)

set.seed(1234) ind <- sample(2, nrow(Data),replace= T,prob = c(0.8,0.2)) train <- Data [ind == 1,] test <- Data [ind == 2,] set.seed(1234) ind <- sample(2, nrow(Dummydata),replace= T,prob = c(0.8,0.2)) train <- Dummydata [ind == 1,] test <- Dummydata [ind == 2,]

#InfoGain set.seed(1234) ind <- sample(2, nrow(Fsubset),replace= T,prob = c(0.8,0.2)) train <- Fsubsetdummy [ind == 1,] test <- Fsubsetdummy [ind == 2,]

#Wraper-Backward set.seed(1234) ind <- sample(2, nrow(WsubsetDummy),replace= T,prob = c(0.8,0.2)) train <- WsubsetDummy[ind == 1,] test <- WsubsetDummy[ind == 2,]

#No-Sampling train$PHSReadmission30DayFLG <- as.integer(train$PHSReadmission30DayFLG )

# Create matrix - One-Hot Encoding for Factor variables trainm <- sparse.model.matrix(PHSReadmission30DayFLG ~ .-1, data = trai n) head(trainm) train_label <- train$PHSReadmission30DayFLG train_matrix <- xgb.DMatrix(data = as.matrix(trainm), label = train_la bel) testm <- sparse.model.matrix(PHSReadmission30DayFLG~.-1, data = test) test_label <- test$PHSReadmission30DayFLG test_matrix <- xgb.DMatrix(data = as.matrix(testm), label = test_label)

# Parameters 149 nc <- length(unique(train_label)) xgb_params <- list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = nc ) watchlist <- list(train = train_matrix, test = test_matrix) nc<-as.numeric(as.character(nc)) # eXtreme Gradient Boosting Model bst_model <- xgb.train(params = xgb_params, data = train_matrix, nrounds = 1000, watchlist = watchlist, eta = 0.001, max.depth = 3, gamma = 0, subsample = 1, colsample_bytree = 1, missing = NA, seed = 333)

# Training & test error plot e <- data.frame(bst_model$evaluation_log) plot(e$iter, e$train_mlogloss, col = 'blue') lines(e$iter, e$test_mlogloss, col = 'red') min(e$test_mlogloss) e[e$test_mlogloss == 0.463865,]

# Feature importance imp <- xgb.importance(colnames(train_matrix), model = bst_model) print(imp) xgb.plot.importance(imp)

# Prediction & confusion matrix - test data p <- predict(bst_model, newdata = test_matrix) prediction <- matrix(p, nrow = nc, ncol = length(p)/nc) %>% t() %>% data.frame() %>% mutate(label = test_label, max_prob = max.col(., "last")-1) table(Prediction =prediction$max_prob, Actual = prediction$label)

# confusion matrix of test set confusionMatrix(factor(prediction$max_prob), factor(prediction$label), 150

mode = "everything", positive='1') library(pROC) auc(prediction$max_prob,prediction$label)

#over sampling data_balanced_over <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "over",N =11566)$data table(data_balanced_over$PHSReadmission30DayFLG) data_balanced_over$PHSReadmission30DayFLG <- as.integer(data_balanced_o ver$PHSReadmission30DayFLG)

# Create matrix - One-Hot Encoding for Factor variables trainm <- sparse.model.matrix(PHSReadmission30DayFLG ~ .-1, data = data _balanced_over) head(trainm) train_label <- data_balanced_over$PHSReadmission30DayFLG train_matrix <- xgb.DMatrix(data = as.matrix(trainm), label = train_la bel) testm <- sparse.model.matrix(PHSReadmission30DayFLG~.-1, data = test) test_label <- test$PHSReadmission30DayFLG test_matrix <- xgb.DMatrix(data = as.matrix(testm), label = test_label)

# Parameters nc <- length(unique(train_label)) xgb_params <- list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = nc ) watchlist <- list(train = train_matrix, test = test_matrix)

# eXtreme Gradient Boosting Model bst_model <- xgb.train(params = xgb_params, data = train_matrix, nrounds = 1000, watchlist = watchlist, eta = 0.001, max.depth = 3, gamma = 0, subsample = 1, colsample_bytree = 1, missing = NA, 151

seed = 333)

# Training & test error plot e <- data.frame(bst_model$evaluation_log) plot(e$iter, e$train_mlogloss, col = 'blue') lines(e$iter, e$test_mlogloss, col = 'red') min(e$test_mlogloss) e[e$test_mlogloss == 0.667025,]

# Feature importance imp <- xgb.importance(colnames(train_matrix), model = bst_model) print(imp) xgb.plot.importance(imp)

# Prediction & confusion matrix - test data p <- predict(bst_model, newdata = test_matrix) prediction <- matrix(p, nrow = nc, ncol = length(p)/nc) %>% t() %>% data.frame() %>% mutate(label = test_label, max_prob = max.col(., "last")-1) table(Prediction =prediction$max_prob, Actual = prediction$label)

# confusion matrix of test set confusionMatrix(factor(prediction$max_prob), factor(prediction$label), mode = "everything", positive='1') library(pROC) auc(prediction$max_prob,prediction$label)

#Under sampling data_balanced_under <- ovun.sample(PHSReadmission30DayFLG ~ ., data = t rain, method = "under", N = 2362, seed = 1)$data table(data_balanced_under$PHSReadmission30DayFLG) #Train model data_balanced_under$PHSReadmission30DayFLG <-as.integer(data_balanced_u nder$PHSReadmission30DayFLG)

# Create matrix - One-Hot Encoding for Factor variables trainm <- sparse.model.matrix(PHSReadmission30DayFLG ~ .-1, data = data 152

_balanced_under) head(trainm) train_label <- data_balanced_under$PHSReadmission30DayFLG train_matrix <- xgb.DMatrix(data = as.matrix(trainm), label = train_la bel) testm <- sparse.model.matrix(PHSReadmission30DayFLG~.-1, data = test) test_label <- test$PHSReadmission30DayFLG test_matrix <- xgb.DMatrix(data = as.matrix(testm), label = test_label)

# Parameters nc <- length(unique(train_label)) xgb_params <- list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = nc ) watchlist <- list(train = train_matrix, test = test_matrix)

# eXtreme Gradient Boosting Model bst_model <- xgb.train(params = xgb_params, data = train_matrix, nrounds = 1000, watchlist = watchlist, eta = 0.001, max.depth = 3, gamma = 0, subsample = 1, colsample_bytree = 1, missing = NA, seed = 333)

# Training & test error plot e <- data.frame(bst_model$evaluation_log) plot(e$iter, e$train_mlogloss, col = 'blue') lines(e$iter, e$test_mlogloss, col = 'red') min(e$test_mlogloss) e[e$test_mlogloss == 0.67407,]

# Feature importance imp <- xgb.importance(colnames(train_matrix), model = bst_model) print(imp) xgb.plot.importance(imp)

# Prediction & confusion matrix - test data p <- predict(bst_model, newdata = test_matrix) prediction <- matrix(p, nrow = nc, ncol = length(p)/nc) %>% 153

t() %>% data.frame() %>% mutate(label = test_label, max_prob = max.col(., "last")-1) table(Prediction =prediction$max_prob, Actual = prediction$label)

# confusion matrix of test set confusionMatrix(factor(prediction$max_prob), factor(prediction$label), mode = "everything", positive='1') library(pROC) auc(prediction$max_prob,prediction$label) set.seed(1234) #BOTH data_balanced_both <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "both", p=0.5, N=6964, seed = 1)$data table(data_balanced_both$PHSReadmission30DayFLG) data_balanced_both$PHSReadmission30DayFLG <-as.integer(data_balanced_bo th$PHSReadmission30DayFLG)

# Create matrix - One-Hot Encoding for Factor variables trainm <- sparse.model.matrix(PHSReadmission30DayFLG ~ .-1, data = data _balanced_both) head(trainm) train_label <- data_balanced_both$PHSReadmission30DayFLG train_matrix <- xgb.DMatrix(data = as.matrix(trainm), label = train_la bel) testm <- sparse.model.matrix(PHSReadmission30DayFLG~.-1, data = test) test_label <- test$PHSReadmission30DayFLG test_matrix <- xgb.DMatrix(data = as.matrix(testm), label = test_label)

# Parameters nc <- length(unique(train_label)) xgb_params <- list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = nc ) watchlist <- list(train = train_matrix, test = test_matrix)

154

# eXtreme Gradient Boosting Model bst_model <- xgb.train(params = xgb_params, data = train_matrix, nrounds = 1000, watchlist = watchlist, eta = 0.001, max.depth = 3, gamma = 0, subsample = 1, colsample_bytree = 1, missing = NA, seed = 333)

# Training & test error plot e <- data.frame(bst_model$evaluation_log) plot(e$iter, e$train_mlogloss, col = 'blue') lines(e$iter, e$test_mlogloss, col = 'red') min(e$test_mlogloss) e[e$test_mlogloss == 0.668152,]

# Feature importance imp <- xgb.importance(colnames(train_matrix), model = bst_model) print(imp) xgb.plot.importance(imp)

# Prediction & confusion matrix - test data p <- predict(bst_model, newdata = test_matrix) prediction <- matrix(p, nrow = nc, ncol = length(p)/nc) %>% t() %>% data.frame() %>% mutate(label = test_label, max_prob = max.col(., "last")-1) table(Prediction =prediction$max_prob, Actual = prediction$label)

# confusion matrix of test set confusionMatrix(factor(prediction$max_prob), factor(prediction$label), mode = "everything", positive='1') library(pROC) auc(prediction$max_prob,prediction$label)

155

FinalSupportVectorMachine.R

# Data set.seed(1234) ind <- sample(2, nrow(Dummydata),replace= T,prob = c(0.8,0.2)) train <- Dummydata [ind == 1,] test <- Dummydata [ind == 2,] write.csv(Dummydata, file = "Dummydata.csv")

#InfoGain set.seed(1234) ind <- sample(2, nrow(Fsubset),replace= T,prob = c(0.8,0.2)) train <- Fsubset [ind == 1,] test <- Fsubset [ind == 2,] write.csv(Fsubset, file = "Fsubset.csv")

#Wraper-Backward ind <- sample(2, nrow(Wsubset),replace= T,prob = c(0.8,0.2)) train <- Wsubset [ind == 1,] test <- Wsubset [ind == 2,] write.csv(Wsubset, file = "Wsubset.csv")

#WithoutSampling # Support Vector Machine library(e1071) train$PHSReadmission30DayFLG <- as.factor(train$PHSReadmission30DayFLG) mymodel <- svm(PHSReadmission30DayFLG~., data= train, kernel = "radial") summary(mymodel)

# Tuning set.seed(123) tmodel <- tune(svm, PHSReadmission30DayFLG~., data= train, ranges = list(epsilon = seq(0,1,0.02), cost = 2^(2:6))) plot(tmodel) summary(tmodel)

# best Model mymodel <- tmodel$best.model summary(mymodel)

# Confusion Matrix and Misclassification Error pred <- predict(mymodel, test) tab <- table(Predicted = pred, Actual = test$PHSReadmission30DayFLG) tab 1-sum(diag(tab))/sum(tab) 156

confusionMatrix(factor(pred), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) pred <- as.numeric(pred) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(pred, test$PHSReadmission30DayFLG)

#over sampling data_balanced_over <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "over",N = 11566)$data table(data_balanced_over$PHSReadmission30DayFLG)

# Support Vector Machine library(e1071) data_balanced_over$PHSReadmission30DayFLG <-as.factor(data_balanced_ove r$PHSReadmission30DayFLG) mymodel2 <- svm(PHSReadmission30DayFLG~., data=data_balanced_over, kernel = "radial") summary(mymodel2)

# Tuning set.seed(123) tmodel2 <- tune(svm, PHSReadmission30DayFLG~., data= data_balanced_over , ranges = list(epsilon = seq(0,1,0.02), cost = 2^(2:6))) plot(tmodel2) summary(tmodel2)

# best Model mymodel2 <- tmodel2$best.model summary(mymodel2)

# Confusion Matrix and Misclassification Error pred <- predict(mymodel2, test) tab<- table(Predicted = pred1, Actual = test$PHSReadmission30DayFLG) tab 1-sum(diag(tab))/sum(tab) confusionMatrix(factor(pred), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) 157 pred1 <- as.numeric(pred) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(pred, test$PHSReadmission30DayFLG)

#Under sampling data_balanced_under <- ovun.sample(PHSReadmission30DayFLG ~ ., data = t rain, method = "under", N = 2362, seed = 1)$data table(data_balanced_under$PHSReadmission30DayFLG) #Train model # Support Vector Machine library(e1071) data_balanced_under$PHSReadmission30DayFLG <-as.factor(data_balanced_un der$PHSReadmission30DayFLG) mymodel2 <- svm(PHSReadmission30DayFLG~., data= data_balanced_under, kernel = "radial") summary(mymodel2)

# Tuning set.seed(123) tmodel2 <- tune(svm, PHSReadmission30DayFLG~., data= data_balanced_unde r, ranges = list(epsilon = seq(0,1,0.02), cost = 2^(2:6))) plot(tmodel2) summary(tmodel2)

# best Model mymodel2 <- tmodel2$best.model summary(mymodel2)

# Confusion Matrix and Misclassification Error pred <- predict(mymodel2, test) tab<- table(Predicted = pred1, Actual = test$PHSReadmission30DayFLG) tab 1-sum(diag(tab))/sum(tab) confusionMatrix(factor(pred), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) pred1 <- as.numeric(pred) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(pred, test$PHSReadmission30DayFLG)

#BOTH data_balanced_both <- ovun.sample(PHSReadmission30DayFLG ~ ., data = tr ain, method = "both", p=0.5, N=6964, seed = 158

1)$data table(data_balanced_both$PHSReadmission30DayFLG)

#Train model # Support Vector Machine library(e1071) data_balanced_both$PHSReadmission30DayFLG <- as.factor(data_balanced_bo th$PHSReadmission30DayFLG) mymodel2 <- svm(PHSReadmission30DayFLG~., data= data_balanced_both , kernel = "radial") summary(mymodel2)

# Tuning set.seed(123) tmodel2 <- tune(svm, PHSReadmission30DayFLG~., data= data_train_full, ranges = list(epsilon = seq(0,1,0.02), cost = 2^(2:6))) plot(tmodel2) summary(tmodel2)

# best Model mymodel2 <- tmodel2$best.model summary(mymodel2)

# Confusion Matrix and Misclassification Error pred <- predict(mymodel2, test) tab<- table(Predicted = pred1, Actual = test$PHSReadmission30DayFLG) tab 1-sum(diag(tab))/sum(tab) confusionMatrix(factor(pred), factor(test$PHSReadmission30DayFLG), mode = "everything", positive='1') library(pROC) pred1 <- as.numeric(pred) test$PHSReadmission30DayFLG <- as.numeric(test$PHSReadmission30DayFLG) auc(pred, test$PHSReadmission30DayFLG)