The Use of Activity Monitoring and Machine Learning for the Functional Classification of Heart Failure

by

Jonathan-F. Benjamin Jason Jérémy Baril

A thesis submitted in conformity with the requirements for the degree of Master of Health Science, Clinical Engineering Institute of Biomaterials and Biomedical Engineering University of Toronto

CC BY 4.0 by Jonathan-F. Benjamin Jason Jérémy Baril, unless otherwise prohibited

The Use of Activity Monitoring and Machine Learning for the Functional Classification of Heart Failure

Jonathan-F. Benjamin Jason Jérémy Baril Master of Health Science, Clinical Engineering Institute of Biomaterials and Biomedical Engineering University of Toronto 2018 Abstract

Background: Assessing the functional status of a heart failure patient is a highly subjective task.

Objective: This thesis aimed to find an accessible, objective means of assessing the New York Heart

Association (NYHA) functional classification (FC) of a patient by leveraging modern machine learning techniques.

M ethods: We first identified relevant quantitative data and upgraded Medly, a remote patient monitoring system (RPMS), to support data collection. We then proceeded to build six different machine learning classifiers including hidden Markov model, Generalized Linear Model (GLM), random forest and neural network based classifiers.

Results: The best overall classifier was found to be a boosted GLM, which achieved a classification performance (Cohen’s Kappa statistic 휅=0.73, balanced accuracy=85%) comparable to human level performance (휅=0.75).

Conclusions: Although the investigated classifiers are not ready for implementation into a real RPMS, they show promise for making the evaluation of NYHA FC more universally consistent and reliable.

ii

dédié à Papa, sans ton encouragement cette thèse n'aurait jamais existée

iii

Acknowledgments

Ah! The acknowledgements. As painful and lonely as it may be to compose a thesis, the acknowledgements section is by far the easiest and most pleasant section to write. It is both heart- warming and humbling to be reminded of how much, and how many others, have sacrificed to breathe life into this work - truly without the help of these people this project would still be a mere figment of an idea in someone’s mind. If you’ve contributed to this work, whether directly or indirectly know that, even if I’ve somehow forgotten to include your name here I am eternally grateful for your help and contribution to this work.

Firstly, I need to acknowledge our patients: it is probably only those of us who do health research who truly understand how much these projects life and die by the pure self-less generosity of patients. Thank you for trusting us with your health and your data. I can only hope this work will somehow contribute to ultimately making the need for your generosity obsolete.

Second, my committee: Drs. Joe Cafazzo, Cedric Manlhiot, Heather Ross, and Babak Taati. Your contributions to this project can only be understated – in fact my biggest regret in this project is not having taken greater advantage of your experience and wisdom. Your guidance, correction, teaching, encouragement and advice was invaluable to having gotten this project anywhere. Thanks also go to Dr. Rob Nolan for taking time to serve as the external examiner for this thesis.

I am also hugely indebted to Simon Bromberg, Raghad Abdulmajeed and Dr. Yasbanoo Moayedi, not only for your foundational work on which I was able build my work but also for leaving behind a treasure trove of data that was indispensable for getting this project started.

Special thanks to Edgar Crowdy, Steven Fan, Bridgette Mueller, Mohammad Peikari, Emily Somerset, and Kabir Sakhrani at the Cardiovascular Data Management Centre for your advice and tips with regards to the analytics but also your incredible help with much of the last-minute data collection, analytics, processing and people-power that went into the ‘research’ part of this project.

Heartfelt thanks also go to Jason Hearn, not only for contributions to this work as part of the aforementioned group, but also your puns, listening ear and friendship journeying through the adventure of doing an MHSc at the Centre these last 2 years. If only all graduate students were so fortunate.

Enormous thanks to Iqra Ashfaq, Alana Tibbles, Patrick Ware, Dr. Emily Seto, and Mary O’Sullivan. Goodness knows how many times I interrupted your work for this project. Thank you so much for your

iv

patience and for being so willing to share your time, your resources, and expertise around all things Medly (as well as for rooting for me all along the way).

Additional thanks go to:

Stephanie Wilson, Diane De Sousa and especially Larissa Maderia for all the hard work you put in so we could get Fitbit integrated into Medly.

Damon Pfaff, Owen Thijssen and Mike Lovas for your design advice and allowing me to leech off your expertise.

James Agnew and Vlad Voloshyn for your technical help.

Melanie Yeung and Akib Uddin, not only for your operational and project management help on the Fitbit integration (and for the internship) but also for your timely encouragement and advice for getting through this degree.

Aarti Mathur and Alison Bison for your always joyous help with various admin and purchasing issues. Similarly, Jess Fifield, but who also deserves additional accolades for her eternal patience in filtering my incessant requests, and for arranging, rearranging and further rearranging Dr. Cafazzo’s calendar and always managing to find an available slot for Jason or for myself to meet with Dr. Cafazzo when necessary. Thanks also to Anna Yuan for managing to wrangle the schedules of 5 incredibly busy university professors so I could defend on time.

Quynh Pham, for your mentorship and encouragement, and for your unwavering enthusiasm at the Centre; for always always [sic] finding time to thoughtfully answer my questions, whether on REB applications, thesis writing, EPR or the myriad other elements of the research student life.

Plinio Morita, for your help and suggestions regarding some of the analytics in this project.

Shivani Goyal, especially for your help and advice regarding my OGS/CGS-M proposal. And speaking of:

Many thanks are owned to The Ted Rogers Centre for Heart Research and Peter Munk Cardiac Centre, (hSITE) Health Support through Information Technology Enhancements, (NSERC) the Natural Sciences and Engineering Research Council, (CIHR) the Canadian Institutes for Health Research, the Government of Ontario, and the University of Toronto for funding various parts of this project at various times.

v

And of course, thank you to everyone else at Healthcare Human Factors and at eHealth Innovation who at various times pitched in, shared their expertise, provided advice or encouraging word or even just expressed interest in the work. Thank you also to Wayne, Chris and Anjum for extending the opportunity to learn, work and travel with the human factors team as part of my internships.

Thanks to Rhonda Marley, our wonderful Clin. Eng. coordinator for alleviating, as you could, a lot of the burdensome administrative workload involved in a graduate degree.

Thank you to BESA, the IBBME community and especially the Clin. Eng. students who were part of our program. It was a true pleasure. We made it.

And lastly, on a personal note, none of this work would have been possible without friends and family who supported and encouraged me over these last 2 years - words cannot express how grateful I am for you. Merci Maman, Papa, Alisson, Benjamin; Ruth and Alvis (my home away from home); Kyle F, Thomas, Esteban (when I needed a nice invigorating round of PUBG or GTA); Vanessa, Rebecca, Theresa, Duela, Sara & Matthew, Matt & Moni, Rachel & Justin, Melanie, Kyle N, Shawn, Valerie, Jamie, and Courtney (all of whom graciously let me go to the big TO but would probably rather I have stayed with them in Winnipeg). Special thanks in particular though have to go to: Paul White, who had the dubious honor of reviewing the first draft of this thesis; Cameron MacGregor, who brought this program to my attention and joined me on the adventure; Knox Church (and my home church in particular; Sam, Chris, Hendrick, Stephen, Andrew, Bella, Roydon, Sarah, Lori, Thomas, Emily, Deborah, Larissa, Katie, Jackie, Danielle, and so many others), for your open arms and being my much-needed community in this new city; to Tanisha Strachan, for keeping me sane these past few months, even though no one warned you that dating a grad student is often too much akin to dating a hermit and of course; and Jesus, because ultimately this was all for you.

Thank you all for your love, for your encouragement, and for your patience.

Now on to the main event…

vi

Table of Contents

Acknowledgments ...... iv

Table of Contents ...... vii

List of Tables ...... xi

List of Figures ...... xiii

List of Abbreviations ...... xvi

- Introduction ...... 1

Thesis Objective ...... 1

Formal Thesis Statement ...... 2

Thesis Summary ...... 2

1.3.1 Phase 1 – Replication of Previous Study ...... 2

1.3.2 Phase 2 – Monitoring Implementation ...... 2

1.3.3 Phase 3 – Machine Learning Implementation & Validation ...... 3

- Background & Literature Review ...... 4

Congestive Heart Failure ...... 4

2.1.1 New York Heart Association Functional Classification ...... 6

Assessing Exercise Capacity ...... 7

2.2.1 The Medical Interview (Standardized & Unstandardized Questioning) ...... 8

2.2.2 Standardized In-Clinic Exercise Testing ...... 11

2.2.3 Fitness Trackers/Monitors ...... 14

Remote Patient Monitoring ...... 22

2.3.1 Medly ...... 24

Artificial Intelligence & Machine Learning ...... 24

2.4.1 Machine Learning ...... 26

2.4.2 Supervised, Unsupervised and Reinforcement Learning ...... 26

2.4.3 Classification vs Prediction Problems ...... 27

vii

2.4.4 The Effect of Sample Size on Machine Learning ...... 28

2.4.5 State-of-the-art ...... 29

Summary ...... 32

- Replication of Previous Study ...... 35

Abstract ...... 35

Introduction ...... 36

Methods ...... 37

3.3.1 Recruitment ...... 37

3.3.2 Statistics ...... 39

Results and Discussion ...... 42

3.4.1 Principal Results ...... 48

3.4.2 Strengths and Limitations ...... 51

Conclusion ...... 52

3.5.1 Acknowledgements ...... 52

3.5.2 Ethics Approval ...... 52

3.5.3 Conflicts of Interest ...... 52

- Activity Tracker Monitoring Implementation ...... 53

Medly User Interface Overview ...... 53

Requirements ...... 54

Design & Implementation ...... 57

4.3.1 Activity Tracker Selection ...... 57

4.3.2 User Interface Design ...... 64

Summary ...... 82

– Assessment of NYHA Functional Classification using Hidden Markov Models ...... 84

Hidden Markov Models ...... 84

5.1.1 Rationale for the use of HMMs ...... 84

viii

Methods ...... 86

5.2.1 Training Data ...... 86

5.2.2 Model Design ...... 89

5.2.3 Model Validation ...... 93

Results and Discussion ...... 94

5.3.1 Classification Performance ...... 94

5.3.2 Training Challenges ...... 94

Summary ...... 101

- Assessment of NYHA Functional Classification Using Cross-sectional Machine Learning Models ...... 103

Machine Learning Models ...... 103

6.1.1 Generalized Linear Models ...... 103

6.1.2 Boosted Generalized Linear Models ...... 105

6.1.3 Random Forest ...... 105

6.1.4 Artificial Neutral Networks ...... 107

6.1.5 Principal Component Analysis Artificial Neutral Networks ...... 109

Methods ...... 110

6.2.1 Training Data ...... 110

6.2.2 Model Design ...... 111

6.2.3 Model Validation ...... 117

Results and Discussion ...... 120

6.3.1 Classification Performance ...... 120

6.3.2 Best Features ...... 124

6.3.3 Comparison of 10-fold and Leave-One-Out Cross-Validation ...... 128

Summary ...... 129

- Conclusions, Recommendations & Future Work ...... 132

Conclusions ...... 132

ix

Recommendations ...... 135

Future Work ...... 136

References ...... 138

Appendix A - Research Ethics ...... 168

I. REB #14-7595: Validation of A Wearable Activity Tracker for the Estimation of Heart Failure Severity ...... 168

II. REB #15-9832: Feasibility Study of Wearable Heart Rate and Activity Trackers for Monitoring Heart Failure ...... 169

III. REB #16-5789: Evaluation of A Mobile Phone-Based Telemonitoring Program for Heart Failure Patients ...... 170

IV. REB #18-0221: Artificial intelligence-based quality improvement initiative of a mobile phone- based telemonitoring program for heart failure patients ...... 171

Appendix B – A Primer on Hidden Markov Models ...... 172

I. Basics of Markov Models (Hidden or Otherwise) ...... 172

II. Semi-Markov Model ...... 174

III. Hidden Markov & Semi-Markov Models Parameters ...... 174

IV. Determining Markov Model Parameters ...... 175

Appendix C – Software Repository ...... 177

Appendix D – Tabulation of All Cross-sectional Machine Learning Classifier Performance Measures ..... 178

x

List of Tables

Table 1: Summary of Cadmus-Bertram activity tracker heart rate accuracy study [79] ...... 19

Table 2: Summary of Abdulmajeed activity tracker heart rate accuracy study. Reproduced from [41] ..... 20

Table 3: Inclusion criteria ...... 37

Table 4: Exclusion criteria ...... 37

Table 5: Study dataset demographics ...... 38

Table 6: Study dataset demographics (overall and just NYHA II or III) ...... 38

Table 7: Study re-grouped dataset demographics (NYHA group II* and group III*) ...... 39

Table 8: Significant findings for comparisons between all classes (I/II, II, II/III, III) and just between class II vs. III...... 43

Table 9: Significant findings for comparisons between group II* and group III* ...... 44

Table 10: Non-significant findings for comparisons between all classes (I/II, II, II/III, III) and just between class II vs. III...... 45

Table 11: Non-significant findings for comparisons between group II* and group III* ...... 46

Table 12: Candidate activity trackers ...... 58

Table 13: Medly inclusion criteria ...... 78

Table 14: Medly exclusion criteria ...... 78

Table 15: iPhone vs. Android patients on Medly system using Fitbit a) all patients onboarded, b) only new Medly patients onboarded during thesis ...... 79

Table 16: Patient adherence on Fitbit ...... 80

Table 17: Fitbit adherence compared to adherence recorded for original Medly during RCT ...... 80

Table 18: Minute-by-minute step count features ...... 111

xi

Table 19: Cardiopulmonary exercise testing data features ...... 113

Table 20: Patient demographic data features ...... 114

Table 21: Header abbreviations for Table 22 ...... 178

Table 22: Cross-sectional machine learning classifier performance metrics ...... 179

xii

List of Figures

Figure 2-1. Renin-Angiotensin-Aldosterone system [286] ...... 5

Figure 2-2 Nervous system response to drop in blood pressure [287] ...... 6

Figure 2-3 PPG, ECG and arterial pressure waveforms (with cardiac arrhythmia) [288]...... 16

Figure 3-1. Histogram of per minute step count values for each patient, grouped by individual NYHA class ...... 40

Figure 3-2. Distribution of per minute step counts by NYHA class (zoomed in to step counts > 0). Stacked internal segments indicate relative contributions by each patient...... 41

Figure 3-3. Individual frequency of per minute step counts for each patient (zoomed in to step counts > 0), grouped by NYHA class ...... 42

Figure 3-4. Boxplots (min, mean-1SEM, mean, mean+1SEM, max) of mean daily total steps for individual each NYHA class ...... 48

Figure 3-5. Boxplots (min, mean-1SEM, mean, mean+1SEM, max) of mean daily per minute step count maximums for each individual NYHA class ...... 49

Figure 3-6. Boxplots (min, mean-1SEM, mean, mean+1SEM, max) of max daily per minute step count maximums for each individual NYHA class ...... 50

Figure 3-7. Number of zero step count minutes as a percentage of individual patient two-week data stream ...... 51

Figure 4-1. Medly system patient smartphone user interface a) home screen b) trends screen [289] ...... 53

Figure 4-2. Medly system clinical user web interface ...... 55

Figure 4-3. Fitbit data flow diagram ...... 60

Figure 4-4. Fitbit authentication process with a client app ...... 61

Figure 4-5. Medly Fitbit patient access sequence ...... 62

Figure 4-6. Medly Fitbit clinician access sequence ...... 63 xiii

Figure 4-7. Proposed designs for patient user interface (home screen) a) combined heart rate and steps data on one card, b) combined heart rate and with pictoral representations, c) seperated heart rate and step data, d) only pictoral representation with mini graph ...... 65

Figure 4-8. Proposed designs for patient user interface (trends) a) simple sparklines, b) data with bands to indicate min (resting), mean and max values for each time period, c) whisker plot to indicate daily range, b) heart rate (maximum and resting) and average step count values broken out for each time period, and e) Tufte style medical data visualization as per f) which is reproduced from [201] ...... 66

Figure 4-9. Proposed design for authorization of new Fitbit by patient via Medly smartphone application...... 67

Figure 4-10. Proposed designs for clinical user interface (activity and heart rate graphs) a) simple graph design with indicator lines for alert levels and mean, b) design inspired by the Sick Kids T3 (tracking, trajectory and trigger) tool [206–208], c) mix of T3 tool with Medly range bands, b) whisker plots style and e) simple graph with range bands and NYHA class prediction display (bottom of the more info page for step count graph) ...... 71

Figure 4-11. Final web interface Fitbit authorization flow ...... 72

Figure 4-12. Final web interface activity tracker profile & deauthorization flow ...... 73

Figure 4-13. Final web interface activity tracker data display ...... 73

Figure 4-14. Distribution of patient Fitbit adherence (as percent of days using the system) ...... 79

Figure 5-1: A method of inputting sequential (time series) data into a cross-sectional model ...... 85

Figure 5-2: Architecture for hidden Markov model based classifier ...... 90

Figure 5-3: Distribution of per-minute step count for patients with NYHA class II and NYHA III (* grouped) ...... 93

Figure 5-4: Overview of HMM based classifier performance ...... 94

Figure 5-5: Example patient step count data (per 6 hour resolution) ...... 95

Figure 5-6: Example patient step count data (per minute resolution) ...... 96

xiv

Figure 5-7: Dithering as applied to a cat photo. Reproduced from Wikipedia [236]...... 100

Figure 6-1: Examples of distributions in the family of exponential distributions (* indicates the distribution belongs in the family only when certain parameters are fixed). Adapted from [290]...... 104

Figure 6-2: Example of a decision tree (above) with corresponding feature space (below)...... 106

Figure 6-3: A perceptron ...... 108

Figure 6-4: A neural network ...... 108

Figure 6-5: 풌-fold cross-validation ...... 117

Figure 6-7: Performance of the best CPET only classifier ...... 121

Figure 6-7: Performance of the best step data only classifier ...... 121

Figure 6-9: Performance of the best CPET + step data classifier ...... 121

Figure 6-9: Performance of the second best CPET + step data classifier ...... 121

Figure 6-10: Receiver Operating Characteristic (ROC) curve for machine learning classifiers trained with CPET & step data (with no data imputation) ...... 122

Figure 6-11: Feature importance scores for GLM classifier using only step count data ...... 125

Figure 6-12: Feature importance scores for random forest classifier using CPET + step count data ...... 126

Figure 6-13: Performance of the best model with cross-validation performance difference ...... 128

Figure B-1: Markov model ...... 173

xv

List of Abbreviations

6MWT 6 minute walk test

Acc accuracy

API application programming interface

AI artificial intelligence

AT anaerobic threshold

BNP brain natriuretic peptide

BP blood pressure. bpm beats per minute

CART classification and regression tree

CC correlation coefficient

CI confidence interval

CV cross validation

CHF congestive heart failure.

CO2 carbon dioxide

CPET cardiopulmonary exercise test.

DPMSC daily per minute step count

ECG electrocardiography. Alternatively: electrocardiogram, or electrocardiograph

GLM generalized linear model

HF heart failure.

xvi

HFrEF heart failure with reduced ejection fraction

HMM hidden Markov model

HMMBC hidden Markov model based classifier

HT home telemonitoring

HR heart rate

HRV heart rate variability

ICC intraclass correlation coefficients

IMU inertial measurement unit

LED light-emitting diode

LVEF left ventricular ejection fraction

LOOCV leave-one-out cross validation

ML machine learning

MVP minimum viable product

NIR no information rate

NNet neural net

NYHA New York Heart Association.

O2 oxygen

PCA principal components analysis

PPG photoplethysmography

QI quality improvement

RCT randomized control trial

xvii

REB research ethics board.

RER respiratory exchange ratio

RF random forest

ROC receiver operating characteristic

RPM remote patient monitoring

SC step count

SEM standard error of the mean

TGH Toronto General Hospital.

UHN University Health Network.

UI user interface

xviii 1

- Introduction

Heart failure (HF), a complex chronic terminal phase of many cardiovascular diseases, is slowly becoming a worldwide silent pandemic [1]. The symptoms of heart failure are complex and difficult to manage for both patients and their physicians [2–4]. Care is made even more difficult because there is no reliable objective method for assessing the symptomatic (functional) status of a given HF patient, or by extension, if their symptoms have recently measurably deteriorated [5–7].

The current clinical gold standard for assessing a patient's symptom state is the New York Heart Association (NYHA) functional classification [8,9]. This system grades a patient's degree of heart failure based on a physician’s interpretation of the patient reported symptoms (mainly with respect to their degree of intolerance to exercise/physical activity) and is by its nature highly subjective. Despite these limitations, years of medical research and clinical observations have established many important relationships between a patient's symptom status and their prognostic outcomes [7,10] which makes it undesirable to simply replace or modifying the existing NYHA functional classification scheme. However, finding an objective means of determining a patient's NYHA class would be of great benefit to both HF care and research as it would allow intra- and inter-physician and patient assessments of HF functional status to be more consistent [7,11,12]. At the very least, consistency would make communication of patient heart failure functional status in research, clinic notes, or other medical documentation more transparent and reliable.

Thesis Objective

The objective of this thesis is to design and develop a means of making the evaluation of NYHA functional class more consistent and reliable for the medical research and clinical community. The larger goal of this research work can be subdivided into 3 major sub-objectives:

1. To identify available relevant, objective data which may be useful for providing insights into patients underlying NYHA functional class and where required, to start the collection of this data. 2. To establish a basic foundational procedure for use by future researchers, data scientists and engineers to develop and assess machine learning based methods of evaluating NYHA functional class (trained to replicate classification by experienced physicians). 3. To perform a pilot analytics experiment, using data collected during an initial brief data collection period, to explore the viability of a few machine learning algorithms which could form

2

the core of an objective and consistent system for evaluating NYHA functional class (and mirrors classification by experienced physicians). 4. To provide a reflection on ‘lessons learned’, potential pitfalls and hazards to be mitigated in a real-life implementation of a machine learning based NYHA functional classification system.

Formal Thesis Statement

We hypothesize that it is possible to assess NYHA functional class with an expected level of performance at least equal to that of skilled humans, namely trained cardiologists, using objective data readily available or recordable as part of routine care.

Thesis Summary

The four phases of this thesis are summarized in the following sections 1.3.1 to 1.3.3. We first replicated a previous scientific study as part of initial investigations into relevant data. A basic physical activity data collection system was then implemented as part of an established remote patient monitoring system at the TGH HF clinic. Once sufficient data was been gathered by this system, we sought to train and validate several machine learning models and assess their potential usefulness for the task of classifying patients into their appropriate NYHA functional class. All research performed as part of this thesis was reviewed and received the required approvals by the UHN Research Ethics Board (REB). The approval letters are included as part of Appendix A.

1.3.1 Phase 1 – Replication of Previous Study

A previously published pilot study [13] showed a statistically significant association between NYHA functional class and total daily step count activity measured by wrist-worn activity monitors in patients with heart failure. However, the study’s small sample had the unfortunate side-effect of limiting scientific confidence in the generalizability of these findings. Since step count activity is expected be a highly relevant, useful, and massively feature rich dataset, we replicated the study on a separate otherwise limited dataset collected during another previous study to increase our confidence in the relevance and usefulness of step data for this particular thesis. This phase of the thesis was approved and covered under REB #15-9832.

1.3.2 Phase 2 – Activity Tracker Monitoring Implementation

Having validated the relevance of step data for this particular application, we upgraded Medly, the remote patient monitoring system already in use at the TGH HF clinic, so it could support the collection

3 and display of continuous free living activity data from a commercially available fitness tracker (a Fitbit), including minute by minute step count and heart rate data which would form an important cornerstone in the rest of our analysis. This phase of the thesis, upon review by the UHN REB, was accorded a waiver of requirement for REB approval under REB #18-0221. The analysis of patient compliance was approved and covered under REB #16-5789.

1.3.3 Phase 3 – Machine Learning Implementation & Validation

In the final phase of this research thesis, we identified potential candidate machine learning algorithms and implemented 6 of them to attempt to create a classifier that could take the collected clinical data and use it to attempt to objectively assess patient NYHA class. We also evaluated the performance of these systems compared to expected ability of experienced physicians to perform the same task. This phase of research, upon review by the UHN REB, was accorded a waiver of requirement for REB approval under REB #18-0221.

The following chapters provide first, the necessary background to understand the rest of the research discussed in this thesis, followed by a detailed description of the methods employed in each phase of the research and the results of the findings of that corresponding phase.

4

- Background & Literature Review

Congestive Heart Failure

Congestive Heart Failure (CHF), or Heart Failure (HF), as previously stated, is a complex chronic terminal phase of many cardiovascular diseases, and is slowly becoming a worldwide silent pandemic [1,14]. Aside from being complex, it is also an incurable, constantly exacerbating condition, that looms threateningly even over a myriad of more relatively ‘benign’ heart problems. In the words of Dr. Paul Fedak, it is the “end result of all cardiac disease. You get heart failure from everything that goes wrong with your heart – all roads lead to heart failure” [2]. Recent estimates would suggest that in 2016 at least 50,000 new Canadians will have officially joined an existing cohort of more than 600,0000 Canadians, and 26 million persons globally, living with heart failure [2,14]. Of course, these numbers are only expected to grow as the population of persons at high risk of developing cardiac disease and, almost inevitably, the prevalence of cardiac disease in general, continues to increase. Globally, the prognosis of HF patients is bleak [1,14]. Even in Canada, despite its relatively advanced medical system, the expected median survival time of Canadian HF patients is still very short - 2.1 years [15].

But what is heart failure? In short, heart failure is when the heart suffers a reduced ability to pump blood, and by extension is unable to adequately supply the body with the nutrients and oxygen it requires [1,2,14]. This inability of the heart to pump blood is sometimes termed cardiac insufficiency. This term helps to avoid the popular misconception that heart failure is when a person’s heart has stopped as in the case of a heart attack [2,16]. While cardiac insufficiency has the, likely obvious, effect of reducing a person’s ability to perform demanding physical activities at any given moment, the full effects of heart failure are rather more insidious.

Galen is perhaps the first recorded physician to have conjectured that organs aside from the heart and arterial-venous network might be involved in regulating circulation [17]. While he erroneously concluded that the liver was the body's main blood producing organ (due to its high degree of vascularization, i.e. it has lots of blood vessels), an error which remained regrettably uncorrected for 15 centuries, it turns out that the liver, along with the lungs and adrenal glands, but most importantly the kidneys, do have major biochemical involvement in regulating a hugely important aspect of the circulatory system: blood pressure [17]. The natural response of these organs to an event of cardiac decompensation (i.e. cardiac insufficiency), is to attempt to correct these drops by activating a series of body systems and reflexes to

5 increase both blood volume and blood pressure and by extension cardiac output [18,19]. This is done primarily through the renin-angiotensin-aldosterone system (see Figure 2-1) which effects an increase in sodium and fluid retention along with an increase in vasoconstriction (narrowing of blood vessels) [18,19]. The autonomic nervous system also contributes by increasing vasoconstriction but also by attempting to increase heart rate and contraction force (see Figure 2) [18,19]. In short, the body engages an emergency response of the bodies ‘fight-or-flight’ mechanism.

While the aforementioned response is highly appropriate for acute events of cardiac insufficiency such as significant blood loss, or even to prevent fainting as a result of standing up suddenly from a resting position, it is the incorrect response to chronic persistent heart

Figure 2-1. Renin-Angiotensin-Aldosterone system failure [18,19]. Not only does this response

[286] not resolve the underlying cause of the chronic heart failure such as abnormal heart rhythms or damage to or malformation of the heart, among other root causes, but constantly engaging the bodies ‘fight-or-flight’ mechanism has damaging side-effects [19]. Elevated blood pressure (hypertension) is associated with increased risk for a myriad of other conditions including: pulmonary edema (leaking of fluid into the lung), atherosclerosis (hardening of arteries as a result of plaques formed due to damage to the vessels), and hemorrhagic stroke (rupture of a blood vessel) [18,19]. Increased sodium and fluid retention causes not just the blood to retain more water, but the whole body; fluid often builds up in other organs and in the arms and legs which can cause undesirable compression of internal organs and result in damage to those organs [19]. Furthermore, the reduced blood flow combined with inappropriate pressure increases in certain organs can cause fluid in general to backup, or become congested in areas along the circulatory network, which is what gives congestive heart failure its name [18,19]. In addition, the whole response system has the effect of causing what is known as ‘cardiac

6

remodelling’ whereby the actual physical structure of the heart changes to adapt to its new environment [19,20]. Many of these changes have an overall damaging effect in the long-run and the exact nature and extent of this remodelling depends greatly on the type of heart failure, for example whether it is localized in the left or right side of the heart (or both), whether it has the effect of weakening or stiffening of the heart muscles, or whether the heart failure is due to other causes such as abnormal heart rhythms or blockages [19,20]. Suffice it to say that the symptoms and pathology of heart failure are complex.

As a result of the complexity of heart failure, it can be difficult Figure 2-2 Nervous system response to drop in blood pressure [287] to manage for both patients and their physicians [2–4,19]. This is especially unfortunate because heart failure is essentially impossible to cure since the heart, unlike many other muscles, does not heal or regenerate naturally and modern medicine has not yet found a way to cause it to do so [19,21]. Care is made even more difficult because there is no reliable objective method for assessing the functional state of any given patient’s HF, never mind determining if it is likely to worsen irreparably [5–7].

2.1.1 New York Heart Association Functional Classification

The current clinical gold standard for communicating the severity of symptoms experienced by a CHF patient is the New York Heart Association (NYHA) functional classification system [8,9,22]. Under this

7 system patients are classified based on the physician’s interpretation of patient reported symptoms (mainly with respect to their degree of exercise/activity intolerance). The physician will then assign the patient into one of the four NYHA functional classes they believe is most appropriate based on their clinical experience, professional judgement and according to the NYHA class definitions. These definitions are copied below for the reader's convenience [23]:

I. “Patients with cardiac disease but without resulting limitation of physical activity. Ordinary physical activity does not cause undue fatigue, palpitation, dyspnea, or anginal pain.”

II. “... slight limitation of physical activity. They are comfortable at rest. Ordinary physical activity results in fatigue, palpitation, dyspnea, or anginal pain.”

III. “... marked limitation of physical activity. They are comfortable at rest. Less than ordinary activity causes fatigue, palpitation, dyspnea, or anginal pain.”

IV. “Patients with cardiac disease resulting in inability to carry on any physical activity without discomfort. Symptoms of heart failure or the anginal syndrome may be present even at rest. If any physical activity is undertaken, discomfort is increased.”

This classification system is highly subjective [6,7], especially for NYHA class II and III, which call for patients experiencing “slight” versus “marked limitation of physical activity” [9]. The application of the criteria thus varies widely based on the patient’s self-report and the individual physician’s interpretation of the report [6,7]. Despite these limitations, clinical evidence and medical research have established many important relationships between a patient's symptom status and their prognostic outcomes which makes the assessment of NYHA functional class a useful part of care [7,10]. Aside from the prognosticative utility it also provides clinicians and medical researchers a standardized way of quickly communicating the clinical severity of a given patient’s heart failure [19,24]. As such, scientific papers dealing with CHF often report the NYHA class of their patient population (amongst other metrics) to provide a universally recognized, although perhaps imprecise, description of the clinical make-up of their population. Unfortunately, approximately 99% of these papers also fail to provide details as to how the NYHA functional classes were assessed [6].

Assessing Exercise Capacity

The core determinant of NYHA class, is the impact of a patient’s heart failure on their ability to perform physical activity without “undue fatigue, palpitation, dyspnea, or anginal pain”. While the NYHA

8 functional classification system does not prescribe a standardized method by which to evaluate limitations of physical activity there are certainly several methods of evaluation a patient’s exercise capacity, whether for NYHA functional class assessment or for other purposes. These include questions posed as part of a medical interview, cardiopulmonary exercise testing, and physical activity/fitness trackers/monitors.

2.2.1 The Medical Interview (Standardized & Unstandardized Questioning)

The familiar medical interview, whereby a clinician carefully queries a patient to elucidate the patient’s relevant medical history and symptoms, is a staple of medical care. It is also the classic method of assessing NYHA functional class; adding a few pertinent questions is inexpensive, relatively quick, fits neatly into the existing workflow of clinicians and also happens to be the established best practice. It is however highly inconsistent with regards to NYHA class assessment both between physicians and for the same physician across time, and is thus highly unreliable [6,11,25–27]. Carroll et al. report (bibliographic reference numbers updated to reflect ours):

[One study] used two physicians to estimate NYHA functional class in 75 patients on the same day without chronic heart failure, reporting an interrater reliability of 56% (weighted kappa = 0.41)[11]. In a second study, two cardiologists assessed the same 50 chronic heart failure patients on the same day in random order, observing 54% agreement in NYHA classes [6]. In a third study, two physicians assigned NYHA class to 56 patients with stable angina within the same hour, resulting in the highest reported agreement of 75% [26]. Among these studies, disagreement by more than one functional class was low and, for the most part, was concentrated on determining the discrete differences between Classes II and III. Taken together, the reliability of the NYHA system is limited in the few trials that have measured it directly [25].

The results are very low: a 54 and 56% level of agreement represents only weak agreement between physicians, and a 75% level of agreement still implies that only about 56% of the examined cases should be considered correct [28].

It should be noted that the third study (Christensen et al.) examined only NYHA functional classes I to III, and the first study (Goldman et al.) examined all four functional classes [11,26]. In the second study (Raphael et al.), the researchers investigated class II and III assessments specifically [6]. Furthermore each study had an imbalanced distribution of classes which makes reporting raw accuracy somewhat misleading since classes I and IV end up being relatively easy to distinguish in clinical practice whereas the middle

9 classes II and III generally represent the actual classification challenge for physicians [25]. Approximately half of patients in Goldman et al.’s study exhibited NYHA class I symptoms which may have contributed to the slightly higher agreement found in this study compared to Raphael et al.’s study. Unfortunately Christensen et al. neglected to provide any information on their class distribution entirely, although it appears to be slightly unbalanced since visual examination of their figures indicates that a significant subset (possibly ¼ to a 1/3rd) of their study population are also patients with NYHA class I. We agree with the authors (Christensen et al.) however that the real reason why they saw higher agreement was likely because they “they used the same two physicians through the study … who, in addition, had a small training session prior to data collection” [26].

In normal practice clinicians usually differ in the exact criteria and questions they would use to assess the NYHA class of their patients [6]. The most popular being self-reported walking distance (70% of the 30 cardiologists surveyed), difficulty in climbing stairs (60%), ability to walk to a recognized local landmark (30%) and breathlessness interfering with performing daily activities or when walking around the house (23%)[6]. 13% of cardiologists had no specific question or criteria for assessing NYHA class [6]. Even of those who would use a common question or criteria, the application of the criteria often differed. For example, in choosing between class II and III patients, 2/3rds of physicians would classify a patient who couldn’t make it up a flight of stairs without stopping as class II, while 1/3rd would classify them as class III [6].

Assessment at the Toronto General Hospital Heart Function Clinic

At the TGH HF clinic, NYHA class is typically assessed for every patient with known cardiac disease, which is first objectively verified using some sort of medical imaging. NYHA class is then reassessed at every clinic visit by the physician responsible for patient's care as part of the medical interview. At minimum, the physician will pose questions to attempt to elucidate the patients' degree of exercise intolerance, for example: "How far can you walk before becoming short of breath?", although the established preferred criteria is "How many flights of stairs can you climb before needing to stop?" The classes are broken down as follows:

Class I. Asymptomatic; able to perform physical activity normally.1

1 As a specialized tertiary care centre, the Heart Function Clinic rarely sees NYHA class I patients as they are often asymptomatic with regards to their heart failure, or at least rarely require the specialized level of care offered by the clinic.

10

Class II. Able to walk up more than 1 flights of stairs, or 100+ meters before being breathless.

Class III. Only able to walk up 1 flight of stairs before being breathless/requiring a break. Alternatively, gets tired walking to the washroom.

Class IV. Always breathless; symptoms even at rest.

Of course, these questions are adjusted as per the clinical demands. For example, the stair question is unsuitable for a patient who is wheel-chair bound or has significant mobility impairment, but the principle of using internally consistent criteria remains the same.

Unsurprisingly, prior agreement on assessment criteria has been demonstrated to improve inter-physician agreement drastically [27]. Kubo et al. for example developed a patient questionnaire with the express intent of addressing the problem of inconsistent NYHA classification in multi-centre trials, although the questionnaire was “not meant to replace or improve the traditional method by which clinicians assess NYHA in everyday clinical encounters” [27]. The questionnaire is composed of 7 major questions that echo some of the popular interview questions including questions such as: “How often do you walk up and down stairs?” and “How often do you go for walks, either outside or inside, on level ground at a normal pace under normal conditions?” with follow up questions including “Do you avoid stairs [/walks] because it makes you tired or short of breath?” and “How often would you get short of breath when you walk up or down a flight of stairs at a normal pace under normal conditions?” that are typically answered with one of ‘Never, Rarely, Some or Frequently’ and occasionally with just a simple ‘Yes/No’ response [27]. The questionnaire uses a separate scoring tool (not provided) that assesses the frequency of both activities and their associated symptoms including symptoms or lack of symptoms at rest [27]. The scoring tool however, at least in its current state, eschews the use of automated algorithm “because of the inability of simple algorithms to reconcile inconsistent patient responses” [27]. In validating the use of this questionnaire, Kubo et al. found about a 60% agreement comparing interdependent assessments performed at a remote site and their core central site, a 75% agreement comparing independent assessments performed at the same core central site and a 90% agreement on repeat assessment of a random subset of the same questionnaires 3 months later [27]. These results are in the same range as Christensen et al.’s results, which is possibly an indication that even informal agreement on NYHA class (in the form of a preparatory training session) drastically improves inter-physician agreement on NYHA class. Of course, subjectivity in the NYHA classification is not just introduced by clinicians. It is also introduced by patients.

11

2.2.2 Standardized In-Clinic Exercise Testing

A second challenge of NYHA class assessment is that it relies heavily on patient reported symptoms and on patient memory, which can be unreliable even in the best of circumstances [29–31]. Clinicians, who face this challenge on a routine basis in the field, even outside the context of NYHA class assessment, have come up with a myriad of ways to address this problem. In fact, a great deal of research tries to identify or create tests that measure physical fitness, maximum exercise capacity, or some proxy thereof in a standardized way [32–39]. In general, these tests measure a patient's exertion over a period of time [32,34–36,38–40]. Exertion is usually calculated by raw distance traveled (being generally more convenient to measure) [32,34,36,40], patient step count (which can be linked to distance if the patient's stride length is known) [38,41–47], movement recorded by raw accelerometer data [39,48–50], activity difficulty (e.g. surface incline, resistance band strength) [41,46] or energy consumption (e.g. Metabolic Equivalents: METS) [8,32,37].

Timed Walking Tests

Timed walking tests are an excellent example of a basic, easy to run standardized in-clinic exercise test. The 6 minute walk test (6MWT), one of the more recently developed time walking tests, typifies the general approach used in this tests. For this particular test, a patient is asked to walk as far as they can (being permitted to rest as needed) over a hard flat surface over the period of 6 minutes; the total distance walked is then used as an indicator of the exercise capacity of the individual [40] and by inference, their symptomatic limitations due to heart failure [7].

While timed walking tests have shown that measures of exertion over time (whether distance, step count or otherwise) are correlated to the NYHA functional classification of patients, there often remains a notable gap in the explanatory power of these measures. For example Demers et al. found that for the 768 patients in their multi-centre study the "baseline 6MWT distance was ... moderately inversely correlated to the New York Heart Association functional classification (NYHA-FC) (r = -0.43, P=.001)” [51]. One would expect that walking distance should be correlated with evaluated NYHA functional class, but distance travelled in this case only explains approximately 18.5% of the variance in the data (r2= 0.1849). This may be because NYHA functional class is not predominantly attempting to ascertain maximal exercise capacity but rather the degree of abnormally symptomatic response to exercise – a much more nuanced question. Therefore tests, measures, or metrics which can reliably mirror NYHA functional class will likely need to measure not just exertion, but the patient’s physiological response to that exertion -

12 beyond the simply binary yes/no response to being able to continue the exertion demanded (the case for all the previously mentioned tests).

Cardiopulmonary Exercise Test (CPET)

The cardiopulmonary exercise test (CPET), or more colloquially ‘the treadmill test’, is the gold standard for in-clinic exercise testing [52]. It is a supervised test run by trained staff in a controlled clinical environment. In this test, the patient walks on a treadmill or cycles on a stationary bicycle typically until they (the patient) becomes exhausted, experiences muscle fatigue, respiratory difficulty or some other symptom that is indicated for the termination of the test [32,53]. While the patient is exercising, their detailed physiological response to increasing resistance on the treadmill/bike is measured using:

• surface electrocardiography (ECG), to measure pulse and cardiac waveform (sinus rhythm);

• pulse oximetry, to measure blood oxygen saturation;

• a blood pressure (BP) cuff, to measure blood pressure;

• spirometry equipment, to measure lung capacity, volumes and flow

• pulmonary gas equipment, to measure oxygen (O2) and carbon dioxide (CO2) exchange [32,53].

Together, this data provides an informative picture from which clinicians can further derive metrics measuring a patient’s lung and cardiac response to exercise [24,32,53,54]. Some of the more unique and important measures derived from this test include:

̇ • 푝푒푎푘푉푂2[mL/kg/min] (relative 푝푒푎푘푉푂2), the peak oxygen volume output, is an estimate for true ̇ ̇ maximal aerobic capacity 푉푂2푚푎푥 [mL/kg/min] of a patient [32]. 푉푂2푚푎푥, or relative 푉푂2푚푎푥, is

the body weight normalized version of (absolute) 푉푂2푚푎푥 [L/min]. Absolute 푉푂2푚푎푥 is “considered to be the metric that defines the limits of the cardiopulmonary system. It is defined by the Fick equation as the product of cardiac output [heart rate & stroke volume] and arteriovenous oxygen difference … at peak exercise” [32]. Reporting the relative (normalized)

version is preferred since patients with higher body weight will naturally have a higher 푉푂2푚푎푥 due to increased body weight but will not necessarily have fundamentally increased functional ̇ capacity, exercise capacity or exercise tolerance [32]. It is also important to note that 푝푒푎푘푉푂2 is always an estimate of true maximal aerobic capacity; its recorded value depends not only on the

13

test modality used (treadmill or bike) but is importantly predicated on the attainment of maximal/peak exercise by the patient during the test [32].

• Ventilatory threshold (푉푇) [mL/kg/min], an estimate for, and sometimes interchangeably known as, anaerobic threshold (퐴푇), attempts to measure the exertion level at which a patient’s body stops being able to keep up with their muscles’ oxygen demands [32]. It is an alternate index used to infer exercise capacity but is predicated on the idea that people do not constantly perform activities at maximal effort. AT, in a sense, is a measure of maximum continuously sustainable exertion [32]. As AT is a submaximal index of exercise capacity it is sometimes reported as a

percentage of 푝푒푎푘푉̇ 푂2 [32].

• Respiratory exchange ratio (푅퐸푅), the ratio between exhaled CO2 and inhaled O2 [32]. Of particular interest is the peak RER which can be used to gauge if a subject is likely to have achieved peak (or at the very least sufficient) exerted effort as part of the test [32]. It is known to be more robust than heart rate response for measuring exertion, as heart rate response is often highly variable even in healthy populations (and worse for patients with heart failure, since their response is often modulated by medications),

̇ ̇ • 푉퐸/푉퐶푂2 [breaths/L], or the relationship between minute ventilation and carbon dioxide output, is used to estimate ventilatory efficiency: how many breaths it takes for the body to clear a given

unit of CO2 [32]. The relationship most often reported is a linear approximation of the ̇ ̇ 푉퐸/푉퐶푂2slope, which is highly robust against test modality and attainment of peak exercise by the patient [32]. It is often used to infer the possible existence of ventilation-perfusion mismatching:

where the lungs are unable to efficiency clear CO2 from the circulatory system either due to

circulatory problems causing poor blood flow or inefficient CO2 transfer due to some sort of lung damage or disease [32].

Many of these CPET measurements have been clinically validated and recommended to help inform ̇ important decisions regarding heart failure care. For example, 푝푒푎푘푉푂2 is used to risk stratify certain classes of HF patients when considering a heart transplant [55].

Others have already attempted to discover the relationship between NYHA class and various CPET measures [11,24,25,56]. Rostagno et al. looked at 143 HF patients with NYHA functional class ranging ̇ from I to IV but found low agreement between both 푝푒푎푘푉푂2 and AT (41.7%) compared to NYHA class (35%) [24].

14

Goldman et al. looked at the duration of treadmill tests and similarly found low agreement, with only 51% of their 150 estimates (75 patients with one estimate each by two independent physicians) agreeing with the NYHA class assigned [11]. This is not terribly surprising but is instead consistent with what we would expect based on Demers et al. 6MWT findings.

In a more recent analysis, authors Lim et al. performed a systematic review of 38 studies that investigated ̇ the correlation between NYHA classification and 푝푒푎푘푉푂2 (other CPET metrics were not reported ̇ consistently enough for analysis) [56]. They found a significant difference between pooled 푝푒푎푘푉푂2 values for NYHA classes I. vs II. and II vs. III (P < 0.0001 in both cases) [56]. However, they did not find a ̇ significant difference when looking at classes III vs. IV [56]. 푝푒푎푘푉푂2 and NYHA class I to III were inversely correlated, although the strength of the correlation was not quantified [56].

To our knowledge no one else has published attempts to characterize the relationship between NYHA class and other CPET measures. Despite the lack of research and evidence surrounding most of the CPET ̇ metrics, Lim et al.’s findings regarding 푝푒푎푘푉푂2 and NYHA class are an encouraging waypoint in the quest to objectively assess NYHA classification. However, CPET studies do have some important drawbacks.

One of the biggest drawbacks of running CPET studies is that they require access to expensive equipment, trained personnel and a lab environment in which to perform the test [32]. Due to the financial cost and time burden alone, it is likely that relying on CPET studies to assess NYHA class will severely limit how often NYHA class can be re-assessed, which makes it less desirable for use in creating a quick and easy method of assessing the severity of patients’ HF symptoms [54].

2.2.3 Fitness Trackers/Monitors

Modern commercially available fitness trackers, such as those developed by Fitbit Inc.[57–59] are a promising, albeit little used candidate for assessing patient exercise capacity that would overcome many of the drawbacks of cardiopulmonary exercise tests.

Activity & Step Detection

Activity trackers are small, portable devices that are worn on one’s person. They may be worn on one’s feet or shoes, clipped on the belt near one’s hip, or worn on one’s wrist like a wristwatch [41,43,64,65,45,57–63]. The classic pedometers of yore are in fact a type of activity tracker but there are specifically limited to only counting steps [65,66]. Most modern activity trackers are more precise and

15 often more multi-functional than the classic pedometer [57–59,64]. Even from a pure motion detection perspective, older pedometers were often limited to single-axis accelerometers which could only detect movement (specifically acceleration) in one axis [66].

Newer, modern activity trackers have been found to be able to fairly reliably track minute-by-minute step count [37,41,43,45,46,65,67–70]. Straiton et al. [70] in a systematic review of 7 observations studies, including a total of 290 elderly patients (mean age 70.2 ± 4.8 [years]), discovered a high correlation between step counts recorded by the test devices compared to the reference devices used in the study. The reference devices used in the individual studies varied but were typically a previously validated research- grade activity monitor such as an ActiGraph™ [71] or BodyMedia Sensewear device (no longer available). In their review they found that “daily step count for all consumer wearables correlated highly with validation criterion, especially the ActiGraph device: intraclass correlation coefficients (ICC) were 0.94 for Fitbit One, 0.94 for [Fitbit] Zip, 0.86 for [Fitbit] Charge HR and 0.96 for Misfit Shine. Slower walking pace and impaired ambulation reduced the levels of agreement” [70]. Physical activity and energy expenditure estimation, as supported by these devices, was also found to be accurate but generally less so than step count measurements.

Evenson et al. (2015) [68] who cast a wider net and conducted a systematic review that included 22 observations studies on adults and youth (20:2) similarly found generally high correlations between the step measurements of various Fitbit and Jawbone devices investigated in these studies compared to the reference devices use. The correlation coefficients (CC) (interclass or Pearson) were found to be >= 0.8 for all the devices (Fitbit and Jawbone) investigated in all the laboratory studies reviewed. Many of the studies however found an even higher correlation, in the > 0.9 range, and even up to 0.99 for both Jawbone and Fitbit devices [68]. Evenson et al also found that physical activity and energy expenditure estimation were generally found to be less high correlated than pure step-tracking.

El-Amrawy in 2015 [44] recorded 4 participants who performed 40 repeated sets of 200, 500 and 1000 step walks and found that step count accuracy varied from an average of 99.1% for the MisFit Shine and , to 79.8% for the 2, as compared to the steps counted by a tally counter equipped observer. Other popular mainstream contenders like the Fitbit Flex (80.5%), the Jawbone UP (82.51%) and the Xiaomi Mi Band (96.6%) also scored highly.

Overall, research points to step-tracking by modern mainstream commercial activity trackers as being highly correlated to equivalent research grade reference devices. Certain activity trackers such as the MisFit Shine appear to be more consistently in agreement with validated reference devices, which may

16 make them optimal for studies where step count values must be as accurate as possible. However, we maintain that all the activity trackers discussed are likely suitable for practical applications of step count tracking. Other features that should be considered are easier access to gathered data, lower cost, improved ease of use for the patient, or the ability to detect some other important physiological marker.

Heart Rate Detection

With respect to other physiological markers, some of the major players in the commercial activity tracker market, namely Fitbit™ [58] and Apple™ [64], have recently pioneered the integration of heart rate monitoring capability alongside the step counting provided by their devices. These augmented fitness trackers, which are worn on the wrist, also monitor heart rate non-invasively by detecting the flow of blood under the surface of the wearer’s skin [41,44,72–74]. This technique, known as photoplethysmography (PPG), has been well validated since its discovery in the 1930s and is commonly used in various clinical settings [75,76]. In fact, it is the core technology that underpins pulse oximetry [75,76].

The fundamental principle that underpins PPG itself is the absorption and reflection of light by various body tissues [75,76]. By shining carefully selected frequencies of light on the surface of the skin and recording either, the light reflected off of, or transmitted through the skin, one can detect changes in perfusion of the surface tissues being illuminated. An example of the resulting waveform is shown in Figure 2-3. Although the precise physiological cause of the perfusion changes measured by the PPG

Figure 2-3 PPG, ECG and arterial pressure waveforms (with cardiac arrhythmia) [288].

17 waveform is still a matter of debate [76], it is clear that certain characteristics of the waveform are synchronized with heartbeat, and can thus be used to track heart rate. The shape of the waveform is also known to be correlated with arterial blood pressure, another clinically important physiological marker [75,76].

One important parameter that can also affect the PPG waveform is the choice of light [75,76]. Light absorption/reflection characteristics of various body tissues are highly frequency dependent [75,76]. One of the most important applications of PPG, arterial blood oxygen measurement, depends on this fact [75,76]. Furthermore, the frequency response of oxygen saturated versus desaturated blood is known to vary at different light frequencies. If we measure separate PPG waveforms using red and near-infrared light, we can measure the relative difference in light absorbed at these different frequencies [75,76]. The resulting difference can then be used to infer the degree to which the blood is saturated vs. desaturated [75,76]. While fitness trackers do not yet measure arterial blood pressure or use different types of light to measure oxygen saturation, some newer models of fitness trackers (e.g. the Fitbit Charge HR 2 [58] and Apple Watch [64]) take advantage of the varying light frequency response of blood by instead using green light which has been found to be more reliable for pulse rate monitoring [77].

Research has shown that consumer heartrate trackers are fairly reliable as compared to clinical grade devices [41,44,73,74,78,79]. However, they do provide considerably less detail than clinical grade devices. Consumer devices generally only capture a minute-by-minute pulse rate, as opposed to the complete ECG waveform provided by a Holter monitor or non-portable ECG setup.

In a 2016 study, Wang et al. monitored 50 heathy patients on a treadmill test and compared the heart rate measured by various fitness trackers to the heart rate recorded by an ECG and found them all to be highly correlated [78]. The concordance coefficients were .99 for the Polar H7 device, .91 for the Apple Watch, .91 for the Mio Fuse, .84 for the Fitbit Charge HR and .83 for the Basis Peak.

In a previously mentioned study, El-Amrawy et al. recorded 4 participants who performed 40 repeated sets of 200, 500 and 1000 step walks. As part of this study they also compared the heart rate of various activity monitors to the heart rate reported by a research validated professional clinical pulse oximeter [44]. The devices investigated, with their corresponding heart rate accuracy (as percent mean deviation from the average recorded heart rate) and the associated standard deviation (σ) of the measurements, ordered from most to least accurate, were the Apple Watch (99.9%, σ = 5.7%), Samsung Galaxy Note Edge (99.6%, σ = 14.4%), Apple iPhone 6 running Cardioo App [80] (99.2%, σ = 6.3%), Samsung Galaxy S6 Edge (98.8%, σ =11.6%), (97.7%, σ = 16.5%), Apple iPhone 5S running Cardioo App

18

(97.6%, σ = 12.4%), (97.4%, σ = 28.8%), Samsung Gear S (95.0%, σ = 20.9%), and Motorola Moto 360 (92.8%, σ = 14.1%).

Cadmus-Bertram et al. in a 2017 study, also investigated the heart rate accuracy of several wrist-worn activity trackers [79]. They were particularly interested in the limits of agreement of the reported beats/minute (bpm) of each of the devices at different heart rate intensity levels. They also studied the devices’ accuracy by measuring the mean difference between the heartrates measured by the trackers, and a simultaneously recorded reference ECG. The limits of agreement were defined as the 95% prediction interval for the mean difference between the tracker and ECG measurements. They also compared measurement agreement of different devices from same model series (i.e. comparing measurements between 2 Fitbit Surges in otherwise identical test conditions) which they termed measurement repeatability. As for the different heart rate intensity levels, they investigated the heart rate accuracy at rest and at 65% of the individual study participants maximum heart rates while running on a treadmill (as determined by the maximum heart rate equation: 푀푎푥 퐻푒푎푟푡 푅푎푡푒 = 220 − 푎푔푒). The 40 study participants were all healthy and between 30 and 65 years old (mean ± σ of 49.3 ± 9.5 [years]), and wore 2 trackers on each wrist (randomly assigned left vs. right, and proximal vs. distal to the wrist). Cadmus- Bertram et al.’s findings, including the mean difference, limits of agreement and measurement repeatability results, are reproduced for easier reading in Table 1. They found that the activity trackers had excellent accuracy with a mean difference of ≤±2.8 [bpm] between activity trackers and reference device whether at rest or while exercising. No further quantitative comparison was made between mean difference at rest vs exercise. For reference, a 1 [bpm] agreement error at 65% of the maximum heart rate of a 30, 49.3 and 65-year-old (minimum, mean and maximum age of participants in this study) represents a percent error of 0.8, 0.9 and 1.0%. At rest, or rather, at heart rates of 60 and 100 [bpm] - the lower and upper limits of the commonly accepted resting heart rate range [81,82] - the same 1 [bpm] agreement error represents a percent error of 1.6 and 1.0%2. The precision, as measured by the limits of agreement was found to be less impressive. At rest, they ranged from good, -5.1 to 4.5 [bpm] (Fitbit Surge), to relatively poor, -17.1 to 22.6 [bpm] (Basis Peak). The performance of the intermediate devices investigated (Fitbit Charge and Mio Fuse), which had limits of agreement of ~±10 [bpm], was closer to the performance of the Fitbit Surge than the Basis Peak. During exercise (@ 65% maximum heart rate), the precision degraded considerably, with lower limits of agreement ranging from -41.0 [bpm] in the worst case (Fitbit Charge) to -22.5 [bpm] (Mio Fuse) in the best case, and upper limits of agreement ranging from -39.0

2 ∴ as a rule of thumb for mental calculations: 1 [bpm] error = 1% (2% when in the 40-60 [bpm] range)

19

[bpm] (Fitbit Surge) in the worst case to 26.0 [bpm] (Mio Fuse) in the best case. With respect to repeatability between devices, most devices were found to be around half as repeatable as the ECG whether are rest or during exercise with only two exceptions: 1) the Fitbit Surge which was found to be possibly slightly more repeatable than the ECG at rest (unfortunately no significance test was provided), and 2) the Basis Peak which was found to be only a quarter as repeatable as the ECG at rest.

Table 1: Summary of Cadmus-Bertram activity tracker heart rate accuracy study [79]

@ Rest @ 65% Maximum Heart Rate

M ean Limits of Repeat- M ean Limits of Repeat- Difference Agreement ability Difference Agreement ability Device [bpm] [bpm] [bpm] [bpm] [bpm] [bpm]

ECG reference - to - 5.3 reference - to - 9.1

Fitbit Surge 2.8 -5.1 to 4.5 4.2 1.0 -34.8 to 39.0 20.6

Mio Fuse -0.7 -7.8 to 9.9 10.9 -2.5 -22.5 to 26.0 23.7

Fitbit Charge -0.3 -10.5 to 9.2 9.3 2.1 -41.0 to 36.0 21.6

Basis Peak 1.0 -17.1 to 22.6 19.3 1.8 -27.1 to 29.2 20.2

Our lab, the Centre for Global eHealth Innovation, also recently investigated the heart rate accuracy of two of the most popular activity trackers at the time: the Fitbit Charge HR and the Apple Watch [41]. In this 2016 study, R. Abdulmajeed studied 8 healthy participants using a similar methodology to Cadmus- Bertram et al. although at different exercise intensity levels, which were controlled using a variable resistance stationary bicycle. The accuracy of the two trackers (worn simultaneously) was measured against the ECG results of a portable Holter monitor. Abdulmajeed found a similar slightly worse percent accuracy at rest between the Holter monitor and the trackers investigated (Fitbit Charge HR: 6.00%; Apple Watch: 3.32%), compared to Cadmus-Bertram et al. findings. Abdulmajeed’s findings also hint at a possibly slightly non-linear relationship between percent agreement and heart workload/heart rate as it appeared to decrease slightly with increasing workload (Fitbit Charge HR: peak of 8.68% at 40 [watts]; Apple Watch: peak of 7.51% at 30 [watts]) before improving to near complete agreement at higher workloads (Fitbit Charge HR: <±0.5% when ≥ 80 [watts]; Apple Watch: <±0.75% when ≥ 60 [watts], except 90 [watts] where the agreement was -1.64%). These findings are reproduced in an easier to read format in Error! Reference source not found. along with the heart rates corresponding to the quoted workload intensities.

20

Table 2: Summary of Abdulmajeed activity tracker heart rate accuracy study. Reproduced from [41]

W orkload Holter Monitor Heart Rate [bpm] Mean Heart Rate Pearson Correlation [W atts] Difference [%] Coefficient

M inimum Average [sic] M aximum Fitbit Apple Fitbit Apple Charge H R W atch Charge H R W atch

0 68 85 102 6.00 3.32 0.406 0.567

10 69 86 102 6.93 4.56 0.593 0.305

20 68 89 114 5.41 6.12 0.951 0.597

30 68 93 129 8.34 7.51 0.973 0.61

40 73 96 129 8.68 5.49 0.93 0.78

50 84 102 132 8.10 2.27 0.88 0.811

60 87 109 136 3.69 -0.45 0.957 0.965

70 88 116 142 1.63 -0.75 0.98 0.994

80 95 122 150 -0.20 -0.72 0.994 0.997

90 99 129 155 -0.10 -1.64 0.986 0.993

100 105 136 161 0.46 0.37 0.992 0.994

Summarizing the findings of these 4 studies, it appears that the findings of Wang et al., Cadmus-Bertram et al. and Abdulmajeed are in clear agreement that heart rate measurements of activity monitors generally have high accuracy and correlation with measurements performed by clinical grade equipment. It also appears based on El-Amrawy et al.’s findings that there is very high correlation between the individual heart rate measurements of the many commercial trackers on the market, perhaps unsurprising, as most of the contenders leverage the same well-validated PPG technology with some minor modifications to make them fit the form factor of the wearable device. Where the performance of these trackers appears to differ greatly from clinical reference devices was in the variance of repeated measurements. Of the trackers investigated in the study, Cadmus-Bertram et al. found that the devices were typically half as consistent as an ECG regardless of whether the measurements were done while active or at rest.

Comparison to Cardiopulmonary Exercise Testing

Based on recent research findings it is clear that modern activity trackers have been found to be fairly reliably at tracking both step count as well as heart rate [37,41,79,43–45,65,67–69,73]. It is also clear however that these devices are definitely less accurate and less precise than the gold-standard CPET.

21

That being said, these devices have significantly lower upfront costs than CPET equipment and require little to no dedicated personnel or physical space in the hospital to run tests. "Replacing" patient memory with activity trackers could still eliminate a significant source of subjectivity and potential error while being potentially easier and less costly to administer than a full CPET.

Of course, fitness trackers provide fewer distinct data streams than a CPET, usually limited to just steps and possibly heart rate. While few researchers have attempted to examine the interplay between fitness tracker and step count data streams, it is possible that, in the same way that an IMU can combine the disparate independently error-prone sensor outputs of an accelerometer, gyroscope and magnetometer using sensor-fusion, the same might be done with activity tracker step count and heart rate data for HF patients and thereby reduce or remove the need for the extra data provided by a CPET. Whether these two data streams alone are sufficient to objectively assess NYHA class or perform a useful clinical function for HF patients though is still yet to be determined. The concept however is clearly not unreasonable: even though hospitals have only recently begun to consider the use of fitness trackers as part of regular care, there have been some very early successes in using single data streams from trackers to perform useful clinical functions such as monitoring step count for post-surgical readmission prediction, or using the heart-rate data for arrhythmia detection outside the hospital [83–88].

Fitness monitors though have another advantage over CPETs: the low cost and portable nature of fitness trackers means that patients can even be monitored outside the hospital during free-living. Capturing real-world free-living activity of HF patients might provide a quantitative insight into the limitations brought about by a patients’ HF symptoms. In fact, a recent exploratory study investigated this exact concept, sending 8 HF patients home with activity trackers for a period of two weeks. The study found a statistically significant difference between the daily average step counts of patients in different NYHA functional classes [13]. Unfortunately, the study’s very small sample size greatly limits scientific confidence in the generalizability of these findings. In response, we replicated this study using a larger sample size as the first phase of this work (detailed in Chapter 3) to independently verify these very promising findings. It would be hugely beneficial to patient care if data streams of regular real-world free- living activity data made it possible to more routinely reassess NYHA class and even allow for more prompt detection of important HF status changes.

22

Remote Patient Monitoring

Regular reassessment of a patient’s status and the continued monitoring of said patient while they are outside the hospital falls under the broader umbrella of telemedicine [89] and is formally termed Remote Patient Monitoring (RPM).

RPM, as a specific application of telemedicine, is of particular interest for patients with chronic conditions [90–92]. An acute exacerbation of a chronic condition can often bring patients into costly hospital emergency rooms for post-hoc care instead of less costly pre-emptive care/management that might have prevented the exacerbation in the first place [4,14,92,93]. This leads to both suboptimal care for the patient as well as misallocation of resources in an already and increasingly strained health sector [4,14,93– 95].

There have been many documented attempts at creating RPM systems targeted towards HF patients. Even though researchers have not come to a consensus about the exact effect of RPM systems on outcomes, based on several meta-analyses of recent literature, it appears that these systems are sometimes capable of delivering on the promise of providing better care at lower cost.

In a 2018 meta-analysis, Yun et al. [96] reviewed 37 randomized control trials (RCT) covering a total of 9582 HF patients and found that the patient groups receiving telemonitoring care had significantly lower HF-related mortality (risk ratio: 0.68, 95% confidence interval (CI): 0.50-0.91, no P-value) as well as all- cause mortality group (risk ratio: 0.81, 95% CI: 0.70-0.94, no P-value) compared to the standard care. Patients were found to benefit significantly when their RPM system transmitted data at least once per day, or when it transmitted multiple (≥3) streams of biological data (e.g. weight, blood pressure and heart rate). Yun et al. also noted that monitoring patient symptoms, medication adherence and prescription changes was also associated with reduced mortality risk.

Klersy et al. [97] in their 2014 meta-analysis of 21 RCTs covering a total of 5715 patients, investigated the healthcare utilization and economic impact of RPM on HF care. They found that, compared to the control groups, the telemonitored patient groups experienced significantly fewer HF-related hospitalizations (incidence rate ratio: 0.77, 95% CI: 0.65-.91) as well as all-cause hospitalizations (incidence rate ratio: 0.87, 95% CI: 0.79-0.96) resulting in a per patient quality-adjusted life years gain of 0.06 years (approximately 22 days). Furthermore, RPM was associated with a yearly patient cost savings of €300 to €1000 (approximately $460 to $1535 CAD based on the 2014 exchange rate). The cost savings

23 were conservatively estimated solely based on the associated third-party payer hospitalization reimbursement costs for the patients in the meta-analysis.

As mentioned though, not all evidence points towards RPM being a unilaterally positive effector of change: of note are 3 commonly cited large high-powered RCTs that found no significant effect on outcomes for HF patients undergoing telemonitoring [50,98,99]. While these three 3 studies are certainly not the only studies to have found little positive change from RPM implementations, their scope makes them hard to simply dismiss. Ware et al. [100], in a comprehensive review piece, discuss the various reasons why it is so hard to form a definitive consensus regarding the effects of home telemonitoring systems in healthcare. They argue that RPM implementations are often viewed as simple one-size-fits-all interventions (perhaps like a silver bullet) but they are in fact complex socio-technologic systems that are (or should be) adequately tailored to suit the specific context in which they are implemented - a fact that is often overlooked when assessing them. Some of the very important factors that impact the successful implementation of any technology often go unreported or unaddressed in studies. This includes: appropriate characterization of the intended and actual user groups (both patient population and clinical staff), suitability of the home telemonitoring (HT) service for the implementation context (e.g. how is the system resourced, and what actual user needs is it attempting to address), implementation strategy used (including training, methods of ensuring adherence to the ‘system as-intended’), suitability of evaluation approach for capturing desired outcome (e.g. RCTs an adequate trial design for capturing outcomes in an evolving socio-technical system?), what are the actual desired outcomes for the intervention (reduced mortality? increased patient quality of life? purely cost reduction?) and do these outcomes match up with stakeholder expectations. In their words:

“HT has been shown to reduce mortality and HF hospitalizations and improve clinical outcomes in HF patients. Despite this evidence, significant heterogeneity exists in the design of HT interventions, the implementation context, and outcomes of individual studies, leading to ambiguity about the true effect of HT on HF outcomes. HT is not one, but rather a collection of complex interventions for which success or failure is linked to a range of contextual factors. These factors cannot be ignored if we are to design studies that will offer more definitive answers about the effect of HT on HF outcomes.” [100]

24

2.3.1 Medly

For this particular thesis we piggy-backed off of a specific RPM system: Medly, a mobile-phone based HF patient telemonitoring system currently in place (and adapted for use at) the Ted Rogers Centre of Excellence for Heart Function, a tertiary care clinic for HF patients located in TGH in Toronto, Canada [101,102]. A previous iteration of Medly, and thus its core features, have previously been validated through a 6 month RCT, which found that its targeted telemonitored patient user group, relative to base- line, had improved self-care maintenance (Δ = +7 points, P = .05) and management (Δ = +14 points, P = .03) as measured with the Minnesota Living with Heart Failure Questionnaire, improved levels of brain natriuretic peptide (BNP) - a biomarker associated with HF stability (Δ = -150pg/ml, P = .03) and improved left-ventricular-ejection-fraction (LVEF) (Δ = +7.4%, P = .005) compared to the control group [103]. In recognition of the complex multi-faceted nature of telemonitoring interventions, we provide a more detailed discussion of the intervention and its unique context in Chapter 4, as part of the larger discussion of how we implemented an initial version of activity tracker monitoring as part of Medly.

One of the important core features of Medly is an innovative computer algorithm capable of generating timely, safe, and clinically-relevant messages (instructions or alerts) to patients and clinical staff [104]. The intent of this feature is to enable Medly to provide a cost-effective and scalable way of monitoring patients on a daily basis by limiting the impact to the workload of clinical staff while simultaneously leveraging ‘teachable moments’ to improve patient self-care maintenance and management [3,104,105]. This is accomplished by imbuing the system with a limited ability to mimic the decision making and actioning process of the expert clinical staff at the Heart Function clinic so that the system is able to adequately triage, and respond to or elevate clinical concerns to staff as necessary while providing patients with regular feedback about their own condition [104]. Of course, the concept of imbuing a machine with decision making ability (limited or otherwise) belongs to the now resurging field of artificial intelligence.

Artificial Intelligence & Machine Learning

Artificial intelligence (AI) broadly refers to the concept of intelligence (e.g. learning, decision making, perception and recognition, creativity and problem solving) exhibited by machines (typically computers, but formally, any thing not imbued with natural intelligence like humans or animals) [106–109]. The field of AI is as fascinating as it is expansive. Although the field only became a formal academic discipline unto

25

itself in 19563 [108,109] it spans and draws from the fields of mathematical, statistical and computer sciences, delves into psychology and neurology, and is even starting to pose new and challenging philosophical, ethical and economic questions (such as ‘what actually is intelligence? what decisions should and shouldn’t we delegate to a computer? what will be the place of humanity if computers can beat us at everything?’).

One of the early successful approaches to creating artificial intelligence was to train a computer program (like Medly) to mimic the decisions of a human expert, like a cardiologist or nurse, in what formally termed an ‘expert system’ [106,110]. Expert systems are typically created by first extracting a series of formalized facts from the target experts and translating them, typically, into formal conditional, ‘if-then’, logic statements. For example: if a patient is male and older than 35 and has chest pain, then suspect a heart attack; if a heart attack is suspected, then perform an ECG. These facts form the ‘knowledge base’ of the expert system. The machine can then use this knowledge base in conjunction with an ‘inference engine, which using some formal logic system - such as zeroth-order propositional logic (i.e. modus pollens4, modus tollens5, etc.) - to manipulate the contents of the knowledge base and draw conclusions, make decisions or supply recommendations (if a patient is male and older than 35 and has chest pains then perform ECG). The machine can then also be asked to ‘show its work’ by displaying the exact step by step deductive, inductive and/or abductive logic processes used to reach its final conclusion [110]. Expert systems have seen application in various sectors, but are especially useful where demand for expertise is high but supply is relatively low or expensive for example in the include health care, finance, and the legal sectors [106,110].

In the case of NYHA functional class assessment, (a function not presently performed by Medly), one might theoretically create an expert system which could mimic expert grading by (an) experienced ‘model’ physician(s). However, in doing so one would run into one of the major issues with expert systems: the knowledge acquisition problem. Since creating traditional expert systems relies on the premise that there are experts available who can formalize their knowledge into statements suitable for interpretation by

3 McCarthy et al. famously “proposed a 2 month, 10 man study of artificial intelligence to be carried out during the summer of 1956… [they thought] that a significant advance [could] be made… if a carefully selected group scientists work on it together for a summer.” Suffice it to say, the problem of AI turned out to need more a small summer research problem to solve. 4 affirming the antecedent: If P then Q; P; ∴ Q 5 denying the consequent: If P then Q; not Q; ∴P

26 some inference engine, the actual implementation of these expert systems becomes compromised when 1) there are insufficient experts available, or 2) their knowledge cannot be formalized adequately (or even at all). In the case of objective NYHA functional class assessment (an unsolved problem), the situation is fairly simple: there are no experts available - which precludes the creation of a traditional expert system entirely. Fortunately, the field of AI has developed beyond just expert systems.

2.4.1 Machine Learning

An alternative to having experts a-priori supply all the knowledge required for an AI to ‘think’ is to instead make an AI that can ‘learn’ that knowledge by itself from input data or example cases. This is sub-domain of AI called machine learning6 [106,107]. This sub-domain is also fairly large, as many different approaches have been developed since 1956 as part of different attempts to get computers to extract useful knowledge from data [111]. Some of these approaches are more suitable for different types of machine learning problems; so it might be helpful to first clarify how machine learning problems are classified, broadly, before determining which machine learning category the problem of NYHA functional class assessment falls into.

2.4.2 Supervised, Unsupervised and Reinforcement Learning

The first important way to classify machine learning problems is by learning modality. Machine learning problems come in 3 major types: supervised learning, unsupervised learning and reinforcement learning problems [111–113].

1) Supervised learning problems, the most common type, are those where both the input and output variables are provided. The computer learns a mapping function to accurately convert the inputs to outputs, even inputs that haven’t been seen before [111,112]. In other words, for a given input variable 푥 and output variable 푦, where 푦 = 푓(푥), find a suitable 푓 [111,112].

2) Unsupervised learning problems are those where neither the output variable (푥) nor the mapping function (푓) are known – the objective of unsupervised learning is usually to have the machine discover underlying patterns in the data [111,112].

6 Colloquially, the terms ‘artificial intelligence’ and ‘machine learning’ are sometimes used interchangeably (e.g. [107]). However, machine learning technically refers to the task of getting machines to mimic the ‘learning’ process of intelligence, whereas artificial intelligence refers to the field (inclusive of all its subdomains) as a whole. In this work we use the technical terms exclusively.

27

3) Reinforcement learning approaches the concept of learning from an entirely different perspective than supervised and unsupervised learning [113]. In reinforcement learning there is, in a sense, neither a static 푥, 푦 nor 푓. Rather, the machine learns by trial and error from successive interactions with an external environment what actions it should take to optimize the value of some future reward [113]. In other words, the machine must not only consider how to interpret the present state of its environment, but also which actions to take (and by extension which additional input data to collect about its environment), and finally decide which actions are most appropriate to bring it closest to its goal based on the past success or failure of previous actions [113]. Reinforcement learning methods are thus the realm of ‘game-playing’ AIs, such AlphaGo [114] which ‘plays’ board game Go, OpenAI Five [115,116] which competes at Dota 2 (a multiplayer online battle arena video game), and the various AIs that compete at real time strategy video games like Starcraft/Starcraft 2 [117].

The question of objective NYHA class assessment clearly falls under the class of supervised learning, since we have a known output label – NYHA functional class – that we wish to determine based on some input variables, or ‘features’, in our dataset. Our question is whether it is possible to find an adequate mapping function given our input data.

2.4.3 Classification vs Prediction Problems

Supervised learning algorithms can be further categorized by the expected output of the algorithm: either a categorical label or a numerical prediction. The former is termed a ‘classification’ problem, and the latter a ‘prediction’ or ‘regression’ problem [111–113]. Note that while the term ‘prediction’ has a temporal connotation, prediction problems need not be temporal in nature – a prediction need not necessarily be a forecast for or of the future. Inferring a missing value in a dataset, such as a missing grade for a student’s assignment based on their other assignments would be as equally valid a prediction problem as forecasting the next day’s temperature based on historical temperature data. In contrast, forecasting whether the next day will be ‘hot’ or ‘cold’ is an example of a classification problem. Determining the probability that a patient falls within a given NYHA class would be a supervised prediction problem. However, since we wish to assign a categorical label (i.e. a NYHA functional class) to each patient, we are instead tackling a supervised classification learning problem.

There are various algorithms for addressing supervised classification problems. These include Generalized Linear Models, Random Trees & Forests, Neural Networks and Support Vector Machines. The author whole-heartedly recommends the book “Programming collective intelligence” by T. Segaran for an

28 accessible, yet thorough primer on these and other modern machine learning techniques [111]. Segaran’s book mostly discusses machine learning algorithms that are fed with cross-sectional data (i.e. where all the data is acquired at a particular ‘slice’ of time or where the order or sequence of the data is not necessarily considered important). Since our application involves the use of time series data where the order of data is important, we also specifically explored the use of hidden Markov models, which are a type of machine learning algorithm that is considered highly suitable for learning from time series data. It has been applied to problems as disparate as speech recognition [118], stock market pricing analysis [119], seizure classification [120] and human physical activity recognition [62,121]. A brief into to HMMs is provided for the readers convenience in Appendix B.

2.4.4 The Effect of Sample Size on Machine Learning

Before we address the current state of research at the intersection of machine learning and HF assessment, we briefly comment on an important consideration of machine learning: the amount of data required to train a machine learning algorithm. Machine learning is notorious for being particularly data intensive [119,122,123]. This notoriety likely explains why the term Big Data is often (incorrectly) used interchangeably with machine learning in popular parlance [124].

Machine learning practitioners generally consider data sets on the order of hundreds of samples to be relatively small [122,123,125]. In fact, most traditional ML algorithms are hard to properly validate even when the training dataset in question contains more than 200 events of interest per candidate ML feature - even some of the simplest models using logistic regression require at least 20-50 per candidate feature [126]. The exact size of a data set required to properly train a typical Hidden Markov Model (or any machine learning algorithm in general) depends on a number of different factors including: the method of classification, complexity of the classifier, separation between classes, variance and presence of noise in the data. The noisier, the more complex and the greater the variance in the data, typically the larger the dataset required to achieve good performance. There is no upper limit for how much data should be used for training but there is a point at which increasing input data begins to yield diminishing returns in improving predictive performance [123]. The exact relationship between training set size and predictive performance for an algorithm and problem in question is often shown as a 'learning curve' graph (which plots training set size versus prediction error(s)). To the best of the author's knowledge the learning curve for this particular application (or a sufficiently analogous application) has not yet been determined. However, given that we expect that the data collected in this study will be relatively noisy and complex we expect that the model may lean towards requiring more data rather than less data. Since biomedical data is typically in short supply, we will endeavour to collect as much data as possible in order to not

29 prematurely limit the power nor the generalizability of the algorithm developed.

2.4.5 State-of-the-art

Tripoliti et al. [127] published a comprehensive review in 2017 on the state-of-the-art for machine learning applications in HF management. They found that across the 45+ unique studies reviewed, various machine learning techniques have been applied to both: a) the prediction of adverse HF events including destabilizations, mortality and hospitalization, as well as b) the diagnosis of HF including HF detection, recognition of sub-types of HF, and estimation of severity (e.g. NYHA functional class). Input data included the standard demographic data, but also variously: clinical history, laboratory and ECG data, and various features that were extracted or computed from the input data. NYHA functional class was often included in the studies as part of the input demographic data, but only 4 studies investigated it specifically as a classification task.

In 2011, Pecchia et al. [128] presented a telemonitoring system that collected and used patient ECG data for HF detection and classified patients as having either NYHA class III (labeled as ‘severe HF’), or NYHA class I or II (labeled as ‘mild HF’). The detection and severity classification tasks are each performed with a single decision tree, specifically one generated using the Classification And Regression Trees (CART) algorithm. The decision trees each use different Heart Rate Variability (HRV) features [129] extracted from the ECG waveform, HRV having already been shown to be useful for discriminating between patients of different NYHA classes [130–134]. Pecchia et al. trained and tested their severity classifier on Holter monitor data available from a public database: the Congestive Heart Failure RR Interval Database [135] (i.e. not data recorded using their telemonitoring system). The dataset consisted of 29 patients (12 mild, 17 severe), with which they were able to achieve an overall classification accuracy7 of 79.31%, sensitivity8 of 82.35%, specificity9 of 75.00%, and precision10 of 82.35% - although the authors failed to specify the validation technique used.

7 The proportion of patients correctly classified into their actual true class 8 a.k.a. recall, or true positive rate: The proportion of patients correctly identified as belonging to the ‘positive’ test class (e.g. class A in A vs. B) 9 a.k.a. true negative rate: the proportion of patients correctly identified as belonging to the ‘negative’ test class (e.g. class B in A vs. B) 10 a.k.a. positive predictive value: the proportion of patients correctly classified as belonging to the ‘positive’ test class amongst all the patients identified by the classifier as belonging to the ‘positive’ test class.

30

In 2013, Melillo et al. [136] performed a similar study, but using a larger superset of data containing additional patients from the publicly available BIDMC Congestive Heart Failure Database [137,138]. This data superset also included class IV patients, which were grouped with class III patients in the ‘severe HF’ class. In this study Melillo et al. performed some additional corrections to their decision trees to permit them to perform feature selection in a way that accounted for the now small and rather unbalanced dataset (12:32, mild:severe). Melillo et al. also compared the performance of their single CART decision tree to a random forest classifier [111,139], as well as a single tree generated using the more popular C4.5 algorithm [139]. Of the 3 classifiers they found that their revised CART performed best with a classification accuracy of 85.40% (Δ = +6.09% compared to [128]), sensitivity of 93.30% (Δ = +10.95%), specificity of 63.60% (Δ = -11.4%), and precision of 87.50% (Δ = +5.15%). In this paper, Melillo et al. specified that they used 10-fold cross validation. 10-fold or 푘-Fold cross-validation (generally) is a common technique for validating machine learning algorithms whereby the complete dataset is separated into 푘 number of groups or ‘folds’ (in this case 10). One of the folds is held aside as the initial test set, while the remaining folds are made to constitute the initial training set [140,141]. The folds held aside as the test and training sets are then rotated such that each fold has been held aside once as a test set with the non-test set folds in that round being used as the training set [140,141]. In this way each data point in the dataset is well utilized and supplies information for both testing and training [140,141].

In 2015, Shahbazi et al. [142] used the same dataset and labelling schema, although they dropped 5 patients based on a pre-established data-reliability measure for a final dataset of 10:29 (mild:severe). In this study, Shahbazi et al. used a different machine learning algorithm known as k-Nearest Neighbour [111,139]. Since the k-Nearest Neighbour algorithm does not have inherent feature selection baked in (in contrast to decision trees), Shahbazi et al. performed feature selection using a method known as generalized discriminant analysis [143] to select a reduced subset of the best available features to present to their k-Nearest Neighbour algorithm. The whole feature selection-classifier chain was validated using leave-one-out cross validation. Leave-one-out cross validation is a variant of 푘-Fold cross validation where 푘 is equal to the number of data points [140]. In other words, for a dataset of size 푁, leave-one-out cross validation is 푘-Fold cross validation where 푘 = 푁. Leave-one-out cross validation is thus often preferred when the dataset in question is particularly small, since only 1 data point is held out as a test set for each round, thus maximizing the amount of data available for training. In any case, Shahbazi et al. were able to achieve a remarkable 100% and 97.43% accuracy respectively for classifiers trained using only non-

31

linear HRV features and using both linear and non-linear HRV features11.

Lastly, in a 2010 study, Yang et al. discussed an attempt made to perform both diagnosis and severity assessment together, using a dataset of 153 patients labelled as either ‘Healthy’, ‘HF-prone’ or ‘HF’ (65:30:58). The ‘Healthy’ group corresponded to those with no cardiac dysfunction, the ‘HF-prone’ group corresponded to those patients with NYHA class I symptoms and the ‘HF’ group corresponding to those with either NYHA classes II or III symptoms. Due to their relative abundance of data points, Yang et al. opted to do a simple training/test set split, allocating 63 (24:14:25) samples for training and 90 for testing (41:16:33). Yang et al. chose to use a support-vector-machine algorithm [111,139], which is a supervised prediction algorithm. As such they had to convert the numeric prediction value into a final output classification, which they performed by first mapping the SVM prediction 푣, to a new mapped output value 푦 using the following tan-sigmoid function:

4 푦 = − 2 (1) 1 + e−4v and then proceeding to determine the decision cutoff points for the groups using Youden’s index [144]. Their approach gave them an overall accuracy of 74.44% with an accuracy of 87.50% and 65.85% for the NYHA I group and NYHA II and III group respectively (78.79% for the healthy group). As input data, Yang et al. used parameters from blood tests (specifically sodium and BNP levels), ECGs (including HRV features), chest radiography (i.e. LVEF and cardiac dimensions), 6MWT (distance) and a “physical test” ̇ [145]. Other noteworthy parameters employed by the SVM models include 푝푒푎푘푉푂2.

To the author’s knowledge, no other studies have used machine learning for the assessing NYHA functional class. Certainly, no study appears to have done more than a binary (two-class) prediction of NYHA class. Of course, this is likely a result of the difficult and time-consuming nature of acquiring a sufficiently large dataset that includes all 4 NYHA functional classes. Fortunately, as previously mentioned, the practical challenge of NYHA functional class assessment mostly centers around distinguishing the middle two classes, II and III, such that studies that use a the ‘mild’/’severe’ labelling scheme like the one used by Pecchia, Melillo, and Shahbazi studies are essentially addressing the central NYHA functional class assessment challenge. It appears clear too from these studies that machine learning methods are a potent tool for objectively assessing NYHA functional class: case in point, Shahbazi et al.’s

11 granted, a model with 100% accuracy smells is very possibly overfit to the dataset used.

32 k-Nearest Neighbour approach appears to have achieved incredible accuracy at separating HF patients with class I or II vs. III or IV - albeit it on a what is still a relatively small sample of 39 patients. All of these aforementioned studies however relied solely on data recorded in the clinic, and on HRV specifically. While we do not doubt the utility of HRV measurements for various aspects of cardiovascular care, they do have some important drawbacks. For example, the preferred standard recording interval for an ECG used for HRV analysis is 24 hours although it is possible to record very long-term ECGs (i.e. for longer than a period of 1-2 days) [129,146,147]. However, very long-term ECGs require slightly different treatment than shorter term ECGs since the longer an ECG recording, the more unreasonable it is to maintain the assumption that the ECG signal is stationary - an important assumption for the underlying mathematics that underpins much of the HRV signal processing [146]. While some researchers have developed new approaches for HRV signal analysis, these have not been validated against outcomes [146]. This is known to be an important step for HRV analysis since, it is known that the features used to short- and long-term ECGs are not always interchangeable [129], it is only reasonable to assume that the same would apply to very long-term ECGs. ECG HRV analysis is also not common practice in many clinics and requires specialized knowledge, and equipment (in particular for use in in telemonitoring). As an additional drawback, ECGs are often replete with artefacts and noise, and so sometimes require manual cleaning before they can be used for HRV analysis [129]. Altogether, this makes HRV analysis a powerful, but relatively inaccessible tool (at least at present) for use in performing regular assessment of NYHA class as part of care. It would be useful to determine if it were possible to objective assess NYHA class using more commonly accessible technology like the standard CPET, or fitness trackers which, although not ubiquitous in the hospital, are ubiquitous in the consumer space and would be an ideal tool for remote monitoring HF patients and regularly reassessing their NYHA class.

Summary

To summarize: heart failure, a global epidemic, is a complex chronic progressive condition associated with significant morbidity and mortality. Patients often present with exacerbations to acute care centers, and hospital emergency rooms at significant cost.

Exercise intolerance, one of the main manifestations of heart failure (HF), is an integral part of HF care evaluations. The New York Heart Association (NYHA) classification is a functional assessment of exercise capacity where a higher NYHA class is associated with increased symptoms, decreased quality of life and poor survival. This classification system is highly subjective, especially for NYHA class II and III, which call for patients experiencing “slight” versus “marked limitation of physical activity” [9]. The application of the criteria thus varies widely based on the patients self-report and the individual physician’s

33 interpretation [6,7]. A quantifiable measure that removes this subjectivity to make the assignment of NYHA class more repeatable and objective is highly desirable, especially if such a measurement could be made on a regular basis to more closely track progression of the disease.

In common clinical practice, most assessments of exercise intolerance are performed through standardized or non-standardized questions posed as part of the medical interview. More quantifiably, CardioPulmonary Exercise Testing (CPET) is a validated clinical tool that is used to assess exercise intolerance. Other researchers have identified some relationships between CPET measures, specifically ̇ 푝푒푎푘푉푂2, although none have attempted to predict NYHA class from CPET measures. Performing CPET studies also has some important drawbacks: they require access to expensive equipment in a lab environment, and trained personnel to run the tests. Consumer targeted wearable physical activity trackers overcome these disadvantages: they are inexpensive, simple to use, and can measure moment-to- moment physical activity (and thus hopefully infer exercise intolerance) during free-living activities instead of simulated activity in a lab. A previous exploratory study [13] investigated wearable activity trackers in HF patients and found a link between patients’ daily average step counts and their corresponding NYHA functional classes. However, the study’s small sample (n=8) limits scientific confidence in the generalizability of this finding, so we resolve to first begin (in the next chapter) by investigating whether these results are generalizable to a larger study sample.

Activity trackers could thus also be used to remotely monitor patients to help both patients and clinical staff better manage their condition. Remote monitoring has been shown to improve HF patient outcomes when properly implemented. To maximize chances of successful implementation, we proposed integrating activity tracker monitoring as part of Medly [101,102], an existing well validated phone-based HF patient monitoring solution already integrated and in use at our hospital.

One of the important features of Medly is that it leverages an expert system (an early type of artificial intelligence algorithm) to triage, respond to or elevate clinical concerns to staff as necessary while handling regular ‘run-of-the-mill’ clinical tasks without needing human intervention, thus providing a cost-effective and scalable way of monitoring patients on a daily basis. We suggest that a similar intelligent system could be used for NYHA class assessment. By using an artificial intelligence system that could translate relevant data into the desired clinical outcome (NYHA classification), or a sufficiently equivalent outcome (an 'NYH-AI' or 'NYHAI' classification if you will), we could provide a way to assess a patient's functional classification in an objective, consistent manner while still leveraging the advantages of the existing 'traditional' NYHA classification method. Some researchers have already investigated intelligence classification algorithms but unfortunately these all relied on analysing heart rate variability

34 from ECGs. We suggest that it might be possible to perform the same classification using more accessible or ubiquitous technology like a CPET or fitness tracker.

35

- Replication of Previous Study

As discussed in the Section 2.2.3.3, a previous exploratory study [13] investigated wearable activity trackers in HF patients and demonstrated a statistically significant difference between the daily average step counts of patients experiencing NYHA class II vs NYHA class III symptoms. However, the study’s small sample (n=8) limits scientific confidence in the generalizability of these finding. Since step count activity is expected be a highly relevant, useful and massively feature rich dataset, we replicated the study on a separate otherwise limited dataset collected during another previous study, to increase our confidence in the relevance and usefulness of step data for this particular research thesis. Our primary objective was to validate the pilot study on a larger sample of patients with HF with reduced ejection fraction (HFrEF). Our secondary objective in analyzing the larger dataset was to also better characterize the distribution of step counts for patients in different NYHA classes.

The remaining part of this chapter, our replication of the pilot study, has been submitted for publication to a peer-reviewed journal [148]. The thesis author was responsible for the direction and execution of the research as well as the drafting of the initial paper. The other authors on the submitted paper (S. Bromberg, M. Yasbanoo, B. Taati, H. Ross, C. Manlhiot, and J. Cafazzo) contributed feedback and edits to subsequent drafts of the manuscript. Additionally, S. Bromberg collected the original dataset used in the study, H. Ross & M. Yasbanoo provided clinical guidance, and J. Cafazzo and C. Manlhiot provided general consultation.

Abstract

Background: A previously published pilot study showed a statistically significant difference between New York Heart Association (NYHA) functional class and step count activity measured by wrist-worn activity monitors in patients with heart failure (HF). However, the study’s small sample size severely limits scientific confidence in the generalizability of this finding to a larger HF population.

Objective: Validate the pilot study on a larger sample of patients with HF with reduced ejection fraction (HFrEF) and attempt to characterize the step count distribution.

M ethods: We repeated the analysis performed during the pilot study on an independently recorded dataset consisting of a total of 50 patients with HFrEF (35 NYHA II and 15 NYHA III) patients. Participants were monitored for step count with a Fitbit Flex for a period of two weeks in a free-living environment.

36

Results: Patients exhibiting NYHA class III symptoms had significantly lower recorded mean of daily total step count (4012 ± 1933 vs. 5484 ± 2640 [steps/day], P = .04), lower recorded mean of daily mean step count (2.8 ± 1.3 vs. 3.8 ± 1.8 [steps/day], P = .04,), and lower mean and maximum of the daily per minute step count maximums (80.5 vs. 95.6, & 112.9 vs. 125.7 [steps/minute], P = .02, & .004 respectively).

Conclusions: Patients with NYHA II and III symptoms differed significantly by various aggregate measures of free-living step count including 1) mean daily total step count as well as, newly discovered, by 2) mean, and 3) maximum of the daily per minute step count maximums. These findings affirm that the degree of exercise intolerance of NYHA II and III patients as a group is quantifiable in a replicable manner. This is a novel and promising finding that is highly suggestive of possible completely objective measure of assessing HF functional class, something which would be a great boon in the continuing quest to improve patient outcomes for this burdensome and costly disease.

Introduction

Heart Failure (HF), a global epidemic [1,14], is a complex chronic progressive condition associated with significant morbidity and mortality. HF is the leading cause of hospitalizations costing Canadians an estimated 3 billion dollars annually [2]. Clinicians caring for patients with HF have a strong desire to reduce hospitalizations from both a systems and patient-centered perspective [2,4]. To do so, it is important for clinicians caring for these patients to understanding each patient’s physiologic parameters.

Evaluating exercise intolerance, one of the main manifestations of HF, is an integral part of HF care. The New York Heart Association (NYHA) classification is a functional assessment of exercise capacity where a higher NYHA class is associated with increased symptoms, decreased quality of life and poor survival [8,10,149]. This classification system is highly subjective [6,7], especially for NYHA class II and III [9]. The application of the criteria thus varies widely based on the patients self-report and the individual physician’s interpretation [6,7]. A quantifiable measure that removes this subjectivity to make the assignment of NYHA class more repeatable and objective would be beneficial.

A previous exploratory study [13], investigated wearable activity trackers in HF patients and demonstrated a statistically significant difference between the daily average step counts, a proxy for exercise intolerance, in patients with class II and III symptoms. However, the study’s small sample (n=8) limits the generalizability of these findings. The aim of this study is to determine if these findings can be replicated using a larger sample collected independently from the original pilot study data.

37

Methods

As a replication, we repeated the analysis performed during the pilot study [13], but on an independently recorded dataset consisting of a total of 50 patients with HFrEF (9 NYHA I/II, 26 NYHA II, 4 NYHA II/III, and 11 NYHA III) patients. Participants were monitored for step count with a Fitbit Flex [59] for a period of two weeks in a free-living environment.

3.3.1 Recruitment

Patients in a moderately larger dataset (n=50) were originally consecutively recruited from the Heart Function Clinic at Toronto General Hospital (TGH) in Toronto, Canada from September 2014 to June 2015. The inclusion and exclusion criteria used are outlined in Table 3 & Table 4 respectively.

Table 3: Inclusion criteria

- Adults (18+ years of age) - Stable chronic HF - NYHA Class II or III - LVEF (Left Ventricular Ejection Fraction) ≤ 35% - Able to walk without walking aids - Capable of undergoing consent, understanding English instructions and complying with the use of the study devices.

Table 4: Exclusion criteria

- Congenital heart disease - Diagnosis less than 6 months prior to recruitment - Travelling out of Canada for more than 1 week during the study period (to limit study costs – i.e. roaming charges)

Data Collection

Patients were supplied with a Fitbit Flex [57], an Android smartphone (Moto-G), the associated charging equipment for both devices, as well as a data plan to facilitate syncing the tracker to the Fitbit server. Patients were instructed to wear the Fitbit daily on the same wrist, preferably their non-dominant hand, for a period of 2 weeks, except during water activities like showering or swimming, as the Flex is not water-proof. Patients were also instructed to charge the Fitbit at least every three days, preferably while they slept. The Fitbit data was retrieved using an open source script published and available on GitHub and adapted for this study [150].

38

Population

Patients in our larger dataset were labeled as either NYHA class II and III, or, when a physician was uncertain about the classification or felt that patients exhibited symptoms from different class levels, as a borderline/mixed class I/II or II/III. Table 5 provides demographic information for each of the patients in the dataset according to their NYHA class, Table 6 provides the same but for all patients overall and just for the subset of patients that were labelled NYHA class II or III. In either case, the patients are predominantly male (86 vs. 89 [%]), aged: 54 ± 14 vs. 56 ± 14 [years old], and overweight (BMI: 28.9 ± 6.4 vs. 29.6 ± 6.3 [kg/m2]).

Table 5: Study dataset demographics

N YH A I/II N YH A II NYHA II/III N YH A III

Total Participants (n [%]) 9 (18%) 26 (52%) 4 (8%) 11 (22%) # Male (n [%]) 6 (67%) 23 (89%) 4 (100%) 10 (91%) Age [years] 52 ± 16 55 ± 14 52 ± 13 58 ± 13 Height [cm] 171 + 12 174 ± 8 177 ± 3 175 ± 10 Weight [kg] 79.5 ± 25.5 87.6 ± 18.6 88.4 ± 22.7 94.4 ± 17.4 BMI [kg/m2] 26.6 ± 7.1 29.0 ± 6.1 28.4 ± 7.5 30.9 ± 6.7

Table 6: Study dataset demographics (overall and just NYHA II or III)

Overall NYHA II or III*

Total Participants (n [%]) 50 37 (74% of total) # Male (n [%]) 43 (86%) 33 (89%) Age [years] 54 ± 14 56 ± 14 Height [cm] 174 ± 9 174 ± 9 Weight [kg] 87.7 ± 20.0 89.6 ± 18.5 BMI [kg/m2] 28.9 ± 6.4 29.6 ± 6.3

Since NYHA class I/II and II/III are not formally recognized NYHA classes, we performed our analysis using the original class labels, as well as a second time but with the borderline/mixed classes grouped into one of the traditional 4 class NYHA. Since NYHA class I corresponds to ‘no limitation of physical activity’, a binary distinction, we reasoned that a patient assigned as class I/II, must have exhibited something more than ‘no limitation of physical activity’, however slight. Since NYHA class II corresponds to ‘a slight limitation of physical activity’ we reasoned that class I/II and class II should be grouped together. We designate the class I/II and class II group as Group II*. We extended the same line of reasoning for II/III patients, noting that patients assigned as class II/III must have experienced some more marked limitation of physical activity beyond class II limitations. As such we grouped them with the

39 lower class III as a conservative approach, assuming the worst-case scenario. As such we group them with the lower class III. We designated the class II/III and III group as Group III*. Table 7 provides demographic information for the patients when the dataset is re-grouped according to the labeling scheme as described above.

Table 7: Study re-grouped dataset demographics (NYHA group II* and group III*)

NYHA Group II * NYHA Group III*

Total Participants (n [%]) 35 (70%) 15 (30%) # Male (n [%]) 29 (83%) 14 (93%) Age [years] 54 ± 14 56 ± 13 Height [cm] 173 ± 9 176 ± 8 Weight [kg] 85.5 ± 20.6 92.8 ± 18.3 BMI [kg/m2] 28.4 ± 6.3 30.2 ± 6.7

3.3.2 Statistics

Consistent with our previous study [13], we use a Kruskal-Wallis rank test to compare the experimental variables of interest, including the mean daily total step count. Since the data is clearly not normally distributed, as can be seen in Figure 3-1, Figure 3-2 and Figure 3-3, we also computed various other aggregations of the minute by minute step count data to attempt to better characterize the data. Namely, we calculated statistical summaries (mean, standard deviation, five number summaries, interquartile range, skewness and kurtosis) for each patient’s overall two week period and then for each individual patient-day of step data. We then calculated the max, min, mean and standard error across each patient’s daily summaries (producing a maximum daily mean, minimum daily mean, mean of daily means, etc.) to assess overall variation on a daily basis. We then performed a Kruskal-Wallis rank tests on each of the overall statistical summaries. The analysis was performed using R [151], RStudio [152] with supporting packages [153–158].

40

Figure 3-1. Histogram of per minute step count values for each patient, grouped by individual NYHA class

41

Figure 3-2. Distribution of per minute step counts by NYHA class (zoomed in to step counts > 0). S tacked internal segments

indicate relative contributions by each patient.

42

Figure 3-3. Individual frequency of per minute step counts for each patient (zoomed in to step counts > 0), grouped by NYHA class

Results and Discussion

Table 8 and Table 9 include results that were found to be significant at the P=.05 level in at least one comparison. Table 10 and Table 11 contain the remaining non-significant results excluding any statistical summary that returned a 0 value for all classes (e.g. aggregations involving daily or overall minimum, 1st, 2nd and 3rd quartile) due to the overwhelming frequency of 0 per minute step counts. Table 8 and Table 10 tabulate the results of the comparison using the original class labels, i.e. comparisons between class II vs. III, and the comparison of all available classes, i.e. I/II vs. II vs. II/III vs. III, whereas Table 9 and Table 11 tabulate the results of the comparison of the relabeled dataset, i.e. group II* vs. group III*. The mean daily total steps, and the mean and max of daily per minute step count maxes (with standard error bars) are plotted graphically in Figure 3-4, Figure 3-5, and Figure 3-6 respectively.

43

Table 8: Significant findings for comparisons between all classes (I/II, II, II/III, III) and just between class II vs. III.

P-value P-value I/II II II/III III (all classes) (II vs. III)

M aximum Maximum 2 Week PMSCa 126.33 125.54 112.75 112.91 .04* .0104* [steps/minute] Maximum of Maximum 126.33 125.54 112.75 112.91 .04* .0104* DPMSCb [steps/minute] Mean of Maximum DPMSCb 96.94 95.10 80.26 80.65 .12 .04* [steps/minute] M ean Mean 2 Week PMSCa 3.85 3.79 3.12 2.66 .22 .0499* [steps/minute] Maximum of Mean DPMSCb 6.33 7.53 6.11 5.02 .07 .014* [steps/minute] Mean of Mean DPMSCb 3.85 3.79 3.12 2.66 .22 .0499* [steps/minute] Standard Deviation of Mean 1.40 1.98 1.70 1.21 .054 .0095** DPMSCb [steps/minute] Standard Error of Mean 0.36 0.50 0.43 0.31 .07 .013* DPMSCb [steps/minute] Standard Deviation Standard Deviation of 2 12.90 13.09 10.51 9.99 .15 .03* Week PMSCa [steps/minute] Maximum of DPMSCb Standard Deviation 18.61 20.10 15.53 14.94 .02* .0053** [steps/minute] Mean of DPMSCb Standard 12.24 11.87 9.44 9.23 .17 .0499* Deviation [steps/minute] Standard Error Standard Error of 2 Week 0.088 0.087 0.071 0.067 .16 .04* PMSCa [steps/minute] Maximum of DPMSCb Standard Error 0.49 0.53 0.41 0.39 .02* .005** [steps/minute] Mean of DPMSCb Standard 0.32 0.31 0.25 0.24 .17 .0499* Error [steps/minute] Total Total 2 Week SCc [kilosteps] 8.19 8.51 6.95 5.87 .16 .03* Maximum of Total DPMSCb 9113 10837 8803 7232 .07 .014* [steps] Mean of Total DPMSCb 5542 5464 4499 3835 .22 .0499* [steps] Standard Deviation of Total 2019 2856 2452 1745 .054 .0095** DPMSCb [steps]

44

P-value P-value I/II II II/III III (all classes) (II vs. III) Standard Error of Total 523 713 624 441 .07 .013* DPMSCb [steps] aPMSC: Per Minute Step Count bDPMSC: Daily Per Minute Step Count cSC: step count

Table 9: Significant findings for comparisons between group II* and group III*

Group II* Group III* P-value (= I/II + II) (= II/III + III)

M aximum Maximum 2 Week PMSCa [steps/minute] 125.74 112.87 .004** Maximum of Maximum DPMSCb [steps/minute] 125.74 112.87 .004** Mean of Maximum DPMSCb [steps/minute] 95.57 80.55 .02* M ean Mean 2 Week PMSCa [steps/minute] 3.81 2.79 .04* Maximum of Mean DPMSCb [steps/minute] 7.22 5.31 .03* Mean of Mean DPMSCb [steps/minute] 3.81 2.79 .04* Standard Deviation of Mean DPMSCb [steps/minute] 1.83 1.34 .04* Standard Error of Mean DPMSCb [steps/minute] 0.46 0.34 .045* Standard Deviation Standard Deviation of 2 Week PMSCa [steps/minute] 13.04 10.13 .02* Maximum of DPMSCb Standard Deviation 19.72 15.09 .002** [steps/minute] Mean of DPMSCb Standard Deviation [steps/minute] 11.97 9.29 .03* Standard Error Standard Error of 2 Week PMSCa [steps/minute] 0.09 0.07 .02* Maximum of DPMSCb Standard Error [steps/minute] 0.52 0.40 .002** Mean of DPMSCb Standard Error [steps/minute] 0.32 0.24 .03* Total Total 2 Week SCc [steps] 84293 61612 .03* Maximum of Total DPMSCb [steps] 10393 7651 .03* Mean of Total DPMSCb [steps] 5484 4012 .04* Standard Deviation of Total DPMSCb [steps] 2640 1933 .04* Standard Error of Total DPMSCb [steps] 664 490 .045* aPMSC: Per Minute Step Count bDPMSC: Daily Per Minute Step Count cSC: step count

45

Table 10: N on-significant findings for comparisons between all classes (I/II, II, II/III, III) and just between class II vs. III.

P-value P-value I/II II II/III III (all classes) (II vs. III)

Demographics Sex [M=0, F=1] 0.33 0.12 0.00 0.09 .29 .83 Age [years] 51.56 54.96 51.50 57.82 .65 .55 Height [cm] 171.44 173.96 176.50 175.27 .76 .69 Weight [kg] 79.53 87.62 88.35 94.35 .53 .21 BMIa [kg/m2] 26.59 29.00 28.41 30.88 .53 .39 Righthanded?b 0.89 0.88 1.00 1.00 .61 .25 [No=0, Yes=1] Wristband Preferencec 0.67 0.35 0.25 0.20 .18 .40 [Left=0, Right=1] M aximum Standard Deviation of Maximum DPMSCd 19.91 26.21 29.13 21.45 .31 .30 [steps/minute] Standard Error of Maximum 5.06 6.43 7.42 5.26 .28 .32 DPMSCd [steps/minute] Minimum of Maximum 58.89 36.81 17.75 40.82 .22 .62 DPMSCd [steps/minute] 75th Percentile Maximum of 75th Percentile 0.56 3.02 4.00 1.09 .46 .36 of DPMSCd [steps/minute] Mean of 75th Percentile of 0.04 0.50 0.72 0.08 .44 .33 DPMSCd [steps/minute] Standard Deviation of 75th Percentile of DPMSCd 0.14 0.91 1.41 0.29 .43 .33 [steps/minute] Standard Error of 75th Percentile of DPMSCd 0.04 0.23 0.35 0.08 .43 .33 [steps/minute] M ean Minimum of Mean DPMSCd 1.31 0.67 0.57 0.88 .21 .36 [steps/minute] Standard Deviation Minimum of DPMSCd Standard Deviation 5.42 3.01 2.07 3.67 .21 .42 [steps/minute] Standard Error Minimum of DPMSCd Standard Error 0.14 0.08 0.05 0.10 .21 .42 [steps/minute] Total Minimum of Total DPMSCd 1887 971 818 1270 .21 .36 [steps]

46

P-value P-value I/II II II/III III (all classes) (II vs. III) IQR (Interquartile Range) Maximum of DPMSCd IQRg 0.56 3.02 4.00 1.09 .46 .36 [steps/minute] Mean of DPMSCd IQRg 0.04 0.50 0.72 0.08 .44 .33 [steps/minute] Standard Deviation of DPMSCd IQRg 0.14 0.91 1.41 0.29 .43 .33 [steps/minute] Standard Error of DPMSCd 0.04 0.23 0.35 0.08 .43 .33 IQRg [steps/minute] Skewness 2 Week PMSCe Skewness 5.14 5.20 5.29 6.50 .62 .27 Maximum of Daily SCf 11.36 13.22 5.24 12.39 .56 .91 Skewness Mean of Daily SCf Skewness 5.20 5.30 4.11 5.77 .76 .65 Standard Deviation of Daily 2.00 2.54 0.58 2.18 .37 .73 SCf Skewness Standard Error of Daily SCf 0.51 0.65 0.16 0.55 .33 .73 Skewness Minimum of Daily SCf 3.61 3.21 2.59 3.70 .42 .34 Skewness Kurtosis 2 Week PMSCe Kurtosis 35.32 33.44 36.72 61.42 .61 .24 Maximum of Daily SCf 249.66 283.85 31.17 237.06 .58 .87 Kurtosis Mean of Daily SCf Kurtosis 43.12 44.82 19.12 49.53 .68 .57 Standard Deviation of Daily 59.92 68.46 5.53 54.44 .39 .78 SCf Kurtosis Standard Error of Daily SCf 15.08 17.33 1.48 13.55 .39 .87 Kurtosis Minimum of Daily SCf 15.38 10.74 6.62 15.64 .36 .23 Kurtosis aBMI: Body Mass Index bRighthanded?: is patient righthanded? cWristband Preference: right or left handed preference for wristband dDPMSC: Daily Per Minute Step Count ePMSC: Per Minute Step Count fSC: step count gIQR: interquartile range

Table 11: N on-significant findings for comparisons between group II* and group III*

Group II* Group III* P-value (= I/II + II) (= II/III + III)

Demographics Sex [M=0, F=1] 0.17 0.07 .33

47

Group II* Group III* P-value (= I/II + II) (= II/III + III) Age [years] 54.09 56.13 .71 Height [cm] 173.31 175.60 .38 Weight [kg] 85.54 92.75 .17 BMIa [kg/m2] 28.38 30.22 .28 Righthanded?b [No=0, Yes=1] 0.89 1.00 .18 Wristband Preferencec [Left=0, Right=1] 0.43 0.21 .16 M aximum Standard Deviation of Maximum DPMSCd 24.59 23.50 .76 [steps/minute] Standard Error of Maximum DPMSCd [steps/minute] 6.08 5.84 .86 Minimum of Maximum DPMSCd [steps/minute] 42.49 34.67 .58 75th Percentile Maximum of 75th Percentile of DPMSCd 2.39 1.87 .93 [steps/minute] Mean of 75th Percentile of DPMSCd [steps/minute] 0.38 0.25 .89 Standard Deviation of 75th Percentile of DPMSCd 0.71 0.59 .91 [steps/minute] Standard Error of 75th Percentile of DPMSCd 0.18 0.15 .91 [steps/minute] M ean Minimum of Mean DPMSCd [steps/minute] 0.84 0.80 .90 Standard Deviation Minimum of DPMSCd Standard Deviation 3.63 3.24 .80 [steps/minute] Standard Error Minimum of DPMSCd Standard Error [steps/minute] 0.10 0.09 .80 Total Minimum of Total DPMSCd [steps] 1207 1149 .90 IQR (Interquartile Range) Maximum of DPMSCd IQR [steps/minute] 2.39 1.87 .93 Mean of DPMSCd IQR [steps/minute] 0.38 0.25 .89 Standard Deviation of DPMSCd IQR [steps/minute] 0.71 0.59 .91 Standard Error of DPMSCd IQR [steps/minute] 0.18 0.15 .91 Skewness 2 Week PMSCe Skewness 5.18 6.18 .29 Maximum of Daily SCf Skewness 12.60 11.68 .97 Mean of Daily SCf Skewness 5.26 5.60 .76 Standard Deviation of Daily SCf Skewness 2.36 2.02 .76 Standard Error of Daily SCf Skewness 0.60 0.51 .79 Minimum of Daily SCf Skewness 3.34 3.59 .65 Kurtosis 2 Week PMSCe Kurtosis 33.93 54.83 .25 Maximum of Daily SCf Kurtosis 272.45 216.47 .97 Mean of Daily SCf Kurtosis 44.25 46.49 .71 Standard Deviation of Daily SCf Kurtosis 65.62 49.55 .73 Standard Error of Daily SCf Kurtosis 16.58 12.34 .79

48

Group II* Group III* P-value (= I/II + II) (= II/III + III) Minimum of Daily SCf Kurtosis 12.29 14.74 .47 aBMI: Body Mass Index bRighthanded?: is patient righthanded? cWristband Preference: right or left handed preference for wristband dDPMSC: Daily Per Minute Step Count ePMSC: Per Minute Step Count fSC: step count gIQR: interquartile range

3.4.1 Principal Results

This study, using an independent, larger group of participants, replicated and validated the findings of our previous pilot study: that the daily free-living step counts of HF patients exhibiting NYHA class II vs class III symptoms are statistically different [13].

Specifically, HF patients categorized as NYHA II vs. III were found to differ significantly by mean of daily total step count (5464 vs. 3835, P = .0499), as well as by mean of daily mean step count (3.8 vs. 2.7, P = .0499). NYHA II vs III patients also differed significantly by mean (95.1 vs. 80.7, P = .04) and maximum Figure 3-4. Boxplots (min, mean-1SEM, mean, (125.5 vs. 112.9, P = .0104) of the mean+1SEM, max) of mean daily total steps for daily per minute step count individual each NYHA class maximums.

49

Similarly, group II* and group III* also differed significantly by mean of daily total step counts (5484 vs. 4012, P = .04), mean of daily mean step count (3.8 vs. 2.8, P = .04) as well as by mean (95.6 vs. 80.5, P = .02), and maximum of the daily per minute step count maximums (125.7 vs. 112.9, P = .004 respectively).

In both cases quoted above, the daily step count results mimicked the two-week overall step count analysis.

Of the 4 metrics identified above only the maximum daily per minute step count maximum was found to differ significantly between the 4 classes I/II, II, II/III and III (126.3 vs. 125.5 vs. 112.8 vs. 112.9, P = .04). It is reasonable that step count maximum, which better captures a patient’s peak exercise during the day, might as a result better capture the “limitation of physical activity” experienced by a patient and thus differentiate more consistently between NYHA classes (compared to a simple mean or sum of a patient’s activity over said day). Visual inspection of the overall step count density (see Figure 3-2) corroborates this suspicion.

Figure 3-5. Boxplots (min, mean-1SEM, mean, We however suggest another alternative. As can clearly be seen mean+1SEM, max) of mean daily per minute step count in Figure 3-1 (which shows a maximums for each individual NYHA class histogram of the step count data

50

for each NYHA class), zero per minute step counts made up an overwhelming portion of the data. Specifically, they accounted for a mean 87.3% (standard deviation 4.9%) of the two week data stream for each patient, accounting for as much as 97.6% of the two week data stream for one patient - the full breakdown can be seen in Figure 3-7. Unfortunately, the meaning of these 0 per minute step count values is ambiguous since the trackers used in this study record a 0 value not only during patient inactivity but also when the patient was simply not wearing the device. As a result, it is challenging to accurately determine if a given series of zeroes indicates a pattern of low physical activity - presumably Figure 3-6. Boxplots (min, mean-1SEM, mean, mean+1SEM, explanatory of NYHA class - or max) of max daily per minute step count maximums for each simply a pattern of non-device individual NYHA class use - essentially introducing noise into the physical activity signal.

A visual inspection of Figure 3-2 and Figure 3-3, both of which show different perspectives of the non-zero per minute step count data distribution, seems to strongly suggest that there is a difference in the activity patterns of patients, for example, a longer, fatter tail for class I/II and II patients. Quantitatively however we failed to the extract many insights into the shape of the activity distribution. Notably the 1st, 2nd, and 3rd quartile (and thus interquartile range) were all found to be fairly consistently 0 for all patients. In

51 other words, 0’s accounted typically accounted for more than 75% of data points for any given patient day. In fact, when looking at the two week period as a whole they accounted for at least 76.7% of all the data points for any given patient (the complete The decimal point is at the | breakdown is shown in Figure 3-7). i.e. 76 | 7 represents 76.7%

76 | 78 The maximum daily per minute step counts on the other 78 | 9 hand are naturally least susceptible to the ambiguous 0 80 | 2728 82 | 13678 per minute step count values. We suggest that this may 84 | 022605688 86 | 03902226 have contributed to their being most consistent at 88 | 024846 90 | 164668 differentiating between patients in different NYHA 92 | 14056 classes. Ultimately though, we believe that the 94 | 9 96 | 027 disambiguation of inactive vs disengaged time in Figure 3-7. Number of zero step count pedometer-like trackers and the subsequent effect on the minutes as a percentage of individual aforementioned-step data distribution are worth patient two-week data stream investigating further to better understand the true nature of the relationship between free-living step count and NYHA functional classification.

3.4.2 Strengths and Limitations

A strength of this replication study is that it uses a separate dataset collected by an different researcher (S.B.) independently of (and prior to the analysis performed in) the referenced pilot study [13]. Except for one patient who participated in both studies, the dataset is also comprised of completely different patients. On the other hand, the data being sourced as a convenience sample at the same single site as the pilot study, i.e. consecutively recruited from the TGH Heart Function, represents a limitation of this study with regards to generalizing our findings. Our analysis was also limited as it did not include any patients with NYHA class I or IV patients. While these are not typically as difficult to classify as NYHA class II or III patients, analysis of all 4 NYHA classes would have potentially provided additional useful insight into the true underlying relationship between step count and NYHA class. Knowing this relationship might be of tremendous value if it could allow us to invert the question posed in this study: to instead see if step count could be used to assess NYHA class or gradation changes in NYHA class for a patient. We suggest that this might be the subject of an important future study. The most significant limitation of our study though was the step tracker utilized, since it introduced significant ambiguity into the 0 per minute step count values which comprised most of each patient’s step data stream. This limits our ability to precisely quantify the distribution of the activity/inactivity of patients especially since it is

52 as of yet unclear how much significance patient inactivity should be accorded when it comes to capturing ‘physical activity limitation’ and by extension NYHA functional class.

Conclusion

NYHA II and NYHA III patients differ significantly by various aggregate measures of step count including 1) mean daily total step count but also importantly by 2) mean, and 3) maximum of the daily per minute step count maximums. These findings validate our previous pilot study. However, the discovery of additional significant aggregate measures raises several questions, amongst them: what is the exact underlying relationship between NYHA class and step count? What features of the step count waveform are most associated or correlated with NYHA class? These questions will no doubt feature as the subjects of future studies, but the findings of this study are an important milestone on the road to an objective means of assessing HF functional classification on our continuing quest to improve outcomes of patient with the burdensome and costly disease that is congestive heart failure.

3.5.1 Acknowledgements

This project was supported by funds from: the Ted Rogers Centre for Heart Research and Peter Munk Cardiac Centre, (hSITE) Healthcare Support through Information Technology Enhancements and (NSERC) the Natural Sciences and Engineering Research Council, (CIHR) the Canadian Institutes for Health Research, the Government of Ontario, and the University of Toronto.

3.5.2 Ethics Approval

This study is covered by institutional and research ethics approval (REB #14-7595) received from the University Health Network REB.

3.5.3 Conflicts of Interest

None declared.

53

- Activity Tracker Monitoring Implementation

Having confirmed the potential utility of remote monitoring the physical activity of heart failure patients we moved to update Medly, the remote patient monitoring system in use at the TGH HF clinic, as part of a Quality Improvement (QI) initiative so it could support the collection and display of the aforementioned data.

In this chapter we provide a brief overview of the Medly user interface, before discussing the activity tracker monitoring implementation requirements. We then discuss the proposed designs, what was actually finally implemented, as well as the success of the implementation in terms of the patients onboarded and their adherence to the system.

Medly User Interface Overview

The concept behind the Medly remote monitoring system is relatively simple: patients download the Medly app on their smartphone (provided by the clinic if required), and use the app every morning to input their weight, blood pressure and pulse – either manually or using a ‘smart’ weight scale and blood pressure cuff which can wireless transmit the corresponding datum to the smartphone app. Additionally, patients answer a series of questions about the symptoms they experienced the day before. Medly’s innovative computer algorithm then assesses the patients’ state and alerts them about further actions they may need to take such as: taking an additional dose of medication, calling their physician, or even going to the nearest emergency room (if the patient is assessed as being in a high-risk state). By a) b) shortening the cause-effect

Figure 4-1. Medly system patient smartphone user interface feedback cycle and leveraging a) home screen b) trends screen [289] ‘teachable moments’ the

54 system helps improve patient self-care maintenance and management. Patients can also review past readings and observe their overall trends on a separate screen. Examples of two of the primary screens of the patient user interface, the home and trends screen, are shown in Figure 4-1. In the example home screen, a patient has been alerted to ‘contact the heart function clinic or [their] family doctor’ due to their elevated heart rate (156 bpm) and reported symptoms (tired, short of breath and lightheaded) which are highlighted in orange. A patient can also take additional readings by pressing the green ‘+’ circle near the bottom right corner of the screen, although new readings will not remove previous alerts. In the example trends screen the patient appears to be maintaining a constant weight higher than the light blue target weight band (~160 lbs), with two unrecorded days (Nov 2nd and 3rd). Their blood pressure (BP) in contrast appears to be fluctuating: initially trending downwards with the diastolic BP stabilizing but the systolic BP recently trending upwards to exceed the gray target BP band.

All of the patients’ readings are sent back to servers at the hospital (UHN) and are displayed on a web interface which is accessible by clinical staff, where they can review alerts and the patient trend data. An example of the main screen of the clinical web interface, showing the weight data for a Mr./Mrs. Demo Patient is shown in Figure 4-2. In this example the patient had 1 of 3 readings, during the period of July 12th to July 19th 2018, be outside of their target normal weight range (this time indicated by a gray coloured band on the graph). The user could also scroll down to see the patient’s BP and pulse readings as well as a chart of their answers to the symptoms questions.

Requirements

In keeping with engineering best practice, we performed some basic requirements gathering before proceeding to implement changes to the Medly system. Initial requirements gathering was performed by discussing the proposed system update with the developers, designers, researchers, project managers and telehealth personnel at the Centre for Global eHealth Innovation, who already had significant expertise in designing, developing, implementing and working with Medly. Their suggestions were supplemented with findings from previously published studies discussing insights on the design and implementation of previous versions of Medly [95,103–105,159].

55

Figure 4-2. Medly system clinical user web interface

The following requirements were identified with regards to fitness tracker selection:

1. The selected activity tracker must be readily available for purchase by patients (as established by the ‘Best Buy Test’: is the fitness tracker available at a local big box electronic store such as Best Buy?) 2. The fitness tracker must be compatible with Apple iOS v9.3.5 and above. 3. The fitness tracker must be compatible with the 2014 Samsung Galaxy Grand Prime (Android 5.1 Lollipop) and above.

56

4. The fitness tracker must be able to record minute by minute step data. 5. The fitness tracker must be able to record minute by minute heart rate data. 6. The data recorded from the fitness tracker must be able to be retrieved for storage and archival at UHN. 7. The fitness tracker must be able to operate continuously for a minimum of 2 days without requiring syncing or charging to ensure recording continuity in the event that a patient forgets or is unable to sync or charge the device overnight).

The following additional, user experience, requirements were identified:

1. The system must provide a method to de-authenticate a fitness tracker or authenticate new fitness tracker. 2. The system must allow for connection and authentication of fitness tracker. 3. The system must provide a means by which activity tracker functionality can be enabled/disabled for a patient. 4. The system must provide feedback to clinicians that the fitness tracker is working. 5. The system must provide a means by which clinicians can view patient heart rate data. 6. The system must provide a means by which clinicians can view patient activity data. 7. The system must provide a means by which fitness tracker data can be access and downloaded including: a. anonymized bulk data. b. analytics data (e.g. usage, interaction patterns) 8. Clinical access must continue to be secured against access by non-authorized (non-Clinical) staff. 9. Research data access must be secured against access by non-authorized (non-QI/research) staff.

The following were also identified as being important for providing an optimal user experience:

1. The system should provide feedback to clinicians that the fitness tracker is being worn by the

patient.12

2. Data visualization should be done in such a manner that clinical staff are able to easily & simultaneously relate heart rate and contextual ‘explainers’ of heart rate (e.g. activity data, medications, etc.)

12 where technically feasible

57

3. The system should provide feedback to the patient that fitness tracker is connected. 4. The system should provide feedback to the patient that the fitness tracker is working and collecting data.

Design & Implementation

After having completed the initial requirements gathering we moved to the design and implementation phase.

4.3.1 Activity Tracker Selection

To select an appropriate activity tracker, an initial search of modern consumer activity trackers was performed, revealing 33 potential candidates. These are briefly detailed in Table 12. Most of these activity trackers did not support continuous heart rate monitoring, had battery lives that did not meet the continuity requirement outlined in fitness tracker requirement 7 of Section 4.2, or were simply no longer available on the market (e.g. the Basis Peak which was recalled by Intel Corporation for safety reasons [160], as well as the Jawbone devices since Jawbone (the company) filed for bankruptcy in July of 2017 [161]). The short list of activity trackers remaining included the Fitbit Charge 2, Ionic and Versa; the Garmin Vivosmart 3, the Nokia/Withings Steel HR, the Wavelet Health Biostrap, and theXiaomi Band 2 (all highlighted in Table 12). We quickly eliminated a) the Nokia/Withings Steel HR since it was not yet released in the Canadian market at the time of the study, b) the Garmin devices in general since access to the device data through their application programming interface (API) required a steep access fee of $5000, and c) the Xiaomi Band 2 since it did not appear to have a reliable manufacturer support method of access device data. Although the Xiaomi Band 2 was advertised as supporting data download using Google Fit anecdotal evidence from user forums appeared to suggest that this approach was unreliable – notwithstanding this possible unreliability there was no way to access the data using iOS (fitness tracker requirement 2 of Section 4.2). This left us with the Fitbit devices and the Wavelet Health Biostrap. We eliminated the Wavelet Health device after encountering unresolvable issues while attempting to connect a trial device to our Android devices, although the device worked fine on iOS. Furthermore, in choosing between Fitbit devices and a relatively new and unproven contender on the relatively volatile activity tracker market (Wavelet Health), we determined that it was a more prudent choice to opt for the market leader, Fitbit. Additionally, due to the popularity of Fitbit devices, investigating the accuracy and reliability of these devices is a more active area of research [41,46,48,65,67,68,84,162]. We opted to use the Fitbit Charge 2, the successor to the Fitbit Charge HR, since it was the lowest cost option of the three short-listed Fitbit devices.

58

Table 12: Candidate activity trackers

Step H eart Battery Company Product Data Access Price Link Count Rate Life13

360-590 Apple Watch Yes Yes 1 day HealthKit [163] [64] CAD

Empatica E4 Wristband Yes Yes 1 day Unclear 1700 USD [164]

Fitbit Alta HR Yes Yes 5 days Fitbit API [165] 200 CAD [166]

Fitbit Alta Yes No 5 days Fitbit API [165] 170 CAD [167]

Fitbit Charge 2 Yes Yes 5 days Fitbit API [165] 200 CAD [58]

Fitbit Flex 2 Yes No 5 days Fitbit API [165] 80 CAD [168]

Fitbit Ionic Yes Yes 5 days Fitbit API [165] 400 CAD [169]

Fitbit Versa Yes Yes 4 days Fitbit API [165] 250 CAD [170]

Garmin Fenix Yes Yes 1 day Garmin API [171] 600 USD [172]

Garmin Vivosmart 3 Yes Yes < 5 days Garmin API [171] 150 USD [173]

Huawei Watch 2 Yes Yes 1 day Google Fit [174] 350 USD [175]

Intel Basis Peak recalled August 1, 2016 [160]

Jawbone Various company undergoing liquidation [161]

LG Watch Sport Yes Yes 1 day Google Fit [174] 350 US [176]

mc10 BioStampRC Yes Yes 1.5 days Unclear 500 US [177]

Misfit API [178] Misfit Flare Yes No 4 months 70 CAD [179] or Google Fit [174]

Misfit API [178] Misfit Phase Yes No 6 months 150 CAD [180] or Google Fit [174]

Misfit API [178] Misfit Ray Yes No 4 months 80 CAD [181] or Google Fit [174]

Misfit API [178] Misfit Shine Yes No 6 months 80 CAD [182] or Google Fit [174]

Misfit API [178] Misfit Shine 2 Yes No 6 months 80 CAD [183] or Google Fit [174]

13 Listed battery life is always approximate.

59

Step H eart Battery Company Product Data Access Price Link Count Rate Life13

Misfit API [178] Misfit Vapor Yes Yes 1 day 200 CAD [184] or Google Fit [174]

60-100 Moov HR Yes Yes < 1 day None [185] CAD

Moov Now Yes No 6 months None 60 CAD [186]

Nokia/ Nokia Health API Go Yes No > 8 months 50 USD [188] Withings [187]

Nokia/ Nokia Health API Steel Yes No > 8 months 130 USD [189] Withings [187]

Nokia/ Nokia Health API Steel HR Yes Yes 25 days 180 USD [190] Withings [187]a

< 1 day to 3 TomTom Spark 3 Yes NCb No new users [191] 290 CAD [192] weeks

TomTom Touch Yes NCb 5 days No new users [191] 130 CAD [193]

Under 170-230 UA Band Yes NCb 2.5 days Unclear [194] Armour CAD

Wavelet Biostrap Yes Yes 5 days Wavelet API [195] 250 USD [195] Health

Google Fit [174], Xiaomi Band Yes No 30 days via unofficial API 15 USD [196] [161], or via BLEc

Google Fit [174], Xiaomi Band 2 Yes Yes 20 days 30 USD [197] or via BLEc aheart rate data access unclear bNC: non-continuous cBLE: bluetooth low energy (N.B. device commands are obfuscated by manufacturer)

60

Proposed Data Access Design

Third party access to Fitbit data is mediated exclusively through the Fitbit web API [165]. It is possible to both write and read data through the API, but impossible to access data directly from the device, as illustrated in Figure 4-3. Access to intraday time series data (i.e. step count and heart rate data

Figure 4-3. Fitbit data flow diagram at a resolution of less than 1 day, e.g. at the minute level) is also restricted to either ‘personal’ applications, or authorized entities. Authorization to access this data is granted on a case-by-case basis by Fitbit. After submitting an initial request on June 22nd, 2017 we received approval to access intraday data 2.5 months later, on September 5th 2017. Access to the individual patient data is mediated through the OAuth 2.0 authentication framework which specifies a secure communications protocol by which Fitbit and third party servers can confidentially exchange security access tokens to maintain secured and encrypted transmission of data between the Fitbit servers and the client – in this case UHN - servers. The complete process for authentication (including initial authentication and maintenance of expired security tokens), and data retrieval is mapped out in a sequence diagram in Figure 4-4. Since the individual patient access tokens, which must be refreshed after each use, must be shared between several users (the patient, clinical staff and research admin/QI personnel) the system was designed such that the central Medly server would mediate requests for data, supplying the requested data from its internal database negating the need to re-request data from the Fitbit servers for each user request. The Medly server then periodically updates this internal database with new data, archiving it according to hospital policy and local, provincial and federal requirements. Figure 4-5 illustrates this proposed design for patient users and

61

Figure 4-4. Fitbit authentication process with a client app

62

Figure 4-6 illustrates the proposed design but for clinical users. The sequence for research admin/QI personnel is essentially identical to that of clinical users.

Final Data Access Implementation

The final implementation for data access was managed by the development team at the Centre for Global Figure 4-5. Medly Fitbit patient access sequence eHealth Innovation (a partner of UHN). As a result, the final implementation differed slightly from the proposed implementation due to time constraints and lack of programming resources as a result of concurrent updates, bug fixes and general QI updates to Medly that were deemed to be a higher priority. The final implementation therefore did not include an update to the client side patient smartphone application. The proposed design was reduced down to a pared down Minimum Viable Product14(MVP). In this pared down version, clinical admin staff (such as the onboarding coordinator) authenticated Fitbits on behalf of patients on the clinical client application. No functionality was provided for patients to authenticate Fitbits with Medly or to access data through the Medly application. Furthermore the ability to authenticate new devices and access patient data was only available for patients using Medly on an Apple iPhone15.

Clinicians wishing to access data for patients using the standard Android device usually provided as part of the Medly patient kit, were only able to access said patient data through the official Fitbit website. Patients, whether Android or iPhone patients were able to access their data either through the Fitbit website or through the Fitbit app that had to be installed on their smartphone. No provisions were made

14 a featured sparse software platform that includes only the bare minimum functionality required to operate. 15 as of the time of publication Medly now supports Fitbit authentication and data access for patients using either Apple iPhone or Android devices.

63

Figure 4-6. Medly Fitbit clinician access sequence for data access by research/QI personnel - in fact the Medly server was implemented to only receive daily step data summaries and not intraday data. The server also did not retrieve heart rate data.

To access intraday heart rate and step data the author created an open source script using the R programming language [151] (available with the rest of the software artifacts generated from this thesis as per Appendix C or directly from [198]). This script connects to the Fitbit API, manages the security access tokens for the patients in the study (both Android and iPhone patients) and is able to download both the minute-by-minute step count and heart rate data for analysis. It is also registered as a separate third-party application with Fitbit to permit separate administration from the clinical system and to avoid technical issues with the script affecting the clinical system. This script was based on previous work by S. Bromberg [46,150], whose originally script is available on GitHub [150].

64

4.3.2 User Interface Design

The Medly user interface (UI) also required updates to support the addition of fitness tracker functionality.

Proposed User Interface Designs

Several designs were proposed, which were based on best practices from the fields of data visualization [199–201], human factors & user experience design [202–205], as well as insights from consultations with the Medly design team at Healthcare Human Factors (a partner of UHN) and the development team at the Centre for Global eHealth Innovation.

In order to provide a more optimal user experience for patients, these should receive feedback that their device is operating as expected. In the case of the Fitbit activity tracker this means not only that the device is charged and collecting data, but also that the device is syncing data to the patient’s smartphone, and ultimately to UHN. Displaying the patient’s Fitbit data on the Medly app on the patient’s smartphone would provide this feedback since it requires an unbroken chain of communication between the Fitbit, Fitbit App, Fitbit Servers, UHN Servers and the Medly app as shown in Figure 4-3. We proposed 4 design each for both the home and trends screen that were consistent with the UI design language already established by Medly. The 4 proposed home screen designs are illustrated in Figure 4-7, the designs for displaying trends data are shown in Figure 4-8. Since the fitness tracker step count and heart rate data is generated at every moment, instead of being collected usually only once a day in the morning, the proposed designs, although adhering loosely to the established design language of Medly intentionally treat fitness tracker data in a visually distinct manner so as to help users identify the less static nature of the fitness tracker data (compare Figure 4-1a and Figure 4-7). Similarly, the proposed trends screens are slightly modified to better adapt to nature of the fitness tracker data. For example, daily or weekly heart rate summaries not only report mean heart rate, but also the lower and upper range of heart rate during those periods.

Along with the aforementioned changes to the trends and home screen, we designed a UI flow for changes to the Medly smartphone app to allow patients to link a Fitbit account to their Medly account, this UI flow is illustrated in Figure 4-9. However, as mentioned in Section 4.3.1.2, this flow was ultimately not implemented. Instead Fitbit account linking was redesigned to be done through the clinician web interface. The final authentication flow is discussed in Section 4.3.2.2.

65

a) b)

c) d)

Figure 4-7. Proposed designs for patient user interface (home screen) a) combined heart rate and steps data on one card, b) combined heart rate and with pictoral representations, c) seperated heart rate and step data, d)

only pictoral representation with mini graph

66

a) b) c)

d) e)

f)

Figure 4-8. Proposed designs for patient user interface (trends) a) simple sparklines, b) data with bands to indicate min (resting), mean and max values for

each time period, c) whisker plot to indicate daily range, b) heart rate (maximum and

resting) and average step count values broken out for each time period, and e) Tufte style medical data visualization as per f) which is reproduced from [201]

67

Figure 4-9. Proposed design for authorization of new Fitbit by patient via Medly smartphone application.

68

With respect to the clinician web interface, changes were much more limited and most centered around adding new graph components to display the new fitness tracker data which differs from the rest of the data collected by Medly since it is available at available at up to minute-level resolution. The proposed web interface graph designs are shown in Figure 4-10 (which can be contrasted to the existing graph design in Figure 4-2).)

The design of the clinical user interface was approached in a similar fashion to the patient smartphone trends screen. Although the web interface has more available screen real estate than the smartphone screen, the performance of the web interface was known to drop drastically when made to process several data points for display on graphs. As such, the design of the clinical user interface represented a similar challenge to the smartphone trends screen: the need to collapse voluminous high resolution minute-by- minute data into more concise daily or weekly summaries; this explains the successive data simplification that occurs while transitioning from Figure 4-10b to Figure 4-10d. The design shown in Figure 4-10b for example is inspired from the UI of an intensive care monitoring system designed for use in the data rich environment of the pediatric critical care units at SickKids: The Hospital for Sick Children in Toronto and Boston Children’s Hospital in Boston [206–208]. Consequently, it is the most ideal of the proposed designs from a data fidelity point of view since it cuts out minimal data and allows a user to more easily visualize concurrent trends in multiple data streams. However due to the technical limitations of the Medly web interface, it is also the least feasible to implement. Figure 4-10c and Figure 4-10d were later design iterations attempting to reduce the number of visual elements that the interface would need to process and draw while still maintaining as much information content as possible. Figure 4-10e returns to the same simple graph style of Figure 4-10a and Figure 4-2 but with range bands and a UI element for displaying something useful derived from the step count data such as the predicted NYHA class (compared to the last assessed NYHA class). This UI element also provides the option for the clinical staff to provide feedback as to whether they agree with the prediction, or not, by pressing on the ‘x’ or check mark and correcting the prediction (this later pop-up is not shown). This functionality would be useful for collecting feedback (and training examples) from the user to assess the accuracy (and dynamically teach) an NYHA functional classification suggestion algorithm once it gets implemented into Medly. Lastly, we proposed simple alerts for both step count and heart rate consistent with those implemented for weight, blood pressure and pulse: namely a lower limit for step count and upper and lower limit alerts for heart rate. We also proposed adding adherence phone call functionality for the fitness tracker similar to the already implemented system that triggers an automated reminder phone call when a patient does not submit their daily readings.

69

a)

70

b)

a)

d)

71

e)

Figure 4-10. Proposed designs for clinical user interface (activity and heart rate graphs) a) simple graph design with indicator lines for alert levels and mean, b) design inspired by the Sick Kids T3 (tracking, trajectory and trigger) tool [206–208], c) mix of T3 tool with

Medly range bands, b) whisker plots style and e) simple graph with range bands and NYHA class prediction display (bottom of the more info page for step count graph)

72

Figure 4-11. Final web interface Fitbit authorization flow

73

Figure 4-12. Final web interface activity tracker profile & deauthorization flow

Figure 4-13. Final web interface activity tracker data display

74

Final User Interface Design

As with the back-end components required to download and access the fitness tracker data, the actual programming of the UI components required for the activity tracker update to Medly was managed by the development team at the Centre for Global eHealth Innovation (a partner of UHN). Again, due to time and resource constraints caused by higher priority fixes and updates, the final UI implementation was reduced down to a proof-of-concept. Due to a lack of available iOS and Android programmers. no updates were possible to the patient smartphone UI, so patients were instead instructed to use the Fitbit app on their phone to confirm that data was being collected and synced to the Fitbit servers. The task of confirming that the Fitbit data was being properly synced to the Medly servers was instead left to the author as part of the research work documented in this thesis. Afterwards, this task is anticipated to be delegated to the clinical admin staff to be performed on a manual basis using elements that were added to the clinician web interface. The inability to update the smartphone UI also necessitated the creation of a new UI design for the task of linking patient Fitbits (whether provided by the clinic, or patient’s personal Fitbits) to Medly servers through the clinician web interface. The final version of this UI flow is shown in Figure 4-11.

As required by the Fitbit applications programming interface (API) for web applications, as part of the authorization process, the user is redirected directly to the official Fitbit website (Figure 4-11 step 3) so they can be confirm that they are connecting to the genuine Fitbit.com site [209]. Once logged into the Fitbit website they user can then select what data to share (Figure 4-11 step 4).

When linking activity trackers we instructed users to select ‘Allow All’ to allow all data to be shared (refer to Figure 4-11 step 4). Normally this violates an old principle of computer security: the principle of least priviledge (or least authority), which dictates that user access rights be a) limited to the bare minimum required to perform the desired task and b) provided only for the duration required for said task. Recognizing howevver that it was likely that Medly would receive updates in the near future to enable more complete use of Fitbit functionality and that if these future updates used data outside outside of the already required ‘heart rate’, and ‘activity and exercise’ data it would necessitate manual unliking and then relinking of all the Fitbit accounts to select additional permissions, likely at significant time cost. Furthermore, clicking the single ‘Allow All’ button was a simpler task for users to perform compared to having users select the sperate individual ‘heart rate’, ‘activity and exercise’ and ‘Fitbit devices and settings’ radio buttons. A less complicated task is predicted to reduce the likelyhood of error when linking a Fitbit account. Lastly, even in the case of a real security concern such as a data breach,

75 the tokens exchanged through the authorization process, which provide the Fitbit data access rights in the first place, can be remotely revoked through the Fitbit website both on an individual basis and on mass. This reduced the actual security risk to what we deemed to be an acceptable level.

We were actually able to confirm this loss of data access to linked Fitbit accounts as a result of a simulated security breach inadvertendely caused during data collection. The incident occurred on May 31st, 2017 while authenticating patients using the custom R script written to download the minute-by- minute heart rate and step count data and manage the associated access tokens.

The script accepts a list of user accounts and loops through a pared down version of the authentication flow shown in Figure 4-11 (i.e. just steps 3 and 4) for each account one immediately after another. This makes it possible to quickly add and retrieve access tokens for multiple patients in bulk, reducing workload for research/QI work. Fitbit’s automated security system interpreted the rapid automated linking of multiple Fitbit accounts as suspicious and potentially indicative of malicious activity. As a result, Fitbit’s security system subsequently banned the internet address of machine running the script and flagged the 34 recently linked accounts as potentially compromised, forcing password resets and invalidating the access tokens for each of these accounts (both for the script and the clinical system).

It took approximately 3 weeks to: 1) confirm with Fitbit that we were the actual cause of the suspected ‘data breach’ (as opposed to an actual malicious third party), 2) reset patient passwords 3) relink accounts on the clinical system, 4) contact patients to ensure that they had successfully logged back into the Fitbit app on their phone, and 5) slowly relink accounts to the research system (which we did at a rate no higher than 1 per 30 seconds and in batches no longer than 25 with a pause of at least 45 minutes between batches). As we experienced delays in reaching patients to inform them that they needed to log back into their Fitbit account, at least half were initially unreachable on the first day and had to be left a voicemail message or equivalent, some of the patients may have suffered about 1-2 weeks of data loss.

The potential data loss would have been caused by the limited internal memory of the Fitbit; since the Fitbit only has sufficient internal memory to record 1 full week of minute-by-minute data it must be synced at least once a week to the Fitbit servers, usually via the Fitbit app, to make more room for new data. Due to accounts being flagged as compromised, patients needed to log back into their account using their new password to reenable syncing between their Fitbit and Fitbit servers. Since Fitbit only provides the last device sync date, which was not actively monitored during this period (as opposed to a complete sync history) we were unable to confirm the actual extent of data loss for patients. We were also unable to ascertain the extend of data loss simply by examining the data since it is difficult to determine if

76 potential lack of data during this period was due to the incident or simply due to patient disengagement, in particular since those patients most likely to have not noticed that they had been logged out of the Fitbit app are almost by definition those least engaged with the system.

Aside from the potential loss of data, the incident had no other reported impact on the system. The loss of data also had minimal impact on the QI/research objectives of this study since most patients impacted by the incident had already been using the monitoring system for several weeks (and even months), and data collection for all patients would still continue for several weeks post incident (to attain a minimum 3 week recording period for each patient).

Returning to the UI: once users proceed through the authentication flow in Figure 4-11 - thus enabling syncing of their Fitbit account to Medly - they are returned to the patient profile page which now displays status information about the connected Fitbit account and the option to unlink the account if desired (see Figure 4-12). This profile page displays informationa about the last time the Medly server was synced with the Fitbit server – ‘Last Server Sync’ – as well as the last time a Fitbit device was synced to the

Fitbit account16 – ‘Last Device Sync’ – the later of which never be more recent than the ‘Last Server

Sync’. These two values were added to help users determine if a lack of displayed step count data is caused by: a communication problem between the Fitbit server and Medly server (the ‘Last Server Sync’ value is not up to date and does not update even when the user presses the ‘Force Sync’ button); the Fitbit device has not yet been synced (‘the Last Device Sync’ value is not up to date although the ‘Last Server Sync’ value is up to date); or the patient has simply not used the Fitbit or performed any physical activity (both the ‘Last Device Sync’ and ‘Last Server Sync’ values are up to date but no step data shows up on the web interface).

As for displaying the Fitbit data: heart rate data was deemed to be non-essential for inclusion as part of the activity tracker MVP in particular since it would further cause confusion with the existing displayed daily recorded pulse data (recorded using a blood pressure cuff). As a result no graphical display was implemented to display the Fitbit acquired heart rate data. The step data graph on the other hand was redesigned after the existing graph design (Figure 4-2) showing total daily steps for each day in the view windows (see Figure 4-13). In the ‘More Info’ page to the immediate left of the graph, the whole time period being viewed was summarized by providing the lowest, average, and highest daily step count and

16 This process occurs automatically every time the user opens the Fitbit app on their smartphone.

77 total readings during period in question. It is worth noting that this final step data graph design also only represents a minimum technically viable product as it does not fully honor the best practices and principles outlined in the Fitbit API terms of service, the most relevant being the following:

“Offer Users a clear path back to their Fitbit Account.

• Always provide clear documentation and links for Users to access their Fitbit Account from your Application.

• Paths to Fitbit User accounts should be available wherever User Data is displayed.

• Paths to Users’ Fitbit accounts should be available in "Setting," "Account," or a similar location from within your Application.

• When displaying Fitbit Data in your Application, Fitbit must be noted as the source of Fitbit Data using the text link and/or logo icon made available to you through the Fitbit Developer Portal.” [210]

As is, the step data graph adheres to none of these provisions.

Despite all of the aforementioned limitations we were able to onboard 46 patients onto the upgraded system over a 5 month period (from January 9th to June 13th). These patients were subject to the same inclusion and ‘exclusion’ criteria used for the general Medly system. The inclusion criteria are detailed in Table 13. While there are no explicit exclusion criteria for Medly, we note that since the system (and by extension this updated) is used as part of the prevailing standard of care at the Heart Function clinic, the decision to prescribe or exclude a patient from the Medly program is ultimately up to the professional judgement of the attending cardiologist. As of the time of writing a total of 7 attending cardiologists use Medly as part of patient care, although one of the cardiologists (the medical director of the clinic) is disproportionately responsible for a majority of the patients monitored. During this period 2 (4%) of the 46 patients later changed their mind about being monitored via Fitbit and subsequently chose to return their devices and be removed from QI initiatives related to Fitbit monitoring. On the other end of the spectrum, 3 (7%) of the 44 patients who remained in the study chose to supply and use their own Fitbit device and Fitbit account instead of being provided one by the clinic (these patients were unsurprisingly all very adherent with their Fitbits).

78

Table 13: M edly inclusion criteria

- a consenting adult (18+ years of age), - diagnosed with heart failure, - followed by a licensed cardiologist at the UHN Heart Function Clinic (who in turn bears the primary responsibility for the management and care of that patients heart failure diagnosis) - sufficiently capable of speaking and reading English, or having an informal caregiver (spouse, parent, etc.) capable of the same so as to both:

o undergo the process of and provision of informed consent for participation in the Medly program

o understand and follow the text prompts provided by the Medly patient-side application - capable of complying with the use of Medly (e.g. capable of truthfully answering symptom questions, capable of safely and correctly using the peripherals such as the weight scale, activity tracker and blood pressure cuff)

Table 14: M edly exclusion criteria

- Congenital heart disease - Diagnosis less than 6 months prior to recruitment - Travelling out of Canada for more than 1 week during the study period (to limit study costs – i.e. roaming charges)

Of the 44 patients who remained on the monitoring system, 12 (27.3%) used and provided their own Apple iPhone devices, and 32 (72.7%) used Android devices provided by the clinic. Based on the number of mobile wireless subscribers in Ontario (88.1% in 2015 [211]), the iPhone market share in Canada (51.37% in October 2017 [212]), and proportion of devices using an iOS version supported by Medly (version 9.4 or above; 96.75% in October 2017 [213]) the expected proportion of iPhone to Android was closer to 43.8% (19:25). These expected values and actual proportions of onboarded patients by device is tabulated for easier reading in Table 15d. By proportion, the number of iPhone users onboarded was slightly less than expected. We anticipated that the relative proportion of Android users was higher since we recruited Android users not just from the pool of new patients onboarded onto Medly during the 5 month period but also from patients who had already been onboarded onto Medly and happened to be returning to the clinic for follow-up during this period. No iPhone users had previously been onboarded onto Medly therefore all of the 7 returning patients (16%) upgraded with Fitbits were Android users. Removing these patients, 32.4% of new patients used iPhones and 67.6% used an Android device, this is closer to the distribution expected based on market share calculations. In either case the relative proportion of iPhone to Android users was not found to be statistically different to the expected

79 proportion at the 5% level of significance and given the sample size (P=0.18, and P=0.47 respectively for the cases discussed above; assessed using a chi-squared test with R [151]).

Table 15: iPhone vs. Android patients on Medly system using Fitbit a) all patients onboarded, b) only new Medly patients onboarded during thesis

a) All Onboarded Expected (by Market Share) P-value iPhone Users 12 (27.3%) 19 (43.8%) .18 Android Users 32 (72.7%) 25 (56.2%)

b) New Patients Only Expected (by Market Share) P-value iPhone Users 12 (32.4%) 16 (43.8%) .47 Android Users 25 (67.6%) 21 (56.2%)

Patient adherence was also recorded at two points during the study, at 3 months into the study (April The decimal point is 1 digit to the right of the | 9th, 2018) and at the end of the data recording period (August 1st, 2018; 7 months). At both of these 9 | 1 represents 91. % junctures, patients were found to be overall 3 Months 7 Months moderately adherence with using the Fitbit – e.g. at 980| 0 |0001235589 6431| 1 |15 the 3 and 7 month timepoints 50% of patients had 842| 2 |45899 used the Fitbit (recorded steps or heart rate) on at 1| 3 |012 64| 4 |19 least half of the days they were on the system. Only | 5 |237 8| 6 |2 1 1 21| 7 |4 around ⁄3 to ⁄4 of patients (at 3 and 7 months 5| 8 |0357 respectively) had excellent levels of adherence 710| 9 |0111137888 000000| 10 |000 (average at least 9 of 10 days using the system). A more complete breakdown of adherence is available in Figure 4-14. Distribution of patient Table 16, with the stem and leaf plots in Figure 4-14 Fitbit adherence (as percent of days illustrating the comparative distribution of the using the system) percentage of days patients had used the system (relative to the total number of days they used the upgraded system) this time at the 3 and 7 months. A paired Wilcoxon signed rank test (since the data is non-normal, as can clearly be discerned from Figure 4-14) revealed that there was no statistically significant difference between the adherence at 3 and 7 months (P = 0.625).

Compared to the adherence levels recorded during the original Medly RCT, where “about 42, 33, and 16 out of the 50 telemonitoring group patients (84%, 66%, and 32%) completed at least 91 (50%), 146 (80%), and 173 (95%) of possible daily readings over the six months respectively (prior to the adherence phone call deadline at 10am)” [103], patients using activity trackers in this study were found to be significantly

80 less adherent (at the 5% level of significance) at both the 50% and 80% adherence thresholds (but not the 95% threshold); detailed results are tabulated in Table 17.

Table 16: Patient adherence on Fitbit

# of Patients Adherence Definition 3 M onths 7 M onths sum a deltab sum a deltab Near Perfect > 95% of days used 7 (26.9%) - - 7 (15.9%) - - Excellent > 90% of days used 9 (34.6%) 2 (7.7%) 12 (27.3%) 5 (11.4%) 17 Consistent > 68% of days used 13 (50.0%) 4 (15.4%) 18 (40.1%) 6 (14.6%) 50-50 > 1/2 of days used 13 (50.0%) 0 (0%) 22 (50%) 4 (9.1%) Sporadic > 1/7 of days used 21 (80.8%) 7 (30.8%) 33 (75%) 11 (25%) Onboarded all patients 26 (100%) 5 (19.2%) 44 (100%) 11 (25%) a i.e. # (%) of patients meeting or exceeding specified level of adherence b i.e. difference between # (%) of patients at specified level of adherence and the next highest adherence level

Table 17: Fitbit adherence compared to adherence recorded for original Medly during RCT

M edly RCT [103] Fitbit @ 3 Months Fitbit @ 7 Months Adherence Level # of patients # of patients P-value # of patients P-value > 95% of days used 16 (32%) 7 (26.9%) .85 7 (15.9%) .12 > 80% of days used 33 (66%) 10 (38.5%) .04* 17 (38.6%) .014* > 50% of days used 42 (84%) 13 (50.0%) .004** 22 (50.0%) <.001 Total 50 26 - 44 -

A recent study by Hermsen et al. [214], who examined sustained use of a provided Fitbit activity tracker in 711 patients, found that 232 days into their study, of those who were non-adherent at that stage (187 patients), 56.7% stopped adhering due to technical problems or difficulties18, 12.8% lost the device, 12.8% forgot to wear the device, 9.7% felt they had no use or motivation to use the particular device given to them (including because they used a different device), 3.7% stopped due to health issues and 5.4% didn’t want to use the device for various other reasons (excluding health issues).

From this study we can infer that people, broadly-speaking, are non-adherent to technology for one of three reasons:

17 68% of days equates to roughly 20-21 days out of the month (i.e. every weekday) 18 in our study we had 2 devices (both replaced) reported as non-functional (one that over-reported steps and one that simply didn’t work).

81

1) they are (humanly) unable to use the technology. Namely because the technology is non- functional, whether due to technical or human factors problems;

2) they want to use the technology but forget to do so; or

3) they don’t want to use the technology, for example because they have concerns about detrimental effects of the technology on their wellbeing, or generally don’t recognize any benefits to using the technology.

For patients who are unable to use the technology, in particular due to human factors problems, the pared down UI designs ultimately implemented do little to make the Fitbit more usable from a patient perspective. However, it also does little to make things worse. Since no UI updates were made to the Medly patient app to help support the fitness tracker, a patient’s interactions with the Fitbit are limited to interactions with the device itself and the proprietary Fitbit app (and optionally the Fitbit website). As a result, difficulties interacting with the technology are in a way more representative of Fitbit as a technology compared to our RPM system. Our findings therefore actually form a baseline for patient adherence on a Fitbit RPM system since the components implemented into our system represent the bare minimum required to actually make a Fitbit enabled RPM system function. Furthermore that the Fitbit user experience design is also largely outside of the control of third-party researchers and programmers simply makes it harder to make real improvements to this part of the user experience perhaps aside from providing better user education (generally considered by human factors experts as the least effective means of effecting meaningful change [215,216]).

In the other case of patients who simply forget to wear the tracker, a solution already exists: adherence phone calls. These were coincidentally used with great effectiveness during the Medly RCT although they were not added as part of the Medly Fitbit MVP.

As for patients who did not want to use our technology: we suspect that these were a less likely contributor to non-adherence in our particular study since the patients onboarded onto this system all willingly consented to participate. That being said, we fully expect this willingness to decrease as time goes on. In the same Hermsen et al. study (that examined the sustained use of a provided Fitbit activity), the authors found a “slow exponential decay in Fitbit use, with 73.9% (526/711) of participants still tracking after 100 days and 16.0% (114/711) … after 320 days.” [214]. Although, as previously mentioned, we found no significant difference between adherence at 3 and 7 months our study was not powered ahead of time to address this question.

82

We suspect that the easiest and most cost-effective solution to most if not all of the aforementioned problems is adding the fitness tracker to the adherence phone call system already implemented as part of Medly. Adherence phone calls would not only help to address the problem of patients simply forgetting to wear the activity tracker (which might otherwise necessitate an update to the Medly UI), but they would also provide increased opportunity to address technical or usability issues experienced by patients by providing patients with an additional compelling reason to get these issues addressed by contacting Medly support staff (i.e. avoiding nuisance phone calls). If the Medly UI were to be updated, adding some sort of alert or reminder when a patient was taking their morning systems would be even better since it would prevent more unintentional data loss. An ideal system would also notify this same Medly support staff of patients who are consistently experiencing difficulties with the activity tracker, to properly close the feedback loop between patients and the clinic and ensure that patient difficulties are being properly addressed. While adherence phone calls would help catch technical or usability issues earlier, it might also help patients see the benefit of this system in that they would be held accountable to this element of their self-care and management. From a research perspective, having already established the baseline adherence of the Fitbit system, we could even quantify actual impact of adherence phone calls by re-running this analysis after this feature implemented.

As for the usage of the updated system by clinical staff: we unfortunately have no quantitative data to perform an analysis similar to the one done for patient users, as the upgraded iteration of the Medly system did not record data that would permit the assessment of clinician usage of the newly available Fitbit data.

The analysis in this chapter was performed using R [151] and supporting packages [217–219].

Summary

In summary, we updated Medly, the remote patient monitoring system in use at the TGH HF clinic, to support the collection and partial display of Fitbit activity tracker data. Although the system supports all Fitbits, we specifically selected to provide patients at the clinic with the Fitbit Charge 2 which was the most inexpensive tracker that met our requirements: namely that it was readily available for purchase, supported the hardware (smartphones) being used as part of the Medly program, could last at least a few (2) days without syncing or charging (to help avoid data loss and provided a means for downloading and accessing continuous minute-by-minute step count and heart rate data from the device - even if indirectly). Data access was performed through the Fitbit API with a separate connection for the clinical system (which allowed clinicians to monitor patient activity through Medly’s custom web interface) and

83 for the research system (a custom R script which allows research/QI staff to manage access tokens, and download patient activity data in bulk for offline analysis – see Appendix C or [198])

Updating Medly to support Fitbit activity tracker data also required an update to the UI of the system to allow users to 1) link a Fitbit account to the corresponding Medly patient account and 2) monitor patient activity through the Medly system. In view of this, several UI designs were proposed to the professional development team whose task it was to program the final design into the existing Medly system. However, due to time and resource constraints caused by other concurrent higher priority updates and bug fixes to Medly, all of the initially proposed designs were eschewed in favor of producing a pared down minimum viable product which demonstrated the technical viability of the solution. As a result, no changes were made to support the Fitbit activity tracker on the patient smartphone applications. Patients were instead instructed to use the Fitbit app alone to access their Fitbit data. As for linking patient’s Fitbit accounts to their Medly account, the authentication flow was adapted so it could be performed by clinical staff through their clinical web interface. The display of Fitbit activity tracker data on said web interface was limited to daily step data only since heart rate data was deemed as non-essential. The updated system also only supported patients using Apple iPhones - clinicians wanting to monitor patients who were using the standard Android phones provided as part of the Medly system instead had to go through the Fitbit website directly (although as of the time of publication the Medly system now fully supports patients using both iPhone and Android).

Despite these limitations, we were able to monitor 44 patients over a 5 month period (from January 9th to June 13th) with an additional 2 patients who were additionally onboarded but later changed their mind. 3 of the 44 patients actually brought and used their own Fitbit. 12 (27.3%) of the patients used iPhones (and could be monitored using the updated Medly web interface), whereas 32 (72.7%) of the patients used Android (which was not supported by the updated Medly web interface). Overall, patients were found to be only moderately adherent with using the Fitbit. At the 3 and 7 month time points, 50% of patients had used the Fitbit (recorded steps or heart rate) on at least half of the days they were on the system. 1 1 Only around ⁄3 to ⁄4 of patients respectively at the 3 months and 7 months timepoints had excellent levels of adherence (average at least 9 of 10 days using the system). We proposed that adding adherence phone calls or reminder notifications would help improve patient adherence to the system, or at least help staff catch and address patient issues in a timely manner.

84

– Assessment of NYHA Functional Classification using Hidden Markov Models

Having completed the essential groundwork of building a system to collect relevant input data, we set out to assess the NYHA functional classification of patients in an example dataset using 6 different machine learning (ML) algorithms, specifically: Hidden Markov Models (HMM); Generalized Linear Models (GLM); a variant thereof: boosted GLMs; Random Forests (RF); Artificial Neural Networks (NNet); and a variant thereof: Principal Component Analysis Neural Networks (PCA NNet). Since the approach used to create the HMM based classifier (HMMBC) differed slightly from the rest of the candidate models, we discuss the HMMBC separately as part of this chapter, while the remaining ML models are treated in Chapter 6.

First, we provide a brief refresher on HMMs - a more detailed introduction is provided in Appendix B – followed by our rationale for using HMMs in the first place. We then proceed to explain our methodology for training and testing a HMMBC. Finally, we discuss the results of our investigation and, since our HMMBC approach was ultimately unsuccessful, we touch on the problems encountered and provide recommendations for future attempts.

Hidden Markov Models

Any introduction to hidden Markov Models must start with Markov models. Markov models are probabilistic state machines where the transitions between states occur randomly according to some pre- determined and pre-specified transition probabilities between each of the states [118,220–223]. Hidden Markov Models (HMM) are simply Markov Models where the underlying states cannot directly be observed [118,220,222,224,225]. Instead, the underlying states of the HMM are inferred from an associated set of possible observations that are linked to each state. In other words, from the possible outputs that can be produced when the system is in a particular state. These observed outputs could be speech phonemes, written characters of the alphabet, or genome sequences [118,226], or in our case step count or heart rate readings, amongst others.

5.1.1 Rationale for the use of HMMs

The rationale for using hidden Markov Models is that they can embrace the complexity and nuance of the entire time series data streams (and sequential data in general). In contrast, the remaining ML models

85 investigated in this thesis (in their standard form) must be provided with input predictors formulated as cross-sectional data (i.e. with the observations coming from a single point in time).

Figure 5-1: A method of inputting sequential (time series) data into a cross-sectional model

Of course, it is possible to format, or distill time, series data into cross-sectional data. For example, one could use the values at discrete time points in a time series as separate independent input features for a

ML model. This is illustrated in Figure 5-1, where the value at time 푡푛 and the 푚 values preceding it:

푡푛−1, 푡푛−2, 푡푛−3, …. to 푡푛−푚 are provided as separate inputs to the ML Model. But, by decoupling the individual time points one loses an, if not the, essential characteristic of time series data (and sequential data generally): the interrelationship between individual data points in the series. An ML model trained in this manner will therefore be robbed of very important information about the time series in question. To avoid completely throwing away this interrelationship information, one could instead compute various metrics or characteristics to describe the entire time series such as: the mean and variance of the signal, the total number or location of peaks, the signal auto-correlation, cross-correlation, frequency distribution, and so on, using these as input features. Ultimately though, any computation which takes an entire time series signal and boils it down to a single parameter before providing it to the ML model must be pre- maturely throwing away possibly relevant information. This is not to say that feature extraction is something to be avoided - in fact, it forms a core part of most machine learning pipelines and is also something we performed as part of training the cross-sectional models detailed in Chapter 6. That being

86 said, we reasoned a HMM, which has access to the full time series waveform, with all its complexities, nuances and interrelationships, would be a better initial candidate for attempting replication of the complex task that is assessing NYHA functional class.

Methods

In the following section we briefly detail our methodology used for a) implementing and b) subsequently assessing the performance of our HMMBC.

The work done in this chapter was performed using the R programming language [151] in conjunction with RStudio [152], an integrated development environment for R, along with various other supporting R packages [153–158,217]. The R package depmixS4 was used specifically for the training of the HMM models [227,228].

5.2.1 Training Data

Dataset

Although we originally intended to use the new data collected from the upgraded Medly system (with the additional activity tracker functionality), we opted to instead use data that was collected during a previous study (the same data used in Chapter 3). Analysis of the data collected, and continuing to be collected, from the upgraded Medly system is instead left to future work. The reasoning for this choice was three-fold.

First, the previous (Chapter 3) study data had a marginally larger sample size of 50 patients, vs. a nominal 44 patients from the new Medly data. Furthermore, since 5 of the 44 had almost no recorded activity, and an additional 6 had less than 1 week of recorded activity, the practical size of the Medly dataset is really closer to 33 patients. While neither of these datasets is large even when considered from a classical statistics perspective, machine learning is notorious for being particularly data intensive, and typically the noisier, the more complex and the greater the variance in the data, the larger the dataset required to achieve good classification performance. Given that we expect that continuous daily step data is simultaneously noisy, complex, and highly variant we expect that the model may lean towards requiring more data rather than less data. Aside from considering the complexity and nature of the machine learning algorithms we are investigating, the use of the somewhat larger 50 patient dataset is further justified since some fraction of the 50 samples will also need to be set aside and reserved for testing and validation of the models.

87

The second reason we chose to use the previous study data was that we had insufficient time to download the last bits of activity data, collect the additional non-activity portions of the data set (e.g. demographics, NYHA class and CPET data), and subsequently properly clean, and then re-run the analysis that follows on the new Medly Fitbit data set. The lack of time was mostly a result of pushing back the final deadline for the inclusion of new onboarded patients into the study dataset, in order to scrape together as much data as possible for ML in the face of the relatively low onboarding rate (~1.5 patients/week including both new patients and upgraded returning patients) and the delays in implementing the required data collection infrastructure (as discussed in Chapter 4).

The third reason we opted to use the previous study data is that it included summary cardiopulmonary exercise testing data for all the patients in the dataset (a by-product of the inclusion criteria) whereas approximately half of the patients on the upgraded Medly system had not had a CPET performed and therefore had no such data available at the time of publication. Using the previous study data therefore had the benefit of allowing us to create models and performing some initial comparisons of classification performance of models trained using only CPET data (recall, the gold standard test for assessing exercise capacity) as compared to models which use activity tracker data.

Our choice of dataset however did come with a significant drawback. As already mentioned, the previous study data used an activity tracker that did not collect heart rate data. As a result, the dataset only consisted of the following data:

1. Minute-by-minute step count data – recorded using a commercially available activity-tracker, a Fitbit Flex [59], continuously throughout the day.

2. Cardiopulmonary exercise testing data – administered by trained clinical staff as part of routine care at the TGH Heart Function Clinic on the same day as recruitment (except for 4 patients

who received it prior to recruitment19).

3. Patient demographic/meta data – recorded as part of onboarding, and specifically including: a. Sex [Male or Female], b. Age [years], c. Height [cm], d. Weight [kg],

19 Specifically, 1, 15, 20 and 22 days prior to recruitment.

88

e. Handedness [left or right], and f. Wristband Preference [left or right].

Population

In short, the data ultimately used in the development and validation of all the ML classifiers discussed in this work is the same data used for to perform the replication study in Chapter 3. Recall that the data was originally sourced between September 2014 and June 2015 from a closed (prospective) cohort of adult outpatients at the Heart Function Clinic (a tertiary care clinic specializing in the management of heart failure) at Toronto General Hospital, a part of the University Health Network (UHN) in Toronto, Canada). The inclusion and exclusion criteria are respectively detailed in Table 3 (page 37) and Table 4 (page 37). The dataset includes 50 patients whose demographics are fully detailed in Table 5 (page 38), Table 6 (page 38) and Table 7 (page 39), but in short, to reiterate, the patients are predominantly male (86 vs. 89 [%]), aged: 54 ± 14 vs. 56 ± 14 [years old], and overweight (BMI: 28.9 ± 6.4 vs. 29.6 ± 6.3 [kg/m2]) with no significant difference in handedness or wristband preference (see Table 11).

Patients in the dataset were recorded for 2 weeks during which time their HF, and by extension their NYHA class, is assumed to be stable (stability of HF being one of the criteria for inclusion into the study which originally generated this dataset; see Table 3).

Label Assignment

The “true” underlying NYHA class of a patient was assessed at onboarding by their physician as either NYHA functional class II (n=26) or III (n=11), according to the criteria outlined in Section 2.2.1.1, or as some intermediate/mixed class I/II (n=9) or II/III (n=4). Patients were assessed as an intermediate/mixed class when a physician was uncertain about the classification or felt that patients exhibited symptoms from different class levels. However, since class I/II and II/II are not formally recognized NYHA classes (nor are the sample sizes for the classes in question large enough for any sort of machine learning), it was necessary to group these intermediate/mixed classes together with the existing traditional NYHA classes for the purpose of developing our ML classifiers. We grouped the intermediate/mixed classes according to the most ‘severe’ NYHA class in the set20, i.e. I/II with NYHA class II, and II/III with NYHA class III.

20 recall our extended reasoning on page 39 for grouping according to the more severe class in the mix.

89

5.2.2 Model Design

Predictor(s)

In order to predict the class labels, the HMMBC was supplied with only one predictor: the step count data, since this was the only available time series data. Adding in either the demographic or available cardiopulmonary testing data would have required stratifying our patients into groups and training separate sub-classifiers for each group. Since our dataset was so small and relatively homogenous we reasoned that stratification was not likely to significantly improve performance but would definitely have at least some detrimental impact on performance by reducing the already meager number of examples available to train a given classifier in the first place (due to the stratification process).

We did however use multiple variations of the step count data after encountering difficulties getting our classifier to converge to a valid model using the high-resolution minute-per-minute data. We re-attempted training our classifier using data at progressively lower temporal resolutions, from 2 to 6 hours. The algorithm was finally able to converge when we used a resolution of 6-hours21. The result is that we investigated five separate variant classifiers as part of this work, with each variant supplied with step count data at a different time resolution, specifically at either:

a) a per minute level resolution [steps/minute], or;

b) a per 2-hour level resolution [steps/2 hours], or;

c) a per 3-hour level resolution [steps/3 hours], or;

d) a per 4-hour level resolution [steps/4 hours], or;

e) a per 6-hour level resolution [steps/6 hours]

Normalization

Additionally, before using the step count data for training, we also normalized the per minute values to between 0 and 1 via linear scaling, from a minimum of 0 and using a maximum of 300 [steps/minute]. Normalizing predictors typically has beneficial effects on training speed but is usually most important for ensuring each predictor is considered equally by the learning algorithm (as a result of being similarly

21 i.e. the per minute data summed into non-overlapping 6 hour intervals

90 weighted). In our case, since our HMMBC does not use multiple predictor inputs at the same time we normalized the data for its secondary effect on learning speed and efficiency.

Architecture

Figure 5-2: Architecture for hidden Markov model based classifier

In order to actually construct a classifier using the aforementioned predictors, we used one HMM per classification label - 2 total: 1 each for NYHA functional class II and III22 - combined as per Figure 5-2.

Each HMM is trained with data from the subset of patients corresponding to the target NYHA class label, i.e. one HMM is trained using the 35 patients with NYHA class II and the second with data from the 15 patients with NYHA class III. Classification of new patients can then be performed by evaluating the likelihood that the given patient's predictor sequence (i.e. step count data stream) was generated from each of the corresponding HMMs in a set. Evaluating this likelihood, or similarity score, is done using an ‘inference’ algorithms, typically the ‘forward’ or ‘backward’ algorithm, whose functionality is included in most HMM programming libraries. The interested reader can read up on the finer details of these inference algorithms in any of these referenced works [118,220,222–224]. Regardless of the algorithm used, the NYHA class of the patient in question is deemed to correspond to the class of the HMM with the

22 by extension, a 3 or 4 class multi-class classifier would additionally contain an additional HMM trained using NYHA class I or IV patients as required.

91 highest similarity score returned by the inference algorithm. In other words, the class of the model with the highest likelihood of having generated a sequence similar to the input predictor corresponds to the predicted class of the input patient data stream.

Model Generation and Selection

As to how we generate the individual HMM models, the process can be divided, at least logically, into two separate parts. The first is that of generating a model for each of the classes. The second involves generating different variant models within each class group using different initial HMM parameters with the goal of trying to find the parametrization that creates the single best model that most accurately represents the class group in question. In other words, to find as close to the global optimal set of parameters (as opposed to simply a local optimum).

The first part, model generation for each class, as already touched on, is accomplished by simply selecting all the patients that belong to given class (NYHA class II or NYHA class III) and using these as the training data for the model training function of our HMM library for R: depmixS4 [227,228]. The depmixS4 training function outputs a potential model which we can add to a list of potential models for that class. This list of models will later be passed onto the optimal model set selection process.

The second part, generating different parametrizations, simply involves repeating the first part of the process, but updating the initial parameters that form the second part of the required input for the depmixS4 model training function, until we have swept through all the desired parameter variations. Each of these models is in turn added to either the appropriate list of potential class II or class II models.

As for selecting the final model pair, this can be accomplished by simply taking every paired combination of class II and class III models in the potential model lists, assessing the performance of each of these combinations against an example test set of data, and selecting the model set with the best overall performance. Unfortunately, we did not actually investigate this last part of the model generation process as a result of the critical problems encountered in the first part of the model generation process: namely that were unable to the training algorithms to converge, or actually train a HMM model using the step count data (whether with the depmixS4 library or others [223,225]). Although we were able to discover a way to overcome these training difficulties - using lower resolution step count data (the per-minute step count data averaged over 6-hour periods) - this solution fundamentally violated the whole rationale for using a HMM model based approach in the first place (being able to use the complete per-minute time series waveform without having to dilute it down). This prompted us to instead pursue and focus on the other more classic cross-sectional ML methods discussed in Chapter 6. As a result, although we managed

92 to train a single set of HMMs, which we used to build an initial HMMBC, the performance of the classifier was so obviously very poor (as discussed in Section 5.3) that we eschewed spending significant time optimizing the algorithm performance when the cross-sectional ML methods proved more effective.

Initial Parameterization

The initial parameterization for the successful trained classifier, with some rationale for the selection, is provided below. We emphasize however that little weight should be given to these parameters since they are hand-picked, essentially arbitrary and not-verified against other parameters. Although we attempted several different variations on model parameterizations as part of the debugging process none of these were thoroughly documented.

1. States: 3

Although we only tested an HMMBC built with 3 underlying states (per HMM), our original intent was to sweep the state parameter from 3 to 6-8 states depending on available computational power. We started with the lowest number in that range - 3 states - to help with debugging our training problems. Since we never performed the optimal parameterization search

our final successful trained classifier therefore only had 3 states23.

2. Starting State Probabilities: [0.95 0.00 0.05]

Based on our initial exploration of the data (Chapter 3), patients spent most of their time in a non-active state. In other words, at any given moment, if we were to look at the step count time stream, it is most likely that a patient would be in a non-active state as opposed to any other state. We assumed the HMM would likely detect as a strong pattern and model this non-active state as one of the 3 state so we set our starting state probabilities to suggest this in advance.

0.90 0.3 0.33 3. Transition Probabilities: [0.05 0.5 0.33] 0.05 0.2 0.33

23 The computational power limit is important since it increases to the square of the number of states (since each state is interconnected). That is, with 3 states there are 9 possible transitions between states which must be solved. Doubling the number of states to 6 causes a quadrupling of the number of possible transitions to 36 and at 8 states there are 64 possible transitions, almost double that of the 6 state case.

93

The selection of initial transition probabilities was done almost completely arbitrarily due to a lack of relevant precedent information. However, to remain consistent with the assumption made for the starting state probabilities - that a patient was likely to remain in the non-active state the majority of the time - we did tweak the initial transition probabilities for the corresponding state (dictated by the starting state probability matrix) to heavily favor remaining in that state. The remainder of the transition probabilities were selected completely arbitrarily with the only restrictions being that the sum of each state transition probabilities should of course be equal to 1 and that no transition probability should be 0.

4. Emission Probabilities: normally distributed with means ± variances (in steps/minute) of

[1 40 100] ± [10 80 1000]

The emission probabilities were based on the range of values graphically observed from the per minute step- count distribution (shown in Figure 5-3). The specific choices for mean and variance were arbitrarily selected, although in such a way that they very loosely separated the distributions into three equidistant parts. Figure 5-3: Distribution of per-minute step count for patients with NYHA class II and NYHA III (* grouped) 5.2.3 Model Validation

Since the classifier did not perform well even when tested with the training data, which should provide overly optimistic performance estimates, we did feel not feel it necessary to perform additional internal or external validation of the HMMBC discussed in this chapter. The performance reported in the Results and Discussion section that follows is therefore based on using identical training and testing sets (all n=50 patients) and should therefore be considered to be overly optimistic about the real-life performance of the HMMBC on actual new data.

94

Results and Discussion

As previously mentioned (in Section 5.2.2.1), we encountered significant difficulties during the HMM training process. Specifically, the HMM training algorithm was unable to converge to a valid model when supplied with the per-minute step count data. The resolution to this problem was ultimately to supply the HMM training algorithm with progressively lower and lower resolution data. The algorithm was finally able to converge when the data supplied had a temporal resolution of 6 hours.

5.3.1 Classification Performance

The performance of the HMM based classifier produced using the per 6-hour step count data is presented in Figure 5-4. As can be seen from the confusion matrix, Physician only 19 of the total 35 NYHA class II patients and 10 of the total

II III 15 NYHA class III patients were correctly classified by the

II 19 5 HMMBC yielding an overall raw (unbalanced) accuracy of 58%. AI The balanced accuracy (not shown in Figure 5-4) - which corrects III 16 10 for the unequal distribution of class II and class II patients - can be calculated to be 60%. Unfortunately, the HMMBC accuracy is No Information Rate (NIR): 0.70 Unbalanced Accuracy (Acc): 0.58 lower than the no information rate (70%). This indicates that, Cohen’s Kappa: 0.18 given the class distribution in the dataset - 70% of patients with Sensitivity: 0.5429 NYHA class II - the classifier actually performs no better than if Specificity: 0.6667 Positive Predictive Value: 0.7917 we had simply randomly assigned NYHA classes to patients. The Negative Predictive Value: 0.3846 poor agreement between the physician assigned NYHA class and

Figure 5-4: Overview of HMM classifier assigned NYHA class is also reflected in the low value of

based classifier performance the Cohen’s Kappa coefficient24 (휅=0.18).

5.3.2 Training Challenges

That the HMMBC performance is sub-par does not necessarily come as a surprise. The amount of training data, for one, is possibly simply insufficient to adequately train the HMMBC: 15 examples of NYHA class II patients and 35 patients of NYHA class III is not a lot of training examples. This potential

24 The Cohen’s Kappa coefficient quantifies agreement between independent raters, correcting for the degree of agreement that would be expected if the raters were simply guessing by chance [28]. Since Cohen’s Kappa is a standardized statistic it is particularly useful for comparing performance between algorithms (and studies) [28].

95 problem is easily resolved by simply collecting more data – something which is currently still in progress as a result of the activity tracker update made to Medly as part of this research.

Another likely explanation for the low performance is that the 6-hour resolution step data is significantly less nuanced than per-minute resolution data. Measured by number of data points alone, the 6-hour resolution step data contains 360 times (or 2 orders of magnitude) less information than the per-minute resolution data. It is likely that this lower resolution data yielded coarser and less nuanced models (due to the reduced data stream size) that did not necessarily take full advantage of the modelling capabilities of HMMs. These coarse models may not have been sufficiently differentiated to really allow for accurate discrimination between different NYHA classes. In a similar vein, it is possible that binning the per- minute data over 6 hours resulted in the washing out of many of the important nuances in the data that might in fact be the key to discriminating between patients in the different NYHA classes.

Compare for example Figure 5-5 and Figure 5-6, respectively the per-6 hour and per-minute step count data for the same patient. Observe, in Figure 5-5 at the 6-hour resolution, that for days 12 and 13 the step count pattern is visually similar with only a small variation in the overall step count. One might be led to conclude from these similarities that a patient perhaps had a slightly more intense workout session or perhaps a little longer walk near the middle of the Figure 5-5: Example patient step count data (per 6 day 12 compared to day 13 but that the hour resolution) underlying activity pattern remained essentially the same. Visualization of the underlying data in Figure 5-6 quickly dispels this notion. The activity near the middle of the day on day 12 is best characterized as isolated but extended high-intensity physical activity, in contrast to day 13 where the activity it is better characterized as punctuated, frequent, low-duration, low-intensity activity. The former might be proposed to be characteristic NYHA class II activity, with the latter being more characteristic of a patient experiencing NYHA class III symptoms, but where one might be able to assess this difference based on the per-minute data it is clearly harder to gauge between these two activities on the basis of the 6-hour aggregate data alone.

96

Figure 5-6: Example patient step count data (per minute resolution)

97

In any case, it is clear that unlocking the potential in the per-minute resolution data is highly preferable to being stuck with using low resolution data.

Analysis of Potential Root Cause

This brings us back to the question of why we were unable to get the HMM algorithm training algorithm to work using with the per-minute resolution data in the first place. As mentioned, although we tried various initialization parameters, ultimately the resolution was to aggregate the data. We hypothesize that the root cause may simply be due to the fact that most of the per-minute step count values in any given day are simply 025, and furthermore, that these 0 values, although sometimes briefly interspersed between long periods of activity, more often exist as long unit uninterrupted sequences. These sequences occur not only in the mornings & evenings – such as when a person is sleeping - but also at random intervals during the middle of the day – for example when a person might simply be inactive - see, for example, days 3, 5, 8, 11, 12, and 13 in Figure 5-6.

Recall that HMMs are stochastic models, in other words, the underlying models they use to represent a process are constrained by the rules of probability. There is therefore, some expectation of inherent variance in the training data which the training algorithm must use capitalizes to start formulating a model of the underlying process. The presence of low (or no) variance sequences may therefore present a real problem to training.

For example, take a very long uninterrupted sequence of identical values, like a string of 0’s. Depending on the length of the sequence and expected nature of the distribution, it may in fact be considered statistically impossible. The probability of given sequence being produced by some Markov model can be calculated using the forward algorithm, which relies on the chain rule: namely the probability of a chain of events 퐸푛 to 퐸1can be calculated as the probability of event 퐸푛 occurring, given that the sequence

퐸푛−1 to 퐸1 has occurred, multiplied by the probability of sequence 퐸푛−1 to 퐸1 having occurred:

푃(퐸푛, … , 퐸1) = P(퐸푛|퐸푛−1, … , 퐸1) ∙ P(퐸푛−1, … , 퐸1) (2)

The probability of sequence 퐸푛−1 to 퐸1 having occurred can be recursively calculated using the same formula, continuously chaining (thus lending the rule its name) the conditional probabilities of the new

25 recall that for our dataset, more than 75% of the per-minute step count values for any given patient are 0 (as measured over their whole two-week monitoring period).

98

event in question, 퐸푛−1, on all the prior events in the sequence. In the case of a produced sequence 푆푟푒푝푒푎푡 of length 푛, composed of the same repeated event, which are known to occur with some probability 푝, Equation 2 simplifies to the following:

푛 푃(푆푟푒푝푒푎푡) = p (3)

An oft quoted value for the threshold of statistical impossibility is 10−50 but he exact cut-off is rather arbitrary [229]. Since our objective is not to provide a rigorous proof of our hypothesis but rather to suggest a theory to future researchers interested in tackling this problem the choice of 10−50 is a reasonable choice of threshold. The choice of probability 푝 by extension is also somewhat arbitrary. Suppose, for simplicity sake though that, since step count ranges from approximately 0 to approximately

1 1 125 in our patients we supposed the probability of a 0 step count value lies around ≈ = 10−2. Since 125 100 a more conservative reader might prefer we use the actual probability of 0 step counts in our sample –

log (3⁄ ) 3 − 4 approximately 75% of the dataset and say that 푝 should be closer to 0.75 = = 10 log (10) ≈ 10−0.12 - we 4 also perform the calculation with this values for comparison. The overall conclusion remains the same.

Assume a rest period of approximately 8 hours (which occurs fairly consistently, once a day), the sequence length, n=480 minutes, has an associated probability of:

−2푛 −2∗480 −960 푃(푆푟푒푝푒푎푡,8 ℎ표푢푟푠) = 10 = 10 = 10

or conservatively:

푃(푆 ) = 10−0.12푛 = 10−57.6 푟푒푝푒푎푡,8 ℎ표푢푟푠 | 푐표푛푠푒푟푣푎푡𝑖푣푒

Whether by the more conservative estimate or not, these probabilities are well in excess of the statistical impossibility threshold of 10−50.

All this is not to say that these sequences are impossible - they are quite clearly not – however, from the perspective of the Markov model, and from the hidden Markov model attempting to guess at the underlying hidden model, such sequences are considered very unlikely26, and therefore not likely to be

26 For a 6 hour sequence, 푃(푆 ) = 10−720 & 푃(푆 ) = 10−43.2. For a 4 hour sequence, 푟푒푝푒푎푡,6 ℎ표푢푟푠 푟푒푝푒푎푡,8 ℎ표푢푟푠 | 푐표푛푠푒푟푣푎푡𝑖푣푒 푃(푆 ) = 10−28.8. Although less than the threshold of 10−50, the sequence probability is still very 푟푒푝푒푎푡,8 ℎ표푢푟푠 | 푐표푛푠푒푟푣푎푡𝑖푣푒 tremendously small, and the sequence thus less, but still very unlikely.

99 interpreted as regular parts of the sequence although they actually are. Even a one hour period is be

−120 found to be highly, although relatively less, unlikely: 푃(푆푟푒푝푒푎푡,1 ℎ표푢푟푠) = 10 , 푃(푆 ) = 10−7.2. 푟푒푝푒푎푡,8 ℎ표푢푟푠 | 푐표푛푠푒푟푣푎푡𝑖푣푒

Of course, any given long predetermined sequence of variables being produced by a Markov model will have a low associated probability. So why do we feel that we can make this special claim about a string of 0’s, or a string of identical values generally? Because of the key fact that the values in the series are identical.

Take an arbitrary sequence of two or more alternating values of the same length 푛 as the sequence 푆푟푒푝푒푎푡 of above. These would have same probabilities calculated above, however this same sequence is unlikely to represent a problem to an HMM. Why? Because the different values are easily associated with different underlying states. With a single unchanging value however, it becomes impossible to determine which value belongs a particular state: is a single state producing the sequence and we have yet to transition to another state (what our probability calculations above actually represent) or are all states producing this same value – in which case what makes them different states, except perhaps their transition probabilities? But how does one determine the transition probabilities of underlying states if the emitted symbols observed from the states are identical. We believe that ultimately, the intractability of these questions may explain why the HMM training algorithm has difficulty converging to a value, and why decreasing the resolution, which reduces the length of identical value sequences, but also generally increases the variance in sequences to make possible states more differentiable resolves the training problem.

Proposed Solution 1: Dithering

It would actually be very easy to test this hypothesis by using a signal processing technique known as dithering. Dithering, is the act of introducing dither, that is, very low amplitude random noise intentionally introduced into a system to improve its performance [230]. It was famously found to have the curious effect of improving navigation and ordnance trajectory calculations performed on aircraft based mechanical computers during the second World War as a result of the aircraft induced vibrations, which smoothed out the operation of the moving mechanical parts [231]. Since then, it has been successfully used to improve performance in various diverse applications as analog-to-digital conversion in microelectronics [232], and trading on stock exchanges (where it is used to reduce high frequency trading - an oft maligned trading practice) [233]. More commonly though, it is used to increase the visual quality of low resolution images [234,235] – an excellent example of which has been reproduced from Wikipedia [236]

100 in Figure 5-7. Compare in particular sub-figures: 1, the raw image; 2, a lower resolution version of the same image; and 3, the low resolution image dithered using a classic image dithering algorithm [234]. Note in particular that the image in image 3, despite having the same resolution as image 2, approaches the visual fidelity of image 1. We propose that, in an analogous way, carefully application of dithering to the step count signal might counterintuitively improve our ability to train an HMMBC with high resolution data. A small amount of noise would at least eliminate the impossibly long uniform sequences in the data, and provide the necessary variance required for the HMM training algorithm to perform as intended, while simultaneously not meaningfully degrading the overall quality of the step count data stream.

Figure 5-7: Dithering as applied to a cat photo. Reproduced from Wikipedia [236].

Proposed Solution 2: Activity Segmentation

An alternative to dithering, is to do away with the inactive sequences altogether, ignoring all the long periods of 0 per-minute step counts, and instead training a HMMBC to use activity segments as opposed to the complete raw daily signal. Unfortunately, this alternative, although conceptually simpler, is likely harder to put into practice and test than dithering. Dithering can be fairly easily tested by adding various different types and magnitudes of random noise to the high-resolution test signal and seeing if the HMM training algorithm can successfully converge. Training on activities however first requires determining what should constitute an activity segment, i.e. where it should begin, but also where it ends, including, how many (if any) inactive minute should be allowed in the middle of the activity (in case of missed readings, brief pauses, etc.). Additionally, it likely requires the development of some sort of automated or quasi-automated data segmentation algorithm, not only for the case where the HMMBC might be implemented in practice as part say a remote patient monitoring system, but also to help consistently and accurately segment the relatively large volume of data that would be required to train and improve such

101 the classifier. Activity segmentation therefore likely involves first further investigating in more detail the finer characteristics of the per-minute step count data stream generally. Although the task of activity classification, at least for healthy patients, is already a very active area of research, the data used is typically raw accelerometry data as opposed to per-minute step count data.

Although more challenging, training on separate activity segments might provide significant additional secondary benefits not attainable though simple dithering. For example, assessing patients using smaller periods of activity as opposed to an entire day or weeks worth of data might reduce assessment latency thereby improving response time for any application that depends on assessments provided through an activity tracker. Alternatively, it might provide additional insight into the specific physical exercise routines of patients which might enable the provision of timely and relevant feedback to patients regarding this aspect of their HF self-management.

Both dithering and activity segmentation have their relative advantages and disadvantages as possible solutions to resolving the training challenges encountered with the HMMBC when compared with simply reducing the temporal resolution of the input data. Ultimately though, since both dithering and activity segmentation each represent very different but complimentary approaches to the training challenge they are likely both worth investigating in their own right.

Summary

To summarize, in this chapter we discussed a proposed method for building a hidden Markov model based machine learning classifier and the results of implementing and testing said classifier. We chose to use hidden Markov models, which are a tool for modeling a system as a stochastic process, because we hypothesized that these might be able to fully embrace the complexity and nuance of the entire time series data streams produced by the activity trackers worn by patients in free-living conditions. We detailed the architecture of the model, which used two hidden Markov models, one each to model the activity patterns of patients with NYHA class II and III symptoms. Instead of using the new 44 person dataset collected from the activity tracker monitoring system detailed in Chapter 4, we opted used the same 50 person dataset investigated in Chapter 3, primarily because there was more data available for us to use to train machine learning classifiers. Since the 50 person dataset does not also have heart rate data, the only time series input provided to the hidden Markov model was patient step count data. Unfortunately, we encountered difficulties in getting the hidden Markov models training algorithms to converge using the per-minute step count data, which we were ultimately able to resolve by converting the data to a coarser 6 hour temporal resolution level. Regrettably, using lower resolution data

102 contradicted our entire rationale for using hidden Markov models in the first place: attempting to use the entire unadulterated time series data stream. Furthermore, although the hidden Markov model based classifier we did train using the per-6 hour step count data was able to classify patients, the classifier did not perform any better than one that simply assigns patient classes by chance (58% unbalanced accuracy for the HMMBC vs 70% accuracy for the random classifier). The Cohen’s Kappa statistic (0.18) confirmed the poor agreement between the physician assigned NYHA class and that assigned by the hidden Markov model based classifier. Of note, since performance of our classifier was evaluated on the exact same data used to train said classifier, the performance reported above should be also interpreted as being highly optimistic compared to the real expected performance of the classifier on new data it hasn’t seen before.

Although our initial attempts to use a hidden Markov model based classifier were met with some significant setbacks, we don’t believe that this means that the approach does not have value, but rather, that it might require more dedicated attention to get such an approach to work. We posited a possible theory for why the training algorithm has difficulty creating hidden Markov models of the step count data, namely that the presence of long low variance sequences of identical step count values makes it impossible for the training algorithm to determine the transitions between states. In response we proposed two possible approaches which might be investigated as part of future work: 1) dithering, that is, intentional applying low-amplitude random noise to the time series step count data, thereby artificially introducing variance into the low variance sequences (which might allow the hidden Markov model training algorithm to function properly while not meaningfully degrading the overall performance of the system), and 2) doing away with the inactive sequences altogether and approaching the task of NYHA class assessment from the perspective of individual periods of activity as opposed to attempting to classify the whole free-living time series data in one fell swoop.

Ultimately, we opted to take a third approach for the purpose of this thesis and put the hidden Markov model based classifier to the side and instead investigate the effectiveness of some other more classic approaches to supervised classification, which we discuss in the next chapter.

103

- Assessment of NYHA Functional Classification Using Cross-sectional Machine Learning Models

As mentioned in the introduction of the previous chapter, we set out to attempt to objectively assess the NYHA functional classification of some example patients using modern machine learning (ML) algorithms. Having discussed our unsuccessful attempt to build a useful hidden Markov model based classifier, we decided to investigate some cross-sectional machine learning algorithms that are popular starting points for supervised classification problems: Generalized Linear Models (GLM; a variant thereof: boosted GLMs; Random Forests (RF); Artificial Neural Networks (NNet); and a variant thereof: Principal Component Analysis Neural Networks (PCA NNet).

In this chapter we first provide a brief refresher on the above ML techniques. The curious reader is invited to consult T. Segaran’s book, “Programming Collective Intelligence: Building Smart Web 2.0 Applications” [111], for a more thorough introduction to these and other popular ML algorithms. We then proceed to explain our methodology for training and testing the ML models investigated and finally, we discuss the results of our investigation and detail some possible future directions to take this research.

Machine Learning Models

What follows is a very brief introduction to the cross-sectional machine learning models investigated in this chapter, in order of relative algorithm complexity.

6.1.1 Generalized Linear Models

The generalized linear model, or GLM, is unsuprisingly, a generalized version of classic linear regression [237,238].

Recall that that the idea behind ordinary linear regression is that we can model some randomly distributed response variable 푦, as a linear combination of predictors 푋 = {푥1, 푥2, … , 푥푛}, subject to some noise/error represented as the error term ε. If we define 퐵 = {훽0, 훽1, 훽2, … , 훽푛} as the regression parameters, with 훽0 being the intercept term, we can express the relationship formally as:

푦 = 훽0 + 훽1푥1 + 훽2푥2 + ⋯ + 훽푛푥푛 + 휀 (4)

This equation, which defines linear regression, can be decomposed into two parts: 1) a linear part and 2) the random error part. The linear part, 훽0 + 훽1푥1 + 훽2푥2 + ⋯ + 훽푛푥푛, which tells us that there is some expected value for 푦 conditional on the value of 푥: 퐸(푦|푥). The error term then tells us that there is some

104 random error or variance about this expected value; in classic linear regression this error is specifically assumed to be normally distributed with some constant variance, 휎2. If we call the expectation value 퐸(푦|푥), the mean as a function of 푥, 휇(푥), of the normal distribution for 푦, we could alternatively represent Equation 4 as:

푦 ~ 푁(휇(푥), 휎2) (5)

The generalized27 linear model asks us: what if the relationship between 푦 and 푥 were not normally distributed but were instead modelled by some other distribution? Specifically, what if it we could use any distribution within the wider family of exponential distributions, of which the normal distribution is just

Figure 6-1: Examples of distributions in the family of exponential distributions (* indicates the distribution belongs in the family only when certain parameters are fixed). Adapted from [290].

27 N.B. not ‘general’ linear model, which is just a special case of the GLM, namely the one expressed in eq. 4

105 one example (see Figure 6-1 for more examples). To effect this change, thus generalizing the linear model, we need to modify the way we link together the expectation value, 퐸(푦|푥), produced by the linear predictors, and the mean value, 휇(푥), of our error distribution. That is, instead of defining the link between 퐸(푦|푥) and 휇(푥) as:

퐸(푦|푥) = 휇(푥) (6) we would first generalize the relationship with 퐸(푦|푥), as link function 푔 of 휇(푥):

퐸(푦|푥) = g(휇(푥)) (7)

The link function for the normal distribution is then simply 푔(푎) = 푏. The link function must always be smooth, invertible, and linearizing, and is changed according to the desired noise distribution. A list of common link functions, and their inverses, can be found in most basic texts on GLMs, for example [237]. A model can then be fit using the maximum likelihood estimation [237,238]. The end result, of this entire process is that we gain a fairly simple yet powerful and versatile method of modelling a wide variety of processes. As a result, although oft forgotten, GLMs usually make a great first choice to use before moving on to more sophisticated ML models.

6.1.2 Boosted Generalized Linear Models

Boosting, or rather gradient boosting, is an ensemble learning technique [239,240]. Instead of using one single strong predictive model, the idea behind gradient boosting is to use an ensemble of weakly performant models that build on each other, learning from the mistakes of previous models, to create a final model that is more accurate than any single (strong or weak) constituent model. Although boosting can make overall more performant classifiers, it must be carefully managed to prevent overfitting the model, that is, training the model to be too good at predicting test data at the expense of making the model generalizable to data it has never seen before. The algorithm used to do gradient boosting is fairly complex and well out of the scope of this thesis. The algorithm however, supports a range of possible underlying range of ML models [240] and a boosted GLM is one that specifically uses generalized linear models as the underlying ML model.

6.1.3 Random Forest

The second type of ensemble learning technique is known as bagging. Bagging forms a core part of Random Forests (RF). The best place however to start discussing random forests is with decision trees.

106

A decision tree is simply a branching set of rules, or boundary cut-points, that separate a feature space into various partitions, each of which are associated with some sort of classification or decision outcome [111]. A very simple example is shown in Figure 6-2. In this example, the decision tree is used to classify the three different colors of data points (green, orange, and purple) according to two arbitrary features, A & B, associated with the data points. Note that due to the placement of the boundaries, some of the dots are misclassified.

One of the simple approaches to training a decision tree is to start from the top of the tree (the root) and go down, selecting several candidate boundary cut-point that divide the Figure 6-2: Example of a decision tree (above) dataset, and then computing how well the with corresponding feature space (below). data is split by each boundary [111]. For example, one could use the Gini impurity (a measure of diversity in the dataset), or the measure of information gain (reduction in entropy) that results from the split. One then selects the best candidate boundary and repeats this process down each new branch. As with all ML algorithms, one must be wary of over-fitting the learner. In the case of decision trees this is especially true as the complexity or even just the number of the boundaries used increases. Even with just the use of linear boundaries a decision tree can get very precise as the tree gets deeper and larger, with more branches and leaves to cut the feature space into smaller and smaller more ultra-specific partitions. As a result, many decision tree creation algorithms feature a way to stop growing the tree – usually by setting a hard limit on the depth - or to prune the tree after growth - to remove unnecessary, unhelpful or weak branches – all to help avoid overfitting.

Decision trees are hugely useful since they are interpretable, in other words, a human can look at a decision tree and understand the decisions being made. This is why decision trees, albeit expert trained

107 ones, are often popular for use in expert systems where the decision process might want to be inspected – the Medly algorithm in fact uses an expert trained decision tree for triaging patients [104].

Despite all this, ML decision trees are still often highly sensitive to the input training data and have a tendency to over-fit and not generalize well to new data. One solution to this problem is the use of the ensemble learning technique of bagging (the counter-point to boosting). Bagging, in a similar fashion to boosting, uses an ensemble of learners to improve learner performance, but whereas in boosting the learners build on each other sequentially, in bagging one trains several independent learners – in this case, multiple independently trained decision trees – and aggregates their responses. Each tree (learner) in the forest (ensemble) produces its own separate prediction using the input predictor data and the resulting ensemble of independent predictions are combined, for example using a majority voting scheme, to produce the overall final prediction. The aptly named random forest is a variation on tree bagging whereby a random subset of features is used to train individual trees in the forest, as opposed to the entire feature space being provided to each tree. This reduces the likelihood of having highly correlated trees, while retaining the random forest’s beneficial properties, such as it’s ability to naturally perform feature selection. Specifically, that the most predictive features will tend to feature more prominently as part of the random forest, whereas less important features will tend to be more sparsely distributed and therefore be less heavily weighted as part of the forest.

All in all, the effect of bagging together decision trees into a random forest creates a ML model that has additional useful emergent properties (e.g. natural features selection), can better generalize to new data, and yet maintain many of the inherent advantages of the underlying decision trees. Because of its simplicity and ease of use, RFs are therefore often (along with GLMs) a good early candidate for ML tasks [111].

6.1.4 Artificial Neutral Networks

In contrast, Neural Networks (NNet), or as they are more formally termed – artificial neural networks – are far on the other end of the complexity spectrum – they are the bazooka to the RF and GLMs pea- shooters. The use of NNets for the sole purpose of assessing NYHA class is therefore likely overkill since, as previously discussed, less complex models are likely to actually perform better due to their simplicity in the face of limited data. However, in the context of assessing NYHA class as part of a remote monitoring system, NNets have an interesting property that makes them a particularly worth investigating. NNets support what is known as online learning, which means that the trained model can be progressively and continuously updated and improved as more and more data becomes available, without needing to retrain

108 the entire model from scratch. This is a particularly useful property within the context of a remote patient monitoring system, where new data becomes available each and every day. While the specific NNet investigated as part of this work may not necessarily be immediately transferable to the task of daily assessment of NYHA class, an initial foray into training a NNets with activity monitoring data is likely to provide useful insights for future work.

The fundamental building block of the NNet is the perceptron. It is a digital neuron and operates in an analogous fashion: summing its input signals which it then converts to an output signal using some predefined thresholding function. An example is shown in Figure 6-3.

A NNet is built by creating a weighted directed network of Figure 6-3: A perceptron perceptrons, as shown in Figure 6-4 (for clarity, the inter-perceptron weights are not shown). The network is arranged in a layered fashion and these layers are logically divided into three types depending on their function. At the front of the NNet is the input layer which connects each input features to a perceptron. The input layer of the NNet shown in Figure 6-4 as an example, would be suitable for use with 4 input predictors or features. The input layer acts as the interface, connecting the input features to the first layer of perceptrons in the next set of layers in the network: the hidden layers. Figure 6-4: A neural network

The hidden layers of the NNet are the innermost layers and form the bulk of the network. They are where the NNet learns the various complex and complicated relationships and patterns in the data. Unfortunately, the nature of this method of learning is that NNets typically remain black-boxes and it is never quite clear how or what relationships the NNet has learned from the data. The number of hidden layers and the number of nodes in each layer can be altered to make a deeper and wider NNet capable of learning more complicated relationships. While NNets can theoretically be made arbitrarily large, training large NNets is computationally expensive and therefore limited by the computational power available. Although NNets have existed since the 1950s, it is only due to modern advances in computing that training large multi-layered NNets, known as deep neutral networks [241], has recently become feasible [111,124,242,243]. The success of deep neural nets at tackling complex problems is generally credited as a cause for the recent popular resurgence in AI research [242].

109

Once the hidden layers, regardless of depth, have processed the input data, the data is picked up by the output layer.

The purpose of the output layer is simply to extract data from the hidden network and convert it to a final output prediction. In output layer of the NNet in Figure 6-4 for example produces 3 output predictions.

Training the hidden network is commonly performed using a what is known as the backpropagation algorithm and it is essential for enabling online learning for the NNet. Essentially, new data is provided, one at a time, to the input layer of the NNet. Examining the output values produced at the output layer of the network, one can then determine how far off the current NNet prediction is and then work backwards through the network, making minor adjustments to the weights of the links to slowly push the output in the correct direction. The degree of tweaking, the learning rate, is carefully controlled to make sure that that the NNet is neither underfit (insufficiently trained) nor overfit (that the NNet overshoots or overgeneralizes from individual data points). The finer details of the backpropagation algorithm are outside the scope of this thesis, but the interested reader is invited to refer to either [111] or [241] for further reading.

Overall, NNet are a complex but very powerful ML algorithm that have been successfully used to learn the relationships present in challenging highly non-linear data and systems. They also support continuous incremental learning which may be particularly useful in the context of remote patient monitoring.

6.1.5 Principal Component Analysis Artificial Neutral Networks

Aside from computational cost, NNets also have the drawback of typically requiring a lot of data to train well. The later point is particularly challenging given our small dataset. One of the ways to make more effective use of a data limited but feature rich dataset is to perform dimensionality reduction on the feature set prior to presenting it to a ML algorithm [244–247]. Dimensionality reduction is related to the concept of feature selection and extraction. In both cases, a large set of features is reduced to a smaller more concise set of principle feature that encodes, as much and as accurately as possible, all the information originally contained in the large feature set [247].

Principal Component Analysis (PCA) is a standard and hugely popular technique for performing dimensionality reduction [248]. In PCA, the larger m-dimensional feature set is projected onto the best 푛- dimensional orthogonal subspace of 푚 (where 푛 < 푚) in such a way that the greatest variance in the projected data comes to lie on the lowest (first) order coordinate axis (of the n-dimensional subspace),

110

with successively lower variance data being reserved for successively higher order coordinate axes28

[244,248]. In this way PCA trims out the features (dimensions) that provide the least new information, either because the information is already accounted for as part of another correlated feature, or because the feature has low variance and therefore provides little additional information to consider. The interested reader can find a more complete mathematical treatment of the algorithm in [248].

In theory, by apply PCA to our set of features before passing it to a NNet, the resulting PCA NNet should perform better, since the algorithm should be able to focus on learning the high information patterns common to the limited dataset, while being less distracted by low value features. Furthermore, the PCA NNet should be able to be trained at a reduced overall computational training cost since the reduced number of features will likely require a lower complexity NNet to model.

Methods

We chose to use the R programming language [151] in combination with RStudio [152], the open-source integrated development environment for R, and various supporting R packages for the research work documented in this chapter [153–158,217,249,250]. For the specific tasks of building, training, and validating the ML models we used the caret (Classification And Regression Training) package for R [251,252]. We also used the caret package for data pre-processing including normalization and imputation, although we used the leaps package [253] for feature selection.

To simplify comparison between the sometimes-disparate models discussed this chapter (as well as the hidden Markov Model based classifier discussed in the previous chapter), we keep the methodology as consistent as possible between the different machine learning approaches. We also aligned our methodology as much as possible with current best practice for the creation and validation of supervised classification ML models.

6.2.1 Training Data

Dataset

We used the same data to develop and validate the cross-sectional algorithms that we used for the hidden Markov model based classifier investigated in Chapter 5. This data, again, is the same data used

28 i.e. the second greatest variance lies the second order axis, the third greatest variance on the third order axis, up to the least variant data which resides on the final 푛-th order coordinate axis.

111 for the replication study discussed in Chapter 3. Recall that the dataset was selected primarily since it had the largest sample size of the available datasets, but also because it contained cardiopulmonary exercise testing data to permit us to establish a helpful baseline performance (based on the gold-standard CPET) evaluate the impact of step count data on our algorithm performance.

Population

Recall that the Chapter 3/Chapter 5 dataset included 50 patients, predominantly male (86 vs. 89 [%]), aged: 54 ± 14 vs. 56 ± 14 [years old], and overweight (BMI: 28.9 ± 6.4 vs. 29.6 ± 6.3 [kg/m2]). whose demographics are fully detailed in Table 5 (page 38), Table 6 (page 38) and Table 7 (page 39). These patients come from a closed (prospective) cohort of adult outpatients at a tertiary care clinic specializing in the management of heart failure at a major hospital in Toronto, Canada. The exact inclusion and exclusion criteria are detailed in Table 3 (page 37) and Table 4 (page 37) respectively.

Label Assignment

Again, recall that the patients in the dataset were originally classified at onboarding by their physician as either NYHA functional class II (n=26) or III (n=11) - according to the criteria outlined in Section 2.2.1.1 - or as some intermediate/mixed class I/II (n=9) or II/III (n=4), as outlined in Section 5.2.1.3. However, for the purposes of the ML classification task being investigated, patients assigned the intermediate/mixed classes I/II were relabelled as NYHA class II patients, and patients assigned as class II/III were relabeled as NYHA class III. This final dataset was therefore composed of only patients labelled as NYHA class II (n=35=26+9) and NYHA class III (n=15=11+4).

6.2.2 Model Design

Predictors

In order to predict the outcome label, each of the machine learning models was fed with a series of predictors (or features) built from available data in the dataset. Recall that the dataset consisted of the following data:

1. Minute-by-minute step count data – recorded using a commercially available activity-tracker (Fitbit Flex) continuously throughout the day. From which we extracted the same metrics calculated and explored in Chapter 3, as listed in Table 18 below:

Table 18: M inute-by-minute step count features

M aximum

112

1 Maximum 2 Week PMSCa [steps/minute] 2 Maximum of Maximum DPMSCb [steps/minute] 3 Mean of Maximum DPMSCb [steps/minute] 4 Standard Deviation of Maximum DPMSCb [steps/minute] 5 Standard Error of Maximum DPMSCb [steps/minute] 6 Minimum of Maximum DPMSCb [steps/minute]

75th Percentile 7 Maximum of 75th Percentile of DPMSCb [steps/minute] 8 Mean of 75th Percentile of DPMSCb [steps/minute] 9 Standard Deviation of 75th Percentile of DPMSCb [steps/minute] 10 Standard Error of 75th Percentile of DPMSCb [steps/minute]

M ean 11 Mean 2 Week PMSCa [steps/minute] 12 Maximum of Mean DPMSCb [steps/minute] 13 Mean of Mean DPMSCb [steps/minute] 14 Standard Deviation of Mean DPMSCb [steps/minute] 15 Standard Error of Mean DPMSCb [steps/minute] 16 Minimum of Mean DPMSCb [steps/minute]

Standard Deviation 17 Standard Deviation of 2 Week PMSCa [steps/minute] 18 Maximum of DPMSCb Standard Deviation [steps/minute] 19 Mean of DPMSCb Standard Deviation [steps/minute] 20 Minimum of DPMSCb Standard Deviation [steps/minute]

Standard Error 21 Standard Error of 2 Week PMSCa [steps/minute] 22 Maximum of DPMSCb Standard Error [steps/minute] 23 Mean of DPMSCb Standard Error [steps/minute] 24 Minimum of DPMSCb Standard Error [steps/minute]

Total 25 Total 2 Week SCc [steps] 26 Maximum of Total DPMSCb [steps] 27 Mean of Total DPMSCb [steps] 28 Standard Deviation of Total DPMSCb [steps] 29 Standard Error of Total DPMSCb [steps] 30 Minimum of Total DPMSCb [steps]

IQR (Interquartile Range) 31 Maximum of DPMSCb IQRd [steps/minute] 32 Mean of DPMSCb IQRd [steps/minute] 33 Standard Deviation of DPMSCb IQRd [steps/minute] 34 Standard Error of DPMSCb IQRd [steps/minute]

Skewness

113

35 2 Week PMSCa Skewness 36 Maximum of Daily SCc Skewness 37 Mean of Daily SCc Skewness 38 Standard Deviation of Daily SCc Skewness 39 Standard Error of Daily SCc Skewness 40 Minimum of Daily SCc Skewness

Kurtosis 41 2 Week PMSCa Kurtosis 42 Maximum of Daily SCc Kurtosis 43 Mean of Daily SCc Kurtosis 44 Standard Deviation of Daily SCc Kurtosis 45 Standard Error of Daily SCc Kurtosis 46 Minimum of Daily SCc Kurtosis aDPMSC: Daily Per Minute Step Count bPMSC: Per Minute Step Count cSC: step count dIQR: interquartile range

2. Cardiopulmonary exercise testing data – administered by trained clinical staff as part of routine care at the TGH Heart Function Clinic on the same day as recruitment (except for 4 patients

who received it prior to recruitment29). From this data we extracted the following features:

Table 19: Cardiopulmonary exercise testing data features

CPET Feature Brief Description of Feature 1 CPET Duration [frac. min.] duration of CPET in fractional minutes 2 CPET Max Watts [W] max resistance achieved at end of CPET 3 percentage of expected CPET Max Watts for % Predicted CPET Watts [%] patient 4 SBP, Resting [mmHG] resting Systolic Blood Pressure before CPET 5 DBP, Resting [mmHG] resting Diastolic Blood Pressure before CPET 6 HR, Resting [bpm] resting Heart Rate before CPET 7 O2 Sat., Resting [%] resting oxygen saturation before CPET 8 FEV, Resting [L] resting Forced Expiratory Volume before CPET 9 percentage of expected Forced Expiratory Volume % Predicted Resting FEV [%] achieved by patient during CPET 10 FVC, Resting resting Forced Vital Capacity before CPET 11 percentage of expected Forced Vital Capacity % Predicted Resting FVC [%] achieved by patient during CPET 12 SBP [mmHG] Systolic Blood Pressure at end of CPET 13 DBP [mmHG] Diastolic Blood Pressure at end of CPET 14 HR [bpm] maximum Heart Rate at end of CPET 15 HR 1 min. Post Test [bpm] Heart Rate 1 minute after end of CPET

29 Specifically, 1, 15, 20 and 22 days prior to recruitment.

114

16 Heart Rate drop (recovery) 1 minute after end of HR Drop in 1 min. [bpm] CPET 17 O2 Saturation [%] oxgyen saturation at end of CPET 18 peak oxygen consumption during CPET relative VO2 Peak (rel.) [ml/kg/min] to patient body weight 19 Predicted VO2 Peak (rel.) expected peak oxygen consumption for patient [ml/kg.min] (relative to body weight) during CPET 20 percentage of predicted peak oxygen consumption % Predicted VO2 Peak (rel.) [%] for patient (relative to body weight) achieved during CPET 21 peak oxygen consumption during CPET (not VO2 Peak [L/min] corrected for patient body weight) 22 expected peak oxygen consumption for patient Predicted VO2 Peak [L/min] during CPET 23 percentage of predicted peak oxygen consumption % Predicted VO2 Peak [%] for patient achieved during CPET 24 Anaerobic Threshold [ml/kg/min] patient’s anaerobic threshold 25 Anaerobic Threshold as a percentage of the AT as % Measured VO2 Peak [%] measured peak oxygen consumption of the patient (relative to their body weight) 26 Anaerobic Threshold as a percentage of the AT as % Predicted VO2 Peak [%] predicted peak oxygen consumption of the patient 27 VE Peak [L] peak minute VEntilation during CPET 28 VCO2 Peak [L] peak CO2 expiration during CPET 29 slope of minute VEntilation to CO2 output at VE/VCO2 Slope @ AT Anaerobic Threshold during CPET 30 slope of minute VEntilation to CO2 output at VE/VCO2 Slope @ Peak CPET peak 31 RER Peak peak Respiratory Exchange Ratio during CPET

3. Patient demographic/meta data – recorded as part of onboarding, specifically:

Table 20: Patient demographic data features

Feature 1 Sex [Male or Female] 2 Age [years] 3 Height [cm] 4 Weight [kg] 5 BMI (Body Mass Index) [kg/m2] 6 Handedness [left or right] 7 Wristband preference [left or right]

We tested three different variants of models using three different combinations of the above features:

a) The ‘CPET feature group’, to establish a baseline performance using only data available from CPET tests. This feature set consisted of all the CPET features and the patient demographic features, for a total of 38 features.

115

b) The ‘CPET + Step Data Metrics feature group’, to establish the additional benefit derived from adding the basic step data features. This feature set consisted of all the CPET features, all the step data features and the patient demographic features, for a total of 84 features.

c) The ‘Step Data Metrics only feature group’, to investigate the effectiveness of using only data derived from an activity tracker. This feature set consisted of all the step data features and the patient demographic features, for a total of 53 features.

Normalization

We normalized the input predictors as the first step in the training process for our cross-sectional ML classifiers 1) to improve training speed but 2), also to ensure that each of the predictors was similarly weighted for consideration by the learning algorithm. Specifically, we shifted each predictor to be centered about its mean value and scaled the predictors by their corresponding standard deviations using the preProcess function in caret R package.

Treatment of Missing Data

Some of the CPET data was missing in the records of some patients. Since the algorithms used do not handle missing data by themselves, we removed patients with missing data from the training data supplied to patients, only including the complete cases (without missing data). However, because the aforementioned caret package’s preProcess function also had the ability to perform data imputation, we also trained a variant of each of model where the missing training data was imputed, to salvage as many of the otherwise incomplete cases in the dataset as possible. The preProcess function used a k-Nearest Neighbour algorithm (k was set to 5) which chooses an imputation value based on the k nearest neighbouring non-missing data points, as measured by their Euclidian (straight-line) distance from the missing data point [254].

Feature Selection

Since we had such a large list of input predictors for each model (up to 84) we compared the impact of performing feature selection on the input list of predictors that were being provided to the model training function. The purpose of automated feature selection is to try to prevent the model from overfitting to the data, thereby improving the ability of the classifier to generalize to new data. Traditional machine learning heuristics dictate that, given our sample size of 50, the number of features used to train out

116

algorithms should be somewhere around 5-10 but possibly up to 49 features to prevent overfitting30. In view of this, we used an R package called leaps [253], which uses linear regression, to identify and separate out the single best combination of up to 10 features. We evaluated the best feature combination using the Bayes information criterion, usually abbreviated BIC [255], which is very similar to the more commonly used Akaike information criterion, usually abbreviated AIC. In both cases, models with lower values are preferred, however the Bayes information criterion penalizes complex, feature rich models more heavily and should therefore favor models that use less features. Based on the previously mentioned heuristics, lower featured models are likely to be more appropriate given the limited size of our dataset.

Feature selection was done as a last step before generating the ML classifier models. Note also that the feature selection was performed using only the data being made available for training the model and did not include any of the validation data which would skew our estimation of the overall final classifier performance.

All this said, in a similar fashion to the normalization and missing data treatment process, we also created variant models where the pre-processing step was not applied, i.e. feature selection was not performed and instead the whole unaltered list of input predictors was provided to the model for training.

Model Generation

To actually generate and train the ML classifiers, we provided the appropriate set of preprocessed features to the model training function of the R caret package. Instead of setting fixed hyper-parameters for the models - e.g. maximum decision tree depth of 5 in the RFs, 4 hidden layers for the NNets, etc. - we had the model training function perform a grid search of the model hyperparameters to identify the optimal hyper-parameterization for each model, assessing the performance of each model using k-fold cross-validation (CV).

30 Pre-hoc determination of the optimal number of features for a given data set size is unfortunately still very much a matter of debate in the field. As a result, various researchers have developed and published various heuristics for the task, which can sometimes greatly vary in their recommendations. Some of these heuristics include: having 10 data points per model parameter/feature [283], having “3-5 independent cases per class and feature” [284] for training stable albeit not necessarily ‘good’ models [125], or for a dataset of size 푛 about √푛 highly correlated features to about 푛 − 1 features when said features are completely uncorrelated [285]. For our dataset this puts us at 5, 3-5, 7 (highly correlated) to 49 (uncorrelated) features.

117 k-fold CV is a technique used for performing training and testing/validation where it is undesirable for an already small dataset to be further divided into proportionately smaller separate training, testing and validation datasets, but where it is still necessary to assess how well a classifier is expected to perform on data it has never seen before [256]. In k-fold CV, the original dataset is instead first segmented into 푘, typically approximately equally sized, partitions termed folds. Testing and training of a given model is then performed 푘 times such that each fold is used once as part of a test set, with the

Figure 6-5: 풌-fold cross-validation remaining 푘 − 1 folds in each round are used to train a model for evaluation on the test fold. The overall performance is then reported as the mean of the performance of the models across the rounds. The process is shown visually in Figure 6-5.

In each case, we set number of folds for the testing CV procedure to be the same as the number used for the overall model CV procedure detailed in the next section.

6.2.3 Model Validation

Since a suitable external validation dataset was not available, we again performed CV using the Chapter 3/Chapter 5 dataset to perform an internal validation of our ML classifiers and estimate the real- world performance of our classifier against new, unseen data. Specifically, we validated the model using both nested 10-fold CV and nested leave-one-out cross validation (LOOCV). In other words, we cross- validated the overall pre-processing, features selection and models, but nested within the evaluation of each model we used a further round of cross-validation (splitting out new further training and test folds) to select the optimally hyper-parameterized model. LOOCV is a special case of k-fold CV where the number of folds, 푘, is set to the be equal to the number of observations in the dataset. In other words, every training/test set split repeatedly leaves out one new data point for testing or validation and uses the rest for training. Before proceeding to discussing the rationale for using both 10-fold and leave-one-out cross validation we first define some important terms for assessing ML model performance

118

On Bias and Variance

The bias of a machine learner is simply its error rate: i.e. how much or how little the algorithm errs in performing whatever task it is attempting to accomplish. They are the “erroneous assumptions in the model” [257]. Notably though, the bias is separate from the unavoidable or irreducible error of the problem and only measures how distant the learner is from the ‘optimal’ overall error rate. For example, if a system was trying to recognize speech from very noisy low quality audio streams where even humans failed at the task 10% of the time, and a machine learning algorithm was able to recognize the speech with an error rate of 15%, the bias of the algorithm would only be 5% since the gold-standard classifier for this problem, the human ear, still erred 10% of the time due to the inherent nature of the problem [258].

In contrast the variance is how well, or rather how badly, the ML classifier generalizes to never before seen data – i.e. how much the classifier errs due to ‘sensitivity to small fluctuations in the training set’ [257]. For example, if the same speech recognition classifier were provided with new test data (separate from the data used to train it) and found to have a new error rate of 27%, the bias of the classifier would still be 5% but the variance would be estimated at 12%, since the algorithm suffered an additional 12% loss in performance in the face of the new test data. Knowing a classifiers bias and variance allow us to estimate how under-, over- or both under- & over-fit a given classifier may be; high bias being indicative of an under-fit classifier, high variance indicative of an over-fit classifier, and high bias & variance indicative of an under- and over-fit classifier [259,260]. By extension, most change made to a ML classifier have an associated bias and variance trade-off where an amelioration in one results in a deterioration of the other – e.g. decreasing bias, and reducing over-fitting results in an increased variance, or increased under-fitting – somewhere in the middle lies the optimal fit point where bias and variance are both minimized.

Rationale for multiple cross-validation

Returning to 10-fold and leave-one-out cross-validation: LOOCV is known to be the least pessimistically biased estimator of model performance [256,261–265]. However it has been accused of having “high [estimator] variance, leading to unreliable estimates (Efron 1983)” [263]. This accusation is typically attributed to the cited paper by R. Kohavi, presumably citing alleged findings by B. Efron [266]. Efron however only elaborates on CV generally and does not appear to investigate or make any claims about higher k values on the variance of the estimate provided by the CV process, Kohavi’s research findings in fact also repudiate his claim to higher variance, as do the findings and simulations of a myriad of other investigators who in fact suggest quite the opposite [261,264,265,267]. Only in special highly specific cases do simulations suggest that higher variance performance estimates result from LOOCV

119

[267]. The conclusion then that LOOCV results in higher variance estimates therefore appears to likely simply be an erroneous intuitive over-generalization (dare we say overfitting) of the bias-variance trade-off so ever present in ML performance assessment, to the actual performance estimators themselves.

Our rationale for also performing 10-fold cross validation therefore is not to improve our estimate of model performance - although in the event that both the 10-fold cross-validation and leave-one-out cross- validation estimates are similar, we would have additional confirmation that the performance estimates are in fact accurate. Rather our objective is in fact to measure the difference in the estimate of model performance using different sized training datasets to roughly determine our location on the learning curve of these algorithms and ascertain if collecting more training data is likely to provide improved model performance. It may seem strange to do this using 10-fold cross validation since we have previously mentioned that LOOCV is known to be a less biased estimator of model performance than lower k-fold CV and we could simply perform LOOCV on an artificially reduced dataset. However, to do so we would have to artificially reduce the dataset and arbitrarily throw away data we could otherwise use for some useful purpose, namely testing, which is why we opt to use 10-fold CV vs. LOOCV. Furthermore, previous simulations and experiments have demonstrated that in most datasets, even as small as 40 datapoints, 10- fold cross-validation provides an estimate that is nearly as unbiased as LOOCV or at least within 7-9 percentage points of the LOOCV value [261,263,267].

Since performing nested 10-fold cross-validation on our dataset represents a large, nearly 15%, reduction in available training data31, most of the performance delta above 7-9% points is reasonable attributable to the reduced training data in our already small dataset and can therefore be used to make a rough approximation of our location on the learning curve (i.e. determine if we are still in the location of high increase in performance for small increase in dataset size). Of course, if the performance delta is within 7- 9% points we unfortunately will not be able to approximate our location on the learning curve since we will be unable to differentiate the bias delta due to using 10-fold CV vs LOOCV and the improvement resulting from an increase in training data. However, in the unlikely event that the performance delta is very low, i.e. both 10-fold and LOOCV converge to the same estimate, we can conclude that either method is suitable for cross-validation of our algorithm given our sample size, and recommend that future

31 From 50 patients, nested leave-one-out results in 2 hold-outs for a total training set size of 48 patients. 10-fold cross-validation results in a hold-out of 5 data points for validation, and a further 4.5 (on average) for the second hold-out for model optimization leaving a total of 41.5 patients for training. (48 - 41.5) / 48 = 15%

120 work utilize 10-fold CV and take advantage of the associated decreased computational cost and simply use the datapoints generated by this work to start plotting the learning curve.

Results and Discussion

Using the methodology detailed in the previous section we were able to successfully train GLMs, boosted GLMs, RF, NNets and PCA NNets for each of the outlined feature groups: the CPET feature group, the CPET + Step Data Metrics feature group, and the Step Data Metrics only feature group.

6.3.1 Classification Performance

The final overall validation performance of each of variant classifiers is tabulated in Table 22, located in Appendix D, for completeness. For brevity’s sake however, we summarize only the top performing classifiers for each feature group in this chapter. In general, we found that pre-selecting features did not change the classification performance of the models, and although imputing missing data did have an effect on classifier performance, 3 of the 4 best performing models were built by simply excluding incomplete cases as opposed to performing imputation.

The best CPET only classifiers (and the third best classifier variant overall), summarized in Figure 6-7, was found to be a simple boosted GLM with no imputed data and either with or without feature pre- selection. The classifier achieved an unbalanced accuracy of 79%, better than the no-information rate of 70% which translates to a balanced accuracy of the model 72%. The level of agreement as measured by Cohen’s Kappa was moderate (휅=0.47). This classifier is a huge improvement over the hidden Markov model based classifier trained in Chapter 5. That being said, the 47% agreement between the GLM and the physician assigned label is still lower than the lower end of comparable human-level performance; recall that the interrater agreement between physicians was found to be between 54-75%32 [6,26]. Solely

32 The study by Goldman et al. [11] which found a 41% agreement is excluded as their result is not directly comparable since they used a weighted kappa to account for disagreements by more than 1 NYHA class. The other cited studies did not encounter this problem.

121 based on the performance of this classifier, human performance remains the gold-standard baseline against which to compare the agreement in assessed NYHA functional class.

Unfortunately, the ML classifiers provided with just the step data did not fare as well as the CPET based classifiers. The best of these step data only classifiers - tied between a regular GLM, a boosted GLM and a NNet - all using imputed data and either with or without feature selection, only achieved an unbalanced accuracy of 72% (63% balanced) – only marginally higher than the no-information rate of 70%. The low agreement between the classifier and physician assigned label was also affirmed by the low kappa coefficient (휅=0.28). That being said, the step data GLM/NNet/boosted GLM still performed better than

Physician Physician Physician Physician II III II III II III II III II 7 6 II 6 9 AI II 6 2 AI II 5 3 AI III 3 27 AI III 5 30 III 1 19 III 0 20 No Information Rate (NIR): 0.70 No Information Rate (NIR): 0.70

NoUnbalanced Information Acc Rateuracy (NIR) (Acc): :0.7 0.791 UnbalancedNo Info rmation Acc uracyRate (NIR)(Acc) :: 0.0.7721 Unbalanced AccCohen’suracy Kappa:(Acc): 0. 0.4789 Unbalanced Cohen’sAccuracy Kappa: (Acc): 0.280.89 Cohen’s Kappa: 0.73 Cohen’s Kappa: 0.70 P-value [Acc > NIR]: 0.12 P-value [Acc > NIR]: 0.45 P-value [Acc > NIR]: 0.02 P-value [Acc > NIR]: 0.02 Model Type: Boosted GLM Model Type: (boosted) GLM/NNet Model Type:Imputed Boosted Data: GLM No Model Type:Imputed Random Data: Forest Yes Pre-selected ImputedFeature: Data:Yes or No No Pre-selected Feature:Imputed Yes Data: or No Pre-selected Feature: Yes or No Pre-selected Feature: Yes or No Figure 6-7: Performance of the Figure 6-7: Performance of the

Figurebest 6 CPET-9: Performance only classifier of the Figurebest step 6-9 :data Performance only classifier of the best CPET + step data second best CPET + step data classifier classifier the hidden Markov model based classifier.

The best performing classifier overall, another boosted GLM which used only complete cases (i.e. no imputed data) and either with or without feature selection, used the combination of CPET and step count data to achieve a solid 89% unbalanced accuracy (85% balanced) which was significantly larger than the no-information rate of the dataset (at the 5% level of significance, since P=.02). There was substantial agreement between the machine and physician assigned labels (휅=0.73) approaching that of the best reported human analogues (휅=0.75 [26]).

122

The second best performing classifier overall, was a RF in the same variant class as the best overall GLM (no imputed data, and with or without feature preselection and using CPET and step count data). It achieved an equivalent unbalanced accuracy (89%) with a corresponding significance level (compared to the no-information rate) but it had a marginally lower agreement coefficient (휅=0.70) and balanced accuracy (81%).

Figure 6-10: Receiver Operating Characteristic (ROC) curve for machine learning classifiers trained with CPET & step data (with no data imputation)

The receiver operating characteristic (ROC) curve, which graphically represents the sensitivity (true positive rate) and specificity (the mathematical complement of the false positive rate33) trade-off of a classifier, is shown in Figure 6-10 for the best RF and boosted GLM built using CPET and step data. Although, it also includes the NNet, PCA NNet and glm in the same variant class: no imputed data, with or without feature selection. We can see from this curve that the diagnostic error rate for the boosted

33 i.e. 1 – the false positive rate

123

GLM is always expected to be more, or at least as, favorable as that of the RF based classifier, regardless of the discrimination threshold chosen.

As an aside, we can also see from this graph that our choice to use PCA for feature selection before providing our features to the NNet was well justified, since the PCA NNet shows greatly improved discriminatory ability compared to the pure NNet. This suggests that a NNet might still have use for assessing NYHA functional class, but may require more careful selection of input features or at least more data to properly take advantage of its powerful modelling capabilities.

Regardless, both of our boosted GLM and RF based CPET + step data classifiers showed improved performance over the ones using heart rate variability (HRV) data created by 1) Pecchia et al. [128] - a cross-validated classification and regression tree that had moderate agreement (휅=0.57) and good discrimination accuracy (79.3%, unbalanced) on a slightly unbalanced dataset (12:17, 59% severe) - and 2) the one created by Melillo et al. [136] - another classification and regression tree, 10-fold cross-validated, which achieved a marginally better level of agreement (휅=0.60), and discrimination accuracy (85.4%, unbalanced) than Pecchia et al.’s tree, but on a different more unbalanced dataset (12:32, 73%). Our classifier however does not approach the performance of Shahbazi et al.’s [142] leave-one-out cross- validated HRV based k-Nearest Neighbour classifier (with generalized discriminant analysis feature selection), which achieved perfect agreement (휅=1.0) and accuracy (100%) at the classification task (I or II vs. III or IV) on their unbalanced dataset (10:29, 74% severe). We suspect that Shabazi’s classifier may possibly be overfit to their data.

Unfortunately, the practical applications of our classifier are not clear cut. Our early investigation of the combination of data from the relatively more established CPET, and the simpler to administer activity tracker monitoring, does demonstrate that it is possible to create classifier that performs comparable to those that use relatively esoteric HRV data. Administering a CPET augmented with two-weeks of activity tracker data might therefore prove a useful alternative for clinicians or researchers wishing to objectively assess NYHA functional classification without requiring access to the specialized software and know-how required to perform an HRV analysis. Unfortunately, this alternative still requires the administration of a CPET which remains an expensive, cumbersome, and labor-intensive ordeal. Furthermore, to achieve near-human levels of classification performance, it presently appears necessary to augment CPET data with use activity tracker step data since neither CPET nor step data alone suffice to achieve reasonable levels of classification agreement. While activity tracker data is less expensive and labor-intensive to collect than CPET data, in its currently investigated form it is associated with at least a two-week delay. Although two-weeks is not necessarily longer than the time required to get certain blood or pathology

124 tests - which can sometimes also take several weeks [268–270] – this time delay certainly limit the practical applications of our classifier.

While an obvious next step is to investigate a smaller monitoring periods, we suggest that an equally profitable step may be to identify better features in the step count data and ideally alternate data sources to reduce the dependence on CPET data outright.

6.3.2 Best Features

As it stands, the top 5 features for the best step count data classifier (GLM) were, in order of most importance: 1) the total 2 week step count, 2) the mean 2 week per minute step count (PMSC), 3) patient weight, 4) the standard error of the 2 week per minute step count (PMSC), and 5) the standard error of the total daily per minute step count.

The features were assessed by summing their weighted importance scores across folds. The raw importance score was computed using the default variable importance scores for the specific model in question, using the varImp function in the caret package [271]. Each of these scores was then scaled to be between 0 and 1 (from least to most important). Therefore, the highest possible importance score is 50, which is possible if a variable scores as most important for all 50 leave-one-out cross-validated folds.

The full ordered list of top features for the step count data only GLM is shown in Figure 6-11. We can see from the graph that very few of the features clearly stood out as being relatively more important, in fact only the total 2 week step count and the mean 2 week per minute step count scored higher than 25 importance points (of 50). The third scoring feature, weight, is not even a step count metric, and is already known to be not significantly different between classes (P=.21) at the 5% level of significant in this dataset (see Table 10). Given the ML classifier used in this case, a GLM (which is linear regression based), - it is not unreasonable to conclude that features at and below this level likely provided increasingly little discriminatory value, which goes a long way towards explaining the relatively low performance of this classifier.

125

Figure 6-11: Feature importance scores for GLM classifier using only step count data

Unfortunately, at the time of writing, the caret package’s varImp function did not adequately support variable importance analysis for boosted GLMs, the model type of our best performing model and the CPET only model. We instead provide as contrast the top 10 features identified by our second best performing classifier, the CPET + step count data RF classifier. The top 10 features for the RF classifier are shown in Figure 6-12.

Only two of the top 10 features used by the RF classifier are step count derived metrics: 1) mean of maximum daily per minute step count 2) standard deviation of total daily per minute step count,

The remaining 8 features are all CPET features, of which the respiratory exchange ratio peak (RER Peak) is particularly noteworthy, having scored the highest possible importance score of 50 points, indicating that it was voted the single most important feature by every single leave-one-out cross- validated fold. The next single most important overall feature (also from the CPET data) is the slope of

126

minute ventilation (VE) to CO2 output (VCO2) at anaerobic threshold (AT) during CPET (VE/VC O2 Slope @ AT), which scored less than 20 importance points, indicating relatively low importance across folds. The third most important feature, the duration of CPET in fractional minutes (CPET Duration), scored less than 10 importance points.

For reference, weight - the 3rd best feature for the step data only GLM - was found to only be the 31st most important feature for the RF with a score of 0.878, which would indicate that weight actually has relatively low overall predictive helpfulness. Interestingly, leanness in HF patients has been found to be associated with worse prognostic outcomes – in what is known as the ‘obesity paradox’ [272–275]. However, more recent findings from a large 300 thousand patient study suggest that this association is likely the result of other unaccounted for confounding factors [276]. This might explain the low ranking of weight (correlated with BMI) in the face of other explanatory variables. The mean 2 week per minute step

Figure 6-12: Feature importance scores for random forest classifier using CPET + step

count data count and the total 2 week step count, the top two highest scoring features for the GLM trained using only step count, also scored as being low importance for the RF classifier: 0.967 and 0.945 respectively.

127

The RF classifier in fact scored 14 other step count derived features as being more important than these (although none of these 14 others scored any higher than 2.6 points).

It is curious that the step count metrics as a whole, appear to be considered by the classifiers to be relatively unimportant in contributing to the successful assessment of patient NYHA class, yet that our analysis of the models from a holistic perspective appears to indicate that the interaction of the step data metrics with the CPET data appears to notably enhance the overall performance of classifier.

We suspect that one possible cause of this paradox is that the step data metrics, which were originally selected due to their ability to characterize the step count distributions and not their predictive capacity, are in fact only weakly correlated, noisy, and uncontextualized, and in general only weakly explanatory of NYHA functional class alone. Furthermore, these metrics are likely also highly intercorrelated. This makes it difficult for a ML algorithm to identify which single metric is most helpful. This is evidenced by the pattern visible in Figure 6-11 where most of the metrics are considered only mildly important with none standing out as specifically important. This pattern, although not shown in Figure 6-12, is also reflected in the RF classifier scoring with similar metrics closely neighbouring each other.

When framed around CPET data - which helps contextualize and account for some noise in the step count data - some of the step count metrics begin to stand out as being more explanatory (they are in the top 10 features). These features therefore appear to possibly be explaining otherwise unexplained variance in the CPET data. However, feature importance is rated inconsistently between models. Although this is not necessarily unexpected, it may indicate that although the RF classifier assesses these as important, they in fact only interpreted as important as a result of the chance subset of training data within the folds. This leads to us to an alternative explanation: that the classifier is simply overfit. This is a less compelling explanation than the step data being simply weakly explanatory since the RF classifier clearly assesses the step count data as being still relatively unimportant. That being said, the possibility of overfitting may definitely still exist, but it could be easily verified by computing the variance of the importance scores across the random folds (high overall variance being an indicator of potential overfitting to individual training folds).

The overall conclusion of our feature analysis however is that the step count metrics provided to the ML classifiers for training are generally inadequate and that most of the predictive power resides in the CPET features. In light of the desire to not be dependent on CPET for assessment of NYHA class, especially within the context of remote patient monitoring for Medly, any continuation of this work should therefore seriously consider investing time in identifying and engineering more relevant step count features as well

128 as adding other data sources like heart rate, which would be complimentary to step count, and help contextualize the step data hopefully reducing the dependence on cumbersome CPET data. However, we also note the lack of impact feature pre-selection had on the performance of our variant models and suggest that increasing the amount of training data available may be a better approach than pre- trimming the available features. That being said, other researchers have had significant success performing clever feature selection to improve their algorithm performance [142].

6.3.3 Comparison of 10-fold and Leave-One-Out Cross-Validation

Recall that we cross-validated our

Physician Physician classifiers not only with leave-one-out 10-fold LOOCV CV II II II II cross-validation, but we also performed 10-fold CV to try to approximate our II 7 6 II 6 9 location on the classifier learning curve. AI AI III 3 27 III 5 30 Excluding models whose unbalanced accuracy was less than the no- No Information Rate (NIR): 0.70 | Unbalanced Accuracy (Acc): 0.79 | 0.72 information rate, the smallest Cohen’s Kappa: 0.47 | 0.28 difference in performance estimation

P-value [Acc > NIR]: 0.12 | 0.45 for 10-fold CV vs LOOCV of the same

classifier was 19% (κ = 0.47, Model Type: Boosted GLM |퐿푂푂퐶푉

Imputed Data: No κ|10−푓표푙푑퐶푉 = 0.28). The classifier in Pre-selected Feature: Yes or No question, with the smallest estimator Data Source: CPET Only difference, was in fact the CPET only Figure 6-13: Performance of the best model with classifier discussed in Section 6.3.1. A cross-validation performance difference summary of the performance estimations between the 10-fold and LOOCV of this classifier (the CPET Only GLM) is shown in Figure 6-13.

The largest and second largest performance differences were associated with the best performing classifier

(CPET + Step Data GLM, κ|퐿푂푂퐶푉 = 0.73, κ|10−푓표푙푑퐶푉 = 0.10) and second best performing classifier

(CPET + Step Data RF, κ|퐿푂푂퐶푉 = 0.70, κ|10−푓표푙푑퐶푉 = 0.10). It is worth noting that the 10-fold CV version of these classifiers also in fact had unbalanced accuracies (68%) that were marginally less than the associated no-information rate (70%) for the classifiers.

129

Since, as previously mentioned in Section 6.2.3.2, we expect at most about 7-9% difference in performance estimation due to the bias of 10-fold CV vs LOOCV, these large differences in performance estimation using 10-fold CV and LOOCV are clearly indications that our model is still highly sensitive to the amount of input data used to train the model and of may be possibly overfit to the training data. From a learning curve perspective, these values indicate that we are still at the point in the curve where we are likely to derive significant benefit from adding more training data. Since adding more training data is often an adequate solution to overfitting, an adequate solution in either case is to collect more data. Certainly, we appear to have been justified in using this larger 50 patient dataset for our experiments as opposed to the 44 patient dataset despite the associated loss of activity monitoring heart rate data.

Fortunately, as a result of the activity tracker monitoring upgrade made to Medly, as detailed in Chapter 4, more data (containing both heart rate and step count) is still actively being collected and should soon result in a larger (n > 50) activity monitoring dataset than the one used for the classification experiments in this thesis.

As the dataset increases in size, we suggest that future work performed with the dataset continue to be assessed using both 10-fold and LOOCV until the estimates from these approaches are found to converge. This will not only increase confidence in the performance estimates of the classifiers, but also help determine when it is appropriate to switch over to the less computational expensive 10-fold CV. Furthermore, recording the performance across otherwise identical ML models as the amount of data available continues to increase, would permit more accurate mapping of the learning curve than our initial single datapoint [258]. Knowing the actual learning curve associated with this problem would be helpful for diagnosing the source of classifier errors and ascertaining possible future steps to improve algorithm performance, and it would also be helpful for determining the incremental cost/benefit of continuing to collect increasingly more data [258].

Summary

To summarize, in this chapter we discussed a method for building cross-sectional machine learning classifiers to assess NYHA functional class using CPET and activity monitoring step data. We chose to investigate some popular starting points for supervised classification problems: Generalized Linear Models (GLM); a variant thereof: boosted GLMs; Random Forests (RF); Artificial Neural Networks (NN); and a variant thereof: Principal Component Analysis Neural Networks (PCA NN). We trained multiple variants of each model to investigate the effect of a) performing separate feature selection ahead of model training, b) imputing missing data instead of just dropping incomplete cases, and c) supplying different groups of

130 input predictors to our models for training. Specifically, we investigated the performance of the classifiers when supplied with demographic data and a) just CPET data, b) just the step data metrics investigated in Chapter 3, and c) the combination of both the CPET data and step data metrics.

To properly determine the expected performance of the classifiers in the face of new data we also cross- validated all the models using 10-fold cross-validation and leave-one-out cross-validation. Since we also optimized the model hyper-parameters and cross-validated these selections, we ended up performing nested 10-fold and nested leave-one-out cross-validation of each of the models.

In general, we found that pre-selecting features did not change the classification performance of the models, and although imputing missing data sometimes had an effect on classifier performance, 3 of the 4 best performing models (all except the step data only classifier) discussed in this chapter were built by simply excluding incomplete cases as opposed to performing 5-Nearest Neighbour imputation.

The best overall classifier was found to be a boosted GLM, trained using only complete cases of both CPET and step data, which achieved an unbalanced accuracy of 89% (85% balanced) versus a no- information rate of 70%. As a result, this classifier had a substantial level of agreement with the physician assigned NYHA class (휅=0.73). The performance of the classifier was therefore comparable to human level performance (휅=0.75 [26])

The CPET + step data classifier exceeded the baseline level of performance established by the best CPET data only classifier. The best classifier trained with only CPET data (another boosted GLM) achieved an unbalanced accuracy of 79% (72%, balanced) which was also better than the no-information rate of 70%. The CPET only classifier therefore showed a moderate level of agreement with the physician assigned label (휅=0.47) which was lower than the lower end of comparable human-level performance (휅=0.54 [6]).

The step data only classifiers (tied between a regular GLM, boosted GLM and NNet) fared much worse, achieving an unbalanced accuracy of 72% (63% balanced) – only marginally higher than the no- information rate of 70%, and with a low level of agreement between the classifier and physician assigned label (휅=0.28).

When comparing which features were considered most important by the classifiers we found that the step data metrics as a whole were found to be less important than the CPET metrics. We theorized that this is because the step data metrics, which were originally selected due to their ability to characterize the step count distributions and not their predictive capacity, are in fact only weakly correlated, noisy and uncontextualized, and in general only weakly explanatory of NYHA functional class alone. This makes it

131 difficult for a ML algorithm to use the features effectively for classification. In light of the desire to also not remain dependent on CPET data for assessment of NYHA class, especially within the context of remote patient monitoring for Medly, we suggested that a next reasonable step would be to invest in engineering more relevant step count features. We also recommend adding other data sources like heart rate, which are presumed to be complimentary to step count and would help contextualize the step data – hopefully replacing the currently required CPET data.

In comparing the performance estimations from the 10-fold and leave-one-out cross-validation we found that there was a notable difference between the measurements of agreement (휅), varying from 19-63% for the well performing algorithms but always in favor of the leave-one-out cross-validation. We proposed that this might be evidence of overfitting of the classifiers, but is likely also largely attributable to the 15% reduction in already limited data available for training the classifier resulting from the nesting of the 10-fold cross-validation process (compared to nesting leave-one-out cross-validation) and thus more indicative of our location on the learning curve. Regardless these numbers indicated that there is likely considerable benefit to collecting more training data. We suggested that future work performed with larger datasets should continue to assess performance using 10-fold and LOOCV until the estimates from these approaches are found to converge. This would increase confidence in the performance estimates of the classifiers, as well as help determine when it is appropriate to switch over to the less computational expensive 10-fold CV. We also suggested that at minimum, keeping the number of folds consistent for cross-validation would be helpful for better mapping out the learning curve for this problem – which would be a helpful tool for diagnosing classifier error, and assessing the cost/benefit of continuing to collect more and more data.

132

- Conclusions, Recommendations & Future Work

In this chapter we reflect on this work as a whole, briefly reiterating the major conclusions and findings of this work and providing some recommendations and suggested directions for future work.

Conclusions

The objective of this thesis was to design and develop a means of making New York Heart Association (NYHA) classification more consistent and reliable for the medical research and clinical community. We proposed that a good way to accomplish this objective was to find a means of objectively assessing NYHA functional class. In light of this, we performed a thorough review of the current state-of-the art for assessing NYHA functional class, including the state-of-the-art in applying artificial intelligence machine learning algorithms to the task of assessing or classifying patients into their NYHA functional class.

We found that other researchers have already attempted to use machine learning for NYHA functional classification. These however used heart rate variability data which is not necessarily readily accessible or usable by all heart function clinic, nor, at least at present, highly suitable for long-term remote patient monitoring. Remote patient monitoring being a growing trend in the pursuit of more cost-efficient care for chronic conditions and specifically the quest to improv patient- and physician-management of the heart failure condition. We proposed that a useful but more accessible data source that would synergize well with remote patient monitoring would be activity tracker data.

We proposed updating an existing remote patient monitoring system with the ability to collect and display activity tracker data, which could provide data for use by a machine learning algorithm to perform automated assessment of NYHA functional class. For this task we selected Medly, the remote patient monitoring system presently in use at the Toronto General Hospital Heart Function Clinic as a suitable candidate system. However, since activity tracker data has not seen wide use in actual clinic settings - in fact we only found one small pilot study that investigated the relationship between NYHA class and activity tracker step count - we first replicated the pilot study on a larger dataset that we had available from a previous study performed at our lab, verifying the findings of the pilot study: that NYHA II and NYHA III patients differ significantly by mean daily total step count. Additionally, we discovered that these patients actually differed by various aggregate measures of step count also including mean and maximum of the daily per minute step count maximums. Overall, our findings reaffirmed the findings of the previous pilot study, giving us some additional reassurance that remote monitored step count might

133 be beneficial for objectively assessing NYHA class. We noted however that the recorded step count data was often ambiguous, since the data recorded by the fitness trackers used in this study, which only recorded step count, did not allow us to differentiate between when the wearer was inactive versus the tracker simply not being worn. This significantly limited our ability to draw precise practical conclusions from the dataset.

We then proceeded to engineer an upgrade to the Medly remote patient monitoring system to allow it to support activity tracker monitoring data from Fitbit devices specifically the Fitbit Charge HR 2 which supported collection of both step count and heart rate data (to avoid the ambiguity problems which were identified in the replication study). Despite delays in the actual implementation of the activity tracker upgrade we were successfully able to onboard 44 patients over a 5 month period with some (3) of the patients even providing their own Fitbit for use with the system. Unfortunately, the patients were found 1 1 to be only moderately adherent with using the Fitbit with only around ⁄3 to ⁄4 of patients (at 3 months and 7 months respectively) having excellent levels of adherence (average at least 9 of 10 days using the system). We theorized that the many compromises made to the user experience throughout the implementation process may have detrimentally impacted patient adherence.

Since the effective size of the Medly Fitbit dataset was drastically reduced to 33 patients after removing those patients with less than 1 week of recorded activity, we opted to instead use the dataset investigated as part of the replication study to explore if it would be possible to assess NYHA class using free-living fitness tracker data. The marginally larger replication data set we opted to use consisted of 50 patients (35 NYHA class II; 15 NYHA class III), and although it lacked activity monitor heart rate data to complement the step count data, all of the patients in the dataset had recorded cardiopulmonary exercise test data which we proposed to use to establish a baseline performance level against which to evaluate our classifiers

We investigated 6 different types of supervised machine learning classifiers to assess NYHA functional classification: a hidden Markov model based classifier, several Generalized Linear Models, boosted Generalized Linear Models, Random Forests, Artificial Neural Networks and Principal Component Analysis Neural Networks.

We found that the hidden Markov model based classifier performed worst overall and in fact in many cases refused to train properly. The hidden Markov model based classifier we did manage to train had poor agreement (Cohen’s Kappa statistic, 휅=0.18) between the physician assigned NYHA class and that

134 assigned by the classifier, with a resulting low (unbalanced) accuracy of 58% (assessed on the same data used to train the classifier) which was actually worse than the no-information rate of the dataset (70%).

In contrast, the best overall classifier was found to be a boosted GLM (leave-one-out cross-validated), trained using only complete cases of both CPET and step data, which demonstrated substantial agreement with the physician assigned NYHA class (휅=0.73) comparable to human level performance (휅=0.75 [26]) and better than 2 of the 3 heart rate variability based machine learning classifiers. The level of agreement of our classifier corresponded to an unbalanced accuracy of 89% (85% balanced) against a no-information rate of 70%.

The best classifier trained with only CPET data – our proposed performance baseline - (another boosted GLM) showed a moderate level of agreement with the physician assigned label (휅=0.47) with corresponding unbalanced accuracy of 79% (72%, balanced), again better than 70%. The performance of this classifier however was lower than the reported lower range of human-level performance (휅=0.54 [6]) and as a result surprisingly did not dislodge physicians as the gold-standard against which to assess NYHA functional class agreement despite the notoriously high degree of subjectivity in their assessments.

The step data only classifier (tied between a regular GLM, boosted GLM and NNet) fared even worse than the classifier trained with only CPET data, although still better than the hidden Markov model based classifier, achieving an unbalanced accuracy of 72% (63% balanced) – only marginally higher than the no-information rate of 70%, and with a low level of agreement between the classifier and physician assigned label (휅=0.28).

An analysis of the important input features revealed notably that, of the CPET + step data features investigated, the respiratory exchange ratio was found to be rated most consistently important. The step data metrics, as a whole, were found to be less important generally than the CPET metrics and were also found to be inconsistent in their ratings of relative importance amongst themselves.

We also found a notable difference between the estimates of the measurements of agreement (|∆휅| = [0.19, 0.63]) generated using 10-fold versus leave-one-out cross-validation for the well performing classifiers when comparing (always in favor of leave-one-out cross-validation). We proposed that this might be evidence of overfitting of the classifiers, but more likely an indication that 10-fold cross-validation caused a severe reduction in the already limited amount of data available for classifier training.

In summary, we found that it is possible to objectively assess NYHA functional classification with a level of performance comparable to the human physicians by using a combination of CPET and step count

135 data. Although CPET data and step count data were found to be generally inadequate for performing objective NYHA functional classifications by themselves, this may have been due to the lack of data and the lack of useful and relevant features. In particular, for the step count data metrics, which were originally selected due to their ability to characterize the step count distributions and not their predictive capacity, more intentional feature engineering of relevant step count metrics might further improve performance using this data. As well, adding other data sources, for example heart rate data, which is presumed complimentary to step count and might help re-contextualize and clean up ambiguity in the data, might further improve classifier performance.

In general, although the machine learning classifiers developed in this work are not yet ready for implementation into a real-life remote patient monitoring system, the classifiers investigated in this thesis certainly show promise for making the assessment of NYHA functional class more objective and by extension more universally consistent and reliable.

Recommendations

In this section we propose several recommendations and ‘lessons learned’ in light of our findings:

1. Avoid activity trackers that label disengagement with the monitoring solution and patient inactivity identically. These contribute significant ambiguity to later data analysis that is often difficult or impossible to reconcile. 2. For data collected remotely from patients, provide a means of helping staff catch and address patient issues in a timely manner, thereby improving the overall quality of the data. For example, adding automated adherence phone calls or reminder notifications (for a smartphone-based application) may improve adherence at little cost. 3. When adding new sources of data to an existing system, either a) begin data collection as soon as possible, improving as required, and collecting lots of lower quality data which can be cleaned and noise-corrected post-hoc, or b) fully commit to designing a user experience that will result in high adherence – collecting a smaller amount of high-quality data. Delaying data collection to design an incomplete user experience will likely only result in collecting an insufficient amount of moderate quality data that will be more challenging to analyze. 4. Notwithstanding the above, prefer collecting more data (especially for machine learning applications). While it is possible to build a machine learning classifier with little data it becomes significantly more difficult to properly assess if the classifier is of good quality.

136

5. The corollary to 3 and 4 is to invest in data collection infrastructure. Collecting a suitably large dataset can take a long time and should be started well in advance of a proposed research project. 6. Invest time in visualizing and understanding the data being collected. In this case of this thesis, we discovered several limitations in our data, for example the prevalence of 0 step count values, that had drastic implications on model design and development. This could have been addressed in a more timely fashion with foresight derived from a more thorough earlier investigation of the source data. 7. Prefer simpler machine learning classifiers over more complex ones especially in the face of smaller datasets. Almost all of the best performing classifiers investigated in this thesis were simple generalized linear models or variants thereof. 8. Prefer the use of the 푅 programming language (along with the tidyverse package by H. Wickham [217]) for analysis and visualization of data, but use Python along with the well established scikit- learn library to accelerate creation of the machine learning pipeline required to build and adequately assess a series of machine learning classifiers. Aside from cleaning data, building the machine learning pipeline is one of the most time-consuming parts of a machine learning project.

Future Work

Having outlined some general recommendations and lessons that should be taken from this work provide some suggested future directions for this work:

1. A more thorough study of the characterization of the minute-by-minute step count waveform for both health persons and patients with congestive heart failure should be undertaken. This would provide very valuable insights for projects investigating the use of fitness trackers for monitoring tasks. 2. Revisit the user interfaces and user experience design of the fitness tracker upgrade applied to Medly. Aside from the fact that the system as is does not fully honor the best practices and principles outlined in the Fitbit API terms of service, patients using the system are only moderately adherent which reduces the amount and quality of data being collected for use by patients, by clinicians, and as part of any future quality improvement or research projects. Adding adherence phone calls or reminder notifications would likely provide significant benefit at little cost. 3. Investigate the effects of applying dithering to the training of the HMMBC.

137

4. Repeat the work performed in this thesis but using the combination of activity tracker step count and heart rate data. The data being collected from Medly patients would be suitable for this purpose once a sufficient number of patients are onboarded onto the upgraded system. 5. Furthermore, investigate the effect of including other data available from the Medly system such as daily symptoms data, these potentially helping further contextualize patient step count data. 6. Investigate the effect of reducing the analysis window duration for the step count data from 2- weeks to some shorter time period. 7. In a similar vein, investigate activity segmentation with an eye towards using it in combination with a HMMBC (or more standard cross-sectional ML model). 8. Perform careful manual feature engineering or automated feature extraction to identify more relevant features from available time series data streams (including step count). 9. And finally, regardless of other work performed, continue to assess the cross-validated performance of otherwise identical models as dataset size increases, to better map the learning curve associated with the NYHA functional class supervised classification problem.

138

References

1. Mehra MR, Butler J. Heart Failure: A Global Pandemic and Not Just a Disease of the West. Heart Fail Clin [Internet] 2015 Oct [cited 2017 Oct 13];11(4):xiii–xiv. PMID:26462110

2. Heart and Stroke Foundation. 2016 Report on the Health of Canadians: The Burden of Heart Failure. 2016 [cited 2016 Oct 29]; Available from: https://www.heartandstroke.ca/-/media/pdf- files/canada/2017-heart-month/heartandstroke-reportonhealth- 2016.ashx?la=en&hash=0478377DB7CF08A281E0D94B22BED6CD093C76DB (Archived by WebCite® at http://www.webcitation.org/706UliccA)

3. Seto E, Leonard KJ, Cafazzo J a, Masino C, Barnsley J, Ross HJ. Self-care and quality of life of heart failure patients at a multidisciplinary heart function clinic. J Cardiovasc Nurs [Internet] 2011;26(5):377–85. PMID:21263339

4. Lawrence S. Canada is failing our heart failure patients - Heart and Stroke Foundation of Canada [Internet]. Marketwired. 2016 [cited 2016 Oct 7]. Available from: http://www.marketwired.com/press-release/canada-is-failing-our-heart-failure-patients- 2093022.htm (Archived by WebCite® at http://www.webcitation.org/706U7G8oI)

5. Cox J, Naylor CD. The Canadian Cardiovascular Society Grading Scale for Angina Pectoris: Is It Time for Refinements? Ann Intern Med [Internet] American College of Physicians; 1992 Oct 15 [cited 2016 Oct 30];117(8):677. [doi: 10.7326/0003-4819-117-8-677]

6. Raphael C, Briscoe C, Davies J, Ian Whinnett Z, Manisty C, Sutton R, Mayet J, Francis DP, Raphael C. Limitations of the New York Heart Association functional classification system and self-reported walking distances in chronic heart failure. Heart [Internet] 2007 Apr 1 [cited 2016 Oct 30];93(4):476–482. [doi: 10.1136/hrt.2006.089656]

7. Bennett JA, Riegel B, Bittner V, Nichols J. Validity and reliability of the NYHA classes for measuring research outcomes in patients with cardiac disease. Hear Lung J Acute Crit Care 2002;31(4):262–270. PMID:12122390

8. Heart Foundation. New York Heart Association (NYHA) Classification [Internet]. Heart Foundation; 2014 [cited 2017 Jun 30]. p. 1. Available from: http://www.heartonline.org.au/media/DRL/New_York_Heart_Association_(NYHA)_classificati

139

on.pdf

9. American Heart Association. Classes of Heart Failure [Internet]. 2015 [cited 2016 Oct 30]. Available from: http://www.heart.org/HEARTORG/Conditions/HeartFailure/AboutHeartFailure/Classes-of- Heart-Failure_UCM_306328_Article.jsp#.WvyuQYgvyiN (Archived by WebCite® at http://www.webcitation.org/6zT3C5Rpx)

10. Ahmed A, Aronow WS, Fleg JL. Higher New York Heart Association classes and increased mortality and hospitalization in patients with heart failure and preserved left ventricular function. Am Heart J [Internet] NIH Public Access; 2006 Feb [cited 2017 Oct 30];151(2):444–50. PMID:16442912

11. Goldman L, Hashimoto B, Cook EF, Loscalzo A. Comparative reproducibility and validity of systems for assessing cardiovascular functional class: advantages of a new specific activity scale. Circulation [Internet] 1981;64(6):1227–1234. PMID:7296795

12. Williams BA, Doddamani S, Troup MA, Mowery AL, Kline CM, Gerringer JA, Faillace RT. Agreement between heart failure patients and providers in assessing New York Heart Association functional class. Hear Lung J Acute Crit Care [Internet] Elsevier Inc; 2017 Jul 1 [cited 2017 Oct 30];46(4):293–299. PMID:28558929

13. Moayedi Y, Abdulmajeed R, Posada JD, Foroutan F, Alba AC, Cafazzo J, Ross HJ, Duero Posada J, Foroutan F, Alba AC, Cafazzo J, Ross HJ. Assessing the Use of Wrist-Worn Devices in Patients With Heart Failure: Feasibility Study. JMIR Cardio [Internet] JMIR Cardio; 2017 Dec 19 [cited 2018 Jan 25];1(2):8. [doi: 10.2196/cardio.8301]

14. Savarese G, Lund LH. Global Public Health Burden of Heart Failure. Card Fail Rev [Internet] Radcliffe Cardiology; 2017 Apr [cited 2018 Jun 4];3(1):7–11. PMID:28785469

15. University of Toronto Faculty of Medicine. The State of the Heart in Canada [Internet]. 2014. Available from: http://medicine.utoronto.ca/sites/default/files/TRCHR_StateOfHeart_Infographsm.png

16. cardiac insufficiency. McGraw-Hill Concise Dict Mod Med [Internet] The McGraw-Hill Companies, Inc.; 2018 [cited 2018 Jul 21]. Available from: https://medical- dictionary.thefreedictionary.com/cardiac+insufficiency

140

17. Aird WC. Discovery of the cardiovascular system: From Galen to William Harvey. J Thromb Haemost [Internet] 2011;9(1 S):118–129. PMID:21781247

18. Silverthorn DU, Johnson BR, Ober WC, Garrison CW, Silverthorn AC. Blood Flow and the Control of Blood Pressure. Hum Physiol An Integr Approach 5th ed Pearson Benjamin Cummings; 2009. p. 512–545.

19. Shah SJ. Heart Failure (HF) [Internet]. Merck Manuals Prof Ed. 2017 [cited 2018 Jul 21]. Available from: https://www.merckmanuals.com/en-ca/professional/cardiovascular- disorders/heart-failure/heart-failure-hf

20. Azevedo PS, Polegato BF, Minicucci MF, Paiva SAR, Zornoff LAM. Cardiac Remodeling: Concepts, Clinical Impact, Pathophysiological Mechanisms and Pharmacologic Treatment. Arq Bras Cardiol [Internet] Arquivos Brasileiros de Cardiologia; 2016 Jan [cited 2018 Jul 21];106(1):62– 9. PMID:26647721

21. Laflamme MA, Murry CE. Heart regeneration. Nature [Internet] NIH Public Access; 2011 May 19 [cited 2018 Jul 21];473(7347):326–35. PMID:21593865

22. National Heart Foundation of Australia and the Cardiac Society of Australia and New Zealand (Chronic Heart Failure Guidelines Expert Writing Panel). Guidelines for the prevention, detection and management of chronic heart failure in Australia. 2011 [cited 2018 May 10];84. Available from: https://www.heartfoundation.org.au/images/uploads/publications/Chronic_Heart_Failure_Guide lines_2011.pdf

23. The Criteria Committee of the New York Heart Association. Classification of Functional Capacity and Objective Assessment [Internet]. 9th ed. Nomencl Criteria Diagnosis Dis Hear Gt Vessel. Boston, Mass.: Little, Brown and Co.; 1994 [cited 2017 Oct 13]. Available from: http://professional.heart.org/professional/General/UCM_423811_Classification-of-Functional- Capacity-and-Objective-Assessment.jsp

24. Rostagno C, Galanti G, Comeglio M, Boddi V, Olivo G, Gastone G, Serneri N. Comparison of different methods of functional evaluation in patients with chronic heart failure. Eur J Heart Fail [Internet] 2000 [cited 2018 Jun 4];2:273–280. Available from: https://onlinelibrary.wiley.com/doi/pdf/10.1016/S1388-9842(00)00091-X

25. Carroll SL, Harkness K, Mcgillion MH. A Comparison of the NYHA Classification and the Duke

141

Treadmill Score in Patients with Cardiovascular Disease. Open J Nurs [Internet] 2014 [cited 2017 Nov 3];4:774–783. [doi: 10.4236/ojn.2014.411083]

26. Christensen HW, Haghfelt T, Vach W, Johansen A, Hoilund-Carlsen PF. Observer reproducibility and validity of systems for clinical classification of angina pectoris: comparison with radionuclide imaging and coronary angiography. Clin Physiol Funct Imaging [Internet] Blackwell Science Ltd; 2006 Jan [cited 2017 Nov 6];26(1):26–31. [doi: 10.1111/j.1475-097X.2005.00643.x]

27. Kubo SH, Schulman S, Starling RC, Jessup M, Wentworth D, Burkhoff D. Development and validation of a patient questionnaire to determine New York heart association classification. J Card Fail [Internet] Churchill Livingstone; 2004 [cited 2017 Nov 3];10(3):228–235. [doi: 10.1016/J.CARDFAIL.2003.10.005]

28. McHugh ML. Interrater reliability: the kappa statistic. Biochem medica [Internet] Croatian Society for Medical Biochemistry and Laboratory Medicine; 2012 [cited 2018 Aug 25];22(3):276–82. PMID:23092060

29. Sallis JF, Saelens BE. Research Quarterly for Exercise and Sport Assessment of Physical Activity by Self-Report: Status, Limitations, and Future Directions. 2015 [cited 2018 Jul 24]; [doi: 10.1080/02701367.2000.11082780org/10.1080/02701367.2000.11082780]

30. Okura Y, Urban LH, Mahoney DW, Jacobsen SJ, Rodeheffer RJ. Agreement between self-report questionnaires and medical record data was substantial for diabetes, hypertension, myocardial infarction and stroke but not for heart failure. J Clin Epidemiol [Internet] Pergamon; 2004 Oct 1 [cited 2018 Jul 24];57(10):1096–1103. [doi: 10.1016/J.JCLINEPI.2004.04.005]

31. Baranowski T. Validity and Reliability of Self Report Measures of Physical Activity: An Information-Processing Perspective. Res Q Exerc Sport [Internet] 1988 [cited 2018 Jul 24];59(4):314–327. [doi: 10.1080/02701367.1988.10609379org/10.1080/02701367.1988.10609379]

32. Balady GJ, Arena R, Sietsema K, Myers J, Coke L, Fletcher GF, Forman D, Franklin B, Guazzi M, Gulati M, Keteyian SJ, Lavie CJ, Macko R, Mancini D, Milani R V. AHA Scientific Statement Clinician’s Guide to Cardiopulmonary Exercise Testing in Adults A Scientific Statement From the American Heart Association. Am Hear Assoc Exerc Clin Cardiol Counc Epidemiol Prev [Internet] [cited 2017 May 2]; [doi: 10.1161/CIR.0b013e3181e52e69]

33. Uth N, Sørensen H, Overgaard K, Pedersen PK. Estimation of VO2max from the Ratio between

142

HRmax and HRrest - the Heart Rate Ratio Method. Eur J Appl Physiol [Internet] 2004 [cited 2017 May 2];91(1):111–115. [doi: 10.1007/s00421-003-0988-y]

34. Kline GM, Porcari JP, Hintermeister R, Freedson PS, Ward A, McCarron RF, Ross J, Rippe JM. Estimation of VO2max from a one-mile track walk, gender, age, and body weight. Med Sci Sports Exerc [Internet] 1987 Jun [cited 2017 May 2];19(3):253–9. PMID:3600239

35. Cooper KH. Aerobics. Bantam Books; 1969. ISBN:9780553144901

36. Saalasti S, Pulkkinen A. Method and system for determining the fitness index of a person [Internet]. United States Patent Office; 2012 [cited 2017 May 2]. Available from: https://www.google.com/patents/US20140088444

37. Butte NF, Ekelund U, Westerterp KR. Assessing Physical Activity Using Wearable Monitors: Measures of Physical Activity. Med Sci Sport Exerc [Internet] 2012 [cited 2017 Jun 15];44(1S):5– 12. [doi: 10.1249/MSS.0b013e3182399c0e]

38. ap507. Study shows slow walking pace is good predictor of heart-related deaths — University of Leicester [Internet]. Univ Leicester News. 2017 [cited 2017 Aug 30]. Available from: https://www2.le.ac.uk/news/blog/2017-archive/august/study-shows-slow-walking-pace-good- predictor-of-heart-related-deaths

39. Zhao S, Chen K, Su Y, Hua W, Chen S, Liang Z, Xu W, Dai Y, Liu Z, Fan X, Hou C, Zhang S. Association between patient activity and long-term cardiac death in patients with implantable cardioverter-defibrillators and cardiac resynchronization therapy defibrillators. Eur J Prev Cardiol [Internet] 2017;24(7):760–767. [doi: 10.1177/2047487316688982]

40. Roul G, Germain P, Bareiss P. Does the 6-minute walk test predict the prognosis in patients with NYHA class II or III chronic heart failure? Am Heart J [Internet] 1998 Sep [cited 2017 Jun 30];136(3):449–457. [doi: 10.1016/S0002-8703(98)70219-4]

41. Abdulmajeed R. The Use of Continuous Monitoring of Heart Rate as a Prognosticator of Readmission in Heart Failure Patients. University of Toronto; 2016.

42. Eapen ZJ, Turakhia MP, McConnell M V., Graham G, Dunn P, Tiner C, Rich C, Harrington RA, Peterson ED, Wayte P. Defining a Mobile Health Roadmap for Cardiovascular Health and Disease. J Am Heart Assoc [Internet] 2016 Jul 12 [cited 2016 Oct 30];5(7):e003119. [doi:

143

10.1161/JAHA.115.003119]

43. Wen D, Zhang X, Liu X, Lei J. Evaluating the Consistency of Current Mainstream Wearable Devices in Health Monitoring: A Comparison Under Free-Living Conditions. J Med Internet Res [Internet] Journal of Medical Internet Research; 2017 Mar 7 [cited 2017 Mar 9];19(3):e68. PMID:28270382

44. El-Amrawy F, Nounou MI, Volpp K, Patel M, Lin N, Lewis R. Are Currently Available Wearable Devices for Activity Tracking and Heart Rate Monitoring Accurate, Precise, and Medically Beneficial? Healthc Inform Res [Internet] Apress Media; 2015 [cited 2017 Jul 7];21(4):315. [doi: 10.4258/hir.2015.21.4.315]

45. An H-S, Jones GC, Kang S-K, Welk GJ, Lee J-M. How valid are wearable physical activity trackers for measuring steps? Eur J Sport Sci [Internet] Routledge; 2017 Mar 16 [cited 2017 Jul 12];17(3):360–368. [doi: 10.1080/17461391.2016.1255261]

46. Bromberg SE. Consumer Wristband Activity Monitors as a Simple and Inexpensive Tool for Remote Heart Failure Monitoring. 2015.

47. Abeles A, Kwasnicki RM, Pettengell C, Murphy J, Darzi A. The relationship between physical activity and post-operative length of hospital stay: A systematic review. Int J Surg [Internet] 2017 Jul [cited 2017 Jul 12]; [doi: 10.1016/j.ijsu.2017.06.085]

48. Bornstein DB, Beets MW, Byun W, Welk G, Bottai M, Dowda M, Pate R. Equating accelerometer estimates of moderate-to-vigorous physical activity: In search of the Rosetta Stone. J Sci Med Sport [Internet] BioMed Central; 2011 Sep [cited 2017 Jul 12];14(5):404–410. [doi: 10.1016/j.jsams.2011.03.013]

49. Awais M, Mellone S, Chiari L. Physical activity classification meets daily life: Review on existing methodologies and open challenges. Proc Annu Int Conf IEEE Eng Med Biol Soc EMBS 2015;2015–Novem:5050–5053. PMID:26737426

50. Jehn M, Prescher S, Koehler K, Von Haehling S, Winkler S, Deckwart O, Honold M, Sechtem U, Baumann G, Halle M, Anker SD, Koehler F. Tele-accelerometry as a novel technique for assessing functional status in patients with heart failure: Feasibility, reliability and patient safety. Int J Cardiol [Internet] 2013 [cited 2017 Sep 5];168:4723–4728. [doi: 10.1016/j.ijcard.2013.07.171]

144

51. Demers C, McKelvie RS, Negassa A, Yusuf S. Reliability, validity, and responsiveness of the six- minute walk test in patients with heart failure. Am Heart J 2001;142(4):698–703. PMID:11579362

52. Guazzi M, Myers J, Arena R. Cardiopulmonary Exercise Testing in the Clinical and Prognostic Assessment of Diastolic Heart Failure. J Am Coll Cardiol [Internet] Elsevier; 2005 Nov 15 [cited 2018 Jul 25];46(10):1883–1890. [doi: 10.1016/J.JACC.2005.07.051]

53. Albouaini K, Egred M, Alahmar A, Wright DJ. Cardiopulmonary exercise testing and its application. Postgrad Med J [Internet] BMJ Group; 2007 Nov [cited 2016 Sep 20];83(985):675–82. PMID:17989266

54. Chatterjee S, Sengupta S, Nag M, Kumar P, Goswami S, Rudra A. Cardiopulmonary Exercise Testing: A Review of Techniques and Applications. 2013 [cited 2018 Jul 25]; [doi: 10.4172/2155- 6148.1000340]

55. Mehra MR, Canter CE, Hannan MM, Semigran MJ, Uber PA, Baran DA, Danziger-Isakov L, Kirklin JK, Kirk R, Kushwaha SS, Lund LH, Potena L, Ross HJ, Taylor DO, Verschuuren EAM, Zuckermann A. The 2016 International Society for Heart Lung Transplantation listing criteria for heart transplantation: A 10-year update. [cited 2018 Jun 2]; [doi: 10.1016/j.healun.2015.10.023]

56. Lim FY, Yap J, Gao F, Teo LL, Lam CSP, Yeo KK. Correlation of the New York Heart Association classification and the cardiopulmonary exercise test: A systematic review. Int J Cardiol [Internet] Elsevier; 2018 Jul 15 [cited 2018 Jun 4];263:88–93. [doi: 10.1016/J.IJCARD.2018.04.021]

57. Fitbit Inc. Fitbit Official Site for Activity Trackers & More [Internet]. 2016. Available from: https://www.fitbit.com/en-ca/home (Archived by WebCite® at http://www.webcitation.org/6zTITrK95)

58. Fitbit Inc. Fitbit Charge 2TM Heart Rate + Fitness Wristband [Internet]. 2018 [cited 2018 Apr 17]. Available from: https://client.fitbit.com/en-ca/charge2 (Archived by WebCite® at http://www.webcitation.org/6zTIzBoj5)

59. Fitbit Inc. Fitbit Flex [Internet]. [cited 2018 Apr 17]. Available from: https://client.fitbit.com/en- ca/shop/flex (Archived by WebCite® at http://www.webcitation.org/6zTIrGkAE)

60. Bromberg SE. Consumer wristband activity monitors as a simple and inexpensive tool for remote heart failure monitoring [Internet]. [Toronto]: University of Toronto; 2015. Available from:

145

http://hdl.handle.net/1807/70232

61. Piwek L, Ellis DA, Andrews S, Joinson A. The Rise of Consumer Health Wearables: Promises and Barriers. PLoS Med [Internet] Public Library of Science; 2016 Feb [cited 2016 Sep 20];13(2):e1001953. PMID:26836780

62. Attal F, Mohammed S, Dedabrishvili M, Chamroukhi F, Oukhellou L, Amirat Y. Physical Human Activity Recognition Using Wearable Sensors. Sensors (Basel) [Internet] 2015;15(12):31314–38. PMID:26690450

63. James CJ. Editorial: “Longer term monitoring through wearables brings with it the promise of predicting the onset of disease - moving from managing illness to maintaining wellness.”. Healthc Technol Lett [Internet] IET: Institution of Engineering and Technology; 2015 Feb [cited 2016 Sep 20];2(1):1. PMID:26609395

64. Apple Inc. Watch - Apple (CA) [Internet]. 2016. Available from: https://www.apple.com/ca/watch/

65. Storm FA, Heller BW, Mazzà C. Step detection and activity recognition accuracy of seven physical activity monitors. PLoS One [Internet] Public Library of Science; 2015 [cited 2018 May 7];10(3):e0118723. PMID:25789630

66. Fitbit Inc. Help article: How does my Fitbit device count steps? [Internet]. Fitbit Help. 2017 [cited 2017 Nov 7]. Available from: https://help.fitbit.com/articles/en_US/Help_article/1143

67. Diaz KM, Krupka DJ, Chang MJ, Peacock J, Ma Y, Goldsmith J, Schwartz JE, Davidson KW. Fitbit?: An accurate and reliable device for wireless physical activity tracking. Int J Cardiol. 2015. PMID:25795203

68. Evenson KR, Goto MM, Furberg RD. Systematic review of the validity and reliability of consumer-wearable activity trackers. Int J Behav Nutr Phys Act [Internet] 2015 Dec 18 [cited 2017 May 18];12(1):159. PMID:26684758

69. Al M. Personalization of energy expenditure and cardiorespiratory fitness estimation using wearable sensors in supervised and ... Personalization of energy expenditure and. Eindhoven University of Technology; 2015.

70. Straiton N, Alharbi M, Bauman A, Neubeck L, Gullick J, Bhindi R, Gallagher R. The validity and

146

reliability of consumer-grade activity trackers in older, community-dwelling adults: A systematic review. Maturitas [Internet] Elsevier; 2018 Jun 1 [cited 2018 Jul 30];112:85–93. [doi: 10.1016/J.MATURITAS.2018.03.016]

71. ActiGraph Corporation. ActiGraph [Internet]. [cited 2018 Jul 30]. Available from: https://www.actigraphcorp.com/

72. Fitbit Inc. Help article: What should I know about my heart rate data? [Internet]. Fitbit Help. 2017 [cited 2017 Nov 7]. Available from: https://help.fitbit.com/articles/en_US/Help_article/1565

73. Kroll RR, Boyd JG, Maslove DM. Accuracy of a Wrist-Worn Wearable Device for Monitoring Heart Rates in Hospital Inpatients: A Prospective Observational Study. J Med Internet Res [Internet] 2016 [cited 2016 Sep 22];18(9):e253. PMID:27651304

74. Ra H-K, Ahn J, Jung Yoon H, Yoon D, Hyuk Son DGIST S, Ko J. I am a “Smart” watch, Smart Enough to Know the Accuracy of My Own Heart Rate Sensor. [cited 2017 May 15]; [doi: 10.1145/3032970.3032977]

75. Allen J. Photoplethysmography and its application in clinical physiological measurement. Physiol Meas [Internet] 2007 [cited 2017 Nov 7];28:1–39. [doi: 10.1088/0967-3334/28/3/R01]

76. Alian AA, Shelley KH. Photoplethysmography. Best Pract Res Clin Anaesthesiol [Internet] Baillière Tindall; 2014 Dec 1 [cited 2018 Jul 30];28(4):395–406. [doi: 10.1016/J.BPA.2014.08.006]

77. Maeda Y, Sekine M, Tamura T. The Advantages of Wearable Green Reflected Photoplethysmography. J Med Syst [Internet] 2011 Oct 18 [cited 2018 Jul 30];35(5):829–834. PMID:20703690

78. Wang R, Blackburn G, Desai M, Phelan D, Gillinov L, Houghtaling P, Gillinov M, MA C, H M, RMT L, DJ T, F E-A, MS P. Accuracy of Wrist-Worn Heart Rate Monitors. JAMA Cardiol [Internet] 2016 Oct 12 [cited 2016 Nov 10];313(6):625–626. [doi: 10.1001/jamacardio.2016.3340]

79. Cadmus-Bertram L, Gangnon R, Wirkus EJ, Thraen-Borowski KM, Gorzelitz-Liebhauser J. The Accuracy of Heart Rate Monitoring by Some Wrist-Worn Activity Trackers. Ann Intern Med [Internet] 2017;10–13. PMID:28395305

80. Cardioo Inc. Cardiio: Heart Rate Monitor (iOS App) [Internet]. Apple Inc; 2012. Available from: https://itunes.apple.com/ca/app/cardiio-heart-rate-monitor/id542891434?mt=8

147

81. Laskowski ER. Heart rate: What’s normal? [Internet]. Mayo Clin. 2015 [cited 2018 Jul 31]. Available from: https://www.mayoclinic.org/healthy-lifestyle/fitness/expert-answers/heart- rate/faq-20057979

82. American Heart Association. All About Heart Rate (Pulse) [Internet]. Am Hear Assoc Website. 2015 [cited 2018 Jul 31]. Available from: https://www.heart.org/en/health-topics/high-blood- pressure/the-facts-about-high-blood-pressure/all-about-heart-rate-pulse#.Wg1mcBO0OCU

83. Low CA, Bovbjerg DH, Ahrendt S, Choudry MH, Holtzman M, Jones HL, Pingpank JF, Ramalingam L, Zeh HJ, Zureikat AH, Bartlett DL. Fitbit step counts during inpatient recovery from cancer surgery as a predictor of readmission. Ann Behav Med [Internet] Oxford University Press; 2018 Jan 5 [cited 2018 Jul 26];52(1):88–92. [doi: 10.1093/abm/kax022]

84. Hartman SJ, Nelson SH, Weiner LS. Patterns of Fitbit Use and Activity Levels Throughout a Physical Activity Intervention: Exploratory Analysis from a Randomized Controlled Trial. JMIR mHealth uHealth [Internet] JMIR mHealth and uHealth; 2018 Feb 5 [cited 2018 Mar 8];6(2):e29. PMID:29402761

85. Wicklund E. Hospital’s mHealth Project Finds Value in Fitbit Data [Internet]. mHealthIntelligence. 2016 [cited 2018 Jul 26]. Available from: https://mhealthintelligence.com/news/hospitals-diabetes-mhealth-project-finds-value-in-fitbit-data

86. Apple Inc. Apple Heart Study launches to identify irregular heart rhythms [Internet]. Apple Newsroom. 2017 [cited 2018 Jul 31]. Available from: https://www.apple.com/newsroom/2017/11/apple-heart-study-launches-to-identify-irregular-heart- rhythms/

87. Eadicicco L. EXCLUSIVE: Fitbit Working On Atrial Fibrillation Detection [Internet]. Time. 2017 [cited 2018 Jul 31]. Available from: http://time.com/4907284/fitbit-detect-atrial-fibrillation/

88. Griffith E. When Your Fitbit Goes From Activity Tracker to Personal Medical Device [Internet]. Wired. 2018 [cited 2018 Jul 26]. Available from: https://www.wired.com/story/when-your-activity- tracker-becomes-a-personal-medical-device/

89. Field MJ, Grigsby J. Telemedicine and Remote Patient Monitoring. JAMA [Internet] American Medical Association; 2002 Jul 24 [cited 2018 Aug 1];288(4):423. [doi: 10.1001/jama.288.4.423]

148

90. Hargreaves S, Hawley MS, Haywood A, Enderby PM. Informing the Design of “Lifestyle Monitoring” Technology for the Detection of Health Deterioration in Long-Term Conditions: A Qualitative Study of People Living With Heart Failure. J Med Internet Res [Internet] Journal of Medical Internet Research; 2017 Jun 28 [cited 2017 Jun 30];19(6):e231. PMID:28659253

91. Noah B, Keller MS, Mosadeghi S, Stein L, Johl S, Delshad S, Tashjian VC, Lew D, Kwan JT, Jusufagic A, Spiegel BMR. Impact of remote patient monitoring on clinical outcomes: an updated meta-analysis of randomized controlled trials. npj Digit Med [Internet] Nature Publishing Group; 2018 Dec 15 [cited 2018 Aug 1];1(1):20172. [doi: 10.1038/s41746-017-0002-4]

92. Hanlon P, Daines L, Campbell C, McKinstry B, Weller D, Pinnock H. Telehealth Interventions to Support Self-Management of Long-Term Conditions: A Systematic Metareview of Diabetes, Heart Failure, Asthma, Chronic Obstructive Pulmonary Disease, and Cancer. J Med Internet Res [Internet] Journal of Medical Internet Research; 2017 May 17 [cited 2017 May 18];19(5):e172. [doi: 10.2196/jmir.6688]

93. Hargreaves S, Hawley MS, Haywood A, Enderby PM. Informing the Design of “Lifestyle Monitoring” Technology for the Detection of Health Deterioration in Long-Term Conditions: A Qualitative Study of People Living With Heart Failure. J Med Internet Res [Internet] Journal of Medical Internet Research; 2017 Jun 28 [cited 2017 Jun 30];19(6):e231. PMID:28659253

94. Clark RA, Inglis SC, McAlister FA, Cleland JGF, Stewart S. Telemonitoring or structured telephone support programmes for patients with chronic heart failure: systematic review and meta- analysis. BMJ [Internet] 2007 May 5 [cited 2018 Apr 4];334(7600):942. PMID:17426062

95. Ware P, Ross HJ, Cafazzo JA, Laporte A, Gordon K, Seto E. Evaluating the Implementation of a Mobile Phone–Based Telemonitoring Program: Longitudinal Study Guided by the Consolidated Framework for Implementation Research. JMIR mHealth uHealth [Internet] JMIR mHealth and uHealth; 2018 Jul 31 [cited 2018 Aug 1];6(7):e10768. [doi: 10.2196/10768]

96. Yun JE, Park J-E, Park H-Y, Lee H-Y, Park D-A. Comparative Effectiveness of Telemonitoring Versus Usual Care for Heart Failure: A Systematic Review and Meta-analysis. J Card Fail [Internet] 2018 Jan [cited 2018 Aug 1];24(1):19–28. [doi: 10.1016/j.cardfail.2017.09.006]

97. Klersy C, De Silvestri A, Gabutti G, Raisaro A, Curti M, Regoli F, Auricchio A. Economic impact of remote patient monitoring: an integrated economic model derived from a meta-analysis of

149

randomized controlled trials in heart failure. Eur J Heart Fail [Internet] Wiley-Blackwell; 2011 Apr 1 [cited 2018 Aug 1];13(4):450–459. [doi: 10.1093/eurjhf/hfq232]

98. Ong MK, Romano PS, Edgington S, Aronow HU, Auerbach AD, Black JT, De Marco T, Escarce JJ, Evangelista LS, Hanna B, Ganiats TG, Greenberg BH, Greenfield S, Kaplan SH, Kimchi A, Liu H, Lombardo D, Mangione CM, Sadeghi B, Sadeghi B, Sarrafzadeh M, Tong K, Fonarow GC. Effectiveness of Remote Patient Monitoring After Discharge of Hospitalized Patients With Heart Failure. JAMA Intern Med [Internet] American Medical Association; 2016 Mar 1 [cited 2018 Aug 1];176(3):310. [doi: 10.1001/jamainternmed.2015.7712]

99. Chaudhry SI, Mattera JA, Curtis JP, Spertus JA, Herrin J, Lin Z, Phillips CO, Hodshon B V., Cooper LS, Krumholz HM. Telemonitoring in Patients with Heart Failure. N Engl J Med [Internet] Massachusetts Medical Society ; 2010 Dec 9 [cited 2018 Aug 1];363(24):2301–2309. [doi: 10.1056/NEJMoa1010029]

100. Ware P, Seto E, Ross HJ. Accounting for Complexity in Home Telemonitoring: A Need for Context-Centred Evidence. Can J Cardiol [Internet] Elsevier; 2018 Jul 1 [cited 2018 Aug 1];34(7):897–904. [doi: 10.1016/J.CJCA.2018.01.022]

101. Centre for Global eHealth Innovation. Medly - Chronic Complex Diseases Self-care Management [Internet]. 2016 [cited 2016 Oct 30]. Available from: http://ehealthinnovation.org/what-we- do/projects/medly/

102. Healthcare Human Factors. Medly: Managing Chronic Conditions [Internet]. 2016 [cited 2016 Oct 30]. Available from: http://humanfactors.ca/projects/medly/

103. Seto E, Leonard KJ, Cafazzo JA, Barnsley J, Masino C, Ross HJ. Mobile phone-based telemonitoring for heart failure management: a randomized controlled trial. J Med Internet Res 2012;14(1):1–14. PMID:22356799

104. Seto E, Leonard KJ, Cafazzo JA, Barnsley J, Masino C, Ross HJ. Developing healthcare rule- based expert systems: Case study of a heart failure telemonitoring system. Int J Med Inform [Internet] Elsevier Ireland Ltd; 2012;81(8):556–565. PMID:22465288

105. Seto E, Leonard KJ, Masino C, Cafazzo JA, Barnsley J, Ross HJ. Attitudes of heart failure patients and health care providers towards mobile phone-based remote monitoring. J Med Internet Res 2010;12(4):3–12. PMID:21115435

150

106. Smith C, McGuire B, Huang T, Yang G. The History of Artificial Intelligence [Internet]. Seattle: University of Washington; 2006 [cited 2018 Apr 4]. p. 27. Available from: https://courses.cs.washington.edu/courses/csep590/06au/projects/history-ai.pdf

107. Anyoha R. The History of Artificial Intelligence [Internet]. Sci News. 2017 [cited 2018 Aug 4]. Available from: http://sitn.hms.harvard.edu/flash/2017/history-artificial-intelligence/

108. McCarthy J, Minsky ML, Rochester N, Shannon CE. A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence [Internet]. Dartmouth; 1955 [cited 2018 Aug 4]. Available from: http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html

109. Coward C. AI and the Ghost in the Machine [Internet]. hackaday. 2017 [cited 2018 Aug 4]. Available from: https://hackaday.com/2017/02/06/ai-and-the-ghost-in-the-machine/

110. Shu-Hsien Liao. Expert system methodologies and applications—a decade review from 1995 to 2004. Expert Syst Appl [Internet] Pergamon; 2005 Jan 1 [cited 2018 Aug 4];28(1):93–103. [doi: 10.1016/J.ESWA.2004.08.003]

111. Segaran T. Programming collective intelligence : building smart web 2.0 applications. O’Reilly; 2007. ISBN:9780596529321

112. Brownlee J. Supervised and Unsupervised Machine Learning Algorithms [Internet]. Mach Learn Mastery. 2016 [cited 2018 Aug 6]. Available from: https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

113. Alpaydin E. Introduction to Machine Learning (Adaptive Computation and Machine Learning) [Internet]. MIT Press; 2004 [cited 2018 Aug 6]. Available from: https://dl.acm.org/citation.cfm?id=1036287ISBN:0262012111

114. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, Van Den Driessche G, Graepel T, Hassabis D. Mastering the game of Go without human knowledge. Nature [Internet] Nature Publishing Group; 2017;550(7676):354–359. PMID:29052630

115. OpenAI Five [Internet]. OpenAI. 2018 [cited 2018 Aug 6]. Available from: https://blog.openai.com/openai-five/

116. Savov V. The OpenAI Dota 2 bots just defeated a team of former pros [Internet]. The Verge. 2018

151

[cited 2018 Aug 6]. Available from: https://www.theverge.com/2018/8/6/17655086/dota2-openai- bots-professional-gaming-ai

117. Thompson T. Zerg Rush: A History of StarCraft AI Research [Internet]. Medium. 2018 [cited 2018 Aug 6]. Available from: https://medium.com/@t2thompson/zerg-rush-a-history-of-starcraft-ai- research-4478759a3c53

118. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE [Internet] 1989 [cited 2017 Aug 28];77(2):257–286. [doi: 10.1109/5.18626]

119. Visser I, Raijmakers MEJ, van der Maas HLJ. Hidden Markov Models for Individual Time Series. In: Valsiner J, Molenaar PCM, Lyra MCDP, Chaudhary N, editors. Dyn Process Methodol Soc Dev Sci 2009. p. 269–289. PMID:25246403

120. Iskandar J. RPubs - Classifying Seizure State (using R package depmixS4) [Internet]. RPubs; 2014 [cited 2017 Aug 30]. p. 6. Available from: https://rpubs.com/jimmyiskandar/30484

121. Mannini A, Sabatini AM. Machine Learning Methods for Classifying Human Physical Activity from On-Body Accelerometers. Sensors [Internet] Molecular Diversity Preservation International; 2010 Feb 1 [cited 2017 Aug 22];10(2):1154–1175. [doi: 10.3390/s100201154]

122. Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample size required for classification performance. BMC Med Inform Decis Mak [Internet] 2012 Dec 15 [cited 2017 Oct 7];12(1):8. [doi: 10.1186/1472-6947-12-8]

123. Brownlee J. How Much Training Data is Required for Machine Learning? [Internet]. Mach Learn Mastery. 2017 [cited 2017 Oct 7]. Available from: https://machinelearningmastery.com/much- training-data-required-machine-learning/

124. Denham L. Aren’t The IoT, Big Data And Machine Learning The Same? [Internet]. Innov Enterp. 2017 [cited 2018 Aug 20]. Available from: https://channels.theinnovationenterprise.com/articles/aren-t-the-iot-big-data-and-machine- learning-the-same

125. Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J. Sample size planning for classification models. Anal Chim Acta [Internet] Elsevier; 2013 Jan 14 [cited 2017 Oct 7];760:25–33. [doi: 10.1016/j.aca.2012.11.007]

152

126. van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol [Internet] 2014 Dec 22 [cited 2018 Aug 20];14(1):137. [doi: 10.1186/1471-2288-14-137]

127. Tripoliti EE, Papadopoulos TG, Karanasiou GS, Naka KK, Fotiadis DI. Heart Failure: Diagnosis, Severity Estimation and Prediction of Adverse Events Through Machine Learning Techniques. Comput Struct Biotechnol J [Internet] 2017 [cited 2017 Oct 7];15:26–47. [doi: 10.1016/j.csbj.2016.11.001]

128. Pecchia L, Melillo P, Bracale M. Remote Health Monitoring of Heart Failure With Data Mining via CART Method on HRV Features. IEEE Trans Biomed Eng [Internet] 2011 Mar [cited 2018 Aug 6];58(3):800–804. [doi: 10.1109/TBME.2010.2092776]

129. Shaffer F, Ginsberg JP. An Overview of Heart Rate Variability Metrics and Norms. Front public Heal [Internet] Frontiers Media SA; 2017 [cited 2018 Aug 7];5:258. PMID:29034226

130. Melillo P, Fusco R, Sansone M, Bracale M, Pecchia L. Discrimination power of long-term heart rate variability measures for chronic heart failure detection. Med Biol Eng Comput [Internet] Springer-Verlag; 2011 Jan 4 [cited 2018 Aug 6];49(1):67–74. [doi: 10.1007/s11517-010-0728-5]

131. Pecchia L, Melillo P, Sansone M, Bracale M. Discrimination Power of Short-Term Heart Rate Variability Measures for CHF Assessment. IEEE Trans Inf Technol Biomed [Internet] 2011 Jan [cited 2018 Aug 6];15(1):40–46. [doi: 10.1109/TITB.2010.2091647]

132. Panina G, Khot UN, Nunziata E, Cody RJ, Binkley PF. Role of spectral measures of heart rate variability as markers of disease progression in patients with chronic congestive heart failure not treated with angiotensin-converting enzyme inhibitors. Am Heart J [Internet] Mosby; 1996 Jan 1 [cited 2018 Aug 6];131(1):153–157. [doi: 10.1016/S0002-8703(96)90064-2]

133. Mietus JE, Peng C-K, Henry I, Goldsmith RL, Goldberger AL. The pNNx files: re-examining a widely used heart rate variability measure. Heart [Internet] BMJ Publishing Group Ltd; 2002 Oct 1 [cited 2018 Aug 6];88(4):378–80. PMID:12231596

134. Casolo GC, Stroder P, Sulla A, Chelucci A, Freni A, Zerauschek M. Heart rate variability and functional severity of congestive heart failure secondary to coronary artery disease. Eur Heart J [Internet] Oxford University Press; 1995 Mar 1 [cited 2018 Aug 6];16(3):360–367. [doi: 10.1093/oxfordjournals.eurheartj.a060919]

153

135. Goldsmith R. Congestive Heart Failure RR Interval Database [Internet]. [cited 2018 Aug 6]. [doi: 10.13026/C2F598]

136. Melillo P, De Luca N, Bracale M, Pecchia L. Classification Tree for Risk Assessment in Patients Suffering From Congestive Heart Failure via Long-Term Heart Rate Variability. IEEE J Biomed Heal Informatics [Internet] 2013 May [cited 2018 Aug 6];17(3):727–733. [doi: 10.1109/JBHI.2013.2244902]

137. Beth Israel Deaconess Medical Center. The BIDMC Congestive Heart Failure Database [Internet]. PhysioNet. 1986 [cited 2018 Aug 6]. [doi: 10.13026/C29G60]

138. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Plamen CI, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet Components of a New Research Resource for Complex Physiologic Signals. Circulation [Internet] 2000 [cited 2018 Aug 6];(101):215–220. [doi: 10.1161/circ.101.23.e215]

139. Witten IH (Ian H., Frank E, Hall MA (Mark A, Pal CJ. Data mining : practical machine learning tools and techniques. ISBN:9780128042915

140. Vanwinckelen G, Blockeel H. On Estimating Model Accuracy with Repeated Cross-Validation. [cited 2018 Apr 25]; Available from: https://lirias.kuleuven.be/bitstream/123456789/346385/3/OnEstimatingModelAccuracy.pdf

141. Forman G, Scholz M. Apples-to-Apples in Cross-Validation Studies: Pitfalls in Classifier Performance Measurement. SIGKDD Explor [Internet] 2010 [cited 2017 Nov 3];12(1):49–57. Available from: http://www.kdd.org/exploration_files/v12-1-p49-forman-sigkdd.pdf

142. Shahbazi F, Asl BM. Generalized discriminant analysis for congestive heart failure risk assessment based on long-term heart rate variability. Comput Methods Programs Biomed [Internet] Elsevier; 2015 Nov 1 [cited 2018 Aug 6];122(2):191–198. [doi: 10.1016/J.CMPB.2015.08.007]

143. Baudat G, Anouar F. Generalized Discriminant Analysis Using a Kernel Approach. Neural Comput [Internet] MIT Press 238 Main St., Suite 500, Cambridge, MA 02142-1046 USA journals- [email protected] ; 2000 Oct 13 [cited 2018 Aug 6];12(10):2385–2404. [doi: 10.1162/089976600300014980]

144. Fluss R, Faraggi D, Reiser B. Estimation of the Youden Index and its associated cutoff point.

154

Biom J [Internet] 2005 Aug [cited 2018 Aug 7];47(4):458–72. PMID:16161804

145. Guiqiu Yang, Yinzi Ren, Qing Pan, Gangmin Ning, Shijin Gong, Guolong Cai, Zhaocai Zhang, Li Li, Jing Yan. A heart failure diagnosis model based on support vector machine. 2010 3rd Int Conf Biomed Eng Informatics [Internet] IEEE; 2010 [cited 2018 Aug 6]. p. 1105–1108. [doi: 10.1109/BMEI.2010.5639619]

146. Wu H-T, Soliman EZ. A new approach for analysis of heart rate variability and QT variability in long-term ECG recording. Biomed Eng Online [Internet] BioMed Central; 2018 Dec 3 [cited 2018 Aug 7];17(1):54. [doi: 10.1186/s12938-018-0490-8]

147. Pang D, Igasaki T, Maehara J. Long-term monitoring of heart rate variability toward practical use in intensive/high care unit. 2016 9th Biomed Eng Int Conf [Internet] IEEE; 2016 [cited 2018 Aug 7]. p. 1–6. [doi: 10.1109/BMEiCON.2016.7859631]

148. Baril J-F, Bromberg S, Yasbanoo M, Taati B, Manlhiot C, Ross HJ, Cafazzo J. Use of free-living step count monitoring for heart failure functional classification: a validation study. Toronto: JMIR Cardio; 2018. [doi: 10.2196/preprints.12122]

149. Stein KM, Mittal S, Merkel S, Meye TE. Baseline Physical Activity and NYHA Classification Affects Future Ventricular Event Rates in a General ICD Population. J Card Fail [Internet] Churchill Livingstone; 2006 Aug 1 [cited 2017 Oct 13];12(6):S58. [doi: 10.1016/J.CARDFAIL.2006.06.203]

150. Bromberg SE. googlefitbit [Internet]. Toronto; 2015. Available from: https://github.com/simonbromberg/googlefitbit

151. R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna, Austria; 2017. Available from: https://www.r-project.org

152. RStudio Team. RStudio: Integrated Development Environment for R [Internet]. Boston, MA; 2015. Available from: http://www.rstudio.com/

153. Wickham H. A Layered Grammar of Graphics. 2010 [cited 2017 May 31]; [doi: 10.1198/jcgs.2009.07098]

154. Arnold JB. ggthemes: Extra Themes, Scales and Geoms for “ggplot2” [Internet]. 2017. Available from: https://cran.r-project.org/package=ggthemes

155

155. Wickham H. The Split-Apply-Combine Strategy for Data Analysis. J Stat Softw [Internet] 2011;40(1):1–29. Available from: http://www.jstatsoft.org/v40/i01/

156. Wickham H, Francois R, Henry L, Müller K. dplyr: A Grammar of Data Manipulation [Internet]. 2017. Available from: https://cran.r-project.org/package=dplyr

157. Wickham H. Reshaping Data with the {reshape} Package. J Stat Softw [Internet] 2007;21(12):1– 20. Available from: http://www.jstatsoft.org/v21/i12/

158. Hester J. glue: Interpreted String Literals [Internet]. 2017. Available from: https://cran.r- project.org/package=glue

159. Seto E, Leonard KJ, Cafazzo JA, Barnsley J, Masino C, Ross HJ. Perceptions and experiences of heart failure patients and clinicians on the use of mobile phone-based telemonitoring. J Med Internet Res 2012;14(1):1–15. PMID:22328237

160. Intel Corporation. Safety Recall Notice for all Basis PeakTM Watches [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://www.intel.ca/content/www/ca/en/support/articles/000025310/emerging- technologies/wearable-devices.html

161. Somerville H. Jawbone’s demise a case of “death by overfunding” in Silicon Valley | Reuters [Internet]. Thomson Reuters. 2018 [cited 2018 Aug 14]. Available from: https://www.reuters.com/article/us-jawbone-failure/jawbones-demise-a-case-of-death-by- overfunding-in-silicon-valley-idUSKBN19V0BS

162. Alharbi M, Straiton N, Gallagher R. Harnessing the Potential of Wearable Activity Trackers for Heart Failure Self-Care. [cited 2017 May 15]; [doi: 10.1007/s11897-017-0318-z]

163. Apple Inc. HealthKit - Apple Developer [Internet]. 2018 [cited 2018 Aug 14]. Available from: https://developer.apple.com/healthkit/

164. empatica. E4 wristband [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://www.empatica.com/research/e4/

165. Fitbit Inc. Fitbit SDK [Internet]. 2018. Available from: https://dev.fitbit.com/

166. Fitbit Inc. AltaHR [Internet]. 2018 [cited 2018 Aug 13]. Available from:

156

https://www.fitbit.com/en-ca/altahr

167. Fitbit AltaTM Fitness Wristband [Internet]. [cited 2018 Aug 13]. Available from: https://www.fitbit.com/en-ca/alta

168. Fitbit Inc. Fitbit Flex 2TM Fitness Wristband [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://www.fitbit.com/en-ca/flex2

169. Fitbit Inc. Fitbit IonicTM Watch [Internet]. 2018. [cited 2018 Aug 13]. Available from: https://www.fitbit.com/en-ca/ionic

170. Fitbit Inc. Fitbit Versa [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://www.fitbit.com/en-ca/versa

171. Garmin. Home | Garmin Developers [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://developer.garmin.com/

172. Garmin. fenix 5 [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://buy.garmin.com/en- CA/CA/p/552982

173. Garmin. vivosmart [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://buy.garmin.com/en-US/US/p/154886

174. Google Developers. Google Fit [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://developers.google.com/fit/

175. Huawei Technology Co. Ltd. 2 [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://consumer.huawei.com/ca/wearables/watch2/

176. LG Electronics. LG Smart Watch Sport for AT&T With Android Wear 2.0 | LG USA [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://www.lg.com/us/smart-watches/lg-W280A-sport

177. mc10. BiostampRC System [Internet]. Available from: https://www.mc10inc.com/our- products/biostamprc

178. Misfit. Build @ Misfit [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://build.misfit.com/

179. Misfit. Misfit Flare [Internet]. 2018. Available from: https://misfit.com/misfit-flare

157

180. Misfit. Misfit Phase [Internet]. 2018. Available from: https://misfit.com/misfit-phase

181. Misfit. Misfit Ray [Internet]. 2018. Available from: https://misfit.com/misfit-ray

182. Misfit. Misfit Shine. 2018.

183. Misfit. Misfit Shine 2 [Internet]. 2018. Available from: https://misfit.com/misfit-shine-2

184. Misfit. Misfit Vapor [Internet]. 2018. Available from: https://misfit.com/misfit-vapor

185. Moov Inc. Moov HR [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://welcome.moov.cc/moovhr/

186. Moov Inc. Moov Now [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://welcome.moov.cc/moovnow/

187. Nokia. Nokia Health API [Internet]. 2018 [cited 2018 Aug 13]. Available from: http://developer.health.nokia.com/oauth2/

188. Nokia | Withings. Nokia Go [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://health.nokia.com/ca/en/go

189. Nokia | Withings. Nokia Steel [Internet]. [cited 2018 Aug 13]. Available from: https://health.nokia.com/ca/en/steel

190. Nokia | Withings. Nokia Steel HR [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://health.nokia.com/ca/en/steel-hr

191. TomTom Sports Team. TomTom Sports Cloud [Internet]. 2018. Available from: https://developer.tomtom.com/tomtom-sports-cloud

192. TomTom. TomTom Spark 3 Cardio + Music GPS Fitness Watch [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://www.tomtom.com/en_ca/sports/fitness-trackers/gps-fitness-watch- cardio-music-spark3/black-large/

193. TomTom. TomTom Touch Fitness Tracker [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://www.tomtom.com/en_ca/sports/fitness-trackers/fitness-tracker-touch/black-large/

194. Under Armour I. Under Armour UA Band [Internet]. 2018 [cited 2018 Aug 13]. Available from:

158

https://www.underarmour.com/en-ca/ua-band

195. Wavelet Health. Products [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://wavelethealth.com/products/

196. MI. Mi Band [Internet]. 2018. [cited 2018 Aug 13]. Available from: https://www.mi.com/en/miband/

197. MI. Mi Band 2 [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://www.mi.com/en/miband2/

198. Baril J-F. fitbit4research [Internet]. Toronto; 2018 [cited 2018 Aug 16]. Available from: https://github.com/cosmomeese/fitbit4research

199. Tufte ER. The visual display of quantitative information. Graphics Press; 2001. ISBN:1930824130

200. Wong DM. The Wall Street journal guide to information graphics : the dos and don’ts of presenting data, facts, and figures. ISBN:0393347281

201. Tufte ER, McKay SR, Christian W, Matey JR. Visual Explanations: Images and Quantities, Evidence and Narrative. Comput Phys 1998; PMID:1659109

202. Zhang J, Johnson TR, Patel VL, Paige DL, Kubose T. Using usability heuristics to evaluate patient safety of medical devices. 2003;36:23–30. [doi: 10.1016/S1532-0464(03)00060-1]

203. Tognazzini B. First Principles of Interaction Design (Revised & Expanded) | askTog [Internet]. askTog.com. [cited 2017 Jan 13]. Available from: http://asktog.com/atc/principles-of-interaction- design/

204. Nielsen J. 10 Heuristics for User Interface Design [Internet]. Nielsen Norman Gr. 1995 [cited 2017 Jan 13]. Available from: https://www.nngroup.com/articles/ten-usability-heuristics/

205. Norman DA. The Design of Everyday Things [Internet]. Hum Factors Ergon Manuf. 2013. PMID:13182255ISBN:0465067107

206. Laussen PC, Almodovar M, Goodwin A, Sick Kids: The Hospital for Sick Children. T3 - Tracking, trajectory and trigger tool [Internet]. Crit Care Med Programs Serv. 2018. Available from: http://www.sickkids.ca/Critical-Care/programs-and-services/T3/index.html

159

207. Laussen PC. Precision monitoring. Crit Care Canada Forum [Internet] Toronto; 2015 [cited 2018 Aug 15]. Available from: https://criticalcarecanada.com/presentations/2015/precision_monitoring.pdf

208. Guerguerian A-M. BME1439 Critical Care Instrumentation Lecture. Toronto; 2016.

209. Fitbit Inc. Accessing the Fitbit API [Internet]. Fitbit Dev Website. 2018. Available from: https://dev.fitbit.com/build/reference/web-api/oauth2/

210. Fitbit Inc. Fitbit Platform Terms of Service (Revised August 1st, 2018) [Internet]. Fitbit Dev Website. 2018. Available from: https://dev.fitbit.com/legal/platform-terms-of-service/

211. Canadian Radio-television and Telecommunications Commission. Communications Monitoring Report 2017: Canada’s Communication System: An Overview for Canadians (Table 2.0.6) [Internet]. Ottawa; 2017. Available from: https://crtc.gc.ca/eng/publications/reports/policymonitoring/2017/cmr2.htm#s20i

212. Mobile Operating System Market Share Canada [Internet]. StatCounter. 2017 [cited 2017 Nov 29]. Available from: http://gs.statcounter.com/os-market-share/mobile/canada/#monthly-201706- 201711

213. Mobile iOS Version Market Share Canada [Internet]. StatCounter. 2017 [cited 2017 Nov 29]. Available from: http://gs.statcounter.com/ios-version-market-share/mobile/canada/#monthly- 201611-201711

214. Hermsen S, Moons J, Kerkhof P, Wiekens C, De Groot M. Determinants for Sustained Use of an Activity Tracker: Observational Study. JMIR mHealth uHealth [Internet] JMIR Publications Inc.; 2017 Oct 30 [cited 2018 Aug 18];5(10):e164. PMID:29084709

215. Cafazzo J, St-Cyr O. From Discovery to Design: The Evolution of Human Factors in Healthcare. Healthc Q [Internet] 2012 Apr 11 [cited 2018 Aug 18];15(sp):24–29. [doi: 10.12927/hcq.2012.22845]

216. Canadian Patient Safety Institute, Institute for Safe Medication Practices Canada, Saskatchewan Health, Patients for Patient Safety Canada, Beard P, Hoffman CE, Ste-Marie M. Canadian Incident Analysis Framework [Internet]. Edmonton, AB; 2012. Available from: http://www.patientsafetyinstitute.ca/en/toolsResources/PatientSafetyIncidentManagementToolkit /Documents/CIAF Key Features - Analysis Process.pdf

160

217. Wickham H. tidyverse: Easily Install and Load the “Tidyverse” [Internet]. 2017. Available from: https://cran.r-project.org/package=tidyverse

218. Wolf HP. aplpack: Another Plot Package: “Bagplots”, “Iconplots”, “Summaryplots”, Slider Functions and Others [Internet]. 2018 [cited 2018 Aug 17]. Available from: https://cran.r- project.org/web/packages/aplpack/index.html

219. Champely S. PairedData: Paired Data Analysis [Internet]. 2018 [cited 2018 Aug 17]. Available from: https://cran.r-project.org/web/packages/PairedData/index.html

220. Jurafsky D, Martin J. Hidden Markov Models. Speech Lang Process [Internet] 3rd ed Pearson; 2017 [cited 2017 Nov 11]. p. 21. Available from: https://web.stanford.edu/~jurafsky/slp3/9.pdf

221. Bobick A, Essa I, Chakraborty A, Udacity. Markov Models [Internet]. Udacity Introd to Comput Vis. YouTube; 2015 [cited 2017 Nov 11]. Available from: https://www.youtube.com/watch?v=4XqWadvEj2k

222. Gagniuc PA. Markov chains: from theory to implementation and experimentation. 1st ed. John Wiley and Sons, Inc; 2017. [doi: 10.1002/9781119387596]ISBN:9781119387558

223. O’Connell J, Højsgaard S. Hidden Semi Markov Models for Multiple Observation Sequences: The mhsmm Package for R. J Stat Softw [Internet] 2011 [cited 2017 Nov 1];39(4):1–22. [doi: 10.18637/jss.v039.i04]

224. Bobick A, Essa I, Chakraborty A, Udacity. Hidden Markov Models [Internet]. Udacity Introd to Comput Vis. YouTube; 2015 [cited 2017 Nov 11]. Available from: https://www.youtube.com/watch?v=5araDjcBHMQ

225. O’Connell J, Højsgaard S. Package “mhsmm.” CRAN 2017;(0.4.16).

226. Altman RM, Mackay Altman R. Mixed Hidden Markov Models Mixed Hidden Markov Models: An Extension of the Hidden Markov Model to the Longitudinal Data Setting. J Am Stat Assoc [Internet] 2007 [cited 2017 Aug 28];102477:201–210. [doi: 10.1198/016214506000001086]

227. Visser I, Speekenbrink M. depmixS4: An R Package for Hidden Markov Models [Internet]. Available from: http://cran.r-project.org/package=depmixS4.

228. Visser I, Speekenbrink M. depmixS4: Dependent Mixture Models - Hidden Markov Models of

161

GLMs and Other Distributions in S4 [Internet]. 2016 [cited 2018 Aug 23]. Available from: https://cran.r-project.org/web/packages/depmixS4/index.html

229. Rohan. Can something be statistically impossible? [Internet]. Math Stack Exch. 2016 [cited 2018 Aug 24]. Available from: https://math.stackexchange.com/q/2049722

230. Pohlmann KC. Principles of digital audio. McGraw-Hill; 2011. ISBN:9780071663465

231. Farmer WC, editor. Ordnance Field Guide: Restricted, Volume 2 [Internet]. Military service publishing company; 1944 [cited 2018 Aug 24]. Available from: https://books.google.ca/books?id=15ffO4UVw8QC&q=dither&redir_esc=y

232. Analog Devices. A Technical Tutorial on Digital Signal Synthesis [Internet]. 1999. Available from: http://www.analog.com/media/cn/training-seminars/tutorials/450968421DDS_Tutorial_rev12-2- 99.pdf

233. Mannix BF. Races, Rushes, and Runs: Taming the Turbulence in Financial Trading [Internet]. Washington; 2013. Available from: www.regulatorystudies.gwu.edu

234. Floyd RW, Steinberg L. An Adaptive Algorithm for Spatial Greyscale. Proc Soc Inf Disp 1976;17(2):75–77.

235. Roberts LG. Picture Coding Using Pseudo-Random Noise. IRE Trans Inf Theory 1962;8(2):145– 154. [doi: 10.1109/TIT.1962.1057702]

236. Wikipedia Contributors. Dither [Internet]. Wikipedia, Free Encycl. 2018 [cited 2018 Aug 24]. Available from: https://en.wikipedia.org/wiki/Dither

237. Fox J. Generalized Linear Models. Appl Regres Gen Linear Model [Internet] SAGE Publications; 2015 [cited 2018 Aug 27]. p. 379–424. Available from: http://kilpatrick.eeb.ucsc.edu/wp- content/uploads/2015/04/GLMs-Chapter_15.pdf

238. Rigollet P. Lecture 21. Generalized Linear Models from MIT 18.650: Statistics for Applications [Internet]. YouTube; 2016 [cited 2018 Aug 27]. Available from: https://www.youtube.com/watch?v=X-ix97pw0xY

239. Gao J, Fan W, Han J. On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled. Tutor SIAM Data Min Conf [Internet] Columbus, OH; 2010 [cited 2018 Aug 27].

162

Available from: https://cse.buffalo.edu/~jing/sdm10ensemble.htm

240. Grover P. Gradient Boosting from scratch [Internet]. ML Rev. 2017 [cited 2018 Aug 27]. Available from: https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d

241. LeCun Y, Bengio Y, Hinton G, Y. L, Y. B, G. H. Deep learning. Nature 2015;521(7553):436–444. PMID:26017442

242. Parloff R. The AI Revolution: Why Deep Learning Is Suddenly Changing Your Life [Internet]. Fortune. 2016 [cited 2018 Aug 29]. Available from: http://fortune.com/ai-artificial-intelligence- deep-machine-learning/

243. Goodfellow I, Bengio Y, Courville A. Deep Learning [Internet]. 2016. Available from: http://www.deeplearningbook.org

244. Zekić-Sušac M, Šarlija N, Pfeifer S. Combining PCA Analysis And Artificial Neural Networks In Modelling Entrepreneurial Intentions Of Students. Croat Oper Res Rev [Internet] 2013 Feb 1 [cited 2018 Aug 29];4(1):306–317. Available from: https://hrcak.srce.hr/index.php?id_clanak_jezik=143365&show=clanak

245. Seuret M, Alberti M, Ingold R, Liwicki M. PCA-Initialized Deep Neural Networks Applied To Document Image Analysis [Internet]. Available from: https://arxiv.org/pdf/1702.00177.pdf

246. Marsupial D. Does Neural Networks based classification need a dimension reduction [Internet]. Cross Validated. 2013 [cited 2018 Aug 29]. Available from: https://stats.stackexchange.com/q/67988

247. Hartmann WM. Dimension Reduction vs. Variable Selection. Springer, Berlin, Heidelberg; 2006 [cited 2018 Aug 29]. p. 931–938. [doi: 10.1007/11558958_113]

248. Sorzano COS, Vargas J, Pascual-Montano A. A survey of dimensionality reduction techniques [Internet]. [doi: arXiv:1403.2877]

249. Turck N, Vutskits L, Sanchez-Pena P, Robin X, Hainard A, Gex-Fabry M, Fouda C, Bassem H, Mueller M, Lisacek F, Puybasset L, Sanchez J-C. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics [Internet] BioMed Central; 2011 Mar 17 [cited 2017 Nov 1];12(77). [doi: 10.1007/s00134-009-1641-y]

163

250. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M, Siegert S. Package “pROC.” CRAN [Internet] 2017 [cited 2017 Nov 1];(1.10). Available from: https://cran.r- project.org/web/packages/pROC/pROC.pdf

251. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B, R Core Team, Benesty M, Lescarbeau R, Ziem A, Scrucca L, Tang Y, Candan C, Hunt T. caret: Classification and Regression Training [Internet]. 2017. Available from: https://cran.r- project.org/package=caret

252. Kuhn M. Predictive Modeling with R and the caret Package. useR! R User Conf [Internet] Albacete, Spain; 2013 [cited 2018 Aug 21]. Available from: http://www.edii.uclm.es/~useR- 2013/Tutorials/kuhn/user_caret_2up.pdf

253. Lumley T, Miller A. leaps: Regression Subset Selection [Internet]. 2017. Available from: https://cran.r-project.org/package=leaps

254. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B, R Core Team, Benesty M, Lescarbeau R, Ziem A, Scrucca L, Tang Y, Candan C, Hunt T. preProcess function [Internet]. R Doc. 2017 [cited 2018 Aug 30]. Available from: https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/preProcess

255. Schwarz G. Estimating the Dimension of a Model. Ann Stat [Internet] Institute of Mathematical Statistics; 1978 Mar [cited 2018 Aug 30];6(2):461–464. [doi: 10.1214/aos/1176344136]

256. Refaeilzadeh P, Tang L, Liu H. Cross-Validation. In: Liu L, Özsu MT, editors. Encycl Database Syst [Internet] Boston, MA: Springer US; 2009 [cited 2018 Aug 25]. p. 532–538. [doi: 10.1007/978- 0-387-39940-9_565]

257. Zemel R. Ensemble Methods from University of Toronto CSC411 Machine Learning & Data Mining [Internet]. Toronto; 2014. Available from: http://www.cs.toronto.edu/~rsalakhu/CSC411/notes/lecture_ensemble1.pdf

258. Ng A. Machine Learning Yearning: Technical Strategy for AI Engineers in the Era of Deep Learning [draft] [Internet]. draft. deeplearning.ai. 2018. Available from: https://gallery.mailchimp.com/dc3a7ef4d750c0abfc19202a3/files/704291d2-365e-45bf-a9f5- 719959dfe415/Ng_MLY01.pdf

164

259. Brownlee J. Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning [Internet]. Mach Learn Mastery. 2016 [cited 2018 Aug 25]. Available from: https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine- learning/

260. Geng D, Shih S. Machine Learning Crash Course: Part 4 - The Bias-Variance Dilemma [Internet]. Mach Learn @ Berkeley. 2017 [cited 2018 Aug 25]. Available from: https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/

261. Sicotte XB. Bias and variance in leave-one-out vs K-fold cross validation [Internet]. Cross Validated. 2018 [cited 2018 Aug 25]. Available from: https://stats.stackexchange.com/q/357749

262. Little MA, Varoquaux G, Saeb S, Lonini L, Jayaraman A, Mohr DC, Kording KP. Using and understanding cross-validation strategies. Perspectives on Saeb et al. Gigascience [Internet] Oxford University Press; 2017 May 1 [cited 2018 Aug 25];6(5):1–6. PMID:28327989

263. Kohavi R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. Proc 14th Int Jt Conf Artif Intell - Vol 2 [Internet] Montreal: Morgan Kaufmann Publishers Inc.; 1995 [cited 2018 Aug 30]. p. 1137–1143. Available from: http://web.cs.iastate.edu/~jtian/cs573/Papers/Kohavi-IJCAI-95.pdf

264. Bengio Y, Grandvalet Y. No Unbiased Estimator of the Variance of K-Fold Cross-Validation Yoshua Bengio Yves Grandvalet. J Mach Learn Res [Internet] 2004 [cited 2018 Aug 31];5:1089– 1105. Available from: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf

265. Zhang Y, Yang Y. Cross-validation for selecting a model selection procedure. J Econom [Internet] 2015 Jul [cited 2018 Aug 31];187(1):95–112. [doi: 10.1016/j.jeconom.2015.02.006]

266. Efron B. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J Am Stat Assoc [Internet] 1983 Jun [cited 2018 Aug 31];78(382):316–331. [doi: 10.1080/01621459.1983.10477973]

267. Sicotte XB. Variance of K-fold cross-validation estimates as f(K): what is the role of “stability”? [Internet]. Cross Validated2. 18AD. Available from: https://stats.stackexchange.com/q/358278

268. National Health Service. Blood tests - Overview [Internet]. Natl Heal Serv. 2016 [cited 2018 Aug 31]. Available from: https://www.nhs.uk/conditions/blood-tests/

165

269. The Royal College of Pathologists of Australasia. Pathology: The Facts [Internet]. 2013. Available from: http://www.health.gov.au/internet/publications/publishing.nsf/Content/CA2578620005D57ACA2 57B6A000862D3/$File/What I Should Know Pathology-FS.pdf

270. Dynacare. After My Test [Internet]. [cited 2018 Aug 31]. Available from: https://www.dynacare.ca/patients-and-individuals/preparation-and-tips/after-my-test.aspx

271. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B, R Core Team, Benesty M, Lescarbeau R, Ziem A, Scrucca L, Tang Y, Candan C, Hunt T. varImp function [Internet]. R Doc. 2017 [cited 2018 Aug 31]. Available from: https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/varImp

272. Habbu A, Lakkis NM, Dokainish H. The Obesity Paradox: Fact or Fiction? Am J Cardiol [Internet] Excerpta Medica; 2006 Oct 1 [cited 2018 Sep 24];98(7):944–948. [doi: 10.1016/J.AMJCARD.2006.04.039]

273. Curtis JP, Selter JG, Wang Y, Rathore SS, Jovin IS, Jadbabaie F, Kosiborod M, Portnay EL, Sokol SI, Bader F, Krumholz HM. The Obesity Paradox. Arch Intern Med [Internet] 2005 Jan 10 [cited 2018 Sep 24];165(1):55. [doi: 10.1001/archinte.165.1.55]

274. Kenchaiah S, Evans JC, Levy D, Wilson PWF, Benjamin EJ, Larson MG, Kannel WB, Vasan RS. Obesity and the Risk of Heart Failure. N Engl J Med [Internet] 2002 Aug [cited 2018 Sep 24];347(5):305–313. [doi: 10.1056/NEJMoa020245]

275. Mosterd A. The prognosis of heart failure in the general population. The Rotterdam Study. Eur Heart J [Internet] 2001 Aug 1 [cited 2018 Sep 24];22(15):1318–1327. [doi: 10.1053/euhj.2000.2533]

276. Iliodromiti S, Celis-Morales CA, Lyall DM, Anderson J, Gray SR, Mackay DF, Nelson SM, Welsh P, Pell JP, Gill JMR, Sattar N. The impact of confounding on the associations of different adiposity measures with the incidence of cardiovascular disease: a cohort study of 296 535 adults of white European descent. Eur Heart J [Internet] Oxford University Press; 2018 May 1 [cited 2018 Sep 24];39(17):1514–1520. [doi: 10.1093/eurheartj/ehy057]

277. Mailund T, Storm Pedersen CN. Machine Learning in Bioinformatics Lecture Week 5 - Hidden Markov Models Selecting model parameters or “training” Hidden Markov Models [Internet]. Aarhus, Denmark; 2014 [cited 2017 Aug 28]. p. 56. Available from: http://users-

166

birc.au.dk/cstorm/courses/MLiB_f14/slides/hidden-markov-models-4.pdf

278. Jelinek B. Review on Training Hidden Markov Models with Multiple Observations. [cited 2017 Aug 28]; Available from: https://www.isip.piconepress.com/courses/msstate/ece_8443/papers/2001_spring/multi_obs/p00 _paper_v0.pdf

279. user34790, de Azevdeo R, Morat, hxd1011, Bulatov Y, Masterfool, Dernoncourt F. What is the difference between the forward-backward and Viterbi algorithms? - Cross Validated [Internet]. Cross Validated. 2016 [cited 2017 Nov 11]. Available from: https://stats.stackexchange.com/questions/31746/what-is-the-difference-between-the-forward- backward-and-viterbi-algorithms

280. Rodríguez LJ, Torres I. Comparative Study of the Baum-Welch and Viterbi Training Algorithms Applied to Read and Spontaneous Speech Recognition. Pattern Recognit Image Anal [Internet] Springer, Berlin, Heidelberg; 2003 [cited 2017 Nov 11]. p. 847–857. [doi: 10.1007/978-3-540-44871- 6_98]

281. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É. Scikit-learn: Machine Learning in Python. J Mach Learn Res [Internet] 2011 [cited 2018 Aug 22];12:2825–2830. Available from: http://scikit-learn.org/stable/about.html#citing- scikit-learn

282. Baril J-F. mhsc-thesis [Internet]. Toronto; 2018. Available from: https://github.com/cosmomeese/mhsc-thesis

283. Abu-Mostafa Y. Lecture 07 - The VC Dimension from Caltech CS 156: Learning Systems [Internet]. YouTube; 2012 [cited 2018 Aug 30]. Available from: https://www.youtube.com/watch?v=Dc0sr0kdBVI&hd=1#t=57m20s

284. Beleites C, Klein A. Any “rules of thumb” on number of features versus number of instances? (small data sets) [Internet]. Data Sci (Stack Exch. 2018. Available from: https://datascience.stackexchange.com/a/29478

285. Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER. Optimal number of features as a function of sample size for various classification rules. Bioinformatics [Internet] Oxford University Press; 2005

167

Apr 15 [cited 2018 Aug 30];21(8):1509–1515. [doi: 10.1093/bioinformatics/bti171]

286. Häggström M. Renin-angiotensin_system_in_man_shadow. Wikimedia Commons; 2009.

287. Ober WC, Garrison CW, Silverthorn DU. Adapted from Figure 15-24 The baroreceptor reflex: the repsonse to orthostatic hypotension. Hum Physiol An Integr Approach Pearson Benjamin Cummings; 2009. p. 991.

288. Alien AA, Shelley HK. Fig. 3. The effect of cardiac arrhythmia (PVCs) on the PPG. Best Pract Res Clin Anaesthesiol [Internet] 2014 [cited 2018 Jul 30];28(4). [doi: 10.1016/j.bpa.2014.08.006]

289. University Health Network (UHN). Medly for Heart Failure [Internet]. iTunes; 2018. Available from: https://itunes.apple.com/ca/app/medly-for-chronic-conditions/id1310832707?mt=8

290. Owen S. Common Probability Distributions: The Data Scientist’s Crib Sheet - Cloudera Engineering Blog [Internet]. Cloudera Eng Blog. 2015 [cited 2018 Aug 27]. Available from: https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib- sheet/

168

Appendix A - Research Ethics

I. REB #14-7595: Validation of A Wearable Activity Tracker for the Estimation of Heart Failure Severity

169

II. REB #15-9832: Feasibility Study of Wearable Heart Rate and Activity Trackers for Monitoring Heart Failure

170

III. REB #16-5789: Evaluation of A Mobile Phone-Based Telemonitoring Program for Heart Failure Patients

171

IV. REB #18-0221: Artificial intelligence-based quality improvement initiative of a mobile phone-based telemonitoring program for heart failure patients

172

Appendix B – A Primer on Hidden Markov Models

I. Basics of Markov Models (Hidden or Otherwise)

Markov Models (hidden or otherwise) are probabilistic state machines where the transitions between states are executed randomly according to pre-specified transition probabilities between states [118,220– 223]. Markov Models are used to model Markov chains/processes which are stochastic (i.e. random) processes that satisfy the Markovian property. That is, the transitions from a given state in the chain to the next immediate state (and by extension all future states) must be dependant solely on the current state of the model [118,220–224]. They must not depend on the path taken to arrive to that state, i.e. on any previous states in which the system has existed. The Markovian property is alternatively known as the 'memoryless' property: essentially that the Markov process or markov chain has no memory of the past [118,220–224]. The transition probabilities along with the number of states form the fundamental model parameters which uniquely describe the Markov Model. Where relevant, a Markov Model may also have initial starting parameters which dictate the likelihood associated with the Markov Model starting in each possible state (e.g. 10% chance to start in State S1, 20% chance to start in State S2 and so on) [118,220–224].

In many Markov Models (and in every Hidden Markov Model) there is also an associated set of possible observations that are linked to each state, i.e. that can possibly be output when the system is in a given state. For example, as shown in Figure B-1, a Markov Model is shown that models weather outside an office with possible states S1 = Sunny, S2 = Cloudy and S3 = Rainy with associated transition probabilities between each state [221]. The observations associated with each state might be the clothing that a given person in a stream of passers-by are wearing, say a shirt, a sweater or a rainjacket [221]. It is possible that a person might be wearing any of these types of clothing in any given type of weather but it is likely that the likelihood of observing each clothing type will differ based on the underlying weather state; for example rainjackets are probably more likely to be observed in rainy weather than in sunny weather [221]. These probabilities are termed observation probabilities and link the states in the Markov Model to the observations that are measured as outputs of the Markov Model. These observations could be speech phonemes, written characters of the alphabet, or genome sequences [118,226]. Observe that in Figure B-1, our hypothetical example Markov Model of the weather includes the starting, transition and observation probabilities. The starting probabilities are indicated by very light lines between the rectangular ‘start’ & the state circles, and are almost uniformly distributed with a slight bias towards it being state S1: Sunny (perhaps unjustified optimism). The transition probabilities, indicated by lines between the three state circles, favor the state remaining the same, with low probability of the state

173 jumping directly between the S1: Sunny and S3: Rainy states. The observation probabilities model our hypothesis that shirts are most likely to be associated with sunny weather, and rainjackets with rainy weather. In cloud weather, people are almost equally likely to wear shirts, sweaters or rainjackets, with a minor preference towards sweaters.

Figure B-1: M arkov model

The appropriately named Hidden Markov Models (HMM) are simply Markov Models where the underlying states are hidden - i.e. cannot directly be observed [118,220,222,224,225]. Specifically, we don’t know the number of states the system has, nor the transition probabilities between states, the sequence of states it has been through, or even the present state of the system [118,220,222,224,225]. However, if we assume the system has a certain number of states (e.g. 3) for which we have some given observation probabilities, it is actually possible for us work backwards and try to infer the current state of the hidden underlying Markov Model, including the sequence of states that the particular model went through and generally to create a model of the underlying process [118,220,222–224,277]. We can then use the model to replicate the modelled process. A relatable example is for text prediction, where an HMM might be trained using text a user inputs into their smartphone and then used to dynamically suggest the next word as a user types in new text. Alternatively, one could use a model to quantify how similar a new process is to an existing modeled process: for example one could model the stock market using the trade volume and price of a major index during a known bullish (rising) period, and then provide this bull

174 market trained HMM a recent sample of the index trade volume and pricing information to quantify how similar the current market is to the known bull market period.

Of course, the process of modelling an underlying process using an HMM relies on many assumptions, both about the input data and properties of the underlying process. As previously mentioned, one of the major assumptions (the Markovian assumption) that comes with hidden Markov Models, as with Markov Models in general, is that they assume that the underlying process they model adheres to the Markovian property: that the future state of the model does not depend on the past states or sequence of states, only the present state [118,220,221,224]. That being said, it has been found that Hidden Markov Model are able in certain cases to fairly successfully model processes that violate this Markovian assumption. For example in the classic cases of speech recognition and gesture recognition [118,226,278]. Of course, both patient activity and heart rate data likely violate the Markovian assumption 'demanded' of hidden Markov Models, and although HMMs have been used successfully in some applications of physical activity recognition using accelerometer data [62] the jury is still out when it comes to modeling with heart rate data or even with minute-by-minute step count data.

II. Semi-Markov Model

The violation of the pure Markovian assumption leads us to a variation on Hidden Markov Models: Hidden Semi-Markov Models (HSMM) [223]. HSMMs are HMMs that formally relax the 'Markovian' assumption of the model by permitting the model to specifically retain the memory of how long it has been in a certain state (sometimes to force the model to not exist in a state for more than a desired time) [223]. As such, HSMMs require that an additional set of parameters be defined: the sojourn distribution of each state [223]. That is, the distribution of expected mean waiting times in each given state. These waiting times can follow any distribution desired - normal, geometric, gamma, etc. - or appropriate for the problem at hand [223]. For example, in the case of patient activity and heart rate, where it might be unreasonable to assume that there no some time-dependence in state changes due to the dynamic nature of human exercise and activity (e.g. people who are performing high-intensity activity are less likely to continue as time goes by since they get tired) one might train equivalent multivariate hidden semi- Markov models to explore and measure the effect of formally relaxing the Markovian assumption (or time- independence) of a pure Markov models. Although HSMMs are likely highly relevant to the problem of assessing NYHA class they were not investigated as part of the research documented in this thesis.

III. Hidden Markov & Semi-Markov Models Parameters

To summarize, the complete set of parameters determined a Hidden Markov Model are as follows:

175

1. the number of states in the model

2. the starting probabilities (for each state)

3. the transition probabilities (between each state)

4. the (observation) emission probabilities (of the observable by-products of each state; e.g. shirt/sweater/rainjacket)

For Hidden Semi-Markov Models, the individual state sojourn distributions must also be specified.

IV. Determining Markov Model Parameters

Determining the single best or most optimal Hidden Markov Model parametrization for given data stream is unfortunately, an intractable problem [118,220,222]. That being said, there is a known algorithm for efficiently computing the most likely locally optimal parametrization, the ‘maximum likelihood estimation’, for a stream. Generally speaking the specific sub-class of algorithms used to solve this problem in the Markov model space are known as expectation-maximization (EM) algorithms [118,220,222]. One of the most common EM algorithm implementations used for Hidden Markov Model training is the Baum-Welch algorithm [118,220,222,279]. Another common algorithm used to approximate EM is the Viterbi training algorithm (N.B. not the Viterbi algorithm) which can yield less accurate models than the Baum-Welch algorithm but is usually much less computationally intensive [279,280]. We eschew further discussion of the implementation details of either of these algorithms since the availability of pre-programmed libraries implementing these algorithms makes it unnecessary for new student of HMMs to have the in-depth knowledge required to implement the algorithms and because there are many excellent sources available that explore the finer details of algorithm much more completely than can be done as part of a quick primer [118,220,222,280]. In any case none of these algorithms is able to determine all of the parameters by itself. Some of the parameters must be provided as 'initial conditions' for the algorithm to execute. Typically these are the emission probabilities, the starting probabilities, the sojourn distributions (and sometimes even initial transition probabilities). Depending on the library used it may try to make an educated guess for starting points or leave the 'initial conditions' to be specified solely by the author. It is possible (and encouraged) to try various combinations of parameters to determine the most effective set - in fact more fully featured software libraries will also sometimes offer to do this automatically, although it is ultimately up to the researcher to determine appropriate 'initial conditions.'

In the case of this work, where we used the R package depmixS4 [227,228] the user must provide the number of desired states, the emission probabilities (which are assumed to remain fixed) as well as an

176 initial starting point for the state probabilities and transition probabilities, which the algorithm then adjusts as it searches for a local optimum. Other hidden Markov model packages exist for R as well as for other programming languages, including Python (as part of the package scikit-learn [281]) which is particularly popular for machine learning.

177

Appendix C – Software Repository

All of the software written by the author and used for, or as part of this project, can be accessed at [282]:

https://github.com/cosmomeese/mhsc-thesis

The Fitbit data management and access script can also be found at [198]:

https://github.com/cosmomeese/fitbit4research

178

Appendix D – Tabulation of All Cross-sectional Machine Learning Classifier Performance Measures

An exhaustive list of all the performance measures recorded for the final cross-sectional machine learning classifiers evaluated in Chapter 6 are tabulated in Table 22. To maximize the legibility of the rest of the tables the headers were abbreviated. Table 21 provides the key to these abbreviations, along with any relevant abbreviated codes used in Table 22. For ease of navigation, the similar model variants are grouped together in Table 22 in roughly descending order of performance (due to the model grouping). Furthermore, the column with the performance metric used for model comparison in this thesis - Cohen’s Kappa (indicated by the 휅 symbol) - is highlighted purple. Models whose unbalanced accuracy does not improve over their no-information are highlighted red, and the best performing models are highlighted in green. The models with the lowest |Δ휅| (of the models that improve over their default no-information rate) are highlighted in yellow.

Table 21: H eader abbreviations for Table 22

Header Abbreviation Expanded Header Coding Type Machine learning model type Features used C=CPET Only, Feats S=Step Data Only, C+S=CPET and Step Data Imp Imputed missing data? F Sel Feature selection performed? k-fold cross-validation method used: -1=leave-one-out cross-validation, K 10=10-fold cross-validation 휅 Cohen’s Kappa Absolute value of the difference between leave-one-out cross-validation kappa and |Δ휅| 10-fold cross-validation kappa for the particular model configuration Bal Acc Balanced Accuracy Raw Acc Unbalanced Accuracy Acc UB Unbalanced Accuracy Upper Bound Acc LB Unbalanced Accuracy Lower Bound

179

Header Abbreviation Expanded Header Coding NIR No Information Rate P P-Value (Unbalanced Accuracy) McN P McNemar P-Value Sens Sensitivity Spec Specificity +ve PV Positive (NYHA Class II) Predictive Value -ve PV Positive (NYHA Class III) Predictive Value Pre Precision Rec Recall F1 F1 Score Prev Prevalence DR Detection Rate DP Detection Prevalence AUC Area Under ROC Curve TP True Positive (Correct NYHA II Classification) Count FN False Negative (Incorrect NYHA III Classification) Count FP False Positive (Incorrect NYHA II Classification) Count TN True Negative (Correct NYHA III Classification) Count

Table 22: Cross-sectional machine learning classifier performance metrics

Bal Ra Ac Ac Feat F Ac w c c NI Mc Sen Spe +ve -ve Re Pre AU T F F T Type s Imp Sel k 휿 |Δ 휿| c Acc UB LB R P N P s c PV PV Pre c F1 v DR DP C P N P N Boosted C+S No No -1 0.73 0.63 0.85 0.89 0.98 0.72 0.71 .02 1.00 0.75 0.95 0.86 0.90 0.86 0.75 0.80 0.29 0.21 0.25 0.94 6 2 1 19 GLM Boosted C+S No Yes -1 0.73 0.63 0.85 0.89 0.98 0.72 0.71 .02 1.00 0.75 0.95 0.86 0.90 0.86 0.75 0.80 0.29 0.21 0.25 0.94 6 2 1 19 GLM Boosted C+S No No 10 0.10 0.63 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.54 3 12 4 31 GLM Boosted C+S No Yes 10 0.10 0.63 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.54 3 12 4 31 GLM Rando m C+S No No -1 0.70 0.60 0.81 0.89 0.98 0.72 0.71 .02 .25 0.63 1.00 1.00 0.87 1.00 0.63 0.77 0.29 0.18 0.18 0.80 5 3 0 20 Forest Rando m C+S No Yes -1 0.70 0.60 0.81 0.89 0.98 0.72 0.71 .02 .25 0.63 1.00 1.00 0.87 1.00 0.63 0.77 0.29 0.18 0.18 0.80 5 3 0 20 Forest Rando m C+S No No 10 0.10 0.60 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.46 3 12 4 31 Forest

180

Bal Ra Ac Ac Feat F Ac w c c NI Mc Sen Spe +ve -ve Re Pre AU T F F T Type s Imp Sel k 휿 |Δ 휿| c Acc UB LB R P N P s c PV PV Pre c F1 v DR DP C P N P N Rando m C+S No Yes 10 0.10 0.60 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.46 3 12 4 31 Forest Boosted C No No -1 0.47 0.19 0.72 0.79 0.90 0.64 0.70 .12 .50 0.54 0.90 0.70 0.82 0.70 0.54 0.61 0.30 0.16 0.23 0.80 7 6 3 27 GLM Boosted C No Yes -1 0.47 0.19 0.72 0.79 0.90 0.64 0.70 .12 .50 0.54 0.90 0.70 0.82 0.70 0.54 0.61 0.30 0.16 0.23 0.80 7 6 3 27 GLM Boosted C No No 10 0.28 0.19 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.55 6 9 5 30 GLM Boosted C No Yes 10 0.28 0.19 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.55 6 9 5 30 GLM PCA C Yes No -1 0.45 0.31 0.73 0.76 0.87 0.62 0.70 .22 .77 0.67 0.80 0.59 0.85 0.59 0.67 0.63 0.30 0.20 0.34 0.68 10 5 7 28 N N et PCA C Yes Yes -1 0.45 0.31 0.73 0.76 0.87 0.62 0.70 .22 .77 0.67 0.80 0.59 0.85 0.59 0.67 0.63 0.30 0.20 0.34 0.68 10 5 7 28 N N et PCA C Yes No 10 0.14 0.31 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.54 4 11 5 30 N N et PCA C Yes Yes 10 0.14 0.31 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.54 4 11 5 30 N N et Boosted C Yes No -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.76 9 6 6 29 GLM Boosted C Yes Yes -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.76 9 6 6 29 GLM

N N et C Yes No -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.73 9 6 6 29

N N et C Yes Yes -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.73 9 6 6 29 Boosted C Yes No 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.53 4 11 5 30 GLM Boosted C Yes Yes 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.53 4 11 5 30 GLM

N N et C Yes No 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.56 4 11 5 30

N N et C Yes Yes 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.56 4 11 5 30

N N et C No No -1 0.41 0.45 0.71 0.74 0.86 0.59 0.70 .32 1.00 0.62 0.80 0.57 0.83 0.57 0.62 0.59 0.30 0.19 0.33 0.73 8 5 6 24

N N et C No Yes -1 0.41 0.45 0.71 0.74 0.86 0.59 0.70 .32 1.00 0.62 0.80 0.57 0.83 0.57 0.62 0.59 0.30 0.19 0.33 0.73 8 5 6 24

N N et C No No 10 -0.05 0.45 0.48 0.56 0.70 0.41 0.70 .99 1.00 0.27 0.69 0.27 0.69 0.27 0.27 0.27 0.30 0.08 0.30 0.55 4 11 11 24

N N et C No Yes 10 -0.05 0.45 0.48 0.56 0.70 0.41 0.70 .99 1.00 0.27 0.69 0.27 0.69 0.27 0.27 0.27 0.30 0.08 0.30 0.55 4 11 11 24

GLM C Yes No -1 0.37 0.23 0.68 0.74 0.85 0.60 0.70 .33 1.00 0.53 0.83 0.57 0.81 0.57 0.53 0.55 0.30 0.16 0.28 0.70 8 7 6 29

GLM C Yes Yes -1 0.37 0.23 0.68 0.74 0.85 0.60 0.70 .33 1.00 0.53 0.83 0.57 0.81 0.57 0.53 0.55 0.30 0.16 0.28 0.70 8 7 6 29

GLM C Yes No 10 0.14 0.23 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.49 4 11 5 30

GLM C Yes Yes 10 0.14 0.23 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.49 4 11 5 30 PCA C+S No No -1 0.36 0.43 0.68 0.75 0.89 0.55 0.71 .43 1.00 0.50 0.85 0.57 0.81 0.57 0.50 0.53 0.29 0.14 0.25 0.63 4 4 3 17 N N et PCA C+S No Yes -1 0.36 0.43 0.68 0.75 0.89 0.55 0.71 .43 1.00 0.50 0.85 0.57 0.81 0.57 0.50 0.53 0.29 0.14 0.25 0.63 4 4 3 17 N N et

181

Bal Ra Ac Ac Feat F Ac w c c NI Mc Sen Spe +ve -ve Re Pre AU T F F T Type s Imp Sel k 휿 |Δ 휿| c Acc UB LB R P N P s c PV PV Pre c F1 v DR DP C P N P N PCA C+S No No 10 -0.06 0.43 0.47 0.52 0.66 0.37 0.70 1.00 .54 0.33 0.60 0.26 0.68 0.26 0.33 0.29 0.30 0.10 0.38 0.56 5 10 14 21 N N et PCA C+S No Yes 10 -0.06 0.43 0.47 0.52 0.66 0.37 0.70 1.00 .54 0.33 0.60 0.26 0.68 0.26 0.33 0.29 0.30 0.10 0.38 0.56 5 10 14 21 N N et PCA C No No -1 0.34 0.24 0.67 0.72 0.85 0.56 0.70 .44 1.00 0.54 0.80 0.54 0.80 0.54 0.54 0.54 0.30 0.16 0.30 0.74 7 6 6 24 N N et PCA C No Yes -1 0.34 0.24 0.67 0.72 0.85 0.56 0.70 .44 1.00 0.54 0.80 0.54 0.80 0.54 0.54 0.54 0.30 0.16 0.30 0.74 7 6 6 24 N N et PCA C No No 10 0.10 0.24 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.53 3 12 4 31 N N et PCA C No Yes 10 0.10 0.24 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.53 3 12 4 31 N N et

GLM S Yes No -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30

GLM S Yes Yes -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30 Boosted S Yes No -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30 GLM Boosted S Yes Yes -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30 GLM

N N et S Yes No -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.69 6 9 5 30

N N et S Yes Yes -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.69 6 9 5 30

GLM S Yes No 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8

GLM S Yes Yes 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8 Boosted S Yes No 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8 GLM Boosted S Yes Yes 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8 GLM

N N et S Yes No 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.33 0 4 0 8

N N et S Yes Yes 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.33 0 4 0 8 Rando m C Yes No -1 0.21 0.13 0.60 0.68 0.80 0.53 0.70 .68 .80 0.40 0.80 0.46 0.76 0.46 0.40 0.43 0.30 0.12 0.26 0.68 6 9 7 28 Forest Rando m C Yes Yes -1 0.21 0.13 0.60 0.68 0.80 0.53 0.70 .68 .80 0.40 0.80 0.46 0.76 0.46 0.40 0.43 0.30 0.12 0.26 0.68 6 9 7 28 Forest Rando m C Yes No 10 0.08 0.13 0.54 0.62 0.75 0.47 0.70 .92 1.00 0.33 0.74 0.36 0.72 0.36 0.33 0.34 0.30 0.10 0.28 0.52 5 10 9 26 Forest Rando m C Yes Yes 10 0.08 0.13 0.54 0.62 0.75 0.47 0.70 .92 1.00 0.33 0.74 0.36 0.72 0.36 0.33 0.34 0.30 0.10 0.28 0.52 5 10 9 26 Forest Boosted C+S Yes No -1 0.17 0.23 0.59 0.66 0.79 0.51 0.70 .78 1.00 0.40 0.77 0.43 0.75 0.43 0.40 0.41 0.30 0.12 0.28 0.65 6 9 8 27 GLM Boosted C+S Yes Yes -1 0.17 0.23 0.59 0.66 0.79 0.51 0.70 .78 1.00 0.40 0.77 0.43 0.75 0.43 0.40 0.41 0.30 0.12 0.28 0.65 6 9 8 27 GLM Boosted C+S Yes No 10 -0.06 0.23 0.48 0.64 0.77 0.49 0.70 .86 .03 0.07 0.89 0.20 0.69 0.20 0.07 0.10 0.30 0.02 0.10 0.45 1 14 4 31 GLM

182

Bal Ra Ac Ac Feat F Ac w c c NI Mc Sen Spe +ve -ve Re Pre AU T F F T Type s Imp Sel k 휿 |Δ 휿| c Acc UB LB R P N P s c PV PV Pre c F1 v DR DP C P N P N Boosted C+S Yes Yes 10 -0.06 0.23 0.48 0.64 0.77 0.49 0.70 .86 .03 0.07 0.89 0.20 0.69 0.20 0.07 0.10 0.30 0.02 0.10 0.45 1 14 4 31 GLM Rando m S Yes No -1 0.14 0.38 0.58 0.62 0.75 0.47 0.70 .92 .65 0.47 0.69 0.39 0.75 0.39 0.47 0.42 0.30 0.14 0.36 0.62 7 8 11 24 Forest Rando m S Yes Yes -1 0.14 0.38 0.58 0.62 0.75 0.47 0.70 .92 .65 0.47 0.69 0.39 0.75 0.39 0.47 0.42 0.30 0.14 0.36 0.62 7 8 11 24 Forest Rando m S Yes No 10 -0.24 0.38 0.38 0.42 0.72 0.15 0.67 .98 1.00 0.25 0.50 0.20 0.57 0.20 0.25 0.22 0.33 0.08 0.42 0.41 1 3 4 4 Forest Rando m S Yes Yes 10 -0.24 0.38 0.38 0.42 0.72 0.15 0.67 .98 1.00 0.25 0.50 0.20 0.57 0.20 0.25 0.22 0.33 0.08 0.42 0.41 1 3 4 4 Forest Rando m C No No -1 0.11 0.23 0.55 0.67 0.81 0.51 0.70 .70 .18 0.23 0.87 0.43 0.72 0.43 0.23 0.30 0.30 0.07 0.16 0.65 3 10 4 26 Forest Rando m C No Yes -1 0.11 0.23 0.55 0.67 0.81 0.51 0.70 .70 .18 0.23 0.87 0.43 0.72 0.43 0.23 0.30 0.30 0.07 0.16 0.65 3 10 4 26 Forest Rando m C No No 10 -0.12 0.23 0.44 0.54 0.68 0.39 0.70 .99 1.00 0.20 0.69 0.21 0.67 0.21 0.20 0.21 0.30 0.06 0.28 0.47 3 12 11 24 Forest Rando m C No Yes 10 -0.12 0.23 0.44 0.54 0.68 0.39 0.70 .99 1.00 0.20 0.69 0.21 0.67 0.21 0.20 0.21 0.30 0.06 0.28 0.47 3 12 11 24 Forest

GLM S No No -1 0.10 0.09 0.55 0.65 0.80 0.46 0.71 .83 .77 0.30 0.79 0.38 0.73 0.38 0.30 0.33 0.29 0.09 0.24 0.65 3 7 5 19

GLM S No Yes -1 0.10 0.09 0.55 0.65 0.80 0.46 0.71 .83 .77 0.30 0.79 0.38 0.73 0.38 0.30 0.33 0.29 0.09 0.24 0.65 3 7 5 19

GLM S No No 10 0.01 0.09 0.50 0.52 0.66 0.37 0.70 1.00 .15 0.47 0.54 0.30 0.70 0.30 0.47 0.37 0.30 0.14 0.46 0.49 7 8 16 19

GLM S No Yes 10 0.01 0.09 0.50 0.52 0.66 0.37 0.70 1.00 .15 0.47 0.54 0.30 0.70 0.30 0.47 0.37 0.30 0.14 0.46 0.49 7 8 16 19

N N et C+S Yes No -1 0.08 0.24 0.54 0.60 0.74 0.45 0.70 .95 .82 0.40 0.69 0.35 0.73 0.35 0.40 0.38 0.30 0.12 0.34 0.49 6 9 11 24

N N et C+S Yes Yes -1 0.08 0.24 0.54 0.60 0.74 0.45 0.70 .95 .82 0.40 0.69 0.35 0.73 0.35 0.40 0.38 0.30 0.12 0.34 0.49 6 9 11 24

N N et C+S Yes No 10 -0.15 0.24 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.51 1 14 7 28

N N et C+S Yes Yes 10 -0.15 0.24 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.51 1 14 7 28 Rando m C+S Yes No -1 0.07 0.10 0.53 0.64 0.77 0.49 0.70 .86 .48 0.27 0.80 0.36 0.72 0.36 0.27 0.31 0.30 0.08 0.22 0.61 4 11 7 28 Forest Rando m C+S Yes Yes -1 0.07 0.10 0.53 0.64 0.77 0.49 0.70 .86 .48 0.27 0.80 0.36 0.72 0.36 0.27 0.31 0.30 0.08 0.22 0.61 4 11 7 28 Forest Rando m C+S Yes No 10 -0.03 0.10 0.49 0.60 0.74 0.45 0.70 .95 .50 0.20 0.77 0.27 0.69 0.27 0.20 0.23 0.30 0.06 0.22 0.62 3 12 8 27 Forest Rando m C+S Yes Yes 10 -0.03 0.10 0.49 0.60 0.74 0.45 0.70 .95 .50 0.20 0.77 0.27 0.69 0.27 0.20 0.23 0.30 0.06 0.22 0.62 3 12 8 27 Forest

183

Bal Ra Ac Ac Feat F Ac w c c NI Mc Sen Spe +ve -ve Re Pre AU T F F T Type s Imp Sel k 휿 |Δ 휿| c Acc UB LB R P N P s c PV PV Pre c F1 v DR DP C P N P N N N et C+S No No -1 0.05 0.04 0.53 0.64 0.81 0.44 0.71 .85 .75 0.25 0.80 0.33 0.73 0.33 0.25 0.29 0.29 0.07 0.21 0.46 2 6 4 16

N N et C+S No Yes -1 0.05 0.04 0.53 0.64 0.81 0.44 0.71 .85 .75 0.25 0.80 0.33 0.73 0.33 0.25 0.29 0.29 0.07 0.21 0.46 2 6 4 16

N N et C+S No No 10 0.02 0.04 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.53 5 10 11 24

N N et C+S No Yes 10 0.02 0.04 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.53 5 10 11 24

GLM C+S Yes No -1 0.05 0.24 0.52 0.60 0.74 0.45 0.70 .95 1.00 0.33 0.71 0.33 0.71 0.33 0.33 0.33 0.30 0.10 0.30 0.50 5 10 10 25

GLM C+S Yes Yes -1 0.05 0.24 0.52 0.60 0.74 0.45 0.70 .95 1.00 0.33 0.71 0.33 0.71 0.33 0.33 0.33 0.30 0.10 0.30 0.50 5 10 10 25

GLM C+S Yes No 10 -0.19 0.24 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.37 2 13 11 24

GLM C+S Yes Yes 10 -0.19 0.24 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.37 2 13 11 24 PCA C+S Yes No -1 0.02 0.17 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.49 5 10 11 24 N N et PCA C+S Yes Yes -1 0.02 0.17 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.49 5 10 11 24 N N et PCA C+S Yes No 10 -0.15 0.17 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.42 1 14 7 28 N N et PCA C+S Yes Yes 10 -0.15 0.17 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.42 1 14 7 28 N N et PCA S No No -1 0.01 0.16 0.50 0.59 0.75 0.41 0.71 .95 1.00 0.30 0.71 0.30 0.71 0.30 0.30 0.30 0.29 0.09 0.29 0.45 3 7 7 17 N N et PCA S No Yes -1 0.01 0.16 0.50 0.59 0.75 0.41 0.71 .95 1.00 0.30 0.71 0.30 0.71 0.30 0.30 0.30 0.29 0.09 0.29 0.45 3 7 7 17 N N et PCA S No No 10 -0.15 0.16 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.40 1 14 7 28 N N et PCA S No Yes 10 -0.15 0.16 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.40 1 14 7 28 N N et

GLM C No No 10 0.07 -0.08 0.55 0.50 0.64 0.36 0.70 1.00 .01 0.67 0.43 0.33 0.75 0.33 0.67 0.44 0.30 0.20 0.60 0.58 10 5 20 15

GLM C No Yes 10 0.07 -0.08 0.55 0.50 0.64 0.36 0.70 1.00 .01 0.67 0.43 0.33 0.75 0.33 0.67 0.44 0.30 0.20 0.60 0.58 10 5 20 15

GLM C No No -1 0.00 -0.08 0.50 0.51 0.67 0.35 0.70 1.00 .19 0.46 0.53 0.30 0.70 0.30 0.46 0.36 0.30 0.14 0.47 0.53 6 7 14 16

GLM C No Yes -1 0.00 -0.08 0.50 0.51 0.67 0.35 0.70 1.00 .19 0.46 0.53 0.30 0.70 0.30 0.46 0.36 0.30 0.14 0.47 0.53 6 7 14 16

N N et S No No -1 -0.03 0.16 0.48 0.56 0.73 0.38 0.71 .98 1.00 0.30 0.67 0.27 0.70 0.27 0.30 0.29 0.29 0.09 0.32 0.44 3 7 8 16

N N et S No Yes -1 -0.03 0.16 0.48 0.56 0.73 0.38 0.71 .98 1.00 0.30 0.67 0.27 0.70 0.27 0.30 0.29 0.29 0.09 0.32 0.44 3 7 8 16

N N et S No No 10 -0.19 0.16 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.42 2 13 11 24

N N et S No Yes 10 -0.19 0.16 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.42 2 13 11 24

GLM C+S No No 10 -0.07 -0.03 0.46 0.54 0.68 0.39 0.70 .99 1.00 0.27 0.66 0.25 0.68 0.25 0.27 0.26 0.30 0.08 0.32 0.55 4 11 12 23

GLM C+S No Yes 10 -0.07 -0.03 0.46 0.54 0.68 0.39 0.70 .99 1.00 0.27 0.66 0.25 0.68 0.25 0.27 0.26 0.30 0.08 0.32 0.55 4 11 12 23

GLM C+S No No -1 -0.11 -0.03 0.44 0.46 0.66 0.28 0.71 1.00 .30 0.38 0.50 0.23 0.67 0.23 0.38 0.29 0.29 0.11 0.46 0.47 3 5 10 10

GLM C+S No Yes -1 -0.11 -0.03 0.44 0.46 0.66 0.28 0.71 1.00 .30 0.38 0.50 0.23 0.67 0.23 0.38 0.29 0.29 0.11 0.46 0.47 3 5 10 10

184

Bal Ra Ac Ac Feat F Ac w c c NI Mc Sen Spe +ve -ve Re Pre AU T F F T Type s Imp Sel k 휿 |Δ 휿| c Acc UB LB R P N P s c PV PV Pre c F1 v DR DP C P N P N Boosted S No No 10 -0.15 -0.01 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.41 1 14 7 28 GLM Boosted S No Yes 10 -0.15 -0.01 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.41 1 14 7 28 GLM Boosted S No No -1 -0.16 -0.01 0.43 0.56 0.73 0.38 0.71 .98 .61 0.10 0.75 0.14 0.67 0.14 0.10 0.12 0.29 0.03 0.21 0.46 1 9 6 18 GLM Boosted S No Yes -1 -0.16 -0.01 0.43 0.56 0.73 0.38 0.71 .98 .61 0.10 0.75 0.14 0.67 0.14 0.10 0.12 0.29 0.03 0.21 0.46 1 9 6 18 GLM Rando m S No No 10 -0.11 -0.09 0.46 0.64 0.77 0.49 0.70 .86 .01 0.00 0.91 0.00 0.68 0.00 0.00 NaN 0.30 0.00 0.06 0.39 0 15 3 32 Forest Rando m S No Yes 10 -0.11 -0.09 0.46 0.64 0.77 0.49 0.70 .86 .01 0.00 0.91 0.00 0.68 0.00 0.00 NaN 0.30 0.00 0.06 0.39 0 15 3 32 Forest Rando m S No No -1 -0.20 -0.09 0.42 0.59 0.75 0.41 0.71 .95 .18 0.00 0.83 0.00 0.67 0.00 0.00 NaN 0.29 0.00 0.12 0.43 0 10 4 20 Forest Rando m S No Yes -1 -0.20 -0.09 0.42 0.59 0.75 0.41 0.71 .95 .18 0.00 0.83 0.00 0.67 0.00 0.00 NaN 0.29 0.00 0.12 0.43 0 10 4 20 Forest PCA S Yes No 10 0.00 NA 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.38 0 4 0 8 N N et PCA S Yes Yes 10 0.00 NA 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.38 0 4 0 8 N N et