Detecting Dyskinesia and Tremor in People with Parkinson’S Disease

Home , Dyskinesia, Tremor

DETECTING DYSKINESIA AND TREMOR IN PEOPLE WITH PARKINSON’S DISEASE

OR ESSENTIAL TREMOR DURING ACTIVITIES OF DAILY LIVING USING BODY

WORN ACCELEROMETERS AND MACHINE LEARNING ALGORITHMS

NATHANIEL DAVID DARNALL

A dissertation submitted in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

WASHINGTON STATE UNIVERSITY School of Mechanical and Materials Engineering

DECEMBER 2014

To the Faculty of Washington State University:

The members of the Committee appointed to examine the dissertation of NATHANIEL

DAVID DARNALL find it satisfactory and recommend that it be accepted.

______David C. Lin, Ph.D., Chair

______Anita N. Vasavada, Ph.D.

______Diane J. Cook, Ph.D.

______Maureen Schmitter-Edgecombe, Ph.D.

ii ACKNOWLEDGEMENT

My primary thanks goes to my adviser, Dr. David Lin, for providing direction and leadership throughout my graduate degree. Dr. Lin’s technical knowledge and dedication to excellence in learning parallels his kind heart and desire to contribute to the greater good of humanity. It was for these qualities I sought him as an adviser, and I now consider him an inspiration both as my mentor and friend. Dr Lin portrayed an ability to synthesize technical engineering problems with clinical approaches with the aim of producing a clinically relevant study. His vision gave me direction in the early stages of research design, and helped me maintain focus through the research challenges I faced.

Along with Dr. Lin, Dr. Anita Vasavada welcomed me into their laboratory. She and

David together provided an open and nurturing environment which encouraged new scientific discoveries, as well as new friendships. Through her knowledge of biomechanics, Dr. Vasavada called my attention to technical perspectives relating to dyskinesia and tremor. She also assisted me in the statistical analyses.

Dr. Diane Cook was instrumental in my acceptance as an IGERT fellow, which ultimately provided the opportunities and contacts I needed to conduct research with people with

Parkinson’s disease. She believed in me throughout my time at WSU, and provided technical knowledge of machine learning uses and applications.

Dr. Maureen Schmitter-Edgecombe offered me insight into neuropsychology, and collaborated in participant recruitment and data collection. With her input, I was able to diversify my degree to include fields of study that are not typical for a mechanical engineer.

Dr. Narayanan “CK” Chatapuram Krishnan guided me through the use and interpretation of machine learning algorithms.

iii Thanks to Jamie Mark, ARNP, Dr. Jonathan Carlson, MD, Ph.D., Dr. David Greeley,

MD, FAAN, Pat Kautzman, and the staff of Northwest Neurological, PLLC, for directing me on the clinical perspective, for providing me the opportunity to interact with patients in a clinic, and for including me as an observer of deep brain stimulation surgeries.

To my wonderful wife, Rebecca Darnall, I offer my thanks for her support and love throughout my years at graduate school. If it was not for your influence, I would have missed out on the challenge, education, and friendships I have grown fond of at WSU. We spent our first married years as graduate students together at WSU. I will always remember how much you gave of yourself during that time, and our good memories together. You continue to teach me more about life and love than I could ever learn in my profession.

To my father, David Darnall, I offer my thanks for both his education in the scientific method, as well as his training in mechanical engineering. He offered me my first job out of my

BSME degree, and provided me the experience I needed to succeed in my career.

Thank you to my mother, Dorothy Darnall, Johann and Brian McDougall and other family members, Shuai Shao and other friends, and my lab mates Vladimir, Beth, Derek, Katie, for their support.

Thank you to every one of the participants in my studies. It was a joy to work with each one of them, and I wish them all the best.

iv DETECTING DYSKINESIA AND TREMOR IN PEOPLE WITH PARKINSON’S DISEASE

OR ESSENTIAL TREMOR DURING ACTIVITIES OF DAILY LIVING USING BODY

WORN ACCELEROMETERS AND MACHINE LEARNING ALGORITHMS

Abstract

By Nathaniel David Darnall, Ph.D. Washington State University December 2014

Chair: David C. Lin

Parkinson’s disease (PD) is a progressive neurodegenerative disorder that causes fluctuating motor deficits such as akinesia, bradykinesia, impaired balance, and tremor. Time periods characterized by severe deficits are referred to as “OFF” periods, while periods of relatively normal function are considered “ON” periods. Clinicians treat these deficits through the combined administration of Carbidopa/Levodopa medication, dopamine agonist medication, and deep brain stimulation. While motor deficits can be reduced, overmedication or overstimulation can cause dyskinesia, an involuntary, rhythmic or choric, exaggeration of movements. Clinicians assess the occurrence of motor deficits and dyskinesia, in part, by asking patients to retrospectively self-report how frequently these periods occurred over the several months prior to the clinical visit. This method is subject to recall bias. To augment the clinical assessment, several systems have been developed to provide clinical ratings from body-worn sensor data using computational algorithms. However, these systems place a substantial time burden and inconvenience to both clinicians and patients. Our ultimate goal is to develop an objective system that continuously identifies tremor, dyskinesia, and non-dyskinesia periods that occur during activities of daily living without placing a time burden on a

v clinician. We hypothesize that we can classify body-worn accelerometer data into tremor, dyskinesia, and non-dyskinesia periods using signal analysis, feature extraction, and machine learning algorithms (MLAs). This research will focus on three specific aims:

1. Classify kinematic data collected during clinical assessment tasks onto tremor severity

ratings using machine learning algorithms.

2. Develop a system that classifies features derived from body-worn accelerometer data as

dyskinesia presence that was determined from visual observation of participants

performing unconstrained activities of daily living.

3. Determine factors that would generalize the dyskinesia detection system for continuous

in-home use.

vi TABLE OF CONTENTS

ACKNOWLEDGEMENT ...... iii

ABSTRACT ...... v

LIST OF TABLES ...... xi

LIST OF FIGURES ...... xii

LIST OF ABBREVIATIONS ...... xiv

DEDICATION ...... xvi

1. INTRODUCTION

1.1 TREATMENT OF MOTOR SYMPTOMS OF PARKINSON’S DISEASE ...... 1

1.2 CLINICAL ASSESSMENT OF PARKINSON’S DISEASE ...... 2

1.3 STATE OF THE ART MONITORING SYSTEMS FOR THE ASSESSMENT OF

PARKINSON’S DISEASE ...... 4

1.3.1 “Mercury Live” system for replicating UPDRS ratings ...... 4

1.3.2 “Kinesia” home monitoring system ...... 5

1.3.3 Quantification of Levodopa-Induced Dyskinesia ...... 6

1.3.4 Identification of On/Off periods ...... 6

1.3.5 Identification of dyskinesia periods ...... 7

1.4 CLINICIAN AND PATIENT ACCEPTANCE CRITERIA FOR BODY-WORN

SENSOR SYSTEMS ...... 11

1.5 STUDY HYPOTHESIS ...... 12

1.6 SPECIFIC AIMS OF THE STUDY ...... 13

BIBLIOGRAPHY ...... 17

vii 2. APPLICATION OF MACHINE LEARNING AND NUMERICAL ANALYSIS TO

CLASSIFY TREMOR IN PATIENTS AFFECTED WITH ESSENTIAL TREMOR OR

PARKINSON’S DISEASE

2.1 INTRODUCTION ...... 22

2.2 METHODS ...... 25

2.2.1 Participant Description ...... 25

2.2.2 Clinicians’ tremor rating method ...... 25

2.2.3 Deep brain stimulation device...... 26

2.2.4 Hardware and software ...... 27

2.2.5 Digital pen ...... 27

2.2.6 Experimental procedure ...... 27

2.2.7 Machine learning...... 28

2.2.8 Signal analysis...... 30

2.3 RESULTS ...... 34

2.4 DISCUSSION ...... 45

2.5 CONCLUSIONS ...... 47

BIBLIOGRAPHY ...... 49

3. DETECTING DYSKINESIA IN PEOPLE WITH PARKINSON’S DISEASE USING BODY

WORN ACCELEROMETERS AND MACHINE LEARNING ALGORITHMS

3.1 INTRODUCTION ...... 54

3.2 METHODS ...... 57

3.2.1 Participant Description ...... 57

3.2.2 Observations...... 58

viii 3.2.3 Hardware ...... 59

3.2.4 Data Processing ...... 60

3.2.5 Feature Extraction ...... 61

3.2.6 Machine Learning Algorithms ...... 65

3.2.7 Optimization...... 67

3.3 RESULTS ...... 71

3.3.1 Classification accuracy of different machine learning algorithms ...... 71

3.3.2 Activity level effect ...... 73

3.3.3 Class imbalance ...... 74

3.3.4 Most effective sensor locations ...... 76

3.3.5 Feature with greatest effect on classification accuracy ...... 77

3.4 DISCUSSION ...... 79

BIBLIOGRAPHY ...... 85

4. GENERALIZATION OF DYSKINESIA SYSTEM

4.1 INTRODUCTION ...... 88

4.2 HYPOTHESES PROPOSED ...... 88

4.3 HYPOTHESIS TESTING ...... 90

4.3.1 Hypothesis 1: Dyskinesia feature variations between participants ...... 90

4.3.2 Hypothesis 2: Differences in severity ...... 97

4.3.3 Hypothesis 3: Fluctuations ...... 101

4.3.4 Hypothesis 4: Transitions into or out of dyskinesia ...... 111

4.4 CONCLUSIONS ...... 115

BIBLIOGRAPHY ...... 121

ix 5. CONCLUSIONS

5.1 STUDY LIMITATIONS ...... 123

5.2 FUTURE DIRECTIONS ...... 125

BIBLIOGRAPHY ...... 127

APPENDIX A ...... 128

APPENDIX B ...... 135

x LIST OF TABLES

Table 1.1. Parkinson’s Disease Monitoring Systems ...... 11

Table 2.1. Comparison of Tremor Rating Method Results ...... 35

Table 2.2. Tremor Severity for Digital Pen Spiral Trace in Patient 5...... 38

Table 2.3. Exact Match Accuracy...... 45

Table 4.1. Dyskinesia Signs...... 98

Table A.1. Features ...... 130

xi LIST OF FIGURES

2.1 Shimmer Wireless Sensor Unit Signal Processing ...... 30

2.2 Digital Pen Tremor Scaling ...... 34

2.3 Machine Learning Accuracy ...... 36

2.4 Spiral Trace ...... 37

2.5 Linearized Spiral Trace ...... 38

2.6 Raw Gyroscope Data at Rest of Patient 10 ...... 41

2.7 Power Spectral Density Roll Axis Spiral Trace of Patient 10 ...... 42

2.8 Raw Gyroscope Data Roll Axis at Rest of Patient 7 ...... 43

2.9 Power Spectral Density Roll Axis Spiral Trace of Patient 7 ...... 44

3.1 Right Wrist Triaxial Accelerometer Signals ...... 60

3.2 Classification Accuracy ...... 71

3.3 MLA Accuracy and F-measure ...... 72

3.4 Activity Level vs. AIMS ...... 73

3.5 Activity Level vs. MLP Classification Accuracy ...... 74

3.6 J48 Bias to Uniform Sampling ...... 75

3.7 Accuracy of Sensor Combinations ...... 76

3.8 Energy in Frequency Bands ...... 78

4.1 Principal Component Analysis ...... 92

4.2 First Two Principal Components ...... 93

4.3 First Two Dyskinesia Principal Components ...... 94

4.4 F-measure for Instances Added to Training Set ...... 95

4.5 F-measure vs. Symptom Count ...... 99

xii 4.6 F-measure vs. AIMS ...... 99

4.7 Constant Dyskinesia Feature Time-Series ...... 104

4.8 No Dyskinesia Feature Time-Series ...... 105

4.9 Fluctuating Dyskinesia Feature Time-Series ...... 106

4.10 Histograms, Constant Dyskinesia ...... 107

4.11 Histograms, Fluctuating Dyskinesia ...... 107

4.12 F-measure vs. Skew ...... 108

4.13 F-measure vs. Kurtosis ...... 108

4.14 Worst Case Misclassifications ...... 113

4.15 Dyskinetic MLP Sensitvity and Specficity, and Percent Dyskinesia ...... 114

APPENDIX A Feature Descriptions ...... 128

APPENDIX B Additional Graphs...... 135

B.1 F-measure vs. Skew for Energy_Low_High_Mean ...... 135

B.2 F-measure vs. Skew for Low_Frequency_Energy ...... 135

B.3 F-measure vs. Kurtosis for Energy_Low_High_Mean ...... 136

B.4 F-measure vs. Kurtosis for Low_Frequency_Energy ...... 136

xiii LIST OF ABBREVIATIONS

ADL ...... Activities of Daily Living

AIMS ...... Abnormal Involuntary Movement Scale

ANN ...... Artificial Neural Network

DBS ...... Deep Brain Stimulation

DNN ...... Dynamic Neural Network

DSVM ...... Dynamic Support Vector Machine

EMA ...... Ecological Momentary Assessment

EMG ...... Electromyographic

ET ...... Essential Tremor

FFT ...... Fast Fourier Transform

GP ...... Globus Pallidus

HY ...... Hoehn and Yahr

HMM ...... Hidden Markov Model

ICC ...... Interclass Correlation Coefficient

J48 ...... WEKA’s J48 Decision Tree

LID ...... Levodopa Induced Dyskinesia m-AIMS ...... Modified Abnormal Involuntary Movement Scale

MDS-UPDRS ...... Movement Disorder Society Parkinson’s Disease Rating Scale

MLA ...... Machine Learning Algorithm

MLP ...... Multilayer Perceptron

MRA ...... Multiple Regression Analysis

NN ...... Neural Network

xiv PD ...... Parkinson’s Disease

RBF ...... Radial Basis Function

RMS ...... Root Mean Squared

ROC ...... Receiver Operator Characteristic

SPRS ...... Short Parkinson’s Rating Scale

STN ...... Subthalamic Nucleus

SVM ...... Support Vector Machine

SWSU ...... Shimmer Wireless Sensor Unit

TRS ...... Tremor Rating Scale

TS ...... Tremor Severity

UPDRS ...... Unified Parkinson’s Disease Rating Scale

WEKA ...... Waikato Environment for Knowledge Analysis

Dedication

This work is dedicated to my grandfather, Marvin Ewing Darnall, for whom I cared as a youth

during his final battle with supranuclear palsy.

Soli Deo Gloria

xvi 1. INTRODUCTION

1.1 Treatment of motor symptoms of Parkinson’s disease

Motor symptoms of Parkinson’s disease (PD), such as rest tremor, bradykinesia, and rigidity [1], are treated with medication and/or surgical interventions [2]. Carbidopa/Levodopa medication is the gold standard treatment for PD [3]. It replenishes dopamine levels in the brain that are abnormally low in PD due to the death of dopamine-producing substantia nigra cells [4].

Carbidopa/levodopa increases voluntary movement, reduces tremor, and reduces rigidity [5].

Dopamine agonists improve mobility [6] by increasing the uptake sensitivity to dopamine [7].

Deep brain stimulation (DBS) devices are inserted into the Subthalamic nucleus (STN) or globus pallidus (GP) [8] of one or both sides of the brain [9]. Electrodes are connected to a controller which controls voltage, electrode configuration, and electric pulse generation, all of which are adjustable after implantation [10, 11]. DBS increases movement and reduces tremor and rigidity on the contra lateral side of the body [2, 12].

Tremor can be defined as a rhythmic shaking and involuntary rhythmic movements of body segments. It occurs in healthy individuals, as so-called physiological tremor. Tremor is composed of two oscillations, mechanical reflex and central neurogenic, which are superimposed on a background of irregular and involuntary fluctuations in muscle forces and displacements. In patients with neurological disorders, tremor is clinically described as rest, postural, and kinetic tremor. Rest tremor appears during resting while postural tremor is triggered by maintenance of a posture or a position against gravity. Kinetic tremor is evoked by a voluntary movement and is maximal while near the movement target.

Dyskinesias in PD are involuntary, rhythmic or choric exaggerated movements, and are a result of either too much levodopa [13] and/or over stimulation from a DBS device [14].

1 Dyskinesia in advanced PD may be either diphasic or occur as dyskinesia-improvement- dyskinesia syndrome during on/off transitions, but not at peak dose [3]. Carbidopa/levodopa loses its effectiveness over time, often requiring dosages to be increased, which contributes to side effects such as dyskinesia [7]. Patients who have DBS normally need less carbidopa/levodopa after the surgery, which helps to reduce dyskinesias [2, 15]. However, if the

DBS device is not configured properly, it can cause dyskinesia, a situation which is corrected by reducing the voltage, modifying the impulse duration, altering the frequency, changing the configuration of electrodes, or reducing the amount of carbidopa/levodopa prescribed concurrently with the DBS device [16].

1.2 Clinical Assessment of Parkinson’s Disease

Clinicians assess motor deficits and the occurrence of dyskinesia during clinical visits typically ranging in frequency from once every few months to over a year [17, 18]. Clinicians adjust medication dosage and/or DBS settings to reduce off and dyskinesia periods [2]. To optimize this process, a clinician must know the occurrence and duration of dyskinesias, the occurrence of off periods, as well as relationships between off and dyskinesia periods and medication dosage and DBS settings to tailor therapy individually [3].

Common clinical rating scales used to evaluate disease progression, including the occurrence of off and dyskinesia periods and symptom severity, are the Unified Parkinson’s

Disease Rating Scale (UPDRS) [19] the Movement Disorders Society - Unified Parkinson’s

Disease Rating Scale (MDS-UPDRS) [20], the Tremor Rating Scale (TRS) [21], Modified

Hoehn and Yahr Staging (HY) [22], and Schwab and England Activities of Daily Living Scale

[23]. These scales contain a combination of retrospective patient self-reporting and clinical

2 observations to assess motor abilities. Both patients and clinicians rate subjectively perceived disability severities on ordinal scales. This method is subject to recall bias, and has well documented intra-and inter-rater variability [24]. Recall for critical daily events over a 6 month retention period has been reported at approximately 80% accuracy [25]. The TRS has an inter- rater reliability kappa statistic of 0.53 for tremor and 0.41 for handwriting [26], the UPDRS an inter-rater reliability ranging from 0.76 to 0.95 for the four parkinsonian domain scores [27] and an intra-rater reliability of 0.85 interclass correlation coefficient (ICC) for activities of daily living (ADL) and 0.90 ICC for the motor rating section [24]. The MDS-UPDRS has an internal consistency calculated as Cronbach’s alpha of 0.79–0.93 across parts [20]. Furthermore, dyskinesia research in PD has been limited by the lack of an established, reliable, and valid clinical rating instrument that is known to be sensitive to changes in disease severity [28].

An alternative to retrospective self-reporting, ecological momentary assessment (EMA) encompasses a variety of diary approaches and technologies used to collect real-time data about a subject’s current state on a schedule or in response to an event. By contrast, autobiographical memory processes can introduce bias into retrospective self-reports, which form the bulk of clinical assessments [29, 30]. EMA aims to minimize recall bias by study participants completing multiple assessments over time, which provides a sample of how their experiences and behaviors fluctuate over time and across situations [30]. Disadvantages to EMA include compliance and missing critical events due to limited sampling. Such obstacles can be minimized by planning assessments at times when participants are likely to respond, not annoying participants by over-sampling, and targeting samples for times when critical events are likely to happen.

3 1.3 State of the Art Monitoring Systems for the Assessment of PD

In-home monitoring for the chronically ill has been identified as a cost-effective method to complement traditional human-provided care [31]. Current state of the art in-home monitoring systems for the assessment of PD have several commonalities: they correlate data from multiple sensors to clinical ratings, they incorporate bulky or inconvenient to wear sensors, and they involve the participant performing tests or scripted activities. Importantly, clinicians have to provide the clinical ratings which enable the basis for correlations between sensor data and clinical ratings.

1.3.1 “Mercury Live” system for replicating UPDRS ratings

The goals of the Mercury Live home monitoring system are to reproduce clinicians’

UPDRS ratings from a subset of UPDRS tasks performed in tandem with activities of daily living (ADL), and to classify sensor data onto tremor, bradykinesia, and dyskinesia severity using a support vector machine (SVM) [32-34]. Participants in this study performed motor evaluation tasks from the UPDRS in four 30-minute sessions per day, comprised of six tests lasting 30 seconds each separated by 20 minutes between tests. Clinicians rated UPDRS severity for the tasks from videos of the participant performing the tasks at home. Participants wore a total of 8 triaxial accelerometers on the upper and lower arms and upper and lower legs, from which the 6 features were derived for each accelerometer axis: the root mean square (RMS) value of linearly de-trended accelerometer signal, range of amplitude of each channel, dominant frequency in the 0.5-5 Hz range, signal modulation frequency, ratio of energy associated with dominant frequency component to the total energy in the 0.5-5 Hz range, and signal entropy.

Features relevant to the task being performed were compiled into a feature vector with clinician

4 ratings as the target class, and run through a random forest regression classifier to classify onto

UPDRS ratings. The RMS error of the algorithms was 0.4, when compared to clinician’s ratings

(scale 0-4). Their major contributions were that they successfully identified 6 features and a random forest classifier useful in classifying accelerometer features onto UPDRS ratings, and ranked features in order of importance using cluster analysis. For dyskinesia, they achieved a classification error of 3.7% using the signal entropy feature and of 1.9% using signal cross- correlation and signal entropy as features. The 5 subjects were constrained to UPDRS tasks instead of ADL, which is not applicable to a continuous monitoring system.

1.3.2 “Kinesia” home monitoring system

The goal of the Kinesia home-monitoring system is to replicate the average of 2 clinicians’ UPDRS ratings for periodic assessments [35-37]. Kinesia utilizes an in-home video interface that prompts patients to perform UPDRS motor task 20 to assess rest, postural, and kinematic tremor of the upper extremities over a 65 second testing period followed by a short questionnaire. Users wear a finger and hand-worn set of gyroscopic and accelerometer sensors.

Clinicians review the video to rate UPDRS severity. The system extracts the features peak power, frequency of peak power, RMS of angular velocity, and RMS of angle. These features are correlated to the clinician’s UPDRS score using multiple regression analysis (MRA). The MRA model is used to predict UPDRS score from new data that has not been rated. Kinesia demonstrated the best MRA fit of r2=0.90, using power features derived from all axes of both accelerometer and gyroscopic sensors while evaluating postural tremor. Rest tremor had a slightly lower fit value at r2=0.89, and kinetic tremor had a poor fit at r2=0.69. The major

5 contribution of this study was identifying peak power and frequency of peak power as useful features in classifying onto tremor.

1.3.3 Quantification of Levodopa-Induced Dyskinesia (LID)

The goal of the study by Mera et al. was to capture levodopa induces dyskinesia (LID) using 2 “Kinesia” hand-worn sensors [31, 37] while subjects sat with hand resting in lap, and hand extended, for 20 seconds each. Testing frequency was 3-6 times per day over 3-6 days for each of their 15 subjects. Features calculated were based upon frequency-domain analyses.

Feature thresholds were determined using receiver operator characteristic (ROC) analysis.

Features were classified onto clinician modified Abnormal Involuntary Rating Scale (AIMS) ratings of participant videos using a multilayer perceptron (MLP). Threshold feature values in distinguishing dyskinesia presence were determined using ROC analysis. The most important features in identifying LID severity were RMS ratio between frequency bands (r=0.64) and log of median power frequency (r=-0.70). Features were also ranked in order of dyskinesia detection performance, of which the most important feature was RMS ratio between frequency bands above and below 3 Hz (sensitivity 0.73, specificity 1.00). They did not report overall accuracy.

1.3.4 Identification of On/Off periods

Keijsers et al. (2006) conducted a study in which their goal was to classify sensor data onto on/off states observed in 23 PD participants over a 3 hour period while they performed scripted daily life activities [38]. Six triaxial accelerometers located on the arms, legs, torso, and most affected wrist recorded data. Features were calculated from both time-domain and frequency-domain analyses. Features were classified with a MLP neural network with one experimenter and one trained physician’s ratings as target values. Results showed that the

6 feature, percentage of peak frequencies above 4 Hz of the trunk, was the most distinguishing feature in characterizing off periods in the group of 23 participants. This feature alone gave a sensitivity of 96% and specificity of 95%. The entire group of sensors and features gave a sensitivity and specificity of 97%, indicating the algorithms have good generalization over the sample population, even when only data from the torso accelerometer were considered.

Drawbacks to this system included a simulated ADL scenario in which participants were given continuous instructions to keep them active, neurologists were required to evaluate 3 hours of video per subject, and some periods were excluded from the dataset including walking periods greater than 3 minutes and transition periods between on and off states.

1.3.5 Identification of dyskinesia periods

One research group developed a system with the goal of classifying sensor data onto dyskinesia occurrences within a simulated an in-home environment (Keijsers et al, 2003a;

Keijsers et al, 2003b). The system used a multi-layer perceptron (MLP) neural networks for 15- minute intervals over a 2.5 hour period with data from 6 body-worn triaxial accelerometers located at the upper arms, upper legs, most dyskinetic wrist, and top of sternum [39]. Features were calculated from time-domain and frequency-domain analyses. Features were evaluated with and MLP with one input layer, one hidden layer, and one output layer. Training and validation was performed with an 80/20% data split. The most valuable variables were the ratio between frequencies above and below 3 Hz of the most affected leg, percentage of time truck was moving, and standard deviation of the leg, which gave Spearman rank correlations of 0.38, 0.44, and 0.37 for the arm, trunk, and leg respectively. Cross correlations between various limb segments’ accelerations contributed the most to distinguishing LID from voluntary movements,

7 except for the walking condition. For walking conditions, power in the frequency range below and above 3 Hz was the most useful feature in detecting dyskinesia. Different neural networks were trained for each sensor location on the body. The neural network correctly classified features onto dyskinesia or the absence of dyskinesia in 15-minute intervals in 93.7%, 99.7% and

97.0% of the time for the arm, trunk and leg, respectively [40]. Contributions of this study included distinguishing dyskinesia from voluntary movements using signal frequency of 3Hz as a cutoff and identifying cross-correlation of values across sensors as useful in distinguishing dyskinesia.

Tsipouras et al. hypothesized they could generate an automated system for assessing levodopa-induced dyskinesia incorporating MLAs with signals from 2 body-worn triaxial gyroscopic and 6 body-worn triaxial accelerometer sensors [41]. Subjects wore 4 accelerometers on the wrists and ankles and the combined accelerometers and gyroscopes on the chest and waist while performing a prescribed set of ADLs over a 15 minute session. Video was recorded and annotated by 2 neurologists for the occurrence of PD symptoms, on or off state, and dyskinesia presence and severity. Features were calculated by time-domain and frequency-domain analyses over 1 second time windows with 0.5 second overlap between windows. MLP (92.99%), random forest (92.55%), and C4.5 decision tree (92.51%) were found to be the most accurate classifiers of feature data onto dyskinesia using 10-fold cross validation. Any combination of any 2 or more sensors with any of these 3 classifiers produced accuracies between 89% and 93% in detecting the presence of dyskinesia. Including more than 2 sensors had negligible effect on accuracy, with the exception of including all sensors, which only improved accuracy by 1 percentage point over the 2 sensor method. Sensitivity and positive predictive values for dyskinesia were 80.5% and

76.84%, respectively, using 10-fold cross validation across subjects. Notably, the frequency band

8 2-5 Hz was related to dyskinesia, while the frequency band 5-10 Hz was related to tremor.

Dyskinesia was highly associated with entropy in the frequency-domain.

Cole et al. identify the presence and severity of PD tremor and dyskinesia in 23 participants [42]. They used dynamic neural networks (DNN), dynamic support vector machines

(DSVM), and hidden Markov models (HMM) to classify 1 second instances onto dyskinesia presence. HMMs classify sequences of features probabilistically, DNNs divide the feature space using a series of linear hyperplanes, and DSVMs divide the feature space using a series of non- linear hyperplanes. They devised two HMMs; one described instances containing dyskinesia, while the other described instances containing non-dyskinesia. They reported global error rates of 8.8% for DNN, 9.1% for DSVM, and 12.3% for detection of dyskinesia presence, for all activity states of their participants. When they classified dyskinesia severity with the HMM, they found a best specificity of 98.6% for severe dyskinesia and a worst specificity of 91.9% for moderate dyskinesia. They did not describe if their MLA validation method included a leave one participant out validation. While they reported that their participants were monitored while they conducted unconstrained ADL in the home environment, they did not specify what the home environment was, how the participants chose to perform ADL, or what activities participants performed. They also discarded instances in which dyskinesia severity was not clear. They classified 1 second instances, which may have been too fine of a resolution to adequately capture complete dyskinesia symptom cycles, which occur at a frequency of 1-3.5 Hz, or allow clinicians who rated individual instances for dyskinesia presence and severity, to accurately rate dyskinesia.

The various studies chose different analyses to report the classification ability of their algorithms. We provide a brief review of systems representing the state of the art (Table 1.1).

9 Study Patel et al., 2011, "Mercury Live" System Objective Reproduce UPDRS ratings for tremor, bradykinesia, and dyskinesia. Methods Data from accelerometers correlate to UPDRS clinician ratings using decision tree regression analysis. Pros Identified signal entropy, SVM, and a random forest classifier as useful in classifying onto UPDRS ratings and dyskinesia severity. Cons Subjects were required to perform UPDRS tasks while wearing 9 accelerometers. Algorithm is subject-specific. RMS error is 0.4 on UPDRS tasks (out of 0-4 scale). Study Giuffrida et al., 2009 and Mera et al., 2012, "Kinesia" System Objective Correlate body-worn sensor data to UPDRS tremor severity and detect dyskinesia. Methods Correlate features and feature thresholds derived from finger-worn accelerometers and gyroscopes to clinicians' ratings of recorded video using a multiple regression algorithm (MRA). Pros Established an r2 fit of 0.90 for resting tremor using MRA. Ranked features in order of dyskinesia detection performance. Identified most important feature, RMS ratio between high and low frequency bands (sensitivity 0.73, specificity 1.00). Cons Finger-worn hardware cannot be worn continuously, video rating system interrupts patient’s day and requires clinician’s time to review. Subject-specific; did not generalize for participants unseen in the training data set. Assessment not during ADL. Study Keijsers et al., 2003 and 2006 Objective Distinguish between on/off states and identify dyskinesia periods during scripted ADL. Methods Using the data from 6 body-worn accelerometers, extract features and classify with neural networks trained from clinician ratings of video. Pros Determined features identifying bradykinesia, tremor, and off periods. Generalization across all PD participants. Likely to be successful in unsupervised ambulatory condition. Classified features onto dyskinesia with 93.7%, 99.7%, and 97.0% accuracy for the arm, trunk, and leg, respectively using the leave-one-out technique. Cons Participants were given instructions about ADL to perform and may not reflect realistic ADLs. On/off states were induced by withholding or administering L- dopa medication. May not replicate normal on/off state fluctuations. Study required 3-hour continuous observation by clinicians to obtain training data. Study Tsipouras et al., 2012 Objective Identify the presence of levodopa-induced dyskinesia. Methods Derive signal features from 6 body-worn accelerometers and 2 gyroscopes. Classify onto dyskinesia with MLAs with neurologist-rated dyskinesia severity training data. Divide data into 1 second time windows, extract features, classify with neural networks Pros Distinguished dyskinesia from tremor, freezing of gait, and voluntary movements in a simulated real-life environment that included random events. Identified best classifier as random MLP with 92.99% accuracy. Established a minimum of 2

10 sensors needed for classification accuracy within 2% of that of all sensors. High generalization across study participants. Cons Did not identify off periods. Subjects in a controlled environment were prompted to perform tasks. Clinicians reviewed video of participants. Requires clinician review of 2.5 hour video per participant, requires 6 body-worn accelerometers, environment simulated ADL but with scripted activities, dyskinesia was purposefully induced. Study Cole et al., 2014 Objective Identify the presence and severity of PD tremor and dyskinesia. Methods Classify 1 second instances of raw accelerometer data and 5 features calculated from accelerometers and electromyographic (EMG) sensors onto dyskinesia presence, tremor presence, and dyskinesia severity using dynamic neural networks and hidden Markov models. Pros Monitored participants during unscripted ADL in the home environment. Used only 5 features that calculated energies and autocorrelation from accelerometer signals, to classify dyskinesia instances. Had low global error rate (8.8% minimum) for detection of dyskinesia presence with dynamic neural networks, and high sensitivity and specificity (92% or greater) for detection of dyskinesia severity with hidden Markov models. Cons Did not specify what they meant by, “unconstrained ADL,” or describe the “home environment” they used for participant monitoring. Discarded instances in which dyskinesia severity was not clear. Used 1 second instances, which may have been too fine of a resolution to adequately capture complete dyskinesia symptom cycles, which occur at a frequency of 1-3.5 Hz, or allow clinicians who rated individual instances for dyskinesia presence and severity, to accurately rate dyskinesia. Table 1.1: PD Monitoring Systems. This table presents a summary of state of the art for systems assessing PD.

1.4 Clinician and Patient Acceptance Criteria for Body-worn Sensor Systems

Critical to the clinical utility and commercial viability of any body-worn sensor system is the question, “What do patients and clinicians want?” In a study addressing this issue [43], patients disclosed that they wanted a system that has a high acceptance rate, will not affect bodily behavior, will not replace the clinician, is easy to use, is small, and is unobtrusive.

Clinicians placed importance on a system that requires no technology training, has a simple interface, maintains a low cost, requires little upkeep, and has a low time demand.

11 Body-worn sensor systems in the previously mentioned studies do not meet all of these requirements. All systems require some form of time-consuming input from clinician, the patient, or both in order to calibrate the system. This includes clinician review of between 65 seconds to 3 hours of video per participant. Some systems have a complicated interface requiring training and time from the clinician to obtain and interpret the results, such as the “Mercury

Live” system. Additionally, current body-worn sensor systems tend to be obtrusive, requiring multiple sensors to varying degrees of bulky dimensions. The Tsipouras study participants in particular indicated they did not like wearing the chest sensors.

1.5 Study hypothesis:

We hypothesize that we can classify body-worn accelerometer data into tremor, dyskinesia, and non-dyskinesia periods using signal analysis, feature extraction, and machine learning algorithms (MLAs). Unlike other sensor systems available for research or commercially, the proposed system determines tremor, dyskinesia, and non-dyskinesia states related to PD from continuous body worn sensor data without placing a time-consuming severity rating requirement on the clinician. State of the art systems relate sensor data to clinician-rated clinical scales, but no system currently exists that will relate sensor data to tremor, dyskinesia, and non-dyskinesia states purely from continuous monitoring of activities of daily living, without either a clinician’s rating or a patient’s performing a set of clinical tests. One of our objectives is to determine which sensor location(s) provide(s) the most accurate classification of tremor, dyskinesia, and non-dyskinesia states, with the purpose of eliminating extraneous sensors.

12 1.6 SPECIFIC AIMS OF THE STUDY

The specific aims of this study are to:

1. Classify kinematic data collected during clinical assessment tasks onto tremor severity

ratings using machine learning algorithms. Complete details are found in Chapter 2, and are

presented in summary below:

The overall goal of this study was to compare the accuracy of various data

analysis techniques to quantify tremor severity (TS) in a clinical context, with the aim of

improving the reliability (context consistency and inter-rater agreement) of tremor

evaluation in patients with Parkinson’s disease (PD) or essential tremor (ET). Ten

patients with either PD or ET were asked to perform several tasks used in the clinical

practice for the characterization of tremor. Three-axis gyroscopes in a Shimmer device

measured angular velocities of the wrist of each subject for postural, kinetic, spiral

tracing, and resting scenarios, and a digital pen recorded subjects’ tracings of an

Archimedes’ spiral printed on paper. Gyroscope data were used for training and testing a

supervised machine learning algorithm to classify TS and for root mean squared (RMS)

numerical rating of TS, while digital pen data were analyzed numerically to quantify

tracing deviations from the spiral and obtain a tremor rating. We evaluated the

performance of our proposed methods compared to clinicians’ diagnostic rating. The

machine learning method matched the clinical rating with 82% accuracy, the digital pen

with 78% accuracy, and RMS with 42% accuracy. We obtained the best accuracy of 82%

using the decision tree machine learning approach with gyroscope data measured with the

Shimmer.

13 This part of the study was published in the journal, Gerontechnology, under the title “Application of machine learning and numerical analysis to classify tremor in patients affected with essential tremor or Parkinson’s disease.” [44]. Chapter 2 follows the format requirements of the journal and varies slightly from the format of the other chapters.

2. Develop a system that classifies features derived from body-worn accelerometer data as

dyskinesia presence that was determined from visual observation of participants performing

unconstrained activities of daily living. Complete details are found in Chapter 3, and are

presented in summary below:

The motivation of our study is to develop and optimize a dyskinesia monitoring

and classification system that has clinical relevance. Specific aims were to: extract

features from accelerometer data recorded under ADL conditions, classify feature

instances onto dyskinesia or non-dyskinesia, and optimize the system by examining

practical considerations of implementing dyskinesia monitoring system. We considered

the following practical considerations: Address how many sensors are needed for

dyskinesia classification, examine the most effective sensor locations for dyskinesia

classification, determine which learning algorithms have the greatest classification

accuracy, and quantify the generalization effect to determine if algorithms can be

developed for a population without requiring clinician-observed training data for new

subjects.

While we found high classification accuracy (96%) for all participants with 10-

fold cross validation, we found lower classification accuracy (86%) for all participants

with leave one participant out validation. Our system did not generalize well for

14 participants who were unseen in the training set, as show by low classification accuracies

between 22 and 80% for the dyskinetic participant left out. We had lower classification

accuracies than other studies; however, our participants were less constrained in their

environment and were allowed to go about their ADL in their daily environment without

scripted activities. Because our study was less constrained than previous studies, it has

implications for in-home use which could be realized if we could improve the ability of

our system to generalize for new participants.

3. Determine which factors would generalize the dyskinesia detection system for continuous in-

home use. Complete details are found in Chapter 4, and are presented in summary below:

We investigated several hypotheses why our classification models did not generalize with

high accuracy on participants whose instances were not included in the training set.

During data collection with participants, we observed variations in the way each

participant presented dyskinesia, including: severity of dyskinesia, locations on the body

which were affected by dyskinesia, fluctuation in dyskinesia severity throughout the

session, and transitions into or out of a dyskinesia period. We proposed four hypotheses:

Hypothesis 1: Dyskinesia feature variations between participants

If individual participants had feature sets that were unique, multiple principal

components will be required to account for feature vector variations, and adding a few

dyskinesia instances to the training set from the participant left out of the training set will

increase the classification accuracy of that participant.

Hypothesis 2: Differences in body location and severity of dyskinesia

Participants affected by dyskinesia on different locations of their body will have

different classifications accuracies, while participants affected by dyskinesia on the same

15 body location will have similar classification accuracies. If differences in overall dyskinesia severity affected classification accuracy, then a relationship exists between severity and accuracy.

Hypothesis 3: Fluctuations

If dyskinesia fluctuations in severity throughout a dyskinesia period affected classification accuracy, such fluctuations can be quantified from the feature set and correlated to the classification accuracy.

Hypothesis 4: Transitions into or out of dyskinesia

If transitions into or out of dyskinesia affected classification accuracy, instances surrounding a transition will contain the majority of misclassifications for transitioning participants, and those participants will have a lower classification sensitivity than non- transitioning dyskinetic participants.

Dyskinesia fluctuations quantified by distribution of acceleration energy levels seemed to cause problems with classification accuracy. From our hypotheses investigations, we saw that low MLP classification ability seemed to be affected by fluctuations in dyskinesia severity within a dyskinesia period and by transitions into or out of a dyskinesia period, but not by dyskinesia location on the body or by maximum dyskinesia severity within the dyskinesia period.

16 BIBLIOGRAPHY

1. Cunningham, L., et al., Computer-Based Assessment of Bradykinesia, Akinesia and Rigidity in Parkinson’s Disease Ambient Assistive Health and Wellness Management in the Heart of the City, M. Mokhtari, et al., Editors. 2009, Springer Berlin / Heidelberg. p. 1-8.

2. Fahn, S., How do you treat motor complications in Parkinson's disease: Medicine, surgery, or both? Annals of Neurology, 2008. 64(S2): p. S56-S64.

3. Olanow, C.W.M.D.F., R.L.M.D. Watts, and W.C.M.D.P. Koller, An algorithm (decision tree) for the management of Parkinson's disease (2001):: Treatment Guidelines. Neurology, 2001. 56(11) Supplement(5): p. S1-S88.

4. Hughes, A.J., et al., What features improve the accuracy of clinical diagnosis in Parkinson's disease: A clinicopathologic study. Neurology., 1992. 42(7): p. 1436-1436.

5. Obeso, J.A., et al., Pathophysiology of the basal ganglia in Parkinson's disease. Trends in neurosciences, 2000. 23, Supplement 1(0): p. S8-S19.

6. Deuschl, G., et al., A randomized trial of deep-brain stimulation for Parkinson's disease. The New England journal of medicine, 2006. 355(9): p. 896-908.

7. Goetz, C.G., et al., Evidence-based medical review update: pharmacological and surgical treatments of Parkinson's disease: 2001 to 2004. Movement disorders : official journal of the Movement Disorder Society, 2005. 20(5): p. 523-39.

8. Weaver Fm, F.K.S.M. and et al., Bilateral deep brain stimulation vs best medical therapy for patients with advanced parkinson disease: A randomized controlled trial. JAMA, 2009. 301(1): p. 63-73.

9. Sturman, M.M., et al., Effects of subthalamic nucleus stimulation and medication on resting and postural tremor in Parkinson's disease. Brain, 2004. 127(Pt 9): p. 2131-43.

10. Carlson, J.D., et al., Deep Brain Stimulation Does Not Silence Neurons in Subthalamic Nucleus in Parkinson's Patients. Journal of Neurophysiology, 2010. 103(2): p. 962-967.

11. Deep-brain stimulation of the subthalamic nucleus or the pars interna of the globus pallidus in Parkinson's disease. N Engl J Med, 2001. 345(13): p. 956-63.

12. Brown, R.G., et al., Impact of deep brain stimulation on upper limb akinesia in Parkinson's disease. Annals of Neurology, 1999. 45(4): p. 473-488.

13. Obeso, J.A., et al., Pathophysiology of levodopa-induced dyskinesias in Parkinson's disease: problems with the current model. Annals of Neurology, 2000. 47(4): p. 22-32.

17 14. Pollak, P., Krak, P., Deep-Brain Stimulation for Movement Disorders, in Parkinson's Disease and Movement Disorders, T.E. Jankovic J, Editor. 2007, Lippincott Williams and Wilkins: Philadelphia. p. 653-691.

15. Follett, K.A., The Surgical Treatment of Parkinsons Disease. Annu. Rev. Med. Annual Review of Medicine, 2000. 51(1): p. 135-147.

16. Groiss, S.J., et al., Review: Deep brain stimulation in Parkinson's disease. Therapeutic Advances in Neurological Disorders, 2009. 2(6): p. 379-391.

17. Chung, K.A., et al., Objective measurement of dyskinesia in Parkinson's disease using a force plate. Mov Disord, 2010. 25(5): p. 602-8.

18. Ahlskog, J.E., Parkinson's disease treatment guide for physicians. 2009, Oxford; New York: Oxford University Press.

19. Fahn S, E.R., UPDRS program members, Unified Parkinsons Disease Rating Scale, in Recent developments in Parkinsons disease, M. FahnS, GoldsteinM, CalneDB, Editor. 1987, Macmillan Healthcare Information: Florham Park, NJ. p. 153–163.

20. Goetz, C.G., et al., Movement Disorder Society-sponsored revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS): Scale presentation and clinimetric testing results. Movement Disorders, 2008. 23(15): p. 2129-2170.

21. Kompoliti K, C.C., Goetz C, Clinical rating scales in movement disorders, in Parkinson’s disease and movement disorders, T.E. Jankovic J, Editor. 2007, Williams and Wilkins; : Philadelphia. p. 692-701.

22. Hoehn, M.M. and M.D. Yahr, Parkinsonism: onset, progression and mortality. Neurology, 1967. 17(5): p. 427-42.

23. Waters, C., Diagnosis and Management of Parkinson’s Disease. 5 ed. 2006, Caddo: Professional Communications Inc.

24. Siderowf, A., et al., Test–Retest reliability of the Unified Parkinson's Disease Rating Scale in patients with early Parkinson's disease: Results from a multicenter clinical trial. Movement Disorders, 2002. 17(4): p. 758-763.

25. Bradburn, N.M., L.J. Rips, and S.K. Shevell, Answering Autobiographical Questions: The Impact of Memory and Inference on Surveys. Science, 1987. 236(4798): p. 157-161.

26. Stacy, M.A., et al., Assessment of interrater and intrarater reliability of the Fahn–Tolosa– Marin Tremor Rating Scale in essential tremor. Movement Disorders, 2007. 22(6): p. 833-838.

27. Bennett, D.A., et al., Metric properties of nurses' ratings of parkinsonian signs with a modified Unified Parkinson's Disease Rating Scale. Neurology., 1997. 49(6): p. 1580.

18 28. Goetz, C.G., et al., Which Dyskinesia Scale Best Detects Treatment Response? Mov Disord, 2013. 6(10): p. 25321.

29. Hoff, J.I., V. van der Meer, and J.J. van Hilten, Accuracy of objective ambulatory accelerometry in detecting motor complications in patients with Parkinson disease. Clin Neuropharmacol, 2004. 27(2): p. 53-7.

30. Shiffman, S., A.A. Stone, and M.R. Hufford, Ecological Momentary Assessment. Annual Review of Clinical Psychology, 2008. 4(1): p. 1-32.

31. Mera, T.O., M.A. Burack, and J.P. Giuffrida. Quantitative assessment of levodopa- induced dyskinesia using automated motion sensing technology. in Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE. 2012.

32. Patel, S., et al. Home monitoring of patients with Parkinson's disease via wearable technology and a web-based application. in Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE. 2010.

33. Patel, S., et al., Longitudinal monitoring of patients with Parkinson's disease via wearable sensor technology in the home setting. Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. EMBS Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, 2011: p. 1552-1555.

34. Bor-Rong, C., et al., A Web-Based System for Home Monitoring of Patients With Parkinson's Disease Using Wearable Sensors. Biomedical Engineering, IEEE Transactions on, 2011. 58(3): p. 831-836.

35. Giuffrida, J.P., et al., Clinically deployable Kinesia™ technology for automated tremor assessment. Movement Disorders, 2009. 24(5): p. 723-730.

36. Mera, T.O., et al., Feasibility of home-based automated Parkinson's disease motor assessment. J Neurosci Methods, 2012. 203(1): p. 152-6.

37. Mera, T., et al., Kinematic optimization of deep brain stimulation across multiple motor symptoms in Parkinson's disease. J Neurosci Methods, 2011. 198(2): p. 280-6.

38. Keijsers, N.L., M.W. Horstink, and S.C. Gielen, Ambulatory motor assessment in Parkinson's disease. Mov Disord, 2006. 21(1): p. 34-44.

39. Keijsers, N.L.W., M.W.I.M. Horstink, and S.C.A.M. Gielen, Movement parameters that distinguish between voluntary movements and levodopa-induced dyskinesia in Parkinson’s disease. Human Movement Science, 2003. 22(1): p. 67-89.

40. Keijsers, N.L.W., M.W.I.M. Horstink, and S.C.A.M. Gielen, Automatic assessment of levodopa-induced dyskinesias in daily life by neural networks. Movement Disorders, 2003. 18(1): p. 70-80.

19 41. Tsipouras, M.G., et al., An automated methodology for levodopa-induced dyskinesia: Assessment based on gyroscope and accelerometer signals. Artificial Intelligence in Medicine, 2012. 55(2): p. 127-135.

42. Cole, B.T. et al., Dynamical Learning and Tracking of Tremor and Dyskinesia from Wearable Sensors. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2014. 22(5): p.982-91.

43. Bergmann, J.H.M. and A.H. McGregor, Body-Worn Sensor Design: What Do Patients and Clinicians Want? Annals of Biomedical Engineering, 2011. 39(9): p. 2299-2312.

44. Darnall, N.D., et al., Application of machine learning and numerical analysis to classify tremor in patients affected with essential tremor or Parkinson’s disease. Gerontechnology, 2012. 10(4).

20 Published in Gerontechnology Vol. 10 No 4 (2012)

2. APPLICATION OF MACHINE LEARNING AND NUMERICAL ANALYSIS TO

CLASSIFY TREMOR IN PATIENTS AFFECTED WITH ESSENTIAL TREMOR OR

PARKINSON’S DISEASE

Nathan D. Darnall BSME School of Mechanical and Materials Engineering, College of Engineering and Architecture, Washington State University, Pullman, Washington 99164, USA E: [email protected]

Conrad K. Donovan MSEE Syeda Aktar BSc School of Electrical Engineering and Computer Science, College of Engineering and Architecture, Washington State University, Pullman, Washington 99164, USA

Han-yun Tseng MA Department of Human Development, College of Agricultural, Human, & Natural Resource Sciences, Washington State University, Pullman, Washington 99164, USA

Paulo Barthelmess PhD Philip R. Cohen PhD Adapx Inc, Seattle, WA, USA

David C. Lin PhD Voiland School of Chemical Engineering and Bioengineering and Department of Veterinary and Comparative Anatomy, Pharmacology, and Physiology, Washington State University, Pullman, Washington 99164, USA

21 2.1 Introduction The overall goal of this study was to compare the accuracy of various data analysis techniques to quantify tremor severity (TS) in a clinical context, with the aim of improving the reliability (context consistency and inter-rater agreement) of tremor evaluation in patients with

Parkinson’s disease (PD) or essential tremor (ET). Ten patients with either PD or ET were asked to perform several tasks used in the clinical practice for the characterization of tremor. Three- axis gyroscopes in a Shimmer device measured angular velocities of the wrist of each subject for postural, kinetic, spiral tracing, and resting scenarios, and a digital pen recorded subjects’ tracings of an Archimedes’ spiral printed on paper. Gyroscope data were used for training and testing a supervised machine learning algorithm to classify TS and for root mean squared (RMS) numerical rating of TS, while digital pen data were analyzed numerically to quantify tracing deviations from the spiral and obtain a tremor rating. We evaluated the performance of our proposed methods compared to clinicians’ diagnostic rating. The machine learning method matched the clinical rating with 82% accuracy, the digital pen with 78% accuracy, and RMS with 42% accuracy. We obtained the best accuracy of 82% using the decision tree machine learning approach with gyroscope data measured with the Shimmer.

Tremor can be defined as a rhythmic shaking and involuntary rhythmic movements of body segments. It occurs in healthy individuals, as so-called physiological tremor1. Tremor is composed of two oscillations, mechanical reflex and central neurogenic, which are superimposed on a background of irregular and involuntary fluctuations in muscle forces and displacements2.

In patients with neurological disorders, tremor is clinically described as rest, postural, and kinetic tremor. Rest tremor appears during resting while postural tremor is triggered by maintenance of a posture or a position against gravity. Kinetic tremor is evoked by a voluntary movement and is maximal while near the movement target1.

22 Parkinson’s disease (PD) is a progressive neurodegenerative disorder. The motor symptoms of PD include rest tremor, bradykinesia, and rigidity, and these develop gradually during the progression of the disorder3. Motor (such as tremor) and non-motor (such as memory loss) PD impairments can be rated from a combination of self-reporting and subjective clinical assessments within different scales4, including Unified Parkinson’s Disease Rating Scale

(UPDRS), Hoehn and Yahr Scale (HY), and Short Parkinson’s Rating Scale (SPRS). The severity ratings of these scales range from 0-4 (UPDRS), 1-5 (HY), and 0-3 (SPRS)5.

Essential tremor (ET) is a neurological disorder with no known cause and is characterized by postural and kinetic tremor6. The tremor can affect almost any part of the body, but it occurs most often in the hands, especially when the patients are maintaining a given posture or executing tasks, such as drinking from a glass, tying shoelaces, writing or shaving.

Tremor severity (TS) in ET or PD is commonly rated clinically using the Fahn-Tolosa-

Marin Tremor Rating Scale (TRS), which is on a scale of 0 to 47. The rating is based upon the clinician’s observation of tremor location and amplitude, the patient’s ability to perform motor functions (such as writing and drawing), and the patient’s self-report of their functional disability resulting from tremor8. One of the most widely used clinical procedures for measuring the severity of arm tremor is tracings of Archimedes spirals9. Patients with tremor show irregularity with swerves compared to individuals without tremor10.

More objective assessments of tremor have been made by quantitative measurement of tremor characteristics11-13. Namely, tremor frequency varies by both tremor type and tremor location. Rest tremor frequency is typically in the 3–6 Hz frequency range3. The frequency of postural tremor is between 4 and 12 Hz. Kinetic tremor has a frequency between 2 and 7 Hz.

Mechanical-reflex tremor depends on limb inertia and joint stiffness. For example, normal elbow

23 tremor occurs at 3-5 Hz while wrist tremor has a natural frequency of 8-12 Hz due to the lower inertia of the hand14. Arm and leg tremor frequency are 5.2 Hz and 3.8 Hz respectively15.

Postural and kinetic wrist tremor in PD patients has a prominent coherence peak at 5-8 Hz, which is distinguishable from the 8-12 Hz peak of healthy controls16. These findings suggest a need to identify specific parameters through which assessment of tremor is independent of different types of tremor and of different body segments.

A general problem with the clinical rating of TS is the subjectivity due to inter-rater and patient self-reporting variability17,18. Inter-rater reliability for the TRS in ET patients has a Kappa statistic of k=0.53 (k=1 means complete agreement and k=0 means no agreement) for postural and action tremor and k=0.41 for handwriting8. Quantitative measurement of tremor is objective and context independent and could improve this low reliability. However, the bridge between these measurements and the commonly-used TRS clinical scale has not been well established.

Quantitative data collected during standard clinical tests has been used to classify on, off, and dyskinetic states in PD patients19 or to classify the UPDRS score20. However, these methods require lengthy procedures for the patient and measurements from many different sensors and body segments. A link between quantitative data and clinical ratings specifically for tremor assessment would not have these drawbacks and still provide clinicians valuable information.

Therefore, the study aim was to capture objective tremor data, quantify TS from the data, and assess the ability of different computational methods to classify TS by matching them to clinical diagnostic assessments.

24 2.2 Methods

2.2.1 Participant Description

This study was approved by Washington State University Institutional Review Board

(IRB). Patients signed both informed consent (IC) and The Health Insurance Portability and

Accountability Act (HIPAA) authorization forms. Ten participants were selected at the clinics,

Northwest Neurology and Inland Neurosurgery and Spine, three of which exhibited a history of

ET, six of which exhibited history of PD, and one of which exhibited a history of both PD and

ET. Presence of predominant postural tremor in addition to a resting tremor, as observed in this last patient, has been described as a co-occurrence of PD and ET21. Two thirds of the patients were female. All patients had Deep Brain Stimulation (DBS) implants. Twenty data sets were gathered; ten in which the participants performed tasks with DBS on, and ten in which they performed the same tasks with DBS off. Although some patients used various medications to treat their PD motor symptoms, information regarding their medication type, dosage, and medication schedule was not considered relevant to the purpose of this study. Clinicians rated TS for each patient with DBS on and again for the same patient with DBS off, prior to each set of tasks each patient performed. Clinically rated TS ranged from 0-4 with mean TS of 1.40 for all patients under all conditions; 0-4 with mean TS 1.20 for DBS on, and 0-4 with mean 1.45 for

DBS off (note TS is only rated on an integer scale in these methods). Participant description is listed in parallel to study results (Table 1).

2.2.2 Clinicians’ tremor rating method

Immediately prior to performing each set of tasks for this study, each patient was evaluated by a clinician who rated their current clinical TS based on the TRS. A total of three

25 clinicians participated in rating individual patients; one neurosurgeon and two nurse practitioners. Participants were asked to perform four tasks using their dominant hand while seated. Tasks included repeatedly extending the arm full length then touching nose five times, holding hand at full horizontal extension for five seconds, resting hand in lap for five seconds, and tracing a spiral print. During the spiral trace task, the patient traced a clinician-supplied spiral with a standard pen using the dominant hand. Clinicians established the clinical TS score on a scale from 0-4 for each patient at each condition (DBS on and DBS off) based on the patient’s performance on the tasks as described in the TRS5.

2.2.3 Deep brain stimulation device

A DBS device is a surgically implanted electronic device that emits low voltage pulses into the brain21,22 and has been shown to reduce tremor23,24. For each patient, a clinician customizes the voltage, pulse width, electrode configuration, and frequency of the DBS signal to minimize tremor25,26.

A clinician adjusted the settings of the each participant’s DBS device to optimally reduce tremor prior to tasks participants performed with their DBS device turned on. Clinicians determined the optimum DBS setting by changing the setting, then observing the change in the clinical tasks used to classify TS. A lower TS score was interpreted as resulting from a more optimal DBS setting. DBS settings were not recorded because the effect of DBS on tremor is not a primary topic of interest in this study.

26 2.2.4 Hardware and software

Four hardware devices were incorporated in this study: a gyroscope (an angular velocity sensor) contained in a Bluetooth wireless Shimmer unit, an Anoto digital pen, a laptop computer equipped with a Bluetooth receiver, and DBS devices specific to each patient. Data were collected from the Shimmer device via a Bluetooth link to the laptop, while the digital pen transmitted data to the laptop via a USB interface. Software used in the study included National

Instruments Labview software, Weka machine learning tool, Adapx Capturx digital pen software, and Microsoft Excel.

The selection criteria for the Shimmer device were: low cost, wireless, containing a 3 axis gyroscope, lightweight, and small size. The Shimmer Wireless Sensor Unit (SWSU)27 is lightweight (15g) with a small form factor (50x2x12.5mm) suitable for PD and ET patients. The

SWSU also supports wireless communication through Bluetooth and 802.15.4 radio.

2.2.5 Digital pen

Adapx’s CapturxTM system for digital paper and pen integrates standard digital pen technology with standard office software applications. Similar technology has been used in the medical field to quantify physical impairment of drivers under the influence of alcohol28. The pen records all the X-Y coordinates that it traverses and uploads the data to a computer via a

USB interface.

2.2.6 Experimental procedure

Participants with the SWSU was strapped to their wrist were asked to perform the same set of four tasks used in the clinical evaluation. Each set was performed twice by each

27 participant: once with their DBS device on, once with it off. A period of 5 to 10 minutes between the first task (DBS on) and the second task (DBS off) was allowed for any residual effects of the

DBS to wear off. Turning the DBS device off resulted in a visible resting tremor in PD patients, which was allowed to continue for about 2 minutes before beginning the second task.

2.2.7 Machine learning

Six classifiers were used: Random Forest, Decision Tree, Nearest Neighbor (NN), Bayes,

Multilayer Perceptron (MLP), and Support Vector Machine (SVM). In a decision tree classifier, entropy is measured as:

Entropy S -p log2 p -p- log2 p- (1)

where S is the set of data points, p is the number of data points that belong to the positive class and p- is the number of data points that belong to the negative class.

The information gain for each attribute is described by the equation:

S Gain S,A Entropy S - v Entropy(S ) (2) v alues(A) S v

where Values(A) is the set of all possible values for feature A. Gain(S,A) measures how well a given feature separates the training examples according to their target classification29. We used the J48 decision tree provided with the Weka software distribution to classify TS.

Random Forest is an ensemble classifier that consists of many decision trees and outputs the most popular class. A tree is grown from independent random vectors using a training set,

28 resulting in a classifier. After a large number of trees are generated, random forest outputs the class that is the mode of the class's output by individual trees30.

NN calculates instances using Euclidean distance and correspond to points in an n- dimensional space. The algorithm assigns a class label to a data point that represents the most common value among the k training examples nearest to the data point31. We used the IBK scheme from Weka with parameter n=1 in our experiment.

SVM maximizes the margin between the training examples and the class boundary.

SVM generates a hyperplane which provides a class label for each data point described by a set of feature values32.

Artificial Neural Networks (ANNs) are computational models mimicking a neuronal organizational structure33. ANNs are built from an interconnected set of sample units, which takes a number of real-valued inputs and produces a single real-valued output31. Using back propagation, ANN minimizes the squared error between the network output and target values.

We applied this technique by using Weka’s MLP algorithm to classify TS.

Naïve Bayes Classifier is a probabilistic classifier which assumes the presence of a particular feature of a class is independent of other features. It learns a classification label by mapping features with Bayes’ theorem:

P F ti P(ti) argmax P ti F (3) ti T P(F)

where T represents the tremor class label and F represents the features values. P(ti) is estimated by counting the frequency with which each target value ti occurs in the training data. P(F) is calculated from the frequency of feature values. Based on the simplifying assumption that feature

29 values are independent given the target values, the probabilities of observing the features are the product of the probabilities for the individual features34.

2.2.8 Signal analysis

The SWSU signal processing overview (Figure 1) shows that data were gathered for three axes: yaw (perpendicular to the arm with positive pointing away from the back of the hand), pitch (perpendicular to the arm with positive pointing to the right), and roll (aligned with the arm with positive pointing away from the body). The gyroscope range for each axis is +/-

500º/s. The raw gyroscope data for each axis (°/s), the power spectral density (PSD) for each axis, (°/s)2 / Hz, the peak frequency and magnitude for each axis, the RMS value for each axis and the TS was recorded by the computer.

Figure 1: SWSU (Shimmer Wireless Sensor Unit) signal processing

The TS was calculated using a RMS method with a 5s time window. The RMS for each gyroscope signal was calculated for a finite series {xt+xt+1+xt+2+xt+n} using (4), where n designated the number of signals in the finite series. Data were sampled at 100Hz, but only written to file at 10Hz in order to reduce file size. Because PD tremor usually lasts at least a few seconds, selecting a short time window (less than 1s) could increase TS scores and give false positives. Selecting a larger time window, such as 10s, would reduce the TS resolution35. The

RMS values for each axis are used to calculate TS which has units of (°/s) and a range of 0 to 4, where 0 represents no tremor and 4 represents severe tremor.

x2 x2 x2 x2 x t t t 2 t n (4) rms n

TS was scaled to a 0-4 scale based on maximum and minimum RMS values recorded for tremor.

We studied the effect of combinations of different features and different algorithms on the accuracy of computing TS. Each data point used to train and test a classifier was treated as an instance. We processed the six features from the gyroscope data in two different ways prior to applying machine learning algorithms. We first defined each instance as a combination of 10 samples in a 1s time window. To convert 10 samples into one data point, we calculated root mean square values for each of the six data features over 10 samples. As an alternate approach, we considered each data feature as an instance, i.e., each sample at interval of 0.1s from the triaxial gyroscope was considered one instance. This instance was different because instead of averaging over 10 samples, each gyroscope-recorded feature is its own instance.

31 Using 10-fold cross validation, the accuracy of each classifier in classifying TS was obtained by comparing the categorizations of the learned classifier to the clinicians' evaluation. A cross-validation approach was used to estimate how accurately a model would perform in practice by separating the dataset into training and validation sets. Analysis was performed based on the data used for training and the result is validated on validation or test data set. To reduce variability, k-fold cross-validation approach was used in which cross validation was performed k different times, each time using a different partitioning of the data into training and validation sets. The results were then averaged31. We used the machine learning tool Weka [36] to perform

10-fold cross-validation approach on tri-axial gyroscope data for the six different classifiers.

In a second approach to measuring TS, we analyzed the patient’s digital pen tracings of a printed Archimedes’ spiral. The printed spiral was generated by the Archimedes’ spiral equation

(5), where the coefficient of offset (α) is a positive real number defining the magnitude θ increases for a given r. For the printed spiral, α 0.07.

θ α r (5)

Successive spiral tracing data points were compared to the printed data points that were defined by (5) and linearized by (6).

Prior to analysis, we linearized both the printed and traced spirals by plotting spiral radius

(r) vs. angle (θ), which were derived from x and y coordinates recorded from the pen in the case of the traced spiral. The radius r was calculated as shown in (6).

2 2 r (x - x0) (y - y0) (6)

θ was calculated as shown in (7) from the polar expression where x0 and y0 were the spiral center coordinates.

(y - y ) θ tan 0 (7) (x - x0 )

θ was returned as a series of increasing or decreasing positive or negative values depending on the Cartesian quadrant in which the data were recorded. Consequently, θ was recalculated by defining the active quadrant from the change in sign and value of θ, then adding the radian angle

θ from the previous iteration. The result was a consistent progression of θ that correctly corresponded to the angular progression of θ on the spiral.

To determine a TS score for each spiral tracing, we calculated from the linearized pen tracing data, r vs. θ, five factors: maximum difference between the radius of the printed spiral and the tracing (Δrmax), the average radius difference between the printed spiral and tracing radius (Δravg), the square of the Pearson product moment correlation coefficient for tracing r and

θ data points (R2), the RMS of the radius difference, and the standard deviation (σ) of the radius difference between printed spiral and tracing37. We derived equations scaling each of the above factors to a 5-point TS scale from four spiral tracings rated for TS by plotting each factor vs. TS rating for that trace, then fitting a best-fit curve to data in Excel (Figure 2). This method resulted in 5 equations scaling TS from 0-4 for each of the 5 factors. We averaged the 5 tremor ratings derived from each of the 5 factors characterizing deviation from that spiral tracing to obtain a single rating of TS. This averaged TS rating was then rounded to the nearest whole number

33 between 0 and 4. The resulting TS score was compared to the clinician’s rating for the patient to determine the ability of the system to classify the same level of tremor the clinicians identified.

Figure 2: Digital pen tremor scaling

2.3 Results

TS ratings from the Shimmer, machine learning approach, and digital pen data analysis were compared to the clinician’s tremor rating for each patient under each condition (DBS on or off) (Table 1). The accuracies of all three methods were compared to determine the most accurate method.

Ratings Rating errors Patient Disease DBS Clinicia Spiral- SWSU- Spiral- Spiral SWSU n (TRS) clinician clinician SWSU 1 PD off 4 2 1 3 1 2 1 PD on 0 0 0 0 0 0 2 ET off 2 2 3 -1 -1 0 2 ET on 2 1 2 0 -1 1 PD&E 3 T off 0 1 1 -1 0 -1 PD&E 3 T on 1 1 1 0 0 0 4 PD off 1 0 1 0 -1 1 4 PD on 1 1 1 0 0 0 5 PD off 1 1 1 0 0 0 5 PD on 1 0 1 0 -1 1 6 PD off 0 1 0 0 1 -1 6 PD on 1 0 1 0 -1 1 7 PD off no data 4 0 no data 4 no data 7 PD on 1 1 1 0 0 0 8 PD off 1 1 1 0 0 0 8 PD on 1 0 0 1 0 1 9 ET off 2 2 4 -2 -2 0 9 ET on 1 0 1 0 -1 1 10 ET off 4 3 4 0 -1 1 10 ET on 4 2 4 0 -2 2 Table 1: Comparison of tremor rating method results of patients taking part in the study;

DBS=Deep brain stimulation; ET=Essential tremor; PD=Parkinson disease; SWSU=Shimmer wireless sensor unit; TRS= Fahn-Tolosa-Marin Tremor Rating Scale.

Comparison of accuracy was obtained for Random Forest, Decision Tree, Nearest

Neighbor (NN), Bayes, Multilayer Perceptron (MLP), and Support Vector Machine (SVM) for both time-segmented data and raw data (Figure 3).

Figure 3: Machine learning accuracy

The accuracy of each classification method is reported for the entire data set, rather than for individual participants, due to the nature of 10-fold cross evaluation. The best value of accuracy has been obtained using a decision tree classifier on raw data: 82%.

Spiral tracings (Figure 4), were linearized and plotted versus the printed spiral (Figure

5). The numerical analysis methods were applied to each linearized spiral trace to determine a tremor rating. Tremor rating for patient 5 with DBS off, differs greatly from that of patient 5 with DBS on (Table 2). In both tables, each type of tracing deviation (Δrmax, Δravg, etc.) is listed for that tracing above the scaled TS score for that factor with the tremor rating averaged from the

36 five scaled tremor severities. Digital pen tracings yielded a match to clinicians’ ratings with 74% accuracy.

Figure 4: Spiral trace of Patient 5, DBS (Deep Brain Stimulation) off, PD (Parkinson’ Disease)

Figure 5: Linearized spiral trace of patient 5, DBS (Deep Brain Stimulation) off, PD (Parkinson’

Disease)

Spiral Trace Deviation Factors TS Parameter Δr Δr avg R2 RMS SD Rating max DBS= off 14.0 Patient spiral trace data 63.46 16.05 0.97 21.34 7 Decimal TS average of 5 deviation 1.69 1.27 0.86 1.37 1.58 1.35 ratings Rounded TS rating 1 DBS=on Patient spiral trace data 26.75 11.03 0.99 13.32 7.47 Decimal TS average of 5 deviation 0.51 0.80 0.67 0.77 0.74 ratings 0.70 Rounded TS rating 1 Table 2: TS (tremor severity) for digital pen spiral trace in Patient 5; Δr max maximum difference between the radius of the printed spiral and the tracing; Δr avg average radius

38 difference between the printed spiral and tracing radius.; R2= square of the Pearson product moment correlation coefficient for tracing r (radius) and θ (angle) data points; RMS root mean square of the radius difference; SD= standard deviation of the radius difference between printed spiral and tracing; DBS= Deep brain stimulation.

We identified one outlier that was removed from the data set. Digital pen spiral tracing for one patient displayed very high tremor with tracings off the page due to tremor, yielding a TS rating of four, as assigned by digital pen data analysis. However, when the clinician rated the patient, the patient traced a clinician-provided spiral with a standard pen, pressing his hand hard against the writing surface. This effectively dampened the tremor during the clinician-provided spiral trace and yielded a clinician tremor rating of one. For all other tracings, it was insured that the patient did not press his hand hard against the paper. Because this one patient kept his hand off the paper during the digital pen spiral trace, this data collection was not conducted in the same manner as the others, and we concluded that data for patient 1 DBS off should be discarded as an outlier. Removing this outlier, the digital pen yielded exact match accuracy to the clinicians’ rating of 78%.

Figure 6 shows the raw gyroscope data from the SWSU when patient 10 was resting.

With the DBS device off the RMS value of the tremor was 23.35°/s and with the device on the

RMS value was 11.21°/s. This showed a total RMS improvement of 12.14°/s when using DBS.

These results parallel those of the literature, which report displacement RMS measured with accelerometers decreased by approximately half for DBS on vs. DBS off condition for medicated

PD patients experiencing postural and kinetic tremor38. Frequency decreased by 0.8Hz and power density of the tremor decreased by 25.26(°/s)2 / Hz when the DBS device was on (Figure

39 7). Figure 8 shows patient 7 raw gyroscope data for the roll axis when the patient was resting.

With DBS off, the RMS value of the tremor was 148.54°/s. With DBS on, the RMS value was

2.05°/s. This showed a total RMS improvement of 146.50°/s when using the DBS device. Figure

9 shows patient 7’s power spectral density for the roll axis when the patient was drawing a spiral.

With the DBS device off, a large power density of 779.18 (°/s)2 / Hz occurred at 3.8 Hz. By contrast, when the DBS was on, very little tremor was noticeable.

Figure 6: Raw gyroscope data at rest of patient 10.

Figure 7: PSD (Power Spectral Density) roll axis, spiral trace of patient 10.

Figure 8: Raw gyroscope data roll axis at rest of patient 7.

Figure 9: PSD (Power Spectral Density) roll axis, spiral trace of patient 7.

Using the RMS method, the tremor severity was matched to the clinician assessment 42% of the time with the aforementioned outlier removed.

44 2.4 Discussion We compared the spiral TS ratings with the clinicians’ TS ratings and the SWSU TS ratings. Table 3 shows a match comparison for tremor ratings based on digital pen spiral tracings, SWSU data, and clinicians’ analysis for ten patients, both for the DBS off condition and the DBS on condition, with one missing data set for patient 7 with DBS off.

Parameter Accuracy, % Random Match Probability 20 Machine Learning Exact Match 82 Spiral-Clinician Exact Match; Outliers 78 Removed SWSU-Clinician Exact Match; Outliers 42 Removed Spiral-SWSU Exact Match; Outliers 44 Removed Table 3: Exact match accuracy.

SWSU TS ratings matched clinicians’ ratings with 42% accuracy, digital pen 78%, and machine learning 82%. It is important to note that a value equal to 20% is due to random effect.

These results represent a substantial improvement over the clinical inter-rater reliability for the

TRS (Kappa statistic =~0.5)8. Because the machine learning algorithms used the clinical rating from three different raters as the target data, a match of 100% would not be expected due to inter-rater variability. Although the upper limit for matching clinical ratings was not established due to the limited number of subjects, the 82% match (machine learning algorithms) and 78% match (digital pen) shows that these methods have the capability to provide more reliable assessments of the TRS scale.

Spiral to SWSU exact match accuracy was 44%, demonstrating poor correspondence between the two rating methods. We concluded that the RMS method applied to gyroscope

45 method alone, at a sample rate of 10 Hz, was not an adequate assessment tool for TS. However, the same data were useful in classifying TS using a random forest machine learning algorithm.

We chose to record SWSU data at 10 Hz even though we sampled data at 100 Hz with the device. Although we initially sampled and recorded data at 100 Hz, we observed from our data a peak PSD at about 4 Hz and decreased the sample rate to 10Hz in order to reduce file size.

This methodology has been used in previous studies. For example, Wu et al16 reported a maximum concurrence peak below 4 Hz for postural and kinetic tremor in PD patients, with a lesser and secondary peak between 6 and 8Hz. While our sample rate of 10Hz most likely captured the max PSD peak for tremor, some of the lesser PSD data that occur around 8Hz may been lost. This could have contributed to our poor accuracy in matching clinical TS ratings with the Shimmer RMS method.

This study presents a feasible approach to bridge objective measurements of tremor to ratings (TRS) familiar to clinicians. Furthermore, this approach used moderately demanding motor tasks in PD and ET patients who could accomplish these tasks with reasonable effort.

However, due to the low number of samples (20 data sets from 10 participants under two DBS conditions), the impact of these methods could be more clearly defined with the inclusion of additional participants. Notably, our findings imply that machine learning algorithms can reliably classify TS in the wrist from gyroscope data into the same scale used by clinicians.

Furthermore, the reliability could be further improved by incorporating digital pen data into the machine learning approach along with gyroscope data to achieve an even higher accuracy in the evaluation of TS.

46 Conclusions

Applying three computational methods to assess patient tremor, we discovered that we could match the qualitative clinical tremor rating 78% of the time with a digital pen spiral tracing analysis, 42% of the time with Shimmer data using RMS methods, and 82% of the time with a machine learning decision tree algorithm on gyroscope data. Combining machine learning with

RMS tremor gyroscope data proved to be the most reliable method. This computational method has the potential to substantially increase the reliability of tremor assessment.

47 Acknowledgements

Thanks to Jonathan Carlson MD PhD and Jamie Mark ARNP for medical advice and to

Adapx Inc. for donation of digital pens and CapturxTM software. This work was supported by

NSF under Grant No. DGE-0900781.

48 BIBLIOGRAPHY

1. Grimaldi G, Manto M. Neurological tremor: Sensors, signal processing and emerging

applications: Sensors 2010;10(2):1399-1422; doi:10.3390/s100201399

2. Elbe R. Characteristics of Physiologic tremor in young and elderly adults: Clinical

Neurophysiology 2003;114(4):624-635; doi:10.1016/S1388-2457(03)00006-3

3. Cunningham L, Nugent C, Moore G, Finlay D, Craig D. Computer-Based Assessment of

Bradykinesia, Akinesia, and Rigidity in Parkinson’s Disease. ICOST 2009; LNCS 5597; pp

1-8

4. Waters C. Diagnosis and Management of Parkinson’s Disease. 5th edition. Caddo:

Professional Communications Inc; 2006 pp 39-252

5. Fahn S, Tolosa E, Marin C, Clinical rating scale in tremor. Pp 225-234, in Jankovic J, Tolosa

E, editors. Parkinson’s disease and movement disorders. Baltimore: Urban and

Schwarzenberg; 1988

6. Heldman D, Jankovic J, Vaillancourt D, Prodoehl J, Elble R, Giuffrida J, Essential tremor

quantification during activities of daily living: Parkinsonism and Related Disorders

2011;17(7):537-542; doi:10.1016/j.parkreldis.2011.04.017

7. Kompoliti K, Comella C, Goetz C. Clinical rating scales in movement disorders. Pp 692-701,

in Jankovic J, Tolosa E, editors. Parkinson’s disease and movement disorders (5).

Philadelphia: Williams and Wilkins; 2007

8. Stacy M, Elble R, Ondo W, Wu S, Hulihan J. Assessment of interrater and intrarater

reliability of the Fahn–Tolosa–Marin Tremor Rating Scale in essential tremor. Movement

Disorders 2007;22(6):833-838; doi:10.1002/mds.21412

49 9. Miotto G, Andrade A, Soares A. Measurement of Tremor Using Digitizing Tablets. V CEEL;

26-28 September; 2007; pp 1-4

10. Factor S, Weiner W. Parkinson's disease: Diagnosis and clinical management. New York:

Demos; 2002

11. Rubchinsky L, Kuznetsov A, Wheelock V, Sigvardt K. Tremor. Scholarpedia

2007;2(10):1379; doi:10.4249/scholarpedia.1379

12. Lo G, Suresh A, Stocco L, Gonzalez-Valenzuela S, Leung V. A Wireless Sensor System for

Motion Analysis of Parkinson’s Disease Patients. IEEE: PerCom 20 ; pp 564-567

13. Someren E, Vonk B, Thijssen W, Speelman J, Schuurman P, Mirmiran M, Swaab D. A new

actigraph for long-term registration of the duration and intensity of tremor and movement:

IEEE Transactions on Biomedical Engineering 1998;45(3):386-395

14. Elbe R, Clinical mechanisms of tremor. Journal of clinical neurophysiology 1996;13(2):133-

144; doi:10.1097/00004691-199603000-00004

15. O’Suilleabhain P, Matsumoto J. Time-frequency analysis of tremor. Brain 1998:121(11);

2127-2134; doi:10.1093/brain/121.11.2127

16. Wu P, Lin C, Wang C, Hwang I. Atypical task-invariant organization of multi-segment

tremors in patients with Parkinson’s disease during manual tracking. Journal of

Electromyography and Kinesiology 2009;19:e144–e153; doi:10.1016/j.jelekin.2007.12.003

17. Siderowf A, McDermott M, Kieburtz K, Blindauer K, Plumb S, Shoulson I. Test–Retest

Reliability of the Unified Parkinson’s Disease Rating Scale in Patients with Early

Parkinson’s Disease: Results from a Multicenter Clinical Trial: Movement Disorders

2002;17(4):758-763; doi:10.1002/mds.10011

50 18. Bennett D, Shannon K, Beckett L, Goetz C, Wilson R. Metric properties of nurses' ratings of

parkinsonian signs with a modified Unified Parkinson's Disease Rating Scale: Neurology

1997;49(6):1580-1587

19. Bonato P, Sherrill D, Standaert D, Salles S, Akay M. Data mining techniques to detect motor

fluctuations in Parkinson's disease: Proceedings of the 26th Annual International Conference

of the IEEE EMBS 2004; pp 4766-4769; doi:10.1109/IEMBS.2004.1404319

20. Papapetropoulos S, Katzen H, Scanlon B, Guevara A, Singer C, Levin B. Objective

quantification of neuromotor symptoms in Parkinson’s disease: Implementation of a portable,

computerized measurement tool: Parkinson's Disease 2010; vol. 2010; Article ID 760196: 1-

6; doi:10.4061/2010/760196

21. Deuschl G, Volkmann J, Raethjen J. Tremors: differential diagnosis, pathophysiology, and

therapy. Pp 298-320, in Jankovic J, Tolosa E, editors. Parkinson’s disease and movement

disorders (5). Philadelphia: Williams and Wilkins; 2007

22. Fisman G, Herzog J, Fishman D, Tamma F, Lyons K, Pahwa R, Lang A, Deuschl G.

Subthalamic nucleus deep brain stimulation: summary and meta-analysis of outcomes:

Movement Disorders 2006;21(14):S290-S304; doi:10.1002/mds.20962

23. Benabid A, Pollak P, Gao D, Hoffmann D, Limousin P, Gay E, Payen I, Benazzouz A.

Chronic electrical stimulation of the ventralis intermedius nucleus of the thalamus as a

treatment of movement disorders: Journal of Neurosurgery 1996;84(2):203–214;

doi:10.3171/jns.1996.84.2.0203

24. Obwegeser A, Uitti R, Witte R, Lucas J, Turk M, Wharen R. Quantitative and qualitative

outcome measures after thalamic deep brain stimulation to treat disabling tremors:

Neurosurgery 2001;48(2):274-284; doi:10.1097/00006123-200102000-00004

51 25. Bronstein J, Tagliati M, Alterman R, Lozano A, Volkmann J, Stefani A, Horak F, Okun M,

Foote K, Krack P, Pahwa R, Henderson J, Hariz M, Bakay R, Rezai A, Marks W, Moro E,

Vitek J, Weaver F, Gross R, DeLong M. Deep brain stimulation for Parkinson disease:

Archives of Neurology 2011;68(2):165-171; doi:10.1001/archneurol.2010.260

26. Obeso J,Olanow C, Rodriguez-Oroz C. Deep-brain stimulation of the subthalamic nucleus or

the pars interna of the globus pallidus in Parkinson’s disease: New England Journal of

Medicine 2001;345(13):956-963; doi:10.1093/brain/124.9.1777

27. www.shimmer-research.com; retrieved July 19, 2011

28. Davies S, Beale R, Tiplady. An Investigation into the Measurement of Driver Impairment at

the Roadside Using a Logitech Digital Pen. 17th International Conference on Alcohol, Drugs,

and Traffic Safety; Glasgow; August 2004

29. Quinlan J, Induction of decision trees: Machine learning 1986;1(1):81-106;

doi:10.1007/BF00116251

30. Breiman L, Random forests: Machine Learning 2001;45(1):5–32; doi:10.1007/BF00116251

31. Mitchell T, Machine Learning. New York: McGraw Hill; 1997

32. Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin classifiers.

Proceedings of the 5th annual workshop on computational learning theory 1992; Pittsburg

ACM; pp 144-152

33. Zornetzer S, Davis J, Lau C. An introduction to neural and electronic networks. San Diego:

Academic Press; 1990

34. Rish I. An empirical study of the naive Bayes classifier: IJCAI-01 workshop on empirical

methods in AI; 2001

52 35. Salarian A, Russmann H, Vinterhoets FJG, Burkhard PR, Blanc Y, Dehollain C, Aminian K.

An Ambulatory System to Quantify Bradykinesia and Tremor in Parkinson's Disease. 4th

International IEEE EBMS Special Topic Conference on Information Technology

Applications in Biomedicine; 2003; pp 35-38; doi:10.1109/TBME.2006.886670

36. Bouckaert R, Frank E, Hall M, Kirkby R, Reutemann P, Seewald A, Scusc D. WEKA

Manual for Version 3-6-3; University of Waikato; Hamilton, New Zealand; July 27, 2010

37. Wang H, Yu Q, Kurtis M, Floyd A, Smith W, Pullman S. Spiral Analysis - Improved Clinical

utility with Center Detection; Journal of Neuroscience Methods 2008;171(2):264-270;

doi:10.1016/j.jneumeth.2008.03.009

38. Sturman M, Vaillancourt D, Metman L, Bakay R, Corcos D. Effects of subthalamic nucleus

stimulation and medication on resting and postural tremor in Parkinson’s disease: Brain

2004;127(9):2131-2143; doi:10.1093/brain/awh237

53 3. DETECTING DYSKINESIA IN PEOPLE WITH PARKINSON’S DISEASE USING

BODY WORN ACCELEROMETERS AND MACHINE LEARNING ALGORITHMS

Nathaniel D. Darnall1 and David C. Lin1, 2, 3

1School of Mechanical and Materials Engineering, Washington State University, Pullman, WA USA 2School of Chemical Engineering and Bioengineering, Washington State University, Pullman, WA USA 3Integrative Physiology and Neuroscience, Washington State University, Pullman, WA USA

3.1 Introduction

Parkinson’s disease (PD) is a progressive neurodegenerative disorder that causes both motor and non-motor deficits. Motor symptoms, which are prevalent during so-called “OFF” periods, include tremor, bradykinesia, rigidity, and postural instability. These deficits typically fluctuate in severity on an hourly and daily basis (Keijsers et al, 2003a; Keijsers et al, 2003b).

Clinicians are able treat its symptoms through the combined administration of

Carbidopa/Levodopa medication, dopamine agonist medication, and deep brain stimulation

(DBS) devices. Overmedication or overstimulation can cause dyskinesia, an involuntary, often rhythmic or choric, exaggeration of movements (Rush, 2000). Clinicians aim to reduce dyskinesia periods when they prescribe treatment. To optimize treatments, clinicians must know the timing of these periods with respect to medication dosage and DBS settings. In common clinical practice, clinicians ask patients to recall the occurrence of their OFF and dyskinesia periods since the last clinical visit. Because the patient typically must recall their symptoms over a period of several months, this method of retrospective self-report is subject to recall bias.

54 In-home monitoring for the chronically ill has been identified as a cost-effective method to complement traditional human-provided care (Mera et al, 2012a; Mera et al, 2012b). To augment clinical assessment, several systems have been developed that use machine learning algorithms (MLAs) to classify body-worn sensor data onto clinical ratings.

Several systems have been developed for in-home monitoring of PD symptoms. These systems can have several undesirable characteristics: participants wear multiple bulky or inconvenient sensors, surface electromyographic (EMG) sensors are used in addition to accelerometers (Roy et al, 2013; Cole et al, 2010; Cole et al, 2014), participants perform tests or scripted ADLs (Patel et al, 2011; Giuffrida et al, 2009; Mera et al, 2012a; Mera et al, 2012b;

Keijsers et al, 2003a; Keijsers et al, 2003b; Keijsers et al, 2006; Tsipouras et al; 2012), dyskinesia is artificially induced during sensor recordings and ground truth observations by increasing the carbidopa/levodopa dose (Keijsers et al, 2003; Keijsers et al, 2006; Tsipouras et al, 2012), and clinicians are required to provide the clinical ratings which enable the correlations between sensor data and clinical ratings. While these studies reported high classification accuracies for the detection of dyskinesia, purposefully inducing dyskinesia or constraining participants to scripted activities of daily living (ADL) or movement disorder tests from clinical exams may not replicate the way voluntary movements are interspersed with dyskinesia in an unconstrained setting.

Some of the methods used in the studies were not highly conducive to continuous dyskinesia monitoring in which the participant performs elective or voluntary ADL. In a daily living situation, participants cannot be expected to wear multiple sensors that would impede their daily activities, or cause social stigma. Neither can participants be relied upon to replace sensors, such as EMG sensors, that are highly sensitive to placement on a specific muscle, or

55 accelerometers, which require replacement on the same location on the same limb. Removing the sensors for showering or sleeping introduces the error of not replacing the sensor in the exact same location. If clinician observations are required to provide ground truth for training the algorithm to new participants, that is a time constraint which does not reduce the burden on the clinician. Therefore, a useful dyskinesia monitoring system should have high generalization across all study participants, displaying the ability to accurately detect dyskinesia in new participants who were not seen in the algorithm training data.

The most promising system reported high dyskinesia classification accuracies (95.0% sensitivity and 98.6% specificity) and the ability to generalize to new participants from body- worn accelerometers and EMG sensors for unscripted participant activities in the home setting

(Cole et al, 2014; Cole et al, 2010; Roy et al, 2011; Roy et al, 2013). However, in this study they discarded any instances in which the 4 clinicians rating dyskinesia could not agree on the dyskinesia severity. This method likely removes instances from the data, during which dyskinesia is fluctuating. Since fluctuations are common in dyskinesia, removing instances from the data set may result in higher classification accuracy but does not reflect an unconstrained, continuous use ADL scenario.

The motivation of our study is to develop and optimize a dyskinesia monitoring and classification system that has clinical relevance. Specific Aims of this research are to:

1. Extract features from accelerometer data recorded under ADL conditions.

2. Classify feature instances onto dyskinesia or non-dyskinesia.

3. Optimize: practical considerations of implementing dyskinesia monitoring

system.

a. Address how many sensors are needed for dyskinesia classification.

56 b. Examine the most effective sensor locations for dyskinesia classification.

c. Determine which learning algorithms have the greatest classification

accuracy.

d. Quantify the generalization effect, to determine if algorithms can be

developed for a population without requiring clinician-observed training

data for new subjects.

3.2 Methods

3.2.1 Participant Description

Nineteen participants (8 female, 11 male) of all severities of PD were recruited with the exclusion criteria of: age less than 18 years old, English is not primary language, current or past year history of psychoactive substance, or diagnosis of dementia. The Washington State

University Internal Review Board (IRB) approved this study and informed consent was obtained from all participants.

Participants ranged in age from 49 to 85 years, with a mean age 64.4 and standard deviation 9.4 years. Time since a clinical PD diagnosis ranged from 0.5 to 14 years, with a mean of 7.9 and standard deviation of 4.1 years. For all participants, the UPDRS score ranged from 35 to 181, with a mean 66.6 and standard deviation 35.0. The scores from the motor section of the

UPDRS for all participants ranged from 19 to 76, with a mean of 34.0 and standard deviation of

14.3. Eleven participants had a DBS device, 3 of which participants also displayed dyskinesia during our study.

Five participants displayed dyskinesia during the study. For these participants, the

UPDRS score ranged from 38 to 65, with a mean of 51.4 and standard deviation 10.3. The scores

57 from the motor section of the UPDRS for dyskinetic participants ranged from 24 to 41 with a mean of 34.2 and standard deviation of 8.2. Abnormal Involuntary Rating Scale (AIMS) (Rush,

2000) ratings for questions 5-10 only, ranged from 5 to 19, with a mean of 11.0 and standard deviation of 4.6.

3.2.2 Observations

We observed each participant over a 1-2 hour period while they wore five Geneactiv triaxial accelerometers on the right wrist (RW), left wrist (LW), right ankle (RA), left ankle

(LA), and right hip (RH), completed a UPDRS exam, and conducted elective ADLs. Most participants sat in a chair, but a few also walked around inside, talked and gestured while seated, washed dishes, and/or typed on a computer. The 1-2 hour period was variable to allow flexibility to the participants, who were either monitored in their neurology clinic, at their workplace, or at home. Participants who displayed no dyskinesia after an hour were permitted to leave without finishing the remaining hour.

We used a single observer to avoid the variability of multiple observers. The single researcher (Darnall) was trained to rate dyskinesia during 9 months of observations at a neurological clinic, and successful completion of the International Parkinson and Movement

Disorder Society’s MDS-UPDRS online training course (Goetz et al, 2008). The researcher made visual observations of each non-dyskinetic participant every 15 minutes to determine if dyskinesia was occurring. Once participants displayed dyskinesia, they were observed on a continuous basis. During the observation of an episode of dyskinesia, dyskinesia varied from conditions where it was hardly visible to conditions where it was extreme. AIMS scores were rated according to the worst dyskinesia observed, in accordance with the standard clinical

58 practice. A transition from dyskinesia to non-dyskinesia was determined by 10 successive one- minute observations with no dyskinesia. Similarly, a transition from no dyskinesia to dyskinesia was determined by the first visually observable dyskinesia symptom. We used this method to best address the clinically relevant question, “Is dyskinesia occurring?” even if it is intermittent or fluctuating in severity. Dyskinesia periods were annotated by time of occurrence and synchronized with accelerometer readings.

3.2.3 Hardware

The accelerometer sensors, the Geneactiv made by Activeinsights, were selected to meet patient and clinician requirements, which included: small form factor (size of a wrist watch), ease of use, not interfering with daily activities, not obvious (looks like a regular watch), low cost, and requiring little upkeep (long battery life and waterproof construction) (Zhang et al,

2012). The accelerometer has 0.0036 g (gravitational acceleration units) resolution and a range of ±8 g. The sampling frequency is adjustable from 10-100 Hz in 10 Hz increments, which met our criteria of a minimum of 26 Hz, or twice the sampling frequency of the highest frequency associated with dyskinesia or other human movements during ADL (Wu et al, 2009). We recorded accelerations at 50 Hz. Dimensionally, it is 4.3 x 4.0 x 1.3 cm, and weighs 16 (grams), which met the small form factor. Battery life is up to 2 months, and memory storage is up to 45 days with 0.5 Gb total space, which provides adequate power and memory allocation for a 7-day study.

59 3.2.4 Data Processing

Data from each accelerometer axis were converted from hexadecimal to base ten and then converted to units of g. Axial acceleration values were composed into a time-series based on start time and sampling rate. Accelerometer signal graphs display a visible difference in frequency and magnitude between dyskinesia and non-dyskinesia time periods (Figure 3.1).

Figure 3.1: Right Wrist Triaxial Accelerometer Signals. The left graph shows accelerations for a participant with no dyskinesia, while the right graph shows accelerations for a participant with dyskinesia.

Triaxial sensor signals were band pass filtered by a tenth order Butterworth filter in the 1 to 13 Hz band to remove low-frequency gravity effects and eliminate high-frequency components that are outside the frequency bands of human voluntary movement. We filtered axial accelerations into four frequency bands identified in previous studies: dyskinesia or low frequency (1-3.5 Hz) (Keijsers et al, 2003a; Keijsers et al, 2003b), PD tremor frequency (5-8 Hz)

(Cunningham et al, 2009; Cunningham et al, 2011; Tsipouras et al., 2012; Hoff et al, 2004), high frequency (3.5-8 Hz) (Hoff et al, 2004), and full frequency (1-13 Hz) (Wu et al, 2009). For each of the 4 frequency bands, scalar acceleration values within that band were calculated as the

60 square root of the sum of squares of axial accelerations within that band. The mean was subtracted from the scalar values in each band to further remove the gravitational component of acceleration.

Jerk was calculated for each axis as the difference between two successive accelerometer readings of that axis divided by the sampling rate. Scalar jerk was computed as the square root of the sum of squares of axial jerk values.

3.2.5 Feature Extraction

Features were calculated on a time window basis with no overlap between windows.

Features were derived for 1 minute moving, non-overlapping time windows because previous studies reported good accuracy and cited clinical relevance with 1 minute windows (Keijsers et al, 2003a; Keijsers et al, 2003b; Keijsers, 2006). Some previous studies used moving time windows of 1 second, but because dyskinesia occurs in the 1-3.5 Hz frequency range, a 1 second window is not enough time for a complete dyskinesia cycle (Cole et al, 2014; Roy et al, 2013).

Therefore, we chose to use 1 minute time windows. Because the sampling rate was 50 Hz, each window incorporated 300 sensor values per axis or scalar value.

The features we generated for our study were based on features that previous studies identified as most useful in detecting dyskinesia from accelerometer recordings (Appendix A).

The 20 features for each of the 5 sensors, combined with the observed dyskinesia state for each time window, formed a 101-dimension feature vector, or instance, for each time window. The features of axial acceleration and jerk were each processed through a Fast Fourier Transform

(FFT) and summed for each set of 3 axis values, from which maximum power, mean power, and dominant frequency over the window were individually calculated for acceleration and jerk.

61 Additional features were calculated from the scalar values of acceleration and jerk over the time window, independent of the FFT, with the exception of the energy mean and standard deviation in high and low frequency bands, which were calculated over a 15-minute set of instances for that participant. Notably, the feature of entropy of acceleration was calculated in a 3-step process. Window scalar acceleration values were divided into a 10-bin histogram. The probability that a single acceleration value within a window is in a bin is found from that window’s histogram. Finally, the entropy at each accelerometer reading is 0 minus the sum of the product of that acceleration’s probability times the log of probabilities of any given bin within the window. The feature of entropy of power was calculated using the same process with values of power (derived from the FFT of acceleration) rather than values of acceleration. The features of acceleration cross-correlation maximum and mean were calculated for each 2-sensor combination, bringing the feature vector size to a total of 120 features, plus observed dyskinesia state, per one minute instance.

RMS Jerk is the root mean squared of rate of acceleration change between 50 Hz samples in the 1-13 Hz frequency band over a 60 s time window (Keijsers et al, 2003a; Keijsers et al,

2003b). Mean Scalar is the mean of the square root of the sum of squared tri-axial accelerometer readings in the 1-13Hz frequency band, over a 60 s time window (Keijsers et al, 2003a; Keijsers et al, 2003b). Energy Low is the sum of squared scalar accelerations in the 1-3.5 Hz frequency band, over a 60 s time window (Roy et al, 2011 Hoff et al, 2004). Energy PD is the sum of squared scalar accelerations in the 5-8 Hz frequency band, over a 60 s time window (Hoff et al,

2004). Energy High is the sum of squared scalar accelerations in the 3.5-8 Hz frequency band, over a 60 s time window (Hoff et al, 2004). RMS Scalar is the root mean square of scalar accelerations in the 1-13 Hz frequency band, over a 60 s time window (Roy et al, 2011; Giuffrida

62 et al, 2009). Jerk % Above Threshold is the percentage of RMS jerk above an empirically derived threshold of 0.05 m/s^3 in the 1-13 Hz band over a 60 s time window (Keijsers et al,

2003a; Keijsers et al, 2003b). Max Scalar Power is the maximum of the discrete Fourier transform power of triaxial accelerations in the 1-13 Hz band over a 60 s time window (Keijsers et al, 2003a; Keijsers et al, 2003b). Mean Scalar Power is the mean of the discrete Fourier transform power of triaxial accelerations in the 1-13 Hz band over a 60 s time window (Keijsers et al, 2003a; Keijsers et al, 2003b). Dominant Frequency is the frequency of the maximum discrete Fourier transform power of triaxial accelerations in the 1-13 Hz band over a 60 s time window (Keijsers et al, 2003a; Keijsers et al, 2003b). Dominant Frequency Jerk is the frequency of the sum of max discrete Fourier transform power of triaxial jerks in the 1-13 Hz band over a

60 s time window (Keijsers et al, 2003a; Keijsers et al, 2003b). Entropy in Time Domain is the difference zero and the sum times the log of the probability that a value of scalar acceleration is in a 10-bin histogram category, in the 1-13 Hz band, over a 60 s time window (Patel et al, 2010;

Patel et al, 2011). Entropy in Frequency Domain is the difference between zero and the sum times the log of the probability that an acceleration power frequency value is in a 10-bin histogram category, in the 1-13 Hz band, over a 60 s time window (Tsipouras 2012). Max Jerk

Power is the maximum of the sum of discrete Fourier transform power of triaxial jerk in the 1-13

Hz band over a 60 s time window (Keijsers et al, 2003a; Keijsers et al, 2003b). Mean Jerk Power is the mean of the sum of discrete Fourier transform power of triaxial jerk in the 1-13 Hz band over a 60 s time window (Keijsers et al, 2003a; Keijsers et al, 2003b). Energy Above Threshold is the sum of squared scalar accelerations in the 5-8 Hz frequency band times 4 less those of the

1-3.5 and 3.5-8 Hz bands, over a 60 s time window (Keijsers, 2006). Energy Ratio High/Low

Freq is the ratio of acceleration energy in the 1-3.5 Hz band over acceleration energy in the 3.5-8

63 Hz band, over a 60 s time window (Keijsers et al, 2003a; Keijsers et al, 2003b; Keijsers et al,

2006; Giuffrida et al, 2009; Mera et al, 2012a; Mera et al, 2012b). Dominant frequency is the power to total window power ratio for each sensor, ratio of the FFT power at the dominant frequency to the power of all other FFT frequencies (Keijsers et al, 2003a; Keijsers et al, 2003b).

Energy Mag High/Low Freq is the square root of the sum of squared acceleration energy in the

1-3.5 and 3.5-8 Hz bands, over a 60 s time window. (Keijsers et al, 2003a; Keijsers et al, 2003b;

Keijsers et al, 2006; Giuffrida et al, 2009; Mera et al, 2012a; Mera et al, 2012b). Energy Mean

High/Low Freq is the mean of the feature of Energy Mag High/Low Freq, over a 15-minute moving time window that moves at 1 minute increments (Keijsers et al, 2003a; Keijsers et al,

2003b; Keijsers et al, 2006; Giuffrida at al, 2009; Mera et al, 2012a; Mera et al, 2012b). Energy

Std Dev High/Low Freq is the standard deviation of the feature of Energy Mag High/Low Freq, over a 15-minute moving time window that moves at 1 minute increments (Keijsers et al, 2003a;

Keijsers et al, 2003b; Keijsers et al, 2006; Giuffrida et al, 2009; Mera et al, 2012a; Mera et al,

2012b). Cross-correlation max is the maximum correlation between two sensor’s accelerations displaced by 100 samples, centered in each 60-second window, and normalized to the time duration of 200 samples (Keijsers et al, 2003a; Keijsers et al, 2003b; Keijsers et al, 2006; Patel et al, 2010). Cross-correlation mean is the maximum correlation between two sensor’s accelerations displaced by 100 samples, centered in each 60-second window, and normalized to the time duration of 200 samples (Keijsers et al, 2003a; Keijsers et al, 2003b; Keijsers et al,

2006; Patel et al, 2010).

64 3.2.6 Machine Learning Algorithms

Features were classified onto dyskinesia occurrences for each participant over the observed period using the WEKA’s decision tree machine learning algorithm, which produced the best accuracy in our previous study as well as in other studies (Patel et al, 2011; Hall et al,

2009; Darnall et al, 2012). Previous studies reported several machine learning algorithms as returning high dyskinesia classification accuracy under scripted ADL conditions: Decision Tree

(J48), Support Vector Machine (SVM), and Multilayer Perceptron (MLP) (Darnall et al, 2012;

Keijsers et al, 2003a; Keijsers et al, 2003b; Keijsers et al 2006; Patel et al, 2011; Giuffrida et al,

2009; Mera et al, 2012a; Mera et al, 2012b; Tsipouras et al, 2012).

In a decision tree classifier such as J48, entropy is measured as:

Entropy S -p log2 p -p- log2 p- (3.1)

where S is the set of data points, p is the number of data points that belong to the positive class and p- is the number of data points that belong to the negative class. The information gain for each attribute is described by the equation:

S Gain S,A Entropy S - v Entropy(S ) (3.2) v alues(A) S v

where Values(A) is the set of all possible values for feature A. Gain(S,A) measures how well a given feature separates the training examples according to their target classification (Quinlan,

1986).

65 SVM maximizes the margin between the training examples and the class boundary.

SVM generates a hyperplane which provides a class label for each data point described by a set of feature values (Boser et al, 1992).

MLP is a type of Artificial Neural Network (ANN), computational models mimicking a neuron organizational structure (Zornetzer et al, 1990). ANNs are built from an interconnected set of sample units, which takes a number of real-valued inputs and produces a single real-valued output (Mitchell, 1997). Using back propagation, ANN minimizes the squared error between the network output and target values.

MLA classification ability can be expressed by a variety of statistics. Accuracy is the proximity of classification results to the true value. Accuracy combines the ability of an algorithm to both identify dyskinesia, and to reject non-dyskinesia, without distinguishing the two conditions. Because accuracy does not distinguish between false positives and false negatives, it can be misleading when the algorithm is sensitive to detecting positives but not specific to detecting negatives, or vice versa.

(3.3)

Sensitivity, or recall rate, is the ratio of correctly classified positive instances to actual positive instances. Sensitivity measures our algorithm’s ability to identify dyskinesia, but it does not measure the algorithms ability to detect non-dyskinesia.

(3.4)

66 Specificity is the ratio of correctly classified negative instances to actual negative instances. Specificity measures the ability of an algorithm to exclude instances not containing dyskinesia, but it does not measure the algorithms ability to detect dyskinesia.

(3.5)

F-measure is the harmonic mean of precision (or positive predictive value; the fraction of retrieved instances that are relevant) and recall (or sensitivity).

(3.6)

where B is a positive weighted value; values of B greater than 1 places a higher weight on precision, while fractional values of B less than 1 place a higher weight on recall. In WEKA, the weighted f-score is a weighted average of the classes' f-scores, weighted by the proportion of how many elements are in each class. Because this F-measure accounts for both positive and negative rates and is weighted by the proportion of instances in each class, it is a better representation of how well a classifier identifies true positives and true negatives than accuracy, sensitivity, or specificity alone.

3.2.7 Optimization

We characterized practical considerations of implementing dyskinesia monitoring system. Namely, we investigated classification accuracy with combinations of reduced numbers of sensors at specific body locations. Further, we determined which MLAs had the greatest

67 accuracy. Classification ability was determined by 10-fold cross validation, leave one subject out training/test split, activity level effect on dyskinesia accuracy, and class imbalance effect.

Classification accuracy of different MLAs

10-fold cross validation splits all the data from all participants into a training set and a test set, by random instance selection. The classifier model trains and builds on the training set data. The model is validated by applying the model to validation set data from which is calculated a classification accuracy, receiver operator characteristic (ROC) area, F-measure, and a true positive, true negative, false positive, false negative “confusion matrix”. This type of validation can be biased toward participants in the study population, because the model was built on training data from all participants in the study population.

Leave one subject out training/test split divides the data into a training set of all data but those from one participant, and a validation set composed only of the data from the participant left out of the training set. This type of validation is useful in determining how well a MLA generalizes across participants, or how well it applies to new participants not seen in the training data set.

We compared the leave-one-participant-out accuracies for the J48, MLP, and SVM algorithms. The J48 algorithm build a C4.5 decision tree with pruning coefficient 0.25, a minimum of 2 instances per leaf, and 3 folds for reduced error pruning. In SVM, which used sequential minimal optimization algorithm to train the support vector classifier, we determined classification accuracy and F-measure for varying kernel type, c, g, and e variables. Kernel type included both polykernel and radial basis function (RBF). We individually varied the complexity parameter (c) by multiples of 10. Attributes were normalized. When we used polykernal, we

68 varied the exponent value (e) from 1 to 3. When we used RBF, we varied the gamma value (g) from 0.01 to 100.00 by multiples of 10. In MLP, which used backpropagation to build a network with sigmoid nodes with attributes normalized to a range between -1 and +1 and hidden units equal to half the sum of attributes plus classes, we compared accuracy and F-measure results from learning rates of 0.01 and 0.10.

Activity level effect

To determine if our algorithms were influenced by activity level, which is influenced by voluntary movements, we calculated activity levels for each participant, and plotted them against dyskinesia classification accuracy. If our algorithms were sensitive to activity level rather than dyskinesia, then we would see higher levels of activity corresponded to greater dyskinesia classification accuracy. Activity levels were determined by the number of accelerometer

“counts” per minute. The “count” was a quantitative summarized measurement of the raw acceleration data on an arbitrary scale (Corder et al, 2008; Chen et al, 2005; van Hess et al,

2010). While there is debate about where activity threshold should be set (Corder et al 2008,

Chen et al 2005, van Hess et al 2010), we set our thresholds as a combination of thresholds from two studies (Freedson et al, 1997; van Hess et al, 2010); less than or equal to 100, greater than

100, greater than 1040, and greater than 1951 counts per minute for sedentary, light, moderate, and vigorous activity levels, respectively.

Class imbalance

Class imbalance, or having more instances in a certain class than in another class, can influence the accuracy of a MLA. Because we had more instances of non-dyskinesia than

69 dyskinesia, we evaluated the J48 algorithm with leave one participant out validation in which we biased the training data set to uniform class sampling. Bias to uniform sampling is a procedure of either randomly oversampling the underrepresented class or randomly under sampling the overrepresented class, until the number of instances in each class is equal in the training set.

Although we obtained similar results with either method of uniform sampling, we report the results form an oversampled dyskinesia class, which was the underrepresented class.

Most effective sensor locations

We listed the J48 algorithm’s 10-fold cross validation accuracy of different combinations of 1, 2, 3, 4, and 5 sensors. From this evaluation, we determined which individual sensor location gave the best classification accuracy. We also determined the most accurate combination of two sensors, and how much additional accuracy was obtained by adding more than 2 sensors to the data set. We based our selection of sensor combinations on the method used by (Tsipouras et al,

2012), in which they evaluated combinations of 1, 2, 4, and 5 sensors for classification accuracy.

Feature with greatest effect on classification accuracy

The J48 algorithm trains a decision tree model, in which the first node of the tree indicates the feature that most reduced the entropy in the model. We selected the first node in the trees as the most important feature in classifying dyskinesia instances.

70 3.3 Results

3.3.1 Classification accuracy of different MLAs

Using 10-fold cross validation on the J48 algorithm, we achieved an overall classification performance of 96% accuracy and 0.96 ROC area across all participants. While these results were very good for participants included in the training set, we proceeded to examine how well

J48 generalized for unseen participants by training and validating on the leave one participant out method. Accuracy is lower for dyskinetic participants, numbers 4, 6, 10, 13 and 15. The results for this method are shown by participant (Figure 3.2).

100%

90%

80%

70%

60%

50% J48 MLP 40% SVM Classification Classification Accuracy 30%

20%

10%

0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Participant Number

Figure 3.2: Classification Accuracy. This is a plot of the classification accuracy from the J48,

MLP, and SVM algorithms, which trained by leaving one participant out of the training set and validating on the one participant left out. The participants 6, 10, 13, and 15 had dyskinesia and

71 classified worse than participants without dyskinesia. Participant 4 also had dyskinesia, but classified with higher accuracy than the other dyskinetic participants.

Since non-dyskinetic participants always classified well regardless of which classifier we used, we focused on accuracy and F-measure performance of the MLAs for the dyskinetic participants. After optimizing the kernel type, complexity parameter, exponent value, gamma value, and learning rate of the classifiers, we found that MLP produced better accuracy and F- measure than J48 and SVM, for dyskinetic participants in leave one participant out validation

(Figure 3.3).

100% 1.00

A B

75% 0.75

50% 0.50

measure

Accuracy F

25% 0.25

0% 0.00 J48 MLP SVM J48 MLP SVM Classifier Classifier

Figure 3.3: MLA Accuracy and F-measure. This is (A) the classification accuracy and (B) the

F-measure, with standard error bars (Appendix B) for the algorithms J48, MLP, and SVM, for leave one participant out validation, for 5 dyskinetic participants.

72 3.3.2 Activity level effect

To test if activity level was correlated with dyskinesia, for each participant we plotted activity level by AIMS score (Figure 3.4) and activity level by MLP classification accuracy

(Figure 3.5).

2.50

2.00

1.50

1.00 Activity Level Activity

0.50

0.00 0 2 4 6 8 10 12 14 16 18 AIMS Score

Figure 3.4: Activity Level vs. AIMS. This is the average activity level for each participant vs. that participant’s AIMS severity. Participants without dyskinesia rated 0 on the AIMS.

73 2.50

2.00

1.50

1.00 Activity Level Activity

0.50

0.00 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% MLP Accuracy

Figure 3.5: Activity Level vs. MLP Classification Accuracy. This is the average activity level for each participant vs. that participant’s MLP classification accuracy.

We see from these graphs that dyskinetic participants had activity levels similar to those of non-dyskinetic participants. Since we did not find a relationship between activity level and dyskinesia, we concluded our algorithms were not biased by activity level.

3.3.3 Class imbalance

To determine if larger quantities of non-dyskinesia instances biased our classification algorithms, we compared classification accuracies for all training sets and for training sets biased to uniform class sampling by oversampling the dyskinesia class (Figure 3.6); graphs were generated from leave one participant out training sets.

74 100%

75%

50%

J48, Bias to Uniform Sampling Accuracy J48

25%

0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Participant Number

Figure 3.6: J48 Bias to Uniform Sampling. This graph shows results from J48 leave one participant out validation, randomly sampled from all data less instances from the participant left out, and randomly sampled from the same dataset with bias toward uniform sampling by oversampling the underrepresented class, which was dyskinesia.

We found classification accuracy of the J48 algorithm was less than one percent different from the classification accuracy of the J48 algorithm with bias to uniform sampling, so the graphs of the two methods are almost identical. Both J48 models were built on the same training set instances, but one model was built by re-sampling the underrepresented dyskinesia class.

Similar classification results from algorithm models that were built off the same data instances, one of which model used oversampled data, show that the J48 algorithm built a model that was not biased to the overrepresented non-dyskinesia class.

75 3.3.4 Most effective sensor locations

Reducing the number of sensors, we examined the J48 10-fold cross validation classification accuracy for single sensors, all sensors, and all 2-sensor combinations (Figure 3.7).

Classifications from all sensors combined produced an accuracy of 96%. The best 2-sensor combinations were right wrist with right ankle (96%) and right wrist with right hip (95%). The worst 2-sensor combination was right wrist with left ankle (94%). The best single sensor was either the right wrist (92%) or the left wrist (92%), and the worst single sensor was left ankle

(88%). All combinations of sensors, or single sensor data sets, produced high classification accuracies (Figure 3.7).

100%

75%

50% Accuracy

25%

ALL

RA, LA RA,

LA, RH LA,

RA, RH RA,

LW, LA LW,

RW, LA RW, LW RA,

LW, RH LW,

RW, RA RW,

RW, RH RW, RW, LW RW, Sensor Locations

Figure 3.7: Accuracy of Sensor Combinations. This graph shows the J48 10-fold cross validation accuracy from feature vectors generated from individual sensors, 2-sensor combinations, and all sensors combined.

3.3.5 Feature with greatest effect on classification accuracy

Based on J48 10-fold cross validation on all participants, the most important feature was identified as the first node of the decision tree. That feature was Energy_Low_High_Mean, which is the mean of the square root of the sum of squared acceleration energy in the 1-3.5 and

3.5-8 Hz bands, over a 15-minute moving time window that moves at 1 minute increments.

Plotting high energy frequency vs. low energy frequency from the right wrist sensor (Figure 3.8), we saw that the dyskinetic participants had a higher mean (119.97) and lower standard deviation

(107.32) than the non-dyskinetic participants mean (75.40) and standard deviation (121.21).

Figure 3.8: Energy in Frequency Bands. This figure is a plot of the square root of the sum of squared right wrist accelerations in the 3.5-8 Hz frequency band vs. the square root of the sum of squared accelerations in the 1-3.5 Hz frequency band, for all participants. Dyskinetic participants are indicated by “O”, while non-dyskinetic participants are indicated by “ ”.

78 3.4 Discussion

We compared generalization for the algorithms J48, SVM, and MLP, with leave one participant out validation, and discovered that MLP had the greatest F-measure and classification accuracy for dyskinetic participants only (0.62 F-measure, 63% accuracy), as well as the greatest accuracy for all participants together (86% accuracy). Several studies in the literature also found that the MLP algorithm produced the highest accuracy or sensitivity and specificity in classifying instances onto dyskinesia (Keijsers et al, 2003a; Keijsers et al, 2003b; Keijsers et al 2006;

Giuffrida et al, 2009; Mera et al, 2012a; Mera et al, 2012b; Tsipouras et al, 2012). The Keijsers studies reported a dyskinesia severity classification accuracy of 93.7% using MLP trained on features calculated from arm accelerations and 15-minute instance windows, with leave one participant out validation. They reported 77% accuracy under the same scenario using 1-minute instance windows. For dyskinesia presence with many voluntary movements, they found a MLP accuracy of 90.4% for the trunk and 79.4% for the arm, for 1-minute windows with 80/20 data split and 50-fold cross validation (Keijsers et al, 2003a). While our system, with leave one participant out validation, had 19% lower classification accuracy for dyskinetic participants only, we achieved 9% higher classification accuracy for all participants together using arm acceleration data with 1-minute time windows. For 10-fold cross validation, we had an overall accuracy of 96% or 6% greater accuracy than the Keijsers study with 50-fold cross validation on dyskinesia presence. When we compare the Keijsers system results with multiple voluntary movements to our system results, our system produced higher dyskinesia detection accuracies than the Keijsers study did with cross validation, but our system did not generalize as well for participants who were unseen in the algorithm training set when we used leave one participant out validation.

79 The Cole system used neural networks, of which MLP is a type. They specifically used dynamic neural networks (DNN), dynamic support vector machines (DSVM), and hidden

Markov models (HMM) to classify 1 second instances onto dyskinesia presence. HMMs classify sequences of features probabilistically, DNNs divide the feature space using a series of linear hyperplanes, and DSVMs divide the feature space using a series of non-linear hyperplanes (Cole et al, 2014). The DNN used backpropagation with learning rate of 0.05 over 1000 iterations. Is was a two-layered network with two hidden nodes to track dyskinesia, with each node applying weights of a five-point FIR filter to time-advanced and time-delayed features calculated from input data. Their DSVM used a sigmoid based function with tradeoff coefficient of 0.125, scale factor of 0.5, and 80 support vectors. They devised two HMMs; one described instances containing dyskinesia, while the other described instances containing non-dyskinesia. The features were quantized into 2 bins with cutoffs dividing the range of possible values between the two classes, dyskinesia and non-dyskinesia. They reported global error rates of 8.8% for

DNN, 9.1% for DSVM, and 12.3% for detection of dyskinesia presence, for all activity states of their participants. When they classified dyskinesia severity with the HMM, they found a best specificity of 98.6% for severe dyskinesia and a worst specificity of 91.9% for moderate dyskinesia. They did not describe if their MLA validation method included a leave one participant out validation. Our error rate, measured as unity less accuracy, was 14% for all participants with leave one participant out validation and 8% for all participants with 10-fold cross validation, which was in the same range as their error rate.

We did not find a correlation between activity level and dyskinesia AIMS severity.

Dyskinetic participants in our study varied in activity level similarly to non-dyskinetic participants. Although body worn accelerometers are commonly used to measure activity level

80 (Corder et al, 2008, Chen et al, 2005, van Hess et al, 2010), we did not find that previous studies considered activity level when using accelerometer data to detect dyskinesia (Patel et al, 2011;

Keijsers et al, 2003a; Keijsers et al, 2003b; Keijsers et al 2006; Giuffrida et al, 2009; Mera et al,

2012a; Mera et al, 2012b; Tsipouras et al, 2012, Cole et al, 2010; Roy et al, 2011, Cole et al,

2014, Roy et al, 2013). By showing that our system was sensitive to dyskinesia but not to activity level, we provide additional validity to the method of dyskinesia detection from accelerometers, which are also used to detect activity level.

Correcting for class imbalance by oversampling the underrepresented dyskinesia class when building our algorithm training set, we did not see a resulting improvement in J48 classification accuracy. Our system appears to be robust for data that contains more classes of non-dyskinesia than dyskinesia. Other studies in the literature did not mention a correction for class imbalance in their methods (Patel et al, 2011; Keijsers et al, 2003a; Keijsers et al, 2003b;

Keijsers et al 2006; Giuffrida et al, 2009; Mera et al, 2012a; Mera et al, 2012b; Tsipouras et al,

2012, Cole et al, 2010; Roy et al, 2011, Cole et al, 2014, Roy et al, 2013).

We found the single most effective sensor location was either the right wrist or left wrist, which each had a J48 10-fold cross validation accuracy of 92%. This result paralleled a previous study that identified the wrist extensor muscle in the dominant arm as the most important sensor location (Cole et al, 2014). Considering that 4 of 5 dyskinetic participants in our study were right-side dominant (participant 13 was the exception), it is interesting that either the left or right wrist location alone produced the same classification accuracy in our study.

We found only 1 sensor was needed to determine dyskinesia states with high accuracy

(92%), and that adding more than 2 sensors added little accuracy (94%), with 10-fold cross validation. A previous study found that only two sensors were needed to produce 93%

81 classification accuracy with leave one participant out validation (Tsipouras et al, 2012). One study found that the right wrist sensor alone was sufficient to train their algorithms, which produced dyskinesia severity sensitivity and specificity ranging from 91.9 to 98.6% for various severities (Cole et al, 2014).

Out of the 20 features we generated per sensor, we found that the feature

Energy_Low_High_Mean that compared acceleration energy in the low (1-3.5 Hz) and high

(3.5-8 Hz) frequency bands was the most important feature in distinguishing dyskinesias. This result parallels previous studies, which found that dyskinesias occur at lower frequencies than voluntary movements or tremor, and identified the most useful features in distinguishing dyskinesia from voluntary movements or tremor were ones that compared change in acceleration, quantity of acceleration, or energy of acceleration above or below a threshold frequency that varied by study from 3 to 5 Hz (Keijsers et al, 2003a; Keijsers et al, 2003b;

Keijsers et al, 2006; Cole et al, 2014; Giuffrida et al, 2009; Mera et al, 2012a; Mera et al, 2012b;

Tsipouras et al, 2012). The Giuffrida and Mera studies reported their most important feature was

RMS ratio between high and low frequency bands. This feature paralleled our feature, energy_high_low_mag, from which we derived our most important feature, energy_high_low_mean. Their study achieved a 10-fold cross validation sensitivity of 1.00 and specificity of 0.73 for detecting dyskinesia during periodic scripted evaluations, which participants accomplished at home while wearing a finger-mounted accelerometer.

While we found high classification accuracy (96%) for all participants with 10-fold cross validation, we found lower classification accuracy (86%) for all participants with leave one participant out validation. Our system does not generalize well for participants who were unseen in the training set, as shown by low classification accuracies between 22 and 80% for the

82 dyskinetic participant left out. By comparison, the Cole study found average sensitivity and specificity greater than 92% for each dyskinesia severity level, and average MSE of 8.8% (Cole et al, 2014), while the Keijsers study found average accuracies as low as 75% for certain specific

ADL and an average accuracy of 82% for leave one participant out validation with 1 minute intervals (Keijsers et al, 2003a; Keijsers et al, 2003b), and the Tsipouras study found individual participant accuracies varied between 63 and 98%, with best average classification between participants of 93% with leave one participant out validation (Tsipouras et al, 2012).

Overall, we had lower classification accuracies than other studies; however, our participants were less constrained in their environment and were allowed to go about their ADL in their daily environment without scripted activities. The Keijsers study only rated participants during scripted ADL, and artificially induced dyskinesia (Keijsers et al, 2003a; Keijsers et al,

2003b). The Tsipouras study likewise monitored participants while they performed a sequence of scripted activities (Tsipouras et al, 2012). The Giuffrida study required participants to perform clinical evaluation tasks in front of a camera several times per day, while wearing a finger mounted accelerometer tethered to a processor box attached to the arm (Giuffrida et al, 2009).

The Cole study acquired sensor data during unconstrained ADL in a home environment; however, they divided their study into sitting, standing, and walking mobility states without clearly describing how their participants were motivated to present ADL in each of these states

(Cole et al, 2014). The Cole study might have been as unconstrained as our study was, but their protocol description was not very specific. Because our study was less constrained than previous studies, it has implications for continuous in-home use which could be realized if we could improve the ability of our system to generalize for new participants. We investigate the causes of poor generalization within our system, in chapter 4.

83 Acknowledgements

Thanks to Jamie Mark, ARNP, Dr. Jonathan Carlson, MD, Ph.D., Dr. David Greeley,

MD, FAAN for guidance in clinical perspectives, and to Dr. Narayanan “CK” Chatapuram

Krishnan for insight into machine learning applications. This work was supported by NSF under

Grant No. DGE-0900781.

84 BIBLIOGRAPHY

Boser B., Guyon I., Vapnik V., A training algorithm for optimal margin classifiers. Proceedings of the 5th annual workshop on computational learning theory, 1992: p. 144-152.

Chen, K. Y. and D. R. Bassett, Jr., The technology of accelerometry based activity monitors: current and future. Med Sci Sports Exerc, 2005. 37(11 Suppl): S490-500.

Cole, B.T. et al., Dynamic neural network detection of tremor and dyskinesia from wearable sensor data. Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE, 2010: p. 6062-6065.

Cole, B.T. et al., Dynamical Learning and Tracking of Tremor and Dyskinesia from Wearable Sensors. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2014. 22(5): p. 982-91.

Corder, K. et al., Assessment of physical activity in youth. J Appl Physiol, 2008. 105(3): p. 977- 87.

Cunningham, L., et al., Computer-Based Assessment of Bradykinesia, Akinesia and Rigidity in Parkinson’s Disease Ambient Assistive Health and Wellness Management in the Heart of the City, M. Mokhtari, et al., Editors. 2009, Springer Berlin / Heidelberg. p. 1-8.

Cunningham, L., et al., Home-Based Monitoring and Assessment of Parkinson's Disease. Information Technology in Biomedicine, IEEE Transactions on, 2011. 15(1): p. 47-53.

Darnall, N.D., et al., Application of machine learning and numerical analysis to classify tremor in patients affected with essential tremor or Parkinson’s disease. 2012. 2012.

Deuschl G, V.J., Raethjen J. , Tremors: differential diagnosis, pathophysiology, and therapy, in Parkinson’s disease and movement disorders, T.E. Jankovic J, Editor. 2007, Williams and Wilkins: Philadelphia. p. 298-320.

Freedson, PS; Melanson, E; Sirard, J; Calibration of the Computer Science and Applications, Inc. Accelerometer. Medicine & Science in Sports Exercise, 1997.

Giuffrida, J.P., et al., Clinically deployable Kinesia™ technology for automated tremor assessment. Movement Disorders, 2009. 24(5): p. 723-730.

Goetz, C.G., et al., Movement Disorder Society-sponsored revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS): Scale presentation and clinimetric testing results. Movement Disorders, 2008. 23(15): p. 2129-2170.

Hall, M., et al., The WEKA data mining software: an update. SIGKDD Explor. Newsl., 2009. 11(1): p. 10-18.

85 Hoff, J.I., V. van der Meer, and J.J. van Hilten, Accuracy of objective ambulatory accelerometry in detecting motor complications in patients with Parkinson disease. Clin Neuropharmacol, 2004. 27(2): p. 53-7.

Keijsers, N.L., M.W. Horstink, and S.C. Gielen, Ambulatory motor assessment in Parkinson's disease. Mov Disord, 2006. 21(1): p. 34-44.

Keijsers, N.L.W., M.W.I.M. Horstink, and S.C.A.M. Gielen, Automatic assessment of levodopa- induced dyskinesias in daily life by neural networks. Movement Disorders, 2003. 18(1): p. 70-80. [A]

Keijsers, N.L.W., M.W.I.M. Horstink, and S.C.A.M. Gielen, Movement parameters that distinguish between voluntary movements and levodopa-induced dyskinesia in Parkinson’s disease. Human Movement Science, 2003. 22(1): p. 67-89. [B]

Mera, T.O., et al., Feasibility of home-based automated Parkinson's disease motor assessment. J Neurosci Methods, 2012. 203(1): p. 152-6.

Mera, T.O., M.A. Burack, and J.P. Giuffrida. Quantitative assessment of levodopa-induced dyskinesia using automated motion sensing technology. in Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE. 2012.

Mitchell, T.M., Machine Learning. 1997, New York: McGraw-Hill.

Patel, S., et al. Home monitoring of patients with Parkinson's disease via wearable technology and a web-based application. in Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE. 2010.

Patel, S., et al., Longitudinal monitoring of patients with Parkinson's disease via wearable sensor technology in the home setting. Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. EMBS Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, 2011: p. 1552-1555.

Quinlan J., Induction of decision trees. Machine learning, 1986. 1(1): p.81-106.

Roy, S. H., et al. High-resolution tracking of motor disorders in Parkinson’s disease during unconstrained activity. Movement Disorders, 2013. 28(8), p. 1080-1087.

Roy, S. H., et al. Resolving signal complexities for ambulatory monitoring of motor function in Parkinson’s disease. 33rd Annual International Conference of the IEEE EMBS, 2011: p. 4836- 4839.

Rush, A. J. Handbook of Psychiatric Measures. Washington, DC: American Psychiatric Association, 2000.

Tsipouras, M.G., et al., An automated methodology for levodopa-induced dyskinesia: Assessment based on gyroscope and accelerometer signals. Artificial Intelligence in Medicine, 2012. 55(2): p. 127-135.

86 van Hees, V.T., et al., A method to compare new and traditional accelerometry data in physical activity monitoring. World of Wireless Mobile and Multimedia Networks (WoWMoM), 2010 IEEE International Symposium on a, 2010: p.1-6.

Wu P., Lin C., Wang C., Hwang I., Atypical task-invariant organization of multi-segment tremors in patients with Parkinson’s disease during manual tracking. Journal of Electromyography and Kinesiology, 2009. 19: p. 144–153.

Zhang, S., et al., Physical activity classification using the GENEA wrist-worn accelerometer. Med Sci Sports Exerc, 2012. 44(4): p. 742-8.

Zornetzer S., Davis J., Lau C., An introduction to neural and electronic networks. 1990, San Diego: Academic Press.

87 4. GENERALIZATION OF DYSKINESIA SYSTEM

4.1 Introduction

Our study produced high classification accuracies when we classified dyskinesia instances across individuals using 10-fold cross validation and the WEKA J48 decision tree classifier (see chapter 3) and the entire population of participants was included in the training and validation sets (Bouckaert et al, 2010). However, a dyskinesia classifying system is most clinically useful if it can learn classification rules on a population set, then apply those rules to new participants who were not included in the training data set. Because our machine learning algorithms did not apply learned classification models to unseen participants with high accuracy, the generalization was inadequate for clinical relevance. To generalize our system in the future, we proposed to identify the sources of algorithm misclassifications.

4.2 Hypotheses Proposed

During data collection with participants, we observed variations in the way each participant presented dyskinesia. These differences included: severity of dyskinesia, locations on the body which were affected by dyskinesia, fluctuation in dyskinesia severity throughout the session, and transitions into or out of a dyskinesia period. Dyskinesia can fluctuate in severity throughout a dyskinesia period (Keijsers et al, 2003a; Keijsers et al, 2003b). Fluctuations in dyskinesia often appear after long-term levodopa as either levodopa peak-dose or wearing-off motor complications (Patel et al, 2010; Bonato et al, 2004). In the section 4.3, we describe the location, severity, and fluctuation of dyskinesia we observed in our participants.

Dyskinesia may involve nearly any portion of the body, including head, neck, torso, limbs, and respiratory muscles (Olanow et al, 2001). Severity of dyskinesia was rated on the

88 five-point clinical Abnormal Involuntary Rating Scale (AIMS): none (0), minimal (1), mild (2), moderate (3), severe (4), and is rated at the maximum severity seen during an observation period

(Rush, 2000). In the AIMS section for extremity movements, choric movements in the arms, wrists, hands, and fingers are rated as well as lateral knee movement, foot tapping, heel dropping, foot squirming, and inversion and eversion of the foot. In the AIMS section for trunk movements, dyskinesia severity is rated on the rocking, twisting, squirming, and gyrations of the neck, shoulders, and hips. In the global judgment section of the AIMS, dyskinesia severity is rated on overall abnormal movements, incapacitation due to abnormal movements, and patient’s awareness of abnormal movements.

We proposed several hypotheses regarding how dyskinesia inconsistencies we observed across participants may have reduced machine learning algorithm (MLA) generalization across participants.

Hypothesis 1: Dyskinesia feature variations between participants

If individual participants had feature sets that were unique, principal component analysis would show distinguishable feature vector sets, and adding a few dyskinesia instances to the training set from the participant left out of the training set will increase the classification accuracy for that participant.

Hypothesis 2: Differences in body location and severity of dyskinesia

Participants affected by dyskinesia on different locations of their body will have different classification accuracies, while participants affected by dyskinesia on the same body location will have similar classification accuracies. If differences in overall dyskinesia severity affected classification accuracy, then a relationship exists between severity and accuracy.

Hypothesis 3: Fluctuations

89 If dyskinesia fluctuations in severity throughout a dyskinesia period affected classification accuracy, such fluctuations can be quantified from the feature set and correlated to the classification accuracy.

Hypothesis 4: Transitions into or out of dyskinesia

4.3 Hypotheses Testing

4.3.1 Hypothesis 1: Dyskinesia feature variations between participants

Hypothesis Statement

Methods

Principal component analysis (PCA) is a linear combination of vectors which explain variations in the data. It reduces a multiple dimension feature vector down to a few principal components which explain the most variation. We determined from this analysis how much variation could be accounted for by 1, 2, or 3 principal components. We examined plots of the first two principal components, to determine if dyskinesia instances were clustered together apart

90 from non-dyskinesia instances. We also examined the plots to determine if dyskinetic participants had an overlapping cluster of instances. PCA is a method to determine the utility of semi-supervised clustering analysis, in which clusters of class data are identified and labeled from sparse ground-truth labels. Semi-supervised approaches are useful when there is a lack of ground-truth class labels, as could be the case with observer-identified dyskinesia instances for new study participants (Witten et al, 2005).

We investigated whether including a few dyskinesia instances in the training set from the individual left out of the training set improved F-measure for that individual in leave-one-out validation. This is a way to test if the algorithm is too specific to the participants in the training set, and how many instances from a new participant are needed for the algorithm to generalize across all previous participants plus the new participant.

We used the multilayer perceptron (MLP) algorithm, which gave the highest F-measure of all the MLAs we tested (see chapter 3), with learning rate = 0.01, which gave the best results in chapter 3, to build a model on the training set. We validated the model on all the instances from the participant who was left out of the training set, from which we calculated F-measures.

F-measure is the harmonic mean of precision and recall. Because F-measure accounts for both positive and negative rates and is weighted by the proportion of instances in each class, it is a better representation of how well a classifier identifies true positives and true negatives than classification accuracy. F-measure was used in chapter 3 to determine the MLA with the best classification ability.

Next, the first 10 dyskinesia instances from the participant were added to the training set and left out of the test set. Finally, the first 20 dyskinesia instances from the participant were added to the training set and left out of the test set. This process was repeated for each participant

91 who had dyskinesia. F-measures for each participant were compared for the 0, 10, and 20 dyskinesia instances in the training set conditions.

Results

Principal component analysis showed that 55% of the variation was accounted for by the first three principal components (Figure 4.1). Each principal component after the third component accounted for less than 4% of the variability.

100 100%

90 90%

80 80%

70 70%

60 60%

50 50%

40 40%

30 30% Variance Explained (%)

20 20%

10 10%

0 0% 1 2 3 4 5 6 7 8 9 10 Principal Component

Figure 4.1: Principal Component Analysis. The bars show the variations in dyskinesia feature vectors accounted for by each additional principal component. The line is the sum of cumulative variations for increasing numbers of principal components.

When we plotted the first 2 principal components of all the participants (Figure 4.2), we observed a large spread of dyskinesia participant data points that were interspersed with non- dyskinesia points.

92 40

0 2nd 2nd Principal Component -10

-20

0 10 20 30 40 50 60 70 1st Principal Component

Figure 4.2: First Two Principal Components. This plot shows the first two principal components for all participant instances. Dyskinetic participants are marked “O”. Non-dyskinetic participants are marked “ ”.

When we plotted dyskinesia participants only, we observed that dyskinesia instances did not overlap well between participants (Figure 4.3).

93 2

-2

-4

-6

-8

-10 2nd 2nd Principal Component

-12

-14

-5 0 5 10 15 1st Principal Component

Figure 4.3: First Two Dyskinesia Principal Components. This is a plot of the first two principal components for dyskinetic participant instances. The instances from the five dyskinetic participants are labeled as, participant 4 “X”, participant 6 “ ”, participant 0 “O”, participant

13 “□”, participant 5 “”.

When we added dyskinesia instances from the participant left out of the training set to the training set, F-measure increased for each dyskinesia participant. Considering all participants but number 13, when 10 instances were added to the training set, F-measure increased from a low of

0.45 for no instances added to a low of 0.87 for 10 instances added (Figure 4.4). Adding 20 instances to the training set increased the lowest F-measure to 0.98 for all participants but participant 13. For this participant, F-measure started at 0.11 for no instances added, increased to

0.23 for 10 instances added, then dropped to 0.19 for 20 instances added. Participant 13 was the

94 only participant that did not classify well for added instances, and the only one that had a lower

F-measure for 20 added instances than for 10 added instances.

0.75

Add 0 0.5

Add 10

Measure - F Add 20

0.25

0 P4 P6 P10 P13 P15 Participant

Figure 4.4: F-measure for Instances Added to Training Set. F-measure is shown for individual participants left out of the training set with 0, 10, or 20 of their instances added to training set and removed from the test set.

Adding 10 dyskinesia instances to the training set decreased the dyskinesia instances in the test set by 9.3, 24.4, 14.1, 14.1, and 58.8% for participants 4, 6, 10, 13, and 15, respectively.

Adding 20 dyskinesia instances to the training set decreased the dyskinesia instances in the test set by 18.5, 48.8, 28.2, and 28.2% for participants 4, 6, 10, and 13, respectively. Since participant

15 only had 17 instances of dyskinesia, we only added 17 dyskinesia instances to the training set instead of 20, in the add 20 test.

Discussion

PCA showed that only 55% of the variations in the feature vectors could be accounted for by 3 components (Figure 4.1). PCA uses an orthogonal transformation to convert a set of variables, or features, into a set of values of linearly uncorrelated variables, or principal components (Pearson, 1901). Reducing a feature set to a minimum number of principal components simplifies instance clustering, which is a separation of groups of instances by variations in the principal components (Witten et al, 2005). If a low percentage of variation is accounted for by principal components, clustering the instances will not likely define data classes within the clusters with a higher percentage of accuracy.

If common dyskinesia features occurred between participants, dyskinesia clusters would have been visualized in the PCA plot (Figure 4.3). Because dyskinesia instances did not form an observable cluster apart from non-dyskinesia instances, and dyskinetic instances between participants did not form a clear cluster, we concluded that there may have been other sources of variation in our feature set that we designed to distinguish dyskinesia from non-dyskinesia.

Adding limited instances from the left-out individual increased F-measure for each dyskinetic participant. We concluded from that individual participants’ presentation of dyskinesia resulted in feature vectors for that participant that differed from those of the sample population, and were therefore unique to that participant. Adding dyskinesia instances from that individual to the population expanded the training data to include dyskinesia features indicative of the individual left out, which resulted in a MLA model that better represented the characteristic features of the participant left out, and therefore better classified instances onto dyskinesia or non-dyskinesia with greater F-measure.

96 Because these results suggest that dyskinesia signs varied in some way between participants, we analyzed which variations in dyskinesia signs may have decreased classification accuracy, and tested our hypotheses.

4.3.2 Hypothesis 2: Differences in body locations and severity of dyskinesia

Hypothesis Statement

Participants affected by dyskinesia on different locations of their body will have different classifications accuracies, while participants affected by dyskinesia on the same body location will have similar classification accuracies. If differences in overall dyskinesia severity affected classification accuracy, then a relationship exists between severity and accuracy.

Methods

We hypothesized that leave one out validation had poor accuracy due to differences in the location of dyskinesia signs on the body. This case would have resulted in the generation of feature vectors containing features indicative of dyskinesia in different locations within the feature vector. Perhaps our MLAs were unable to distinguish which features were indicating dyskinesia, if the same features did not represent dyskinesia across all participants.

To test this hypothesis, we compiled a table ranking dyskinesia sign locations we observed in each participant with dyskinesia classification F-measure for each MLA with leave one participant out validation. We calculated F-measure for the 3 algorithms that yielded the best classification accuracy; WEKA’s Decision Tree (J48), Multilayer Perceptron (MLP), and

Support Vector Machine (SVM). WEKA implemented a sequential minimal optimization algorithm (SMO) for training a support vector classifier. To test if dyskinesia severity had an

97 effect on classification F-measure, we tabulated the AIMS dyskinesia score, which ranks the severity of the most severe dyskinesia observed, against F-measure.

Results

The dyskinesia signs we observed were: legs dancing, arms twisting, foot slowly lifting or curling, foot rolls, torso twisting, and head rolling (Table 4.1). To investigate if there was a relationship between classification ability and number of dyskinesia locations on the body, we plotted F-measure against number of dyskinesia body locations (Figure 4.5) and F-measure against AIMS rating (Figure 4.6).

Dyskinesia Sign Participant # 4 6 10 13 15 Legs dancing X X Arms twisting X X X X Foot slowly lifting/ curling X X X X X Foot rolling X X X X Torso twisting X X X Head rolling X Symptom Count 5 3 3 6 2 J48 F-measure 100% 23% 33% 15% 71% MLP F-measure 96% 45% 86% 11% 71% SVM F-measure 73% 19% 76% 7% 72%

AIMS score 14 9 10 17 5

Table 4.1: Dyskinesia Signs. Body location of dyskinesia sign is tabulated by participant number and AIMS score.

98 120%

100%

80%

60% J48 F-measure

measure -

F MLP F-measure 40% SVM F-measure

20%

0% 0 1 2 3 4 5 6 Number of Dyskinesia Body Locations

Figure 4.5: F-measure vs. Symptom Count. F-measure is shown for number of dyskinesia body locations, for the J48, MLP, and SVM algorithms.

120%

100%

80%

60% J48 F-measure

measure -

F MLP F-measure 40% SVM F-measure

20%

0% 0 5 10 15 20 AIMS Score

Figure 4.6: F-measure vs. AIMS. F-measure is shown for different AIMS dyskinesia severities, for the J48, MLP, and SVM algorithms.

99 Discussion

F-measure appeared to not have a relationship with the number of dyskinesia signs we visually observed, so we concluded that changes in the body location of dyskinesia signs did not have an effect on the MLA classification ability.

Figure 4.6 did not show a relationship between F-measure and AIMS score, for either J48 or MLP algorithms, so we concluded that the MLAs did not identify dyskinesia instances with less accuracy for participants with mild dyskinesia for those algorithms. The SVM algorithm did indicate a decreasing F-measure with increasing AIMS score, so that algorithm may not have classified as well for participants with more severe dyskinesia.

Other studies did not report the locations on the body in which their participants experienced dyskinesia. Some studies found that the classification ability of dyskinesia severity did not vary much for different levels of dyskinesia severity, while another study found classification ability changed with severity. One study reported sensitivities of 93.9, 91.9, and

95.0% and specificities of 95.5, 94.6, and 98.6% for mild, moderate, and severe dyskinesia detection, respectively, as defined by UPDRS severity levels one through three (Cole et al,

2014). Another study reported global error rates of 5.3, 6.8, and 3.2% for mild, moderate, and severe dyskinesia detection, respectively, as defined by UPDRS severity levels one through three

(Roy et al, 2013). Neither study reported which validation method they used. Another study found that sensitivity to dyskinesia severity classification initially decreased, then increased with increasing dyskinesia severity (93.37, 66.22, 73.37, and 88.95% sensitivity for UPDRS item 33 dyskinesia severity scores of 0, 1, 2, and 3, respectively) using a C4.5 MLA and leave one participant out stratified cross validation (Tsipouras et al, 2012). The Tsipouras study indicates that mild dyskinesias may be difficult for certain MLAs to detect.

100

4.3.3 Hypothesis 3: Fluctuations

Hypothesis Statement

Methods

While we only recorded the presence or absence of dyskinesia over several minutes as ground truth for our class labels, we actually observed seconds to minutes of time within periods we labeled as, “dyskinetic,” where dyskinesia was severe, mild, moderate, or non-observable.

We hypothesized that fluctuations in dyskinesia during an observed session caused a lack of

MLA classification accuracy. To test this hypothesis, we explored how MLA F-measure was related to variations within specific features. We calculated F-measure for the MLP algorithm, which yielded the best classification accuracy in chapter 3. To characterize how well MLP classified instances of dyskinesia for different amounts of fluctuation, we plotted F-measure against variations within two features from our feature set. Results from the J48 and SVM classifiers were also calculated (Appendix B).

The two features used to measure fluctuations were Energy_Low_High_Mean and

Low_Frequency_Energy. The feature of Energy_Low_High_Mean was identified in chapter 3 as the most important feature used in the detection of dyskinesia. The feature of

Energy_Low_High_Mean is the mean of the square root of the sum of squared acceleration energy in the 1-3.5 and 3.5-8 Hz bands, over a 15-minute moving time window that moves at 1

101 minute increments (Appendix A). Previous studies identified acceleration energies in the low frequency levels as useful in detecting dyskinesia, and acceleration energy in the high frequency levels as useful in detecting non-dyskinetic movements (Keijsers, 2006; Keijsers et al, 2003a;

Keijsers et al, 2003b; Giuffrida, 2009). One study found the most important feature in detecting dyskinesia was the ratio of acceleration RMS energy above and below 3 Hz (Mera et al, 2011;

Mera et al, 2012a; Mera et al, 2012b), while another study used 3.5 Hz (Keijsers, 2006).

The feature of Low_Frequency_Energy is the amount of accelerometer signal energy in the frequency spectrum associated with dyskinesia. Previous studies identified the magnitude of acceleration energy in the low frequency range as useful in classifying instances of dyskinesia

(Patel, 2010; Patel, 2011; Bor-Rong, 2011; Roy, 2011; Hoff, 2004). The feature

Low_Frequency_Energy was the sum of squared scalar accelerations in the 1-3.5 Hz frequency band, over a 60 s time window (Appendix A).

Variations within features for the entire segment when dyskinesia occurred were characterized by the histograms for the two features over the entire dyskinesia period. The feature histogram plots how many times feature instances occur within a range of feature values, called a bin. We expected participants with more dyskinesia fluctuations would have variable peaks and valleys in their feature time-series data, which would be characterized by a non- normal distribution as seen in a histogram plot. Participants with mild or little dyskinesia interspersed with periods of moderate or severe dyskinesia would have predominately low energy levels with occasional spikes in energy during the more severe dyskinesia instances. We expected a distribution plot of energy instances would show a sharp peak that was off-center from a normal distribution, a trend which would increase for participants who fluctuated more in their dyskinesia.

102 Kurtosis is a measure of how peaked is the feature’s histogram from a normal distribution. Positive kurtosis indicates a distribution that is more peaked than a normal distribution, while negative kurtosis indicates a distribution that is flatter than a normal distribution. We anticipated participants with more dyskinesia fluctuations would have a greater quantity of low-energy instances, which would generate a high peak in the distribution graph at the predominant low-energy instances. Participants who fluctuated more in dyskinesia would therefore have a more positive kurtosis.

Skew, or skewness, is a measure of asymmetry in the feature’s histogram from that of a normal distribution. Negative skew indicates the left tail of the distribution is longer, while positive skew indicates the right tail of the distribution is longer. We anticipated participants with more dyskinesia fluctuations would have a greater quantity of low-energy instances distributed with much higher energy instances during more severely dyskinetic instances. This would generate a distribution graph that peaked off-center around the predominant low-energy instances. Participants who fluctuated more in dyskinesia would therefore have a more positive skew.

Results

When the feature Low_Frequency_Energy was plotted over time for participants with constant dyskinesia, no dyskinesia, and fluctuating dyskinesia as defined by the observer, we saw differences in magnitude and quantity of peaks. A consistently large quantity of energy was found along with multiple high peaks and valleys in the Low_Frequency_Energy graph associated with constant dyskinesia (Figure 4.7).

103

Figure 4.7: Constant Dyskinesia Feature Time-Series. This is a time-series plot of the right wrist feature, Low_Frequency_Energy, for participant 4, who had constant dyskinesia.

A lower quantity of energy graph without large peaks was found for a participant who did not have dyskinesia (Figure 4.8).

104

Figure 4.8: No Dyskinesia Feature Time-Series. This is a time-series plot of the right wrist feature, Low_Frequency_Energy, for participant 1, who did not have dyskinesia.

The signal changed from large peaks to small infrequent peaks for a participant who transitioned from intense dyskinesia to a period of low-intensity dyskinesia, to non-dyskinesia

(Figure 4.9).

105

Figure 4.9: Fluctuating Dyskinesia Feature Time-Series. This is a time-series plot of the right wrist feature, Low_Frequency_Energy, for participant 13, who had fluctuating dyskinesia in the first 71 minutes of data collection.

We plotted histograms for a participant with high classification F-measure, with consistent dyskinesia severity (Figures 4.10-4.11). We plotted histograms for a participant with low classification F-measure, with fluctuating dyskinesia severity (Figures 4.12-4.13). Kurtosis and skew are calculated from histograms.

106

40 30

A B 35 25 30 25 20 20 15 15 10 10

5 Quantity ofInstances Quantity Quantity ofInstances Quantity 5

0 0

97 34 74

145 113 129 161 177 193 209 225 241 115 156 196 237 278 318 359 400

More More Energy_Low_High_Mean Low_Frequency_Energy

Figure 4.10: Histograms, Constant Dyskinesia. These are participant 4 histograms for the right wrist features (A) Energy_Low_High_Mean and (B) Low_Frequency_Energy. Participant 4 had constant dyskinesia.

70 70

A B

60 60 50 50 40 40 30 30

20 20 Quantity ofInstances Quantity 10 ofInstances Quantity 10 0 0

Energy_Low_High_Mean Low_Frequency_Energy

Figure 4.11: Histograms, Fluctuating Dyskinesia. These are participant 13 histograms for the right wrist features (A) Energy_Low_High_Mean and (B) Low_Frequency_Energy. Participant

13 had fluctuating dyskinesia.

107 We investigated the relationship between classification ability and skew or kurtosis by plotted F-measure against skew and kurtosis for each dyskinetic participant (Figures 4.12-4.13).

1.00 1.00 A B

0.75 0.75

0.50 0.50

measure

F F 0.25 0.25

0.00 0.00 0.00 0.50 1.00 1.50 2.00 2.50 0.00 2.00 4.00 6.00 8.00 Skew Skew

Figure 4.12: F-measure vs. Skew. These plots show five dyskinetic participants MLP F- measure vs. skew for the features (A) Energy_Low_High_Mean and (B)

Low_Frequency_Energy.

1.00 1.00 A B

0.75 0.75

0.50 0.50

measure measure

- -

F F 0.25 0.25

0.00 0.00 -2 1 4 -5 20 45 Kurtosis Kurtosis

Figure 4.13: F-measure vs. Kurtosis. These plots show five dyskinetic participants MLP F- measure vs. kurtosis for the features (A) Energy_Low_High_Mean and (B)

Low_Frequency_Energy.

108

Figures 4.12-4.13 show greater F-measures for lower values of skew and kurtosis for both features. For the feature Energy_Low_High_Mean (Figure 4.13A), F-measure did not show as clear of a relationship to kurtosis. In this graph, kurtosis was close to -1.0 for all participants, with the exception of participant 13, who was about +4.0. The graph of F-measure vs. kurtosis for the feature of Low_Frequency_Energy (Figure 4.13B) shows the same difference for participant 13 from the other participants, but there is greater variation among the other participants characterizing a decreasing trend of F-measure with increasing kurtosis.

Discussion

We quantified dyskinesia fluctuations as kurtosis and skew within the features,

Energy_Low_High_Mean and Low_Frequency_Energy because we found that fluctuating dyskinetic participants had higher skew and kurtosis than non-fluctuating dyskinetic participants.

We found that classification ability, quantified as F-measure, decreased with increasing kurtosis and increasing skew. From visual inspection of the plots, F-measure presented a more defined relationship to the feature Low_Frequency_Energy than to the feature Energy_Low_High_Mean.

From these results, we concluded that a more fluctuating feature set characterized by periods of low dyskinesia level interspersed with periods of higher dyskinesia level, and therefore higher skew and higher kurtosis, resulted in a lower MLP classification ability. Conversely, more consistent feature sets containing dyskinesia that did not fluctuate, and therefore lower skew and lower kurtosis, resulted in higher dyskinesia classification ability.

Fluctuations within dyskinesia are a common occurrence (Keijsers et al, 2003a; Keijsers et al, 2003b). To account for dyskinesia fluctuations, recent studies have incorporated dynamic

109 classifiers which consider time-weighted values of acceleration energy over a moving time window of instances (Cole, 2014). We used static classifiers in our study, which did not account for incremental dyskinesia fluctuations. However, we considered if dyskinesia fluctuations could be identified within the feature set. We could add features that quantify dyskinesia fluctuations to the feature set to improve classification accuracy during dyskinesia fluctuations.

Our observation method identified time periods in which dyskinesias were present.

During our observations, we considered if dyskinesia occurred at all within a 15 minute set of one-minute instances and labeled a time period as dyskinetic regardless of whether we observed constant dyskinesia or some dyskinesia interspersed with seconds to minutes of little or no visible dyskinesia. If we were not able to observe dyskinesia for small time periods within a dyskinesia period, it is reasonable that MLAs would not classify instances within those small time windows as dyskinesia. If we could replicate that decision by generating a feature that detects dyskinesia fluctuation over a set of instances, then the MLA would likely classify instances onto dyskinesia with greater accuracy. A recent study in the literature reports using dynamic neural networks, which include time-weighted values of instances before and after the instance being classified, to exploit the time-varying nature of dyskinesia in order to identify dyskinesia presence and severity with an error rate less than 10% when compared to clinician- annotated video of sensor recordings (Cole et al, 2010; Roy et al, 2011; Roy et al, 2013; Cole et al, 2014). They reported that surface electromyographic (EMG) signals caused by dyskinetic movements varied in amplitude, duration and time between bursts, which suggests that dyskinesia was fluctuating. However, they only used accelerometer-based features as neural network inputs because EMG signals corresponding to dyskinesia were difficult to discern from

110 EMG signals corresponding to voluntary movements. Instead of using 1 minute instance windows, they had independent clinicians rate each 1 second instance for dyskinesia severity.

It is important to note that the Cole et al study discarded instances in which clinicians did not agree unanimously on the severity rating (Cole et al, 2014). While removing an unspecified number of instances where dyskinesia severity was unclear might lead to high classification accuracies they report (error rate less than 10%), it may also have removed instances in which dyskinesia was fluctuating. Therefore, it is unknown how well their algorithm classifies instances onto dyskinesia when dyskinesia is fluctuating.

4.3.4 Hypothesis 4: Transitions into or out of dyskinesia

Hypothesis Statement

Methods

We considered if dyskinesia fluctuations during a transition period caused the majority of dyskinesia misclassifications. To test this hypothesis, we plotted a time-series of correct classifications and incorrect classifications for the dyskinetic participant with the lowest J48 10- fold cross validation classification accuracy, and calculated the percentage of misclassifications that occurred near the transition point. We used J48 because it classified with similar accuracy

(96%) to MLP (91%) with 10-fold cross validation, and provided a basis for comparing results

111 from two different classifiers. If the majority of misclassifications occurred near the transition point, then the classifier may not be able to distinguish when a transition occurs and the transition may have affected classification accuracy. However, if the majority of misclassifications did not occur around the transition point, then the algorithm shows the ability to distinguish when a transition occurs, and the transition may not have affected classification accuracy. Recent dyskinesia studies used 15 minute windows when classifying dyskinesia occurrence (Keijsers et al, 2003a; Keijsers et al, 2003b; Keijsers et al, 2006). To determine if the majority of misclassification occurred near a transition point, we calculated the percentage of misclassified instances that occurred 15 minutes before or after a transition.

Second, we compared the sensitivity and specificity of all dyskinetic participants for

MLP, the algorithm that had the highest F-measure in chapter 3, for leave-one-out validation. If the classifier shows low sensitivity and high specificity or high sensitivity and low specificity for transitioning participants, then the algorithm is not distinguishing between dyskinesia and non- dyskinesia instances for that participant, and the transition may not be affecting classification accuracy.

Results

We plotted correct classifications and incorrect classifications for the participant with the lowest dyskinesia 10-fold cross validation classification accuracy (Figure 4.14). A number of misclassifications occurred near the transition point for participant 13 (Figure 4.14). We calculated that 39, 57, and 56% misclassifications occurred within 15 minutes of the transition point (the “transition window”), and 6 , 43, and 44% of misclassifications occurred outside the transition window for transitioning participants 6, 13 and 15 respectively. On average, 51% of

112 the misclassifications occurred within 15 minutes of the transition and 49% of misclassifications occurred outside the transition window for the 3 participants who transitioned into or out of dyskinesia.

Incorrect

Correct

1=Dyskinesia

0=No Dyskinesia 0=No Dyskinesia Dyskinesia State:

0 1046 1066 1086 1106 1126 Transition Instance Index

Figure 4.14: Worst Case Misclassifications. This is a plot of the J48 10-fold cross validation instance misclassifications for participant 13, who had the greatest number of misclassifications.

A misclassification is an instance the MLA did not classify as the same class that the observer recorded. Each instance is a 1 minute window. The vertical line indicates transition time as defined by the observer.

We saw that participants had much higher accuracies for 10-fold cross validation than for leave one participant out validation. Both the specificity and sensitivity are different for each validation type. We plotted sensitivity and specificity for MLP leave one participant out validation (Figure 4.15).

113 100% 100% A B

75%

50%

50% Correct RateCorrect 25% Dyskinesia % Time Dyskinesia 25%

0% 4 6 10 13 15 0% Participant Number 4 6 10 13 15 specificity sensitivity Participant Number

Figure 4.15: Dyskinetic MLP Sensitvity and Specficity, and Percent Dyskinesia. These are plots of the (A) MLP leave one participant out sensitivity and specificity for dyskinetic participants and (B) percent of recording time during which dyskinetic participants were observed to have dyskinesia.

We found high sensitivity for participants (4 and 10) who did not transition into or out of dyskinesia. These participants resulted in no specificity because there were no non-dyskinesia instances to identify. All transitioning participants (6, 13, and 15) had high specificity but low or no sensitivity. The algorithm was able to identify non-dyskinesia for transitioning participants, but had problems identifying dyskinesia.

Discussion

For leave one participant out validation, participants 13 and 15 both displayed a good detection of non-dyskinesia, but a poor detection of dyskinesia as measured by the sensitivity metric. Participant 13 had a leave one participant out MLP accuracy of 22%, although dyskinesia

114 was only present in 80% of the data set for that participant. Likewise, participant 15 had an overall MLP accuracy of 79%, although dyskinesia was only present in 17% of the data set, and participant 6 had a MLP accuracy of 45% even though dyskinesia was only present in 39% of the data set. The statistics for participants 13 and 15 agree with the sensitivity and specificity rates for these participants; in both cases, the MLP algorithm classified most of the instances as non- dyskinesia even when dyskinesia was present. These participants had a high specificity

(detecting non-dyskinesia correctly) but almost no sensitivity (correctly identifying dyskinesia).

Therefore, we determined that the algorithm was not detecting dyskinesia for transitioning participants. We concluded that a transition into or out of dyskinesia was likely a significant contributor to leave one participant out misclassifications.

4.4 Conclusions

We proposed and investigated several hypotheses why our classification models did not generalize with high accuracy on participants whose instances were not included in the training set. Because we only had 5 participants in our study who presented with dyskinesia, we did not have enough data to reach significance through statistical tests. Instead, we examined plots to determine trends in dyskinesia classification ability from variations in our data. From our hypotheses investigations, we saw that low MLP classification F-measure seemed to be affected by fluctuations in dyskinesia severity within a dyskinesia period and by transitions into or out of a dyskinesia period, but not by dyskinesia location on the body or by maximum dyskinesia severity within the dyskinesia period.

One study (Keijsers et al, 2003a; Keijsers et al, 2003b) reported that they observed dyskinesia fluctuations within their participant population, which varied in AIMS severity from

115 0-3 in some participants. In order to induce dyskinesia, participants who did not display dyskinesia halfway into the 2.5 hour monitoring session were purposefully given extra levodopa until dyskinesia occurred. One of the features they used as an input to their classifier cross- correlated acceleration signals between accelerometers placed at different body locations.

Signals between limbs that did not co-vary were indicative of dyskinetic movements. Instances were classified by dyskinesia severity corresponding to a four-point AIMS scale of 0-3 as well as by dyskinesia presence. Two experienced physicians rated video recordings of participants for

AIMS severity at 1 minute increments. They trained a different neural network for each body location. Neural network performance for classification of dyskinesia presence was calculated as a set of 80/20 split tests, in which 80% of the data was randomly selected and used as a training set and the remaining 20% was used as a test set. They iterated the 80/20 split test 50 times to obtain classification accuracy. This validation method is similar in nature to the 10-fold cross validation we performed on the J48 classifier, in which we also obtained high classification accuracies. They tested how well their classifier generalized dyskinesia severity across participants with leave one participant out validation, which we also performed for detection of dyskinesia presence with lower classification accuracies.

The Keijsers studies reported high classification accuracy (79.4, 90.4, and 82.6% for arm, trunk, and leg, respectively) for detection of dyskinesia presence for participants who transitioned into dyskinesia, with 80/20 data split and 50-fold cross validation (Keijsers et al,

2003a). Furthermore, they reported a mean square error of 0.14 for dyskinesia severity classification using 1 minute windows and 97.5% dyskinesia severity classification using 15 minute windows and leave one participant out validation (Keijsers et al, 2003a; Keijsers et al,

2003b).

116 While participants did transition into dyskinesia during these studies, the recording sessions only occurred while the participants performed scripted ADL, and were therefore not continuous. Although participants were “continuously monitored,” video for severity rating by clinicians was only recorded while participants performed scripted ADL during the sessions.

Accelerometer data recorded during times that were not recorded by video did not have clinician- observed ground truth dyskinesia severity or presence ratings, and could therefore not be used in algorithm training or validation. Because dyskinesia was purposefully induced and results were only reported for dyskinetic participants, it is likely that participants did not transition into dyskinesia during scripted ADL activities, and dyskinesia monitoring was likely performed only after the transition into dyskinesia was complete. Because participants performed scripted activities, they may not have presented dyskinesia as naturally as our participants, who could obscure their dyskinesia by bracing their arms or legs against a chair of the floor if they became aware of their symptoms. Had the participants been continuously monitored in the Keijsers study as we did in our study, including times they transitioned into dyskinesia and times they were not performing scripted activities, the classification accuracy (99.5% for neural network dyskinesia presence detection and 15 minute windows) in this study may have been closer to the lower accuracy (86% accuracy with MLP and leave one participant out validation) we found in our study.

The Keijsers study also found that they had lower classification accuracy (82.4% accuracy) for 1 minute windows than for 15 minute windows (99.5% accuracy). Our 1 minute window accuracy for all participants (86%) exceeded their accuracy; however, our 1 minute window accuracy for dyskinesia participants only (62%) was less than theirs. Our dataset was much smaller than theirs; we only had 308 minutes of dyskinesia monitoring. Had we divided

117 our dyskinesia data into 15 minute instance windows, we would have only had 20 instances of dyskinesia for all participants, which was not enough data to train and test a statistically significant generalization of a MLA. The Keijsers study calculated to 1950 minutes of monitoring data for 13 participants at 2.5 hours each, which would provide 130 instances of 15 minute windows if all monitoring data contained dyskinesia. This would be enough data to calculate a statistically significant generalization ability of an MLA, although they did not disclose how much of the monitoring session time they actually used for dyskinesia identification.

A recent study reports using dynamic neural networks, which include time-weighted values of instances before and after the instance being classified, to exploit the time-varying nature of dyskinesia in order to identify dyskinesia presence and severity with an error rate less than 10%, using an unspecified validation method, when compared to clinician-annotated video of sensor recordings (Cole et al, 2010; Roy et al, 2011; Roy et al, 2013; Cole et al, 2014). Like our study, this study was conducted with participants conducting unscripted ADL in a home environment. Algorithm validation was performed on individual participants who were not included in the algorithm training set, which we also performed in our leave one participant out validation. This study differed from our study in that they used 1 second instance windows rather than 1 minute instance windows, they identified dyskinesia severity as well as occurrence, they used dynamic neural networks and hidden Markov models instead of a static MLP, they trained their algorithms on one sensor’s features, they only used calculated 5 features, and they removed any instances from the training or validation sets in which clinicians could not agree on dyskinesia severity. They also input the feature, height of first peak in autocorrelation of an accelerometer epoch, into a Bayesian maximum likelihood classifier. They reported sensitivity

118 and specificity that varied from 91.9% for moderate dyskinesia to 98.6% for severe dyskinesia.

They did not specify if they used leave one participant our validation or another validation method, so their results may not be an equivalent comparison to ours.

Instead of using 1 minute instance windows, the Cole study had independent clinicians rate each 1 second instance for dyskinesia severity. They then discarded instances in which clinicians did not agree unanimously on the severity rating. While removing an unspecified number of instances where dyskinesia severity was unclear might lead to high classification accuracies they report (error rate less than 10%), it is unknown how well their algorithm classifies instances onto dyskinesia during continuous dyskinesia monitoring, in which instances that might be problematic to a MLA are removed from the data.

Because neither the Cole nor Keijsers studies included all the instances in their data sets not did Keijsers use unscripted ADL, both studies may have obtained accuracies that would be higher than would be obtained in a system that was continuous. In contrast, our study trained and tested on continuous data while our participants performed voluntary and elective ADL. While our generalization ability was lower than the Cole or Keijsers studies, our system may represent more realistic classification ability for a fully continuous, clinically relevant, in-home system.

119 Acknowledgements

Thanks to Jamie Mark, ARNP, Dr. Jonathan Carlson, MD, Ph.D., Dr. David Greeley,

MD, FAAN for guidance in clinical perspectives, and to Dr. Narayanan “CK” Chatapuram

Krishnan for insight into machine learning applications. This work was supported by NSF under

Grant No. DGE-0900781.

120 BIBLIOGRAPHY

Bonato P, Sherrill D, Standaert D, Salles S, Akay M. Data mining techniques to detect motor fluctuations in Parkinson's disease. Proceedings of the 26th Annual International Conference of the IEEE EMBS, 2004: pp 4766-4769.

Bor-Rong, C., et al., A Web-Based System for Home Monitoring of Patients With Parkinson’s Disease Using Wearable Sensors. Biomedical Engineering, IEEE Transactions on, 2011. 58(3): p. 831-836.

Bouckaert R, Frank E, Hall M, Kirkby R, Reutemann P, Seewald A, Scusc D. WEKA Manual for Version 3-6-3; University of Waikato; Hamilton, New Zealand; July 27, 2010.

Burkhard, P.R., et al., Quantification of dyskinesia in Parkinson's disease: Validation of a novel instrumental method. Movement Disorders, 1999. 14(5): p. 754-763.

Cole, B.T. et al., Dynamical Learning and Tracking of Tremor and Dyskinesia from Wearable Sensors. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2014. 22(5): p.982-91.

Giuffrida, J.P., et al., Clinically deployable Kinesia technology for automated tremor assessment. Movement Disorders, 2009. 24(5): p. 723-730.

Hoff, J.I., V. van der Meer, and J.J. van Hilten, Accuracy of objective ambulatory accelerometry in detecting motor complications in patients with Parkinson disease. Clin Neuropharmacol, 2004. 27(2): p. 53-7.

Keijsers, N.L., M.W. Horstink, and S.C. Gielen, Ambulatory motor assessment in Parkinson's disease. Movement Disorders, 2006. 21(1): p. 34-44.

Keijsers, N.L.W., M.W.I.M. Horstink, and S.C.A.M. Gielen, Automatic assessment of levodopa- induced dyskinesias in daily life by neural networks. Movement Disorders, 2003. 18(1): p. 70-80.

Keijsers, N.L.W., et. al., Movement parameters that distinguish between voluntary movements and levodopa-induced dyskinesia in Parkinson’s disease. Human Movement Science, 2003. 22(1): p. 67-89.

Mera, T.O., et al., Kinematic optimization of deep brain stimulation across multiple motor symptoms in Parkinson's disease. J Neurosci Methods, 2011. 198(2): p. 280-6.

Mera, T.O., et al., Feasibility of home-based automated Parkinson's disease motor assessment. J Neurosci Methods, 2012. 203(1): p. 152-6.

121 Mera, T.O., M.A. Burack, and J.P. Giuffrida. Quantitative assessment of levodopa-induced dyskinesia using automated motion sensing technology. in Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE. 2012.

Olanow, et al., An algorithm (decision tree) for the management of Parkinson's disease (2001):: Treatment Guidelines. Neurology, 2001. 56(11) Supplement(5): p. S1-S88.

Patel, S., et al. Home monitoring of patients with Parkinson's disease via wearable technology and a web-based application. Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE. 2010.

Pearson, K. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 1901. 2: p. 559-572

Roy, S. H., et al. High-resolution tracking of motor disorders in Parkinson’s disease during unconstrained activity. Movement Disorders, 2013. 28(8), p. 1080-1087.

Roy, S. H., et al. Resolving signal complexities for ambulatory monitoring of motor function in Parkinson’s disease. 33rd Annual International Conference of the IEEE EMBS, 2011: p. 4836- 4839.

Rush, A. J. Handbook of Psychiatric Measures. Washington, DC: American Psychiatric Association, 2000.

Witten, Ian H., and Frank, Eibe. Data Mining : Practical Machine Learning Tools and Techniques. Burlington, MA, USA: Morgan Kaufmann, 2005.

122 5. CONCLUSIONS

5.1 Study Limitations

Our study was limited by small quantity of dyskinesia data, few numbers of participants who had dyskinesia during the recording session, lack of constant and severe dyskinesia, lack of resolution in dyskinesia observations, and many participants who had deep brain stimulation

(DBS) devices.

Of our 19 study participants, only 5 participants had some dyskinesia during the recording sessions. Other participants reported that they experienced dyskinesia on a daily basis, but they did not show signs of dyskinesia during the recording and observation period. Of the participants who displayed signs of dyskinesia, only 2 participants showed signs of dyskinesia throughout the recording session. The other 3 participants transitioned once into or out of dyskinesia during the entire recording session. Participant 15 in particular only showed dyskinesia for 17 minutes, or 19% of the observation time. Participants 13 and 6 displayed signs of dyskinesia for 71 (80% of observation time) and 41 (39% of observation time) minutes, respectively. Both participants 4 and 10 showed signs of dyskinesia 100% of the observation time, or 108 and 71 minutes, respectively. Across all 5 participants, we had 308 minutes of dyskinesia data, which was a small percentage of training and test data compared to approximately 1950 minutes of total data from the Keijsers study and 4830 minutes of total data from the Cole study (Keijsers et al, 2003a; Keijsers et al, 2003b; Cole et al, 2014; Roy et al,

2013).

Because we only had 5 participants with dyskinesia, and because they each displayed dyskinesia in a unique way (Chapter 4), our machine learning algorithms (MLA) may not have had enough training data to generalize across new participants. Training the algorithms on more

123 data from more participants would both capture more of the dyskinesia variations across participants, and the variations of dyskinesia fluctuations for individual participants. A broader training set may allow the algorithm to classify instances of dyskinesia with higher accuracy across individuals, as both Keijsers and Cole reported.

We observed many fluctuations in dyskinesia severity in our participants, which both

Keijsers and Cole also observed in their participants (Keijsers et al, 2003a; Keijsers et al, 2003b;

Cole et al, 2014; Roy et al, 2013). Cole reported that dyskinesia only occurred in their participants for a continuous duration of 62.6 seconds on average. However, both the Keijsers and Cole studies included participants with moderate to severe dyskinesias. Cole reported 5 mild,

5 moderate, and 4 severe participants, as defined by the Modified Abnormal Involuntary

Movement Scale (m_AIMS).While we had 2 participants who displayed severe dyskinesia on the

Abnormal Involuntary Movement Scale (AIMS), one of those participants displayed severe dyskinesia for only a few minutes and mild dyskinesia for most of the time, and eventually transitioned out of the dyskinetic state. The other 3 participants were mostly in the mild dyskinetic state, and occasionally showed a few seconds or minutes of moderate dyskinesia. Cole reported they recorded between 69 and 92 hours, or between 248400 and 331200 one second instances in their study with 23 subjects and 3-4 hours monitoring per subject (Cole et al, 2014).

We only had 5 hours of dyskinesia monitoring, or 308 one minute instances, plus non-dyskinesia monitoring. Our data set likely did not contain enough instances of different dyskinesia severity levels to properly train a MLA to classify all instances of dyskinesia.

We used live observations of participants to establish dyskinesia class labels for our training and validation sets. Because we only recorded dyskinesia presence, and used the criteria that fluctuations of little or no dyskinesia for periods up to 10 minutes did not indicate a

124 transition out of the dyskinetic state, our observations contained a lower resolution than we may have needed to correctly train an algorithm for 1-minute instances. If the observer was not able to detect consistent dyskinesia for a few seconds up to a few minutes, instances within that time frame are not likely to be classified as dyskinesia with high accuracy if they are included in the test set, or they may confuse the algorithm if they are used in the training set. Cole avoided this problem by using an observation resolution of 1-second, any by discarding instances in which dyskinesia severity was unclear.

Many of the participants in our study had a DBS device. While the DBS reduces tremor, it can induce dyskinesia if the voltage is too high (Fahn, 2008; Brown et al, 1999; Pollak et al,

2007). Participant 13, who displayed the most severe dyskinesia fluctuations, adjusted the DBS device on 2 occasions during the recording session in order to reduce dyskinesia. At the time of each adjustment, dyskinesia severity changed from severe to mild within a few seconds. Such a rapid change in dyskinesia is an extreme example, containing both high severity dyskinesia and mild dyskinesia within the same instance; therefore, instances during or near the time a DBS setting was changed may not have provided a good basis for training the algorithms.

5.2 Future Directions

Future direction for this study would begin with obtaining a dataset in which dyskinesia observations were made for both higher dyskinesia severity and time resolution. Because dyskinesia fluctuations within a dyskinesia period reduced classification F-measure, incorporating more ground truth labels of dyskinesia per minute would yield a clearer training set on which to train algorithms for the classification of dyskinesia. Incorporating longer recording sessions per participant would provide more instances containing different levels of dyskinesia severity, and would also provide more instances of both dyskinesia and non-dyskinesia.

125 Likewise, including more dyskinetic participants within the study would allow a more thorough statistical testing of our hypotheses proposing which factors contributed to low classification F- measures. Adding more participants to the study would provide a better representation of the dyskinetic Parkinson’s disease population. This would provide more data we could use to further quantify how dyskinesia feature vectors vary between participants. If the population could be better characterized in the training set data, then MLA models could be trained that would have higher F-measure for new participants during a leave one participant out validation.

126 BIBLIOGRAPHY

Brown, R.G., et al., Impact of deep brain stimulation on upper limb akinesia in Parkinson's disease. Annals of Neurology, 1999. 45(4): p. 473-488.

Cole, B.T. et al., Dynamical Learning and Tracking of Tremor and Dyskinesia from Wearable Sensors. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2014. 22(5): p.982-91.

Fahn, S., How do you treat motor complications in Parkinson's disease: Medicine, surgery, or both? Annals of Neurology, 2008. 64(S2): p. S56-S64.

Keijsers, N.L.W., et. al., Movement parameters that distinguish between voluntary movements and levodopa-induced dyskinesia in Parkinson’s disease. Human Movement Science, 2003. 22(1): p. 67-89.

Keijsers, N.L.W., M.W.I.M. Horstink, and S.C.A.M. Gielen, Automatic assessment of levodopa- induced dyskinesias in daily life by neural networks. Movement Disorders, 2003. 18(1): p. 70-80.

Pollak, P., Krak, P., Deep-Brain Stimulation for Movement Disorders, in Parkinson's Disease and Movement Disorders, T.E. Jankovic J, Editor. 2007, Lippincott Williams and Wilkins: Philadelphia. p. 653-691.

Roy, S. H., et al. High-resolution tracking of motor disorders in Parkinson’s disease during unconstrained activity. Movement Disorders, 2013. 28(8), p. 1080-1087.

127 APPENDIX A

We present a table of feature vectors we generated for this study (Table A.1).

Feature Description Reference Root mean squared of rate of Keijers 2003 (2 papers) acceleration change between 50hz samples in the 1-13Hz frequency band over a 60s time RMS Jerk* window. Mean of the square root of the Keijers 2003 (2 papers) sum of squared tri-axial accelerometer readings in the 1- 13Hz frequency band, over a 60s Mean Scalar* time window. Sum of squared scalar Roy 2011, Hoff 2004 accelerations in the 1-3.5Hz frequency band, over a 60s time Energy Low* window. Sum of squared scalar Hoff 2004 accelerations in the 5-8Hz frequency band, over a 60s time Energy PD* window. Sum of squared scalar Hoff 2004 accelerations in the 3.5-8Hz frequency band, over a 60s time Energy High* window. The root mean square of scalar Roy 2011, Giuffrida 2009 accelerations in the 1-13Hz frequency band, over a 60s time RMS Scalar* window. Percentage of RMS jerk above a Keijers 2003 (2 papers) preset threshold of 0.05 m/s^3 in Jerk % Above the 1-13 Hz band over a 60s time Threshold* window. Maximum of the discrete Fourier Keijers 2003 (2 papers), transform power of triaxial Giuffrida 2009 accelerations in the 1-13 Hz Max Scalar Power* band over a 60s time window. Mean of the discrete Fourier Keijers 2003 (2 papers) transform power of triaxial accelerations in the 1-13 Hz Mean Scalar Power* band over a 60s time window. Frequency of the maximum Keijers 2003 (2 papers) , discrete Fourier transform power Giuffrida 2009 Dominant Frequency* of triaxial accelerations in the 1-

128 13 Hz band over a 60s time window. Frequency of the sum of max Keijers 2003 (2 papers) discrete Fourier transform power Dominant Frequency of triaxial jerks in the 1-13 Hz Jerk* band over a 60s time window. The difference between zero and Patel 2010, Patel 2011 the sum times the log of the probability that a value of scalar acceleration is in a 10-bin histogram category, in the 1-13 Entropy in Time Hz band, over a 60s time Domain* window. The difference between zero and Tsipouras 2012 the sum times the log of the probability that an acceleration power frequency value is in a 10-bin histogram category, in the Entropy in Frequency 1-13 Hz band, over a 60s time Domain* window. Maximum of the sum of discrete Keijers 2003 (2 papers) Fourier transform power of triaxial jerk in the 1-13 Hz band Max Jerk Power* over a 60s time window. Mean of the sum of discrete Keijers 2003 (2 papers) Fourier transform power of triaxial jerk in the 1-13 Hz band Mean Jerk Power* over a 60s time window. Sum of squared scalar Expanded from Keijers accelerations in the 5-8Hz 2006 frequency band times 4 less Energy Above those of the 1-3.5 and 3.5-8 Hz Threshold* bands, over a 60s time window. Ratio of acceleration energy in Expanded from Keijers the 1-3.5 Hz band over 2006 & 2003, Giuffrida acceleration energy in the 3.5-8 2009, Mera 2012 Energy Ratio Hz band, over a 60s time High/Low Freq* window. Square root of the sum of Expanded from Keijers squared acceleration energy in 2006 & 2003, Giuffrida Energy Mag the 1-3.5 and 3.5-8 Hz bands, 2009, Mera 2012 High/Low Freq* over a 60s time window. Mean of “Energy Mag Expanded from Keijers High/Low Freq” over a 15- 2006 & 2003, Giuffrida Energy Mean minute moving time window that 2009, Mera 2012 High/Low Freq* moves at 1 minute increments.

129 Standard deviation of “Energy Expanded from Keijers Mag High/Low Freq” over a 15- 2006 & 2003, Giuffrida Energy Std Dev minute moving time window that 2009, Mera 2012 High/Low Freq* moves at 1 minute increments. Cross-correlation The maximum correlation Keijers 2003 (2 papers), max** between two sensor’s Patel 2010 accelerations displaced by 100 samples, centered in each 60- second window, and normalized to the time duration of 200 samples. Cross-correlation The maximum correlation Keijers 2003 (2 papers), mean** between two sensor’s Patel 2010 accelerations displaced by 100 samples, centered in each 60- second window, and normalized to the time duration of 200 samples. Table A.1: Features. This is a table of features we calculated for each sensor location. *These features are calculated for each of 5 sensors. One sensor is located in each of the locations: left ankle, left wrist, right ankle, right hip, and right wrist. **Cross-correlations are between sensors; 5 sensors yield 20 cross-correlations.

Symbols: n = number of samples in a participant’s recording session m = number of samples in an instance

WindowSamples = number of accelerometer readings within an instance window

JerkThreshold = 0.05 m/s3

PS probability that a value of scalar acceleration is in an instance’s 10-bin histogram category

PJ = probability that a value of scalar jerk is in an instance’s 10-bin histogram category b = number of bins in the histogram

FFT: Fast Fourier Transform Function

N = number of window samples in 15 windows

Filtered x-axis acceleration, Full (1-15 Hz): xf

130 Filtered y-axis acceleration, Full (1-15 Hz): yf

Filtered z-axis acceleration, Full (1-15 Hz): zf

Filtered x-axis acceleration, Low (1-3.5 Hz): xflow

Filtered y-axis acceleration, Low (1-3.5 Hz): yflow

Filtered z-axis acceleration, Low (1-3.5 Hz): zflow

Filtered x-axis acceleration, PD (5-8 Hz): xfPD

Filtered y-axis acceleration, PD (5-8 Hz): yfPD

Filtered z-axis acceleration, PD (5-8 Hz): zfPD

Filtered x-axis acceleration, High (3.5-8 Hz): xfhigh

Filtered y-axis acceleration, High (3.5-8 Hz): yfhigh

Filtered z-axis acceleration, High (3.5-8 Hz): zfhigh

Acceleration Equations:

Filtered scalar (1-15 Hz):

(A1)

Filtered scalar, Low (1-3.5 Hz):

(A2)

Filtered scalar, PD (5-8 Hz):

(A3)

Filtered scalar, High (3.5-8 Hz):

131 (A4)

Jerk Equations:

Filtered x-axis jerk, Full (1-15 Hz):

(A5)

Filtered y-axis jerk, Full (1-15 Hz):

(A6)

Filtered z-axis jerk, Full (1-15 Hz):

(A7)

Jerk scalar, Full (1-15 Hz):

(A8)

Feature Equations:

(A9)

132 (A10)

(A11)

(A12)

(A13)

(A14)

(A15)

(A16)

(A17)

(A18)

(A19)

(A20)

133 (A21)

(A22)

(A23)

(A24)

(A25)

(A26)

(A27)

(A28)

134 APPENDIX B

F-measure is plotted against quantified skew and kurtosis for three MLAs (Figures B.1-B.4).

1.00

0.75

0.50 J48

measure -

F MLP 0.25 SVM

0.00 0.0 0.5 1.0 1.5 2.0 2.5 Skew

Figure B.1: F-measure vs. Skew for Energy_Low_High_Mean. This is a graph of F-measure vs. skew for the feature of Energy_Low_High_Mean for the algorithms J48, MLP, and SVM, and five dyskinetic participants.

1.00

0.75

0.50 J48

measure -

F MLP 0.25 SVM

0.00 0 1 2 3 4 5 6 7 Skew

Figure B.2: F-measure vs. Skew for Low_Frequency_Energy. This is a graph of F-measure vs. skew for the feature of Low_Frequency_Energy for the algorithms J48, MLP, and SVM, and five dyskinetic participants.

135

1.00

0.75

0.50 J48

measure -

F MLP 0.25 SVM

0.00 -2 -1 0 1 2 3 4 5 Kurtosis

Figure B.3: F-measure vs. Kurtosis for Energy_Low_High_Mean. This is a graph of F- measure vs. kurtosis of the feature of Energy_Low_High_Mean for the algorithms J48, MLP, and SVM, and five dyskinetic participants.

1.00

0.75

0.50 J48

measure -

F MLP 0.25 SVM

0.00 -10 0 10 20 30 40 50 60 Kurtosis

Figure B.4: F-measure vs. Kurtosis for Low_Frequency_Energy. This is a graph of F- measure vs. kurtosis of the feature of Low_Frequency_Energy for the algorithms J48, MLP, and

SVM, and five dyskinetic participants.

136 Standard error (SE) bars in charts were calculated using the formula for SE.

(B1)

i = point number in series s m = number of series for point y in chart n = number of points in each series yis = data value of series s and the ith point ny = total number of data values in all series

137