COVER FEATURE SMART HEALTH AND WELL-BEING

Machine Learning in Cardiac Health and Decision Support

Shurouq Hijazi and Alex Page, University of Rochester Burak Kantarci, University of Ottawa Tolga Soyata, SUNY Albany

Portable medical devices generate volumes of data that could be useful in identifying health risks. The proposed method filters patients’ electrocardiograms (ECGs) and applies machine-learning classifiers to identify cardiac health risks and estimate severity. The authors present the results of applying their method in a case study.

s personalized becomes increas- decision-support systems based on machine learning ingly more sophisticated and affordable, (ML) can ease the review burden by filtering noise, errors, portable medical devices have become ubiq- and irrelevant information so that the data reviewed uitous and monitoring applications have contains only relevant clinical markers. ML algorithms begunA to blend a range of functions. AliveCor, for exam- learn patterns within the data, which serve as the basis ple, offers an inexpensive smartphone electrocardio- for predictions about patient health. A machine can look gram (ECG) attachment that can sample an individual’s through millions of reports and medical records to iden- ECG, calculate real-time statistics, and share the record- tify previously unknown drug interactions.1 Such algo- ing with a . rithms can significantly improve diagnostic accuracy, In aggregate, portable medical devices generate data healthcare quality, and patients’ quality of life. at a much higher rate than conventional systems, which ML algorithms have many potential applications in can overwhelm medical personnel who must review smart health. To explore one of these, we developed a reports for many patients. However, the data also pres- method that filters data from long-term ECG recordings ents scientists and engineers with the opportunity to of patients with Long QT Syndrome (LQTS), uses ML to create health-monitoring and decision-support systems identify circadian patterns that signal risk of symp- that enhance and personalize healthcare. For example, toms such as cardiac arrhythmia, and estimates the

38 COMPUTER PUBLISHED BY THE IEEE COMPUTER SOCIETY 0018-9162/16/$33.00 © 2016 IEEE Acquired data Patient Physician EHRs

Delineation

Q, R, S, T (n leads) QTc > 470 ms ST elevated Clinical severity of that risk. LQTS is a disor- markers Database Alerts and der that primarily affects ion chan- queries thresholding QT, RR, QTc, nels in muscle cells, allowing ST amplitude abnormal electrical activity to occur (n leads) Syncope 87% that can lead to sudden and dangerous Database TdP 54% Merge Seizure 22% arrhythmias. We tested our method (optional) using four classification methods AMI 4% QT, RR, QTc, against a database of 434 24-hour ECG Filtering Classi cation Visualization ST amplitude (optional) and ML recordings. We also explored how our (1 lead) method might scale with the volume Decision support of medical data. Scalability is rapidly becoming critical in medical studies. FIGURE 1. Conceptual workflow of a remote health-monitoring and decision-support Analyzing massive amounts of data system. Data from a patient at a remote location is acquired, preprocessed, and used to not only promotes a deeper under- provide decision support in several forms. Visualization provides simple summaries to standing of the mechanisms that the physician or other clinician without any recommendations. Alerts are triggered by cause diseases but also makes per- more urgent events, such as the violation of an established threshold. Classification of sonalized treatment possible. Such the patient’s condition is based on the results of machine learning (ML), which involves analyses can lead to breakthroughs comparing the patient to existing electronic health records (EHRs). AMI: acute myocardial in relating genes to diseases as well as infarction; Q, R, S, T: waves that indicate cardiac electrical state (on an electrocardiogram providing the basis for treatment ori- [ECG]); QT, RR, QTc, and ST: intervals in the cardiac electrical cycle, also measured on an ented to a particular patient’s lifestyle ECG; TdP: torsades de pointes, a cardiac arrhythmia. and genetic makeup.

AN ML-BASED SYSTEM Work flow cases, it would be desirable to obtain We envision incorporating our method Figure 1 is a conceptual diagram of more detailed information about cer- in a remote health-­monitoring system the workflow for a healthcare system tain patients from their — that can provide feedback and decision that stores patient data electroni- an impossibility because it would support to a clinician. The system would cally. After preprocessing and filter- violate HIPAA. Consequently, regula- use devices that acquire data through ing patient data, the system stores it tion, not just technology, can limit the the Internet of Things and are connected as an electronic health record (EHR). acquisition of needed data. to a cloud-based decision-support sys- Each EHR gradually enriches the data- Even after removing identifying tem.2 The technological components base, which will improve the accuracy information, what remains could of such a system are within reach, and of future ML results. A large database be combined to statistically reveal a advanced devices for acquiring medical with many patients’ records might not patient’s identity. On one hand, PHI data are becoming commercially avail- be as useful as a database with fewer information such as age, gender, able.3 Sophisticated and powerful ML patients but more information on race, and genetic disorders, is critical algorithms are already well understood each patient. to developing an effective decision-­ and accessible.4 When many patients’ records are support system. On the other, includ- However, the human brain has aggregated and analyzed, steps must ing too much information on the unmatched reasoning abilities, so the be taken to protect the individuals’ wrong computer system risks violat- physician is still the most important privacy. Most protected health infor- ing HIPAA. Applications can also cre- part of any medical decision-­support mation (PHI), such as names and birth- ate privacy violations because some system. Thus, the goal of our envisioned days, can be removed from records in require explicitly protected informa- system is to provide physicians or other compliance with the Health Insurance tion such as a patient’s voice print6 or clinicians with concise, relevant infor- Portability and Accountability Act city of residence. Researchers must mation that can increase their diagnos- (HIPAA) without detriment to the data keep these restrictions in mind during tic efficiency and accuracy. mining process.5 However, in some all stages of a study.

NOVEMBER 2016 39 SMART HEALTH AND WELL-BEING

Decision support typical ECG report that a cardiologist pointes (TdP), which are likely to cause An effective ML-based healthcare reviews. When real-time monitor- serious symptoms such as seizures, system capitalizes on the computer’s ing reveals an urgent issue, an alert fainting, or sudden death. It is there- vast computational capability and the would immediately be sent to both the fore critical to monitor the QT inter- physician’s reasoning ability. Both patient and physician through SMS, val in patients prone to this disorder machine and physician are looking for pager, or an application. using long-term ECG recordings. Data patterns, but the physician cannot ana- recordings of ambulatory patients over lyze every heartbeat of every patient Evolving symbiosis. The physician several hours or days are called Holter or be familiar with every disease’s is still at the head of this process—­ recordings or simply Holters. nuances. The machine can do all these ordering tests, analyzing records, Our study focused on congenital tasks and then present its conclusions adjusting prescriptions, and so on. LQTS rather than the drug-induced to the physician for confirmation. The machine’s visualizations and form. In the database used for the recommendations are simply addi- study, we knew which recordings were Support types. As Figure 1 shows (see tional decision-­making tools. Over from patients with symptoms, but we box at lower right), we envision three time, the database will expand, and did not know if the symptoms came types of decision support: visualiza- the machine’s classifications will be before or after the ECG recording. Con- tion, alerts, and classification. more accurate. But improvements will sequently, we had no way to use ML to Visualization puts long-term mon- be symbiotic: as the machine’s accu- predict when symptoms would occur itoring data in a concise and intuitive racy grows, the physician will develop or to detect symptoms in real time. format,7 which could significantly an intuition for how and when the Instead, we attempted to identify when reduce the physician’s data burden machine makes those accurate classifi- a recording came from a patient whom and enable timely and accurate deci- cations and recognize when it might be we knew had symptoms in the past sion making. fallible. For example, a patient might or would have them in the future. In Alerts are alarms that activate when have an abnormal T-wave morphology other words, we attempted to identify a value crosses an established thresh- that the algorithms did not process the patient’s risk—an important con- old. The value can be simple to check, correctly, or a patient’s had cern for physicians, who must often or the result of a more advanced algo- not reached the point at which prob- prescribe medications and implantable rithm. The threshold could be a clini- lems would be identifiable. The physi- devices on the basis of the perceived cal standard—for example, 480 ms for cian can recognize the machine’s lim- risk of symptoms. (Symptoms in this QTc, which is a measure of the ventric- itations, and might opt for additional context are events, such as syncope, ular depolarization and repolarization methods to measure risk such as pre- that are triggered by prolonged QT.) duration—­­­­­​or it could be tailored to a scribing a drug or challenge or Identifying high-risk patients who patient. For example, the physician conducting a manual ECG analysis. need extra prescriptions or monitor- might want to be notified only if a par- ing or low-risk patients who would not ticular patient’s QTc exceeds 600 ms. CASE STUDY PARAMETERS benefit from those burdens would be Classification is the process of pre- In a case study to evaluate our method, highly valuable to both the physician dicting the group a patient belongs in, we exploited ML’s pattern-­recognition and patient. Additionally, despite the such as people with a specific geno­ abilities to classify risk in LQTS limitations of this particular data- type or people at risk for certain car- patients. The QT interval, which is typ- base, this study laid the groundwork diac events. Prediction of short-term ically used to measure the duration of for a future study at a time when outcomes is a primary goal. For exam- ventricular repolarization (a clinical a dataset with clinical outcomes ple, the machine might predict that a marker of the heart’s electrical activ- becomes available. patient is at high risk for a myocardial ity), can be abnormally long in some infarction in the next 12 hours. people who are taking certain medi- Data preprocessing The outputs from these support cations or have certain genetic disor- Figure 2 illustrates the steps that systems, such as plots and recom- ders.8 A prolonged QT interval can trig- transform raw ECG data into clini- mendations, would be attached to the ger arrhythmias such as torsades de cally useful measurements. Raw data

40 COMPUTER WWW.COMPUTER.ORG/COMPUTER Clinical markers

QRS complex Acquired Delineation R raw ECG data

R ST PR segment segment P T P T

Q PR interval Q S S contains massive redundancies as QT interval well as ectopic heartbeats and obvi- ous noise and errors, which will not be FIGURE 2. Detailed preprocessing steps using the workflow from Figure 1. The raw ECG useful in later processing. Preprocess- signal contains too much data to feed into most ML algorithms, and the data has massive ing extracts only clinically relevant redundancy across leads (sensor locations) and heartbeats, as well as noise and errors. markers from the data, which reduces Preprocessing filters the raw ECG waveforms to extract only relevant clinical markers, the data fed into the ML algorithm by such as the durations of the QRS complex and P and T waves. multiple orders of magnitude while drastically improving classification accuracy and execution time. The is of interest in heart-attack cases, and reduction and faster processing in sub- delineated heart­beat (Clinical markers the shape of the P wave is of interest to sequent steps. box) highlights several important mea- diagnosing atrial enlargement. Many Reduced dimensionality is import- surements of the cardiac electrical diseases affect the QT interval and its ant because the “curse of dimension- cycle. Atrial and ventricular depolar- subintervals (QRS, the J point to T​ peak, ality” remains a difficult problem in ization and repolarization are repre- and T peak to T end). categorizing big data. That is, large sented on the ECG as a series of waves: In our study, the QT and RR inter- volumes of data have a daunting num- the P wave followed by the QRS com- vals were the most relevant markers. ber of features, with each feature hav- plex and the T wave. The P wave occurs The QT interval alone is not enough ing myriad possible values. There are during atrial depolarization—­when information, because it will naturally so many dimensions to work with the atria contract. The QRS complex lengthen and shorten in all individ- that it is too easy to separate the train- indicates ventricular depolarization—­ uals as their heart rate decreases or ing data into the correct groups. The when the ventricles contract; at its increases. The RR interval—the dura- learned model becomes specific to the end is the J point. The T wave occurs tion of a complete —­ training set rather than generalizing during ventricular repolarization— provides enough information to correct to other data—a problem known as when the ventricles relax. Durations, the QT for heart rate. We calculated overfitting. Thus, an enormous amount amplitudes, and shapes at various the corrected QT (QTc) using the Frid- of training data is required to ensure parts of the ECG can be used to diag- ericia equation:11 that there are several samples with nose myriad illnesses. each combination of values. With a QT QTc = fixed number of training samples, RR Algorithm training 3 ML’s predictive power can decrease as To identify patients at risk for LQTS s dimensionality increases. symptoms, we trained ML algorithms using input variables extracted from Reducing dimensionality Method selection raw ECG data. As Figure 2 shows, Preprocessing substantially reduces the Among the ML methods for classifica- useful intervals and amplitudes are data to be reviewed. The raw ECG data— tion, supervised learning and cluster- available after only two preprocessing sampled at 200 Hz, 16 bits per sample, ing are the most popular. In our study, steps: delineation and the computa- on 3 leads, and over 24 hours—needed we focused on supervised learning. The tion of clinical markers. 100 Mbytes of storage. If only QTc inter- alternative to supervised learning is We first annotated important val values are required to detect LQTS clustering, also known as unsupervised markers in the ECG recording (such as symptoms, the storage requirement learning, which generally tries to group the P, Q, R, S, and T peaks, onsets, and lowers to 1 Mbyte. Storage capacity data points into clusters according to offsets) using delineation software to alone is not a sufficient reason to have their proximity to one another. A new identify these points.9,10 preprocessing; additional storage space data point can then be classified on the The relevance of each clinical is relatively inexpensive. Rather, pre- basis of the cluster into which it best fits. marker depends on the disease being processing is necessary when using ML Artificial neural networks, inspired studied. For example, STe, the eleva- algorithms because the data reduction by the neuron web in the human brain, tion (voltage) during the ST segment, translates directly to a dimensionality can be used for both supervised and

NOVEMBER 2016 41 SMART HEALTH AND WELL-BEING

TABLE 1. Pros and cons of four supervised learning classifiers.

Classifier Advantages Disadvantages

k-nearest neighbors Simple to implement Sensitive to noisy data and anomalies Easy to understand and interpret Computationally expensive for large datasets

Support vector machine Flexible with nonlinear data Difficult to interpret feature importance Scales up with large sets of data Yields possibly unreliable confidence estimates Relatively resistant to the “curse of dimensionality”

Random forest Alleviates overfitting problem Increases bias relative to single decision tree Easy to extract feature importance Different results possible in retraining on same data Resilient to missing data Scales to large datasets

AdaBoost Automatically reduces dimensionality Sensitive to noisy data and anomalies Relatively fast

unsupervised learning. Deep networks to create hyperplanes that divide the LQTS genotypes (LQT1 and LQT2) and use many layers of artificial neurons data into classes, which it keeps as far the most complete demographic infor- to form input data abstractions, which apart as possible (thereby maximizing mation (such as age and gender). The lead to the formation of a final classi- the distance between the hyperplane subjects’ average age was 25 ± 18 years fication layer. In each layer, weights and different class samples). SVMs (newborns to senior citizens); 55 per- are applied to features of the previous rely on a linear or nonlinear feature cent of the subjects were female, and layer to optimize performance. combination, depending on the kernel 67 percent had the LQT1 mutation. The supervised learning methods declared in the algorithm. We tested a Our goal was to determine which of that best fit our study were support vec- linear kernel as well as a radial basis the patients would show symptoms tor machine (SVM), decision tree, and function (RBF) kernel. of LQTS, such as seizures or syncope. nearest neighbors. These algorithms That is, we were trying to identify classify previously unseen data points Random forest. The random forest ECG patterns that could reveal which based on some function of the data algorithm is an ensemble learning genotype-­positive patients will also points they have already seen. In an method that averages the results of be phenotype-­positive. Given some attempt to improve accuracy, some ML several decision trees to classify its measurements from an ECG, a classi- algorithms factor in the results of sev- samples. As the name implies, each fier should simply tell us “symptoms eral classifiers in order to make their decision tree is trained on a random expected” or “no symptoms expected,” decision. Random forest and AdaBoost training data subset, perhaps using perhaps with a confidence value. are among the most popular methods to random features as well. use this ensemble learning technique. Algorithm implementation Table 1 lists the pros and cons of the AdaBoost. AdaBoost, short for We implemented all the classification four methods we considered in our adaptive boosting, aggregates the algorithms—k-nearest neighbors, lin- evaluation: k-nearest neighbors, SVM, results from many weak classifiers ear SVM and RBF SVM, random forest, random forest, and AdaBoost. The by iteratively retraining them to and AdaBoost—using scikit-learn, an table is useful in identifying the best focus on fixing mistakes from the open source Python library built on classifier for a given problem and set of previous round. It then averages the SciPy and NumPy.4 To assess a clas- computational constraints. results. In our experiments, Ada- sifier’s accuracy, we set aside 30 per- Boost always used decision trees as cent of the samples for testing, and k-nearest neighbors. The k-nearest the weak classifier. trained only on the remaining 70 per- neighbors algorithm finds the short- cent. Because some algorithms include est distance between a new testing RUNNING THE CLASSIFIERS inherent randomness in their opera- point and adjacent training points. We accessed a database containing tion and because the division between It then classifies the testing point as 24-hour ECG recordings of 480 LQTS training and testing data is also ran- the most common class among its patients, including demographic dom, we repeated the cycle of selecting ­k-nearest neighbors. information such as gender, age, training data, training, and testing 50 and specific LQTS genotype.12 We times for each classifier. Support vector machine. The SVM restricted our study to 434 recordings The average result from these method uses a training points subset of patients with the most common trials—­the Monte Carlo cross-­validation

42 COMPUTER WWW.COMPUTER.ORG/COMPUTER 0.75 0.65

0.60 0.70 0.55

0.65 0.50 y on test data y on test data 0.45 Linear SVM 0.60 RBF SVM 0.40 Accurac AdaBoost Accurac Random forest 0.35 0.55 k-nearest neighbors Guessing 0.35

0.50 0.25 50 100 150 200 250 300 50 100 150 200 250 300 (a) No. of training samples (b) No. of training samples 103 101

102

101 100 raining time (ms) T Classi cation time (ms)

100

10-1 10-1 50 100 150 200 250 300 50 100 150 200 250 300 (c) (d) No. of training samples No. of training samples

FIGURE 3. Classifier performance in the study. The input features were hourly QT and RR measurements for one day. Each data point is an average of 50 trials for which we randomly selected different training and testing data. Approximately 52 percent of the patients did not have symptoms, so the thin dashed line in (a) average accuracy and (b) minimum accuracy represents the performance achievable by simply guessing “no symptoms” every time. (c) Training time represents training conducted for increasingly larger subsets of the full training dataset of 304 samples, and (d) classification time represents how long validation took for the full test set of 130 samples. AdaBoost always used decision trees as the weak classifier. score—told us how well a classifier Feature selection classifiers. Each of our samples for would likely perform. The worst result The four ML methods we used have training or classification consisted of showed how low a classifier’s accuracy inherent strengths and weaknesses, 48 values (24 for QT and 24 for RR); could be when the data was not evenly but their performance was also con- increasing that number risked inflict- distributed across the random training/ strained by the data we provided ing the curse of dimensionality. testing split (based on its underlying them. We knew that QTc, and there- To reduce dimensionality even properties). For example, if most of the fore QT and RR, are the measurements more, we used chi-square (χ2) tests to outliers or noisy recordings end up in that cardiologists use most often to automatically select features that were the test set, or almost all of the asymp- determine whether a LQTS patient is likely to be the most useful. In general, tomatic patients end up in the training in danger. We also knew that people the fewer the dimensions in the input, set, our trained model will not match with different LQT genotypes tend to the fewer training samples are needed up well with the data it’s tested against. show more QTc prolongation at dif- to achieve good performance, and the For classifiers that require input data to ferent times of the day.13 We there- faster classifiers will run. Feature be normalized, we used scikit-learn’s fore decided to provide hourly QT and selection methods are also useful in ­StandardScaler() function. RR measurements as input to the ML the discovery of previously unknown

NOVEMBER 2016 43 SMART HEALTH AND WELL-BEING

patterns and correlations between testing. However, we conducted train- percent with the best classifiers, RBF variables. For example, the machine ing over increasingly larger subsets of SVM and random forest. When the might find that patients with a very the remaining 70 percent to determine computer was correct, its confidence specific genetic mutation are more at how many samples would produce was higher—around 68 to 74 percent. risk than others with seemingly sim- optimal results. This test showed us the possibility of ilar mutations or that a measurement setting a threshold, below which the that is typically not used in the clinic Average and minimum accuracy decision-support system could report actually carries significant informa- Figure 3a shows average accuracy— “inconclusive” rather than marking a tion. Even if no new relationships are the accuracies that we could expect patient as having a high or low risk. discovered, feature selection is use- from each classifier on the basis of ful to confirm the chosen model—to 50 random training-data selections. Scalability see if the machine picks the features Figure 3b shows minimum accuracy Figures 3c and 3d show the results of expected. over the 50 trials—assuming we chose measuring the runtime of the train- training data poorly, how well could ing and classification stages, which RESULTS each algorithm do? We found that in we used to estimate each classifier’s We used the four classifiers to deter- this worst-case scenario, more than scalability. Runtime was not a prob- mine if the genotype-positive LQTS 100 training samples could be required lem with this particular dataset, but patients in our database had suffered simply to break even—to exceed 52 it could be a limitation in other stud- or would suffer from any symptoms. percent (the guessing line). The high- ies. Although the ensemble classi- Figure 3 illustrates how the perfor- est scores, both minimum and aver- fiers (AdaBoost and random forest) mance of each classifier changes as age, came from random forest and the took longer than the others in both we provided more training samples. RBF SVM, which achieved 60 to 65 per- stages, adding training samples We configured the k-nearest neigh- cent accuracy even with a poor selec- barely affected their runtimes. As we expected, k­-nearest neighbors had essentially zero training time, but classification time increased with IN FEATURE SELECTION, OUR AIM ​ the number of samples because the algorithm had to compute distances WAS TO REDUCE INPUT DIMENSIONALITY to every point in the training set. In WHILE MINIMIZING RELEVANT fact, at around 240 training samples, INFORMATION LOSS. classification actually became slower than with AdaBoost. Because the two ensemble methods have very flat run- times, SVM will also become slower bors classifier to weight samples by tion of data and fewer than 100 train- than even random forest, given distance rather than uniformly and ing samples. The best classifier in our enough training samples. composed the random forest with 100 tests, the RBF SVM, averaged about 70 All the classifiers exceptk -nearest trees rather than the default of 10. We percent accuracy. neighbors (which does not really have set both random forest and the two Obviously, it is not desirable for the a training stage) could not incorpo- SVMs to use balanced class weights. machine to make bad classifications. rate new data after training was com- All other parameters were the scikit- However, relatively low accuracy is plete. When a database grows, classi- learn defaults. Each of the classifiers’ manageable if we know when the fiers must be entirely retrained or the input samples contained 48 values. computer was unsure of a result. We process of adding samples must use We computed runtime results on therefore tested the machine’s aver- nontrivial techniques. The ability to an Intel i7-5930K and always used age confidence in its responses. When add one or more training examples 30 percent of the 434 samples in the the computer was incorrect, its aver- to a model without complete retrain- full dataset (training plus testing) for age confidence was around 64 to 69 ing, referred to as online ML, is an

44 COMPUTER WWW.COMPUTER.ORG/COMPUTER 12

Top 100 features Top 25 features 10

8 important scalability feature. Con- 6 sequently, we tested the perceptron online algorithm14 and found that it

achieved 68 percent accuracy, a level of highly ranked features 4 comparable to that of random forest. No.

Effects of feature reduction We next investigated the impact of 2 providing classifiers with QTc alone instead of QT and RR separately, which 0 reduced the number of features from 0 5101520 48 to 24. Because QTc is designed to Hour of day contain the LQTS-relevant informa- tion from QT and RR, we expected to FIGURE 4. Feature importance versus time of day. Starting at 0, which represents mid- see improved runtimes without lower night, the peaks indicate that late night, awakening, and the end of the workday are the accuracy. However, we found that best times to detect cardiac issues for this patient group. accuracy decreased for random forest and AdaBoost (both of which are based on decision trees) as did the learning each patient’s EHR: gender, age, LQT Figure 4 is a histogram of the rate (more samples were required to type (1 or 2), mutation type, and muta- times of day in which the top 25 and reach peak performance). The feature tion location. top 100 features appeared. As the fig- reduction did not hurt RBF SVM’s per- Because we had only 434 training ure shows, all the top 25 features are formance, which consistently yielded samples, we expected to have to reduce evident around 5 pm to 6 pm, which 70 percent accuracy when provided the new feature set’s dimensional- implies that perhaps fatigue at the end with enough training samples. ity to mitigate overfitting. Our aim of the workday is unmasking cardiac AdaBoost and random forest saw no was to reduce input dimensionality issues. Expanding the search to the improvement in runtime with fewer while minimizing information loss, top 100 features begins to reveal other features. Predictably, k-nearest neigh- which both the principal component important times. One is first thing bors’ classifications were faster, and analysis (PCA) and the χ2 method sat- in the morning (6 am to 7 am), which all SVM runtimes improved. Because isfy. PCA projects the data to a lower is another stressful time of day.15 of these results, we revisited our fea- dimensional space, while χ2 selects Another is late night (1 am to 2 am), ture choice, attempting to narrow the the statistically best features. We mea- which makes sense for the LQT2 sub- list as much as possible. sured classifier accuracy varying the set of patients who tend to show more number of preserved features from QTc prolongation during sleep.13 Also Searching for better features 1 to 512. Surprisingly, both feature important is the lack of highly ranked Our use of QT, RR, and QTc was based selection methods allowed the clas- features around 8 am to 11 am, which on knowing what physicians mea- sifiers to achieve 70 percent accuracy implies that a clinical checkup in the sure in practice. However, we wanted with only one feature or attribute. morning might not be sufficient for to be sure that we did not overlook The most important features seemed the physician to accurately assess a any other useful cardiac features, so to be QT-like measurements taken in patient’s risk. we decided to evaluate 23 features at the evening, where “QT-like” means every hour of the day—a total of 552 QT, JT, QTp, JTp, or versions of these STUDY IMPLICATIONS measurements that included QT, RR, corrected for heart rate. Using the top Our study had several implications QRS, ST segment duration and ele- 20 features with the random forest for future analyses, including the vation, QTp, JT, JTp, TpTe, and T-wave classifier yielded 72 percent accuracy effects of beta blockers, gender influ- duration and amplitude. Addition- (69 percent sensitivity and 75 percent ence on classification, and measure- ally, we used several features from specificity). ment type.

NOVEMBER 2016 45 SMART HEALTH AND WELL-BEING

Beta blocker effects patients had symptoms. When com- virtually no weight placed on that fea- In one of our experiments, we found bined, the line was 52 percent. ture. It will be interesting to see under that separating QTc into QT and RR Our second experiment was to train what conditions gender or the other improved accuracy. However, many the classifier on both groups as before, static inputs become significant. high-risk LQTS patients are on beta but to test the accuracy against each blockers—drugs prescribed to slow group separately. Results showed little Measurement type the heart rate—so it was possible that difference: classification of only BB or For our study, the feature of interest classifiers in that experiment were only non-BB patients remained at 66 to (QTc) has characteristics that fit well actually reacting to the increased RR 68 percent accuracy. with hourly average measurements because QTC changes slowly and is corrected for heart rate. However, for other applications, even the exten- THE ML SYSTEM WE ENVISION WILL sive feature set we tested might not be PROVIDE A PERSONALIZED ASSESSMENT enough. Heart rate, for example, can vary greatly during one hour. Perhaps OF EACH PATIENT BASED ON A a different measure such as heart rate COMBINATION OF CRITICAL MARKERS. variability would be more suitable. Future analyses might even include more exotic measurements, such as “ST elevation at 60 ms after J point that is characteristic of beta blockers, Finally, in our experiment that used during high heart rate” or “percentage not to any novel pattern we had hoped only QTc as input, we were not (ide- of beats that T wave is inverted.” to find. In particular, beta blockers ally) providing any heart-rate infor- might explain why the classifiers mation to the classifier. The classifier based on decision trees (AdaBoost needed more training samples to reach e have presented a workflow and random forest) had higher accu- peak accuracy, but that peak was still and a conceptual ML-based racy when heart rate was available. To around 70 percent. system for health monitor- determine if this was the case, we con- We concluded that the presence of Wing ​that aims to analyze the ECGs of ducted three evaluations. BBs does not affect overall accuracy, patients with an LQTS genetic disorder The first was to train the clas- but classifiers are more accurate when and to identify patients with increased sifiers on only beta blockers (BBs) the groups are separated. risk of adverse cardiac events. The or non-BB patients and then check envisioned system will provide a per- accuracy. When the classifiers were Gender sonalized assessment of each patient trained on only BB patients, overall We expected gender to be a feature of by considering a combination of criti- accuracy remained at approximately significant importance in ML classifi- cal markers. 70 percent. Accuracy within the BB cation, as gender differences are known The results in Figure 3 provide group increased to approximately to influence many coronary heart dis- insights into which classifier works 90 percent, but accuracy in the eases,16 and males and females have best under constraints such as avail- non-BB group fell to approximately distinct average QTc values and clini- able training data or computational 60 percent. When we trained the cal prolongation thresholds. However, power. RBF SVM or random forest classifier on only the non-BB group, feature selection eliminated gender will yield the highest accuracy. RBF we observed the opposite results. (along with age and mutation infor- SVM is probably the better choice Although 90 percent accuracy is a mation) as very insignificant relative for experiments with relatively few marked improvement, the “guessing” to most ECG measurements. We con- training samples, but random for- line is higher in these subgroups: firmed this by running RBF SVM and est’s faster runtime will be prefer- 67 percent of non-BB patients had random forest with gender as an input, able when training data grows to no symptoms, and 62 percent of BB and found no difference in results and thousands of samples. As runtime

46 COMPUTER WWW.COMPUTER.ORG/COMPUTER ABOUT THE AUTHORS

SHUROUQ HIJAZI is a technology consultant at Ernst & Young. While conduct- ing the research reported in this article, she was a research assistant in the machine-learning (ML) laboratory at the University of Rochester. Her research interests include ML techniques, cybersecurity, computer networks, the Inter- net of Things (IoT), and virtualization. Hijazi received a BS in electrical and com- becomes more of a concern, feature puter engineering from the University of Rochester. She is a student member of selection gets more important. IEEE. Contact her at [email protected]. In Figure 3a, all classifiers seem to be close to reaching a horizontal ALEX PAGE is a postdoctoral associate in the Heart Research Follow-up Program asymptote, meaning that their perfor- at the University of Rochester Medical Center. While conducting the research mance will not improve simply by add- reported in this article, he was a PhD student in electrical engineering at the Uni- ing more training samples. Instead, versity of Rochester. His research interests include computer systems for analyz- their inputs and parameters will need ing medical data, such as databases, GPU acceleration, and ML techniques. Page to be optimized. In previous work, we received a PhD in electrical engineering from the University of Rochester. He is a found that time of day was important student member of IEEE. Contact him at [email protected]. in classifying patients as having the 13,17 LQT1 or LQT2 genotype. Both types BURAK KANTARCI is an assistant professor in the School of Electrical Engi- show QT prolongation, but during dif- neering and Computer Science at the University of Ottawa and a courtesy ferent activities and different times of assistant professor in the Electrical and Computer Engineering Department day. This finding is in large part why at Clarkson University. His research interests include the IoT, big data in the we structured our inputs as hourly network, crowdsensing and social networks, cloud networking, and digital data points in the study described. The health. Kantarci received a PhD in computer engineering from Istanbul Tech- dimensions also reduce well in this nical University. He is an editor of IEEE Communications Surveys and Tutori- structure, as usually only a few hours als, a Senior Member of IEEE, and a member of ACM. Contact him at burak during sleep are enough to differenti- [email protected]. ate LQTS types. However, other input structures and measurements should TOLGA SOYATA an associate professor in the Department of Electrical and be investigated. In the selection of Computer Engineering at SUNY Albany. His research interests include cyber- appropriate input features, the physi- physical systems, digital health, and GPU-based high-performance comput- cian’s knowledge and intuition remain ing. Soyata received a PhD in electrical and computer engineering from the critical. University of Rochester. He is a Senior Member of IEEE and ACM. Contact him The annotation algorithm we used at [email protected]. had some trouble with noisy or abnor- mal ECGs; improved accuracy might require cleaner inputs and more accu- rate annotations. Additionally, we could construct more complex fea- tures such as T-wave symmetry mea- illnesses. We expect that the refine- REFERENCES surements. Another possible approach ment of our method and the growth 1. T. Lorberbaum et al., “An Integrative is the use of a voting classifier, which of EHR databases will greatly improve Data Science Pipeline to Identify attempts to aggregate the predictions the quality of care for patients with a Novel Drug Interactions That Pro- of several other classifiers to reach a variety of disorders. long the QT interval,” Drug Safety, better result. However, our experi- vol. 39, no. 5, 2016, pp. 433–441. ence suggests that a voting classifier 2. M. Hassanalieragh et al., “Health will be only slightly more accurate ACKNOWLEDGMENTS Monitoring and Management Using than the best individual classifier. We thank Mehmet Aktas and Jean- Internet-of-Things (IoT) Sensing with Finally, a complete set of experiments Philippe Couderc from the University of Cloud-Based Processing: Opportu- will require trying other classification Rochester’s Department of Medicine for nities and Challenges,” Proc. IEEE methods such as clustering and artifi- motivating this study and providing guid- Int’l Conf. Services Computing (SCC 15), cial neural networks. ance about its clinical applications. This 2015, pp. 285–292. The steps we used can be general- work was supported in part by National 3. D. Son et al., “Multifunctional ized to other types of medical data and Science Foundation grant CNS-1239423. Wearable Devices for Diagnosis and

NOVEMBER 2016 47 SMART HEALTH AND WELL BEING

Therapy of Movement Disorders,” Data,” IEEE Access, vol. , , Normal Humans and in Patients Nature Nanotechnology, vol. , , pp. –. with Heart Disease,“ vol. ,  , pp.  –. . C.E. Chiang, “Congenital and pp.  – (in German). . F. Pedregosa et al., “Scikit-Learn: Acquired Long QT Syndrome: Cur- . J. Couderc, “The Telemetric and Machine Learning in Python,” J. rent Concepts and Management,” Holter ECG Warehouse Initiative Machine Learning Research, vol. , Rev., vol. , no. , , (THEW): A Data Repository for the , pp. –. pp. –. Design, Implementation and Valida- . “Summary of HIPAA Privacy Rule,” . Y. Chesnokov, D. Nerukh, and R. tion of ECG-Related Technologies,” US Department of Health and Glen, “Individually Adaptable Auto- Proc. IEEE Int’l Conf. Eng. Medicine Human Services, ; www.hhs matic QT Detector,” Proc. IEEE Com- and Biology Soc. (EMBC ), , pp. .gov/hipaa/for-professionals/privacy puters in Cardiology (CinC ), , –. /laws-regulations. pp. –. . A. Page et al., “QT Clock to Improve . E.C. Larson et al., “Spirosmart: Using . A. Demski and M.L. Soria, “Ecg-Kit: Detection of QT Prolongation in a Microphone to Measure Lung Func- A Matlab Toolbox for Cardiovas- Long QT Syndrome Patients,” Heart tion on a Mobile Phone,” Proc. ACM cular Signal Processing,” J. Open Rhythm, vol. , no. , , pp. Conf. Ubiquitous Computing (UbiComp Research Software, vol. , no. , ;  – . ), , pp. – . openresearchsoftware.metajnl . F. Rosenblatt, “The Perceptron: A . A. Page et al., “An Open Source ECG .com/articles/./jors.. Probabilistic Model for Information Clock Generator for Visualization . L.S. Fridericia, “The Duration of Storage and Organization in the of Long-Term Cardiac Monitoring Systole in an Electrocardiogram in Brain.” Psychological Rev., vol. , no. ,  , pp. –. . W.B. White, “Cardiovascular Risk and Therapeutic Intervention for the Early Morning Surge in and Heart Rate,” Blood Pressure Monitoring, 2017 B. Ramakrishna Rau Award vol. , no. , , pp. –. . A. Maas and Y. Appelman, “Gender Call for Nominations Di erences in Coronary Heart Dis- ease,” Netherlands Heart J., vol. , no. Honoring contributions to the computer microarchitecture field , , pp.  –. New Deadline: 1 May 2017 . A. Page et al., “Research Directions in Cloud-Based Decision Support Established in memory of Dr. B. (Bob) Ramakrishna Rau, the award recognizes his distinguished career in promoting and expanding the Systems for Health Monitoring use of innovative computer microarchitecture techniques, including Using Internet-of-Things Driven Data his innovation in complier technology, his leadership in academic and industrial computer architecture, and his extremely high personal Acquisition,” Int’l J. Services Comput- and ethical standards. ing, vol. , no. , , pp. –. WHO IS ELIGIBLE?: The candidate will have made an outstanding innovative contribution or contributions to microarchitecture, use of novel microarchitectural techniques or compiler/architecture interfacing. It is hoped, but not required, that the winner will have also contributed to the computer microarchitecture community through teaching, mentoring, or community service. AWARD: Certificate and a $2,000 honorarium. PRESENTATION: Annually presented at the ACM/IEEE International Symposium on Microarchitecture NOMINATION SUBMISSION: This award requires 3 endorsements. Nominations are being accepted electronically: www.computer.org/web/awards/rau Read your subscriptions CONTACT US: Send any award-related questions to [email protected] through the myCS publications portal at www.computer.org/awards http://mycs.computer.org.

48 COMPUTER WWW.COMPUTER.ORG/COMPUTER