Prognostic applications for Alzheimer’s disease using magnetic resonance imaging and machine-learning

by

Nikhil Bhagwat

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Institute of Biomaterials & Biomedical Engineering University of Toronto

c Copyright 2018 by Nikhil Bhagwat Abstract

Prognostic applications for Alzheimer’s disease using magnetic resonance imaging and machine-learning

Nikhil Bhagwat Doctor of Philosophy Graduate Department of Institute of Biomaterials & Biomedical Engineering University of Toronto 2018

Alzheimer’s disease (AD), the most common form of dementia, is a neurodegenerative disorder that leads to cognitive deficits, particularly in the memory domain. Recent advances in magnetic resonance

(MR) imaging techniques and computational tools, such as machine-learning (ML), provide promising opportunity for prognostic applications in AD. Imaging biomarkers can improve our understanding of etiology and progression of the disease, as well as assist clinicians in decision-making pertaining to patient monitoring, intervention, and treatment selection. The overarching goal of this thesis is to develop several computational methods that facilitate the use of MR imaging data in translational applications to improve personalized patient care.

The work in this thesis is divided into three projects. The first project aims to improve MR image segmentation - a commonly used MR image processing step to delineate anatomical structures used in a multitude of downstream quantitative analyses. The goal of the second project is to leverage MR- based anatomical features towards subject-level clinical severity prediction. Methodologically, the work provides a novel ML framework for high-dimensional, multimodal analysis customized for MR imaging data. The third project extends the subject-level prediction towards longitudinal prognosis with several practical considerations. The methodological contributions involve modeling and prediction of clinical trajectories from longitudinal MR and clinical measures using ML approaches. The work addresses many challenges faced in a clinical setting, such as missing data points, and provides a powerful framework for early detection and accurate prognosis of at-risk AD patients via continuous monitoring.

The comprehensive validation of the methods presented in this work with multiple datasets and studies demonstrates the utility of MR images towards AD prognosis. The tools proposed complement existing clinical workflow and can be leveraged in conjunction with current clinical assessments. With the increasing availability of large-scale datasets, further improvements and validations can be made to adopt this work for individual intervention and prognosis, as well as for improving recruitment strategies for the clinical trials in AD.

ii Dedicated to my childhood storybooks from around the world that have proved to be inspiring and grounding at the same time...

iii Acknowledgements

Research is innately collaborative in nature, and hence I would like to extend my appreciation to many people whose support has been invaluable for the work presented in this thesis. First, I want to thank my supervisor - Dr. M. Mallar Chakravarty, who entrusted me as his very first PhD student. Mallar found the tricky balance between mentoring an initially naive academic trainee and providing freedom to pursue my own ideas, even the seemingly absurd ones. Apart from academic training, I must also thank Mallar for the enriching collaborations in Toronto and Montreal, wonderful conference trips, and many exciting gastronomic excursions around the globe. Second, great gratitude goes to Dr. Aristotle Voineskos, who not only served as my committee member, but also provided lab space and resources during my four years of PhD work at CAMH in Toronto. As a clinical , Aristotle’s advice has been crucial in addressing clinical goals of this work. Special mention also goes to my other committee members, Dr. Chris Honey, Dr. Jo Knight, Dr. Richard Zemel, and Dr. Babak Taati, who formed an incredibly multidisciplinary committee. Their advice has helped me challenge my own preconceived notions, and develop understanding of my projects from diverse perspectives. I am also grateful to my department for supporting me throughout this stint with many unexpected logistical difficulties. Particularly, I want to thank Dr. Christopher Yip and Dr. Julie Audet for their advice and understanding. I am also grateful to the IBBME staff members - Jeffrey Little, Rhonda Marley, and Elizabeth Flannery, for dealing with my many incessant queries with great patience. I was also fortunate to have wonderful colleagues over the past years at both Toronto and Montreal labs. Particularly, I thank Jon Pipitone and Gabriel Devenyi, for helping me with the computational resources and impromptu brainstorming sessions! Typically, my research insights have been realized at the most unexpected moments during semi-academic, borderline silly interactions with my labmates and friends. Beyond my academic circles, I must attribute significant credit to many of my old and new friends, especially, Neda, Julie, Anwesha, Joseph, and Sophia who indulged my many ventures and ventings, and helped me minimize the work-life imbalance during this period. Finally I want to thank my parents, Vandana and Parag, along with my little brother Tejas. My fam- ily’s courageous life decisions, unwavering support, and multi-timezonal video conferences have allowed and prepared me to take on these intellectual pursuits, for which I am always grateful.

iv Contents

Acknowledgements iv

Table of Contents v

List of Tables ix

List of Figures xi

1 Introduction 1 1.1 Research contributions and thesis outline ...... 2

2 Background 4 2.1 Alzheimer’s Disease ...... 4 2.1.1 AD diagnosis and staging ...... 5 2.1.2 AD risk factors ...... 5 2.1.3 AD treatment ...... 5 2.2 Pathophysiology of AD ...... 7 2.2.1 Beta-amyloid (Aβ)...... 7 2.2.2 Neurofibrillary tangles (NFTs) of protein tau ...... 7 2.2.3 Progression of Aβ and NFTs ...... 7 2.3 ...... 8 2.4 AD biomarkers ...... 10 2.4.1 CSF and PET markers ...... 10 2.4.2 MR imaging markers ...... 12 2.5 MR-based ...... 13 2.5.1 MR acquisition ...... 13 2.5.2 MR image preprocessing ...... 15 2.5.3 Image registration ...... 16 2.5.4 Image segmentation ...... 19 2.5.5 Cortical surface estimation ...... 23 2.6 Computational and machine-learning ...... 24 2.6.1 Machine-learning ...... 25 2.6.2 Supervised learning ...... 26 2.6.3 Reference models ...... 26 2.6.4 Artificial neural networks and deep learning ...... 28

v 2.6.5 Performance metrics for supervised learning ...... 30 2.6.6 Supervised ML and AD ...... 30 2.6.7 Unsupervised learning ...... 30 2.6.8 Performance metrics for unsupervised learning ...... 31 2.6.9 Unsupervised ML and AD ...... 31 2.6.10 Performance evaluation ...... 31 2.7 Project synopses ...... 32 2.7.1 Manual-protocol inspired technique for improving automated MR image segmenta- tion during label fusion (Published online 2016 Jul 19. doi: 10.3389/fnins.2016.00325) 33 2.7.2 An artificial neural network model for clinical score prediction in Alzheimer’s dis- ease using structural neuroimaging measures (accepted in the journal of and neuroscience) ...... 34 2.7.3 Modeling and prediction of clinical symptom trajectories in Alzheimer’s disease using longitudinal data (accepted in PLOS Computational Biology) ...... 34

3 Project 1: MR Image Segmentation 35 3.1 Abstract ...... 36 3.2 Author Contributions ...... 37 3.3 Introduction ...... 38 3.4 Materials and Methods ...... 41 3.4.1 Methodological Novelty of AWoL-MRF ...... 41 3.4.2 Baseline Multi-Atlas Segmentation Method ...... 41 3.4.3 Proposed Label-Fusion Method: AWoL-MRF ...... 42 3.5 Validation Experiments ...... 47 3.5.1 Datasets ...... 47 3.5.2 Label-Fusion Methods Compared ...... 48 3.5.3 Evaluation Criteria ...... 51 3.6 Results ...... 52 3.6.1 Experiment I: ADNI Validation ...... 52 3.6.2 Experiment II: FEP Validation ...... 52 3.6.3 Experiment III: Preterm Neonatal Cohort Validation ...... 54 3.6.4 Experiment IV: Hippocampal Volumetry ...... 55 3.6.5 Parameter Selection ...... 57 3.7 Discussion and Conclusion ...... 63 3.8 Acknowledgments ...... 65 3.9 Supplementary Material ...... 66 S3.1 Experiment I: ADNI Validation ...... 66 S3.2 Experiment II: First Episode (FEP) Validation ...... 66 S3.3 Experiment III: Preterm Neonatal Cohort Validation ...... 66 S3.4 Experiment IV: Hippocampal Volumetry ...... 67 S3.5 Surface-Distance error analysis ...... 67

vi 4 Project 2: Clinical Score Prediction 68 4.1 Abstract ...... 69 4.2 Author Contributions ...... 70 4.3 Introduction ...... 71 4.4 Materials and Methods ...... 73 4.4.1 Datasets ...... 73 4.4.2 MR image processing ...... 73 4.4.3 Anatomically Partitioned Artificial Neural Network (APANN) ...... 74 4.4.4 Empirical distributions ...... 74 4.4.5 Performance Validation ...... 78 4.5 Results ...... 81 4.5.1 Experiment 1: ADNI1 Cohort ...... 81 4.5.2 Experiment 2: ADNI2 Cohort ...... 81 4.5.3 Experiment 3: ADNI1 + ADNI2 Cohort ...... 81 4.5.4 Longitudinal prediction ...... 81 4.6 Discussion ...... 87 4.6.1 Clinical scale comparisons ...... 87 4.6.2 Input modality comparisons ...... 87 4.6.3 Dataset comparisons ...... 87 4.6.4 Longitudinal analysis ...... 88 4.6.5 Related work ...... 88 4.6.6 Clinical translation ...... 89 4.7 Limitations ...... 90 4.8 Conclusion ...... 91 4.9 Acknowledgements ...... 92 4.10 Supplementary Material ...... 93 S4.1 Performance comparison to reference models ...... 93 S4.2 Empirical sampling: standardization across modalities ...... 96 S4.3 Computational resource requirements ...... 96 S4.4 Performance bias in combined ADNI1and2 cohort ...... 97

5 Project 3: Prognosis in AD 98 5.1 Abstract ...... 99 5.2 Author Summary ...... 99 5.3 Author Contributions ...... 100 5.4 Introduction ...... 101 5.5 Materials and Methods ...... 103 5.5.1 Datasets ...... 103 5.5.2 Preprocessing ...... 104 5.5.3 Analysis workflow ...... 104 5.5.4 Trajectory modeling ...... 104 5.5.5 Trajectory prediction ...... 106 5.5.6 Performance evaluation ...... 109 5.6 Results ...... 111

vii 5.6.1 Trajectory modeling ...... 111 5.6.2 Trajectory Prediction ...... 111 5.6.3 MMSE trajectories (binary classification) ...... 111 5.6.4 ADAS-13 trajectories (3-way classification) ...... 114 5.6.5 AIBL results ...... 115 5.6.6 Effect of trajectory modeling on predictive performance ...... 116 5.6.7 Computational specifications and training times ...... 118 5.7 Discussion ...... 119 5.7.1 Clinical Implications ...... 119 5.7.2 Trajectory modeling ...... 120 5.7.3 Trajectory prediction ...... 120 5.7.4 AIBL analysis ...... 120 5.7.5 Effect of trajectory modeling on predictive performance ...... 121 5.7.6 Comparison with related work ...... 121 5.7.7 Limitations ...... 122 5.7.8 Conclusions ...... 123 5.8 Acknowledgements ...... 124 S5 Supplementary Material ...... 125 S5.1 Subject Lists ...... 125 S5.2 Hyperparameter search ...... 129 S5.3 Clinical score distributions ...... 130 S5.4 Prediction performance results ...... 131 S5.5 Effect of available timepoints (duration) on prediction performance ...... 137 S5.6 K-fold nested cross-validation procedure ...... 140

6 Discussion 141 6.1 Challenges and limitations ...... 142 6.1.1 Project 1: Hippocampal segmentation ...... 142 6.1.2 Project 2: Clinical severity prediction in AD ...... 143 6.1.3 Project 3: Clinical trajectory modeling and prediction in AD ...... 146 6.2 Future directions ...... 149 6.2.1 MR image segmentation ...... 149 6.2.2 Severity prediction ...... 150 6.2.3 Longitudinal prediction ...... 150 6.2.4 Clinical tasks ...... 150 6.2.5 Clinical translation ...... 151 6.3 Concluding remarks ...... 152

Bibliography 153

viii List of Tables

2.1 National Institute on Aging (NIA) proposed clinical and preclinical stages of Alzheimer’s disease and their corresponding symptomatic criteria ...... 6 2.2 Comparison of commonly used linear and non-linear registration methods ...... 19

3.1 ADNI1 cross-validation subset demographics. CN: Cognitively Normal. LMCI: Late- onset Mild Cognitive Impairment. AD: Alzheimer’s Disease. CDR-SB: Clinical Demen- tia Rating-Sum of Boxes. ADAS: Alzheimer’s Disease Assessment Scale. MMSE: Mini- Mental State Examination. Values are presented as lower quartile, median, and upper quartile for continuous variables ...... 47 3.2 First episode psychosis subject demographics. Ambi: ambidextrous. SES: Socioeconomic Status score. FSIQ: Full Scale IQ. Values are presented as lower quartile, median, and upper quartile for continuous variables. N* is the number of non-missing value out of 81. 48 3.3 Hippocampal Volumetry Statistics of ADNI1:Complete Screening 1.5T dataset per di- agnosis (AD: Alzheimer’s patients, MCI: subjects with mild cognitive impairment, CN: healthy subjects). Top: volumetric statistics of segmentations provided by each method. Middle: Effect sizes of pairwise differences between diagnostics groups based on Cohen’s d metric. Bottom: t-values and significance levels from a linear model comprising “Age”, “Sex”, and “total--volume” as covariates (∗ : p < 0.05, ∗∗ : p < 0.01, ∗ ∗ ∗ : p < 0.001). 49 3.4 Summary of automated segmentation methods of the hippocampus. AD = Alzheimer’s Disease; MCI = Mild Cognitive Impairment; CN = Cognitively Normal; FEP = First Episode of Psychosis; LOOCV = Leave-one-out cross-validation; MCCV = Monte Carlo cross-validation; SNT = Surgical Medtronic Navigation Technologies semi-automated la- bels; L-HC = Left hippocampus; R-HC = Right hippocampus. (a): AD: 0.838, MCI: n/a, CN: 0.883, (b): See [149] for manual segmentation protocol details, (c): The method were applied in the 2012 MICCAI Multi-Atlas Labeling Challenge ...... 50 S3.1 Experiment I surface-distance errors based on variant of Hausdorff distance. The error is measures in number of voxels, and mean and standard deviation values are reported over all the subjects in the dataset. The validation configuration comprised 9 atlases and 19 templates...... 67

4.1 Dataset demographics for ADNI1 and ADNI2 cohorts used in this study. CN: Cogni- tively Normal, SMC: Significant Memory Concern, EMCI: Early Mild Cognitive Impaired, LMCI: Late Mild Cognitive Impaired, AD: Alzheimer’s Disease; ADAS: Alzheimer’s Dis- ease Assessment Scale, MMSE: Mini–Mental State Examination...... 73

ix 4.2 Hyperparameter search space for the four models. Grid search of the hyperparameters was performed using a nested inner loop for each cross-validation round. For APANN model, the fixed hyperparameters refer to a broader network design choices that remained identical for all cross-validation rounds. The tunable hyperparameter for APANN were optimized for each fold...... 79 4.3 Prediction Performance for Alzheimer’s Disease Assessment Scale-13 scores. LR L1: Lin- ear Regression model with Lasso regularizer, SVR: Support Vector Regression, RF: Ran- dom Forest Regression, APANN: Anatomically Partitioned Artificial Neural Network; HC: hippocampal input, CT: cortical thickness input, HC+CT: combined hippocampal and cortical thickness input; r: Pearson’s correlation (mean, std), rmse: root mean square error (mean, std)...... 84 4.4 Prediction Performance for Mini–Mental State Examination (MMSE) scores. LR L1: Linear Regression model with Lasso regularizer, SVR: Support Vector Regression, RF: Random Forest Regression, APANN: Anatomically Partitioned Artificial Neural Network; HC: hippocampal input, CT: cortical thickness input, HC+CT: combined hippocampal and cortical thickness input; r: Pearson’s correlation (mean, std), rmse: root mean square error (mean, std)...... 85 S4.1 Bias measures for Experiment 3 ...... 97

5.1 Demographics of ADNI and AIBL datasets. TM: trajectory modeling cohort, TP: trajec- tory prediction cohort. R: replication cohort ...... 104 5.2 Cluster demographics of ADNI trajectory prediction (TP) cohort based on MMSE and ADAS-13 scales. *GDS: Geriatric Depression Scale ...... 106 5.3 Trajectory membership comparison between MMSE and ADAS-13 scales. Note that MMSE only has single decline trajectory...... 106 5.4 Longitudinal siamese network (LSN) architecture ...... 108 S5.1 Hyperparameter grid search for Longitudinal siamese network (LSN) ...... 129 S5.2 Predictive performance: All subjects, MMSE scale, CA input ...... 131 S5.3 Predictive performance: Cognitively Consistent (CC) Group, MMSE scale, CA input . . . 131 S5.4 Predictive performance: All subjects, MMSE scale, CT input ...... 132 S5.5 Predictive performance: Cognitively Consistent (CC) Group, MMSE scale, CT input . . . 132 S5.6 Predictive performance: All subjects, MMSE scale, CA+CT input ...... 132 S5.7 Predictive performance: Cognitively Consistent (CC) Group, MMSE scale, CA+CT input 133 S5.8 Predictive performance: All Subjects, ADAS-13 scale, CA input ...... 133 S5.9 Predictive performance: Cognitively Consistent (CC) Group, ADAS-13 scale, CA input . 134 S5.10Predictive performance: All Subjects, ADAS-13 scale, CT input ...... 134 S5.11Predictive performance: Cognitively Consistent (CC) Group, ADAS-13 scale, CT input . 135 S5.12Predictive performance: All Subjects, ADAS-13 scale, CA+CT input ...... 135 S5.13Predictive performance: Cognitively Consistent (CC) Group, ADAS-13 scale, CA+CT input ...... 136 S5.14AIBL predictive performance: All Subjects, MMSE scale, CA+CT input ...... 136

x List of Figures

2.1 The hippocampal formation ...... 9 2.2 Hypothetical model of biomarker progression in AD ...... 11 2.3 MR image preprocessing ...... 17 2.4 3T in-vivo high-resolution atlas of the hippocampal subfields ([373] ...... 22 2.5 3T in vivo high-resolution atlas of the hippocampal subfields and white-matter structures [12]...... 22 2.6 CIVET stages for extracting cortical surface ...... 24 2.7 Logistic (sigmoid) function ...... 27 2.8 Support vector machine ...... 28 2.9 decision tree ...... 29 2.10 A feed-forward artificial neural network ...... 30 2.11 Nested k-fold cross-validation ...... 33

3.1 The segmentation of a sample hippocampus with AWoL-MRF ...... 44 3.2 Experiment I DSC improvement ...... 53 3.3 Experiment I DSC boxplots ...... 53 3.4 Experiment I Bland-Altman analysis ...... 54 3.5 Experiment I qualitative analysis ...... 55 3.6 Experiment II DSC improvement ...... 56 3.7 Experiment II DSC boxplots ...... 56 3.8 Experiment II Bland-Altman analysis ...... 57 3.9 Experiment II qualitative analysis ...... 58 3.10 Experiment III DSC improvement ...... 58 3.11 Experiment III DSC boxplots ...... 59 3.12 Experiment III Bland-Altman analysis ...... 59 3.13 Experiment III qualitative analysis ...... 60 3.14 Hippocampal volume vs. diagnoses ...... 61 3.15 Parameter selection ...... 62

4.1 ANN architectures and APANN ...... 75 4.2 Empirical sampling ...... 77 4.3 A custom cortical surface parcellation ...... 78 4.4 APANN performance: ADAS13 and MMSE ...... 82 4.5 APANN performance: scatter plots ...... 83

xi 4.6 APANN longitudinal performance ...... 86 S4.1 Performance comparison of all models ...... 94 S4.2 Correlation performance of reference models with high-dimensional input ...... 95 S4.3 Performance of all models for the HC+CT input in Experiment 3 split by subject-dataset membership ...... 97

5.1 Analysis workflow of the longitudinal framework ...... 103 5.2 Trajectory modeling ...... 105 5.3 Longitudinal Siamese network (LSN) model ...... 107 5.4 Potential clinical workflow ...... 109 5.5 MMSE prediction AUC and accuracy performance for ADNI dataset ...... 112 5.6 MMSE prediction ROC curves for ADNI dataset ...... 113 5.7 ADAS-13 prediction accuracy performance for ADNI dataset ...... 114 5.8 MMSE prediction AUC and accuracy performance for AIBL replication cohort ...... 116 5.9 MMSE prediction ROC curves for AIBL replication cohort ...... 117 S5.1 Clinical score distributions of different trajectories at two timepoints. Note: for subjects who are missing 12 month timepoint, 6 month scores are used instead...... 130 S5.2 Distribution of number of available timepoints with clinical score data...... 137 S5.3 Effect of available timepoints on predictive performance (MMSE) ...... 138 S5.4 Effect of available timepoints on predictive performance (ADAS-13) ...... 139 S5.5 Project 3 supplement: K-fold nested cross-validation ...... 140

6.1 Machine-learning models: underfitting vs. overfitting ...... 144

xii Chapter 1

Introduction

Translational applications of computational neuroscientific methods can have a meaningful impact on global challenges. This multidisciplinary field of study can help model the biological processes that govern the healthy and diseased states of the and map them onto observ- able clinical presentations. In the last decade, the rapid increase in large biomedical (neuroimaging and related biological data) datasets, concurrent with the advances in the field of machine-learning has opened new avenues towards development of diagnostic and prognostic applications for neurodegenera- tive and neuropsychiatric disorders. From the computational perspective, this has spawned development of tools that can incorporate several patient-specific observations towards prediction-making to improve clinical outcomes of those suffering from these disorders. This thesis leverages these multidisciplinary advances towards developing prognostic applications for late-onset Alzheimer’s disease (AD) using mag- netic resonance (MR) imaging data and machine-learning (ML) techniques. The overarching goal of these applications is to improve early detection and personalized treatment planning by identifying individuals at the highest risk for AD and AD-related symptom decline. AD, the most common form of dementia, was first identified by Dr. Alois Alzheimer in 1906 and is a distinctively different neurobiological process compared to normal aging. AD is a progressive brain disorder that alters synaptic connectivity due to the aggregation specific brain pathologies. The altered neuronal connectivity ultimately results in the death of brain cells causing declining episodic memory function and cognitive ability. In advanced stages, patients are unable to perform even basic tasks required for daily living (eating, bathing, and dressing themselves) resulting in a need for constant monitoring and care. Currently, there is no cure to this debilitating and ultimately fatal disease. As an affliction of the ageing population, the socio-economic burden of AD is set to increase rapidly with rising percentage of this demographic in the developed world. Currently over 560,000 Canadian are living with dementia and in the next fifteen years this number is projected to grow to over 930,000 [11]. One in thirteen Canadians between ages 65 and 74 years is affected by AD and related dementias. This number increases to one in nine after age 75 and one in four after age 85. 10.4 billion dollars are spent towards treatment and caregiving costs annually in Canada. In the United States approximately 5.5 million people are currently living with AD in 2017 [11]. Globally it is estimated that the prevalence of AD will reach over 100 million by 2050 [46]. Therefore it is critical to develop new intervention and treatment strategies together with caregiving infrastructure to deal with this impending global healthcare crisis.

1 Chapter 1. Introduction 2

An early identification of presymptomatic individuals at-risk of AD or related symptomatic decline would potentially have significant impact on the development of intervention strategies that could treat or delay the onset of AD. A prognostic tool that can predict the future decline for an at-risk individual would help greatly in clinical trial recruitment and design of potential disease-modifying therapy. Targeting subject populations in earlier stages of the disease also increases success rate of a treatment [320, 116, 283]. In efforts towards early detection, continuous monitoring of at-risk individuals via MR imaging and development of applicable computational tools for prognostic predictions is of great interest from the public healthcare perspective, and serves as a motivating rationale for the work in thesis.

1.1 Research contributions and thesis outline

The work presented in this thesis contributes to the growing body of research in the area of clinical applications in AD, leveraging MR imaging data and ML techniques. The projects undertaken as part of this thesis, make specific methodological advancements in the areas of MR image segmentation as well as ML based model development for multimodal and longitudinal data analysis aimed towards subject- level clinical predictions. These clinical applications can contribute greatly towards early detection and prognostic predictions at individual level that can assist interventions, treatment planning, as well as, design of clinical trials. The research is progressively divided into three projects as follows:

• Project 1: MR image segmentation of the hippocampus

• Project 2: Symptom severity prediction based on neuroanatomy

• Project 3: Modeling and prediction of clinical progression

The next chapter details the background pertaining to AD and associated , as well as, the current state-of-art relating to MR based neuroimaging, and computational approaches that form the basis for this thesis. Chapter 3 describes project 1, which aims towards improving MR image segmentation - a commonly used MR processing step that produces anatomically meaningful feature sets that serve as input to mul- titude of clinical applications. Specifically, the work in project 2 presentes a novel automated method inspired from manual protocols that improves performance of multi-atlas based segmentation frame- works. The presented method is validated towards whole hippocampal segmentation on three different datasets. Chapter 4 describes project 2, which aims at leveraging MR features towards subject-level clinical severity prediction. Methodologically, the work contributes by providing a novel ML framework for high- dimensional, multimodal data analysis. Specifically, the work presents a novel artificial neural network that combines hippocampal segmentations and cortical thickness measures to predict scores from multiple clinical scales simultaneously. The presented model is validated on multiple large AD datasets towards baseline as well as longitudinal score prediction. Clinically, this can assist the clinicians to make (or validate) diagnosis based on quantitative MR measures. Chapter 5 describes project 3, which aims at modeling and prediction of clinical trajectories from longitudinal MR and clinical measures. First, a data-driven clinical subtyping approach is presented that characterizes differential symptomatic progression of individuals based on longitudinal clinical scores. Chapter 1. Introduction 3

Methodologically, the approach allows graceful handling of missing data-points. Then, the work presents a novel longitudinal ML framework that can combine MR data from two timepoints, namely baseline and follow-up, towards prediction of these symptom trajectories. The work in this projects provides a powerful computational tool for early detection and accurate prognosis of at-risk AD patients via continuous monitoring. Lastly, chapter 6 discusses the overall contributions as well as the limitations of the work in this thesis, and finally concludes with overarching remarks on future directions. It should be noted that this is a manuscript based thesis. Chapters 3-5 are research papers that are either published or accepted at various journals. Hence readers may notice some overlap between certain sections of the papers and stand-alone chapters of the thesis. Chapter 2

Background

2.1 Alzheimer’s Disease

Alzheimer’s disease (AD) progressively affects brain function and cognitive processes related to learning and memory. Despite its high prevalence within the ageing population, AD is not a part of normal ageing process. Clinical diagnosis of AD is made according to the consensus criteria which have been refined over the years. One of the earlier sets of guidelines were proposed by the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer’s Disease and Related Disor- ders Association (the latter now known as the Alzheimer’s Association) [245]. These criteria proposed the classification of patients into definite, probable, or possible AD. A diagnosis of definite AD requires that neuropathological findings be confirmed by a direct analysis of brain tissue samples, which may be obtained either at autopsy or from a brain biopsy. In 2007, an International Working Group (IWG) of dementia experts developed guidelines that incor- porated new knowledge about the prodromal symptomatic stage of AD [105]. These were later updated to address atypical clinical presentations of AD and to identify clinically asymptomatic individuals with positive biomarkers of AD pathology [104]. In 2011, Alzheimer’s Association and the National Institute on Aging (NIA) issued four diagnostic criteria and guidelines for AD that focus on three stages of AD: (1) dementia due to Alzheimer’s - characterized by impairment in memory, thinking and behavior that compromises person’s ability to function independently in everyday life [181, 246]. (2) mild cognitive impairment (MCI) due to Alzheimer’s - characterized by mild changes in memory and thinking that are noticeable and can be measured on mental status tests, but are not severe enough to disrupt a person’s day-to-day life [7], and (3) preclinical (presymptomatic) Alzheimer’s - characterized by mea- sureable biomarker changes in the brain, which may occur years before symptoms [325]. The diagnosis of preclinical AD is primarily used within research settings to investigate and parse the substantial pathophysiological and behavioural heterogeneity present in this stage [139]. There are several commonalities as well as differences in these two sets of diagnostic criteria proposed by IWG and NIA, and harmonization efforts have been made to reach consensus [255]. The NIA proposed guidelines to conceptualize AD on a continuum and recognize that the pathological processes of AD may not be clinically expressed. This thesis focuses on the NIA diagnostic criteria, which are used by the studies leveraged in this work.

4 Chapter 2. Background 5

2.1.1 AD diagnosis and staging

The NIA proposed criteria outline three stages of AD [255]. Among which the preclinical staging in AD is based on a hypothetical temporal ordering of biomarkers. The neuropathology and AD biomarkers are described in detail subsequent sections. The diagnostic and preclinical stages are described in Table 2.1.

2.1.2 AD risk factors

The risk of AD onset and related symptomatic severity is attributed to several contributing factors. Among the modifiable, lifestyle-related risk factors, US National Institute of Health has identified dia- betes mellitus, smoking, depression, mental inactivity, physical inactivity and poor diet as being associ- ated with increased risk of cognitive decline and AD [93]. These factors can be taken under consideration for developing primary strategies for prevention or delaying of the disease onset via healthier living. Other non-modifiable risk factors include age and which have been implicated in the onset of disease [78, 185, 261, 232]. Notably, risk of AD and severity of related symptoms increase with age. Apolipoprotein E (ApoE), the leading genetic risk factor for AD, supports lipid transport and injury repair in the brain [122]. There are three polymorphic alleles of ApoE: 2, 3, and 4; with 3 being the most common occurrence. Individuals carrying the 4 allele are at increased risk of AD compared with those carrying 3 allele, whereas the 2 allele decreases the risk of the onset [141]. In case of the inherited form of AD, genetic mutations in presenilin 1 (PSEN1), presenilin 2 (PSEN2), or amyloid precursor protein (APP) have been implicated in the onset [40, 305].

2.1.3 AD treatment

Although there is no cure for AD, there are several medication and treatment options that can help with the symptoms to improve the quality of life. The pharmacological interventions approved to treat AD symptoms relating to memory, cognition, language, etc. include cholinesterase inhibitors and N-methyl- D-aspartate (NMDA) partial receptor antagonist memantine [248, 33, 289]. Cholinesterase inhibitors target to improve cell-to-cell communication by providing acetylcholine, a , that is de- pleted due to the disease. Memantine attempts to block overstimulation of NMDA receptor by glutamate which is linked to memory processes, and implicated in neurodegenerative disorders. The effect of these drugs does vary from person-to-person, and is typically prescribed during early stages of the disease in order to delay the onset and slow down worsening of the symptoms. As an alternative to pharmacological medication, cognitive interventions have been suggested as a preventive measure in earlier stages of the disease. The interventions techniques include cognitive training, stimulation, and rehabilitation to improve or at least maintain functioning of several cogni- tive domains [64, 259, 337]. Despite several studies, there is no strong evidence to support significant improvement or slowing down of cognitive decline with these techniques. However, it is possible that well-designed studies, focusing on specific subset of preclinical population, might be needed to measure efficacy of these methods on different cognitive domains [23]. Recently, there is a growing interest in investigating cognitive benefits of physical exercise, as lack of physical exercise is a known risk factor for age-related cognitive decline, it is suggested that physical activities may improve brain function by in- creasing expression of nerve growth factors and related to cognitive function [24, 121]. Chapter 2. Background 6

Stage Criteria AD dementia 1. The presence of dementia, as determined by intra-individual decline in cognition and function. 2. Insidious onset and progressive cognitive decline.

3. Impairment in two or more cognitive domains; although an amnestic presen- tation is most common, the criteria allow for diagnosis based on nonamnestic presentations (e.g. impairment in executive function and visuospatial abilities). 4. Absence of prominent features associated with other dementing disorders.

5. Increased diagnostic confidence may be suggested by the biomarker algorithm discussed in the MCI due to AD section above.

MCI due to AD 1. A change in cognition from previously attained levels, as noted by self or infor- mant report and/or the judgment of a clinician. 2. Impaired cognition in at least one domain (but not necessarily episodic memory) relative to age-and education-matched normative values; impairment in more than one cognitive domain is permissible.

3. Preserved independence in functional abilities, although the criteria also accept ‘mild problems’ in performing instrumental activities of daily living (IADL) even when this is only with assistance (i.e. rather than insisting on independence, the criteria now allow for mild dependence due to functional loss).

4. No dementia, which nominally is a function of 3. 5. A clinical presentation consistent with the phenotype of AD in the absence of other potentially dementing disorders. Increased diagnostic confidence may be suggested by • Optimal: A positive beta-amyloid (Aβ) biomarker and a positive degener- ation biomarker • Less optimal: – A positive beta-amyloid (Aβ) biomarker without a degeneration biomarker – A positive degeneration biomarker without testing for beta-amyloid (Aβ) biomarkers

Preclinical staging 1. Stage 1 is marked by Aβ42 peptide dysregulation (reflected in reduced levels of CSF Aβ42 or by elevated cerebral cortical amyloid burden as determined by PET amyloid imaging) 2. Stage 2 adds synaptic/neuronal dysfunction and loss (i.e. ), as evidenced by increased CSF p−tau levels, or hypometabolism or cortical thin- ning/hippocampal atrophy as determined by FDG PET and MRI, respectively; 3. Stage 3 features the abnormalities in Stages 1 and 2 plus subtle cognitive decline

Table 2.1: National Institute on Aging (NIA) proposed clinical and preclinical stages of Alzheimer’s disease and their corresponding symptomatic criteria Chapter 2. Background 7

Positive effect of exercise, although with small effect size, has been documented by a few studies on MCI and AD populations [271]. The recommendations from these studies investigating AD treatment have a few commonalities. First, an early intervention is crucial for delaying or slowing down the clinical progression of the disease. Second, the limited success thus far has been in terms of developing symptomatic treatment regimens, rather than a comprehensive disease treatment. Thus, an early diagnosis of AD and identification of at-risk individuals for cognitive decline with high-risk to conversion for AD is extremely important for timely intervention, treatment planning, and making caregiving arrangements.

2.2 Pathophysiology of AD

The two defining pathological features of AD consist of extracellular beta-amyloid (Aβ) peptides and intracellular neurofibrillary tangles (NFTs) of protein tau [43, 155, 312, 228, 37].

2.2.1 Beta-amyloid (Aβ)

Aβ oligomers and plaques are synaptotoxins that stimulate inflammatory processes. Aβ denotes peptides of 36–43 amino acids derived from the amyloid precursor protein (APP). Aβ circulates in plasma, cerebrospinal fluid (CSF) and brain interstitial fluid (ISF) mainly as soluble Aβ40. APP is cleaved by beta secretase and gamma secretase to yield Aβ. Due to imprecise nature of the gamma secretase cleavage process, along with the most common form of Aβ40 ( 80 − 90%), several other Aβ variants are produced, including Aβ42 ( 5 − 10%) which is more hydrophobic and fibrillogenic [333, 306]. This form of Aβ is predominantly found in amyloid plaques. Overproduction or lack of clearance of Aβ causes its accumulation outside the cell in forms of oligomers. Further aggregation of Aβ oligomers with other protein and cellular materials develops into insoluble plaques. These insoluble plaques can bind strongly with neuronal receptors eventually destroying the synapses, and consequently spreading interneuron dysfunction [341, 228, 197].

2.2.2 Neurofibrillary tangles (NFTs) of protein tau

Tau is a microtubule-associated protein that facilitates axonal transport [173, 120, 174]. A single gene on chromosome 17 encodes six molecular isoforms of tau, which are generated through alternative splicing of its pre-mRNA. Tau is crucial for stable intracellular microtubule network and its production function is regulated by the degree of phosphorylation. Normal adult human brain contains 2–3 moles phosphate/- mole of tau protein [173]. In AD, tau is translocated to the somatodendritc compartment and undergoes hyperphosphorylation. Subsequently, its misfolding and aggregation gives rise to NFTs disrupting the microtubule assembly which ultimately leads to neural death [313].

2.2.3 Progression of Aβ and NFTs

In 1991, the seminal work by Braak and Braak examined distribution of Aβ and NFTs in 83 obtained at autopsy. The work showed a characteristic progression pattern of NFTs that was categorized into six stages [43]. In stages I and II, NFTs are confined mainly to the transentorhinal region. In stage III and IV, they spread into the limbic regions such as the hippocampus. Finally, in stages V and VI they permeate the neocortex in frontal, superolateral, and occipital directions [43, 44]. In contrast Chapter 2. Background 8 to the topological distribution of NFTs, Aβ deposition progression is less predictable. The Entorhinal cortex, hippocampal formation, basal ganglia, brainstem, and cerebellum are the commonly impacted regions by Aβ deposits. Braak and Braak proposed three stages of Aβ progression. In stage I, Aβ deposits are found mainly in the basal regions of frontal, temporal, and occipital lobes. Progressively, in stage II isocortical association areas are heavily affected, whereas the hippocampal formation is partially involved, and the primary sensory, motor, and visual cortices are yet unaffected by Aβ deposits. Lastly, in stage III, Aβ affects the primary isocortical areas, as well as, the cerebellum and subcortical nuclei such as striatum, thalamus, hypothalamus, subthalamic nucleus, and red nucleus. Subsequent work [341] summarized these stages into “isocortical”, “allocortical or limbic”, and “subcortical” categories. In 1992, the amyloid cascade hypothesis was proposed [156, 290]. Notably, there is evidence that sug- gest that NFTs can develop independent of Aβ deposition, as well as, lack of symptomatic presentation despite substantial Aβ deposition in the brain [156, 290]. For instance, several studies have reported Aβ positive subjects showing no cognitive decline, who undergo healthy ageing process [120, 59]; and conversely several clinically diagnosed late-stage AD patients have shown disproportionate Aβ burden as well [113]. These heterogeneous clinical presentation patterns suggest the presence of upstream causal neurodegenerative processes that produce Aβ and NFTs and/or protective mechanisms that increase the functional and cognitive resiliency of the brain for certain populations [286, 290, 331].

2.3 Neuroanatomy

The human brain, the central organ in the nervous system, consists of the cerebrum, the brainstem and the cerebellum. The cerebrum comprises two hemispheres which are further divided into four lobes: frontal, temporal, parietal, and occipital. The outer layer of gray matter of the cerebrum, referred as cerebral cortex, comprises neuronal cell bodies. The folding of cortex, the result of the migration of neural pregenitors across radial glial units during the neurodevelopmental processes, manifests into ridges (gyri) and grooves (sulci) [287]. The white-matter within the cerebrum consists of myelinated axons that connect different neuronal cell bodies. At the base of the cerebrum are the cerebellum and brainstem. The latter connects the rest of the brain with the spinal cord and is responsible for many of the body’s autonomic functions. Underneath the protective skull of the brain, there are three layers of tissue: dura, arachnoid, and pia, collectively referred to as meninges. The subarachnoid space is filled with cerebrospinal fluid (CSF), which further helps with protection and support of the brain. CSF also fills the four cavities within the brain, known as ventricles as well as, the central canal of the spinal cord. The pathological progression of AD is typically reflected by anatomical changes that begin with the entorhinal cortex and the hippocampus within the temporal lobe of the brain, followed by regions of neocortex. The temporal lobe of the brain, specifically the medial temporal lobe (MTL) structures play important role pertaining to consolidation of information and formation of short-term and long-term memory [311, 327]. The hippocampus, which sits through the length of the MTL and belongs to the limbic system, is a major component of the memory circuitry and is one of the first regions to be affected by AD ([180]. The hippocampal complex, includes the dentate gyrus, comprises four subfields referred to as Cornu Ammonis (CA) 1-4, and subiculum (see Fig. 2.1 [373, 214]). The information flow into the hippocampus begins from the pyramidal cell axons from entorhinal cortex that perforate subiculum and project into the dentate gyrus. The information is then passed onto CA3 and subsequently to CA1 which projects back to the entorhinal cortex completing the circuit. This feedback loop is an important Chapter 2. Background 9 excitatory-inhibitory mechanism for memory processing [94, 340]. Other output pathways from the hippocampus project into several cortical areas including prefrontal cortex and lateral septal area.

Figure 2.1: The hippocampal formation, as drawn by Santiago Ramon y Cajal. Notations, DG: dentate gyrus. Sub: subiculum. EC: entorhinal cortex. CA: Cornu Ammonis

Functionally, the role of hippocampus in learning and memory is thoroughly investigated. His- torically, a remarkable case-study that highlighted the significance of hippocampus in episodic memory formation, was performed through investigation of Henry Molaison (“patient H. M.”), who went through bilateral medial temporal lobe resection in an attempt to control epileptic seizures. The procedure sub- sided seizures by removing the epileptogenic focus, but resulted in severe memory impairment that caused him to forget daily events nearly as fast as they occurred [311, 328]. Subsequently, a multitude of studies have associated the hippocampus and its subfields to different learning and memory processes in the brain [326, 344, 346, 131, 266]. Several studies have also reported hippocampal role in navigation via use of place cells that encode spatial information [272, 238]. Consequently, neuroanatomical abnor- malities in the hippocampus have been associated with memory dysfunction and related disorders such as AD [195, 44, 177, 261, 263, 94]. Particularly from the clinical perspective, the hippocampus has been region of interest with respect to biomarker development for AD and an important predictor towards classification of diagnostic and prognostic states of an individual [176, 41, 260, 299].

In addition to regionally specific structural changes within the MTL, more global neuroanatomical changes throughout the entire cerebrum have been associated with AD. The cerebral cortex is a folded sheet of neurons with a laminar organization comprising six separate layers throughout the neocortex [241, 124]. Going from the surface towards the white matter, these include: 1) the molecular layer, 2) the corpuscular layer, 3) the pyramidal layer, 4) the granular layer, 5) the ganglionic layer, and 6) the multiform layer. The cortical ribbon is constrained by the two gray/white and pial surfaces and the distance between these two surfaces quantifies the thickness of the cortical gray matter. The thickness values vary between 1 and 4.5 mm, with an overall average of approximately 2.5 mm [124]. Cortical atrophy has been shown to correlate with the cognitive decline [221, 286, 299]. Cortical atrophy is reflected in a loss of gray matter which results in a reduction of cortical thickness. In comparison to the volumetric atrophy of temporal lobe structures, cortical thickness provides a more robust quantitative measure of AD related neuroanatomical changes as it is less sensitive to inter-subject variations in head or brain size that confound the former volumetric measurements [180, 310]. A similar issues is also observed with surface area measures of the cortex [27, 310]. Thus cortical thickness measures are more commonly employed in individual level diagnostic and prognostic analyses. Chapter 2. Background 10

2.4 AD biomarkers

A biomarker is a surrogate biological measure that serves as an indicator of a normal or pathological process. Biomarkers are key components of secondary prevention strategies and clinical trials for AD. Commonly used AD biomarkers include both imaging and biofluid measures. Although several CSF derived and radiotracer-based PET imaging markers have been studied extensively, this thesis focuses on structural MR imaging markers due their non-invasive nature. A prominent hypothesis explaining the temporal progression of several AD biomarkers was initially proposed in [181]. The hypothetical model was subsequently updated (see Fig. 2.2) by the authors in [182]. The original model provided prototypical progression of AD biomarkers with level of abnormality as a function of pathophysiological pathway. The model denoted CSF Aβ42 and amyloid PET as upstream and structural MR based neurodegenerative as downstream biomarkers, followed by clinical symptoms. The revised model expressed biomarker abnormality as an explicit function of time instead of clinical disease stage. The model also represented cognitive outcomes on a spectrum to account for the inter-subject symptomatic variability observed in the clinic. Lastly, the revised model reordered certain biomarkers, specifically, CSF aβ42 was moved before amyloid PET, which was followed by CSF tau. FDG PET and MRI were also redrawn to represent concurrent progression of the two. According to this model, the earliest detectable changes are caused by amyloid accumulation, typically measured by CSF Aβ sample, making it one of the most promising biomarkers for early detection. Studies show that AD patients have reduced CSF Aβ and elevated CSF tau levels compared to cognitively normal individuals [174, 336]. Moreover, these levels of Aβ and tau are more extreme for AD patients with one or two ApoE 4 alleles compared to patients with no 4 alleles [339]. Although CSF biomarkers can provide early evidence of clinical decline, these measures change relatively slowly through the course of disease progression. In contrast, PET, MR based measures tend to be more dynamic biomarkers providing better characterization of disease progression and related clinical decline. Therefore, especially non-invasive MR-based biomarkers are well suited for continuous monitoring of asymptotic at-risk individuals. Although many studies have adopted this model [182] as a guiding hypothesis, alternative theories have been proposed by a few studies to better explain the heterogeneity and loose coupling between the pathological burden and observable symptom severity within certain AD populations. A notable hypothesis among these studies postulates a vascular dysfunction process that alters the balance be- tween the blood flow substrate delivery and the neuronal/glial energy demands, leading the downstream dysfunction and the disease [395, 175]. Particularly [175] present a data-driven model as a more re- alistic characterization of biomarker progression compared to traditional observational disease models. Although investigation of these alternate hypotheses is crucial in order to understand the etiology of Alzheimer’s disease; it is beyond the scope of this thesis as it focuses on downstream neurodegenerative processes.

2.4.1 CSF and PET markers

Characterization of the neurobiological processes associated with Aβ and NFTs and early detection of the consequent anatomical changes is essential for development of intervention strategies that could limit or prevent downstream clinical symptoms. Typically the presence of Aβ and NFTs can be confirmed via highly invasive lumbar punctures or brain biopsies, or during post mortem autopsy [66, 120, 37]. In the lumbar puncture procedure, a needle is inserted between the lumbar vertebrae into the subarachnoid Chapter 2. Background 11

Figure 2.2: Hypothetical model of biomarker progression in AD. Aβ is identified by CSF Aβ42 or PET amyloid imaging. Tau-mediated neuronal injury and dysfunction is identified by CSF tau or fluorodeoxyglucose-PET. Brain structure is measured by use of structural MR imaging. Aβ = β-amyloid. MCI=mild cognitive impairment. Image adopted from Jack et al. 2013 with reuse permission. space of the spinal canal to extract cerebrospinal fluid (CSF). Studies have shown that AD patients have a reduced amounts of Aβ and elevated amounts of tau in CSF compared to cognitively normal individuals [174, 66]. Studies have also suggested that MCI patients have CSF Aβ and tau levels in between AD and cognitively normal individuals [315, 340]. As an alternative to invasive lumbar puncture procedures, several surrogate techniques are proposed for in vivo detection such as positron emission tomography (PET). A PET scan acquisition involves injecting the patient with a tracer, labelled with a positron-emitting radionuclide. The tracer molecules are selected to target a particular physiological process [129]. The images are acquired in several planes through the brain, which provide visual information showing the radiotracer distribution. The two nuclei commonly used in PET imaging are 11C and 18F, which have half-lives of 20 and 110 minutes, respectively. Pittsburgh compound B (PiB) is a 11C labelled thioflavin-T derivative that binds to amyloid plaques in vivo. AD patients typically have increased PiB retention in areas known to accumulate significant amyloid deposits in comparison with healthy individuals. Cortical PiB retention is also observed in MCI patients, but to a lesser extent than in AD [183, 356]. Patients are often classified as PiB positive or negative, where a global cortical to cerebellar ratio is defined to separate the two groups. Independent studies have consistently found that approximately 30% of cognitively normal elderly individuals would be classified as PiB positive according to such criteria [183]. This suggests that PiB alone is not a sufficient marker for AD. Florbetapir (18F or 18F -AV-45) is another radiopharmaceutical compound that contains the radionuclide fluorine-18. Florbetapir has strong affinity for amyloid proteins in AD brain and faster in vivo kinetics [51, 65]. The longer half-life of 18F has facilitated the use of Florbetapir in several amyloid imaging studies, which demonstrate the feasibility of this compound to differentiate patients with AD and MCI from healthy controls [376, 50]. Similar to PiB based techniques, thus far the utility of Florbetapir remains limited to qualitative amyloid imaging, and will require significant investigation into feasibility, tolerability, and reliability of the biomarker before it can be used towards clinical diagnostic Chapter 2. Background 12 applications [50]. These findings along with undesirable radioactive nature of the tracers used in PET, MR imaging has gained more attention for development of neuroanatomical biomarkers.

2.4.2 MR imaging markers

Structural MR images offer a rich source of information that can measure anatomical alterations at the voxel-level granularity. Structural volumetry characterizing atrophy patterns is a simple approach for utilizing this high-dimensional information towards biomarker development [134, 355]. Studies demon- strate that temporal lobe atrophy is strongly associated with AD. Histological data validate that the entorhinal cortex, hippocampus, and amygdala are particularly vulnerable structures affected by AD pathology. Several studies have investigated the association between the rate of temporal lobe atrophy, as measured by MR imaging, with current as well as future cognitive decline. Longitudinal studies have marked accelerated rates of atrophy in AD and MCI patients compared to cognitively normal groups. Specifically entorhinal cortex and hippocampal degeneration has been established as a marker of AD and explains memory related symptoms [43, 57, 98, 81]. Several studies have demonstrated group-wise differences in the total hippocampal volume as well as its subfields across healthy, MCI, and AD pop- ulations [180, 176, 62, 142, 299, 80, 94]. However, hippocampal atrophy alone is not a good predictor for MCI to AD conversion. This is potentially due to 1) substantial symptomatic and neurobiological heterogeneity within MCI population, 2) lack of consensus regarding anatomical definition of hippocam- pus as captured by MR imaging, and 3) sensitivity of acquisition and segmentation techniques across studies[284, 178, 257, 214, 38, 136, 303]. These challenges have made the clinical translation of hip- pocampal volumetry and morphometric techniques for early detection and prognosis difficult. Recently, [303] showed an aberrant hippocampal volumetric fluctuations in a longitudinal AD sample cautioning the use of hippocampal volume as a stand-alone biomarker. This in turn necessitates development of more sensitive structural biomarkers to model disease progression at the individual-level. In this pursuit, studies have explored more global as well as granular structural biomarkers [19].

One approach of developing these complex biomarkers involves incorporation of multiple brain regions implicated in AD related neurodegeneration. Using this approach, many studies have explored atrophy patterns in the cortex to identify different progression stages [224, 222, 286, 230, 299]. Another approach towards building more sensitive biomarkers involves voxel-wise or vertex-wise analysis, which provides a way to detect subtle and distributed changes that could discriminate between different clinical states of the disease [117]. Voxel-based-morphometry (VBM) is a popular technique for such analysis [15]. Although it provides an extremely powerful tool to perform group-wise comparisons, it cannot provide individual specific measures that would enable use of such biomarker towards individual diagnosis or prognosis.

The use of high-dimensional information from MR images towards development of complex biomark- ers necessitates employment of multivariate statistical techniques for quantitative representation of neu- rodegeneration. Moreover, different validation regimens need to be applied to assess the biomarker per- formance when used towards group-wise versus individual-level modeling and prediction tasks. These statistical approaches in the computational neuroscience domain are described in Section 2.6. Chapter 2. Background 13

2.5 MR-based Neuroimaging

MR based neuroimaging can provide qualitative and quantitative information regarding brain structure and function. The advances in the MR imaging technology over the past decade have opened new av- enues for mental health research utilizing neuroimaging data. In typical use, MR techniques are used for imaging soft tissue that allows researchers to investigate structural and functional characteristics of the human brain in vivo. MR image acquisition is a non-invasive, safe for humans, process that pro- duces three dimensional detailed anatomical images without the use of harmful radiation [88, 268, 226]. Neuroimaging processing pipelines consist of multiple sequential tasks prior to statistical analysis. The pipeline begins with the acquisition of MR images comprising biases and artifacts induced by the hard- ware and acquisition protocol itself. These artifacts are subsequently “corrected for” using appropriate image processing techniques. Images are also cropped to remove areas beyond region of interest. Af- ter these preprocessing steps, images are used towards subsequent statistical analyses. Structural MR images form the basis of this thesis. As a result, these steps are described in detail in following sections.

2.5.1 MR acquisition

An MR imaging scanner consists of a main magnet, which generates a strong primary magnetic field, a radiofrequency (RF) coil, which transmits and receives radiofrequency energy to and from the tissue; and gradient coils (typically in the x, y and z directions), which are used to generate field gradients to enable frequency and spatial encoding used for signal source localization. The primary magnetic field, denoted as B0, causes protons from the abundantly present water molecules in a human body to align with the field. Conventionally this field defines the coordinate frame of reference with B0 oriented along z-axis. B0 causes protons to precess at a frequency, known as Larmor frequency, that is proportional to the field strength:

ωL = γB0 (2.1)

where the gyromagnetic ratio γ is characteristic of the nuclei under consideration. The alignment along the z-axis is perturbed out of equilibrium by an application of a radio frequency (RF) pulse perpendicular to this axis. Once the RF field is turned off, the sensors can detect the energy released by the protons as they undergo realignment to the primary magnetic field. The amplitude of this signal is maximal immediately following the RF pulse, and decays with time. By employing magnetic field gradients, signal source can be spatial localized by inducing differential Larmor frequencies along the z-axis. Additionally gradient induced phase encoding is used to resolve signal location in the xy plane. The sampling of this frequency and phase encoded signal generates a complex-valued 3 dimensional array in a spatial frequency domain referred as k-space. Finally, the image itself is reconstructed using Fourier transform of this k-space representation [268, 350]. Several parameters of the acquisition protocol influence the quality of the image (i.e. signal to noise ratio, resolution, field of view, etc.) But typically, image quality is improved with higher B0, which is commonly set to be 1.5T or 3T; although 7T scanners have become commercially available. The signal is quantified using a time constant characterizing signal decay. The excited protons generate magnetization components along z-axis (longitudinal) as well as, xy (transverse) plane. A set of macroscopic equations to calculate nuclear magnetization (M) as a function of time were first introduced by Bloch in 1946 [36], and are written in matrix form as follows. Chapter 2. Background 14

        Mx -1/T2 γ Bz -γ By Mx 0 d  M  =  -γ B -1/T γ B   M  +  0  (2.2) dt  y   z 2 x   y    Mz γ By -γ Bx -1/T1 Mz M0/T1

Where γ is the gyromagnetic ratio, and T1 and T2 are the time constants associated with the decay of the signal corresponding to the longitudinal and transverse components, respectively. The recovery of longitudinal magnetisation as the protons align with B0 is known as spin-lattice (T1) relaxation.

(−t/T1) Mz(t) = M(θ)(1 − e ) (2.3)

Decay of transverse magnetization during realignment is known as spin-spin (T2) relaxation.

(−t/T2) Mxy(t) = M(θ)e (2.4)

Where M is nuclear spin magnetization vector parallel to the external magnetic field B0.

These time constants are dependent on the surrounding environment i.e. biological tissue. Thus the resultant MR contrast is dependent on these time constants as well as the proton density of each tissue type.

A basic MR acquisition sequence, referred to as spin-echo, comprises a two RF pulses. First, a 90 degree pulse tips the net magnetization into transverse plane. Once the RF transmitters is turned off, transverse magnetization (Mxy) decays, and longitudinal magnetization is recovered as the protons align themselves to B0. Protons themselves re-radiate the absorbed energy which can be detected by the receiver coils. The signal received in the transverse plane decays faster than T2 would predict. This is modeled by a modified time constant T2* comprising pure T2 decay as well as the static inhomogeneities in the magnetic field which accelerate the dephasing process. The 90 degree pulse is followed by a 180 degree pulse in order to rephase the spins in the transverse plane and reverse the static field inhomo- geneities. The signal is measured after the phase coherence is achieved. This time epoch is called echo time (TE), which is the time between the 90 degree pulse and MR signal sampling. The 180 degree pulse is applied at time TE/2. This process is repeated several times, and the time between two 90 degree pulses is referred to as the repetition time (TR). Due to different T1 and T2 values for each tissue, MR contrast can be modified with different configurations of TE and TR. With short TR and TE, contrast depends primarily on the tissue specific differences in the longitudinal magnetization recovery, i.e. T1. This referred to as T1-weighted sequence. Relatively, with longer TR and TE, T1 differences diminish, and tissue contrast results mainly from the T2 properties of the tissue. This is referred to as T2-weighted sequence. Configuration of long TR and short TE produces a proton density (PD) image, in which the contrast is a function of the differences in the proton density of the tissues, as neither longitudinal or transverse signal is allowed to recover by sampling at high rate. Tissues with longer T1 and T2 (e.g. water) appears dark in T1-weighted image and bright in the T2-weighted image. Conversely, tissue with short T1 and a long T2 (e.g. fat) appears bright in the T1-weighted image and dark in the T2-weighted image. Chapter 2. Background 15

2.5.2 MR image preprocessing

The raw image acquisition usually comprises noise and artifacts that need to be accounted for prior to downstream computational analysis [29, 321, 347]. In addition, it is also important to extract brain-tissue from the raw image and discard background, skull and other regions irrelevant to the computational anal- ysis of interest. These preprocessing steps alter signal-to-noise and contrast-to-noise ratios (SNR, CNR) of the image and thus influence the subsequent image analysis, such as brain segmentation. Therefore it is crucial to carefully design preprocessing pipelines to achieve accurate as well as reproducible results, especially within multi-site and multi-study experimental paradigms.

MR image denoising

The noise confounds in the MR signal are resultant of thermal vibrations of ions and electrons in the receiving coil and the tissue manifested as intensity fluctuations [15, 392]. The noise in magnitude MR images generally follows a Rician or non-central Chi distribution. In theory, the SNR can be improved by averaging multiple repeatedly acquired images. However, this requires substantially more acquisition time that is not feasible in practice. Another simple approach to mitigate high-frequency noise is to use low-pass Gaussian filter which essentially averages neighbouring pixels [15]. However this results in blurred images diminishing high-frequency spatial information such as structural boundaries. Several advanced denoising methods have been proposed and applied that include anisotropic diffusion filter [211], wavelet-based filters [8, 387] and adaptive non-local means [47, 240, 84]. We note that based on our assessment of the quality of publicly available, standardized datasets utilized in this thesis, we did not include denoising step in our preprocessing pipeline for any of the three projects.

Intensity inhomogeneity correction

The low frequency, intensity non-uniformity artifacts are referred as bias, inhomogeneity, illumination nonuniformity, or gain field. The bias field, in an MRI context, causes intensity inhomogeneity within an image, resulting in a smooth signal variation within a tissue of the same type. This artifact can be caused by spatial inhomogeneity in the magnetic field, spatially varying receiver coil sensitivity, and interaction of the magnetic field with human tissue. The magnitude of this effect is dependent on the magnetic field strength. Thus, images obtained using higher-field scanners will be more susceptible to image inhomogeneities [96]. Among the several bias correction algorithms, the nonparametric nonuniform normalization (N3) [321] and subsequent improved revision N4 [347] are one of the most commonly used techniques. N3 is an iterative algorithm which maximises the high frequency content of the tissue intensity distribution of the corrected scan. The algorithm uses a b-spline approximation to obtain a smoothed estimate of the non-uniformity field of the scan and iteratively removes this from the original scan, until the non-uniformity estimate converges. N3 does not require prior knowledge of tissue types, and can be applied at an early stage in automated image analysis. N4 correction, a more recent, improved version of this algorithm modifies the iterative optimization technique used in N3, using a multiresolution iterative optimization framework. Specifically in the N4 version, the b-spline is initially fit at a lower resolution, and the resolution is hierarchically increased to achieve the best fit of the bias field. The bias field correction is performed in an iterative manner such that the corrected image from the first step is used as input in the next iteration, and so on, to estimate the residual bias field each time, allowing for iterative incremental updates of the bias field. The N4 correction step was included in the preprocessing Chapter 2. Background 16 pipeline for the three projects in this thesis. An example of N4 corrected T1 MR image is shown in Fig. 2.3.

Brain extraction

Brain extraction, or masking of non-brain tissue, such as skull, fat and neck regions is another common pre-processing step. Such cropping of region of interest improves subsequent image analysis. Brain extraction involves binary classification of each voxel from the raw scan as brain or non brain tissue, where the brain comprises grey matter, white matter and cerebrospinal fluid (CSF) [96]. A common method for brain extraction is BET (Brain Extraction Tool). BET uses a deformable model of a sphere’s surface, which expands one vertex at a time until the boundary of the brain’s surface is reached. A more recently developed patch based segmentation tool, BEaST (Brain Extraction based on nonlocal Segmentation Technique) [119], uses a large library of priors to perform nonlocal segmentation in a multi-resolution framework, employing varying patch sizes to improve segmentation accuracy and computation time. BEaST has been shown to have significantly higher accuracy than BET, especially when analysing data from psychiatric populations who may have pathology [119]. The BEaST extraction step was included in the preprocessing pipeline for the three projects in this thesis Fig. 2.3 shows an example of a T1 MR image with an extracted brain region using BEaST.

2.5.3 Image registration

Image registration is an alignment problem that deals with transforming raw data into a common frame of reference [394, 17]. In medical imaging, registration is crucial for establishing comparability across different individuals, timepoints, and modalities. For structural MR images, this typically implies esti- mating a one-to-one mapping between two image spaces to have anatomical correspondence. Registra- tion approaches can be divided by choice of feature space (i.e. pixels/voxels, landmarks), transformation process (i.e. affine, nonlinear), degrees of freedom, and similarity metrics (i.e. cross-correlation, mu- tual information) [209]. Mathematically, the registration of image J into the space of image I can be formulated as an optimization problem with the goal of finding transformation as follows.

∗ T = argmaxMΩS(I, J, M) (2.5)

where, I = reference image J = image to be transformed M = transformation Ω = search space for transformation S = similarity measure T* = optimal transformation

In case of 3-dimensional affine transformations, the mapping from each point (x1, x2, x3) of an image to a point (y1, y2, y3) in the transformed space can be expressed as: Chapter 2. Background 17

Figure 2.3: T1-weighted MR image before and after preprocessing stages. The image is randomly selected from the ADNI2 cohort used in the analysis in this thesis.

y1 = m11x1 + m12x2 + m13x3

y2 = m21x1 + m22x2 + m23x3

y3 = m31x1 + m32x2 + m33x3

which can be represented concisely as the matrix multiplication (y = Mx):

      y1 m11 m12 m13 m14 x1        y2   m21 m22 m23 m24   x2    =     (2.6)  y   m m m m   x   3   31 32 33 34   3  1 0 0 0 1 1 Chapter 2. Background 18

For rigid-body transformation, M is typically decomposed in terms for translation (T) and rotation (R) matrices as M = TR. This can be further parameterized as follows:

  1 0 0 q1    0 1 0 q2  T =   (2.7)  0 0 1 q   3  0 0 0 1

and

      1 0 0 0 cos(q5) 0 sin(q5) 0 cos(q6) sin(q6) 0 0        0 cos(q4) sin(q4) 0   0 1 0 0   −sin(q6) cos(q6) 0 0  R =        0 −sin(q ) cos(q ) 0   −sin(q ) 0 cos(q ) 0   0 0 1 0   4 4   5 5    0 0 0 1 0 0 0 1 0 0 0 1 (2.8)

The estimation of translation parameters (q1,q2, and q3) is trivially given by the last column of M, whereas rotational parameters are computed from matrix multiplication as follow:

  cos(q5)cos(q6) cos(q5)sin(q6) sin(q5) 0    -sin(q4)sin(q5)cos(q6)- cos(q4)sin(q6)-sin(q4)sin(q5)sin(q6) + cos(q4)cos(q6) sin(q4)cos(q5) 0  R =    -cos(q )sin(q )cos(q ) + sin(q )sin(q )-cos(q )sin(q )sin(q )- sin(q )cos(q ) cos(q )cos(q ) 0   4 5 6 4 6 4 5 6 4 6 4 5  0 0 0 1 (2.9) which yields,

q5 = asin(R13)

R23 R33 q4 = atan2( , ) cos(q5) cos(q5) (2.10) R12 R11 q6 = atan2( , ) cos(q5) cos(q5) where atan2 is the four quadrant inverse tangent. Thus, rigid-body transformation can be defined with 6 (qi) parameters, whereas for the more general case of affine transformation, which comprises scaling, and shearing in addition to translation, rotation operations, R consists of 9 parameters. These parameters need to be optimized to maximize similarity between transformed and reference images. In case of nonlinear registration, typically, the linear transformation stage is followed by deformation operation. This operation computes a nonlinear mapping between the affine transform of J and reference image I that further maximizes the similarity between I and transformed J. A multitude of affine and nonlinear registration approaches along with their implementation in image preprocessing pipelines have been proposed over the last decade. ANIMAL ([68], FLIRT [186], HAMMER [317], ART [13], Mindboggle [200] are some of the commonly used methods. Table 2.2 provides an algorithmic comparison of these methods based on transformation type, degrees of freedom, and similarity metric used. A comprehensive comparative study by Klein et al. [200] suggests ART and symmetric image nor- malization (SyN) deliver consistently high performance across various subject populations. The modern Chapter 2. Background 19

Algorithm (year) Transformation Degrees of Similarity Freedom ANIMAL (1997) Local translations 69K Cross correlation FLIRT (2001) Linear, rigid-body 9,6 Normalized correlation ratio HAMMER (2002) Hierarchical deforma- n/a Geometric moment invariants tion ART (2005) Non-parametric, 7M Normalized cross correlation homeomorphic SPM5 - Unified (2005) Discrete cosine trans- 1K Generative segmentation model forms SPM5-DARTEL Finite difference model 6.4M Multinomial model (2007) of velocity SyN (2008) Bi-directional diffeo- 28M Cross correlation morphism Diffeomorphic Demons Non-parametric, dif- 21M Sum of square differences (2009) feomorphic

Table 2.2: Comparison of commonly used linear and non-linear registration methods methods, with a large number of parameters (or degrees of freedom) tend to perform better with ad- ditional computational cost. SyN belongs to a family of diffeomorphic image registration algorithms [345]. Diffeomorphic approaches are symmetric with respect to image inputs (source and target) and allow probabilistic similarity measures [21]. These are usually contrasted against inverse consistent im- age registration approaches previously popularized by Thirion’s Demons algorithm [343]. However, the latter approaches can only approximate symmetry and inverse transformations with respect to input image. The authors showed that the SyN’s symmetric diffeomorphic optimizer outperforms the inverse consistent image registration with elastic optimizer as used in the case of HAMMER [317]. SyN is part of Advanced Normalization Tools (ANTs) within Insight ToolKit (ITK). ITK is a popular computational framework for customizing MR registration pipeline [20, 22], and was employed to process datasets used in this thesis.

2.5.4 Image segmentation

The process of image segmentation refers to parcellating pixels or voxels into labelled salient regions. Segmentation provides meaningful representation of an image that facilitates quantitative analysis. In MR imaging, segmentation is commonly performed at different levels of granularity as well as anatomical categories. Several classes of segmentation methods exist, including manual segmentation, intensity- based methods, and atlas-based methods.

Manual segmentation

The gold standard for the anatomical segmentation identifying various cortical and subcortical structures is defined by an expert human rater through a manual delineation process. This manual delineation is a tedious and time consuming process, and also introduces inter- and intra-rater variabilities into segmentations. Nevertheless several protocols have been proposed in an effort to standardize the process and mitigate human biases [195, 284, 98, 373, 384, 12]. Although manual labeling of large datasets is infeasible, these protocols have produced several moderately sized labeled datasets that serve as reference Chapter 2. Background 20 and validation for automated techniques.

Automatic segmentation

In the last two decades, rapid progress has been made in automatic techniques to improve the perfor- mance and efficiency of cortical and subcortical segmentations. The earlier approaches classified healthy brain tissue into grey matter (GM), white matter (WM) and cerebrospinal fluid (CSF) broadly based on the differential intensity profiles of each tissue types [16]. Nevertheless, thresholding of these intensity distributions to assign each voxel to a discrete categorical label is highly subjective and not trivial. Alternatively, several region growing, classification, and clustering approaches have been proposed for intensity-based segmentation [154]. Region growing requires selection of seed points (voxels) that belong to a region of interest. Then the algorithm examines the local neighbourhood intensities and assigns labels based on predefined similarity criterion. Region growing methods are suitable for structures with large connected regions, such as brain vessels, tumors, etc [365]. The classification methods make use of a training set of labeled images to automatically learn the mapping between intensity profile and corresponding categorical label. One of the simplest nonparamet- ric classifiers used towards segmentation is k-nearest neighbour, where voxels are classified according to majority vote of the closest training data [362]. Another commonly used parametric approach includes a Bayesian classifier. During training, a Bayesian classifier models the probabilistic relationship between the image intensities and the class labels. Then a new image gets assigned labels using an inference tech- nique, such as the maximum a posteriori estimation, based on Bayes’ rule. These type of classifiers are commonly implemented in expectation-maximization framework in several MR segmentation software packages such as SPM [18], FAST [393], FreeSurfer [125], and 3DSlicer [281]. In contrast with classification methods, clustering methods belong to unsupervised learning paradigm that segment images into voxel clusters with similar intensities. These methods usually rely on an iterative process that updates the voxel-cluster membership and estimation of tissue-intensity mapping for an image to be segmented. Some of the commonly used clustering methods include k-means [67], fuzzy C-means [4], and expectation-maximization methods [276]. Similar to classification methods, clustering methods typically do not incorporate spatial neighbourhood information making them vulnerable to noise and intensity inhomogeneities. Although several extensions have been proposed to mitigate this issue [97, 229], atlas-based alternative approaches have become more popular choice for leveraging prior anatomical knowledge for localization and identification of several brain regions. The atlas-based approaches are extremely powerful in their ability to transfer a priori spatial anatom- ical information to the new image during segmentation [338, 70, 314]. Traditionally, atlas based tech- niques use a single template derived from manual segmentation that would serve as a reference atlas for automated techniques. This atlas in a given stereotaxic space provides spatially localized prior proba- bilities of voxel membership to a certain tissue or a structure. An image to be segmented is aligned to this atlas via affine and then nonlinear registration techniques. Once the new image and atlas are in the same reference frame, all the label information can be propagated from the atlas to the new image via transform information from the registration step. The performance of such segmentation is consequently contingent upon the quality of registration [90, 138, 54]. Several methods have been proposed to refine the post registration segmentation quality via unifying these two processes [18, 281]. Chapter 2. Background 21

Multi-atlas label-fusion based segmentation

More recently, an alternative approach known as “multi-atlas label-fusion (MALF)” that uses multiple atlases has shown great success towards segmentation [363, 160, 302, 82, 360, 278, 153, 32]. Briefly, MALF methods begin with a set of manually labeled images, referred to as atlases, which are registered to the new image based on intensity values to enforce spatial correspondence. Then labels of anatomical structures under consideration from each atlas are propagated to the new image. This provides a label distribution at each voxel based on the anatomical labeling of the atlases. This distribution is converted into a single categorical label value via a label-fusion method of choice, such as, a majority vote [72, 55, 278]. Several variants of MALF have been proposed that provide strategies for atlas selection, atlas weighting, local patch-based methods, as well as, optimization algorithms for label fusion. Atlas selection and weighting methods use a similarity metric to identify a subset of atlases and the image to be labeled which minimizing the discrepancies between them [201, 378, 9, 375, 72]. The patch-based methods tackle the segmentation problem at a local neighbourhood scale instead of across entire image. These methods aim to leverage redundancy present in the image to naturally inflate the number of examples considered during label estimation [82, 296]. The optimization methods during label fusion stage focus on maximizing agreement within several candidate segmentation from multiple atlases, and meeting anatomically plausible spatial constraints [363, 160, 236, 302, 360, 32]. These label-fusion methods in the context of hippocampal segmentation are discussed in detail in Chapter 3.

Hippocampal segmentation

As mentioned in the Section 2.2, hippocampus has been the region of great interest for AD research. Therefore accurate delineation of hippocampus from structural MR images has received a lot of attention over the past decade. MR based atlases are typically derived using group-wise registration techniques that capture neuroanatomical variability as well as commonalities within the group of subjects. Several methods also make use of reconstructed, warped histological data to enhance visual information lacking in MR images [53, 3]. Differences in the anatomical definition of hippocampus as captured by MR imaging [206], intersubject variability in hippocampal shape, and heterogeneity of MR acquisition parameters (resolution, contrast etc.) have led to the development of several manual segmentation protocols for identifying the whole hippocampus [284, 385], as well as, its subfields [373, 384, 213], and white-matter regions [12] (see Figs. 2.4, 2.5. The reliability of manual tracing protocols based on intra-rater dice overlaps varies depending on the granularity of segmentations. For whole hippocampus, it ranges from 0.79 to 0.92; whereas for subfields the ranges are, CA1: 0.78-0.88, CA2/CA3: 0.70-0.85, CA4/Dentate gyrus:0.80:0.84 [373, 385, 351]. There has been significant push towards harmonization of these different protocols in efforts to facilitate clinical applications based on hippocampal morphometry [38, 109, 178]. A recent outcome of these efforts led to a harmonized protocol (HarP) for the whole hippocampal segmentation that shows high reproducibility on the MR images from Alzheimer’s Disease Neuroimaging Initiative (ADNI) [137]. Although manual tracing is considered as the gold-standard of hippocampal segmentation, it requires significant time and resource investment from expert raters. Thus for practical purposes, accurate au- tomated segmentation methods are critical for hippocampal volumetric and morphometric studies as well as potential clinical applications. Several proposed automated techniques have also been applied towards hippocampal segmentation, and have reported comparable performance to the manual gold stan- Chapter 2. Background 22

Figure 2.4: 3T in-vivo high-resolution atlas of the hippocampal subfields ([373]

Figure 2.5: 3T in vivo high-resolution atlas of the hippocampal subfields and white-matter structures [12]

dard [363, 236, 80, 360, 278, 32, 351]. Specific to hippocampal segmentation, the similar intensity values of grey matter in hippocampus and its neighbouring structures, such as amygdala, caudate nucleus, and thalamus complicates the boundary definitions [125]. This motivates use of atlases to incorporate prior anatomical knowledge in order to resolve the ambiguous intensity profiles of neighbouring structures. Hence the MALF approaches have shown great success towards accurate hippocampal segmentation. Sabuncu et al. propose a generative framework for label-fusion algorithms by probabilistically modeling the relationship between atlas and target images [302].. Many MALF based techniques can be com- pared within this framework. Another notable work by Coupe et al. extend the MALF approach to a patch-based procedure to capture similarity between subset of voxels between atlas and target images to assign structural labels [82]. A similar idea of incorporating voxel-neighbourhood similarity information towards label-fusion is used by [189] as an extension to the classical STAPLE [363] algorithm to address local vs. global image matching problems. Further extending the similarity comparison techniques to consider pairwise dependencies within the atlas pool [360] propose a joint label fusion approach to mit- Chapter 2. Background 23 igate systemic errors resulting from similar atlases during label-fusion. Then in efforts to minimize the number of atlases required for the MALF approaches, [278] propose a bootstrapping method to boost the number of labeled templates from a small atlas pool without the loss of segmentation performance. Another interesting approach by [153] leverages machine-learning classifiers and a local labelling strategy to estimate target image segmentation. A comparative study of these state-of-the-art methods, along with other published works, reports performance based on Dice overlap ranges from 0.64 to 0.91 [100]. The common factors attributing to the performance variation are 1) choice of gold-standard segmen- tation (manual-segmentation protocol), 2) automation level (semi or fully automatic), and 3) choice of test cohort that detects sample size, demographics, and the image acquisition parameters. Nevertheless, several approaches have reported consistently high Dice scores (> 0.88) [82, 360, 278] on large cohorts (N > 60) encouraging their use towards large-scale hippocampal volumetric and morphometric studies. Chapter 3 focuses on hippocampal segmentation methods, where these techniques are discussed in detail along with the description and validation of proposed label-fusion method as part of this thesis. Several automated segmentation protocols have been utilized in the analysis of AD subject popu- lations, particularly for developing hippocampal, or its subfield volumes, as a discriminative biomarker between different diagnostic stages [180, 62, 74, 194, 142, 132, 163]. Studies have shown volumetric difference between cognitively normal, MCI, and AD groups, however validation of significant differ- ences within subgroups of MCI, i.e. early and late MCI or stable of declining MCI has proven to be challenging. Use of hippocampal segmentation as a biomarker is discussed in detail in Section 2.6, as well as Chapters 3, 4, and 5.

2.5.5 Cortical surface estimation

Cortical thickness and surface area are commonly used metrics for examining neuroanatomical properties and alterations of cerebral cortex as captured by MR images. Cortical thickness and surface area are known to reflect differential neurobiological processes as well as genetic influences [273, 287]. The layered organization of cortex can be parsed into columnar units [258]. Then in this arrangement, cortical thickness and surface area are postulated to reflect the number of cells within a cortical column and the number of columns themselves, respectively [287]. Furthermore, cortical thickness is thought to include dendritic arborisation and pruning [171] and surface area is thought to include cortical folding and gyrification. The developmental trajectories of these two measures depend on genetic factors, which impact the division of progenitor cells in the periventricular area during embryogenesis [58, 273]. Thus, investigation of these two measures and how they separately contribute to cortical architecture can provide important information about neuroanatomical development and the potential underlying cellular mechanisms, as well as about the neuroanatomical correlates of various diseases and neuropsychiatric conditions [224, 286, 99, 106, 310]. Advances over the last 20 plus years have made accurate, automated estimation of cortical thickness from MR images possible [124, 237, 221], allowing for detailed, regionally specific analysis of the cerebral cortex. Cortical thickness estimates are derived by linearly registering images to a model in stereotaxic space, classifying the brain into grey matter, white matter and CSF, and defining the boundaries of the white matter and pial surfaces. Inner and outer surfaces are then extracted, and the distance between these two surfaces at a given point is calculated, which represents the cortical thickness at this vertex [124, 221]. Two of the most widely used tools for estimating cortical thickness values are CIVET [221] and Freesurfer [124]. These algorithms differ in the manner in which the cortical surfaces are reconstructed. Chapter 2. Background 24

CIVET uses the Constrained Laplacian Anatomical Segmentation using Proximities (CLASP) method [196], in which the pial surface is expanded from the white matter surface to the GM-CSF boundary along a Laplacian field. The tlink method is then used to estimate the cortical thickness, which calculates the distance between corresponding vertices on the inner and outer surfaces. The overall CIVET process on a sample image is shown in Fig. 2.6. In Freesurfer, a deformable mesh is used to reconstruct the inner and pial surfaces, and the cortical thickness is estimated as the shortest distance between the two surfaces at any point in the cortex. An alternative approach for estimating cortical thickness leverages voxel-based methods, which do not require deformable mesh models. However, these methods are more sensitive to voxel sizes and partial volume effects [187, 89]. A head-to-head study [288] comparing CIVET (v1.1.9) and Freesurfer (v5.3.0), using an AD cohort demonstrated that both pipelines offer similar performance, with CIVET providing slightly higher sensitivity to atrophy patterns at the MCI stage. The cortical surface extraction for all the subjects in projects 2 and 3 was performed using CIVET 1.1.12.

Figure 2.6: CIVET stages for extracting cortical surface. The image is randomly selected from the ADNI2 cohort used in the analysis in this thesis. 1) linear/affine registration of the MR images from native to stereotaxic space, using the average MNI ICBM152 model as the target of registration, 2) tissue classification into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF), 3) the boundary between cortical GM and subcortical WM is extracted using a deformable surface model, starting from an ellipsoid that contracts to take the shape of the white matter mask. The pial surface, or the boundary between the cortical GM and the extra-cortical CSF expands outwards from the WM surface to the CSF, 4) The surfaces are registered to the MNI ICBM152 surface template for comparability. The cortical thickness is computed by evaluating the distance, in mm, between the original WM and GM surfaces transformed back to the native space of the original MR images, then interpolated onto the surface template.

2.6 Computational neuroscience and machine-learning

Statistical analysis of neuroimaging data can be used to describe and explain brain structure and func- tion at population as well as individual levels. Conceptually, these inference analyses can be divided into decoding and encoding problems [265]. The classical univariate and mass-univariate approaches with structural (or functional) phenotypes as dependent variables belong to a family of encoding models. Chapter 2. Background 25

Encoding objective involves modeling the dependency between these phenotypes and the population under study via regression or group-wise differences analysis using general linear model (GLM) frame- work. The encoding effect can be hypothesized at different granularities such as regional volume or voxel specific. In the context of AD, several studies have investigated structural differences across the diagnostic groups [177, 57, 125, 41, 224, 286, 80]. Many volumetric, cortical thickness, voxels-wise, etc. studies have established strong evidence for significant differences in medial temporal lobe and cortical surface regions, particularly between AD and matched healthy subjects [299, 180, 176, 355, 117, 222]. Studies investigating MCI have shown varying patterns of structural differences, potentially due to large heterogeneity in the symptom presentation in this group. The decoding models typically consist of classification or regression tasks that use structural (or functional) phenotypes as predictors. Typically, the computational objective here is to predict the in- dividual state (sensory, motor, cognitive, diagnostic etc.). Recently popular machine-learning based approaches of statistical analysis have shown remarkable success in tackling multivariate decoding prob- lems [158, 128, 95, 202, 116, 253]. The decoding models also hypothesize predictive features at different granularities represented by volumetric, vertex-wise, or voxel-wise input depending on the task at hand. There is an implicit circularity between the encoding and decoding modeling as encoding analysis informing on neurological processes can serve as a prior to the decoding analysis to achieve better performance. Conversely, individual level predictive performance of decoding models can validate the group-level findings beyond significance testings.

2.6.1 Machine-learning

Machine-learning (ML) is a branch of statistics that heavily utilizes advanced computational algorithms for extracting meaningful relationships within large amount of data and making accurate predictions [34, 167, 30, 217]. A subset of ML approaches with deep artificial neural network based architecture, also known as deepnets, has enjoyed tremendous success in the last decade in areas of computer vision, speech processing, natural language processing, as well as, control systems and game playing [166, 217]. Leveraging recent advances in computational hardware, such as graphical processing units (GPUs), and availability large datasets, ML has tackled problems pertaining to object and face recognition [212, 159], image generation [147], speech and sentiment recognition [145], language translation [379], caption generation [191], personal assistance [382], autonomous vehicles [140], and Atari, Go, Chess game playing [251, 318, 319] to name a few. In the context of neuroimaging, ML approaches have been proposed to tackle modeling and predic- tion challenges pertaining to high-dimensional, multimodal, and longitudinal data resultant of MR and other imaging modalities. Traditional hypothesis driven statistical methods have had limited success in handling complex, high-dimensional datasets towards individual prediction tasks. Particularly the univariate, linear computational models are unable to capture spatiotemporal structure and nonlinear relationships within the imaging data. ML approaches allow to learn these relationships from the data itself instead of strong hypothesis driven predictive models, and have shown promising results with neuroimaging data. Broadly, ML approaches can be divided into three categories based on their purpose. 1) Supervised learning: these involve a family of classification or regression tasks that aim at predicting a discrete label or a real value from input (e.g. object or diagnostic classification). 2) Unsupervised learning: these involve problems relating to discovery of parsimonious, meaningful features, representative from a Chapter 2. Background 26 typically high-dimensional input data (e.g. matrix factorization, clustering). 3) Reinforcement learning: these involve devising a strategy that should be used to maximize eventual payout (e.g. chess, Go games). Only supervised and unsupervised learning approaches are within the scope of this thesis as they are applicable to many predictive and descriptive tasks commonly encountered in computational neuroscience.

2.6.2 Supervised learning

Supervised learning techniques make use of labeled data to learn relationships between input (X) and output (y) pairs. During training of supervised models, the learning process aims to identify a set of weights that can accurately predict the output. The output can be a variety of data types including, categorical, continuous, structured vector forms. Broadly, supervised classifiers can be divided into generative and discriminative models. Given a set of paired (X, y) examples, the generative classifiers first aim to learn joint probability distribution P (X, y), followed by calculation of P (y|X) using Bayes’ rule to infer the most likely label based on input data. Generative models require relatively large pool of labeled examples in order to accurately model the joint distribution P (X, y). Generative models provide valuable insights into data distribution facilitating model interpretation. In addition, they can be useful practically for imputing missing data points and augmenting training set by generating new examples. Commonly used generative models include naive Bayes, Gaussian mixture model, Generative adversarial networks [144], etc. Discriminative classifiers, in contrast, circumvent estimation of joint probability distribution, and directly learn the conditional probability distribution P (y|X). For binary classification, this essentially implies defining a single cut-off boundary that separates the input space for each class without explicitly specifying the data distribution. In comparison to generative models, discriminative approaches require fewer number of labeled examples. Logistic regression [85], decision trees, random forests [45], support vector machines [79], artificial neural networks [298] are some of the commonly used discriminative classifiers. Many of the classifiers can also be modified to perform regression for continuous valued output prediction. Amongst which linear regression (LR), support vector machine (SVM), random forest (RF), and artificial neural network (ANNs) variants are used in this thesis. Specifically, two customized ANN models have been presented in project 2 and 3, to handle multimodal and longitudinal input, respectively. A brief description of the reference ML models, i.e. LR, SVM, and RF, along with ANNs that form the basis of many deep learning approaches is provided below.

2.6.3 Reference models

Given below is a brief description of the three reference models used to mark the baseline performances in the projects 2 and 3. Project 2 uses the implementations of these models towards regression analysis, whereas project 3 uses them towards classification tasks. For the sake of simplicity, the description is limited to the classification versions.

Logistic regression

Logistic regression (LR) is one of the most commonly used discriminative models towards binary classi- fication tasks. The model is based on logistic or a sigmoid function (see Fig. 2.7 given as: Chapter 2. Background 27

1 σ(z) = 1 + exp(−z) (2.11) T z = θ x + θ0

where z itself is a linear combination of input variables xi, that can take any real input values. The sigmoid function acts as a squashing transformation that bounds the output between (0, 1), allowing probabilistic interpretation of the prediction. The parameters of LR can be learned via maximum likelihood function that can be solved using iterative optimization algorithms such as gradient descent [34].

Figure 2.7: Logistic (sigmoid) function

Support vector machine

Support vector machine (SVM) falls under family of kernel based models that perform a feature space transformation of input variables prior to classification step. For the binary classification tasks, the model is represented in a following form:

T y(x) = θ φ(x) + θ0 (2.12)

where φ(x) denotes the feature-space transformation. The decision boundary in the feature space is constructed using a subset of examples referred to as support vectors. These examples are the closest points on either side of this boundary, which is referred as a separating hyperplane. The SVM aims to maximize the margin between this hyperplane and the support vectors (see Fig. 2.8). In the case where data are not linearly separable, a soft-margin criterion is used to allow a certain level of misclassification [79]. Alternatively, non-linear classifiers, based on a kernel trick to improve model flexibility, are also proposed. Estimation of hyperplane is a convex optimization problem that allows computation of global minimum. It should be noted that unlike LR, SVM does not provide probabilistic output, however certain software implementations (e.g. scikit-learn) calibrate class probabilities using Platt scaling [279]. Chapter 2. Background 28

Figure 2.8: Support vector machine: Max margin hyperplane (red line) maximizes the distance from the support vectors (purple circles) representing the two classes. Image adopted from [34] with reuse permission.

Random forest

Random forests (RF) is an ensemble method that combines predictions from multiple decision tree classifiers [45]. RF combines bootstrap aggregation (bagging) and random feature selection techniques to construct a collection of decision trees. Each tree is subjected to a random sample (with replacement) of n < N examples from the entire training dataset. This is known as bootstrap sampling. A given decision tree partitions the input space into different regions by thresholding the feature values (see Fig. 2.9. At each node in a tree, d << D features are randomly selected, and the parent node is partitioned using the best possible binary split. The best split is determined according to an impurity criterion which aims to maximize the homogeneity of the child nodes with respect to the parent node. Impurity can be assessed using various measures such as the Gini index. Gini index measures the likelihood of whether an example would be incorrectly labelled if it were randomly classified according to the distribution of labels within the node. The aggregation of prediction from all trees, referred to as bagging, is performed based on a majority vote criterion. This helps reduce the high variance and consequent overfitting resulting from single tree based prediction.

2.6.4 Artificial neural networks and deep learning

Artificial neural networks (ANNs), also known as feedforward neural networks or multilayer perceptrons (MLPs), are the building blocks of many deep-learning models that have had great success in tack- ling high-dimensional imaging datasets. Similar to other supervised ML models, the goal of ANNs to approximate a function f∗ that can map input x to output y as follows:

y = f ∗(x, θ) (2.13)

Where θ are the model weights (or parameters) that are learned via a training process. Computa- tionally, the ANNs differ from traditional ML approaches, by representing f as a composite function formed by several nested functions as follows [143, 30]: Chapter 2. Background 29

Figure 2.9: A: Example of a binary decision tree comprising 3 input variables; B: Corresponding par- titioning of the two-dimensional input space into five regions using axis-aligned boundaries. Images adopted from [34] with reuse permission.

f(x) = f (n)(...f (3)(f (2)(f (1)(x)))) (2.14)

These chained connections are reflected by the hierarchical (hidden) layers in a graphical representa- tion (see Fig. 2.10. The depth of an ANN is then determined by the number of these layers. The compute operation at each hidden node at a given layer typically comprises of 1) a weighted sum of inputs from the preceding layer and 2) a nonlinear activation function that filters the weighted sum. The hierarchical structure with nonlinear transformations (e.g. sigmoid, rectified linear unit) enables ANNs to represent complex, nonlinear functions that cannot be approximated accurately using linear models. ANNs are trained using gradient descent based algorithms. The presence of nonlinearity introduces non-convex loss function for the optimization process, which does not have global convergence guarantees as in case of linear optimization problems (e.g. logistic regression or SVM). Therefore ANNs are optimized using a stochastic iterative procedures that refine the model weights until performance improvements are made based on a predefined loss function. All gradient descent based optimization algorithms require compu- tation of a gradient. For ANNs these gradients at each hidden layer are computed using an algorithm called backpropagation [298]. Backpropagation leverages the chain rule of calculus in order to compute gradients at each layer represented by the composite function f. Although theoretically straightforward, implementation of gradient descent coupled with backpropagation needs to address several caveats, such as initialization weights, vanishing gradient, batch normalization etc.

The core ideas behind formulation of ANNs have not changed substantially since the 1980s. Their recent success is attributed mainly to availability of large datasets and powerful computing infrastructure (e.g. GPUs). Algorithmically the notable changes include 1) use of cross-entropy loss function is place of the mean square error and 2) replacing sigmoid function with rectified linear unit as the activation function. These innovations along with the novel network architecture designs, such as convolutional net- works, long short-term memory recurrent networks, Siamese networks, U-nets, etc. have demonstrated the state-of-the-art performance towards many supervised tasks involving high-dimensional data. This in turn has motivated the customization and application of ANNs for handling neuroimaging data towards the prediction of clinical tasks in this thesis. Chapter 2. Background 30

Figure 2.10: A feed-forward artificial neural network (ANN) comprising input, hidden, and output layers. Each node from the hidden layer represents a compute operation, whereas the connections between nodes denote the model weights (parameters) that are learned through a training process.

2.6.5 Performance metrics for supervised learning

Supervised learning tasks are typically employed toward individual level prediction in contrast with group level analysis, and hence require a different validation paradigm. The supervised learning performance can be measured with multitude of metrics depending on the task objective. For classification problems, these metrics include accuracy, receiver operating curve, confusion matrix, specificity, sensitivity, F1 score etc. Whereas, for regression problems performance is evaluated via mean squared error, mean absolute error, correlation etc. metrics. These performance metrics are typically collected in a cross- validation framework which involves permutation and sampling of available data.

2.6.6 Supervised ML and AD

In the neuroimaging domain, and particularly relating to AD, wide variety of supervised learning algo- rithms have been developed towards tasks such as image segmentation (structured output) [361, 293], diagnostic classification (categorical) [202, 91, 87], clinical symptom severity identification (continuous) [332, 390, 383], prognostic prediction (categorical/continuous)[322, 253, 170], etc. Performance of these methods is discussed in detail in Chapters 4 and 5.

2.6.7 Unsupervised learning

Unsupervised learning aims at inferring a function to describe hidden structure from “unlabeled” data. Two common classes of unsupervised techniques include, latent variable models, typically implemented via matrix factorization, and clustering models. The latent variable models are commonly used for dimensionality reduction that transform the high-dimensional input into fewer components that par- simoniously represent the useful information. Principal component analysis [264, 151], independent component analysis [244, 49], non-negative matrix factorization [218] are some of the examples of this class of techniques. In the context of dimensionality reduction techniques, these models can be thought as feature engineering operators [34, 30]. In contrast with the variable selection process, which simply drops a subset of variables in order to reduce input dimensionality, feature engineering approach defines Chapter 2. Background 31 a mapping from high-dimensional input space onto a low dimensional representation. The variables in the new space are transformed versions of the original input and not a selected subset. The second class of techniques, clustering aims at grouping the original set of variables based on certain criterion. Unlike the latent variable models, clustering approaches yield categorical labels denoting cluster membership. Different types of clustering approaches are used depending on the task goals, input data distribution, and notion of similarity between two examples. K-means, Gaussian mixture model, spectral clustering, hierarchical clustering, are some of the commonly used clustering techniques [342, 1, 34].

2.6.8 Performance metrics for unsupervised learning

Since there is no output label, unsupervised learning has a different set of evaluation metrics that measure the stability or reproducibility of learned features or clusters. Explained variance (for a given number of principal components), silhouette coefficient [297], pairwise cluster stability are some of the metrics used towards such validation. When unsupervised learning is used as a preprocessing step for dimensionality reduction, it is usually validated under a hyperparameter optimization module of the cross-validation framework.

2.6.9 Unsupervised ML and AD

In the neuroimaging domain, and particularly relating to AD, unsupervised learning algorithms have been implemented towards discovering subtypes of the disease based on clinical and neuroimaging measures [263, 370, 123]. Clustering based subtyping is discussed in detail in Chapter 5.

2.6.10 Performance evaluation

Cross-validation (CV) is a procedure commonly used for training and testing supervised ML models in scenarios with small sample size availability [205, 111, 14]. The primary goal of CV is to evaluate predictive performance and generalizability of the ML models on unseen data. The CV framework comprises several computational stages including a few preprocessing steps, such as feature selection, feature transformation, which are design choices to be made by the investigator. The CV framework is split into two processing pipelines, namely - train and test. The available data is first split into two subsets to be used in each of these pipelines. All operations pertaining to “learning” i.e. model parameter estimation are performed within training pipeline. The learned models and/or transformations are then applied to the test pipeline with unseen data. This is then repeated multiple times with different train and test splits of the available data referred to as folds. The training pipeline begins with raw data acquired from different modalities such as MR imaging, genetics, demographics, etc. In the field of neuroimaging, the dimensionality of raw input data typically exceeds substantially compared to the number of available samples [220, 210]. Without sufficient samples during training, models are likely to memorize the one-to-one mapping between each input and output pair. Consequently, models are unable to learn meaningful patterns that are generalizable to unseen data. This is referred to as an overfitting problem [34, 210]. Thus it becomes imperative to reduce dimensionality of raw input data in order to mitigate overfitting by unnecessarily complex models with large number of parameters. This can be achieved via feature selection or feature transformation (or both) techniques [264]. Feature selection involves selecting a subset of input variables based on certain criteria. These criteria can be based on prior hypothesis (e.g. anatomical regions of interest based on known pathology), Chapter 2. Background 32 or data-driven (e.g. anatomical regions of interest based on significant group-wise differences). In feature selection, the raw values of selected variables are preserved. In comparison, feature transformation comprises a filtering operation that maps original multivariate space into another multivariate space with smaller dimensionality. These transformations can also be based on a hypothesis (e.g. average values over an anatomical region of interest), or data-driven (e.g. matrix factorization methods such as principal component analysis, independent component analysis etc.). After transformation, input features are represented with new set of values derived from weighted combination of the original raw input [165, 30, 264]. It is important to note that feature selection and transformation operations need to be performed using only training data subset to avoid “double-dipping”, which implies utilization of information from unseen data, and results in performance inflation [210]. During performance evaluation, these learned selection and transformation operations are directly applied to test data without any further tuning based on test data distribution.

As previously stated, these operations are repeated multiple times on permuted train and test splits of available data (see Fig. 2.11). Several strategies for data splitting are available including, leave-one- out, k-fold, stratified k-fold, Monte-Carlo, etc [14]. Stratified k-fold is one commonly used approach that involves splitting data into k mutually exclusive partitions or folds . During each iteration k-1 folds are used for training and remaining subset is used for testing. This is repeated k times to exhaustively test each sample once. Additionally, available data is stratified prior to sampling, so that each fold comprises similar number of output labels based on task at hand. This provides balanced proportions of output labels in the train and test subsets. Additional stratification can also be enforced in order to control for other confounding factors such as demographics (sex, race, etc.) and acquisition peculiarities (study, site, etc.) Stratification typically helps with model training and reducing variability in classifier performance during cross-validation.

Another approach towards evaluating model performance and generalizability involves training and testing models on independent datasets. In this approach, models are trained on one dataset and tested on the other. In more general case, with availability of n independent datasets, this is extended to leave- one-dataset-out approach, where models are trained on n-1 datasets and tested on the holdout dataset. This allows to evaluate dataset specific biases and invariances of the trained models. This is a relatively challenging validation paradigm as different datasets comprise inherent site and study specific biases that affect the acquisition protocols introducing markedly different data distributions. Nevertheless, it is important to address these challenges for practical use of ML models.

2.7 Project synopses

The next three chapters comprise the published and accepted (under production) manuscripts that describe and discuss the three projects completed as part of this thesis. Given below are the brief summaries of the scope and the findings for each project. Chapter 2. Background 33

Figure 2.11: Nested k-fold cross-validation paradigm. During each iteration k−1 folds are used as a train subset, with remaining fold as test subset. The samples from each train subset are further divided into j inner folds to define and evaluate various data preprocessing operations, including feature selection / transformation, data scaling / normalization, hyperparameter configuration. The top performing hyperparameter configuration is then used towards training a single model on the entire train subset that is subsequently applied to the test subset from the outerfold. The model is finally evaluated based on the test set performance

2.7.1 Manual-protocol inspired technique for improving automated MR im- age segmentation during label fusion (Published online 2016 Jul 19. doi: 10.3389/fnins.2016.00325)

The first project presents a novel method, “Autocorrecting Walks over Localized Markov Random Fields (AWoL-MRF)” that aims at mimicking the sequential process of manual segmentation, which is the gold-standard for virtually all the segmentation methods. AWoL-MRF begins with a set of candidate labels generated by a multi-atlas segmentation pipeline as an initial label distribution and refines low confidence regions based on a localized Markov random field (L-MRF) model using a novel sequential in- ference process (walks). The results demonstrate that AWoL-MRF produces state-of-the-art results with superior accuracy and robustness with a small atlas library compared to existing methods. The method is validated by performing hippocampal segmentations on three independent datasets: (1) Alzheimer’s Disease Neuroimaging Initiative (ADNI) Database; (2) First Episode Psychosis patient cohort; and (3) A cohort of preterm neonates scanned early in life and at term-equivalent age. AWoL-MRF is compared qualitatively as well as quantitatively to other label-fusion techniques including majority vote, STAPLE, and Joint Label Fusion methods. AWoL-MRF reaches a maximum accuracy of 0.881 (dataset 1), 0.897 (dataset 2), and 0.807 (dataset 3) based on Dice similarity coefficient metric, offering significant perfor- mance improvements with a smaller atlas library (< 10) over compared methods. The diagnostic utility of AWoL-MRF is also discussed by analyzing the volume differences within diagnostic categories based on ADNI1: Complete Screening dataset. Chapter 2. Background 34

2.7.2 An artificial neural network model for clinical score prediction in Alzheimer’s disease using structural neuroimaging measures (accepted in the journal of psychiatry and neuroscience)

The second project presents a novel anatomically partitioned artificial neural network (APANN) model for predicting individual level clinical scores from mini-mental state exam (MMSE) and Alzheimer’s Dis- ease Assessment Scale (ADAS-13) assessments. APANN combines input from two structural MR imaging measures relevant to neurodegenerative patterns observed in AD; namely: hippocampal segmentations and cortical thickness. Performance of APANN is evaluated with 10 rounds of 10-fold cross-validation in three sets of experiments using ADNI1, ADNI2, and ADNI1+ADNI2 cohorts. Pearson’s correlation and root mean square error between the actual and predicted scores for ADAS-13 (ADNI1: r = 0.60; ADNI2: r = 0.68; ADNI1and2: r = 0.63) and MMSE (ADNI1: r = 0.52; ADNI2: r = 0.55; ADNI1and2: r = 0.55) demonstrate that APANN can accurately infer clinical severity from MR imaging data. Fur- thermore, APANN is also validated in a proof-of-concept longitudinal analysis comprising prediction of future clinical scores. The results show that APANN provides a highly robust and scalable framework for prediction of clinical severity at the individual level utilizing high-dimensional, multimodal neuroimaging data.

2.7.3 Modeling and prediction of clinical symptom trajectories in Alzheimer’s disease using longitudinal data (accepted in PLOS Computational Bi- ology)

The third project presentes a computational framework comprising machine-learning techniques for 1) modeling symptom trajectories and 2) predicting symptom trajectories using multimodal and longitu- dinal data. The project comprises a primary analysis performed using three cohorts from Alzheimer’s Disease Neuroimaging Initiative (ADNI), and a replication analysis which is performed using subjects from Australian Imaging, Biomarker and Lifestyle (AIBL) Flagship Study of Ageing. In the modeling stage, the prototypical symptom trajectory classes are defined using clinical assessment scores from mini- mental state exam (MMSE) and Alzheimer’s Disease Assessment Scale (ADAS-13) at nine timepoints spanned over six years based on a hierarchical clustering approach. Subsequently in the prediction stage, these trajectory classes for each individual are predicted using magnetic resonance (MR) imaging, ge- netic, and clinical variables from two timepoints (baseline + follow-up). For prediction, a longitudinal Siamese neural-network (LSN) with a novel architecture for combining the multimodal data from two timepoints is presented. The trajectory modeling yields two (stable and decline) and three (stable, slow-decline, fast-decline) trajectory classes for MMSE and ADAS-13 assessments, respectively. For the predictive tasks, LSN offers highly accurate performance with 0.900 accuracy and 0.968 AUC for binary MMSE task and 0.760 accuracy for 3-way ADAS-13 task on ADNI datasets, as well as, 0.715 accuracy, and 0.907 AUC for binary MMSE task on replication AIBL dataset. Chapter 3

Manual-Protocol Inspired Technique for Improving Automated MR Image Segmentation during Label Fusion

Nikhil Bhagwat [1,2,3,*], Jon Pipitone [3], Julie L. Winterburn [1,2,3], Ting Guo [4,5], Emma G. Duerden [4,5], Aristotle N. Voineskos [3,6], Martin Lepage [2,7], Steven P. Miller [4,5], Jens C. Pruessner [2,8], M. Mallar Chakravarty [1,2,7], and Alzheimer’s Disease Neuroimaging Initiative.

1. Institute of Biomaterials and Biomedical Engineering, University of Toronto, Toronto, ON, Canada

2. Cerebral Imaging Centre, Douglas Mental Health University Institute, Verdun, QC, Canada

3. Kimel Family Translational Imaging-Genetics Research Lab, Research Imaging Centre, Campbell Family Mental Health Research, Institute, Centre for and Mental Health, Toronto, ON, Canada

4. and Mental Health, The Hospital for Sick Children Research Institute, Toronto, ON, Canada

5. Department of Paediatrics, The Hospital for Sick Children and the University of Toronto, Toronto, ON, Canada

6. Department of Psychiatry, University of Toronto, Toronto, ON, Canada

7. Department of Psychiatry, McGill University, Montreal, QC, Canada

8. McGill Centre for Studies in Aging, Montreal, QC, Canada

Correspondence: Nikhil Bhagwat, Email: [email protected] Keywords: MR Imaging, Segmentation, Multi-Atlas Label-Fusion, Markov Random Fields, Hip- pocampus, Alzheimer’s disease, First Episode Psychosis, Schizophrenia, Premature Birth and Neonates.

35 Chapter 3. Project 1: MR Image Segmentation 36

3.1 Abstract

Recent advances in multi-atlas based algorithms address many of the previous limitations in model-based and probabilistic segmentation methods. However, at the label-fusion stage, a majority of algorithms focus primarily on optimizing weight-maps associated with the atlas library based on a theoretical ob- jective function that approximates the segmentation error. In contrast, we propose a novel method - Autocorrecting Walks over Localized Markov Random Fields (AWoL-MRF) - that aims at mimicking the sequential process of manual segmentation by which the gold standard is defined for virtually all the segmentation methods. AWoL-MRF begins with a set of candidate labels generated by a multi-atlas seg- mentation pipeline as an initial label distribution and uses it to partition the given image into high and low confidence segmentation regions. Then, the labels of the low confidence regions are updated based on a localized Markov random field (L-MRF) model and a novel sequential inference process (walks), which captures the behavior of a manual rater. The approach combines the strong a priori information from the atlas library with the local spatial and intensity information from the target image, without depending on the computationally expensive pairwise comparisons with the atlas library. We show that AWoL-MRF produces state-of-the-art results with a small atlas library (< 10) and improves the accuracy and robustness of the existing segmentation pipelines. We validate the proposed approach by perform- ing hippocampal segmentations on three independent datasets: 1) Alzheimer’s Disease Neuroimaging Database (ADNI); 2) First episode psychosis patient cohort; and 3) A cohort of preterm neonates scanned early in life and at term-equivalent age. We assess the improvement in the performance qualitatively as well as quantitatively by comparing AWoL-MRF with majority vote, STAPLE, and Joint Label Fusion methods. AWoL-MRF reaches the maximum accuracy of 0.881 (dataset 1), 0.897 (dataset 2), and 0.810 (dataset 3) based on Dice similarity coefficient metric, offering significant improvements with a smaller atlas library over compared methods. We also evaluate the diagnostic utility of the presented method by analyzing the volume differences per disease category in the ADNI1: Complete Screening dataset. The source code for AWoL-MRF can be found at https://github.com/CobraLab/AWoL-MRF. Chapter 3. Project 1: MR Image Segmentation 37

3.2 Author Contributions

Nikhil Bhagwat (NB) worked on the development of AWoL-MRF algorithm and subsequent imple- mentation and validation. He performed preprocessing and quality control of MR image datasets. He also wrote the manuscript of the published research paper. Jon Pipitone assisted with the preprocess- ing of MR images. He also provided feedback on the proposed methodological approach. Additionally he provided the support for computational resources. Julie Winterburn served as an expert manual rator for neonatal MR image segmentation. Ting Guo, Emma G. Duerden, and Steven P. Miller performed acquisition and curation of neonatal dataset. Martin Lepage performed acquisition and curation of FEP dataset. Jens C. Pruessner served as an expert anatomist who developed the manual protocol for hippocampal segmentation on ADNI dataset. He also advised for manual segmentation protocols for the FEP and neonatal datasets. Aristotle N. Voineskos served as a clinical advisor and is a member of NB’s thesis committee. He also served as a supervisor for NB in a lab at CAMH that provided significant computational resources. M. Mallar Chakravarty is a thesis supervisor for NB. He provided guidance on development and validation of all proposed models and techniques, as well as, manuscript writing. Chapter 3. Project 1: MR Image Segmentation 38

3.3 Introduction

The volumetric and morphometric analysis of neuroanatomical structures has growing importance in many clinical applications. For instance, structural characteristics of hippocampus have been used as an important biomarker in many neurological and psychiatric disorders including Alzheimer’s disease (AD), schizophrenia, major depression, and bipolar disorder [157, 133, 223, 192, 247, 366]. The gold standard for neuroanatomical segmentation is manual delineation by an expert human rater. However, with the increasing ubiquity of magnetic resonance (MR) imaging technology and neuroimaging studies targeting larger populations, the time and expertise required for manual segmentation of large MR datasets becomes a critical bottleneck in the analysis pipelines [243, 242]. A manual rater’s performance is dependent on his or her specialized knowledge of the neuroanatomy. A generic manual segmentation protocol leverages this anatomical knowledge and uses it in tandem with voxel intensities to enforce structural boundary conditions during the delineation process. This is, of course, the premise of many automated model-based segmentation approaches. Multi-atlas based approaches have been shown to improve segmentation accuracy and precision over model-based approaches [71, 364, 285, 161, 10, 56, 227, 235, 301, 374, 162, 359, 386, 54]. The processing pipelines of these approaches can be divided into multiple stages. First, each atlas image is registered to a target image - an image to be segmented. Subsequently, the atlas labels are propagated to produce several candidate segmentations of the target image. Finally, a label fusion technique such as voxel-wise voting is used to merge these candidate labels into the final segmentation for the target image. For the remainder of the paper we refer this latter stage within a multi-atlas based segmentation pipeline as “label-fusion”, which is a core interest of this work. In many image processing and computer vision applications, Markov Random Field (MRF) for mod- eling spatial dependencies has been a popular approach, and particularly in the context of neuroimaging, it has been used in several model-based segmentation techniques. Existing software packages, such as FreeSurfer [126] and FSL [324], use MRF for gray matter, white matter, and cerebrospinal fluid clas- sification as well as for segmentation of multiple subcortical structures. For example, FreeSurfer uses an anisotropic non-stationary MRF that encodes the inter-voxel dependencies as a function of location within the brain. Pertaining to multi-atlas label fusion techniques, STAPLE (Simultaneous Truth And Performance Level Estimation) [364], uses a probabilistic performance framework consisting of MRF model and an Expectation-Maximization (EM) inference method to compute the probabilistic estimate of a true segmentation based on an optimal combination of a collection of segmentations. STAPLE has been explored in several studies for improving a variety of segmentations tasks [75, 76, 188, 6] Alternatively, a majority of modern multi-atlas approaches treat label-fusion as a weight-estimation problem, where the objective is to estimate optimal weights for the candidate segmentation propagated from each atlas. In a trivial case with uniform weights, this label-fusion technique boils down to a simple majority vote. In other cases, the weights can be used to exclude atlases that are dissimilar to a target image [10] to minimize the errors from unrepresentative anatomy. In a more general case, weight values are estimated using some similarity metric between the atlas library and the target image. A comprehensive probabilistic generative framework that models such underlying relationship between the atlas and target data, exploited by the methods belonging to this class, is provided by [301]. More recently, several methods [83, 295, 359] have extended this label-fusion approach by adopting spatially varying weight-maps to capture similarity at a local level. These algorithms usually introduce bias during label fusion when the weights are assigned independently to each atlas, allowing several atlases Chapter 3. Project 1: MR Image Segmentation 39 to produce similar label errors. These systematic (i.e. consistent across subject cohort) errors can be mitigated by taking pairwise dependencies between atlases into account during weight assignment as proposed by [359, 386]. In contrast, the proposed method – Autocorrecting Walks over Localized Markov Random Field (AWoL-MRF) – pursues a different idea for tackling label-fusion problem. As for virtually all the seg- mentation methods, the gold-standard is defined by the manual labels, we hypothesize that we could achieve superior performance by mimicking the behavior of the manual rater. Consequently, the label- fusion objective we aim here comprises capturing the sequential process of manual segmentation rather than optimizing atlas library weights based on similarity measure proxies and/or performing iterative inference to estimate optimal label configurations based on MRFs. Hence the novelty of the approach lies in the methodological procedure as we combine the strong prior anatomical information provided by the multi-atlas framework with the local neighborhood information specific to the given subject. In the context of segmentation of anatomical structures such as hippocampus, the challenging areas for label assignment are mainly located at the surface regions of the structure. We observe that a manual rater traces these boundary regions by balancing intensity information and anatomical knowledge, while en- forcing smoothness requirements and tackling partial volume effects. In practice, this behavior translates into a sequential labeling process that depends on information offered by the local neighborhood around a voxel of interest. The proposed label-fusion method attempts to incorporate these observations into an automated procedure and is implemented as part of a segmentation pipeline previously developed by our group [277]. The algorithmic steps of AWoL-MRF can be summarized as follows. First based on a given multi- atlas segmentation method, we initialize the label distribution for a neuroanatomical structure to be segmented. This initial label-vote distribution is leveraged to partition the given target volume in two disjoint subsets comprising regions with high and low confidence label values based on the vote distribution at the voxels. Next we construct a set of local 3-dimensional patches comprising certain ratio of high and low confidence voxels. The spatial dependencies in these patches are modeled using independent MRFs. Finally, we traverse these patches moving from high confidence voxels to low in a sequential manner and perform the label distribution updates based on a localized (patch-based) MRF model. We implement a novel spanning-tree method to build these ordered sequences of voxels (walks). The detailed description of this entire procedure and the key differentiating features in comparison to the existing approaches are provided in the next section. We provide an explanation and extensive validation of our approach in this paper, which is organized as follows. First, we describe the AWoL-MRF method and the underlying assumptions in detail. Then, we provide a thorough validation of the method for the whole hippocampus segmentation by conducting multi-fold validation over three independent datasets that span the entire human lifespan. The quan- titative accuracy evaluations are performed on three datasets: 1) a subset of the Alzheimer’s Disease Neuroimaging Database (ADNI) dataset; 2) a cohort of first episode psychosis (FEP) patients; and 3) a cohort of preterm neonates scanned early in life and at term-equivalent age. Additionally we evaluate diagnostic utility of the method by analyzing the volume differences per disease category in the ADNI1: Complete Screening dataset. We assess the accuracy and robustness of this proposed method (source code: https://github.com/CobraLab/AWoL-MRF) by comparing it with three other approaches. Our group has recently validated the performance of MAGeT Brain [277] against several other automated methods. Here, we make use of MAGeT Brain to generate candidate labels – on which variety of label- Chapter 3. Project 1: MR Image Segmentation 40 fusion methods can be implemented. We first compare the performance of AWOL-MRF with the default majority-vote based label fusion used in MAGeT Brain. In addition, we compare AWoL-MRF with STAPLE [364], which is a more sophisticated label-fusion approach that uses MRF model and estimates the rater performance using an EM technique. Lastly, we compare it against one of the more recent methods – Joint Label Fusion (JLF) [359] which estimates atlas weights by taking into account the effect of pairwise dependencies approximated by the intensity similarity between atlases. Chapter 3. Project 1: MR Image Segmentation 41

3.4 Materials and Methods

3.4.1 Methodological Novelty of AWoL-MRF

As mentioned earlier, the novelty of this approach stems from methodological similarity with the manual labeling process. For instance, a manual rater would begin by marking a boundary of a structure that they believe to be correct (high-confidence) based on anatomical knowledge. Next, the rater would iden- tify certain regions that require further refinement (low-confidence). Then, region-by-region (patches), the rater would perform these refinements by moving from high-confidence areas to low in a sequen- tial manner, while taking into account the information offered by neighborhood voxels from orthogonal planes. Furthermore, the voxel intensity distribution conditioned on a label class leveraged by a man- ual rater is derived purely from the neighborhood of the target image itself and not from the atlas library. AWoL-MRF translates this into estimating the intensity distributions based on the statistics estimated from the high-confidence voxels in a given localized patch in a target image. Thus, the key differences between AWoL-MRF and the existing multi-atlas label-fusion include the decoupling from atlas library post registration stage. Once we obtain the label-vote distribution, we completely rely on the intensity profile of the target image and avoid any computationally expensive pairwise similarity comparisons with the atlas-library. Additionally, even though we use a commonly used MRF approach to model spatial dependencies, the novel spanning-tree based inference technique that attempts to mimic the delineation process of a manual rater differentiates AWoL-MRF from traditional iterative optimiza- tion techniques such as iterative conditional modes (ICM) or Expectation-Maximization (EM). The key benefits of AWoL-MRF implementation are two fold. First we offer state-of-the-art performance using small atlas library (< 10), whereas most of the segmentation pipelines typically make use of large atlas libraries comprising from 30 up to 80 manually segmented image volumes [285, 161] that require spe- cialized knowledge and experience to generate. Secondly, from computational perspective, AWoL-MRF mitigates many expensive operations common to many multi-atlas label fusion methods. By eliminat- ing the need for pairwise similarity metric estimation, we avoid computationally expensive registration operations that increase rapidly with the size of atlas library. Furthermore, several extensions based on patch-based comparisons between atlas library and target image make use of a variant of local search algorithm or a supervised learning approach [83, 295, 377, 361, 152]. For instance, [83] uses a non-local means approach to carry out label transfer based on multiple patch comparisons; [152] uses a supervised machine-learning method to train a classifier using similar patches from atlas library. Computationally these patch-based approaches, especially the implementations that incorporate non-local means, are ex- pensive [361] and require considerable number of labeled images [377, 152]. Moreover, compared to the single unified MRF models, the localized MRF model reduces the computational complexity while main- taining the spatial homogeneity constraints in the given neighborhood. It also allows to capture local characteristics of the image based on high-confidence regions without requiring the iterative parameter estimation and inference methods such as EM.

3.4.2 Baseline Multi-Atlas Segmentation Method

MAGeT Brain (https://github.com/CobraLab/MAGeTbrain) - a segmentation pipeline previously de- veloped by our group, is used as a baseline method of comparison [277]. MAGeT Brain uses multiple manually labeled anatomical atlases and a bootstrapping method to generate large set of candidate Chapter 3. Project 1: MR Image Segmentation 42 labels (votes) for each voxel for a given target image to be segmented. These labels are generated by propagating atlas segmentations to a template library, formed from a subset of target images, via trans- formations estimated by nonlinear image registration. Subsequently, template library segmentations are propagated to each target image and these candidate labels are fused using a label-fusion method. The number of candidate labels is dependent on the number of available atlases and number of templates. In a default MAGeT Brain configuration, the candidate labels are fused by a majority vote. (In previ- ous investigations by our group [56, 277], we noticed no improvements based on cross correlation and normalized mutual information based weighted voting. Hence, our default implementation uses a simple majority-vote at the label fusion stage of the algorithm.) These candidate labels from MAGeT Brain serve as the input to AWoL-MRF, STAPLE, and default majority vote label-fusion methods. Use of these candidate labels is not trivial with JLF implementation for the following reason. JLF requires cou- pled atlas image and label pairs as input. The permutations in the registration stage in MAGeT Brain pipeline generate candidate labels totaling to number of atlases x number of templates. These candidate labels no longer have unique corresponding intensity images associated with them. Use of identical atlas (or template) library images as proxies is likely to deteriorate the performance of JLF, as it models the joint probability of two atlases making a segmentation error based on intensity similarity between a pair of atlases and the target image [359]. Therefore, no template library is used during JLF evaluation. Note that even though MAGeT brain is used as a baseline method for the performance validation in this work, AWoL-MRF is a generic label-fusion algorithm that can be used with any multi-atlas segmentation pipeline that produces a set of candidate labels.

3.4.3 Proposed Label-Fusion Method: AWoL-MRF

A generic label fusion method involves some sort of voting technique, such as simple majority or some variant of weighted voting, which combines labels from a set of candidate segmentations derived from a multi-atlas library. These voting techniques normally yield accurate performance at labeling the core regions of an anatomical structure; however, the overall performance is dependent on the structural variability accounted by the atlas library. Especially in cases where only a small number of expert atlases are available, the resultant segmentation of a target image can be split into two distinct regions - areas with (near) unanimous label votes and areas with divided label votes. The proposed method incorporates this observation by partitioning the given image volume into two subsets based on the label vote distribution (number of votes per label per voxel) obtained from candidate segmentations. Subsequently, these partitions are used to generate a set of patches on which we construct MRF models to impose homogeneity constraints in the given neighborhood spanned by each patch. Finally, the voxels in these localized MRFs are updated in a sequential manner incorporating the intensity values and label information of the neighboring voxels. A detailed description of this procedure is provided below.

Image Partitioning

Let S be a set comprising all voxels in a given 3-dimensional volume. Then an image I comprising gray-scale intensities and the corresponding label volume are defined as:

I(S): {x ∈ S} −→ R (3.1) Lj(S): {x ∈ S} −→{0, 1} Chapter 3. Project 1: MR Image Segmentation 43

Thus, Lj represents the jth candidate segmentation volume comprising binary label values (back- ground:0 and structure:1) for a given image. Then with J candidate segmentations, we can obtain a label-vote distribution through voxel-wise normalization.

P wjLj(S) V (S) = j (3.2) J Where, wj is the weight assigned to the jth candidate segmentation. Now, V (S) represents the label probability distribution over all the voxels in the given image. For an individual voxel, it provides the probability of belonging to a particular structure: V (xi) = P (L(xi) = 1 = 1 − P (L(xi) = 0. Now, we split set S into two disjoint subsets SH (high-confidence region) and SL (low-confidence region) such that

0 1 SH = {x ∈ S|V (xi) > LT ∪ V (xi) > LT } (3.3) SL = {x ∈ S|s∈ / SH }

0 1 where, LT and LT are the voting confidence thresholds for L = 0 and L = 1, respectively. Note that 0 1 in the generic majority vote scenario LT = LT = 0.5 and SL collapses to an empty set. In order to identify and separate low-confidence regions, these thresholds are set at higher values (> 0.5) and can be adjusted based on empirical evidence (see Section 3.6.5). As mentioned earlier, voting distributions usually form a near consensus (uni-modal) towards a particular label at certain locations, such as the core regions of structures, and therefore these voxels are assigned to the high-confidence subset. In contrast, other areas that have split (flat) label distribution are assigned to low-confidence subset.

Patch Based Sub-graph Generation

The partitioning operation significantly reduces the number of nodes (note: voxels are referred as nodes in the context of graphs) to be re-labeled by a significant amount. However, considering the size of the MR images, selecting a single MRF model consisting of all SL nodes and their neighbors is a computationally expensive task. Additionally, the unified model usually considers global averages over an entire structure during parameter estimation for choice of prior distributions, such as P (intensity|label), which may not be ideal in cases where local signal characteristics show spatial variability. Therefore, we propose a patch-based approach, which further divides the given image in smaller subsets (3-dimensional cubes) comprising SH as well as SL nodes. The subsets are created with a criterion imposing a minimum number requirement of SH nodes in a given patch. This criterion essentially dictates the relative composition of SH and SL nodes in the patch - which is referred as the “mixing ratio” parameter in this paper. The impact of this heuristic method of patch generation is discussed in Section 3.6.5. The basic idea behind this approach is to utilize the information offered by the SH neighbors via pairwise interactions

(doubleton clique) along with the local intensity information to update the label-likelihood of SL voxels. The implemented algorithm to generate these patches is described below.

First, the SL nodes are sorted based on the number of SH nodes in their 26-node neighborhood. Next, thresholding on the mixing ratio parameter, top SL nodes from the sorted list are selected as seeds. Then, the patches are constructed centered at these seeds with pre-defined length (Lpatch). Fig. 3.1-A shows the schematic representation of the SH , SL partitions based on initial label distribution (V (S)), as well as the overlaying patch-based subsets comprising SH and SL nodes. Note that depending on parameter choice (mixing-ratio and patch-length), these patches may not be strictly disjoint. In this case, the Chapter 3. Project 1: MR Image Segmentation 44 nodes in overlapping patches are assigned to a single patch based on a simple metric, such as its distance from the seed node. Additionally, these patches may not cover the entire SL region. These unreachable

SL nodes are labeled according to the baseline majority vote. These two edge cases can be mitigated with sophisticated graph partitioning methods - nevertheless as per our preliminary investigation, such methods prove to be computationally expensive, and yield minimal accuracy improvements.

Figure 3.1: A) The segmentation of a sample hippocampus in sagittal view during various stages of algorithm. Row 1: The target intensity image to be segmented. Row 2: The voxel-wise label vote distribution map for the target image based on candidate labels. Row 3: Image partitioning comprising two disjoint regions (high confidence: red, low confidence: white). Row 4: Orange Patches (localized MRFs) comprising low confidence voxels. Row 5: Fused target labels. B) Image partitioning into certain and uncertain regions and generation of patches. C) Transformation of MRF graph into spanning tree representation. The tree is traversed starting from the root (seed) node and successively moving towards the leaf nodes. Chapter 3. Project 1: MR Image Segmentation 45

Localized Markov Random Field Model

As seen from Fig. 3.1-A, B, the MRF model is built on nodes in a given patch (SP ). The probability distribution associated with the particular field configuration (label values of the voxels in the patch) can be factorized based on the cliques of the underlying graph topology. With first-order connectivity assumption, we get a 3-dimensional grid topology, where each node (excluding patch edges) has six connected neighbors along the Cartesian axes. Consequently, this graph topology yields two types of cliques. The singleton clique (C1) of SP is a set of all the voxels contained in that patch. Whereas the doubleton clique (C2) is a set consisting of all the pairs of neighboring voxels in the given patch. Then, for the MRF model, the total energy (U) of a given label configuration (y) is given by the sum of all clique potentials (VC ) of all the cliques in this MRF model:

X X X U(y) = VC (y) = (yi) + (yi, yj) (3.4)

c∈C i∈C1 i,j∈C2

where, y : {L(xi)|xi ∈ SP } Now, assuming that voxel gray-scale intensities (fi = I(xi)) follow a Gaussian distribution given the label value, we get the following relation for the singleton clique potential based on the MRF model.

√ (f − µ )2 V (y ) = log(P (f |y )) = −log( 2πσ ) − yi (3.5) C1 i i i yi 2σ2 yi The mean and variance of the Gaussian model can be estimated for each patch empirically, utilizing the SH nodes in the given patch as a training set. This approach proves to be advantageous especially in the context of T1-weighted images of the brain, as intensity distributions tend to fluctuate spatially. The doubleton clique potentials are modeled to favor similar labels at the neighboring nodes and are given by following relation.

( -β if yi = yj VC2 (yi) = −βd(yi, yj) = (3.6) +β if yi 6= yj

The beta parameter can be estimated empirically using the atlas library [301]. As beta increases the regions become more homogeneous. This is discussed further in Section 3.6.5. Finally, the posterior probability distribution of the label configuration can be computed using Hammersley-Clifford theorem, and is given by:

1 P (y|f) = exp(−U(y)) z 2 (3.7) X √ (f − µy ) X P (y|f) ∝ (log( 2πσ ) + i ) + β d(y , y ) yi 2σ2 i j yi i∈C1 i,j∈C2 where Z is the partition function that a normalizes configuration energy (U) into a probability distribution. The maximum a posteriori (MAP) label distribution is given by:

MAP y = argmaxyP (u|f) = argminyU(y) (3.8)

The posterior segmentation can be computed using a variety of optimization algorithms as described in the next section. Chapter 3. Project 1: MR Image Segmentation 46

Inference

This section provides the details of the optimization technique used to compute posterior label distri- bution. Common iterative inference and learning methods such as Iterated Conditional Modes (ICM) and Expectation Maximization (EM) are computationally intensive, and ICM variants often suffer from greedy behavior that results in local optima. Here, we present an alternative approach that computes the posterior label distribution in a non-iterative, online process, minimizing computational costs. The intuition behind this approach is to mimic manual tracing protocols where the delineation process tra- verses from higher-confidence regions to lower-confidence regions in a sequential manner. In order to follow such a process, we transform the undirected graph structures defined by the MRF patches into directed spanning trees (see 3.1-C). Then we compute the posterior label distributions one voxel at a time as we traverse (walk) through the directed tree exhaustively. The directed tree structure mitigates the need for iterative inference over loops within the original undirected graph. The following is a brief outline of the implementation of the inference procedure:

1. Initialize all voxels to the labels given by the mode of baseline label distribution.

2. Transform the subgraph consisting of SL nodes within an MRF patch into a directed tree graph, specifically a spanning tree graph with seed voxels as the root of the tree. This transformation is computed using a minimum spanning tree method (Prim’s Algorithm [282]), which finds the optimal tree structure based on a predefined edge-weight criterion. In this method, the weights are assigned based on the node adjacency and voxel intensity gradients.

( 2 (fi − fj) if d(xi, xj)=1 w(xi, xj) = (3.9) ∞ if d(xi, xj)6= 1

where d(xi, xj) is a graph metric representing distance between two vertices.

3. Traverse through the entire ordered sequence of the minimum spanning tree to update the label at each voxel using Eq. 3.8.

4. Repeat this process for all MRF patches. Chapter 3. Project 1: MR Image Segmentation 47

CN (N=20) LMCI (N=20) AD (N=20) Combined (N=60) Age (Years ) 72.2, 75.5, 80.3 70.9, 75.6, 80.4 69.4, 74.9, 80.1 70.9, 75.2, 80.2 Sex (M/F) 10/10 10/10 10/10 30/30 Education 14.0, 16.0, 18.0 13.8, 16.0, 16.5 12.0, 15.5, 18.0 13.0, 16.0, 18.0 CDR-SB 0.00, 0.00, 0.00 1.00, 2.00, 2.50 3.50, 4.00, 5.00 0.00, 1.75, 3.62 ADAS 13 6.00, 7.67, 11.00 14.92, 20.50, 25.75 24.33, 27.00, 32.09 9.50, 18.84, 26.25 MMSE 28.8, 29.5, 30.0 26.0, 27.5, 28.2 22.8, 23.0, 24.0 24.0, 27.0, 29.0

Table 3.1: ADNI1 cross-validation subset demographics. CN: Cognitively Normal. LMCI: Late-onset Mild Cognitive Impairment. AD: Alzheimer’s Disease. CDR-SB: Clinical Dementia Rating-Sum of Boxes. ADAS: Alzheimer’s Disease Assessment Scale. MMSE: Mini-Mental State Examination. Values are presented as lower quartile, median, and upper quartile for continuous variables

3.5 Validation Experiments

3.5.1 Datasets

For complete details please refer to supplementary materials.

Experiment I: ADNI Validation

Data used in this experiment was obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu/). The dataset consists of 60 baseline scans in the ADNI1: Complete 1Yr 1.5T standardized dataset [380]. The expert manual segmentations for the hippocampus (ADNI spe- cific) were obtained based on the Pruessner-protocol [285], which is used for validation and performance comparisons.

Experiment II: First Episode Psychosis (FEP) Validation

Data used in preparation of this experiment were obtained from the Prevention and Early Intervention Program for Psychoses (PEPP-Montreal), a specialized early intervention service at the Douglas Mental Health University Institute in Montreal, Canada [239]. The dataset consists of structural MRIs of 81 subjects. Expert whole hippocampal manual segmentations of each subject were produced following the Pruessner protocol [285].

Experiment III: Preterm Neonatal Cohort Validation

This cohort consists of 22 premature neonates whose anatomical images were acquired at two time points, once in the first weeks after birth when clinically stable and again at term equivalent age (total of 44 images: 22 early-in-life and 22 term-equivalency images). The whole hippocampus was manually segmented by an expert rater using a 3-step segmentation protocol. The protocol adapts the histological definitions of [110], as well as existing whole hippocampal segmentation protocols for MR images [285, 372, 39] to the preterm infant brain.

Experiment IV: Hippocampal Volumetry

The volumetric analysis was performed using the standardized ADNI1: Complete Screening 1.5T dataset [380] comprising 811 ADNI T1-weighted screening and baseline MR images of healthy elderly (227), MCI Chapter 3. Project 1: MR Image Segmentation 48

N* FEP Age 80 21 23 26 Gender (M/F) 81 51/30 Handedness: 81 ambi 5 Left 4 Right 72 Education 81 11 13 15 FSIQ 79 88 102 109

Table 3.2: First episode psychosis subject demographics. Ambi: ambidextrous. SES: Socioeconomic Status score. FSIQ: Full Scale IQ. Values are presented as lower quartile, median, and upper quartile for continuous variables. N* is the number of non-missing value out of 81.

(394) and AD (190) patients.

3.5.2 Label-Fusion Methods Compared

The performance of AWoL-MRF is compared against MAGeT Brain Majority vote and STAPLE. The basic process of these label-fusion methods is described below.

MAGeT Brain Majority Vote

As described earlier, the MAGeT brain pipeline uses a template library sampled from the subject image pool. Consequently, the total number of candidate labels (votes) prior to label fusion equals number of atlases x number of templates. In a default MAGeT Brain (MB) configuration, these candidate labels are fused based on simple majority vote.

Simultaneous Truth and Performance level Estimation (STAPLE)

STAPLE (Simultaneous truth and performance level estimation) [364], is a probabilistic performance model that tries to estimate underlying ground-truth labelings from a set of manual or automatic seg- mentations generated by multiple raters or methods. Note that STAPLE does not consider the intensity values from the subject image in its model comprising MRF. STAPLE carries out label fusion in an Expectation-Maximization framework and estimates performance of a manual rater or an automatic segmentation method for each label class - which is then used to find the optimal segmentation for the subject image. Software implementation of STAPLE was obtained from Computational Radiology Laboratory. (http://www.crl.med.harvard.edu/software/STAPLE/index.php)

Joint Label Fusion (JLF)

Among the modern label-fusion approaches incorporating spatially varying weight-distribution, JLF also accounts for the dependencies among the atlas library [359]. These dependencies are estimated based on intensity similarity measure between a pair of atlases and a target image in a small neigh- borhood surrounding a voxel. This approach allows mitigation of bias typically incurred by the pres- ence similar atlases. Software implementation of JLF was obtained from ANTs repository on Github. (https://github.com/stnava/ANTs/blob/master/Scripts/antsJointLabelFusion.sh) Chapter 3. Project 1: MR Image Segmentation 49

CN vs. MCI vs. AD Comparisons Volumetric statistics CN MCI AD mean (stdev) mean (stdev) mean (stdev) Majority Vote 2084.7 (615.3) 1960.5 (599) 1897.2 (582.3) STAPLE 2236.6 (659) 2124.2 (649.3) 2068.2 (655.4) JLF 1943.6 (593.5) 1803.3 (572.6) 1697.3 (551.6) AWoL-MRF 2312.9 (676.3) 2147.5 (652) 2047.7 (631.3) Cohen’s d CN v MCI CN v AD MCI v AD Majority Vote 0.1727 0.3194 0.123 STAPLE 0.1463 0.2688 0.1005 JLF 0.202 0.4343 0.2155 AWoL-MRF 0.2092 0.4111 0.1783 Linear Model CN v MCI CN v AD MCI v AD Majority Vote -3.875942*** -3.662264*** -0.402088 STAPLE -3.424026*** -3.001039** -0.101867 JLF -4.19533*** -4.884486*** -1.451038 AWoL-MRF -4.424061*** -4.657673*** -0.987672 MCI-converters vs. MCI-stable Comparisons Volumetric statistics MCI-converters MCI-stable mean (stdev) mean (stdev) Majority Vote 1846.2 (489.6) 1995.7 (619.3) STAPLE 2000.7 (542.8) 2163.6 (668.0) JLF 1686.8 (483.9) 1842.2 (586.3) AWoL-MRF 2007.4 (534.6) 2186.6 672.6 Cohen’s d Linear Model MCI-converters vs. MCI-stable MCI-converters vs. MCI-stable Majority Vote 0.185 -1.708 STAPLE 0.181 -1.616 JLF 0.192 -1.844 AWoL-MRF 0.204 -1.965*

Table 3.3: Hippocampal Volumetry Statistics of ADNI1:Complete Screening 1.5T dataset per diagnosis (AD: Alzheimer’s patients, MCI: subjects with mild cognitive impairment, CN: healthy subjects). Top: volumetric statistics of segmentations provided by each method. Middle: Effect sizes of pairwise differ- ences between diagnostics groups based on Cohen’s d metric. Bottom: t-values and significance levels from a linear model comprising “Age”, “Sex”, and “total-brain-volume” as covariates (∗ : p < 0.05, ∗∗ : p < 0.01, ∗ ∗ ∗ : p < 0.001). Chapter 3. Project 1: MR Image Segmentation 50

Atlases DSC Reference Study Validation Dataset (ground- truth) 9 0.881 AWoL-MRF 3-Fold MCCV, ADNI (Pruessner) N=60 9 0.897 AWoL-MRF 3-Fold MCCV, ADNI (Pruessner) N=81 9 0.81 AWoL-MRF 1-Fold MCCV, 3-step segmentation N=44 protocol (b) 9 0.869 MAGeT Brain [277] 10-Fold MCCV, ADNI (Pruessner) N=60 9 0.892 MAGeT Brain [277] 5-Fold MCCV, FEP subjects N=81 9 0.79 MAGeT Brain [149] 1-Fold MCCV, 3-step segmentation N=44 protocol (b) 30 0.82 Decision Fusion [161] LOOCV Controls 21 0.862 Auto Context Model [254] LOOCV ADNI (SNT) 55 0.86 Barnes et al. [26] LOOCV Controls and AD 275 0.835 Aljabar et al. [10] LOOCV Controls 80 0.89 Collins et al. [73] LOOCV Controls 30 0.885 Lotjonen et al. [235] N=60 ADNI (SNT) 55 0.89 MAPS [227] N=30 ADNI (SNT) 30 0.848 LEAP [374] N=182 ADNI (SNT) 16 0.861 Patch-based [83] (a) LOOCV ADNI (Pruessner) 20 0.897 (L), JLF [359] 10-Fold MCCV, semi-automatic+ man- 0.888 (R) N=20 ual correction 15 0.862(L), JLF [361] (c) N=20 brainCOLOR 0.861(R) 15 0.872(L), JLF (With corrective N=20 brainCOLOR 0.871(R) learning) [361] (c) 9 0.841 MAGeT Brain [277] 10-Fold MCCV, ADNI (SNT) N=69

Table 3.4: Summary of automated segmentation methods of the hippocampus. AD = Alzheimer’s Dis- ease; MCI = Mild Cognitive Impairment; CN = Cognitively Normal; FEP = First Episode of Psychosis; LOOCV = Leave-one-out cross-validation; MCCV = Monte Carlo cross-validation; SNT = Surgical Medtronic Navigation Technologies semi-automated labels; L-HC = Left hippocampus; R-HC = Right hippocampus. (a): AD: 0.838, MCI: n/a, CN: 0.883, (b): See [149] for manual segmentation protocol details, (c): The method were applied in the 2012 MICCAI Multi-Atlas Labeling Challenge Chapter 3. Project 1: MR Image Segmentation 51

3.5.3 Evaluation Criteria

We performed both quantitative and qualitative assessment of the results. The segmentation accuracy was measured using Dice similarity coefficient (DSC) given as follows:

2|A ∩ B| DSC = (3.10) |A| + |B| where A and B are the three dimensional label volumes being compared. We also evaluated the level of agreement between automatically computed volumes and manual segmentations using Bland- Altman plots [35]. Bland-Altman plots are created with segmentation yielded with 5 atlases and 19 templates configuration. For the ADNI and FEP datasets, we performed three-fold cross validation and obtained the quantitative scores by averaging over all the validation rounds, as well as the left and right hippocampal segmentations. Constrained by the size of the Premature Birth and Neonatal dataset and the quality of certain images which caused difficulties in the registration pipeline, we simply performed a single round of validation to determine if the results that we found in Experiments I and II were generalizable to brains with radically different neuroanatomy. Due to incomplete myelination of the neonatal brains, the MR image contrast levels for this dataset are drastically different - the intensity values for the hippocampus are reversed relative to T1-weighted images of the adolescent or adult human brains. These distinct attributes make it an excellent “held out sample” or “independent test-set” for performance evaluation. Thus, for this dataset, the quantitative scores are averages over left and right hippocampi over a single validation round. Chapter 3. Project 1: MR Image Segmentation 52

3.6 Results

3.6.1 Experiment I: ADNI Validation

For ADNI dataset, the mean Dice score of AWoL-MRF maximizes at 0.881 with 9 atlases and 19 templates. As seen from Fig. 3.2, AWoL-MRF outperforms both majority vote (0.862), STAPLE (0.858) and JLF label-fusion (0.873) methods. Particularly compared to JLF, more improvement is seen with fewer atlases as AWoL-MRF reaches mean Dice score of 0.88 with only six atlases. The improvement diminishes with an increasing number of atlases and a smaller number of templates (bootstrapping parameter for generating candidate labels). Additionally, AWoL-MRF helps reduce the bias introduced by certain majority vote techniques while arbitrarily breaking vote-ties in the cases of even number of atlases, as previously described by our group and others [161, 277]. We find that AWoL-MRF corrects these dips in performance, which is evident by the extra accuracy boosts for the even number of atlases. DSC distribution comparisons for four configurations (Number of Atlases = 3,5,7,9; Number of Tem- plates=11) are shown in Fig. 3.3. These plots reveal that AWoL-MRF provides statistically significant improvement over all other methods regardless of size of the atlas library. As expected, we also notice the reduction in variance with increasing number of atlases. The Bland-Altman plots reveal the biases incurred with the application of each automatic segmenta- tion method during volumetric analysis. Fig. 3.4 shows that all four methods have a proportional bias associated with their volume estimates. Specifically, we see that in all four methods, the volumes of the smaller hippocampi are overestimated, whereas the larger hippocampi are underestimated. Nevertheless, AWoL-MRF shows the smallest mean bias magnitude along with tighter limits of agreements across the cohort. STAPLE displays similar mean bias values, but higher variance in volume estimation compared to AWoL-MRF, which is evident by its steeper line-slope and wider limits of agreements. Majority vote and JLF show the highest amount of positive mean bias indicating tendency towards underestimation of hippocampal volume. Qualitatively, improvement in segmentations is seen at the surface regions of the hippocampus. As seen in Fig. 3.5, spatial homogeneity is improved as well.

3.6.2 Experiment II: FEP Validation

For the FEP dataset, the mean Dice score of AWoL-MRF maximizes at 0.897, with 9 atlases and 19 templates. Similar to the Experiment I, the AWoL-MRF consistently outperforms the majority vote (0.891), STAPLE (0.892), and JLF (0.888) methods; however, the improvement is comparatively modest. Higher improvement is seen with fewer atlases when compared to JLF, as AWoL-MRF surpasses the mean Dice score of 0.89 with only three atlases (see Fig. 3.6). The improvement diminishes with an increasing number of atlases and a smaller number of templates. In addition to a smaller atlas library requirement, the ability to reduce the bias introduced by the majority vote technique is also observed in this experiment. DSC distribution comparisons for four sample configurations (Number of Atlases = 3,5,7,9; Number of Templates=11) are shown in Fig. 3.7. These plots reveal that AWoL-MRF provides statistically significant improvement over all other methods regardless of size of the atlas library. Similar to accuracy gains, the variance of Dice score distribution is also smaller compared to ADNI experiment. The Bland-Altman plots (see Fig. 3.8) show that both AWoL-MRF and majority vote exhibit the smallest mean proportional bias compared to majority vote. In comparison, STAPLE and JLF show strong Chapter 3. Project 1: MR Image Segmentation 53

Figure 3.2: Experiment I DSC: All results show the average performance values of left and right hip- pocampi over 3-fold validation. The top-left subplot shows Mean DSC score performance of all the methods. Remaining subplots show the mean DSC score improvement over compared methods for different number of templates (bootstrapping parameter of MAGeT Brain).

Figure 3.3: Experiment I DSC: statistical comparison of the performance of all methods for different atlas library sizes. The statistical significance is reported for pairwise comparisons (∗ : p < 0.05, ∗∗ : p < 0.01, ∗ ∗ ∗ : p < 0.001). biases characterizing considerable overestimation (negative bias) and underestimation (positive bias) of hippocampal volume across the cohort respectively. Quantitatively, AWoL-MRF still outperforms the Chapter 3. Project 1: MR Image Segmentation 54

Figure 3.4: Experiment I Bland-Altman analysis: Comparison between computed and manual volumes (in mm3) for single parameter configuration of 9 atlases and 19 templates. The overall mean difference in volume, and limits of agreement (LA+/LA-: 1.96SD) are shown by dashed horizontal lines. Linear fit lines are shown for each method. Note that the points above the mean difference indicate underestimation of the volume with respect to the manual volume, and vice versa. other three methods, as evident from the smaller line-slope and tighter limits of agreement. Similar to the ADNI experiment, qualitative improvement is seen at the surfaces regions of the hippocampus (see Fig. 3.9).

3.6.3 Experiment III: Preterm Neonatal Cohort Validation

The mean Dice score of AWoL-MRF maximizes at 0.810, with 9 atlases and 19 templates. Note that due to incomplete myelination of the neonatal brains, the intensity values for the hippocampus are re- versed relative to T1-weighted images of the child, adolescent, and adult human brains. Nevertheless, no manual interventions were carried out for the implementation of the methods. Similar to the first two experiments, the AWoL-MRF consistently outperforms the majority vote (0.775), STAPLE (0.775), and JLF (0.771) methods by a large amount. More improvement is seen with fewer atlases when compared to JLF, as AWoL-MRF surpasses the mean Dice score of 0.80 with only four atlases (see Fig. 3.10). Improvement diminishes with an increasing number of atlases and a smaller number of templates. Also, due to single fold experimental design for this dataset, higher performance variability is observed espe- cially with a smaller number of templates. DSC distribution comparisons for four sample configurations (Number of Atlases = 3,5,7,9; Number of Templates=11) are shown in Fig. 3.11. These plots reveal that AWoL-MRF provides statistically significant improvement over all other methods regardless of the size of the atlas library. The Bland-Altman plots show that both AWoL-MRF and JLF offer volume estimator with an extremely small proportional bias (see Fig. 3.12). Compared to ADNI and FEP datasets, the magnitude of the bias is significantly lower, with AWoL-MRF producing the best result. Chapter 3. Project 1: MR Image Segmentation 55

Figure 3.5: Experiment I qualitative analysis: comparison of manual versus automatic segmentation methods. The red rectangle illustrates a section where the superiority of the AWoL-MRF approach is particularly apparent. The segmentations are performed using 3 atlases, and the Dice scores are as follows: Majority Vote: 0.806 STAPLE: 0.833 JLF: 0.804 AWoL-MRF: 0.854. The segmentation of the left hippocampus is shown in sagittal view.

In comparison, majority vote consistently underestimates and STAPLE consistently overestimates hip- pocampal volumes across the cohort. Similar to previous two experiments, the qualitative improvement is seen at the surfaces regions of the hippocampus (see Fig. 3.13). Note that the intensity values for hippocampus are reversed due to incomplete myelination.

3.6.4 Experiment IV: Hippocampal Volumetry

The volumetric analysis was performed on standardized ADNI1: Complete Screening 1.5T dataset [380]. The segmentations were produced using 9 atlases with each method. For majority vote, STAPLE and AWoL-MRF number of templates was set to 19. As mentioned earlier, use of templates is not possible with JLF due to coupling between image and label volumes from the atlas library.

Group Comparisons between CN, MCI, and AD

In this part of analysis we compared the mean hippocampal volume measurements per diagnosis (AD: Alzheimer’s patients, MCI: subjects with mild cognitive impairment, CN: healthy subjects). As seen from the Fig. 3.14 (top pane), mean volume decreases with the severity of the disease for all methods. The volumetric statistics are summarized in Table 3.3. Based on Cohen’s d metric as a measure of effect size, we see the largest separation between “CN vs. AD” diagnostic categories, followed by “CN vs. MCI” categories, and lastly between “MCI vs. AD” categories. We also constructed a linear model predictive of hippocampal volume based on diagnostic category along with “age”, “sex”, and “total- Chapter 3. Project 1: MR Image Segmentation 56

Figure 3.6: Experiment II DSC: All results show the average performance values of left and right hippocampi over 3-fold validation. The top-left subplot shows Mean DSC score performance of all the methods. Remaining subplots show the mean DSC score improvement over compared methods for different number of templates (bootstrapping parameter of MAGeT Brain).

Figure 3.7: Experiment II DSC: statistical comparison of the performance of all methods for different atlas library sizes. The statistical significance is reported for pairwise comparisons (∗ : p < 0.05, ∗∗ : p < 0.01, ∗ ∗ ∗ : p < 0.001). brain-volume” as covariates. The results show that the effect sizes are most pronounced in AWoL-MRF and JLF in all pairwise comparisons. All four methods show strong volumetric differences (p < 0.001 Chapter 3. Project 1: MR Image Segmentation 57

Figure 3.8: Experiment II Bland-Altman analysis: Comparison between computed and manual volumes (in mm3) for single parameter configuration of 9 atlases and 19 templates. The overall mean difference in volume, and limits of agreement (LA+/LA-: 1.96SD) are shown by dashed horizontal lines. Linear fit lines are shown for each method. Note that the points above the mean difference indicate underestimation of the volume with respect to the manual volume, and vice versa. or p < 0.01) between “CN vs. AD” categories followed by “CN vs. MCI“, which show relatively weaker significance levels. JLF also shows volumetric differences between “MCI vs. AD” categories but with a weaker significance level (p < 0.05). In linear model based comparison, we see that all four methods show significant differences (p < 0.001 or p < 0.01) only between “CN vs. AD” and “CN vs. MCI” comparisons.

Group Comparisons between MCI converters and non-converters

In this part of analysis we compared the mean hippocampal volume measurements of two MCI sub- groups: MCI-converters (65 subjects converting from MCI to AD diagnosis within 1-year from screening) and MCI-stable (285 subjects with stable MCI diagnosis within 1 year from screening). The volumetric statistics are summarized in Table3.3. Fig. 3.14 (bottom pane), shows that the MCI-converters have relatively smaller volume compared to MCI-stable group. JLF and AWoL-MRF both show strongest effect sizes based on Cohen’s d metric with statistically significant (p < 0.05) differences between these two groups.

3.6.5 Parameter Selection

We studied the impact of parameter selection on the performance of AWoL-MRF with joint consideration of the segmentation accuracy and computational cost. The four parameters which need to be chosen 0 1 a priori are: confidence thresholds (LT ,LT ), patch-length (Lpatch), mixing-ratio ((SH /SL)patch), and the β parameter of the MRF model. Recall that the Gaussian distribution parameters in the MRF are Chapter 3. Project 1: MR Image Segmentation 58

Figure 3.9: Experiment II qualitative analysis: Comparison of manual versus automatic segmentation methods. The red rectangle illustrates a section where the superiority of the AWoL-MRF approach is particularly apparent. The segmentations are performed using 3 atlases, and the Dice scores are as follows: majority vote: 0.875, STAPLE: 0.878, JLF: 0.856 AWoL-MRF: 0.891. The segmentation of the right hippocampus is shown in sagittal view.

Figure 3.10: Experiment III DSC: preterm neonate cohort validation: All results show the average performance values of left and right hippocampi over 3-fold validation. The top-left subplot shows Mean DSC score performance of all the methods. Remaining subplots show the mean DSC score improvement over compared methods for different number of templates (bootstrapping parameter of MAGeT Brain). Chapter 3. Project 1: MR Image Segmentation 59

Figure 3.11: Experiment III DSC: statistical comparison of the performance of all methods for different atlas library sizes. The statistical significance is reported for pairwise comparisons (∗ : p < 0.05, ∗∗ : p < 0.01, ∗ ∗ ∗ : p < 0.001).

Figure 3.12: Experiment III Bland-Altman analysis: comparison between computed and manual volumes (in mm3) for single parameter configuration of 9 atlases and 19 templates. The overall mean difference in volume, and limits of agreement (LA+/LA-: 1.96SD) are shown by dashed horizontal lines. Linear fit lines are shown for each method. Note that the points above the mean difference indicate underestimation of the volume with respect to the manual volume, and vice versa. Chapter 3. Project 1: MR Image Segmentation 60

Figure 3.13: Experiment III qualitative analysis: Comparison of manual versus automatic segmentation methods. The red rectangles illustrate sections where the superiority of the AWoL-MRF approach is particularly apparent. The segmentations are performed using 3 atlases, and the Dice scores are as follows: majority vote: 0.748, STAPLE: 0.760, JLF: 0.746, AWoL: 0.807. The segmentation of the left hippocampus is shown in sagittal view. Note that for this particular dataset, the brain structures are mostly unmyelinated causing a reversal of the intensity values for the hippocampal structure - as shown in the top row.

estimated for each patch automatically using the SH nodes in the given patch. First, the confidence threshold parameters are selected heuristically derived from the voting distribution. As mentioned 0 0 before, both LT and LT values need to be greater than 0.5 to produce non-empty low-confidence voxel set. Based on the assumptions that the high-confidence region (SH ) comprises more structural voxels

(L(xi) = 1) than the total number of voxels in the low-confidence region (SL), we define a following metric:

|S | ρ = L (3.11) |L(xi = 1)

0 1 Then we choose confidence thresholds (LT and LT ), which fall in the parameter space bounded by ρ ∈ (0.5, 1). Fig. 3.15-Left shows an example of these bound values - computed for the ADNI dataset in experiment I (left hippocampus). Note that the larger threshold values imply larger SL region, and 0 1 consequently higher computational time. Based on this heuristic, we chose LT = 0.8 and LT = 0.6 for 0 1 experiments I, II, and IV; and LT = LT = 0.7 for the experiment III. As described in Section 3.4.3, the patch-length and the mixing ratio parameters are interrelated and directly affect the coverage of SL region. From a performance perspective, these have higher impact Chapter 3. Project 1: MR Image Segmentation 61

Figure 3.14: Hippocampal volume (in mm3) vs. diagnoses. Cohen’s d scores (effect size) and statistical significance is reported for pairwise comparisons between diagnostic groups. on the computational time than the segmentation accuracy (see Fig. 3.15-Middle, Right). Higher

Lpatch implies larger MRF model on the sub-volume and therefore requires higher computational time. Conversely, smaller patches would reduce the computational time; but would run a risk of insufficient Chapter 3. Project 1: MR Image Segmentation 62

coverage of SL region and consequently offer poor accuracy improvement. The third parameter choice of mixing ratio affects the total number of seeds/patches for a given image. A higher ratio necessitates a search for SL nodes surrounded with large number of SH nodes, which reduces the total number of patches as well as the computational time. Based on the accuracy vs. computational cost trade-off analysis with respect to these parameter choices, we selected patch-length of 11 voxels and minimum mixing ratio of 0.0075 which translates into seed nodes surrounded by minimum of 10 SH nodes in the 26-node neighborhood, for all validation experiments. Lastly, the β parameter of MRF model controls the homogeneity of the segmentation. It is dependent on the image intensity distribution and the structural properties of the anatomical structure. The large value of β results in more homogeneous regions giving a smoothed appearance to a structure. We selected β = −0.2 based on the results of training phase where we split the atlas pool into two groups and used one set to segment the other.

Figure 3.15: Parameter selection. A: Effect of confidence threshold values on image partitioning. ρ represents the ratio of low-confidence voxels over high-confidence structural voxels. The highlighted region denotes the heuristically ’good’ region for threshold selection. B: Effect of mixing ratio and Lpatch on DSC performance. Mixing ratio is the minimum required number of SH nodes in the 26-node neighborhood for a given seed voxel. Note that with a larger Lpatch performance improves. Whereas, with a smaller Lpatch or a higher mixing ratio the performance worsens due to poor coverage over SL region. C: Effect of mixing ratio and Lpatch on computational cost. The light blue line shows the number of patches for given configuration as a reference. Note that the computation time increases exponentially with a higher Lpatch and a smaller mixing ratio (Note: mixing ratio* represents the equivalent number of minimum SH node requirement in the 26-node neighborhood for the seed node selections) Chapter 3. Project 1: MR Image Segmentation 63

3.7 Discussion and Conclusion

In this work we presented a novel label-fusion method that can be incorporated with any multi-atlas segmentation pipeline for improved accuracy and robustness. We validated the performance of AWoL- MRF over three independent datasets spanning a wide range of demographics and anatomical variations. In Experiment I, we validated AWoL-MRF on an Alzheimer’s disease cohort (N = 60) with median age of 75. In Experiment II, validation was performed on first episode of psychosis cohort (N = 81), with median age of 23. In Experiment III, we applied AWoL-MRF to a unique cohort (N = 22x 2) comprising preterm neonates scanned in the first weeks after birth and again at term-equivalent age with distinctly different brain sizes and MR scan characteristics. In all of these exceptionally heterogeneous subject groups, AWoL-MRF provided superior segmentation results compared to all three methods: majority vote, STAPLE, and JLF, based on DSC metric as well as proportional bias measurements. In all three experiments, we see that as one of the most desirable benefits, AWoL-MRF offers superior performance with a remarkably small atlas library. AWoL-MRF provides mean DSC scores over 0.88 with only six atlases (Experiment I), 0.89 with only three atlases (Experiment 2), and 0.80 with only four atlases (Experiment III) compared to other methods, which require larger atlas libraries to deliver similar performance. This is an important benefit as it reduces the resource expenditure on the manual delineation of MR images and speeds up the analysis pipelines. From robustness perspective, we also notice reduction of two types of biases. First AWoL-MRF mitigated the issue of degenerating accuracy caused by vote-ties with a small number of even atlases. Then, more importantly, we see a consistent reduction of proportional bias, as evident by the Bland-Altman analysis. We believe that the performance boosts provided by AWoL-MRF can be explained by two major factors. First, we argue that the utilization of intensity values and local neighborhood constraints acts as a regularizer, which helps avoid over-fitting to the hippocampal model represented by the atlas library. Both majority vote and STAPLE do not consider intensity values in their label-fusion stage and thus are more likely to ignore minute variations near the surface areas of the structure, which are not well represented within the atlas library. JLF, which does take intensity information into account and implements a patch-based approach, tends to perform better than majority vote and STAPLE with relatively higher number of atlases: > 4 in Experiment I and > 6 in Experiment III. Therefore, we speculate that JLF is more likely to deliver superior performance in cases with larger atlas library availability, which again comes with the cost of generating manual segmentations. Second, the spanning tree based inference method tries to mimic the manual delineation process by starting with regions with strong neighborhood label information and moving progressively towards more uncertain areas. Compared to iterative methods (e.g. EM), the sequential inference process may not be optimal in a based on theoretical sense; nevertheless, the similarity between automatic and manual labeling process provides more accurate results, since the ground truth is defined by the latter. Additionally, decoupling of label-fusion process from similarity comparisons with the atlas library allows AWoL-MRF to utilize bootstrapping techniques that augment the pool of candidate labels as used by the baseline segmentation pipeline (MAGeT Brain) in this work [277]. Use of such techniques is not trivial with approaches using intensity information from the atlas library. From a diagnostics perspective, the volumetric assessment of all four methods shows significant dif- ferences (p < 0.001 or p < 0.01) between “CN vs. AD” and “CN vs. MCI” comparisons. Consistent with the Bland-Altman analysis (see Fig. 3.4), JLF and majority vote underestimate the volume com- pared to AWoL-MRF and STAPLE across all diagnostic categories. Even though the direct volumetric Chapter 3. Project 1: MR Image Segmentation 64 comparisons based on JLF yield significant differences (p < 0.05) between “MCI vs. AD” category, these differences vanish in the linear model comprising “age”, “sex”, and ”total-brain-volume” as covariates. These findings are consistent with a variety of studies [256, 215, 300] highlighting the heterogeneity in MCI subjects that results in a large variation of hippocampal volume and consequently smaller differ- ences between MCI and AD subjects. This is particularly typical in the ADNI-1 cohort MCI subjects used in this analysis, which are now classified under more progressed stages of MCI or late-MCI [5]. The volumetric comparison between MCI-converters and MCI-stable groups reveal larger hippocampal volumes for the latter group. These findings are consistent with a previous study conducted on ADNI baseline cohort [291]. We also find that these differences remain statistically significant in the linear model comprising “age”, “sex”, and ”total-brain-volume” as covariates. A direct comparison against other methods from the current literature is difficult due to differences in the choices for gold standards, evaluation metrics, hyper-parameter configuration etc. Nevertheless, Table 3.4 shows a brief survey of several segmentation studies. Note that many of these studies have relied on SNT – labels provided by ADNI – for the ground-truth (manual) segmentations. A performance comparison of the baseline method based on SNT labels is discussed in our previous work [277], where we noticed several shortcomings of the SNT protocol [372, 277], and therefore we have evaluated the presented method against the manual label based on the Pruessner protocol [285]. Despite the differences in the experimental designs, comparisons with the other methods show that AWoL-MRF delivers superior performance with a significantly smaller atlas library requirement. For ADNI cohort validation, barring the ground-truth label dissimilarities, methods presented by [227, 235] have equivalent DSC scores; however, the atlas library sizes for these methods are 30 and 55 respectively. Moreover, to the best of our knowledge, no other study has validated their method with three drastically different datasets that span the entire human lifespan demonstrating the robustness of their method. The computational cost of the algorithm implementation, as described in the previous section, de- pends on the parameter selection. From a theoretical perspective, minimum spanning tree (MST) transformation is the most expensive task in this method. The current implementation of MST uses Prim’s algorithm with simple adjacency matrix graph representation, which requires O(|V |2) running time (|V |: number of uncertain voxels in the patch). However, this can be reduced down to O(|E|log|V |) or O(|E| + |V |log|V |) using a binary heap or Fibonacci heap data structures, respectively (|E|: number of edges in the patch). The computational times for Experiment I with current implementation for different parameter configurations are shown in Fig. 3.15 (Right). The code was implemented in Matlab R2013b and run on a single CPU (Intel x86 − 64, 3.59GHz). A direct computational time comparison with other methods is not practical due to hardware and software implementation differences. However, the non-iterative nature of AWoL-MRF provides considerably faster run times compared to EM based approaches, where the convergence of the algorithm is dependent on the agreement between candidate labels and can be highly variable [352, 364]. In conclusion, AWoL-MRF attempts to mimic the behavior of a manual segmentation protocol in a multi-atlas segmentation framework. We validated its performance over three independent datasets comprising significantly different subject cohorts. Even though this work focused on hippocampal seg- mentations, AWoL-MRF can be easily applied to other structures and scenarios with multiple label classes, which is a part of future studies. The validations indicate that the method delivers state-of-the- art performance with a remarkably small library of manually labeled atlases, which motivates its use as a highly efficient label-fusion method for rapid deployment of automatic segmentation pipelines. Chapter 3. Project 1: MR Image Segmentation 65

3.8 Acknowledgments

NB receives support from the Alzheimer’s Society. MMC is funded by the Weston Brain Institute, the Alzheimer’s Society, the Michael J. Fox Foundation for Parkinson’s Research, Canadian Institutes for Health Research, National Sciences and Engineering Research Council Canada, and Fondation de Recherches Sant´eQu´ebec. ANV is funded by the Canadian Institutes of Health Research, Ontario Mental Health Foundation, the Brain and Behavior Research Foundation, and the National Institute of Mental Health (R01MH099167 and R01MH102324). FEP data was supported by a CIHR (#68961) to Dr. Martin Lepage and Dr. Ashok Malla. The preterm neonate cohort is supported by Canadian Institutes for Health Research (CIHR) operating grants MOP-79262 (SPM) and MOP-86489 (Dr. Ruth Grunau). SPM is supported by the Bloorview Children’s Hospital Chair in Pediatric Neuroscience. The authors thank Drs. Ruth Grunau, Anne Synnes, Vann Chau and Kenneth J. Poskitt for their contributions in studying the preterm neonatal cohort and providing access to the MR images. Computations were performed on the GPC supercomputer at the SciNet HPC Consortium [234]. SciNet is funded by the Canada Foundation for Innovation under the auspices of Compute Canada; the Government of Ontario; Ontario Research Fund - Research Excellence; and the University of Toronto. In addition, computations were performed on the CAMH Specialized Computing Cluster. The SCC is funded by the Canada Foundation for Innovation, Research Hospital Fund. ADNI Acknowledgments: Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bio- engineering, and through generous contributions from the following: Abbott; Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Amorfix Life Sciences Ltd.; AstraZeneca; Bayer HealthCare; BioClinica, Inc.; Biogen Idec Inc.; Bristol-Myers Squibb Company; Eisai Inc.; Elan Pharmaceuticals Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research Development, LLC.; Johnson & Johnson Pharmaceutical Research Development LLC.; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foun- dation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is Rev March 26, 2012 coordinated by the Alzheimer’s disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for NeuroImaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129 and K01 AG030514. We would also like to thank Curt Johnson and Robert Donner for inspiring some of the ideas in this work. Chapter 3. Project 1: MR Image Segmentation 66

3.9 Supplementary Material

S3.1 Experiment I: ADNI Validation

Data used in this experiment was obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu/). The dataset consists of 60 baseline scans in the ADNI1: Complete 1Yr 1.5T standardized dataset [380]. Twenty subjects were chosen from each diagnostic category: cog- nitively normal (CN), mild cognitive impairment (MCI), and Alzheimer’s disease (AD). All images were acquired using 1.5T scanners (General Electric Healthcare, Philips Medical Systems or Siemens Medical Solutions) at multiple sites using a protocol previously described in [179]. Representative 1.5T imaging parameters were TR=2400ms, TI=1000ms, TE=3.5ms, flip-angle=8, field of view=240x240mm, 192x192x166 matrix (x, y, and z directions) yielding voxel dimensions of 1.25mm x 1.25mm x 1.2mm. The manual segmentations for the hippocampus were generated by expert raters following the Pruessner- protocol [285] which is used for validation and performance comparisons. The choice of Pruessner labels was motivated from our previous validation of the baseline MAGeT Brain pipeline [277], in which we noted the inconsistencies in the SNT labels provided by ADNI.

S3.2 Experiment II: First Episode Psychosis (FEP) Validation

Data used in preparation of this experiment were obtained from the Prevention and Early Intervention Program for Psychoses (PEPP-Montreal), a specialized early intervention service at the Douglas Mental Health University Institute in Montreal, Canada [239]. The dataset consists of structural MRIs of 81 subjects, which were acquired at the Montreal Neurological Institute on a 1.5T Siemens whole body MRI system. Structural T1 volumes were acquired for each participant using a three-dimensional (3D) gradient echo pulse sequence with sagittal volume excitation (repetition time=22ms, echo-time = 9.2ms, flip-angle=30, 180 1mm contiguous sagittal slices). The rectangular field-of-view for the images was 256mm (SI) x 204mm (AP). The manual segmentations for the hippocampus were generated by expert raters following the Pruessner-protocol [285], which is identical to the manual segmentation protocol used in our previous validation work [277]. FEP data are not publicly available.

S3.3 Experiment III: Preterm Neonatal Cohort Validation

This cohort consists of 22 premature neonates whose anatomical images were acquired with a specialized neonatal head coil (Advanced Imaging Research, Cleveland, OH) on a Siemens 1.5T Avanto scanner (Erlangen, Germany) at two time points, once in the first weeks after birth when clinically stable and again at term equivalent age (total of 44 images: 22 early-in-life and 22 term-equivalency images). The 22 neonates (7 males) were born at a mean gestational age of 27.7 weeks (SD 1.9), and scanned early-in- life at 32.1 weeks (SD 1.9) and again at term equivalent age at 40.4 weeks (SD 2.1). Sequence parameters for the 3D volumetric T1-weighted images were: TR=36ms, TE=9.2ms, flip-angle=30, voxel size 1mm x 1.04mm x 1.04mm. The whole hippocampus was manually segmented by an expert rater using a 3-step segmentation protocol. The protocol adapts the histological definitions of [110], as well as existing whole hippocampal segmentation protocols for MR images [285, 372, 39] to the preterm infant brain. This dataset was previously used by our group in the validation of an adaptation of MAGeT-Brain to the specific needs of the neonatal and prematurely born infant brain. For complete details on acquisition and manual segmentation process see [149]. Neonatal data are not publicly available. Chapter 3. Project 1: MR Image Segmentation 67

Method Left (number of voxels) Right (number of voxels) Majority Vote 4.6 (1.5) 4.4 (1.1) STAPLE 4.6 (1.7) 4.5 (1.5) JLF 5.3 (2.5) 4.8 (2.0) AWoL-MRF 4.6 (1.9) 4.6 (1.7)

Table S3.1: Experiment I surface-distance errors based on variant of Hausdorff distance. The error is measures in number of voxels, and mean and standard deviation values are reported over all the subjects in the dataset. The validation configuration comprised 9 atlases and 19 templates.

S3.4 Experiment IV: Hippocampal Volumetry

The volumetric analysis was performed using the standardized ADNI1: Complete Screening 1.5T dataset comprising 811 ADNI T1-weighted screening and baseline MR images of healthy elderly (227), MCI (394) and AD (190) patients. (Note: The standardized ADNI1: Complete Screening 1.5T dataset consists of 818 subjects out of which seven subjects failed during the registration stage of the segmentation pipeline.)

S3.5 Surface-Distance error analysis

We performed surface distance analysis identical to the previous work [54]. The surface distance met- ric (M) estimates the maximum distance between the surfaces of manual and fused labels and is an approximation of symmetric Hausdorff distance. M was calculated using contour maps generated from the manual label, and 26-connected voxel erosion of the fused label. The border labels (surface) from the eroded fused-volume were intersected with manual label contour maps to compute M1 = H(a,b) and M2 = H(b,a), where H is the Hausdorff distance, then M = max(M1,M2). The results [mean (sd)] for ADNI validation (Experiment I) with 9 atlases and 19 templates are as shown in Table S3.1. This preliminary analysis shows that majority vote, STAPLE, and AWoL MRF produce similar performance. In comparison JLF yields slightly higher surface-distance errors. Chapter 4

An artificial neural network model for clinical score prediction in Alzheimer’s disease using structural neuroimaging measures

Nikhil Bhagwat [1,2,3,*], Jon Pipitone [3], Aristotle N. Voineskos [3,4], M. Mallar Chakravarty [1,2,5], and Alzheimer’s Disease Neuroimaging Initiative.

1. Institute of Biomaterials and Biomedical Engineering, University of Toronto, Toronto, ON, Canada

2. Cerebral Imaging Centre, Douglas Mental Health University Institute, Verdun, QC, Canada

3. Kimel Family Translational Imaging-Genetics Research Lab, Research Imaging Centre, Campbell Family Mental Health Research, Institute, Centre for Addiction and Mental Health, Toronto, ON, Canada

4. Department of Psychiatry, University of Toronto, Toronto, ON, Canada

5. Department of Psychiatry, McGill University, Montreal, QC, Canada

Correspondence: Nikhil Bhagwat, M. Mallar Chakravarty; Email: [email protected], [email protected]

68 Chapter 4. Project 2: Clinical Score Prediction 69

4.1 Abstract

Background: Development of diagnostic and prognostic tools for Alzheimer’s disease (AD) is compli- cated by the substantial clinical heterogeneity observed in prodromal stages. Many neuroimaging studies have focussed on case-control classification and mild-cognitive-impairment to AD conversion predictions. However, prediction of scores from clinical assessments, such as MMSE and ADAS-13 from MR imaging data has received less attention. Prediction of clinical scores can be crucial in providing nuanced prog- nosis as well as the symptomatic disease severity. Methods: In this work, we predict clinical scores at the individual level using a novel anatomically partitioned artificial neural network (APANN) model. APANN combines input from two structural MR imaging measures relevant to neurodegenerative pat- terns observed in AD; namely: hippocampal segmentations and cortical thickness. We evaluate the performance of APANN with 10 rounds of 10-fold cross-validation in three sets of experiments using ADNI1, ADNI2, and ADNI1+ADNI2 cohorts. Results: Pearson’s correlation and root mean square er- ror between the actual and predicted scores for ADAS-13 (ADNI1: r=0.60; ADNI2: r=0.68; ADNI1and2: r=0.63) and MMSE (ADNI1: r=0.52; ADNI2: r=0.55; ADNI1and2: r=0.55) demonstrate that APANN can accurately infer clinical severity from MR imaging data. Limitations: In efforts to rigorously val- idate the presented model, we have primarily focussed on large cross-sectional baseline datasets, with only a proof-of-concept longitudinal results. Conclusion: APANN provides a highly robust and scalable framework for prediction of clinical severity at the individual level utilizing high-dimensional, multimodal neuroimaging data. Chapter 4. Project 2: Clinical Score Prediction 70

4.2 Author Contributions

Nikhil Bhagwat (NB) worked on the development of proposed anatomically partitioned artificial neural network (APANN) model and other machine-learning techniques along with subsequent imple- mentation and validation. He performed preprocessing and quality control of MR image datasets. He also wrote the manuscript of the research paper that is currently under review. Jon Pipitone assisted with the preprocessing of MR images. He also provided feedback on the proposed methodological ap- proach. Additionally he provided the support for computational resources. Aristotle N. Voineskos served as a clinical advisor and is a member of the NB’s thesis committee. He also served as a supervisor for NB in a lab at CAMH that provided significant computational resources. M. Mallar Chakravarty is a thesis supervisor for NB. He provided guidance on development and validation of all proposed models and techniques, as well as, manuscript writing. Chapter 4. Project 2: Clinical Score Prediction 71

4.3 Introduction

Machine-learning methods have been used extensively to identify individuals suffering from Alzheimer’s disease (AD) and its from healthy controls [52, 62, 87, 142, 391]. However, predicting symp- tomatic severity at the individual level remains a challenging problem, which is potentially more in- timately related to personalized care and prognosis. This prediction is confounded by the substan- tial pathophysiological and clinical heterogeneity observed in prodromal stages, such as mild-cognitive- impairment (MCI) or significant memory concern (SMC) [82, 118, 216, 263, 331, 370]. Although much is known about the spatiotemporal progression of amyloid plaques, neurofibril- lary tangles, and the resultant downstream neurodegeneration [44], the heterogeneous patterns of neu- roanatomical atrophy and AD-related cognitive impairment remain an open question. Understanding complex pathophysiological processes that characterize the varying clinical presentations across these groups is essential for biomarker development and early detection of at-risk individuals [19, 123, 304]. Furthermore, neuroanatomically informed subject-level prediction of clinical performance is an impor- tant step towards biomarker assessment and development of assistive tools for prognosis and treatment planning. As a structural biomarker, the hippocampus has long been associated with AD-related pathophys- iology and impairment [80, 108, 117, 134, 142, 183, 300]. However, the hippocampal volume measures lack the sensitivity to act as a standalone biomarker [32, 215, 275, 277, 303]. In efforts to achieve nu- anced characterization of disease states, studies have explored hippocampal subfield-based biomarkers [12, 215, 261], and other neurodegeneration indicators, such as cortical atrophy quantified by cortical thickness [117, 202, 225, 222, 286, 299]. Nevertheless, no characteristic localized patterns of atrophy have been associated with the prodromal disease states and symptomatic severity levels, which are likely to be heavily influenced by cognitive reserve [286, 331]. This motivates an approach that incorporates multiple, distributed phenotypes for prediction of clinical severity in service of robust diagnostic and prognostic applications. Previously, computational approaches using neuroimaging measures in the context of AD have focused on predicting diagnosis in cross-sectional datasets [52, 62, 87, 391], or predicting conversion from MCI to AD in longitudinal analyses [80, 253, 91]. However, clinicians are more likely to treat symptoms based on structured assessments rather than a specific diagnosis. Thus, in this work, we focus on a predicting clinical scores of disease severity (e.g. Alzheimer’s Disease Assessment Scale [ADAS-13] [294], Mini Mental State Examination [MMSE] [127] directly from neuroimaging data [332, 390]. Such neuroanatomically informed prediction of clinical performance at baseline and subsequently at future timepoints, particularly of the MCI or SMC individuals can help clinicians to parse through the clinical heterogeneity and make accurate diagnostic and prognostic decisions. Although the ultimate clinical goal of this work is to provide longitudinal prognosis, here we primarily focus on thorough validation of a single time-point (baseline) datasets, which is an important first step in model development for longitudinal tasks. Additionally, we perform a proof-of-concept analysis to verify capability of the proposed model towards longitudinal prediction. For this prediction task, we propose an anatomically partitioned artificial neural network (APANN) model. Artificial neural networks (ANNs) and related deep learning approaches have delivered state- of-the-art performance in classification and prediction problems in computer vision, speech recognition, natural language processing, and several other domains [30, 165, 199, 212, 280, 293]. ANNs provide highly flexible computational frameworks that can be used to extract latent features corresponding to Chapter 4. Project 2: Clinical Score Prediction 72 the hierarchical structural and functional organization of the brain and are well-suited for problems with high-dimensional data, unlike more standard models [30, 280]. To this end, the primary objective of this manuscript is to assess whether ANN models can accurately predict ADAS-13 and MMSE clinical scores of individuals using T1-weighted brain MR imaging data. In a larger context, we aim to build an ANN- based computational framework that can process high-dimensional and distributed structural changes captured by multiple phenotypic measures towards a biomarker development predictive of symptomatic progression. We designed, trained, and tested our model using participants from two Alzheimer’s Disease Neu- roimaging Initiative (ADNI) cohorts. We used a combination of high-dimensional (> 30000) features derived from two neuroanatomical measures from the T1-weighted images; namely: 1) hippocampal segmentations, and 2) cortical thickness. Hippocampal segmentations and cortical thickness measures were generated using MAGeT Brain and CIVET pipelines (see sections 2.2.1 and 2.2.2), respectively. We present a model with an innovative modular design that enables analysis of this high-dimensional, multimodal input. Additionally, it allows inclusion of new input modalities to the analysis without hav- ing to retrain the whole model and offers simultaneous prediction of multiple clinical scores (ADAS-13, MMSE). We address the need for large training examples given the high-dimensionality of the input data by introducing a novel data augmentation method. The methodology presented in this paper is not solely limited to the prediction of disease severity in AD, but can be applied to train variety of deep- learning models that use high-dimensional neuroimaging data in order to tackle multitude of diagnostic and prognostic questions. Chapter 4. Project 2: Clinical Score Prediction 73

4.4 Materials and Methods

4.4.1 Datasets

In this work, we used baseline data from participants acquired from the ADNI1 (N = 818) and ADNI2 (N = 788) databases, respectively [380] (http://adni.loni.usc.edu/, download date: April 2017). The final number of subjects used was 669 and 690 after exclusion based on quality control of image pre- processing outputs (see 4.1 for demographic details). We set our objective as prediction of MMSE and ADAS-13 scores. MMSE is one of the most widely used cognitive assessment for diagnosis of AD and related dementias [270, 316] and has a range from 0 to 30 with lower scores indicating greater cognitive impairment. ADAS-13 is a modified version of ADAS-cog assessment with a maximum score of 85. Although it has some overlap with MMSE, it also includes additional assessment components targeting memory, language, and praxis. In contrast to MMSE, higher scores indicate greater cognitive impair- ment in the ADAS-13. We note that we pool subjects from all diagnostic categories to build models for the entire spectrum of clinical performance. Diagnostic grouping is not used in the analysis as we model AD progression on a continuum, a method which has been shown to be useful in other studies of AD progress [175, 182].

ADNI-1 (N=669) ADNI-2 (N=690) Acquisition Scanner: 1.5T Voxel sizes: 1.2mm x Scanner: 3.0T Voxel sizes: 1.2mm x 1.25mm x 1.25mm 1mm x 1mm Diagnosis CN: 198 LMCI: 326 AD: 145 CN: 179 SMC: 77 EMCI: 162 LMCI: 149 AD: 123 Sex Male: 377 Female: 292 Male: 361 Female: 329 Age in years (mean, (75.0, 6.7) (72.6, 7.2) stdev) Education in years (15.5, 3.1) (16.3, 2.6) (mean, stdev) ADAS-13 (mean, (18.4, 9.2, [1.0, 54.7]) (16.1, 10.14, [1.0 , 52.0]) stdev, [min, max]) MMSE (mean, (26.7, 2.7, [18.0, 30.0]) (27.5, 2.7, [19.0, 30.0]) stdev, [min, max])

Table 4.1: Dataset demographics for ADNI1 and ADNI2 cohorts used in this study. CN: Cognitively Normal, SMC: Significant Memory Concern, EMCI: Early Mild Cognitive Impaired, LMCI: Late Mild Cognitive Impaired, AD: Alzheimer’s Disease; ADAS: Alzheimer’s Disease Assessment Scale, MMSE: Mini–Mental State Examination.

4.4.2 MR image processing

MR images were first preprocessed using the bpipe pipeline (https://github.com/CobraLab/minc-bpipe- library/) comprising N4-correction [347], neck-cropping to improve linear registration, and BEaST brain extraction [119]. This preprocessed data was then used to extract 1) hippocampal (HC) segmentations and 2) cortical thickness (CT) measures, referred to as input modalities in this work. Chapter 4. Project 2: Clinical Score Prediction 74

Hippocampal Segmentation

HC segmentations of T1-weighted MR images were produced using the MAGeT Brain pipeline [55, 277]. Briefly, this pipeline begins with five manually segmented high-resolution 3T T1-weighted images [372], which are each registered non-linearly to fifteen of the ADNI images selected at random (known as the template library). Then each image in the template library is non-linearly registered to all images in the ADNI datasets, and the segmentations from each atlas are warped via the template library transformations to each ADNI image. This results in 75 (numberofatlasxnumberoftemplates) candidate segmentations for each image, which are fused into a single segmentation using voxel-wise majority voting.

Cortical Thickness Estimation

The preprocessed images were input to the CIVET pipeline [2, 69, 196, 221, 237] to estimate cortical thickness measures at 40,962 vertices per hemisphere, which can be subsequently grouped by region of interest (ROI) based on a surface atlas.

4.4.3 Anatomically Partitioned Artificial Neural Network (APANN)

Artificial neural networks (ANNs) are a biologically inspired family of graphical machine-learning (ML) models which can perform prediction tasks using high-dimensional input (See 4.1-A). ANN models can be designed to contain multiple hidden layers, which hierarchically encode latent features that are informative of the objective task. The neuron connections represent a set of weights for the preceding input values, which are then combined and filtered with a nonlinear function. In neuroimaging, a few variants of ANN models such as autoencoders, and restricted Boltzmann machines, have been investigated for classification and prediction tasks [280, 335]. In comparison to these existing approaches, the model presented in this work differs significantly in its design and implementation. From a design perspective, we leverage the hierarchical structure of ANNs to build a modular (see 4.1-B) architecture that is capable of multimodal input integration (see 4.1-C) and multitask predictions (see 4.1-D). We achieve the following objectives in three stages (see 4.1-E). The stage I consists of anatomically partitioned modules (two hidden layers per module) which extract features from individual anatomical input sources (hippocampus and cortical surface). These individual anatomical features from stage I, serve as input to stage II of the network, where they are combined at a higher layer within the hidden-layer hierarchy. Lastly, we use these integrated features to perform multiple tasks simultaneously. These task-specific hidden layers are represented by the higher layers in stage III (four hidden layers total). This anatomically partitioned ANN (APANN) mitigates overfitting by reducing the number of parameters of the model compared to the classical fully-connected hidden layer architectures and allows independent pretraining of each input source in a single branch. These individual pretrained branches can be subsequently used to train stage II in order to integrate features efficiently.

4.4.4 Empirical distributions

The input dimensionality of MR data greatly exceeds the available number of samples leaving ML models susceptible to overfitting [19, 202]. This necessitates the critical step of feature engineering, the trans- formation of high-dimensional raw input to a meaningful and computationally manageable feature space Chapter 4. Project 2: Clinical Score Prediction 75

Figure 4.1: A) Structure of a generic artificial neural network (ANN) model. A neural net may comprise multiple hidden layers that encode hierarchical set of features from input, informative of the predic- tion/classification task at hand. The connections between layers represent the model weights, which are updated via backpropagation based on loss function associated with the task. B) A single feature mod- ule comprising multiple hidden layers. This is a building block of APANN architecture which facilitates pretraining of individual branches per input modality. C) A multi-modal ANN with a single output task. This design consists of stage 1 and stage 2 feature modules. Stage 1 modules learn features from each modality that are subsequently combined at stage 2 feature module. Only single task performance is used to update the weights of the model in this architecture. D) A multi-task ANN with a single input modality. This design consists of stage 1 and stage 3 feature modules. Stage I module learns individual features from given modality that are then fed into task-specific feature modules connected to the output nodes for joint prediction of the two tasks (ADAS-13 and MMSE score prediction). Prediction perfor- mance from both tasks is used to update the weights of stage I feature module. Left HC, Right HC, and CT input modalities are trained separately using this design to learn input feature modules from each modality. E) The proposed multi-modal, multi-task APANN model comprising anatomical partitioning. This design consists of stage 1, stage 2, and stage 3 feature modules. Stage I comprises pretrained fea- ture modules from each modality. These input features are fed into stage II to learn integrated features, which in turn are fed into the task specific feature modules in stage III. The stage III modules are con- nected to the output nodes for joint prediction of the two tasks (ADAS-13 and MMSE score prediction). Prediction performance from both tasks is used to update the weights of stage I and stage 2 feature modules. The partitioned architecture reduces the number of model parameters, which along with the pre-trained feature modules helps mitigate overfitting issues. Note1: Input data dimensionality is as follows:16086 (left HC), 16471(right HC), and 686 (CT) Note2: For details regarding hyper-parameters (number of hidden nodes, learning policies, weight regularization, etc.) of APANN see 4.2.

[264]. Techniques for addressing the high-dimensionality include employing various approaches, such as downsampling, handcrafting features based on biological priors (e.g. atlases), principal component analysis etc. Alternatively, one can increase the sample size by adding transformed data (e.g. linear transformations, image patches) to deal with the high-dimensionality issue. In this work, we present a novel data augmentation method that leverages the MR preprocessing pipelines to produce a set of empirical samples for both HC and CT input modalities in place of single point estimate per subject. This boost in training sample size make it feasible to train these models with large parameter space, and Chapter 4. Project 2: Clinical Score Prediction 76 helps prevent overfitting by exposing the model to a large set of possible variations in anatomical input associated with a given severity level. Adding linear and non-linear transformations of original input data is a common practice in machine-learning [212, 293]. In computer vision applications this typically means translation, rotation or dropping of certain pixels in an effort to capture a larger set of commonly encountered variation of input features to which the classifier should be invariant. In structural MR data, we are more interested in modeling the joint voxel distribution of anatomical segmentations than achieving high translational invariance, since the location of anatomical structures is relatively consistent across subjects. Thus, the empirical samples, generated as part of common segmentation and cortical surface extraction pipelines, help train model to be invariant to the methodologically driven pertur- bations of input values. This in turn mitigates overfitting and helps model learn anatomical patterns relevant to clinical performance. For the HC inputs, the empirical samples refer to a set of “candidate segmentations” generated from a multi-atlas segmentation pipeline (see 4.2-A) (Chakravarty et al., 2013; Pipitone et al., 2014) that model the underlying joint label distribution over the set of voxels for a given subject. For the CT inputs, the empirical samples refer to cortical thickness values from a set of vertices belonging to a given cortical ROI (see 4.2-B). In traditional approaches, these samples are usually fused to produce a point estimate of the feature [286, 87]. We detail the sample generation process for both of these input types below.

Hippocampal segmentations

• Segmentation generation: 75 candidate and 1 fused segmentations are produced for each subject via the MAGeT brain pipeline [277]. The ADNI1 and ADNI2 datasets are segmented using a separate template library with fifteen images chosen from the respective cohort. These candidate segmentations are binary masks of the left and right hippocampal voxels.

• Segmentation alignment: We rigidly align candidate segmentations to a common space (a subject chosen at random from the ADNI1 dataset) to maximize anatomical correspondence across sub- jects. Each segmentation is split into left and right hemisphere segmentations, and both are rigidly aligned to this common space using the ANTS registration toolkit [20]

• Segmentation filtering: In order to remove outlier segmentations due to misregistrations or poor segmentations, we compute the Dice kappa score between rigidly aligned candidate segmentations and the preselected common space segmentation, and then exclude any candidate segmentations with the Dice score smaller than one standard deviation from the mean Dice score over all subjects.

• Voxel filtering: To further compact the bounding box of all the candidate segmentations, we exclude voxels with low information density by only keeping structural voxels present in at least 25% candidate segmentations across the ADNI1 and ADNI2 datasets. Post filtering operations, the 3-dimensional volumes are flattened into a one-dimensional vector of included voxels per candidate segmentation.

Upon completion of this process, the vectorized voxels represent the hippocampal input for the APANN model. The length of input vector was 16086 and 16471 for left and right hippocampus, respectively. Chapter 4. Project 2: Clinical Score Prediction 77

Figure 4.2: A) Schematic of multi-atlas segmentation pipeline depicting registration and label-fusion stages. The red-box highlights the “candidate labels” derived from different atlases which are treated as empirical samples in the context of structural labels. These labels are usually fused into a single label which serves as a point-estimate mask of a given structure. B) Schematic of cortical thickness estimation pipeline comprising surface registration, parcellation and average thickness estimation steps. The red-box highlights the individual vertices in a given region of interest, which are treated as empirical samples in the context of cortical thickness measure. The thickness values of these vertices are usually averaged out to estimate mean thickness over an region of interest (ROI).

Cortical thickness

CIVET preprocessing produces CT values at 40,962 vertices per hemisphere. We assign these cortical vertices to unique ROIs based on a predefined atlas. In this work, we created a custom atlas (see 4.3) comprising 686 ROIs, maintaining bilateral symmetry (343 ROIs per hemisphere) using data-driven par- cellation based on spectral clustering. Spectral clustering allows creation of ROIs with a similar number of vertices, which is desirable for unbiased sampling of vertices for cortical thickness estimation. Also, Chapter 4. Project 2: Clinical Score Prediction 78 work by others [193] suggest that increasing the spatial resolution of a cortical parcellation may improve predictive performance, which further motivates the use of this data-driven atlas over neuroanatomically derived parcellations [190, 348]. The connectivity information from the cortical mesh of the template was used as the adjacency matrix during implementation. Upon generating sets of vertices per ROI, we simply treat each vertex as a sample from a distribution that characterizes the thickness of that ROI. Thus CT features of each subject now can be characterized by a distribution of thickness values per ROI instead of the mean thickness values computed as point estimates (see 4.2-B). The independent empirical sampling processes for HC and CT input necessitate a standardization step, which is described in the supplementary material Section 2.

Figure 4.3: A custom cortical surface parcellation (atlas) comprising 688 regions of interest (ROI), with each comprising roughly equal number of vertices. The parcellations were based on a triangular surface mesh obtained from a CIVET model. The vertices of the mesh were grouped together based on spatial proximity using spectral clustering method1. Bilateral symmetry within the vertices of the hemispheres was preserved. The atlas was propagated to each subject to obtain thickness samples per ROI. 1 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html

Training procedure

The training procedure consists of two parts: 1) train individual branches per input modality, 2) fine- tune unified model comprising pretrained branches from part 1 along with additional integrated and task specific feature layers. In the first part, we trained separate models using individual HC and CT modalities independently (see 4.1-D). The model was trained to jointly predict both tasks (ADAS- 13 and MMSE scores). At the end of this training procedure, we obtained the set of weights for the hidden layers in stage I for each input branch. Subsequently, we extended the model with stage II and III hidden layers and further trained it to learn integrated and the task specific feature lay- ers (see 4.1-E). We used both tasks during this training procedure as well. For both parts, hyper- parameters of the model (see 4.2) were determined using inner cross validation loop. The code utiliz- ing Caffe toolbox (http://caffe.berkeleyvision.org/) for APANN design and training is available here: https://github.com/CobraLab/NI-ML/tree/master/projects/APANN. The computational resource re- quirements are provided in supplementary material Section 3.

4.4.5 Performance Validation

We compared the performance of the APANN model separately for MMSE and ADAS-13 score pre- diction. We conducted three experiments to compare performance of each cohort separately as well as Chapter 4. Project 2: Clinical Score Prediction 79

Model Hyperparameters Linear Regression with L1-penalty: 0.001 to 1 (with increment of 0.01) Lasso (LR L1)

Support Vector Regres- kernel : linear, rbf, C: [0.001,0.01,1,10,100] sion (SVR)

Random Forest Regres- N estimators: 10 to 210 (with increment of 25), sion (RFR) min sample split: [2,4,6,8]

Artificial Neural Network Fixed hyperparameters (APANN) • Stage I (input features): Two hidden layers with equal nodes in each layer • Stage II (integrated features): one hidden layers

• Stage III (task features): One hidden layer • Activation nonlinearity: ReLU Tunable hyperparameters: • Stage I number of hidden nodes: [25,50,100,200]

• Stage II number of hidden nodes: [25,50] • Stage III number of hidden nodes: [25,50] • Learning rate: [1e-6, 1e-5, 1e-4]

• Learning policy: [Nesterov, Adagrad] • Weight decay: [1e-4,1e-3,1e-2] • Dropout rate: [0, 0.25, 0.5] (Only for Stage I)

Table 4.2: Hyperparameter search space for the four models. Grid search of the hyperparameters was performed using a nested inner loop for each cross-validation round. For APANN model, the fixed hyperparameters refer to a broader network design choices that remained identical for all cross-validation rounds. The tunable hyperparameter for APANN were optimized for each fold. Chapter 4. Project 2: Clinical Score Prediction 80 combined: 1) ADNI1, 2) ADNI2, and 3) ADNI1+ADNI2. The latter is an effort to evaluate model robustness in the context of multi-cohort, multi-site studies, which is becoming increasingly prevalent in the field. In each experiment, we compared the performance of two inputs separately as well as com- bined: 1) HC, 2) CT, and 3) HC+CT. We used Pearson’s correlation (r) and root mean square error (rmse) values between true and predicted clinical scores as our performance metrics. All experiments were evaluated using 10 rounds of 10-fold nested cross validation procedure. The outer-folds were cre- ated by dividing subject pool into 10 non-overlapping subsets. During each run, 9 out of 10 subsets were chosen as a training set and the performance was evaluated on the held-out test subset. During model training, three inner-folds were created by further dividing the training set under consideration to determine optimal combination of hyperparameters (e.g. number hidden nodes) using grid-search. For Experiment 3, outer-folds were stratified to maintain similar ratio of ADNI1 and ADNI2 subjects in each fold. We compared the performance of APANN in all experiments against three commonly used ML models: linear regression with Lasso, support vector machine, and random forest. These results are provided in the supplementary materials Section 1. Our secondary, proof-of-concept analysis comprises a longitudinal experiment to predict clinical scores at baseline and at 1-year simultaneously, using only baseline MR data. This is in an effort to demonstrate applicability of APANN from a clinical standpoint, where the end goal is to predict future diagnostic and/or prognostic states of a subject. We limit our analysis to ADAS-13 scale, whose larger score range offers better sensitivity to longitudinal changes, and individual ADNI1 and ADNI2 cohorts. We note that for this experiment, due to missing timepoints, the number of subjects reduced to 553 for ADNI1 and 590 for ADNI2. Chapter 4. Project 2: Clinical Score Prediction 81

4.5 Results

The mean correlation (r) and root mean square error (rmse) performance values for all three experiments with three input modality configurations are summarized in 4.4 and 4.3, 4.4. Scatter plots for predicted and actual ADAS-13 and MMSE scores are shown in 4.5. Scatters plots were generated using scores from all the test subsets from a randomly chosen round of a 10-fold run. Results for the longitudinal experiment are shown in 4.6. Individual results for each experiment are detailed below. The comparative results with other models are provided in the supplementary materials Section 1. Briefly, results from all three experiments indicate that APANN model offers better predictive performance with HC inputs. In comparison, CT input modality, when used independently, does not offer improvement. However, the HC+CT input to APANN model offers significantly higher performance improvement over reference models across all three experiments.

4.5.1 Experiment 1: ADNI1 Cohort

The combined HC+CT input provides the best results for ADAS-13 prediction with r = 0.60, rmse = 7.11. Similar trends are observed for MMSE prediction with HC+CT input with r = 0.52, rmse = 2.25. The sole HC input yields r = 0.53, rmse = 7.56, and r = 0.40, rmse = 2.41, for ADAS-13 and MMSE score prediction, respectively. Whereas, sole CT input yields r = 0.51, rmse = 7.67, and r = 0.50, rmse = 2.29, for ADAS-13 and MMSE score prediction, respectively.

4.5.2 Experiment 2: ADNI2 Cohort

Similar to Experiment 1, the combined HC+CT input provides the best results for ADAS-13 prediction with r = 0.68, rmse = 7.17. Similar trends are also observed for MMSE prediction with HC+CT input with r = 0.55, rmse = 2.25. The sole HC input yields r = 0.52, rmse = 8.32, and r = 0.40, rmse = 2.51, for ADAS-13 and MMSE score prediction, respectively. Whereas, sole CT input yields r = 0.63, rmse = 7.58, and r = 0.52, rmse = 2.31, for ADAS-13 and MMSE score prediction, respectively.

4.5.3 Experiment 3: ADNI1 + ADNI2 Cohort

Similar to Experiment 1 and 2, the combined HC+CT input provides the best results for ADAS-13 prediction with r = 0.63, rmse = 7.32. Similar trends are observed for MMSE prediction with HC+CT input with r = 0.55, rmse = 2.25. The sole HC input yields r = 0.54, rmse = 7.99, and r = 0.45, rmse = 2.42, for ADAS-13 and MMSE score prediction, respectively. Whereas, sole CT input yields r = 0.57, rmse = 7.79, and r = 0.50, rmse = 2.37, for ADAS-13 and MMSE score prediction, respectively. A further analysis of results in this experiment stratified by subject-cohort membership (ADNI1 vs. ADNI2) shows that APANN has smaller performance bias towards any particular cohort (i.e. models performing well on only single cohort) compared to other models (see supplementary materials for details).

4.5.4 Longitudinal prediction

Similar to Experiments 1-3, the combined HC+CT input provides the best results, with r = 0.58, rmse = 7.1 for baseline and r = 0.59, rmse = 9.08 for 1-year score prediction for ADNI1; and r = 0.64, rmse = 7.07 for baseline and r = 0.65, rmse = 9.07 for 1-year score prediction for ADNI2. The sole HC Chapter 4. Project 2: Clinical Score Prediction 82

ADAS13 MMSE 0.7

0.6

0.5 modality 0.4

r HC 0.3 CT HC+CT 0.2

0.1

0.0 ADNI1 ADNI2 ADNI1and2 ADNI1 ADNI2 ADNI1and2 cohort cohort

ADAS13 MMSE 9 8 7 6 modality 5 HC

rmse 4 CT 3 HC+CT 2 1 0 ADNI1 ADNI2 ADNI1and2 ADNI1 ADNI2 ADNI1and2 cohort cohort

Figure 4.4: Performance of APANN subject to individual and combined input modalities. The correlation and rmse values are averaged over 10 rounds of 10-folds. All models were trained with a nested-inner loop that searched for optimal hyperparameters. input yields better performance compared to sole CT input for baseline and 1-year score prediction for ADNI1. Whereas sole CT input yields better performance compared to sole HC input for baseline and 1-year score prediction for ADNI2. Chapter 4. Project 2: Clinical Score Prediction 83

ADAS13 cohort = ADNI1 cohort = ADNI2 cohort = ADNI1and2 40 R^2=0.321 R^2=0.430 R^2=0.379 35

30

25

20

15 Predicted_Scores

10

5

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Actual_Scores Actual_Scores Actual_Scores

MMSE

cohort = ADNI1 cohort = ADNI2 cohort = ADNI1and2 R^2=0.245 R^2=0.266 R^2=0.279 30

28

26

24 Predicted_Scores 22

20

20 22 24 26 28 30 20 22 24 26 28 30 20 22 24 26 28 30 Actual_Scores Actual_Scores Actual_Scores

Figure 4.5: Scatter plots for predicted and actual ADAS-13 and MMSE scores for three cohorts (ADNI1, ADNI2, ADNI1and2). Scatters plots are generated by concatenating scores from all the test subsets from a randomly chosen round of a 10-fold validation run. Chapter 4. Project 2: Clinical Score Prediction 84

ADNI1 HC CT HC+CT LR L1 r: 0.22, 0.11 r: 0.56, 0.08 r: 0.56, 0.08 rmse: 8.72, 0.81 rmse: 7.44, 0.72 rmse: 7.42, 0.74 SVR r: 0.23, 0.11 r: 0.52, 0.08 r: 0.53, 0.08 rmse: 8.70, 0.85 rmse: 7.68, 0.76 rmse: 7.62, 0.78 RFR r: 0.15, 0.10 r: 0.54, 0.08 r: 0.54, 0.08 rmse: 9.27, 0.80 rmse: 7.55, 0.76 rmse: 7.51, 0.77 APANN r: 0.53, 0.09 r: 0.51, 0.10 r: 0.60, 0.08 rmse: 7.56, 0.76 rmse: 7.67, 0.76 rmse: 7.11, 0.72 ADNI2 HC CT HC+CT LR L1 r: 0.14, 0.11 r: 0.61, 0.07 r: 0.61, 0.07 rmse: 9.69, 0.70 rmse: 7.77, 0.71 rmse: 7.78,0.71 SVR r: 0.21, 0.10 r: 0.63, 0.07 r: 0.63, 0.07 rmse: 9.75, 0.79 rmse: 7.65, 0.68 rmse: 7.66, 0.70 RFR r: 0.24, 0.09 r: 0.58, 0.07 r: 0.58, 0.08 rmse: 9.77, 0.76 rmse: 7.97, 0.65 rmse: 7.97, 0.67 APANN r: 0.52, 0.07 r: 0.63, 0.07 r: 0.68, 0.06 rmse: 8.32, 0.79 rmse: 7.58, 0.71 rmse: 7.17, 0.71 ADNI1+ADNI2 HC CT HC+CT LR L1 r: 0.12, 0.08 r: 0.58, 0.06 r: 0.58, 0.06 rmse: 9.37, 0.50 rmse: 7.71, 0.48 rmse: 7.71, 0.48 SVR r: 0.18, 0.07 r: 0.59, 0.05 r: 0.59, 0.05 rmse: 9.39, 0.54 rmse: 7.65, 0.42 rmse: 7.65, 0.42 RFR r: 0.18, 0.09 r: 0.57, 0.05 r: 0.57, 0.05 rmse: 9.63, 0.61 rmse: 7.76, 0.46 rmse: 7.75, 0.46 APANN r: 0.54, 0.06 r: 0.57, 0.05 r: 0.63, 0.05 rmse: 7.99, 0.59 rmse: 7.79, 0.51 rmse: 7.32, 0.53

Table 4.3: Prediction Performance for Alzheimer’s Disease Assessment Scale-13 scores. LR L1: Linear Regression model with Lasso regularizer, SVR: Support Vector Regression, RF: Random Forest Regres- sion, APANN: Anatomically Partitioned Artificial Neural Network; HC: hippocampal input, CT: cortical thickness input, HC+CT: combined hippocampal and cortical thickness input; r: Pearson’s correlation (mean, std), rmse: root mean square error (mean, std). Chapter 4. Project 2: Clinical Score Prediction 85

ADNI1 HC CT HC+CT LR L1 r: 0.23, 0.12 r: 0.49, 0.08 r: 0.50, 0.08 rmse: 2.54, 0.18 rmse: 2.28, 0.17 rmse: 2.27, 0.17 SVR r: 0.25, 0.12 r: 0.48, 0.07 r: 0.50, 0.07 rmse: 2.59, 0.19 rmse: 2.31, 0.16 rmse: 2.28, 0.16 RFR r: 0.22, 0.11 r: 0.48, 0.08 r: 0.49, 0.08 rmse: 2.63, 0.21 rmse: 2.30, 0.17 rmse: 2.28, 0.17 APANN r: 0.40, 0.09 r: 0.50, 0.09 r: 0.52, 0.08 rmse: 2.41, 0.15 rmse: 2.29, 0.20 rmse: 2.23, 0.17 ADNI2 HC CT HC+CT LR L1 r: 0.19, 0.12 r: 0.46, 0.08 r: 0.47, 0.08 rmse: 2.64, 0.19 rmse: 2.39, 0.19 rmse: 2.39, 0.19 SVR r: 0.28, 0.14 r: 0.52, 0.07 r: 0.54, 0.07 rmse: 2.72, 0.24 rmse: 2.32, 0.18 rmse: 2.30, 0.18 RFR r: 0.25, 0.12 r: 0.50, 0.09 r: 0.51, 0.08 rmse: 2.67, 0.24 rmse: 2.33, 0.17 rmse: 2.31, 0.17 APANN r: 0.40, 0.09 r: 0.52, 0.12 r: 0.55, 0.10 rmse: 2.51, 0.21 rmse: 2.31, 0.25 rmse:2.25, 0.21 ADNI1+ADNI2 HC CT HC+CT LR L1 r: 0.15, 0.08 r: 0.50, 0.07 r: 0.50, 0.07 rmse: 2.64, 0.12 rmse: 2.31, 0.13 rmse: 2.31, 0.13 SVR r: 0.22, 0.07 r: 0.52, 0.07 r: 0.52, 0.07 rmse: 2.71, 0.13 rmse: 2.31, 0.13 rmse: 2.30, 0.13 RFR r: 0.17, 0.08 r: 0.50, 0.07 r: 0.50, 0.07 rmse: 2.74, 0.14 rmse: 2.31, 0.14 rmse: 2.31, 0.14 APANN r: 0.45, 0.06 r: 0.50, 0.07 r: 0.55, 0.06 rmse: 2.42, 0.14 rmse: 2.37, 0.15 rmse: 2.25, 0.12

Table 4.4: Prediction Performance for Mini–Mental State Examination (MMSE) scores. LR L1: Linear Regression model with Lasso regularizer, SVR: Support Vector Regression, RF: Random Forest Regres- sion, APANN: Anatomically Partitioned Artificial Neural Network; HC: hippocampal input, CT: cortical thickness input, HC+CT: combined hippocampal and cortical thickness input; r: Pearson’s correlation (mean, std), rmse: root mean square error (mean, std). Chapter 4. Project 2: Clinical Score Prediction 86

Figure 4.6: Simultaneous predictions of baseline and 1 year ADAS-13 scores. The top two rows show the Pearson’s r values based on predicted and actual ADAS-13 scores over 10 fold cross-validation for ADNI1 and ADNI2 cohorts respectively. The bottom two rows show the root mean square error (rmse) between predicted and actual ADAS-13 scores for ADNI1 and ADNI2 cohorts respectively. The first column shows performance at baseline, whereas the second column shows the performance at month 12. Models were trained separately for each input: HC, CT, and HC+CT, which are represented by different colors in the plots. Chapter 4. Project 2: Clinical Score Prediction 87

4.6 Discussion

In this manuscript, we presented an artificial neural network model for prediction of cognitive scores in AD using high-dimensional structural MR imaging data. We showed that information from voxel level hippocampal segmentations and highly granular cortical parcellations can be leveraged to infer cognitive performance and clinical severity at the single-subject level. This capability of APANN model to predict MMSE and ADAS-13 scores based on structural MR features may prove to be valuable from clinical perspective to help build prognostic tools. The proof-of-concept longitudinal experiment demonstrated that APANN can successfully predict future scores (1-year) from the baseline MR data. The results comparing APANN against several other models are provided in the supplementary materials Section 1. These results highlight the performance gains offered by high-dimensional features as input to APANN. Below we discuss the performance of APANN with respect to 1) clinical scale , 2) input modality, 3) dataset, and 4) related literature.

4.6.1 Clinical scale comparisons

Performance comparison between the clinical scales based on correlation values indicates that predicting MMSE scores is more challenging of the two across all inputs and cohorts. This disparity between the performances is possibly due to the higher sensitivity of the ADAS-13 assessment, which is reflected by comparatively larger scoring range improving its association with the structural measures.

4.6.2 Input modality comparisons

The results from all three experiments indicate that the APANN model offers better predictive per- formance with the combined HC+CT inputs. The use of CT outperforms HC inputs in all the three experiments for both the scales except in the case with ADNI1 cohort for ADAS-13 prediction, where HC input offers slightly higher performance. This highlights the importance of incorporating multiple phenotypes for biomarker development indicative of cognitive performance. The capability of APANN to handle multimodal input is crucial for building clinical tools leveraging disparate MR, clinical, as well as, genetic markers.

4.6.3 Dataset comparisons

Between Experiments 1 and 2, we observe that ADNI2 cohort yields better performance compared to ADNI1 across all models. This may be due to the differences in acquisition protocols, as ADNI2 images were acquired at higher field strength with better resolution. The improvement in image acquisition will likely provide superior quality segmentations and cortical thickness measures [61]. In Experiment 3 we combined ADNI1 and ADNI2 cohorts. Pooling data from different datasets is becoming increasingly important to verify generalizability of the model on larger population that extends beyond a single study. Interestingly, Experiment 3 outperforms Experiment 1, but underperforms compared to Experiment 2. This is partially expected due to substantial differences in the individual feature distributions (e.g. hippocampal segmentations) resultant of aforementioned differences in the acquisition protocols. In such cases it becomes imperative to build models invariant to dataset specific biases resultant of non- uniform data collection practices. The results from Experiment 3 show that APANN offers consistent performance comparable to Experiment 1 and 2, and low dataset specific bias (i.e. model performing well Chapter 4. Project 2: Clinical Score Prediction 88 only on single dataset), compared to other models (see supplementary material Section 4 for details). We speculate that the models incorporating high-dimensional, multimodal input are less susceptible to multi-cohort and multi-site study design artifacts, which is desirable for the development of clinical tools in practical settings.

4.6.4 Longitudinal analysis

Consistent with the first three experiments, the combined HC+CT input offers the best performance for 1-year score prediction with similar correlation results but higher rsme. This suggests that uncertainty is likely to increase with larger timespan under consideration for the longitudinal tasks (1-year vs. 2 years. vs. 5 years), making the predictions more challenging. Further considerations are also needed for the cases where information from multiple timepoints (baseline + 1-year) are utilized towards subsequent (2-year +) performance prediction. Missing timepoints become increasingly important caveat for such tasks. Nevertheless APANN shows promising results for investigating more sophisticated longitudinal predictions.

4.6.5 Related work

As we mentioned earlier, prediction of clinical scores is a relatively underexplored task. For fair comparison we limit our discussion relating to two recent studies involving baseline prediction with MR imaging features by [332, 390]. Both works use structural MR images from the ADNI1 baseline dataset for prediction of MMSE and ADAS-Cog scales (which uses 11/13 of the subscales of ADAS-13; http://adni.loni.usc.edu/data-samples/data-faq/). Consequently, ADAS-Cog and ADAS-13 scores are strongly correlated (r ¿ 0.9 for ADNI1 and ADNI2 subjects considered in this manuscript). Stonington et al. use relevance vector regression (RVR) models with sample size of 586 subjects, and report corre- lation values of 0.48 (MMSE) and 0.57 (ADAS-Cog). [390] propose a computational framework called: Multi-Modal Multi-Task (M3T) that offers multi-task feature selection and multi-modal support vector machines (SVM) for regression and classification tasks. With only MR based features, M3T achieves correlation of 0.50 (MMSE) and 0.60 (ADAS-Cog) on a sample size of 186 subjects. The APANN model, in comparison, offers correlation of 0.52 (MMSE), and 0.60 (ADAS-13) with a much larger cohort of 669 ADNI1 subjects. Although APANN offers similar performance for the ADNI1 dataset, there are several key advantages. In contrast to M3T, which implements two separate stages for feature extraction and regression (or classification) tasks, APANN provides a unified model that performs feature extraction and multi-task prediction using multimodal input in a seamless manner. From scalability perspective, the results show that APANN is capable of handling high-dimensional input and extending the model to incorporate new modalities without retraining of entire model. Whereas M3T is subjected to 93 MR atlas-based features [190] with a total 189 multimodal (MRI, FDG-PET, and cerebrospinal fluid) features [390]. Moreover, with APANN we replicate performance on the ADNI2 cohort and demonstrate improved correlation performance of 0.55 (MMSE) and 0.68 (ADAS-13) with 690 subjects further val- idating its generalizability. Other recent works address clinical score prediction using sparse Bayesian learning [358] and graph-guided feature selection [383] with 98 and 93 imaging feature respectively. Both the works report high performance on AD and CN subject groups, however performance degrades after inclusion of MCI subjects. For example, [383] report correlation of 0.745 (MMSE), 0.74 (ADAS-cog) for specific subsets of AD/CN subjects, however the performance degrades to 0.382 (MMSE) and 0.472 Chapter 4. Project 2: Clinical Score Prediction 89

(ADAS-cog) for subset of MCI/CN subjects. Clinically, MCI subjects are of high interest from an early intervention and prognostic standpoint, and therefore prediction of their cognitive performance is crucial. To our best knowledge, APANN is the first work tackling high input dimensionality (> 30k features), which we have validated across the continuum from healthy control to AD dementia, on multiple cohorts with site and scanner differences. Such validation is increasingly important with the availability of newer and larger datasets, such as the UK biobank (http://www.ukbiobank.ac.uk/about-biobank-uk/).

4.6.6 Clinical translation

As mentioned earlier, the ultimate clinical goal of this work is to provide longitudinal prognosis that can predict future clinical states of an individual. The rigorously validated APANN provides a computational platform for variety of longitudinal tasks, such as the 1-year ADAS-13 prediction task investigated in the proof-of-concept experiment (see section 3.4). We envision APANN applied to the MR data of at- risk individuals from prodromal stages (MCI, SMC etc.) and even early AD stages towards prediction of future clinical scores and other clinical state proxies. The ability of APANN to capture relevant subtle neuroanatomical changes from high-dimensional, multi-modal MR imaging data, can be leveraged towards nuanced diagnosis and prognosis on various symptom subdomains to either assist or verify the decision-making by the clinicians. Such prognosis can help with early intervention, clinical trial recruitment, and caregiver arrangements for the patients. Chapter 4. Project 2: Clinical Score Prediction 90

4.7 Limitations

In this work we applied APANN primarily to cross-sectional datasets and a proof-of-concept longitudinal dataset. From the clinical perspective, it is crucial to note that the use of a specific clinical or cognitive test is subjective, contingent on availability, and associated with its own set of biases. Further, similar to the clinical diagnosis that uses several sources of information in order to create a composite of the pa- tient’s clinical profile, we envision the proposed MR based prediction framework also as another assistive instrument that will be interpreted in the larger context of an overall clinical picture. We acknowledge that the cross-sectional experiments in this work are a first step towards building assistive MR-based models. We believe that the design flexibility of APANN can be utilized towards this goal. APANN can handle multi-modal input, as well as, multiple scale predictions that could minimize modality-specific and scale-specific biases, respectively. Large-scale models, such as APANN, subjected to high-dimensional input require significant com- putational resources. Thus we have limited the scope of this work to classical ANNs as a prototypical example to demonstrate feasibility of large-scale analysis with structural neuroimaging data. Neverthe- less the training regimes discussed in this work should motivate further development of state-of-the-art neural network architectures, such as 3-dimensional convolutional networks, towards various neuroimag- ing applications. Another common drawback of models with deep architectures pertains to the lack of interpretability of the model parameters compared to simpler models, which prohibits localizing most predictive brain regions. In our view, it is a model design trade-off that in turn allows capturing dis- tributed changes, often present in heterogeneous atrophy patterns in AD prodromes. The computational flexibility of ANNs allows us to model collective impact of these complex atrophy patterns and predict clinical performance more accurately. Chapter 4. Project 2: Clinical Score Prediction 91

4.8 Conclusion

The presented APANN model together with empirical sampling procedures offers a sophisticated machine- learning framework for high-dimensional, multimodal structural neuroimaging analysis. By going beyond low-dimensional, anatomical prior-based feature sets, we can build more sensitive models capable of cap- turing subtle neuroanatomical changes associated with cognitive symptoms in AD. The results validate the strong predictive performance of the APANN model across two independent cohorts as well as its ro- bustness in the case with the combination of these two cohorts. From clinical standpoint, these attributes make APANN a promising approach towards building diagnostic and prognostic tools that would help identify at-risk individuals and provide clinical trajectories assessments, facilitating early intervention and treatment planning. Chapter 4. Project 2: Clinical Score Prediction 92

4.9 Acknowledgements

NB receives support from the Alzheimer’s Society of Canada. MMC is funded by the Weston Brain Institute, the Alzheimer’s Society, the Michael J. Fox Foundation for Parkinson’s Research, Canadian Institutes for Health Research, National Sciences and Engineering Research Council Canada, and Fon- dation de Recherches Sant´eQu´ebec. ANV is funded by the Canadian Institutes of Health Research, Ontario Mental Health Foundation, the Brain and Behavior Research Foundation, and the National Institute of Mental Health (R01MH099167 and R01MH102324). Computations were performed on GPC supercomputer at the SciNet HPC Consortium [234] and Kimel Family Translational Imaging-Genetics Research (TIGR) Lab computing cluster. SciNet is funded by the Canada Foundation for Innovation under the auspices of Compute Canada; the Government of Ontario; Ontario Research Fund - Research Excellence; and the University of Toronto. TIGR Lab cluster is funded by the Canada Foundation for Innovation, Research Hospital Fund. ADNI Acknowledgments: Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Insti- tute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Abbott; Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Amorfix Life Sciences Ltd.; AstraZeneca; Bayer HealthCare; BioClinica, Inc.; Biogen Idec Inc.; Bristol-Myers Squibb Company; Eisai Inc.; Elan Pharmaceuticals Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its af- filiated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research Development, LLC.; Johnson & Johnson Pharmaceutical Research Develop- ment LLC.; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sec- tor contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is Rev March 26, 2012 coordinated by the Alzheimer’s disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for NeuroImaging at the Uni- versity of California, Los Angeles. This research was also supported by NIH grants P30 AG010129 and K01 AG030514. Chapter 4. Project 2: Clinical Score Prediction 93

4.10 Supplementary Material

S4.1 Performance comparison to reference models

Models

We compared the performance of the APANN model with three commonly used models 1) linear regres- sion with Lasso regularizer (LR L1) [52, 369] 2) support vector regression (SVR) [91, 164, 354, 390], and 3) random forest regression (RFR) [146]. Separate instances of these baseline models were trained for MMSE and ADAS-13 prediction tasks. Separate instances of these models were also trained to compare performance of each each input, namely: 1) HC, 2) CT, and 3) HC+CT. The input features from each individual modality for the three baseline models were as follows:

• HC: 2 continuous variables representing left and right hippocampal volumes

• CT: 78 continuous variables representing thickness values based on AAL atlas ROIs [348]

The difference in input feature sets for the baseline models was prompted by the use of anatomically driven, low-dimensional features in many instances in the relevant literature [335, 389]. Moreover, we also investigated high-dimensional HC and CT input choices (identical as of APANN model) for the baseline models. However, the baseline models considerably underperformed with high-dimensional input compared to the input choice of low-dimensional features. The input values for LR L1 and SVR models were preprocessed with an additional step in which data were mean centered and feature-wise scaled to unit variance. All the baseline models were implemented using scikit-learn toolbox (http://scikit- learn.org/stable/index.html).

Results

The correlation performance comparison between APANN and reference models for both tasks and three experiments is shown below in Fig. S4.1. All correlation and rmse values are also tabulated in Table 1 in the manuscript. Correlation performance of baseline models with high-dimension input (identical as of APANN model) is shown in Fig. S4.2.

Discussion

Results from all three experiments indicate that APANN model offers better predictive performance with HC and HC+CT inputs. Specifically for the HC input modality, we see substantial improvement with APANN model utilizing voxel-wise information from segmented hippocampal masks. We note that this performance gain is attributed to 1) added information from the voxel-wise input compared to two volumetric measures, and 2) the computational capacity of APANN to extract useful features from this voxel-wise input compared to the baseline models. As described earlier, the baseline models do show improved performance with the voxel-wise input, however, the gains are smaller compared to the APANN model. In comparison, CT input modality, when used independently, does not offer improvement with APANN model. For ADAS-13 prediction task, the baseline models outperform the APANN model in Experiment 1 and 3, and offer similar performance in Experiment 2 when comparing performance with CT input alone. However, the HC+CT input to APANN model offers significantly higher performance improvement over baseline models across all three experiments. Chapter 4. Project 2: Clinical Score Prediction 94

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 0 0 0 0 0 0 0 0 0 0 *

0 0 0 0 0 0 0 0 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00

Figure S4.1: Performance of all models subject to individual and combined input modalities. The correlation values are averaged over 10 rounds of 10-folds. All models were trained with a nested-inner loop that searched for optimal hyperparameters. The statistical significance1 annotations for pairwise method comparisons are as follows: ∗ : p < 0.05, ∗∗ : p < 0.01, ∗ ∗ ∗ : p < 0.001. Plots only show statistical comparisons for HC+CT modality. Chapter 4. Project 2: Clinical Score Prediction 95

Figure S4.2: Correlation performance of LR L1, SVR, and RFR models with high-dimensional input comprising HC: 32557 voxel-wise hippocampal features and CT: 686 thickness values from ROIs. Chapter 4. Project 2: Clinical Score Prediction 96

S4.2 Empirical sampling: standardization across modalities

Since we use independent procedures for augmenting number of samples for the two input modalities (HC and CT), each subject may end up with different number of empirical samples from each input. This raises an issue during training models with combined input from both modalities (HC+CT). Additionally, within each modality, two subjects may have drastically different number of empirical samples, which can result in biased training. In order to avoid these issues, we standardize the empirical sample sizes by enforcing following constraints: 1) the total number of samples per subject need to be equal across modalities (number of HC samples = number of CT samples for ith subject), and 2) the number of samples per subject needs to be similar in number across subjects (number of samples for ith subject number of samples for jth subject). Based on these constraints, the samples are randomly chosen from the available pool of empirical samples for a given subject. At the end of this process, the resultant augmented training set sample size was approximately 35 times the number of subjects in the cohort.

S4.3 Computational resource requirements

The ANN models were trained using NVIDIA GPU GeForce GTX TITAN X and Caffe deep learn- ing framework (http://caffe.berkeleyvision.org/). Training durations for a single model for each input modality averaged over the different hyperparameter combinations (number of hidden nodes, learning rate etc.) are as follows: 1) CT (input dimensionality: 686): 4 minutes (Experiment 1 and 2), 7 minutes (Experiment 3), 2) HC (input dimensionality: 32557): 8 minutes (Experiment 1 and 2), 15 minutes (Ex- periment 3), 3) HC+CT (input dimensionality: 33243): 11 minutes (Experiment 1 and 2) 20 minutes (Experiment 3). Chapter 4. Project 2: Clinical Score Prediction 97

S4.4 Performance bias in combined ADNI1and2 cohort

In Experiment 3, we used the pooled ADNI1 and ADNI2 datasets during cross-validation. Therefore models were trained and tested on mixtured of ADNI1 and ADNI2 subjects. In order to evaluate if these trained models show any dataset bias in their predictive performance, we stratified the test performance based on subject-dataset membership (ADNI1 vs. ADNI2). Then the performance bias was computed as follows:

mean(|r − r |) bias(normalized) = adni2 adni1 (4.1) mean(radni1+2)

LL LL

Figure S4.3: Correlation (r) performance of four models for the HC+CT input in Experiment 3 split by subject-dataset membership. In contrast with Experiment 1 and 2, all models were cross-validated on pooled ADNI1+ADNI2 subjects. The stratification of results offers insight into performance biases that may have caused by the dataset membership itself (eg.: high prediction performance for only ADNI1 subjects). APANN model outperforms other models in dataset-wise comparisons for both clinical scales and exhibits low bias towards single dataset.

LR L1 SVR RFR APANN ADAS13 0.1735 0.1958 0.1773 0.1284 MMSE 0.2129 0.2150 0.2320 0.1991

Table S4.1: Bias measures for Experiment 3

Based on this metric, a model exhibits a high bias if there is a large difference between performances of ADNI1 and ADNI2 splits (i.e. models performing well on only single dataset). The results show that APANN has the smallest performance bias towards any particular dataset compared to other models for both ADAS13 and MMSE tasks. See Fig. S4.3 and Table S4.1. Chapter 5

Modeling and prediction of clinical symptom trajectories in Alzheimer’s disease using longitudinal data

Nikhil Bhagwat [1,2,3,*], Joseph D. Viviano [3], Aristotle N. Voineskos [3,4], M. Mallar Chakravarty [1,2,5], and Alzheimer’s Disease Neuroimaging Initiative.

1. Institute of Biomaterials and Biomedical Engineering, University of Toronto, Toronto, ON, Canada

2. Cerebral Imaging Centre, Douglas Mental Health University Institute, Verdun, QC, Canada

3. Kimel Family Translational Imaging-Genetics Research Lab, Research Imaging Centre, Campbell Family Mental Health Research, Institute, Centre for Addiction and Mental Health, Toronto, ON, Canada

4. Department of Psychiatry, University of Toronto, Toronto, ON, Canada

5. Biological and Biomedical Engineering, McGill University, Montreal, QC, Canada

Correspondence: Nikhil Bhagwat, M. Mallar Chakravarty; Email: [email protected], [email protected]

98 Chapter 5. Project 3: Prognosis in AD 99

5.1 Abstract

Computational models predicting symptomatic progression at the individual level can be highly beneficial for early intervention and treatment planning for Alzheimer’s disease (AD). Individual prognosis is complicated by many factors including the definition of the prediction objective itself. In this work, we present a computational framework comprising machine-learning techniques for 1) modeling symptom trajectories and 2) prediction of symptom trajectories using multimodal and longitudinal data. We perform primary analyses on three cohorts from Alzheimer’s Disease Neuroimaging Initiative (ADNI), and a replication analysis using subjects from Australian Imaging, Biomarker & Lifestyle Flagship Study of Ageing (AIBL). We model the prototypical symptom trajectory classes using clinical assessment scores from mini-mental state exam (MMSE) and Alzheimer’s Disease Assessment Scale (ADAS-13) at nine timepoints spanned over six years based on a hierarchical clustering approach. Subsequently we predict these trajectory classes for a given subject using magnetic resonance (MR) imaging, genetic, and clinical variables from two timepoints (baseline + follow-up). For prediction, we present a longitudinal Siamese neural-network (LSN) with novel architectural modules for combining multimodal data from two timepoints. The trajectory modeling yields two (stable and decline) and three (stable, slow-decline, fast-decline) trajectory classes for MMSE and ADAS-13 assessments, respectively. For the predictive tasks, LSN offers highly accurate performance with 0.900 accuracy and 0.968 AUC for binary MMSE task and 0.760 accuracy for 3-way ADAS-13 task on ADNI datasets, as well as, 0.724 accuracy and 0.883 AUC for binary MMSE task on replication AIBL dataset.

5.2 Author Summary

With an aging global population, the prevalence of Alzheimer’s disease (AD) is rapidly increasing, cre- ating a heavy burden on public healthcare systems. It is, therefore, critical to identify those most likely to decline towards AD in an effort to implement preventative treatments and interventions. However, predictions are complicated by the substantial heterogeneity present in the clinical presentation in the prodromal stages of AD. Longitudinal data comprising cognitive assessments, magnetic resonance images, along with genetic and demographic information can help model and predict the symptom progression patterns at the single subject level. Additionally, recent advances in machine-learning techniques provide the computational framework for extracting combinatorial longitudinal and multimodal feature sets. To this end, we have used multiple AD datasets consisting of 1000 subjects with longitudinal visits spanned up to six years for 1) modeling stable versus declining clinical symptom trajectories and 2) predicting these trajectories using data from both baseline and a follow-up visits within one year. From a compu- tational standpoint, we validated that a machine-learning model is capable of combining longitudinal, multimodal data towards accurate predictions. Our validations demonstrate that the presented model can be used for early detection of individuals at risk for clinical decline, and therefore holds crucial clinical utility for AD, as well as, other neurodegenerative disease interventions. Chapter 5. Project 3: Prognosis in AD 100

5.3 Author Contributions

Nikhil Bhagwat (NB) worked on the development of longitudinal Siamese neural-network (LSN) model and other machine-learning techniques along with subsequent implementation and validation. He performed preprocessing and quality control of MR image datasets. He also wrote the manuscript of the research paper that is currently under review. Joseph Viviano provided feedback on the proposed methodological approach. Additionally he provided the support for computational resources.Aristotle N. Voineskos served as a clinical advisor and is a member of NB’s thesis committee. He also served as a supervisor for NB in a lab at CAMH, that provided significant computational resources. M. Mallar Chakravarty is a thesis supervisor for NB. He provided guidance on development and validation of all proposed models and techniques, as well as, manuscript writing. Chapter 5. Project 3: Prognosis in AD 101

5.4 Introduction

Clinical decline towards Alzheimer’s disease (AD) and its preclinical stages (significant memory concern [SMC] and mild cognitive impairment [MCI]) increases the burden on healthcare and support systems [11]. Identification of the declining individuals a priori would provide a critical window for timely intervention and treatment planning. Individual level clinical forecasting is complicated by many factors that includes the definition of the prediction objective itself. Several previous efforts have focused on diagnostic conversion, within a fixed time-window, as a prediction end point (e.g. the conversion of MCI to frank AD onset) [91, 107, 292, 381, 389, 117, 252, 207, 170, 250]. Other studies, which model the clinical states as a continuum instead of discrete categories, investigate prediction problems pertaining to symptom severity. These studies define their objective as predicting future clinical scores from assessments such as mini-mental state exam (MMSE) and Alzheimer’s Disease Assessment Scale- cognitive (ADAS-cog) [108, 389, 170]. All of these tasks have proved to be challenging due to the heterogeneity in clinical presentation comprising highly variable and nonlinear longitudinal symptom progression exhibited throughout the continuum of AD prodromes and the onset [323, 371, 299, 216, 182, 184]. In the pursuit of predictive biomarkers identification, several studies have reported varying neu- roanatomical patterns associated with functional and cognitive decline in AD and its prodromes [135, 299, 117, 323, 371, 299, 216, 184, 182, 286, 148, 219, 231, 103]. The lack of locally specific, canonical atrophy signatures could be attributed to cognitive reserve, genetics, or environmental factors [286, 331, 263, 216, 269, 353, 123]. As a result, local anatomical features, such as hippocampal volume, may be insufficient for predicting future clinical decline at a single subject-level [299, 303, 182, 323, 371, 299, 216, 184, 182]. Thus, models incorporating ensemble of imaging features, clinical and genotypic information have been proposed [389, 252, 207]. However, in such multimodal models, the performance gains offered by the imaging data are unclear. Particularly, insight into prediction improvement from magnetic resonance (MR) images is crucial, as it may aid decision making regarding the necessity of the MR acquisition (a relatively expensive, time-consuming, and possibly stressful requirement) for a given subject in the aims of improving prognosis. Furthermore, there is an increasing interest in incorporating data from multiple timepoints in short span (i.e. follow-up patient visits), in an effort to improve long-term prognosis. However, this is a challenging task requiring longitudinally consistent feature selection and mitigation of missing timepoints [389, 170]. The overarching goal of this work is to provide a longitudinal analysis framework for predicting symptom progression in AD that addresses the aforementioned challenges per- taining to task definition (model output) as well as ensemble feature representation (model input). The contributions of this work are two-fold. First, we present a novel data-driven approach for modeling long- term symptom trajectories derived solely from clustering of longitudinal clinical assessments. We show that the resultant trajectory classes represent relatively stable and declining trans-diagnostic subgroups of the subject population. Second, we present a novel machine-learning (ML) model called longitudinal Siamese network (LSN) for prediction of these symptom trajectories based on multimodal and longitudi- nal data. Specifically, we use cortical thickness as our MR measure due to its higher robustness against typical confounds, such as head size, total brain volume, etc., compared to local volumetric measures [184] and its previous use in biomarker development and clinical applications in AD [99, 299, 286, 222]. The choice of excluding other potential biomarkers related to AD-progression, such as PET or CSF data in the analysis was based on their invasive acquisition and lack of availability in practice and in the databases leveraged in this work. Chapter 5. Project 3: Prognosis in AD 102

We evaluate the performance of trajectory modeling and prediction tasks on Alzheimer’s disease Neuroimaging Initiative (ADNI) datasets (ADNI1, ADNIGO, ADNI2) [http://adni.loni.usc.edu/]. Moreover, we also validate the predictive performance on a completely independent replication cohort from Australian Imaging, Biomarker & Lifestyle Flagship Study of Ageing (AIBL) [http://adni.loni. usc.edu/study-design/collaborative-studies/aibl/]. We compare LSN with other ML models including logistic regression (LR), support vector machine (SVM), random forest (RF), and classical artificial neural network (ANN). We examine the added value of MR information in combination with clinical and demographic data, as well as, the benefit of the follow-up timepoint information towards the prediction task to assist prioritization of MR data acquisition and periodic patient monitoring. Chapter 5. Project 3: Prognosis in AD 103

5.5 Materials and Methods

5.5.1 Datasets

ADNI1, ADNIGO, ADNI2, and AIBL datasets were downloaded from Alzheimer’s disease Neuroimag- ing Initiative (ADNI) database (http://adni.loni.usc.edu/). The Australian Imaging, Biomarker & Lifestyle (AIBL) Flagship Study of Ageing is a collaborative study that shares many common goals with ADNI (http://adni.loni.usc.edu/study-design/collaborative-studies/aibl/). Only ADNI-compliant subjects with at least three timepoints from AIBL were included in this study. For all ADNI cohorts, sub- jects with clinical assessments from at least three visits, in timespan longer than one year, were included in the analysis. Subjects were further excluded based on manual quality checks after MR preprocessing pipelines (described below). Age, Apolipoprotein E4 (APOE4) status, clinical scores from mini-mental state exam (MMSE) and Alzheimer’s Disease Assessment Scale (ADAS-13), and T1-weighted MR im- ages were used in the analysis. Subject demographics are shown in Table 1, and the complete list of included subjects is provided in the supplementary materials (S1).

Figure 5.1: Analysis workflow of the longitudinal framework. The workflow comprises two tasks, 1) trajectory modeling (TM), and 2) trajectory prediction (TP). Data from 69 ADNI-1 MCI subjects with 9 visits within 6 years are used for TM task using hierarchical clustering. 1116 ADNI subjects pooled from ADNI1, ADNIGO, and ADNI2 cohorts are used towards TP task. Data from baseline and a follow- up timepoint is used towards trajectory prediction. The trained models from k-fold cross validation of ADNI subjects are then tested on 117 AIBL subjects as part of the replication analysis.

The ADNI sample comprising pooled ADNI1, ADNIGO, and ADNI2 subjects was used to perform primary analysis comprising trajectory modeling and prediction tasks, whereas AIBL subjects were used as the independent replication cohort for the prediction task. There are a few important differences in the ADNI and AIBL cohorts. First, ADNI has clinical measures from both MMSE and ADAS-13 scales with visits separate by 12 months or less, spanned over 72 months. In contrast, AIBL sample only has MMSE scores available with visits separated by 18 months or more spanned over 54 months. Chapter 5. Project 3: Prognosis in AD 104

Dataset N Age APOE4 Sex Dx (years) (# alleles) (at baseline) ADNI1 (TM) 69 74.5 0:30 M:51 LMCI:69 1:32 F:18 2:7 ADNI1 (TP) 513 75.3 0:275 M:289 CN:177 1:180 F:224 LMCI:226 2:58 AD:110 ADNI2 (TP) 515 72.4 0:277 M:281 CN:150, SMC:26 1:187 F:234 EMCI:142, LMCI:130 2:58 AD:67 ADNIGO (TP) 88 71.1 0:54 M:47 EMCI:88 1:26 F:41 2:8 AIBL (R) 117 71.7 0:65 M:61 CN:99 1:46 F:56 MCI:18 2:6

Table 5.1: Demographics of ADNI and AIBL datasets. TM: trajectory modeling cohort, TP: trajectory prediction cohort. R: replication cohort

5.5.2 Preprocessing

T1-weighted MR images are used in this study. First, the MR images were preprocessed using the bpipe pipeline (https://github.com/CobraLab/minc-bpipe-library/) comprising N4-correction [347], neck- cropping to improve linear registration, and BEaST brain extraction [119]. Then, the preprocessed im- ages were input into the CIVET pipeline [69, 221, 196, 237] to estimate cortical surfaces measures at 40,962 vertices per hemisphere. Lastly, cortical vertices were grouped into 78 regions of interest (ROIs) based on the Automated Anatomical Labeling (AAL) atlas [348] that were used to estimate mean cortical thickness across each ROI..

5.5.3 Analysis workflow

The presented longitudinal framework comprises two tasks, 1) trajectory modeling and 2) trajectory prediction. The overall process is outlined in Fig 5.1.

5.5.4 Trajectory modeling

We aim to characterize the symptomatic progression of subjects based on clinical scores from multiple timepoints using a data driven approach. In this pursuit, we used 69 late-MCI subjects from ADNI1 cohort (see Table 5.1) with all 6 years (total 9 timepoints) of available clinical data as input to hierarchical clustering (see Fig 5.2). The goal here is to group subjects with similar clinical progression in order to build a template of differential trajectories. We note that the primary goal of this exercise is not to discover unknown subtypes of AD progression, but to create trajectory prototypes against which all participants can be compared and assigned a trajectory (prognostic) label. These labels provide a goal for the prediction task, which in traditional setting is defined by diagnosis or change in diagnosis, or even symptom profile at a specific timepoint. We used Euclidean distance between longitudinal clinical score vectors as a similarity metric and Ward’s method as linkage criterion for clustering. Note that the Chapter 5. Project 3: Prognosis in AD 105 number of clusters is a design choice in this approach, which depends on specificity of the trajectory progression that we desire. Higher number of clusters allows modeling of trajectory progression with higher specificity (i.e. slow vs. fast decline). Clinically, it would be useful to have more specific prognosis to prioritize and personalize intervention and treatment options. However, it also increases the difficulty of early prediction. In this work, the choice of 2 vs. 3 clusters is made based on the dynamic score range of the clinical assessment. Each of the resultant clusters represents stable or declining symptom trajectory. We modeled MMSE and ADAS-13 trajectories separately. We note that this does not assume independence between the two scales. This is primarily due to high prevalence of these scales in research and clinic, as well as, to demonstrate the feasibility of clustering approach that allows modeling of ¿2 trajectories with a more symptom specific clinical assessment such as ADAS-13. After clustering, we simply average the clinical scores of subjects from each cluster at each individual timepoint to determine trajectory-templates for each of the classes (i.e. stable, decline).

Figure 5.2: Trajectory modeling. A) 69 MCI subjects (rows) with six years of clinical scores (columns) were used as input to hierarchical clustering. The color indicates the clinical score at a given timepoint. Euclidean distance between score vectors was used as a similarity metric between two subjects. Ward’s method was used as a linkage criterion. B) Clinical score distribution at each timepoint of different trajectories (stable vs. decliners) derived from hierarchical clustering. Mean scores at each timepoints are used to build a template for each trajectory class.

Subsequently these trajectory-templates are used to assign trajectory labels to the rest of the subjects (not used in clustering) based on Euclidean proximity computed from all available timepoints of a given subject (see Fig. S5.2). There are two advantages to this approach. First it allows us to group subjects Chapter 5. Project 3: Prognosis in AD 106

Trajectory N Age APOE4 Sex Dx GDS* (years) (# alleles) (at baseline) (at baseline) MMSE stable 674 72.8 0:443 M:371 CN:319, SMC:23 Mean:1.21 1:195 F:303 EMCI:187, LMCI:144 stdev:1.37 2:36 AD:1 decline 442 74.8 0:163, 1:198 M:246 CN:8, SMC:3 Mean:1.63 1:198 F:196 EMCI:43, LMCI:212 stdev:1.39 2:81 AD:176 ADAS-13 stable 558 72.3 0:399 M:308 CN:307, SMC:25 Mean:1.18 1:163 M:308 EMCI:167, LMCI:83 stdev:1.38 2:23 AD:3 slow-decline 184 74.9 0:98 M:122 CN:19, SMC:1 Mean:1.50 1:61 F:62 EMCI:51, LMCI:105 stdev:1.38 2:25 AD:8 fast-decline 346 74.8 0:108 M:186 CN:1, SMC:0 Mean:1.66 1:169 F:160 EMCI:12, LMCI:168 stdev:1.38 2:69 AD:165

Table 5.2: Cluster demographics of ADNI trajectory prediction (TP) cohort based on MMSE and ADAS- 13 scales. *GDS: Geriatric Depression Scale

ADAS-13 Stable Slow-decline fast-decline MMSE Stable 558 107 9 Decline 27 77 337

Table 5.3: Trajectory membership comparison between MMSE and ADAS-13 scales. Note that MMSE only has single decline trajectory. without having to enforce strict cut-offs for defining boundaries, such as MCI to AD conversion within a certain number of months [86, 389, 252, 170]. Second, it offers a relatively simple way of dealing with missing timepoints, as the trajectory-template can be sampled based on clinical data availability of a given subject. In contrast, prediction tasks that are defined based on a specific time window have to exclude subjects with missing data and ignore data from additional timepoints beyond the set cut-off (e.g. late AD converters). We assign trajectory labels to all remaining ADNI and AIBL subjects based on their proximity to each of the trajectory-templates computed from at least 3 timepoints in a timespan longer than one year. The demographics stratified by trajectory labels are shown in Table 5.2. Whereas Table 5.3 shows the subject membership overlap between MMSE and ADAS-13 based trajectory assignment. These trajectory labels are then used as task outcome (ground truth) in the predictive analysis described next.

5.5.5 Trajectory prediction

The goal of this predictive analysis is to identify the prognostic trajectories of individual subjects as early as possible. In the simplest (and perhaps the most ideal) case, the prediction models use information available from a single timepoint, also referred as the baseline timepoint. However, input from a single Chapter 5. Project 3: Prognosis in AD 107 timepoint lacks information regarding short-term changes (or rate of change) in neuroanatomy and clinical status, which would potentially be useful for increasing specificity of future predictions of decline. In order to leverage this information, we propose a longitudinal siamese network (LSN) model that can effectively combine data from two timepoints and improve the predictive performance. We build the LSN model with two design objectives. First, we want to combine the MR information from two timepoints that encodes the structural changes predictive of different clinical trajectories. Second, we want to incorporate genetic (APOE4 status) and clinical information with the structural MR information in an effective manner. We achieve these objectives with an augmented artificial neural network architecture (see Fig. 5.3) comprising a siamese network and a multiplicative module, that combines MR, genetic and clinical data systematically.

Figure 5.3: Longitudinal Siamese network (LSN) model. LSN consists of three stages. The first stage is a siamese artificial neural network with twin weight-sharing branches. Weight-sharing implies identical weight configuration at each layer across two branches. These branches process an MR input (2 x 78 CT values) from the same subject at two timepoints and produce a transformed output that is representative of change (atrophy) over time. This change pattern is referred as “distance embedding”. In the second stage, this embedding is modulated by the APOE4 status with a multiplicative operation. Lastly, in the third stage, the modulated distance embedding is concatenated with the two clinical scores and age, and used towards final trajectory prediction. The weights (model parameters) of all operations are learned jointly in a single unified model framework.

The first stage of LSN is an artificial neural network model based on a siamese architecture with twin weight-sharing branches. Typically, siamese networks are employed for tasks, such as signature or face verification, that require learning a “similarity” measure between two comparable inputs [60, 203]. Chapter 5. Project 3: Prognosis in AD 108

Network module Specifications Siamese net Input: 2 x 78 CT values per subject Number of hidden layers: 4 each branch (fixed for all folds) Number of nodes per layer: 25 to 50 (grid search) Output: distance embedding Number of nodes: 10 to 20 (grid search) Multiplicative modulation Input: distance embedding, APOE4 status (0, 1, 2) Number of hidden layers: 1 (fixed for all folds) Number of nodes: 1 (fixed for all folds) Output: modulated distance embedding Concatenation and prediction Input: modulated distance embedding, clinical scores, age Number of hidden layers: 1 (fixed for all folds) Number of nodes: 10 to 20 (grid search) Output: trajectory class

Table 5.4: Longitudinal siamese network (LSN) architecture

Here, we utilize this architecture towards encoding a “difference embedding” between two MR images in order to represent the amount of atrophy over time that is predictive of clinical trajectory. In the second stage, this encoding is modulated by the APOE4 status with a multiplicative operation. The choice of multiplicative operation is based on an underlying assumption of non-additive interaction between MR and genetic modalities. Subsequently, the modulated distance embedding is concatenated with the baseline and follow-up clinical scores and age, which is used towards final trajectory prediction. The weights (model parameters) of all operations are learned jointly in a single unified model framework. The complete design specification of the model are outlined in Table 5.4. The complete input data consists of 2 x 78 CT values, 2 x clinical score, age, and APOE4 status. Two different LSN models are trained separately to perform binary classification task for MMSE based trajectories and three-way classification task for ADAS-13 based trajectories. All models are trained using the TensorFlow library (version: 0.12.1, https://www.tensorflow.org/), and the code is available at (https://github.com/ CobraLab/NI-ML/tree/master/projects/LSN).

Selection of second-timepoint as input

The LSN model is agnostic to the time difference between two MR scans. Therefore, we use the latest available datapoint for a given subject within 12 months from the baseline visit. This allows us to include subjects with missing data at the 12-month timepoint, but have available MR and clinical data at the 6-month timepoint (N=43). We note that this is an “input” timepoint selection criterion, as the ground truth trajectory labels (output) are assigned based on all available timepoints (3 or more) for each subject.

Performance gains from MR modality and the second timepoint

In order to gain insight into identifying subjects that would benefit from added MR and second timepoint information, we divide the subjects into three groups based on a potential clinical workflow as shown in Fig 5.4.

• Baseline edge-cases (BE): Subjects with very high or very low cognitive performance at baseline Chapter 5. Project 3: Prognosis in AD 109

• Follow-up edge-cases (FE): Subjects with non-extreme cognitive performance at baseline but marked change in performance at follow-up

• Cognitively consistent (CC): Subjects with non-extreme cognitive performance at baseline and marginal change in performance at follow-up

The cutoff thresholds for the quantitative stratification, and the corresponding subject-group mem- berships are shown in Fig. 5.4. We note that these threshold values are not a fixed choice, and are selected here only for the purpose of demonstrating a clinical use case. Based on this grouping, we can expect that the clinical scores of CC subjects provide very little information regarding potential trajectories and thus we need to rely on MR information for prediction, making it the target group for multimodal, longitudinal prediction.

Figure 5.4: Potential clinical workflow for subject specific decision-making. The goal of this flowchart is to identify subjects benefitting from MR and additional timepoint information. Qualitatively, baseline edge-cases (BE) group includes subjects with very high or very low cognitive performance at baseline. Follow-up edge-cases (FE) group includes subjects with non-extreme cognitive performance at baseline but substantial change in performance at follow-up. And cognitively consistent (CC) group includes subjects with non-extreme cognitive performance at baseline and marginal change in performance at follow-up. The table shows the threshold values used for MMSE and ADAS-13 scales and corresponding trajectory class distribution within each group.

5.5.6 Performance evaluation

We computed the performance of 2x3 input choices based on two timepoints: 1) baseline, 2) baseline+follow- up and three feature sets: 1) clinical attributes (CA) comprising clinical score, age and apoe4 status, 2) cortical thickness (CT) data comprising 78 ROIs, and 3) combination of CA and CT features. This allows us to evaluate the benefits of added MR-derived information and the second timepoint. Evaluating the Chapter 5. Project 3: Prognosis in AD 110 added benefit of the MR modality is critical due to the implicit dependency between clinical scores and outcome variables commonly encountered during prognostic predictions. Accuracy, area under the curve (AUC), confusion matrix (CM), and receiver operating characteristic (ROC) were used as evaluation metrics. All models were evaluated using pooled ADNI1, ADNI2, and ADNIGO subjects in a 10-fold nested cross-validation paradigm. By pooling subjects from ADNI1, ADNIGO, and ADNI2 we show that the models trained in this work are not sensitive to differences in acquisition and quality of the MR data (e.g. 1.5T vs. 3T). Hyperparameter tuning was performed using nested inner folds. The ranges for hyperparameter grid search are provided in supplementary material (S2). The train and test subsets were stratified by cohort membership and the trajectory class distribution. The stratification minimizes the risk of overfitting models towards particular cohort and reduces cohort specific biases during evaluation of results. We compare the performance of LSN against four reference models: logistic regression with Lasso regularizer (LR), support vector machine (SVM), random forest (RF), and default artificial neural net- work (ANN). The comparison of LSN against other models allows us to quantify the performance gains offered by LSN with data from two timepoints. We hypothesize that LSN would provide better predictive performance over existing prediction modeling approaches comprising single as well as two timepoints. For reference models, all input features under consideration were concatenated into a single array. For LR and SVM models, all features in a given training set were standardized by removing the mean and scaling to unit variance. The test set was standardized using the mean and standard deviation com- puted from the training set. The statistical comparison of LSN against other models was performed using Mann-Whitney-U test. All trained models were later directly used to test predictive performance with AIBL subjects. The AIBL performance statistics were averaged over 10 instances of trained models (one from each fold) for each input combination. Chapter 5. Project 3: Prognosis in AD 111

5.6 Results

5.6.1 Trajectory modeling

MMSE and ADAS-13 based clustering yielded two and three clusters, respectively. The bigger score range of ADAS-13 scale (see Fig. S5.1) allows modeling of symptom progression with higher specificity providing slow and fast decline trajectories. The cluster assignment of the entire ADNI dataset based on the trajectory-templates yielded 674 stable and 442 decline subjects for MMSE scale, and 585 stable, 184 slow decline, and 346 fast decline subjects for ADAS-13 scale. The complete cluster demographics are shown in Table 5.2. Notably, stable clusters comprise higher percentage subjects with zero copies of APOE4 allele, and conversely declining clusters comprise higher percentage of subjects with two copies of APOE4 allele. Subjects with single APOE4 copy are evenly distributed across stable and declining clusters. Table 5.3 shows the trajectory membership overlap between MMSE and ADAS-13. The results indicate that large number of subjects belong to similar trajectories for MMSE and ADAS-13. In the non- overlapping cases, especially the stable MMSE subjects predominantly belong to slow-decline ADAS-13 trajectory.

5.6.2 Trajectory Prediction

Below we provide the MMSE and ADAS-13 based trajectories prediction results for single (baseline) and two timepoint models for 1) CA 2) CT, and 3) CA+CT input. Results are summarized in Figs. 5.5 to 5.9 and supplementary material (Tables S5.2-S5.13). Note that ADAS-13 based trajectory results are from a 3-class prediction task. We first report the performance LSN model, which is only applicable to the two timepoint input comprising combined CA and CT features, followed by performance of single and two timepoint reference models.

5.6.3 MMSE trajectories (binary classification)

The combined CA and CT input feature set with two timepoints provides the most accurate prediction of trajectory classes. LSN outperforms all four reference models with 0.94 accuracy and 0.99 AUC with all subjects, and 0.900 accuracy and 0.968 AUC for “cognitively consistent” group (see Figs. 5.5 and 5.6, supplement Tables S5.2-S5.7).

Single (baseline) timepoint (All subjects)

With only CA input, all four reference models (LR, SVM, RF, ANN) provide highly similar performance with top accuracy of 0.84 and AUC of 0.91. With CT input only, the performance diminishes for all models with top accuracy of 0.77 and AUC of 0.83. With combined CA+CT input, all four models see a boost with top accuracy of 0.86 and AUC of 0.93.

Two timepoints (baseline + follow-up) (All subjects)

With only CA input, all four models (LR, SVM, RF, ANN) provide highly similar performance with top accuracy of 0.91 and AUC of 0.97. With CT input only, the performance diminishes for all models with top accuracy of 0.77 and AUC of 0.83. With combined CA+CT input, the performance stays similar to CA input for all four models with top accuracy of 0.91 and AUC of 0.96. Chapter 5. Project 3: Prognosis in AD 112

subset = All subset = BE 1.0

0.9

0.8

AUC 0.7

0.6

0.5

model subset = FE subset = CC LR 1.0 SVM RF 0.9 ANN 0.8 LSN

AUC 0.7

0.6

0.5

BL,CA BL,CT BL,CA BL,CT BL,CA+CT BL,CA+CT BL+follow-up,CABL+follow-up,CT BL+follow-up,CABL+follow-up,CT BL+follow-up,CA+CT BL+follow-up,CA+CT Input Input

subset = All subset = BE 1.0

0.8

Accuracy 0.6

0.4 model subset = FE subset = CC LR 1.0 SVM RF ANN 0.8 LSN

Accuracy 0.6

0.4

BL,CA BL,CT BL,CA BL,CT BL,CA+CT BL,CA+CT BL+follow-up,CABL+follow-up,CT BL+follow-up,CABL+follow-up,CT BL+follow-up,CA+CT BL+follow-up,CA+CT Input Input

Figure 5.5: MMSE prediction AUC and accuracy performance for ADNI dataset. Top pane: Area under the ROC curve (AUC), Bottom pane: Accuracy. Results are stratified by groups defined by the clinical workflow. Abbreviations are as follows. BL: baseline visit, CA: clinical attributes, CT: cortical thickness. BE: baseline edge-cases, FE: follow-up edge-cases, CC: cognitively consistent. The statistical comparison of LSN against other models was performed using Mann-Whitney-U test. Note that only BL+followup, CA+CT input is relevant for this comparison. For AUC comparison, LSN offered significantly better performance over all four models for ‘All’, ‘BE’, and ‘CC’ subsets. LSN also offered statistically significant results over RF and ANN models for ‘FE’ subset. For accuracy comparison, LSN offered significantly better performance over SVM, RF, and ANN models for ‘All’, ‘FE’, and ‘CC’ subsets. No statistically significant results were obtained for ‘BE’ subset. Chapter 5. Project 3: Prognosis in AD 113

input modality: CA+CT, subset: All input modality: CA+CT, subset: BE

1.0 1.0

0.8 0.8

0.6 0.6

True Positive Rate 0.4 True Positive Rate 0.4

tp: BL+follow-up, model: LR (area = 0.96) tp: BL+follow-up, model: LR (area = 0.97) tp: BL+follow-up, model: SVM (area = 0.96) tp: BL+follow-up, model: SVM (area = 0.97) 0.2 0.2 tp: BL+follow-up, model: RF (area = 0.95) tp: BL+follow-up, model: RF (area = 0.96) tp: BL, model: LR (area = 0.93) tp: BL, model: LR (area = 0.96) tp: BL, model: SVM (area = 0.93) tp: BL, model: SVM (area = 0.96) tp: BL, model: RF (area = 0.91) tp: BL, model: RF (area = 0.94) tp: BL+follow-up, model: ANN (area = 0.96) tp: BL+follow-up, model: ANN (area = 0.97) tp: BL, model: ANN (area = 0.91) tp: BL, model: ANN (area = 0.95) 0.0 tp: BL+follow-up, model: LSN (area = 0.98) 0.0 tp: BL+follow-up, model: LSN (area = 0.99) Luck Luck

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate

input modality: CA+CT, subset: FE input modality: CA+CT, subset: CC

1.0 1.0

0.8 0.8

0.6 0.6

True Positive Rate 0.4 True Positive Rate 0.4

tp: BL+follow-up, model: LR (area = 0.97) tp: BL+follow-up, model: LR (area = 0.90) tp: BL+follow-up, model: SVM (area = 0.96) tp: BL+follow-up, model: SVM (area = 0.90) 0.2 0.2 tp: BL+follow-up, model: RF (area = 0.95) tp: BL+follow-up, model: RF (area = 0.90) tp: BL, model: LR (area = 0.80) tp: BL, model: LR (area = 0.86) tp: BL, model: SVM (area = 0.82) tp: BL, model: SVM (area = 0.86) tp: BL, model: RF (area = 0.78) tp: BL, model: RF (area = 0.85) tp: BL+follow-up, model: ANN (area = 0.95) tp: BL+follow-up, model: ANN (area = 0.88) tp: BL, model: ANN (area = 0.77) tp: BL, model: ANN (area = 0.81) 0.0 tp: BL+follow-up, model: LSN (area = 0.98) 0.0 tp: BL+follow-up, model: LSN (area = 0.96) Luck Luck

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate

Figure 5.6: MMSE prediction ROC curves for ADNI dataset. Receiver operating characteristic curves for CA+CT input. Results are stratified by groups defined by the clinical workflow. Abbreviations are as follows. BL: baseline visit, CA: clinical attributes, CT: cortical thickness. BE: baseline edge-cases, FE: follow-up edge-cases, CC: cognitively consistent.

Clinical group trends

The groupwise results (see Figs. 5.5 and 5.6) show that the trajectories of “baseline edge-cases (BE)” can be predicted with high accuracy by their baseline clinical attributes. Consequently, inclusion of CT and a second timepoint features offers minimal gains to predictive accuracy. Nevertheless, the two- timepoint LSN model with CA and CT inputs does offer the best performance with 0.95 accuracy and 0.992 AUC. Interestingly, trajectory prediction of “follow-up edgecase (FE)” yields poor performance with baseline CA features. Inclusion of baseline CT features improves performance for the FE group. Higher predictive gains are seen after inclusion of second timepoint information, with LSN model offering the best performance with 0.942 accuracy and 0.987 AUC. Lastly, trajectory prediction of “cognitively consistent (CC)” subjects, shows in between performance compared to BE and FE subjects with baseline data. However, inclusion of second timepoint does not improve performance substantially, causing poorer prediction of CC subjects compared to FE and BE with two-timepoints, with LSN still offering the best Chapter 5. Project 3: Prognosis in AD 114 performance.

5.6.4 ADAS-13 trajectories (3-way classification)

Similar to MMSE task, the combined CA and CT feature set with two timepoints provides the most accurate prediction of trajectory classes. LSN outperforms all four reference models with 0.91 accuracy with all subjects, and 0.76 accuracy for “cognitively consistent” group (see Fig. 5.7, supplement Tables S5.8-S5.13)).

subset = All subset = BE 1.0

0.8

0.6 Accuracy 0.4

0.2 model subset = FE subset = CC LR 1.0 SVM RF 0.8 ANN LSN 0.6 Accuracy 0.4

0.2

,CA CT ,CA BL,CA BL,CT BL,CA BL,CT up,CT BL,CA+CT p,CA+ BL,CA+CT p,CA+CT ow-u ow-u BL+follow-upBL+follow-up,CTll BL+follow-upBL+follow- BL+fo BL+foll Input Input

Figure 5.7: ADAS-13 prediction accuracy performance for ADNI dataset. Results are stratified by groups defined by the clinical workflow. Abbreviations are as follows. BL: baseline visit, CA: clinical attributes, CT: cortical thickness. BE: baseline edge-cases, FE: follow-up edge-cases, CC: cognitively consistent. The statistical comparison of LSN against other models was performed using Mann-Whitney-U test. Note that only BL+followup, CA+CT input is relevant for this comparison. For accuracy comparison, LSN offered significantly better performance over LR, SVM, RF, and ANN models for ‘All’, ‘FE’, ‘BE’, and ‘CC’ subsets.

Single (baseline) timepoint (All subjects)

With only CA input, LR provides the top accuracy of 0.81. The confusion matrix (CM) shows that all four models perform relatively better at distinguishing subjects in stable and fast-declining trajectories, compared to identifying slow-declining trajectory from the other two. The ANN model does not offer consistent performance for slow-declining trajectory prediction as it fails to predict that class in at least one fold of cross validation. With CT input only, the performance diminishes for all models with top accuracy of 0.68 with LR model. The issue regarding the slow-declining trajectory prediction is persistent with CT features as both LR and RF models fail to predict that class in at least one fold of cross validation. With combined CA+CT input, the performance remains similar to CA input for all four models with top accuracy of 0.81. Chapter 5. Project 3: Prognosis in AD 115

Two timepoints (baseline + follow-up) (All subjects)

With only CA input, LR and SVM provide similar performance with accuracy of 0.88. Similar to single timepoint models, the confusion matrix (CM) shows that all four models perform relatively better at distinguishing subjects in stable and fast-declining trajectories, compared to identifying slow-declining trajectory from the other two. With CT input only, the performance diminishes for all models with top accuracy of 0.67 with LR model. Both LR and RF models fail to predict slow-declining trajectory with CT features in at least one fold of cross validation. With combined CA+CT input, the performance diminishes compared to CA input for all four models with top accuracy of 0.84.

Clinical group trends

The groupwise results (see Fig. 5.7) show that the trajectories of “baseline edge-cases (BE)” can be predicted with high accuracy by their baseline clinical attributes. Consequently, inclusion of CT and the second timepoint features offers minimal gains to predictive accuracy. The confusion matrix shows that only stable and fast declining trajectories can be predicted due to extreme scores of BE subjects. The two-timepoint LSN model with CA and CT features offers the best performance with 0.98 accuracy. Trajectory prediction of “follow-up edgecase (FE)” yields poor performance with baseline CA features. Inclusion of baseline CT features only incrementally improves performance. Substantial predictive gains are seen after inclusion of second timepoint information, with LSN model offering the best performance with 0.86 accuracy. The confusion matrix shows predictability of all three classes. Lastly, trajectory prediction of “cognitively consistent (CC)” subjects shows in between performance with baseline data compared to BE and FE. However, inclusion of a second timepoint does not improve performance substantially, causing poorer prediction of CC subjects compared to FE and BE with two-timepoints. Nevertheless, LSN still offers the best performance compared to the reference models.

5.6.5 AIBL results

The use of AIBL as a replication cohort has a few caveats due to differences in the study design and data collection. Notably, AIBL has an 18-month interval between timepoints with at most 4 timepoints per subject and does not have ADAS-13 assessments available. Therefore, the results are provided only with MMSE assessments. The trajectory membership assignment based on ADNI templates, yielded 99 stable and 18 declining subjects in the AIBL cohort. For MMSE based trajectory classification task, LSN offers 0.724 accuracy and 0.883 AUC with combined CA and CT feature set from two timepoints (see Figs. 5.8 and 5.9, supplement Table S5.14).

Single (baseline) timepoint (All subjects)

With only CA input, LR, SVM, RF provide similar performance with SVM offering the best results with 0.853 accuracy and 0.887 AUC. With only CT input, performance of all models degrades substantially with maximum of 0.72 accruacy and 0.74 AUC. With combined CA+CT input, performance of reference models is similar to CA input, with ANN offering the best performance with 0.856 accuracy and 0.851 AUC values. Chapter 5. Project 3: Prognosis in AD 116

0.9

0.8 model

AUC 0.7 RF SVM 0.6 LR ANN 0.5 LSN BL,CA BL,CT BL,CA+CT BL+follow-up,CABL+follow-up,CT BL+follow-up,CA+CT Input

0.9

0.8 model RF Accuracy 0.7 SVM LR 0.6 ANN LSN BL,CA BL,CT BL,CA+CT BL+follow-up,CABL+follow-up,CT BL+follow-up,CA+CT Input

Figure 5.8: MMSE prediction AUC and accuracy performance for AIBL replication cohort. Top pane: Area under the ROC curve (AUC), Bottom pane: Accuracy. Abbreviations are as follows. BL: baseline visit, CA: clinical attributes, CT: cortical thickness.

Two timepoints (baseline + follow-up)) (All subjects)

With only CA input, LR, SVM, RF provide similar performance with LR and SVM offering the best results with 0.90 accuracy and 0.95 AUC. With only CT input, performance of all models degrades, and especially LR and SVM offer biased performance with at least one model instance predicting only the majority class (i.e stable). With combined CA+CT input, again LR, SVM, and ANN offer biased performance with at least one model instance predicting only the majority class (i.e. stable). RFC offers relatively balanced performance with with 0.871 accuracy and 0.803 AUC.

5.6.6 Effect of trajectory modeling on predictive performance

The use of variable number of clinical timepoints per subject based on the availability during trajec- tory assignment (groundtruth) impacts the predictive performance (see supplement Section 5.3, Figs. Chapter 5. Project 3: Prognosis in AD 117

AIBL, input modality: CA+CT

1.0

0.8

0.6

0.4 True PositiveTrue Rate

tp: BL, model: LR (area = 0.86) tp: BL, model: SVM (area = 0.86) 0.2 tp: BL, model: RF (area = 0.86) tp: BL, model: ANN (area = 0.85) tp: BL+follow-up, model: LR (area = 0.84) tp: BL+follow-up, model: SVM (area = 0.50) tp: BL+follow-up, model: RF (area = 0.80) tp: BL+follow-up, model: ANN (area = 0.84) 0.0 tp: BL+follow-up, model: LSN (area = 0.88) Luck

0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

Figure 5.9: MMSE prediction ROC curves for AIBL replication cohort. Receiver operating characteristic curves for CA+CT input. Abbreviations are as follows. BL: baseline visit, CA: clinical attributes, CT: cortical thickness.

S5.3-5.4). The stratification of performance based on last available timepoint for a given subject yielded 605, 510 subjects for 18 to 36 months (near future) and 48 to 72 month (distant future) spans, re- spectively. The AUC results for MMSE based trajectories show that all models perform better while predicting trajectories for subjects with visits from shorter timespan (upto 36 months). The predictive performance worsens for subjects with available timepoints between 48 and 72 months. This difference in the performance is largest with CA input and lowest with CA+CT input. The 3-way accuracy results for ADAS-13 based trajectories show a similar trend with CA input as all models perform better for subjects with visits from shorter timespan. However with CT and CA+CT input, the amount of bias varies with the choice of model. While comparing models, LSN shows the least amount of predictive bias subjected to timepoint availability for both prediction tasks (MMSE and ADAS-13). Chapter 5. Project 3: Prognosis in AD 118

5.6.7 Computational specifications and training times

The computation was performed on Intel(R) Xeon(R) CPU E5-1660 v3 @ 3.00GHz machine with 16 cores and a GeForce GTX TITAN X graphic card. Scikit-learn (http://scikit-learn.org/stable/) and TensorFlow (version: 0.12.1, https://www.tensorflow.org/) libraries were used to implement and evaluate machine-learning models. The training times can vary based on hyperparameter grid search. The average training times for a single model are as follows: 10 minutes for LSN, 3 minutes for ANN, and less than a minute for LR, SVM, and RF. Chapter 5. Project 3: Prognosis in AD 119

5.7 Discussion

In this manuscript, we aimed towards building a framework for longitudinal analysis comprising model- ing and prediction tasks that can be used towards accurate clinical prognosis in AD. We proposed a novel data-driven method to model clinical trajectories based on clustering of clinical scores measuring cogni- tive performance, which were then used to assign prognostic labels to the individuals. Subsequently, we demonstrated that longitudinal clinical and structural MR imaging data along with genetic information can be successfully combined to predict these trajectories using machine-learning techniques. Below, we highlight the clinical implications of the proposed work, followed by the discussion of trajectory modeling, prediction, replication task, comparisons to related work in the field, and the limitations.

5.7.1 Clinical Implications

A critical component of any computational work seeking to predict patient outcome is its clinical rele- vance. How can the proposed methodology be integrated into a clinical workflow? Is the use of expensive and potentially stressful tests worthwhile in a clinical context? How can we use change in clinical sta- tus and neuroanatomical progression in a short-term to better predict long-term outcomes for patient populations? We envision two overarching clinical uses for the presented work. First our trajectory modeling efforts provide a symptom-centric prognostic objective across AD and prodromal population. Then the prediction model facilitates decision-making pertaining to frequency and types of assessments that should be conducted in preclinical and prodromal individuals. If an individual is predicted to be in fast-decline, then this may lead a clinician to recommend more frequent assessments, additional MRI sessions, and preventative therapeutic interventions. Conversely, if an individual is deemed as being stable (in spite of an MCI diagnosis) this may lead a clinician to reduce the frequency of assessments. Another application would be assisting clinical trial recruitment based on the projected rate of decline. Pre-selection of individuals for clinical trials has become an important topic in AD given the recent challenges in the drug development [283, 28, 307]. LSN can help identify the suitable candidates who are at high risk for decline over the next few years. The workflow shown in Fig. 5.4 provides a use case for motivating continual monitoring and adopting LSN in the clinic. We note that the stratification of clinical spectrum is not a trivial problem [184], and is confounded by several demographic and genetic factors. Nonetheless, the example clinical scenario in Fig 5.4 may help identification of specific individuals whose prognosis may be improved by acquiring additional data (such MRI or a follow-up time point). The results (see Figs. 5.5 to 5.7) suggest that prognosis of subjects on the extreme ends of the spectrum i.e. baseline edge-cases (BE), can be simply predicted by their CA features. However, same features yield poor performance for follow-up edge-cases (FE) subjects, which exhibit substantial change in clinical performance in the one year period from baseline. This shows that models using baseline CA inputs simply predict continuation of symptom severity as seen at baseline. Inclusion of CT features does help compensate some of this prediction bias as seen by improved performance with the CA+CT input at baseline. Lastly cognitively consistent (CC) subjects, which perform in the mid-ranges of clinical performance at baseline and do not show substantial change within a year, are predicted poorly with CA input at baseline and with the inclusion of second timepoint. This is expected since unlike the BE and FE groups, clinical scores from two timepoints for CC group do not point towards one trajectory or the other. Inclusion of MR features from both timepoints seems to be highly beneficial for these subjects, especially with LSN model. These Chapter 5. Project 3: Prognosis in AD 120 observations demonstrate the utility of short-term longitudinal MR input towards long-term clinical prediction.

5.7.2 Trajectory modeling

Of importance to our modeling work are the challenges addressed in the longitudinal clinical task defi- nition. The modeling approach defines stable and declining trajectories over six years without the need for hard thresholds (e.g. time window, four-point change etc.). Moreover, it also offers a solution to deal with subjects with missing timepoints. The hierarchical clustering approach provides control over de- sired levels of trajectory specificity. The higher memory related specificity of ADAS-13 coupled with the larger score ranges allows us to resolve three trajectories (clusters), with further subdivision of decline trajectory. The trajectory membership comparison between MMSE and ADAS-13 scales (see Table 5.3) shows that there is large number of subjects that overlap in the stable and fast declining group. This is expected due to several commonalities between the two assessments. The non-overlapping cases could be resultant of assessments of specific symptom subdomain performance and higher specificity offered by the three ADAS-13 trajectories over two MMSE trajectories.

5.7.3 Trajectory prediction

Pertaining to prognostic performance, it is imperative to discuss 1) prediction accuracy versus trajectory specificity trade-off and 2) prediction gains offered by the additional information (MR modality, follow- up timepoint). The comparison between MMSE and ADAS-13 prediction confirms that the choice of three trajectories for ADAS-13 offers more prognostic specificity, with lower mean accuracy for the 3-way prediction. While evaluating longitudinal clinical prediction, it is important to note the implicit depen- dency between the task outcomes (e.g. future clinical score prediction, diagnostic conversion etc.) and baseline clinical scores. Seldomly do studies disambiguate between what can be achieved in longitudinal prediction tasks using only baseline clinical scores. Our results on the entire cohort (Figs. 5.5 to 5.7, subset=All) suggests that at baseline, for large number of subjects, high prediction accuracy can be achieved solely with clinical inputs with incremental gains from CT inputs. Interestingly, follow-up clin- ical information improves predictions over multimodal performance at baseline. Lastly, the multimodal, two-timepoint input offers the best performance compared to all other inputs with LSN, underscoring the importance of model architecture for multimodal, longitudinal input. We attribute this performance to the design novelties that allow the model to learn feature embeddings that are more predictive of the clinical task compared to models that simply concatenate all the features into a single vector, as is often done [204, 170]. Siamese network with shared weights provides an effective way for combining temporally associated multivariate input to represent change patterns using two CT feature sets from the same individual. Subsequently, LSN modulates the anatomical change by APOE4 status and combines with clinical scores. We also note that LSN can be extended to incorporate high dimensional voxel or vertex-wise input as well; however, considering the performance offered by ROI based feature selection approach, we estimate marginal improvements at substantially higher computational costs.

5.7.4 AIBL analysis

The goal of AIBL analysis was to go beyond standard K-fold cross validation and evaluate model ro- bustness with subjects from an independent study that was not used to train the model. The differences Chapter 5. Project 3: Prognosis in AD 121 in the study design and data collection such as the 18 month interval between timepoints and the last visit at 54 month mark, do introduce a few issues. First, although the flexibility of trajectory model- ing approach allows trajectory membership assignment with any number of available timepoints, the shorter study span introduces bias compared to ADNI subjects with typically more number of available timepoints spread over larger timespan. Consequently, we see a highly skewed trajectory membership assignment with 99 stable and 18 declining subjects. As a result, the accuracy and confusion matrix results show that references models are skewed towards prediction of majority class with baseline and two timepoint. Especially with inclusion of CT features in two-timepoint LR and SVM models seems to predict only majority class. This can be potentially attributed to the feature standardization issue. Different datasets have different MR feature distribution, which can make application of scaling factors learned on one dataset (ADNI) onto the other dataset (AIBL) problematic - especially for subject level prediction tasks. The prediction with CA input from two timepoints with the 18 month interval are highly accurate, diminishing the need for MR features. This again can be attributed to shorter timespan availability (near future) during the trajectory assignment for the AIBL subjects. We speculate that with the availability of more timepoints (distant future), LSN would see a boost in performance. Never- theless, from a model stability perspective, LSN offers the best AUC (0.883) with CA+CT input from two timepoints validating its robustness with multi-modal, longitudinal input data types.

5.7.5 Effect of trajectory modeling on predictive performance

To better understand performance, it is critical to understand the impact of the number of timepoints available for establishing a ground truth classification based on the data that was used. Specifically for MMSE-based predictions (see Fig. S5.2), the models provide improved performance for individual trajectories over near future (36 months) as opposed to long term predictions (72 months). There is variable bias shown by models with CT and CA+CT inputs for ADAS-13 based trajectories (see Fig. S5.3). This could be attributed to the different performance metrics (accuracy vs. AUC) and the 3-way classification task as opposed to the binary MMSE task. LSN shows the least amount of predictive bias offering best results for short-term and long-term predictions for both MMSE and ADAS-13. This could be potentially due to the set of features extracted by the LSN model which captures not only the pronounced short-term markers but also the subtle changes that are indicative of long-term clinical progression.

5.7.6 Comparison with related work

Past studies [323, 371] propose different methods for trajectory modeling and report varying number of trajectories. Small et al. construct a two-group quadratic growth mixture model to describe progression of MMSE scores over a six year period. Wilkosz et al., used a latent class mixture model with quadratic trajectories to model cognitive and psychotic symptoms over 13.5 years. The results indicated presence of 6 trajectory courses. The more popular, diagnostic change-based trajectory models usually define only 2 classes such as AD converters vs. non-converters, or more specifically in the context of MCI population, and stable vs. progressive MCI groups [86, 252, 170]. The two-group model is computation- ally convenient; hence the nonparametric modeling approach presented here starts with two trajectories, but can be easily extended to incorporate more trajectories, as shown by 3-class ADAS-13 trajectory definitions. Pertaining to prediction tasks, it is not trivial to compare LSN performance with previous Chapter 5. Project 3: Prognosis in AD 122 studies due to differences in task definitions and input data types. Authors of [86, 252, 170] have pro- vided a comparative overview of studies tackling AD conversion tasks. In these compared studies, the conversion times under consideration range from 18 to 54 months, and the reported AUC from 0.70 to 0.902, with higher performance obtained by studies using combination of MR and cognitive features. The best performing study among these [252] propose a two-step approach in which first MR features are learned via a semi-supervised low density separation technique, which are then combined with cog- nitive measures using a random forest classifier. Authors report AUC of 0.882 with cognitive measures, 0.799 with MR features, and 0.902 with aggregate data on 264 ADNI1 subjects. Another recent study, [207] presents a probabilistic multiple kernel learning (pMKL) classifier with input features comprising variety of risk factors, cognitive and functional assessments, structural MR imaging data, and plasma proteomic data. The authors report AUC of 0.83 with cognitive assessments, 0.76 with MR data, and 0.87 with multi-source data on 259 ADNI1 subjects. A few studies have also explored longitudinal data towards predictive tasks. The authors of [389] use a sparse linear regression model on imaging data and clinical scores (ADAS and MMSE), with group regularization for longitudinal feature extraction, which is fed into an SVM classifier. The authors report AUC of 0.670 with cognitive scores, 0.697 with MR features, and 0.768 with the proposed longitudinal multimodal classifier on 88 ADNI1 subjects. The authors also report AUC of 0.745 using baseline multimodal data. Another recent study [170] uses a hierarchical classification framework for selecting longitudinal features and report AUC of 0.754 with baseline features and 0.812 with longitudinal features solely derived from MR data from 131 ADNI1 subjects. Despite the differences in task definition, sample sizes, etc., we note a couple of trends. First, as mentioned earlier due to implicit dependency between task definition and cognitive assessments, we see a strong contribution by clinical scores towards the predictive performance with larger cohorts. Then, the longitudinal studies show promising results with performance gains from the added timepoint further motivating models that can handle both multimodal and longitudinal data.

5.7.7 Limitations

In this work, we used 69 LMCI subjects for trajectory modeling based on availability of complete clinical data. Ideally, this cohort could be expanded in the future to better represent clinical heterogeneity in the at-risk population, as more longitudinal data of subjects across the pre-AD spectrum becomes available. This would help improve the specificity of the clinical progression patterns used as trajectory templates. The proposed LSN model is designed to leverage data from two timepoints. Therefore, availability longitudinal CT and CA data along with the APOE4 status is essential for prediction. Although we offer solutions to mitigate issues with missing timepoints (i.e. flexibility of choice for baseline and follow- up visits for LSN), addressing scenarios with missing modality (i.e. APOE4 status) is beyond the scope of this work. We acknowledge that missing data is an important open challenge in the field that affects the sample size of subjects under consideration, making it harder to train and test models. Potentially, one can employ flexible model architectures that can aggregate predictions independently or via Bayesian framework from available modalities. However we defer investigation of these strategies to future work. Then in this work, we compared LSN against standard implementations of the reference model. We note that modification can be made to reference models to incorporate longitudinal input instead of simple concatenation of variables through regularization and feature selection techniques [389, 170]. However, such modifications are not trivial, and flexibility offered by the reference models is limited in comparison with ANNs. Lastly, the hierarchical architecture (multiple hidden layers) poses challenges in interpreting Chapter 5. Project 3: Prognosis in AD 123 the learned features from the CT data. Inferring the most predictive cortical regions from the distance embedding learned by a siamese network is not a trivial operation. Computationally, the weights in the higher layers that learn robust combinatorial features cannot be uniquely mapped backed onto input (i.e. cortical regions). Moreover, from the biological perspective, if there is significant heterogeneity within spatial distribution of atrophy patterns, as observed in MCI and AD, then presence of underlying neuroanatomical subtypes relating to disparate neuroanatomical atrophy patterns is quite plausible. In such scenario, the frequentist ranking of ROI contribution, even if computable, will not be accurate, as it would average the feature importance across subtypes. These limitations are common trade-offs encountered with multivariate, ANN based ML approaches in order to gain more predictive power. We plan to address these issues in future work as suitable techniques become available for neuroimaging data.

5.7.8 Conclusions

In summary, we presented a longitudinal framework that provides a data-driven, flexible way of modeling and predicting clinical trajectories. We introduced a novel LSN model that combines clinical and MR data from two timepoints and provides state of the art predictive performance. We demonstrated the robustness of the model via successful cross-validation using three different ADNI cohorts with varying data acquisition protocol and scanner resolutions. We also verified the generalizability of LSN on a replication AIBL dataset. Lastly, we provide an example use case that could further help clinicians identify subjects that would benefit the most from LSN model predictions. We believe this work will further motivate the exploration of multi-modal, longitudinal models that would improve the prognostic predictions and patient care in AD. Chapter 5. Project 3: Prognosis in AD 124

5.8 Acknowledgements

The authors thank Dr. Christopher Honey, Dr. Richard Zemel, and Dr. Jo Knight for helpful feedback. We also thank Gabriel A. Devenyi and Sejal Patel for assistance with MR image preprocesing and quality control. Chapter 5. Project 3: Prognosis in AD 125

S5 Supplementary Material

S5.1 Subject Lists

ADNI (Trajectory Modeling)

023_S_0042, 136_S_0107, 123_S_0108, 127_S_0112, 027_S_0116, 023_S_0126, 018_S_0142, 098_S_0160, 010_S_0161, 021_S_0178, 032_S_0214, 027_S_0256, 021_S_0276, 130_S_0285, 130_S_0289, 035 _S_0292, 031_S_0294, 027_S_0307, 023_S_0331, 116_S_0361, 131_S_0384, 027_S_0408, 037_S_0539 , 005_S_0546, 027_S_0644, 016_S_0702, 126_S_0709, 002_S_0729, 068_S_0802, 027_S_0835, 033 _S_0906, 053_S_0919, 033_S_0922, 052_S_0952, 032_S_0978, 137_S_0994, 127_S_1032, 003_S_1074 , 037_S_1078, 022_S_1097, 003_S_1122, 006_S_1130, 126_S_1187, 116_S_1243, 129_S_1246, 057 _S_1269, 123_S_1300, 052_S_1346, 137_S_1414, 041_S_1418, 127_S_1427, 021_S_0626, 014_S_0169 , 023_S_0217, 014_S_0563, 005_S_0572, 014_S_0658, 137_S_0668, 137_S_0722, 116_S_0752, 137 _S_0800, 029_S_0914, 027_S_1045, 031_S_1066, 002_S_1155, 002_S_1268, 029_S_1318, 016_S_1326 , 052_S_1352 ADNI (Trajectory Prediction)

002_S_0295,002_S_0413,002_S_0619,002_S_0685,002_S_0782,002_S_0816,002_S_0938,002_S_0954,002 _S_1018,002_S_1070,002_S_1261,002_S_1280,002_S_2010,002_S_2073,002_S_4171,002_S_4213,002 _S_4219,002_S_4225,002_S_4229,002_S_4237,002_S_4262,002_S_4270,002_S_4447,002_S_4473,002 _S_4521,002_S_4654,002_S_4746,002_S_4799,002_S_5018,002_S_5178,002_S_5230,002_S_5256,003 _S_0907,003_S_0981,003_S_1057,003_S_2374,003_S_4119,003_S_4136,003_S_4288,003_S_4350,003 _S_4354,003_S_4441,003_S_4555,003_S_4644,003_S_4872,003_S_4892,003_S_4900,003_S_5130,005 _S_0221,005_S_0222,005_S_0223,005_S_0324,005_S_0448,005_S_0553,005_S_0602,005_S_0610,005 _S_0814,005_S_1224,005_S_1341,005_S_2390,005_S_4168,005_S_4185,005_S_4707,005_S_5038,006 _S_0498,006_S_0547,006_S_0675,006_S_0681,006_S_0731,006_S_4150,006_S_4153,006_S_4192,006 _S_4346,006_S_4357,006_S_4363,006_S_4485,006_S_4515,006_S_4679,006_S_4713,006_S_4960,007 _S_0041,007_S_0068,007_S_0070,007_S_0101,007_S_0128,007_S_0249,007_S_0293,007_S_0316,007 _S_0414,007_S_0698,007_S_1206,007_S_1222,007_S_2394,007_S_4272,007_S_4387,007_S_4467,007 _S_4488,007_S_4516,007_S_4568,007_S_4611,007_S_4620,007_S_4637,007_S_4911,009_S_0842,009 _S_0862,009_S_1030,009_S_2208,009_S_2381,009_S_4324,009_S_4337,009_S_4359,009_S_4388,009 _S_4530,009_S_4543,009_S_4612,009_S_4741,009_S_4814,009_S_4958,009_S_5000,009_S_5027,009 _S_5037,009_S_5125,010_S_0067,010_S_0419,010_S_0420,010_S_0422,010_S_0472,010_S_0786,010 _S_0829,010_S_0904,010_S_4345,010_S_4442,011_S_0016,011_S_0021,011_S_0022,011_S_0023,011 _S_0053,011_S_0183,011_S_0241,011_S_0326,011_S_0362,011_S_0856,011_S_0861,011_S_1080,011 _S_1282,011_S_2274,011_S_4075,011_S_4105,011_S_4120,011_S_4222,011_S_4235,011_S_4278,011 _S_4366,011_S_4547,011_S_4827,011_S_4893,011_S_4912,012_S_0634,012_S_0637,012_S_0689,012 _S_0712,012_S_0720,012_S_0803,012_S_0932,012_S_1009,012_S_1033,012_S_1133,012_S_1165,012 _S_1292,012_S_1321,012_S_4012,012_S_4026,012_S_4094,012_S_4128,012_S_4188,012_S_4545,012 _S_4643,012_S_4849,012_S_4987,013_S_0240,013_S_0325,013_S_0502,013_S_0575,013_S_0860,013 _S_1035,013_S_1120,013_S_1186,013_S_1205,013_S_1275,013_S_4268,013_S_4395,013_S_4579,013 _S_4580,013_S_4595,013_S_4616,013_S_4791,013_S_4917,013_S_4985,013_S_5137,014_S_0328,014 _S_0519,014_S_0520,014_S_0548,014_S_0557,014_S_0558,014_S_1095,014_S_2185,014_S_4039,014 _S_4058,014_S_4079,014_S_4080,014_S_4263,014_S_4328,014_S_4401,014_S_4576,014_S_4577,016 _S_0354,016_S_0359,016_S_0538,016_S_0991,016_S_1028,016_S_1117,016_S_1121,016_S_2007,016 _S_2031,016_S_4009,016_S_4121,016_S_4584,016_S_4591,016_S_4638,016_S_4646,016_S_4902,016 _S_4951,016_S_4952,016_S_5007,016_S_5031,016_S_5057,018_S_0055,018_S_0080,018_S_0087,018 Chapter 5. Project 3: Prognosis in AD 126

_S_0155,018_S_0286,018_S_0335,018_S_0369,018_S_0425,018_S_0450,018_S_0633,018_S_2133,018 _S_2155,018_S_2180,018_S_4257,018_S_4313,018_S_4349,018_S_4399,018_S_4400,018_S_4597,018 _S_4696,018_S_4809,018_S_4868,018_S_4889,019_S_4252,019_S_4285,019_S_4293,019_S_4367,019 _S_4477,019_S_4548,019_S_4549,019_S_4680,019_S_4835,019_S_5019,019_S_5242,020_S_0097,020 _S_0213,020_S_0883,020_S_0899,020_S_1288,020_S_4920,021_S_0141,021_S_0159,021_S_0231,021 _S_0273,021_S_0332,021_S_0337,021_S_0343,021_S_0424,021_S_0642,021_S_0647,021_S_0753,021 _S_0984,021_S_1109,021_S_2077,021_S_2100,021_S_2125,021_S_2142,021_S_4245,021_S_4659,021 _S_4718,022_S_0066,022_S_0096,022_S_0129,022_S_0544,022_S_0750,022_S_0961,022_S_1351,022 _S_4173,022_S_4196,022_S_4266,022_S_4291,022_S_4320,022_S_4444,022_S_4805,022_S_4922,022 _S_5004,023_S_0030,023_S_0031,023_S_0058,023_S_0061,023_S_0078,023_S_0081,023_S_0083,023 _S_0084,023_S_0139,023_S_0376,023_S_0604,023_S_0625,023_S_0855,023_S_0926,023_S_0963,023 _S_1046,023_S_1126,023_S_1247,023_S_1262,023_S_4020,023_S_4035,023_S_4115,023_S_4122,023 _S_4164,023_S_4243,023_S_4448,023_S_4501,023_S_4502,023_S_4796,024_S_0985,024_S_1171,024 _S_1393,024_S_2239,024_S_4084,024_S_4158,024_S_4169,024_S_4223,024_S_4280,024_S_4392,024 _S_4674,024_S_4905,024_S_5054,027_S_0074,027_S_0120,027_S_0179,027_S_0403,027_S_0404,027 _S_0417,027_S_0461,027_S_0485,027_S_0850,027_S_1081,027_S_1082,027_S_1213,027_S_1254,027 _S_1277,027_S_1387,027_S_2183,027_S_2219,027_S_2245,027_S_2336,027_S_4729,027_S_4757,027 _S_4802,027_S_4804,027_S_4869,027_S_4873,027_S_4919,027_S_4926,027_S_4936,027_S_4938,027 _S_4955,027_S_4962,027_S_4964,027_S_4966,027_S_5079,027_S_5083,027_S_5093,027_S_5109,027 _S_5110,027_S_5118,027_S_5127,029_S_0824,029_S_0836,029_S_0843,029_S_0845,029_S_0866,029 _S_0878,029_S_0999,029_S_1056,029_S_1384,029_S_2376,029_S_2395,029_S_4279,029_S_4290,029 _S_4327,029_S_4384,029_S_4385,029_S_4585,029_S_4652,031_S_0321,031_S_0351,031_S_0554,031 _S_0568,031_S_0618,031_S_0830,031_S_1209,031_S_2018,031_S_2022,031_S_2233,031_S_4005,031 _S_4021,031_S_4024,031_S_4029,031_S_4032,031_S_4042,031_S_4149,031_S_4194,031_S_4203,031 _S_4218,031_S_4496,031_S_4721,032_S_0095,032_S_0187,032_S_0479,032_S_0677,032_S_0718,032 _S_1101,032_S_1169,032_S_5289,033_S_0511,033_S_0513,033_S_0514,033_S_0516,033_S_0567,033 _S_0723,033_S_0724,033_S_0725,033_S_0733,033_S_0734,033_S_0739,033_S_0741,033_S_0923,033 _S_1281,033_S_1284,033_S_1285,033_S_1308,033_S_1309,033_S_4176,033_S_4177,033_S_4179,033 _S_4508,033_S_5087,035_S_0033,035_S_0048,035_S_0156,035_S_0204,035_S_0341,035_S_0555,035 _S_0997,035_S_2061,035_S_2074,035_S_4082,035_S_4114,035_S_4256,035_S_4414,035_S_4464,035 _S_4582,035_S_4784,036_S_0576,036_S_0577,036_S_0656,036_S_0672,036_S_0673,036_S_0748,036 _S_0759,036_S_0760,036_S_0813,036_S_0869,036_S_0945,036_S_0976,036_S_1001,036_S_1135,036 _S_1240,036_S_2378,036_S_2380,036_S_4389,036_S_4430,036_S_4491,036_S_4538,036_S_4562,036 _S_4714,036_S_4715,036_S_4736,036_S_4878,036_S_4894,036_S_4899,037_S_0150,037_S_0303,037 _S_0327,037_S_0454,037_S_0467,037_S_0501,037_S_0552,037_S_0566,037_S_0588,037_S_0627,037 _S_1225,037_S_1421,037_S_4001,037_S_4015,037_S_4028,037_S_4030,037_S_4071,037_S_4146,037 _S_4214,041_S_0125,041_S_0282,041_S_0446,041_S_0549,041_S_1002,041_S_1010,041_S_1260,041 _S_1368,041_S_1412,041_S_1425,041_S_4138,041_S_4271,041_S_4510,041_S_4874,041_S_4876,041 _S_4989,041_S_5026,041_S_5078,041_S_5082,041_S_5097,041_S_5100,041_S_5131,041_S_5141,051 _S_1040,051_S_1072,051_S_1123,051_S_1131,051_S_1296,051_S_1331,051_S_4929,051_S_4980,051 _S_5005,052_S_0671,052_S_0951,052_S_1054,052_S_1168,052_S_1250,052_S_1251,052_S_4626,052 _S_4807,052_S_4885,052_S_4944,052_S_4945,053_S_0389,053_S_0507,053_S_1044,053_S_2357,053 _S_2396,053_S_4557,053_S_4578,053_S_4661,053_S_4813,053_S_5070,057_S_0464,057_S_0474,057 _S_0643,057_S_0779,057_S_0818,057_S_0839,057_S_0934,057_S_0941,057_S_1007,057_S_1217,057 _S_1265,057_S_1373,057_S_2398,057_S_4888,057_S_4897,062_S_0578,062_S_0690,062_S_0730,062 _S_0768,062_S_1099,062_S_1182,062_S_1299,067_S_0019,067_S_0029,067_S_0038,067_S_0056,067 _S_0059,067_S_0076,067_S_0077,067_S_0098,067_S_0176,067_S_0177,067_S_0284,067_S_0290,067 Chapter 5. Project 3: Prognosis in AD 127

_S_0336,067_S_0607,067_S_2195,067_S_2196,067_S_2301,067_S_2304,067_S_4054,067_S_4072,067 _S_4184,067_S_4212,067_S_4310,067_S_4767,067_S_4782,067_S_4918,068_S_0109,068_S_0127,068 _S_0210,068_S_0442,068_S_0473,068_S_0872,068_S_2187,068_S_2248,068_S_4061,068_S_4067,068 _S_4134,068_S_4174,068_S_4217,068_S_4340,068_S_4424,070_S_5040,072_S_0315,072_S_2027,072 _S_2037,072_S_2072,072_S_2083,072_S_2093,072_S_2116,072_S_2164,072_S_4007,072_S_4057,072 _S_4063,072_S_4102,072_S_4103,072_S_4131,072_S_4206,072_S_4226,072_S_4383,072_S_4390,072 _S_4391,072_S_4394,072_S_4445,072_S_4462,072_S_4465,072_S_4522,072_S_4539,072_S_4610,072 _S_4613,072_S_4769,072_S_4871,072_S_4941,072_S_5207,073_S_0089,073_S_0311,073_S_0312,073 _S_0386,073_S_0518,073_S_0746,073_S_0909,073_S_2153,073_S_2182,073_S_2190,073_S_2191,073 _S_2225,073_S_2264,073_S_4155,073_S_4216,073_S_4259,073_S_4300,073_S_4311,073_S_4312,073 _S_4360,073_S_4382,073_S_4393,073_S_4443,073_S_4552,073_S_4559,073_S_4614,073_S_4762,073 _S_4777,073_S_4825,073_S_5023,082_S_0928,082_S_1119,082_S_1256,082_S_1377,082_S_2099,082 _S_2121,082_S_2307,082_S_4090,082_S_4208,082_S_4224,082_S_4244,082_S_4339,082_S_4428,082 _S_5014,082_S_5029,094_S_0434,094_S_0526,094_S_0531,094_S_0692,094_S_0711,094_S_0921,094 _S_1027,094_S_1090,094_S_1164,094_S_1188,094_S_1241,094_S_1267,094_S_1293,094_S_1417,094 _S_2201,094_S_2216,094_S_2238,094_S_2367,094_S_4089,094_S_4234,094_S_4434,094_S_4503,094 _S_4560,094_S_4649,098_S_0171,098_S_0172,098_S_0269,098_S_0896,098_S_2047,098_S_2079,098 _S_4003,098_S_4018,098_S_4215,098_S_4275,098_S_4506,099_S_0040,099_S_0051,099_S_0054,099 _S_0060,099_S_0090,099_S_0111,099_S_0291,099_S_0352,099_S_0372,099_S_0470,099_S_0533,099 _S_0534,099_S_0551,099_S_0880,099_S_1034,099_S_1144,099_S_2042,099_S_2063,099_S_2146,099 _S_2205,099_S_4076,099_S_4086,099_S_4104,099_S_4157,099_S_4202,099_S_4205,099_S_4463,099 _S_4475,099_S_4480,099_S_4498,099_S_4565,100_S_0015,100_S_0035,100_S_0047,100_S_0069,100 _S_0190,100_S_0296,100_S_0995,100_S_4469,100_S_4512,100_S_4556,100_S_5096,109_S_0950,109 _S_0967,109_S_1014,109_S_1114,109_S_1157,109_S_1183,109_S_1343,109_S_2200,109_S_4380,109 _S_4455,109_S_4499,109_S_4531,109_S_4594,114_S_0166,114_S_0173,114_S_0374,114_S_0378,114 _S_0410,114_S_0416,114_S_0458,114_S_0601,114_S_0979,114_S_1103,114_S_1106,114_S_1118,114 _S_2392,114_S_4404,114_S_5047,116_S_0370,116_S_0382,116_S_0392,116_S_0487,116_S_0648,116 _S_0649,116_S_0657,116_S_1232,116_S_1249,116_S_1271,116_S_1315,116_S_4010,116_S_4043,116 _S_4092,116_S_4167,116_S_4175,116_S_4195,116_S_4199,116_S_4338,116_S_4453,116_S_4483,116 _S_4625,116_S_4635,116_S_4855,116_S_4898,121_S_1322,123_S_0050,123_S_0072,123_S_0088,123 _S_0091,123_S_0106,123_S_0113,123_S_0162,123_S_0390,123_S_2055,123_S_2363,123_S_4096,123 _S_4127,123_S_4170,123_S_4526,123_S_4780,123_S_4806,126_S_0605,126_S_0606,126_S_0680,126 _S_0708,126_S_0865,126_S_0891,126_S_1221,126_S_2360,126_S_2405,126_S_2407,126_S_4458,126 _S_4494,126_S_4507,126_S_4514,126_S_4675,126_S_4712,126_S_4743,126_S_4891,126_S_4896,127 _S_0259,127_S_0260,127_S_0393,127_S_0394,127_S_0431,127_S_0684,127_S_0754,127_S_0844,127 _S_1140,127_S_1419,127_S_2213,127_S_2234,127_S_4148,127_S_4197,127_S_4198,127_S_4210,127 _S_4240,127_S_4301,127_S_4500,127_S_4604,127_S_4624,127_S_4645,127_S_4844,127_S_4928,127 _S_4940,127_S_5058,127_S_5067,127_S_5095,127_S_5132,128_S_0200,128_S_0230,128_S_0863,128 _S_0947,128_S_1043,128_S_1088,128_S_1148,128_S_1242,128_S_1407,128_S_1408,128_S_2002,128 _S_2036,128_S_2045,128_S_2123,128_S_2130,128_S_2151,128_S_2220,128_S_4553,128_S_4571,128 _S_4586,128_S_4607,128_S_4636,128_S_4653,128_S_4742,128_S_4772,128_S_4832,128_S_4842,128 _S_5066,129_S_0778,129_S_2332,129_S_4073,129_S_4220,129_S_4287,129_S_4369,129_S_4371,129 _S_4396,129_S_4422,130_S_0102,130_S_0232,130_S_0423,130_S_0505,130_S_0783,130_S_0886,130 _S_0956,130_S_0969,130_S_1200,130_S_1201,130_S_1290,130_S_1337,130_S_2391,130_S_2403,130 _S_4250,130_S_4294,130_S_4343,130_S_4352,130_S_4405,130_S_4415,130_S_4417,130_S_4468,130 _S_4542,130_S_4605,130_S_4641,130_S_4660,130_S_4817,130_S_4883,130_S_4925,130_S_4982,130 _S_4984,130_S_4990,130_S_5006,130_S_5142,131_S_0123,131_S_0441,131_S_1389,132_S_0987,133 Chapter 5. Project 3: Prognosis in AD 128

_S_0433,133_S_0488,133_S_0525,133_S_0629,133_S_0638,133_S_0727,133_S_0771,133_S_0792,133 _S_0912,133_S_0913,133_S_1170,135_S_4281,135_S_4309,135_S_4356,135_S_4406,135_S_4446,135 _S_4489,135_S_4566,135_S_4598,135_S_4657,135_S_4676,135_S_4689,135_S_4722,135_S_4723,135 _S_4863,135_S_5015,136_S_0086,136_S_0184,136_S_0186,136_S_0194,136_S_0195,136_S_0196,136 _S_0299,136_S_0426,136_S_0429,136_S_0579,136_S_0695,136_S_0873,136_S_0874,136_S_1227,136 _S_4189,136_S_4269,137_S_0158,137_S_0283,137_S_0301,137_S_0366,137_S_0443,137_S_0459,137 _S_0481,137_S_0631,137_S_0669,137_S_0686,137_S_0796,137_S_0825,137_S_0972,137_S_0973,137 _S_4211,137_S_4258,137_S_4299,137_S_4303,137_S_4331,137_S_4351,137_S_4466,137_S_4482,137 _S_4520,137_S_4536,137_S_4587,137_S_4596,137_S_4623,137_S_4631,137_S_4632,137_S_4672,137 _S_4678,137_S_4815,137_S_4816,137_S_4852,141_S_0696,141_S_0717,141_S_0767,141_S_0790,141 _S_0810,141_S_0851,141_S_0852,141_S_0853,141_S_0915,141_S_0982,141_S_1004,141_S_1052,141 _S_1094,141_S_1137,141_S_1152,141_S_1245,141_S_1255,141_S_1378,141_S_4053,141_S_4160,141 _S_4232,141_S_4426,153_S_2109,153_S_2148,153_S_4077,153_S_4125,153_S_4133,153_S_4151,153 _S_4159,153_S_4172,153_S_4372,153_S_4621,153_S_4838,941_S_1194,941_S_1197,941_S_1202,941 _S_1311,941_S_4036,941_S_4100,941_S_4187,941_S_4255,941_S_4292,941_S_4365,941_S_4376,941 _S_4420,941_S_4764,941_S_5124

AIBL

14, 17, 18, 20, 21, 22, 23, 26, 27, 28, 29, 31, 33, 38, 39, 43, 44, 50, 52, 55, 56, 57, 61, 62, 68, 75, 80, 88, 90, 111, 117, 118, 121, 126, 152, 153, 156, 186, 194, 212, 229, 232, 236, 244, 273, 277, 284, 287, 294, 310, 314, 315, 317, 331, 335, 354, 362, 365, 367, 380, 388, 390, 406, 471, 480, 482, 483, 509, 516, 518, 527, 528, 550, 551, 557, 570, 571, 572, 573, 588, 605, 609, 666, 681, 696, 697, 698, 699, 707, 716, 721, 722, 736, 737, 740, 757, 771, 784, 796, 798, 808, 814, 827, 851, 891, 904, 914, 931, 938, 942, 945, 1050, 1067, 1146, 1147, 1153, 1157 Chapter 5. Project 3: Prognosis in AD 129

S5.2 Hyperparameter search

The internal validation procedure searches through a reasonable set of permutation of hyperparameters to decide optimal model architecture that balances the accuracy and generalizability metrics. Below are the hyperparameter search spaces for each model used in the analysis.

Model Hyperparameters LR C:[1e-3,5e-2,1e-2,5e-1,1e-1,1,1e1,1e2] SVM kernel:[linear,rbf], C:[1e-4,1e-3,1e-2,1e-1,1,1e1] RF n estimators:[10,50,75,100,150], minsamplessplit : [2, 4, 8] ANN n layers: [2,3,4], nhiddennodes(perlayer) = [5, 10, 25, 50, 100], dropout = [0,0.1, 0.2], learningrate = 1e − 2, 1e − 3, 1e − 4 LSN Siamese network: Number of hidden layers: 4 each branch (fixed for all folds) Number of nodes per layer: [25, 50] Output: distance embedding Number of nodes: [10, 20]

Multiplicative module Number of hidden layers: 1 (fixed for all folds) Number of hidden nodes: 1 (fixed for all folds)

Concatenation and Prediction Number of hidden layers: 1 (fixed for all folds) Number of nodes: [10, 20]

Table S5.1: Hyperparameter grid search for Longitudinal siamese network (LSN) Chapter 5. Project 3: Prognosis in AD 130

S5.3 Clinical score distributions

The clinical score distributions of subjects used in the analysis at two timepoints separated by trajectory membership. The substantial overlap between the distributions makes it difficult to differentiate between trajectories solely based on scores. The comparison between MMSE and ADAS-13 scales shows that the bigger score range of ADAS-13 scale allows modeling of symptom progression with higher specificity providing slow and fast decline trajectories (See Fig. S5.1).

Figure S5.1: Clinical score distributions of different trajectories at two timepoints. Note: for subjects who are missing 12 month timepoint, 6 month scores are used instead. Chapter 5. Project 3: Prognosis in AD 131

S5.4 Prediction performance results

Below are the results for two scales (MMSE, ADAS-13), two timepoints (baseline, follow-up), and three features sets (CA, CT, CA+CT). Moreover, the results are provided for all subjects and for cognitively consistent (CC) group based on the example clinical workflow. Note that only BL+follow-up, CA+CT input is applicable for LSN. Performance metrics include: accuracy (Acc), area under the ROC curve (Auc), and confusion matrix (CM).

All Subjects MMSE CA: Baseline CA: Baseline + Follow-up Model Acc Auc CM Acc Auc CM 0.88 0.12 0.926 0.074 LR 0.84 (0.01) 0.90 (0.01) 0.90 (0.01) 0.96 (0.01) 0.21 0.79 0.138 0.862 0.87 0.13 0.931 0.069 SVM 0.84 (0.01) 0.91 (0.01) 0.90 (0.01) 0.96 (0.01) 0.19 0.81 0.157 0.843 0.87 0.13 0.913 0.087 RF 0.83 (0.01) 0.90 (0.01) 0.89 (0.01) 0.95 (0.01) 0.23 0.77 0.148 0.852 0.88 0.12 0.907 0.093 ANN 0.84 (0.01) 0.91 (0.01) 0.91 (0.01) 0.97 (0.01) 0.2 0.8 0.093 0.907

Table S5.2: Predictive performance: All subjects, MMSE scale, CA input

Cognitively Consistent (CC) Group MMSE CA: Baseline CA: Baseline + Follow-up Model Acc Auc CM Acc Auc CM 0.834 0.166 0.869 0.131 LR 0.79 (0.02) 0.85 (0.02) 0.84 (0.02) 0.89 (0.02) 0.228 0.772 0.158 0.842 0.807 0.193 0.872 0.128 SVM 0.80 (0.02) 0.88 (0.01) 0.82 (0.02) 0.89 (0.02) 0.197 0.803 0.187 0.813 0.808 0.192 0.829 0.171 RF 0.81 (0.03) 0.90 (0.02) 0.81 (0.03) 0.89 (0.02) 0.204 0.796 0.161 0.839 0.827 0.173 0.803 0.197 ANN 0.80 (0.02) 0.87 (0.02) 0.831 (0.03) 0.90 (0.02) 0.207 0.793 0.114 0.886

Table S5.3: Predictive performance: Cognitively Consistent (CC) Group, MMSE scale, CA input Chapter 5. Project 3: Prognosis in AD 132

All Subjects MMSE CT: Baseline CT: Baseline + Follow-up Model Acc Auc CM Acc Auc CM 0.824 0.176 0.83 0.17 LR 0.76 (0.01) 0.82 (0.02) 0.77 (0.02) 0.83 (0.02) 0.321 0.679 0.31 0.69 0.83 0.17 0.829 0.171 SVM 0.77 (0.01) 0.83 (0.01) 0.76 (0.01) 0.82 (0.01) 0.313 0.687 0.317 0.683 0.761 0.239 0.781 0.219 RF 0.75 (0.01) 0.81 (0.01) 0.76 (0.01) 0.82 (0.01) 0.276 0.724 0.28 0.72 0.829 0.171 0.833 0.167 ANN 0.75 (0.02) 0.81 (0.01) 0.75 (0.01) 0.83 (0.02) 0.333 0.667 0.334 0.666

Table S5.4: Predictive performance: All subjects, MMSE scale, CT input

Cognitively Consistent (CC) Group MMSE CT: Baseline CT: Baseline + Follow-up Model Acc Auc CM Acc Auc CM 0.671 0.329 0.652 0.348 LR 0.73 (0.03) 0.77 (0.04) 0.70 (0.03) 0.74 (0.03) 0.188 0.812 0.216 0.784 0.698 0.302 0.639 0.361 SVM 0.75 (0.02) 0.77 (0.04) 0.70 (0.02) 0.73 (0.03) 0.179 0.821 0.211 0.789 0.581 0.419 0.574 0.426 RF 0.65 (0.03) 0.74 (0.04) 0.64 (0.03) 0.74 (0.03) 0.199 0.801 0.229 0.771 0.687 0.313 0.671 0.329 ANN 0.73 (0.03) 0.75 (0.04) 0.70 (0.04) 0.73 (0.03) 0.216 0.784 0.241 0.759

Table S5.5: Predictive performance: Cognitively Consistent (CC) Group, MMSE scale, CT input

All Subjects MMSE CA+CT: Baseline CA+CT: Baseline + Follow-up Model Acc Auc CM Acc Auc CM 0.89 0.11 0.93 0.07 LR 0.86 (0.01) 0.93 (0.01) 0.91 (0.01) 0.96 (0.01) 0.19 0.81 0.11 0.89 0.88 0.12 0.92 0.08 SVM 0.85 (0.01) 0.93 (0.01) 0.89 (0.01) 0.96 (0.01) 0.19 0.81 0.14 0.86 0.86 0.14 0.88 0.12 RF 0.85 (0.01) 0.91 (0.01) 0.88 (0.01) 0.96 (0.00) 0.15 0.85 0.11 0.89 0.89 0.11 0.92 0.08 ANN 0.84 (0.01) 0.91 (0.01) 0.89 (0.01) 0.96 (0.01) 0.23 0.77 0.14 0.86 na na na 0.94 0.06 LSN na 0.94 (0.01) 0.99 (0.00) na na na 0.06 0.94

Table S5.6: Predictive performance: All subjects, MMSE scale, CA+CT input Chapter 5. Project 3: Prognosis in AD 133

Cognitively Consistent (CC) Group MMSE CA+CT: Baseline CA+CT: Baseline + Follow-up Model Acc Auc CM Acc Auc CM 0.819 0.181 0.842 0.158 LR 0.80 (0.03) 0.86 (0.03) 0.85 (0.03) 0.90 (0.02) 0.201 0.799 0.123 0.877 0.801 0.199 0.81 0.19 SVM 0.80 (0.03) 0.86 (0.02) 0.82 (0.03) 0.90 (0.02) 0.169 0.831 0.16 0.84 0.762 0.238 0.762 0.238 RF 0.79 (0.04) 0.85 (0.03) 0.79 (0.02) 0.90 (0.02) 0.158 0.842 0.148 0.852 0.768 0.232 0.792 0.208 ANN 0.76 (0.04) 0.811 (0.04) 0.82 (0.04) 0.88 (0.02) 0.227 0.773 0.153 0.847 na na na 0.943 0.057 LSN na 0.90 (0.01) 0.97 (0.01) na na na 0.137 0.863

Table S5.7: Predictive performance: Cognitively Consistent (CC) Group, MMSE scale, CA+CT input

All Subjects ADAS13 CA: Baseline CA: Baseline + Follow-up Model Acc CM Acc CM 0.855 0.128 0.017 0.921 0.076 0.004 LR 0.81(0.01) 0.241 0.496 0.2630.88(0.01) 0.21 0.698 0.09 0.053 0.161 0.788 0.003 0.123 0.875 0.953 0.045 0.002 0.967 0.031 0.002 SVM 0.78(0.01) 0.38 0.406 0.2120.88(0.01) 0.255 0.597 0.147 0.003 0.081 0.916 0.003 0.028 0.969 0.883 0.104 0.013 0.938 0.059 0.003 RF 0.79(0.01) 0.325 0.435 0.2380.87(0.02) 0.225 0.608 0.167 0.038 0.1 0.862 0.003 0.66 0.931 0.88 0.109 0.012 0.893 0.1 0.007 ANN 0.78(0.00) nan nan nan0.82(0.00) nan nan nan 0.101 0.224 0.671 0.03 0.247 0.724

Table S5.8: Predictive performance: All Subjects, ADAS-13 scale, CA input Chapter 5. Project 3: Prognosis in AD 134

Cognitively Consistent (CC) Group ADAS13 CA: Baseline CA: Baseline + Follow-up Model Acc CM Acc CM 0.538 0.422 0.04 0.631 0.343 0.024 LR 0.58(0.05) 0.166 0.699 0.1330.69(0.04) 0.18 0.748 0.07 0.046 0.385 0.568 0 0.317 0.683 nan nan nan nan nan nan SVM 0.64(0.04) 0.296 0.562 0.1410.70(0.03) 0.212 0.621 0.166 0 0.136 0.864 0 0.09 0.91 0.541 0.402 0.056 0.751 0.234 0.014 RF 0.63(0.04) 0.259 0.595 0.1450.69(0.03) 0.224 0.624 0.154 0.031 0.209 0.76 0 0.154 0.846 0.623 0.352 0.025 0.555 0.419 0.024 ANN 0.46(0.03) nan nan nan0.50(0.03) nan nan nan 0.112 0.485 0.4 0.054 0.487 0.458

Table S5.9: Predictive performance: Cognitively Consistent (CC) Group, ADAS-13 scale, CA input

All Subjects ADAS13 CT: Baseline CT: Baseline + Follow-up Model Acc CM Acc CM 0.725 0.157 0.118 0.729 0.158 0.112 LR 0.68(0.01) nan nan nan0.67(0.01) nan nan nan 0.184 0.186 0.634 0.2 0.177 0.622 0.764 0.146 0.089 0.772 0.149 0.082 SVM 0.61(0.02) 0.518 0.183 0.3020.60(0.01) 0.502 0.18 0.319 0.157 0.185 0.658 0.168 0.174 0.658 0.683 0.163 0.153 0.695 0.163 0.142 RF 0.67 (0.01) nan nan nan0.66(0.01) nan nan nan 0.191 0.175 0.637 0.205 0.171 0.626 0.767 0.14 0.092 0.774 0.147 0.078 ANN 0.66(0.01) 0.565 0.19 0.2450.65(0.02) 0.527 0.134 0.338 0.223 0.189 0.589 0.193 0.19 0.616

Table S5.10: Predictive performance: All Subjects, ADAS-13 scale, CT input Chapter 5. Project 3: Prognosis in AD 135

Cognitively Consistent (CC) Group ADAS13 CT: Baseline CT: Baseline + Follow-up Model Acc CM Acc CM 0.371 0.498 0.131 0.393 0.464 0.143 LR 0.37(0.02) nan nan nan0.40(0.02) nan nan nan 0.162 0.429 0.407 0.11 0.446 0.441 0.408 0.492 0.099 0.389 0.466 0.143 SVM 0.42(0.03) 0.172 0.426 0.4030.44(0.02) 0.256 0.486 0.259 0.155 0.441 0.4 0.099 0.429 0.471 0.315 0.478 0.208 0.35 0.451 0.197 RF 0.36(0.02) nan nan nan0.37(0.03) nan nan nan 0.175 0.417 0.407 0.119 0.461 0.418 0.381 0.474 0.146 0.381 0.49 0.127 ANN 0.36(0.03) nan nan nan0.38(0.03) nan nan nan 0.196 0.419 0.382 0.152 0.435 0.413

Table S5.11: Predictive performance: Cognitively Consistent (CC) Group, ADAS-13 scale, CT input

All Subjects ADAS13 CA+CT: Baseline CA+CT: Baseline + Follow-up Model Acc CM Acc CM 0.85 0.13 0.02 0.89 0.1 0.01 LR 0.81 (0.01) 0.28 0.55 0.170.84 (0.01) 0.3 0.61 0.09 0.04 0.17 0.79 0.02 0.15 0.83 0.92 0.07 0.01 0.93 0.06 0.01 SVM 0.78 (0.01) 0.38 0.4 0.220.84 (0.01) 0.27 0.53 0.19 0.01 0.09 0.9 0.01 0.07 0.92 0.78 0.17 0.05 0.82 0.15 0.03 RF 0.78 (0.01) 0.3 0.46 0.240.84 (0.01) 0.13 0.75 0.12 0.04 0.14 0.82 0.01 0.1 0.89 0.795 0.466 0.041 0.887 0.108 0.008 ANN 0.79 (0.01) nan nan nan0.80 (0.01) 0.336 0.397 0.265 0.036 0.148 0.816 0.029 0.177 0.794 na na na 0.96 0.04 0 LSN na na na na0.91 (0.01) 0.18 0.71 0.11 na na na 0 0.05 0.95

Table S5.12: Predictive performance: All Subjects, ADAS-13 scale, CA+CT input Chapter 5. Project 3: Prognosis in AD 136

Cognitively Consistent (CC) Group ADAS13 CA+CT: Baseline CA+CT: Baseline + Follow-up Model Acc CM Acc CM 0.533 0.432 0.035 0.532 0.406 0.062 LR 0.56(0.04) nan nan nan0.59(0.04) 0.235 0.695 0.07 0.024 0.416 0.56 0.029 0.382 0.589 0.752 0.248 0 0.608 0.351 0.042 SVM 0.64(0.04) 0.239 0.593 0.1670.63(0.03) 0.181 0.616 0.205 0.028 0.214 0.758 0.049 0.289 0.663 0.416 0.491 0.092 0.445 0.461 0.092 RF 0.49(0.03) 0.183 0.617 0.20.58(0.02) 0.087 0.797 0.116 0.019 0.391 0.59 0.022 0.28 0.697 0.427 0.488 0.084 0.505 0.438 0.056 ANN 0.47(0.03) nan nan nan0.52(0.04) 0.208 0.54 0.251 0.038 0.405 0.557 0.055 0.424 0.519 na na na 0.797 0.188 0.014 LSN na na na na0.76(0.02) 0.159 0.713 0.128 na na na 0 0.137 0.863

Table S5.13: Predictive performance: Cognitively Consistent (CC) Group, ADAS-13 scale, CA+CT input

MMSE CA+CT: Baseline CA+CT: Baseline + Follow-up Model Acc Auc CM Acc Auc CM 0.933 0.067 0.862 0.138 LR 0.812 (0.01) 0.859 (0.00) 0.857 (0.00) 0.835 (0.02) 0.056 0.439 nan nan 0.932 0.068 0.855 0.145 SVM 0.829 (0.00) 0.863 (0.00) 0.855 (0.00) 0.503 (0.00) 0.532 0.468 nan nan 0.947 0.053 0.912 0.088 RF 0.815 (0.01) 0.857 (0.00) 0.871 (0.01) 0.803 (0.02) 0.55 0.45 0.417 0.583 0.918 0.082 0.886 0.114 ANN 0.856 (0.01) 0.851 (0.01) 0.846 (0.01) 0.835 (0.01) 0.393 0.607 nan nan na na 0.976 0.024 LSN na na 0.715 (0.01) 0.907 (0.00) na na 0.671 0.329

Table S5.14: AIBL predictive performance: All Subjects, MMSE scale, CA+CT input Chapter 5. Project 3: Prognosis in AD 137

S5.5 Effect of available timepoints (duration) on prediction performance

The use of variable number of clinical timepoints per subject based on the availability (see Fig. S5.2) during trajectory assignment (groundtruth) impacts the predictive performance. The stratification of performance based on last available timepoint for a given subject yielded 605, 510 subjects for 18 to 36 months (near future) and 48 to 72 month (distant future) spans, respectively. The predictive performance worsens for subjects with available timepoints between 48 and 72 months. This difference in the performance is largest with CA input and lowest with CA+CT input (see Figs. S5.3 and S5.4).

Figure S5.2: Distribution of number of available timepoints with clinical score data. Chapter 5. Project 3: Prognosis in AD 138

Figure S5.3: Effect of available timepoints on predictive performance (MMSE). The results for 2x3 combinations of based on baseline, follow-up timepoints and CA, CT, CA+CT features. Note that only BL+follow-up, CA+CT input is applicable for LSN. Chapter 5. Project 3: Prognosis in AD 139

Figure S5.4: Effect of available timepoints on predictive performance (ADAS-13). The results for 2x3 combinations of based on baseline, follow-up timepoints and CA, CT, CA+CT features. Note that only BL+follow-up, CA+CT input is applicable for LSN. Chapter 5. Project 3: Prognosis in AD 140

S5.6 K-fold nested cross-validation procedure

In a K-fold nested cross-validation paradigm (see Fig. S5.5), subjects are randomly divided into K subsets. Then K-1 subsets are chosen as a training set and the remaining subset is held-out as a test set. Subsequently, The samples from each train subset are further divided into j inner folds to define and evaluate various data preprocessing operations, including feature selection, data noramlization, and hyperparameter configuration. The validation subset within the inner folds is used to evaluate the performance generalizability of each model architecture. The hyperparameters (e.g. L1 penalty, number of layers, hidden nodes etc.) of any ML model are essentially “tuning knobs” that adjust the complexity of model. For ANNs, with a large number of hidden nodes, the model is able to handle more complex relationships within the input allowing more accurate predictions. However at the same time, it is also likely to overfit to the training data and offer poor performance on the test data. The internal validation procedure searches through the grid of hyperparameters to decide optimal model architecture that balances the accuracy and overfitting metrics. The nested validation set within the train set prevents double-dipping issues that may exaggerate the model performance. This optimal model (i.e. top performing hyperparameter configuration) is then used to evaluate performance on held-out test data. This process is then repeated K times to iterate through all combination of train and test sets.

Figure S5.5: K-fold nested cross-validation. Subjects are randomly divided into K subsets. Then K-1 subsets are chosen as a training set and the remaining subset is held-out as a test set. Subsequently, The samples from each train subset are further divided into j inner folds to define and evaluate various data preprocessing operations, including feature selection, data normalization, and hyperparameter configu- ration. The validation subset within the inner folds is used to evaluate the performance generalizability of each model architecture. The optimal model is then used to evaluate performance on held-out test data. Chapter 6

Discussion

The work in this thesis consisted of three projects pertaining to development of prognostic applications for late-onset Alzheimer’s disease (AD) using of magnetic resonance (MR) imaging data and machine- learning (ML) techniques. Below are the brief summaries of these project: 1) MR image segmentation of the hippocampus, 2) clinical severity prediction based on neuroanatomy, and 3) modeling and pre- diction of clinical progression. The project summaries are followed by the challenges encountered in the experimental designs, the trade-offs of the choices made during methodological implementation, and the consequent limitations resultant of these design choices. Finally, potential avenues for future research are discussed based on the findings and the lessons learned from these projects. The first project aimed at accurately delineation of neuroanatomical structures, specifically the hip- pocampus, from MR images - a critical prerequisite for the development of biomarkers and prognostic applications towards AD and other disorders. The objective of the first project was to improve label- fusion performance of multi-atlas label-fusion (MALF) approaches by replacing a simple majority vote technique. Towards this goal, the first project presented a novel method - autocorrecting walks over local- ized Markov random fields (AWoL-MRF), inspired from manual segmentation protocol procedures. The segmentation performance of the AWoL-MRF was validated over three independent datasets spanning a wide range of demographics and anatomical variations. The results showed that AWoL-MRF achieves the state-of-the-art Dice scores with substantially fewer labeled atlases compared to other MALF meth- ods, reducing the manual resource requirements. The method was also tested for diagnostic utility by comparing hippocampal volume across cognitively normal (CN), mild-cognitive-impairment (MCI), and AD subject groups. Significant volumetric differences between “CN vs. AD” and “CN vs. MCI” groups, as well as, “MCI-converters vs. MCI-stable” groups were found based on AWoL-MRF segmentations. The second project aimed at leveraging high-dimensional features derived from MR-based neu- roanatomical delineation techniques towards predicting cognitive performance of an individual. Com- monly used clinical assessments, such as MMSE and ADAS-13, were used as proxies for symptom severity associated with AD. A novel machine-learning model - anatomically partitioned artificial neural network (APANN), was presented for the prediction of these assessment scores from the high-dimensional, mul- timodal features derived from hippocampal segmentations and cortical thickness measures. The results showed that APANN outperforms other models based on Pearson’s correlation and root mean square error metrics. APANN also demonstrated robust performance on multiple cohorts with varying data acquisition protocols and quality. From a clinical standpoint, the accurate prediction of clinical perfor-

141 Chapter 6. Discussion 142 mance from nuanced neuroanatomy was an important step towards building diagnostic and prognostic tools for early detection and future clinical trajectory prediction of individuals at-risk for AD. The third project extended the work from the previous chapter by developing a computational frame- work for predicting future clinical states of an individual based on longitudinal, multimodal data. The framework aimed to predict symptomatic decline based on MMSE and ADAS-13 scales up to five years in advance. The project used machine-learning techniques, specifically a hierarchical clustering approach to characterize prototypical stable and declining symptom trajectories. Subsequently, an artificial neural network model was developed to predict these trajectories using clinical and MR imaging features. The proposed model - longitudinal siamese neural-network (LSN), provides a means to combine MR-based information from baseline and follow-up visits to model structural change incurred by an individual that is predictive of the trajectories. The results showed excellent performance for both scales with subjects from multiple cohorts with varying data acquisition protocols. A replication study based on a completely independent subject pool also demonstrated promising results. Such performance replication further val- idates the utility of the presented framework towards early detection and prognosis of at-risk individuals via combination of clinical and MR data acquired through continuous monitoring of individuals.

6.1 Challenges and limitations

During each project, several challenges were encountered due to multitude of factors including the complex nature of MR image data, subjectivity in the clinical practices, heterogeneity in populations etc. These challenges are discussed below along with the consequent trade-offs of the choices made during experimental design process.

6.1.1 Project 1: Hippocampal segmentation

There were two main challenges for the segmentation project. First, the MR image processing pipeline comprising nonlinear image registration requires substantial computing resources creating a bottleneck. The number of registrations between labeled atlases and target images compounds with the use of MAGeT brain method, which uses a template library to inflate the number of candidate segmentations to be fused by AWoL-MRF. Thus, the validations were limited to three folds to evaluate the performance of AWoL-MRF with ADNI and FEP datasets. Permuting more folds in the validation would have been a more rigorous approach with the added computational cost. For the preterm-neonate cohort, only a single fold validation was performed due to added challenges of its smaller sample size (44 images: 22 early-in-life and 22 term-age equivalent) and the quality of images themselves. Acquisition and manual delineation of preterm-neonate scans is an extremely difficult task and therefore the evaluation of automated segmentation of these images was limited to a proof-of-concept [150]. Another common limitation of automatic segmentation protocols pertains to the choice of ground- truth labelled atlases. The scope of the segmentation work in this thesis was limited to whole hippocam- pus delineation and the performance of the proposed segmentation method was tied to the choice of atlas library used with the given cohort. This choice dictates anatomical definitions of the structures (i.e. whole hippocampus), as well as the image quality requirements (e.g. resolution), both of which impact the performance of the segmentations. Depending on these attributes, the segmentation process might be affected by several issues, such as partial volume effect, that can limit the performance of Chapter 6. Discussion 143 manual and automated protocols. For instance, the Dice overlap accuray measure for automated seg- mentation methods shows a ceiling effect bounded by the inter and intra-rater performance on manual segmentations used in a given atlas library. These atlas library driven differences in the segmentation performance, in turn introduce variability in the derived biomarker measures, such as volume, which needs to be assessed and accounted for while developing prognostic models. Moreover during segmenta- tion efficacy evaluation, use of groupwise biomarker differences as a validation metric should be analyzed carefully. Due to lack of access to the true anatomical groundtruth on large datasets, the groupwise dif- ferences may not be purely biological, but very likely to be confounded by methodological biases. Hence using them as a validation of segmentation protocols must be performed with care and replicated on multiple datasets. These observations motivated the use of the “empirical samples” in Project 2 to help build robust machine-learning models invariant to some of these variations introduced by application of a segmentation pipeline to subject cohorts with different image qualities.

6.1.2 Project 2: Clinical severity prediction in AD

The challenges in the second project dealt with the task definition, the high-dimensional, multimodal input, and the training of large-scale anatomically partitioned artificial neural network (APANN) model.

Task definition

Compared to the diagnostic classification, the choice of clinical prediction task focuses on performance related to cognitive symptoms in AD as captured by the MMSE and ADAS-13 clinical assessments. This deviation from typical diagnostic classification tasks was in efforts to characterize symptom severity using MR imaging, an assistive tool for clinicians, which can be important for decisions pertaining to interventions and lifestyle changes [390, 383]. Nevertheless, the symptomatic domains covered in this project are limited by the scope of the assessments used in this work, i.e. MMSE and ADAS- 13. Subsequently, the precision of prediction is limited by the range and sensitivity of these scales. For instance, the results from Chapter 4 suggest that MMSE prediction models offered diminished performance due to a smaller, less sensitive scale (range: 0-30). The severity prediction task is defined as a regression problem with continuous valued output (dependent) variables. Although this offers a possibility of more nuanced characterization of clinical states on a spectrum compared to the categorical classification problems, it concurrently makes the prediction task more difficult [390]. For regression tasks, the performance is susceptible to the ceiling and floor effects, which can make prediction models biased towards either ends of the spectrum [349, 169]. These effects also attributed towards performance loss in the case of MMSE prediction task.

Input choice

As outlined in the Chapter 4, the choice of hippocampal and cortical thickness measures towards clinical severity prediction was driven by prior research showing their involvement in cognitive performance. The specific use of voxel-wise hippocampal segmentations was to capture subtle, focal changes in the neuroanatomy, which may be indistinguishable with total volume measurements. We also note that AWoL-MRF was not used for segmentating in this project due to overlapping timelines of the Chapters 3 and 4. The use of cortical thickness measure was motivated by its relative insensitivity to confounds, such as head size or total brain volume, which is desirable in subject-specific predictions. Chapter 6. Discussion 144

The use of high-dimensional input (> 30k variables) allows to capture subtle changes in the neu- roanatomy. However, it also raises two main concerns for the model development. First, the large number of variables per subject adds to the computational cost (i.e. hardware requirements, training times). Second, it also poses issues to the learning process due to its susceptibility to the phenomenons of “overfitting” and the “curse of dimensionality” [31, 102]. The issue of overfitting is generally encountered in machine-learning tasks due to lack of sufficient training examples for the model to learn generalizable patterns from the data. Such scenarios usually result in the model performing well only on the training set and producing poor results on held-out test set (see Fig. 6.1). The lack of sufficient data can result in the trained model exhibiting high variance (e.g. decision trees) or high bias (e.g. linear classifiers) [102]. This problem is exacerbated by the use of models with a large number of parameters, i.e. artificial neural networks (ANNs). Several strategies exist to combat overfitting including regularization and data augmentation, both of which were employed in this work. Regularization offers a way to incorporate prior knowledge about the data distribution that can help the models learn parsimonious, generalizable functions predictive of the task at hand [30, 34, 329]. This commonly used technique was used in a form of L1, L2 penalties and dropout factors during model training in this work.

Figure 6.1: Underfitting (M=0,1) vs. overfitting (M=9): Specifying model complexity (e.g. polynomial order) is a design choice that can have significant impact on generalizability of the model performance. An overly simplistic model may not have enough capacity to capture true data distribution which is referred as model underfitting. Whereas, highly complex models (large number of free parameters) can lead to better prediction accuracy on a training set; however, it can result into implausible model (e.g. M=9) that will perform poorly on unseen test set. Large sample size and regularization of model parameters are essential for avoiding these issues. (figure adopted from Bishop 2006 with reuse permission)

The key contribution of the second project pertains to the generation of empirical samples towards data augmentation. Data augmentation typically involves inflating the training set by either adding plausible transformation of the available examples or generating synthetic examples from the estimated data distribution. Subjecting model to potential variations in the input via data augmentation usually Chapter 6. Discussion 145 helps build robust predictors [293, 212]. This non-trivial process is strongly dependent upon the input datatype, as estimating and sampling from a high-dimensional space is often a non-tractable problem. Therefore, sampling from the empirical distributions resultant of existing MR image processing pipelines provides an effective way for data augmentation. The results from Chapter 4 show that the empirical samples do help against overfitting when implemented with the APANN. The high performance of APANN compared against other models also suggests that the computational cost of leveraging empirical samples is an acceptable trade-off for the performance gains obtained from the high-dimensional input choice with sophisticated models with a large number of parameters. The curse of dimensionality is a relatively more complex problem compared to the overfitting issue. In the ML context, it implies that achieving generalizability gets exponentially more difficult as the data dimensionality grows. This difficulty is not merely a consequence of number of raw dimensions, as the examples may congregate on a manifold conforming to a low-dimensional space, but it results from the increased complexity of the function that needs to be learned by the model in order to produce generalizable performance [30, 31]. Consequently, many intuitive and useful notions, such as Euclidean distance for defining nearest neighbours, or Gaussian distribution approximations, fail to scale in the high-dimensional cases. The volume of multivariate space is increasingly concentrated away from the origin of the coordinate frame, making it difficult to estimate the shape of data distributions from a small set of examples. These issues prevent the simpler models such as K-nearest-neighbours, linear classifiers, and kernel based methods to capture the complex mapping between the high-dimensional space and the desired output task. The recent success of the ANNs rises from the capability of deep (hierarchical) architectures to learn structured and nonlinear patterns in the high-dimensional space that offers some mitigation against the curse of dimensionality [30, 31, 34]. This has been an important motivating factor towards use of ANN based models to tackle high-dimensional structured image data in this thesis. Additionally, the anatomically partitioned design of APANN was employed in efforts to reduce the complexity of the learning task in the high-dimensional space by segregating training process for each input modality.

Multimodal modeling

In addition to the challenges imposed by the high-dimensionality of the input, a few other computational modeling issues arise from the nature of the data-type itself. Particularly, development of models capable of handling multiple modalities or data-types is a difficult task. Multimodal modeling is a highly active area of research encompassing issues pertaining to feature 1) representation, 2) fusion, 3) co-learning, 4) translation, 5) alignment [25, 199, 330, 267]. A comprehensive discussion of all of these areas is beyond the scope of this thesis, nevertheless, some of the issues related to this work are discussed as follows. Finding a good representation of raw data that simplifies the model training process is a non- trivial task even for single modalities. For multiple modalities, this problem is further complicated by the different data-types (e.g. categorical, continuous), varying amount of noise, and missing data. A naive approach for handling multiple modalities is to concatenate all variables, and potentially apply a scaling or normalizing operation to establish comparability among variables. Although, this approach might be effective with a small number of weakly correlated variables, it fails to preserve information within high-dimensional, highly structured data-types (e.g. images) [30]. The concatenation operation of the input variable is also referred as an early or a shallow multimodal fusion. Alternatively, the joint multimodal representation can be achieved by first extracting features from individual modalities Chapter 6. Discussion 146 and then projecting these features into a joint higher-order feature space. ANNs provide an extremely flexible, hierarchical architecture to learn such representations [199, 330, 198]. In the context of AD, a deep belief network based approach has been used to tackle multimodal representation of MR and PET imaging data [334]. These examples motivated the choice of APANN to represent features from high-dimensional, structured hippocampal and cortical input modalities. The anatomical partitioning allows training of individual, single modality networks, which subsequently can be combined and fine- tuned to learn higher-order multimodal representations. Apart from ANNs, a few other approaches, notably the multi-kernel learning (MKL), have been employed for multimodal feature fusion task in the context of AD and MR imaging [233, 390]. Independent modality specific kernels can offer better fusion of heterogeneous data-types. Also kernel based approaches provide a convex loss function simplifying the optimization process. Nevertheless empirical evidence shows that the kernel based approaches tend to perform poorly when compared to ANN-based approaches especially in high-dimensional scenarios [30, 31, 166]. The MKL analysis in the second project concurred with these observations.

6.1.3 Project 3: Clinical trajectory modeling and prediction in AD

The challenges in the third project primarily dealt with longitudinal modeling of prognostic task and input data from two timepoints.

Longitudinal modeling of clinical performance

Defining tasks comprising future outcomes is complicated by several factors related to time span under consideration. These factors are depend on the underlying theorized assumptions, such as the short-term versus long-term predictability of the biomarker, as well as, the practical limitations, such as missing data points [170, 230, 92, 250]. In the context of AD, a typical approach towards defining a future task involves setting a predefined time-window (e.g. two years from baseline) for the prediction of a clinical state. For diagnostic classification, this translates into predicting whether a subject will transition to a successive clinical stage within the aforementioned duration. This however puts a hard upper bound on the permitted transition period, and any subject that may have undergone diagnostic change, even shortly after two years, will be considered diagnostically unchanged. Setting a hard threshold imposes even stricter constraints for clinical severity prediction, which is typically measured by continuous vari- ables. Clinical scores at a particular time epoch is a noisy sample of the clinical performance of an individual, which tends to fluctuate substantially when measured over few months [309, 77]. These observations motivated the work in the third project that aimed towards modeling data-driven clinical performance trajectories. The trajectory based characterization of clinical states, allows consideration of longer time spans without any hard cutoff thresholds. Additionally, it provides a way to deal with missing timepoints, which is an important practical challenge in virtually all longitudinal tasks. The lack of hard thresholding also allows a post-hoc insight into biomarker performance towards long-term and short-term predictive performance without devising separate experimental setups. However, there are two trade-offs of trajectory based clinical state assignment. First, it induces subjectivity in the characterization of progression model, referred to as trajectory templates in the third project. Since trajectories are modeled using a semi-arbitrary group of subjects, the set of scores of these templates provide a comparative and not an absolute definition of clinical state progression. Second, as the cate- gorical trajectories are resolved based on the hierarchical clustering method, the number of trajectories Chapter 6. Discussion 147 is another design choice that needs addressed. A higher number of trajectories offers a more nuanced prognosis, but it also complicates the modeling and prediction tasks. These choices are dependent on the dynamic range of the clinical scale, as well as, the stability of the data-driven clusters, both of which should offer plausible differential progression patterns. This prompted the use of two and three cluster solutions for the MMSE and ADAS-13 scales, respectively, in the third project.

Longitudinal modeling of input

Development of models for temporally related input is a challenging task in many ML domains. Although several successful approaches have been proposed to handle continuous time-series data, such as speech or functional MR imaging signals, the temporal relationship in the third project was limited to the two timepoints: baseline and follow-up. A fundamental idea in subject specific longitudinal modeling is to capture change or rate of change associated with the biomarker represented by the input variables. This implies that the additional timepoint can improve the clinical prognosis compared to the information captured in a biomarker snapshot at a single time epoch. Thus, the objective of the ML model is then to identify combination of input variables whose change patterns are predictive of the clinical task of interest. This can be implemented as a feature selection or a feature transformation process (see Section 2.6 for details). Similar to the case with multimodal input, this can also be modeled with early (shallow) or higher-order (deep) feature representation. In the simplest case, the input from two timepoints can be concatenated. This however removes the temporal relation between the variables. Some approaches mitigate this issue with a longitudinal feature selection process. In such approach, during dimensionality reduction stage, variables are jointly selected from multiple timepoints preserving the temporal depen- dencies [170, 389]. Nevertheless this still belongs to the class of early (shallow) feature representation. As an alternative to the joint feature selection approach, a deep feature engineering approach was employed in this work using the longitudinal siamese network (LSN) architecture. This choice was motivated in efforts to preserve temporal relationship between higher-order feature representation that is predictive of the clinical task. The siamese network learns a joint transformation function that maps each timepoint into a common space. The representation in this common space can be used to measure the relative change (difference) between the two timepoints that is predictive of the clinical task. The use of ANN based approach provides an end-to-end model mitigating the need for a regularized temporal feature se- lection stage. It also offers an insight into the “change” or “rate of change” derived from a multivariate and potentially high-dimensional longitudinal input that is informative of the clinical performance.

Input choice

As outlined in the Chapter 5, the use of cortical thickness measures was motivated by its strong asso- ciation with cognitive decline as suggested by prior studies as well as the cross-sectional analysis in the Chapter 4. The deviation from using high-dimensional input, as in the case of Chapter 4, was to minimize modeling complications and computational burden. The focus of this work was to model longitudinal change, and therefore anatomical priors, as defined by AAL atlas, were used for input dimensionality reduction to avoid model overfitting. With lower input dimensionality, the available sample size proved to be sufficient for successful model training and hence empirical sampling based data augmentation was not employed during this work. Chapter 6. Discussion 148

Model interpretability

A common criticism applicable to virtually all ANN based approaches pertains to the interpretation of the informative variables. This stems from the fact that the higher-order features of hierarchical models are difficult to link back to the original input variables. Although a few methods have proposed potential solutions to this issue [388],, evaluating variable importance in a high-dimensional input setting remains an open challenge. It should be noted that it is also possible that a single canonical subset of highly informative variables (e.g. voxels or vertices) common across all subjects may not exist. A differential and distributed change pattern among the variables may be more predictive of the clinical task, which would not be properly identified by a frequentist computation of ranked variable importance. In such scenarios, the ANN architectures might be able to extract the nonlinear, informative patterns from raw variables, which may not be readily interpretable, but could offer better results [30, 166].

Defining cutpoints for subject selection

Although biomarker progression and clinical performance exist on a continuum, dichotomization of these values is necessary in various scenarios, such as clinical trial selection, interventions, lifestyle changes etc. [184]. In the context of the third project, clinical scale cutpoints were used to identify edge-case and prognostically uncertain individuals that needed further monitoring. The edge-cases individuals were at the ends of the spectrum for the clinical performance, i.e. individual scoring very high or low on the MMSE and ADAS-13 scales. The clinical trajectories of these individuals can be determined with high confidence simply based on their scores at the baseline or follow-up. In contrast, individuals on the middle of the clinical spectrum could not be classified under a trajectory with high certainty, and therefore additional biomarker (MR modality) evidence and more frequent monitoring (follow-up visit data) is necessary to improve their prognostic predictions. From a computational perspective, this grouping of subjects is also necessary for a fair evaluation of the contributions of the additional modality and follow-up timepoint towards the model performance. In Chapter 5, a sample use case with heuristically chosen MMSE and ADAS-13 cutpoints at baseline and follow-up visits was provided. This yielded three groups of subjects 1) baseline edge-cases, 2) follow-up edge-cases 3) cognitively consistent. This post-hoc analysis provides some insight into whose prognostic evaluation can be improved based on a new modality and a follow-up timepoint. However, the heuristic nature of the cutpoints induces some subjectivity that can affect the inclusion-exclusion criteria for each group. A more principled approach might leverage a Bayesian framework during model development, where the incremental addition of subject-specific information (biomarker or timepoint) can be incorporated to update the posterior probability of individual trajectories. Then the cutpoints can be determined based on desired confidence levels for the trajectory prediction instead of raw clinical scores. This formulation was however beyond the scope of this thesis, and will be explored in the future work.

Dataset biases

The work in this thesis was predominantly based on ADNI datasets. The first project used ADNI1; the second project used ADNI1 and ADNI2; and the third project used ADNI1, ADNI2, and ADNIGO subjects towards the analysis. There are two main limitations of ADNI that are relevant to the data used in these projects. First, as disclosed by ADNI investigators, the subjects represent an amnestic clinical Chapter 6. Discussion 149 trial population and not an epidemiologically selected real life population [367]. Second, the age range of ADNI is limited to 55-90 years, and there is a growing evidence suggesting that AD pathology may begin prior to this age range [367, 368]. For early detection and better characterization of preclinical AD stages, a younger at-risk cohort would be highly desirable. Similarly AIBL, a replication cohort used in the third project in a comparison to ADNI sample, also has several shortcomings. The age range of AIBL is 60 to 96 years, which also excludes the younger population. The clinical battery in AIBL comprises only MMSE scores, and ADAS-13 or ADAS-cog assessments are not conducted. Furthermore, the follow-up visit duration is 18 months, which is relatively longer than ADNI [112]. The impact of these and other differences, such as MR acquisition protocol, in contrast with ADNI are described in the Chapter 5.

Medication confounds

The medication use of the participants in ADNI is a potential confound in the analysis. Many MCI and AD patients in ADNI have commonly received cholinesterase inhibitors and memantine hydrochloride as anti-dementia medications [262, 114]. Use of several other medications, common in an ageing population, is also reported by the participants. Use of these medications in ADNI, specifically the MCI individuals, has been associated with clinical decline and thus affects the interpretation of results based to these individuals. [308]. The polypharmacy confounds are difficult to quantify or account for during the predictive analysis. Thus, further investigation is needed to disambiguate the medication effects from disease related changes and the impact of medication regimen on the prognostic predictions.

6.2 Future directions

6.2.1 MR image segmentation

The work in this thesis, specifically projects two and three have used ANN architectures towards predic- tion tasks. However, ANNs were not investigated during the first project. Although convolutional ANNs (convnets), and related deep architectures have been extremely successfully towards image classification and segmentation problems in computer vision, their application in the medical imaging field has proven difficult. This is primarily due to the lack of large labelled (manually segmented) examples required to train these models. Moreover, the highly successful convnets have been focussed on two-dimensional (2-d) natural images. Extending the architecture of these 2-d convnets to handle 3-d MR imaging data computationally is not trivial and requires substantially more resources. However, recently a few convnet architectures (e.g. U-net) have been proposed that have demonstrated promising results with 2-d as well as 3-d medical imaging data [293, 63, 101]. The major advantage of such ANN based approaches is that it signficantly reduces the time required to segment a new image. During the training stage, a large amount of computational operations are embedded within the network parameters via the exhaustive and slow learning process. This consequently circumvents several preprocessing operations at the test time reducing the computational burden substantially. The authors of U-net successfully tested a 3-d convnet for volumetric segmentation of microscopy data of Xenopus kidney. Subsequently, 3-d convents were also applied for segmenting MR imaging data of prostate [249] and brain tumor [115] by training the networks to optimize the dice overlap metric. These emerging implementations are only a beginning of a new class of segmentation methods that needs to be explored towards classical MR image segmentation Chapter 6. Discussion 150 tasks pertaining to the tissue, as well as, the cortical and subcortical structural parcellations. Adapta- tion of data augmentation approaches along with the patch-based techniques to facilitate training, along with formulation of novel deep network architectures suitable for 3-d MR imaging data segmentation promise to be an exciting area of research in the near future.

6.2.2 Severity prediction

Similar to the MR image segmentation tasks, 3-d convnets can be applied towards severity prediction as well. It should be noted that the high input dimensionality of the raw images still tends to be a prohibitive factor for these deep architectures. And unlike segmentation tasks, patch-based approaches may not be suitable for the clinical severity prediction or diagnostic classification problems. Recently, a few studies have demonstrated the use of 3-d convnets with AD diagnostic classification tasks reporting comparable performance to the traditional approaches [274, 208, 168]. The results from these proof- of-concept studies suggest that the performance can be substantially improved with more labeled data and computational resources at disposal to tune these models better. Nevertheless the advances of these deep-learning approaches have potential to reduce the time-consuming preprocessing and handcrafted feature engineering stages from the computational pipelines and produce state-of-the-art results. In the context of project two, another exciting area of research to be explored pertains to model development capable of handling surface-based phenotypical input, such as cortical surface. The cortical surface measures, i.e. thickness values over mesh of vertices, can be modeled as a 2-d manifold or a graph. However, traditional ML models and even the state-of-the-art convnets expect the input structured in an array or a grid. Thus, more work is needed to extend these approaches that can accomodate and leverage the dependencies within a generic graph-based input.

6.2.3 Longitudinal prediction

The third project presented a longitudinal siamese network (LSN) architecture that learned relevant structural change patterns from cortical thickness measures based on an anatomical atlas. The LSN can be extended to incorporate high-dimensional vertex-wise cortical thickness measures to capture more subtle anatomical changes. Furthermore, LSN can also be applied to voxel-wise data from anatomi- cal segmentations or raw intensity values. And finally, LSN can be extended to incorporate multiple modalities with high-dimensional input similar to the project 2. These advancements are contingent upon addressing the aforementioned challenges pertaining to availability of data, and improvements to the convent architectures. Nevertheless, a siamese network capable of handling high-dimensional, multi- modal, and longitudinal data can provide an extremely powerful tool for making prognostic predictions through continuous monitoring.

6.2.4 Clinical tasks

In this thesis, prediction tasks based on clinical assessments were primarily addressed. Although, such tasks along with diagnostic classification problems, are the predominant choice for application of ML models, several other clinical problems can benefit from ML based techniques. Particularly, questions regarding treatment selection and corresponding response prediction at individual level can help recruit- ment for clinical trials and help clinicians devise personalized treatment plans. The application and Chapter 6. Discussion 151 validation of ML models towards these problems will attract more interest as large longitudinal datasets comprising various treatment trials become available in the research domain.

6.2.5 Clinical translation

The development of ML applications towards AD and other mental health disorders in a research setting essentially yields a working prototype limited to the confines of the experimental design and available data. Despite the use of principled validation paradigms, the model performance remains confounded by the dataset specific idiosyncrasies of the subject population. In order to adopt these ML models in a clinical setting, several practical challenges need to be addressed. The first and foremost issue pertains to the availability of annotated clinical data. There are marked differences between data collected in a con- trolled research environment and existing healthcare systems. The heterogeneity in the data acquisition protocols (e.g. MR scanners), clinical assessment (e.g. MMSE, ADAS, MoCA) used for symptomatic evaluation, frequency of patient monitoring, poses numerous challenges that cannot be resolved using current implementations of image processing and ML tools. More robust computational frameworks need to be developed that are invariant to the differences in the phenotypic distributions induced by the varying scanner types, field strengths and other factors. This however is a non-trivial task as the effect of these nuisance factors cannot be linearly regressed out from the structured, multivariate data distri- butions. Detailed higher-order statistical modeling of these factors is an important first step towards mitigating their effects. From a model evaluation perspective, large scale cross-validation comprising multiple datasets, such as comparing performance in leave-one-dataset-out manner, is critical towards building generalizable models that can reproduce performance on completely independent datasets from multiple sites and studies. Several large initiatives (e.g. ADNI, UK Biobank) have made great efforts towards improving data sharing. However, pertaining to accessibility of patient data, several ethical and legal concerns need to be addressed prior to data distribution. These concerns include issues relating to privacy and anonymity of the participants, as well as, regulatory issues concerning in- tellectual property rights, as established by the ethical review boards across the globe. These challenges need to be dealt with to create diverse, large-scale data repositories. The second important consideration pertains to improving our understanding of the ML model be- haviour. This is particularly critical for deep-learning models, which suffer from valid criticism regarding their interpretability and lack of mechanistic approach towards task prediction [172, 19]. Thus, it is im- portant to devise validation paradigms to interrogate the sensitivity of different model parameters, as well as carry out an exhaustive sweep of probable input scenarios to assess the model biases and failures. A well defined set of conditions that could guarantee the model efficacy would make the adaptation of these quasi-deterministic tools easier in a practical setting. In this regard, significant further work is needed towards analysis of outliers, type-I versus type-2 errors, as well as, permutation based significance testing and estimation of confidence intervals for any clinical predictive task of interest. Lastly, there are several practical issues that pertain to delivery of ML tools within a clinic or a hospital. The deployment of ML software onto existing healthcare systems used by the clinicians or radiologists may involve developing software capabilities that can handle proprietary data formats used by different device manufactures, as well as, electronic medical record systems. Considering the computationally demanding nature of the preprocessing pipelines that transform the raw imaging data into some standardized format, hardware upgrades (e.g. GPUs) might be needed at many healthcare facilities in order to meet the reasonable operational time requirements. The rapid ongoing advances Chapter 6. Discussion 152 in ML techniques and the underlying hardware technologies will certainly help the migration of legacy systems and facilitate the integration of assistive ML tools into our healthcare system.

6.3 Concluding remarks

All models are wrong, but some are useful is a common aphorism in statistics [42]. This observation holds true for ML based approaches as well, since any trained model is an approximation of the underlying data distributions and the input-output mapping subjected to sampling biases. As highlighted in this thesis, the usefulness of ML models derives primarily from their ability to handle complex datatypes and make accurate individualized predictions. In the era of large-scale data analysis comprising high- dimensional MR images, computational neuroscience research will definitely benefit from leveraging well- established ML techniques towards improving our understanding of brain and behavior [48]. However, it should be acknowledged that the lack of mechanistic model building approach makes it even harder to judge their approximation of the biological systems. Therefore, it is imperative to think of ML approaches, not as a replacement, but as a complement to the hypothesis-driven methods [357]. With the increasing availability of computing resources and black-box ML tools, it is easier to perform quantitative research that produces descriptive and inferential statistics amassing in explosion of disparate results. As per the critical allegory titled ‘Chaos in the Brickyard” [130], such practices should be avoided without the continual pursuit of overarching theory that can inform the data-driven findings. Therefore, particularly pertinent to mental health applications, rigorous validation and replication is critical along with hypothesis-driven confirmation, prior to clinical translation of these tools [172]. With employment of robust large-scale methodologies, this powerful multidisciplinary field would improve our understanding of the biological processes underlying AD and other mental health disorders, and subsequently help develop clinical applications towards betterment of outcomes for the patients. Bibliography

[1] Alexandre Abraham, Fabian Pedregosa, Michael Eickenberg, Philippe Gervais, Andreas Mueller, Jean Kossaifi, Alexandre Gramfort, Bertrand Thirion, and Gael Varoquaux. Machine learning for neuroimaging with scikit-learn. Front. Neuroinform., 8:14, 2014.

[2] Y Ad-Dab’bagh, O Lyttelton, J S Muehlboeck, C Lepage, D Einarson, K Mok, O Ivanov, R D Vincent, J Lerch, E Fombonne, and Others. The CIVET image-processing environment: a fully automated comprehensive pipeline for anatomical neuroimaging research. In Proceedings of the 12th annual meeting of the organization for human , page 2266, 2006.

[3] Daniel H Adler, Alex Yang Liu, John Pluta, Salmon Kadivar, Sylvia Orozco, Hongzhi Wang, James C Gee, Brian B Avants, and Paul A Yushkevich. RECONSTRUCTION OF THE HUMAN HIPPOCAMPUS IN 3D FROM HISTOLOGY AND HIGH-RESOLUTION EX-VIVO MRI. Proc. IEEE Int. Symp. Biomed. Imaging, 2012:294–297, December 2012.

[4] Mohamed N Ahmed, Sameh M Yamany, Nevin Mohamed, Aly A Farag, and Thomas Moriarty. A modified fuzzy c-means algorithm for bias field estimation and segmentation of MRI data. IEEE Trans. Med. Imaging, 21(3):193–199, March 2002.

[5] Paul S Aisen, Ronald C Petersen, Michael C Donohue, Anthony Gamst, Rema Raman, Ronald G Thomas, Sarah Walter, John Q Trojanowski, Leslie M Shaw, Laurel a Beckett, Clifford R Jack, William Jagust, Arthur W Toga, Andrew J Saykin, John C Morris, Robert C Green, and Michael W Weiner. Clinical Core of the Alzheimer’s Disease Neuroimaging Initiative: progress and plans. Alzheimer’s & dementia : the journal of the Alzheimer’s Association, 6(3):239–46, 2010.

[6] Alireza Akhondi-Asl and Simon K Warfield. Estimation of the prior distribution of ground truth in the STAPLE algorithm: an empirical Bayesian approach. Medical image computing and computer- assisted intervention : MICCAI ... International Conference on Medical Image Computing and Computer-Assisted Intervention, 15(Pt 1):593–600, jan 2012.

[7] M S Albert, S T DeKosky, D Dickson, B Dubois, and others. The diagnosis of mild cognitive impairment due to alzheimer’s disease: Recommendations from the national institute on Aging- Alzheimer’s association workgroups on ˆaS.ˇ Alzheimers. Dement., 2011.

[8] M. E. Alexander, R. Baumgartner, A. R. Summers, C. Windischberger, M. Klarhoefer, E. Moser, and R. L. Somorjai. A wavelet-based method for improving signal-to-noise ratio and contrast in MR images. Magnetic Resonance Imaging, 18(2):169–180, 2000.

153 Bibliography 154

[9] P Aljabar, R A Heckemann, A Hammers, J V Hajnal, and D Rueckert. Multi-atlas based segmen- tation of brain images: atlas selection and its effect on accuracy. Neuroimage, 46(3):726–738, July 2009.

[10] P Aljabar, R a Heckemann, a Hammers, J V Hajnal, and D Rueckert. Multi-atlas based segmen- tation of brain images: atlas selection and its effect on accuracy. NeuroImage, 46(3):726–38, jul 2009.

[11] Alzheimer’s Association. 2017 Alzheimer’s Disease Facts and Figures. Alzheimers. Dement., 2017.

[12] Robert S C Amaral, Min Tae M Park, Gabriel A Devenyi, Vivian Lynn, Jon Pipitone, Julie Win- terburn, Sofia Chavez, Mark Schira, Nancy J Lobaugh, Aristotle N Voineskos, Jens C Pruessner, M Mallar Chakravarty, and Alzheimer’s Disease Neuroimaging Initiative. Manual segmentation of the fornix, fimbria, and alveus on high-resolution 3T MRI: Application via fully-automated mapping of the human memory circuit white and grey matter in healthy and pathological aging. Neuroimage, October 2016.

[13] Babak A Ardekani, Stephen Guckemus, Alvin Bachman, Matthew J Hoptman, Michelle Wojtaszek, and Jay Nierenberg. Quantitative comparison of algorithms for inter-subject registration of 3D volumetric brain MRI scans. J. Neurosci. Methods, 142(1):67–76, March 2005.

[14] Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection. Stat. Surv., 4:40–79, 2010.

[15] J Ashburner and K J Friston. Voxel-based morphometry–the methods. Neuroimage, 11(6 Pt 1):805–821, June 2000.

[16] J Ashburner and K J Friston. Image segmentation. Human Brain Function, 2003.

[17] J Ashburner and K J Friston. Rigid body registration. : The analysis of functional brain images, 2007.

[18] John Ashburner and Karl J Friston. Unified segmentation. Neuroimage, 26(3):839–851, July 2005.

[19] Gowtham Atluri, Kanchana Padmanabhan, Gang Fang, Michael Steinbach, Jeffrey R Petrella, Kelvin Lim, Angus Macdonald, 3rd, Nagiza F Samatova, P Murali Doraiswamy, and Vipin Kumar. Complex biomarker discovery in neuroimaging data: Finding a needle in a haystack. Neuroimage Clin, 3:123–131, August 2013.

[20] B B Avants, C L Epstein, M Grossman, and J C Gee. Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Med. Image Anal., 12(1):26–41, February 2008.

[21] Brian B Avants, C L Epstein, and J C Gee. Geodesic image normalization and temporal param- eterization in the space of diffeomorphisms. In Medical Imaging and Augmented Reality, Lecture Notes in Computer Science, pages 9–16. Springer, Berlin, Heidelberg, August 2006.

[22] Brian B Avants, Nicholas J Tustison, Michael Stauffer, Gang Song, Baohua Wu, and James C Gee. The insight ToolKit image registration framework. Front. Neuroinform., 8:44, April 2014. Bibliography 155

[23] A Bahar Fuchs, L Clare, and B Woods. Cognitive training and cognitive rehabilitation for mild to moderate alzheimer’s disease and vascular dementia. The Cochrane Library, 2013.

[24] Laura D Baker, Laura L Frank, Karen Foster-Schubert, Pattie S Green, Charles W Wilkinson, Anne McTiernan, Stephen R Plymate, Mark A Fishel, G Stennis Watson, Brenna A Cholerton, Glen E Duncan, Pankaj D Mehta, and Suzanne Craft. Effects of aerobic exercise on mild cognitive impairment: a controlled trial. Arch. Neurol., 67(1):71–79, January 2010.

[25] Tadas Baltruˇsaitis,Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. May 2017.

[26] J. Barnes, J. Foster, R. G. Boyes, T. Pepple, E. K. Moore, J. M. Schott, C. Frost, R. I. Scahill, and N. C. Fox. A comparison of methods for the automated calculation of volumes and atrophy rates in the hippocampus. NeuroImage, 40(4):1655–1671, 2008.

[27] Josephine Barnes, Gerard R. Ridgway, Jonathan Bartlett, Susie M.D. Henley, Manja Lehmann, Nicola Hobbs, Matthew J. Clarkson, David G. MacManus, Sebastien Ourselin, and Nick C. Fox. Head size, age and gender adjustment in MRI studies: A necessary nuisance? NeuroImage, 53(4):1244–1255, 2010.

[28] Jennifer H Barnett, Lily Lewis, Andrew D Blackwell, and Matthew Taylor. Early intervention in alzheimer’s disease: a health economic study of the effects of diagnostic timing. BMC Neurol., 14:101, May 2014.

[29] Boubakeur Belaroussi, Julien Milles, Sabin Carme, Yue Min Zhu, and Hugues Benoit-Cattin. Intensity non-uniformity correction in MRI: existing methods and their validation. Med. Image Anal., 10(2):234–246, April 2006.

[30] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, August 2013.

[31] Yoshua Bengio and Yann Le Cun. Scaling learning algorithms towards AI.

[32] Nikhil Bhagwat, Jon Pipitone, Julie L Winterburn, Ting Guo, Emma G Duerden, Aristotle N Voineskos, Martin Lepage, Steven P Miller, Jens C Pruessner, and M Mallar Chakravarty. Manual- Protocol inspired technique for improving automated MR image segmentation during label fusion. Front. Neurosci., 10:325, 2016.

[33] J Birks. Cholinesterase inhibitors for alzheimer’s disease. Cochrane Database Syst. Rev., (1):CD005593, January 2006.

[34] Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, August 2006.

[35] J M Bland and D G Altman. Statistical methods for assessing agreement between two methods of clinical measurement. Technical report, 1986.

[36] F Bloch. Nuclear induction. Phys. Rev., 70(7-8):460–474, October 1946.

[37] George S Bloom. Amyloid-β and tau: the trigger and bullet in alzheimer disease pathogenesis. JAMA Neurol., 71(4):505–508, April 2014. Bibliography 156

[38] M Boccardi, R Ganzola, S Duchesne, A Redolfi, and others. Survey of segmentation protocols for manual hippocampal volumetry: Preparatory phase for an EADC-ADNI harmonization protocol. Alzheimers. Dement., 2010.

[39] Marina Boccardi, Martina Bocchetta, Rossana Ganzola, Nicolas Robitaille, Alberto Redolfi, Simon Duchesne, Clifford R. Jack, and Giovanni B. Frisoni. Operationalizing protocol differences for EADC-ADNI manual hippocampal segmentation, 2013.

[40] David R Borchelt, Gopal Thinakaran, Christopher B Eckman, Michael K Lee, Frances Davenport, Tamara Ratovitsky, Cristian-Mihail Prada, Grace Kim, Sophia Seekins, Debra Yager, Hilda H Slunt, Rong Wang, Mary Seeger, Allan I Levey, Samuel E Gandy, Neal G Copeland, Nancy A Jenkins, Donald L Price, Steven G Younkin, and Sangram S Sisodia. Familial alzheimer’s Disease– Linked presenilin 1 variants elevate Aβ1–42/1–40 ratio in vitro and in vivo. Neuron, 17(5):1005– 1013, November 1996.

[41] C´assioM C Bottino, Cl´audioC Castro, Regina L E Gomes, Carlos A Buchpiguel, Renato L Marchetti, and M´arioR Louz˜aNeto. Volumetric MRI measurements can differentiate alzheimer’s disease, mild cognitive impairment, and normal aging. Int. Psychogeriatr., 14(1):59–72, March 2002.

[42] G E P Box. Robustness in the strategy of scientific model building. Robustness in Statistics, pages 201–236, 1979.

[43] H Braak and E Braak. Neuropathological stageing of alzheimer-related changes. Acta Neuropathol., 82(4):239–259, 1991.

[44] H Braak and E Braak. Frequency of stages of alzheimer-related lesions in different age categories. Neurobiol. Aging, 18(4):351–357, July 1997.

[45] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001.

[46] Ron Brookmeyer, Elizabeth Johnson, Kathryn Ziegler-Graham, and H Michael Arrighi. Forecast- ing the global burden of alzheimer’s disease. Alzheimers. Dement., 3(3):186–191, July 2007.

[47] A. Buades, B. Coll, and J. M. Morel. A Review of Image Denoising Algorithms, with a New One. Multiscale Modeling & Simulation, 4(2):490–530, 2005.

[48] Danilo Bzdok and B T Thomas Yeo. Inference in the age of big data: Future perspectives on neuroscience. Neuroimage, 155:549–564, July 2017.

[49] Vince D Calhoun, Jingyu Liu, and T¨ulay Adali. A review of group ICA for fMRI data and ICA for joint inference of imaging, genetic, and ERP data. Neuroimage, 45(1 Suppl):S163–72, March 2009.

[50] V Camus, P Payoux, L Barr´e,B Desgranges, T Voisin, C Tauber, R La Joie, M Tafani, C Hommet, G Ch´etelat,K Mondon, V de La Sayette, J P Cottier, E Beaufils, M J Ribeiro, V Gissot, E Vierron, J Vercouillie, B Vellas, F Eustache, and D Guilloteau. Using PET with 18F-AV-45 (florbetapir) to quantify brain amyloid load in a clinical environment. Eur. J. Nucl. Med. Mol. Imaging, 39(4):621– 631, April 2012. Bibliography 157

[51] A P Carpenter, Jr, M J Pontecorvo, F F Hefti, and D M Skovronsky. The use of the exploratory IND in the evaluation and development of 18F-PET radiopharmaceuticals for amyloid imaging in the brain: a review of one company’s experience. Q. J. Nucl. Med. Mol. Imaging, 53(4):387, 2009.

[52] Ramon Casanova, Fang-Chi Hsu, Kaycee M Sink, Stephen R Rapp, Jeff D Williamson, Susan M Resnick, Mark A Espeland, and for the Alzheimer’s Disease Neuroimaging Initiative. Alzheimer’s disease risk assessment using Large-Scale machine learning methods. PLoS One, 8(11):e77949, 2013.

[53] M Mallar Chakravarty, Gilles Bertrand, Charles P Hodge, Abbas F Sadikot, and D Louis Collins. The creation of a brain atlas for image guided using serial histological data. Neu- roimage, 30(2):359–376, April 2006.

[54] M Mallar Chakravarty, Abbas F Sadikot, J¨urgenGermann, Pierre Hellier, Gilles Bertrand, and D Louis Collins. Comparison of piece-wise linear, linear, and nonlinear atlas-to-patient warping techniques: analysis of the labeling of subcortical nuclei for functional neurosurgical applications. Hum. Brain Mapp., 30(11):3574–3595, November 2009.

[55] M Mallar Chakravarty, Patrick Steadman, Matthijs C van Eede, Rebecca D Calcott, Victoria Gu, Philip Shaw, Armin Raznahan, D Louis Collins, and Jason P Lerch. Performing label-fusion-based segmentation using multiple automatically generated templates. Hum. Brain Mapp., 34(10):2635– 2654, October 2013.

[56] M. Mallar Chakravarty, Patrick Steadman, Matthijs C. van Eede, Rebecca D. Calcott, Victoria Gu, Philip Shaw, Armin Raznahan, D. Louis Collins, and Jason P. Lerch. Performing label-fusion- based segmentation using multiple automatically generated templates. Human Brain Mapping, 34:2635–2654, 2013.

[57] D Chan, N C Fox, R I Scahill, W R Crum, J L Whitwell, G Leschziner, A M Rossor, J M Stevens, L Cipolotti, and M N Rossor. Patterns of temporal lobe atrophy in semantic dementia and alzheimer’s disease. Ann. Neurol., 49(4):433–442, April 2001.

[58] Anjen Chenn and Christopher A Walsh. Regulation of cerebral cortical size by control of cell cycle exit in neural precursors. Science, 297(5580):365–369, July 2002.

[59] Ga¨elCh´etelat,Renaud La Joie, Nicolas Villain, Audrey Perrotin, Vincent de La Sayette, Francis Eustache, and Rik Vandenberghe. Amyloid imaging in cognitively normal individuals, at-risk populations and preclinical alzheimer’s disease. Neuroimage Clin, 2:356–365, March 2013.

[60] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages 539–546, 2005.

[61] N Chow, K S Hwang, S Hurtz, A E Green, J H Somme, P M Thompson, D A Elashoff, C R Jack, M Weiner, L G Apostolova, and Alzheimer’s Disease Neuroimaging Initiative. Comparing 3T and 1.5t MRI for mapping hippocampal atrophy in the alzheimer’s disease neuroimaging initiative. AJNR Am. J. Neuroradiol., 36(4):653–660, April 2015. Bibliography 158

[62] Marie Chupin, Emilie G´erardin,R´emi Cuingnet, Claire Boutet, Louis Lemieux, St´ephaneLeh´ericy, Habib Benali, Line Garnero, Olivier Colliot, and Alzheimer’s Disease Neuroimaging Initiative. Fully automatic hippocampus segmentation and classification in alzheimer’s disease and mild cog- nitive impairment applied on data from ADNI. Hippocampus, 19(6):579–587, June 2009.

[63] Ozg¨unC¸i¸cek,Ahmed¨ Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. June 2016.

[64] Linda Clare and Robert T Woods. Cognitive training and cognitive rehabilitation for people with early-stage alzheimer’s disease: A review. Neuropsychol. Rehabil., 14(4):385–401, September 2004.

[65] Christopher M Clark, Julie A Schneider, Barry J Bedell, Thomas G Beach, Warren B Bilker, Mark A Mintun, Michael J Pontecorvo, Franz Hefti, Alan P Carpenter, Matthew L Flitter, Michael J Krautkramer, Hank F Kung, R Edward Coleman, P Murali Doraiswamy, Adam S Fleisher, Marwan N Sabbagh, Carl H Sadowsky, Eric P Reiman, P Eric M Reiman, Simone P Zehntner, Daniel M Skovronsky, and AV45-A07 Study Group. Use of florbetapir-PET for imaging beta-amyloid pathology. JAMA, 305(3):275–283, January 2011.

[66] Christopher M Clark, Sharon Xie, Jesse Chittams, Douglas Ewbank, Elaine Peskind, Douglas Galasko, John C Morris, Daniel W McKeel, Martin Farlow, Sharon L Weitlauf, Joseph Quinn, Jeffrey Kaye, David Knopman, Hiroyuki Arai, Rachelle S Doody, Charles DeCarli, Susan Leight, Virginia M-Y Lee, and John Q Trojanowski. Cerebrospinal fluid tau and β-Amyloid: How well do these biomarkers reflect Autopsy-Confirmed dementia diagnoses? Arch. Neurol., 60(12):1696–1702, December 2003.

[67] G B Coleman and H C Andrews. Image segmentation by clustering. Proc. IEEE, 67(5):773–785, May 1979.

[68] D L Collins and A C Evans. Animal: Validation and applications of nonlinear Registration-Based segmentation. Int. J. Pattern Recognit Artif Intell., 11(08):1271–1294, December 1997.

[69] D L Collins, P Neelin, T M Peters, and A C Evans. Automatic 3D intersubject registration of MR volumetric data in standardized talairach space. J. Comput. Assist. Tomogr., 18(2):192–205, March 1994.

[70] D Louis Collins, C J Holmes, T M Peters, and A C Evans. Automatic 3-D model-based neu- roanatomical segmentation. Hum. Brain Mapp., 3(3):190–208, January 1995.

[71] D Louis Collins, C J Holmes, T M Peters, and A C Evans. Automatic 3-D model-based neu- roanatomical segmentation. Human Brain Mapping, 3(3):190–208, oct 1995.

[72] D Louis Collins and Jens C Pruessner. Towards accurate, automatic segmentation of the hip- pocampus and amygdala from MRI by augmenting ANIMAL with a template library and label fusion. Neuroimage, 52(4):1355–1366, October 2010.

[73] D Louis Collins and Jens C Pruessner. Towards accurate, automatic segmentation of the hip- pocampus and amygdala from MRI by augmenting ANIMAL with a template library and label fusion. NeuroImage, 52(4):1355–66, oct 2010. Bibliography 159

[74] O Colliot, G Ch´etelat,M Chupin, B Desgranges, and others. Discrimination between alzheimer disease, mild cognitive impairment, and normal aging by using automated segmentation of the hippocampus1. Radiology, 2008.

[75] Olivier Commowick, Alireza Akhondi-Asl, and Simon K Warfield. Estimating a reference stan- dard segmentation with spatially varying performance parameters: local MAP STAPLE. IEEE transactions on medical imaging, 31(8):1593–606, aug 2012.

[76] Olivier Commowick and Simon K Warfield. Incorporating priors on expert performance parameters for segmentation validation and label fusion: a maximum a posteriori STAPLE. Medical image computing and computer-assisted intervention : MICCAI ... International Conference on Medical Image Computing and Computer-Assisted Intervention, 13(Pt 3):25–32, jan 2010.

[77] Donald J Connor and Marwan N Sabbagh. Administration and scoring variance on the ADAS-Cog. J. Alzheimers. Dis., 15(3):461–464, November 2008.

[78] E H Corder, A M Saunders, W J Strittmatter, D E Schmechel, P C Gaskell, G W Small, A D Roses, J L Haines, and M A Pericak-Vance. Gene dose of apolipoprotein E type 4 allele and the risk of alzheimer’s disease in late onset families. Science, 261(5123):921–923, August 1993.

[79] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Mach. Learn., 20(3):273–297, September 1995.

[80] Pierrick Coup´e,Simon F Eskildsen, Jos´eV Manj´on, Vladimir S Fonov, D Louis Collins, and Alzheimer’s disease Neuroimaging Initiative. Simultaneous segmentation and grading of anatomical structures for patient’s classification: application to alzheimer’s disease. Neuroimage, 59(4):3736– 3747, February 2012.

[81] Pierrick Coup´e,Simon F Eskildsen, Jos´eV Manj´on,Vladimir S Fonov, Jens C Pruessner, Mich`ele Allard, D Louis Collins, and Alzheimer’s Disease Neuroimaging Initiative. Scoring by nonlocal image patch estimator for early detection of alzheimer’s disease. Neuroimage Clin, 1(1):141–152, October 2012.

[82] Pierrick Coup´e,Jos´eV Manj´on,Vladimir Fonov, Jens Pruessner, Montserrat Robles, and D Louis Collins. Patch-based segmentation using expert priors: application to hippocampus and ventricle segmentation. Neuroimage, 54(2):940–954, January 2011.

[83] Pierrick Coup´e,Jos´eV Manj´on,Vladimir Fonov, Jens Pruessner, Montserrat Robles, and D Louis Collins. Patch-based segmentation using expert priors: application to hippocampus and ventricle segmentation. NeuroImage, 54(2):940–54, jan 2011.

[84] Pierrick Coupe, Pierre Yger, Sylvain Prima, Pierre Hellier, Charles Kervrann, and Christian Bar- illot. An optimized blockwise nonlocal means denoising filter for 3-D magnetic resonance images. IEEE Transactions on Medical Imaging, 27(4):425–441, 2008.

[85] D R Cox. The regression analysis of binary sequences. J. R. Stat. Soc. Series B Stat. Methodol., 20(2):215–242, 1958. Bibliography 160

[86] R´emi Cuingnet, Emilie Gerardin, J´erˆome Tessieras, Guillaume Auzias, St´ephane Leh´ericy, Marie Odile Habert, Marie Chupin, Habib Benali, and Olivier Colliot. Automatic classification of patients with Alzheimer’s disease from structural MRI: A comparison of ten methods using the ADNI database. Neuroimage, 56(2):766–781, May 2011.

[87] R´emiCuingnet, Emilie Gerardin, J´erˆomeTessieras, Guillaume Auzias, St´ephaneLeh´ericy, Marie- Odile Habert, Marie Chupin, Habib Benali, Olivier Colliot, and Alzheimer’s Disease Neuroimaging Initiative. Automatic classification of patients with alzheimer’s disease from structural MRI: a comparison of ten methods using the ADNI database. Neuroimage, 56(2):766–781, May 2011.

[88] Stuart Currie, Nigel Hoggard, Ian J Craven, Marios Hadjivassiliou, and Iain D Wilkinson. Un- derstanding MRI: basic MR physics for physicians. Postgrad. Med. J., 89(1050):209–223, April 2013.

[89] Sandhitsu R Das, Brian B Avants, Murray Grossman, and James C Gee. Registration based cortical thickness measurement. Neuroimage, 45(3):867–879, April 2009.

[90] C Davatzikos. Spatial transformation and registration of brain images using elastically deformable models. Comput. Vis. Image Underst., 66(2):207–222, May 1997.

[91] Christos Davatzikos, Priyanka Bhatt, Leslie M Shaw, Kayhan N Batmanghelich, and John Q Trojanowski. Prediction of MCI to AD conversion, via MRI, CSF biomarkers, and pattern classi- fication. Neurobiol. Aging, 32(12):2322.e19–27, December 2011.

[92] Christos Davatzikos, Feng Xu, Yang An, Yong Fan, and Susan M Resnick. Longitudinal progression of alzheimer’s-like patterns of atrophy in normal older adults: the SPARE-AD index. Brain, 132(Pt 8):2026–2035, August 2009.

[93] Martha L Daviglus, Carl C Bell, Wade Berrettini, Phyllis E Bowen, E Sander Connolly, Jr, Nancy Jean Cox, Jacqueline M Dunbar-Jacob, Evelyn C Granieri, Gail Hunt, Kathleen McGarry, Dinesh Patel, Arnold L Potosky, Elaine Sanders-Bush, Donald Silberberg, and Maurizio Trevisan. National institutes of health State-of-the-Science conference statement: preventing alzheimer dis- ease and cognitive decline. Ann. Intern. Med., 153(3):176–181, August 2010.

[94] Robin de Flores, Renaud La Joie, Brigitte Landeau, Audrey Perrotin, Florence M´ezenge,Vincent de La Sayette, Francis Eustache, B´eatrice Desgranges, and Ga¨elCh´etelat. Effects of age and alzheimer’s disease on hippocampal subfields. Hum. Brain Mapp., 36(2):463–474, February 2015.

[95] Federico De Martino, Giancarlo Valente, No¨elStaeren, John Ashburner, Rainer Goebel, and Elia Formisano. Combining multivariate voxel selection and support vector machines for mapping and classification of fMRI spatial patterns. Neuroimage, 43(1):44–58, October 2008.

[96] Ivana Despotovi´c,Bart Goossens, and Wilfried Philips. MRI segmentation of the human brain: challenges, methods, and applications. Comput. Math. Methods Med., 2015:450341, March 2015.

[97] Ivana Despotovi´c,Bart Goossens, Ewout Vansteenkiste, and Wilfried Philips. T1- and t2-weighted spatially constrained fuzzy c-means clustering for brain MRI segmentation. In Medical Imaging 2010: Image Processing, volume 7623, page 76231V. International Society for Optics and Photon- ics, March 2010. Bibliography 161

[98] Leyla deToledo Morrell, T R Stoub, M Bulgakova, R S Wilson, D A Bennett, S Leurgans, J Wuu, and D A Turner. MRI-derived entorhinal volume is a good predictor of conversion from MCI to AD. Neurobiol. Aging, 25(9):1197–1203, October 2004.

[99] Bradford C Dickerson, David A Wolk, and Alzheimer’s Disease Neuroimaging Initiative. MRI cor- tical thickness biomarker predicts AD-like CSF and cognitive decline in normal adults. , 78(2):84–90, January 2012.

[100] Vanderson Dill, Alexandre Rosa Franco, and M´arcioSarroglia Pinho. Automated methods for hippocampus segmentation: the evolution and a review of the state of the art. Neuroinformatics, 13(2):133–150, April 2015.

[101] Jose Dolz, Christian Desrosiers, and Ismail Ben Ayed. 3D fully convolutional networks for subcor- tical segmentation in MRI: A large-scale study. Neuroimage, 170:456–470, April 2018.

[102] Pedro Domingos. A few useful things to know about machine learning. Commun. ACM, 55(10):78– 87, October 2012.

[103] An-Tao Du, Norbert Schuff, Joel H Kramer, Howard J Rosen, Maria Luisa Gorno-Tempini, Kather- ine Rankin, Bruce L Miller, and Michael W Weiner. Different regional patterns of cortical thinning in Alzheimer’s disease and frontotemporal dementia. Brain : a journal of neurology, 2007.

[104] Bruno Dubois, Howard H Feldman, Claudia Jacova, Jeffrey L Cummings, Steven T Dekosky, Pas- cale Barberger-Gateau, Andr´eDelacourte, Giovanni Frisoni, Nick C Fox, Douglas Galasko, Serge Gauthier, Harald Hampel, Gregory A Jicha, Kenichi Meguro, John O’Brien, Florence Pasquier, Philippe Robert, Martin Rossor, Steven Salloway, Marie Sarazin, Leonardo C de Souza, Yaakov Stern, Pieter J Visser, and Philip Scheltens. Revising the definition of alzheimer’s disease: a new lexicon. Lancet Neurol., 9(11):1118–1127, November 2010.

[105] Bruno Dubois, Howard H Feldman, Claudia Jacova, Steven T DeKosky, Pascale Barberger- Gateau, Jeffrey Cummings, Andr´eDelacourte, Douglas Galasko, Serge Gauthier, Gregory Jicha, Kenichi Meguro, John O’Brien, Florence Pasquier, Philippe Robert, Martin Rossor, Steven Sal- loway, Yaakov Stern, Pieter J Visser, and Philip Scheltens. Research criteria for the diagnosis of alzheimer’s disease: revising the NINCDS–ADRDA criteria. Lancet Neurol., 6(8):734–746, August 2007.

[106] Simon Ducharme, Matthew D Albaugh, Tuong-Vi Nguyen, James J Hudziak, J M Mateos-P´erez, Aurelie Labbe, Alan C Evans, Sherif Karama, and Brain Development Cooperative Group. Tra- jectories of cortical thickness maturation in normal brain development–the importance of quality control procedures. Neuroimage, 125:267–279, January 2016.

[107] Simon Duchesne, Anna Caroli, C Geroldi, Christian Barillot, Giovanni B Frisoni, and D Louis Collins. MRI-based automated computer classification of probable AD versus normal controls. IEEE Trans. Med. Imaging, 2008.

[108] Simon Duchesne, Anna Caroli, Cristina Geroldi, D Louis Collins, and Giovanni B Frisoni. Relating one-year cognitive change in mild cognitive impairment to baseline MRI features. Neuroimage, 2009. Bibliography 162

[109] Simon Duchesne, Fernando Valdivia, Nicolas Robitaille, Abderazzak Mouiha, F Abiel Valdivia, Martina Bocchetta, Liana G Apostolova, Rossana Ganzola, Greg Preboske, Dominik Wolf, Ma- rina Boccardi, Clifford R Jack, Jr, Giovanni B Frisoni, and EADC-ADNI Working Group on The Harmonized Protocol for Manual Hippocampal Segmentation and for the Alzheimer’s Dis- ease Neuroimaging Initiative. Manual segmentation qualification platform for the EADC-ADNI harmonized protocol for hippocampal segmentation project. Alzheimers. Dement., 11(2):161–174, February 2015.

[110] Henri M. Duvernoy, Fran¸coiseCattin, Pierre Yves Risold, J. L. Vannson, and M. Gaudron. The human hippocampus: Functional anatomy, vascularization and serial sections with MRI, fourth edition. 2013.

[111] Bradley Efron and Robert Tibshirani. Improvements on Cross-Validation: The 632+ bootstrap method. J. Am. Stat. Assoc., 92(438):548–560, June 1997.

[112] Kathryn A Ellis, Ashley I Bush, David Darby, Daniela De Fazio, Jonathan Foster, Peter Hudson, Nicola T Lautenschlager, Nat Lenzo, Ralph N Martins, Paul Maruff, Colin Masters, Andrew Milner, Kerryn Pike, Christopher Rowe, Greg Savage, Cassandra Szoeke, Kevin Taddei, Victor Villemagne, Michael Woodward, David Ames, and AIBL Research Group. The australian imaging, biomarkers and lifestyle (AIBL) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of alzheimer’s disease. Int. Psychogeriatr., 21(4):672– 687, August 2009.

[113] Henry Engler, Anton Forsberg, Ove Almkvist, Gunnar Blomquist, Emma Larsson, Irina Sav- itcheva, Anders Wall, Anna Ringheim, Bengt L˚angstr¨om,and Agneta Nordberg. Two-year follow- up of amyloid deposition in patients with alzheimer’s disease. Brain, 129(Pt 11):2856–2866, Novem- ber 2006.

[114] Noam U Epstein, Andrew J Saykin, Shannon L Risacher, Sujuan Gao, and Martin R Farlow. Dif- ferences in medication use in the Alzheimer’s disease neuroimaging initiative: analysis of baseline characteristics. Drugs & aging, 2010.

[115] Bora Erden, Noah Gamboa, and Sam Wood. 3D convolutional neural network for brain tumor segmentation.

[116] Javier Escudero, John P Zajicek, Emmanuel Ifeachor, and Alzheimer’s Disease Neuroimaging Initiative. Machine learning classification of MRI features of alzheimer’s disease and mild cognitive impairment subjects to reduce the sample size in clinical trials. Conf. Proc. IEEE Eng. Med. Biol. Soc., 2011:7957–7960, 2011.

[117] S F Eskildsen, P Coupe, V S Fonov, J C Pruessner, D L Collins, and Initiative Alzheimer’s Disease Neuroimaging. Structural imaging biomarkers of Alzheimer’s disease: predicting disease progression. Neurobiol. Aging, 2015.

[118] S F Eskildsen, P Coupe, D Garcia-Lorenzo, V Fonov, J C Pruessner, D L Collins, and Initia- tive Alzheimer’s Disease Neuroimaging. Prediction of Alzheimer’s disease in subjects with mild cognitive impairment from the ADNI cohort using patterns of cortical thinning. Neuroimage, 2013. Bibliography 163

[119] Simon F Eskildsen, Pierrick Coup´e,Vladimir Fonov, Jos´eV Manj´on,Kelvin K Leung, Nico- las Guizard, Shafik N Wassef, Lasse Riis Østergaard, D Louis Collins, and Alzheimer’s Disease Neuroimaging Initiative. BEaST: brain extraction based on nonlocal segmentation technique. Neuroimage, 59(3):2362–2373, February 2012.

[120] Anne M Fagan, Mark A Mintun, Aarti R Shah, Patricia Aldea, Catherine M Roe, Robert H Mach, Daniel Marcus, John C Morris, and David M Holtzman. Cerebrospinal fluid tau and ptau(181) increase with cortical amyloid deposition in cognitively normal individuals: implications for future clinical trials of alzheimer’s disease. EMBO Mol. Med., 1(8-9):371–380, November 2009.

[121] Nicolas Farina, Jennifer Rusted, and Naji Tabet. The effect of exercise interventions on cognitive outcome in alzheimer’s disease: a systematic review. Int. Psychogeriatr., 26(1):9–18, January 2014.

[122] Lindsay A Farrer, L Adrienne Cupples, Jonathan L Haines, Bradley Hyman, Walter A Kukull, Richard Mayeux, Richard H Myers, Margaret A Pericak-Vance, Neil Risch, and Cornelia M van Duijn. Effects of age, sex, and ethnicity on the association between apolipoprotein E genotype and alzheimer disease: A meta-analysis. JAMA, 278(16):1349–1356, October 1997.

[123] Daniel Ferreira, Chlo¨eVerhagen, Juan Andr´esHern´andez-Cabrera,Lena Cavallin, Chun-Jie Guo, Urban Ekman, J-Sebastian Muehlboeck, Andrew Simmons, Jos´eBarroso, Lars-Olof Wahlund, and Eric Westman. Distinct subtypes of alzheimer’s disease based on patterns of brain atrophy: longitudinal trajectories and clinical applications. Sci. Rep., 7:srep46263, April 2017.

[124] B Fischl and A M Dale. Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proc. Natl. Acad. Sci. U. S. A., 97(20):11050–11055, September 2000.

[125] Bruce Fischl, David H Salat, Evelina Busa, Marilyn Albert, Megan Dieterich, Christian Haselgrove, Andre van der Kouwe, Ron Killiany, David Kennedy, Shuna Klaveness, Albert Montillo, Nikos Makris, Bruce Rosen, and Anders M Dale. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron, 33(3):341–355, January 2002.

[126] Bruce Fischl, David H. Salat, Evelina Busa, Marilyn Albert, Megan Dieterich, Christian Hasel- grove, Andre Van Der Kouwe, Ron Killiany, David Kennedy, Shuna Klaveness, Albert Montillo, Nikos Makris, Bruce Rosen, and Anders M. Dale. Whole brain segmentation: Automated labeling of neuroanatomical structures in the human brain. Neuron, 33:341–355, 2002.

[127] M F Folstein, S E Folstein, and P R McHugh. “mini-mental state”. a practical method for grading the cognitive state of patients for the clinician. J. Psychiatr. Res., 12(3):189–198, November 1975.

[128] Elia Formisano, Federico De Martino, and Giancarlo Valente. Multivariate analysis of fMRI time series: classification and regression of brain responses using machine learning. Magn. Reson. Imaging, 26(7):921–934, September 2008.

[129] Anton Forsberg, Henry Engler, Ove Almkvist, Gunnar Blomquist, G¨oranHagman, Anders Wall, Anna Ringheim, Bengt L˚angstr¨om,and Agneta Nordberg. PET imaging of amyloid deposition in patients with mild cognitive impairment. Neurobiol. Aging, 29(10):1456–1465, October 2008.

[130] B K Forscher. Chaos in the brickyard. Science, 142(3590):339, October 1963. Bibliography 164

[131] Norbert J Fortin, Kara L Agster, and Howard B Eichenbaum. Critical role of the hippocampus in memory for sequences of events. Nat. Neurosci., 5(5):458–462, May 2002.

[132] N C Fox, E K Warrington, P A Freeborough, P Hartikainen, A M Kennedy, J M Stevens, and M N Rossor. Presymptomatic hippocampal atrophy in alzheimer’s disease. a longitudinal MRI study. Brain, 119 ( Pt 6):2001–2007, December 1996.

[133] Benicio N Frey, Ana C Andreazza, Fabiano G Nery, Marcio R Martins, Jo˜aoQuevedo, Jair C Soares, and Fl´avioKapczinski. The role of hippocampus in the pathophysiology of bipolar disorder. Behavioural pharmacology, 18:419–430, 2007.

[134] Giovanni B Frisoni, Nick C Fox, Clifford R Jack, Jr, Philip Scheltens, and Paul M Thompson. The clinical use of structural MRI in alzheimer disease. Nat. Rev. Neurol., 6(2):67–77, February 2010.

[135] Giovanni B Frisoni, Nick C Fox, Clifford R Jack, Jr, Philip Scheltens, and Paul M Thompson. The clinical use of structural MRI in alzheimer disease. Nat. Rev. Neurol., 6(2):67–77, February 2010.

[136] Giovanni B Frisoni and Clifford R Jack. Harmonization of magnetic resonance-based manual hippocampal segmentation: a mandatory step for wide clinical use. Alzheimers. Dement., 7(2):171– 174, March 2011.

[137] Giovanni B Frisoni, Clifford R Jack, Jr, Martina Bocchetta, Corinna Bauer, Kristian S Frederiksen, Yawu Liu, Gregory Preboske, Tim Swihart, Melanie Blair, Enrica Cavedo, Michel J Grothe, Mari- angela Lanfredi, Oliver Martinez, Masami Nishikawa, Marileen Portegies, Travis Stoub, Chadwich Ward, Liana G Apostolova, Rossana Ganzola, Dominik Wolf, Frederik Barkhof, George Bart- zokis, Charles DeCarli, John G Csernansky, Leyla deToledo Morrell, Mirjam I Geerlings, Jef- frey Kaye, Ronald J Killiany, Stephane Leh´ericy, Hiroshi Matsuda, John O’Brien, Lisa C Silbert, Philip Scheltens, Hilkka Soininen, Stefan Teipel, Gunhild Waldemar, Andreas Fellgiebel, Josephine Barnes, Michael Firbank, Lotte Gerritsen, Wouter Henneman, Nikolai Malykhin, Jens C Pruessner, Lei Wang, Craig Watson, Henrike Wolf, Mony deLeon, Johannes Pantel, Clarissa Ferrari, Paolo Bosco, Patrizio Pasqualetti, Simon Duchesne, Henri Duvernoy, Marina Boccardi, and EADC- ADNI Working Group on The Harmonized Protocol for Manual Hippocampal Volumetry and for the Alzheimer’s Disease Neuroimaging Initiative. The EADC-ADNI harmonized protocol for man- ual hippocampal segmentation on magnetic resonance: evidence of validity. Alzheimers. Dement., 11(2):111–125, February 2015.

[138] Klaus A Ganser, Hartmut Dickhaus, Roland Metzner, and Christian R Wirtz. A deformable digital brain atlas system according to talairach and tournoux. Med. Image Anal., 8(1):3–22, March 2004.

[139] Serge Gauthier, Christopher Patterson, Howard Chertkow, Michael Gordon, Nathan Herrmann, Kenneth Rockwood, Pedro Rosa-Neto, and Jean-Paul Soucy. Recommendations of the 4th cana- dian consensus conference on the diagnosis and treatment of dementia (CCCDTD4). Can. Geriatr. J., 15(4):120–126, December 2012.

[140] A Geiger, P Lenz, and R Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, June 2012. Bibliography 165

[141] E Genin, D Hannequin, D Wallon, K Sleegers, M Hiltunen, O Combarros, M J Bullido, S Engel- borghs, P De Deyn, C Berr, F Pasquier, B Dubois, G Tognoni, N Fi´evet, N Brouwers, K Bettens, B Arosio, E Coto, M Del Zompo, I Mateo, J Epelbaum, A Frank-Garcia, S Helisalmi, E Porcellini, A Pilotto, P Forti, R Ferri, E Scarpini, G Siciliano, V Solfrizzi, S Sorbi, G Spalletta, F Valdivieso, S Veps¨al¨ainen,V Alvarez, P Bosco, M Mancuso, F Panza, B Nacmias, P Boss`u,O Hanon, P Pic- cardi, G Annoni, D Seripa, D Galimberti, F Licastro, H Soininen, J-F Dartigues, M I Kamboh, C Van Broeckhoven, J C Lambert, P Amouyel, and D Campion. APOE and alzheimer disease: a major gene with semi-dominant inheritance. Mol. Psychiatry, 16(9):903–907, September 2011.

[142] Emilie Gerardin, Ga¨elCh´etelat,Marie Chupin, R´emiCuingnet, B´eatriceDesgranges, Ho-Sung Kim, Marc Niethammer, Bruno Dubois, St´ephane Leh´ericy, Line Garnero, Francis Eustache, Olivier Colliot, and Alzheimer’s Disease Neuroimaging Initiative. Multidimensional classification of hippocampal shape features discriminates alzheimer’s disease and mild cognitive impairment from normal aging. Neuroimage, 47(4):1476–1486, October 2009.

[143] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.

[144] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z Ghahramani, M Welling, C Cortes, N D Lawrence, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.

[145] A Graves, A r. Mohamed, and G Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645– 6649, May 2013.

[146] Katherine R Gray, Paul Aljabar, Rolf A Heckemann, Alexander Hammers, Daniel Rueckert, and Alzheimer’s Disease Neuroimaging Initiative. Random forest-based similarity measures for multi- modal classification of alzheimer’s disease. Neuroimage, 65:167–175, January 2013.

[147] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. DRAW: A recurrent neural network for image generation. February 2015.

[148] Michael D Greicius and Daniel L Kimmel. Neuroimaging insights into network-based neurodegen- eration. Curr. Opin. Neurol., 2012.

[149] Ting Guo, Julie L. Winterburn, Jon Pipitone, Emma G. Duerden, Min Tae M. Park, Vann Chau, Kenneth J. Poskitt, Ruth E. Grunau, Anne Synnes, Steven P. Miller, and M. Mallar Chakravarty. Automatic segmentation of the hippocampus for preterm neonates from early-in-life to term- equivalent age. NeuroImage: Clinical, 9:176–193, 2015.

[150] Ting Guo, Julie L Winterburn, Jon Pipitone, Emma G Duerden, Min Tae M Park, Vann Chau, Kenneth J Poskitt, Ruth E Grunau, Anne Synnes, Steven P Miller, and M Mallar Chakravarty. Automatic segmentation of the hippocampus for preterm neonates from early-in-life to term- equivalent age. Neuroimage Clin, 9:176–193, August 2015. Bibliography 166

[151] Christian Habeck, Yaakov Stern, and Alzheimer’s Disease Neuroimaging Initiative. Multivariate data analysis for neuroimaging data: overview and application to alzheimer’s disease. Cell Biochem. Biophys., 58(2):53–67, November 2010.

[152] Yongfu Hao, Tianyao Wang, Xinqing Zhang, Yunyun Duan, Chunshui Yu, Tianzi Jiang, and Yong Fan. Local label learning (LLL) for subcortical structure segmentation: Application to hippocampus segmentation. Human Brain Mapping, 35:2674–2697, 2014.

[153] Yongfu Hao, Tianyao Wang, Xinqing Zhang, Yunyun Duan, Chunshui Yu, Tianzi Jiang, Yong Fan, and Alzheimer’s Disease Neuroimaging Initiative. Local label learning (LLL) for subcortical struc- ture segmentation: application to hippocampus segmentation. Hum. Brain Mapp., 35(6):2674– 2697, June 2014.

[154] Robert M Haralick and Linda G Shapiro. Image segmentation techniques. Computer Vision, Graphics, and Image Processing, 29(1):100–132, January 1985.

[155] J A Hardy and G A Higgins. Alzheimer’s disease: the amyloid cascade hypothesis. Science, 256(5054):184–185, April 1992.

[156] John Hardy and Dennis J Selkoe. The amyloid hypothesis of alzheimer’s disease: progress and problems on the road to therapeutics. Science, 297(5580):353–356, July 2002.

[157] Paul J Harrison. The hippocampus in schizophrenia: a review of the neuropathological evidence and its pathophysiological implications. , 174:151–162, 2004.

[158] John-Dylan Haynes and Geraint Rees. Decoding mental states from brain activity in humans. Nat. Rev. Neurosci., 7(7):523–534, July 2006.

[159] K He, X Zhang, S Ren, and J Sun. Deep residual learning for image recognition. vision and pattern recognition, 2016.

[160] Rolf A Heckemann, Joseph V Hajnal, Paul Aljabar, Daniel Rueckert, and Alexander Hammers. Automatic anatomical brain MRI segmentation combining label propagation and decision fusion. Neuroimage, 33(1):115–126, October 2006.

[161] Rolf a Heckemann, Joseph V Hajnal, Paul Aljabar, Daniel Rueckert, and Alexander Hammers. Automatic anatomical brain MRI segmentation combining label propagation and decision fusion. NeuroImage, 33(1):115–26, oct 2006.

[162] Rolf A. Heckemann, Shiva Keihaninejad, Paul Aljabar, Katherine R. Gray, Casper Nielsen, Daniel Rueckert, Joseph V. Hajnal, and Alexander Hammers. Automatic morphometry in Alzheimer’s disease and mild cognitive impairment. NeuroImage, 56:2024–2037, 2011.

[163] W J P Henneman, J D Sluimer, J Barnes, W M van der Flier, I C Sluimer, N C Fox, P Scheltens, H Vrenken, and F Barkhof. Hippocampal atrophy rates in alzheimer disease: added value over whole brain volume measures. Neurology, 72(11):999–1007, March 2009.

[164] Chris Hinrichs, Vikas Singh, Guofan Xu, Sterling C Johnson, and Alzheimers Disease Neuroimag- ing Initiative. Predictive markers for AD in a multi-modality framework: an analysis of MCI progression in the ADNI population. Neuroimage, 55(2):574–589, March 2011. Bibliography 167

[165] G E Hinton and R R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006.

[166] Geoffrey E Hinton. Learning multiple layers of representation. Trends Cogn. Sci., 11(10):428–434, October 2007.

[167] Geoffrey E Hinton. Machine learning for neuroscience. Neural Syst. Circuits, 1(1):12, August 2011.

[168] E Hosseini-Asl, R Keynton, and A El-Baz. Alzheimer’s disease diagnostics by adaptation of 3D convolutional network. In 2016 IEEE International Conference on Image Processing (ICIP), pages 126–130, 2016.

[169] I-Chan Huang, Constantine Frangakis, Mark J Atkinson, Richard J Willke, Walter L Leite, W Bruce Vogel, and Albert W Wu. Addressing ceiling effects in health status measures: a com- parison of techniques applied to measures for people with HIV disease. Health Serv. Res., 43(1 Pt 1):327–339, February 2008.

[170] Meiyan Huang, Wei Yang, Qianjin Feng, and Wufan Chen. Longitudinal measurement and hi- erarchical classification framework for the prediction of Alzheimer’s disease. Sci. Rep., 7:39880, January 2017.

[171] P R Huttenlocher. Morphometric study of human cerebral cortex development. Neuropsychologia, 28(6):517–527, 1990.

[172] Quentin J M Huys, Tiago V Maia, and Michael J Frank. Computational psychiatry as a bridge from neuroscience to clinical applications. Nat. Neurosci., 19:404, February 2016.

[173] K Iqbal, F Liu, C-X Gong, and I Grundke-Iqbal. Tau in alzheimer disease and related tauopathies. Curr. Alzheimer Res., 7(8):656–664, December 2010.

[174] K Ishiguro, H Ohno, H Arai, H Yamaguchi, K Urakami, J M Park, K Sato, H Kohno, and K Imahori. Phosphorylated tau in human cerebrospinal fluid is a diagnostic marker for alzheimer’s disease. Neurosci. Lett., 270(2):91–94, July 1999.

[175] Y Iturria-Medina, R C Sotero, P J Toussaint, J M Mateos-P´erez,A C Evans, and Alzheimer’s Disease Neuroimaging Initiative. Early role of vascular dysregulation on late-onset alzheimer’s disease based on multifactorial data-driven analysis. Nat. Commun., 7:11934, June 2016.

[176] C R Jack, Jr, R C Petersen, Y C Xu, P C O’Brien, G E Smith, R J Ivnik, B F Boeve, S C Waring, E G Tangalos, and E Kokmen. Prediction of AD with MRI-based hippocampal volume in mild cognitive impairment. Neurology, 52(7):1397–1403, April 1999.

[177] C R Jack, Jr, R C Petersen, Y C Xu, S C Waring, P C O’Brien, E G Tangalos, G E Smith, R J Ivnik, and E Kokmen. Medial temporal atrophy on MRI in normal aging and very mild alzheimer’s disease. Neurology, 49(3):786–794, September 1997.

[178] Clifford R Jack, Frederik Barkhof, Matt A Bernstein, Marc Cantillon, Patricia E Cole, Charles DeCarli, Bruno Dubois, Simon Duchesne, Nick C Fox, Giovanni B Frisoni, Harald Hampel, Derek L G Hill, Keith Johnson, Jean-Fran¸coisMangin, Philip Scheltens, Adam J Schwarz, Reisa Sperling, Joyce Suhy, Paul M Thompson, Michael Weiner, and Norman L Foster. Steps to standardization Bibliography 168

and validation of hippocampal volumetry as a biomarker in clinical trials and diagnostic criterion for alzheimer’s disease. Alzheimers. Dement., 7(4):474–485.e4, 2011.

[179] Clifford R. Jack, Frederik Barkhof, Matt A. Bernstein, Marc Cantillon, Patricia E. Cole, Charles Decarli, Bruno Dubois, Simon Duchesne, Nick C. Fox, Giovanni B. Frisoni, Harald Hampel, Derek L G Hill, Keith Johnson, Jean Franois Mangin, Philip Scheltens, Adam J. Schwarz, Reisa Sperling, Joyce Suhy, Paul M. Thompson, Michael Weiner, and Norman L. Foster. Steps to standardization and validation of hippocampal volumetry as a biomarker in clinical trials and diagnostic criterion for Alzheimer’s disease, 2011.

[180] Clifford R Jack, Ronald C Petersen, Peter C O’Brien, and Eric G Tangalos. MRˆabasedhip- pocampal volumetry in the diagnosis of alzheimer’s disease. Neurology, 42(1):183–183, January 1992.

[181] Clifford R Jack, Jr, Marilyn S Albert, David S Knopman, Guy M McKhann, Reisa A Sperling, Maria C Carrillo, Bill Thies, and Creighton H Phelps. Introduction to the recommendations from the national institute on Aging-Alzheimer’s association workgroups on diagnostic guidelines for alzheimer’s disease. Alzheimers. Dement., 7(3):257–262, May 2011.

[182] Clifford R Jack, Jr, David S Knopman, William J Jagust, Ronald C Petersen, Michael W Weiner, Paul S Aisen, Leslie M Shaw, Prashanthi Vemuri, Heather J Wiste, Stephen D Weigand, Tim- othy G Lesnick, Vernon S Pankratz, Michael C Donohue, and John Q Trojanowski. Tracking pathophysiological processes in alzheimer’s disease: an updated hypothetical model of dynamic biomarkers. Lancet Neurol., 12(2):207–216, February 2013.

[183] Clifford R Jack, Jr, Val J Lowe, Stephen D Weigand, Heather J Wiste, Matthew L Senjem, David S Knopman, Maria M Shiung, Jeffrey L Gunter, Bradley F Boeve, Bradley J Kemp, Michael Weiner, Ronald C Petersen, and Alzheimer’s Disease Neuroimaging Initiative. Serial PIB and MRI in normal, mild cognitive impairment and alzheimer’s disease: implications for sequence of pathological events in alzheimer’s disease. Brain, 132(Pt 5):1355–1365, May 2009.

[184] Clifford R Jack, Jr., Heather J Wiste, Stephen D Weigand, Terry M Therneau, Val J Lowe, David S Knopman, Jeffrey L Gunter, Matthew L Senjem, David T Jones, Kejal Kantarci, Mary M Machulda, Michelle M Mielke, Rosebud O Roberts, Prashanthi Vemuri, Denise A Reyes, and Ronald C Petersen. Defining imaging biomarker cut points for brain aging and alzheimer’s disease. Alzheimers. Dement., 13(3):205–216, March 2017.

[185] G P Jarvik, E M Wijsman, W A Kukull, G D Schellenberg, and others. Interactions of apolipopro- tein E genotype, total cholesterol level, age, and sex in prediction of alzheimer’s disease a caseˆacontrol study. Neurology, 1995.

[186] M Jenkinson and S Smith. A global optimisation method for robust affine registration of brain images. Med. Image Anal., 5(2):143–156, June 2001.

[187] S E Jones, B R Buchbinder, and I Aharon. Threeˆadimensional mapping of cortical thickness using laplace’s equation. Hum. Brain Mapp., 2000. Bibliography 169

[188] M Jorge Cardoso, Kelvin Leung, Marc Modat, Shiva Keihaninejad, David Cash, Josephine Barnes, Nick C Fox, and Sebastien Ourselin. STEPS: Similarity and Truth Estimation for Propagated Segmentations and its application to hippocampal segmentation and brain parcelation. Medical image analysis, 17(6):671–84, aug 2013.

[189] M Jorge Cardoso, Kelvin Leung, Marc Modat, Shiva Keihaninejad, David Cash, Josephine Barnes, Nick C Fox, Sebastien Ourselin, and Alzheimer’s Disease Neuroimaging Initiative. STEPS: Sim- ilarity and truth estimation for propagated segmentations and its application to hippocampal segmentation and brain parcelation. Med. Image Anal., 17(6):671–684, August 2013.

[190] Noor Jehan Kabani. 3D anatomical atlas of the human brain. Neuroimage, 7:P–0717, 1998.

[191] Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic alignments for generating image descrip- tions. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):664–676, April 2017.

[192] Matthew J Kempton, Zainab Salvador, Marcus R Munaf`o,John R Geddes, Andrew Simmons, Sophia Frangou, and Steven C R Williams. Structural neuroimaging studies in major depressive disorder. Meta-analysis and comparison with bipolar disorder. Archives of general psychiatry, 68:675–690, 2011.

[193] Budhachandra S Khundrakpam, Jussi Tohka, Alan C Evans, and Brain Development Cooperative Group. Prediction of brain maturity based on cortical thickness at different spatial resolutions. Neuroimage, 111:350–359, May 2015.

[194] R J Killiany, T Gomez-Isla, M Moss, R Kikinis, T Sandor, F Jolesz, R Tanzi, K Jones, B T Hyman, and M S Albert. Use of structural magnetic resonance imaging to predict who will get alzheimer’s disease. Ann. Neurol., 47(4):430–439, April 2000.

[195] R J Killiany, M B Moss, M S Albert, T Sandor, J Tieman, and F Jolesz. Temporal lobe regions on magnetic resonance imaging identify patients with early alzheimer’s disease. Arch. Neurol., 50(9):949–954, September 1993.

[196] June Sic Kim, Vivek Singh, Jun Ki Lee, Jason Lerch, Yasser Ad-Dab’bagh, David MacDonald, Jong Min Lee, Sun I Kim, and Alan C Evans. Automated 3-D extraction and evaluation of the inner and outer cortical surfaces using a laplacian map and partial volume effect classification. Neuroimage, 27(1):210–221, August 2005.

[197] Taeho Kim, George S Vidal, Maja Djurisic, Christopher M William, Michael E Birnbaum, K Christopher Garcia, Bradley T Hyman, and Carla J Shatz. Human LilrB2 is a β-amyloid recep- tor and its murine homolog PirB regulates synaptic plasticity in an alzheimer’s model. Science, 341(6152):1399–1404, September 2013.

[198] Y Kim, H Lee, and E M Provost. Deep learning for robust feature generation in audiovisual emotion recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3687–3691, May 2013.

[199] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying Visual-Semantic embeddings with multimodal neural language models. November 2014. Bibliography 170

[200] Arno Klein, Jesper Andersson, Babak A Ardekani, John Ashburner, Brian Avants, Ming-Chang Chiang, Gary E Christensen, D Louis Collins, James Gee, Pierre Hellier, Joo Hyun Song, Mark Jenkinson, Claude Lepage, Daniel Rueckert, Paul Thompson, Tom Vercauteren, Roger P Woods, J John Mann, and Ramin V Parsey. Evaluation of 14 nonlinear deformation algorithms applied to human brain MRI registration. Neuroimage, 46(3):786–802, July 2009.

[201] Arno Klein, Brett Mensh, Satrajit Ghosh, Jason Tourville, and Joy Hirsch. Mindboggle: automated brain labeling with multiple atlases. BMC Med. Imaging, 5:7, October 2005.

[202] Stefan Kl¨oppel, Cynthia M Stonnington, Carlton Chu, Bogdan Draganski, Rachael I Scahill, Jonathan D Rohrer, Nick C Fox, Clifford R Jack, John Ashburner, and Richard S J Frackowiak. Automatic classification of MR scans in alzheimer’s disease. Brain, 131(3):681–689, March 2008.

[203] G Koch, R Zemel ICML Deep Learning . . . , and 2015. Siamese neural networks for one-shot image recognition. pdfs.semanticscholar.org, 2015.

[204] Omid Kohannim, Xue Hua, Derrek P Hibar, Suh Lee, Yi-Yu Chou, Arthur W Toga, Clifford R Jack, Jr, Michael W Weiner, Paul M Thompson, and Alzheimer’s Disease Neuroimaging Initiative. Boosting power for clinical trials using classifiers based on multiple biomarkers. Neurobiol. Aging, 31(8):1429–1442, August 2010.

[205] Ron Kohavi and Others. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, volume 14, pages 1137–1145, 1995.

[206] C Konrad, T Ukas, C Nebel, V Arolt, A W Toga, and K L Narr. Defining the human hippocampus in cerebral magnetic resonance images–an overview of current segmentation protocols. Neuroimage, 47(4):1185–1195, October 2009.

[207] Igor O Korolev, Laura L Symonds, and Andrea C Bozoki. Predicting progression from mild cognitive impairment to Alzheimer’s dementia using clinical, MRI, and plasma biomarkers via probabilistic pattern classification. PLoS One, 11(2), 2016.

[208] Sergey Korolev, Amir Safiullin, Mikhail Belyaev, and Yulia Dodonova. Residual and plain convo- lutional neural networks for 3D brain MRI classification. January 2017.

[209] P J Kostelec and S Periaswamy. Image registration for MRI. Modern signal processing, 2003.

[210] Nikolaus Kriegeskorte, W Kyle Simmons, Patrick S F Bellgowan, and Chris I Baker. Circular analysis in : the dangers of double dipping. Nat. Neurosci., 12(5):535–540, May 2009.

[211] Karl Krissian and Santiago Aja-Fern´andez. Noise-driven anisotropic diffusion filtering of MRI. IEEE Transactions on Image Processing, 18(10):2265–2274, 2009.

[212] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep con- volutional neural networks. In F Pereira, C J C Burges, L Bottou, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. Bibliography 171

[213] Renaud La Joie, Marine Fouquet, Florence M´ezenge,Brigitte Landeau, Nicolas Villain, Katell Mevel, Alice P´elerin,Francis Eustache, B´eatriceDesgranges, and Ga¨elCh´etelat. Differential effect of age on hippocampal subfields assessed using a new high-resolution 3T MR sequence. Neuroimage, 53(2):506–514, November 2010.

[214] Renaud La Joie, Audrey Perrotin, Vincent de La Sayette, St´ephanieEgret, Lo¨ıcDoeuvre, Serge Belliard, Francis Eustache, B´eatriceDesgranges, and Ga¨elCh´etelat. Hippocampal subfield vol- umetry in mild cognitive impairment, alzheimer’s disease and semantic dementia. Neuroimage Clin, 3:155–162, August 2013.

[215] Renaud La Joie, Audrey Perrotin, Vincent de La Sayette, St´ephanieEgret, Lo¨ıcDoeuvre, Serge Belliard, Francis Eustache, B´eatriceDesgranges, and Ga¨elCh´etelat. Hippocampal subfield vol- umetry in mild cognitive impairment, Alzheimer’s disease and semantic dementia. NeuroImage. Clinical, 3:155–62, 2013.

[216] Benjamin Lam, Mario Masellis, Morris Freedman, Donald T Stuss, and Sandra E Black. Clinical, imaging, and pathological heterogeneity of the alzheimer’s disease syndrome. Alzheimers. Res. Ther., 5(1):1, January 2013.

[217] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.

[218] D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, October 1999.

[219] Manja Lehmann, Pia M Ghosh, Cindee Madison, Robert Laforce, Chiara Corbetta-Rastelli, Michael W Weiner, Michael D Greicius, William W Seeley, Maria L Gorno-Tempini, Howard J Rosen, Bruce L Miller, William J Jagust, and Gil D Rabinovici. Diverging patterns of amyloid deposition and hypometabolism in clinical variants of probable Alzheimer’s disease. Brain, 2013.

[220] Steven Lemm, Benjamin Blankertz, Thorsten Dickhaus, and Klaus-Robert M¨uller. Introduction to machine learning for brain imaging. Neuroimage, 56(2):387–399, May 2011.

[221] Jason P Lerch and Alan C Evans. Cortical thickness analysis examined through power analysis and a population simulation. Neuroimage, 24(1):163–173, January 2005.

[222] Jason P Lerch, Jens Pruessner, Alex P Zijdenbos, D Louis Collins, Stefan J Teipel, Harald Hampel, and Alan C Evans. Automated cortical thickness measurements from MRI can accurately separate alzheimer’s patients from normal elderly controls. Neurobiol. Aging, 29(1):23–30, January 2008.

[223] Jason P. Lerch, Jens Pruessner, Alex P. Zijdenbos, D. Louis Collins, Stefan J. Teipel, Harald Hampel, and Alan C. Evans. Automated cortical thickness measurements from MRI can accurately separate Alzheimer’s patients from normal elderly controls. Neurobiology of Aging, 29:23–30, 2008.

[224] Jason P Lerch, Jens C Pruessner, Alex Zijdenbos, Harald Hampel, Stefan J Teipel, and Alan C Evans. Focal decline of cortical thickness in alzheimer’s disease identified by computational neu- roanatomy. Cereb. Cortex, 15(7):995–1001, July 2005. Bibliography 172

[225] Jason P Lerch, Jens C Pruessner, Alex Zijdenbos, Harald Hampel, Stefan J Teipel, and Alan C Evans. Focal decline of cortical thickness in alzheimer’s disease identified by computational neu- roanatomy. Cereb. Cortex, 15(7):995–1001, July 2005.

[226] Jason P Lerch, Andr´eJ W van der Kouwe, Armin Raznahan, Tom´aˇsPaus, Heidi Johansen- Berg, Karla L Miller, Stephen M Smith, Bruce Fischl, and Stamatios N Sotiropoulos. Studying neuroanatomy using MRI. Nat. Neurosci., 20(3):314–326, February 2017.

[227] Kelvin K Leung, Josephine Barnes, Gerard R Ridgway, Jonathan W Bartlett, Matthew J Clarkson, Kate Macdonald, Norbert Schuff, Nick C Fox, and Sebastien Ourselin. Automated cross-sectional and longitudinal hippocampal volume measurement in mild cognitive impairment and Alzheimer’s disease. NeuroImage, 51(4):1345–59, jul 2010.

[228] H LeVine, III. Alzheimer’s disease and the β-amyloid peptide. J. Alzheimers. Dis., 2010.

[229] Bing Nan Li, Chee Kong Chui, Stephen Chang, and S H Ong. Integrating spatial fuzzy clustering with level set methods for automated medical image segmentation. Comput. Biol. Med., 41(1):1–10, January 2011.

[230] Yang Li, Yaping Wang, Guorong Wu, Feng Shi, Luping Zhou, Weili Lin, and Dinggang Shen. Dis- criminant analysis of longitudinal cortical thickness changes in alzheimer’s disease using dynamic and network features. Neurobiol. Aging, 33(2):427.e15–427.e30, February 2012.

[231] Yang Li, Yaping Wang, Guorong Wu, Feng Shi, Luping Zhou, Weili Lin, Dinggang Shen, and The Alzheimer’s Disease Neuroimaging Initiative. Discriminant analysis of longitudinal cortical thickness changes in Alzheimer’s disease using dynamic and network features. Neurobiol. Aging, 2011.

[232] Chia-Chen Liu, Chia-Chan Liu, Takahisa Kanekiyo, Huaxi Xu, and Guojun Bu. Apolipoprotein E and alzheimer disease: risk, mechanisms and therapy. Nat. Rev. Neurol., 9(2):106–118, February 2013.

[233] Fayao Liu, Luping Zhou, Chunhua Shen, and Jianping Yin. Multiple kernel learning in the primal for multimodal alzheimer’s disease classification. IEEE J Biomed Health Inform, 18(3):984–990, May 2014.

[234] Chris Loken, Daniel Gruner, Leslie Groer, Richard Peltier, Neil Bunn, Michael Craig, Teresa Henriques, Jillian Dempsey, Ching-Hsing Yu, Joseph Chen, L Jonathan Dursi, Jason Chong, Scott Northrup, Jaime Pinto, Neil Knecht, and Ramses Van Zon. SciNet: Lessons Learned from Building a Power-efficient Top-20 System and Data Centre, 2010.

[235] Jyrki MP L¨otj¨onen,Robin Wolz, Juha R. Koikkalainen, Lennart Thurfjell, Gunhild Waldemar, Hilkka Soininen, and Daniel Rueckert. Fast and robust multi-atlas segmentation of brain magnetic resonance images. NeuroImage, 49:2352–2365, 2010.

[236] Jyrki Mp L¨otj¨onen,Robin Wolz, Juha R Koikkalainen, Lennart Thurfjell, Gunhild Waldemar, Hilkka Soininen, Daniel Rueckert, and Alzheimer’s Disease Neuroimaging Initiative. Fast and robust multi-atlas segmentation of brain magnetic resonance images. Neuroimage, 49(3):2352– 2365, February 2010. Bibliography 173

[237] D MacDonald, N Kabani, D Avis, and A C Evans. Automated 3-D extraction of inner and outer surfaces of cerebral cortex from MRI. Neuroimage, 12(3):340–356, September 2000.

[238] E A Maguire, D G Gadian, I S Johnsrude, C D Good, J Ashburner, R S Frackowiak, and C D Frith. Navigation-related structural change in the hippocampi of taxi drivers. Proc. Natl. Acad. Sci. U. S. A., 97(8):4398–4403, April 2000.

[239] Ashok Malla, Ross Norman, Terry McLean, Derek Scholten, and Laurel Townsend. A Canadian programme for early intervention in non-affective psychotic disorders, 2003.

[240] Jose Manjon, Pierrick Coup´e,Luis Mart´ı-Bonmat´ı,D. Louis Collins, Montserrat Robles, Jos´eV Manj´on,Pierrick Coup´e,Luis Mart´ı-Bonmat´ı,D. Louis Collins, and Montserrat Robles. Adaptive non-local means denoising of MR images with spatially varying noise levels: Spatially Adaptive Nonlocal Denoising. Journal of Magnetic Resonance Imaging, 31(1):192–203, 2010.

[241] John H Martin. Neuroanatomy Text and Atlas, volume XXXIII. 2012.

[242] J Mazziotta, A Toga, A Evans, P Fox, J Lancaster, K Zilles, R Woods, T Paus, G Simpson, B Pike, C Holmes, L Collins, P Thompson, D MacDonald, M Iacoboni, T Schormann, K Amunts, N Palomero-Gallagher, S Geyer, L Parsons, K Narr, N Kabani, G Le Goualher, D Boomsma, T Cannon, R Kawashima, and B Mazoyer. A probabilistic atlas and reference system for the human brain: International Consortium for Brain Mapping (ICBM). Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 356:1293–1322, 2001.

[243] J C Mazziotta, A W Toga, A Evans, P Fox, and J Lancaster. A probabilistic atlas of the human brain: theory and rationale for its development. The International Consortium for Brain Mapping (ICBM). NeuroImage, 2:89–101, 1995.

[244] Martin J McKeown, Lars Kai Hansen, and Terrence J Sejnowsk. Independent component analysis of functional MRI: what is signal and what is noise? Curr. Opin. Neurobiol., 13(5):620–629, October 2003.

[245] G McKhann, D Drachman, M Folstein, R Katzman, and others. Clinical diagnosis of alzheimer’s disease report of the NINCDSˆaADRDA work group* under the auspices of department of health and human services task force ˆaS.ˇ Neurology, 1984.

[246] Guy M McKhann, David S Knopman, Howard Chertkow, Bradley T Hyman, Clifford R Jack, Claudia H Kawas, William E Klunk, Walter J Koroshetz, Jennifer J Manly, Richard Mayeux, Richard C Mohs, John C Morris, Martin N Rossor, Philip Scheltens, Maria C Carrillo, Bill Thies, Sandra Weintraub, and Creighton H Phelps. The diagnosis of dementia due to alzheimer’s disease: Recommendations from the national institute on Aging-Alzheimer’s association workgroups on diagnostic guidelines for alzheimer’s disease. Alzheimers. Dement., 7(3):263–269, May 2011.

[247] Shashwath A. Meda, Mary Ellen I Koran, Jennifer R. Pryweller, Jennifer N. Vega, and Tricia A. Thornton-Wells. Genetic interactions associated with 12-month atrophy in hippocampus and en- torhinal cortex in Alzheimer’s Disease Neuroimaging Initiative. Neurobiology of Aging, 34, 2013.

[248] Michael S Mega. The cholinergic deficit in alzheimer’s disease: impact on cognition, behaviour and function. Int. J. Neuropsychopharmacol., 3(7):3–12, July 2000. Bibliography 174

[249] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. June 2016.

[250] Chandan Misra, Yong Fan, and Christos Davatzikos. Baseline and longitudinal patterns of brain atrophy in MCI patients, and their use in prediction of short-term conversion to AD: Results from ADNI. Neuroimage, 44(4):1415–1422, 2009.

[251] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wier- stra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.

[252] Elaheh Moradi, Antonietta Pepe, Christian Gaser, Heikki Huttunen, and Jussi Tohka. Machine learning framework for early MRI-based Alzheimer’s conversion prediction in MCI subjects. Neu- roimage, 104:398–412, 2015.

[253] Elaheh Moradi, Antonietta Pepe, Christian Gaser, Heikki Huttunen, Jussi Tohka, and Alzheimer’s Disease Neuroimaging Initiative. Machine learning framework for early MRI-based alzheimer’s conversion prediction in MCI subjects. Neuroimage, 104:398–412, January 2015.

[254] Jonathan H. Morra, Zhuowen Tu, Liana G. Apostolova, Amity E. Green, Christina Avedissian, Sarah K. Madsen, Neelroop Parikshak, Xue Hua, Arthur W. Toga, Clifford R. Jack, Michael W. Weiner, and Paul M. Thompson. Validation of a fully automated 3D hippocampal segmentation method using subjects with Alzheimer’s disease mild cognitive impairment, and elderly controls. NeuroImage, 43(1):59–68, 2008.

[255] J C Morris, K Blennow, L Froelich, A Nordberg, H Soininen, G Waldemar, L-O Wahlund, and B Dubois. Harmonized diagnostic criteria for alzheimer’s disease: recommendations. J. Intern. Med., 275(3):204–213, March 2014.

[256] Abderazzak Mouiha and Simon Duchesne. Hippocampal atrophy rates in Alzheimer’s disease: automated segmentation variability analysis. Neuroscience letters, 495:6–10, 2011.

[257] Abderazzak Mouiha, Simon Duchesne, and Alzheimer’s Disease Neuroimaging Initiative. Hip- pocampal atrophy rates in alzheimer’s disease: automated segmentation variability analysis. Neu- rosci. Lett., 495(1):6–10, May 2011.

[258] V B Mountcastle. The columnar organization of the neocortex. Brain, 120 ( Pt 4):701–722, April 1997.

[259] Loren Mowszowski, Jennifer Batchelor, and Sharon L Naismith. Early intervention for cognitive decline: can cognitive training be used as a selective prevention technique? Int. Psychogeriatr., 22(4):537–548, June 2010.

[260] Susanne G Mueller, Norbert Schuff, Kristine Yaffe, Catherine Madison, Bruce Miller, and Michael W Weiner. Hippocampal atrophy patterns in mild cognitive impairment and alzheimer’s disease. Hum. Brain Mapp., 31(9):1339–1347, September 2010. Bibliography 175

[261] Susanne G Mueller and Michael W Weiner. Selective effect of age, apo e4, and alzheimer’s disease on hippocampal subfields. Hippocampus, 19(6):558–564, June 2009.

[262] Susanne G Mueller, Michael W Weiner, Leon J Thal, Ronald C Petersen, Clifford R Jack, William Jagust, John Q Trojanowski, Arthur W Toga, and Laurel Beckett. Ways toward an early diagnosis in Alzheimer’s disease: the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimer’s & dementia : the journal of the Alzheimer’s Association, 2005.

[263] Melissa E Murray, Neill R Graff-Radford, Owen a Ross, Ronald C Petersen, Ranjan Duara, and Dennis W Dickson. Neuropathologically defined subtypes of Alzheimer’s disease with distinct clinical characteristics: A retrospective study. Lancet Neurol., 10(9):785–796, 2011.

[264] Benson Mwangi, Tian Siva Tian, and Jair C Soares. A review of feature reduction techniques in neuroimaging. Neuroinformatics, 12(2):229–244, April 2014.

[265] Thomas Naselaris, Kendrick N Kay, Shinji Nishimoto, and Jack L Gallant. Encoding and decoding in fMRI. Neuroimage, 56(2):400–410, May 2011.

[266] Guilherme Neves, Sam F Cooke, and Tim V P Bliss. Synaptic plasticity, memory and the hip- pocampus: a neural network approach to causality. Nat. Rev. Neurosci., 9(1):65–75, January 2008.

[267] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Mul- timodal deep learning. In Proceedings of the 28th International Conference on International Con- ference on Machine Learning, ICML’11, pages 689–696, USA, 2011. Omnipress.

[268] D G Nishimura. Principles of magnetic resonance imaging. 1996.

[269] Young Noh, Seun Jeon, Jong Min Lee, Sang Won Seo, Geon Ha Kim, Hanna Cho, Byoung Seok Ye, Cindy W Yoon, Hee Jin Kim, Juhee Chin, Kee Hyung Park, Kenneth M Heilman, and Duk L Na. Anatomical heterogeneity of Alzheimer disease. Neurology, 2014.

[270] Sid E O’Bryant, Joy D Humphreys, Glenn E Smith, Robert J Ivnik, Neill R Graff-Radford, Ronald C Petersen, and John A Lucas. Detecting dementia with the mini-mental state exam- ination in highly educated individuals. Arch. Neurol., 65(7):963–967, July 2008.

[271] Hanna Ohman,¨ Niina Savikko, Timo E Strandberg, and Kaisu H Pitk¨al¨a.Effect of physical exercise on cognitive performance in older adults with mild cognitive impairment or dementia: a systematic review. Dement. Geriatr. Cogn. Disord., 38(5-6):347–365, August 2014.

[272] J O’Keefe and J Dostrovsky. The hippocampus as a spatial map. preliminary evidence from unit activity in the freely-moving rat. Brain Res., 34(1):171–175, November 1971.

[273] Matthew S Panizzon, Christine Fennema-Notestine, Lisa T Eyler, Terry L Jernigan, Elizabeth Prom-Wormley, Michael Neale, Kristen Jacobson, Michael J Lyons, Michael D Grant, Carol E Franz, Hong Xian, Ming Tsuang, Bruce Fischl, Larry Seidman, Anders Dale, and William S Kremen. Distinct genetic influences on cortical surface area and cortical thickness. Cereb. Cortex, 19(11):2728–2735, November 2009. Bibliography 176

[274] Adrien Payan and Giovanni Montana. Predicting alzheimer’s disease: a neuroimaging study with 3D convolutional neural networks. February 2015.

[275] Richard J Perrin, Anne M Fagan, and David M Holtzman. Multimodal techniques for diagnosis and prognosis of alzheimer’s disease. Nature, 461:916, October 2009.

[276] D L Pham, C Xu, and J L Prince. Current methods in medical image segmentation. Annu. Rev. Biomed. Eng., 2:315–337, 2000.

[277] Jon Pipitone, Min Tae M Park, Julie Winterburn, Tristram a Lett, Jason P Lerch, Jens C Pruess- ner, Martin Lepage, Aristotle N Voineskos, and M Mallar Chakravarty. Multi-atlas segmentation of the whole hippocampus and subfields using multiple automatically generated templates. Neu- roImage, 101:494–512, apr 2014.

[278] Jon Pipitone, Min Tae M Park, Julie Winterburn, Tristram A Lett, Jason P Lerch, Jens C Pruess- ner, Martin Lepage, Aristotle N Voineskos, M Mallar Chakravarty, and Alzheimer’s Disease Neu- roimaging Initiative. Multi-atlas segmentation of the whole hippocampus and subfields using multiple automatically generated templates. Neuroimage, 101:494–512, November 2014.

[279] John Platt and Others. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.

[280] Sergey M Plis, Devon R Hjelm, Ruslan Salakhutdinov, Elena A Allen, Henry J Bockholt, Jeffrey D Long, Hans J Johnson, Jane S Paulsen, Jessica A Turner, and Vince D Calhoun. Deep learning for neuroimaging: a validation study. Front. Neurosci., 8:229, 2014.

[281] Kilian M Pohl, John Fisher, W Eric L Grimson, Ron Kikinis, and William M Wells. A bayesian model for joint segmentation and registration. Neuroimage, 31(1):228–239, May 2006.

[282] R. C. Prim. Shortest Connection Networks And Some Generalizations. Bell System Technical Journal, 36(6):1389–1401, nov 1957.

[283] Martin Prince, Renata Bryce, and Cleusa Ferri. World Alzheimer Report 2011: The benefits of early diagnosis and intervention. Alzheimer’s Disease International, 2011.

[284] J C Pruessner, L M Li, W Serles, M Pruessner, D L Collins, N Kabani, S Lupien, and A C Evans. Volumetry of hippocampus and amygdala with high-resolution MRI and three-dimensional analysis software: minimizing the discrepancies between laboratories. Cereb. Cortex, 10(4):433–442, April 2000.

[285] J C Pruessner, L M Li, W Serles, M Pruessner, D L Collins, N Kabani, S Lupien, and A C Evans. Volumetry of hippocampus and amygdala with high-resolution MRI and three-dimensional analysis software: minimizing the discrepancies between laboratories. Cerebral cortex (New York, N.Y. : 1991), 10(4):433–42, apr 2000.

[286] Olivier Querbes, Florent Aubry, J´er´emiePariente, Jean-Albert Lotterie, Jean-Fran¸coisD´emonet, V´eroniqueDuret, Mich`elePuel, Isabelle Berry, Jean-Claude Fort, Pierre Celsis, and Alzheimer’s Disease Neuroimaging Initiative. Early diagnosis of alzheimer’s disease using cortical thickness: impact of cognitive reserve. Brain, 132(Pt 8):2036–2047, August 2009. Bibliography 177

[287] P Rakic. Specification of cerebral cortical areas. Science, 241(4862):170–176, July 1988.

[288] Alberto Redolfi, David Manset, Frederik Barkhof, Lars-Olof Wahlund, Tristan Glatard, Jean- Fran¸coisMangin, Giovanni B Frisoni, and neuGRID Consortium, for the Alzheimer’s Disease Neuroimaging Initiative. Head-to-head comparison of two popular cortical thickness extraction algorithms: a cross-sectional and longitudinal study. PLoS One, 10(3):e0117692, March 2015.

[289] Barry Reisberg, Rachelle Doody, Albrecht St¨offler,Frederick Schmitt, Steven Ferris, and Hans J¨org M¨obius.Memantine in Moderate-to-Severe Alzheimer’s Disease. New England Journal of Medicine, 2003.

[290] Christiane Reitz. Alzheimer’s disease and the amyloid cascade hypothesis: a critical review. Int. J. Alzheimers. Dis., 2012:369808, March 2012.

[291] S L Risacher, A J Saykin, J D West, L Shen, H A Firpi, B C McDonald, and Initiative Alzheimer’s Disease Neuroimaging. Baseline MRI predictors of conversion from MCI to probable AD in the ADNI cohort. Curr Alzheimer Res, 6(4):347–361, 2009.

[292] S L Risacher, A J Saykin, J D West, L Shen, H A Firpi, B C McDonald, and Initiative Alzheimer’s Disease Neuroimaging. Baseline MRI predictors of conversion from MCI to probable AD in the ADNI cohort. Curr. Alzheimer Res., 6(4):347–361, 2009.

[293] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomed- ical image segmentation. May 2015.

[294] W G Rosen, R C Mohs, and K L Davis. A new rating scale for alzheimer’s disease. Am. J. Psychiatry, 141(11):1356–1364, November 1984.

[295] Fran¸ccoisRousseau, Piotr A. Habas, and Colin Studholme. A supervised patch-based approach for human brain labeling. IEEE Transactions on Medical Imaging, 30:1852–1862, 2011.

[296] Fran¸ccoisRousseau, Piotr A Habas, and Colin Studholme. A supervised patch-based approach for human brain labeling. IEEE Trans. Med. Imaging, 30(10):1852–1862, October 2011.

[297] Peter J Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20:53–65, November 1987.

[298] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 323:533, October 1986.

[299] Mert R Sabuncu, Rahul S Desikan, Jorge Sepulcre, Boon Thye T Yeo, Hesheng Liu, Nicholas J Schmansky, Martin Reuter, Michael W Weiner, Randy L Buckner, Reisa A Sperling, and Bruce Fischl. The Dynamics of Cortical and Hippocampal Atrophy in Alzheimer Disease. Arch. Neurol., 68(8):1040–1048, 2011.

[300] Mert R Sabuncu, Rahul S Desikan, Jorge Sepulcre, Boon Thye T Yeo, Hesheng Liu, Nicholas J Schmansky, Martin Reuter, Michael W Weiner, Randy L Buckner, Reisa A Sperling, and Bruce Fischl. The dynamics of cortical and hippocampal atrophy in Alzheimer disease. Archives of neurology, 68(8):1040–8, 2011. Bibliography 178

[301] Mert R Sabuncu, B T Thomas Yeo, Koen Van Leemput, Bruce Fischl, and Polina Golland. A generative model for image segmentation based on label fusion. IEEE transactions on medical imaging, 29(10):1714–29, oct 2010.

[302] Mert R Sabuncu, B T Thomas Yeo, Koen Van Leemput, Bruce Fischl, and Polina Golland. A generative model for image segmentation based on label fusion. IEEE Trans. Med. Imaging, 29(10):1714–1729, October 2010.

[303] Tejas Sankar, Min Tae M Park, Tasha Jawa, Raihaan Patel, Nikhil Bhagwat, Aristotle N Voineskos, Andres M Lozano, M Mallar Chakravarty, and Alzheimer’s Disease Neuroimaging Initiative. Your algorithm might think the hippocampus grows in alzheimer’s disease: Caveats of longitudinal automated hippocampal volumetry. Hum. Brain Mapp., 38(6):2875–2896, June 2017.

[304] Nienke M E Scheltens, Francisca Galindo-Garre, Yolande A L Pijnenburg, Annelies E van der Vlies, Lieke L Smits, Teddy Koene, Charlotte E Teunissen, Frederik Barkhof, Mike P Wattjes, Philip Scheltens, and Wiesje M van der Flier. The identification of cognitive subtypes in alzheimer’s disease dementia using latent class analysis. J. Neurol. Neurosurg. Psychiatry, 87(3):235–243, March 2016.

[305] D Scheuner, C Eckman, M Jensen, X Song, M Citron, N Suzuki, T D Bird, J Hardy, M Hut- ton, W Kukull, and Others. Secreted amyloid β–protein similar to that in the senile plaques of alzheimer’s disease is increased in vivo by the presenilin 1 and 2 and APP mutations linked to familial alzheimer’s disease. Nat. Med., 2(8):864, 1996.

[306] D E Schmechel, A M Saunders, W J Strittmatter, B J Crain, C M Hulette, S H Joo, M A Pericak- Vance, D Goldgaber, and A D Roses. Increased amyloid beta-peptide deposition in cerebral cortex as a consequence of apolipoprotein E genotype in late-onset alzheimer disease. Proc. Natl. Acad. Sci. U. S. A., 90(20):9649–9653, October 1993.

[307] L S Schneider, F Mangialasche, and others. Clinical trials and late-stage drug development for alzheimer’s disease: an appraisal from 1984 to 2014. J. Intern. Med., 2014.

[308] Lon S. Schneider, Philip S. Insel, and Michael W. Weiner. Treatment with cholinesterase in- hibitors and memantine of patients in the Alzheimer’s disease neuroimaging initiative. Archives of Neurology, 2011.

[309] Lon S Schneider and Mary Sano. Current alzheimer’s disease clinical trials: methods and placebo outcomes. Alzheimers. Dement., 5(5):388–397, September 2009.

[310] Christopher G Schwarz, Jeffrey L Gunter, Heather J Wiste, Scott A Przybelski, Stephen D Weigand, Chadwick P Ward, Matthew L Senjem, Prashanthi Vemuri, Melissa E Murray, Dennis W Dickson, Joseph E Parisi, Kejal Kantarci, Michael W Weiner, Ronald C Petersen, Clifford R Jack, Jr, and Alzheimer’s Disease Neuroimaging Initiative. A large-scale comparison of cortical thickness and volume methods for measuring alzheimer’s disease severity. Neuroimage Clin, 11:802–812, May 2016.

[311] W B Scoville and B Milner. Loss of recent memory after bilateral hippocampal lesions. J. Neurol. Neurosurg. Psychiatry, 20(1):11–21, February 1957. Bibliography 179

[312] Dennis J Selkoe. Amyloid β-Protein and the genetics of alzheimer’s disease. J. Biol. Chem., 271(31):18295–18298, August 1996.

[313] Alberto Serrano-Pozo, Matthew P Frosch, Eliezer Masliah, and Bradley T Hyman. Neuropatholog- ical alterations in alzheimer disease. Cold Spring Harb. Perspect. Med., 1(1):a006189, September 2011.

[314] David W Shattuck, Mubeena Mirza, Vitria Adisetiyo, Cornelius Hojatkashani, Georges Salamon, Katherine L Narr, Russell A Poldrack, Robert M Bilder, and Arthur W Toga. Construction of a 3D probabilistic atlas of human cortical structures. Neuroimage, 39(3):1064–1080, February 2008.

[315] Leslie M Shaw, Hugo Vanderstichele, Malgorzata Knapik-Czajka, Christopher M Clark, Paul S Aisen, Ronald C Petersen, Kaj Blennow, Holly Soares, Adam Simon, Piotr Lewczuk, Robert Dean, Eric Siemers, William Potter, Virginia M-Y Lee, John Q Trojanowski, and Alzheimer’s Disease Neuroimaging Initiative. Cerebrospinal fluid biomarker signature in alzheimer’s disease neuroimaging initiative subjects. Ann. Neurol., 65(4):403–413, April 2009.

[316] Bart Sheehan. Assessment scales in dementia. Ther. Adv. Neurol. Disord., 5(6):349–358, November 2012.

[317] Dinggang Shen and Christos Davatzikos. HAMMER: hierarchical attribute matching mechanism for elastic registration. IEEE Trans. Med. Imaging, 21(11):1421–1439, November 2002.

[318] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.

[319] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by Self-Play with a general reinforce- ment learning algorithm. December 2017.

[320] Asha Singanamalli, Haibo Wang, and Anant Madabhushi. Cascaded multi-view canonical correla- tion (CaMCCo) for early diagnosis of alzheimer’s disease via fusion of clinical, imaging and omic features. Sci. Rep., 7(1):8137, August 2017.

[321] J G Sled, A P Zijdenbos, and A C Evans. A nonparametric method for automatic correction of intensity nonuniformity in MRI data. IEEE Trans. Med. Imaging, 17(1):87–97, February 1998.

[322] Brent J Small and Lars B¨ackman. Longitudinal trajectories of cognitive change in preclinical alzheimer’s disease: A growth mixture modeling analysis. Cortex, 43(7):826–834, January 2007.

[323] Brent J Small and Lars B¨ackman. Longitudinal trajectories of cognitive change in preclinical Alzheimer’s disease: A growth mixture modeling analysis. Cortex, 43(7):826–834, 2007.

[324] Stephen M. Smith, Mark Jenkinson, Mark W. Woolrich, Christian F. Beckmann, Timothy E.J. Behrens, Heidi Johansen-Berg, Peter R. Bannister, Marilena De Luca, Ivana Drobnjak, David E. Bibliography 180

Flitney, Rami K. Niazy, James Saunders, John Vickers, Yongyue Zhang, Nicola De Stefano, J. Michael Brady, and Paul M. Matthews. Advances in functional and structural MR image analysis and implementation as FSL. In NeuroImage, volume 23, 2004.

[325] Reisa A Sperling, Paul S Aisen, Laurel A Beckett, David A Bennett, Suzanne Craft, Anne M Fagan, Takeshi Iwatsubo, Clifford R Jack, Jr, Jeffrey Kaye, Thomas J Montine, Denise C Park, Eric M Reiman, Christopher C Rowe, Eric Siemers, Yaakov Stern, Kristine Yaffe, Maria C Carrillo, Bill Thies, Marcelle Morrison-Bogorad, Molly V Wagster, and Creighton H Phelps. Toward defining the preclinical stages of alzheimer’s disease: recommendations from the national institute on Aging- Alzheimer’s association workgroups on diagnostic guidelines for alzheimer’s disease. Alzheimers. Dement., 7(3):280–292, May 2011.

[326] L R Squire. Memory and the hippocampus: a synthesis from findings with rats, monkeys, and humans. Psychol. Rev., 99(2):195–231, April 1992.

[327] L R Squire and S Zola-Morgan. The medial temporal lobe memory system. Science, 253(5026):1380–1386, September 1991.

[328] Larry R Squire. The legacy of patient HM for neuroscience. Neuron, 61(1):6–9, 2009.

[329] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15:1929– 1958, 2014.

[330] Nitish Srivastava and Ruslan Salakhutdinov. Learning representations for multimodal data with deep belief nets. In International conference on machine learning workshop, volume 79, 2012.

[331] Yaakov Stern. Cognitive reserve in ageing and alzheimer’s disease. Lancet Neurol., 11(11):1006– 1012, November 2012.

[332] Cynthia M Stonnington, Carlton Chu, Stefan Kl¨oppel, Clifford R Jack, Jr, John Ashburner, Richard S J Frackowiak, and Alzheimer Disease Neuroimaging Initiative. Predicting clinical scores from magnetic resonance scans in alzheimer’s disease. Neuroimage, 51(4):1405–1413, July 2010.

[333] W J Strittmatter, K H Weisgraber, D Y Huang, L M Dong, G S Salvesen, M Pericak-Vance, D Schmechel, A M Saunders, D Goldgaber, and A D Roses. Binding of human apolipoprotein E to synthetic amyloid beta peptide: isoform-specific effects and implications for late-onset alzheimer disease. Proc. Natl. Acad. Sci. U. S. A., 90(17):8098–8102, September 1993.

[334] Heung-Il Suk, Seong-Whan Lee, Dinggang Shen, and Alzheimer’s Disease Neuroimaging Initia- tive. Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis. Neuroimage, 101:569–582, November 2014.

[335] Heung-Il Suk, Seong-Whan Lee, Dinggang Shen, and Alzheimer’s Disease Neuroimaging Initiative. Latent feature representation with stacked auto-encoder for AD/MCI diagnosis. Brain Struct. Funct., 220(2):841–859, March 2015.

[336] Trey Sunderland, Gary Linker, Nadeem Mirza, Karen T Putnam, David L Friedman, Lida H Kimmel, Judy Bergeson, Guy J Manetti, Matthew Zimmermann, Brian Tang, John J Bartko, and Bibliography 181

Robert M Cohen. Decreased beta-amyloid1-42 and increased tau levels in cerebrospinal fluid of patients with alzheimer disease. JAMA, 289(16):2094–2103, 2003.

[337] Jennifer Y Y Szeto and Simon J G Lewis. Current treatment options for alzheimer’s disease and parkinson’s disease dementia. Curr. Neuropharmacol., 14(4):326–338, 2016.

[338] J Talairach and P Tournoux. Co-planar stereotaxic atlas of the human brain. 3-Dimensional proportional system: an approach to cerebral imaging. Thieme, 1988.

[339] Tero Tapiola, Tuula Pirttil¨a,Pankaj D Mehta, Irina Alafuzoff, Maarit Lehtovirta, and Hilkka Soininen. Relationship between apoe genotype and CSF β-amyloid (1–42) and tau in patients with probable and definite alzheimer’s disease. Neurobiol. Aging, 21(5):735–740, 2000.

[340] Christine L Tardif, Gabriel A Devenyi, Robert S C Amaral, Sandra Pelleieux, Judes Poirier, Pedro Rosa-Neto, John Breitner, M Mallar Chakravarty, and PREVENT-AD Research Group. Regionally specific changes in the hippocampal circuitry accompany progression of cerebrospinal fluid biomarkers in preclinical alzheimer’s disease. Hum. Brain Mapp., 39(2):971–984, February 2018.

[341] Dietmar R Thal, Udo R¨ub,Mario Orantes, and Heiko Braak. Phases of a beta-deposition in the human brain and its relevance for the development of AD. Neurology, 58(12):1791–1800, June 2002.

[342] Bertrand Thirion, Ga¨elVaroquaux, Elvis Dohmatob, and Jean-Baptiste Poline. Which fMRI clustering gives good brain parcellations? Front. Neurosci., 8:167, 2014.

[343] J P Thirion. Non-rigid matching using demons. In Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 245–251, June 1996.

[344] Alessandro Treves and Edmund T Rolls. Computational analysis of the role of the hippocampus in memory. Hippocampus, 4(3):374–391, June 1994.

[345] Alain Trouv´e.Diffeomorphisms groups and pattern matching in image analysis. Int. J. Comput. Vis., 28(3):213–221, July 1998.

[346] E Tulving and H J Markowitsch. Episodic and declarative memory: role of the hippocampus. Hippocampus, 8(3):198–204, 1998.

[347] Nicholas J Tustison, Brian B Avants, Philip A Cook, Yuanjie Zheng, Alexander Egan, Paul A Yushkevich, and James C Gee. N4ITK: improved N3 bias correction. IEEE Trans. Med. Imaging, 29(6):1310–1320, June 2010.

[348] N Tzourio-Mazoyer, B Landeau, D Papathanassiou, F Crivello, O Etard, N Delcroix, B Mazoyer, and M Joliot. Automated anatomical labeling of activations in SPM using a macroscopic anatom- ical parcellation of the MNI MRI single-subject brain. Neuroimage, 15(1):273–289, January 2002.

[349] Werner Vach. Regression Models as a Tool in Medical Research. CRC Press, November 2012.

[350] Robert-Jan M van Geuns, Piotr A Wielopolski, Hein G de Bruin, Benno J Rensing, Peter M A van Ooijen, Marc Hulshoff, Matthijs Oudkerk, and Pim J de Feyter. Basic principles of magnetic resonance imaging. Prog. Cardiovasc. Dis., 42(2):149–156, September 1999. Bibliography 182

[351] Koen Van Leemput, Akram Bakkour, Thomas Benner, Graham Wiggins, Lawrence L Wald, Jean Augustinack, Bradford C Dickerson, Polina Golland, and Bruce Fischl. Automated segmentation of hippocampal subfields from ultra-high resolution in vivo MRI. Hippocampus, 19(6):549–557, June 2009.

[352] Koen Van Leemput, Frederik Maes, Dirk Vandermeulen, and Paul Suetens. A unifying framework for partial volume segmentation of brain MR images. IEEE transactions on medical imaging, 22(1):105–19, jan 2003.

[353] Erdem Varol, Aristeidis Sotiras, Christos Davatzikos, and Alzheimer’s Disease Neuroimaging Ini- tiative. HYDRA: Revealing heterogeneity of imaging and genetic patterns through a multiple max-margin discriminative analysis framework. Neuroimage, 145(Pt B):346–364, January 2017.

[354] Prashanthi Vemuri, Jeffrey L Gunter, Matthew L Senjem, Jennifer L Whitwell, Kejal Kantarci, David S Knopman, Bradley F Boeve, Ronald C Petersen, and Clifford R Jack, Jr. Alzheimer’s dis- ease diagnosis in individual subjects using structural MR images: validation studies. Neuroimage, 39(3):1186–1197, February 2008.

[355] Prashanthi Vemuri and Clifford R Jack. Role of structural MRI in alzheimer’s disease. Alzheimers. Res. Ther., 2(4):23, August 2010.

[356] V L Villemagne, K E Pike, D Darby, P Maruff, G Savage, S Ng, U Ackermann, T F Cowie, J Currie, S G Chan, G Jones, H Tochon-Danguy, G O’Keefe, C L Masters, and C C Rowe. Abeta deposits in older non-demented individuals with cognitive decline are indicative of preclinical alzheimer’s disease. Neuropsychologia, 46(6):1688–1697, February 2008.

[357] Mai-Anh T Vu, Tulay Adali, Demba Ba, Gyorgy Buzsaki, David Carlson, Katherine Heller, Conor Liston, Cynthia Rudin, Vikaas Sohal, Alik S Widge, Helen S Mayberg, Guillermo Sapiro, and Kafui Dzirasa. A shared vision for machine learning in neuroscience. J. Neurosci., pages 0508– 0517, January 2018.

[358] J Wan, Z Zhang, J Yan, T Li, B D Rao, S Fang, S Kim, S L Risacher, A J Saykin, and L Shen. Sparse bayesian multi-task learning for predicting cognitive outcomes from neuroimaging measures in alzheimer’s disease. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 940–947, June 2012.

[359] Hongzhi Wang, Jung W Suh, Sandhitsu R Das, John Pluta, Caryne Craige, and Paul A Yushkevich. Multi-Atlas Segmentation with Joint Label Fusion. IEEE transactions on pattern analysis and machine intelligence, jun 2012.

[360] Hongzhi Wang, Jung W Suh, Sandhitsu R Das, John B Pluta, Caryne Craige, and Paul A Yushke- vich. Multi-Atlas segmentation with joint label fusion. IEEE Trans. Pattern Anal. Mach. Intell., 35(3):611–623, March 2013.

[361] Hongzhi Wang and Paul Yushkevich. Multi-atlas segmentation with joint label fusion and corrective learning—an open source implementation. Front. Neuroinform., 7:27, 2013.

[362] Simon K Warfield, Michael Kaus, Ferenc A Jolesz, and Ron Kikinis. Adaptive template moderated spatially varying statistical classification. In William M Wells, Alan Colchester, and Scott Delp, Bibliography 183

editors, Medical Image Computing and Computer-Assisted Intervention — MICCAI’98, volume 1496 of Lecture Notes in Computer Science, pages 431–438. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.

[363] Simon K Warfield, Kelly H Zou, and William M Wells. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging, 23(7):903–921, July 2004.

[364] Simon K Warfield, Kelly H Zou, and William M Wells. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE transactions on medical imaging, 23(7):903–21, jul 2004.

[365] T Wegli´nskiand A Fabija´nska. Brain tumor segmentation from MRI data sets using region growing approach. In Perspective Technologies and Methods in MEMS Design, pages 185–188, May 2011.

[366] Michael W Weiner. Dementia in 2012: Further insights into Alzheimer disease pathogenesis. Nature reviews. Neurology, 9:65–6, 2013.

[367] Michael W Weiner, Paul S Aisen, Clifford R Jack, Jr, William J Jagust, John Q Trojanowski, Leslie Shaw, Andrew J Saykin, John C Morris, Nigel Cairns, Laurel A Beckett, Arthur Toga, Robert Green, Sarah Walter, Holly Soares, Peter Snyder, Eric Siemers, William Potter, Patricia E Cole, Mark Schmidt, and Alzheimer’s Disease Neuroimaging Initiative. The alzheimer’s disease neuroimaging initiative: progress report and future plans. Alzheimers. Dement., 6(3):202–11.e7, May 2010.

[368] Michael W Weiner, Dallas P Veitch, Paul S Aisen, Laurel A Beckett, Nigel J Cairns, Jesse Cedar- baum, Michael C Donohue, Robert C Green, Danielle Harvey, Clifford R Jack, Jr, William Ja- gust, John C Morris, Ronald C Petersen, Andrew J Saykin, Leslie Shaw, Paul M Thompson, Arthur W Toga, John Q Trojanowski, and Alzheimer’s Disease Neuroimaging Initiative. Impact of the alzheimer’s disease neuroimaging initiative, 2004 to 2014. Alzheimers. Dement., 11(7):865–884, July 2015.

[369] Eric Westman, J-Sebastian Muehlboeck, and Andrew Simmons. Combining MRI and CSF mea- sures for classification of alzheimer’s disease and prediction of mild cognitive impairment conver- sion. Neuroimage, 62(1):229–238, August 2012.

[370] Jennifer L Whitwell, Dennis W Dickson, Melissa E Murray, Stephen D Weigand, Nirubol Tosakul- wong, Matthew L Senjem, David S Knopman, Bradley F Boeve, Joseph E Parisi, Ronald C Petersen, Clifford R Jack, Jr, and Keith A Josephs. Neuroimaging correlates of pathologically defined subtypes of alzheimer’s disease: a case-control study. Lancet Neurol., 11(10):868–877, October 2012.

[371] Patricia A Wilkosz, Howard J Seltman, Bernie Devlin, Elise A Weamer, Oscar L Lopez, Steven T DeKosky, and Robert A Sweet. Trajectories of cognitive decline in Alzheimer’s disease. Int. Psychogeriatr., 22(2):281–290, 2010.

[372] Julie L Winterburn, Jens C Pruessner, Sofia Chavez, Mark M Schira, Nancy J Lobaugh, Aristotle N Voineskos, and M Mallar Chakravarty. A novel in vivo atlas of human hippocampal subfields using high-resolution 3 T magnetic resonance imaging. NeuroImage, 74:254–65, 2013. Bibliography 184

[373] Julie L Winterburn, Jens C Pruessner, Sofia Chavez, Mark M Schira, Nancy J Lobaugh, Aristotle N Voineskos, and M Mallar Chakravarty. A novel in vivo atlas of human hippocampal subfields using high-resolution 3T magnetic resonance imaging. Neuroimage, 74:254–265, 2013.

[374] Robin Wolz, Paul Aljabar, Joseph V Hajnal, Alexander Hammers, and Daniel Rueckert. LEAP: learning embeddings for atlas propagation. NeuroImage, 49(2):1316–25, jan 2010.

[375] Robin Wolz, Paul Aljabar, Joseph V Hajnal, Alexander Hammers, Daniel Rueckert, and Alzheimer’s Disease Neuroimaging Initiative. LEAP: learning embeddings for atlas propagation. Neuroimage, 49(2):1316–1325, January 2010.

[376] Dean F Wong, Paul B Rosenberg, Yun Zhou, Anil Kumar, Vanessa Raymont, Hayden T Ravert, Robert F Dannals, Ayon Nandi, James R Braˇsi´c,Weiguo Ye, John Hilton, Constantine Lyketsos, Hank F Kung, Abhinay D Joshi, Daniel M Skovronsky, and Michael J Pontecorvo. In vivo imaging of amyloid deposition in alzheimer disease using the radioligand 18F-AV-45 (flobetapir F 18). J. Nucl. Med., 51(6):913–920, June 2010.

[377] Guorong Wu, Qian Wang, Daoqiang Zhang, Feiping Nie, Heng Huang, and Dinggang Shen. A generative probability model of joint label fusion for multi-atlas based brain segmentation. Medical Image Analysis, 18:881–890, 2014.

[378] Minjie Wu, Caterina Rosano, Pilar Lopez-Garcia, Cameron S Carter, and Howard J Aizenstein. Optimum template selection for atlas-based segmentation. Neuroimage, 34(4):1612–1618, February 2007.

[379] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu,Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. September 2016.

[380] Bradley T. Wyman, Danielle J. Harvey, Karen Crawford, Matt A. Bernstein, Owen Carmichael, Patricia E. Cole, Paul K. Crane, Charles Decarli, Nick C. Fox, Jeffrey L. Gunter, Derek Hill, Ronald J. Killiany, Chahin Pachai, Adam J. Schwarz, Norbert Schuff, Matthew L. Senjem, Joyce Suhy, Paul M. Thompson, Michael Weiner, and Clifford R. Jack. Standardization of analysis sets for reporting results from ADNI MRI data, 2013.

[381] Jieping Ye, Michael Farnum, Eric Yang, Rudi Verbeeck, Victor Lobanov, Nandini Raghavan, Gerald Novak, Allitia DiBernardo, and Vaibhav A Narayan. Sparse learning and stability selection for predicting MCI to AD conversion using baseline ADNI data. BMC Neurol., 12(1):46, 2012.

[382] Dong Yu and Li Deng. Automatic Speech Recognition: A Deep Learning Approach. Springer, November 2014.

[383] Guan Yu, Yufeng Liu, and Dinggang Shen. Graph-guided joint prediction of class label and clinical scores for the alzheimer’s disease. Brain Struct. Funct., 221(7):3787–3801, September 2016. Bibliography 185

[384] Paul A Yushkevich, Robert S C Amaral, Jean C Augustinack, Andrew R Bender, Jeffrey D Bernstein, Marina Boccardi, Martina Bocchetta, Alison C Burggren, Valerie A Carr, M Mal- lar Chakravarty, Ga¨elCh´etelat,Ana M Daugherty, Lila Davachi, Song-Lin Ding, Arne Ekstrom, Mirjam I Geerlings, Abdul Hassan, Yushan Huang, J Eugenio Iglesias, Renaud La Joie, Geoffrey A Kerchner, Karen F LaRocque, Laura A Libby, Nikolai Malykhin, Susanne G Mueller, Rosanna K Olsen, Daniela J Palombo, Mansi B Parekh, John B Pluta, Alison R Preston, Jens C Pruessner, Charan Ranganath, Naftali Raz, Margaret L Schlichting, Dorothee Schoemaker, Sachi Singh, Craig E L Stark, Nanthia Suthana, Alexa Tompary, Marta M Turowski, Koen Van Leemput, Anthony D Wagner, Lei Wang, Julie L Winterburn, Laura E M Wisse, Michael A Yassa, Michael M Zeineh, and Hippocampal Subfields Group (HSG). Quantitative comparison of 21 protocols for labeling hippocampal subfields and parahippocampal subregions in in vivo MRI: towards a harmonized segmentation protocol. Neuroimage, 111:526–541, May 2015.

[385] Paul A Yushkevich, Brian B Avants, John Pluta, Sandhitsu Das, David Minkoff, Dawn Mechanic- Hamilton, Simon Glynn, Stephen Pickup, Weixia Liu, James C Gee, Murray Grossman, and John A Detre. A high-resolution computational atlas of the human hippocampus from postmortem magnetic resonance imaging at 9.4 T. Neuroimage, 44(2):385–398, January 2009.

[386] Paul A Yushkevich, Hongzhi Wang, John Pluta, and Brian B Avants. From label fusion to corre- spondence fusion: a new approach to unbiased groupwise registration. Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer So- ciety Conference on Computer Vision and Pattern Recognition, pages 956–963, jan 2012.

[387] Saleem Zaroubi and Gadi Goelman. Complex denoising of MR data via wavelet analysis: Appli- cation for functional MRI. Magnetic Resonance Imaging, 18(1):59–68, 2000.

[388] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014, pages 818–833. Springer International Publishing, 2014.

[389] Daoqiang Zhang and Dinggang Shen. Predicting future clinical changes of MCI patients using longitudinal and multimodal biomarkers. PLoS One, 7(3), 2012.

[390] Daoqiang Zhang, Dinggang Shen, and Alzheimer’s Disease Neuroimaging Initiative. Multi- modal multi-task learning for joint prediction of multiple regression and classification variables in alzheimer’s disease. Neuroimage, 59(2):895–907, January 2012.

[391] Daoqiang Zhang, Yaping Wang, Luping Zhou, Hong Yuan, Dinggang Shen, and Alzheimer’s Dis- ease Neuroimaging Initiative. Multimodal classification of alzheimer’s disease and mild cognitive impairment. Neuroimage, 55(3):856–867, April 2011.

[392] Xinyuan Zhang, Guirong Hou, Ma Jianhua, Wei Yang, Bingquan Lin, Yikai Xu, Wufan Chen, and Yanqiu Feng. Denoising MR images using non-local means filter with combined patch and pixel similarity. PLoS ONE, 9(6), 2014.

[393] Y Zhang, M Brady, and S Smith. Segmentation of brain MR images through a hidden markov random field model and the expectation-maximization algorithm. IEEE Trans. Med. Imaging, 20(1):45–57, January 2001. Bibliography 186

[394] Barbara Zitov´aand Jan Flusser. Image registration methods: a survey. Image Vis. Comput., 21(11):977–1000, October 2003.

[395] Berislav V. Zlokovic. Neurovascular pathways to neurodegeneration in Alzheimer’s disease and other disorders, 2011.