Prognostic applications for Alzheimer’s disease using magnetic resonance imaging and machine-learning
by
Nikhil Bhagwat
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Institute of Biomaterials & Biomedical Engineering University of Toronto
c Copyright 2018 by Nikhil Bhagwat Abstract
Prognostic applications for Alzheimer’s disease using magnetic resonance imaging and machine-learning
Nikhil Bhagwat Doctor of Philosophy Graduate Department of Institute of Biomaterials & Biomedical Engineering University of Toronto 2018
Alzheimer’s disease (AD), the most common form of dementia, is a neurodegenerative disorder that leads to cognitive deficits, particularly in the memory domain. Recent advances in magnetic resonance
(MR) imaging techniques and computational tools, such as machine-learning (ML), provide promising opportunity for prognostic applications in AD. Imaging biomarkers can improve our understanding of etiology and progression of the disease, as well as assist clinicians in decision-making pertaining to patient monitoring, intervention, and treatment selection. The overarching goal of this thesis is to develop several computational methods that facilitate the use of MR imaging data in translational applications to improve personalized patient care.
The work in this thesis is divided into three projects. The first project aims to improve MR image segmentation - a commonly used MR image processing step to delineate anatomical structures used in a multitude of downstream quantitative analyses. The goal of the second project is to leverage MR- based anatomical features towards subject-level clinical severity prediction. Methodologically, the work provides a novel ML framework for high-dimensional, multimodal analysis customized for MR imaging data. The third project extends the subject-level prediction towards longitudinal prognosis with several practical considerations. The methodological contributions involve modeling and prediction of clinical trajectories from longitudinal MR and clinical measures using ML approaches. The work addresses many challenges faced in a clinical setting, such as missing data points, and provides a powerful framework for early detection and accurate prognosis of at-risk AD patients via continuous monitoring.
The comprehensive validation of the methods presented in this work with multiple datasets and studies demonstrates the utility of MR images towards AD prognosis. The tools proposed complement existing clinical workflow and can be leveraged in conjunction with current clinical assessments. With the increasing availability of large-scale datasets, further improvements and validations can be made to adopt this work for individual intervention and prognosis, as well as for improving recruitment strategies for the clinical trials in AD.
ii Dedicated to my childhood storybooks from around the world that have proved to be inspiring and grounding at the same time...
iii Acknowledgements
Research is innately collaborative in nature, and hence I would like to extend my appreciation to many people whose support has been invaluable for the work presented in this thesis. First, I want to thank my supervisor - Dr. M. Mallar Chakravarty, who entrusted me as his very first PhD student. Mallar found the tricky balance between mentoring an initially naive academic trainee and providing freedom to pursue my own ideas, even the seemingly absurd ones. Apart from academic training, I must also thank Mallar for the enriching collaborations in Toronto and Montreal, wonderful conference trips, and many exciting gastronomic excursions around the globe. Second, great gratitude goes to Dr. Aristotle Voineskos, who not only served as my committee member, but also provided lab space and resources during my four years of PhD work at CAMH in Toronto. As a clinical psychiatrist, Aristotle’s advice has been crucial in addressing clinical goals of this work. Special mention also goes to my other committee members, Dr. Chris Honey, Dr. Jo Knight, Dr. Richard Zemel, and Dr. Babak Taati, who formed an incredibly multidisciplinary committee. Their advice has helped me challenge my own preconceived notions, and develop understanding of my projects from diverse perspectives. I am also grateful to my department for supporting me throughout this stint with many unexpected logistical difficulties. Particularly, I want to thank Dr. Christopher Yip and Dr. Julie Audet for their advice and understanding. I am also grateful to the IBBME staff members - Jeffrey Little, Rhonda Marley, and Elizabeth Flannery, for dealing with my many incessant queries with great patience. I was also fortunate to have wonderful colleagues over the past years at both Toronto and Montreal labs. Particularly, I thank Jon Pipitone and Gabriel Devenyi, for helping me with the computational resources and impromptu brainstorming sessions! Typically, my research insights have been realized at the most unexpected moments during semi-academic, borderline silly interactions with my labmates and friends. Beyond my academic circles, I must attribute significant credit to many of my old and new friends, especially, Neda, Julie, Anwesha, Joseph, and Sophia who indulged my many ventures and ventings, and helped me minimize the work-life imbalance during this period. Finally I want to thank my parents, Vandana and Parag, along with my little brother Tejas. My fam- ily’s courageous life decisions, unwavering support, and multi-timezonal video conferences have allowed and prepared me to take on these intellectual pursuits, for which I am always grateful.
iv Contents
Acknowledgements iv
Table of Contents v
List of Tables ix
List of Figures xi
1 Introduction 1 1.1 Research contributions and thesis outline ...... 2
2 Background 4 2.1 Alzheimer’s Disease ...... 4 2.1.1 AD diagnosis and staging ...... 5 2.1.2 AD risk factors ...... 5 2.1.3 AD treatment ...... 5 2.2 Pathophysiology of AD ...... 7 2.2.1 Beta-amyloid (Aβ)...... 7 2.2.2 Neurofibrillary tangles (NFTs) of protein tau ...... 7 2.2.3 Progression of Aβ and NFTs ...... 7 2.3 Neuroanatomy ...... 8 2.4 AD biomarkers ...... 10 2.4.1 CSF and PET markers ...... 10 2.4.2 MR imaging markers ...... 12 2.5 MR-based Neuroimaging ...... 13 2.5.1 MR acquisition ...... 13 2.5.2 MR image preprocessing ...... 15 2.5.3 Image registration ...... 16 2.5.4 Image segmentation ...... 19 2.5.5 Cortical surface estimation ...... 23 2.6 Computational neuroscience and machine-learning ...... 24 2.6.1 Machine-learning ...... 25 2.6.2 Supervised learning ...... 26 2.6.3 Reference models ...... 26 2.6.4 Artificial neural networks and deep learning ...... 28
v 2.6.5 Performance metrics for supervised learning ...... 30 2.6.6 Supervised ML and AD ...... 30 2.6.7 Unsupervised learning ...... 30 2.6.8 Performance metrics for unsupervised learning ...... 31 2.6.9 Unsupervised ML and AD ...... 31 2.6.10 Performance evaluation ...... 31 2.7 Project synopses ...... 32 2.7.1 Manual-protocol inspired technique for improving automated MR image segmenta- tion during label fusion (Published online 2016 Jul 19. doi: 10.3389/fnins.2016.00325) 33 2.7.2 An artificial neural network model for clinical score prediction in Alzheimer’s dis- ease using structural neuroimaging measures (accepted in the journal of psychiatry and neuroscience) ...... 34 2.7.3 Modeling and prediction of clinical symptom trajectories in Alzheimer’s disease using longitudinal data (accepted in PLOS Computational Biology) ...... 34
3 Project 1: MR Image Segmentation 35 3.1 Abstract ...... 36 3.2 Author Contributions ...... 37 3.3 Introduction ...... 38 3.4 Materials and Methods ...... 41 3.4.1 Methodological Novelty of AWoL-MRF ...... 41 3.4.2 Baseline Multi-Atlas Segmentation Method ...... 41 3.4.3 Proposed Label-Fusion Method: AWoL-MRF ...... 42 3.5 Validation Experiments ...... 47 3.5.1 Datasets ...... 47 3.5.2 Label-Fusion Methods Compared ...... 48 3.5.3 Evaluation Criteria ...... 51 3.6 Results ...... 52 3.6.1 Experiment I: ADNI Validation ...... 52 3.6.2 Experiment II: FEP Validation ...... 52 3.6.3 Experiment III: Preterm Neonatal Cohort Validation ...... 54 3.6.4 Experiment IV: Hippocampal Volumetry ...... 55 3.6.5 Parameter Selection ...... 57 3.7 Discussion and Conclusion ...... 63 3.8 Acknowledgments ...... 65 3.9 Supplementary Material ...... 66 S3.1 Experiment I: ADNI Validation ...... 66 S3.2 Experiment II: First Episode Psychosis (FEP) Validation ...... 66 S3.3 Experiment III: Preterm Neonatal Cohort Validation ...... 66 S3.4 Experiment IV: Hippocampal Volumetry ...... 67 S3.5 Surface-Distance error analysis ...... 67
vi 4 Project 2: Clinical Score Prediction 68 4.1 Abstract ...... 69 4.2 Author Contributions ...... 70 4.3 Introduction ...... 71 4.4 Materials and Methods ...... 73 4.4.1 Datasets ...... 73 4.4.2 MR image processing ...... 73 4.4.3 Anatomically Partitioned Artificial Neural Network (APANN) ...... 74 4.4.4 Empirical distributions ...... 74 4.4.5 Performance Validation ...... 78 4.5 Results ...... 81 4.5.1 Experiment 1: ADNI1 Cohort ...... 81 4.5.2 Experiment 2: ADNI2 Cohort ...... 81 4.5.3 Experiment 3: ADNI1 + ADNI2 Cohort ...... 81 4.5.4 Longitudinal prediction ...... 81 4.6 Discussion ...... 87 4.6.1 Clinical scale comparisons ...... 87 4.6.2 Input modality comparisons ...... 87 4.6.3 Dataset comparisons ...... 87 4.6.4 Longitudinal analysis ...... 88 4.6.5 Related work ...... 88 4.6.6 Clinical translation ...... 89 4.7 Limitations ...... 90 4.8 Conclusion ...... 91 4.9 Acknowledgements ...... 92 4.10 Supplementary Material ...... 93 S4.1 Performance comparison to reference models ...... 93 S4.2 Empirical sampling: standardization across modalities ...... 96 S4.3 Computational resource requirements ...... 96 S4.4 Performance bias in combined ADNI1and2 cohort ...... 97
5 Project 3: Prognosis in AD 98 5.1 Abstract ...... 99 5.2 Author Summary ...... 99 5.3 Author Contributions ...... 100 5.4 Introduction ...... 101 5.5 Materials and Methods ...... 103 5.5.1 Datasets ...... 103 5.5.2 Preprocessing ...... 104 5.5.3 Analysis workflow ...... 104 5.5.4 Trajectory modeling ...... 104 5.5.5 Trajectory prediction ...... 106 5.5.6 Performance evaluation ...... 109 5.6 Results ...... 111
vii 5.6.1 Trajectory modeling ...... 111 5.6.2 Trajectory Prediction ...... 111 5.6.3 MMSE trajectories (binary classification) ...... 111 5.6.4 ADAS-13 trajectories (3-way classification) ...... 114 5.6.5 AIBL results ...... 115 5.6.6 Effect of trajectory modeling on predictive performance ...... 116 5.6.7 Computational specifications and training times ...... 118 5.7 Discussion ...... 119 5.7.1 Clinical Implications ...... 119 5.7.2 Trajectory modeling ...... 120 5.7.3 Trajectory prediction ...... 120 5.7.4 AIBL analysis ...... 120 5.7.5 Effect of trajectory modeling on predictive performance ...... 121 5.7.6 Comparison with related work ...... 121 5.7.7 Limitations ...... 122 5.7.8 Conclusions ...... 123 5.8 Acknowledgements ...... 124 S5 Supplementary Material ...... 125 S5.1 Subject Lists ...... 125 S5.2 Hyperparameter search ...... 129 S5.3 Clinical score distributions ...... 130 S5.4 Prediction performance results ...... 131 S5.5 Effect of available timepoints (duration) on prediction performance ...... 137 S5.6 K-fold nested cross-validation procedure ...... 140
6 Discussion 141 6.1 Challenges and limitations ...... 142 6.1.1 Project 1: Hippocampal segmentation ...... 142 6.1.2 Project 2: Clinical severity prediction in AD ...... 143 6.1.3 Project 3: Clinical trajectory modeling and prediction in AD ...... 146 6.2 Future directions ...... 149 6.2.1 MR image segmentation ...... 149 6.2.2 Severity prediction ...... 150 6.2.3 Longitudinal prediction ...... 150 6.2.4 Clinical tasks ...... 150 6.2.5 Clinical translation ...... 151 6.3 Concluding remarks ...... 152
Bibliography 153
viii List of Tables
2.1 National Institute on Aging (NIA) proposed clinical and preclinical stages of Alzheimer’s disease and their corresponding symptomatic criteria ...... 6 2.2 Comparison of commonly used linear and non-linear registration methods ...... 19
3.1 ADNI1 cross-validation subset demographics. CN: Cognitively Normal. LMCI: Late- onset Mild Cognitive Impairment. AD: Alzheimer’s Disease. CDR-SB: Clinical Demen- tia Rating-Sum of Boxes. ADAS: Alzheimer’s Disease Assessment Scale. MMSE: Mini- Mental State Examination. Values are presented as lower quartile, median, and upper quartile for continuous variables ...... 47 3.2 First episode psychosis subject demographics. Ambi: ambidextrous. SES: Socioeconomic Status score. FSIQ: Full Scale IQ. Values are presented as lower quartile, median, and upper quartile for continuous variables. N* is the number of non-missing value out of 81. 48 3.3 Hippocampal Volumetry Statistics of ADNI1:Complete Screening 1.5T dataset per di- agnosis (AD: Alzheimer’s patients, MCI: subjects with mild cognitive impairment, CN: healthy subjects). Top: volumetric statistics of segmentations provided by each method. Middle: Effect sizes of pairwise differences between diagnostics groups based on Cohen’s d metric. Bottom: t-values and significance levels from a linear model comprising “Age”, “Sex”, and “total-brain-volume” as covariates (∗ : p < 0.05, ∗∗ : p < 0.01, ∗ ∗ ∗ : p < 0.001). 49 3.4 Summary of automated segmentation methods of the hippocampus. AD = Alzheimer’s Disease; MCI = Mild Cognitive Impairment; CN = Cognitively Normal; FEP = First Episode of Psychosis; LOOCV = Leave-one-out cross-validation; MCCV = Monte Carlo cross-validation; SNT = Surgical Medtronic Navigation Technologies semi-automated la- bels; L-HC = Left hippocampus; R-HC = Right hippocampus. (a): AD: 0.838, MCI: n/a, CN: 0.883, (b): See [149] for manual segmentation protocol details, (c): The method were applied in the 2012 MICCAI Multi-Atlas Labeling Challenge ...... 50 S3.1 Experiment I surface-distance errors based on variant of Hausdorff distance. The error is measures in number of voxels, and mean and standard deviation values are reported over all the subjects in the dataset. The validation configuration comprised 9 atlases and 19 templates...... 67
4.1 Dataset demographics for ADNI1 and ADNI2 cohorts used in this study. CN: Cogni- tively Normal, SMC: Significant Memory Concern, EMCI: Early Mild Cognitive Impaired, LMCI: Late Mild Cognitive Impaired, AD: Alzheimer’s Disease; ADAS: Alzheimer’s Dis- ease Assessment Scale, MMSE: Mini–Mental State Examination...... 73
ix 4.2 Hyperparameter search space for the four models. Grid search of the hyperparameters was performed using a nested inner loop for each cross-validation round. For APANN model, the fixed hyperparameters refer to a broader network design choices that remained identical for all cross-validation rounds. The tunable hyperparameter for APANN were optimized for each fold...... 79 4.3 Prediction Performance for Alzheimer’s Disease Assessment Scale-13 scores. LR L1: Lin- ear Regression model with Lasso regularizer, SVR: Support Vector Regression, RF: Ran- dom Forest Regression, APANN: Anatomically Partitioned Artificial Neural Network; HC: hippocampal input, CT: cortical thickness input, HC+CT: combined hippocampal and cortical thickness input; r: Pearson’s correlation (mean, std), rmse: root mean square error (mean, std)...... 84 4.4 Prediction Performance for Mini–Mental State Examination (MMSE) scores. LR L1: Linear Regression model with Lasso regularizer, SVR: Support Vector Regression, RF: Random Forest Regression, APANN: Anatomically Partitioned Artificial Neural Network; HC: hippocampal input, CT: cortical thickness input, HC+CT: combined hippocampal and cortical thickness input; r: Pearson’s correlation (mean, std), rmse: root mean square error (mean, std)...... 85 S4.1 Bias measures for Experiment 3 ...... 97
5.1 Demographics of ADNI and AIBL datasets. TM: trajectory modeling cohort, TP: trajec- tory prediction cohort. R: replication cohort ...... 104 5.2 Cluster demographics of ADNI trajectory prediction (TP) cohort based on MMSE and ADAS-13 scales. *GDS: Geriatric Depression Scale ...... 106 5.3 Trajectory membership comparison between MMSE and ADAS-13 scales. Note that MMSE only has single decline trajectory...... 106 5.4 Longitudinal siamese network (LSN) architecture ...... 108 S5.1 Hyperparameter grid search for Longitudinal siamese network (LSN) ...... 129 S5.2 Predictive performance: All subjects, MMSE scale, CA input ...... 131 S5.3 Predictive performance: Cognitively Consistent (CC) Group, MMSE scale, CA input . . . 131 S5.4 Predictive performance: All subjects, MMSE scale, CT input ...... 132 S5.5 Predictive performance: Cognitively Consistent (CC) Group, MMSE scale, CT input . . . 132 S5.6 Predictive performance: All subjects, MMSE scale, CA+CT input ...... 132 S5.7 Predictive performance: Cognitively Consistent (CC) Group, MMSE scale, CA+CT input 133 S5.8 Predictive performance: All Subjects, ADAS-13 scale, CA input ...... 133 S5.9 Predictive performance: Cognitively Consistent (CC) Group, ADAS-13 scale, CA input . 134 S5.10Predictive performance: All Subjects, ADAS-13 scale, CT input ...... 134 S5.11Predictive performance: Cognitively Consistent (CC) Group, ADAS-13 scale, CT input . 135 S5.12Predictive performance: All Subjects, ADAS-13 scale, CA+CT input ...... 135 S5.13Predictive performance: Cognitively Consistent (CC) Group, ADAS-13 scale, CA+CT input ...... 136 S5.14AIBL predictive performance: All Subjects, MMSE scale, CA+CT input ...... 136
x List of Figures
2.1 The hippocampal formation ...... 9 2.2 Hypothetical model of biomarker progression in AD ...... 11 2.3 MR image preprocessing ...... 17 2.4 3T in-vivo high-resolution atlas of the hippocampal subfields ([373] ...... 22 2.5 3T in vivo high-resolution atlas of the hippocampal subfields and white-matter structures [12]...... 22 2.6 CIVET stages for extracting cortical surface ...... 24 2.7 Logistic (sigmoid) function ...... 27 2.8 Support vector machine ...... 28 2.9 decision tree ...... 29 2.10 A feed-forward artificial neural network ...... 30 2.11 Nested k-fold cross-validation ...... 33
3.1 The segmentation of a sample hippocampus with AWoL-MRF ...... 44 3.2 Experiment I DSC improvement ...... 53 3.3 Experiment I DSC boxplots ...... 53 3.4 Experiment I Bland-Altman analysis ...... 54 3.5 Experiment I qualitative analysis ...... 55 3.6 Experiment II DSC improvement ...... 56 3.7 Experiment II DSC boxplots ...... 56 3.8 Experiment II Bland-Altman analysis ...... 57 3.9 Experiment II qualitative analysis ...... 58 3.10 Experiment III DSC improvement ...... 58 3.11 Experiment III DSC boxplots ...... 59 3.12 Experiment III Bland-Altman analysis ...... 59 3.13 Experiment III qualitative analysis ...... 60 3.14 Hippocampal volume vs. diagnoses ...... 61 3.15 Parameter selection ...... 62
4.1 ANN architectures and APANN ...... 75 4.2 Empirical sampling ...... 77 4.3 A custom cortical surface parcellation ...... 78 4.4 APANN performance: ADAS13 and MMSE ...... 82 4.5 APANN performance: scatter plots ...... 83
xi 4.6 APANN longitudinal performance ...... 86 S4.1 Performance comparison of all models ...... 94 S4.2 Correlation performance of reference models with high-dimensional input ...... 95 S4.3 Performance of all models for the HC+CT input in Experiment 3 split by subject-dataset membership ...... 97
5.1 Analysis workflow of the longitudinal framework ...... 103 5.2 Trajectory modeling ...... 105 5.3 Longitudinal Siamese network (LSN) model ...... 107 5.4 Potential clinical workflow ...... 109 5.5 MMSE prediction AUC and accuracy performance for ADNI dataset ...... 112 5.6 MMSE prediction ROC curves for ADNI dataset ...... 113 5.7 ADAS-13 prediction accuracy performance for ADNI dataset ...... 114 5.8 MMSE prediction AUC and accuracy performance for AIBL replication cohort ...... 116 5.9 MMSE prediction ROC curves for AIBL replication cohort ...... 117 S5.1 Clinical score distributions of different trajectories at two timepoints. Note: for subjects who are missing 12 month timepoint, 6 month scores are used instead...... 130 S5.2 Distribution of number of available timepoints with clinical score data...... 137 S5.3 Effect of available timepoints on predictive performance (MMSE) ...... 138 S5.4 Effect of available timepoints on predictive performance (ADAS-13) ...... 139 S5.5 Project 3 supplement: K-fold nested cross-validation ...... 140
6.1 Machine-learning models: underfitting vs. overfitting ...... 144
xii Chapter 1
Introduction
Translational applications of computational neuroscientific methods can have a meaningful impact on global mental health challenges. This multidisciplinary field of study can help model the biological processes that govern the healthy and diseased states of the human brain and map them onto observ- able clinical presentations. In the last decade, the rapid increase in large biomedical (neuroimaging and related biological data) datasets, concurrent with the advances in the field of machine-learning has opened new avenues towards development of diagnostic and prognostic applications for neurodegenera- tive and neuropsychiatric disorders. From the computational perspective, this has spawned development of tools that can incorporate several patient-specific observations towards prediction-making to improve clinical outcomes of those suffering from these disorders. This thesis leverages these multidisciplinary advances towards developing prognostic applications for late-onset Alzheimer’s disease (AD) using mag- netic resonance (MR) imaging data and machine-learning (ML) techniques. The overarching goal of these applications is to improve early detection and personalized treatment planning by identifying individuals at the highest risk for AD and AD-related symptom decline. AD, the most common form of dementia, was first identified by Dr. Alois Alzheimer in 1906 and is a distinctively different neurobiological process compared to normal aging. AD is a progressive brain disorder that alters synaptic connectivity due to the aggregation specific brain pathologies. The altered neuronal connectivity ultimately results in the death of brain cells causing declining episodic memory function and cognitive ability. In advanced stages, patients are unable to perform even basic tasks required for daily living (eating, bathing, and dressing themselves) resulting in a need for constant monitoring and care. Currently, there is no cure to this debilitating and ultimately fatal disease. As an affliction of the ageing population, the socio-economic burden of AD is set to increase rapidly with rising percentage of this demographic in the developed world. Currently over 560,000 Canadian are living with dementia and in the next fifteen years this number is projected to grow to over 930,000 [11]. One in thirteen Canadians between ages 65 and 74 years is affected by AD and related dementias. This number increases to one in nine after age 75 and one in four after age 85. 10.4 billion dollars are spent towards treatment and caregiving costs annually in Canada. In the United States approximately 5.5 million people are currently living with AD in 2017 [11]. Globally it is estimated that the prevalence of AD will reach over 100 million by 2050 [46]. Therefore it is critical to develop new intervention and treatment strategies together with caregiving infrastructure to deal with this impending global healthcare crisis.
1 Chapter 1. Introduction 2
An early identification of presymptomatic individuals at-risk of AD or related symptomatic decline would potentially have significant impact on the development of intervention strategies that could treat or delay the onset of AD. A prognostic tool that can predict the future decline for an at-risk individual would help greatly in clinical trial recruitment and design of potential disease-modifying therapy. Targeting subject populations in earlier stages of the disease also increases success rate of a treatment [320, 116, 283]. In efforts towards early detection, continuous monitoring of at-risk individuals via MR imaging and development of applicable computational tools for prognostic predictions is of great interest from the public healthcare perspective, and serves as a motivating rationale for the work in thesis.
1.1 Research contributions and thesis outline
The work presented in this thesis contributes to the growing body of research in the area of clinical applications in AD, leveraging MR imaging data and ML techniques. The projects undertaken as part of this thesis, make specific methodological advancements in the areas of MR image segmentation as well as ML based model development for multimodal and longitudinal data analysis aimed towards subject- level clinical predictions. These clinical applications can contribute greatly towards early detection and prognostic predictions at individual level that can assist interventions, treatment planning, as well as, design of clinical trials. The research is progressively divided into three projects as follows:
• Project 1: MR image segmentation of the hippocampus
• Project 2: Symptom severity prediction based on neuroanatomy
• Project 3: Modeling and prediction of clinical progression
The next chapter details the background pertaining to AD and associated neuropathology, as well as, the current state-of-art relating to MR based neuroimaging, and computational approaches that form the basis for this thesis. Chapter 3 describes project 1, which aims towards improving MR image segmentation - a commonly used MR processing step that produces anatomically meaningful feature sets that serve as input to mul- titude of clinical applications. Specifically, the work in project 2 presentes a novel automated method inspired from manual protocols that improves performance of multi-atlas based segmentation frame- works. The presented method is validated towards whole hippocampal segmentation on three different datasets. Chapter 4 describes project 2, which aims at leveraging MR features towards subject-level clinical severity prediction. Methodologically, the work contributes by providing a novel ML framework for high- dimensional, multimodal data analysis. Specifically, the work presents a novel artificial neural network that combines hippocampal segmentations and cortical thickness measures to predict scores from multiple clinical scales simultaneously. The presented model is validated on multiple large AD datasets towards baseline as well as longitudinal score prediction. Clinically, this can assist the clinicians to make (or validate) diagnosis based on quantitative MR measures. Chapter 5 describes project 3, which aims at modeling and prediction of clinical trajectories from longitudinal MR and clinical measures. First, a data-driven clinical subtyping approach is presented that characterizes differential symptomatic progression of individuals based on longitudinal clinical scores. Chapter 1. Introduction 3
Methodologically, the approach allows graceful handling of missing data-points. Then, the work presents a novel longitudinal ML framework that can combine MR data from two timepoints, namely baseline and follow-up, towards prediction of these symptom trajectories. The work in this projects provides a powerful computational tool for early detection and accurate prognosis of at-risk AD patients via continuous monitoring. Lastly, chapter 6 discusses the overall contributions as well as the limitations of the work in this thesis, and finally concludes with overarching remarks on future directions. It should be noted that this is a manuscript based thesis. Chapters 3-5 are research papers that are either published or accepted at various journals. Hence readers may notice some overlap between certain sections of the papers and stand-alone chapters of the thesis. Chapter 2
Background
2.1 Alzheimer’s Disease
Alzheimer’s disease (AD) progressively affects brain function and cognitive processes related to learning and memory. Despite its high prevalence within the ageing population, AD is not a part of normal ageing process. Clinical diagnosis of AD is made according to the consensus criteria which have been refined over the years. One of the earlier sets of guidelines were proposed by the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer’s Disease and Related Disor- ders Association (the latter now known as the Alzheimer’s Association) [245]. These criteria proposed the classification of patients into definite, probable, or possible AD. A diagnosis of definite AD requires that neuropathological findings be confirmed by a direct analysis of brain tissue samples, which may be obtained either at autopsy or from a brain biopsy. In 2007, an International Working Group (IWG) of dementia experts developed guidelines that incor- porated new knowledge about the prodromal symptomatic stage of AD [105]. These were later updated to address atypical clinical presentations of AD and to identify clinically asymptomatic individuals with positive biomarkers of AD pathology [104]. In 2011, Alzheimer’s Association and the National Institute on Aging (NIA) issued four diagnostic criteria and guidelines for AD that focus on three stages of AD: (1) dementia due to Alzheimer’s - characterized by impairment in memory, thinking and behavior that compromises person’s ability to function independently in everyday life [181, 246]. (2) mild cognitive impairment (MCI) due to Alzheimer’s - characterized by mild changes in memory and thinking that are noticeable and can be measured on mental status tests, but are not severe enough to disrupt a person’s day-to-day life [7], and (3) preclinical (presymptomatic) Alzheimer’s - characterized by mea- sureable biomarker changes in the brain, which may occur years before symptoms [325]. The diagnosis of preclinical AD is primarily used within research settings to investigate and parse the substantial pathophysiological and behavioural heterogeneity present in this stage [139]. There are several commonalities as well as differences in these two sets of diagnostic criteria proposed by IWG and NIA, and harmonization efforts have been made to reach consensus [255]. The NIA proposed guidelines to conceptualize AD on a continuum and recognize that the pathological processes of AD may not be clinically expressed. This thesis focuses on the NIA diagnostic criteria, which are used by the studies leveraged in this work.
4 Chapter 2. Background 5
2.1.1 AD diagnosis and staging
The NIA proposed criteria outline three stages of AD [255]. Among which the preclinical staging in AD is based on a hypothetical temporal ordering of biomarkers. The neuropathology and AD biomarkers are described in detail subsequent sections. The diagnostic and preclinical stages are described in Table 2.1.
2.1.2 AD risk factors
The risk of AD onset and related symptomatic severity is attributed to several contributing factors. Among the modifiable, lifestyle-related risk factors, US National Institute of Health has identified dia- betes mellitus, smoking, depression, mental inactivity, physical inactivity and poor diet as being associ- ated with increased risk of cognitive decline and AD [93]. These factors can be taken under consideration for developing primary strategies for prevention or delaying of the disease onset via healthier living. Other non-modifiable risk factors include age and genetics which have been implicated in the onset of disease [78, 185, 261, 232]. Notably, risk of AD and severity of related symptoms increase with age. Apolipoprotein E (ApoE), the leading genetic risk factor for AD, supports lipid transport and injury repair in the brain [122]. There are three polymorphic alleles of ApoE: 2, 3, and 4; with 3 being the most common occurrence. Individuals carrying the 4 allele are at increased risk of AD compared with those carrying 3 allele, whereas the 2 allele decreases the risk of the onset [141]. In case of the inherited form of AD, genetic mutations in presenilin 1 (PSEN1), presenilin 2 (PSEN2), or amyloid precursor protein (APP) have been implicated in the onset [40, 305].
2.1.3 AD treatment
Although there is no cure for AD, there are several medication and treatment options that can help with the symptoms to improve the quality of life. The pharmacological interventions approved to treat AD symptoms relating to memory, cognition, language, etc. include cholinesterase inhibitors and N-methyl- D-aspartate (NMDA) partial receptor antagonist memantine [248, 33, 289]. Cholinesterase inhibitors target to improve cell-to-cell communication by providing acetylcholine, a neurotransmitter, that is de- pleted due to the disease. Memantine attempts to block overstimulation of NMDA receptor by glutamate which is linked to memory processes, and implicated in neurodegenerative disorders. The effect of these drugs does vary from person-to-person, and is typically prescribed during early stages of the disease in order to delay the onset and slow down worsening of the symptoms. As an alternative to pharmacological medication, cognitive interventions have been suggested as a preventive measure in earlier stages of the disease. The interventions techniques include cognitive training, stimulation, and rehabilitation to improve or at least maintain functioning of several cogni- tive domains [64, 259, 337]. Despite several studies, there is no strong evidence to support significant improvement or slowing down of cognitive decline with these techniques. However, it is possible that well-designed studies, focusing on specific subset of preclinical population, might be needed to measure efficacy of these methods on different cognitive domains [23]. Recently, there is a growing interest in investigating cognitive benefits of physical exercise, as lack of physical exercise is a known risk factor for age-related cognitive decline, it is suggested that physical activities may improve brain function by in- creasing expression of nerve growth factors and neurotransmitters related to cognitive function [24, 121]. Chapter 2. Background 6
Stage Criteria AD dementia 1. The presence of dementia, as determined by intra-individual decline in cognition and function. 2. Insidious onset and progressive cognitive decline.
3. Impairment in two or more cognitive domains; although an amnestic presen- tation is most common, the criteria allow for diagnosis based on nonamnestic presentations (e.g. impairment in executive function and visuospatial abilities). 4. Absence of prominent features associated with other dementing disorders.
5. Increased diagnostic confidence may be suggested by the biomarker algorithm discussed in the MCI due to AD section above.
MCI due to AD 1. A change in cognition from previously attained levels, as noted by self or infor- mant report and/or the judgment of a clinician. 2. Impaired cognition in at least one domain (but not necessarily episodic memory) relative to age-and education-matched normative values; impairment in more than one cognitive domain is permissible.
3. Preserved independence in functional abilities, although the criteria also accept ‘mild problems’ in performing instrumental activities of daily living (IADL) even when this is only with assistance (i.e. rather than insisting on independence, the criteria now allow for mild dependence due to functional loss).
4. No dementia, which nominally is a function of 3. 5. A clinical presentation consistent with the phenotype of AD in the absence of other potentially dementing disorders. Increased diagnostic confidence may be suggested by • Optimal: A positive beta-amyloid (Aβ) biomarker and a positive degener- ation biomarker • Less optimal: – A positive beta-amyloid (Aβ) biomarker without a degeneration biomarker – A positive degeneration biomarker without testing for beta-amyloid (Aβ) biomarkers
Preclinical staging 1. Stage 1 is marked by Aβ42 peptide dysregulation (reflected in reduced levels of CSF Aβ42 or by elevated cerebral cortical amyloid burden as determined by PET amyloid imaging) 2. Stage 2 adds synaptic/neuronal dysfunction and loss (i.e. neurodegeneration), as evidenced by increased CSF p−tau levels, or hypometabolism or cortical thin- ning/hippocampal atrophy as determined by FDG PET and MRI, respectively; 3. Stage 3 features the abnormalities in Stages 1 and 2 plus subtle cognitive decline
Table 2.1: National Institute on Aging (NIA) proposed clinical and preclinical stages of Alzheimer’s disease and their corresponding symptomatic criteria Chapter 2. Background 7
Positive effect of exercise, although with small effect size, has been documented by a few studies on MCI and AD populations [271]. The recommendations from these studies investigating AD treatment have a few commonalities. First, an early intervention is crucial for delaying or slowing down the clinical progression of the disease. Second, the limited success thus far has been in terms of developing symptomatic treatment regimens, rather than a comprehensive disease treatment. Thus, an early diagnosis of AD and identification of at-risk individuals for cognitive decline with high-risk to conversion for AD is extremely important for timely intervention, treatment planning, and making caregiving arrangements.
2.2 Pathophysiology of AD
The two defining pathological features of AD consist of extracellular beta-amyloid (Aβ) peptides and intracellular neurofibrillary tangles (NFTs) of protein tau [43, 155, 312, 228, 37].
2.2.1 Beta-amyloid (Aβ)
Aβ oligomers and plaques are synaptotoxins that stimulate inflammatory processes. Aβ denotes peptides of 36–43 amino acids derived from the amyloid precursor protein (APP). Aβ circulates in plasma, cerebrospinal fluid (CSF) and brain interstitial fluid (ISF) mainly as soluble Aβ40. APP is cleaved by beta secretase and gamma secretase to yield Aβ. Due to imprecise nature of the gamma secretase cleavage process, along with the most common form of Aβ40 ( 80 − 90%), several other Aβ variants are produced, including Aβ42 ( 5 − 10%) which is more hydrophobic and fibrillogenic [333, 306]. This form of Aβ is predominantly found in amyloid plaques. Overproduction or lack of clearance of Aβ causes its accumulation outside the cell in forms of oligomers. Further aggregation of Aβ oligomers with other protein and cellular materials develops into insoluble plaques. These insoluble plaques can bind strongly with neuronal receptors eventually destroying the synapses, and consequently spreading interneuron dysfunction [341, 228, 197].
2.2.2 Neurofibrillary tangles (NFTs) of protein tau
Tau is a microtubule-associated protein that facilitates axonal transport [173, 120, 174]. A single gene on chromosome 17 encodes six molecular isoforms of tau, which are generated through alternative splicing of its pre-mRNA. Tau is crucial for stable intracellular microtubule network and its production function is regulated by the degree of phosphorylation. Normal adult human brain contains 2–3 moles phosphate/- mole of tau protein [173]. In AD, tau is translocated to the somatodendritc compartment and undergoes hyperphosphorylation. Subsequently, its misfolding and aggregation gives rise to NFTs disrupting the microtubule assembly which ultimately leads to neural death [313].
2.2.3 Progression of Aβ and NFTs
In 1991, the seminal work by Braak and Braak examined distribution of Aβ and NFTs in 83 brains obtained at autopsy. The work showed a characteristic progression pattern of NFTs that was categorized into six stages [43]. In stages I and II, NFTs are confined mainly to the transentorhinal region. In stage III and IV, they spread into the limbic regions such as the hippocampus. Finally, in stages V and VI they permeate the neocortex in frontal, superolateral, and occipital directions [43, 44]. In contrast Chapter 2. Background 8 to the topological distribution of NFTs, Aβ deposition progression is less predictable. The Entorhinal cortex, hippocampal formation, basal ganglia, brainstem, and cerebellum are the commonly impacted regions by Aβ deposits. Braak and Braak proposed three stages of Aβ progression. In stage I, Aβ deposits are found mainly in the basal regions of frontal, temporal, and occipital lobes. Progressively, in stage II isocortical association areas are heavily affected, whereas the hippocampal formation is partially involved, and the primary sensory, motor, and visual cortices are yet unaffected by Aβ deposits. Lastly, in stage III, Aβ affects the primary isocortical areas, as well as, the cerebellum and subcortical nuclei such as striatum, thalamus, hypothalamus, subthalamic nucleus, and red nucleus. Subsequent work [341] summarized these stages into “isocortical”, “allocortical or limbic”, and “subcortical” categories. In 1992, the amyloid cascade hypothesis was proposed [156, 290]. Notably, there is evidence that sug- gest that NFTs can develop independent of Aβ deposition, as well as, lack of symptomatic presentation despite substantial Aβ deposition in the brain [156, 290]. For instance, several studies have reported Aβ positive subjects showing no cognitive decline, who undergo healthy ageing process [120, 59]; and conversely several clinically diagnosed late-stage AD patients have shown disproportionate Aβ burden as well [113]. These heterogeneous clinical presentation patterns suggest the presence of upstream causal neurodegenerative processes that produce Aβ and NFTs and/or protective mechanisms that increase the functional and cognitive resiliency of the brain for certain populations [286, 290, 331].
2.3 Neuroanatomy
The human brain, the central organ in the nervous system, consists of the cerebrum, the brainstem and the cerebellum. The cerebrum comprises two hemispheres which are further divided into four lobes: frontal, temporal, parietal, and occipital. The outer layer of gray matter of the cerebrum, referred as cerebral cortex, comprises neuronal cell bodies. The folding of cortex, the result of the migration of neural pregenitors across radial glial units during the neurodevelopmental processes, manifests into ridges (gyri) and grooves (sulci) [287]. The white-matter within the cerebrum consists of myelinated axons that connect different neuronal cell bodies. At the base of the cerebrum are the cerebellum and brainstem. The latter connects the rest of the brain with the spinal cord and is responsible for many of the body’s autonomic functions. Underneath the protective skull of the brain, there are three layers of tissue: dura, arachnoid, and pia, collectively referred to as meninges. The subarachnoid space is filled with cerebrospinal fluid (CSF), which further helps with protection and support of the brain. CSF also fills the four cavities within the brain, known as ventricles as well as, the central canal of the spinal cord. The pathological progression of AD is typically reflected by anatomical changes that begin with the entorhinal cortex and the hippocampus within the temporal lobe of the brain, followed by regions of neocortex. The temporal lobe of the brain, specifically the medial temporal lobe (MTL) structures play important role pertaining to consolidation of information and formation of short-term and long-term memory [311, 327]. The hippocampus, which sits through the length of the MTL and belongs to the limbic system, is a major component of the memory circuitry and is one of the first regions to be affected by AD ([180]. The hippocampal complex, includes the dentate gyrus, comprises four subfields referred to as Cornu Ammonis (CA) 1-4, and subiculum (see Fig. 2.1 [373, 214]). The information flow into the hippocampus begins from the pyramidal cell axons from entorhinal cortex that perforate subiculum and project into the dentate gyrus. The information is then passed onto CA3 and subsequently to CA1 which projects back to the entorhinal cortex completing the circuit. This feedback loop is an important Chapter 2. Background 9 excitatory-inhibitory mechanism for memory processing [94, 340]. Other output pathways from the hippocampus project into several cortical areas including prefrontal cortex and lateral septal area.
Figure 2.1: The hippocampal formation, as drawn by Santiago Ramon y Cajal. Notations, DG: dentate gyrus. Sub: subiculum. EC: entorhinal cortex. CA: Cornu Ammonis
Functionally, the role of hippocampus in learning and memory is thoroughly investigated. His- torically, a remarkable case-study that highlighted the significance of hippocampus in episodic memory formation, was performed through investigation of Henry Molaison (“patient H. M.”), who went through bilateral medial temporal lobe resection in an attempt to control epileptic seizures. The procedure sub- sided seizures by removing the epileptogenic focus, but resulted in severe memory impairment that caused him to forget daily events nearly as fast as they occurred [311, 328]. Subsequently, a multitude of studies have associated the hippocampus and its subfields to different learning and memory processes in the brain [326, 344, 346, 131, 266]. Several studies have also reported hippocampal role in navigation via use of place cells that encode spatial information [272, 238]. Consequently, neuroanatomical abnor- malities in the hippocampus have been associated with memory dysfunction and related disorders such as AD [195, 44, 177, 261, 263, 94]. Particularly from the clinical perspective, the hippocampus has been region of interest with respect to biomarker development for AD and an important predictor towards classification of diagnostic and prognostic states of an individual [176, 41, 260, 299].
In addition to regionally specific structural changes within the MTL, more global neuroanatomical changes throughout the entire cerebrum have been associated with AD. The cerebral cortex is a folded sheet of neurons with a laminar organization comprising six separate layers throughout the neocortex [241, 124]. Going from the surface towards the white matter, these include: 1) the molecular layer, 2) the corpuscular layer, 3) the pyramidal layer, 4) the granular layer, 5) the ganglionic layer, and 6) the multiform layer. The cortical ribbon is constrained by the two gray/white and pial surfaces and the distance between these two surfaces quantifies the thickness of the cortical gray matter. The thickness values vary between 1 and 4.5 mm, with an overall average of approximately 2.5 mm [124]. Cortical atrophy has been shown to correlate with the cognitive decline [221, 286, 299]. Cortical atrophy is reflected in a loss of gray matter which results in a reduction of cortical thickness. In comparison to the volumetric atrophy of temporal lobe structures, cortical thickness provides a more robust quantitative measure of AD related neuroanatomical changes as it is less sensitive to inter-subject variations in head or brain size that confound the former volumetric measurements [180, 310]. A similar issues is also observed with surface area measures of the cortex [27, 310]. Thus cortical thickness measures are more commonly employed in individual level diagnostic and prognostic analyses. Chapter 2. Background 10
2.4 AD biomarkers
A biomarker is a surrogate biological measure that serves as an indicator of a normal or pathological process. Biomarkers are key components of secondary prevention strategies and clinical trials for AD. Commonly used AD biomarkers include both imaging and biofluid measures. Although several CSF derived and radiotracer-based PET imaging markers have been studied extensively, this thesis focuses on structural MR imaging markers due their non-invasive nature. A prominent hypothesis explaining the temporal progression of several AD biomarkers was initially proposed in [181]. The hypothetical model was subsequently updated (see Fig. 2.2) by the authors in [182]. The original model provided prototypical progression of AD biomarkers with level of abnormality as a function of pathophysiological pathway. The model denoted CSF Aβ42 and amyloid PET as upstream and structural MR based neurodegenerative as downstream biomarkers, followed by clinical symptoms. The revised model expressed biomarker abnormality as an explicit function of time instead of clinical disease stage. The model also represented cognitive outcomes on a spectrum to account for the inter-subject symptomatic variability observed in the clinic. Lastly, the revised model reordered certain biomarkers, specifically, CSF aβ42 was moved before amyloid PET, which was followed by CSF tau. FDG PET and MRI were also redrawn to represent concurrent progression of the two. According to this model, the earliest detectable changes are caused by amyloid accumulation, typically measured by CSF Aβ sample, making it one of the most promising biomarkers for early detection. Studies show that AD patients have reduced CSF Aβ and elevated CSF tau levels compared to cognitively normal individuals [174, 336]. Moreover, these levels of Aβ and tau are more extreme for AD patients with one or two ApoE 4 alleles compared to patients with no 4 alleles [339]. Although CSF biomarkers can provide early evidence of clinical decline, these measures change relatively slowly through the course of disease progression. In contrast, PET, MR based measures tend to be more dynamic biomarkers providing better characterization of disease progression and related clinical decline. Therefore, especially non-invasive MR-based biomarkers are well suited for continuous monitoring of asymptotic at-risk individuals. Although many studies have adopted this model [182] as a guiding hypothesis, alternative theories have been proposed by a few studies to better explain the heterogeneity and loose coupling between the pathological burden and observable symptom severity within certain AD populations. A notable hypothesis among these studies postulates a vascular dysfunction process that alters the balance be- tween the blood flow substrate delivery and the neuronal/glial energy demands, leading the downstream dysfunction and the disease [395, 175]. Particularly [175] present a data-driven model as a more re- alistic characterization of biomarker progression compared to traditional observational disease models. Although investigation of these alternate hypotheses is crucial in order to understand the etiology of Alzheimer’s disease; it is beyond the scope of this thesis as it focuses on downstream neurodegenerative processes.
2.4.1 CSF and PET markers
Characterization of the neurobiological processes associated with Aβ and NFTs and early detection of the consequent anatomical changes is essential for development of intervention strategies that could limit or prevent downstream clinical symptoms. Typically the presence of Aβ and NFTs can be confirmed via highly invasive lumbar punctures or brain biopsies, or during post mortem autopsy [66, 120, 37]. In the lumbar puncture procedure, a needle is inserted between the lumbar vertebrae into the subarachnoid Chapter 2. Background 11
Figure 2.2: Hypothetical model of biomarker progression in AD. Aβ is identified by CSF Aβ42 or PET amyloid imaging. Tau-mediated neuronal injury and dysfunction is identified by CSF tau or fluorodeoxyglucose-PET. Brain structure is measured by use of structural MR imaging. Aβ = β-amyloid. MCI=mild cognitive impairment. Image adopted from Jack et al. 2013 with reuse permission. space of the spinal canal to extract cerebrospinal fluid (CSF). Studies have shown that AD patients have a reduced amounts of Aβ and elevated amounts of tau in CSF compared to cognitively normal individuals [174, 66]. Studies have also suggested that MCI patients have CSF Aβ and tau levels in between AD and cognitively normal individuals [315, 340]. As an alternative to invasive lumbar puncture procedures, several surrogate techniques are proposed for in vivo detection such as positron emission tomography (PET). A PET scan acquisition involves injecting the patient with a tracer, labelled with a positron-emitting radionuclide. The tracer molecules are selected to target a particular physiological process [129]. The images are acquired in several planes through the brain, which provide visual information showing the radiotracer distribution. The two nuclei commonly used in PET imaging are 11C and 18F, which have half-lives of 20 and 110 minutes, respectively. Pittsburgh compound B (PiB) is a 11C labelled thioflavin-T derivative that binds to amyloid plaques in vivo. AD patients typically have increased PiB retention in areas known to accumulate significant amyloid deposits in comparison with healthy individuals. Cortical PiB retention is also observed in MCI patients, but to a lesser extent than in AD [183, 356]. Patients are often classified as PiB positive or negative, where a global cortical to cerebellar ratio is defined to separate the two groups. Independent studies have consistently found that approximately 30% of cognitively normal elderly individuals would be classified as PiB positive according to such criteria [183]. This suggests that PiB alone is not a sufficient marker for AD. Florbetapir (18F or 18F -AV-45) is another radiopharmaceutical compound that contains the radionuclide fluorine-18. Florbetapir has strong affinity for amyloid proteins in AD brain and faster in vivo kinetics [51, 65]. The longer half-life of 18F has facilitated the use of Florbetapir in several amyloid imaging studies, which demonstrate the feasibility of this compound to differentiate patients with AD and MCI from healthy controls [376, 50]. Similar to PiB based techniques, thus far the utility of Florbetapir remains limited to qualitative amyloid imaging, and will require significant investigation into feasibility, tolerability, and reliability of the biomarker before it can be used towards clinical diagnostic Chapter 2. Background 12 applications [50]. These findings along with undesirable radioactive nature of the tracers used in PET, MR imaging has gained more attention for development of neuroanatomical biomarkers.
2.4.2 MR imaging markers
Structural MR images offer a rich source of information that can measure anatomical alterations at the voxel-level granularity. Structural volumetry characterizing atrophy patterns is a simple approach for utilizing this high-dimensional information towards biomarker development [134, 355]. Studies demon- strate that temporal lobe atrophy is strongly associated with AD. Histological data validate that the entorhinal cortex, hippocampus, and amygdala are particularly vulnerable structures affected by AD pathology. Several studies have investigated the association between the rate of temporal lobe atrophy, as measured by MR imaging, with current as well as future cognitive decline. Longitudinal studies have marked accelerated rates of atrophy in AD and MCI patients compared to cognitively normal groups. Specifically entorhinal cortex and hippocampal degeneration has been established as a marker of AD and explains memory related symptoms [43, 57, 98, 81]. Several studies have demonstrated group-wise differences in the total hippocampal volume as well as its subfields across healthy, MCI, and AD pop- ulations [180, 176, 62, 142, 299, 80, 94]. However, hippocampal atrophy alone is not a good predictor for MCI to AD conversion. This is potentially due to 1) substantial symptomatic and neurobiological heterogeneity within MCI population, 2) lack of consensus regarding anatomical definition of hippocam- pus as captured by MR imaging, and 3) sensitivity of acquisition and segmentation techniques across studies[284, 178, 257, 214, 38, 136, 303]. These challenges have made the clinical translation of hip- pocampal volumetry and morphometric techniques for early detection and prognosis difficult. Recently, [303] showed an aberrant hippocampal volumetric fluctuations in a longitudinal AD sample cautioning the use of hippocampal volume as a stand-alone biomarker. This in turn necessitates development of more sensitive structural biomarkers to model disease progression at the individual-level. In this pursuit, studies have explored more global as well as granular structural biomarkers [19].
One approach of developing these complex biomarkers involves incorporation of multiple brain regions implicated in AD related neurodegeneration. Using this approach, many studies have explored atrophy patterns in the cortex to identify different progression stages [224, 222, 286, 230, 299]. Another approach towards building more sensitive biomarkers involves voxel-wise or vertex-wise analysis, which provides a way to detect subtle and distributed changes that could discriminate between different clinical states of the disease [117]. Voxel-based-morphometry (VBM) is a popular technique for such analysis [15]. Although it provides an extremely powerful tool to perform group-wise comparisons, it cannot provide individual specific measures that would enable use of such biomarker towards individual diagnosis or prognosis.
The use of high-dimensional information from MR images towards development of complex biomark- ers necessitates employment of multivariate statistical techniques for quantitative representation of neu- rodegeneration. Moreover, different validation regimens need to be applied to assess the biomarker per- formance when used towards group-wise versus individual-level modeling and prediction tasks. These statistical approaches in the computational neuroscience domain are described in Section 2.6. Chapter 2. Background 13
2.5 MR-based Neuroimaging
MR based neuroimaging can provide qualitative and quantitative information regarding brain structure and function. The advances in the MR imaging technology over the past decade have opened new av- enues for mental health research utilizing neuroimaging data. In typical use, MR techniques are used for imaging soft tissue that allows researchers to investigate structural and functional characteristics of the human brain in vivo. MR image acquisition is a non-invasive, safe for humans, process that pro- duces three dimensional detailed anatomical images without the use of harmful radiation [88, 268, 226]. Neuroimaging processing pipelines consist of multiple sequential tasks prior to statistical analysis. The pipeline begins with the acquisition of MR images comprising biases and artifacts induced by the hard- ware and acquisition protocol itself. These artifacts are subsequently “corrected for” using appropriate image processing techniques. Images are also cropped to remove areas beyond region of interest. Af- ter these preprocessing steps, images are used towards subsequent statistical analyses. Structural MR images form the basis of this thesis. As a result, these steps are described in detail in following sections.
2.5.1 MR acquisition
An MR imaging scanner consists of a main magnet, which generates a strong primary magnetic field, a radiofrequency (RF) coil, which transmits and receives radiofrequency energy to and from the tissue; and gradient coils (typically in the x, y and z directions), which are used to generate field gradients to enable frequency and spatial encoding used for signal source localization. The primary magnetic field, denoted as B0, causes protons from the abundantly present water molecules in a human body to align with the field. Conventionally this field defines the coordinate frame of reference with B0 oriented along z-axis. B0 causes protons to precess at a frequency, known as Larmor frequency, that is proportional to the field strength:
ωL = γB0 (2.1)
where the gyromagnetic ratio γ is characteristic of the nuclei under consideration. The alignment along the z-axis is perturbed out of equilibrium by an application of a radio frequency (RF) pulse perpendicular to this axis. Once the RF field is turned off, the sensors can detect the energy released by the protons as they undergo realignment to the primary magnetic field. The amplitude of this signal is maximal immediately following the RF pulse, and decays with time. By employing magnetic field gradients, signal source can be spatial localized by inducing differential Larmor frequencies along the z-axis. Additionally gradient induced phase encoding is used to resolve signal location in the xy plane. The sampling of this frequency and phase encoded signal generates a complex-valued 3 dimensional array in a spatial frequency domain referred as k-space. Finally, the image itself is reconstructed using Fourier transform of this k-space representation [268, 350]. Several parameters of the acquisition protocol influence the quality of the image (i.e. signal to noise ratio, resolution, field of view, etc.) But typically, image quality is improved with higher B0, which is commonly set to be 1.5T or 3T; although 7T scanners have become commercially available. The signal is quantified using a time constant characterizing signal decay. The excited protons generate magnetization components along z-axis (longitudinal) as well as, xy (transverse) plane. A set of macroscopic equations to calculate nuclear magnetization (M) as a function of time were first introduced by Bloch in 1946 [36], and are written in matrix form as follows. Chapter 2. Background 14
Mx -1/T2 γ Bz -γ By Mx 0 d M = -γ B -1/T γ B M + 0 (2.2) dt y z 2 x y Mz γ By -γ Bx -1/T1 Mz M0/T1
Where γ is the gyromagnetic ratio, and T1 and T2 are the time constants associated with the decay of the signal corresponding to the longitudinal and transverse components, respectively. The recovery of longitudinal magnetisation as the protons align with B0 is known as spin-lattice (T1) relaxation.
(−t/T1) Mz(t) = M(θ)(1 − e ) (2.3)
Decay of transverse magnetization during realignment is known as spin-spin (T2) relaxation.
(−t/T2) Mxy(t) = M(θ)e (2.4)
Where M is nuclear spin magnetization vector parallel to the external magnetic field B0.
These time constants are dependent on the surrounding environment i.e. biological tissue. Thus the resultant MR contrast is dependent on these time constants as well as the proton density of each tissue type.
A basic MR acquisition sequence, referred to as spin-echo, comprises a two RF pulses. First, a 90 degree pulse tips the net magnetization into transverse plane. Once the RF transmitters is turned off, transverse magnetization (Mxy) decays, and longitudinal magnetization is recovered as the protons align themselves to B0. Protons themselves re-radiate the absorbed energy which can be detected by the receiver coils. The signal received in the transverse plane decays faster than T2 would predict. This is modeled by a modified time constant T2* comprising pure T2 decay as well as the static inhomogeneities in the magnetic field which accelerate the dephasing process. The 90 degree pulse is followed by a 180 degree pulse in order to rephase the spins in the transverse plane and reverse the static field inhomo- geneities. The signal is measured after the phase coherence is achieved. This time epoch is called echo time (TE), which is the time between the 90 degree pulse and MR signal sampling. The 180 degree pulse is applied at time TE/2. This process is repeated several times, and the time between two 90 degree pulses is referred to as the repetition time (TR). Due to different T1 and T2 values for each tissue, MR contrast can be modified with different configurations of TE and TR. With short TR and TE, contrast depends primarily on the tissue specific differences in the longitudinal magnetization recovery, i.e. T1. This referred to as T1-weighted sequence. Relatively, with longer TR and TE, T1 differences diminish, and tissue contrast results mainly from the T2 properties of the tissue. This is referred to as T2-weighted sequence. Configuration of long TR and short TE produces a proton density (PD) image, in which the contrast is a function of the differences in the proton density of the tissues, as neither longitudinal or transverse signal is allowed to recover by sampling at high rate. Tissues with longer T1 and T2 (e.g. water) appears dark in T1-weighted image and bright in the T2-weighted image. Conversely, tissue with short T1 and a long T2 (e.g. fat) appears bright in the T1-weighted image and dark in the T2-weighted image. Chapter 2. Background 15
2.5.2 MR image preprocessing
The raw image acquisition usually comprises noise and artifacts that need to be accounted for prior to downstream computational analysis [29, 321, 347]. In addition, it is also important to extract brain-tissue from the raw image and discard background, skull and other regions irrelevant to the computational anal- ysis of interest. These preprocessing steps alter signal-to-noise and contrast-to-noise ratios (SNR, CNR) of the image and thus influence the subsequent image analysis, such as brain segmentation. Therefore it is crucial to carefully design preprocessing pipelines to achieve accurate as well as reproducible results, especially within multi-site and multi-study experimental paradigms.
MR image denoising
The noise confounds in the MR signal are resultant of thermal vibrations of ions and electrons in the receiving coil and the tissue manifested as intensity fluctuations [15, 392]. The noise in magnitude MR images generally follows a Rician or non-central Chi distribution. In theory, the SNR can be improved by averaging multiple repeatedly acquired images. However, this requires substantially more acquisition time that is not feasible in practice. Another simple approach to mitigate high-frequency noise is to use low-pass Gaussian filter which essentially averages neighbouring pixels [15]. However this results in blurred images diminishing high-frequency spatial information such as structural boundaries. Several advanced denoising methods have been proposed and applied that include anisotropic diffusion filter [211], wavelet-based filters [8, 387] and adaptive non-local means [47, 240, 84]. We note that based on our assessment of the quality of publicly available, standardized datasets utilized in this thesis, we did not include denoising step in our preprocessing pipeline for any of the three projects.
Intensity inhomogeneity correction
The low frequency, intensity non-uniformity artifacts are referred as bias, inhomogeneity, illumination nonuniformity, or gain field. The bias field, in an MRI context, causes intensity inhomogeneity within an image, resulting in a smooth signal variation within a tissue of the same type. This artifact can be caused by spatial inhomogeneity in the magnetic field, spatially varying receiver coil sensitivity, and interaction of the magnetic field with human tissue. The magnitude of this effect is dependent on the magnetic field strength. Thus, images obtained using higher-field scanners will be more susceptible to image inhomogeneities [96]. Among the several bias correction algorithms, the nonparametric nonuniform normalization (N3) [321] and subsequent improved revision N4 [347] are one of the most commonly used techniques. N3 is an iterative algorithm which maximises the high frequency content of the tissue intensity distribution of the corrected scan. The algorithm uses a b-spline approximation to obtain a smoothed estimate of the non-uniformity field of the scan and iteratively removes this from the original scan, until the non-uniformity estimate converges. N3 does not require prior knowledge of tissue types, and can be applied at an early stage in automated image analysis. N4 correction, a more recent, improved version of this algorithm modifies the iterative optimization technique used in N3, using a multiresolution iterative optimization framework. Specifically in the N4 version, the b-spline is initially fit at a lower resolution, and the resolution is hierarchically increased to achieve the best fit of the bias field. The bias field correction is performed in an iterative manner such that the corrected image from the first step is used as input in the next iteration, and so on, to estimate the residual bias field each time, allowing for iterative incremental updates of the bias field. The N4 correction step was included in the preprocessing Chapter 2. Background 16 pipeline for the three projects in this thesis. An example of N4 corrected T1 MR image is shown in Fig. 2.3.
Brain extraction
Brain extraction, or masking of non-brain tissue, such as skull, fat and neck regions is another common pre-processing step. Such cropping of region of interest improves subsequent image analysis. Brain extraction involves binary classification of each voxel from the raw scan as brain or non brain tissue, where the brain comprises grey matter, white matter and cerebrospinal fluid (CSF) [96]. A common method for brain extraction is BET (Brain Extraction Tool). BET uses a deformable model of a sphere’s surface, which expands one vertex at a time until the boundary of the brain’s surface is reached. A more recently developed patch based segmentation tool, BEaST (Brain Extraction based on nonlocal Segmentation Technique) [119], uses a large library of priors to perform nonlocal segmentation in a multi-resolution framework, employing varying patch sizes to improve segmentation accuracy and computation time. BEaST has been shown to have significantly higher accuracy than BET, especially when analysing data from psychiatric populations who may have pathology [119]. The BEaST extraction step was included in the preprocessing pipeline for the three projects in this thesis Fig. 2.3 shows an example of a T1 MR image with an extracted brain region using BEaST.
2.5.3 Image registration
Image registration is an alignment problem that deals with transforming raw data into a common frame of reference [394, 17]. In medical imaging, registration is crucial for establishing comparability across different individuals, timepoints, and modalities. For structural MR images, this typically implies esti- mating a one-to-one mapping between two image spaces to have anatomical correspondence. Registra- tion approaches can be divided by choice of feature space (i.e. pixels/voxels, landmarks), transformation process (i.e. affine, nonlinear), degrees of freedom, and similarity metrics (i.e. cross-correlation, mu- tual information) [209]. Mathematically, the registration of image J into the space of image I can be formulated as an optimization problem with the goal of finding transformation as follows.
∗ T = argmaxMΩS(I, J, M) (2.5)
where, I = reference image J = image to be transformed M = transformation Ω = search space for transformation S = similarity measure T* = optimal transformation
In case of 3-dimensional affine transformations, the mapping from each point (x1, x2, x3) of an image to a point (y1, y2, y3) in the transformed space can be expressed as: Chapter 2. Background 17
Figure 2.3: T1-weighted MR image before and after preprocessing stages. The image is randomly selected from the ADNI2 cohort used in the analysis in this thesis.
y1 = m11x1 + m12x2 + m13x3
y2 = m21x1 + m22x2 + m23x3
y3 = m31x1 + m32x2 + m33x3
which can be represented concisely as the matrix multiplication (y = Mx):
y1 m11 m12 m13 m14 x1 y2 m21 m22 m23 m24 x2 = (2.6) y m m m m x 3 31 32 33 34 3 1 0 0 0 1 1 Chapter 2. Background 18
For rigid-body transformation, M is typically decomposed in terms for translation (T) and rotation (R) matrices as M = TR. This can be further parameterized as follows:
1 0 0 q1 0 1 0 q2 T = (2.7) 0 0 1 q 3 0 0 0 1
and
1 0 0 0 cos(q5) 0 sin(q5) 0 cos(q6) sin(q6) 0 0 0 cos(q4) sin(q4) 0 0 1 0 0 −sin(q6) cos(q6) 0 0 R = 0 −sin(q ) cos(q ) 0 −sin(q ) 0 cos(q ) 0 0 0 1 0 4 4 5 5 0 0 0 1 0 0 0 1 0 0 0 1 (2.8)
The estimation of translation parameters (q1,q2, and q3) is trivially given by the last column of M, whereas rotational parameters are computed from matrix multiplication as follow:
cos(q5)cos(q6) cos(q5)sin(q6) sin(q5) 0 -sin(q4)sin(q5)cos(q6)- cos(q4)sin(q6)-sin(q4)sin(q5)sin(q6) + cos(q4)cos(q6) sin(q4)cos(q5) 0 R = -cos(q )sin(q )cos(q ) + sin(q )sin(q )-cos(q )sin(q )sin(q )- sin(q )cos(q ) cos(q )cos(q ) 0 4 5 6 4 6 4 5 6 4 6 4 5 0 0 0 1 (2.9) which yields,
q5 = asin(R13)
R23 R33 q4 = atan2( , ) cos(q5) cos(q5) (2.10) R12 R11 q6 = atan2( , ) cos(q5) cos(q5) where atan2 is the four quadrant inverse tangent. Thus, rigid-body transformation can be defined with 6 (qi) parameters, whereas for the more general case of affine transformation, which comprises scaling, and shearing in addition to translation, rotation operations, R consists of 9 parameters. These parameters need to be optimized to maximize similarity between transformed and reference images. In case of nonlinear registration, typically, the linear transformation stage is followed by deformation operation. This operation computes a nonlinear mapping between the affine transform of J and reference image I that further maximizes the similarity between I and transformed J. A multitude of affine and nonlinear registration approaches along with their implementation in image preprocessing pipelines have been proposed over the last decade. ANIMAL ([68], FLIRT [186], HAMMER [317], ART [13], Mindboggle [200] are some of the commonly used methods. Table 2.2 provides an algorithmic comparison of these methods based on transformation type, degrees of freedom, and similarity metric used. A comprehensive comparative study by Klein et al. [200] suggests ART and symmetric image nor- malization (SyN) deliver consistently high performance across various subject populations. The modern Chapter 2. Background 19
Algorithm (year) Transformation Degrees of Similarity Freedom ANIMAL (1997) Local translations 69K Cross correlation FLIRT (2001) Linear, rigid-body 9,6 Normalized correlation ratio HAMMER (2002) Hierarchical deforma- n/a Geometric moment invariants tion ART (2005) Non-parametric, 7M Normalized cross correlation homeomorphic SPM5 - Unified (2005) Discrete cosine trans- 1K Generative segmentation model forms SPM5-DARTEL Finite difference model 6.4M Multinomial model (2007) of velocity SyN (2008) Bi-directional diffeo- 28M Cross correlation morphism Diffeomorphic Demons Non-parametric, dif- 21M Sum of square differences (2009) feomorphic
Table 2.2: Comparison of commonly used linear and non-linear registration methods methods, with a large number of parameters (or degrees of freedom) tend to perform better with ad- ditional computational cost. SyN belongs to a family of diffeomorphic image registration algorithms [345]. Diffeomorphic approaches are symmetric with respect to image inputs (source and target) and allow probabilistic similarity measures [21]. These are usually contrasted against inverse consistent im- age registration approaches previously popularized by Thirion’s Demons algorithm [343]. However, the latter approaches can only approximate symmetry and inverse transformations with respect to input image. The authors showed that the SyN’s symmetric diffeomorphic optimizer outperforms the inverse consistent image registration with elastic optimizer as used in the case of HAMMER [317]. SyN is part of Advanced Normalization Tools (ANTs) within Insight ToolKit (ITK). ITK is a popular computational framework for customizing MR registration pipeline [20, 22], and was employed to process datasets used in this thesis.
2.5.4 Image segmentation
The process of image segmentation refers to parcellating pixels or voxels into labelled salient regions. Segmentation provides meaningful representation of an image that facilitates quantitative analysis. In MR imaging, segmentation is commonly performed at different levels of granularity as well as anatomical categories. Several classes of segmentation methods exist, including manual segmentation, intensity- based methods, and atlas-based methods.
Manual segmentation
The gold standard for the anatomical segmentation identifying various cortical and subcortical structures is defined by an expert human rater through a manual delineation process. This manual delineation is a tedious and time consuming process, and also introduces inter- and intra-rater variabilities into segmentations. Nevertheless several protocols have been proposed in an effort to standardize the process and mitigate human biases [195, 284, 98, 373, 384, 12]. Although manual labeling of large datasets is infeasible, these protocols have produced several moderately sized labeled datasets that serve as reference Chapter 2. Background 20 and validation for automated techniques.
Automatic segmentation
In the last two decades, rapid progress has been made in automatic techniques to improve the perfor- mance and efficiency of cortical and subcortical segmentations. The earlier approaches classified healthy brain tissue into grey matter (GM), white matter (WM) and cerebrospinal fluid (CSF) broadly based on the differential intensity profiles of each tissue types [16]. Nevertheless, thresholding of these intensity distributions to assign each voxel to a discrete categorical label is highly subjective and not trivial. Alternatively, several region growing, classification, and clustering approaches have been proposed for intensity-based segmentation [154]. Region growing requires selection of seed points (voxels) that belong to a region of interest. Then the algorithm examines the local neighbourhood intensities and assigns labels based on predefined similarity criterion. Region growing methods are suitable for structures with large connected regions, such as brain vessels, tumors, etc [365]. The classification methods make use of a training set of labeled images to automatically learn the mapping between intensity profile and corresponding categorical label. One of the simplest nonparamet- ric classifiers used towards segmentation is k-nearest neighbour, where voxels are classified according to majority vote of the closest training data [362]. Another commonly used parametric approach includes a Bayesian classifier. During training, a Bayesian classifier models the probabilistic relationship between the image intensities and the class labels. Then a new image gets assigned labels using an inference tech- nique, such as the maximum a posteriori estimation, based on Bayes’ rule. These type of classifiers are commonly implemented in expectation-maximization framework in several MR segmentation software packages such as SPM [18], FAST [393], FreeSurfer [125], and 3DSlicer [281]. In contrast with classification methods, clustering methods belong to unsupervised learning paradigm that segment images into voxel clusters with similar intensities. These methods usually rely on an iterative process that updates the voxel-cluster membership and estimation of tissue-intensity mapping for an image to be segmented. Some of the commonly used clustering methods include k-means [67], fuzzy C-means [4], and expectation-maximization methods [276]. Similar to classification methods, clustering methods typically do not incorporate spatial neighbourhood information making them vulnerable to noise and intensity inhomogeneities. Although several extensions have been proposed to mitigate this issue [97, 229], atlas-based alternative approaches have become more popular choice for leveraging prior anatomical knowledge for localization and identification of several brain regions. The atlas-based approaches are extremely powerful in their ability to transfer a priori spatial anatom- ical information to the new image during segmentation [338, 70, 314]. Traditionally, atlas based tech- niques use a single template derived from manual segmentation that would serve as a reference atlas for automated techniques. This atlas in a given stereotaxic space provides spatially localized prior proba- bilities of voxel membership to a certain tissue or a structure. An image to be segmented is aligned to this atlas via affine and then nonlinear registration techniques. Once the new image and atlas are in the same reference frame, all the label information can be propagated from the atlas to the new image via transform information from the registration step. The performance of such segmentation is consequently contingent upon the quality of registration [90, 138, 54]. Several methods have been proposed to refine the post registration segmentation quality via unifying these two processes [18, 281]. Chapter 2. Background 21
Multi-atlas label-fusion based segmentation
More recently, an alternative approach known as “multi-atlas label-fusion (MALF)” that uses multiple atlases has shown great success towards segmentation [363, 160, 302, 82, 360, 278, 153, 32]. Briefly, MALF methods begin with a set of manually labeled images, referred to as atlases, which are registered to the new image based on intensity values to enforce spatial correspondence. Then labels of anatomical structures under consideration from each atlas are propagated to the new image. This provides a label distribution at each voxel based on the anatomical labeling of the atlases. This distribution is converted into a single categorical label value via a label-fusion method of choice, such as, a majority vote [72, 55, 278]. Several variants of MALF have been proposed that provide strategies for atlas selection, atlas weighting, local patch-based methods, as well as, optimization algorithms for label fusion. Atlas selection and weighting methods use a similarity metric to identify a subset of atlases and the image to be labeled which minimizing the discrepancies between them [201, 378, 9, 375, 72]. The patch-based methods tackle the segmentation problem at a local neighbourhood scale instead of across entire image. These methods aim to leverage redundancy present in the image to naturally inflate the number of examples considered during label estimation [82, 296]. The optimization methods during label fusion stage focus on maximizing agreement within several candidate segmentation from multiple atlases, and meeting anatomically plausible spatial constraints [363, 160, 236, 302, 360, 32]. These label-fusion methods in the context of hippocampal segmentation are discussed in detail in Chapter 3.
Hippocampal segmentation
As mentioned in the Section 2.2, hippocampus has been the region of great interest for AD research. Therefore accurate delineation of hippocampus from structural MR images has received a lot of attention over the past decade. MR based atlases are typically derived using group-wise registration techniques that capture neuroanatomical variability as well as commonalities within the group of subjects. Several methods also make use of reconstructed, warped histological data to enhance visual information lacking in MR images [53, 3]. Differences in the anatomical definition of hippocampus as captured by MR imaging [206], intersubject variability in hippocampal shape, and heterogeneity of MR acquisition parameters (resolution, contrast etc.) have led to the development of several manual segmentation protocols for identifying the whole hippocampus [284, 385], as well as, its subfields [373, 384, 213], and white-matter regions [12] (see Figs. 2.4, 2.5. The reliability of manual tracing protocols based on intra-rater dice overlaps varies depending on the granularity of segmentations. For whole hippocampus, it ranges from 0.79 to 0.92; whereas for subfields the ranges are, CA1: 0.78-0.88, CA2/CA3: 0.70-0.85, CA4/Dentate gyrus:0.80:0.84 [373, 385, 351]. There has been significant push towards harmonization of these different protocols in efforts to facilitate clinical applications based on hippocampal morphometry [38, 109, 178]. A recent outcome of these efforts led to a harmonized protocol (HarP) for the whole hippocampal segmentation that shows high reproducibility on the MR images from Alzheimer’s Disease Neuroimaging Initiative (ADNI) [137]. Although manual tracing is considered as the gold-standard of hippocampal segmentation, it requires significant time and resource investment from expert raters. Thus for practical purposes, accurate au- tomated segmentation methods are critical for hippocampal volumetric and morphometric studies as well as potential clinical applications. Several proposed automated techniques have also been applied towards hippocampal segmentation, and have reported comparable performance to the manual gold stan- Chapter 2. Background 22
Figure 2.4: 3T in-vivo high-resolution atlas of the hippocampal subfields ([373]
Figure 2.5: 3T in vivo high-resolution atlas of the hippocampal subfields and white-matter structures [12]
dard [363, 236, 80, 360, 278, 32, 351]. Specific to hippocampal segmentation, the similar intensity values of grey matter in hippocampus and its neighbouring structures, such as amygdala, caudate nucleus, and thalamus complicates the boundary definitions [125]. This motivates use of atlases to incorporate prior anatomical knowledge in order to resolve the ambiguous intensity profiles of neighbouring structures. Hence the MALF approaches have shown great success towards accurate hippocampal segmentation. Sabuncu et al. propose a generative framework for label-fusion algorithms by probabilistically modeling the relationship between atlas and target images [302].. Many MALF based techniques can be com- pared within this framework. Another notable work by Coupe et al. extend the MALF approach to a patch-based procedure to capture similarity between subset of voxels between atlas and target images to assign structural labels [82]. A similar idea of incorporating voxel-neighbourhood similarity information towards label-fusion is used by [189] as an extension to the classical STAPLE [363] algorithm to address local vs. global image matching problems. Further extending the similarity comparison techniques to consider pairwise dependencies within the atlas pool [360] propose a joint label fusion approach to mit- Chapter 2. Background 23 igate systemic errors resulting from similar atlases during label-fusion. Then in efforts to minimize the number of atlases required for the MALF approaches, [278] propose a bootstrapping method to boost the number of labeled templates from a small atlas pool without the loss of segmentation performance. Another interesting approach by [153] leverages machine-learning classifiers and a local labelling strategy to estimate target image segmentation. A comparative study of these state-of-the-art methods, along with other published works, reports performance based on Dice overlap ranges from 0.64 to 0.91 [100]. The common factors attributing to the performance variation are 1) choice of gold-standard segmen- tation (manual-segmentation protocol), 2) automation level (semi or fully automatic), and 3) choice of test cohort that detects sample size, demographics, and the image acquisition parameters. Nevertheless, several approaches have reported consistently high Dice scores (> 0.88) [82, 360, 278] on large cohorts (N > 60) encouraging their use towards large-scale hippocampal volumetric and morphometric studies. Chapter 3 focuses on hippocampal segmentation methods, where these techniques are discussed in detail along with the description and validation of proposed label-fusion method as part of this thesis. Several automated segmentation protocols have been utilized in the analysis of AD subject popu- lations, particularly for developing hippocampal, or its subfield volumes, as a discriminative biomarker between different diagnostic stages [180, 62, 74, 194, 142, 132, 163]. Studies have shown volumetric difference between cognitively normal, MCI, and AD groups, however validation of significant differ- ences within subgroups of MCI, i.e. early and late MCI or stable of declining MCI has proven to be challenging. Use of hippocampal segmentation as a biomarker is discussed in detail in Section 2.6, as well as Chapters 3, 4, and 5.
2.5.5 Cortical surface estimation
Cortical thickness and surface area are commonly used metrics for examining neuroanatomical properties and alterations of cerebral cortex as captured by MR images. Cortical thickness and surface area are known to reflect differential neurobiological processes as well as genetic influences [273, 287]. The layered organization of cortex can be parsed into columnar units [258]. Then in this arrangement, cortical thickness and surface area are postulated to reflect the number of cells within a cortical column and the number of columns themselves, respectively [287]. Furthermore, cortical thickness is thought to include dendritic arborisation and pruning [171] and surface area is thought to include cortical folding and gyrification. The developmental trajectories of these two measures depend on genetic factors, which impact the division of progenitor cells in the periventricular area during embryogenesis [58, 273]. Thus, investigation of these two measures and how they separately contribute to cortical architecture can provide important information about neuroanatomical development and the potential underlying cellular mechanisms, as well as about the neuroanatomical correlates of various diseases and neuropsychiatric conditions [224, 286, 99, 106, 310]. Advances over the last 20 plus years have made accurate, automated estimation of cortical thickness from MR images possible [124, 237, 221], allowing for detailed, regionally specific analysis of the cerebral cortex. Cortical thickness estimates are derived by linearly registering images to a model in stereotaxic space, classifying the brain into grey matter, white matter and CSF, and defining the boundaries of the white matter and pial surfaces. Inner and outer surfaces are then extracted, and the distance between these two surfaces at a given point is calculated, which represents the cortical thickness at this vertex [124, 221]. Two of the most widely used tools for estimating cortical thickness values are CIVET [221] and Freesurfer [124]. These algorithms differ in the manner in which the cortical surfaces are reconstructed. Chapter 2. Background 24
CIVET uses the Constrained Laplacian Anatomical Segmentation using Proximities (CLASP) method [196], in which the pial surface is expanded from the white matter surface to the GM-CSF boundary along a Laplacian field. The tlink method is then used to estimate the cortical thickness, which calculates the distance between corresponding vertices on the inner and outer surfaces. The overall CIVET process on a sample image is shown in Fig. 2.6. In Freesurfer, a deformable mesh is used to reconstruct the inner and pial surfaces, and the cortical thickness is estimated as the shortest distance between the two surfaces at any point in the cortex. An alternative approach for estimating cortical thickness leverages voxel-based methods, which do not require deformable mesh models. However, these methods are more sensitive to voxel sizes and partial volume effects [187, 89]. A head-to-head study [288] comparing CIVET (v1.1.9) and Freesurfer (v5.3.0), using an AD cohort demonstrated that both pipelines offer similar performance, with CIVET providing slightly higher sensitivity to atrophy patterns at the MCI stage. The cortical surface extraction for all the subjects in projects 2 and 3 was performed using CIVET 1.1.12.
Figure 2.6: CIVET stages for extracting cortical surface. The image is randomly selected from the ADNI2 cohort used in the analysis in this thesis. 1) linear/affine registration of the MR images from native to stereotaxic space, using the average MNI ICBM152 model as the target of registration, 2) tissue classification into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF), 3) the boundary between cortical GM and subcortical WM is extracted using a deformable surface model, starting from an ellipsoid that contracts to take the shape of the white matter mask. The pial surface, or the boundary between the cortical GM and the extra-cortical CSF expands outwards from the WM surface to the CSF, 4) The surfaces are registered to the MNI ICBM152 surface template for comparability. The cortical thickness is computed by evaluating the distance, in mm, between the original WM and GM surfaces transformed back to the native space of the original MR images, then interpolated onto the surface template.
2.6 Computational neuroscience and machine-learning
Statistical analysis of neuroimaging data can be used to describe and explain brain structure and func- tion at population as well as individual levels. Conceptually, these inference analyses can be divided into decoding and encoding problems [265]. The classical univariate and mass-univariate approaches with structural (or functional) phenotypes as dependent variables belong to a family of encoding models. Chapter 2. Background 25
Encoding objective involves modeling the dependency between these phenotypes and the population under study via regression or group-wise differences analysis using general linear model (GLM) frame- work. The encoding effect can be hypothesized at different granularities such as regional volume or voxel specific. In the context of AD, several studies have investigated structural differences across the diagnostic groups [177, 57, 125, 41, 224, 286, 80]. Many volumetric, cortical thickness, voxels-wise, etc. studies have established strong evidence for significant differences in medial temporal lobe and cortical surface regions, particularly between AD and matched healthy subjects [299, 180, 176, 355, 117, 222]. Studies investigating MCI have shown varying patterns of structural differences, potentially due to large heterogeneity in the symptom presentation in this group. The decoding models typically consist of classification or regression tasks that use structural (or functional) phenotypes as predictors. Typically, the computational objective here is to predict the in- dividual state (sensory, motor, cognitive, diagnostic etc.). Recently popular machine-learning based approaches of statistical analysis have shown remarkable success in tackling multivariate decoding prob- lems [158, 128, 95, 202, 116, 253]. The decoding models also hypothesize predictive features at different granularities represented by volumetric, vertex-wise, or voxel-wise input depending on the task at hand. There is an implicit circularity between the encoding and decoding modeling as encoding analysis informing on neurological processes can serve as a prior to the decoding analysis to achieve better performance. Conversely, individual level predictive performance of decoding models can validate the group-level findings beyond significance testings.
2.6.1 Machine-learning
Machine-learning (ML) is a branch of statistics that heavily utilizes advanced computational algorithms for extracting meaningful relationships within large amount of data and making accurate predictions [34, 167, 30, 217]. A subset of ML approaches with deep artificial neural network based architecture, also known as deepnets, has enjoyed tremendous success in the last decade in areas of computer vision, speech processing, natural language processing, as well as, control systems and game playing [166, 217]. Leveraging recent advances in computational hardware, such as graphical processing units (GPUs), and availability large datasets, ML has tackled problems pertaining to object and face recognition [212, 159], image generation [147], speech and sentiment recognition [145], language translation [379], caption generation [191], personal assistance [382], autonomous vehicles [140], and Atari, Go, Chess game playing [251, 318, 319] to name a few. In the context of neuroimaging, ML approaches have been proposed to tackle modeling and predic- tion challenges pertaining to high-dimensional, multimodal, and longitudinal data resultant of MR and other imaging modalities. Traditional hypothesis driven statistical methods have had limited success in handling complex, high-dimensional datasets towards individual prediction tasks. Particularly the univariate, linear computational models are unable to capture spatiotemporal structure and nonlinear relationships within the imaging data. ML approaches allow to learn these relationships from the data itself instead of strong hypothesis driven predictive models, and have shown promising results with neuroimaging data. Broadly, ML approaches can be divided into three categories based on their purpose. 1) Supervised learning: these involve a family of classification or regression tasks that aim at predicting a discrete label or a real value from input (e.g. object or diagnostic classification). 2) Unsupervised learning: these involve problems relating to discovery of parsimonious, meaningful features, representative from a Chapter 2. Background 26 typically high-dimensional input data (e.g. matrix factorization, clustering). 3) Reinforcement learning: these involve devising a strategy that should be used to maximize eventual payout (e.g. chess, Go games). Only supervised and unsupervised learning approaches are within the scope of this thesis as they are applicable to many predictive and descriptive tasks commonly encountered in computational neuroscience.
2.6.2 Supervised learning
Supervised learning techniques make use of labeled data to learn relationships between input (X) and output (y) pairs. During training of supervised models, the learning process aims to identify a set of weights that can accurately predict the output. The output can be a variety of data types including, categorical, continuous, structured vector forms. Broadly, supervised classifiers can be divided into generative and discriminative models. Given a set of paired (X, y) examples, the generative classifiers first aim to learn joint probability distribution P (X, y), followed by calculation of P (y|X) using Bayes’ rule to infer the most likely label based on input data. Generative models require relatively large pool of labeled examples in order to accurately model the joint distribution P (X, y). Generative models provide valuable insights into data distribution facilitating model interpretation. In addition, they can be useful practically for imputing missing data points and augmenting training set by generating new examples. Commonly used generative models include naive Bayes, Gaussian mixture model, Generative adversarial networks [144], etc. Discriminative classifiers, in contrast, circumvent estimation of joint probability distribution, and directly learn the conditional probability distribution P (y|X). For binary classification, this essentially implies defining a single cut-off boundary that separates the input space for each class without explicitly specifying the data distribution. In comparison to generative models, discriminative approaches require fewer number of labeled examples. Logistic regression [85], decision trees, random forests [45], support vector machines [79], artificial neural networks [298] are some of the commonly used discriminative classifiers. Many of the classifiers can also be modified to perform regression for continuous valued output prediction. Amongst which linear regression (LR), support vector machine (SVM), random forest (RF), and artificial neural network (ANNs) variants are used in this thesis. Specifically, two customized ANN models have been presented in project 2 and 3, to handle multimodal and longitudinal input, respectively. A brief description of the reference ML models, i.e. LR, SVM, and RF, along with ANNs that form the basis of many deep learning approaches is provided below.
2.6.3 Reference models
Given below is a brief description of the three reference models used to mark the baseline performances in the projects 2 and 3. Project 2 uses the implementations of these models towards regression analysis, whereas project 3 uses them towards classification tasks. For the sake of simplicity, the description is limited to the classification versions.
Logistic regression
Logistic regression (LR) is one of the most commonly used discriminative models towards binary classi- fication tasks. The model is based on logistic or a sigmoid function (see Fig. 2.7 given as: Chapter 2. Background 27
1 σ(z) = 1 + exp(−z) (2.11) T z = θ x + θ0
where z itself is a linear combination of input variables xi, that can take any real input values. The sigmoid function acts as a squashing transformation that bounds the output between (0, 1), allowing probabilistic interpretation of the prediction. The parameters of LR can be learned via maximum likelihood function that can be solved using iterative optimization algorithms such as gradient descent [34].
Figure 2.7: Logistic (sigmoid) function
Support vector machine
Support vector machine (SVM) falls under family of kernel based models that perform a feature space transformation of input variables prior to classification step. For the binary classification tasks, the model is represented in a following form:
T y(x) = θ φ(x) + θ0 (2.12)
where φ(x) denotes the feature-space transformation. The decision boundary in the feature space is constructed using a subset of examples referred to as support vectors. These examples are the closest points on either side of this boundary, which is referred as a separating hyperplane. The SVM aims to maximize the margin between this hyperplane and the support vectors (see Fig. 2.8). In the case where data are not linearly separable, a soft-margin criterion is used to allow a certain level of misclassification [79]. Alternatively, non-linear classifiers, based on a kernel trick to improve model flexibility, are also proposed. Estimation of hyperplane is a convex optimization problem that allows computation of global minimum. It should be noted that unlike LR, SVM does not provide probabilistic output, however certain software implementations (e.g. scikit-learn) calibrate class probabilities using Platt scaling [279]. Chapter 2. Background 28
Figure 2.8: Support vector machine: Max margin hyperplane (red line) maximizes the distance from the support vectors (purple circles) representing the two classes. Image adopted from [34] with reuse permission.
Random forest
Random forests (RF) is an ensemble method that combines predictions from multiple decision tree classifiers [45]. RF combines bootstrap aggregation (bagging) and random feature selection techniques to construct a collection of decision trees. Each tree is subjected to a random sample (with replacement) of n < N examples from the entire training dataset. This is known as bootstrap sampling. A given decision tree partitions the input space into different regions by thresholding the feature values (see Fig. 2.9. At each node in a tree, d << D features are randomly selected, and the parent node is partitioned using the best possible binary split. The best split is determined according to an impurity criterion which aims to maximize the homogeneity of the child nodes with respect to the parent node. Impurity can be assessed using various measures such as the Gini index. Gini index measures the likelihood of whether an example would be incorrectly labelled if it were randomly classified according to the distribution of labels within the node. The aggregation of prediction from all trees, referred to as bagging, is performed based on a majority vote criterion. This helps reduce the high variance and consequent overfitting resulting from single tree based prediction.
2.6.4 Artificial neural networks and deep learning
Artificial neural networks (ANNs), also known as feedforward neural networks or multilayer perceptrons (MLPs), are the building blocks of many deep-learning models that have had great success in tack- ling high-dimensional imaging datasets. Similar to other supervised ML models, the goal of ANNs to approximate a function f∗ that can map input x to output y as follows:
y = f ∗(x, θ) (2.13)
Where θ are the model weights (or parameters) that are learned via a training process. Computa- tionally, the ANNs differ from traditional ML approaches, by representing f as a composite function formed by several nested functions as follows [143, 30]: Chapter 2. Background 29
Figure 2.9: A: Example of a binary decision tree comprising 3 input variables; B: Corresponding par- titioning of the two-dimensional input space into five regions using axis-aligned boundaries. Images adopted from [34] with reuse permission.
f(x) = f (n)(...f (3)(f (2)(f (1)(x)))) (2.14)
These chained connections are reflected by the hierarchical (hidden) layers in a graphical representa- tion (see Fig. 2.10. The depth of an ANN is then determined by the number of these layers. The compute operation at each hidden node at a given layer typically comprises of 1) a weighted sum of inputs from the preceding layer and 2) a nonlinear activation function that filters the weighted sum. The hierarchical structure with nonlinear transformations (e.g. sigmoid, rectified linear unit) enables ANNs to represent complex, nonlinear functions that cannot be approximated accurately using linear models. ANNs are trained using gradient descent based algorithms. The presence of nonlinearity introduces non-convex loss function for the optimization process, which does not have global convergence guarantees as in case of linear optimization problems (e.g. logistic regression or SVM). Therefore ANNs are optimized using a stochastic iterative procedures that refine the model weights until performance improvements are made based on a predefined loss function. All gradient descent based optimization algorithms require compu- tation of a gradient. For ANNs these gradients at each hidden layer are computed using an algorithm called backpropagation [298]. Backpropagation leverages the chain rule of calculus in order to compute gradients at each layer represented by the composite function f. Although theoretically straightforward, implementation of gradient descent coupled with backpropagation needs to address several caveats, such as initialization weights, vanishing gradient, batch normalization etc.
The core ideas behind formulation of ANNs have not changed substantially since the 1980s. Their recent success is attributed mainly to availability of large datasets and powerful computing infrastructure (e.g. GPUs). Algorithmically the notable changes include 1) use of cross-entropy loss function is place of the mean square error and 2) replacing sigmoid function with rectified linear unit as the activation function. These innovations along with the novel network architecture designs, such as convolutional net- works, long short-term memory recurrent networks, Siamese networks, U-nets, etc. have demonstrated the state-of-the-art performance towards many supervised tasks involving high-dimensional data. This in turn has motivated the customization and application of ANNs for handling neuroimaging data towards the prediction of clinical tasks in this thesis. Chapter 2. Background 30
Figure 2.10: A feed-forward artificial neural network (ANN) comprising input, hidden, and output layers. Each node from the hidden layer represents a compute operation, whereas the connections between nodes denote the model weights (parameters) that are learned through a training process.
2.6.5 Performance metrics for supervised learning
Supervised learning tasks are typically employed toward individual level prediction in contrast with group level analysis, and hence require a different validation paradigm. The supervised learning performance can be measured with multitude of metrics depending on the task objective. For classification problems, these metrics include accuracy, receiver operating curve, confusion matrix, specificity, sensitivity, F1 score etc. Whereas, for regression problems performance is evaluated via mean squared error, mean absolute error, correlation etc. metrics. These performance metrics are typically collected in a cross- validation framework which involves permutation and sampling of available data.
2.6.6 Supervised ML and AD
In the neuroimaging domain, and particularly relating to AD, wide variety of supervised learning algo- rithms have been developed towards tasks such as image segmentation (structured output) [361, 293], diagnostic classification (categorical) [202, 91, 87], clinical symptom severity identification (continuous) [332, 390, 383], prognostic prediction (categorical/continuous)[322, 253, 170], etc. Performance of these methods is discussed in detail in Chapters 4 and 5.
2.6.7 Unsupervised learning
Unsupervised learning aims at inferring a function to describe hidden structure from “unlabeled” data. Two common classes of unsupervised techniques include, latent variable models, typically implemented via matrix factorization, and clustering models. The latent variable models are commonly used for dimensionality reduction that transform the high-dimensional input into fewer components that par- simoniously represent the useful information. Principal component analysis [264, 151], independent component analysis [244, 49], non-negative matrix factorization [218] are some of the examples of this class of techniques. In the context of dimensionality reduction techniques, these models can be thought as feature engineering operators [34, 30]. In contrast with the variable selection process, which simply drops a subset of variables in order to reduce input dimensionality, feature engineering approach defines Chapter 2. Background 31 a mapping from high-dimensional input space onto a low dimensional representation. The variables in the new space are transformed versions of the original input and not a selected subset. The second class of techniques, clustering aims at grouping the original set of variables based on certain criterion. Unlike the latent variable models, clustering approaches yield categorical labels denoting cluster membership. Different types of clustering approaches are used depending on the task goals, input data distribution, and notion of similarity between two examples. K-means, Gaussian mixture model, spectral clustering, hierarchical clustering, are some of the commonly used clustering techniques [342, 1, 34].
2.6.8 Performance metrics for unsupervised learning
Since there is no output label, unsupervised learning has a different set of evaluation metrics that measure the stability or reproducibility of learned features or clusters. Explained variance (for a given number of principal components), silhouette coefficient [297], pairwise cluster stability are some of the metrics used towards such validation. When unsupervised learning is used as a preprocessing step for dimensionality reduction, it is usually validated under a hyperparameter optimization module of the cross-validation framework.
2.6.9 Unsupervised ML and AD
In the neuroimaging domain, and particularly relating to AD, unsupervised learning algorithms have been implemented towards discovering subtypes of the disease based on clinical and neuroimaging measures [263, 370, 123]. Clustering based subtyping is discussed in detail in Chapter 5.
2.6.10 Performance evaluation
Cross-validation (CV) is a procedure commonly used for training and testing supervised ML models in scenarios with small sample size availability [205, 111, 14]. The primary goal of CV is to evaluate predictive performance and generalizability of the ML models on unseen data. The CV framework comprises several computational stages including a few preprocessing steps, such as feature selection, feature transformation, which are design choices to be made by the investigator. The CV framework is split into two processing pipelines, namely - train and test. The available data is first split into two subsets to be used in each of these pipelines. All operations pertaining to “learning” i.e. model parameter estimation are performed within training pipeline. The learned models and/or transformations are then applied to the test pipeline with unseen data. This is then repeated multiple times with different train and test splits of the available data referred to as folds. The training pipeline begins with raw data acquired from different modalities such as MR imaging, genetics, demographics, etc. In the field of neuroimaging, the dimensionality of raw input data typically exceeds substantially compared to the number of available samples [220, 210]. Without sufficient samples during training, models are likely to memorize the one-to-one mapping between each input and output pair. Consequently, models are unable to learn meaningful patterns that are generalizable to unseen data. This is referred to as an overfitting problem [34, 210]. Thus it becomes imperative to reduce dimensionality of raw input data in order to mitigate overfitting by unnecessarily complex models with large number of parameters. This can be achieved via feature selection or feature transformation (or both) techniques [264]. Feature selection involves selecting a subset of input variables based on certain criteria. These criteria can be based on prior hypothesis (e.g. anatomical regions of interest based on known pathology), Chapter 2. Background 32 or data-driven (e.g. anatomical regions of interest based on significant group-wise differences). In feature selection, the raw values of selected variables are preserved. In comparison, feature transformation comprises a filtering operation that maps original multivariate space into another multivariate space with smaller dimensionality. These transformations can also be based on a hypothesis (e.g. average values over an anatomical region of interest), or data-driven (e.g. matrix factorization methods such as principal component analysis, independent component analysis etc.). After transformation, input features are represented with new set of values derived from weighted combination of the original raw input [165, 30, 264]. It is important to note that feature selection and transformation operations need to be performed using only training data subset to avoid “double-dipping”, which implies utilization of information from unseen data, and results in performance inflation [210]. During performance evaluation, these learned selection and transformation operations are directly applied to test data without any further tuning based on test data distribution.
As previously stated, these operations are repeated multiple times on permuted train and test splits of available data (see Fig. 2.11). Several strategies for data splitting are available including, leave-one- out, k-fold, stratified k-fold, Monte-Carlo, etc [14]. Stratified k-fold is one commonly used approach that involves splitting data into k mutually exclusive partitions or folds . During each iteration k-1 folds are used for training and remaining subset is used for testing. This is repeated k times to exhaustively test each sample once. Additionally, available data is stratified prior to sampling, so that each fold comprises similar number of output labels based on task at hand. This provides balanced proportions of output labels in the train and test subsets. Additional stratification can also be enforced in order to control for other confounding factors such as demographics (sex, race, etc.) and acquisition peculiarities (study, site, etc.) Stratification typically helps with model training and reducing variability in classifier performance during cross-validation.
Another approach towards evaluating model performance and generalizability involves training and testing models on independent datasets. In this approach, models are trained on one dataset and tested on the other. In more general case, with availability of n independent datasets, this is extended to leave- one-dataset-out approach, where models are trained on n-1 datasets and tested on the holdout dataset. This allows to evaluate dataset specific biases and invariances of the trained models. This is a relatively challenging validation paradigm as different datasets comprise inherent site and study specific biases that affect the acquisition protocols introducing markedly different data distributions. Nevertheless, it is important to address these challenges for practical use of ML models.
2.7 Project synopses
The next three chapters comprise the published and accepted (under production) manuscripts that describe and discuss the three projects completed as part of this thesis. Given below are the brief summaries of the scope and the findings for each project. Chapter 2. Background 33
Figure 2.11: Nested k-fold cross-validation paradigm. During each iteration k−1 folds are used as a train subset, with remaining fold as test subset. The samples from each train subset are further divided into j inner folds to define and evaluate various data preprocessing operations, including feature selection / transformation, data scaling / normalization, hyperparameter configuration. The top performing hyperparameter configuration is then used towards training a single model on the entire train subset that is subsequently applied to the test subset from the outerfold. The model is finally evaluated based on the test set performance
2.7.1 Manual-protocol inspired technique for improving automated MR im- age segmentation during label fusion (Published online 2016 Jul 19. doi: 10.3389/fnins.2016.00325)
The first project presents a novel method, “Autocorrecting Walks over Localized Markov Random Fields (AWoL-MRF)” that aims at mimicking the sequential process of manual segmentation, which is the gold-standard for virtually all the segmentation methods. AWoL-MRF begins with a set of candidate labels generated by a multi-atlas segmentation pipeline as an initial label distribution and refines low confidence regions based on a localized Markov random field (L-MRF) model using a novel sequential in- ference process (walks). The results demonstrate that AWoL-MRF produces state-of-the-art results with superior accuracy and robustness with a small atlas library compared to existing methods. The method is validated by performing hippocampal segmentations on three independent datasets: (1) Alzheimer’s Disease Neuroimaging Initiative (ADNI) Database; (2) First Episode Psychosis patient cohort; and (3) A cohort of preterm neonates scanned early in life and at term-equivalent age. AWoL-MRF is compared qualitatively as well as quantitatively to other label-fusion techniques including majority vote, STAPLE, and Joint Label Fusion methods. AWoL-MRF reaches a maximum accuracy of 0.881 (dataset 1), 0.897 (dataset 2), and 0.807 (dataset 3) based on Dice similarity coefficient metric, offering significant perfor- mance improvements with a smaller atlas library (< 10) over compared methods. The diagnostic utility of AWoL-MRF is also discussed by analyzing the volume differences within diagnostic categories based on ADNI1: Complete Screening dataset. Chapter 2. Background 34
2.7.2 An artificial neural network model for clinical score prediction in Alzheimer’s disease using structural neuroimaging measures (accepted in the journal of psychiatry and neuroscience)
The second project presents a novel anatomically partitioned artificial neural network (APANN) model for predicting individual level clinical scores from mini-mental state exam (MMSE) and Alzheimer’s Dis- ease Assessment Scale (ADAS-13) assessments. APANN combines input from two structural MR imaging measures relevant to neurodegenerative patterns observed in AD; namely: hippocampal segmentations and cortical thickness. Performance of APANN is evaluated with 10 rounds of 10-fold cross-validation in three sets of experiments using ADNI1, ADNI2, and ADNI1+ADNI2 cohorts. Pearson’s correlation and root mean square error between the actual and predicted scores for ADAS-13 (ADNI1: r = 0.60; ADNI2: r = 0.68; ADNI1and2: r = 0.63) and MMSE (ADNI1: r = 0.52; ADNI2: r = 0.55; ADNI1and2: r = 0.55) demonstrate that APANN can accurately infer clinical severity from MR imaging data. Fur- thermore, APANN is also validated in a proof-of-concept longitudinal analysis comprising prediction of future clinical scores. The results show that APANN provides a highly robust and scalable framework for prediction of clinical severity at the individual level utilizing high-dimensional, multimodal neuroimaging data.
2.7.3 Modeling and prediction of clinical symptom trajectories in Alzheimer’s disease using longitudinal data (accepted in PLOS Computational Bi- ology)
The third project presentes a computational framework comprising machine-learning techniques for 1) modeling symptom trajectories and 2) predicting symptom trajectories using multimodal and longitu- dinal data. The project comprises a primary analysis performed using three cohorts from Alzheimer’s Disease Neuroimaging Initiative (ADNI), and a replication analysis which is performed using subjects from Australian Imaging, Biomarker and Lifestyle (AIBL) Flagship Study of Ageing. In the modeling stage, the prototypical symptom trajectory classes are defined using clinical assessment scores from mini- mental state exam (MMSE) and Alzheimer’s Disease Assessment Scale (ADAS-13) at nine timepoints spanned over six years based on a hierarchical clustering approach. Subsequently in the prediction stage, these trajectory classes for each individual are predicted using magnetic resonance (MR) imaging, ge- netic, and clinical variables from two timepoints (baseline + follow-up). For prediction, a longitudinal Siamese neural-network (LSN) with a novel architecture for combining the multimodal data from two timepoints is presented. The trajectory modeling yields two (stable and decline) and three (stable, slow-decline, fast-decline) trajectory classes for MMSE and ADAS-13 assessments, respectively. For the predictive tasks, LSN offers highly accurate performance with 0.900 accuracy and 0.968 AUC for binary MMSE task and 0.760 accuracy for 3-way ADAS-13 task on ADNI datasets, as well as, 0.715 accuracy, and 0.907 AUC for binary MMSE task on replication AIBL dataset. Chapter 3
Manual-Protocol Inspired Technique for Improving Automated MR Image Segmentation during Label Fusion
Nikhil Bhagwat [1,2,3,*], Jon Pipitone [3], Julie L. Winterburn [1,2,3], Ting Guo [4,5], Emma G. Duerden [4,5], Aristotle N. Voineskos [3,6], Martin Lepage [2,7], Steven P. Miller [4,5], Jens C. Pruessner [2,8], M. Mallar Chakravarty [1,2,7], and Alzheimer’s Disease Neuroimaging Initiative.
1. Institute of Biomaterials and Biomedical Engineering, University of Toronto, Toronto, ON, Canada
2. Cerebral Imaging Centre, Douglas Mental Health University Institute, Verdun, QC, Canada
3. Kimel Family Translational Imaging-Genetics Research Lab, Research Imaging Centre, Campbell Family Mental Health Research, Institute, Centre for Addiction and Mental Health, Toronto, ON, Canada
4. Neurosciences and Mental Health, The Hospital for Sick Children Research Institute, Toronto, ON, Canada
5. Department of Paediatrics, The Hospital for Sick Children and the University of Toronto, Toronto, ON, Canada
6. Department of Psychiatry, University of Toronto, Toronto, ON, Canada
7. Department of Psychiatry, McGill University, Montreal, QC, Canada
8. McGill Centre for Studies in Aging, Montreal, QC, Canada
Correspondence: Nikhil Bhagwat, Email: [email protected] Keywords: MR Imaging, Segmentation, Multi-Atlas Label-Fusion, Markov Random Fields, Hip- pocampus, Alzheimer’s disease, First Episode Psychosis, Schizophrenia, Premature Birth and Neonates.
35 Chapter 3. Project 1: MR Image Segmentation 36
3.1 Abstract
Recent advances in multi-atlas based algorithms address many of the previous limitations in model-based and probabilistic segmentation methods. However, at the label-fusion stage, a majority of algorithms focus primarily on optimizing weight-maps associated with the atlas library based on a theoretical ob- jective function that approximates the segmentation error. In contrast, we propose a novel method - Autocorrecting Walks over Localized Markov Random Fields (AWoL-MRF) - that aims at mimicking the sequential process of manual segmentation by which the gold standard is defined for virtually all the segmentation methods. AWoL-MRF begins with a set of candidate labels generated by a multi-atlas seg- mentation pipeline as an initial label distribution and uses it to partition the given image into high and low confidence segmentation regions. Then, the labels of the low confidence regions are updated based on a localized Markov random field (L-MRF) model and a novel sequential inference process (walks), which captures the behavior of a manual rater. The approach combines the strong a priori information from the atlas library with the local spatial and intensity information from the target image, without depending on the computationally expensive pairwise comparisons with the atlas library. We show that AWoL-MRF produces state-of-the-art results with a small atlas library (< 10) and improves the accuracy and robustness of the existing segmentation pipelines. We validate the proposed approach by perform- ing hippocampal segmentations on three independent datasets: 1) Alzheimer’s Disease Neuroimaging Database (ADNI); 2) First episode psychosis patient cohort; and 3) A cohort of preterm neonates scanned early in life and at term-equivalent age. We assess the improvement in the performance qualitatively as well as quantitatively by comparing AWoL-MRF with majority vote, STAPLE, and Joint Label Fusion methods. AWoL-MRF reaches the maximum accuracy of 0.881 (dataset 1), 0.897 (dataset 2), and 0.810 (dataset 3) based on Dice similarity coefficient metric, offering significant improvements with a smaller atlas library over compared methods. We also evaluate the diagnostic utility of the presented method by analyzing the volume differences per disease category in the ADNI1: Complete Screening dataset. The source code for AWoL-MRF can be found at https://github.com/CobraLab/AWoL-MRF. Chapter 3. Project 1: MR Image Segmentation 37
3.2 Author Contributions
Nikhil Bhagwat (NB) worked on the development of AWoL-MRF algorithm and subsequent imple- mentation and validation. He performed preprocessing and quality control of MR image datasets. He also wrote the manuscript of the published research paper. Jon Pipitone assisted with the preprocess- ing of MR images. He also provided feedback on the proposed methodological approach. Additionally he provided the support for computational resources. Julie Winterburn served as an expert manual rator for neonatal MR image segmentation. Ting Guo, Emma G. Duerden, and Steven P. Miller performed acquisition and curation of neonatal dataset. Martin Lepage performed acquisition and curation of FEP dataset. Jens C. Pruessner served as an expert anatomist who developed the manual protocol for hippocampal segmentation on ADNI dataset. He also advised for manual segmentation protocols for the FEP and neonatal datasets. Aristotle N. Voineskos served as a clinical advisor and is a member of NB’s thesis committee. He also served as a supervisor for NB in a lab at CAMH that provided significant computational resources. M. Mallar Chakravarty is a thesis supervisor for NB. He provided guidance on development and validation of all proposed models and techniques, as well as, manuscript writing. Chapter 3. Project 1: MR Image Segmentation 38
3.3 Introduction
The volumetric and morphometric analysis of neuroanatomical structures has growing importance in many clinical applications. For instance, structural characteristics of hippocampus have been used as an important biomarker in many neurological and psychiatric disorders including Alzheimer’s disease (AD), schizophrenia, major depression, and bipolar disorder [157, 133, 223, 192, 247, 366]. The gold standard for neuroanatomical segmentation is manual delineation by an expert human rater. However, with the increasing ubiquity of magnetic resonance (MR) imaging technology and neuroimaging studies targeting larger populations, the time and expertise required for manual segmentation of large MR datasets becomes a critical bottleneck in the analysis pipelines [243, 242]. A manual rater’s performance is dependent on his or her specialized knowledge of the neuroanatomy. A generic manual segmentation protocol leverages this anatomical knowledge and uses it in tandem with voxel intensities to enforce structural boundary conditions during the delineation process. This is, of course, the premise of many automated model-based segmentation approaches. Multi-atlas based approaches have been shown to improve segmentation accuracy and precision over model-based approaches [71, 364, 285, 161, 10, 56, 227, 235, 301, 374, 162, 359, 386, 54]. The processing pipelines of these approaches can be divided into multiple stages. First, each atlas image is registered to a target image - an image to be segmented. Subsequently, the atlas labels are propagated to produce several candidate segmentations of the target image. Finally, a label fusion technique such as voxel-wise voting is used to merge these candidate labels into the final segmentation for the target image. For the remainder of the paper we refer this latter stage within a multi-atlas based segmentation pipeline as “label-fusion”, which is a core interest of this work. In many image processing and computer vision applications, Markov Random Field (MRF) for mod- eling spatial dependencies has been a popular approach, and particularly in the context of neuroimaging, it has been used in several model-based segmentation techniques. Existing software packages, such as FreeSurfer [126] and FSL [324], use MRF for gray matter, white matter, and cerebrospinal fluid clas- sification as well as for segmentation of multiple subcortical structures. For example, FreeSurfer uses an anisotropic non-stationary MRF that encodes the inter-voxel dependencies as a function of location within the brain. Pertaining to multi-atlas label fusion techniques, STAPLE (Simultaneous Truth And Performance Level Estimation) [364], uses a probabilistic performance framework consisting of MRF model and an Expectation-Maximization (EM) inference method to compute the probabilistic estimate of a true segmentation based on an optimal combination of a collection of segmentations. STAPLE has been explored in several studies for improving a variety of segmentations tasks [75, 76, 188, 6] Alternatively, a majority of modern multi-atlas approaches treat label-fusion as a weight-estimation problem, where the objective is to estimate optimal weights for the candidate segmentation propagated from each atlas. In a trivial case with uniform weights, this label-fusion technique boils down to a simple majority vote. In other cases, the weights can be used to exclude atlases that are dissimilar to a target image [10] to minimize the errors from unrepresentative anatomy. In a more general case, weight values are estimated using some similarity metric between the atlas library and the target image. A comprehensive probabilistic generative framework that models such underlying relationship between the atlas and target data, exploited by the methods belonging to this class, is provided by [301]. More recently, several methods [83, 295, 359] have extended this label-fusion approach by adopting spatially varying weight-maps to capture similarity at a local level. These algorithms usually introduce bias during label fusion when the weights are assigned independently to each atlas, allowing several atlases Chapter 3. Project 1: MR Image Segmentation 39 to produce similar label errors. These systematic (i.e. consistent across subject cohort) errors can be mitigated by taking pairwise dependencies between atlases into account during weight assignment as proposed by [359, 386]. In contrast, the proposed method – Autocorrecting Walks over Localized Markov Random Field (AWoL-MRF) – pursues a different idea for tackling label-fusion problem. As for virtually all the seg- mentation methods, the gold-standard is defined by the manual labels, we hypothesize that we could achieve superior performance by mimicking the behavior of the manual rater. Consequently, the label- fusion objective we aim here comprises capturing the sequential process of manual segmentation rather than optimizing atlas library weights based on similarity measure proxies and/or performing iterative inference to estimate optimal label configurations based on MRFs. Hence the novelty of the approach lies in the methodological procedure as we combine the strong prior anatomical information provided by the multi-atlas framework with the local neighborhood information specific to the given subject. In the context of segmentation of anatomical structures such as hippocampus, the challenging areas for label assignment are mainly located at the surface regions of the structure. We observe that a manual rater traces these boundary regions by balancing intensity information and anatomical knowledge, while en- forcing smoothness requirements and tackling partial volume effects. In practice, this behavior translates into a sequential labeling process that depends on information offered by the local neighborhood around a voxel of interest. The proposed label-fusion method attempts to incorporate these observations into an automated procedure and is implemented as part of a segmentation pipeline previously developed by our group [277]. The algorithmic steps of AWoL-MRF can be summarized as follows. First based on a given multi- atlas segmentation method, we initialize the label distribution for a neuroanatomical structure to be segmented. This initial label-vote distribution is leveraged to partition the given target volume in two disjoint subsets comprising regions with high and low confidence label values based on the vote distribution at the voxels. Next we construct a set of local 3-dimensional patches comprising certain ratio of high and low confidence voxels. The spatial dependencies in these patches are modeled using independent MRFs. Finally, we traverse these patches moving from high confidence voxels to low in a sequential manner and perform the label distribution updates based on a localized (patch-based) MRF model. We implement a novel spanning-tree method to build these ordered sequences of voxels (walks). The detailed description of this entire procedure and the key differentiating features in comparison to the existing approaches are provided in the next section. We provide an explanation and extensive validation of our approach in this paper, which is organized as follows. First, we describe the AWoL-MRF method and the underlying assumptions in detail. Then, we provide a thorough validation of the method for the whole hippocampus segmentation by conducting multi-fold validation over three independent datasets that span the entire human lifespan. The quan- titative accuracy evaluations are performed on three datasets: 1) a subset of the Alzheimer’s Disease Neuroimaging Database (ADNI) dataset; 2) a cohort of first episode psychosis (FEP) patients; and 3) a cohort of preterm neonates scanned early in life and at term-equivalent age. Additionally we evaluate diagnostic utility of the method by analyzing the volume differences per disease category in the ADNI1: Complete Screening dataset. We assess the accuracy and robustness of this proposed method (source code: https://github.com/CobraLab/AWoL-MRF) by comparing it with three other approaches. Our group has recently validated the performance of MAGeT Brain [277] against several other automated methods. Here, we make use of MAGeT Brain to generate candidate labels – on which variety of label- Chapter 3. Project 1: MR Image Segmentation 40 fusion methods can be implemented. We first compare the performance of AWOL-MRF with the default majority-vote based label fusion used in MAGeT Brain. In addition, we compare AWoL-MRF with STAPLE [364], which is a more sophisticated label-fusion approach that uses MRF model and estimates the rater performance using an EM technique. Lastly, we compare it against one of the more recent methods – Joint Label Fusion (JLF) [359] which estimates atlas weights by taking into account the effect of pairwise dependencies approximated by the intensity similarity between atlases. Chapter 3. Project 1: MR Image Segmentation 41
3.4 Materials and Methods
3.4.1 Methodological Novelty of AWoL-MRF
As mentioned earlier, the novelty of this approach stems from methodological similarity with the manual labeling process. For instance, a manual rater would begin by marking a boundary of a structure that they believe to be correct (high-confidence) based on anatomical knowledge. Next, the rater would iden- tify certain regions that require further refinement (low-confidence). Then, region-by-region (patches), the rater would perform these refinements by moving from high-confidence areas to low in a sequen- tial manner, while taking into account the information offered by neighborhood voxels from orthogonal planes. Furthermore, the voxel intensity distribution conditioned on a label class leveraged by a man- ual rater is derived purely from the neighborhood of the target image itself and not from the atlas library. AWoL-MRF translates this into estimating the intensity distributions based on the statistics estimated from the high-confidence voxels in a given localized patch in a target image. Thus, the key differences between AWoL-MRF and the existing multi-atlas label-fusion include the decoupling from atlas library post registration stage. Once we obtain the label-vote distribution, we completely rely on the intensity profile of the target image and avoid any computationally expensive pairwise similarity comparisons with the atlas-library. Additionally, even though we use a commonly used MRF approach to model spatial dependencies, the novel spanning-tree based inference technique that attempts to mimic the delineation process of a manual rater differentiates AWoL-MRF from traditional iterative optimiza- tion techniques such as iterative conditional modes (ICM) or Expectation-Maximization (EM). The key benefits of AWoL-MRF implementation are two fold. First we offer state-of-the-art performance using small atlas library (< 10), whereas most of the segmentation pipelines typically make use of large atlas libraries comprising from 30 up to 80 manually segmented image volumes [285, 161] that require spe- cialized knowledge and experience to generate. Secondly, from computational perspective, AWoL-MRF mitigates many expensive operations common to many multi-atlas label fusion methods. By eliminat- ing the need for pairwise similarity metric estimation, we avoid computationally expensive registration operations that increase rapidly with the size of atlas library. Furthermore, several extensions based on patch-based comparisons between atlas library and target image make use of a variant of local search algorithm or a supervised learning approach [83, 295, 377, 361, 152]. For instance, [83] uses a non-local means approach to carry out label transfer based on multiple patch comparisons; [152] uses a supervised machine-learning method to train a classifier using similar patches from atlas library. Computationally these patch-based approaches, especially the implementations that incorporate non-local means, are ex- pensive [361] and require considerable number of labeled images [377, 152]. Moreover, compared to the single unified MRF models, the localized MRF model reduces the computational complexity while main- taining the spatial homogeneity constraints in the given neighborhood. It also allows to capture local characteristics of the image based on high-confidence regions without requiring the iterative parameter estimation and inference methods such as EM.
3.4.2 Baseline Multi-Atlas Segmentation Method
MAGeT Brain (https://github.com/CobraLab/MAGeTbrain) - a segmentation pipeline previously de- veloped by our group, is used as a baseline method of comparison [277]. MAGeT Brain uses multiple manually labeled anatomical atlases and a bootstrapping method to generate large set of candidate Chapter 3. Project 1: MR Image Segmentation 42 labels (votes) for each voxel for a given target image to be segmented. These labels are generated by propagating atlas segmentations to a template library, formed from a subset of target images, via trans- formations estimated by nonlinear image registration. Subsequently, template library segmentations are propagated to each target image and these candidate labels are fused using a label-fusion method. The number of candidate labels is dependent on the number of available atlases and number of templates. In a default MAGeT Brain configuration, the candidate labels are fused by a majority vote. (In previ- ous investigations by our group [56, 277], we noticed no improvements based on cross correlation and normalized mutual information based weighted voting. Hence, our default implementation uses a simple majority-vote at the label fusion stage of the algorithm.) These candidate labels from MAGeT Brain serve as the input to AWoL-MRF, STAPLE, and default majority vote label-fusion methods. Use of these candidate labels is not trivial with JLF implementation for the following reason. JLF requires cou- pled atlas image and label pairs as input. The permutations in the registration stage in MAGeT Brain pipeline generate candidate labels totaling to number of atlases x number of templates. These candidate labels no longer have unique corresponding intensity images associated with them. Use of identical atlas (or template) library images as proxies is likely to deteriorate the performance of JLF, as it models the joint probability of two atlases making a segmentation error based on intensity similarity between a pair of atlases and the target image [359]. Therefore, no template library is used during JLF evaluation. Note that even though MAGeT brain is used as a baseline method for the performance validation in this work, AWoL-MRF is a generic label-fusion algorithm that can be used with any multi-atlas segmentation pipeline that produces a set of candidate labels.
3.4.3 Proposed Label-Fusion Method: AWoL-MRF
A generic label fusion method involves some sort of voting technique, such as simple majority or some variant of weighted voting, which combines labels from a set of candidate segmentations derived from a multi-atlas library. These voting techniques normally yield accurate performance at labeling the core regions of an anatomical structure; however, the overall performance is dependent on the structural variability accounted by the atlas library. Especially in cases where only a small number of expert atlases are available, the resultant segmentation of a target image can be split into two distinct regions - areas with (near) unanimous label votes and areas with divided label votes. The proposed method incorporates this observation by partitioning the given image volume into two subsets based on the label vote distribution (number of votes per label per voxel) obtained from candidate segmentations. Subsequently, these partitions are used to generate a set of patches on which we construct MRF models to impose homogeneity constraints in the given neighborhood spanned by each patch. Finally, the voxels in these localized MRFs are updated in a sequential manner incorporating the intensity values and label information of the neighboring voxels. A detailed description of this procedure is provided below.
Image Partitioning
Let S be a set comprising all voxels in a given 3-dimensional volume. Then an image I comprising gray-scale intensities and the corresponding label volume are defined as:
I(S): {x ∈ S} −→ R (3.1) Lj(S): {x ∈ S} −→{0, 1} Chapter 3. Project 1: MR Image Segmentation 43
Thus, Lj represents the jth candidate segmentation volume comprising binary label values (back- ground:0 and structure:1) for a given image. Then with J candidate segmentations, we can obtain a label-vote distribution through voxel-wise normalization.
P wjLj(S) V (S) = j (3.2) J Where, wj is the weight assigned to the jth candidate segmentation. Now, V (S) represents the label probability distribution over all the voxels in the given image. For an individual voxel, it provides the probability of belonging to a particular structure: V (xi) = P (L(xi) = 1 = 1 − P (L(xi) = 0. Now, we split set S into two disjoint subsets SH (high-confidence region) and SL (low-confidence region) such that
0 1 SH = {x ∈ S|V (xi) > LT ∪ V (xi) > LT } (3.3) SL = {x ∈ S|s∈ / SH }
0 1 where, LT and LT are the voting confidence thresholds for L = 0 and L = 1, respectively. Note that 0 1 in the generic majority vote scenario LT = LT = 0.5 and SL collapses to an empty set. In order to identify and separate low-confidence regions, these thresholds are set at higher values (> 0.5) and can be adjusted based on empirical evidence (see Section 3.6.5). As mentioned earlier, voting distributions usually form a near consensus (uni-modal) towards a particular label at certain locations, such as the core regions of structures, and therefore these voxels are assigned to the high-confidence subset. In contrast, other areas that have split (flat) label distribution are assigned to low-confidence subset.
Patch Based Sub-graph Generation
The partitioning operation significantly reduces the number of nodes (note: voxels are referred as nodes in the context of graphs) to be re-labeled by a significant amount. However, considering the size of the MR images, selecting a single MRF model consisting of all SL nodes and their neighbors is a computationally expensive task. Additionally, the unified model usually considers global averages over an entire structure during parameter estimation for choice of prior distributions, such as P (intensity|label), which may not be ideal in cases where local signal characteristics show spatial variability. Therefore, we propose a patch-based approach, which further divides the given image in smaller subsets (3-dimensional cubes) comprising SH as well as SL nodes. The subsets are created with a criterion imposing a minimum number requirement of SH nodes in a given patch. This criterion essentially dictates the relative composition of SH and SL nodes in the patch - which is referred as the “mixing ratio” parameter in this paper. The impact of this heuristic method of patch generation is discussed in Section 3.6.5. The basic idea behind this approach is to utilize the information offered by the SH neighbors via pairwise interactions
(doubleton clique) along with the local intensity information to update the label-likelihood of SL voxels. The implemented algorithm to generate these patches is described below.
First, the SL nodes are sorted based on the number of SH nodes in their 26-node neighborhood. Next, thresholding on the mixing ratio parameter, top SL nodes from the sorted list are selected as seeds. Then, the patches are constructed centered at these seeds with pre-defined length (Lpatch). Fig. 3.1-A shows the schematic representation of the SH , SL partitions based on initial label distribution (V (S)), as well as the overlaying patch-based subsets comprising SH and SL nodes. Note that depending on parameter choice (mixing-ratio and patch-length), these patches may not be strictly disjoint. In this case, the Chapter 3. Project 1: MR Image Segmentation 44 nodes in overlapping patches are assigned to a single patch based on a simple metric, such as its distance from the seed node. Additionally, these patches may not cover the entire SL region. These unreachable
SL nodes are labeled according to the baseline majority vote. These two edge cases can be mitigated with sophisticated graph partitioning methods - nevertheless as per our preliminary investigation, such methods prove to be computationally expensive, and yield minimal accuracy improvements.
Figure 3.1: A) The segmentation of a sample hippocampus in sagittal view during various stages of algorithm. Row 1: The target intensity image to be segmented. Row 2: The voxel-wise label vote distribution map for the target image based on candidate labels. Row 3: Image partitioning comprising two disjoint regions (high confidence: red, low confidence: white). Row 4: Orange Patches (localized MRFs) comprising low confidence voxels. Row 5: Fused target labels. B) Image partitioning into certain and uncertain regions and generation of patches. C) Transformation of MRF graph into spanning tree representation. The tree is traversed starting from the root (seed) node and successively moving towards the leaf nodes. Chapter 3. Project 1: MR Image Segmentation 45
Localized Markov Random Field Model
As seen from Fig. 3.1-A, B, the MRF model is built on nodes in a given patch (SP ). The probability distribution associated with the particular field configuration (label values of the voxels in the patch) can be factorized based on the cliques of the underlying graph topology. With first-order connectivity assumption, we get a 3-dimensional grid topology, where each node (excluding patch edges) has six connected neighbors along the Cartesian axes. Consequently, this graph topology yields two types of cliques. The singleton clique (C1) of SP is a set of all the voxels contained in that patch. Whereas the doubleton clique (C2) is a set consisting of all the pairs of neighboring voxels in the given patch. Then, for the MRF model, the total energy (U) of a given label configuration (y) is given by the sum of all clique potentials (VC ) of all the cliques in this MRF model:
X X X U(y) = VC (y) = (yi) + (yi, yj) (3.4)
c∈C i∈C1 i,j∈C2
where, y : {L(xi)|xi ∈ SP } Now, assuming that voxel gray-scale intensities (fi = I(xi)) follow a Gaussian distribution given the label value, we get the following relation for the singleton clique potential based on the MRF model.
√ (f − µ )2 V (y ) = log(P (f |y )) = −log( 2πσ ) − yi (3.5) C1 i i i yi 2σ2 yi The mean and variance of the Gaussian model can be estimated for each patch empirically, utilizing the SH nodes in the given patch as a training set. This approach proves to be advantageous especially in the context of T1-weighted images of the brain, as intensity distributions tend to fluctuate spatially. The doubleton clique potentials are modeled to favor similar labels at the neighboring nodes and are given by following relation.
( -β if yi = yj VC2 (yi) = −βd(yi, yj) = (3.6) +β if yi 6= yj
The beta parameter can be estimated empirically using the atlas library [301]. As beta increases the regions become more homogeneous. This is discussed further in Section 3.6.5. Finally, the posterior probability distribution of the label configuration can be computed using Hammersley-Clifford theorem, and is given by:
1 P (y|f) = exp(−U(y)) z 2 (3.7) X √ (f − µy ) X P (y|f) ∝ (log( 2πσ ) + i ) + β d(y , y ) yi 2σ2 i j yi i∈C1 i,j∈C2 where Z is the partition function that a normalizes configuration energy (U) into a probability distribution. The maximum a posteriori (MAP) label distribution is given by:
MAP y = argmaxyP (u|f) = argminyU(y) (3.8)
The posterior segmentation can be computed using a variety of optimization algorithms as described in the next section. Chapter 3. Project 1: MR Image Segmentation 46
Inference
This section provides the details of the optimization technique used to compute posterior label distri- bution. Common iterative inference and learning methods such as Iterated Conditional Modes (ICM) and Expectation Maximization (EM) are computationally intensive, and ICM variants often suffer from greedy behavior that results in local optima. Here, we present an alternative approach that computes the posterior label distribution in a non-iterative, online process, minimizing computational costs. The intuition behind this approach is to mimic manual tracing protocols where the delineation process tra- verses from higher-confidence regions to lower-confidence regions in a sequential manner. In order to follow such a process, we transform the undirected graph structures defined by the MRF patches into directed spanning trees (see 3.1-C). Then we compute the posterior label distributions one voxel at a time as we traverse (walk) through the directed tree exhaustively. The directed tree structure mitigates the need for iterative inference over loops within the original undirected graph. The following is a brief outline of the implementation of the inference procedure:
1. Initialize all voxels to the labels given by the mode of baseline label distribution.
2. Transform the subgraph consisting of SL nodes within an MRF patch into a directed tree graph, specifically a spanning tree graph with seed voxels as the root of the tree. This transformation is computed using a minimum spanning tree method (Prim’s Algorithm [282]), which finds the optimal tree structure based on a predefined edge-weight criterion. In this method, the weights are assigned based on the node adjacency and voxel intensity gradients.
( 2 (fi − fj) if d(xi, xj)=1 w(xi, xj) = (3.9) ∞ if d(xi, xj)6= 1
where d(xi, xj) is a graph metric representing distance between two vertices.
3. Traverse through the entire ordered sequence of the minimum spanning tree to update the label at each voxel using Eq. 3.8.
4. Repeat this process for all MRF patches. Chapter 3. Project 1: MR Image Segmentation 47
CN (N=20) LMCI (N=20) AD (N=20) Combined (N=60) Age (Years ) 72.2, 75.5, 80.3 70.9, 75.6, 80.4 69.4, 74.9, 80.1 70.9, 75.2, 80.2 Sex (M/F) 10/10 10/10 10/10 30/30 Education 14.0, 16.0, 18.0 13.8, 16.0, 16.5 12.0, 15.5, 18.0 13.0, 16.0, 18.0 CDR-SB 0.00, 0.00, 0.00 1.00, 2.00, 2.50 3.50, 4.00, 5.00 0.00, 1.75, 3.62 ADAS 13 6.00, 7.67, 11.00 14.92, 20.50, 25.75 24.33, 27.00, 32.09 9.50, 18.84, 26.25 MMSE 28.8, 29.5, 30.0 26.0, 27.5, 28.2 22.8, 23.0, 24.0 24.0, 27.0, 29.0
Table 3.1: ADNI1 cross-validation subset demographics. CN: Cognitively Normal. LMCI: Late-onset Mild Cognitive Impairment. AD: Alzheimer’s Disease. CDR-SB: Clinical Dementia Rating-Sum of Boxes. ADAS: Alzheimer’s Disease Assessment Scale. MMSE: Mini-Mental State Examination. Values are presented as lower quartile, median, and upper quartile for continuous variables
3.5 Validation Experiments
3.5.1 Datasets
For complete details please refer to supplementary materials.
Experiment I: ADNI Validation
Data used in this experiment was obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu/). The dataset consists of 60 baseline scans in the ADNI1: Complete 1Yr 1.5T standardized dataset [380]. The expert manual segmentations for the hippocampus (ADNI spe- cific) were obtained based on the Pruessner-protocol [285], which is used for validation and performance comparisons.
Experiment II: First Episode Psychosis (FEP) Validation
Data used in preparation of this experiment were obtained from the Prevention and Early Intervention Program for Psychoses (PEPP-Montreal), a specialized early intervention service at the Douglas Mental Health University Institute in Montreal, Canada [239]. The dataset consists of structural MRIs of 81 subjects. Expert whole hippocampal manual segmentations of each subject were produced following the Pruessner protocol [285].
Experiment III: Preterm Neonatal Cohort Validation
This cohort consists of 22 premature neonates whose anatomical images were acquired at two time points, once in the first weeks after birth when clinically stable and again at term equivalent age (total of 44 images: 22 early-in-life and 22 term-equivalency images). The whole hippocampus was manually segmented by an expert rater using a 3-step segmentation protocol. The protocol adapts the histological definitions of [110], as well as existing whole hippocampal segmentation protocols for MR images [285, 372, 39] to the preterm infant brain.
Experiment IV: Hippocampal Volumetry
The volumetric analysis was performed using the standardized ADNI1: Complete Screening 1.5T dataset [380] comprising 811 ADNI T1-weighted screening and baseline MR images of healthy elderly (227), MCI Chapter 3. Project 1: MR Image Segmentation 48
N* FEP Age 80 21 23 26 Gender (M/F) 81 51/30 Handedness: 81 ambi 5 Left 4 Right 72 Education 81 11 13 15 FSIQ 79 88 102 109
Table 3.2: First episode psychosis subject demographics. Ambi: ambidextrous. SES: Socioeconomic Status score. FSIQ: Full Scale IQ. Values are presented as lower quartile, median, and upper quartile for continuous variables. N* is the number of non-missing value out of 81.
(394) and AD (190) patients.
3.5.2 Label-Fusion Methods Compared
The performance of AWoL-MRF is compared against MAGeT Brain Majority vote and STAPLE. The basic process of these label-fusion methods is described below.
MAGeT Brain Majority Vote
As described earlier, the MAGeT brain pipeline uses a template library sampled from the subject image pool. Consequently, the total number of candidate labels (votes) prior to label fusion equals number of atlases x number of templates. In a default MAGeT Brain (MB) configuration, these candidate labels are fused based on simple majority vote.
Simultaneous Truth and Performance level Estimation (STAPLE)
STAPLE (Simultaneous truth and performance level estimation) [364], is a probabilistic performance model that tries to estimate underlying ground-truth labelings from a set of manual or automatic seg- mentations generated by multiple raters or methods. Note that STAPLE does not consider the intensity values from the subject image in its model comprising MRF. STAPLE carries out label fusion in an Expectation-Maximization framework and estimates performance of a manual rater or an automatic segmentation method for each label class - which is then used to find the optimal segmentation for the subject image. Software implementation of STAPLE was obtained from Computational Radiology Laboratory. (http://www.crl.med.harvard.edu/software/STAPLE/index.php)
Joint Label Fusion (JLF)
Among the modern label-fusion approaches incorporating spatially varying weight-distribution, JLF also accounts for the dependencies among the atlas library [359]. These dependencies are estimated based on intensity similarity measure between a pair of atlases and a target image in a small neigh- borhood surrounding a voxel. This approach allows mitigation of bias typically incurred by the pres- ence similar atlases. Software implementation of JLF was obtained from ANTs repository on Github. (https://github.com/stnava/ANTs/blob/master/Scripts/antsJointLabelFusion.sh) Chapter 3. Project 1: MR Image Segmentation 49
CN vs. MCI vs. AD Comparisons Volumetric statistics CN MCI AD mean (stdev) mean (stdev) mean (stdev) Majority Vote 2084.7 (615.3) 1960.5 (599) 1897.2 (582.3) STAPLE 2236.6 (659) 2124.2 (649.3) 2068.2 (655.4) JLF 1943.6 (593.5) 1803.3 (572.6) 1697.3 (551.6) AWoL-MRF 2312.9 (676.3) 2147.5 (652) 2047.7 (631.3) Cohen’s d CN v MCI CN v AD MCI v AD Majority Vote 0.1727 0.3194 0.123 STAPLE 0.1463 0.2688 0.1005 JLF 0.202 0.4343 0.2155 AWoL-MRF 0.2092 0.4111 0.1783 Linear Model CN v MCI CN v AD MCI v AD Majority Vote -3.875942*** -3.662264*** -0.402088 STAPLE -3.424026*** -3.001039** -0.101867 JLF -4.19533*** -4.884486*** -1.451038 AWoL-MRF -4.424061*** -4.657673*** -0.987672 MCI-converters vs. MCI-stable Comparisons Volumetric statistics MCI-converters MCI-stable mean (stdev) mean (stdev) Majority Vote 1846.2 (489.6) 1995.7 (619.3) STAPLE 2000.7 (542.8) 2163.6 (668.0) JLF 1686.8 (483.9) 1842.2 (586.3) AWoL-MRF 2007.4 (534.6) 2186.6 672.6 Cohen’s d Linear Model MCI-converters vs. MCI-stable MCI-converters vs. MCI-stable Majority Vote 0.185 -1.708 STAPLE 0.181 -1.616 JLF 0.192 -1.844 AWoL-MRF 0.204 -1.965*
Table 3.3: Hippocampal Volumetry Statistics of ADNI1:Complete Screening 1.5T dataset per diagnosis (AD: Alzheimer’s patients, MCI: subjects with mild cognitive impairment, CN: healthy subjects). Top: volumetric statistics of segmentations provided by each method. Middle: Effect sizes of pairwise differ- ences between diagnostics groups based on Cohen’s d metric. Bottom: t-values and significance levels from a linear model comprising “Age”, “Sex”, and “total-brain-volume” as covariates (∗ : p < 0.05, ∗∗ : p < 0.01, ∗ ∗ ∗ : p < 0.001). Chapter 3. Project 1: MR Image Segmentation 50
Atlases DSC Reference Study Validation Dataset (ground- truth) 9 0.881 AWoL-MRF 3-Fold MCCV, ADNI (Pruessner) N=60 9 0.897 AWoL-MRF 3-Fold MCCV, ADNI (Pruessner) N=81 9 0.81 AWoL-MRF 1-Fold MCCV, 3-step segmentation N=44 protocol (b) 9 0.869 MAGeT Brain [277] 10-Fold MCCV, ADNI (Pruessner) N=60 9 0.892 MAGeT Brain [277] 5-Fold MCCV, FEP subjects N=81 9 0.79 MAGeT Brain [149] 1-Fold MCCV, 3-step segmentation N=44 protocol (b) 30 0.82 Decision Fusion [161] LOOCV Controls 21 0.862 Auto Context Model [254] LOOCV ADNI (SNT) 55 0.86 Barnes et al. [26] LOOCV Controls and AD 275 0.835 Aljabar et al. [10] LOOCV Controls 80 0.89 Collins et al. [73] LOOCV Controls 30 0.885 Lotjonen et al. [235] N=60 ADNI (SNT) 55 0.89 MAPS [227] N=30 ADNI (SNT) 30 0.848 LEAP [374] N=182 ADNI (SNT) 16 0.861 Patch-based [83] (a) LOOCV ADNI (Pruessner) 20 0.897 (L), JLF [359] 10-Fold MCCV, semi-automatic+ man- 0.888 (R) N=20 ual correction 15 0.862(L), JLF [361] (c) N=20 brainCOLOR 0.861(R) 15 0.872(L), JLF (With corrective N=20 brainCOLOR 0.871(R) learning) [361] (c) 9 0.841 MAGeT Brain [277] 10-Fold MCCV, ADNI (SNT) N=69
Table 3.4: Summary of automated segmentation methods of the hippocampus. AD = Alzheimer’s Dis- ease; MCI = Mild Cognitive Impairment; CN = Cognitively Normal; FEP = First Episode of Psychosis; LOOCV = Leave-one-out cross-validation; MCCV = Monte Carlo cross-validation; SNT = Surgical Medtronic Navigation Technologies semi-automated labels; L-HC = Left hippocampus; R-HC = Right hippocampus. (a): AD: 0.838, MCI: n/a, CN: 0.883, (b): See [149] for manual segmentation protocol details, (c): The method were applied in the 2012 MICCAI Multi-Atlas Labeling Challenge Chapter 3. Project 1: MR Image Segmentation 51
3.5.3 Evaluation Criteria
We performed both quantitative and qualitative assessment of the results. The segmentation accuracy was measured using Dice similarity coefficient (DSC) given as follows:
2|A ∩ B| DSC = (3.10) |A| + |B| where A and B are the three dimensional label volumes being compared. We also evaluated the level of agreement between automatically computed volumes and manual segmentations using Bland- Altman plots [35]. Bland-Altman plots are created with segmentation yielded with 5 atlases and 19 templates configuration. For the ADNI and FEP datasets, we performed three-fold cross validation and obtained the quantitative scores by averaging over all the validation rounds, as well as the left and right hippocampal segmentations. Constrained by the size of the Premature Birth and Neonatal dataset and the quality of certain images which caused difficulties in the registration pipeline, we simply performed a single round of validation to determine if the results that we found in Experiments I and II were generalizable to brains with radically different neuroanatomy. Due to incomplete myelination of the neonatal brains, the MR image contrast levels for this dataset are drastically different - the intensity values for the hippocampus are reversed relative to T1-weighted images of the adolescent or adult human brains. These distinct attributes make it an excellent “held out sample” or “independent test-set” for performance evaluation. Thus, for this dataset, the quantitative scores are averages over left and right hippocampi over a single validation round. Chapter 3. Project 1: MR Image Segmentation 52
3.6 Results
3.6.1 Experiment I: ADNI Validation
For ADNI dataset, the mean Dice score of AWoL-MRF maximizes at 0.881 with 9 atlases and 19 templates. As seen from Fig. 3.2, AWoL-MRF outperforms both majority vote (0.862), STAPLE (0.858) and JLF label-fusion (0.873) methods. Particularly compared to JLF, more improvement is seen with fewer atlases as AWoL-MRF reaches mean Dice score of 0.88 with only six atlases. The improvement diminishes with an increasing number of atlases and a smaller number of templates (bootstrapping parameter for generating candidate labels). Additionally, AWoL-MRF helps reduce the bias introduced by certain majority vote techniques while arbitrarily breaking vote-ties in the cases of even number of atlases, as previously described by our group and others [161, 277]. We find that AWoL-MRF corrects these dips in performance, which is evident by the extra accuracy boosts for the even number of atlases. DSC distribution comparisons for four configurations (Number of Atlases = 3,5,7,9; Number of Tem- plates=11) are shown in Fig. 3.3. These plots reveal that AWoL-MRF provides statistically significant improvement over all other methods regardless of size of the atlas library. As expected, we also notice the reduction in variance with increasing number of atlases. The Bland-Altman plots reveal the biases incurred with the application of each automatic segmenta- tion method during volumetric analysis. Fig. 3.4 shows that all four methods have a proportional bias associated with their volume estimates. Specifically, we see that in all four methods, the volumes of the smaller hippocampi are overestimated, whereas the larger hippocampi are underestimated. Nevertheless, AWoL-MRF shows the smallest mean bias magnitude along with tighter limits of agreements across the cohort. STAPLE displays similar mean bias values, but higher variance in volume estimation compared to AWoL-MRF, which is evident by its steeper line-slope and wider limits of agreements. Majority vote and JLF show the highest amount of positive mean bias indicating tendency towards underestimation of hippocampal volume. Qualitatively, improvement in segmentations is seen at the surface regions of the hippocampus. As seen in Fig. 3.5, spatial homogeneity is improved as well.
3.6.2 Experiment II: FEP Validation
For the FEP dataset, the mean Dice score of AWoL-MRF maximizes at 0.897, with 9 atlases and 19 templates. Similar to the Experiment I, the AWoL-MRF consistently outperforms the majority vote (0.891), STAPLE (0.892), and JLF (0.888) methods; however, the improvement is comparatively modest. Higher improvement is seen with fewer atlases when compared to JLF, as AWoL-MRF surpasses the mean Dice score of 0.89 with only three atlases (see Fig. 3.6). The improvement diminishes with an increasing number of atlases and a smaller number of templates. In addition to a smaller atlas library requirement, the ability to reduce the bias introduced by the majority vote technique is also observed in this experiment. DSC distribution comparisons for four sample configurations (Number of Atlases = 3,5,7,9; Number of Templates=11) are shown in Fig. 3.7. These plots reveal that AWoL-MRF provides statistically significant improvement over all other methods regardless of size of the atlas library. Similar to accuracy gains, the variance of Dice score distribution is also smaller compared to ADNI experiment. The Bland-Altman plots (see Fig. 3.8) show that both AWoL-MRF and majority vote exhibit the smallest mean proportional bias compared to majority vote. In comparison, STAPLE and JLF show strong Chapter 3. Project 1: MR Image Segmentation 53
Figure 3.2: Experiment I DSC: All results show the average performance values of left and right hip- pocampi over 3-fold validation. The top-left subplot shows Mean DSC score performance of all the methods. Remaining subplots show the mean DSC score improvement over compared methods for different number of templates (bootstrapping parameter of MAGeT Brain).
Figure 3.3: Experiment I DSC: statistical comparison of the performance of all methods for different atlas library sizes. The statistical significance is reported for pairwise comparisons (∗ : p < 0.05, ∗∗ : p < 0.01, ∗ ∗ ∗ : p < 0.001). biases characterizing considerable overestimation (negative bias) and underestimation (positive bias) of hippocampal volume across the cohort respectively. Quantitatively, AWoL-MRF still outperforms the Chapter 3. Project 1: MR Image Segmentation 54
Figure 3.4: Experiment I Bland-Altman analysis: Comparison between computed and manual volumes (in mm3) for single parameter configuration of 9 atlases and 19 templates. The overall mean difference in volume, and limits of agreement (LA+/LA-: 1.96SD) are shown by dashed horizontal lines. Linear fit lines are shown for each method. Note that the points above the mean difference indicate underestimation of the volume with respect to the manual volume, and vice versa. other three methods, as evident from the smaller line-slope and tighter limits of agreement. Similar to the ADNI experiment, qualitative improvement is seen at the surfaces regions of the hippocampus (see Fig. 3.9).
3.6.3 Experiment III: Preterm Neonatal Cohort Validation
The mean Dice score of AWoL-MRF maximizes at 0.810, with 9 atlases and 19 templates. Note that due to incomplete myelination of the neonatal brains, the intensity values for the hippocampus are re- versed relative to T1-weighted images of the child, adolescent, and adult human brains. Nevertheless, no manual interventions were carried out for the implementation of the methods. Similar to the first two experiments, the AWoL-MRF consistently outperforms the majority vote (0.775), STAPLE (0.775), and JLF (0.771) methods by a large amount. More improvement is seen with fewer atlases when compared to JLF, as AWoL-MRF surpasses the mean Dice score of 0.80 with only four atlases (see Fig. 3.10). Improvement diminishes with an increasing number of atlases and a smaller number of templates. Also, due to single fold experimental design for this dataset, higher performance variability is observed espe- cially with a smaller number of templates. DSC distribution comparisons for four sample configurations (Number of Atlases = 3,5,7,9; Number of Templates=11) are shown in Fig. 3.11. These plots reveal that AWoL-MRF provides statistically significant improvement over all other methods regardless of the size of the atlas library. The Bland-Altman plots show that both AWoL-MRF and JLF offer volume estimator with an extremely small proportional bias (see Fig. 3.12). Compared to ADNI and FEP datasets, the magnitude of the bias is significantly lower, with AWoL-MRF producing the best result. Chapter 3. Project 1: MR Image Segmentation 55
Figure 3.5: Experiment I qualitative analysis: comparison of manual versus automatic segmentation methods. The red rectangle illustrates a section where the superiority of the AWoL-MRF approach is particularly apparent. The segmentations are performed using 3 atlases, and the Dice scores are as follows: Majority Vote: 0.806 STAPLE: 0.833 JLF: 0.804 AWoL-MRF: 0.854. The segmentation of the left hippocampus is shown in sagittal view.
In comparison, majority vote consistently underestimates and STAPLE consistently overestimates hip- pocampal volumes across the cohort. Similar to previous two experiments, the qualitative improvement is seen at the surfaces regions of the hippocampus (see Fig. 3.13). Note that the intensity values for hippocampus are reversed due to incomplete myelination.
3.6.4 Experiment IV: Hippocampal Volumetry
The volumetric analysis was performed on standardized ADNI1: Complete Screening 1.5T dataset [380]. The segmentations were produced using 9 atlases with each method. For majority vote, STAPLE and AWoL-MRF number of templates was set to 19. As mentioned earlier, use of templates is not possible with JLF due to coupling between image and label volumes from the atlas library.
Group Comparisons between CN, MCI, and AD
In this part of analysis we compared the mean hippocampal volume measurements per diagnosis (AD: Alzheimer’s patients, MCI: subjects with mild cognitive impairment, CN: healthy subjects). As seen from the Fig. 3.14 (top pane), mean volume decreases with the severity of the disease for all methods. The volumetric statistics are summarized in Table 3.3. Based on Cohen’s d metric as a measure of effect size, we see the largest separation between “CN vs. AD” diagnostic categories, followed by “CN vs. MCI” categories, and lastly between “MCI vs. AD” categories. We also constructed a linear model predictive of hippocampal volume based on diagnostic category along with “age”, “sex”, and “total- Chapter 3. Project 1: MR Image Segmentation 56
Figure 3.6: Experiment II DSC: All results show the average performance values of left and right hippocampi over 3-fold validation. The top-left subplot shows Mean DSC score performance of all the methods. Remaining subplots show the mean DSC score improvement over compared methods for different number of templates (bootstrapping parameter of MAGeT Brain).
Figure 3.7: Experiment II DSC: statistical comparison of the performance of all methods for different atlas library sizes. The statistical significance is reported for pairwise comparisons (∗ : p < 0.05, ∗∗ : p < 0.01, ∗ ∗ ∗ : p < 0.001). brain-volume” as covariates. The results show that the effect sizes are most pronounced in AWoL-MRF and JLF in all pairwise comparisons. All four methods show strong volumetric differences (p < 0.001 Chapter 3. Project 1: MR Image Segmentation 57
Figure 3.8: Experiment II Bland-Altman analysis: Comparison between computed and manual volumes (in mm3) for single parameter configuration of 9 atlases and 19 templates. The overall mean difference in volume, and limits of agreement (LA+/LA-: 1.96SD) are shown by dashed horizontal lines. Linear fit lines are shown for each method. Note that the points above the mean difference indicate underestimation of the volume with respect to the manual volume, and vice versa. or p < 0.01) between “CN vs. AD” categories followed by “CN vs. MCI“, which show relatively weaker significance levels. JLF also shows volumetric differences between “MCI vs. AD” categories but with a weaker significance level (p < 0.05). In linear model based comparison, we see that all four methods show significant differences (p < 0.001 or p < 0.01) only between “CN vs. AD” and “CN vs. MCI” comparisons.
Group Comparisons between MCI converters and non-converters
In this part of analysis we compared the mean hippocampal volume measurements of two MCI sub- groups: MCI-converters (65 subjects converting from MCI to AD diagnosis within 1-year from screening) and MCI-stable (285 subjects with stable MCI diagnosis within 1 year from screening). The volumetric statistics are summarized in Table3.3. Fig. 3.14 (bottom pane), shows that the MCI-converters have relatively smaller volume compared to MCI-stable group. JLF and AWoL-MRF both show strongest effect sizes based on Cohen’s d metric with statistically significant (p < 0.05) differences between these two groups.
3.6.5 Parameter Selection
We studied the impact of parameter selection on the performance of AWoL-MRF with joint consideration of the segmentation accuracy and computational cost. The four parameters which need to be chosen 0 1 a priori are: confidence thresholds (LT ,LT ), patch-length (Lpatch), mixing-ratio ((SH /SL)patch), and the β parameter of the MRF model. Recall that the Gaussian distribution parameters in the MRF are Chapter 3. Project 1: MR Image Segmentation 58
Figure 3.9: Experiment II qualitative analysis: Comparison of manual versus automatic segmentation methods. The red rectangle illustrates a section where the superiority of the AWoL-MRF approach is particularly apparent. The segmentations are performed using 3 atlases, and the Dice scores are as follows: majority vote: 0.875, STAPLE: 0.878, JLF: 0.856 AWoL-MRF: 0.891. The segmentation of the right hippocampus is shown in sagittal view.
Figure 3.10: Experiment III DSC: preterm neonate cohort validation: All results show the average performance values of left and right hippocampi over 3-fold validation. The top-left subplot shows Mean DSC score performance of all the methods. Remaining subplots show the mean DSC score improvement over compared methods for different number of templates (bootstrapping parameter of MAGeT Brain). Chapter 3. Project 1: MR Image Segmentation 59
Figure 3.11: Experiment III DSC: statistical comparison of the performance of all methods for different atlas library sizes. The statistical significance is reported for pairwise comparisons (∗ : p < 0.05, ∗∗ : p < 0.01, ∗ ∗ ∗ : p < 0.001).
Figure 3.12: Experiment III Bland-Altman analysis: comparison between computed and manual volumes (in mm3) for single parameter configuration of 9 atlases and 19 templates. The overall mean difference in volume, and limits of agreement (LA+/LA-: 1.96SD) are shown by dashed horizontal lines. Linear fit lines are shown for each method. Note that the points above the mean difference indicate underestimation of the volume with respect to the manual volume, and vice versa. Chapter 3. Project 1: MR Image Segmentation 60
Figure 3.13: Experiment III qualitative analysis: Comparison of manual versus automatic segmentation methods. The red rectangles illustrate sections where the superiority of the AWoL-MRF approach is particularly apparent. The segmentations are performed using 3 atlases, and the Dice scores are as follows: majority vote: 0.748, STAPLE: 0.760, JLF: 0.746, AWoL: 0.807. The segmentation of the left hippocampus is shown in sagittal view. Note that for this particular dataset, the brain structures are mostly unmyelinated causing a reversal of the intensity values for the hippocampal structure - as shown in the top row.
estimated for each patch automatically using the SH nodes in the given patch. First, the confidence threshold parameters are selected heuristically derived from the voting distribution. As mentioned 0 0 before, both LT and LT values need to be greater than 0.5 to produce non-empty low-confidence voxel set. Based on the assumptions that the high-confidence region (SH ) comprises more structural voxels
(L(xi) = 1) than the total number of voxels in the low-confidence region (SL), we define a following metric:
|S | ρ = L (3.11) |L(xi = 1)
0 1 Then we choose confidence thresholds (LT and LT ), which fall in the parameter space bounded by ρ ∈ (0.5, 1). Fig. 3.15-Left shows an example of these bound values - computed for the ADNI dataset in experiment I (left hippocampus). Note that the larger threshold values imply larger SL region, and 0 1 consequently higher computational time. Based on this heuristic, we chose LT = 0.8 and LT = 0.6 for 0 1 experiments I, II, and IV; and LT = LT = 0.7 for the experiment III. As described in Section 3.4.3, the patch-length and the mixing ratio parameters are interrelated and directly affect the coverage of SL region. From a performance perspective, these have higher impact Chapter 3. Project 1: MR Image Segmentation 61
Figure 3.14: Hippocampal volume (in mm3) vs. diagnoses. Cohen’s d scores (effect size) and statistical significance is reported for pairwise comparisons between diagnostic groups. on the computational time than the segmentation accuracy (see Fig. 3.15-Middle, Right). Higher
Lpatch implies larger MRF model on the sub-volume and therefore requires higher computational time. Conversely, smaller patches would reduce the computational time; but would run a risk of insufficient Chapter 3. Project 1: MR Image Segmentation 62
coverage of SL region and consequently offer poor accuracy improvement. The third parameter choice of mixing ratio affects the total number of seeds/patches for a given image. A higher ratio necessitates a search for SL nodes surrounded with large number of SH nodes, which reduces the total number of patches as well as the computational time. Based on the accuracy vs. computational cost trade-off analysis with respect to these parameter choices, we selected patch-length of 11 voxels and minimum mixing ratio of 0.0075 which translates into seed nodes surrounded by minimum of 10 SH nodes in the 26-node neighborhood, for all validation experiments. Lastly, the β parameter of MRF model controls the homogeneity of the segmentation. It is dependent on the image intensity distribution and the structural properties of the anatomical structure. The large value of β results in more homogeneous regions giving a smoothed appearance to a structure. We selected β = −0.2 based on the results of training phase where we split the atlas pool into two groups and used one set to segment the other.
Figure 3.15: Parameter selection. A: Effect of confidence threshold values on image partitioning. ρ represents the ratio of low-confidence voxels over high-confidence structural voxels. The highlighted region denotes the heuristically ’good’ region for threshold selection. B: Effect of mixing ratio and Lpatch on DSC performance. Mixing ratio is the minimum required number of SH nodes in the 26-node neighborhood for a given seed voxel. Note that with a larger Lpatch performance improves. Whereas, with a smaller Lpatch or a higher mixing ratio the performance worsens due to poor coverage over SL region. C: Effect of mixing ratio and Lpatch on computational cost. The light blue line shows the number of patches for given configuration as a reference. Note that the computation time increases exponentially with a higher Lpatch and a smaller mixing ratio (Note: mixing ratio* represents the equivalent number of minimum SH node requirement in the 26-node neighborhood for the seed node selections) Chapter 3. Project 1: MR Image Segmentation 63
3.7 Discussion and Conclusion
In this work we presented a novel label-fusion method that can be incorporated with any multi-atlas segmentation pipeline for improved accuracy and robustness. We validated the performance of AWoL- MRF over three independent datasets spanning a wide range of demographics and anatomical variations. In Experiment I, we validated AWoL-MRF on an Alzheimer’s disease cohort (N = 60) with median age of 75. In Experiment II, validation was performed on first episode of psychosis cohort (N = 81), with median age of 23. In Experiment III, we applied AWoL-MRF to a unique cohort (N = 22x 2) comprising preterm neonates scanned in the first weeks after birth and again at term-equivalent age with distinctly different brain sizes and MR scan characteristics. In all of these exceptionally heterogeneous subject groups, AWoL-MRF provided superior segmentation results compared to all three methods: majority vote, STAPLE, and JLF, based on DSC metric as well as proportional bias measurements. In all three experiments, we see that as one of the most desirable benefits, AWoL-MRF offers superior performance with a remarkably small atlas library. AWoL-MRF provides mean DSC scores over 0.88 with only six atlases (Experiment I), 0.89 with only three atlases (Experiment 2), and 0.80 with only four atlases (Experiment III) compared to other methods, which require larger atlas libraries to deliver similar performance. This is an important benefit as it reduces the resource expenditure on the manual delineation of MR images and speeds up the analysis pipelines. From robustness perspective, we also notice reduction of two types of biases. First AWoL-MRF mitigated the issue of degenerating accuracy caused by vote-ties with a small number of even atlases. Then, more importantly, we see a consistent reduction of proportional bias, as evident by the Bland-Altman analysis. We believe that the performance boosts provided by AWoL-MRF can be explained by two major factors. First, we argue that the utilization of intensity values and local neighborhood constraints acts as a regularizer, which helps avoid over-fitting to the hippocampal model represented by the atlas library. Both majority vote and STAPLE do not consider intensity values in their label-fusion stage and thus are more likely to ignore minute variations near the surface areas of the structure, which are not well represented within the atlas library. JLF, which does take intensity information into account and implements a patch-based approach, tends to perform better than majority vote and STAPLE with relatively higher number of atlases: > 4 in Experiment I and > 6 in Experiment III. Therefore, we speculate that JLF is more likely to deliver superior performance in cases with larger atlas library availability, which again comes with the cost of generating manual segmentations. Second, the spanning tree based inference method tries to mimic the manual delineation process by starting with regions with strong neighborhood label information and moving progressively towards more uncertain areas. Compared to iterative methods (e.g. EM), the sequential inference process may not be optimal in a based on theoretical sense; nevertheless, the similarity between automatic and manual labeling process provides more accurate results, since the ground truth is defined by the latter. Additionally, decoupling of label-fusion process from similarity comparisons with the atlas library allows AWoL-MRF to utilize bootstrapping techniques that augment the pool of candidate labels as used by the baseline segmentation pipeline (MAGeT Brain) in this work [277]. Use of such techniques is not trivial with approaches using intensity information from the atlas library. From a diagnostics perspective, the volumetric assessment of all four methods shows significant dif- ferences (p < 0.001 or p < 0.01) between “CN vs. AD” and “CN vs. MCI” comparisons. Consistent with the Bland-Altman analysis (see Fig. 3.4), JLF and majority vote underestimate the volume com- pared to AWoL-MRF and STAPLE across all diagnostic categories. Even though the direct volumetric Chapter 3. Project 1: MR Image Segmentation 64 comparisons based on JLF yield significant differences (p < 0.05) between “MCI vs. AD” category, these differences vanish in the linear model comprising “age”, “sex”, and ”total-brain-volume” as covariates. These findings are consistent with a variety of studies [256, 215, 300] highlighting the heterogeneity in MCI subjects that results in a large variation of hippocampal volume and consequently smaller differ- ences between MCI and AD subjects. This is particularly typical in the ADNI-1 cohort MCI subjects used in this analysis, which are now classified under more progressed stages of MCI or late-MCI [5]. The volumetric comparison between MCI-converters and MCI-stable groups reveal larger hippocampal volumes for the latter group. These findings are consistent with a previous study conducted on ADNI baseline cohort [291]. We also find that these differences remain statistically significant in the linear model comprising “age”, “sex”, and ”total-brain-volume” as covariates. A direct comparison against other methods from the current literature is difficult due to differences in the choices for gold standards, evaluation metrics, hyper-parameter configuration etc. Nevertheless, Table 3.4 shows a brief survey of several segmentation studies. Note that many of these studies have relied on SNT – labels provided by ADNI – for the ground-truth (manual) segmentations. A performance comparison of the baseline method based on SNT labels is discussed in our previous work [277], where we noticed several shortcomings of the SNT protocol [372, 277], and therefore we have evaluated the presented method against the manual label based on the Pruessner protocol [285]. Despite the differences in the experimental designs, comparisons with the other methods show that AWoL-MRF delivers superior performance with a significantly smaller atlas library requirement. For ADNI cohort validation, barring the ground-truth label dissimilarities, methods presented by [227, 235] have equivalent DSC scores; however, the atlas library sizes for these methods are 30 and 55 respectively. Moreover, to the best of our knowledge, no other study has validated their method with three drastically different datasets that span the entire human lifespan demonstrating the robustness of their method. The computational cost of the algorithm implementation, as described in the previous section, de- pends on the parameter selection. From a theoretical perspective, minimum spanning tree (MST) transformation is the most expensive task in this method. The current implementation of MST uses Prim’s algorithm with simple adjacency matrix graph representation, which requires O(|V |2) running time (|V |: number of uncertain voxels in the patch). However, this can be reduced down to O(|E|log|V |) or O(|E| + |V |log|V |) using a binary heap or Fibonacci heap data structures, respectively (|E|: number of edges in the patch). The computational times for Experiment I with current implementation for different parameter configurations are shown in Fig. 3.15 (Right). The code was implemented in Matlab R2013b and run on a single CPU (Intel x86 − 64, 3.59GHz). A direct computational time comparison with other methods is not practical due to hardware and software implementation differences. However, the non-iterative nature of AWoL-MRF provides considerably faster run times compared to EM based approaches, where the convergence of the algorithm is dependent on the agreement between candidate labels and can be highly variable [352, 364]. In conclusion, AWoL-MRF attempts to mimic the behavior of a manual segmentation protocol in a multi-atlas segmentation framework. We validated its performance over three independent datasets comprising significantly different subject cohorts. Even though this work focused on hippocampal seg- mentations, AWoL-MRF can be easily applied to other structures and scenarios with multiple label classes, which is a part of future studies. The validations indicate that the method delivers state-of-the- art performance with a remarkably small library of manually labeled atlases, which motivates its use as a highly efficient label-fusion method for rapid deployment of automatic segmentation pipelines. Chapter 3. Project 1: MR Image Segmentation 65
3.8 Acknowledgments
NB receives support from the Alzheimer’s Society. MMC is funded by the Weston Brain Institute, the Alzheimer’s Society, the Michael J. Fox Foundation for Parkinson’s Research, Canadian Institutes for Health Research, National Sciences and Engineering Research Council Canada, and Fondation de Recherches Sant´eQu´ebec. ANV is funded by the Canadian Institutes of Health Research, Ontario Mental Health Foundation, the Brain and Behavior Research Foundation, and the National Institute of Mental Health (R01MH099167 and R01MH102324). FEP data was supported by a CIHR (#68961) to Dr. Martin Lepage and Dr. Ashok Malla. The preterm neonate cohort is supported by Canadian Institutes for Health Research (CIHR) operating grants MOP-79262 (SPM) and MOP-86489 (Dr. Ruth Grunau). SPM is supported by the Bloorview Children’s Hospital Chair in Pediatric Neuroscience. The authors thank Drs. Ruth Grunau, Anne Synnes, Vann Chau and Kenneth J. Poskitt for their contributions in studying the preterm neonatal cohort and providing access to the MR images. Computations were performed on the GPC supercomputer at the SciNet HPC Consortium [234]. SciNet is funded by the Canada Foundation for Innovation under the auspices of Compute Canada; the Government of Ontario; Ontario Research Fund - Research Excellence; and the University of Toronto. In addition, computations were performed on the CAMH Specialized Computing Cluster. The SCC is funded by the Canada Foundation for Innovation, Research Hospital Fund. ADNI Acknowledgments: Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bio- engineering, and through generous contributions from the following: Abbott; Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Amorfix Life Sciences Ltd.; AstraZeneca; Bayer HealthCare; BioClinica, Inc.; Biogen Idec Inc.; Bristol-Myers Squibb Company; Eisai Inc.; Elan Pharmaceuticals Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research Development, LLC.; Johnson & Johnson Pharmaceutical Research Development LLC.; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foun- dation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is Rev March 26, 2012 coordinated by the Alzheimer’s disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for NeuroImaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129 and K01 AG030514. We would also like to thank Curt Johnson and Robert Donner for inspiring some of the ideas in this work. Chapter 3. Project 1: MR Image Segmentation 66
3.9 Supplementary Material
S3.1 Experiment I: ADNI Validation
Data used in this experiment was obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu/). The dataset consists of 60 baseline scans in the ADNI1: Complete 1Yr 1.5T standardized dataset [380]. Twenty subjects were chosen from each diagnostic category: cog- nitively normal (CN), mild cognitive impairment (MCI), and Alzheimer’s disease (AD). All images were acquired using 1.5T scanners (General Electric Healthcare, Philips Medical Systems or Siemens Medical Solutions) at multiple sites using a protocol previously described in [179]. Representative 1.5T imaging parameters were TR=2400ms, TI=1000ms, TE=3.5ms, flip-angle=8, field of view=240x240mm, 192x192x166 matrix (x, y, and z directions) yielding voxel dimensions of 1.25mm x 1.25mm x 1.2mm. The manual segmentations for the hippocampus were generated by expert raters following the Pruessner- protocol [285] which is used for validation and performance comparisons. The choice of Pruessner labels was motivated from our previous validation of the baseline MAGeT Brain pipeline [277], in which we noted the inconsistencies in the SNT labels provided by ADNI.
S3.2 Experiment II: First Episode Psychosis (FEP) Validation
Data used in preparation of this experiment were obtained from the Prevention and Early Intervention Program for Psychoses (PEPP-Montreal), a specialized early intervention service at the Douglas Mental Health University Institute in Montreal, Canada [239]. The dataset consists of structural MRIs of 81 subjects, which were acquired at the Montreal Neurological Institute on a 1.5T Siemens whole body MRI system. Structural T1 volumes were acquired for each participant using a three-dimensional (3D) gradient echo pulse sequence with sagittal volume excitation (repetition time=22ms, echo-time = 9.2ms, flip-angle=30, 180 1mm contiguous sagittal slices). The rectangular field-of-view for the images was 256mm (SI) x 204mm (AP). The manual segmentations for the hippocampus were generated by expert raters following the Pruessner-protocol [285], which is identical to the manual segmentation protocol used in our previous validation work [277]. FEP data are not publicly available.
S3.3 Experiment III: Preterm Neonatal Cohort Validation
This cohort consists of 22 premature neonates whose anatomical images were acquired with a specialized neonatal head coil (Advanced Imaging Research, Cleveland, OH) on a Siemens 1.5T Avanto scanner (Erlangen, Germany) at two time points, once in the first weeks after birth when clinically stable and again at term equivalent age (total of 44 images: 22 early-in-life and 22 term-equivalency images). The 22 neonates (7 males) were born at a mean gestational age of 27.7 weeks (SD 1.9), and scanned early-in- life at 32.1 weeks (SD 1.9) and again at term equivalent age at 40.4 weeks (SD 2.1). Sequence parameters for the 3D volumetric T1-weighted images were: TR=36ms, TE=9.2ms, flip-angle=30, voxel size 1mm x 1.04mm x 1.04mm. The whole hippocampus was manually segmented by an expert rater using a 3-step segmentation protocol. The protocol adapts the histological definitions of [110], as well as existing whole hippocampal segmentation protocols for MR images [285, 372, 39] to the preterm infant brain. This dataset was previously used by our group in the validation of an adaptation of MAGeT-Brain to the specific needs of the neonatal and prematurely born infant brain. For complete details on acquisition and manual segmentation process see [149]. Neonatal data are not publicly available. Chapter 3. Project 1: MR Image Segmentation 67
Method Left (number of voxels) Right (number of voxels) Majority Vote 4.6 (1.5) 4.4 (1.1) STAPLE 4.6 (1.7) 4.5 (1.5) JLF 5.3 (2.5) 4.8 (2.0) AWoL-MRF 4.6 (1.9) 4.6 (1.7)
Table S3.1: Experiment I surface-distance errors based on variant of Hausdorff distance. The error is measures in number of voxels, and mean and standard deviation values are reported over all the subjects in the dataset. The validation configuration comprised 9 atlases and 19 templates.
S3.4 Experiment IV: Hippocampal Volumetry
The volumetric analysis was performed using the standardized ADNI1: Complete Screening 1.5T dataset comprising 811 ADNI T1-weighted screening and baseline MR images of healthy elderly (227), MCI (394) and AD (190) patients. (Note: The standardized ADNI1: Complete Screening 1.5T dataset consists of 818 subjects out of which seven subjects failed during the registration stage of the segmentation pipeline.)
S3.5 Surface-Distance error analysis
We performed surface distance analysis identical to the previous work [54]. The surface distance met- ric (M) estimates the maximum distance between the surfaces of manual and fused labels and is an approximation of symmetric Hausdorff distance. M was calculated using contour maps generated from the manual label, and 26-connected voxel erosion of the fused label. The border labels (surface) from the eroded fused-volume were intersected with manual label contour maps to compute M1 = H(a,b) and M2 = H(b,a), where H is the Hausdorff distance, then M = max(M1,M2). The results [mean (sd)] for ADNI validation (Experiment I) with 9 atlases and 19 templates are as shown in Table S3.1. This preliminary analysis shows that majority vote, STAPLE, and AWoL MRF produce similar performance. In comparison JLF yields slightly higher surface-distance errors. Chapter 4
An artificial neural network model for clinical score prediction in Alzheimer’s disease using structural neuroimaging measures
Nikhil Bhagwat [1,2,3,*], Jon Pipitone [3], Aristotle N. Voineskos [3,4], M. Mallar Chakravarty [1,2,5], and Alzheimer’s Disease Neuroimaging Initiative.
1. Institute of Biomaterials and Biomedical Engineering, University of Toronto, Toronto, ON, Canada
2. Cerebral Imaging Centre, Douglas Mental Health University Institute, Verdun, QC, Canada
3. Kimel Family Translational Imaging-Genetics Research Lab, Research Imaging Centre, Campbell Family Mental Health Research, Institute, Centre for Addiction and Mental Health, Toronto, ON, Canada
4. Department of Psychiatry, University of Toronto, Toronto, ON, Canada
5. Department of Psychiatry, McGill University, Montreal, QC, Canada
Correspondence: Nikhil Bhagwat, M. Mallar Chakravarty; Email: [email protected], [email protected]
68 Chapter 4. Project 2: Clinical Score Prediction 69
4.1 Abstract
Background: Development of diagnostic and prognostic tools for Alzheimer’s disease (AD) is compli- cated by the substantial clinical heterogeneity observed in prodromal stages. Many neuroimaging studies have focussed on case-control classification and mild-cognitive-impairment to AD conversion predictions. However, prediction of scores from clinical assessments, such as MMSE and ADAS-13 from MR imaging data has received less attention. Prediction of clinical scores can be crucial in providing nuanced prog- nosis as well as the symptomatic disease severity. Methods: In this work, we predict clinical scores at the individual level using a novel anatomically partitioned artificial neural network (APANN) model. APANN combines input from two structural MR imaging measures relevant to neurodegenerative pat- terns observed in AD; namely: hippocampal segmentations and cortical thickness. We evaluate the performance of APANN with 10 rounds of 10-fold cross-validation in three sets of experiments using ADNI1, ADNI2, and ADNI1+ADNI2 cohorts. Results: Pearson’s correlation and root mean square er- ror between the actual and predicted scores for ADAS-13 (ADNI1: r=0.60; ADNI2: r=0.68; ADNI1and2: r=0.63) and MMSE (ADNI1: r=0.52; ADNI2: r=0.55; ADNI1and2: r=0.55) demonstrate that APANN can accurately infer clinical severity from MR imaging data. Limitations: In efforts to rigorously val- idate the presented model, we have primarily focussed on large cross-sectional baseline datasets, with only a proof-of-concept longitudinal results. Conclusion: APANN provides a highly robust and scalable framework for prediction of clinical severity at the individual level utilizing high-dimensional, multimodal neuroimaging data. Chapter 4. Project 2: Clinical Score Prediction 70
4.2 Author Contributions
Nikhil Bhagwat (NB) worked on the development of proposed anatomically partitioned artificial neural network (APANN) model and other machine-learning techniques along with subsequent imple- mentation and validation. He performed preprocessing and quality control of MR image datasets. He also wrote the manuscript of the research paper that is currently under review. Jon Pipitone assisted with the preprocessing of MR images. He also provided feedback on the proposed methodological ap- proach. Additionally he provided the support for computational resources. Aristotle N. Voineskos served as a clinical advisor and is a member of the NB’s thesis committee. He also served as a supervisor for NB in a lab at CAMH that provided significant computational resources. M. Mallar Chakravarty is a thesis supervisor for NB. He provided guidance on development and validation of all proposed models and techniques, as well as, manuscript writing. Chapter 4. Project 2: Clinical Score Prediction 71
4.3 Introduction
Machine-learning methods have been used extensively to identify individuals suffering from Alzheimer’s disease (AD) and its prodromes from healthy controls [52, 62, 87, 142, 391]. However, predicting symp- tomatic severity at the individual level remains a challenging problem, which is potentially more in- timately related to personalized care and prognosis. This prediction is confounded by the substan- tial pathophysiological and clinical heterogeneity observed in prodromal stages, such as mild-cognitive- impairment (MCI) or significant memory concern (SMC) [82, 118, 216, 263, 331, 370]. Although much is known about the spatiotemporal progression of amyloid plaques, neurofibril- lary tangles, and the resultant downstream neurodegeneration [44], the heterogeneous patterns of neu- roanatomical atrophy and AD-related cognitive impairment remain an open question. Understanding complex pathophysiological processes that characterize the varying clinical presentations across these groups is essential for biomarker development and early detection of at-risk individuals [19, 123, 304]. Furthermore, neuroanatomically informed subject-level prediction of clinical performance is an impor- tant step towards biomarker assessment and development of assistive tools for prognosis and treatment planning. As a structural biomarker, the hippocampus has long been associated with AD-related pathophys- iology and impairment [80, 108, 117, 134, 142, 183, 300]. However, the hippocampal volume measures lack the sensitivity to act as a standalone biomarker [32, 215, 275, 277, 303]. In efforts to achieve nu- anced characterization of disease states, studies have explored hippocampal subfield-based biomarkers [12, 215, 261], and other neurodegeneration indicators, such as cortical atrophy quantified by cortical thickness [117, 202, 225, 222, 286, 299]. Nevertheless, no characteristic localized patterns of atrophy have been associated with the prodromal disease states and symptomatic severity levels, which are likely to be heavily influenced by cognitive reserve [286, 331]. This motivates an approach that incorporates multiple, distributed phenotypes for prediction of clinical severity in service of robust diagnostic and prognostic applications. Previously, computational approaches using neuroimaging measures in the context of AD have focused on predicting diagnosis in cross-sectional datasets [52, 62, 87, 391], or predicting conversion from MCI to AD in longitudinal analyses [80, 253, 91]. However, clinicians are more likely to treat symptoms based on structured assessments rather than a specific diagnosis. Thus, in this work, we focus on a predicting clinical scores of disease severity (e.g. Alzheimer’s Disease Assessment Scale [ADAS-13] [294], Mini Mental State Examination [MMSE] [127] directly from neuroimaging data [332, 390]. Such neuroanatomically informed prediction of clinical performance at baseline and subsequently at future timepoints, particularly of the MCI or SMC individuals can help clinicians to parse through the clinical heterogeneity and make accurate diagnostic and prognostic decisions. Although the ultimate clinical goal of this work is to provide longitudinal prognosis, here we primarily focus on thorough validation of a single time-point (baseline) datasets, which is an important first step in model development for longitudinal tasks. Additionally, we perform a proof-of-concept analysis to verify capability of the proposed model towards longitudinal prediction. For this prediction task, we propose an anatomically partitioned artificial neural network (APANN) model. Artificial neural networks (ANNs) and related deep learning approaches have delivered state- of-the-art performance in classification and prediction problems in computer vision, speech recognition, natural language processing, and several other domains [30, 165, 199, 212, 280, 293]. ANNs provide highly flexible computational frameworks that can be used to extract latent features corresponding to Chapter 4. Project 2: Clinical Score Prediction 72 the hierarchical structural and functional organization of the brain and are well-suited for problems with high-dimensional data, unlike more standard models [30, 280]. To this end, the primary objective of this manuscript is to assess whether ANN models can accurately predict ADAS-13 and MMSE clinical scores of individuals using T1-weighted brain MR imaging data. In a larger context, we aim to build an ANN- based computational framework that can process high-dimensional and distributed structural changes captured by multiple phenotypic measures towards a biomarker development predictive of symptomatic progression. We designed, trained, and tested our model using participants from two Alzheimer’s Disease Neu- roimaging Initiative (ADNI) cohorts. We used a combination of high-dimensional (> 30000) features derived from two neuroanatomical measures from the T1-weighted images; namely: 1) hippocampal segmentations, and 2) cortical thickness. Hippocampal segmentations and cortical thickness measures were generated using MAGeT Brain and CIVET pipelines (see sections 2.2.1 and 2.2.2), respectively. We present a model with an innovative modular design that enables analysis of this high-dimensional, multimodal input. Additionally, it allows inclusion of new input modalities to the analysis without hav- ing to retrain the whole model and offers simultaneous prediction of multiple clinical scores (ADAS-13, MMSE). We address the need for large training examples given the high-dimensionality of the input data by introducing a novel data augmentation method. The methodology presented in this paper is not solely limited to the prediction of disease severity in AD, but can be applied to train variety of deep- learning models that use high-dimensional neuroimaging data in order to tackle multitude of diagnostic and prognostic questions. Chapter 4. Project 2: Clinical Score Prediction 73
4.4 Materials and Methods
4.4.1 Datasets
In this work, we used baseline data from participants acquired from the ADNI1 (N = 818) and ADNI2 (N = 788) databases, respectively [380] (http://adni.loni.usc.edu/, download date: April 2017). The final number of subjects used was 669 and 690 after exclusion based on quality control of image pre- processing outputs (see 4.1 for demographic details). We set our objective as prediction of MMSE and ADAS-13 scores. MMSE is one of the most widely used cognitive assessment for diagnosis of AD and related dementias [270, 316] and has a range from 0 to 30 with lower scores indicating greater cognitive impairment. ADAS-13 is a modified version of ADAS-cog assessment with a maximum score of 85. Although it has some overlap with MMSE, it also includes additional assessment components targeting memory, language, and praxis. In contrast to MMSE, higher scores indicate greater cognitive impair- ment in the ADAS-13. We note that we pool subjects from all diagnostic categories to build models for the entire spectrum of clinical performance. Diagnostic grouping is not used in the analysis as we model AD progression on a continuum, a method which has been shown to be useful in other studies of AD progress [175, 182].
ADNI-1 (N=669) ADNI-2 (N=690) Acquisition Scanner: 1.5T Voxel sizes: 1.2mm x Scanner: 3.0T Voxel sizes: 1.2mm x 1.25mm x 1.25mm 1mm x 1mm Diagnosis CN: 198 LMCI: 326 AD: 145 CN: 179 SMC: 77 EMCI: 162 LMCI: 149 AD: 123 Sex Male: 377 Female: 292 Male: 361 Female: 329 Age in years (mean, (75.0, 6.7) (72.6, 7.2) stdev) Education in years (15.5, 3.1) (16.3, 2.6) (mean, stdev) ADAS-13 (mean, (18.4, 9.2, [1.0, 54.7]) (16.1, 10.14, [1.0 , 52.0]) stdev, [min, max]) MMSE (mean, (26.7, 2.7, [18.0, 30.0]) (27.5, 2.7, [19.0, 30.0]) stdev, [min, max])
Table 4.1: Dataset demographics for ADNI1 and ADNI2 cohorts used in this study. CN: Cognitively Normal, SMC: Significant Memory Concern, EMCI: Early Mild Cognitive Impaired, LMCI: Late Mild Cognitive Impaired, AD: Alzheimer’s Disease; ADAS: Alzheimer’s Disease Assessment Scale, MMSE: Mini–Mental State Examination.
4.4.2 MR image processing
MR images were first preprocessed using the bpipe pipeline (https://github.com/CobraLab/minc-bpipe- library/) comprising N4-correction [347], neck-cropping to improve linear registration, and BEaST brain extraction [119]. This preprocessed data was then used to extract 1) hippocampal (HC) segmentations and 2) cortical thickness (CT) measures, referred to as input modalities in this work. Chapter 4. Project 2: Clinical Score Prediction 74
Hippocampal Segmentation
HC segmentations of T1-weighted MR images were produced using the MAGeT Brain pipeline [55, 277]. Briefly, this pipeline begins with five manually segmented high-resolution 3T T1-weighted images [372], which are each registered non-linearly to fifteen of the ADNI images selected at random (known as the template library). Then each image in the template library is non-linearly registered to all images in the ADNI datasets, and the segmentations from each atlas are warped via the template library transformations to each ADNI image. This results in 75 (numberofatlasxnumberoftemplates) candidate segmentations for each image, which are fused into a single segmentation using voxel-wise majority voting.
Cortical Thickness Estimation
The preprocessed images were input to the CIVET pipeline [2, 69, 196, 221, 237] to estimate cortical thickness measures at 40,962 vertices per hemisphere, which can be subsequently grouped by region of interest (ROI) based on a surface atlas.
4.4.3 Anatomically Partitioned Artificial Neural Network (APANN)
Artificial neural networks (ANNs) are a biologically inspired family of graphical machine-learning (ML) models which can perform prediction tasks using high-dimensional input (See 4.1-A). ANN models can be designed to contain multiple hidden layers, which hierarchically encode latent features that are informative of the objective task. The neuron connections represent a set of weights for the preceding input values, which are then combined and filtered with a nonlinear function. In neuroimaging, a few variants of ANN models such as autoencoders, and restricted Boltzmann machines, have been investigated for classification and prediction tasks [280, 335]. In comparison to these existing approaches, the model presented in this work differs significantly in its design and implementation. From a design perspective, we leverage the hierarchical structure of ANNs to build a modular (see 4.1-B) architecture that is capable of multimodal input integration (see 4.1-C) and multitask predictions (see 4.1-D). We achieve the following objectives in three stages (see 4.1-E). The stage I consists of anatomically partitioned modules (two hidden layers per module) which extract features from individual anatomical input sources (hippocampus and cortical surface). These individual anatomical features from stage I, serve as input to stage II of the network, where they are combined at a higher layer within the hidden-layer hierarchy. Lastly, we use these integrated features to perform multiple tasks simultaneously. These task-specific hidden layers are represented by the higher layers in stage III (four hidden layers total). This anatomically partitioned ANN (APANN) mitigates overfitting by reducing the number of parameters of the model compared to the classical fully-connected hidden layer architectures and allows independent pretraining of each input source in a single branch. These individual pretrained branches can be subsequently used to train stage II in order to integrate features efficiently.
4.4.4 Empirical distributions
The input dimensionality of MR data greatly exceeds the available number of samples leaving ML models susceptible to overfitting [19, 202]. This necessitates the critical step of feature engineering, the trans- formation of high-dimensional raw input to a meaningful and computationally manageable feature space Chapter 4. Project 2: Clinical Score Prediction 75
Figure 4.1: A) Structure of a generic artificial neural network (ANN) model. A neural net may comprise multiple hidden layers that encode hierarchical set of features from input, informative of the predic- tion/classification task at hand. The connections between layers represent the model weights, which are updated via backpropagation based on loss function associated with the task. B) A single feature mod- ule comprising multiple hidden layers. This is a building block of APANN architecture which facilitates pretraining of individual branches per input modality. C) A multi-modal ANN with a single output task. This design consists of stage 1 and stage 2 feature modules. Stage 1 modules learn features from each modality that are subsequently combined at stage 2 feature module. Only single task performance is used to update the weights of the model in this architecture. D) A multi-task ANN with a single input modality. This design consists of stage 1 and stage 3 feature modules. Stage I module learns individual features from given modality that are then fed into task-specific feature modules connected to the output nodes for joint prediction of the two tasks (ADAS-13 and MMSE score prediction). Prediction perfor- mance from both tasks is used to update the weights of stage I feature module. Left HC, Right HC, and CT input modalities are trained separately using this design to learn input feature modules from each modality. E) The proposed multi-modal, multi-task APANN model comprising anatomical partitioning. This design consists of stage 1, stage 2, and stage 3 feature modules. Stage I comprises pretrained fea- ture modules from each modality. These input features are fed into stage II to learn integrated features, which in turn are fed into the task specific feature modules in stage III. The stage III modules are con- nected to the output nodes for joint prediction of the two tasks (ADAS-13 and MMSE score prediction). Prediction performance from both tasks is used to update the weights of stage I and stage 2 feature modules. The partitioned architecture reduces the number of model parameters, which along with the pre-trained feature modules helps mitigate overfitting issues. Note1: Input data dimensionality is as follows:16086 (left HC), 16471(right HC), and 686 (CT) Note2: For details regarding hyper-parameters (number of hidden nodes, learning policies, weight regularization, etc.) of APANN see 4.2.
[264]. Techniques for addressing the high-dimensionality include employing various approaches, such as downsampling, handcrafting features based on biological priors (e.g. atlases), principal component analysis etc. Alternatively, one can increase the sample size by adding transformed data (e.g. linear transformations, image patches) to deal with the high-dimensionality issue. In this work, we present a novel data augmentation method that leverages the MR preprocessing pipelines to produce a set of empirical samples for both HC and CT input modalities in place of single point estimate per subject. This boost in training sample size make it feasible to train these models with large parameter space, and Chapter 4. Project 2: Clinical Score Prediction 76 helps prevent overfitting by exposing the model to a large set of possible variations in anatomical input associated with a given severity level. Adding linear and non-linear transformations of original input data is a common practice in machine-learning [212, 293]. In computer vision applications this typically means translation, rotation or dropping of certain pixels in an effort to capture a larger set of commonly encountered variation of input features to which the classifier should be invariant. In structural MR data, we are more interested in modeling the joint voxel distribution of anatomical segmentations than achieving high translational invariance, since the location of anatomical structures is relatively consistent across subjects. Thus, the empirical samples, generated as part of common segmentation and cortical surface extraction pipelines, help train model to be invariant to the methodologically driven pertur- bations of input values. This in turn mitigates overfitting and helps model learn anatomical patterns relevant to clinical performance. For the HC inputs, the empirical samples refer to a set of “candidate segmentations” generated from a multi-atlas segmentation pipeline (see 4.2-A) (Chakravarty et al., 2013; Pipitone et al., 2014) that model the underlying joint label distribution over the set of voxels for a given subject. For the CT inputs, the empirical samples refer to cortical thickness values from a set of vertices belonging to a given cortical ROI (see 4.2-B). In traditional approaches, these samples are usually fused to produce a point estimate of the feature [286, 87]. We detail the sample generation process for both of these input types below.
Hippocampal segmentations
• Segmentation generation: 75 candidate and 1 fused segmentations are produced for each subject via the MAGeT brain pipeline [277]. The ADNI1 and ADNI2 datasets are segmented using a separate template library with fifteen images chosen from the respective cohort. These candidate segmentations are binary masks of the left and right hippocampal voxels.
• Segmentation alignment: We rigidly align candidate segmentations to a common space (a subject chosen at random from the ADNI1 dataset) to maximize anatomical correspondence across sub- jects. Each segmentation is split into left and right hemisphere segmentations, and both are rigidly aligned to this common space using the ANTS registration toolkit [20]
• Segmentation filtering: In order to remove outlier segmentations due to misregistrations or poor segmentations, we compute the Dice kappa score between rigidly aligned candidate segmentations and the preselected common space segmentation, and then exclude any candidate segmentations with the Dice score smaller than one standard deviation from the mean Dice score over all subjects.
• Voxel filtering: To further compact the bounding box of all the candidate segmentations, we exclude voxels with low information density by only keeping structural voxels present in at least 25% candidate segmentations across the ADNI1 and ADNI2 datasets. Post filtering operations, the 3-dimensional volumes are flattened into a one-dimensional vector of included voxels per candidate segmentation.
Upon completion of this process, the vectorized voxels represent the hippocampal input for the APANN model. The length of input vector was 16086 and 16471 for left and right hippocampus, respectively. Chapter 4. Project 2: Clinical Score Prediction 77
Figure 4.2: A) Schematic of multi-atlas segmentation pipeline depicting registration and label-fusion stages. The red-box highlights the “candidate labels” derived from different atlases which are treated as empirical samples in the context of structural labels. These labels are usually fused into a single label which serves as a point-estimate mask of a given structure. B) Schematic of cortical thickness estimation pipeline comprising surface registration, parcellation and average thickness estimation steps. The red-box highlights the individual vertices in a given region of interest, which are treated as empirical samples in the context of cortical thickness measure. The thickness values of these vertices are usually averaged out to estimate mean thickness over an region of interest (ROI).
Cortical thickness
CIVET preprocessing produces CT values at 40,962 vertices per hemisphere. We assign these cortical vertices to unique ROIs based on a predefined atlas. In this work, we created a custom atlas (see 4.3) comprising 686 ROIs, maintaining bilateral symmetry (343 ROIs per hemisphere) using data-driven par- cellation based on spectral clustering. Spectral clustering allows creation of ROIs with a similar number of vertices, which is desirable for unbiased sampling of vertices for cortical thickness estimation. Also, Chapter 4. Project 2: Clinical Score Prediction 78 work by others [193] suggest that increasing the spatial resolution of a cortical parcellation may improve predictive performance, which further motivates the use of this data-driven atlas over neuroanatomically derived parcellations [190, 348]. The connectivity information from the cortical mesh of the template was used as the adjacency matrix during implementation. Upon generating sets of vertices per ROI, we simply treat each vertex as a sample from a distribution that characterizes the thickness of that ROI. Thus CT features of each subject now can be characterized by a distribution of thickness values per ROI instead of the mean thickness values computed as point estimates (see 4.2-B). The independent empirical sampling processes for HC and CT input necessitate a standardization step, which is described in the supplementary material Section 2.
Figure 4.3: A custom cortical surface parcellation (atlas) comprising 688 regions of interest (ROI), with each comprising roughly equal number of vertices. The parcellations were based on a triangular surface mesh obtained from a CIVET model. The vertices of the mesh were grouped together based on spatial proximity using spectral clustering method1. Bilateral symmetry within the vertices of the hemispheres was preserved. The atlas was propagated to each subject to obtain thickness samples per ROI. 1 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html
Training procedure
The training procedure consists of two parts: 1) train individual branches per input modality, 2) fine- tune unified model comprising pretrained branches from part 1 along with additional integrated and task specific feature layers. In the first part, we trained separate models using individual HC and CT modalities independently (see 4.1-D). The model was trained to jointly predict both tasks (ADAS- 13 and MMSE scores). At the end of this training procedure, we obtained the set of weights for the hidden layers in stage I for each input branch. Subsequently, we extended the model with stage II and III hidden layers and further trained it to learn integrated and the task specific feature lay- ers (see 4.1-E). We used both tasks during this training procedure as well. For both parts, hyper- parameters of the model (see 4.2) were determined using inner cross validation loop. The code utiliz- ing Caffe toolbox (http://caffe.berkeleyvision.org/) for APANN design and training is available here: https://github.com/CobraLab/NI-ML/tree/master/projects/APANN. The computational resource re- quirements are provided in supplementary material Section 3.
4.4.5 Performance Validation
We compared the performance of the APANN model separately for MMSE and ADAS-13 score pre- diction. We conducted three experiments to compare performance of each cohort separately as well as Chapter 4. Project 2: Clinical Score Prediction 79
Model Hyperparameters Linear Regression with L1-penalty: 0.001 to 1 (with increment of 0.01) Lasso (LR L1)
Support Vector Regres- kernel : linear, rbf, C: [0.001,0.01,1,10,100] sion (SVR)
Random Forest Regres- N estimators: 10 to 210 (with increment of 25), sion (RFR) min sample split: [2,4,6,8]
Artificial Neural Network Fixed hyperparameters (APANN) • Stage I (input features): Two hidden layers with equal nodes in each layer • Stage II (integrated features): one hidden layers
• Stage III (task features): One hidden layer • Activation nonlinearity: ReLU Tunable hyperparameters: • Stage I number of hidden nodes: [25,50,100,200]
• Stage II number of hidden nodes: [25,50] • Stage III number of hidden nodes: [25,50] • Learning rate: [1e-6, 1e-5, 1e-4]
• Learning policy: [Nesterov, Adagrad] • Weight decay: [1e-4,1e-3,1e-2] • Dropout rate: [0, 0.25, 0.5] (Only for Stage I)
Table 4.2: Hyperparameter search space for the four models. Grid search of the hyperparameters was performed using a nested inner loop for each cross-validation round. For APANN model, the fixed hyperparameters refer to a broader network design choices that remained identical for all cross-validation rounds. The tunable hyperparameter for APANN were optimized for each fold. Chapter 4. Project 2: Clinical Score Prediction 80 combined: 1) ADNI1, 2) ADNI2, and 3) ADNI1+ADNI2. The latter is an effort to evaluate model robustness in the context of multi-cohort, multi-site studies, which is becoming increasingly prevalent in the field. In each experiment, we compared the performance of two inputs separately as well as com- bined: 1) HC, 2) CT, and 3) HC+CT. We used Pearson’s correlation (r) and root mean square error (rmse) values between true and predicted clinical scores as our performance metrics. All experiments were evaluated using 10 rounds of 10-fold nested cross validation procedure. The outer-folds were cre- ated by dividing subject pool into 10 non-overlapping subsets. During each run, 9 out of 10 subsets were chosen as a training set and the performance was evaluated on the held-out test subset. During model training, three inner-folds were created by further dividing the training set under consideration to determine optimal combination of hyperparameters (e.g. number hidden nodes) using grid-search. For Experiment 3, outer-folds were stratified to maintain similar ratio of ADNI1 and ADNI2 subjects in each fold. We compared the performance of APANN in all experiments against three commonly used ML models: linear regression with Lasso, support vector machine, and random forest. These results are provided in the supplementary materials Section 1. Our secondary, proof-of-concept analysis comprises a longitudinal experiment to predict clinical scores at baseline and at 1-year simultaneously, using only baseline MR data. This is in an effort to demonstrate applicability of APANN from a clinical standpoint, where the end goal is to predict future diagnostic and/or prognostic states of a subject. We limit our analysis to ADAS-13 scale, whose larger score range offers better sensitivity to longitudinal changes, and individual ADNI1 and ADNI2 cohorts. We note that for this experiment, due to missing timepoints, the number of subjects reduced to 553 for ADNI1 and 590 for ADNI2. Chapter 4. Project 2: Clinical Score Prediction 81
4.5 Results
The mean correlation (r) and root mean square error (rmse) performance values for all three experiments with three input modality configurations are summarized in 4.4 and 4.3, 4.4. Scatter plots for predicted and actual ADAS-13 and MMSE scores are shown in 4.5. Scatters plots were generated using scores from all the test subsets from a randomly chosen round of a 10-fold run. Results for the longitudinal experiment are shown in 4.6. Individual results for each experiment are detailed below. The comparative results with other models are provided in the supplementary materials Section 1. Briefly, results from all three experiments indicate that APANN model offers better predictive performance with HC inputs. In comparison, CT input modality, when used independently, does not offer improvement. However, the HC+CT input to APANN model offers significantly higher performance improvement over reference models across all three experiments.
4.5.1 Experiment 1: ADNI1 Cohort
The combined HC+CT input provides the best results for ADAS-13 prediction with r = 0.60, rmse = 7.11. Similar trends are observed for MMSE prediction with HC+CT input with r = 0.52, rmse = 2.25. The sole HC input yields r = 0.53, rmse = 7.56, and r = 0.40, rmse = 2.41, for ADAS-13 and MMSE score prediction, respectively. Whereas, sole CT input yields r = 0.51, rmse = 7.67, and r = 0.50, rmse = 2.29, for ADAS-13 and MMSE score prediction, respectively.
4.5.2 Experiment 2: ADNI2 Cohort
Similar to Experiment 1, the combined HC+CT input provides the best results for ADAS-13 prediction with r = 0.68, rmse = 7.17. Similar trends are also observed for MMSE prediction with HC+CT input with r = 0.55, rmse = 2.25. The sole HC input yields r = 0.52, rmse = 8.32, and r = 0.40, rmse = 2.51, for ADAS-13 and MMSE score prediction, respectively. Whereas, sole CT input yields r = 0.63, rmse = 7.58, and r = 0.52, rmse = 2.31, for ADAS-13 and MMSE score prediction, respectively.
4.5.3 Experiment 3: ADNI1 + ADNI2 Cohort
Similar to Experiment 1 and 2, the combined HC+CT input provides the best results for ADAS-13 prediction with r = 0.63, rmse = 7.32. Similar trends are observed for MMSE prediction with HC+CT input with r = 0.55, rmse = 2.25. The sole HC input yields r = 0.54, rmse = 7.99, and r = 0.45, rmse = 2.42, for ADAS-13 and MMSE score prediction, respectively. Whereas, sole CT input yields r = 0.57, rmse = 7.79, and r = 0.50, rmse = 2.37, for ADAS-13 and MMSE score prediction, respectively. A further analysis of results in this experiment stratified by subject-cohort membership (ADNI1 vs. ADNI2) shows that APANN has smaller performance bias towards any particular cohort (i.e. models performing well on only single cohort) compared to other models (see supplementary materials for details).
4.5.4 Longitudinal prediction
Similar to Experiments 1-3, the combined HC+CT input provides the best results, with r = 0.58, rmse = 7.1 for baseline and r = 0.59, rmse = 9.08 for 1-year score prediction for ADNI1; and r = 0.64, rmse = 7.07 for baseline and r = 0.65, rmse = 9.07 for 1-year score prediction for ADNI2. The sole HC Chapter 4. Project 2: Clinical Score Prediction 82
ADAS13 MMSE 0.7
0.6
0.5 modality 0.4
r HC 0.3 CT HC+CT 0.2
0.1
0.0 ADNI1 ADNI2 ADNI1and2 ADNI1 ADNI2 ADNI1and2 cohort cohort
ADAS13 MMSE 9 8 7 6 modality 5 HC
rmse 4 CT 3 HC+CT 2 1 0 ADNI1 ADNI2 ADNI1and2 ADNI1 ADNI2 ADNI1and2 cohort cohort
Figure 4.4: Performance of APANN subject to individual and combined input modalities. The correlation and rmse values are averaged over 10 rounds of 10-folds. All models were trained with a nested-inner loop that searched for optimal hyperparameters. input yields better performance compared to sole CT input for baseline and 1-year score prediction for ADNI1. Whereas sole CT input yields better performance compared to sole HC input for baseline and 1-year score prediction for ADNI2. Chapter 4. Project 2: Clinical Score Prediction 83
ADAS13 cohort = ADNI1 cohort = ADNI2 cohort = ADNI1and2 40 R^2=0.321 R^2=0.430 R^2=0.379 35
30
25
20
15 Predicted_Scores
10
5
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Actual_Scores Actual_Scores Actual_Scores
MMSE
cohort = ADNI1 cohort = ADNI2 cohort = ADNI1and2 R^2=0.245 R^2=0.266 R^2=0.279 30
28
26
24 Predicted_Scores 22
20
20 22 24 26 28 30 20 22 24 26 28 30 20 22 24 26 28 30 Actual_Scores Actual_Scores Actual_Scores
Figure 4.5: Scatter plots for predicted and actual ADAS-13 and MMSE scores for three cohorts (ADNI1, ADNI2, ADNI1and2). Scatters plots are generated by concatenating scores from all the test subsets from a randomly chosen round of a 10-fold validation run. Chapter 4. Project 2: Clinical Score Prediction 84
ADNI1 HC CT HC+CT LR L1 r: 0.22, 0.11 r: 0.56, 0.08 r: 0.56, 0.08 rmse: 8.72, 0.81 rmse: 7.44, 0.72 rmse: 7.42, 0.74 SVR r: 0.23, 0.11 r: 0.52, 0.08 r: 0.53, 0.08 rmse: 8.70, 0.85 rmse: 7.68, 0.76 rmse: 7.62, 0.78 RFR r: 0.15, 0.10 r: 0.54, 0.08 r: 0.54, 0.08 rmse: 9.27, 0.80 rmse: 7.55, 0.76 rmse: 7.51, 0.77 APANN r: 0.53, 0.09 r: 0.51, 0.10 r: 0.60, 0.08 rmse: 7.56, 0.76 rmse: 7.67, 0.76 rmse: 7.11, 0.72 ADNI2 HC CT HC+CT LR L1 r: 0.14, 0.11 r: 0.61, 0.07 r: 0.61, 0.07 rmse: 9.69, 0.70 rmse: 7.77, 0.71 rmse: 7.78,0.71 SVR r: 0.21, 0.10 r: 0.63, 0.07 r: 0.63, 0.07 rmse: 9.75, 0.79 rmse: 7.65, 0.68 rmse: 7.66, 0.70 RFR r: 0.24, 0.09 r: 0.58, 0.07 r: 0.58, 0.08 rmse: 9.77, 0.76 rmse: 7.97, 0.65 rmse: 7.97, 0.67 APANN r: 0.52, 0.07 r: 0.63, 0.07 r: 0.68, 0.06 rmse: 8.32, 0.79 rmse: 7.58, 0.71 rmse: 7.17, 0.71 ADNI1+ADNI2 HC CT HC+CT LR L1 r: 0.12, 0.08 r: 0.58, 0.06 r: 0.58, 0.06 rmse: 9.37, 0.50 rmse: 7.71, 0.48 rmse: 7.71, 0.48 SVR r: 0.18, 0.07 r: 0.59, 0.05 r: 0.59, 0.05 rmse: 9.39, 0.54 rmse: 7.65, 0.42 rmse: 7.65, 0.42 RFR r: 0.18, 0.09 r: 0.57, 0.05 r: 0.57, 0.05 rmse: 9.63, 0.61 rmse: 7.76, 0.46 rmse: 7.75, 0.46 APANN r: 0.54, 0.06 r: 0.57, 0.05 r: 0.63, 0.05 rmse: 7.99, 0.59 rmse: 7.79, 0.51 rmse: 7.32, 0.53
Table 4.3: Prediction Performance for Alzheimer’s Disease Assessment Scale-13 scores. LR L1: Linear Regression model with Lasso regularizer, SVR: Support Vector Regression, RF: Random Forest Regres- sion, APANN: Anatomically Partitioned Artificial Neural Network; HC: hippocampal input, CT: cortical thickness input, HC+CT: combined hippocampal and cortical thickness input; r: Pearson’s correlation (mean, std), rmse: root mean square error (mean, std). Chapter 4. Project 2: Clinical Score Prediction 85
ADNI1 HC CT HC+CT LR L1 r: 0.23, 0.12 r: 0.49, 0.08 r: 0.50, 0.08 rmse: 2.54, 0.18 rmse: 2.28, 0.17 rmse: 2.27, 0.17 SVR r: 0.25, 0.12 r: 0.48, 0.07 r: 0.50, 0.07 rmse: 2.59, 0.19 rmse: 2.31, 0.16 rmse: 2.28, 0.16 RFR r: 0.22, 0.11 r: 0.48, 0.08 r: 0.49, 0.08 rmse: 2.63, 0.21 rmse: 2.30, 0.17 rmse: 2.28, 0.17 APANN r: 0.40, 0.09 r: 0.50, 0.09 r: 0.52, 0.08 rmse: 2.41, 0.15 rmse: 2.29, 0.20 rmse: 2.23, 0.17 ADNI2 HC CT HC+CT LR L1 r: 0.19, 0.12 r: 0.46, 0.08 r: 0.47, 0.08 rmse: 2.64, 0.19 rmse: 2.39, 0.19 rmse: 2.39, 0.19 SVR r: 0.28, 0.14 r: 0.52, 0.07 r: 0.54, 0.07 rmse: 2.72, 0.24 rmse: 2.32, 0.18 rmse: 2.30, 0.18 RFR r: 0.25, 0.12 r: 0.50, 0.09 r: 0.51, 0.08 rmse: 2.67, 0.24 rmse: 2.33, 0.17 rmse: 2.31, 0.17 APANN r: 0.40, 0.09 r: 0.52, 0.12 r: 0.55, 0.10 rmse: 2.51, 0.21 rmse: 2.31, 0.25 rmse:2.25, 0.21 ADNI1+ADNI2 HC CT HC+CT LR L1 r: 0.15, 0.08 r: 0.50, 0.07 r: 0.50, 0.07 rmse: 2.64, 0.12 rmse: 2.31, 0.13 rmse: 2.31, 0.13 SVR r: 0.22, 0.07 r: 0.52, 0.07 r: 0.52, 0.07 rmse: 2.71, 0.13 rmse: 2.31, 0.13 rmse: 2.30, 0.13 RFR r: 0.17, 0.08 r: 0.50, 0.07 r: 0.50, 0.07 rmse: 2.74, 0.14 rmse: 2.31, 0.14 rmse: 2.31, 0.14 APANN r: 0.45, 0.06 r: 0.50, 0.07 r: 0.55, 0.06 rmse: 2.42, 0.14 rmse: 2.37, 0.15 rmse: 2.25, 0.12
Table 4.4: Prediction Performance for Mini–Mental State Examination (MMSE) scores. LR L1: Linear Regression model with Lasso regularizer, SVR: Support Vector Regression, RF: Random Forest Regres- sion, APANN: Anatomically Partitioned Artificial Neural Network; HC: hippocampal input, CT: cortical thickness input, HC+CT: combined hippocampal and cortical thickness input; r: Pearson’s correlation (mean, std), rmse: root mean square error (mean, std). Chapter 4. Project 2: Clinical Score Prediction 86
Figure 4.6: Simultaneous predictions of baseline and 1 year ADAS-13 scores. The top two rows show the Pearson’s r values based on predicted and actual ADAS-13 scores over 10 fold cross-validation for ADNI1 and ADNI2 cohorts respectively. The bottom two rows show the root mean square error (rmse) between predicted and actual ADAS-13 scores for ADNI1 and ADNI2 cohorts respectively. The first column shows performance at baseline, whereas the second column shows the performance at month 12. Models were trained separately for each input: HC, CT, and HC+CT, which are represented by different colors in the plots. Chapter 4. Project 2: Clinical Score Prediction 87
4.6 Discussion
In this manuscript, we presented an artificial neural network model for prediction of cognitive scores in AD using high-dimensional structural MR imaging data. We showed that information from voxel level hippocampal segmentations and highly granular cortical parcellations can be leveraged to infer cognitive performance and clinical severity at the single-subject level. This capability of APANN model to predict MMSE and ADAS-13 scores based on structural MR features may prove to be valuable from clinical perspective to help build prognostic tools. The proof-of-concept longitudinal experiment demonstrated that APANN can successfully predict future scores (1-year) from the baseline MR data. The results comparing APANN against several other models are provided in the supplementary materials Section 1. These results highlight the performance gains offered by high-dimensional features as input to APANN. Below we discuss the performance of APANN with respect to 1) clinical scale , 2) input modality, 3) dataset, and 4) related literature.
4.6.1 Clinical scale comparisons
Performance comparison between the clinical scales based on correlation values indicates that predicting MMSE scores is more challenging of the two across all inputs and cohorts. This disparity between the performances is possibly due to the higher sensitivity of the ADAS-13 assessment, which is reflected by comparatively larger scoring range improving its association with the structural measures.
4.6.2 Input modality comparisons
The results from all three experiments indicate that the APANN model offers better predictive per- formance with the combined HC+CT inputs. The use of CT outperforms HC inputs in all the three experiments for both the scales except in the case with ADNI1 cohort for ADAS-13 prediction, where HC input offers slightly higher performance. This highlights the importance of incorporating multiple phenotypes for biomarker development indicative of cognitive performance. The capability of APANN to handle multimodal input is crucial for building clinical tools leveraging disparate MR, clinical, as well as, genetic markers.
4.6.3 Dataset comparisons
Between Experiments 1 and 2, we observe that ADNI2 cohort yields better performance compared to ADNI1 across all models. This may be due to the differences in acquisition protocols, as ADNI2 images were acquired at higher field strength with better resolution. The improvement in image acquisition will likely provide superior quality segmentations and cortical thickness measures [61]. In Experiment 3 we combined ADNI1 and ADNI2 cohorts. Pooling data from different datasets is becoming increasingly important to verify generalizability of the model on larger population that extends beyond a single study. Interestingly, Experiment 3 outperforms Experiment 1, but underperforms compared to Experiment 2. This is partially expected due to substantial differences in the individual feature distributions (e.g. hippocampal segmentations) resultant of aforementioned differences in the acquisition protocols. In such cases it becomes imperative to build models invariant to dataset specific biases resultant of non- uniform data collection practices. The results from Experiment 3 show that APANN offers consistent performance comparable to Experiment 1 and 2, and low dataset specific bias (i.e. model performing well Chapter 4. Project 2: Clinical Score Prediction 88 only on single dataset), compared to other models (see supplementary material Section 4 for details). We speculate that the models incorporating high-dimensional, multimodal input are less susceptible to multi-cohort and multi-site study design artifacts, which is desirable for the development of clinical tools in practical settings.
4.6.4 Longitudinal analysis
Consistent with the first three experiments, the combined HC+CT input offers the best performance for 1-year score prediction with similar correlation results but higher rsme. This suggests that uncertainty is likely to increase with larger timespan under consideration for the longitudinal tasks (1-year vs. 2 years. vs. 5 years), making the predictions more challenging. Further considerations are also needed for the cases where information from multiple timepoints (baseline + 1-year) are utilized towards subsequent (2-year +) performance prediction. Missing timepoints become increasingly important caveat for such tasks. Nevertheless APANN shows promising results for investigating more sophisticated longitudinal predictions.
4.6.5 Related work
As we mentioned earlier, prediction of clinical scores is a relatively underexplored task. For fair comparison we limit our discussion relating to two recent studies involving baseline prediction with MR imaging features by [332, 390]. Both works use structural MR images from the ADNI1 baseline dataset for prediction of MMSE and ADAS-Cog scales (which uses 11/13 of the subscales of ADAS-13; http://adni.loni.usc.edu/data-samples/data-faq/). Consequently, ADAS-Cog and ADAS-13 scores are strongly correlated (r ¿ 0.9 for ADNI1 and ADNI2 subjects considered in this manuscript). Stonington et al. use relevance vector regression (RVR) models with sample size of 586 subjects, and report corre- lation values of 0.48 (MMSE) and 0.57 (ADAS-Cog). [390] propose a computational framework called: Multi-Modal Multi-Task (M3T) that offers multi-task feature selection and multi-modal support vector machines (SVM) for regression and classification tasks. With only MR based features, M3T achieves correlation of 0.50 (MMSE) and 0.60 (ADAS-Cog) on a sample size of 186 subjects. The APANN model, in comparison, offers correlation of 0.52 (MMSE), and 0.60 (ADAS-13) with a much larger cohort of 669 ADNI1 subjects. Although APANN offers similar performance for the ADNI1 dataset, there are several key advantages. In contrast to M3T, which implements two separate stages for feature extraction and regression (or classification) tasks, APANN provides a unified model that performs feature extraction and multi-task prediction using multimodal input in a seamless manner. From scalability perspective, the results show that APANN is capable of handling high-dimensional input and extending the model to incorporate new modalities without retraining of entire model. Whereas M3T is subjected to 93 MR atlas-based features [190] with a total 189 multimodal (MRI, FDG-PET, and cerebrospinal fluid) features [390]. Moreover, with APANN we replicate performance on the ADNI2 cohort and demonstrate improved correlation performance of 0.55 (MMSE) and 0.68 (ADAS-13) with 690 subjects further val- idating its generalizability. Other recent works address clinical score prediction using sparse Bayesian learning [358] and graph-guided feature selection [383] with 98 and 93 imaging feature respectively. Both the works report high performance on AD and CN subject groups, however performance degrades after inclusion of MCI subjects. For example, [383] report correlation of 0.745 (MMSE), 0.74 (ADAS-cog) for specific subsets of AD/CN subjects, however the performance degrades to 0.382 (MMSE) and 0.472 Chapter 4. Project 2: Clinical Score Prediction 89
(ADAS-cog) for subset of MCI/CN subjects. Clinically, MCI subjects are of high interest from an early intervention and prognostic standpoint, and therefore prediction of their cognitive performance is crucial. To our best knowledge, APANN is the first work tackling high input dimensionality (> 30k features), which we have validated across the continuum from healthy control to AD dementia, on multiple cohorts with site and scanner differences. Such validation is increasingly important with the availability of newer and larger datasets, such as the UK biobank (http://www.ukbiobank.ac.uk/about-biobank-uk/).
4.6.6 Clinical translation
As mentioned earlier, the ultimate clinical goal of this work is to provide longitudinal prognosis that can predict future clinical states of an individual. The rigorously validated APANN provides a computational platform for variety of longitudinal tasks, such as the 1-year ADAS-13 prediction task investigated in the proof-of-concept experiment (see section 3.4). We envision APANN applied to the MR data of at- risk individuals from prodromal stages (MCI, SMC etc.) and even early AD stages towards prediction of future clinical scores and other clinical state proxies. The ability of APANN to capture relevant subtle neuroanatomical changes from high-dimensional, multi-modal MR imaging data, can be leveraged towards nuanced diagnosis and prognosis on various symptom subdomains to either assist or verify the decision-making by the clinicians. Such prognosis can help with early intervention, clinical trial recruitment, and caregiver arrangements for the patients. Chapter 4. Project 2: Clinical Score Prediction 90
4.7 Limitations
In this work we applied APANN primarily to cross-sectional datasets and a proof-of-concept longitudinal dataset. From the clinical perspective, it is crucial to note that the use of a specific clinical or cognitive test is subjective, contingent on availability, and associated with its own set of biases. Further, similar to the clinical diagnosis that uses several sources of information in order to create a composite of the pa- tient’s clinical profile, we envision the proposed MR based prediction framework also as another assistive instrument that will be interpreted in the larger context of an overall clinical picture. We acknowledge that the cross-sectional experiments in this work are a first step towards building assistive MR-based models. We believe that the design flexibility of APANN can be utilized towards this goal. APANN can handle multi-modal input, as well as, multiple scale predictions that could minimize modality-specific and scale-specific biases, respectively. Large-scale models, such as APANN, subjected to high-dimensional input require significant com- putational resources. Thus we have limited the scope of this work to classical ANNs as a prototypical example to demonstrate feasibility of large-scale analysis with structural neuroimaging data. Neverthe- less the training regimes discussed in this work should motivate further development of state-of-the-art neural network architectures, such as 3-dimensional convolutional networks, towards various neuroimag- ing applications. Another common drawback of models with deep architectures pertains to the lack of interpretability of the model parameters compared to simpler models, which prohibits localizing most predictive brain regions. In our view, it is a model design trade-off that in turn allows capturing dis- tributed changes, often present in heterogeneous atrophy patterns in AD prodromes. The computational flexibility of ANNs allows us to model collective impact of these complex atrophy patterns and predict clinical performance more accurately. Chapter 4. Project 2: Clinical Score Prediction 91
4.8 Conclusion
The presented APANN model together with empirical sampling procedures offers a sophisticated machine- learning framework for high-dimensional, multimodal structural neuroimaging analysis. By going beyond low-dimensional, anatomical prior-based feature sets, we can build more sensitive models capable of cap- turing subtle neuroanatomical changes associated with cognitive symptoms in AD. The results validate the strong predictive performance of the APANN model across two independent cohorts as well as its ro- bustness in the case with the combination of these two cohorts. From clinical standpoint, these attributes make APANN a promising approach towards building diagnostic and prognostic tools that would help identify at-risk individuals and provide clinical trajectories assessments, facilitating early intervention and treatment planning. Chapter 4. Project 2: Clinical Score Prediction 92
4.9 Acknowledgements
NB receives support from the Alzheimer’s Society of Canada. MMC is funded by the Weston Brain Institute, the Alzheimer’s Society, the Michael J. Fox Foundation for Parkinson’s Research, Canadian Institutes for Health Research, National Sciences and Engineering Research Council Canada, and Fon- dation de Recherches Sant´eQu´ebec. ANV is funded by the Canadian Institutes of Health Research, Ontario Mental Health Foundation, the Brain and Behavior Research Foundation, and the National Institute of Mental Health (R01MH099167 and R01MH102324). Computations were performed on GPC supercomputer at the SciNet HPC Consortium [234] and Kimel Family Translational Imaging-Genetics Research (TIGR) Lab computing cluster. SciNet is funded by the Canada Foundation for Innovation under the auspices of Compute Canada; the Government of Ontario; Ontario Research Fund - Research Excellence; and the University of Toronto. TIGR Lab cluster is funded by the Canada Foundation for Innovation, Research Hospital Fund. ADNI Acknowledgments: Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Insti- tute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Abbott; Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Amorfix Life Sciences Ltd.; AstraZeneca; Bayer HealthCare; BioClinica, Inc.; Biogen Idec Inc.; Bristol-Myers Squibb Company; Eisai Inc.; Elan Pharmaceuticals Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its af- filiated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research Development, LLC.; Johnson & Johnson Pharmaceutical Research Develop- ment LLC.; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sec- tor contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is Rev March 26, 2012 coordinated by the Alzheimer’s disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for NeuroImaging at the Uni- versity of California, Los Angeles. This research was also supported by NIH grants P30 AG010129 and K01 AG030514. Chapter 4. Project 2: Clinical Score Prediction 93
4.10 Supplementary Material
S4.1 Performance comparison to reference models
Models
We compared the performance of the APANN model with three commonly used models 1) linear regres- sion with Lasso regularizer (LR L1) [52, 369] 2) support vector regression (SVR) [91, 164, 354, 390], and 3) random forest regression (RFR) [146]. Separate instances of these baseline models were trained for MMSE and ADAS-13 prediction tasks. Separate instances of these models were also trained to compare performance of each each input, namely: 1) HC, 2) CT, and 3) HC+CT. The input features from each individual modality for the three baseline models were as follows:
• HC: 2 continuous variables representing left and right hippocampal volumes
• CT: 78 continuous variables representing thickness values based on AAL atlas ROIs [348]
The difference in input feature sets for the baseline models was prompted by the use of anatomically driven, low-dimensional features in many instances in the relevant literature [335, 389]. Moreover, we also investigated high-dimensional HC and CT input choices (identical as of APANN model) for the baseline models. However, the baseline models considerably underperformed with high-dimensional input compared to the input choice of low-dimensional features. The input values for LR L1 and SVR models were preprocessed with an additional step in which data were mean centered and feature-wise scaled to unit variance. All the baseline models were implemented using scikit-learn toolbox (http://scikit- learn.org/stable/index.html).
Results
The correlation performance comparison between APANN and reference models for both tasks and three experiments is shown below in Fig. S4.1. All correlation and rmse values are also tabulated in Table 1 in the manuscript. Correlation performance of baseline models with high-dimension input (identical as of APANN model) is shown in Fig. S4.2.
Discussion
Results from all three experiments indicate that APANN model offers better predictive performance with HC and HC+CT inputs. Specifically for the HC input modality, we see substantial improvement with APANN model utilizing voxel-wise information from segmented hippocampal masks. We note that this performance gain is attributed to 1) added information from the voxel-wise input compared to two volumetric measures, and 2) the computational capacity of APANN to extract useful features from this voxel-wise input compared to the baseline models. As described earlier, the baseline models do show improved performance with the voxel-wise input, however, the gains are smaller compared to the APANN model. In comparison, CT input modality, when used independently, does not offer improvement with APANN model. For ADAS-13 prediction task, the baseline models outperform the APANN model in Experiment 1 and 3, and offer similar performance in Experiment 2 when comparing performance with CT input alone. However, the HC+CT input to APANN model offers significantly higher performance improvement over baseline models across all three experiments. Chapter 4. Project 2: Clinical Score Prediction 94