Spatio-Temporal Representations and Analysis of Brain Function from fMRI
DISSERTATION
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Firdaus Janoos, B.E., M.S.
Graduate Program in Computer Science
*****
The Ohio State University
2011
Dissertation Committee:
Prof. Raghu Machiraju, PhD., Adviser Prof. Steffen Sammet, M.D.,PhD. Dr. Istvan´ Akos´ Morocz,´ M.D.,PhD. Prof. Michael V. Knopp, M.D.,PhD. Prof. Lee Potter, PhD. © Copyright by
Firdaus Janoos
2011 ABSTRACT
Understanding the highly complex, spatially distributed and temporally organized phenom- ena entailed by mental processes using functional MRI is an important research problem in cognitive and clinical neuroscience. Classically, the analysis of functional Magnetic Res- onance Imaging (fMRI) has focused either on the creation of static maps localizing the metabolic fingerprints of neural processes or on studying their temporal evolution in a few pre-selected regions in the human brain. However, it is widely acknowledged that cognition recruits the entire brain and that the underlying mental processes are fundamentally spatio- temporal in nature. By neglecting either the temporal dimension or the spatial entirety of brain function, such methods must necessarily compromise on extracting and representing all the information contained in the data.
In this thesis, I present new paradigms and an accompanying suite of tools to facilitate a timeresolved exploration of mental processes as captured by fMRI. The first part of the thesis describes a method for visualizing the metabolic activity recorded in the data and a method for studying the timing differences in the recruitment of the different functional modules during a task. In the next part a state-space formalism is used to model the brain transitioning through a sequence of mental states as it solves a task, enabling study of the spatial distribution of activity along with its temporal structure. Efficient algorithms for estimating the parameters, state-sequence and the hemodynamic behavior of the brain have
ii been developed. In addition to revealing the mental patterns of an individual subject, such a generative model enables comparing mental processes between subjects in their entirety, not just as spatial activation maps.
The methods developed here were applied to fMRI studies for developmental disorders such as dyslexia and dyscalculia (i.e. math learning disability) and for visuo-spatial work- ing memory. I show the types of inferences possible with these methods in analyzing and differentiating mental capabilities and the neuro-scientific conclusions that they provide.
iii This thesis is dedicated to my parents for their unconditional love and unswerving support
during this long and sometimes arduous journey.
iv ACKNOWLEDGMENTS
I would like thank Raghu for his many years of guidance – intellectual and philosophical, his pragmatic wisdom, his forbearance, his friendship and his genuine solicitude. I am grateful to Steffen for sharing the deep knowledge of MRI and radiology and for his in- fectious and friendly spirit. I owe special gratitude to Pisti for continuously reminding me to keep the big picture in mind and not developing algorithms for algorithms’ sake, for his wild but brilliant visions, and most of all for his warmth and humanism.
Getting through graduate school would have been harder, if it were not for the the support of my friends, who are too many to list here. Among these, I am especially thankful to
Kishore, Okan and Shantanu for sharing this experience with me. Most of all, I have to thank Zeenat for giving me the impetus to finally graduate.
v VITA
2001 ...... B.E. Computer Science, University of Pune, India
2009 ...... M.Sc. Computer Science, Ohio State University, USA
2009–present ...... PhD Candidate, Ohio State University, USA
PUBLICATIONS
Research Publications
F. Janoos, R. Machiraju and I.A.´ Morocz,´ “Spatio-temporal Models of Cognitive Processes with fMRI,” NeuroImage, In review.
T.K. Dey, F. Janoos and J.A. Levine, “Meshing interfaces of multi-label data with Delaunay refinement,” Engineering with Computers, In review.
F. Janoos, R. Machiraju, S. Sammet, M.V. Knopp and I.A.´ Morocz,´ “Unsupervised Learn- ing of Brain States from fMRI Data,” Proceedings of 13th International Conference on
vi Medical Image Computing and Computer Assisted Intervention (MICCAI), Vol. 6362, 201– 208, 2010.
F. Janoos, M.O. Irfanoglu, O. Afacan, R. Machiraju, S.K. Warfield, L.L. Wald and I.A.´ Morocz,´ “Brain State Identification from fMRI Using Unsupervised Learning,” Proceedings of the 16th Annual Meeting of the Organization for Human Brain Mapping (OHBM), 2010.
O. Afacan, D.H. Brooks, F. Janoos, W.S. Hoge and I.A.´ Morocz,´ “Multi-shot high-speed 3D-EPI fMRI using GRAPPA and UNFOLD,” Proceedings of the 16th Annual Meeting of the Organization for Human Brain Mapping (OHBM), 2010.
C. Lehr, M.O. Irfanoglu, F. Janoos, M.V. Knopp and S. Sammet, “Disease Progression in Multiple Sclerosis: Correlations to Diffusion Tensor Imaging Features,” Proceedings of 18th Annual Meeting of The International Society for Magnetic Resonance in Medicine (ISMRM) , 2010.
M.O. Irfanoglu,R. Machiraju, F. Janoos, M.V. Knopp and S. Sammet, “Effect of Gradient Resolution in Diffusion Tensor Imaging on the Appearance of Multiple Sclerosis Lesions at 3T,” Proceedings of 18th Annual Meeting of The International Society for Magnetic Resonance in Medicine (ISMRM) , 2010.
F. Janoos, R. Machiraju, S. Sammet, M.V. Knopp, S.K. Warfield and I.A.´ Morocz,´ “Mea- suring Effects of Latency in Brain Activity with fMRI,” Proceedings of IEEE Symposium on Bio-medical Imaging (ISBI), 2010.
K. Mosaliganti, F. Janoos, A. Gelas, R. Noche, N. Obholzer, R. Machiraju and S. Megason, “Anisotropic Plate Diffusion Filtering for Detection of Cell Membranes in 3D Microscopy Images,” Proceedings of IEEE Symposium on Bio-medical Imaging (ISBI), 2010.
vii F. Janoos, B. Nouansengsy, R. Machiraju, H.W. Shen, S. Sammet, M. Knopp and I.A.´ Morocz,´ “Visual Analysis of Brain Activity from fMRI Data,” Computer Graphics Forum, Vol.28(3), 903-910, June 2009.
K. Mosaliganti, F. Janoos, O. Irfanoglu, R. Ridgway, R. Machiraju, K. Huang, J. Saltz, G. Leone and M. Ostrowski, “Tensor Classification of N-point Correlation Function fea- tures for Histology Tissue Segmentation,” Medical Image Analysis, Vol. 13(1), 156–166, Feb. 2009.
F. Janoos, K. Mosaliganti, X. Xu, R. Machiraju and S.T.C. Wong, “Robust 3D Reconstruc- tion and Identification of Dendritic Spines from Optical Microscopy Imaging,” Medical Image Analysis, Vol. 13(1), 167–179, Feb. 2009.
F. Janoos, B. Nouansengsy, X. Xu, R. Machiraju, K. Huang and S.T.C. Wong, “Classifica- tion and Uncertainty Visualization of Dendritic Spines from Optical Microscopy Imaging,” Computer Graphics Forum, Vol. 27(3), 879– 886, Sep. 2008.
K. Mosaliganti, F. Janoos, R. Sharp, R. Ridgway, R. Machiraju, K. Huang, P. Wenzel, A. de Bruin, G. Leone and J. Saltz, “Detection and Visualization of Surface-Pockets to enable Phenotyping Studies,” IEEE Transactions on Medical Imaging, Vol. 26(9), 1283– 90, Sep. 2007.
K. Mosaliganti, J. Chen, F. Janoos, R. Machiraju, W. Xia, X. Xu and K. Huang, “Auto- mated Quantification of Colony Growth in Clonogenic Assays,” Proceedings of Medical Image Analysis with Applications in Biology (MIAAB), 2007.
F. Janoos, S. Singh, O. Irfanoglu, R. Machiraju and R. Parent, “Activity Analysis Using Spatio-Temporal Trajectory Volumes in Surveillance Applications,” Proceedings of IEEE Symposium on Visual Analytics Science and Technology (VAST), 3–10, Nov. 2007.
viii F. Janoos, O. Irfanoglu, K. Mosaliganti, R. Machiraju, K. Huang, P. Wenzel, A. de Bruin and G. Leone, “Histology Image Segmentation using the N-Point Correlation Functions,” Proceedings of IEEE Symposium on Biomedical Imaging (ISBI), 300 – 303, Apr. 2007.
K. Mosaliganti, F. Janoos, X. Xu, R. Machiraju, K. Huang and S.T.C. Wong, “Temporal Matching of Dendritic Spines in Confocal Microscopy Images of Neuronal Tissue Sec- tions,” Proceedings of the 9th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 106–113, 2006.
F. Janoos, R. Machiraju, R. Parent, J.W. Davis and A. Murray, “Sensor Orientation for Coverage Optimization for Surveillance Applications,” Proceedings of IS&T/SPIE Sympo- sium on Electronic Imaging, vol. 6491, 1–12, Jan 2007.
Instructional Publications
F. Janoos, R. Machiraju, S. Singh and I.A.´ Morocz,´ “Spatio-temporal Representations and Decoding Cognitive Processes from fMRI,” Ohio State Univ. Tech. Report OSU-CISRC- 9/10-TR19, 2010.
F. Janoos, R. Machiraju, S. Sammet, I.A.´ Morocz,´ M.V. Knopp and S.K. Warfield, “Linear Models for fMRI with Varying Hemodynamics,” Ohio State Univ. Tech. Report OSU- CISRC-9/10-TR20, 2010.
F. Janoos, O. Irfanolgu, R. Machiraju and I.A.´ Morocz,´ “Visualizing Brain Activity from fMRI Data,” Ohio State Univ. Tech. Report OSU-CISRC-9/10-TR21, 2010.
T.K. Dey, F. Janoos and J.A. Levine, “Meshing interfaces of multi-label data with Delaunay refinement,” Ohio State Univ. Tech. Report OSU-CISRC-8/09-TR40, 2008.
ix FIELDS OF STUDY
Major Field: Computer Science and Engineering
Studies in:
Medical Image Analysis Prof. Raghu Machiraju Machine Learning Prof. Yoonkyung Lee Computer Graphics Prof. Tamal K. Dey
x TABLE OF CONTENTS
Page
Abstract ...... ii
Dedication ...... iv
Acknowledgments ...... v
Vita ...... vi
List of Tables ...... xvii
List of Figures ...... xviii
List of Algorithms ...... xxi
Introduction ...... 1
1 Background of Problem ...... 1 2 Research Statement ...... 5 3 Outline of Solution ...... 6 4 Organization of Thesis ...... 8
I Background 10
Chapters:
1. Background: fMRI Principles ...... 11
1.1 Nuclear Magnetic Resonance ...... 12 1.2 Magnetic Resonance Imaging ...... 14 1.3 Functional Magnetic Resonance Imaging ...... 16 1.3.1 The BOLD Contrast ...... 17
xi 1.3.2 Relationship between BOLD and Physiology ...... 18 1.3.3 fMRI Noise ...... 20
2. Background: fMRI Methods ...... 22
2.1 Neuroscience Principles ...... 23 2.2 fMRI Methods Taxonomy ...... 25 2.3 Pre-processing ...... 25 2.3.1 Motion Correction ...... 25 2.3.2 Distortion Correction ...... 27 2.3.3 De-noising and Drift Estimation ...... 27 2.3.4 Inter–subject Registration ...... 28 2.3.5 Hemodynamic Response Modeling and Estimation ...... 29 2.4 Functional Specialization ...... 29 2.4.1 Unsupervised ...... 30 2.4.2 Supervised ...... 31 2.4.3 Semi-supervised ...... 35 2.5 Functional Integration ...... 37 2.5.1 Functional Connectivity ...... 37 2.5.2 Effective Connectivity ...... 38 2.6 Functional Representation ...... 40 2.6.1 Multivariate Pattern Recognition ...... 41 2.6.2 Multivariate Linear Models ...... 44 2.7 Summary ...... 44
3. Background: Neuroscientific Setting ...... 45
3.1 Visuo–Spatial Working Memory ...... 46 3.1.1 Data-set ...... 47 3.2 Mental Arithmetic ...... 48 3.2.1 Dyscalculia ...... 50 3.2.2 Dyslexia ...... 51 3.2.3 Data-set ...... 53
II Functional Chronometry 56
4. Mental Chronometry: Theory ...... 57
4.1 Significance ...... 57 4.2 Mental Chronometry with BOLD ...... 59 4.3 Chronometry Methods ...... 62
xii 5. Mental Chronometry: A Visual Analysis ...... 64
5.1 Outline of Solution ...... 65 5.1.1 Challenges ...... 65 5.1.2 Proposed Solution ...... 67 5.2 Related Work ...... 68 5.3 VOI Selection ...... 70 5.3.1 Distance Metric ...... 71 5.3.2 Hierarchical Clustering ...... 76 5.4 User Interaction ...... 78 5.5 Results ...... 80 5.6 Conclusion ...... 81
6. Mental Chronometry: Measuring Latency in Brain Activity ...... 86
6.1 Outline of Solution ...... 87 6.1.1 Motivation ...... 87 6.1.2 Proposed Solution ...... 87 6.2 Method ...... 88 6.2.1 Robust Estimation of Latency ...... 89 6.2.2 Parametric Effects with Factorial Designs ...... 90 6.2.3 Hyper-parameter Selection ...... 92 6.3 Results ...... 93 6.3.1 Simulated Data ...... 93 6.3.2 Latency Analysis for VSWM Task ...... 94 6.4 Conclusion ...... 99
III Spatio-temporal Representations for Cognitive Processes 100
7. Spatio-temporal Representations: Theory ...... 101
7.1 Functional Representation ...... 102 7.2 Supervised, Unsupervised and Semi-supervised ...... 105 7.3 Pattern Recognition vs. Linear Models ...... 106 7.3.1 Multivariate Pattern Recognition (MVPR) ...... 106 7.3.2 Multivariate Linear Models (MVLM) ...... 107 7.4 Motivation ...... 108
8. Brain–States: Investigating Spatio–Temporal Patterns in the Data ...... 111
8.1 Inspiration ...... 112
xiii 8.2 Method ...... 113 8.3 Results ...... 115 8.4 Conclusion ...... 115
9. Brain–States: The Notion of Functional Distance ...... 118
9.1 Functional Distance ...... 119 9.2 Functional Networks ...... 121 9.3 Computing the Functional Distance ...... 123 9.4 The Diffusion Distance ...... 127 9.4.1 Hierarchical Clustering: ...... 129 9.5 Results ...... 129 9.6 Conclusion ...... 133
10. Feature–Space: A Linear Embedding of the Functional Distance ...... 135
10.1 Motivation ...... 137 10.1.1 Other Feature–Spaces in fMRI ...... 138 10.2 Feature–Space ...... 139 10.2.1 Cost–Function ...... 139 10.2.2 Orthogonal Basis for the Feature–Space ...... 140 10.3 Linear Approximation for Functional Distance ...... 142 10.4 Feature Selection ...... 145 10.5 Evaluation of Feature–Space ...... 146 10.6 Conclusion ...... 149
11. State–Space Models : Towards a Spatio-Temporal Representation ...... 151
11.1 Related Work ...... 153 11.2 State–space Model ...... 154 11.3 Model Estimation ...... 159 11.3.1 Parameter Estimation ...... 159 11.3.2 HRF Marginalization ...... 161 11.3.3 Optimal State–Sequence Estimation ...... 162 11.4 Model Size Selection ...... 162 11.5 Results ...... 164 11.5.1 Simulation ...... 164 11.5.2 Mental Arithmetic ...... 167 11.6 Conclusion ...... 174
12. State–Space Models : A Semi–Supervised Approach ...... 177
12.1 The State–Space Model ...... 182
xiv 12.1.1 Feature–Space Transform ...... 185 12.2 Parameter Estimation ...... 185 12.2.1 E-Step ...... 186 12.2.2 M-Step ...... 188 12.2.3 Spatial Activation Maps ...... 194 12.3 Estimating the Optimal State–Sequence ...... 194 12.4 Model Hyper-Parameter Selection ...... 196 12.5 Results ...... 198 12.5.1 Simulation ...... 199 12.5.2 Data-Set 1: Visuo–Spatial Motor Task ...... 203 12.5.3 Data-Set 2: Mental Arithmetical Task ...... 210 12.6 Conclusion ...... 225
Epilogue ...... 229
Appendices: 233
A. Proofs for Activation Onset Latency Estimator ...... 233
B. Functional Connectivity Estimation ...... 235
B.1 Hierarchical Agglomerative Clustering ...... 235 B.2 Shrinkage ...... 236 B.3 Voxel-wise Correlations ...... 237
C. Construction of the Feature–Space ...... 238
C.1 Orthogonal Partitioning of F ...... 238 C.2 Primal and Dual Formulations ...... 239 C.2.1 Augmented Formulation ...... 240 C.3 Proofs for the Linear Approximation ...... 241
D. Proofs for Unsupervised State–Space Model ...... 246
D.1 Expectation Maximization ...... 247 D.2 Forward Backward Recursions ...... 250 D.3 Marginalizing the HRF Filter h ...... 253 D.4 State–Sequence Estimation ...... 254 D.5 Estimation in Full vs. Reduced Models ...... 255 D.6 Mutual Information ...... 256
xv E. Proofs for Semi–Supervised State–Space Model ...... 257
E.1 Proofs for the E-Step ...... 257 E.2 Proofs for the M-Step ...... 261 E.2.1 Estimating State Transition Parameters w and Missing Stimulus u261 E.2.2 Estimating Emission Parameters ϑ ...... 265 E.2.3 Estimating Hemodynamic and Noise Parameters h, Σ ..... 267 E.3 Estimating the Optimal State–Sequence x∗ ...... 269 E.4 Error–rate and Mutual Information ...... 270
References 271
xvi LIST OF TABLES
Table Page
9.1 Effect of Gaussian Smoothing on Number of Clusters ...... 123
10.1 Notation for Linear Embedding ...... 136
11.1 Notation for Unsupervised State–Space Model ...... 158
11.2 Effect of SNR on Running–Time and Error–Rates ...... 166
11.3 Results for Mental Arithmetic Data–set ...... 170
12.1 Notation for Semi–Supervised State–Space Model ...... 181
12.2 Effect of SNR on Estimation Error ...... 202
12.3 Effect of Shrinkage on Estimation and Prediction Errors ...... 203
12.4 Prediction Error versus Different HRF Models ...... 217
12.5 Overall Results for Mental Arithmetic Data-Set ...... 220
12.6 Assessment of Separation between Groups for Mental Arithmetic Task . . . 223
xvii LIST OF FIGURES
Figure Page
1.1 Precession of the Nuclear Magnetic Moment ...... 13
1.2 RF Excitation ...... 14
1.3 MR k-Space Acquisition ...... 16
1.4 BOLD Response ...... 19
2.1 Taxonomy of fMRI Methods ...... 26
2.2 GLM Analysis Pipeline ...... 34
2.3 Dynamic Causal Models ...... 39
3.1 Visuo–Spatial Working Memory Task ...... 48
3.2 Mental Arithmetic Task ...... 54
4.1 Cascade of Functional Recruitment with fMRI ...... 61
5.1 Overview of Visual Analytics Tool ...... 68
5.2 The Raw fMRI Time–Courses ...... 72
5.3 HR Curves and SVD Spectrum ...... 76
5.4 User Interface of the Visual Analytics Tool ...... 83
5.5 Visual Confirmation of SPM Results ...... 84
5.6 Visual Analysis of Recruitment Cascade ...... 85
xviii 6.1 MSE of Regularized Estimator ...... 94
6.2 Quantitative Evaluation of Parametric Effects ...... 95
6.3 Activation Maps for VSWM Task ...... 96
6.4 Parametric Effect on Latency ...... 97
6.5 Parametric Effects of Latency and Amplitude ...... 98
7.1 Representation of Visual Categories in the Ventral Temporal Cortex . . . . 103
7.2 Predicting Spatial Maps for Given Stimulus Words ...... 104
8.1 EEG Microstates ...... 112
8.2 Brain–State Labels ...... 116
9.1 Conceptual Illustration of the Functional Distance ...... 122
9.2 Regularized Correlation Estimator ...... 124
9.3 Functional Distance Approximation Error ...... 130
9.4 Brain–State Sequences ...... 131
9.5 Spatial Activation Maps of Brain–States ...... 132
10.1 Computation of Feature–Space ...... 147
10.2 Dimensionality Reduction ...... 148
10.3 Approximation Quality ...... 149
10.4 Cross-Correlations in Feature–Space ...... 150
11.1 The Full state–space Model (Unsupervised) ...... 154
11.2 The Reduced State–Space Model (Unsupervised) ...... 157
xix 11.3 Computation of Pair–Wise MI ...... 169
11.4 Phase–wise Error Rates and MI ...... 171
11.5 Multidimensional Scaling (MDS) Plots for Mental Arithmetic Study . . . . 173
11.6 Spatial Maps for Mental Arithmetic Task ...... 176
12.1 Outline of the Method for the Semi–supervised State–Space Model. . . . . 180
12.2 The Semi–Supervised State–Space Model ...... 183
12.3 Simulation Results ...... 201
12.4 Spatial Maps for Visuo-Motor Task ...... 205
12.5 Error Rates for Visuo-Motor Task ...... 206
12.6 Brain–State Probabilities ...... 208
12.7 Effect of Model–Size on Prediction Error ...... 209
12.8 Estimated Hemodynamic Response ...... 210
12.9 Comparative Analysis for Mental Arithmetic Data-Set ...... 213
12.10Effect of Hyper–Parameters on ERRSSM:PH ...... 215
12.11Estimated HRF FIR filter for Mental Arithmetic Data-Set ...... 218
12.12Spatial Maps for Mental Arithmetic ...... 219
12.13Prediction Error for Mental Arithmetic Task ...... 221
12.14Multidimensional Scaling (MDS) Plots for Mental Arithmetic Study . . . . 222
xx LIST OF ALGORITHMS
5.1 Hierarchical Clustering Algorithm ...... 77
5.2 Octree Clustering Algorithm ...... 78
9.1 Recursive Approximation of the Functional Distance ...... 125
10.1 Construction of Orthogonal Basis Functions ...... 141
B.1 Hierarchical Agglomerative Clustering ...... 235
xxi INTRODUCTION
Begin at the beginning and go on till you come to the end, then stop.
Lewis Carroll (1832–1898), Alice’s Adventures in Wonderland.
This thesis describes a new paradigm for analyzing fMRI data that preserves the informa- tion contained in the temporal dimension of mental processes. The Introduction opens by presenting the larger scientific context of the research problem in Section 1, followed by the specific research statement in Section 2 and the proposed solutions in Section 3. Finally, the organization of the rest of this book is laid out in Section 4.
1 Background of Problem
Even though the human mind has been the subject of philosophical and scientific study throughout the ages, many aspects of brain function are still unknown. Perception, informa- tion processing and decision making are only a few of the cognitive capacities intensively investigated in today’s neuroscience, which seeks to understand the relationship between the mind and the brain. Our cognitive capacities rely on synergistic activities of large neural populations distributed throughout the brain. Therefore understanding the mind not only requires a comprehension of the workings of low–level neural networks but also demands
1 a detailed map of the brain’s functional architecture and a description of the large–scale connections between populations of neurons and insights into how relations between these simpler networks give rise to higher–level thought [111].
Until recently, systems neuroscience relied on the single micro-electrode technique to mea- sure the action potentials produced by an isolated neuron or a small assembly of neu- rons [78]. Although very useful in characterizing the small–scale behavior of individual neural networks, the method clearly falls short of providing information on spatio–temporal cooperativeness and on the global, associational operations performed by these neural net- works. The development of non-invasive neuroimaging technologies opened new possibil- ities to study brain physiology in vivo at this macroscopic / integrationist level.
Since its development in 1992, functional magnetic resonance imaging (fMRI) has be- come one of the most popular functional neuroimaging tools due to its unprecedented spatial resolution at temporal resolutions measured in seconds. The method is based on nuclear magnetic resonance (NMR) signal changes due to hemodynamic and metabolic responses at the sites of neural synaptic activity induced by external and internal stimuli to the brain [139]. Inspired by the concepts of functional specialization and functional in- tegration [69], the primary application of fMRI has been the localization of the different functional domains of the human perceptual and cognitive apparatus in the brain and eluci- dating inter–connectivity structures [107]. In addition to these, an emergent theme has been to reveal the representation, i.e. encoding, of mental percepts, affects and concepts of the subject in the recorded data, with a view of understanding the “neural code” [169]. Mul- tivariate pattern analysis / pattern recognition (MVPA / MVPR) has been widely used to
2 explore these encodings of different mental phenomena in the spatially distributed patterns
of metabolic activity recorded by fMRI [176].
Methods for fMRI analysis have concentrated on creating static pictures of the foci of ac-
tivity or of the interconnected networks or of the distributed neural representations. This
paradigm of studying brain physiology is starkly inadequate to explain the fundamentally
dynamic relationships between the neural substrates involved in mental processes [33] that
are spatially distributed, temporally transient and occur at multiple scales of space and time.
One classical argument for this static approach is that the hemodynamic response, which
has a lag of 4–12s and which depends on the complex interaction of several metabolic and
vascular parameters [140], is orders of magnitude slower than neural events that occur in
the 10–100ms time range. However, there are many high–level mental processes such as
attending, learning and task–solving along with oscillations in the default–state networks,
that occur at time scales accessible to modern day fMRI [62]. In addition, there is in-
creasing evidence that the hemodynamic response may not be as sluggish as long assumed: capillaries and arterioles change their diameter within a fraction of a second following neu- ronal activity in their neighbourhood [175]. These rapid effects may be now measurable with the advent of high–speed fMRI with scan rates < 1Hz [2]. The argument for static analysis methods is becoming increasingly untenable.
One important application of a time–resolved analysis is to understand the ordering of information processing in the various functional nodes, in order to better understand their function in terms of an anatomical wiring diagram [88]. There are a few methods reported in literature that recognize the importance of temporality in fMRI [59, 61, 98, 156, 164,
3 204]. However, they make significant compromises in terms of the spatio–temporal scales and physiological assumptions under which these phenomena are investigated
For example, methods for mental chronometry (cf. Chapter 4) measure the relative tim- ing of activity at each voxel independently, neglecting the interactions between them and thereby reducing fMRI to a collection of single–unit recordings [61, 103, 156, 190]. More sophisticated models treating the brain response as a collection of processes have also been developed, but again resort to single–voxel analysis [59, 98]. On the other hand, methods to study the dynamics of interaction between regions such as dynamic causal models (DCM) and dynamic Bayesian networks (DBM) (cf. Section 2.5) suffer from enormous computa- tional complexity and have to be restricted to a few (∼ 10) pre–selected regions–of–interest
(ROI) [204, 227]. While this approach is feasible for studying low–level functions such as early–stage visual perception, it does not scale well towards revealing the integration of the multiplicity of functional modules involved in cognition. Multivariate methods that attempt spatio–temporal analysis – by simultaneously representing the spatial interactions and temporal evolution – are fundamentally limited to block design experiments where the subjects are exposed to very simple and limited set of alternatives [159, 164], not repre- sentative of natural perception and which may fail to engage the higher–level aspects of human cognition.
This inability of fMRI analysis methods to reveal the role of time is becoming increasingly relevant in the study of neurological and psychiatric disorders like dementia, schizophrenia, autism, multiple sclerosis, etc. [28, 37], or common learning disabilities like dyscalculia or dyslexia [181], where group-level inference of spatial activity–maps has been inconclusive
4 due to the lack of clear differences between populations. This is mainly because inter– subject comparisons are usually made by spatially normalizing their data to a common anatomical space. However, there are fundamental differences between the anatomy of subjects, and they cannot be easily aligned into a single coordinate system [207]. This is especially the case in the presence of neuropathologies where significant anatomical changes confound determination of functional differences [211]. Secondly, the mapping between structure and function is not rigid across subjects and moreover the low spatial resolution of fMRI precludes the possibility of finding accurate spatial correspondences.
Furthermore, there is the acknowledgement of the neurophysiological fact that the similar- ities and differences between the brain function of subjects may not reside in the spatial layout of the mental activity but more so in its temporal organization [88, 92].
2 Research Statement
In this thesis, I shall present methodological contributions that enable investigation of the following neurophysiological questions:
Mental Chronometry Is the chronoarchitecture [10] of the cerebral cortex organized with
respect to its functional architecture? Can this organization be identified and tested
for specific hypotheses? Can it account for the fact that the hemodynamic response
of the brain is unknown and is spatially and temporally non–stationary ?
Spatio-temporal Representations Do fMRI data contain information about mental pro-
cesses? Is there a spatio–temporal representation of this information, at a multiplicity
5 of scales of space and time, that can be used to enrich the study of brain function un-
der differing mental tasks and mental deficits? Can this representation be arrived at
in a computationally tractable fashion?
3 Outline of Solution
In order to study of the chronoarchitecture of the human brain, I developed two comple- mentary approaches:
i An exploratory method to aid in examination of the time–series data and its tempo-
ral characteristics with a minimal amount of pre-processing and data manipulation,
as an analog to the visual inspection of EEG and MEG recordings for “interesting
events” [8, 78]. This is posed as volume-of-interest (VOI) selection problem, where
the method automatically determines a set of candidate VOIs that exhibit coherent
activity with respect to the experimental task. The investigator is able to then se-
lect and compare the time–series data of VOIs that are relevant to the task and its
expected neural recruitments [105].
ii A confirmatory method based on a general linear model (GLM) of the fMRI signal
that estimates the experimental effects not only on the amplitude of the hemodynamic
response but also its latency. The result of this method is a map of activation latency
under different task conditions, similar to that of activation amplitude [103]. This
method improves on alternative approaches by proposing a more stable estimator of
latency and makes allowances for spatially and temporally variable hemodynamics.
6 While these approaches are complimentary to classical methods of functional localization, they only partially capture the spatio–temporal signatures of neural activity. To build a more descriptive representation of spatially varying and temporally recurring neural phenomena, the paradigm of the functional brain stepping through a mental state–space as it performs a task was invoked. This concept of fMRI as a recording of a sequence of “brain–states” was developed and refined through the following iterations:
i In an initial exploration of this concept, I used an ICA decomposition with hierarchi-
cal clustering to demonstrate that the fMRI data were organized, intrinsically, with
respect to the mental states of the subject [101].
ii After this initial proof–of–concept, I proposed a measure of the functional distance
between two fMRI volumes of the same subject that quantified the similarity of their
activation patterns by the “transport” of activity over the functional networks of the
brain [102].
iii I then modeled the functional brain transitioning through a mental state–space by a
multivariate hidden Markov chain. Parameters were estimated from the fMRI data
without reference to experimental conditions using Monte–Carlo methods [104]. The
model not only discovered the spatial distribution and timing of neural recruitment in
each subject, but moreover was able to contrast the spatio–temporal patterns arising
from the underlying neural processes between two populations.
iv The model was augmented to allow a partial set of stimuli to be given as input,
resulting in a semi–supervised framework. Its motivation was to the let the stimuli
7 guide the estimation towards a model more relevant to the investigator in a non-
convex optimization landscape marred by multiple local minima, without precluding
discovery of new and un–hypothesized patterns in the data.
In contrast to other methods, these methods are not restricted to pre–selected VOIs but instead operate on the data for the entire brain. Also, they can be used to study arbitrarily complex paradigms and higher–level cognitive functions. These methods were developed and applied in the context of larger neuroscientific investigations into the working of visuo– spatial memory [30] in young children and in the study of arithmetical deficits in individuals suffering from dyscalculia and dyslexia.
4 Organization of Thesis
This book is organized into three self–contained parts.
Part I sets the context for the research problems dealt with in this thesis. Chapter 1 pro- vides a brief background of the principles behind NMR, MRI, and blood oxygenation level
dependent (BOLD) fMRI. Then in Chapter 2, a taxonomy of fMRI analysis methods is
presented. Finally Chapter 3 gives a brief background of the neuroscientific problems ad-
dressed in the course of this dissertation.
Part II deals with the solutions for the problem of mental chronometry, which is defined in
Chapter 4. Then in Chapter 5, I talk about the exploratory visual analysis of the temporal
dimension of brain activity, followed by the GLM based estimation of activation latency in
Chapter 6.
8 Then in Part III, I present the spatio–temporal models of mental processes via the ab- straction of brain–states. Chapter 7 provides a discussion of the concepts behind neural representation and decoding, the need for spatio–temporal models, and the salient issues concerning supervised vs. unsupervised vs. semi–supervised analysis techniques. After a report of the initial investigations about the information contained in fMRI about brain– states in Chapter 8, I describe the development of the distance measure in Chapter 9 and the low–dimensional linear embedding of the data based on this metric in Chapter 10.
The unsupervised state–space formalism is presented in Chapter 11 followed by the semi– supervised method in Chapter 12.
Finally, the Epilogue concludes this book with a summary of the research presented here and shares some thoughts on future directions.
9 PART I
Background
10 CHAPTER 1
BACKGROUND: FMRI PRINCIPLES
In this age of specialization men who thoroughly know one field are often in- competent to discuss another. The great problems of the relations between one and another aspect of human activity have for this reason been discussed less and less in public.
Richard Feynman (1918–1988), Remarks at a Caltech YMCA lunch forum, 1956.
The physical principle behind fMRI is nuclear magnetic resonance (NMR), which depends on the Zeeman effect [224]. The physics underlying this phenomenon were refined in the
1920s and 1930s and practical instruments for measuring it were developed by Felix Bloch and Edward Purcell with coworkers in the 1940s. For these developments, Bloch and Pur- cell shared the Nobel prize in 1952. In work pioneered by Paul Lauterbur in the 1970s, methods for generating tomographic images of objects based on the NMR phenomenon were developed [128] leading to the development of magnetic resonance imaging (MRI).
In the 1980s, MRI was established as an indispensable diagnostic clinical tool due to its ability to non-invasively produce high quality anatomical images of the human body. Dur- ing the 1990s, an array of MRI techniques for studying human physiology were developed.
Examples of such techniques now available are MR angiography and arterial spin labeling
11 (ASL) for imaging blood vessels and blood flow, real-time cardiac imaging and perfusion measurements, diffusion tensor MRI for tracing white matter fibres and functional MRI for mapping brain activity [107].
This chapter begins with a brief description of the theory of NMR in Section 1.1, and of its application to MRI in Section 1.2. The principles behind fMRI are then explained in
Section 1.3, starting with an overview of the blood oxygenation level dependent (BOLD) contrast mechanism in Section 1.3.1, an explanation of the neurobiological basis of the
BOLD contrast in Section 1.3.2 and a discussion of the noise and artifacts found in fMRI data in Section 1.3.3.
1.1 Nuclear Magnetic Resonance
The sub-atomic particles in an atom nucleus, viz. protons, neutrons, possess a magnetic moment of ±1/2 arising from the spin of the these particles, imparting it a net magnetic moment. Of interest to NMR is hydrogen (1H) with one unpaired proton and a total nuclear spin = 1/2. Tomographic images are generated by measuring the spatial distribution of the magnetic moment of 1H, abundantly present in living tissue in the form of water.
When placed in a large external magnetic field (the B0 field), hydrogen nuclei align either parallel or anti-parallel with the direction of the magnetic field (cf. Fig. 1.1). The detectable signal that is produced at room temperature depends on the manipulation of the few parts per million protons aligned with the magnetic field (in the z direction of the B0 field). At the same time, the magnetization vector of the proton precesses at a frequency ω0 which
12 depends upon its gyromagnetic ratio γ, given by the Larmor equation ω0 = γB0. The gyromagnetic ratio is a nucleus specific constant and for hydrogen, γ = 42.6 MHz/Tesla.
Figure 1.1: Precession of the Nuclear Magnetic Moment. Hydrogen nuclei attain one of two different energy states when placed in a static magnetic field B0. The nucleus can be seen as a small bar magnet and in the lower energy state the bar magnet is aligned with B0 while the higher energy state corresponds to a counter–aligned magnet.
A pulse of radiation resonant with the precession frequency is applied to turn the small fraction of the z aligned protons by an angle of π/2, to align in the direction perpendicular to the magnetic field. This rotating magnetic moment now experiences a torque tending it along the B0 field, as per the equations derived by Felix Bloch:
dMz Mz − M0 dMx,y Mx,y = − − γ(M × B)z and = − − γ(M × B)x,y, (1.1.1) dt T1 dt T2 where M is the nuclear magnetization as a function of time, and M0 is the equilibrium magnetization in a steady and uniform field B0 in the z direction. The longitudinal relax- ation time T1 gives the time it takes for the magnetization in the z direction to relax back to its equilibrium value M0, while the transverse relaxation time T2 gives the time for the azimuthal angles of the spins to get out of phase with one another. As the polarized protons
13 precess together and relax to their initial alignment, they produce electromagnetic radia- tion of frequency proportional to the strength of the magnetic field and is called the free induction decay (FID) signal.
Figure 1.2: Radio Frequency (RF) Excitation. There will be a small excess of hydrogen nuclei in the lower energy state and therefore a resultant magnetic vector pointing in the direction of B0. Energy can be supplied to the nuclei by applying a Radio Frequency (RF) pulse. The resultant magnetic vector is then tilted into the xy-plane and a current is induced in the receiver coil. Due to different relaxation processes, the xy-component of the magnetic vector, as well as the induced current in the receiver coil, will decay.
Variations in the molecular structure of biological substances can cause field inhomo- geneities causing the spins to experience different local magnetic fields, and they go out
∗ of phase as they precess, reducing the net FID signal. The time constant T2 measures the combined effect of random nuclei interactions and magnetic field inhomogeneities. It holds
∗ that T1 >> T2 > T2 .
1.2 Magnetic Resonance Imaging
The main concept of MRI is the spatial selection and localization of the NMR signal [128] through the use of magnetic gradients. As seen in the previous section, the resonant fre- quency ω of the nuclear spin system is dependent on its gyromagnetic ratio γ and the
14 magnetic field strength B experienced by it. The use of spatially varying magnetic gradient
fields G = (Gx,Gy,Gz) creates a different net magnetic field at every spatial location in
the sample, thereby changing its intrinsic Larmor frequency. When this frequency changes
linearly with position then the net measured signal becomes the Fourier transform of the
spin density of the sample.
Slice selection is achieved by applying a strong linear gradient in the slice direction z
during excitation with the B1 RF pulse. The slice select gradient Gz changes the Larmor frequency as a function of z coordinate. The RF pulse has a finite bandwidth and thereby excites only the z-extent with Larmor frequency within its bandwidth.
The phase–encode gradient is then switched on for a period of time τy, causing a phase
warping of the protons in the selected slice as a function of y position in space ky = γGyτy.
After some time this gradient is switched off and the frequency encode gradients Gx are
applied and the signal is sampled. The gradient activity as a function of time on each
orthogonal axis is typically represented, along with RF activity, in a pulse sequence diagram
of Fig. 1.3. The final image is obtained by an inverse Fourier transform of the k-space data.
∗ The time constants T1, T2 and T2 are tissue type dependent, allowing delineation of dif- ferent tissues, and are the main cause of different types of contrasts in clinical MRI. This effect comes from the governing equation of MRI (derived from eqn. 1.1.1)
TR TE − T − T S = S0 1 − e 1 e 2 , (1.2.1)
where S is the signal detected, S0 is the maximum detectable signal, proportional to B =
B0 + hG, xi and to the spin density ρ(x). Here TE is the echo time between the B1
excitation and readout, and TR is the repeat time between one B1 excitation and the next.
15 Figure 1.3: MR k-Space Acquisition. On the left-hand side is a pulse sequence diagram showing the sequence of events on the three orthogonal gradient axes, the RF excitation and the acquisition window. The first action is the RF excitation along with a slice selection gradient Gz along the z direction followed by a refocusing gradient. This is followed by phase-encode gradient Gy along y-axis and the readout pre-phasing gradient Gx along the x-axis. Finally, the readout (frequency-encode) gradient Gx is applied while the FID signal is measured. In order to fully encode a slice, this sequence is repeated for different amplitudes of the phase encode gradient. The right-hand side shows the corresponding trajectory through k-space. When the phase encode gradient is zero the pre-phasing gradient moves to the leftmost point in k-space and the the readout gradient sweeps across a line of k-space. When the phase encode gradient is applied, it shifts the readout line in k-space along the y direction by an amount Gy.τy. The spacing of the grid in k-space inversely affects the field-of-view, while the extent of the k-space is directly related to imaging resolution.
1.3 Functional Magnetic Resonance Imaging
fMRI has become increasingly popular for brain mapping and for studying neurophysiol- ogy because MRI scanners are commonly accessible and studies on healthy subjects can be performed without harmful side effects, making repeated studies of the same subject feasible. Furthermore, it offers unambiguous determination of the source of activation at a spatial resolution similar that of positron emission tomography (PET) but at a greatly superior temporal resolution. Electroencephalography (EEG) and magnetoencephalogra- phy (MEG) offer better temporal resolution, but spatial localization is ambiguous and the intrinsic spatial resolution is very poor. Fundamentally, fMRI can never hope to match the temporal resolution of electrophysiological methods, because the method is based on
16 an indirect measurement of neural activity via changes in blood flow. However, it can be sufficiently fast to follow the hemodynamic response to a single synaptic event.
1.3.1 The BOLD Contrast
Although the first fMRI experiments used an exogenous gadolinium-based contrast agent, this technique was rapidly superseded by the discovery that de–oxyhemoglobin could be used as an endogenous contrast agent [171]. The origin of the blood oxygenation level dependent (BOLD) effect is that hemoglobin (Hb) is diamagnetic when oxygenated and paramagnetic when deoxygenated. The free electrons of iron in de-oxyhemoglobin alter the local magnetic susceptibility creating magnetic field distortions within and around the
∗ blood vessels and produce a slight alteration in the local T2 of a voxel.
The BOLD contrast measures the distribution of paramagnetic de–oxyhemoglobin content
∗ in a tissue by means of a T2 -weighted MRI sequence with single or multi–shot echo-planar imaging (EPI) methods [147]. Single–shot EPI can obtain an entire image volume with a single RF excitation, but suffers from geometric distortion artifacts due to the long read- out times (cf.Section 1.3.3). Multi–shot EPI results in high quality images comparable to conventional MR images, but slower acquisition rates. On the whole, EPI offers ma- jor advantages over conventional MR imaging, including reduced imaging time, decreased motion artifacts and the rapid imaging of physiologic processes.
17 1.3.2 Relationship between BOLD and Physiology
The BOLD signal is not directly related to electrical neuronal activity, but rather measures
the hemodynamic response to metabolic activity in the neural substrate and depends on a complex interaction of several metabolic and vascular parameters. Following metabolic ac- tivity in the brain tissue, oxygen is consumed to replenish the depleted stores of adenosine triphosphate (ATP) from glucose, causing a temporary increase in the amount of deoxy- hemoglobin. In response to this, oxygenated blood is rushed to the metabolic site via capillaries, causing an increase in the regional cerebral blood flow (rCBF) greater than the regional cerebral metabolic rate of oxygen consumption, and as a result causing a reduction in the de–oxyhemoglobin fraction.
In healthy human subjects, the increase in rCBF is dominant over the other changes, with the consequence that increased neural activity leads to an increase in BOLD signal as mea-
∗ sured by T2 -weighted imaging. The BOLD response to a short stimulus may show three phases, as illustrated in Fig. 1.3.2. After the stimulus, there may be a negative initial response that attains its minimum value at two to three seconds post-stimulus. This is fol- lowed by the main BOLD response that is conventionally used in fMRI experiments, with a time to peak of about five seconds and a response width with full-width half-maximum
(FWHM) of roughly four seconds. Thereafter, a post-stimulus undershoot occurs, which may take up to a minute to return to baseline [140].
Currently, it is unknown as to which aspect of neural activity drives the hemodynamic re- sponse, and exact relationship between these terms is unclear [139]. Experimental studies comparing electrophysiological measurements with BOLD and rCBF, have found that the
18 Figure 1.4: Schematic showing the time course of the BOLD response to a short stimulus. The fast response has a negative peak at about two seconds post-stimulus, the main BOLD response peaks at about five seconds with an FWHM of about four seconds. The signal takes about a minute to return to baseline.
hemodynamic responses correlate better with local field potentials, rather then local spiking rates, suggesting that the hemodynamic response is driven by input synaptic activity rather than output spiking activity, in accordance with theoretical models of the energy consump- tion for neuronal signaling 1. In the cerebellum of anesthetized rats, the regional cerebral blood flow was shown to be proportional to the product of the frequency of stimulation and the strength of the evoked local field potential near a Purkenje cell [172]. Monkey studies have provided evidence that the fMRI signal is better correlated with the local field potential2 than with multi–unit and single–neuron activity Logothetis [139]. These stud- ies seems to imply a close relationship of fMRI with input synaptic activity which is the primary cause of local field potentials. Based on such a coupling, two parameters that char- acterize the fMRI signal, namely the amplitude of the signal intensity change and the time course of this change, have been used to derive detailed spatio–temporal information about
1 The primary expenditure of energy is to restore the ion gradients degraded during neural activation. The intracellularextracellular sodium gradient is far from equilibrium, so pumping sodium against this gradient is a strongly uphill reaction in a thermodynamic sense. For this reason, the most costly aspect of neural activity is likely to be excitatory synaptic activity in which glutamate opens sodium channels. The action of the sodiumpotassium pump is thought to consume a large fraction of the ATP energy budget in the brain 2The local field potential is generated by extracellular currents that pass through the extracellular space in a closed loop. These voltage changes (in the µV range) are smaller than action potentials but last longer and extend over a larger area of neural tissue. It is a linear sum of current flows to and from intracellular and extracellular spaces.
19 the underlying neuronal events [172]. There is also evidence that inhibitory activity does not elicit a measurable BOLD response [127].
1.3.3 fMRI Noise
The BOLD signal in fMRI data is corrupted due to many different causes, including:
Baseline Drift: The fMRI time–series data exhibit a low frequency drift (0.01-0.015 Hz), partially explained by a drift in the baseline magnetization of the scanner and long term excitation and spin saturation history [202].
Physiological Noise: The signal is also contaminated with artifacts from physiologic func- tions such as breathing or pulsating blood. Unless the sampling rate is fast enough ( < 1s per volume), they get aliased into the frequency bands occupied by the response to the ex- perimental presentations. Also, spontaneous low-frequency fluctuations of arterial carbon dioxide were shown to induce low-frequency BOLD signal variations [7]. The spatio– temporal signatures of physiologic processes confound the identification of the BOLD response to bona fide neural activity impelling various compensatory strategies in pre– processing and analysis pipelines.
Random Noise: Even in the absence of an experimental effect, fMRI time–series exhibit serial autocorrelations with disproportionate spectral power at low frequencies, i.e., its spectrum is 1/f-like. From studies of cadavers and phantoms, it is observed that colored noise arises even in the absence of physiological processes and must therefore be due to quantum effects [58].
20 Geometric Distortion: EPI sequences are particularly sensitive to the effects of magnetic
field inhomogeneities because of long readout times leading to a miscalculation of voxel position. The effect is most severe in regions where air–filled sinuses border with bone or tissue such as the frontal lobes, occipital and temporal lobes, but it is also apparent to a lesser extent in other regions and arise in the direction in which the acquisition time between adjacent points is greatest. This is the phase encoding direction, often along the anterior–posterior axis (and also along the inferior–superior axis for 3D-EPI). Distortions along the read–out direction (left–right axis) are negligible because the acquisition time between adjacent points is small [114].
Head Motion: Head motion is a significant problem in fMRI data sets [170] where slight movements of the head over the course of the fMRI study can lead to large signal changes in the image time–series, which obscures the subtle signal changes that are being studied. It manifests in the form of non-linear geometric distortions and non–uniformities in intensity.
21 CHAPTER 2
BACKGROUND: FMRI METHODS
Because we don’t understand the brain very well we’re constantly tempted to use the latest technology as a model for trying to understand it. In my child- hood we were always assured that the brain was a telephone switchboard. (What else could it be?) And I was amused to see that Sherrington, the great British neuroscientist, thought that the brain worked like a telegraph system. Freud often compared the brain to hydraulic and electromagnetic systems. Leibniz compared it to a mill, and now, obviously, the metaphor is the digi- tal computer.
John R. Searle (1932–).
This chapter opens with a discussion of the general principles of neuroscience investigated with fMRI in Section 2.1. After this, I define a taxonomy of fMRI analysis methods in Sec- tion 2.2 from my reading of literature from this field, starting with pre–processing methods in Section 2.3. Then categories for functional specialization are presented in Section 2.4, that for functional integration in Section 2.5 while the methods for decoding the represen- tation of brain–states in fMRI are catalogued in Section 2.6.
22 2.1 Neuroscience Principles
The brain adheres to two fundamental principles of organization, functional specialization
and functional integration. These concepts have become central in functional neuroimag-
ing, which is able to sample evoked responses over the entire brain at the same time.
The principle of functional specialization was first observed by Franz Joseph Gall (1758–
1828) who articulated the theory that architectural differences are indicative of functional differences and, conversely, that functional differences demand differences in architecture
[74]. In other words, brain function is implemented in the form of neuronal hardware, and differences in function require differences in hardware, visible in terms of cell types, connectivity, synaptic and molecular structures.
The architectural layout of the cerebral cortex has since been investigated using cytoar- chitectonics and myeloarchitectonics, the former revealing the arrangement of various cell types and the latter, the patterns of myelination in different zones of the cerebral cortex [10].
The identification of functionally specific modules was initially performed by examining the behavioral consequences of localized brain lesions The most famous example is the study of the lesions in the posterior inferior frontal gyrus of two patients by Paul Pierre
Broca that led to the discovery of the involvement of the eponymous brain region in speech production [96]. These observations received spectacular confirmation with the brain–maps produced by modern imaging studies diverting the main focus of fMRI towards localizing particular cognitive functions to specific brain regions, creating a large database of isolated structure-function correlations over the years [63].
23 However, this explanation is incomplete, since it fails to characterize how cognition arises
from local computations through their interactions. More than 60 years ago, Donald Hebb
hypothesized that the fundamental unit of brain operation is not the single neuron but rather
the cell assembly – an anatomically dispersed but functionally integrated ensemble of
neurons. The individual neurons that compose an assembly may reside in widely sepa-
rated brain areas but act as a single functional unit through coordinated network activity.
Dynamic interactions between multiple assemblies may then give rise to the large-scale functional networks found in mammalian brains. Therefore, it has been postulated that higher–level cognition must emerge through information flows across these distributed re- gions [74].
This information flow is the functional integration of the brain, which can be character- ized in two ways, namely in terms of functional connectivity and effective connectivity.
Functional connectivity, defined as the temporal coherence among the activity of different functional units, is measured by cross-correlating their observed signals (e.g. spikes in
EEG, or the BOLD response). Effective connectivity, a more abstract notion, is defined as the simplest circuit that could produce the same temporal relationships as observed experi- mentally. It, therefore, requires a model that describes the causal influences that functional units exert over another [68].
However, a third viewpoint which has lately emerged in cognitive neuroscience is that of functional representation, in the sense of whether the data contain information about the cognitive, perceptual or affective state of the subject and if so, how is this information encoded in the distributed patterns of activation observed [92]. This problem therefore fo- cusses more on understanding how, rather than where, the brain encodes information. The
24 question of representation is becoming more important as computational neuroscientists
attempt to build computational models of brain function using theories from computability,
artificial intelligence, information–theory, economics and game–theory [69].
2.2 fMRI Methods Taxonomy
With a PubMed listing of more than 5000 publications, it is impossible to provide even the
briefest of surveys of the literature on fMRI analysis methods. Therefore, in this section
I shall attempt to summarize these methods in the form of a taxonomy along with select
references to landmark and highly cited publications. The overview of this taxonomy is
laid out in Fig. 2.1.
2.3 Pre-processing
This category encompasses all processing of the raw fMRI volumes (in image–space) to
make them suitable for use by inferential and exploratory methods. It does not, strictly
speaking, include algorithms for reconstructing the image–space data from its k–space representation; however many of the methods listed here operate on the k–space data.
2.3.1 Motion Correction
Head motion is typically corrected for using affine (rigid-body) registration schemes [6].
Alternative approaches based on motion corrected independent components analysis (mcICA) [137] and field–map based methods for simultaneous distortion and motion correction [223] have also been proposed. Oakes et al. provide a comprehensive survey of motion correction tools
25 Pre-processing Functional Specialization
i Motion Correction i Unsupervised ii Distortion Correction i.i Cluster-Analysis iii De-noising and Drift Estima- i.ii Decomposition tion ii Supervised iv Inter–subject Registration ii.i Mass Univariate iv.i Anatomical Registration ii.i.i Linear iv.ii Functional Registration ii.i.ii Non–linear v HemodynamicResponseMod- ii.ii Multivariate eling and Estimation iii Semi-supervised v.i Modeling iii.i UnsupervisedAugmented v.ii Estimation with Stimulus Informa- tion iii.ii Supervised Augmented with Unlabeled Data
Functional Integration Functional Representation
i Functional Connectivity i Multivariate Pattern Recogni- i.i Decomposition tion i.ii Cross-Correlation i.i Spatial Only i.iii Cluster-Analysis i.ii Spatio–Temporal ii Effective Connectivity i.ii.i Supervised ii.i Strongly Causal i.ii.ii Unsupervised ii.i.i Dynamic i.ii.iii Semi–supervised ii.i.ii Static ii Multivariate Linear Models ii.ii Weakly Causal
Figure 2.1: Taxonomy of fMRI Methods. The proposed taxonomy of fMRI methods published in literature.
used in fMRI [170]. Due to the low spatial resolution of fMRI, post hoc methods for cor- recting head motion artifacts are fairly inaccurate, and moreover introduce unquantifiable biases into the analysis. However, due to limitations such as unacceptably long acquisition
26 times and lack of appropriate hardware, biometrically and field–map based motion correc- tion methods are usually impractical and post hoc methods are generally used, despite their drawbacks.
2.3.2 Distortion Correction
The non-linear distortion introduced by inhomogeneities in magnetic field can be corrected through multiple post-hoc methods, including those that require a B0 field–map [106, 214], along with non-rigid post hoc distortion correction schemes that do not require B0 maps
(e.g. [6, 134]).
2.3.3 De-noising and Drift Estimation
Several methods have been developed for reducing physiological noise such as respira- tion and cardiac activity from the fMRI time–series including navigator methods [129] that use an auxiliary echo to determine the confounds, k–space methods [221], notch filtering approaches [23], image–space methods with [109] and without [39] the use of echo car- diograms. The methods for drift estimation and removal fall mainly in these categories: low-pass filters, autoregressive filters, Kalman filters, nonlinear low pass filters and sub- space methods [14, 158, 202]. Because of their decorrelating properties for the long–range correlated noise in fMRI, several authors have suggested a variety of wavelet–based noise estimation, de-noising and de-trending schemes [29, 215] as different variations of the wavelet shrinkage concept [52].
27 2.3.4 Inter–subject Registration
2.3.4.1 Anatomical Registration
Spatial normalization methods are typically used to aligning the images of multiple sub- jects into a common anatomical space, for the purposes of inter–subject comparisons and group–level analysis. This normalization is usually carried out on a voxel-by-voxel basis using non-rigid registration techniques by first co-registering the functional images with a high–resolution structural image of the same subject, followed by registering the structural images with an anatomical template image in an atlas reference system [6, 207]
2.3.4.2 Functional Registration
Anatomical registration methods suffer from two fundamental problems: (i) dealing with the true anatomical differences between subjects ( suggested by sulco–gyral and cytoar- chitectonic studies across subjects ), especially in the case of neuropathologies and (ii) the relevance of anatomical alignment for the study of the functional commonalities in popula- tions (i.e. the rigidity of the mapping between structure and function). In response to these criticisms, strategies based on alignment of intra-subject parcellations [207] or of func- tional connectivity [43] or of task–related functional activation maps [192], along with strategies for directly registering the time–series itself [125] have been suggested.
28 2.3.5 Hemodynamic Response Modeling and Estimation
The problem of understanding and accounting for the variability in the hemodynamic re- sponse of the brain has been approached either by building biophysiological models of the response or by estimating the response from the data.
2.3.5.1 Modeling
Methods to model the hemodynamic response function (HRF) include Fourier basis [Henson et al.],
Gamma functions [66], cosine basis [218] , Gaussian basis [38], anatomically–informed ba- sis function [113], finite impulse response (FIR) filters [82], subspace methods [64] along with physiological models of the neuro–vascular coupling [32].
2.3.5.2 Estimation
Direct estimation of the HRF from has also been reported including methods using Bayesian nets [148], non–linear control theory [97] and deconvolution in the time [82], frequency and wavelet [59] domains.
2.4 Functional Specialization
The majority of fMRI literature is concerned with localizing the neural substrates for a particular mental faculty, i.e. with the problem of functional specialization.
29 2.4.1 Unsupervised
These are exploratory methods to identify salient regions and/or distributed patterns of ac- tivation from fMRI data, without reference to information about the task (i.e. stimulus).
Unsupervised methods have been widely used for analyzing non–task related data such as resting state data [84]. For a more detailed look at the issues surrounding supervised, un- supervised and semi–supervised methods, the reader is referred to Section 7.2 of Chapter 7
2.4.1.1 Cluster-Analysis
In this approach, voxels are clustered together according to varying combination of simi- larity of their structural and fMRI time–course information. A large number of clustering methods [99] have been applied to fMRI, including fuzzy k–means clustering [12], vector
quantization [11], self-organizing maps [216], neural gas networks [149], clustering in the
frequency domain [157] or wavelet domain [105], dynamical cluster analysis [13], tempo-
ral cluster analysis [226] and hierarchical clustering [83, 105]. Please refer to Dimitriadou
et al. [51] for a review of these methods.
2.4.1.2 Decomposition
Activation patterns in the data are explored by decomposing the data into constituent com-
ponents with linear factor analysis methods such as principle component analysis (PCA) [70],
independent component analysis [34], non–negative matrix factorizaton [168] or non–
linear methods such as non–linear PCA [67] and kernel PCA [206]. Also, dynamical
30 formulations for decompositions have been proposed wherein the data are modeled as mul- tivariate ARMA processes and the decomposition reduces to estimating the ARMA coeffi- cients [205].
Decomposition based methods in general, but ICA in particular and its many variants are very popular, not only for the exploratory analysis of functional localization, but also for functional connectivity, motion–correction, de–noising and as feature–vectors for use in further pattern recognition. The reader is referred to the review article by Calhoun and
Adali [34] for more details.
2.4.2 Supervised
These are confirmatory methods that use either classical or Bayesian hypothesis testing to produce statistical measures (i.e. p–values or posterior probabilities) of activation at different locales in the brain in response to a task.
2.4.2.1 Mass Univariate
This is the most popular method of studying functional localization, where every voxel is individually and independently tested for activation. They yield parametric maps of pa- rameter estimates representing the amount of activity, as per some model of hemodynamic response. These models can be either:
i Linear: Most existing analysis tools are based on the a linear time-invariant model of
brain response formed by convolving functions representing the experimental stimuli
with an approximate HRF to build the design matrix. The parameters of the loading
31 of the regressors in the observed time–course of each voxel are estimated using Gen-
eral Linear Modeling (GLM) [66]. The main drawbacks of this approach are its
assumptions of linear coupling between stimulus and BOLD response and that of
spatio-temporally invariant and known hemodynamics. Many studies have shown
a relatively large variation in the observed hemodynamic response, across subjects,
across brain sites within the same subject, and even at the same brain site of the
same subject across time [201]. Non-linearities in the transfer from a stimulus to the
hemodynamic response have also been demonstrated [140], questioning the validity
of linear models. However, due to their explanatory power, statistical simplicity and
computational efficiency, GLM approaches are still the most widely used methods.
ii Non–linear: Estimating the parameters of activation like amplitude, latency, disper-
sion, etc. through non-linear regression has been variously proposed, wherein the
estimated HRF is a non-linear function of these parameters [97, 119, 124, 182].
While these methods do not require assumptions of known and spatially constant
hemodynamic responses and only assume a fixed parametric form of the hemody-
namic response function at each voxel, they are computationally expensive involving
non-linear minimization steps at each voxel. Also, due to their non–linear nature, the
p–values of parameter estimates have to be computed using permutation tests adding
to their computational burden.
Multiple Comparisons Problem Activation maps are computed by determining the cor- rect value at which to threshold the per-voxel parameter estimates in order to reject the null
hypothesis (of no activation) at the desired size (i.e. the α–value).
32 Rejecting the null hypothesis at each voxel individually leads to a Multiple Comparisons
Problem (MCP)3, and therefore many correction strategies are used in fMRI:
• Boneferroni: Based on the Bonferroni correction4 of the individual tests to give
a global test of desired size, these methods include correcting the threshold for
each voxel [66], thresholding the wavelet coefficients of the statistical parametric
map [212]
• False Discovery Rate (FDR): Given a population of V voxels marked as active, an
FDR of R implies that not more than R.V voxels are false positives. FDR can be
implemented either in a voxel–wise [76] and wavelet–based fashion [212].
• Random Field Theory: Here, the fMRI voxels are treated as a random field with
specific interdependency structures, such as Gaussian random fields [220] or Markov
random fields [44], in which case the global p–value can be estimated either in closed
form by the Euler characteristic of the excursion set or through MCMC methods.
• Permutation Testing: Permutation testing is a non–parameteric test that uses boot-
strapped estimates of the global null hypothesis by resampling from the data, and
therefore does not suffer from the local vs. global null problem [166].
3In mass-univariate analysis, testing for individual voxels that violate the null hypothesis of no activity at a size of α (the desired Type-I or false positive error rate) means that even in the absence of a real effect, one would observe α–fraction of the voxels as active simply due to noise. In fMRI, given that there are approximately 105 voxels in the brain volume, for an α of 0.05, the number of Type-I (false positive) errors can be on the order of 5000 voxels. Such a univariate test is invalid with respect to the global null hypothesis (i.e. a test of a true size different from the nominal size), which cannot treat voxels as independent random variables. Instead a global p–value of the spatially-distributed activation pattern is need to reject aglobal null hypothesis. 4Bonferroni correction is a simple method of maintaining the family–wise error rate when testing n hy- potheses simultaneously, by testing each individual hypothesis at a statistical significance level of 1/n × α where α is the global significance level.
33 • Bayesian: Here the problem of thresholding parameter values is bypassed entirely
by instead computing posterior probability maps (ppms) of the posterior distribution
of the parameters at each voxel, computed using hierarchical Bayesian models [72].
Methods based on Bayesian segmentation of the parametric maps [184], have also
been developed.
The typical mass–univariate analysis pipeline using a GLM is shown in Fig. 2.5.2.1.
Figure 2.2: GLM Analysis Pipeline: Shown are the different stages of the pipeline involved in a GLM mass–univariate analysis of fMRI data, including pre–processing, spatial smoothing, GLM–based regression against the design matrix followed by thresholding of the estimated parametric maps at the desired size, using one of the MCP correction procedures.
34 2.4.2.2 Multivariate Linear Models
Multivariate models relax the na¨ıve independence assumptions of univariate methods and
enable inference about distributed responses. They also do not suffer from the multiple
comparisons problem, of univariate models. Although multivariate methods have been used
in functional neuroimaging since the 80’s [161], their popularity was circumscribed due to
the insufficient degrees of freedom since the number of voxels (i.e. variables) exceeds the
number of scans (i.e. observations) by orders of magnitude. It is only recently that there
has been a resurgence in these methods, accompanied by some form of dimensionality–
reduction, mainly because of the increasing interest in the question of neural representation
of mental states which is possible only in a multi–variate setting. While non–linear multi-
variate methods have been used for the studying the distributed representation in the form
of pattern–recognition classifiers, only linear models have been used for studying func-
tional specialization mainly due their computational efficiency, statistical power and ease
of interpretation. These methods, however, still require the assumption of linear coupling
of stimulus to BOLD response and a spatially invariant and known HRF. Included are
scaled sub–space profile models [161], partial least squares (PLS) [152], MANCOVA– type methods such as canonical correlation analysis (CCA) [70], canonical variates anal- ysis (CVA) [219] and hierarchial Bayesian linear models [65].
2.4.3 Semi-supervised
These methods try to effect a compromise between unsupervised and supervised methods, in order to use their complementary advantages – that of supervised method to enforce
35 quantifiable links with experimental conditions and that of unsupervised methods to dis- cover new patterns in the data. These methods exist in two flavors:
2.4.3.1 Unsupervised Augmented with Stimulus Information
Here, the results of an unsupervised method are improved or made more relevant by incor- porating some information about the experimental task and expected results. The aim is to condition or guide the discovery to patterns pertinent to the mental task and away from spurious artifacts. These methods include functional PCA [79], constrained ICA (cICA)
[142], ICA with reference (ICA-R) Lu and Rajapakse [141] and semi–blind ICA [138] wherein prior information, such as statistical properties, reference signals, or spatial tem- plates of expected localization is used to improve the quality of the decomposition and reduce indeterminacies inherent in ICA. Additionally, approaches that use the stimulus time–series as reference for the distance metric in a clustering algorithm have also been proposed [64, 83, 105].
2.4.3.2 Supervised Augmented with Unlabeled Data
In this paradigm, unlabeled fMRI data – i.e. data for which there is no associated stimulus information – is used to improve the estimation of spatial maps for task–related studies, for example using resting–state data for Laplacian regularization of the regression of data from subjects watching annotated movies [24].
36 2.5 Functional Integration
The recent advances in functional neuroimaging technology and image analysis theory have paved the way to investigate the brain function in terms of the interaction between neural systems rather than individual regions involved in a sensory or cognitive task [204].
2.5.1 Functional Connectivity
Methods for functional connectivity involve assessing the non-causal relationships, typi- cally correlation, between spatially distinct regions based on the time–courses of the vox- els in those regions. The reader is referred to Li et al. [133] for a review of functional connectivity literature in fMRI.
2.5.1.1 Decomposition
The methods involve decomposing the relationships between regions specified by the statis- tical moments of their time–courses into spatially coherent maps. Typically methods used are PCA [71] and PLS [152] that operate on the correlation (i.e. second order moment) between regions, and non–linear PCA [67] and ICA that orthogonalize the higher–order statistical dependencies as well [67].
2.5.1.2 Cross-Correlation
These methods measure functional connectivity as the cross-correlation between one re- gion / voxel and another. In seed voxel correlation analysis (SVCA) [36] 3D functional
37 connectivity maps are created that chart the brain regions correlated with a seed region, se- lected either from brain anatomy or additional functional activation studies. Methods that analyze the correlation between multiple regions or parcels simultaneously have also been applied [1].
2.5.1.3 Cluster-Analysis
Many of the clustering methods reviewed in Section 2.4.1.1 have also be used to reveal the time–course similarity patterns between spatially distant voxels.
2.5.2 Effective Connectivity
The main criticism of functional connectivity is that correlations may arise in a variety of ways and may not be from causal or even functional relationships. For example, in multi–unit electrode recordings, they can result from stimulus-locked transients evoked by a common input or can reflect stimulus-induced oscillations mediated by synaptic connec- tions [204]. Additionally, the drawback of using approaches such as PCA and ICA, is that among patterns identified, not all have a clear relationship with brain neural activity and many are demonstrably artefactual.
2.5.2.1 Strongly Causal
Strongly causal links [174] between interconnected regions are inferred to firmly establish structure–function relationships in one of two ways:
38 Dynamic Dynamic system theory in the form dynamic causal models (DCM) [204] or graphical models in the form of dynamic Bayesian networks (DBN) [185] have been used to infer causal links between a pre–selected ROIs from fMRI data, typically in the visual processing stream. These methods select amongst competing models of the causal interac- tions between neural circuits by comparing the evidence for the observed fMRI data arising from the given circuit. They are restricted to examining a very small number of interactions because of their high computational complexity. Also, as they require a model explaining how the BOLD data is generated from neural activity [73], their validity depends on the accuracy of this model.
Figure 2.3: Dynamic Causal Models: Shown is an example of a DCM encoding causal interactions be- tween visual areas V1 and V4, Brodmann areas 39 and 37, and the superior temporal gyrus STG. The dark square boxes represent the transformation of internal state in each region (neuronal activity) into a measured (hemodynamic) response y. Image reproduced from [73].
39 Static Methods such as structural equation models (SEM) [153], compare the evidence for different models of causal relationships between brain regions. This is a static approach in the sense that the relationships are instantaneous and they do not account for temporality of the data. These methods are also computationally expensive and require selecting among multiple pre–specified alternatives.
2.5.2.2 Weakly Causal
Here, the causal definition of effective connectivity is relaxed to a weaker one, that of
Granger causality [80]. Under this definition, if incorporating the past values of time– series X improves the future prediction of time–series Y , then X is said to have a (Granger) causal influence on Y . This definition allows finding functional relationships between large numbers of regions simultaneously, thereby circumventing the drawbacks of dynamic mod- els and static models (namely pre–specification of alternative models and pre–selecting a small number of ROIs ). In a similar vein, multivariate autoregressive processes [89] have been used to analyze causal relationships at the level of the BOLD signal itself in terms of the coefficients of an autoregressive process.
The drawback of such “model–free” formalisms are that the neuro-scientific interpretations of their results are unclear.
2.6 Functional Representation
Revealing distributed encoding of information about the cognitive state of the subject from fMRI must be, by definition, performed in a multivariate setting.
40 The use of voxel–based inferential statistics systematically eliminates most of the data,
reducing the power of fMRI to that of multiple single–unit recordings [173]. In contrast to
univariate analysis, multivariate methods do not make the na¨ıve assumption that activity in
voxels is independent and mine for information present in the interactions among voxels.
Communication among neurons as well as larger functional units is the main basis of neural
computation, and by not disregarding their interactions, multivariate methods are able to
peer into the neural code. In addition, multivariate methods do not suffer from the multiple
comparison problems of univariate methods (cf. Section 2.4). As importantly, the need for
spatial smoothing to increase SNR is obviated as such methods integrate the information
from groups of voxels that individually are weakly activated, but jointly may be highly
structured with respect to the task.
The topic of representation of the neural code and decoding of cognitive states from the
data is reprised for a more in–depth discussion in Chapter 7.
2.6.1 Multivariate Pattern Recognition
A popular approach has been the use of multivariate pattern recognition (MVPR) [92], which learns the statistical mapping from the distributed pattern of activation in an individ- ual brain to the experimental conditions experienced during the scans.
41 2.6.1.1 Spatial–only
Most MVPR methods do not typically take the temporal nature of cognitive processing into
account. They make the assumption that all fMRI scans with the same label (e.g., behav-
ioral state) have the same properties – and not account for the hemodynamic delay between
neural activity and BOLD signal. Therefore, these approaches are inherently restricted to
block–design experiments where such assumptions are permissible.
Typically linear classifiers, such as correlation-based classifiers [90], single-layer percep-
trons [179], linear discriminant analysis [91], linear support vector machines (SVMs) [160], and Gaussian Naive Bayes [159], have been used due to simplicity of interpretation without significant loss of accuracy [110]. However, non–linear classifiers such as kernel canonical correlation analysis (kCCA) have also been reported [87] that learn the mapping from the fMRI data to scale invariant feature transformation (SIFT) features of the natural images.
Such MVPR methods, that predict the behavioral state of the subject, have been applied mainly to the study of visual (e.g. [90, 91, 110, 179]) processing, but also auditory [150] perception, motor tasks [122], word recogition [160], detecting emotional affects such as deception [200] and fear [177], etc. Alternative approaches based on pattern classifiers attempt to decode the perceptual state of the subject without modification of the sensory input [54] (uses MDS to understand the representation of shapes in the visual cortex).
42 2.6.1.2 Spatio–Temporal
The methods listed in this class attempt to describe the information contained not only in the spatial distribution of activity at one time–instant but also in the temporal evolution of these patterns.
Supervised Temporal variability during a task has been accounted for – in a limited way – through the temporal embedding of all the fMRI scans in one block as the fea- ture vector [159, 164]. Also, a pattern classifier was trained to decode changes in binoc- ular dominance5 on a second–by–second basis, thereby revealing the timing of changes in neural representations of information and its subsequent availability for report by the participant [179].
Unsupervised
Semi-supervised To the best of our knowledge the models introduced here are the first examples of unsupervised and semi–supervised methods for studying the spatio–temporal patterns in the data.
5When dissimilar images are of the subject presented to the two eyes, they compete for perceptual domi- nance so that each image is visible in turn for a few seconds while the other is suppressed. Because perceptual transitions between each monocular view occur spontaneously without any change in physical stimulation, neural responses associated with conscious perception can be distinguished from those due to sensory pro- cessing.
43 2.6.2 Multivariate Linear Models
The MVLMs discussed in Section 2.4.2.2 can also be used to study representation. This is
further elaborated upon in Chapter 7.
2.7 Summary
In this chapter, I presented a taxonomy for the different processing and analysis strategies in
fMRI. At the top level, the salient classes of neuroscientific problems that are investigated
with fMRI were listed, followed by the sub–category of the overall neuroscientific problem.
Each subdivision was then brachiated based on methodological specifics and statistical
considerations.
Although this classification schemata does not specifically address the important problem
of mental chronometry (cf. Chapter 4), the visual analysis method for mental chronometry
developed in this thesis is categorized as Functional Specialization :
Unsupervised : Cluster-analysis, while the GLM based method for latency
determination would be Functional Specialization : Supervised : Mass
Univariate : Linear. The methods for studying the representation of the mental–state of the subject in the spatio–temporal patterns of metabolic activity developed in Part III fall under category Functional Representation : Multivariate Pattern
Recognition : Spatio-Temporal, and further sub-classified as Unsupervised
and Semi-supervised.
44 CHAPTER 3
BACKGROUND: NEUROSCIENTIFIC SETTING
It ought to be generally known that the source of our pleasure, merriment, laughter, and amusement, as of our grief, pain, anxiety, and tears, is none other than the brain. It is specially the organ which enables us to think, see, and hear, and to distinguish the ugly and the beautiful, the bad and the good, pleasant and unpleasant. It is the brain too which is the seat of madness and delirium, of the fears and frights which assail us, often by night, but sometimes even by day; it is there where lies the cause of insomnia and sleep-walking, of thoughts that will not come, forgotten duties, and eccentricities
Hippocrates (460BC–370BC).
The methods presented in this thesis were designed and applied for investigating the neural processing cascades for two neurophysiological studies:
i The neurophysiology of retrieval and manipulation of visuo–spatial working memory
in Section 3.1.
ii The functional processing of mental arithmetic in adults suffering from developmen-
tal dyscalculia and dyslexia in Section 3.2.
In this chapter, the neurological processes being investigated, a background of the neu- ropathologies involved and the data–sets used in these studies are described.
45 3.1 Visuo–Spatial Working Memory
Visuo–spatial working memory (VSWM) is the ability to temporarily maintain visuo– spatial information in mind, is a key cognitive function that underlies other cognitive abil- ities such as complex reasoning, reading, mathematical calculation, and problem-solving.
The development of working memory and its cognitive control / manipulation is one of the most salient features of the maturation of mental processes during childhood and adoles- cence [55].
Several brain imaging studies have been conducted to determine precisely what is changing in a child’s brain over time, enabling better control of thoughts and behavior. But compared with what is known about changes in brain structure during development6, far less is known about the resulting changes in brain function. The pattern of developmental changes in brain activation has generally been observed as a shift from diffuse to focal activation and from posterior to anterior activation [30]. The precise pattern of change observed depends on the task, the ages being examined and the brain region in question.
Brain imaging studies of working memory in adults suggest that different parts of lateral prefrontal cortex (PFC) are involved in maintenance and manipulation with the ventrolat- eral (VL) PFC performing online maintenance of information, and the mid-dorsolateral
(DL) PFC additionally recruited for manipulation. It has also been hypothesized that rep- resentations of magnitude or space in parietal cortex serve as the substrate for the organi- zation and manipulation of items in working memory [46]. In school-aged children and
6Structural brain imaging studies of development indicate cortical gray matter loss and white matter in- creases during late childhood and adolescence, associated with pruning of excessive neurons and increased structural connectivity between brain regions
46 adolescents, as well, the ability to manipulate information is associated with the strength of recruitment of regions in dlPFC and bilateral superior parietal cortex (SPC). It is believed that in children the ability to manipulate items in working memory develops more slowly than the ability to simply retain them, and by the age of 13 this ability is fully developed.
3.1.1 Data-set
The study was designed to isolate manipulation requirements by comparing a maintenance
+ manipulation condition with a pure maintenance condition. fMRI data were recorded while 8 healthy right–handed children aged 7 to 11, performed a working memory task with both maintenance and manipulation conditions. Three name- able objects were presented sequentially (Fig. 1). During a 6-s delay period, participants were instructed to repeat the objects in a forward order (the maintenance task) or to reverse the order of the objects (the manipulation task). After the delay, participants were prompted with one of the objects and indicated with a button press whether this target object occurred
first, second, or third in the forward or backward sequence.
Acquisition was done on a Siemens 3T Tim Trio MRI scanner with a quadrature head coil using a BOLD sensitized 2D-EPI gradient-echo pulse sequence with the following specifications: echo time 30ms, flip angle 30◦, volume scan time 2.22s, and voxel size 3 ×
3×3.75mm. A typical study lasted around 10 minutes, with about 30 trials. Raw data were reconstructed off-line and routine pre-processing (viz. motion and slice-timing correction, spatial normalization to a standard brain space, co-registration of functional and structural scans, and spatial smoothing with an 8mm Gaussian filter) was done in SPM5 [66].
47 backward forward 1,2,3 ?
1 2 3 INSTR PROBE
2 2 2 1 1 6 6 2 0.25 0.25
22.5
Figure 3.1: Visuo–Spatial Working Memory Task: The timings for various stages of the paradigm to study visuo–spatial working memory maintenance and manipulation.
3.2 Mental Arithmetic
Cognitive theories of numerical representation suggest that understanding of numerical quantities is driven by a magnitude representation associated with the intraparietal sulcus and possibly under genetic control [162]. Dehaene et al. [49] proposed a triple code model of the organization of number-related processes in the parietal lobe, based on neuropsycho- logical evidence derived from behavioral, lesion, PET and fMRI studies. According to this model, there are three distinct systems involved in mental arithmetic:
i A quantity system having a nonverbal semantic representation of the size and dis-
tance relations between numbers
ii A verbal system, where numerals are represented lexically, phonologically, and syn-
tactically, like any other type of word
iii A visual system in which numbers are encoded as strings of Arabic numerals
48 Cognitive models suggest that exact arithmetic facts are stored in a language–specific for-
mat, while approximate knowledge is language–independent and shows a numerical dis-
tance effect associated with the nonverbal quantity system.
The horizontal segment of the intraparietal sulcus (hIPS) is a major site of activation in neu-
roimaging studies of number processing. It is systematically activated whenever numbers
are manipulated, independently of number notation, with increased activation correspod-
ing to increased magnitude of the quantities manipulated. The hIPS is more active when
subjects estimate the approximate result of an addition problem than when they compute
its exact solution. It shows greater activation for subtraction than for multiplication 7. The
HIPS is also active whenever a comparative operation that needs access to a numerical
scale is called for. Parietal activation in number comparison is often larger in the right than
in the left hemisphere – however the parietal activation, although it may be asymmetric, is
always present in both hemispheres. It has been speculated that the core quantity system,
analogous to an internal “number line,” is localized in this region [49, 195].
A left angular gyrus area (lAG), in connection with other left-hemispheric perisylvian areas, supports the manipulation of numbers in verbal form. This region is part of the language system and contributes to processing of arithmetic operations, such as multipli- cation, that make strong demands on the verbal coding of numbers [49]. The lAG is more active in exact calculation than in approximation [47]. Also, it shows greater activation for exact calculations that require access to a rote verbal memory of arithmetic facts, such as multiplication, than for operations that are not stored and require some form of quantity manipulation. Even within a given operation, such as single-digit addition, the left angular
7Multiplication tables and small exact addition facts may be stored in rote verbal memory, and hence place minimal requirements on quantity manipulation
49 gyrus is more active for small problems with a sum below 10 than for large problems with a sum above 10. This probably reflects the fact that small addition facts, just like multi- plication tables, are stored in rote verbal memory, while larger addition problems may be solved by resorting to semantic manipulation strategies [195].
Finally, a bilateral posterior superior parietal system supports attentional orientation on the mental number line, just like on any other spatial dimension. It is active during number comparison, approximation, subtraction of two digits, and counting. It also appears to increase in activation when subjects carry out two operations instead of one. It also plays a central role in a variety of visuo–spatial tasks including hand reaching, grasping, eye and/or attention orienting, mental rotation, and spatial working memory [49]. It may also contribute to attentional selection on other mental dimensions that are analogous to space, such as time or attending to specific quantities on the number line.
The right precuneus, left and right middle and superior frontal regions and the pre-central gyrus containing the supplementary motor area (SMA) have also been identified during arithmetic operations. Therefore, mental arithmetic appear to reflect a basic anatomical substrate of working memory, numerical knowledge and processing based on finger count- ing, and derived from a network originally related to finger movement [60].
3.2.1 Dyscalculia
Developmental dyscalculia (DDC) is defined as difficulty in learning arithmetic that can- not be explained by mental retardation, inappropriate schooling, or poor social environ- ment [162, 196]. Children can exhibit low math performance in many different ways: Some
50 may have particular difficulties with arithmetical facts , others with procedures and strate-
gies , while most disabled children seem to have difficulties across the whole spectrum of
numerical tasks [31]. Just as diverse as their symptoms is the wide range of terms referring
to these disabilities (developmental dyscalculia, mathematical disability, arithmetical learn-
ing disability, number fact disorder, psychological difficulties in mathematics). In adult
acalculia, at least two subtypes of dyscalculia may be observed. Multiplication deficits are
reported in cases of dyscalculia accompanied by dysphasia and/or dyslexia, while subtrac-
tion and quantity–manipulation deficits are often present in patients with dyscalculia but
without any accompanying dyslexia or language retardation [75].
It is relatively frequent, affecting 3-6% of children and a fraction of those children may
suffer from a core conceptual deficit in the numerical domain. This could affect even very
simple tasks such as counting or comparing numerical magnitudes [31]. Classified as a de- velopmental Gerstmann syndrome, it is frequently co–morbid with a variety of disorders, like dyslexia, attention disorders, dysgraphia, left-right disorientation and finger agnosia, poor hand–eye coordination, poor working memory span, epilepsy, fragile–X syndrome,
Williams syndrome and Turner syndrome. However, causal relationships between these disorders have not been established and the genetic and neural basis of DDC remain un- known [121].
3.2.2 Dyslexia
Dyslexia (DL) is a reading disorder defined as a selective inability to build a visual rep- resentation of a word, used in subsequent language processing, in the absence of general visual impairment or speech disorders [48]. It can arise from of a variety of disorders of the
51 visual word form system (VWFS) which plays a pivotal role in informing other temporal,
parietal and frontal areas of the identity of the letter string. This reading network contains
processes for orthographic recognition of word forms and the sublexical conversion from
orthography to phonology [48]. While, it has been proposed that dyslexia is more generally
characterized by a disconnection syndrome of the reading network, the neural correlates of
these pathways are not well understood.
Neuropsychological studies have demonstrated that the acquisition of reading skills is re-
flected by progressively greater activation in left occipital, temporal and frontal regions
and progressively less activation in posterior right hemisphere regions [197]. These re-
cruitments depend on the type of words being manipulated. For example, unfamiliar pseu-
dowords8 are thought to increase demands on the sublexical conversion of orthography to
phonology, whereas exception words9 rely on lexico-semantic processing. The effect of
word type on brain activation also depends upon the task (phonological recognition vs. se-
mantic recognition) being performed [21].
In the case of developmental dyslexia, neuronal abnormalities within the reading network are difficult to interpret because they appear to depend upon the task, language, and type of dyslexia. In the case of acquired dyslexia caused by pathological or accidental focal brain damage, the neural correlates are usually more clear. Pure alexia is defined as dif-
ficulty reading all types of words in the context of preserved writing skills, and typically occurs following left occipitotemporal damage [181]. Phonological dyslexia, defined as an inability to read psuedowords,is usually caused by large cerebral infarcts in the middle
8Novel words that have not been encountered before (e.g. floop). 9When the pronunciation of a whole word is inconsistent with that of its parts (e.g. yacht).
52 left hemisphere affecting temporoparietal and frontal regions. Surface dyslexia, involving difficulties with exception words, is associated with anterolateral temporal lobe atrophy.
3.2.3 Data-set
Twenty control subjects, thirteen high-performing (fullscale IQ>95) individuals with pure dyscalculia (DC) and nine with dyslexia (DL) [163] participated (controls: 10 female, one female and one male lefthanded, one male ambidextrous, age 21-34 yrs, mean age 25.6 yrs
± 3.0 yrs; DC: 6 female, 1 male lefthanded, age 22-23yrs: DL: two females, two males left–handed, age 18–32 yrs, mean age 24.5yrs ). All subjects were free of neurological and psychiatric illnesses and attention-deficit disorder. All controls denied a history of any calculation difficulties.
The layout in Fig. 3.2 illustrates the self-paced, irregular paradigm used in these experi- ments. Subjects were exposed visually to simple multiplication problems with single digit operands, e.g., 4 × 5 , and had to decide if the incorrect solution subsequently offered was, e.g., close for 23 , too small for 12 , or too big for 27 from the correct result of 20.
All solutions were within ±50% of the correct answer. Only one solution was presented at the time. The close answer had to be applied for solutions that were within ±25% of the correct result, while the two remaining exceeded this threshold. Subjects answered by pressing a button with the index finger of the dominant hand for too small, the middle fin- ger for close, and the ring finger for too big. Identical operand pairs were excluded. The simplest operand pair was 3 × 4, while the most demanding pair was 8 × 9. The order of small vs. large operands was approximately counterbalanced. Presentation times were the following: multiplication problem 2.5s, equal sign (=) 0.3s, solution 0.8s, judgment period
53 up to 4s, and rest condition with fixation point of 1 s until the beginning of a new cycle.
Subjects were encouraged to respond as quickly as possible. Stimulus onset asynchrony
(SOA) ranged from around 4s to 8.6 s. All subjects were exposed to two different sets of multiplication problems, with an interval of approximately 30min between sessions 1 and
2 during which time they solved other nonnumerical tasks.
27 ‘too big’
4 5 = 23 ‘close’
12 ‘too small’ 2.5 0.8 ≤ 4 1 0.3 ≤ 8.6
Figure 3.2: Mental Arithmetic Task The five phases of each trial and their associated timings of the paradigm to study arithmetical abilities.
For each problem presentation, the following variables were recorded:
• The two numbers to be multiplied and the incorrect result
• A product–size LogPs and problem–difficulty LogDiff score described next
• A binary variable indicating correct or incorrect answer
If Rc = a × b is the correct product for the multiplication problem a × b and Rd is the displayed incorrect result, then the product size is scored as LogPs = log(Rc). The score
LogDiff is log(|(1.25Rc) − (Rc + |Rc − Rd|)|/|1.25Rc|), which measures the closeness of
54 the incorrect result to the ±25% mark and represents the difficulty subjects would have in judging the correct answer as close vs. too big or too small.
Data were acquired on a GE 3T MRI scanner (vh3) with a quadrature head coil. After localizer scans, a first anatomical, axial-oblique 3D-SPGR volume was acquired. Slab coordinates and matrix size corresponded to those applied during the subsequent fMRI runs using a 3D PRESTO BOLD pulse sequence [209] with phase navigator correction and the following specifications: echo time 40ms, repetition time 26.4ms, echo train length
17, flip angle 17◦ , volume scan time 2.64s, number of scans 280, session scan time 12:19 min, 3D matrix size 51 × 64 × 32, and isotropic voxel size 3.75mm. At the end of the study, a sagittal 3DSPGR scan was acquired with a slice thickness of 1.2mm and in-plane resolution of 0.94mm.
The first four fMRI scans were discarded leaving 276 scans for analysis. Raw data were re- constructed off-line. The structural scans were bias–field corrected, normalized to an MNI atlas space and segmentation into grey and white matter, while the fMRI scans were motion corrected using linear registration and co-registered with the structural scan in SPM8 [66].
Further motion correction was performed using motion corrected independent component analysis (mcICA) [137]. The fMRI data were then de-noised using a wavelet–based Wiener
filter [4] and high-pass filtered to remove artifacts such as breathing, pulsatile effects, and scanner drift.
55 PART II
Functional Chronometry
56 CHAPTER 4
MENTAL CHRONOMETRY: THEORY
Man is enabled to find sense in this chaos of experience and discover the mean- ing and measure of this incomprehensive flux of perpetual ’flourishing and per- ishing’ which we call Time.
Dr. K. Bhaskaran Nair (1927–1990).
One important application of revealing the temporal aspects of mental processes has been the field of mental chronometry: the attempt to decompose a perceptual, cognitive or motor task into a sequence of processing stages on the basis of measured response times [61]. fMRI-based mental chronometry has the potential to provide a new type of information that goes beyond the identification of activated brain regions. It enables studying the dynamics of these activated regions during the specific processing stages of a mental task and to provide insight into the links between cognition, behavior and brain activity.
4.1 Significance
The importance of timing information comes when trying to determine the hierarchical structure of signal processing in the brain [88]. In a strictly serially connected neuronal
57 network, activation onset times would give direct information about the hierarchical posi- tion of that node within the whole processing chain. However, the situation is in reality much more complicated as the network contains feedback connections, through which sig- nals can be modulated at an earlier processing stage or through which some nodes may display several activation “waves”.
Mental chronometry, therefore, studies the role of timing in cognition and the important time windows for different brain functions at a more macroscopic level. This includes un- derstanding how to interpret the timing information in terms of serial versus parallel path- ways, and in terms of hierarchical organization of cortical signal processing. The chronoar- chitecture [10] of the cerebral cortex has been shown to be highly organized according to its functional modularity, in the sense that functionally related regions exhibit highly cor- related and phase–locked metabolic fingerprints of activity10.
Also of interest is determining whether there are systematic differences in cortical activa- tion sequences differ in subjects with different psychological abilities. Considering brain– function as a network of dynamic neural circuits that interact to perform computational tasks [33], it is to be expected that the similarities and differences between processing strategies are reflected even more clearly in the timing of functional recruitment than in localization of the sites of cortical activation [88].
10The term “chronoarchitecture” is used in contrast to cytoarchitecture and myeloarchitecture, which have failed to show distinctions within areas that other techniques such as metabolic methods have revealed to functionally separate.
58 4.2 Mental Chronometry with BOLD
Initially, mental chronometry was based exclusively on analyzing the behavioral response time or reaction time (RT) as a function of the task condition [180]. More recently, behav- ioral RT information has been complemented with invasive [78] and non-invasive [156] measurements of brain activity.
The basic assumption in fMRI–based mental chronometry is that timing differences in observed hemodynamic response (HR) are attributable to the underlying neural events.
Assessing the degree to which this assumption holds, however, is not simple. There is a lack of a detailed description of its intrinsic physiological variability, thus leaving uncertainty concerning the accuracy of timing-based fMRI response measures. Repeated studies have shown a relatively large variation in the observed hemodynamic response, across subjects, across brain sites within the same subject, and even at the same brain site of the same subject across time [117]. Non-linearities in transfer from a stimulus to the hemodynamic response have been demonstrated [140], questioning the validity of a time-invariant HRF.
Also, the current methods only measure timing differences across voxels, not within the same voxel. Therefore, the comparison of the delay of the response is most likely to make sense only at the same position in the brain or else it may only reflect differences in the microvasculature system across cerebral regions [136].
Despite the fact that the coupling of the neuronal activation and the measured signal is unknown, it has been shown that: 1) a stronger activation leads to an increased BOLD response; 2) a prolonged activation is accompanied by a prolonged response; and 3) a time
59 difference in the activation onset (e.g. between a sensory and an efferent area) is reflected by a temporal shift in the responses of these areas [117].
Menon et al. [155, 156] showed that the “slow” fMRI can trace sequences of neural events surprisingly well with an effective temporal resolution of 100-200 ms, which is adequate for many mental chronometric measurements. By manually selecting a set of ROIs and av- eraging their fMRI time–courses collected from subjects performing the mental arithmetic task (cf. Section 3.2.3), Morocz et al. [163] demonstrated that the time–courses contain evidence of the cascade of functional recruitment, as shown in Fig. 4.1.
When such a temporal resolution is acceptable, using fMRI alone to gain information about time and space simultaneously has several practical advantages over EEG/MEG measure- ments. Firstly, as brain activation is a distributed phenomena, it is difficult to measure la- tency distributions accurately from a few spatial measurements. Secondly, EEG and MEG measurements are not very sensitive for long-lasting, sustained processes and are better suited to detect effects that are closely time-locked to external stimulus-onsets.
The interpretability of relative timing differences between arbitrary brain areas can be tested by using several tasks that exert a differential influence on temporal activation or that require execution of the same sub–processes but in a different temporal order. Similarly, by analyzing the dependence of onset latency on experimental parameters at a particular location, any observed systematic timing effect in that area must be attributable to neuronal dynamics because the biophysical parameters do not change [213].
60 1 2 main right IOG visual perception event 2 2 main left IOG visual perception 3 2 PrS left fusiform/ITG attention modulation 4 2 main left postcentral `finger counting' 5 2 PrS left sup parietum rote memory table 6 2 PrS right cerebellum magnitude assessor 7 2 PrS right MOG spatial processing 8 2 PrS right IPS magnitude appreciation 9 3 main left IPS rote memory table accessor 10 3 main left caudate head routing 11 3 main right caudate head routing 12 3 PrS left MFG attention 13 3 PrS left anterior insula verbal association 14 3 PrS ant RCZ product size, attention 15 3 PrS V1 attention modulation 16 3 Dist right STG num. distance evaluation 17 e main left supramarg phonological store 18 e main left post IPS estimation, evaluation SOA 19 e main right ant IPS estimation, evaluation 20 e main left MFG judgment, intention 21 4 Diff ant RCZ task difficulty, conflict 22 e main SMA motor response 23 [%] e main left ant IPS motor response 0.1 24 0 e main left M1 motor response −0.1 0 2 4 6 8 10 12 14 16 [s]
Figure 4.1: Cascade of Functional Recruitment with fMRI The time–courses from 24 manually selected ROIs are averaged and plotted from a subject performing the mental arithmetic task (cf. Section 3.2.3). The ROIs are shown by the filled balls in the glass–brain, and their time–courses over the duration of one trial are laid out below. A cascade in the recruitment of different functional modules can be observed from the time–course profiles.
61 4.3 Chronometry Methods
fMRI chronometry for the onset time–difference across voxels typical examines their cross- correlation function with a reference time–series [191]. The effect of behavioral parameters measured by reaction time on onset latency has also be studied in a single region [61] through a cross–correlation analysis. This approach does not isolate the component of the signal due to the stimulus of interest, and hence it is unclear how much the latency estimate is affected by confounding factors. In the case of periodic experimental paradigms, onset latencies have been measured by studying the phase of the Fourier transform or through the
Hilbert transform of the time–series signal [190].
The activation latency at a voxel can also be estimated by including a first-order Taylor series expansion of the hemodynamic response function (HRF) [94] in a GLM analysis or by including an orthogonal basis derived from a spectrum of time-shifted HRFs [136].
Estimating HR parameters like amplitude, latency, dispersion, etc. through non-linear re- gression has been variously proposed [119, 124, 182], wherein the estimated HRF is a non-linear function of these parameters. While these methods can potentially yield a more detailed picture of the hemodynamics, the drawback of such methods is that they require expensive non-linear minimization steps at each voxel and in noisy time–series and the esti- mation algorithms might not converge to the global optima. Moreover, the validity of these results is limited by the questionable biological accuracy of the hemodynamic models.
Data–driven multivariate chronometry has been performed using spatial ICA [10] to in- vestigate the time–course variability in single trials and to detect voxels with unexpected
62 temporal profiles, without requiring a priori knowledge about the shape, nature or coupling of the HR.
In the next chapter a tool for the visual exploration of the ordering in the cascades of functional recruitment, similar to that of Fig. 4.1, is proposed. Then in Chapter 6, a robust method to estimate the onset latency at each voxel in a GLM framework is developed.
63 CHAPTER 5
MENTAL CHRONOMETRY: A VISUAL ANALYSIS
The only reason for time is so that everything doesn’t happen at once.
Albert Einstein (1879–1955).
As imaging pulse sequences used for fMRI become more efficient in their temporal and spatial resolution, tools that can efficiently depict cascadic and serial brain recruitment in task performance, become ever more important. Such mental activity road–maps shown as time–series for specific brain regions as used for years in EEG and MEG, but now enhanced by the tomographic fidelity of MRI, can crucially enhance our understanding of brain physiology [118].
In addition to this, the commonly used statistical methods for fMRI analysis have a large number of parameters that need to be tuned, which in turn profoundly affect the detection of activated brain regions. Therefore, there is a pressing need for a visual analytics tool that allows the neurologist visualize the raw data, in order to assess the fidelity and veracity of the results obtained from any type of fMRI analysis.
To address these concerns, this chapter presents a software tool to visually analyze the time dimension of brain function with a minimum amount of processing, allowing neurologists
64 to verify the correctness of the analysis results, and develop a better understanding of tem- poral characteristics of the functional behaviour. The system allows studying time–series data through specific volumes–of–interest in the cerebral–cortex, the selection of which is guided by a hierarchical clustering algorithm performed in the wavelet domain.
The organization of the chapter is as follows: the proposed solution is outlined in Sec- tion 5.1 while Section 5.2 covers the current literature in the domain of visual analysis methods for spatio–temporal data, with special attention to medical images. Section 5.3 introduces the method for automatically selecting a candidate set of VOIs. Here, I discuss the wavelet based dissimilarity metric and the hierarchical clustering algorithm. Section 5.4 explains the tool and its usage. In Section 5.5 I present some results obtained by using the method to explore the mental arithmetic data–set (c.f Section 3.2). Finally in Section 5.6, I conclude with some remarks on the tool and directions of further investigation.
5.1 Outline of Solution
5.1.1 Challenges
A major challenge in the time–dimension visualization of fMRI is the large number of voxels within the brain (∼ O(105)). It is obviously impossible to examine the time–series through every voxel individually. Typically, the user manually defines a Volume-of-Interest
(VOI) in the brain, and computes the mean time–course through that VOI (e.g. Fig. 4.1 in Chapter 4). Given the limitations of existing tools, only a certain number of VOIs can be manually examined and compared in a practical fashion. Therefore, the investigator not only has to decide, a priori which locations of the brain are “interesting”, but also the shape
65 and extent of these regions, which have a profound influence on the resulting time–courses.
This limits the power of this avenue of exploration. Therefore, for a visual exploration tool to be useful it should:
(i) Provide a quantitative assessment of the quality of a VOI, in terms of the similarity of
activity exhibited by the voxels contained therein.
(ii) Guide the user in selecting good VOIs, as defined by the above quality metric.
(iii) Take into account the specific experimental task and the neuro-physiological phenom-
ena being investigated.
Another challenge in developing a meaningful visualization of fMRI data, is the nature of the acquired signal itself. As discussed in Chapter 1, the time–series data at each voxel consists of four main components: a A structural component, which represents anatomical features (much like a conventional
MR image). b The blood oxygenation level dependent (BOLD) signal, which measures brain metabolism
and is the component of interest. c Noise, which tends to be colored with a 1/f spectrum, where f is the frequency. d Slowly varying drifts in the baseline signal.
Therefore, any na¨ıve visualization based on the raw time–courses will be hard to decipher, and some kind of processing of the data is required. However, it is imperative that this
66 processing be kept to a minimum, in order to minimize the amount of bias that is inevitably introduced by any processing and retain as much information as possible from the raw data.
5.1.2 Proposed Solution
The aim of the tool for visualizing the time–series fMRI data is that it should examine the data and study the temporal aspect of brain activity with a minimal amount of pre- processing and data manipulation. The solution is therefore posed in terms of the problem of VOI selection, by automatically determining a set of candidate VOIs, such that the voxels in the VOI exhibit coherent activity with respect to the experimental task under consider- ation. These VOIs are determined by clustering together voxels with similar activations in a hierarchical fashion. By navigating through the hierarchy, the user is able to select the correct set of VOIs that match his expert intuition, in terms of shape, location, and within-cluster error. An overview of the method is shown in Fig. 5.1.
One methodological contribution is a distance metric that quantifies the dissimilarity be- tween the voxels based on their time–series. The proposed metric adaptively extracts fea- tures from the time–series which correspond to the experimental tasks and the neurophysi- ological phenomena under study. This metric operates in three steps:
i It first transforms the acquired time–series into the wavelet domain, thereby de-
correlating its different components.
ii It projects the wavelet coefficients into a lower dimensional subspace spanned by the
features of interest.
67 Wavelet Transform time Project fMRI volumes Time courses
Low dimensional Hierarchical subspace Clustering
Build basis VOI hinting Wavelet Experiment Transform Experiment stimulus stimulus function convolved function with different hrfs
Figure 5.1: Overview of Visual Analytics Tool The pipeline of the processing stages of the visual analytics tool for examining the temporal aspects of fMRI data.
iii It then computes a weighted Euclidean distance between the time–series of two vox-
els in this lower dimensional subspace.
The weights are selected to emphasize certain features and de-emphasize others, depend- ing upon the experimental task. The tool also allows manual delineation of a VOI, and computes its quality using this dissimilarity metric.
5.2 Related Work
There is a large body of work for visualizing data–sets with temporal dependencies. Most of the prior art in spatio–temporal visualization has been concerned with the problem of dealing with massive data sizes and rendering them accurately and efficiently [85]. It has also been studied in the context of flow-field visualization, with the focus mainly on extracting the relevant features from the data and visualizing them in an interactive fashion
68 [144, 217]. Aigner et al. [3] survey methods for time–series data through the lens of
Visual Analytics. Additionally, they provide a taxonomy that includes structure of time, the data characteristics and abstraction, and representation (esp. dimensionality). As per their proposed categorization, fMRI data can be classified as being linearly ordered time points, univariate and spatial, dynamic and three-dimensional.
In the case of medical data–sets, existing methods concentrate on a comprehensive evalu- ation of the temporal behavior of the data. Tory et al. [208] discuss several methods for
visualizing temporal medical data. The authors discuss the efficacy of surface-based and
iconic visualization methods to produce an animation of consecutive time steps that depict
temporal changes of intensity and signal gradient. There are other techniques that use di-
rect volume rendering to provide static visualizations through the construction of transfer
functions that best represent the temporal characteristics of the data, thereby effecting a
visual and implicit clustering/segmentation of the data in a temporal feature space [217].
Interfaces and interaction often play an important role [210].
In the case of fMRI, there are many tools currently available for analysis and visualizing,
including AFNI [45], BrainMap [123], Brede Toolbox [167], SPM [66], FSL [203]. While
all these tools allow for manual selection of VOIs and then display the aggregate temporal
response of these VOIs, none of them provide an automated VOI selection system, along
the lines proposed in our paper.
Wavelets are extensively used in fMRI, either along the spatial dimension or the temporal
domain. Spatially it is used to obtain a sparse representation of the activation map and then
the statistical significance of activation is computed on these wavelet coefficients [212].
69 Alternatively, along the temporal dimension, wavelets are used for de-noising and whiten- ing the time–series data [29] and activity estimation [158].
The concept measuring similarity between the time–series of voxels by projecting into a linear subspace spanned by the “signal of interest” is common [83]. However, in these cases the projection is computed in the temporal domain as the correlation of the acquired time–series with a reference (ideal) response. This has the disadvantages that the noise is still colored, and there is an inability to emphasize important features in signal, like the natural scale of the HRF. In contrast, by moving to the wavelet domain, we are able to effectively whiten the noise thereby removing any spurious correlations in the data, and also we can determine natural scales at which the HRF has greatest energy.
5.3 VOI Selection
The problem of automatically selecting VOIs is two fold:
(a) Determining the number, location, size and shape of VOIs.
(b) Determining the coherency or goodness of the selected VOIs.
For the first problem, there are, in general, no computational solutions and only the expert user can decide this depending upon the phenomena being studied. For this purpose, we use a hierarchical clustering of the voxels, and allow the user to navigate this hierarchy in order to select the correct number, as per his requirements. This is further discussed in
Section 5.3.2.
70 To deal with the second problem, the notion of goodness of a VOI requires a measure of similarity of the voxels in the VOI. This metric should satisfy the following properties:
(i) It should quantify the dissimilarity of the activation patterns of the voxels, as given
by their time–courses.
(ii) It should be robust against the confounds in the acquired signal, like noise, drift,
inhomogeneity, etc.
(iii) It should also quantify the spatial proximity among voxels, encoding the belief that
nearby voxels have a higher likelihood of exhibiting similar behaviour, as compared
to distant voxels.
The proposed dissimilarity metric satisfies these properties as explained next.
5.3.1 Distance Metric
One of the key requirements of the metric is robustness to noise and other confounds in the acquired time–series data. As briefly mentioned in the introduction to this chapter, the measured signal has the following components:
(i) The structural component.
(ii) The BOLD signal. While the exact shape of the brain hemodynamic response to a
stimulus is highly variable across brain regions, stimuli, and subjects, a few typical
(canonical) hemodynamic response functions (HRF)s are shown in Fig. 5.2(a).
71 (iii) Instrumental noise and artifacts due to subject head motion and breathing which tend
to exhibit a 1/f spectrum associated with fractional Brownian motion (fBm) [29].
(iv) Slowly varying drifts in baseline signal intensity, due to temporal variations in the
operating characteristics of the MRI scanner and changes in subject physiology like
blood pressure, etc.
0.02
0.015
0.01 Intensity
0.005
0
5 10 15 20 25 30 seconds (a) Canonical HRF
20
10
0
−10
−20 BOLD intensity −30
−40
200 250 300 350 400 450 500 550 600 650 700 seconds (b) Raw time–courses through two voxels
Figure 5.2: The Raw fMRI Time–Courses Fig.(a) The shape of a few typical hemodynamical response function (HRF). However, in reality, the exact shape of HRF is highly variable. Fig.(b) The raw time–courses (mean shifted) through two voxels (blue and red), both presumably activated. The solid line shows the measured time course, the dashed line is an estimate of the baseline drift of the intensity over time.
72 An example of the raw time–courses (mean shifted) through two activated voxels is shown in Fig. 5.2(b). Here, the problems of drift and spatial inhomogeneity in the signal baseline are evident.
The observed time-signal yx(t) at each voxel x is generally modeled as [66]:
yx(t) = µx + θx(t) + sx(t) + υx(t), t = 1 ...N. (5.3.1)
Here, sx(t) is the BOLD component, θx(t) is the baseline drift, µx is the structural com- ponent intensity at the voxel, υx(t) is correlated noise with the 1/f spectrum of a long memory fBm noise.
This structure of the fMRI signal motivates the use of a wavelet representation for three main reasons. One, transforming to the wavelet domain gives a sparsity in the represen- tation of the signal, and allows us the flexibility to weight different aspects of the signal depending upon their scale space characteristics. Two, it has been shown that the different components of the signal occupy different regions of the time-frequency plane [158]. The baseline drift is restricted to the wavelet coefficients at large scales, since it happens due to phenomena that vary gradually over relatively large periods of time as compared to the
HRF. By selecting a wavelet basis with p vanishing moments and assuming a polynomial approximation of the drift with order less than p, θx(t) will have a sparse representation in this basis. Thirdly, the wavelet tranform provides an approximation of the Karhunen-Loeve` transform (KLT) of the fBm noise. The KLT de-correlates a random process by project- ing it on to an orthogonal bases that are the eigenfunctions of the auto-covariance kernel.
∗ Specifically, if Ky(t1, t2) = E[y(t1)y (t2)] − E[y(t1)]E[y(t2)] is the auto-covariance ker- nel of a stochastic process y(t), then KLT is the decomposition of the process onto an
73 P orthogonal basis ei(t), i = 1 ... ∞, such that y(t) = i Ziei(t). Here ei(t) are the eigen- functions of Ky(t1, t2), and Zi are uncorrelated random variables. Therefore, the wavelet transform W{υx} of the fBm noise υx(t) is composed of coefficients which are almost
de-correlated11.
The Cohen-Daubechies-Feauveau 9/7 bi-orthogonal spline wavelets with p = 4 vanishing
moments are used here because they are symmetric with finite support and linear phase,
and do not require special treatment at boundaries. This is implemented using a lifting
scheme which gives a 2× speedup over the standard wavelet transform algorithms [145].
Applying the wavelet transform W to both sides of eqn. 5.3.1, we get:
W{yx} = W{θx} + W{s} + W{υx}, (5.3.2)
0 where yx = (yx(1) yx(2) . . . yx(T )) , etc.
Now, in order to select the optimal regions of the time-frequency plane and suppress the
undesired components of the signal without compromising the desired component, W{yx}
is projected into a lower dimensional subspace spanned by the features of interest, as de-
fined by the experimental task. The experimental task is specified by the stimulus function
p(t) giving the onset and duration of each stimulus presented to the subject. For example, it
will be a train of Dirac-δ in case of event stimulii, or a train of box-cars in case of persistent
stimulii.
The set of expected brain responses {rhi (t); t = 1 ...T } for different hemodyamics are
then computed by convolving the stimulus function p(t), t = 1 ...T with a set of typical
11 υ υ j j p Strictly, the correlation of the wavelet coefficients wj,k and wj,k0 decays like O(|2− k − 2− k0|2− ).. υ υ υ T Also, the noise coefficients [cJ,0, wJ,0, ...w1,T/2 1] are well approximated by normally distributed indepen- − 2 2 dent random variables, with co-variance matrix Συ = diag(σJ , ...σ1). [29]
74 HRFs {hi(t); i = 1 ...B}, B T , as rhi (t) = hi ?p(t). Fig. 5.3(a) shows an experimental task, represented by a Dirac-δ train (dark blue) convolved with a few HRFs. This set of
0 expected responses R = [rh1 rh2 ... rhB ], where rhi = (rhi (1) rhi (2) . . . rhi (T )) define
a lower (B) dimensional subspace H spanned by the features of interest from the exper-
imental task. An orthogonal basis for this subspace H is obtained by the singular value
0 decomposition of RT ×B = UΛV , where U = [u1u2 ... uT ] is a T × T orthonormal ma-
trix spanning the column space of H, Λ is a T × B diagonal matrix of the weights of each basis vector, and V0 is a B × B orthonormal matrix spanning the row space. The singular
values λi,i, show that most of the volume spanned by H is concentrated in the first few basis vectors of U. Fig. 5.3(b) shows the percentage volume with respect to the number of basis
vectors of U for a particular experiment.
Since U˜ defines an orthogonal bases for subspace H˜ , a Euclidean distance metric can be defined on it as:
||y − y || = || (y˜ − y˜ ) || (5.3.3) x1 x2 H˜ x1 x2 2
which is compatible with the Euclidean distance metric in the wavelet space. The drawback
of this metric is that it takes into consideration neither the different importances of the ˜ individual bases u˜i of U as expressed by the diagonal matrix Λ nor the physical proximity
between the voxels x1 and x2. Therefore, we augment this metric to incorporate these
characteristics as:
2 2 ||x1 − x2||2 0 ˜ ||yx − yx || = + (y˜x − y˜x ) Λ(y˜x − y˜x ) , (5.3.4) 1 2 H˜ θ2 1 2 1 2
where Λ˜ is a B˜ × B˜ diagonal matrix containing the B˜ largest singular values from Λ. Also,
θ is a scaling factor that weights the spatial proximity relative to signal similarity, and is a
user-tunable parameter of the tool.
75 0.07
0.06
0.05
0.04
0.03
0.02 respone functions
0.01
0 20 30 40 50 60 70 80 90 100 secs
(a) Stimulus function p(t) and a few response curves rh1 (t)
100
90
80
70 % volume
60
50 100 150 200 250 300 350 400 450 500 Bases retained (b) The percentage volume with respect to bases retained
Figure 5.3: Fig.(a) The stimulus function p(t) as a Dirac-δ train (dark blue). A few curves rhi from the response set, obtained by convolving p(t) with a theoretical HRF hi(t) Fig(b) The percentage of the whole volume of subspace H with respect to the number of basis vectors retained.
5.3.2 Hierarchical Clustering
Candidate VOIs are built as clusters of voxels exhibiting similar activity as defined by the dissimilarity metric of Eqn. 5.3.4. These voxels are clustered together with the hierarchical agglomerative clustering (HAC) given in Algorithm 5.1.
76 1 begin // Initialization 2 For each voxel xi, create one cluster ci of size ni = 1 3 Each ci is associated with a time–course Y[i] 4 end 5 while Number of clusters greater than specified value do 6 Find two clusters ci and cj that are spatially adjacent to each other and merge them into a new cluster ck = (ci, cj), if and only if Var[ck] is minimum over all i, j 7 Remove clusters ci and cj from the set of clusters, and add ck 8 end Algorithm 5.1: Hierarchical Clustering Algorithm
Pnk If a cluster ck has nk voxels then the cluster mean µk has physical location xµk = i=1 xi/n
Pnk and feature vector y˜µk = i=1 y˜i/n. The mean of a new cluster ck = {ci, cj} can be effi- ciently computed as
niµi + njµj µk = (5.3.5) ni + nj
The variance of a cluster ck under the dissimilarity metric is defined as:
nk 1 X 2 Var[ck] = ||yx − yµ || , (5.3.6) n n i H˜ k i=1
where ρk is the mean of ck. By the variance separation theorem, the combined variance of
two clusters ci and cj can be efficiently computed as:
2 n Var[c ] + n Var[c ] ||niyµ − njyµ || i i j j i j H˜ Var[ck = {ci, cj}] = − (5.3.7) ni + nj ni + nj
Since the dissimilarity metric penalizes voxels that are far away in physical space, and
hence are not likely to be clustered together, the clustering algorithm can be accelerated by
using an octree [5] decomposition of the physical brain volume as per Algorithm 5.2.
If Γ1 is the cluster tree generated by hierarchical clustering on the full volume, and Γ2
is that generated by clustering on the octree, then for each cluster c in Γ1, we define the
77 1 begin 2 Start at the lowest level L = 0 of the octree(i.e. leaf nodes) 3 end 4 while Root node of octree is not reached do 5 Perform hierarchical clustering in each node independently to get a set of clusters {ci} for each node. 6 If all new clusters have a in-cluster variance greater than a certain threshold κL, then union all the per-node clusters hierarchies, and move to the next level L + 1 in the octree. 7 end Algorithm 5.2: Octree Clustering Algorithm
cluster error as ρ(c) = mind∈Γ2 |(c \ d) ∪ (d \ c)|/|c ∩ d|, where \ is set difference, and
|.| is set cardinality. Therefore, ρ(c) measures smallest error in overlap of the voxels of
cluster c ∈ Γ1 with all clusters in Γ2. In our experiments, this acceleration resulted in a 3×
speedup in clustering and a cluster error of less than 15%, on average.
5.4 User Interaction
The layout of the user interface is shown in Fig. 5.4, and it consists of the following salient
components:
(a) The main window showing a 3D visualization of a high-resolution MR image of the
brain (as volumetric, cortical surface or orthogonal cutting planes). The user can switch
between these three views, and the VOIs are overlaid as 3D blobs on this rendering.
It also has the three 2D orthographic views (Sagittal, Coronal, Axial). The VOIs are
displayed in this view as 2D blobs. (cf. Fig. 5.4(a))
78 (b) Functionality to visualize the mean time–series of each cluster in cine (video) mode,
by modulating the intensity of the cluster with the value of the mean time–series at
that point. This feature gives the user additional power in searching for patterns in the
temporal responses across different regions of the brain.
(c) A panel showing the mean time–course (blue) through the selected VOIs. Around the
mean time–course, it also shows an envelope (grey region bounded by black lines) of
±1 standard deviation at each time point, computed from the time–courses of all the
voxels in the cluster as follows: if {xi} is the set of n voxels in a VOI, with measured
time–series yxi (t), then the standard deviation for the time–series of the VOI is σ(t) = P 2 P 2 1/n i [yxi (t)] − [1/n i yxi (t)]] . This envelope gives the user a rough estimate of the dispersion of the time–series within the VOI and serves as a visual indication of the
VOI quality. (cf. Fig. 5.4(b))
(d) A tree view for navigating through the cluster hierarchy to select among the automat-
ically generated VOIs. Initially, the tool suggests set of clusters (meeting a certain
in-cluster variance threshold) as candidate VOIs. The user can examine these clusters,
their time–courses and their associated quality metrics (in-cluster variance and the ±1
standard deviation envelope). If he is not satisfied with them, he can navigate either
up the hierarchy merging multiple clusters to get a larger VOI, or down the hierarchy
sub–dividing a cluster to get smaller VOIs. (cf. Fig. 5.4(c))
(e) VOI selection tools, similar to most standard MRI viewing tools, such as rectangle,
ellipse, polyline and free–form selection. With these tools, the user can manually select
a VOI, and view its mean time–course and associated quality metrics.
79 5.5 Results
This section presents an application of the tool to help an investigator refine hypotheses, validate the results of automated analyses, and better understand the temporal characteris- tics of the hemodynamic response function for the mental arithmetic task (cf. Section 3.2 of Chapter 3).
One application of this tool is to determine the correct parameters for an SPM–style anal- ysis of the data. For example, consider a hypothesis test for finding those voxels activated only during visual presentation, and not during aural presentation. Fig. 5.5 shows the gen- erated activation maps for different parameters of the analysis. Here α is the statistical significance level, while FDR is the False Discovery Rate method for performing correc- tion for the multiple comparisons problem (cf. Section 2.4). It is not easy to determine the validity of the activations in Fig. 5.5(a) vs. (b). The aim is to eliminate spurious activation focii, without discarding true activations. For example, consider the area highlighted by the red arrow in Fig. 5.5(b). It is marked active in Fig. 5.5(b) but not in Fig. 5.5(c). However, from the fact that the VOI around this region had a in-cluster variance of ∼ 20, and a ±1 std.dev. envelope of almost 10× the mean time–course intensity, (Fig. 5.5(d)) we were able to confirm that it was in fact an incorrect activation, corroborated by the fact that the region
(ldPFC) is not known to be associated with visual tasks, but rather with number processing and working memory. Compare this with a VOI of a truly activated region in the visual area of the brain (Figs. 5.5(e)-(d)), with an in-cluster variance ∼ 5, and its ±1 std. dev. envelope < 25% of signal intensity.
80 Another significant application of the tool is to study the chronology of the recruitment of functional substations to perform a task. For our data–set, by selecting from the VOIs suggested by the tool, we were able to observe a temporal cascade in the brain activation pattern that helped us not only corroborate extant theories but also refine our understanding of brain function [163]. This temporal ordering of activity patterns in brain regions is shown in Fig. 5.6
5.6 Conclusion
With this tool, we have tried to address the important problem of visualizing the time di- mension of fMRI data, in order to understand temporal relationships in brain function. To achieve this, a solution to the VOI selection problem was proposed by merging an algo- rithmic VOI selection system with a user-driven feedback in order to leverage the user’s expert understanding. An activation dissimilarity metric was developed that captured the context of the neuro–functional phenomena being investigated, through a transform into wavelet space and a projection onto a subspace spanned by the expected behaviour. The tool presents a candidate set of VOIs to the user through a hierarchical clustering procedure, and the user can navigate through this hierarchy to select VOIs. This VOI selection is per- formed interactively, where both the quality metrics of the VOI and the expert knowledge of brain function guide the user in his visual analysis task.
Further refinements must include a more intuitive and navigable methods of presenting the
VOI hierarchy tree, through the use of tree-maps and similar abstractions. One aim was to minimize the bias introduced into any analysis by the data processing step and retain as much information from the raw data as possible. Though we believe this aim was achieved,
81 a method for quantitatively estimating the biases would be desirable. Of interest also are
VOI selection methods that do not require knowledge of the experimental paradigm to judge time–course similarity such as ICA based methods.
82 (a) Main Viewer Window
(b) Cluster time–courses (c) Hierarchy Tree Navigation Pane
Figure 5.4: User Interface of the Visual Analytics Tool Fig. (a) The main window showing the structural volume overlaid with the clusters in 3D and 2D orthographic views. Fig. (b) Cluster time–series displays. Fig. (c) Navigation pane for the cluster hierarchy.
83 (a) α = 0.001 with no correction for (b) α = 0.01 with FDR correction multiple comparisons
(c) VOI around an inactive re- (d) Mean time–course and gion ±1 std. dev. envelope
(e) VOI around an active region (f) Mean time–course and ±1 std. dev. envelope
Figure 5.5: Visual Confirmation of SPM Results Fig. (a)-(b) Maximum intensity projections of the activity maps for two different settings of the analysis procedures. Darker shades of grey indicates higher levels of activation. Fig. (c) A VOI selected from the set generated by the tool. Fig. (d) The mean time–course of the VOI and its ±1 std. dev. envelope, suggesting that this region of the brain is not activated. Fig. (e) Another VOI selected from the set generated by the tool. Fig. (f) The mean time–course of the VOI and its ±1 std. dev. envelope, suggesting that this region of the brain may be activated
84 (a) Selected VOIs (orthogonal projections)
(b) The corresponding mean time–courses of the VOIs
Figure 5.6: Visual Analysis of Recruitment Cascade Fig. (a) Six VOIs (as 3D blobs) overlaid on the three orthogonal planes (cutting planes) through the structural volume. Fig. (b) The corresponding mean time– courses through the selected VOIs, along with their ±1 std. dev. envelopes. Here, a temporal cascade (green lines) in activation can be seen.
85 CHAPTER 6
MENTAL CHRONOMETRY: MEASURING LATENCY IN BRAIN ACTIVITY
Statistics has been the most successful information science. Those who ignore statistics are condemned to reinvent it.
Bradley Efron (1938–)
Mass–univariate analysis methods based on general linear models (GLM) of brain response
and linear least squares estimation of their parameters are very popular because of their
computational efficiency, statistical simplicity and explanatory power (cf. Section 2.4 of
Chapter 2). While most GLMs are used to estimate the amplitude of the hemodynamic
response (HR) to a stimulus at each voxel, it is possible to estimate the response latency
using a first-order Taylor series expansion of the hemodynamic response function (HRF) in the GLM [94]. This estimator, however, is numerically unstable and biased. Here, we suggest a low-bias estimator for latency and provide an analytical formulation for its variance, needed for deriving confidence intervals.
86 6.1 Outline of Solution
6.1.1 Motivation
Using GLM methods, it is possible to study the effects of experimental parameters on func- tional activation, in one of two ways: a) parametric effect analysis or b) factorial analysis.
The first method is used when testing whether a real-valued (interval) experimental param- eter p has a statistically significant effect f(p) on the amplitude of activation by adding a regressor weighted by f(p) (typically, a polynomial function) and testing its effect. One drawback is that the correct relationship f between the parameter and amplitude may not be known a priori. Moreover, it assumes that the parameter modulates only the amplitude of the HR while all other aspects remain unchanged, which is known not to be the case [156].
Therefore, it only tests the effect of the parameter on the amplitude, not the latency of the response. The other alternative, a factorial analysis is used with categorical parameters, by adding a regressor corresponding to each level of the variable. The difference in response at each level can be used to deduce the presence of an effect on both amplitude and latency.
While this method does not suffer from the drawbacks of the parametric effect analysis, it cannot be used for real-valued parameters.
6.1.2 Proposed Solution
Here, a method is proposed that combines the strengths of both approaches by measuring the effect of an interval parameter on both HR amplitude and latency, without requiring that the relationship to test for be specified a priori. The idea, as explained in Section 6.2, is to quantize the real–valued parameter into a finite number of levels and analyze it with
87 a factorial design. The loss in statistical power that would result from such a partitioning of the design is avoided by a regularization of the estimation procedure which significantly improves the quality of the inferences.
In section 6.3 the method is validated on simulated data and is applied to the study for visuo–spatial working memory, as described in Section 3.1.
6.2 Method
Let si(t), i = 1...q, be the stimulus function representing the onsets and durations of the neurological stimuli corresponding to a task of type i. In conventional analysis of fMRI data, the following two assumptions are made: a) the HR is linear; b) the HRF is spatially and temporally invariant, leading to the following model for the observed signal y(t) at each voxel:
q X h (1) i Y (t) = βi x1(t) + γix˙1(t) + (t). (6.2.1) i=1
Here, xi(t) = si(t) ? h(t), i = 1 . . . q is the expected BOLD response (with no lag) to si(t) obtained by convolving it with a typical HRF h(t). By including the first-order Taylor series expansion of xi(t + τ) ≈ x(t) + τx˙(t), the model is able to explain a certain amount of delay in the observed response. The coefficient of regressor xi is βi and that of x˙ i is γi.
The noise term (t) is assumed to be normal, colored, and is typically modeled as an AR(1) process.
88 If T is the number of fMRI scans in the session, the GLM can be expressed in matrix notation as
y = Xβ~ + ,
0 where X is the T × 2q design matrix X = [x1x˙ 1 ... xqx˙ q], with xi = (xi(1) . . . xi(T )) .
~ 0 0 Also, β = (β1, γ1, . . . βq, γq) is the coefficient vector, and = ((1) . . . (T )) is the noise
2 distributed as N (0, σ Σ).
The Gauss-Markov estimate is
~ˆ 0 −1 − 0 −1 V ~ˆ 2 0 −1 − β = (X Σ X) X Σ y with ar[β] = σ (X Σ X) ,
where − is the pseudo-inverse operation.
6.2.1 Robust Estimation of Latency
This model provides an estimate of the hemodynamic latency τi as [94]:
2α1 γi τi ≈ − α1, where ρi = . (6.2.2) 1 + exp(α2ρi) βi
The non-linear transformation through the logistic function is used to correct for the error
due to the neglected higher order terms of the Taylor expansion, and the values of α1, α2
are determined empirically.
ˆ It can be seen however, that the estimate ρˆi =γ ˆi/βi is Cauchy distributed and therefore ˆ biased. It also becomes numerically unstable when βi is small.
Therefore, we propose a low-bias and stable estimator as follows. Since xi(t) is orthogonal ˆ to x˙ i(t), βi and γˆi are Gaussian variables with correlation roughly zero, indicating that they
89 ˆ are almost independent. Therefore E[ˆρi] ≈ E[ˆγi]E[1/βi]. Taking a first order Taylor series ˆ expansion of 1/βi about βi, and using the fact that it is unbiased, we get:
V ˆ ! ar[βi] −2 E[ˆρi] = ρi 1 + =ρ ˆi 1 + (tβ ) , (6.2.3) ˆ 2 i (βi)
where
βˆ t = i βi q ˆ Var[βi]
is t–score for the estimate of βi. This yields following corrected estimate for the value of
corr −2−1 ρ to used in eqn. 6.2.2 as ρˆi =ρ ˆi 1 + (tβi ) This correction not only un-biases the ˆ estimate of the ρi but also conditions it numerically when the t–score of βi is low.
ˆ An approximate estimate of the variance of τˆi := τ(βi, γˆi) is obtained by taking its first–
order Taylor expansion around βi and γi, and using the fact that their estimates are unbiased and uncorrelated, to give:
ˆ ∂τ ∂τ (βi − βi) Var[ˆτi] ≈ Var τ(βi, γi) + (6.2.4) ∂βi ∂γi (ˆγi − γi) 2 2 ˆ ∂τ ∂τ ≈ Var[βi] + Var[ˆγi] . (6.2.5) ∂βi ∂γi
Proofs for these equations are given in Appendix A.
6.2.2 Parametric Effects with Factorial Designs
The model of eqn. 6.2.1 can be used to test the effect f(p(t)) of an experimental parameter ˆ p(t) by adding a regressor of the form xi(t) = f(pi) ? h(t). The corresponding βi reflects the contribution of this parameter towards the amplitude of the response, while τˆi gives
the total delay of the HR to this stimulus. Note that it does not measure the change in
90 latency with change in experimental parameter, i.e. it fails to characterize the effect of the parameter on the latency.
The solution is to quantize p into np levels and treat it like a categorical variable result- ing in a factorial design, with one regressor corresponding to each level. The most ex- treme case would be to treat each value of the parameter as an individual level. In ma- ~ trix notation, y = Xaβa + , where Xa is the design matrix containing the parametric ~ variable partitioned into one regressor per level, and βa is the regression coefficient vec-
th tor. The original design matrix X = XaC, where Ck,l gives the weight of the k re-
th ~ˆ ~ˆ gressor of Xa toward the l regressor of X. It is easy to verify that β = Dβa, where
− 0 −1 − −0 0 −1 D = C (XaΣ Xa) CC XaΣ Xa.
Unfortunately, it is hard to make reliable inferences with this model due to the inflated variance of the estimate caused by increased model degrees of freedom, given by the trace of
P 0 −1 − 0 −1 the projection matrix Xa =Xa (Xa Σ Xa) Xa Σ . Therefore, to find a tradeoff between the flexibility of the model and its statistical power, we regularize it such that the estimates
np of the coefficients {(βi , γi)}i=1 for parameter p are normally distributed around their mean
2 values for p with variance σβΣβ. Here Σβ is a diagonal matrix representing the relative scale of variation in βi with respect to that of γi.
~ The Gauss-Markov estimate for βa results in the following ridge-regression formulation:
0 0 ~ˆ h ~ i −1 h ~ i h~ − ~ i −1 h~ − ~ i βa(λ) = min y − Xaβa Σ y − Xaβa + λ βa − DD βa Σβ βa − DD βa , β~a 0 h ~ i −1 h ~ i ~0 ~ = min y − Xaβa Σ y − Xaβa + λβaQβa, β~a 0 −1 − 0 −1 = Xa Σ Xa + λQ Xa Σ y, (6.2.6)
91 − 0 −1 − 2 2 where Q = [I − DD ] Σβ [I − DD ]. Here, λ represents the ratio σ /σβ. A value of λ = 0 ~ ~ indicates a flat prior on βa(λ), and the solution corresponds to the OLS estimate βa without
regularization.
P The model degrees of freedom, given by the trace of the projection matrix Tr{ Xa (λ)}, is
a decreasing function of λ. Here, the projection matrix
P 0 −1 − 0 −1 Xa (λ) = Xa Xa Σ Xa + λQ Xa Σ .
~ˆ Also, βa(λ) is efficiently calculated as
~ˆ 0 −1 − − ~ˆ βa(λ) = I + λ(Xa Σ Xa) Q βa(0)
6.2.3 Hyper-parameter Selection
~ˆ The diagonal values of Σβ are estimated by first computing the OLS values of βa(0), and ˆ then setting Σβ(i, i) = 1 and Σβ(i + 1, i + 1) = Var[ˆγi]/Var[βi].
The optimal value λ∗ of λ is the one at which mean squared error (MSE) of the estimates is
the smallest, which when using linear LS regression is well-approximated by the Mallow’s
Cp statistic [146]:
Tr{R(λ)} y0R(0)y C (λ) = y0R(λ)y + 2 , (6.2.7) p N Tr{R(0)}
R P where (λ) = [I − Xa (λ)] is the residual forming matrix.
2 The noise variance σ and correlation Σ is estimated for each value of λ as follows: (a) ˆ ˆ Set Σb(λ) = I and compute the projection matrix P(λ), the residuals r = [I − P(λ)]y
2 0 Pˆ and σˆ (λ) = r r/Tr{I − (λ)}. (b) Obtain an empirical estimate of the auto-correlation
92 ˆ Pn 0 ˆ of the noise as φ(t) = i=t+1 ri−tri/r r (c) from φ(t). (c) Treat the noise as an AR(1) process and solve the Yule-Walker equations for the AR(1) coefficient θˆ. Spatially smooth the estimated θˆ in order to reduce its variance. (e) Reconstruct the noise correlation matrix p ˆ ˆ ˆ|i−j| ˆ2 Σ(λ) from the AR(1) coefficient as Σ(λ)i,j = θ / 1 − θ . (f) Repeat the regression with these new estimates of the noise covariance.
6.3 Results
First, in Section 6.3.1 a quantitative validation of the estimation algorithms is provided using simulated data. Then, the results of the method applied to the visuo–spatial working memory (VSWM) data–set are described in Section 6.3.2.
6.3.1 Simulated Data
For a stimulus function s(t) consisting of a train of delta functions, its HR was generated P as y(t) = u h(t − u)s (u − τ(p(u))) β(p(u)) + (t). Here, β(p) is the modulation of the experimental parameter 0 < p(t) ≤ 10 on the HR amplitude, while τ(p) is its effect on
latency. Both functions were modeled as cubic polynomials. The noise (t) was generated
by an AR(1) process with coefficient=0.3. The empirical MSE for the latency and ampli-
tude estimators with and without regularization is plotted against SNR 12 at different levels
of quantization np in Fig. 6.1. The reduction in MSE for both latency and amplitude after
regularization is apparent, and the method is relatively robust even for relatively low SNR.
It is also observed that initially increasing the levels of quantization (from 5 to 10) reduces
12 1 Defined as 20 log(||y||2σ− )
93 the MSE due to an improved fit, but beyond a certain point it begins to degrade (np = 15, in this case) due to overfitting.
1 1 0.8 0.8 0.6 0.6
MSE 0.4 0.4
0.2 0.2
0 0 0 10 20 30 40 0 10 20 30 40 SNR SNR
Figure 6.1: MSE of Regularized Estimator MSE with respect to SNR of amplitude (left) and latency (right) for the regularized and unregularized estimators at np = 5, 10, 15 levels of quantization. Legend. Unregularized: np = 5–circles, np = 10–asterisks , np = 15–filled dots; Regularized: np = 5–solid line, np = 10– dotted line, np = 15–dot-dashed line.
Fig. 6.2 shows parametric effects on latency and amplitude estimates using the regularized
and unregularized estimators (np = 10). It is seen that while the mean value from both
methods is close to the true value, the spread of values (red filled box) for the regularized
method is smaller indicating much lower variance. We also observed that the bias of our
corrected latency estimator was about 70% less than that of the original estimator of Henson
et al. [94], and the empirical variance of τˆ was (1±0.15) times the analytical variance from
eqn. 6.2.4.
6.3.2 Latency Analysis for VSWM Task
This method was applied to the fMRI data–set of 8 children aged 7 to 11, to study visuo–
spatial memory (VSM) maintenance and manipulation as described in Section 3.1.
94 ntuto hs ftetili hw nFg . o aho h he methods. three the of each for 6.3 Fig. in shown is trial the of phase instruction projections intensity the Maximum of foci. (MIP) activation common identify to subjects, 8 the estimated across amplitudes activation the for analysis fixed–effects group–level a performed We Figure nlzdusing: analyzed time experiment by parameterized habit- as to repetition, due and VSM uation the in changes functional identify and memory case) vs. case) (“backward” (“forward” manipulation recall memory for differences functional the study to order In confidience 95% (blue, their unregularized with the along for shown, Effects are estimators line). line) bands. black dashed (CI) (solid (red interval regularized simulated and were line) (right) dash-dotted latency and (left) amplitude [ [ [ b a c Tesm ein u sn h euaie siaindvlpdhere. developed estimation regularized the using but design, same The ] for quantization of levels 10 with design factorial a using method GLM standard A ] Asadr L ihefcsfor effects with GLM standard A ] T os n w eesfrrcl direction recall for levels two and sors, 6.2: and
uniaieEauto fPrmti Effects. Parametric of Evaluation Quantitative Amplitude (a.u.) 2 t soe ( -scores eesfor levels < p D p 0 . 05 D orce)frteatvto mltd uigthe during amplitude activation the for corrected) FDR , T
oeldb ier udai n ui regres- cubic and quadratic linear, by modelled Delay (s) 95 D 1 = , 2 h fet feprmna parameter experimental of effects The p T 0 = – 10 is h aawere data the mins, p on a
b
c
Figure 6.3: Glass brain (MIP) images of activation amplitude assessed at the group level during memory manipulation / recall using methods [a],[b] and [c]. The red arrows locate the intra-parietal sulcus (IPS).
While method [a] strictly cannot be compared with [b] and [c], we observed more activated regions using methods [b] and [c] than [a] at the same significance level. One explanation for this could be that there is a parametric effect in the activation that cannot be explained by a cubic polynomial model, and also that method [a] assumes a fixed delay across all trials. This clearly demonstrates the benefits of testing for effects of real-valued parameters by treating them as categorical parameters. Method [c] exhibited yet larger activation foci than method [b] (12% suprathreshold voxels as percentage of intracranial volume, t-score maximum = 15.23, average = 5.17 vs. 8%, 13.81 and 3.76 respectively) partially due to the larger variance of method [b], reducing its t-scores for the activation amplitudes.
96 For the purposes of further exposition, we shall consider a locus in the intra-parietal sul-
cus (IPS), located by the red-arrow in Fig. 6.3, that exhibited high t-scores across all three
methods and is known to process spatial attributes of numerous cognitive tasks. An ax-
0 mins 3 mins 6 mins 9 mins 0.3s
0 mins 3 mins 6 mins 9 mins 0.0s
Figure 6.4: Parametric Effect on Latency. Axial slice of the brain (z = 55mm, MNI space) for group- wise latency using method [c] at the location marked by red arrows in Fig. 3. Top row shows forward recall (D = 1) and bottom row shows backward recall (D = 2), at different experiment times (T ).
ial slice of the group-wise latency assessed using method [c] is shown in Fig. 6.4, while
Fig. 6.5 shows the effect of parameter T on the amplitude and latency as estimated by the
three methods at this locus. As explained earlier, method [a] cannot test for a parametric effect on latency.
Firstly, we observe that the variance of both the amplitude and latency estimates by our method are consistently much lower than method [b], while there is no noticeable increase in bias. The polynomial weighting of the parameter T explains why the 95% CI band for the effect estimated by method [a] is much wider than those for the other two methods.
97 Amplitude (a.u.)Amplitude
mins mins
(i) Magnitude (forward) (ii) Magnitude (backward) Delay (s) Delay
mins mins (iii) Latency (forward) (iv) Latency (backward)
Figure 6.5: Parametric effects of latency and amplitude at a region in the IPS. Figs. (i)–(ii) graph ac- tivation amplitude vs. experiment time T = 0–10mins for forward (D = 1) and backward recall(D = 2). Figs. (iii)–(iv) graph activation latency vs. T . Legend. Method [a]: mean effect – black solid line; Method [b]: blue dotted line; Method [c]: red dashed line. Bands show the 95% CIs for the estimates
While no appreciable parametric effect on amplitude can be seen for the forward case, the
activation latency does exhibit an increase over time. For the backward case, there is a
slight increase in activation amplitude over time, followed by a leveling out. This effect
can be seen more clearly using methods [b] and [c]. Latency, on the other hand, starts off high but then reduces with T . These trends point to physiological phenomena that could be due to attention, adaptation (learning) or habituation. This example clearly demonstrates the value of examining parametric effects of activation latency.
98 6.4 Conclusion
This chapter presented a method to estimate the effects of an experimental parameter on the amplitude and the latency of the hemodynamic response, while exploiting the advantages of a GLM framework, namely computational speed and ease of interpretation. Additionally, a low–bias estimator for latency was developed and validated on simulated data.
It was also demonstrated that latency is capable of exposing aspects of neural recruitment of the visuo–spatial working memory not available through classical GLM analysis. By examining parametric effects on latency, more and interesting differences in the activation patterns could be observed for memory recall and manipulation than by using amplitude information alone.
A promising line of future investigation is to develop a group–analysis framework for la- tency, which may help provide further insight into neuropathologies and the salient differ- ences between populations with different cognitive capabilities. Also, the ability to model more variation in the hemodynamics through non–parametric representations of the HRF might be able to capture latency effects with higher precision. In addition, of interest are extensions that characterize the variation of all the features of the hemodynamic response, which may reveal even more aspects of the neurophysiology of cognition.
99 PART III
Spatio-temporal Representations for Cognitive Processes
100 CHAPTER 7
SPATIO-TEMPORAL REPRESENTATIONS: THEORY
Representation of the world, like the world itself, is the work of men; they describe it from their own point of view, which they confuse with the absolute truth.
Simone de Beauvoir (1908-1986) The Second Sex.
In this chapter, Section 7.1 provides a brief background on the problem of studying the
representation of information about cognitive states of the brain contained in fMRI data.
This is followed by a theoretical discussion of the merits and drawbacks of supervised,
unsupervised and semi–supervised analysis methods in Section 7.2. Then Section 7.3
talks about the two main supervised multivariate methodologies for brain–state decoding
viz. multivariate pattern recognition (MVPR) and multivariate linear models (MVLM).
Finally in Section 7.4 I shall motivate the need for a dynamical unsupervised and semi– supervised approach towards discovering patterns in the data that might indicate the inter- nal / hidden and transient cognitive state of the subject.
101 7.1 Functional Representation
One of the main challenges in cognitive neuroscience is understanding the neural represen- tation of a cognitive, perceptual or affective state of a subject, i.e. in “cracking the neural code”. The essence of the solution lies in the fact that determining of neural coding of a particular variable provides the means for potentially solving the problem of cognitive processing of that variable. For example, if we can decipher the neural coding of num- bers, then we have a good chance of deciphering the cognitive process of mental arithmetic by observing neural activity during this process, recovering the numbers by decoding, and inferring how they are being operated upon in this particular process.
While fMRI operates at a level much removed from neural processes, there is nonetheless information about the cognitive state of the subject encoded in the distributed pattern of activity. In one of the first publications on this topic, Haxby et al. [90], using a simple correlation based linear classifier, demonstrated the representability of different types of visual stimuli in the human ventral temporal (VT) cortex with fMRI. These results are reproduced in Fig. 7.1.
Since then, the encoding of visual concepts in the human visual cortex – at both higher and lower levels of the processing stream – has been widely investigated [91, 110, 179] with over 200 publications to date. This approach has been applied to study other types of mental representations, including auditory [150] perception, motor tasks [122], word recognition [160], detecting emotional affects such as deception [200] and fear [177], etc.
In a landmark paper, Mitchell et al. [160] presented a method to predict the fMRI response to new words from the responses of other related words. For a given word, they learnt the
102 Figure 7.1: The category specificity of patterns of response was analyzed with pairwise contrasts between within-category and between-category correlations. The pattern of response to each category was measured separately from data obtained on even-numbered and odd-numbered runs in each individual subject. These patterns were normalized to a mean of zero in each voxel across categories by subtracting the mean response across all categories. Brain images shown here are the normalized patterns of response in two axial slices in a single subject containing the VT cortex. For each pairwise comparison, the within-category correlation is compared with one between-category correlation. (A) Comparisons between the patterns of response to faces and houses in one subject. The within-category correlations for faces (r= 0.81) and houses (r = 0.87) are both markedly larger than the between-category correlations, yielding correct identifications of the category being viewed. (B) Comparisons between the patterns of response to chairs and shoes in the same subject. The category being viewed was identified correctly for all comparisons. (C) Mean response across all categories relative to a resting baseline. Figure reproduced from Haxby et al. [90].
103 Figure 7.2: Predicting Spatial Maps for Given Stimulus Words. (A) Forming a prediction for the stimulus word “celery” after training on 58 other words. Learnt activation maps for 3 of the 25 semantic features (“eat”, “taste,” and “fill”) are depicted by the voxel colors in the three images at the top of the panel. The co–occurrence value for each of these features for the stimulus word “celery” is shown to the left of their respective images. The predicted activation for the stimulus word [shown at the bottom of (A)] is a linear combination of the 25 semantic fMRI signatures, weighted by their co-occurrence values. (B) Predicted and observed fMRI images for celery and airplane after training that uses 58 other words. The two long red and blue vertical streaks near the top (posterior region) of the predicted and observed images are the left and right fusiform gyri.
semantic relationships with other words in terms of their co-occurrence statistics within a trillion word text corpus. In the second step, they predicted the fMRI image of a new word
(e.g. “celery”) as a linear combination of the distributed fMRI responses to these related words (e.g. “eat”, “taste”, “fill”, etc.). I have reproduced these results in Fig. 7.2.
This is part of a broader trend of machine learning in the analysis of neuro-scientific record- ings [20] with wide applications in brain-machine interfaces [57], clinical psychology and cognitive neuroscience [178], real-time biofeedback [122], etc.
104 7.2 Supervised, Unsupervised and Semi-supervised
One of the main advantages of supervised approaches is that they allow quantitative test- ing for the effect of an experiment variable on the brain’s response. The drawback is that these tests fundamentally boil down to model–comparison [65] and therefore require pre– specification of at–least two models (e.g. the null and alternative hypotheses), typically as the linear coupling between stimulus and response (cf. Sidebar on Generative vs. Clas- sification Models). Except for the simplest of tasks, there are no good models of brain function, and it is unclear how complete of a picture of brain–function can be provided by these extreme oversimplifications that are needed for computational and statistical expedi- ency [74].
Purely unsupervised or data–driven methods, in contrast, don’t make assumptions about the linearity and stationarity of the brain’s response, but they suffer from problems of in- terpretability since they are based on statistical criteria and not on a generative model of brain function. Secondly, they fail to provide quantifiable links to experimental variables of interest.
For example with PCA, multiple studies have noted that the components with maximum variance have often corresponded to artifacts such as respiration and head–motion. ICA, on the other hand, does not provide any ordering of components or any quantitative criteria for selecting amongst them. Components are selected either through visual inspection where the identification of “interesting” components is left completely to the investigator [173], or by (linear) correlation with respect to a reference time–series [222] which again requires knowing the mathematical relationship between fMRI signal and experimental variables,
105 or through information theoretic criteria [135] which select components that may or may not be related to brain function.
Similar ambiguities affect clustering based methods, such as criteria for setting the appro- priate number of clusters, the drastic effects of algorithm initialization on the final results and interpretability issues due to lack of an underlying model.
7.3 Pattern Recognition vs. Linear Models
This section discusses the pros and cons of pattern classifiers and linear models in the context of supervised multivariate analysis. Some of the statistical issues alluded to here are elaborated further in the sidebar on Generative vs. Classification Models.
7.3.1 Multivariate Pattern Recognition (MVPR)
MVPR methods (cf. Section 2.6.1) treat the data as an abstract representation of mental states without requiring a model of brain function, i.e. of how neural activity is converted into the fMRI signal. Instead, they are posed in terms of prediction accuracy, through principles as structural risk minimization and generalization error [22].
The advantage of this approach is that they are not limited, inferentially, by the fact that the physiology of brain function and its translation into the BOLD mechanism is poorly understood, and all models are likely to be highly inaccurate. The drawbacks, however, are the reduced temporal resolution of the analysis limiting them to block designs and the inability to make quantitative and definitive neurophysiological interpretations from their parameter estimates.
106 Also, most methods make the assumption that all fMRI scans with the same label (i.e. be-
havioral state) are equivalent. The attractiveness of this approach is that it allows an in-
vestigation of the mental state at each time point independently (of course, restricted to
block designs). The flip side to this, however, is that it ignores the temporal dependencies,
variations and evolutions of patterns that fundamental to mental processes, and that sense
provides a static unchanging picture of brain function.
Another limitation of MVPR classifiers is their applicability only to studies where subjects
are presented with fixed number of alternatives (e.g. faces vs. objects [90]). Generalization
to complex cognitive paradigms with interval–valued parameters and further on to real
world situations poses a significant methodological challenge [92].
Finally, so far all reported studies have used classifiers trained and tested on the same
subject. An important and unresolved question is the extent to which these strategies can
be generalized to across subjects and to new situations. This will require detecting the
representation of specific mental concepts in a manner that is invariant across humans and
task conditions.
7.3.2 Multivariate Linear Models (MVLM)
In contrast to MVPR, MVLMs specify a probabilistic generative model of how the ob- served data are related to the stimulus, based on some assumptions about brain function.
Both forward (i.e. Y = Xs + ) and decoding (i.e. s = XY + ) MVLMs specify this as a linear relationship between regressors formed by convolving the stimuli with a hemody- namic response function (HRF) and the observed data.
107 While advantages include computational efficiency, statistical simplicity and straightfor-
ward neurophysiological interpretation of parameters, this methodology requires that the
mathematical relationship between experimental variables and fMRI signal be known a priori. This is oftentimes hard to define in a principled manner, especially in experiments for higher–level cognition.
Equally problematic is the assumption of spatially and temporally constant hemodynamics in these models, since multiple studies have shown a large variation in the hemodynamic response (HR) across subjects, across brain sites within the same subject, and even at the same brain site of the same subject across time (cf. Section 1.3.2).
7.4 Motivation
Despite the success of the multivariate methods for understanding the representation of cognitive states, there are nevertheless many open challenges. Fundamentally, these meth- ods learn a fixed mapping from fMRI data to regressors / labels describing stimuli or subject behavior. Hence, they cannot discover intrinsic patterns that might be present in the data and therefore their ability to explain the internal mental state is limited to behavioral corre- lates as recorded by the stimulus. Decoding the internal cognitive states, especially under natural conditions, requires reconstructing the spontaneously changing dynamic “stream of consciousness” from brain activity alone, without reference to extrinsic labels.
An equally significant challenge arises because of the fact that mental processes are con- stituted of dynamical changing patterns of activity. These approaches, by creating static
108 spatial maps corresponding to particular experimental variable and ignoring its dynamics, disregard a potentially very informative aspect of brain function.
In order to address some of these shortcomings, in the next two chapters I propose two related models of brain function as represented by fMRI using a dynamical state–space formulation. The first version of this model as an unsupervised framework along with a
Monte–Carlo estimation algorithm is described in Chapter 11. Then, to address the problem of linking the results back to the experimental variables, the model is refined to include stimulus information in Chapter 12. Additionally, a computationally efficient mean–field approximation based estimation algorithm is developed in this chapter.
109 Generative vs. Classification Models
The question of inferring a link between a distributed pattern of response and a mental state is essentially one of comparing the evidence between alternative models, typically between the null hypothesis H0 which posits the absence of an effect and the alternative hypothesis H1. Experimental neuroscience rests on comparing generative models that embody competing hypotheses about how data are caused. From a statistical perspective, by the Neymann–Pearson lemma [112] the likelihood–ratio test (or Bayes factors) for a statistic θ:
p(θ|H1) Λ = p(θ|H0) is the uniformly most powerful test for a given size α = p(Λ ≥ µ|H0) with a threshold µ (i.e it has the least Type–II / false–negative error–rate for a given Type–I / false–positive error–rate.), and is the basis for most of statistical inference including classical methods like Wilk’s Lambda in canonical correlation analysis (CCA) and the F –test in ANOVA. The null distribution of the likelihood ratio statistic p(θ|H0) can be determined non-parametrically or under parametric assumptions (e.g., a t-test). To evaluate the marginal likelihood it is necessary to specify the joint density function entailed by a model, typically in parametric form. In Bayesian analysis, the parameters are then integrated out with respect to a prior density to give the model evidence. Generative models of the form g(θ): s → Y explain how experimental variables s produce ob- served data Y. Such forward models (e.g. GLMs) assume the experimental conditions as fixed or known variables and randomness only in the data as: Y = g(s) + . In multivariate decoding, the direction of the generative model is reversed to give a decoding model s = g(Y)+. The advantage of this approach is that it can account for the perceptual uncertainty of a presented stimulus. The drawback of this approach is the much higher dimensionality of Y with respect to the observed data causing the risk of over–fitting, necessitating some kind of dimensionality reduction. In classification, one wants to predict or classify a new observation Ynew using a decoding model whose parameters have been estimated using training data and classification pairs. Classification may be based on the predictive density: Z p(snew|Ynew, s, Y) = p(snew|θ, Ynew)p(θ|s, Y)dθ, although many classifiers (e.g. SVMs) do not even try to estimate the predictive density. Instead, they try to directly maximize prediction ability and can be thought of as point estimators of the parameters. In most neuroscientific investigations, prediction of new fMRI volumes is not of direct interest and the predictive density or the generalization error–rate (measured by cross–validation) is used in lieu of model–evidence to detect the presence of an effect. This is because classifiers do not yield probabilistic estimates the parameters, which means their evidence is not defined. Therefore, by the Neymann–Pearson lemma, inferences made by such schemes are sub-optimal. The second prob- lem for classifiers is that the marginal likelihood depends on both accuracy and model complexity. However, many classification schemes do not account for this complexity explicitly, but rather do so in an indirect way through the generalization error and over–fitting penalties.
110 CHAPTER 8
BRAIN–STATES: INVESTIGATING SPATIO–TEMPORAL PATTERNS IN THE DATA
Space and time are the framework within which the mind is constrained to con- struct its experience of reality.
Immanuel Kant (1724–1804), Critique of Pure Reason.
As discussed in the previous chapter, identifying the transient intrinsic cognitive states of the brain as it performs a mental task from fMRI data is an important research problem with wide applications in cognitive neuroscience.
In this chapter, I shall present an initial exploration of this concept, to determine whether the spatially distributed BOLD signal recorded at each time–point encodes some informa- tion about the instantaneous mental state of the subject. This investigation was inspired by the micro–states discovered in EEG [131], which is further discussed in Section 8.1.
The method adopted for identifying the potential of a similar organization in the metabolic traces of neural processes recorded by the BOLD signal is described in Section 8.2. The results of this investigation are presented in Section 8.3, followed by a discussion in Sec- tion 8.4.
111 8.1 Inspiration
Although a complex system such as the brain comprises of many local functional states, they can be aggregated into global functional states or configurations at each moment in time. In their seminal work [132], Lehmann et al. extracted and classified typical or charac- teristic brief quasi-stable topographies of electric field potentials recorded simultaneously from many EEG electrodes on the placed scalp, which they termed as microstates. This is shown in Fig. 8.1.
Figure 8.1: EEG Microstates over 4 seconds of spontaneous EEG using a cluster analysis. The wave- forms represent eyes-closed EEG recorded from 42 electrodes. For each time point, the potential distribution map was calculated and all maps of the 4 seconds were subjected to a k-means cluster analysis. A cross- validation criterion identified four characteristic electrical potential landscapes, i.e. microstates as illustrated in the 3rd row of the figure. Fitting these maps back to the original data revealed that each microstate ap- peared repeatedly and dominated during certain time segments, as shown in the fourth row of the figure (with each microstate color-coded appropriately). Figure reproduced from [131]
112 These different electric potential landscapes or microstates are generated by different dis- tributions of neuronal electric activity in the brain and ranging from 70ms to 150ms. They are hypothesized to reflect the activation of different neuro–cognitive networks, each repre- senting specific aspects of cognitive processing and may be the “atoms of thought” that con- stitute the seemingly continual “stream of consciousness” [100]. These microstates change in a non–continuous manner: one state may dwell over extended periods in a quasi–stabile manner, followed by rapid and major changes of state.
In contrast, the identification of a similar instantaneous state from fMRI without reference to the experimental task has been an unexplored problem which could provide important insights into mental processes. While, temporal limitations would prevent access into the faster and more fleeting “atoms of thought”, fMRI could potentially reveal relatively longer lasting and more high–level intrinsic mental states such as attention, intention, planning, etc., that do not necessarily correspond to observable attributes as recorded by the experi- mental stimuli.
Based on this hypothesis, we developed an unsupervised method for identifying brain– states as described next.
8.2 Method
After the data were pre–processed to correct for head–motion and physiological artifacts, de–noised and the white–matter masked out (cf. Chapter 3 for specifics), each fMRI ses- sion was decomposed into c components using spatial ICA. Components whose time–series
113 were highly correlated with head motion parameters and mean volume intensity fluctua- tions, identified with multiple regression, were removed and the volumes reconstructed.
The sum of squared differences of voxel values ||Y(t1) − Y(t2)||2 of two volumes at two different time points was then used to perform hierarchical agglomerative clustering
(HAC) [99] of all the volumes Y(t), t = 1 ...T , in the data. This process was repeated for 30 ≤ c ≤ 80 and multiple ICA initializations.
The average number of HAC steps when two volumes Y(t1) and Y(t2) merged into the same cluster was recorded as dhac(t1, t2). For all pairs of volumes, an affinity matrix
D(t1, t2) = exp{−dhac(t1, t2)/σ} is constructed, where σ was a manually chosen band- width parameter, typically as the median value of dhac(t1, t2).
The affinity matrix D was used to find K clusters in the space spanned by the T volumes using spectral graph clustering [198] with K determined manually. Each cluster is labeled P with a unique integer value k = 1 ...K using dynamic programming such that t |kt+1kt| is minimized where Y(t) in the cluster labeled as kt and Y(t + 1) in the cluster labeled as kt+1 . This results in a time–series of cluster labels such that most transitions are between states close to each other in value.
These labels kt assigned to each time–point Y(t) are treated an indication of the intrinsic cognitive states of the brain.
114 8.3 Results
This method was applied to the fMRI study of visuo–spatial working memory (cf. Sec- tion 3.1). The brain–state labels for two subjects over a 200s period are shown in Fig. 8.2(a), where the color coding shows the corresponding phase of the experiment as displayed in
Fig. 3.2. Here, an intriguing synchronization of the assigned state with the phase of the task can be seen. It should be noted that no information about the experiment was used when determining the brain state. In Fig. 8.2(b), the median brain states during the forward and backward recall condition are shown (both subjects), along with the 10 and 90 percentile bands.
As can be seen, there is a clear transition of the brain between distinct states during differ- ent phases of the experiment, along with a separation between the forward and backward conditions in the instruction and probe phase. The high variation towards the end of a trial is probably because the subject took different response times and may have adopted differ- ent strategies across the trials. The “M”–shaped peaks hint at a two–fold engagement of the pre–frontal areas during the instruction and probe phases. The drop between the peaks implies a “shift of gear” as the brain advances from one phase into the next one.
8.4 Conclusion
Using an unsupervised data–driven approach based on clustering brain volumes solely on their voxel intensity distributions, without using information about the experiment, we were able to detect a pattern in the sequence of cluster labels that was highly organized with respect to the experiment.
115 (a) Brain–States over 200s period
forward forward
backward backward
State# State#
Event time (secs) Event time (secs)
(b) Brain–States over one trial
Figure 8.2: Brain–state Labels for Two Subjects. Fig.(a): Brain–state labels for two subjects over a 200s period. Fig.(b): Median brain states for the two subjects (along with the 10 and 90 percentile bands) in each trial separated in terms of forward and backward recall.
From these transitions, not only a change during one phase to another, but also the effect of different experimental conditions on the measured response of the brain could be observed.
116 In the next few chapters, I shall explore this concept more to create a representation that serves as an abstract vehicle for the spatio–temporal patterns recorded in the data, in order to understand the dynamic processes underlying human thought.
117 CHAPTER 9
BRAIN–STATES: THE NOTION OF FUNCTIONAL DISTANCE
Brain, n.: An apparatus with which we think that we think. Mind, n.: A mysterious form of matter secreted by the brain.
Ambrose Bierce (1842–1914), The Devil’s Dictionary.
Building on the promising results from the previous chapter of the potential of fMRI to reveal the internal cognitive state of the subject, this chapter develops a refinement of the clustering mechanism through the definition a distance metric that quantifies the functional similarity between the the neural activation patterns present in two fMRI scans.
The functional distance metric (FD), as explicated in Section 9.1, measures the amount of “transport of activity” over the functional networks of the brain. A robust, fast and sparse method for determining these functional networks, routinely defined as the “tempo- ral correlations between spatially remote neurophysiological events” [133], is described in
Section 9.2.
This is followed by the details of a fast approximation method for computing the functional distance, based on recursive aggregation in Section 9.3.
118 Once the functional distance between each pair of acquisition time–points is computed, the
activation patterns at the T time–points are embedded in a space equipped with a metric obtained by a diffusion process. This metric, explained in Section 9.4, is aware of the geometry of the underlying low-dimensional manifold spanned by these neural patterns.
The data are then grouped using hierarchical agglomerative clustering (HAC) in this low dimensional space, where each cluster represents a characteristic distribution of activity in the brain, i.e. the brain–state at that time–point.
The results of this method applied to the mental arithmetic data–set (cf. Section 3.2 are reported in Section 9.5 followed by a brief discussion in Section 9.6.
9.1 Functional Distance
For the sake of discussion, let Zt1 and Zt2 , where t1, t2 = 1 ...T denote a voxel–wise measure of neural activity or more accurately, its metabolic fingerprint13, that evokes the hemodynamic response, which is then measured as the BOLD signal Yt, t = 1 ...T . In the discussion that follows later in this chapter, these activation patterns Z are arrived at by deconvolving the (denoised) fMRI signal Y using a canonical HRF.
The difference FD(Zt1 , Zt2 ) between two activation patterns Zt1 and Zt2 is quantified by the transportation distance 14 [183], i.e. the minimal “transport” f : N × N → R of
activity over the functional circuits to convert Zt1 into Zt2 [102]. Specifically,
13Let T be the number of fMRI scans in the session, each acquired at 1TR intervals. 14This metric when used in the context of discrete probability distributions becomes the well–known Earth Mover’s Distance [188]
119 Definition 1 (Functional Distance).
N N X X FD(Zt1 , Zt2 ) = min f[i, j]dF[i, j], (9.1.1) f i=1 j=1
subject to the constraints:
f[i, j] ≥ 0 X f[i, j] ≤ Zt1 [i] j X f[i, j] ≤ Zt2 [j] i ( ) X X X f[i, j] = min Zt1 [i], Zt2 [i] i,j i i
The cost of the transport of f[i, j] from voxel i to j will depend on a measure dF : N ×N →
R+ between the voxels that captures their “functional disconnectivity”.
If i, j = 1 ...N index two cortical voxels, then the cost of transport between them dF[i, j] will depend on the on their functional connectivity F[i, j] ∈ [−1, 1], measured by the
correlation of their time–series (cf. Section 9.2). Although in this chapter, the relationship
between the cost–function and the functional connectivity is defined by the heuristic:
dF[i, j] = 1 − |F[i, j]|, (9.1.2)
the next chapter (cf. Section 10.2 of Chapter 10) develops a more formal and principled
relationship.
This definition of the functional distance FD captures the intuitive notion that two activity
patterns are functionally more similar if the differences between them are mainly on voxels
that are functionally related to each other, indicating the activation of a shared functional
120 network, as illustrated by a toy example in Fig. 9.1. Fig. 9.1(a)–(b) show a simplified
functional connectivity network is shown in its correlation matrix and graph representations
respectively. Three exemplary distributions of neural activity Zt at each of the nodes in the
network are displayed in Fig. 9.1(c)–(e) at three different time–points t1, t2 and t3. Since
the network activated at time–point t1 is functionally more related (measured by their time–
series correlations) to that at t2 than to that at t3, as per Definition 1, the functional distances
FD(t1, t2) < FD(t1, t3).
9.2 Functional Networks
This section lays out an algorithm for computing the functional connectivity (i.e. corre-
lations) F between voxels that is consistent, sparse and computationally efficient. Define
Y , {Y1 ... YT } as the fMRI time–series data with N voxels and T scans, where Y[i] is the time–series data of voxel i = 1 ...N.
Because N T , the standard covariance estimator is badly conditioned, and its eigen-
system is inconsistent [186]. Therefore, regularization is required to impose sensible struc-
ture on the estimated covariance matrix while being computationally efficient.
First, the images are smoothed with a Gaussian kernel (FWHM=8mm) to increase spa-
tial coherence of the time–series data. Next, spatially proximal voxels are grouped into
a set of Ne < N spatially contiguous clusters using hierarchical agglomerative clustering
(HAC) [99]. HAC is repeated until the number of clusters Ne ≈ 0.25 × N, as elaborated
upon in Section B.1 of Appendix B.
121 (a) Functional Connectivity: Matrix (b) Functional Connectivity: Graph
(c) Activity at t1 (d) Activity at t2 (e) Activity at t3
Figure 9.1: Conceptual Illustration of the Functional Distance. A toy example of a functional connectiv- ity network represented as a matrix of correlations in Fig.(a) and its graph–based representation in Fig.(b). Figs.(c)–(e) show the neural activity intensity at three time–points from the data–set. The hot color scheme is used in these plots with activation magnitude increasing from left to right.
This procedure has a two-fold benefit of reducing the dimensionality of the estimation problem while simultaneously increasing the SNR of the data through averaging. Table 9.1 shows that the clusters, after Gaussian smoothing, are larger and their sizes are more uni- form for the same number of HAC–steps as compared to those without smoothing.
122 HAC– 0.5 × N 0.75 × N 0.875 × N steps FWHM Ne Avg Std.dev Ne Avg Std.dev Ne Avg Std.dev mm3 mm3 mm3 0mm 0.69 11.59 6.32 0.51 15.68 14.62 0.36 21.26 30.33 4mm 0.63 12.69 4.85 0.42 19.04 9.16 0.29 27.58 17.41 8mm 0.58 13.74 3.98 0.34 23.52 7.70 0.22 36.16 12.56
Table 9.1: Effect of FWHM of the Gaussian kernel on the number of clusters Ne. The mean and standard deviation of cluster sizes (as a fraction of N) after a certain number of HAC–steps are shown. Values are for the data–set described in Section 3.2.
Next, cluster-wise covariances are computed and regularized using adaptive soft shrink-
age detailed in Section B.2 of Appendix B. Estimates of voxel-wise correlations are then
recomputed from the regularized cluster-wise correlations. If i, j = 1 ...N index two cor-
tical voxels, then the functional connectivity map F[i, j] ∈ [−1, 1] for all 1 ≤ i, j ≤ N is consistent and extremely sparse. It is also easy to verify that this F is positive definite. The results of this procedure on the distribution of the functional connectivity estimates on the mental arithmetic data–set of Section 3.2 are shown in Fig. 9.2.
9.3 Computing the Functional Distance
While there exist efficient algorithms for computing the functional distance based on the
Hungarian algorithm [165], these methods exhibit worst–case complexity of O(|N|3 log |N|).
For an fMRI study with voxel size of 3 × 3 × 3mm3, the number of grey–matter voxels is
≈ 5 × 104, giving a running time of O(1014). If the number of scans is T , it will require
O(T 2) number of comparisons, making the pair–wise computation prohibitively expen-
sive. Moreover, because the cost–function dF is derived from the functional connectivities
123 0.04 0.04
0.02 0.02
0 0 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 (a) Raw correlations (b) After 8mm smoothing
0.04 0.04
0.02 0.02
0 0 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 (c) After clustering (d) After thresholding
Figure 9.2: Results for the Regularized Correlation Estimator. Fig.(a): Without any regularization, most of the mass of the distribution is in concentrated in small non-zero correlations, while the strong correlations are only a fraction of the total. Fig.(b): The smoothing procedure shifts the whole distribution towards the right, by strengthening all correlations. Fig.(c): The hierarchical clustering procedure boosts strong correlations without affecting weak correlations. Fig.(d): Finally, the shrinkage step sparsifies the correlation matrix, with most correlations set to zero.
between voxels and is not an Euclidean distance, standard approximations are not readily
applicable [199].
Let X be the set of voxels in the cerebral cortex and E be the set of edges between them
with weights given by dF. Define G(X, E) as the graph structure representation of the func- tional networks. The problem is made tractable using the recursive approximation scheme
of Algorithm 9.1. Here, the FD(Zt1 , Zt2 ) is solved on graphs of increasing resolution start-
ing from a very coarse–grained version of the graph. The FDj(Zt1 , Zt2 ) at each resolution j is a lower bound on the true FD, and refinement of the graph is terminated if FDj > τj, a certain threshold, implying that the functional distance between the patterns at the two
124 time–points is high enough that an approximate value with suffice. The inaccuracies intro-
duced due to this approximation are compensated for using a diffusion operation described
in Section 9.4.
1 begin // Initialization
2 Let Zt1 and Zt2 be the (de–convolved) fMRI volumes at time–points t1 and t2 3 end 4 for j = J to 1 do j j 5 Create a low resolution graph Gj of N/2 nodes by clustering 2 maximally correlated voxels into groups using a greedy algorithm j j // Each node in the set In, n = 1 ... |N|/2 now represents a group of 2j voxels
6 Compute the transportation distance as FDj(Zt1 , Zt2 ) on this low resolution graph Gj // If the edge weights and vertex values of Gj are calculated appropriately, then FD ≥ FDj, i.e. FDj is a lower bound on the true FD
7 if FDj(Zt1 , Zt2 ) ≥ τj then // τj is a threshold 8
9 Approximate FD(Zt1 , Zt2 ) ← FDj(Zt1 , Zt2 ) 10 Exit 11 end 12 end
13 if j = 0 then We have effectively solved FD(Zt1 , Zt2 ) = FD0(Zt1 , Zt2 ) Algorithm 9.1: Recursive Approximation of the Functional Distance
In what follows, the system to correctly aggregate a set of vertices i ∈ I into a new vertex
i0, so that the FD computed on this reduced graph G0(X0, E0) is less than that on the original
graph G(X, E) is given.
Let δZ = Zt2 − Zt1 be the difference between the two distributions. For explicatory P P P purposes, let i Zt1 [i] = i Zt2 [i], i.e. i δZ[i] = 0 though the reasoning holds for
125 15 ∗ the general case . Then, the optimal flow f from Zt1 to Zt2 on G is the solution to P P the transportation problem, subject to the constraints i f[i, j] − i f[j, i] = δZ[j] and f[i, j] ≥ 0, for all i, j ∈ X. Now, the total cost can be partitioned as
X ∗ X ∗ X ∗ X ∗ f [i, j]dF[i, j] = f [i, j]dF[i, j] + f [i, j]dF[i, j] + f [i, j]dF[i, j]. i,j∈X i,j∈I i,j∈X\I i∈I,j∈X\I (9.3.1)
Firstly, through a conservation of mass argument, the total flow from i ∈ I to all j ∈ \ must be P f ∗[i, j] = P δZ[i]. Let f + be the optimal solution to the X I i∈I,j∈X\I i∈I transportation problem on the graph 0 where the value δZ[i0] = P δZ[i]. Also, let G i∈I 0 dF[i, j] if i, j ∈ X \ I dF[i, j] = . minj dF[i, j] if i ∈ I, j ∈ X \ I
Therefore, the last term in eqn. 9.3.1
X ∗ X + 0 0 f [i, j]dF[i, j] ≥ f [i, j]dF[i , j] i∈I,j∈X\I j∈X0
Secondly, using reductio ad absurdum the first term in eqn. 9.3.1 P f ∗[i, j]d [i, j] ≥ i,j∈I F P f ◦[i, j]d [i, j], where f ◦ is the solution to the transportation problem restricted to i,j∈I F the subgraph I, subject to the constraints:
f[i, j] ≥ 0 X f[i, j] ≤ Zt1 [i] j X f[i, j] ≤ Zt2 [j] i ( ) X X X X f[i, j] = min Zt1 [i], Zt2 [i] − δZ[i], i,j i i i
15 This condition can be easily satisfied by adding to the optimization problem of eqn. 9.1.1 a dummy node PN with index N + 1 called the dump, where δZ[N + 1] = − i=1 δZ[i] and dF[i, N + 1] = 0, ∀i = 1 ...N.
126 with all i, j ∈ I. This subproblem on the vertices of I could again be solved using the
above recursive approximation scheme. However, if |I| is small enough, then the exact algorithm may be used.
Therefore, the recursive approximation scheme gives a lower bound on the transportation
distance as:
N/2j X FD(Zt , Zt ) ≥ FDj(Zt , Zt ) + FD j (Zt , Zt ), (9.3.2) 1 2 1 2 [In] 1 2 n=1
where FD j (Zt , Zt ) is the solution to the transportation problem restricted to the sub- [In] 1 2
j graph defined by In.
9.4 The Diffusion Distance
Although FD provides a well–motivated method to compare brain–states with similar ac-
tivity patterns (i.e. low FD), its suitability to quantify the distance between patterns with
larger differences is more uncertain, apart from the fact that for such comparisons, we have
only an approximate FD. Then, assuming the accuracy of the FD only in local neighbor-
hoods on the manifold spanned by the patterns of brain activity, the data are embedded
into a lower dimension Euclidean space using the concept of diffusion distances [41], as follows.
Each activity pattern Zt, t = 1 ...T is treated as a vertex on a completely connected graph, specified by the T × T affinity matrix W, where
2 FD(Zt1 , Zt2 ) Wt1,t2 = exp − 2 , 2σW
127 where the user-defined parameter σW defines a notion of proximity between activation
PT −1 patterns. Let Dt,t = t0=1 Wt,t0 be the T × T diagonal degree matrix. Then M = D W
16 can be treated as a stochastic matrix defining a random walk on the graph, with Mt1,t2
encoding the probability p(t2|t1) for a Markov transition from node Zt1 to Zt2 .
The probability p(n, t|t1) that the random walk starting at Zt1 will end at Zt in n–steps
n is given by Mt1,t. Consider the following notation for the generalized eigen-system of
0 M = ΨΛΦ with Λ the diagonal matrix of eigenvalues 1 = λ1 ≥ ... ≥ λT ≥ 0, Φ the matrix of right eigenvectors (φ1 . . . φT ) and Ψ that of left eigenvectors (ψ1 . . . ψT ). In that
PT n case, p(n, t|t1) = φ1(t)+ j=2 λj φj(t)ψj(t1). Note that φ1(t) is the stationary distribution
limn→∞ p(n, t|t1) of the random walk, and is independent of starting point Zt1 .
The diffusion distance between two vertices on this graph quantifies the difference in the probabilities distributions of a random walk starting at either of these vertices arriving at any vertex after n steps:
T 2 X 2 ρn(Zt1 , Zt2 ) = |p(n, t|t1) − p(n, t|t2)| φ1(t). (9.4.1) t=1 The parameter n defines the scale of the diffusion process and controls the sensitivity of the metric on the local geometry, with smaller n making the metric more sensitive to local differences. It can be shown that [41]:
T 2 X 2n 2 ρn(Zt1 , Zt2 ) = λj [ψj(t2) − ψj(t1)] , j=1
n T i.e. an Euclidean distance with the coordinates of vertex Zt defined by {λj ψj(t)}j=1.
The diffusion distance, though not a geodesic distance, is related to the Laplace-Beltrami
and Fokker-Planck operators on the manifold underlying the graph and therefore provides
16 PT That is M ≥ 0 and t0=1 Mt,t0 = 1.
128 a geometrically aware embedding for functions intrinsically defined on it [41]. Moreover,
since the spectral gap is usually large, with a few eigenvalues close to 1 and most ≈ 0, the diffusion distance is be well–approximated by only the first Tˆ T eigenvectors, with
error of the order of O(λn ). Tˆ+1
9.4.1 Hierarchical Clustering:
After embedding the data acquired at T time–points in a Tˆ T –dimensional Euclidean
space, they are grouped into K clusters {c1...cK } using k–means [99]. These clusters
represent distinctive patterns of distribution of metabolic activity and are probable indi-
cators of the distinctive modes / states of cognition. Next, each cluster ck, k = 1...K
is labeled with an integer value 0 ≤ lk < K, using dynamic programming to minimize PT −1 t=1 |lk(t + 1) − lk(t)|, where lk(t) is the label of cluster ck if S(t) ∈ ck. This results in a time–series of labels where most transitions are between labels close to each other in value.
9.5 Results
This section report the results of the method applied to the fMRI data of 4 healthy control
subjects (male) who underwent a study for mental arithmetic. The paradigm used here was
similar in layout and timing as that presented in Section 3.2 with the difference that the
stimulus was either displayed or sounded out, i.e. alternating audio and visual modalities.
Acquisition was done on a GE 3T LX scanner with a quadrature head coil using a BOLD
sensitized 2D-EPI gradient-echo pulse sequence (TE=35ms, FA=90◦, TR=2s, voxel size
3.75 × 3.75 × 3.75mm3 ). A typical session lasted ≈ 18 minutes, with 150 trials and 525
scans.
129 The algorithms developed in this chapter were implemented using MATLABr and Star-
Pr on an 2.6GHz Opteron cluster with 16 processors and 32GB RAM. The recursive FD
approximation was done with J = 10 and thresholds τj were selected adaptively, so that
only a small percentage (25%) of the comparisons would need to be performed again at the
next level. This resulted in a speed up of 103×, with an average running time of ≈ 23 hours
per subject. Fig. 9.3 shows the relative approximation error (FDtrue − FDapprox)/FDtrue)
with respect to the true FD. It is observed that the relative error scales linearly with respect to the true distance and is acceptably small for our data–sets.
0.25
0.2
0.15
0.1
0.05 Approximation Error Error Approximation 0 0 0.2 0.4 0.6 0.8 1 Exact FD (normalized)
Figure 9.3: Relative approximation error (FDtrue −FDapprox)/FDtrue) with respect to FDtrue. The x–axis is normalized with respect to the maximum FD between all pairs of volumes.
The parameter σW used in the affinity matrix W was set such that α% of all the pair- √ wise FD were less than it, reflecting the assumption that α% of brain patterns should be
“close” to any given pattern. We found the results to be reproducible for 5% ≤ α ≤ 20% in our experiments. For the low dimensional embedding in the diffusion metric space, Tˆ = 8
was a conservative value with λTˆ+1 < 0.05. The number of clusters K = 11 was selected for all subjects.
130 In Fig. 9.4, the median brain–state labels for the audio and visual presentations of the trial are shown, for the four subjects. Here, a strong pattern in the assignment of the state with the phase of the task can be seen. Also, there is a clear separation of labels during pre- sentation of the experiment depending on its modality (audio vs. visual), which converge towards the computation phase, as expected. These findings are especially significant given that no information about the experiment was used when determining the brain state. This synchronization of the brain–state labels with the experiment phase becomes more appar- ent on examining the intensity distributions for each cluster. The t-score maps for the first subject are shown in Fig. 9.5. The t-scores at every voxel were computed as within-cluster mean divided by within-cluster standard deviation.
10
5
0
10 Brain State Label State Brain 5
0 0s 4s 8s 0s 4s 8s
Figure 9.4: The median brain–state labels, for all four subjects, during a single trial of the experiment. The phases of the experiment are color-coded to indicate the 2.5s, 0.3s, 0.8s, 0–4s and 1s intervals of each trial, as per Fig. 3.2. Red and blue lines show the the median brain–states for the visual vs. audio presentation of the numbers, respectively. Also shown are the 25 and 75 percentile bands.
131 State 1 shows strong patterns in the visual cortex, and its occurrence typically corresponds
to the visual presentation of the two numbers. State 3, which usually occurs later, is active
in visual areas related with size estimation. States 5 and 6, associated with the calculation
/ judgement phase, are mainly in the frontal and parietal lobes, implicated in higher level
cognition and number size assessment. There is also activity in the motor cortex and may
be related to the button press. In states 8 and 10, which usually coincide with the audio presentation of the multiplication problem, the patterns in concentrated in the auditory cor- tices in the temporal lobe. These findings are in close agreement with those reported [163] for this paradigm using conventional analysis.
10
9
8
7
6
5
4
3
2
1 0 0s 4s 8s +5.0 0 -5.0
Figure 9.5: The within-cluster t–scores for states 1,3,5,6,8,10 for the first subject, overlaid on a high- resolution structural scan. The color coding indicates the intensity in that region. Also shown is the temporal ordering of brain–states for the subject in each trial. The other maps are qualitatively similar and omitted for purposes of concision, while t–scores between [-2,+2] are not displayed for clarity.
132 9.6 Conclusion
This chapter proposed a refinement in identifying the intrinsic cognitive states of the subject through the concept of a functional distance metric. This metric in combination with the diffusion distance was used to embed the spatially–distributed instantaneous patterns of brain activity as captured by fMRI in a space that was aware of the underlying topology of these patterns, and performed clustering in this space. We developed a computationally tractable approximation of the FD based on recursive aggregation.
The method was used to analyze a study of mental arithmetic and a pattern in the sequence of brain–states was observed that was highly organized with respect to the experiment, with distinct changes from one phase to another. Also the effect of different experimental con- ditions on the measured response of the brain was observed. Brain maps of activity were obtained that were physiologically meaningful with respect to the expected mental phase, and corresponded with the results of conventional analysis. The neurophysiological con- sistency of these maps and the temporally organized structure of the brain–states validates the functional distance metric proposed in this chapter to analyze internal cognitive states from fMRI data.
The combination use of FD and diffusion distance is the main reason the method is able to extract relevant structures from the data. The diffusion distance could be thought of as an operator that uses locally accurate measures of similarity to induce a globally consistent
Euclidean metric on the space. For it to succeed however, it is crucial that the underlying measure be accurate when two points are close to each other.
133 In the next chapter, I shall use the definition of functional distance introduced here to derive an embedding of the pattern of activity in a brain in Euclidean space, and use a state–space formalism to analyze the dynamics of these patterns in the following chapters.
134 CHAPTER 10
FEATURE–SPACE: A LINEAR EMBEDDING OF THE FUNCTIONAL DISTANCE
This ... obliged us to abandon, on the plane of atomic magnitudes, a causal description of nature in the ordinary space-time system, and in its place to set up invisible fields of probability in multidimensional spaces.
Carl Gustav Jung (1875–1961) and Wolfgang Pauli (1900–1958), The Interpretation of Nature and the Psyche.
In the previous chapter, a distance metric was introduced that quantified the functional distance between the distributed patterns of the metabolic traces of neural activity at two different time–points. In this chapter, the concept of the functional distance is used to derive a low–dimensional linear Euclidean embedding for fMRI data, which provides a good approximation of the functional distance.
The layout of this chapter is as follows: The need for such an embedding is motivated in
Section 10.1. The construction of the basis vectors of this feature–space is described in
Section 10.2, and the proof for its approximation of the functional distance is derived in
Section 10.3. A dimensionality reduction procedure by means of bootstrapped feature– selection [56] is provided in Section 10.4.
135 A quantitative evaluation of the feature–space is reported in Section 10.5 calculated on the mental arithmetic data–set described in Section 3.2.
Note: In the discussion that follows > denotes the matrix transpose operator.
Symbol Definition > Matrix transpose operator T Total number of time–points in an fMRI session N Total number of (cortical) voxels in an fMRI volume N Yt ∈ R The fMRI scan at 1 ≤ t ≤ T Y Defined as (Y1 ... YT ) N Zt ∈ R The (pre–HR) brain activation pattern at 1 ≤ t ≤ T Z Defined as (Z1 ... ZT ) C = {1 ...N} Voxel–grid of the volumetric fMRI data F:[−1, 1]N×N Functional connectivity (i.e. cor- relation) map +N×N dF : R The distance metric induced by F N×N DF : R The diagonal degree matrix of F L The normalized graph Laplacian of F φ ∈ RN One dimensional distortion mini- mizing embedding of F Φ = {φ(l,m) ∈ RN } Orthogonal basis functions of the feature–space
Table 10.1: A summary of the notation used throughout this chapter.
136 10.1 Motivation
Although the FD metric was used to extract neurophysiologically meaningful patterns – both spatially and temporally – from the fMRI data in an unsupervised fashion, it has the following limitations. Firstly, it is computationally very expensive, even with the recursive approximation algorithm. This is mainly due to the need for solving the transportation problem between all pairs of fMRI volumes.
More importantly, however, the metric is posed as the solution to an optimization problem and therefore does not have a well–understood topological or geometric structure. For example, there is no closed form solution for computing the centroid of a cluster under this metric17. As a result, determining the statistical properties of clusters obtained under this metric, leave alone developing more sophisticated models of brain–function, is not straightforward. In following chapters, a state–space representation of the spatio–temporal patterns in the metabolic traces left by neural activity is presented. Using the FD metric, inference in these state–space models becomes mathematically intractable.
Therefore, in this chapter we present a linear feature–space for fMRI which provides a good approximation of this similarity. This embedding is a generalization of the linear approximation for the earth mover’s distance (EMD) defined on Euclidean spaces [199] to arbitrary spaces. The dimensionality of this feature–space is reduced using a bootstrap analysis of stability [17], where only features that are stable across multiple resamples of the data are retained.
17As compared to a Euclidean metric where the centroid is nothing but the empirical average of all the P 2 points in the cluster. Under the FD metric, the mean is the solution to minZ0 t FD(Zt, Z0)
137 10.1.1 Other Feature–Spaces in fMRI
In fMRI the number of voxels N ∼ O(105) is orders of magnitude larger than the number of scans T ∼ O(102). Therefore for a multivariate analysis of the data, some type of dimensionality reduction is necessary to prevent over–fitting, typically through a linear transformation of the voxel–wise data into a new basis followed by a feature-selection step, although non-linear transformations have also been used [86]. These transforms have included projecting along the directions of maximum variance (i.e. PCA) [71], along the directions of maximum covariance with experimental variables (i.e. PLS) [152], along the directions of statistical independence (i.e. ICA) [35], projecting on to the original fMRI scans themselves (i.e. support vectors) [65], or harmonic transforms (i.e. Fourier and wavelets) [189].
Supervised methods for dimensionality reduction select features either most correlated or most predictive of the experimental variables [65, 159]. The aim of the spatio–temporal analysis presented in this dissertation is to reveal the intrinsic mental state of the subject, using the recorded experimental variables only as a guide, without limiting the ability to capture new and unexpected patterns. Hence, supervised feature selection approaches are unsuitable, as they are inherently biased towards the experimental variables for which they were selected, while ignoring intrinsic patterns in the data. At the other extreme, purely un- supervised dimensionality reduction methods, such as retaining components with highest variance, are based on statistical criteria unrelated to any model of brain function. There- fore such approaches, while good for data compression, have been shown to be inadequate for predicting cognitive states [90, 173]. For example, in our data–sets we observed that
138 the largest variance principal components corresponded to motion and physiological noise,
such as respiration and pulsatile activity.
In contrast to these methods, the feature–space developed here is derived not from arbitrary
statistical criteria but from a definition that captures an intuitive notion of functional simi-
larity. And as feature–selection uses a stability criteria, it does not suffer from the biases of
supervised methods and the ambiguities of unsupervised methods.
10.2 Feature–Space
10.2.1 Cost–Function
Consider the cost–function dF in the definition of FD (cf. Definition 1 in Section 9.1). It
was, at that point defined heuristically as dF[i, j] = 1 − |F[i, j]|.
Here, we shall start off the construction of the feature–space by redefining this cost–
function as that induced by a distortion minimizing one dimensional embedding φ∗ : N →
R of the graph with F as its adjacency matrix as:
P P (φ[i] − φ[j])2 F[i, j] φ∗ = arg inf i j , (10.2.1) φ⊥D 1 P 2 F i φ[i] DF[i, i]
where DF is the diagonal degree matrix of the adjacency matrix F:
X DF[i, i] = F[i, j] j6=i
DF[i, j] = 0, ∀i 6= j.
Here, the embedding φ∗ will take similar values at voxels that have high functional con-
∗ ∗ nectivity and the cost–function between them is dF[i, j] = |φ [i] − φ [j]|. The constraint
139 ∗ φ ⊥ DF1 is to prevent φ from taking a value at each vertex proportional to its degree,
which is the trivial minimizer of eqn. 10.2.1.
Rewriting eqn. 10.2.1 in matrix notation and using the method of Lagrange multipliers [40],
it can be shown that φ∗ is the solution to the generalized eigenvalue problem:
> (DF − F)φ = λDFφ such that φ DF1 = 0.
If η1 is the eigenvector η1 = λ1Lη1 corresponding to second smallest eigenvalue λ1 > 0 of the normalized graph Laplacian of F:
1 1 − 2 − 2 L = DF (DF − F) DF , (10.2.2)
1 ∗ 2 then φ = DFη1.
10.2.2 Orthogonal Basis for the Feature–Space
Through a recursive partitioning of the voxel–grid based on its embedding φ∗, an orthog-
onal basis Φ = {φ(l,m) : N → R} is constructed, as elaborated by Algorithm 10.1. The
m index m = 0 ... log2 N − 1 gives the level of decomposition, while l = 0 ... 2 − 1 in-
(0,0) − 1 dexes the basis vectors at level m. The first basis vector φ = D 2 η1, where η1 is the eigenvector of L(0,0) = L corresponding to the second smallest eigenvalue.
The graph is then partitioned into two sub-graphs based on the sign of φ(0,0), and their graph Laplacians L(1,1) and L(2,1) are computed. The details of this partitioning are given in Appendix C.1. The next two basis vectors φ(1,1) and φ(2,1) are the second smallest eigen- vectors of the L(1,1) and L(2,1), respectively. The process is repeated until only one voxel is left in the partition.
140 1 begin // Initialization (0,0) (0,0) (0,0) (0,0) 2 C ← C; F ← F; DF ← DF; L ← L 1 (0,0) − 2 (0,0) 3 φ (i) ← DF η1; λ ← λ1 4 m ← 0; l ← 0; 5 end (l,m) 6 while |C | > 1 do m 7 for l ← 0 to 2 − 1 do // Recompute the residual connectivity (l,m) (l,m) (l,m) 0(l,m) 8 Fb ← F − λ φ φ // Partition the grid C(l,m) into C(2l,m+1) and C(2l+1,m+1) based on the sign of φ(l,m) (l,m) 9 for i, j ∈ C do (l,m) (l,m) 10 if φ (i) ≥ 0 AND φ (j) ≥ 0 then (2l,m+1) 11 F [i, j] ← F[b i, j] (2l,m+1) 12 Add i, j to C 13 end 14 (l,m) (l,m) 15 else if φ (i) < 0 AND φ (j) < 0 then (2l+1,m+1) 16 F [i, j] ← F[b i, j] (2l+1,m+1) 17 Add i, j to C 18 end 19 20 else (2l,m+1) 21 F [i, j] ← 0 (2l+1,m+1) 22 F [i, j] ← 0 23 end 24 end // Recompute the graph Laplacian and its second smallest eigenvector (2l,m+1) (2l+1,m+1) (2l,m+1) (2l+1,m+1) (2l,m+1) 25 Compute DF , DF and L , L from from F and F(2l+1,m+1) (2l,m+1) (2l+1,m+1) (2l+1,m+1) (2l,m+1) 26 Calculate φ , φ from the eigenvectors of L , L corresponding to their second smallest eigenvalues λ(2l,m+1) and λ(2l+1,m+1), respectively 27 end 28 end Algorithm 10.1: Construction of Orthogonal Basis Functions
141 The coefficients of the spatially distributed brain activation at one time instant Zt in this
m orthogonal linear space are denoted as {zt[l, m], m = 0 ... log2 N − 1, l = 0 ... 2 − 1},
−m (l,m) where zt[l, m] , 2 hZt, φ i.
Then the functional distance FD(Zt1 , Zt2 ) is well–approximated by the `2 distance metric
in this space:
1 ! 2 X 2 ∆(zt1 , zt2 ) = | (zt1 [l, m] − zt2 [l, m]) | . (10.2.3) l,m
The proof for this assertion is the topic of the next section.
10.3 Linear Approximation for Functional Distance
To examine the reasoning behind this approximation, consider the dual formulation to the
linear optimization problem of eqn. 9.1.1
N X FD(Zt1 , Zt2 ) = sup g[i] · δZ[i] (10.3.1) g i=1
subject to the constraints:
g[i] − g[j] ≤ dF[i, j] X g[i] = 0 i
Please consult Section C.2 of Appendix C for a more detailed explanation of the primal–
dual equivalence.
142 This cost function is nothing but an inner product between g : N → R and the difference
vector δZ = Zt1 − Zt2 . Since inner products are preserved under orthogonal transforma-
tions, it is the case that
hg, δZi = hΦ[g], Φ[δZ]i
(l,m) Denoting the coefficients of δZ in the basis Φ as δz[l, m] , hφ , δZi, the following the- orem can be proved:
Theorem 1. Let δz[l, m] be coefficients of δZ = Zt1 − Zt2 in the basis Φ. Then, there exist
constants M0,0 > 0 and Mc0,0 > 0, such that
log N−1 2m−1 log N−1 2m−1 X2 X X2 X Mc0,0 |δz[l, m]| ≤ FD(Zt1 , Zt2 ) ≤ M0,0 |δz[l, m]| m=0 l=0 m=0 l=0 (10.3.2) and the tightness of this bound is: " # X X X X (M0,0 − Mc0,0) sup M0,0 |δz[l, m]| − Mc0,0 |δz[l, m]| ≈ √ (10.3.3) ||δz||2=1 m l m l 2
The detailed derivations of the approximation bounds and its tightness are listed in Sec-
tion C.3 of Appendix C.
As shown in Theorem 2 of Appendix C, similar bounds can be derived (or numerically
evaluated) for any orthogonal bases Ψ defined on C with respect to the distance metric in-
P (l) duced by F (if i ψ [i] = 0). Identifying the basis with the tightest bound (cf. eqn. C.3.3) is a combinatorial optimization problem in N 2 variables with N(N + 1)/2 orthogonality
constraints.
143 Given the computational complexity of this problem, we instead selected from among a set
of standard basis by choosing the one with the best tightness. The approximation Φ defined
in Section 10.2 was compared with respect to the following other basis for the functional
connectivity maps F over all the subjects in our data–set. The minimum and maximum
values of the bound-tightness metric, relative to the average value for Φ, are listed:
i The delta-basis {δ[i], i = 1 ...N}, i.e. the original voxel-wise data itself: 8.43 – 11.58
ii The PCA-like basis consisting of the eigenvectors of F: 3.21–4.66
iii The Laplacian eigenmap [16] basis containing the eigenvectors of the normalized graph
Laplacian of F : 1.79–2.30.
iv The basis set containing indicator functions on recursive normalized cuts [198] of the
graph defined by F: 2.02–2.95
v The diffusion wavelet basis induced by F [42]: 0.89–1.13.
vi An orthogonal basis derived from an spatial ICA decomposition [154]: 3.51–5.87.
One reason for the comparatively tight bound of Φ is the fast decay of the coefficients in
this basis, thereby making their contribution to the error negligible. The relatively similar
values of [iii] and [iv] are because they are obtained from a similar set of operations on
F, and the basis vectors share a lot of properties in common, such as coefficient decay.
Although the diffusion wavelet basis is tighter approximation to the distance metric FD, its marginally better performance is offset by its much greater computational burden. The high variance of the ICA derived basis could be because it is not directly related to F and also because the coefficients in this basis are not sparse.
144 10.4 Feature Selection
The feature–space Φ obtained through the orthogonalization of the graph Laplacian L of
F is of dimensionality N, the number of voxels. In order to reduce the dimension of this space, a feature selection strategy based on assessing the stability of the basis vectors through a non-parametric bootstrap analysis [56] is used.
The bootstrap generates a non–parametric estimate of the sampling distribution of a statistic
(i.e. bootstrap distribution) from a single sample of the data, and is used in cases where generating multiple independent samples is infeasible. It creates multiple surrogate samples
(i.e. resamples), of same size as the original sample, by resampling with replacement from the original sample.
Bootstrap estimates of the functional connectivity matrix F are obtained by resampling fMRI volumes from a session Y = {Y1 ... YT } to create a surrogate session. The presence of serial correlations in the time–series data Y prevents a na¨ıve resampling scheme where
T scans are randomly selected (with replacement) since it would destroy the background correlations present in the data. However, as fMRI scans are block exchangeable [166], a block bootstrap method can be used wherein the T scans are divided into M-blocks, and a resample is created by randomly selecting M-blocks from this set, with replacement.
Although the block length T/M needs to be adapted to the range of temporal dependencies present in the data, the correlation structure in the data is faithfully reproduced over a fairly wide range of lengths [17]. We found T/M ≈ 5TRs to be adequate for our data–sets.
The stability of a particular vector φ(l,m) is defined as its correlation across the resamples
(l,m) (l,m) of the data–set. Specifically, if φ(r) is the estimate of φ from the r–th resample of the
145 18 fMRI data Y, then the absolute correlation across two resamples r1 and r2 is
ρ(l,m)(r , r ) |hφ(l,m), φ(l,m)i| 1 2 , (r1) (r2)
(l,m) (l,m) Given the bootstrap distribution of correlations Prboot ρ (r1, r2) , a vector φ is said
(l,m) to be τΦ–stable if Prboot ρ (r1, r2) ≥ τΦ ≥ 0.75, i.e. the correlation between at–least
(l,m) 75% of the resamples of φ is greater than the threshold 0 ≤ τΦ ≤ 1.
(l,m) If φ is not τΦ–stable, then it is discarded, which also removes all the vectors obtained
(l,m) from the subdivision of φ . Therefore, increasing the value of τΦ causes a geometric increase in the number of vectors that are removed.
A flow–chart of the feature–space computation is shown in Fig. 10.2.
10.5 Evaluation of Feature–Space
The effect of τΦ on the dimensionality D is shown in Fig. 10.2. Initially there is a steep
5 2 reduction in dimensionality from O(10 ) to O(10 ). However, after a certain value of τΦ, the reduction slows down significantly. For all the data–sets tested, this knee-point usually occurred at D ≈ 500 corresponding to τΦ = 0.4– 0.5. Therefore, τΦ was adaptively set for each fMRI session such that D = 500.
(l,m) The figure also shows the largest index m of the vectors φ retained for a given τΦ, indi- cating that the stability of basis vectors reduces as the level of decomposition m increases.
This observation, along with the 2−m decay of the coefficients in eqn. 10.3.2, implies that
18The absolute value is chosen to account for the indeterminacy in the sign of φ(l,m).
146 Y
Resampling with Replacement
Functional Connectivity Estimation
Gaussian smoothing
HAC until Ñ≈0.25N R Cluster-wise Correlation Estimation and Shrinkage times
Voxel-wise Correlation Estimation
Basis Vector φ(l,m) Computation
Bootstrap Distribution of Correlations ρ (l,m)
Feature Selection (l,m) ` (l,m) Retain φ if Pr[ρ ≥ τΦ] ≥ 0.75
Φ
Figure 10.1: Overview of Feature–Space Computation. The functional connectivity of the brain is esti- mated from a resample of the fMRI data Y using the method described in Section 9.2. Basis vectors of the feature–space Φ are computed through a recursive orthogonal partitioning of these functional networks as per Algorithm 10.1. Dimensionality reduction is performed by retaining stable basis vectors using the bootstrap analysis of stability.
147 x 105 2 20 Φ Φ Dimension of Maximum m 1 10 Maximum m Dimension of 0 0 0 0.2 0.4 0.6 0.8 1 τΦ
Figure 10.2: Dimensionality Reduction. The effect of threshold τΦ on the average dimensionality of Φ and on the maximum index m (level of decomposition) of the basis vectors φ(l,m) that were retained. Results are for the data–set of Section 3.2.
the effect the of reduced dimensionality of Φ on the approximation error is small, as most
of the discarded vectors have a large index m.
The relative logarithmic error of the linear approximation ∆(zt1 , zt2 ) of FD(Zt1 , Zt2 ) de-
fined as:
| log10 ∆(zt1 , zt2 ) − log10 FD(Zt1 , Zt2 )|
log10 FD(Zt1 , Zt2 )
using the reduced Φ versus the full basis set is shown in Fig. 10.3.
It can be observed that the linear approximation ∆(zt1 , zt2 ) provided by the full basis Φ
is typically within 2.5× the transportation distance FD(zt1 , zt2 ), while the distance in re-
duced dimensionality basis is within 3 × FD(zt1 , zt2 ), providing empirical validation of eqn. 10.3.3. Also, it can be seen that reducing the dimensionality by an order of O(103) on average increases the approximation error by < 20%.
As an interesting side note, since the feature–space Φ is an orthogonalization of the graph
Laplacian of the voxel-wise correlations, the feature–space coefficients yt[l, m] an the
148 0.8
0.7
) 0.6 Φ 0.5
0.4
0.3
Rel. log error (reduced 0.2
0.1
0 0 0.2 0.4 0.6 0.8 Rel. log error (full Φ)
Figure 10.3: Approximation Quality. A scatter plot of the relative logarithmic (base–10) error in the ap- proximation of FD(zt1 , zt2 ) by ∆(zt1 , zt2 ) using the reduced vs. the full basis-set.
19 fMRI volume Yt exhibit very low temporal correlation , and their covariance matrix is extremely sparse, with 98% of the cross-correlations having a value < 0.2 as shown in
Fig. 10.4.
10.6 Conclusion
In the previous chapter we motivated and developed a distance metric to compare the dif- ference between the neural activity distributions at two different time–instants (i.e. TRs) encoded in the metabolic traces as recorded by fMRI. Although this functional distance provided a neurophysiologically meaningful indication of intrinsic classes of activity, its
19The correlation is exactly zero if the basis vectors are obtained from the orthogonalization of the covari- ance matrix, as compared to the correlation matrix.
149 0.08
0.06
0.04
0.02
0 0.2 0.4 0.6 0.8 1 Correlations (Abs)
Figure 10.4: Cross-Correlations in Feature–Space. Histogram of the non-zero cross-correlations (absolute values) of the feature–space coefficients.
formulation made it mathematically intractable for advanced modeling and statistical anal- ysis.
In response to this problem, and to address the issue of its high computational complexity, in this chapter we developed a linear Euclidean embedding of voxel–wise distribution of activity for each time–point that provides a good approximation of this functional distance.
The dimensionality of this feature–space, derived from an recursive orthogonal partition- ing of the graph of functional connectivities was reduced through a bootstrap analysis of stability.
In the next chapter, we shall use this representation of the fMRI data to build a spatio– temporal model using a state–space formalism.
150 CHAPTER 11
STATE–SPACE MODELS : TOWARDS A SPATIO-TEMPORAL REPRESENTATION
The modern age has a false sense of security because the great mass of data at its disposal. But the valid issue is the extent to which people know how to form and master the material at their command.
Johann Wolfgang von Goethe (1749–1832).
In Chapter 7 we pointed out multiple challenges with the status–quo of multivariate meth- ods for studying the representation of cognitive states from fMRI data, including:
• The bias of supervised methods towards modeled effects, preventing discovery of
intrinsic cognitive states not related to behavioral parameters
• The trade–offs of lack–of–interpretability and simple designs for multivariate pattern
classifier (MVPR) methods vs. the need for specifying the mathematical transforma-
tion of stimulus to fMRI signal and the homogenous hemodynamics for multivariate
linear models (MVLM)
• The fundamental drawback of all these methods, namely producing a static picture
of the inherently dynamic and changing processes of cognition
151 In this chapter, we shall adduce a solution to some of these issues with an unsupervised temporally resolved multivariate analysis based on a state–space model of brain function.
Extant literature on state–space models in fMRI is reviewed in Section 11.1.
In this formulation, a first order Markov chain captures the concept of the functional brain transitioning through a cognitive state–space as it performs a mental task. Each state is associated with a characteristic spatial distribution of activity and an occurrence of the state corresponds with an activation pattern based on this signature. The observed fMRI data arise from the convolution of these activation patterns with hemodynamic response function (HRF). The model is described further in Section 11.2.
Then in Section 11.3, an expectation–maximization (EM) [50] algorithm with Gibbs sam- pling [194] is used to estimate the Markov structure, the activation maps and the optimal sequence of brain–states for a given data–set. The effect of assuming a fixed shape of the
HRF is ameliorated by marginalizing it out under a Laplace approximation. A method to determine the correct model size resulting in a model most relevant to the investigator is proposed in Section 11.4.
A quantitative validation of the estimation algorithms is given in Section 11.5.1. The results of this dynamical multivariate analysis on the study of mental arithmetic (c.f Section 3.2) are reported in Section 12.5.3. Finally, the chapter concludes with some remarks and ob- servations in Section 12.6.
Appendix D contains the complete proofs and details about the algorithms developed in this chapter.
152 11.1 Related Work
State–space models have been previously used in fMRI either for determining the activation
state of individual voxels in a univariate fashion. Bayesian hidden Markov models (HMM)
with MCMC sampling have been used to determine the activation state of individual voxels
in blocked-design single-trial experiments [95]. In another approach [53], for each voxel,
a two-state HMM was created, and the model parameters were estimated from the voxel
time–series and the stimulus paradigm. Activation detection associated with known stimuli
has also been done with hidden Markov multiple event sequence models (HMMESM) [59], that pre-process the raw time–series into a series of spikes to infer neural events at each voxel.
A hidden process model (HPM) [98] was used decompose each voxel’s time–series into a set of “neural processes” and their instantiations. For each process a spatio–temporal map is generated giving the voxel locations and the probability of the process onset relative to some external event. Since all the possible processes and the configurations of their in- stances have to be pre-specified, HPMs are limited to testing specific hypothesis involving simple interactions of a small number of neural processes. A multi-variate ARMA formal- ism was used to effect a dynamical components analysis that extract spatial components and their time–courses from fMRI data, given the experimental stimuli [205].
Dynamic Bayesian networks have been also used to study the time-varying functional in- tegration [225] of a small number of pre-selected functional modules, from the interdepen- dency structure of their average time–series.
153 α,π . . . xt xt+1 xt+2 xt+L
μk Σk K . . . zt zt+1 zt+2 zt+L
γ μγ σγ
. . . h yt yt+1 yt+2 yt+L
T Σε
Figure 11.1: The Full State–Space Model. The hidden brain–states are shown by xt giving rise to dis- tributed activation patterns zt. The fMRI data yt ... yt+L 1 is observed after convolution of the neural activations with the hemodynamic response h. −
In contrast to the above methods, this chapter presents data-driven multivariate method for the dynamical analysis of mental processes in a time–resolved fashion using a phenomeno- logical model for brain function.
11.2 State–space Model
The state–space model with K hidden states is parameterized by θ = {α, π, ω, Σ} as shown in Fig. 11.1.
N Here, yt ∈ R , t = 1 ...T is the observed fMRI data after feature–space transformation
20 , with the corresponding experimental conditions given by st.
The underlying mental process is represented as a (hidden) state–sequence xt ∈ [1 ...K], for t = 1 ...T . Let the state marginal probability be denoted by α = (α1 . . . αK ), where
20The fMRI data in voxel–space is denote by Y.
154 αk , p(xt = k). The state transition probabilities are given by the K ×K stochastic matrix
π, where πi,j , p(xt+1 = j|xt = i), the transition probability from state i to j.
Each state xt = 1 ...K is associated with a characteristic activation pattern zt. The emis- sion model has a two-level hierarchy to account for the fact that yt is the hemodynamic response to the (unobserved) neural activation pattern zt corresponding to state xt. The activity signature corresponding to xt = k is assumed to be normally distributed in the feature–space Φ (cf. Chapter 10) with mean µk and variance Σk. Let ϑ , {ϑ1 . . . ϑk} be the emission parameters of the model, where ϑk , (µk, Σk), the emission parameters for state k.
The measured fMRI signal yt is obtained by a linear convolution of spatially distributed activation signatures with an HRF h as per 21 :
L X yt = hlzt−l + t. l=0
Here, t ∼ N (0, Σ) is a time-stationary noise term. The HRF is a FIR filter of length
L + 1 given by the difference of two Gamma functions [66], with non-linear parameters γ controlling its delay, dispersion, and ratio of onset-to-undershoot, with prior density p(γ) =
N (µγ, σγ).
Examining this graphical model using the d–separation criteria [174], it can observed that conditioned on y all the x and z variables are dependent on each other. This dependency structure complicates the marginalization of the hidden variables x and z (as required by
EM) since the integration cannot be factorized with respect to these variables. In Chapter 12
21 P The voxel-wise data Yt = l hτ Zt l is modeled as a linear convolution of the activation patterns Z, − P and because the linear projection zt , Φ[Zt] is commutative with convolution, it holds that yt = l Hlzt l − in the feature space, where yt , Φ[Yt].
155 we shall address this problem through the use of the mean–field approximation. However, in this chapter, we make the additional simplifying assumption that yt, t = 1 ...T are inde- pendent of each other given x, giving an approximative reduced model shown in Fig. 11.2.
This assumption is equivalent to collapsing the z layer into the x layer and neglecting the structured variability introduced by it22.
Expanding out the linear dependence of y on x through h and considering z as a “fixed effect”, the following probability model is obtained:
p(y, x|θ, h,K) = p(y|x, ϑ, h)p(x|α, π) (11.2.1) T T Y Y p(x|α, π, K) = p(xt|xt−1, α, π) = αx1 πxt,xt−1 (11.2.2) t=1 t=2 T Y p(y|x, ϑ, h,K) = p(yt|xt−L . . . xt, θ) (11.2.3) t=1 L L ! X X 2 p(yt|xt−L . . . xt, ϑ, h,K) ∼ N hlµxt−l , hl Σxt−l + Σ , (11.2.4) l=0 l=0 where
θ , {α, π, ϑ, Σ} L X µt−L...t , µxt−l hl l=0 L X 2 Σt−L...t , Σ + Σxt−l hl (11.2.5) l=0
Thus, the convolution introduces a dependency only between states xt−L . . . xt, when con- ditioned on observation yt. Although this violates the first-order Markov property required for the classical forward-backward recursions, its parameters can be efficiently estimated using the L + 1–order Monte–Carlo algorithm of Section 11.3.
22 Refer to Appendix D.5 for a full justification of this.
156 α, π μk Σk K
x x x x t t+1 t+2 … t+L-1
γ μγ σγ
y y y y h t t+1 t+2 … t+L T
Σε
Figure 11.2: The Reduced State–Space Model. The hidden brain–states are shown by xt, the activation pattern is observed in the fMRI data yt ... yt+L 1 (in feature–space) after convolution with the hemodynamic − response h. The intermediate zt layer is dropped.
Also, the following short-hand notation is used through out the paper: yt1...t2 , {yt1 ... yt2 }, y , {y1 ... yT } and similarly for x. Also, define pθ(·) , p(·|θ, h,K). Matrix transpose is denoted by the > operator. The notation introduced here is compiled in Table 11.1.
157 Symbol Definition
> Matrix transpose operator T Total number of time–points in an fMRI session N Total number of (cortical) voxels in an fMRI volume K Total number of hidden brain–states Φ = {φ(l,m) ∈ RN } Orthogonal basis functions of the feature–space xt ∈ [1 ...K] The brain–state at 1 ≤ t ≤ T N Yt ∈ R The fMRI scan in voxel–space at 1 ≤ t ≤ T
Y Defined as (Y1 ... YT ) D yt ∈ R The fMRI scan in feature–space at 1 ≤ t ≤ T y Defined as (y1 ... yT ) st The stimulus vector at time t
αk The marginal probability of state k
πxt−1,xt , p(xt = The transition probability from state i to state j j|xt−1 = i)
ϑk = (µk, Σk) The emission parameters for state k h The hemodynamic finite impulse response (FIR) filter of length L + 1 γ The non–linear parametrization of h
µγ, Σγ The mean and variance of the prior distribution of γ
t ∼ N (0, Σ) Normally distributed noise
θ , {α, π, ϑ, Σ} The model parameters pθ(◦) , p(◦|θ, h,K) Parameterized probability density function w Multinomial logistic regression weights
Table 11.1: A summary of the notation used throughout this chapter on the unsupervised state–space model
158 11.3 Model Estimation
11.3.1 Parameter Estimation
The maximum likelihood (ML) estimate θML = arg maxθ ln p(y|θ, K) is obtained using
EM [22] which involves iterating the following two steps until convergence:
X E-step: Q(θ, θn) = p(x|y, θn, h,K) ln p(y, x, θ|h,K), (11.3.1) x M-step: θn+1 = arg max Q(θ, θn). θ
Because of the inclusion of the FIR filter for the HRF, which violates the first-order Markov property of the state–sequence x when conditioned on an observation yt, the EM update equations take the following form (cf. Appendix D.1):
p (n) (x = k|y) αn+1 = θ 1 k PK 0 k0=1 pθ(n) (x1 = k |y) PT p (n) (x = k , x = k |y) πn+1 = t=2 θ t 1 t+1 2 k1,k2 PK PT 0 k0=1 t=2 pθ(n) (xt = k1, xt+1 = k |y) X µn+1 = H− µn+1 k k,k0...kL k0...kL k0...kL X and Σn+1 = G− Σn+1 , (11.3.2) k k,k0...kL k0...kL k0...kL
The updates to µt−L...t and Σt−L...t (cf. eqn. 11.2.5) for one particular assignment {k0 . . . kL} of the sequence L + 1 states long are:
PT p (n) (x = k . . . k |y)y µn+1 = t=1 θ t−L...t 0 L t , k0...kL PT t=1 pθ(n) (xt−L...t = k0 . . . kL|y) PT n+1 n+1 > p (n) (xt−L...t = k0 . . . kL|y) · (yt − µ )(yt − µ ) Σn+1 = t=1 θ k0...kL k0...kL . k0...kL PT t=1 pθ(n) (xt−L...t = k0 . . . kL|y)
159 Here, H and G are the KL+1 × K convolution matrices that give the relationship between
HRF coefficients hl, the activation pattern means µk for state k and the µt−L...t,s Σt−L...t for any assignment of an L + 1 length state–sequence: hL + ... h0 0 ... 0 0 hL + ... h1 h0 ... 0 0 . . . . H = ...... , . . . . 0 0 ... h h + h L L−1 0 0 0 ... 0 hL + h0 2 2 hL + ... h0 0 ... 0 0 2 2 2 hL + ... h1 h0 ... 0 0 . . . . G = ...... , . . . . 0 0 ... h2 h2 + h2 L L−1 0 2 2 0 0 ... 0 hL + h0 and H− is the (k, k . . . k )–th element of the pseudo-inverse of H, given by H− = k,k0...kL 0 L (H>H)−H>. Even though H is an KL+1 × K matrix, it is extremely sparse with each column k of H having only 2L+1 non-zero entries corresponding to those µn+1 where k0...kL
> L+1 2 k ∈ {k0 . . . kL}. Therefore, H H is computed in O(2 K ) time, and is inverted using
the SVD pseudo-inverse. Similarly for G.
Using the relationship pθ(n) (x|y) = pθ(n) (y, x)/pθ(n) (y) and the fact that pθ(n) (y) is can-
celed out by the numerators and denominators of eqn. 11.3.2, the conditional densities are
replaced by their joint densities pθ(n) (y, xt), pθ(n) (y, xt,t+1) and pθ(n) (y, xt−L...t). These are
calculated as:
X pθ(y, xt) = a(xt+1−L...t)b(xt+1−L...t)
xt+1−L...t−1 X pθ(y, xt,t+1) = a(xt+1−L...t) · pθ(yt+1|xt+1−L...t+1)pθ(xt+1|xt) · b(xt+2−L...t)
xt+1−L...t−1
pθ(y, xt,t+L) = a(xt,t−1+L) · pθ(yt+L, xt,t+L) · pθ(xt,t+L) · b(xt+1...t+L). (11.3.3)
160 where a and b are the forward–backward recursion terms:
X a(xt+1−L...t) = pθ(y1...t, xt+1−L...t) = pθ(n) (yt|xt−L...t)pθ(n) (xt|xt−1) · a(xt−L...t−1)
xt−L X b(xt+1−L...t) = pθ(yt+1...T |xt+1−L...t) = pθ(n) (yt+1|xt+1−L...t+1)b(xt+2−L...t+1). xt+1 (11.3.4)
The derivations of these terms are elaborated in Appendix D.2.
The summations (i.e. expectations) over the densities of state-sequences L long of the P form p (n) (y, x )[...] in eqns. 11.3.2 and 11.3.3 are replaced with Monte– xt−L...t θ t−L...t
Carlo estimates, by Gibbs sampling from the distribution pθ(n) (y, xt−L...t) with stochastic
forward-backward recursions [194].
The same EM procedure can estimate θML given multiple fMRI data–sets corresponding to a group of subjects, with slight modifications to the update equations.
11.3.2 HRF Marginalization
The dependence of θML on a specific HRF filter h is removed by marginalizing out h under a
∗ R Laplace approximation to obtain a Bayesian estimate θ = h θML(h)p(h)dh, independent of h. It is computed through Monte–Carlo integration by first sampling the parameter γ
from N (µγ, σγ), constructing h(γ), finding θML(h) and then averaging over all samples.
Please consult Appendix D.3 for details.
161 11.3.3 Optimal State–Sequence Estimation
Given a set of parameters θ and observations y, the most probable sequence of states x∗ = arg max ln pθ(y, x) is estimated by backtracking through the following recursive system:
max ln pθ(y, x) = max ηT x xt−L...T
where ηt = max [ln pθ(yt, xt−L...t) + ln pθ(xt|xt−1) + ηt−1] xt−1
and η1 = ln pθ(y1|x1) + ln pθ(x1)
The maximization over sequences of states L + 1 long xt−L...t is done using iterated condi- tional modes (ICM) [22], with random restarts. The detailed derivations for this procedure are given in Appendix D.4.
11.4 Model Size Selection
Model–size (i.e. K) selection can be done using Bayes factors [112], information theo- retic criteria [151], reversible jump MCMC based methods [194] or non-parametric ex- tensions [15]. The reader is referred to the excellent article by Lanterman [126] for a theoretical perspective on these model selection methods.
These methods are equivalent in that they strike a compromise between model complexity
(typically measured by the number of states) and the ability of the model to explain the data (typically measured by p(Y|K), the model evidence). In the absence of a domain– specific way to define model complexity, information based methods select models that minimize Kolmogorov complexity, while Bayesian methods specify a prior derived either
162 from equivalent definitions of complexity or through empirical ones such as hierarchical models or reference priors [19].
Instead we adopt an alternative strategy where experimental conditions st are used to select
K that results in a maximally predictive model, The rationale behind this strategy is that fMRI data may contain multiple spatio–temporal patterns of both neurophysiological (such as default–network and other non–task related mental processes) and of extraneous (such as respiration, pulsatile, head–motion) origin, of which only task related effects are of interest to the investigator. This criterion enforces that link by selecting a model which has identified the most relevant patterns. Although this step introduces a dependence of the experimental conditions on the model, the parameters themselves are estimated without reference to the task in an unsupervised fashion.
Let x∗,K denote the optimal state–sequence for an fMRI session y produced by the model
∗ with K states and optimal parameters θK . And, let s = (s1 ... sT ) denote the corresponding experimental conditions recorded during the session. A multinomial logistic regression
(MLR) classifier [22] with weights w is trained to predict the state xt at time t given a stimulus vector st according to the formula:
> exp{st wk} Pr[xt = k] = (11.4.1) PK > 0 k0=1 exp{st wk}
∗ ∗ ∗,K The optimal K is then selected as K = arg min ERRpredict(x , w) where R is the error– rate (i.e. risk) of predicting the optimal state–sequence x∗,K from the experimental condi- tions as per:
h i ∗,K E ∗,K ERRpredict(x , w) = 1 − Pr[xt = xt ] , (11.4.2)
163 and is computed using cross–validation23.
Therefore, the model Mi trained on a data–set y for a subject i consists of the tuple Mi =
∗ ∗ (Φ, θK ,K , w), viz. the feature–space basis, the optimal model parameters, the optimal number of states, and the MLR weights.
11.5 Results
This section starts off with a quantitative validation of the model and state–sequence esti- mation algorithms using a simulated data–set in Section 11.5.1. Then in Section 11.5.2, the results of the method applied to the mental arithmetic task of Section 3.2 are reported.
11.5.1 Simulation
11.5.1.1 Methods and Materials
This section reports a quantitative validation of the model and state–sequence estimation algorithms using synthetic data–set, created as follows. For all simulations, the number of time–points was T = 600, the dimension of the feature–space was D = 500 and the dimension of the stimulus vector st was set to 5, to reflect a typical fMRI data–set. The model size K was varied from 5 to 50.
The MLR weights w were initialized from a uniform distribution over [−1, 1]. The hemo- dynamic FIR coefficients h were obtained by sampling the HRF parameters γ from N (µγ, Σγ).
23 Specifically, the data is partitioned into M–folds, each of T/M fMRI time–points selected randomly. ,K The weights wk are learnt from the (xt∗ , st) pairs of M − 1 folds and the error–rate is assessed on the ,K (xt∗ , st) from the remaining 1 fold.
164 The emission parameters (µk, Σk) for each state k were obtained by sampling Σk from a
Wishart distribution W(T, ID) and µk from N (0, Σk). The noise variance Σ was sampled
−1 from W(T, β ID) The parameter β effectively controls the SNR of the data, by control-
ling the ratio of the noise variance to that of the activation patterns. For each time–point
t = 1 ...T , the stimuli st were generated from a normal distribution N (0, I5). The values
of xt were sampled from the multinomial distribution arising from eqn. 11.4.1, then zt from
the normal distribution N (µxt , Σxt ) and then yt from the convolutive model with Gaussian noise with variance Σ. Note that this sampling scheme corresponds to the full model of
Fig. 11.1, in order to test the accuracy of the approximation implied by Fig. 11.2.
The Gibbs sampler (sampling from sequences L + 1 states long) had a burn–in time of
100 samples and convergence of parameter estimates was defined using the scale reduction
factor criterion [27]. The experiments were repeated with β = 10, 100 and 1000 corre- sponding to SNR of 10, 20 and 30dB respectively.
11.5.1.2 Discussion
Table 11.2 limns the average running–time and error–rates for the simulation. Included are the error in the parameters estimates ERRestimate, defined as the average relative error in the
∗ ∗ ∗ ∗ estimates w , µk, Σk, and Σ , the model–size estimation error ERRK , and the prediction
error ERRpredict (cf. Section 11.4.2). Of special interest is the error in the µk, k = 1 ...K
parameters which correspond to the spatial distribution of activity representative of each
state. The average estimation error of the spatial distribution parameters, defined as:
X ∗ > −1 ∗ ERRspatial = 1/K (µk − µk) Σk (µk − µk), k
165 SNR Running–Time ERRestimate ERRK ERRpredict ERRspatial (dB) (hours) % % % %
30 4.305 ± 0.72 21.13 ± 8.25 12.70 ± 5.35 20.77± 4.45 17.56±3.11 20 4.63 ± 0.96 28.89 ± 11.14 18.91 ± 8.62 41.59± 9.53 26.57±9.30 10 5.28 ± 0.82 52.45 ± 17.93 44.33 ± 15.12 56.60± 7.25 43.41±12.39
Table 11.2: Effect of SNR on the total running–time, estimation error ERRestimate, model–size error ERRK , prediction error ERRpredict, and spatial activity error ERRspatial for the simulation study of the state–space model. All values are listed ±1 std. dev.
∗ where µk is the estimate of µk.
Firstly, reducing SNR causes an increase in running–time due to slower convergence of
the estimates as a function of the number of Monte–Carlo samples. Also, for a sufficiently
high SNR the estimate of model–size using the maximally predictive criterion was within
20% of the true model–size validating this strategy. The error in the parameter estimates in-
creases drastically as the SNR goes from 20 to 30dB, indicating a breakdown in estimation
stability as the noise level crosses a certain threshold. The error in the activity signature
parameters ERRspatial exhibits a similar trend. Interestingly, the prediction error increases
by almost 20% from 10 to 20dB – which points to a high sensitivity of prediction accuracy
on estimation quality.
The high baseline error of ≈ 20% in parameter estimation is due to the simplification of eliminating the z layer (cf. Fig. 11.2) in the estimation algorithm. The alternative mean–field approximation presented in the next chapter, which does not necessitate this simplification, yields much higher quality estimates for an equivalent simulation (cf. Sec- tion 12.5.1).
166 11.5.2 Mental Arithmetic
11.5.2.1 Methods and Materials
This method was tested on the mental arithmetic data–set described in Section 3.2. We restricted this study to the 20 controls and 13 dyscalculic (DC) subjects only. For each t =
1 ...T , the experimental conditions are described by the vector st = (Ph, LogPs, LogDiff), where Ph is the phase within the current trial in 1 TR increments with respect to its start,
LogPs, which gives the product size for the presented problem and LogDiff, which gives the expected difficulty in judging the right answer, are quantized into two levels.
One model M = (Φ, θ, K, w) was trained per subject and the following statistics were calculated:
self ERRpredict: The “within–subject” prediction error–rate of the optimal state–sequence as- signed to the fMRI run of one subject using the model for that subject (i.e. trained on
the same data)
cross ERRpredict: The “between–subject” prediction error–rate of the optimal state–sequence as- signed to the data of one subject using a model trained on the data of another subject
MI(i, j): Mutual Information between the state sequences generated for the same fMRI
session y by the models Mi and Mj for two different subjects i and j, described
further in Appendix D.6
The mutual information quantifies the similarity of the two models, where higher MI indi-
∗ cates better similarity, with a maximum of log2 K . In general the correspondence between
167 the state labels of two different models is unknown. By comparing the state-sequences of
the same data generated by the two models, this correspondence can be determined. A
higher MI indicates a higher level of correspondence between the state-sequences of the
same data when labeled by two different models, while an MI of zero indicates absolutely
no correspondence. This procedure is illustrated in Fig. 11.3.
The reader may peek ahead to Fig. 11.3 in the next chapter for an illustration of this concept.
33 This procedure applied to all ( 2 ) pairs of subjects yields a pair-wise similarity matrix.
These MI relationships can then be visualized in 2D by using multidimensional scal- ing (MDS) [120] as shown in Fig. 11.5.
Spatial maps of the activation patterns for a specific value of the experimental variables st is obtained by first computing the activation pattern mean:
" K # X µst = Pr[Xt = k]µk k=1 " K # X Σst = Pr[Xt = k]Σk k=1
where Pr[Xt = k] are the MLR probabilities from eqn. 11.4.1 for the condition st. The
−1/2 z–score map for the activation pattern corresponding to ˆs is given by Σst µst in feature–
space, which can then be transformed back into a voxel-wise spatial map of activity. The
group–level spatial maps 24 for the three phases of the two groups are shown in Fig. 11.6
24Displayed as a t–score map of voxel–wise group average divided by group std. dev.
168 st st+1 ut+2 st+L-1 … W λW
x x x x t t+1 t+2 … t+L-1
μk Σk xt xt+1 xt+2 xt+L-1 zt zt+1 zt+2 zt+L-1 K … Φ1 …
h μh Σh
y y y y Σ t t+1 t+2 … t+L-1 ε T Subject 1
Φ2 1 2 … 41 42 1 10.00 8.94 … 8.50 5.40 Subject 2 2 8.94 10.00 … 1.54 0.29 Y ⁞ ⁞ ⁞ ⁞ ⁞ 41 8.50 1.54 … 10.00 3.95 fMRI 42 5.40 0.29 … 3.95 10.00 data for . subject i . Pair-wise MI . matrix
Φ42
Subject 42
Figure 11.3: Computation of Pair–Wise MI Between All Subjects. One model Mj = {Φj, θj} was trained per subject. The data of another subject / session, say subject i, was then first projected into the feature–space Φj of subject j and then the optimal state–sequence for the data of subject i was computed using θj.
169 11.5.2.2 Discussion
The average values of these statistics are compiled in Table 11.3. The “chance–level”
prediction error is ≈ 83% calculated by permuting the stimuli with respect to the scans.
self cross Group K∗ ERRpredict (%) ERRpredict (%) MI (bits)
Controls 19.67 ± 2.44 34.23 ± 5.40 41.69 ± 8.90 3.81 ± 0.12 DC 23.18 ± 6.53 39.37 ± 9.81 58.56 ± 9.76 2.94 ± 0.15 Controls vs. DC 60.03 ± 9.34 2.72 ± 0.33
Table 11.3: The group–wise average values for the optimal number of states K∗, within–subject and self cross between–subject prediction errors ERRpredict, ERRpredict respectively and the between–subject mutual infor- mation are tabulated. The last row shows the between–subject prediction errors and mutual information comparing control subjects and DC subjects.
Here, a larger variation model–sizes for the DC group can be observed as compared to the controls, even though their group size is much smaller. This points to a greater het- erogeneity in the data across the DCs necessitating models with different sizes. Also, the consistently higher error–rate of the DC population indicates the relative inaccuracy of their models, as compared to the controls. Moreover, the MI between DC subjects is on par with that between DC and controls. This means that while one DC model labels a fMRI run quite incongruently to a control model, it even labels the data differently as compared to another DC model. This reaffirms the conclusion of high heterogeneity in the mental strategies adopted by DC subjects.
self cross The phase–wise breakdown (according to 1TR long phases Ph) of ERRpredict, ERRpredict and MI is shown in Fig. 11.4.
170 self cross (a) ERRpredict (b) ERRpredict
Controls DCs Controls vs DCs LogPs Effect LogDiff Effect (c) MI (d) Legend
Figure 11.4: 1 TR Long Phase–wise Breakdown of Error Rates and MI. Figs.(a)–(c) show the effect of experiment phase Ph, product–size LogPs and product–difficulty LogDiff on within–subject error–rate self cross ERRpredict, between–subject error–rate ERRpredict and mutual information MI. The LogPs effect is measured as the value for high LogPs minus that for low LogPs. Similarly for LogDiff. The background color-coding shows timing of each trial using the color scheme of Fig. 3.2.
Here, for both groups the error is low in the initial (read) phase of the experiment and in- creases in the second (judgement) phase. The high rates during the third phase, are possibly due to increased conflict involved in making a decision. Additionally, this phase is highly variable between repetitions and very often overlapped with the inter-trial rest interval re- ducing predictability. Again, the between–subject error rates of the control vs. DC case is on par with that between the DCs themselves.
The product–size LogPs increases the predictability of the control group during the first two phases while has negligible effect on the DC group, for both the within–subject and between–subject cases. This effect early on in the trial is attributable to a stronger effect of
171 product recall from the rote tables located in the lower left parietal lobe, and strong number
size effects in the occipital visual areas. The later effect of LogPs is consistent with strong activation of the working verbal (mute-rehearsal) and visual memories [187].
The problem–difficulty LogDiff effect is noticeable after the second 2.64s phase, which is expected since it depends on the onset of the incorrect result Rd displayed at 2.8s. It causes
a strong reduction in error for controls in the within–subject case while has almost no effect
for the between–subject case. Its effect is strongest in the third (response) phase since it
mainly affects attention and conflict resolution, and is probably not a number–related effect.
The trends in the MI plots tell a story confirmatory to that of the between–subject prediction
error. Of interest here is the slight dip in MI during the second phase followed by an
increase in the third phase for the controls. This could reflect a second recruitment of
working memory located in the right intra-parietal suclus as the controls recapitulate the
multiplication problem, after they have finished the task [163].
From the MDS plots of Fig. 11.5, an overall clustering of control and DC subjects is no-
ticeable with the separation increasing as LogPs increases. In the first phase of the task, the separation increases, while during the second phase the picture gets murkier. Both clusters spread out, with a greater dispersal of the DC subjects in the 2–D plot. LogDiff seems to have a slight organizing effect tightening the clusters. It should be noted that this labeling is applied post hoc, after plotting all the subjects using MDS on the pairwise MI matrix, and therefore an intrinsic organization in the spatio–temporal patterns of the subjects in each group has been discovered.
172 Control Male Control Female DC Male DC Female
(a) Overall (b) Overall – Product Size Effect
(c) Phase 1 (d) Phase 1 – Product Size Effect
(e) Phase 2 (f) Phase 2 – Problem Difficulty Effect
Figure 11.5: MDS plots of the Mutual Information Between All Pairs of Subjects. Fig.(a) shows the MDS plots of the subjects based on their overall MI and Fig.(b) shows their layout in 2–D for the MI computed on high LogPs stimuli. Figs.(c) and (d) show relative arrangement of the subjects during the first 2.64s phase of the trial for the overall and high LogPs cases. Similarly, Figs.(e) and (f) plot the subjects during the second 2.64s phase for the overall and high LogDiff cases.
173 As the spatial maps reported here are qualitatively similar to those of Chapter 12, please refer to Section 12.5.3 for a description of the activation patterns along with a detailed interpretation of the results.
11.6 Conclusion
In this chapter, we put forward an approach towards decoding and representing the spatio– temporal information about mental processes in fMRI data, using a hidden Markov model.
The hemodynamic coupling between metabolic activity and the BOLD response was ac- counted for by an additional hidden layer. However, due to the computational complexity incurred for a Monte–Carlo based EM, this hidden layer was approximated out through an assumption of independence of the fMRI data given the state–sequence.
The effect of assuming spatially invariant hemodynamics was ameliorated through marginal- ization under a Laplace approximation. Model selection was then performed using a max- imally predictive criteria. We applied the method to a group-wise study for developmen- tal dyscalculia, and demonstrated task-related differences between healthy controls and dyscalculics, which were systematically organized in time.
Although this model had the ability to predict experimental conditions at better than chance levels, the estimation and prediction error–rates are quite high and very sensitive to noise, as verified by the simulation study. The fact that absolutely no information about the structure of the experiment is used to stabilize estimation 25 exacerbates the problems caused by the independence assumption and the spatially stationary HRF.
25The link to the experiment happens only in the model selection step.
174 In the next chapter to address these drawbacks, this model is augmented to incorporate stimulus information and spatially varying and unknown hemodynamics and a fast estima- tion algorithm is developed that does not necessitate the independence assumption.
175 Ph 1
Ph 2
Ph 3
(a) Control Group
Ph 1
Ph 2
Ph 3
(b) DC Group
Figure 11.6: The group–wise activation t–score maps (defined as group average divided by group std. dev. per voxel) for the the phases of the task for the control and dyscalculic groups in Fig.(a) and Fig.(b) respec- tively. The hot color scheme , with magnitude increasing from left to right, shows the activation patterns for the corresponding phase of the task, with negative values masked out for visual clarity.
176 CHAPTER 12
STATE–SPACE MODELS : A SEMI–SUPERVISED APPROACH
Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.
George P. Box (1919–) and Normal Draper, Empirical Model-Building and Response Surfaces, 1987.
The previous chapter introduced a dynamic system formulation of brain–function as a se- quence of abstract brain–states stepping through a mental state–space, in order to determine the intrinsic states of the subject from the data without reference to experimental conditions.
This chapter augments that model with information about the experimental task to create a semi–supervised temporally resolved multivariate analysis. This is done in order to ad- dress the main challenges of purely unsupervised methods, that of enforcing a link back to the experimental variables and of stabilizing estimation. For example, in the simulation study of the unsupervised method we observed that the errors were generally high and that solutions were very sensitive to noise. In contradistinction, the semi–supervised approach does not preclude discovery of new and unexpected patterns, but simultaneously guides the discovery towards patterns of interest to the investigator and mitigates the effect of the confounds that plague fMRI.
177 As previously, a first order Markov chain captures the concept of the functional brain transi-
tioning through an abstract state–space as it performs a mental task, where each state has a
characteristic spatial distribution of (metabolic) activity which gives rise to the fMRI signal
by convolution with an HRF. The difference is that here the brain–state at each time–point
is affected not only by the current experimental conditions, but also by the previous state.
The model is developed in Section 12.1.
The previous model assumed a spatially uniform HRF and resorted to reducing its impact
through marginalization. This compromise is eschewed in this augmented model by allow-
ing a spatially varying HRF which is estimated from the data.
To make estimation of the model of Chapter 11 computationally tractable, it had to be
simplified to assume independence of the fMRI data, given the state–sequence. This as-
sumption is obviated here through the use of an efficient algorithm based on the mean–field
approximation of expectation–maximization (EM) [50] to estimate the Markov structure, the activation maps, the hemodynamic filter along with predictions of missing stimuli as per Section 12.2. The optimal sequence of brain–states for a given data–set is again esti- mated using EM with the mean–field approximation as elucidated in Section 12.3.
The maximally predictive method to determine the correct model size and other hyper– parameters in an automated fashion that best describe the task being performed by the subject is given in Section 12.4.
The Markov chain of brain–states serves two purposes: a) To enforce a temporal ordering on the states, and b) To decouple the stimulus from the fMRI signal, thereby avoiding specification of the exact mathematical relationship between the two. This second aspect
178 makes it a semi–supervised method as it uses the stimuli to guide estimation but does not
preclude discovery of new patterns in the data and investigation of effects not explicitly
encoded in the experimental variables. The model can predict the value of experimental
stimuli at new frames and is able to estimate a spatially varying HRF from the data.
The outline of the different processing steps of the method presented in this chapter is given
in Fig. 12.1.
A quantitative validation of the estimation algorithms is given in Section 12.5.1. Sec-
tion 12.5.2 illustrates the single–subject spatio–temporal maps produced by the method and
provides a comparative evaluation with GLMs and SVMs, using a visuo–motor task. Sec-
tion 12.5.3 demonstrates the novel insights provided by such a phenomenological spatio–
temporal model on the study of mental arithmetic introduced in Section 3.2. Finally, the
chapter concludes with some remarks and observations in Section 12.6.
The notation used during the course of this chapter is collated in Table 12.1. Note: As previously, > denotes the matrix transpose operator.
179 fMRI Data Y Hyperparameter Selection
Φ Feature-Space Transformation Error Feature-space basis K, λ W Rate y Hyper parameters
Model Estimation E-step Compute q(n)(x,z) from p(y,z,x|θ(n)) Until convergence M-step Estimate θ(n+1) : L(q(n), θ(n+1)) > L(q(n), θ(n))
s Stimulus Parameters θ
State Sequence Estimation E-step Compute q(n)(z) from p(z| y,x(n),θ) Until convergence M-step x(n+1) = argmax L(q(n), x)
x
Figure 12.1: Outline of the Method. The data (y), after projecting into the low dimensional feature space (cf. Chapter 10) are used to estimate model parameters θ through a generalized EM algorithm (cf. Sec- tion 12.2). Model hyper–parameters K and λw are selected to minimize the error of predicting the stimulus s (cf. Section 12.4). Given a set of model parameters, the optimal state–sequence x is estimated using EM (cf. Section 12.3).
180 Symbol Definition
> Matrix transpose operator T Total number of time–points in an fMRI session N Total number of (cortical) voxels in an fMRI volume K Total number of hidden brain–states Φ = {φ(l,m) ∈ RN } Orthogonal basis functions of the feature–space xt ∈ [1 ...K] The brain–state at 1 ≤ t ≤ T N Yt ∈ R The fMRI scan in voxel–space at 1 ≤ t ≤ T
Y Defined as (Y1 ... YT ) D yt ∈ R The fMRI scan in feature–space at 1 ≤ t ≤ T y Defined as (y1 ... yT ) N Zt ∈ R The (pre–HR) brain activation pattern at 1 ≤ t ≤ T in voxel– space
Z Defined as (Z1 ... ZT ) D zt ∈ R The (pre–HR) brain activation pattern at 1 ≤ t ≤ T in feature– space z Defined as (z1 ... zT ) st The stimulus vector at time t ut Unobserved (hidden) stimulus vector at time t w , {wi,j, ωj, i, j = State–transition probability parameters 1 ...K} wi,j State transition probability weight parameter from state i to state j
ωj Stimulus–driven state transition probability weight parameter
πi,j(st) , p(xt = Shorthand for the transition probabilities j|xt−1 = i, st, w)
Table 12.1: A summary of the notation used throughout this chapter on the semi–supervised state–space model
181 Symbol (contd.) Definition (contd.)
λw Hyper–parameter for state–transition probabilities
ϑk = (µk, Σk) The emission parameters for state k h The hemodynamic finite impulse response (FIR) filter of length L + 1
µh, Σh The mean and variance of the prior distribution of h
Hl , Matrix version of h > diag{(hl[1] ... hl[D]) }
t ∼ N (0, Σ) Normally distributed noise
θ , {u, w, ϑ, h, Σ} The model parameters
pθ(◦) , Parameterized probability density function p(◦, w, h|s, u, ϑ, Σ,K)
Contd from previous page
12.1 The State–Space Model
The functioning brain transitioning through a set of (unobserved) mental states xt = 1 ...K while performing task is represented by the state–space model of Fig. 12.2. These brain states are driven by the experimental variables described by a stimulus vector st [18].
Therefore, the probability Pr[xt = k] that the brain–state xt at time t = 1 ...T (in TR units) is k = 1 ...K depends on not only on the previous state of the brain but also on the current experimental stimulus described by the vector st. The multinomial transition probability from xt−1 = i to xt = j, is given as:
> exp{st (ωj + wi,j)} πx ,x (st) p(xt = j|xt−1 = i, st, w) = (12.1.1) t−1 t , PK > k=1 exp{st (ωk + wi,k)}
182 s s u s t t+1 t+2 … t+L-1 W λW
x x x x t t+1 t+2 … t+L-1
μk Σk
z z z z K t t+1 t+2 … t+L-1
h μh Σh
y y y y Σ t t+1 t+2 … t+L-1 ε
T
Figure 12.2: The State–Space Model. The experimental parameters are represented by st, while the corresponding brain–state is xt, and the instantaneous activation pattern is zt. The activation pattern is observed in the fMRI data yt ... yt+L 1 after convolution with the hemodynamic response h −
The probability of being in state j at any instant is parameterized by the vector ωj. The probability of transitioning from state i at time t − 1 to state j at time t is parameterized
−1 by wi,j, which has a normal prior N (0, λw I) with precision hyper–parameter λw. All these
transitions are driven by the stimulus vector st. Introducing an additional element in the stimulus vector st set to 1 allows modifying the transition probability to include a term independent of the current stimulus. Though the experiment maybe have combination of interval and categorical valued stimuli, they are converted into standardized normal vari- ables st through a probit transformation of their cumulative distribution functions. The hyper–parameter λw controls the trade–off between the influence of the current stimulus st and the previous state xt−1 on the probability of the current state xt. A low value biases the estimates of wi,j towards its mean value ωj reducing the influence of the previous state xt−1 = i on p(Xt = j|Xt−1 = i) and increasing the influence of the st on the transition.
183 The SSM allows estimating the value of unobserved or missing stimuli at a subset of the
time–points U , {t1 . . . tU } ⊂ {1 ...T }, represented by the hidden variables ut, t ∈ U. This feature enables prediction of stimuli from data at these time–points t ∈ U.
The fMRI data in the D dimensional (D N, the number of voxels) feature–space, obtained by projecting the volumetric data Yt on a linear basis Φ (cf. Chapter 10), is rep- resented by yt. If the hemodynamic response function (HRF) is L + 1 TRs long, it will induce a correlation in the scans yt...t+L based on the neural activation corresponding to the brain–state xt at time t. To account for this effect, an additional hidden layer zt rep- resenting the underlying neural / metabolic activation pattern for xt is introduced. For xt = k, k = 1 ...K, we assume zt as normally distributed with mean µk and covariance
Σk. Let ϑ , {ϑ1 . . . ϑk}, where ϑk , (µk, Σk), the emission parameters for state k.
Each element d = 1 ...D of the D-dimensional feature space is assumed to have an inde- pendent HRF, modeled as an finite impulse response (FIR) filter h[d] , (h0[d] ... hL[d]) of length L + 1. Each h[d] has a normal prior with mean µh and variance Σh, constructed by varying the delay, dispersion, and onset parameters of the canonical HRF of SPM8 [66] and computing their mean and variance. The length L + 1 is typically set to 32s. The
> > > set of HRF parameters is then the D × L matrix h , (h[1] ... h[D] ) . Defining
> Hl , diag{(hl[1] ... hl[D]) }, the fMRI data yt is obtained by an element-wise convo- P lution yt = l Hlzt−l + t, where t ∼ N (0, Σ) is temporally i.i.d. noise.
Therefore, denoting θ , {u, w, ϑ, h, Σ}, the full probability model is (cf. Fig. 12.2):
pθ(y, z, x) , p(y, z, x, w, h|s, u, ϑ, Σ)
= p(y, h|z, Σ)p(z|x, ϑ)p(x, w|s, u), (12.1.2)
184 where 26 Y Y p(x, w|s, u) = p(w) πxt−1,xt (st) πxt−1,xt (ut) , t∈T \U t∈U T Y p(z|x, ϑ) = p(zt|xt, ϑxt ) t=1 " T # Y p(y, h|z, Σ) = p(h) p(yt|zt−L...t, h, Σ) . t=1
,
The model hyperparameters are K, λw, µh, Σh. The hyperparameters K and λw are se- lected using an automatic data-driven procedure described in Section 12.4.
12.1.1 Feature–Space Transform
P The voxel-wise data Yt = l HlZt−l is modeled as a linear convolution of the activation patterns Z, and because the linear projection zt , Φ[Zt] is commutative with convolution, P it holds that yt = l Hlzt−l in the feature space, where yt , Φ[Yt].
12.2 Parameter Estimation
In this section, a generalized expectation-maximization (GEM) algorithm [50] to estimate
∗ the parameters θ as θ = arg maxθ pθ(y) is presented. Introducing a variational density q(z, x) over the latent variables z, x, the log-probability of eqn. 12.1.2 is decomposed into
26 Define p(x1|x0, s1) as the marginal density p(x1|s1).
185 a free-energy and a KL-divergence as:
ln pθ(y) = Q(q, θ) + KL(q||pθ), (12.2.1) Z X pθ(y, z, x) where, Q(q, θ) = q(z, x) ln dz, q(z, x) x z Z X pθ(z, x|y) and, KL(q||p ) = − q(z, x) ln dz. θ q(z, x) x z
(0) Starting with an initial estimate θ , the GEM algorithm finds a local maxima of ln pθ(y) by iterating the following two steps:
(n) E-step q ← arg min KL(q||pθ(n) ) (12.2.2) q M-step θ(n+1) ← θ such that Q(q(n), θ) > Q(q(n), θ(n)). (12.2.3)
The iterations are terminated when the updates to θ fall below a pre-specified tolerance
(adaptively set at 1% of the absolute value of the parameter in the n–th iteration), yielding a locally optimal solution θ∗.
12.2.1 E-Step
Although the minimizer of KL(q||pθ(n) ) is q(z, x) = pθ(n) (z, x|y), the HR introduces a
dependency structure between xt−L . . . xt+L and zt−L ... zt+L when conditioned on yt.
Therefore, evaluation of Q(pθ(n) (z, x|y), θ) in the M-step would require marginalization
over sequences of 2L+1 variables, resulting in a computational complexity of O(T ×K2L)
for parameter estimation (as compared to T × K2 for first-order HMMs).
To avoid this expensive computation, we restrict q to the family of factorizable distributions
T T Y Y q(z, x) = qt(zt, xt) = qt(zt|xt)qt(xt). t=1 t=1
186 This is known as the mean–field approximation in statistical physics and it can be shown [22] that if :
T ∗ Y ∗ q (z, x) = qt (zt, xt) t=1
= arg min KL(q||pθ(z, x|y)) q
= arg min KL(q||pθ(y, z, x)), q then
∗ q (zt, xt) ∝ exp{EQ q∗ [ln pθ(y, z, x)]}. (12.2.4) t t06=t t0
Introducing the following terms:
K Xh ∗ ∗ i α ∗ = q (x = k) ln p (x |k) + q (x = k) ln p (k|x ) qt (xt) t−1 t−1 θ t t+1 t+1 θ t k=1 " L #−1 X 2 −1 −1 Σ ∗ = H Σ + Σ , qt (zt|xt=k) l k l=0 " L −1X µ ∗ =Σ ∗ Σ H · y qt (zt|xt=k) qt (zt|xt=k) l t+l l=0 L L # X X −1 − H · H E ∗ [z ] + Σ µ (12.2.5) l m qt+l−m t+l−m k k l=0 m=0,m6=l
∗ then each factor of the mean–field approximation in eqn. 12.2.4 becomes a product qt (zt, xt)=
∗ ∗ ∗ ∗ qt (zt|xt)qt (xt) of a multinomial logistic probability qt (xt) and a normal density qt (zt|xt) as per27:
exp{α ∗ } ∗ qt (xt) ∗ q (xt) = and q (zt|xt) ∼ N µ ∗ , Σ ∗ . (12.2.6) t PK t qt (zt|xt) qt (zt|xt) 0 exp{αq∗(x0 )} xt=1 t t
27That is, a mixture of Gaussians
187 Therefore, under this approximation, the E-step involves computing the factorizable den-
(n) QT (n) sity q (z, x) = t=1 qt (zt, xt) with the following fixed-point iterations:
28 (n) (n) i For all t = 1 ...T , initialize qt(xt) ← pθ(n) (xt), and qt(zt|xt = k) ← N µk , Σk .
∗ ∗ ii Update qt (xt) and qt (zt|xt) as per eqn. 12.2.6, for all t.
iii Set q(n) ← q∗.
iv Iterate step (ii) until the updates to all αqt(xt), Σqt(zt|xt), and µqt(zt|xt) fall below pre-
specified tolerances.
As these iterations are a coordinate–descent of the KL-divergence term KL(q||pθ(n) ), the
solution obtained is only a local optimum and depends on the initializations and the order
∗ of the updates to qt .
The details of these derivations are given in Section E.1 of Appendix E.
12.2.2 M-Step
12.2.2.1 Estimating State Transition Parameters w
Since the maximization of w is coupled with that of u, and Q(q(n), θ) is not jointly concave in w and u, we decouple the problem into two concave problems, by first maximizing Q w.r.t. w setting u ← u(n), and then maximizing w.r.t. u setting w ← w(n + 1). This is explained next.
28 The invariant density pθ(n) (xt) is eigenvector corresponding to the eigenvalue 1 of the stochastic matrix
πxt−1,xt
188 Defining the vectors: w1,1 . . (n) (n) π1,1(st) qt−1(1)qt (1) w 1,K . . . . . . (n) (n) π1,K (st) qt−1(1)qt (K) wK,1 . (n) . w , , π(st) , . , qt−1,t , . , . . π (s ) q(n) (K)q(n)(1) K,1 t t−1 t wK,K . . . . ω 1 (n) (n) . πK,K (st) qt−1(K)qt (K) . ωK (12.2.7) the gradient of Q with respect to w is: (n) X qt−1,t − π(st) ∇wQ = ⊗ st 1> (n) t∈T \U ( K ⊗ IK )[qt−1,t − π(st)] (n) (n) X qt−1,t − π(ut ) (n) w + ⊗ ut − λw (12.2.8) 1> (n) (n) t∈U ( K ⊗ IK )[qt−1,t − π(ut )] 0K×1
where 1K is the K × 1 dimensional vector of ones, 0K×1 is the K × 1 dimensional vector of zeros, IK is the K × K dimensional identity matrix, ek is the K × 1-dimensional basis vector with a 1 at the k–th element and zeros elsewhere, and ⊗ is the Kronecker product.
2 Also the Hessian ∇wQ can be shown to be negative-definite implying that Q is concave in w with a unique global maximum value.
Please refer to Section E.2.1 complete derivations.
Although, w can be estimated using the iteratively re-weighted least squares (IRLS) method,
2 it involves an expensive inversion of the Hessian ∇wQ at every iteration. This inversion
189 can be avoided using a bound optimization method [116] that iteratively maximizes a
(n0+1,n) 0 (n0,n) surrogate function w = arg maxw Q (w|w ). This surrogate is a cost function
selected such that Q(w(n0,n)) − Q0(w|w(n0,n)) attains its minimum at w = w(n0,n). The
index n0 marks the iterations of the bound maximization of Q0(w|w(n0,n)) with respect to
w, during one iteration of the M-step indexed by n. This inner maximization loop is ini-
(0,n) (n) (n0+1,n) (n0,n) tialized with w ← w and terminates when the update ||w − w ||2 falls
below a certain tolerance, and w(n+1) ← w(n0+1,n) is the new value for the M-step update.
(n0,n) This tolerance can be fairly loose (typically 10% of the absolute value ||w ||2), as the
M-step of the GEM algorithm only requires an increase in the value of Q with respect to
its parameters, and not necessarily the maximization of Q.
2 One such surrogate is a quadratic function with a constant Hessian B such that ∇wQ − B
is negative-definite, given as: AAP X > X (n) (n)† IK2 0K2×K B , ⊗ stst + ut ut − λw , > > P A P AP t∈T \U t∈U 0K×K2 0K×K (12.2.9)
1 1 1> where A , − 2 IK2 − K K .
Maximization of this surrogate function leads to the following update equation:
(n0+1,n) (n0,n) −1 (n0,n) w ← w − B ∇wQ(w ). (12.2.10)
The index n0 marks the iterations of the bound maximization of Q0(q(n), θ) with respect to w, during one iteration of the M-step indexed by n. This inner maximization loop is
(0,n) (n) (n0+1,n) (n0,n) initialized with w ← w and terminates when the update ||w − w ||2 falls
below a certain tolerance, and w(n+1) ← w(n0+1,n) is the new value for the M-step update.
(n0,n) This tolerance can be fairly loose (typically 10% of the absolute value ||w ||2), as the
190 M-step of the GEM algorithm only requires an increase in the value of Q with respect to
its parameters, and not necessarily the maximization of Q.
Although bound optimization takes more iterations to converge than IRLS, on the whole it is much faster since it precludes inverting the Hessian (of the order of the size of w) at each step [116]. Please refer to Appendix E.2.1 for more on the bound optimization algorithm.
12.2.2.2 Estimating Missing Stimulus u
(n+1) (n) After estimating w , Q(q , θ) can then be maximized with respect to ut for all t ∈ U
(n+1) (n+1) (n+1) (n+1) by setting w ← w . If we express $i,j , wi,j + ωj , the gradient and
Hessian of Q with respect to ut are given by:
K K " K # X X (n) (n) (n+1) X (n+1) ∇ut Q = qt−1(i)qt (j) $i,j − $i,k πi,k(ut) , i=1 j=1 k=1 K K " K > 2 X X (n) (n) X (n+1) (n+1) ∇ut Q = − qt−1(i)qt (j) $i,k $i,k πi,k(ut) i=1 j=1 k=1 K K # > X (n+1) X (n+1) + ($i,k )πi,k(ut) $i,l πi,l(ut) k=1 l=1
2 Again, the Hessian ∇ut Q is negative-definite and therefore Q is concave in ut with a
2 unique global maximum. Since ∇ut Q is of the dimension of the stimulus vector and is easily invertible, this maximization is done using IRLS because of its faster convergence.
Please refer to Appendix E.2.1 for detailed derivations.
191 12.2.2.3 Estimating Emission Parameters ϑ
The M-step for the emission parameters ϑk = (µk, Σk) yields the following closed-form
updates: T (n+1) 1 X (n) µk = qt (k)µ (n) , T qt (zt|xt=k) t=1 T > (n+1) 1 X (n) > (n+1) (n+1) Σk = qt (k) Σq(n)(z |x =k) + µq(n)(z |x =k) · µ (n) + µk · µk T t t t t t t qt (zt|xt=k) t=1 > (n+1) (n+1) > − µq(n)(z |x =k) · µk − µk · µ (n) , (12.2.11) t t t qt (zt|xt=k) where Σ (n) and µ (n) were defined in eqn. 12.2.5. The details of this formula qt (zt|xt=k) qt (zt|xt) are given in Appendix E.2.2.
As reported in Section 10.5 of Chapter 10, the feature–space coefficients of an fMRI ses- sion exhibit very low temporal correlations. To enforce this high degree of sparsity in the estimates of Σk (cf. eqn. 12.2.11), during the n–th iteration of the M-step the estimate for
(n) Σk is sparsified using adaptive shrinkage [186], similar to the procedure in Section 9.2 of Chapter 9.
12.2.2.4 Estimating HRF FIR Filter h
> The estimation the coefficients of the FIR filter h[d] , (h0[d] ... hL[d]) , the L + 1–tap HRF corresponding to the d–th element of the D–dimension feature space, is described next. The gradient of the free-energy term Q from eqn. 12.2.1 is:
T D ∂Q X X = Σ−1[d0, d] y [d0]ν [d] − Λ [d0, d]h[d0] − Σ−1(h[d] − µ ), (12.2.12) ∂h[d] t t t h h t=1 d0=1
192 where 0 νq(n) [d] Λq(n) [d , d] ... 0 t t . 0 . .. . νt[d] = . , Λt[d , d] = . . . . 0 ν (n) [d] 0 ... Λ (n) [d , d] qt−L−1 qt−L−1...t
(n) Here, the marginal first and second moments of zt under the variational density qt (zt)
E E > defined as ν (n) (n) {zt} and Λ (n) (n) ztzt respectively. qt , qt (zt) qt , qt
As per eqn. 12.2.12, the gradient ∂Q/∂h[d] for the FIR filter at the d–th element depends on
the values of h[d0] at all the other d0 6= d of the D-dimensional space. Setting ∂Q/∂h[d] =
0, for all d = 1 ...D, results in a linear system of D × L equations in D × L unknowns.
The unique solution h(n+1) is computed using conjugate gradient descent [81] initialized
(n) (n+1) (n) at h , and its iterations are terminated when the update ||h − h ||2 falls below a
(n) pre-specified tolerance (set adaptively at 10% of ||h ||2).
12.2.2.5 Estimating Noise Variance Σ
The noise variance Σ has a closed form estimate:
T " L−1 ! L−1 (n+1) 1 X 0 X (n+1) > X (n+1) > Σ = ytyt − yt Hl ν (n) − Hl νq(n) yt T qt−l t−l t=1 l=0 l=0 L−1 ! L−1 L−1 # X (n+1)2 X X (n+1) (n+1) > + Hl Λq(n) + 2 Hl Hm νq(n) ν (n) , t t qt+l−m l=0 l=0 m=0,m6=l (12.2.13)
(n) where ν (n) and Λ (n) are the marginal moments of zt under the variational distribution qt qt qt as defined earlier. The estimation formulae of the HRF and noise parameters are derived
Section E.2.3 of Appendix E.
193 12.2.3 Spatial Activation Maps
The activation pattern for a specific value of the experimental variables st is obtained by
first computing the invariant distribution p(xt|w, st) as the first eigenvector of the state– transition matrix π(st) (cf. eqn. 12.1.1), and then computing the activation pattern mean hPK i hPK i as µst = k=1 pθ(xt = k|w, st)µk and its variance Σst = k=1 pθ(xt = k|w, st)Σk .
−1/2 The z–score map for the activation pattern corresponding to ˆs is given by Σst µst in feature–space, which is then transformed back into a voxel–wise spatial map of activity.
12.3 Estimating the Optimal State–Sequence
Direct estimation of the most probable state–sequence
∗ x = arg max pθ(x|y) = arg max pθ(y, x), x x given model parameters θ requires joint maximization over all T state variables, since the hidden layer z layer introduces a dependency between all the y and x variables preventing factorization of the graph29. As the size of the search space increases exponentially with
T with a complexity of O(T K ) for the whole chain, exhaustive search soon becomes in- feasible and an approximation such as Iterated Conditional Modes (ICM) [22] is required.
In the approximation of Chapter 11, the full model was replaced with a reduced one by removing the intermediate hidden layer z. This resulted in a joint maximization over L states and O(TLK ) complexity, which was solved using ICM.
29See the discussion in Appendix D.5
194 In this chapter, an EM algorithm determines the optimal state–sequence through a mean–
field approximation that iteratively transforms the problem into a series of first order HMMs, which in turn are solved using the standard Viterbi algorithm [22].
As in Section 12.2, the log-probability term is decomposed into a free-energy and a KL- divergence term
ln pθ(y, x) = Q(q, x) + KL(q||pθ(z|y, x)), where
Z Z pθ(y, z, x) pθ(z|y, x) Q(q, x) = q(z) ln dz and KL(q||pθ(z|y, x)) = − q(z) ln dz, z q(z) z q(z) by introducing a variational density q(z)
QT Again as before, restricting q(z) to the family of factorizible distributions q(z) = t=1 qt(zt), the E-step estimate of
T (n) Y (n) (n) (n) q = qt = arg min KL(q||pθ(z|y, x )) = arg min KL(q||pθ(y, z|x ) q q t=1 is obtained by iteratively computing
∗ (n) q (zt) = exp{EQ q∗ [ln pθ(y, z|x )]} t t06=t t0
∗ ∗ q (z ) q (z ) ∼ N (µ ∗ , Σ ∗ ) until convergence. The factorized density t t is a normal distribution t t qt qt ,
µ ∗ Σ ∗ with mean qt and variance qt identical to the form defined in eqn. 12.2.5, with the dif-
(n) ference that xt is replaced by xt .
Since the terms
Z Z q(z) ln pθ(y|z)dz and q(z) ln q(z)dz z z
195 in Q(q, x) are independent of x, the maximization step becomes:
Z (n) (n) arg max Q(q , x) = arg max ln pθ(x) + q (z) ln pθ(z|x)dz x x z ( T Z ) X (n) = arg max ln pθ(xt|xt−1) + q (zt) ln pθ(zt|xt)dzt , x t=1 zt (12.3.1)
which is identical to the problem of estimating the optimal state–sequence of a first-order
HMM, with the difference that the emission (log) probability is replaced by the expected
(log) probability under the variational density q(n). The solution can be computed using the
Viterbi algorithm with O(T × K2) complexity.
n+1 (n) The EM iterations terminate when the increments | ln pθ(y, x ) − ln pθ(y, x )| fall be-
low a pre-specified tolerance, typically set to 0.0099 corresponding to < 1% increase in the
probability.
The details of the state–sequence estimation algorithm are adduced in Appendix E.3.
12.4 Model Hyper-Parameter Selection
The hyper–parameters of the SSM are the number of hidden states K, the precision λw of the prior distribution of the transition weights w, and the parameters µh, Σh of the prior model of the HRF h. The values of µh and Σh, determined from the canonical HRF of SPM8, are used to enforce domain knowledge by restricting the HRF to the space of physiologically plausible shapes. This provides an optimal trade-off between allowing a spatially varying and unknown HRF against over–fitting the FIR filter to the data.
196 The hyper–parameter λw determines the variance in the weights wi,j, and implements a trade–off between the effect of the stimulus versus the previous state on the current state probability and mediates a complex set of interactions between the temporal structure of the fMRI data and of the stimulus sequence. A very high value of λw causes the state– transitions to be driven mostly by the current stimulus, while a low value increases the con- tribution of the previous state to the transition probability. It therefore cannot be practically provided as a user–tunable parameter. On the other hand, model–size (i.e. K) selection is typically done using Bayes factors [112], information theoretic criteria [151] or reversible jump MCMC based methods [194]. Implicitly these methods require an a priori notion about the complexity of a given model.
Here instead, we adopt an automated method for selecting both K and λw based on the maximally predictive criterion, as developed in the previous chapter (cf. Section 11.4), leveraging the ability of the SSM to predict missing stimuli u.
From the stimulus time–series, blocks of T 0 consecutive time–points (in TR units) totalling to 25% of the total number of scans, are removed at random to serve as missing stimuli
0 0 ∗ U , {t1 . . . t1 + T − 1, . . . , tM . . . tM + T − 1} and the optimal SSM parameters θ are estimated for a given K and λw. The prediction error is then measured as ERRmissing , P t∈U ||ut − st||2. between the predicted ut and their true values st. The hyper–parameters are then selected to minimize this error–rate. The optimal value of K is obtained by first stepping through different values of K with large step-sizes and then iteratively refining the step-size. The advantage of this procedure is that it allows selecting a model most relevant to the experiment being conducted.
197 For each setting of K, the optimal λw is determined by searching over the range log10 λw =
−3 ... +3, and selecting the valued that minimizes ERRmissing. This allows setting the
parameter to effect an optimal compromise between between stimulus driven and previous
state driven transitions. We observed that the prediction error is relatively insensitive to λw
(cf. Section 12.5), and therefore a common value can be selected across a multi–subject data–set for one study.
The reader will observe that prediction error is used merely as a statistic (cf. Sidebar on
Generative vs. Classification Models in Chapter 7) to select hyper–parameters. The pa- rameters themselves, unlike MVPR classifiers, are not estimated to minimize prediction error but rather to fit a model of brain function to the data. It is this distinction that allows interpretation of the estimated parameters (in terms of the underlying neurophysiological model) in contrast to MVPR classifiers.
The effect of these hyper–parameters and the length T 0 of a missing–stimulus block on the
model estimation is evaluated in Section 12.5.
12.5 Results
This section starts off with a quantitative validation of the model and state–sequence esti-
mation algorithms using a synthetic data–set. Then the method is illustrated on two fMRI
studies, one a simple block design for a visuo–spatial motor task, and the other the com-
plex and irregular event-related design for arithmetical processing (cf. Section 3.2). The
first example focuses on the spatio–temporal activation maps, demonstrates the ability of
the method to discover new patterns in the data, and provides a comparative evaluation
198 with respect to other analysis methods and feature–spaces. The second study is used to
perform group–level inferences in the abstract representational space generated by this
spatio–temporal phenomenological model.
For all the fMRI data, the mean volume of the time–series was subtracted, white matter
masked out and all further processing was performed on grey matter voxels. The algo-
rithms were implemented in MATLABr with Star-Pr on an 2.6Hz Opteron cluster with
16 processors and 32GB RAM.
12.5.1 Simulation
12.5.1.1 Methods and Materials
The algorithms were validated on a synthetic data–set created as follows. For all simula- tions, the length of the session was T = 600 TRs, the dimension of the feature–space was
D = 500 and the dimension of the stimulus vector st was set to 5, to reflect a typical fMRI data–set. Model size K was varied from 5 to 50, while the precision hyper–parameter λw was varied from 10−3 to 103.
The state transition parameters ωj were initialized from a uniform distribution over [0,1],
−1 while wi,j were sampled from N (0, λw I5), where In is the n × n identity matrix. The HRF
FIR coefficients h[d] for each element of the d feature–space were obtained by sampling from N (µh, Σh). The emission parameters (µk, Σk) for each state k were obtained by sam- pling Σk from a Wishart distribution W(T, ID) and µk from N (0, Σk). The noise variance Σ
−1 was sampled from W(T, β ID). The parameter β effectively controls the SNR of the data, by controlling the ratio of the noise variance to that of the activation patterns zt. For each
199 time–point t = 1 ...T , the stimuli st were generated from a normal distribution N (0, I5) and then smoothed along the time dimension with a Gaussian filter of different full–width– at–half–maximum (i.e. FWHM), in order to impose a temporal structure on the simulated data. The values of xt, zt and yt were then sampled from their generative distributions as per Section 12.1.
We compared the results of the generalized EM (GEM) algorithm under the mean field approximation (MF-GEM) presented here (cf. Section 12.2) against those obtained by an
MCMC based GEM algorithm. The number of MCMC samples were varied to match the
MF-GEM algorithm in terms of equal running time (MCMC:RT) and equal estimation error
(MCMC:EE). As the MCMC method can produce exact estimates, given sufficient number of samples, it was also run until convergence (MCMC:CNV) in order to establish baseline accuracy. The experiments were repeated with β = 10, 100 and 1000 corresponding to
SNR of 10, 20 and 30dB respectively.
12.5.1.2 Discussion
∗ The relative error ERRestimate in the parameter estimates θ , the relative error ERRK of
∗ model-size estimates K and the prediction error ERRmissing (cf. Section 12.4) for the vari-
ous experiments are charted in Fig. 12.3.
One of the main observations is that the MCMC algorithm requires almost thrice the total
running–time (including searching for the optimal hyper–parameter values) for the same
estimation error ERRestimate as the mean field EM method (MF-GEM), while the prediction
error of MF-GEM is within 20% of the best ERRmissing as measured by MCMC:CNV. While
reducing SNR does not affect running–time significantly, its effect on the errors is large.
200 80 0.7 70 0.6 60 0.5 50 0.4 40 estimate
mins 30 0.3 0.2
20 ERR 10 0.1 0 0
(a) Running time (b) ERRestimate
1 0.7 0.8 0.6 0.5 0.6 K
predict 0.4 0.3
0.4 ERR 0.2 ERR 0.2 0.1 0 0
(c) ERRpredict (d) ERRK
Legend MF-GEM GEM with mean field approximation 30dB MCMC:RT MCMC with running time equal to MF-GEM 20dB MCMC:PE MCMC with ERRestimate equal to MF-GEM 10dB MCMC:CNV MCMC with convergence
Figure 12.3: Simulation Results. The GEM method under mean field approximation (MF-GEM) is com- pared against an MCMC based estimation algorithm matched in terms of equal running time (MCMC:RT), equal estimation error (MCMC:EE) and MCMC run until convergence of the estimates (MCMC:CONV). The experiments were repeated for SNR = 10,20 and 30dB. Plotted are total running time (Fig. (a)), relative estimation error (Fig. (b)), prediction error (Fig. (c)) and relative error in estimating the correct K (Fig. (d)) for the different experiments. Error bars indicate ±1 standard deviations.
Reducing the SNR from 30 to 20dB caused prediction error to increase from < 10% to
≈ 30%. Furthermore, for SNR ≤ 20dB the estimate for model–size using the maximally predictive criteria is within 10% of the true K.
201 Although all the parameters θ are important in determining the accuracy of the model, of special interest are the µk, k = 1 ...K parameters, as they correspond to the spatial distribution of activity representative of each state. The average estimation error of the spatial distribution parameters, defined as:
X ∗ > −1 ∗ ERRspatial = 1/K (µk − µk) Σk (µk − µk), k
∗ where µk is the estimate of µk, for the MF-GEM and MCMC:CNV cases are listed in Table 12.2.
It can be observed that for the 20dB case, the estimated spatial patterns are within ≈ 0.25 standard deviations (given by Σk) of the true µk.
SNR (dB) MF-GEM MCMC:CNV
30 0.151 ± 0.06 0.126 ± 0.05 20 0.223 ± 0.09 0.212 ± 0.09 10 0.361 ± 0.13 0.357 ± 0.14
Table 12.2: Effect of SNR on the error ERRspatial (±1 std.dev.) in µk estimated by the MF-GEM algorithm and MCMC run to convergence (MCMC:CNV).
The effect of the sparsification step during the estimation of Σk (cf. Section 12.2.2.3) on estimation accuracy and prediction rate is given in Table 12.3. While shrinkage of the ML estimate of Σk has a positive effect on the estimation and prediction accuracies, the benefit is more pronounced at lower SNR, indicating the necessity of this step especially when dealing with noisy data.
202 SNR (dB) ERRestimate ERRmissing
30 14.77% ± 2.26 9.85% ± 2.73 20 18.21% ± 3.18 12.29% ± 2.57 10 23.36% ± 5.19 17.87% ± 3.44
Table 12.3: The percentage reduction in estimation error ERRestimate and prediction error ERRmissing due to shrinkage of the ML estimates of Σk, k = 1 ...K, at various SNR levels.
12.5.2 Data-Set 1: Visuo–Spatial Motor Task
12.5.2.1 Methods and Materials
In this task, four subjects were visually exposed to oriented wedges filled with high-contrast random noise patterns and displayed in one of four quadrants. Subjects were were asked to focus on a center dot and to perform a finger-tapping motion with the right or left hand when the visual wedge was active in the upper right or lower left quadrants, respectively. Block length of each visual wedge stimulation varied from 5 to 15s and noise patterns changed at a frequency of 5Hz. A multi-shot 3D Gradient Echo Planar Imaging (EPI) sequence accelerated in the slice encoding direction with GRAPPA and UNFOLD was used on a GE
3T MRI scanner with a quadrature head coil and 171 volumes were acquired at TR=1.05s, an isotropic resolution of 3mm, with total imaging time of 3min and the first five volumes were discarded from the analysis.
The data were analyzed using a univariate GLM with SPM8. The design matrix included a regressor for the presentation of the wedge in each quadrant, convolved with a canonical
HRF. The results of this analysis are shown in Fig. 12.4(a), and correspond to the classic
203 retinotopic organization of the primary visual cortex and with the hand motor representation
areas in both hemispheres.
The model was trained using the GEM algorithm under the mean field approximation (MF-
GEM), with the data represented in the following feature–spaces:
• [FS:Φ] The basis vectors of Φ, with D = 500 (cf. Chapter 10)
• [FS:PCA] Coefficients of the scans projected on their principal components. retained if
their variance was greater than the mean variance (≈ 80) [71].
• [FS:VOX-AVG] The top 500 significantly activated voxels identified with a GLM using
a contrast for the average effect of all orientations.
• [FS:VOX-ORIENT] The set of 500 voxels maximally responsive to only one orientation
of the wedge, using an appropriate contrast in the GLM.
Two different encodings of the stimulus vectors st were used as input to the training algorithm: (SSM:FULL) each st is a vector containing the post-SOA (stimulus-onset- asynchrony) time of the current fMRI frame t within the current presentation block, and the orientation of the wedge; (SSM:PSOA) each st contains only the post-SOA time. For comparison, we also trained a linear multi-class SVM classifier (SVM-CLASS) [108] with the orientations as class-labels using the same feature–spaces.
12.5.2.2 Discussion
Fig. 12.5 shows the prediction error (averaged across subjects) of the wedge orientation for the three different cases SSM:FULL, SSM:PSOA and SVM-CLASS and the different
204 205
(a) SPM8 (b) SSM:FULL
Figure 12.4: Spatial Activation Maps for the Visuo-Motor Task from SPM8 and the State–Space Multivariate Analysis. Fig. (a): Maximum intensity projections of significantly activated voxels (p < 0.05, FWE corrected) in a single subject for the four orientations of the wedge and the hand motor actions, computed using SPM8. The red circles indicate the ROIs for which the estimated HR FIR filters are displayed in Fig. 12.8. Fig. (b): Spatial (z–score) maps showing the distribution of activity for each orientation of the wedge computed from our state–space model displayed on an inflated surface of the brain. Displayed are the posterio-lateral and posterio-medial views of the left and right hemispheres respectively. Values of z ≤ 1 have been masked out for visual clarity. 0.6 Legend 0.5 FS:Φ 0.4 FS:PCA 0.3 FS:VOX-AVG
Errorrate FS:VOX-ORIENT 0.2 Error for same 0.1 orientation as FS:VOX-ORIENT 0
Figure 12.5: The inter–subject average prediction error–rates for the visuo–motors task using a multi- class SVM (SVM-CLASS), and our model trained with the stimulus vector st containing both post-SOA time and wedge orientation (SSM:FULL) and with the stimulus vector st containing only post–SOA time (SSM:PSOA). The error–rates are shown for different feature–spaces. The last bar in the SVM-CLASS col- umn shows the SVM prediction error of the same wedge orientation for the which the 500 most active voxels are used as a feature–space (FS:VOX-ORIENT).
feature–spaces. Since the experiment was fully randomized, the chance-level error–rates are 75%. The error in the prediction of the wedge orientation for SSM:FULL is readily assessed during the computation of ERRmissing. In order to measure the prediction error for the SSM:PSOA, we first trained the optimal model using only post-SOA times in s, and obtained the optimal state–sequence x∗. In a second step, we trained a simple multinomial
∗ classifier to predict the wedge orientation at time t from the state-label xt , and measured
the prediction error of this classifier using cross-validation.
Firstly we observe that the prediction error of the semi–supervised model (SSM:FULL)
trained on the same experimental variables is very similar to that of the supervised SVM
(SVM-CLASS). More interestingly however, our model was able to predict, at better than
chance levels, the orientation of the wedge without being trained for it (SSM:PSOA). This
206 indicates that the model has learnt some intrinsic patterns in the data that strongly related
to the mental activity of the subject.
It can be noticed that the prediction error using FS:Φ was slightly better than that using the set voxels significantly activated under all orientations (FS:VOX-AVG). This is noteworthy, especially given that FS:Φ, in contrast to FS:VOX-AVG, is computed without knowledge of the experimental parameters. One reason for the low accuracy of FS:VOX-AVG is that it includes voxels that are commonly activated for all wedge orientations and are therefore not necessarily selective to any one orientation. In order to account for this defect, we also trained various algorithms on a set of voxels that were maximally responsive to only one orientation of the wedge (FS:VOX-ORIENT). The drawback of this feature-set is that while it is very accurate for that orientation, its discriminative ability for the other three orientations is very poor. As can be seen, for the multi-class SVM trained with one such
FS:VOX-ORIENT feature space, the average error for the same orientation is 21.81 ±
2.38%, but the error for the other orientations the error was much higher, resulting in an overall error–rate of 42.11%. The error–rate of SSM:FULL was comparable.
The poor accuracy of the PCA basis selected using maximum variance criteria has also been documented in other fMRI studies [90, 173] and can be attributed to the lack of a specific relationship between the task selectivity of a component and its variance.
The spatial maps of activation patterns (cf. Section 12.2.3) for SSM:FULL using FS:Φ, for a single subject, are shown in Fig. 12.4(b). The maps for the other three subjects are qualitatively similar. The retinotopic character of the activation maps follows the expected anatomical boundaries of the four visual quadrants within the occipital visual system. The
207 TR 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4
5 Statelabel 6 7 8
Figure 12.6: Brain–state probabilities for one subject. The size of the circles corresponds to the marginal probabilities of the states during the display of the wedge in lower right, lower left, upper left and upper right quadrants for 4TRs each. States have been relabeled for expository purposes and transition probabilities have been omitted for visual clarity. same spatio–temporal accuracy is found for the motor component of the paradigm with the typical cortical activation pattern for hand action in the contralateral hemisphere.
The state transition probabilities πi,j(st) encode information about the temporal dynamics of the model, and reveal the organization of the brain–states xt with respect to the experi- mental variables st. Fig. 12.6 shows the marginal probabilities (i.e. the first eigenvector of
πj(st)) of the brain–states for one subject corresponding to a sequence of wedges oriented in each quadrant for 4×TRs each.
Here, we see that the probability of a particular state is structured with respect to the ori- entation of the wedge. For example, at the start of the presentation with the wedge in the lower-right quadrant, state 1 is most probable. But by the second TR, state 2 becomes more dominant and this distribution remains stable for the rest of this presentation. Then, as the
208 display transitions to the lower-left quadrant, states 3 and 4 become equiprobable. How-
ever, as this orientation is maintained, the probability distribution peaks about state 4 and
remains stable. A similar pattern in observed in the probability distributions for the other
orientations.
From the plot of prediction error ERRmissing against model–size K for each of the four
subjects in Fig. 12.7, one can see that it has a relatively shallow basin and the minima
occur in a similar range of K = 8 ... 15. This points to the robustness of the model with
respect to K and the similarity of the models across the subject.
0.6 predict 0.4 ERR 10 20 30 40 50 K
Figure 12.7: ERRmissing with respect to model–size K. The minimum for each curve is marked by a circle.
Fig. 12.8 graphs the estimates of the hemodynamic FIR filter h for each subject, averaged in ROIs selected in the left primary motor cortex (BA3, BA4), and the left visual cortex
(BA17) (indicated by the red circles in Fig. 12.4). A qualitative difference between the estimated HRs of the two areas is apparent in terms of their rise-time, peak value and dispersion.
The HR of the brain is known to be highly variable [139] and by allowing the hemodynamic
FIR filter to vary spatially (by allowing one filter h[d] per feature–space element d), the model is able to account for this inter-region variability.
209 0.4 0.4
h 0.2 h 0.2 0 0 0 10 20 30 0 10 20 30 secs secs (a) Motor Cortex ROI (b) Visual Cortex ROI
Figure 12.8: Estimated hemodynamic FIR filter h. The estimated FIR filter coefficients for each of the four subjects averaged in two ROIs selected in the motor and visual cortices.
12.5.3 Data-Set 2: Mental Arithmetical Task
12.5.3.1 Methods and Materials
This section discusses the inferences of this model applied the study for mental arithmetic described in Section 3.2. For each t = 1 ...T , the experimental conditions were described by: Ph = 1, 2, 3 which indicates if t is in the (1) multiplication or (2) subtraction / judgement or (3) decision–making phase of the experiment; 1 ≤ LogPs ≤ 10 which quantifies the product size of the multiplication problem; and 1 ≤ LogDiff ≤ 5 which quantifies the expected difficulty in judging the right answer.
For each of the 42 subjects (20 control, 13 dyscalculia, 9 dyslexia), one model Mj =
{Φj, θj} was trained per subject j = 1 ... 42 with the stimulus vector st containing the post–SOA time, the reaction–time for current trial and parameter quantifying the size of the product of the two numbers displayed (cf. Section 3.2.3 for specifics). In order to
210 balance the group sizes, group–level analysis was done by selecting 8 subjects at random from each group and computing the statistics over multiple resamples.
12.5.3.2 Comparative Analysis
One SSM Mj = {θ, K, λw} was trained per subject j = 1 ... 42 with three different encod- ings of the stimulus vector st:
SSM:NONE with st = (1)
SSM:PH with st = (Ph, 1)
SSM:FULL with st = (Ph, LogPs, LogDiff, 1)
As the models SSM:PH and SSM:NONE do not encode LogPs, LogDiff (and Ph) in the stimulus vector, they cannot estimate their as missing stimulus. Therefore, to assess the ability of the SSMs to predict these variables from the optimal state–sequence x∗ we trained
∗ three simple multinomial classifiers: one to predict the probability Pr[Ph|xt ] of the phase
∗ Ph = 1, 2, 3, one to predict Pr[LogPs|xt ] of LogPs quantized to two levels at a value of
∗ 5, and one to predict Pr[LogDiff|xt ] of LogDiff quantized to two levels at a value of 2.5.
The error–rates across the three classifiers were accumulated into a single ERRSSM:NONE,
ERRSSM:PH and ERRSSM:FULL, for each SSM trained. Also, as ERRmissing is not defined for
SSM:NONE, its hyper–parameters are selected so as to minimize ERRSSM:NONE.
For comparative evaluation of the SSM with MVPR methods, we trained three linear SVM classifiers per subject: one to predict Ph = 1, 2, 3, one for LogPs = 0, 1 (quantized) and one for LogDiff = 0, 1 (quantized) and accumulated their error–rates into ERRSVM. The
211 SVMs were trained to predict the stimuli from the fMRI data deconvolved with the canoni-
cal HRF. Among the other classifiers evaluated (viz. GNB, LDA, quadratic and exponential
SVM) none significantly outperformed the linear SVM.
For each of the SSMs and SVMs trained, the following feature–spaces were evaluated:
• [FS:Φ] The basis vectors of Φ, with D = 500 (cf. Chapter 10).
• [FS:PCA-NONE] A 110 basis vectors obtained from a PCA of the fMRI data, retained
using a bootstrap analysis of stability [17] at a 75% confidence level, to match the feature
selection criterion for Φ. This confidence level yields 112.92±7.80 principal components
(PC) for each subject.
• [FS:PCA-PH] The set of PCs maximally correlated with the HRF–convolved regressor
for Ph. For each subject 65 PCs were retained at a confidence level of 75% (64.27 ± 8.56
PCs).
• [FS:PCA-FULL] The set of PCs maximally correlated with the design matrix containing
HRF–convolved regressors for Ph, LogPs and LogDiff identified using multiple regres-
sion. For each subject 80 PCs were retained at a confidence level of 75% (79.14 ± 9.57
PCs).
The prediction error of the different model and feature–space combinations for the control group are complied in Fig. 12.9. The other two groups showed similar trends and are omitted for conciseness.
It can be observed that the error for SSM:FULL is consistently lower (> 3 SEM for FS:Φ) than that of the SVM. This is noteworthy especially since the parameters of the SVM were
212 Figure 12.9: Comparative Analysis. The mean prediction error (±1 standard error of mean (SEM) ) for the control group using a linear SVM classifier, and the SSM with three different encoding of the stimulus vector st viz. SSM:FULL, SSM:PH and SSM:NONE. The error-rates are measured for four different feature– spaces: FS-Φ, FS:PCA-NONE, FS:PCA-PH and FS:PCA-FULL. The last bar in the SVM column shows the SVM prediction error for only Ph against which the PCs of FS:PCA-PH were selected. The “chance–level” prediction error is ≈ 0.87 calculated through a permutation test.
specifically optimized for prediction error. By treating each fMRI scan independently,
MVPR classifiers are unable to leverage the temporal structure in the data and rely only on spatial patterns for prediction. Moreover, the SVM uses a point–estimate of the neural activation through deconvolution, whereas the SSM accounts for spatially varying and un- known hemodynamics in a probabilistic fashion which contributes to its ability to predict the mental state of the subject.
Using only information about the phase Ph to train SSM:PH increased the error as com- pared to SSM:FULL, but only slightly (≈ 1 SEM). But removing all experimental informa- tion (SSM:NONE) caused a dramatic increase in error (ERRSSM:NONE ≈ 0.48). This implies that the semi–supervised SSM can detect the effect of experimental variables from patterns in the data (namely LogPs and LogDiff in the case of SSM:PH) against which it was not ex- plicitly trained. Including some cues (namely Ph) about the experiment guides this discov- ery, stabilizes estimation and precludes the model from learning spatio–temporal patterns
213 that are not relevant to the task and which may be due to artifacts (unlike SSM:NONE) .
It is nevertheless interesting that, despite not using any stimulus information, SSM:NONE had a prediction error much better than chance (ERRchance ≈ 0.87) which implies that it has discovered the mental states of the subject in a purely unsupervised fashion, validating some of the neurophysiological assumptions behind the SSM.
The unsupervised PCA feature–space (FS:PCA-NONE) exhibited performance worse than
FS:Φ in all cases. The poor accuracy of PCA has also been documented in other fMRI studies [90, 160, 173] and can be attributed to the lack of a specific relationship between the task selectivity of a PC and its variance. In contrast, as FS:Φ is obtained from the correlation matrix, it describes the structure of the inter–relationships between the voxel time–series and not their magnitudes. Using PCs selected against the stimuli (FS:PCA-
FULL) deteriorates the performance of the SSMs even further. This is because the exact coupling between stimuli and fMRI signal (and therefore PC time–courses) is unknown and may be non–linear, and selecting PCs linearly correlated with HRF–convolved stimuli may not preserve a large proportion of the spatio–temporal patterns in the data. In contrast, the
SVM, based on optimizing prediction error, has best overall performance with this feature– selection strategy. The limitations, however, of supervised feature–selection are apparent
Ph in the case of PCA:PH. Although the SVM predicts Ph with high accuracy (ERRSVM ≈
0.08 for FS:PCA-PH vs. ≈ 0.11% for FS:PCA-FULL), its ability to predict any other stimulus is severely degraded with an overall error of ≈ 0.50. The SSMs have similarly poor performance, due to the loss of information about spatio–temporal patterns in this basis.
214 12.5.3.3 SSM Parameter Estimates
This section further investigates the parameters as estimated by SSM:PH trained with FS
Φ and st = (Ph, 1). The prediction error ERRSSH:PH exhibits a relatively shallow basin with
0.8 0.6
0.6
SSM:PH 0.4 SSM:PH 0.4 ERR ERR 0.2 0.2 10 20 30 40 50 −2 0 2 K log10 λw
(a) Error with respect to K (b) Error with respect to λw
0.6 0.8
0.6 0.4 SSM:PH SSM:PH 0.4 0.2 ERR ERR 0.2 1000 2000 3000 4000 5000 2 4 6 8 10 D T 0 (TRs)
(c) Error with respect to D (d) Error with respect to T 0
Figure 12.10: Effect of Hyper–Parameters on ERRSSM:PH. Fig.(a): ERRSSM:PH with respect to model-size K. Fig.(b): ERRSSM:PH with respect to precision hyper–parameter λw. Fig.(c): ERRSSM:PH with respect to dimension D of Φ. Fig.(d): ERRSSM:PH with respect to missing stimulus block length T 0. Error bars indicate ±1 SEM. Legend. Blue solid line: control group, Red dashed line: DC group, Green dot–dashed line: DL group.
respect to model–size K for all three groups in Fig. 12.10(a), with minima occurring in the range K = 18 ... 28. This points to the robustness of the SSM estimation with respect to K for each subject. The robustness of ERRSSM:PH with respect to λw, shown in Fig. 12.10(b) was comparable to that of the simulation study, with the curve of ERRSSM:PH almost flat for
−2.2 0.7 10 ≤ λw ≤ 10 .
215 From the plot of ERRSSM:PH versus the dimensionality D of the feature–space Φ in Fig. 12.10(c), it can be noticed that initially ERRSSM:PH drops as D increases with Φ explaining more of
the information in the data and bottoms out at 400 ≤ D ≤ 600, across the three groups.
It then begins to slowly rise as a larger number of unstable basis vectors are included,
capturing an increasing percentage of the noise in the data.
0 Fig. 12.10(d) graphs ERRSSM:PH versus the length T of the missing stimulus block used
in the assessment of ERRmissing (cf. Section 12.4). Here, we observe a very low error for
small T 0 as prediction is driven primarily by the strong temporal regularity of the stimulus presentation sequence over short durations. However, as in the case of the simulation, it increases with T 0 and stabilizes at a block length of 2 trials (T 0 ≈ 5 TRs) after which point there is no structure in the stimulus sequence and prediction is driven mainly by the patterns in the data.
Table 12.4 compares the prediction error of SSMs with: [HRF:NONE] no HRF FIR filter;
[HRF:CONST] a spatially constant HRF of length L+1=32s; [HRF:UNCON] spatially varying HRF of length L+1=32s without any constraints; and [HRF:PRIOR] the SSM with spatially varying and unknown HRF of length L+1=32s but constrained by the prior density N (µh, Σh) (cf. Section 12.1). Here, the advantage of the spatially–varying but phys- iologically constrained HRF (HRF:PRIOR) in dealing with the variable hemodynamics of the brain and accurately predicting the mental state can be seen.
Removing the HRF altogether from the model (HRF:NONE), thereby not accounting for the lag in the fMRI signal due to hemodynamics, leads to the largest deterioration in per- formance. Although the inclusion of a spatially constant HRF (HRF:CONST) causes some
216 Control DC DL
HRF:NONE 0.62±0.11 0.68±0.13 0.64±0.10 HRF:CONST 0.36±0.08 0.46±0.11 0.40±0.09 HRF:UNCON 0.54±0.13 0.59±0.15 0.55±0.16 HRF:PRIOR 0.31±0.05 0.40±0.09 0.33±0.05
Table 12.4: Prediction Error versus Different HRF Models. ERRSSM:PH (±1 SEM) for the SSM model with no HRF (HRF:NONE), spatially constant HRF (HRF:CONST), spatially varying and unconstrained HRF (HRF:UNCON) and the spatially varying HRF with a prior obtained from canonical HRF of SPM8 (HRF:PRIOR).
reduction in accuracy, allowing too much variability by putting no constraints on the shape of the HRF (HRF:UNCON) results in even worse performance due to over–fitting of noise.
Fig. 12.11 shows the estimates of the spatially varying but constrained HRF FIR filter
(HRF:PRIOR) for each group , averaged in regions–of–interest (ROI) selected in the left primary motor cortex (BA3, BA4) and the bilateral intraparietal sulcus (IPS) (BA40). A qualitative difference in the estimated HRFs is apparent in terms of their rise–time, peak value and dispersion. The prolonged and repeated recruitment of the IPS in this task may explain the dispersed shape of its HRF as compared to the motor cortex. No significant differences in HRF estimates were observed between the three groups.
The HRF of the brain is known to be highly variable [139] and by allowing the HRF FIR
filter to vary spatially (by allowing one filter h[d] per feature–space element d), the SSM is able to account for this inter-region variability.
The group–wise spatial maps corresponding to the three phases of each trial are shown in
Fig. 12.12.
217 Motor
IPS 0.4
h 0.2 0 0 10 20 30 secs (a) ROIs (Left Hemisphere) (b) Motor Cortex
0.4 0.4
h 0.2 h 0.2 0 0 0 10 20 30 0 10 20 30 secs secs (c) Left IPS (d) Right IPS
Figure 12.11: Estimated HRF FIR filter h. Fig.(a): The locations of the ROIs (in the left hemisphere). Fig.(b-d): The estimated FIR filter coefficients (± 1 std.dev.) for each group averaged in the ROI in the left motor cortex and the left and right IPS. Legend. Blue solid line: control group, Red dashed line: DC group, Green dot–dashed line: DL group.
218 1
2
3
(a) Control Group
1
2
3
(b) Dyscalculic Group
1
2
3
(c) Dyslexic Group
Figure 12.12: Spatial Maps for Mental Arithmetic. The group–wise t–score maps on an inflated brain– surface are shown with columns for the left lateral–posterior, left medial–posterior, right medial–posterior and right lateral–posterior views. Values t < 3 have masked out for clarity and the color–map shows values ranging from t = 3 to t = 14. Each row shows the activation maps corresponding to three phases within a single trial of the task. 219 12.5.3.4 Discussion
∗ The average optimal model size K and prediction error ERRSSM:PH for the three groups
are shown in Table 12.5. Here, we notice the variation in model-sizes for the DC group is
larger than the controls while that for DL group is almost of the same order. This points to
a greater heterogeneity in the DC data necessitating models with different sizes. Also, the
consistently higher error–rate of the DC population indicates the relative inaccuracy of the
models for their mental processes, as compared to the other two groups.
Control DC DL
K∗ 22.57±2.19 26.23±3.95 23.14±2.25 ERR (%) 0.31±0.05 0.40±0.09 0.33±0.05
Table 12.5: Overall Results. The mean optimal model size K∗ and prediction error ERRSSM:PH (± 1 SEM) for the control, dyscalculic (DC) and dyslexic (DL) subjects. The chance-level error–rate for the data-set is ≈ 0.83, computed by permuting the stimuli with respect to the scans.
Similar to the results of the last chapter, these observations concur with the theory that not only are dyscalculics different from each other in their arithmetical strategies, their lack of an intuitive notion of numerical size maybe compensated for by shifting mental strategies resulting in the poor fit of a single model for a subject. Also, the consistently higher error– rate of the DC population indicates the relative inaccuracy of the models for their mental processes, as compared to the other two groups.
In Fig. 12.13, we show the effect of Ph, LogPs and LogDiff on the error ERRSSM:PH. Note
that LogPs and LogDiff were not used to train the model, and therefore the influence of
220 these parameters on the mental patterns of the subjects was effectively discovered by the
method.
Legend
0.45 Control DC
0.35 DL
0.25 Error rate Error 0.15
0.05
-0.05 Phase 1 Phase 2 Phase 3
Figure 12.13: ERRSSM:PH with respect to Ph, LogPs and LogDiff. For each group, the overall error–rate (first bar) followed by the effect for LogPs (second bar) and LogDiff (third bar) are displayed with respect to trial phase Ph. The effects are calculated as the difference in ERRSSM:PH at the high minus that at the low level of the quantized LogPs and LogDiff. Error bars indicate ±1 SEM.
In order to measure the similarity between two models Mi = {Φi, θi} and Mj = {Φj, θj} trained on the data for two different subjects, the mutual information (MI) between the state–sequences for two models was used. As in the previous chapter, for each fMRI session
Y in our data–set the optimal state–sequence was computed using each of the 42 models as per the estimation algorithm of Section 12.3. The MI is derived from the joint histogram of X(i) and X(j), the optimal state sequences for the same fMRI data Y computed from
42 the models Mi and Mj respectively. This procedure applied to all ( 2 ) pairs of subjects yields a pair-wise similarity matrix. These MI relationships can then be visualized in 2D using multidimensional scaling (MDS) [120] as shown in Fig. 12.14. The specification of
221 the SSM in terms of abstract mental–states allows comparing the spatio–temporal patterns between subjects in their entirety in this abstract representation [115].
Please refer to Appendix E.4 for the computation of these error–rates and MI.
Control Male Control Female Dyslexic Male Dyslexic Female Dyscalculic Male Dyscalculic Female
(a) Overall
(b) Phase 1 (c) Phase 1: Product Size Effect
(d) Phase 2 (e) Phase 2: Problem Difficulty Effect
Figure 12.14: MDS plots of the Mutual Information Between All Pairs of Subjects. Fig.(a) shows the MDS plots of the subjects based on their overall MI, while Figs.(b) and (d) show the relative arrangement of the subjects based on their MI during the first and second phases of the trial. The effect of product–size in phase 1 and problem-difficulty in Ph = 2 on the MI are plotted Figs.(c) and (e).
222 Fig. 12.14(a) shows a clustering of subjects in the MDS space with respect to their group
(control, DL or DC) along the vertical axis, while along the horizontal axis we see a slight,
but not significant, organization dictated by gender. Since this labeling is applied after plotting all the subjects in the MDS space, an intrinsic organization in the spatio–temporal patterns of the subjects in each group has been identified. Interestingly, there are a few
DC subjects that cluster along with the DL group, at the top of Fig. 12.14(a). This is not surprising, given that oftentimes dyscalculia is comorbid with dyslexia [162] and these DC subjects may exhibit dyslexic deficits during this task.
The separation between the MDS clusters for each group can be quantified using Cramer´
test [9] which provides a non–parametric measure of the p–value of the distance between
the means of two samples through a permutation method. The p–values of the group–wise
differences are compiled in Table 12.6.
Overall Ph 1 Ph 2 Ph 3
Ctrl vs. DC 0.78 (+0.04,+0.01) 0.80 (+0.10,-0.02) 0.84 (+0.02,+0.06) 0.71 (+0.02,-0.03) Ctrl vs. DL 0.74 (-0.02,+0.02) 0.86 (-0.01,-0.00) 0.73 (+0.01,+0.01) 0.65 (-0.00,+0.01) DC vs. DL 0.77 (+0.03,-0.01) 0.79 (+0.07,+0.03) 0.85 (+0.01,+0.08) 0.72 (-0.01,+0.02)
Table 12.6: The separation between the three groups in the MDS plots assessed using Cramer´ non– parameteric test. Tabulated are the p–values for the overall distance between the means of the groups in the MDS space along with the Ph–wise changes of the p–values. Each column also includes the effects of LogPs and LogDiff on the p–value in brackets.
From the results in Figs. 12.12, 12.13, 12.14 and Tables 12.5, 12.6 the following observa- tions can be made.
223 Multiplication Phase. The error rate for the DL group is much higher than that for the controls (cf. Fig. 12.13). An increase in product–size causes a large ( > 1.5 SEM) reduction in ERRSSM:PH for controls, while the effect for the DC and DL groups is less pronounced (
> 1 SEM). Also, there is a clear separation between the DL and control groups in the MDS space and product–size increases the separation between the DC and control groups. For the control subjects high values are seen in the bilateral occipital extra-striate cortices, the left postcentral area, the left angular gyrus (lAG), the medial frontal gyri (MFG), and the left intra-parietal sulcus (IPS). The DC subjects show lower activation in the bilateral IPS, while the DL subjects show increased activation in their left fronto-parietal and left medial frontal gyral (lMFG) regions as compared to controls.
These results may be due to the greater difficulty and conflict experienced by the DL sub- jects and multiplicity of mental strategies adopted during the reading phase of the task.
The higher error of the DC subjects may be due to irregular patterns in accessing the ver- bally encoded rote multiplication tables located in the lAG. The reduction in error–rates of all subjects with increase in product–size may be due to increased organization of their mental processes as their multiplication memory is stressed, while the increased separa- tion between the groups could indicate greater divergence of the mental patterns of the DC individuals from the controls.
Judgement Phase. ERRSSM:PH for the DC group increases drastically, while that for the
DL and control groups match up. The DC subjects may experience difficulty in judging the difference between the size of the correct and incorrect results and may resort to a greater variety of mental strategies. Not surprisingly, as the reading phase of the experiment has ended, the patterns of the DL individuals begin to resemble that of the controls and the
224 separation between these groups reduces in MDS space, while the separation of the DC
group increases. The control and DL subjects exhibit high values in the left and right
IPS, both pallida, caudate heads (CdH), left anterior insula (aIn), lMFG, the supplementary
motor area (SMA) and the left fronto-parietal operculum, while the map for the DC group
activates in both aIn, both MFG, left IPS, the anterior rostral cingulate zone (aRCZ), and the
right supramarginal gyrus (SMG). Although LogDiff reduces the error–rate of the control
and DL subjects, it has the opposite effect on the DC group as increased conflict may
recruit new functional circuits. The effect of LogPs is consistent with strong activation of the working verbal (mute rehearsal) and visual memories.
Third Phase. This phase involves decision–making and conflict–resolution and is highly variable between repetitions and subjects, causing increased inaccuracy during this phase.
Also, due to the self–paced nature of the task, it very often contained the button–press and inter–trial rest interval. The spatial–maps for the three groups show increased foci in the pre–frontal and motor areas. The left IPS region in the DC group is also strongly activated during this phase, which may point to irregular storage and retrieval of the number size using spatial attributes typically processed in this region.
12.6 Conclusion
In this chapter, we extended the state–space model of Chapter 11 to include information about the experimental task in order to guide the discovery of patterns. Efficient estima- tion algorithms using a variational formulation of generalized EM under the mean field approximation were developed and quantified with a simulation study. The HRF of the brain is known to be highly variable [139] and by using a spatially varying but unknown
225 FIR filter, the state–space model (SSM) was able to compensate for this variability. Model
hyper–parameters were selected in an automated fashion using a maximally predictive cri-
terion. By selecting which stimulus to input to the SSM, the user is able to choose between
data–driven and model–driven estimation of the parameters.
The hidden layers in the SSM decouple the stimulus from the data, and therefore neither
does the stimulus need to be convolved with an HRF nor does the exact mathematical re-
lationship between the stimulus and the fMRI signal need to be specified. This allows
flexibility in choosing which experimental variables to include and their encoding, without
having to worry about statistical issues like the orthogonality of the experiment, the estima-
bility of the design matrix and omitted variable bias. But classical issues like confounding
variables will still affect inference and must be addressed through appropriate experimental
designs.
As demonstrated by the mental arithmetic study, this method can be used with arbitrarily
complex paradigms, where the investigator can decide which stimuli to provide as input
thereby choosing a trade–off between data driven and model (i.e. stimulus) driven estima-
tion of parameters. The effects of other un–modeled experimental variables on the model
can then be tested, post hoc. This is in contrast to supervised methods that cannot, by design, capture the effects of experimental variable against which they have not been mod- eled. However, with simple block design paradigms where the effect of hemodynamics and the temporal structure within a block are insignificant, we observed that MVPR clas- sifiers tended to outperform the SSM in predicting the mental state. Also, its application
226 to default–state and non–task related fMRI studies would require an alternative model– size selection procedure that does use prediction error as a criterion, or a non–parametric formulation with an infinite number of states [15].
The SSM parameters are estimated through a fitting criterion and consequently have a well–defined interpretation implied by the underlying neurophysiological model. Here prediction error is used as a statistic to select between models and to infer an effect of the experimental variables on the data, which implicitly involves selecting between alter- native hypotheses [65]. For example, the ability to predict mental states at “much better than chance” levels adduces evidence against the null–hypothesis that the SSM does not ex- plain the data. A similar argument applies for the predictability of experimental variables that were not included during the training of the SSM. The SSM, however, due the lack of a parametric form of the null distribution of the prediction error and the prohibitively high cost of a non–parametric permutation test, cannot measure the confidence level (i.e. a p–value) in a hypothesis test.
Comparing brain–function in abstract representation spaces rather than the spatial–maps di- rectly has been shown to be a very powerful principle in psychology and neuroscience [115].
For example, Edelman et al. [54] discovered natural groupings within a representational space derived using MDS on the activation patterns under different task conditions and subjects. Here, the abstract state–space representation was used to compare the spatio– temporal signatures of mental processes in their entirety. Systematic differences in the cas- cades of recruitment of the functional modules between subject populations were shown indicating the necessity of retaining the temporal dimension. The MDS plots derived from the MI between subject pairs enabled a succinct assessment of the relationships between
227 different groups with respect to experimental parameters. This ability to reveal and study the group–wise structure in the spatio–temporal patterns could guide in the design of more specific experiments to test interesting effects.
Therefore, given its advantages and disadvantages with respect to other analysis methods, we believe that it is a complementary tool in an investigator’s arsenal providing a new and different insight into mental processes.
228 EPILOGUE
Prediction is very difficult, especially about the future.
Niels Bohr (1885–1962).
In this thesis I have to attempted to address the pressing need for solutions to the problem of studying the spatio–temporal patterns implied by mental processes from their metabolic traces recorded by functional magnetic resonance imaging (fMRI). In pursuit of this, I in- vestigated two tracks: one, revealing the temporal ordering in the cascades of recruitment of the functional modules of the brain during the performance of the task (i.e. mental chronometry); and two, building a spatio–temporal representation of mental processes as a sequence of abstract brain–states each having a spatially distributed signature of neu- ral / metabolic activation.
The methods were developed and applied in the context of two studies, one for studying the development of visuo–spatial working memory in children and one for investigating the neural basis of arithmetical processing deficits.
In Part II, a visual analytic tool was developed to explore the chronoarchitecture of the brain using a semi–supervised clustering algorithm followed by a statistical method to measure the timing differences between different regions of the brain. The visual tool identifies voxel–clusters of potential interest to the investigator and displays their time–series, in a
229 paradigm reminiscent of the visual examination of EEG data. Then, a robust and efficient estimator for activation latency was developed using a general linear model (GLM) for univariate statistics.
Part III dealt with the creation of a phenomenological model of mental processes with the brain transitioning through an abstract state–space as it performs a mental task. After an initial confirmation that fMRI data indeed has the information necessary for such a repre- sentation, a distance metric that captured of notion of the functional similarity between the activation patterns of two brain–states was defined and used to design a low–dimensional linear feature–space. Then, a spatio–temporal representation based on a hidden Markov model (HMM) formalism of brain function was proposed and an unsupervised estima- tion procedure based on Monte–Carlo sampling developed. The correct model–size was selected using a maximally predictive criterion that linked the results back to the exper- imental effects of interest. This method suffered from low accuracy due to simplifying assumptions needed for reasonable running–time and due to its unsupervised nature. These drawbacks were corrected by stabilizing the estimation procedure with information about the experimental task and by eliminating the simplifications. Computational efficiency was achieved through estimators that used a mean–field approximation.
The advantages of such a dynamical generative model over other approaches are three fold:
(i) The fully specified generative model allows definitive neurophysiological interpreta-
tion of the parameters.
230 (ii) It allows comparing the spatio–temporal patterns of mental processes between sub-
jects in their entirety, and not just their static activation maps where the temporal
ordering of events is lost.
(iii) It can predict the cognitive state of the subject, not from single time–points but from
the time–evolution of patterns in the data.
(iv) The abstraction in terms of brain–states can provide the ability to study dynamical
characteristics of the data, such as periods and cycles, surprising or new events and
regime changes.
Some of the drawbacks, on the other hand, include computational complexity, the require- ment of having stimulus information or task labels for model–size selection, the lack of statistical measures of confidence and the need for spatial normalization of the data for inter–subject comparison.
Given the exponential increase of computational power, the widespread availability of high–performance and massively–parallel computing infrastructure and the amenability of many of the algorithms here to parallelization, computational complexity should not be a significant obstacle to the adoption of these methods.
We are currently refining a model–size selection strategy based on comparing model– evidences evaluated through cross–validation. This procedure should eliminate the need for experimental stimulus in model selection while simultaneously avoiding the specifica- tion of model–complexity as required by other model selection techniques.
231 Unlike GLMs, the nature of the state–space model precludes closed–form expressions for the sampling distributions of the parameters and hence parametric assessments of confi- dence. The high computational burden complicates the use of non–parametric methods such as permutation tests. To overcome this difficulty, we are working on a fully Bayesian version of this model that will provide posterior distributions to all the parameters.
One of the most important and vexing problems is the need for spatially normalizing the data of all the subjects into a common anatomical space for inter–subject analysis. As dis- cussed in Section 2.3.4, this step has many fundamental problems such as the ability to find correspondences of anatomical features between subjects and the validity of this to their functional correspondences. As the state–space models use the fMRI data projected into the low–dimensional feature–space obtained from their functional connectivity, it would be natural to perform registration in this feature–space. This might be achieved either through the registration of their functional networks posed as a graph homomorphism problem or through the estimation of a common feature–space using a hierarchical model for func- tional connectivity.
Another exciting avenue for future work is the integration of other modalities into this analysis methodology, such as the connectivity information of DTI to build a feature–space or incorporating the high temporal resolution offered by EEG to improve characterization of mental dynamics. Finally, of interest is the detection of the multiple sub–processes running in parallel that constitute the building blocks of human thought.
232 APPENDIX A
PROOFS FOR ACTIVATION ONSET LATENCY ESTIMATOR
Consider the case of a single stimulus function s(t), which yields the following regressors
x(1)(t) = s ? h(t) and x(2)(t) = s ? h˙ (t). The resulting GLM is y = Xβ~ + , where
(1) (2) ~ 0 2 X = [x x ] and β = (β, γ) , and ∼ N (0, σ Σ)
~ ~ 0 −1 − 0 −1 V ~ The Gauss-Markov estimator for β is βb = (X Σ X) X Σ y, and its variance is ar[βb] =
0 −1 − (X Σ X) .
˙ Assuming Σ = I, and by observing that h(t) is orthogonal to h(t), we see that the cross-
covariance terms of Var[ˆγ] are theoretically zero, indicating that βˆ and γˆ are uncorrelated
Gaussian variables. Therefore, E[ˆρ] ≈ E[ˆγ]E[1/βˆ]. Now, taking a first order Taylor series
expansion of 1/βˆ about β, and using the fact that βˆ is unbiased, we get: ! ! Var[βˆ] 1 E[ˆρ] = ρ 1 + =ρ ˆi 1 + , (A.0.1) 2 2 (βˆ) tβ q ˆ ˆ ˆ where tβ = β/ Var[β] is the t–score for β. This expression for the bias in ρ is used to derive the corrected estimate ρˆcorr. This correction shrinks the estimate of the delay when the t–score of βˆ is low, thereby also mitigating the resulting numerical instability of ρˆ.
233 According to the model in eqn. A.0.1, we get τˆ = f(β,ˆ γˆ), a non-linear function of βˆ and
γˆ. Applying a first order Taylor expansion of f around the true value β and γ, and using the fact that the two estimates are unbiased, we get:
∂f ∂f Var[ˆτ] = ∇f(β, γ)Var[ˆγ]∇f(β, γ)0 where ∇f(u, v) = (A.0.2) ∂u ∂v
234 APPENDIX B
FUNCTIONAL CONNECTIVITY ESTIMATION
B.1 Hierarchical Agglomerative Clustering
1 begin // Initialization 2 For each voxel i, create one cluster ci of size ni = 1 3 Each ci is associated with a time–course Y[i] 4 end 5 while Number of clusters greater than specified value do 6 Find two clusters ci and cj that are spatially adjacent to each other and merge them into a new cluster ck = (ci, cj), if and only if Var[ck] is minimum over all i, j 7 Remove clusters ci and cj from the set of clusters, and add ck 8 end Algorithm B.1: Hierarchical Agglomerative Clustering
The time–series for the new cluster c is defined as Y[k] = 1/n P Y[i], and for a new k k ci∈ck
cluster ck = (ci, cj) can be efficiently updated according to Y[k] = (niY[i]+njY[j])/(ni+
nj).
235 The variance of a cluster c is Var[c ] = (1/n T ) P PT (Y[i] − Y[k])2, and is k k k ci∈ck t=1 efficiently updated through the variance separation theorem:
V V PT 2 ni ar[ci] + nj ar[cj] t=1(Y[i] − Y[k]) Var[ck] = − . ni + nj T (ni + nj)
V After hierarchical clustering, the covariance σ[k1, k2] , ar[ck1 , ck2 ] between two clusters
ck1 and ck2 is estimated as:
T ! ! 1 X 1 X 1 X σ[k , k ] = Y [k ]Y [k ] − Y [k ] Y [k ] . 1 2 T t 1 t 2 T t 1 T t 2 t=1 t t
B.2 Shrinkage
The regularized estimate of the covariance is computed using an adaptive soft shrinkage estimator [186]: