<<

Spatio-Temporal Representations and Analysis of Brain Function from fMRI

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Firdaus Janoos, B.E., M.S.

Graduate Program in Computer

*****

The Ohio State University

2011

Dissertation Committee:

Prof. Raghu Machiraju, PhD., Adviser Prof. Steffen Sammet, M.D.,PhD. Dr. Istvan´ Akos´ Morocz,´ M.D.,PhD. Prof. Michael V. Knopp, M.D.,PhD. Prof. Lee Potter, PhD. © Copyright by

Firdaus Janoos

2011 ABSTRACT

Understanding the highly complex, spatially distributed and temporally organized phenom- ena entailed by mental processes using functional MRI is an important research problem in cognitive and clinical neuroscience. Classically, the analysis of functional Magnetic Res- onance Imaging (fMRI) has focused either on the creation of static maps localizing the metabolic fingerprints of neural processes or on studying their temporal evolution in a few pre-selected regions in the human brain. However, it is widely acknowledged that recruits the entire brain and that the underlying mental processes are fundamentally spatio- temporal in . By neglecting either the temporal dimension or the spatial entirety of brain function, such methods must necessarily compromise on extracting and representing all the information contained in the data.

In this thesis, I new paradigms and an accompanying suite of tools to facilitate a timeresolved exploration of mental processes as captured by fMRI. The first part of the thesis describes a method for visualizing the metabolic activity recorded in the data and a method for studying the timing differences in the recruitment of the different functional modules during a task. In the next part a state- formalism is used to model the brain transitioning through a sequence of mental states as it solves a task, enabling study of the spatial distribution of activity along with its temporal structure. Efficient algorithms for estimating the parameters, state-sequence and the hemodynamic behavior of the brain have

ii been developed. In addition to revealing the mental patterns of an individual subject, such a generative model enables comparing mental processes between subjects in their entirety, not just as spatial activation maps.

The methods developed here were applied to fMRI studies for developmental disorders such as dyslexia and dyscalculia (i.e. math learning disability) and for visuo-spatial work- ing . I show the types of inferences possible with these methods in analyzing and differentiating mental capabilities and the neuro-scientific conclusions that they provide.

iii This thesis is dedicated to my parents for their unconditional love and unswerving support

during this long and sometimes arduous journey.

iv ACKNOWLEDGMENTS

I would like thank Raghu for his many of guidance – intellectual and philosophical, his pragmatic wisdom, his forbearance, his friendship and his genuine solicitude. I am grateful to Steffen for sharing the deep knowledge of MRI and radiology and for his in- fectious and friendly spirit. I owe special gratitude to Pisti for continuously reminding me to keep the big picture in mind and not developing algorithms for algorithms’ sake, for his wild but brilliant visions, and most of all for his warmth and humanism.

Getting through graduate school would have been harder, if it were not for the the support of my friends, who are too many to list here. Among these, I am especially thankful to

Kishore, Okan and Shantanu for sharing this experience with me. Most of all, I have to thank Zeenat for giving me the impetus to finally graduate.

v VITA

2001 ...... B.E. Computer Science, University of Pune, India

2009 ...... M.Sc. Computer Science, Ohio State University, USA

2009–present ...... PhD Candidate, Ohio State University, USA

PUBLICATIONS

Research Publications

F. Janoos, R. Machiraju and I.A.´ Morocz,´ “Spatio-temporal Models of Cognitive Processes with fMRI,” NeuroImage, In review.

T.K. Dey, F. Janoos and J.A. Levine, “Meshing interfaces of multi-label data with Delaunay refinement,” Engineering with Computers, In review.

F. Janoos, R. Machiraju, S. Sammet, M.V. Knopp and I.A.´ Morocz,´ “Unsupervised Learn- ing of Brain States from fMRI Data,” Proceedings of 13th International Conference on

vi Medical Image Computing and Computer Assisted Intervention (MICCAI), Vol. 6362, 201– 208, 2010.

F. Janoos, M.O. Irfanoglu, O. Afacan, R. Machiraju, S.K. Warfield, L.L. Wald and I.A.´ Morocz,´ “Brain State Identification from fMRI Using Unsupervised Learning,” Proceedings of the 16th Annual Meeting of the Organization for Human Brain Mapping (OHBM), 2010.

O. Afacan, D.H. Brooks, F. Janoos, W.S. Hoge and I.A.´ Morocz,´ “Multi-shot high-speed 3D-EPI fMRI using GRAPPA and UNFOLD,” Proceedings of the 16th Annual Meeting of the Organization for Human Brain Mapping (OHBM), 2010.

C. Lehr, M.O. Irfanoglu, F. Janoos, M.V. Knopp and S. Sammet, “Disease Progression in Multiple Sclerosis: Correlations to Diffusion Tensor Imaging Features,” Proceedings of 18th Annual Meeting of The International Society for Magnetic Resonance in Medicine (ISMRM) , 2010.

M.O. Irfanoglu,R. Machiraju, F. Janoos, M.V. Knopp and S. Sammet, “Effect of Gradient Resolution in Diffusion Tensor Imaging on the Appearance of Multiple Sclerosis Lesions at 3T,” Proceedings of 18th Annual Meeting of The International Society for Magnetic Resonance in Medicine (ISMRM) , 2010.

F. Janoos, R. Machiraju, S. Sammet, M.V. Knopp, S.K. Warfield and I.A.´ Morocz,´ “Mea- suring Effects of Latency in Brain Activity with fMRI,” Proceedings of IEEE Symposium on Bio-medical Imaging (ISBI), 2010.

K. Mosaliganti, F. Janoos, A. Gelas, R. Noche, N. Obholzer, R. Machiraju and S. Megason, “Anisotropic Plate Diffusion Filtering for Detection of Cell Membranes in 3D Microscopy Images,” Proceedings of IEEE Symposium on Bio-medical Imaging (ISBI), 2010.

vii F. Janoos, B. Nouansengsy, R. Machiraju, H.W. Shen, S. Sammet, M. Knopp and I.A.´ Morocz,´ “Visual Analysis of Brain Activity from fMRI Data,” Computer Graphics Forum, Vol.28(3), 903-910, June 2009.

K. Mosaliganti, F. Janoos, O. Irfanoglu, R. Ridgway, R. Machiraju, K. Huang, J. Saltz, G. Leone and M. Ostrowski, “Tensor Classification of N-point Correlation Function fea- tures for Histology Tissue Segmentation,” Medical Image Analysis, Vol. 13(1), 156–166, Feb. 2009.

F. Janoos, K. Mosaliganti, X. Xu, R. Machiraju and S.T.C. Wong, “Robust 3D Reconstruc- tion and Identification of Dendritic Spines from Optical Microscopy Imaging,” Medical Image Analysis, Vol. 13(1), 167–179, Feb. 2009.

F. Janoos, B. Nouansengsy, X. Xu, R. Machiraju, K. Huang and S.T.C. Wong, “Classifica- tion and Uncertainty Visualization of Dendritic Spines from Optical Microscopy Imaging,” Computer Graphics Forum, Vol. 27(3), 879– 886, Sep. 2008.

K. Mosaliganti, F. Janoos, R. Sharp, R. Ridgway, R. Machiraju, K. Huang, P. Wenzel, A. de Bruin, G. Leone and J. Saltz, “Detection and Visualization of Surface-Pockets to enable Phenotyping Studies,” IEEE Transactions on Medical Imaging, Vol. 26(9), 1283– 90, Sep. 2007.

K. Mosaliganti, J. Chen, F. Janoos, R. Machiraju, W. Xia, X. Xu and K. Huang, “Auto- mated Quantification of Colony Growth in Clonogenic Assays,” Proceedings of Medical Image Analysis with Applications in Biology (MIAAB), 2007.

F. Janoos, S. Singh, O. Irfanoglu, R. Machiraju and R. Parent, “Activity Analysis Using Spatio-Temporal Trajectory Volumes in Surveillance Applications,” Proceedings of IEEE Symposium on Visual Analytics Science and Technology (VAST), 3–10, Nov. 2007.

viii F. Janoos, O. Irfanoglu, K. Mosaliganti, R. Machiraju, K. Huang, P. Wenzel, A. de Bruin and G. Leone, “Histology Image Segmentation using the N-Point Correlation Functions,” Proceedings of IEEE Symposium on Biomedical Imaging (ISBI), 300 – 303, Apr. 2007.

K. Mosaliganti, F. Janoos, X. Xu, R. Machiraju, K. Huang and S.T.C. Wong, “Temporal Matching of Dendritic Spines in Confocal Microscopy Images of Neuronal Tissue Sec- tions,” Proceedings of the 9th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 106–113, 2006.

F. Janoos, R. Machiraju, R. Parent, J.W. Davis and A. Murray, “Sensor Orientation for Coverage Optimization for Surveillance Applications,” Proceedings of IS&T/SPIE Sympo- sium on Electronic Imaging, vol. 6491, 1–12, Jan 2007.

Instructional Publications

F. Janoos, R. Machiraju, S. Singh and I.A.´ Morocz,´ “Spatio-temporal Representations and Decoding Cognitive Processes from fMRI,” Ohio State Univ. Tech. Report OSU-CISRC- 9/10-TR19, 2010.

F. Janoos, R. Machiraju, S. Sammet, I.A.´ Morocz,´ M.V. Knopp and S.K. Warfield, “Linear Models for fMRI with Varying Hemodynamics,” Ohio State Univ. Tech. Report OSU- CISRC-9/10-TR20, 2010.

F. Janoos, O. Irfanolgu, R. Machiraju and I.A.´ Morocz,´ “Visualizing Brain Activity from fMRI Data,” Ohio State Univ. Tech. Report OSU-CISRC-9/10-TR21, 2010.

T.K. Dey, F. Janoos and J.A. Levine, “Meshing interfaces of multi-label data with Delaunay refinement,” Ohio State Univ. Tech. Report OSU-CISRC-8/09-TR40, 2008.

ix FIELDS OF STUDY

Major Field: Computer Science and Engineering

Studies in:

Medical Image Analysis Prof. Raghu Machiraju Machine Learning Prof. Yoonkyung Lee Computer Graphics Prof. Tamal K. Dey

x TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vi

List of Tables ...... xvii

List of Figures ...... xviii

List of Algorithms ...... xxi

Introduction ...... 1

1 Background of Problem ...... 1 2 Research Statement ...... 5 3 Outline of Solution ...... 6 4 Organization of Thesis ...... 8

I Background 10

Chapters:

1. Background: fMRI Principles ...... 11

1.1 Nuclear Magnetic Resonance ...... 12 1.2 Magnetic Resonance Imaging ...... 14 1.3 Functional Magnetic Resonance Imaging ...... 16 1.3.1 The BOLD Contrast ...... 17

xi 1.3.2 Relationship between BOLD and ...... 18 1.3.3 fMRI Noise ...... 20

2. Background: fMRI Methods ...... 22

2.1 Neuroscience Principles ...... 23 2.2 fMRI Methods Taxonomy ...... 25 2.3 Pre-processing ...... 25 2.3.1 Motion Correction ...... 25 2.3.2 Distortion Correction ...... 27 2.3.3 De-noising and Drift Estimation ...... 27 2.3.4 Inter–subject Registration ...... 28 2.3.5 Hemodynamic Response Modeling and Estimation ...... 29 2.4 Functional Specialization ...... 29 2.4.1 Unsupervised ...... 30 2.4.2 Supervised ...... 31 2.4.3 Semi-supervised ...... 35 2.5 Functional Integration ...... 37 2.5.1 Functional Connectivity ...... 37 2.5.2 Effective Connectivity ...... 38 2.6 Functional Representation ...... 40 2.6.1 Multivariate Pattern Recognition ...... 41 2.6.2 Multivariate Linear Models ...... 44 2.7 Summary ...... 44

3. Background: Neuroscientific Setting ...... 45

3.1 Visuo–Spatial Working Memory ...... 46 3.1.1 Data-set ...... 47 3.2 Mental Arithmetic ...... 48 3.2.1 Dyscalculia ...... 50 3.2.2 Dyslexia ...... 51 3.2.3 Data-set ...... 53

II Functional 56

4. Mental Chronometry: Theory ...... 57

4.1 Significance ...... 57 4.2 Mental Chronometry with BOLD ...... 59 4.3 Chronometry Methods ...... 62

xii 5. Mental Chronometry: A Visual Analysis ...... 64

5.1 Outline of Solution ...... 65 5.1.1 Challenges ...... 65 5.1.2 Proposed Solution ...... 67 5.2 Related Work ...... 68 5.3 VOI Selection ...... 70 5.3.1 Distance Metric ...... 71 5.3.2 Hierarchical Clustering ...... 76 5.4 User Interaction ...... 78 5.5 Results ...... 80 5.6 Conclusion ...... 81

6. Mental Chronometry: Measuring Latency in Brain Activity ...... 86

6.1 Outline of Solution ...... 87 6.1.1 Motivation ...... 87 6.1.2 Proposed Solution ...... 87 6.2 Method ...... 88 6.2.1 Robust Estimation of Latency ...... 89 6.2.2 Parametric Effects with Factorial Designs ...... 90 6.2.3 Hyper-parameter Selection ...... 92 6.3 Results ...... 93 6.3.1 Simulated Data ...... 93 6.3.2 Latency Analysis for VSWM Task ...... 94 6.4 Conclusion ...... 99

III Spatio-temporal Representations for Cognitive Processes 100

7. Spatio-temporal Representations: Theory ...... 101

7.1 Functional Representation ...... 102 7.2 Supervised, Unsupervised and Semi-supervised ...... 105 7.3 Pattern Recognition vs. Linear Models ...... 106 7.3.1 Multivariate Pattern Recognition (MVPR) ...... 106 7.3.2 Multivariate Linear Models (MVLM) ...... 107 7.4 Motivation ...... 108

8. Brain–States: Investigating Spatio–Temporal Patterns in the Data ...... 111

8.1 Inspiration ...... 112

xiii 8.2 Method ...... 113 8.3 Results ...... 115 8.4 Conclusion ...... 115

9. Brain–States: The Notion of Functional Distance ...... 118

9.1 Functional Distance ...... 119 9.2 Functional Networks ...... 121 9.3 Computing the Functional Distance ...... 123 9.4 The Diffusion Distance ...... 127 9.4.1 Hierarchical Clustering: ...... 129 9.5 Results ...... 129 9.6 Conclusion ...... 133

10. Feature–Space: A Linear Embedding of the Functional Distance ...... 135

10.1 Motivation ...... 137 10.1.1 Other Feature– in fMRI ...... 138 10.2 Feature–Space ...... 139 10.2.1 Cost–Function ...... 139 10.2.2 Orthogonal Basis for the Feature–Space ...... 140 10.3 Linear Approximation for Functional Distance ...... 142 10.4 Feature Selection ...... 145 10.5 Evaluation of Feature–Space ...... 146 10.6 Conclusion ...... 149

11. State–Space Models : Towards a Spatio-Temporal Representation ...... 151

11.1 Related Work ...... 153 11.2 State–space Model ...... 154 11.3 Model Estimation ...... 159 11.3.1 Parameter Estimation ...... 159 11.3.2 HRF Marginalization ...... 161 11.3.3 Optimal State–Sequence Estimation ...... 162 11.4 Model Size Selection ...... 162 11.5 Results ...... 164 11.5.1 Simulation ...... 164 11.5.2 Mental Arithmetic ...... 167 11.6 Conclusion ...... 174

12. State–Space Models : A Semi–Supervised Approach ...... 177

12.1 The State–Space Model ...... 182

xiv 12.1.1 Feature–Space Transform ...... 185 12.2 Parameter Estimation ...... 185 12.2.1 E-Step ...... 186 12.2.2 M-Step ...... 188 12.2.3 Spatial Activation Maps ...... 194 12.3 Estimating the Optimal State–Sequence ...... 194 12.4 Model Hyper-Parameter Selection ...... 196 12.5 Results ...... 198 12.5.1 Simulation ...... 199 12.5.2 Data-Set 1: Visuo–Spatial Motor Task ...... 203 12.5.3 Data-Set 2: Mental Arithmetical Task ...... 210 12.6 Conclusion ...... 225

Epilogue ...... 229

Appendices: 233

A. Proofs for Activation Onset Latency Estimator ...... 233

B. Functional Connectivity Estimation ...... 235

B.1 Hierarchical Agglomerative Clustering ...... 235 B.2 Shrinkage ...... 236 B.3 -wise Correlations ...... 237

C. Construction of the Feature–Space ...... 238

C.1 Orthogonal Partitioning of F ...... 238 C.2 Primal and Dual Formulations ...... 239 C.2.1 Augmented Formulation ...... 240 C.3 Proofs for the Linear Approximation ...... 241

D. Proofs for Unsupervised State–Space Model ...... 246

D.1 Expectation Maximization ...... 247 D.2 Forward Backward Recursions ...... 250 D.3 Marginalizing the HRF Filter h ...... 253 D.4 State–Sequence Estimation ...... 254 D.5 Estimation in Full vs. Reduced Models ...... 255 D.6 Mutual Information ...... 256

xv E. Proofs for Semi–Supervised State–Space Model ...... 257

E.1 Proofs for the E-Step ...... 257 E.2 Proofs for the M-Step ...... 261 E.2.1 Estimating State Transition Parameters w and Missing Stimulus u261 E.2.2 Estimating Emission Parameters ϑ ...... 265 E.2.3 Estimating Hemodynamic and Noise Parameters h, Σ ..... 267 E.3 Estimating the Optimal State–Sequence x∗ ...... 269 E.4 Error–rate and Mutual Information ...... 270

References 271

xvi LIST OF TABLES

Table Page

9.1 Effect of Gaussian Smoothing on Number of Clusters ...... 123

10.1 Notation for Linear Embedding ...... 136

11.1 Notation for Unsupervised State–Space Model ...... 158

11.2 Effect of SNR on Running– and Error–Rates ...... 166

11.3 Results for Mental Arithmetic Data–set ...... 170

12.1 Notation for Semi–Supervised State–Space Model ...... 181

12.2 Effect of SNR on Estimation Error ...... 202

12.3 Effect of Shrinkage on Estimation and Prediction Errors ...... 203

12.4 Prediction Error versus Different HRF Models ...... 217

12.5 Overall Results for Mental Arithmetic Data-Set ...... 220

12.6 Assessment of Separation between Groups for Mental Arithmetic Task . . . 223

xvii LIST OF FIGURES

Figure Page

1.1 Precession of the Nuclear Magnetic ...... 13

1.2 RF Excitation ...... 14

1.3 MR k-Space Acquisition ...... 16

1.4 BOLD Response ...... 19

2.1 Taxonomy of fMRI Methods ...... 26

2.2 GLM Analysis Pipeline ...... 34

2.3 Dynamic Causal Models ...... 39

3.1 Visuo–Spatial Working Memory Task ...... 48

3.2 Mental Arithmetic Task ...... 54

4.1 Cascade of Functional Recruitment with fMRI ...... 61

5.1 Overview of Visual Analytics Tool ...... 68

5.2 The Raw fMRI Time–Courses ...... 72

5.3 HR Curves and SVD Spectrum ...... 76

5.4 User Interface of the Visual Analytics Tool ...... 83

5.5 Visual Confirmation of SPM Results ...... 84

5.6 Visual Analysis of Recruitment Cascade ...... 85

xviii 6.1 MSE of Regularized Estimator ...... 94

6.2 Quantitative Evaluation of Parametric Effects ...... 95

6.3 Activation Maps for VSWM Task ...... 96

6.4 Parametric Effect on Latency ...... 97

6.5 Parametric Effects of Latency and Amplitude ...... 98

7.1 Representation of Visual Categories in the Ventral Temporal Cortex . . . . 103

7.2 Predicting Spatial Maps for Given Stimulus Words ...... 104

8.1 EEG Microstates ...... 112

8.2 Brain–State Labels ...... 116

9.1 Conceptual Illustration of the Functional Distance ...... 122

9.2 Regularized Correlation Estimator ...... 124

9.3 Functional Distance Approximation Error ...... 130

9.4 Brain–State Sequences ...... 131

9.5 Spatial Activation Maps of Brain–States ...... 132

10.1 Computation of Feature–Space ...... 147

10.2 Dimensionality Reduction ...... 148

10.3 Approximation Quality ...... 149

10.4 Cross-Correlations in Feature–Space ...... 150

11.1 The Full state–space Model (Unsupervised) ...... 154

11.2 The Reduced State–Space Model (Unsupervised) ...... 157

xix 11.3 Computation of Pair–Wise MI ...... 169

11.4 Phase–wise Error Rates and MI ...... 171

11.5 Multidimensional Scaling (MDS) Plots for Mental Arithmetic Study . . . . 173

11.6 Spatial Maps for Mental Arithmetic Task ...... 176

12.1 Outline of the Method for the Semi–supervised State–Space Model. . . . . 180

12.2 The Semi–Supervised State–Space Model ...... 183

12.3 Simulation Results ...... 201

12.4 Spatial Maps for Visuo-Motor Task ...... 205

12.5 Error Rates for Visuo-Motor Task ...... 206

12.6 Brain–State Probabilities ...... 208

12.7 Effect of Model–Size on Prediction Error ...... 209

12.8 Estimated Hemodynamic Response ...... 210

12.9 Comparative Analysis for Mental Arithmetic Data-Set ...... 213

12.10Effect of Hyper–Parameters on ERRSSM:PH ...... 215

12.11Estimated HRF FIR filter for Mental Arithmetic Data-Set ...... 218

12.12Spatial Maps for Mental Arithmetic ...... 219

12.13Prediction Error for Mental Arithmetic Task ...... 221

12.14Multidimensional Scaling (MDS) Plots for Mental Arithmetic Study . . . . 222

xx LIST OF ALGORITHMS

5.1 Hierarchical Clustering Algorithm ...... 77

5.2 Octree Clustering Algorithm ...... 78

9.1 Recursive Approximation of the Functional Distance ...... 125

10.1 Construction of Orthogonal Basis Functions ...... 141

B.1 Hierarchical Agglomerative Clustering ...... 235

xxi INTRODUCTION

Begin at the beginning and go on till you come to the end, then stop.

Lewis Carroll (1832–1898), Alice’s Adventures in Wonderland.

This thesis describes a new paradigm for analyzing fMRI data that preserves the informa- tion contained in the temporal dimension of mental processes. The Introduction opens by presenting the larger scientific context of the research problem in Section 1, followed by the specific research statement in Section 2 and the proposed solutions in Section 3. Finally, the organization of the rest of this book is laid out in Section 4.

1 Background of Problem

Even though the human mind has been the subject of philosophical and scientific study throughout the ages, many aspects of brain function are still unknown. , informa- tion processing and decision making are only a few of the cognitive capacities intensively investigated in today’s neuroscience, which seeks to understand the relationship between the mind and the brain. Our cognitive capacities rely on synergistic activities of large neural populations distributed throughout the brain. Therefore understanding the mind not only requires a comprehension of the workings of low–level neural networks but also demands

1 a detailed map of the brain’s functional architecture and a description of the large–scale connections between populations of neurons and insights into how relations between these simpler networks give rise to higher–level [111].

Until recently, systems neuroscience relied on the single micro-electrode technique to mea- sure the action potentials produced by an isolated neuron or a small assembly of neu- rons [78]. Although very useful in characterizing the small–scale behavior of individual neural networks, the method clearly falls short of providing information on spatio–temporal cooperativeness and on the global, associational operations performed by these neural net- works. The development of non-invasive technologies opened new possibil- ities to study brain physiology in vivo at this macroscopic / integrationist level.

Since its development in 1992, functional magnetic resonance imaging (fMRI) has be- come one of the most popular functional neuroimaging tools due to its unprecedented spatial resolution at temporal resolutions measured in . The method is based on nuclear magnetic resonance (NMR) signal changes due to hemodynamic and metabolic responses at the sites of neural synaptic activity induced by external and internal stimuli to the brain [139]. Inspired by the concepts of functional specialization and functional in- tegration [69], the primary application of fMRI has been the localization of the different functional domains of the human perceptual and cognitive apparatus in the brain and eluci- dating inter–connectivity structures [107]. In addition to these, an emergent theme has been to reveal the representation, i.e. encoding, of mental percepts, affects and concepts of the subject in the recorded data, with a view of understanding the “neural code” [169]. Mul- tivariate pattern analysis / pattern recognition (MVPA / MVPR) has been widely used to

2 explore these encodings of different mental phenomena in the spatially distributed patterns

of metabolic activity recorded by fMRI [176].

Methods for fMRI analysis have concentrated on creating static pictures of the foci of ac-

tivity or of the interconnected networks or of the distributed neural representations. This

paradigm of studying brain physiology is starkly inadequate to explain the fundamentally

dynamic relationships between the neural substrates involved in mental processes [33] that

are spatially distributed, temporally transient and occur at multiple scales of space and time.

One classical argument for this static approach is that the hemodynamic response, which

has a lag of 4–12s and which depends on the complex interaction of several metabolic and

vascular parameters [140], is orders of magnitude slower than neural events that occur in

the 10–100ms time range. However, there are many high–level mental processes such as

attending, learning and task–solving along with oscillations in the default–state networks,

that occur at time scales accessible to modern fMRI [62]. In addition, there is in-

creasing evidence that the hemodynamic response may not be as sluggish as long assumed: capillaries and arterioles change their diameter within a fraction of a following neu- ronal activity in their neighbourhood [175]. These rapid effects may be now measurable with the advent of high–speed fMRI with scan rates < 1Hz [2]. The argument for static analysis methods is becoming increasingly untenable.

One important application of a time–resolved analysis is to understand the ordering of information processing in the various functional nodes, in order to better understand their function in terms of an anatomical wiring diagram [88]. There are a few methods reported in literature that recognize the importance of temporality in fMRI [59, 61, 98, 156, 164,

3 204]. However, they make significant compromises in terms of the spatio–temporal scales and physiological assumptions under which these phenomena are investigated

For example, methods for mental chronometry (cf. Chapter 4) measure the relative tim- ing of activity at each voxel independently, neglecting the interactions between them and thereby reducing fMRI to a collection of single–unit recordings [61, 103, 156, 190]. More sophisticated models treating the brain response as a collection of processes have also been developed, but again resort to single–voxel analysis [59, 98]. On the other hand, methods to study the dynamics of interaction between regions such as dynamic causal models (DCM) and dynamic Bayesian networks (DBM) (cf. Section 2.5) suffer from enormous computa- tional complexity and have to be restricted to a few (∼ 10) pre–selected regions–of–interest

(ROI) [204, 227]. While this approach is feasible for studying low–level functions such as early–stage visual perception, it does not scale well towards revealing the integration of the multiplicity of functional modules involved in cognition. Multivariate methods that attempt spatio–temporal analysis – by simultaneously representing the spatial interactions and temporal evolution – are fundamentally limited to block design experiments where the subjects are exposed to very simple and limited set of alternatives [159, 164], not repre- sentative of natural perception and which may fail to engage the higher–level aspects of human cognition.

This inability of fMRI analysis methods to reveal the role of time is becoming increasingly relevant in the study of neurological and psychiatric disorders like dementia, schizophrenia, autism, multiple sclerosis, etc. [28, 37], or common learning disabilities like dyscalculia or dyslexia [181], where group-level inference of spatial activity–maps has been inconclusive

4 due to the lack of clear differences between populations. This is mainly because inter– subject comparisons are usually made by spatially normalizing their data to a common anatomical space. However, there are fundamental differences between the anatomy of subjects, and they cannot be easily aligned into a single coordinate system [207]. This is especially the case in the presence of neuropathologies where significant anatomical changes confound determination of functional differences [211]. Secondly, the mapping between structure and function is not rigid across subjects and moreover the low spatial resolution of fMRI precludes the possibility of finding accurate spatial correspondences.

Furthermore, there is the acknowledgement of the neurophysiological fact that the similar- ities and differences between the brain function of subjects may not reside in the spatial layout of the mental activity but more so in its temporal organization [88, 92].

2 Research Statement

In this thesis, I shall present methodological contributions that enable investigation of the following neurophysiological questions:

Mental Chronometry Is the chronoarchitecture [10] of the cerebral cortex organized with

respect to its functional architecture? Can this organization be identified and tested

for specific hypotheses? Can it account for the fact that the hemodynamic response

of the brain is unknown and is spatially and temporally non–stationary ?

Spatio-temporal Representations Do fMRI data contain information about mental pro-

cesses? Is there a spatio–temporal representation of this information, at a multiplicity

5 of scales of space and time, that can be used to enrich the study of brain function un-

der differing mental tasks and mental deficits? Can this representation be arrived at

in a computationally tractable fashion?

3 Outline of Solution

In order to study of the chronoarchitecture of the human brain, I developed two comple- mentary approaches:

i An exploratory method to aid in examination of the time–series data and its tempo-

ral characteristics with a minimal amount of pre-processing and data manipulation,

as an analog to the visual inspection of EEG and MEG recordings for “interesting

events” [8, 78]. This is posed as volume-of-interest (VOI) selection problem, where

the method automatically determines a set of candidate VOIs that exhibit coherent

activity with respect to the experimental task. The investigator is able to then se-

lect and compare the time–series data of VOIs that are relevant to the task and its

expected neural recruitments [105].

ii A confirmatory method based on a general linear model (GLM) of the fMRI signal

that estimates the experimental effects not only on the amplitude of the hemodynamic

response but also its latency. The result of this method is a map of activation latency

under different task conditions, similar to that of activation amplitude [103]. This

method improves on alternative approaches by proposing a more stable estimator of

latency and makes allowances for spatially and temporally variable hemodynamics.

6 While these approaches are complimentary to classical methods of functional localization, they only partially capture the spatio–temporal signatures of neural activity. To a more descriptive representation of spatially varying and temporally recurring neural phenomena, the paradigm of the functional brain stepping through a mental state–space as it performs a task was invoked. This concept of fMRI as a recording of a sequence of “brain–states” was developed and refined through the following iterations:

i In an initial exploration of this concept, I used an ICA decomposition with hierarchi-

cal clustering to demonstrate that the fMRI data were organized, intrinsically, with

respect to the mental states of the subject [101].

ii After this initial proof–of–concept, I proposed a measure of the functional distance

between two fMRI volumes of the same subject that quantified the similarity of their

activation patterns by the “transport” of activity over the functional networks of the

brain [102].

iii I then modeled the functional brain transitioning through a mental state–space by a

multivariate hidden Markov chain. Parameters were estimated from the fMRI data

without reference to experimental conditions using Monte–Carlo methods [104]. The

model not only discovered the spatial distribution and timing of neural recruitment in

each subject, but moreover was able to contrast the spatio–temporal patterns arising

from the underlying neural processes between two populations.

iv The model was augmented to allow a partial set of stimuli to be given as input,

resulting in a semi–supervised framework. Its motivation was to the let the stimuli

7 guide the estimation towards a model more relevant to the investigator in a non-

convex optimization landscape marred by multiple local minima, without precluding

discovery of new and un–hypothesized patterns in the data.

In contrast to other methods, these methods are not restricted to pre–selected VOIs but instead operate on the data for the entire brain. Also, they can be used to study arbitrarily complex paradigms and higher–level cognitive functions. These methods were developed and applied in the context of larger neuroscientific investigations into the working of visuo– spatial memory [30] in young children and in the study of arithmetical deficits in individuals suffering from dyscalculia and dyslexia.

4 Organization of Thesis

This book is organized into three self–contained parts.

Part I sets the context for the research problems dealt with in this thesis. Chapter 1 pro- vides a brief background of the principles behind NMR, MRI, and blood oxygenation level

dependent (BOLD) fMRI. Then in Chapter 2, a taxonomy of fMRI analysis methods is

presented. Finally Chapter 3 gives a brief background of the neuroscientific problems ad-

dressed in the course of this dissertation.

Part II deals with the solutions for the problem of mental chronometry, which is defined in

Chapter 4. Then in Chapter 5, I talk about the exploratory visual analysis of the temporal

dimension of brain activity, followed by the GLM based estimation of activation latency in

Chapter 6.

8 Then in Part III, I present the spatio–temporal models of mental processes via the ab- straction of brain–states. Chapter 7 provides a discussion of the concepts behind neural representation and decoding, the need for spatio–temporal models, and the salient issues concerning supervised vs. unsupervised vs. semi–supervised analysis techniques. After a report of the initial investigations about the information contained in fMRI about brain– states in Chapter 8, I describe the development of the distance measure in Chapter 9 and the low–dimensional linear embedding of the data based on this metric in Chapter 10.

The unsupervised state–space formalism is presented in Chapter 11 followed by the semi– supervised method in Chapter 12.

Finally, the Epilogue concludes this book with a summary of the research presented here and shares some on directions.

9 PART I

Background

10 CHAPTER 1

BACKGROUND: FMRI PRINCIPLES

In this of specialization men who thoroughly know one field are often in- competent to discuss another. The great problems of the relations between one and another aspect of human activity have for this reason been discussed less and less in public.

Richard Feynman (1918–1988), Remarks at a Caltech YMCA lunch forum, 1956.

The physical principle behind fMRI is nuclear magnetic resonance (NMR), which depends on the Zeeman effect [224]. The physics underlying this phenomenon were refined in the

1920s and 1930s and practical instruments for measuring it were developed by and Edward Purcell with coworkers in the 1940s. For these developments, Bloch and Pur- cell shared the in 1952. In work pioneered by Paul Lauterbur in the 1970s, methods for generating tomographic images of objects based on the NMR phenomenon were developed [128] leading to the development of magnetic resonance imaging (MRI).

In the 1980s, MRI was established as an indispensable diagnostic clinical tool due to its ability to non-invasively produce high quality anatomical images of the human body. Dur- ing the 1990s, an array of MRI techniques for studying human physiology were developed.

Examples of such techniques now available are MR angiography and arterial spin labeling

11 (ASL) for imaging blood vessels and blood flow, real-time cardiac imaging and perfusion measurements, diffusion tensor MRI for tracing fibres and functional MRI for mapping brain activity [107].

This chapter begins with a brief description of the theory of NMR in Section 1.1, and of its application to MRI in Section 1.2. The principles behind fMRI are then explained in

Section 1.3, starting with an overview of the blood oxygenation level dependent (BOLD) contrast mechanism in Section 1.3.1, an explanation of the neurobiological basis of the

BOLD contrast in Section 1.3.2 and a discussion of the noise and artifacts found in fMRI data in Section 1.3.3.

1.1 Nuclear Magnetic Resonance

The sub-atomic particles in an atom nucleus, viz. protons, neutrons, possess a magnetic moment of ±1/2 arising from the spin of the these particles, imparting it a net magnetic moment. Of interest to NMR is hydrogen (1H) with one unpaired proton and a total nuclear spin = 1/2. Tomographic images are generated by measuring the spatial distribution of the magnetic moment of 1H, abundantly present in living tissue in the form of water.

When placed in a large external magnetic field (the B0 field), hydrogen nuclei align either parallel or anti-parallel with the direction of the magnetic field (cf. Fig. 1.1). The detectable signal that is produced at room temperature depends on the manipulation of the few parts per million protons aligned with the magnetic field (in the z direction of the B0 field). At the same time, the magnetization vector of the proton precesses at a frequency ω0 which

12 depends upon its gyromagnetic ratio γ, given by the Larmor equation ω0 = γB0. The gyromagnetic ratio is a nucleus specific constant and for hydrogen, γ = 42.6 MHz/Tesla.

Figure 1.1: Precession of the Nuclear Magnetic Moment. Hydrogen nuclei attain one of two different energy states when placed in a static magnetic field B0. The nucleus can be seen as a small bar magnet and in the lower energy state the bar magnet is aligned with B0 while the higher energy state corresponds to a counter–aligned magnet.

A pulse of radiation resonant with the precession frequency is applied to turn the small fraction of the z aligned protons by an angle of π/2, to align in the direction perpendicular to the magnetic field. This rotating magnetic moment now experiences a tending it along the B0 field, as per the equations derived by Felix Bloch:

dMz Mz − M0 dMx,y Mx,y = − − γ(M × B)z and = − − γ(M × B)x,y, (1.1.1) dt T1 dt T2 where M is the nuclear magnetization as a function of time, and M0 is the equilibrium magnetization in a steady and uniform field B0 in the z direction. The longitudinal relax- ation time T1 gives the time it takes for the magnetization in the z direction to relax back to its equilibrium value M0, while the transverse relaxation time T2 gives the time for the azimuthal angles of the spins to get out of phase with one another. As the polarized protons

13 precess together and relax to their initial alignment, they produce electromagnetic radia- tion of frequency proportional to the strength of the magnetic field and is called the free induction decay (FID) signal.

Figure 1.2: Radio Frequency (RF) Excitation. There will be a small excess of hydrogen nuclei in the lower energy state and therefore a resultant magnetic vector pointing in the direction of B0. Energy can be supplied to the nuclei by applying a Radio Frequency (RF) pulse. The resultant magnetic vector is then tilted into the xy-plane and a current is induced in the receiver coil. Due to different relaxation processes, the xy-component of the magnetic vector, as well as the induced current in the receiver coil, will decay.

Variations in the molecular structure of biological substances can cause field inhomo- geneities causing the spins to experience different local magnetic fields, and they go out

∗ of phase as they precess, reducing the net FID signal. The time constant T2 measures the combined effect of random nuclei interactions and magnetic field inhomogeneities. It holds

∗ that T1 >> T2 > T2 .

1.2 Magnetic Resonance Imaging

The main concept of MRI is the spatial selection and localization of the NMR signal [128] through the use of magnetic gradients. As seen in the previous section, the resonant fre- quency ω of the nuclear spin system is dependent on its gyromagnetic ratio γ and the

14 magnetic field strength B experienced by it. The use of spatially varying magnetic gradient

fields G = (Gx,Gy,Gz) creates a different net magnetic field at every spatial location in

the sample, thereby changing its intrinsic Larmor frequency. When this frequency changes

linearly with position then the net measured signal becomes the Fourier transform of the

spin density of the sample.

Slice selection is achieved by applying a strong linear gradient in the slice direction z

during excitation with the B1 RF pulse. The slice select gradient Gz changes the Larmor frequency as a function of z coordinate. The RF pulse has a finite bandwidth and thereby excites only the z-extent with Larmor frequency within its bandwidth.

The phase–encode gradient is then switched on for a period of time τy, causing a phase

warping of the protons in the selected slice as a function of y position in space ky = γGyτy.

After some time this gradient is switched off and the frequency encode gradients Gx are

applied and the signal is sampled. The gradient activity as a function of time on each

orthogonal axis is typically represented, along with RF activity, in a pulse sequence diagram

of Fig. 1.3. The final image is obtained by an inverse Fourier transform of the k-space data.

∗ The time constants T1, T2 and T2 are tissue type dependent, allowing delineation of dif- ferent tissues, and are the main cause of different types of contrasts in clinical MRI. This effect comes from the governing equation of MRI (derived from eqn. 1.1.1)

 TR  TE − T − T S = S0 1 − e 1 e 2 , (1.2.1)

where S is the signal detected, S0 is the maximum detectable signal, proportional to B =

B0 + hG, xi and to the spin density ρ(x). Here TE is the echo time between the B1

excitation and readout, and TR is the repeat time between one B1 excitation and the next.

15 Figure 1.3: MR k-Space Acquisition. On the left-hand side is a pulse sequence diagram showing the sequence of events on the three orthogonal gradient axes, the RF excitation and the acquisition window. The first action is the RF excitation along with a slice selection gradient Gz along the z direction followed by a refocusing gradient. This is followed by phase-encode gradient Gy along y-axis and the readout pre-phasing gradient Gx along the x-axis. Finally, the readout (frequency-encode) gradient Gx is applied while the FID signal is measured. In order to fully encode a slice, this sequence is repeated for different amplitudes of the phase encode gradient. The right-hand side shows the corresponding trajectory through k-space. When the phase encode gradient is the pre-phasing gradient moves to the leftmost point in k-space and the the readout gradient sweeps across a line of k-space. When the phase encode gradient is applied, it shifts the readout line in k-space along the y direction by an amount Gy.τy. The spacing of the grid in k-space inversely affects the field-of-view, while the extent of the k-space is directly related to imaging resolution.

1.3 Functional Magnetic Resonance Imaging

fMRI has become increasingly popular for brain mapping and for studying neurophysiol- ogy because MRI scanners are commonly accessible and studies on healthy subjects can be performed without harmful side effects, making repeated studies of the same subject feasible. Furthermore, it offers unambiguous determination of the of activation at a spatial resolution similar that of positron emission tomography (PET) but at a greatly superior temporal resolution. (EEG) and magnetoencephalogra- phy (MEG) offer better temporal resolution, but spatial localization is ambiguous and the intrinsic spatial resolution is very poor. Fundamentally, fMRI can never hope to match the temporal resolution of electrophysiological methods, because the method is based on

16 an indirect measurement of neural activity via changes in blood flow. However, it can be sufficiently fast to follow the hemodynamic response to a single synaptic .

1.3.1 The BOLD Contrast

Although the first fMRI experiments used an exogenous gadolinium-based contrast agent, this technique was rapidly superseded by the discovery that de–oxyhemoglobin could be used as an endogenous contrast agent [171]. The origin of the blood oxygenation level dependent (BOLD) effect is that hemoglobin (Hb) is diamagnetic when oxygenated and paramagnetic when deoxygenated. The free electrons of iron in de-oxyhemoglobin alter the local magnetic susceptibility creating magnetic field distortions within and around the

∗ blood vessels and produce a slight alteration in the local T2 of a voxel.

The BOLD contrast measures the distribution of paramagnetic de–oxyhemoglobin content

∗ in a tissue by means of a T2 -weighted MRI sequence with single or multi–shot echo-planar imaging (EPI) methods [147]. Single–shot EPI can obtain an entire image volume with a single RF excitation, but suffers from geometric distortion artifacts due to the long read- out (cf.Section 1.3.3). Multi–shot EPI results in high quality images comparable to conventional MR images, but slower acquisition rates. On the whole, EPI offers ma- jor advantages over conventional MR imaging, including reduced imaging time, decreased motion artifacts and the rapid imaging of physiologic processes.

17 1.3.2 Relationship between BOLD and Physiology

The BOLD signal is not directly related to electrical neuronal activity, but rather measures

the hemodynamic response to metabolic activity in the neural substrate and depends on a complex interaction of several metabolic and vascular parameters. Following metabolic ac- tivity in the brain tissue, oxygen is consumed to replenish the depleted stores of adenosine triphosphate (ATP) from glucose, causing a temporary increase in the amount of deoxy- hemoglobin. In response to this, oxygenated blood is rushed to the metabolic site via capillaries, causing an increase in the regional cerebral blood flow (rCBF) greater than the regional cerebral metabolic rate of oxygen consumption, and as a result causing a reduction in the de–oxyhemoglobin fraction.

In healthy human subjects, the increase in rCBF is dominant over the other changes, with the consequence that increased neural activity leads to an increase in BOLD signal as mea-

∗ sured by T2 -weighted imaging. The BOLD response to a short stimulus may show three phases, as illustrated in Fig. 1.3.2. After the stimulus, there may be a negative initial response that attains its minimum value at two to three seconds post-stimulus. This is fol- lowed by the main BOLD response that is conventionally used in fMRI experiments, with a time to peak of about five seconds and a response width with full-width half-maximum

(FWHM) of roughly four seconds. Thereafter, a post-stimulus undershoot occurs, which may take up to a to return to baseline [140].

Currently, it is unknown as to which aspect of neural activity drives the hemodynamic re- sponse, and exact relationship between these terms is unclear [139]. Experimental studies comparing electrophysiological measurements with BOLD and rCBF, have found that the

18 Figure 1.4: Schematic showing the time course of the BOLD response to a short stimulus. The fast response has a negative peak at about two seconds post-stimulus, the main BOLD response peaks at about five seconds with an FWHM of about four seconds. The signal takes about a minute to return to baseline.

hemodynamic responses correlate better with local field potentials, rather then local spiking rates, suggesting that the hemodynamic response is driven by input synaptic activity rather than output spiking activity, in accordance with theoretical models of the energy consump- tion for neuronal signaling 1. In the cerebellum of anesthetized rats, the regional cerebral blood flow was shown to be proportional to the product of the frequency of stimulation and the strength of the evoked local field potential near a Purkenje cell [172]. Monkey studies have provided evidence that the fMRI signal is better correlated with the local field potential2 than with multi–unit and single–neuron activity Logothetis [139]. These stud- ies seems to imply a close relationship of fMRI with input synaptic activity which is the primary cause of local field potentials. Based on such a coupling, two parameters that char- acterize the fMRI signal, namely the amplitude of the signal intensity change and the time course of this change, have been used to derive detailed spatio–temporal information about

1 The primary expenditure of energy is to restore the ion gradients degraded during neural activation. The intracellularextracellular sodium gradient is far from equilibrium, so pumping sodium against this gradient is a strongly uphill reaction in a thermodynamic sense. For this reason, the most costly aspect of neural activity is likely to be excitatory synaptic activity in which glutamate opens sodium channels. The action of the sodiumpotassium pump is thought to consume a large fraction of the ATP energy budget in the brain 2The local field potential is generated by extracellular currents that pass through the extracellular space in a closed loop. These voltage changes (in the µV range) are smaller than action potentials but last longer and extend over a larger area of neural tissue. It is a linear sum of current flows to and from intracellular and extracellular spaces.

19 the underlying neuronal events [172]. There is also evidence that inhibitory activity does not elicit a measurable BOLD response [127].

1.3.3 fMRI Noise

The BOLD signal in fMRI data is corrupted due to many different causes, including:

Baseline Drift: The fMRI time–series data exhibit a drift (0.01-0.015 Hz), partially explained by a drift in the baseline magnetization of the scanner and long excitation and spin saturation [202].

Physiological Noise: The signal is also contaminated with artifacts from physiologic func- tions such as breathing or pulsating blood. Unless the sampling rate is fast enough ( < 1s per volume), they get aliased into the frequency bands occupied by the response to the ex- perimental presentations. Also, spontaneous low-frequency fluctuations of arterial carbon dioxide were shown to induce low-frequency BOLD signal variations [7]. The spatio– temporal signatures of physiologic processes confound the identification of the BOLD response to bona fide neural activity impelling various compensatory strategies in pre– processing and analysis pipelines.

Random Noise: Even in the absence of an experimental effect, fMRI time–series exhibit serial autocorrelations with disproportionate spectral power at low frequencies, i.e., its spectrum is 1/f-like. From studies of cadavers and phantoms, it is observed that colored noise arises even in the absence of physiological processes and must therefore be due to quantum effects [58].

20 Geometric Distortion: EPI sequences are particularly sensitive to the effects of magnetic

field inhomogeneities because of long readout times leading to a miscalculation of voxel position. The effect is most severe in regions where air–filled sinuses border with bone or tissue such as the frontal lobes, occipital and temporal lobes, but it is also apparent to a lesser extent in other regions and arise in the direction in which the acquisition time between adjacent points is greatest. This is the phase encoding direction, often along the anterior–posterior axis (and also along the inferior–superior axis for 3D-EPI). Distortions along the read–out direction (left–right axis) are negligible because the acquisition time between adjacent points is small [114].

Head Motion: Head motion is a significant problem in fMRI data sets [170] where slight movements of the head over the course of the fMRI study can lead to large signal changes in the image time–series, which obscures the subtle signal changes that are being studied. It manifests in the form of non-linear geometric distortions and non–uniformities in intensity.

21 CHAPTER 2

BACKGROUND: FMRI METHODS

Because we don’t understand the brain very well we’re constantly tempted to use the latest technology as a model for trying to understand it. In my child- hood we were always assured that the brain was a telephone switchboard. (What else could it be?) And I was amused to see that Sherrington, the great British neuroscientist, thought that the brain worked like a telegraph system. Freud often compared the brain to hydraulic and electromagnetic systems. Leibniz compared it to a mill, and now, obviously, the metaphor is the digi- tal computer.

John R. Searle (1932–).

This chapter opens with a discussion of the general principles of neuroscience investigated with fMRI in Section 2.1. After this, I define a taxonomy of fMRI analysis methods in Sec- tion 2.2 from my reading of literature from this field, starting with pre–processing methods in Section 2.3. Then categories for functional specialization are presented in Section 2.4, that for functional integration in Section 2.5 while the methods for decoding the represen- tation of brain–states in fMRI are catalogued in Section 2.6.

22 2.1 Neuroscience Principles

The brain adheres to two fundamental principles of organization, functional specialization

and functional integration. These concepts have become central in functional neuroimag-

ing, which is able to sample evoked responses over the entire brain at the same time.

The principle of functional specialization was first observed by Franz Joseph Gall (1758–

1828) who articulated the theory that architectural differences are indicative of functional differences and, conversely, that functional differences demand differences in architecture

[74]. In other words, brain function is implemented in the form of neuronal hardware, and differences in function require differences in hardware, visible in terms of cell types, connectivity, synaptic and molecular structures.

The architectural layout of the cerebral cortex has since been investigated using cytoar- chitectonics and myeloarchitectonics, the former revealing the arrangement of various cell types and the latter, the patterns of myelination in different zones of the cerebral cortex [10].

The identification of functionally specific modules was initially performed by examining the behavioral consequences of localized brain lesions The most famous example is the study of the lesions in the posterior inferior frontal gyrus of two patients by Paul Pierre

Broca that led to the discovery of the involvement of the eponymous brain region in speech production [96]. These observations received spectacular confirmation with the brain–maps produced by modern imaging studies diverting the main focus of fMRI towards localizing particular cognitive functions to specific brain regions, creating a large database of isolated structure-function correlations over the years [63].

23 However, this explanation is incomplete, since it fails to characterize how cognition arises

from local computations through their interactions. More than 60 years ago, Donald Hebb

hypothesized that the fundamental unit of brain operation is not the single neuron but rather

the cell assembly – an anatomically dispersed but functionally integrated ensemble of

neurons. The individual neurons that compose an assembly may reside in widely sepa-

rated brain areas but act as a single functional unit through coordinated network activity.

Dynamic interactions between multiple assemblies may then give rise to the large-scale functional networks found in mammalian brains. Therefore, it has been postulated that higher–level cognition must emerge through information flows across these distributed re- gions [74].

This information flow is the functional integration of the brain, which can be character- ized in two ways, namely in terms of functional connectivity and effective connectivity.

Functional connectivity, defined as the temporal coherence among the activity of different functional units, is measured by cross-correlating their observed signals (e.g. spikes in

EEG, or the BOLD response). Effective connectivity, a more abstract notion, is defined as the simplest circuit that could produce the same temporal relationships as observed experi- mentally. It, therefore, requires a model that describes the causal influences that functional units exert over another [68].

However, a third viewpoint which has lately emerged in is that of functional representation, in the sense of whether the data contain information about the cognitive, perceptual or affective state of the subject and if so, how is this information encoded in the distributed patterns of activation observed [92]. This problem therefore fo- cusses more on understanding how, rather than where, the brain encodes information. The

24 question of representation is becoming more important as computational neuroscientists

attempt to build computational models of brain function using theories from computability,

artificial , information–theory, economics and game–theory [69].

2.2 fMRI Methods Taxonomy

With a PubMed listing of more than 5000 publications, it is impossible to provide even the

briefest of surveys of the literature on fMRI analysis methods. Therefore, in this section

I shall attempt to summarize these methods in the form of a taxonomy along with select

references to landmark and highly cited publications. The overview of this taxonomy is

laid out in Fig. 2.1.

2.3 Pre-processing

This category encompasses all processing of the raw fMRI volumes (in image–space) to

make them suitable for use by inferential and exploratory methods. It does not, strictly

speaking, include algorithms for reconstructing the image–space data from its k–space representation; however many of the methods listed here operate on the k–space data.

2.3.1 Motion Correction

Head motion is typically corrected for using affine (rigid-body) registration schemes [6].

Alternative approaches based on motion corrected independent components analysis (mcICA) [137] and field–map based methods for simultaneous distortion and motion correction [223] have also been proposed. Oakes et al. provide a comprehensive survey of motion correction tools

25 Pre-processing Functional Specialization

i Motion Correction i Unsupervised ii Distortion Correction i.i Cluster-Analysis iii De-noising and Drift Estima- i.ii Decomposition tion ii Supervised iv Inter–subject Registration ii.i Mass Univariate iv.i Anatomical Registration ii.i.i Linear iv.ii Functional Registration ii.i.ii Non–linear v HemodynamicResponseMod- ii.ii Multivariate eling and Estimation iii Semi-supervised v.i Modeling iii.i UnsupervisedAugmented v.ii Estimation with Stimulus Informa- tion iii.ii Supervised Augmented with Unlabeled Data

Functional Integration Functional Representation

i Functional Connectivity i Multivariate Pattern Recogni- i.i Decomposition tion i.ii Cross-Correlation i.i Spatial Only i.iii Cluster-Analysis i.ii Spatio–Temporal ii Effective Connectivity i.ii.i Supervised ii.i Strongly Causal i.ii.ii Unsupervised ii.i.i Dynamic i.ii.iii Semi–supervised ii.i.ii Static ii Multivariate Linear Models ii.ii Weakly Causal

Figure 2.1: Taxonomy of fMRI Methods. The proposed taxonomy of fMRI methods published in literature.

used in fMRI [170]. Due to the low spatial resolution of fMRI, post hoc methods for cor- recting head motion artifacts are fairly inaccurate, and moreover introduce unquantifiable biases into the analysis. However, due to limitations such as unacceptably long acquisition

26 times and lack of appropriate hardware, biometrically and field–map based motion correc- tion methods are usually impractical and post hoc methods are generally used, despite their drawbacks.

2.3.2 Distortion Correction

The non-linear distortion introduced by inhomogeneities in magnetic field can be corrected through multiple post-hoc methods, including those that require a B0 field–map [106, 214], along with non-rigid post hoc distortion correction schemes that do not require B0 maps

(e.g. [6, 134]).

2.3.3 De-noising and Drift Estimation

Several methods have been developed for reducing physiological noise such as respira- tion and cardiac activity from the fMRI time–series including navigator methods [129] that use an auxiliary echo to determine the confounds, k–space methods [221], notch filtering approaches [23], image–space methods with [109] and without [39] the use of echo car- diograms. The methods for drift estimation and removal fall mainly in these categories: low-pass filters, autoregressive filters, Kalman filters, nonlinear low pass filters and sub- space methods [14, 158, 202]. Because of their decorrelating properties for the long–range correlated noise in fMRI, several authors have suggested a variety of wavelet–based noise estimation, de-noising and de-trending schemes [29, 215] as different variations of the wavelet shrinkage concept [52].

27 2.3.4 Inter–subject Registration

2.3.4.1 Anatomical Registration

Spatial normalization methods are typically used to aligning the images of multiple sub- jects into a common anatomical space, for the purposes of inter–subject comparisons and group–level analysis. This normalization is usually carried out on a voxel-by-voxel basis using non-rigid registration techniques by first co-registering the functional images with a high–resolution structural image of the same subject, followed by registering the structural images with an anatomical template image in an atlas reference system [6, 207]

2.3.4.2 Functional Registration

Anatomical registration methods suffer from two fundamental problems: (i) dealing with the true anatomical differences between subjects ( suggested by sulco–gyral and cytoar- chitectonic studies across subjects ), especially in the case of neuropathologies and (ii) the relevance of anatomical alignment for the study of the functional commonalities in popula- tions (i.e. the rigidity of the mapping between structure and function). In response to these criticisms, strategies based on alignment of intra-subject parcellations [207] or of func- tional connectivity [43] or of task–related functional activation maps [192], along with strategies for directly registering the time–series itself [125] have been suggested.

28 2.3.5 Hemodynamic Response Modeling and Estimation

The problem of understanding and accounting for the variability in the hemodynamic re- sponse of the brain has been approached either by building biophysiological models of the response or by estimating the response from the data.

2.3.5.1 Modeling

Methods to model the hemodynamic response function (HRF) include Fourier basis [Henson et al.],

Gamma functions [66], cosine basis [218] , Gaussian basis [38], anatomically–informed ba- sis function [113], finite impulse response (FIR) filters [82], subspace methods [64] along with physiological models of the neuro–vascular coupling [32].

2.3.5.2 Estimation

Direct estimation of the HRF from has also been reported including methods using Bayesian nets [148], non–linear control theory [97] and deconvolution in the time [82], frequency and wavelet [59] domains.

2.4 Functional Specialization

The majority of fMRI literature is concerned with localizing the neural substrates for a particular mental faculty, i.e. with the problem of functional specialization.

29 2.4.1 Unsupervised

These are exploratory methods to identify salient regions and/or distributed patterns of ac- tivation from fMRI data, without reference to information about the task (i.e. stimulus).

Unsupervised methods have been widely used for analyzing non–task related data such as resting state data [84]. For a more detailed look at the issues surrounding supervised, un- supervised and semi–supervised methods, the reader is referred to Section 7.2 of Chapter 7

2.4.1.1 Cluster-Analysis

In this approach, are clustered together according to varying combination of simi- larity of their structural and fMRI time–course information. A large number of clustering methods [99] have been applied to fMRI, including fuzzy k–means clustering [12], vector

quantization [11], self-organizing maps [216], neural gas networks [149], clustering in the

frequency domain [157] or wavelet domain [105], dynamical cluster analysis [13], tempo-

ral cluster analysis [226] and hierarchical clustering [83, 105]. Please refer to Dimitriadou

et al. [51] for a review of these methods.

2.4.1.2 Decomposition

Activation patterns in the data are explored by decomposing the data into constituent com-

ponents with linear methods such as principle component analysis (PCA) [70],

independent component analysis [34], non–negative matrix factorizaton [168] or non–

linear methods such as non–linear PCA [67] and kernel PCA [206]. Also, dynamical

30 formulations for decompositions have been proposed wherein the data are modeled as mul- tivariate ARMA processes and the decomposition reduces to estimating the ARMA coeffi- cients [205].

Decomposition based methods in general, but ICA in particular and its many variants are very popular, not only for the exploratory analysis of functional localization, but also for functional connectivity, motion–correction, de–noising and as feature–vectors for use in further pattern recognition. The reader is referred to the review article by Calhoun and

Adali [34] for more details.

2.4.2 Supervised

These are confirmatory methods that use either classical or Bayesian hypothesis testing to produce statistical measures (i.e. p–values or posterior probabilities) of activation at different locales in the brain in response to a task.

2.4.2.1 Mass Univariate

This is the most popular method of studying functional localization, where every voxel is individually and independently tested for activation. They yield parametric maps of pa- rameter estimates representing the amount of activity, as per some model of hemodynamic response. These models can be either:

i Linear: Most existing analysis tools are based on the a linear time-invariant model of

brain response formed by convolving functions representing the experimental stimuli

with an approximate HRF to build the design matrix. The parameters of the loading

31 of the regressors in the observed time–course of each voxel are estimated using Gen-

eral Linear Modeling (GLM) [66]. The main drawbacks of this approach are its

assumptions of linear coupling between stimulus and BOLD response and that of

spatio-temporally invariant and known hemodynamics. Many studies have shown

a relatively large variation in the observed hemodynamic response, across subjects,

across brain sites within the same subject, and even at the same brain site of the

same subject across time [201]. Non-linearities in the transfer from a stimulus to the

hemodynamic response have also been demonstrated [140], questioning the validity

of linear models. However, due to their explanatory power, statistical simplicity and

computational efficiency, GLM approaches are still the most widely used methods.

ii Non–linear: Estimating the parameters of activation like amplitude, latency, disper-

sion, etc. through non-linear regression has been variously proposed, wherein the

estimated HRF is a non-linear function of these parameters [97, 119, 124, 182].

While these methods do not require assumptions of known and spatially constant

hemodynamic responses and only assume a fixed parametric form of the hemody-

namic response function at each voxel, they are computationally expensive involving

non-linear minimization steps at each voxel. Also, due to their non–linear nature, the

p–values of parameter estimates have to be computed using permutation tests adding

to their computational burden.

Multiple Comparisons Problem Activation maps are computed by determining the cor- rect value at which to threshold the per-voxel parameter estimates in order to reject the null

hypothesis (of no activation) at the desired size (i.e. the α–value).

32 Rejecting the null hypothesis at each voxel individually leads to a Multiple Comparisons

Problem (MCP)3, and therefore many correction strategies are used in fMRI:

• Boneferroni: Based on the Bonferroni correction4 of the individual tests to give

a global test of desired size, these methods include correcting the threshold for

each voxel [66], thresholding the wavelet coefficients of the statistical parametric

map [212]

• False Discovery Rate (FDR): Given a population of V voxels marked as active, an

FDR of R implies that not more than R.V voxels are false positives. FDR can be

implemented either in a voxel–wise [76] and wavelet–based fashion [212].

• Random Field Theory: Here, the fMRI voxels are treated as a random field with

specific interdependency structures, such as Gaussian random fields [220] or Markov

random fields [44], in which case the global p–value can be estimated either in closed

form by the Euler characteristic of the excursion set or through MCMC methods.

• Permutation Testing: Permutation testing is a non–parameteric test that uses boot-

strapped estimates of the global null hypothesis by resampling from the data, and

therefore does not suffer from the local vs. global null problem [166].

3In mass-univariate analysis, testing for individual voxels that violate the null hypothesis of no activity at a size of α (the desired Type-I or false positive error rate) means that even in the absence of a real effect, one would observe α–fraction of the voxels as active simply due to noise. In fMRI, given that there are approximately 105 voxels in the brain volume, for an α of 0.05, the number of Type-I (false positive) errors can be on the order of 5000 voxels. Such a univariate test is invalid with respect to the global null hypothesis (i.e. a test of a true size different from the nominal size), which cannot treat voxels as independent random variables. Instead a global p–value of the spatially-distributed activation pattern is need to reject aglobal null hypothesis. 4Bonferroni correction is a simple method of maintaining the family–wise error rate when testing n hy- potheses simultaneously, by testing each individual hypothesis at a statistical significance level of 1/n × α where α is the global significance level.

33 • Bayesian: Here the problem of thresholding parameter values is bypassed entirely

by instead computing posterior probability maps (ppms) of the posterior distribution

of the parameters at each voxel, computed using hierarchical Bayesian models [72].

Methods based on Bayesian segmentation of the parametric maps [184], have also

been developed.

The typical mass–univariate analysis pipeline using a GLM is shown in Fig. 2.5.2.1.

Figure 2.2: GLM Analysis Pipeline: Shown are the different stages of the pipeline involved in a GLM mass–univariate analysis of fMRI data, including pre–processing, spatial smoothing, GLM–based regression against the design matrix followed by thresholding of the estimated parametric maps at the desired size, using one of the MCP correction procedures.

34 2.4.2.2 Multivariate Linear Models

Multivariate models relax the na¨ıve independence assumptions of univariate methods and

enable inference about distributed responses. They also do not suffer from the multiple

comparisons problem, of univariate models. Although multivariate methods have been used

in functional neuroimaging since the 80’s [161], their popularity was circumscribed due to

the insufficient degrees of freedom since the number of voxels (i.e. variables) exceeds the

number of scans (i.e. observations) by orders of magnitude. It is only recently that there

has been a resurgence in these methods, accompanied by some form of dimensionality–

reduction, mainly because of the increasing interest in the question of neural representation

of mental states which is possible only in a multi–variate setting. While non–linear multi-

variate methods have been used for the studying the distributed representation in the form

of pattern–recognition classifiers, only linear models have been used for studying func-

tional specialization mainly due their computational efficiency, statistical power and ease

of interpretation. These methods, however, still require the assumption of linear coupling

of stimulus to BOLD response and a spatially invariant and known HRF. Included are

scaled sub–space profile models [161], partial least squares (PLS) [152], MANCOVA– type methods such as canonical correlation analysis (CCA) [70], canonical variates anal- ysis (CVA) [219] and hierarchial Bayesian linear models [65].

2.4.3 Semi-supervised

These methods try to effect a compromise between unsupervised and supervised methods, in order to use their complementary advantages – that of supervised method to enforce

35 quantifiable links with experimental conditions and that of unsupervised methods to dis- cover new patterns in the data. These methods exist in two flavors:

2.4.3.1 Unsupervised Augmented with Stimulus Information

Here, the results of an unsupervised method are improved or made more relevant by incor- porating some information about the experimental task and expected results. The aim is to condition or guide the discovery to patterns pertinent to the mental task and away from spurious artifacts. These methods include functional PCA [79], constrained ICA (cICA)

[142], ICA with reference (ICA-R) Lu and Rajapakse [141] and semi–blind ICA [138] wherein prior information, such as statistical properties, reference signals, or spatial tem- plates of expected localization is used to improve the quality of the decomposition and reduce indeterminacies inherent in ICA. Additionally, approaches that use the stimulus time–series as reference for the distance metric in a clustering algorithm have also been proposed [64, 83, 105].

2.4.3.2 Supervised Augmented with Unlabeled Data

In this paradigm, unlabeled fMRI data – i.e. data for which there is no associated stimulus information – is used to improve the estimation of spatial maps for task–related studies, for example using resting–state data for Laplacian regularization of the regression of data from subjects watching annotated movies [24].

36 2.5 Functional Integration

The recent advances in functional neuroimaging technology and image analysis theory have paved the way to investigate the brain function in terms of the interaction between neural systems rather than individual regions involved in a sensory or cognitive task [204].

2.5.1 Functional Connectivity

Methods for functional connectivity involve assessing the non-causal relationships, typi- cally correlation, between spatially distinct regions based on the time–courses of the vox- els in those regions. The reader is referred to Li et al. [133] for a review of functional connectivity literature in fMRI.

2.5.1.1 Decomposition

The methods involve decomposing the relationships between regions specified by the statis- tical moments of their time–courses into spatially coherent maps. Typically methods used are PCA [71] and PLS [152] that operate on the correlation (i.e. second order moment) between regions, and non–linear PCA [67] and ICA that orthogonalize the higher–order statistical dependencies as well [67].

2.5.1.2 Cross-Correlation

These methods measure functional connectivity as the cross-correlation between one re- gion / voxel and another. In seed voxel correlation analysis (SVCA) [36] 3D functional

37 connectivity maps are created that chart the brain regions correlated with a seed region, se- lected either from brain anatomy or additional functional activation studies. Methods that analyze the correlation between multiple regions or parcels simultaneously have also been applied [1].

2.5.1.3 Cluster-Analysis

Many of the clustering methods reviewed in Section 2.4.1.1 have also be used to reveal the time–course similarity patterns between spatially distant voxels.

2.5.2 Effective Connectivity

The main criticism of functional connectivity is that correlations may arise in a variety of ways and may not be from causal or even functional relationships. For example, in multi–unit electrode recordings, they can result from stimulus-locked transients evoked by a common input or can reflect stimulus-induced oscillations mediated by synaptic connec- tions [204]. Additionally, the drawback of using approaches such as PCA and ICA, is that among patterns identified, not all have a clear relationship with brain neural activity and many are demonstrably artefactual.

2.5.2.1 Strongly Causal

Strongly causal links [174] between interconnected regions are inferred to firmly establish structure–function relationships in one of two ways:

38 Dynamic Dynamic system theory in the form dynamic causal models (DCM) [204] or graphical models in the form of dynamic Bayesian networks (DBN) [185] have been used to infer causal links between a pre–selected ROIs from fMRI data, typically in the visual processing stream. These methods select amongst competing models of the causal interac- tions between neural circuits by comparing the evidence for the observed fMRI data arising from the given circuit. They are restricted to examining a very small number of interactions because of their high computational complexity. Also, as they require a model explaining how the BOLD data is generated from neural activity [73], their validity depends on the accuracy of this model.

Figure 2.3: Dynamic Causal Models: Shown is an example of a DCM encoding causal interactions be- tween visual areas V1 and V4, Brodmann areas 39 and 37, and the superior temporal gyrus STG. The dark square boxes represent the transformation of internal state in each region (neuronal activity) into a measured (hemodynamic) response y. Image reproduced from [73].

39 Static Methods such as structural equation models (SEM) [153], compare the evidence for different models of causal relationships between brain regions. This is a static approach in the sense that the relationships are instantaneous and they do not account for temporality of the data. These methods are also computationally expensive and require selecting among multiple pre–specified alternatives.

2.5.2.2 Weakly Causal

Here, the causal definition of effective connectivity is relaxed to a weaker one, that of

Granger [80]. Under this definition, if incorporating the values of time– series X improves the future prediction of time–series Y , then X is said to have a (Granger) causal influence on Y . This definition allows finding functional relationships between large numbers of regions simultaneously, thereby circumventing the drawbacks of dynamic mod- els and static models (namely pre–specification of alternative models and pre–selecting a small number of ROIs ). In a similar vein, multivariate autoregressive processes [89] have been used to analyze causal relationships at the level of the BOLD signal itself in terms of the coefficients of an autoregressive process.

The drawback of such “model–free” formalisms are that the neuro-scientific interpretations of their results are unclear.

2.6 Functional Representation

Revealing distributed encoding of information about the cognitive state of the subject from fMRI must be, by definition, performed in a multivariate setting.

40 The use of voxel–based inferential statistics systematically eliminates most of the data,

reducing the power of fMRI to that of multiple single–unit recordings [173]. In contrast to

univariate analysis, multivariate methods do not make the na¨ıve assumption that activity in

voxels is independent and mine for information present in the interactions among voxels.

Communication among neurons as well as larger functional units is the main basis of neural

computation, and by not disregarding their interactions, multivariate methods are able to

peer into the neural code. In addition, multivariate methods do not suffer from the multiple

comparison problems of univariate methods (cf. Section 2.4). As importantly, the need for

spatial smoothing to increase SNR is obviated as such methods integrate the information

from groups of voxels that individually are weakly activated, but jointly may be highly

structured with respect to the task.

The topic of representation of the neural code and decoding of cognitive states from the

data is reprised for a more in–depth discussion in Chapter 7.

2.6.1 Multivariate Pattern Recognition

A popular approach has been the use of multivariate pattern recognition (MVPR) [92], which learns the statistical mapping from the distributed pattern of activation in an individ- ual brain to the experimental conditions experienced during the scans.

41 2.6.1.1 Spatial–only

Most MVPR methods do not typically take the temporal nature of cognitive processing into

account. They make the assumption that all fMRI scans with the same label (e.g., behav-

ioral state) have the same properties – and not account for the hemodynamic delay between

neural activity and BOLD signal. Therefore, these approaches are inherently restricted to

block–design experiments where such assumptions are permissible.

Typically linear classifiers, such as correlation-based classifiers [90], single-layer percep-

trons [179], linear discriminant analysis [91], linear support vector machines (SVMs) [160], and Gaussian Naive Bayes [159], have been used due to simplicity of interpretation without significant loss of accuracy [110]. However, non–linear classifiers such as kernel canonical correlation analysis (kCCA) have also been reported [87] that learn the mapping from the fMRI data to scale invariant feature transformation (SIFT) features of the natural images.

Such MVPR methods, that predict the behavioral state of the subject, have been applied mainly to the study of visual (e.g. [90, 91, 110, 179]) processing, but also auditory [150] perception, motor tasks [122], word recogition [160], detecting emotional affects such as deception [200] and fear [177], etc. Alternative approaches based on pattern classifiers attempt to decode the perceptual state of the subject without modification of the sensory input [54] (uses MDS to understand the representation of shapes in the visual cortex).

42 2.6.1.2 Spatio–Temporal

The methods listed in this class attempt to describe the information contained not only in the spatial distribution of activity at one time–instant but also in the temporal evolution of these patterns.

Supervised Temporal variability during a task has been accounted for – in a limited way – through the temporal embedding of all the fMRI scans in one block as the fea- ture vector [159, 164]. Also, a pattern classifier was trained to decode changes in binoc- ular dominance5 on a second–by–second basis, thereby revealing the timing of changes in neural representations of information and its subsequent availability for report by the participant [179].

Unsupervised

Semi-supervised To the best of our knowledge the models introduced here are the first examples of unsupervised and semi–supervised methods for studying the spatio–temporal patterns in the data.

5When dissimilar images are of the subject presented to the two eyes, they compete for perceptual domi- nance so that each image is visible in turn for a few seconds while the other is suppressed. Because perceptual transitions between each monocular view occur spontaneously without any change in physical stimulation, neural responses associated with conscious perception can be distinguished from those due to sensory pro- cessing.

43 2.6.2 Multivariate Linear Models

The MVLMs discussed in Section 2.4.2.2 can also be used to study representation. This is

further elaborated upon in Chapter 7.

2.7 Summary

In this chapter, I presented a taxonomy for the different processing and analysis strategies in

fMRI. At the top level, the salient classes of neuroscientific problems that are investigated

with fMRI were listed, followed by the sub–category of the overall neuroscientific problem.

Each subdivision was then brachiated based on methodological specifics and statistical

considerations.

Although this classification schemata does not specifically address the important problem

of mental chronometry (cf. Chapter 4), the visual analysis method for mental chronometry

developed in this thesis is categorized as Functional Specialization :

Unsupervised : Cluster-analysis, while the GLM based method for latency

determination would be Functional Specialization : Supervised : Mass

Univariate : Linear. The methods for studying the representation of the mental–state of the subject in the spatio–temporal patterns of metabolic activity developed in Part III fall under category Functional Representation : Multivariate Pattern

Recognition : Spatio-Temporal, and further sub-classified as Unsupervised

and Semi-supervised.

44 CHAPTER 3

BACKGROUND: NEUROSCIENTIFIC SETTING

It ought to be generally known that the source of our pleasure, merriment, laughter, and amusement, as of our grief, pain, anxiety, and tears, is none other than the brain. It is specially the organ which enables us to think, see, and hear, and to distinguish the ugly and the beautiful, the bad and the good, pleasant and unpleasant. It is the brain too which is the seat of madness and delirium, of the fears and frights which assail us, often by night, but sometimes even by day; it is there where lies the cause of insomnia and -walking, of thoughts that will not come, forgotten duties, and eccentricities

Hippocrates (460BC–370BC).

The methods presented in this thesis were designed and applied for investigating the neural processing cascades for two neurophysiological studies:

i The neurophysiology of retrieval and manipulation of visuo–spatial working memory

in Section 3.1.

ii The functional processing of mental arithmetic in adults suffering from developmen-

tal dyscalculia and dyslexia in Section 3.2.

In this chapter, the neurological processes being investigated, a background of the neu- ropathologies involved and the data–sets used in these studies are described.

45 3.1 Visuo–Spatial Working Memory

Visuo–spatial working memory (VSWM) is the ability to temporarily maintain visuo– spatial information in mind, is a key cognitive function that underlies other cognitive abil- ities such as complex reasoning, reading, mathematical calculation, and problem-solving.

The development of working memory and its cognitive control / manipulation is one of the most salient features of the maturation of mental processes during childhood and adoles- cence [55].

Several brain imaging studies have been conducted to determine precisely what is changing in a child’s brain over time, enabling better control of thoughts and behavior. But compared with what is known about changes in brain structure during development6, far less is known about the resulting changes in brain function. The pattern of developmental changes in brain activation has generally been observed as a shift from diffuse to focal activation and from posterior to anterior activation [30]. The precise pattern of change observed depends on the task, the ages being examined and the brain region in question.

Brain imaging studies of working memory in adults suggest that different parts of lateral prefrontal cortex (PFC) are involved in maintenance and manipulation with the ventrolat- eral (VL) PFC performing online maintenance of information, and the mid-dorsolateral

(DL) PFC additionally recruited for manipulation. It has also been hypothesized that rep- resentations of magnitude or space in parietal cortex serve as the substrate for the organi- zation and manipulation of items in working memory [46]. In school-aged children and

6Structural brain imaging studies of development indicate cortical gray matter loss and white matter in- creases during late childhood and adolescence, associated with pruning of excessive neurons and increased structural connectivity between brain regions

46 adolescents, as well, the ability to manipulate information is associated with the strength of recruitment of regions in dlPFC and bilateral superior parietal cortex (SPC). It is believed that in children the ability to manipulate items in working memory develops more slowly than the ability to simply retain them, and by the age of 13 this ability is fully developed.

3.1.1 Data-set

The study was designed to isolate manipulation requirements by comparing a maintenance

+ manipulation condition with a pure maintenance condition. fMRI data were recorded while 8 healthy right–handed children aged 7 to 11, performed a working memory task with both maintenance and manipulation conditions. Three name- able objects were presented sequentially (Fig. 1). During a 6-s delay period, participants were instructed to repeat the objects in a forward order (the maintenance task) or to reverse the order of the objects (the manipulation task). After the delay, participants were prompted with one of the objects and indicated with a button press whether this target object occurred

first, second, or third in the forward or backward sequence.

Acquisition was done on a Siemens 3T Tim Trio MRI scanner with a quadrature head coil using a BOLD sensitized 2D-EPI gradient-echo pulse sequence with the following specifications: echo time 30ms, flip angle 30◦, volume scan time 2.22s, and voxel size 3 ×

3×3.75mm. A typical study lasted around 10 , with about 30 trials. Raw data were reconstructed off-line and routine pre-processing (viz. motion and slice-timing correction, spatial normalization to a standard brain space, co-registration of functional and structural scans, and spatial smoothing with an 8mm Gaussian filter) was done in SPM5 [66].

47 backward forward 1,2,3 ?

1 2 3 INSTR PROBE

2 2 2 1 1 6 6 2 0.25 0.25

22.5

Figure 3.1: Visuo–Spatial Working Memory Task: The timings for various stages of the paradigm to study visuo–spatial working memory maintenance and manipulation.

3.2 Mental Arithmetic

Cognitive theories of numerical representation suggest that understanding of numerical quantities is driven by a magnitude representation associated with the intraparietal sulcus and possibly under genetic control [162]. Dehaene et al. [49] proposed a triple code model of the organization of number-related processes in the parietal lobe, based on neuropsycho- logical evidence derived from behavioral, lesion, PET and fMRI studies. According to this model, there are three distinct systems involved in mental arithmetic:

i A quantity system having a nonverbal semantic representation of the size and dis-

tance relations between numbers

ii A verbal system, where numerals are represented lexically, phonologically, and syn-

tactically, like any other type of word

iii A visual system in which numbers are encoded as strings of Arabic numerals

48 Cognitive models suggest that exact arithmetic facts are stored in a language–specific for-

mat, while approximate knowledge is language–independent and shows a numerical dis-

tance effect associated with the nonverbal quantity system.

The horizontal segment of the intraparietal sulcus (hIPS) is a major site of activation in neu-

roimaging studies of number processing. It is systematically activated whenever numbers

are manipulated, independently of number notation, with increased activation correspod-

ing to increased magnitude of the quantities manipulated. The hIPS is more active when

subjects estimate the approximate result of an addition problem than when they compute

its exact solution. It shows greater activation for subtraction than for multiplication 7. The

HIPS is also active whenever a comparative operation that needs access to a numerical

scale is called for. Parietal activation in number comparison is often larger in the right than

in the left hemisphere – however the parietal activation, although it may be asymmetric, is

always present in both hemispheres. It has been speculated that the core quantity system,

analogous to an internal “number line,” is localized in this region [49, 195].

A left angular gyrus area (lAG), in connection with other left-hemispheric perisylvian areas, supports the manipulation of numbers in verbal form. This region is part of the language system and contributes to processing of arithmetic operations, such as multipli- cation, that make strong demands on the verbal coding of numbers [49]. The lAG is more active in exact calculation than in approximation [47]. Also, it shows greater activation for exact calculations that require access to a rote verbal memory of arithmetic facts, such as multiplication, than for operations that are not stored and require some form of quantity manipulation. Even within a given operation, such as single-digit addition, the left angular

7Multiplication tables and small exact addition facts may be stored in rote verbal memory, and hence place minimal requirements on quantity manipulation

49 gyrus is more active for small problems with a sum below 10 than for large problems with a sum above 10. This probably reflects the fact that small addition facts, just like multi- plication tables, are stored in rote verbal memory, while larger addition problems may be solved by resorting to semantic manipulation strategies [195].

Finally, a bilateral posterior superior parietal system supports attentional orientation on the mental number line, just like on any other spatial dimension. It is active during number comparison, approximation, subtraction of two digits, and counting. It also appears to increase in activation when subjects carry out two operations instead of one. It also plays a central role in a variety of visuo–spatial tasks including hand reaching, grasping, eye and/or orienting, , and spatial working memory [49]. It may also contribute to attentional selection on other mental dimensions that are analogous to space, such as time or attending to specific quantities on the number line.

The right precuneus, left and right middle and superior frontal regions and the pre-central gyrus containing the supplementary motor area (SMA) have also been identified during arithmetic operations. Therefore, mental arithmetic appear to reflect a basic anatomical substrate of working memory, numerical knowledge and processing based on finger count- ing, and derived from a network originally related to finger movement [60].

3.2.1 Dyscalculia

Developmental dyscalculia (DDC) is defined as difficulty in learning arithmetic that can- not be explained by mental retardation, inappropriate schooling, or poor social environ- ment [162, 196]. Children can exhibit low math performance in many different ways: Some

50 may have particular difficulties with arithmetical facts , others with procedures and strate-

gies , while most disabled children seem to have difficulties across the whole spectrum of

numerical tasks [31]. Just as diverse as their symptoms is the wide range of terms referring

to these disabilities (developmental dyscalculia, mathematical disability, arithmetical learn-

ing disability, number fact disorder, psychological difficulties in mathematics). In adult

acalculia, at least two subtypes of dyscalculia may be observed. Multiplication deficits are

reported in cases of dyscalculia accompanied by dysphasia and/or dyslexia, while subtrac-

tion and quantity–manipulation deficits are often present in patients with dyscalculia but

without any accompanying dyslexia or language retardation [75].

It is relatively frequent, affecting 3-6% of children and a fraction of those children may

suffer from a core conceptual deficit in the numerical domain. This could affect even very

simple tasks such as counting or comparing numerical magnitudes [31]. Classified as a de- velopmental Gerstmann syndrome, it is frequently co–morbid with a variety of disorders, like dyslexia, attention disorders, dysgraphia, left-right disorientation and finger agnosia, poor hand–eye coordination, poor working memory span, epilepsy, fragile–X syndrome,

Williams syndrome and Turner syndrome. However, causal relationships between these disorders have not been established and the genetic and neural basis of DDC remain un- known [121].

3.2.2 Dyslexia

Dyslexia (DL) is a reading disorder defined as a selective inability to build a visual rep- resentation of a word, used in subsequent language processing, in the absence of general visual impairment or speech disorders [48]. It can arise from of a variety of disorders of the

51 visual word form system (VWFS) which plays a pivotal role in informing other temporal,

parietal and frontal areas of the identity of the letter string. This reading network contains

processes for orthographic recognition of word forms and the sublexical conversion from

orthography to phonology [48]. While, it has been proposed that dyslexia is more generally

characterized by a disconnection syndrome of the reading network, the neural correlates of

these pathways are not well understood.

Neuropsychological studies have demonstrated that the acquisition of reading skills is re-

flected by progressively greater activation in left occipital, temporal and frontal regions

and progressively less activation in posterior right hemisphere regions [197]. These re-

cruitments depend on the type of words being manipulated. For example, unfamiliar pseu-

dowords8 are thought to increase demands on the sublexical conversion of orthography to

phonology, whereas exception words9 rely on lexico-semantic processing. The effect of

word type on brain activation also depends upon the task (phonological recognition vs. se-

mantic recognition) being performed [21].

In the case of developmental dyslexia, neuronal abnormalities within the reading network are difficult to interpret because they appear to depend upon the task, language, and type of dyslexia. In the case of acquired dyslexia caused by pathological or accidental focal brain damage, the neural correlates are usually more clear. Pure alexia is defined as dif-

ficulty reading all types of words in the context of preserved writing skills, and typically occurs following left occipitotemporal damage [181]. Phonological dyslexia, defined as an inability to read psuedowords,is usually caused by large cerebral infarcts in the middle

8Novel words that have not been encountered before (e.g. floop). 9When the pronunciation of a whole word is inconsistent with that of its parts (e.g. yacht).

52 left hemisphere affecting temporoparietal and frontal regions. Surface dyslexia, involving difficulties with exception words, is associated with anterolateral temporal lobe atrophy.

3.2.3 Data-set

Twenty control subjects, thirteen high-performing (fullscale IQ>95) individuals with pure dyscalculia (DC) and nine with dyslexia (DL) [163] participated (controls: 10 female, one female and one male lefthanded, one male ambidextrous, age 21-34 yrs, mean age 25.6 yrs

± 3.0 yrs; DC: 6 female, 1 male lefthanded, age 22-23yrs: DL: two females, two males left–handed, age 18–32 yrs, mean age 24.5yrs ). All subjects were free of neurological and psychiatric illnesses and attention-deficit disorder. All controls denied a history of any calculation difficulties.

The layout in Fig. 3.2 illustrates the self-paced, irregular paradigm used in these experi- ments. Subjects were exposed visually to simple multiplication problems with single digit operands, e.g., 4 × 5 , and had to decide if the incorrect solution subsequently offered was, e.g., close for 23 , too small for 12 , or too big for 27 from the correct result of 20.

All solutions were within ±50% of the correct answer. Only one solution was presented at the time. The close answer had to be applied for solutions that were within ±25% of the correct result, while the two remaining exceeded this threshold. Subjects answered by pressing a button with the index finger of the dominant hand for too small, the middle fin- ger for close, and the ring finger for too big. Identical operand pairs were excluded. The simplest operand pair was 3 × 4, while the most demanding pair was 8 × 9. The order of small vs. large operands was approximately counterbalanced. Presentation times were the following: multiplication problem 2.5s, equal sign (=) 0.3s, solution 0.8s, judgment period

53 up to 4s, and rest condition with fixation point of 1 s until the beginning of a new cycle.

Subjects were encouraged to respond as quickly as possible. Stimulus onset asynchrony

(SOA) ranged from around 4s to 8.6 s. All subjects were exposed to two different sets of multiplication problems, with an interval of approximately 30min between sessions 1 and

2 during which time they solved other nonnumerical tasks.

27 ‘too big’

4 5 = 23 ‘close’

12 ‘too small’ 2.5 0.8 ≤ 4 1 0.3 ≤ 8.6

Figure 3.2: Mental Arithmetic Task The five phases of each trial and their associated timings of the paradigm to study arithmetical abilities.

For each problem presentation, the following variables were recorded:

• The two numbers to be multiplied and the incorrect result

• A product–size LogPs and problem–difficulty LogDiff score described next

• A binary variable indicating correct or incorrect answer

If Rc = a × b is the correct product for the multiplication problem a × b and Rd is the displayed incorrect result, then the product size is scored as LogPs = log(Rc). The score

LogDiff is log(|(1.25Rc) − (Rc + |Rc − Rd|)|/|1.25Rc|), which measures the closeness of

54 the incorrect result to the ±25% mark and represents the difficulty subjects would have in judging the correct answer as close vs. too big or too small.

Data were acquired on a GE 3T MRI scanner (vh3) with a quadrature head coil. After localizer scans, a first anatomical, axial-oblique 3D-SPGR volume was acquired. Slab coordinates and matrix size corresponded to those applied during the subsequent fMRI runs using a 3D PRESTO BOLD pulse sequence [209] with phase navigator correction and the following specifications: echo time 40ms, repetition time 26.4ms, echo train length

17, flip angle 17◦ , volume scan time 2.64s, number of scans 280, session scan time 12:19 min, 3D matrix size 51 × 64 × 32, and isotropic voxel size 3.75mm. At the end of the study, a sagittal 3DSPGR scan was acquired with a slice thickness of 1.2mm and in-plane resolution of 0.94mm.

The first four fMRI scans were discarded leaving 276 scans for analysis. Raw data were re- constructed off-line. The structural scans were bias–field corrected, normalized to an MNI atlas space and segmentation into grey and white matter, while the fMRI scans were motion corrected using linear registration and co-registered with the structural scan in SPM8 [66].

Further motion correction was performed using motion corrected independent component analysis (mcICA) [137]. The fMRI data were then de-noised using a wavelet–based Wiener

filter [4] and high-pass filtered to remove artifacts such as breathing, pulsatile effects, and scanner drift.

55 PART II

Functional Chronometry

56 CHAPTER 4

MENTAL CHRONOMETRY: THEORY

Man is enabled to find sense in this chaos of experience and discover the mean- ing and measure of this incomprehensive flux of perpetual ’flourishing and per- ishing’ which we call Time.

Dr. K. Bhaskaran Nair (1927–1990).

One important application of revealing the temporal aspects of mental processes has been the field of mental chronometry: the attempt to decompose a perceptual, cognitive or motor task into a sequence of processing stages on the basis of measured response times [61]. fMRI-based mental chronometry has the potential to provide a new type of information that goes beyond the identification of activated brain regions. It enables studying the dynamics of these activated regions during the specific processing stages of a mental task and to provide insight into the links between cognition, behavior and brain activity.

4.1 Significance

The importance of timing information comes when trying to determine the hierarchical structure of signal processing in the brain [88]. In a strictly serially connected neuronal

57 network, activation onset times would give direct information about the hierarchical posi- tion of that node within the whole processing chain. However, the situation is in reality much more complicated as the network contains feedback connections, through which sig- nals can be modulated at an earlier processing stage or through which some nodes may display several activation “waves”.

Mental chronometry, therefore, studies the role of timing in cognition and the important time windows for different brain functions at a more macroscopic level. This includes un- derstanding how to interpret the timing information in terms of serial versus parallel path- ways, and in terms of hierarchical organization of cortical signal processing. The chronoar- chitecture [10] of the cerebral cortex has been shown to be highly organized according to its functional modularity, in the sense that functionally related regions exhibit highly cor- related and phase–locked metabolic fingerprints of activity10.

Also of interest is determining whether there are systematic differences in cortical activa- tion sequences differ in subjects with different psychological abilities. Considering brain– function as a network of dynamic neural circuits that interact to perform computational tasks [33], it is to be expected that the similarities and differences between processing strategies are reflected even more clearly in the timing of functional recruitment than in localization of the sites of cortical activation [88].

10The term “chronoarchitecture” is used in contrast to cytoarchitecture and myeloarchitecture, which have failed to show distinctions within areas that other techniques such as metabolic methods have revealed to functionally separate.

58 4.2 Mental Chronometry with BOLD

Initially, mental chronometry was based exclusively on analyzing the behavioral response time or reaction time (RT) as a function of the task condition [180]. More recently, behav- ioral RT information has been complemented with invasive [78] and non-invasive [156] measurements of brain activity.

The basic assumption in fMRI–based mental chronometry is that timing differences in observed hemodynamic response (HR) are attributable to the underlying neural events.

Assessing the degree to which this assumption holds, however, is not simple. There is a lack of a detailed description of its intrinsic physiological variability, thus leaving uncertainty concerning the accuracy of timing-based fMRI response measures. Repeated studies have shown a relatively large variation in the observed hemodynamic response, across subjects, across brain sites within the same subject, and even at the same brain site of the same subject across time [117]. Non-linearities in transfer from a stimulus to the hemodynamic response have been demonstrated [140], questioning the validity of a time-invariant HRF.

Also, the current methods only measure timing differences across voxels, not within the same voxel. Therefore, the comparison of the delay of the response is most likely to make sense only at the same position in the brain or else it may only reflect differences in the microvasculature system across cerebral regions [136].

Despite the fact that the coupling of the neuronal activation and the measured signal is unknown, it has been shown that: 1) a stronger activation leads to an increased BOLD response; 2) a prolonged activation is accompanied by a prolonged response; and 3) a time

59 difference in the activation onset (e.g. between a sensory and an efferent area) is reflected by a temporal shift in the responses of these areas [117].

Menon et al. [155, 156] showed that the “slow” fMRI can trace sequences of neural events surprisingly well with an effective temporal resolution of 100-200 ms, which is adequate for many mental chronometric measurements. By manually selecting a set of ROIs and av- eraging their fMRI time–courses collected from subjects performing the mental arithmetic task (cf. Section 3.2.3), Morocz et al. [163] demonstrated that the time–courses contain evidence of the cascade of functional recruitment, as shown in Fig. 4.1.

When such a temporal resolution is acceptable, using fMRI alone to gain information about time and space simultaneously has several practical advantages over EEG/MEG measure- ments. Firstly, as brain activation is a distributed phenomena, it is difficult to measure la- tency distributions accurately from a few spatial measurements. Secondly, EEG and MEG measurements are not very sensitive for long-lasting, sustained processes and are better suited to detect effects that are closely time-locked to external stimulus-onsets.

The interpretability of relative timing differences between arbitrary brain areas can be tested by using several tasks that exert a differential influence on temporal activation or that require execution of the same sub–processes but in a different temporal order. Similarly, by analyzing the dependence of onset latency on experimental parameters at a particular location, any observed systematic timing effect in that area must be attributable to neuronal dynamics because the biophysical parameters do not change [213].

60 1 2 main right IOG visual perception event 2 2 main left IOG visual perception 3 2 PrS left fusiform/ITG attention modulation 4 2 main left postcentral `finger counting' 5 2 PrS left sup parietum rote memory table 6 2 PrS right cerebellum magnitude assessor 7 2 PrS right MOG spatial processing 8 2 PrS right IPS magnitude appreciation 9 3 main left IPS rote memory table accessor 10 3 main left caudate head routing 11 3 main right caudate head routing 12 3 PrS left MFG attention 13 3 PrS left anterior insula verbal association 14 3 PrS ant RCZ product size, attention 15 3 PrS V1 attention modulation 16 3 Dist right STG num. distance evaluation 17 e main left supramarg phonological store 18 e main left post IPS estimation, evaluation SOA 19 e main right ant IPS estimation, evaluation 20 e main left MFG judgment, intention 21 4 Diff ant RCZ task difficulty, conflict 22 e main SMA motor response 23 [%] e main left ant IPS motor response 0.1 24 0 e main left M1 motor response −0.1 0 2 4 6 8 10 12 14 16 [s]

Figure 4.1: Cascade of Functional Recruitment with fMRI The time–courses from 24 manually selected ROIs are averaged and plotted from a subject performing the mental arithmetic task (cf. Section 3.2.3). The ROIs are shown by the filled balls in the glass–brain, and their time–courses over the of one trial are laid out below. A cascade in the recruitment of different functional modules can be observed from the time–course profiles.

61 4.3 Chronometry Methods

fMRI chronometry for the onset time–difference across voxels typical examines their cross- correlation function with a reference time–series [191]. The effect of behavioral parameters measured by reaction time on onset latency has also be studied in a single region [61] through a cross–correlation analysis. This approach does not isolate the component of the signal due to the stimulus of interest, and hence it is unclear how much the latency estimate is affected by confounding factors. In the case of periodic experimental paradigms, onset latencies have been measured by studying the phase of the Fourier transform or through the

Hilbert transform of the time–series signal [190].

The activation latency at a voxel can also be estimated by including a first-order Taylor series expansion of the hemodynamic response function (HRF) [94] in a GLM analysis or by including an orthogonal basis derived from a spectrum of time-shifted HRFs [136].

Estimating HR parameters like amplitude, latency, dispersion, etc. through non-linear re- gression has been variously proposed [119, 124, 182], wherein the estimated HRF is a non-linear function of these parameters. While these methods can potentially yield a more detailed picture of the hemodynamics, the drawback of such methods is that they require expensive non-linear minimization steps at each voxel and in noisy time–series and the esti- mation algorithms might not converge to the global optima. Moreover, the validity of these results is limited by the questionable biological accuracy of the hemodynamic models.

Data–driven multivariate chronometry has been performed using spatial ICA [10] to in- vestigate the time–course variability in single trials and to detect voxels with unexpected

62 temporal profiles, without requiring a priori knowledge about the shape, nature or coupling of the HR.

In the next chapter a tool for the visual exploration of the ordering in the cascades of functional recruitment, similar to that of Fig. 4.1, is proposed. Then in Chapter 6, a robust method to estimate the onset latency at each voxel in a GLM framework is developed.

63 CHAPTER 5

MENTAL CHRONOMETRY: A VISUAL ANALYSIS

The only reason for time is so that everything doesn’t happen at once.

Albert Einstein (1879–1955).

As imaging pulse sequences used for fMRI become more efficient in their temporal and spatial resolution, tools that can efficiently depict cascadic and serial brain recruitment in task performance, become ever more important. Such mental activity road–maps shown as time–series for specific brain regions as used for years in EEG and MEG, but now enhanced by the tomographic fidelity of MRI, can crucially enhance our understanding of brain physiology [118].

In addition to this, the commonly used statistical methods for fMRI analysis have a large number of parameters that need to be tuned, which in turn profoundly affect the detection of activated brain regions. Therefore, there is a pressing need for a visual analytics tool that allows the neurologist visualize the raw data, in order to assess the fidelity and veracity of the results obtained from any type of fMRI analysis.

To address these concerns, this chapter presents a software tool to visually analyze the time dimension of brain function with a minimum amount of processing, allowing neurologists

64 to verify the correctness of the analysis results, and develop a better understanding of tem- poral characteristics of the functional behaviour. The system allows studying time–series data through specific volumes–of–interest in the cerebral–cortex, the selection of which is guided by a hierarchical clustering algorithm performed in the wavelet domain.

The organization of the chapter is as follows: the proposed solution is outlined in Sec- tion 5.1 while Section 5.2 covers the current literature in the domain of visual analysis methods for spatio–temporal data, with special attention to medical images. Section 5.3 introduces the method for automatically selecting a candidate set of VOIs. Here, I discuss the wavelet based dissimilarity metric and the hierarchical clustering algorithm. Section 5.4 explains the tool and its usage. In Section 5.5 I present some results obtained by using the method to explore the mental arithmetic data–set (c.f Section 3.2). Finally in Section 5.6, I conclude with some remarks on the tool and directions of further investigation.

5.1 Outline of Solution

5.1.1 Challenges

A major challenge in the time–dimension visualization of fMRI is the large number of voxels within the brain (∼ O(105)). It is obviously impossible to examine the time–series through every voxel individually. Typically, the user manually defines a Volume-of-Interest

(VOI) in the brain, and computes the mean time–course through that VOI (e.g. Fig. 4.1 in Chapter 4). Given the limitations of existing tools, only a certain number of VOIs can be manually examined and compared in a practical fashion. Therefore, the investigator not only has to decide, a priori which locations of the brain are “interesting”, but also the shape

65 and extent of these regions, which have a profound influence on the resulting time–courses.

This limits the power of this avenue of exploration. Therefore, for a visual exploration tool to be useful it should:

(i) Provide a quantitative assessment of the quality of a VOI, in terms of the similarity of

activity exhibited by the voxels contained therein.

(ii) Guide the user in selecting good VOIs, as defined by the above quality metric.

(iii) Take into account the specific experimental task and the neuro-physiological phenom-

ena being investigated.

Another challenge in developing a meaningful visualization of fMRI data, is the nature of the acquired signal itself. As discussed in Chapter 1, the time–series data at each voxel consists of four main components: a A structural component, which represents anatomical features (much like a conventional

MR image). b The blood oxygenation level dependent (BOLD) signal, which measures brain metabolism

and is the component of interest. c Noise, which tends to be colored with a 1/f spectrum, where f is the frequency. d Slowly varying drifts in the baseline signal.

Therefore, any na¨ıve visualization based on the raw time–courses will be hard to decipher, and some kind of processing of the data is required. However, it is imperative that this

66 processing be kept to a minimum, in order to minimize the amount of bias that is inevitably introduced by any processing and retain as much information as possible from the raw data.

5.1.2 Proposed Solution

The aim of the tool for visualizing the time–series fMRI data is that it should examine the data and study the temporal aspect of brain activity with a minimal amount of pre- processing and data manipulation. The solution is therefore posed in terms of the problem of VOI selection, by automatically determining a set of candidate VOIs, such that the voxels in the VOI exhibit coherent activity with respect to the experimental task under consider- ation. These VOIs are determined by clustering together voxels with similar activations in a hierarchical fashion. By navigating through the hierarchy, the user is able to select the correct set of VOIs that match his expert intuition, in terms of shape, location, and within-cluster error. An overview of the method is shown in Fig. 5.1.

One methodological contribution is a distance metric that quantifies the dissimilarity be- tween the voxels based on their time–series. The proposed metric adaptively extracts fea- tures from the time–series which correspond to the experimental tasks and the neurophysi- ological phenomena under study. This metric operates in three steps:

i It first transforms the acquired time–series into the wavelet domain, thereby de-

correlating its different components.

ii It projects the wavelet coefficients into a lower dimensional subspace spanned by the

features of interest.

67 Wavelet Transform time Project fMRI volumes Time courses

Low dimensional Hierarchical subspace Clustering

Build basis VOI hinting Wavelet Experiment Transform Experiment stimulus stimulus function convolved function with different hrfs

Figure 5.1: Overview of Visual Analytics Tool The pipeline of the processing stages of the visual analytics tool for examining the temporal aspects of fMRI data.

iii It then computes a weighted Euclidean distance between the time–series of two vox-

els in this lower dimensional subspace.

The weights are selected to emphasize certain features and de-emphasize others, depend- ing upon the experimental task. The tool also allows manual delineation of a VOI, and computes its quality using this dissimilarity metric.

5.2 Related Work

There is a large body of work for visualizing data–sets with temporal dependencies. Most of the prior art in spatio–temporal visualization has been concerned with the problem of dealing with massive data sizes and rendering them accurately and efficiently [85]. It has also been studied in the context of flow-field visualization, with the focus mainly on extracting the relevant features from the data and visualizing them in an interactive fashion

68 [144, 217]. Aigner et al. [3] survey methods for time–series data through the lens of

Visual Analytics. Additionally, they provide a taxonomy that includes structure of time, the data characteristics and abstraction, and representation (esp. dimensionality). As per their proposed categorization, fMRI data can be classified as being linearly ordered time points, univariate and spatial, dynamic and three-dimensional.

In the case of medical data–sets, existing methods concentrate on a comprehensive evalu- ation of the temporal behavior of the data. Tory et al. [208] discuss several methods for

visualizing temporal medical data. The authors discuss the efficacy of surface-based and

iconic visualization methods to produce an animation of consecutive time steps that depict

temporal changes of intensity and signal gradient. There are other techniques that use di-

rect volume rendering to provide static visualizations through the construction of transfer

functions that best represent the temporal characteristics of the data, thereby effecting a

visual and implicit clustering/segmentation of the data in a temporal feature space [217].

Interfaces and interaction often play an important role [210].

In the case of fMRI, there are many tools currently available for analysis and visualizing,

including AFNI [45], BrainMap [123], Brede Toolbox [167], SPM [66], FSL [203]. While

all these tools allow for manual selection of VOIs and then display the aggregate temporal

response of these VOIs, none of them provide an automated VOI selection system, along

the lines proposed in our paper.

Wavelets are extensively used in fMRI, either along the spatial dimension or the temporal

domain. Spatially it is used to obtain a sparse representation of the activation map and then

the statistical significance of activation is computed on these wavelet coefficients [212].

69 Alternatively, along the temporal dimension, wavelets are used for de-noising and whiten- ing the time–series data [29] and activity estimation [158].

The concept measuring similarity between the time–series of voxels by projecting into a linear subspace spanned by the “signal of interest” is common [83]. However, in these cases the projection is computed in the temporal domain as the correlation of the acquired time–series with a reference (ideal) response. This has the disadvantages that the noise is still colored, and there is an inability to emphasize important features in signal, like the natural scale of the HRF. In contrast, by moving to the wavelet domain, we are able to effectively whiten the noise thereby removing any spurious correlations in the data, and also we can determine natural scales at which the HRF has greatest energy.

5.3 VOI Selection

The problem of automatically selecting VOIs is two fold:

(a) Determining the number, location, size and shape of VOIs.

(b) Determining the coherency or goodness of the selected VOIs.

For the first problem, there are, in general, no computational solutions and only the expert user can decide this depending upon the phenomena being studied. For this purpose, we use a hierarchical clustering of the voxels, and allow the user to navigate this hierarchy in order to select the correct number, as per his requirements. This is further discussed in

Section 5.3.2.

70 To deal with the second problem, the notion of goodness of a VOI requires a measure of similarity of the voxels in the VOI. This metric should satisfy the following properties:

(i) It should quantify the dissimilarity of the activation patterns of the voxels, as given

by their time–courses.

(ii) It should be robust against the confounds in the acquired signal, like noise, drift,

inhomogeneity, etc.

(iii) It should also quantify the spatial proximity among voxels, encoding the belief that

nearby voxels have a higher likelihood of exhibiting similar behaviour, as compared

to distant voxels.

The proposed dissimilarity metric satisfies these properties as explained next.

5.3.1 Distance Metric

One of the key requirements of the metric is robustness to noise and other confounds in the acquired time–series data. As briefly mentioned in the introduction to this chapter, the measured signal has the following components:

(i) The structural component.

(ii) The BOLD signal. While the exact shape of the brain hemodynamic response to a

stimulus is highly variable across brain regions, stimuli, and subjects, a few typical

(canonical) hemodynamic response functions (HRF)s are shown in Fig. 5.2(a).

71 (iii) Instrumental noise and artifacts due to subject head motion and breathing which tend

to exhibit a 1/f spectrum associated with fractional Brownian motion (fBm) [29].

(iv) Slowly varying drifts in baseline signal intensity, due to temporal variations in the

operating characteristics of the MRI scanner and changes in subject physiology like

blood pressure, etc.

0.02

0.015

0.01 Intensity

0.005

0

5 10 15 20 25 30 seconds (a) Canonical HRF

20

10

0

−10

−20 BOLD intensity −30

−40

200 250 300 350 400 450 500 550 600 650 700 seconds (b) Raw time–courses through two voxels

Figure 5.2: The Raw fMRI Time–Courses Fig.(a) The shape of a few typical hemodynamical response function (HRF). However, in reality, the exact shape of HRF is highly variable. Fig.(b) The raw time–courses (mean shifted) through two voxels (blue and red), both presumably activated. The solid line shows the measured time course, the dashed line is an estimate of the baseline drift of the intensity over time.

72 An example of the raw time–courses (mean shifted) through two activated voxels is shown in Fig. 5.2(b). Here, the problems of drift and spatial inhomogeneity in the signal baseline are evident.

The observed time-signal yx(t) at each voxel x is generally modeled as [66]:

yx(t) = µx + θx(t) + sx(t) + υx(t), t = 1 ...N. (5.3.1)

Here, sx(t) is the BOLD component, θx(t) is the baseline drift, µx is the structural com- ponent intensity at the voxel, υx(t) is correlated noise with the 1/f spectrum of a long memory fBm noise.

This structure of the fMRI signal motivates the use of a wavelet representation for three main reasons. One, transforming to the wavelet domain gives a sparsity in the represen- tation of the signal, and allows us the flexibility to weight different aspects of the signal depending upon their scale space characteristics. Two, it has been shown that the different components of the signal occupy different regions of the time-frequency plane [158]. The baseline drift is restricted to the wavelet coefficients at large scales, since it happens due to phenomena that vary gradually over relatively large periods of time as compared to the

HRF. By selecting a wavelet basis with p vanishing moments and assuming a polynomial approximation of the drift with order less than p, θx(t) will have a sparse representation in this basis. Thirdly, the wavelet tranform provides an approximation of the Karhunen-Loeve` transform (KLT) of the fBm noise. The KLT de-correlates a random process by project- ing it on to an orthogonal bases that are the eigenfunctions of the auto-covariance kernel.

∗ Specifically, if Ky(t1, t2) = E[y(t1)y (t2)] − E[y(t1)]E[y(t2)] is the auto-covariance ker- nel of a stochastic process y(t), then KLT is the decomposition of the process onto an

73 P orthogonal basis ei(t), i = 1 ... ∞, such that y(t) = i Ziei(t). Here ei(t) are the eigen- functions of Ky(t1, t2), and Zi are uncorrelated random variables. Therefore, the wavelet transform W{υx} of the fBm noise υx(t) is composed of coefficients which are almost

de-correlated11.

The Cohen-Daubechies-Feauveau 9/7 bi-orthogonal spline wavelets with p = 4 vanishing

moments are used here because they are symmetric with finite support and linear phase,

and do not require special treatment at boundaries. This is implemented using a lifting

scheme which gives a 2× speedup over the standard wavelet transform algorithms [145].

Applying the wavelet transform W to both sides of eqn. 5.3.1, we get:

W{yx} = W{θx} + W{s} + W{υx}, (5.3.2)

0 where yx = (yx(1) yx(2) . . . yx(T )) , etc.

Now, in order to select the optimal regions of the time-frequency plane and suppress the

undesired components of the signal without compromising the desired component, W{yx}

is projected into a lower dimensional subspace spanned by the features of interest, as de-

fined by the experimental task. The experimental task is specified by the stimulus function

p(t) giving the onset and duration of each stimulus presented to the subject. For example, it

will be a train of Dirac-δ in case of event stimulii, or a train of box-cars in case of persistent

stimulii.

The set of expected brain responses {rhi (t); t = 1 ...T } for different hemodyamics are

then computed by convolving the stimulus function p(t), t = 1 ...T with a set of typical

11 υ υ j j p Strictly, the correlation of the wavelet coefficients wj,k and wj,k0 decays like O(|2− k − 2− k0|2− ).. υ υ υ T Also, the noise coefficients [cJ,0, wJ,0, ...w1,T/2 1] are well approximated by normally distributed indepen- − 2 2 dent random variables, with co-variance matrix Συ = diag(σJ , ...σ1). [29]

74 HRFs {hi(t); i = 1 ...B}, B  T , as rhi (t) = hi ?p(t). Fig. 5.3(a) shows an experimental task, represented by a Dirac-δ train (dark blue) convolved with a few HRFs. This set of

0 expected responses R = [rh1 rh2 ... rhB ], where rhi = (rhi (1) rhi (2) . . . rhi (T )) define

a lower (B) dimensional subspace H spanned by the features of interest from the exper-

imental task. An orthogonal basis for this subspace H is obtained by the singular value

0 decomposition of RT ×B = UΛV , where U = [u1u2 ... uT ] is a T × T orthonormal ma-

trix spanning the column space of H, Λ is a T × B diagonal matrix of the weights of each basis vector, and V0 is a B × B orthonormal matrix spanning the row space. The singular

values λi,i, show that most of the volume spanned by H is concentrated in the first few basis vectors of U. Fig. 5.3(b) shows the percentage volume with respect to the number of basis

vectors of U for a particular experiment.

Since U˜ defines an orthogonal bases for subspace H˜ , a Euclidean distance metric can be defined on it as:

||y − y || = || (y˜ − y˜ ) || (5.3.3) x1 x2 H˜ x1 x2 2

which is compatible with the Euclidean distance metric in the wavelet space. The drawback

of this metric is that it takes into consideration neither the different importances of the ˜ individual bases u˜i of U as expressed by the diagonal matrix Λ nor the physical proximity

between the voxels x1 and x2. Therefore, we augment this metric to incorporate these

characteristics as:

2 2 ||x1 − x2||2 0 ˜ ||yx − yx || = + (y˜x − y˜x ) Λ(y˜x − y˜x ) , (5.3.4) 1 2 H˜ θ2 1 2 1 2

where Λ˜ is a B˜ × B˜ diagonal matrix containing the B˜ largest singular values from Λ. Also,

θ is a scaling factor that weights the spatial proximity relative to signal similarity, and is a

user-tunable parameter of the tool.

75 0.07

0.06

0.05

0.04

0.03

0.02 respone functions

0.01

0 20 30 40 50 60 70 80 90 100 secs

(a) Stimulus function p(t) and a few response curves rh1 (t)

100

90

80

70 % volume

60

50 100 150 200 250 300 350 400 450 500 Bases retained (b) The percentage volume with respect to bases retained

Figure 5.3: Fig.(a) The stimulus function p(t) as a Dirac-δ train (dark blue). A few curves rhi from the response set, obtained by convolving p(t) with a theoretical HRF hi(t) Fig(b) The percentage of the whole volume of subspace H with respect to the number of basis vectors retained.

5.3.2 Hierarchical Clustering

Candidate VOIs are built as clusters of voxels exhibiting similar activity as defined by the dissimilarity metric of Eqn. 5.3.4. These voxels are clustered together with the hierarchical agglomerative clustering (HAC) given in Algorithm 5.1.

76 1 begin // Initialization 2 For each voxel xi, create one cluster ci of size ni = 1 3 Each ci is associated with a time–course Y[i] 4 end 5 while Number of clusters greater than specified value do 6 Find two clusters ci and cj that are spatially adjacent to each other and merge them into a new cluster ck = (ci, cj), if and only if Var[ck] is minimum over all i, j 7 Remove clusters ci and cj from the set of clusters, and add ck 8 end Algorithm 5.1: Hierarchical Clustering Algorithm

Pnk If a cluster ck has nk voxels then the cluster mean µk has physical location xµk = i=1 xi/n

Pnk and feature vector y˜µk = i=1 y˜i/n. The mean of a new cluster ck = {ci, cj} can be effi- ciently computed as

niµi + njµj µk = (5.3.5) ni + nj

The variance of a cluster ck under the dissimilarity metric is defined as:

nk 1 X 2 Var[ck] = ||yx − yµ || , (5.3.6) n n i H˜ k i=1

where ρk is the mean of ck. By the variance separation theorem, the combined variance of

two clusters ci and cj can be efficiently computed as:

2 n Var[c ] + n Var[c ] ||niyµ − njyµ || i i j j i j H˜ Var[ck = {ci, cj}] = − (5.3.7) ni + nj ni + nj

Since the dissimilarity metric penalizes voxels that are far away in physical space, and

hence are not likely to be clustered together, the clustering algorithm can be accelerated by

using an octree [5] decomposition of the physical brain volume as per Algorithm 5.2.

If Γ1 is the cluster tree generated by hierarchical clustering on the full volume, and Γ2

is that generated by clustering on the octree, then for each cluster c in Γ1, we define the

77 1 begin 2 Start at the lowest level L = 0 of the octree(i.e. leaf nodes) 3 end 4 while Root node of octree is not reached do 5 Perform hierarchical clustering in each node independently to get a set of clusters {ci} for each node. 6 If all new clusters have a in-cluster variance greater than a certain threshold κL, then union all the per-node clusters hierarchies, and move to the next level L + 1 in the octree. 7 end Algorithm 5.2: Octree Clustering Algorithm

cluster error as ρ(c) = mind∈Γ2 |(c \ d) ∪ (d \ c)|/|c ∩ d|, where \ is set difference, and

|.| is set cardinality. Therefore, ρ(c) measures smallest error in overlap of the voxels of

cluster c ∈ Γ1 with all clusters in Γ2. In our experiments, this acceleration resulted in a 3×

speedup in clustering and a cluster error of less than 15%, on average.

5.4 User Interaction

The layout of the user interface is shown in Fig. 5.4, and it consists of the following salient

components:

(a) The main window showing a 3D visualization of a high-resolution MR image of the

brain (as volumetric, cortical surface or orthogonal cutting planes). The user can switch

between these three views, and the VOIs are overlaid as 3D blobs on this rendering.

It also has the three 2D orthographic views (Sagittal, Coronal, Axial). The VOIs are

displayed in this view as 2D blobs. (cf. Fig. 5.4(a))

78 (b) Functionality to visualize the mean time–series of each cluster in cine (video) ,

by modulating the intensity of the cluster with the value of the mean time–series at

that point. This feature gives the user additional power in searching for patterns in the

temporal responses across different regions of the brain.

(c) A panel showing the mean time–course (blue) through the selected VOIs. Around the

mean time–course, it also shows an envelope (grey region bounded by black lines) of

±1 standard deviation at each time point, computed from the time–courses of all the

voxels in the cluster as follows: if {xi} is the set of n voxels in a VOI, with measured

time–series yxi (t), then the standard deviation for the time–series of the VOI is σ(t) = P 2 P 2 1/n i [yxi (t)] − [1/n i yxi (t)]] . This envelope gives the user a rough estimate of the dispersion of the time–series within the VOI and serves as a visual indication of the

VOI quality. (cf. Fig. 5.4(b))

(d) A tree view for navigating through the cluster hierarchy to select among the automat-

ically generated VOIs. Initially, the tool suggests set of clusters (meeting a certain

in-cluster variance threshold) as candidate VOIs. The user can examine these clusters,

their time–courses and their associated quality metrics (in-cluster variance and the ±1

standard deviation envelope). If he is not satisfied with them, he can navigate either

up the hierarchy merging multiple clusters to get a larger VOI, or down the hierarchy

sub–dividing a cluster to get smaller VOIs. (cf. Fig. 5.4(c))

(e) VOI selection tools, similar to most standard MRI viewing tools, such as rectangle,

ellipse, polyline and free–form selection. With these tools, the user can manually select

a VOI, and view its mean time–course and associated quality metrics.

79 5.5 Results

This section presents an application of the tool to help an investigator refine hypotheses, validate the results of automated analyses, and better understand the temporal characteris- tics of the hemodynamic response function for the mental arithmetic task (cf. Section 3.2 of Chapter 3).

One application of this tool is to determine the correct parameters for an SPM–style anal- ysis of the data. For example, consider a hypothesis test for finding those voxels activated only during visual presentation, and not during aural presentation. Fig. 5.5 shows the gen- erated activation maps for different parameters of the analysis. Here α is the statistical significance level, while FDR is the False Discovery Rate method for performing correc- tion for the multiple comparisons problem (cf. Section 2.4). It is not easy to determine the validity of the activations in Fig. 5.5(a) vs. (b). The aim is to eliminate spurious activation focii, without discarding true activations. For example, consider the area highlighted by the red arrow in Fig. 5.5(b). It is marked active in Fig. 5.5(b) but not in Fig. 5.5(c). However, from the fact that the VOI around this region had a in-cluster variance of ∼ 20, and a ±1 std.dev. envelope of almost 10× the mean time–course intensity, (Fig. 5.5(d)) we were able to confirm that it was in fact an incorrect activation, corroborated by the fact that the region

(ldPFC) is not known to be associated with visual tasks, but rather with number processing and working memory. Compare this with a VOI of a truly activated region in the visual area of the brain (Figs. 5.5(e)-(d)), with an in-cluster variance ∼ 5, and its ±1 std. dev. envelope < 25% of signal intensity.

80 Another significant application of the tool is to study the of the recruitment of functional substations to perform a task. For our data–set, by selecting from the VOIs suggested by the tool, we were able to observe a temporal cascade in the brain activation pattern that helped us not only corroborate extant theories but also refine our understanding of brain function [163]. This temporal ordering of activity patterns in brain regions is shown in Fig. 5.6

5.6 Conclusion

With this tool, we have tried to address the important problem of visualizing the time di- mension of fMRI data, in order to understand temporal relationships in brain function. To achieve this, a solution to the VOI selection problem was proposed by merging an algo- rithmic VOI selection system with a user-driven feedback in order to leverage the user’s expert understanding. An activation dissimilarity metric was developed that captured the context of the neuro–functional phenomena being investigated, through a transform into wavelet space and a projection onto a subspace spanned by the expected behaviour. The tool presents a candidate set of VOIs to the user through a hierarchical clustering procedure, and the user can navigate through this hierarchy to select VOIs. This VOI selection is per- formed interactively, where both the quality metrics of the VOI and the expert knowledge of brain function guide the user in his visual analysis task.

Further refinements must include a more intuitive and navigable methods of presenting the

VOI hierarchy tree, through the use of tree-maps and similar abstractions. One aim was to minimize the bias introduced into any analysis by the data processing step and retain as much information from the raw data as possible. Though we believe this aim was achieved,

81 a method for quantitatively estimating the biases would be desirable. Of interest also are

VOI selection methods that do not require knowledge of the experimental paradigm to judge time–course similarity such as ICA based methods.

82 (a) Main Viewer Window

(b) Cluster time–courses (c) Hierarchy Tree Navigation Pane

Figure 5.4: User Interface of the Visual Analytics Tool Fig. (a) The main window showing the structural volume overlaid with the clusters in 3D and 2D orthographic views. Fig. (b) Cluster time–series displays. Fig. (c) Navigation pane for the cluster hierarchy.

83 (a) α = 0.001 with no correction for (b) α = 0.01 with FDR correction multiple comparisons

(c) VOI around an inactive re- (d) Mean time–course and gion ±1 std. dev. envelope

(e) VOI around an active region (f) Mean time–course and ±1 std. dev. envelope

Figure 5.5: Visual Confirmation of SPM Results Fig. (a)-(b) Maximum intensity projections of the activity maps for two different settings of the analysis procedures. Darker shades of grey indicates higher levels of activation. Fig. (c) A VOI selected from the set generated by the tool. Fig. (d) The mean time–course of the VOI and its ±1 std. dev. envelope, suggesting that this region of the brain is not activated. Fig. (e) Another VOI selected from the set generated by the tool. Fig. (f) The mean time–course of the VOI and its ±1 std. dev. envelope, suggesting that this region of the brain may be activated

84 (a) Selected VOIs (orthogonal projections)

(b) The corresponding mean time–courses of the VOIs

Figure 5.6: Visual Analysis of Recruitment Cascade Fig. (a) Six VOIs (as 3D blobs) overlaid on the three orthogonal planes (cutting planes) through the structural volume. Fig. (b) The corresponding mean time– courses through the selected VOIs, along with their ±1 std. dev. envelopes. Here, a temporal cascade (green lines) in activation can be seen.

85 CHAPTER 6

MENTAL CHRONOMETRY: MEASURING LATENCY IN BRAIN ACTIVITY

Statistics has been the most successful information science. Those who ignore statistics are condemned to reinvent it.

Bradley Efron (1938–)

Mass–univariate analysis methods based on general linear models (GLM) of brain response

and linear least squares estimation of their parameters are very popular because of their

computational efficiency, statistical simplicity and explanatory power (cf. Section 2.4 of

Chapter 2). While most GLMs are used to estimate the amplitude of the hemodynamic

response (HR) to a stimulus at each voxel, it is possible to estimate the response latency

using a first-order Taylor series expansion of the hemodynamic response function (HRF) in the GLM [94]. This estimator, however, is numerically unstable and biased. Here, we suggest a low-bias estimator for latency and provide an analytical formulation for its variance, needed for deriving confidence intervals.

86 6.1 Outline of Solution

6.1.1 Motivation

Using GLM methods, it is possible to study the effects of experimental parameters on func- tional activation, in one of two ways: a) parametric effect analysis or b) factorial analysis.

The first method is used when testing whether a real-valued (interval) experimental param- eter p has a statistically significant effect f(p) on the amplitude of activation by adding a regressor weighted by f(p) (typically, a polynomial function) and testing its effect. One drawback is that the correct relationship f between the parameter and amplitude may not be known a priori. Moreover, it assumes that the parameter modulates only the amplitude of the HR while all other aspects remain unchanged, which is known not to be the case [156].

Therefore, it only tests the effect of the parameter on the amplitude, not the latency of the response. The other alternative, a factorial analysis is used with categorical parameters, by adding a regressor corresponding to each level of the variable. The difference in response at each level can be used to deduce the presence of an effect on both amplitude and latency.

While this method does not suffer from the drawbacks of the parametric effect analysis, it cannot be used for real-valued parameters.

6.1.2 Proposed Solution

Here, a method is proposed that combines the strengths of both approaches by measuring the effect of an interval parameter on both HR amplitude and latency, without requiring that the relationship to test for be specified a priori. The idea, as explained in Section 6.2, is to quantize the real–valued parameter into a finite number of levels and analyze it with

87 a factorial design. The loss in statistical power that would result from such a partitioning of the design is avoided by a regularization of the estimation procedure which significantly improves the quality of the inferences.

In section 6.3 the method is validated on simulated data and is applied to the study for visuo–spatial working memory, as described in Section 3.1.

6.2 Method

Let si(t), i = 1..., be the stimulus function representing the onsets and durations of the neurological stimuli corresponding to a task of type i. In conventional analysis of fMRI data, the following two assumptions are made: a) the HR is linear; b) the HRF is spatially and temporally invariant, leading to the following model for the observed signal y(t) at each voxel:

q X h (1) i Y (t) = βi x1(t) + γix˙1(t) + (t). (6.2.1) i=1

Here, xi(t) = si(t) ? h(t), i = 1 . . . q is the expected BOLD response (with no lag) to si(t) obtained by convolving it with a typical HRF h(t). By including the first-order Taylor series expansion of xi(t + τ) ≈ x(t) + τx˙(t), the model is able to explain a certain amount of delay in the observed response. The coefficient of regressor xi is βi and that of x˙ i is γi.

The noise term (t) is assumed to be normal, colored, and is typically modeled as an AR(1) process.

88 If T is the number of fMRI scans in the session, the GLM can be expressed in matrix notation as

y = Xβ~ + ,

0 where X is the T × 2q design matrix X = [x1x˙ 1 ... xqx˙ q], with xi = (xi(1) . . . xi(T )) .

~ 0 0 Also, β = (β1, γ1, . . . βq, γq) is the coefficient vector, and  = ((1) . . . (T )) is the noise

2 distributed as N (0, σ Σ).

The Gauss-Markov estimate is

~ˆ 0 −1 − 0 −1 V ~ˆ 2 0 −1 − β = (X Σ X) X Σ y with ar[β] = σ (X Σ X) ,

where − is the pseudo-inverse operation.

6.2.1 Robust Estimation of Latency

This model provides an estimate of the hemodynamic latency τi as [94]:

2α1 γi τi ≈ − α1, where ρi = . (6.2.2) 1 + exp(α2ρi) βi

The non-linear transformation through the logistic function is used to correct for the error

due to the neglected higher order terms of the Taylor expansion, and the values of α1, α2

are determined empirically.

ˆ It can be seen however, that the estimate ρˆi =γ ˆi/βi is Cauchy distributed and therefore ˆ biased. It also becomes numerically unstable when βi is small.

Therefore, we propose a low-bias and stable estimator as follows. Since xi(t) is orthogonal ˆ to x˙ i(t), βi and γˆi are Gaussian variables with correlation roughly zero, indicating that they

89 ˆ are almost independent. Therefore E[ˆρi] ≈ E[ˆγi]E[1/βi]. Taking a first order Taylor series ˆ expansion of 1/βi about βi, and using the fact that it is unbiased, we get:

V ˆ ! ar[βi] −2 E[ˆρi] = ρi 1 + =ρ ˆi 1 + (tβ ) , (6.2.3) ˆ 2 i (βi)

where

βˆ t = i βi q ˆ Var[βi]

is t–score for the estimate of βi. This yields following corrected estimate for the value of

corr −2−1 ρ to used in eqn. 6.2.2 as ρˆi =ρ ˆi 1 + (tβi ) This correction not only un-biases the ˆ estimate of the ρi but also conditions it numerically when the t–score of βi is low.

ˆ An approximate estimate of the variance of τˆi := τ(βi, γˆi) is obtained by taking its first–

order Taylor expansion around βi and γi, and using the fact that their estimates are unbiased and uncorrelated, to give:

    ˆ  ∂τ ∂τ (βi − βi) Var[ˆτi] ≈ Var τ(βi, γi) + (6.2.4) ∂βi ∂γi (ˆγi − γi)  2  2 ˆ ∂τ ∂τ ≈ Var[βi] + Var[ˆγi] . (6.2.5) ∂βi ∂γi

Proofs for these equations are given in Appendix A.

6.2.2 Parametric Effects with Factorial Designs

The model of eqn. 6.2.1 can be used to test the effect f(p(t)) of an experimental parameter ˆ p(t) by adding a regressor of the form xi(t) = f(pi) ? h(t). The corresponding βi reflects the contribution of this parameter towards the amplitude of the response, while τˆi gives

the total delay of the HR to this stimulus. Note that it does not measure the change in

90 latency with change in experimental parameter, i.e. it fails to characterize the effect of the parameter on the latency.

The solution is to quantize p into np levels and treat it like a categorical variable result- ing in a factorial design, with one regressor corresponding to each level. The most ex- treme case would be to treat each value of the parameter as an individual level. In ma- ~ trix notation, y = Xaβa + , where Xa is the design matrix containing the parametric ~ variable partitioned into one regressor per level, and βa is the regression coefficient vec-

th tor. The original design matrix X = XaC, where Ck,l gives the weight of the k re-

th ~ˆ ~ˆ gressor of Xa toward the l regressor of X. It is easy to verify that β = Dβa, where

− 0 −1 − −0 0 −1 D = C (XaΣ Xa) CC XaΣ Xa.

Unfortunately, it is hard to make reliable inferences with this model due to the inflated variance of the estimate caused by increased model degrees of freedom, given by the trace of

P 0 −1 − 0 −1 the projection matrix Xa =Xa (Xa Σ Xa) Xa Σ . Therefore, to find a tradeoff between the flexibility of the model and its statistical power, we regularize it such that the estimates

np of the coefficients {(βi , γi)}i=1 for parameter p are normally distributed around their mean

2 values for p with variance σβΣβ. Here Σβ is a diagonal matrix representing the relative scale of variation in βi with respect to that of γi.

~ The Gauss-Markov estimate for βa results in the following ridge-regression formulation:

0 0 ~ˆ h ~ i −1 h ~ i h~ − ~ i −1 h~ − ~ i βa(λ) = min y − Xaβa Σ y − Xaβa + λ βa − DD βa Σβ βa − DD βa , β~a 0 h ~ i −1 h ~ i ~0 ~ = min y − Xaβa Σ y − Xaβa + λβaQβa, β~a 0 −1 − 0 −1 = Xa Σ Xa + λQ Xa Σ y, (6.2.6)

91 − 0 −1 − 2 2 where Q = [I − DD ] Σβ [I − DD ]. Here, λ represents the ratio σ /σβ. A value of λ = 0 ~ ~ indicates a flat prior on βa(λ), and the solution corresponds to the OLS estimate βa without

regularization.

P The model degrees of freedom, given by the trace of the projection matrix Tr{ Xa (λ)}, is

a decreasing function of λ. Here, the projection matrix

P 0 −1 − 0 −1 Xa (λ) = Xa Xa Σ Xa + λQ Xa Σ .

~ˆ Also, βa(λ) is efficiently calculated as

~ˆ 0 −1 − − ~ˆ βa(λ) = I + λ(Xa Σ Xa) Q βa(0)

6.2.3 Hyper-parameter Selection

~ˆ The diagonal values of Σβ are estimated by first computing the OLS values of βa(0), and ˆ then setting Σβ(i, i) = 1 and Σβ(i + 1, i + 1) = Var[ˆγi]/Var[βi].

The optimal value λ∗ of λ is the one at which mean squared error (MSE) of the estimates is

the smallest, which when using linear LS regression is well-approximated by the Mallow’s

Cp statistic [146]:

Tr{R(λ)} y0R(0)y C (λ) = y0R(λ)y + 2 , (6.2.7) p N Tr{R(0)}

R P where (λ) = [I − Xa (λ)] is the residual forming matrix.

2 The noise variance σ and correlation Σ is estimated for each value of λ as follows: (a) ˆ ˆ Set Σb(λ) = I and compute the projection matrix P(λ), the residuals r = [I − P(λ)]y

2 0 Pˆ and σˆ (λ) = r r/Tr{I − (λ)}. (b) Obtain an empirical estimate of the auto-correlation

92 ˆ Pn 0 ˆ of the noise as φ(t) = i=t+1 ri−tri/r r (c) from φ(t). (c) Treat the noise as an AR(1) process and solve the Yule-Walker equations for the AR(1) coefficient θˆ. Spatially smooth the estimated θˆ in order to reduce its variance. (e) Reconstruct the noise correlation matrix p ˆ ˆ ˆ|i−j| ˆ2 Σ(λ) from the AR(1) coefficient as Σ(λ)i,j = θ / 1 − θ . (f) Repeat the regression with these new estimates of the noise covariance.

6.3 Results

First, in Section 6.3.1 a quantitative validation of the estimation algorithms is provided using simulated data. Then, the results of the method applied to the visuo–spatial working memory (VSWM) data–set are described in Section 6.3.2.

6.3.1 Simulated Data

For a stimulus function s(t) consisting of a train of delta functions, its HR was generated P as y(t) = u h(t − u)s (u − τ(p(u))) β(p(u)) + (t). Here, β(p) is the modulation of the experimental parameter 0 < p(t) ≤ 10 on the HR amplitude, while τ(p) is its effect on

latency. Both functions were modeled as cubic polynomials. The noise (t) was generated

by an AR(1) process with coefficient=0.3. The empirical MSE for the latency and ampli-

tude estimators with and without regularization is plotted against SNR 12 at different levels

of quantization np in Fig. 6.1. The reduction in MSE for both latency and amplitude after

regularization is apparent, and the method is relatively robust even for relatively low SNR.

It is also observed that initially increasing the levels of quantization (from 5 to 10) reduces

12 1 Defined as 20 log(||y||2σ− )

93 the MSE due to an improved fit, but beyond a certain point it begins to degrade (np = 15, in this case) due to overfitting.

1 1 0.8 0.8 0.6 0.6

MSE 0.4 0.4

0.2 0.2

0 0 0 10 20 30 40 0 10 20 30 40 SNR SNR

Figure 6.1: MSE of Regularized Estimator MSE with respect to SNR of amplitude (left) and latency (right) for the regularized and unregularized estimators at np = 5, 10, 15 levels of quantization. Legend. Unregularized: np = 5–circles, np = 10–asterisks , np = 15–filled dots; Regularized: np = 5–solid line, np = 10– dotted line, np = 15–dot-dashed line.

Fig. 6.2 shows parametric effects on latency and amplitude estimates using the regularized

and unregularized estimators (np = 10). It is seen that while the mean value from both

methods is close to the true value, the spread of values (red filled box) for the regularized

method is smaller indicating much lower variance. We also observed that the bias of our

corrected latency estimator was about 70% less than that of the original estimator of Henson

et al. [94], and the empirical variance of τˆ was (1±0.15) times the analytical variance from

eqn. 6.2.4.

6.3.2 Latency Analysis for VSWM Task

This method was applied to the fMRI data–set of 8 children aged 7 to 11, to study visuo–

spatial memory (VSM) maintenance and manipulation as described in Section 3.1.

94 ntuto hs ftetili hw nFg . o aho h he methods. three the of each for 6.3 Fig. in shown is trial the of phase instruction projections intensity the Maximum of foci. (MIP) activation common identify to subjects, 8 the estimated across amplitudes activation the for analysis fixed–effects group–level a performed We Figure nlzdusing: analyzed time experiment by parameterized habit- as to repetition, due and VSM uation the in changes functional identify and memory case) vs. case) (“backward” (“forward” manipulation recall memory for differences functional the study to order In confidience 95% (blue, their unregularized with the along for shown, Effects are estimators line). line) bands. black dashed (CI) (solid (red interval regularized simulated and were line) (right) dash-dotted latency and (left) amplitude [ [ [ b a c Tesm ein u sn h euaie siaindvlpdhere. developed estimation regularized the using but design, same The ] for quantization of levels 10 with design factorial a using method GLM standard A ] Asadr L ihefcsfor effects with GLM standard A ] T os n w eesfrrcl direction recall for levels two and sors, 6.2: and

uniaieEauto fPrmti Effects. Parametric of Evaluation Quantitative Amplitude (a.u.) 2 t soe ( -scores eesfor levels < p D p 0 . 05 D orce)frteatvto mltd uigthe during amplitude activation the for corrected) FDR , T

oeldb ier udai n ui regres- cubic and quadratic linear, by modelled Delay (s) 95 D 1 = , 2 h fet feprmna parameter experimental of effects The p T 0 = – 10 is h aawere data the mins, p on a

b

c

Figure 6.3: Glass brain (MIP) images of activation amplitude assessed at the group level during memory manipulation / recall using methods [a],[b] and [c]. The red arrows locate the intra-parietal sulcus (IPS).

While method [a] strictly cannot be compared with [b] and [c], we observed more activated regions using methods [b] and [c] than [a] at the same significance level. One explanation for this could be that there is a parametric effect in the activation that cannot be explained by a cubic polynomial model, and also that method [a] assumes a fixed delay across all trials. This clearly demonstrates the benefits of testing for effects of real-valued parameters by treating them as categorical parameters. Method [c] exhibited yet larger activation foci than method [b] (12% suprathreshold voxels as percentage of intracranial volume, t-score maximum = 15.23, average = 5.17 vs. 8%, 13.81 and 3.76 respectively) partially due to the larger variance of method [b], reducing its t-scores for the activation amplitudes.

96 For the purposes of further exposition, we shall consider a locus in the intra-parietal sul-

cus (IPS), located by the red-arrow in Fig. 6.3, that exhibited high t-scores across all three

methods and is known to process spatial attributes of numerous cognitive tasks. An ax-

0 mins 3 mins 6 mins 9 mins 0.3s

0 mins 3 mins 6 mins 9 mins 0.0s

Figure 6.4: Parametric Effect on Latency. Axial slice of the brain (z = 55mm, MNI space) for group- wise latency using method [c] at the location marked by red arrows in Fig. 3. Top row shows forward recall (D = 1) and bottom row shows backward recall (D = 2), at different experiment times (T ).

ial slice of the group-wise latency assessed using method [c] is shown in Fig. 6.4, while

Fig. 6.5 shows the effect of parameter T on the amplitude and latency as estimated by the

three methods at this locus. As explained earlier, method [a] cannot test for a parametric effect on latency.

Firstly, we observe that the variance of both the amplitude and latency estimates by our method are consistently much lower than method [b], while there is no noticeable increase in bias. The polynomial weighting of the parameter T explains why the 95% CI band for the effect estimated by method [a] is much wider than those for the other two methods.

97 Amplitude (a.u.)Amplitude

mins mins

(i) Magnitude (forward) (ii) Magnitude (backward) Delay (s) Delay

mins mins (iii) Latency (forward) (iv) Latency (backward)

Figure 6.5: Parametric effects of latency and amplitude at a region in the IPS. Figs. (i)–(ii) graph ac- tivation amplitude vs. experiment time T = 0–10mins for forward (D = 1) and backward recall(D = 2). Figs. (iii)–(iv) graph activation latency vs. T . Legend. Method [a]: mean effect – black solid line; Method [b]: blue dotted line; Method [c]: red dashed line. Bands show the 95% CIs for the estimates

While no appreciable parametric effect on amplitude can be seen for the forward case, the

activation latency does exhibit an increase over time. For the backward case, there is a

slight increase in activation amplitude over time, followed by a leveling out. This effect

can be seen more clearly using methods [b] and [c]. Latency, on the other hand, starts off high but then reduces with T . These trends point to physiological phenomena that could be due to attention, adaptation (learning) or habituation. This example clearly demonstrates the value of examining parametric effects of activation latency.

98 6.4 Conclusion

This chapter presented a method to estimate the effects of an experimental parameter on the amplitude and the latency of the hemodynamic response, while exploiting the advantages of a GLM framework, namely computational speed and ease of interpretation. Additionally, a low–bias estimator for latency was developed and validated on simulated data.

It was also demonstrated that latency is capable of exposing aspects of neural recruitment of the visuo–spatial working memory not available through classical GLM analysis. By examining parametric effects on latency, more and interesting differences in the activation patterns could be observed for memory recall and manipulation than by using amplitude information alone.

A promising line of future investigation is to develop a group–analysis framework for la- tency, which may help provide further insight into neuropathologies and the salient differ- ences between populations with different cognitive capabilities. Also, the ability to model more variation in the hemodynamics through non–parametric representations of the HRF might be able to capture latency effects with higher precision. In addition, of interest are extensions that characterize the variation of all the features of the hemodynamic response, which may reveal even more aspects of the neurophysiology of cognition.

99 PART III

Spatio-temporal Representations for Cognitive Processes

100 CHAPTER 7

SPATIO-TEMPORAL REPRESENTATIONS: THEORY

Representation of the world, like the world itself, is the work of men; they describe it from their own point of view, which they confuse with the absolute truth.

Simone de Beauvoir (1908-1986) The Second Sex.

In this chapter, Section 7.1 provides a brief background on the problem of studying the

representation of information about cognitive states of the brain contained in fMRI data.

This is followed by a theoretical discussion of the merits and drawbacks of supervised,

unsupervised and semi–supervised analysis methods in Section 7.2. Then Section 7.3

talks about the two main supervised multivariate methodologies for brain–state decoding

viz. multivariate pattern recognition (MVPR) and multivariate linear models (MVLM).

Finally in Section 7.4 I shall motivate the need for a dynamical unsupervised and semi– supervised approach towards discovering patterns in the data that might indicate the inter- nal / hidden and transient cognitive state of the subject.

101 7.1 Functional Representation

One of the main challenges in cognitive neuroscience is understanding the neural represen- tation of a cognitive, perceptual or affective state of a subject, i.e. in “cracking the neural code”. The essence of the solution lies in the fact that determining of neural coding of a particular variable provides the means for potentially solving the problem of cognitive processing of that variable. For example, if we can decipher the neural coding of num- bers, then we have a good chance of deciphering the cognitive process of mental arithmetic by observing neural activity during this process, recovering the numbers by decoding, and inferring how they are being operated upon in this particular process.

While fMRI operates at a level much removed from neural processes, there is nonetheless information about the cognitive state of the subject encoded in the distributed pattern of activity. In one of the first publications on this topic, Haxby et al. [90], using a simple correlation based linear classifier, demonstrated the representability of different types of visual stimuli in the human ventral temporal (VT) cortex with fMRI. These results are reproduced in Fig. 7.1.

Since then, the encoding of visual concepts in the human visual cortex – at both higher and lower levels of the processing stream – has been widely investigated [91, 110, 179] with over 200 publications to . This approach has been applied to study other types of mental representations, including auditory [150] perception, motor tasks [122], word recognition [160], detecting emotional affects such as deception [200] and fear [177], etc.

In a landmark paper, Mitchell et al. [160] presented a method to predict the fMRI response to new words from the responses of other related words. For a given word, they learnt the

102 Figure 7.1: The category specificity of patterns of response was analyzed with pairwise contrasts between within-category and between-category correlations. The pattern of response to each category was measured separately from data obtained on even-numbered and odd-numbered runs in each individual subject. These patterns were normalized to a mean of zero in each voxel across categories by subtracting the mean response across all categories. Brain images shown here are the normalized patterns of response in two axial slices in a single subject containing the VT cortex. For each pairwise comparison, the within-category correlation is compared with one between-category correlation. (A) Comparisons between the patterns of response to faces and houses in one subject. The within-category correlations for faces (r= 0.81) and houses (r = 0.87) are both markedly larger than the between-category correlations, yielding correct identifications of the category being viewed. (B) Comparisons between the patterns of response to chairs and shoes in the same subject. The category being viewed was identified correctly for all comparisons. (C) Mean response across all categories relative to a resting baseline. Figure reproduced from Haxby et al. [90].

103 Figure 7.2: Predicting Spatial Maps for Given Stimulus Words. (A) Forming a prediction for the stimulus word “celery” after training on 58 other words. Learnt activation maps for 3 of the 25 semantic features (“eat”, “taste,” and “fill”) are depicted by the voxel colors in the three images at the top of the panel. The co–occurrence value for each of these features for the stimulus word “celery” is shown to the left of their respective images. The predicted activation for the stimulus word [shown at the bottom of (A)] is a linear combination of the 25 semantic fMRI signatures, weighted by their co-occurrence values. (B) Predicted and observed fMRI images for celery and airplane after training that uses 58 other words. The two long red and blue vertical streaks near the top (posterior region) of the predicted and observed images are the left and right fusiform gyri.

semantic relationships with other words in terms of their co-occurrence statistics within a trillion word text corpus. In the second step, they predicted the fMRI image of a new word

(e.g. “celery”) as a linear combination of the distributed fMRI responses to these related words (e.g. “eat”, “taste”, “fill”, etc.). I have reproduced these results in Fig. 7.2.

This is part of a broader trend of machine learning in the analysis of neuro-scientific record- ings [20] with wide applications in brain-machine interfaces [57], clinical psychology and cognitive neuroscience [178], real-time biofeedback [122], etc.

104 7.2 Supervised, Unsupervised and Semi-supervised

One of the main advantages of supervised approaches is that they allow quantitative test- ing for the effect of an experiment variable on the brain’s response. The drawback is that these tests fundamentally boil down to model–comparison [65] and therefore require pre– specification of at–least two models (e.g. the null and alternative hypotheses), typically as the linear coupling between stimulus and response (cf. Sidebar on Generative vs. Clas- sification Models). Except for the simplest of tasks, there are no good models of brain function, and it is unclear how complete of a picture of brain–function can be provided by these extreme oversimplifications that are needed for computational and statistical expedi- ency [74].

Purely unsupervised or data–driven methods, in contrast, don’t make assumptions about the linearity and stationarity of the brain’s response, but they suffer from problems of in- terpretability since they are based on statistical criteria and not on a generative model of brain function. Secondly, they fail to provide quantifiable links to experimental variables of interest.

For example with PCA, multiple studies have noted that the components with maximum variance have often corresponded to artifacts such as respiration and head–motion. ICA, on the other hand, does not provide any ordering of components or any quantitative criteria for selecting amongst them. Components are selected either through visual inspection where the identification of “interesting” components is left completely to the investigator [173], or by (linear) correlation with respect to a reference time–series [222] which again requires knowing the mathematical relationship between fMRI signal and experimental variables,

105 or through information theoretic criteria [135] which select components that may or may not be related to brain function.

Similar ambiguities affect clustering based methods, such as criteria for setting the appro- priate number of clusters, the drastic effects of algorithm initialization on the final results and interpretability issues due to lack of an underlying model.

7.3 Pattern Recognition vs. Linear Models

This section discusses the pros and cons of pattern classifiers and linear models in the context of supervised multivariate analysis. Some of the statistical issues alluded to here are elaborated further in the sidebar on Generative vs. Classification Models.

7.3.1 Multivariate Pattern Recognition (MVPR)

MVPR methods (cf. Section 2.6.1) treat the data as an abstract representation of mental states without requiring a model of brain function, i.e. of how neural activity is converted into the fMRI signal. Instead, they are posed in terms of prediction accuracy, through principles as structural risk minimization and generalization error [22].

The advantage of this approach is that they are not limited, inferentially, by the fact that the physiology of brain function and its translation into the BOLD mechanism is poorly understood, and all models are likely to be highly inaccurate. The drawbacks, however, are the reduced temporal resolution of the analysis limiting them to block designs and the inability to make quantitative and definitive neurophysiological interpretations from their parameter estimates.

106 Also, most methods make the assumption that all fMRI scans with the same label (i.e. be-

havioral state) are equivalent. The attractiveness of this approach is that it allows an in-

vestigation of the mental state at each time point independently (of course, restricted to

block designs). The flip side to this, however, is that it ignores the temporal dependencies,

variations and evolutions of patterns that fundamental to mental processes, and that sense

provides a static unchanging picture of brain function.

Another limitation of MVPR classifiers is their applicability only to studies where subjects

are presented with fixed number of alternatives (e.g. faces vs. objects [90]). Generalization

to complex cognitive paradigms with interval–valued parameters and further on to real

world situations poses a significant methodological challenge [92].

Finally, so far all reported studies have used classifiers trained and tested on the same

subject. An important and unresolved question is the extent to which these strategies can

be generalized to across subjects and to new situations. This will require detecting the

representation of specific mental concepts in a manner that is invariant across humans and

task conditions.

7.3.2 Multivariate Linear Models (MVLM)

In contrast to MVPR, MVLMs specify a probabilistic generative model of how the ob- served data are related to the stimulus, based on some assumptions about brain function.

Both forward (i.e. Y = Xs + ) and decoding (i.e. s = XY + ) MVLMs specify this as a linear relationship between regressors formed by convolving the stimuli with a hemody- namic response function (HRF) and the observed data.

107 While advantages include computational efficiency, statistical simplicity and straightfor-

ward neurophysiological interpretation of parameters, this methodology requires that the

mathematical relationship between experimental variables and fMRI signal be known a priori. This is oftentimes hard to define in a principled manner, especially in experiments for higher–level cognition.

Equally problematic is the assumption of spatially and temporally constant hemodynamics in these models, since multiple studies have shown a large variation in the hemodynamic response (HR) across subjects, across brain sites within the same subject, and even at the same brain site of the same subject across time (cf. Section 1.3.2).

7.4 Motivation

Despite the success of the multivariate methods for understanding the representation of cognitive states, there are nevertheless many open challenges. Fundamentally, these meth- ods learn a fixed mapping from fMRI data to regressors / labels describing stimuli or subject behavior. Hence, they cannot discover intrinsic patterns that might be present in the data and therefore their ability to explain the internal mental state is limited to behavioral corre- lates as recorded by the stimulus. Decoding the internal cognitive states, especially under natural conditions, requires reconstructing the spontaneously changing dynamic “stream of consciousness” from brain activity alone, without reference to extrinsic labels.

An equally significant challenge arises because of the fact that mental processes are con- stituted of dynamical changing patterns of activity. These approaches, by creating static

108 spatial maps corresponding to particular experimental variable and ignoring its dynamics, disregard a potentially very informative aspect of brain function.

In order to address some of these shortcomings, in the next two chapters I propose two related models of brain function as represented by fMRI using a dynamical state–space formulation. The first version of this model as an unsupervised framework along with a

Monte–Carlo estimation algorithm is described in Chapter 11. Then, to address the problem of linking the results back to the experimental variables, the model is refined to include stimulus information in Chapter 12. Additionally, a computationally efficient mean–field approximation based estimation algorithm is developed in this chapter.

109 Generative vs. Classification Models

The question of inferring a link between a distributed pattern of response and a mental state is essentially one of comparing the evidence between alternative models, typically between the null hypothesis H0 which posits the absence of an effect and the alternative hypothesis H1. Experimental neuroscience rests on comparing generative models that embody competing hypotheses about how data are caused. From a statistical perspective, by the Neymann–Pearson lemma [112] the likelihood–ratio test (or Bayes factors) for a statistic θ:

p(θ|H1) Λ = p(θ|H0) is the uniformly most powerful test for a given size α = p(Λ ≥ µ|H0) with a threshold µ (i.e it has the least Type–II / false–negative error–rate for a given Type–I / false–positive error–rate.), and is the basis for most of statistical inference including classical methods like Wilk’s Lambda in canonical correlation analysis (CCA) and the F –test in ANOVA. The null distribution of the likelihood ratio statistic p(θ|H0) can be determined non-parametrically or under parametric assumptions (e.g., a t-test). To evaluate the marginal likelihood it is necessary to specify the joint density function entailed by a model, typically in parametric form. In Bayesian analysis, the parameters are then integrated out with respect to a prior density to give the model evidence. Generative models of the form g(θ): s → Y explain how experimental variables s produce ob- served data Y. Such forward models (e.g. GLMs) assume the experimental conditions as fixed or known variables and randomness only in the data as: Y = g(s) + . In multivariate decoding, the direction of the generative model is reversed to give a decoding model s = g(Y)+. The advantage of this approach is that it can account for the perceptual uncertainty of a presented stimulus. The drawback of this approach is the much higher dimensionality of Y with respect to the observed data causing the risk of over–fitting, necessitating some kind of dimensionality reduction. In classification, one wants to predict or classify a new observation Ynew using a decoding model whose parameters have been estimated using training data and classification pairs. Classification may be based on the predictive density: Z p(snew|Ynew, s, Y) = p(snew|θ, Ynew)p(θ|s, Y)dθ, although many classifiers (e.g. SVMs) do not even try to estimate the predictive density. Instead, they try to directly maximize prediction ability and can be thought of as point estimators of the parameters. In most neuroscientific investigations, prediction of new fMRI volumes is not of direct interest and the predictive density or the generalization error–rate (measured by cross–validation) is used in lieu of model–evidence to detect the presence of an effect. This is because classifiers do not yield probabilistic estimates the parameters, which means their evidence is not defined. Therefore, by the Neymann–Pearson lemma, inferences made by such schemes are sub-optimal. The second prob- lem for classifiers is that the marginal likelihood depends on both accuracy and model complexity. However, many classification schemes do not account for this complexity explicitly, but rather do so in an indirect way through the generalization error and over–fitting penalties.

110 CHAPTER 8

BRAIN–STATES: INVESTIGATING SPATIO–TEMPORAL PATTERNS IN THE DATA

Space and time are the framework within which the mind is constrained to con- struct its experience of reality.

Immanuel Kant (1724–1804), Critique of Pure Reason.

As discussed in the previous chapter, identifying the transient intrinsic cognitive states of the brain as it performs a mental task from fMRI data is an important research problem with wide applications in cognitive neuroscience.

In this chapter, I shall present an initial exploration of this concept, to determine whether the spatially distributed BOLD signal recorded at each time–point encodes some informa- tion about the instantaneous mental state of the subject. This investigation was inspired by the micro–states discovered in EEG [131], which is further discussed in Section 8.1.

The method adopted for identifying the potential of a similar organization in the metabolic traces of neural processes recorded by the BOLD signal is described in Section 8.2. The results of this investigation are presented in Section 8.3, followed by a discussion in Sec- tion 8.4.

111 8.1 Inspiration

Although a complex system such as the brain comprises of many local functional states, they can be aggregated into global functional states or configurations at each moment in time. In their seminal work [132], Lehmann et al. extracted and classified typical or charac- teristic brief quasi-stable topographies of electric field potentials recorded simultaneously from many EEG electrodes on the placed scalp, which they termed as microstates. This is shown in Fig. 8.1.

Figure 8.1: EEG Microstates over 4 seconds of spontaneous EEG using a cluster analysis. The wave- forms represent eyes-closed EEG recorded from 42 electrodes. For each time point, the potential distribution map was calculated and all maps of the 4 seconds were subjected to a k-means cluster analysis. A cross- validation criterion identified four characteristic electrical potential landscapes, i.e. microstates as illustrated in the 3rd row of the figure. Fitting these maps back to the original data revealed that each microstate ap- peared repeatedly and dominated during certain time segments, as shown in the fourth row of the figure (with each microstate color-coded appropriately). Figure reproduced from [131]

112 These different electric potential landscapes or microstates are generated by different dis- tributions of neuronal electric activity in the brain and ranging from 70ms to 150ms. They are hypothesized to reflect the activation of different neuro–cognitive networks, each repre- senting specific aspects of cognitive processing and may be the “atoms of thought” that con- stitute the seemingly continual “stream of consciousness” [100]. These microstates change in a non–continuous manner: one state may dwell over extended periods in a quasi–stabile manner, followed by rapid and major changes of state.

In contrast, the identification of a similar instantaneous state from fMRI without reference to the experimental task has been an unexplored problem which could provide important insights into mental processes. While, temporal limitations would prevent access into the faster and more fleeting “atoms of thought”, fMRI could potentially reveal relatively longer lasting and more high–level intrinsic mental states such as attention, intention, planning, etc., that do not necessarily correspond to observable attributes as recorded by the experi- mental stimuli.

Based on this hypothesis, we developed an unsupervised method for identifying brain– states as described next.

8.2 Method

After the data were pre–processed to correct for head–motion and physiological artifacts, de–noised and the white–matter masked out (cf. Chapter 3 for specifics), each fMRI ses- sion was decomposed into c components using spatial ICA. Components whose time–series

113 were highly correlated with head motion parameters and mean volume intensity fluctua- tions, identified with multiple regression, were removed and the volumes reconstructed.

The sum of squared differences of voxel values ||Y(t1) − Y(t2)||2 of two volumes at two different time points was then used to perform hierarchical agglomerative clustering

(HAC) [99] of all the volumes Y(t), t = 1 ...T , in the data. This process was repeated for 30 ≤ c ≤ 80 and multiple ICA initializations.

The average number of HAC steps when two volumes Y(t1) and Y(t2) merged into the same cluster was recorded as dhac(t1, t2). For all pairs of volumes, an affinity matrix

D(t1, t2) = exp{−dhac(t1, t2)/σ} is constructed, where σ was a manually chosen band- width parameter, typically as the value of dhac(t1, t2).

The affinity matrix D was used to find K clusters in the space spanned by the T volumes using spectral graph clustering [198] with K determined manually. Each cluster is labeled P with a unique integer value k = 1 ...K using dynamic programming such that t |kt+1kt| is minimized where Y(t) in the cluster labeled as kt and Y(t + 1) in the cluster labeled as kt+1 . This results in a time–series of cluster labels such that most transitions are between states close to each other in value.

These labels kt assigned to each time–point Y(t) are treated an indication of the intrinsic cognitive states of the brain.

114 8.3 Results

This method was applied to the fMRI study of visuo–spatial working memory (cf. Sec- tion 3.1). The brain–state labels for two subjects over a 200s period are shown in Fig. 8.2(a), where the color coding shows the corresponding phase of the experiment as displayed in

Fig. 3.2. Here, an intriguing synchronization of the assigned state with the phase of the task can be seen. It should be noted that no information about the experiment was used when determining the brain state. In Fig. 8.2(b), the median brain states during the forward and backward recall condition are shown (both subjects), along with the 10 and 90 percentile bands.

As can be seen, there is a clear transition of the brain between distinct states during differ- ent phases of the experiment, along with a separation between the forward and backward conditions in the instruction and probe phase. The high variation towards the end of a trial is probably because the subject took different response times and may have adopted differ- ent strategies across the trials. The “M”–shaped peaks hint at a two–fold engagement of the pre–frontal areas during the instruction and probe phases. The drop between the peaks implies a “shift of gear” as the brain advances from one phase into the next one.

8.4 Conclusion

Using an unsupervised data–driven approach based on clustering brain volumes solely on their voxel intensity distributions, without using information about the experiment, we were able to detect a pattern in the sequence of cluster labels that was highly organized with respect to the experiment.

115 (a) Brain–States over 200s period

forward forward

backward backward

State# State#

Event time (secs) Event time (secs)

(b) Brain–States over one trial

Figure 8.2: Brain–state Labels for Two Subjects. Fig.(a): Brain–state labels for two subjects over a 200s period. Fig.(b): Median brain states for the two subjects (along with the 10 and 90 percentile bands) in each trial separated in terms of forward and backward recall.

From these transitions, not only a change during one phase to another, but also the effect of different experimental conditions on the measured response of the brain could be observed.

116 In the next few chapters, I shall explore this concept more to create a representation that serves as an abstract vehicle for the spatio–temporal patterns recorded in the data, in order to understand the dynamic processes underlying human thought.

117 CHAPTER 9

BRAIN–STATES: THE NOTION OF FUNCTIONAL DISTANCE

Brain, n.: An apparatus with which we think that we think. Mind, n.: A mysterious form of matter secreted by the brain.

Ambrose Bierce (1842–1914), The Devil’s Dictionary.

Building on the promising results from the previous chapter of the potential of fMRI to reveal the internal cognitive state of the subject, this chapter develops a refinement of the clustering mechanism through the definition a distance metric that quantifies the functional similarity between the the neural activation patterns present in two fMRI scans.

The functional distance metric (FD), as explicated in Section 9.1, measures the amount of “transport of activity” over the functional networks of the brain. A robust, fast and sparse method for determining these functional networks, routinely defined as the “tempo- ral correlations between spatially remote neurophysiological events” [133], is described in

Section 9.2.

This is followed by the details of a fast approximation method for computing the functional distance, based on recursive aggregation in Section 9.3.

118 Once the functional distance between each pair of acquisition time–points is computed, the

activation patterns at the T time–points are embedded in a space equipped with a metric obtained by a diffusion process. This metric, explained in Section 9.4, is aware of the geometry of the underlying low-dimensional manifold spanned by these neural patterns.

The data are then grouped using hierarchical agglomerative clustering (HAC) in this low dimensional space, where each cluster represents a characteristic distribution of activity in the brain, i.e. the brain–state at that time–point.

The results of this method applied to the mental arithmetic data–set (cf. Section 3.2 are reported in Section 9.5 followed by a brief discussion in Section 9.6.

9.1 Functional Distance

For the sake of discussion, let Zt1 and Zt2 , where t1, t2 = 1 ...T denote a voxel–wise measure of neural activity or more accurately, its metabolic fingerprint13, that evokes the hemodynamic response, which is then measured as the BOLD signal Yt, t = 1 ...T . In the discussion that follows later in this chapter, these activation patterns Z are arrived at by deconvolving the (denoised) fMRI signal Y using a canonical HRF.

The difference FD(Zt1 , Zt2 ) between two activation patterns Zt1 and Zt2 is quantified by the transportation distance 14 [183], i.e. the minimal “transport” f : N × N → R of

activity over the functional circuits to convert Zt1 into Zt2 [102]. Specifically,

13Let T be the number of fMRI scans in the session, each acquired at 1TR intervals. 14This metric when used in the context of discrete probability distributions becomes the well–known Earth Mover’s Distance [188]

119 Definition 1 (Functional Distance).

N N X X FD(Zt1 , Zt2 ) = min f[i, j]dF[i, j], (9.1.1) f i=1 j=1

subject to the constraints:

f[i, j] ≥ 0 X f[i, j] ≤ Zt1 [i] j X f[i, j] ≤ Zt2 [j] i ( ) X X X f[i, j] = min Zt1 [i], Zt2 [i] i,j i i

The cost of the transport of f[i, j] from voxel i to j will depend on a measure dF : N ×N →

R+ between the voxels that captures their “functional disconnectivity”.

If i, j = 1 ...N index two cortical voxels, then the cost of transport between them dF[i, j] will depend on the on their functional connectivity F[i, j] ∈ [−1, 1], measured by the

correlation of their time–series (cf. Section 9.2). Although in this chapter, the relationship

between the cost–function and the functional connectivity is defined by the heuristic:

dF[i, j] = 1 − |F[i, j]|, (9.1.2)

the next chapter (cf. Section 10.2 of Chapter 10) develops a more formal and principled

relationship.

This definition of the functional distance FD captures the intuitive notion that two activity

patterns are functionally more similar if the differences between them are mainly on voxels

that are functionally related to each other, indicating the activation of a shared functional

120 network, as illustrated by a toy example in Fig. 9.1. Fig. 9.1(a)–(b) show a simplified

functional connectivity network is shown in its correlation matrix and graph representations

respectively. Three exemplary distributions of neural activity Zt at each of the nodes in the

network are displayed in Fig. 9.1(c)–(e) at three different time–points t1, t2 and t3. Since

the network activated at time–point t1 is functionally more related (measured by their time–

series correlations) to that at t2 than to that at t3, as per Definition 1, the functional distances

FD(t1, t2) < FD(t1, t3).

9.2 Functional Networks

This section lays out an algorithm for computing the functional connectivity (i.e. corre-

lations) F between voxels that is consistent, sparse and computationally efficient. Define

Y , {Y1 ... YT } as the fMRI time–series data with N voxels and T scans, where Y[i] is the time–series data of voxel i = 1 ...N.

Because N  T , the standard covariance estimator is badly conditioned, and its eigen-

system is inconsistent [186]. Therefore, regularization is required to impose sensible struc-

ture on the estimated covariance matrix while being computationally efficient.

First, the images are smoothed with a Gaussian kernel (FWHM=8mm) to increase spa-

tial coherence of the time–series data. Next, spatially proximal voxels are grouped into

a set of Ne < N spatially contiguous clusters using hierarchical agglomerative clustering

(HAC) [99]. HAC is repeated until the number of clusters Ne ≈ 0.25 × N, as elaborated

upon in Section B.1 of Appendix B.

121 (a) Functional Connectivity: Matrix (b) Functional Connectivity: Graph

(c) Activity at t1 (d) Activity at t2 (e) Activity at t3

Figure 9.1: Conceptual Illustration of the Functional Distance. A toy example of a functional connectiv- ity network represented as a matrix of correlations in Fig.(a) and its graph–based representation in Fig.(b). Figs.(c)–(e) show the neural activity intensity at three time–points from the data–set. The hot color scheme is used in these plots with activation magnitude increasing from left to right.

This procedure has a two-fold benefit of reducing the dimensionality of the estimation problem while simultaneously increasing the SNR of the data through averaging. Table 9.1 shows that the clusters, after Gaussian smoothing, are larger and their sizes are more uni- form for the same number of HAC–steps as compared to those without smoothing.

122 HAC– 0.5 × N 0.75 × N 0.875 × N steps FWHM Ne Avg Std.dev Ne Avg Std.dev Ne Avg Std.dev mm3 mm3 mm3 0mm 0.69 11.59 6.32 0.51 15.68 14.62 0.36 21.26 30.33 4mm 0.63 12.69 4.85 0.42 19.04 9.16 0.29 27.58 17.41 8mm 0.58 13.74 3.98 0.34 23.52 7.70 0.22 36.16 12.56

Table 9.1: Effect of FWHM of the Gaussian kernel on the number of clusters Ne. The mean and standard deviation of cluster sizes (as a fraction of N) after a certain number of HAC–steps are shown. Values are for the data–set described in Section 3.2.

Next, cluster-wise covariances are computed and regularized using adaptive soft shrink-

age detailed in Section B.2 of Appendix B. Estimates of voxel-wise correlations are then

recomputed from the regularized cluster-wise correlations. If i, j = 1 ...N index two cor-

tical voxels, then the functional connectivity map F[i, j] ∈ [−1, 1] for all 1 ≤ i, j ≤ N is consistent and extremely sparse. It is also easy to verify that this F is positive definite. The results of this procedure on the distribution of the functional connectivity estimates on the mental arithmetic data–set of Section 3.2 are shown in Fig. 9.2.

9.3 Computing the Functional Distance

While there exist efficient algorithms for computing the functional distance based on the

Hungarian algorithm [165], these methods exhibit worst–case complexity of O(|N|3 log |N|).

For an fMRI study with voxel size of 3 × 3 × 3mm3, the number of grey–matter voxels is

≈ 5 × 104, giving a running time of O(1014). If the number of scans is T , it will require

O(T 2) number of comparisons, making the pair–wise computation prohibitively expen-

sive. Moreover, because the cost–function dF is derived from the functional connectivities

123 0.04 0.04

0.02 0.02

0 0 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 (a) Raw correlations (b) After 8mm smoothing

0.04 0.04

0.02 0.02

0 0 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 (c) After clustering (d) After thresholding

Figure 9.2: Results for the Regularized Correlation Estimator. Fig.(a): Without any regularization, most of the mass of the distribution is in concentrated in small non-zero correlations, while the strong correlations are only a fraction of the total. Fig.(b): The smoothing procedure shifts the whole distribution towards the right, by strengthening all correlations. Fig.(c): The hierarchical clustering procedure boosts strong correlations without affecting weak correlations. Fig.(d): Finally, the shrinkage step sparsifies the correlation matrix, with most correlations set to zero.

between voxels and is not an Euclidean distance, standard approximations are not readily

applicable [199].

Let X be the set of voxels in the cerebral cortex and E be the set of edges between them

with weights given by dF. Define G(X, E) as the graph structure representation of the func- tional networks. The problem is made tractable using the recursive approximation scheme

of Algorithm 9.1. Here, the FD(Zt1 , Zt2 ) is solved on graphs of increasing resolution start-

ing from a very coarse–grained version of the graph. The FDj(Zt1 , Zt2 ) at each resolution j is a lower bound on the true FD, and refinement of the graph is terminated if FDj > τj, a certain threshold, implying that the functional distance between the patterns at the two

124 time–points is high enough that an approximate value with suffice. The inaccuracies intro-

duced due to this approximation are compensated for using a diffusion operation described

in Section 9.4.

1 begin // Initialization

2 Let Zt1 and Zt2 be the (de–convolved) fMRI volumes at time–points t1 and t2 3 end 4 for j = J to 1 do j j 5 Create a low resolution graph Gj of N/2 nodes by clustering 2 maximally correlated voxels into groups using a greedy algorithm j j // Each node in the set In, n = 1 ... |N|/2 now represents a group of 2j voxels

6 Compute the transportation distance as FDj(Zt1 , Zt2 ) on this low resolution graph Gj // If the edge weights and vertex values of Gj are calculated appropriately, then FD ≥ FDj, i.e. FDj is a lower bound on the true FD

7 if FDj(Zt1 , Zt2 ) ≥ τj then // τj is a threshold 8

9 Approximate FD(Zt1 , Zt2 ) ← FDj(Zt1 , Zt2 ) 10 Exit 11 end 12 end

13 if j = 0 then We have effectively solved FD(Zt1 , Zt2 ) = FD0(Zt1 , Zt2 ) Algorithm 9.1: Recursive Approximation of the Functional Distance

In what follows, the system to correctly aggregate a set of vertices i ∈ I into a new vertex

i0, so that the FD computed on this reduced graph G0(X0, E0) is less than that on the original

graph G(X, E) is given.

Let δZ = Zt2 − Zt1 be the difference between the two distributions. For explicatory P P P purposes, let i Zt1 [i] = i Zt2 [i], i.e. i δZ[i] = 0 though the reasoning holds for

125 15 ∗ the general case . Then, the optimal flow f from Zt1 to Zt2 on G is the solution to P P the transportation problem, subject to the constraints i f[i, j] − i f[j, i] = δZ[j] and f[i, j] ≥ 0, for all i, j ∈ X. Now, the total cost can be partitioned as

X ∗ X ∗ X ∗ X ∗ f [i, j]dF[i, j] = f [i, j]dF[i, j] + f [i, j]dF[i, j] + f [i, j]dF[i, j]. i,j∈X i,j∈I i,j∈X\I i∈I,j∈X\I (9.3.1)

Firstly, through a conservation of mass argument, the total flow from i ∈ I to all j ∈ \ must be P f ∗[i, j] = P δZ[i]. Let f + be the optimal solution to the X I i∈I,j∈X\I i∈I transportation problem on the graph 0 where the value δZ[i0] = P δZ[i]. Also, let G i∈I  0 dF[i, j] if i, j ∈ X \ I dF[i, j] = . minj dF[i, j] if i ∈ I, j ∈ X \ I

Therefore, the last term in eqn. 9.3.1

X ∗ X + 0 0 f [i, j]dF[i, j] ≥ f [i, j]dF[i , j] i∈I,j∈X\I j∈X0

Secondly, using reductio ad absurdum the first term in eqn. 9.3.1 P f ∗[i, j]d [i, j] ≥ i,j∈I F P f ◦[i, j]d [i, j], where f ◦ is the solution to the transportation problem restricted to i,j∈I F the subgraph I, subject to the constraints:

f[i, j] ≥ 0 X f[i, j] ≤ Zt1 [i] j X f[i, j] ≤ Zt2 [j] i ( ) X X X X f[i, j] = min Zt1 [i], Zt2 [i] − δZ[i], i,j i i i

15 This condition can be easily satisfied by adding to the optimization problem of eqn. 9.1.1 a dummy node PN with index N + 1 called the dump, where δZ[N + 1] = − i=1 δZ[i] and dF[i, N + 1] = 0, ∀i = 1 ...N.

126 with all i, j ∈ I. This subproblem on the vertices of I could again be solved using the

above recursive approximation scheme. However, if |I| is small enough, then the exact algorithm may be used.

Therefore, the recursive approximation scheme gives a lower bound on the transportation

distance as:

N/2j X FD(Zt , Zt ) ≥ FDj(Zt , Zt ) + FD j (Zt , Zt ), (9.3.2) 1 2 1 2 [In] 1 2 n=1

where FD j (Zt , Zt ) is the solution to the transportation problem restricted to the sub- [In] 1 2

j graph defined by In.

9.4 The Diffusion Distance

Although FD provides a well–motivated method to compare brain–states with similar ac-

tivity patterns (i.e. low FD), its suitability to quantify the distance between patterns with

larger differences is more uncertain, apart from the fact that for such comparisons, we have

only an approximate FD. Then, assuming the accuracy of the FD only in local neighbor-

hoods on the manifold spanned by the patterns of brain activity, the data are embedded

into a lower dimension Euclidean space using the concept of diffusion distances [41], as follows.

Each activity pattern Zt, t = 1 ...T is treated as a vertex on a completely connected graph, specified by the T × T affinity matrix W, where

 2  FD(Zt1 , Zt2 ) Wt1,t2 = exp − 2 , 2σW

127 where the user-defined parameter σW defines a notion of proximity between activation

PT −1 patterns. Let Dt,t = t0=1 Wt,t0 be the T × T diagonal degree matrix. Then M = D W

16 can be treated as a stochastic matrix defining a random walk on the graph, with Mt1,t2

encoding the probability p(t2|t1) for a Markov transition from node Zt1 to Zt2 .

The probability p(n, t|t1) that the random walk starting at Zt1 will end at Zt in n–steps

n is given by Mt1,t. Consider the following notation for the generalized eigen-system of

0 M = ΨΛΦ with Λ the diagonal matrix of eigenvalues 1 = λ1 ≥ ... ≥ λT ≥ 0, Φ the matrix of right eigenvectors (φ1 . . . φT ) and Ψ that of left eigenvectors (ψ1 . . . ψT ). In that

PT n case, p(n, t|t1) = φ1(t)+ j=2 λj φj(t)ψj(t1). Note that φ1(t) is the stationary distribution

limn→∞ p(n, t|t1) of the random walk, and is independent of starting point Zt1 .

The diffusion distance between two vertices on this graph quantifies the difference in the probabilities distributions of a random walk starting at either of these vertices arriving at any vertex after n steps:

T 2 X 2 ρn(Zt1 , Zt2 ) = |p(n, t|t1) − p(n, t|t2)| φ1(t). (9.4.1) t=1 The parameter n defines the scale of the diffusion process and controls the sensitivity of the metric on the local geometry, with smaller n making the metric more sensitive to local differences. It can be shown that [41]:

T 2 X 2n 2 ρn(Zt1 , Zt2 ) = λj [ψj(t2) − ψj(t1)] , j=1

n T i.e. an Euclidean distance with the coordinates of vertex Zt defined by {λj ψj(t)}j=1.

The diffusion distance, though not a geodesic distance, is related to the Laplace-Beltrami

and Fokker-Planck operators on the manifold underlying the graph and therefore provides

16 PT That is M ≥ 0 and t0=1 Mt,t0 = 1.

128 a geometrically aware embedding for functions intrinsically defined on it [41]. Moreover,

since the spectral gap is usually large, with a few eigenvalues close to 1 and most ≈ 0, the diffusion distance is be well–approximated by only the first Tˆ  T eigenvectors, with

error of the order of O(λn ). Tˆ+1

9.4.1 Hierarchical Clustering:

After embedding the data acquired at T time–points in a Tˆ  T –dimensional Euclidean

space, they are grouped into K clusters {c1...cK } using k–means [99]. These clusters

represent distinctive patterns of distribution of metabolic activity and are probable indi-

cators of the distinctive modes / states of cognition. Next, each cluster ck, k = 1...K

is labeled with an integer value 0 ≤ lk < K, using dynamic programming to minimize PT −1 t=1 |lk(t + 1) − lk(t)|, where lk(t) is the label of cluster ck if S(t) ∈ ck. This results in a time–series of labels where most transitions are between labels close to each other in value.

9.5 Results

This section report the results of the method applied to the fMRI data of 4 healthy control

subjects (male) who underwent a study for mental arithmetic. The paradigm used here was

similar in layout and timing as that presented in Section 3.2 with the difference that the

stimulus was either displayed or sounded out, i.e. alternating audio and visual modalities.

Acquisition was done on a GE 3T LX scanner with a quadrature head coil using a BOLD

sensitized 2D-EPI gradient-echo pulse sequence (TE=35ms, FA=90◦, TR=2s, voxel size

3.75 × 3.75 × 3.75mm3 ). A typical session lasted ≈ 18 minutes, with 150 trials and 525

scans.

129 The algorithms developed in this chapter were implemented using MATLABr and Star-

Pr on an 2.6GHz Opteron cluster with 16 processors and 32GB RAM. The recursive FD

approximation was done with J = 10 and thresholds τj were selected adaptively, so that

only a small percentage (25%) of the comparisons would need to be performed again at the

next level. This resulted in a speed up of 103×, with an average running time of ≈ 23

per subject. Fig. 9.3 shows the relative approximation error (FDtrue − FDapprox)/FDtrue)

with respect to the true FD. It is observed that the relative error scales linearly with respect to the true distance and is acceptably small for our data–sets.

0.25

0.2

0.15

0.1

0.05 Approximation Error Error Approximation 0 0 0.2 0.4 0.6 0.8 1 Exact FD (normalized)

Figure 9.3: Relative approximation error (FDtrue −FDapprox)/FDtrue) with respect to FDtrue. The x–axis is normalized with respect to the maximum FD between all pairs of volumes.

The parameter σW used in the affinity matrix W was set such that α% of all the pair- √ wise FD were less than it, reflecting the assumption that α% of brain patterns should be

“close” to any given pattern. We found the results to be reproducible for 5% ≤ α ≤ 20% in our experiments. For the low dimensional embedding in the diffusion metric space, Tˆ = 8

was a conservative value with λTˆ+1 < 0.05. The number of clusters K = 11 was selected for all subjects.

130 In Fig. 9.4, the median brain–state labels for the audio and visual presentations of the trial are shown, for the four subjects. Here, a strong pattern in the assignment of the state with the phase of the task can be seen. Also, there is a clear separation of labels during pre- sentation of the experiment depending on its modality (audio vs. visual), which converge towards the computation phase, as expected. These findings are especially significant given that no information about the experiment was used when determining the brain state. This synchronization of the brain–state labels with the experiment phase becomes more appar- ent on examining the intensity distributions for each cluster. The t-score maps for the first subject are shown in Fig. 9.5. The t-scores at every voxel were computed as within-cluster mean divided by within-cluster standard deviation.

10

5

0

10 Brain State Label State Brain 5

0 0s 4s 8s 0s 4s 8s

Figure 9.4: The median brain–state labels, for all four subjects, during a single trial of the experiment. The phases of the experiment are color-coded to indicate the 2.5s, 0.3s, 0.8s, 0–4s and 1s intervals of each trial, as per Fig. 3.2. Red and blue lines show the the median brain–states for the visual vs. audio presentation of the numbers, respectively. Also shown are the 25 and 75 percentile bands.

131 State 1 shows strong patterns in the visual cortex, and its occurrence typically corresponds

to the visual presentation of the two numbers. State 3, which usually occurs later, is active

in visual areas related with size estimation. States 5 and 6, associated with the calculation

/ judgement phase, are mainly in the frontal and parietal lobes, implicated in higher level

cognition and number size assessment. There is also activity in the motor cortex and may

be related to the button press. In states 8 and 10, which usually coincide with the audio presentation of the multiplication problem, the patterns in concentrated in the auditory cor- tices in the temporal lobe. These findings are in close agreement with those reported [163] for this paradigm using conventional analysis.

10

9

8

7

6

5

4

3

2

1 0 0s 4s 8s +5.0 0 -5.0

Figure 9.5: The within-cluster t–scores for states 1,3,5,6,8,10 for the first subject, overlaid on a high- resolution structural scan. The color coding indicates the intensity in that region. Also shown is the temporal ordering of brain–states for the subject in each trial. The other maps are qualitatively similar and omitted for purposes of concision, while t–scores between [-2,+2] are not displayed for clarity.

132 9.6 Conclusion

This chapter proposed a refinement in identifying the intrinsic cognitive states of the subject through the concept of a functional distance metric. This metric in combination with the diffusion distance was used to embed the spatially–distributed instantaneous patterns of brain activity as captured by fMRI in a space that was aware of the underlying topology of these patterns, and performed clustering in this space. We developed a computationally tractable approximation of the FD based on recursive aggregation.

The method was used to analyze a study of mental arithmetic and a pattern in the sequence of brain–states was observed that was highly organized with respect to the experiment, with distinct changes from one phase to another. Also the effect of different experimental con- ditions on the measured response of the brain was observed. Brain maps of activity were obtained that were physiologically meaningful with respect to the expected mental phase, and corresponded with the results of conventional analysis. The neurophysiological con- sistency of these maps and the temporally organized structure of the brain–states validates the functional distance metric proposed in this chapter to analyze internal cognitive states from fMRI data.

The combination use of FD and diffusion distance is the main reason the method is able to extract relevant structures from the data. The diffusion distance could be thought of as an operator that uses locally accurate measures of similarity to induce a globally consistent

Euclidean metric on the space. For it to succeed however, it is crucial that the underlying measure be accurate when two points are close to each other.

133 In the next chapter, I shall use the definition of functional distance introduced here to derive an embedding of the pattern of activity in a brain in Euclidean space, and use a state–space formalism to analyze the dynamics of these patterns in the following chapters.

134 CHAPTER 10

FEATURE–SPACE: A LINEAR EMBEDDING OF THE FUNCTIONAL DISTANCE

This ... obliged us to abandon, on the plane of atomic magnitudes, a causal description of nature in the ordinary space-time system, and in its place to set up invisible fields of probability in multidimensional spaces.

Carl Gustav Jung (1875–1961) and Wolfgang Pauli (1900–1958), The Interpretation of Nature and the Psyche.

In the previous chapter, a distance metric was introduced that quantified the functional distance between the distributed patterns of the metabolic traces of neural activity at two different time–points. In this chapter, the concept of the functional distance is used to derive a low–dimensional linear Euclidean embedding for fMRI data, which provides a good approximation of the functional distance.

The layout of this chapter is as follows: The need for such an embedding is motivated in

Section 10.1. The construction of the basis vectors of this feature–space is described in

Section 10.2, and the proof for its approximation of the functional distance is derived in

Section 10.3. A dimensionality reduction procedure by means of bootstrapped feature– selection [56] is provided in Section 10.4.

135 A quantitative evaluation of the feature–space is reported in Section 10.5 calculated on the mental arithmetic data–set described in Section 3.2.

Note: In the discussion that follows > denotes the matrix transpose operator.

Symbol Definition > Matrix transpose operator T Total number of time–points in an fMRI session N Total number of (cortical) voxels in an fMRI volume N Yt ∈ R The fMRI scan at 1 ≤ t ≤ T Y Defined as (Y1 ... YT ) N Zt ∈ R The (pre–HR) brain activation pattern at 1 ≤ t ≤ T Z Defined as (Z1 ... ZT ) C = {1 ...N} Voxel–grid of the volumetric fMRI data F:[−1, 1]N×N Functional connectivity (i.e. cor- relation) map +N×N dF : R The distance metric induced by F N×N DF : R The diagonal degree matrix of F L The normalized graph Laplacian of F φ ∈ RN One dimensional distortion mini- mizing embedding of F Φ = {φ(l,m) ∈ RN } Orthogonal basis functions of the feature–space

Table 10.1: A summary of the notation used throughout this chapter.

136 10.1 Motivation

Although the FD metric was used to extract neurophysiologically meaningful patterns – both spatially and temporally – from the fMRI data in an unsupervised fashion, it has the following limitations. Firstly, it is computationally very expensive, even with the recursive approximation algorithm. This is mainly due to the need for solving the transportation problem between all pairs of fMRI volumes.

More importantly, however, the metric is posed as the solution to an optimization problem and therefore does not have a well–understood topological or geometric structure. For example, there is no closed form solution for computing the centroid of a cluster under this metric17. As a result, determining the statistical properties of clusters obtained under this metric, leave alone developing more sophisticated models of brain–function, is not straightforward. In following chapters, a state–space representation of the spatio–temporal patterns in the metabolic traces left by neural activity is presented. Using the FD metric, inference in these state–space models becomes mathematically intractable.

Therefore, in this chapter we present a linear feature–space for fMRI which provides a good approximation of this similarity. This embedding is a generalization of the linear approximation for the earth mover’s distance (EMD) defined on Euclidean spaces [199] to arbitrary spaces. The dimensionality of this feature–space is reduced using a bootstrap analysis of stability [17], where only features that are stable across multiple resamples of the data are retained.

17As compared to a Euclidean metric where the centroid is nothing but the empirical average of all the P  2 points in the cluster. Under the FD metric, the mean is the solution to minZ0 t FD(Zt, Z0)

137 10.1.1 Other Feature–Spaces in fMRI

In fMRI the number of voxels N ∼ O(105) is orders of magnitude larger than the number of scans T ∼ O(102). Therefore for a multivariate analysis of the data, some type of dimensionality reduction is necessary to prevent over–fitting, typically through a linear transformation of the voxel–wise data into a new basis followed by a feature-selection step, although non-linear transformations have also been used [86]. These transforms have included projecting along the directions of maximum variance (i.e. PCA) [71], along the directions of maximum covariance with experimental variables (i.e. PLS) [152], along the directions of statistical independence (i.e. ICA) [35], projecting on to the original fMRI scans themselves (i.e. support vectors) [65], or harmonic transforms (i.e. Fourier and wavelets) [189].

Supervised methods for dimensionality reduction select features either most correlated or most predictive of the experimental variables [65, 159]. The aim of the spatio–temporal analysis presented in this dissertation is to reveal the intrinsic mental state of the subject, using the recorded experimental variables only as a guide, without limiting the ability to capture new and unexpected patterns. Hence, supervised feature selection approaches are unsuitable, as they are inherently biased towards the experimental variables for which they were selected, while ignoring intrinsic patterns in the data. At the other extreme, purely un- supervised dimensionality reduction methods, such as retaining components with highest variance, are based on statistical criteria unrelated to any model of brain function. There- fore such approaches, while good for data compression, have been shown to be inadequate for predicting cognitive states [90, 173]. For example, in our data–sets we observed that

138 the largest variance principal components corresponded to motion and physiological noise,

such as respiration and pulsatile activity.

In contrast to these methods, the feature–space developed here is derived not from arbitrary

statistical criteria but from a definition that captures an intuitive notion of functional simi-

larity. And as feature–selection uses a stability criteria, it does not suffer from the biases of

supervised methods and the ambiguities of unsupervised methods.

10.2 Feature–Space

10.2.1 Cost–Function

Consider the cost–function dF in the definition of FD (cf. Definition 1 in Section 9.1). It

was, at that point defined heuristically as dF[i, j] = 1 − |F[i, j]|.

Here, we shall start off the construction of the feature–space by redefining this cost–

function as that induced by a distortion minimizing one dimensional embedding φ∗ : N →

R of the graph with F as its adjacency matrix as:

P P (φ[i] − φ[j])2 F[i, j] φ∗ = arg inf i j , (10.2.1) φ⊥D 1 P 2 F i φ[i] DF[i, i]

where DF is the diagonal degree matrix of the adjacency matrix F:

X DF[i, i] = F[i, j] j6=i

DF[i, j] = 0, ∀i 6= j.

Here, the embedding φ∗ will take similar values at voxels that have high functional con-

∗ ∗ nectivity and the cost–function between them is dF[i, j] = |φ [i] − φ [j]|. The constraint

139 ∗ φ ⊥ DF1 is to prevent φ from taking a value at each vertex proportional to its degree,

which is the trivial minimizer of eqn. 10.2.1.

Rewriting eqn. 10.2.1 in matrix notation and using the method of Lagrange multipliers [40],

it can be shown that φ∗ is the solution to the generalized eigenvalue problem:

> (DF − F)φ = λDFφ such that φ DF1 = 0.

If η1 is the eigenvector η1 = λ1Lη1 corresponding to second smallest eigenvalue λ1 > 0 of the normalized graph Laplacian of F:

1 1 − 2 − 2 L = DF (DF − F) DF , (10.2.2)

1 ∗ 2 then φ = DFη1.

10.2.2 Orthogonal Basis for the Feature–Space

Through a recursive partitioning of the voxel–grid based on its embedding φ∗, an orthog-

onal basis Φ = {φ(l,m) : N → R} is constructed, as elaborated by Algorithm 10.1. The

m index m = 0 ... log2 N − 1 gives the level of decomposition, while l = 0 ... 2 − 1 in-

(0,0) − 1 dexes the basis vectors at level m. The first basis vector φ = D 2 η1, where η1 is the eigenvector of L(0,0) = L corresponding to the second smallest eigenvalue.

The graph is then partitioned into two sub-graphs based on the sign of φ(0,0), and their graph Laplacians L(1,1) and L(2,1) are computed. The details of this partitioning are given in Appendix C.1. The next two basis vectors φ(1,1) and φ(2,1) are the second smallest eigen- vectors of the L(1,1) and L(2,1), respectively. The process is repeated until only one voxel is left in the partition.

140 1 begin // Initialization (0,0) (0,0) (0,0) (0,0) 2 C ← C; F ← F; DF ← DF; L ← L 1 (0,0) − 2 (0,0) 3 φ (i) ← DF η1; λ ← λ1 4 m ← 0; l ← 0; 5 end (l,m) 6 while |C | > 1 do m 7 for l ← 0 to 2 − 1 do // Recompute the residual connectivity (l,m) (l,m) (l,m) 0(l,m) 8 Fb ← F − λ φ φ // Partition the grid C(l,m) into C(2l,m+1) and C(2l+1,m+1) based on the sign of φ(l,m) (l,m) 9 for i, j ∈ C do (l,m) (l,m) 10 if φ (i) ≥ 0 AND φ (j) ≥ 0 then (2l,m+1) 11 F [i, j] ← F[b i, j] (2l,m+1) 12 Add i, j to C 13 end 14 (l,m) (l,m) 15 else if φ (i) < 0 AND φ (j) < 0 then (2l+1,m+1) 16 F [i, j] ← F[b i, j] (2l+1,m+1) 17 Add i, j to C 18 end 19 20 else (2l,m+1) 21 F [i, j] ← 0 (2l+1,m+1) 22 F [i, j] ← 0 23 end 24 end // Recompute the graph Laplacian and its second smallest eigenvector (2l,m+1) (2l+1,m+1) (2l,m+1) (2l+1,m+1) (2l,m+1) 25 Compute DF , DF and L , L from from F and F(2l+1,m+1) (2l,m+1) (2l+1,m+1) (2l+1,m+1) (2l,m+1) 26 Calculate φ , φ from the eigenvectors of L , L corresponding to their second smallest eigenvalues λ(2l,m+1) and λ(2l+1,m+1), respectively 27 end 28 end Algorithm 10.1: Construction of Orthogonal Basis Functions

141 The coefficients of the spatially distributed brain activation at one time instant Zt in this

m orthogonal linear space are denoted as {zt[l, m], m = 0 ... log2 N − 1, l = 0 ... 2 − 1},

−m (l,m) where zt[l, m] , 2 hZt, φ i.

Then the functional distance FD(Zt1 , Zt2 ) is well–approximated by the `2 distance metric

in this space:

1 ! 2 X 2 ∆(zt1 , zt2 ) = | (zt1 [l, m] − zt2 [l, m]) | . (10.2.3) l,m

The proof for this assertion is the topic of the next section.

10.3 Linear Approximation for Functional Distance

To examine the reasoning behind this approximation, consider the dual formulation to the

linear optimization problem of eqn. 9.1.1

N X FD(Zt1 , Zt2 ) = sup g[i] · δZ[i] (10.3.1) g i=1

subject to the constraints:

g[i] − g[j] ≤ dF[i, j] X g[i] = 0 i

Please consult Section C.2 of Appendix C for a more detailed explanation of the primal–

dual equivalence.

142 This cost function is nothing but an inner product between g : N → R and the difference

vector δZ = Zt1 − Zt2 . Since inner products are preserved under orthogonal transforma-

tions, it is the case that

hg, δZi = hΦ[g], Φ[δZ]i

(l,m) Denoting the coefficients of δZ in the basis Φ as δz[l, m] , hφ , δZi, the following the- orem can be proved:

Theorem 1. Let δz[l, m] be coefficients of δZ = Zt1 − Zt2 in the basis Φ. Then, there exist

constants M0,0 > 0 and Mc0,0 > 0, such that

log N−1 2m−1 log N−1 2m−1 X2 X X2 X Mc0,0 |δz[l, m]| ≤ FD(Zt1 , Zt2 ) ≤ M0,0 |δz[l, m]| m=0 l=0 m=0 l=0 (10.3.2) and the tightness of this bound is: " # X X X X (M0,0 − Mc0,0) sup M0,0 |δz[l, m]| − Mc0,0 |δz[l, m]| ≈ √ (10.3.3) ||δz||2=1 m l m l 2

The detailed derivations of the approximation bounds and its tightness are listed in Sec-

tion C.3 of Appendix C.

As shown in Theorem 2 of Appendix C, similar bounds can be derived (or numerically

evaluated) for any orthogonal bases Ψ defined on C with respect to the distance metric in-

P (l) duced by F (if i ψ [i] = 0). Identifying the basis with the tightest bound (cf. eqn. C.3.3) is a combinatorial optimization problem in N 2 variables with N(N + 1)/2 orthogonality

constraints.

143 Given the computational complexity of this problem, we instead selected from among a set

of standard basis by choosing the one with the best tightness. The approximation Φ defined

in Section 10.2 was compared with respect to the following other basis for the functional

connectivity maps F over all the subjects in our data–set. The minimum and maximum

values of the bound-tightness metric, relative to the average value for Φ, are listed:

i The delta-basis {δ[i], i = 1 ...N}, i.e. the original voxel-wise data itself: 8.43 – 11.58

ii The PCA-like basis consisting of the eigenvectors of F: 3.21–4.66

iii The Laplacian eigenmap [16] basis containing the eigenvectors of the normalized graph

Laplacian of F : 1.79–2.30.

iv The basis set containing indicator functions on recursive normalized cuts [198] of the

graph defined by F: 2.02–2.95

v The diffusion wavelet basis induced by F [42]: 0.89–1.13.

vi An orthogonal basis derived from an spatial ICA decomposition [154]: 3.51–5.87.

One reason for the comparatively tight bound of Φ is the fast decay of the coefficients in

this basis, thereby making their contribution to the error negligible. The relatively similar

values of [iii] and [iv] are because they are obtained from a similar set of operations on

F, and the basis vectors share a lot of properties in common, such as coefficient decay.

Although the diffusion wavelet basis is tighter approximation to the distance metric FD, its marginally better performance is offset by its much greater computational burden. The high variance of the ICA derived basis could be because it is not directly related to F and also because the coefficients in this basis are not sparse.

144 10.4 Feature Selection

The feature–space Φ obtained through the orthogonalization of the graph Laplacian L of

F is of dimensionality N, the number of voxels. In order to reduce the dimension of this space, a feature selection strategy based on assessing the stability of the basis vectors through a non-parametric bootstrap analysis [56] is used.

The bootstrap generates a non–parametric estimate of the sampling distribution of a statistic

(i.e. bootstrap distribution) from a single sample of the data, and is used in cases where generating multiple independent samples is infeasible. It creates multiple surrogate samples

(i.e. resamples), of same size as the original sample, by resampling with replacement from the original sample.

Bootstrap estimates of the functional connectivity matrix F are obtained by resampling fMRI volumes from a session Y = {Y1 ... YT } to create a surrogate session. The presence of serial correlations in the time–series data Y prevents a na¨ıve resampling scheme where

T scans are randomly selected (with replacement) since it would destroy the background correlations present in the data. However, as fMRI scans are block exchangeable [166], a block bootstrap method can be used wherein the T scans are divided into M-blocks, and a resample is created by randomly selecting M-blocks from this set, with replacement.

Although the block length T/M needs to be adapted to the range of temporal dependencies present in the data, the correlation structure in the data is faithfully reproduced over a fairly wide range of lengths [17]. We found T/M ≈ 5TRs to be adequate for our data–sets.

The stability of a particular vector φ(l,m) is defined as its correlation across the resamples

(l,m) (l,m) of the data–set. Specifically, if φ(r) is the estimate of φ from the r–th resample of the

145 18 fMRI data Y, then the absolute correlation across two resamples r1 and r2 is

ρ(l,m)(r , r ) |hφ(l,m), φ(l,m)i| 1 2 , (r1) (r2)

 (l,m)  (l,m) Given the bootstrap distribution of correlations Prboot ρ (r1, r2) , a vector φ is said

 (l,m)  to be τΦ–stable if Prboot ρ (r1, r2) ≥ τΦ ≥ 0.75, i.e. the correlation between at–least

(l,m) 75% of the resamples of φ is greater than the threshold 0 ≤ τΦ ≤ 1.

(l,m) If φ is not τΦ–stable, then it is discarded, which also removes all the vectors obtained

(l,m) from the subdivision of φ . Therefore, increasing the value of τΦ causes a geometric increase in the number of vectors that are removed.

A flow–chart of the feature–space computation is shown in Fig. 10.2.

10.5 Evaluation of Feature–Space

The effect of τΦ on the dimensionality D is shown in Fig. 10.2. Initially there is a steep

5 2 reduction in dimensionality from O(10 ) to O(10 ). However, after a certain value of τΦ, the reduction slows down significantly. For all the data–sets tested, this knee-point usually occurred at D ≈ 500 corresponding to τΦ = 0.4– 0.5. Therefore, τΦ was adaptively set for each fMRI session such that D = 500.

(l,m) The figure also shows the largest index m of the vectors φ retained for a given τΦ, indi- cating that the stability of basis vectors reduces as the level of decomposition m increases.

This observation, along with the 2−m decay of the coefficients in eqn. 10.3.2, implies that

18The absolute value is chosen to account for the indeterminacy in the sign of φ(l,m).

146 Y

Resampling with Replacement

Functional Connectivity Estimation

Gaussian smoothing

HAC until Ñ≈0.25N R Cluster-wise Correlation Estimation and Shrinkage times

Voxel-wise Correlation Estimation

Basis Vector φ(l,m) Computation

Bootstrap Distribution of Correlations ρ (l,m)

Feature Selection (l,m) ` (l,m) Retain φ if Pr[ρ ≥ τΦ] ≥ 0.75

Φ

Figure 10.1: Overview of Feature–Space Computation. The functional connectivity of the brain is esti- mated from a resample of the fMRI data Y using the method described in Section 9.2. Basis vectors of the feature–space Φ are computed through a recursive orthogonal partitioning of these functional networks as per Algorithm 10.1. Dimensionality reduction is performed by retaining stable basis vectors using the bootstrap analysis of stability.

147 x 105 2 20 Φ Φ Dimension of Maximum m 1 10 Maximum m Dimension of 0 0 0 0.2 0.4 0.6 0.8 1 τΦ

Figure 10.2: Dimensionality Reduction. The effect of threshold τΦ on the average dimensionality of Φ and on the maximum index m (level of decomposition) of the basis vectors φ(l,m) that were retained. Results are for the data–set of Section 3.2.

the effect the of reduced dimensionality of Φ on the approximation error is small, as most

of the discarded vectors have a large index m.

The relative logarithmic error of the linear approximation ∆(zt1 , zt2 ) of FD(Zt1 , Zt2 ) de-

fined as:

| log10 ∆(zt1 , zt2 ) − log10 FD(Zt1 , Zt2 )|

log10 FD(Zt1 , Zt2 )

using the reduced Φ versus the full basis set is shown in Fig. 10.3.

It can be observed that the linear approximation ∆(zt1 , zt2 ) provided by the full basis Φ

is typically within 2.5× the transportation distance FD(zt1 , zt2 ), while the distance in re-

duced dimensionality basis is within 3 × FD(zt1 , zt2 ), providing empirical validation of eqn. 10.3.3. Also, it can be seen that reducing the dimensionality by an order of O(103) on average increases the approximation error by < 20%.

As an interesting side note, since the feature–space Φ is an orthogonalization of the graph

Laplacian of the voxel-wise correlations, the feature–space coefficients yt[l, m] an the

148 0.8

0.7

) 0.6 Φ 0.5

0.4

0.3

Rel. log error (reduced 0.2

0.1

0 0 0.2 0.4 0.6 0.8 Rel. log error (full Φ)

Figure 10.3: Approximation Quality. A scatter plot of the relative logarithmic (base–10) error in the ap- proximation of FD(zt1 , zt2 ) by ∆(zt1 , zt2 ) using the reduced vs. the full basis-set.

19 fMRI volume Yt exhibit very low temporal correlation , and their covariance matrix is extremely sparse, with 98% of the cross-correlations having a value < 0.2 as shown in

Fig. 10.4.

10.6 Conclusion

In the previous chapter we motivated and developed a distance metric to compare the dif- ference between the neural activity distributions at two different time–instants (i.e. TRs) encoded in the metabolic traces as recorded by fMRI. Although this functional distance provided a neurophysiologically meaningful indication of intrinsic classes of activity, its

19The correlation is exactly zero if the basis vectors are obtained from the orthogonalization of the covari- ance matrix, as compared to the correlation matrix.

149 0.08

0.06

0.04

0.02

0 0.2 0.4 0.6 0.8 1 Correlations (Abs)

Figure 10.4: Cross-Correlations in Feature–Space. Histogram of the non-zero cross-correlations (absolute values) of the feature–space coefficients.

formulation made it mathematically intractable for advanced modeling and statistical anal- ysis.

In response to this problem, and to address the issue of its high computational complexity, in this chapter we developed a linear Euclidean embedding of voxel–wise distribution of activity for each time–point that provides a good approximation of this functional distance.

The dimensionality of this feature–space, derived from an recursive orthogonal partition- ing of the graph of functional connectivities was reduced through a bootstrap analysis of stability.

In the next chapter, we shall use this representation of the fMRI data to build a spatio– temporal model using a state–space formalism.

150 CHAPTER 11

STATE–SPACE MODELS : TOWARDS A SPATIO-TEMPORAL REPRESENTATION

The modern age has a false sense of security because the great mass of data at its disposal. But the valid issue is the extent to which people know how to form and master the material at their command.

Johann Wolfgang von Goethe (1749–1832).

In Chapter 7 we pointed out multiple challenges with the status–quo of multivariate meth- ods for studying the representation of cognitive states from fMRI data, including:

• The bias of supervised methods towards modeled effects, preventing discovery of

intrinsic cognitive states not related to behavioral parameters

• The trade–offs of lack–of–interpretability and simple designs for multivariate pattern

classifier (MVPR) methods vs. the need for specifying the mathematical transforma-

tion of stimulus to fMRI signal and the homogenous hemodynamics for multivariate

linear models (MVLM)

• The fundamental drawback of all these methods, namely producing a static picture

of the inherently dynamic and changing processes of cognition

151 In this chapter, we shall adduce a solution to some of these issues with an unsupervised temporally resolved multivariate analysis based on a state–space model of brain function.

Extant literature on state–space models in fMRI is reviewed in Section 11.1.

In this formulation, a first order Markov chain captures the concept of the functional brain transitioning through a cognitive state–space as it performs a mental task. Each state is associated with a characteristic spatial distribution of activity and an occurrence of the state corresponds with an activation pattern based on this signature. The observed fMRI data arise from the convolution of these activation patterns with hemodynamic response function (HRF). The model is described further in Section 11.2.

Then in Section 11.3, an expectation–maximization (EM) [50] algorithm with Gibbs sam- pling [194] is used to estimate the Markov structure, the activation maps and the optimal sequence of brain–states for a given data–set. The effect of assuming a fixed shape of the

HRF is ameliorated by marginalizing it out under a Laplace approximation. A method to determine the correct model size resulting in a model most relevant to the investigator is proposed in Section 11.4.

A quantitative validation of the estimation algorithms is given in Section 11.5.1. The results of this dynamical multivariate analysis on the study of mental arithmetic (c.f Section 3.2) are reported in Section 12.5.3. Finally, the chapter concludes with some remarks and ob- servations in Section 12.6.

Appendix D contains the complete proofs and details about the algorithms developed in this chapter.

152 11.1 Related Work

State–space models have been previously used in fMRI either for determining the activation

state of individual voxels in a univariate fashion. Bayesian hidden Markov models (HMM)

with MCMC sampling have been used to determine the activation state of individual voxels

in blocked-design single-trial experiments [95]. In another approach [53], for each voxel,

a two-state HMM was created, and the model parameters were estimated from the voxel

time–series and the stimulus paradigm. Activation detection associated with known stimuli

has also been done with hidden Markov multiple event sequence models (HMMESM) [59], that pre-process the raw time–series into a series of spikes to infer neural events at each voxel.

A hidden process model (HPM) [98] was used decompose each voxel’s time–series into a set of “neural processes” and their instantiations. For each process a spatio–temporal map is generated giving the voxel locations and the probability of the process onset relative to some external event. Since all the possible processes and the configurations of their in- stances have to be pre-specified, HPMs are limited to testing specific hypothesis involving simple interactions of a small number of neural processes. A multi-variate ARMA formal- ism was used to effect a dynamical components analysis that extract spatial components and their time–courses from fMRI data, given the experimental stimuli [205].

Dynamic Bayesian networks have been also used to study the time-varying functional in- tegration [225] of a small number of pre-selected functional modules, from the interdepen- dency structure of their average time–series.

153 α,π . . . xt xt+1 xt+2 xt+L

μk Σk K . . . zt zt+1 zt+2 zt+L

γ μγ σγ

. . . h yt yt+1 yt+2 yt+L

T Σε

Figure 11.1: The Full State–Space Model. The hidden brain–states are shown by xt giving rise to dis- tributed activation patterns zt. The fMRI data yt ... yt+L 1 is observed after convolution of the neural activations with the hemodynamic response h. −

In contrast to the above methods, this chapter presents data-driven multivariate method for the dynamical analysis of mental processes in a time–resolved fashion using a phenomeno- logical model for brain function.

11.2 State–space Model

The state–space model with K hidden states is parameterized by θ = {α, π, ω, Σ} as shown in Fig. 11.1.

N Here, yt ∈ R , t = 1 ...T is the observed fMRI data after feature–space transformation

20 , with the corresponding experimental conditions given by st.

The underlying mental process is represented as a (hidden) state–sequence xt ∈ [1 ...K], for t = 1 ...T . Let the state marginal probability be denoted by α = (α1 . . . αK ), where

20The fMRI data in voxel–space is denote by Y.

154 αk , p(xt = k). The state transition probabilities are given by the K ×K stochastic matrix

π, where πi,j , p(xt+1 = j|xt = i), the transition probability from state i to j.

Each state xt = 1 ...K is associated with a characteristic activation pattern zt. The emis- sion model has a two-level hierarchy to account for the fact that yt is the hemodynamic response to the (unobserved) neural activation pattern zt corresponding to state xt. The activity signature corresponding to xt = k is assumed to be normally distributed in the feature–space Φ (cf. Chapter 10) with mean µk and variance Σk. Let ϑ , {ϑ1 . . . ϑk} be the emission parameters of the model, where ϑk , (µk, Σk), the emission parameters for state k.

The measured fMRI signal yt is obtained by a linear convolution of spatially distributed activation signatures with an HRF h as per 21 :

L X yt = hlzt−l + t. l=0

Here, t ∼ N (0, Σ) is a time-stationary noise term. The HRF is a FIR filter of length

L + 1 given by the difference of two Gamma functions [66], with non-linear parameters γ controlling its delay, dispersion, and ratio of onset-to-undershoot, with prior density p(γ) =

N (µγ, σγ).

Examining this graphical model using the d–separation criteria [174], it can observed that conditioned on y all the x and z variables are dependent on each other. This dependency structure complicates the marginalization of the hidden variables x and z (as required by

EM) since the integration cannot be factorized with respect to these variables. In Chapter 12

21 P The voxel-wise data Yt = l hτ Zt l is modeled as a linear convolution of the activation patterns Z, − P and because the linear projection zt , Φ[Zt] is commutative with convolution, it holds that yt = l Hlzt l − in the feature space, where yt , Φ[Yt].

155 we shall address this problem through the use of the mean–field approximation. However, in this chapter, we make the additional simplifying assumption that yt, t = 1 ...T are inde- pendent of each other given x, giving an approximative reduced model shown in Fig. 11.2.

This assumption is equivalent to collapsing the z layer into the x layer and neglecting the structured variability introduced by it22.

Expanding out the linear dependence of y on x through h and considering z as a “fixed effect”, the following probability model is obtained:

p(y, x|θ, h,K) = p(y|x, ϑ, h)p(x|α, π) (11.2.1) T T Y Y p(x|α, π, K) = p(xt|xt−1, α, π) = αx1 πxt,xt−1 (11.2.2) t=1 t=2 T Y p(y|x, ϑ, h,K) = p(yt|xt−L . . . xt, θ) (11.2.3) t=1 L L ! X X 2 p(yt|xt−L . . . xt, ϑ, h,K) ∼ N hlµxt−l , hl Σxt−l + Σ , (11.2.4) l=0 l=0 where

θ , {α, π, ϑ, Σ} L X µt−L...t , µxt−l hl l=0 L X 2 Σt−L...t , Σ + Σxt−l hl (11.2.5) l=0

Thus, the convolution introduces a dependency only between states xt−L . . . xt, when con- ditioned on observation yt. Although this violates the first-order Markov property required for the classical forward-backward recursions, its parameters can be efficiently estimated using the L + 1–order Monte–Carlo algorithm of Section 11.3.

22 Refer to Appendix D.5 for a full justification of this.

156 α, π μk Σk K

x x x x t t+1 t+2 … t+L-1

γ μγ σγ

y y y y h t t+1 t+2 … t+L T

Σε

Figure 11.2: The Reduced State–Space Model. The hidden brain–states are shown by xt, the activation pattern is observed in the fMRI data yt ... yt+L 1 (in feature–space) after convolution with the hemodynamic − response h. The intermediate zt layer is dropped.

Also, the following short-hand notation is used through out the paper: yt1...t2 , {yt1 ... yt2 }, y , {y1 ... yT } and similarly for x. Also, define pθ(·) , p(·|θ, h,K). Matrix transpose is denoted by the > operator. The notation introduced here is compiled in Table 11.1.

157 Symbol Definition

> Matrix transpose operator T Total number of time–points in an fMRI session N Total number of (cortical) voxels in an fMRI volume K Total number of hidden brain–states Φ = {φ(l,m) ∈ RN } Orthogonal basis functions of the feature–space xt ∈ [1 ...K] The brain–state at 1 ≤ t ≤ T N Yt ∈ R The fMRI scan in voxel–space at 1 ≤ t ≤ T

Y Defined as (Y1 ... YT ) D yt ∈ R The fMRI scan in feature–space at 1 ≤ t ≤ T y Defined as (y1 ... yT ) st The stimulus vector at time t

αk The marginal probability of state k

πxt−1,xt , p(xt = The transition probability from state i to state j j|xt−1 = i)

ϑk = (µk, Σk) The emission parameters for state k h The hemodynamic finite impulse response (FIR) filter of length L + 1 γ The non–linear parametrization of h

µγ, Σγ The mean and variance of the prior distribution of γ

t ∼ N (0, Σ) Normally distributed noise

θ , {α, π, ϑ, Σ} The model parameters pθ(◦) , p(◦|θ, h,K) Parameterized probability density function w Multinomial logistic regression weights

Table 11.1: A summary of the notation used throughout this chapter on the unsupervised state–space model

158 11.3 Model Estimation

11.3.1 Parameter Estimation

The maximum likelihood (ML) estimate θML = arg maxθ ln p(y|θ, K) is obtained using

EM [22] which involves iterating the following two steps until convergence:

X E-step: Q(θ, θn) = p(x|y, θn, h,K) ln p(y, x, θ|h,K), (11.3.1) x M-step: θn+1 = arg max Q(θ, θn). θ

Because of the inclusion of the FIR filter for the HRF, which violates the first-order Markov property of the state–sequence x when conditioned on an observation yt, the EM update equations take the following form (cf. Appendix D.1):

p (n) (x = k|y) αn+1 = θ 1 k PK 0 k0=1 pθ(n) (x1 = k |y) PT p (n) (x = k , x = k |y) πn+1 = t=2 θ t 1 t+1 2 k1,k2 PK PT 0 k0=1 t=2 pθ(n) (xt = k1, xt+1 = k |y) X µn+1 = H− µn+1 k k,k0...kL k0...kL k0...kL X and Σn+1 = G− Σn+1 , (11.3.2) k k,k0...kL k0...kL k0...kL

The updates to µt−L...t and Σt−L...t (cf. eqn. 11.2.5) for one particular assignment {k0 . . . kL} of the sequence L + 1 states long are:

PT p (n) (x = k . . . k |y)y µn+1 = t=1 θ t−L...t 0 L t , k0...kL PT t=1 pθ(n) (xt−L...t = k0 . . . kL|y) PT n+1 n+1 > p (n) (xt−L...t = k0 . . . kL|y) · (yt − µ )(yt − µ ) Σn+1 = t=1 θ k0...kL k0...kL . k0...kL PT t=1 pθ(n) (xt−L...t = k0 . . . kL|y)

159 Here, H and G are the KL+1 × K convolution matrices that give the relationship between

HRF coefficients hl, the activation pattern means µk for state k and the µt−L...t,s Σt−L...t for any assignment of an L + 1 length state–sequence:   hL + ... h0 0 ... 0 0      hL + ... h1 h0 ... 0 0   . . . .  H =  ......  ,  . . . .     0 0 ... h h + h   L L−1 0  0 0 ... 0 hL + h0   2 2 hL + ... h0 0 ... 0 0    2 2 2   hL + ... h1 h0 ... 0 0   . . . .  G =  ......  ,  . . . .     0 0 ... h2 h2 + h2   L L−1 0  2 2 0 0 ... 0 hL + h0 and H− is the (k, k . . . k )–th element of the pseudo-inverse of H, given by H− = k,k0...kL 0 L (H>H)−H>. Even though H is an KL+1 × K matrix, it is extremely sparse with each column k of H having only 2L+1 non-zero entries corresponding to those µn+1 where k0...kL

> L+1 2 k ∈ {k0 . . . kL}. Therefore, H H is computed in O(2 K ) time, and is inverted using

the SVD pseudo-inverse. Similarly for G.

Using the relationship pθ(n) (x|y) = pθ(n) (y, x)/pθ(n) (y) and the fact that pθ(n) (y) is can-

celed out by the numerators and denominators of eqn. 11.3.2, the conditional densities are

replaced by their joint densities pθ(n) (y, xt), pθ(n) (y, xt,t+1) and pθ(n) (y, xt−L...t). These are

calculated as:

X pθ(y, xt) = a(xt+1−L...t)b(xt+1−L...t)

xt+1−L...t−1 X pθ(y, xt,t+1) = a(xt+1−L...t) · pθ(yt+1|xt+1−L...t+1)pθ(xt+1|xt) · b(xt+2−L...t)

xt+1−L...t−1

pθ(y, xt,t+L) = a(xt,t−1+L) · pθ(yt+L, xt,t+L) · pθ(xt,t+L) · b(xt+1...t+L). (11.3.3)

160 where a and b are the forward–backward recursion terms:

X a(xt+1−L...t) = pθ(y1...t, xt+1−L...t) = pθ(n) (yt|xt−L...t)pθ(n) (xt|xt−1) · a(xt−L...t−1)

xt−L X b(xt+1−L...t) = pθ(yt+1...T |xt+1−L...t) = pθ(n) (yt+1|xt+1−L...t+1)b(xt+2−L...t+1). xt+1 (11.3.4)

The derivations of these terms are elaborated in Appendix D.2.

The summations (i.e. expectations) over the densities of state-sequences L long of the P form p (n) (y, x )[...] in eqns. 11.3.2 and 11.3.3 are replaced with Monte– xt−L...t θ t−L...t

Carlo estimates, by Gibbs sampling from the distribution pθ(n) (y, xt−L...t) with stochastic

forward-backward recursions [194].

The same EM procedure can estimate θML given multiple fMRI data–sets corresponding to a group of subjects, with slight modifications to the update equations.

11.3.2 HRF Marginalization

The dependence of θML on a specific HRF filter h is removed by marginalizing out h under a

∗ R Laplace approximation to obtain a Bayesian estimate θ = h θML(h)p(h)dh, independent of h. It is computed through Monte–Carlo integration by first sampling the parameter γ

from N (µγ, σγ), constructing h(γ), finding θML(h) and then averaging over all samples.

Please consult Appendix D.3 for details.

161 11.3.3 Optimal State–Sequence Estimation

Given a set of parameters θ and observations y, the most probable sequence of states x∗ = arg max ln pθ(y, x) is estimated by backtracking through the following recursive system:

max ln pθ(y, x) = max ηT x xt−L...T

where ηt = max [ln pθ(yt, xt−L...t) + ln pθ(xt|xt−1) + ηt−1] xt−1

and η1 = ln pθ(y1|x1) + ln pθ(x1)

The maximization over sequences of states L + 1 long xt−L...t is done using iterated condi- tional modes (ICM) [22], with random restarts. The detailed derivations for this procedure are given in Appendix D.4.

11.4 Model Size Selection

Model–size (i.e. K) selection can be done using Bayes factors [112], information theo- retic criteria [151], reversible jump MCMC based methods [194] or non-parametric ex- tensions [15]. The reader is referred to the excellent article by Lanterman [126] for a theoretical perspective on these model selection methods.

These methods are equivalent in that they strike a compromise between model complexity

(typically measured by the number of states) and the ability of the model to explain the data (typically measured by p(Y|K), the model evidence). In the absence of a domain– specific way to define model complexity, information based methods select models that minimize Kolmogorov complexity, while Bayesian methods specify a prior derived either

162 from equivalent definitions of complexity or through empirical ones such as hierarchical models or reference priors [19].

Instead we adopt an alternative strategy where experimental conditions st are used to select

K that results in a maximally predictive model, The rationale behind this strategy is that fMRI data may contain multiple spatio–temporal patterns of both neurophysiological (such as default–network and other non–task related mental processes) and of extraneous (such as respiration, pulsatile, head–motion) origin, of which only task related effects are of interest to the investigator. This criterion enforces that link by selecting a model which has identified the most relevant patterns. Although this step introduces a dependence of the experimental conditions on the model, the parameters themselves are estimated without reference to the task in an unsupervised fashion.

Let x∗,K denote the optimal state–sequence for an fMRI session y produced by the model

∗ with K states and optimal parameters θK . And, let s = (s1 ... sT ) denote the corresponding experimental conditions recorded during the session. A multinomial logistic regression

(MLR) classifier [22] with weights w is trained to predict the state xt at time t given a stimulus vector st according to the formula:

> exp{st wk} Pr[xt = k] = (11.4.1) PK > 0 k0=1 exp{st wk}

∗ ∗ ∗,K The optimal K is then selected as K = arg min ERRpredict(x , w) where R is the error– rate (i.e. risk) of predicting the optimal state–sequence x∗,K from the experimental condi- tions as per:

h i ∗,K E ∗,K ERRpredict(x , w) = 1 − Pr[xt = xt ] , (11.4.2)

163 and is computed using cross–validation23.

Therefore, the model Mi trained on a data–set y for a subject i consists of the tuple Mi =

∗ ∗ (Φ, θK ,K , w), viz. the feature–space basis, the optimal model parameters, the optimal number of states, and the MLR weights.

11.5 Results

This section starts off with a quantitative validation of the model and state–sequence esti- mation algorithms using a simulated data–set in Section 11.5.1. Then in Section 11.5.2, the results of the method applied to the mental arithmetic task of Section 3.2 are reported.

11.5.1 Simulation

11.5.1.1 Methods and Materials

This section reports a quantitative validation of the model and state–sequence estimation algorithms using synthetic data–set, created as follows. For all simulations, the number of time–points was T = 600, the dimension of the feature–space was D = 500 and the dimension of the stimulus vector st was set to 5, to reflect a typical fMRI data–set. The model size K was varied from 5 to 50.

The MLR weights w were initialized from a uniform distribution over [−1, 1]. The hemo- dynamic FIR coefficients h were obtained by sampling the HRF parameters γ from N (µγ, Σγ).

23 Specifically, the data is partitioned into M–folds, each of T/M fMRI time–points selected randomly. ,K The weights wk are learnt from the (xt∗ , st) pairs of M − 1 folds and the error–rate is assessed on the ,K (xt∗ , st) from the remaining 1 fold.

164 The emission parameters (µk, Σk) for each state k were obtained by sampling Σk from a

Wishart distribution W(T, ID) and µk from N (0, Σk). The noise variance Σ was sampled

−1 from W(T, β ID) The parameter β effectively controls the SNR of the data, by control-

ling the ratio of the noise variance to that of the activation patterns. For each time–point

t = 1 ...T , the stimuli st were generated from a normal distribution N (0, I5). The values

of xt were sampled from the multinomial distribution arising from eqn. 11.4.1, then zt from

the normal distribution N (µxt , Σxt ) and then yt from the convolutive model with Gaussian noise with variance Σ. Note that this sampling scheme corresponds to the full model of

Fig. 11.1, in order to test the accuracy of the approximation implied by Fig. 11.2.

The Gibbs sampler (sampling from sequences L + 1 states long) had a burn–in time of

100 samples and convergence of parameter estimates was defined using the scale reduction

factor criterion [27]. The experiments were repeated with β = 10, 100 and 1000 corre- sponding to SNR of 10, 20 and 30dB respectively.

11.5.1.2 Discussion

Table 11.2 limns the average running–time and error–rates for the simulation. Included are the error in the parameters estimates ERRestimate, defined as the average relative error in the

∗ ∗ ∗ ∗ estimates w , µk, Σk, and Σ , the model–size estimation error ERRK , and the prediction

error ERRpredict (cf. Section 11.4.2). Of special interest is the error in the µk, k = 1 ...K

parameters which correspond to the spatial distribution of activity representative of each

state. The average estimation error of the spatial distribution parameters, defined as:

X ∗ > −1 ∗ ERRspatial = 1/K (µk − µk) Σk (µk − µk), k

165 SNR Running–Time ERRestimate ERRK ERRpredict ERRspatial (dB) (hours) % % % %

30 4.305 ± 0.72 21.13 ± 8.25 12.70 ± 5.35 20.77± 4.45 17.56±3.11 20 4.63 ± 0.96 28.89 ± 11.14 18.91 ± 8.62 41.59± 9.53 26.57±9.30 10 5.28 ± 0.82 52.45 ± 17.93 44.33 ± 15.12 56.60± 7.25 43.41±12.39

Table 11.2: Effect of SNR on the total running–time, estimation error ERRestimate, model–size error ERRK , prediction error ERRpredict, and spatial activity error ERRspatial for the simulation study of the state–space model. All values are listed ±1 std. dev.

∗ where µk is the estimate of µk.

Firstly, reducing SNR causes an increase in running–time due to slower convergence of

the estimates as a function of the number of Monte–Carlo samples. Also, for a sufficiently

high SNR the estimate of model–size using the maximally predictive criterion was within

20% of the true model–size validating this strategy. The error in the parameter estimates in-

creases drastically as the SNR goes from 20 to 30dB, indicating a breakdown in estimation

stability as the noise level crosses a certain threshold. The error in the activity signature

parameters ERRspatial exhibits a similar trend. Interestingly, the prediction error increases

by almost 20% from 10 to 20dB – which points to a high sensitivity of prediction accuracy

on estimation quality.

The high baseline error of ≈ 20% in parameter estimation is due to the simplification of eliminating the z layer (cf. Fig. 11.2) in the estimation algorithm. The alternative mean–field approximation presented in the next chapter, which does not necessitate this simplification, yields much higher quality estimates for an equivalent simulation (cf. Sec- tion 12.5.1).

166 11.5.2 Mental Arithmetic

11.5.2.1 Methods and Materials

This method was tested on the mental arithmetic data–set described in Section 3.2. We restricted this study to the 20 controls and 13 dyscalculic (DC) subjects only. For each t =

1 ...T , the experimental conditions are described by the vector st = (Ph, LogPs, LogDiff), where Ph is the phase within the current trial in 1 TR increments with respect to its start,

LogPs, which gives the product size for the presented problem and LogDiff, which gives the expected difficulty in judging the right answer, are quantized into two levels.

One model M = (Φ, θ, K, w) was trained per subject and the following statistics were calculated:

self ERRpredict: The “within–subject” prediction error–rate of the optimal state–sequence as- signed to the fMRI run of one subject using the model for that subject (i.e. trained on

the same data)

cross ERRpredict: The “between–subject” prediction error–rate of the optimal state–sequence as- signed to the data of one subject using a model trained on the data of another subject

MI(i, j): Mutual Information between the state sequences generated for the same fMRI

session y by the models Mi and Mj for two different subjects i and j, described

further in Appendix D.6

The mutual information quantifies the similarity of the two models, where higher MI indi-

∗ cates better similarity, with a maximum of log2 K . In general the correspondence between

167 the state labels of two different models is unknown. By comparing the state-sequences of

the same data generated by the two models, this correspondence can be determined. A

higher MI indicates a higher level of correspondence between the state-sequences of the

same data when labeled by two different models, while an MI of zero indicates absolutely

no correspondence. This procedure is illustrated in Fig. 11.3.

The reader may peek ahead to Fig. 11.3 in the next chapter for an illustration of this concept.

33 This procedure applied to all ( 2 ) pairs of subjects yields a pair-wise similarity matrix.

These MI relationships can then be visualized in 2D by using multidimensional scal- ing (MDS) [120] as shown in Fig. 11.5.

Spatial maps of the activation patterns for a specific value of the experimental variables st is obtained by first computing the activation pattern mean:

" K # X µst = Pr[Xt = k]µk k=1 " K # X Σst = Pr[Xt = k]Σk k=1

where Pr[Xt = k] are the MLR probabilities from eqn. 11.4.1 for the condition st. The

−1/2 z–score map for the activation pattern corresponding to ˆs is given by Σst µst in feature–

space, which can then be transformed back into a voxel-wise spatial map of activity. The

group–level spatial maps 24 for the three phases of the two groups are shown in Fig. 11.6

24Displayed as a t–score map of voxel–wise group average divided by group std. dev.

168 st st+1 ut+2 st+L-1 … W λW

x x x x t t+1 t+2 … t+L-1

μk Σk xt xt+1 xt+2 xt+L-1 zt zt+1 zt+2 zt+L-1 K … Φ1 …

h μh Σh

y y y y Σ t t+1 t+2 … t+L-1 ε T Subject 1

Φ2 1 2 … 41 42 1 10.00 8.94 … 8.50 5.40 Subject 2 2 8.94 10.00 … 1.54 0.29 Y ⁞ ⁞ ⁞ ⁞ ⁞ 41 8.50 1.54 … 10.00 3.95 fMRI 42 5.40 0.29 … 3.95 10.00 data for . subject i . Pair-wise MI . matrix

Φ42

Subject 42

Figure 11.3: Computation of Pair–Wise MI Between All Subjects. One model Mj = {Φj, θj} was trained per subject. The data of another subject / session, say subject i, was then first projected into the feature–space Φj of subject j and then the optimal state–sequence for the data of subject i was computed using θj.

169 11.5.2.2 Discussion

The average values of these statistics are compiled in Table 11.3. The “chance–level”

prediction error is ≈ 83% calculated by permuting the stimuli with respect to the scans.

self cross Group K∗ ERRpredict (%) ERRpredict (%) MI (bits)

Controls 19.67 ± 2.44 34.23 ± 5.40 41.69 ± 8.90 3.81 ± 0.12 DC 23.18 ± 6.53 39.37 ± 9.81 58.56 ± 9.76 2.94 ± 0.15 Controls vs. DC 60.03 ± 9.34 2.72 ± 0.33

Table 11.3: The group–wise average values for the optimal number of states K∗, within–subject and self cross between–subject prediction errors ERRpredict, ERRpredict respectively and the between–subject mutual infor- mation are tabulated. The last row shows the between–subject prediction errors and mutual information comparing control subjects and DC subjects.

Here, a larger variation model–sizes for the DC group can be observed as compared to the controls, even though their group size is much smaller. This points to a greater het- erogeneity in the data across the DCs necessitating models with different sizes. Also, the consistently higher error–rate of the DC population indicates the relative inaccuracy of their models, as compared to the controls. Moreover, the MI between DC subjects is on par with that between DC and controls. This means that while one DC model labels a fMRI run quite incongruently to a control model, it even labels the data differently as compared to another DC model. This reaffirms the conclusion of high heterogeneity in the mental strategies adopted by DC subjects.

self cross The phase–wise breakdown (according to 1TR long phases Ph) of ERRpredict, ERRpredict and MI is shown in Fig. 11.4.

170 self cross (a) ERRpredict (b) ERRpredict

Controls DCs Controls vs DCs LogPs Effect LogDiff Effect (c) MI (d) Legend

Figure 11.4: 1 TR Long Phase–wise Breakdown of Error Rates and MI. Figs.(a)–(c) show the effect of experiment phase Ph, product–size LogPs and product–difficulty LogDiff on within–subject error–rate self cross ERRpredict, between–subject error–rate ERRpredict and mutual information MI. The LogPs effect is measured as the value for high LogPs minus that for low LogPs. Similarly for LogDiff. The background color-coding shows timing of each trial using the color scheme of Fig. 3.2.

Here, for both groups the error is low in the initial (read) phase of the experiment and in- creases in the second (judgement) phase. The high rates during the third phase, are possibly due to increased conflict involved in making a decision. Additionally, this phase is highly variable between repetitions and very often overlapped with the inter-trial rest interval re- ducing predictability. Again, the between–subject error rates of the control vs. DC case is on par with that between the DCs themselves.

The product–size LogPs increases the predictability of the control group during the first two phases while has negligible effect on the DC group, for both the within–subject and between–subject cases. This effect early on in the trial is attributable to a stronger effect of

171 product recall from the rote tables located in the lower left parietal lobe, and strong number

size effects in the occipital visual areas. The later effect of LogPs is consistent with strong activation of the working verbal (mute-rehearsal) and visual [187].

The problem–difficulty LogDiff effect is noticeable after the second 2.64s phase, which is expected since it depends on the onset of the incorrect result Rd displayed at 2.8s. It causes

a strong reduction in error for controls in the within–subject case while has almost no effect

for the between–subject case. Its effect is strongest in the third (response) phase since it

mainly affects attention and conflict resolution, and is probably not a number–related effect.

The trends in the MI plots tell a story confirmatory to that of the between–subject prediction

error. Of interest here is the slight dip in MI during the second phase followed by an

increase in the third phase for the controls. This could reflect a second recruitment of

working memory located in the right intra-parietal suclus as the controls recapitulate the

multiplication problem, after they have finished the task [163].

From the MDS plots of Fig. 11.5, an overall clustering of control and DC subjects is no-

ticeable with the separation increasing as LogPs increases. In the first phase of the task, the separation increases, while during the second phase the picture gets murkier. Both clusters spread out, with a greater dispersal of the DC subjects in the 2–D plot. LogDiff seems to have a slight organizing effect tightening the clusters. It should be noted that this labeling is applied post hoc, after plotting all the subjects using MDS on the pairwise MI matrix, and therefore an intrinsic organization in the spatio–temporal patterns of the subjects in each group has been discovered.

172 Control Male Control Female DC Male DC Female

(a) Overall (b) Overall – Product Size Effect

(c) Phase 1 (d) Phase 1 – Product Size Effect

(e) Phase 2 (f) Phase 2 – Problem Difficulty Effect

Figure 11.5: MDS plots of the Mutual Information Between All Pairs of Subjects. Fig.(a) shows the MDS plots of the subjects based on their overall MI and Fig.(b) shows their layout in 2–D for the MI computed on high LogPs stimuli. Figs.(c) and (d) show relative arrangement of the subjects during the first 2.64s phase of the trial for the overall and high LogPs cases. Similarly, Figs.(e) and (f) plot the subjects during the second 2.64s phase for the overall and high LogDiff cases.

173 As the spatial maps reported here are qualitatively similar to those of Chapter 12, please refer to Section 12.5.3 for a description of the activation patterns along with a detailed interpretation of the results.

11.6 Conclusion

In this chapter, we put forward an approach towards decoding and representing the spatio– temporal information about mental processes in fMRI data, using a hidden Markov model.

The hemodynamic coupling between metabolic activity and the BOLD response was ac- counted for by an additional hidden layer. However, due to the computational complexity incurred for a Monte–Carlo based EM, this hidden layer was approximated out through an assumption of independence of the fMRI data given the state–sequence.

The effect of assuming spatially invariant hemodynamics was ameliorated through marginal- ization under a Laplace approximation. Model selection was then performed using a max- imally predictive criteria. We applied the method to a group-wise study for developmen- tal dyscalculia, and demonstrated task-related differences between healthy controls and dyscalculics, which were systematically organized in time.

Although this model had the ability to predict experimental conditions at better than chance levels, the estimation and prediction error–rates are quite high and very sensitive to noise, as verified by the simulation study. The fact that absolutely no information about the structure of the experiment is used to stabilize estimation 25 exacerbates the problems caused by the independence assumption and the spatially stationary HRF.

25The link to the experiment happens only in the model selection step.

174 In the next chapter to address these drawbacks, this model is augmented to incorporate stimulus information and spatially varying and unknown hemodynamics and a fast estima- tion algorithm is developed that does not necessitate the independence assumption.

175 Ph 1

Ph 2

Ph 3

(a) Control Group

Ph 1

Ph 2

Ph 3

(b) DC Group

Figure 11.6: The group–wise activation t–score maps (defined as group average divided by group std. dev. per voxel) for the the phases of the task for the control and dyscalculic groups in Fig.(a) and Fig.(b) respec- tively. The hot color scheme , with magnitude increasing from left to right, shows the activation patterns for the corresponding phase of the task, with negative values masked out for visual clarity.

176 CHAPTER 12

STATE–SPACE MODELS : A SEMI–SUPERVISED APPROACH

Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.

George P. Box (1919–) and Normal Draper, Empirical Model-Building and Response Surfaces, 1987.

The previous chapter introduced a dynamic system formulation of brain–function as a se- quence of abstract brain–states stepping through a mental state–space, in order to determine the intrinsic states of the subject from the data without reference to experimental conditions.

This chapter augments that model with information about the experimental task to create a semi–supervised temporally resolved multivariate analysis. This is done in order to ad- dress the main challenges of purely unsupervised methods, that of enforcing a link back to the experimental variables and of stabilizing estimation. For example, in the simulation study of the unsupervised method we observed that the errors were generally high and that solutions were very sensitive to noise. In contradistinction, the semi–supervised approach does not preclude discovery of new and unexpected patterns, but simultaneously guides the discovery towards patterns of interest to the investigator and mitigates the effect of the confounds that plague fMRI.

177 As previously, a first order Markov chain captures the concept of the functional brain transi-

tioning through an abstract state–space as it performs a mental task, where each state has a

characteristic spatial distribution of (metabolic) activity which gives rise to the fMRI signal

by convolution with an HRF. The difference is that here the brain–state at each time–point

is affected not only by the current experimental conditions, but also by the previous state.

The model is developed in Section 12.1.

The previous model assumed a spatially uniform HRF and resorted to reducing its impact

through marginalization. This compromise is eschewed in this augmented model by allow-

ing a spatially varying HRF which is estimated from the data.

To make estimation of the model of Chapter 11 computationally tractable, it had to be

simplified to assume independence of the fMRI data, given the state–sequence. This as-

sumption is obviated here through the use of an efficient algorithm based on the mean–field

approximation of expectation–maximization (EM) [50] to estimate the Markov structure, the activation maps, the hemodynamic filter along with predictions of missing stimuli as per Section 12.2. The optimal sequence of brain–states for a given data–set is again esti- mated using EM with the mean–field approximation as elucidated in Section 12.3.

The maximally predictive method to determine the correct model size and other hyper– parameters in an automated fashion that best describe the task being performed by the subject is given in Section 12.4.

The Markov chain of brain–states serves two purposes: a) To enforce a temporal ordering on the states, and b) To decouple the stimulus from the fMRI signal, thereby avoiding specification of the exact mathematical relationship between the two. This second aspect

178 makes it a semi–supervised method as it uses the stimuli to guide estimation but does not

preclude discovery of new patterns in the data and investigation of effects not explicitly

encoded in the experimental variables. The model can predict the value of experimental

stimuli at new frames and is able to estimate a spatially varying HRF from the data.

The outline of the different processing steps of the method presented in this chapter is given

in Fig. 12.1.

A quantitative validation of the estimation algorithms is given in Section 12.5.1. Sec-

tion 12.5.2 illustrates the single–subject spatio–temporal maps produced by the method and

provides a comparative evaluation with GLMs and SVMs, using a visuo–motor task. Sec-

tion 12.5.3 demonstrates the novel insights provided by such a phenomenological spatio–

temporal model on the study of mental arithmetic introduced in Section 3.2. Finally, the

chapter concludes with some remarks and observations in Section 12.6.

The notation used during the course of this chapter is collated in Table 12.1. Note: As previously, > denotes the matrix transpose operator.

179 fMRI Data Y Hyperparameter Selection

Φ Feature-Space Transformation Error Feature-space basis K, λ W Rate y Hyper parameters

Model Estimation E-step Compute q(n)(x,z) from p(y,z,x|θ(n)) Until convergence M-step Estimate θ(n+1) : L(q(n), θ(n+1)) > L(q(n), θ(n))

s Stimulus Parameters θ

State Sequence Estimation E-step Compute q(n)(z) from p(z| y,x(n),θ) Until convergence M-step x(n+1) = argmax L(q(n), x)

x

Figure 12.1: Outline of the Method. The data (y), after projecting into the low dimensional feature space (cf. Chapter 10) are used to estimate model parameters θ through a generalized EM algorithm (cf. Sec- tion 12.2). Model hyper–parameters K and λw are selected to minimize the error of predicting the stimulus s (cf. Section 12.4). Given a set of model parameters, the optimal state–sequence x is estimated using EM (cf. Section 12.3).

180 Symbol Definition

> Matrix transpose operator T Total number of time–points in an fMRI session N Total number of (cortical) voxels in an fMRI volume K Total number of hidden brain–states Φ = {φ(l,m) ∈ RN } Orthogonal basis functions of the feature–space xt ∈ [1 ...K] The brain–state at 1 ≤ t ≤ T N Yt ∈ R The fMRI scan in voxel–space at 1 ≤ t ≤ T

Y Defined as (Y1 ... YT ) D yt ∈ R The fMRI scan in feature–space at 1 ≤ t ≤ T y Defined as (y1 ... yT ) N Zt ∈ R The (pre–HR) brain activation pattern at 1 ≤ t ≤ T in voxel– space

Z Defined as (Z1 ... ZT ) D zt ∈ R The (pre–HR) brain activation pattern at 1 ≤ t ≤ T in feature– space z Defined as (z1 ... zT ) st The stimulus vector at time t ut Unobserved (hidden) stimulus vector at time t w , {wi,j, ωj, i, j = State–transition probability parameters 1 ...K} wi,j State transition probability weight parameter from state i to state j

ωj Stimulus–driven state transition probability weight parameter

πi,j(st) , p(xt = Shorthand for the transition probabilities j|xt−1 = i, st, w)

Table 12.1: A summary of the notation used throughout this chapter on the semi–supervised state–space model

181 Symbol (contd.) Definition (contd.)

λw Hyper–parameter for state–transition probabilities

ϑk = (µk, Σk) The emission parameters for state k h The hemodynamic finite impulse response (FIR) filter of length L + 1

µh, Σh The mean and variance of the prior distribution of h

Hl , Matrix version of h > diag{(hl[1] ... hl[D]) }

t ∼ N (0, Σ) Normally distributed noise

θ , {u, w, ϑ, h, Σ} The model parameters

pθ(◦) , Parameterized probability density function p(◦, w, h|s, u, ϑ, Σ,K)

Contd from previous page

12.1 The State–Space Model

The functioning brain transitioning through a set of (unobserved) mental states xt = 1 ...K while performing task is represented by the state–space model of Fig. 12.2. These brain states are driven by the experimental variables described by a stimulus vector st [18].

Therefore, the probability Pr[xt = k] that the brain–state xt at time t = 1 ...T (in TR units) is k = 1 ...K depends on not only on the previous state of the brain but also on the current experimental stimulus described by the vector st. The multinomial transition probability from xt−1 = i to xt = j, is given as:

> exp{st (ωj + wi,j)} πx ,x (st) p(xt = j|xt−1 = i, st, w) = (12.1.1) t−1 t , PK > k=1 exp{st (ωk + wi,k)}

182 s s u s t t+1 t+2 … t+L-1 W λW

x x x x t t+1 t+2 … t+L-1

μk Σk

z z z z K t t+1 t+2 … t+L-1

h μh Σh

y y y y Σ t t+1 t+2 … t+L-1 ε

T

Figure 12.2: The State–Space Model. The experimental parameters are represented by st, while the corresponding brain–state is xt, and the instantaneous activation pattern is zt. The activation pattern is observed in the fMRI data yt ... yt+L 1 after convolution with the hemodynamic response h −

The probability of being in state j at any instant is parameterized by the vector ωj. The probability of transitioning from state i at time t − 1 to state j at time t is parameterized

−1 by wi,j, which has a normal prior N (0, λw I) with precision hyper–parameter λw. All these

transitions are driven by the stimulus vector st. Introducing an additional element in the stimulus vector st set to 1 allows modifying the transition probability to include a term independent of the current stimulus. Though the experiment maybe have combination of interval and categorical valued stimuli, they are converted into standardized normal vari- ables st through a probit transformation of their cumulative distribution functions. The hyper–parameter λw controls the trade–off between the influence of the current stimulus st and the previous state xt−1 on the probability of the current state xt. A low value biases the estimates of wi,j towards its mean value ωj reducing the influence of the previous state xt−1 = i on p(Xt = j|Xt−1 = i) and increasing the influence of the st on the transition.

183 The SSM allows estimating the value of unobserved or missing stimuli at a subset of the

time–points U , {t1 . . . tU } ⊂ {1 ...T }, represented by the hidden variables ut, t ∈ U. This feature enables prediction of stimuli from data at these time–points t ∈ U.

The fMRI data in the D dimensional (D  N, the number of voxels) feature–space, obtained by projecting the volumetric data Yt on a linear basis Φ (cf. Chapter 10), is rep- resented by yt. If the hemodynamic response function (HRF) is L + 1 TRs long, it will induce a correlation in the scans yt...t+L based on the neural activation corresponding to the brain–state xt at time t. To account for this effect, an additional hidden layer zt rep- resenting the underlying neural / metabolic activation pattern for xt is introduced. For xt = k, k = 1 ...K, we assume zt as normally distributed with mean µk and covariance

Σk. Let ϑ , {ϑ1 . . . ϑk}, where ϑk , (µk, Σk), the emission parameters for state k.

Each element d = 1 ...D of the D-dimensional feature space is assumed to have an inde- pendent HRF, modeled as an finite impulse response (FIR) filter h[d] , (h0[d] ... hL[d]) of length L + 1. Each h[d] has a normal prior with mean µh and variance Σh, constructed by varying the delay, dispersion, and onset parameters of the canonical HRF of SPM8 [66] and computing their mean and variance. The length L + 1 is typically set to 32s. The

> > > set of HRF parameters is then the D × L matrix h , (h[1] ... h[D] ) . Defining

> Hl , diag{(hl[1] ... hl[D]) }, the fMRI data yt is obtained by an element-wise convo- P lution yt = l Hlzt−l + t, where t ∼ N (0, Σ) is temporally i.i.d. noise.

Therefore, denoting θ , {u, w, ϑ, h, Σ}, the full probability model is (cf. Fig. 12.2):

pθ(y, z, x) , p(y, z, x, w, h|s, u, ϑ, Σ)

= p(y, h|z, Σ)p(z|x, ϑ)p(x, w|s, u), (12.1.2)

184 where 26   Y Y p(x, w|s, u) = p(w)  πxt−1,xt (st) πxt−1,xt (ut) , t∈T \U t∈U T Y p(z|x, ϑ) = p(zt|xt, ϑxt ) t=1 " T # Y p(y, h|z, Σ) = p(h) p(yt|zt−L...t, h, Σ) . t=1

,

The model hyperparameters are K, λw, µh, Σh. The hyperparameters K and λw are se- lected using an automatic data-driven procedure described in Section 12.4.

12.1.1 Feature–Space Transform

P The voxel-wise data Yt = l HlZt−l is modeled as a linear convolution of the activation patterns Z, and because the linear projection zt , Φ[Zt] is commutative with convolution, P it holds that yt = l Hlzt−l in the feature space, where yt , Φ[Yt].

12.2 Parameter Estimation

In this section, a generalized expectation-maximization (GEM) algorithm [50] to estimate

∗ the parameters θ as θ = arg maxθ pθ(y) is presented. Introducing a variational density q(z, x) over the latent variables z, x, the log-probability of eqn. 12.1.2 is decomposed into

26 Define p(x1|x0, s1) as the marginal density p(x1|s1).

185 a free-energy and a KL-divergence as:

ln pθ(y) = Q(q, θ) + KL(q||pθ), (12.2.1) Z X pθ(y, z, x) where, Q(q, θ) = q(z, x) ln dz, q(z, x) x z Z X pθ(z, x|y) and, KL(q||p ) = − q(z, x) ln dz. θ q(z, x) x z

(0) Starting with an initial estimate θ , the GEM algorithm finds a local maxima of ln pθ(y) by iterating the following two steps:

(n) E-step q ← arg min KL(q||pθ(n) ) (12.2.2) q M-step θ(n+1) ← θ such that Q(q(n), θ) > Q(q(n), θ(n)). (12.2.3)

The iterations are terminated when the updates to θ fall below a pre-specified tolerance

(adaptively set at 1% of the absolute value of the parameter in the n–th iteration), yielding a locally optimal solution θ∗.

12.2.1 E-Step

Although the minimizer of KL(q||pθ(n) ) is q(z, x) = pθ(n) (z, x|y), the HR introduces a

dependency structure between xt−L . . . xt+L and zt−L ... zt+L when conditioned on yt.

Therefore, evaluation of Q(pθ(n) (z, x|y), θ) in the M-step would require marginalization

over sequences of 2L+1 variables, resulting in a computational complexity of O(T ×K2L)

for parameter estimation (as compared to T × K2 for first-order HMMs).

To avoid this expensive computation, we restrict q to the family of factorizable distributions

T T Y Y q(z, x) = qt(zt, xt) = qt(zt|xt)qt(xt). t=1 t=1

186 This is known as the mean–field approximation in statistical physics and it can be shown [22] that if :

T ∗ Y ∗ q (z, x) = qt (zt, xt) t=1

= arg min KL(q||pθ(z, x|y)) q

= arg min KL(q||pθ(y, z, x)), q then

∗ q (zt, xt) ∝ exp{EQ q∗ [ln pθ(y, z, x)]}. (12.2.4) t t06=t t0

Introducing the following terms:

K Xh ∗ ∗ i α ∗ = q (x = k) ln p (x |k) + q (x = k) ln p (k|x ) qt (xt) t−1 t−1 θ t t+1 t+1 θ t k=1 " L #−1 X 2 −1 −1 Σ ∗ = H Σ + Σ , qt (zt|xt=k) l  k l=0 " L −1X µ ∗ =Σ ∗ Σ H · y qt (zt|xt=k) qt (zt|xt=k)  l t+l l=0 L L # X X  −1 − H · H E ∗ [z ] + Σ µ (12.2.5) l m qt+l−m t+l−m k k l=0 m=0,m6=l

∗ then each factor of the mean–field approximation in eqn. 12.2.4 becomes a product qt (zt, xt)=

∗ ∗ ∗ ∗ qt (zt|xt)qt (xt) of a multinomial logistic probability qt (xt) and a normal density qt (zt|xt) as per27:

exp{α ∗ } ∗ qt (xt) ∗  q (xt) = and q (zt|xt) ∼ N µ ∗ , Σ ∗ . (12.2.6) t PK t qt (zt|xt) qt (zt|xt) 0 exp{αq∗(x0 )} xt=1 t t

27That is, a mixture of Gaussians

187 Therefore, under this approximation, the E-step involves computing the factorizable den-

(n) QT (n) sity q (z, x) = t=1 qt (zt, xt) with the following fixed-point iterations:

28  (n) (n) i For all t = 1 ...T , initialize qt(xt) ← pθ(n) (xt), and qt(zt|xt = k) ← N µk , Σk .

∗ ∗ ii Update qt (xt) and qt (zt|xt) as per eqn. 12.2.6, for all t.

iii Set q(n) ← q∗.

iv Iterate step (ii) until the updates to all αqt(xt), Σqt(zt|xt), and µqt(zt|xt) fall below pre-

specified tolerances.

As these iterations are a coordinate–descent of the KL-divergence term KL(q||pθ(n) ), the

solution obtained is only a local optimum and depends on the initializations and the order

∗ of the updates to qt .

The details of these derivations are given in Section E.1 of Appendix E.

12.2.2 M-Step

12.2.2.1 Estimating State Transition Parameters w

Since the maximization of w is coupled with that of u, and Q(q(n), θ) is not jointly concave in w and u, we decouple the problem into two concave problems, by first maximizing Q w.r.t. w setting u ← u(n), and then maximizing w.r.t. u setting w ← w(n + 1). This is explained next.

28 The invariant density pθ(n) (xt) is eigenvector corresponding to the eigenvalue 1 of the stochastic matrix

πxt−1,xt

188 Defining the vectors:   w1,1  .   .        (n) (n)   π1,1(st) qt−1(1)qt (1)  w       1,K   .   .   .   .   .   .           (n) (n)     π1,K (st)   qt−1(1)qt (K)   wK,1   .  (n)  .  w ,   , π(st) ,  .  , qt−1,t ,  .  ,  .       .         π (s )   q(n) (K)q(n)(1)     K,1 t   t−1 t   wK,K   .   .     .   .   ω       1  (n) (n)  .  πK,K (st) qt−1(K)qt (K)  .    ωK (12.2.7) the gradient of Q with respect to w is:   (n) X qt−1,t − π(st) ∇wQ =   ⊗ st 1> (n) t∈T \U ( K ⊗ IK )[qt−1,t − π(st)]     (n) (n) X qt−1,t − π(ut ) (n) w +   ⊗ ut − λw   (12.2.8) 1> (n) (n) t∈U ( K ⊗ IK )[qt−1,t − π(ut )] 0K×1

where 1K is the K × 1 dimensional vector of ones, 0K×1 is the K × 1 dimensional vector of zeros, IK is the K × K dimensional identity matrix, ek is the K × 1-dimensional basis vector with a 1 at the k–th element and zeros elsewhere, and ⊗ is the Kronecker product.

2 Also the Hessian ∇wQ can be shown to be negative-definite implying that Q is concave in w with a unique global maximum value.

Please refer to Section E.2.1 complete derivations.

Although, w can be estimated using the iteratively re-weighted least squares (IRLS) method,

2 it involves an expensive inversion of the Hessian ∇wQ at every iteration. This inversion

189 can be avoided using a bound optimization method [116] that iteratively maximizes a

(n0+1,n) 0 (n0,n) surrogate function w = arg maxw Q (w|w ). This surrogate is a cost function

selected such that Q(w(n0,n)) − Q0(w|w(n0,n)) attains its minimum at w = w(n0,n). The

index n0 marks the iterations of the bound maximization of Q0(w|w(n0,n)) with respect to

w, during one iteration of the M-step indexed by n. This inner maximization loop is ini-

(0,n) (n) (n0+1,n) (n0,n) tialized with w ← w and terminates when the update ||w − w ||2 falls

below a certain tolerance, and w(n+1) ← w(n0+1,n) is the new value for the M-step update.

(n0,n) This tolerance can be fairly loose (typically 10% of the absolute value ||w ||2), as the

M-step of the GEM algorithm only requires an increase in the value of Q with respect to

its parameters, and not necessarily the maximization of Q.

2 One such surrogate is a quadratic function with a constant Hessian B such that ∇wQ − B

is negative-definite, given as:       AAP  X > X (n) (n)† IK2 0K2×K B ,   ⊗ stst + ut ut − λw   , > > P A P AP t∈T \U t∈U  0K×K2 0K×K (12.2.9)

1  1 1>  where A , − 2 IK2 − K K .

Maximization of this surrogate function leads to the following update equation:

(n0+1,n) (n0,n) −1 (n0,n) w ← w − B ∇wQ(w ). (12.2.10)

The index n0 marks the iterations of the bound maximization of Q0(q(n), θ) with respect to w, during one iteration of the M-step indexed by n. This inner maximization loop is

(0,n) (n) (n0+1,n) (n0,n) initialized with w ← w and terminates when the update ||w − w ||2 falls

below a certain tolerance, and w(n+1) ← w(n0+1,n) is the new value for the M-step update.

(n0,n) This tolerance can be fairly loose (typically 10% of the absolute value ||w ||2), as the

190 M-step of the GEM algorithm only requires an increase in the value of Q with respect to

its parameters, and not necessarily the maximization of Q.

Although bound optimization takes more iterations to converge than IRLS, on the whole it is much faster since it precludes inverting the Hessian (of the order of the size of w) at each step [116]. Please refer to Appendix E.2.1 for more on the bound optimization algorithm.

12.2.2.2 Estimating Missing Stimulus u

(n+1) (n) After estimating w , Q(q , θ) can then be maximized with respect to ut for all t ∈ U

(n+1) (n+1) (n+1) (n+1) by setting w ← w . If we express $i,j , wi,j + ωj , the gradient and

Hessian of Q with respect to ut are given by:

K K " K # X X (n) (n) (n+1) X  (n+1) ∇ut Q = qt−1(i)qt (j) $i,j − $i,k πi,k(ut) , i=1 j=1 k=1 K K " K > 2 X X (n) (n) X  (n+1)  (n+1) ∇ut Q = − qt−1(i)qt (j) $i,k $i,k πi,k(ut) i=1 j=1 k=1 K K # > X (n+1) X  (n+1) + ($i,k )πi,k(ut) $i,l πi,l(ut) k=1 l=1

2 Again, the Hessian ∇ut Q is negative-definite and therefore Q is concave in ut with a

2 unique global maximum. Since ∇ut Q is of the dimension of the stimulus vector and is easily invertible, this maximization is done using IRLS because of its faster convergence.

Please refer to Appendix E.2.1 for detailed derivations.

191 12.2.2.3 Estimating Emission Parameters ϑ

The M-step for the emission parameters ϑk = (µk, Σk) yields the following closed-form

updates: T (n+1) 1 X (n) µk = qt (k)µ (n) , T qt (zt|xt=k) t=1 T  > (n+1) 1 X (n) > (n+1) (n+1) Σk = qt (k) Σq(n)(z |x =k) + µq(n)(z |x =k) · µ (n) + µk · µk T t t t t t t qt (zt|xt=k) t=1 >  (n+1) (n+1) > − µq(n)(z |x =k) · µk − µk · µ (n) , (12.2.11) t t t qt (zt|xt=k) where Σ (n) and µ (n) were defined in eqn. 12.2.5. The details of this formula qt (zt|xt=k) qt (zt|xt) are given in Appendix E.2.2.

As reported in Section 10.5 of Chapter 10, the feature–space coefficients of an fMRI ses- sion exhibit very low temporal correlations. To enforce this high degree of sparsity in the estimates of Σk (cf. eqn. 12.2.11), during the n–th iteration of the M-step the estimate for

(n) Σk is sparsified using adaptive shrinkage [186], similar to the procedure in Section 9.2 of Chapter 9.

12.2.2.4 Estimating HRF FIR Filter h

> The estimation the coefficients of the FIR filter h[d] , (h0[d] ... hL[d]) , the L + 1–tap HRF corresponding to the d–th element of the D–dimension feature space, is described next. The gradient of the free-energy term Q from eqn. 12.2.1 is:

T D ∂Q X X   = Σ−1[d0, d] y [d0]ν [d] − Λ [d0, d]h[d0] − Σ−1(h[d] − µ ), (12.2.12) ∂h[d]  t t t h h t=1 d0=1

192 where     0 νq(n) [d] Λq(n) [d , d] ... 0  t   t   .  0  . .. .  νt[d] =  .  , Λt[d , d] =  . . .  .     0 ν (n) [d] 0 ... Λ (n) [d , d] qt−L−1 qt−L−1...t

(n) Here, the marginal first and second moments of zt under the variational density qt (zt)

E E  > defined as ν (n) (n) {zt} and Λ (n) (n) ztzt respectively. qt , qt (zt) qt , qt

As per eqn. 12.2.12, the gradient ∂Q/∂h[d] for the FIR filter at the d–th element depends on

the values of h[d0] at all the other d0 6= d of the D-dimensional space. Setting ∂Q/∂h[d] =

0, for all d = 1 ...D, results in a linear system of D × L equations in D × L unknowns.

The unique solution h(n+1) is computed using conjugate gradient descent [81] initialized

(n) (n+1) (n) at h , and its iterations are terminated when the update ||h − h ||2 falls below a

(n) pre-specified tolerance (set adaptively at 10% of ||h ||2).

12.2.2.5 Estimating Noise Variance Σ

The noise variance Σ has a closed form estimate:

T " L−1 ! L−1 (n+1) 1 X 0 X (n+1) > X (n+1)  > Σ = ytyt − yt Hl ν (n) − Hl νq(n) yt T qt−l t−l t=1 l=0 l=0 L−1 ! L−1 L−1  # X (n+1)2 X X (n+1) (n+1) > + Hl Λq(n) + 2 Hl Hm νq(n) ν (n) , t t qt+l−m l=0 l=0 m=0,m6=l (12.2.13)

(n) where ν (n) and Λ (n) are the marginal moments of zt under the variational distribution qt qt qt as defined earlier. The estimation formulae of the HRF and noise parameters are derived

Section E.2.3 of Appendix E.

193 12.2.3 Spatial Activation Maps

The activation pattern for a specific value of the experimental variables st is obtained by

first computing the invariant distribution p(xt|w, st) as the first eigenvector of the state– transition matrix π(st) (cf. eqn. 12.1.1), and then computing the activation pattern mean hPK i hPK i as µst = k=1 pθ(xt = k|w, st)µk and its variance Σst = k=1 pθ(xt = k|w, st)Σk .

−1/2 The z–score map for the activation pattern corresponding to ˆs is given by Σst µst in feature–space, which is then transformed back into a voxel–wise spatial map of activity.

12.3 Estimating the Optimal State–Sequence

Direct estimation of the most probable state–sequence

∗ x = arg max pθ(x|y) = arg max pθ(y, x), x x given model parameters θ requires joint maximization over all T state variables, since the hidden layer z layer introduces a dependency between all the y and x variables preventing factorization of the graph29. As the size of the search space increases exponentially with

T with a complexity of O(T K ) for the whole chain, exhaustive search soon becomes in- feasible and an approximation such as Iterated Conditional Modes (ICM) [22] is required.

In the approximation of Chapter 11, the full model was replaced with a reduced one by removing the intermediate hidden layer z. This resulted in a joint maximization over L states and O(TLK ) complexity, which was solved using ICM.

29See the discussion in Appendix D.5

194 In this chapter, an EM algorithm determines the optimal state–sequence through a mean–

field approximation that iteratively transforms the problem into a series of first order HMMs, which in turn are solved using the standard Viterbi algorithm [22].

As in Section 12.2, the log-probability term is decomposed into a free-energy and a KL- divergence term

ln pθ(y, x) = Q(q, x) + KL(q||pθ(z|y, x)), where

Z Z pθ(y, z, x) pθ(z|y, x) Q(q, x) = q(z) ln dz and KL(q||pθ(z|y, x)) = − q(z) ln dz, z q(z) z q(z) by introducing a variational density q(z)

QT Again as before, restricting q(z) to the family of factorizible distributions q(z) = t=1 qt(zt), the E-step estimate of

T (n) Y (n) (n) (n) q = qt = arg min KL(q||pθ(z|y, x )) = arg min KL(q||pθ(y, z|x ) q q t=1 is obtained by iteratively computing

∗ (n) q (zt) = exp{EQ q∗ [ln pθ(y, z|x )]} t t06=t t0

∗ ∗ q (z ) q (z ) ∼ N (µ ∗ , Σ ∗ ) until convergence. The factorized density t t is a normal distribution t t qt qt ,

µ ∗ Σ ∗ with mean qt and variance qt identical to the form defined in eqn. 12.2.5, with the dif-

(n) ference that xt is replaced by xt .

Since the terms

Z Z q(z) ln pθ(y|z)dz and q(z) ln q(z)dz z z

195 in Q(q, x) are independent of x, the maximization step becomes:

 Z  (n) (n) arg max Q(q , x) = arg max ln pθ(x) + q (z) ln pθ(z|x)dz x x z ( T  Z ) X (n) = arg max ln pθ(xt|xt−1) + q (zt) ln pθ(zt|xt)dzt , x t=1 zt (12.3.1)

which is identical to the problem of estimating the optimal state–sequence of a first-order

HMM, with the difference that the emission (log) probability is replaced by the expected

(log) probability under the variational density q(n). The solution can be computed using the

Viterbi algorithm with O(T × K2) complexity.

n+1 (n) The EM iterations terminate when the increments | ln pθ(y, x ) − ln pθ(y, x )| fall be-

low a pre-specified tolerance, typically set to 0.0099 corresponding to < 1% increase in the

probability.

The details of the state–sequence estimation algorithm are adduced in Appendix E.3.

12.4 Model Hyper-Parameter Selection

The hyper–parameters of the SSM are the number of hidden states K, the precision λw of the prior distribution of the transition weights w, and the parameters µh, Σh of the prior model of the HRF h. The values of µh and Σh, determined from the canonical HRF of SPM8, are used to enforce domain knowledge by restricting the HRF to the space of physiologically plausible shapes. This provides an optimal trade-off between allowing a spatially varying and unknown HRF against over–fitting the FIR filter to the data.

196 The hyper–parameter λw determines the variance in the weights wi,j, and implements a trade–off between the effect of the stimulus versus the previous state on the current state probability and mediates a complex set of interactions between the temporal structure of the fMRI data and of the stimulus sequence. A very high value of λw causes the state– transitions to be driven mostly by the current stimulus, while a low value increases the con- tribution of the previous state to the transition probability. It therefore cannot be practically provided as a user–tunable parameter. On the other hand, model–size (i.e. K) selection is typically done using Bayes factors [112], information theoretic criteria [151] or reversible jump MCMC based methods [194]. Implicitly these methods require an a priori notion about the complexity of a given model.

Here instead, we adopt an automated method for selecting both K and λw based on the maximally predictive criterion, as developed in the previous chapter (cf. Section 11.4), leveraging the ability of the SSM to predict missing stimuli u.

From the stimulus time–series, blocks of T 0 consecutive time–points (in TR units) totalling to 25% of the total number of scans, are removed at random to serve as missing stimuli

0 0 ∗ U , {t1 . . . t1 + T − 1, . . . , tM . . . tM + T − 1} and the optimal SSM parameters θ are estimated for a given K and λw. The prediction error is then measured as ERRmissing , P t∈U ||ut − st||2. between the predicted ut and their true values st. The hyper–parameters are then selected to minimize this error–rate. The optimal value of K is obtained by first stepping through different values of K with large step-sizes and then iteratively refining the step-size. The advantage of this procedure is that it allows selecting a model most relevant to the experiment being conducted.

197 For each setting of K, the optimal λw is determined by searching over the range log10 λw =

−3 ... +3, and selecting the valued that minimizes ERRmissing. This allows setting the

parameter to effect an optimal compromise between between stimulus driven and previous

state driven transitions. We observed that the prediction error is relatively insensitive to λw

(cf. Section 12.5), and therefore a common value can be selected across a multi–subject data–set for one study.

The reader will observe that prediction error is used merely as a statistic (cf. Sidebar on

Generative vs. Classification Models in Chapter 7) to select hyper–parameters. The pa- rameters themselves, unlike MVPR classifiers, are not estimated to minimize prediction error but rather to fit a model of brain function to the data. It is this distinction that allows interpretation of the estimated parameters (in terms of the underlying neurophysiological model) in contrast to MVPR classifiers.

The effect of these hyper–parameters and the length T 0 of a missing–stimulus block on the

model estimation is evaluated in Section 12.5.

12.5 Results

This section starts off with a quantitative validation of the model and state–sequence esti-

mation algorithms using a synthetic data–set. Then the method is illustrated on two fMRI

studies, one a simple block design for a visuo–spatial motor task, and the other the com-

plex and irregular event-related design for arithmetical processing (cf. Section 3.2). The

first example focuses on the spatio–temporal activation maps, demonstrates the ability of

the method to discover new patterns in the data, and provides a comparative evaluation

198 with respect to other analysis methods and feature–spaces. The second study is used to

perform group–level inferences in the abstract representational space generated by this

spatio–temporal phenomenological model.

For all the fMRI data, the mean volume of the time–series was subtracted, white matter

masked out and all further processing was performed on grey matter voxels. The algo-

rithms were implemented in MATLABr with Star-Pr on an 2.6Hz Opteron cluster with

16 processors and 32GB RAM.

12.5.1 Simulation

12.5.1.1 Methods and Materials

The algorithms were validated on a synthetic data–set created as follows. For all simula- tions, the length of the session was T = 600 TRs, the dimension of the feature–space was

D = 500 and the dimension of the stimulus vector st was set to 5, to reflect a typical fMRI data–set. Model size K was varied from 5 to 50, while the precision hyper–parameter λw was varied from 10−3 to 103.

The state transition parameters ωj were initialized from a uniform distribution over [0,1],

−1 while wi,j were sampled from N (0, λw I5), where In is the n × n identity matrix. The HRF

FIR coefficients h[d] for each element of the d feature–space were obtained by sampling from N (µh, Σh). The emission parameters (µk, Σk) for each state k were obtained by sam- pling Σk from a Wishart distribution W(T, ID) and µk from N (0, Σk). The noise variance Σ

−1 was sampled from W(T, β ID). The parameter β effectively controls the SNR of the data, by controlling the ratio of the noise variance to that of the activation patterns zt. For each

199 time–point t = 1 ...T , the stimuli st were generated from a normal distribution N (0, I5) and then smoothed along the time dimension with a Gaussian filter of different full–width– at–half–maximum (i.e. FWHM), in order to impose a temporal structure on the simulated data. The values of xt, zt and yt were then sampled from their generative distributions as per Section 12.1.

We compared the results of the generalized EM (GEM) algorithm under the mean field approximation (MF-GEM) presented here (cf. Section 12.2) against those obtained by an

MCMC based GEM algorithm. The number of MCMC samples were varied to match the

MF-GEM algorithm in terms of equal running time (MCMC:RT) and equal estimation error

(MCMC:EE). As the MCMC method can produce exact estimates, given sufficient number of samples, it was also run until convergence (MCMC:CNV) in order to establish baseline accuracy. The experiments were repeated with β = 10, 100 and 1000 corresponding to

SNR of 10, 20 and 30dB respectively.

12.5.1.2 Discussion

∗ The relative error ERRestimate in the parameter estimates θ , the relative error ERRK of

∗ model-size estimates K and the prediction error ERRmissing (cf. Section 12.4) for the vari-

ous experiments are charted in Fig. 12.3.

One of the main observations is that the MCMC algorithm requires almost thrice the total

running–time (including searching for the optimal hyper–parameter values) for the same

estimation error ERRestimate as the mean field EM method (MF-GEM), while the prediction

error of MF-GEM is within 20% of the best ERRmissing as measured by MCMC:CNV. While

reducing SNR does not affect running–time significantly, its effect on the errors is large.

200 80 0.7 70 0.6 60 0.5 50 0.4 40 estimate

mins 30 0.3 0.2

20 ERR 10 0.1 0 0

(a) Running time (b) ERRestimate

1 0.7 0.8 0.6 0.5 0.6 K

predict 0.4 0.3

0.4 ERR 0.2 ERR 0.2 0.1 0 0

(c) ERRpredict (d) ERRK

Legend MF-GEM GEM with mean field approximation 30dB MCMC:RT MCMC with running time equal to MF-GEM 20dB MCMC:PE MCMC with ERRestimate equal to MF-GEM 10dB MCMC:CNV MCMC with convergence

Figure 12.3: Simulation Results. The GEM method under mean field approximation (MF-GEM) is com- pared against an MCMC based estimation algorithm matched in terms of equal running time (MCMC:RT), equal estimation error (MCMC:EE) and MCMC run until convergence of the estimates (MCMC:CONV). The experiments were repeated for SNR = 10,20 and 30dB. Plotted are total running time (Fig. (a)), relative estimation error (Fig. (b)), prediction error (Fig. (c)) and relative error in estimating the correct K (Fig. (d)) for the different experiments. Error bars indicate ±1 standard deviations.

Reducing the SNR from 30 to 20dB caused prediction error to increase from < 10% to

≈ 30%. Furthermore, for SNR ≤ 20dB the estimate for model–size using the maximally predictive criteria is within 10% of the true K.

201 Although all the parameters θ are important in determining the accuracy of the model, of special interest are the µk, k = 1 ...K parameters, as they correspond to the spatial distribution of activity representative of each state. The average estimation error of the spatial distribution parameters, defined as:

X ∗ > −1 ∗ ERRspatial = 1/K (µk − µk) Σk (µk − µk), k

∗ where µk is the estimate of µk, for the MF-GEM and MCMC:CNV cases are listed in Table 12.2.

It can be observed that for the 20dB case, the estimated spatial patterns are within ≈ 0.25 standard deviations (given by Σk) of the true µk.

SNR (dB) MF-GEM MCMC:CNV

30 0.151 ± 0.06 0.126 ± 0.05 20 0.223 ± 0.09 0.212 ± 0.09 10 0.361 ± 0.13 0.357 ± 0.14

Table 12.2: Effect of SNR on the error ERRspatial (±1 std.dev.) in µk estimated by the MF-GEM algorithm and MCMC run to convergence (MCMC:CNV).

The effect of the sparsification step during the estimation of Σk (cf. Section 12.2.2.3) on estimation accuracy and prediction rate is given in Table 12.3. While shrinkage of the ML estimate of Σk has a positive effect on the estimation and prediction accuracies, the benefit is more pronounced at lower SNR, indicating the necessity of this step especially when dealing with noisy data.

202 SNR (dB) ERRestimate ERRmissing

30 14.77% ± 2.26 9.85% ± 2.73 20 18.21% ± 3.18 12.29% ± 2.57 10 23.36% ± 5.19 17.87% ± 3.44

Table 12.3: The percentage reduction in estimation error ERRestimate and prediction error ERRmissing due to shrinkage of the ML estimates of Σk, k = 1 ...K, at various SNR levels.

12.5.2 Data-Set 1: Visuo–Spatial Motor Task

12.5.2.1 Methods and Materials

In this task, four subjects were visually exposed to oriented wedges filled with high-contrast random noise patterns and displayed in one of four quadrants. Subjects were were asked to focus on a center dot and to perform a finger-tapping motion with the right or left hand when the visual wedge was active in the upper right or lower left quadrants, respectively. Block length of each visual wedge stimulation varied from 5 to 15s and noise patterns changed at a frequency of 5Hz. A multi-shot 3D Gradient Echo Planar Imaging (EPI) sequence accelerated in the slice encoding direction with GRAPPA and UNFOLD was used on a GE

3T MRI scanner with a quadrature head coil and 171 volumes were acquired at TR=1.05s, an isotropic resolution of 3mm, with total imaging time of 3min and the first five volumes were discarded from the analysis.

The data were analyzed using a univariate GLM with SPM8. The design matrix included a regressor for the presentation of the wedge in each quadrant, convolved with a canonical

HRF. The results of this analysis are shown in Fig. 12.4(a), and correspond to the classic

203 retinotopic organization of the primary visual cortex and with the hand motor representation

areas in both hemispheres.

The model was trained using the GEM algorithm under the mean field approximation (MF-

GEM), with the data represented in the following feature–spaces:

• [FS:Φ] The basis vectors of Φ, with D = 500 (cf. Chapter 10)

• [FS:PCA] Coefficients of the scans projected on their principal components. retained if

their variance was greater than the mean variance (≈ 80) [71].

• [FS:VOX-AVG] The top 500 significantly activated voxels identified with a GLM using

a contrast for the average effect of all orientations.

• [FS:VOX-ORIENT] The set of 500 voxels maximally responsive to only one orientation

of the wedge, using an appropriate contrast in the GLM.

Two different encodings of the stimulus vectors st were used as input to the training algorithm: (SSM:FULL) each st is a vector containing the post-SOA (stimulus-onset- asynchrony) time of the current fMRI frame t within the current presentation block, and the orientation of the wedge; (SSM:PSOA) each st contains only the post-SOA time. For comparison, we also trained a linear multi-class SVM classifier (SVM-CLASS) [108] with the orientations as class-labels using the same feature–spaces.

12.5.2.2 Discussion

Fig. 12.5 shows the prediction error (averaged across subjects) of the wedge orientation for the three different cases SSM:FULL, SSM:PSOA and SVM-CLASS and the different

204 205

(a) SPM8 (b) SSM:FULL

Figure 12.4: Spatial Activation Maps for the Visuo-Motor Task from SPM8 and the State–Space Multivariate Analysis. Fig. (a): Maximum intensity projections of significantly activated voxels (p < 0.05, FWE corrected) in a single subject for the four orientations of the wedge and the hand motor actions, computed using SPM8. The red circles indicate the ROIs for which the estimated HR FIR filters are displayed in Fig. 12.8. Fig. (b): Spatial (z–score) maps showing the distribution of activity for each orientation of the wedge computed from our state–space model displayed on an inflated surface of the brain. Displayed are the posterio-lateral and posterio-medial views of the left and right hemispheres respectively. Values of z ≤ 1 have been masked out for visual clarity. 0.6 Legend 0.5 FS:Φ 0.4 FS:PCA 0.3 FS:VOX-AVG

Errorrate FS:VOX-ORIENT 0.2 Error for same 0.1 orientation as FS:VOX-ORIENT 0

Figure 12.5: The inter–subject average prediction error–rates for the visuo–motors task using a multi- class SVM (SVM-CLASS), and our model trained with the stimulus vector st containing both post-SOA time and wedge orientation (SSM:FULL) and with the stimulus vector st containing only post–SOA time (SSM:PSOA). The error–rates are shown for different feature–spaces. The last bar in the SVM-CLASS col- umn shows the SVM prediction error of the same wedge orientation for the which the 500 most active voxels are used as a feature–space (FS:VOX-ORIENT).

feature–spaces. Since the experiment was fully randomized, the chance-level error–rates are 75%. The error in the prediction of the wedge orientation for SSM:FULL is readily assessed during the computation of ERRmissing. In order to measure the prediction error for the SSM:PSOA, we first trained the optimal model using only post-SOA times in s, and obtained the optimal state–sequence x∗. In a second step, we trained a simple multinomial

∗ classifier to predict the wedge orientation at time t from the state-label xt , and measured

the prediction error of this classifier using cross-validation.

Firstly we observe that the prediction error of the semi–supervised model (SSM:FULL)

trained on the same experimental variables is very similar to that of the supervised SVM

(SVM-CLASS). More interestingly however, our model was able to predict, at better than

chance levels, the orientation of the wedge without being trained for it (SSM:PSOA). This

206 indicates that the model has learnt some intrinsic patterns in the data that strongly related

to the mental activity of the subject.

It can be noticed that the prediction error using FS:Φ was slightly better than that using the set voxels significantly activated under all orientations (FS:VOX-AVG). This is noteworthy, especially given that FS:Φ, in contrast to FS:VOX-AVG, is computed without knowledge of the experimental parameters. One reason for the low accuracy of FS:VOX-AVG is that it includes voxels that are commonly activated for all wedge orientations and are therefore not necessarily selective to any one orientation. In order to account for this defect, we also trained various algorithms on a set of voxels that were maximally responsive to only one orientation of the wedge (FS:VOX-ORIENT). The drawback of this feature-set is that while it is very accurate for that orientation, its discriminative ability for the other three orientations is very poor. As can be seen, for the multi-class SVM trained with one such

FS:VOX-ORIENT feature space, the average error for the same orientation is 21.81 ±

2.38%, but the error for the other orientations the error was much higher, resulting in an overall error–rate of 42.11%. The error–rate of SSM:FULL was comparable.

The poor accuracy of the PCA basis selected using maximum variance criteria has also been documented in other fMRI studies [90, 173] and can be attributed to the lack of a specific relationship between the task selectivity of a component and its variance.

The spatial maps of activation patterns (cf. Section 12.2.3) for SSM:FULL using FS:Φ, for a single subject, are shown in Fig. 12.4(b). The maps for the other three subjects are qualitatively similar. The retinotopic character of the activation maps follows the expected anatomical boundaries of the four visual quadrants within the occipital visual system. The

207 TR 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4

5 Statelabel 6 7 8

Figure 12.6: Brain–state probabilities for one subject. The size of the circles corresponds to the marginal probabilities of the states during the display of the wedge in lower right, lower left, upper left and upper right quadrants for 4TRs each. States have been relabeled for expository purposes and transition probabilities have been omitted for visual clarity. same spatio–temporal accuracy is found for the motor component of the paradigm with the typical cortical activation pattern for hand action in the contralateral hemisphere.

The state transition probabilities πi,j(st) encode information about the temporal dynamics of the model, and reveal the organization of the brain–states xt with respect to the experi- mental variables st. Fig. 12.6 shows the marginal probabilities (i.e. the first eigenvector of

πj(st)) of the brain–states for one subject corresponding to a sequence of wedges oriented in each quadrant for 4×TRs each.

Here, we see that the probability of a particular state is structured with respect to the ori- entation of the wedge. For example, at the start of the presentation with the wedge in the lower-right quadrant, state 1 is most probable. But by the second TR, state 2 becomes more dominant and this distribution remains stable for the rest of this presentation. Then, as the

208 display transitions to the lower-left quadrant, states 3 and 4 become equiprobable. How-

ever, as this orientation is maintained, the probability distribution peaks about state 4 and

remains stable. A similar pattern in observed in the probability distributions for the other

orientations.

From the plot of prediction error ERRmissing against model–size K for each of the four

subjects in Fig. 12.7, one can see that it has a relatively shallow basin and the minima

occur in a similar range of K = 8 ... 15. This points to the robustness of the model with

respect to K and the similarity of the models across the subject.

0.6 predict 0.4 ERR 10 20 30 40 50 K

Figure 12.7: ERRmissing with respect to model–size K. The minimum for each curve is marked by a circle.

Fig. 12.8 graphs the estimates of the hemodynamic FIR filter h for each subject, averaged in ROIs selected in the left primary motor cortex (BA3, BA4), and the left visual cortex

(BA17) (indicated by the red circles in Fig. 12.4). A qualitative difference between the estimated HRs of the two areas is apparent in terms of their rise-time, peak value and dispersion.

The HR of the brain is known to be highly variable [139] and by allowing the hemodynamic

FIR filter to vary spatially (by allowing one filter h[d] per feature–space element d), the model is able to account for this inter-region variability.

209 0.4 0.4

h 0.2 h 0.2 0 0 0 10 20 30 0 10 20 30 secs secs (a) Motor Cortex ROI (b) Visual Cortex ROI

Figure 12.8: Estimated hemodynamic FIR filter h. The estimated FIR filter coefficients for each of the four subjects averaged in two ROIs selected in the motor and visual cortices.

12.5.3 Data-Set 2: Mental Arithmetical Task

12.5.3.1 Methods and Materials

This section discusses the inferences of this model applied the study for mental arithmetic described in Section 3.2. For each t = 1 ...T , the experimental conditions were described by: Ph = 1, 2, 3 which indicates if t is in the (1) multiplication or (2) subtraction / judgement or (3) decision–making phase of the experiment; 1 ≤ LogPs ≤ 10 which quantifies the product size of the multiplication problem; and 1 ≤ LogDiff ≤ 5 which quantifies the expected difficulty in judging the right answer.

For each of the 42 subjects (20 control, 13 dyscalculia, 9 dyslexia), one model Mj =

{Φj, θj} was trained per subject j = 1 ... 42 with the stimulus vector st containing the post–SOA time, the reaction–time for current trial and parameter quantifying the size of the product of the two numbers displayed (cf. Section 3.2.3 for specifics). In order to

210 balance the group sizes, group–level analysis was done by selecting 8 subjects at random from each group and computing the statistics over multiple resamples.

12.5.3.2 Comparative Analysis

One SSM Mj = {θ, K, λw} was trained per subject j = 1 ... 42 with three different encod- ings of the stimulus vector st:

SSM:NONE with st = (1)

SSM:PH with st = (Ph, 1)

SSM:FULL with st = (Ph, LogPs, LogDiff, 1)

As the models SSM:PH and SSM:NONE do not encode LogPs, LogDiff (and Ph) in the stimulus vector, they cannot estimate their as missing stimulus. Therefore, to assess the ability of the SSMs to predict these variables from the optimal state–sequence x∗ we trained

∗ three simple multinomial classifiers: one to predict the probability Pr[Ph|xt ] of the phase

∗ Ph = 1, 2, 3, one to predict Pr[LogPs|xt ] of LogPs quantized to two levels at a value of

∗ 5, and one to predict Pr[LogDiff|xt ] of LogDiff quantized to two levels at a value of 2.5.

The error–rates across the three classifiers were accumulated into a single ERRSSM:NONE,

ERRSSM:PH and ERRSSM:FULL, for each SSM trained. Also, as ERRmissing is not defined for

SSM:NONE, its hyper–parameters are selected so as to minimize ERRSSM:NONE.

For comparative evaluation of the SSM with MVPR methods, we trained three linear SVM classifiers per subject: one to predict Ph = 1, 2, 3, one for LogPs = 0, 1 (quantized) and one for LogDiff = 0, 1 (quantized) and accumulated their error–rates into ERRSVM. The

211 SVMs were trained to predict the stimuli from the fMRI data deconvolved with the canoni-

cal HRF. Among the other classifiers evaluated (viz. GNB, LDA, quadratic and exponential

SVM) none significantly outperformed the linear SVM.

For each of the SSMs and SVMs trained, the following feature–spaces were evaluated:

• [FS:Φ] The basis vectors of Φ, with D = 500 (cf. Chapter 10).

• [FS:PCA-NONE] A 110 basis vectors obtained from a PCA of the fMRI data, retained

using a bootstrap analysis of stability [17] at a 75% confidence level, to match the feature

selection criterion for Φ. This confidence level yields 112.92±7.80 principal components

(PC) for each subject.

• [FS:PCA-PH] The set of PCs maximally correlated with the HRF–convolved regressor

for Ph. For each subject 65 PCs were retained at a confidence level of 75% (64.27 ± 8.56

PCs).

• [FS:PCA-FULL] The set of PCs maximally correlated with the design matrix containing

HRF–convolved regressors for Ph, LogPs and LogDiff identified using multiple regres-

sion. For each subject 80 PCs were retained at a confidence level of 75% (79.14 ± 9.57

PCs).

The prediction error of the different model and feature–space combinations for the control group are complied in Fig. 12.9. The other two groups showed similar trends and are omitted for conciseness.

It can be observed that the error for SSM:FULL is consistently lower (> 3 SEM for FS:Φ) than that of the SVM. This is noteworthy especially since the parameters of the SVM were

212 Figure 12.9: Comparative Analysis. The mean prediction error (±1 standard error of mean (SEM) ) for the control group using a linear SVM classifier, and the SSM with three different encoding of the stimulus vector st viz. SSM:FULL, SSM:PH and SSM:NONE. The error-rates are measured for four different feature– spaces: FS-Φ, FS:PCA-NONE, FS:PCA-PH and FS:PCA-FULL. The last bar in the SVM column shows the SVM prediction error for only Ph against which the PCs of FS:PCA-PH were selected. The “chance–level” prediction error is ≈ 0.87 calculated through a permutation test.

specifically optimized for prediction error. By treating each fMRI scan independently,

MVPR classifiers are unable to leverage the temporal structure in the data and rely only on spatial patterns for prediction. Moreover, the SVM uses a point–estimate of the neural activation through deconvolution, whereas the SSM accounts for spatially varying and un- known hemodynamics in a probabilistic fashion which contributes to its ability to predict the mental state of the subject.

Using only information about the phase Ph to train SSM:PH increased the error as com- pared to SSM:FULL, but only slightly (≈ 1 SEM). But removing all experimental informa- tion (SSM:NONE) caused a dramatic increase in error (ERRSSM:NONE ≈ 0.48). This implies that the semi–supervised SSM can detect the effect of experimental variables from patterns in the data (namely LogPs and LogDiff in the case of SSM:PH) against which it was not ex- plicitly trained. Including some cues (namely Ph) about the experiment guides this discov- ery, stabilizes estimation and precludes the model from learning spatio–temporal patterns

213 that are not relevant to the task and which may be due to artifacts (unlike SSM:NONE) .

It is nevertheless interesting that, despite not using any stimulus information, SSM:NONE had a prediction error much better than chance (ERRchance ≈ 0.87) which implies that it has discovered the mental states of the subject in a purely unsupervised fashion, validating some of the neurophysiological assumptions behind the SSM.

The unsupervised PCA feature–space (FS:PCA-NONE) exhibited performance worse than

FS:Φ in all cases. The poor accuracy of PCA has also been documented in other fMRI studies [90, 160, 173] and can be attributed to the lack of a specific relationship between the task selectivity of a PC and its variance. In contrast, as FS:Φ is obtained from the correlation matrix, it describes the structure of the inter–relationships between the voxel time–series and not their magnitudes. Using PCs selected against the stimuli (FS:PCA-

FULL) deteriorates the performance of the SSMs even further. This is because the exact coupling between stimuli and fMRI signal (and therefore PC time–courses) is unknown and may be non–linear, and selecting PCs linearly correlated with HRF–convolved stimuli may not preserve a large proportion of the spatio–temporal patterns in the data. In contrast, the

SVM, based on optimizing prediction error, has best overall performance with this feature– selection strategy. The limitations, however, of supervised feature–selection are apparent

Ph in the case of PCA:PH. Although the SVM predicts Ph with high accuracy (ERRSVM ≈

0.08 for FS:PCA-PH vs. ≈ 0.11% for FS:PCA-FULL), its ability to predict any other stimulus is severely degraded with an overall error of ≈ 0.50. The SSMs have similarly poor performance, due to the loss of information about spatio–temporal patterns in this basis.

214 12.5.3.3 SSM Parameter Estimates

This section further investigates the parameters as estimated by SSM:PH trained with FS

Φ and st = (Ph, 1). The prediction error ERRSSH:PH exhibits a relatively shallow basin with

0.8 0.6

0.6

SSM:PH 0.4 SSM:PH 0.4 ERR ERR 0.2 0.2 10 20 30 40 50 −2 0 2 K log10 λw

(a) Error with respect to K (b) Error with respect to λw

0.6 0.8

0.6 0.4 SSM:PH SSM:PH 0.4 0.2 ERR ERR 0.2 1000 2000 3000 4000 5000 2 4 6 8 10 D T 0 (TRs)

(c) Error with respect to D (d) Error with respect to T 0

Figure 12.10: Effect of Hyper–Parameters on ERRSSM:PH. Fig.(a): ERRSSM:PH with respect to model-size K. Fig.(b): ERRSSM:PH with respect to precision hyper–parameter λw. Fig.(c): ERRSSM:PH with respect to dimension D of Φ. Fig.(d): ERRSSM:PH with respect to missing stimulus block length T 0. Error bars indicate ±1 SEM. Legend. Blue solid line: control group, Red dashed line: DC group, Green dot–dashed line: DL group.

respect to model–size K for all three groups in Fig. 12.10(a), with minima occurring in the range K = 18 ... 28. This points to the robustness of the SSM estimation with respect to K for each subject. The robustness of ERRSSM:PH with respect to λw, shown in Fig. 12.10(b) was comparable to that of the simulation study, with the curve of ERRSSM:PH almost flat for

−2.2 0.7 10 ≤ λw ≤ 10 .

215 From the plot of ERRSSM:PH versus the dimensionality D of the feature–space Φ in Fig. 12.10(c), it can be noticed that initially ERRSSM:PH drops as D increases with Φ explaining more of

the information in the data and bottoms out at 400 ≤ D ≤ 600, across the three groups.

It then begins to slowly rise as a larger number of unstable basis vectors are included,

capturing an increasing percentage of the noise in the data.

0 Fig. 12.10(d) graphs ERRSSM:PH versus the length T of the missing stimulus block used

in the assessment of ERRmissing (cf. Section 12.4). Here, we observe a very low error for

small T 0 as prediction is driven primarily by the strong temporal regularity of the stimulus presentation sequence over short durations. However, as in the case of the simulation, it increases with T 0 and stabilizes at a block length of 2 trials (T 0 ≈ 5 TRs) after which point there is no structure in the stimulus sequence and prediction is driven mainly by the patterns in the data.

Table 12.4 compares the prediction error of SSMs with: [HRF:NONE] no HRF FIR filter;

[HRF:CONST] a spatially constant HRF of length L+1=32s; [HRF:UNCON] spatially varying HRF of length L+1=32s without any constraints; and [HRF:PRIOR] the SSM with spatially varying and unknown HRF of length L+1=32s but constrained by the prior density N (µh, Σh) (cf. Section 12.1). Here, the advantage of the spatially–varying but phys- iologically constrained HRF (HRF:PRIOR) in dealing with the variable hemodynamics of the brain and accurately predicting the mental state can be seen.

Removing the HRF altogether from the model (HRF:NONE), thereby not accounting for the lag in the fMRI signal due to hemodynamics, leads to the largest deterioration in per- formance. Although the inclusion of a spatially constant HRF (HRF:CONST) causes some

216 Control DC DL

HRF:NONE 0.62±0.11 0.68±0.13 0.64±0.10 HRF:CONST 0.36±0.08 0.46±0.11 0.40±0.09 HRF:UNCON 0.54±0.13 0.59±0.15 0.55±0.16 HRF:PRIOR 0.31±0.05 0.40±0.09 0.33±0.05

Table 12.4: Prediction Error versus Different HRF Models. ERRSSM:PH (±1 SEM) for the SSM model with no HRF (HRF:NONE), spatially constant HRF (HRF:CONST), spatially varying and unconstrained HRF (HRF:UNCON) and the spatially varying HRF with a prior obtained from canonical HRF of SPM8 (HRF:PRIOR).

reduction in accuracy, allowing too much variability by putting no constraints on the shape of the HRF (HRF:UNCON) results in even worse performance due to over–fitting of noise.

Fig. 12.11 shows the estimates of the spatially varying but constrained HRF FIR filter

(HRF:PRIOR) for each group , averaged in regions–of–interest (ROI) selected in the left primary motor cortex (BA3, BA4) and the bilateral intraparietal sulcus (IPS) (BA40). A qualitative difference in the estimated HRFs is apparent in terms of their rise–time, peak value and dispersion. The prolonged and repeated recruitment of the IPS in this task may explain the dispersed shape of its HRF as compared to the motor cortex. No significant differences in HRF estimates were observed between the three groups.

The HRF of the brain is known to be highly variable [139] and by allowing the HRF FIR

filter to vary spatially (by allowing one filter h[d] per feature–space element d), the SSM is able to account for this inter-region variability.

The group–wise spatial maps corresponding to the three phases of each trial are shown in

Fig. 12.12.

217 Motor

IPS 0.4

h 0.2 0 0 10 20 30 secs (a) ROIs (Left Hemisphere) (b) Motor Cortex

0.4 0.4

h 0.2 h 0.2 0 0 0 10 20 30 0 10 20 30 secs secs (c) Left IPS (d) Right IPS

Figure 12.11: Estimated HRF FIR filter h. Fig.(a): The locations of the ROIs (in the left hemisphere). Fig.(b-d): The estimated FIR filter coefficients (± 1 std.dev.) for each group averaged in the ROI in the left motor cortex and the left and right IPS. Legend. Blue solid line: control group, Red dashed line: DC group, Green dot–dashed line: DL group.

218 1

2

3

(a) Control Group

1

2

3

(b) Dyscalculic Group

1

2

3

(c) Dyslexic Group

Figure 12.12: Spatial Maps for Mental Arithmetic. The group–wise t–score maps on an inflated brain– surface are shown with columns for the left lateral–posterior, left medial–posterior, right medial–posterior and right lateral–posterior views. Values t < 3 have masked out for clarity and the color–map shows values ranging from t = 3 to t = 14. Each row shows the activation maps corresponding to three phases within a single trial of the task. 219 12.5.3.4 Discussion

∗ The average optimal model size K and prediction error ERRSSM:PH for the three groups

are shown in Table 12.5. Here, we notice the variation in model-sizes for the DC group is

larger than the controls while that for DL group is almost of the same order. This points to

a greater heterogeneity in the DC data necessitating models with different sizes. Also, the

consistently higher error–rate of the DC population indicates the relative inaccuracy of the

models for their mental processes, as compared to the other two groups.

Control DC DL

K∗ 22.57±2.19 26.23±3.95 23.14±2.25 ERR (%) 0.31±0.05 0.40±0.09 0.33±0.05

Table 12.5: Overall Results. The mean optimal model size K∗ and prediction error ERRSSM:PH (± 1 SEM) for the control, dyscalculic (DC) and dyslexic (DL) subjects. The chance-level error–rate for the data-set is ≈ 0.83, computed by permuting the stimuli with respect to the scans.

Similar to the results of the last chapter, these observations concur with the theory that not only are dyscalculics different from each other in their arithmetical strategies, their lack of an intuitive notion of numerical size maybe compensated for by shifting mental strategies resulting in the poor fit of a single model for a subject. Also, the consistently higher error– rate of the DC population indicates the relative inaccuracy of the models for their mental processes, as compared to the other two groups.

In Fig. 12.13, we show the effect of Ph, LogPs and LogDiff on the error ERRSSM:PH. Note

that LogPs and LogDiff were not used to train the model, and therefore the influence of

220 these parameters on the mental patterns of the subjects was effectively discovered by the

method.

Legend

0.45 Control DC

0.35 DL

0.25 Error rate Error 0.15

0.05

-0.05 Phase 1 Phase 2 Phase 3

Figure 12.13: ERRSSM:PH with respect to Ph, LogPs and LogDiff. For each group, the overall error–rate (first bar) followed by the effect for LogPs (second bar) and LogDiff (third bar) are displayed with respect to trial phase Ph. The effects are calculated as the difference in ERRSSM:PH at the high minus that at the low level of the quantized LogPs and LogDiff. Error bars indicate ±1 SEM.

In order to measure the similarity between two models Mi = {Φi, θi} and Mj = {Φj, θj} trained on the data for two different subjects, the mutual information (MI) between the state–sequences for two models was used. As in the previous chapter, for each fMRI session

Y in our data–set the optimal state–sequence was computed using each of the 42 models as per the estimation algorithm of Section 12.3. The MI is derived from the joint histogram of X(i) and X(j), the optimal state sequences for the same fMRI data Y computed from

42 the models Mi and Mj respectively. This procedure applied to all ( 2 ) pairs of subjects yields a pair-wise similarity matrix. These MI relationships can then be visualized in 2D using multidimensional scaling (MDS) [120] as shown in Fig. 12.14. The specification of

221 the SSM in terms of abstract mental–states allows comparing the spatio–temporal patterns between subjects in their entirety in this abstract representation [115].

Please refer to Appendix E.4 for the computation of these error–rates and MI.

Control Male Control Female Dyslexic Male Dyslexic Female Dyscalculic Male Dyscalculic Female

(a) Overall

(b) Phase 1 (c) Phase 1: Product Size Effect

(d) Phase 2 (e) Phase 2: Problem Difficulty Effect

Figure 12.14: MDS plots of the Mutual Information Between All Pairs of Subjects. Fig.(a) shows the MDS plots of the subjects based on their overall MI, while Figs.(b) and (d) show the relative arrangement of the subjects based on their MI during the first and second phases of the trial. The effect of product–size in phase 1 and problem-difficulty in Ph = 2 on the MI are plotted Figs.(c) and (e).

222 Fig. 12.14(a) shows a clustering of subjects in the MDS space with respect to their group

(control, DL or DC) along the vertical axis, while along the horizontal axis we see a slight,

but not significant, organization dictated by gender. Since this labeling is applied after plotting all the subjects in the MDS space, an intrinsic organization in the spatio–temporal patterns of the subjects in each group has been identified. Interestingly, there are a few

DC subjects that cluster along with the DL group, at the top of Fig. 12.14(a). This is not surprising, given that oftentimes dyscalculia is comorbid with dyslexia [162] and these DC subjects may exhibit dyslexic deficits during this task.

The separation between the MDS clusters for each group can be quantified using Cramer´

test [9] which provides a non–parametric measure of the p–value of the distance between

the means of two samples through a permutation method. The p–values of the group–wise

differences are compiled in Table 12.6.

Overall Ph 1 Ph 2 Ph 3

Ctrl vs. DC 0.78 (+0.04,+0.01) 0.80 (+0.10,-0.02) 0.84 (+0.02,+0.06) 0.71 (+0.02,-0.03) Ctrl vs. DL 0.74 (-0.02,+0.02) 0.86 (-0.01,-0.00) 0.73 (+0.01,+0.01) 0.65 (-0.00,+0.01) DC vs. DL 0.77 (+0.03,-0.01) 0.79 (+0.07,+0.03) 0.85 (+0.01,+0.08) 0.72 (-0.01,+0.02)

Table 12.6: The separation between the three groups in the MDS plots assessed using Cramer´ non– parameteric test. Tabulated are the p–values for the overall distance between the means of the groups in the MDS space along with the Ph–wise changes of the p–values. Each column also includes the effects of LogPs and LogDiff on the p–value in brackets.

From the results in Figs. 12.12, 12.13, 12.14 and Tables 12.5, 12.6 the following observa- tions can be made.

223 Multiplication Phase. The error rate for the DL group is much higher than that for the controls (cf. Fig. 12.13). An increase in product–size causes a large ( > 1.5 SEM) reduction in ERRSSM:PH for controls, while the effect for the DC and DL groups is less pronounced (

> 1 SEM). Also, there is a clear separation between the DL and control groups in the MDS space and product–size increases the separation between the DC and control groups. For the control subjects high values are seen in the bilateral occipital extra-striate cortices, the left postcentral area, the left angular gyrus (lAG), the medial frontal gyri (MFG), and the left intra-parietal sulcus (IPS). The DC subjects show lower activation in the bilateral IPS, while the DL subjects show increased activation in their left fronto-parietal and left medial frontal gyral (lMFG) regions as compared to controls.

These results may be due to the greater difficulty and conflict experienced by the DL sub- jects and multiplicity of mental strategies adopted during the reading phase of the task.

The higher error of the DC subjects may be due to irregular patterns in accessing the ver- bally encoded rote multiplication tables located in the lAG. The reduction in error–rates of all subjects with increase in product–size may be due to increased organization of their mental processes as their multiplication memory is stressed, while the increased separa- tion between the groups could indicate greater divergence of the mental patterns of the DC individuals from the controls.

Judgement Phase. ERRSSM:PH for the DC group increases drastically, while that for the

DL and control groups match up. The DC subjects may experience difficulty in judging the difference between the size of the correct and incorrect results and may resort to a greater variety of mental strategies. Not surprisingly, as the reading phase of the experiment has ended, the patterns of the DL individuals begin to resemble that of the controls and the

224 separation between these groups reduces in MDS space, while the separation of the DC

group increases. The control and DL subjects exhibit high values in the left and right

IPS, both pallida, caudate heads (CdH), left anterior insula (aIn), lMFG, the supplementary

motor area (SMA) and the left fronto-parietal operculum, while the map for the DC group

activates in both aIn, both MFG, left IPS, the anterior rostral cingulate zone (aRCZ), and the

right supramarginal gyrus (SMG). Although LogDiff reduces the error–rate of the control

and DL subjects, it has the opposite effect on the DC group as increased conflict may

recruit new functional circuits. The effect of LogPs is consistent with strong activation of the working verbal (mute rehearsal) and visual memories.

Third Phase. This phase involves decision–making and conflict–resolution and is highly variable between repetitions and subjects, causing increased inaccuracy during this phase.

Also, due to the self–paced nature of the task, it very often contained the button–press and inter–trial rest interval. The spatial–maps for the three groups show increased foci in the pre–frontal and motor areas. The left IPS region in the DC group is also strongly activated during this phase, which may point to irregular storage and retrieval of the number size using spatial attributes typically processed in this region.

12.6 Conclusion

In this chapter, we extended the state–space model of Chapter 11 to include information about the experimental task in order to guide the discovery of patterns. Efficient estima- tion algorithms using a variational formulation of generalized EM under the mean field approximation were developed and quantified with a simulation study. The HRF of the brain is known to be highly variable [139] and by using a spatially varying but unknown

225 FIR filter, the state–space model (SSM) was able to compensate for this variability. Model

hyper–parameters were selected in an automated fashion using a maximally predictive cri-

terion. By selecting which stimulus to input to the SSM, the user is able to choose between

data–driven and model–driven estimation of the parameters.

The hidden layers in the SSM decouple the stimulus from the data, and therefore neither

does the stimulus need to be convolved with an HRF nor does the exact mathematical re-

lationship between the stimulus and the fMRI signal need to be specified. This allows

flexibility in choosing which experimental variables to include and their encoding, without

having to worry about statistical issues like the orthogonality of the experiment, the estima-

bility of the design matrix and omitted variable bias. But classical issues like confounding

variables will still affect inference and must be addressed through appropriate experimental

designs.

As demonstrated by the mental arithmetic study, this method can be used with arbitrarily

complex paradigms, where the investigator can decide which stimuli to provide as input

thereby choosing a trade–off between data driven and model (i.e. stimulus) driven estima-

tion of parameters. The effects of other un–modeled experimental variables on the model

can then be tested, post hoc. This is in contrast to supervised methods that cannot, by design, capture the effects of experimental variable against which they have not been mod- eled. However, with simple block design paradigms where the effect of hemodynamics and the temporal structure within a block are insignificant, we observed that MVPR clas- sifiers tended to outperform the SSM in predicting the mental state. Also, its application

226 to default–state and non–task related fMRI studies would require an alternative model– size selection procedure that does use prediction error as a criterion, or a non–parametric formulation with an infinite number of states [15].

The SSM parameters are estimated through a fitting criterion and consequently have a well–defined interpretation implied by the underlying neurophysiological model. Here prediction error is used as a statistic to select between models and to infer an effect of the experimental variables on the data, which implicitly involves selecting between alter- native hypotheses [65]. For example, the ability to predict mental states at “much better than chance” levels adduces evidence against the null–hypothesis that the SSM does not ex- plain the data. A similar argument applies for the predictability of experimental variables that were not included during the training of the SSM. The SSM, however, due the lack of a parametric form of the null distribution of the prediction error and the prohibitively high cost of a non–parametric permutation test, cannot measure the confidence level (i.e. a p–value) in a hypothesis test.

Comparing brain–function in abstract representation spaces rather than the spatial–maps di- rectly has been shown to be a very powerful principle in psychology and neuroscience [115].

For example, Edelman et al. [54] discovered natural groupings within a representational space derived using MDS on the activation patterns under different task conditions and subjects. Here, the abstract state–space representation was used to compare the spatio– temporal signatures of mental processes in their entirety. Systematic differences in the cas- cades of recruitment of the functional modules between subject populations were shown indicating the necessity of retaining the temporal dimension. The MDS plots derived from the MI between subject pairs enabled a succinct assessment of the relationships between

227 different groups with respect to experimental parameters. This ability to reveal and study the group–wise structure in the spatio–temporal patterns could guide in the design of more specific experiments to test interesting effects.

Therefore, given its advantages and disadvantages with respect to other analysis methods, we believe that it is a complementary tool in an investigator’s arsenal providing a new and different insight into mental processes.

228 EPILOGUE

Prediction is very difficult, especially about the future.

Niels Bohr (1885–1962).

In this thesis I have to attempted to address the pressing need for solutions to the problem of studying the spatio–temporal patterns implied by mental processes from their metabolic traces recorded by functional magnetic resonance imaging (fMRI). In pursuit of this, I in- vestigated two tracks: one, revealing the temporal ordering in the cascades of recruitment of the functional modules of the brain during the performance of the task (i.e. mental chronometry); and two, building a spatio–temporal representation of mental processes as a sequence of abstract brain–states each having a spatially distributed signature of neu- ral / metabolic activation.

The methods were developed and applied in the context of two studies, one for studying the development of visuo–spatial working memory in children and one for investigating the neural basis of arithmetical processing deficits.

In Part II, a visual analytic tool was developed to explore the chronoarchitecture of the brain using a semi–supervised clustering algorithm followed by a statistical method to measure the timing differences between different regions of the brain. The visual tool identifies voxel–clusters of potential interest to the investigator and displays their time–series, in a

229 paradigm reminiscent of the visual examination of EEG data. Then, a robust and efficient estimator for activation latency was developed using a general linear model (GLM) for univariate statistics.

Part III dealt with the creation of a phenomenological model of mental processes with the brain transitioning through an abstract state–space as it performs a mental task. After an initial confirmation that fMRI data indeed has the information necessary for such a repre- sentation, a distance metric that captured of notion of the functional similarity between the activation patterns of two brain–states was defined and used to design a low–dimensional linear feature–space. Then, a spatio–temporal representation based on a hidden Markov model (HMM) formalism of brain function was proposed and an unsupervised estima- tion procedure based on Monte–Carlo sampling developed. The correct model–size was selected using a maximally predictive criterion that linked the results back to the exper- imental effects of interest. This method suffered from low accuracy due to simplifying assumptions needed for reasonable running–time and due to its unsupervised nature. These drawbacks were corrected by stabilizing the estimation procedure with information about the experimental task and by eliminating the simplifications. Computational efficiency was achieved through estimators that used a mean–field approximation.

The advantages of such a dynamical generative model over other approaches are three fold:

(i) The fully specified generative model allows definitive neurophysiological interpreta-

tion of the parameters.

230 (ii) It allows comparing the spatio–temporal patterns of mental processes between sub-

jects in their entirety, and not just their static activation maps where the temporal

ordering of events is lost.

(iii) It can predict the cognitive state of the subject, not from single time–points but from

the time–evolution of patterns in the data.

(iv) The abstraction in terms of brain–states can provide the ability to study dynamical

characteristics of the data, such as periods and cycles, surprising or new events and

regime changes.

Some of the drawbacks, on the other hand, include computational complexity, the require- ment of having stimulus information or task labels for model–size selection, the lack of statistical measures of confidence and the need for spatial normalization of the data for inter–subject comparison.

Given the exponential increase of computational power, the widespread availability of high–performance and massively–parallel computing infrastructure and the amenability of many of the algorithms here to parallelization, computational complexity should not be a significant obstacle to the adoption of these methods.

We are currently refining a model–size selection strategy based on comparing model– evidences evaluated through cross–validation. This procedure should eliminate the need for experimental stimulus in model selection while simultaneously avoiding the specifica- tion of model–complexity as required by other model selection techniques.

231 Unlike GLMs, the nature of the state–space model precludes closed–form expressions for the sampling distributions of the parameters and hence parametric assessments of confi- dence. The high computational burden complicates the use of non–parametric methods such as permutation tests. To overcome this difficulty, we are working on a fully Bayesian version of this model that will provide posterior distributions to all the parameters.

One of the most important and vexing problems is the need for spatially normalizing the data of all the subjects into a common anatomical space for inter–subject analysis. As dis- cussed in Section 2.3.4, this step has many fundamental problems such as the ability to find correspondences of anatomical features between subjects and the validity of this to their functional correspondences. As the state–space models use the fMRI data projected into the low–dimensional feature–space obtained from their functional connectivity, it would be natural to perform registration in this feature–space. This might be achieved either through the registration of their functional networks posed as a graph homomorphism problem or through the estimation of a common feature–space using a hierarchical model for func- tional connectivity.

Another exciting avenue for future work is the integration of other modalities into this analysis methodology, such as the connectivity information of DTI to build a feature–space or incorporating the high temporal resolution offered by EEG to improve characterization of mental dynamics. Finally, of interest is the detection of the multiple sub–processes running in parallel that constitute the building blocks of human thought.

232 APPENDIX A

PROOFS FOR ACTIVATION ONSET LATENCY ESTIMATOR

Consider the case of a single stimulus function s(t), which yields the following regressors

x(1)(t) = s ? h(t) and x(2)(t) = s ? h˙ (t). The resulting GLM is y = Xβ~ + , where

(1) (2) ~ 0 2 X = [x x ] and β = (β, γ) , and  ∼ N (0, σ Σ)

~ ~ 0 −1 − 0 −1 V ~ The Gauss-Markov estimator for β is βb = (X Σ X) X Σ y, and its variance is ar[βb] =

0 −1 − (X Σ X) .

˙ Assuming Σ = I, and by observing that h(t) is orthogonal to h(t), we see that the cross-

covariance terms of Var[ˆγ] are theoretically zero, indicating that βˆ and γˆ are uncorrelated

Gaussian variables. Therefore, E[ˆρ] ≈ E[ˆγ]E[1/βˆ]. Now, taking a first order Taylor series

expansion of 1/βˆ about β, and using the fact that βˆ is unbiased, we get: ! ! Var[βˆ] 1 E[ˆρ] = ρ 1 + =ρ ˆi 1 + , (A.0.1) 2 2 (βˆ) tβ q ˆ ˆ ˆ where tβ = β/ Var[β] is the t–score for β. This expression for the bias in ρ is used to derive the corrected estimate ρˆcorr. This correction shrinks the estimate of the delay when the t–score of βˆ is low, thereby also mitigating the resulting numerical instability of ρˆ.

233 According to the model in eqn. A.0.1, we get τˆ = f(β,ˆ γˆ), a non-linear function of βˆ and

γˆ. Applying a first order Taylor expansion of f around the true value β and γ, and using the fact that the two estimates are unbiased, we get:

∂f ∂f  Var[ˆτ] = ∇f(β, γ)Var[ˆγ]∇f(β, γ)0 where ∇f(u, v) = (A.0.2) ∂u ∂v

234 APPENDIX B

FUNCTIONAL CONNECTIVITY ESTIMATION

B.1 Hierarchical Agglomerative Clustering

1 begin // Initialization 2 For each voxel i, create one cluster ci of size ni = 1 3 Each ci is associated with a time–course Y[i] 4 end 5 while Number of clusters greater than specified value do 6 Find two clusters ci and cj that are spatially adjacent to each other and merge them into a new cluster ck = (ci, cj), if and only if Var[ck] is minimum over all i, j 7 Remove clusters ci and cj from the set of clusters, and add ck 8 end Algorithm B.1: Hierarchical Agglomerative Clustering

The time–series for the new cluster c is defined as Y[k] = 1/n P Y[i], and for a new k k ci∈ck

cluster ck = (ci, cj) can be efficiently updated according to Y[k] = (niY[i]+njY[j])/(ni+

nj).

235 The variance of a cluster c is Var[c ] = (1/n T ) P PT (Y[i] − Y[k])2, and is k k k ci∈ck t=1 efficiently updated through the variance separation theorem:

V V PT 2 ni ar[ci] + nj ar[cj] t=1(Y[i] − Y[k]) Var[ck] = − . ni + nj T (ni + nj)

V After hierarchical clustering, the covariance σ[k1, k2] , ar[ck1 , ck2 ] between two clusters

ck1 and ck2 is estimated as:

T ! ! 1 X 1 X 1 X σ[k , k ] = Y [k ]Y [k ] − Y [k ] Y [k ] . 1 2 T t 1 t 2 T t 1 T t 2 t=1 t t

B.2 Shrinkage

The regularized estimate of the covariance is computed using an adaptive soft shrinkage estimator [186]:

−1 sλ(σ[k1, k2]) = sgn(σ[k1, k2]) |σ[k1, k2]| − λ|σ[k1, k2]| + . (B.2.1)

This estimator has the property that the shrinkage is continuous with respect to σ[k1, k2], but the amount of shrinkage decreases as σ[k1, k2] increases resulting in less bias than the standard soft shrinkage estimator. The threshold parameter λ is selected by minimizing the risk function R(λ) = E||sλ(σ) − σ||2. Under certain regularity assumptions about the data, a closed form estimate of the optimal threshold is obtained as [130]: P Var[σk ,k ] λ ≈ k16=k2 1 2 , (B.2.2) P σ2 k16=k2 k1,k2 V where ar[σk1,k2 ] is estimated as:

T T !2 T X X Y [i]Y [j] − Y 0 [i]Y 0 [j] . (T − 1)3 t t t t t=1 t0=1

236 This estimator is “sparsistent” [186], that is, in addition to being consistent, it estimates true

zeros as zeros and non-zero elements as non-zero with the correct sign, with probability

tending to 1.

B.3 Voxel-wise Correlations

The covariance between the time–series Y[i] of a voxel i belonging to cluster ck and the

cluster average time–series Y[k] is ! ! 1 X 1 X 1 X σ[i, k] = Y [i]Y [k] − Y [i] Y [k] , T t t T t T t t t t and the correlation coefficient is: σ[i, k] ρ[i, k] = . (B.3.1) pσ[i, i]σ[k, k]

Also, the smoothed (i.e. conditionally expected) time–series Y[i|k] , E[Y[i] | Y[k]] is: ! 1 X 1 X Y[i|k] = Y [i] + σ[i, k]σ[k, k]−1 Y[k] − Y [k] , (B.3.2) T t T t t t σ[i, k]2 and σ[i|k] = σ[i, i] − . σ[k, k]

is its conditional variance σ[i|k] , Var[Y[i] | Y[k]].

Therefore, the expected (smoothed) correlation between two voxels i and j belonging to

clusters cki and ckj respectively are obtained by substituting eqn. B.3.2 and eqn. B.3.1 to get:

Cov [Y[i|ki], Y[j|kj]]} F[i, j] = p σ[i|ki]σ[j|kj]

ρ[i, ki]ρ[j, kj] =sλ(σ[ki, kj]). . (B.3.3) p 2 2 (1 − σ[ki, ki]ρ[i, ki] ) (1 − σ[kj, kj]ρ[j, kj] )

237 APPENDIX C

CONSTRUCTION OF THE FEATURE–SPACE

C.1 Orthogonal Partitioning of F

Since F is a symmetric positive definite kernel, we can consider the functional connectiv- ity F[i, j] = hYe [i], Ye [j]i in some representation of the data Ye [i] and Ye [j] at the voxels i, j [193]. In the definition of Section 9.2, the regularized correlation coefficient defines p this inner-product, i.e. Ye [i] = Y[i] Var[Y[i]]. If Ye = (Ye [1] ... Ye [N]), then F = Ye >Ye .

1 1/2 > PN 2 0 Consider the SVD of Ye = VΛ U = n=0 λn vnun. Eliminating the contribution of

v1, the left singular vector corresponding to the second eigenvector η1, from the functional

0 1/2 0 connectivity, yields Ye − v1v1Ye = Ye − λ1 v1u1, and therefore,

> 0 Fb = Ye Ye − λ1u1u1. (C.1.1)

238 C.2 Primal and Dual Formulations

Consider the original linear minimization problem of Definition 1 which can be written out

as:

N N X X FD(Zt1 , Zt2 ) = min f[i, j]dF[i, j], (C.2.1) f i=1 j=1

subject to the constraints:

f[i, j] ≥ 0 X X f[i, j] − f[j, i] = δZ[i] j j

Here, we have defined the difference between the activation patterns Zt1 and Zt2 as δZ =

PN 30 Zt1 − Zt2 . Also, we have assumed i=1 δZ[i] = 0, without loss of generality .

Writing out the simplex tableau [143] for this problem

f[1, 2] f[1, 3] ... f[2, 1] f[2, 3] ... f[3, 1] f[3, 2] ... g[1] 1 1 ... −1 ... −1 ... δZ[1] g[2] −1 ... 1 1 ... −1 ... δZ[2] , g[3] −1 ... −1 ... 1 1 ... δZ[3] ......

dF[1, 2] dF[1, 3] ... dF[2, 1] dF[2, 3] ... dF[3, 1] dF[3, 2] ...

and adding in the dual variables g ∈ RN using the prescribed procedure, we get:

N X FD(Zt1 , Zt2 ) = sup g[i] · δZ[i] (C.2.2) g i=1

30This condition can be easily satisfied by adding to the optimization problem of eqn. C.2.1 a dummy node PN with index N + 1 called the dump, where δZ[N + 1] = − i=1 δZ[i] and dF[i, N + 1] = 0, ∀i = 1 ...N. However, as will be seen later, even this artifice is not essential to the development of the feature–space.

239 subject to the constraints:

g[i] − g[j] ≤ dF[i, j]

Note that the dual variables g are unrestricted. By the fundamental theorem of duality, if the primal linear programming problem is bounded feasible, then so is its dual, their values are equal, and there exist optimal vectors for both problems.

C.2.1 Augmented Formulation

It can be seen that if g is a feasible solution to eqn. C.2.2 with maximum cost α then g +M, P where M is any scalar, is also a feasible solution with value α (Since i δZ[i] = 0).

Therefore, eqn. C.2.2 is equivalent to:

N X g∗ = arg sup g[i] · δZ[i], (C.2.3) g i=1 subject to the constraints:

g[i] − g[j] ≤ dF[i, j] X g[i] = 0 i

The corresponding primal formulation requires an additional primal variable ζ correspond- P ing to the constraint i g[i] = 0, as follows:

N N ∗ X X f = arg min f[i, j]dF[i, j] + 0 · ζ (C.2.4) f i=1 j=1

240 subject to the constraints:

f[i, j] ≥ 0

ζ unrestricted X X f[i, j] − f[j, i] + ζ = δZ[i] j j

P It also has the same optimal solution. In the primal, this implies that i δZ[i] = K 6= 0 does not change the solution to the problem, which is the property that allows dealing with P P the case when i Zt1 [i] 6= i Zt2 [i].

This primal–dual equivalence can be seen by writing out the augment simplex tableau:

f[1, 2] f[1, 3] ... f[2, 1] f[2, 3] ... f[3, 1] f[3, 2] ... ζ g[1] 1 1 ... −1 ... −1 ... 1 δZ[1] g[2] −1 ... 1 1 ... −1 ... 1 δZ[2] g[3] −1 ... −1 ... 1 1 ... 1 δZ[3] ......

dF[1, 2] dF[1, 3] ... dF[2, 1] dF[2, 3] ... dF[3, 1] dF[3, 2] ... 0 .

C.3 Proofs for the Linear Approximation

This section contains proofs showing that any orthogonal transformation Ψ = {ψ(1) ... ψ(N)}

P (l) where i ψ [i] = 0 yields a lower and an upper bound to the transportation problem, and that the basis Φ constructed in Section 10.2 has a tight bound.

241 Define δZ = Zt1 − Zt2 to be the difference between the two activity patterns Zt1 and Zt2

to be compared. The primal and dual formulations of the functional distance are as per

eqn. C.2.4 and eqn. C.2.3 respectively.

Now, if we define the coefficients of a vector δZ : N → R in the basis Ψ be δz[l] = hψ(l), δZi, the following theorem holds:

Theorem 2. Consider the optimization problem of eqn. C.2.3. Let δz[l] be coefficients δZ in the basis Ψ. Then, there exist constants Ml ≥ Mcl > 0, such that

N N X X X Mcl|δz[l]| ≤ max g[i]δZ[i] ≤ Ml|δz[l]| (C.3.1) g l=0 i∈C l=0

To prove this theorem, the next two lemmas are required. The first lemma will help estab- lish the upper bound property, while the second lemma will be needed to prove the lower bound. P Lemma 1. If i g[i] = 0 and |g[i] − g[j]| ≤ dF[i, j], then there exist constants Ml, l =

(l) 1 ...N, such that |g | ≤ Ml

Proof.

X (l) |g[l]| = g[i]ψ [i] i∈C

X (l) X (l) = (g[i] − g[i0])φ [i] + g[i0] φ [i] , i i ! X (l) X (l) ≤ |g[i] − g[i0]| · |φ [i]| Since, φ [i] = 0 i i X (l) ≤ dF[i, i0]|φ [i]| i X (l) ≤ sup dF[i, j] |φ [i]| i,j∈ C i

= Ml (C.3.2)

242 If we consider the basis vector φ(l,m) as defined in Section 10.2 the upper bound is: " # X (l,m) X (l,m) X (l,m) dF[i, i0]|ψi | = dF[i, i0]|φ [i]| ≤ sup dF[i, j] sup |φ [i]| (l,m) (l,m) i,j∈C l,m i∈C i∈C i and, for every level of decomposition m, it can be seen that supi,j∈C(l,m) dF[i, j] decays faster than 2−m. Therefore, the upper bound on coefficients of the function g decays ac-

(l,m) −m cording to |g | ≤ 2 M0,0, in the basis Φ.

Lemma 2. There exist positive constants Mcl, 0 < Mcl ≤ Ml, l = 1 ...N, such that the

N set of vectors {g ∈ R }, where |g[l]| ≤ Mcl must satisfy the property |g[i]−g[j]| ≤ dF[i, j].

Proof. If any function satisfies |g[i] − g[j]| ≤ dF[i, j] then g + c will also satisfy this property, for any constant c. Therefore, we shall prove the lemma for the subset of vectors

P 0 0 that have the property i g[i] = 0. Now, if ∀i ∈ C, |g[i ]| ≤ infi,j dF[i, j] then it must be that |g[i] − g[j]| ≤ dF[i, j]. Also, because Ψ is an orthogonal basis, it is true that if

|g[l]| ≤ Mcl, then

N X sup sup |g[i]| = sup sup g[l]ψ(l)[i] g i∈ g i∈ C C l=0 " N # X (l) ≤ sup sup Mcl|ψ [i]| g i l=0 " N # X (l) = sup Mcl|ψ [i]| i l=0

There exist many combinations of {Mcl, l = 1 ...N} such that supg supi∈C |g[i]| ≤ infi,j dF[i, j]. For example, by setting:

inf d [i, j] 1 M = i,j F cl P (l) N i |ψ [i]| this property is ensured.

243 P (l,m) P (l,m) 2 For the basis Φ, first observe that, by construction, i φ [i] = 0, i |φ [i]| = 1,

P2m−1 P (l,m) 2 m and therefore, l=0 i |φ [i]| = 2 . Also, note that for an N dimensional vector √ g, ||g||1 ≤ N||g||2. Therefore, this bound becomes:

inf d [i, j] 1 M = i,j F ≈ 2−mM cl,m P (l,m) c0,0 N i |φ [i]|

Using these two lemmas, Theorem 2 is now proved as follows:

Proof. (Theorem 2). Since Ψ is an orthogonal transformation, P g[i]δZ[i] = PN g[l]δz[l],. i∈C l=0 The upper-bound then follows from Lemma 1.

For the lower bound, assume that g∗ is the optimal solution such that P g∗[i]δZ[i] < i∈C PN + PN l=0 Mcl|δz[l]|. However, as per Lemma 2, the function g = l=0 sgn(δz[l])Mclφl is also

PN ∗ a feasible solution with cost l=0 Mcl|δz[l]|. Therefore, g cannot be the optimal solution, resulting in a contradiction.

Theorem 3. The quality of the approximation is evaluated by the tightness of the bound: v N N ! u N X X 1 uX 2 sup Ml|δz[l]| − Mcl|δz[l]| = √ t (Ml − Mcl) , (C.3.3) ||δz||2=1 l=0 l=0 2 2 l=0

obtained through the method of Lagrange multipliers. For the basis Φ, eqn. C.3.3 is ap- √ proximately equal to (M0,0 − Mc0,0)/ 2.

Proof. Defining the symbols δM , M − Mc and λ as the Lagrange multiplier, the La- grangian of the optimization problem of eqn. C.3.3 is:

N N ! N ! ! X X X X 2 sup Ml|δz[l]| − Mcl|δz[l]| = min sup δM[l]|δz[l]| + λ |δz[l]| − 1 λ δz ||δz||2=1 l=0 l=0 l=0 l (C.3.4)

244 Now, by inspection, it is obvious that the supremum will occur for δz > 0 given the constraints31.

Differentiating eqn. C.3.6 with respect to δz[l] and setting to 0, we get δz[l] = −δM[l]/2λ and eqn. C.3.3 becomes:

N ! 1 X min − δM[l]2 − λ (C.3.5) λ 2λ l=0

Differentiating with respect to λ and setting to 0, we get λ = pP δM[l]2/2 and substitut-

ing in eqn. C.3.5 we get: v N N ! u N X X 1 uX 2 sup Ml|δz[l]| − Mcl|δz[l]| = √ t (Ml − Mcl) (C.3.6) ||δz||2=1 l=0 l=0 2 2 l=0

31This property can be verified algebraically by introducing Lagrange multipliers µ[l] for the constraint δz[l] ≥ 0. These multipliers take a value of 0, which by complementary slackness indicates that the constraint is compulsorily satisfied.

245 APPENDIX D

PROOFS FOR UNSUPERVISED STATE–SPACE MODEL

This appendix is organized as follows: in Section D.1, the derivation of the EM algorithm for the proposed model is given, and then Section D.2 explains the forward-backward re- cursions needed in the M-step of the EM algorithm. Next, the procedure to marginalize out the HRF filter from the estimates of the parameters is given in Section D.3. The estimation of the optimal state–sequence x∗ given model parameters θ and observations y is described in Section D.4.

The simplification of estimation algorithms in the reduced model as compared to the full model of Fig. 11.1 is justified in Section D.5. Finally, the precise definition of mutual– information for comparing two state–sequences is given in Section D.6.

246 D.1 Expectation Maximization

The maximum likelihood (ML) estimate θML = arg maxθ ln p(y|θ, h,K) can be obtained using the EM algorithm by decomposing the log-probability into a free-energy and a KL- divergence term as:

X p(y, x, θ|K) ln p(y|θ, h,K) = q(x) ln + KL(q||p(x|y, θ, h,K)), (D.1.1) q(x) x which yields the following two-step iterative algorithm:

X E-step Q(θ, θn) = p(x|y, θn) ln p(y, x|θ), (D.1.2) x M-step θn+1 = arg max Q(θ, θn). (D.1.3) θ

The complete log-likelihood term is:

ln p(y, x|θ) = ln p(y|x, ϑ, Σ) + ln p(x|α, π) (D.1.4) T X where ln p(x|α, π) = ln αx1 + ln πxt,xt+1 . t=2

Since the relationship between the observations y and hidden states x is mediated through the underlying activation patterns z and the hemodynamic response function h, an FIR

PL filter of length L, as per the equation yt = τ=0 zt−τ hτ , we see that, L Z Y pθ(yt|x) = pθ(yt|zt−L...t) pθ(zt−l|xt−l)dzt−L...t = N (µt−L...t, Σt−L...t) , zt−L...t l=0 (D.1.5) where

L X µt−L...t = µxt−l hl l=0 L X 2 Σt−L...t =Σ + Σxt−l hτ . l=0

247 PL If we consider one particular assignment of xt−L...t = {k0 . . . kL} and let µk0...kL = l=0 µkl hL−l, then any element µ(i) of µ of µ is a linear combination of the corresponding k0...kL k0...kL k0...kL elements of µ1 . . . µK , as:

 (i)     (i)  µ1...1 hL + ... h0 0 ... 0 0 µ1        (i)     (i)   µ1...2   hL + ... h1 h0 ... 0 0   µ2   .   . . . .   .   .  =  ......   .  .  .   . . . .   .         µ(i)   0 0 ... h h + h   µ(i)   K...K−1   L L−1 0   K−1  (i) (i) µK...K 0 0 ... 0 hL + h0 µK In matrix notation,

~µ(i) = H~µ(i) and, ~µ(i) = H−~µ(i) , (D.1.6) k0...kL k k k0...kL where H− is the pseudo-inverse of H.

Similarly, each element Σ(i1,i2) of Σ is related to the corresponding elements of k0...kL k0...kL

Σ1 ... ΣK as:       Σ(i1,i2) (i1,i2) 2 2 1 Σ1...1 hL + ... h0 0 ... 0 0        Σ(i1,i2)   (i1,i2)   2 2 2   2  Σ1...2 hL + ... h1 h0 ... 0 0  .       .   .   . . . .     .  =  ......    .      Σ(i1,i2)   (i1,i2)   2 2 2   K−1   ΣK...K−1   0 0 ... hL hL−1 + h0   (i ,i )       Σ 1 2  Σ(i1,i2) 0 0 ... 0 h2 + h2  K  K...K L 0 (i1,i2) Σ

In matrix notation,

(i ,i ) (i ,i ) Σ~ (i1,i2) = GΣ~ 1 2 and, Σ~ 1 2 = G−Σ~ (i1,i2) . (D.1.7) k0...kL k k k0...kL

248 Furthermore, using the independence structure of the emission probabilities as implied by the reduced model of Fig. 11.2, we see:

T X ln p(y|x, ϑ, Σ) = ln p(yt|xt−L...t, ϑ, Σ) t=1 " T # 1 X = − ln |Σ | + (y − µ )>Σ−1 (y − µ ) + c. 2 t−L...t t t−L...t t−L...t t t−L...t t=1 (D.1.8)

Therefore, by substituting the results of eqn. D.1.4 and eqn. D.1.8 in eqn. D.1.2, and inter- changing the order of the summations, the expected complete log-likelihood becomes:

n X n Q(θ, θ ) = p(x|y, θ ) [ln p(y|x, ϑ, Σ) + ln p(x|α, π)] x T n X n = p(x1|y, θ ) ln αx1 + p(xt−1,t|y, θ ) ln πxt−1,xt t=2 T X X n + p(xt−L...t|y, θ ) ln p(yt|xt−L...t, ϑ, Σ). (D.1.9)

t=1 xt−L...t

PK PK The M-step for α, π, constrained to k=1 αk = 1, k0=1 πk,k0 = 1, results in:

p(x = k|y, θn) αn+1 = 1 k PK 0 n k0=1 p(x1 = k |y, θ ) PT p(x = k , x = k |y, θn) πn+1 = t=2 t 1 t+1 2 . (D.1.10) k1,k2 PK PT 0 n k0=1 t=2 p(xt = k1, xt+1 = k |y, θ )

n+1 To determine the M-step update µk , from eqns. D.1.8,D.1.9, first observe that:

T X X n p(xt−L...t|y, θ ) ln p(yt|xt−L...t, ω, Σ) t=1 xt−L...t T X X n  > −1  ∝ p(xt−L...t|y, θ ) ln |Σt−L...t| + (yt − µt−L...t) [Σt−L...t] (yt − µt−L...t) . t=1 xt−L...t (D.1.11)

249 Maximizing eqn. D.1.11 with respect to one specific instantiation of the state–sequence xt−L...t = k0 . . . kL gives:

PT p(x = k . . . k |y, θn)y µn+1 = t=1 t−L...t 0 L t , (D.1.12) k0...kL PT n t=1 p(xt−L...t = k0 . . . kL|y, θ ) and from eqn. D.1.6, we get µn+1 = P H− µn+1 . k k0...kL k,k0...kL k0...kL

Similarly, maximizing eqn. D.1.11 with respect to a specific Σk0...kL , gives

PT n n+1 n+1 > p(xt−L...t = k0 . . . kL|y, θ ) · (yt − µ )(yt − µ ) Σn+1 = t=1 k0...kL k0...kL , (D.1.13) k0...kL PT n t=1 p(xt−L...t = k0 . . . kL|y, θ ) where as before Σn+1 = P G− Σn+1 . A similar relationship applies for Σ . k k0...kL k,k0...kL k0...kL 

D.2 Forward Backward Recursions

This section explains the forward-backward recursions to compute the probabilities of the form pθ(n) (y, xt), pθ(n) (y, xt−1, xt) and pθ(n) (y, xt−L...t), needed in the M-step of the EM algorithm.

From the conditional independence structure implied by the model, we observe that:

X pθ(y, xt) = pθ(y, xt+1−L...t)

xt+1−L...t−1 X = pθ(y1...t, xt+1−L...t)pθ(yt+1...T |xt+1−L...t) (D.2.1)

xt+1−L...t−1

X pθ(y, xt,t+1) = pθ(y, xt+1−L...t−1, xt,t+1)

xt+1−L...t−1 X = pθ(y1...t, xt+1−L...t) · pθ(yt+1|xt+1−L...t+1)

xt+1−L...t−1

· pθ(xt+1|xt) · pθ(yt+2...T |xt+2−L...t+1), (D.2.2)

250 and

pθ(y, xt...t+L) =pθ(y1...t+L, xt...t+L) · pθ(yt+1+L...T |xt...t+L)

=pθ(y1...t−1+L, xt...t−1+L) · pθ(yt+L, xt...t+L) · pθ(xt...t+L)

· pθ(yt+1+L...T |xt+1...t+L), (D.2.3) where L is the length of the FIR filter h.

The forward recursion through this Markov chain is:

a(x1) = pθ(y1, x1)

= pθ(y1|x1)pθ(x1), (D.2.4)

a(x1...2) = pθ(y1...2, x1...2)

= pθ(y2|x1...2)pθ(y1|x1)pθ(x2|x1)pθ(x1)

= pθ(y2|x1...2)pθ(x2|x1) · a(x1). (D.2.5)

Similarly, continuing up to,

a(x1...L) = pθ(y1...L, x1...L)

= pθ(yL|x1...L)pθ(xL|xL−1) · a(x1...L−1).

Now, after we have at least L observations y

a(x2...L+1) = pθ(y1...L+1, x2...L+1) X = pθ(x1...L+1, y1...L+1)

x1

= pθ(yL+1|x1...L+1)pθ(xL+1|xL) · a(x1...L).

251 And similarly,

a(x3...L+2) = pθ(x3...L+2, y1...L+2) X = pθ(x2...L+2, y1...L+2)

x2 X = pθ(yL+2|x2...L+2)pθ(xL+2|xL+1) · a(x2...L+1),

x2 upto,

X a(xt+1−L...t) = pθ(xt−L...t, y1...t)

xt−L X = pθ(yt|xt−L...t)pθ(xt|xt−1) · a(xt−L...t−1). (D.2.6)

xt−L

The backward recursion for this chain is as follows:

b(xT −L...T −1) = pθ(yT |xT −L...T −1) X = pθ(yT |xT −L...T ),

xT

b(xT −1−L...T −2) = pθ(yT −1...T |xT −1−L...T −2) X = pθ(yT −1|xT −1−L...T −1)pθ(yT |xT −L...T −1)

xT −1 X = pθ(yT −1|xT −1−L...T −1)b(xT −L...T −1),

xT −1 and similarly,

b(xt+1−L...t) = pθ(yt+1...T |xt+1−L...t) X = pθ(yt+1|xt+1−L...t+1)b(xt+2−L...t+1). (D.2.7)

xt+1

252 Therefore, substituting in eqns. D.2.1 to D.2.3, the conditional probabilities become:

X pθ(y, xt) = a(xt+1−L...t)b(xt+1−L...t),

xt+1−L...t−1 X pθ(y, xt,t+1) = a(xt+1−L...t) · pθ(yt+1|xt+1−L...t+1)pθ(xt+1|xt) · b(xt+2−L...t),

xt+1−L...t−1

pθ(y, xt,t+L) = a(xt,t−1+L) · pθ(yt+L, xt,t+L) · pθ(xt,t+L) · b(xt+1...t+L). (D.2.8)

D.3 Marginalizing the HRF Filter h

The EM procedure so far determined θML conditioned on a specific HRF filter h. This

dependence is removed by marginalizing out h under a Laplace approximation of the pos-

terior distribution of θ as follows:

Under uninformative priors, the posterior density p(θ|y, h,K) ∝ p(y|θ, h,K) and θMAP =

θML. Then using a Laplace approximation around θML the posterior density is given by:

1  1  p(θ|Y, h,K) ≈ | ∇2|1/2 exp − (θ − θ )0∇2(θ − θ ) , (D.3.1) 2π 2 ML ML where −∇2 is the Hessian matrix of ln p(y|θ, h,K).

Then, the conditional expectation θ∗ independent of h is given by:

∗ θ =E[θ|y,K] = Eh [E[θ|y, h,K]] Z Z  = θp(θ|y, h,K)dθ p(h)dh h θ Z = θML(h)p(h)dh, (D.3.2) h

and is computed through Monte–Carlo integration by first sampling the parameter γ from

N (µγ, σγ), constructing h(γ), finding θML(h) and then averaging over all samples.

253 D.4 State–Sequence Estimation

In this section, we explain the procedure to find the most probable set of states x∗ = arg max ln pθ(y, x) given a set of model parameters θ and observations y.

Note the following recursive relationship:

max ln pθ(y, x) = max [ln pθ(yT |xT −L...T ) + ln pθ(y1...T −1, x1...T −1)] x x   = max ln pθ(yT |xT −L...T ) + max ln pθ(y1...T −1, x1...T −1) xT −L...T x1...T −1−L  = max ln pθ(YT |xT −L...T ) + max [ln pθ(YT −1|xT −1−L...T −1)+ xT −L...T xT −1−L  max ln pθ(y1...T −2, x1...T −2) x1...T −2−L . .

Therefore, if we define:

η1 = ln pθ(y1, x1) = ln pθ(y1|x1) + ln pθ(x1),

η2 = max ln pθ(y1,2, x1,2) = max [ln pθ(y2|x1,2) + ln pθ(x2|x1) + η1] , x1 x1 . .

ηt = max [ln pθ(yt, xt−L...t) + ln pθ(xt|xt−1) + ηt−1] , xt−1

ηt+1 = max [ln pθ(yt+1, xt+1−L...t+1) + ln pθ(xt+1|xt) + ηt] , xt

then it can be verified that maxx ln pθ(y, x) = maxxt−L...T ηT .

Let ϕt(xt...t+L) keep track of the state of xt−1 which is a maximum configuration for ηt,

∗ given xt...t+L. Then, the optimal configuration of states x for a particular θ are obtained

254 by backtracking as follows:

∗ xt−L...T = arg max ηT , xt−L...T ∗ ∗ xt−L−1 = ϕT −1(xt−L...T ), . .

∗ ∗ x1 = ϕL(x2...L). (D.4.1)

D.5 Estimation in Full vs. Reduced Models

The retention of the z layer becomes a problem when computing marginal densities of the form pθ(y, xt) required during the M–step.

In the model of Fig. 11.2, this can be expressed as:

X X pθ(y, xt) = pθ(y, xt−L...t) = pθ(y1...t, xt−L...t)pθ(yt+1...T |xt+1−L...t).

xt−L...t−1 xt−L...t−1

This term is evaluated using the forward–backward recursions (c.f. Appendix D.2) with computation of these marginals through Monte–Carlo integration of sequences of x L + 1 states long. Under the model of Fig. 11.1 this factorization would take the form:

X Z pθ(y, zt, xt) = pθ(y, zt−L...t, xt−L...t) z xt−L...t−1 t−L...t−1

Due to the dependency structure introduced by z, the evaluation of a term in the forward recursion would turn out to be:

X Z a(xt+1−L...t, zt+1−L...t) = pθ(y1...t|z1...t)pθ(z1...t, x1...t), z x1...t−L 1...t−L which would require Monte–Carlo integration over sequences O(t) states long.

255 D.6 Mutual Information

If X(1) and X(2) are the optimal state sequences for one fMRI session computed with respect to two different models M1 and M2, then the mutual information MI(ˆs) between the two models with respect to that fMRI session for experimental conditions st ∈ ˆs is:

MI(ˆs) = H(1)(ˆs) + H(2)(ˆs) − H(1),(2)(ˆs), (D.6.1) where H(1)(ˆs) is the empirical entropy of the states X(1) measured for only those t when st ∈ ˆs as:

PT δ(X(1) = k)δ(s ∈ ˆs) Pr(1)[k] = t=1 t t PT t=1 δ(st ∈ ˆs) K X H(1)(ˆs) = Pr(1)[k] ln Pr(1)[k], k=1 and similarly for H(2)(ˆs). Here Pr(1)[k] is the empirical probability of state k when the experimental variables take a value in ˆs for the fMRI data, given model 1.

The empirical joint entropy H(1),(2)(ˆs) between X(1) and X(2) for stimuli value in ˆs is equivalently defined as:

PT δ(X(1) = k )δ(X(2) = k )δ(s ∈ ˆs) Pr(1),(2)[k , k ] = t=1 t 1 t 2 t 1 2 PT t=1 δ(st ∈ ˆs) (1),(2) X X (1),(2) (1),(2) H (ˆs) = Pr [k1, k2] ln Pr [k1, k2].

k1 k2

256 APPENDIX E

PROOFS FOR SEMI–SUPERVISED STATE–SPACE MODEL

E.1 Proofs for the E-Step

This section contains proofs for the equations given in Section 12.2.1 of Chapter 12.

QT Proof for eqn. 12.2.4 Consider the variational density q(z, x) = t=1 qt(zt, xt). Then

 p (y, z, x) KL(q||p (y, z, x)) = − E log θ θ q q(z, x) " T # Y = − Eq log pθ(y, z, x) + log qt(zt, xt) t=1 X = − Eq [log pθ(y, z, x)] + Eq [log qt(zt, xt)] t =E EQ [log p (y, z, x)] qt t06=t qt0 θ X E E 0 0 0 + qt [log qt(zt, xt)] + qt0 [log qt (zt , xt )] t06=t h  n oi =E log exp EQ [log p (y, z, x)] + E [log q (z , x )] qt t06=t qt0 θ qt t t t X E 0 0 0 + qt0 [log qt (zt , xt )] t06=t  n o = KL q || exp EQ [log p (y, z, x)] t t06=t qt0 θ X E 0 0 0 + qt0 [log qt (zt , xt )] t06=t

257 Therefore, minimizing KL(q||pθ(y, z, x)) with respect to qt(zt, xt) gives

∗ n o q = exp EQ [log p (y, z, x)] t t06=t qt0 θ (E.1.1)

Therefore, the E-step involves iteratively evaluating:

EQ q∗ [ln pθ(y, z, x)] =EQ q∗ [ln pθ(x) + ln pθ(z|x) + ln pθ(y|z)] . (E.1.2) t06=t t0 t06=t t0

.

Combining all the terms not dependent on zt, xt into a constant term, the first term on the

RHS of eqn. E.1.2 becomes,  EQ q∗ [ln pθ(x)] =EQ q∗ ln pθ(xt|xt−1) t06=t t0 t06=t t0  + ln pθ(xt+1|xt) + const  X X ∗ ∗ = qt−1(xt−1)qt+1(xt+1) ln pθ(xt|xt−1) xt−1 xt+1  + ln pθ(xt+1|xt) + const

K X ∗ = qt−1(xt−1) ln pθ(xt|xt−1) xt−1=1 K X ∗ + qt+1(xt+1) ln pθ(xt+1|xt) + const. (E.1.3) xt+1=1

Doing the same for the second term ln pθ(z|x), we get:

EQ q∗ [ln pθ(z|x)] =EQ q∗ [ln pθ(zt|xt)] + const t06=t t0 t06=t t0 K Xh 1 > −1 i =EQ q∗ − (zt − µk) Σ (zt − µk) δ[x =k] + const t06=t t0 2 k t k=1 K 1 Xh i = − (z − µ )>Σ−1(z − µ ) δ + const. (E.1.4) 2 t k k t k [xt=k] k=1

258 And, finally grouping the terms of EQ q∗ [ln pθ(y|z)] not dependent on zt, xt into a con- t06=t t0 stant term, gives:

"t+L # X EQ q∗ [ln pθ(y|z)] =EQ q∗ ln pθ(yi|zi−L...z ) + const t06=t t0 t06=t t0 t i=t t+L L !> L ! X 1 X −1 X EQ ∗ = 0 q  − yi − Hlzi−l Σ yi − Hlzi−l  t 6=t t0 2 i=t l=0 l=0 + const

PL > −1 PL Expanding out the terms in (yi − l=0 Hlzi−l) Σ (yi − l=0 Hlzi−l) and retaining only those dependent on zt gives:

" L ! 1 X 2 > −1 EQ ∗ EQ ∗ 0 q [ln pθ(y|z)] = − 0 q Hl zt Σ zt t 6=t t0 2 t 6=t t0 l=0 L ! L ! X > −1 > −1 X − Hl zt Σ yt − yt Σ Hl zt l=0 l=0 L L # X X > −1 + 2 HlHmzt Σ zt+l−m + const l=0 m=0,m6=l " L ! L ! 1 X X = − H2 z>Σ−1z − z>Σ−1 H y 2 l t  t t  l t+l l=0 l=0 L !> X −1 − Hlyt+l Σ zt l=0 L L !# −1 X X + 2z Σ H H E ∗ [z ] t  l m qt+l−m t+l−m l=0 m=0,m6=l + const. (E.1.5)

259 Substituting eqns. E.1.3, E.1.4 and E.1.5 in eqn. E.1.2 we get:

K X ∗ EQ q∗ [ln pθ(z, x, y)] = q (xt−1) ln pθ(xt|xt−1) t06=t t0 t−1 xt−1=1 K X ∗ + qt+1(xt+1) ln pθ(xt+1|xt) xt+1=1 K " # 1 X − δ (z − µ )>Σ−1(z − µ ) 2 [xt=k] t k k t k k=1 " L ! L ! 1 X X − H2 z>Σ−1z − z>Σ−1 H y 2 l t  t t  l t+l l=0 l=0 L !> X −1 − Hlyt+l Σ zt l=0 L L ! −1 X X + z Σ H H E ∗ [z ] t  l m qt+l−m t+l−m l=0 m=0,m6=l L L ! # X X −1 + H H E ∗ [z ] Σ z l m qt+l−m t+l−m  t l=0 m=0,m6=l + const. (E.1.6)

In order to simplify the expression of the variational density obtained from the exponenti- ation of eqn. E.1.6, we shall introduce the following symbols:

K Xh ∗ ∗ i α ∗ = q (x = k) ln p (x |k) + q (x = k) ln p (k|x ) qt (xt) t−1 t−1 θ t t+1 t+1 θ t k=1 L K −1 hX  −1 X −1i Σ ∗ = H · H Σ + [δ Σ , qt (zt|xt) l l  [xt=k] k l=0 k=1 " L L L ! −1 X X X µ ∗ =Σ ∗ Σ H · y − H · H E ∗ [z ] qt (zt|xt) qt (zt|xt)  l t+l l m qt+l−m t+l−m l=0 l=0 m=0,m6=l K # X −1  + δ[xt=k]Σk µk (E.1.7) k=1

260 Substituting these symbols and completing the square of eqn. E.1.6, we get:

1 ∗ > −1 ∗ EQ ∗ ∗ 0 q [ln pθ(z, x, y)] =αq (xt) − (zt − µq∗(z |x )) Σ ∗ (zt − µq∗(z |x )) + const. t 6=t t0 t 2 t t t qt (zt|xt) t t t (E.1.8)

Therefore, after determining the normalization factor through inspection, the mean–field approximation becomes a product of a multinomial and normal density (i.e. a mixture of

∗ ∗ ∗ Gaussians) as qt (zt, xt) = qt (zt|xt)qt (xt), where:

exp{α ∗ } ∗ qt (xt=k) ∗  q (xt = k) = , and q (zt|xt) = N µ ∗ , Σ ∗ . t PK t qt (zt|xt) qt (zt|xt) 0 exp{α ∗ 0 } k =1 qt (xt=k ) (E.1.9)

E.2 Proofs for the M-Step

This section contains the proofs for the M-Step equations of Section 12.2.2 in the main

text.

E.2.1 Estimating State Transition Parameters w and Missing Stimu- lus u

Grouping all the terms in eqn. 12.1.2 that don’t depend on w and u, the free-energy term

of eqn. 12.2.1 becomes:

(n) E X X 2 Q(q , θ) = q(n)(z,x) [ln p(x|w, s, u)] − λw/2 wi,j + const. (E.2.1) i j

261 By substituting and marginalizing out z the first term on the RHS of eqn. E.2.1, becomes,

E X (n) X q(n)(z,x) [ln p(x|w, s, u)] = q (x) ln p(xt|xt−1, wxt,xt−1 , st) x t∈T \U X (n) X + q (x) ln p(xt|xt−1, wxt,xt−1 , ut) x t∈U X X X (n) (n) = qt−1(xt−1)qt (xt) ln πxt−1,xt (st) t∈T \U xt−1 xt X X X (n) (n) + qt−1(xt−1)qt (xt) ln πxt−1,xt (ut), t∈U xt−1 xt

where the transition probability p(xt|xt−1, wxt,xt−1 , st) is denoted by πxt−1,xt (st).

The gradient of eqn. E.2.1, with respect to ωj, wi,j is:

∂Q X  (n) (n)  = qt−1(i)qt (j) − πi,j(st) · st ∂wi,j t∈T \U X (n) (n) (n)  (n) + qt−1(i)qt (j) − πi,j(ut ) · ut − λwwi,j, t∈U K ∂Q Xh X (n) (n) = (qt−1(i)qt (j) − πi,j(st)) · st ∂ωj i=1 t∈T \U X (n) (n) (n) (n)i + (qt−1(i)qt (j) − πi,j(ut )) · ut (E.2.2) t∈U

262 The second-order derivative terms are:

2 ∂ Q X >   = − stst πi,j(st)δ[i0,j0=i,j] − πi,j(st)πi0,j0 (st)δ[i0=i] ∂wi,j∂wi0,j0 t∈T \U X (n) (n)† h (n) (n) (n) i − ut ut πi,j(ut )δ[i0,j0=i,j] − πi,j(ut )πi0,j0 (ut )δ[i0=i] t∈U

− λδ[i0,j0=i,j] 2 ∂ Q X >   = − stst πi0,j0 (st)δ[j0=j] − πi0,j0 (st)πi0,j(st) ∂ωj∂wi0,j0 t∈T \U X (n) (n)† h (n) (n) (n) i − ut ut πi0,j0 (ut )δ[j0=j] − πi0,j0 (ut )πi0,j(ut ) t∈U 2 " K # ∂ Q X > X  = − stst πi,j0 (st)δ[j0=j] − πi,j0 (st)πi,j(st) ∂ωj∂ωj0 t∈T \U i=1 " K # X (n) (n)† X  (n) (n) (n)  − ut ut πi,j0 (ut )δ[j0=j] − πi,j0 (ut )πi,j(ut ) (E.2.3) t∈U i=1

Therefore in matrix notation, the Hessian of Q with respect to w is:   2 X Π(st) Π(st)P > ∇wQ = −   ⊗ stst > > t∈T \U P Π(st)P Π(st)P   (n) (n) X Π(ut ) Π(ut )P (n) (n)† −   ⊗ ut ut > (n) > (n) t∈U P Π(ut )P Π(ut )P   IK2 0K2×K − λw   (E.2.4) 0K×K2 0K×K where

K X  > 1 P i , diag(π(st)) − (diag(ek) ⊗ IK ) π(st)π(st) and P , K ⊗ IK k=1

Since the matrix Π is positive-definite, using the rules of compositions of positive-definite

2 matrices [77] it can be easily verified that −∇wQ is positive-definite.

263 5.2.1.1 Bound Optimization

If we define w(n) as the EM estimate of w after n steps, the n + 1 M-step will in-

volve maximizing Q(q(n)|θ) with respect to w where q(n) is determined by w(n). For notational simplicity in the following discussion, we rewrite this optimization problem

(n) (n) maxw Q(q (w |w) as maxw Q(w).

The bound optimization method [25, 116] replaces this computationally complex optimiza-

tion problem with a simple one by iteratively maximizing a surrogate function w(n0+1,n) =

0 (n0,n) 0 arg maxw Q (w|w ). We have introduced a new ordinal n that indexes the iterative

maximization of Q0 with respect to w, and is initialized as w(0,n) ← w(n). This surrogate

is a cost function selected such that Q(w(n0,n)) − Q0(w|w(n0,n)) attains its minimum at

w = w(n0,n).

Therefore, an increase in the value of Q0 results in an increase in the value of Q which is

seen as follows:

Q(w(n+1)) =Q(w(n+1)) − Q0(w(n+1)|w(n0,n)) + Q0(w(n+1)|w(n0,n))

≥Q(w(n)) − Q0(w(n)|w(n0,n)) + Q0(w(n+1)|w(n0,n))

≥Q(w(n)) − Q0(w(n)|w(n0,n)) + Q0(w(n)|w(n0,n))

= Q(w(n))

The first inequality is a result of the condition that Q(w(n0,n)) − Q0(w|w(n0,n)) attains its

minimum at w = w(n0,n). The second inequality follows from the fact that Q0(w(n+1)|w(n0,n)) ≥

Q0(w(n)|w(n0,n)).

264 One such surrogate function can be constructed by finding a negative-definite matrix B

2 such that ∇wQ − B is also negative-definite for all w [26]. This is validated by observing:

1 0 0 Q(w) ≥ Q(w0) + (w − w0)>∇ Q0 + (w − w(n ,n))>B(w − w(n ,n)), w 2

2 as a result of B  ∇wQ.

Then, defining

0 (n0,n) (n0,n) (n0,n) > (n0,n) Q (w|w ) ,Q(w ) + (w − w ) ∇wQ(w )

1 0 0 + (w − w(n ,n))>B(w − w(n ,n)) ≤ Q(w) 2

we have that Q(w(n0,n)) − Q0(w|w(n0,n)) attains it minimum at w = w(n0,n). It can be

1  1 1>  shown [26] that if A , − 2 IK2 − K K , then Π(st) − A is negative-definite, for all

2 st. Similarly for Π(ut). Therefore, the Hessian ∇wQ is lower bounded by the constant

negative-definite matrix       AAP  X > X (n) (n)† IK2 0K2×K B ,   ⊗ stst + ut ut − λw   > > P A P AP t∈T \U t∈U  0K×K2 0K×K (E.2.5)

E.2.2 Estimating Emission Parameters ϑ

Grouping all the terms of eqn. 12.1.2 in the main text that don’t depend on ϑ = {(µ1, Σ1) ... (µK , ΣK )} into the constant factor, the free-energy term (c.f. eqn. 12.2.1) becomes:

(n) E Q(q , θ) = q(n)(z,x) [ln p(z|x, ϑ)] + const T 1 X  > −1  E (n) = − ln |Σ(σxt )| + (zt − µxt ) Σxt (zt − µxt ) + const. 2 qt (zt,xt) t=1 (E.2.6)

265 Therefore, maximizing eqn. E.2.6 with respect to µk gives:

T T (n+1) 1 X E 1 X (n) µk = (n) [zt] = qt (xt = k)µ (n) , (E.2.7) T qt (zt,xt=k) T qt (zt|xt=k) t=1 t=1

The covariance parameters Σk for the emission distribution of of state k are optimally estimated by:

T 1 X  > Σk = E (n) (zt − µk)(zt − µk) T qt (zt,xt=k) t=1 T  1 X (n) > = qt (xt = k) Σq(n)(z |x =k) + µq(n)(z |x =k) · µ (n) T t t t t t t qt (zt|xt=k) t=1 > (n+1) (n+1) > − µq(n)(z |x =k) · µk − µk · µ (n) t t t qt (zt|xt=k)  (n+1) (n+1)> + µk · µk . (E.2.8)

266 E.2.3 Estimating Hemodynamic and Noise Parameters h, Σ

To estimate hemodynamic parameter h[d] and noise parameters Σ, all the terms in eqn. 12.1.2

that don’t depend on these parameters are grouped into the constant factor, yielding:

(n) Q(q , θ) =EQ (n) [ln p(y|z, h, Σ)] + ln p(h) + const t qt T X = E (n) [ln p(yt|zt−L...t, h, Σ)] + ln p(h) + const qt−L−1...t t=1 T  L !> L ! 1 1 X E X −1 X = q(n) − ln |Σ| − yt − Hl · zt−l Σ yt − Hl · zt−l  t−L−1...t 2 2 t=1 l=0 l=0 D 1 X > − (h[d] − µ ) Σ−1 (h[d] − µ ) + const 2 h h h d=1 D T 1 X = − ln |Σ | − (h[d] − µ )>Σ−1(h[d] − µ ) 2  2 h h h d=1 T " D D L !> X E 1 X X X + q(n) − yt[d1] − hl[d1] · zt−l[d1] t−L−1...t 2 t=1 d1=1 d2=1 l=0 L !# −1 X Σ [d1, d2] yt[d2] − hl[d2] · zt−l[d2] + const. l=0 (E.2.9)

> > > If we group the time–points zt ... zt−L−1 to define the variable z˜t , (zt ... zt−L−1) , then

D T 1 X Q(q(n), θ) = const − ln |Σ | − (h[d] − µ )>Σ−1(h[d] − µ ) 2  2 h h h d=1 T " 1 X E − q(n) 2 t−L−1...t t=1 D D X X >  −1 >  yt[d1] − h[d1] z˜t[d1] Σ [d1, d2] yt[d2] − h[d2] z˜t[d2] ] d1=1 d2=1 # . (E.2.10)

267 Differentiating wrt. h[d], the gradient is:

T " D # ∂Q X X 0 0 > 0  −1 0 > > −1 E (n) ˜ ˜ = q yt[d ] − h[d ] zt[d ] Σ [d , d]zt[d] − (h[d] − µh) Σh ∂h[d] t−L−1...t t=1 d0=1 T D X X −1 0 0 E = Σ [d , d]yt[d ] (n) [z˜t[d]] qt−L−1...t t=1 d0=1 T D X X −1 0 E  0 > 0 −1 − Σ [d , d] (n) z˜t[d ]z˜t[d] h[d ] − Σh (h[d] − µh), (E.2.11) qt−L−1...t t=1 d0=1 where the terms   ν (n) [d] qt  .  E (n) [z [d]] =  .  , q et  .  t   ν (n) [d] qt−L−1 and   0 Λ (n) [d , d] ... 0 qt  0 >  . . .  E (n) z˜ [d ]z˜ [d] =  . . .  , q t t  . . .  t−L−1...t   0 0 ... Λ (n) [d , d] qt−L−1...t

(n) represent the marginal first and second moments of z˜t under the variational density qt−L−1...t. E E  > Each individual marginal moment ν (n) (n) [zt] and Λ (n) (n) ztzt are computed qt , qt qt , qt as follows:

K X (n) ν (n) = qt (xt)µ (n) , qt qt (zt|xt) xt=1 and

K X (n) h > i Λq(n) = qt (xt) Σq(n)(z |x ) + µq(n)(z |x )µ (n) . (E.2.12) t t t t t t t qt (zt|xt) xt=1

As before, the terms Σ (n) and µ (n) are the mean and variance of the variational qt (zt|xt) qt (zt|xt) (n) density qt (zt|xt) defined in eqn. E.1.7.

268 (n+1) In order to estimate the noise variance Σ , differentiate eqn. E.2.9 w.r.t. Σ to get:

T " L ! L !># 1 (n+1) X E X (n+1) X (n+1) Σ = q(n) yt − Hl · zt−l yt − Hl · zt−l T t−L...t t=1 l=0 l=0 (E.2.13)

, Expanding out, we see that:

T " L L (n+1) 1 X 0 X X (n+1) (n+1) > Σ = ytyt + 2 Hl Hm νq(n) ν (n) T t qt+l−m t=1 l=0 m=0,m6=l L L L ! # X (n+1) > X (n+1) > X (n+1)2 − yt Hl · ν (n) − Hl · νq(n) yt + Hl Λq(n) , qt−l t−l t l=0 l=0 l=0 (E.2.14)

(n) where ν (n) and Λ (n) are the marginal moments of zt under the variational distribution qt qt qt defined in eqn. E.2.12.

E.3 Estimating the Optimal State–Sequence x∗

This section contains the proofs for the E-steps and M-steps in the EM algorithm for estimating the optimal state–sequence in Section 12.3 of Chapter 12.

∗ The factorized density qt (zt) is obtained as follows:

∗ (n)  (n)  ln q (zt) =EQ q∗ [ln pθ(y, z|x )] = EQ q∗ ln pθ(y|z) + ln pθ(z|x ) t t06=t t0 t06=t t0 T L !> L ! X 1 X −1 X EQ ∗ = 0 q − yt − Hlzt−l Σ yt − Hlzt−l + const. t 6=t t0 2 t=1 l=0 l=0

269 By including all terms not dependent on zt into the constant term, rearranging the summa-

∗ q (z ) ∼ N (µ ∗ , Σ ∗ ) tions and completing the square, it follows that t t qt qt , where, L −1 hX  −1 −1 i Σ ∗ = H · H Σ + Σ , qt l l  (n) xt l=0 and " L L L ! # −1 X X X −1 µ ∗ =Σ ∗ Σ H · y − H · H µ ∗ + Σ µ (n) . qt qt  l t+l l m qt+l−m (n) x xt t l=0 l=0 m=0,m6=l (E.3.1)

∗ Therefore, the mean and variance of the variational distribution qt has the same form as

∗ (n) (n) qt (zt|xt = xt ) from eqn. E.1.7, with xt set to the value of xt .

(n) (n) (n) The joint density pθ(y, x ) used in the termination criteria is factorized as pθ(y|x )pθ(x ),

(n) where the density pθ(y|x ) is normal, with parameters: L E (n) X [Yt|x1...T ] = Hlµ (n) , xt−l l=0 L V (n) X 2 ar[Yt|x1...T ] =Σ + Hl Σ (n) , xt−l l=0   0 if τ ≥ L Cov[Y ,Y |x(n) ] = , t t+τ 1...T PL−τ−1  l=0 HlHl+τ Σ (n) if 0 < τ < L xt−l T (n) Y (n) (n) and pθ(x ) = pθ(xt |xt−1). t=1

E.4 Error–rate and Mutual Information

The prediction error ERRpredict(ˆs) given a specific range of values sˆ of the experimental variables st ∈ sˆ is defined as: X ERRpredict(ˆs) , ||ut − st||2 δ(st ∈ sˆ). (E.4.1) t∈U

270 To measure the effect of an interval valued parameter, (e.g. product–size), its range is quan-

tized into “high” and “low”, and then the effect defined as the ERRpredict(ˆs) for “high”

product size minus the ERRpredict(ˆs) for low product–size.

If X(1) and X(2) are the optimal state sequences for one fMRI session computed with

respect to two different models M1 = {Φ1, θ1} and M2 = {Φ2, θ2}, then the mutual infor-

mation MI(ˆs) between the two models with respect to that fMRI session for experimental

conditions st ∈ ˆs is:

MI(ˆs) = H(1)(ˆs) + H(2)(ˆs) − H(1),(2)(ˆs), (E.4.2)

where H(1)(ˆs) is the empirical entropy of the states X(1) measured for only those t when

st ∈ ˆs as:

PT δ(X(1) = k)δ(s ∈ ˆs) Pr(1)[k] = t=1 t t PT t=1 δ(st ∈ ˆs) K X H(1)(ˆs) = Pr(1)[k] ln Pr(1)[k], k=1

and similarly for H(2)(ˆs). Here Pr(1)[k] is the empirical probability of state k when the

experimental variables take a value in ˆs for the fMRI data, given model 1.

The empirical joint entropy H(1),(2)(ˆs) between X(1) and X(2) for stimuli value in ˆs is

equivalently defined as:

PT δ(X(1) = k )δ(X(2) = k )δ(s ∈ ˆs) Pr(1),(2)[k , k ] = t=1 t 1 t 2 t 1 2 PT t=1 δ(st ∈ ˆs) (1),(2) X X (1),(2) (1),(2) H (ˆs) = Pr [k1, k2] ln Pr [k1, k2].

k1 k2

271 REFERENCES

[1] Achard, S., Salvador, R., Whitcher, B., Suckling, J., and Bullmore, E. (2006). A

resilient, low-frequency, small-world human brain functional network with highly con-

nected association cortical hubs. Neurosci, 26(1):63–72.

[2] Afacan, O., Brooks, D. H., Janoos, F., Hoge, W. S., and Morocz, I. A. (2010). Multi-

shot high-speed 3D-EPI fMRI using GRAPPA and UNFOLD. In Proc the 16th Annual

Meeting of the Organization for Hum Brain Map (OHBM).

[3] Aigner, W., Miksch, S., Muller,¨ W., Schumann, H., and Tominski, C. (2007). Visualiz-

ing time-oriented data-a systematic view. Comput. Graph., 31(3):401–409.

[4] Alexander, M., Baumgartner, R., Windischberger, C., Moser, E., and Somorjai, R.

(2000). Wavelet domain de-noising of time-courses in MR image sequences. Mag Res

Imag, 18(9):1129–1134.

[5] Amanatides, A. and Woo, A. (1987). A fast voxel traversal algorithm for ray tracing.

Eurographics ’87. Proc European Comp Graphics Conference Exhibition, pages 3–10.

[6] Ashburner, J. (2007). A fast diffeomorphic image registration algorithm. Neuroimage,

38(1):95–113.

272 [7] Auer, D. P. (2008). Spontaneous low-frequency blood oxygenation level-dependent

fluctuations and functional connectivity analysis of the ’resting’ brain. Mag Res Imag,

26(7):1055–1064.

[8] Baillet, S., Mosher, J., and Leahy, R. (2001). Electromagnetic brain mapping. Sig

Proc. Magazine, IEEE, 18(6):14 –30.

[9] Baringhaus, L. and Franz, C. (2004). On a new multivariate two-sample test. J. Multi-

var. Anal., 88:190–206.

[10] Bartels, A. and Zeki, S. (2005). The chronoarchitecture of the cerebral cortex. Philo-

sophical Transactions of Royal Soc. B: Biological , 360(1456):733–750.

[11] Baudelet, C. and Gallez, B. (2003). Cluster analysis of bold fMRI time series in

tumors to study the heterogeneity of hemodynamic response to treatment. Mag Res in

Medicine, 49(6):985–990.

[12] Baumgartner, R., Ryner, L., Richter, W., Summers, R., Jarmasz, M., and Somorjai, R.

(2000). Comparison of two exploratory data analysis methods for fMRI: fuzzy clustering

vs. principal component analysis. Mag Res Imag, 18(1):89–94.

[13] Baune, A., Sommer, F. T., Erb, M., Wildgruber, D., Kardatzki, B., Palm, G., and

Grodd, W. (1999). Dynamical cluster analysis of cortical fMRI activation. Neuroimage,

9(5):477–489.

[14] Bazargani, N., Nosratinia, A., Gopinath, K., and Briggs, R. (2007). FMRI baseline

drift estimation method by mdl principle. Biomedical Imag: From Nano to Macro, 2007.

ISBI 2007. 4th IEEE Int Symposium on, pages 472–475.

273 [15] Beal, M. J., Ghahramani, Z., and Rasmussen, C. E. (2002). The infinite hidden

markov model. In Adv Neural Info Proc Sys (NIPS), pages 29–245.

[16] Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction

and data representation. Neural Comp, 15(6):1373–1396.

[17] Bellec, P., Rosa-Neto, P., Lyttelton, O. C., Benali, H., and Evans, A. C. (2010).

Multi-level bootstrap analysis of stable clusters in resting-state fMRI. Neuroimage,

51(3):1126–1139.

[18] Bengio, Y. and Frasconi, P. (1996). Input-output hmms for sequence processing.

Neural Networks, IEEE Trans, 7(5):1231 –1249.

[19] Bernardo, J. M. and Smith, A. F. M. (2000). Bayesian Theory (Wiley Series in Prob-

ability and Statistics). Wiley, 1 edition.

[20] Besserve, M., Jerbi, K., Laurent, F., Baillet, S., Martinerie, J., and Garnero, L. (2007).

Classification methods for ongoing EEG and MEG signals. Biol Res, 40(4):415–437.

[21] Binder, J. R., McKiernan, K. A., Parsons, M. E., Westbury, C. F., Possing, E. T.,

Kaufman, J. N., and Buchanan, L. (2003). Neural correlates of lexical access during

visual word recognition. J Cogn Neurosci, 15(3):372–393.

[22] Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer, 1st ed.

2006. corr. 2nd printing edition.

[23] Biswal, B. B. and Ulmer, J. L. (1999). Blind source separation of multiple signal

sources of fMRI data sets using independent component analysis. Comput Assist To-

mogr, 23(2):265–271.

274 [24] Blaschko, M., Shelton, J., and Bartels, A. (2009). Augmenting feature-driven fMRI

analyses: Semi-supervised learning and resting state activity. In Adv Neural Info Proc

Sys (NIPS) 22, pages 126–134.

[25]B ohning,¨ D. (1992). Multinomial logistic regression algorithm. Ann Inst of Stat Math,

44(1):197–200.

[26]B ohning,¨ D. and Lindsay, B. G. (1988). Monotonicity of quadratic-approximation

algorithms. Ann Inst of Stat Math, 40(4):641–663.

[27] Brooks, S. P. and Gelman, A. (1998). General methods for monitoring convergence

of iterative simulations. J Computational Graphical Stat, 7(4):434–455.

[28] Buckner, R. L., Andrews-Hanna, J. R., and Schacter, D. L. (2008). The brain’s default

network: anatomy, function, and relevance to disease. Ann NY Acad Sci., 1124:1–38.

[29] Bullmore, E., Long, C., Suckling, J., J.Fadili, Calvert, G., Zelaya, F., T.A.Carpenter,

and Brammer, M. (2001). Colored noise and computational inference in neurophysio-

logical (fMRI) time series analysis: Resampling methods in time and wavelet domains.

Hum Brain Map, 12(2):61–78.

[30] Bunge, S. A. and Wright, S. B. (2007). Neurodevelopmental changes in working

memory and cognitive control. Curr Op Neurobio, 17(2):243–250.

[31] Butterworth, B. (2005). The development of arithmetical abilities. J Child Psychol

Psychiatry, 46(1):3–18.

[32] Buxton, R. B., Wong, E. C., and Frank, L. R. (1998). Dynamics of blood flow and

oxygenation changes during brain activation: the balloon model. Mag Res in Medicine,

39(6):855–864.

275 [33] Buzsaki, G. (2006). Rhythms of the Brain. Oxford Univ Press, USA.

[34] Calhoun, V. D. and Adali, T. (2006). Unmixing fMRI with independent component

analysis. IEEE Eng Med Biol Mag, 25(2):79–90.

[35] Calhoun, V. D., Liu, J., and Adali, T. (2009). A review of group ica for fMRI data and

ica for joint inference of imaging, genetic, and erp data. Neuroimage, 45(1 Suppl):S163–

S172.

[36] Cao, J. and Worsley, K. (1999). The geometry of correlation fields with an application

to functional connectivity of the brain. Ann. Appl. Probab., 9(4):1021–1057.

[37] Cecchi, G., Rish, I., Thyreau, B., Thirion, B., Plaze, M., Paillere-Martinot, M.-L.,

Martelli, C., Martinot, J.-L., and Poline, J.-B. (2009). Discriminative network models

of schizophrenia. In Adv Neural Info Proc Sys (NIPS) 22, pages 252–260.

[38] Chen, H., Yao, D., and Liu, Z. (2005). A comparison of gamma and gaussian dynamic

convolution models of the fMRI bold response. Mag Res Imag, 23(1):83–88.

[39] Cheng, H. and Li, Y. (2008). Respiratory noise correction using phase information.

BioMed Eng Informatics, 2008. BMEI 2008. Int Conf, 2:733–736.

[40] Chung, F. (1997). Lectures on Spectral Graph Theory. CBMS Reg Conf Series Math.

Am Math Soc.

[41] Coifman, R. and Lafon, S. (2006). Diffusion maps. App Computational Harmonic

Ana, special issue on diffusion maps wavelets, 21:5–30.

[42] Coifman, R. R. and Maggioni, M. (2006). Diffusion wavelets. App Comp Harm Ana,

21(1):53 – 94.

276 [43] Conroy, B., Singer, B., Haxby, J., and Ramadge, P. (2009). fMRI-based inter-subject

cortical alignment using functional connectivity. In Adv Neural Info Proc Sys (NIPS).

[44] Cosman, E., III, J. F., and III, W. W. (2004). Exact map activity detection in fMRI

using a glm with an ising spatial prior. MICCAI 2004, pages 703?–710.

[45] Cox, R. (1996). Software for analysis and visualization of functional magnetic reso-

nance neuroimages. Comp. Biomed. Res., 29:162–17.

[46] Crone, E. A., Wendelken, C., Donohue, S., van Leijenhorst, L., and Bunge, S. A.

(2006). Neurocognitive development of the ability to manipulate information in working

memory. Proc Nat Acad Sci., USA, 103(24):9315–9320.

[47] Dehaene, S. (1992). Varieties of numerical abilities. Cognition, 44(1-2):1–42.

[48] Dehaene, S., Cohen, L., Sigman, M., and Vinckier, F. (2005). The neural code for

written words: a proposal. Trends Cogn Sci, 9(7):335–341.

[49] Dehaene, S., Piazza, M., Pinel, P., and Cohen, L. (2003). Three parietal circuits for

number processing. J Cog Neuropsycho, 20:487–506.

[50] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from

incomplete data via the EM algorithm. J Royal Stat Soc.. Series B (Methodological),

39(1):1–38.

[51] Dimitriadou, E., Barth, M., Windischberger, C., Hornik, K., and Moser, E. (2004). A

quantitative comparison of functional mri cluster analysis. Artificial Intel in Medicine,

31(1):57 – 71.

277 [52] Donoho, D. L. and Johnstone, J. M. (1994). Ideal spatial adaptation by wavelet shrink-

age. Biometrika, 81(3):425–455.

[53] Duan, R., Man, H., Jiang, W., and Liu, W.-C. (2005). Activation detection on fMRI

time series using hidden Markov model. In Neural Eng, IEEE EMBS Conf, pages 510

–513.

[54] Edelman, S., Grill-Spector, K., Kushnir, T., and Malach, R. (1999). Towards direct

visualization of the internal shape representation space by fMRI. Psychobio, 26:309–

321.

[55] Edin, F., Macoveanu, J., Olesen, P., Tegnr, J., and Klingberg, T. (2007). Stronger

synaptic connectivity as a mechanism behind development of working memory-related

brain activity during childhood. J Cogn Neurosci, 19(5):750–760.

[56] Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman &

Hall, New York.

[57] Eklund, A., Ohlsson, H., Andersson, M., Rydell, J., Ynnerman, A., and Knutsson, H.

(2009). Using real-time fMRI to control a dynamical system by brain activity classifi-

cation. Med Image Comput Comput Assist Interv, 12(Pt 1):1000–1008.

[58] Fadili, M. J. and Bullmore, E. T. (2002). Wavelet-generalized least squares: a new

blu estimator of linear regression models with 1/f errors. Neuroimage, 15(1):217–232.

[59] Faisan, S., Thoraval, L., Armspach, J.-P., and Heitz, F. (2007). Hidden Markov mul-

tiple event sequence models: A paradigm for the spatio-temporal analysis of fMRI data.

Med Image Ana, 11(1):1–20.

278 [60] Fehr, T., Code, C., and Herrmann, M. (2007). Common brain regions underlying

different arithmetic operations as revealed by conjunct fMRI-bold activation. Brain Res,

1172:93–102.

[61] Formisano, E. and Goebel, R. (2003). Tracking cognitive processes with functional

MRI mental chronometry. Curr Op Neurobio, 13(2):174–181.

[62] Fox, M. D. and Raichle, M. E. (2007). Spontaneous fluctuations in brain activity

observed with functional magnetic resonance imaging. Nature Rev: Neurosci, 8(9):700–

711.

[63] Frackowiak, R., Friston, K., Frith, C., Dolan, R., Price, C., Zeki, S., Ashburner, J.,

and Penny, W. (2003). Human Brain Function. Acad Press, 2nd edition.

[64] Friman, O., Borga, M., Lundberg, P., and Knutsson, H. (2003). Adaptive analysis of

fMRI data. Neuroimage, 19(3):837–845.

[65] Friston, K., Chu, C., Mouro-Miranda, J., Hulme, O., Rees, G., Penny, W., and Ash-

burner, J. (2008). Bayesian decoding of brain images. Neuroimage, 39(1):181–205.

[66] Friston, K., Holmes, A., Worsley, K., Poline, J., Frith, C., and Frackowiak, R. (1995a).

Statistical parametric maps in functional imaging: A general linear approach. Hum

Brain Map, 2(4):189–210.

[67] Friston, K., Phillips, J., Chawla, D., and Bchel, C. (1999). Revealing interactions

among brain systems with nonlinear pca. Hum Brain Map, 8(2-3):92–97.

[68] Friston, K. J. (1994). Functional and effective connectivity in neuroimaging: a syn-

thesis. Hum Brain Map, 2:56–78.

279 [69] Friston, K. J. (2009). Modalities, modes, and models in functional neuroimaging.

Science, 326(5951):399–403.

[70] Friston, K. J., Frith, C. D., Frackowiak, R. S., and Turner, R. (1995b). Characterizing

dynamic brain responses with fMRI: a multivariate approach. Neuroimage, 2(2):166–

172.

[71] Friston, K. J., Frith, C. D., Liddle, P. F., and Frackowiak, R. S. (1993). Functional

connectivity: the principal-component analysis of large (pet) data sets. Cerebral Blood

Flow Metabolism, 13(1):5–14.

[72] Friston, K. J., Glaser, D. E., Henson, R. N. A., Kiebel, S., Phillips, C., and Ashburner,

J. (2002). Classical and bayesian inference in neuroimaging: applications. Neuroimage,

16(2):484–512.

[73] Friston, K. J., Harrison, L., and Penny, W. (2003). Dynamic causal modelling. Neu-

roimage, 19(4):1273–1302.

[74] Fuster, J. M. (2000). The module: Crisis of a paradigm. Neuron, 26(1):51 – 53.

[75] Geary, D. C., Hamson, C. O., and Hoard, M. K. (2000). Numerical and arithmetical

cognition: a longitudinal study of process and concept deficits in children with learning

disability. Exp Child Psychol, 77(3):236–263.

[76] Genovese, C. R., Lazar, N. A., and Nichols, T. (2002). Thresholding of statistical

maps in functional neuroimaging using the false discovery rate. Neuroimage, 15(4):870–

878.

[77] Genton, M. (2000). Classes of kernels for machine learning: A statistics perspective.

J Mach Learning Research, 2:299–312.

280 [78] Georgopoulos, A. P., Taira, M., and Lukashin, A. (1993). Cognitive neurophysiology

of the motor cortex. Science, 260(5104):47–52.

[79] Ghebreab, S. and Smeulders, A. (2010). Identifying distributed and overlapping clus-

ters of hemodynamic synchrony in fMRI data sets. Pat Ana & App, pages 1–18.

[80] Goebel, R., Roebroeck, A., Kim, D.-S., and Formisano, E. (2003). Investigating

directed cortical interactions in time-resolved fMRI data using vector autoregressive

modeling and granger causality mapping. Mag Res Imag, 21(10):1251–1261.

[81] Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations. The Johns Hopkins

Univ Press, 3rd edition.

[82] Goutte, C., Nielsen, F., and Hansen, K. (2000). Modeling the hemodynamic response

in fMRI using smooth fir filters. Med Imag, IEEE Trans, 19(12):1188–1201.

[83] Goutte, C., Toft, P., Rostrup, E., Nielsen, F., and Hansen, L. (1999). On clustering

fMRI time series. Neuroimage, 9(3):298 – 310.

[84] Greicius, M. D., Srivastava, G., Reiss, A. L., and Menon, V. (2004). Default-mode

network activity distinguishes alzheimer’s disease from healthy aging: evidence from

functional mri. Proc Nat Acad Sci., USA, 101(13):4637–4642.

[85] Hansen, C. D. and Johnson, C. (2004). Visualization Handbook. Acad Press.

[86] Hanson, S. J., Matsuka, T., and Haxby, J. V. (2004). Combinatorial codes in ven-

tral temporal lobe for object recognition: Haxby (2001) revisited: is there a face area?

Neuroimage, 23(1):156 – 166.

281 [87] Hardoon, D. R., Mouro-Miranda, J., Brammer, M., and Shawe-Taylor, J. (2007).

Unsupervised analysis of fMRI data using kernel canonical correlation. Neuroimage,

37(4):1250 – 1259.

[88] Hari, R., Levanen,¨ S., and Raij, T. (2000). Timing of human cortical functions during

cognition: role of MEG. Trends Cogn Sci, 4(12):455–462.

[89] Harrison, L., Penny, W. D., and Friston, K. (2003). Multivariate autoregressive mod-

eling of fMRI time series. Neuroimage, 19(4):1477–1491.

[90] Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., and Pietrini,

P. (2001). Distributed and overlapping representations of faces and objects in ventral

temporal cortex. Science, 293(5539):2425–2430.

[91] Haynes, J.-D. and Rees, G. (2005). Predicting the orientation of invisible stimuli from

activity in human primary visual cortex. Nature Neurosci, 8(5):686–691.

[92] Haynes, J.-D. and Rees, G. (2006). Decoding mental states from brain activity in

humans. Nature Rev: Neurosci, 7(7):523–534.

[Henson et al.] Henson, R., Rugg, M., and Friston, K. The choice of basis functions in

event-related fMRI. Technical report.

[94] Henson, R. N. A., Price, C. J., Rugg, M. D., Turner, R., and Friston, K. J. (2002). De-

tecting latency differences in event-related bold responses: application to words versus

nonwords and initial versus repeated face presentations. Neuroimage, 15(1):83–97.

[95] Højen-Sørensen, P., Hansen, L. K., and Rasmussen, C. E. (2000). Bayesian modelling

of fMRI time series. In Adv Neural Info Proc Sys (NIPS), pages 754–760.

282 [96] Horwitz, B. (2003). The elusive concept of brain connectivity. Neuroimage, 19(2):466

– 470.

[97] Hu, Z. and Shi, P. (2007). Nonlinear analysis of bold signal: biophysical model-

ing, physiological states, and functional activation. Med Image Comput Comput Assist

Interv, 10(Pt 2):734–741.

[98] Hutchinson, R. A., Niculescu, R. S., Keller, T. A., Rustandi, I., and Mitchell, T. M.

(2009). Modeling fMRI data generated by overlapping cognitive processes with un-

known onsets using Hidden Process Models. Neuroimage, 46(1):87 – 104.

[99] Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pat Recog Letters,

31(8):651 – 666.

[100] James, W. (1950). The principles of psychology, vol. 1.

[101] Janoos, F., Irfanoglu, M., Afacan, O., Machiraju, R., Warfield, S. K., Wald, L. L.,

and Morocz,´ I. A. (2010a). Brain state identification from fMRI using unsupervised

learning. In Hum Brain Map Annual Meeting.

[102] Janoos, F., Machiraju, R., Sammet, S., Knopp, M. V., and Morocz,´ I. (2010b). Un-

supervised learning of brain states from fMRI data. In 13th Int Conf Med Image Comp

& Comp Assist Intervent (MICCAI), volume 6362 of LNCS, pages 201–208.

[103] Janoos, F., Machiraju, R., Sammet, S., Knopp, M. V., Warfield, S. K., and Morocz,´

I. A. (2010c). Measuring effects of latency in brain activity with fMRI. In Biomedical

Imaging: From Nano to Macro, IEEE Int Symposiumon, pages 1141 –1144.

283 [104] Janoos, F., Machiraju, R., Singh, S., and Morocz,´ I. A. (2010d). Spatio-temporal

representations and decoding cognitive processes from fMRI. Technical Report OSU-

CISRC-9/10-TR19, Ohio State Univ.

[105] Janoos, F., Nouanesengsy, B., Machiraju, R., Shen, H. W., Sammet, S., Knopp, M.,

and Morocz,´ I. A. (2009). Visual analysis of brain activity from fMRI data. Comp

Graphics Forum, 28:903910.

[106] Jezzard, P. (2000). Physical basis of spatial distortions in magnetic resonance im-

ages. pages 425 – 438.

[107] Jezzard, P. (2003). Functional MRI: An Introduction to Methods. Oxford

Univ. Press.

[108] Joachims, T., Finley, T., and Yu, C.-N. (2009). Cutting-plane training of structural

SVMs. Mach Learning, 77(1):27–59.

[109] Jones, T. B., Bandettini, P. A., and Birn, R. M. (2008). Integration of motion correc-

tion and physiological noise regression in fMRI. Neuroimage, 42(2):582–590.

[110] Kamitani, Y. and Tong, F. (2005). Decoding the visual and subjective contents of

the human brain. Nature Neurosci, 8(5):679–685.

[111] Kandel, E., Schwartz, J., and Jessell, T. (2000). Principles of Neural Science.

McGraw-Hill Medical, 4 edition.

[112] Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J Am Stat Assoc, 90(430):773–

795.

284 [113] Kiebel, S. and Friston, K. J. (2002). Anatomically informed basis functions in mul-

tisubject studies. Hum Brain Map, 16(1):36–46.

[114] Kim, D.-S. and Garwood, M. (2003). High-field magnetic resonance techniques for

brain research. Curr Op Neurobio, 13(5):612–619.

[115] Kriegeskorte, N., Mur, M., and Bandettini, P. (2008). Representational similarity

analysis - connecting the branches of systems neuroscience. Front Syst Neurosci, 2:4.

[116] Krishnapuram, B., Carin, L., Figueiredo, M. A. T., and Hartemink, A. J. (2005).

Sparse multinomial logistic regression: fast algorithms and generalization bounds. Pat

Ana Mach Intel, IEEE Trans, 27(6):957–968.

[117] Kruggel, F. and von Cramon, D. Y. (1999a). Modeling the hemodynamic response

in single-trial functional MRI experiments. Mag Res in Medicine, 42(4):787–797.

[118] Kruggel, F. and von Cramon, D. Y. (1999b). Temporal properties of the hemody-

namic response in functional MRI. Hum Brain Map, 8(4):259–271.

[119] Kruggel, F., Zysset, S., and von Cramon, D. Y. (2000). Nonlinear regression of

functional MRI data: an item recognition task study. Neuroimage, 12(2):173–183.

[120] Kruskal, J. (1964). Multidimensional scaling by optimizing goodness of fit to a

nonmetric hypothesis. Psychometrika, 29(1):1–27–27.

[121] Kucian, K., Loenneker, T., Dietrich, T., Dosch, M., Martin, E., and von Aster, M.

(2006). Impaired neural networks for approximate calculation in dyscalculic children: a

functional mri study. Behavioral Brain Functions, 2:31.

285 [122] LaConte, S. M., Peltier, S. J., and Hu, X. P. (2007). Real-time fMRI using brain-state

classification. Hum Brain Map, 28(10):1033–1044.

[123] Lancaster, J., Chan, E., Mikiten, S., Nguyen, S., and Fox, P. (1997). BrainMaptm

search and view. Neuroimage, 5:634.

[124] Lange, N. and Zeger, S. L. (1997). Non-linear fourier time series analysis for human

brain mapping by functional magnetic resonance imaging. App Stat, 46(1):1–29.

[125] Langs, G., Tie, Y., Rigolo, L., Golby, A., and Golland, P. (2010). Functional ge-

ometry alignment and localization of brain areas. In Adv Neural Info Proc Sys (NIPS)

(NIPS).

[126] Lanterman, A. D. (2001). Schwarz, Wallace, and Rissanen: Intertwining themes in

theories of model selection. Int Stat Review, 69(2):185–212.

[127] Lauritzen, M. and Gold, L. (2003). Brain function and neurophysiological correlates

of signals used in functional neuroimaging. Neurosci, 23(10):3972–3980.

[128] Lauterbur, P. C. (1973). Image formation by induced local interactions: Examples

employing nuclear magnetic resonance. Nature, 242(5394):190–191.

[129] Le, T. H. and Hu, X. (1996). Retrospective estimation and correction of physiolog-

ical artifacts in fMRI by direct extraction of physiological activity from mr data. Mag

Res in Medicine, 35(3):290–298.

[130] Ledoit, O. and Wolf, M. (2004). Honey, I shrunk the sample covariance matrix. J

Portfolio Man, 30(4):110–119.

286 [131] Lehmann, D., Pascual-Marqui, R. D., Strik, W. K., and Koenig, T. (2010). Core

networks for visual-concrete and abstract thought content: a brain electric microstate

analysis. Neuroimage, 49(1):1073–1079.

[132] Lehmann, D., Strik, W. K., Henggeler, B., Koenig, T., and Koukkou, M. (1998).

Brain electric microstates and momentary conscious mind states as building blocks of

spontaneous thinking: I. visual imagery and abstract thoughts. Int J Psychophysiol,

29(1):1–11.

[133] Li, K., Guo, L., Nie, J., Li, G., and Liu, T. (2009). Review of methods for functional

brain connectivity detection using fMRI. J Comp Med Imag Graphics., 33(2):131 – 139.

[134] Li, Y., Xu, N., Fitzpatrick, J. M., and Dawant, B. M. (2008). Geometric distor-

tion correction for echo planar images using nonrigid registration with spatially varying

scale. Mag Res Imag, 26(10):1388–1397.

[135] Li, Y.-O., Adali, T., and Calhoun, V. D. (2007). Estimating the number of inde-

pendent components for functional magnetic resonance imaging data. Hum Brain Map,

28(11):1251–1266.

[136] Liao, C. H., Worsley, K. J., Poline, J.-B., Aston, J. A. D., Duncan, G. H., and Evans,

A. C. (2002). Estimating the delay of the fMRI response. Neuroimage, 16(3 Pt 1):593–

606.

[137] Liao, R., Krolik, J. L., and McKeown, M. J. (2005). An information-theoretic cri-

terion for intrasubject alignment of fMRI time series: motion corrected independent

component analysis. Med Imag, IEEE Trans, 24(1):29–44.

287 [138] Lin, Q.-H., Liu, J., Zheng, Y.-R., Liang, H., and Calhoun, V. D. (2009). Semiblind

spatial ica of fMRI using spatial constraints. Hum Brain Map.

[139] Logothetis, N. K. (2008). What we can do and what we cannot do with fMRI.

Nature, 453(7197):869–878.

[140] Logothetis, N. K. and Wandell, B. A. (2004). Interpreting the bold signal. Annu

Rev: Physiol, 66:735–769.

[141] Lu, W. and Rajapakse, J. (2005). Approach and applications of constrained ica.

Neural Networks, IEEE Trans, 16(1):203–212.

[142] Lu, W. and Rajapakse, J. C. (2000). Constrained independent component analysis.

13:570–576.

[143] Luenberger, D. G. (2003). Linear and Nonlinear Programming, Second Edition.

Springer, 2nd edition.

[144] Ma, K. L. and Lum, E. B. (2003). Visualization Handbook.

[145] Mallat, S. e. (2008). A wavelet tour of signal processing, third edition: The sparse

way.

[146] Mallows, C. (1973). Some comments on cp. Technometrics, 15:661–675.

[147] Mansfield, P. (1977). Multi-planar image formation using NMR spin echoes. J.

Phys. C., 10:L55–58.

[148] Marrelec, G., Ciuciu, P., Plgrini-Issac, M., and Benali, H. (2003). Estimation of the

hemodynamic response function in event-related functional mri: directed acyclic graphs

for a general bayesian inference framework. Inf Process Med Imag, 18:635–646.

288 [149] Martinetz, T. M., Berkovich, S. G., and Schulten, K. J. (1993). ;neural-gas’ network

for vector quantization and its application to time-series prediction. IEEE Trans Neural

Netw, 4(4):558–569.

[150] Martino, F. D., Valente, G., Staeren, N., Ashburner, J., Goebel, R., and Formisano,

E. (2008). Combining multivariate voxel selection and support vector machines for

mapping and classification of fMRI spatial patterns. Neuroimage, 43(1):44–58.

[151] McGrory, C. A. and Titterington, D. M. (2009). Variational bayesian analyses for

hidden markov models. Aust & New Zeal J Stat, 51(2):227–244.

[152] McIntosh, A. R., Bookstein, F. L., Haxby, J. V., and Grady, C. L. (1996). Spatial

pattern analysis of functional brain images using partial least squares. Neuroimage, 3(3

Pt 1):143–157.

[153] McIntosh, A. R., Grady, C. L., Ungerleider, L. G., Haxby, J. V., Rapoport, S. I.,

and Horwitz, B. (1994). Network analysis of cortical visual pathways mapped with pet.

Neurosci, 14(2):655–666.

[154] McKeown, M. J., Makeig, S., Brown, G. G., Jung, T. P., Kindermann, S. S., Bell,

A. J., and Sejnowski, T. J. (1998). Analysis of fMRI data by blind separation into

independent spatial components. Hum Brain Map, 6(3):160–188.

[155] Menon and Kim (1999). Spatial and temporal limits in cognitive neuroimaging with

fMRI. Trends Cogn Sci, 3(6):207–216.

[156] Menon, R. S., Luknowsky, D. C., and Gati, J. S. (1998). Mental chronometry using

latency-resolved functional MRI. Proc Nat Acad Sci., USA, 95(18):10902–10907.

289 [157] Meyer, F. and Chinrungrueng, J. (2005). Local clustering of fMRI time series in the

frequency domain. Med Image Ana, 9(1):51–68.

[158] Meyer, F. G. (2003). Wavelet-based estimation of a semiparametric generalized

linear model of fMRI time-series. Med Imag, IEEE Trans, 22(3):315–322.

[159] Mitchell, T. M., Hutchinson, R., Niculescu, R. S., Pereira, F., Wang, X., Just, M.,

and Newman, S. (2004). Learning to decode cognitive states from brain images. Mach

Learning, 57(1-2):145–175.

[160] Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Mason,

R. A., and Just, M. A. (2008). Predicting Human Brain Activity Associated with the

Meanings of Nouns. Science, 320(5880):1191–1195.

[161] Moeller, J. R., Strother, S. C., Sidtis, J. J., and Rottenberg, D. A. (1987). Scaled

subprofile model: a statistical approach to the analysis of functional patterns in positron

emission tomographic data. J Cereb Blood Flow Metab, 7(5):649–658.

[162] Molko, N., Cachia, A., Riviere, D., Mangin, J. F., Bruandet, M., LeBihan, D., Cohen,

L., and Dehaene, S. (2003). Functional and structural alterations of the intraparietal

sulcus in a developmental dyscalculia of genetic origin. Neuron, 40(4):847–858.

[163] Morocz, I., Gross-Tsur, A., von Aster, M., Manor, O., Breznitz, Z., Karni, A., and

Shalev, R. (2003). Functional magnetic resonance imaging in dyscalculia: preliminary

observations. Ann Neurology, 54(S7):S145,.

[164] Mourao-Miranda,˜ J., Friston, K. J., and Brammer, M. (2007). Dynamic discrimina-

tion analysis: A spatial-temporal SVM. Neuroimage, 36(1):88 – 99.

290 [165] Munkres, J. (1957). Algorithms for the assignment and transportation problems. J

Soc. for Industrial App Mamatics, 5(1):32–38.

[166] Nichols, T. E. and Holmes, A. P. (2002). Nonparametric permutation tests for func-

tional neuroimaging: a primer with examples. Hum Brain Map, 15(1):1–25.

[167] Nielsen, F. and Hansen, L. K. (2000). Experiences with Matlab and VRML in func-

tional neuroimaging visualizations.

[168] Nielsen, F. A., Balslev, D., and Hansen, L. K. (2005). Mining the posterior cingulate:

segregation between memory and pain components. Neuroimage, 27(3):520–532.

[169] Norman, K. A., Polyn, S. M., Detre, G. J., and Haxby, J. V. (2006). Beyond mind-

reading: multi-voxel pattern analysis of fMRI data. Trends Cogn Sci, 10(9):424–430.

[170] Oakes, T. R., Johnstone, T., Walsh, K. S. O., Greischar, L. L., Alexander, A. L., Fox,

A. S., and Davidson, R. J. (2005). Comparison of fMRI motion correction software

tools. Neuroimage, 28(3):529–543.

[171] Ogawa, S. and Lee, T. M. (1990). Magnetic resonance imaging of blood vessels

at high fields: in vivo and in vitro measurements and image simulation. Mag Res in

Medicine, 16(1):9–18.

[172] Ogawa, S., Lee, T. M., Stepnoski, R., Chen, W., Zhu, X. H., and Ugurbil, K. (2000).

An approach to probe some neural systems interaction by functional mri at neural time

scale down to . Proc Nat Acad Sci., USA, 97(20):11026–11031.

291 [173] O’Toole, A. J., Jiang, F., Abdi, H., Pnard, N., Dunlop, J. P., and Parent, M. A.

(2007). Theoretical, statistical, and practical perspectives on pattern-based classifi-

cation approaches to the analysis of functional neuroimaging data. J Cog Neurosci,

19(11):1735–1752.

[174] Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge Univ

Press.

[175] Peppiatt, C. M., Howarth, C., Mobbs, P., and Attwell, D. (2006). Bidirectional

control of cns capillary diameter by pericytes. Nature, 443(7112):700–704.

[176] Pereira, F., Mitchell, T., and Botvinick, M. (2009). Machine learning classifiers and

fMRI: A tutorial overview. Neuroimage, 45(1, Supplement 1):S199 – S209.

[177] Pessoa, L. and Padmala, S. (2005). Quantitative prediction of perceptual decisions

during near-threshold fear detection. Proc Nat Acad Sci., USA, 102(15):5612–5617.

[178] Poldrack, R. A. (2008). The role of fMRI in cognitive neuroscience: where do we

stand? Curr Op Neurobio, 18(2):223–227.

[179] Polyn, S. M., Natu, V. S., Cohen, J. D., and Norman, K. A. (2005). Category-specific

cortical activity precedes retrieval during memory search. Science, 310(5756):1963–

1966.

[180] Posner, M. I. (1986). Chronometric Explorations of Mind. Oxford Univ Press, USA.

[181] Price, C. J. and Mechelli, A. (2005). Reading and reading disturbance. Curr Op

Neurobio, 15(2):231–238.

292 [182] Purdon, P. L., Solo, V., Weisskoff, R. M., and Brown, E. N. (2001). Locally regular-

ized spatiotemporal modeling and model comparison for functional MRI. Neuroimage,

14(4):912–923.

[183] Rachev, S. T. and Ruschendorf, L. (1998). Mass transportation problems: Volume i:

Theory (probability and its applications).

[184] Rajapakse, J. and Piyaratna, J. (2001). Bayesian approach to segmentation of statis-

tical parametric maps. 48(10):1186–1194.

[185] Rajapakse, J. C. and Zhou, J. (2007). Learning effective brain connectivity with

dynamic bayesian networks. Neuroimage, 37(3):749–760.

[186] Rothman, A., Levina, E., and Zhu, J. (2009). Generalized thresholding of large

covariance matrices. J Am Stat Assoc (Theory Methods), 104:177–186.

[187] Rotzer, S., Loenneker, T., Kucian, K., Martin, E., Klaver, P., and von Aster, M.

(2009). Dysfunctional neural network of spatial working memory contributes to devel-

opmental dyscalculia. Neuropsychologia, 47(13):2859–2865.

[188] Rubner, Y., Tomasi, C., and Guibas, L. J. (2000). The earth mover’s distance as a

metric for image retrieval. Int. J. Comput. , 40(2):99–121.

[189] Ruttimann, U. E., Unser, M., Rawlings, R. R., Rio, D., Ramsey, N. F., Mattay, V. S.,

Hommer, D. W., Frank, J. A., and Weinberger, D. R. (1998). Statistical analysis of

functional MRI data in the wavelet domain. Med Imag, IEEE Trans, 17(2):142–154.

[190] Saad, Z. S., DeYoe, E. A., and Ropella, K. M. (2003). Estimation of fMRI response

delays. Neuroimage, 18(2):494–504.

293 [191] Saad, Z. S., Ropella, K. M., Cox, R. W., and DeYoe, E. A. (2001). Analysis and use

of fMRI response delays. Hum Brain Map, 13(2):74–93.

[192] Sabuncu, M. R., Singer, B. D., Conroy, B., Bryan, R. E., Ramadge, P. J., and Haxby,

J. V. (2010). Function-based intersubject alignment of human cortical anatomy. Cereb.

Cortex, 20(1):130–140.

[193] Scholkopf, B. and Smola, A. J. (2001). Learning with Kernels: Support Vector Ma-

chines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine

Learning). The MIT Press, 1st edition.

[194] Scott, S. L. (2002). Bayesian methods for hidden Markov models: Recursive com-

puting in the 21st . J Am Stat Assoc, 97(457):337–351.

[195] Serra-Grabulosa, J. M., Adan, A., Prez-Pmies, M., Lachica, J., and Membrives, S.

(2010). [neural bases of numerical processing and calculation]. Rev: Neurol, 50(1):39–

46.

[196] Shalev, R. S. (2004). Developmental dyscalculia. Child Neurol, 19(10):765–771.

[197] Shaywitz, B. A., Shaywitz, S. E., Pugh, K. R., Mencl, W. E., Fulbright, R. K.,

Skudlarski, P., Constable, R. T., Marchione, K. E., Fletcher, J. M., Lyon, G. R., and

Gore, J. C. (2002). Disruption of posterior brain systems for reading in children with

developmental dyslexia. Biol Psychiatry, 52(2):101–110.

[198] Shi, J. and Malik, J. (1997). Normalized cuts and image segmentation. In Comp Vis

Pat Recog, page 731.

[199] Shirdhonkar, S. and Jacobs, D. (2008). Approximate earth mover’s distance in linear

time. In Comp Vis Pat Recog., IEEE Conf., pages 1 –8.

294 [200] Sip, K. E., Roepstorff, A., McGregor, W., and Frith, C. D. (2008). Detecting decep-

tion: the scope and limits. Trends Cogn Sci, 12(2):48–53.

[201] Sirotin, Y. B. and Das, A. (2009). Anticipatory haemodynamic signals in sensory

cortex not predicted by local neuronal activity. Nature, 457(7228):475–479.

[202] Smith, A., Lewis, B., Ruttimann, U., Ye, F., Sinnwell, T., Yang, Y., Duyn, J., and

Frank, J. (2003). Investigation of low frequency drift in fMRI signal. Neuroimage,

9(5):526–533.

[203] Smith, S., Jenkinson, M., Woolrich, M., Beckmann, C., Behrens, T., Johansen-Berg,

H., Bannister, P., Luca, M. D., Drobnjak, I., Flitney, D., Niazy, R., Saunders, J., Vick-

ers, J., Zhang, Y., Stefano, N. D., Brady, J., , and Matthews, P. (2004). Advances in

functional and structural MR image analysis and implementation as FSL. Neuroimage,

23(S1):208–219.

[204] Stephan, K. E., Harrison, L. M., Kiebel, S. J., David, O., Penny, W. D., and Friston,

K. J. (2007). Dynamic causal models of neural system dynamics:current state and future

extensions. Biosci, 32(1):129–144.

[205] Thirion, B. and Faugeras, O. (2002). Dynamical components analysis of fMRI data.

[206] Thirion, B. and Faugeras, O. (2003). Dynamical components analysis of fMRI data

through kernel pca. Neuroimage, 20(1):34–49.

[207] Thirion, B., Flandin, G., Pinel, P., Roche, A., Ciuciu, P., and Poline, J.-B. (2006).

Dealing with the shortcomings of spatial normalization: multi-subject parcellation of

fMRI datasets. Hum Brain Map, 27(8):678–693.

295 [208] Tory, M., Rober, N., Moller,¨ T., Celler, A., and Atkins, M. (2001). 4d space-time

techniques: a medical imaging case study. In IEEE Vis, pages 473 – 592.

[209] van Gelderen, P., Wu, C. W. H., de Zwart, J. A., Cohen, L., Hallett, M., and Duyn,

J. H. (2005). Resolution and reproducibility of BOLD and perfusion functional MRI at

3.0 Tesla. Mag Res in Medicine, 54(3):569–576.

[210] Van Wijk, J. and Van Selow, E. (1999). Cluster and based visualization

of time series data. Information Visualization, 1999. (Info Vis ’99) Proceedings. 1999

IEEE Symposium on, pages 4–9, 140.

[211] Varoquaux, G., Baronnet, F., Kleinschmidt, A., Fillard, P., and Thirion, B. (2010).

Detection of brain functional-connectivity difference in post-stroke patients using group-

level covariance modeling. In Medical Image Computing and Computer Aidded Inter-

vention.

[212] Ville, D. V. D., Seghier, M. L., Lazeyras, F., Blu, T., and Unser, M. (2007). Wspm:

wavelet-based statistical parametric mapping. Neuroimage, 37(4):1205–1217.

[213] Weilke, F., Spiegel, S., Boecker, H., von Einsiedel, H. G., Conrad, B., Schwaiger,

M., and Erhard, P. (2001). Time-resolved fMRI of activation patterns in M1 and SMA

during complex voluntary movement. Neurophysiol, 85(5):1858–1863.

[214] Weiskopf, N., Klose, U., Birbaumer, N., and Mathiak, K. (2005). Single-shot com-

pensation of image distortions and bold contrast optimization using multi-echo epi for

real-time fMRI. Neuroimage, 24(4):1068–1079.

[215] Wink, A. and Roerdink, J. (2004). Denoising functional MR images: a comparison

of wavelet denoising and Gaussian smoothing. Med Imag, IEEE Trans, 23:374–387.

296 [216] Wismuller,¨ A., Meyer-Base,¨ A., Lange, O., Auer, D., Reiser, M. F., and Sumners,

D. (2004). Model-free functional MRI analysis based on unsupervised clustering. J. of

Biomedical Informatics, 37(1):10–18.

[217] Woodring, J., Wang, C., and Shen, H.-W. (2003). High dimensional direct rendering

of time-varying volumetric data. Visualization, 2003. VIS 2003. IEEE, pages 417–424.

[218] Woolrich, M., Jenkinson, M., Brady, J., and Smith, S. (2004). Fully bayesian spatio-

temporal modeling of fMRI data. Med Imag, IEEE Trans, 23(2):213–231.

[219] Worsley, K., Poline, J., Friston, K., and A.C.Evans (1997). Characterizing the re-

sponse of PET and fMRI data using multivariate linear models. Neuroimage, 6(4):305–

319.

[220] Worsley, K. J., Taylor, J. E., Tomaiuolo, F., and Lerch, J. (2004). Unified univariate

and multivariate random field theory. Neuroimage, 23 Suppl 1:S189–S195.

[221] Wowk, B., McIntyre, M. C., and Saunders, J. K. (1997). k-space detection and

correction of physiological artifacts in fMRI. Mag Res in Medicine, 38(6):1029–1034.

[222] Ye, J., Lazar, N. A., and Li, Y. (2009). Geostatistical analysis in clustering fMRI

time series. Stat in Medicine, 28(19):2490–2508.

[223] Yeo, D. T., Fessler, J. A., and Kim, B. (2008). Concurrent correction of geometric

distortion and motion using the map-slice-to-volume method in echo-planar imaging.

Mag Res Imag, 26(5):703 – 714.

[224] Zeeman, P. (1897). The effect of magnetisation on the nature of light emitted by a

substance. Nature, 55:347–347.

297 [225] Zhang, L., Samaras, D., Alia-Klein, N., Volkow, N., and Goldstein, R. (2006). Mod-

eling neuronal interactivity using Dynamic Bayesian Networks. In Adv Neural Info Proc

Sys (NIPS) 18, pages 1593–1600.

[226] Zhao, X., Glahn, D., Tan, L. H., Li, N., Xiong, J., and Gao, J.-H. (2004). Comparison

of tca and ica techniques in fMRI data processing. Mag Res Imag, 19(4):397–402.

[227] Zheng, X. and Rajapakse, J. C. (2006). Learning functional structure from fmr

images. Neuroimage, 31(4):1601–1613.

298