Statistical Challenges in Modeling Big Signals

Zhaoxia Yu1, Dustin Pluta1, Tong Shen1, Chuansheng Chen2, Gui Xue3, and Hernando Ombao1, 4

1Department of Statistics, University of California, Irvine, USA 2Department of Psychology and Social Behavior, University of California, Irvine, USA 3State Key Laboratory of Cognitive and Learning and IDG/McGovern Institute for Brain Research, Center for Collaboration and Innovation in Brain and Learning Sciences, Beijing Normal University, Beijing, PR China 4Biostatistics Group, King Abdullah University of Science and Technology, Saudi, Email: [email protected]

November 2, 2017

Abstract

Brain signal data are inherently big: massive in amount, complex in structure, and high in dimensions. These characteristics impose great challenges for statistical inference and learning. Here we review several key challenges, discuss possible solutions, and highlight future research directions. arXiv:1711.00432v1 [stat.ME] 1 Nov 2017 1 Introduction

The brain is one of the most complex organs of human body. Various recording techniques have been developed to get a glimpse of brain activity and to gain a deeper understanding of the neural mechanisms both during rest and in response to various stimuli. One of the most commonly used modalities is functional magnetic resonance imaging (fMRI) which measures changes in the blood oxygen level in localized brain regions (Lindquist (2008)). Another important modality is electroencephalogram (EEG) recordings, which measures the collective electrical activity of populations of neurons (Ombao et al. (2016)). While both

1 fMRI and EEG measure brain function over time, strucutral modalities provide additional information about the relationship of brain regions, for example, diffusion tensor imaging (DTI) maps the structure and orientation of the brain’s white matter fiber tracks through the diffusion of water molecules across the entire brain volume (Basser et al. (1994)). Brain data recorded by these different techniques bring exciting new research opportunities, but statistical inference and analysis of these data poses immense challenges due to the sheer size of data and the complicated biological characteristics under study. For example, fMRI data consists of time series recordings over a three-dimensional image with over 100K voxels and sampled every 1 − 2 seconds. Moreover, a typical scalp EEG is recorded across 256 channels sampled at the rate of 1000 observations per second. These two brain modalities capture different aspects of brain activity (electrical by EEG and hemodynamic by fMRI) and possess different spatio-temporal resolutions (EEG has excellent temporal resolution while fMRI offers high spatial specificity). Nevertheless, they share several common features: the amount of information is massive (current studies routinely generate terabytes worth of raw data), the signal-to-noise ratio is low and the structure is complicated. In particular, studying interactions between brain regions is challenging because of the multi-scale nature of the data. One has to account for both local interactions (within voxels in a region) and global (between regions). In addition, in imaging , we want to study how genetic variants are associated with brain function. Since genetic data are also high-dimensional, imaging genetics faces even more statistical and computational challenges.

2 Brain Connectivity and Imaging Genetics

2.1 Brain Connectivity

Localized brain activations play a critical role in many functions (e.g., visual, motor). However, the execution of higher level cognitive processing (e.g., in memory retrieval, decision making) requires the interaction and transfer of information between many of these localized regions. Recently, there has been great interest in studying brain connectivity, spurred by emerging evidence that connectivity may provide greater insight into a number of mental disorders, the brain-behavior relationship, and variations in cognitive performance across individuals (Woodward and Cascio (2015)). While the general idea of connectivity between two brain regions seems natural, the stag- gering complexity of the brain requires numerous definitions and measurement methods to highlight different aspects of brain connectivity. There are three general concepts of brain “connectivity” that have been of interest, namely structural, functional, and effective con- nectivities. Structural connectivity refers to the physical connections between brain regions and is commonly measured using DTI. Functional connectivity is a symmetric and undi- rected measure of concordant activation between brain regions. Some common measures of

2 functional connectivity are cross-correlation, cross-coherence, partial-correlation, and par- tial coherence (see Ombao and Van Bellegem (2008); Sun et al. (2004); Fiecas and Ombao (2011)). By contrast, effective connectivity is an asymmetric measure of how past activity in one region may influence the future activity of another region. Effective connectivity is closely related to Granger causality (as opposed to physiological causality which requires a carefully designed experiment) and is often assessed using the vector autoregressive (VAR) model ( Gorrostieta et al. (2013), Yu et al. (2016), Chiang et al. (2017)). To illustrate the difference between these measures, we estimated both functional and effective connectivity using a sample of 209 subjects whose fMRI data were acquired during a cups task experiment (Xue et al. (2008)). We found that one ROI in the right prefrontal cortex has relatively weak functional connectivity with the other ROIs. However, in terms of effective connectivity, it is a relatively strong sender to the other ROIs, but is a weak receiver (see Figure 1).

Figure 1: Functional vs. effective connectivity

There is an increasing body of empirical evidence indicating the importance of dynamic characteristics of brain connectivity for understanding . Interesting features of dynamic connectivity can be exhibited across widely varying time scales, ranging from milliseconds (in electrophysiological studies) to seconds (in fMRI studies) and even over years in extended longitudinal studies, such as studies of infant development and Alzheimer’s disease progression. See Hutchison et al. (2013) Adding this extra level of complexity further increases the difficulty of modelling brain connectivity, and has motivated the development of new statistical approaches. A variety of methods have been proposed to estimate dynamic connectivity, including sliding window methods (e.g., Chang and Glover (2010)) change-point detection via VAR models (Kirch et al. (2015)) and hidden Markov switching VAR models (Samdin et al. (2017)). In a recent work, Ting et al. (2017) proposed a method for high dimensional by finding

3 low-dimensional representations via principal components analysis. The proposed model, f- SVAR, is a factor-based vector autoregressive model with Markov-switching between states. By estimating latent brain states that recur throughout the experiment, this method is able to capture connectivity patterns associated with those distinct states.

2.2 Imaging Genetics

Imaging Genetics is an emerging research on determining genetic basis for brain anatomi- cal structure, brain physiological activity during rest and stimulation and behavior. This new area is a potentially important tool in early diagnosis, personalized medicine, and the treatment of neurological disorders (Stingo et al. (2013)). Indeed, numerous work has been recently published to assess the genetic components of structural and functional imaging data. In particular, Ge et al. (2016) demonstrate that several characteristics of brain struc- ture are genetically heritable, such as brain volume and neuroanatomical shape. As we move from structural imaging towards functional imaging, the magnitude of data complex- ity sharply increases. This is especially true for brain connectivity, which examines how genetics may explain variation in between-regions interactions. Recent work suggests that brain connectivity has the qualities of a fingerprint, i.e., everyone one has an own unique profile regardless at which condition the connectivity is measured (Finn et al. (2015)). Like other characteristics of the brain, and probably more so due to its “fingerprinting” prop- erty, connectivity is expected to be heritable. Indeed, a recent study based on families with attention-deficit/hyperactivity disorder (ADHD) suggests significant genetic heritability for the default mode, cognitive control, and ventral attention networks (Sudre et al. (2017)).

3 Statistical Challenges and Strategies

It is widely accepted that statistical inference for data face tremendous chal- lenges. As noted, imaging data are inherently multi-modal and high dimensional. Moreover, they suffer from low signal-to-noise ratios and small effect sizes from individual genetic fac- tors and brain regions. We review several statistical strategies.

3.1 Multiple Testing

Multiple testing is a common and serious issue in analyzing brain imaging data as tests of activation are performed on 100K voxels or thousands of regions of interest (ROIs) in fMRI and hundreds of channels in EEG. In addition, there are tests on connectivity across all pairs of voxels, pairs of ROIs and pairs of channels (Nichols (2012)). The burden of multiple testing is further amplified in imaging genetics, where the curse of dimensionality comes from both the brain signal data and the genetic variables . In the massive univariate approach,

4 each pair of voxel and single nucleotide polymorphism (SNP) will be tested, leading to trillions of tests (Stein et al. (2010)), Given this huge number of tests, it is then necessary to control the expected number of false positives. One method is to control the familywise error rate (FWER), i.e., the probability of making at least one false positive, at a pre- specified value. A simple method is Bonferroni correction where the p-value cutoff is defined as the targeting FWER divided by the number of tests. Although Bonferroni correction successfully controls FWER, it tends to be overly conservative and thus can potentially fail to detect important features in the data. Instead, in many situations, it makes sense to incur the cost of having a small proportion of false positive in return for more true positives. Consequently, it is reasonable to control the false discovery rate (FDR), which is defined as the expected proportion of false positives among the ones that have been declared to be significant (Benjamini and Hochberg (1995)). However, for imaging analysis the issue is further complicated by the spatial dependence of the data, which can decrease overall power if ignored. To address this, resampling methods such as permutation-based evaluation can be used (Nichols and Holmes (2002); Raz et al. (2003)), which have the advantage of preserving the correlation structure within a modality. On the other hand, permutations are computationally expensive, and can be impractical when the computational cost of each permutation is substantial.

3.2 Dimension Reduction, Regularized Analysis, and Low-Rank Approximation

Due to the numerous challenges posed by high-dimensional data, dimension reduction is often necessary for efficient statistical analysis of imaging and genetics data. Some commonly used data drive methods include principal component analysis (PCA), high order singular value decomposition (the high dimension extension of PCA), and independent component analysis (ICA)(Liu et al. (2009)). These methods, although effective in reducing dimensions for the genetic or imaging modality, are often suboptimal for prediction and testing, as the extracted leading components do not necessarily capture the strongest associations between genetic and imaging modalities. Simultaneous dimension reduction methods, such as partial least squares (PLS), canonical correlation analysis (CCA), reduced rank regression (RRR), and parallel ICA Ahn et al. (2015) on the other hand, perform dimension reduction jointly on responses and covariates by minimizing a specific cost function. Recently, tensor regression has been of increasing interest for analyzing multi-way and high- dimensional data such as brain imaging data. By analyzing tensors, i.e., multi-array data, tensor regression can directly incorporate the inherent spatio-temporal structure of imaging data. At the same time, it achieves parsimony by imposing low-rank decompositions (such as PARAFAC) and sparsity assumptions (such as LASSO) on the regression coefficients. By utilizing these reduction methods, which depend on the response data as well as the covariates, tensor regression methods may lead to more accurate parameter estimation and

5 outcome prediction (Lu et al. (2017); Zhou et al. (2013)) An additional advantage of these approaches is that individual coefficients between covariates and responses can be estimated, which makes the results more interpretable than those from traditional dimension reduction methods. Indeed, Zhou et al. (2013) identified brain regions that might be responsible for ADHD by applying tensor regression on 3D MRI data. It is likely that extensions of these methods, such as adding a time dimension with hidden states, will improve our understanding of dynamic connectivity and how it is connected to behavioral traits, cognitive outcomes, and genetic factors.

3.3 Global Tests

The random effect model, also known as variance component model in this setting, has been proposed to jointly model a large number of genetic variants (Ge et al. (2016)). In this approach, the effect from an individual genetic variant is treated as a random variable, rather than a fixed parameter. As a result, the number of variants that can be jointly analyzed is not limited by sample size. It is also noticed that this method is closely related to kernel machine methods, which can naturally model non-additive effects (Ge et al. (2012)). These approaches are often used to analyze sets of genetic variants, such as the variants in a gene or pathway, or in estimating the genetic heritability of a phenotype. For hypothesis testing, score tests are often used because they can be calculated rapidly. We recently discovered that these score tests have a unified form (Pluta et al. (2017)), which we refer to as Mantel’s statistic (Mantel (1967)). Essentially, this approach examines whether there is a relationship between two sets of variables by testing whether the pairwise similarity/distance measurements between the two sets of variables are correlated. This is a very general framework that includes many other proposed approaches as special cases, such as multivariate distance matrix regression, pseudo F-tests, and kernel-based tests (Shehzad et al. (2014); Xu et al. (2017)). Moreover, this method is not limited in application to quantitative and matrix-shaped data, but can be implemented as long as there is a suitable definition of similarity between two subjects. In theory, one can also model interactions between all pairs of genes/SNPs, high-order interactions, and even complicated and unknown functions of genetic variants, through an appropriate choice of similarity measure. This flexibility makes the framework well-suited for testing the association of two sets of high- dimensional, biologically complex features.

3.4 Data Integration

Similar to dimension reduction methods, data integration methods also aim to improve ef- ficiency and power. However, different from dimension reduction, data integration achieves better power by aggregating weak signals from multiple sources. For example, one can con- duct meta-analysis by combining data collected from different centers (Wager et al. (2007);

6 Salimi-Khorshidi et al. (2009)). Another direction, particularly useful for neuroimaging studies, is to integrate information from different imaging modalities, as different modalities often measure complementary characteristics of brain function. For example, fMRI has high spatial but low temporal resolution; on the contrary, EEG has low spatial but high temporal resolution. Multi-modal data can be acquired simultaneously or separately. While simul- taneous acquisition sounds appealing, it faces many technical issues. For example, when fMRI and EEG data are recorded during the same experiment, EEG electrodes can distort the magnetic field, making the fMRI data less accurate (Uluda˘gand Roebroeck (2014)). As a result, we here focus on combining information from separately (non-contemporaneously) recorded data. Previous work in this direction has fused fMRI and DTI data in estimating brain connectivity ( Xue et al. (2015); Kang et al. (2017)), and using structural connec- tivity from DTI to construct informative priors when estimating resting state functional connectivity from fMRI (Kang et al. (2017); Chiang et al. (2017)).

4 Discussion

Brain signals have several unique characteristics. Its low signal-to-noise ratio, the difficulty to select truly associated features in high-dimensional data, and the small individual effect sizes (such as genetic effects) all contribute to the difficulties and low statistical power. In particular, as we are expanding analysis from individual ROIs/voxels to connectivity and from connectivity to dynamic connectivity, there is an urgent need for computationally and statistically efficient methods. Given the various issues in brain signal and other related data, methods for data integration would be valuable. Indeed, methodological development on integrating different brain signal data is a fast evolving area. A practical and useful strategy to aggregate weak signal is to conduct global tests between two complicate and high-dimensional domains. Taking imaging genetics as an example, given the statistical chellenges for realistic sample sizes, a logical first step is to examine whether there is an “overall” association between the brain connectivity modality and the genetic modality. Considering another example, for brain connectivity separately estimated from fMRI and EEG, it is of interest to know the concordance of these measures. Measuring the consistency of connectivity measures across modalities can lead to improved understanding of brain connectivity in general, and inform future statistical analyses. For instance, our recent work has applied the Mantel test to show that brain connectivity estimated using fMRI is consistent with that using EEG (Pluta et al. (2017)), suggesting that it may be possible to develop more powerful models of brain connectivity by combining fMRI and EEG data. With continued improvements in imaging and genetic sequencing technologies, and the availility of relevant publicly available data banks, it is clear that there will be a continued need for further developing a variety of flexible, robust statistical methods for multi-modal data.

7 References

Ahn, M., Shen, H., Lin, W., and Zhu, H. (2015). A sparse reduced rank framework for group analysis of data. Statistica Sinica, 25(1):295. Basser, P. J., Mattiello, J., and LeBihan, D. (1994). Estimation of the effective self-diffusion tensor from the nmr spin echo. Journal of Magnetic Resonance, Series B, 103(3):247–254. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pages 289–300. Chang, C. and Glover, G. H. (2010). Time–frequency dynamics of resting-state brain con- nectivity measured with fmri. Neuroimage, 50(1):81–98. Chiang, S., Guindani, M., Yeh, H., Haneef, Z., Stern, J., and Vannucci, M. (2017). A bayesian vector autoregressive model for multi-subject effective connectivity inference using multi- modal neuroimaging data. Human , 38:1311–1332. Fiecas, M. and Ombao, H. (2011). The generalized shrinkage estimator for the analysis of functional connectivity of brain signals. Annals of Applied Statistics, 5:1102–1125. Finn, E. S., Shen, X., Scheinost, D., Rosenberg, M. D., Huang, J., Chun, M. M., Pa- pademetris, X., and Constable, R. T. (2015). Functional connectome fingerprinting: iden- tifying individuals using patterns of brain connectivity. Nature neuroscience, 18(11):1664– 1671. Ge, T., Feng, J., Hibar, D. P., Thompson, P. M., and Nichols, T. E. (2012). Increasing power for voxel-wise genome-wide association studies: the random field theory, least square kernel machines and fast permutation procedures. Neuroimage, 63(2):858–873. Ge, T., Reuter, M., Winkler, A. M., Holmes, A. J., Lee, P. H., Tirrell, L. S., Roffman, J. L., Buckner, R. L., Smoller, J. W., and Sabuncu, M. R. (2016). Multidimensional heritability analysis of neuroanatomical shape. Nature communications, 7:13291. Gorrostieta, C., Fiecas, M., Ombao, H., Burke, E., and Cramer, S. C. (2013). Hierarchical vector auto-regressive models and their applications to multi-subject effective connectivity. Frontiers in Computational Neuroscience, 7(159):1–11. Hutchison, R. M., Womelsdorf, T., Allen, E. A., Bandettini, P. A., Calhoun, V. D., Cor- betta, M., Della Penna, S., Duyn, J. H., Glover, G. H., Gonzalez-Castillo, J., et al. (2013). Dynamic functional connectivity: promise, issues, and interpretations. Neuroim- age, 80:360–378. Kang, H., Ombao, H., Fonnesbeck, C., Ding, Z., and Morgan, V. L. (2017). A bayesian double fusion model for resting-state brain connectivity using joint functional and structural data. Brain Connectivity, 7:219–227.

8 Kirch, C., Muhsal, B., and Ombao, H. (2015). Detection of changes in multivariate time series with application to eeg data. Journal of the American Statistical Association, 110:1197– 1216.

Lindquist, M. (2008). The statistical analysis of fmri data. Statistical Science, pages 439–464.

Liu, J., Pearlson, G., Windemuth, A., Ruano, G., Perrone-Bizzozero, N. I., and Calhoun, V. (2009). Combining fmri and snp data to investigate connections between brain function and genetics using parallel ica. Human brain mapping, 30(1):241–255.

Lu, Z.-H., Khondker, Z., Ibrahim, J. G., Wang, Y., Zhu, H., Initiative, A. D. N., et al. (2017). Bayesian longitudinal low-rank regression models for imaging genetic data from longitudinal studies. NeuroImage, 149:305–322.

Mantel, N. (1967). The detection of disease clustering and a generalized regression approach. Cancer research, 27(2 Part 1):209–220.

Nichols, T. E. (2012). Multiple testing corrections, nonparametric methods, and random field theory. Neuroimage, 62(2):811–815.

Nichols, T. E. and Holmes, A. P. (2002). Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human brain mapping, 15(1):1–25.

Ombao, H., Lindquist, M., Thompson, W., and Aston, J. (2016). Handbook of Statistical Methods for NeuroImaging. CRC Press.

Ombao, H. and Van Bellegem, S. (2008). Coherence analysis: A linear filtering point of view. IEEE Transactions on Signal Processing, 56(6):2259–2266.

Pluta, D., Shen, T., Xue, G., Chen, C., Yu, Z., and Ombao, H. (2017). Mantel tests for high dimensinoal data: parametric, non-parametric, or penalized analysis? UC Irvine Statistics Research Paper.

Raz, Jonathanand Zheng, H., Ombao, H., and Turetsky, B. (2003). Statistical test for fmri based on experimental randomization. NeuroImage, 19:226–232.

Salimi-Khorshidi, G., Smith, S. M., Keltner, J. R., Wager, T. D., and Nichols, T. E. (2009). Meta-analysis of neuroimaging data: a comparison of image-based and coordinate-based pooling of studies. Neuroimage, 45(3):810–823.

Samdin, S. B., Ting, C.-M., Ombao, H., and Salleh, S.-H. (2017). A unified estimation framework for state-related changes in effective brain connectivity. IEEE Transactions on Biomedical Engineering, 64(4):844–858.

Shehzad, Z., Kelly, C., Reiss, P. T., Craddock, R. C., Emerson, J. W., McMahon, K., Copland, D. A., Castellanos, F. X., and Milham, M. P. (2014). A multivariate distance- based analytic framework for connectome-wide association studies. Neuroimage, 93:74–94.

9 Stein, J. L., Hua, X., Lee, S., Ho, A. J., Leow, A. D., Toga, A. W., Saykin, A. J., Shen, L., Foroud, T., Pankratz, N., et al. (2010). Voxelwise genome-wide association study (vgwas). neuroimage, 53(3):1160–1174. Stingo, F. C., Guindani, M., Vannucci, M., and Calhoun, V. D. (2013). An integrative bayesian modeling approach to imaging genetics. Journal of the American Statistical Association, 108(503):876–891. Sudre, G., Choudhuri, S., Szekely, E., Bonner, T., Goduni, E., Sharp, W., and Shaw, P. (2017). Estimating the heritability of structural and functional brain connectivity in families affected by attention-deficit/hyperactivity disorder. JAMA , 74(1):76– 84. Sun, F., Miller, L., and D’Esposito, M. (2004). Measuring interregional functional connectiv- ity using coherence and partial coherence analyses of fmri data. NeuroImage, 21:647–658. Ting, C.-M., Ombao, H., Samdin, S. B., and Salleh, S.-H. (2017). Estimating time-varying effective connectivity in high-dimensional fmri data using regime-switching factor models. arXiv preprint arXiv:1701.06754. Uluda˘g,K. and Roebroeck, A. (2014). General overview on the merits of multimodal neu- roimaging data fusion. Neuroimage, 102:3–10. Wager, T. D., Lindquist, M., and Kaplan, L. (2007). Meta-analysis of functional neu- roimaging data: current and future directions. Social cognitive and affective neuroscience, 2(2):150–158. Woodward, N. D. and Cascio, C. J. (2015). Resting-state functional connectivity in psychi- atric disorders. JAMA psychiatry, 72(8):743–744. Xu, Z., Xu, G., and Pan, W. (2017). Adaptive testing for association between two random vectors in moderate to high dimensions. Genetic Epidemiology. Xue, G., Lu, Z., Levin, I. P., Weller, J. A., Li, X., and Bechara, A. (2008). Functional dissociations of risk and reward processing in the medial prefrontal cortex. Cerebral Cortex, 19(5):1019–1027. Xue, W., Bowman, F. D., Pileggi, A. V., and Mayer, A. R. (2015). A multimodal approach for determining brain networks by jointly modeling functional and structural connectivity. Frontiers in computational neuroscience, 9. Yu, Z., Prado, R., Burke, E., Cramer, S. C., and Ombao, H. (2016). A hierarchical bayesian model for studying the impact of stroke on brain motor function. Journal of the American Statistical Association, 111:549–563. Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540–552.

10