On Stability of Canonical Correlation Analysis and Partial Least Squares with Application to Brain-Behavior Associations

bioRxiv preprint doi: https://doi.org/10.1101/2020.08.25.265546; this version posted August 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. On stability of Canonical Correlation Analysis and Partial Least Squares with application to brain-behavior associations Markus Helmera, Shaun Warringtonb, Ali-Reza Mohammadi-Nejadb,c, Jie Lisa Jia,d, Amber Howella,d, Benjamin Rosande, Alan Anticevica,d,f, Stamatios N. Sotiropoulosb,c,g,1, and John D. Murraya,d,e,1 aDepartment of Psychiatry, Yale School of of Medicine, New Haven, CT 06511; bSir Peter Mansfield Imaging Centre, School of Medicine, University of Nottingham, Nottingham, NG7 2UH, United Kingdom; cNational Institute for Health Research (NIHR) Nottingham Biomedical Research Ctr, Queens Medical Ctr, Nottingham, United Kingdom; dInterdepartmental Neuroscience Program, Yale University School of Medicine, New Haven, CT 06511, USA; eDepartment of Physics, Yale University, New Haven, CT 06511, USA; fDepartment of Psychology, Yale University, New Haven, CT 06511, USA; gFMRIB, Wellcome Centre for Integrative Neuroimaging, Nuffield Department of Clinical Neurosciences, John Radcliffe Hospital, University of Oxford, Oxford, OX3 9DU, United Kingdom This manuscript was compiled on August 24, 2020 1 Associations between high-dimensional datasets, each comprising the sets of weights—which correspond to axes of variation—to 14 2 many features, can be discovered through multivariate statistical maximize the association strength (Fig. 1A). The resulting 15 3 methods, like Canonical Correlation Analysis (CCA) or Partial Least profiles of weights for each dataset can be examined for how 16 4 Squares (PLS). CCA and PLS are widely used methods which reveal the features form the association. If the association strength 17 5 which features carry the association. Despite the longevity and popu- is measured by the correlation coefficient, the method is called 18 6 larity of CCA/PLS approaches, their application to high-dimensional canonical correlation analysis (CCA) (7), whereas if covariance 19 7 datasets raises critical questions about the reliability of CCA/PLS so- is used the method is called partial least squares (PLS) (5, 8, 9). 20 8 lutions. In particular, overfitting can produce solutions that are not CCA and PLS are commonly employed across scientific fields, 21 9 stable across datasets, which severely hinders their interpretability including behavioral sciences (10), biology (11, 12), biomedical 22 10 and generalizability. To study these issues, we developed a genera- engineering (13), chemistry (14), environmental sciences (15), 23 11 tive model to simulate synthetic datasets with multivariate associa- genomics (16), and neuroimaging (4, 17–19). 24 12 tions, parameterized by feature dimensionality, data variance struc- Although the utility of CCA and PLS is well established, 25 13 ture, and assumed latent association strength. We found that re- a number of open challenges exist regarding their stability 26 14 sulting CCA/PLS associations could be highly inaccurate when the in characteristic regimes of dataset properties. Stability im- 27 15 number of samples per feature is relatively small. For PLS, the pro- plies that elements of CCA/PLS solutions, such as association 28 16 files of feature weights exhibit detrimental bias toward leading prin- strength and weight profiles, are reliably estimated across 29 17 cipal component axes. We confirmed these model trends in state-of- different independent sample sets from the same population, 30 18 the-art datasets containing neuroimaging and behavioral measure- despite inherent variability in the data. Instability or overfit- 31 19 ments in large numbers of subjects, namely the Human Connectome ting can occur if an insufficient amount of data is available 32 20 Project (n ≈ 1000) and UK Biobank (n = 20000), where we found to properly constrain the model. Manifestations of instabil- 33 21 that only the latter comprised enough samples to obtain stable es- 22 timates. Analysis of the neuroimaging literature using CCA to map 23 brain-behavior relationships revealed that the commonly employed Significance Statement 24 sample sizes yield unstable CCA solutions. Our generative model- 25 ing framework provides a calculator of dataset properties required Scientific studies often begin with an observed association be- 26 for stable estimates. Collectively, our study characterizes dataset tween different types of measures. When datasets comprise 27 properties needed to limit the potentially detrimental effects of over- large numbers of features, multivariate approaches such as 28 fitting on stability of CCA/PLS solutions, and provides practical rec- canonical correlation analysis (CCA) and partial least squares 29 ommendations for future studies. (PLS) are often used. These methods can reveal the profiles of features that carry the optimal association. We developed Data fusion | Multivariate associations | Canonical Correlation Analysis a generative model to simulate data, and characterized how | Partial Least Squares | Stability | Brain-behavior associations obtained feature profiles can be unstable, which hinders interpretability and generalizability, unless a sufficient number of 1 iscovery of associations between datasets is a topic of samples is available to estimate them. We determine sufficient 2 Dgrowing importance across scientific disciplines in analy- sample sizes, depending on properties of datasets. We also 3 sis of data comprising a large number of samples across high- show that these issues arise in neuroimaging studies of brain- 4 dimensional sets of features. For instance, large initiatives in behavior relationships. We provide practical guidelines and 5 human neuroimaging collect, across thousands of subjects, rich computational tools for future CCA and PLS studies. 6 multivariate neural measures as one dataset and psychometric 7 and demographic measures as another linked dataset (1–3). A Conceptualization: MH, SW, AA, SNS, JDM. Methodology: MH, JDM. Software: MH. Formal analysis: MH, SW, AM, BR. Resources: AA, SNS, JDM. Data Curation: AM, JLJ, AH. Writing - Original 8 major goal is to determine, in a data-driven way, the dominant Draft: MH, JDM. Writing - Review & Editing: All authors. Visualization: MH. Supervision: JDM. 9 latent patterns of association linking individual variation in Project administration: JDM. Funding acquisition: AA, SNS, JDM. 10 behavioral features to variation in neural features (4–6). J.L.J, A.A. and J.D.M. have received consulting fees from BlackThorn Therapeutics. A.A. has served on the Advisory Board of BlackThorn Therapeutics. 11 A widely employed approach to map such multivariate as- 12 sociations is to define linearly weighted composites of features 1To whom correspondence should be addressed. E-mail: [email protected] or stama- 13 in both datasets (e.g., neural and psychometric) and to choose [email protected] bioRxiv preprint doi: https://doi.org/10.1101/2020.08.25.265546; this version posted August 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. A association? correlation and the variance structure in the data, for both 53 Dataset X Dataset Y CCA and PLS. 54 CCA X-weight vector PLS To investigate these issues, we developed a generative sta- 55 X-weight CCA tistical model to simulate synthetic datasets with known latent 56 Y-weight Scores: vector axes of association. Sampling from the generative model al- 57 projections vector on weights lowed quantification of deviations between estimated and true 58 CCA or PLS solutions. We found that stability of CCA/PLS 59 solutions requires more samples than are commonly used in 60 PLS feature 2 feature 2 61 X published neuroimaging studies. With too few samples, es- Y Y-weight Measurement vector timated association strengths were too high, and estimated 62 sample weights could be unreliable for interpretation. CCA and PLS 63 differed in their dependences and robustness, in part due to 64 X feature 1 Y feature 1 PLS exhibiting a detrimental bias of weights toward principal 65 Projections on CCA weights maximize correlation between sets axes. We analyzed two large state-of-the-art neuroimaging- 66 Projections on PLS weights maximize covariance between sets psychometric datasets, the Human Connectome Project (2) 67 and the UK Biobank (3), which followed similar trends as 68 B Variances along PCs D Generative model: our model. These model and empirical findings, in con- 69 normal distribution with junction with a meta-analysis of estimated stability in the 70 CCA effectively joint covariance matrix uses whitened brain-behavior CCA literature, suggest that typical CCA/PLS 71 PLS data studies in neuroimaging are prone to instability. Finally, we 72 Between applied the generative model to develop algorithms and a 73 log(variance) Within X covariance set co- software package for calculation of estimation errors and re- 74 log (PC number) variance quired sample sizes for CCA/PLS. We end with 10 practical 75 C Association between sets recommendations for application and interpretation of CCA 76 and PLS in future studies (see also Tab. S1). 77 Between Within Y set co- covariance scores variance Y Results 78 X scores A generative model for cross-dataset multivariate associations. 79 To analyze sampling properties of CCA and PLS, we need

On Stability of Canonical Correlation Analysis and Partial Least Squares with Application to Brain-Behavior Associations

A Library for Time-To-Event Analysis Built on Top of Scikit-Learn

Statistics and Machine Learning in Python Edouard Duchesnay, Tommy Lofstedt, Feki Younes

Testing Linear Regressions by Statsmodel Library of Python for Oceanological Data Interpretation

Introduction to Python for Econometrics, Statistics and Data Analysis 4Th Edition

A. an Introduction to Scientific Computing with Python

Testing Linear Regressions by Statsmodel Library of Python for Oceanological Data Interpretation Polina Lemenkova

FORECASTING USING ARIMA MODELS in PYTHON Statsmodels SARIMAX Class

Matlab®To Python: a Migration Guide 6

Researchpy Documentation Release 0.3.2

Array Programming with Numpy

A Foundational Python Library for Data Analysis and Statistics

Python Modeling & Fitting Packages