<<

RICE UNIVERSITY

By

R. Michael Weylandt , Jr.

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE

Doctor of Philosophy

APPROVED, THESIS COMMITTEE

Kathy Ensor (Sep 14, 2020 15:23 CDT) Frank Jones (Sep 14, 2020 16:58 CDT) Katherine B. Ensor Frank Jones

Marina Vannucci

Richard Baraniuk (Sep 14, 2020 17:02 ADT) Richard G. Baraniuk

HOUSTON, TEXAS September 2020 RICE UNIVERSITY

Computational and Statistical Methodology for Highly Structured Data by R. Michael Weylandt Jr.

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

Approved, Thesis Committee:

Katherine B. Ensor, Chair Noah Harding Professor of Statistics Director of the Center for Computational Finance and Economic Systems

Richard G. Baraniuk Victor E. Cameron Professor of Electrical and Computer Engineering Founder & Director of OpenStax

Frank Jones Noah Harding Professor of Mathematics

Marina Vannucci Noah Harding Professor of Statistics Houston, Texas September, 2020 ABSTRACT

Computational and Statistical Methodology for Highly Structured Data

by

R. Michael Weylandt Jr.

Modern data-intensive research is typically characterized by large scale data and the impressive computational and modeling tools necessary to analyze it. Equally im- portant, though less remarked upon, is the important structure present in large data sets. Statistical approaches that incorporate knowledge of this structure, whether spatio-temporal dependence or sparsity in a suitable basis, are essential to accurately capture the richness of modern large scale data sets. This thesis presents four novel methodologies for dealing with various types of highly structured data in a statisti- cally rich and computationally efficient manner. The first project considers sparse regression and sparse covariance selection for complex valued data. While complex valued data is ubiquitous in spectral analysis and neuroimaging, typical machine learning techniques discard the rich structure of complex numbers, losing valuable phase information in the process. A major contribution of this project is the develop- ment of convex analysis for a class of non-smooth “Wirtinger” functions, which allows high-dimensional statistical theory to be applied in the complex domain. The second project considers clustering of large scale multi-way array (“tensor”) data. Efficient clustering algorithms for convex bi-clustering and co-clustering are derived and shown to achieve an order-of-magnitude speed improvement over previous approaches. The third project considers principal component analysis for data with smooth and/or sparse structure. An efficient manifold optimization technique is proposed which can flexibly adapt to a wide variety of regularization schemes, while efficiently estimating multiple principal components. Despite the non-convexity of the manifold constraints used, it is possible to establish convergence to a stationary . Additionally, a new family of “deflation” schemes are proposed to allow iterative estimation of nested prin- cipal components while maintaining weaker forms of orthogonality. The fourth and final project develops a multivariate volatility model for US natural gas markets. This model flexibly incorporates differing market dynamics across time scales and differ- ent spatial locations. A rigorous evaluation shows significantly improved forecasting performance both in- and out-of-sample. All four methodologies are able to flexibly incorporate prior knowledge in a statistically rigorous fashion while maintaining a high degree of computational performance. iv

Acknowledgements

This work described in this thesis was primarily supported by a National Science Foundation Graduate Research Fellowship under Grant Number 1842494. Additional support was provided by an NSF Research Training Grant Number 1547433, by the Brown Foundation, and by a Computational Science and Engineering Enhancement Fellowship and the 2016/2017 BP Fellowship Graduate Fellowship, both awarded by the Ken Kennedy Institute at Rice University. Computational resources for Chap- ter 7 were provided by the Center for Computational Finance and Economic Sys- tems (CoFES) at Rice University. The Rice Business Finance Center provided the Bloomberg Terminal used to access the data used in Chapter 7. Contents

Abstract ii Acknowledgements iv List of Illustrations x List of Tables xix

1 Introduction 1 1.1 Overview of Part I ...... 4 1.2 Overview of Part II ...... 7

I Complex Graphical Models 11

2 Optimality and Optimization of Wirtinger Functions 12 2.1 Optimality Conditions in Real Vector Spaces ...... 13 2.1.1 Second-Order Approximations and Convexity ...... 19 2.2 Implications for Wirtinger Convex Optimization ...... 21 2.2.1 Wirtinger Convex Optimization ...... 21 2.2.2 Wirtinger Semi-Definite Programming ...... 29 2.3 Algorithms for Wirtinger Optimization ...... 30 2.3.1 Background ...... 30 2.3.2 Proximal Gradient Descent ...... 31 2.3.3 Coordinate Descent ...... 33 2.3.4 Alternating Direction Method of Multipliers ...... 34 2.3.5 Numerical Experiments ...... 35 vi

3 Sparsistency of the Complex Lasso 38 3.1 Introduction and Background ...... 38 3.2 Prediction and Estimation Consistency ...... 41 3.3 Model Selection Consistency ...... 51 3.4 Algorithms ...... 59 3.5 Numerical Experiments ...... 61 3.A Concentration of Complex valued Sub-Gaussian Random Variables . 65

4 Sparsistency of Complex Gaussian Graphical Models 79 4.1 Introduction and Background ...... 79 4.2 Likelihood Approach: Sparsistency ...... 80 4.3 Pseudo-Likelihood Approach: Sparsistency ...... 89 4.4 Algorithms ...... 99 4.5 Numerical Experiments ...... 100 4.A Properties of the Complex Multivariate Normal ...... 103 4.A.1 Proper Complex Random Variables ...... 104 4.A.2 Graphical Models for Complex Random Variables ...... 105 4.B Concentration of Complex Sample Covariance Matrices ...... 106 4.C Lemmata Used in the Proof of Theorem 4.1 ...... 112

II Additional Work 116

5 Splitting Methods for Convex Bi-Clustering 117 5.1 Introduction ...... 118 5.1.1 Convex Clustering, Bi-Clustering, and Co-Clustering ..... 118 5.1.2 Operator Splitting Methods ...... 120 5.2 Operator-Splitting Methods for Convex Bi-Clustering ...... 123 5.2.1 Alternating Direction Method of Multipliers ...... 123 vii

5.2.2 Generalized ADMM ...... 125 5.2.3 Davis-Yin Splitting ...... 126 5.3 Simulation Studies ...... 127 5.4 Complexity Analysis ...... 130 5.5 Extensions to General Co-Clustering ...... 131 5.6 Discussion ...... 132 5.A Step Size Selection ...... 134 5.B Detailed Derivations ...... 135 5.B.1 ADMM ...... 135 5.B.2 Generalized ADMM ...... 138 5.B.3 Davis-Yin ...... 139

6 Multi-Rank Sparse and Functional PCA 144 6.1 Introduction ...... 145 6.2 Manifold Optimization for SFPCA ...... 146 6.3 Iterative Deflation for SFPCA ...... 151 6.4 Simulation Studies ...... 154 6.5 Discussion ...... 157 6.A Proofs ...... 159 6.A.1 Deflation Schemes ...... 159 6.A.2 Simultaneous Multi-Rank Deflation ...... 163 6.A.3 Solution of the Generalized Unbalanced Procrustes Problem . 166 6.B Algorithmic Details ...... 168 6.B.1 Manifold Proximal Gradient ...... 168 6.B.2 Manifold ADMM ...... 170 6.B.3 Additional Note on Identifiability ...... 170 6.C Additional Experimental Results ...... 171 6.D Additional Background ...... 172 viii

7 Natural Gas Market Volatility Modeling 179 7.1 Introduction ...... 180 7.1.1 U.S. Natural Gas Markets ...... 183 7.1.2 Previous Work ...... 185 7.2 Empirical Characteristics of LNG Prices ...... 188 7.3 Model Specification ...... 190 7.3.1 Bayesian Estimation and Prior Selection ...... 195 7.3.2 Computational Concerns ...... 197 7.4 Application to Financial Risk Management ...... 198 7.4.1 In-Sample Model Fit ...... 199 7.4.2 In-Sample Risk Management ...... 200 7.4.3 Out-of-Sample Risk Management ...... 201 7.5 Conclusions ...... 203 7.A Additional Data Description ...... 205 7.B Data Set Replication Information ...... 207 7.C Additional Results ...... 216 7.C.1 In-Sample Model Fit ...... 216 7.C.2 In-Sample Risk Management ...... 218 7.C.3 Out-of-Sample Risk Management ...... 220 7.D Computational Details ...... 229 7.D.1 Stan Code ...... 229 7.D.2 MCMC Diagnostics ...... 235 7.D.3 Simulation-Based Calibration ...... 237 7.D.4 Additional Simulation Studies ...... 237 7.E Additional Background ...... 242 7.E.1 Volatility Models ...... 242 7.E.2 Natural Gas Markets ...... 245 ix

8 Conclusion and Discussion 247

Works Cited 254 Illustrations

1.1 Schematic of ECoG Analytic Pipeline: ECoG probes placed directly on the patient’s cortex (left) record neural potentials (next-to-left) which can be transformed into Fourier components (next-to-right) and used to infer a time series graphical model of inter-region connectivity (right). Adapted from Nuwer [93], Blausen.com Staff [94], Chandran et al. [95], and Narayan [96]...... 8

2.1 Run-Time of the Complex Lasso Coordinate Descent Algorithm 1. Increasing the problem dimensionality increases runtime, as we would expect, but surprisingly, run-time also increases with the difficulty of the statistical problem. Run-times are averaged over 100 values of λ (logarithmically spaced) and 50 replicates. Timing tests performed using the classo in the R package of the same name available at https://github.com/michaelweylandt/classo. .... 36 xi

2.2 Run-Time of the Complex Graphical Lasso ADMM Algorithm 1. Increasing the problem dimensionality increases runtime, as we would expect, but surprisingly, run-time also increases with the difficulty of the statistical problem (smaller n or more edges). Run-times are averaged over 100 values of λ (logarithmically spaced) and 20 replicates. Timing tests performed using the cglasso(method="glasso") function in the R package of the same name available at https://github.com/michaelweylandt/classo. For small values of n, run-time is dominated by the R language run-time and does not accurately reflect true computational cost. .. 37

3.1 Model Selection Consistency of the Complex Lasso (3.1). The CLasso has high true positive rate and low false positive rate in both low- (p < n) and high (p > n) dimensional settings. As ρ → 1, the CLasso struggles to accurately recover the true model in high dimensions, though the decay is slow. Beyond a small minimum signal strength requirement, the signal to noise ratio is not an important driver of accuracy...... 64

4.5.1 Model Selection Consistency of the Complex Graphical Lasso (4.1). When p ≪ n, the CGLasso has both high true positive rate and low false positive rate, leading to high rates of correct model selection for both the likelihood (CGLasso) and pseudo-likelihood (neighborhood selection) procedures. As the number of edges to be estimated increases, model selection performance decays quickly, especially for larger values of p, even as both true positive and false negative rates remain high. For this random graph model, neither estimator is able to perform perfect graph recovery for n ≈ p and high numbers of edges. 102 xii

5.3.1 Per iteration convergence of the ADMM, Generalized ADMM, Davis-Yin Splitting, and COBRA on the presidential speech and TCGA breast cancer data sets. COBRA clearly exhibits the fastest per iteration convergence, appearing to exhibit super-linear convergence. The ADMM and Generalized ADMM have similar convergence rates, while the Davis-Yin iterates converge much more slowly...... 128 5.3.2 Wall clock speed of the ADMM, Generalized ADMM, Davis-Yin Splitting, and COBRA on the presidential speech and TCGA data sets. Despite its rapid per iteration convergence, COBRA is less efficient than the ADMM and Generalized ADMM due to the complexity of its iterations. Davis-Yin Splitting remains the least efficient algorithm considered for this problem...... 129

6.4.1 Simulation Scenarios Used in Section 6.4. For both the left singular vectors (top row of each scenario) and the right singular vectors (bottom row of each scenario), ManSFPCA is able to recover the signal far more accurately than unregularized PCA (SVD). ManSFPCA is not identifiable up to change of order or sign...... 155

7.1.1 U.S. LNG Pipelines (as of 2008). There are approximately 3 million miles of LNG pipelines in the United States, connecting LNG storage facilities with customers. The majority of these pipelines concentrated along the Gulf of Mexico and in western Pennsylvania / northern West Virginia. Note the large number of interstate pipelines originating at or near Henry Hub in southern Louisiana. Figure originally published by the U.S. Energy Information Administration (E.I.A.) [298]...... 184 xiii

7.2.1 Henry Hub spot returns. Henry Hub clearly exhibits many of the same properties as equity returns: volatility clustering, heavy-tails, and autocorrelation in the second moment...... 189 7.2.2 Return series of several LNG spots. (Details of each spot series are given in Section 7.B of the Supplementary Materials.) Returns are clearly highly correlated across series, with the major volatility events (late 2009 and early 2014) being observed at all hubs. This high degree of correlation suggests that a single-factor model will suffice. . 190 7.2.3 Comparison of Henry Hub Spot and Futures. As shown in the top panel, spot and futures returns are moderately correlated (25-29%), particularly at shorter maturities (29.2%). As shown in the bottom panel, spot and futures volatitlities are highly correlated (74-99%) for all of the realized volatility measures considered, including the close-to-close, Garman-Klass (1980), Rogers-Satchell (1991), and Yang-Zhang (2000) estimators. In both cases, we observe strong correlation between the low frequency spot data and the high frequency futures data...... 191 7.4.1 Signed-rank-transformed differences of 40 spots over 73 periods. Ignoring periods where our WAIC estimates are corrupted by outliers (purple), the RBGARCH model performs better the GARCH model in 2333 out of 2444 (95.4%) sub-samples...... 201 7.4.2 In-Sample VaR Accuracy of the GARCH and RBGARCH models, as measured by Kupiec’s (1995) unconditional exceedances test. Larger p-values indicate better performance. The RBGARCH model performs well for all portfolios considered, while the GARCH model shows clear signs of inaccurate VaR calculation for all portfolios except the 20% Spot/80% Futures portfolio. Displayed p-values are for in-sample VaR estimates for the entire sample period...... 202 xiv

7.4.3 Rolling Out-Of-Sample VaR Accuracy of the GARCH and RBGARCH models, as measured by Kupiec’s (1995) unconditional exceedances test. Larger p-values indicate better performance. The RBGARCH model outperforms the GARCH model for the seven different portfolio weights considered here. Displayed p-values are for out-of-sample filtered VaR estimates for the entire sample period...... 203 7.B.1Henry Hub spot prices and rolling estimates of the first four moments. Spot prices peaked in mid-2008, declined from late 2008 to early 2009 and have remained at a relatively constant level since. Spot returns exhibit moderate heteroscedasticity, skewness and kurtosis, with large changes in the sample moments being primarily driven by high-volatility periods in Fall 2009 and Spring 2014. .... 208 7.B.2Pearson correlation among daily spot price returns for 40 hubs in the sample period. Spot prices are typically highly correlated with each other and with Henry Hub (not shown here). The Cheyenne, Northeast Transco Zone 6, Texas Eastern Transmission, Tennessee Gas Pipeline, Northeast, and Iroquois Transmission System spot prices, however, are more highly correlated among themselves than to other spots during the sample period...... 209 7.B.3Henry Hub futures move in tandem with spot prices, but generally exhibit fewer large jumps. In particular, the futures markets do not exhibit the large jumps in early 2014 that are a significant driver of the elevated kurtosis observed in spot prices during the same period. 211 7.B.4Daily returns of Henry Hub futures are highly correlated, but only exhibit moderate correlation with Henry Hub spot returns. Correlations among spot and futures returns are higher at lower frequencies (weekly, monthly, etc.)...... 212 xv

7.B.5Pearson correlation among the estimated volatilities of 40 spot prices during the sample period. As we see, there is generally a strong positive correlation among spot volatilities, though it is not as pronounced as the correlation among spot returns. There is some evidence that the structure of volatility is changing over time, as can be seen by the emergence of subgroups in the 2012-2015 period. ... 213 7.B.6Pearson correlation among the Henry Hub spot volatility, non-Henry Hub spot volatility, and futures realized volatility during the sample period. The reported correlations with non-Henry spots are the average correlation over the 40 non-Henry spots in our sample. The correlation between Henry Hub volatility and short-maturity (NG1) realized volatility is consistently the highest. The degree of correlation appears robust to the choice of realized volatility measure. 214 7.C.1Signed-rank-transformed differences of 40 spots over 73 periods. Positive values indicate that the RBGARCH model outperformed the GARCH model for that sample period. p-values from a paired one-sample Wilcoxon test applied to WAIC differences for each sample are displayed on the left axis. The RBGARCH model outperformed the GARCH model across all spots, but was somewhat more sensitive to outliers in the data...... 217 7.C.2Price history for Custer, OK LNG trading hub. The large price jump on February 5th, 2014 causes WAIC instabilities for several sample sub-periods, as highlighted in Figure 7.C.1. Because we use a 50-day rolling window with a 250-day history, a single outlier of this form can impact up to 5 sub-periods...... 218 xvi

7.C.3In-Sample WAIC Comparison of GARCH and RBGARCH models, aggregated over all 40 spot prices in our sample. The reported probabilities are the fraction of 250 day sample periods in which the the WAIC of the RBGARCH exceeded that of the GARCH model by at least K times the estimated standard error of the difference for K = 0, 1, 2. For all years in the sample period, the RBGARCH model has a consistently higher WAIC than the GARCH model, typically by several standard errors...... 219 7.C.4In-sample VaR accuracy measures of the GARCH and RBGARCH models at the 99.8% (1 in 500 days) confidence level. The Kupiec [320] and Christoffersen [331] tests both indicate that the RBGARCH model consistently outperforms the GARCH model for all 40 spots in our sample, with the advantage being particularly pronounced for approximately equally weighted portfolios. In order to accurately evaluate VaR predictions at this extreme quantile, the reported p-values were obtained by taking the concatenated in-sample predictions over the entire 10-year sample period...... 221 7.C.5In-sample VaR accuracy measures of the GARCH and RBGARCH models at the 99.5% (1 in 200 days) confidence level. Compared to Figure 7.C.4, the in-sample advantage of the RBGARCH model is still present, albeit less pronounced, as both models are able to capture this portion of the distribution well...... 222 7.C.6In-sample VaR accuracy measures of the GARCH and RBGARCH models at the 99.8% (1 in 500 days) confidence level. Consistent with Figure 7.C.4, the RBGARCH model consistently outperforms the GARCH model, as can be seen from the upward sloping lines connecting corresponding p-values from the two models...... 223 xvii

7.C.7Assessment of the In-Sample VaR estimates of the GARCH and RBGARCH models for a range of sample portfolios. The GARCH model typically underestimates the true VaR (lies above the red line) while the RBGARCH model is more accurate, but with a conservative bias (lies below the red line)...... 224 7.C.8Out-of-sample VaR accuracy measures of the GARCH and RBGARCH models at the 99.8% (1 in 500 days) confidence level. The Kupiec [320] and Christoffersen [331] tests both indicate that the RBGARCH model consistently outperforms the GARCH model for all 40 spots in our sample, with the advantage being particularly pronounced for approximately equally weighted portfolios. Somewhat surprisingly, the RBGARCH model appears to perform almost as well out-of-sample as it does in-sample, while the GARCH model is noticeably less accurate. 225 7.C.9Out-of-sample VaR accuracy measures of the GARCH and RBGARCH models at the 99.5% (1 in 200 days) confidence level. The Kupiec [320] and Christoffersen [331] tests both indicate that the RBGARCH model consistently outperforms the GARCH model for all 40 spots in our sample, with the advantage being particularly pronounced for approximately equally weighted portfolios. Unlike the in-sample case (Figure 7.C.5), the RBGARCH model here clearly outperforms the GARCH model...... 226 7.C.10Out-of-sampleVaR accuracy measures of the GARCH and RBGARCH models at the 99.8% (1 in 500 days) confidence level. Consistent with Figure 7.C.8, the RBGARCH model consistently outperforms the GARCH model, as can be seen from the upward sloping lines connecting corresponding p-values from the two models...... 227 xviii

7.C.11Assessment of the Out-of-Sample VaR estimates of the GARCH and RBGARCH models for a range of sample portfolios. The GARCH model typically underestimates the true VaR (lies above the red line) while the RBGARCH model is more accurate, but with a conservative bias (lies below the red line)...... 228 7.D.1Simulation-Based Calibration Analysis of Hamiltonian Monte Carlo applied to Model (7.1) with T = 200. The blue bars indicate the number of SBC samples in each rank bin while the green bars give a 99% marginal confidence band for the number of samples in each bin. While there is significant variability in the histograms, commensurate with a complex model, we do not observe any systematic deviations. . 238 7.D.2Empirical coverage probabilities of posterior credible intervals associated with Model (7.1) under a weakly informative prior distribution. Because we drew the parameters from the prior, these intervals are correctly calibrated...... 240 7.D.3Mean width of symmetric posterior intervals associated with Model (7.1) under a weakly informative prior distribution. We observe standard n−1/2-type convergence for all parameters as the length of the sample period (T ) increases. Somewhat surprisingly, we note that parameters associated with the mean model (µ, α) are typically more uncertain than the parameters of the GARCH dynamics...... 241 Tables

5.4.1 Performance of the ADMM, Generalized ADMM, Davis-Yin Splitting, and COBRA on the presidential speech and TCGA data sets. For COBRA, the numbers in parentheses are the total number of sub-problem iterations taken. The † indicates that the method was stopped after failing to converge in 10,000 iterations...... 131

6.3.1 Properties of Hotelling’s Deflation (HD), Projection Deflation (PD), and Schur Complement Deflation (SD). Only SD captures all ofthe individual signal of each principal component without re-introducing signal at later iterations. Additionally, only SD allows for the non-unit-norm PCs estimated by SFPCA to be used without rescaling. 153 6.4.1 Comparison of Manifold-ADMM, Manifold Proximal Gradient, and Alternating Manifold Proximal Gradient approaches for Manifold SFPCA (6.2). MADMM is consistently more efficient and obtains better solutions, but both MADMM and A-ManPG perform well in terms of signal recovery, as measured by relative subspace recovery ∥ ˆ ˆ T − ∗ ∗ T ∥ ∥ ˆ ˆ T − ∗ ∗ T ∥ error (rSS-Error = UU U (U ) / USVDUSVD U (U ) ), true positive rate (TPR) and false positive rate (FPR)...... 157 xx

6.4.2 Cumulative Proportion of Variance Explained (CPVE) of Rank-One SFPCA with (normalized) Hotelling, Projection, and Schur Complement Deflation and of (order 3) Manifold SFPCA. SD gives the best CVPE of the iterative approaches and appears to be more robust to violations of orthogonality (Scenario 2). Manifold SFPCA outperforms the iterative methods in both scenarios...... 158 6.C.1Extended Version of Table 6.4.2, comparing Cumulative Proportion of Variance Explained (CVPE), relative subspace recovery error ∥ ˆ ˆ T − ∗ ∗ T ∥ ∥ ˆ ˆ T − ∗ ∗ T ∥ (rSS-Error = UU U (U ) / USVDUSVD U (U ) ), true positive rate (TPR) and false positive rate (FPR) of Manifold SFPCA with Rank-One SFPCA using (normalized) Hotelling, Projection and Schur Complement Deflation...... 173

7.3.1 Prior Specifications. Parameters for first-order behavior (mean and correlation of returns) are set using standard weakly informative priors. Priors for GARCH dynamics are calibrated to S&P500 dynamics over the sample period (2006-2015). The “coupling parameters” β which are the primary focus of this paper are given independent weakly informative N (0, 1) priors...... 196 7.3.2 Prior Predictive Checks for Model (7.1): 90% Prior Predictive Intervals and observed S&P 500 and Henry Hub values. The 90% Prior Predictive Intervals cover the observed values for each of the standard summary statistics except those based on unconditional volatility (standard deviation and kurtosis of returns), where the priors imply a slightly higher volatility than actually observed. As we will see in Section 7.4, this is consistent with the out-of-sample evaluation of our model, which is generally accurate, but slightly conservative...... 197 xxi

7.B.1Apparent Inconsistencies in Futures Data. These typically occur because the auction-based Open and Close prices are not included in the intra-day trading range used to determine the High and Low prices. To calculate the Yang-Zhang (2000) realized volatility, the Open and Close are truncated to the High-Low range...... 210 7.B.2Bloomberg identifiers of the price series used in our analysis. LNG Futures were used to compute realized volatility measures and served as the market proxy while LNG Spot prices were used for the sparse-data asset...... 215 7.D.1Markov Chain Monte Carlo Diagnostics for Hamiltonian Monte Carlo estimation of Model 7.1. These diagnostics are taken from a representative run of four chains of 2000 iterations each. Reported

values for σM and σi are averages over the entire fitting period. ... 239 1

Chapter 1

Introduction

The past two decades have witnessed explosive growth in the use of statistical machine

learning methods across science, engineering, and industry. Following early develop-

ments in statistics [1] and in signal processing [2], a rich framework for the design

and analysis of sparsity-based estimators has been developed. Building on powerful

geometric foundations [3], these advances have had a revolutionary impact across

disciplines, spawning the compressive sensing approach to signal processing [4–6] and

rich theory of high-dimensional statistics. Within statistics, the principles of sparsity

and regularization have spurred advances in regression [1, 7–10], classification [11–13],

clustering [14–16], probabilistic graphical models (covariance selection) [17–24], mul-

tivariate analysis [25–28], matrix completion [29–31], and time series analysis [32–34].

While broad in scope, these remarkable advances are almost exclusively restricted to

real valued data, typically under independent sampling assumptions. In this thesis, I

extend this popular paradigm in two important directions: firstly, I develop founda-

tional theory and methodology for complex valued data, with a specific focus on data

arising from spectral analysis of stationary stochastic processes; secondly, I consider

several estimators defined by optimization problems on “non-standard” spaces, includ-

ing the space of fixed-dimension multi-linear operators (“tensors”) and the manifold

of low-rank orthogonal matrices.

Both of these contributions are examples of statistical machine learning techniques. 2

The statistical approach to machine learning is typified by the M-estimation paradigm,

often called the “loss function + penalty” approach, whereby a statistical estimator

is defined by minimizing a composite objective function that combines a loss function

measuring fidelity to the observed data anda penalty function (regularizer). The loss is typically constructed from a statistical model, while the penalty is chosen to induce desired properties in the estimate, typically some form of sparsity [1, 35–39]

to allow for straightforward scientific interpretation. Recent advances in penalized

M-estimator theory allow strong accuracy guarantees, among the most remarkable of which are model selection consistency results which prove that estimate has the same

sparsity pattern as the “true” parameter(s). These results are among the most im-

portant for scientific practice as they guarantee correct identification of “important”

aspects of the underlying real-world phenomenon.∗

The spectral (Fourier) approach to time series analysis is the foundation for much of

classical time series statistics and modern signal processing. As noted by Whittle [41,

42], transforming a problem to the spectral domain greatly reduces (and often asymp-

totically eliminates) the complicated dependency structures typically associated with

time series. Because of this reduction, spectral methods have proven immensely pop-

ular in signal and radar processing [43–46], neuroscience [47], geosciences [48–50], and

econometrics [51], among many other domains. Indeed, in the neuroscientific context,

spectral approaches are the default for many analyses: the popular science notion

of “brain waves” properly refers to the Fourier decomposition of neural activity. De-

spite this wide applicability, existing theory for spectral method is almost exclusively

low-dimensional (more time points than series) and not applicable to modern high-

∗Classically, these results require the statistical model to be well-specified but recent work has relaxed this assumption [40]. 3

dimensional data sets. Additionally, the use of spectral methods in statistical machine

learning poses fundamental, and largely unexplored, mathematical questions, further

hindering the development of modern spectral methods.

This thesis is divided into two large parts: Part I develops a rigorous novel theory for complex valued sparse estimators and is the major contribution of this thesis. Part II

reviews additional work performed during my Ph.D., with a focus on computational

methods useful for high-dimensional time series. The material in Part I is novel and,

in some places, rather more abstract than is typical of a publication in the statistical

sciences. As such, I give a thorough treatment, expanding upon details that are often

elided. To balance this, the treatment in the second part is somewhat more cursory.

In brief, the relevant publications are:

• Part I:

– M. Weylandt. “Wirtinger Graphical Models for Complex Valued Data:

Likelihood and Pseudo-Likelihood Approaches.” In preparation.

• Part II:

– Chapter 5: M. Weylandt. “Splitting Methods for Convex Bi-Clustering

and Co-Clustering.” DSW 2019: Proceedings of the IEEE Data Sci-

ence Workshop 2019, pp.237-244. 2019. doi: 10.1109/DSW.2019.8755599

ArXiv: 1901.06075.

– Chapter 6: M. Weylandt. “Multi-Rank Sparse and Functional PCA:

Manifold Optimization and Iterative Deflation Techniques.” CAMSAP

2019: Proceedings of the IEEE 8th International Workshop on Computa-

tional Advances in Multi-Sensor Adaptive Processing, pp.500-504. 2019. 4

doi: 10.1109/CAMSAP45676.2019.9022486 ArXiv: 1907.12012.

– Chapter 7: M. Weylandt, Y. Han, and K. B. Ensor. “Multivariate

Modeling of Natural Gas Spot Trading Hubs Incorporating Futures Market

Realized Volatility.” ArXiv: 1907.10152. Submitted.

1.1 Overview of Part I

The major contribution of Part I is the development of a “modern” theory of spectral

time series analysis, uniting recently developed tools of high-dimensional statistical

theory [52–54] with classical spectral analysis tools [55–57]. While we focus on the

problem of structure estimation in graphical models, the technical tools developed

in this part provide a thorough theoretical grounding for the statistical analysis of

complex valued data generally and greatly advance machine learning methodology

for high-dimensional time series.

In this part, I develop foundational statistical theory for complex valued M-estimators

of the “loss + penalty” form described above. Following the early work of Huber [58],

the study of M-estimators, statistical estimators defined as the minimizers of an ob-

jective function, has given rise to a widely applicable and richly developed body of

statistical theory [59, 60]. In addition to maximum (penalized) likelihood estimators,

the class of M-estimators also includes robust estimators, support vector machines

(SVMs), quantile regression, and many other statistical workhorses. Much of the

recent success of the M-estimation paradigm relies heavily on the well-established

machinery of convex analysis, especially the Karush-Kuhn-Tucker (first order) opti-

mality conditions which can be used to characterize the optima of a convex optimiza-

tion problem without explicitly computing the solution [61–64]. The KKT conditions 5

can be used to identify deterministic conditions under which an estimator performs

“well,” e.g., within some pre-specified accuracy or correctly recovering the sparsity pattern of a parameter [59, 60]. One then applies concentration-of-measure results to

show that these conditions are satisfied, and hence the estimator performs well, with

high probability [65, 66].

Chapter 2 rigorously develops KKT-type conditions for the type of real valued func-

tions of complex variables which naturally arise in the M-estimation paradigm. I refer

to these functions as Wirtinger, in recognition of Wilhelm Wirtinger’s early work on this class. The key construction is an algebraic generalization of the classical (real variables) setting to an arbitrary inner product space over a totally ordered field. Spe- cializing this to the case of complex vectors or positive-definite Hermitian matrices

(both considered as vector fields over R), I give KKT conditions for “Wirtinger” opti- mization and “Wirtinger” semi-definite programming. Interestingly, when restricted to smooth convex functions, this construction recovers the conjugate con- struction of Wirtinger [67], also known as the CR-derivative in the signal processing

literature [68–73]. In addition to analytic tractability, formulation of M-estimators as

convex programs allows for efficient computation. While the past fifteen years have

seen an explosive growth in convex optimization techniques tailored for the types of

problems arising in sparse estimation [74–79], there has been less activity in complex

convex optimization. Building on the results discussed above, Chapter 2 also provides

a brief convergence analysis of optimization methods proposed herein, generalizing

the results of Sorber et al. [80] to the non-smooth complex domain functions at the

heart of this thesis.

A particularly powerful example of this KKT-based approach is the primal-dual wit- 6 ness (PDW) technique of Wainwright [81], which lies at the heart of almost all model

selection consistency results [12, 21, 24, 82–84]. Lee et al. [85] provide a very gen-

eral formulation of the PDW, using abstractions from convex analysis to precisely

characterize the geometric conditions under which model selection consistency holds.

Despite this high level of generality, their analysis does not apply to the complex val-

ued (spectral) setting described above. Applying the tools of Chapter 2 to the sparse

(ℓ1-regularized) linear model and covariance estimation problems, Chapters 3 and 4 give the first model selection consistency guarantees for complex valued data, enabling

state of the art statistical methodologies to be applied to problems in complex spaces.

Throughout, care is taken to provide both deterministic and stochastic bounds, po-

tentially allowing for the generalization to dependent data in future work.

One notable difference between the real and complex domain analyses is in the struc-

ture of the noise. In traditional real valued statistics, it is often sufficient to assume

that the noise is iid (sub-)Gaussian and to give results under that weak assump-

tion. In the complex domain, iid noise is insufficient to fully specify the statistical

aspects of a problem. In particular, we must specify any dependencies between the

real and imaginary parts of the noise distribution. If we are able to assume these

are independent, e.g., as would be the case for the Fourier transform of a station-

ary process, significant improvements in performance can be guaranteed. Lemma 3.9

clarifies the different rates of concentration, which percolate through to the complex

lasso (CLasso) and complex graphical lasso (CGLasso).

My interest in the statistics of complex valued data arises from a collaboration with

neuroscientists specializing in electrocorticography (ECoG) recordings (see Figure

1.1). ECoG is an invasive neural recording technology where probes placed directly 7

on the patient’s cortex record local electrical potentials. Unlike non-invasive tech-

nologies (EEG, fMRI), ECoG signals are not obscured by the patient’s skull, allowing

accurate measurement of both power and phase information at relatively high tempo-

ral resolution. Current state of-the-art approaches simply discard this phase informa-

tion and retain the real valued power (modulus) information for downstream analysis.

While straightforward, the loss of phase information has significant scientific and sta-

tistical implications. Scientifically, an emerging consensus suggests that coordination

between different brain regions is almost exclusively modulated by phase information

[86–90]. Statistically, using the power instead of the observed data discards informa-

tion and transforms direct linear correlations into complicated non-linear phenomena.

In the fMRI domain, Daniel Rowe and collaborators [47, 91] have argued for the

importance of jointly analyzing real and imaginary parts, suggesting a massively mul-

tivariate Bayesian approach to doing so. The work presented in this thesis is the first

to study joint models of real and complex parts in a penalized maximum likelihood

framework. An interesting alternative is to consider only the phase information, dis-

carding modulus, drawing on techniques of multivariate directional data [92]; we do

not pursue that approach here.

1.2 Overview of Part II

Chapter 5 considers algorithms for co-clustering of tensor data. Co-clustering refers

to the simultaneous clustering along each “face” of a multi-way array. Most frequently

encountered in the cluster heatmap popular in genomics [97], the rank-two case of

tensor co-clustering, also known as matrix bi-clustering, is particularly widely used

in applied work. Chi et al. [98] proposed a convex formulation of co-clustering, build- ing on the convex clustering approach proposed by Pelckmans et al. [14] and later 8

Figure 1.1 : Schematic of ECoG Analytic Pipeline: ECoG probes placed directly on the patient’s cortex (left) record neural potentials (next-to-left) which can be transformed into Fourier components (next-to-right) and used to infer a time series graphical model of inter- region connectivity (right). Adapted from Nuwer [93], Blausen.com Staff [94], Chandran et al. [95], and Narayan [96].

extended to the bi-clustering case by Chi et al. [99]. While Chi et al. [98] propose a simple algorithm based on dual gradient ascent, significant speed-ups can be obtained by the use of operator-splitting methods. In this chapter, I propose and evaluate four different operator-splitting methods for the general co-clustering problem. Overall, a method based on the Generalized ADMM of Deng and Yin [100] consistently achieves the best performance, and can be further improved by use of the acceleration tech- niques considered by Goldstein et al. [101]. These techniques have the additional advantage of not requiring solutions to a tensor Sylvester equation, efficient methods for which are an active area of research.

Chapter 6 turns to the classical problem of principal components analysis (PCA), that is, of estimating low-rank approximations to a data matrix or, equivalently, of estimating leading eigenspaces of covariance matrices. PCA is a fundamental and ubiquitous tool in applied statistics, with a rich literature dating back to Hotelling

[102]. Recent developments have augmented classical PCA with various forms of regularization, including both smooth (“functional”) and sparse regularization [103– 9

105]. Because the singular vectors estimated in regularized PCA are no longer (exact)

eigenvectors, standard numerical methods can no longer be used. Instead, iterative

optimization techniques must be used: because the regularized eigenvalue problem is

non-convex, the quality of the estimated principal components is often quite sensitive

to the specifics of the algorithm in question.

While many methods for regularized PCA have been proposed, they almost all es-

timate a single principal component at a time and then “deflate” the data matrix

before finding the next principal component [25, 27, 106, 107]. Unless additional

steps are taken, this iterative deflation approach no longer guarantees orthogonality

of the estimated principal components. To address this, I propose a method which

simultaneously estimates several principal components with an explicit orthogonality

constraint. Using techniques of manifold constrained optimization, this novel ap-

proach to regularized PCA is able to flexibly incorporate prior information without

sacrificing interpretability of the estimated components. For situations where mani-

fold optimization techniques are too computationally burdensome, I also propose a

set of improved deflation techniques which retain some, but not all, of the desired

orthogonality properties.

Chapter 7 is somewhat different in spirit than the preceding chapters. Rather than developing an optimization-based general methodology, this chapter develops a highly tailored Bayesian model for tail risk estimation in domestic (U.S.) natural gas mar- kets. Natural gas markets are among the most important commodity markets in the

US and play an essential role in transport and pricing of domestic and industrial energy. Unlike more well-studied equity markets, natural gas markets are highly frag- mented and characterized by irregular patterns of data availability. To address these 10

limitations, I develop an integrated model which is both “multivariate,” in that it is

able to model over 100 different aspects of natural gas markets simultaneously, and

“multi-scale,” in that it is able to incorporate high-frequency (intraday) data when it

is available.

I apply this model to the problem of financial risk management, with a focus on out- of-sample tail estimation. A lengthy empirical study demonstrates several favorable properties of the proposed approach:

• Superior tail probability estimation: the proposed model is able to character-

ize tail events significantly more accurately than univariate or low-frequency

models.

• Conservative bias: my model tends to over-estimate risk rather than under

estimating, allowing for a much more efficient use of risk capital (cf., Paragraph

B.4(j) of the 1996 Basel Accords [108]).

• Automated outlier diagnostic: my model is able to automatically identify dislo-

cations in certain spot prices, providing a guide to further analysis.

Additionally, a careful analysis of the computational properties of my model including

MCMC accuracy and parameter recovery rates is provided.

Taken together, the four projects exhibit exciting directions in contemporary statisti-

cal research: computationally efficient methods which can flexibly incorporate prior

information to analyze highly structured data. The mathematical complexity of these

approaches makes analyzing the statistical properties of the estimators quite difficult.

I discuss some opportunities for future research at the end of this thesis. 11

Part I

Wirtinger Graphical Models for

Complex Valued Data 12

Chapter 2

Optimality and Optimization of Wirtinger Functions

Before considering statistical applications, we first turn to the more fundamental

question of optimization in the presence of complex values. Recalling our eventual

interest in statistical estimators defined by non-smooth objectives, we may be inclined

to consider first-order optimality conditions of the following form:

f(x) ≥ f(x∗) + γT (x − x∗) for all x ∈ X where γ ∈ ∂f(x∗).

If f(·), γ, x, x∗ are all real valued, this inequality is well-defined and indeed fundamen-

tal to much of optimization theory. In our setting, however, we are not guaranteed

real valuedness: in particular, if the domain of optimization is Cn or a thereof,

x − x∗ is an essentially arbitrary complex vector and the first-order inequality is

ill-posed, regardless of the value of γ and any guarantees we can make about f. To

proceed, it is clear that we need to modify the first-order model–γT (x − x∗)–in such

a way that the first-order conditions are well-defined.

The key insight is quite simple: the solution to an optimization problem

arg min f(z) z∈Z⊆Cn

depends only on the topology of the domain and and range of f, not on its algebraic 13

structure. Indeed, Weierstrass’s celebrated Extreme Value Theorem, as generalized

by Fréchet [109], implies that sufficient conditions for the above problem to be well- defined are for Z to be compact and for f to take values in R. As such, we have the

freedom to alter the algebraic structure at hand to make the problem tractable. In

the sequel, we consider Cn as a real , rather than a complex vector space,

and use this alternative construction to develop the necessary optimization theory for

our statistical goals.

2.1 Optimality Conditions in Real Vector Spaces

In this section, we give general results relating global optimality, first-order conditions,

and subdifferentials for functions in an arbitrary real vector space. In the next section, we will specialize these results to the complex case of interest in the rest of this

thesis.

Theorem 2.1. Let V be a finite-dimensional vector space over the real numbers, equipped with an inner product ⟨·, ·⟩ : V × V → R. Furthermore, let f : V → R be a

convex function: that is, one satisfying the first-order inequality

f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y) for all x, y ∈ V and λ ∈ [0, 1].

Additionally, assume f is proper and closed. Then

x∗ ∈ arg min f(x) ⇐⇒ 0 ∈ ∂f(x∗), x∈V where

∂f(x) = {y ∈ V : f(z) ≥ f(x) + ⟨x − z, y⟩ for all z ∈ V} 14 is the subdifferential of f at x. Elements of ∂f(x) are called subgradients of f at x.

We note that it is possible to obtain similar results for a vector space over a totally ordered field F, with suitable notions of convexity, as long as f takes values in F as well. This additional generality is not needed in what follows, so we omit it for now.∗ Similarly, the restriction to finite-dimensional V is not strictly necessary,

but it makes the subsequent analysis simpler because we can take advantage of the

canonical isomorphism between V and its dual space; Barbu and Precupanu [112] and

Zălinescu [113], among others, develop equivalent theory in the setting of (possibly

infinite-dimensional) Banach spaces.

Proof. Both directions of the proof follow immediately from the definition of subgra-

dients. For the forward direction (⇒), if x∗ is a minimizer of f, then f(z) ≥ f(x)

for all z. This implies f(z) ≥ f(x∗) + ⟨x∗ − z, 0⟩ for all z so 0 is a subgradient of f at x∗. The reverse implication (⇐) is similarly straightforward: suppose 0 ∈ ∂f(x∗).

Then f(z) ≥ f(x∗) for all z ∈ V and hence x∗ is a minimizer of f.

Theorem 2.2. Suppose f : V → R is proper, closed, and convex. Then ∂f(x) is non-empty for all x ∈ V at which f is continuous.

This result is classical and can be proven in several different ways, including via directional as in Rockafellar [63, Part V] or via a duality argument [114,

Propositions 2.5.5 and 4.2.1]. Following Barbu and Precupanu [112, Preposition 2.36],

I give an elegant argument based on the supporting hyperplane theorem which I first

pause to prove.

∗It is possible to consider notions of convexity under much weaker constructs than vector spaces as done by Gudder [110] but, as Stone [111, Theorem 2] notes, these essentially reduce to vector spaces under even weak regularity conditions. 15

Lemma 2.3 (Supporting Hyperplane Theorem). Suppose C ⊊ V is a convex set with non-empty interior and x is a point on the boundary of C. Then there exists a non- zero y such that ⟨x, y⟩ ≤ ⟨z, y⟩ for all z ∈ C. The affine set {z ∈ V : ⟨y, z⟩ = ⟨y, x⟩} is said to be a supporting hyperplane of C at the support point x.

Proof. First note that the set X = {x} is an affine set in V, albeit a rather uninterest-

ing one, and one that sits at the boundary of C. By assumption that x ∈ bd C, and

so, x ∈/ rel int C. Define an essentially arbitrary linear functional f0(·) on X such that

f0(x) = 1. Consider also the set-distance function g(·) = 1 + minx∈C ∥x − ·∥, which is clearly a real convex function by the convexity of C. Taken together, we clearly

have f(·) ≤ g(·) for all x ∈ X . Since X is a linear subspace of V, by the (analytic)

Hahn-Banach theorem there exists a linear functional f defined on all of V satisfying

f(z) ≤ g(z) = 1C (z) ∀z ∈ C.

Finally, since V is a finite-dimensional inner product space, the linear functional f can be identified with some element of V∗, and that can in turn be identified with the

desired non-zero y. Note that y must be non-zero, since we have f0(x) = f(x) = 1 by construction.

The analytic (classical) Hahn-Banach theorem used above can be found in any ad- vanced text on analysis, e.g. Kreyszig [115, Theorems 4.2-1 and 4.3-1]. In the setting of a topological vector space, slightly more general than we consider here, the Sup- porting Hyperplane Theorem is equivalent to the analytic Hahn-Banach theorem and can indeed be considered a geometric variant thereof.

We are now ready to give the proof of Theorem 2.2, using the supporting hyperplane, 16

or more precisely its associated linear functional, to construct the desired subgradi-

ents.

Proof of Theorem 2.2. Let epi f be the epigraph of f: that is,

epi f = {(x, r) ∈ V × R : f(x) ≤ r}.

Suppose f is continuous at x: then (x, f(x)) is a boundary point of the (closure of

the) epigraph of f. (Recall that a function f is closed if epi f is a closed set.) Applying

the Supporting Hyperplane Theorem to epi f at the support point (x, f(x)), there

exists (y, y0) ∈ V × R such that

⟨(x, f(x)), (y, y0)⟩V×R ≤ ⟨(z, z0), (y, y0)⟩V×R for all (z, z0) ∈ epi f

⟨x, y⟩V + y0f(x) ≤ ⟨z, y⟩V + z0y0

⟨x − z, y⟩ ≤ y0(z0 − f(x)).

Note that y0 cannot be zero. If it were, then we would have ⟨x − z, y⟩ ≤ 0 for all z, which only holds for y = 0, and the Supporting Hyperplane Theorem guarantees

(y, y0) ≠ 0.

Similarly, we know y0 must be positive. Take z = x − y and the above inequality becomes:

2 ⟨x − (x − y), y⟩ = ∥y∥ ≤ y0(z0 − f(x)).

Since z0 can be arbitrarily large, this can only hold for y0 ≥ 0 which, when combined with the above, implies y0 > 0. 17

Dividing the above by y0 and taking z0 = f(z), we thus obtain the inequality

⟨x − z, y′⟩ ≤ f(z) − f(x) ∀z ∈ V

′ where y = y/y0 ∈ V is a subgradient of f at x. Since the above argument holds for all x at which f is continuous, we have the desired result.

Having now established the existence of subgradients, we give several useful results

that allow us to manipulate subgradients of functions given subgradients of their

composite parts:

Theorem 2.4 (Subgradient Calculus). Let V be a real inner product space and let

f, g : V → R be proper, closed, and convex functions; additionally let A be a linear from V to V, b ∈ V, and c ∈ R≥0 be a non-negative scalar. Then:

(a) ∂(cf)(x) = c ∂f(x) = {c ∗ y for all y ∈ ∂f(x)};

(b) ∂(f + g)(x) = ∂f(x) ⊕ ∂g(x) where A ⊕ B = {a + b : a ∈ A, b ∈ B} is the

Minkoswki sum of sets;

(c) Suppose f(x) is “coordinate-separable” in that there exist proper, closed, and

convex functions f1, f2 that f(x) = f1(x1) + f2(x2) for x = (x1, x2); suppose that the inner product on V is separable in the same manner, satisfying ⟨a, b⟩ =

⟨a1, b1⟩ + ⟨a2, b2⟩ Then (∂f1(x1), ∂f2(x2)) ⊆ ∂f(x).

The above results are sometimes referred to as a strong subgradient calculus because

they give the entire subdifferential of composite functions from their component parts,

rather than simply a single subgradient. Additional results in the subgradient calculus

often consider the composition of two convex functions or the pointwise supremum 18

of convex functions, but we will not make use of them in the sequel. (See Corollary

16.72 and Section 18.2 of the text by Bauschke and Combettes [116] for a discussion

of the subtleties of these results in the general case.)

Proof. (a) Suppose γ is a subgradient of f at x. Then

f(z) ≥ f(x) + ⟨x − z, γ⟩

cf(z) ≥ f(x) + c⟨x − z, γ⟩

= f(x) + ⟨x − z, cγ⟩

so cγ is a subgradient of cf at x as desired.

(b) Suppose γf and γg are subgradients of f and g at x, respectively. Then

f(z) ≥ f(x) + ⟨x − z, γf ⟩

g(z) ≥ g(x) + ⟨x − z, γg⟩

f(z) + g(z) ≥ f(x) + g(x) + ⟨x − z, γf ⟩ + ⟨x − z, γg⟩

(f + g)(z) ≥ (f + g)(x) + ⟨x − z, γf + γg⟩

so γf + γg is a subgradient of f + g at x as desired.

(c) Suppose γ1 ∈ ∂f(x1) and γ2 ∈ ∂f(x2). We thus have

f1(z1) ≥ f1(x1) + ⟨γ1, z1 − x1⟩

f2(z2) ≥ f2(x2) + ⟨γ2, z2 − x2⟩.

Adding these, and using the assumed coordinate separability of the inner product, 19

we obtain

f(z) = f1(z1) + f2(z2) ≥ f1(x1) + f2(x2) + ⟨(γ1, γ2), (z1 − x1, z2 − x2)⟩

= f(x) + ⟨(γ1, γ2), z − x⟩

so (γ1, γ2) is a subgradient of f as desired.

2.1.1 Second-Order Approximations and Convexity

While unused in the analysis that follows, it is interesting to consider second-order differential information in our analysis of convex problems. While, in the Cartesian setting, it is possible to define a second-order derivative as the “derivative ofthe derivative,” subject to certain differentiability assumptions, this perspective does not naturally generalize to the more general real vector space setting we consider. In particular, Theorem 2.2 above applies only to convex functions taking values in R, while any natural notion of a first-order derivative will take gradients in V (or, more

formally, its isomorphic dual space). Put another way, we have only established how

to differentiate functions V → R, but the derivative is a function V → V, so our

previous analysis is not useful.

The fundamental result of second-order convex analysis is that of Alexandrov [117], who showed that a second-order Taylor expansion of a convex function f can be found

almost everywhere:

1 f(x + w) = f(x) + ⟨v, w⟩ + ⟨Aw, w⟩ + o(∥w∥2), 2

An alternative view of second-order analysis was put forward by Mignot [118], who 20

showed that a linear pertubation of the subgradient can be found almost every- where:

∂f(x + w) = v + A′w + o(∥w∥2)B, where B is a suitable unit ball, both exist and coincide almost everywhere for closed,

proper, and convex f. Rockafellar [119] considered these two constructions and found

that they coincide (in the sense A = A′) whenever both are defined. Borwein and

Noll [120] consider similar equivalences in much more general spaces and find that, without separability and finite-dimensionality assumptions, the two notions may not

be equivalent.

We state Alexandrov’s theorem, adapted to functions f : V → R, but do not prove it

as we do not make use of in the sequel:

Theorem 2.5. Suppose f : V → R is a convex function. Then the subdifferential of f is single valued almost everywhere and, for any point x where ∂f(x) is single valued, there exists a positive-semi-definite operator A satisfying

1 f(z) = f(x) + ⟨γ, z − x⟩ + ⟨z − x|A|z − x⟩ + o(∥z − x∥2) 2 where γ ∈ ∂f(x) and A will occasionally be denoted as ∂2f(x).

In the expansion of f, we make use of the A-quadratic form notation, which we define

as

⟨x|A|x⟩ = ⟨A∗x, x⟩ = ⟨x, Ax⟩

In the context of a real vector space, the matrix A can be assumed self-adjoint. 21

2.2 Implications for Wirtinger Convex Optimization

Having developed the necessary general theory, we now turn our attention to the

class of functions at the heart of this thesis, those from Cn → R. These functions

are, in general, not differentiable in the standard complex analytic sense described

in texts such as Ahlfors [121]. In particular, for a function with domain C to be

differentiable, it must satisfy the Cauchy-Riemann equations which essentially state

that the derivative of the real and imaginary parts “balance.” For the real valued

functions we care about, the imaginary part, and hence its derivative, are identically

zero. If differentiable in the standard sense, the Cauchy-Riemann equations would

require the derivative of the real part to be zero and hence the function would have

to be constant everywhere. Such functions are clearly of no use for defining statistical

estimators.

We instead consider an alternative construction of the derivative of a complex domain

function, first explored by Wirtinger in 1927 [67]. In honor of his work, we refer to

complex domain real valued functions as Wirtinger functions. As we will see, if f is

a convex Wirtinger function, Wirtinger’s differential coincides with the the notions

of subgradients explored above.

2.2.1 Wirtinger Convex Optimization

We first consider convex functions with domain Cn. Adopting the real inner product

space perspective described above, we take V = Cn and equip it with the real inner

product aH b + aT b ⟨a, b⟩ = . 2 22

For two complex vectors w = wR + iwI and z = zR + izR, the inner product on Cn can also be written as the sum of the inner products of their real and imaginary

parts:† wH z + wT z ⟨w, z⟩ = = ⟨w , z ⟩ + ⟨w , z ⟩. 2 R R I I

The purpose of the 1/2 factor may not be immediately clear: it is included so that this inner product generates the typical norm on Cn:

zH z + zT z ∥z∥2 + ∥z∥2 ⟨z, z⟩ = = = ∥z∥2. 2 2

Importantly, this generates the same norm as the standard inner product ⟨z, z⟩′ = zH z = ∥z∥2 so Cn has the same topology as a vector space over R and C. This is useful computationally, but also importantly assures us that we are indeed solving the optimization problem we intend to solve, as the identity and existance of an optimizer is a topological and not algebraic property.

For smooth functions f, the Wirtinger derivative is obtained by taking the of a function with respect to z while holding z constant. While less fully developed than the standard notion of complex differentiability, Wirtinger differen- tiation has found use in statistical signal processing theory; see, e.g., the work of

Brandwood [68], van den Bos [69], Yan and Fan [122], and Hjørungnes and Gesbert

[123] or the discussion in Appendix 2 of the textbook by Schreier and Scharf [124].

For example, if we were to use Wirtinger calculus to find the (unregularized) complex

†Here and throughout this thesis, the notation z = ℜ(z) − iℑ(z) is used to refer to the conjugate of a complex scalar, vector, or matrix. ·T is the standard matrix transpose and ·H is the conjugate transpose, also called the Hermitian transpose, satisfying XH = (X)T . 23

ordinary least squares estimate, we would proceed as follows:

∥ − ∥2 f(β) = y Xβ 2

f(β, β) = (y − Xβ)H (y − Xβ)

T T = yH y − yH Xβ − β XH y + β XH Xβ

∂f T = −yH X + β XH X ∂β T 0 = −yH X + β XH X

T β = yH X(XH X)−1

β = (XH X)−1XT y

β = (XH X)−1XH y.

Note that, in performing these calculations, it was important to “look through” the

conjugate transpose and to treat it as constant as it depends on β only through

the conjugate β. Compared with real valued OLS, we see that this is essentially the

same result, except with the conjugate transpose (·H ) replacing the transpose operator

(·T ). The above analysis is informal, however, as we have not established a connection

between the formal Wirtinger derivative and the optimality conditions of the previous

section. As Theorem 2.6 below shows, for convex Wirtinger functions, the Wirtinger

derivative captures the same behavior as the convex subdifferential.

Importantly, the above analysis would give the same estimate if we had differentiated with respect to β instead. To formalize this, we give a rigorous definition of Wirtinger

derivative in terms of the (real) derivatives of the real and imaginary parts of its

arguments.

A Wirtinger function f(z): Cn → R can be equivalently understood as a function of 24

its real and imaginary parts f˜(u, v): R2n → R where u = ℜ(z) and v = ℑ(z). The

Wirtinger derivative of f can then be defined as

df df˜ df˜ (z ) = (u , v ) + i (u , v ) dz 0 du 0 0 dv 0 0 while the conjugate Wirtinger derivative can be defined as

df df˜ df˜ (z ) = (u , v ) − i (u , v ). dz 0 du 0 0 dv 0 0

We highlight several key properties of these definitions:

• f is Wirtinger differentiable if and only if f˜ is real differentiable;

• the conjugate Wirtinger derivative is simply the conjugate of the standard

df df Wirtinger derivative ( dz = dz )) so it will suffice to analyze only the standard derivative;

df ⇔ df • in particular, dz = 0 dz = 0, so it does not matter whether we use the Wirtinger derivative or its conjugate when finding stationary points; and

• the standard (Cauchy-Riemann) definition of differentiability can be compactly

df written as dz = 0 this makes clear that a Cauchy-Riemann-differentiable func- tion can only be Wirtinger-differentiable if it is constant everywhere.

Note that Schreier and Scharf [124, Appendix 2] give an intuitive motivation for a

1 similar construction, though they include an extra factor of 2 and an additional factor of −1 in the imaginary parts that we can omit as we use a different inner product.

In particular, when defining a formal Wirtinger-type derivative of a complex valued 25

function, it is common to define

  ∂ 1 ∂ ∂ = − i . ∂z 2 ∂u ∂v

Under the standard complex inner product, this gives the desirable result that

    ∂z 1 ∂ ∂ 1 ∂u ∂v ∂u ∂v = − i (u + iv) = + i − i + = 1. ∂z 2 ∂u ∂v 2 ∂u ∂u ∂v ∂v

Our restriction to real valued functions does not apply here, but we can obtain a

similar result for the univariate function f(z) = ℜ(z), which has the real equivalent

f˜(u, v) = u: df df˜ df˜ = + i = 1 + 0i = 1, dz du dv as one would expect.

This formal definition can be used to construct directional derivatives on Cn, building on the notion of directional derivatives in R2n, and to connect these to Wirtinger derivatives. For f corresponding to a f˜ real , we have

f(z + δγ) − f(z) f ′(z; γ) = lim δ→0+ δ f˜(u + δγ , v + δγ ) − f˜(u, v) = lim u v δ→0+ δ ˜′ = f (u, v; γu, γv) (∗) ˜ = ⟨∇f(u, v), (γu, γv)⟩R2n * + * + df˜ df˜ = , γ + , γ du u dv v  Rn Rn (∗) df = (z), γ dz Cn 26

′ where f (z; γ) is the directional derivative in the direction of a unit vector γ, (γu, γv) are the real and imaginary parts of γ arranged as a real 2n-vector and ∇f˜ is the

standard real gradient. The first starred equality follows from the fact that R2n is a

Fréchet space, while the second starred equality follows from our definitions of the

Wirtinger differential and the inner product on Cn. If γ is not a unit vector, the final

three equations should be divided by the norm of γ.

The next theorem uses this construction to connect Wirtinger derivatives to convex

subdifferentials: as is typical, we handle the multivariate case by restricting toan

arbitrary line segment and utilizing directional derivatives along that segment.

Theorem 2.6 (cf. Proposition 2.40 in Barbu and Precupanu [112]). Suppose f is convex and its (formal) Wirtinger derivative exists at z0. Then ∂f(z0) is a singleton { df } consisting only of the Wirtinger differential at z0: ∂f(z0) = dz (z0) .

Proof. Suppose that f is Wirtinger differentiable at z0, then for any unit vector γ, the above argument gives

  df ′ f(z0 + δγ) − f(z0) (z0), γ = f (z0, γ) = lim dz δ→0+ δ

Because f is convex along the line segment from z0 +γ to z0, we have, for all δ ∈ (0, 1)

f (δ(z0 + γ) + (1 − δ)z0) ≤ δf(z0 + γ) + (1 − δ)f(z0)

f(z0 + δγ) ≤ δf(z0 + γ) + (1 − δ)f(z0)

f(z0 + δγ) − f(z0) ≤ δ [f(z0 + γ) − f(z0)]

−1 =⇒ δ [f(z0 + δγ) − f(z0)] ≤ f(z0 + γ) − f(z0) 27

Taking the as δ → 0, we have

  df (z ), γ ≤ f(z + γ) − f(z ) dz 0 0 0

or equivalently,

    df df f(z + γ) ≥ f(z ) + (z ), γ = f(z ) + (z ), (z + γ) − z 0 0 dz 0 0 dz 0 0 0

df ∈ for arbitrary γ, hence dz (z0) ∂f(z0).

For the other direction, assume y ∈ ∂f(z0); by definition, we have, for any x,

f(x) ≥ f(z0) + ⟨y, x − z0⟩.

Taking x = z0 + δγ, we obtain

f(z0 + δγ) ≥ f(z0) + ⟨y, δγ⟩

or equivalently

−1 δ (f(z0 + δγ) − f(z0)) ≥ ⟨y, γ⟩

Taking the limit as δ → 0, this gives

    df df (z ), γ ≥ ⟨y, γ⟩ =⇒ (z ) − y, γ ≥ 0 dz 0 dz 0

df for all γ, which is only true if y = dz (z0) as desired.

As an important consequence of this result, we have the following result, which follows 28

from our formal analysis of the ordinary least squares problem above:

Cp → R 1 ∥ − ∥2 ∈ Cn×p Corollary 2.7. Let f : be defined as f(z) = 2 Az b 2 for fixed A and b ∈ Cn. Then ∂f(z) = AH (Az − b).

Putting Theorems 2.1, 2.2, and 2.6 together, we have a precise characterization of op- tima of convex Wirtinger functions, analogous to that for convex real functions: z∗ is

a global minimizer of f if and only if 0 ∈ ∂f(z∗) and if f is differentiable at z∗, ∂f(z∗)

is simply the Wirtinger derivative at that point. We now give the subdifferential of

the non-differentiable ℓ1 norm which will be of use in Chapter 3.

Theorem 2.8. Let f = | · | : C → R. Then   {∠z} = {z/|z|} z ≠ 0 ∂f(z) =  {w ∈ C : |w| ≤ 1} z = 0

n For g = ∥ · ∥1 : C → R, we have (∂g(z))i = ∂f(zi).

Proof. Applying the subgradient inequality to the modulus (absolute value) function

at z, we have

|x| ≥ |z| + ⟨γ, x − z⟩.

In the case z = 0, this becomes

|x| ≥ |0| + ⟨w, x − 0⟩ = ⟨w, x⟩ which clearly holds for any unit vector w, with equality holding for w = x/|x| = ∠x. 29

For the case z ≠ 0, taking γ = z/|z| gives

|x| ≥ |z| + ⟨z/|z|, x⟩ − ⟨z/|z|, z⟩ = |z| + ⟨z/|z|, x⟩ − |z| = ⟨z/|z|, x⟩

Because ⟨·, ·⟩ defines an inner product, we have

z ⟨z/|z|, x⟩ ≤ |x| = |x|, |z|

n so z/|z| is indeed a subgradient. The extension to the ℓ1-norm on C follows the coordinate-separable subgradient calculus given above.

2.2.2 Wirtinger Semi-Definite Programming

The results of the previous section can be straightforwardly extended to the setting of

semi-definite programs. In particular, for real valued semi-definite programs, thatis, Cn×n optimization of real valued functions taking arguments in ⪰0 , we recover standard results by situation in a real vector space with inner product:

1 H T ⟨A, B⟩Cn×n Tr(A B + A B). ⪰0 2

Note that while the conical structure of the set of semi-definite programs is usually of primary interest, the space of real or complex symmetric matrices of fixed size is a vector space and the semi-definite cone inherits is inner product.

The matrix equivalent of the Wirtinger derivatives described above was first consid- ered by Hjørungnes and Gesbert [123], combining a matrix parallel of the results of van den Bos [69] with the matrix differential approach pioneered by Magnus and

Neudecker [125]. We state analogues of the theorems given in the previous subsection 30 without proof.

Theorem 2.9. Suppose f is convex and its (formal) Wirtinger derivative exists at Z.

Then ∂f(Z) is a singleton consisting only of the differential considered by Hjørungnes and Gesbert [123].

The following two derivatives will be used in Chapter 4:

d (Tr(AZ)) = A dZ d (log det Z) = Z−1 dZ

P ∥ · ∥ n | · | Cn×n → R Theorem 2.10. Let f = 1,off-diag = i,j=1 ij : ⪰0 . Then i≠ j    {Aij/|Aij|} Aij ≠ 0 ∂f(A)ij =   {w ∈ C : |w| ≤ 1} Aij = 0

Note that, while a general element of ∂f(A) is not necessarily itself Hermitian, there is always at least one element of ∂f(A) which is Hermitian, so it will suffice to restrict our attention to those elements in the sequel.

2.3 Algorithms for Wirtinger Optimization

2.3.1 Background

The theory developed above provides an elegant framework for analyzing the optima

of Wirtinger optimization problems and, as we will see in the following chapters, 31 yields elegant tools for analyzing statistical estimators of this form. These methods

are non-constructive, however, and leave the practical question of computing these

optima unanswered. In this section, I provide convergence claims for three basic

sparse optimization algorithms, proximal gradient descent, coordinate descent, and

ADMM, in the Wirtinger setting. These proofs are essentially the same as their real

counterparts and so are omitted here, with brief remarks added about how existing

proofs could be adapted: no attempt to give the tightest constants or most general

conditions is given. Proofs for these results in the real domain can be found in

standard references, such as the books of Boyd and Vandenberghe [74], Bertsekas et

al. [114], or Beck [126].

In general, complex valued optimization can proceed straightforwardly by “realifying”

the problem, so the relevant literature is rather small. Sorber et al. [80] give a general

treatment of first-order and quasi-Newton methods, though their focus is restricted

to smooth functions and does not apply to the non-smooth functions of interest here.

Furthermore, their analysis focuses on efficiently moving between the real and complex

domains and is not “complex native” like the approach we pursue here. Li et al. [127]

give a similar “realifying” convergence proof for a complex ADMM. Yan and Fan [122]

gives a specialized Newton-type algorithm for blind equalization, while Adali et al.

[128] considers algorithms for complex valued independent components analysis. In a

series of papers, Bouboulis and co-authors [129–131] consider complex valued neural

networks and support vector machines using a similar Wirtinger construct.

2.3.2 Proximal Gradient Descent

Proximal gradient algorithms, first proposed in the sparse regression context by

the “ISTA” algorithm of Daubechies et al. [132] and later popularized by Beck and 32

Teboulle [75], combine a gradient descent-type update for the smooth part of the

objective with the proximal operator associated with the non-smooth part. They are

particularly common for problems where the non-smooth part has a simple (or even

closed form) solution. Proximal gradient methods have been applied to a wide-range

of problems. See, e.g., Parikh and Boyd [77] for a recent review. Candès et al. [133]

use a closely related “Wirtinger flow” algorithm to solve the phase retrieval problem;

see also Waldspurger et al. [134] who give a semidefinite programming approach based

on similar ideas in a block coordinate descent context.

Theorem 2.11. Suppose f(z) = g(z)+h(z) where g is convex, Wirtinger-differentiable

df everywhere, and dz is Lipschitz continuous with constant L > 0 and h is convex and ∥ − ∥2 proximable in the sense proxh(z) = arg minw w z 2/2 + h(w) can be exactly eval- uated. Then the proximal gradient iterates

  dg z(k+1) = prox z(k) + t (z(k)) th dz

for t < L−1 satisfy ∥z(0) − z∗∥ f(z(k)) − f(z∗) ≤ 2tk where z∗ is a minimizer of f.

Convergence proofs for proximal gradient methods typically rely on a “generalized gradient” of the form

x − prox [x − t∇g(x)] G (x) = th t t

For the sum inside the proximal operator to be well-defined in our setting, it suffices

df to note that V is self-dual and to add z and the Wirtinger derivative dz using the 33

natural isomorphism. The Lipschitz assumption is sometimes stated as a bound on

the Hessian of the smooth part - specifically, the maximum leading eigenvalue of

2 the Hessian maxx λmax(∇ g(x)) - but the Lipschitz bound suffices without requiring second order analysis.

2.3.3 Coordinate Descent

Unlike proximal gradient descent, theory for coordinate descent schemes is an active

and rapidly moving area of research. The monographs of Wright [135] and Shi et

al. [136] give two different perspectives on this area. In this thesis, only a “vanilla”

synchronous and cyclical coordinate descent scheme is used, the only difficulty we

face is from the non- of f. In 1973, Powell [137] gave an example of a

non-smooth problem on which coordinate descent fails, but the general convergence

question remained open until 2001 when Tseng [138] analyzed the problem. His

conditions are quite abstract, so we give a simplified version here:

Theorem 2.12. Suppose

Xs F (x) = f(x1,..., xs) + gi(xi): V → R i=1

where f is convex and differentiable and gi are convex, but possibly non-differentiable,

functions for some non-overlapping blocks x = (x1,..., xs). Then the block coordinate descent iterates given by

• Repeat until convergence: 34

– For i = 1, . . . , s:

(k+1) ← (k+1) (k) xi arg min f(x1 ,..., xi,..., xs ) + gi(xi) xi converge to a minimizer of F if the level sets of F are compact.

Tseng’s Theorem 4.1 implies convergence to a stationary point and, under convexity,

to a minimizer. To adapt the proof of this theorem to our setting, it suffices to

note use the directional derivative construction used in the proof of Theorem 2.6

above and to note that Wirtinger differentiability implies Gâteaux differentiability.

The level set assumption is quite mild and holds for all problems considered in the

following chapters.

2.3.4 Alternating Direction Method of Multipliers

Many different convergence proofs for the Alternating Direction Method of Multipli-

ers (ADMM) have been given following its introduction by Glowinski and Marroco

[139] and Gabay and Mercier [140], most notably Eckstein and Bertsekas [141] who

connected the ADMM to the celebrated proximal point algorithm [142]. Many vari-

ants of the ADMM exist, e.g., the generalized variant of Deng and Yin [100], the

stochastic variant of Ouyang et al. [143], or the accelerated version of Goldstein et al. [101], but we restrict our attention to a “vanilla” variant here. Boyd et al. [76,

Appendix A] give a straightforward, if long, convergence proof, which can be directly

adapted to our setting.

Theorem 2.13. Let F (z) = f(z) + g(z) for f, g : V → R closed, proper, and convex. 35

Suppose also that the Langrangian

L(z, w, u) = f(z) + g(w) + ⟨u, z − w⟩ has a saddle point at some (z∗, w∗, u∗) satisfying

L(z∗, w∗, u) ≤ L(z∗, w∗, u∗) ≤ L(z, w, u∗)

for all (z, w, u) ∈ V × V × V∗. Then the ADMM iterates given by

ρ z(k+1) ← arg min f(z) + ∥z − w(k) + u(k)∥2 z 2 ρ w(k+1) ← arg min g(w) + ∥z(k+1) − w + u(k)∥2 w 2 u(k+1) ← u(k) + z(k+1) − w(k+1) converges to the optimum in objective function value (f(z(k) → f ∗) and dual u(k) → u∗.

Under the additional assumption that F is strictly convex, we additionally have primal convergence (z(k) → z∗). The complex graphical lasso studied in Chapter 4 satisfies

all these conditions and so is suitable for that problem.

2.3.5 Numerical Experiments

In this section, I demonstrate the efficiency of these schemes to the two problems which I will focus on in the sequel, the complex lasso (3.1) and complex graphical

lasso (4.1). For the complex lasso, a coordinate descent scheme described in more

detail in Section 3.4 is used; for the complex graphical lasso, an ADMM described in 36

SNR: 1 SNR: 2 SNR: 3 SNR: 4 SNR: 5

0.09

0.06

0.03 Run−Time (s) 0.00 0 5001000150020000 5001000150020000 5001000150020000 5001000150020000 500100015002000 Ambient Dimensionality

ρ 0 0.2 0.4 0.6 0.8

Figure 2.1 : Run-Time of the Complex Lasso Coordinate Descent Algorithm 1. In- creasing the problem dimensionality increases runtime, as we would expect, but sur- prisingly, run-time also increases with the difficulty of the statistical problem. Run- times are averaged over 100 values of λ (logarithmically spaced) and 50 replicates. Timing tests performed using the classo function in the R package of the same name available at https://github.com/michaelweylandt/classo.

more detail in Section 4.4 is used.

For the complex lasso, I generate X from a multivariate Gaussian with autoregressive

|i−j| covariance Σij = ρ for various values of ρ. We vary the ambient dimensionality p while keeping the sample size n and number of non-zero variables s fixed at 200 and

10 respectively. Additionally, the true variables β∗ are generated with fixed signal

size and uniformly random phase. In each case, a warm-started coordinate descent

algorithm was run for 100 values of λ. Results are shown in Figure 2.1. As one would

expect, run-time increases with p, but the growth rate is slow due to the active set

strategy used. More surprisingly, run time is also a function of the statistical difficulty

of the problem: increasing feature correlation or decreasing signal strength increases

runtime. While this phenomenon has been consistently observed by many authors,

the true source of this behavior remains an open question. 37

p = 10 p = 20 p = 40 0.010 0.30

0.10 1e−03 0.005

Time (s) 0.03 8e−04 0.003 7e−04 0.01 1 2 3 4 5 5 10 15 20 20 40 60 80 Number of Edges

Number of Samples 50 100 200 500

Figure 2.2 : Run-Time of the Complex Graphical Lasso ADMM Algorithm 1. Increas- ing the problem dimensionality increases runtime, as we would expect, but surpris- ingly, run-time also increases with the difficulty of the statistical problem (smaller n or more edges). Run-times are averaged over 100 values of λ (logarithmically spaced) and 20 replicates. Timing tests performed using the cglasso(method="glasso") function in the R package of the same name available at https://github.com/ michaelweylandt/classo. For small values of n, run-time is dominated by the R language run-time and does not accurately reflect true computational cost.

For the complex graphical lasso, I generate a graph structure from an Erdős-Rényi

graph with a fixed edge probability and then add a small multiple of an identity

matrix to the adjacency graph to create a positive-definite precision matrix. The

ambient dimensionality p, sample size n, and edge probability are allowed to vary, as

shown in Figure 2.2. In each case, a warm-started ADMM was run for 100 values

of λ. As one would expect, run-time increases rapidly with p and with the number

of true edges. Surprisingly, run-time decreases as n increases: this is contrary to our

intuition, where n does not effect the complexity of the ADMM iterations, butis

consistent with our previous observation that convergence is faster in the presence of

stronger signals. 38

Chapter 3

Model Selection Consistency in Complex Variate Sparse Linear Models

3.1 Introduction and Background

In this chapter, I investigate the complex valued analogue of the celebrated lasso or basis pursuit procedure [1, 2, 144]. While the real valued lasso is well-studied at this point, the complex variant has received less attention, with Yang and Zhang

[44], Maleki et al. [45], and Mechlenbrauker et al. [46] being prominent exceptions.

Investigating the complex lasso, hereafter CLasso, has two benefits: it gives usa

relatively straight-forward environment to demonstrate the use of the tools developed

in the previous chapter and it allows us to characterize the consistency conditions

necessary for the pseudo-likelihood approach to graphical model selection discussed

in the next chapter to perform satisfactorily.

The CLasso is a straight-forward modification of its real counterpart:

1 ∥ − ∥2 ∥ ∥ arg min y Xβ 2 + λ β 1 (3.1) β∈Cp 2n

n n×p where y ∈ C and X ∈ C are observed data, λ ∈ R≥0 is an arbitrary regularization parameter, and β ∈ Cp is the vector of regression coefficients and our estimand of interest. We focus on the so-called “hard sparsity” model where β is s-sparse; that is, β has at most s non-zero elements. 39

In the high-dimensional setting, three forms of non-asymptotic consistency are regu-

larly studied:

1. Prediction Consistency: Does the estimated parameter θˆ give (nearly) the same

predictions as the true parameter θ∗?

2. Estimation Consistency: Is the estimated parameter θˆ close to the true param-

eter θ∗ in a suitable metric?

3. Model Selection Consistency: Does the estimated parameter θˆ have the same

sparsity pattern as the true parameter θ∗?

Model selection consistency is often strengthened to having the same sparsity pattern and sign as the true parameter, but this concept doesn’t generalize straightforwardly to the complex setting. These three forms of consistency are progressively stronger and as such require stronger, and increasingly unverifiable, assumptions to hold.

Prediction consistency of the lasso, sometimes called presistency, was first established by Greenshtein and Ritov [7] under minimal assumptions, building on earlier work by

Juditsky and Nemirovski [145]. Their result gives mean squared prediction error which

−1/2 decreases as OP (n ), which has become known as the “slow rate” of convergence. Under stronger assumptions on X which we detail below, Bickel et al. [10] give

−1 a “fast rate” of convergence with error OP (n ). Bickel et al. [10] use the same conditions to show estimation consistency and we only prove the “fast rate” below.

Model selection consistency of the lasso has been studied by several authors, notably

Meinshausen and Bühlmann [17] and Zhao and Yu [8] who establish probabilistic

guarantees; Wainwright [81] gives the most direct proof and tightest bounds and we

follow his analysis here. 40

The real domain lasso is among the most widely studied mathematical objects of

recent history. While ℓ1-based regularization has a lengthy history in the mathe- matical sciences, current interest in the statistical and signal processing communities was sparked by the seminal papers of Tibshirani [1] and Chen et al. [2]. While first

proposed for the standard regression case, the core idea was extended to generalized

linear models more broadly shortly after the original papers [146, 147]. Traditional

“low-dimensional” asymptotics for the lasso were studied by Fu and Knight [148], but

the first high-dimensional asymptotics are those presented by Greenshtein and Ritov

[7]. Their work was quickly built upon by many others Bickel et al. [10], Bunea et

al. [149, 150], Zhang and Huang [151], Meinshausen and Yu [152], and Zhang [153]

and extended to a much wider class of penalty functions [82, 154]. The wide range of

conditions used to prove various theoretical properties can be overwhelming, but van

de Geer and Bühlmann [155] give a unified treatment. The popularity of the lasso

has also prompted various computational and methodological advances: Osborne et

al. [156, 157] was the first to propose efficient specialized algorithms for the lasso,but the lars approach of Efron et al. [158] was the first to see widespread use. The lars

approach was later superceeded by the coordinate descent approach popularized by

several authors [159–164]. Zou et al. [165] built upon the lars approach to calculate

a ‘degrees of freedom’ measure for the lasso and their approach was later generalized

by Tibshirani and Taylor [166]. As will be seen below, it is necessary to know the true value of the noise variance σ2 to optimally set the value of the regularization parameter λ. While in practice, λ is often set via cross-validation [167, 168], it is also

possible to use a variant of the lasso estimator which is automatically adaptive to

the value of σ2 [169, 170] or one which automatically adapts to the effective noise of

the problem [171]. The reader is referred to the books of Bühlmann and van de Geer 41

[52], Hastie et al. [53], and Wainwright [54] for additional background not covered

here, including the exansive and closely related compressed sensing literature [4–6, 9,

172].

3.2 Prediction and Estimation Consistency

In this section, we give proofs of fast-rate prediction consistency and classical ℓ2 estimation-consistency for the CLasso problem (3.1). Our proofs are almost identical

to the real valued case and mirror Sections 7.3 and 7.4 of the book by Wainwright

[54]. Theorems 11.1 and 11.2 of the book by Hastie et al. [53] give a slightly more

elementary treatment of the same material.

Theorem 3.1 (Estimation Consistency). Suppose a vector-matrix pair (y, X) ∈

Cn × Cn×p are observed, with y = Xβ∗ + ϵ for some s-sparse β∗ ∈ Cp supported on √ a set S. Suppose further that X is column normalized such that ∥X·j∥2 = n for all j ∈ [p] and that X satisfies the following restricted eigenvalue condition:

1 2 2 p ∥X∆∥ ≥ κ∥∆∥ for all ∆ ∈ {∆ ∈ C : ∥∆ c ∥ ≤ 3∥∆ ∥ } = C (S). n 2 2 S 1 S 1 3

Then, the following hold:

H (a) For any λ ≥ 2∥X ϵ/n∥∞, the solution to the lasso problem (3.1) satisfies the

deterministic concentration bound

3 √ ∥βˆ − β∗∥ ≤ sλ κ

(b) If ϵ is a vector of independent (complex) sub-Gaussian random variables with

variance proxy σ2, the lasso problem (3.1) with regularization parameter λ = 42 p σ τ log p/n satisfies r 3σ τs log p ∥βˆ − β∗∥ ≤ 2 κ n

with probability at least 1 − 2e−(τ−16) log p/16 for τ ≥ 16.

(c) If, additionally, the real and imaginary parts of ϵ are independently sub-Gaussian

with variance proxy σ2, then the above result holds with probability at least 1 −

4e−(τ−2) log p/2 for τ ≥ 2.

Comparing this result to the real valued case, we see that we pay an additional

penalty to account for the higher dimensionality of the problem. If the noise in ϵ is

assumed independent, this cost is minimal (a factor of 4 instead of 2), but if we allow

for arbitrary structure in the elements of ϵ, we are forced to go much deeper into

the tails of the distribution to get the same concentration guarantees (a factor of 16

instead of 2).

Proof, following Theorem 7.13(a) and Example 7.14 of Wainwright [54]. We first show ˆ ∗ that the error vector ∆ = β −β falls in the set C3(S) with high probability. Because βˆ optimizes the lasso objective, for which β∗ is also feasible, we have the following inequalities:

1 1 ∥y − Xβˆ∥2 + λ∥βˆ∥ ≤ ∥y − Xβ∗∥2 + λ∥β∗∥ 2n 2 1 2n 2 1 1 1 ∥X(βˆ − β∗) − ϵ∥2 + λ∥βˆ∥ ≤ ∥ϵ∥2 + λ∥β∗∥ 2n 2 1 2n 2 1 1 ⟨X∆, ϵ⟩ 1 1 ∥X∆∥2 − + ∥ϵ∥2 ≤ ∥ϵ∥2 + λ(∥β∗∥ − ∥βˆ∥ ) 2n 2 n 2n 2 2n 2 1 1 1 ⟨X∆, ϵ⟩ 0 ≤ ∥X∆∥2 ≤ + λ(∥β∗∥ − ∥βˆ∥ ) 2n 2 n 1 1 where ⟨a, b⟩ = (aH b + aT b)/2 is the inner product on Cp considered as a real inner 43

product space.∗ In the case where X, β∗, ϵ, y are all real, this reduces to 2ϵT X∆,

giving the real valued basic inequality (cf., Eqn. 7.29 in Wainwright [54]).

In order to bound the final term, note

∗ ˆ ∗ ∗ ∥β ∥1 − ∥β∥1 = ∥β ∥1 − ∥β + ∆∥1 X | ∗| − | ∗ − | = βj βj ∆j j∈[p] X X | ∗| − | ∗ − | −| | = βj βj ∆j + ∆j j∈S j∈Sc ∥ ∗ ∥ − ∥ ∗ ∥ − ∥ ∥ = βS 1 βS + ∆S 1 ∆Sc 1

Substituting into the above, we get

1 ⟨X∆, ϵ⟩ ∥ ∥2 ≤ ∥ ∗ ∥ − ∥ ∗ ∥ − ∥ ∥ X∆ 2 2 +2λ( βS 1 βS + ∆S 1 ∆Sc 1) n | {zn } | {z } (ii) (i)

H ≤ 2∥X ϵ/n∥∞∥∆∥1 + 2λ (∥∆S∥1 − ∥∆Sc ∥1) where the inner product term (i) is simplified using Hölder’s inequality as ⟨X∆, ϵ⟩/n =

H H ⟨∆, X ϵ/n⟩ ≤ ∥X ϵ/n∥∞∥∆∥1 and the second term (ii) can be obtained by noting ∥a∥ − ∥a + b∥ ≤ ∥b∥ by the reverse triangle inequality.

∗Recall that this generates the standard norm on Cp. 44

H (a) Taking λ ≥ 2∥X ϵ/n∥∞, we have

1 2 ∥X∆∥ ≤ λ∥∆∥ + 2λ(∥∆ ∥ − ∥∆ c ∥ ) n 2 1 S 1 S 1

= λ∥∆S∥1 + λ∥∆Sc ∥1 + 2λ(∥∆S∥1 − ∥∆Sc ∥1)

= 3λ∥∆S∥1 − λ∥∆Sc ∥1

Because the left-hand side is clearly non-negative, this implies

∥∆Sc ∥1 ≤ 3∥∆S∥1

so the restricted eigenvalue condition can be applied giving

∥ ∥2 ≤ ∥ ∥ − ∥ ∥ ≤ ∥ ∥ κ ∆ 2 λ(3 ∆S 1 ∆Sc 1) 3λ ∆S 1

√ Finally, for any s-vector, we have ∥ · ∥1 ≤ s∥ · ∥2, so this gives

√ ∥ ∥2 ≤ ∥ ∥ κ ∆ 2 3λ s ∆ 2 3 √ ∥∆∥ = ∥βˆ − β∗∥ ≤ sλ 2 2 κ

as desired.

H (b) We now bound the term ∥X ϵ/n∥∞ stochastically using Lemma 3.10. Specifi-

cally, X satisfies the normalization assumptions of that Lemma and each element

of ϵ/n is sub-Gaussian with variance proxy σ2/n2 so we have

    2 2 H nt nt P [∥X ϵ/n∥∞ ≥ t] ≤ 2p exp − = 2 exp − + log p 16σ2 16σ2 45 p Setting t = σ τ log p/n for some τ > 16, we have

      − H τ log p (τ 16) P ∥X ϵ/n∥∞ ≥ t ≤ 2 exp − + log p = 2 exp − log p . 16 16

p Substituting λ = σ τ log p/n into the previous bound, we have r 3σ τs log p ∥βˆ − β∗∥ ≤ 2 κ n

with probability at least 1 − 2e−(τ−16) log p/16 for τ ≥ 16, as desired.

(c) Under the assumption that the real and imaginary parts of ϵ are independently

sub-Gaussian, we can tighten the above analysis using Part (b) of Lemma 3.10.

Specifically, we have

    2 2 H nt nt P [∥X ϵ/n∥∞ ≥ t] ≤ 4p exp − = 4 exp − + log p 2σ2 2σ2

p so taking λ = t = σ τ log p/n we obtain

      − H τ log p (τ 2) P ∥X ϵ/n∥∞ ≥ t ≤ 4 exp − + log p = 4 exp − log p 2 2

which in turn implies the bound r 3σ τs log p ∥βˆ − β∗∥ ≤ 2 κ n

holds with probability at least 1 − 4e−(τ−2) log p/2 for τ ≥ 2, as desired.

If we are able to estimate β with relative accuracy, it should not come as a surprise that we can also estimate Xβ accurately. Indeed, we can show that both β and 46 √ Xβ can be estimated with error decreasing as n under the same assumptions as

the previous theorem. What is somewhat more surprising is that we can estimate

Xβ, even in situations where we cannot estimate β. Intuitively, we note that the

strongest assumption in the proof of the previous theorem is that X has non-zero

restricted eigenvalues, implying that the columns of X cannot be too correlated. For

predictive purposes, we can discard this assumption since correlated columns of X will ultimately result in the same predictions. The following theorem formalizes this

intuition.

Theorem 3.2 (Prediction Consistency, cf. Theorem 11.2 of Hastie et al. [53]). As before, suppose a vector-matrix pair (y, X) ∈ Cn ×Cn×p is observed, with y = Xβ∗ +ϵ

for some s-sparse β∗ ∈ Cp supported on a set S. Suppose further that X is column √ normalized such that ∥X·j∥2 = n for all j ∈ [p]. Then the solution to the CLasso problem (3.1) satisfies the following mean square prediction error (MSPE) bounds:

∗ (a) If ∥β ∥1 ≤ R, then:

H (i) For λ ≥ 2/n∥X ϵ∥∞, we have

∥X(βˆ − β∗)∥2 MSPE = 2 ≤ 12λR n

(ii) If ϵ is a vector of independent (complex) sub-Gaussian random variables

with variance proxy σ2, the lasso problem (3.1) with regularization parameter p λ = σ τ log p/n satisfies r ∥X(β∗ − βˆ)∥2 τ log p MSPE = 2 ≤ 12σR n n 47

with probability at least 1 − 2e−(τ−16) log p/16 for τ ≥ 16.

(iii) If, additionally, the real and imaginary parts of ϵ are independently sub-

Gaussian with variance proxy σ2, then the above result holds with probability

at least 1 − 4e−(τ−2) log p/2 for τ ≥ 2.

(b) If X satisfies the restricted eigenvalue condition

1 2 2 p ∥X∆∥ ≥ κ∥∆∥ for all ∆ ∈ {∆ ∈ C : ∥∆ c ∥ ≤ 3∥∆ ∥ } = C (S), n 2 2 S 1 S 1 3

then

H (i) For λ ≥ 2/n∥X ϵ∥∞, we have

∥X(βˆ − β∗)∥2 s MSPE = 2 ≤ 9λ2 n κ

(ii) If ϵ is a vector of independent (complex) sub-Gaussian random variables

with variance proxy σ2, the prediction error bound

∥X(βˆ − β∗)∥2 τs log p MSPE ≤ 2 ≤ 9σ2 n nκ

holds with probability at least 1 − 2e−(τ−16) log p/16 for τ ≥ 16.

(iii) If, additionally, the real and imaginary parts of ϵ are independently sub-

Gaussian with variance proxy σ2, then the above result holds with probability

at least 1 − 4e−(τ−2) log p/2 for τ ≥ 2. 48

Proof. From the proof of Theorem 3.1, we have the basic inequality

H   1 ∥X ϵ∥∞∥∆∥ 0 ≤ ∥X∆∥2 ≤ 1 + λ ∥β∗∥ − ∥βˆ∥ 2n 2 n 1 1 H ∥X ϵ∥∞∥∆∥ = 1 + λ (∥β∗∥ − ∥β∗ + ∆∥ ) n 1 1

∗ ˆ ∗ ∗ for ∆ = β − β. We note that −∥β + ∆∥1 ≤ ∥β ∥1 − ∥∆∥1 by the reverse triangle inequality, so we have

H 1 ∥X ϵ∥∞∥∆∥ ≤ ∥ ∥2 ≤ 1 ∥ ∗∥ − ∥ ∗∥ − ∥ ∥ 0 X∆ 2 + λ ( β 1 β 1 ∆ 1) 2n  n H ∥X ϵ∥∞ = − λ ∥∆∥ + 2λ∥β∗∥ n 1 1

H (i) Under the assumption λ ≥ 2/n∥X ϵ∥∞, this gives

  λ λ 0 ≤ − λ ∥∆∥ + 2λ∥β∗∥ = (4∥β∗∥ − ∥∆∥ ) 2 1 2 1 1

∗ This implies that ∥∆∥1 ≤ 4∥β ∥1 ≤ 4R. Substituting this into the above, we thus obtain

2 H ∥∆∥ ∥X ϵ∥∞ 2 ≤ ∥∆∥ + λ(∥β∗∥ − ∥βˆ∥ ) 2n n 1 1 1 H ∥X ϵ∥∞ ∗ ≤ ∥∆∥1 + λ∥∆ ∥1  n  H ∥X ϵ∥∞ = + λ ∥∆∗∥ n 1

≤ 1.5λ ∗ 4R

= 6λR ∥X(β∗ − βˆ)∥2 MSPE = 2 ≤ 12λR n 49

as desired.

(ii) We now bound λ stochastically, using the same techniques used in Part (b) p of Theorem 3.1. Specifically, that proof shows that taking λ = σ τ log p/n

H for τ ≥ 16 suffices to bound λ ≥ 2∥X ϵ/n∥∞ with probability at least 1 −

2e−(τ−16) log p/16 implying the MSPE bound r ∥X(β∗ − βˆ)∥2 τ log p MSPE = 2 ≤ 12σR n n

(iii) Similarly, we bound λ stochastically, using the same techniques used in Part (c) p of Theorem 3.1. That proof shows that taking λ = t = σ τ log p/n suffices for

the above analysis to go through with probability 1 − 4e−(τ−2) log p/2, as desired.

Now we additionally assume the restricted eigenvalue condition:

κ ∥∆∥2 2 2

H (i) As argued in the proof of Theorem 3.1, λ ≥ 2/n∥X ϵ∥∞ implies that

∥ ∥2 √ √ X∆ 2 ≤ 3λ∥∆ ∥ − λ∥∆ c ∥ ≤ 3λ s∥∆ ∥ ≤ 3λ s∥∆∥ n S 1 S 1 S 2 2

Recalling that the vector ∆ lies in the cone C3(S), we have

1 ∥X∆∥2 ∥∆∥2 ≤ 2 2 κ n 50

Using this to bound the right-most term, we get

r ∥ ∥2 X∆ 2 s ≤ 3λ ∥X∆∥2 n rnκ ∥X∆∥ s 2 ≤ 3λ n nκ ∥X∆∥2 s 2 ≤ 9λ2 n2 nκ ∥X(βˆ − β∗)∥2 s MSPE = 2 ≤ 9λ2 n κ

p (ii) As in the proof of Part (a)(ii), we take λ = σ τ log p/n for τ ≥ 16 so we have

∥X(βˆ − β∗)∥2 τs log p MSPE ≤ 2 ≤ 9σ2 n nκ

with probability at least 1 − 2e−(τ−16) log p/16 for τ ≥ 16. p (iii) As in the proof of Part (a)(iii), we take λ = σ τ log p/n for τ ≥ 2 to get the

same bound with probability at least 1 − 4e−(τ−2) log p/2 for τ ≥ 2.

As with the proof of estimation consistency, we see that we pay a significant prediction error penalty if we allow arbitrary dependencies in the real and imaginary parts of ϵ,

but that the penalty reduces to a very manageable factor of two if we assume the real

and imaginary parts of ϵ are independent. Interestingly, the deterministic bounds,

Part (i) of the above, are exactly the same as in the real case, indicating that minor

decrease in statistical efficiency is entirely due to the additional degrees of freedom

available to complex noise vectors. 51

3.3 Model Selection Consistency

We now turn to the more fundamental question of support recovery in sparse regression,

that is, the conditions under which the CLasso is able to correctly identify the zero

and non-zero elements of β∗. In the real domain, it is common to consider signed

support recovery - conditions under which βˆ has the same sparsity pattern and signs as

β∗, but in the complex domain we restrict our claims to simple support recovery.

Our main theorem parallels the celebrated Primal-Dual Witness approach of Wain- wright [81] and follows the presentation of that result as Theorem 7.21 in the book

by Wainwright [54].

Theorem 3.3 (Model Selection Consistency of the Complex Lasso). Suppose a vector- matrix pair (y, X) ∈ Cn × Cn×p are observed, with y = Xβ∗ + ϵ for some s-sparse

β∗ ∈ Cp supported on a set S. Additionally, assume that X satisfies the following properties: √ 1. Column-normalization: ∥X·j∥2 = n for all j ∈ [p];

H 2. Lower-S-Eigenvalue: the smallest eigenvalue of XS XS/n is at least c > 0;

3. Mutual Incoherence: there exists an α ∈ [0, 1) such that

∥ H −1 H ∥ ≤ max (XS XS) XS X·j 1 α j∈Sc where

2 ϵ ≥ H λ XSc ΠS⊥ (X) 1 − α n ∞

− H −1 H where ΠS⊥ (X) is the nullspace projection operator given by I XS(XS XS) XS . Then the following hold: 52

(a) The CLasso problem (3.1) has a unique solution βˆ;

(b) No false positives: supp βˆ ⊆ supp β∗;

(c) ℓ∞-bound: the estimation error satisfies:

∗ H −1 H H −1 ∥βˆ − β ∥∞ ≤ (X X /n) X ϵ/n + ∥(X X /n) ∥ λ | S S S {z∞ S S ℓ∞,1 } =E(λ,X) P ∥ ∥ | | where A ℓ∞,1 = maxi j Aij ; and

(d) No false negatives: if the non-zero elements of β∗ are all greater than E(λ, X) in

norm, then supp βˆ = supp β∗.

Furthermore, the following probabilistic bounds hold if s < p/2:

(i) If ϵ is a vector of independent (complex) sub-Gaussian random variables with

variance proxy σ2, the CLasso problem (3.1) with regularization parameter

p λ = σ τ log(p − s)/n

n o − − (τ−16) − performs exact support recovery with probability at least 1 4 exp 16 log(p s) if   | ∗| −1/2 ∥ H −1∥ βj > σλ 8σc + (XS XS/n) ∞,1

for all j ∈ S and τ ≥ 16.

(ii) If, additionally, the real and imaginary parts are independently sub-Gaussian

with variance proxy σ2 and β∗ satisfies the same bound, then the CLasso per- n o − − (τ−2) − forms exact support recovery with probability at least 1 8 exp 2 log(p s) for τ ≥ 2. 53

Proof. The stationarity conditions associated with the CLasso problem at βˆ are given by

1 1 1 XH ϵ XH (Xβˆ−y)+λz = XH (Xβˆ−Xβ∗−ϵ)+λz = XH X(βˆ−β∗)− +λz = 0 n n n n

ˆ where z ∈ ∂∥β∥1. If we temporarily assume that βSc = 0, then we can write these conditions in block matrix terms as           H H ˆ − ∗ H XS XS XS XSc  βS β  XS ϵ  zS   0S      −   + nλ   =   H H XSc XS XSc XSc 0 XSc ϵ zSc 0Sc

The remainder of the proof will proceed in four parts:

• First, we will show uniqueness by showing that any two solutions (βˆ, z) and

(β˜, z˜) of the above must coincide;

• Next, we will show that there is a solution z satisfying the above: that is, ˆ a complex vector satisfying ∥zSc ∥∞ < 1 and zS = ∠βS such that the above inequality holds.†

• Then, we will show that solution satisfies the ℓ∞ error bound.

• Finally, we show that under assumptions on β∗ and ϵ, the Lagrangian bound

holds with high probability and support recovery is guaranteed.

The pair (βˆ, z) forms the primal dual witness pair for this problem. By showing that

there is a solution with the desired properties, and that the solution is unique, we

† Note that, even though it is possible for elements of zSc to have unit norm, the proof used depends on bounding ∥zSc ∥∞ strictly less than one. This is to guarantee that βSc = 0. If we only c have |zj| = 1 for some j ∈ S , we have to accept the possibility that βj is a non-zero vector with the same phase as zj and cannot guarantee sparsity. 54

know that any CLasso solution has the desired properties.

Suppose for the sake of contradiction that, in addition to (βˆ, z) satisfying the sta- ˜ tionarity conditions with zSc , we have an additional solution β. Both solutions must give the same objective value of the CLasso problem, so we have

1 1 ∥y − Xβˆ∥2 + λ⟨z, βˆ⟩ = ∥y − Xβ˜∥2 + λ∥β˜∥ 2n 2 2n 2 1

Rearranging, we have

1 1 ∥y − Xβˆ∥2 − λ⟨z, β˜ − βˆ⟩ = ∥y − Xβ˜∥2 + λ(∥β˜∥ − ⟨z, β˜⟩) 2n 2 2n 2 1

∥y−Xβˆ∥2 The stationarity conditions imply that λz = − d 2 , so this becomes dβˆ 2n

1 d ∥y − Xβˆ∥2 1 ∥y − Xβˆ∥2 + λ⟨ 2 , β˜ − βˆ⟩ − ∥y − Xβ˜∥2 = λ(∥β˜∥ − ⟨z, β˜⟩) 2n 2 ˆ 2n 2n 2 1 | d{zβ } | {z } (ii) (i)

We recognize that the first term (i) is a first-order approximation of the least squares term around βˆ evaluated at β˜. Because first-order approximations underestimate

convex functions, we know that the left-hand side must be negative. This implies ˜ ˜ ˜ ˜ that ∥β∥1 ≤ ⟨z, β. Applying Hölder’s inequality, this gives ∥β∥1 ≤ ∥z∥∞∥β1∥. Since

∥z∥∞ ≤ 1 by construction, this implies that we must have ∥z∥∞ = 1 and hence ˜ ˜ ∥β∥1 = ⟨z, β⟩ holds as an equality. Since ∥zSc ∥∞ < 1 by assumption, this can only

˜ c ˜ hold if βj is zero for all j ∈ S . Thus, any solution β must have the same support as 55

βˆ. The minimum eigenvalue assumption implies that the “oracle” problem

1 arg min ∥y − XSβS∥ + λ∥βS∥1 β∈Cp 2n βSc =0

ˆ ˜ is strictly convex and hence has a unique solution. Hence, βS = βS and the two solutions must be equal.

The preceding argument implies that, if there is a solution βˆ with the correct sparsity

pattern, it will be the unique solution. Alternatively, we can also show uniqueness

based only on the value of X, using the techniques discussed by Tibshirani [173]. We still need to show that such a solution exists. To do so, we explicitly construct a z, depending on βˆ that satisfies the stationarity and feasibility conditions.

The S-terms in the block-matrix stationarity conditions imply

ˆ − ∗ H −1 H − H −1 βS βS = (XS XS) XS ϵ λn(XS XS) zS

Similarly, the block-matrix stationarity conditions for zSc imply that

  1 H ∗ H ϵ z c = − X c X (βˆ − β ) + X c S λn S S S S S λn

Substituing the former into the latter, we find that   Π ⊥ (X) z S}| {   H H −1 H  H −1 H  ϵ z c = X c X (X X ) z + X c I − X (X X ) X  S | S S {zS S S} S S S S S λn µ1 | {z }

µ2 56

By the triangle inequality, we have

∥zSc ∥∞ ≤ ∥µ1∥∞ + ∥µ2∥∞

In order to bound ∥zSc ∥∞ strictly less than 1, it will suffice to bound µ1, µ2. Recall ˆ that we do not need to bound zS because we know it explicitly to be zS = ∠βS.

∥ H −1 H ∥ The mutual incoherence bound on X implies that (XS XS) XS Xj 1 is bounded above by α for any j ∈ Sc and hence the bound holds for the maximum over all

c elements of S . Because zS is unit scaled, it does not loosen our bound. Hence

∥µ1∥ ≤ α. A bound on ∥µ2∥ follows from our assumed bound on λ. In particular, we have

ϵ ∥ ∥ H µ2 ∞ = XSc ΠS⊥ (X) λn ∞ ϵ ∥ ∥ H λ µ2 ∞ = XSc ΠS⊥ (X) n ∞ 2∥µ ∥∞ ϵ ϵ 2 H ≤ H XSc ΠS⊥ (X) XSc ΠS⊥ (X) 1 − α n ∞ n ∞ 1 − α ∥µ ∥∞ ≤ 2 2

Combining these, we have

1 − α 1 + α ∥z c ∥∞ ≤ ∥µ ∥∞ + ∥µ ∥∞ ≤ α + = < 1 S 1 2 2 2

so zSc is indeed bounded away from zero (recall α < 1 by assumption). This implies ˆ there is a solution satisfying βSc = 0 and, by uniqueness, any solution must satisfy the “no false negatives” property.

ˆ − ∗ We now turn towards getting an ℓ∞ bounds on βS βS. (The previous section 57

∥ ˆ − ∗ ∥ guarantees βS βS ∞ = 0.) As noted above, the S-terms in the block-matrix stationarity conditions imply

ˆ − ∗ H −1 H − H −1 βS βS = (XS XS/n) XS ϵ/n λ(XS XS/n) zS

Applying the triangle inequality, we get

∥ ˆ − ∗ ∥ H −1 H ∥ H −1 ∥ βS βS ∞ = (XS XS/n) XS ϵ/n ∞ + λ (XS XS/n) zS ∞

From here, we factor the second term into a ℓ∞ → ℓ∞ matrix operator norm and a

ℓ1 norm on zS:

∥ ˆ − ∗ ∥ ≤ H −1 H ∥ H −1∥ ∥ ∥ βS βS ∞ (XS XS/n) XS ϵ/n ∞ + λ (XS XS/n) ℓ∞→ℓ∞ zS ∞

We know ∥zS∥∞ = 1 and we recall that the ℓ∞ → ℓ∞ norm is equal to the ℓ∞,1 norm, so this becomes

∥ ˆ − ∗ ∥ ≤ H −1 H ∥ H −1∥ E βS βS ∞ (XS XS/n) XS ϵ/n ∞ + λ (XS XS/n) ∞,1 = (λ, X)

so we have the desired bound. Finally, to show “no false negatives” it suffices to note | ∗| E ˆ that if βj > (λ, X), the error bound implies that βj must be non-zero.

We now turn to the probabilistic claims: as with the previous theorems in this chapter,

the bulk of the work is to identify an upper bound on the ϵ term so that we can fix

λ suitably high. Recall that we need

2 ϵ ≥ H λ XSc ΠS⊥ (X) 1 − α n ∞ 58

H The ΠS⊥ term is a non-expansive projection operator, so XSc ΠS⊥ (X) has the same

H column normalization as XSc . Applying Part (a) of Lemma 3.10 where ϵ/n are sub-Gaussian with variance proxy σ2/n2, we have that

h i   ϵ nt2 H ≥ ≤ − − P XSc ΠS⊥ (X) t 2(p s) exp n ∞ 16σ2

p Setting t = σ τ log(p − s)/n for τ > 16, we have

h i   ϵ τ log(p − s) H ≥ ≤ − − P XSc ΠS⊥ (X) t 2 exp + log(p s) n ∞  16  (τ − 16 = 2 exp − log(p − s) 16

p 2σ − so if we take λ = 1−α τ log(p s)/n, our bounds hold with probability at least 1 − 2e−(τ−16) log p/16 for τ ≥ 16.

The first term in the ℓ∞ error bounds is still stochastic, so we use similar techniques

∥ H −1 H ∥ to bound it. In particular, we note that the term (XS XS/n) XS ϵ/n ∞ can be bounded as the maximum of s sub-Gaussian random variables each of which has variance proxy at most

2 2 2 2 H −1 σ n σ σ /n ∥(X X /n) ∥ ∥X· ≤ = . S S 2 j n2 c nc

Applying Part (c) of Lemma 3.9 and a union bound, we have

    2 H −1 H t cn P ∥(X X /n) X ϵ/n∥∞ ≥ t ≤ 2s exp − S S S 16σ2 59 √ Taking t = 8σλ/ c,we find that

   ∥ H −1 H ∥ ≥ ≤ − 2 P (XS XS/n) XS ϵ/n ∞ t 2s exp 4λ n + log s

so that we have

  ∥ ˆ − ∗ ∥ ≤ −1/2 ∥ H −1∥ βS βS ∞ λ 8σc + (XS XS/n) ∞,1

n o − − (τ−16) − ≥ with probability at least 1 4 exp 16 log(p s) for τ 16.

In the case of ϵ with independent real and imaginary parts, the above analysis can be n o − − (τ−2) − tightened using Part (b) of Lemma 3.10 to hold with probability 1 8 exp 2 log(p s) for τ ≥ 2.

3.4 Algorithms

We solve the CLasso problem (3.1) using the coordinate descent approach popular-

ized for the lasso and other sparse penalties by Friedman et al. [159], Friedman et al.

[160], Wu and Lange [161], and Breheny and Huang [162, 164]. We iteratively update

β coordinate-wise by solving problems of the form:

1 ← ∥ − ∥2 | | βi arg min ri X·iβ 2 + λ β β∈C 2n

Factoring the quadratic, we obtain:

H − H − H H ri ri ri X·iβ βX·i ri + βX·i X·iβ βi ← arg min + λ|β| β∈C 2n 2 2 H ∥X· ∥ |β| − 2⟨β, X r ⟩ = arg min i 2 ·i i + nλ|β| β∈C 2 60 where 2⟨z, w⟩ = zw + zw is the scalar inner product used throughout this chapter.

∥ ∥2 Dividing by X·i 2 and completing the square, we obtain:

| |2 − ⟨ H ∥ ∥2⟩ ← β 2 β, Xi· ri/ X·i nλ βi arg min + 2 β∈C 2 ∥X·i∥ | |2 − ⟨ H ∥ ∥2⟩ β 2 β, Xi· ri/ X·i nλ | | = arg min + 2 β ∈C 2 ∥X· ∥ β i 2 H 2 1 − Xi· ri nλ | | = arg min β 2 + 2 β ∈C 2 ∥X ·∥ ∥X· ∥ β  i 2 2 i 2 H Xi· ri = prox 2 nλ/∥X·i∥ |·| 2 2 ∥X ·∥  i 2 H S Xi· ri = nλ/∥X· ∥2 i 2 ∥ ∥2 Xi· 2

1 ∥ − ∥2 S where proxf (z) = arg minw 2 w z 2 + f(w) and denote the proximal and soft- thresholding operators respectively. To simplify this final term, we recall a homogene- ity property of the soft-thresholding operator:         cx − cλ cλ ≤ cx  x − λ x ≥ λ      S S cλ(cx) =  0 when |cx| ≤ |cλ| = c  0 when |x| ≤ λ = c λ(x).         cx + cλ cx < −cλ x + λ x < −λ

Applying this, the update becomes:

1  β ← S XH r i ∥ ∥2 nλ ·i i X·i 2 which essentially matches the coordinate update for the real lasso.

For the general case of the complex elastic net [36] with regression weights (W ), 61

intercept terms (γ), and offsets (o) defined by,

  − − − H − − − ∥ ∥ ˆ (y Xβ γ1 o) W (y Xβ γ1 o) β 2 βλ = arg min + λ (1 − α) + α∥β∥1 β∈Cp 2n 2 γ∈C p Xn X 1 − α | − − − |2 | |2 | | = arg min wi yi Xi· γ oi 2 + λ βi + α βi , (3.2) ∈Cp 2 β i=1 i=1 γ∈C

the entire algorithm is given in Algorithm 1. Further performance improvements

can be obtained by using “active set” and “warm start” strategies, which we do not

describe here. This algorithm has been implemented in the author’s classo package,

available at https://github.com/michaelweylandt/classo. As discussed in more

detail in Section 2.3, convergence follows from a straight-forward generalization of

the results of Tseng [138].

3.5 Numerical Experiments

In this section, I briefly demonstrate the predictions of Theorem 3.3 in simulation. I

|i−j| generate X from a multivariate Gaussian with autoregressive covariance Σij = ρ for various values of ρ and generate y from the sparse linear regression model with iid Gaussian noise with variance σ2 and independent real and imaginary parts. I vary the problem dimension p while keeping the sample size n and number of non- zero variables s fixed at 200 and 10, respectively. Additionally, the true variables β∗

are generated with fixed signal size and uniformly random phase. The regularization

parameter λ was chosen to give oracle sparsity. That is, the number of true features was considered known, but their identity was estimated. Results are shown in Figure

3.1. As can be seen there, the CLasso performs quite well across a wide range of

conditions, with the inter-feature correlation ρ being the most important determinant 62

Algorithm 1 Coordinate Descent for the Complex Elastic Net Problem (3.2) 1. Input: • Data Matrix: X ∈ Cn×p • Response Vector: y ∈ Cn • Regression Weight Matrix (Diagonal with strictly positive elements): W ∈ Rp×p • Regression Offset Vector: o ∈ Cn • Penalty Parameter: λ ∈ R>0 • Stopping Tolerance: ϵ ∈ R>0 2. Precompute: • Weight Factors: u = diag(XH WX) 3. Initialize: • Intercept: γ(0) = 0 ∈ C • Coefficients: β(0) = 0 ∈ Cp • Working residual: r = y − o ∈ Cn • k = 0 ∥ (k) − − ∥2 | (k) − (k−1)|2 ≤ 4. Repeat until β β(k 1) 2 + γ γ ϵ: • Update Coefficients. For j ∈ 1, . . . , p: (k) (i) r ← r + X· β j j  (k+1) 1 H (ii) β ← Snαλ X· W ri j uj +λ(1−α) i ← − (k+1) (iii) r r X·jβj • Update Intercept: (i) r ← r + γ(k) (ii) γ(k+1) ← rH diag(W )/n (iii) r ← r − γ(k+1) • k ← k + 1 5. Return: • Estimated regression coefficients: β(k−1) ∈ Cp • Estimated regression intercept: γ(k−1) ∈ C 63 of accuracy. 64

p = 20 p = 50 p = 100 p = 200 p = 500 p = 1000 p = 2000 100% Recovery Probability

75%

50%

25%

100% True Positive Rate 90% 80% 70% Accuracy 60%

12.5% False Positive Rate 10.0% 7.5% 5.0% 2.5% 0.0% 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Signal−to−Noise Ratio

ρ 0 0.2 0.4 0.6 0.8

Figure 3.1 : Model Selection Consistency of the Complex Lasso (3.1). The CLasso has high true positive rate and low false positive rate in both low- (p < n) and high (p > n) dimensional settings. As ρ → 1, the CLasso struggles to accurately recover the true model in high dimensions, though the decay is slow. Beyond a small minimum signal strength requirement, the signal to noise ratio is not an important driver of accuracy. 65

Appendix to Chapter 3

3.A Concentration of Complex valued Sub-Gaussian Random

Variables

We say a real valued random variable X is sub-Gaussian with variance proxy σ2 if its

E t(X−E[X]) ≤ σ2t2 centered cumulant generating function ψX (t) = log [e ] satisfies ψX (t) 2 for all t ∈ R≥0. We similarly say a real valued random vector X is sub-Gaussian if cT X is sub-Gaussian for all c with norm 1. For random vectors, the variance proxy

σ2 is defined as the maximum of the variance proxies of the projections cT X over all

unit vectors c.

Sub-Gaussian complex valued random variables can be characterized the same way:

that is, a complex random-variable X is sub-Gaussian if the real valued random ⟨ ⟩ cX+cX ℜ ∈ C variable c, X = 2 = (cX) is real sub-Gaussian for all c with unit norm. Finally, a complex random vector X ∈ Cn is said to be sub-Gaussian if ⟨c, X⟩ =

cH X+cT X ∈ Cn ∥ ∥ 2 is sub-Gaussian for all c such that c = 1.

As we might expect, if a complex random variable has sub-Gaussian real and imagi-

nary parts, it is sub-Gaussian as well:

Lemma 3.4. Suppose Z = X + iY where X and Y are independent real valued sub-Gaussian random variables each with variance proxy σ2. Then Z is complex sub-Gaussian with variance proxy σ2 as well.

Proof. Without loss of generality, we may assume E[Z] = 0 and hence E[X] = E[Y ] =

0. First, we note that the “real projections” ⟨c, Z⟩ are all centered random variables: 66

E[⟨c, Z⟩] = ⟨c, E[Z]⟩ = ⟨c, 0⟩ = 0. From here, we note

ψ⟨c,Z⟩(t) = log E [exp {t⟨c, Z⟩}]

= log E [exp {t(ℜ(c)X + ℑ(c)Y )}]

= log E[exp{tℜ(c)X}] + log E[exp{tℑ(c)Y }]

= ψX (ℜ(c)t) + ψY (ℑ(c)t) σ2t2ℜ(c)2 σ2t2ℑ(c)2 ≤ + 2 2 σ2t2|c|2 = 2 σ2t2 = 2

for all c with unit norm, so Z is sub-Gaussian.

The requirement that X and Y are independent may be relaxed, at the cost of being

able to bound the variance proxy of Z. As noted in Exercise 3.4.3 of Vershynin [66], while Z is sub-Gaussian even without independence, the variance proxy of Z may be

far larger than those of X and Y .

Lemma 3.5. Suppose Z = X + iY where X and Y are real valued sub-Gaussian random variables, not necessarily independent. Then Z is complex sub-Gaussian as well.

Proof. We use an alternate characterization of sub-Gaussianity, based on the Lp norms of a random variable considered as a measurable function. In particular, let

∥ ∥ E | |p 1/p X Lp = [ X ] for any scalar random variable X. Then X is sub-Gaussian if √ ∥ ∥ O and only if X Lp = ( p)[66, Proposition 2.5.2(ii)].

As before, we assume Z, and hence X and Y , are centered and we want to show ⟨c, Z⟩ 67

is sub-Gaussian for all c. We recall that ⟨c, Z⟩ = ℜ(c)X + ℑ(c)Y so

∥⟨c, Z⟩∥ = ∥ℜ(c)X + ℑ(c)Y ∥ Lp Lp ≤ |ℜ |∥ ∥ |ℑ |∥ ∥ (c) X Lp + (c) Y Lp √ √ = |ℜ(c)|O( p) + |ℑ(c)|O( p) √ = O( p) showing ⟨c, Z⟩ is sub-Gaussian for all unit vectors c and hence Z is sub-Gaussian, as desired.

Given two complex Gaussian random variables, we may wish to manipulate their

sums:

Lemma 3.6. Suppose Z1 and Z2 are mean zero sub-Gaussian random variables with

2 2 variance proxies σ1 and σ2 respectively:

| |2 2 (a) W1 = cZ1 is sub-Gaussian with variance proxy c σ1

(b) If Z1 and Z2 are independent, then Z1 + Z2 is sub-Gaussian with variance proxy

2 2 σ1 + σ2.

2 (c) In general, Z1 + Z2 is sub-Gaussian with variance proxy upper bounded by 2(σ1 +

2 σ2).

(d) In general, Z2 + Z2 is sub-Gaussian with variance proxy upper bounded by (σ1 +

2 σ2) .

An important corollary of this lets us manipulate the averages of mean-zero sub-

Gaussians: 68

Corollary 3.7. Suppose Z1,...,Zn are mean zero sub-Gaussian random variables P 2 −1 n with variance proxy at most σ . Then Z = n i=1 is sub-Gaussian with variance

2 2 proxy at most σ ; if the Zi are independent, this bound can be tightened to σ /n.

Proof of Lemma 3.6. Using the cumulant generating function ψZ (t) = log E[exp{⟨t, Z⟩}],

2 we have that a random variable Z is sub-Gaussian with variance proxy σ if ψZ (t) ≤ σ2|t|2/2.

(a) This follows immediately by noting ψW1 (t) = ψZ1 (ct), which follows immediately from

2⟨a, bc⟩ = abc + a(bc)

= (ac)b + (ac)b

= 2⟨ac, b⟩

under our chosen inner product.

(b) Under independence, we can split the expectation E[f(Z1)g(Z2)] = E[f(Z1)]E[g(Z2)] to obtain:

{E {⟨ ⟩} } ψZ1+Z2 (t) = log [exp t, Z1 + Z2 ]

= log {E [exp {⟨t, Z1⟩}] E [exp {⟨t, Z2⟩}]}

= log {E [exp {⟨t, Z1⟩}]} + log {E [exp {⟨t, Z2⟩}]}

= ψZ1 (t) + ψZ2 (t) σ2|t|2 σ2|t|2 ≤ 1 + 2 2 2 (σ2 + σ2)|t|2 = 1 2 2 69

which implies sub-Gaussianity of Z1 + Z2.

(c) Without independence, we cannot ‘split’ the expectation naturally, but we can

still obtain an upper bound by applying the (expectation form of the) Cauchy-

⟨ ⟩ ⟨ ⟩ Schwarz inequality to the random variables e t,Z1 and e t,Z2 to find:

⟨ ⟩ ⟨ ⟩ ⟨ ⟩ E[e t,Z1+Z2 ]2 = E[e t,Z1 e t,Z2 ]2

⟨ ⟩ ⟨ ⟩ ≤ E[e t,Z1 ]2E[e t,Z2 ]2

⇒ ≤ = ψZ1+Z2 (t) 2ψZ1 (t) + 2ψZ2 (t) 2 2 | |2 ≤ 2(σ1 + σ2) t 2

as desired. (Note that the conventional norms in the Cauchy-Schwarz inequality

can be omitted as all terms here are strictly positive, following from the fact that

⟨·, ·⟩ is real valued.)

(d) We proceed as before, but now use Hölder’s inequality instead of Cauchy-Schwarz

to find

⟨ ⟩ ⟨ ⟩ ⟨ ⟩ E[e t,Z1+Z2 ] = E[e t,Z1 e t,Z2 ]2

⟨ ⟩ ⟨ ⟩ ≤ E[ep t,Z1 ]1/pE[eq t,Z2 ]1/q 1 1 =⇒ ψ (t) ≤ ψ (pt) + ψ (qt) Z1+Z2 p Z1 q Z2 1 p2σ2|t|2 1 q2σ2|t|2 ≤ 1 + 2 p 2 q 2 (pσ2 + qσ2)|t|2 = 1 2 2

for a Hölder conjugate pair (p, q) satisfying p−1 + q−1 = 1. To get the tightest

bound, we seek to minimize over (p, q) and taking p = σ2 + 1, we get q = σ1 + 1 σ1 σ2 70

and the bound

(σ2 + σ σ + σ2 + σ σ )|t|2 (σ + σ )2|t|2 ψ (t) ≤ 1 1 2 2 1 2 = 1 2 Z1+Z2 2 2

as desired.

A useful fact of complex sub-Gaussian vectors, like their real counterpart, is that they are closed under linear combinations. In particular, the following result about sums of independent sub-Gaussians holds:

Lemma 3.8. Suppose Zi,...,Zn are independent complex mean sub-Gaussian random

2 variables each with variance proxy σ . Then the vector Z = (Z1,...,Zn) is sub- Gaussian with variance proxy σ2.

Proof. Let u ∈ Cn be a unit vector. Then " ( )# Xn E [exp {t⟨u, Z⟩}] = E exp t ℜ(uiZi) i=1 Yn = E[exp {tℜ(uiZi)} i=1   Yn σ2t2|u |2 ≤ exp i 2 i=1   σ2t2|u|2 = exp  2 σ2t2 = exp . 2

As a corollary, the weighted sum of sub-Gaussian variates is also sub-Gaussian, as 71

can be seen by taking suitable u. In particular, we have the result that ! Xn Xn iid 2 2 2 Zi ∼ sub-Gaussian(0, σ ) =⇒ S = aiZi ∼ sub-Gaussian 0, σ |ai| i=1 i=1

For a real sub-Gaussian variable, X, its absolute value |X| is generally not sub-

Gaussian,‡, but its norm concentrates well around its expected value. For complex

X, we can obtain a similar result:

Lemma 3.9. Suppose Z is mean-zero sub-Gaussian random variable. Then:

(a) if Z is real valued with variance proxy σ2, we have

  t2 P [|Z| ≥ t] ≤ 2 exp − for all t ∈ R≥ ; 2σ2 0

(b) if Z is complex with ℜ(Z) and ℑ(Z) sub-Gaussian, each with variance proxy σ2

(non-necessarily independent):

  t2 P [|Z| ≥ t] ≤ 4 exp − for all t ∈ R≥ ; 2σ2 0

(c) if Z is complex valued with variance proxy σ2:

  t2 P [|Z| ≥ t] ≤ 2 exp − for all t ∈ R≥ . 16σ2 0

In each case, the generalization to arbitrary mean is straight-forward (by applying

the above results to Z˜ = Z − E[Z]), but not useful for this work.

‡For example, in the specific case of real Gaussian X, X2 is exponentially distributed and |X| is χ1-distributed. 72

Proof. (a) We first consider the case where Z is real and apply the Cramèr-Chernoff

technique to bound P [Z ≥ t]. In particular, note that the random variable eλZ

is strictly positive for all λ ∈ R,§ so we can apply Markov’s inequality to obtain

    E[eλZ ] σ2λ2 P [Z ≥ t] = P eλZ ≥ eλt ≤ ≤ exp − λt eλt 2

where the final inequality comes from the sub-Gaussianity of Z. Optimizing this

bound over λ at λ = t/σ2, we get

    t2 t2 t2 P [Z ≥ t] ≤ exp − = exp − 2σ2 σ2 2σ2

Applying the same argument to the random variable −Z, which is again mean-

zero sub-Gaussian with variance proxy σ2, we get an equivalent bound on P [−Z ≥

t] = P [Z ≤ −t]. Putting these together, we get the desired result for the real

case:

P [|Z| ≥ t] = P [Z ≥ t] + P [Z ≤ −t] ≤ e−t2/2σ2 + e−t2/2σ2 = 2e−t2/2σ2 .

(b) Next, we consider the concentration of |Z| for complex Z with independent com-

ponents. As before, we assume Z is mean zero without loss of generality. Note

that √ max{|X|, |Y |} ≤ t/ 2 =⇒ |Z| ≤ t

so it suffices to apply a standard (real valued) maximal inequality to the real

§Recall Z is sub-Gaussian so we know that E[eλZ ] exists for all λ: see, e.g. Proposition 2.5.2(v) of Vershynin [66]. 73

sub-Gaussian 2-vector (X,Y ). In particular, the previous part gives us

  √ t2 P [|X| ≥ t/ 2] ≤ 2 exp − 4σ2

and similarly for |Y |. Combining these with a union bound, we have

  √ t2 P [max{|X|, |Y |} ≥ t/ 2] ≤ 4 exp − 4σ2

Hence,

√ P [max{|X|, |Y |} ≤ t/ 2] ≤ P [|Z| ≤ t] √ 1 − P [max{|X|, |Y |} > t/ 2] ≤ P [|Z| > t] √ P [|Z| > t] ≤ P [max{|X|, |Y |} > t/ 2]   t2 ≤ 4 exp − 4σ2

as desired.

(c) We now turn to the general case of complex Z, armed only with a variance proxy

for the entire vector, not the individual elements. We first consider a tail bound

on ⟨c, Z⟩ for fixed c and then use an ϵ-net argument to “boost” this to all c,

including c = ∠Z which will give us a tail bound for ⟨∠Z,Z⟩ = |Z|. By the

sub-Gaussianity of Z, we have ⟨c, Z⟩ is a real valued sub-Gaussian with variance

proxy at most σ2. Because Z is centered, so is ⟨c, Z⟩, and hence we can apply

the result of the previous section to find:

P [|⟨c, Z⟩| ≥ t] ≤ 2e−t2/2σ2 . 74

We are now ready to extend this argument to the random projection c = ∠Z.

C Let B (1) be the unit circle in C, that is {c ∈ C : |c| = 1} and let W = {wi} be an ϵ-net of BC(1), i.e., a set of points such that for every z ∈ BC(1) there exists w ∈ W such that ∥w −z∥ ≤ ϵ. Let w : C → W be the map from C to the element of W closest to its argument; in the case of ties, w can be defined arbitrarily. We then have

|Z| = ⟨∠Z,Z⟩

= ⟨w(Z),Z⟩ + ⟨∠Z − w(Z),Z⟩

≤ ⟨w(Z),Z⟩ + |∠Z − w(Z)||Z|

≤ ⟨w(Z),Z⟩ + ϵ|Z| where the final two inequalities are obtained by Cauchy-Schwarz and the ϵ-net construction, respectively. Rearranging, we obtain

|Z| ≤ (1 − ϵ)−1⟨w(Z),Z⟩.

Applying a union bound over the entire net, we find:

  1 P [|Z| ≥ t] ≤ P |⟨w(Z),Z⟩| ≥ t 1 − ϵ

≤ P [|⟨w, Z⟩| ≥ (1 − ϵ)t for at least one w ∈ W]

= |W|2e−(1−ϵ)2t2/2σ2

In order to bound |W| as a function of ϵ, we need to calculate the covering number of BC(1). A typical high-dimensional analysis of covering numbers of Euclidean 75 spheres, scales as (3/ϵ)d; see, e.g., Corollary 4.2.13 of Vershynin [66]. Because our focus here is on the low-dimensional set BC(1), we instead use a more direct construction of a net of equally spaced points at the roots of unity on BC(1), given

{ i2πk/K }K−1 by wk = e k=0 .

Half the distance between wk and wk+1 is an upper bound from any point Z to a point in W. The Euclidean distance between the roots of unity is upper-bounded by the arc length 2π/K. Hence, the distance between any Z and w(Z) is at most

π/K and we have that the Kth roots of unity form an π/K-net of size K. Thus, we can get a valid ϵ-net from the Kth roots of unity with K = π/ϵ. We use this to take |W| ≤ π/ϵ for any minimal ϵ-net W. This gives the bound

2π 2 2 2 P [|Z| ≥ t] ≤ e−(1−ϵ) t /2σ ϵ

1 The exact minimizing value of ϵ depends on σ and t, but we can take ϵ = 2 to get an approximate bound of

  2 2 2 t P [|Z| ≥ t] ≤ 4πe−t /8σ = exp − + log(4π) 8σ2

If we wish to retain the conventional leading constant (2), we can rescale the variance proxy. First, we note that if t2/8σ2 < log(4π), the tail bound above is vacuous. Hence, we are only interested in values of t such that t2/8σ2 > log(4π); that is, t2 ∈ [8σ2 log(4π), ∞). On this range, it is straight-forward to see that

t2 t2 − log(4π) > 8σ2 16σ2 76

Hence, t2 t2 − + log(4π) ≤ − 8σ2 16σ2

which in turn implies

P [|Z| ≥ t] ≤ min(1, 4πe−t2/8σ2 ) ≤ min(1, 2e−t2/16σ2 )

√ Hence, |Z| has sub-Gaussian-type tails with variance proxy 2 2σ. The factor

of 2 seems appropriate, given the isomorphism between C and R2n; whether the √ additional factor of 2 is unavoidable remains an open question.

The above bound is very standard in the real case: see, e.g., the discussion in Section

2.1.2 of Wainwright [54]. The complex case, especially without separate bounds on

ℜ(Z) and ℑ(Z) appears to be new, but is closely related to the notion of norm-sub-

Gaussianity recently proposed by Jin et al. [174]. Note that the ϵ-net argument used

for the complex case could be equivalently phrased in terms of concentration bounds  a −b on the operator norm of matrices of the form b a .

Our main use of sub-Gaussian random variables will be in bounding quantities of

the form ∥Aw∥∞ for sub-Gaussian vectors w. In particular, we will make use of the

following tail bounds:

Lemma 3.10. Suppose Z = (Z1,...,Zn) is a vector of independent complex centered

n×p random variables; suppose also that A ∈ C is column-normalized so that ∥A·j∥ = √ n for j ∈ [p]. 77

2 (a) If each Zi is sub-Gaussian with variance proxy σ , then

  2 H t P [∥A Z∥∞ ≥ t] ≤ 2p exp − for all t ∈ R≥ 16nσ2 0

(b) If the real and imaginary parts of each Zi are independent sub-Gaussian with variance proxy σ2, then

  2 H t P [∥A Z∥∞ ≥ t] ≤ 4p exp − . 2nσ2

In each case, we observe that

 ∥ H ∥ | H | A Z ∞ = max Ai· Z 1≤i≤n

is the maximum norm of a linear combination of sub-Gaussian random variables, which is itself sub-Gaussian. (Note that this is the standard complex product, not

the real valued inner product used elsewhere.) As shown above, while the sum of

independent sub-Gaussians is sub-Gaussian, we are not able to speak about the sum

of two arbitrary sub-Gaussians. Due to the “mixing” of real and imaginary parts when multiplying Z by A, we can only use Part (b) of Lemma 3.9 if the real and

imaginary parts of Z are independent. In general, we will have to fall back on Part

(c) of Lemma 3.9 which does not require additional independence assumptions. The

H bulk of the proof is in identifying upper bounds on the variance proxies of the Ai· Z terms to combine Lemma 3.9 with a standard union bound.

P H n Proof. (a) For a fixed i, note that Ai· Z = j=1 AjiZi is a sum of independent

H sub-Gaussian random variables. By the Corollary to Lemma 3.8, Ai· Z is thus 78 P 2 n | T |2 2 sub-Gaussian with variance proxy σ j=1 Aij = nσ . Applying Part (c) of Lemma 3.9, we have

  t2 P [|AH Z| ≥ t] ≤ 2 exp − i· 16nσ2

Applying a union bound over the p rows of AH , we obtain

  2 H t P [∥A Z∥∞ ≥ t] ≤ 2p exp − for all t ∈ R≥ 16nσ2 0

as desired.

(b) If we further assume independence of the real and imaginary parts, we can obtain

a tighter bound. In particular, we have that

H ℜ ℜ ℑ ℑ ℜ ℑ − ℑ ℜ Aij Zi = ( (Aji) (Zi) + (Aji) (Zi)) + i(( (Aji) (Zi) (Aji) (Zi)))

If ℜ(Zi) and ℑ(Zi) are independent, then the real and imaginary parts of this sum are separately sub-Gaussian, again with variance proxy σ2, and we can apply

H Part (b) to the sum Ai· Zi giving the tail bound   t2 P [|AH Z| ≥ t] ≤ 4 exp − . i· 2nσ2

Again, applying a union bound over the n rows of A, we obtain

  2 H t P [∥A Z∥∞ ≥ t] ≤ 4p exp − . 2nσ2 79

Chapter 4

Model Selection Consistency for Complex valued Gaussian Graphical Models: Likelihood and Psuedo-Likelihood Methods

4.1 Introduction and Background

In this chapter, we consider the problem of estimating conditional independence struc-

tures from complex valued data sampled from a multivariate Gaussian distribution.

The problem of sparse covariance estimation can be traced back to Dempster [175], who highlighted the important role of sparsity in the precision matrix. Besag [176]

connected the covariance problem to the more general phenomenon of graphical mod-

els and the important Hammersley-Clifford theorem (see also Grimmett [177]). The

books by Lauritzen [178] and by Wainwright and Jordan [179] review the classical

theory of graphical models.

Modern sparse graph estimation (structure learning) techniques build upon the suc-

cess of sparse regression. Yuan and Lin [18] considered the penalized likelihood

approach and their work was extended by many others shortly thereafter [19–21,

180]. Beyond the standard multivariate Gaussian, several extensions to more struc-

tured Gaussian variants have been proposed including partially observed Gaussians

[181], matrix and tensor Gaussians [182–184], copula Gaussian distributions [185–

187], among others. Ravikumar et al. [21] established model selection consistency

results which we discuss in more detail below. Selecting the regularization parameter 80

is somewhat more difficult in graph estimation than it is in regression: see the papers

by Foygel and Drton [188] and by Liu et al. [189] for more discussion.

An alternative pseudo-likelihood approach to sparse graph estimation, sometimes called neighborhood selection, was proposed by Meinshausen and Bühlmann [17],

building on the results of Besag [176]. This approach has been extended to non-

Gaussian graphical models, including Ising models [12] and general exponential family

graphical models [22–24, 190, 191]. We apply this theory to the complex domain in

Section 4.3.

The problem of covariance estimation for complex valued data has received relatively

little attention, mainly in the signal processing literature. Andersen et al. [192] re- view the conditional independence and graphical model properties of the complex

Gaussian, but does not focus on sparse estimation. Jung et al. [193] give a com-

plex covariance estimator based on a “realified” pseudo-likelihood approach, using

the group lasso [194, 195] to couple real and imaginary components. Jung et al. [196]

consider a related likelihood approach and demonstrate its applicability to time series

data. Tugnait [197–199] considers “complex native” sparse estimation similar to what we consider here, but does not provide comparable theoretical guarantees, focusing

instead on asymptotic theory and algorithms. Tank et al. [200] consider a similar

problem in a Bayesian framework, again motivated by time series spectra.

4.2 Likelihood Approach: Sparsistency

We first consider the likelihood approach to graphical model estimation. Inthis

approach, the multivariate Gaussian likelihood is augmented with a sparsity-inducing

penalty on the non-diagonal elements of the precision matrix. Specifically, given 81

{ }n independent samples zi i=1 from a p-dimensional multivariate Gaussian with mean zero and unknown precision matrix Θ, the likelihood is given by:

Yn L { }n L ( zi i=1) = (zi) i=1  Yn det Θ exp −zH Θz = i i πp i=1 ( ) (det Θ)n Xn = exp − zH Θz πnp i i ( i=1 ) (det Θ)n Xn = exp − Tr(zH Θz ) πnp i i ( i=1 ) (det Θ)n Xn = exp − Tr(Θz zH ) πnp i i  i=1           (det Θ)n  Xn  = exp − Tr Θ  z zH  np    i i  π      |i=1 {z }  =nˆΣ P ˆ −1 n H where Σ = n i=1 zizi is an estimate of the sample covariance. Taking logarithms and dropping constants, we find the log-likelihood is given by

ˆ L({zi}) = n log det Θ − n Tr(ΘΣ)

We multiply through by −n−1 and add a sparsity-inducing penalty to get the complex graphical lasso optimization problem, hereafter CGLasso,

ˆ arg min − log det Θ + Tr(ΘΣ) + λ∥Θ∥1,off-diag (4.1) ∈Cp×p Θ ≻0 82 where X ∥Θ∥1,off-diag = |Θij| i,j=1 i≠ j

It is possible to define several forms of consistency for the covariance estimation

problem, depending on the matrix norm being used. Of these, concentration in the

Frobenius or spectral norms are the most common, as these provide subspace recovery

guarantees [201]. As in the previous chapter, our main focus is on model selection consistency - here referring to the zeros in Θ - so our main goal is consistency in the element-wise ℓ∞ norm. We again make use of the primal-dual framework and extend the analysis of Ravikumar et al. [21, 202] to the complex domain.

Theorem 4.1 (Model Selection Consistency of the Complex Graphical Lasso). Sup- ∗ ∈ Cp×p pose that Θ ≻0 is a strictly positive definite matrix satisfying, supported on { ∗ ̸ } S = (i, j):Θij = 0 and satisfying the first α-incoherence condition,

∥ ∗ ⊗ ∗ ∗ ⊗ ∗ −1∥ ≤ − max (Σ Σ )eS(Σ Σ )SS 1 1 α e∈Sc

for some α ∈ (0, 1) where Σ∗ = (Θ∗)−1, with maximum degree bounded by

X ∥ ∗∥ d = Θ ℓ0,ℓ∞,off-diag = max 1|Θij |̸=0. j∈[p] i≠ j

Suppose further that Σˆ satisfies

∗ αλ ∥Σˆ − Σ ∥∞ ∞ ≤ , 8 83 and that ( s )   α 1 1 αλ 2κΛ∗ λ 1 + ≤ min , , ; ∗ 3 ∗ 3 8 3dκΣ 3dκΣ∗ κΛ 12dκΣ∗

for Λ∗ = (Σ∗ ⊗ Σ∗) and the associated norms

Xp ∗ ∗ ∗ ∥ ∥ | | κΣ = Σ ℓ∞→ℓ∞ = max Σij i∈p j=1 and

∗ −1 ∗ ∥ ∥ κΛ = (ΛSS) ℓ∞→ℓ∞ .

∗ ∗ (Note that κ· is defined differently for Λ and Σ : the former does not depend on S, the sparsity pattern of Θ∗.) Then the solution to the CGLasso problem (4.1) satisfies

∗ ̸ ⇒ ˆ ̸ (a) No false positives: Θij = 0 = Θij = 0; and

ˆ ̸ ⇒ ∗ ̸ (b) No false negatives: Θij = 0 = Θij = 0

{ }n Furthermore, if zi i=1 are n independent samples from a proper complex p-variate Gaussian distribution with known mean 0 and unknown inverse covariance matrix Θ∗ satisfying the conditions above, then the solution to the CGLasso problem (4.1) with P ˆ −1 n H Σ = n i=1 zizi and r σ log p λ = + δ for δ ∈ R α n >0 estimates Θ∗ with no false positives and no false negatives (“sparsistently”) with  − 2 − log p − δn 2 ∗ probability at least 1 3p exp 64σ2 256σ2 where σ = max Σii.

As with the major results of the previous chapter, we divide the proof of this theo- rem into two major claims: one, purely deterministic, which establishes consistency 84

assuming the error in Σˆ is not too large and a second, stochastic, which shows that the error is small with high probability.

Inverting the probabilistic bound in Theorem 4.1, we find that the sample complexity

scales as n = Ω(log p), which does not include the degree parameter d. To get a

picture of the dependence on d, note that we are assuming , note that κΛ∗ grows p 2 as d keeping 2κΛ∗ λ bounded requires that λ ≲ λ/d. Rearranging this, we that d = Ω(λ) = Ω(n−1/2), which in turn implies a sample complexity in d of n = Ω(d2).

A more refined analysis combines these as n = Ω(d2 log p). As Wang et al. [203] note,

quadratic dependence on d is not inherent to the problem and, as we will see in the

next section, different estimators can reduce this to a linear dependence on d.

Proof of Theorem 4.1. Before beginning, we define a useful quantity

W = Σˆ − (Θ∗)−1 = Σˆ − Σ∗ which represents the statistical error in estimating Σ∗ by the sample covariance Σˆ .

Because the CGLasso estimator depends on Σˆ , any error occured in the sample

covariance will be propagated through to our estimate Θˆ . As such, it is necessary to

keep this term as small as possible.

The first (deterministic) part of theorem proceeds by bounding the estimation ∆ =

Θˆ − Θ∗ subject to bounds on W while the second (stochastic) part establishes shows

that those bounds hold with high probability. Additionally, the quantity

R(A) = Θˆ −1 − (Θ∗)−1 + (Θ∗)−1A(Θ∗) 85 will be useful in the following, specifically when applied to ∆˜ = Θ˜ −Θ∗, to be defined

below.

Our proof of the deterministic claim proceeds in the following steps:

i. First, we show that the solution to the CGLasso problem is unique and coincides

with the solution to the “oracle” problem, under some conditions on W and

R(∆˜ );

ii. Next, we show that the boundedness condition on W holds by assumption;

iii. Then, we show that the boundedness condition on R(∆˜ ) holds as a consequence

of our bound on W ;

iv. Finally, we show that

The deterministic proof is as follows:

i. Uniqueness of the CGLasso solution follows immediately from Lemma 4.7 below.

Lemma 4.7 also guarantees us that the CGLasso solution Θˆ satisfies the first-

order condition

Σˆ − Θˆ −1 + λZˆ

ˆ −1 ˆ −1 ˆ for Z ∈ ∂∥Θ∥1,off-diag given by Zij = λ ([Θ ]ij − Σij). From Theorem 2.10, we know that in order to show that Θˆ has no false positives, we need to show that

c |Zij| < 1 for all (i, j) ∈ S . Because elements of Z can only have norm less than one corresponding to zeros in Θˆ , this will suffice to show that Θˆ has no false

positives.

It is difficult to characterize Θˆ or Z directly, so we instead explicitly construct a

different pair (Θ˜ , Z˜) which have the desired sparsity pattern and then show that 86 their sparsity matches that of (Θˆ , Z). Specifically, define the “oracle” problem to be the CGLasso problem as if we knew the sparsity pattern of Θ∗:

˜ ˆ ˜ ˜ arg min − log det Θ + Tr(ΣΘ) + λ∥Θ∥1,off-diag (4.2) ˜ ∈Cp×p Θ ≻0 ˜ ΘSc =0 where the added constraint says that Θ˜ is restricted to have zeros on the “off- support” elements of Θ∗; that is, Θ˜ has no false positives in its sparsity pattern by construction. We now seek to show Θˆ and Θ˜ have the same sparsity pattern, which would prove that Θˆ has no false positives.

To do so, define Z˜ as   ∠Θ˜ (i, j) ∈ S ˜ ij Zij =  ˜ −1 −ˆ  [Θ ]ij Σij ∈ c λ (i, j) S

By construction, Z˜ satisfies the optimality conditions for the CGLasso problem and therefore, by uniqueness, Z˜ = Z. It thus suffices to show that Z˜ satisfies strict dual feasibility (less than unit norm) on Sc, which implies Z satisfies the same, and hence that Θˆ has the correct sparsity pattern. We thus seek to prove

˜ c |Zij| < 1 for all (i, j) ∈ S

Lemma 4.8 implies that this holds so long as

αλ max{∥W ∥∞ ∞, ∥R(∆˜ )∥∞ ∞} ≤ , , 8 87

so we now need to prove those two bounds.

ii. We now seek to prove ∥W ∥∞,∞ ≤ αλ/8. This is assumed in the statement of the theorem.

Before proceeding, we note a useful triple of bounds, which follow from this:

  ∗ αλ 2κ ∗ (∥Σˆ − Σ ∥∞ ∞ + λ) ≤ 2κ ∗ + λ Λ , Λ 8   α = 2κΛ∗ λ 1 + ( 8 s ) 1 1 αλ ≤ min , , . ∗ 3 ∗ 3 3dκΣ 3dκΣ∗ κΛ 12dκΣ∗

Each of these bounds will appear in the next step.

˜ iii. Now, we seek to prove ∥R(∆)∥∞,∞ ≤ αλ/8. We first use Lemma 4.9 to get an

elementwise bound on ∆ by controlling ∥∆∥∞,∞. We have again assumed the necessary conditions in the first and second part of the triple bound, so wehave

˜ ˜ ∗ ∥∆∥∞,∞ = ∥Θ − Θ ∥∞,∞ ≤ 2κΛ∗ (∥W ∥∞,∞ + λ).

By the first of the triple bound above, this satisfies

˜ 1 ∥∆∥∞,∞ ≤ , 3dκΣ∗

so the conditions for Lemma 4.10 hold. Lemma 4.10 then gives

3 2 3 ∥R(∆˜ )∥∞ ∞ ≤ d∥∆˜ ∥ κ ∗ , 2 ∞,∞ Σ 88

˜ Combining this with the bound on ∥∆∥∞,∞ from above, we have

3 3 2 ∥R(∆˜ )∥∞ ∞ ≤ dκ ∗ ∥∆˜ ∥ , 2 Σ ∞,∞ 3 3 2 ≤ dκ ∗ [2κ ∗ (∥W ∥∞ ∞ + λ)] 2 Σ Λ , "s #2 3 αλ ≤ 3 dκΣ∗ 3 2 12dκΣ∗ αλ = 8

where the third part of the triple bound was used in the last inequality. Hence, the

conditions of Lemma 4.8 apply, so Θ˜ and Θˆ are both solutions to the CGLasso

problem (4.1). By uniqueness, this implies Θ˜ = Θˆ and hence they have the same

sparsity pattern, so Θˆ has no false positives.

iv. Finally, we turn to establishing no false negatives in Θˆ . By the argument above, ˜ ˆ ˜ ∗ we have Θ = Θ so it suffices to bound ∥Θ − Θ ∥∞,∞. From Lemma 4.9, we already have ˜ ∗ ∥Θ − Θ ∥∞,∞ ≤ 2κΛ∗ (∥W ∥∞,∞ + λ)

so we can apply Lemma 4.11 to find that there are no false negatives in Θ˜ = Θˆ .

We now turn to the stochastic claims: taking r σ log p λ = + δ α n

ˆ ∗ for some δ ∈ (0, 1), we need to show that ∥W ∥∞,∞ = ∥Σ − Σ ∥∞,∞ is small with 89

high probability. Specifically, we note that taking r 1 log p αλ t = + δ =⇒ tσ2 = 8σ n 8

in Lemma 4.3 implies that

    αλ 2 log p δn P ∥W ∥∞ ∞ ≥ ≤ 3p exp − − , 8 64σ2 256σ2

as desired.

4.3 Pseudo-Likelihood Approach: Sparsistency

In this section, we describe the pseudo-likelihood approach to graphical model se-

lection. In the Gaussian case, this approach relies on the insight that Xi|X−i–Xi conditioned on the other elements of X–is Gaussian and the coefficients obtained by

regression Xi on X−i have the same sparsity pattern as column i of Θ. From this in- sight, we can “stitch” together an estimate of the sparsity of Θ by performing p sparse

regressions of each column of X against the rest. We are not in general guaranteed

that the sparsity estimate formed this way yields a symmetric (undirected) graph so we can adopt a “symmetrization rule” which gives a non-zero value of Θ if both (and

rule) or just one (or rule) regression gives a non-zero value. The full approach is

given in Algorithm 2. As Theorem 4.2 below outlines, the neighborhood selection

approach also gives a consistent estimate of the sparsity pattern of Θ∗. Comparing

the incoherence conditions of Theorems 4.1 and 4.2, we see that they are distinct. In

general, neither approach dominates the other: see Meinshausen [204] and Ravikumar

et al. [21, Section 3.1.1] for further discussion of this point. 90

Algorithm 2 Neighborhood Selection for Complex Covariance Selection 1. Input: • Data Matrix: Z ∈ Cn×p • Penalty Parameter: λ ∈ R>0 2. For each j ∈ [p]: n (a) Set y˜ = Z·j ∈ C ˜ n×(p−1) (b) Set X = Z·(−j) ∈ C (c) Solve the complex lasso problem for (X˜ , y˜) at λ to obtain sparse regression coefficients βˆj,λ Nˆ { ∈ j,λ ̸ } (d) Return the neighborhood estimate (j) = k [p]: βk = 0 {Nˆ }p 3. Combine the neighbornood estimates (j) j=1 using the and or or rules to get an estimate of the graph Gˆ = (V, Eˆ) 4. Return the estimated edge set Eˆ

{ }n Theorem 4.2. Suppose zi i=1 are n independent samples from a proper complex multivariate Gaussian distribution with known mean 0 and unknown inverse covari- ance matrix Θ∗ = (Σ∗)−1 with maximum degree

X ∥ ∗∥ d = Θ ℓ0,ℓ∞,off-diag = max 1|Θij |̸=0. j∈[p] i≠ j

If Θ∗ satisfies the second α-incoherence condition with respect to the sparsity of Θ∗, meaning

∗ ∗ max ∥Θ Σ ∥1 ≤ 1 − α for all j ∈ ∗ eS SS e/supp(Θ·j )

∈ ∥ ∗ −1∥ ≤ ≥ for some α (0, 1) and that (ΣN (j)N (j)) ∞ b for all j for some b 1. Then the output of the neighborhood selection (pseudo-likelihood) procedure at r σ log p λ = + δ α n satisfies the following with probability at least 1 − 20p2e−nλ2α2/128σ2db 91

∗ ̸ ⇒ ˆ ̸ (a) No false positives: Θij = 0 = Θij = 0; and

ˆ ̸ ⇒ ∗ ̸ (b) No false negatives: Θij = 0 = Θij = 0 √ if all non-zero elements of Θ∗ are greater than λ(α + b d) and n > 64σ2bd log p/α2.

Unlike the statement of this result by Meinshausen and Bühlmann [17] or Wainwright

[54], we give explicit but not necessarily tight constants and dependence on n, p, d,

α, and b. At a high level, these dependencies are consistent with their results: in

particular, we require n = Ω(d log p) samples for consistent graph estimation, where d is the true number of edges in each nodewise regression and the log p term is an

unavoidable additional penalty incurred for not knowing the sparsity pattern a priori p [205, cf., Theorem 2]. The suggested penalty level λ = Ω(σ log p/n) is also consistent with the sparsistent penalty level for the CLasso given in Theorem 3.3.

Proof of Theorem 4.2, Following Theorem 11.12 of Wainwright [54]. Our proof of this

theorem builds upon the proof of Theorem 3.3, which establishes sparsistency condi-

tions for the individual node-wise CLasso regressions. Unlike that theorem, we have

to deal with randomness in both X˜ and y˜, rather than being able to “isolate” the

randomness to the observation noise ϵ. This proof will proceed by proving the two

major claims–no false positives and no false negatives–for a single vertex j with high

probability and then will achieve the overall result by taking a union bound over all

j ∈ [p]. Note that we are proving exact recovery at each j, so both the and and or

rules give the same correct edge set.

We introduce some notation that we will use throughout this proof: for fixed j, let

• S = N (j) be the (true) neighborhood of vertex j, that is, the set {i ∈ [p]:

∗ ̸ ̸ } c Θij = 0, i = j and S its complement; 92

∗ E H • Υ = [Z−jZ−j] be the population covariance of the non-j observations and ˆ −1 H Υ = n Z−jZ−j be its sample version;

ˆ ˆ ˆ c • ΥSS and ΥScS be the submatrices of Υ indexed by rows S or S , respectively, and columns S.

− ˜ ∗ • W = y˜ XΘ(−j)j be the n-vector corresponding to the “effective noise” in the neighborhood regression problem.

We also suppress the tilde notation (X˜ , y˜) for clarity, though each X in the following

should be understood in the context of nodewise regression, not the entire data matrix.

Following the proof of Theorem 3.3, we have   =Π ⊥ (X) z S}| {   H H −1 H  H −1 H  W z c = X c X (X X ) z + X c I − X (X X ) X  . S | S S {zS S S} S S S S S λn =µ1 | {z }

=µ2

As argued in that proof, in order to establish the “no false positives” condition, it suffices to show ∥zSc ∥∞ < 1. As before, we do so by establishing strong bounds on both ∥µ1∥∞ and ∥µ2∥∞ and combining them with the triangle inequality:

• Bound on ∥µ1∥∞: the difficulty in this bound comes from relating the sample quanity Υˆ = XH X/n to its population quantity Υ∗. Because our data follows

a multivariate Gaussian, we can relate the random quantities XS and XSc via the population quantity Υ∗ and an additional stochastic component Ψ:

H ∗ ∗ −1 H H XSc = ΥScS(ΥSS) XS + Ψ

where Ψ ∈ Cn×|Sc| is a zero-mean complex proper Gaussian matrix independent 93

of XS. Moreover, because Ψ is “residual” variance, standard properties of Schur complements imply

∗ − ∗ ∗ −1 ∗ ⪯ ∗ Cov(Ψ) = ΥScSc ΥScS(ΥSS) ΥSSc Υ

Because Υ∗ is a submatrix of Σ∗ obtained by marginalizing out j, we have

max diag(Υ∗) ≤ max diag(Σ∗) = σ2 so the elements of Ψ have variance at most σ2. Substituting this decomposition

H of XSc into µ1, we get

H H −1 µ1 = XSc XS(XS XS) zS   ∗ ∗ −1 H H H −1 = ΥScS(ΥSS) XS + Ψ XS(XS XS) zS

∗ ∗ −1 H H −1 = Υ c (Υ ) zS + Ψ XS(X XS) zS S S SS S ΨH X ∥ ∥ ∗ ∗ −1 √ √ S ˆ −1 µ1 ∞ = ΥScS(ΥSS) zS + (ΥSS) zS n n ∞ ΨH X ≤ ∗ ∗ −1 √ √ S ˆ −1 ΥScS(ΥSS) zS ∞ + (ΥSS) zS n n ∞

H Ψ√ X√ S ˆ −1 ≤ (1 − α) + (ΥSS) zS n n | {z } =V ∞ with the final step following from our incoherence assumption.

Turning to now bound V , we have that each element of V can be expressed as a √ ∥ ˆ −1 ∥ mean-zero Gaussian with standard deviation bounded by XS(ΥSS) zS/ n σmax recalling that Ψ and XS are independent by construction. In order to bound 94 this variance, we note that

√ 2 √ 2 ˆ −1 ≤ ˆ −1 ∥ ∥2 XS(ΥSS) zS/ n XS(ΥSS) / n zS 2 σmax σmax ≤ ∥ ˆ −1∥ ∥ ∥2 (ΥSS) σmax zS 2

≤ ∥ ˆ −1∥ d (ΥSS) σmax

∥·∥ → where the inequalities follow from the characterization of σmax as the ℓ2 ℓ2- √ 2 operator norm; the fact that σmax(XS/ n) = λmax(ΥSS); and the construc- tion of z as a vector of at most d unit norm elements respectively. To bound

∥ ˆ −1∥ (ΥSS) σmax , we use Corollary 4.6 to find the variance term is bounded above by db with probability at least 1−e−n/4. We can then apply Part (b) of Lemma

3.9 to V which has Gaussian elements with variance bounded by σ2mb/n to find   nt2 n P [|V | ≥ t] ≤ 4 exp − − . i 2db 4

Taking a union bound over the |Sc| elements of V , we have

  2 c nt n −nt2/2σ2db P [∥V ∥∞ ≥ t] ≤ 4|S | exp − − ≤ 4pe 2σ2db 4

Combining with the above, this gives

  nt2 P [∥µ ∥∞ ≥ (1 − α) + t] ≥ 4p exp − 1 2σ2db

We want to take t to be at most α/4, so we set r 2σ2db log p α2 t = + n 32 95

and note t < α/4 so long as n > 64σ2db log p/α2. With this bound, we now

have     3 nα2 P ∥µ ∥∞ ≥ 1 − α ≥ 4 exp − 1 4 64σ2db

so long as n > 64σ2db log p/α2.

• Bound on ∥µ2∥∞: We now turn to bounding ∥µ2∥∞, where

  H W µ = X c Π ⊥ (X) 2 S S λn

H Decomposing XSc as before and noting that XS is in the null space of ΠS⊥ by construction, this simplifies to

  H W µ = Ψ Π ⊥ 2 S λn

Note that ∥ΠS⊥ W ∥2 = ∥W ∥2 where W are n independent Gaussian random √ 2 variables with variance at most σ . Defining the event E = {∥W ∥2 ≤ 2σ n}, we see that P [E ] ≥ 1 − 4e−n, following from standard Gaussian tail bounds.

Hence, we have

c c P [∥µ2∥∞ ≥ t] = P [∥µ2∥∞ ≥ t|E ]P [E ] + P [∥µ2∥∞ ≥ t|E ]P [E ]

−n ≤ P [∥µ2∥∞ ≥ t|E ] ∗ 1 + 1 ∗ 4e

c Conditional on E , we can bound ∥µ2∥∞ as the maximum of |S | proper Gaussian random variables each with variance at most 4σ2/λ2n so Part (b) of Lemma 3.9

gives

c −nλ2t2/8σ2 −λt2/8σ2 P [∥µ2∥∞ ≥ t|E ] ≤ 4|S |e ≤ 4pe 96

Hence,

−nλ2t2/8σ2 −n P [∥µ2∥∞ ≥ t] ≤ 4pe + 4e

Taking t = α/4, we have

−nλ2α2/128σ2 −n P [∥µ2∥∞ ≥ α/4] ≤ 4pe + 4e

Putting these two bounds together, we have

P [∥µ1∥∞ + ∥µ2∥∞ ≥ 1] = P [∥µ1∥∞ + ∥µ2∥∞ ≥ (1 − 3α/4) + α/4]   h i 3 α ≤ P ∥µ1∥∞ ≥ 1 − α + P ∥µ2∥∞ ≥  4  4 nα2 nλ2α2 ≤ 4 exp − + 4p exp − + 4e−n 64σ2db 128σ2 showing that we have no false negatives with high probability, as desired.

From here, we turn to the “no false negatives” part of the theorem: as before, our proof of Theorem 3.3 implies that

∥ ˆ − ∗ ∥ ≤ ∥ ˆ −1 H ∥ ∥ ˆ −1∥ θS θS ∞ (ΥSS) XS W /n ∞ + λ (ΥSS) ℓ∞→ℓ∞

∗ ˆ where θ = Θ(−j)j is the vector of “effective regression coefficients” and θ is the estimate thereof produced by the CLasso procedure.

∥ ∗ −1∥ We first attempt to bound the second term. While we have a bound on (ΥSS) ℓ∞→ℓ∞ ˆ by assumption, we do not have concentration bounds for the sample estimate ΥSS in this norm. As before, it is useful to move to the spectral norm temporarily and to show

concentration there. In particular, we can bound the matrix ℓ∞ → ℓ∞ norm, equiva- 97

lent to the ℓ∞, ℓ1-mixed elementwise norm, in terms of more well-behaved quantities by noting that v u Xk √ uXk √ ∥ ∥ | | ≤ t | |2 ≤ ∥ ∥ A ℓ∞→ℓ∞ = max Aij k max Aij k A λmax i∈[k] i∈[k] j=1 j=1

× ∥ · ∥ for any k k matrix, where λmax is the leading eigenvalue norm (also called the

ℓ2 → ℓ2 or spectral norm). Applying this bound above, we find

√ ∥ ˆ − ∗ ∥ ≤ ∥ ˆ −1 H ∥ ∥ ˆ −1∥ θS θS ∞ (ΥSS) XS W /n ∞ + λ d (ΥSS) λmax

To bound the second term, we now turn to Corollary 4.6 to find

√ ∥ˆ − ∗ ∥ ≤ ∥ ˆ −1 H ∥ ∥ ∗ −1∥ θS θS ∞ (ΥSS) XS W /n ∞ + λ d (ΥSS) λmax

− −n/4 ∥ · ∥ ≤ ∥ · ∥ with probability at least 1 e . Noting that λmax ℓ∞→ℓ∞ this implies

√ ∥ˆ − ∗ ∥ ≤ ∥ ˆ −1 H ∥ ∥ ∗ −1∥ θS θS ∞ (ΥSS) XS W /n ∞ + λ d (ΥSS) ℓ∞→ℓ∞ √ ≤ ∥ ˆ −1 H ∥ (ΥSS) XS W /n ∞ + λb d with probability at least 1 − e−n/4.

To bound the first term, recall our analysis of

H Ψ√ X√ S ˆ −1 V = (ΥSS) zS n n ∞

in the “no false positives” analysis where Ψ is a matrix of Gaussian random variables

2 with variance at most σ and independent of XS. As shown above, the zS term has 98

no effect, so it can be ignored and the same analysis can be performed heretoshow

h i ∥ ˆ −1 H ∥ ≥ ≤ −nt2/2σ2db P (ΥSS) XS W /n ∞ t 4pe .

Putting these together, we have

√ ∥ ˆ − ∗ ∥ ≤ θS θS ∞ t + λb d with probability at least 1 − e−n/4 − 4pe−nt2/2σ2db. Setting t = αλ, we find that √ ∥ ˆ − ∗ ∥ ≤ − −n/4 − −nλ2α2/2σ2db θS θS ∞ λ(α + b d) with probability at least 1 e 4pe . For ∗ this to imply no false negatives, we need the left hand side to be less than ∥θ ∥∞, which we assume temporarily.

Hence, the pseudo-likelihood procedure given by Algorithm 2 is sparsistent for the jth n o column of Θ with probability at least 1−4p exp {−nλ2α2/2σ2db}−4 exp − nα2 − n o 64σ2db − nλ2α2 − −n 4p exp 128σ2 5e . For this procedure to be sparsistent for the entire graph, we apply a union bound over all p vertices to get sparsistency with probability at n o n o − 2 {− 2 2 2 } − − nα2 − 2 − nλ2α2 − −n least 1 4p exp nλ α /2σ db 4p exp 64σ2db 4p exp 128σ2 5pe as desired.

Finally, we set   σ2 log p λ2 = + δ α2 n for arbitrary δ ∈ R≥0 to find that Algorithm 2 with fixed λ is sparsistent for the entire graph Θ∗ with probability at least

        log p + δ nα2 log p + δ 1−4p2 exp − −4p exp − −4p2 exp − −5pe−n 2db 64σ2db 128 99

if n > 64σ2db log p/α2 and all non-zero elements of Θ∗ have norm greater than   p α2+αdb σ log p/n + δ. The simpler bound in the statement of the theorem is a loose bound for each of the terms in this probability.

4.4 Algorithms

Many algorithms have been proposed for solving the real valued GLasso problem.

Banerjee et al. proposed two algorithms for solving this problem [19, 180]: a block

coordinate descent approach on the dual problem and a smoothing algorithm based

on the work of Nesterov [206]. Friedman et al. [20] extended this first approach, com-

bining it with their previous work on efficient solvers for the Lasso problem [159,

160]. Their method was further refined by Witten et al. [207] and by Mazumder

and Hastie [208, 209], who addressed some non-convergence problems of the original

approach and suggested screening rules to further improve performance. Hsieh et al.

[210] proposed a highly efficient proximal Newton approach termed Quic, which they

later extended to be able to handle tens of thousands of variables [211]. While effi-

cient, Quic is difficult to implement efficiently. A far simpler, and similarly efficient

approach, was proposed by Scheinberg et al. [212], who advocated the use of alter-

nating direction methods for the GLasso problem. Algorithm 3 gives a variant of

their approach for the CGLasso problem; see also the related discussion in Section

6.5 of the manuscript by Boyd et al. [76]. Further performance improvements can be

obtained by using “active set” and “warm start” strategies, which we do not describe

here. This algorithm has been implemented in the author’s classo package, available

at https://github.com/michaelweylandt/classo. Convergence follows from the

discussion of the ADMM in Section 2.3. 100

For the pseudo-likelihood approach, the coordinate descent algorithm described in

Section 3.4 may be used.

4.5 Numerical Experiments

In this section, I briefly demonstrate the predictions of Theorem 4.1 in simulation. I generate a graph structure from an Erdős-Rényi graph with a fixed edge probability and then add a small multiple of an identity matrix to the adjacency graph to create a positive-definite precision matrix, using the graph generation mechanism implemented in the huge package for R [213]. The ambient dimensionality p, sample size n, and edge probability are allowed to vary, as shown in Figure 4.5.1. λ was chosen to give oracle sparsity: that is, the number of true edges was considered known, but their identity was estimated. As can be seen in Figure 4.5.1, the probability of exact model recovery decays quickly in both the number of edges and in p, while true positive rates remain high and false positive rates remain low: this suggests that the difficulty of the covariance selection problem is localized in a few “hard” edges rather than exhibiting the hard phase transition observed for the CLasso. 101

Algorithm 3 Complex ADMM for the Complex Graphical Lasso Problem (4.1) 1. Input: • Data Matrix: X ∈ Cn×p • Penalty Parameter: λ ∈ R>0 • Stopping Tolerance: ϵ ∈ R>0 2. Precompute: • Sample Covariance: Σˆ = (X − X)H (X − X)/n 3. Initialize: (0) • Estimate: Θ = Ip×p • Copy Variable: Z(0) = Θ(0) (0) • Dual Variable: U = 0p×p • k = 0 4. Repeat until ∥Θ(k) − Θ(k−1)∥ ≤ ϵ: • Update Primal Variable: (i) Eigendecomposition:

ΓΛΓH = Eigen(Z(k) − U (k) − Σˆ )

(ii) Eigenvalue shrinkage:   1 √ Λ′ = Λ + Λ2 + 4 2 (iii) Update: Θ(k+1) = ΓΛ′ΓH • Update Copy Variable:

Z(k+1) = Soff-diag(Θ(k+1) + U (k)) (λ ∠ (k+1) (k) | (k+1) (k)| − ̸ (Θij + Uij )( Θij + Uij λ)+ i = j = (k+1) (k) Θij + Uij i = j

• Update Dual Variable:

U (k+1) = Θ(k+1) − Z(k+1) − U (k)

• k ← k + 1 5. Return: (k−1) ∈ Cp×p • Estimated inverse covariance Θ ≻0 102

p = 10 p = 20 p = 50 p = 100 p = 200 p = 500 1.00 Recovery Probability 0.75

0.50

0.25

0.00

1.00

0.75 True Positive Rate

0.50 Accuracy

0.25

0.00

1.00

0.75 False Positive Rate

0.50

0.25

0.00 1 2 3 4 5 5 10 15 20 25 50 75 100125 100200300400500 500 1000150020002500500075001000012500 Number of Edges

Estimate Likelihood Pseudo−Likelihood Number of Samples 50 100 200 500

Figure 4.5.1 : Model Selection Consistency of the Complex Graphical Lasso (4.1). When p ≪ n, the CGLasso has both high true positive rate and low false pos- itive rate, leading to high rates of correct model selection for both the likelihood (CGLasso) and pseudo-likelihood (neighborhood selection) procedures. As the num- ber of edges to be estimated increases, model selection performance decays quickly, especially for larger values of p, even as both true positive and false negative rates re- main high. For this random graph model, neither estimator is able to perform perfect graph recovery for n ≈ p and high numbers of edges. 103

Appendix to Chapter 4

4.A Properties of the Complex Multivariate Normal

In this section, we review key properties of the complex multivariate normal (Gaus-

sian) distribution. Like its real counterpart, the complex Gaussian can be defined as

the multivariate distribution all of whose univariate projections are univariate Gaus-

sian [214, Chapter 3]. This definition points at the most surprising property ofthe complex Gaussian: it is a three parameter distribution, characterized by its mean

E[Z], covariance E[ZZH ], and pseudo-covariance E[ZZT ]. The pseudo-covariance

generalizes the idea of correlation between the real and imaginary parts of the uni- variate complex Gaussian.

The density of a complex Gaussian z ∼ CN (µ, Σ, Γ) with mean µ, covariance Σ, and

pseudo-covariance Γ is given by [124, Section 2.3]:

    −1      H ΣΓ exp − 1 − −   − −  2 z µ z µ z µ z µ   Γ Σ  v p(z) = u   u u u ΣΓ πntdet   Γ Σ

As can be seen from this form, there are several constraints on the matrices Σ and

Γ:

• Σ is Hermitian and positive-definite

• Γ is symmetric

• Σ − ΓH Σ−1Γ must also be positive-definite 104

4.A.1 Proper Complex Random Variables

An important special case of the general complex normal distribution are so-called

proper complex Gaussians: those distributions whose phase is uniformly distributed

or, equivalently, whose distribution is unchanged under rotation. This restriction

arises naturally in the consideration of stationary processes, for which we assume

Law[Z] = Law[eiθZ] for all θ ∈ R.

A proper complex Gaussian can be easily shown to have zero pseudo-covariance:

E[ZZT ] = E[(eiθZ)(eiθZ)T ] = e2iθE[ZZT ] which can only hold for all θ if E[ZZT ] = 0. The density of the circularly symmetric

complex normal is much simpler than the general case and is given by

 exp −zH Σ−1z p(z) = πn det Σ

Because of its similarities with the real Gaussian, the proper complex distribution

has been the primary object of study in the statistical literature [215–220]. For many years, the circularity assumption was ubiquitous and unstated in signal processing

due to the connection with stationarity, until van den Bos [70] and Picinbono [71]

highlighted the importance of the general case. Adali et al. [221] review the various

connections between improper signals and the general case of the complex normal, with applications in signal processing. In this thesis, attention is restricted to the

case of proper Z unless noted otherwise. 105

4.A.2 Graphical Models for Complex Random Variables

Like its real counterpart, the proper complex multivariate Gaussian gives rise to a

probabilistic graphical model (PGM) under certain assumptions on the inverse covari-

ance (precision) matrix Θ = Σ−1. Specifically, we note the density of a random vector z can be expressed as:

Xp Xp H 2 log p(z) = −z Θz = − Θijzizj − Θii|zi| i,j=1 i=1 i≠ j

Applying the Hammersley-Clifford theorem [176, 177] to this density, we see that it

corresponds to a graphical model G = (V, E) whose (undirected) edges correspond to

the non-zero off-diagonal elements of Θ. Specifically, the standard pairwise Markov

property of undirected graphical models applies to Z:

Zi ⊥⊥ Zj|Z−ij ⇔ Θij = 0

as well as its local and global extensions [178, Section 3.2.1]. This suggests the sparse

estimation task discussed above.

We also note a key connection between graphical models and conditional distributions:

if Z is a proper complex Gaussian random vector, then the conditional distribution

of Z conditional on one element is given by

Z−i|Zi = z ∼ CN (Σ−ii/Σiiz, Σ(−i)(−i) − Σ−iiΣi − i/Σii)

Focusing on the conditional mean, we see that the conditional mean of Z−i condi- tioned on a value Zi depends only through the non-zero elements of Σ−ii, i.e., those 106

elements of the ith column of Σ which are non-zero. This motivates the neighborhood

selection approach discussed in Section 4.3.

4.B Concentration of Complex Sample Covariance Matrices

As discussed in Section 4.2, the CGLasso performs well when the sample covariance

Σˆ is ‘close’ to the true covariance Σ∗. The following lemma allows us to guarantee that

these matrices are close in the element-wise maximum norm ∥A∥∞,∞ = maxi,j |Aij| with high-probability for Gaussian samples:

{ }n Lemma 4.3. Let zi i=1 be an iid sample from a proper p-variate complex Gaussian with known mean zero and unknown covariance Σ∗. The sample covariance Σˆ = P −1 n H n i=1 zizi satisfies the following tail bound

h i ˆ ∗ 2 2 −nt2/4 P ∥Σ − Σ ∥∞,∞ ≥ tσ ≤ 3p e

2 ∗ where σ = maxi Σii

We next state a useful χ2-type concentration bound: see Lemma 1 of Laurent and

Massart [222] for related discussion.

Lemma 4.4. Let Zi be independent proper complex Gaussian random variables with variance σ2. Then the following tail bound holds:

 P  n | |2 i=1 Zi −nt2/4 P − 1 ≥ t ≤ 2e n

for all t ∈ (0, 1).

√ ˜ ˜ ˜ ˜ ˜ Proof. Let Zi = Zi ∗ 2/σ and Zi = Xi + iYi for all i. The random variables Xi and 107

˜ Yi are all real normal with variance 1 and independent by propriety. Applying the tail bound given in Example 2.11 of Wainwright [54] to the collection of 2n standard

normal variables given by    ˜ Xi 1 ≤ i ≤ n Vi =   ˜ Yi−n n < i ≤ 2n we obtain:

" P # 2n 2 2 V 2e−nt /4 ≥ P k=1 i − 1 ≥ t 2n " # P n X˜ 2 + Y˜ 2 = P k=1 i i − 1 ≥ t 2n " # P n |Z˜ |2 = P k=1 i − 1 ≥ t 2n  P  n |Z |2 k=1 i − ≥ = P 2 1 t  P nσ  n | |2 k=1 Zi 2 2 = P − σ ≥ tσ n

P 2 | −1 n | |2 − 2| ≥ as desired. Setting δ = tσ , we get the equivalent bound P [ n i=1 Zi σ δ] ≤ 2e−nδ2/4σ4 which is consistent with the Var(S2) ∝ σ4 rate observed for variance estimation from U-statistics.

We are now ready to prove our main concentration result:

Proof of Lemma 4.3, following Lemma 6.26 of Wainwright [54]. While it may be pos-

sible to obtain exact results by using tail bounds for the (complex) Wishart distri-

bution, this approach is problematic in high-dimensions as the Wishart is typically

defined only for n ≥ p, which we do not assume here (though see Uhlig [223], Srivas- 108

tava [224] and Yu et al. [225] for some exceptions). Instead, we establish element-wise

bounds and use a union bound to get the overall ∥ · ∥∞,∞ bound.

We begin by establishing a concentration bound for the diagonal elements of Σˆ : con- P ˆ −1 n | |2 sider each diagonal element Σii = n j=1 zij . Lemma 4.4 gives the bound

h i 2 ˆ − ∗ ≥ 2 ≤ −nt /4 P Σii Σii tσ 2e

Now we turn to the off-diagonal elements of ∆ = Σˆ − Σ∗. Some algebra yields:

2 Xn 2∆ = z z − 2Σ∗ jk n ij ij jk i=1 1 Xn = |z + z |2 − (Σ∗ + Σ∗ − 2Σ∗ ) −( ∆ + ∆ ) n ij ik jj kk jk |{z}jj |{z}kk i=1 | {z } B C A

We break this into three terms which we bound separately. To bound the first term

(A) we consider the random variable Sjk = Zj +Zk which has mean zero and variance at most 4σ2 by Lemma 3.6. Applying the same χ2-type tail bound from Lemma 4.4, we have the bound

P [|A| ≥ 4tσ2] ≤ 2e−nt2/4

We can bound terms B and C directly using the bounds already established:

P [|B| > tσ2] = P [|C| > tσ2] ≤ 2e−nt2/4

Taking a triangle inequality, we have

|A| + |B| + |C| |∆ | ≤ jk 2 109

A union bound then yields

2 −nt2/4 P [|∆ij| ≥ 6tσ ] ≤ 6e

or equivalently

2 −nt2/144 P [|∆ij| ≥ tσ ] ≤ 6e

Finally, we take a union bound over the p diagonal elements and the p(p − 1)/2 off

diagonal elements to get the desired bound:

2 −nt2/4 −nt2/144 P [∥∆∥∞,∞ ≥ tσ ] ≤ 2pe + 3p(p − 1)e

≤ 2pe−nt2/4 + 3p(p − 1)e−nt2/4

= e−nt2/4(2p + 3p2 − 3p)

≤ 3p2e−nt2/4.

We do not need it here, but the assumption t ∈ (0, 1) can be removed, at the cost of

decay at rate min(t, t2) by using a Bernstein-type bound instead of Lemma 4.4.

Additionally, we give a deviation bound on the leading singular value of a random matrix generated from a proper complex Gaussian ensemble with covariance Σ:

Lemma 4.5. Let X ∈ Cn×p be a random matrix drawn from a proper Σ-complex

Gaussian ensemble: that is, each row of X is an independent sample from a proper

p-dimensional Gaussian with mean zero and covariance Σ. Then, the following bound holds for all δ > 0, where σmax(X) is the maximum singular value of X and λmax(Σ) 110 is the maximum eigenvalue of Σ: " r # √ σ (X) Tr(Σ) 2 P max√ ≥ λ ( Σ)(1 + δ) + ≤ e−nδ /4. n max n

A similar bound applies for the trailing singular value as well: " r # √ σ (X) Tr(Σ) 2 P min√ ≤ λ ( Σ)(1 − δ) − ≤ e−nδ /4. n min n

Proof. Consider the matrix   ℜ(X) −ℑ(X) Y =   ∈ R2n×2p ℑ(X) ℜ(X)

and note that σmax(X) = σmax(Y ). To see this, note that if z maximizes ∥Xz∥2 then

∥Y w∥2 is maximized by w = (ℜ(z), ℑ(z)). Now apply Theorem 6.1 of Wainwright [54] to Y to get " r # p σmax(Y ) Tr(ΣY ) −nδ2/2 P √ ≥ λmax( ΣY )(1 + δ) + ≤ e 2n 2n

where ΣY is the covariance of Y given by   1 ℜ(Σ) −ℑ(Σ) Σ =   . Y 2 ℑ(Σ) ℜ(Σ)

To simplify, note that Tr(ΣY ) = Tr(ℜ(Σ)) = Tr(Σ) because Σ is Hermitian and that 111 √ √ λmax( ΣY ) = λmax( Σ) by construction. This gives " r # √ σmax(X) Tr(Σ) −nδ2/2 P √ ≥ λmax( Σ)(1 + δ) + ≤ e 2n 2n

Finally, changing variables n′ = 2n yields the desired result. The lower tail bound on

σmin(X) can be obtained analogously.

This chapter focuses on inverse covariance estimation so the following corollary will

be useful:

Corollary 4.6. Let X ∈ Cn×p be a random matrix drawn from a proper Σ-complex

ˆ H ∈ Cp×p Gaussian ensemble and let Σ = X /X/n ⪰0 be the associated sample covariance. Then the following bounds hold for all δ ≥ 0:

h i ˆ 2 −nδ2/4 P λmin(Σ) ≤ λmin(Σ)(1 + δ) ≤ e h i 2 ˆ −1 −1 −nδ2/4 P (1 + δ) λmax(Σ ) ≥ λmax(Σ ) ≤ e so long as all relevant inverses are defined. As a consequence, we have:

h i ˆ −1 −1 −n/4 P λmax(Σ ) ≥ λmax(Σ ) ≤ e

√ 2 H ˆ Proof. Recall that σmin(X/ n) = λmin(X X/n) = λmin(Σ) and use the tail bound 112

from the previous lemma to find: " r # √ √ Tr(Σ) 2 P σ (X/ n) ≤ λ ( Σ)(1 − δ) − ≤ e−nδ /4 min min n h √ √ i −nδ2/4 P σmin(X/ n) ≤ λmin( Σ)(1 + δ) ≤ e h √ √ i 2 2 2 −nδ2/4 P σmin(X/ n) ≤ λmin( Σ) (1 + δ) ≤ e h i ˆ 2 −nδ2/4 P λmin(Σ) ≤ λmin(Σ)(1 + δ) ≤ e

−1 Noting that λmin(A) = 1/λmax(A ) and rearranging gives the second bound. The final bound follows by setting δ = 0.

4.C Lemmata Used in the Proof of Theorem 4.1

In this section, we collect several technical Lemmata used in the proof of Theorem

4.1. These are straightforward extensions of results due to Ravikumar et al. [21], so we only give proofs where the complex domain changes the analysis.

ˆ Lemma 4.7 (Lemma 3 in Ravikumar et al. [21]). For λ ∈ R>0 and Σ with strictly positive diagonal elements, the CGLasso problem (4.1) has a unique solution Θˆ satisfying

ˆ ˆ −1 ˆ 0 ∈ Σ − Θ + λ∂∥Θ∥1,off-diag.

The proof is identical to that appearing in Appendix A of Ravikumar et al. [21].

Lemma 4.8 (Lemma 4 in Ravikumar et al. [21]). Suppose that

αλ max{∥W ∥∞ ∞, ∥R(∆˜ )∥∞ ∞} ≤ . , , 8 113

Then the dual variable Z given by    ˜ ∗ ̸ Zij Θij = 0 Zij =   −1 ˜ −1 − ˆ ∗ λ ([Θ ]ij Σij)Θij = 0

| | ∗ satisfies the strict dual feasibility condition Zij < 1 for all (i, j) such that Θij = 0, ˜ ˜ where ∥·∥∞,∞ is the elementwise maximum of a matrix, and (Θ, Z) are a primal-dual solution pair to the “oracle” problem (4.2).

The proof is identical to that appearing in Section 4.2.1 of Ravikumar et al. [21].

Lemma 4.9 (Lemma 6 in Ravikumar et al. [21]). Suppose that

  1 1 r = 2κΛ∗ (∥W ∥∞,∞ + λ) ≤ min , . ∗ 3 ∗ 3dκΣ 3dκΣ∗ κΛ

Then ˜ ˜ ∗ ∥∆∥∞,∞ = ∥Θ − Θ ∥∞,∞ ≤ r where Θ˜ is the solution to the oracle problem (4.2).

The proof is the same as that appearing in Appendix C of Ravikumar et al. [21]: Cp×p Brouwer’s fixed point theorem, a key element of that proof, may be applied to ⪰0 since we consider it as a Euclidean (real inner product) space with the inner product

⟨ ⟩ 1 H T A, B = 2 Tr(A B + A B) discussed in Chapter 2.

Lemma 4.10 (Lemma 5 in Ravikumar et al. [21]). Suppose that ∆˜ satisfies the elementwise bound ˜ 1 ∥∆∥∞,∞ ≤ . 3dκΣ∗ 114

Then the remainder matrix R(∆˜ ) has elementwise maximum norm bounded by

3 2 3 ∥R(∆˜ )∥∞ ∞ ≤ p∥∆∥ κ ∗ . , 2 ∞,∞ Σ

The proof is the same as that appearing in Appendix B of Ravikumar et al. [21],

except with the Hermitian (conjugate transpose) replacing the standard transpose in

˜ T H the unit vector analysis used to bound ∥R(∆)∥∞,∞ and J becomes J . Note also ∥ · ∥ ∥ · ∥ that I use the notations ∞,∞ and ℓ∞→ℓ∞ to denote the elementwise maximum

and the ℓ∞ to ℓ∞ operator norm, respectively, while Ravikumar et al. [21] use ∥ · ∥∞

and ~·~∞, which may be confused easily.

Lemma 4.11 (Lemma 7 in Ravikumar et al. [21]). Suppose that

  1 1 2κΛ∗ (∥W ∥∞,∞ + λ) ≤ min , ∗ 3 ∗ 3dκΣ 3dκΣ∗ κΛ and that the non-zero elements of the precision matrix are bounded below as

∗ θmin = min |Θ | ≥ 4κΛ∗ (∥W ∥∞,∞ + λ). ∗ ̸ ij (i,j):Θij =0

Then Θ˜ has the same sparsity pattern as Θ∗, where Θ˜ is the solution to the oracle

problem (4.2).

The proof is similar to that given in Section 4.2.4 of Ravikumar et al. [21]. To adapt

it for the complex domain, we simply note that we cannot establish sign consistency

in the complex domain, only sparsity, which follows from the observation that

|aˆ − a∗| > e =⇒ |aˆ| > 0 115 if |a∗| > |e|, applied elementwise to Θ˜ .

Further discussion of the results of this and the preceeding two chapters can be found in Chapter 8. 116

Part II

Additional Advances in

Multivariate Methods and Models

for Highly Structured Data 117

Chapter 5

Splitting Methods for Convex Bi-Clustering and Co-Clustering

The following was originally published as

M. Weylandt. “Splitting Methods for Convex Bi-Clustering and Co-

Clustering.” DSW 2019: Proceedings of the IEEE Data Science Workshop

2019, pp.237-244. 2019. doi: 10.1109/DSW.2019.8755599 ArXiv: 1901.06075 where it was selected for an Oral Presentation and a Student Travel Award. The Gen-

eralized ADMM proposed in this chapter is impemented in the author’s clustRviz

package, available at https://github.com/DataSlingers/clustRviz/.

Abstract

Co-Clustering, the problem of simultaneously identifying clusters across multiple as-

pects of a data set, is a natural generalization of clustering to higher-order struc-

tured data. Recent convex formulations of bi-clustering and tensor co-clustering, which shrink estimated centroids together using a convex fusion penalty, allow for

global optimality guarantees and precise theoretical analysis, but their computa-

tional properties have been less well studied. In this chapter, I present three efficient

operator-splitting methods for the convex co-clustering problem: i) a standard two-

block ADMM; ii) a Generalized ADMM which avoids an expensive tensor Sylvester

equation in the primal update; and iii) a three-block ADMM based on the operator 118

splitting scheme of Davis and Yin. Theoretical complexity analysis suggests, and

experimental evidence confirms, that the Generalized ADMM is far more efficient for

large problems.

5.1 Introduction

5.1.1 Convex Clustering, Bi-Clustering, and Co-Clustering

In recent years, there as been a surge of interest in the expression of classical statistical

problems as penalized estimation problems. These convex re-formulations have many

noteworthy advantages over classical methods, including improved statistical perfor-

mance, analytical tractability, and efficient and scalable computation. In particular,

a convex formulation of clustering [14–16] has sparked much recent research, as it pro- vides theoretical and computational guarantees not available for classical clustering methods [226–228]. Convex clustering combines a squared Frobenius norm loss term, which encourages the estimated centroids to remain near the original data, with a

convex fusion penalty, typically the ℓq-norm of the row-wise differences, which shrinks the estimated centroids together, inducing clustering behavior in the solution:

1 Xn ˆ ∥ − ∥2 ∥ − ∥ U = arg min X U F + λ wij Ui· Uj· q (5.1) ∈Rn×p 2 U i,j=1 i

ˆ ˆ Observations i and j are assigned to the same cluster if Ui· = Uj·.

For highly structured data, it is often useful to cluster both the rows and the columns of the data. For example, in genomics one may wish to simultaneously estimate dis- ease subtypes (row-wise clusters of patients) while simultaneously identifying genetic pathways (column-wise clusters of genes). While these clusterings can be performed 119

independently, it is often more efficient to jointly estimate the two clusterings, a

problem known as bi-clustering. Building on (5.1), Chi et al. [99] propose a convex

formulation of bi-clustering, given by the following optimization problem:

1 ˆ ∥ − ∥2 U = arg min X U F ∈Rn×p 2 U  n p (5.2) X X  + λ  wij∥Ui· − Uj·∥q + w˜kl∥U·k − U·l∥q i,j=1 k,l=1 i

(n) (p) ∈ R ∈ R 2 ∈ R 2 for fixed λ ≥0, w ≥0 , and w˜ ≥0 . As with standard convex clustering, the fusion penalties fuse elements of U together, but here the rows and columns are both fused together, resulting in a characteristic checkerboard pattern in Uˆ .

Extending this, Chi et al. [98] propose a convex formulation of co-clustering, the problem of jointly clustering along each mode of a tensor (data array). Their estimator is defined by the following optimization problem, where both X and U are order J

j/i th tensors of dimension n1 × n2 × · · · × nJ and X denotes the i slice of X along the jth mode: 1 XJ XnJ Uˆ ∥X − U∥2 j ∥U j/k − U j/l∥ = arg min F + λ wk,l q (5.3) U 2 j=1 k,l=1 k

(nj ) ∈ R j ∈ R 2 ∥X ∥ for fixed λ ≥0 and w ≥0 , j = 1,...,J, and where q denotes the ℓq- norm of the vectorization of X . While many algorithms have been proposed to solve

Problem (5.1), there has been relatively little work on efficient algorithms to solve

Problems (5.2) and (5.3). Operator splitting methods have been shown to be among

the most efficient algorithms for convex clustering [229] and, as we will show below,

they are highly efficient for bi-clustering and co-clustering as well. 120

5.1.2 Operator Splitting Methods

Many problems in statistical learning, compressive sensing, and sparse coding can be

cast in a “loss + penalty” form, where a typically smooth loss function, measuring

the in-sample accuracy, is combined with a structured penalty function (regularizer) which induces a simplified structure in the resulting estimate to improve performance.

The popularity of this two-term structure has spurred a resurgence of interest in so-

called operator splitting methods, which break the problem into simpler subproblems which can be solved efficiently [230]. Among the most popular of these methods is

the Alternating Direction Method of Multipliers (ADMM), which is typically quite

simple to implement, has attractive convergence properties, is easily parallelizable

and distributed, and has been extended to stochastic and accelerated variants [76,

101, 143].

The ADMM can be used to solve problems of the form

arg min f(u) + g(v) subject to L1u + L2v = b (5.4) (u,v)∈H1×H2

where u, v take values in arbitrary (real, finite-dimensional) Hilbert spaces H1, H2,

Li : Hi → H∗, i = 1, 2 are linear operators from Hi to a common range space H∗,

b is a fixed element of H∗, and f, g are closed, proper, and convex functions from

H1 and H2 to R = R ∪ {∞}. The ADMM proceeds by Gauss-Seidel updates of the augmented Lagrangian

L ⟨ − ⟩ ρ(u, v, z) = f(u) + g(v) + z, L1u + L2v b H∗ ρ + ∥L u + L v − b∥2 . 2 1 2 H∗ 121

The ADMM steps are given by:

(k+1) (k) (k) u = arg min Lρ(u, v , z ) (5.5.1) u∈H1

(k+1) (k+1) (k) v = arg min Lρ(u , v, z ) (5.5.2) v∈H2

(k+1) (k) (k+1) (k+1) z = z + ρ(L1u + L2v − b). (5.5.3)

While this level of generality is not typically required in applications, we will see below that judicious use of the general formulation is key to developing efficient ADMMs for the co-clustering problem.

While the simplified sub-problems are often much easier to solve than the original problem, they may still be impractically expensive. Deng and Yin [100] consider

a generalized ADMM, which augments the subproblems (5.5.1-5.5.2) with positive-

definite quadratic operators A and B to obtain the updates:

(k+1) (k) (k) (k) u = arg min Lρ(u, v , z ) + A(u − u ) (5.6.1) u∈H1

(k+1) (k+1) (k) (k) v = arg min Lρ(u , v, z ) + B(v − v ) (5.6.2) v∈H2

(k+1) (k) (k+1) (k+1) z = z + ρ(L1u + L2y − b). (5.6.3)

If A and B are appropriately chosen, the generalized subproblems (5.6.1-5.6.2) may

be easier to solve than their standard counterparts.

The wide adoption of the ADMM has lead to many attempts to extend the ADMM

to objectives with three or more terms. Recently, Davis and Yin [231, Algorithm 8] 122

proposed an operator-splitting method for the three-block problem:

arg min f(u) + g(v) + h(w) ∈H ×H ×H (u,v,w) 1 2 3 (5.7)

subject to L1u + L2v + L3w = b.

Like the standard ADMM, the Davis-Yin scheme alternately updates the augmented

Lagrangian in a Gauss-Seidel fashion:

Lρ(u, v, w, z) =f(u) + g(v) + h(w) ρ + ∥L u + L v + L w − b − ρ−1z∥2 2 1 2 3 H∗

The Davis-Yin iterates bear similarities to the standard ADMM and the Alternating

Minimization Algorithm [232], both of which are special cases:

(k+1) (k) u = arg min f(u) + ⟨z , L1u⟩ u∈H1

(k+1) (k+1) (k) (k) v = arg min Lρ(u , v, w , z ) v∈H2

(k+1) (k+1) (k+1) (k) w = arg min Lρ(u , v , w, z ) w∈H3

(k+1) (k) (k+1) (k+1) (k+1) z = z + ρ(L1u + L2v + L3w − b)

Like the AMA, the Davis-Yin three block ADMM requires strong convexity of f(·).

Unlike the standard ADMM, which allows ρ to be arbitrarily chosen or even to vary

over the course of the algorithm, both the AMA and Davis-Yin ADMM place addi-

tional constraints on ρ to ensure convergence. 123

5.2 Operator-Splitting Methods for Convex Bi-Clustering

In this section, we present three operator-splitting methods for convex co-clustering.

For simplicity of exposition, we focus only on the bi-clustering (second-order tensor)

case in this section and suppress the fusion weights w, w˜. The bi-clustering problem

(5.2) then becomes

1 ˆ ∥ − ∥2 ∥ ∥ ∥ ∥ U = arg min X U F + λ ( DrowU row,q + UDcol col,q) (5.9) U∈Rn×p 2

where Drow and Dcol are directed difference matrices implied by the (non-zero) fusion weights and ∥ · ∥row,q and ∥ · ∥col,q are the sums of the row- and column-wise ℓq-norms of a matrix. The extension of our methods to to tensors of arbitrary order is natural

and discussed in more detail in Section 5.5.

5.2.1 Alternating Direction Method of Multipliers

Because the convex bi-clustering problem (5.9) has three terms, it is not immediately

obvious how a standard two-block ADMM can be applied to this problem. However, if

n×p we take H1 = R to be the natural space of the primal variable and H2 = Hrow×Hcol to be the Cartesian product of the row- and column-edge difference spaces, we can 124

obtain the ADMM updates:

1 ρ 2 (k+1) ∥ − ∥2 − (k) −1 (k) U = arg min X U 2 + DrowU Vrow + ρ Zrow F ∈Rn×p 2 2 U ρ 2 − (k) −1 (k) + UDcol Vcol + ρ Zcol    2 F  (k+1) (k+1) −1 (k) Vrow  proxλ/ρ ∥·∥ (DrowU + ρ Zrow)   =  row,q  (k+1) (k+1) −1 (k) V proxλ/ρ ∥·∥ (U Dcol + ρ Z )  col   col,q col (k+1) (k) (k+1) (k+1) Zrow  Zrow + ρ(DrowU − Vrow )   =   (k+1) (k) (k+1) − (k+1) Zcol Zcol + ρ(U Dcol Vcol )

The V - and Z-updates are straightforward to derive and parallel those of the ADMM

for the convex clustering problem [229, 233, Appendix A.1]. Additionally, we note

that, due to the separable Cartesian structure of H2, the V and Z-updates can be applied separately, and potentially in parallel, for the row and column blocks. As we will see, this separable structure holds for all algorithms considered in this paper and

for tensors of arbitrary rank.

The U-update is somewhat more difficult as it requires us to solve

T (k) − −1 (k) (k) − −1 (k) T X + ρDrow(Vrow ρ Zrow) + ρ(Vcol ρ Zcol )Dcol

T T = U + ρDrowDrowU + ρUDcolDcol.

1 T 1 This is a Sylvester equation with coefficient matrices 2 I + ρDrowDrow and 2 I +

T ρDcolDcol and can be solved using standard numerical algorithms [234, 235]. Typ- ically, the most expensive step in solving a Sylvester equation is either a Schur or

Hessenberg decomposition of the coefficient matrices, both of which scale cubically with the size of the coefficient matrices. Since the coefficient matrices are fixed bythe 125

problem structure and do not vary between iterations, this factorization can be cached

and amortized over the ADMM iterates, though, as we will see below, it is often more

efficient to modify the U-update to avoid the Sylvester equation entirely.

5.2.2 Generalized ADMM

To avoid the Sylvester equation in the U-update, we turn to the Generalized ADMM

framework of Deng and Yin [100] and augment the U-update with the quadratic

1 ∥ ∥2 − ∥ ∥2 operator A(U) = 2 (α U F ρ L1U H∗ ). The constant α must be chosen to ensure A is positive-definite so that sub-problem (5.6.1) remains convex. The smallest valid

α depends on ρ and the operator norm of L1 which can be bounded above by

∥ ∥ ≤ L1 H1→H∗ σmax(Drow) + σmax(Dcol)

where σmax(·) is the maximum singular value of a matrix. To avoid expensive singular value calculations, an upper bound based on the maximum degree of any vertex in

the graph can be used instead [229, 236, Section 4.1].

With this augmentation, the U-update (5.6.1) becomes:

(k+1) (k) T (k) − −1 (k) − (k) U = αU + X + ρDrow(V ρ Zrow DrowU )  (k) − −1 (k) − (k) T +ρ(Vcol ρ Zcol U Dcol)Dcol /(1 + α)

We note that this can be interpreted as a weighted average of a standard update and

the previous estimate of U. This update does not require solving a Sylvester equation

and can be performed in quadratic, rather than cubic, time. As we will see in Section

5.3, this update significantly decreases the computational cost of an iteration without 126

significantly slowing the per iteration convergence rate, resulting in much improved

performance.

5.2.3 Davis-Yin Splitting

The three-block structure of the convex bi-clustering problem (5.9) naturally lends

itself to the Davis-Yin three-block ADMM, which has provable convergence due to

the strongly convex squared Frobenius norm loss. We take H1 as before, but now split

the row- and column-edge differences into two spaces (H2 = Hrow and H3 = Hcol). The joint range space, which also contains the dual variable, is still the Cartesian

product H∗ = H2 × H3. The Davis-Yin iterates have the same V - and Z-updates as before, but yield a simpler U-update:

(k+1) − T (k) − (k) T U = X DrowZrow Zcol Dcol

∥L ∥ Unlike the ADMM, the Davis-Yin updates require ρ to be less than 1/ 1 H1→H∗ . The same bound used for α in the Generalized ADMM can be used here.

We note that, due to the “Cartesian” structure of the linear operators L2V = − L − ( Vrow, 0Hcol ) and 3W = (0Hrow , Vcol), the Davis-Yin updates have the same sep- arable structure as the ADMM and Generalized ADMM updates. In fact, it is not hard to see that the Davis-Yin updates can also be derived as updates of the Alter- nating Minimization Algorithm (AMA) [232]. The AMA is known to be equivalent

to proximal gradient on the dual problem, which Chi et al. [98] recommended to solve

Problem (5.3).

As Chi and Lange [229] note for the convex clustering problem, Moreau’s decomposi-

tion [237] can be used to eliminate the V -update by replacing the proximal operator 127 with a projection onto the dual norm ball, giving the simplified Z-updates

(k+1) (k) (k+1) Zrow = projλ/ρB ∗ (Z + ρDrowU ) ∥·∥row,q

(k+1) (k) (k+1) Zcol = projλ/ρB∥·∥∗ (Z + ρDcolU ). col,q

This simplification lowers the computational cost of the Davis-Yin updates, butit

does not improve their per iteration convergence which, as we will see in the next

section, is not competive with the ADMM variants.

5.3 Simulation Studies

In this section, we compare the performance of the three algorithms considered on

the presidential speeches data set (n = 44, p = 75) of Weylandt et al. [233] and log-

transformed Level III RPKM gene expression levels from 438 breast cancer samples

collected by The Cancer Genome Atlas (TCGA) (n = 438, p = 353)[238]. In addition

to the three algorithms discussed above, we also compare to the COBRA algorithm of

Chi et al. [99], an application of the Dykstra-Like Proximal Algorithm of Bauschke

and Combettes [239] to Problem (5.2), which works by solving alternating convex

clustering problems (5.1) on the rows and columns of U until convergence. For

each method, we compare a standard version and a version using the acceleration

techniques proposed in Goldstein et al. [101]. (For COBRA, acceleration was applied

to both the row and column sub-problems.)

We show results for the rotationally invariant q = 2 penalty with λ = 1 × 104 for

the presidents data and λ = 1 × 106 for the breast cancer data, though our results

are similar for other penalty functions and values of λ. For the ADMM variants, we

fixed ρ = 1. The Davis-Yin step size and the Generalized ADMM coefficient (α) were 128

Figure 5.3.1 : Per iteration convergence of the ADMM, Generalized ADMM, Davis- Yin Splitting, and COBRA on the presidential speech and TCGA breast cancer data sets. COBRA clearly exhibits the fastest per iteration convergence, appearing to exhibit super- linear convergence. The ADMM and Generalized ADMM have similar convergence rates, while the Davis-Yin iterates converge much more slowly.

both set to twice the maximum degree of the row- or column-indicidence graph. The

default sparse Gaussian kernel weights of the clustRviz package [240] are used for

each data set.

As can be seen in Figure 5.3.1, the COBRA algorithm clearly has the fastest per iteration

convergence, rapidly converging to the optimal solution. (The slower convergence

near optimality appears to be an artifact of the inexact solution of the sub-problems

rather than an inherent behavior of the DLPA.) The ADMM and Generalized ADMM

exhibit relatively fast convergence, while the Davis-Yin iterates are the slowest to

converge. The standard ADMM appears to have slightly faster convergence than the

Generalized ADMM near initialization, but the two methods appear to eventually

attain the same convergence rate. Acceleration provides a minor, but consistent,

improvement for all methods except COBRA, where the acceleration is applied within each sub-problem. 129

Figure 5.3.2 : Wall clock speed of the ADMM, Generalized ADMM, Davis-Yin Split- ting, and COBRA on the presidential speech and TCGA data sets. Despite its rapid per iteration convergence, COBRA is less efficient than the ADMM and Generalized ADMM due to the complexity of its iterations. Davis-Yin Splitting remains the least efficient algorithm considered for this problem.

The apparent advantage of COBRA disappears, however, when we instead consider total

elapsed time, as shown in Figure 5.3.2, as the COBRA updates require solving a pair

of expensive convex clustering sub-problems at each step. On a wall-clock basis, the

Generalized ADMM performs the best for both problems, followed by the standard

ADMM, COBRA, and Davis-Yin splitting in that order.

Practically, the Generalized ADMM and Davis-Yin updates have essentially the same

cost and are both three to four times faster per iteration than the standard ADMM with caching. Precise numerical results are given in Table 5.4.1. These results suggest

that our Generalized ADMM scheme attains the best of both worlds, achieving both

the high per iteration convergence rate of a standard ADMM with the low per iteration

computational cost of the Davis-Yin / AMA updates, consistent with our theoretical

analysis in the next section. 130

5.4 Complexity Analysis

Both the ADMM and AMA are known to exhibit O(1/K) convergence in general,

[101, 241, 242], but the ADMM obtains a superior convergence rate of o(1/K) for

strongly convex problems [243]. This rate can be further improved to linear conver- gence if certain full-rank conditions are fulfilled by the problem-specific Drow and

Dcol matrices [100]. As Deng and Yin [100] show, the Generalized ADMM achieves essentially the same convergence rate as the standard ADMM. Consequently, theory

suggests that the two-block ADMM variants have superior per iteration convergence,

consistent with our experimental results.

The computational cost of the Generalized ADMM and Davis-Yin U-updates are

essentially the same, both being dominated by matrix multiplications giving compu-

tational complexity

O(np max(#rows, #columns)), where #rows and #columns are the number of non-zero row and column fusion weights respectively. For sparse weighting schemes, we typically have #rows = O(n)

and #columns = O(p), yielding an overall per iteration complexity of O(max(np2, n2p)) for the U-update. A naive implementation of the ADMM update requires O((#rows)3+

(#columns)3) to solve the Sylvester equation, but if the initial factorization is cached, the per iteration complexity is again reduced to O(max(np2, n2p)) under a sparse weight scheme.

To the best of our knowledge, a convergence rate for the DLPA, on which COBRA is based, has not been established. Experimentally, the DLPA exhibits very rapid con- vergence, consistent with known rates for Dykstra’s alternating projections algorithm.

Despite this, the high per iteration cost makes it impractical on larger problems. 131

Presidential Speeches TCGA Breast Cancer Method Iter/sec Total Time (s) Total Iterations Iter/sec Total Time (s) Total Iterations Generalized ADMM 663 3.98 2639 2.58 1318 3411 Accelerated 645 4.69 3025 2.55 835 2135 Davis-Yin / AMA 657 †15.23 †10000 2.60 †3850 †10000 Accelerated 653 †15.31 †10000 2.58 †3879 †10000 ADMM 189 8.12 1536 0.84 1521 1277 Accelerated 188 7.08 1334 0.84 1324 1107 COBRA 5.06 37.20 188 (29581 total) 0.054 5994 325 (16519 total) Accelerated 6.89 27.15 187 (21362 total) 0.065 4953 323 (13549 total)

Table 5.4.1 : Performance of the ADMM, Generalized ADMM, Davis-Yin Splitting, and COBRA on the presidential speech and TCGA data sets. For COBRA, the numbers in parentheses are the total number of sub-problem iterations taken. The † indicates that the method was stopped after failing to converge in 10,000 iterations.

5.5 Extensions to General Co-Clustering

In Section 5.2, we restricted our attention to co-clustering tensors of order two, i.e.,

bi-clustering. In this section, we show how the ADMM, Generalized ADMM, and

Davis-Yin (AMA) algorithms can be extended to the general problem of co-clustering

order-K tensors (5.3). As before, we can re-express the co-clustering problem using

directed difference tensors Dj as

1 XJ Uˆ ∥X − U∥2 ∥U × D ∥ = arg min F + λ j j j,q U 2 j=1

where ∥X ∥j,q is the sum of the ℓq-norms of the vectorization of each slice of X along the jth mode. For all three methods, the V and Z-updates are straightforward extensions

of the bi-clustering case:

  V(k+1) U (k+1) × D −1Z j = proxλ/ρ∥·∥j,q j j + ρ (k)j

Z(k+1) Z(k) U (k+1) × D − V(k+1) j = j + ρ( j j j ). 132

As with the bi-clustering case, we have separate V and Z variables for each term in

the fusion penalty (i.e., each mode of the tensor), each of which can be updated in

parallel. For the standard ADMM, the U-update requires solving the following tensor

Sylvester equation [244], where (·)T denotes a transpose, with the direction implied

by context:

XJ XJ X V(k) − −1Z(k) × DT U U × D × DT + ρ ( j ρ j ) j j = + ρ j j j j . j=1 j=1

As above, the U-update can be solved explicitly for our other methods:

α X U (k+1) = U (k) + (Generalized ADMM) 1 + α 1 + α ρ XJ + (V(k) − ρ−1Z(k) − U (k) × D ) × DT 1 + α j j j j j j j=1 XJ U (k+1) X − Z(k) × D T = j j ( j) . (Davis-Yin / AMA) j=1

5.6 Discussion

We have introduced three operator-splitting methods for solving the convex bi-clustering

and co-clustering problems: a standard ADMM, a Generalized ADMM, and a three-

block ADMM based on Davis-Yin splitting. The Davis-Yin three-block ADMM was

found to be equivalent to the AMA which is in turn equivalent to the dual projected

gradient recommended by Chi et al. [98]. The standard ADMM achieves the fastest

per iteration convergence, but requires solving a Sylvester equation at each step. The

Generalized ADMM avoids the expensive Sylvester equation while still maintaining

the rapid convergence of the ADMM. Unlike Chi and Lange [229], we do not find that

the AMA performs well, perhaps due to the more complex structure of the bi- and 133

co-clustering problems. On the TCGA breast cancer data, the Generalized ADMM was able to find a better solution in three and a half minutes than the AMA could

in over an hour. All three of our methods can be extended to co-cluster tensors of

arbitrary order, with the advantages of the Generalized ADMM becoming even more

pronounced as it avoids a tensor Sylvester equation.

Our ADMM and Generalized ADMM methods provide a unified and efficient com-

putational approach to convex co-clustering of large tensors of arbitrary order, but

several interesting methodological questions remain unanswered. In particular, the

conditions under which the co-clustering solution path is purely agglomerative, en-

abling the construction of dendrogram-based visualizations of the kind proposed for

convex clustering by Weylandt et al. [233], remain unknown. While not discussed

here, our algorithms can naturally be extended for missing data handling and cross- validation, using a straight-forward extension of the Majorization-Minimization (MM)

imputation scheme proposed by Chi et al. [99]. Our algorithms make convex bi- and

co-clustering practical for large structured data sets and we anticipate they will en-

courage adoption and additional study of these useful techniques. 134

Appendix to Chapter 5

5.A Step Size Selection

∥ ∥2 − The Generalized ADMM requires that the quadratic operator A(U) = α U F

2 ρ∥L1U∥H∗ be positive-definite, which holds if α/ρ is greater than the operator norm

−1 of L1. Similarly, the Davis-Yin ADMM requires that the step size ρ be greater the

operator norm of L1. (See the related discussion in Section 4.2 of Chi and Lange [229].) While this operator norm is somewhat inconvenient to compute, an upper

bound is easily found:

∥ ∥ ∥ ∥ L1 H1→H∗ = sup L1U H∗ U∈H1 ∥U∥H ≤1 1 q ∥ ∥2 ∥ ∥2 = sup DrowU F + UDcol F U∈H1 ∥ ∥ ≤ U H1 1

≤ sup ∥DrowU∥F + ∥UDcol∥F U∈H1 ∥ ∥ ≤ U H1 1

≤ sup ∥DrowU∥F + sup ∥UDcol∥F U∈H1 U∈H1 ∥ ∥ ≤ ∥ ∥ ≤ U H1 1 U H1 1

= σmax(Drow) + σmax(Dcol)

where σmax(·) is the maximum singular value of a matrix. To avoid a potentially expensive SVD for large problems, we can instead take advantage of the special structure of Drow and Dcol using known results about the eigenstructure of a graph Laplacian [236] which imply

σmax(Drow) ≤ max degree(i) + degree(j) ≤ 2 max degree(i) (i,j)∈Erow i∈Erow 135

and

σmax(Dcol) ≤ max degree(i) + degree(j) ≤ 2 max degree(i) ∈E (i,j)∈Ecol i col where the first maximum is taken over the pairs of connected vertices and thesecond

is taken over all vertices. This bound can be computed with minimal effort and, in

our experience, is sufficiently tight for most problems.

5.B Detailed Derivations

In this section, we provide detailed derivations of the ADMM, Generalized ADMM,

and Davis-Yin algorithms for convex clustering discussed above. Unless otherwise

stated, our notation is the same as that used in the main body of the paper. Given the

large number of Hilbert spaces, norms, and inner products used in these derivations, we err on the side of explicitness rather than concision in our notation.

5.B.1 ADMM

The ADMM updates are

ρ 2 U (k+1) = arg min f(U) + ⟨Z(k), L U + L V (k) − b⟩ + L U + L V (k) − b 1 2 1 2 H∗ U∈H1 2

ρ 2 V (k+1) = arg min g(V ) + ⟨Z(k), L U (k+1) + L V − b⟩ + L U (k+1) + L V − b 1 2 1 2 H∗ V ∈H2 2

(k+1) (k) (k+1) (k+1) Z = Z + ρ(L1U + L2V − b). 136

Rescaling the dual variable by ρ−1 (Z → ρ−1Z), these can be simplified to

ρ 2 U (k+1) = arg min f(U) + L U + L V (k) − b + Z(k) 1 2 H∗ U∈H1 2

ρ 2 V (k+1) = arg min g(V ) + L U (k+1) + L V − b + Z(k) 1 2 H∗ V ∈H2 2

(k+1) (k) (k+1) (k+1) Z = Z + L1U + L2V − b.

For the convex bi-clustering problem (5.9), we take:

n×p • H1 = R equipped with the Frobenius norm and inner product

|Erow|×p n×|Ecol| • Hrow = R and Hcol = R both equipped with the Frobenius norm and inner product

• H2 = H∗ = Hrow × Hcol equipped with the inner product

⟨ ⟩ ⟨ ⟩ ⟨ ⟩ (Arow, A2), (Brow, Bcol) H∗ = Arow, Brow Hrow + Acol, Bcol Hcol

and the norm

q 2 2 ∥(A , A )∥H∗ = ∥A ∥ + ∥A ∥ row col row Hrow col Hcol

• L1 : H1 → H∗ given by L1U = (DrowU, UDcol)

• L2 : H2 → H∗ given by L2(Vrow, Vcol) = −(Vrow, Vcol)

∈ H • b = 0H∗ = (0Hrow , 0Hcol ) ∗

1 ∥ − ∥2 • f(U) = 2 X U F

• g((Vrow, Vcol)) = λ∥Vrow∥q,1 + λ∥Vcol∥1,q 137

First, we consider the U-update:       2 (k) (k) 1 ρ DrowU −Vrow  Zrow (k+1) ∥ − ∥2       U = arg min X U F + + + U∈Rn×p 2 2 − (k) (k) UDcol Vcol Zcol H∗

1 ρ 2 ρ 2 ∥ − ∥2 − (k) (k) − (k) (k) = arg min U X F + DrowU Vrow + Zrow F + UDcol Vcol + Zcol U∈Rn×p 2 2 2 F

H ∥ · ∥2 using the separability of the squared ∗ norm ( H∗ ). This is fully smooth and so we take the gradient with respect to U to obtain the stationarity conditions:

   − T − (k) (k) − (k) (k) T 0H1 = U X + ρ Drow(DrowU Vrow + Zrow) + ρ (UDcol Vcol + Zcol )Dcol using the identity∗

∂ ∥AXB + C∥2 = 2AT [AXB + C]BT . ∂X F

Solving for U, we obtain

T (k) − (k) (k) (k) T T T X + ρDrow(Vrow Zrow) + ρ(Vcol + Zcol )Dcol = U + ρDrowDrowU + ρUDcolDcol

The V -updates are straight-forward to derive using the separable structure of H2 =

∗See Equation (119) in the Matrix Cookbook: https://www.math.uwaterloo.ca/~hwolkowi/ matrixcookbook.pdf. 138

Hrow × Hcol:   (k+1) Vrow    = arg min λ∥Vrow∥q,1 + λ∥Vrow∥q,1 (k+1) (Vrow,Vcol)∈Hrow×Hcol Vcol         2 − (k) (k) ρ DrowU  Vrow  0Hrow  Zrow +   +   +   +   2 (k) (k) UDcol −V 0H Z col col col H ×H  row col ∥ ∥ ρ ∥ (k+1) − (k) ∥2 arg minV ∈H λ Vrow row,q + DrowU Vrow + Zrow F  =  row row 2  ∥ ∥ ρ ∥ (k+1) − (k)∥2 arg minV ∈H λ Vcol col,q + U Dcol Vcol + Z  col col 2 col F  λ ∥ ∥ 1 ∥ − (k+1) (k) ∥2 arg minV ∈H Vrow row,q + Vrow (DrowU + Zrow) F  =  row row ρ 2  λ ∥ ∥ 1 ∥ − (k+1) (k) ∥2 arg minV ∈H Vcol col,q + Vcol (U Dcol + Z )  col col 2 2  col F (k+1) (k) proxλ/ρ ∥·∥ (DrowU + Zrow) =  row,q  prox (U (k+1)D + Z(k)) λ/ρ ∥·∥col,q col col

If fusion weights are included, then they are reflected in the row- and column-wise

proximal operators. Finally, the Z-updates are standard:     (k+1) (k) (k+1) (k+1) Zrow  Zrow + DrowU − V    =   (k+1) (k) (k+1) − (k+1) Zcol Zcol + U Dcol V

If we unscale Z → ρ−1Z, we recover the updates given in the main body of this

note.

5.B.2 Generalized ADMM

To avoid the Sylvester equation, we recommend use of a Generalized ADMM, where

− (k) α ∥ − the U-update is augmented with the quadratic operator A(U U ) = 2 (U (k) ∥2 − ρ ∥ − (k) ∥2 U ) F 2 L1(U U ) F where α is fixed sufficiently large to ensure positive- 139

definiteness. This augmentation gives:

1 ρ ρ (k+1) ∥ − ∥2 ∥ − (k) (k) ∥2 ∥ − (k) (k) ∥2 U = arg min U X F + DrowU Vrow + Zrow F + DrowU Vrow + Zrow F U∈Rn×p 2 2 2 ∥ − (k)∥2 − ∥ − (k)∥2 ∥ − (k) ∥2 + α U U F ρ DrowU DrowU F + ρ UDcol U Dcol F

As before, we take gradients with respect to U to obtain the stationarity condi- tion:

− T − T (k) − (k) T − (k) − (k) T 0H1 = U X + ρDrowDrowU ρDrow(Vrow Zrow) + ρUDcolDcol ρ(Vcol Zcol )Dcol − (k) − T − (k) − − (k) T + α(U U ) ρDrow(DrowU DrowU ) ρ(UDcol U Dcol)Dcol

Simplifying, we obtain:

T (k)− (k) (k)− (k) T (k)− T (k)− (k) T (1+α)U = X+ρDrow(Vrow Zrow)+ρ(Vcol Zcol )Dcol+αU ρDrowDrowU ρU DcolDcol

The V and Z-updates are unchanged.

5.B.3 Davis-Yin

The Davis-Yin updates are given by:

(k+1) (k) U = arg min f(u) + ⟨Z , L1U⟩ U∈H1

ρ 2 (k+1) (k+1) (k) − −1 (k) V = arg min g(V ) + L1U + L2V + L3W b + ρ Z ) H∗ V ∈H2 2

ρ 2 (k+1) (k+1) (k+1) − −1 (k) W = arg min h(W ) + L1U + L2V + L3W b + ρ Z ) H∗ w∈H3 2

(k+1) (k) (k+1) (k+1) (k+1) Z = Z + ρ(L1U + L2V + L3W − b) 140

For the convex bi-clustering problem (5.9), we take:

n×p • H1 = R equipped with the Frobenius norm and inner product

|Erow|×p • H2 = Hrow = R equipped with the Frobenius norm and inner product

n×|Ecol| • H3 = Hcol = R equipped with the Frobenius norm and inner product

• H∗ = H2 × H3 equipped with the inner product

⟨ ⟩ ⟨ ⟩ ⟨ ⟩ (A1, A2), (B1, B2) H∗ = A1, B1 H2 + A2, B2 H3

and the norm q 2 2 ∥(A , A )∥H = ∥A ∥ + ∥A ∥ 1 2 ∗ 1 H2 2 H3

• L1 : H1 → H∗ given by L1U = (DrowU, UDcol)

H → H − • L2 : 2 ∗ given by L2V = ( V , 0H3 )

H → H − • L3 : 3 ∗ given by L3W = (0H2 , W )

∈ H • b = 0H∗ = (0H2 , 0H3 ) ∗

1 ∥ − ∥2 • f1(U) = 2 X U F

• f2(V ) = λ∥V ∥row,q

• f3(W ) = λ∥W ∥col,q 141

First, we consider the U-update:

1 ∥ − ∥2 ⟨ (k) ⟩ arg min X U F + Z , L1U H∗ U∈H1 2 1 ∥ − ∥2 ⟨ (k) ⟩ = arg min X U F + Z , (DrowU, UDcol) H∗ U∈H1 2 1 ∥ − ∥2 ⟨ (k) ⟩ ⟨ (k) ⟩ = arg min X U F + Zrow, DrowU H2 + Zcol , UDcol H3 U∈H1 2 1 ∥ − ∥2 (k) T (k) T = arg min X U F + Tr((Zrow) DrowU) + Tr((Zcol ) UDcol) U∈H1 2

This is fully smooth, so we take the gradient with respect to U, using the iden-

tity† ∂ ∂ Tr(AXB) = AT BT =⇒ Tr(AX) = AT , ∂X ∂X

to obtain the stationarity condition

− − T (k) − (k) T 0H1 = (X U) + DrowZrow Zcol Dcol.

Solving for U, we obtain the Davis-Yin U-update:

(k+1) − T (k) − (k) T U = X DrowZrow Zcol Dcol.

†See Equation (101) of the Matrix Cookbook: https://www.math.uwaterloo.ca/~hwolkowi/ matrixcookbook.pdf. 142

The Vrow-update (i.e., the V -update above) is given by

ρ ∥ ∥ ∥ (k+1) (k) − − −1 (k)∥2 arg min λ V row,q + L1U + L2V + L3W 0H∗ ρ Z H∗ V ∈Hrow 2 λ 1 ∥ ∥ (k+1) (k+1) − arg min V row,q + (DrowU , U Dcol) + ( V , 0Hcol ) V ∈H ρ 2 row 2 (k) −1 (k) −1 (k) +(0H , −W ) + (ρ Z , ρ Z ) row row col H ∗ 2 λ 1 (k+1) −1 (k) (k+1) (k) −1 (k) arg min ∥V ∥row,q + (DrowU − V + ρ Z , U Dcol − W + ρ Z ) row col H V ∈Hrow ρ 2 ∗

Using the separability of the squared H∗-norm, we see that the H3 component of the second term does not depend on V and hence the above problem is equivalent

to

(k+1) λ 1 (k+1) −1 (k) 2 V = arg min ∥V ∥row,q + V − (DrowU + ρ Z ) row row H2 V ∈Hrow ρ 2  (k+1) −1 (k) = proxλ/ρ∥·∥row,q DrowU + ρ Zrow .

A similar analysis gives the Vcol-update (i.e., the W -update above):

 λ 1 (k+1) ∥ ∥ − (k+1) −1 (k) ∥2 Vcol = arg min W col,q + W (U Dcol + ρ Zcol H3 W ∈H ρ 2 col   = prox U (k+1)D + ρ−1Z(k) . λ/ρ∥·∥col,q col col 143

Finally, the Z-update is given by:

 (k+1) (k) (k+1) (k+1) (k+1) Z = Z + ρ L1U + L2V + L3W − b             (k+1) (k) (k+1) (k+1) Zrow  Zrow DrowU  −Vrow   0H  0H    =   + ρ   +   +  2  −  2  (k+1) (k) (k+1) (k+1) Z Z U Dcol 0H −V 0H col  col  3 col 3 (k) (k+1) (k+1) Zrow + ρ(DrowU − Vrow ) =   (k) (k+1) − (k+1) Zcol + ρ(U Dcol Vcol ) 144

Chapter 6

Multi-Rank Sparse and Functional PCA: Manifold Optimization and Iterative Deflation Techniques

The following was originally published as

M. Weylandt. “Multi-Rank Sparse and Functional PCA: Manifold Opti-

mization and Iterative Deflation Techniques.” CAMSAP 2019: Proceed-

ings of the IEEE 8th International Workshop on Computational Advances

in Multi-Sensor Adaptive Processing, pp.500-504. 2019. doi: 10.1109/CAM-

SAP45676.2019.9022486 ArXiv: 1907.12012 where it was selected for an Oral Presentation. The deflation techniques proposed

in Section 6.3 are implemented in the author’s MoMA software available at https:

//github.com/DataSlingers/MoMA; the manifold optimization routines will be im- plemented in a future version of the software.

Abstract

We consider the problem of estimating multiple principal components using the re-

cently proposed Sparse and Functional Principal Components Analysis (SFPCA) es-

timator. We first propose an extension of SFPCA which estimates several principal

components simultaneously using manifold optimization techniques to enforce orthog-

onality constraints. While effective, this approach is computationally burdensome so we also consider iterative deflation approaches which take advantage of existing fast 145

algorithms for rank-one SFPCA. We show that alternative deflation schemes can more

efficiently extract signal from the data, in turn improving estimation of subsequent

components. Finally, we compare the performance of our manifold optimization and

deflation techniques in a scenario where orthogonality does not hold and findthat

they still lead to significantly improved performance.

6.1 Introduction

Principal Components Analysis (PCA, [102]) is a widely used approach to finding low- dimensional patterns in complex data, enabling visualization, dimension reduction

(compression), and predictive modeling. While PCA performs well in a wide range of low-dimensional settings, its performance degrades rapidly in high-dimensions, neces- sitating the use of regularized variants. Recently, Allen and Weylandt [28] proposed

Sparse and Functional PCA (SFPCA), a unified regularization scheme that allows

for simultaneous smooth (functional) and sparse estimation of both row and column

principal components (PCs). The rank-one SFPCA estimator is given by

T arg max u Xv − λuPu(u) − λvPv(v) (6.1) ∈Bn ∈Bp u Su ,v Sv

where Pu(·) is a regularizer inducing sparsity in the row PCs, with strength controlled by λu; Ωu is a positive semi-definite penalty matrix, typically a second- or fourth- order difference matrix; Su = I + αuΩu is a smoothing matrix for the row PCs, Bn with strength controlled by αu; and Su is the unit ellipse of the Su-norm, i.e., Bn { ∈ Rn T ≤ } · Bp Su = u : u Suu 1 . (Respectively, Pv( ), λv, Ωv, αv, Sv, and Sv for the column PCs.)

Allen and Weylandt [28] show that SFPCA unifies much of the existing regularized 146

PCA literature [25–27, 106, 107, 245] into a single framework, avoiding many patholo-

gies of other approaches. Finally, they propose an efficient alternating maximization

scheme with guaranteed global convergence to solve the bi-concave SFPCA problem

(6.1). The SFPCA estimator only allows for a single pair of PCs to be estimated for

a given data matrix X. Allen and Weylandt suggest applying SFPCA repeatedly to the deflated data matrix if multiple PCs are desired. While this approach performs acceptably in practice, it loses the interpretable orthogonality properties of standard

PCA. In particular, the estimated PCs are no longer guaranteed to be orthogonal to each other, hindering the common interpretation of PCs as statistically independent sources of variance, or to the deflated data matrix, suggesting that additional signal remains uncaptured.

We extend the work of Allen and Weylandt [28] to address these shortcomings: first, in

Section 6.2, we modify the SFPCA estimator to simultaneously estimate several sparse

PCs subject to orthogonality constraints. The resulting estimator is constrained to

a product of generalized Stiefel mainfolds and we propose three efficient algorithms

to solve the resulting manifold optimization problem. Next, in Section 6.3, we pro-

pose improved deflation schemes which provably remove all of the signal from the

data matrix, allowing for more accurate iterative estimation of sparse PCs. Finally, we demonstrate the improved performance of our manifold estimators and deflation

schemes in Section 6.4.

6.2 Manifold Optimization for SFPCA

One of the most attractive properties of PCA is the factors it extracts are orthogonal

T T (ut us = vt vs = 1 if t = s and 0 otherwise). Because of this, PCs can be interpreted 147

as separate sources of variance and, under an additional Gaussianity assumption,

statistically independent. While this follows directly from the properties of eigende-

compositions for standard PCA, it is much more difficult to obtain similar results for

sparse PCA. Some authors have suggested that there exists a fundamental tension

between orthogonality and sparsity, with Journée et al. [246] calling the goal of sparse

and orthogonal estimation “questionable.” Indeed ex post orthogonalization of sparse

PCs, e.g., using a Gram-Schmidt step, destroys any sparsity in the estimated PCs.

To avoid this, it is necessary to impose orthogonality directly in the estimation step, rather than trying to impose it afterwards.

We modify the SFPCA estimator to simultaneously estimate multiple PCs subject to an orthognality constraint:

T arg max Tr(U XV ) − λU PU (U) − λV PV (V ) (6.2) ∈VSu ∈VSv U n×k,V p×k

VSu Rn where n×k is the generalized Stiefel manifold of order k over , i.e.,

∈ VSu ⇐⇒ ∈ Rn×k T U n×k U and U SuU = Ik.

The generalized Stiefel manifold constraint ensures orthogonality of the estimated

PCs, while still allowing us to capture most of the variability in the data.∗ We note that, because we use a generalized Stiefel constraint, the estimated PCs will be

T ̸ orthogonal with respect to Su, i.e., ut Suus = 0 for t = s, rather than orthogonal in the standard sense. This is commonly observed for functional PCA variants [106, 107,

∗When estimating orthogonal factors, it is common to re-express the problem using a (generalized) Grassmanian manifold constraint to avoid identifiability issues. We cannot use the Grassmanian approach here as sparse estimation (implicitly) fixes a single coordinate system. 148

245] and can be interpreted as orthogonality under the inner product generating the

Su-norm. If no roughness penalty is imposed (αu = 0 or αv = 0), then our method gives orthogonality in the standard sense.

To solve the Manifold SFPCA problem (6.2), we employ an alternating maximization

scheme, first holding V fixed while we update U and vice versa, as described in

Algorithm 4. Even with one parameter held fixed, the resulting sub-problems are

still difficult manifold optimization problems, which require iterative approaches to

obtain a solution [247–254].

Algorithm 4 Manifold SFPCA Algorithm 1. Initialize Uˆ , Vˆ to the leading k singular vectors of X 2. Repeat until convergence: (a) U-subproblem. Solve using Algorithm 5 or 6:

ˆ T ˆ U = arg min − Tr(U XV ) + λU PU (U) ∈VSu U n×k

(b) V -subproblem: Solve using Algorithm 5 or 6, with U and V reversed:

ˆ ˆ T V = arg min − Tr(U XV ) + λV PV (V ) ∈VSv V n×p

3. Return Uˆ and Vˆ

Allen and Weylandt [28] developed a custom projected + proximal gradient algorithm

to solve the u- and v subproblems of the rank-one SFPCA estimator. Assuming PU and PV are positive homogeneous, (e.g. P (·) = ∥A · ∥p for arbitrary p ≥ 1 and A), they establish convergence to a stationary point. In order to extend this idea to the multi-rank (manifold) case, we use the recently proposed Manifold Proximal Gradient

(ManPG) scheme of Chen et al. [251], detailed in Algorithm 5. ManPG proceeds 149

in two-steps: first, we solve for a descent direction D of the objective along the tangent space of the generalized Stiefel manifold, subject to the tangency constraint

T (k) of D SuU being skew-symmetric; secondly, back-tracking line search is used to determine a step-size α, after which the estimate is projected back onto the generalized

Steifel manifold using a retraction anchored at the previous U (k). The retraction, which plays the same role as the projection step in the original SFPCA algorithm [28,

Algorithm 1], can be computed using a Cholesky factorization [254, Algorithm 3.1].

Chen et al. [252] showed that a single step of ManPG is sufficient to ensure convergence

despite the bi-concave objective: we refer to their approach as Alternating ManPG

(A-ManPG).

Algorithm 5 Manifold Prox. Gradient (ManPG) for Uˆ -Subproblem 1. Initialize U (k) = Uˆ 2. Repeat until convergence: T (k) (k) T • Solve, subject to D SuU + (U ) SuD = 0:

ˆ ˆ (k) D = arg min −⟨XV , D⟩F + λU PU (U + D) D∈Rn×k

• Select α by Armijo-back-tracking (k+1) ˆ • U = RetrU (k) (αD) 3. Return Uˆ

Note that in general manifold proximal gradient schemes [251, 252] impose a max-

imum step-size to ensure that linearization of the smooth portion of the objective

actually leads to descent: because the smooth portion of our objective function in

linear in U and V , we can omit this term from Algorithm 5.

While efficient when tuned properly, we have found the performance of ManPGon the U- and V -subproblems quite sensitive to infeasibility in the descent direction. A 150

more robust scheme can be derived by using the Manifold ADMM (MADMM) scheme

of Kovnatsky et al. [250] to solve the subproblems. Like standard ADMM schemes,

MADMM allows us to split a problem into two parts, each of which can be solved

more easily than the global problem. When applied to the U- and V -subproblems,

the MADMM allows us to separate the manifold constraint from the sparsity inducing

regularizer, thereby side-stepping the orthogonality / sparsity tension at the heart of

this paper. After this splitting, the smooth update can be shown to equivalent to

the unbalanced Procrustes problem [255, 256] with a closed-form update: U (k+1) =

−1/2 T T −1/2 1/2 (k) (k) Su AB where A∆B is the SVD of Su XVˆ +ρSu (W −Z ). The sparse update is simply the proximal operator of PU (·), typically a threhsolding step [126, Chapter 6]. To the best of our knowledge, the convergence of MADMM has not yet

been established, but we have not observed significant non-convergence problems in

our experiments.

Algorithm 6 Manifold ADMM (MADMM) for Uˆ -Subproblem 1. Initialize U (k) = W (k) = Uˆ , Z(k) = 0 and k = 1 2. Repeat until convergence:

(k+1) − T ρ∥ − (k) (k)∥2 U = arg min Tr(U XV ) + U W + Z F ∈VSu 2 U n×k (k+1) ρ∥ (k+1) − (k)∥2 W = arg min λU PU (W ) + U W + Z F n×k 2 W ∈R   (k+1) (k) = proxλU /ρPU (·) U + Z Z(k+1) = Z(k) + U (k+1) − W (k+1)

3. Return Uˆ and Vˆ 151

6.3 Iterative Deflation for SFPCA

We next consider the use of iterative deflation schemes for multi-rank SFPCA. As dis-

cussed by Mackey [257], the attractive orthogonality properties of standard (Hotelling’s) deflation depend critically on the estimated PCs being exact eigenvectors oftheco- variance matrix. Because the PCs estimated by sparse PCA schemes are almost surely not eigenvectors, Mackey [257] proposes several alternate deflation schemes which retain some of the attractive properties of Hotelling’s deflation even when non-

eigenvectors are used. We extend these to the low-rank model and allow for deflation

by several PCs, possibly with non-unit norm, simultaneously, e.g., as produced by

ManSFPCA. To ease exposition, we first work in the vector setting and consider the

general case at the end of this section. The properties of our proposed deflation

schemes are summarized in Table 6.3.1.

The simplest deflation scheme is essentially that used by Hotelling [102], extended to

the low-rank model:

− T T Xt := Xt−1 dtutvt where dt = ut Xt−1vt. (HD)

For two-way sparse PCA variants [26, 27], this deflation gives a deflated matrix which

T is “two-way” orthogonal to the estimated PCs, i.e., ut Xtvt = 0. We may interpret this as Hotelling’s deflation (HD) capturing all the signal jointly associated with the

pair (ut, vt).

We may also ask if HD captures all of the signal associated with ut or only the signal which is also associated with vt. If HD captures all of the signal associated with T ∈ Rp T ut, then we would expect ut Xtv˜ = 0 for all v˜ , or equivalently, ut Xt = 0p. 152

Interestingly, HD does not have this left-orthogonality property, suggesting that it

leaves additional ut-signal in the deflated matrix Xt.

HD fails to yield left- and right-orthogonality because it is not based on a projection operator. To address this in the covariance model, Mackey [257] proposed a deflation

scheme which projects the covariance matrix onto the orthogonal complement of the

estimated principal component. We extend this idea to the low-rank model by project-

ing the column- and row-space of the data matrix into the orthogonal complement of

the left- and right-PCs respectively, giving two-way projection deflation (PD):

− T − T Xt := (In utut )Xt−1(Ip vtvt ). (PD)

Unlike HD, PD captures all of the linear signal associated with ut and vt individually

n p (utXtv˜ = uX˜ tvt = 0, ∀u˜ ∈ R , v˜ ∈ R ).

If we use PD repeatedly, however, the multiply deflated matrix will not continue

T ̸ ≥ to be orthogonal to the PCs: that is, ut Xt+s = 0 for s 1. This suggests that repeated application of PD can reintroduce signal in the direction of the PCs that we previously removed. This occurs because PD works by sequentially projecting

the data matrix, but in general the compositition of two orthogonal projections is

not another orthogonal projection without additional assumptions. To address this,

Mackey [257] proposed a Schur complement deflation (SD) technique, which we now

extend to the low-rank (two-way) model:

T − Xt−1vtut Xt−1 Xt := Xt−1 T . (SD) ut Xt−1vt

While Mackey motivates this approach using conditional distributions and a Gaussian- 153

ity assumption on X, it can also be understood as an alternate projection construct which is more robust to scaling and non-orthogonality.

So far, we have only considered the behavior of the proposed deflation schemes for two-way sparse PCA. If we consider SFPCA in generality, however, ut and vt are unit vectors under the Su- and Sv-norms, not under the Euclidean norm. Conse- quently, the projections used by PD may not be actual projections and a PD deflated matrix may fail to be two- or one-way orthogonal. Normalizing the estimated PCs before deflation addresses this problem and is recommended in practice: conversely, because its deflation term is invariant under rescalings of ut and vt, SD works without renormalization.

Hotelling’s (HD) Projection (PD) Schur (SD)

Two-Way Orthogonality    T (ut Xtvt = 0)

One-Way Orthogonality    T (ut Xt, Xtvt = 0)

Subsequent Orthogonality    T ∀ ≥ (ut Xt+s, Xt+svt = 0 s 0)

Robust to    Scale of ut, vt

Table 6.3.1 : Properties of Hotelling’s Deflation (HD), Projection Deflation (PD), and Schur Complement Deflation (SD). Only SD captures all of the individual signal of each principal component without re-introducing signal at later iterations. Addition- ally, only SD allows for the non-unit-norm PCs estimated by SFPCA to be used without rescaling.

The extension of these techniques to the multi-rank case is straightforward. We give 154

the normalized variants here:

HD − T −1 T T −1 T Xt := Xt−1 Ut(Ut Ut) Ut Xt−1Vt(Vt Vt) Vt

PD − T −1 T − T −1 T Xt := (In Ut(Ut Ut) Ut )Xt−1(Ip Vt(Vt Vt) Vt )

SD − T −1 T Xt := Xt−1 Xt−1Vt(Ut Xt−1Vt) Ut Xt−1.

As in the covariance model, if ut and vt are true singular vectors, all three deflation schemes are equivalent.

6.4 Simulation Studies

In this section, we compare the performance of Manifold SFPCA and the iterative

rank-one deflation schemes proposed above in illustrative simulation studies. Manifold

SFPCA using Manifold ADMM (Algorithm 6) to solve the subproblems achieves

better solutions than Manifold Proximal Gradient or A-ManPG (Algorithm 5) in less

time. Furthermore, despite the additional flexibility of the iterative rank-one variants,

Manifold SFPCA achieves both better signal recovery and a higher proportion of variance explained, even when the orthogonality assumptions are violated.

We first consider the relative performance of the three algorithms proposed for solving the Manifold SFPCA problem (6.2). We generate data in X∗ = U ∗D∗V ∗ ∈ R250×100 with three distinct PCs: the left PCs (U ∗) are localized sinusoids of varying frequency;

the right PCs (V ∗) are non-overlapping sawtooth waves. (See Figure 6.4.1.) We add

independent standard Gaussian noise (E) to give a signal-to-noise ratio (SNR) of

∗ ∥X ∥/∥E∥ ≈ 1.2. We fix λu = λv = 1 and αu = αv = 3 which is near optimal for all three schemes. This is a favorable setting for Manifold SFPCA as the underlying

signals are orthogonal, sparse, smooth, and of comparable magnitude. 155

Scenario 1: U ∗ and V ∗ Orthogonal - SNR ≈ 1.2 Signal SVD ManSFPCA

Scenario 2: U ∗ and V ∗ Not Orthogonal - SNR ≈ 1.7 Signal SVD ManSFPCA

Figure 6.4.1 : Simulation Scenarios Used in Section 6.4. For both the left singular vectors (top row of each scenario) and the right singular vectors (bottom row of each scenario), ManSFPCA is able to recover the signal far more accurately than unregularized PCA (SVD). ManSFPCA is not identifiable up to change of order or sign. 156

Table 6.4.1 shows the performance of our three Manifold SFPCA algorithms on sev-

eral metrics, averaged over 100 replicates. Overall, the Manifold ADMM (MADMM

[250]) and Alternating Manifold Proximal Gradient (A-ManPG [252]) variants per-

form best, handily beating the Manifold Proximal Gradient scheme [251] on all mea-

sures. MADMM achieved the best objective value on every replicate.

In terms of signal recovery, MADMM achieves slightly better performance than A-

ManPG on the right singular vectors, while A-ManPG is slightly better on the left

singular vectors. Computationally, MADMM dominates both proximal gradient vari-

ants even though it requires many more matrix decompositions. The descent direction

subproblems of ManPG and A-ManPG are rather expensive to solve repeatedly and

their performance is very sensitive to the solver used. Overall, the MADMM variant

of Manifold SFPCA achieves the best optimization and statistical performance in far

less time than the proximal gradient-based variants.

Next, we compare Manifold SFPCA with the iterative deflation schemes proposed in

Section 6.3 in two different scenarios: the favorable scenario used above and a less-

favorable scenario where the true PCs are shifted and no longer orthogonal. (n = p =

∗ T ∗ ∗ T ∗ 100 and ∥(U ) (U ) − I3∥, ∥(V ) V − I3∥ ≈ 0.37.) We compare the proportion of variance explained using Manifold SFPCA with iterative rank-one SFPCA using the

normalized Hotelling, Projection, and Schur Complement deflation strategies. As can

be seen in Table 6.4.2, PD and SD consistently dominate HD. Because PD and SD fully remove the signal associated with estimated PCs, the subsequent PCs are able to capture different signals and explain a larger fraction of variance. By ensuring that the signal is never re-introduced, SD does even better than PD as we consider higher ranks. Interestingly, while PD and SD perform about as well when the underlying 157

signals are orthogonal, SD performs much better in the non-orthogonal scenario. By

estimating all three PCs simultaneously, Manifold SFPCA is able to find a better set

of PCs than any of the greedy deflation methods.

MADMM ManPG A-ManPG Time (s) 19.46 217.40 175.99 Suboptimality 0 4.58 12.23 U 68.66% 74.03% 64.54% rSS-Error V 36.85% 50.69% 43.17% U 87.75% 69.72% 89.99% TPR V 94.75% 70.30% 89.87% U 12.25% 30.28% 10.01% FPR V 5.25% 29.70% 10.13%

Table 6.4.1 : Comparison of Manifold-ADMM, Manifold Proximal Gradient, and Alternating Manifold Proximal Gradient approaches for Manifold SFPCA (6.2). MADMM is consistently more efficient and obtains better solutions, but both MADMM and A-ManPG perform well in terms of signal recovery, as measured by ∥ ˆ ˆ T − ∗ ∗ T ∥ ∥ ˆ ˆ T − relative subspace recovery error (rSS-Error = UU U (U ) / USVDUSVD U ∗(U ∗)T ∥), true positive rate (TPR) and false positive rate (FPR).

6.5 Discussion

We have introduced two practical extensions to Sparse and Functional PCA: first, we presented a multi-rank scheme which estimates multiple PCs simultaneously and

proposed algorithms to solve the resulting manifold optimization problems. The

resulting estimator inherits many of the attractive properties of rank-one SFPCA and

is, to the best of our knowledge, the first multi-rank PCA scheme for the low-rank

model. ManSFPCA combines both the flexiblity and superior statistical performance

of rank-one SFPCA with the superior interpretability of orthogonal (non-regularized)

PCA. 158

Scenario 1: U ∗ and V ∗ Orthogonal HD PD SD ManSFPCA PC1 15.92% 21.05% 21.87% CPVE PC2 22.21% 29.42% 30.59% 37.12% PC3 26.80% 35.57% 37.09% Scenario 2: U ∗ and V ∗ Not Orthogonal PC1 8.85% 19.74% 29.80% CPVE PC2 13.03% 28.30% 39.87% 50.85% PC3 16.16% 34.22% 46.48%

Table 6.4.2 : Cumulative Proportion of Variance Explained (CPVE) of Rank-One SFPCA with (normalized) Hotelling, Projection, and Schur Complement Deflation and of (order 3) Manifold SFPCA. SD gives the best CVPE of the iterative approaches and appears to be more robust to violations of orthogonality (Scenario 2). Manifold SFPCA outperforms the iterative methods in both scenarios.

Secondly, we re-considered the use of Hotelling’s deflation, and developed two ad-

ditional deflation schemes which have attractive theoretical properties and emprical

performance. Our schemes extend the results of Mackey [257] in several ways: they

allow for deflation by multiple PCs in a single step, they are applicable to thelow-

rank model and the covariance model, and they are robust to non-orthogonality and

non-unit-scaling common to Functional PCA variants. While developed for SFPCA,

these deflation schemes are useful for any regularized PCA model.

We note here that our results can be extended naturally to other multivariate analysis

techniques which can be expressed in a regularized SVD framework (e.g., PLS, CCA, etc.). Our deflation approaches can be also extended to the higher-order / multi-way context and may be particularly useful in the context of regularized CP decomposi- tions [258, 259]. In the tensor setting, ManSFPCA is a sparse and smooth version

of a Tucker decomposition, which suggests several interesting extensions we leave for

future work [260, 261]. 159

Appendix to Chapter 6

6.A Proofs

6.A.1 Deflation Schemes

In this section, we give proofs of the claimed properties for the deflation schemes

discussed in Section 6.3 and summarized in Table 6.3.1.

We first consider Hotelling’s deflation scheme (HD): given estimated left- and right-

singular vectors ut, vt of a n-by-p matrix Xt−1, two-way orthogonality is immedi- ate:

T T − T ut Xtvt = ut (Xt−1 dtutvt )vt

T − T T = ut Xt−1vt dtu utvt vt

2 2 = dt − dt∥ut∥ ∥vt∥

2 2 If ∥ut∥ = ∥vt∥ = 1, then this gives the desired two-way orthogonality. From here, we

clearly see that Hotelling’s deflation only gives two-way orthogonality if ut and vt are unit-scaled, showing that it is not robust to non-unit scaling. Turning our attention

to one-way orthogonality, we give a counter-example using the data matrix         2 −4/3 2/3 −2/3 1/3 3 0 0 T             1 0 0 X = 2 2/3 = 2/3 1/3 −2/3 0 2 0   .       0 1 0 1 4/3 1/3 2/3 2/3 0 0 1 | {z } | {z } | {z } =V T =U =D   √ √ T If we take take u1 = 1/ 2 1/ 2 0 as a sparse left singular vector and v1 = 160   T √ 1 0 the true right singular vector, then Hotelling’s deflation gives d1 = 8 and   0 −4/3     X1 = 0 2/3 .   1 4/3   √ T The deflated X1 is not left-orthogonal to u1, however, as u1 X1 = 0 − 2/3 , showing that Hotelling’s deflation is not one-way orthogonal.

Next we consider the projection deflation scheme (PD). Left orthogonality can be

shown explicitly:

T T − T − T ut Xt = ut (In utut )Xt−1(Ip vtvt )

T − ∥ ∥2 T − T = (ut ut ut )Xt−1(Ip vtvt )

which is clearly zero for arbitrary vt if and only if ∥ut∥ = 1. Essentially the same

argument shows that projection deflation is right orthogonal if and onlyif ∥vt∥ = 1, and two-way orthogonality follows from either left or right orthogonality. Hence

projection deflation is both one- and two-way orthogonal but it is also sensitive to

the unit scaling of the left and right principal components. Turning to subsequent 161

orthogonality, we now take

  − −  2 3/2 1     8/3 1/6 1/3   X =    0 5/2 1   2/3 7/6 7/3       −  1/2 1/2 1/2 1/2 4 0 0 0  2/3 2/3 1/3              1/2 −1/2 1/2 1/2 0 3 0 0 −2/3 1/3 2/3       =       .  1/2 1/2 −1/2 1/2 0 0 2 0  1/3 −2/3 2/3       1/2 1/2 1/2 −1/2 0 0 0 1 0 0 0 | {z } | {z } | {z } =U =D =V T     T √ √ T We take u1 = 1/2 1/2 1/2 1/2 and v1 = 1/ 2 1/ 2 0 as a pair of sparse PCs to yield the deflated matrix:

  − −  1/8 1/8 1/6      11/8 −11/8 −5/6   X1 =   −9/8 9/8 −1/6   −1/8 1/8 7/6

  T The leading singular pair of X1 are approximately −0.0447 −0.1673 0.8620 0.4765     T T and 0.6731 0.2168 −0.7071 , which we sparsely approximate by u2 = 0 0 4/5 3/5 162   √ √ and v2 = 1/ 2 0 1/ 2 respectively. Another round of PD gives

   0.0417 0.2917 0      2.2083 −0.5417 0 ≈   X2    0.2750 0.9650 0   −0.3667 −1.2867 0

  T ≈ which has u1 X2 1.0792 −0.2858 0 which is clearly non-zero.

Finally, we consider Schur complement deflation (SD). As with projection deflation,

left- and right-orthogonality (and hence two-way orthogonality), can be shown explic-

itly:

  T X − v u X − T T − t 1 t t t 1 ut Xt = ut Xt−1 T ut Xt−1vt T T (u X − v )u X − T − t t 1 t t t 1 = ut Xt−1 T ut Xt−1vt = 0

which holds without any further restrictions on ut or vt, showing that Schur comple- ment deflation is robust to unit scaling of estimated principal components. Subse-

quent left orthogonality can be shown by induction using the above result as the base 163

T case: for the inductive step, suppose ut Xt+s = 0 for some s, then   X v u X T T − t+s t+s+1 t+s+1 t+s ut Xt+s+1 = ut Xt+s T ut+s+1Xt+svt+s+1 (uT X )v u X T − t t+s t+s+1 t+s+1 t+s = ut Xt+s T ut+s+1Xt+svt+s+1 − 0vt+s+1ut+s+1Xt+s = 0 T ut+s+1Xt+svt+s+1 = 0 as desired.

6.A.2 Simultaneous Multi-Rank Deflation

In this section, we extend the results of the previous section to allow for simultaneous deflation by several principal components at once. The properties of the multi-rank deflation schemes are essentially the same as those of their counterparts. Thenor- malized decompositions are given by:

HD − T −1 T T −1 T Xt := Xt−1 Ut(Ut Ut) Ut Xt−1Vt(Vt Vt) Vt

PD − T −1 T − T −1 T Xt := (In Ut(Ut Ut) Ut )Xt−1(Ip Vt(Vt Vt) Vt )

SD − T −1 T Xt := Xt−1 Xt−1Vt(Ut Xt−1Vt) Ut Xt−1. 164

As before, we see that Hotelling’s deflation gives two-way orthogonality:

  T HD T − T −1 T T −1 T Ut Xt Vt = Ut Xt−1 Ut(Ut Ut) Ut Xt−1Vt(Vt Vt) Vt Vt

T − T T −1 T −1 T = Ut Xt−1Vt Ut Ut(Ut Ut) UtXt−1Vt(Vt Vt) Vt Vt

T − T = Ut Xt−1Vt Ut Xt−1Vt

= 0.

T −1 T We do not get one-way orthogonality unless Vt(Vt Vt) Vt = I:

  T HD T − T −1 T T −1 T Ut Xt = Ut Xt−1 Ut(Ut Ut) Ut Xt−1Vt(Vt Vt) Vt

T − T T −1 T −1 T = Ut Xt−1 Ut Ut(Ut Ut) UtXt−1Vt(Vt Vt) Vt

T − T T −1 T = Ut Xt−1 Ut Xt−1Vt(Vt Vt) Vt .

For projection deflation, we prove (left) one-way orthogonality as before:

  T PD T − T −1 T − T −1 T Ut Xt = Ut (In Ut(Ut Ut) Ut )Xt−1(Ip Vt(Vt Vt) Vt )

T − T T −1 T − T −1 T = (Ut Ut Ut(Ut Ut) Ut )Xt−1(Ip Vt(Vt Vt) Vt )

T − T − T −1 T = (Ut Ut )Xt−1(Ip Vt(Vt Vt) Vt )

= 0.

An essentially identical argument proves right one-way orthogonality. Either form of one-way orthogonality proves two-way orthogonality a fortiori. Considering (left) subsequent orthogonality,

  T T − T −1 T − T −1 T Ut Xt+1 = Ut (I Ut+1(Ut+1Ut+1) Ut+1)Xt(I Vt+1(Vt+1Vt+1) Vt+1) 165

we see that it cannot hold unless either Ut lies in the (left) range-space of Ut+1 − T −1 T or commutes with (I Ut+1(Ut+1Ut+1) Ut+1), neither of which hold in general. Interestingly, the matrix form makes the necessity of normalization clear for HD and

T PD: the unnormalized form is essentially assuming Ut Ut = I, which will not hold

unless we are in the non-smoothed case (αu = 0).

Finally, for Schur complement deflation, we get left-orthogonality via the same argu-

ment as the vector case:

  T SD T − T −1 T Ut Xt = Ut Xt−1 Xt−1Vt(Ut Xt−1Vt) Ut Xt−1

T − T T −1 T = Ut Xt−1 Ut Xt−1Vt(Ut Xt−1Vt) Ut Xt−1

T − T = Ut Xt−1 IUt Xt−1

= 0.

Taking this as the base case for subsequent orthogonality, we inductively assume

T T Ut Xt+s = 0 and consider Ut Xt+s+1:

  T T − −1 T Ut Xt+s+1 = Ut Xt+s Xt+sVt+s+1(Ut+s+1Xt+sVt+s+1) Ut+s+1Xt+s

T − T −1 T = Ut Xt+s Ut Xt+sVt+s+1(Ut+s+1Xt+sVt+s+1) Ut+s+1Xt+s

− −1 T = 0 0 Vt+s+1(Ut+s+1Xt+sVt+s+1) Ut+s+1Xt+s

= 0

showing that SD indeed gives subsequent orthogonality. As with the vector case,

T −1 SD is not sensitive to normalization, as indicated by the fact it has no (Ut Ut) or

T −1 (Vt Vt) terms. 166

6.A.3 Solution of the Generalized Unbalanced Procrustes Problem

The U-update in the Manifold ADMM scheme for the Uˆ -subproblem (Algorithm 6)

requires us to solve the following problem:

ρ ˆ − T ˆ ∥ − (k) (k)∥2 U = arg min Tr(U XV ) + U W + Z Su ∈VSu 2 U n×k which has the closed-form solution:

ˆ −1/2 T T −1/2 ˆ 1/2 (k) − (k) U = Su AB where A, ∆, B = SVD(Su XV + ρSu (W Z )) as shown by the following theorem.

Theorem 6.1. Suppose Su is a strictly positive definite n × n matrix and A, B are full (column) rank matrices of size n × k for k < n. Then the solution to

ρ ˆ − T ∥ − ∥2 X = arg min Tr(X A) + X B Su ∈VSu 2 X n×k is given by

−1/2 T X = Su UV

T −1/2 1/2 where UDV is the (economical) SVD of Su A + ρSu B.

We note that this result is a (slight) generalization of the well-studied Procrustes problem first considered by Schönemann [255] in the orthogonal case and extended

to the unbalanced (Stiefel manifold) case by Eldén and Park [256]. We modify their

result to the generalized Stiefel manifold, though we do not consider the additional

complexities associated with rank-deficient A, B, or Su matrices as they do not apply 167

to our problem.

1/2 T T Proof. Let Y = Su X so that X SuX ⇔ Y Y = I. Then we can rewrite the above problem as:

ρ ˆ − T ∥ − ∥2 X = arg min Tr(X A) + X B Su ∈VSu 2 X n×k ρ − 1/2 T −1/2 ∥ 1/2 − 1/2 ∥2 = arg min Tr((Su X) (Su A)) + Su X Su B F ∈VSu 2 X n×k ρ ˆ − T −1/2 ∥ − 1/2 ∥2 Y = arg min Tr(Y (Su A)) + Y Su B F Y ∈Vn×k 2

From here, we apply Lemma 6.2 to obtain:

ˆ T T −1/2 1/2 Y = UV where U, D, V = SVD(Su A + ρSu B)

and hence

ˆ −1/2 T T −1/2 1/2 X = Su UV where U, D, V = SVD(Su A + ρSu B).

Lemma 6.2. Suppose A, B are full (column) rank matrices of size n × k for k < n.

Then the solution to

ρ ˆ − T ∥ − ∥2 X = arg min Tr(X A) + X B F X∈Vn×k 2 is given by

Xˆ = UV T where UDV T is the (economical) SVD of A + ρB. 168

T Proof. Letting ⟨A, B⟩F = Tr(A B) be the standard (Frobenius) inner product, the above problem becomes:

ρ ˆ − T ∥ − ∥2 X = arg min Tr(X A) + X B F X∈Vn×k 2 ρ = arg min −⟨X, A⟩F + ⟨X − B, X − B⟩F X∈Vn×k 2 ρ ρ −⟨ ⟩ ∥ ∥2 − ⟨ ⟩ ∥ ∥2 = arg min X, A F + |X{z F} ρ X, B F + |B{z F} X∈Vn×k 2 2 =1 constant

= arg max⟨X, A + ρB⟩F X∈Vn×k

Let UDV T be the SVD of A + ρB. Then

T T ⟨X, A + ρB⟩F = ⟨X, UDV ⟩F = ⟨U XV , D⟩F

Since D is a diagonal matrix, this is maximized when the left term is an identity

matrix, as can be obtained by taking X = UV T .

6.B Algorithmic Details

In this section, we give additional details of the Algorithms used to solve the Manifold

SFPCA (6.2) problem:

T arg max Tr(U XV ) − λU PU (U) − λV PV (V ) ∈VSu ∈VSv U n×k,V p×k

6.B.1 Manifold Proximal Gradient

Manifold proximal gradient [251, 252] proceeds in two steps: first, a descent direction within the tangent space is identified; secondly, a step along along the descent direc- 169

tion is taken, with the step size chosen by Armijo-type [262] back-tracking. (The line

search method we use is essentially that of Beck and Teboulle [263]; see also Parikh

and Boyd [77, Section 4.2].) Because the step in the descent direction is taken in the

ambient space rather than the along the manifold in question (e.g., an unconstrained

n×k step in R rather than a geodesic move along Vn×k), a retraction step is used to project back onto the manifold and preserve feasibility.

Both steps require further discussion: we first consider identifying the descent direc-

tion, which requires solving the following problem

ˆ ˆ (k) DU = arg min −⟨XV , DU ⟩F + λU PU (U + DU ) n×k DU ∈R

T (k) (k) T subject to DU SuU + (U ) SuDU = 0.

The constraint arises from the tangent space of the generalized Stiefel manifold [252,

Appendix A.1], while the objective is essentially that of the overall problem. (Note

that, because the smooth portion of the objective is already linear, we do not need

to linearize it, unlike Chen et al. [251, 252].) The tangency constraint makes this

problem non-trivial to solve, but it can be reformulated as a linearly constrained

quadratic program by splitting DU into positive and negative parts and solved using standard approaches. Chen et al. [251, 252] recommend the use of a semi-smooth

Newton method to solve this problem [264, 265], though we used the generic SDPT3

solver of Toh et al. [266–268] in our timing experiments.

Expanding Algorithm 4 with Algorithm 5 for the subproblems, we obtain Algorithm ˆ ˆ 7 below. Because Algorithm 7 is initialized at an infeasible pair (U, V ) (unless αu =

αv = 0) the first iteration almost always decreases the objective value: after that, 170

however, each sub-problem typically increases the objective value. (If λu = λv = 0, this is guaranteed because the subproblems have a unique global optimum, but we

cannot prove monotonicity in general.)

Chen et al. [252] suggested an interesting variant of this approach which requires

only a single proximal gradient step at each iteration. Applying this approach to our

problem yields Algorithm 8. Somewhat surprisingly, they are able to give convergence

guarantees for this approach, even though it has both non-convex constraints and a

non-convex (bi-convex) objective. As far as we know, this is the only one of our

algorithms to have provable convergence.

6.B.2 Manifold ADMM

If we expand Algorithm 4 using Algorithm 6 (Manifold ADMM [250]) to solve the sub-

problems, we obtain Algorithm 9. Note that the result of Section 6.A.3 is used to solve

the U (k+1) and V (k+1) updates appearing in Steps 2(a)(ii) and 2(b)(ii) respectively. ˆ ˆ Because Algorithm 9 is initialized at an infeasible pair (U, V ) (unless αu = αv = 0) the first iteration almost always decreases the objective value: after that, however,

each sub-problem typically increases the objective value. (If λu = λv = 0, this is guaranteed because the subproblems have a unique global optimum, but we cannot

prove monotonicity in general.) Unlike Algorithms 7 and 8, Algorithm 9 contains

an explicit thresholding step, which we have found improves convergence to exact

zeros.

6.B.3 Additional Note on Identifiability

As written, the Manifold SFPCA problem (6.2) suffers from two forms of non-identifiability which may impede convergence, at least in the typical case with PU and PV elemen- 171

twise ℓ1-penalties [1]. In particular, if the columns of U and V are simultaneously

permuted (i.e., U → USk and V → VSk for some permutation matrix Sk) or if the signs of rows of U and V are simultaneously flipped, the objective is unchanged.

To address the first ambiguity (permutation-invariance), the smooth term may be replaced by Tr(U T XVD) where D is a diagonal matrix with elements (1+ϵ)k−1, (1+

ϵ)k−2,..., 1 to ensure that the leading PCs are indeed placed first. Alternatively, both

problems may be addressed by post-processing the (U, V ) iterates at each step and

putting them in a canonical form. A simple canonicalization that we have found works well is to sort the columns of U lexographically by absolute value and set the signs so that the first column of U has as many positive elements as possible.

6.C Additional Experimental Results

In this section, we give additional details of the simulations performed in Section

6.4.

The timing comparisons of the first simulation (Table 6.4.1) are rather sensitive to

the specific sub-problem solvers and matrix decomposition subroutines used. For

the problem considered, on average MADMM required 30,659 rank-3 SVDs before

convergence; ManPG required 439 descent direction solves and 2,259 retractions; and

A-ManPG required 366 descent direction solves and 1,952 retractions. Because the

cost of an SVD and a retraction (a QR decomposition) are roughly comparable, it is

clear that the cost of solving the descent direction subproblem dominates the proximal

gradient schemes and that they will be quite sensitive to the solver used. Recently,

Huang and Wei [269] proposed an acceleration scheme for ManPG which could be

used here to improve convergence speeds. 172

Table 6.C.1 gives an extended version of Table 6.4.2, giving additional details on the

true and false positive rates of each deflation scheme, as well as measuring subspace

recovery performance. For ManSFPCA, the MADMM approach was used to compute

the solution, and tuning parameters were fixed as λu = λv = 1 and αu = αv = 3. For the iterative deflation schemes, all four tuning parameters were chosen to maximize

the BIC, using the adaptive tuning scheme recommended by Allen and Weylandt

[28].

As discussed above, ManSFPCA is able to achieve a greater proportion of variance

explained than any of the iterative deflation methods. In terms of subspace recovery,

ManSFPCA is far more accurate than the iterative schemes or than an unregularized

SVD (indicated by rSS-Error less than 100%). In terms of variable selection, the BIC

scheme used to tune the iterative schemes performs as expected, yielding almost no

false positives, at the expense of many false negatives (low TPR). ManSFPCA con- versely with fixed tuning parameters has a few more false positives but recovers much

more of the true signal than any of the deflation schemes, though this is somewhat

sensitive to the choice of tuning parameters. In the orthogonal scenario, the different

deflation schemes achieve essentially the same results: in the non-orthogonal scenario,

the fuller deflation performed by PD and SD gives a better TPR than Hotelling’s de-

flation.

6.D Additional Background

The literature on regularized PCA variants is vast and we refer the reader to Appendix

C of Allen and Weylandt [28] for a review.† As they note, the vast majority of regu-

†Appendix C can be found in the online Supplementary Materials, available at https://arxiv. org/abs/1309.2895. 173

Scenario 1: U ∗ and V ∗ Orthogonal Scenario 2: U ∗ and V ∗ Not Orthogonal HD PD SD ManSFPCA HD PD SD ManSFPCA PC1 15.92% 21.05% 21.87% 8.85% 19.74% 29.80% CPVE PC2 22.21% 29.42% 30.59% 37.12% 13.03% 28.30% 39.87% 50.85% PC3 26.80% 35.57% 37.09% 16.16% 34.22% 46.48% U 129.54% 129.55% 128.35% 69.32% 215.73% 206.30% 205.74% 97.77% rSS-Error V 143.01% 143.72% 141.15% 36.98% 211.15% 207.77% 204.38% 78.26% U 54.57% 54.29% 54.63% 87.72% 11.31% 13.11% 15.28% 88.91% TPR V 59.20% 58.84% 59.41% 94.72% 12.89% 15.33% 18.50% 80.08% U 0.92% 0.91% 0.92% 12.28% 1.24% 1.05% 1.18% 11.09% FPR V 0.61% 0.60% 0.60% 5.28% 2.80% 2.72% 2.79% 19.92%

Table 6.C.1 : Extended Version of Table 6.4.2, comparing Cumulative Proportion of Variance Explained (CVPE), relative subspace recovery error (rSS-Error = ∥Uˆ Uˆ T − ∗ ∗ T ∥ ∥ ˆ ˆ T − ∗ ∗ T ∥ U (U ) / USVDUSVD U (U ) ), true positive rate (TPR) and false positive rate (FPR) of Manifold SFPCA with Rank-One SFPCA using (normalized) Hotelling, Projection and Schur Complement Deflation.

larized PCA methods use an iterative deflation approach, typically that of Hotelling

[102], though the deflations of Mackey [257] can be applied to any method which

uses the covariance model and the results of Section 6.3 can be used for any which

uses the low-rank model. While mainly focusing on the rank-one case, Journée et

al. [246] show how their modified power algorithm can be extended to simultaneously

estimate several orthogonal PCs. Benidis et al. [253] give a clever MM-type algorithm

for finding orthogonal sparse PCs, based on the iteratively reweighted ℓ1-methods first proposed by Candès et al. [37] and extended to orthogonality constraints by Song et

al. [270].

In the Bayesian context, the sparse probabilistic PCA model of Guan and Dy [271]

extends the probabilistic PCA model of Tipping and Bishop [272] and allows for spar-

sity. This method allows for multiple sparse PCs to be estimated simultaneously,

though the authors do not discuss specific difficulties associated with orthogonality.

While Gibbs sampling from Stiefel-constrained posteriors poses relatively little addi- 174

tional difficulty, it is less obvious how to use more performant MCMC samplers onthe

Stiefel manifold, especially the popular NUTS variant of Hamiltonian Monte Carlo

(HMC) [273–276]. Girolami and Calderhead [277] and Byrne and Girolami [278] pro-

pose geodesic variants of HMC, which extend the strong geometric foundations of

(Euclidean) HMC [279] to arbitrary smooth manifolds, but they are not easy to apply in practice and robust software implementations have yet to be developed. Working around this, several authors have proposed methods to reparameterize the Stiefel man- ifold into an approximately Euclidean coordinate system and apply standard HMC, though the practicality of these methods is limited by the incompatible structures of the Stiefel manifold and [280, 281]. More recently, Jauch et al.

[282] proposed a promising data augmentation scheme based on the polar decompo-

sition which avoids these topological inconsistencies, at the cost of a slightly larger

parameter space.

The use of manifold optimization techniques for estimating multiple regularized PCs

simultaneously is a recent development, driven by recent developments in non-smooth

manifold optimization, especially the proximal gradient scheme of Chen et al. [251,

252] and the Manifold-constrained ADMM of Kovnatsky et al. [250], as well as refer-

ences therein. (Lai and Osher [249] give an interesting splitting method for Stiefel-

constrained problems based on Bregman iteration techniques, but to the best of our

knowledge, it has not yet been applied to sparse PCA formulations.) Theory for these

methods is still under rapid development and several key results remain unproven: in

particular, convergence of the Manifold ADMM [250] has not been established, though

the analysis of Wang et al. [283] comes somewhat close.

On the contrary, smooth manifold optimization techniques are well-established and 175 typically motivated by problems in physics and engineering, where conservation laws are expressed as manifold constraints. These techniques were (re-)popularized by the influential book of Absil et al.[247]. Wen and Yin [248] give a particularly nice algorithm which maintains feasibility by using a clever application of the Cayley trans- form. Ritchie et al. [284] used smooth manifold optimization to solve a supervised

PCA problem. Interestingly, their approach is Grassmanian- rather than Stiefel- constrained because, in a non-sparse context, the specific coordinate system used is irrelevant. This allows them to use the Grassmanian-based techniques of Edelman et al. [285] and of Boumal et al. [286]. 176

Algorithm 7 Alternating Maximization Approach for Manifold SFPCA (6.2) using Manifold Proximal Gradient for Subproblem Solutions 1. Initialize Uˆ and Vˆ as the k leading singular vectors of X 2. Repeat Until Convergence: (a) Solve U-Subproblem using Manifold Proximal Gradient [251]:

ˆ T ˆ U = arg min − Tr(U XV ) + λU PU (U) ∈VSu U n×k

i. Initialize U (k) = Uˆ ii. Repeat Until Convergence: A. Determine Descent Direction:

ˆ − T ˆ ∥ (k) ∥ DU = arg min Tr(DU XV ) + λU U + DU 1 n×k DU ∈R T (k) (k) T subject to DU SuU + (U ) SuDU = 0

B. Perform Backtracking to Determine Step Size: • Set α = 1 T ˆ − ∥ ∥ • While Tr(RetrU (k) (αDU ) XV ) λU RetrU (k) (αDU ) 1 < (k) T (k) Tr((U ) XVˆ ) + λU ∥U ∥1: – Set α = 0.8 ∗ α (k+1) C. Set U = RetrU (k) (αDU ) iii. Set Uˆ = U (k) (b) Solve V -Subproblem using Manifold Proximal Gradient [251]:

ˆ ˆ T V = arg min − Tr(U XV ) + λV PV (V ) ∈VSv V p×k

i. Initialize V (k) = Vˆ ii. Repeat Until Convergence: A. Determine Descent Direction:

ˆ ˆ T (k) DV = arg min − Tr(U XDV ) + λV ∥V + DV ∥1 p×k DV ∈R T (k) (k) T subject to DV SvV + (V ) SvDV = 0

B. Perform Backtracking to Determine Step Size: • Set α = 1 ˆ T − ∥ ∥ • While Tr(U X RetrV (k) (αDU )) λV RetrV (k) (αDV ) 1 < (k) (k) Tr(UXVˆ ) + λV ∥V ∥1: – Set α = 0.8 ∗ α (k+1) C. Set V = RetrV (k) (αDV ) (c) Set Vˆ = V (k) 3. Return Uˆ and Vˆ 177

Algorithm 8 Alternating Manifold Proximal Gradient Approach [252] for Manifold SFPCA (6.2) 1. Initialize Uˆ and Vˆ as the k leading singular vectors of X 2. Repeat Until Convergence: (a) U-Update: One Step of Manifold Proximal Gradient [251]:

ˆ T ˆ U = arg min − Tr(U XV ) + λU PU (U) ∈VSu U n×k

i. Determine Descent Direction:

ˆ − T ˆ ∥ ˆ ∥ DU = arg min Tr(DU XV ) + λU U + DU 1 n×k DU ∈R T ˆ ˆ T subject to DU SuU + U SuDU = 0

ii. Perform Backtracking to Determine Step Size: • Set α = 1 T ˆ − ∥ ∥ ˆ T ˆ • While Tr(RetrUˆ (αDU ) XV ) λU RetrUˆ (αDU ) 1 < Tr(U XV ) + λU ∥Uˆ ∥1: – Set α = 0.8 ∗ α ˆ iii. Set U = RetrUˆ (αDU ) (b) V -Update: One Step of Manifold Proximal Gradient [251]:

ˆ ˆ T V = arg min − Tr(U XV ) + λV PV (V ) ∈VSv V p×k

i. Determine Descent Direction:

ˆ ˆ T ˆ DV = arg min − Tr(U XDV ) + λV ∥V + DV ∥1 p×k DV ∈R T ˆ ˆ T subject to DV SvV + V SvDV = 0

ii. Perform Backtracking to Determine Step Size: • Set α = 1 ˆ T − ∥ ∥ ˆ T ˆ • While Tr(U X RetrVˆ (αDV )) λV RetrVˆ (αDV ) 1 < Tr(U XV ) + λV ∥Vˆ ∥1: – Set α = 0.8 ∗ α ˆ iii. Set V = RetrVˆ (αDV ) 3. Return Uˆ and Vˆ 178

Algorithm 9 Alternating Maximization Approach for Manifold SFPCA (6.2) using Manifold ADMM for Subproblem Solutions 1. Initialize Uˆ and Vˆ as the k leading singular vectors of X 2. Repeat Until Convergence: (a) Solve U-Subproblem using Manifold ADMM [250]:

ˆ T ˆ T ˆ U = arg min − Tr(U XV ) + λU PU (U) = arg min − Tr(U XV ) + λU PU (WU ) ∈VSu ∈VSu U n×k U,WU n×k

subject to U = WU

(k) ˆ (k) (k) i. Initialize U = U and restore WU , ZU from previous iteration if (k) (k) (k) available else set WU = U and ZU = 0n×k ii. Repeat Until Convergence: ρ (k+1) − T ˆ ∥ − (k) (k)∥2 U = arg min Tr(U XV ) + U WU + ZU F ∈VSu 2 U n×k −1/2 T T −1/2 ˆ 1/2 (k) − (k) = Su AB where A, ∆, B = SVD(Su XV + ρSu (WU ZU )) ρ (k+1) ∥ (k+1) − (k)∥2 WU = arg min λU PU (WU ) + U WU + ZU F ∈Rn×k 2 WU   = prox U (k+1) + Z(k) λU /ρPU (·) U (k+1) (k) (k+1) − (k+1) ZU = ZU + U WU

iii. Set Uˆ = U (k) (b) Solve V -Subproblem using Manifold ADMM [250]:

ˆ ˆ T ˆ T V = arg min − Tr(U XV ) + λV PV (V ) = arg min − Tr(U XV ) + λV PV (WV ) ∈VSv ∈VSv V p×k V ,WV p×k

subject to V = WV

(k) ˆ (k) (k) i. Initialize V = V and restore WV , ZV from previous iteration if (k) (k) (k) available else set WV = V and ZV = 0p×k ii. Repeat Until Convergence: ρ (k+1) − ˆ T ∥ − (k) (k)∥2 V = arg min Tr(U XV ) + V WV + ZV F ∈VSv 2 V p×k −1/2 T T −1/2 T ˆ 1/2 (k) − (k) = Sv AB where A, ∆, B = SVD(Sv X U + ρSv (WV ZV )) ρ (k+1) ∥ (k+1) − (k)∥2 WV = arg min λV PV (V ) + V WV + ZV F ∈Rp×k 2 WV   = prox V (k+1) + Z(k) λV /ρPV (·) V (k+1) (k) (k+1) − (k+1) ZV = ZV + V WV

(c) Set Vˆ = V (k) 3. Return Uˆ and Vˆ 179

Chapter 7

Multivariate Modeling of Natural Gas Spot Trading Hubs Incorporating Futures Market Realized Volatility

The following was originally published as

M. Weylandt, Y. Han, and K. B. Ensor. “Multivariate Modeling of Natu-

ral Gas Spot Trading Hubs Incorporating Futures Market Realized Volatil-

ity.” ArXiv: 1907.10152.

This paper is currently being revised for re-submission for a leading statistics jour-

nal and was awarded a 2020 Student Paper Award by the Business and Economic

Statistics Section of the American Statistical Association.

Abstract

Financial markets for Liquified Natural Gas (LNG) are an important and rapidly

growing segment of commodities markets. Like other commodities markets, there

is an inherent spatial structure to LNG markets, with different price dynamics for

different points of delivery hubs. Certain hubs support highly liquid markets, al-

lowing efficient and robust price discovery, while others are highly illiquid, limiting

the effectiveness of standard risk management techniques. We propose a joint mod-

eling strategy, which uses high-frequency information from thickly traded hubs to

improve volatility estimation and risk management at thinly traded hubs. The result- 180

ing model has superior in- and out-of-sample predictive performance, particularly for

several commonly used risk management metrics, demonstrating that joint modeling

is indeed possible and useful. To improve estimation, a Bayesian estimation strategy

is employed and data-driven weakly informative priors are suggested. Our model is

robust to sparse data and can be effectively used in any market with similar irregular

patterns of data availability.

7.1 Introduction

Financial markets for Liquefied Natural Gas (LNG) are among the most important

commodities markets in the United States (U.S.), with their importance rising as

natural gas makes up an increasingly large share of U.S. energy consumption. Unlike

other major commodities, the pricing of natural gas is largely driven by transportation

and storage costs, which are reflected in the correlated price dynamics of different spot

prices. In this paper, we develop a joint model for natural gas spot prices which takes

advantage of the observable market structure and pools information across different

time scales to more accurately forecast volatility for thinly traded spot prices. Our

results indicate that joint modeling of disparate natural gas spot prices is able to

predict future volatility more accurately than standard univariate models.

Natural gas is a naturally occurring mixture of hydrocarbons, principally methane

(CH4), which is widely used for both large-scale commercial power generation and do- mestic use. In 2017, natural gas was used to generate approximately 1.273 petawatthours

(PWh) or 1,273 billion killowatthours (kWh) of electricity in the U.S., comprising ap-

proximately 31.7% of all electricity generated in the U.S. [287]. In 2015, natural

gas surpassed coal as the principal source of electricity generation in the U.S. and is 181

forecast to continue to make up a larger proportion of the U.S.’s electricity genera-

tion mixture in coming years [288, 289]. This increase in usage has been primarily

driven by the development of hydraulic fracturing (“fracking”) technology, which has

driven down the cost of domestic production significantly by allowing natural gas to

be extracted from shale efficiently and at relatively low cost290 [ ].

Natural gas is principally stored and transported in a cooled liquid form and, as such,

so-called Liquefied Natural Gas markets are the main venue for large-scale commercial

trade in natural gas. This trade is centered around a network of many standardized

LNG transit and storage centers throughout the U.S., colloquially known as “hubs.”

(The use of the term “hub” may be somewhat confusing here since it does not imply

that the storage center is centrally located or otherwise important, but we will use

it throughout this paper as it is a standard terminology in these markets.) These

hubs are connected by a nation-wide network of pipelines which connect the hubs to

each other, as well as to important population centers and power generation plants, as

shown in Figure 7.1.1 below. Like any commodity, LNG prices fluctuate unpredictably in response to market forces, sometimes rather significantly. Both upstream producers and downstream consumers, as well as investors in LNG markets from outside the supply chain, have a material interest in measuring and managing their exposure to these fluctuations.

Over-the-counter “spot” markets exist for almost every hub in the U.S., but the liquidity and transparency of these markets varies widely from hub to hub. For a few thickly traded hubs, the spot market is highly liquid and supports a market of standardized futures contracts and other derivatives, not unlike equity (stock) or foreign exchange (FX) markets. The quality of these markets make them an 182

attractive venue for speculators and market makers, who absorb risk from producers

and consumers, improving overall market efficiency. For many other hubs, however,

the spot market is thinly traded, with only a few large trades occurring each day,

inhibiting price discovery, limiting the attractiveness of spot markets to third parties,

and making risk management more difficult.

It has been observed by many authors that the use of high(er)-frequency “realized” volatility measures, based on intra-day price movements, significantly improves volatil-

ity estimation and allows for more efficient risk measurement, particularly in settings where a model based solely on daily data would otherwise be slow to react [291–293].

It is therefore natural to ask whether similar techniques can be applied to LNG mar-

kets, particularly at thinly traded hubs where intra-day data may not be available.

This paper answers that question in the affirmative, demonstrating that it is possible

to use intra-day data from LNG futures markets to improve volatility estimation at

thinly traded hubs.

Taking the “Realized Beta GARCH” model of Hansen et al. [294] as our starting

point, we develop a predictive single-factor multivariate volatility model for use in

LNG markets. Our model is fully predictive and fully multivariate in nature, esti-

mating volatility for multiple LNG spot and futures markets simultaneously. We

adopt a Bayesian approach to parameter estimation in order to address two well-

known difficulties of GARCH models: i) by incorporating information from alarge

pre-existing financial literature into our prior specifications, we are able to regular-

ize our parameter estimation; and ii) by working in a fully Bayesian framework, we

are able to coherently propagate uncertainty in our estimated parameters into our

risk-management calculations. 183

The remainder of this paper is organized as follows: Section 7.1.1 provides additional

background on LNG markets, after which Section 7.1.2 reviews related work. Section

7.2 describes the data set used in the following analysis and highlights several stylized

facts that motivate our proposed model. Section 7.3 specifies our proposed model in

detail, with Section 7.3.1 providing additional detail on suggested priors. Section 7.4

demonstrates the usefulness of our model with an application to U.S. LNG markets,

highlighting the improved accuracy on standard risk-management benchmarks. Fi-

nally, Section 7.5 closes with a discussion and a consideration of some directions for

future work. Supplemental materials give additional background, data description,

and results.

7.1.1 U.S. Natural Gas Markets

Conceptually, the simplest mechanism for trading LNG is through so-called “spot”

markets. Trades in this market lead to (essentially) immediate exchange of cash for

delivery of LNG to the purchaser’s account at a pre-determined hub. Historically,

the most important of these hubs is the so-called “Henry Hub,” located immediately

outside the town of Erath in southern Louisiana. While LNG markets have expanded

nationwide, Henry Hub, and the associated distribution network, remains a key transit

nexus and price reference. The transit and storage facilities at Henry Hub are con-

nected to several major interstate LNG pipelines, facilitating easy transit throughout

the U.S., as shown in Figure 7.1.1. As such, spot prices at Henry Hub are generally

understood to serve as a proxy for U.S. LNG market prices more broadly.

As with many commodities, markets for LNG are heavily financialized, with a wide

array of futures, options, and other derivative securities being heavily traded. Be-

cause of the central role Henry Hub plays in spot markets, the vast majority of these 184

Figure 7.1.1 : U.S. LNG Pipelines (as of 2008). There are approximately 3 million miles of LNG pipelines in the United States, connecting LNG storage facilities with customers. The majority of these pipelines concentrated along the Gulf of Mexico and in western Pennsylvania / northern West Virginia. Note the large number of interstate pipelines originating at or near Henry Hub in southern Louisiana. Figure originally published by the U.S. Energy Information Administration (E.I.A.) [298].

derivatives are directly or indirectly based on Henry Hub spot prices. The New York

Mercantile Exchange introduced futures contracts based on Henry Hub prices in 1990; these contracts are traded at a variety of maturities up to 18 months at an average daily volume of approximately $14 billion USD and comprise the third largest com- modity market in the U.S., exceeded only by crude oil and gold [295]. In recent years, the Intercontinental Exchange has also become a leading venue for LNG derivatives, currently serving as the venue for approximately 40% of total open interest in Henry

Hub-referenced futures [296]. Non-U.S. hubs are of increasing interest, recognizing the evolution and growth of international LNG markets [297], but we restrict our focus to hubs in the continental U.S.

Unlike equity or bond markets, the mechanisms by which LNG is stored and trans- ported have a significant effect on price behaviors. Mohammadi [299] gives a de- 185

tailed survey of the pricing structure from well heads (extraction) to end-consumers.

The relationship between LNG spot and futures prices was investigated by Ghoddusi

[300] who found, inter alia that short-maturity futures are Granger-causal for phys- ical prices, a fact which is consistent with our findings on the usefulness of futures realized volatility in spot price volatility modeling. To the best of our knowledge, the subject of this paper – the relationship between the instantaneous volatilities of different spot prices – has not been previously examined in the literature.

7.1.2 Previous Work

Financial time series, including LNG prices, exhibit complex and well studied pat-

terns, notably significantly non-constant patterns of volatility (heteroscedasticity).

This motivates the use of instantaneous volatility, typically understood as the in-

stantaneous standard deviation of returns, as an explicit and important quantity in

any realistic model. Direct measurement of volatility, however, is a difficult task,

as volatility can change as rapidly or more rapidly than prices can be observed. To

address this limitation, two parallel lines of research have been developed in the

econometrics literature. The first, latent volatility process models, posit dynamics

for the unobserved volatility process and attempt to estimate the parameters of those

dynamics directly from returns data; these models are typically further subdivided

into stochastic volatility [301, 302] and (generalized) autoregressive conditional het- eroscedasticity (GARCH), depending on the form of the assumed dynamics [303, 304].

We focus on GARCH models here as they form the basis of our proposed approach.

2 These models typically assume a daily volatility σt which is never observed directly, but can be estimated over long periods of time using many realizations of the squared

2 2 2 ≈ − 2 daily return rt = (log Pt/Pt−1) , essentially using only the estimator σˆt (rt r) 186

and using a model to pool estimates through time. Given the fact that these models

allow the instantaneous volatility to change rapidly and have so little information per

time point, they are often more effective for recovering the parameters of the dynam-

ics than for estimating the instantaneous volatility at a given time point. Despite this,

these models have proven immensely popular in the econometric literature, spawning

an enormous number of variants and refinements; see, e.g., the thirty-page glossary

of Bollerslev [305] or the review of Bauwens et al. [306].

The second line of research, so-called “realized volatility” models, attempts to en-

large the information set used to estimate the integrated (average) volatility over a

fixed period of time. Early work in this direction incorporated additional commonly

cited summary statistics, such as the open (first) price, the intra-day trading range,

or combinations thereof [307]. Yang and Zhang [308] derived an optimal (minimum- variance unbiased) estimator allowing non-zero drift and overnight jumps, which can

be interpreted as a weighted sum of the overnight, intra-day, and Rogers-Sachell

(1991) volatility estimators. When appropriately tuned, the Yang-Zhang estimator

is up to fourteen times more efficient than the standard close-to-close volatility esti-

mate [308], though its assumption of continuous price dynamics leads it to slightly

underestimate volatility in the presence of jumps. Moving beyond “OHLC” (Open,

High, Low, Close) data, more recent work uses even higher-frequency data, correcting

for the idiosyncrasies of market micro-structure, but these adjustments are typically

developed for equities and are quite sensitive to market microstructure, so we do not

pursue them here.

Given the successes of both of these lines of research, it is natural to combine them into

a single model, in which instantaneous volatility is still assumed to evolve according 187

to some form of GARCH dynamics, but for which we have more accurate estima-

tors than the squared daily return. The Realized GARCH framework introduced by

P.R. Hansen and co-authors [292, 293] is a recent entry in this vein, which allows for

slightly more general specifications than the previously proposed Multiplicative Error

Model (MEM) or High-Frequency Based Volatility (HEAVY) frameworks [309, 310].

In particular, we build upon the “realized beta GARCH” model of Hansen et al. [294] which proposes a model with two series, termed the “market” and the “asset,” where

the market has a realized volatility measure available but the asset does not, and

uses information from the market to improve estimation for the asset. We adapt this

approach for predictive risk management in the LNG market, describing a coherent

Bayesian estimation strategy complete with data-driven priors and an out-of-sample

VaR forecasting scheme. Contino and Gerlach [311] previously considered Bayesian

approaches to estimating realized GARCH models, though they do not consider the

multivariate realized beta GARCH model that we use here.

The Network GARCH model of Zhou et al. [312] is similar to the model we present

in Section 7.3, but they assume that the observations are conditionally independent

and assume the squared return of one asset directly influence the latent volatility of

another asset. As we show below, daily returns of LNG spot prices are highly corre-

lated, so the Network GARCH is clearly misspecified for our application. Interpreted

as a network, our model encodes the marginal volatilities as a star graph, with Henry

Hub at its center and other nodes connected only to it. We leave the problem of

allowing and estimating a more general graph structure to future work. 188

7.2 Empirical Characteristics of LNG Prices

Commercial data vendors recognize over one-hundred and fifty tradeable spot LNG

prices in the continental U.S. and Canada (see, e.g., the BGAS screen on a Bloomberg

terminal). Not surprisingly, a full data history is not available for each of these

price series. We curate a data set of forty spot prices for which we were able to

obtain a full daily price history over the ten-year period ranging from January 2006

to December 2015 (2209 trading days). This period roughly aligns with the third

subsample period identified by Li [313], who describes it as a large, active, and growing

market with lower levels of volatility than observed during the initial emergence of

LNG markets. Additionally, we use OHLC data from Henry Hub futures at one,

two, three, six, nine, and twelve month maturities to calculate the market return and

realized volatility measures. We highlight several key properties of our data set below

and give additional details in Section 7.A of the Supplementary Materials. Section

7.B of the Supplementary Materials describes our data curation and gives replication instructions.

LNG markets exhibit many of the same well-known characteristics as equity markets, including asymmetric heavy-tailed returns, volatility clustering, and long-lasting au- tocorrelation of squared returns. Figure 7.2.1 shows the daily return of the Henry

Hub spot, as well as its absolute value and autocorrelation function. From this, it is clear that Henry Hub exhibits volatility clustering and long-lasting autocorrelation in its second moment. This series exhibits strong evidence of heavy tails (sample excess kurtosis of 11.3), but only limited evidence of skewness (sample skewness 0.72). Other spot prices exhibit similar behaviors.

Spot prices at different trading hubs tend to move together, with shocks at onehub 189

Henry Hub Spot Prices Henry Hub Spot ACF

Daily Return Daily Return 40.0% 5.0% 20.0% 0.0% 0.0% −5.0% −20.0% −10.0% −15.0% Absolute Value Absolute Value ACF Return 30.0% 30.0% 20.0% 20.0% 10.0% 10.0% 0.0% 0.0% 2006 2008 2010 2012 2014 2016 0 10 20 30 Date Lag

Figure 7.2.1 : Henry Hub spot returns. Henry Hub clearly exhibits many of the same properties as equity returns: volatility clustering, heavy-tails, and autocorrelation in the second moment.

being quickly reflected in connected hubs and eventually throughout the entire LNG

market. As can be seen in Figure 7.2.2, the daily price returns at different spots

are clearly highly correlated and exhibit similar volatility clustering patterns. For

the 40 spot prices considered in Section 7.4, the first principal component explains

approximately 74% of the total Spearman covariance, suggesting that a single market

factor drives most of the observed variability.

LNG spot prices, particularly those for Henry Hub, are also closely connected to

the associated futures market. As can be seen in the top panel of Figure 7.2.3,

Henry Hub spot and futures are highly correlated, with the correlation highest for

the shortest maturity futures. Similarly, spot and futures volatility estimates are also

highly correlated, as shown in the bottom panel of Figure 7.2.3. Because the futures

market is typically more liquid than the spot market, we can use the additional

information provided by this market to improve estimation of spot volatilities.

Putting these results together, we see that the spot LNG market exhibits equity-like volatility patterns with strong correlations among different spots and between the 190

NGGCT800 NGGCTXEZ NGGCTXEW NGUSHHUB 100% 50% 0% −50%

NTGSTXKA NGTXPERY NGGCANRS NGGCCOLG 100%

Spot Return 50% 0% −50%

2006 2009 2012 2015 2006 2009 2012 2015 2006 2009 2012 2015 2006 2009 2012 2015 Date

Figure 7.2.2 : Return series of several LNG spots. (Details of each spot series are given in Section 7.B of the Supplementary Materials.) Returns are clearly highly correlated across series, with the major volatility events (late 2009 and early 2014) being observed at all hubs. This high degree of correlation suggests that a single-factor model will suffice.

spot and futures markets. Spot prices exhibit quite heavy tails, but relatively lit-

tle skewness, suggesting the use of a skewed observation distribution rather than an

asymmetric GARCH formulation. Finally, the dominant leading principal compo-

nent, indicative of market-wide volatility shifts, suggests that a single factor model will perform well. We will incorporate each of these observations as we develop our

proposed model in the following section.

7.3 Model Specification

In this section, we introduce our model piecewise, giving motivation for our modeling

choices as we proceed. The complete specification is given in Equation (7.1) at the

end of this section. Conceptually, our model has three major components: a stan-

dard GARCH specification for the volatility at Henry Hub(σM,t), multiple realized volatility measures (ςj,t), and an augmented GARCH specification for the non-Henry 191

Henry Hub Spot and Futures Returns

1 Month 3 Month 6 Month 9 Month 20.0% 10.0% 0.0% −10.0% −20.0% Future Return

0.0% 0.0% 0.0% 0.0% −20.0% 20.0% 40.0% −20.0% 20.0% 40.0% −20.0% 20.0% 40.0% −20.0% 20.0% 40.0% Spot Return Henry Hub Spot and Futures Volatility

Close−to−Close Garman−Klass (1980) Rogers−Satchell (1991) Yang−Zhang (2000)

150% 100% 50%

0% 0% 0% 0% 100% 200% 300% 100% 200% 300% 100% 200% 300% 100% 200% 300%

Futures Realized Volatility Futures Realized Spot Volatility

Figure 7.2.3 : Comparison of Henry Hub Spot and Futures. As shown in the top panel, spot and futures returns are moderately correlated (25-29%), particularly at shorter maturities (29.2%). As shown in the bottom panel, spot and futures volatitlities are highly correlated (74-99%) for all of the realized volatility measures considered, including the close-to-close, Garman-Klass (1980), Rogers-Satchell (1991), and Yang- Zhang (2000) estimators. In both cases, we observe strong correlation between the low frequency spot data and the high frequency futures data.

2 Hub spots which includes an additional term (σM,t) to capture market-wide changes in the volatility. Because our eventual aim is risk management, our model is fully

predictive: in particular, volatilities at time t (σM,t, σi,t) depend only on past values, allowing for out-of-sample forecasting, rather than having σM,t influence σi,t as in the specification of Hansen et294 al.[ ].

We adopt a slight variant of a standard linear GARCH(1, 1) specification for the

instantaneous volatility at Henry Hub:

XJ 2 2 2 | | − 2 σM,t = ωM + γM σM,t−1 + ζjςj,t−1 + τM,1 rM,t−1 + τM,2(rM,t−1 µM ) j=1 192

where rM,t is the daily Henry Hub spot price return at t, µM is the long-run average

th return, and ςj,t−1 is the value of the j realized volatility measure at time t−1. Chan and Grant [315] note that, contrary to crude oil markets, the presence of a leverage

term has minimal predictive value in LNG markets. As such, we follow Hansen et al.

[292] and use a second-order Hermite polynomial in |r·,t| to allow for flexible modeling of leverage, rather than enforcing leverage directly in our specification. Note that,

since we are using a linear rather than log-linear formulation, we include an absolute

2 value term to disallow negative volatilities. The realized volatility term (ζj ) allows our model to indirectly incorporate aspects of volatility captured by the realized volatility

measurements which are not otherwise captured by the close-to-close return.

The individual asset volatilities follow a similar specification, though we introduce an

additional “coupling” term to capture market-wide changes in volatility where Henry

Hub is used as a proxy for the market as a whole:

2 2 | | − 2 2 ∀ σi,t = ωi + γiσi,t−1 + τi,1 ri,t−1 + τM,2(ri,t−1 µi) + βiσM,t−1 i = 1,...,I

where ri,t is the return of (non-Henry Hub) spot price i at time t. The coupling

parameter βi measures the influence of Henry Hub volatility on the other hubs. If

βi = 0, then Henry Hub volatility is minimally informative of the volatility at hub

i (conditionally independent given the past observables), while a large value of βi indicates that the secondary hub experiences significant “spill-over” volatility from

Henry Hub.

Conditional on these volatilities, we assume that the daily returns follow a multi- variate skew-normal distribution, though any skewed elliptical distribution, e.g., the 193

multivariate skewed t distribution, could be used as well. In other words,  

 rM,t  ⃗rt =   ∼ Multi-Skew-Normal (α, µ, Σt) { }I ri, t i=1   { }k { }k Σt = diag σM,t, σi,t i=1 Ω diag σM,t, σi,t i=1 where α and µ are fixed (non-time varying) skewness and mean parameters and Ω

is the fixed (non-time varying) correlation of returns. Because our data exhibits

relatively low degrees of skewness, we do not impose strong skewness through our

model specification, instead allowing for the possibility of skewness by combining the

linear GARCH specification with a skew-normal return distribution [316]. Empirically, we have found this scheme to perform better than the log-linear specification described

by Hansen et al. [292], but, as they note, the success of the realized GARCH framework

does not depend critically on the specification used.

Finally, we include an explicit model for the realized volatility measures:

∼ N | | 2 2 ∀ ςj,t (ξj + ϕjσM,t + δj,1 rM,t + δj,2rM,t, νj ) j = 1, . . . , J.

As before, we incorporate a leverage effect using a second order Hermite polynomial.

We emphasize that the realized volatility measures need not be unbiased estimates of

the true market volatility at time t. The bias (ξj) and scaling (ϕj) factors allow for mis-scaled or mis-aligned volatility measures. In other words, an intra-day volatility

measure does not need to be re-scaled to daily volatility. This is especially important

in our application below, where realized volatility measures from futures markets are

used for spot markets. (See Figure 7.2.3 for evidence of different scalings.) The bias 194

and scaling terms allow these estimates to inform our estimation, even though our

model does not directly account for the particular dynamics of the futures markets

(i.e., cost of carry, interest rate dynamics, etc.; see the discussion by Li [313]). The

use of drift-adaptive volatility measure of Yang and Zhang [308] also helps to ame-

liorate these effects.). As we will see, our model is able to adapt to the mismatch

between the futures and spot markets, demonstrating that, despite this discrepancy,

the futures market realized volatility does indeed constitute a useful addition to spot

models.

Putting these pieces together, we obtain the complete specification:  

 rM,t  ⃗rt =   ∼ Multi-Skew-Normal (α, µ, Σt) { }I ri, t i=1   { }k { }k Σt = diag σM,t, σi,t i=1 Ω diag σM,t, σi,t i=1 XJ 2 2 2 | | − 2 σM,t = ωM + γM σM,t−1 + ζjςj,t−1 + τM,1 rM,t−1 + τM,2(rM,t−1 µM ) (7.1) j=1 2 2 | | − 2 2 ∀ σi,t = ωi + γiσi,t−1 + τi,1 ri,t−1 + τi,2(ri,t−1 µi) + βiσM,t−1 i = 1,...,I

∼ N | | 2 2 ∀ ςj,t (ξj + ϕjσM,t + δj,1 rM,t + δj,2rM,t, νj ) j = 1,...,J where

rM,t is the market return at time t (observed);

ri,t is the return of (non-Henry Hub) spot price i at time t (observed);

α, µ are fixed (non-time-varying) return parameters (unobserved);

Σt is the conditional variance at time t (unobserved); 195

σM,t is the instantaneous market volatility at time t (unobserved);

σi,t is the instantaneous volatility of spot price i at time t (unobserved);

βi measures the effect of market volatility on the volatility of spot price i (unob- served);

ζi measures the influence of realized volatility measure j on market volatility (unob- served);

ω, γ, τ are fixed (non-time-varying) GARCH parameters (unobserved);

ςj,t is a realized measure of market volatility at time t (observed); and

ξ, ϕ, δ, ν are fixed (non-time-varying) parameters of the realized volatility measure-

ment (unobserved).

7.3.1 Bayesian Estimation and Prior Selection

To estimate the parameters of our model, we adopt a Bayesian approach. As we will see below, this poses several advantages, perhaps the most obvious of which is

that it allows us to incorporate information from the large GARCH literature into

a prior distribution. A less obvious, but equally important, advantage is that the

Bayesian framework allows for coherent propagation of parameter uncertainty into

subsequent analyses. In the financial context, this typically produces more extreme,

but ultimately more accurate, risk measures [317].

The choice of priors is fundamental to any Bayesian model. Rather than developing

priors based on theoretical considerations, we derive priors from a related but inde-

pendent data set where possible. In particular, we fit a (univariate) realized GARCH

model to the returns of the S&P 500, an index of major U.S. stocks, and use the 196

maximum likelihood estimates to center our priors for LNG. This process is repeated

over 250 of randomly chosen year-long periods during our sample period (2006-2015)

and the median estimate is used as the mean of the prior. The prior standard devia-

tion is matched to ten times the median absolute deviation of the MLEs. This yields

the prior specifications shown in Table 7.3.1. For priors which can be well estimated

from the data, e.g., the mean return µ, or the fixed correlation matrix, Ω, we use

a standard weakly informative prior. Throughout, we also constrain our priors, and

hence our posterior, to ensure stationarity of the underlying process.

Parameter(s) Interpretation Calibration Strategy Prior µ Mean Return N (0, I) β Volatility coupling N (0, I) Weakly Informative Ω Return Correlation Haar Measure (Uniform) α Return Skewness N (0, I) ω Long-Run Volatility Baseline N (0.002, 0.0252) γ Volatility Persistence N (0.8, 0.62) 2 τ1 First-Order GARCH Effect N (0, 0.01 ) 2 τ2 Second-Order GARCH Effect N (0.1, 0.7 ) ξ Realized Volatility Bias SPY MLE Calibration N (0.02, 0.62) ϕ Scaling of Realized Volatility N (15, 602) ν Noise of Realized Volatility (R.V.) N (0.05, 0.252) 2 δ1 R.V. First-Order Leverage Effect N (1.15, 8 ) 2 δ2 R.V. Second-Order Leverage Effect N (1.15, 14 )

Table 7.3.1 : Prior Specifications. Parameters for first-order behavior (mean and correlation of returns) are set using standard weakly informative priors. Priors for GARCH dynamics are calibrated to S&P500 dynamics over the sample period (2006- 2015). The “coupling parameters” β which are the primary focus of this paper are given independent weakly informative N (0, 1) priors.

To ensure the reasonableness of our priors, we generate simulated trajectories with

GARCH parameters drawn from our priors as a “prior predictive check.” On visual

inspection of simulated price series, we observe that our priors imply realistic GARCH

dynamics, though the volatility is higher than what we actually observe, consistent with our use of weakly informative (wide) priors on the GARCH parameters, which 197

allow for higher-than-realistic volatilities.

Table 7.3.2 compares the 90% prior predictive intervals of several relevant statistics

against the S&P 500 data used to calibrate our priors and against Henry Hub spot

returns. Our priors imply a kurtosis below the observed kurtosis of the S&P 500 and

Henry Hub. This divergence appears to be driven by the model not accounting for

large exogenous jumps rather than an incorrect GARCH specification. Overall, we

see that our priors are realistic, even though they often imply a higher “volatility of volatility” than actually observed.

Test Statistic 90% Interval S&P 500 Henry Hub Spot Mean Return [-0.025, 0.022] 0.00 0.00 Mean Absolute Return [0.058, 0.519] 0.01 0.03 Standard Deviation of Returns [0.073, 0.761] 0.01 0.04 Skewness of Returns [-0.545, 0.484] −0.15 0.72 Kurtosis of Returns [2.839, 13.362] 17.78 14.38 Fraction of Positive Returns [0.466, 0.535] 0.55 0.47 One-Day Autocorrelation of Returns [-0.108, 0.112] −0.08 0.04 Two-Day Autocorrelation of Returns [-0.092, 0.089] −0.08 −0.14 One-Day Partial Autocorrelation of Returns [-0.094, 0.08] −0.09 −0.15 One-Day Autocorrelation of Squared Returns [0.014, 0.56] 0.18 0.27 Two-Day Autocorrelation of Squared Returns [-0.005, 0.433] 0.45 0.29 One-Day Partial Autocorrelation of Squared Returns [-0.060, 0.239] 0.43 0.23

Table 7.3.2 : Prior Predictive Checks for Model (7.1): 90% Prior Predictive Intervals and observed S&P 500 and Henry Hub values. The 90% Prior Predictive Intervals cover the observed values for each of the standard summary statistics except those based on unconditional volatility (standard deviation and kurtosis of returns), where the priors imply a slightly higher volatility than actually observed. As we will see in Section 7.4, this is consistent with the out-of-sample evaluation of our model, which is generally accurate, but slightly conservative.

7.3.2 Computational Concerns

Full Bayesian inference for Model (7.1) requires use of a Markov Chain Monte Carlo

(MCMC) sampler. Because the latent volatilities, σM,T and σi,T , are typically highly 198

correlated across time, Hamiltonian Monte Carlo samplers are particularly well suited

for this problem. We make use of the Stan probabilistic programming language and

its variant of the No-U-Turn Sampler [274, 275]. While Stan is able to sample the pos-

terior relatively efficiently for this problem, simultaneous inference for a large number

of assets, e.g., the 40 spot prices considered in the next section, is still computationally

quite demanding. To avoid this, we take advantage of the factorizable structure of

Model (7.1) and compute partial posteriors for each asset separately and combine our

partial posteriors to obtain an approximate posterior. The parameters of the joint re-

turn distribution can then be estimated efficiently afterwards by treating the returns

as independent samples after standardization by the estimated instantaneous volatili-

ties. Additional details of our Markov Chain Monte Carlo procedure and convergence

diagnostics are included in Section 7.D of the Supplementary Materials.

7.4 Application to Financial Risk Management

We demonstrate the effectiveness of our model with an application to LNG spot prices

at non-Henry hubs. We fit our model, under the priors described in Section 7.3.1, on

a ten-year historical period, refitting every 50 days using a 250-day look-back window.

This periodic refitting strategy implicitly allows for time-varying coefficients, unlike

the approach of Hansen et al. [294], who only allow for time-varying correlations (Ω) with a parametric specification for the temporal dynamics. Unless otherwise stated,

all volatility measures are posterior expectations taken over 4000 posterior samples.

Throughout, we compare the results of Model (7.1), which we denote as RBGARCH for

realized beta GARCH, with a pair of standard skew-normal GARCH(1, 1) models, which we denote as GARCH. To ensure a fair comparison, the same model specification

and priors are used for both GARCH and RBGARCH. While the RBGARCH model is able to 199

convincingly outperform the GARCH model in-sample, the superior out-of-sample risk-

management performance is particularly compelling evidence as to the usefulness of

the realized beta GARCH model in LNG markets. Additional comparisons are given

in Section 7.C of the Supplementary Materials.

7.4.1 In-Sample Model Fit

We first compare the in-sample model fit ofthe GARCH and RBGARCH models. To assess

improvement in in-sample performance, we use Watanabe’s (2010) Widely Applicable

Information Criterion (WAIC). Essentially a Bayesian analogue of AIC, WAIC is an approximate measure of the leave-one-observation-out predictive performance of a model. The likelihood of the RBGARCH model contains terms for both the daily returns and the realized volatility measure. To ensure a fair comparison with the

GARCH model, we calculate the WAIC of the RBGARCH model on a partial likelihood, using only the likelihood of the daily returns. The partial likelihood is analogous to the complete likelihood of the GARCH model, which does not include a term for the realized volatility measure, and allows for an apples-to-apples WAIC comparison of the two models.

In Figure 7.4.1, we compare the in-sample performance of the GARCH and RBGARCH

models using a signed-rank transform to account for the highly non-normal sampling

distribution of the WAIC differences. The left portion of Figure 7.4.1 shows a small

number of sub-periods where the WAIC comparison appears to strongly favor the

GARCH model. A closer investigation of these sub-periods reveals that these irregu-

larities are caused by short-lived outliers in the underlying data which corrupt the

WAIC estimates. Vehtari et al. [319] suggest that if the posterior variance of posterior

log-predictive density exceeds 0.4 for any single observation, WAIC does not provide 200

a robust estimate to out-of-sample predictive accuracy. Applying this heuristic to our

data, we see that almost all of the sub-periods where GARCH appears to out-perform

RBGARCH have WAIC instabilities and that, ignoring those outliers, the RBGARCH model consistently has superior in-sample performance, even after adjusting for the added complexity of the RBGARCH model.

A paired one-sided Wilcoxon test comparing the WAICs of the GARCH and RBGARCH models was performed, with the alternative hypothesis that the RBGARCH model has a higher median WAIC across our 73 time periods (not omitting the outlier-corrupted periods). We see consistent evidence that RBGARCH outperforms, with 37 out of 40 p-values less than 0.10 and 34 out of 40 less than 0.05. (Note that the non-parametric

Wilcoxon test was used because the sampling distribution of WAIC is highly non- normal in small samples and extremely sensitive to the heteroscedasticity of the un- derlying data.) Overall, we see consistent evidence that, by jointly modeling the volatility of spot and futures returns, the RBGARCH model is able to consistently cap- ture price dynamics more accurately than the GARCH model.

7.4.2 In-Sample Risk Management

We consider a multi-asset portfolio containing a mixture of spot LNG and one-month

Henry Hub futures. In-sample VaR estimates are computed by taking quantiles of the

posterior predictive distribution of portfolio returns using the posterior distribution

of the mean return parameters µ, α and the estimated conditional variance Σt. To assess the accuracy of our VaR estimates, we use the (unconditional) test of Kupiec

[320], which compares the number of times that the actual losses are larger than the

estimated quantile (VaR) with the theoretical rate using a binomial proportion test.

The Kupiec test is a misspecification test, where smaller values indicate stronger 201

In−Sample Model Fit Comparison: GARCH vs RBGARCH

−3000 −2000 −1000 0 1000 2000

−1000 0 1000 2000 Signed Rank of WAIC Difference

Better Performing Model GARCH RBGARCH WAIC Appears Unstable

Figure 7.4.1 : Signed-rank-transformed differences of 40 spots over 73 periods. Ignor- ing periods where our WAIC estimates are corrupted by outliers (purple), the RBGARCH model performs better the GARCH model in 2333 out of 2444 (95.4%) sub-samples.

evidence that the VaR estimates are not correctly calibrated. Results of this test

are shown in Figure 7.4.2. As this figure shows, the RBGARCH model consistently

provides more accurate in-sample VaR estimates than the GARCH model. We note that, because it models the volatilities jointly and does not assume independence, the

RBGARCH model does particularly well for balanced portfolios where the correlation among assets is particularly important.

7.4.3 Out-of-Sample Risk Management

The ultimate test of a financial model is its out-of-sample performance, so we compare

the out-of-sample performance of the GARCH and RBGARCH models, again focusing on

VaR forecasting. Forward-looking (out-of-sample) volatility predictions were obtained by (forward) filtering observed returns for each posterior sample and then taking the quantile of the posterior predictive distribution for each day. In other words, out-of- 202

In−Sample VaR Accuracy [1 in 500 Frequency]

20% Spot 30% Spot 40% Spot 50% Spot 60% Spot 70% Spot 80% Spot

1e−01

1e−05

1e−09

1e−13 Kupiec Test P−Value Test Kupiec

Model GARCH RBGARCH

Figure 7.4.2 : In-Sample VaR Accuracy of the GARCH and RBGARCH models, as mea- sured by Kupiec’s (1995) unconditional exceedances test. Larger p-values indicate better performance. The RBGARCH model performs well for all portfolios considered, while the GARCH model shows clear signs of inaccurate VaR calculation for all portfo- lios except the 20% Spot/80% Futures portfolio. Displayed p-values are for in-sample VaR estimates for the entire sample period.

sample observations were used to update the conditional forward-looking volatility

estimates, but were not used for parameter re-estimation. While we only consider

one-day ahead prediction here, it is straightforward to extend these results to longer

forecast horizons using standard techniques, e.g., the data-driven aggregation method

proposed by Hamidieh and Ensor [321].

Results of this test are shown in Figure 7.4.3 for the same two-asset portfolios as

considered in the previous section. The RBGARCH model continues to consistently

outperform the GARCH model, particularly for the portfolios in which the spot as-

set is heavily weighted. In particular, while the performance of the GARCH model is

significantly worse out-of-sample, the RBGARCH model performs almost as well out- of-sample as it does in-sample. A closer examination reveals that, while the GARCH model significantly underestimates volatility, the RBGARCH model produces slightly 203

over-conservative out-of-sample VaR estimates, a feature which we consider accept-

able, and possibly even desirable, in applications.

Out−of−Sample VaR Accuracy [1 in 500 Frequency]

20% Spot 30% Spot 40% Spot 50% Spot 60% Spot 70% Spot 80% Spot

1e−02

1e−06

1e−10 Kupiec Test P−Value Test Kupiec

Model GARCH RBGARCH

Figure 7.4.3 : Rolling Out-Of-Sample VaR Accuracy of the GARCH and RBGARCH mod- els, as measured by Kupiec’s (1995) unconditional exceedances test. Larger p-values indicate better performance. The RBGARCH model outperforms the GARCH model for the seven different portfolio weights considered here. Displayed p-values are for out- of-sample filtered VaR estimates for the entire sample period.

7.5 Conclusions

We present a model for joint volatility estimation at several LNG hubs and demon-

strate that, despite the apparent mismatch, information from Henry Hub futures can

be used to improve volatility estimation for non-Henry Hub spot prices. We provide

suggested priors for use in Bayesian estimation and demonstrate the effectiveness of

the Bayesian approach for accurate risk prediction. Our model produces improved volatility estimates, which exhibit more accurate estimates of unconditional volatility

levels, increased responsiveness to external market shocks, and improved VaR estima-

tion. Notably, our joint model performs as well out-of-sample as it does in-sample,

making it particularly promising for risk management applications. This impressive 204 out-of-sample performance can be traced back to our use of realized volatility infor- mation from the futures market: by taking advantage of an additional highly accurate realized volatility measure, the Realized Beta GARCH is able to adapt to changing market conditions much more rapidly than standard approaches. Joint modeling of different aspects of LNG markets is an effective and flexible strategy forincorpo- rating inconsistently available data and provides useful insights into the degree and dynamics of LNG spot price volatility and inter-asset correlation. The joint modeling framework may be easily extended to include additional realized volatility measures or to allow more complex market dynamics. 205

Appendix to Chapter 7

7.A Additional Data Description

In this section, we provide some additional characterization of the data used for the

empirical study in Section 7.4 of our paper. As discussed in Sections 7.1.1 and 7.2,

Henry Hub spot prices play an important role in both the spot and futures markets.

Figure 7.B.1 shows the spot price of Henry Hub futures for the sample period, as well as (annualized) rolling return and volatility estimates, sample skewness, and

sample (non-excess) kurtosis. As Li [313] notes, our sample period is characterized

by relatively low volatility, with changes in volatility, skewness, and kurtosis being

driven by periods of elevated volatility in late 2009 and in early 2014. Given this

low level of volatility, we do not include a jump component in our model, contrary to

the findings of Mason and Wilmot322 [ ] who find the presence of a jump component

significantly improves model fit in an earlier, more volatile, period.

LNG futures prices are strongly correlated across different maturities, with the cor-

relation remaining roughly constant across time as shown in Figure 7.B.3. Futures

prices exhibit less pronounced jump behavior than the underlying spot price, par-

ticularly in the latter portion of our sample, leading to reduced correlation between

future and spot returns at a daily frequency. At lower frequencies, the effects of these

jumps are reduced and the correlation between spot and futures returns is higher, as

shown in Figure 7.B.4.

Given that the effects of storage and production shocks are felt throughout LNG

markets, we would expect the volatilities of spot prices are correlated as well. As can

be seen in Figure 7.B.5, this is indeed the case – spot price volatilities are typically 206

highly correlated, though not to quite the same degree as spot price returns. As before,

there is clear evidence of a subgroup of spot prices that are more closely correlated

among each other than to the rest of the market.

While the spot and futures markets generally move together, they are still somewhat

disjoint, with shocks to the futures market not necessarily being reflected in the spot

market and vice versa. Despite this disjoint behavior, there is still a strong correlation between realized volatility in the futures markets and close-to-close volatility in the spot markets. Figure 7.B.6 illustrates this correlation, using several different realized volatility measures and futures maturities. Henry Hub clearly displays a higher level of

correlation with futures volatility than the other spot prices in our sample, consistent with the fact that LNG futures are based on Henry Hub prices. Regardless, futures volatility is indeed a useful predictor of volatility at non-Henry hubs, as demonstrated

in Section 7.4.

The observed correlation does not appear to be sensitive to the choice of realized volatility measure, suggesting our results are robust to the exact form of external

information being supplied. Shorter maturity futures, especially one-month futures

(NG1), consistently show the highest correlation with spot prices, so we use the Yang-

Zhang estimator applied to the generic fixed-maturity NG1 in our application. The

high-degree of correlation suggests that single-factor volatility models, including our

proposed model, will be able to capture most of the second-order dynamics. This is

consistent with the findings of Kyj et al.[323] who, like us, combine a single-factor

model with realized volatility measures and find significant improvements in equity

portfolio construction. 207

7.B Data Set Replication Information

Table 7.B.2 lists the Bloomberg identifiers used to obtain the data for our analysis.

For spot prices, only the last traded price of the day (Bloomberg key PX_LAST) was

used. These spot prices were selected from Bloomberg’s full listing of LNG spot prices

(available through the BGAS screen) as having a relatively complete trading history

(more than 245 trading days per year) for the entire sample period (2006 - 2015). Days when only certain spots traded are omitted from our analysis to ensure an aligned

data set. After screening, the NGCDAECO spot price (Alberta, Canada) was manually removed as it is the same spot as NGCAAECO, but reported in Canadian dollars instead of U.S. dollars. For futures prices, Open, High, Low, and Close prices were obtained using Bloomberg keys PX_OPEN, PX_HIGH, PX_LOW, and PX_CLOSE respectively.

As described in the main text, the reported Open and Close prices are occasionally outside of the daily trading range (the High-Low range). This is not a data error, but is instead an artifact of the Open and Close prices being set by an auction process which is not included in the High and Low calculations, rather than the standard market mechanisms. Despite this, the Yang-Zhang volatility estimator is derived assuming a continuous time model for the price series and makes no distinction between traded and non-traded prices. As such, it is not well-defined when the Open or Close prices fall outside of the intraday trading range. This occurs somewhat regularly in our data set, as detailed in Table 7.B.1, particularly at longer maturities.

To address this, we restrict the Open and Close prices to the intraday trading range.

This potentially results in a small loss of estimation efficiency, but empirically appears

to have minimal effect, particularly for our application (Section 7.4), where only the

one-month futures are used. 208

Figure 7.B.1 : Henry Hub spot prices and rolling estimates of the first four mo- ments. Spot prices peaked in mid-2008, declined from late 2008 to early 2009 and have remained at a relatively constant level since. Spot returns exhibit moderate heteroscedasticity, skewness and kurtosis, with large changes in the sample moments being primarily driven by high-volatility periods in Fall 2009 and Spring 2014. 209

Figure 7.B.2 : Pearson correlation among daily spot price returns for 40 hubs in the sample period. Spot prices are typically highly correlated with each other and with Henry Hub (not shown here). The Cheyenne, Northeast Transco Zone 6, Texas Eastern Transmission, Tennessee Gas Pipeline, Northeast, and Iroquois Transmission System spot prices, however, are more highly correlated among themselves than to other spots during the sample period. 210

Open > High Close > High Open < Low Close < Low Low > High Total 1 Month (NG1) 6 0 5 0 0 11 2 Month (NG2) 3 1 2 1 0 7 3 Month (NG3) 9 4 7 0 0 20 6 Month (NG6) 9 38 6 15 0 68 9 Month (NG9) 11 121 9 61 0 202 12 Month (NG12) 8 181 7 125 0 321 Total 46 345 36 202 0 629

Table 7.B.1 : Apparent Inconsistencies in Futures Data. These typically occur because the auction-based Open and Close prices are not included in the intra-day trading range used to determine the High and Low prices. To calculate the Yang-Zhang (2000) realized volatility, the Open and Close are truncated to the High-Low range. 211

Figure 7.B.3 : Henry Hub futures move in tandem with spot prices, but generally exhibit fewer large jumps. In particular, the futures markets do not exhibit the large jumps in early 2014 that are a significant driver of the elevated kurtosis observed in spot prices during the same period. 212

Figure 7.B.4 : Daily returns of Henry Hub futures are highly correlated, but only exhibit moderate correlation with Henry Hub spot returns. Correlations among spot and futures returns are higher at lower frequencies (weekly, monthly, etc.). 213

Figure 7.B.5 : Pearson correlation among the estimated volatilities of 40 spot prices during the sample period. As we see, there is generally a strong positive correlation among spot volatilities, though it is not as pronounced as the correlation among spot returns. There is some evidence that the structure of volatility is changing over time, as can be seen by the emergence of subgroups in the 2012-2015 period. 214

Figure 7.B.6 : Pearson correlation among the Henry Hub spot volatility, non-Henry Hub spot volatility, and futures realized volatility during the sample period. The reported correlations with non-Henry spots are the average correlation over the 40 non-Henry spots in our sample. The correlation between Henry Hub volatility and short-maturity (NG1) realized volatility is consistently the highest. The degree of correlation appears robust to the choice of realized volatility measure. 215

Data Type Identifier Full Name NAGAALLI Alliance Pipeline Delivered NAGAANRL Mid-Continent / ANR Lavrne (Custer, OK) NAGAMICG Michigan Consolidated Gas (Detroit, MI) NAGANGMC Mid-Continent Natural Gas Spot Price NAGANGPL Mid-Continent / Chicago Citygate (Chicago, IL) NAGANGTO Texas-Oklahoma East (Montgomery County, TX) NAGANOND N. Natural Mainline (Clay County, KS) NAGANORB N. Border Natural Gas Spot Price (Ventura, IA) NGCAAECO AECO C Hub (Alberta, Canada) NGCGNYNY TETCO M3 (New York, NY) NGGCANRS Gulf Coast / ANR Southeast NGGCCGLE Columbia Transmission TCO Pool (Leach, KY) NGGCCOLG Columbia Gulf Onshore Louisiana Pool NGGCT800 Tennessee 800 Leg (SE Louisiana) NGGCTR30 Transco Station 30 (Wharton County, TX) NGGCTRNZ Transco Station 65 (Beauregard Parish, LA) NGGCTXEW TETCO West Louisiana NGGCTXEZ TETCO East Louisiana NGNECNGO Dominion South Point (Lebanon, OH) NGNEIROQ Iroquois Gas Transmission System (Waddington, NY) LNG Spot Prices NGNEIZN2 Iroquois Zone 2 (Wright, NY) NGNETNZ6 Tennessee Gas Pipeline Zone 6 NGNETRNZ Northeast Transco Zone 6 (Linden, NJ) NGRMCHEY Cheyenne Hub (Cheyenne, WY) NGRMDENV Colorado Interstate Gas Mainline NGRMELPS Non-Bondad San Juan Basin (El Paso, TX / Blanco, NM) NGRMEPBD Bondad San Juan Basin (El Paso, TX) NGRMKERN Rocky Mountains / Kern River (Opal, WY) NGRMNWES Northwest Pipeline (Stanfield, OR) NGTXEPP2 Permian Basin (West Texas) NGTXOASI Waha Hub (Waha, TX) NGTXPERY Columbia Mainline (Perryville, LA) NGUSHHUB Henry Hub (Erath, LA) NGWCEPEB El Paso South Mainline (El Paso, TX) NGWCPGNE Pacific Gas and Electric Citygate (N. California) NGWCPGSP Northwest Transmission (Malin, OR) NGWCPGTP Pacific Gas and Electric Topock (Topock, AZ) NGWCSCAL Southern California Border NTGSTXKA Katy Texas Area SNNWPIPA Eastern Oklahoma Panhandle Field Zone (Haven, KS) NG1 Generic 1st NG Future NG2 Generic 2nd NG Future NG3 Generic 3rd NG Future LNG Futures Prices NG6 Generic 6th NG Future NG9 Generic 9th NG Future NG12 Generic 12th NG Future

Table 7.B.2 : Bloomberg identifiers of the price series used in our analysis. LNG Futures were used to compute realized volatility measures and served as the market proxy while LNG Spot prices were used for the sparse-data asset. 216

7.C Additional Results

In this section, we provide additional description of the in- and out-of-sample perfor-

mance of our RBGARCH model, extending the results presented in Section 7.4.

7.C.1 In-Sample Model Fit

We begin by considering the in-sample fit, as measured by the Widely Applicable In-

formation Criterion (WAIC) proposed by Watanabe [318]. As discussed in more detail

by Gelman et al. [324], WAIC is a fully Bayesian analogue of AIC [325], which at-

tempts to estimate out-of-sample expected log-posterior density. Unlike the Deviance

Information Criterion (DIC) of Spiegelhalter et al. [326, 327], WAIC is calculated us-

ing the entire posterior distribution, rather than just the posterior mean, taking fuller

account of posterior uncertainty. Note that the Bayesian WAIC should not be con-

fused with the so-called Bayesian Information Criterion (BIC) of Schwarz [328] or

its actually Bayesian analogue WBIC [329], both of which attempt to estimate the

marginal likelihood of a proposed model [330]. Because we do not assume that either

the GARCH or RBGARCH models are true, we focus more on predictive accuracy than

model recovery.

Figure 7.C.1 shows in-sample WAIC comparisons of the GARCH and RBGARCH models

for each of our 40 spot prices and 73 sub-sample periods. As discussed above, for a

small fraction of sub-samples 5-15%, depending on the spot, the WAIC estimates are

corrupted by the presence of extreme outliers (see, e.g., Figure 7.C.2). We identify

these corrupted sub-periods using the 0.4 posterior variance heuristic of Vehtari et al.

[319] and highlight them in purple. This heuristic identifies several periods in which

the WAIC estimates of the RBGARCH model are unstable, but those of the GARCH model 217

In−Sample Model Fit Comparison: GARCH vs RBGARCH

SNNWPIPA (p = 1.28e−04) NTGSTXKA (p = 5.44e−05) NGWCSCAL (p = 1.23e−06) NGWCPGTP (p = 8.79e−06) NGWCPGSP (p = 2.39e−02) NGWCPGNE (p = 1.32e−03) NGWCEPEB (p = 1.72e−03) NGUSHHUB (p = 2.68e−02) NGTXPERY (p = 3.61e−05) NGTXOASI (p = 1.00e−01) NGTXEPP2 (p = 1.24e−02) NGRMNWES (p = 3.21e−05) NGRMKERN (p = 2.56e−01) NGRMEPBD (p = 1.06e−02) NGRMELPS (p = 4.23e−03) NGRMDENV (p = 3.04e−02) NGRMCHEY (p = 3.69e−05) NGNETRNZ (p = 5.86e−02) NGNETNZ6 (p = 7.57e−04) NGNEIZN2 (p = 5.73e−02) NGNEIROQ (p = 4.42e−08) NGNECNGO (p = 4.96e−03)

Spot ID NGGCTXEZ (p = 5.24e−02) NGGCTXEW (p = 3.07e−05) NGGCTRNZ (p = 1.30e−02) NGGCTR30 (p = 4.44e−03) NGGCT800 (p = 1.60e−02) NGGCCOLG (p = 1.09e−02) NGGCCGLE (p = 5.68e−04) NGGCANRS (p = 1.95e−03) NGCGNYNY (p = 6.89e−03) NGCAAECO (p = 3.07e−05) NAGANORB (p = 2.40e−03) NAGANOND (p = 4.90e−02) NAGANGTO (p = 2.16e−03) NAGANGPL (p = 4.96e−04) NAGANGMC (p = 1.69e−01) NAGAMICG (p = 1.50e−03) NAGAANRL (p = 1.57e−06) NAGAALLI (p = 5.56e−05) −3000 −2000 −1000 0 1000 2000 Signed Rank of WAIC Difference

Better Performing Model GARCH RBGARCH WAIC Appears Unstable

Figure 7.C.1 : Signed-rank-transformed differences of 40 spots over 73 periods. Posi- tive values indicate that the RBGARCH model outperformed the GARCH model for that sample period. p-values from a paired one-sample Wilcoxon test applied to WAIC differences for each sample are displayed on the left axis.The RBGARCH model outper- formed the GARCH model across all spots, but was somewhat more sensitive to outliers in the data.

are not. We conjecture that this is due to the additional structure of the RBGARCH

model: because the RBGARCH model expects volatilities to move together, an outlier

in the spot price which does not have a corresponding jump in the futures market is

even less likely under the RBGARCH model than a simultaneous jump in both markets.

Regardless, for our specific goal of VaR forecasting, the RBGARCH model consistently

outperforms the GARCH model with no outlier filtering necessary, as shown in the following two sections. 218

ANR Lavrne (Custer, OK) LNG Spot Price 25 15 5

Jan 03 Jul 02 Jan 02 Jul 01 Jan 04 Jul 01 Jan 02 2006 2007 2009 2010 2012 2013 2015

Figure 7.C.2 : Price history for Custer, OK LNG trading hub. The large price jump on February 5th, 2014 causes WAIC instabilities for several sample sub-periods, as highlighted in Figure 7.C.1. Because we use a 50-day rolling window with a 250-day history, a single outlier of this form can impact up to 5 sub-periods.

In Figure 7.C.3, we show the fraction of subsample periods for which the RBGARCH

model outperforms the GARCH model by K standard errors, broken down by year. (Be-

cause WAIC calculations include a standard error, it is possible to obtain a standard

error on the difference in WAIC values. For details, see Section 5 of Vehtari etal.

[319].) The RBGARCH model consistently outperforms the GARCH model, typically by

one or more standard errors, for all years in the sample period.

7.C.2 In-Sample Risk Management

In addition to the unconditional test of Kupiec [320], we can also assess VaR accu-

racy using the conditional test of Christoffersen [331]. While the Kupiec test is a

marginal goodness of fit test, evaluating whether the samples appear to be draws

from a Bernoulli distribution with nominal coverage, the Christoffersen test considers

dependence between samples and tests whether the probability of an exceedance on

day T is independent of the probability of an exceedance on day T + 1. Figures

7.C.4 and 7.C.5 display p-values for each of the 40 spots, at 99.8% and 99.5% levels 219

In−Sample WAIC Comparison of GARCH and RBGARCH Models K = 0 Standard Errors K = 1 Standard Error K = 2 Standard Errors 100%

75%

50%

25%

0% Fraction of Sample Periods Fraction

RBGARCH Out−Performance 2006200720082009201020112012201320142015 2006200720082009201020112012201320142015 2006200720082009201020112012201320142015 Year

Figure 7.C.3 : In-Sample WAIC Comparison of GARCH and RBGARCH models, aggre- gated over all 40 spot prices in our sample. The reported probabilities are the fraction of 250 day sample periods in which the the WAIC of the RBGARCH exceeded that of the GARCH model by at least K times the estimated standard error of the difference for K = 0, 1, 2. For all years in the sample period, the RBGARCH model has a consistently higher WAIC than the GARCH model, typically by several standard errors.

respectively. Figure 7.C.6 gives a simplified version of this information, from which we can clearly see the improved performance of the RBGARCH model, as indicated by

the upward sloping lines comparing the p-values for each model.

From these, we see that the RBGARCH model consistently outperforms the GARCH model, particularly for portfolios with approximately 50% allocation in the spot and futures, taking advantage of its multivariate structure. While it is clear from these plots that the RBGARCH model performs well, it is interesting to examine the limitations of the

RBGARCH model. To this end, Figure 7.C.7 shows the expected and observed number

of exceedances, corresponding to different confidence levels in the VaR calculation, for

the GARCH and RBGARCH models. As can be seen from these figures, neither the GARCH

nor the RBGARCH are perfectly calibrated (signified by the red line), though the two

models fail differently. The GARCH model typically has more observed exceedances 220

than expected, indicating a systematic underestimation of VaR, while the RBGARCH

has generally fewer observed than expected, indicating overly conservative estimates.

This is consistent with what we would expect for a well-fitting Bayesian forecast,

as discussed in Ardia et al. [317]. (Note that the VaR estimates provided by both

GARCH and RBGARCH approaches are more conservative than would be obtained from their non-Bayesian counterparts, but the RBGARCH model is especially conservative

as it accounts for parameter uncertainty in many more parameters than the GARCH

model.)

7.C.3 Out-of-Sample Risk Management

Figures 7.C.8, 7.C.9, 7.C.10, 7.C.11 repeat the analysis of the previous subsection

on the out-of-sample VaR forecasts. As described in Section 7.4, these forecasts are

obtained by filtering the volatility for each posterior sample and then calculating an

overall VaR estimate by marginalizing over the posterior samples. Not surprisingly,

our out-of-sample forecasts are generally less accurate than our in-sample estimates.

Despite this, we see essentially the same results as before: the RBGARCH model consis-

tently outperforms the GARCH model, doing particularly well on portfolios with large amounts of both spot and futures. When the RBGARCH is mis-calibrated, it tends to be slightly conservative, but overall it is quite accurate.

The systematic underestimation of standard GARCH models is well-known to prac- titioners and model-based estimates of VaR are typically multiplied by a “fudge fac- tor” to account for this. In 1996, the Basel Committee on Banking Supervision recommended the use of a fudge factor of 3 [108, Paragraph B.4(j)], cementing the widespread use of this sort of ad hoc adjustment. While this style of adjustment

is sufficient for conservative risk management, if applied uncritically it can greatly 221

In−Sample VaR Accuracy [1 in 500 Frequency: 99.8% Confidence]

Kupiec (1995) Test Christoffersen (1998) Test

1e−01 20% Spot 1e−05 1e−09 1e−13

1e−01 30% Spot 1e−05 1e−09 1e−13

1e−01 40% Spot 1e−05 1e−09 1e−13

1e−01 50% Spot 1e−05 1e−09

P−Value 1e−13

1e−01 60% Spot 1e−05 1e−09 1e−13

1e−01 70% Spot 1e−05 1e−09 1e−13

1e−01 80% Spot 1e−05 1e−09 1e−13 NAGAALLI NAGAALLI NGNEIZN2 NGNEIZN2 NGTXOASI NGTXOASI SNNWPIPA SNNWPIPA NGGCT800 NGGCT800 NGTXEPP2 NGTXEPP2 NGNETNZ6 NGNETNZ6 NTGSTXKA NTGSTXKA NAGAMICG NAGAMICG NGNEIROQ NGNEIROQ NGGCTR30 NGGCTR30 NGTXPERY NGTXPERY NAGAANRL NAGAANRL NAGANGPL NAGANGPL NGGCTXEZ NGNETRNZ NGGCTXEZ NGNETRNZ NGRMELPS NGRMELPS NAGANGTO NAGANGTO NGGCTRNZ NGGCTRNZ NAGANORB NAGANORB NGCAAECO NGGCCGLE NGUSHHUB NGCAAECO NGGCCGLE NGUSHHUB NAGANOND NAGANOND NGCGNYNY NGGCANRS NGRMEPBD NGWCSCAL NGCGNYNY NGGCANRS NGRMEPBD NGWCSCAL NGGCCOLG NGRMCHEY NGRMDENV NGRMKERN NGWCEPEB NGGCCOLG NGRMCHEY NGRMDENV NGRMKERN NGWCEPEB NAGANGMC NAGANGMC NGGCTXEW NGNECNGO NGWCPGTP NGGCTXEW NGNECNGO NGWCPGTP NGWCPGSP NGWCPGSP NGWCPGNE NGWCPGNE NGRMNWES NGRMNWES Spot ID

Model GARCH RBGARCH

Figure 7.C.4 : In-sample VaR accuracy measures of the GARCH and RBGARCH models at the 99.8% (1 in 500 days) confidence level. The Kupiec320 [ ] and Christoffersen [331] tests both indicate that the RBGARCH model consistently outperforms the GARCH model for all 40 spots in our sample, with the advantage being particularly pronounced for approximately equally weighted portfolios. In order to accurately evaluate VaR predictions at this extreme quantile, the reported p-values were obtained by taking the concatenated in-sample predictions over the entire 10-year sample period. 222

In−Sample VaR Accuracy [1 in 200 Frequency: 99.5% Confidence]

Kupiec (1995) Test Christoffersen (1998) Test

1e−01 20% Spot 1e−05 1e−09 1e−13

1e−01 30% Spot 1e−05 1e−09 1e−13

1e−01 40% Spot 1e−05 1e−09 1e−13

1e−01 50% Spot 1e−05 1e−09

P−Value 1e−13

1e−01 60% Spot 1e−05 1e−09 1e−13

1e−01 70% Spot 1e−05 1e−09 1e−13

1e−01 80% Spot 1e−05 1e−09 1e−13 NAGAALLI NAGAALLI NGNEIZN2 NGNEIZN2 NGTXOASI NGTXOASI SNNWPIPA SNNWPIPA NGGCT800 NGGCT800 NGTXEPP2 NGTXEPP2 NGNETNZ6 NGNETNZ6 NTGSTXKA NTGSTXKA NAGAMICG NAGAMICG NGNEIROQ NGNEIROQ NGGCTR30 NGGCTR30 NGTXPERY NGTXPERY NAGAANRL NAGAANRL NAGANGPL NAGANGPL NGGCTXEZ NGNETRNZ NGGCTXEZ NGNETRNZ NGRMELPS NGRMELPS NAGANGTO NAGANGTO NGGCTRNZ NGGCTRNZ NAGANORB NAGANORB NGCAAECO NGGCCGLE NGUSHHUB NGCAAECO NGGCCGLE NGUSHHUB NAGANOND NAGANOND NGCGNYNY NGGCANRS NGRMEPBD NGWCSCAL NGCGNYNY NGGCANRS NGRMEPBD NGWCSCAL NGGCCOLG NGRMCHEY NGRMDENV NGRMKERN NGWCEPEB NGGCCOLG NGRMCHEY NGRMDENV NGRMKERN NGWCEPEB NAGANGMC NAGANGMC NGGCTXEW NGNECNGO NGWCPGTP NGGCTXEW NGNECNGO NGWCPGTP NGWCPGSP NGWCPGSP NGWCPGNE NGWCPGNE NGRMNWES NGRMNWES Spot ID

Model GARCH RBGARCH

Figure 7.C.5 : In-sample VaR accuracy measures of the GARCH and RBGARCH models at the 99.5% (1 in 200 days) confidence level. Compared to Figure 7.C.4, the in- sample advantage of the RBGARCH model is still present, albeit less pronounced, as both models are able to capture this portion of the distribution well. 223

In−Sample VaR Accuracy [1 in 500 Frequency: 99.8% Confidence]

20% Spot 30% Spot 40% Spot 50% Spot 60% Spot 70% Spot 80% Spot 1e−01 Kupiec (1995) Test 1e−05

1e−09

1e−13 Christoffersen (1998) Test

P−Value 1e−01

1e−05

1e−09

1e−13

Model GARCH RBGARCH

Figure 7.C.6 : In-sample VaR accuracy measures of the GARCH and RBGARCH models at the 99.8% (1 in 500 days) confidence level. Consistent with Figure 7.C.4, the RBGARCH model consistently outperforms the GARCH model, as can be seen from the upward sloping lines connecting corresponding p-values from the two models.

increase the financial burden imposed on market participants due to the increased capital requirements associated with higher VaR levels. The superior performance of the RBGARCH model suggests that a fudge factor may not be necessary if additional information is used to improve out-of-sample predictive performance. 224

In−Sample VaR Performance

10% Spot 20% Spot 30% Spot 40% Spot 50% Spot 1000

100

10

60% Spot 70% Spot 80% Spot 90% Spot 100% Spot 1000

Observed Exceedances 100

10

30 100 300 1000 30 100 300 1000 30 100 300 1000 30 100 300 1000 30 100 300 1000 Expected Exceedances

Model GARCH RBGARCH

Figure 7.C.7 : Assessment of the In-Sample VaR estimates of the GARCH and RBGARCH models for a range of sample portfolios. The GARCH model typically underestimates the true VaR (lies above the red line) while the RBGARCH model is more accurate, but with a conservative bias (lies below the red line). 225

Out−of−Sample VaR Accuracy [1 in 500 Frequency: 99.8% Confidence]

Kupiec (1995) Test Christoffersen (1998) Test

1e−02 20% Spot 1e−06 1e−10

1e−02 30% Spot 1e−06 1e−10

1e−02 40% Spot 1e−06 1e−10

1e−02 50% Spot 1e−06 1e−10 P−Value

1e−02 60% Spot 1e−06 1e−10

1e−02 70% Spot 1e−06 1e−10

1e−02 80% Spot 1e−06 1e−10 NAGAALLI NAGAALLI NGNEIZN2 NGNEIZN2 NGTXOASI NGTXOASI SNNWPIPA SNNWPIPA NGGCT800 NGGCT800 NGTXEPP2 NGTXEPP2 NGNETNZ6 NGNETNZ6 NTGSTXKA NTGSTXKA NAGAMICG NAGAMICG NGNEIROQ NGNEIROQ NGGCTR30 NGGCTR30 NGTXPERY NGTXPERY NAGAANRL NAGAANRL NAGANGPL NAGANGPL NGGCTXEZ NGNETRNZ NGGCTXEZ NGNETRNZ NGRMELPS NGRMELPS NAGANGTO NAGANGTO NGGCTRNZ NGGCTRNZ NAGANORB NAGANORB NGCAAECO NGGCCGLE NGUSHHUB NGCAAECO NGGCCGLE NGUSHHUB NAGANOND NAGANOND NGCGNYNY NGGCANRS NGRMEPBD NGWCSCAL NGCGNYNY NGGCANRS NGRMEPBD NGWCSCAL NGGCCOLG NGRMCHEY NGRMDENV NGRMKERN NGWCEPEB NGGCCOLG NGRMCHEY NGRMDENV NGRMKERN NGWCEPEB NAGANGMC NAGANGMC NGGCTXEW NGNECNGO NGWCPGTP NGGCTXEW NGNECNGO NGWCPGTP NGWCPGSP NGWCPGSP NGWCPGNE NGWCPGNE NGRMNWES NGRMNWES Spot ID

Model GARCH RBGARCH

Figure 7.C.8 : Out-of-sample VaR accuracy measures of the GARCH and RBGARCH models at the 99.8% (1 in 500 days) confidence level. The Kupiec [320] and Christof- fersen [331] tests both indicate that the RBGARCH model consistently outperforms the GARCH model for all 40 spots in our sample, with the advantage being particularly pro- nounced for approximately equally weighted portfolios. Somewhat surprisingly, the RBGARCH model appears to perform almost as well out-of-sample as it does in-sample, while the GARCH model is noticeably less accurate. 226

In−Sample VaR Accuracy [1 in 200 Frequency: 99.5% Confidence]

Kupiec (1995) Test Christoffersen (1998) Test

1e−01 20% Spot 1e−05 1e−09 1e−13

1e−01 30% Spot 1e−05 1e−09 1e−13

1e−01 40% Spot 1e−05 1e−09 1e−13

1e−01 50% Spot 1e−05 1e−09

P−Value 1e−13

1e−01 60% Spot 1e−05 1e−09 1e−13

1e−01 70% Spot 1e−05 1e−09 1e−13

1e−01 80% Spot 1e−05 1e−09 1e−13 NAGAALLI NAGAALLI NGNEIZN2 NGNEIZN2 NGTXOASI NGTXOASI SNNWPIPA SNNWPIPA NGGCT800 NGGCT800 NGTXEPP2 NGTXEPP2 NGNETNZ6 NGNETNZ6 NTGSTXKA NTGSTXKA NAGAMICG NAGAMICG NGNEIROQ NGNEIROQ NGGCTR30 NGGCTR30 NGTXPERY NGTXPERY NAGAANRL NAGAANRL NAGANGPL NAGANGPL NGGCTXEZ NGNETRNZ NGGCTXEZ NGNETRNZ NGRMELPS NGRMELPS NAGANGTO NAGANGTO NGGCTRNZ NGGCTRNZ NAGANORB NAGANORB NGCAAECO NGGCCGLE NGUSHHUB NGCAAECO NGGCCGLE NGUSHHUB NAGANOND NAGANOND NGCGNYNY NGGCANRS NGRMEPBD NGWCSCAL NGCGNYNY NGGCANRS NGRMEPBD NGWCSCAL NGGCCOLG NGRMCHEY NGRMDENV NGRMKERN NGWCEPEB NGGCCOLG NGRMCHEY NGRMDENV NGRMKERN NGWCEPEB NAGANGMC NAGANGMC NGGCTXEW NGNECNGO NGWCPGTP NGGCTXEW NGNECNGO NGWCPGTP NGWCPGSP NGWCPGSP NGWCPGNE NGWCPGNE NGRMNWES NGRMNWES Spot ID

Model GARCH RBGARCH

Figure 7.C.9 : Out-of-sample VaR accuracy measures of the GARCH and RBGARCH models at the 99.5% (1 in 200 days) confidence level. The Kupiec [320] and Christof- fersen [331] tests both indicate that the RBGARCH model consistently outperforms the GARCH model for all 40 spots in our sample, with the advantage being particularly pronounced for approximately equally weighted portfolios. Unlike the in-sample case (Figure 7.C.5), the RBGARCH model here clearly outperforms the GARCH model. 227

Out−of−Sample VaR Accuracy [1 in 500 Frequency: 99.8% Confidence]

20% Spot 30% Spot 40% Spot 50% Spot 60% Spot 70% Spot 80% Spot

1e−02 Kupiec (1995) Test

1e−06

1e−10 Christoffersen (1998) Test P−Value 1e−02

1e−06

1e−10

Model GARCH RBGARCH

Figure 7.C.10 : Out-of-sample VaR accuracy measures of the GARCH and RBGARCH models at the 99.8% (1 in 500 days) confidence level. Consistent with Figure 7.C.8, the RBGARCH model consistently outperforms the GARCH model, as can be seen from the upward sloping lines connecting corresponding p-values from the two models. 228

Out−of−Sample VaR Performance

10% Spot 20% Spot 30% Spot 40% Spot 50% Spot

100

10

1

60% Spot 70% Spot 80% Spot 90% Spot 100% Spot

100 Observed Exceedances 10

1 10 30 100 10 30 100 10 30 100 10 30 100 10 30 100 Expected Exceedances

Model GARCH RBGARCH

Figure 7.C.11 : Assessment of the Out-of-Sample VaR estimates of the GARCH and RBGARCH models for a range of sample portfolios. The GARCH model typically under- estimates the true VaR (lies above the red line) while the RBGARCH model is more accurate, but with a conservative bias (lies below the red line). 229

7.D Computational Details

In this section, we give additional details of the computational approach used to es-

timate Model (7.1). To perform estimation, we use the No-U-Turn Sampler variant

of Hamiltonian Monte Carlo [273, 274, 276] as implemented in Stan [275]. Hamilto-

nian Monte Carlo is particularly well suited for models such as ours, as it can take

advantage of gradient information to more efficiently explore high-dimensional and

highly correlated posterior distributions. The relevant Stan code is given in Section

7.D.1. Section 7.D.2 describes ex post diagnostics to assess the efficiency of Hamilto- nian Monte Carlo for our analysis. We confirm that Stan can sample effectively from

the posterior using the Simulation-Based Calibration (SBC) approach of Talts et al.

[332], as described in Section 7.D.3. Finally, we examine the finite sample estimation performance of our model in Section 7.D.4.

7.D.1 Stan Code

The probabilistic programming language Stan [275] provides a modeling language

for high-performance Markov Chain Monte Carlo (‘MCMC’), using the No-U-Turn

Sampler variant of Hamiltonian Monte Carlo [274]. Unless otherwise stated, Stan was used for all posterior inference reported in this paper. The Stan manual [333]

provides an exhaustive introduction to the use of Stan. The Stan code presented

below was used to fit Model (7.1). The Stan code for the univariate model used for comparisons in Section 7.4 can be obtained by removing certain sections of this code.

Stan does not provide a built-in multivariate skew-normal density, so we implement it as a user-defined function. For computational reasons, we model the marginal vari- 230

ances and the correlation matrix separately, so we use a four-parameter formulation

of the multivariate skew-normal density. Our specification is essentially that of Azza-

lini and Capitanio [316], except we use the Cholesky factor of the correlation matrix, which we denote Ω, instead of the full correlation matrix (which they denote Ωz) to more efficiently evaluate the multivariate normal density. Additionally, we denote

the marginal variances by σ, as opposed to ω.

functions{

real multi_skew_normal_lpdf(vector y, vector mu, vector sigma, vector alpha, matrix omega){

real retval = 0;

int K = rows(y);

retval += multi_normal_cholesky_lpdf(y | mu, diag_pre_multiply(sigma, omega));

for (i in 1:K){

retval += normal_lcdf(dot_product(alpha, (y - mu) ./ sigma) | 0, 1);

}

return retval;

}

}

We combine the market (Henry Hub futures) and asset (non-Henry spot) returns into

a T -length array of 2-vectors.

data {

int T;

vector[T] return_market;

vector[T] return_asset;

vector[T] realized_vol_market;

} 231

transformed data{

vector[2] returns[T];

for(t in 1:T){

returns[t][1] = return_market[t];

returns[t][2] = return_asset[t];

}

}

As discussed above, we use the Cholesky factor of the correlation matrix for computa-

tional efficiency. This allows for more efficient evaluation of the multivariate normal

density (by avoiding an expensive determinant calculation) and allows Stan to use a

positive (semi)-definiteness-enforcing transformation automatically334 [ ].

parameters {

vector[2] mu;

vector[2] alpha;

cholesky_factor_corr[2] L;

We enforce stationarity by constraining the parameters to sections of the parameter

space which yield a stationary process. Note that we do not use variance target-

ing.

real omega_market;

real gamma_market;

real tau_1_market;

real tau_2_market;

We treat the initial volatility as an unknown parameter to be inferred. In practice, this is not particularly important since we use a long (250 day) window to fit our model. 232

real sigma_market1;

real omega_asset;

real beta_asset;

real gamma_asset;

real tau_1_asset;

real tau_2_asset;

real sigma_asset1;

real xi;

real phi;

real delta_1_rv;

real delta_2_rv;

real rv_sd;

}

Because the instantaneous volatilities in a GARCH model are deterministic, condi- tional on the returns, model parameters, and initial condition, we calculate them in the transformed parameters block. Note the use of local variables sigma_market and sigma_asset to reduce memory pressure. transformed parameters {

vector[2] sigma[T];

vector[T] rv_market_mean;

{

vector[T] sigma_market;

vector[T] sigma_asset; 233

sigma_market[1] = sigma_market1;

sigma_asset[1] = sigma_asset1;

for(t in 2:T){

sigma_market[t] = sqrt(omega_market +

gamma_market * square(sigma_market[t - 1]) +

tau_1_market * fabs(return_market[t-1]) +

tau_2_market * square(return_market[t-1]));

sigma_asset[t] = sqrt(omega_asset +

gamma_asset * square(sigma_asset[t - 1]) +

beta_asset * square(sigma_market[t - 1]) +

tau_1_asset * fabs(return_asset[t-1]) +

tau_2_asset * square(return_asset[t-1]));

}

for(t in 1:T){

sigma[t][1] = sigma_market[t];

sigma[t][2] = sigma_asset[t];

}

rv_market_mean = xi + phi * sigma_market +

delta_1_rv * fabs(return_market) +

delta_2_rv * square(return_market);

}

}

The priors from Table 7.3.1 are used. Note that Stan uses a mean-standard deviation parameterization of the normal distribution as opposed to the more common mean- variance or mean-precision parameterizations. 234 model {

// Priors

mu ~ normal(0, 1);

alpha ~ normal(0, 1);

L ~ lkj_corr_cholesky(1);

beta_asset ~ normal(0, 1);

// Market (Henry Hub) GARCH Dynamics 2 2 | | 2 // σt = ω + γσt−1 + τ1 rt−1 + τ2rt−1 omega_market ~ normal(0.002, 0.025);

gamma_market ~ normal(0.8, 0.6);

tau_1_market ~ normal(0, 0.1);

tau_2_market ~ normal(0.1, 0.7);

// Asset (non-Henry Hub Spot) GARCH Dynamics 2 2 | | 2 // σt = ω + γσt−1 + τ1 rt−1 + τ2rt−1 omega_asset ~ normal(0.002, 0.025);

gamma_asset ~ normal(0.8, 0.6);

tau_1_asset ~ normal(0, 0.1);

tau_2_asset ~ normal(0.1, 0.7);

// Realized volatility dynamics | | 2 N 2 // ςt = ξ + ϕσt + δ1 rt + δ2rt + (0, ν ) xi ~ normal(0.02, 0.6);

phi ~ normal(15, 60);

delta_1_rv ~ normal(1.15, 8);

delta_2_rv ~ normal(1.15, 14);

rv_sd ~ normal(0.05, 0.25);

A weak data-dependent prior is on the initial conditions of the GARCH volatil- 235

ity.

// Initialize initial vol with weak prior

sigma_market1 ~ normal(sd(return_asset), 5 * sd(return_asset));

sigma_asset1 ~ normal(sd(return_asset), 5 * sd(return_asset));

// Likelihood

realized_vol_market ~ normal(rv_market_mean, rv_sd);

for(t in 1:T){

returns[t] ~ multi_skew_normal(mu, sigma[t], alpha, L);

}

}

T Σt = diag(σt)L Ldiag(σt) is the covariance matrix of the returns model.

generated quantities{

cov_matrix[2] Sigma[T];

for(t in 1:T){

Sigma[t] = tcrossprod(diag_pre_multiply(sigma[t], L));

}

}

7.D.2 MCMC Diagnostics

We propose the use of the Hamiltonian variant of Markov Chain Monte Carlo to

estimate Model (7.1). While Hamiltonian Monte Caro (HMC), like all commonly used MCMC methods, is guaranteed to asymptotically recover the true posterior, one should always carefully assess its performance in a given simulation before perform- ing inference based on its output. Gelman and Shirley [335] give a succinct review of

standard MCMC diagnostics. We supplement these with various recently developed

HMC-specific diagnostics, such as the Expected Bayesian Fraction of Missing Infor- 236

mation (E-BFMI) [336] and divergent transitions [276, Section 6.2]. While the use

of Hamiltonian Monte Carlo for Realized GARCH models has not been previously

examined, our results suggest that it is an efficient and robust sampling scheme for

this type of model.

For each run of HMC, we run four separate chains of 2000 iterations.∗ While 2000

iterations may seem insufficient, particularly as compared to the tens of thousands

typically used with other samplers, HMC typically mixes far more efficiently than

other methods, reducing the total number of iterations required. As we will see below,

this is sufficient for our problem. To increase the robustness of our results, weuseda

slightly higher target adaptation rate (adapt_delta = 0.99) and maximum tree-depth

(max_treedepth = 12) than Stan’s default settings (0.8 and 10 respectively). On the vast majority of our fits, no divergent transitions were encountered, the maximum

treedepth was never hit, and the average E-BMFI across chains was above 0.9 (values

below 0.2 are typically considered indicative of a sampling pathology). Taken together,

these results suggest that the sampler was able to efficiently explore the posterior

distribution.

Having confirmed that the sampler was able to efficiently explore the posterior dis-

tribution, we are now in a position to assess whether the sampler output provides

enough precision for subsequent analyses. We compute both the effective sample size

and the split-Rˆ diagnostic [337, Section 11.4-11.5] for each of the parameters of our

model as well as the estimated volatilities σt. Our results are presented in Table 7.D.1. The diagnostics suggest that we are able to explore the posterior efficiently and that

our results are sufficiently precise for downstream analyses.

∗Following Stan’s default settings, the first 1000 iterations were used as an adaptation period and discarded, while the second 1000 iterations were stored. No thinning was performed. 237

7.D.3 Simulation-Based Calibration

Talts et al. [332] propose a general scheme for validating Bayesian inference, based on

the idea of the data-averaged posterior. They note that, if the parameters are indeed drawn from the prior, the average of the exact posterior distributions is exactly the prior. As such, any deviation between the data-averaged posterior produced by a sampling scheme and the prior indicates biases in the sampling scheme. Applying this technique to Model (7.1) yields the results in Figure 7.D.1, which show no systematic

deviations from uniformity, suggesting that our inference is unbiased and that our

sampling scheme is well suited for the model.

7.D.4 Additional Simulation Studies

In this section, we provide additional evidence that Model (7.1) is able to effectively

capture the underlying GARCH dynamics, even with relatively small data sizes. Since we are concerned with performance across a wide range of possible parameter values, we use wider priors for these simulation studies than we do in the main body of

this paper. In particular, we use a weakly informative N (0, 1)-prior on all GARCH

parameters for each study in this section.

The natural first question to ask of any Bayesian model is whether the posterior

credible intervals are indeed accurately computed. If the parameters are drawn from

the prior distribution, the posterior credible intervals should be perfectly calibrated

and, ceteris paribus, should decrease in length as the sample size increases. Figures

7.D.2 and 7.D.3 confirm that our posterior credible intervals are correctly calculated and well calibrated. Hansen et al. [292] demonstrated that the univariate Realized √ GARCH framework satisfies the conditions for standard (1/ T ) asymptotic conver- 238

Simulation−Based Calibration (T = 200) α α β 1 2 1 δ1

90 60 30 0 γ γ δ2 1 M L2,1

90 60 30 0 µ µ L2,2 1 2 ν

90 60 30

Frequency 0 τ ω1 ωM φ i=1,1

90 60 30 0

τi=1,2 τM,1 τM,2 ξ

90 60 30 0 [0,10] [0,10] [0,10] [0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100] (90,100] (90,100] (90,100] Binned Rank

Figure 7.D.1 : Simulation-Based Calibration Analysis of Hamiltonian Monte Carlo applied to Model (7.1) with T = 200. The blue bars indicate the number of SBC samples in each rank bin while the green bars give a 99% marginal confidence band for the number of samples in each bin. While there is significant variability in the histograms, commensurate with a complex model, we do not observe any systematic deviations. 239

gence of the Quasi-Maximum Likelihood Estimate. Figure 7.D.3 suggests that Model

(7.1) is similarly well-behaved and that Bernstein-von Mises-type convergence rates

can be obtained.

A close examination of Figure 7.D.3 will reveal that parameters associated with the

mean model – namely the µ and α parameters – have wider posterior credible intervals

than the parameters of the GARCH and realized volatility processes (ω, γ, τ, β, etc.)

While this may seem counterintuitive at first, it appears to be a consequence of

the relatively high-levels of volatility considered, which make precise inference on the

mean difficult. Despite this difficulty, our simulations suggest that Model 7.1 is able to

recover the underlying dynamics effectively even with relatively small samples.

b Parameter Effective Sample Size (ndeff) Potential Scale Reduction Factor (R)

L21 = chol(Ω)21 2704.386 0.999 L22 = chol(Ω)22 2705.967 0.999 ωM 1491.062 1.001 γM 2345.653 1.000 τ1,M 2223.725 1.000 τ2,M 2705.410 1.001 σM 2051.442 1.000

ωi 1994.433 1.000 βi 2604.040 1.000 γi 2625.324 1.000 τ1,i 2331.136 1.000 τ2,i 2372.701 1.000 σi 2084.207 1.000 ξ 1715.700 1.001 ϕ 1662.953 1.001 δ1 2395.295 1.000 δ2 2297.632 1.000 ν 4000.000 1.000

Table 7.D.1 : Markov Chain Monte Carlo Diagnostics for Hamiltonian Monte Carlo estimation of Model 7.1. These diagnostics are taken from a representative run of four chains of 2000 iterations each. Reported values for σM and σi are averages over the entire fitting period. 240

Posterior Interval Coverage µ µ 1 2 α1 α2 100% 75% 50% 25% 0% ω γ L2,1 L2,2 M M 100% 75% 50% 25% 0% τ τ ω M,1 M,2 1 β1 100% 75% 50% 25% 0% γ τ τ Empirical Coverage 1 i=1,1 i=1,2 ξ 100% 75% 50% 25% 0% φ ν δ1 δ2 100% 75% 50% 25% 0% 25% 50% 75% 25% 50% 75% 25% 50% 75% 25% 50% 75% Nominal Coverage

25 75 150 500 Number of Observations (T) 50 100 200

Figure 7.D.2 : Empirical coverage probabilities of posterior credible intervals asso- ciated with Model (7.1) under a weakly informative prior distribution. Because we drew the parameters from the prior, these intervals are correctly calibrated. 241

Posterior Interval Width µ µ 1 2 α1 α2 0.3 0.3 1.5 1.5 0.2 0.2 1.0 1.0 0.1 0.1 0.5 0.5 0.0 0.0 ω γ L2,1 L2,2 M M 0.125 0.125 0.4 0.20 0.100 0.100 0.3 0.15 0.075 0.075 0.2 0.10 0.050 0.050 0.1 0.05 0.025 0.025 0.0 0.00 0.000 0.000 τ τ ω M,1 M,2 1 β1 0.16 0.08 0.06 0.125 0.06 0.12 0.100 0.04 0.075 0.04 0.08 0.02 0.050 0.02 0.04 0.025 0.00 0.00 τ τ γ1 i=1,1 i=1,2 ξ 0.20 0.100 0.100

Mean Posterior Interval Width Mean Posterior 0.15 0.075 0.075 0.2 0.10 0.050 0.050 0.1 0.05 0.025 0.025

φ ν δ1 δ2 0.12 0.04 0.06 0.2 0.09 0.03 0.06 0.04 0.02 0.1 0.03 0.02 0.01 0.00 0.00 0.00 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 Number of Observations (T)

10% 30% 50% 70% 90% Posterior Interval Coverage 20% 40% 60% 80%

Figure 7.D.3 : Mean width of symmetric posterior intervals associated with Model (7.1) under a weakly informative prior distribution. We observe standard n−1/2-type convergence for all parameters as the length of the sample period (T ) increases. Some- what surprisingly, we note that parameters associated with the mean model (µ, α) are typically more uncertain than the parameters of the GARCH dynamics. 242

7.E Additional Background

In this section, we give some additional background for the reader who may be inter-

ested in learning more about volatility models or the structure of LNG markets.

7.E.1 Volatility Models

The first latent volatility process model was the Autoregressive Conditional Heterosc-

dasticity (ARCH) model of Engle [303], which modeled the instantaneous volatility

2 σt as a weighted sum of the q-previous (standardized) returns. Bollerslev [304] pro- posed the Generalized ARCH (GARCH) model which instead uses an autoregressive

moving average (ARMA) model for the instantaneous volatility, combining previous volatility levels and (standarized) returns. As demonstrated by Bollerslev [304], the use of previous volatility levels significantly reduces the amount of history required, giving a more accurate and more parsimonious model. For this reason, GARCH-type models have generally supplanted ARCH models in applied work.

GARCH-type models have had an enormous impact on financial econometrics and many variants have been proposed, including the integrated GARCH (I-GARCH) of

Engle and Bollerslev [338]; the exponential GARCH (E-GARCH) of Nelson [339]; the

GJR-GARCH of Glosten et al. [340], which allows for asymmetric positive and nega-

tive effects; and the asymmetric power GARCH of Ding etal.[341], which introduces

a Box-Cox transform [342] into the model specification and unifies several previous

proposals. The family GARCH model of Hentschel [343] gives a very general speci-

fication comprising a wide range of GARCH variants. More recent variants attempt

to decompose latent volatilities into long- and short-term components, yielding the

additive and multiplicative component GARCH models of Engle and Lee [344] and 243

Engle and Sokalska [345], respectively. The glossary of Bollerslev [305] provides a

useful and comprehensive list of variants.

Multivariate extensions of GARCH models are equally numerous and we refer the

reader to the survey of Bauwens et al. [306] for a more detailed review. The sim-

plest multivariate GARCH model is perhaps that of Bollerslev [346], who assumes a

constant conditional correlation (CCC) matrix among the various assets. Engle [347]

extended the CCC model to allow for slowly varying conditional correlations, yield-

ing the popular dynamic conditional correlation (DCC) specification. Our proposed

model uses the CCC approach, though the possibility of misspecification is mitigated

by our rolling refitting strategy which yields an approximate DCC specification.

A popular alternative to GARCH models is the class of “stochastic volatility” models

first introduced by Kim et al.[301]. These models allow the volatility to evolve

according to a (typically independent) stochastic process, giving their name. This is

in contrast to GARCH-type models where the next day’s volatility is deterministic,

conditional on the current return and volatility history. We do not review these

models in detail here, instead referring the reader to the paper of Asai et al. [348]

and the discussion in Vankov et al. [302]

A commonly noted shortcoming of both SV- and GARCH-type models is their slow

responsiveness to rapid changes in the volatility level [291, 349, 350]. This slowness

is a consequence of the “smoothing” nature of both SV and GARCH models, which

balance the information provided by a single day’s return against a long history.

By using several volatility measures, it becomes possible to weight the current time

period more heavily and to develop more responsive models, as shown by, e.g., Visser

[351] and Andersen et al. [352] for GARCH models and Takahashi et al. [353] for SV 244

models.

Driven by market conventions, the most common additional volatility measures are

those based on OHLC (Open, High, Low, Close) data, such as the high-low range

proposed by Parkinson [354], the open-close difference proposed by Garman and Klass

[314], or high, low, and close data as suggested by Rogers and Satchell [307]. The

estimator of Yang and Zhang [308] is optimal (minimum variance unbiased) among

estimators based solely on OHLC data. If higher frequency data is available, even

more accurate estimators have been proposed, though their efficacy is very sensitive

to the structure of the market being considered [355–358].

While realized volatility measures can be used to improve GARCH estimation by

replacing the standardized squared return with an improved estimate, this naive ap-

proach does not capture the structural relationships among different volatility mea-

sures. Several classes of “complete” models which jointly model prices and real-

ized volatilty measures have been proposed, including the Multiplicative Error Model

(MEM) framework of Engle and Gallo [309] and the High-Frequency Based Volatility

(HEAVY) framework of Shephard and Sheppard [310]. In this paper, we consider

the Realized GARCH framework of Hansen et al. [292], Hansen and Huang [293],

and Hansen et al. [294], which is among the most flexible specifications proposed to

date. In addition to standard volatility estimation, the realized GARCH framework

has proven useful for risk management, with Watanabe [359] demonstrating its use-

fulness in conditional quantile estimation and Contino and Gerlach [311] and Wang

et al. [360] showing its usefulness for estimating VaR and Expected Shortfall [361–

363]. 245

7.E.2 Natural Gas Markets

The past 20 years have seen significant developments in the structure and size ofUS

LNG markets. Much of this growth has been driven by technological progress in shale

gas extraction technologies, in particular hydraulic fracturing (“fracking”), which has

lowered production costs and increased demand for LNG and LNG derivatives [364].

At the same time, increasing awareness of the environmental impacts of various energy

sources has prompted additional investment into LNG: while a fossil fuel and not

renewable, LNG is widely considered to be a cleaner energy source than coal and a

possible bridge to a fully renewable energy sector. Over the same period, investor

interest in LNG and other commodities has increased, spurring the growth of liquid

futures and derivatives markets [365].

Li [313], following Narayan and Liu [366], identifies September 2008 as the “coming of age” of modern LNG markets. Modern LNG markets are characterized by active, but relatively stable, spot trading at a large number of hubs and by informative and liquid futures markets. For historical reasons, Henry Hub plays a particularly prominent role in LNG markets, inter alia, serving as the reference price for much of the LNG

derivatives market, it is far from the only hub of economic interest. Practitioners

recognize over one hundred trading hubs in the continental U.S.ãnd Canada, with

the Chicago Citygate, Algonquin Citygate (serving the Boston area), Opal (Lincoln

County, Wyoming), Southern California, and NOVA (Alberta, Canada) hubs being

particularly closely monitored by market participants. (Strictly speaking, some of

these are commonly reported regional average indices, not physical hubs, but we will

continue to refer to them as hubs.)

Mohammadi [299] gives a detailed survey of the pricing structure from well heads (ex- 246

traction) to end-consumers, while Hou and Nguyen [367] and Hailemariam and Smyth

[368] attempt to quantify the effect of supply and demand shocks on observed prices.

Unlike equities, LNG spot and futures prices do not move in perfect synchrony due

to storage and transpot costs. The relationship between LNG spot and futures prices was investigated by Ghoddusi [300] who found, inter alia that short-maturity futures

are Granger-causal for physical prices, a fact which is consistent with our findings on

the usefulness of futures realized volatility in spot price volatility modeling.

The relationship between the prices of LNG and other commodities has been partic-

ularly well studied, and has been found to be surprisingly complex: Hartley et al.

[369] found important dependencies on foreign exchange rates, spot inventories, and

unpredictable (weather) shocks; Brigida [370] and Batten et al. [371] found that tech-

nological advancements weakened, but did not eliminate the underlying relationship;

and Caporin and Fontini [364] investigated the effect of the shale revolution onthe relationship. 247

Chapter 8

Conclusion and Discussion

Over the past two decades, statistical machine learning has achieved remarkable suc-

cesses in the analysis of large and high-dimensional data. A key driver of this process

has been the incorporation and development of convex analysis, which provides effi-

cient algorithms for applying statistical machine learning methods to large data and

a rich theoretical toolkit for analyzing the resulting approaches. This thesis builds

on this framework, applying advanced and novel techniques of convex analysis in

the development of several new statistical methodologies. In particular, this thesis

presents four methods relating to the statistical analysis of non-standard structured

data, including complex valued data, tensor data, manifold-constrained estimation,

and unaligned time series data. Common to each method discussed is the ability to

flexibly incorporate external information, either in the form of an explicit regular-

izer or the implicit regularization imposed by a Bayesian prior, without sacrificing

computational efficiency or statistical accuracy.

Part I develops novel and foundational theory for the statistical analysis of complex valued data. Chapter 2 develops the fundamental tools of complex convex analysis

and establishes the existence of first-order derivatives and subgradients for convex

Wirtinger functions. While the Wirtinger derivative has appeared infrequently in the

statistical signal processing literature, this appears to be the first rigorous treatment

of its properties. A key aspect of my analysis was the convexity assumption, which 248 was used to prove the existence of (sub-)gradients and thus Wirtinger derivatives. The

convexity assumption is somewhat restrictive and could be relaxed to local convexity

(or concavity) allowing the theory to be applied much more broadly. While modern

machine learning is primarily based on first-order methods and analysis, it would also

be interesting to consider second-order convex analysis. As briefly mentioned in Sec-

tion 2.1.1, it is possible to extend Alexandrov-Mignot theory to Wirtinger functions,

but it is less clear how to calculate the “Wirtinger Hessian.” Such analysis would be

useful in extending ideas of restricted strong convexity to the complex domain and

analyzing general M-estimators [60, 85]. Using the Wirtinger derivative, convergence proofs for several optimization methods are provided.

Chapter 3 considers the complex valued lasso or CLasso problem and establishes

estimation, predictive, and model selection consistency results. Key to this analysis

is a concentration bound for complex sub-Gaussian variates: unlike their real counter-

parts, this bound depends both on the variance and on the correlation between the

real and imaginary parts. Experimental results confirm the accuracy of this theory. It would also be interesting to consider “complex generalized linear models,” though it is

difficult to see how the monotonicity of the link function should be generalized.

Finally, Chapter 4 extends these results to the covariance estimation problem, show-

ing model selection consistency for both likelihood-based (CGLasso) and pseudo-

likelihood (neighborhood selection) approaches. Having developed these model selec-

tion consistency results for independent (“iid”) sampling, it would be interesting to

consider extensions of these results to the dependent sampling case. Dahlhaus [372]

first considered extensions of PGMs to time series data. In a remarkable paper,he

showed that similar conditional independence relationships can be recovered by ex- 249

amining the inverse coherence matrix, which plays an analogous role to the precision

matrix in non-temporal data. Building off this insight, several authors [200, 373]

have proposed sparse estimators of the inverse spectral coherence matrix, and by ex-

tension the time series PGM structure, but none provided theoretical guarantees of

performance. Applying the tools of Chapter 4 to this problem, it would be interesting

to develop a spectral coherence graphical lasso which is able to provably recover the

graphical model structure of temporal data:

arg min − log det(Θ) + ⟨Θ, S⟩ + λ∥Θ∥1,off-diag × Θ∈F p p ⪰0   F = R, S = sample covariance =⇒ Graphical Lasso with  F = C, S = sample spectrum =⇒ Time Series PGM.

A remarkable aspect of Dahlhaus’s result, inherited by the proposed estimator, is

that it captures any conditional dependence relationships and does not require fixing

a particular parametric form (e.g., autoregressive with known maximum lag) unlike

existing approaches [33, 34, 374, 375], and is instead able to capture structure which is not known a priori.

The success of these approaches depends strongly on accurate estimation of the spec- tra of the underlying time series. While the univariate and low-dimensional cases are well-understood, there has been relatively less work on concentration inequalities for the high-dimensional setting. Recently, Sun et al. [376] developed a novel con-

centration inequality for the sample spectrum, i.e., the sample periodogram. While

their result is powerful, and certainly applicable to the problems described above,

knowledge of the low-dimensional case suggests stronger results can be obtained. In 250

particular, we know from the low-dimensional case that the sample periodogram is a

sub-optimal estimator of the true spectrum and that tapering and windowing methods

greatly improve estimation accuracy. Fiecas et al. [377] develop a powerful concentra-

tion inequality for the “realified” smoothed periodogram, which they use to analyze

sparse spectrum estimation using the clime estimator of Cai et al. [378], but the gen-

eral question of concentration inequalities for high-dimensional (regularized) variants

of the multitaper family of methods [379] remains active.

The PGM framework admits several extensions useful in the neuroscientific context.

Because the proposed estimator has the sample spectrum S as a sufficient statistic, we can use any estimate of S as input to the structure learning problem. In par-

ticular, we can build upon well-established methods for spectrum estimation from

asynchronous data such as the Lomb-Scargle periodogram [380–382] to deal with

the irregular and missing data common in neural recordings. Additionally, recent re-

search has suggested modifications to the graphical lasso estimator which can address

unobserved covariates [181, 383, 384]; these extensions allow recovery of conditional

independence structures among neural recordings even when only part of the brain is

recorded. Future work will address these questions.

Part II suggests a variety of models for dealing with data with non-trivial dependency

structures. Chapter 5 proposes several efficient operator-splitting algorithms for con- vex co-clustering. These algorithms scale efficiently with the ambient dimensionality

of the data and with the rank of the data tensor. A key advantage of these approaches

is that they avoid the need to solve a difficult high-dimension Sylvester equation and

instead only rely on elementary multiplications and tensor contractions, which are

easy to implement efficiently. While this chapter makes co-clustering practical for 251

large data, several methodological questions remain. An appealing intuitive property

of convex co-clustering is the “nested” clusterings produced, but no formal proof of

this has been given. Chi and Steinerberger [385] show that, for one-way clustering,

any possible tree structure can be recovered by suitably chosen weights, but this is

a far cry from establishing that a given weight scheme guarantees a tree structure.

More statistically, Radchenko and Mukherjee [228] establish a “tree-wise” consistency

property of convex clustering that could potentially be extended to co-clustering. Fi-

nally, Campbell [386, Chapter 4] gives a set of novel selective inferential methods that could be extended to the co-clustering case as well.

Chapter 6 proposes a new approach for regularized principal components which uti-

lizes a pair of generalized Stiefel manifold constraints to ensure orthogonality of the

estimated principal components. Extending the rank-one Sparse and Functional Prin-

cipal Components Analysis framework to estimating multiple principal components,

this approach is able to incorporate arbitrary combinations of smoothness and sparsity

regularization in a single unified algorithm. Three manifold optimization approaches

are given and their performance is compared. Rather surprisingly for a non-convex

problem, the alternating manifold proximal gradient scheme is shown to converge to

a stationary point, but experiments suggest the Manifold ADMM achieves a better

local optimum. The matrix factorization method at the heart of multi-rank Sparse

and Functional PCA could be fruitfully extended to the tensor case, where it would

form a regularized version of the the Tucker decomposition [260] or the Higher-Order

SVD of Lathauwer et al. [387]. As a regularized SVD, this approach could be ap- plied throughout multivariate analysis wherever the SVD is used, suggesting means of regularized partial least squares, canonical correlation analysis, or linear discrimi- nant analysis. The projection deflation technique generalizes naturally to the tensor 252

domain, but it is less obvious how to generalize Schur deflation to the higher-order

setting. The identifiability concerns raised in Section 6.B.3 also suggest an avenue for

further study. The generalized Stiefel manifold is not the most natural domain for

regularized low-rank approximation, as signs and orderings are not identified by the

data. The manifold formed by taking the quotient of the generalized Stiefel manifold with the signed permutation group would avoid the identifiability pathologies, but

this object has not been previously studied in the literature.

Finally, Chapter 7 proposes a multivariate and multiscale “realized beta” GARCH model for natural gas spot markets. This model is able to adapt to irregular data availability in commodities markets, while providing significantly improved tail esti- mation and risk management. Market-calibrated priors were proposed and superior out of sample performance was demonstrated across a range of market conditions.

The flexible frameowrk of this model can be fruitfully applied to a wide rangeof commodities markets with similar patterns of data irregularity. In this case, the structure of the correlation matrix Ω is considered known and fixed in a single-factor

model, as clearly suggested by Figure 7.2.2. While this assumption is reasonable for

commodities markets, it is less clear for other markets, where a variety of factors

may be present. A challenging, but worthwhile, extension would be to estimate Ω

corresponding to a conditional independence or multi-factor structure (Ω−1 sparse or

low-rank) as would be expected in equity markets.

This thesis demonstrates the richness of computationally grounded modern statistical

methodology in four different domains. Taken together, these methods show a variety

of mathematically well-posed and statistically powerful approaches for understanding

structured data. A recurring theme of this work is the importance of accounting for 253

dependencies inherent in the data directly in the estimation process and the supe-

riority of these approaches over post hoc approaches. As discussed above, the work presented herein provides fertile ground upon which to build more sophisticated an- alytical tools, pushing the tools of computational optimization and convex analysis even further. The proof is in the pudding, however, and applications of these tech- niques to large structured data are necessary to firmly establish their usefulness. To this end, open-source software implementing these methods is available online and will be maintained into the future. 254

Works Cited

[1] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of

the Royal Statistical Society, Series B: Methodological, vol. 58, no. 1, pp. 267–

288, 1996. doi: 10.1111/j.2517-6161.1996.tb02080.x.

[2] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by

basis pursuit,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 33–

61, 1998. doi: 10.1137/S1064827596304010.

[3] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex

geometry of linear invese problems,” Foundations of Computational Mathe-

matics, vol. 12, no. 6, pp. 805–849, 2012. doi: 10.1007/s10208-012-9135-7.

[4] E. J. Candès, J. K. Romberg, and T. Tao, “Stable signal recovery from incom-

plete and inaccurate measurements,” Communications on Pure and Applied

Mathematics, vol. 59, no. 8, pp. 1207–1223, 2006. doi: 10.1002/cpa.20124.

[5] ——, “Robust uncertainty principles: Exact signal reconstruction from highly

incomplete frequency information,” IEEE Transactions on Information The-

ory, vol. 52, no. 2, pp. 489–509, 2006. doi: 10.1109/TIT.2005.862083.

[6] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information The-

ory, vol. 52, no. 4, pp. 1289–1306, 2006. doi: 10.1109/TIT.2006.871582.

[7] E. Greenshtein and Y. Ritov, “Persistence in high-dimensional linear predic-

tor selection and the virtue of overparametrization,” Bernoulli, vol. 10, no. 6,

pp. 971–988, 2004. doi: 10.3150/bj/1106314846. 255

[8] P. Zhao and B. Yu, “On model selction consistency of lasso,” Journal of Ma-

chine Learning Research, vol. 7, pp. 2541–2563, 2006. Available at: http://

www.jmlr.org/papers/volume7/zhao06a/zhao06a.pdf.

[9] E. Candès and T. Tao, “The Dantzig selector: Statistical estimation when p is

much larger than n,” Annals of Statistics, vol. 35, no. 6, pp. 2313–2351, 2007.

doi: 10.1214/009053606000001523.

[10] P. J. Bickel, Y. Ritov, and A. B. Tsybakov, “Simultaneous analysis of lasso

and Dantzig selector,” Annals of Statistics, vol. 37, no. 4, pp. 1705–1732, 2009.

doi: 10.1214/08-AOS620.

[11] J. Bi, K. Bennett, M. Embrechts, C. Breneman, and M. Song, “Dimensionality

reduction via sparse support vector machines,” Journal of Machine Learning

Research, vol. 3, pp. 1229–1243, 2003. Available at: http://jmlr.org/papers/

v3/bi03a.html.

[12] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty, “High-dimensional Ising

model selection using ℓ1-regularized logistic regression,” Annals of Statistics, vol. 38, no. 3, pp. 1287–1319, 2010. doi: 10.1214/09-AOS691.

[13] G. I. Allen, “Automatic feature selection via weighted kernels and regular-

ization,” Journal of Computational and Graphical Statistics, vol. 22, no. 3,

pp. 284–299, 2013. doi: 10.1080/10618600.2012.681213.

[14] K. Pelckmans, J. de Brabanter, B. de Moor, and J. Suykens, “Convex clustering

shrinkage,” in PASCAL Workshop on Statistics and Optimization of Clustering,

2005.

[15] F. Lindsten, H. Ohlsson, and L. Ljung, “Clustering using sum-of-norms reg-

ularization: With application to particle filter output computation,” in SSP

2011: Proceedings of the 2011 IEEE Statistical Signal Processing Workshop, 256

P. M. Djuric, Ed., Nice, France: Curran Associates, Inc., 2011, pp. 201–204.

doi: 10.1109/SSP.2011.5967659.

[16] T. D. Hocking, A. Joulin, F. Bach, and J.-P. Vert, “Clusterpath: An algorithm

for clustering using convex fusion penalties,” in ICML 2011: Proceedings of the

28th International Conference on Machine Learning, L. Getoor and T. Scheffer,

Eds., Bellevue, Washington, USA: ACM, 2011, pp. 745–752, isbn: 978-1-4503-

0619-5. Available at: http://www.icml-2011.org/papers/419_icmlpaper.

pdf.

[17] N. Meinshausen and P. Bühlmann, “High-dimensional graphs and variable

selection with the lasso,” Annals of Statistics, vol. 34, no. 3, pp. 1436–1462,

2006. doi: 10.1214/009053606000000281.

[18] M. Yuan and Y. Lin, “Model selection and estimation in the Gaussian graphical

model,” Biometrika, vol. 94, no. 1, pp. 19–35, 2007. doi: 10.1093/biomet/

asm018.

[19] O. Banerjee, L. E. Ghaoui, and A. d’Aspremont, “Model selection through

sparse maximum likelihood estimation for multivariate Gaussian or binary

data,” Journal of Machine Learning Research, vol. 9, pp. 485–516, 2008. Avail-

able at: http://jmlr.org/papers/v9/banerjee08a.html.

[20] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estima-

tion with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.

doi: 10.1093/biostatistics/kxm045.

[21] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu, “High-dimensional

covariance estimation by minimizing ℓ1-penalized log-determinant divergence,” Electronic Journal of Statistics, vol. 5, pp. 935–980, 2011. doi: 10.1214/11-

EJS631. 257

[22] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu, “Graphical models via gener-

alized linear models,” in NIPS 2012: Advances in Neural Information Process-

ing Systems 25, (Lake Tahoe, USA), F. Pereira, C. Burges, L. Bottou, and K.

Weinberger, Eds., 2012. Available at: http://papers.nips.cc/paper/4617-

graphical-models-via-generalized-linear-models.

[23] E. Yang, Y. Baker, P. Ravikumar, G. I. Allen, and Z. Liu, “Mixed graphi-

cal models via exponential families,” in AISTATS 2014: Proceedings of the

17th International Conference on Artificial Intelligence and Statistics, (Reyk-

javik, Iceland), S. Kaski and J. Corande, Eds., 2014. Available at: http://

proceedings.mlr.press/v33/yang14a.pdf.

[24] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu, “Graphical models via univari-

ate exponential family distributions,” Journal of Machine Learning Research,

vol. 16, pp. 3813–3847, 2015. Available at: http://jmlr.org/papers/v16/

yang15a.html.

[25] H. Shen and J. Z. Huang, “Sparse principal component analysis via regularized

low rank matrix approximation,” Journal of Multivariate Analysis, vol. 99,

no. 6, pp. 1015–1034, 2008. doi: 10.1016/j.jmva.2007.06.007.

[26] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix decomposi-

tion, with applications to sparse principal components and canonical correla-

tion analysis,” Biostatistics, vol. 10, no. 3, pp. 515–534, 2009. doi: 10.1093/

biostatistics/kxp008.

[27] G. I. Allen, L. Grosenick, and J. Taylor, “A generalized least-square matrix de-

composition,” Journal of the American Statistical Association, vol. 109, no. 505,

pp. 145–159, 2014. doi: 10.1080/01621459.2013.852978. 258

[28] G. I. Allen and M. Weylandt, “Sparse and functional principal components

analysis,” in DSW 2019: Proceedings of the 2nd IEEE Data Science Work-

shop, G. Karypis, G. Michailidis, and R. Willett, Eds., Minneapolis, Minnesota:

IEEE, 2019, pp. 11–16. doi: 10.1109/DSW.2019.8755778.

[29] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,”

Foundations of Computational Mathematics, vol. 9, pp. 717–772, 2009. doi:

10.1007/s10208-009-9045-5.

[30] E. J. Candès and T. Tao, “The power of convex relaxation: Near-optimal

matrix completion,” IEEE Transactions on Information Theory, vol. 56, no. 5,

pp. 2053–2080, 2010. doi: 10.1109/TIT.2010.2044061.

[31] B. Recht, “A simpler approach to matrix completion,” Journal of Mahcine

Learning Research, vol. 12, pp. 3413–3430, 2011. Available at: http://www.

jmlr.org/papers/v12/recht11a.html.

[32] F. Han and H. Liu, “Transition matrix estimation in high dimensional time

series,” in ICML 2013: Proceedings of the 30th International Conference on

Machine Learning, S. Dasgupta and D. McAllester, Eds., vol. 28, Atlanta,

Georgia: PMLR, 2013, pp. 172–180. Available at: http://proceedings.mlr.

press/v28/han13a.html.

[33] S. Basu and G. Michailidis, “Regularized estimation in sparse high-dimensional

time series models,” Annals of Statistics, vol. 43, no. 4, pp. 1535–1567, 2015.

doi: 10.1214/15-AOS1315.

[34] S. Basu, X. Li, and G. Michailidis, “Low rank and structured modeling of high-

dimensional vector autoregressions,” IEEE Transactions on Signal Processing,

vol. 67, no. 5, pp. 1207–1222, 2019. doi: 10.1109/TSP.2018.2887401. 259

[35] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and

its oracle properties,” Journal of the American Statistical Association, vol. 96,

no. 456, pp. 1348–1360, 2001. doi: 10.1198/016214501753382273.

[36] H. Zou and T. Hastie, “Regularization and variable selection via the elastic

net,” Journal of the Royal Statistical Society, Series B: Statistical Methodology,

vol. 67, no. 2, pp. 301–320, 2005. doi: 10.1111/j.1467-9868.2005.00503.x.

[37] E. J. Candès, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted

ℓ1 minimization,” Journal of Fourier Analysis and Applications, vol. 14, no. 5-6, pp. 877–905, 2008. doi: 10.1007/s00041-008-9045-x.

[38] C.-H. Zhang, “Nearly unbiased variable selection under minimax concave penalty,”

Annals of Statistics, vol. 38, no. 2, pp. 894–942, 2010. doi: 10 . 1214 / 09 -

AOS729.

[39] M. Bogdan, E. van den Berg, C. Sabatti, W. Su, and E. J. Candès, “SLOPE:

Adaptive variable selection via convex optimization,” Annals of Applied Statis-

tics, vol. 9, no. 3, pp. 1103–1140, 2015. doi: 10.1214/15-AOAS842.

[40] M. Neykov, J. S. Liu, and T. Cai, “L1-regularized least squares for support recovery of high dimensional single index models with Gaussian designs,” Jour-

nal of Machine Learning Research, vol. 17, no. 87, pp. 1–37, 2016. Available

at: http://jmlr.org/papers/v17/16-006.html.

[41] P. Whittle, “Estimation and information in stationary time series,” Arkiv för

Mathematik, vol. 2, no. 5, pp. 423–434, 1953. doi: 10.1007/BF02590998.

[42] A. M. Sykulski, S. C. Olhede, A. P. Guillaumin, J. M. Lilly, and J. J. Early,

“The debiased Whittle likelihood,” Biometrika, vol. 106, no. 2, pp. 251–266,

2019. doi: 10.1093/biomet/asy071. 260

[43] J. F. de Andrade Jr., M. L. R. de Campos, and J. A. Apolinário Jr., “A complex

version of the LASSO algorithm and its applications to beamforming,” in The

7th International Telecommunications Symposium (ITS 2010), 2010.

[44] Z. Yang and C. Zhang, “Sparsity-undersampling tradeoff of compressed sensing

in the complex domain,” in ICASSP 2011: Proceedings of the 2011 IEEE

International Conference on Acoustics, Speech, and Signal Processing, (Prague,

Czech Republic), 2011, pp. 3668–3671. doi: 10.1109/ICASSP.2011.5947146.

[45] A. Maleki, L. Anitori, Z. Yang, and R. G. Baraniuk, “Asymptotic analysis of

complex LASSO via complex approximate message passing (CAMP),” IEEE

Transactions on Information Theory, vol. 59, no. 7, pp. 4290–4308, 2013. doi:

10.1109/TIT.2013.2252232.

[46] C. F. Mechlenbrauker, P. Gerstoft, and E. Zochmann, “c-Lasso and its dual for

sparse signal estimation from array data,” Signal Processing, vol. 130, pp. 204–

215, 2017. doi: 10.1016/j.sigpro.2016.06.029.

[47] C.-H. Yu, R. Prado, H. Ombao, and D. Rowe, “A Bayesian variable selection

approach yields improved detection of brain activation from complex-valued

fMRI,” Journal of the American Statistical Association, vol. 113, no. 524,

pp. 1395–1410, 2018. doi: 10.1080/01621459.2018.1476244.

[48] S. de Iaco, M. Palma, and D. Posa, “Covariance functions and models for

complex-valued random fields,” Stochastic Environmental Research and Risk

Assessment, vol. 17, no. 3, pp. 145–156, 2003. doi: 10.1007/s00477- 003-

0129-5.

[49] D. P. Mandic, S. Javidi, S. L. Goh, A. Kuh, and K. Aihara, “Complex-valued

prediction of wind profile using augmented complex statistics,” Renewable En-

ergy, vol. 34, no. 1, pp. 196–201, 2009. doi: 10.1016/j.renene.2008.03.022. 261

[50] S. de Iaco, “The cgeostat software for analyzing complex-valued random fields,”

Journal of Statistical Software, vol. 79, no. 5, 2017. doi: 10.18637/jss.v079.

i05.

[51] C. W. J. Granger and R. Engle, “Applications of spectral analysis in econo-

metrics,” in Time Series in the Frequency Domain, ser. Handbook of Statis-

tics, D. R. Brillinger and P. Krishnaiah, Eds., vol. 3, 1983, pp. 93–109. doi:

10.1016/S0169-7161(83)03007-2.

[52] P. Bühlmann and S. van de Geer, Statistics for High-Dimensional Data: Meth-

ods, Theory and Applications, 1st, ser. Springer Series in Statistics. Springer-

Verlag Berlin Heidelberg, 2011, isbn: 978-3-642-20191-2. doi: 10.1007/978-

3-642-20192-9.

[53] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity:

The Lasso and Generalizations, ser. Monographs on Statistics and Applied

Probability 143. Chapman & Hall / CRC Press, 2015, isbn: 978-1-498-71216-

3. Available at: https://web.stanford.edu/~hastie/StatLearnSparsity/.

[54] M. J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint,

ser. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge

University Press, 2019, isbn: 978-1-108-62777-1.

[55] M. B. Priestly, Spectral Analysis and Time Series. Academic Press, 1981, isbn:

978-0-125-64922-3.

[56] L. H. Koopmans, The Spectral Analysis of Time Series. Academic Press, 1995,

isbn: 978-0-124-19251-5.

[57] D. B. Percival and A. T. Walden, Wavelet Methods for Time Series Analy-

sis, 1st, ser. Cambridge Series in Statistical and Probabilistic Mathematics.

Cambridge University Press, 2006, isbn: 978-0-521-68508-5. 262

[58] P. J. Huber, “Robust estimation of a location parameter,” Annals of Math-

ematical Statistics, vol. 35, no. 1, pp. 73–101, 1964. doi: 10 . 1214 / aoms /

1177703732.

[59] S. van de Geer, Empirical Processes in M-Estimation, 1st, ser. Cambridge Se-

ries in Statistical and Probabilistic Mathematics. Cambridge University Press,

2000, isbn: 978-0-521-12325-9.

[60] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu, “A unified

framework for high-dimensional analysis of M-estimators with decomposable

regularizers,” Statistical Science, vol. 27, no. 4, pp. 538–557, 2012. doi: 10.

1214/12-STS400.

[61] W. Karush, “Minima of functions of several variables with inequalities as side

conditions,” in Traces and Emergence of Nonlinear Programming, G. Giorgi

and T. H. Kjeldsen, Eds. Birkhäuser, 2014, ch. 10, pp. 217–245, Originally

published in 1939 as W. Karush’s M.Sc. thesis at the University of Chicago,

Department of Mathematics, isbn: 978-3-0348-0438-7. doi: 10.1007/978-3-

0348-0439-4.

[62] H. W. Kuhn and A. W. Tucker, “Nonlinear programming,” in Proceedings

of the Second Berkeley Symposium on Mathematical Statistics and Probabil-

ity, University of California Press, 1951, pp. 481–492. doi: euclid.bsmsp/

1200500249.

[63] R. T. Rockafellar, Convex Analysis. Princeton University Press, 1970, isbn:

978-0-691-01586-6.

[64] ——, “Lagrange multipliers and optimality,” SIAM Review, vol. 35, no. 2,

pp. 183–238, 1993. doi: 10.1137/1035044. 263

[65] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymp-

totic Theory of Independence, 1st. Oxford University Press, 2016, isbn: 978-0-

198-76765-7.

[66] R. Vershynin, High-Dimensional Probability: An Introduction with Applications

in Data Science, 1st, ser. Cambridge Series in Statistical and Probabilistic

Mathematics. Cambridge University Press, 2018, isbn: 978-1-108-41519-4.

[67] W. Wirtinger, “Zur formalen theorie der functionen von mehr komplexen

veränderlichen,” Mathematische Annalen, vol. 97, pp. 357–375, 1927. doi: 10.

1007/BF01447872.

[68] D. H. Brandwood, “A complex gradient operator and its application in adap-

tive array theory,” IEE Proceedings F: Communcations, Radar and Signal

Processing, vol. 130, no. 1, pp. 11–16, 1983. doi: 10.1049/ip-f-1.1983.0003.

[69] A. van den Bos, “Complex gradient and Hessian,” IEE Proceedings - Vision,

Image and Signal Processing, vol. 141, no. 6, pp. 380–382, 1994. doi: 10.1049/

ip-vis:19941555.

[70] ——, “The multivariate complex normal distribution - a generalization,” IEEE

Transactions on Information Theory, vol. 41, no. 2, pp. 537–539, 1995. doi:

10.1109/18.370165.

[71] B. Picinbono, “Second-order complex random vectors and normal distribu-

tions,” IEEE Transactions on Signal Processing, vol. 44, no. 11, 1996. doi:

10.1109/78.539051.

[72] A. van den Bos, “The real-complex normal distribution,” IEEE Transactions

on Information Theory, vol. 44, no. 4, pp. 1670–1672, 1998. doi: 10.1109/18.

681349. 264

[73] K. Kreutz-Delgado, “The complex gradient operator and the CR-calculus,”

ArXiv Pre-Print 0906.4835, 2009. Available at: https://arxiv.org/abs/

0906.4835.

[74] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University

Press, 2004. Available at: https://web.stanford.edu/~boyd/cvxbook/.

[75] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm

for linear inverse problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1,

pp. 183–202, 2009. doi: 10.1137/080716542.

[76] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimiza-

tion and statistical learning via the alternating direction method of multipliers,”

Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.

doi: 10.1561/2200000016.

[77] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in

Optimization, vol. 1, no. 3, pp. 127–239, 2013. doi: 10.1561/2400000003.

[78] J. D. Lee, Y. Sun, and M. Saunders, “Proximal Newton-type methods for min-

imizing composite functions,” SIAM Journal on Optimization, vol. 24, no. 3,

pp. 1420–1443, 2014. doi: 10.1137/130921428.

[79] R. H. Byrd, J. Nocedal, and F. Oztoprak, “An inexact successive quadratic

approximation method for l1 regularized optimization,” Mathematical Pro-

gramming, Series B, vol. 157, pp. 375–396, 2016. doi: 10.1007/s10107-015-

0941-y.

[80] L. Sorber, M. V. Barel, and L. D. Lathauwer, “Unconstrained optimization of

real functions in complex variables,” SIAM Journal on Optimization, vol. 22,

no. 3, pp. 879–898, 2012. doi: 10.1137/110832124. 265

[81] M. J. Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity

recovery using ℓ1-constrained quadratic programming (lasso),” IEEE Trans- actions on Information Theory, vol. 55, no. 5, pp. 2183–2202, 2009. doi: 10.

1109/TIT.2009.2016018.

[82] F. Bach, “Consistency of the group lasso and multiple kernel learning,” Journal

of Machine Learning Research, vol. 9, pp. 1179–1225, 2008. Available at: http:

//www.jmlr.org/papers/v9/bach08b.html.

[83] P.-L. Loh and M. J. Wainwright, “Support recovery without incoherence: A

case for nonconvex regularization,” Annals of Statistics, vol. 45, no. 6, pp. 2455–

2482, 2017. doi: 10.1214/16-AOS1530.

[84] F. Campbell and G. I. Allen, “Within-group variable selection through the

exclusive lasso,” Electronic Journal of Statistics, vol. 11, no. 2, pp. 4220–4257,

2017. doi: 10.1214/17-EJS1317.

[85] J. D. Lee, Y. Sun, and J. E. Taylor, “On model selection consistency of reg-

ularized M-estimators,” Electronic Journal of Statistics, vol. 9, pp. 608–642,

2015. doi: 10.1214/15-EJS1013.

[86] R. T. Canolty, E. Edwards, S. S. Dalal, M. Soltani, S. S. Nagarajan, H. E.

Kirsch, M. S. Berger, N. M. Barbaro, and R. T. Knight, “High gamma power

is phase-locked to theta oscillations in the human neocortex,” Science, vol. 313,

no. 5793, pp. 1626–1628, 2006. doi: 10.1126/science.1128115.

[87] P. Lakatos, G. Karmos, A. D. Mehta, I. Ulbert, and C. E. Schroeder, “Entrain-

ment of neuronal oscillations as a mechanism of attentional selection,” Science,

vol. 320, no. 5872, pp. 110–113, 2008. doi: 10.1126/science.1154735.

[88] X. Lin, X. Meng, P. Karunanayaka, and S. K. Holland, “A spectral graphical

model approach for learning brain connectivity network of children’s narrative 266

comprehension,” Brain Connectivity, vol. 1, no. 5, 2011. doi: 10.1089/brain.

2011.0045.

[89] A. J. Watrous, L. Deuker, J. Fell, and N. Axmacher, “Phase-amplitude cou-

pling supports phase coding in human ECoG,” eLife, vol. 4, no. e07886, 2015.

doi: 10.7554/eLife.07886.

[90] H. Watanabe, K. Takahashi, and T. Isa, “Phase locking of β oscillation in

electrocorticography (ECoG) in the monkey motor cortex at the onset of EMGs

and 3D reaching movements,” in EMBC 2015: Proceedings of the 37th Annual

International Conference of the IEEE Engineering in Medicine and Biology

Society, (Milan, Italy), N. Lovell and L. Mainardi, Eds., 2015. doi: 10.1109/

EMBC.2015.7318299.

[91] D. W. Adrian, R. Maitra, and D. B. Rowe, “Complex-valued time series mod-

eling for improved activation detection in fMRI studies,” Annals of Applied

Statistics, vol. 12, no. 3, pp. 1451–1478, 2018. doi: 10.1214/17-AOAS1117.

[92] N. Klein, J. Orellana, S. L. Brincat, E. K. Miller, and R. E. Kass, “Torus

graphs for multivariate phase coupling analysis,” Annals of Applied Statistics,

vol. 14, no. 2, pp. 635–660, 2020.

[93] M. R. Nuwer, “Electrocorticography and intraoperative electroencephalogra-

phy,” in Intraoperative Neurophysiologic Monitoring, G. M. Galloway, M. R.

Nuwer, J. R. Lopez, and K. M. Zamel, Eds., Cambridge University Press, 2010,

pp. 77–89. doi: 10.1017/CBO9780511777950.009.

[94] Blausen.com Staff, “Medical gallery of Blausen Medical 2014,” WikiJournal of

Medicine, vol. 1, no. 2, 2014. doi: 10.15347/wjm/2014.010.

[95] S. Chandran, A. Mishra, V. Shirhatti, and S. Ray, Journal of Neuroscience,

vol. 36, no. 12, pp. 339–3408, 2016. doi: 10.1523/JNEUROSCI.3633-15.2016. 267

[96] M. Narayan, “Inferential methods to find differences in populations of graph-

ical models with applications to functional connectomics,” PhD thesis, Rice

University, 2015.

[97] L. Wilkinson and M. Friendly, “The history of the cluster heat map,” The

American Statistician, vol. 63, no. 2, pp. 179–184, 2009. doi: 10.1198/tas.

2009.0033.

[98] E. C. Chi, B. R. Gaines, W. W. Sun, H. Zhou, and J. Yang, “Provable con-

vex co-clustering of tensors,” ArXiv Pre-Print 1803.06518, 2018. Available at:

https://arxiv.org/abs/1803.06518.

[99] E. C. Chi, G. I. Allen, and R. G. Baraniuk, “Convex biclustering,” Biometrics,

vol. 73, no. 1, pp. 10–19, 2017. doi: 10.1111/biom.12540.

[100] W. Deng and W. Yin, “On the global and linear convergence of the generalized

alternating direction method of multipliers,” Journal of Scientific Computing,

vol. 66, no. 3, pp. 889–916, 2016. doi: 10.1007/s10915-015-0048-x.

[101] T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk, “Fast alternating

direction optimization methods,” SIAM Journal on Imaging Sciences, vol. 7,

no. 3, pp. 1588–1623, 2014. doi: 10.1137/120896219.

[102] H. Hotelling, “Analysis of a complex of statistical variables into principal com-

ponents,” Journal of Educational Psychology, vol. 24, no. 6, pp. 417–441, 1933.

doi: 10.1037/h0071325.

[103] J. O. Ramsay and B. W. Silverman, Applied Functional Data Analysis: Methods

and Case Studies, 1st, ser. Spring Series in Statistics. Springer-Verlag New York,

2002, isbn: 978-0-387-95414-1. doi: 10.1007/b98886.

[104] ——, Functional Data Analysis, 2nd, ser. Springer Series in Statistics. Springer-

Verlag New York, 2005, isbn: 978-0-387-40080-8. doi: 10.1007/b98888. 268

[105] H. Zou and L. Xue, “A selective overview of sparse principal component anal-

ysis,” Proceedings of the IEEE, vol. 106, no. 8, pp. 1311–1320, 2018. doi:

10.1109/JPROC.2018.2846588.

[106] J. Z. Huang, H. Shen, and A. Buja, “Functional principal components analysis

via penalized rank one approximation,” Electronic Journal of Statistics, vol. 2,

pp. 678–695, 2008. doi: 10.1214/08-EJS218.

[107] ——, “The analysis of two-way functional data using two-way regularized sin-

gular value decompositions,” Journal of the American Statistical Association,

vol. 104, no. 488, pp. 1609–1620, 2009. doi: 10.1198/jasa.2009.tm08024.

[108] Basel Committee on Banking Supervision, Bank for International Settlements,

Amendment to the capital accord to incorporate market risks, Accessed: 2018-

11-29, 1996. Available at: https://www.bis.org/publ/bcbs24.htm.

[109] M. Fréchet, “Généralisation d’un théorème de Weierstrass,” Comptes Rendus

Hebdomadaires des Séances de l’Académie des Sciences, vol. 134, no. 2, pp. 848–

850, Juillet - Décembre 1904.

[110] S. P. Gudder, “A general theory of convexity,” Rendiconti del Seminario

Matematico e Fisico di Milano, vol. 49, pp. 89–96, 1979. doi: 10 . 1007 /

BF02925185.

[111] M. H. Stone, “Postulates for the barcycentric calculus,” Annali di Matematica

Pura ed Applicata, vol. 29, pp. 25–30, 1949. doi: 10.1007/BF02413910.

[112] V. Barbu and T. Precupanu, Convexity and Optimization in Banach Spaces,

4th ed., ser. Springer Monographs in Mathematics. Springer, 2012, isbn: 978-

94-007-2246-0. doi: 10.1007/978-94-007-2247-7.

[113] C. Zălinescu, Convex Analysis in General Vector Spaces. World Scientific, 2002,

isbn: 978-9-812-38067-8. doi: 10.1142/5021. 269

[114] D. P. Bertsekas, A. Nedić, and A. E. Ozdaglar, Convex Analysis and Opti-

mization. Athena Scientific, 2003, isbn: 978-1-886-52945-8.

[115] E. Kreyszig, Introductory Functional Analysis with Applications, ser. Wiley

Classics Library. John Wiley and Sons, 1978, Reprinted in 1989, isbn: 978-04-

715-0459-7.

[116] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator

Theory in Hilbert Spaces, 2nd, ser. CMS Books in Mathematics. Springer, 2017,

isbn: 978-3-319-48310-8. doi: 10.1007/978-3-319-48311-5.

[117] A. D. Alexandrov, “Almost everywhere existence of the second differential of

a convex function and some properties of convex surfaces connected with it,”

Leningrad State University Annals [Uchenye Zapiski], Mathematical Series,

vol. 6, pp. 3–35, 1939, Citation adapted from Rockafellar [119]; the current

author has not been able to locate a copy of this paper.

[118] F. Mignot, “Contrôle dans les inéquations variationelles elliptiques,” Journal

of Functional Analysis, vol. 22, no. 2, pp. 130–185, 1976. doi: 10.1016/0022-

1236(76)90017-3.

[119] R. T. Rockafellar, “Second-order convex analysis,” Journal of Nonlinear and

Convex Analysis, vol. 1, no. 1, pp. 1–16, 1999.

[120] J. M. Borwein and D. Noll, Transactions of the American Mathematical Society,

vol. 342, pp. 43–81, 1994. doi: 10.1090/S0002-9947-1994-1145959-4.

[121] L. Ahlfors, , 3rd. McGraw-Hill, 1979, isbn: 978-0-070-00657-

7.

[122] G. Yan and H. Fan, “A Newton-like algorithm for complex variables with

applications in blind equalization,” IEEE Transactions on Signal Processing, 270

vol. 48, no. 2, pp. 553–556, 2000. Available at: http://ieeexplore.ieee.

org/document/823982/.

[123] A. Hjørungnes and D. Gesbert, “Complex-valued matrix differentiation: Tech-

niques and key results,” IEEE Transactions on Signal Processing, vol. 55, no. 6,

pp. 2740–2746, 2007. Available at: http://ieeexplore.ieee.org/document/

4203075/.

[124] P. J. Schreier and L. L. Scharf, Statistical Signal Processing of Complex-Valued

Data: The Theory of Improper and Noncircular Signals, 1st. Cambridge Uni-

versity Press, 2010, isbn: 978-0-521-89772-3.

[125] J. R. Magnus and H. Neudecker, Matrix with Applica-

tions in Statistics and Econometrics, 3rd, ser. Wiley Series in Probability

and Statistics. John Wiley & Sons Ltd., 2019, isbn: 978-1-119-54120-2. doi:

10.1002/9781119541219.

[126] A. Beck, First-Order Methods in Optimization, ser. MOS-SIAM Series on Op-

timization. 2017. doi: 10.1137/1.9781611974997.

[127] L. Li, X. Wang, and G. Wang, “Alternating direction method of multipliers for

separable convex optimization of real functions in complex variables,” Mathe-

matical Problems in Engineering, vol. Article ID 104531, 2015. doi: 10.1155/

2015/104531.

[128] T. Adali, H. Li, M. Novey, and J.-F. Cardoso, “Complex ICA using nonlinear

functions,” IEEE Transactions on Signal Processing, vol. 56, no. 9, pp. 4536–

4544, 2008. doi: 10.1109/TSP.2008.926104.

[129] P. Bouboulis and S. Theodoridis, “Extension of Wirtinger’s calculus to repro-

ducing kernel Hilbert spaces and the complex kernel LMS,” IEEE Transac- 271

tions on Signal Processing, vol. 59, no. 3, pp. 964–978, 2011. doi: 10.1109/

TSP.2010.2096420.

[130] P. Bouboulis, K. Slavakis, and S. Theodoridis, “Adaptive learning in complex

reproducing kernel hilbert spaces employing Wirtinger’s subgradients,” IEEE

Transactions on Neural Networks and Learning Systems, vol. 23, no. 3, pp. 425–

438, 2012. doi: 10.1109/TNNLS.2011.2179810.

[131] P. Bouboulis, S. Theodoridis, C. Mavroforakis, and L. Evaggelatou-Dalla, “Com-

plex support vector machines for regression and quaternary classification,”

IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 6,

pp. 1260–1274, 2015. doi: 10.1109/TNNLS.2014.2336679.

[132] I. Daubechies, M. Defrise, and C. de Mol, “An iterative thresholding algorithm

for linear inverse problems with a sparsity constraint,” Communications on

Pure and Applied Mathematics, vol. 57, no. 11, pp. 1413–1457, 2004. doi:

10.1002/cpa.20042.

[133] E. J. Candès, X. Li, and M. Soltanolkotabi, “Phase retrieval via Wirtinger flow:

Theory and algorithms,” IEEE Transactions on Information Theory, vol. 61,

no. 4, pp. 1985–2007, 2015. doi: 10.1109/TIT.2015.2399924.

[134] I. Waldspurger, A. d’Aspremont, and S. Mallat, “Phase recovery, MaxCut

and complex semidefinite programming,” Mathematical Programming, vol. 149,

pp. 47–81, 2015. doi: 10.1007/s10107-013-0738-9.

[135] S. J. Wright, “Coordinate descent algorithms,” Mathematical Programming,

vol. 151, pp. 3–34, doi: 10.1007/s10107-015-0892-3.

[136] H.-J. M. Shi, S. Tu, Y. Xu, and W. Yin, “A primer on coordinate descent

algorithms,” ArXiv Pre-Print 1610.00040, 2016. Available at: https://arxiv.

org/abs/1610.00040. 272

[137] M. J. D. Powell, “On search directions for minimization algorithms,” vol. 4,

pp. 193–201, 1973. doi: 10.1007/BF01584660.

[138] P. Tseng, “Convergence of a block coordinate descent method for nondiffer-

entiable minimization,” Journal of Optimization Theory and Applications,

vol. 109, no. 3, pp. 475–494, 2001. doi: 10.1023/A:1017501703105.

[139] R. Glowinski and A. Marroco, “Sur l’approximation, par éléments finis d’ordre

un, et la résolution, par pénalisation-dualité d’une classe de problèmes de

Dirichlet non linéaires,” Revue française d’automatique, informatique, recherche

opérationnelle. Analyse numérique, vol. 9, no. R2, pp. 41–76, 1975.

[140] D. Gabay and B. Mercier, “A dual algorithm for the solution of nonlinear

variational problems via finite element approximation,” Computers & Mathe-

matics with Applications, vol. 2, no. 1, pp. 17–40, 1976. doi: 10.1016/0898-

1221(76)90003-1.

[141] J. Eckstein and D. P. Bertsekas, “On the Douglas-Rachford splitting method

and the proximal point algorithm for maximal monotone operators,” Mathe-

matical Programming, vol. 55, pp. 293–318, 1992. doi: 10.1007/BF01581204.

[142] R. T. Rockafellar, “Monotone operators and the proximal point algorithm,”

SIAM Journal on Control and Optimization, vol. 14, no. 5, pp. 877–898, 1976.

doi: 10.1137/0314056.

[143] H. Ouyang, N. He, L. Q. Tran, and A. Gray, “Stochastic alternating direction

method of multipliers,” in ICML 2013: Proceedings of the 30th International

Conference on Machine Learning, S. Dasgupta and D. McAllester, Eds., vol. 28,

Atlanta, Georgia: PMLR, 2013, pp. 80–88. Available at: http://proceedings.

mlr.press/v28/ouyang13.html. 273

[144] R. Tibshirani, “Regression shrinkage and selction via the lasso: A retrospective

(with discussion),” Journal of the Royal Statistical Society, Series B: Statistical

Methodology, vol. 73, no. 3, pp. 273–282, 2011. doi: 10.1111/j.1467-9868.

2011.00771.x.

[145] A. Juditsky and A. Nemirovski, “Functional aggregation for nonparametric

regression,” Annals of Statistics, vol. 28, no. 3, pp. 681–712, 2000. doi: 10.

1214/aos/1015951994.

[146] R. Tibshirani, “The lasso method for variable selection in the Cox model,”

Statistics in Medicine, vol. 16, no. 4, pp. 385–395, 1997.

[147] S. A. van de Geer, “High-dimensional generalized linear models and the lasso,”

Annals of Statistics, vol. 36, no. 2, pp. 614–645, 2008. doi: 10.1214/009053607000000929.

[148] W. Fu and K. Knight, “Asymptotics for lasso-type estimators,” Annals of

Statistics, vol. 28, no. 5, pp. 1356–1378, 2000. doi: 10.1214/aos/1015957397.

[149] F. Bunea, A. B. Tsybakov, and M. H. Wegkamp, “Aggregation and sparsity

via ℓ1 penalized least squares,” in COLT 2006: Proceedings of the 2006 Inter- national Conference on Computational Learning Theory, G. Lugosi and H. U.

Simon, Eds., ser. Lecture Notes in Computer Science, vol. 4005, Pittsburg,

USA: Springer, 2006, pp. 379–391. doi: 10.1007/11776420_29.

[150] F. Bunea, A. Tsybakov, and M. Wegkamp, “Sparsity oracle inequalities for

the lasso,” Electronic Journal of Statistics, vol. 1, pp. 169–194, 2007. doi:

10.1214/07-EJS008.

[151] C.-H. Zhang and J. Huang, “The sparsity and bias of the lasso selection in high-

dimensional linear regression,” Annals of Statistics, vol. 36, no. 4, pp. 1567–

1594, 2008. doi: 10.1214/07-AOS520. 274

[152] N. Meinshausen and B. Yu, “Lasso-type recovery of sparse representations for

high-dimensional data,” Annals of Statistics, vol. 37, no. 1, pp. 246–270, 2009.

doi: 10.1214/07-AOS582.

[153] T. Zhang, “Some sharp performance bounds for least squares regression with

l1 regularization,” Annals of Statistics, vol. 37, no. 5A, pp. 2109–2144, 2009.

doi: 10.1214/08-AOS659.

[154] M. Hebiri and S. van de Geer, “The smooth-lasso and other ℓ1 + ℓ2-penalized methods,” Electronic Journal of Statistics, vol. 5, pp. 1184–1226, 2011. doi:

10.1214/11-EJS638.

[155] S. A. van de Geer and P. Bühlmann, “On the conditions used to prove oracle

results for the lasso,” Electronic Journal of Statistics, vol. 3, pp. 1360–1392,

2009. doi: 10.1214/09-EJS506.

[156] M. R. Osborne, B. Presnell, and B. A. Turlach, “A new approach to variable se-

lection in least squares problems,” IMA Journal of Numerical Analysis, vol. 20,

no. 3, pp. 389–404, 2000. doi: 10.1093/imanum/20.3.389.

[157] ——, “On the LASSO and its dual,” Journal of Computational and Graphical

Statistics, vol. 9, no. 2, pp. 319–337, 2000. doi: 10.1080/10618600.2000.

10474883.

[158] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regres-

sion,” Annals of Statistics, vol. 32, no. 2, pp. 407–451, 2004. doi: 10.1214/

009053604000000067.

[159] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani, “Pathwise coordinate

optimization,” Annals of Applied Statistics, vol. 1, no. 2, pp. 302–332, 2007.

doi: 10.1214/07-AOAS131. 275

[160] J. H. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for gen-

eralized linear models via coordinate descent,” Journal of Statistical Software,

vol. 33, no. 1, pp. 1–22, 2010. doi: 10.18637/jss.v033.i01.

[161] T. T. Wu and K. Lange, “Coordinate descent algorithms for lasso penalized

regression,” Annals of Applied Statistics, vol. 2, no. 1, pp. 224–244, 2008. doi:

10.1214/07-AOAS147.

[162] P. Breheny and J. Huang, “Coordinate descent algorithms for nonconvex pe-

nalized regression, with applications to biological feature selection,” Annals of

Applied Statistics, vol. 5, no. 1, pp. 232–253, 2011. doi: 10.1214/10-AOAS388.

[163] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J.

Tibshirani, “Strong rules for discarding predictors in lasso-type problems,”

Journal of the Royal Statistical Society, Series B: Statistical Methodology,

vol. 74, no. 2, pp. 245–266, 2012. doi: 10.1111/j.1467-9868.2011.01004.x.

[164] P. Breheny and J. Huang, “Group descent algorithms for nonconvex penalized

linear and logistic regression models with grouped predictors,” Statistics and

Computing, vol. 25, pp. 173–187, 2015. doi: 10.1007/s11222-013-9424-2.

[165] H. Zou, T. Hastie, and R. Tibshirani, “On the ”degrees of freedom” of the

lasso,” Annals of Statistics, vol. 35, no. 5, pp. 2173–2192, 2007. doi: 10.1214/

009053607000000127.

[166] R. J. Tibshirani and J. Taylor, “Degrees of freedom in lasso problems,” Annals

of Statistics, vol. 40, no. 2, pp. 1198–1232, 2012. doi: 10.1214/12-AOS1003.

[167] M. Stone, “Cross-validatory choice and assessment of statistical predictions

(with discussion and rejoinder),” Journal of the Royal Statistical Society: Series

B (Methodological), vol. 36, no. 2, pp. 111–147, 1974. doi: 10.1111/j.2517-

6161.1974.tb00994.x. 276

[168] S. Arlot and A. Celisse, “A survey of cross-validation procedures for model

selection,” Statistics Surveys, vol. 4, pp. 40–79, 2010. doi: 10.1214/09-SS054.

[169] A. Belloni, V. Chernozhukov, and L. Wang, “Square-root lasso: Pivotal re-

covery of sparse signals via conic programming,” Biometrika, vol. 98, no. 4,

pp. 791–806, 2011. doi: 10.1093/biomet/asr043.

[170] T. Sun and C.-H. Zhang, “Scaled sparse linear regression,” Biometrika, vol. 99,

no. 4, pp. 879–898, 2012. doi: 10.1093/biomet/ass043.

[171] M. Chichignoud, J. Lederer, and M. J. Wainwright, “A practical scheme and

fast algorithm to tune the lasso with optimality guarantees,” Journal of Ma-

chine Learning Research, vol. 17, no. 231, pp. 1–20, 2016. Available at: http:

//jmlr.org/papers/v17/15-605.html.

[172] E. J. Candès and T. Tao, “Decoding by linear programming,” IEEE Trans-

actions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005. doi:

10.1109/TIT.2005.858979.

[173] R. J. Tibshirani, “The lasso problem and uniqueness,” Electronic Journal of

Statistics, vol. 7, pp. 1456–1490, 2013. doi: 10.1214/13-EJS815.

[174] C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, “A short note on

concentration inequalities for random vectors with subGaussian norm,” ArXiv

Pre-Print 1902.03736, 2019. Available at: https://arxiv.org/abs/1902.

03736.

[175] A. Dempster, “Covariance selection,” Biometrics, vol. 28, no. 1, pp. 157–175,

1972. Available at: http://jstor.org/stable/2528966.

[176] J. Besag, “Spatial interaction and the statistical analysis of lattice systems

(with discussion),” Journal of the Royal Statistical Society, Series B: Statistical 277

Methodology, vol. 36, no. 2, pp. 192–236, 1974. Available at: http://jstor.

org/stable/2984812.

[177] G. R. Grimmett, “A theorem about random fields,” Bulletin of the London

Mathematical Society, vol. 5, no. 1, pp. 81–84, 1973. doi: 10.1112/blms/5.1.

81.

[178] S. L. Lauritzen, Graphical Models, 1st, ser. Oxford Statistical Science Series

17. Oxford Science Publications, 1996, isbn: 978-0-198-52219-5.

[179] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential fami-

lies, and variational inference,” Foundations and Trends in Machine Learning,

vol. 1, no. 1-2, pp. 1–305, 2008. doi: 10.1561/2200000001.

[180] A. d’Aspremont, O. Banerjee, and L. E. Ghaoui, “First-order methods for

sparse covariance selection,” SIAM Journal on Matrix Analysis and Applica-

tions, vol. 30, no. 1, pp. 56–66, 2008. doi: 10.1137/060670985.

[181] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky, “Latent variable graphical

model selection via convex optimization,” Annals of Statistics, vol. 40, no. 4,

pp. 1935–1967, 2012. doi: 10.1214/11-AOS949.

[182] C. Leng and C. Y. Tang, “Sparse matrix graphical models,” Journal of the

American Statistical Association, vol. 107, no. 499, pp. 1187–1200, 2012. doi:

10.1080/01621459.2012.706133.

[183] S. Zhou, “Gemini: Graph estimation with matrix variate normal instances,”

Annals of Statistics, vol. 42, no. 2, pp. 532–562, 2014. doi: 10 . 1214 / 13 -

AOS1187.

[184] S. He, J. Yin, H. Li, and X. Wang, “Graphical model selection and estimation

for high-dimensional tensor data,” Journal of Multivariate Analysis, vol. 128,

pp. 165–185, 2014. doi: 10.1016/j.jmva.2014.03.007. 278

[185] H. Liu, J. Lafferty, and L. Wasserman, “The nonparanormal: Semiparametric

estimation of high dimensional undirected graphs,” Journal of Machine Learn-

ing Research, vol. 10, pp. 2295–2328, 2009. Available at: http://jmlr.org/

papers/v10/liu09a.html.

[186] J. Lafferty, H. Liu, and L. Wasserman, “Sparse nonparametric graphical mod-

els,” Statistical Science, vol. 27, no. 4, pp. 519–537, 2012. doi: 10.1214/12-

STS391.

[187] H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasserman, “High-dimensional

semiparametric Gaussian copula graphical models,” Annals of Statistics, vol. 40,

no. 4, pp. 2293–2326, 2012. doi: 10.1214/12-AOS1037.

[188] R. Foygel and M. Drton, “Extended Bayesian information criteria for Gaussian

graphical models,” in NIPS 2010: Advances in Neural Information Processing

Systems 23, (Vancouver, Canada), J. Lafferty, C. Williams, J. Shawe-Taylor, R.

Zemel, and A. Culotta, Eds., 2010. Available at: https://papers.nips.cc/

paper/4087-extended-bayesian-information-criteria-for-gaussian-

graphical-models.

[189] H. Liu, K. Roder, and L. Wasserman, “Stability approach to regularization

selection (StARS) for high dimensional graphical models,” in NIPS 2010: Ad-

vances in Neural Information Processing Systems 23, (Vancouver, Canada),

J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds.,

2010. Available at: http : / / papers . nips . cc / paper / 3966 - stability -

approach-to-regularization-selection-stars-for-high-dimensional-

graphical-models. 279

[190] S. Chen, D. M. Witten, and A. Shojaie, “Selection and estimation for mixed

graphical models,” Biometrika, vol. 102, no. 1, pp. 47–64, 2015. doi: 10.1093/

biomet/asu051.

[191] J. D. Lee and T. J. Hastie, “Learning the structure of mixed graphical models,”

Journal of Computational and Graphical Statistics, vol. 24, no. 1, pp. 230–253,

2015. doi: 10.1080/10618600.2014.900500.

[192] H. H. Andersen, M. Hojbjerre, D. Sorensen, and P. S. Eriksen, Linear and

Graphical Models for the Multivariate Complex Normal Distribution, ser. Lec-

ture Notes in Statistics. Springer, 1995, isbn: 978-0-387-94521-7. doi: 10 .

1007/978-1-4612-4240-6.

[193] A. Jung, R. Heckel, H. Bölcskei, and F. Hlawatsch, “Compressive nonparamet-

ric graphical model selection for time series,” in ICASSP 2014: Proceedings

of the 2014 IEEE International Conference on Acoustics, Speech, and Signal

Processing, (Florence, Italy), 2014, pp. 769–773. doi: 10.1109/ICASSP.2014.

6853700.

[194] M. Yuan and Y. Lin, “Model selection and estimation in regression with

grouped variables,” Journal of the Royal Statistical Society, Series B: Sta-

tistical Methodology, vol. 68, no. 1, pp. 49–67, 2006. doi: 10.1111/j.1467-

9868.2005.00532.x.

[195] X. Lv, G. Bi, and C. Wan, “The group lasso for stable recovery of block-sparse

signal representations,” IEEE Transactions on Signal Processing, vol. 59, no. 4,

pp. 1371–1382, 2011.

[196] A. Jung, G. Hannak, and N. Goertz, “Graphical LASSO based model selection

for time series,” IEEE Signal Processing Letters, vol. 22, no. 10, pp. 1781–1785,

2015. doi: 10.1109/LSP.2015.2425434. 280

[197] J. K. Tugnait, “An edge exclusion test for complex Gaussian graphical model

selection,” in SSP 2018: Proceedings of the 2018 IEEE Statistical Signal Pro-

cessing Workshop, (Freiburg, Germany), 2018. doi: 10 . 1109 / SSP . 2018 .

8450845.

[198] ——, “Graphical lasso for high-dimensional complex Gaussian graphical model

selection,” in ICASSP 2019: Proceedings of the 2019 IEEE International Con-

ference on Acoustics, Speech, and Signal Processing, (Brighton, United King-

dom), 2019, pp. 2952–2956. doi: 10.1109/ICASSP.2019.8682867.

[199] ——, “On sparse complex Gaussian graphical model selection,” in MLSP 2019:

Proceedings of the IEEE 29th International Workshop on Machine Learning

for Signal Processing, (Pittsburg, USA), 2019. doi: 10 . 1109 / MLSP . 2019 .

8918691.

[200] A. Tank, N. J. Foti, and E. B. Fox, “Bayesian structure learning for stationary

time series,” in UAI 2015: Proceedings of the 31st Conference on Uncertainty

in Artificial Intelligence, (Amsterdam, Netherlands), M. Meila and T. Heskes,

Eds., 2015. Available at: http://auai.org/uai2015/proceedings/papers/

227.pdf.

[201] Y. Yu, T. Wang, and R. J. Samworth, “A useful variant of the Davis–Kahan

theorem for statisticians,” Biometrika, vol. 102, no. 2, pp. 315–323, 2015. doi:

10.1093/biomet/asv008.

[202] P. Ravikumar, G. Raskutti, M. J. Wainwright, and B. Yu, “Model selection

in Gaussian graphical models: High-dimensional consistency of ℓ1-regularized MLE,” in NIPS 2008: Advances in Neural Information Processing Systems 21,

D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds., 2008, pp. 1329–1336.

Available at: https://papers.nips.cc/paper/3436- model- selection- 281

in- gaussian- graphical- models- high- dimensional- consistency- of-

boldmathell_1-regularized-mle.

[203] W. Wang, M. J. Wainwright, and K. Ramchandran, “Information-theoretic

bounds on model selection for Gaussian Markov random fields,” in ISIT 2010:

Proceedings of the 2010 IEEE International Symposium on Information Theory,

M. Gastpar, R. Heath, and K. Narayanan, Eds., Austin, TX USA: IEEE, 2010,

pp. 1373–1377. doi: 10.1109/ISIT.2010.5513573.

[204] N. Meinshausen, “A note on the lasso for Gaussian graphical model selection,”

Statistics and Probability Letters, vol. 78, no. 7, pp. 880–884, 2008. doi: 10.

1016/j.spl.2007.09.014.

[205] M. J. Wainwright, “Information-theoretic limits on sparsity recovery in the

high-dimensional and noisy setting,” IEEE Transactions on Information The-

ory, vol. 55, no. 12, pp. 5728–5741, 2009. doi: 10.1109/TIT.2009.2032816.

[206] Y. Nesterov, “Smooth minimization of non-smooth functions,” Mathematical

Programming, vol. 103, no. 1, pp. 127–152, 2005. doi: 10.1007/s10107-004-

0552-5.

[207] D. M. Witten, J. H. Friedman, and N. Simon, “New insights and faster com-

putations for the graphical lasso,” Journal of Computational and Graphical

Statistics, vol. 20, no. 4, pp. 892–900, 2011. doi: 10.1198/jcgs.2011.11051a.

[208] R. Mazumder and T. Hastie, “The graphical lasso: New insights and alter-

natives,” Electronic Journal of Statistics, vol. 6, pp. 2125–2149, 2012. doi:

10.1214/12-EJS740.

[209] ——, “Exact covariance thresholding into connected components for large-

scale graphial lasso,” Journal of Machine Learning Research, vol. 13, pp. 781–

794, 2012. Available at: http://jmlr.org/papers/v13/mazumder12a.html. 282

[210] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar, “Quic: Quadratic

approximation for sparse inverse covariance estimation,” Journal of Machine

Learning Research, vol. 15, pp. 2911–2947, 2014. Available at: http://jmlr.

org/papers/v15/hsieh14a.html.

[211] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, P. Ravikumar, and R. A. Poldrack,

“Big & Quic: Sparse inverse covariance estimation for a million variables,” in

NIPS 2013: Advances in Neural Information Processing Systems 26, C. J. C.

Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds.,

vol. 26, Lake Tahoe, NV, USA, 2013.

[212] K. Scheinberg, S. Ma, and D. Goldfarb, “Sparse inverse covariance selection via

alternating linearization methods,” in NIPS 2010: Advances in Neural Infor-

mation Processing Systems 23, (Vancouver, Canada), J. Lafferty, C. Williams,

J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds., 2010. Available at: https:

//papers.nips.cc/paper/4099-sparse-inverse-covariance-selection-

via-alternating-linearization-methods.

[213] T. Zhao, H. Liu, K. Roeder, J. Lafferty, and L. Wasserman, “The huge package

for high-dimensional undirected graph estimation in R,” Journal of Machine

Learning Research, vol. 13, pp. 1059–1062, 2012. Available at: http://jmlr.

org/papers/v13/zhao12a.html.

[214] M. L. Eaton, Multivariate Statistics: A Vector Space Approach, R. A. Vitale,

Ed., ser. Lecture Notes-Monograph Series 53. Beachwood, Ohio, USA: Institute

of Mathematical Sciences, 2007. Available at: https://projecteuclid.org/

euclid.lnms/1196285102. 283

[215] R. A. Wooding, “The multivariate distribution of complex normal variables,”

Biometrika, vol. 43, no. 1-2, pp. 212–215, 1956. doi: 10.1093/biomet/43.1-

2.212.

[216] N. R. Goodman, “Statistical analysis based on a certain multivariate complex

Gaussian distribution (an introduction),” Annals of Mathematical Statistics,

vol. 34, no. 1, pp. 152–177, 1963. doi: 10.1214/aoms/1177704250.

[217] ——, “The distribution of the determinant of a complex Wishart distributed

matrix,” Annals of Mathematical Statistics, vol. 34, no. 1, pp. 178–180, 1963.

doi: 10.1214/aoms/1177704251.

[218] M. S. Srivastava, “On the complex Wishart distribution,” Annals of Mathemat-

ical Statistics, vol. 36, no. 1, pp. 313–315, doi: 10.1214/aoms/1177700294.

[219] C. Khatri, “Classical statistical analysis based on a certain multivariate com-

plex Gaussian distribution,” Annals of Mathematical Statistics, vol. 36, no. 1,

pp. 98–114, 1965. Available at: 10.1214/aoms/1177700274.

[220] P. Graczyk, G. Letac, and H. Massam, “The complex Wishart distribution

and the symmetric group,” Annals of Statistics, vol. 31, no. 1, pp. 287–309,

2003. doi: 10.1214/aos/1046294466.

[221] T. Adali, P. J. Schreier, and L. L. Scharf, “Complex-valued signal process-

ing: The proper way to deal with impropriety,” IEEE Transactions on Signal

Processing, vol. 59, no. 11, pp. 5101–5125, 2011. doi: 10.1109/TSP.2011.

2162954.

[222] B. Laurent and P. Massart, “Adaptive estimation of a quadratic functional by

model selection,” Annals of Statistics, vol. 28, no. 5, pp. 1302–1338, 2000. doi:

10.1214/aos/1015957395. 284

[223] H. Uhlig, “On singular Wishart and singular multivariate beta distributions,”

Annals of Statistics, vol. 22, no. 1, pp. 395–405, 1994. doi: 10.1214/aos/

1176325375.

[224] M. S. Srivastava, “Singular Wishart and multivariate beta distributions,” An-

nals of Statistics, vol. 31, no. 5, pp. 1537–1560, 2003. doi: 10.1214/aos/

1065705118.

[225] S. Yu, J. Ryu, and K. Park, “A derivation of anti-Wishart distribution,” Jour-

nal of Multivariate Analysis, vol. 131, pp. 121–125, 2014. doi: 10.1016/j.

jmva.2014.06.019.

[226] C. Zhu, H. Xu, C. Leng, and S. Yan, “Convex optimization procedure for

clustering: Theoretical revisit,” in NIPS 2014: Advances in Neural Information

Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,

and K. Q. Weinberger, Eds., Montréal, Canada: Curran Associates, Inc., 2014,

pp. 1619–1627. Available at: https://papers.nips.cc/paper/5307-convex-

optimization-procedure-for-clustering-theoretical-revisit.

[227] K. M. Tan and D. Witten, “Statistical properties of convex clustering,” Elec-

tronic Journal of Statistics, vol. 9, no. 2, pp. 2324–2347, 2015. doi: 10.1214/

15-EJS1074.

[228] P. Radchenko and G. Mukherjee, “Convex clustering via ℓ1 fusion penaliza- tion,” Journal of the Royal Statistical Society, Series B: Statistical Methodol-

ogy, vol. 79, no. 5, pp. 1527–1546, 2017. doi: 10.1111/rssb.12226.

[229] E. C. Chi and K. Lange, “Splitting methods for convex clustering,” Journal

of Computational and Graphical Statistics, vol. 24, no. 4, pp. 994–1013, 2015.

doi: 10.1080/10618600.2014.948181. 285

[230] R. Glowinski, S. J. Osher, and W. Yin, Eds., Splitting Methods in Communi-

cation and Imaging, Science, and Engineering, Springer, 2016. doi: 10.1007/

978-3-319-41589-5.

[231] D. Davis and W. Yin, “A three-operator splitting scheme and its optimization

applications,” Set-Valued and Variational Analysis, vol. 25, no. 4, pp. 829–858,

2017. doi: 10.1007/s11228-017-0421-z.

[232] P. Tseng, “Applications of a splitting algorithm to decomposition in convex

programming and variational inequalities,” SIAM Journal on Control and Op-

timization, vol. 29, no. 1, pp. 119–138, 1991. doi: 10.1137/0329006.

[233] M. Weylandt, J. Nagorski, and G. I. Allen, “Dynamic visualization and fast

computation for convex clustering via algorithmic regularization,” Journal of

Computational and Graphical Statistics, vol. 29, no. 1, pp. 87–96, 2020. doi:

10.1080/10618600.2019.1629943.

[234] R. Bartels and G. Stewart, “Solution of the matrix equation AX + XB = C,”

Communications of the ACM, vol. 15, no. 9, pp. 820–826, 1972. doi: 10.1145/

361573.361582.

[235] G. Golub, S. Nash, and C. V. Loan, “A Hessenberg-Schur method for the

problem AX + XB = C,” IEEE Transactions on Automatic Control, vol. 24,

no. 6, pp. 909–913, 1979. doi: 10.1109/TAC.1979.1102170.

[236] W. N. Anderson Jr. and T. D. Morley, “Eigenvalues of the Laplacian of a

graph,” Linear and Multilinear Algebra, vol. 18, no. 2, pp. 141–145, 1985. doi:

10.1080/03081088508817681.

[237] J. J. Moreau, “Décomposition orthogonale d’un espace hilbertien selon deux

cônes mutuellement polaires,” Comptes Rendus Hebdomadaires des Séances de

l’Académie des Sciences, vol. 255, no. 1, pp. 238–240, 1962. 286

[238] The Cancer Genome Atlas Network, “Comprehensive molecular portraits of

human breast tumours,” Nature, vol. 490, pp. 61–70, 2012. doi: 10.1038/

nature11412.

[239] H. H. Bauschke and P. L. Combettes, “A Dykstra-like algorithm for two mono-

tone operators,” Pacific Journal of Optimization, vol. 4, no. 3, pp. 383–391,

2008. Available at: http://www.ybook.co.jp/online2/oppjo/vol4/p383.

html.

[240] M. Weylandt, J. Nagorski, and G. I. Allen, clustRviz (R package), 2019. Avail-

able at: http://github.com/DataSlingers/clustRviz.

[241] B. He and X. Yuan, “On the O(1/n) convergence rate of the Douglas–Rachford

alternating direction method,” SIAM Journal on Numerical Analysis, vol. 50,

no. 2, pp. 700–709, 2012. doi: 10.1137/110836936.

[242] ——, “On non-ergodic convergence rate of Douglas-Rachford alternating direc-

tion method of multipliers,” Numerische Mathematik, vol. 130, no. 3, pp. 567–

577, 2015. doi: 10.1007/s00211-014-0673-6.

[243] D. Davis and W. Yin, “Faster convergence rates of relaxed Peaceman-Rachford

and ADMM under regularity assumptions,” Mathematics of Operations Re-

search, vol. 42, no. 3, pp. 783–805, 2017. doi: 10.1287/moor.2016.0827.

[244] A. H. Bentbib, S. El-Halouy, and E. M. Sadek, “Krylov subspace projection

method for Sylvester tensor equation with low rank right-hand side,” Numer-

ical Algorithms, vol. 84, pp. 1411–1430, 2020. doi: 10.1007/s11075- 020-

00874-0.

[245] B. W. Silverman, “Smoothed functional principal components analysis by

choice of norm,” Annals of Statistics, vol. 24, no. 1, pp. 1–24, 1996. doi:

10.1214/aos/1033066196. 287

[246] M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre, “Generalized power

method for sparse principal component analysis,” Journal of Machine Learning

Research, vol. 11, pp. 517–553, 2010. Available at: http://www.jmlr.org/

papers/v11/journee10a.html.

[247] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix

Manifolds. Princeton University Press, 2007, isbn: 978-0-691-13298-3.

[248] Z. Wen and W. Yin, “A feasible method for optimization with orthogonality

constraints,” Mathematical Programming, vol. 142, no. 1-2, pp. 397–434, 2013.

doi: 10.1007/s10107-012-0584-1.

[249] R. Lai and S. Osher, “A splitting method for orthogonality constrained prob-

lems,” Journal of Scientific Computing, vol. 58, no. 2, pp. 431–449, 2014. doi:

10.1007/s10915-013-9740-x.

[250] A. Kovnatsky, K. Glashoff, and M. M. Bronstein, “MADMM: A generic algo-

rithm for non-smooth optimization on manifolds,” in ECCV 2016: Proceedings

of the 14th European Conference on Computer Vision, B. Leibe, J. Matas, N.

Sebe, and M. Welling, Eds., ser. Lecture Notes in Computer Science, vol. 9909,

Springer, 2016, pp. 680–696. doi: 10.1007/978-3-319-46454-1_41.

[251] S. Chen, S. Ma, A. M.-C. So, and T. Zhang, “Proximal gradient method for

nonsmooth optimization over the Stiefel manifold,” SIAM Journal on Opti-

mization, vol. 30, no. 1, pp. 210–239, 2020. doi: 10.1137/18M122457X.

[252] S. Chen, S. Ma, L. Xue, and H. Zou, “An alternating manifold proximal gradi-

ent method for sparse principal component analysis and sparse canonical cor-

relation analysis,” INFORMS Journal on Optimization, vol. 2, no. 3, pp. 192–

208, 2020. doi: 10.1287/ijoo.2019.0032. 288

[253] K. Benidis, Y. Sun, P. Babu, and D. P. Palomar, “Orthogonal sparse PCA

and covariance estimation via Procrustes reformulation,” IEEE Transactions

on Signal Processing, vol. 64, no. 23, pp. 6211–6226, 2016. doi: 10.1109/TSP.

2016.2605073.

[254] H. Sato and K. Aihara, “Cholesky QR-based retraction on the generalized

Stiefel manifold,” Computational Optimization and Applications, vol. 72, no. 2,

pp. 293–308, 2019. doi: 10.1007/s10589-018-0046-7.

[255] P. H. Schönemann, “A generalized solution of the orthogonal Procrustes prob-

lem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966. doi: 10.1007/BF02289451.

[256] L. Eldén and H. Park, “A Procrustes problem on the Stiefel manifold,” Nu-

merische Mathematik, vol. 82, no. 4, pp. 599–619, 1999. doi: 10.1007/s002110050432.

[257] L. Mackey, “Deflation methods for sparse PCA,” in NIPS 2008: Advances

in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y.

Bengio, and L. Bottou, Eds., 2008, pp. 1017–1024. Available at: https://

papers.nips.cc/paper/3575-deflation-methods-for-sparse-pca.

[258] G. I. Allen, “Sparse higher-order principal components analysis,” in AISTATS

2012: Proceedings of the 15th International Conference on Artificial Intelligence

and Statistics, N. D. Lawrence and M. Girolami, Eds., vol. 22, Canary Islands,

Spain: PMLR, 2012, pp. 27–36. Available at: http : / / proceedings . mlr .

press/v22/allen12.html.

[259] ——, “Multi-way functional principal components analysis,” in CAMSAP 2013:

Proceedings of the 5th IEEE International Workshop on Computational Ad-

vances in Multi-Sensor Adaptive Processing, V. Cevher and S. Gazor, Eds.,

St. Martin, France: IEEE, 2013, pp. 220–223. doi: 10.1109/CAMSAP.2013.

6714047. 289

[260] L. R. Tucker, “Some mathematical notes on three-mode factor analysis,” Psy-

chometrika, vol. 31, no. 3, pp. 279–311, 1966. doi: 10.1007/BF02289464.

[261] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,”

SIAM Review, vol. 51, no. 3, pp. 455–500, 2009. doi: 10.1137/07070111X.

[262] L. Armijo, “Minimization of functions having Lipschitz continuous first partial

derivatives,” Pacific Journal of Mathematics, vol. 16, pp. 1–3, 1966. doi: 10.

2140/pjm.1966.16.1. Available at: https://projecteuclid.org/euclid.

pjm/1102995080.

[263] A. Beck and M. Teboulle, “Gradient-based algorithms with applications to

signal-recovery problems,” in Convex Optimization in Signal Processing and

Communications, D. P. Palomar and Y. C. Eldar, Eds., Cambridge University

Press, 2010, pp. 42–88, isbn: 978-0-521-76222-9. doi: 10.1017/CBO9780511804458.

003.

[264] X. Li, D. Sun, and K.-C. Toh, “A highly efficient semismooth Newton aug-

mented Lagrangian method for solving lasso problems,” SIAM Journal on

Optimization, vol. 28, no. 1, pp. 433–458, 2016. doi: 10.1137/16M1097572.

[265] A. Ali, E. Wong, and J. Z. Kolter, “A semismooth Newton method for fast,

generic convex programming,” in ICML 2017: Proceedings of the 34th Inter-

national Conference on Machine Learning, D. Precu and Y. W. Teh, Eds.,

vol. 70, Sydney, Australia: PMLR, 2017, pp. 70–79. Available at: http://

proceedings.mlr.press/v70/ali17a.html.

[266] K.-C. Toh, M. J. Todd, and R. H. Tütüncü, “SDPT3 - a matlab software

package for semidefinite programming, version 1.3,” Optimization Methods and

Software, vol. 11, no. 1-4, pp. 545–581, 1999. doi: 10.1080/10556789908805762. 290

[267] R. H. Tütüncü, K.-C. Toh, and M. J. Todd, “Solving semidefinite-quadratic-

linear programs using SDPT3,” Mathematical Programming, vol. 95, no. 2,

pp. 189–217, 2003. doi: 10.1007/s10107-002-0347-5.

[268] K.-C. Toh, M. J. Todd, and R. H. Tütüncü, “On the implementation and usage

of SDPT3 - a Matlab software package for semidefinite-quadratic-linear pro-

gramming, version 4.0,” in Handbook on Semidefinite, Conic and Polynomial

Optimization, ser. International Series in Operations Research & Management

Science Frederick S. Hillier, M. F. Anjos and J. B. Lasserre, Eds., 2012, pp. 715–

754, isbn: 978-1-4614-0768-3. doi: 10.1007/978-1-4614-0769-0_25.

[269] W. Huang and K. Wei, “Extending FISTA to Riemannian optimization for

sparse PCA,” ArXiv Pre-Print 1909.05485, 2019. Available at: http://arxiv.

org/abs/1909.05485.

[270] J. Song, P. Babu, and D. P. Palomar, “Sparse generalized eigenvalue problem

via smooth optimization,” IEEE Transactions on Signal Processing, vol. 63,

no. 7, pp. 1627–1642, 2015. doi: 10.1109/TSP.2015.2394443.

[271] Y. Guan and J. Dy, “Sparse probabilistic principal component analysis,” in

AISTATS 2009: Proceedings of 12th International Conference on Artificial

Intelligence and Statistics, D. van Dyk and M. Welling, Eds., vol. 5, Clear-

water Beach, Florida, USA: PMLR, 2009, pp. 185–192. Available at: http:

//proceedings.mlr.press/v5/guan09a.html.

[272] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analy-

sis,” Journal of the Royal Statistical Society, Series B: Statistical Methodology,

vol. 61, no. 3, pp. 611–622, 1999. doi: 10.1111/1467-9868.00196.

[273] R. M. Neal, “MCMC using Hamiltonian dynamics,” in Handbook of Markov

Chain Monte Carlo, ser. Handbooks of Modern Statistical Methods, S. Brooks, 291

A. Gelman, G. L. Jones, and X.-L. Meng, Eds., 1st, Chapman & Hall/CRC,

2011, ch. 5, pp. 113–162, isbn: 978-1-420-07941-8. Available at: http://www.

mcmchandbook.net/HandbookChapter5.pdf.

[274] M. D. Hoffman and A. Gelman, “The No-U-Turn sampler: Adaptively setting

path lengths in Hamiltonian Monte Carlo,” Journal of Machine Learning Re-

search, vol. 15, pp. 1593–1623, 2014. Available at: http://jmlr.org/papers/

v15/hoffman14a.html.

[275] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betan-

court, M. Brubaker, J. Guo, P. Li, and A. Riddell, “Stan: A probabilistic

programming language,” Journal of Statistical Software, vol. 76, no. 1, 2017.

doi: 10.18637/jss.v076.i01.

[276] M. Betancourt, “A conceptual introduction to Hamiltonian Monte Carlo,”

ArXiv 1701.02434, 2017. Available at: https://arxiv.org/abs/1701.02434.

[277] M. Girolami and B. Calderhead, “Riemann manifold Langevin and Hamilto-

nian Monte Carlo methods,” Journal of the Royal Statistical Society, Series

B: Statistical Methodology, vol. 73, no. 2, pp. 123–214, 2011. doi: 10.1111/j.

1467-9868.2010.00765.x.

[278] S. Byrne and M. Girolami, “Geodesic Monte Carlo on embedded manifolds,”

Scandanavian Journal of Statistics, vol. 40, no. 4, pp. 825–845, 2013. doi:

10.1111/sjos.12036.

[279] M. Betancourt, S. Byrne, S. Livingstone, and M. Girolami, “The geometric

foundations of Hamiltonian Monte Carlo,” Bernoulli, vol. 23, no. 4A, pp. 2257–

2298, 2017. doi: 10.3150/16-BEJ810.

[280] A. A. Pourzanjani, R. M. Jiang, B. Mitchell, P. J. Atzberger, and L. R. Petzold,

“General Bayesian inference over the Stiefel manifold via the Givens represen- 292

tation,” ArXiv Pre-Print 1710.09443, 2017. Available at: http://arxiv.org/

abs/1710.09443.

[281] M. Jauch, P. D. Hoff, and D. B. Dunson, “Random orthogonal matrices and

the Cayley transform,” Bernoulli, vol. 26, no. 2, pp. 1560–1586, 2020. doi:

10.3150/19-BEJ1176.

[282] ——, “Monte Carlo simulation on the Stiefel manifold via polar expansion,”

ArXiv Pre-Print 1906.07684, 2019. Available at: http://arxiv.org/abs/

1906.07684.

[283] Y. Wang, W. Yin, and J. Zeng, “Global convergence of ADMM in noncon-

vex nonsmooth optimization,” Journal of Scientific Computing, vol. 78, no. 1,

pp. 29–63, 2019. doi: 10.1007/s10915-018-0757-z.

[284] A. Ritchie, C. Scott, L. Balzano, D. Kessler, and C. S. Sripada, “Supervised

principal component analysis via manifold optimization,” in DSW 2019: Pro-

ceedings of the 2nd IEEE Data Science Workshop, G. Karypis, G. Michailidis,

and R. Willett, Eds., Minneapolis, Minnesota: IEEE, 2019, pp. 6–10. doi:

10.1109/DSW.2019.8755587.

[285] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with

orthogonality constraints,” SIAM Journal on Matrix Analysis and Applications,

vol. 20, no. 2, pp. 303–353, 1998. doi: 10.1137/S0895479895290954.

[286] N. Boumal, P.-A. Absil, and C. Cartis, “Global rates of convergence for noncon-

vex optimization on manifolds,” IMA Journal of Numerical Analysis, vol. 39,

no. 1, pp. 1–33, 2019. doi: 10.1093/imanum/drx080.

[287] U.S. Energy Information Administration (E.I.A.), Frequently asked questions:

What is U.S. electricity generation by energy source? https://www.eia.gov/tools/faqs/faq.php?id=427&t=3,

Accessed: 2018-05-11, 2018. 293

[288] T. Murphy, “Natural gas surpasses coal as biggest US electricity source,” The

Associated Press, 2015. Available at: https://apnews.com/59a30fadd58e42f08280e4cdd198653c.

[289] U.S. Energy Information Administration (E.I.A.), Short-term energy outlook:

January 2018, https://www.eia.gov/todayinenergy/detail.php?id=34612,

Accessed: 2018-05-11, 2018.

[290] PPI Energy and Chemicals Team, Bureau of Labor Statistics (BLS), “The

effects of shale gas production on natural gas prices,” Beyond the Numbers:

Prices and Spending, vol. 2, no. 13, 2013. Available at: https://www.bls.

gov/opub/btn/volume-2/pdf/the-effects-of-shale-gas-production-

on-natural-gas-prices.pdf.

[291] T. G. Andersen, T. Bollerslev, and N. Meddahi, “Correcting the errors: Volatil-

ity forecast evaluation using high-frequency data and realized volatilities,”

Econometrica, vol. 73, no. 1, pp. 279–296, 2005. doi: 10 . 1111 / j . 1468 -

0262.2005.00572.x.

[292] P. R. Hansen, Z. Huang, and H. H. Shek, “Realized GARCH: A joint model for

returns and realized measures of volatility,” Journal of Applied Econometrics,

vol. 27, pp. 877–906, 2012. doi: 10.1002/jae.1234.

[293] P. R. Hansen and Z. Huang, “Exponential GARCH modeling with realized

measures of volatility,” Journal of Business & Economic Statistics, vol. 34,

no. 2, pp. 269–287, 2016. doi: 10.1080/07350015.2015.1038543.

[294] P. R. Hansen, A. Lunde, and V. Voev, “Realized beta GARCH: A multivari-

ate GARCH model with realized measures of volatility,” Journal of Applied

Econometrics, vol. 29, pp. 774–799, 2014. doi: 10.1002/jae.2389. 294

[295] CME Group, Leading products Q1 2018, http://www.cmegroup.com/education/

files/cme-group-leading-products-2018-q1.pdf, Accessed: 2018-05-11,

2018.

[296] Intercontinental Exchange (ICE), Trading natural gas on ICE, https://www.

theice.com/publicdocs/Trading_Natural_Gas_on_ICE.pdf, Accessed:

2018-11-12, 2018.

[297] U.S. Energy Information Administration (E.I.A.), Perspectives on the develop-

ment of LNG market hubs in the asia pacific region, https://www.eia.gov/analysis/studies/lng/asia/pdf/lngasia.pdf,

Accessed: 2018-11-12, 2017.

[298] ——, U.S. natural gas pipeline network, https://www.eia.gov/naturalgas/archive/analysis_publications/ngpipeline/ngpipelines_map.html,

Accessed: 2018-05-11, 2009.

[299] H. Mohammadi, “Market integration and price transmission in the U.S. natural

gas market: From the wellhead to end use markets,” Energy Economics, vol. 33,

no. 2, pp. 227–235, 2011. doi: 10.1016/j.eneco.2010.08.011.

[300] H. Ghoddusi, “Integration of physical and futures prices in the US natural

gas market,” Energy Economics, vol. 56, pp. 229–238, 2016. doi: 10.1016/j.

eneco.2016.03.011.

[301] S. Kim, N. Shephard, and S. Chib, “Stochastic volatility: Likelihood inference

and comparison with ARCH models,” The Review of Economic Studies, vol. 65,

no. 3, pp. 361–393, 1998.

[302] E. R. Vankov, M. Guindani, and K. B. Ensor, “Filtering and estimation for

a class of stochastic volatility models with intractable likelihoods,” Bayesian

Analysis, vol. 14, no. 1, pp. 29–52, 2019. doi: 10.1214/18-BA1099. 295

[303] R. F. Engle, “Autoregressive conditional heteroscedasticity with estimates

of the variance of United Kingdom inflation,” Econometrica, vol. 50, no. 4,

pp. 987–1007, 1982. doi: 10.2307/1912773.

[304] T. Bollerslev, “Generalized autoregressive conditional heteroskedasticity,” Jour-

nal of Econometrics, vol. 31, no. 3, pp. 307–327, 1986. doi: 10.1016/0304-

4076(86)90063-1.

[305] ——, “Glossary to ARCH,” in Volatility and Time Series Econometrics: Essays

in Honor of Robert Engle, ser. Advanced Texts in Econometrics, T. Bollerslev,

J. Russell, and M. Watson, Eds., Oxford University Press, 2010. doi: 10.1093/

acprof:oso/9780199549498.003.0008.

[306] L. Bauwens, S. Laurent, and J. V. Rombouts, “Multivariate GARCH models:

A survey,” Journal of Applied Econometrics, vol. 21, no. 1, pp. 79–106, 2006.

doi: 10.1002/jae.842.

[307] L. Rogers and S. Satchell, “Estimating variance form high, low, and closing

prices,” Annals of Applied Probability, vol. 1, no. 4, pp. 504–512, 1991. doi:

10.1214/aoap/1177005835.

[308] D. Yang and Q. Zhang, “Drift-independent volatility estimation based on high,

low, open, and close prices,” Journal of Business, vol. 73, no. 3, pp. 477–492,

2000. doi: 10.1086/209650.

[309] R. F. Engle and G. M. Gallo, “A multiple indicators model for volatility using

intra-daily data,” Journal of Econometrics, vol. 131, no. 1-2, pp. 3–27, 2006.

doi: 10.1016/j.jeconom.2005.01.018.

[310] N. Shephard and K. Sheppard, “Realising the future: Forecasting with high-

frequency-based volatility (HEAVY) models,” Journal of Applied Economet-

rics, vol. 25, no. 2, pp. 197–231, 2010. doi: 10.1002/jae.1158. 296

[311] C. Contino and R. H. Gerlach, “Bayesian tail-risk forecasting using realized

GARCH,” Applied Stochastic Models in Business and Industry, vol. 33, pp. 213–

236, 2017. doi: 10.1002/asmb.2237.

[312] J. Zhou, D. Li, R. Pan, and H. Wang, “The network GARCH model,” Statistica

Sinica, vol. To Appear, 2018+. doi: 10.5705/ss.202018.0234.

[313] B. Li, “Pricing dynamics of natural gas futures,” Energy Economics, vol. 78,

pp. 91–108, 2019. doi: 10.1016/j.eneco.2018.10.024.

[314] M. B. Garman and M. J. Klass, “On the estimation of security price volatilities

from historical data,” Journal of Business, vol. 53, no. 1, pp. 67–78, 1980. doi:

10.1086/296072.

[315] J. C. C. Chan and A. L. Grant, “Modeling energy price dynamics: GARCH

versus stochastic volatility,” Energy Economics, vol. 54, pp. 182–189, 2016.

doi: 10.1016/j.eneco.2015.12.003.

[316] A. Azzalini and A. Capitanio, “Statistical applications of the multivariate skew

normal distribution,” Journal of the Royal Statistical Society, Series B: Sta-

tistical Methodology, vol. 61, no. 3, pp. 579–602, 1999. doi: 10.1111/1467-

9868.00194.

[317] D. Ardia, J. Kolly, and D.-A. Trottier, “The impact of parameter and model

uncertainty on market risk predictions from GARCH-type models,” Journal

of Forecasting, vol. 36, no. 7, pp. 808–823, 2017. doi: 10.1002/for.2472.

[318] S. Watanabe, “Asymptotic equivalence of bayes cross validation and widely

applicable information criterion in singular learning theory,” Journal of Ma-

chine Learning Research, vol. 11, pp. 3571–3594, 2010. Available at: http :

//www.jmlr.org/papers/v11/watanabe10a.html. 297

[319] A. Vehtari, A. Gelman, and J. Gabry, “Practical Bayesian model evaluation

using leave-one-out cross-validation and WAIC,” Statistics and Computing,

vol. 27, no. 5, pp. 1413–1432, 2017. doi: 10.1007/s11222-016-9696-4.

[320] P. H. Kupiec, “Techniques for verifying the accuracy of risk measurement mod-

els,” Journal of Derivatives, vol. 3, no. 2, pp. 73–84, 1995. doi: 10.3905/jod.

1995.407942.

[321] K. Hamidieh and K. B. Ensor, “A simple method for time scaling value-at-risk:

Let the data speak for themselves,” Journal of Risk Management in Financial

Institutions, vol. 3, no. 4, pp. 380–391, 2010.

[322] C. F. Mason and N. A. Wilmot, “Jump processes in natural gas markets,”

Energy Economics, vol. 46, no. Supplement 1, S69–S79, 2014. doi: 10.1016/

j.eneco.2014.09.015.

[323] L. M. Kyj, B. Ostdiek, and K. B. Ensor, “Covariance estimation in dynamic

portfolio optimization: A realized single factor model,” SSRN Pre-Print 1364642,

2009, AFA 2010 Atlanta Meetings Paper. doi: 10.2139/ssrn.1364642. Avail-

able at: https://ssrn.com/abstract=1364642.

[324] A. Gelman, J. Hwang, and A. Vehtari, “Understanding predictive information

criteria for Bayesian models,” Statistics and Computing, vol. 24, no. 6, pp. 997–

1016, 2014. doi: 10.1007/s11222-013-9416-2.

[325] H. Akaike, “A new look at the statistical model identification,” IEEE Trans-

actions on Automatic Control, vol. 19, no. 6, pp. 716–723, 1974. doi: 10.1109/

TAC.1974.1100705.

[326] D. J. Spiegelhalter, N. G. Best, B. P. Carlin, and A. van der Linde, “Bayesian

measures of model complexity and fit,” Journal of the Royal Statistical Society, 298

Series B: Statistical Methodology, vol. 64, no. 4, pp. 583–639, 2002. doi: 10.

1111/1467-9868.00353.

[327] ——, “The deviance information criterion: 12 years on,” Journal of the Royal

Statistical Society, Series B: Statistical Methodology, vol. 76, no. 3, pp. 485–

493, 2014. doi: 10.1111/rssb.12062.

[328] G. Schwarz, “Estimating the dimension of a model,” Annals of Statistics, vol. 6,

no. 2, pp. 461–464, 1978. doi: 10.1214/aos/1176344136.

[329] S. Watanabe, “A widely applicable bayesian information criterion,” Journal

of Machine Learning Research, vol. 14, pp. 867–897, 2013. Available at: http:

//www.jmlr.org/papers/v14/watanabe13a.html.

[330] A. A. Neath and J. E. Cavanaugh, “The Bayesian information criterion: Back-

ground, derivation, and applications,” vol. 4, no. 2, pp. 199–203, 2012. doi:

10.1002/wics.199.

[331] P. F. Christoffersen, “Evaluating interval forecasts,” International Economic

Review, vol. 39, no. 4, pp. 841–862, 1998. doi: 10.2307/2527341.

[332] S. Talts, M. Betancourt, D. Simpson, A. Vehtari, and A. Gelman, “Validat-

ing Bayesian inference algorithms with simulation-based calibration,” ArXiv

1804.06788, 2018. Available at: https://arxiv.org/abs/1804.06788.

[333] Stan Development Team, Stan modeling language users guide and reference

manual, version 2.17.0, 2017. Available at: http://mc- stan.org/users/

documentation/.

[334] J. C. Pinheiro and D. M. Bates, “Unconstrained parameterizations for variance-

covariance matrices,” Statistics and Computing, vol. 6, no. 3, pp. 289–296, 1996.

doi: 10.1007/BF0014087. 299

[335] A. Gelman and K. Shirley, “Inference from simulations and monitoring conver-

gence,” in Handbook of Markov Chain Monte Carlo, ser. Handbooks of Modern

Statistical Methods, S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, Eds.,

1st, Chapman & Hall/CRC, 2011, ch. 6, pp. 163–174, isbn: 978-1-420-07941-8.

Available at: http://www.mcmchandbook.net/HandbookChapter6.pdf.

[336] M. Betancourt, “Diagnosing suboptimal cotangent disintegrations in Hamilto-

nian Monte Carlo,” ArXiv Pre-Print 1604.00695, 2016. Available at: https:

//arxiv.org/abs/1604.00695.

[337] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Ru-

bin, Bayesian Data Analysis, 3rd, ser. Texts in Statistical Science. CRC Press,

2013, isbn: 978-1-439-84095-5. Available at: http://www.stat.columbia.

edu/~gelman/book/BDA3.pdf.

[338] R. F. Engle and T. Bollerslev, “Modelling the persistence of conditional vari-

ances,” Econometric Reviews, vol. 5, no. 1, pp. 1–50, 1986. doi: 10.1080/

07474938608800095.

[339] D. B. Nelson, “Conditional heteroskedasticity in asset returns: A new ap-

proach,” Econometrica, vol. 59, no. 2, pp. 347–370, 1991.

[340] L. R. Glosten, R. Jagannathan, and D. E. Runkle, “On the relation between

the expected value and the volatility of nominal excess returns on stocks,”

Journal of Finance, vol. 48, no. 5, pp. 1779–1801, 1993. doi: 10.1111/j.1540-

6261.1993.tb05128.x.

[341] Z. Ding, C. W. J. Granger, and R. F.Engle, “A long memory property of stock

market returns and a new model,” Journal of Empirical Finance, vol. 1, no. 1,

pp. 83–106, 1993. doi: 10.1016/0927-5398(93)90006-D. 300

[342] G. E. P. Box and D. R. Cox, “An analysis of transformations,” Journal of the

Royal Statistical Society, Series B: Methodological, vol. 26, no. 2, pp. 211–243,

1964. doi: 10.1111/j.2517-6161.1964.tb00553.x.

[343] L. Hentschel, “All in the family: Nesting symmetric and asymmetric GARCH

models,” Journal of Financial Economics, vol. 39, no. 1, pp. 71–104, 1995. doi:

10.1016/0304-405X(94)00821-H.

[344] R. F. Engle and G. G. J. Lee, “A long-run and short-run component model

of stock return volatility,” in Cointegration, Causality, and Forecasting: A

Festschrift in Honour of Clive W.J. Granger, R. F. Engle and H. White, Eds.,

Oxford University Press, 1999, pp. 475–497, isbn: 978-0-198-29683-6.

[345] R. F. Engle and M. E. Sokalska, “Forecasting intraday volatility in the US eq-

uity market: Multiplicative component GARCH,” Journal of Financial Econo-

metrics, vol. 10, no. 1, pp. 54–83, 2012. doi: 10.1093/jjfinec/nbr005.

[346] T. Bollerslev, “Modelling the coherence in short-run nominal exchange rates: A

multivariate generalized ARCH model,” The Review of Economics and Statis-

tics, vol. 72, no. 3, pp. 498–505, 1990.

[347] R. Engle, “Dynamic conditional correlation,” Journal of Business & Economic

Statistics, vol. 20, no. 3, pp. 339–350, 2002. doi: 10.1198/073500102288618487.

[348] M. Asai, M. McAleer, and J. Yu, “Multivariate stochastic volatility: A review,”

Econometric Reviews, vol. 25, no. 2-3, pp. 145–175, 2006. doi: 10 . 1080 /

07474930600713564.

[349] R. Engle, “New frontiers for ARCH models,” Journal of Applied Econometrics,

vol. 17, no. 5, pp. 425–446, 2002. doi: 10.1002/jae.683. 301

[350] T. G. Andersen, T. Bollerslev, F. X. Diebold, and P. Labys, “Modeling and

forecasting realized volatility,” Econometrica, vol. 71, no. 2, pp. 579–625, 2003.

doi: 10.1111/1468-0262.00418.

[351] M. P. Visser, “GARCH parameter estimation using high-frequency data,” Jour-

nal of Financial Econometrics, vol. 9, no. 1, pp. 162–197, 2011. doi: 10.1093/

jjfinec/nbq017.

[352] T. G. Andersen, T. Bollerslev, and X. Huang, “A reduced form framework for

modeling volaitlity of speculative prices based on realized variation measures,”

Journal of Econometrics, vol. 160, no. 1, pp. 176–189, 2011. doi: 10.1016/j.

jeconom.2010.03.029.

[353] M. Takahashi, Y. Omori, and T. Watanabe, “Estimating stochastic volatil-

ity models using daily returns and realized volatility simultaneously,” Com-

putational Statistics and Data Analysis, vol. 53, pp. 2404–2426, 2009. doi:

10.1016/j.csda.2008.07.039.

[354] M. Parkinson, “The extreme value method for estimating variance of the rate

of return,” Journal of Business, vol. 53, no. 1, pp. 61–65, 1980. doi: 10.1086/

296071.

[355] O. E. Barndorff-Nielsen and N. Shephard, “Econmetric analysis of realized

volatility and its use in estimating stochastic volatility models,” Journal of

the Royal Statistical Society, Series B: Statistical Methodology, vol. 64, no. 2,

pp. 253–280, 2002. doi: 10.1111/1467-9868.00336.

[356] ——, “Realized power variation and stochastic volatility models,” Bernoulli,

vol. 9, no. 2, pp. 243–265, 2003. doi: 10.3150/bj/1068128977. 302

[357] L. Zhang, P. A. Mykland, and Y. Aït-Sahalia, “A tale of two time scales,”

Journal of the American Statistical Association, vol. 100, no. 472, pp. 1394–

1411, 2005. doi: 10.1198/016214505000000169.

[358] O. E. Barndorff-Nielsen, P. R. Hansen, A. Lunde, and N. Shephard, “Mul-

tivariate realised kernels: Consistent positive semi-definite estimators of the

covariation of equity prices with noise and non-synchronous trading,” Journal

of Econometrics, vol. 162, no. 2, pp. 149–169, 2011. doi: 10.1016/j.jeconom.

2010.07.009.

[359] T. Watanabe, “Quantile forecasts of financial returns using realized GARCH

models,” Japanese Economic Review, vol. 63, no. 1, pp. 68–80, 2012. doi:

10.1111/j.1468-5876.2011.00548.x.

[360] C. Wang, Q. Chen, and R. Gerlach, “Bayesian realized-GARCH models for

financial tail risk forecasting incorporating two-sided Weibull distribution,”

Quantitative Finance, vol. 19, no. 6, pp. 1017–1042, 2019. doi: 10 . 1080 /

14697688.2018.1540880.

[361] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath, “Coherent measures of risk,”

Mathematical Finance, vol. 9, no. 3, pp. 203–228, Jul. 1999. doi: 10.1111/

1467-9965.00068.

[362] C. Acerbi and D. Tasche, “Expected shortfall: A natural coherent alternative

to value-at-risk,” Economic Notes, vol. 31, no. 2, pp. 379–388, 2002. doi: 10.

1111/1468-0300.00091.

[363] P. Jorion, Value at Risk: The New Benchmark for Managing Financial Risk,

3rd. McGraw-Hill, 2006, isbn: 978-0-071-46495-6. 303

[364] M. Caporin and F. Fontini, “The long-run oil-natural gas price relationship

and the shale gas revolution,” Energy Economics, vol. 64, pp. 511–519, 2017.

doi: 10.1016/j.eneco.2016.07.024.

[365] K. Tang and W. Xiong, “Index investment and the financialization of com-

modities,” Financial Analysts Journal, vol. 68, no. 6, pp. 54–74, 2012. doi:

10.2469/faj.v68.n6.5.

[366] P. K. Narayan and R. Liu, “A unit root model for trending time-series energy

variables,” Energy Economics, vol. 50, pp. 391–402, 2015. doi: 10.1016/j.

eneco.2014.11.021.

[367] C. Hou and B. H. Nguyen, “Understanding the US natural gas market: A

Markov switching VAR approach,” Energy Economics, vol. 75, pp. 42–53, 2018.

doi: 10.1016/j.eneco.2018.08.004.

[368] A. Hailemariam and R. Smyth, “What drives volatility in natural gas prices?”

Energy Economics, vol. 80, pp. 731–742, 2019. doi: 10.1016/j.eneco.2019.

02.011.

[369] P. R. Hartley, K. B. Medlock III, and J. E. Rosthal, “The relationship of

natural gas to oil prices,” The Energy Journal, vol. 29, no. 3, pp. 47–65, 2018.

Available at: https://www.jstor.org/stable/41323169.

[370] M. Brigida, “The switching relationship between natural gas and crude oil

prices,” Energy Economics, vol. 43, pp. 48–55, 2014. doi: 10.1016/j.eneco.

2014.01.014.

[371] J. A. Batten, C. Ciner, and B. M. Lucey, “The dynamic linkages between crude

oil and natural gas markets,” Energy Economics, vol. 62, pp. 155–170, 2017.

doi: 10.1016/j.eneco.2016.10.019. 304

[372] R. Dahlhaus, “Graphical interaction models for multivariate time series,” Metrika,

vol. 51, no. 2, pp. 157–172, 2000. doi: 10.1007/s001840000055.

[373] F. R. Bach and M. I. Jordan, “Learning graphical models for stationary time

series,” IEEE Transactions on Signal Processing, vol. 52, no. 8, pp. 2189–2199,

2004. doi: 10.1109/TSP.2004.831032.

[374] I. Wilms, S. Basu, J. Bien, and D. S. Matteson, “Sparse identification and

estimation of large-scale vector autoregressive moving averages,” ArXiv Pre-

Print 1710.01415, 2017. Available at: https://arxiv.org/abs/1710.01415.

[375] M. Schweinberger, S. Babkin, and K. B. Ensor, “High-dimensional multivariate

time series with additional structure,” Journal of Computational and Graphical

Statistics, vol. 26, no. 3, pp. 610–622, 2017. doi: 10.1080/10618600.2016.

1265528.

[376] Y. Sun, Y. Li, A. Kuceyeski, and S. Basu, “Large spectral density matrix

estimation by thresholding,” ArXiv Pre-Print 1812.00532, 2018. Available at:

https://arxiv.org/abs/1812.00532.

[377] M. Fiecas, C. Leng, W. Liu, and Y. Yu, “Sepctral analysis of high-dimensional

time series,” Electronic Journal of Statistics, vol. 13, no. 2, pp. 4079–4101,

2019. doi: 10.1214/19-EJS1621.

[378] T. Cai, W. Liu, and X. Luo, “A constrained ℓ1 minimization approach to sparse precision matrix estimation,” Journal of the American Statistical Association,

vol. 106, no. 494, pp. 594–607, 2011. doi: 10.1198/jasa.2011.tm10155.

[379] D. J. Thompson, “Spectrum estimation and harmonic analysis,” Proceedings of

the IEEE, vol. 70, no. 9, pp. 1055–1096, 1982. doi: 10.1109/PROC.1982.12433. 305

[380] N. R. Lomb, “Least-squares frequency analysis of unequally spaced data,” As-

trophysics and Space Science, vol. 39, no. 2, pp. 447–462, 1976. doi: 10.1007/

BF00648343.

[381] J. D. Scargle, “Studies in astronomical time series analysis II: Statistical as-

pects of spectral analysis of unevenly spaced data,” Astrophysical Journal,

vol. 263, pp. 835–853, 1982. doi: 10.1086/160554.

[382] M. Zechmeister and M. Kürster, “The generalised Lomb-Scargle periodogram:

A new formalism for the floating-mean and Keplerian periodograms,” Astron-

omy and Astrophysics, vol. 496, no. 2, pp. 577–584, 2009. doi: 10.1051/0004-

6361:200811296.

[383] D. Yatsenko, K. Josić, A. S. Ecker, E. Froudarakis, R. J. Cotton, and A. S. To-

lias, “Improved estimation and interpreation of correlations in neural circuits,”

PLOS Computational Biology, vol. 11, no. 3, 2015. doi: 10.1371/journal.

pcbi.1004083. eprint: e1004083.

[384] A. Chang, T. Yao, and G. I. Allen, “Graphical models and dynamic latent

factors for modeling functional brain connectivity,” in DSW 2019: Proceedings

of the 2nd IEEE Data Science Workshop, G. Karypis, G. Michailidis, and R.

Willett, Eds., Minneapolis, Minnesota: IEEE, 2019, 57–63. doi: 10.1109/DSW.

2019.8755783.

[385] E. C. Chi and S. Steinerberger, “Recovering trees with convex clustering,”

SIAM Journal on Mathematics of Data Science, vol. 1, no. 3, pp. 383–407,

2019. doi: 10.1137/18M121099X.

[386] F. Campbell, “Statistical machine learning methodology and inference for

structured variable selection,” PhD thesis, Rice University, 2018. 306

[387] L. de Lathauwer, B. de Moor, and J. Vandewalle, “A multilinear singular value

decomposition,” SIAM Journal on Matrix Analysis and Applications, vol. 21,

no. 4, pp. 1253–1278, 2000. doi: 10.1137/S0895479896305696.