UCLA UCLA Electronic Theses and Dissertations

Title Projection algorithms for large scale optimization and genomic data analysis

Permalink https://escholarship.org/uc/item/95v3t2nk

Author Keys, Kevin Lawrence

Publication Date 2016

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITYOF CALIFORNIA Los Angeles

Projection algorithms for large scale optimization and genomic data analysis

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Biomathematics

by

Kevin Lawrence Keys

2016 c Copyright by Kevin Lawrence Keys 2016 ABSTRACTOFTHE DISSERTATION

Projection algorithms for large scale optimization and genomic data analysis

by

Kevin Lawrence Keys Doctor of Philosophy in Biomathematics University of California, Los Angeles, 2016 Professor Kenneth L. Lange, Chair

The advent of the Big Data era has spawned intense interest in scalable mathematical optimiza- tion methods. Traditional approaches such as Newton’s method fall apart whenever the features outnumber the examples in a data set. Consequently, researchers have intensely developed first- order methods that rely only on gradients and subgradients of a cost function.

In this dissertation we focus on projected gradient methods for large-scale constrained opti- mization. We develop a particular case of a proximal gradient method called the proximal distance algorithm. Proximal distance algorithms combine the classical penalty method of constrained min- imization with distance majorization. To optimize the loss function f(x) over a constraint set C,

ρ 2 the proximal distance principle mandates minimizing the penalized loss f(x) + 2 dist(x, C) and following the solution xρ to its limit as ρ → ∞. At each iteration the squared Euclidean distance

2 2 dist(x, C) is majorized by kx − ΠC(xk)k , where ΠC(xk) denotes the projection of the current ρ 2 iterate xk onto C. The minimum of the surrogate function f(x) + 2 kx − ΠC(xk)k is given by the

proximal map proxρ−1f [ΠC(xk)]. The next iterate xk+1 automatically decreases the original penal- ized loss for fixed ρ. Since many explicit projections and proximal maps are known in analytic or computable form, the proximal distance algorithm provides a scalable computational framework for a variety of constraints.

For the particular case of sparse linear regression, we implement a projected gradient algo- rithm known as iterative hard thresholding for a particular large-scale genomics analysis known

ii as a genome-wide association study. A genome-wide association study (GWAS) correlates marker variation with trait variation in a sample of individuals. Each study subject is genotyped at a mul- titude of SNPs (single nucleotide polymorphisms) spanning the genome. Here we assume that subjects are unrelated and collected at random and that trait values are normally distributed or transformed to normality. Over the past decade, researchers have been remarkably successful in applying GWAS analysis to hundreds of traits. The massive amount of data produced in these stud- ies present unique computational challenges. Penalized regression with LASSO or MCP penalties is capable of selecting a handful of associated SNPs from millions of potential SNPs. Unfor- tunately, model selection can be corrupted by false positives and false negatives, obscuring the genetic underpinning of a trait. Our parallel implementation of IHT accommodates SNP genotype compression and exploits multiple CPU cores and graphics processing units (GPUs). This allows statistical geneticists to leverage desktop workstations in GWAS analysis and to eschew expensive supercomputing resources. We evaluate IHT performance on both simulated and real GWAS data and conclude that it reduces false positive and false negative rates while remaining competitive in computational time with penalized regression.

iii The dissertation of Kevin Lawrence Keys is approved.

Lieven Vandenberghe

Marc Adam Suchard

Van Maurice Savage

Kenneth L. Lange, Committee Chair

University of California, Los Angeles

2016

iv To my parents

v TABLEOF CONTENTS

1 Introduction ...... 1

2 ...... 4

2.1 Convexity ...... 5

2.2 Projections and Proximal Operators ...... 7

2.3 Descent Methods ...... 9

2.3.1 Gradient Methods ...... 10

2.3.2 Proximal Gradient Method ...... 11

2.4 Second-order methods ...... 12

2.4.1 Newton’s method ...... 12

2.4.2 Conjugate gradient method ...... 13

2.5 The MM Principle ...... 14

3 The Proximal Distance Algorithm ...... 17

3.1 An Adaptive Barrier Method ...... 17

3.2 MM for an Exact Penalty Method ...... 21

3.2.1 Exact Penalty Method for Quadratic Programming ...... 23

3.3 Distance Majorization ...... 24

3.4 The Proximal Distance Method ...... 25

3.5 Examples ...... 29

3.5.1 Projection onto an Intersection of Closed Convex Sets ...... 29

3.5.2 Network Optimization ...... 31

3.5.3 Nonnegative Quadratic Programming ...... 33

3.5.4 Linear Regression under an `0 Constraint ...... 36 vi 3.5.5 Matrix Completion ...... 36

3.5.6 Sparse Precision Matrix Estimation ...... 40

3.6 Discussion ...... 43

4 Accelerating the Proximal Distance Algorithm ...... 45

4.1 Derivation ...... 45

4.2 Convergence and Acceleration ...... 48

4.3 Examples ...... 52

4.3.1 Linear Programming ...... 52

4.3.2 Nonnegative Quadratic Programming ...... 55

4.3.3 Closest Kinship Matrix ...... 57

4.3.4 Projection onto a Second-Order Cone Constraint ...... 59

4.3.5 Copositive Matrices ...... 62

4.3.6 Linear Complementarity Problem ...... 64

4.3.7 Sparse Principal Components Analysis ...... 65

4.4 Discussion ...... 70

5 Iterative Hard Thresholding for GWAS Analysis ...... 72

5.1 Introduction ...... 72

5.2 Methods ...... 74

5.2.1 Penalized regression ...... 74

5.2.2 Calculating step sizes ...... 78

5.2.3 Bandwidth optimizations ...... 78

5.2.4 Parallelization ...... 79

5.2.5 Selecting the best model ...... 80

5.3 Results ...... 80

vii 5.3.1 Simulation ...... 81

5.3.2 Speed comparisons ...... 83

5.3.3 Application to lipid phenotypes ...... 84

5.4 Discussion ...... 86

6 Discussion and Future Research ...... 90

6.1 Parameter tuning for proximal distance algorithms ...... 90

6.2 IHT with nonlinear loss functions ...... 91

6.3 Other greedy algorithms for linear regression ...... 92

7 Notation ...... 95

7.1 Sets ...... 95

7.2 Vectors and Matrices ...... 95

7.3 Norms and Distances ...... 97

7.4 Functions and Calculus ...... 97

7.5 Projections and Proximal Operators ...... 98

7.6 Computation ...... 98

References ...... 99

viii LISTOF FIGURES

1.1 The cost of sequencing a single human genome, which we assume to be 3,000 megabases, is shown by the green line on a logarithmic scale. Moore’s law of computing is drawn in white. The data are current as of April 2015. After January 2008 modern sequencing centers switched from Sanger dideoxy chain termination sequencing to next-generation sequencing technologies such as 454 sequencing, Il- lumina sequencing, and SOLiD sequencing. For Sanger sequencing, the assumed coverage is 6-fold with average read length of 500-600 bases. 454 sequencing assumes 10-fold coverage with average read length 300-400 bases, while the Il- lumina/SOLiD sequencers attain 30-fold coverage with an average read length of 75-150 bases...... 2

2.1 A graphical representation of a convex set and a nonconvex one. As noted in Definition 1, a convex set contains all line segments between any two points in the set. Image courtesy of Torbjørn Taskjelle from StackExchange [129]...... 5

4.1 Proportion of variance explained by q PCs for each algorithm. Here PD1 is the accelerated proximal distance algorithm enforcing matrix sparsity, PD2 is the ac- celerated proximal distance algorithm enforcing column-wise sparsity, and SPC is the orthogonal sparse PCA method from PMA...... 68

4.2 Computation times for q PCs for each algorithm. Here PD1 is the accelerated proximal distance algorithm enforcing matrix sparsity, PD2 is the accelerated prox- imal distance algorithm enforcing column-wise sparsity, and SPC is the orthogonal sparse PCA method from PMA...... 69

5.1 A visual representation of model selection with the LASSO. The addition of the `1 penalty encourages representation of y by a subset of the columns of X...... 75

ix 5.2 A graphical representation of penalized (regularized) regression using norm balls.

From left to right, the graphs show `2 or Tikhonov regression, `1 or LASSO re-

gression, and `0 or subset regression. The ellipses denote level curves around the unpenalized optimum β. The penalized optimum occurs at the intersection of the level curves with the norm ball. Tikhonov regularization provides some shrinkage,

while the shrinkage from LASSO regularization is more dramatic. The `0 norm enforces sparsity without shrinkage. The MCP “norm ball” cannot be easily drawn

but sits between the `1 and `0 balls...... 76

5.3 A view of sparse regression with thresholding operators. The order from left to

right differs from Figure 5.2: the `1 operator or soft thresholding operator, the MCP

or firm thresholding operator, and the `0 operator or hard thresholding operator. We clearly see how MCP interpolates the soft and hard thresholding operators...... 76

5.4 A visual representation of IHT. The algorithm starts at a point y and steps in the direction −∇f(y) with magnitude µ to an intermediate point y+. IHT then en-

forces sparsity by projecting onto the sparsity set Sm. The projection for m = 2 is

+ the identity projection in this example, while projection onto S0 merely sends y

to the origin 0. Projection onto S1 preserves the larger of the two components of y+...... 77

5.5 Mean squared error as a function of model size, as averaged over 5 cross-validation slices, for four lipid phenotypes from NFBC 1966...... 87

x LISTOF TABLES

3.1 Performance of the adaptive barrier method in linear programming...... 21

3.2 Dykstra’s algorithm versus the proximal distance algorithm...... 31

3.3 CPU times in seconds and iterations until convergence for the network optimiza- tion problem. Asterisks denote computer runs exceeding computer memory limits. Iterations were capped at 200...... 33

3.4 CPU times in seconds and optima for the nonnegative quadratic program. Abbre- viations: n for the problem dimension, MM for the proximal distance algorithm, CV for CVX, MA for MATLAB’s quadprog, and YA for YALMIP...... 34

3.5 Numerical experiments comparing MM to MATLAB’s lasso. Each row presents averages over 100 independent simulations. Abbreviations: n the number of cases, p the number of predictors, d the number of actual predictors in the generating

model, p1 the number of true predictors selected by MM, p2 the number of true predictors selected by lasso, λ the regularization parameter at the LASSO op-

timal loss, L1 the optimal loss from MM, L1/L2 the ratio of L1 to the optimal

LASSO loss, T1 the total computation time in seconds for MM, and T1/T2 the

ratio of T1 to the total computation time of lasso...... 37

3.6 Comparison of the MM proximal distance algorithm to SoftImpute. Abbre- viations: p is the number of rows, q is the number of columns, α is the ratio of

observed entries to total entries, r is the rank of the matrix, L1 is the optimal loss

under MM, L2 is the optimal loss under SoftImpute, T1 is the total computation

time (in seconds) for MM, and T2 is the total computation time for SoftImpute. 39

xi 3.7 Numerical results for precision matrix estimation. Abbreviations: p for matrix di-

mension, kt for the number of nonzero entries in the true model, k1 for the number

of true nonzero entries recovered by the proximal distance algorithm, k2 for the number of true nonzero entries recovered by glasso, ρ the average tuning con-

stant for glasso for a given kt, L1 the average loss from the proximal distance

algorithm, L1 − L2 the difference between L1 and the average loss from glasso,

T1 the average compute time in seconds for the proximal distance algorithm, and

T1/T2 the ratio of T1 to the average compute time for glasso...... 43

4.1 CPU times and optima for linear programming. Here m is the number of con- straints, n is the number of variables, PD is the accelerated proximal distance al- gorithm, SCS is the Splitting Cone Solver, and Gurobi is the Gurobi solver. After m = 512 the constraint matrix A is initialized to be sparse with sparsity level 0.01. 54

4.2 CPU times and optima for nonnegative quadratic programming. Here n is the number of variables, PD is the accelerated proximal distance algorithm, IPOPT is the Ipopt solver, and Gurobi is the Gurobi solver. After n = 512, the constraint matrix A is sparse...... 56

4.3 CPU times and optima for the closest kinship matrix problem. Here the kinship matrix is n × n, PD1 is the proximal distance algorithm, PD2 is the accelerated proximal distance, PD3 is the accelerated proximal distance algorithm with the positive semidefinite constraints folded into the domain of the loss, and Dykstra is Dykstra’s adaptation of alternating projections. All times are in seconds...... 58

4.4 CPU times and optima for the second-order cone projection. Here m is the number of constraints, n is the number of variables, PD is the accelerated proximal distance algorithm, SCS is the Splitting Cone Solver, and Gurobi is the Gurobi solver. After m = 512 the constraint matrix A is initialized with sparsity level 0.01...... 61

xii 4.5 CPU times (seconds) and optima for approximating the Horn variational index of a Horn matrix. Here n is the size of Horn matrix, PD is the proximal distance algorithm, aPD is the accelerated proximal distance algorithm, and Mosek is the Mosek solver...... 63

4.6 CPU times and optima for testing the copositivity of random symmetric matrices. Here n is the size of matrix, PD is the proximal distance algorithm, aPD is the accelerated proximal distance algorithm, and Mosek is the Mosek solver...... 64

4.7 CPU times (seconds) and optima for the linear complementarity problem with ran- domly generated data. Here n is the size of matrix, PD is the accelerated proximal distance algorithm, and Mosek is the Mosek solver...... 65

5.1 Model selection performance on NFBC1966 chromosome 1 data...... 82

5.2 Computational times in seconds on NFBC1966 chromosome 1 data...... 84

5.3 Dimensions of data used for each phenotype in GWAS experiment. Here n is the

number of cases, p is the number of predictors (genetic + covariates), and mbest is

the best cross-validated model size. Note that mbest includes nongenetic covariates. 85

5.4 Computational results from the GWAS experiment. Here β is the calculated effect size. Known associations include the relevant citation...... 88

6.1 Model selection performance of IHT and the exchange algorithm on NFBC1966 chromosome 1 data...... 94

xiii ACKNOWLEDGMENTS

The material presented in this dissertation was funded by the UCLA Graduate Opportunity Fel- lowship Program, a National Science Foundation Graduate Research Fellowship (DGE-0707424), a Predoctoral Training Grant from the National Human Genome Research Institute (HG002536), and financial support from the UCLA Department of Biomathematics, the Stanford University Department of Statistics, and the startup funds of Kenneth Lange.

The students in the Biomathematics program unwillingly bore the brunt of the highs and lows of my graduate research career, and the unfortunate souls that occupied my office suffered the wrath of prodigious puns, dark humor, and gratuitous foul language. For their tolerance, I wish to particularly thank Forrest Crawford, Gabriela Cybis, Joshua Chang, Wesley Kerr, Lae Un Kim, Trevor Shaddox, Bhaven Mistry, and Timothy Stutz.

During the course of graduate school I learned that undergraduate mentors are also lifelong mentors. Marc Tischler, Joseph Watkins, and William Yslas Velez´ never hesitated to lend advice or a reference letter. Dr. Velez´ informed me that he is slated to retire in 2017, a meritous reward after bringing thousands of undergraduates through the mathematics program at The University of Arizona. I aspire to someday possess even an ounce of his work ethic. I will always remember fondly my interactions with Mar´ıa Teresa Velez,´ the other Dr. Velez,´ former associate dean of the Graduate College at The University of Arizona, who always greeted me at conferences with a warm embrace and a motherly concern for my degree progress. She tragically passed away mere weeks before I defended this dissertation. May she rest in power.

The research in this dissertation represents collaborative work with Gary K. Chen and Hua Zhou, both of whom are much better programmers than I could ever hope to be. Their patient, thoughtful, and careful approach to software development is one that I hope to mimic in my career. An early collaboration with Eric C. Chi and Gary bore no fruit but sparked my interest the sparse regression methods that ultimately constitute the capstone of my dissertation. I wish to thank my advisor Kenneth Lange, who gracefully and patiently introduced me into the world of optimization. He never hesitated to offer financial, intellectual, or emotional support during my graduate school career.

xiv To my family I owe an unpayable debt. To this day, my parents, my brothers, my uncles and aunts, and my cousins do not understand what I did during graduate school or why I did it. Nonetheless, they always tried to offer emotional support and healthy distractions when needed. I must particularly thank my mother, who somehow summoned the patience to find a silver lining every time that my career prospects seemed to recede into a bleak and cloudy future. Perhaps someday I can complete her wish of “finding the gene that causes cancer.” Lastly, I must thank Gabriela Bran Anleu, who concurrently supported my graduate career while traversing her own. Her convictions, her stubbornness, and her optimism for a future filled with renewable energy sources still inspires me to this day.

xv VITA

2007–2010 Research Assistant with Michael Hammer, Arizona Research Laboratories, The University of Arizona, Tucson, AZ

2009 Visiting Student Researcher with Jaume Bertranpetit, Institut de Biolog´ıa Evo- lutiva, Universitat Pompeu Fabra, Barcelona, Spain

2010 B.S. (Mathematics) and B.A. (Linguistics), The University of Arizona

2010–2011 Fulbright Student Researcher with Jaume Bertranpetit, Institut de Biolog´ıa Evo- lutiva, Universitat Pompeu Fabra, Barcelona, Spain

2011–present Graduate Student, Department of Biomathematics, University of California, Los Angeles, CA

2012 M.S. (Biomathematics), University of California, Los Angeles

2014 Visiting Student Researcher with Tim Conrad, Konrad Zuse Zentrum, Freie Universitat¨ Berlin, Berlin, Germany

2014–2015 Visiting Graduate Researcher, Department of Statistics, Stanford University, Stanford, CA

PUBLICATIONS

Keys KL and Lange KL. “An exchange algorithm for least squares regression”. (in preparation)

Keys KL, Chen GK, Lange KL. “Hard Thresholding Pursuit Algorithms for Model Selection in Genome-Wide Association Studies”. (in preparation)

xvi Keys KL, Zhou H, Lange KL. “Proximal Distance Algorithms: Theory and Examples”. (submit- ted)

Montanucci L, Laayouni H, Dobon´ B, Keys KL, Bertranpetit J, and Pereto´ J. “Influence of topol- ogy and functional class on the molecular evolution of human metabolic genes.” Molecular Biology and Evolution. (submitted)

Lange KL and Keys KL. “The MM Proximal Distance Algorithm.” Proceedings of the 2014 In- ternational Congress of Mathematicians, Seoul, South Korea.

Dall’Olio GM, Marino J, Schubert M, Keys KL, Stefan MI, Gillespie CS, Poulein P, Shameer K, Suger R, Invergo BM, Jensen LJ, Bertranpetit J, Laayouni H. “Ten simple rules for getting help from online scientific communities.” PLOS Computational Biology 7:9 (2011), e1002202.

xvii CHAPTER 1

Introduction

The fields of genetics and genomics have blossomed since the publication of the first sequenced human genome in 2003. Modern genotyping and sequencing technologies have dramatically low- ered the cost of genetic data collection. The National Human Genome Research Institute (NHGRI) of the United States monitors the average cost of sequencing a 3 gigabase human genome. The graph in Figure 1.1 shows the striking decline in sequencing costs from the year 2001 up to the year 2015, the most recent year for which data are available [138]. The sheer scale of data that modern genomic technology can generate vastly outpaces the computational hardware and soft- ware to analyze it. Typical issues under the well-worn “Big Data” label, such as memory limits and scalable algorithms, are crucially important for modern genetic analysis software.

This dissertation addresses one facet of the genomic data boom, the analysis of genome-wide association studies (GWASes). At its core, GWAS analysis is a very large regression problem. The solution proposed here draws from the fields of computer science, mathematical optimization, and statistical genetics to formulate several software packages for linear regression in GWAS. The story behind the development of this software starts firmly in the field of convex optimization, in particular the class of proximal gradient algorithms, and slowly moves into nonconvex algorithms. Along the way it will create computational tools useful in other genomics contexts, such as sparse principal components analysis (SPCA) and sparse precision matrix estimation commonly used in genetic expression analyses. The climax of this story is an implementation of an algorithm called iterative hard thresholding (IHT) that performs efficient model selection in GWAS. The exposition presented in the following chapters assumes little previous biological coursework. However, those who have not studied mathematics will find the topics intimidating. At a minimum, readers should be comfortable with real analysis, linear algebra, multivariate calculus, and linear statistical models

1 Figure 1.1: The cost of sequencing a single human genome, which we assume to be 3,000 megabases, is shown by the green line on a logarithmic scale. Moore’s law of computing is drawn in white. The data are current as of April 2015. After January 2008 modern sequencing centers switched from Sanger dideoxy chain termination sequencing to next-generation sequencing tech- nologies such as 454 sequencing, Illumina sequencing, and SOLiD sequencing. For Sanger se- quencing, the assumed coverage is 6-fold with average read length of 500-600 bases. 454 sequenc- ing assumes 10-fold coverage with average read length 300-400 bases, while the Illumina/SOLiD sequencers attain 30-fold coverage with an average read length of 75-150 bases.

(at the undergraduate level) and some optimization theory and numerical linear algebra (at the graduate level).

At this juncture, it is important to emphasize what this dissertation does not represent. It does not represent a synthesis of theorems and proofs. In many instances, both convergence and recovery guarantees are taken for granted. Where proofs are not given, relevant mathematical references are provided. Readers seeking mathematical rigor are encouraged to look elsewhere. This work is also not complete: the final product of this investigation, a group of software packages coded in the new Julia programming language, is as much a work in progress as the Julia language

2 itself. The tactics and implementations detailed herein may well become outdated in a few years. The hope is that this software suite will serve as a springboard or benchmark for future software development targeting increasingly powerful hardware with increasingly clever algorithms.

The rest of this dissertation proceeds as follows. Chapter 2 lightly sketches the necessary convex optimization knowledge to understand the algorithms described later. Chapters 3 and 4 de- scribe the development of the class of proximal gradient algorithms that we call proximal distance algorithms. The proximal distance algorithm serves as a springboard for thinking about sparse regression frameworks such as IHT, while Chapter 5 demonstrates the superiority of IHT versus current software for feature selection in GWAS. The discussion in Chapter 6 draws a roadmap for future directions that this project could take. As will be demonstrated, IHT is a promising framework that could eventually dominate the sparse regression world.

3 CHAPTER 2

Convex Optimization

The field of mathematical optimization or mathematical programming is concerned with finding the optimal points (minima and maxima) of functions f : U → R over an open domain U. The problem of unconstrained optimization seeks the optimal points of a scalar-valued function f over its entire domain. A constrained optimization problem arises when we optimize f over some set C ⊂ dom f.

The field of optimization traces its roots to early developments in calculus [45, 59]. Theoretical insights from Fermat and Legendre used tools from calculus to determine explicit formulæ for determining optimal values of a function. Newton and Gauss developed iterative methods for computing optima, one of which we now know as Newton’s method. The early 20th century saw the birth of linear programming, the simplest case of mathematical programming. Leonid Kantorovich laid the foundational theory of linear programming [82], while George Dantzig coined the term “linear programming” and published the simplex algorithm [43]. The theory of duality, originally developed by John von Neumann for economic game theory, was found to apply to linear programming as well [133].

Since the 1950s, the field of optimization has blossomed and evolved rapidly. For reasons that will become clear later, we will focus on the important subdomain of optimization known as con- vex optimization. Convex optimization deals with the optimization of convex functions over convex sets. The exposition given here offers a mere glimpse into the vast literature of convex optimiza- tion. Several books [20, 26, 27, 72, 93, 94, 120] offer a rigorous mathematical development of convex analysis. Algorithms for convex optimization are described in [15, 111, 112].

4 Convex set Nonconvex set

Figure 2.1: A graphical representation of a convex set and a nonconvex one. As noted in Definition 1, a convex set contains all line segments between any two points in the set. Image courtesy of Torbjørn Taskjelle from StackExchange [129].

2.1 Convexity

Convexity is a fundamental property in mathematical optimization.

Definition 1. A convex set S ⊂ Rn is any set where for x, y ∈ S and α ∈ [0, 1] we have z = αx + (1 − α)y.

An intuitive interpretation of Definition 1 is that a convex set S contains all line segments between any two points in S. Figure 2.1 demonstrates this explicitly by juxtaposing a convex set with a nonconvex one.

Definition 2. A function f : U → R with convex domain U is called a convex function if it satisfies

f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y) (2.1) for all x, y ∈ U and all α ∈ [0, 1].

When strict inequality holds in (2.1), f is said to be strictly convex. If a convex function f is differentiable, then we have the following useful result.

Proposition 1. (First Order Condition for Convexity) Consider a function f : U → R with open convex domain U ⊂ Rn. Then f is convex if and only if for all x, y ∈ U we have

f(y) ≥ f(x) + ∇f(x)T (y − x) (2.2) 5 Proof. The proof flows from the Definition 2. See [93] for details.

The first order condition (2.2) states that f lies above a tangent hyperplane given by ∇f(x) at a tangent point x. If f is twice-differentiable, then a stronger result holds true.

Proposition 2. (Second Order Condition for Convexity) Consider a twice-differentiable function

f : U → R over an open convex domain U ⊂ Rn. If ∇2f(x)  0 for every x ∈ U, then f is convex.

Proof. Following the exposition in [93], the expansion

f(y) = f(x) + ∇f(x)T (y − x) Z 1  + (y − x)T ∇2f(x − α(y − x))(1 − α)dα (y − x) 0 for x, y ∈ U and α ∈ [0, 1] yields the first order condition (2.2), thus demonstrating the convexity of f.

A related concept is the notion of strong convexity.

Definition 3. A function f : U → R is called strongly convex with parameter m > 0 if for all points x, y ∈ U and any α ∈ [0, 1] we have

1 f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y) − mα(1 − α)kx − yk2 2 2

If f is differentiable, then a first-order condition for strong convexity is given by

m f(y) ≥ f(x) + ∇f(x)T (y − x) + ky − xk2 2 2

If f is twice differentiable, then f is strongly convex provided that ∇2f(x)  mI.

Strong convexity bounds the smallest eigenvalue of ∇2f away from 0. Strongly convex func- tions are generally easy to optimize. However, the set of strongly convex functions is smaller than the set of strictly convex functions, so their scope is limited.

The first order condition (2.2) can be generalized for a convex relaxation of differentiability known as subdifferentiability. 6 Definition 4. A subgradient of a convex function f : U → R with U ⊂ Rn is any vector g ∈ Rn satisfying f(y) − f(x) ≥ gT (y − x) (2.3) for all x, y ∈ U.

Definition 5. (Subdifferentiability) A convex function f : U → R is subdifferentiable if the sub- gradient is defined at every point in U. The subdifferential ∂f(x) at a point x is the set of all subgradients of f(x).

If f is differentiable at x then ∂f(x) = {∇f(x)}, so differentiable functions are by definition subdifferentiable. To illustrate a nondifferentiable f that is subdifferentiable, consider the absolute value function f : R → R given by f(x) = |x|. Then  1 x > 0,   ∂f(x) = −1 x < 0,   [−1, 1] x = 0.

For the smooth portions of f, there exists only one slope, given by the derivative function f 0. At the point x = 0 where f is nondifferentiable, the subdifferential ∂f contains the slopes of all possible tangent lines at x.

An important concept in convex analysis is that of the conjugate function.

Definition 6. The convex conjugate of a function f (alternatively the Fenchel conjugate or the Legendre-Fenchel conjugate of f) is defined as

f ?(x) = sup yT x − f(y) y

The conjugate of f ∗ is always closed and convex regardless of the convexity of f.

2.2 Projections and Proximal Operators

Projection operators occupy a useful niche in optimization. 7 n Definition 7. The projection operator ΠS : R → R onto a set S (alternatively the Euclidean projection onto S) maps a point x ∈ Rn to a possibly nonunique point y ∈ S that minimizes the Euclidean distance dist(x, y). In functional terms, we have

ΠS (x) = argmin kx − yk2 (2.4) y∈S

dist(x, S) = inf kx − yk2 (2.5) y∈S

If S is closed and convex, then ΠS maps x uniquely to its counterpart y ∈ S, and dist(x, S) is a convex function. Projections onto many particular convex sets are known in closed or computable form [9, 11].

An important generalization of a projection operator is known as a proximal operator.

n n Definition 8. The proximal operator proxf (x): R → R (alternatively the proximity operator or the proximal map) for a closed convex function f is the solution to the optimization problem   1 2 proxf (x) = argmin f(y) + kx − yk2 . (2.6) y 2

The proximal operator is unique and exists for all x ∈ dom f [11, 107]. It is often useful to parametrize the proximal operator by a step size t:   1 2 proxtf (x) = argmin tf(y) + kx − yk2 y 2   1 2 = argmin f(y) + kx − yk2 . y 2t

The proximal operator is the solution to the Moreau-Yosida regularization of f, so evaluating proxtf (x) is itself an optimization problem. Proximal operators are particularly useful when f is nondifferentiable or otherwise difficult to optimize.

The analytical properties of proximal operators are well-understood. We can view them as generalized projections. Intuitively, the proximal operator establishes a compromise between min- imizing the distance to x and minimizing the function f itself. Like the projection operators that they generalize, proximal operators of many functions have closed form or computable solutions

8 [11, 117]. For example, the proximal operator of the indicator function of a set C  0 x ∈ C δC(x) = ∞ x 6∈ C is simply the projection ΠC onto C.

One important property of proximal operators is firm expansiveness. If f is a closed convex function, then proxtf satisfies

prox (x) − prox (y) 2 ≤ prox (x) − prox (y)T (x − y) tf tf 2 tf tf for all x, y ∈ dom f. Firmly nonexpansive operators T (x) are useful for fixed point algorithms since the iteration scheme

xk = (1 − ρ)x + ρT (x) converges weakly to a fixed point whenever ρ ∈ (0, 2).

Another important property of proximal operators is known as the Moreau decomposition.A closed convex function f is related to its conjugate f ∗ via the relation

x = proxf (x) + proxf ∗ (x).

The Moreau decomposition is similar in spirit to the orthogonal decomposition in linear algebra, in which a vector x ∈ Rn is split into a sum of two vectors y ∈ C and z ∈ C⊥ for some closed set C ∈ Rn.

2.3 Descent Methods

Descent methods are iterative schemes for optimizing a function f by producing a minimizing sequence {xk} with k = 1, 2,... that satisfies

xk+1 = xk + tk∆xk

f(xk+1) ≤ f(xk) with step direction ∆xk and step size tk > 0 for all unoptimal xk. Descent methods come in many flavors, and their domain of application can vary depending on the size and complexity of f. 9 2.3.1 Gradient Methods

The class of algorithms known as gradient methods or first-order methods optimize a function f using first-order (sub)differentiability of f [27, 35]. Gradient methods are sometimes called steepest descent methods since the search direction ∆x uses the negative gradient −∇f(x), which points in the direction of steepest descent. They follow the simple update scheme

xk+1 := xk − tk∇f(xk) (2.7)

A more complete algorithm appears as Algorithm 1.

Algorithm 1 The method. Require: a starting point x ∈ dom f and a tolerance  > 0 with   1. repeat ∆x := −∇f(x). Choose step size t with appropriate method. Update x := x + t∆x.

until k∆xk2 < 

If f is subdifferentiable but not differentiable then replacing the gradient ∇f with a subgradient g ∈ ∂f at every point x yields a subgradient method. Subgradient methods typically exhibit slower convergence than similar gradient descent methods, but they apply to a much larger class of functions.

Strictly speaking, Algorithm 1 describes an unconstrained minimization scheme. If we wish to optimize a convex differentiable function f over a constraint set C, then we use the projected gradient descent update

xk+1 := ΠC (xk − tk∇f(xk)) (2.8)

For certain conditions on t, the update scheme (2.8) converges stably to the constrained minimum of f [19]. For example, if ∇f is Lipschitz continuous with Lipschitz constant L, then a constant step size t ∈ (0, 2/L)) ensures convergence with (2.8). Exploiting Lipschitz continuity yields the simplest convergence guarantees; more complicated guarantees rely on the Wolfe conditions

10 [140, 141] or the Armijo rule [5]. We will make frequent use of constant step sizes t ∈ (0, 1/L) based on Lipschitz constants for reasons that will become clear later.

2.3.2 Proximal Gradient Method

Suppose that we can split a convex objective function f : U → R into the sum f = g + h of two closed proper convex functions g : U → R and h : U → R where g is differentiable. The proximal gradient method is the iterative scheme given by

x := prox (x − t ∇g(x )) k+1 tkh k k k (2.9)

with step size tk at iteration k. If we optimize f via a surrogate function

1 g(x | x , t) = f(x) + ∇f(x)T (x − x ) + kx − x k2 (2.10) k k 2t k 2

then we can compute a suitable t with the line search of Beck and Teboulle [13] described in Algorithm 2. The use of surrogate functions presages our discussion of majorization methods in Section 2.5.

Algorithm 2 Line search for the projected gradient descent method.

Require: a starting point xk, a step size tk−1, and a parameter β ∈ (0, 1).

Let t := t−1. repeat

z := proxth (xk − t∇g(xk))

tk+1 := βtk

until f(xk+1) ≤ g(xk+1 | xk, tk)

return tk := t, xk+1 := z.

If h is the indicator δC of the constraint set C, then (2.9) reduces to the projected gradient descent sceheme in (2.8). Setting h = 0 yields the standard gradient descent scheme (2.7).

11 2.4 Second-order methods

Second-order methods for optimizing a convex function f exploit approximate or real second derivative information about f. When f is twice differentiable, then its Hessian matrix ∇2f pro- vides curvature information useful for computing search directions with Newton’s method. Ap- proximate second-order methods such as the conjugate gradient method extrapolate second-order information from ∇f.

2.4.1 Newton’s method

Suppose that f : U → R is closed, convex, and twice differentiable. The Newton step for f at x is defined as  2 −1 ∆xnt = − ∇ f(x) ∇f(x).

The Newton step is motivated by considering the second order Taylor expansion of f at x given by

1 f˜(x + v) = f(x) + ∇f(x)T v + vT ∇2f(x)v. 2

Observe that f˜ is a convex quadratic function of v. The minimizer with respect to v is

v = − ∇2f(x)−1 ∇f(x)

2 n which coincides with ∆xnt. When ∇ f(x) ∈ S+, as is the case for convex functions, then the Newton step gives the direction of steepest descent for the quadratic norm

p T 2 kyk∇2f(x) = y ∇ f(x)y

defined by the Hessian ∇2f(x) at x. A more intuitive explanation is that Newton’s method warps the direction of steepest descent in accordance with information from ∇2f. Newton’s method attains quadratic convergence near the minimum [27], but it can easily overshoot the minimum if no safeguards are put in place. Typically Newton directions are damped with a suitable backtracking line search. Monitoring the Newton decrement q N(x) = ∇f(x)T [∇2f(x)]−1 ∇f(x) (2.11)

12 yields a simple stopping criterion. Algorithm 3 sketches one version of the damped Newton method.

Algorithm 3 Damped Newton’s method. given a starting point x ∈ dom f and a tolerance  > 0 with   1. repeat

2 −1 Compute the Newton step ∆xnt := − [∇ f(x)] ∇f(x). Compute step size t via backtracking line search.

Update x := x + t∆xnt. Compute the convergence criterion λ := N 2(x). until λ/2 <  return x.

2.4.2 Conjugate gradient method

n Suppose that we wish to solve the linear system Ax = b where A ∈ S+. For dense systems with millions or billions of equations, the burden of computing a Newton step can overwhelm most computational hardware. The conjugate gradient method is well suited to numerically solving large sparse systems of linear equations [70, 122]. In exact arithmetic, the conjugate gradient method converges in no more than n iterations. However, even tiny numerical imprecisions render the conjugate gradient method unstable as a direct method on computers.

Fortunately, the conjugate gradient works remarkably well as an iterative method. For any two vectors u, v ∈ Rn we say that u and v are conjugate with respect to A if uT Av = 0. Because n A ∈ S+, the conjugate relation defines an inner product hAu, vi. Suppose that we form a set P of n mutually conjugate vectors p1, p2,..., pn under the inner product defined by A. Then P forms 2 n a basis for Rn. For the aforementioned linear system, we have P = {b, Ab, A b,... A b}. We call span P a Krylov subspace [89]. According to the Cayley-Hamilton theorem, the matrix A−1 used to solve Ax = b can be expressed as a linear combination of the powers of A. Since P contains images of b under powers of A, there exists a matrix A˜ ∈ span P such that A˜ ≈ A−1. This approximation is good so long as A is well-conditioned; using the conjugate gradient method

13 with a poorly conditioned A often requires pre- or post-multiplication of A by a rescaling operator called a preconditioner. Algorithm 4 succinctly describes an unconditioned conjugate gradient method as an iterative solver.

Algorithm 4 A typical conjugate gradient algorithm. given parameters A, and b, and tolerance  > 0 with   1.

n initialize starting point x0 ∈ R , residual vector r0 := b − Ax0, p := r, and squared norm 2 c0 = kr0k2. repeat

Compute Krylov vector zk := Apk. T Compute the ratio of norms αk = rk/pk zk.

Update the estimated solution xk+1 := xk + αp.

Update the residuals rk+1 := rk − αzk.

2 Update the squared norm ck+1 := krk+1k2.

ck+1/ck Update p := rk+1 + . k+1 p k √ until ck+1 < .

2.5 The MM Principle

The MM principle (alternatively optimization transfer or iterative majorization) is a device for con- structing optimization algorithms [25, 76, 95, 93, 90]. In essence, it replaces the objective function

f(x) by a simpler surrogate function g(x | xk) anchored at the current iterate xk and majorizing or

minorizing f(x). As a byproduct of optimizing g(x | xk) with respect to x, the objective function f(x) is sent downhill or uphill, depending on whether the purpose is minimization or maximiza-

tion. The next iterate xk+1 is chosen to optimize the surrogate g(x | xk) subject to any relevant

constraints. Majorization combines two conditions: the tangency condition g(xk | xk) = f(xk)

and the domination condition g(x | xk) ≥ f(x) for all x. In minimization these conditions and

the definition of xk+1 lead to the descent property

f(xk+1) ≤ g(xk+1 | xk) ≤ g(xk | xk) = f(xk). (2.12)

14 Minorization reverses the domination inequality and produces an ascent algorithm. Under appro- priate regularity conditions, an MM algorithm is guaranteed to converge to a stationary point of the objective function [90]. In particular, the MM principle is ideally suited for optimizing convex objective functions since their surrogates can exploit the machinery of convex optimization. From the perspective of dynamical systems, the objective function serves as a Lyapunov function for the algorithm map.

The MM principle simplifies optimization by: (a) separating the variables of a problem, (b) avoiding large matrix inversions, (c) linearizing a problem, (d) restoring symmetry, (e) dealing with equality and inequality constraints gracefully, and (f) turning a nondifferentiable problem into a smooth problem. Choosing a tractable surrogate function g(x | xk) that hugs the objective function f(x) as tightly as possible requires experience and skill with inequalities. The majoriza- tion relation between functions is closed under the formation of sums, nonnegative products, limits, and composition with an increasing function. Hence, it is possible to work piecemeal in majorizing complicated objective functions.

The MM principle as formulated here represents the synthesis of a complex history. Specific MM algorithms appeared years before the principle was well understood [67, 105, 125, 137, 145]. Projected gradient and proximal gradient algorithms can be motivated from the MM perspective, but the early emphasis on operators and fixed points obscured this distinction. The celebrated EM (expectation-maximization) principle of computational statistics is a special case of the MM principle [106]. Although Dempster, Laird, and Rubin [48] formally named the EM algorithm, many of their contributions were anticipated by Baum [8] and Sundberg [127]. The MM princi- ple was clearly stated by Ortega and Rheinboldt [114]. de Leeuw [46] is generally credited with recognizing the importance of the principle in practice. The EM algorithm had an immediate and large impact in computational statistics, but the more general MM principle was much slower to take hold. The papers [47, 69, 84] by the Dutch school of psychometricians solidified its posi- tion. The related Dinkelbach [52] maneuver in fractional linear programming also highlighted the importance of the descent property in algorithm construction.

Since the MM principle is not an algorithm per se, it can easily exploit the aforementioned gradient and Newton descent methods. The development of the proximal distance algorithm and 15 iterative hard thresholding in the proceeding chapters will make explicit use of both the MM prin- ciple and gradient descent methods.

16 CHAPTER 3

The Proximal Distance Algorithm

The current exposition emphasizes the role of the MM principle in nonlinear programming. For smooth functions, one can construct an adaptive interior point method based on scaled Bregman barriers. This algorithm does not follow the central path. For convex programming subject to nonsmooth constraints, one can combine an exact penalty method with distance majorization to create versatile algorithms that are effective even in discrete optimization. These proximal distance algorithms are highly modular and reduce to set projections and proximal mappings, both very well-understood techniques in optimization. We illustrate the possibilities in linear programming, binary piecewise-linear programming, nonnegative quadratic programming, `0 regression, matrix completion, and sparse precision matrix estimation.

3.1 An Adaptive Barrier Method

In convex programming it simplifies matters notationally to replace a convex inequality constraint hj(x) ≤ 0 by the concave constraint vj(x) = −hj(x) ≥ 0. Barrier methods operate on the relative interior of the feasible region where all vj(x) > 0. Adding an appropriate barrier term to the objective function f(x) keeps an initially inactive constraint vj(x) inactive throughout an optimization search. If the barrier function is well designed, it should adapt and permit convergence to a feasible point y with one or more inequality constraints active.

We now briefly summarize an adaptive barrier method that does not follow the central path [91]. Because the logarithm of a concave function is concave, the Bregman majorization [29]

1 T − ln vj(x) + ln vj(xk) + ∇vj(xk) (x − xk) ≥ 0 vj(xk) acts as a convex barrier for a smooth constraint vj(x) ≥ 0. To make the barrier adaptive, we scale 17 it by the current value vj(xk) of the constraint. These considerations suggest an MM algorithm based on the surrogate function

s s X X T g(x | xk) = f(x) − ρ vj(xk) ln vj(x) + ρ ∇vj(xk) (x − xk) j=1 j=1 for s inequality constraints. Minimizing the surrogate subject to relevant linear equality constraints

Ax = b produces the next iterate xk+1. The constant ρ determines the tradeoff between keeping the constraints inactive and minimizing f(x). One can show that the MM algorithm with exact minimization converges to the constrained minimum of f(x) [90].

In practice one step of Newton’s method is usually adequate to decrease f(x). The first step of

Newton’s method minimizes the surrogate g(x | xk) given by the second-order Taylor expansion of around xk subject to the equality constraints. Given smooth functions, the two derivatives

∇g(xk | xk) = ∇f(xk) s 2 2 X 2 ∇ g(xk | xk) = ∇ f(xk) − ρ ∇ vj(xk) (3.1) j=1 s X 1 + ρ ∇v (x )∇v (x )T v (x ) j k j k j=1 j k are the core ingredients in the quadratic approximation of g(x | xk). Unfortunately, one step of Newton’s method is neither guaranteed to decrease f(x) nor to respect the nonnegativity con- straints.

For instance, the standard form of linear programming requires the minimization of a linear function f(x) = cT x subject to Ax = b and x  0. The quadratic approximation to the surrogate g(x | xk) amounts to

p ρ X 1 cT x + cT (x − x ) + (x − x )2. k k 2 x j kj j=1 kj The minimum of this quadratic subject to the linear equality constraints occurs at the point

−1 −1 T −1 T −1 −1 xk+1 = xk − Dk c + Dk A (ADk A ) (b − Axk + ADk c).

−1 Here Dk is the diagonal matrix with ith diagonal entry ρxk,i . Observe that the increment xk+1 −xk satisfies the linear equality constraint A(xk+1 − xk) = b − Axk. 18 One can overcome the objections to Newton updates by taking a controlled step along the

Newton direction uk = xk+1 − xk. The key is to exploit the theory of self-concordant functions [27, 111]. A thrice differentiable convex function h(t) is said to be self-concordant if it satisfies the inequality

|h000(t)| ≤ 2ch00(t)3/2

for some constant c ≥ 0 and all t in the essential domain of h(t). All convex quadratic functions qualify as self-concordant with c = 0. The function h(t) = − ln(at + b) is self-concordant with constant 1. The class of self-concordant functions is closed under sums and composition with

linear functions. A convex function q(x) with domain Rn is said to be self-concordant if every slice h(t) = q(x + tu) is self-concordant.

Rather than conduct an expensive one-dimensional search along the Newton direction xk +tuk, one can majorize the surrogate function h(t) = g(xk + tuk | xk) along the half-line t ≥ 0. The clever majorization

1 1 h(t) ≤ h(0) + h0(0)t − h00(0)1/2t − ln[1 − cth00(0)1/2] (3.2) c c2

both guarantees a decrease in f(x) and prevents a violation of the inequality constraints [111]. Here c is the self-concordance constant associated with the surrogate. The optimal choice of t reduces to the damped Newton update

h0(0) t = . (3.3) h00(0) − ch0(0)h00(0)1/2

The first two derivatives of h(t) are clearly

0 T h (0) = ∇f(xk) uk s 00 T 2 X T 2 h (0) = uk ∇ f(xk)uk − ρ uk ∇ vj(xk)uk j=1 s X 1 + ρ [∇v (x )u ]2. v (x ) j k k j=1 j k

The first of these derivatives is nonpositive because uk is a descent direction for f(x). The second is generally positive because all of the contributing terms are nonnegative. 19 When f(x) is quadratic and the inequality constraints are affine, detailed calculations show that the surrogate function g(x | xk) is self-concordant with constant

1 c = p . ρ min{v1(xk), . . . , vs(xk)}

Taking the damped Newton’s step with step length (3.3) keeps xk + tkuk in the relative interior of the feasible region while decreasing the surrogate and hence the objective function f(x). When

f(x) is not quadratic but can be majorized by a quadratic surrogate q(x | xk), one can replace

f(x) by q(x | xk) in calculating the adaptive-barrier update. The next iterate xk+1 retains the descent property.

As a toy example consider the linear programming problem of minimizing cT x subject to Ax = b and x  0. Applying the adaptive barrier method to the choices   −1         −1 2 0 0 1 0 0 1         −1 A = 0 2 0 0 1 0 , b = 1 , c =              0  0 0 2 0 0 1 1      0    0

1 and to the feasible initial point x0 = 3 1 produces the results displayed in Table 3.1. Not shown 1 1 1 T is the minimum point ( 2 , 2 , 2 , 0, 0, 0) . Columns two and three of the table record the progress

of the unadorned adaptive barrier method. The quantity k∆kk2 equals the Euclidean norm of the

difference vector ∆k = xk −xk−1. Columns four and five repeat this information for the algorithm

modified by the self-concordant majorization (3.2). The quantity tk in column six represents the

optimal step length (3.3) in going from xk−1 to xk along the Newton direction uk−1. Clearly, there is a price to be paid in implementing a safeguarded Newton step. In practice, this price is well worth paying.

20 No Safeguard Self-concordant Safeguard

T T Iteration k c xk k∆kk2 c xk k∆kk2 tk

1 -1.20000 0.25820 -1.11270 0.14550 0.56351 2 -1.33333 0.17213 -1.20437 0.11835 0.55578 3 -1.41176 0.10125 -1.27682 0.09353 0.55026 4 -1.45455 0.05523 -1.33288 0.07238 0.54630 5 -1.47692 0.02889 -1.37561 0.05517 0.54345 10 -1.49927 0.00094 -1.47289 0.01264 0.53746 15 -1.49998 0.00003 -1.49426 0.00271 0.53622 20 -1.50000 0.00000 -1.49879 0.00057 0.53597 25 -1.50000 0.00000 -1.49975 0.00012 0.53591 30 -1.50000 0.00000 -1.49995 0.00003 0.53590 35 -1.50000 0.00000 -1.49999 0.00001 0.53590 40 -1.50000 0.00000 -1.50000 0.00000 0.53590

Table 3.1: Performance of the adaptive barrier method in linear programming.

3.2 MM for an Exact Penalty Method

We now turn to exact penalty methods. For a smooth objective function and smooth constraints, the most convenient penalized objective is

G(x) Fρ(x) = f(x) + ρ ,

H(x)+ where f(x) is the objective function, G(x) is the vector of equality constraints, and H(x)+ is the

vector of truncated inequality constraints with components max{0, hj(x)}. Classical optimization theory says that a constrained minimum point of f(x) furnishes an unconstrained minimum point

21 of Fρ(x) provided that ρ is sufficiently large and that the Lagrangian

p q X X L(x, λ, µ) = f(x) + λigi(x) + µjhj(x) i=1 j=1 is suitably well-behaved [121]. Here the Lagrange multiplier vectors λ and µ are chosen so that the multiplier rule ∇L(y, λ, µ) = 0 holds at the constrained minimum y.

The nonsmooth nature of Fρ(x) is a crippling hindrance to its optimization. Fortunately, a simple modification of the penalty leads to a viable minimization algorithm. Let us replace the Euclidean norm in the penalty by

u p = kuk2 + kvk2 +  v  for a small  > 0. This positions us to majorize the penalty via the univariate majorization

√ √ t − tk t ≥ tk + √ (3.4) 2 tk √ of the concave function t on the interval t > 0. The resulting majorization

" p q # ρ X X f(x) + ρq (x) ≤ f(x) + g (x)2 + h (x)2 + c (3.5)  2q (x ) i j + k  k i=1 j=1

is the key to approximate minimization of Fρ(x). Here the irrelevant constant ck depends only on

xk and q(x) is given by

G(x) q(x) =

H(x)+ 

The obvious tactic in generating a better iterate xk+1 is to apply one step of Newton’s method.

If we let wk = ρ/(2q(xk)), then the gradient

p q X X ∇f(xk) + wk gi(xk)∇gi(xk) + wk hj(xk)+∇hj(xk) i=1 j=1

of the surrogate function at xk is straightforward to derive, but the second differential is problem-

2 atic to compute because the functions hj(x)+ are not twice-differentiable. When hj(xk) < 0,

22 2 2 the second derivative satisfies ∇ hj(xk)+ = 0. In the opposite situation hj(xk) > 0, the Gauss- Newton approximation

2 2 T ∇ hj(xk)+ ≈ 2∇hj(xk)∇hj(xk) (3.6) is valuable for several reasons. Most importantly, it avoids second derivatives and preserves posi- tive definiteness of the approximate second differential. Notably, the approximation (3.6) is exact

2 2 2 if hj(x) is affine. Furthermore, the omitted term 2hj(x)+∇ hj(x) of ∇ hj(xk)+ vanishes as the algorithm approaches convergence. In the rare instances when hj(xk) = 0, the literature [77] sug- gests that there is little harm in approximating the second differential by the outer product (3.6). For the same reasons, we recommend the Gauss-Newton approximation for the equality constraints

2 gi(x) as well. Lastly, the Sherman-Morrison formula facilitates matrix inversion when ∇ f(xk) is explicitly invertible and the number of constraints is small.

In practice there is no guarantee that one step of Newton’s method will decrease the surrogate on the right-hand side of majorization (3.5). If f(x) is quadratic or can be majorized by a quadratic function, then another round of majorization avails. When hj(xk) ≥ 0, then the majorization

2 2 hj(x)+ ≤ hj(x) applies. If instead hj(xk) < 0, then we instead apply the alternative majorization 2 2 hj(x)+ ≤ [hj(x) − hj(xk)] . Both of these lead to the Hessian approximation on the right- hand side of (3.6). The gradient of the surrogate changes in an obvious way in each case. The additional round of majorization eliminates the need for step-halving. The price may be slower overall convergence.

3.2.1 Exact Penalty Method for Quadratic Programming

1 T T Minimization of a convex quadratic objective 2 x Ax + b x subject to linear equality constraints Cx = d and linear inequality constraints Ex ≤ f is one of the building blocks of modern optimization algorithms. The case A = 0 corresponds to linear programming. Both equality and inequality constraints can be handled as just suggested. Alternatively, the introduction of slack variables allows one to replace linear inequality constraints by a combination of linear equality

23 constraints and nonnegativity constraints xi ≥ 0. For the relevant components, the majorizations  2 x xk,i ≥ 0 2  i max{xi, 0} ≤  2 (xi − xk,i) xk,i < 0 simplifies the overall algorithm and yields a purely quadratic surrogate that is minimized by one step of Newton’s method. The next iterate xk+1 is guaranteed to send the approximate objective downhill.

3.3 Distance Majorization

On a Euclidean space, the distance to a closed set S is a Lipschitz continuous function dist(x, S) with Lipschitz constant 1. As discussed in Chapter 2, if S is also convex, then dist(x, S) is a convex function. Projection onto S is intimately tied to dist(x, S). Unless S is convex, the

projection operator ΠS (x) is multi-valued for at least one argument x. Fortunately, it is possible

to majorize dist(x, S) at xk by kx − ΠS (xk)k2. This simple observation is the key to the proximal distance algorithm to be discussed later. In the meantime, let us show how to derive two feasibility

algorithms by distance majorization [39]. Let S1,..., Sm be closed sets. The method of averaged

m projections attempts to find a point in their intersection S = ∩j=1Sj. To derive the algorithm, consider the convex combination m X 2 f(x) = αj dist(x, Sj) j=1 of squared distance functions. Obviously, f(x) vanishes on S precisely when all coefficients

αj > 0. The majorization m X g(x | x ) = α kx − Π (x )k2 k j Sj k 2 j=1

of f(x) is easy to minimize. The minimum point of g(x | xk), m X x = α Π (x ), k+1 j Sj k j=1

defines the averaged operator. The MM principle guarantees that xk+1 decreases the objective function. 24 Von Neumann’s method of alternating projections can also be derived from this perspective.

2 For two sets S1 and S2, consider the problem of minimizing f(x) = dist(x, S2) subject to the constraint x ∈ S1. The function

2 g(x | xk) = kx − ΠS2 (xk)k2

majorizes f(x). Indeed, the domination condition g(x | xk) ≥ f(x) holds because ΠS2 (xk) belongs to S2; the tangency condition g(xk | xk) = f(xk) holds because ΠS2 (xk) is the closest point in S2 to xk. The surrogate function g(x | xk) is minimized subject to the constraint by setting

xk+1 = ΠS1 ◦ΠS2 (xk). The MM principle again ensures that xk+1 decreases the objective function. When the two sets intersect, the least distance of 0 is achieved at any point in the intersection. One

2 2 can extend this derivation to three sets by minimizing f(x) = dist(x, S2) + dist(x, S3) subject to x ∈ S1. The surrogate

2 2 g(x | xk) = kx − ΠS2 (xk)k2 + kx − ΠS3 (xk)k2 2 1 = 2 x − [ΠS2 (xk) + ΠS3 (xk)] + ck 2 2 relies on an irrelevant constant ck. The closest point in S1 is 1  x = Π [Π (x ) + Π (x )] . k+1 S1 2 S2 k S3 k

This construction clearly generalizes to more than three sets.

3.4 The Proximal Distance Method

We now turn to an exact penalty method that applies to nonsmooth functions. Clarke’s exact penalty method [40] turns the constrained problem of minimizing a function f(y) over a closed set S into the unconstrained problem of minimizing the penalized function f(y) + ρ dist(y, S) for sufficiently large ρ. Here is a precise statement of a generalization of Clarke’s result [26, 40, 49].

Proposition 3. Suppose that f(y) achieves a local minimum on S at the point x. Let φS (y)

n denote a function that vanishes on S and that satisfies φS (y) ≥ c dist(y, S) for all x ∈ R and some positive constant c. If f(y) is locally Lipschitz continuous around x with constant L, then 25 −1 for every ρ ≥ c L, the function Fρ(y) = f(y)+ρφS (y) achieves a local unconstrained minimum at x.

Classically the choice φS (x) = dist(x, S) was preferred. For affine equality constraints gi(x) = 0 and affine inequality constraints hj(x) ≤ 0, Hoffman’s bound [74]

G(y) dist(y, S) ≤ τ

H(y)+ 2 applies, where τ is some positive constant, S is the feasible set where G(y) = 0, and H(y)+ ≤ 0.

The vector H(y)+ has components hj(x)+ = max{hj(y), 0}. When S is the intersection of several closed sets S1,..., Sm, then the alternative v u m uX 2 φS (y) = t dist(y, Si) (3.7) i=1 is attractive. The next proposition gives sufficient conditions under which the crucial bound

φS (y) ≥ c dist(y, S) is valid for the function (3.7).

n Proposition 4. Suppose that S1,..., Sm are closed convex sets in R where the first j sets are polyhedral. Assume further that the intersection

j  m  S = ∩i=1Si ∩ ∩i=j+1 relint Si is nonempty and bounded. Then there exists a constant τ > 0 such that v m √ u m X uX 2 dist(x, S) ≤ τ dist(x, Si) ≤ τ mt dist(x, Si) i=1 i=1 for all x. The sets S1,..., Sm are said to be linearly regular.

Proof. See the references [10, 51] for all details. A polyhedral set is the nonempty intersection of a finite number of half-spaces. The operator relint K forms the relative interior of the convex set K, namely, the interior of K relative to the affine hull of K. When K is nonempty, its relative interior is nonempty and generates the same affine hull as K itself.

26 In general, we will require that f(x) and φS (x) be continuous functions and that the sum

Fρ(y) = f(y) + ρφS (y) be coercive for some value ρ = ρ0. It then follows that Fρ(y) is coercive

and attains its minimum for all ρ ≥ ρ0. One can prove a partial converse to Clarke’s theorem

[49, 50]. This requires the enlarged set S = {x : φS (x) < } of points lying close to S as

measured by φS (x).

Proposition 5. Suppose that f(y) is Lipschitz continuous on S for some  > 0. Then under the

stated assumptions on f(x) and φS (x), a global minimizer of Fρ(y) is a constrained minimizer of f(y) for all sufficiently large ρ.

When the constraint set S is compact and f(y) has a continuously varying local Lipschitz constant, then the hypotheses of Proposition 5 are fulfilled. This is the case, for instance, when f(y) is continuously differentiable. With this background on the exact penalty method in mind, we now sketch an approximate MM algorithm for convex programming that is motivated by distance majorization. This algorithm is designed to exploit set projections and proximal maps. A huge literature and software base exist for computing projections and proximal maps [11].

Recall from Chapter 2 that the proximal map proxh(y) associated with a convex function h(x) satisfies   1 2 proxh(y) = argmin h(x) + ky − xk2 . x 2

Since the function dist(x, S) is merely continuous, we advocate approximating it by the differen- tiable function

p 2 dist(x, S) = dist(x, S) +  (3.8)

for 0 <   1. The composite function dist(x, S) is convex when S is convex because the func- √ tion t2 +  is increasing and convex on [0, ∞). Instead of minimizing f(x) + ρ dist(x, S), we suggest using an MM algorithm to minimize the differentiable convex function f(x)+ρ dist(x, S). Regardless of whether or not S is convex, the majorization

q 2 dist(x, S) ≤ kx − ΠS (xk)k2 +  (3.9)

27 is available. If S is nonconvex, then there may exist multiple points that minimize the distance from xk to S, and one must choose a representative of the set ΠS (xk). In any event one can invoke the univariate majorization (3.4) and majorize the majorization (3.9) by q 1 kx − Π (x )k2 +  ≤ kx − Π (x )k2 + c S k 2 p 2 S k 2 k 2 kxk − ΠS (xk)k2 + 

for some irrelevant constant ck. The second step of our proposed MM algorithm consists of mini- mizing the surrogate function w g(x | x ) = f(x) + k kx − Π (x )k2 (3.10) k 2 S k 2 ρ w = . k p 2 kxk − ΠS (xk)k2 + 

The corresponding proximal map drives f(x) + ρ dist(x, S) downhill. Under the more general exact penalty (3.7), the surrogate function depends on a sum of spherical quadratic functions rather than a single spherical quadratic function.

It is possible to project onto a variety of closed nonconvex sets. For example, if S is the set

1 of integers, then projection amounts to rounding. An ambiguous point k + 2 can be projected to either k or k + 1. Projection onto a finite set simply tests each point separately. Projection onto a Cartesian product is achieved via the Cartesian product of the projections. One can also project onto many continuous sets of interest. For example, to project onto the closed set of points having at most m nonzero coordinates, one sends to zero all but the m largest coordinates in magnitude. Projection onto the sphere of center z and radius r maps y 6= z to the point z + r (y − z). ky−zk2 All points of the sphere are equidistant from its center.

By definition the update xk+1 = prox −1 [Π (xk)] minimizes g(x | xk). We will refer to wk f S this MM algorithm as the proximal distance algorithm. It enjoys several virtues. First, it allows one to exploit the extensive body of results on proximal maps and projections. Second, it does not demand that the constraint set S be convex. Third, it does not require the objective function f(x) to be convex or smooth. Finally, the optima and optimizers of the functions f(x) + ρ dist(x, S) and f(x) + ρ dist(x, S) are close when  > 0 is small.

In implementing the proximal distance algorithm, the constants L and  must specified. For many norms the Lipschitz constant L is known. For a differentiable function f(x), the mean value 28 inequality suggests taking L equal to the maximal value of k∇f(x)k2 in a neighborhood of the optimal point. In specific problems a priori bounds can be derived. If no such prior bound is known, then one must guess an appropriate ρ and see if it leads to a constrained minimum. If not, then ρ should be systematically increased until a constrained minimum is reached. Even with a justifiable bound, it is prudent to start ρ well below its intended upper bound to emphasize minimization of the loss function in early iterations. Experience shows that gradually decreasing  is also a good tactic; otherwise, one again runs the risk of putting too much early stress on satisfying the constraints. In

k −k practice the sequences ρk = min{α ρ0, ρmax} and k = max{β 0, min} work well for α and β slightly larger than 1, say 1.2, and ρ0 = 0 = 1. On many problems more aggressive choices of α and β are possible. Suitable values of ρmax and min are specific to a problem. In general, taking

ρmax substantially greater than a known Lipschitz constant slows convergence, while taking min too large leads to a poor approximate solution.

3.5 Examples

We now explore some typical applications of the proximal distance algorithm. In all cases we are able to establish local Lipschitz constants. The proximal distance algorithms are coded in MAT- LAB 2013a. All numerical experiments were run on an Early 2013 15” MacBook Pro Retina running OSX 10.9.5 (Darwin 13.4.0) with a quadcore 2.7 GHz Intel Core i7 processor and 16Gb of 1600 MHz DDR3 memory. Comparisons with standard optimization software serve as perfor- mance benchmarks.

3.5.1 Projection onto an Intersection of Closed Convex Sets

Let S1,..., Sm be closed convex sets with simple projections. Dykstra’s algorithm [51, 55] is

m designed to find the projection of an external point y onto S = ∩j=1Sj. The proximal distance algorithm provides an alternative based on the convex function

q 2 f(x) = kx − yk2 + δ

29 for δ > 0, say δ = 1. The choice f(x) is preferable to the obvious choice of the squared norm

2 kx−yk2 because f(x) is Lipschitz continuous with Lipschitz constant 1. In the proximal distance algorithm, we take v u m uX 2 φS (x) = t dist(x, Sj) j=1

and minimize the surrogate function

m wk X g(x | x ) = f(x) + kx − p k2 k 2 k,j 2 j=1 mw = f(x) + k kx − p¯ k2 + c , 2 m 2 k p Π (x ) x S p¯ p c where kj is the projection Sj k of k onto j, m is the average of the projections k,j, k is an irrelevant constant, and ρ wk = . qPm 2 j=1 kxk − pkjk2 + 

After rearrangement, the stationarity condition for optimality reads mw x = (1 − α)y + αp¯ , α = k . m √ 1 2 + mwk kx−yk2+δ

In other words, xk+1 is a convex combination of y and p¯m.

To calculate the optimal coefficient α, we minimize the convex surrogate

h(α) = g[(1 − α)y + αp¯m | xk] kw = pα2 dist(y, p¯ )2 + δ + k (1 − α)2 dist(y, p¯ )2 + c . m 2 m k

Its derivative

2 0 α dist(y, p¯m) 2 h (α) = − mwk(1 − α) dist(y, p¯ ) p 2 2 m α dist(y, p¯m) + δ satisfies h0(0) < 0 and h0(1) > 0 and possesses a unique root on the open interval (0, 1). This root can be easily computed by bisection or Newton’s method.

Table 3.2 compares Dykstra’s algorithm and the proximal distance algorithm on a simple planar

2 example. Here S1 is the closed unit ball in R , and S2 is the closed halfspace with x1 ≥ 0. The 30 Dykstra Proximal Distance

Iteration k xk,1 xk,2 xk,1 xk,2

0 -1.00000 2.00000 -1.00000 2.00000 1 -0.44721 0.89443 -0.44024 1.60145 2 0.00000 0.89443 -0.25794 1.38652 3 -0.26640 0.96386 -0.16711 1.25271 4 0.00000 0.96386 -0.11345 1.16647 5 -0.14175 0.98990 -0.07891 1.11036 10 0.00000 0.99934 -0.01410 1.01576 15 -0.00454 0.99999 -0.00250 1.00257 20 0.00000 1.00000 -0.00044 1.00044 25 -0.00014 1.00000 -0.00008 1.00008 30 0.00000 1.00000 -0.00001 1.00001 35 0.00000 1.00000 0.00000 1.00000

Table 3.2: Dykstra’s algorithm versus the proximal distance algorithm. intersection S reduces to the right half ball centered at the origin. The table records the iterates

T of the two algorithms from the starting point x0 = (−1, 2) until their eventual convergence to

T k the geometrically obvious solution (0, 1) . In the proximal distance method we set ρk = 2 and

−k aggressively set k = 4 . The two algorithms exhibit similar performance but take rather different trajectories.

3.5.2 Network Optimization

The problem of minimizing the piecewise-linear function

X T f(x) = Aij|xi − xj| + b x i

31 n n subject to binary constraints x ∈ {0, 1} and nonnegative weights Aij from a matrix A ∈ S is a typical discrete optimization problem with applications in graph cuts and network optimization. If we invoke the majorization

xk,i + xk,j xk,i + xk,j |xi − xj| ≤ xi − + xj − 2 2

prior to applying the proximal operator, then the proximal distance algorithm separates the param-

eters xi and xj. Parameter separation promotes parallelization and benefits from a fast algorithm for computing proximal maps in one dimension. The one-dimensional algorithm is similar to but faster than bisection [117]. Finally, the objective function is Lipschitz continuous with the explicit constant s X X 2 L = Aij + kbk2. (3.11) i j6=i

This assertion follows from the simple bound

X X T |f(x) − f(y)| ≤ Aij|xj − yj| + |b (x − y)| i j6=i s X X 2 ≤ Aij · kx − yk2 + kbk2 · kx − yk2 i j6=i

under the symmetry convention Aij = Aji.

Table 3.3 displays the numerical results for a few typical examples. For each dimension n we filled b with standard normal deviates and the upper triangle of the weight matrix A with the abso- lute values of such deviates. The lower triangle of A was determined by symmetry. Small values of b often lead to degenerate solutions x with all entries 0 or 1. To preclude this possibility, we multiplied each entry of b by n. In the graph cut context, a degenerate solution corresponds to no

k cuts at all or a completely cut graph. These examples depend on the schedules ρk = min{1.2 ,L}

−k −15 and k = max{1.2 , 10 } for the two tuning constants and the local Lipschitz constant (3.11).

Although the proximal distance algorithm makes good progress towards the minimum in the first 100 iterations or so, it sometimes hovers around its limit without fully converging. This translates into fickle compute times, and for this reason we capped the number of iterations at 200. For small dimensions the proximal distance algorithm can be much slower than CVX. Fortunately, 32 CPU times

n PD CVX Iterations

2 0.038 0.080 9 4 0.052 0.060 18 8 2.007 0.050 200 16 2.416 0.100 200 32 2.251 0.130 200 64 4.134 0.400 200 128 0.212 2.980 32 256 0.868 62.63 200 512 68.27 1534 200 1024 526.6 * 200 2048 127.2 * 200 4096 547.4 * 200

Table 3.3: CPU times in seconds and iterations until convergence for the network optimization problem. Asterisks denote computer runs exceeding computer memory limits. Iterations were capped at 200. the performance of the proximal distance algorithm improves markedly as n increases. In all runs the two algorithms reach the same solution after rounding components to the nearest integer. The proximal distance algorithm also requires much less storage than CVX. Asterisks appear in the table where CVX demanded more memory than our laptop computer could deliver.

3.5.3 Nonnegative Quadratic Programming

The proximal distance algorithm is applicable in minimizing a convex quadratic objective function

1 T T f(x) = 2 x Ax + b x subject to the constraint x  0. In this nonnegative quadratic program, let

33 n yk be the projection of the current iterate xk onto S = R+. If we define the weight ρ wk = , p 2 kxk − ykk2 +  then the next iterate can be expressed as

−1 xk+1 = (A + wkI) (wkyk − b). (3.12)

The multiple matrix inversions implied by (3.12) can be avoided by precomputing and caching the

T −1 spectral decomposition U DU of A. We then reformulate the inverse (A + wkI) as the matrix

T −1 product U (D + wkI) U. The diagonal matrix D + wkI is obviously trivial to invert. The remaining operations in computing xk+1 reduce to matrix-vector multiplications.

CPU times Optima

n MM CV MA YA MM CV MA YA

8 0.97 0.23 0.01 0.13 -0.0172 -0.0172 -0.0172 -0.0172 16 0.50 0.24 0.01 0.11 -1.1295 -1.1295 -1.1295 -1.1295 32 0.50 0.24 0.01 0.14 -1.3811 -1.3811 -1.3811 -1.3811 64 0.57 0.28 0.01 0.13 -0.5641 -0.5641 -0.5641 -0.5641 128 0.79 0.36 0.02 0.14 -0.7018 -0.7018 -0.7018 -0.7018 256 1.66 0.65 0.06 0.22 -0.6890 -0.6890 -0.6890 -0.6890 512 5.61 2.95 0.26 0.73 -0.5971 -0.5968 -0.5970 -0.5970 1024 32.69 21.90 1.32 2.91 -0.4944 -0.4940 -0.4944 -0.4944 2048 156.7 178.8 8.96 15.89 -0.4514 -0.4505 -0.4512 -0.4512 4096 695.1 1551 57.73 91.54 -0.4690 -0.4678 -0.4686 -0.4686

Table 3.4: CPU times in seconds and optima for the nonnegative quadratic program. Abbrevia- tions: n for the problem dimension, MM for the proximal distance algorithm, CV for CVX, MA for MATLAB’s quadprog, and YA for YALMIP.

One can estimate an approximate Lipschitz constant for this problem. Note that f(0) = 0 and

34 that

1 f(x) ≥ λ (A)kxk2 − kbk · kxk , 2 min 2 2 2

where λmin(A) is the smallest eigenvalue of A. It follows that x cannot minimize f(x) subject to −1 the nonnegativity constraint whenever kxk2 > 2kbk2 [λmin(A)] On the other hand, the gradient of f(x) satisfies

k∇f(x)k2 ≤ kAk2kxk2 + kbk2 ≤ λmax(A)kxk2 + kbk2,

where λmax(A) is the largest eigenvalue of A. In view of the mean-value inequality, these bounds suggest that   2λmax(A) L = + 1 kbk2 = [2 cond(A) + 1] kbk2 λmin(A) provides an approximate Lipschitz constant for f(x) on the region harboring the minimum point. This bound on ρ is usually too large. One remedy is to multiply the bound by a deflation factor such as 0.1. Another remedy is to replace the covariance matrix A by the corresponding correlation matrix. Thus, one solves the problem for the preconditioned matrix D−1AD−1, where D is the diagonal matrix whose entries are the square roots of the corresponding diagonal entries of A. The transformed parameters y = Dx obey the same nonnegativity constraints as x.

For testing purposes we filled a n × n matrix M with independent standard normal deviates and set A = M T M + I. Addition of the identity matrix avoids ill conditioning. We also filled the vector b with independent standard normal deviates. Our gentle tuning constant schedule

−k −15 k k = max{1.005 , 10 } and ρk = min{1.005 , 0.1 × L} adjusts ρ and  so slowly that their limits are not actually met in practice. In any event L is the a priori bound for the correlation matrix derived from A. Table 3.4 compares the performance of the MM proximal distance algorithm to MATLAB’s quadprog, CVX with the SDPT3 solver, and YALMIP with the MOSEK solver. MATLAB’s quadprog is clearly the fastest of the four tested methods on these problems. The relative speed of the proximal distance algorithm improves as the problem dimension n increases. We will revisit this example in Section 4.3.2.

35 3.5.4 Linear Regression under an `0 Constraint

1 2 In this example the objective function is the sum of squares 2 ky − Xβk2, where y is the response vector, X is the design matrix, and β is the vector of regression coefficients. The constraint set

n Sm consists of those β with at most m nonzero entries. Projection onto the closed but nonconvex n set Sm is achieved by supplanting with 0 all but the m largest coordinates in magnitude. These coordinates will be unique except in the rare circumstance that we encounter a pair (βi, βj) with mth largest absolute value |βi| = |βj|. The proximal distance algorithm for this problem coincides with that of the previous problem if we substitute XT X for A, −XT y for b, β for x, and the projection operator Π n for Π n . Better accuracy can be maintained if the MM update exploits Sm R+ the singular value decomposition of X in forming the spectral decomposition of XT X. Although the proximal distance algorithm carries no absolute guarantee of finding the optimal set of m n  regression coefficients, it is far more efficient than sifting through all m sets of size m. The alternative of LASSO-guided model selection must contend with strong shrinkage and a surplus of false positives.

Table 3.5 compares the MM proximal distance algorithm to MATLAB’s lasso function. In simulating data, we filled X with standard normal deviates, set all components of β to 0 except for βi = 1/i for 1 ≤ i ≤ 10, and added a vector of standard normal deviates to Xβ to determine y. For a given choice of n and p we ran each experiment 100 times and averaged the results. The table demonstrates the superior speed of the LASSO and the superior accuracy of the proximal distance algorithm as measured by optimal loss and model selection. This example motivates the sparse regression framework described in Chapter 5.

3.5.5 Matrix Completion

Let Y = (yij) denote a partially observed p × q matrix and let ∆ denote the set of index pairs (i, j)

indexing the observed yij. Matrix completion [33] imputes the missing entries by approximating

36 n p d p1 p2 λ L1 L1/L2 T1 T1/T2

256 128 10 5.97 3.32 0.143 248.763 0.868 0.603 8.098 128 256 10 3.83 1.91 0.214 106.234 0.744 0.999 10.254 512 256 10 6.51 2.88 0.119 506.570 0.900 0.907 6.262 256 512 10 4.50 1.82 0.172 241.678 0.835 1.743 8.687 1024 512 10 7.80 5.25 0.101 1029.333 0.921 2.597 5.057 512 1024 10 5.54 2.58 0.138 507.451 0.881 8.235 13.532 2048 1024 10 8.98 8.49 0.080 2047.098 0.945 15.460 8.858 1024 2048 10 6.80 2.93 0.110 1044.640 0.916 34.997 18.433 4096 2048 10 9.75 9.90 0.060 4086.886 0.966 89.684 10.956 2048 4096 10 8.36 6.60 0.086 2045.645 0.942 166.386 25.821

Table 3.5: Numerical experiments comparing MM to MATLAB’s lasso. Each row presents av- erages over 100 independent simulations. Abbreviations: n the number of cases, p the number of

predictors, d the number of actual predictors in the generating model, p1 the number of true pre- dictors selected by MM, p2 the number of true predictors selected by lasso, λ the regularization

parameter at the LASSO optimal loss, L1 the optimal loss from MM, L1/L2 the ratio of L1 to the

optimal LASSO loss, T1 the total computation time in seconds for MM, and T1/T2 the ratio of T1 to the total computation time of lasso.

Y with a low rank matrix X. Imputation relies on the singular value decomposition

X = UΣV T r X T = σiuivi , (3.13) i=1

where r is the rank of X, the nonnegative singular values σi are presented in decreasing order,

the left singular vectors ui are orthonormal, and the right singular vectors vi are also orthonormal

[65]. The set Sm of p × q matrices of rank m or less is closed. Projection onto Sm is accomplished

37 by truncating the sum (3.13) to

min{r,m} X T ΠSm (X) = σiuivi . i=1

When r > m and σm+1 = σm, the projection operator is multi-valued.

The MM principle allows one to restore the symmetry lost in the missing entries [104]. Suppose that Xk is the current approximation to X. One simply replaces a missing entry yij of Y for

1 2 (i, j) 6∈ ∆ by the corresponding entry xk,ij of Xk and adds the term 2 (xk,ij − xij) to the least squares criterion

1 X f(X) = (y − x )2. 2 ij ij (i,j)∈∆

Since the added terms majorize 0, they create a legitimate surrogate function. Let us rephrase the

⊥ surrogate in terms of the orthogonal complement operator Π∆(Y ) via the equation

⊥ Y = Π∆(Y ) + Π∆(Y ).

⊥ The matrix Zk = Π∆(Y ) + Π∆(Xk) temporarily completes Y and yields the surrogate function 1 2 2 kZk − XkF . In implementing a slightly modified version of the proximal distance algorithm, one must solve for the minimum of the Moreau envelope

1 2 wk 2 kZk − Xk + X − Π (Xk) . 2 F 2 Sm F

where wk = ρ/ dist(Xk, Sm). The stationarity condition

  0 = X − Zk + wk X − ΠSm (Xk)

yields the trivial solution

1 wk Xk+1 = Zk + ΠSm (Xk). (3.14) 1 + wk 1 + wk

The update (3.14) is guaranteed to decrease the objective function

1 X ρ F (X) = (y − x )2 + dist (X, S ). ρ 2 ij ij 2  m (i,j)∈∆

38 p q α r L1 L1/L2 T1 T1/T2

200 250 0.05 20 1598 0.251 4.66 7 800 1000 0.20 80 571949 0.253 131.02 18.1 1000 1250 0.25 100 1112604 0.24 222.2 15.1 1200 1500 0.15 40 793126 0.361 161.51 3.6 1200 1500 0.30 120 1569105 0.235 367.78 12.3 1400 1750 0.35 140 1642661 0.236 561.76 9 1800 2250 0.45 180 2955533 0.171 1176.22 10.1 2000 2500 0.10 20 822673 0.50 307.89 1.9 2000 2500 0.50 200 1087404 0.192 2342.32 2 5000 5000 0.05 30 7647707 0.664 1827.16 2

Table 3.6: Comparison of the MM proximal distance algorithm to SoftImpute. Abbreviations: p is the number of rows, q is the number of columns, α is the ratio of observed entries to to- tal entries, r is the rank of the matrix, L1 is the optimal loss under MM, L2 is the optimal loss under SoftImpute, T1 is the total computation time (in seconds) for MM, and T2 is the total computation time for SoftImpute.

In the spirit of Section 3.5.3, we can derive a local Lipschitz constant based on the value 1 P 2 f(0) = 2 (i,j)∈∆ yij. The inequality

1 X 1 X 1 X y2 < (y − x )2 = (y2 − 2y x + x2 ) 2 ij 2 ij ij 2 ij ij ij ij (i,j)∈∆ (i,j)∈∆ (i,j)∈∆

is equivalent to the inequality

X X 2 2 yijxij < xij. (i,j)∈∆ (i,j)∈∆

In view of the Cauchy-Schwarz inequality s s X X 2 X 2 yijxij ≤ yij xij , (i,j)∈∆ (i,j)∈∆ (i,j)∈∆

39 no solution X of the constrained problem can satisfy s s X 2 X 2 xij > 2 yij . (i,j)∈∆ (i,j)∈∆

When the opposite inequality holds, s X 2 k∇f(X)kF = (xij − yij) (i,j)∈∆ s s X 2 X 2 ≤ xij + yij (i,j)∈∆ (i,j)∈∆ s X 2 ≤ 3 yij. (i,j)∈∆

Again this tends to be a conservative estimate of the required local bound on ρ.

Table 3.6 compares the performance of the MM proximal distance algorithm and a MATLAB implementation of SoftImpute [104]. Although the proximal distance algorithm is noticeably slower, it substantially lowers the optimal loss and improves in relative speed as problem dimen- sions grow.

3.5.6 Sparse Precision Matrix Estimation

The graphical LASSO has applications in estimating sparse precision matrices [61]. In this context, one minimizes the convex criterion

− ln det Θ + tr(SΘ) + ρkΘk1,

where Θ−1 is a p×p theoretical covariance matrix, S is a corresponding sample covariance matrix, and the graphical LASSO penalty kΘk1 equals the sum of the absolute values of the off-diagonal entries of Θ. The solution exhibits both sparsity and shrinkage. One can avoid shrinkage by minimizing

f(Θ) = − ln det Θ + tr(SΘ)

p subject to Θ having at most 2m nonzero off-diagonal entries. Let Tm be the closed set of p × p p p symmetric matrices possessing this property. Projection of a matrix M ∈ S onto Tm can be 40 achieved by arranging the components of the upper triangle of M in decreasing absolute value and replacing all but the first m of these entries by 0. The lower triangular components are treated similarly.

The proximal distance algorithm for minimizing f(Θ) subject to the set constraints operates through the convex surrogate

wk 2 g(Θ | Θk) = f(Θ) + Θ − Π p (Θk) 2 Tm F ρ wk = . r 2

Θk − ΠT p (Θk) +  m F

A stationary point minimizes the surrogate and satisfies

−1 0 = −Θ + w Θ + S − w Π p (Θ ). (3.15) k k Tm k

The matrix S − wkΠ p (Θk) is constant with respect to Θ. If we denote its spectral decomposition Tm T T by U kDkU k , then multiplying equation (3.15) on the left by U k and on the right by U k gives

T −1 T 0 = −U k Θ U k + wkU k ΘU k + Dk.

T This suggests that we take Ek = U k ΘU k to be diagonal and require its diagonal entries ek,ii to satisfy

1 0 = − + wkek,ii + dk,i. ek,ii

Multiplying this identity by ek,ii and solving for the positive root with the quadratic formula yields

q 2 −dk,i + dk,i + 4wk ek+1,ii = . 2wk

T Given the solution matrix Ek+1, we reconstruct Θk+1 as U kEk+1U k .

Finding a local Lipschitz constant is more challenging in this example. Because the identity matrix is feasible, the minimum cannot exceed

p X − ln det I + tr(SI) = tr(S) = ωi, i=1

41 p where we assume that S ∈ S++ with eigenvalues ωi ordered from largest to smallest. If the

candidate matrix Θ is positive definite with ordered eigenvalues λi, then the von Neumann-Fan inequality [26] implies that

p p X X f(Θ) ≥ − ln λi + λiωp−i+1. (3.16) i=1 i=1

To show that f(Θ) > f(I) whenever any λi falls outside a designated interval, note that the contri- bution − ln λj + λjωp−j+1 to the right side of inequality (3.16) is bounded below by ln ωp−j+1 + 1 −1 when λj = ωp−j+1. Hence, f(Θ) > f(I) whenever

p X X − ln λi + λiωp−i+1 > ωi − (ln ωp−j+1 + 1). (3.17) i=1 j6=i

Given the strict convexity of the function − ln λi + λiωp−i+1, equality holds in inequality (3.17) at

exactly two points λi,min > 0 and λi,max > λi,min. These roots can be readily extracted by bisection

or Newton’s method. The strict inequality f(Θ) > f(I) holds when any λi falls to the left of λi,min

or to the right of λi,max. Within the intersection of the intervals [λi,max, λi,min], the gradient of f(Θ) satisfies

−1 k∇f(Θ)kF ≤ kΘ kF + kSkF v u p uX −2 ≤ t λi + kSkF i=1 v u p uX −2 ≤ t λi min + kSkF . i=1

This bound serves as a local Lipschitz constant near the optimal point.

Table 3.7 compares the performance of the proximal distance algorithm to that of the R glasso package [61]. The sample precision matrix S−1 = LLT + δMM T was generated by filling the diagonal and first three subdiagonals of the banded lower triangular matrix L with standard nor- mal deviates. Filling M with standard normal deviates and choosing δ = 0.01 imposed a small amount of noise obscuring the band nature of LLT . All table statistics represent averages over 10 runs started at Θ = S−1 with m equal to the true number of nonzero entries in LLT . The proximal distance algorithm performs better in minimizing average loss and recovering nonzero entries. 42 p kt k1 k2 ρ L1 L2 − L1 T1 T1/T2

8 18 14.0 14.0 0.00186 −12.35 0.01 0.022 43.458 16 42 30.5 28.7 0.00305 −25.17 0.08 0.026 43.732 32 90 53.5 49.9 0.00330 −50.75 0.17 0.054 31.639 64 186 97.8 89.3 0.00445 −98.72 0.53 0.234 28.542 128 378 191.6 169.9 0.00507 −196.09 1.14 1.060 18.693 256 762 345.0 304.2 0.00662 −369.62 2.55 4.253 9.559 512 1530 636.4 566.8 0.00983 −641.89 6.72 19.324 5.679

Table 3.7: Numerical results for precision matrix estimation. Abbreviations: p for matrix dimen- sion, kt for the number of nonzero entries in the true model, k1 for the number of true nonzero entries recovered by the proximal distance algorithm, k2 for the number of true nonzero entries recovered by glasso, ρ the average tuning constant for glasso for a given kt, L1 the average loss from the proximal distance algorithm, L1 − L2 the difference between L1 and the average loss from glasso, T1 the average compute time in seconds for the proximal distance algorithm, and

T1/T2 the ratio of T1 to the average compute time for glasso.

3.6 Discussion

The MM principle offers a unique and potent perspective on high-dimensional optimization. The current survey emphasizes proximal distance algorithms and their applications in nonlinear pro- gramming. Our construction of this new class of algorithms relies on the exact penalty method of Clarke [40] and majorization of a smooth approximation to the Euclidean distance to the constraint set. Well-studied proximal maps and Euclidean projections constitute the key ingredients of our seven realistic examples. These examples illustrate the versatility of the method in handling both convex and nonconvex constraints, the scalability of the method as problem dimension increases, and the pitfalls in sending the tuning constants ρ and  too quickly to their limits. Certainly, the proximal distance algorithm is not a panacea for optimization problems. For example, the proximal distance algorithm as formulated here exhibits remarkably fickle behavior on linear programming

43 problems. For linear programming, we ensure numerical stability and guard against premature convergence only by great care in parameter tuning and updating. The nonnegative quadratic pro- gramming example in Section 3.5.3 fails to converge both accurately and quickly; either accuracy is preserved at the cost of many thousands of iterations, or accuracy is sacrificed for the sake of speed. We will see in Chapter 4 that a slight reformulation of the proximal distance algorithm remedies these problems.

44 CHAPTER 4

Accelerating the Proximal Distance Algorithm

The previous chapter introduced the proximal distance algorithm for constrained optimization. The solution of constrained optimization problems is part science and part art. Details of imple- mentation can greatly affect performance. Here we modify the proximal distance algorithm to use squared distance penalties and Nesterov acceleration to yield better solutions than our original exact penalties. In the presence of convexity, it is clear that every proximal distance algorithm reduces to a proximal gradient algorithm. Hence, convergence analysis can appeal to a venera- ble body of convex theory. But as hinted in Chapter 3, the proximal distance algorithm is not limited to convex problems. In fact, its most important applications may well be to nonconvex problems. The focus of this chapter is on practical exploration of the proximal distance algorithm in high-dimensional optimization. We do not attempt to extend classical convergence arguments to nonconvex problems and leave that challenge for future work.

4.1 Derivation

Recall from Chapter 3 that the generic problem of minimizing a function f(x) over a closed set C can be attacked by distance majorization. The penalty method seeks the minimum point of a penalized version f(x) + ρq(x) of f(x) and then follows the solution vector xρ as ρ tends to ∞. In the limit one recovers the constrained solution. If the constraint set C equals an intersection m 1 Pm 2 ∩i=1Ci of closed sets, then it is natural to define the penalty q(x) = 2m i=1 dist(x, Ci) . Distance

45 majorization gives the surrogate function

m ρ X g (x | x ) = f(x) + kx − Π (x )k2 ρ k 2m Ci k 2 i=1 m 2 ρ 1 X = f(x) + x − Π (x ) + c 2 m Ci k k i=1 2 c y = 1 Pm Π (x ) for an irrelevant constant k. If we set k m i=1 Ci k , then by definition the minimum of

the surrogate gρ(x | xk) occurs at the proximal point

xk+1 = proxρ−1f (yk).

We call this MM algorithm the proximal distance algorithm. The penalty q(x) is generally smooth because 1 ∇ dist(x, C)2 = x − Π (x) 2 C

at any point x where the projection ΠC(x) is single valued [26, 94]. In contrast to the penalty (3.8), which approximated the distance function with an extra parameter , q(x) uses a squared distance function with no extra parameters.

For the special case of projection of an external point z onto the intersection C of the closed

1 2 sets Ci, one should take f(x) = 2 kz − xk2. The proximal distance iterates then obey the explicit formula

1 x = (z + ρy ). k+1 1 + ρ k

Linear programming with arbitrary convex constraints is another simple case. Here f(x) = vT x, and the update reduces to

1 x = y − v. k+1 k ρ

If the proximal map is impossible to calculate, but ∇f(x) is known to be Lipschitz continuous with constant L, then one can substitute the standard majorization

L f(x) ≤ f(x ) + ∇f(x )T (x − x ) + kx − x k2 k k k 2 k 2

46 for f(x). Minimizing the sum of the loss majorization plus the penalty majorization leads to the MM update 1 x = [−∇f(x ) + Lx + ρy ] k+1 L + ρ k k k 1 = x − [∇f(x ) + ρ∇q(x )]. (4.1) k L + ρ k k This is a gradient descent algorithm without an intervening proximal map.

The proximal distance algorithm can also be applied to unconstrained problems. For example, consider the problem of minimizing a penalized loss `(x) + p(Ax). The presence of the linear transformation Ax in the penalty complicates optimization. The strategy of parameter splitting introduces a new variable y and minimizes `(x) + p(y) subject to the constraint y = Ax. If

ΠM(z) denotes projection onto the manifold M = {z = (x, y): Ax = y}, then the constrained problem can be solved approximately by minimizing the function ρ `(x) + p(y) + dist(z, M)2 2 for large ρ. If ΠM(zk) consists of two subvectors uk and vk corresponding to xk and yk, then the proximal distance updates are

xk+1 = proxρ−1`(uk)

yk+1 = proxρ−1p(vk).

When the matrix A has dimensions r × s, one can attack the projection problem by differenti- ating the Lagrangian 1 1 L(x, y, λ) = kx − uk2 + ky − vk2 + λT (Ax − y). 2 2 2 2 To solve the stationarity equations

0 = x − u + AT λ (4.2)

0 = y − v − λ,

we multiply the first by A, subtract it from the second, and substitute Ax = y. This generates the identity

T 0 = Au − v − (AA + Ir)λ

47 with solution

T −1 λ = (AA + Ir) (Au − v). (4.3)

The values x = u − AT λ and y = v + λ are then immediately available. This approach is preferred when r < s. In the opposite case r > s, it makes more sense to directly minimize the function 1 1 f(x) = kx − uk2 + kAx − vk2. 2 2 2 2 This leads to the solution

T −1 T x = (A A + Is) (A v + u).

T The advantage here is that the matrix A A + Is is now s × s rather than r × r.

4.2 Convergence and Acceleration

In the presence of convexity, the proximal distance algorithm reduces to a proximal gradient algo- rithm. This follows from the representation m 1 X y = Π (x) m Ci i=1 m 1 X = x − x − Π (x) m Ci i=1 = x − ∇q(x) involving the penalty q(x). Thus, the proximal distance algorithm can be expressed as

xk+1 = proxρ−1f [xk − ∇q(xk)].

In this regard, there is the implicit assumption that ∇q(x) is Lipschitz continuous with constant 1. This is indeed the case. According to the Moreau decomposition [11], for a single closed convex set C we have

∇q(x) = x − ΠC(x)

= prox ? (x), δC 48 ? where δC(x) is the Fenchel conjugate of the indicator function  0 x ∈ C δC (x) = ∞ x 6∈ C.

Because proximal operators of closed convex functions are nonexpansive [11], the result follows for a single set. For the general penalty q(x) with m sets, the Lipschitz constants are scaled by m−1 and summed to produce an overall Lipschitz constant of 1.

Proximal gradient algorithms can exhibit painfully slow convergence. This fact suggests that one should slowly send ρ to ∞ and refuse to wait until convergence occurs for any given ρ. It also suggests that Nesterov acceleration may rectify the undesirable convergence behavior. Nesterov acceleration for the general proximal gradient algorithm with loss `(x) and penalty p(x) takes the form

k − 1 z = x + (x − x ) k k k + d − 1 k k−1 −1 xk+1 = proxL−1`[zk − L ∇p(zk)], (4.4)

where L is the Lipschitz constant for ∇p(x) and d is typically chosen to be 3. Nesterov acceleration achieves an O(n−2) convergence rate [126], which is vastly superior to the O(n−1) rate achieved by ordinary gradient descent. The Nesterov update possesses the desirable property of preserving

affine constraints. In other words, if Axk−1 = b and Axk = b, then Azk = b as well. In cases not covered by convex analysis theory, we accelerate our proximal distance algorithms by applying the

algorithm map M(x) to the shifted point zk, yielding the accelerated update xk+1 = M(zk). The recent paper of Ghadimi and Lan [63] extends Nestorov acceleration to this more general setting.

Newton’s method offers another possibility for acceleration. This depends on the differentia- bility of the gradient

m ρ X ∇f(x) + [x − Π (x)]. m Ci i=1 Unfortunately, the second differential

m ρ X ∇2f(x) + [I − ∇Π (x)] (4.5) m Ci i=1 49 may not exist globally. For some sets C, the gradient ∇ΠC(x) is trivial to calculate. For instance,

when ΠC(x) = Mx + b is affine, then the identity ∇ΠC(x) = M holds true. In the case of projection onto the set {x : Ax = b}, the matrix M takes the form M = I − AT (AAT )−1A. If

C is a sparsity set, and ΠC(x) reduces to a single point, then ∇ΠC(x) is a diagonal matrix whose

ith diagonal entry is 1 when |xi| is sufficiently large and 0 otherwise. When C is a nonnegativity

constraint set, and no component of x equals 0, then ∇ΠC(x) is diagonal with ith diagonal com-

ponent 1 for xi ≥ 0 and 0 for xi < 0. To its detriment, Newton’s method requires a great deal of linear algebra per iteration in high-dimensional problems. There is also no guarantee that the second differential (4.5) is positive definite, even for convex problems.

Finally, it is worth proving that the proximal distance algorithm converges in the presence of convexity. Our convergence analysis relies on well-known results from operator theory [11]. Prox- imal operators in general and projection operators in particular are nonexpansive and averaged. By definition an averaged operator

M(x) = αx + (1 − α)N(x)

is a convex combination of a nonexpansive operator N(x) and the identity operator I. The av-

eraged operators on Rn with α ∈ (0, 1) form a convex set closed under functional composition. Furthermore, M(x) and the base operator N(x) share their fixed points. The celebrated theorem of Krasnosel’skii [88] and Mann [103] says that if an averaged operator M(x) = αx + (1 − α)N(x)

possesses one or more fixed points, then the iteration scheme xk+1 = M(xk) converges to a fixed point.

Consider minimization of the penalized loss

m ρ X f(x) + dist(x, C )2. 2m i i=1 By definition the proximal distance iterate is given by

xk+1 = proxρ−1f (yk),

y = 1 Pm Π (x ) where k m i=1 Ci k . The algorithm map is an averaged operator because it is the compo- sition of two averaged operators. Hence, the Krasnosel’skii-Mann theorem guarantees convergence 50 to a fixed point if one or more exist. Now y is a fixed point if and only if

m m ρ X ρ X f(y) + ky − Π (y)k2 ≤ f(x) + kx − Π (y)k2 2m S 2 2m S 2 i=1 i=1 for all x and a constraint set S. In the presence of convexity, this is equivalent to the directional derivative inequality

m ρ X 0 ≤ ∇ f(y) + [y − Π (y)]T v v m Ci i=1 " m # ρ X = ∇ f(y) + dist(y, C )2 v 2m i i=1 for all v, which is in turn equivalent to y minimizing the convex penalized loss. Minimum points exist, and therefore so do fixed points.

Convergence of the overall proximal distance algorithm is tied to the convergence of the clas- sical penalty method [14]. In this context the loss is f(x), and the penalty is

m 1 X q(x) = dist(x, C )2. 2m i i=1

Assuming that the objective f(x) + ρq(x) is coercive, then the theory mandates that the solution path xρ is bounded and any cluster point of the path attains the minimum value of f(x) subject to the constraints. Furthermore, if f(x) is coercive and possesses a unique minimum point in the

constraint set C, then the path xρ converges to that point. Algorithm 5 gives a schematic of the proximal distance algorithm.

51 Algorithm 5 The proximal distance algorithm with Nesterov acceleration. Require: a starting point x ∈ Rn, a tolerance  > 0 with   1, a maximum iteration count K, an initial ρ, a ρ update frequency R, and a ρ update factor a. while k ≤ K do Update iteration count k ← k + 1.

k−1 Compute accelerated Nesterov step yk := xk + k+2 (xk − xk−1).

Compute zk := ΠC(yk).

Update xk+1 := proxρ−1f (zk).

Exit if kxk+1 − xkk2 < (1 + kxkk2). If mod (k, R) = 0 then augment ρ := aρ. end while

4.3 Examples

The following examples highlight the versatility of proximal distance algorithms in a variety of convex and nonconvex settings. Programming details matter in solving these problems. Individual programs are not necessarily long, but care must be exercised in projecting onto constraints, choos- ing tuning schedules, folding constraints into the domain of the loss, implementing acceleration, and declaring convergence. Whenever possible, competing software was run with the Julia opti- mization module MathProgBase [54, 101] to wrap both open-source and commercial solvers. The sparse PCA problem relies on the R package PMA of Witten, Tibshirani, and Hastie [139].

4.3.1 Linear Programming

As suggested in Chapter 3, this deceptively simple problem is harder to solve with a proximal dis- tance algorithm than one might first suspect. Our approach is to roll the standard affine constraints Ax = b into the domain of the loss function vT x. The standard nonnegativity requirement x ≥ 0 is achieved by penalization. Let xk be the current iterate and yk = (xk)+ be its projection onto

52 n R+. Derivation of the proximal distance algorithm relies on the Lagrangian ρ vT x + kx − y k2 + λT (Ax − b). 2 k 2

One can multiply the stationarity equation

T 0 = v + ρ(x − yk) + A λ

by A and solve for the Lagrange multiplier λ in the form

T −1 λ = (AA ) (ρAyk − ρb − Av). (4.6)

Inserting this value into the stationarity equation gives the update

1  1  x = y − v − AT (AAT )−1 Ay − b − Av . (4.7) k+1 k ρ k ρ

Table 4.1 compares the accelerated proximal distance algorithm to the open-source Splitting Cone Solver (SCS) [113] and the interior point method implemented in the commercial Gurobi solver. The first seven rows of the table summarize linear programs with dense data A, b, and v. The bottom six rows rely on random sparse matrices A with sparsity level 0.01. For dense problems, the proximal distance algorithm starts the penalty constant ρ = 1 and doubles it every 100 iterations. Because we precompute and cache the pseudoinverse AT (AAT )−1 of A, the update (4.7) reduces to vector additions and matrix-vector multiplications.

For sparse problems the proximal distance algorithm updates ρ by a factor of 1.5 every 50 iterations. To avoid computing large pseudoinverses, we appeal to the LSQR variant of the conju- gate gradient method [115, 116] to solve the linear system (4.6). The optima of all three methods agree to 4 digits of accuracy. Each algorithm demonstrates merit. Gurobi clearly performs best on low-dimensional problems, but it scales poorly with dimension compared to SCS and the proxi- mal distance algorithm. In large sparse regimes the proximal distance algorithm and SCS perform equally well. If accuracy is not a primary concern, then the proximal distance algorithm is easily accelerated with a more aggressive update schedule for ρ.

53 Dimensions Optima CPU Seconds

m n PD SCS Gurobi PD SCS Gurobi

2 4 0.2629 0.2629 0.2629 0.0018 0.0004 0.0012 4 8 1.0455 1.0456 1.0455 0.0022 0.0012 0.0011 8 16 2.4513 2.4514 2.4513 0.0167 0.0024 0.0013 16 32 3.4226 3.4225 3.4223 0.0472 0.0121 0.0014 32 64 6.2398 6.2397 6.2398 0.0916 0.0165 0.0028 64 128 14.671 14.671 14.671 0.1554 0.0643 0.0079 128 256 27.116 27.116 27.116 0.3247 0.8689 0.0406 256 512 58.501 58.494 58.494 0.6064 2.9001 0.2773 512 1024 135.35 135.34 135.34 1.4651 5.0410 1.9607 1024 2048 254.50 254.47 254.47 4.7953 4.7158 0.9544 2048 4096 533.27 533.23 533.23 12.482 23.495 10.121 4096 8192 991.74 991.67 991.67 52.300 84.860 93.687 8192 16384 2058.7 2058.5 2058.5 456.50 430.86 945.75

Table 4.1: CPU times and optima for linear programming. Here m is the number of constraints, n is the number of variables, PD is the accelerated proximal distance algorithm, SCS is the Splitting Cone Solver, and Gurobi is the Gurobi solver. After m = 512 the constraint matrix A is initialized to be sparse with sparsity level 0.01.

54 4.3.2 Nonnegative Quadratic Programming

Let us revisit the nonnegative quadratic programming (NQP) example of Section 3.5.3. In NQP

1 T T one minimizes the objective function f(x) = 2 x Ax + b x subject to x  0 for a positive definite matrix A. As in Section 4.3.1, projection onto the nonnegative cone is accomplished by the max operator (x)+. For a given value of the penalty constant ρ, the proximal distance update is

−1 xk+1 = (ρI + A) [ρ(xk)+ − b] . (4.8)

Na¨ıvely solving the linear system at every iteration leads to suboptimal performance. Two alterna- tives exist, depending on whether A is dense or sparse. If A is dense, then one should precompute and cache the spectral decomposition A = VDV T . The update (4.8) becomes

−1 T xk+1 = V (ρI + D) V [ρ(xk)+ − b] . (4.9)

The diagonal matrix ρI + D is trivial to invert. The remaining operations reduce to matrix-vector multiplications, which are substantially cheaper than repeated matrix inversions. Extraction of the spectral decomposition of A becomes prohibitive as the dimension of A increases. To compute the update (4.8) efficiently for large sparse matrices, we apply LSQR.

Table 4.2 compares the performance of the proximal distance algorithm for NQP to the open source nonlinear interior point solver Ipopt [134, 135] and the interior point method of Gurobi. Test problems were generated by filling an n × n matrix M and an n-vector b with standard normal deviates. We then set A = M T M + 0.001I. For sparse problems we set the sparsity level of

M to be log10(n)/n. Our setup ensures that A has full rank and that the quadratic program has a solution. For dense matrices, we start with ρ = 1 and multiply it by 1.5 every 200 iterations. For sparse problems, we start ρ at 10−4 and multiply it by 1.5 every 100 iterations. Table 4.2 suggests that the proximal distance algorithm and the interior point solvers perform equally well on small dense problems. However, in high-dimensional and low-accuracy environments, the proximal distance algorithm provides better performance.

55 Dimensions Optima CPU Seconds

n PD IPOPT Gurobi PD IPOPT Gurobi

2 -0.0015 -0.0014 -0.0014 0.0042 0.0025 0.0031 4 -0.6070 -0.6070 -0.6070 0.0002 0.0028 0.0017 8 -0.6840 -0.6834 -0.6834 0.0064 0.0036 0.0024 16 -0.6235 -0.6234 -0.6235 0.0872 0.0037 0.0022 32 -0.1936 -0.1934 -0.1936 0.0864 0.0041 0.0030 64 -0.3368 -0.3364 -0.3368 0.1121 0.0054 0.0059 128 -0.5344 -0.5337 -0.5344 0.1698 0.0124 0.0326 256 -0.4969 -0.4956 -0.4969 0.3001 0.0512 0.0760 512 -0.4716 -0.4689 -0.4716 0.8104 0.2617 0.3720 1024 -26271 -26277 -26277 12.7841 0.2575 0.3685 2048 -26000 -26024 -26024 29.6108 2.2635 2.2506 4096 -56138 -56272 -56272 57.9576 23.850 17.452 8192 -52960 -53025 -53025 126.145 242.90 164.90 16384 -108677 -108837 -108837 425.017 2596.3 1500.4

Table 4.2: CPU times and optima for nonnegative quadratic programming. Here n is the number of variables, PD is the accelerated proximal distance algorithm, IPOPT is the Ipopt solver, and Gurobi is the Gurobi solver. After n = 512, the constraint matrix A is sparse.

56 4.3.3 Closest Kinship Matrix

In genetics studies, kinship is measured by the fraction of genes that two individuals share identi- cally by descent. For a given pedigree, the kinship coefficients for all pairs of individuals appear as entries in a symmetric kinship matrix Y . This matrix possesses three crucial properties:

(a) it is positive semidefinite,

(b) its entries are nonnegative,

1 (c) its diagonal entries are 2 unless some pedigree members are inbred.

Inbreeding is the exception rather than the rule. Kinship matrices can be estimated empirically from single nucleotide polymorphism (SNP) data, but there is no guarantee that the three high- lighted properties are satisfied. Hence, it helpful to project Y to the nearest qualifying matrix.

This projection problem is best solved by folding the positive semidefinite constraint into the

1 2 domain of the Frobenius loss function 2 kX − Y kF . As we shall see, the alternative of imposing two penalties rather than one is slower and less accurate. Projection onto the constraints implied

1 by conditions (b) and (c) is trivial. All diagonal entries xii of X are reset to 2 , and all off-diagonal

entries xij are reset to max{xij, 0}. If Π(Xk) denotes the current projection, then the proximal distance algorithm minimizes the surrogate

1 ρ g(X | X ) = kX − Y k2 + kX − Π(X )k2 k 2 F 2 k F 2 1 + ρ 1 ρ = X − Y − Π(Xk) + ck, 2 1 + ρ 1 + ρ F

where ck is an irrelevant constant. The minimum is found by extracting the spectral decomposition T 1 ρ UDU of 1+ρ Y + 1+ρ Π(Xk) and truncating the negative eigenvalues. This gives the update T Xk+1 = UD+U in obvious notation. The most onerous computations are clearly the repeated matrix decompositions.

Table 4.3 compares three versions of the proximal distance algorithm to Dykstra’s algorithm [28]. Higham proposed Dykstra’s algorithm for the related problem of finding the closest correla- tion matrix [71]. In Table 4.3 algorithm PD1 is the basic proximal distance algorithm, PD2 is the 57 Size PD1 PD2 PD3 Dykstra

n Loss Time Loss Time Loss Time Loss Time

2 1.64 0.36 1.64 0.01 1.64 0.01 1.64 0.00 4 2.86 0.10 2.86 0.01 2.86 0.01 2.86 0.00 8 18.77 0.21 18.78 0.03 18.78 0.03 18.78 0.00 16 45.10 0.84 45.12 0.18 45.12 0.12 45.12 0.02 32 169.58 4.36 169.70 0.61 169.70 0.52 169.70 0.37 64 837.85 16.77 838.44 2.90 838.43 2.63 838.42 4.32 128 3276.41 91.94 3279.44 18.00 3279.25 14.83 3279.23 19.73 256 14029.07 403.59 14045.30 89.58 14043.59 64.89 14043.46 72.79

Table 4.3: CPU times and optima for the closest kinship matrix problem. Here the kinship matrix is n × n, PD1 is the proximal distance algorithm, PD2 is the accelerated proximal distance, PD3 is the accelerated proximal distance algorithm with the positive semidefinite constraints folded into the domain of the loss, and Dykstra is Dykstra’s adaptation of alternating projections. All times are in seconds. accelerated proximal distance algorithm, and PD3 is the accelerated proximal distance algorithm with the positive semidefinite constraints folded into the domain of the loss. On this demanding problem, these algorithms are comparable to Dykstra’s algorithm in speed but slightly less ac- curate. Acceleration of the proximal distance algorithm is effective in reducing both execution time and error. Folding the positive semidefinite constraint into the domain of the loss function leads to further improvements. The data matrices M in these trials were populated by standard normal deviates and then symmetrized by averaging opposing triangles. In algorithm PD1 we set

k 22 ρk = max{1.2 , 2 }. In the accelerated versions PD2 and PD3 we started with ρ = 1 and multi- plied it by 5 every 100 iterations. At the expense of longer compute times, better accuracy can be achieved by all three proximal distance algorithms with a less aggressive update schedule.

58 4.3.4 Projection onto a Second-Order Cone Constraint

Second-order cone programming is among the most important fields in convex analysis [4, 99]. It

T revolves around conic constraints of the form {u : kAu + bk2 ≤ c u + d}. Projection of a vector x onto such a constraint set is facilitated by parameter splitting. In this setting parameter splitting introduces a vector w, a scalar r, and the two affine constraints w = Au+b and r = cT u+d. The conic constraint then reduces to the Lorentz cone constraint kwk2 ≤ r, for which the projection is known [27]. If we concatenate the parameters into the single vector   u     y = w   r and define L = {y : kwk ≤ r} and M = {y : w = Au + b and r = cT u + d}, then we can

1 2 rephrase the problem as minimizing 2 kx − uk2 subject to y ∈ L ∩ M. This is a fairly typical set projection problem except that the w and r components of y are missing in the loss function.

Taking a cue from Section 4.3.1, we incorporate the affine constraints in the domain of the objective function. If we represent projection onto L by     wk w˜ k ΠL   =   , rk r˜k

then the Lagrangian generated by the proximal distance algorithm amounts to   2 w − w˜ 1 2 ρ k T T kx − uk2 +   + λ (Au + b − w) + θ(c u + d − r). 2 2 r − r˜k 2 This gives rise to a system of three stationarity equations

0 = u − x + AT λ + θc (4.10)

0 = ρ(w − w˜ k) − λ (4.11)

0 = ρ(r − r˜k) − θ. (4.12)

Solving for the multipliers λ and θ in equations (4.11) and (4.12) and substituting their values in

59 equation (4.10) yields

T 0 = u − x + ρA (w − w˜ k) + ρ(r − r˜k)c

T T = u − x + ρA (Au + b − w˜ k) + ρ(c u + d − r˜k)c.

This leads to the update

−1 T T −1 −1 T uk+1 = (ρ I + A A + cc ) [ρ x + A (w˜ k − b) + (˜rk − d)c]. (4.13)

T The updates wk+1 = Auk+1 + b and rk+1 = c uk+1 + d follow from the constraints.

Table 4.4 compares the proximal distance algorithm to SCS and Gurobi. Echoing previous examples, we tailor the update schedule for ρ differently for dense and sparse problems. Dense problems converge quickly and accurately when we set ρ0 = 1 and double ρ every 100 iterations.

Sparse problems require a greater range and faster updates of ρ, so we set ρ0 = 0.01 and then multiply ρ by 2.5 every 10 iterations. For dense problems, it is clearly advantageous to cache the spectral decomposition of AT A + ccT as suggested in Section 4.3.2. In this regime, the proximal distance algorithm is as accurate as Gurobi and nearly as fast. SCS is comparable to Gurobi in speed but notably less accurate.

With a large sparse constraint matrix A, extraction of its spectral decomposition becomes pro-   hibitively expensive. If we let E = ρ−1/2IAT c , then we must solve a linear system of equations defined by the Gram matrix G = EET . There are three reasonable options for solving this system. The first relies on computing and caching a sparse Cholesky decomposition of G. The second computes the QR decomposition of the sparse matrix E. The R part of the QR decompo- sition coincides with the Cholesky factor. Unfortunately, updates to ρ necessitate recomputation of the Cholesky or QR factors. The third option is the conjugate gradient algorithm. In our expe- rience the QR decomposition offers superior stability and accuracy. When E is very sparse, the QR decomposition is often much faster than the Cholesky decomposition because it avoids the formation of the dense matrix AT A. Even when only 5% of the entries of A are nonzero, 90% of the entries of AT A can be nonzero. If exquisite accuracy is not a concern, then the conjugate gradient method provides the fastest update. Table 4 reflects this choice.

60 Dimensions Optima CPU Seconds

m n PD SCS Gurobi PD SCS Gurobi

2 4 0.10598 0.10607 0.10598 0.0043 0.0103 0.0026 4 8 0.00000 0.00000 0.00000 0.0003 0.0009 0.0022 8 16 0.88988 0.88991 0.88988 0.0557 0.0011 0.0027 16 32 2.16514 2.16520 2.16514 0.0725 0.0012 0.0040 32 64 3.03855 3.03864 3.03853 0.0952 0.0019 0.0094 64 128 4.86894 4.86962 4.86895 0.1225 0.0065 0.0403 128 256 10.5863 10.5843 10.5863 0.1975 0.0810 0.0868 256 512 31.1039 31.0965 31.1039 0.5463 0.3995 0.3405 512 1024 27.0483 27.0475 27.0483 3.7667 1.6692 2.0189 1024 2048 1.45578 1.45569 1.45569 0.5352 0.3691 1.5489 2048 4096 2.22936 2.22930 2.22921 1.0845 2.4531 5.5521 4096 8192 1.72306 1.72202 1.72209 3.1404 17.272 15.204 8192 16384 5.36191 5.36116 5.36144 13.979 133.25 88.024

Table 4.4: CPU times and optima for the second-order cone projection. Here m is the number of constraints, n is the number of variables, PD is the accelerated proximal distance algorithm, SCS is the Splitting Cone Solver, and Gurobi is the Gurobi solver. After m = 512 the constraint matrix A is initialized with sparsity level 0.01.

61 4.3.5 Copositive Matrices

A symmetric matrix M is copositive if its associated quadratic form xT Mx is nonnegative for all x  0. Copositive matrices find applications in numerous branches of the mathematical sciences [16]. All positive semidefinite matrices and all matrices with nonnegative entries are copositive. The variational index

µ(M) = min xT Mx kxk2=1, x0 is one key to understanding copositive matrices [73]. The constraint set S is the intersection of

n the unit sphere and the nonnegative cone R+. Projection of an external point y onto S splits into

three cases. When all components of y are negative, then ΠS (y) = ei, where yi is the least

negative component of y, and ei is the standard unit vector along coordinate direction i. The origin 0 is equidistant from all points of S. If any component of y is positive, then the projection is constructed by setting the negative components of y equal to 0 and then standardizing the truncated version of y to have Euclidean norm 1.

As a test case for the proximal distance algorithm, consider the Horn matrix [66]   1 −1 1 1 −1     −1 1 −1 1 1      M =  1 −1 1 −1 1  .      1 1 −1 1 −1    −1 1 1 −1 1

The value µ(M) = 0 is attained for the vectors √1 (1, 1, 0, 0, 0)T , √1 (1, 2, 1, 0, 0)T , and equivalent 2 6 vectors with their entries permuted. Matrices in higher dimensions with the same Horn pattern of 1s and −1s are copositive as well [79]. A Horn matrix of odd dimension cannot be written as a positive semidefinite matrix, a nonnegative matrix, or a sum of two such matrices.

The proximal distance algorithm minimizes the criterion 1 ρ g(x | x ) = xT Mx + kx − Π (x )k2 k 2 2 S k 2 and generates the updates

−1 xk+1 = (M + ρI) ρΠS (xk). 62 Table 4.5: CPU times (seconds) and optima for approximating the Horn variational index of a Horn matrix. Here n is the size of Horn matrix, PD is the proximal distance algorithm, aPD is the accelerated proximal distance algorithm, and Mosek is the Mosek solver.

Dimension Optima CPU Seconds

n PD aPD Mosek PD aPD Mosek

4 0.000000 0.000000 feasible 0.5555 0.0124 2.7744 5 0.000000 0.000000 infeasible 0.0039 0.0086 0.0276 8 0.000021 0.000000 feasible 0.0059 0.0083 0.0050 9 0.000045 0.000000 infeasible 0.0055 0.0072 0.0082 16 0.000377 0.000001 feasible 0.0204 0.0237 0.0185 17 0.000441 0.000001 infeasible 0.0204 0.0378 0.0175 32 0.001610 0.000007 feasible 0.0288 0.0288 0.1211 33 0.002357 0.000009 infeasible 0.0242 0.0346 0.1294 64 0.054195 0.000026 feasible 0.0415 0.0494 3.6284 65 0.006985 0.000026 infeasible 0.0431 0.0551 2.7862

k It takes a gentle tuning schedule to get decent results. The choice ρk = 1.2 converges in 600 to 700 iterations from random starting points and reliably yields objective values below 10−5 for Horn matrices. The computational burden per iteration is significantly eased by exploiting the cached spectral decomposition of M. Table 4.5 compares the performance of the proximal distance algo- rithm to the Mosek solver on a range of Horn matrices. Mosek uses semidefinite programming to decide whether M can be decomposed into a sum of a positive semidefinite matrix and a nonneg- ative matrix. If not, then Mosek declares the problem infeasible. Nesterov acceleration improves the final loss for the proximal distance algorithm, but it does not decrease overall computing time.

Testing for copositivity is challenging because neither the loss function nor the constraint set is convex. The proximal distance algorithm offers a fast screening device for checking whether a matrix is copositive. On random 1000 × 1000 symmetric matrices M, the method invariably returns a negative index in less than two seconds of computing time. Because the vast majority 63 Table 4.6: CPU times and optima for testing the copositivity of random symmetric matrices. Here n is the size of matrix, PD is the proximal distance algorithm, aPD is the accelerated proximal distance algorithm, and Mosek is the Mosek solver.

Dimension Optima CPU Seconds

n PD aPD Mosek PD aPD Mosek

4 -0.391552 -0.391561 infeasible 0.0029 0.0031 0.0024 8 -0.911140 -2.050316 infeasible 0.0037 0.0044 0.0045 16 -1.680697 -1.680930 infeasible 0.0199 0.0272 0.0062 32 -2.334520 -2.510781 infeasible 0.0261 0.0242 0.0441 64 -3.821927 -3.628060 infeasible 0.0393 0.0437 0.6559 128 -5.473609 -5.475879 infeasible 0.0792 0.0798 38.3919 256 -7.956365 -7.551814 infeasible 0.1632 0.1797 456.1500

of symmetric matrices are not copositive, accurate estimation of the minimum is not required. Table 4.6 summarizes a few random trials with lower-dimensional symmetric matrices. In higher dimensions, Mosek becomes non-competitive, and Nesterov acceleration is of dubious value.

4.3.6 Linear Complementarity Problem

The linear complementarity problem [108] consists of finding vectors x and y with nonnegative components such that xT y = 0 and y = Ax + b for a given square matrix A and vector b.

1 2 The natural loss function is 2 ky − Ax − bk2. To project a vector pair (u, v) onto the nonconvex

constraint set, one considers each component pair (ui, vi) in turn. If ui ≥ max{vi, 0}, then the nearest pair of vectors (x, y) has components (xi, yi) = (ui, 0). If vi ≥ max{ui, 0}, then the nearest pair of vectors has components (xi, yi) = (0, vi). Otherwise, (xi, yi) = (0, 0). At each iteration the proximal distance algorithm minimizes the criterion

1 ρ ρ ky − Ax − bk2 + kx − x˜ k2 + ky − y˜ k2, 2 2 2 k 2 2 k 2

64 Table 4.7: CPU times (seconds) and optima for the linear complementarity problem with randomly generated data. Here n is the size of matrix, PD is the accelerated proximal distance algorithm, and Mosek is the Mosek solver.

Dimension Optima CPU Seconds

n PD Mosek PD Mosek

4 0.000000 0.000000 0.0230 0.0266 8 0.000000 0.000000 0.0062 0.0079 16 0.000000 0.000000 0.0269 0.0052 32 0.000000 0.000000 0.0996 0.4303 64 0.000074 0.000000 2.6846 360.5183

where (x˜k, y˜k) is the projection of (xk, yk) onto the constraint set. The stationarity equations become

T 0 = −A (y − Ax − b) + ρ(x − x˜k)

0 = y − Ax − b + ρ(y − y˜k).

Substituting the value of y from the second equation into the first equation leads to the updates

T −1 T xk+1 = [(1 + ρ)I + A A] [A (y˜k − b) + (1 + ρ)x˜k] (4.14) 1 ρ y = (Ax + b) + y˜ . k+1 1 + ρ k+1 1 + ρ k The linear system (4.14) can be solved in low to moderate dimensions by computing and caching the spectral decomposition of AT A and in high dimensions by the conjugate gradient method. Table 4.7 compares the performance of the proximal gradient algorithm to the Mosek solver on some randomly generated problems.

4.3.7 Sparse Principal Components Analysis

Let X be an n × p data matrix gathered on n cases and p predictors. Assume that the columns of X are centered to have mean 0. Principal component analysis (PCA) [75, 118] operates on 65 1 T the sample covariance matrix S = n X X. Here we formulate a proximal distance algorithm for sparse PCA (SPCA), which has attracted substantial interest in the machine learning community [18, 17, 44, 80, 81, 139, 148]. According to a result of Ky Fan [57], the first q principal components

T (PCs) u1,..., uq can be extracted by maximizing the function tr(U SU) subject to the matrix T constraint U U = Iq, where ui is the ith column of the p × q matrix U. This constraint set is called a Stiefel manifold, which we denote by Mq. One can impose sparsity by insisting that any given column ui have at most r nonzero entries. Alternatively, one can require the entire matrix U to have at most r nonzero entries. The latter choice permits nonzero values to be distributed non-uniformly across columns.

Extraction of sparse PCs is difficult for three reasons. First, the Stiefel manifold Mq and both sparsity sets are nonconvex. Second, the objective function is concave rather than convex. Third, there is no simple formula or algorithm for projecting onto the intersection of the two constraint

sets. Fortunately, each individual projection is simple. Let ΠMq (U) denote the projection of

U onto Mq. It is well known that ΠMq (U) can be calculated by extracting a thin singular value T T decomposition U = V ΣW of U and then applying the Procrustes projection ΠMq (U) = VW [65]. Here V and W are orthogonal matrices of dimension p × q and q × q, respectively, and Σ is

a diagonal matrix of dimension q × q. Let ΠSr (U) denote the projection of U onto the sparsity set

Sr = {V : vij 6= 0 for at most r entries of each column vi}.

Because ΠSr (U) operates column by column, it suffices to project each column vector ui to spar- sity. This entails nothing more than sorting the entries of ui by magnitude, saving the r largest, and sending the remaining p − r entries to 0. If the entire matrix U must have at most r nonzero entries, then U can be treated as a concatenated vector vec(U) during projection.

The key to a good algorithm is to incorporate the Stiefel constraints into the domain of the objective function [84, 85] and the sparsity constraints into the distance penalty. Thus, we propose decreasing the criterion

1 ρ f(U) = − tr(U T SU) + dist(U, S )2. 2 2 r

66 at each iteration subject to the Stiefel constraints. The loss can be majorized via 1 1 1 − tr(U T SU) = − tr[(U − U )T S(U − U )] − tr(U T SU ) + tr(U T SU ) 2 2 k k k 2 k k 1 ≤ − tr(U T SU ) + tr(U T SU ) k 2 k k because S is positive semidefinite. The penalty is majorized by ρ dist(U, S )2 ≤ −ρ tr[U T Π (U )] + c 2 r Sr k k

T 2 up to an irrelevant constant ck since the squared Frobenius norm satisfies the relation kU UkF = q on the Stiefel manifold. It now follows that f(U) is majorized by 1 kU − SU − ρΠ (U )k2 2 k Sr k F up to an irrelevant constant. Accordingly, the Stiefel projection

U k+1 = ΠMq [SU k + ρΠSr (U k)]

provides the next iterate.

Figures 4.1 and 4.2 compare the proximal distance algorithm to the SPC function from the R package PMA [139]. The breast cancer data from PMA provide the data matrix X. The data consist of p = 19, 672 RNA measurements on n = 89 patients. The two figures show computa- tion times and the proportion of variance explained (PVE) by the p × q loading matrix U. For

T t T −1 T sparse PCA, PVE is defined as tr(Xq Xq)/ tr(X X), where Xq = XU(U U) U [124]. When the loading vectors of U are orthogonal, this criterion reduces to the familiar definition tr(U T XT XU)/ tr(XT X) of PVE for ordinary PCA. The proximal distance algorithm enforces either matrix-wise or column-wise sparsity. In contrast, SPC enforces only column-wise sparsity

via the constraint kuik1 ≤ c for each column ui of U. We take c = 8. The number of nonzeroes per loading vector output by SPC dictates the sparsity level for the column-wise version of the proximal distance algorithm. Summing these counts across all columns dictates the sparsity level for the matrix version of the proximal distance algorithm.

Figures 4.1 and 4.2 demonstrate the superior PVE and computational speed of both proximal distance algorithms versus SPC. The type of projection does not appear to affect the computa- tional performance of the proximal distance algorithm, as both versions scale equally well with q. 67 PVE of q Sparse PCs

PD1 PD2 SPC PVE 0.05 0.10 0.15

5 10 15 20 25

q

Figure 4.1: Proportion of variance explained by q PCs for each algorithm. Here PD1 is the ac- celerated proximal distance algorithm enforcing matrix sparsity, PD2 is the accelerated proximal distance algorithm enforcing column-wise sparsity, and SPC is the orthogonal sparse PCA method from PMA.

68 Compute time to calculate q Sparse PCs

PD1 PD2 SPC Compute time 0 50 100 150 200

5 10 15 20 25

q

Figure 4.2: Computation times for q PCs for each algorithm. Here PD1 is the accelerated proximal distance algorithm enforcing matrix sparsity, PD2 is the accelerated proximal distance algorithm enforcing column-wise sparsity, and SPC is the orthogonal sparse PCA method from PMA.

69 However, the matrix projection, which permits the algorithm to more freely assign nonzeroes to the loadings, attains better PVE than the more restrictive column-wise projection. For both vari- ants of the proximal distance algorithm, Nesterov acceleration improves both fitting accuracy and computational speed, especially as the number of PCs q increases.

4.4 Discussion

The proximal distance algorithm applies to a host of problems. In addition to the linear and quadratic programming examples considered here, in the previous chapter we derivs and test al- gorithms for network optimization, `0 regression, matrix completion [31, 34, 37, 104], and sparse precision matrix estimation [61]. Other potential applications immediately come to mind. An inte- ger linear program in standard form can be expressed as minimizing cT x subject to Ax + s = b, s ≥ 0, and x ∈ Zp. The latter two constraints can be combined in a single constraint for which projection is trivial. The affine constraints should be folded into the domain of the objective. In- teger programming is NP-hard, so that the proximal distance algorithm just sketched is merely heuristic. Integer linear programming includes traditional NP-hard problems such as the traveling salesman problem, the vertex cover problem, set packing, and Boolean satisfiability. It remains unclear whether or not the proximal distance principle can meet these challenges. Our experience with the closest lattice point problem [3] and the eight queens problem suggests that the proxi- mal distance algorithm can be too greedy for difficult combinatorial optimization. The nonconvex problems solved in this paper are in a vague sense easy combinatorial problems.

The behavior of a proximal distance algorithm depends critically on a sensible tuning schedule for increasing ρ. Starting ρ too high puts too much stress on satisfying the constraints. Incre- menting ρ too quickly causes the algorithm to veer off the solution path guaranteed by the penalty method. Given the chance of roundoff error even with double precision arithmetic, it is unwise to take ρ all the way to ∞. Trial and error can help in deciding whether a given class of problems will benefit from an aggressive update schedule and strict or loose convergence criteria. In problems with little curvature such as linear programming, more conservative updates are probably prudent. Both the closest kinship matrix problem and the SPCA problem document the value of folding

70 constraints into the domain of the loss. In the same spirit it is wise to minimize the number of constraints. A single penalty for projecting onto the intersection of two constraint sets is almost always preferable to two penalties for their separate projections. Exceptions to this rule occur when projection onto the intersection is hard. The integer linear programming problem mentioned previously illustrates these ideas.

The version of the proximal distance algorithm derived in Chapter 3 ignored acceleration. In many cases the solutions attained low accuracy and often required many iterations to converge. The realization that convex proximal distance algorithms can be phrased as proximal gradient algorithms opened the possibility of Nesterov acceleration. The version developed here uses Nes- terov acceleration routinely on the subproblems with a fixed ρ. This typically forces tighter path following and a reduction in overall computing times. The examples in Section 4.3 generally bear out the contention that Nesterov acceleration is useful in nonconvex problems [63]. However, the value of acceleration often lies in improving the quality of a solution as much as it does in increas- ing the rate of convergence. Of course, acceleration cannot prevent convergence to an inferior local minimum.

The proximal distance principle offers insight into many existing algorithms and a path to de- vising new ones. Effective proximal and projection operators usually spell the difference between success and failure. The number and variety of such operators is expanding quickly as the field of optimization relinquishes it fixation on convexity. The current exposition leaves many open ques- tions about tuning schedules, rates of convergence, and acceleration in the face of nonconvexity.

It remains to show that these projection algorithms are useful. In Chapter 5 we describe in detail a particular projection algorithm called iterative hard thresholding. Like the proximal distance algorithm for `0 regression, iterative hard thresholding performs sparse linear regression. We will demonstrate its utility in a statistical genetics setting that benefits from its nonconvex nature.

71 CHAPTER 5

Iterative Hard Thresholding for GWAS Analysis

5.1 Introduction

A genome-wide association study (GWAS) examines the influence of a multitude of genetic vari- ants on a given trait. Over the past decade, GWAS has benefitted from technological advances in dense genotype microarrays, high-throughput sequencing, and more powerful computing re- sources. Yet researchers still struggle to find the genetic variants that account for the missing heritability of many traits. It is now common for consortia studying a complex trait such as height to pool results across multiple sites and countries. Meta-analyses have discovered hundreds of statistically significant single nucleotide polymorphisms (SNPs), each of which explains a small fraction of the total heritability. A drawback of GWAS meta-analysis is that it typically relies on univariate regression rather than on more informative multivariate regression [144]. Because the number of SNPs (predictors) in a GWAS vastly exceeds the number of study subjects (ob- servations), statistical geneticists have resorted to machine learning techniques such as penalized regression [96] for model selection.

In the statistical setting of n subjects and p predictors with n  p, penalized regression es-

timates a sparse statistical model β ∈ Rp by minimizing a penalized loss f(β) + λp(β), where f(β) is a convex loss, p(β) is a suitable penalty, and λ is a tuning constant controlling the sparsity of β. The most popular and mature sparse regression tool is LASSO (`1) regression. It is known that LASSO parameter estimates are biased towards zero [68], often severely so, as a consequence of shrinkage. Shrinkage in itself is not terribly harmful, but LASSO regression lets too many false positives enter a model. Since GWAS is often followed by expensive biological validation stud- ies, there is value in reducing false positive rates. In view of the side effects of shrinkage, Zhang

72 [147] recommends the minimax concave penalty (MCP) as an alternative to the `1 penalty. Other non-convex penalties exist, but MCP is probably the simplest to implement. MCP also has prov- able stable convergence guarantees. In contrast to the LASSO, which admits false positives, MCP tends to allow too few predictors to enter a model. Thus, its false negative rate is too high. Our subsequent numerical examples illustrate these tendencies.

Surprisingly few software packages implement efficient penalized optimization algorithms for GWAS. The R packages glmnet and ncvreg are ideal candidates given their ease of use, matu- rity of development, and wide acceptance. The former implements LASSO-penalized regression [62, 92, 131], while the latter implements both LASSO- and MCP-penalized regression [30, 147]. Both packages provide excellent functionality for moderately sized problems. However, their scal- ability to GWAS is encumbered by R’s poor memory management. In fact, analysis on a typical workstation is limited to at most a handful of chromosomes. Larger problems must appeal to clus- ter or cloud computing. Neither glmnet nor ncvreg natively support the compressed PLINK binary genotype file (BED file) format typically used to efficiently store and distribute GWAS data [119]. Memory-efficient implementations of LASSO or MCP regression for GWAS appear in the packages Mendel, gpu-lasso, SparSNP, and the beta version of PLINK 1.9 [1, 36, 38, 97, 143]. To our knowledge, only Mendel supports MCP regression with PLINK files.

As an alternative to penalized regression, one can enforce sparsity directly through projection onto sparsity sets [21, 23, 22, 24]. Iterative hard thresholding (IHT) attempts to solve the problem of minimizing the loss f(β) subject to the sparsity constraint kβk0 ≤ m, where the `0 “norm” kβk0 counts the number of nonzero entries of the parameter vector β. The integer m serves as a tuning constant analogous to λ in LASSO and MCP regression. IHT is one member of a family of sparse regression algorithms. Similar algorithms treated in the signal processing literature include hard thresholding pursuit [7, 58, 146], matching pursuit [102, 110, 132], and subspace pursuit [42]. Some of these algorithms rely on gradient descent and thus avoid computing and inverting large Hessian matrices. The absence of second derivatives is crucial for implementations that scale to large datasets. In addition, our implementation of IHT addresses some of the specific concerns of GWAS. First, it accommodates genotype compression if genotypes are presented in the PLINK compression format. Second, our version of IHT allows the user to choose the sparsity level m of 73 a model. In contrast, LASSO and MCP penalized regression must choose the model size indirectly by adjust the tuning constant λ to match a given m. Third, our implementation of IHT uses prudent memory management, exploits all available CPU cores, and interfaces with massively parallel graphics processing unit (GPU) devices. Finally, our version of IHT performs more parsimonious model selection than either LASSO or MCP penalized regression. All of these advantages can be realized on a modern desktop workstation. Although our current IHT implementation is limited to ordinary linear least squares, the literature suggests that logistic regression is ultimately within reach [7, 146].

Before moving onto the rest of the chapter, let us sketch its main contents. Section 5.2 describes penalized regression and the IHT algorithm. Here we also describe in detail the tactics necessary for parallelization. Section 5.3 records our numerical experiments. The performance of IHT and competing algorithms is evaluated by several metrics: computation time, false positive rates, false negative rates, and prediction error. The sparsity level m for a given dataset is chosen by cross- validation on both real and simulated genetic data. Our discussion in Section 5.4 summarizes results, limitations, and precautions.

5.2 Methods

5.2.1 Penalized regression

Consider a statistical design matrix X ∈ Rn×p, a noisy n-dimensional response y, and a sparse vector β of regression coefficients. When y represents a continuous phenotype, then the residual sum of squares 1 f(β) = ky − Xβk2 (5.1) 2 2 is a reasonable choice for the loss f(β). LASSO penalized regression imposes the convex `1 Pp penalty pλ(β) = λkβk1 = λ i=2 |βi|. In most applications the intercept contribution |β1| is omitted from the penalty. Various approaches exist to minimize the objective f(β) + λkβk1, including least angle regression (LARS) [56], cyclic coordinate descent [60, 142, 143], and the fast iterative shrinkage and thresholding algorithm (FISTA) [12]. The `1 norm penalty induces

74 β?

Y = X β LASSO Y = X?

Figure 5.1: A visual representation of model selection with the LASSO. The addition of the `1 penalty encourages representation of y by a subset of the columns of X.

both sparsity and shrinkage. Shrinkage per se is not an issue because selected parameters can be re-estimated with both the non-selected parameters and the penalty removed. However, the severe shrinkage induced by the LASSO inflates false negative rates since spurious predictors enter the model to absorb the unexplained variance left by shrinkage imposed on true predictors.

Pp The MCP alternative to LASSO takes pλ,γ(β) = i=2 q(|βi|) with  2 λβi − βi /(2γ) 0 ≤ βi ≤ γλ q(βi) =  2 γλ /2 βi > γλ  λ − βi/γ 0 ≤ βi < γλ 0  q (βi) = (5.2)  0 βi > γλ for positive tuning constants λ and γ. The MCP penalty (5.2) attenuates penalization for large parameter values. Indeed, if βi ≥ γλ, then MCP does not shrink βi at all. By relaxing the penalization of large entries of β, the MCP ameliorates the bias towards 0 introduced by shrinkage

from the LASSO. If one majorizes the MCP function q(βi) by a scaled absolute value function, then cyclic coordinate descent computes MCP-penalized parameter updates that resemble the LASSO updates [78].

Figures 5.2 and 5.3 offer a visual of sparsity via shrinkage. As mentioned earlier, one can obtain sparsity without shrinkage by minimizing f(β) subject to kβk0 ≤ m. This subset selection problem is known to be NP-hard [64, 109]. Nonetheless, good heuristic methods exist for its

75 β β β

kxk2 = 1 kxk1 = 1 kxk0 = 1

Figure 5.2: A graphical representation of penalized (regularized) regression using norm balls.

From left to right, the graphs show `2 or Tikhonov regression, `1 or LASSO regression, and `0 or subset regression. The ellipses denote level curves around the unpenalized optimum β. The penalized optimum occurs at the intersection of the level curves with the norm ball. Tikhonov regularization provides some shrinkage, while the shrinkage from LASSO regularization is more

dramatic. The `0 norm enforces sparsity without shrinkage. The MCP “norm ball” cannot be easily

drawn but sits between the `1 and `0 balls.

S(1, 1) MCP S(0, 1)

Figure 5.3: A view of sparse regression with thresholding operators. The order from left to right differs from Figure 5.2: the `1 operator or soft thresholding operator, the MCP or firm thresholding operator, and the `0 operator or hard thresholding operator. We clearly see how MCP interpolates the soft and hard thresholding operators.

76 Π (y+) S1 y+

−µ∇f(y)

y

+ ΠS0 (y )

Figure 5.4: A visual representation of IHT. The algorithm starts at a point y and steps in the direction −∇f(y) with magnitude µ to an intermediate point y+. IHT then enforces sparsity by

projecting onto the sparsity set Sm. The projection for m = 2 is the identity projection in this

+ example, while projection onto S0 merely sends y to the origin 0. Projection onto S1 preserves the larger of the two components of y+.

solution. IHT relies on the projected gradient update

βk+1 = ΠSm [βk − µ∇f(βk)], (5.3)

where µ denotes the step size of the algorithm, and ΠSm (β) denotes the projection of β onto the sparsity set Sm where at most m components of ΠSm (γ) are nonzero. Projection is accomplished by setting all but the m largest components of β in magnitude equal to 0. Figure 5.4 gives a schematic of a single iteration of IHT.

For sufficiently small µ, the projected gradient update (5.3) is guaranteed to reduce the loss, but it forfeits stronger convergence properties because Sm is nonconvex. Recent work has de- veloped stable convergence and recovery guarantees for projected gradient updates by imposing loose additional restrictions on the local minima of f(β). Examples of these restrictions include the restricted isometry property (RIP) [32], restricted strong convexity (RSC) [53], and restricted smoothness [2]. We urge curious readers to consult the references for mathematical details and proofs.

77 5.2.2 Calculating step sizes

Computing a reasonable step size µ is important for ensuring stable descent in projected gradient schemes. For the case of least squares regression, our implementation of IHT uses the “normal- ized” update of Blumensath and Davies [24]. At each iteration k, we accordingly employ the step size 2 kβkk2 µk = 2 . kXβkk2 Guaranteed convergence requires that µ < ω, where

2 β − β ω = (1 − c) k+1 k 2  2 (5.4) X(βk+1 − βk k2 for some constant 0 < c  1. One can interpret ω as the normed ratio of the difference between successive iterates versus the difference between successive estimated responses.

5.2.3 Bandwidth optimizations

Analysis of large GWAS datasets requires intelligent memory management and data wrangling. Our software reads datasets in PLINK binary format, which stores each genotype in two bits. In theory, the PLINK compression protocol can attain a 32x compression by reducing 64-bit floats to 2-bit integers. PLINK compression facilitates storage and transport of data but complicates linear algebra operations. We store both a compressed X and a compressed transpose XT. The transpose XT is used to compute the gradient ∇f(β) = −XT (y − Xβ), while X is used to compute the estimated response Xβ. This counterintuitive tactic roughly doubles the memory required to store the genotype data. However, it facilitates accessing all data in column-major and unit stride order, thereby ensuring that all linear algebra operations maintain full memory caches.

Good statistical practice dictates standardizing all predictors; otherwise, parameters are pe- nalized nonuniformly. Standardizing nongenetic covariates is trivial. However, one cannot store standardized genotypes in PLINK binary format. The remedy is to precompute and cache vectors u and v that contain the mean and precision, respectively, of each of the p SNPs. We then use u and v to standardize genotypes on-the-fly. On-the-fly standardization is a costly operation and must be employed judiciously. For example, to calculate Xβ we exploit the structural sparsity 78 of β by only decompressing and standardizing the submatrix Xm of X corresponding to the m nonzero values in β. We then use Xm for parameter updates until we observe a change in the support of β, at which point we recompute the standardized genotypes for the new active set. Unfortunately, calculation of the gradient ∇f(β) offers no such optimization because it requires a fully decompressed and standardized matrix XT. Since we cannot store all n × p standardized genotypes in floating point format, the best that we can achieve is standardization on-the-fly every time that we update the gradient.

5.2.4 Parallelization

Our implementation of IHT for PLINK files relies on two parallel computing schemes. First, we make heavy use of multicore computing with shared memory arrays to distribute computations over all cores in a CPU. For example, suppose that we wish to compute in parallel the column means of X stored in a shared memory array. The mean of each column is independent of the others, so the computations distribute naturally across multiple cores. If a CPU contains four available cores, then we enlist four workers for our computations, one master and three slaves. Each slave can see the entirety of X but only works on a subset of its columns. The slaves compute the column means for the three chunks of X in parallel. Columnwise operations, vector arithmetic, and matrix-vector operations fit within this paradigm.

The two most expensive operations are the products Xβ and XT (y − Xβ). We previously

discussed intelligent computation of Xβ via Xmβm. Dense multithreaded linear algebra libraries

such as BLAS facilitate efficient computation of Xmβm. Consequently, we obtain Xβ in O(nm) total operations. In contrast, the gradient ∇f(β) = −XT (y − Xβ) in the update (5.3) requires a completely dense matrix-vector multiplication with a run-time complexity of O(np). We could address the computational burden with cluster computing, but then communication between the different nodes would diminish performance.

A better alternative for acceleration is to calculate of the gradient on a GPU. When the com- putations are distributed intelligently, a GPU can exploit hundreds of compute units to perform calculations in parallel. As a stream processor, a GPU can provide tremendous computational ac-

79 celeration. The limiting factors are device memory, which the programmer must manage explicitly, and data transfers. In particular, an optimal GPU implementation must minimize memory transac- tions between the device GPU and the host CPU because the PCI slot that connects them cannot push large amounts of data at once. Our solution is to push the compressed PLINK matrix X and its column means and precisions onto the device at the start of the algorithm. We also cache device buffers for the residuals and the gradient. Whenever we calculate the gradient, we compute the n residuals on the host and then push the residuals onto the device. At this stage, the device exe- cutes two kernels. The first kernel initializes many workgroups of threads and distributes a block of XT (y − Xβ) to each workgroup. Each thread handles the decompression, standardization, and computation of one component of X with the residuals. The second kernel reduces across all thread blocks and returns the p-dimensional gradient. Finally, the host pulls the p-dimensional gradient from the device. Thus, after the initialization of the data, our GPU implementation only requires the host and device to exchange p + n floating point numbers per iteration.

5.2.5 Selecting the best model

Given a regularization path computed by the IHT, the obvious way to choose the best model along the path is to resort to q-fold cross-validation with mean squared error (MSE) as a selection cri-

terion. For a path of user-supplied model sizes m1, m2, . . . , mr, our implementation of IHT fits the entire path on the q − 1 training partitions. We then view the qth partition as a testing set and compute its mean squared error (MSE). Finally, we determine the model size m with minimum

MSE and refit the data subject to kβk0 ≤ m.

5.3 Results

We tested our IHT implementation on a subset of the data from the Northern Finland Birth Cohort 1966 (NFBC1966) [123]. These data contain several biometric phenotypes for 5,402 patients geno- typed at 370,404 SNPs. We imputed the missing genotypes in X with Mendel [6] and performed quality control with PLINK 1.9 beta [36]. Our numerical experiments include both simulated and measured phenotypes. For our simulated phenotype, we benchmarked the model recovery and 80 predictive performance of our software against glmnet and ncvreg [30, 62]. Sections 5.3.2 and 5.3.3 include as nongenetic covariates the SEXOCPG factor, which we calculated per Sabatti et al., and the first two principal components of X, which we calculated with PLINK 1.9. The experiments were run on a single compute node equipped with four 6-core 2.67Ghz Intel Xeon CPUs and two NVIDIA Tesla C2050 GPUs each with 6Gb of memory. To simulate performance on a workstation, the experiment only used one GPU and one CPU. The compute environment was 64-bit Julia v0.4.0 with the corresponding OpenBLAS library and LLVM v3.3 compiler.

5.3.1 Simulation

The goal of our first numerical experiment was to demonstrate the superior model selection perfor- mance of IHT versus LASSO and MCP. Here we used only the matrix Xchr1 of 24,663 SNPs from chromosome 1 of the NFBC1966 dataset. This matrix is sufficiently small to render PLINK com-

pression and GPU acceleration unnecessary. Xchr1 uses the 5289 cases with observed BMI. Note that this number is larger than what we will use in Section 5.3.3; no exclusion criteria were applied here since the phenotype was simulated. We standardized observed genotype dosages and then set

any unobserved dosages to 0. We simulated βtrue for true model sizes mtrue ∈ {100, 200, 300} with effect sizes drawn from the uniform distribution U(−0.5, 0.5). The simulated phenotypes were

then formed as ytrue = Xβtrue + , with each i ∼ N(0, 0.01). To assess predictive performance, we separated 289 individuals as a validation set and used the remaining 5,000 individuals for 5- fold cross-validation. We generated 10 different models for each mtrue. For each replicate, we ran regularization paths of 100 model sizes m0, m0 + 2, m0 + 4, . . . , m0 + 200 straddling mtrue and chose the model with minimum MSE.

By directly choosing the model size, IHT naturally facilitates cross-validation of the model size mbest. For cross-validation with LASSO and MCP, we used the cross-validation and response prediction routines in glmnet and ncvreg. To ensure roughly comparable lengths of regular- ization paths and therefore commensurate compute times, we capped the maximum permissible degrees of freedom at mtrue + 100 for both LASSO and MCP regression routines. The case of MCP regression is peculiar since ncvreg does not cross-validate the γ parameter. We modified

81 Model Penalty True Pos Total Pos MSE Time size Mean(SD) Mean(SD) Mean(SD) Mean(SD)

100 IHT 94.7 (2.9) 96.6 (3.4) 0.006 (0.0001) 135.3 (6.9) LASSO 96.0 (2.6) 157.8 (8.3) 0.012 (0.0002) 122.5 (29.5) MCP 88.3 (3.2) 88.3 (3.2) 0.012 (0.0008) 442.0 (115.4) 200 IHT 190.3 (3.0) 191.2 (3.7) 0.006 (0.0003) 146.0 (6.3) LASSO 190.9 (2.3) 256.0 (6.4) 0.017 (0.0003) 64.2 (7.7) MCP 172.2 (5.0) 172.2 (5.0) 0.018 (0.0025) 255.1 (10.4) 300 IHT 283.1 (5.4) 284.4 (5.7) 0.007 (0.0001) 135.8 (9.2) LASSO 282.8 (4.5) 347.2 (8.5) 0.027 (0.0017) 137.1 (56.9) MCP 252.6 (17.0) 252.6 (17.0) 0.032 (0.0226) 655.5 (271.1)

Table 5.1: Model selection performance on NFBC1966 chromosome 1 data.

the approach of Breheny and Huang [30] to obtain a suitable γ for each model. Their proctocol entails cross-validating λ once with the default γ = 3 and checking if the optimal λ, which we call

λbest, exceeds the minimum lambda λmin guaranteeing a convex penalty. Whenever λbest ≤ λmin, we incremented γ by 1 and cross-validated λ again. We repeated this process until λbest > λmin. The larger final γ then became the default for the next model selection run, thereby amortizing the selection of a proper value of γ across all 10 models for a given mtrue. This procedure for selecting γ ensured model selection stability while simultaneously avoiding expensive cross-validation over a full grid of γ and λ values. The reported compute times for MCP reflect this procedure.

Table 5.1 shows the results of our simulation. The various columns of the table summarize true positive recovery, recovered model size, prediction error, and compute times for each model size mtrue. In information theoretic terms, LASSO exhibits excellent recall but suboptimal precision. MCP demonstrates suboptimal recall but optimal precision. IHT ameliorates the worst features of both LASSO and MCP. IHT recovers more true positives than MCP and fewer false positives than LASSO. Furthermore, IHT consistently attains the best prediction error on the validation set.

82 Despite these benefits, the IHT pays only a modest price in computational speed versus LASSO; it is fully competitive with MCP on this metric.

5.3.2 Speed comparisons

Our next numerical experiment highlights the sacrifice in computational speed incurred with com- pressed genotypes. The genotype matrix Xchr1 is now limited to patients with both BMI and di- rectly observed weight, a condition imposed by Sabatti et al [123]. The response vector y is the log body mass index (BMI) from NFBC1966. As mentioned previously, we included the SEXOCPG factor and the first two principal components as nongenetic covariates. We then ran three different configurations on a single compute node. The first used the floating point version of Xchr1. We did not explicitly enable any multicore calculations. For the second run, we used a compressed copy of Xchr1 with multicore options enabled, but we disabled the GPU acceleration. The third run used the compressed Xchr1 data with both multicore and GPU options enabled. We ran each algorithm over a regularization path of model sizes m = 1, 2,..., 25 and averaged compute times over 10 runs. For all uncompressed arrays, we used double precision arithmetic.

Table 5.2 shows the compute times. The floating point variant is clearly the fastest, requiring fewer than 10 seconds to compute all models. The analysis using PLINK compression with a mul- ticore CPU suffers a 7.4-fold slowdown, clearly demonstrating the deleterious effects on compute times resulting from repeated decompression and on-the-fly standardization. Enabling GPU accel- eration seemingly fails to rectify this lackluster performance and leads to a 7.7x slowdown. The value of the GPU will be more evident in crossvalidation: a 5-fold crossvalidation on one machine requires either five hexcore CPUs or one hexcore CPU and one GPU. The latter configuration lies within modern workstation capabilities. We note that our insistence on the use of double precision arithmetic dims the luster of GPU acceleration. Indeed, our experience is that using compressed arrays and a GPU with single precision arithmetic is only 1.7x slower than the corresponding float- ing point compute times. Furthermore, while we limit our computations to one CPU with six physical cores, including additional cores (both physical and virtual) improves compute times for compressed data without a GPU.

83 Data type Mean Time Standard Deviation

Uncompressed data 9.4 0.29 Compressed data, no GPU 69.6 0.04 Compressed data with GPU 72.8 2.70

Table 5.2: Computational times in seconds on NFBC1966 chromosome 1 data.

5.3.3 Application to lipid phenotypes

For our final numerical experiment, we embarked on a genome-wide search for associations based on the full n × p NFBC1966 genotype matrix X. In addition to BMI, this analysis considered three additional phenotypes from Sabatti et al. [123]: HDL cholesterol (HDL), LDL cholesterol

(LDL), and triglycerides (TG). HDL, LDL, and TG all use SEXOCPG and the first two principal components as nongenetic covariates, but they also use BMI as a covariate. Quality control on SNPs included filters for minor allele frequencies below 0.01 and Hardy-Weinberg P -values below 10−5. Subjects with unobserved or missing traits were excluded from analysis. We applied further exclusion criteria per Sabatti et al. [123]; for analysis with BMI, we excluded subjects without direct weight measurements, and for analysis of TG, HDL, and BMI, we excluded subjects with nonfasting blood samples and subjects on diabetes medication. These filters yield different values of n and p for each trait. Table 5.3 records problem dimensions and trait transforms.

We performed 5-fold cross-validation for the best model size over a path of sparsity levels k = 1, 2,..., 50. Refitting the best model size yielded the selected predictors and effect sizes. Table 5.3 records the compute times and best model sizes, while Table 5.4 shows the SNPs recovered by IHT.

One can immediately see that IHT does not collapse causative SNPs in strong linkage dis- equilibrium. IHT finds an adjacent pair of SNPs rs6917603 and rs9261256 for HDL. For TG, rs11974409 is one SNP separated from rs2286276, while SNP rs676210 is one SNP separated from rs673548. IHT does not flag rs673548 but its association with TG is known. Common sense suggests treating each associated pair as a single predictor. 84 Phenotype n p Transform mbest Compute Time (Hours)

BMI 5122 333,656 log 2 1.12 HDL 4729 333,774 none 9 1.28 LDL 4715 333,771 none 6 1.32 TG 4728 333,769 log 10 1.49

Table 5.3: Dimensions of data used for each phenotype in GWAS experiment. Here n is the number

of cases, p is the number of predictors (genetic + covariates), and mbest is the best cross-validated

model size. Note that mbest includes nongenetic covariates.

Our analysis replicates several association from the literature but finds new ones as well. For example, Sabatti et al [123] found associations between TG and the SNPs rs1260326 and rs10096633, while rs2286276 was identified elsewhere. The SNPs rs676210, rs7743187, rs6917603, rs2000571, and rs3010965 are new associations with TG. We find that SNP rs6917603 is associ- ated with all four traits; the BMI connection was missed by Sabatti et al [123]. The association of SNP rs6917603 with BMI may be secondary to its association to lipid levels. The same comment applies to SNP rs676210; it is very close to rs673548 and is strongly associated with oxidative LDL cholesterol [87]. IHT flags an association between SNP rs2304130 and TG. This association was validated in a large meta-analysis of 3,540 cases and 15,657 controls performed after [123] was published. This example suggests that IHT is better at detecting associations with small sample sizes than traditional model selection methods. Some of the effect sizes in Table 5.4 are difficult to interpret. For example, IHT estimates effects for rs10096633 (β = 0.03781) and rs1260326 (β = −0.04088) that are both smaller and in opposite sign to the estimates in [123].

The potential new associations to TG with SNPs rs7743187 and rs3010965 are absent from the literature. Furthermore, our analysis misses borderline significant associations identified by Sabatti et al [123], such as rs2624265 for TG and rs9891572 for HDL; these SNPs were never flagged in later studies. In this regard it is worth emphasizing that the best model size mbest delivered by cross-validation is a guide rather than definitive truth. Figure 5.5 shows that the difference in

85 MSE between mbest and adjacent model sizes can be quite small. Models of a few SNPs more or

less than mbest predict trait values about as well. Thus, TG with mtrue = 10 has MSE = 0.2310,

while TG mtrue = 4 has MSE = 0.2315. Indeed, refitting the TG phenotype with m = 4 yields only three significant SNPs: rs1260326, rs6917603, and rs10096633. The SNPs rs7743187 and rs3010965 are absent from this model, so we should be cautious in declaring new associations. This example also highlights the great value in computing many model sizes, which univariate regression schemes typically fail to do.

Finally, we comment on compute times. IHT requires about 1.5 hours to cross-validate the best model size over 50 possible models using double precision arithmetic. Obviously, computing fewer models can decrease this compute time substantially. If the phenotype in question is scaled correctly, then analysis with IHT may be feasible with single precision arithmetic, which yields an additional speedup as suggested in Section 5.3.2. Analyses requiring better accuracy will benefit from the addition of double precision registers in newer GPU models. Thus, our analysis still retains room for speedup without sacrificing model selection performance if compute speed is a key concern.

5.4 Discussion

The current paper demonstrate the utility of iterative hard thresholding (IHT) in large-scale GWAS analysis. The IHT algorithm enjoys proveable convergence guarantees despite its nonconvex na- ture. Its model selection performance exceeds that of more popular and mature tools such as LASSO- and MCP-regression. Our software directly and intelligently handles the compression protocol widely used to store GWAS genotypes. Finally, IHT can be substantially accelerated by exploiting both shared-memory and massively parallel processing hardware. As a rule of thumb, computation times with IHT may scale as O(np) or somewhat worse if more predictors with small effect sizes come into play.

Our implementation addressed a gap in current software packages for GWAS analysis. Lack of general support for PLINK binary genotype data, poor memory management, and limited parallel capabilities discourage use of software such as glmnet or ncvreg. Our IHT package enables 86 MSEs for BMI MSEs for HDL MSE MSE 0.13 0.14 0.15 0.16 0.17 0.030 0.032 0.034 0 10 20 30 40 50 0 10 20 30 40 50

Model size Model size

MSEs for LDL MSEs for TG MSE MSE 0.82 0.86 0.90 0.230 0.240 0.250 0.260 0 10 20 30 40 50 0 10 20 30 40 50

Model size Model size

Figure 5.5: Mean squared error as a function of model size, as averaged over 5 cross-validation slices, for four lipid phenotypes from NFBC 1966.

87 Phenotype Chromosome SNP Position β Status

BMI 6 rs6917603 30125050 -0.01995 Unreported HDL 6 rs6917603 30125050 0.10100 Reported [123] 6 rs9261256 30129920 -0.06252 Nearby 11 rs7120118 47242866 -0.03351 Reported [123] 15 rs1532085 56470658 -0.04963 Reported [41] 16 rs3764261 55550825 -0.02808 Reported [123] 16 rs7499892 55564091 0.02625 Reported [128] LDL 1 rs646776 109620053 0.09211 Reported [123] 2 rs693 21085700 -0.08544 Reported [123] 6 rs6917603 30125050 -0.07536 Reported [83] TG 2 rs676210 21085029 0.03633 Nearby 2 rs1260326 27584444 -0.04088 Reported [123] 6 rs7743187 25136642 0.03450 Unreported 6 rs6917603 30125050 -0.08215 Unreported 7 rs2286276 72625290 0.01858 Reported [86] 7 rs11974409 72627326 0.01759 Nearby 8 rs10096633 19875201 0.03781 Reported [123] 13 rs3010965 60937883 0.02828 Unreported 19 rs2304130 19650528 0.03039 Reported [136]

Table 5.4: Computational results from the GWAS experiment. Here β is the calculated effect size. Known associations include the relevant citation.

analysis of models of any sparsity, while gpu-lasso is designed for very sparse models. It also cross-validates for the best model size over a range of possible models, while Mendel currently only cross-validates one model at a time.

It is worth reminding readers that IHT is hardly a panacea for GWAS. Analysts must still deal

88 with perennial statistical issues such as correlated predictors and sufficient sample sizes. Fur- thermore, while the estimation properties of hard thresholding algorithms are well understood, IHT lacks a coherent theory of inference for assessing statistical significance. In contrast, con- siderable progress has been made in understanding postselection inference with LASSO penalties [98, 100, 130].

As formulated here, the scope of application for IHT is limited to linear least squares regression. Researchers have begun to extend IHT to generalized linear models, particularly logistic regression [7, 146]. We anticipate that IHT will eventually overtake LASSO as the standard tool for sparse regression. In our opinion, GWAS analysis clearly stands to benefit from this advance.

89 CHAPTER 6

Discussion and Future Research

As evident from the results in Chapter 5, nonconvex optimization has much to offer for statistical genetics and genetic epidemiology. We use the iterative hard thresholding algorithm (IHT) to attack issues with model selection in genome-wide association studies. IHT provides superior model recovery performance compared to current feature selection methods. As a projected gradient algorithm, IHT is similar in spirit to the proximal distance algorithm developed in Chapters 3 and 4. The proximal distance algorithm proves its mettle in both convex and nonconvex settings. For large-scale optimization problems, the proximal distance algorithm maintains good accuracy while requiring reasonable compute times. However, many details remain unresolved.

6.1 Parameter tuning for proximal distance algorithms

The results for the proximal distance algorithms relied on update schedules for the tuning parame- ter ρ (in Chapters 3 and 4) and  (in Chapter 3). The parameter updates were chosen heuristically. Heuristic choices are suitable in practice, but as a matter of principle a more rigorous or general strategy is desirable. Unfortunately, exact penalty methods lack a coherent theory behind parame- ter updates. This field is ripe for theoretical insight. Absent any breakthroughs in the near future, we must rely on what works in practice. Selection of ρ from the proximal distance algorithm from Chapter 4 has an intuitive mathematical interpretation and a practical computational inter- pretation. Mathematically, proximal distance algorithm must balance the competing objectives of minimizing the loss function f and satisfying the constraints given in C. Therefore, starting ρ small encourages minimization of f, which is initially preferable to remaining on the feasible set. From a computational standpoint, ρ controls the tradeoff between speed and accuracy. Gentle update

90 schedules prod the algorithm to move towards the constrained minimum in C, while faster updates lose accuracy. The rate at which the user increases ρ thus dictates the degree to which the user accepts inaccuracy for the sake of speed. We present update schedules that ensure accuracy of roughly four to six digits. However, in many high-dimensional settings, an approximate solution usually suffices, and the proximal distance algorithm can give an approximate solution in a short time.

6.2 IHT with nonlinear loss functions

IHT is demonstrably better than LASSO or MCP for model selection with the residual sum of squares loss function. However, unlike IHT, both LASSO and MCP can penalize generalized linear models such as logistic regression and Poisson regression. As suggested in Section 5.4, nascent developments [7, 146] aim to generalize IHT for wider use. For exposition purposes, let us sketch theoretical developments for IHT in logistic regression. Given a statistical design matrix

X ∈ Rn×p, a statistical model β ∈ Rp, and a binary response variable y ∈ {0, 1}n, logistic regression is the task of minimizing the negative logistic loglikelihood function

n X  xT β T f(β | X, y) = log 1 + e i − y Xβ. (6.1) i=1 However, minimization of equation (6.1) can run aground in the setting n  p pertinent to GWAS. In particular, (6.1) may have multiple minimizers. Furthermore, the data in X can exhibit linear separability in which sending certain values of β to infinity can produce arbitrarily small values of

f. Standard statistical practice is to regularize (6.1) with a squared `2 penalty to produce a modified loss n X  xT β T λ 2 f (β | X, y) = log 1 + e i − y Xβ + kβk (6.2) λ 2 2 i=1 with regularization parameter λ > 0. Observe that fλ is λ−strongly convex, so in some sense fλ is relatively easy to optimize. We then impose the sparsity constraint with the `0 penalty. Unlike an iteration of normal IHT, which optimizes the loss over a support C of the m largest components of βk in magnitude, an iteration of gradient descent with nonlinear IHT must apply the hard thresholding operator against a slightly larger support S. The enlarged support S represents 91 the union of C with the largest 2m components in magnitude of ∇fλ(βk), for a total of up to 3m components. The added components from ∇fλ(βk) represent additional search directions to use ? in minimizing fλ. The loss fλ is then minimized over S to produce an intermediate optimum βk ? with up to 3m nonzeroes. Finally, the vector βk is thresholded to its m largest components in magnitude.

In theory this nonlinear IHT is guaranteed to converge under moderate restrictions on the Hes- sian matrix ∇2f. However, the method remains largely untested. The literature [7, 146] makes no mention of unresolved computational headaches such as divergence of parameter values and cycling between different minimizers. Furthermore, guidance on the selection of a suitable value of λ is thin. Finally, unlike the simple crossvalidation scheme for normal IHT, crossvalidation for nonlinear IHT may require a grid search on λ.

6.3 Other greedy algorithms for linear regression

The discussion of projected gradient algorithms appeals to many established convergence results for gradient descent kernels. However, gradient descent algorithms themselves are not the only solution to model selection. We have developed a novel but currently unpublished greedy selection tool, called the exchange algorithm, that can potentially accomplish sparse linear regression better than IHT with an optimization kernel remniscent of coordinate descent.

As a prelude to our analysis, let us review least squares for the simple case of a single predic- tor. For a data vector x of n cases, a corresponding response vector y, and an effect size β, the 1 Pn 2 ˆ Pn 2 minimum of the residual sum of squares 2 i=1(yi −xiβ) is attained at β = ( i=1 yixi)/(kxk2). The minimum value then becomes

n n n !2 1 X 1 X 1 X (y − x βˆ)2 = y2 − y x . 2 i i 2 i 2kxk2 i i i=1 i=1 2 i=1

With p predictors encoded in a design matrix X = (xij), the residual sum of squares criterion 1 Pn Pp 2 becomes 2 i=1(yi − j=1 xijβj) . Let S be the current set of active predictors. By definition the number of active predictors is fixed at |S| = r. The best predictor ` 6∈ S to swap with k ∈ S Pn 2 2 P maximizes ( i=1 zixi`) /kx`k2, where zi = yi − j∈S\{k} xijβj is the residual for observation 92 i omitting the effect of predictor k. Often predictor k cannot be improved. In this situation, the

natural response is to update βk so that it is optimal given the remaining active coefficients βj. The exchange algorithm cycles through the active predictors multiple times until the residual sum of squares stabilizes. By definition the residual sum of squares decreases at every iteration.

Striking reductions in computation times are possible if certain entries of the p×p matrix XT X are computed once and cached. Obviously, storing the entire matrix is impractical for large p. It is

2 also advantageous to compute and store the squared Euclidean norm kxik2 of each column xi of T X and the inner product r xi of each column against the residual vector r. In replacing predictor

T k by predictor `, the residual vector r changes to r+βkxk −β`x`. The inner product r xi changes

T T T T to r xi + βkxk xi − β`x` xi. If the inner products x` xi for the optimal ` are unavailable, then at this junction we compute and store them. Thus, as each inactive predictor becomes active, we T P t compute and store a new column of X X. The test inner product (y − j∈S\{k} βjxj) x` is T T computed as r x` + βkxk x`.

In cross validation one needs to compute an entire segment of the solution path as the model size r increases. The simplest tactic is to start with r = 1 and work upward, say by incrementing r by 1 at each new stage. The current parameter estimates can then serve as warm starts for the next value of r. The computational cost of model selection and parameter estimation increases as r increases, so starting with r = 1 is hardly a barrier to good model selection. If bad predictors enter a model, then they usually exit at later stages. Incrementing r more aggressively does not overcome the major computational burden of computing and storing relevant inner products. An aggressive scheme also leads to more iterations per stage as the minimum loss is sought. When r is incremented by 1, the new best predictor is often identified immediately, and predictor swap- ping functions primarily as coordinate descent. Thus, the combination of predictor swapping and coordinate descent hits a good compromise between global accuracy and low computational cost.

Table 6.1 gives a glimpse of the computational potential of the novel exchange algorithm. The IHT results from Table 5.1 are reprinted for comparison. The exchange algorithm consistently maintains a slight edge over IHT in recovery behavior and predictive accuracy. For smaller models, the exchange algorithm is clearly faster. The speed advantage attenuates as the dimension of the true model increases. Unlike IHT, the exchange algorithm is not clearly extensible to nonlinear 93 Model Penalty True Pos Total Pos MSE Compute Time size Mean(SD) Mean(SD) Mean(SD) Mean(SD)

100 IHT 93.9 (3.2) 94.8 (4.0) 0.006 (0.0002) 136.6 (31.2) Exchange 94.8 (3.19) 96.1 (3.57) 0.001 (0.0000) 99.9 (29.95) 200 IHT 188.9 (2.9) 191.6 (3.2) 0.006 (0.0001) 185.6 (50.6) Exchange 190.6 (2.46) 191.3 (2.67) 0.001 (0.0004) 159.0 (5.11) 300 IHT 278.3 (4.1) 285.4 (5.0) 0.007 (0.0002) 220.8 (45.9) Exchange 284.0 (4.94) 284.2 (5.05) 0.001 (0.0000) 241.2 (7.08)

Table 6.1: Model selection performance of IHT and the exchange algorithm on NFBC1966 chro- mosome 1 data.

models. Nonetheless, these results show promise for the particular case of sparse linear regression.

94 CHAPTER 7

Notation

The following notation is used throughout the document.

7.1 Sets

•C : a set.

• R: the real number line.

• Rn: the set of n-vectors with real components.

• Rm×n: the set of m × n matrices with real components.

• R+: the set of nonnegative real numbers.

• R++: the set of positive real numbers.

• Sn: the set of symmetric n × n matrices.

n • S+: the set of symmetric positive semidefinite n × n matrices.

n • S++: the set of symmetric positive definite n × n matrices.

7.2 Vectors and Matrices

• x: a column vector with real components.

• xT : a row vector with real components.

95 • xi: the ith component of a vector x.

• 1: a vector of all 1s.

• 0: a vector of all 0s.

• X: a matrix with real components.

• xj: the jth column of a matrix X.

• xij: the (i, j)th component of a matrix X.

• XT : the transpose of a matrix X.

• vec(X): a vector composed of the columns of a matrix X stacked in numerical order.

• In: the n × n identity matrix.

• tr X: the trace of a matrix X.

• X−1: the inverse of a square invertible matrix X.

• X†: the Moore-Penrose pseudoinverse of a matrix X.

• diag x: a diagonal matrix with diagonal components x1, x2, . . . , xn.

• diag X: the diagonal of a matrix X, a vector with components x11, x22, . . . , xnn.

•h x, yi: the inner product of two vectors x and y, denoted xT y in real Euclidean spaces.

• span{x1, x2,..., xn}: the linear span of a set of vectors {x1, x2,..., xn}.

• λ(X): a vector containing the eigenvalues of a matrix X ∈ Rn×n.

n×n • λmin(X), λmax(X): the minimum (maximum) eigenvalue of X ∈ R

• σ(X): a vector containing the singular values of a matrix X ∈ Rm×n.

m×n • σmin(X), σmax(X): the minimum (maximum) singular value of X ∈ R .

96 7.3 Norms and Distances

• k·k: a general norm.

•k xk2: the vector `2 or Euclidean norm.

•k xk1: the vector `1 or taxicab norm.

•k xk∞: the vector `∞ or Chebyshev norm.

•k XkF : the matrix Frobenius norm, equivalent to kvec(X)k2.

•k Xk1: the matrix `1 norm, equivalent to kvec(X)k1.

•k Xk2: the spectral matrix norm, equivalent to σmax(X).

• dist(x, C): the Euclidean distance between a point x and a set C.

• dist(x, y): the Euclidean distance between two points x and y.

7.4 Functions and Calculus

• f(x): a real-valued function f : R → R of a scalar variable x.

• f(x): a real-valued function f : Rn → R of a vector variable x.

• f(X): a real-valued function f : Rm×n → R of a matrix variable X.

• dom f: the domain of a function f.

• f 0(x): the first derivative of a univariate function f at x.

• f 00(x), f 000(x): the second and third derivatives of a univariate function f at x.

•∇ f(x): the gradient of a multivariate function f at x.

•∇ 2f(x): the Hessian matrix of a multivariate function f at x.

• f ◦ g(x): the function composition f(g(x)) at x. 97 7.5 Projections and Proximal Operators

• ΠC: the projection operator onto a set C.

• proxtf : the proximal operator of a function f with scale parameter t.

• δC: the 0/∞ indicator function for a set C.

7.6 Computation

•O (n): big-O notation for computational complexity.

• xk: the kth iterate of a scalar algorithm variable x.

• xk: the kth iterate of a vector algorithm variable x.

• Xk: the kth iterate of a matrix algorithm variable (X).

• xk,i the ith component from a vector iterate xk

• xk,ij the (i, j)th component from a matrix iterate Xk.

98 REFERENCES

[1] Gad Abraham, Adam Kowalczyk, Justin Zobel, and Michael Inouye. SparSNP: Fast and memory-efficient analysis of all SNPs for phenotype prediction. BMC Bioinformatics, 13(1):88, 2012.

[2] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. The Annals of Statistics, 40(5):2452–2482, 2012.

[3] Erik Agrell, Thomas Eriksson, Alexander Vardy, and Kenneth Zeger. Closest point search in lattices. IEEE Transactions on Information Theory, 48(8):2201–2214, 2002.

[4] Farid Alizadeh and Donald Goldfarb. Second-order cone programming. Mathematical Programming, 95:3–51, 2003.

[5] Larry Armijo. Minimization of functions having Lipschitz continuous first partial deriva- tives. Pacific Journal of Mathematics, 16(1):1–3, 1966.

[6] Kristin Ayers and Kenneth Lange. Penalized estimation of haplotype frequencies. Bioinfor- matics, 24:1596–1602, 2008.

[7] Sohail Bahmani, Bhiksha Raj, and Retros T. Boufounos. Greedy sparsity-constrained opti- mization. Journal of Machine Learning Research, 14(3):807–841, 2013.

[8] Leonard E. Baum. An inequality and associated maximization technique in statistical esti- mation for probabilistic functions of markov processes. Inequalities, 3:1–8, 1972.

[9] Heinz H Bauschke. Projection algorithms and monotone operators. PhD thesis, Theses (Dept. of Mathematics and Statistics)/Simon Fraser University, 1996.

[10] Heinz H Bauschke, Jonathan M Borwein, and Wu Li. Strong conical hull intersection prop- erty, bounded linear regularity, jamesons property (g), and error bounds in convex optimiza- tion. Mathematical Programming, 86(1):135–160, 1999.

[11] Heinz H Bauschke and Patrick L Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2011.

[12] Amir Beck and Marc Teboulle. A fast iterative shrinkage thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Sciences, 2(1):183–202, 2009.

[13] Amir Beck and Marc Teboulle. Gradient-based algorithms with applications to signal recov- ery. Convex Optimization in Signal Processing and Communications, pages 42–88, 2009.

[14] Edward J Beltrami. An Algorithmic Approach to Nonlinear Analysis and Optimization. Academic Press, 1970.

[15] Ahron Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering applications, volume 2. SIAM, 2001. 99 [16] Abraham Berman and Robert J Plemmons. Nonnegative Matrices in the Mathematical Sci- ences. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics, 1994.

[17] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse prin- cipal component detection. In Conference on Learning Theory, pages 1046–1066, 2013.

[18] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimension. The Annals of Statistics, 41(4):1780–1815, 2013.

[19] Dimitri P Bertsekas. Nonlinear Programming. Athena Scientific, 1999.

[20] Dimitri P Bertsekas, Angelia Nedic,´ and Asuman Ozda¨ glar.˘ Convex analysis and optimiza- tion. Athena Scientific, Belmont, MA, 2003.

[21] Thomas Blumensath. Accelerated iterative hard thresholding. Signal Processing, 2(1):183– 202, 2012.

[22] Thomas Blumensath and Michael E. Davies. Iterative hard thresholding for sparse approxi- mation. Journal of Fourier Analysis and Applications, 14:629–654, 2008.

[23] Thomas Blumensath and Michael E. Davies. Iterative hard thresholding for compressed sensing. Applications of Computational and Harmonic Analysis, 27:265–274, 2009.

[24] Thomas Blumensath and Michael E. Davies. Normalized iterative hard thresholding: Guar- anteed stability and performance. IEEE Journal of Selected Topics in Signal Processing, 4(2):298–309, 2010.

[25] Ingwer Borg and Patrick J. F. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, New York, NY, 2007.

[26] Jonathan M Borwein and Adrian S Lewis. Convex analysis and nonlinear optimization: theory and examples. Springer Science & Business Media, 2010.

[27] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2009.

[28] James P Boyle and Richard L Dykstra. A method for finding projections onto the intersec- tion of convex sets in Hilbert spaces. In Advances in Order Restricted Statistical Inference, pages 28–47. Springer, 1986.

[29] Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200–217, 1967.

[30] Patrick Breheny and Jian Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5(1):232–253, 2011.

100 [31] Jian-Feng Cai, Emmanuel J Candes,` and Zuowei Shen. A singular value thresholding algo- rithm for matrix completion. SIAM Journal on Optimization, 20:1956–1982, 2010.

[32] Emmanuel Candes,´ Jonah K. Romberg, and Terrence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathe- matics, 59(8):1207–1223, 2006.

[33] Emmanuel J Candes` and Benjamin Recht. Exact matrix completion via convex optimiza- tion. Foundations on Computational Mathematics, 9(6):717–772, 2009.

[34] Emmanuel J Candes` and Terence Tao. The power of convex relaxation: near-optimal matrix completion. IEEE Transactions on Information Theory, 56:2053–2080, 2010.

[35] Augustin Cauchy. Methode´ gen´ erale´ pour la resolution´ des systemes dequations´ si- multanees.´ Comptes rendus hebdomadaires des seances´ de l’Academie´ des Sciences, 25(1847):536–538, 1847.

[36] Christopher C Chang, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, and James J Lee. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4(7), 2015.

[37] Caihua Chen, Bingsheng He, and Xiaoming Yuan. Matrix completion via an alternating direction method. IMA Journal of Numerical Analysis, 32:227–245, 2012.

[38] Gary K. Chen. A scalable and portable framework for massively parallel variable selection in genetic association studies. Bioinformatics, 28:719–720, 2012.

[39] Eric C Chi, Hua Zhou, Gary K Chen, Diego O Del Vecchyo, and Kenneth Lange. Genotype imputation via matrix completion. Genome Research, 23:509–518, March 2013.

[40] Francis H. Clarke. Optimization and Nonsmooth Analysis. Wiley, New York, 1983.

[41] Global Lipids Genetics Consortium. Discovery and refinement of loci associated with lipid levels. Nature Genetics, 45:1274–1283, 2013.

[42] Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal recon- struction. EEE Transactions on Information Theory, 55(5):2230–2249, 2009.

[43] George B Dantzig. Linear Programming. In Proceedings of Symposium on Modern Calcu- lating Machinery and Numerical Methods, UCLA, July 1948. Applied Mathematics, Series 15, National Bureau of Standards, June 1951, pp. 18–21.

[44] Alexandre D’Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert R G Lanck- riet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434–448, 2007.

[45] Charles Jean de la Vallee´ Poussin. Sur la methode de l’approximation minimum. Annales de la Societe de Bruxelles, 35(2):1–16, 1910.

101 [46] Jan de Leeuw. Applications of convex analysis to multidimensional scaling. In Jean Rene´ Barra, F. Brodeau, G. Romier, and Bernard van Cutsem, editors, Recent Developments in Statistics, pages 133–146. North Holland Publishing Company, 1 edition, 1977. [47] Jan de Leeuw. Multivariate analysis with optimal scaling. In Somesh Das Gupta and Jayanta K. Ghosh, editors, Proceedings of the International Conference on Advances in Multivariate Statistical Analysis, pages 127–160, 1988. [48] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incom- plete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977. [49] Vladimir F. Demyanov. Nonsmooth optimization. In Gianni Di Pillo and Fabio Schoen, editors, Nonlinear Optimization. Springer, New York, NY, 2010. [50] Vladimir F. Demyanov, Gianni Di Pillo, and Francisco Facchinei. Exact penalization via dini and hadamard conditional derivatives. Optimization Methods and Software, 9(1-3):19– 36, 1998. [51] Frank R Deutsch. Best approximation in inner product spaces. Springer Science & Busi- ness Media, 2012. [52] Werner Dinkelbach. On nonlinear fractional programming. Management Science, 13(7):492–498, 1967. [53] Annette J. Dobson and Adrian G. Barnett. An Introduction to Generalized Linear Models, volume 3. Chapman and Hall/CRC Press, 2008. [54] Iain Dunning, Joey Huchette, and Miles Lubin. JuMP: A modeling language for mathemat- ical optimization. arXiv:1508.01982 [math.OC], 2015. [55] Richard L Dykstra. An algorithm for restricted least squares regression. Journal of the American Statistical Association, 78(384):837–842, 1983. [56] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals of statistics, 32(2):407–499, 2004. [57] Ky Fan. On a theorem of Weyl concerning eigenvalues of linear transformations I. Pro- ceedings of the National Academy of Sciences of the United States of America, 35:652–655, 1949. [58] Simon Foucart. Hard thresholding pursuit: An algorithm for compressive sensing. SIAM Journal on Numerical Analysis, 49(6):2543–2563, 2011. [59] Jean Baptise Joseph Fourier. Solution d’une question particuliere du calcul des inegalities. In Oeuvres de Fourier, pages 317–319. 1826. Reprinted by Tome II Olms, Hildesheim, 1970. [60] Jerome Friedman, Trevor Hastie, Holger Hofling,¨ and Robert Tibshirani. Pathwise coordi- nate optimization. Annals of Applied Statistics, 1(2):302–332, 2007. 102 [61] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estima- tion with the graphical lasso. Biostatistics, 9:432–441, July 2008.

[62] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for gener- alized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.

[63] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1):59–99, 2015.

[64] Gene Golub, Virginia Klema, and Gilbert W Stewart. Rank degeneracy and least squares problems. Technical report, Stanford University Department of Computer Science, 1976.

[65] Gene H Golub and Charles F Van Loan. Matrix Computations. JHU Press, 4 edition, 2012.

[66] Marshall Hall and Morris Newman. Copositive and completely positive quadratic forms. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 59, pages 329–339. Cambridge Univ Press, 1963.

[67] Herman O. Hartley. Maximum likelihood estimation from incomplete data. Biometrics, 14:174–194, 1958.

[68] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. The Elements of Statistical Learn- ing. Springer, 2 edition, 2009.

[69] Willem J Heiser. Convergent computation by iterative majorization: theory and applications in multidimensional data analysis. Recent advances in descriptive multivariate analysis, pages 157–189, 1995.

[70] Magnus Rudolph Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems, volume 49. NBS, 1952.

[71] Nicholas J Higham. Computing the nearest correlation matrix - a problem from finance. IMA Journal of Numerical Analysis, 22(3):329–343, 2002.

[72] Jean-Baptiste Hiriart-Urruty and Claude Lemarechal.´ Fundamentals of convex analysis. Springer Science & Business Media, 2012.

[73] Jean-Baptiste Hiriart-Urruty and Alberto Seeger. A variational approach to copositive ma- trices. SIAM Review, 52:593–629, 2010.

[74] Alan J Hoffman. On approximate solutions of systems of linear inequalities. Journal of Research of the National Bureau of Standards, 49(4), 1952.

[75] Harold Hotelling. Analysis of a complex of statistical variables into principle components. Journal of Educational Psychology, 24:417–441, 1933.

[76] David R. Hunter and Kenneth Lange. A tutorial on MM algorithms. American Statistician, 58:30–37, 2004.

103 [77] Vaithilingam Jeyakumar and Dinh The Luc. Approximate jacobian matrices for nons- mooth continuous maps and c1-optimization. SIAM Journal on Control and Optimization, 36(5):1815–1832, 1998.

[78] Dingfeng Jiang and Jian Huang. Majorization-Minimization by Coordinate Descent for Concave Penalized Generalized Linear Models. Technical Report 412, Department of Statistics and Actuarial Science, The University of Iowa, October 2011.

[79] Charles R Johnson and Robert Reams. Constructing copositive matrices from interior ma- trices. Electronic Journal of Linear Algebra, 17:9–20, 2008.

[80] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal com- ponents analysis in high dimensions. Journal of the American Statistical Association, 104(486):682–693, 2009.

[81] Michel Journee,´ Yurii Nesterov, Peter Richtarik,´ and Rodolphe Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Re- search, 11:517–553, 2010.

[82] Leonid V. Kantorovich. Mathematical methods of organizing and planning production. Management Science, 6(4):366–422, 1960. Translation of original 1939 manuscript from Publication House of the Leningrad State University, 68 pages.

[83] Johannes Kettunen, Taru Tukiainen, Antti-Pekka Sarin, Alfredo Ortega-Alonso, Emmi Tikkanen, Leo-Pekka Lyytikinen, Antti J Kangas, Pasi Soininen, Peter Wrtz, Kaisa Silan- der, Danielle M Dick, Richard J Rose, Markku J Savolainen, Jorma Viikari, Mika Khnen, Terho Lehtimki, Kirsi H Pietilinen, Michael Inouye, Mark I McCarthy, Antti Jula, Johan Eriksson, Olli T Raitakari, Veikko Salomaa, Jaakko Kaprio, Marjo-Riitta Jrvelin, Leena Peltonen, Markus Perola, Nelson B Freimer, Mika Ala-Korpela, Aarno Palotie, and Samuli Ripatti. Genome-wide association study identifies multiple loci influencing human serum metabolite levels. Nature Genetics, 44:269–276, 2012.

[84] Henk AL Kiers. Majorization as a tool for optimizing a class of matrix functions. Psy- chometrika, 55:417–428, 1990.

[85] Henk AL Kiers and Jos MF ten Berge. Minimization of a class of matrix trace functions by means of refined majorization. Psychometrika, 57:371–382, 1992.

[86] Young Jin Kim, Min Jin Go, Cheng Hu, Chang Bum Hong, Yun Kyoung Kim, Ji Young Lee, Joo-Yeon Hwang, Ji Hee Oh, Dong-Joon Kim, Nam Hee Kim, Soeui Kim, Eun Jung Hong, Ji-Hyun Kim, Haesook Min, Yeonjung Kim, Rong Zhang, Weiping Jia, Yukinori Okada, Atsushi Takahashi, Michiaki Kubo, Toshihiro Tanaka, Naoyuki Kamatani, Koichi Matsuda, MAGIC Consortium, Taesung Park, Bermseok Oh, Kuchan Kimm, Daehee Kang, Chol Shin, Nam H Cho, Hyung-Lae Kim, Bok-Ghee Han, Jong-Young Lee, and Yoon Shin Cho. Large-scale genome-wide association studies in east asians identify new genetic loci influencing metabolic traits. Nature Genetics, 43(10):990–995, 2011.

104 [87]M akel¨ a¨ KM, Seppal¨ a¨ I, Hernesniemi JA, Lyytikainen¨ LP, Oksala N, Kleber ME, Scharnagl H, Grammer TB, Baumert J, Thorand B, Jula A, Hutri-Kah¨ onen¨ N, Juonala M, Laitinen T, Laaksonen R, Karhunen PJ, Nikus KC, Nieminen T, Laurikka J, Kuukasjarvi¨ P, Tarkka M, Viik J, Klopp N, Illig T, Kettunen J, Ahotupa M, Viikari JS, Kah¨ onen¨ M, Raitakari OT, Karakas M, Koenig W, Boehm BO, Winkelmann BR, Marz¨ W, and Lehtimaki¨ T. Genome- wide association study pinpoints a new functional apolipoprotein b variant influencing oxi- dized low-density lipoprotein levels but not cardiovascular events: Atheroremo consortium. Cardiovascular Genetics, 6:73 – 81, 2013. [88] Mark Aleksandrovich Krasnosel’skii. Two remarks on the method of successive approxi- mations. Uspekhi Matematicheskikh Nauk, 10(1):123–127, 1955. [89] Aleksei N Krylov. On the numerical solution of the equation by which the frequency of small oscillations is determined in technical problems. News of the Academy of Sciences of the USSR, 4:491–539, 1931. [90] Kenenth Lange, Jeanette C Papp, Janet S Sinsheimer, and Eric M Sobel. Next genera- tion statistical genetics: Modeling, penalization, and optimization in high-dimensional data, 2013. [91] Kenneth Lange. An adaptive barrier method for convex programming. Methods and Appli- cations of Analysis, 1(4):392–402, 1994. [92] Kenneth Lange. Numerical Analysis for Statisticians. Springer Science & Business Media, 2010. [93] Kenneth Lange. Optimization. Springer, New York, 2nd edition, 2010. [94] Kenneth Lange. MM Optimization Algorithms. SIAM, 2016. [95] Kenneth Lange, David Hunter, and Ilsoon Yang. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9:1–59, 2000. [96] Kenneth Lange, Jeanette C. Papp, Janet S. Sinsheimer, and Eric M. Sobel. Next genera- tion statistical genetics: Modeling, penalization, and optimization in high-dimensional data. Annual Review of Statistics and Its Application, 1(1):279–300, 2014. [97] Kenneth Lange, Jeanette C. Papp, Janet S. Sinsheimer, Ram Sripracha, Hua Zhou, and Eric M. Sobel. Mendel: The Swiss army knife of genetic analysis programs. Bioinfor- matics, 29:1568–1570, 2013. [98] Jason D Lee, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. Exact post-selection inference with the lasso. arXiv preprint arXiv:1311.6238, 2013. [99] Miguel Sousa Lobo, Lieven Vandenberghe, Stephen Boyd, and Herve´ Lebret. Applications of second-order cone programming. Linear Algebra and its Applications, 284:193–228, 1998. [100] Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, and Robert Tibshirani. A signifi- cance test for the lasso. The Annals of Statistics, 42(2):413–468, 04 2014. 105 [101] Miles Lubin and Iain Dunning. Computing in operations research using Julia. INFORMS Journal on Computing, 27(2):238–248, 2015.

[102] Stephane´ Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. SIAM Journal on Computing, 24(2):3397–3415, 1993.

[103] W Robert Mann. Mean value methods in iteration. Proceedings of the American Mathe- matical Society, 4(3):506–510, 1953.

[104] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algo- rithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11:2287–2322, 2010.

[105] Anderson G. McKendrick. Applications of mathematics to medical problems. Proceedings of the Edinburgh Mathematical Society, 44:1–34, 1926.

[106] Geoggrey J. McLachlan and Thriyambakam Krishnan. The EM Algorithm and Extensions. Wiley, Hoboken, NJ, 2 edition, 2008.

[107] Jean-Jacques Moreau. Proximite´ et dualite´ dans un espace hilbertien. Bulletin de la Societ´ e´ mathematique´ de France, 93:273–299, 1965.

[108] Katta G Murty and Feng-Tien Yu. Linear Complementarity, Linear and Nonlinear Pro- gramming. Heldermann Verlag, West Berlin, 1988.

[109] Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227–234, 1995.

[110] Deanna Needell and Joel A. Tropp. CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009.

[111] Yurii Nesterov, Arkadii Nemirovskii, and Yinyu Ye. Interior-point polynomial algorithms in convex programming, volume 13. SIAM, 1994.

[112] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

[113] Brendan O’Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via operator splitting and homogeneous self-dual embedding. Journel of Optimization Theory and Applications, pages 1–27, 2016.

[114] James M. Ortega and Werner C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York and London, 1970.

[115] Christopher C Paige and Michael A Saunders. Algorithm 583: LSQR: Sparse linear equa- tions and least squares problems. ACM Transactions on Mathematical Software (TOMS), 8(2):195–209, 1982.

106 [116] Christopher C Paige and Michael A Saunders. LSQR: An algorithm for sparse linear equa- tions and sparse least squares. ACM Transactions on Mathematical Software (TOMS), 8(1):43–71, 1982.

[117] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimiza- tion, 1(3):123–231, 2013.

[118] Karl Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11):559–572, 1901.

[119] Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A. R. Ferreira, David Bender, Julia Maller, Pamela Sklar, Paul I. W. de Bakker, Mark J. Daly, and Pak C. Sham. PLINK: A tool set for whole-genome association and population-based linkage anal- yses. American Journal of Human Genetics, 81(3):559–575, 2007.

[120] Ralph Tyrell Rockafellar. Convex analysis. Princeton University Press, 2015.

[121] Andrzej P Ruszczynski.´ Nonlinear optimization, volume 13. Princeton University Press, 2006.

[122] Yousef Saad. Iterative methods for sparse linear systems. Siam, 2003.

[123] Chiara Sabatti, Susan K Service, Anna-Liisa Hartikainen, Anneli Pouta, Samuli Ripatti, Jae Brodsky, Chris G Jones, Noah A Zaitlen, Teppo Varilo, Marika Kaakinen, et al. Genome- wide association analysis of metabolic traits in a birth cohort from a founder population. Nature Genetics, 41(1):35–46, 2009.

[124] Haipeng Shen and Jianhua Z. Huang. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99:1015–1034, 2008.

[125] Cedric A. B. Smith. Counting methods in genetical statistics. Annals of Human Genetics, 21:254–276, 1957.

[126] Weijie Su, Stephen Boyd, and Emmanuel Candes.` A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Infor- mation Processing Systems, pages 2510–2518, 2014.

[127] Rolf Sundberg. An iterative method for solution of the likelihood equations for incomplete data from exponential families. Communications in Statistics, Series B, 5:55–64, 1976.

[128] Ida Surakka, John B. Whitfield, Markus Perola, Peter M. Visscher, Grant W. Montgomery, Mario Falchi, Gonneke Willemsen, Eco J. C. de Geus, Patrik K. E. Magnusson, Kaare Christensen, Thorkild I. A. Sørensen, Kirsi H. Pietilainen,¨ Taina Rantanen, Kaisa Silan- der, Elisabeth Widen,´ Juha Muilu, Iffat Rahman, Ulrika Liljedahl, Ann-Christine Syvanen,¨ Aarno Palotie, Jaakko Kaprio, Kirsten O. Kyvik, Nancy L. Pedersen, Dorret I. Boomsma, Tim Spector, Nicholas G. Martin, Samuli Ripatti, and Leena Peltonen. A genome-wide as- sociation study of monozygotic twin-pairs suggests a locus related to variability of serum high-density lipoprotein cholesterol. Twin Research and Human Genetics, 15:691–699, 2012. 107 [129] Torbjørn Taskjelle. Alignment of Tikz pictures in subfigures, 2016. Figure from answer to TeX StackExchange question 302589.

[130] Jonathan Taylor and Robert J Tibshirani. Statistical learning and selective inference. Pro- ceedings of the National Academy of Sciences, 112(25):7629–7634, 2015.

[131] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.

[132] Joel A. Tropp and Anna C. Gilbert. Signal recovery from random measurements via or- thogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655–4666, 2007.

[133] John von Neumann. A model of general economic equilibrium. Review of Economic Stud- ies, 13(1):1–9, 1946.

[134] Andreas Wachter¨ and Lorenz T Biegler. Line search filter methods for nonlinear program- ming: Motivation and global convergence. SIAM Journal on Optimization, 16(1):1–31, 2005.

[135] Andreas Wachter¨ and Lorenz T Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57, 2006.

[136] Dawn M. Waterworth, Sally L. Ricketts, Kijoung Song, Li Chen, Jing Hua Zhao, Samuli Ripatti, Yurii S. Aulchenko, Weihua Zhang, Xin Yuan, Noha Lim, Jian’an Luan, Sofie Ashford, Eleanor Wheeler, Elizabeth H. Young, David Hadley, John R. Thompson, Pe- ter S. Braund, Toby Johnson, Maksim Struchalin, Ida Surakka, Robert Luben, Kay-Tee Khaw, Sheila A. Rodwell, Ruth J.F. Loos, S. Matthijs Boekholdt, Michael Inouye, Panagi- otis Deloukas, Paul Elliott, David Schlessinger, Serena Sanna, Angelo Scuteri, Anne Jack- son, Karen L. Mohlke, Jaako Tuomilehto, Robert Roberts, Alexandre Stewart, Y. Antero Kesaniemi,¨ Robert W. Mahley, Scott M. Grundy, Wellcome Trust Case Control Consor- tium, Wendy McArdle, Lon Cardon, Gerard´ Wæber, Peter Vollenweider, John C. Cham- bers, Michael Boehnke, Gonc¸alo R. Abecasis, Veikko Salomaa, Marjo-Riitta Jarvelin,¨ Aimo Ruokonen, Inesˆ Barroso, Stephen E. Epstein, Hakon H. Hakonarson, Daniel J. Rader, Muredach P. Reilly, Jacqueline C.M. Witteman, Alistair S. Hall, Nilesh J. Samani, David P. Strachan, Philip Barter, Cornelia M. van Duijn, Jaspal S. Kooner, Leena Pelto- nen, Nicholas J. Wareham, Ruth McPherson, Vincent Mooser, and Manjinder S. Sandhu. Genetic variants influencing circulating lipid levels and risk of coronary artery disease. Ar- teriosclerosis, Thrombosis, and Vascular Biology, 30:2264–2276, 2010.

[137] Endre Weiszfeld. Sur le point pour lequel la somme des distances de n points donnes est minimum. Tohuku Mathematical Journal, 43:355–386, 1937. Translated into English and annoted by Plastria, F. (2009), ”On the point for which the sum of the distances to n given points is minimum”, in Drezner and Plastria (2009), pp. 741.

[138] Kris A. Wetterstrand. Dna sequencing costs: Data from the nhgri genome sequencing pro- gram, 2016. 108 [139] Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposi- tion, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009.

[140] Philip Wolfe. Convergence conditions for ascent methods. SIAM Review, 11(2):226–235, 1969.

[141] Philip Wolfe. Convergence conditions for ascent methods. ii: Some corrections. SIAM Review, 13(2):185–188, 1971.

[142] Tong Tong Wu, Yi Fang Chen, Trevor Hastie, Eric Sobel, and Kenneth Lange. Genome- wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714– 721, 2009.

[143] Tong Tong Wu and Kenneth Lange. Coordinate descent algorithms for lasso penalized regression. Annals of Applied Statistics, 2(1):224–244, 2008.

[144] Jian Yang, Beben Benyamin, Brian P McEvoy, Scott Gordon, Anjali K Henders, Dale R Nyholt, Pamela A Madden, Andrew C Heath, Nicholas G Martin, Grant W Montgomery, et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics, 42(7):565–569, 2010.

[145] Frank Yates. The analysis of multiple classifications with unequal numbers in different classes. Journal of the American Statistical Association, 29:51–66, 1934.

[146] Xiao-Tong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity- constrained optimization. CoRR, abs/1311.5750, 2013.

[147] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. An- nals of Statistics, 38(2):894–942, 2010.

[148] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal components analysis. Jour- nal of Computational and Graphical Statistics, 15(2):262–282, 2006.

109