UCLA UCLA Electronic Theses and Dissertations
Title Projection algorithms for large scale optimization and genomic data analysis
Permalink https://escholarship.org/uc/item/95v3t2nk
Author Keys, Kevin Lawrence
Publication Date 2016
Peer reviewed|Thesis/dissertation
eScholarship.org Powered by the California Digital Library University of California UNIVERSITYOF CALIFORNIA Los Angeles
Projection algorithms for large scale optimization and genomic data analysis
A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Biomathematics
by
Kevin Lawrence Keys
2016 c Copyright by Kevin Lawrence Keys 2016 ABSTRACTOFTHE DISSERTATION
Projection algorithms for large scale optimization and genomic data analysis
by
Kevin Lawrence Keys Doctor of Philosophy in Biomathematics University of California, Los Angeles, 2016 Professor Kenneth L. Lange, Chair
The advent of the Big Data era has spawned intense interest in scalable mathematical optimiza- tion methods. Traditional approaches such as Newton’s method fall apart whenever the features outnumber the examples in a data set. Consequently, researchers have intensely developed first- order methods that rely only on gradients and subgradients of a cost function.
In this dissertation we focus on projected gradient methods for large-scale constrained opti- mization. We develop a particular case of a proximal gradient method called the proximal distance algorithm. Proximal distance algorithms combine the classical penalty method of constrained min- imization with distance majorization. To optimize the loss function f(x) over a constraint set C,
ρ 2 the proximal distance principle mandates minimizing the penalized loss f(x) + 2 dist(x, C) and following the solution xρ to its limit as ρ → ∞. At each iteration the squared Euclidean distance
2 2 dist(x, C) is majorized by kx − ΠC(xk)k , where ΠC(xk) denotes the projection of the current ρ 2 iterate xk onto C. The minimum of the surrogate function f(x) + 2 kx − ΠC(xk)k is given by the
proximal map proxρ−1f [ΠC(xk)]. The next iterate xk+1 automatically decreases the original penal- ized loss for fixed ρ. Since many explicit projections and proximal maps are known in analytic or computable form, the proximal distance algorithm provides a scalable computational framework for a variety of constraints.
For the particular case of sparse linear regression, we implement a projected gradient algo- rithm known as iterative hard thresholding for a particular large-scale genomics analysis known
ii as a genome-wide association study. A genome-wide association study (GWAS) correlates marker variation with trait variation in a sample of individuals. Each study subject is genotyped at a mul- titude of SNPs (single nucleotide polymorphisms) spanning the genome. Here we assume that subjects are unrelated and collected at random and that trait values are normally distributed or transformed to normality. Over the past decade, researchers have been remarkably successful in applying GWAS analysis to hundreds of traits. The massive amount of data produced in these stud- ies present unique computational challenges. Penalized regression with LASSO or MCP penalties is capable of selecting a handful of associated SNPs from millions of potential SNPs. Unfor- tunately, model selection can be corrupted by false positives and false negatives, obscuring the genetic underpinning of a trait. Our parallel implementation of IHT accommodates SNP genotype compression and exploits multiple CPU cores and graphics processing units (GPUs). This allows statistical geneticists to leverage desktop workstations in GWAS analysis and to eschew expensive supercomputing resources. We evaluate IHT performance on both simulated and real GWAS data and conclude that it reduces false positive and false negative rates while remaining competitive in computational time with penalized regression.
iii The dissertation of Kevin Lawrence Keys is approved.
Lieven Vandenberghe
Marc Adam Suchard
Van Maurice Savage
Kenneth L. Lange, Committee Chair
University of California, Los Angeles
2016
iv To my parents
v TABLEOF CONTENTS
1 Introduction ...... 1
2 Convex Optimization ...... 4
2.1 Convexity ...... 5
2.2 Projections and Proximal Operators ...... 7
2.3 Descent Methods ...... 9
2.3.1 Gradient Methods ...... 10
2.3.2 Proximal Gradient Method ...... 11
2.4 Second-order methods ...... 12
2.4.1 Newton’s method ...... 12
2.4.2 Conjugate gradient method ...... 13
2.5 The MM Principle ...... 14
3 The Proximal Distance Algorithm ...... 17
3.1 An Adaptive Barrier Method ...... 17
3.2 MM for an Exact Penalty Method ...... 21
3.2.1 Exact Penalty Method for Quadratic Programming ...... 23
3.3 Distance Majorization ...... 24
3.4 The Proximal Distance Method ...... 25
3.5 Examples ...... 29
3.5.1 Projection onto an Intersection of Closed Convex Sets ...... 29
3.5.2 Network Optimization ...... 31
3.5.3 Nonnegative Quadratic Programming ...... 33
3.5.4 Linear Regression under an `0 Constraint ...... 36 vi 3.5.5 Matrix Completion ...... 36
3.5.6 Sparse Precision Matrix Estimation ...... 40
3.6 Discussion ...... 43
4 Accelerating the Proximal Distance Algorithm ...... 45
4.1 Derivation ...... 45
4.2 Convergence and Acceleration ...... 48
4.3 Examples ...... 52
4.3.1 Linear Programming ...... 52
4.3.2 Nonnegative Quadratic Programming ...... 55
4.3.3 Closest Kinship Matrix ...... 57
4.3.4 Projection onto a Second-Order Cone Constraint ...... 59
4.3.5 Copositive Matrices ...... 62
4.3.6 Linear Complementarity Problem ...... 64
4.3.7 Sparse Principal Components Analysis ...... 65
4.4 Discussion ...... 70
5 Iterative Hard Thresholding for GWAS Analysis ...... 72
5.1 Introduction ...... 72
5.2 Methods ...... 74
5.2.1 Penalized regression ...... 74
5.2.2 Calculating step sizes ...... 78
5.2.3 Bandwidth optimizations ...... 78
5.2.4 Parallelization ...... 79
5.2.5 Selecting the best model ...... 80
5.3 Results ...... 80
vii 5.3.1 Simulation ...... 81
5.3.2 Speed comparisons ...... 83
5.3.3 Application to lipid phenotypes ...... 84
5.4 Discussion ...... 86
6 Discussion and Future Research ...... 90
6.1 Parameter tuning for proximal distance algorithms ...... 90
6.2 IHT with nonlinear loss functions ...... 91
6.3 Other greedy algorithms for linear regression ...... 92
7 Notation ...... 95
7.1 Sets ...... 95
7.2 Vectors and Matrices ...... 95
7.3 Norms and Distances ...... 97
7.4 Functions and Calculus ...... 97
7.5 Projections and Proximal Operators ...... 98
7.6 Computation ...... 98
References ...... 99
viii LISTOF FIGURES
1.1 The cost of sequencing a single human genome, which we assume to be 3,000 megabases, is shown by the green line on a logarithmic scale. Moore’s law of computing is drawn in white. The data are current as of April 2015. After January 2008 modern sequencing centers switched from Sanger dideoxy chain termination sequencing to next-generation sequencing technologies such as 454 sequencing, Il- lumina sequencing, and SOLiD sequencing. For Sanger sequencing, the assumed coverage is 6-fold with average read length of 500-600 bases. 454 sequencing assumes 10-fold coverage with average read length 300-400 bases, while the Il- lumina/SOLiD sequencers attain 30-fold coverage with an average read length of 75-150 bases...... 2
2.1 A graphical representation of a convex set and a nonconvex one. As noted in Definition 1, a convex set contains all line segments between any two points in the set. Image courtesy of Torbjørn Taskjelle from StackExchange [129]...... 5
4.1 Proportion of variance explained by q PCs for each algorithm. Here PD1 is the accelerated proximal distance algorithm enforcing matrix sparsity, PD2 is the ac- celerated proximal distance algorithm enforcing column-wise sparsity, and SPC is the orthogonal sparse PCA method from PMA...... 68
4.2 Computation times for q PCs for each algorithm. Here PD1 is the accelerated proximal distance algorithm enforcing matrix sparsity, PD2 is the accelerated prox- imal distance algorithm enforcing column-wise sparsity, and SPC is the orthogonal sparse PCA method from PMA...... 69
5.1 A visual representation of model selection with the LASSO. The addition of the `1 penalty encourages representation of y by a subset of the columns of X...... 75
ix 5.2 A graphical representation of penalized (regularized) regression using norm balls.
From left to right, the graphs show `2 or Tikhonov regression, `1 or LASSO re-
gression, and `0 or subset regression. The ellipses denote level curves around the unpenalized optimum β. The penalized optimum occurs at the intersection of the level curves with the norm ball. Tikhonov regularization provides some shrinkage,
while the shrinkage from LASSO regularization is more dramatic. The `0 norm enforces sparsity without shrinkage. The MCP “norm ball” cannot be easily drawn
but sits between the `1 and `0 balls...... 76
5.3 A view of sparse regression with thresholding operators. The order from left to
right differs from Figure 5.2: the `1 operator or soft thresholding operator, the MCP
or firm thresholding operator, and the `0 operator or hard thresholding operator. We clearly see how MCP interpolates the soft and hard thresholding operators...... 76
5.4 A visual representation of IHT. The algorithm starts at a point y and steps in the direction −∇f(y) with magnitude µ to an intermediate point y+. IHT then en-
forces sparsity by projecting onto the sparsity set Sm. The projection for m = 2 is
+ the identity projection in this example, while projection onto S0 merely sends y
to the origin 0. Projection onto S1 preserves the larger of the two components of y+...... 77
5.5 Mean squared error as a function of model size, as averaged over 5 cross-validation slices, for four lipid phenotypes from NFBC 1966...... 87
x LISTOF TABLES
3.1 Performance of the adaptive barrier method in linear programming...... 21
3.2 Dykstra’s algorithm versus the proximal distance algorithm...... 31
3.3 CPU times in seconds and iterations until convergence for the network optimiza- tion problem. Asterisks denote computer runs exceeding computer memory limits. Iterations were capped at 200...... 33
3.4 CPU times in seconds and optima for the nonnegative quadratic program. Abbre- viations: n for the problem dimension, MM for the proximal distance algorithm, CV for CVX, MA for MATLAB’s quadprog, and YA for YALMIP...... 34
3.5 Numerical experiments comparing MM to MATLAB’s lasso. Each row presents averages over 100 independent simulations. Abbreviations: n the number of cases, p the number of predictors, d the number of actual predictors in the generating
model, p1 the number of true predictors selected by MM, p2 the number of true predictors selected by lasso, λ the regularization parameter at the LASSO op-
timal loss, L1 the optimal loss from MM, L1/L2 the ratio of L1 to the optimal
LASSO loss, T1 the total computation time in seconds for MM, and T1/T2 the
ratio of T1 to the total computation time of lasso...... 37
3.6 Comparison of the MM proximal distance algorithm to SoftImpute. Abbre- viations: p is the number of rows, q is the number of columns, α is the ratio of
observed entries to total entries, r is the rank of the matrix, L1 is the optimal loss
under MM, L2 is the optimal loss under SoftImpute, T1 is the total computation
time (in seconds) for MM, and T2 is the total computation time for SoftImpute. 39
xi 3.7 Numerical results for precision matrix estimation. Abbreviations: p for matrix di-
mension, kt for the number of nonzero entries in the true model, k1 for the number
of true nonzero entries recovered by the proximal distance algorithm, k2 for the number of true nonzero entries recovered by glasso, ρ the average tuning con-
stant for glasso for a given kt, L1 the average loss from the proximal distance
algorithm, L1 − L2 the difference between L1 and the average loss from glasso,
T1 the average compute time in seconds for the proximal distance algorithm, and
T1/T2 the ratio of T1 to the average compute time for glasso...... 43
4.1 CPU times and optima for linear programming. Here m is the number of con- straints, n is the number of variables, PD is the accelerated proximal distance al- gorithm, SCS is the Splitting Cone Solver, and Gurobi is the Gurobi solver. After m = 512 the constraint matrix A is initialized to be sparse with sparsity level 0.01. 54
4.2 CPU times and optima for nonnegative quadratic programming. Here n is the number of variables, PD is the accelerated proximal distance algorithm, IPOPT is the Ipopt solver, and Gurobi is the Gurobi solver. After n = 512, the constraint matrix A is sparse...... 56
4.3 CPU times and optima for the closest kinship matrix problem. Here the kinship matrix is n × n, PD1 is the proximal distance algorithm, PD2 is the accelerated proximal distance, PD3 is the accelerated proximal distance algorithm with the positive semidefinite constraints folded into the domain of the loss, and Dykstra is Dykstra’s adaptation of alternating projections. All times are in seconds...... 58
4.4 CPU times and optima for the second-order cone projection. Here m is the number of constraints, n is the number of variables, PD is the accelerated proximal distance algorithm, SCS is the Splitting Cone Solver, and Gurobi is the Gurobi solver. After m = 512 the constraint matrix A is initialized with sparsity level 0.01...... 61
xii 4.5 CPU times (seconds) and optima for approximating the Horn variational index of a Horn matrix. Here n is the size of Horn matrix, PD is the proximal distance algorithm, aPD is the accelerated proximal distance algorithm, and Mosek is the Mosek solver...... 63
4.6 CPU times and optima for testing the copositivity of random symmetric matrices. Here n is the size of matrix, PD is the proximal distance algorithm, aPD is the accelerated proximal distance algorithm, and Mosek is the Mosek solver...... 64
4.7 CPU times (seconds) and optima for the linear complementarity problem with ran- domly generated data. Here n is the size of matrix, PD is the accelerated proximal distance algorithm, and Mosek is the Mosek solver...... 65
5.1 Model selection performance on NFBC1966 chromosome 1 data...... 82
5.2 Computational times in seconds on NFBC1966 chromosome 1 data...... 84
5.3 Dimensions of data used for each phenotype in GWAS experiment. Here n is the
number of cases, p is the number of predictors (genetic + covariates), and mbest is
the best cross-validated model size. Note that mbest includes nongenetic covariates. 85
5.4 Computational results from the GWAS experiment. Here β is the calculated effect size. Known associations include the relevant citation...... 88
6.1 Model selection performance of IHT and the exchange algorithm on NFBC1966 chromosome 1 data...... 94
xiii ACKNOWLEDGMENTS
The material presented in this dissertation was funded by the UCLA Graduate Opportunity Fel- lowship Program, a National Science Foundation Graduate Research Fellowship (DGE-0707424), a Predoctoral Training Grant from the National Human Genome Research Institute (HG002536), and financial support from the UCLA Department of Biomathematics, the Stanford University Department of Statistics, and the startup funds of Kenneth Lange.
The students in the Biomathematics program unwillingly bore the brunt of the highs and lows of my graduate research career, and the unfortunate souls that occupied my office suffered the wrath of prodigious puns, dark humor, and gratuitous foul language. For their tolerance, I wish to particularly thank Forrest Crawford, Gabriela Cybis, Joshua Chang, Wesley Kerr, Lae Un Kim, Trevor Shaddox, Bhaven Mistry, and Timothy Stutz.
During the course of graduate school I learned that undergraduate mentors are also lifelong mentors. Marc Tischler, Joseph Watkins, and William Yslas Velez´ never hesitated to lend advice or a reference letter. Dr. Velez´ informed me that he is slated to retire in 2017, a meritous reward after bringing thousands of undergraduates through the mathematics program at The University of Arizona. I aspire to someday possess even an ounce of his work ethic. I will always remember fondly my interactions with Mar´ıa Teresa Velez,´ the other Dr. Velez,´ former associate dean of the Graduate College at The University of Arizona, who always greeted me at conferences with a warm embrace and a motherly concern for my degree progress. She tragically passed away mere weeks before I defended this dissertation. May she rest in power.
The research in this dissertation represents collaborative work with Gary K. Chen and Hua Zhou, both of whom are much better programmers than I could ever hope to be. Their patient, thoughtful, and careful approach to software development is one that I hope to mimic in my career. An early collaboration with Eric C. Chi and Gary bore no fruit but sparked my interest the sparse regression methods that ultimately constitute the capstone of my dissertation. I wish to thank my advisor Kenneth Lange, who gracefully and patiently introduced me into the world of optimization. He never hesitated to offer financial, intellectual, or emotional support during my graduate school career.
xiv To my family I owe an unpayable debt. To this day, my parents, my brothers, my uncles and aunts, and my cousins do not understand what I did during graduate school or why I did it. Nonetheless, they always tried to offer emotional support and healthy distractions when needed. I must particularly thank my mother, who somehow summoned the patience to find a silver lining every time that my career prospects seemed to recede into a bleak and cloudy future. Perhaps someday I can complete her wish of “finding the gene that causes cancer.” Lastly, I must thank Gabriela Bran Anleu, who concurrently supported my graduate career while traversing her own. Her convictions, her stubbornness, and her optimism for a future filled with renewable energy sources still inspires me to this day.
xv VITA
2007–2010 Research Assistant with Michael Hammer, Arizona Research Laboratories, The University of Arizona, Tucson, AZ
2009 Visiting Student Researcher with Jaume Bertranpetit, Institut de Biolog´ıa Evo- lutiva, Universitat Pompeu Fabra, Barcelona, Spain
2010 B.S. (Mathematics) and B.A. (Linguistics), The University of Arizona
2010–2011 Fulbright Student Researcher with Jaume Bertranpetit, Institut de Biolog´ıa Evo- lutiva, Universitat Pompeu Fabra, Barcelona, Spain
2011–present Graduate Student, Department of Biomathematics, University of California, Los Angeles, CA
2012 M.S. (Biomathematics), University of California, Los Angeles
2014 Visiting Student Researcher with Tim Conrad, Konrad Zuse Zentrum, Freie Universitat¨ Berlin, Berlin, Germany
2014–2015 Visiting Graduate Researcher, Department of Statistics, Stanford University, Stanford, CA
PUBLICATIONS
Keys KL and Lange KL. “An exchange algorithm for least squares regression”. (in preparation)
Keys KL, Chen GK, Lange KL. “Hard Thresholding Pursuit Algorithms for Model Selection in Genome-Wide Association Studies”. (in preparation)
xvi Keys KL, Zhou H, Lange KL. “Proximal Distance Algorithms: Theory and Examples”. (submit- ted)
Montanucci L, Laayouni H, Dobon´ B, Keys KL, Bertranpetit J, and Pereto´ J. “Influence of topol- ogy and functional class on the molecular evolution of human metabolic genes.” Molecular Biology and Evolution. (submitted)
Lange KL and Keys KL. “The MM Proximal Distance Algorithm.” Proceedings of the 2014 In- ternational Congress of Mathematicians, Seoul, South Korea.
Dall’Olio GM, Marino J, Schubert M, Keys KL, Stefan MI, Gillespie CS, Poulein P, Shameer K, Suger R, Invergo BM, Jensen LJ, Bertranpetit J, Laayouni H. “Ten simple rules for getting help from online scientific communities.” PLOS Computational Biology 7:9 (2011), e1002202.
xvii CHAPTER 1
Introduction
The fields of genetics and genomics have blossomed since the publication of the first sequenced human genome in 2003. Modern genotyping and sequencing technologies have dramatically low- ered the cost of genetic data collection. The National Human Genome Research Institute (NHGRI) of the United States monitors the average cost of sequencing a 3 gigabase human genome. The graph in Figure 1.1 shows the striking decline in sequencing costs from the year 2001 up to the year 2015, the most recent year for which data are available [138]. The sheer scale of data that modern genomic technology can generate vastly outpaces the computational hardware and soft- ware to analyze it. Typical issues under the well-worn “Big Data” label, such as memory limits and scalable algorithms, are crucially important for modern genetic analysis software.
This dissertation addresses one facet of the genomic data boom, the analysis of genome-wide association studies (GWASes). At its core, GWAS analysis is a very large regression problem. The solution proposed here draws from the fields of computer science, mathematical optimization, and statistical genetics to formulate several software packages for linear regression in GWAS. The story behind the development of this software starts firmly in the field of convex optimization, in particular the class of proximal gradient algorithms, and slowly moves into nonconvex algorithms. Along the way it will create computational tools useful in other genomics contexts, such as sparse principal components analysis (SPCA) and sparse precision matrix estimation commonly used in genetic expression analyses. The climax of this story is an implementation of an algorithm called iterative hard thresholding (IHT) that performs efficient model selection in GWAS. The exposition presented in the following chapters assumes little previous biological coursework. However, those who have not studied mathematics will find the topics intimidating. At a minimum, readers should be comfortable with real analysis, linear algebra, multivariate calculus, and linear statistical models
1 Figure 1.1: The cost of sequencing a single human genome, which we assume to be 3,000 megabases, is shown by the green line on a logarithmic scale. Moore’s law of computing is drawn in white. The data are current as of April 2015. After January 2008 modern sequencing centers switched from Sanger dideoxy chain termination sequencing to next-generation sequencing tech- nologies such as 454 sequencing, Illumina sequencing, and SOLiD sequencing. For Sanger se- quencing, the assumed coverage is 6-fold with average read length of 500-600 bases. 454 sequenc- ing assumes 10-fold coverage with average read length 300-400 bases, while the Illumina/SOLiD sequencers attain 30-fold coverage with an average read length of 75-150 bases.
(at the undergraduate level) and some optimization theory and numerical linear algebra (at the graduate level).
At this juncture, it is important to emphasize what this dissertation does not represent. It does not represent a synthesis of theorems and proofs. In many instances, both convergence and recovery guarantees are taken for granted. Where proofs are not given, relevant mathematical references are provided. Readers seeking mathematical rigor are encouraged to look elsewhere. This work is also not complete: the final product of this investigation, a group of software packages coded in the new Julia programming language, is as much a work in progress as the Julia language
2 itself. The tactics and implementations detailed herein may well become outdated in a few years. The hope is that this software suite will serve as a springboard or benchmark for future software development targeting increasingly powerful hardware with increasingly clever algorithms.
The rest of this dissertation proceeds as follows. Chapter 2 lightly sketches the necessary convex optimization knowledge to understand the algorithms described later. Chapters 3 and 4 de- scribe the development of the class of proximal gradient algorithms that we call proximal distance algorithms. The proximal distance algorithm serves as a springboard for thinking about sparse regression frameworks such as IHT, while Chapter 5 demonstrates the superiority of IHT versus current software for feature selection in GWAS. The discussion in Chapter 6 draws a roadmap for future directions that this project could take. As will be demonstrated, IHT is a promising framework that could eventually dominate the sparse regression world.
3 CHAPTER 2
Convex Optimization
The field of mathematical optimization or mathematical programming is concerned with finding the optimal points (minima and maxima) of functions f : U → R over an open domain U. The problem of unconstrained optimization seeks the optimal points of a scalar-valued function f over its entire domain. A constrained optimization problem arises when we optimize f over some set C ⊂ dom f.
The field of optimization traces its roots to early developments in calculus [45, 59]. Theoretical insights from Fermat and Legendre used tools from calculus to determine explicit formulæ for determining optimal values of a function. Newton and Gauss developed iterative methods for computing optima, one of which we now know as Newton’s method. The early 20th century saw the birth of linear programming, the simplest case of mathematical programming. Leonid Kantorovich laid the foundational theory of linear programming [82], while George Dantzig coined the term “linear programming” and published the simplex algorithm [43]. The theory of duality, originally developed by John von Neumann for economic game theory, was found to apply to linear programming as well [133].
Since the 1950s, the field of optimization has blossomed and evolved rapidly. For reasons that will become clear later, we will focus on the important subdomain of optimization known as con- vex optimization. Convex optimization deals with the optimization of convex functions over convex sets. The exposition given here offers a mere glimpse into the vast literature of convex optimiza- tion. Several books [20, 26, 27, 72, 93, 94, 120] offer a rigorous mathematical development of convex analysis. Algorithms for convex optimization are described in [15, 111, 112].
4 Convex set Nonconvex set
Figure 2.1: A graphical representation of a convex set and a nonconvex one. As noted in Definition 1, a convex set contains all line segments between any two points in the set. Image courtesy of Torbjørn Taskjelle from StackExchange [129].
2.1 Convexity
Convexity is a fundamental property in mathematical optimization.
Definition 1. A convex set S ⊂ Rn is any set where for x, y ∈ S and α ∈ [0, 1] we have z = αx + (1 − α)y.
An intuitive interpretation of Definition 1 is that a convex set S contains all line segments between any two points in S. Figure 2.1 demonstrates this explicitly by juxtaposing a convex set with a nonconvex one.
Definition 2. A function f : U → R with convex domain U is called a convex function if it satisfies
f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y) (2.1) for all x, y ∈ U and all α ∈ [0, 1].
When strict inequality holds in (2.1), f is said to be strictly convex. If a convex function f is differentiable, then we have the following useful result.
Proposition 1. (First Order Condition for Convexity) Consider a function f : U → R with open convex domain U ⊂ Rn. Then f is convex if and only if for all x, y ∈ U we have
f(y) ≥ f(x) + ∇f(x)T (y − x) (2.2) 5 Proof. The proof flows from the Definition 2. See [93] for details.
The first order condition (2.2) states that f lies above a tangent hyperplane given by ∇f(x) at a tangent point x. If f is twice-differentiable, then a stronger result holds true.
Proposition 2. (Second Order Condition for Convexity) Consider a twice-differentiable function
f : U → R over an open convex domain U ⊂ Rn. If ∇2f(x) 0 for every x ∈ U, then f is convex.
Proof. Following the exposition in [93], the expansion
f(y) = f(x) + ∇f(x)T (y − x) Z 1 + (y − x)T ∇2f(x − α(y − x))(1 − α)dα (y − x) 0 for x, y ∈ U and α ∈ [0, 1] yields the first order condition (2.2), thus demonstrating the convexity of f.
A related concept is the notion of strong convexity.
Definition 3. A function f : U → R is called strongly convex with parameter m > 0 if for all points x, y ∈ U and any α ∈ [0, 1] we have
1 f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y) − mα(1 − α)kx − yk2 2 2
If f is differentiable, then a first-order condition for strong convexity is given by
m f(y) ≥ f(x) + ∇f(x)T (y − x) + ky − xk2 2 2
If f is twice differentiable, then f is strongly convex provided that ∇2f(x) mI.
Strong convexity bounds the smallest eigenvalue of ∇2f away from 0. Strongly convex func- tions are generally easy to optimize. However, the set of strongly convex functions is smaller than the set of strictly convex functions, so their scope is limited.
The first order condition (2.2) can be generalized for a convex relaxation of differentiability known as subdifferentiability. 6 Definition 4. A subgradient of a convex function f : U → R with U ⊂ Rn is any vector g ∈ Rn satisfying f(y) − f(x) ≥ gT (y − x) (2.3) for all x, y ∈ U.
Definition 5. (Subdifferentiability) A convex function f : U → R is subdifferentiable if the sub- gradient is defined at every point in U. The subdifferential ∂f(x) at a point x is the set of all subgradients of f(x).
If f is differentiable at x then ∂f(x) = {∇f(x)}, so differentiable functions are by definition subdifferentiable. To illustrate a nondifferentiable f that is subdifferentiable, consider the absolute value function f : R → R given by f(x) = |x|. Then 1 x > 0, ∂f(x) = −1 x < 0, [−1, 1] x = 0.
For the smooth portions of f, there exists only one slope, given by the derivative function f 0. At the point x = 0 where f is nondifferentiable, the subdifferential ∂f contains the slopes of all possible tangent lines at x.
An important concept in convex analysis is that of the conjugate function.
Definition 6. The convex conjugate of a function f (alternatively the Fenchel conjugate or the Legendre-Fenchel conjugate of f) is defined as
f ?(x) = sup yT x − f(y) y
The conjugate of f ∗ is always closed and convex regardless of the convexity of f.
2.2 Projections and Proximal Operators
Projection operators occupy a useful niche in optimization. 7 n Definition 7. The projection operator ΠS : R → R onto a set S (alternatively the Euclidean projection onto S) maps a point x ∈ Rn to a possibly nonunique point y ∈ S that minimizes the Euclidean distance dist(x, y). In functional terms, we have
ΠS (x) = argmin kx − yk2 (2.4) y∈S
dist(x, S) = inf kx − yk2 (2.5) y∈S
If S is closed and convex, then ΠS maps x uniquely to its counterpart y ∈ S, and dist(x, S) is a convex function. Projections onto many particular convex sets are known in closed or computable form [9, 11].
An important generalization of a projection operator is known as a proximal operator.
n n Definition 8. The proximal operator proxf (x): R → R (alternatively the proximity operator or the proximal map) for a closed convex function f is the solution to the optimization problem 1 2 proxf (x) = argmin f(y) + kx − yk2 . (2.6) y 2
The proximal operator is unique and exists for all x ∈ dom f [11, 107]. It is often useful to parametrize the proximal operator by a step size t: 1 2 proxtf (x) = argmin tf(y) + kx − yk2 y 2 1 2 = argmin f(y) + kx − yk2 . y 2t
The proximal operator is the solution to the Moreau-Yosida regularization of f, so evaluating proxtf (x) is itself an optimization problem. Proximal operators are particularly useful when f is nondifferentiable or otherwise difficult to optimize.
The analytical properties of proximal operators are well-understood. We can view them as generalized projections. Intuitively, the proximal operator establishes a compromise between min- imizing the distance to x and minimizing the function f itself. Like the projection operators that they generalize, proximal operators of many functions have closed form or computable solutions
8 [11, 117]. For example, the proximal operator of the indicator function of a set C 0 x ∈ C δC(x) = ∞ x 6∈ C is simply the projection ΠC onto C.
One important property of proximal operators is firm expansiveness. If f is a closed convex function, then proxtf satisfies
prox (x) − prox (y) 2 ≤ prox (x) − prox (y)T (x − y) tf tf 2 tf tf for all x, y ∈ dom f. Firmly nonexpansive operators T (x) are useful for fixed point algorithms since the iteration scheme
xk = (1 − ρ)x + ρT (x) converges weakly to a fixed point whenever ρ ∈ (0, 2).
Another important property of proximal operators is known as the Moreau decomposition.A closed convex function f is related to its conjugate f ∗ via the relation
x = proxf (x) + proxf ∗ (x).
The Moreau decomposition is similar in spirit to the orthogonal decomposition in linear algebra, in which a vector x ∈ Rn is split into a sum of two vectors y ∈ C and z ∈ C⊥ for some closed set C ∈ Rn.
2.3 Descent Methods
Descent methods are iterative schemes for optimizing a function f by producing a minimizing sequence {xk} with k = 1, 2,... that satisfies
xk+1 = xk + tk∆xk
f(xk+1) ≤ f(xk) with step direction ∆xk and step size tk > 0 for all unoptimal xk. Descent methods come in many flavors, and their domain of application can vary depending on the size and complexity of f. 9 2.3.1 Gradient Methods
The class of algorithms known as gradient methods or first-order methods optimize a function f using first-order (sub)differentiability of f [27, 35]. Gradient methods are sometimes called steepest descent methods since the search direction ∆x uses the negative gradient −∇f(x), which points in the direction of steepest descent. They follow the simple update scheme
xk+1 := xk − tk∇f(xk) (2.7)
A more complete algorithm appears as Algorithm 1.
Algorithm 1 The gradient descent method. Require: a starting point x ∈ dom f and a tolerance > 0 with 1. repeat ∆x := −∇f(x). Choose step size t with appropriate method. Update x := x + t∆x.
until k∆xk2 <
If f is subdifferentiable but not differentiable then replacing the gradient ∇f with a subgradient g ∈ ∂f at every point x yields a subgradient method. Subgradient methods typically exhibit slower convergence than similar gradient descent methods, but they apply to a much larger class of functions.
Strictly speaking, Algorithm 1 describes an unconstrained minimization scheme. If we wish to optimize a convex differentiable function f over a constraint set C, then we use the projected gradient descent update
xk+1 := ΠC (xk − tk∇f(xk)) (2.8)
For certain conditions on t, the update scheme (2.8) converges stably to the constrained minimum of f [19]. For example, if ∇f is Lipschitz continuous with Lipschitz constant L, then a constant step size t ∈ (0, 2/L)) ensures convergence with (2.8). Exploiting Lipschitz continuity yields the simplest convergence guarantees; more complicated guarantees rely on the Wolfe conditions
10 [140, 141] or the Armijo rule [5]. We will make frequent use of constant step sizes t ∈ (0, 1/L) based on Lipschitz constants for reasons that will become clear later.
2.3.2 Proximal Gradient Method
Suppose that we can split a convex objective function f : U → R into the sum f = g + h of two closed proper convex functions g : U → R and h : U → R where g is differentiable. The proximal gradient method is the iterative scheme given by
x := prox (x − t ∇g(x )) k+1 tkh k k k (2.9)
with step size tk at iteration k. If we optimize f via a surrogate function
1 g(x | x , t) = f(x) + ∇f(x)T (x − x ) + kx − x k2 (2.10) k k 2t k 2
then we can compute a suitable t with the line search of Beck and Teboulle [13] described in Algorithm 2. The use of surrogate functions presages our discussion of majorization methods in Section 2.5.
Algorithm 2 Line search for the projected gradient descent method.
Require: a starting point xk, a step size tk−1, and a parameter β ∈ (0, 1).
Let t := t−1. repeat
z := proxth (xk − t∇g(xk))
tk+1 := βtk
until f(xk+1) ≤ g(xk+1 | xk, tk)
return tk := t, xk+1 := z.
If h is the indicator δC of the constraint set C, then (2.9) reduces to the projected gradient descent sceheme in (2.8). Setting h = 0 yields the standard gradient descent scheme (2.7).
11 2.4 Second-order methods
Second-order methods for optimizing a convex function f exploit approximate or real second derivative information about f. When f is twice differentiable, then its Hessian matrix ∇2f pro- vides curvature information useful for computing search directions with Newton’s method. Ap- proximate second-order methods such as the conjugate gradient method extrapolate second-order information from ∇f.
2.4.1 Newton’s method
Suppose that f : U → R is closed, convex, and twice differentiable. The Newton step for f at x is defined as 2 −1 ∆xnt = − ∇ f(x) ∇f(x).
The Newton step is motivated by considering the second order Taylor expansion of f at x given by
1 f˜(x + v) = f(x) + ∇f(x)T v + vT ∇2f(x)v. 2
Observe that f˜ is a convex quadratic function of v. The minimizer with respect to v is
v = − ∇2f(x)−1 ∇f(x)
2 n which coincides with ∆xnt. When ∇ f(x) ∈ S+, as is the case for convex functions, then the Newton step gives the direction of steepest descent for the quadratic norm
p T 2 kyk∇2f(x) = y ∇ f(x)y
defined by the Hessian ∇2f(x) at x. A more intuitive explanation is that Newton’s method warps the direction of steepest descent in accordance with information from ∇2f. Newton’s method attains quadratic convergence near the minimum [27], but it can easily overshoot the minimum if no safeguards are put in place. Typically Newton directions are damped with a suitable backtracking line search. Monitoring the Newton decrement q N(x) = ∇f(x)T [∇2f(x)]−1 ∇f(x) (2.11)
12 yields a simple stopping criterion. Algorithm 3 sketches one version of the damped Newton method.
Algorithm 3 Damped Newton’s method. given a starting point x ∈ dom f and a tolerance > 0 with 1. repeat
2 −1 Compute the Newton step ∆xnt := − [∇ f(x)] ∇f(x). Compute step size t via backtracking line search.
Update x := x + t∆xnt. Compute the convergence criterion λ := N 2(x). until λ/2 < return x.
2.4.2 Conjugate gradient method
n Suppose that we wish to solve the linear system Ax = b where A ∈ S+. For dense systems with millions or billions of equations, the burden of computing a Newton step can overwhelm most computational hardware. The conjugate gradient method is well suited to numerically solving large sparse systems of linear equations [70, 122]. In exact arithmetic, the conjugate gradient method converges in no more than n iterations. However, even tiny numerical imprecisions render the conjugate gradient method unstable as a direct method on computers.
Fortunately, the conjugate gradient works remarkably well as an iterative method. For any two vectors u, v ∈ Rn we say that u and v are conjugate with respect to A if uT Av = 0. Because n A ∈ S+, the conjugate relation defines an inner product hAu, vi. Suppose that we form a set P of n mutually conjugate vectors p1, p2,..., pn under the inner product defined by A. Then P forms 2 n a basis for Rn. For the aforementioned linear system, we have P = {b, Ab, A b,... A b}. We call span P a Krylov subspace [89]. According to the Cayley-Hamilton theorem, the matrix A−1 used to solve Ax = b can be expressed as a linear combination of the powers of A. Since P contains images of b under powers of A, there exists a matrix A˜ ∈ span P such that A˜ ≈ A−1. This approximation is good so long as A is well-conditioned; using the conjugate gradient method
13 with a poorly conditioned A often requires pre- or post-multiplication of A by a rescaling operator called a preconditioner. Algorithm 4 succinctly describes an unconditioned conjugate gradient method as an iterative solver.
Algorithm 4 A typical conjugate gradient algorithm. given parameters A, and b, and tolerance > 0 with 1.
n initialize starting point x0 ∈ R , residual vector r0 := b − Ax0, p := r, and squared norm 2 c0 = kr0k2. repeat
Compute Krylov vector zk := Apk. T Compute the ratio of norms αk = rk/pk zk.
Update the estimated solution xk+1 := xk + αp.
Update the residuals rk+1 := rk − αzk.
2 Update the squared norm ck+1 := krk+1k2.
ck+1/ck Update p := rk+1 + . k+1 p k √ until ck+1 < .
2.5 The MM Principle
The MM principle (alternatively optimization transfer or iterative majorization) is a device for con- structing optimization algorithms [25, 76, 95, 93, 90]. In essence, it replaces the objective function
f(x) by a simpler surrogate function g(x | xk) anchored at the current iterate xk and majorizing or
minorizing f(x). As a byproduct of optimizing g(x | xk) with respect to x, the objective function f(x) is sent downhill or uphill, depending on whether the purpose is minimization or maximiza-
tion. The next iterate xk+1 is chosen to optimize the surrogate g(x | xk) subject to any relevant
constraints. Majorization combines two conditions: the tangency condition g(xk | xk) = f(xk)
and the domination condition g(x | xk) ≥ f(x) for all x. In minimization these conditions and
the definition of xk+1 lead to the descent property
f(xk+1) ≤ g(xk+1 | xk) ≤ g(xk | xk) = f(xk). (2.12)
14 Minorization reverses the domination inequality and produces an ascent algorithm. Under appro- priate regularity conditions, an MM algorithm is guaranteed to converge to a stationary point of the objective function [90]. In particular, the MM principle is ideally suited for optimizing convex objective functions since their surrogates can exploit the machinery of convex optimization. From the perspective of dynamical systems, the objective function serves as a Lyapunov function for the algorithm map.
The MM principle simplifies optimization by: (a) separating the variables of a problem, (b) avoiding large matrix inversions, (c) linearizing a problem, (d) restoring symmetry, (e) dealing with equality and inequality constraints gracefully, and (f) turning a nondifferentiable problem into a smooth problem. Choosing a tractable surrogate function g(x | xk) that hugs the objective function f(x) as tightly as possible requires experience and skill with inequalities. The majoriza- tion relation between functions is closed under the formation of sums, nonnegative products, limits, and composition with an increasing function. Hence, it is possible to work piecemeal in majorizing complicated objective functions.
The MM principle as formulated here represents the synthesis of a complex history. Specific MM algorithms appeared years before the principle was well understood [67, 105, 125, 137, 145]. Projected gradient and proximal gradient algorithms can be motivated from the MM perspective, but the early emphasis on operators and fixed points obscured this distinction. The celebrated EM (expectation-maximization) principle of computational statistics is a special case of the MM principle [106]. Although Dempster, Laird, and Rubin [48] formally named the EM algorithm, many of their contributions were anticipated by Baum [8] and Sundberg [127]. The MM princi- ple was clearly stated by Ortega and Rheinboldt [114]. de Leeuw [46] is generally credited with recognizing the importance of the principle in practice. The EM algorithm had an immediate and large impact in computational statistics, but the more general MM principle was much slower to take hold. The papers [47, 69, 84] by the Dutch school of psychometricians solidified its posi- tion. The related Dinkelbach [52] maneuver in fractional linear programming also highlighted the importance of the descent property in algorithm construction.
Since the MM principle is not an algorithm per se, it can easily exploit the aforementioned gradient and Newton descent methods. The development of the proximal distance algorithm and 15 iterative hard thresholding in the proceeding chapters will make explicit use of both the MM prin- ciple and gradient descent methods.
16 CHAPTER 3
The Proximal Distance Algorithm
The current exposition emphasizes the role of the MM principle in nonlinear programming. For smooth functions, one can construct an adaptive interior point method based on scaled Bregman barriers. This algorithm does not follow the central path. For convex programming subject to nonsmooth constraints, one can combine an exact penalty method with distance majorization to create versatile algorithms that are effective even in discrete optimization. These proximal distance algorithms are highly modular and reduce to set projections and proximal mappings, both very well-understood techniques in optimization. We illustrate the possibilities in linear programming, binary piecewise-linear programming, nonnegative quadratic programming, `0 regression, matrix completion, and sparse precision matrix estimation.
3.1 An Adaptive Barrier Method
In convex programming it simplifies matters notationally to replace a convex inequality constraint hj(x) ≤ 0 by the concave constraint vj(x) = −hj(x) ≥ 0. Barrier methods operate on the relative interior of the feasible region where all vj(x) > 0. Adding an appropriate barrier term to the objective function f(x) keeps an initially inactive constraint vj(x) inactive throughout an optimization search. If the barrier function is well designed, it should adapt and permit convergence to a feasible point y with one or more inequality constraints active.
We now briefly summarize an adaptive barrier method that does not follow the central path [91]. Because the logarithm of a concave function is concave, the Bregman majorization [29]
1 T − ln vj(x) + ln vj(xk) + ∇vj(xk) (x − xk) ≥ 0 vj(xk) acts as a convex barrier for a smooth constraint vj(x) ≥ 0. To make the barrier adaptive, we scale 17 it by the current value vj(xk) of the constraint. These considerations suggest an MM algorithm based on the surrogate function
s s X X T g(x | xk) = f(x) − ρ vj(xk) ln vj(x) + ρ ∇vj(xk) (x − xk) j=1 j=1 for s inequality constraints. Minimizing the surrogate subject to relevant linear equality constraints
Ax = b produces the next iterate xk+1. The constant ρ determines the tradeoff between keeping the constraints inactive and minimizing f(x). One can show that the MM algorithm with exact minimization converges to the constrained minimum of f(x) [90].
In practice one step of Newton’s method is usually adequate to decrease f(x). The first step of
Newton’s method minimizes the surrogate g(x | xk) given by the second-order Taylor expansion of around xk subject to the equality constraints. Given smooth functions, the two derivatives
∇g(xk | xk) = ∇f(xk) s 2 2 X 2 ∇ g(xk | xk) = ∇ f(xk) − ρ ∇ vj(xk) (3.1) j=1 s X 1 + ρ ∇v (x )∇v (x )T v (x ) j k j k j=1 j k are the core ingredients in the quadratic approximation of g(x | xk). Unfortunately, one step of Newton’s method is neither guaranteed to decrease f(x) nor to respect the nonnegativity con- straints.
For instance, the standard form of linear programming requires the minimization of a linear function f(x) = cT x subject to Ax = b and x 0. The quadratic approximation to the surrogate g(x | xk) amounts to
p ρ X 1 cT x + cT (x − x ) + (x − x )2. k k 2 x j kj j=1 kj The minimum of this quadratic subject to the linear equality constraints occurs at the point
−1 −1 T −1 T −1 −1 xk+1 = xk − Dk c + Dk A (ADk A ) (b − Axk + ADk c).
−1 Here Dk is the diagonal matrix with ith diagonal entry ρxk,i . Observe that the increment xk+1 −xk satisfies the linear equality constraint A(xk+1 − xk) = b − Axk. 18 One can overcome the objections to Newton updates by taking a controlled step along the
Newton direction uk = xk+1 − xk. The key is to exploit the theory of self-concordant functions [27, 111]. A thrice differentiable convex function h(t) is said to be self-concordant if it satisfies the inequality
|h000(t)| ≤ 2ch00(t)3/2
for some constant c ≥ 0 and all t in the essential domain of h(t). All convex quadratic functions qualify as self-concordant with c = 0. The function h(t) = − ln(at + b) is self-concordant with constant 1. The class of self-concordant functions is closed under sums and composition with
linear functions. A convex function q(x) with domain Rn is said to be self-concordant if every slice h(t) = q(x + tu) is self-concordant.
Rather than conduct an expensive one-dimensional search along the Newton direction xk +tuk, one can majorize the surrogate function h(t) = g(xk + tuk | xk) along the half-line t ≥ 0. The clever majorization
1 1 h(t) ≤ h(0) + h0(0)t − h00(0)1/2t − ln[1 − cth00(0)1/2] (3.2) c c2
both guarantees a decrease in f(x) and prevents a violation of the inequality constraints [111]. Here c is the self-concordance constant associated with the surrogate. The optimal choice of t reduces to the damped Newton update
h0(0) t = . (3.3) h00(0) − ch0(0)h00(0)1/2
The first two derivatives of h(t) are clearly
0 T h (0) = ∇f(xk) uk s 00 T 2 X T 2 h (0) = uk ∇ f(xk)uk − ρ uk ∇ vj(xk)uk j=1 s X 1 + ρ [∇v (x )u ]2. v (x ) j k k j=1 j k
The first of these derivatives is nonpositive because uk is a descent direction for f(x). The second is generally positive because all of the contributing terms are nonnegative. 19 When f(x) is quadratic and the inequality constraints are affine, detailed calculations show that the surrogate function g(x | xk) is self-concordant with constant
1 c = p . ρ min{v1(xk), . . . , vs(xk)}
Taking the damped Newton’s step with step length (3.3) keeps xk + tkuk in the relative interior of the feasible region while decreasing the surrogate and hence the objective function f(x). When
f(x) is not quadratic but can be majorized by a quadratic surrogate q(x | xk), one can replace
f(x) by q(x | xk) in calculating the adaptive-barrier update. The next iterate xk+1 retains the descent property.
As a toy example consider the linear programming problem of minimizing cT x subject to Ax = b and x 0. Applying the adaptive barrier method to the choices −1 −1 2 0 0 1 0 0 1 −1 A = 0 2 0 0 1 0 , b = 1 , c = 0 0 0 2 0 0 1 1 0 0
1 and to the feasible initial point x0 = 3 1 produces the results displayed in Table 3.1. Not shown 1 1 1 T is the minimum point ( 2 , 2 , 2 , 0, 0, 0) . Columns two and three of the table record the progress
of the unadorned adaptive barrier method. The quantity k∆kk2 equals the Euclidean norm of the
difference vector ∆k = xk −xk−1. Columns four and five repeat this information for the algorithm
modified by the self-concordant majorization (3.2). The quantity tk in column six represents the
optimal step length (3.3) in going from xk−1 to xk along the Newton direction uk−1. Clearly, there is a price to be paid in implementing a safeguarded Newton step. In practice, this price is well worth paying.
20 No Safeguard Self-concordant Safeguard
T T Iteration k c xk k∆kk2 c xk k∆kk2 tk
1 -1.20000 0.25820 -1.11270 0.14550 0.56351 2 -1.33333 0.17213 -1.20437 0.11835 0.55578 3 -1.41176 0.10125 -1.27682 0.09353 0.55026 4 -1.45455 0.05523 -1.33288 0.07238 0.54630 5 -1.47692 0.02889 -1.37561 0.05517 0.54345 10 -1.49927 0.00094 -1.47289 0.01264 0.53746 15 -1.49998 0.00003 -1.49426 0.00271 0.53622 20 -1.50000 0.00000 -1.49879 0.00057 0.53597 25 -1.50000 0.00000 -1.49975 0.00012 0.53591 30 -1.50000 0.00000 -1.49995 0.00003 0.53590 35 -1.50000 0.00000 -1.49999 0.00001 0.53590 40 -1.50000 0.00000 -1.50000 0.00000 0.53590
Table 3.1: Performance of the adaptive barrier method in linear programming.
3.2 MM for an Exact Penalty Method
We now turn to exact penalty methods. For a smooth objective function and smooth constraints, the most convenient penalized objective is
G(x) Fρ(x) = f(x) + ρ ,
H(x)+ where f(x) is the objective function, G(x) is the vector of equality constraints, and H(x)+ is the
vector of truncated inequality constraints with components max{0, hj(x)}. Classical optimization theory says that a constrained minimum point of f(x) furnishes an unconstrained minimum point
21 of Fρ(x) provided that ρ is sufficiently large and that the Lagrangian
p q X X L(x, λ, µ) = f(x) + λigi(x) + µjhj(x) i=1 j=1 is suitably well-behaved [121]. Here the Lagrange multiplier vectors λ and µ are chosen so that the multiplier rule ∇L(y, λ, µ) = 0 holds at the constrained minimum y.
The nonsmooth nature of Fρ(x) is a crippling hindrance to its optimization. Fortunately, a simple modification of the penalty leads to a viable minimization algorithm. Let us replace the Euclidean norm in the penalty by
u p = kuk2 + kvk2 + v for a small > 0. This positions us to majorize the penalty via the univariate majorization
√ √ t − tk t ≥ tk + √ (3.4) 2 tk √ of the concave function t on the interval t > 0. The resulting majorization
" p q # ρ X X f(x) + ρq (x) ≤ f(x) + g (x)2 + h (x)2 + c (3.5) 2q (x ) i j + k k i=1 j=1
is the key to approximate minimization of Fρ(x). Here the irrelevant constant ck depends only on
xk and q(x) is given by
G(x) q(x) =
H(x)+
The obvious tactic in generating a better iterate xk+1 is to apply one step of Newton’s method.
If we let wk = ρ/(2q(xk)), then the gradient
p q X X ∇f(xk) + wk gi(xk)∇gi(xk) + wk hj(xk)+∇hj(xk) i=1 j=1
of the surrogate function at xk is straightforward to derive, but the second differential is problem-
2 atic to compute because the functions hj(x)+ are not twice-differentiable. When hj(xk) < 0,
22 2 2 the second derivative satisfies ∇ hj(xk)+ = 0. In the opposite situation hj(xk) > 0, the Gauss- Newton approximation
2 2 T ∇ hj(xk)+ ≈ 2∇hj(xk)∇hj(xk) (3.6) is valuable for several reasons. Most importantly, it avoids second derivatives and preserves posi- tive definiteness of the approximate second differential. Notably, the approximation (3.6) is exact
2 2 2 if hj(x) is affine. Furthermore, the omitted term 2hj(x)+∇ hj(x) of ∇ hj(xk)+ vanishes as the algorithm approaches convergence. In the rare instances when hj(xk) = 0, the literature [77] sug- gests that there is little harm in approximating the second differential by the outer product (3.6). For the same reasons, we recommend the Gauss-Newton approximation for the equality constraints
2 gi(x) as well. Lastly, the Sherman-Morrison formula facilitates matrix inversion when ∇ f(xk) is explicitly invertible and the number of constraints is small.
In practice there is no guarantee that one step of Newton’s method will decrease the surrogate on the right-hand side of majorization (3.5). If f(x) is quadratic or can be majorized by a quadratic function, then another round of majorization avails. When hj(xk) ≥ 0, then the majorization
2 2 hj(x)+ ≤ hj(x) applies. If instead hj(xk) < 0, then we instead apply the alternative majorization 2 2 hj(x)+ ≤ [hj(x) − hj(xk)] . Both of these lead to the Hessian approximation on the right- hand side of (3.6). The gradient of the surrogate changes in an obvious way in each case. The additional round of majorization eliminates the need for step-halving. The price may be slower overall convergence.
3.2.1 Exact Penalty Method for Quadratic Programming
1 T T Minimization of a convex quadratic objective 2 x Ax + b x subject to linear equality constraints Cx = d and linear inequality constraints Ex ≤ f is one of the building blocks of modern optimization algorithms. The case A = 0 corresponds to linear programming. Both equality and inequality constraints can be handled as just suggested. Alternatively, the introduction of slack variables allows one to replace linear inequality constraints by a combination of linear equality
23 constraints and nonnegativity constraints xi ≥ 0. For the relevant components, the majorizations 2 x xk,i ≥ 0 2 i max{xi, 0} ≤ 2 (xi − xk,i) xk,i < 0 simplifies the overall algorithm and yields a purely quadratic surrogate that is minimized by one step of Newton’s method. The next iterate xk+1 is guaranteed to send the approximate objective downhill.
3.3 Distance Majorization
On a Euclidean space, the distance to a closed set S is a Lipschitz continuous function dist(x, S) with Lipschitz constant 1. As discussed in Chapter 2, if S is also convex, then dist(x, S) is a convex function. Projection onto S is intimately tied to dist(x, S). Unless S is convex, the
projection operator ΠS (x) is multi-valued for at least one argument x. Fortunately, it is possible
to majorize dist(x, S) at xk by kx − ΠS (xk)k2. This simple observation is the key to the proximal distance algorithm to be discussed later. In the meantime, let us show how to derive two feasibility
algorithms by distance majorization [39]. Let S1,..., Sm be closed sets. The method of averaged
m projections attempts to find a point in their intersection S = ∩j=1Sj. To derive the algorithm, consider the convex combination m X 2 f(x) = αj dist(x, Sj) j=1 of squared distance functions. Obviously, f(x) vanishes on S precisely when all coefficients
αj > 0. The majorization m X g(x | x ) = α kx − Π (x )k2 k j Sj k 2 j=1
of f(x) is easy to minimize. The minimum point of g(x | xk), m X x = α Π (x ), k+1 j Sj k j=1
defines the averaged operator. The MM principle guarantees that xk+1 decreases the objective function. 24 Von Neumann’s method of alternating projections can also be derived from this perspective.
2 For two sets S1 and S2, consider the problem of minimizing f(x) = dist(x, S2) subject to the constraint x ∈ S1. The function
2 g(x | xk) = kx − ΠS2 (xk)k2
majorizes f(x). Indeed, the domination condition g(x | xk) ≥ f(x) holds because ΠS2 (xk) belongs to S2; the tangency condition g(xk | xk) = f(xk) holds because ΠS2 (xk) is the closest point in S2 to xk. The surrogate function g(x | xk) is minimized subject to the constraint by setting
xk+1 = ΠS1 ◦ΠS2 (xk). The MM principle again ensures that xk+1 decreases the objective function. When the two sets intersect, the least distance of 0 is achieved at any point in the intersection. One
2 2 can extend this derivation to three sets by minimizing f(x) = dist(x, S2) + dist(x, S3) subject to x ∈ S1. The surrogate
2 2 g(x | xk) = kx − ΠS2 (xk)k2 + kx − ΠS3 (xk)k2 2 1 = 2 x − [ΠS2 (xk) + ΠS3 (xk)] + ck 2 2 relies on an irrelevant constant ck. The closest point in S1 is 1 x = Π [Π (x ) + Π (x )] . k+1 S1 2 S2 k S3 k
This construction clearly generalizes to more than three sets.
3.4 The Proximal Distance Method
We now turn to an exact penalty method that applies to nonsmooth functions. Clarke’s exact penalty method [40] turns the constrained problem of minimizing a function f(y) over a closed set S into the unconstrained problem of minimizing the penalized function f(y) + ρ dist(y, S) for sufficiently large ρ. Here is a precise statement of a generalization of Clarke’s result [26, 40, 49].
Proposition 3. Suppose that f(y) achieves a local minimum on S at the point x. Let φS (y)
n denote a function that vanishes on S and that satisfies φS (y) ≥ c dist(y, S) for all x ∈ R and some positive constant c. If f(y) is locally Lipschitz continuous around x with constant L, then 25 −1 for every ρ ≥ c L, the function Fρ(y) = f(y)+ρφS (y) achieves a local unconstrained minimum at x.
Classically the choice φS (x) = dist(x, S) was preferred. For affine equality constraints gi(x) = 0 and affine inequality constraints hj(x) ≤ 0, Hoffman’s bound [74]
G(y) dist(y, S) ≤ τ
H(y)+ 2 applies, where τ is some positive constant, S is the feasible set where G(y) = 0, and H(y)+ ≤ 0.
The vector H(y)+ has components hj(x)+ = max{hj(y), 0}. When S is the intersection of several closed sets S1,..., Sm, then the alternative v u m uX 2 φS (y) = t dist(y, Si) (3.7) i=1 is attractive. The next proposition gives sufficient conditions under which the crucial bound
φS (y) ≥ c dist(y, S) is valid for the function (3.7).
n Proposition 4. Suppose that S1,..., Sm are closed convex sets in R where the first j sets are polyhedral. Assume further that the intersection