UCLA Electronic Theses and Dissertations

UCLA UCLA Electronic Theses and Dissertations Title Projection algorithms for large scale optimization and genomic data analysis Permalink https://escholarship.org/uc/item/95v3t2nk Author Keys, Kevin Lawrence Publication Date 2016 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA Los Angeles Projection algorithms for large scale optimization and genomic data analysis A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Biomathematics by Kevin Lawrence Keys 2016 c Copyright by Kevin Lawrence Keys 2016 ABSTRACT OF THE DISSERTATION Projection algorithms for large scale optimization and genomic data analysis by Kevin Lawrence Keys Doctor of Philosophy in Biomathematics University of California, Los Angeles, 2016 Professor Kenneth L. Lange, Chair The advent of the Big Data era has spawned intense interest in scalable mathematical optimization methods. Traditional approaches such as Newton’s method fall apart whenever the features outnumber the examples in a data set. Consequently, researchers have intensely developed first- order methods that rely only on gradients and subgradients of a cost function. In this dissertation we focus on projected gradient methods for large-scale constrained optimization. We develop a particular case of a proximal gradient method called the proximal distance algorithm. Proximal distance algorithms combine the classical penalty method of constrained min- imization with distance majorization. To optimize the loss function f(x) over a constraint set C, ρ 2 the proximal distance principle mandates minimizing the penalized loss f(x) + 2 dist(x; C) and following the solution xρ to its limit as ρ ! 1. At each iteration the squared Euclidean distance 2 2 dist(x; C) is majorized by kx − ΠC(xk)k , where ΠC(xk) denotes the projection of the current ρ 2 iterate xk onto C. The minimum of the surrogate function f(x) + 2 kx − ΠC(xk)k is given by the proximal map proxρ−1f [ΠC(xk)]. The next iterate xk+1 automatically decreases the original penalized loss for fixed ρ. Since many explicit projections and proximal maps are known in analytic or computable form, the proximal distance algorithm provides a scalable computational framework for a variety of constraints. For the particular case of sparse linear regression, we implement a projected gradient algorithm known as iterative hard thresholding for a particular large-scale genomics analysis known ii as a genome-wide association study. A genome-wide association study (GWAS) correlates marker variation with trait variation in a sample of individuals. Each study subject is genotyped at a mul- titude of SNPs (single nucleotide polymorphisms) spanning the genome. Here we assume that subjects are unrelated and collected at random and that trait values are normally distributed or transformed to normality. Over the past decade, researchers have been remarkably successful in applying GWAS analysis to hundreds of traits. The massive amount of data produced in these stud- ies present unique computational challenges. Penalized regression with LASSO or MCP penalties is capable of selecting a handful of associated SNPs from millions of potential SNPs. Unfor- tunately, model selection can be corrupted by false positives and false negatives, obscuring the genetic underpinning of a trait. Our parallel implementation of IHT accommodates SNP genotype compression and exploits multiple CPU cores and graphics processing units (GPUs). This allows statistical geneticists to leverage desktop workstations in GWAS analysis and to eschew expensive supercomputing resources. We evaluate IHT performance on both simulated and real GWAS data and conclude that it reduces false positive and false negative rates while remaining competitive in computational time with penalized regression. iii The dissertation of Kevin Lawrence Keys is approved. Lieven Vandenberghe Marc Adam Suchard Van Maurice Savage Kenneth L. Lange, Committee Chair University of California, Los Angeles 2016 iv To my parents v TABLE OF CONTENTS 1 Introduction ::::::::::::::::::::::::::::::::::::::: 1 2 Convex Optimization :::::::::::::::::::::::::::::::::: 4 2.1 Convexity . .5 2.2 Projections and Proximal Operators . .7 2.3 Descent Methods . .9 2.3.1 Gradient Methods . 10 2.3.2 Proximal Gradient Method . 11 2.4 Second-order methods . 12 2.4.1 Newton’s method . 12 2.4.2 Conjugate gradient method . 13 2.5 The MM Principle . 14 3 The Proximal Distance Algorithm ::::::::::::::::::::::::::: 17 3.1 An Adaptive Barrier Method . 17 3.2 MM for an Exact Penalty Method . 21 3.2.1 Exact Penalty Method for Quadratic Programming . 23 3.3 Distance Majorization . 24 3.4 The Proximal Distance Method . 25 3.5 Examples . 29 3.5.1 Projection onto an Intersection of Closed Convex Sets . 29 3.5.2 Network Optimization . 31 3.5.3 Nonnegative Quadratic Programming . 33 3.5.4 Linear Regression under an `0 Constraint . 36 vi 3.5.5 Matrix Completion . 36 3.5.6 Sparse Precision Matrix Estimation . 40 3.6 Discussion . 43 4 Accelerating the Proximal Distance Algorithm :::::::::::::::::::: 45 4.1 Derivation . 45 4.2 Convergence and Acceleration . 48 4.3 Examples . 52 4.3.1 Linear Programming . 52 4.3.2 Nonnegative Quadratic Programming . 55 4.3.3 Closest Kinship Matrix . 57 4.3.4 Projection onto a Second-Order Cone Constraint . 59 4.3.5 Copositive Matrices . 62 4.3.6 Linear Complementarity Problem . 64 4.3.7 Sparse Principal Components Analysis . 65 4.4 Discussion . 70 5 Iterative Hard Thresholding for GWAS Analysis ::::::::::::::::::: 72 5.1 Introduction . 72 5.2 Methods . 74 5.2.1 Penalized regression . 74 5.2.2 Calculating step sizes . 78 5.2.3 Bandwidth optimizations . 78 5.2.4 Parallelization . 79 5.2.5 Selecting the best model . 80 5.3 Results . 80 vii 5.3.1 Simulation . 81 5.3.2 Speed comparisons . 83 5.3.3 Application to lipid phenotypes . 84 5.4 Discussion . 86 6 Discussion and Future Research :::::::::::::::::::::::::::: 90 6.1 Parameter tuning for proximal distance algorithms . 90 6.2 IHT with nonlinear loss functions . 91 6.3 Other greedy algorithms for linear regression . 92 7 Notation ::::::::::::::::::::::::::::::::::::::::: 95 7.1 Sets . 95 7.2 Vectors and Matrices . 95 7.3 Norms and Distances . 97 7.4 Functions and Calculus . 97 7.5 Projections and Proximal Operators . 98 7.6 Computation . 98 References ::::::::::::::::::::::::::::::::::::::::: 99 viii LIST OF FIGURES 1.1 The cost of sequencing a single human genome, which we assume to be 3,000 megabases, is shown by the green line on a logarithmic scale. Moore’s law of computing is drawn in white. The data are current as of April 2015. After January 2008 modern sequencing centers switched from Sanger dideoxy chain termination sequencing to next-generation sequencing technologies such as 454 sequencing, Il- lumina sequencing, and SOLiD sequencing. For Sanger sequencing, the assumed coverage is 6-fold with average read length of 500-600 bases. 454 sequencing assumes 10-fold coverage with average read length 300-400 bases, while the Il- lumina/SOLiD sequencers attain 30-fold coverage with an average read length of 75-150 bases. .2 2.1 A graphical representation of a convex set and a nonconvex one. As noted in Definition 1, a convex set contains all line segments between any two points in the set. Image courtesy of Torbjørn Taskjelle from StackExchange [129]. .5 4.1 Proportion of variance explained by q PCs for each algorithm. Here PD1 is the accelerated proximal distance algorithm enforcing matrix sparsity, PD2 is the accelerated proximal distance algorithm enforcing column-wise sparsity, and SPC is the orthogonal sparse PCA method from PMA..................... 68 4.2 Computation times for q PCs for each algorithm. Here PD1 is the accelerated proximal distance algorithm enforcing matrix sparsity, PD2 is the accelerated proximal distance algorithm enforcing column-wise sparsity, and SPC is the orthogonal sparse PCA method from PMA............................. 69 5.1 A visual representation of model selection with the LASSO. The addition of the `1 penalty encourages representation of y by a subset of the columns of X....... 75 ix 5.2 A graphical representation of penalized (regularized) regression using norm balls. From left to right, the graphs show `2 or Tikhonov regression, `1 or LASSO regression, and `0 or subset regression. The ellipses denote level curves around the unpenalized optimum β. The penalized optimum occurs at the intersection of the level curves with the norm ball. Tikhonov regularization provides some shrinkage, while the shrinkage from LASSO regularization is more dramatic. The `0 norm enforces sparsity without shrinkage. The MCP “norm ball” cannot be easily drawn but sits between the `1 and `0 balls. 76 5.3 A view of sparse regression with thresholding operators. The order from left to right differs from Figure 5.2: the `1 operator or soft thresholding operator, the MCP or firm thresholding operator, and the `0 operator or hard thresholding operator. We clearly see how MCP interpolates the soft and hard thresholding operators. 76 5.4 A visual representation of IHT. The algorithm starts at a point y and steps in the direction −∇f(y) with magnitude µ to an intermediate point y+. IHT then enforces sparsity by projecting onto the sparsity set Sm. The projection for m = 2 is + the identity projection in this example, while projection onto S0 merely sends y to the origin 0. Projection onto S1 preserves the larger of the two components of y+........................................... 77 5.5 Mean squared error as a function of model size, as averaged over 5 cross-validation slices, for four lipid phenotypes from NFBC 1966. 87 x LIST OF TABLES 3.1 Performance of the adaptive barrier method in linear programming. 21 3.2 Dykstra’s algorithm versus the proximal distance algorithm. 31 3.3 CPU times in seconds and iterations until convergence for the network optimization problem.

UCLA Electronic Theses and Dissertations

Sparse Matrix Linear Models for Structured High-Throughput Data

Arxiv:1803.01621V4 [Eess.SP] 27 Jan 2020

Learning with Stochastic Proximal Gradient

Arxiv:1912.02039V3 [Math.OC] 2 Oct 2020 P1-1.1-PD-2019-1123, Within PNCDI III

Smoothing Proximal Gradient Method for General Structured Sparse

A Policy-Based Optimization Library

Proximal Gradient Algorithms: Applications in Signal Processing

A Unified Formulation and Fast Accelerated Proximal Gradient Method for Classification

A Proximal Stochastic Gradient Method with Progressive Variance Reduction∗