Parallel Architectures and Algorithms for Large-Scale Nonlinear Programming

Parallel Architectures and Algorithms for Large-Scale Nonlinear Programming Carl D. Laird Associate Professor, School of Chemical Engineering, Purdue University Faculty Fellow, Mary Kay O’Connor Process Safety Center Laird Research Group: http://allthingsoptimal.com Landscape of Scientific Computing 10$ 1$ 50%/year 20%/year Clock&Rate&(GHz)& 0.1$ 0.01$ 1980$ 1985$ 1990$ 1995$ 2000$ 2005$ 2010$ Year& [Steven Edwards, Columbia University] 2 Landscape of Scientific Computing Clock-rate, the source of past speed improvements have stagnated. Hardware manufacturers are shifting their focus to energy efficiency (mobile, large10" data centers) and parallel architectures. 12" 10" 1" 8" 6" #"of"Cores" Clock"Rate"(GHz)" 0.1" 4" 2" 0.01" 0" 1980" 1985" 1990" 1995" 2000" 2005" 2010" Year" 3 “… over a 15-year span… [problem] calculations improved by a factor of 43 million. [A] factor of roughly 1,000 was attributable to faster processor speeds, … [yet] a factor of 43,000 was due to improvements in the efficiency of software algorithms.” - attributed to Martin Grotschel [Steve Lohr, “Software Progress Beats Moore’s Law”, New York Times, March 7, 2011] Landscape of Computing 5 Landscape of Computing High-Performance Parallel Architectures Multi-core, Grid, HPC clusters, Specialized Architectures (GPU) 6 Landscape of Computing High-Performance Parallel Architectures Multi-core, Grid, HPC clusters, Specialized Architectures (GPU) 7 Landscape of Computing data VM iOS and Android device sales High-Performance have surpassed PC Parallel Architectures iPhone performance Multi-core, Grid, ~2X / year HPC clusters, iPhone 5S within a factor of 2 Specialized of 2.5 GHz Core i5 Architectures (GPU) [Geekbench 3] 8 Landscape of Computing data VM Watch: increased use in High-Performance operations Parallel Architectures Watch: rise of medical Multi-core, Grid, devices and health HPC clusters, Specialized Watch: touch form factor Architectures (GPU) for desktop computing 9 High-Performance Computing Parallel Computing Architectures Measuring Parallel Performance Parallel Algorithms for NLP 10 Parallel Computing Architectures Flynn’s Taxonomy Emerging'Architectures'(SIMD/HybriD)' Single' Mul-ple' Graphics$Processing$Unit$(GPU)$ Data' Data' • Affordable,$1000’s$of$Processors$ Single' SISD$ SIMD$ Instruc-on' • Specialized$Compilers/Languages$ • CUDA,$Brook++,$OpenCL$ Mul-ple' MISD$ MIMD$ Instruc-on' • Several$Complexi5es$&$Limita5ons$ Beowulf'Cluster'(MIMD)' Desktop'Mul-;core'(MIMD)' • Scalable$Distributed$Compu5ng$ • Affordable$Hardware $$ 100s$of$Processors$ Low$Number$of$Processors$ • Standard$Compilers$(MPI)$ • Standard$Compilers$(Threads,$OpenMP)$ • Network$Communica5on$(Ethernet)$ • Fast$Communica5on$(No$Network)$ • Communica5on$BoDleneck$ • Memory$Access/#$CPU$BoDleneck$ 11 Parallel Computing Architectures Flynn’s Taxonomy Emerging'Architectures'(SIMD/HybriD)' Single' Mul-ple' Graphics$Processing$Unit$(GPU)$ Data' Data' • Affordable,$1000’s$of$Processors$ Single' SISD$ SIMD$ Instruc-on' • Specialized$Compilers/Languages$ • CUDA,$Brook++,$OpenCL$ Mul-ple' MISD$ MIMD$ Instruc-on' • Several$Complexi5es$&$Limita5ons$ Beowulf'Cluster'(MIMD)' Desktop'Mul-;core'(MIMD)' • Scalable$Distributed$Compu5ng$ • Affordable$Hardware $$ 100s$of$Processors$ Low$Number$of$Processors$ • Standard$Compilers$(MPI)$ • Standard$Compilers$(Threads,$OpenMP)$ • Network$Communica5on$(Ethernet)$ • Fast$Communica5on$(No$Network)$ • Communica5on$BoDleneck$ • Memory$Access/#$CPU$BoDleneck$ 12 Parallel Computing Architectures Flynn’s Taxonomy Emerging'Architectures'(SIMD/HybriD)' Single' Mul-ple' Graphics$Processing$Unit$(GPU)$ Data' Data' • Affordable,$1000’s$of$Processors$ Single' SISD$ SIMD$ Instruc-on' • Specialized$Compilers/Languages$ • CUDA,$Brook++,$OpenCL$ Mul-ple' MISD$ MIMD$ Instruc-on' • Several$Complexi5es$&$Limita5ons$ Beowulf'Cluster'(MIMD)' Desktop'Mul-;core'(MIMD)' • Scalable$Distributed$Compu5ng$ • Affordable$Hardware $$ 100s$of$Processors$ Low$Number$of$Processors$ • Standard$Compilers$(MPI)$ • Standard$Compilers$(Threads,$OpenMP)$ • Network$Communica5on$(Ethernet)$ • Fast$Communica5on$(No$Network)$ • Communica5on$BoDleneck$ • Memory$Access/#$CPU$BoDleneck$ 13 Writing High-Performance Scientific Software • Implementation and hardware details matter - Cache sizes have a major impact - Memory latency, coalescing, and alignment matter - Network latency and bandwidth matter - Compiler optimization can only do so much - Use high-performance code tailored to architecture (e.g. BLAS & LAPACK for dense linear algebra) • Languages matter - C / C++ / Fortran still the dominant choices - Interpreted languages (E.g. Matlab / Python) might be reasonable, but be careful - Modern compiled languages close to speed of C (with significant reduction in development time) 14 Measuring Parallel Performance Speedup: Efficiency: usually < 1 Communication Overhead Memory/Cache Bottlenecks 15 Measuring Parallel Performance Serial Execution Time = 10 min 16 Measuring Parallel Performance Serial Execution Time = 10 min Can be computed in parallel Must be computed in serial 17 Measuring Parallel Performance Serial Execution Time = 10 min Can be computed in parallel Must be computed in serial 18 Measuring Parallel Performance Serial Execution Time = 10 min 2 Processors 1.8 Times Speedup Parallel Time = 5.5 min ~90% Efficiency Can be computed in parallel Must be computed in serial 19 Measuring Parallel Performance Serial Execution Time = 10 min 2 Processors 1.8 Times Speedup Parallel Time = 5.5 min ~90% Efficiency Infinite Processors Parallel Time = 1.0 min 10 Times Speedup Can be computed in parallel Must be computed in serial 20 Measuring Parallel Performance Serial Execution Time = 10 min Amdahl’s Law, Maximum Speedup: Infinite Processors Parallel Time = 1.0 min 10 Times Speedup Can be computed in parallel Must be computed in serial 21 Measuring Parallel Performance (Scaling Efficiency) Strong Scaling - Problem size fixed as # processors increased - Fixed workload Linear Scaling - Speedup = # Processors Performance compared against linear scaling 22 Measuring Parallel Performance (Scaling Efficiency) Weak Scaling - Problem size grows as # processors increases - Fixed workload per processor Linear Scaling - Constant run-time Performance compared against linear scaling 23 Parallel Algorithms for Nonlinear Optimization Inherently Parallel Exploiting Problem Algorithm Structure Algorithm Parallel on Algorithm Parallel Due to General Problem Problem Structure Often based on less Decomposition effective serial algorithms strategies Easily parallelized High-level parallelism 24 Parallel Algorithms for Nonlinear Optimization Inherently Parallel Exploiting Problem Algorithm Structure Problem Level Internal Decomposition Decomposition Bender’s, Lagrangian, Host Algorithm: parallelize Progressive Hedging all scale-dependent operations Not for SIMD Slow convergence for large NLP Nonlinear interior-point 25 Nonlinear Interior-point Methods Original NLP Barrier NLP - Fiacco & McCormick (1968) 26 Nonlinear Interior-point Methods Original NLP Initialize Original NLP Done Converged? Barrier NLP Barrier NLP Reduce Barrier Converged? Parameter Calculate Derivatives, Residuals, etc. KNITRO (Byrd, Nocedal, Hribar, Waltz) Calculate Step Direction LOQO (Benson, Vanderbei, Shanno) IPOPT (Wachter, Laird, Biegler) Perform Line Search 27 Nonlinear Interior-point Methods Original NLP Initialize Original NLP Done Converged? Barrier NLP Barrier NLP Reduce Barrier Converged? Parameter Calculate Derivatives, Residuals, etc. KNITRO (Byrd, Nocedal, Hribar, Waltz) Calculate Step Direction LOQO (Benson, Vanderbei, Shanno) IPOPT (Wachter, Laird, Biegler) Perform Line Search 28 Parallel Nonlinear Interior-point Methods Structure in the optimization Initialize problem induces structure in the linear algebra Original NLP Done Converged? Parallelize all scale-dependent operations - Vector and matrix operations Barrier NLP Reduce Barrier Converged? Parameter - Model evaluation Calculate Derivatives, Compared with problem-level Residuals, etc. decomposition, implementation is time consuming Calculate Step Direction Retain convergence properties of serial algorithm Perform Line Search 29 Parallel Linear Algebra for NLP General Purpose Parallel Direct Solvers - For general sparse systems - no need to specify structure - [Schenk & Gartner 2004; Scott 2003; Amestoy et al. 2000; …] - Modest scale up Iterative Linear Solvers (e.g. GMRES, PCG) - Capable of handling general sparse systems - Easily parallelized (many implementations) - Appropriate for SIMD architectures - Requires effective preconditioning [Dollar et al. 2007; Forsgren et al. 2008] Tailored Decomposition of Problem Structure - Custom parallel linear algebra exploits a fixed, known structure - [Gondzio, Anitescu (PIPS), Laird, …] - Problem class specific - difficult to implement - Can scale to large parallel architectures (100’s - 1000’s of processors) 30 Exploiting Problem Structure Optimization Under Uncertainty - block structure because of coupled scenarios - common structure of many applications Kang, J., Word, D.P., and Laird, C.D., "An interior-point method for efficient solution of block-structured NLP problems using an implicit Schur-complement decomposition", to appear in Computers and Chemical Engineering, 2014. tf Dynamic Optimization min L(x, y, u) dt u - block structure because of Zt0 finite element discretization s.t. F (˙x, x, y, u)=0 x(t0)=x0 (x, y, u)L (x, y, u) (x, y, u)U Word, D.P., Kang, J., Akesson, J., and Laird, C.D., "Efficient Parallel Solution of Large-Scale Nonlinear Dynamic

Parallel Architectures and Algorithms for Large-Scale Nonlinear Programming

CUDA by Example

Bitfusion Guide to CUDA Installation Bitfusion Guides Bitfusion: Bitfusion Guide to CUDA Installation

(GPU) Computing

Massively Parallel Computing with CUDA

CUDA Flux: a Lightweight Instruction Profiler for CUDA Applications

GPU: Power Vs Performance

An Introduction to Gpus, CUDA and Opencl

Cuda C Programming Guide

CUDA 11 and A100 - WHAT’S NEW? Markus Hrywniak, 23Rd June 2020 TOPICS for TODAY

NVIDIA Ampere GA102 GPU Architecture Whitepaper

Gpgpu Processing in Cuda Architecture

NVIDIA Tesla P100 Pcie Data Sheet