Numerical computation in C++
Introduction
Numerical Fast numerical computation in C++: algorithms, libraries and their Expression Templates and Beyond to performance Interlude: Profiling Lazy Code Generation (LzCG) on Linux Expression template generalities B. Nikolic Lazy code generation – what it is & how it works
Cavendish Laboratory/Kavli Institute LzCG example University of Cambridge Summary
BoostCon 2011 May 2011 Numerical Overview of ideas computation in C++
Introduction
1. ‘Standard’ rules of C++ lead to inefficient numerical Numerical algorithms, code libraries and their 2. New rules (≡ sub-languages) can be implemented performance Interlude: Profiling using expression templates on Linux
2.1 Types are used confer information about expressions Expression template 2.2 Translated to ‘standard’ C++ at compile-time generalities 3. Makes high-performance numerical C++ libraries Lazy code generation – what possible and successful it is & how it works 4. But is it enough? LzCG example 4.1 Most efficient algorithm not obvious at compile-time Summary 4.2 Convenience/flexibility of generating code in C++ 5. Types retain information about expressions in signatures in object code 5.1 Can re-generate expression template implementations post-compilation-time Numerical Outline computation in C++
Introduction
Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux
Expression Interlude: Profiling on Linux template generalities
Lazy code Expression template generalities generation – what it is & how it works
LzCG example
Lazy code generation – what it is & how it works Summary
LzCG example
Summary Numerical About myself: ALMA telescope computation in C++ Largest ground-based astronomy project in the world
Introduction
Numerical algorithms, libraries and their performance
Interlude: Profiling on Linux
Expression template generalities
Lazy code generation – what it is & how it works
LzCG example
Summary
Currently being commissioned at altitude of 5000 m in Chile. Will have 66 telescopes separated by up to 15 kms and observed at wavelength between 7 and 0.35 mm. Numerical About myself: Green Bank Telescope computation in C++ Largest steerable telescope in the world
Introduction
Numerical algorithms, libraries and their performance
Interlude: Profiling on Linux
Expression template generalities
Lazy code generation – what it is & how it works
LzCG example
Summary
Main reflector is 100x110 m in size, total height 160 m. Entire structure is accurate to 0.25 mm. Numerical About myself: Thermal radio emission from computation in C++ Messier 66 Introduction
Numerical algorithms, libraries and their performance
Interlude: Profiling I Colour scale is on Linux emission from dust at Expression template 0.024 mm wavelegnth generalities Lazy code I Contours represent generation – what emission at 3 mm from it is & how it works hot electron gas LzCG example Summary I Both appear to be powered by recent star formation Numerical General Interests computation in C++
Introduction
Numerical algorithms, libraries and their performance
Interlude: Profiling I Model optimisation and statistical inference on Linux
(maximum-likelihood, Markov Chain Monte Carlo, Expression template Nested Sampling techniques) generalities Pricing and risk-management of derivative contracts Lazy code I generation – what it is & how it works I Remote sensing of Earth’s atmosphere LzCG example
I Radiative transfer and other physical simulations Summary ⇒ All very numerically intensive applications... Numerical Aperture synthesis radio-astronomy computation in C++
Introduction Revolutionised the radio view of the universe – Nobel I Numerical prize in 1972 algorithms, libraries and their I Development of the technique closely tied to performance computers: Interlude: Profiling on Linux
I Lots of Fourier Transforms Expression I Large quantities of data to be binned, inspected, template generalities discarded if necessary Lazy code I Instruments inherently unstable so calibration is generation – what critical it is & how it works LzCG example I Atacama Large Millimetre Array: eventually 66 antennas, ∼ 20 Mb/s average output data rate: Summary
I Computational issues inconvenient, reduce scientist productivity
I Square Kilometre Array (SKA): 1000s antennas, wide field of view, ∼ few Gb/s average output data rate:
I Computational issues limiting factor in scientific output Numerical Risk management of ‘derivative’ contracts in computation in C++ finance Requirements in just one product line (e.g., credit derivatives) Introduction Numerical algorithms, libraries and their Typically calculations involve either: solving PDEs using performance Interlude: Profiling finite differences; or computing FFTs; or Monte-Carlo (MC) on Linux simulations. Expression template generalities I 2000 nodes × 1 kW/node + 50% aircon cost = Lazy code 3 MW generation – what it is & how it works
I 3 MW × 10 p/s × 8500 hr/yr = LzCG example 6 ∼ 2.5 × 10 GBP/yr! Summary I Additional costs ∝ number of nodes:
I Installation, maintenance, software licenses (even Excel sometimes!) I Floor-space (in expensive buildings) I Standby backup power generation costs Numerical Numerical performance computation in C++ (Why) does it matter?
Introduction
Numerical algorithms, Easily parallelisable Difficult to parallelise libraries and their performance
Interlude: Profiling I Cost I Feasibility on Linux I Heat, power, floor I Latency Expression template space generalities I User patience Lazy code I Environmental impact generation – what it is & how it works I Time to scale-up LzCG example I Access to capital Summary
Parallelisation is usually the most important aspect of high-performance numerical computing
I Not directly considering it in this talk although much of the material is relevant Numerical Small problems ≡ simple solutions computation in C++ Many practical scientific and industrial problems can be accelerated a simple way Introduction
Numerical algorithms, Listing 1: By-hand coding + SIMD intrinsics libraries and their void add2Vect ( const std :: vector Introduction Numerical algorithms, libraries and their performance I Correctness Interlude: Profiling I Maintainability, readability, portability on Linux Expression I Algorithms need adjustment over time template generalities I Experiment with different implementations of Lazy code generation – what algorithms it is & how it works I Approximations: how much precision, what accuracy LzCG example is necessary? Summary ⇒ These can be difficult to achieve with complex hand-crafted code! Numerical Warning! computation in C++ “Don’t try this at home” – try existing libraries first Introduction Numerical algorithms, libraries and their performance Writing numerical libraries is difficult and error prone – Interlude: Profiling on Linux always carefully consider alternatives! Expression template I Can you use standard existing libraries (“C” or “C++”) generalities Lazy code I Are you writing a general purpose library or an generation – what application? it is & how it works LzCG example I Can you, in advance, identify a subset of algorithm Summary which is likely to consume most time but can present a clean, data-only, interface? Numerical Outline computation in C++ Introduction Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression Interlude: Profiling on Linux template generalities Lazy code Expression template generalities generation – what it is & how it works LzCG example Lazy code generation – what it is & how it works Summary LzCG example Summary Numerical Requirements for good numerical computation in C++ performance Introduction Numerical algorithms, I Maximise parallelism libraries and their performance I Use all of the nodes/processors/cores/execution units Interlude: Profiling I Use Single-Instruction-Multiple-Data (SIMD) on Linux Minimise memory access Expression I template generalities I Keep close data to be processed together I Use algorithms that process small chunks of input Lazy code generation – what data at a time it is & how it works I Avoid temporaries LzCG example I Minimise ‘branching’ Summary I Keep the pipeline and speculative fetches good I But, need enough code at hand to execute I Minimise quantity of transcendental calculations I Includes division in this set I Reducing precision or accuracy makes these faster Numerical Optimisation Challenges I computation in C++ Introduction I Want: to describe the algorithm in simple, readable, Numerical re-usable way algorithms, libraries and their / / This : performance R=A+B+C+D+E; / / Not t h i s : Interlude: Profiling addFiveVect Double Double Double Double Double(A, B, C, D, E, R); on Linux Expression template I Rules for transforming such description to executable generalities code need to be complex to be efficient Lazy code generation – what I Simple application of rules applying to C++ objects: it is & how it works I Arguments are ‘evaluated’ before being passed to LzCG example functions Summary I Operators take two arguments at most I Creation of temporaries I Iteration is interpreted literally as ordered repetition of same segment of code Fundamentally: One step of the algorithm at a time Definitely not suitable for fast code! Numerical Optimisation Challenges II computation in C++ Introduction Numerical algorithms, I Programs may be compiled on one hardware setup libraries and their but run on many different hardware setups performance Interlude: Profiling I Might need (or want) to adjust rules for generation of on Linux implementations after the compilation of the main Expression template program generalities Lazy code I Speed of execution of particular implementation of generation – what algorithm can be difficult to predict it is & how it works LzCG example I Depends on precise model of the processor: clock speed, number of floating point execution cores, Summary hyper-threading, branch-prediction, pipeline designs, microcode implementations of complex instructions I Sizes of the various levels of data and code caches, main memory bus speed Numerical Example: row vs column matrix access computation in C++ Introduction Numerical algorithms, libraries and their performance Listing 2: Sum by iterating through columns first Interlude: Profiling u : : matrix for ( s i z e t j =0; j Introduction Numerical algorithms, libraries and their performance Listing 3: Sum by iterating through rows first Interlude: Profiling u : : matrix for ( s i z e t i =0; i Rows Columns Time for col-first Time for row-first Introduction Numerical (seconds) (seconds) algorithms, libraries and their performance 1000000 10 4.16 4.52 Interlude: Profiling 100000 100 4.15 9.54 on Linux 10000 1000 4.04 5.52 Expression template 1000 10000 4.02 5.41 generalities Lazy code 100 100000 4.00 4.56 generation – what 10 1000000 3.96 4.04 it is & how it works LzCG example Note Summary I Compiled using gcc without optimisation I Run on my laptop I ⇒ Illustration only! Numerical Techniques used for advanced numerical computation in C++ libraries Introduction Numerical algorithms, libraries and their performance Interlude: Profiling I Optimising compilers on Linux Expression I Custom compilers for standard languages template generalities Code generation using custom languages/frameworks I Lazy code generation – what I Run-time selection according to detected hardware it is & how it works I Run-time profiling of multiple/many algorithms LzCG example Summary I Run-time generation of machine code Expression templates and lazy code generation can adapt all of these to standard C++. Numerical A quick case study: FFTW computation in C++ http://www.fftw.org Introduction Numerical algorithms, libraries and their Revolutionary at the time: performance I Building blocks (“codelets”) of algorithms generated at Interlude: Profiling on Linux compile-time: Expression OCAML → C → machine code template generalities I Run-time selection of best combination of Lazy code building-blocks generation – what it is & how it works I Memory-layout, cache sizes, relative speed of LzCG example memory Summary I Code selection can be saved I SIMD+ (p)threads + MPI parallelism I Presents trivial C-language & Fortran-language interfaces Numerical Some subtleties computation in C++ Introduction The order of floating point operations usually matters I Numerical even when operations on real numbers would not: algorithms, libraries and their performance −10 −10 1 + (−1 + 10 ) 6= (1 − 1) + 10 (1) Interlude: Profiling on Linux Expression I “Standard” x86 floating point uses 80-bit internal template precision! (SIMD instructions do not) generalities Lazy code I Non-normal (NaN, Inf, de-normalised) floating point generation – what it is & how it works number badly affect performance: LzCG example Summary ∞ + 1 = ∞ (slowly!) (2) I Transcendentals can be approximated to less than full precision ⇒ Ensuring “bit-equivalent” results is difficult and expensive Numerical Outline computation in C++ Introduction Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression Interlude: Profiling on Linux template generalities Lazy code Expression template generalities generation – what it is & how it works LzCG example Lazy code generation – what it is & how it works Summary LzCG example Summary Numerical How to identify bottlenecks computation in C++ Introduction Numerical algorithms, libraries and their I Run a tick counting profiler: performance http://oprofile.sourceforge.net/ Interlude: Profiling on Linux I Get a stochastic measurement of where the CPU Expression spends most of its time without modifying the code! template generalities I Run a call-graph profiler: http://valgrind.org/ Lazy code I Shows how the CPU-intensive parts of the code fit generation – what into the big picture of the application it is & how it works LzCG example I Compile to assembly only (“gcc -S”) and look at the Summary code! I Allows identifications of in-efficiencies in the produced code and gives hints for optimisations Numerical Outline computation in C++ Introduction Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression Interlude: Profiling on Linux template generalities Lazy code Expression template generalities generation – what it is & how it works LzCG example Lazy code generation – what it is & how it works Summary LzCG example Summary Numerical What is an expression template? computation in C++ Introduction Numerical I Templated algorithms, libraries and their I Type performance Interlude: Profiling I Where the type itself on Linux encodes an operation, Expression template expression or generalities Expression algorithm Lazy code = generation – what template I Passed between it is & how it works functions as instantiated LzCG example objects (usually with Summary data as references) I Implemented through (partial) specialisations Numerical Expression template (minimal) example computation in C++ Introduction Numerical algorithms, libraries and their Listing 4: Adding vectors performance Interlude: Profiling // Some vectors to use in the example on Linux std :: vector Introduction Listing 5: Simple add operator for std::vector Numerical algorithms, template< class T> libraries and their performance std :: vector 1. Considers two vectors at a time 2. Iterates through the whole vector and computes the entire result Numerical Expression template (minimal) example computation in C++ Listing 6: Expression template class Introduction Numerical template { Expression template const E1 & l e f t ; const E2 & r i g h t ; generalities Lazy code generation – what binop ( const E1 & l e f t , const E2 &r i g h t , it is & how it works op opval ) : LzCG example left(left), right(right) {} ; Summary binop ( const E1 &v a l ) ; } ; 1. The sub-expression are referenced in left, right 2. The operation is contained in type of op I valueop, addop are simple tag structs Numerical Expression template (minimal) example computation in C++ Introduction Listing 7: Creation of compound expression Numerical algorithms, template 1. E1, E2 are ‘free’ template parameters – types of sub-expression is encoded in result 2. addop is the type of third template parameter – encodes addition of sub-expressions Numerical Expression template (minimal) example computation in C++ Introduction Numerical algorithms, libraries and their performance Interlude: Profiling Listing 8: Templated evaluation operation on Linux Expression template /// Evaluate the i −th element of generalities /// an expression Lazy code generation – what template double eval ( const binop s i z e t i ) ; Summary Numerical Expression template (minimal) example computation in C++ Partial specialisation for add operation Introduction Numerical algorithms, libraries and their Listing 9: Implementation of sum operation using partial performance specialisation Interlude: Profiling on Linux template s i z e t i ) Lazy code generation – what { it is & how it works return eval(o.left , i)+eval(o.right , i); LzCG example } ; Summary 1. Specialised on addop as third temp-par 2. Recursively evaluate and add using standard operator+ Numerical Expression template (minimal) example computation in C++ Partial specialisation for value operation Introduction Numerical algorithms, libraries and their performance Listing 10: Partial specialisation for value types Interlude: Profiling on Linux template s i z e t i ) Lazy code generation – what { it is & how it works return o . l e f t [ i ] ; LzCG example } ; Summary 1. Specialised on valueop as third template-parameter 2. Simply returns the value of reference vector at i Numerical Expression template (minimal) example computation in C++ Introduction Numerical algorithms, libraries and their performance Listing 11: In use Interlude: Profiling std :: vector Summary 1. Does not create a temporary vector 2. Evaluates only the third element of the result Numerical A look at the types computation in C++ Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression Listing 12: Signatures as seen by nm -C template generalities 000000000040114a W double eval Summary Numerical A look at the types computation in C++ Listing 13: Output of nm -C wrapped properly Introduction double eval Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation – what it is & how it works LzCG example Summary Numerical Lazy evaluation computation in C++ Not the same as lazy code generation... Introduction In summary: Numerical 1. Operations (+function calls) return ‘expressions’ not algorithms, libraries and their results performance 2. The order and implementation of operations in Interlude: Profiling on Linux expression can be modified (at compile-time) Expression template 3. Results are only evaluated at a boundary, e.g., when generalities assigning to plain-old-data Lazy code generation – what 4. If the result is never required, it is never computed it is & how it works Doing this properly is very elegant but things get LzCG example complicated – see the Haskel programming language Summary Things to keep in mind: 1. Side-effects are ill-defined – stick to ‘functional’ programming I Assigning to an already initialised variable is not functional programming! 2. Implemented with expression templates, size of types grows very quickly Numerical Libraries computation in C++ Introduction http://eigen.tuxfamily.org Numerical I Eigen (LGPL) algorithms, libraries and their Array Ops, Basic + Advanced Linear Algebra, performance Geometry Interlude: Profiling on Linux I Armadillo http://arma.sourceforge.net/ (LGPL) Expression Array Ops, Basic + Advanced Linear Algebra, template generalities Geometry Lazy code generation – what I Boost.uBLAS http://www.boost.org (Boost License) it is & how it works Basic Linear Algebra LzCG example 2 Summary I NT http://nt2.sourceforge.net/ (LGPL) I Blitz++ http://www.oonumerics.org/blitz/ (GPL+artistic) This was one of the first libraries to use expression templates Numerical Outline computation in C++ Introduction Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression Interlude: Profiling on Linux template generalities Lazy code Expression template generalities generation – what it is & how it works LzCG example Lazy code generation – what it is & how it works Summary LzCG example Summary Numerical Introduction computation in C++ Introduction ‘Standard’ expression templates Numerical algorithms, libraries and their 1. Types convey information about algorithm that performance functions implement Interlude: Profiling on Linux 2. These types are interpreted at compile-time and Expression template corresponding code generated generalities Lazy code generation – what Lazy code generation it is & how it works LzCG example 1. The types are recorded in object code too, so the Summary algorithm implemented by symbols is retained 2. Generate new implementations, post-compilation-time ⇒ Introduces new flexibility and modularity in code generation process Numerical Example – BLAS Reference card computation in C++ Introduction Level BLAS dim scalar vector vector scalars element array pre xes SUBROUTINE xROTG A B C S Generate plane rotation S D Numerical SUBROUTINE xROTMG D D A B PARAM Generate di ed mo plane rotation S D SUBROUTINE xROT N X INCX Y INCY C S Apply plane rotation S D SUBROUTINE xROTM N X INCX Y INCY PARAM Apply di ed mo plane rotation S D algorithms, SUBROUTINE xSWAP N X INCX Y INCY x y S D C Z SUBROUTINE xSCAL N ALPHA X INCX x x S D C Z CS ZD libraries and their SUBROUTINE xCOPY N X INCX Y INCY y x S D C Z SUBROUTINE xAXPY N ALPHA X INCX Y INCY y x y S D C Z T performance FUNCTION xDOT N X INCX Y INCY dot x y S D DS T FUNCTION xDOTU N X INCX Y INCY dot x y C Z H FUNCTION xDOTC N X INCX Y INCY dot x y C Z T FUNCTION xxDOT N X INCX Y INCY dot x y SDS Interlude: Profiling FUNCTION xNRM N X INCX m nr jj x jj S D SC DZ 2 FUNCTION xASUM N X INCX asum jj e r x jj jj im x jj S D SC DZ 1 1 st on Linux FUNCTION IxAMAX N X INCX amax k e j r x j j im x j S D C Z k k