Numerical computation in C++

Introduction

Numerical Fast numerical computation in C++: algorithms, libraries and their Expression Templates and Beyond to performance Interlude: Profiling Lazy Code Generation (LzCG) on Linux Expression template generalities B. Nikolic Lazy code generation – what it is & how it works

Cavendish Laboratory/Kavli Institute LzCG example University of Cambridge Summary

BoostCon 2011 May 2011 Numerical Overview of ideas computation in C++

Introduction

1. ‘Standard’ rules of C++ lead to inefficient numerical Numerical algorithms, code libraries and their 2. New rules (≡ sub-languages) can be implemented performance Interlude: Profiling using expression templates on Linux

2.1 Types are used confer information about expressions Expression template 2.2 Translated to ‘standard’ C++ at compile-time generalities 3. Makes high-performance numerical C++ libraries Lazy code generation – what possible and successful it is & how it works 4. But is it enough? LzCG example 4.1 Most efficient algorithm not obvious at compile-time Summary 4.2 Convenience/flexibility of generating code in C++ 5. Types retain information about expressions in signatures in object code 5.1 Can re-generate expression template implementations post-compilation-time Numerical Outline computation in C++

Introduction

Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux

Expression Interlude: Profiling on Linux template generalities

Lazy code Expression template generalities generation – what it is & how it works

LzCG example

Lazy code generation – what it is & how it works Summary

LzCG example

Summary Numerical About myself: ALMA telescope computation in C++ Largest ground-based astronomy project in the world

Introduction

Numerical algorithms, libraries and their performance

Interlude: Profiling on Linux

Expression template generalities

Lazy code generation – what it is & how it works

LzCG example

Summary

Currently being commissioned at altitude of 5000 m in Chile. Will have 66 telescopes separated by up to 15 kms and observed at wavelength between 7 and 0.35 mm. Numerical About myself: Green Bank Telescope computation in C++ Largest steerable telescope in the world

Introduction

Numerical algorithms, libraries and their performance

Interlude: Profiling on Linux

Expression template generalities

Lazy code generation – what it is & how it works

LzCG example

Summary

Main reflector is 100x110 m in size, total height 160 m. Entire structure is accurate to 0.25 mm. Numerical About myself: Thermal radio emission from computation in C++ Messier 66 Introduction

Numerical algorithms, libraries and their performance

Interlude: Profiling I Colour scale is on Linux emission from dust at Expression template 0.024 mm wavelegnth generalities Lazy code I Contours represent generation – what emission at 3 mm from it is & how it works hot electron gas LzCG example Summary I Both appear to be powered by recent star formation Numerical General Interests computation in C++

Introduction

Numerical algorithms, libraries and their performance

Interlude: Profiling I Model optimisation and statistical inference on Linux

(maximum-likelihood, Markov Chain Monte Carlo, Expression template Nested Sampling techniques) generalities Pricing and risk-management of derivative contracts Lazy code I generation – what it is & how it works I Remote sensing of Earth’s atmosphere LzCG example

I Radiative transfer and other physical simulations Summary ⇒ All very numerically intensive applications... Numerical Aperture synthesis radio-astronomy computation in C++

Introduction Revolutionised the radio view of the universe – Nobel I Numerical prize in 1972 algorithms, libraries and their I Development of the technique closely tied to performance computers: Interlude: Profiling on Linux

I Lots of Fourier Transforms Expression I Large quantities of data to be binned, inspected, template generalities discarded if necessary Lazy code I Instruments inherently unstable so calibration is generation – what critical it is & how it works LzCG example I Atacama Large Millimetre Array: eventually 66 antennas, ∼ 20 Mb/s average output data rate: Summary

I Computational issues inconvenient, reduce scientist productivity

I Square Kilometre Array (SKA): 1000s antennas, wide field of view, ∼ few Gb/s average output data rate:

I Computational issues limiting factor in scientific output Numerical Risk management of ‘derivative’ contracts in computation in C++ finance Requirements in just one product line (e.g., credit derivatives) Introduction Numerical algorithms, libraries and their Typically calculations involve either: solving PDEs using performance Interlude: Profiling finite differences; or computing FFTs; or Monte-Carlo (MC) on Linux simulations. Expression template generalities I 2000 nodes × 1 kW/node + 50% aircon cost = Lazy code 3 MW generation – what it is & how it works

I 3 MW × 10 p/s × 8500 hr/yr = LzCG example 6 ∼ 2.5 × 10 GBP/yr! Summary I Additional costs ∝ number of nodes:

I Installation, maintenance, software licenses (even Excel sometimes!) I Floor-space (in expensive buildings) I Standby backup power generation costs Numerical Numerical performance computation in C++ (Why) does it matter?

Introduction

Numerical algorithms, Easily parallelisable Difficult to parallelise libraries and their performance

Interlude: Profiling I Cost I Feasibility on Linux I Heat, power, floor I Latency Expression template space generalities I User patience Lazy code I Environmental impact generation – what it is & how it works I Time to scale-up LzCG example I Access to capital Summary

Parallelisation is usually the most important aspect of high-performance numerical computing

I Not directly considering it in this talk although much of the material is relevant Numerical Small problems ≡ simple solutions computation in C++ Many practical scientific and industrial problems can be accelerated a simple way Introduction

Numerical algorithms, Listing 1: By-hand coding + SIMD intrinsics libraries and their void add2Vect ( const std :: vector &v1 , performance const std :: vector &v2 , std :: vector &res ) { Interlude: Profiling typedef double v2df a t i b u t e ( ( mode(V2DF ) ) ) ; on Linux ∗ ∗ ∗ v2df dest =( v2df )&( res.begin()); Expression const s i z e t n=v1.size(); template ∗ ∗ const v2df src1 =( const v2df )&v1 [ 0 ] ; generalities const v2df ∗src2 =( const v2df ∗)&v2 [ 0 ] ; i f ( n%2==0) Lazy code { generation – what for ( s i z e t i =0; i

Introduction

Numerical algorithms, libraries and their performance I Correctness Interlude: Profiling I Maintainability, readability, portability on Linux Expression I Algorithms need adjustment over time template generalities

I Experiment with different implementations of Lazy code generation – what algorithms it is & how it works I Approximations: how much precision, what accuracy LzCG example is necessary? Summary ⇒ These can be difficult to achieve with complex hand-crafted code! Numerical Warning! computation in C++ “Don’t try this at home” – try existing libraries first

Introduction

Numerical algorithms, libraries and their performance Writing numerical libraries is difficult and error prone – Interlude: Profiling on Linux always carefully consider alternatives! Expression template I Can you use standard existing libraries (“C” or “C++”) generalities Lazy code I Are you writing a general purpose library or an generation – what application? it is & how it works LzCG example

I Can you, in advance, identify a subset of algorithm Summary which is likely to consume most time but can present a clean, data-only, interface? Numerical Outline computation in C++

Introduction

Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux

Expression Interlude: Profiling on Linux template generalities

Lazy code Expression template generalities generation – what it is & how it works

LzCG example

Lazy code generation – what it is & how it works Summary

LzCG example

Summary Numerical Requirements for good numerical computation in C++ performance Introduction

Numerical algorithms, I Maximise parallelism libraries and their performance I Use all of the nodes/processors/cores/execution units Interlude: Profiling I Use Single-Instruction-Multiple-Data (SIMD) on Linux Minimise memory access Expression I template generalities I Keep close data to be processed together I Use algorithms that process small chunks of input Lazy code generation – what data at a time it is & how it works I Avoid temporaries LzCG example

I Minimise ‘branching’ Summary

I Keep the pipeline and speculative fetches good I But, need enough code at hand to execute

I Minimise quantity of transcendental calculations

I Includes division in this set I Reducing precision or accuracy makes these faster Numerical Optimisation Challenges I computation in C++

Introduction I Want: to describe the algorithm in simple, readable, Numerical re-usable way algorithms, libraries and their / / This : performance R=A+B+C++E; / / Not t h i s : Interlude: Profiling addFiveVect Double Double Double Double Double(A, B, C, D, E, R); on Linux Expression template I Rules for transforming such description to executable generalities

code need to be complex to be efficient Lazy code generation – what I Simple application of rules applying to C++ objects: it is & how it works

I Arguments are ‘evaluated’ before being passed to LzCG example functions Summary I Operators take two arguments at most I Creation of temporaries I Iteration is interpreted literally as ordered repetition of same segment of code Fundamentally: One step of the algorithm at a time Definitely not suitable for fast code! Numerical Optimisation Challenges II computation in C++

Introduction

Numerical algorithms, I Programs may be compiled on one hardware setup libraries and their but run on many different hardware setups performance Interlude: Profiling I Might need (or want) to adjust rules for generation of on Linux implementations after the compilation of the main Expression template program generalities Lazy code I Speed of execution of particular implementation of generation – what algorithm can be difficult to predict it is & how it works LzCG example I Depends on precise model of the processor: clock speed, number of floating point execution cores, Summary hyper-threading, branch-prediction, pipeline designs, microcode implementations of complex instructions I Sizes of the various levels of data and code caches, main memory bus speed Numerical Example: row vs column matrix access computation in C++

Introduction

Numerical algorithms, libraries and their performance

Listing 2: Sum by iterating through columns first Interlude: Profiling u : : matrix A(nrow, ncol) ; on Linux Expression // initialise... template generalities double res ; Lazy code for ( s i z e t k =0; k

for ( s i z e t j =0; j

Introduction

Numerical algorithms, libraries and their performance

Listing 3: Sum by iterating through rows first Interlude: Profiling u : : matrix A(nrow, ncol); on Linux Expression // initialise... template generalities double res ; Lazy code for ( s i z e t k =0; k

for ( s i z e t i =0; i

Rows Columns Time for col-first Time for row-first Introduction Numerical (seconds) (seconds) algorithms, libraries and their performance 1000000 10 4.16 4.52 Interlude: Profiling 100000 100 4.15 9.54 on Linux 10000 1000 4.04 5.52 Expression template 1000 10000 4.02 5.41 generalities Lazy code 100 100000 4.00 4.56 generation – what 10 1000000 3.96 4.04 it is & how it works LzCG example Note Summary

I Compiled using gcc without optimisation

I Run on my laptop

I ⇒ Illustration only! Numerical Techniques used for advanced numerical computation in C++ libraries Introduction

Numerical algorithms, libraries and their performance

Interlude: Profiling I Optimising compilers on Linux Expression I Custom compilers for standard languages template generalities Code generation using custom languages/frameworks I Lazy code generation – what I Run-time selection according to detected hardware it is & how it works

I Run-time profiling of multiple/many algorithms LzCG example Summary I Run-time generation of machine code Expression templates and lazy code generation can adapt all of these to standard C++. Numerical A quick case study: FFTW computation in C++ http://www.fftw.org

Introduction

Numerical algorithms, libraries and their Revolutionary at the time: performance I Building blocks (“codelets”) of algorithms generated at Interlude: Profiling on Linux compile-time: Expression OCAML → C → machine code template generalities

I Run-time selection of best combination of Lazy code building-blocks generation – what it is & how it works I Memory-layout, cache sizes, relative speed of LzCG example memory Summary I Code selection can be saved

I SIMD+ (p)threads + MPI parallelism

I Presents trivial C-language & Fortran-language interfaces Numerical Some subtleties computation in C++

Introduction The order of floating point operations usually matters I Numerical even when operations on real numbers would not: algorithms, libraries and their performance −10 −10 1 + (−1 + 10 ) 6= (1 − 1) + 10 (1) Interlude: Profiling on Linux

Expression I “Standard” x86 floating point uses 80-bit internal template precision! (SIMD instructions do not) generalities Lazy code I Non-normal (NaN, Inf, de-normalised) floating point generation – what it is & how it works

number badly affect performance: LzCG example

Summary ∞ + 1 = ∞ (slowly!) (2)

I Transcendentals can be approximated to less than full precision ⇒ Ensuring “bit-equivalent” results is difficult and expensive Numerical Outline computation in C++

Introduction

Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux

Expression Interlude: Profiling on Linux template generalities

Lazy code Expression template generalities generation – what it is & how it works

LzCG example

Lazy code generation – what it is & how it works Summary

LzCG example

Summary Numerical How to identify bottlenecks computation in C++

Introduction

Numerical algorithms, libraries and their I Run a tick counting profiler: performance http://oprofile.sourceforge.net/ Interlude: Profiling on Linux I Get a stochastic measurement of where the CPU Expression spends most of its time without modifying the code! template generalities I Run a call-graph profiler: http://valgrind.org/ Lazy code I Shows how the CPU-intensive parts of the code fit generation – what into the big picture of the application it is & how it works LzCG example

I Compile to assembly only (“gcc -S”) and look at the Summary code!

I Allows identifications of in-efficiencies in the produced code and gives hints for optimisations Numerical Outline computation in C++

Introduction

Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux

Expression Interlude: Profiling on Linux template generalities

Lazy code Expression template generalities generation – what it is & how it works

LzCG example

Lazy code generation – what it is & how it works Summary

LzCG example

Summary Numerical What is an expression template? computation in C++

Introduction

Numerical I Templated algorithms, libraries and their I Type performance Interlude: Profiling I Where the type itself on Linux

encodes an operation, Expression template expression or generalities Expression algorithm Lazy code = generation – what template I Passed between it is & how it works functions as instantiated LzCG example objects (usually with Summary data as references)

I Implemented through (partial) specialisations Numerical Expression template (minimal) example computation in C++

Introduction

Numerical algorithms, libraries and their Listing 4: Adding vectors performance Interlude: Profiling // Some vectors to use in the example on Linux std :: vector a(10, 1.0), b(10, 2.0); Expression template generalities // Addition the old way (see next slide) Lazy code generation – what double res=(a+b+a+b)[3]; it is & how it works LzCG example 1. A temporary is created Summary 2. All members of the result vector are computed 3. The temporary is iterated three times over Numerical Helper function computation in C++

Introduction

Listing 5: Simple add operator for std::vector Numerical algorithms, template< class T> libraries and their performance std :: vector Interlude: Profiling operator+(const std :: vector &a , on Linux < > Expression const std::vector T &b ) template { generalities Lazy code std :: vector res(a.size ()); generation – what f o r ( s i z e t i =0; i

1. Considers two vectors at a time 2. Iterates through the whole vector and computes the entire result Numerical Expression template (minimal) example computation in C++

Listing 6: Expression template class Introduction Numerical template performance Interlude: Profiling struct binop on Linux

{ Expression template const E1 & l e f t ; const E2 & r i g h t ; generalities

Lazy code generation – what binop ( const E1 & l e f t , const E2 &r i g h t , it is & how it works

op opval ) : LzCG example left(left), right(right) {} ; Summary binop ( const E1 &v a l ) ; } ;

1. The sub-expression are referenced in left, right 2. The operation is contained in type of op

I valueop, addop are simple tag structs Numerical Expression template (minimal) example computation in C++

Introduction

Listing 7: Creation of compound expression Numerical algorithms, template libraries and their performance binop Interlude: Profiling operator+ ( const E1 & l e f t , on Linux const E2 & r i g h t ) Expression template { generalities < > Lazy code return binop E1, E2, addop ( l e f t , generation – what r i g h t , it is & how it works addop ( ) ) ; LzCG example } Summary

1. E1, E2 are ‘free’ template parameters – types of sub-expression is encoded in result 2. addop is the type of third template parameter – encodes addition of sub-expressions Numerical Expression template (minimal) example computation in C++

Introduction

Numerical algorithms, libraries and their performance

Interlude: Profiling Listing 8: Templated evaluation operation on Linux Expression template /// Evaluate the i −th element of generalities

/// an expression Lazy code generation – what template it is & how it works

double eval ( const binop &o , LzCG example

s i z e t i ) ; Summary Numerical Expression template (minimal) example computation in C++ Partial specialisation for add operation

Introduction

Numerical algorithms, libraries and their Listing 9: Implementation of sum operation using partial performance specialisation Interlude: Profiling on Linux

template Expression template double eval ( const binop &o , generalities

s i z e t i ) Lazy code generation – what { it is & how it works return eval(o.left , i)+eval(o.right , i); LzCG example } ; Summary

1. Specialised on addop as third temp-par 2. Recursively evaluate and add using standard operator+ Numerical Expression template (minimal) example computation in C++ Partial specialisation for value operation

Introduction

Numerical algorithms, libraries and their performance

Listing 10: Partial specialisation for value types Interlude: Profiling on Linux template Expression double eval ( const binop &o , template generalities

s i z e t i ) Lazy code generation – what { it is & how it works

return o . l e f t [ i ] ; LzCG example

} ; Summary

1. Specialised on valueop as third template-parameter 2. Simply returns the value of reference vector at i Numerical Expression template (minimal) example computation in C++

Introduction

Numerical algorithms, libraries and their performance

Listing 11: In use Interlude: Profiling std :: vector a (10 , 1 . 0 ) , on Linux Expression b (10 , 2 . 0 ) ; template generalities binop<> ba(a), bb(b); Lazy code generation – what it is & how it works double res=eval(ba+bb+ba+bb, 3); LzCG example

Summary 1. Does not create a temporary vector 2. Evaluates only the third element of the result Numerical A look at the types computation in C++

Introduction

Numerical algorithms, libraries and their performance

Interlude: Profiling on Linux

Expression Listing 12: Signatures as seen by nm -C template generalities 000000000040114a W double eval >, std::vector >, valueop >, binop >, std::vector >, valueop >, addop>, binop >, std::vector >, valueop >, addop>, binop >, std::vector >, valueop> >(binop >, std::vector >, valueop >, binop >, std::vector >, valueop >, addop>, binop >, std::vector >, valueop >, addop>, binop >, std::vector >, valueop >, addop> const&, unsigned long ) 00000000004014a7 W double eval> code, std::vector >, valueop >, binop >, std::vector >, valueop >, addop>, binop >, std::vector >, valueop> >(binop >, std::vector >, valueop >, binop >, std::vector >, valueop >, addop>, binop >, std::vector >, valueop >, addop> const&, unsigned long ) 0000000000401624 W double eval >,generation std::vector – what >, valueop >, binop >, std::vector >, valueop> >(binop >, std::vector >, valueop >, binop >, std::vector >, valueop >, addop> const&, unsigned long ) 0000000000401473 W double eval >, std::vectorit is &< howdouble it works, std::allocator>, valueop> const&, unsigned long ) LzCG example

Summary Numerical A look at the types computation in C++

Listing 13: Output of nm -C wrapped properly Introduction double eval >, Numerical std :: vector >, algorithms, valueop >, libraries and their binop >, performance std :: vector >, valueop >, Interlude: Profiling addop>, on Linux binop >, Expression std :: vector >, template valueop >, generalities addop>, binop >, Lazy code std :: vector >, generation – what valueop> >( it is & how it works binop >, std :: vector >, LzCG example valueop >, binop >, Summary std :: vector >, valueop >, addop>, binop >, std :: vector >, valueop >, addop>, binop >, std :: vector >, valueop >, addop> const&, unsigned long ) Numerical A look at the types computation in C++

Introduction

Numerical algorithms, libraries and their performance

Interlude: Profiling on Linux

Expression template generalities

Lazy code generation – what it is & how it works

LzCG example

Summary Numerical Lazy evaluation computation in C++ Not the same as lazy code generation...

Introduction In summary: Numerical 1. Operations (+function calls) return ‘expressions’ not algorithms, libraries and their results performance 2. The order and implementation of operations in Interlude: Profiling on Linux

expression can be modified (at compile-time) Expression template 3. Results are only evaluated at a boundary, e.g., when generalities

assigning to plain-old-data Lazy code generation – what 4. If the result is never required, it is never computed it is & how it works

Doing this properly is very elegant but things get LzCG example complicated – see the Haskel Summary Things to keep in mind: 1. Side-effects are ill-defined – stick to ‘functional’ programming I Assigning to an already initialised variable is not ! 2. Implemented with expression templates, size of types grows very quickly Numerical Libraries computation in C++

Introduction http://eigen.tuxfamily.org Numerical I Eigen (LGPL) algorithms, libraries and their Array Ops, Basic + Advanced Linear Algebra, performance

Geometry Interlude: Profiling on Linux I Armadillo http://arma.sourceforge.net/ (LGPL) Expression Array Ops, Basic + Advanced Linear Algebra, template generalities Geometry Lazy code generation – what I Boost.uBLAS http://www.boost.org (Boost License) it is & how it works Basic Linear Algebra LzCG example

2 Summary I NT http://nt2.sourceforge.net/ (LGPL)

I Blitz++ http://www.oonumerics.org/blitz/ (GPL+artistic) This was one of the first libraries to use expression templates Numerical Outline computation in C++

Introduction

Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux

Expression Interlude: Profiling on Linux template generalities

Lazy code Expression template generalities generation – what it is & how it works

LzCG example

Lazy code generation – what it is & how it works Summary

LzCG example

Summary Numerical Introduction computation in C++

Introduction ‘Standard’ expression templates Numerical algorithms, libraries and their 1. Types convey information about algorithm that performance functions implement Interlude: Profiling on Linux

2. These types are interpreted at compile-time and Expression template corresponding code generated generalities

Lazy code generation – what Lazy code generation it is & how it works LzCG example 1. The types are recorded in object code too, so the Summary algorithm implemented by symbols is retained 2. Generate new implementations, post-compilation-time ⇒ Introduces new flexibility and modularity in code generation process Numerical Example – BLAS Reference card computation in C++

Introduction

Level BLAS

dim scalar vector vector scalars element array prexes xROTG A B C S Generate plane rotation S D Numerical SUBROUTINE xROTMG D D A B PARAM Generate died mo plane rotation S D

SUBROUTINE xROT N X INCX Y INCY C S Apply plane rotation S D

SUBROUTINE xROTM N X INCX Y INCY PARAM Apply died mo plane rotation S D algorithms,

SUBROUTINE xSWAP N X INCX Y INCY x y S D C Z

SUBROUTINE xSCAL N ALPHA X INCX x x S D C Z CS ZD libraries and their

SUBROUTINE xCOPY N X INCX Y INCY y x S D C Z

SUBROUTINE xAXPY N ALPHA X INCX Y INCY y x y S D C Z

T performance

FUNCTION xDOT N X INCX Y INCY dot x y S D DS

T

FUNCTION xDOTU N X INCX Y INCY dot x y C Z

H

FUNCTION xDOTC N X INCX Y INCY dot x y C Z

T FUNCTION xxDOT N X INCX Y INCY dot x y SDS Interlude: Profiling

FUNCTION xNRM N X INCX m nr jj x jj S D SC DZ

2

FUNCTION xASUM N X INCX asum jj e r x jj jj im x jj S D SC DZ

1 1

st on Linux

FUNCTION IxAMAX N X INCX amax k e j r x j j im x j S D C Z

k k

max e j r x j j im x j

i i

Level BLAS

options dim bwidth scalar matrix vector scalar vector

T H Expression

xGEMV TRANS M N ALPHA A LDA X INCX BETA Y INCY y Ax y y A x y y A x A y m n S D C Z

T H

xGBMV TRANS M N KL KU ALPHA A LDA X INCX BETA Y INCY y Ax y y A x y y A x A y m n S D C Z

xHEMV UPLO N ALPHA A LDA X INCX BETA Y INCY y Ax y C Z template

xHBMV UPLO N K ALPHA A LDA X INCX BETA Y INCY y Ax y C Z

xHPMV UPLO N ALPHA AP X INCX BETA Y INCY y Ax y C Z generalities

xSYMV UPLO N ALPHA A LDA X INCX BETA Y INCY y Ax y S D

xSBMV UPLO N K ALPHA A LDA X INCX BETA Y INCY y Ax y S D

xSPMV UPLO N ALPHA AP X INCX BETA Y INCY y Ax y S D

T H xTRMV UPLO TRANS DIAG N A LDA X INCX x x Ax A x x A x S D C Z Lazy code

T H

xTBMV UPLO TRANS DIAG N K A LDA X INCX x x Ax A x x A x S D C Z

T H xTPMV UPLO TRANS DIAG N AP X INCX x x Ax A x x A x S D C Z generation – what

1 T H

xTRSV UPLO TRANS DIAG N A LDA X INCX x A x x A x x A x S D C Z

1 T H xTBSV UPLO TRANS DIAG N K A LDA X INCX x A x x A x x A x S D C Z it is & how it works

1 T H

xTPSV UPLO TRANS DIAG N AP X INCX x A x x A x x A x S D C Z

options dim scalar vector vector matrix

T

xGER M N ALPHA X INCX Y INCY A LDA A xy A A m n S D

T xGERU M N ALPHA X INCX Y INCY A LDA A xy A A m n C Z LzCG example

H

xGERC M N ALPHA X INCX Y INCY A LDA A xy A A m n C Z

H

xHER UPLO N ALPHA X INCX A LDA A xx A C Z

H

xHPR UPLO N ALPHA X INCX AP A xx A C Z

H H xHER UPLO N ALPHA X INCX Y INCY A LDA A xy y x A C Z Summary

H H

xHPR UPLO N ALPHA X INCX Y INCY AP A xy y x A C Z

T

xSYR UPLO N ALPHA X INCX A LDA A xx A S D

T

xSPR UPLO N ALPHA X INCX AP A xx A S D

T T

xSYR UPLO N ALPHA X INCX Y INCY A LDA A xy x y A S D

T T

xSPR UPLO N ALPHA X INCX Y INCY AP A xy x y A S D

Level BLAS

options dim scalar matrix matrix scalar matrix

T H

xGEMM TRANSA TRANSB M N K ALPHA A LDA B LDB BETA C LDC C op A op B op C X X X X C m n S D C Z

T

xSYMM SIDE UPLO M N ALPHA A LDA B LDB BETA C LDC C AB C C A B C C m A n A S D C Z

H

xHEMM SIDE UPLO M N ALPHA A LDA B LDB BETA C LDC C AB C C A B C C m A n A C Z

T T

xSYRK UPLO TRANS N K ALPHA A LDA BETA C LDC C AA C C A A C C n n S D C Z

H H

xHERK UPLO TRANS N K ALPHA A LDA BETA C LDC C AA C C A A C C n n C Z

T T T T

xSYRK UPLO TRANS N K ALPHA A LDA B LDB BETA C LDC C AB A B C C A B B A C C n n S D C Z

H H H H

xHERK UPLO TRANS N K ALPHA A LDA B LDB BETA C LDC C AB A B C C A B B A C C n n C Z

T H

xTRMM SIDE UPLO TRANSA DIAG M N ALPHA A LDA B LDB B op B A B op B A op A A A A B m n S D C Z

1 1 T H

xTRSM SIDE UPLO TRANSA DIAG M N ALPHA A LDA B LDB B op A B B op B A op A A A A B m n S D C Z

2 Numerical BLAS – structure of function of names computation in C++

Introduction

Numerical Matrix type: algorithms, Operation: libraries and their DOT scalar product GE general performance GB general band Interlude: Profiling AXPY vector sum on Linux

MV matrix-vector product SY symmetric Expression template SV matrix-vector solve SB symmetric band generalities

MM matrix-matrix product SP symmetric packed Lazy code generation – what SM matrix-matrix solve HE hermitian it is & how it works

Numerical type: HB hermitian band LzCG example S single real HP hermitian packed Summary D double real TR triangular C single complex TB triangular band Z double complex TP triangular packed Numerical C++ type is encoded in the function (symbol) computation in C++ name Introduction

Numerical Listing 14: Back to basic expression template example algorithms, double eval >, libraries and their std :: vector >, performance valueop >, Interlude: Profiling binop >, on Linux std :: vector >, valueop >, Expression addop>, template binop >, generalities std :: vector >, valueop >, Lazy code addop>, generation – what binop >, it is & how it works std :: vector >, valueop> >( LzCG example binop >, Summary std :: vector >, valueop >, binop >, std :: vector >, valueop >, addop>, binop >, std :: vector >, valueop >, addop>, binop >, std :: vector >, valueop >, addop> const&, unsigned long ) Numerical C++ type is encoded in the function (symbol) computation in C++ name Introduction

Numerical algorithms, libraries and their performance Listing 15: Back to basic expression template example (raw Interlude: Profiling nm) on Linux 000000000040114a W Z4evalI5binopIS0 IS0 ISt6vectorIdSaIdEES3 \ Expression 7valueopES5 5addopES5 S6 ES5 EdRKS0 IT T0 S6 Em template generalities

Lazy code I The function/symbol name specifies exactly the generation – what algorithm that it should apply to its data it is & how it works LzCG example

I This information is available post-compile-time (as Summary simply as using nm)

I Implementation can be generated post-compilation and used in program simply by linking it (weak symbols make this trivial) Numerical Alternative view computation in C++

Introduction

Numerical algorithms, libraries and their performance

Interlude: Profiling What we get: on Linux Expression Expression templates allow export of a subset of the template generalities program parse outside the compiler environment. Lazy code generation – what it is & how it works

What would be the alternative? LzCG example Parsing the C++ source code ≡ writing new compiler Summary Numerical Potential advantages of LzCG computation in C++

General: Introduction Numerical 1. A very simple, clean, efficient mechanism for algorithms, libraries and their separating specification of what needs to be from how performance

its done Interlude: Profiling on Linux 2. A mechanism introducing an embedded language in Expression C++ that can be implemented outside traditional C++ template generalities

compilation scheme Lazy code generation – what Specific: it is & how it works 1. Can try multiple algorithms and select the LzCG example experimentally most efficient Summary 2. Can detect hardware configuration and generate efficient code without access to source code 3. Can use custom and third-party code generators (GPU compilers, cluster compute tools, or Ocaml + C-compiler!) Numerical Outline computation in C++

Introduction

Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux

Expression Interlude: Profiling on Linux template generalities

Lazy code Expression template generalities generation – what it is & how it works

LzCG example

Lazy code generation – what it is & how it works Summary

LzCG example

Summary Numerical Example introduction – Fast Fourier computation in C++ Transforms Introduction

Numerical algorithms, libraries and their performance

Interlude: Profiling on Linux

Expression I Optimum FFT algorithm depends on array length, template generalities cache sizes, processor architecture, etc., etc Lazy code generation – what I Normally selected by benchmarking at run-time it is & how it works

I Can we do better if we know array size at LzCG example compile-time? Summary Numerical Code computation in C++

Listing 16: Call to compute the forward FFT Introduction Numerical boost :: array inp ; algorithms, libraries and their boost :: array out ; performance Interlude: Profiling on Linux FFTForward(inp , out); Expression template 1. Array sizes are known at compile-time generalities Lazy code 2. Can select optimum algorithm as soon as we know generation – what what machine we run on it is & how it works LzCG example I This may be after compile-time but must be before Summary run-time 3. Selection before run-time: 3.1 Removes potentially lengthy run-time algorithm selection 3.2 Makes the program performance more predictable 3.3 Reduces code size 3.4 Allow selection from wider range of algorithms Numerical Code computation in C++

Introduction

Numerical Listing 17: Call to FFTW the old-fashioned way algorithms, libraries and their f f t w p l a n p= performance Interlude: Profiling f f t w p l a n d f t 1 d (5 , on Linux ( fftw complex ∗)(& i n [ 0 ] ) , Expression template ( fftw complex ∗)(&out[0]) , generalities FFTW FORWARD, Lazy code generation – what FFTW ESTIMATE ) ; it is & how it works f f t w execute(p); LzCG example Summary 1. First call to create the plan can be time consuming (e.g., ∼ 1 second !) 2. Linking the entire library – ‘codelets’ + the algorithm selection code Numerical Code computation in C++

Introduction

Numerical algorithms, libraries and their Listing 18: Declaration of the FFTForward template performance Interlude: Profiling template on Linux void FFTForward( const boost :: array &in , Expression template boost :: array &out ) ; generalities Lazy code generation – what Listing 19: Call to compute the forward FFT it is & how it works boost :: array inp ; LzCG example Summary boost :: array out ;

FFTForward(inp , out); Numerical Code computation in C++

Listing 20: Signature of the call to FFTForward (nm -C) Introduction 00000000004005e3 W void Numerical FFTForward(boost :: array const&, algorithms, boost :: array&) libraries and their performance

This is all the information we need to select an algorithm: Interlude: Profiling 1. Parse signatures to identify all instances of on Linux Expression FFTForward template 2. Machine generate new C++ code with specialisation generalities Lazy code for each FFTForward instance but with optimum generation – what algorithm pre-selected it is & how it works LzCG example 3. Compile and link these with original code Summary Note:

I Do not need access to the application source code – just the object file

I Can do the code generation on a different computer, using different tools, compilers & languages Numerical Implementation computation in C++

Introduction

Numerical algorithms, libraries and their Listing 21: Parse symbol table performance Interlude: Profiling def analyseCal(fnamein ): on Linux mre=re.compile(”.∗ void FFTForward\d+) ul >\\(\ boost :: array const &, \ Expression boost :: array&\\).∗ ”) template stable=subprocess.Popen([ ”nm” , ”−C”, fnamein], generalities stdin=subprocess.PIPE, stdout=subprocess.PIPE).communicate()[0] Lazy code for l in stable.split(”\n ” ) : generation – what m=mre.match( l ) it is & how it works i f m: s=mkFunc( int (m.group(”N” ))) LzCG example Summary 1. Simple regular expression on output of the nm -C 2. The array length is extracted from the signature Numerical Implementation computation in C++

Introduction Listing 22: Algorithm selection Numerical algorithms, def templateProg(N): libraries and their return ””” performance # include i n t main ( void ) Interlude: Profiling { on Linux const s i z e t N=%i ; fftw complex ∗in , ∗out ; Expression f f t w p l a n p ; template i n = ( fftw complex ∗) f f t w malloc(sizeof(fftw complex ) ∗ N); generalities out = ( fftw complex ∗) f f t w malloc(sizeof(fftw complex ) ∗ N); p = f f t w p l a n d f t 1 d (N, in , out , FFTW FORWARD, FFTW MEASURE) ; Lazy code f f t w p r i n t p l a n ( p ) ; // This outputs the plan to stdout generation – what } it is & how it works ” ” ” % (N/ 2 ) LzCG example

Summary 1. Trivial C program that calls FFTW with right array size 2. Execute this at lazy-code-generation-time 3. Print the selected best algorithm 4. Hardcode this algorithm selection in the specialised function Numerical Implementation computation in C++

Listing 23: Emit a specialised function for this array size Introduction Numerical def mkFunc (N ) : algorithms, mkProg(templateProg(N)) libraries and their plan=open( ”tmpProgOut” ). read() performance s= ” ” ” # include Interlude: Profiling # include ”fftbind .hpp” on Linux # include ” f f t w 3 . h ” template <> Expression void FFTForward(const boost :: array &in , template boost :: array &out ) generalities { const char ∗mplan=”%s” ; Lazy code f f t w i m p o r t w i s d o m f r o m string(mplan); generation – what f f t w p l a n p= f f t w p l a n d f t 1 d (%i , it is & how it works ( fftw complex ∗) const cast(&i n [ 0 ] ) , LzCG example ( fftw complex ∗)(&out[0]) , FFTW FORWARD, Summary FFTW ESTIMATE ) ; f f t w execute(p); f f t w d e s t r o y p l a n ( p ) ; } ”””% (N, N, N, plan, N/2) return s

1. The plan is stored as string literal within each specialised function Numerical What have we achieved? computation in C++

Introduction

Numerical algorithms, libraries and their 1. Optimum, pre-selected FFTW transform for each performance Interlude: Profiling array of known size at compile-time – efficiency on Linux Expression 2. Could switch to multi-core/GPU/etc without access to template source code – modularity generalities Lazy code generation – what Implementation shortcomings (this is a quick example!): it is & how it works I The entire FFTW is linked-in – not just the specific LzCG example algorithms Summary

I Loading plans at run-time could be significant for small transforms Numerical Outline computation in C++

Introduction

Numerical Introduction algorithms, libraries and their performance Numerical algorithms, libraries and their performance Interlude: Profiling on Linux

Expression Interlude: Profiling on Linux template generalities

Lazy code Expression template generalities generation – what it is & how it works

LzCG example

Lazy code generation – what it is & how it works Summary

LzCG example

Summary Numerical Simple C++ code is not numerically efficient computation in C++

Introduction

Numerical algorithms, libraries and their 1. Function arguments evaluated promptly: performance Interlude: Profiling vector myVect(10000000, 3.0); on Linux Expression takeOne(1.0/myVect, 5); template generalities 2. Only binary operators Lazy code generation – what vector myVect(10000000, 3.0); it is & how it works vector myVect2=myVect+myVect+myVect ;LzCG example Summary 3. Optimum algorithm sometimes can not be selected at compile-time Numerical Expression templates resolve many of these computation in C++ issues Introduction

Numerical algorithms, libraries and their performance

Interlude: Profiling on Linux

Expression 1. Expressions create objects with type that specifies the template algorithm to be carried out generalities Lazy code 2. This type is interpreted at compile-time (through generation – what partial specialisation) to generate an efficient it is & how it works LzCG example algorithm implementation of the whole expression Summary ⇒ Blitz++, Boost::UBLAS, Armadillo, Eigen, NT2 Numerical Lazy code generation computation in C++

Introduction

Numerical algorithms, 1. The type of expression templates is available in object libraries and their performance code Interlude: Profiling 00000000004005e3 W void on Linux FFTForward(boost :: array const&, boost :: array&, Expression boost :: array&) template generalities

Lazy code 2. It can be interpreted post-compilation-time to generation – what generate code for the implementation it is & how it works LzCG example I Try different algorithms Summary I Delay algorithm selection for final hardware I Increased modularisation 3. The original implementation can be empty or a fall-back Numerical When to use Lazy code generation computation in C++

Introduction

Numerical algorithms, libraries and their 1. Need to do empirical selection of optimum algorithms performance Interlude: Profiling 2. Want to do code generation using non-C++ compiler on Linux tools: Expression template I OcamL → C+intrinsics → machine code generalities I GPU compiler? Lazy code generation – what 3. Need to re-establish clean separation between it is & how it works application and library LzCG example Summary I Update implementations without access to original source code I Licensing concerns, proprietary libraries Numerical computation in C++

Introduction

Numerical algorithms, libraries and their performance Thanks for listening! Interlude: Profiling on Linux

Expression I See http://www.bnikolic.co.uk/ and template generalities http://www.mrao.cam.ac.uk/˜bn204/ and for Lazy code more information generation – what it is & how it works

LzCG example

Summary