Exascale Computing Without Templates

Karl Rupp, Richard Mills, Barry Smith, Matthew Knepley, Jed Brown

Argonne National Laboratory

DOE COE Performance Portability Meeting, Denver August 22-24, 2017 ProvidingProviding Context Context

Exascale Computing without Threads˚ A White Paper Submitted to the DOE High Performance Computing Operational Review (HPCOR) on Scientific Software Architecture for Portability and Performance August 2015

Matthew G. Knepley1, Jed Brown2, Barry Smith2, Karl Rupp3, and Mark Adams4 1Rice University, 2Argonne National Laboratory, 3TU Wien, 4Lawrence Berkeley National Laboratory [email protected], [jedbrown,bsmith]@mcs.anl.gov, [email protected], [email protected] Abstract We provide technical details on why we feel the use of threads does not offer any fundamental performance advantage over using processes for high-performance computing and hence why we plan to extend PETSc to exascale (on emerging architectures) using node-aware MPI techniques, including neighborhood collectives and portable shared memory within a node, instead of threads.

https://www.orau.gov/hpcor2015/whitepapers/ExascaleIntroduction. With the recent shift toward many cores on a singleComputing node with shared-memorywithout Threads-Barry domains, a hybrid Smith.pdf approach has became in vogue: threads (almost always OpenMP) within each shared-memory domain (or within each processor socket in order to address NUMA issues) and MPI across nodes. We believe that portions of the HPC community have adopted the point of view that somehow threads are “necessary” in order to utilize such PETSc Developerssystems, Care (1) without About fully understanding Recent the Developments alternatives, including MPI 3 functionality, (2) underestimating the difficulty of utilizing threads efficiently, and (3) without appreciating the similarities of threads and processes. This short paper, due to space constraints, focuses exclusively on issue (3) since we feel it has gotten virtually no After careful evaluation:attention. Favor MPI 3.0 (and later) over threads A common misconception is that threads are lightweight, whereas processes are heavyweight, and that interthread Find the best long-termcommunication is solutions fast, whereas inter-process for our communication users is slow (and might even involve a dreaded ). Although this was true at certain times and on certain systems, we claim that in the world relevant to HPC today it is incorrect. To understand whether threads are necessary to HPC, one must understand the exact technical Consider bestimportance solutions of the for differences large-scale (and similarities) applications, between threads and not processes just and toy-apps whether these differences have any bearing on the utility of threads versus processes. 2 To prevent confusion, we will use the term stream of execution to mean the values in the registers, including the program counter, and a stack. Modern systems have multiple cores each of which may have multiple sets of (independent) registers. Multiple cores mean that several streams of execution can run simultaneously (one per core), and multiple sets of registers mean that a single core can switch between two streams of execution very rapidly, for example in one clock cycle, by simply switching which set of registers is used (so-called hardware threads or hyperthreading). This hyperthreading is intended to hide memory latency. If one stream of execution is blocked on a memory load, the core can switch to the other stream, which, if it is not also blocked on a load, can commence computations immediately. Each stream of execution runs in a virtual address space. All addresses used by the stream are translated to physical addresses by the TLB (translation look-aside buffer) automatically by the hardware before the value at that address is loaded from the caches or memory into a register. In the past, some L1 caches used virtual addresses, which meant that switching between virtual address spaces required the entire cache to be invalidated (this is one reason the myth of heavyweight processes developed); but for modern systems the L1 caches are hardware address aware and changing virtual address spaces does not require flushing the cache. Another reason switching processes in the past was expensive was that some systems would flush the TLB when switching virtual address spaces. Modern systems allow multiple virtual address spaces to share different parts of the TLB. With , the of choice for most HPC systems, the only major difference between threads and processes is that threads (of the same process) share the same virtual address space while different processes have different virtual address spaces (though they can share physical memory). Scheduling of threads and processes is handled in the same way, hence the cost of switching between threads or between processes is the same. Thus, the only fundamental difference is that threads share all memory by default, whereas processes share no memory by default. The one true difference between threads and processes is that threads can utilize the entire TLB for virtual address translation, whereas processes each get a portion of the TLB. With modern large TLBs we argue that this

˚Note that this discussion relates to threads on multicore systems and does not address threads on GPU systems.

1 ProvidingProviding Context Context

http://mooseframework.org/static/media/wiki/images/229/b61cdbc1e8be71dae37adc31a688d209/moose-arch.png 3 Aftermath Sequential build times for the ViennaCL test suite Sieve: Replaced by DMPlex (written in ) 4 10 ViennaGrid: Version 3.0 provides C-ABI 1.7.0 1.6.0 ViennaCL: Rewrite in C likely 1.5.0 103 1.4.0

1.3.0 Time (sec) 102 1.1.0 1.2.0

1.0.0

101 2010 2011 2012 2013 2014 2015 2016 Year

ProvidingProviding Context Context

Our Attempts in C++ Development Sieve: Several years of C++ mesh management attempts in PETSc ViennaGrid 2.x: Heavily templated C++ mesh management library ViennaCL: Dense and sparse linear algebra and solvers for multi- and many-core architectures

4 ProvidingProviding Context Context

Our Attempts in C++ Library Development Sieve: Several years of C++ mesh management attempts in PETSc ViennaGrid 2.x: Heavily templated C++ mesh management library ViennaCL: Dense and sparse linear algebra and solvers for multi- and many-core architectures

Aftermath Sequential build times for the ViennaCL test suite Sieve: Replaced by DMPlex (written in C) 4 10 ViennaGrid: Version 3.0 provides C-ABI 1.7.0 1.6.0 ViennaCL: Rewrite in C likely 1.5.0 103 1.4.0

1.3.0 Time (sec) 102 1.1.0 1.2.0

1.0.0

101 2010 2011 2012 2013 2014 2015 2016 4 Year Dealing with Compilation Errors Type names pollute output Replicated across interfaces CRTP may result in type length explosion

Default arguments become visible https://xkcd.com/303/ Type Length std::vector 38 std::vector > 109 std::vector > > 251 std::vector > > > 539

DisadvantagesDisadvantages of of C++ C++ Templates Templates

Static Dispatch Architecture-specific information only available at run time “Change code and recompile” not acceptable advice

5 DisadvantagesDisadvantages of of C++ C++ Templates Templates

Static Dispatch Architecture-specific information only available at run time “Change code and recompile” not acceptable advice

Dealing with Compilation Errors Type names pollute compiler output Replicated across interfaces CRTP may result in type length explosion

Default arguments become visible https://xkcd.com/303/ Type Length std::vector 38 std::vector > 109 std::vector > > 251 std::vector > > > 539

5 DisadvantagesDisadvantages of of C++ C++ Templates Templates

https://tgceec.tumblr.com/

6 Example Consider vector updates in pipelined CG method:

xi ← xi−1 + αpi−1

ri ← ri−1 − αyi

pi ← ri + βpi−1

Reuse of pi−1 and ri−1 easy with for-loops, but hard with expression templates

DisadvantagesDisadvantages of of C++ C++ Templates Templates

Scope Limitations metaprogramming lacks state Optimizations across multiple code lines difficult or impossible

7 DisadvantagesDisadvantages of of C++ C++ Templates Templates

Scope Limitations Template metaprogramming lacks state Optimizations across multiple code lines difficult or impossible

Example Consider vector updates in pipelined CG method:

xi ← xi−1 + αpi−1

ri ← ri−1 − αyi

pi ← ri + βpi−1

Reuse of pi−1 and ri−1 easy with for-loops, but hard with expression templates

7 Lack of a Stable ABI Object files from different generally incompatible Name mangling makes use outside C++ land almost impossible

High Entry Bar Number of potential contributors inversely proportional to code sophistication Domain scientists have limited resources for C++ templates

DisadvantagesDisadvantages of of C++ C++ Templates Templates

Complicates Debugging Stack traces get longer names and deeper Setting good breakpoints may become harder

8 High Entry Bar Number of potential contributors inversely proportional to code sophistication Domain scientists have limited resources for C++ templates

DisadvantagesDisadvantages of of C++ C++ Templates Templates

Complicates Debugging Stack traces get longer names and deeper Setting good breakpoints may become harder

Lack of a Stable ABI Object files from different compilers generally incompatible Name mangling makes use outside C++ land almost impossible

8 DisadvantagesDisadvantages of of C++ C++ Templates Templates

Complicates Debugging Stack traces get longer names and deeper Setting good breakpoints may become harder

Lack of a Stable ABI Object files from different compilers generally incompatible Name mangling makes use outside C++ land almost impossible

High Entry Bar Number of potential contributors inversely proportional to code sophistication Domain scientists have limited resources for C++ templates

8 Development Implications Adopt professional software development practices Develop, maintain, and evolve different datastructures ...... and code paths Use clear and easy-to-understand datastructures Fallacy: “Writing” an application only once in its final form

AA Path Path Forward Forward

0:6 F. Van Zee and T. Smith

5th loop around micro-kernel nC nC

A Manage Complexity Cj += Bj

th Good interface design 4 loop around micro-kernel k C Bp A Refactor code when needed Cj += p

kC ~ Hand-optimize small kernels only (cf. BLIS methodology) Pack Bp → Bp 3rd loop around micro-kernel ~ m m Ci C C Ai Bp

+= ~ Pack Ai → Ai

2nd loop around micro-kernel ~ nR ~ nR Ci Ai Bp

mR +=

kC 1st loop around micro-kernel nR

mR

+= kC

micro-kernel main memory 1 L3 cache L2 cache L1 cache += 1 registers

Fig. 1. An illustration of the algorithm for computing high-performance , as expressed 9 within the BLIS framework [Van[F. Zee Van and van Zee, de Geijn T. 2015]. Smith, ACM TOMS 2017]

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 0, Publication date: 2016. AA Path Path Forward Forward

0:6 F. Van Zee and T. Smith

5th loop around micro-kernel nC nC

A Manage Complexity Cj += Bj

th Good interface design 4 loop around micro-kernel k C Bp A Refactor code when needed Cj += p

kC ~ Hand-optimize small kernels only (cf. BLIS methodology) Pack Bp → Bp 3rd loop around micro-kernel ~ m m Ci C C Ai Bp

+= ~ Pack Ai → Ai

2nd loop around micro-kernel ~ Development Implications nR ~ nR Ci Ai Bp

mR Adopt professional software development practices +=

kC 1st loop around micro-kernel Develop, maintain, and evolve different datastructures ... nR

mR k ... and code paths += C

Use clear and easy-to-understand datastructures micro-kernel main memory 1 L3 cache Fallacy: “Writing” an application only once in its final form L2 cache L1 cache += 1 registers

Fig. 1. An illustration of the algorithm for computing high-performance matrix multiplication, as expressed 9 within the BLIS framework [Van[F. Zee Van and van Zee, de Geijn T. 2015]. Smith, ACM TOMS 2017]

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 0, Publication date: 2016. Required Incentives Reward contributions to existing projects Pair research funding with software development funding Establish software development career tracks

AA Path Path Forward Forward

Spending Development Resources Reuse existing libraries — reinventing the wheel is not productive! Focus on domain- and application-specific aspects Obtain expertise and resources for continuous code evolution

10 AA Path Path Forward Forward

Spending Development Resources Reuse existing libraries — reinventing the wheel is not productive! Focus on domain- and application-specific aspects Obtain expertise and resources for continuous code evolution

Required Incentives Reward contributions to existing projects Pair research funding with software development funding Establish software development career tracks

10 AA Path Path Forward Forward

Is Performance Portability Just a Software Productivity Aspect?

https://www.nitrd.gov/PUBS/CSESSPWorkshopReport.pdf 11 SummarySummary

Long-Term Problems of Heavy C++ Templates Use Template metaprogramming is a leaky abstraction Excessive type names slow down all stages of Compile-Run-Debug-cycle Templates operate at compile time - architecture ultimately known at run time

A Path Forward Adopt professional software development practices Be prepared to develop different datastructures and code paths Write clear, readable code using simple datastructures Evolve and refactor datastructures, kernels, and interfaces over time (cf. software productivity discussions)

12