Concurrent Collections (CnC)

Kath Knobe Kath.knobe@.com

2012/03/29

Software and Services Group 1 Outline

• Intro to the CnC • Tuning CnC • CnC for distributed memory

Software and Services Group 2 Outline

• Intro to the CnC • Tuning CnC • CnC for distributed memory

Software and Services Group 3 The Big Idea

• Don’t think about what operations can run in parallel – Difficult and depends on target

• Think about the semantic ordering constraints only – Easier and depends only on application

Software and Services Group

4 4 Exactly Two Sources Of Ordering Requirements

• Producer must execute before consumer

Producer - consumer

COMPUTE STEP DATA ITEM COMPUTE STEP step1 item step2

Software and Services Group

5 5 Exactly Two Sources Of Ordering Requirements

• Producer must execute before consumer

• Controller must execute before controllee

Producer - consumer Controller - controllee

CONTROL TAG tony

COMPUTE STEP DATA ITEM COMPUTE STEP COMPUTE STEP COMPUTE STEP step1 item step2 step1 step2

Software and Services Group

6 6 Cholesky factorization

Software and Services Group 7 Cholesky factorization

Cholesky

Software and Services Group 8 Cholesky factorization

Cholesky

Trisolve

Software and Services Group 9 Cholesky factorization

Cholesky

Trisolve Update

Slightly simplified

Software and Services Group 10 Cholesky factorization

Cholesky

Software and Services Group 11 4 simple steps to a CnC application

Software and Services Group 12

1:The white board level

• Data • Computations • Producer/consumer relations among them • I/O

COMPUTE STEP Cholesky

COMPUTE STEP Trisolve

COMPUTE STEP Update DATA ITEM Array

Software and Services Group

13 13

2: Distinguish among the instances

• Distinct instances (chunks of data or computation) may be distinctly scheduled and mapped

• A collection is a statically named set of dynamic instances

COMPUTE STEP Cholesky: iter

COMPUTE STEP Trisolve: row, iter

COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter

Software and Services Group

14 14

2: Distinguish among the instances

• Distinct instances (chunks of data or computation) may be distinctly scheduled and mapped • A collection is a statically named set of dynamic instances • Optional: tag functions relating the tags e.g., Cholesky: iter→ Array: iter, iter, iter

COMPUTE STEP Cholesky: iter

COMPUTE STEP Trisolve: row, iter

COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter

Software and Services Group

15 15 Control tag CONTROL TAG Ceasar

COMPUTE STEP • Tuple – unique identifier foo –

• Meaning of control relationship – For each row/col/iteration tuple in the collection Ceasar, the corresponding instance of foo will execute sometime. – An instance of foo has access to the values of row, col and iteration.

• Control dependence and data dependence are on the same footing

Software and Services Group 16

3: What are the control tag collections

CONTROL TAG

CholeskyTag: iter

CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG

COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter

COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter

Software and Services Group

17 17

4: Who produces control

CONTROL TAG

CholeskyTag: iter

CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG

COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter

COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter

Software and Services Group

18 18 Actual app: cell tracker

(Cell detector: K)

[Cell candidates: K ]

[Histograms: K] [Input Image: K]

(Cell tracker: K) [Labeled cells initial: K] (Arbitrator initial: K)

[State measurements: K]

(Correction filter: K, cell)

[Motion corrections: K , cell]

[Labeled cells final: K] (Arbitrator final: K)

[Final states: K]

(Prediction filter: K, cell)

Software and [PredictedServices Group states: K, cell] 19 Sample step (cholesky)

// Perform unblocked Cholesky factorization on the input block // Output is a lower triangular matrix. int cholesky::execute( const int & t, cholesky_context & c ) const { tile_const_ptr_type A_block; tile_ptr_type L_block; Get data used by this step Normal int b = c.b; computation const int iter = t;

c.Array.get(triple(iter,col,row), A_block); // Get the input tile.

L_block = std::make_shared< tile_type >( b ); // allocate output tile

for(int k_b = 0; k_b < b; k_b++) { Put data produced // do all the math on this tile by this step }

c.Array.put(triple(iter+1,col,row), L_block); // Write the output tile

return CnC::CNC_Success; } Steps have no side-effects Steps are atomic Items are dynamic single assignment

Software and Services Group 20 Shared memory Spike Performance James Brodman - UIUC, Intel

6

• Parallel solver for banded linear systems of equations 5 • Complex algorithm but “easily” maps onto CnC

• Same code for shared and distributed memory 4

Runtime (S) Runtime 3

Shared/distributed memory 2 (4 nodes, GigE)

1

0 1048576 w/ 128 Super/Sub Diagonals, 32 partitions MKL MKL HTA CnC HTA DistCnC MKL +OMPMKL+OMP (TBB)HTA+TBB CnC HTA+MPI (MPI)DistCnC

1048576 w/ 128 Super/Sub Diagonals, 32Software partitions and Services Group 21 Shared memory Spike Performance James Brodman - UIUC, Intel

6

• Parallel solver for banded linear systems of equations 5 • Complex algorithm but “easily” maps onto CnC

• Same code for shared and distributed memory 4

Runtime (S) Runtime 3

Shared/distributed memory 2 (4 nodes, GigE)

1

0 1048576 w/ 128 Super/Sub Diagonals, 32 partitions MKL MKL HTA CnC HTA DistCnC MKL +OMPMKL+OMP (TBB)HTA+TBB CnC HTA+MPI (MPI)DistCnC

1048576 w/ 128 Super/Sub Diagonals, 32Software partitions and Services Group 22 Semantics of CnC

• Semantics defined in terms of tag attributes of instances. avail

• The current state is described by the current set of instances and their attributes.

step controlReady

step step ready executed

step dataReady

Item avail

Software and Services Group 23 CnC is a coordination language: paired with a computation language

• Existing: • In the works: – C++ (intel) – Fortran (Rice) – Java (Rice) – Chapel (UIUC) – Haskell (Indiana) • Next up: – Python (Rice) – Matlab (Rice) – Scala (Rice) – APL (Indiana) Rice = Vivek Sarkar’s group Indiana = Ryan Newton’s group UIUC = David Padua’s group

Software and Services Group 24 CnC is a high level model allows for a wide variety of runtime approaches

grain distribution schedule

HP static static static TStreams

HP static static dynamic TStreams

static dynamic dynamic Intel CnC

static dynamic dynamic Rice CnC

dynamic dynamic dynamic Georgia Tech CnC

Software and Services Group 25 Possible expressions of the model

• Graphical interface

• Textual representation – Rice version – previous Intel versions

• API – Current Intel release

Software and Services Group 26 Key properties

• Think about the meaning of your program – Not about parallelism • Deterministic & serializable – Easy to debug • Separation of concerns simplifies everything – Computation from coordination – Domain spec from tuning spec • Domain spec is free of tuning and largely independent of architecture • Unified programming for shared and distributed memory – Not like MPI/OpenMP combo

Software and Services Group 27 Intel® Concurrent Collections for C++

• New whatif-release – Header-files, shared libs – Samples – Documentation • CnC concepts • Using CnC • Tutorials

• Template library with runtime – For Windows(IA-32/Intel-64) and Linux (Intel-64) – Shared and distributed memory

Software and Services Group 28 Outline

• Intro to the CnC • Tuning CnC • CnC for distributed memory

Software and Services Group 29 Tuning expert

with Rice: Mike Burke, Sanjay Chatterjee, Zoran Budimlic

• Without tuning – Very good performance – Typically more than enough parallelism – Asynchronous execution

• Multiple tuning specs for each domain spec – Tuning spec is in addition and separate – Semantic constraints from the domain spec are still requirements – A tuned execution is a one of the many legal executions of the domain spec. Goal: guide away from bad ones.

Software and Services Group 30 Foundation for tuning: Hierarchical Affinity groups

• Affinity groups – Avoid mappings with poor locality (temporal + spatial) – No benefit if too far away in either time or space – Foundation doesn’t distinguish space from time – On top of this foundation we allow (but don’t require) time-specific and space-specific control • Hierarchy – Same low-level group => tight affinity – Same higher-level group => weaker affinity

Software and Services Group 31

Cholesky domain spec

CONTROL TAG

CholeskyTag: iter

CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG

COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter

COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter

Software and Services Group

32 32 Hierarchical affinity structure

COMPUTE STEP Cholesky:

COMPUTE STEP Trisolve:

COMPUTE STEP Update:

Software and Services Group 33 Hierarchical affinity groups have names

AFFINITY GROUP GroupC:

COMPUTE STEP Cholesky:

AFFINITY GROUP GroupTU:

COMPUTE STEP Trisolve:

COMPUTE STEP Update:

Software and Services Group 34 We distinguish among instances of affinity groups

AFFINITY GROUP GroupC: iter

COMPUTE STEP Cholesky: iter

AFFINITY GROUP GroupTU: row, iter

COMPUTE STEP Trisolve: row, iter

COMPUTE STEP Update: col, row, iter

Software and Services Group 35 Specify the instances of each affinity group

AFFINITY GROUP CONTROL TAG CholeskyTag: iter GroupC: iter

COMPUTE STEP Cholesky: iter

AFFINITY GROUP CONTROL TAG GroupTU: row, iter TrisolveTag: row, iter COMPUTE STEP Trisolve: row, iter

COMPUTE STEP Update: col, row, iter

Software and Services Group 36 Cholesky: instances of the outer group Cholesky

Trisolve Update

Trisolve Update Update

Trisolve Update Update Update

Software and Services Group 37 Cholesky: instances of the outer group

Cholesky

Trisolve Update

Trisolve Update Update

Software and Services Group 38 Cholesky: instances of the inner group Cholesky

Trisolve Update

Trisolve Update Update

Trisolve Update Update Update

Software and Services Group 39 Cholesky: instances of the inner group Cholesky

Trisolve Update

Trisolve Update Update

Trisolve Update Update Update

Software and Services Group 40 Cholesky: instances of the inner group Cholesky

Trisolve Update

Trisolve Update Update

Trisolve Update Update Update

Software and Services Group 41 Space-specific mappings

AFFINITY GROUP CONTROL TAG CholeskyTag: iter GroupC: iter

COMPUTE STEP Cholesky: iter

AFFINITY GROUP CONTROL TAG GroupTU: row, iter TrisolveTag: row, iter COMPUTE STEP Trisolve: row, iter

COMPUTE STEP Update: col, row, iter • Replicate components of this group across sockets • Distribute components of this group across nodes • Distribute components of this group across nodes via func()

Software and Services Group 42 Time-specific mappings

Relative time among components within a group.

ordered unordered Non-overlapping serial/barrier exclusive overlapping priority arbitrary

Software and Services Group 43 Space-time combinations

{big_memory_footprint: i}

Distribute across address spaces (space) Within each address space exclusive (time)

{video_processing: frameID}

cyclic distribution among address spaces(space) ordered & overlapping within address spaces (time)

Software and Services Group 44 Related work

• Control of time – Various types of parallelism Data parallel, task parallel, streaming, fork-join, … • Control of space – Support for distributions Hierarchical Place Trees, HPF, …

CnC can control both time and space and in a unified way We allow (but don’t require) the tuning expert to blur the distinction between them

Knobe & Burke “Tuning Language for Concurrent Collections” Compilers for (CPC 2012)

Software and Services Group 45 Outline

• Intro to the CnC • Tuning CnC • CnC for distributed memory

Software and Services Group 46 Distributed CnC Intel: Frank Schlimbach

• CnC is a declarative and high level model – Abstracts from memory model • Provides a unified programming model for shared and distributed memory – Same binary runs on both – Easy to switch between shared and distributed memory (no explicit message passing needed) • CnC comes with a control methodology – Allows controlling work distribution • CnC comes with a data methodology – Provides hooks for selective data distribution

Software and Services Group 47 Limitations of distributed computing apply

• Usual caveats for distributed memory apply – E.g. ratio between data-exchange and computation

• Different algorithms might be needed for distributed or shared memory – Programming methodology and framework stays the same in any case – Over a wide class of applications the algorithm stays the same

Software and Services Group 48 Scalability and efficiency: Spike

Scalability Parallel Efficiency 25.00 120.00%

100.00%

20.00

80.00% 15.00 60.00% 10.00

Time [sec] Time 40.00% Efficiency Efficiency [%]

5.00 20.00%

0.00 0.00% 1 2 4 8 16 32 2 4 8 16 32 #nodes (24 h-threads each) #nodes (24 h-threads each)

(matrix: 10Mx10M, 257 bands)

Software and Services Group 49 Distributed CnC: without tuning (UTS) UTS [T3XXL] Speedup • Unbalanced tree search 12 - Tree shape unknown 10

• CnC code is trivial - CnC: 151 loc 8

- Shmem: ~1000 loc 6 - MPI: ~800 loc MPI • Performance gap in the 4 CnC

middle region is a load- speedup over 1 node

balancing issue 2 - Used scheduler available 0 on our release 16 32 64 96 128 160 192 224 256 - Solved by different 1 2 4 6 8 10 12 14 16 #Threads/ranks scheduler #Nodes

Software and Services Group 50 Untuned performance often bad: tuning

CnC Time (untuned) CnC Time (tuned) [GigE] [IB] 250.00 140.00

120.00 200.00

100.00 150.00 inverse 80.00 primes 100.00 60.00

Time [sec] Time mandelbrot cholesky 40.00 50.00 20.00 0.00 0.00 1 2 4 2 4 8 16 32 #nodes (24 h-threads each) #nodes (24h-threads each)

Software and Services Group 5151 CnC tuning easy and efficient

• The general default distribution of steps – Data is sent to where needed (requested) – Not always efficient

• Tuner can declare functions – consumed_by: Which steps use this data item (semantics) – compute_on: Where the step will execute (tuning) – To redistribute, modify compute_on: data items and steps automatically adjusted

• Mapping tag->rank can be computed – Compile time – Initialization time – Execution time

Software and Services Group 52 Distribution functions

Cyclic - Simple int tuner::compute_on( const int & i ) const { return i % numProcs(); }

Blocked - Minimizes data transfer int tuner::compute_on( const int& x, y, z ) const { return x * NBLOCKS_X / nx + ( y * NBLOCKS_Y / ny ) * nx + ( z * NBLOCKS_Z / nz ) * nx * ny }

Blocked cyclic – Block size/shape is relevant int tuner::compute_on( int x, y ) const { return ((x + y * m_nx) / BLOCKSIZE) % numProcs(); }

Software and Services Group 53 53

Scalability and efficiency: stencil

RTM Stencil - Speedup RTM Stencil [weak] - Efficiency (3kx1.5k2, 32 time-steps) (16 time-steps) 35 105%

100% 30 95% 90% 25

85%

20 80% 75%

15 70% Speedup Parallel Parallel Efficiency 65% 10 60% 1 2 4 8 16 32 5 3,2E+09 6,4E+09 1,3E+10 2,6E+10 5,1E+10 1,0E+11 0 12,0 24,0 47,9 95,8 191,7 383,3 2 4 8 16 32 # of nodes #nodes (24 h-threads each) Matrix size [#elements] Matrix size [GB]

Efficiency stays above 97%! 3D 25-point stencil reads from 2 previous time-steps Shmem performance similar to cache-oblivious algorithms

Software and Services Group 54 Distribution function for tree

At certain tree-depth stop distributing Used for and stay local quickSort

1

2 3

4 5 6 7

8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23 24 25 25 26 27 28 29 30

Software and Services Group 55 Summary

• Separation of concerns – Computation from coordination – Domain spec from tuning spec • Domain spec – High-level: hides all the difficulties of low-level techniques – Exposes potential parallelism – Easy to debug (deterministic) – Efficient even with no tuning • Tuning spec – Isolated – Simple – Flexible • Support for distributed memory – Distribution is easy because the domain spec already identifies chunks of work and data. – Same spec runs on shared and/or distributed memory – Distribution functions: isolated to support easy experimentation

Software and Services Group 56 Status • General – Available – Yearly workshops – Used in some intro programming classes as a gentle introduction to parallelism • Domain language – Very good performance on shared memory even without tuning language • Bunch of benchmarks • Comparable performance to other tools • Tuning language – Ubiquitous High Performanc Computing (UPHC) • Vivek Sarkar’s group at Rice is currently implementing the tuning language for CnC. • Distributed CnC – Available – Next: static analysis to generate distribution functions

Software and Services Group 57

New Intel CnC release: http://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc/

New Intel CnC Yahoo Group: http://tech.groups.yahoo.com/group/intel_concurrent_collections/

Rice CnC: http://habanero.rice.edu/cnc

This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

Software and Services Group 58 Software and Services Group 59 Optimization Notice

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Software and Services Group 60 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, , Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.

Copyright © 2011-2012. Intel Corporation.

http://intel.com/software/products

Software and Services Group 61