Concurrent Collections (CnC)
Kath Knobe Kath.knobe@intel.com
2012/03/29
Software and Services Group 1 Outline
• Intro to the CnC • Tuning CnC • CnC for distributed memory
Software and Services Group 2 Outline
• Intro to the CnC • Tuning CnC • CnC for distributed memory
Software and Services Group 3 The Big Idea
• Don’t think about what operations can run in parallel – Difficult and depends on target
• Think about the semantic ordering constraints only – Easier and depends only on application
Software and Services Group
4 4 Exactly Two Sources Of Ordering Requirements
• Producer must execute before consumer
Producer - consumer
COMPUTE STEP DATA ITEM COMPUTE STEP step1 item step2
Software and Services Group
5 5 Exactly Two Sources Of Ordering Requirements
• Producer must execute before consumer
• Controller must execute before controllee
Producer - consumer Controller - controllee
CONTROL TAG tony
COMPUTE STEP DATA ITEM COMPUTE STEP COMPUTE STEP COMPUTE STEP step1 item step2 step1 step2
Software and Services Group
6 6 Cholesky factorization
Software and Services Group 7 Cholesky factorization
Cholesky
Software and Services Group 8 Cholesky factorization
Cholesky
Trisolve
Software and Services Group 9 Cholesky factorization
Cholesky
Trisolve Update
Slightly simplified
Software and Services Group 10 Cholesky factorization
Cholesky
Software and Services Group 11 4 simple steps to a CnC application
Software and Services Group 12
1:The white board level
• Data • Computations • Producer/consumer relations among them • I/O
COMPUTE STEP Cholesky
COMPUTE STEP Trisolve
COMPUTE STEP Update DATA ITEM Array
Software and Services Group
13 13
2: Distinguish among the instances
• Distinct instances (chunks of data or computation) may be distinctly scheduled and mapped
• A collection is a statically named set of dynamic instances
COMPUTE STEP Cholesky: iter
COMPUTE STEP Trisolve: row, iter
COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter
Software and Services Group
14 14
2: Distinguish among the instances
• Distinct instances (chunks of data or computation) may be distinctly scheduled and mapped • A collection is a statically named set of dynamic instances • Optional: tag functions relating the tags e.g., Cholesky: iter→ Array: iter, iter, iter
COMPUTE STEP Cholesky: iter
COMPUTE STEP Trisolve: row, iter
COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter
Software and Services Group
15 15 Control tag CONTROL TAG Ceasar
COMPUTE STEP • Tuple – unique identifier foo –
• Meaning of control relationship – For each row/col/iteration tuple in the collection Ceasar, the corresponding instance of foo will execute sometime. – An instance of foo has access to the values of row, col and iteration.
• Control dependence and data dependence are on the same footing
Software and Services Group 16
3: What are the control tag collections
CONTROL TAG
CholeskyTag: iter
CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG
COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter
COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter
Software and Services Group
17 17
4: Who produces control
CONTROL TAG
CholeskyTag: iter
CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG
COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter
COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter
Software and Services Group
18 18
(Cell detector: K)
[Cell candidates: K ]
[Histograms: K] [Input Image: K]
(Cell tracker: K) [Labeled cells initial: K] (Arbitrator initial: K)
[State measurements: K]
[Motion corrections: K , cell]
[Labeled cells final: K] (Arbitrator final: K)
[Final states: K]
Software and [PredictedServices Group states: K, cell] 19 Sample step (cholesky)
// Perform unblocked Cholesky factorization on the input block // Output is a lower triangular matrix. int cholesky::execute( const int & t, cholesky_context & c ) const { tile_const_ptr_type A_block; tile_ptr_type L_block; Get data used by this step Normal int b = c.b; computation const int iter = t;
c.Array.get(triple(iter,col,row), A_block); // Get the input tile.
L_block = std::make_shared< tile_type >( b ); // allocate output tile
for(int k_b = 0; k_b < b; k_b++) { Put data produced // do all the math on this tile by this step }
c.Array.put(triple(iter+1,col,row), L_block); // Write the output tile
return CnC::CNC_Success; } Steps have no side-effects Steps are atomic Items are dynamic single assignment
Software and Services Group 20 Shared memory Spike Performance James Brodman - UIUC, Intel
6
• Parallel solver for banded linear systems of equations 5 • Complex algorithm but “easily” maps onto CnC
• Same code for shared and distributed memory 4
Runtime (S) Runtime 3
Shared/distributed memory 2 (4 nodes, GigE)
1
0 1048576 w/ 128 Super/Sub Diagonals, 32 partitions MKL MKL HTA CnC HTA DistCnC MKL +OMPMKL+OMP (TBB)HTA+TBB CnC HTA+MPI (MPI)DistCnC
1048576 w/ 128 Super/Sub Diagonals, 32Software partitions and Services Group 21 Shared memory Spike Performance James Brodman - UIUC, Intel
6
• Parallel solver for banded linear systems of equations 5 • Complex algorithm but “easily” maps onto CnC
• Same code for shared and distributed memory 4
Runtime (S) Runtime 3
Shared/distributed memory 2 (4 nodes, GigE)
1
0 1048576 w/ 128 Super/Sub Diagonals, 32 partitions MKL MKL HTA CnC HTA DistCnC MKL +OMPMKL+OMP (TBB)HTA+TBB CnC HTA+MPI (MPI)DistCnC
1048576 w/ 128 Super/Sub Diagonals, 32Software partitions and Services Group 22 Semantics of CnC
• Semantics defined in terms of tag attributes of instances. avail
• The current state is described by the current set of instances and their attributes.
step controlReady
step step ready executed
step dataReady
Item avail
Software and Services Group 23 CnC is a coordination language: paired with a computation language
• Existing: • In the works: – C++ (intel) – Fortran (Rice) – Java (Rice) – Chapel (UIUC) – Haskell (Indiana) • Next up: – Python (Rice) – Matlab (Rice) – Scala (Rice) – APL (Indiana) Rice = Vivek Sarkar’s group Indiana = Ryan Newton’s group UIUC = David Padua’s group
Software and Services Group 24 CnC is a high level model allows for a wide variety of runtime approaches
grain distribution schedule
HP static static static TStreams
HP static static dynamic TStreams
static dynamic dynamic Intel CnC
static dynamic dynamic Rice CnC
dynamic dynamic dynamic Georgia Tech CnC
Software and Services Group 25 Possible expressions of the model
• Graphical interface
• Textual representation – Rice version – previous Intel versions
• API – Current Intel release
Software and Services Group 26 Key properties
• Think about the meaning of your program – Not about parallelism • Deterministic & serializable – Easy to debug • Separation of concerns simplifies everything – Computation from coordination – Domain spec from tuning spec • Domain spec is free of tuning and largely independent of architecture • Unified programming for shared and distributed memory – Not like MPI/OpenMP combo
Software and Services Group 27 Intel® Concurrent Collections for C++
• New whatif-release – Header-files, shared libs – Samples – Documentation • CnC concepts • Using CnC • Tutorials
• Template library with runtime – For Windows(IA-32/Intel-64) and Linux (Intel-64) – Shared and distributed memory
Software and Services Group 28 Outline
• Intro to the CnC • Tuning CnC • CnC for distributed memory
Software and Services Group 29 Tuning expert
with Rice: Mike Burke, Sanjay Chatterjee, Zoran Budimlic
• Without tuning – Very good performance – Typically more than enough parallelism – Asynchronous execution
• Multiple tuning specs for each domain spec – Tuning spec is in addition and separate – Semantic constraints from the domain spec are still requirements – A tuned execution is a one of the many legal executions of the domain spec. Goal: guide away from bad ones.
Software and Services Group 30 Foundation for tuning: Hierarchical Affinity groups
• Affinity groups – Avoid mappings with poor locality (temporal + spatial) – No benefit if too far away in either time or space – Foundation doesn’t distinguish space from time – On top of this foundation we allow (but don’t require) time-specific and space-specific control • Hierarchy – Same low-level group => tight affinity – Same higher-level group => weaker affinity
Software and Services Group 31
Cholesky domain spec
CONTROL TAG
CholeskyTag: iter
CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG
COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter
COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter
Software and Services Group
32 32 Hierarchical affinity structure
COMPUTE STEP Cholesky:
COMPUTE STEP Trisolve:
COMPUTE STEP Update:
Software and Services Group 33 Hierarchical affinity groups have names
AFFINITY GROUP GroupC:
COMPUTE STEP Cholesky:
AFFINITY GROUP GroupTU:
COMPUTE STEP Trisolve:
COMPUTE STEP Update:
Software and Services Group 34 We distinguish among instances of affinity groups
AFFINITY GROUP GroupC: iter
COMPUTE STEP Cholesky: iter
AFFINITY GROUP GroupTU: row, iter
COMPUTE STEP Trisolve: row, iter
COMPUTE STEP Update: col, row, iter
Software and Services Group 35 Specify the instances of each affinity group
AFFINITY GROUP CONTROL TAG CholeskyTag: iter GroupC: iter
COMPUTE STEP Cholesky: iter
AFFINITY GROUP CONTROL TAG GroupTU: row, iter TrisolveTag: row, iter COMPUTE STEP Trisolve: row, iter
COMPUTE STEP Update: col, row, iter
Software and Services Group 36 Cholesky: instances of the outer group Cholesky
Trisolve Update
Trisolve Update Update
Trisolve Update Update Update
Software and Services Group 37 Cholesky: instances of the outer group
Cholesky
Trisolve Update
Trisolve Update Update
Software and Services Group 38 Cholesky: instances of the inner group Cholesky
Trisolve Update
Trisolve Update Update
Trisolve Update Update Update
Software and Services Group 39 Cholesky: instances of the inner group Cholesky
Trisolve Update
Trisolve Update Update
Trisolve Update Update Update
Software and Services Group 40 Cholesky: instances of the inner group Cholesky
Trisolve Update
Trisolve Update Update
Trisolve Update Update Update
Software and Services Group 41 Space-specific mappings
AFFINITY GROUP CONTROL TAG CholeskyTag: iter GroupC: iter
COMPUTE STEP Cholesky: iter
AFFINITY GROUP CONTROL TAG GroupTU: row, iter TrisolveTag: row, iter COMPUTE STEP Trisolve: row, iter
COMPUTE STEP Update: col, row, iter • Replicate components of this group across sockets • Distribute components of this group across nodes • Distribute components of this group across nodes via func()
Software and Services Group 42 Time-specific mappings
Relative time among components within a group.
ordered unordered Non-overlapping serial/barrier exclusive overlapping priority arbitrary
Software and Services Group 43 Space-time combinations
{big_memory_footprint: i}
Distribute across address spaces (space) Within each address space exclusive (time)
{video_processing: frameID}
cyclic distribution among address spaces(space) ordered & overlapping within address spaces (time)
Software and Services Group 44 Related work
• Control of time – Various types of parallelism Data parallel, task parallel, streaming, fork-join, … • Control of space – Support for distributions Hierarchical Place Trees, HPF, …
CnC can control both time and space and in a unified way We allow (but don’t require) the tuning expert to blur the distinction between them
Knobe & Burke “Tuning Language for Concurrent Collections” Compilers for Parallel Computing (CPC 2012)
Software and Services Group 45 Outline
• Intro to the CnC • Tuning CnC • CnC for distributed memory
Software and Services Group 46 Distributed CnC Intel: Frank Schlimbach
• CnC is a declarative and high level model – Abstracts from memory model • Provides a unified programming model for shared and distributed memory – Same binary runs on both – Easy to switch between shared and distributed memory (no explicit message passing needed) • CnC comes with a control methodology – Allows controlling work distribution • CnC comes with a data methodology – Provides hooks for selective data distribution
Software and Services Group 47 Limitations of distributed computing apply
• Usual caveats for distributed memory apply – E.g. ratio between data-exchange and computation
• Different algorithms might be needed for distributed or shared memory – Programming methodology and framework stays the same in any case – Over a wide class of applications the algorithm stays the same
Software and Services Group 48 Scalability and efficiency: Spike
Scalability Parallel Efficiency 25.00 120.00%
100.00%
20.00
80.00% 15.00 60.00% 10.00
Time [sec] Time 40.00% Efficiency Efficiency [%]
5.00 20.00%
0.00 0.00% 1 2 4 8 16 32 2 4 8 16 32 #nodes (24 h-threads each) #nodes (24 h-threads each)
(matrix: 10Mx10M, 257 bands)
Software and Services Group 49 Distributed CnC: without tuning (UTS) UTS [T3XXL] Speedup • Unbalanced tree search 12 - Tree shape unknown 10
• CnC code is trivial - CnC: 151 loc 8
- Shmem: ~1000 loc 6 - MPI: ~800 loc MPI • Performance gap in the 4 CnC
middle region is a load- speedup over 1 node
balancing issue 2 - Used scheduler available 0 on our release 16 32 64 96 128 160 192 224 256 - Solved by different 1 2 4 6 8 10 12 14 16 #Threads/ranks scheduler #Nodes
Software and Services Group 50 Untuned performance often bad: tuning
CnC Time (untuned) CnC Time (tuned) [GigE] [IB] 250.00 140.00
120.00 200.00
100.00 150.00 inverse 80.00 primes 100.00 60.00
Time [sec] Time mandelbrot cholesky 40.00 50.00 20.00 0.00 0.00 1 2 4 2 4 8 16 32 #nodes (24 h-threads each) #nodes (24h-threads each)
Software and Services Group 5151 CnC tuning easy and efficient
• The general default distribution of steps – Data is sent to where needed (requested) – Not always efficient
• Tuner can declare functions – consumed_by: Which steps use this data item (semantics) – compute_on: Where the step will execute (tuning) – To redistribute, modify compute_on: data items and steps automatically adjusted
• Mapping tag->rank can be computed – Compile time – Initialization time – Execution time
Software and Services Group 52 Distribution functions
Cyclic - Simple int tuner::compute_on( const int & i ) const { return i % numProcs(); }
Blocked - Minimizes data transfer int tuner::compute_on( const int& x, y, z ) const { return x * NBLOCKS_X / nx + ( y * NBLOCKS_Y / ny ) * nx + ( z * NBLOCKS_Z / nz ) * nx * ny }
Blocked cyclic – Block size/shape is relevant int tuner::compute_on( int x, y ) const { return ((x + y * m_nx) / BLOCKSIZE) % numProcs(); }
Software and Services Group 53 53
Scalability and efficiency: stencil
RTM Stencil - Speedup RTM Stencil [weak] - Efficiency (3kx1.5k2, 32 time-steps) (16 time-steps) 35 105%
100% 30 95% 90% 25
85%
20 80% 75%
15 70% Speedup Parallel Parallel Efficiency 65% 10 60% 1 2 4 8 16 32 5 3,2E+09 6,4E+09 1,3E+10 2,6E+10 5,1E+10 1,0E+11 0 12,0 24,0 47,9 95,8 191,7 383,3 2 4 8 16 32 # of nodes #nodes (24 h-threads each) Matrix size [#elements] Matrix size [GB]
Efficiency stays above 97%! 3D 25-point stencil reads from 2 previous time-steps Shmem performance similar to cache-oblivious algorithms
Software and Services Group 54 Distribution function for tree
At certain tree-depth stop distributing Used for and stay local quickSort
1
2 3
4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 25 26 27 28 29 30
Software and Services Group 55 Summary
• Separation of concerns – Computation from coordination – Domain spec from tuning spec • Domain spec – High-level: hides all the difficulties of low-level techniques – Exposes potential parallelism – Easy to debug (deterministic) – Efficient even with no tuning • Tuning spec – Isolated – Simple – Flexible • Support for distributed memory – Distribution is easy because the domain spec already identifies chunks of work and data. – Same spec runs on shared and/or distributed memory – Distribution functions: isolated to support easy experimentation
Software and Services Group 56 Status • General – Available – Yearly workshops – Used in some intro programming classes as a gentle introduction to parallelism • Domain language – Very good performance on shared memory even without tuning language • Bunch of benchmarks • Comparable performance to other tools • Tuning language – Ubiquitous High Performanc Computing (UPHC) • Vivek Sarkar’s group at Rice is currently implementing the tuning language for CnC. • Distributed CnC – Available – Next: static analysis to generate distribution functions
Software and Services Group 57
New Intel CnC release: http://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc/
New Intel CnC Yahoo Group: http://tech.groups.yahoo.com/group/intel_concurrent_collections/
Rice CnC: http://habanero.rice.edu/cnc
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
Software and Services Group 58 Software and Services Group 59 Optimization Notice
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Software and Services Group 60 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.
Copyright © 2011-2012. Intel Corporation.
http://intel.com/software/products
Software and Services Group 61