Intel Concurrent Collections C++ API and Runtime

Concurrent Collections (CnC) Kath Knobe [email protected] 2012/03/29 Software and Services Group 1 Outline • Intro to the CnC • Tuning CnC • CnC for distributed memory Software and Services Group 2 Outline • Intro to the CnC • Tuning CnC • CnC for distributed memory Software and Services Group 3 The Big Idea • Don’t think about what operations can run in parallel – Difficult and depends on target • Think about the semantic ordering constraints only – Easier and depends only on application Software and Services Group 4 4 Exactly Two Sources Of Ordering Requirements • Producer must execute before consumer Producer - consumer COMPUTE STEP DATA ITEM COMPUTE STEP step1 item step2 Software and Services Group 5 5 Exactly Two Sources Of Ordering Requirements • Producer must execute before consumer • Controller must execute before controllee Producer - consumer Controller - controllee CONTROL TAG tony COMPUTE STEP DATA ITEM COMPUTE STEP COMPUTE STEP COMPUTE STEP step1 item step2 step1 step2 Software and Services Group 6 6 Cholesky factorization Software and Services Group 7 Cholesky factorization Cholesky Software and Services Group 8 Cholesky factorization Cholesky Trisolve Software and Services Group 9 Cholesky factorization Cholesky Trisolve Update Slightly simplified Software and Services Group 10 Cholesky factorization Cholesky Software and Services Group 11 4 simple steps to a CnC application Software and Services Group 12 1:The white board level • Data • Computations • Producer/consumer relations among them • I/O COMPUTE STEP Cholesky COMPUTE STEP Trisolve COMPUTE STEP Update DATA ITEM Array Software and Services Group 13 13 2: Distinguish among the instances • Distinct instances (chunks of data or computation) may be distinctly scheduled and mapped • A collection is a statically named set of dynamic instances COMPUTE STEP Cholesky: iter COMPUTE STEP Trisolve: row, iter COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter Software and Services Group 14 14 2: Distinguish among the instances • Distinct instances (chunks of data or computation) may be distinctly scheduled and mapped • A collection is a statically named set of dynamic instances • Optional: tag functions relating the tags e.g., Cholesky: iter→ Array: iter, iter, iter COMPUTE STEP Cholesky: iter COMPUTE STEP Trisolve: row, iter COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter Software and Services Group 15 15 Control tag CONTROL TAG Ceasar COMPUTE STEP • Tuple – unique identifier foo – <row, col, iteration> – <socialSecurityNumber, date> – <treeNode> • Meaning of control relationship – For each row/col/iteration tuple in the collection Ceasar, the corresponding instance of foo will execute sometime. – An instance of foo has access to the values of row, col and iteration. • Control dependence and data dependence are on the same footing Software and Services Group 16 3: What are the control tag collections CONTROL TAG CholeskyTag: iter CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter Software and Services Group 17 17 4: Who produces control CONTROL TAG CholeskyTag: iter CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter Software and Services Group 18 18 <K> Actual app: cell tracker (Cell detector: K) [Cell candidates: K ] [Histograms: K] [Input Image: K] (Cell tracker: K) [Labeled cells initial: K] (Arbitrator initial: K) [State measurements: K] <Cell tags Initial: K, cell> (Correction filter: K, cell) [Motion corrections: K , cell] [Labeled cells final: K] (Arbitrator final: K) [Final states: K] <Cell tags Final: K, cell> (Prediction filter: K, cell) Software and [PredictedServices Group states: K, cell] 19 Sample step (cholesky) // Perform unblocked Cholesky factorization on the input block // Output is a lower triangular matrix. int cholesky::execute( const int & t, cholesky_context & c ) const { tile_const_ptr_type A_block; tile_ptr_type L_block; Get data used by this step Normal int b = c.b; computation const int iter = t; c.Array.get(triple(iter,col,row), A_block); // Get the input tile. L_block = std::make_shared< tile_type >( b ); // allocate output tile for(int k_b = 0; k_b < b; k_b++) { Put data produced // do all the math on this tile by this step } c.Array.put(triple(iter+1,col,row), L_block); // Write the output tile return CnC::CNC_Success; } Steps have no side-effects Steps are atomic Items are dynamic single assignment Software and Services Group 20 Shared memory Spike Performance James Brodman - UIUC, Intel 6 • Parallel solver for banded linear systems of equations 5 • Complex algorithm but “easily” maps onto CnC • Same code for shared and distributed memory 4 Runtime (S) Runtime 3 Shared/distributed memory 2 (4 nodes, GigE) 1 0 1048576 w/ 128 Super/Sub Diagonals, 32 partitions MKL MKL HTA CnC HTA DistCnC MKL +OMPMKL+OMP (TBB)HTA+TBB CnC HTA+MPI (MPI)DistCnC 1048576 w/ 128 Super/Sub Diagonals, 32Software partitions and Services Group 21 Shared memory Spike Performance James Brodman - UIUC, Intel 6 • Parallel solver for banded linear systems of equations 5 • Complex algorithm but “easily” maps onto CnC • Same code for shared and distributed memory 4 Runtime (S) Runtime 3 Shared/distributed memory 2 (4 nodes, GigE) 1 0 1048576 w/ 128 Super/Sub Diagonals, 32 partitions MKL MKL HTA CnC HTA DistCnC MKL +OMPMKL+OMP (TBB)HTA+TBB CnC HTA+MPI (MPI)DistCnC 1048576 w/ 128 Super/Sub Diagonals, 32Software partitions and Services Group 22 Semantics of CnC • Semantics defined in terms of tag attributes of instances. avail • The current state is described by the current set of instances and their attributes. step controlReady step step ready executed step dataReady Item avail Software and Services Group 23 CnC is a coordination language: paired with a computation language • Existing: • In the works: – C++ (intel) – Fortran (Rice) – Java (Rice) – Chapel (UIUC) – Haskell (Indiana) • Next up: – Python (Rice) – Matlab (Rice) – Scala (Rice) – APL (Indiana) Rice = Vivek Sarkar’s group Indiana = Ryan Newton’s group UIUC = David Padua’s group Software and Services Group 24 CnC is a high level model allows for a wide variety of runtime approaches grain distribution schedule HP static static static TStreams HP static static dynamic TStreams static dynamic dynamic Intel CnC static dynamic dynamic Rice CnC dynamic dynamic dynamic Georgia Tech CnC Software and Services Group 25 Possible expressions of the model • Graphical interface • Textual representation – Rice version – previous Intel versions • API – Current Intel release Software and Services Group 26 Key properties • Think about the meaning of your program – Not about parallelism • Deterministic & serializable – Easy to debug • Separation of concerns simplifies everything – Computation from coordination – Domain spec from tuning spec • Domain spec is free of tuning and largely independent of architecture • Unified programming for shared and distributed memory – Not like MPI/OpenMP combo Software and Services Group 27 Intel® Concurrent Collections for C++ • New whatif-release – Header-files, shared libs – Samples – Documentation • CnC concepts • Using CnC • Tutorials • Template library with runtime – For Windows(IA-32/Intel-64) and Linux (Intel-64) – Shared and distributed memory Software and Services Group 28 Outline • Intro to the CnC • Tuning CnC • CnC for distributed memory Software and Services Group 29 Tuning expert with Rice: Mike Burke, Sanjay Chatterjee, Zoran Budimlic • Without tuning – Very good performance – Typically more than enough parallelism – Asynchronous execution • Multiple tuning specs for each domain spec – Tuning spec is in addition and separate – Semantic constraints from the domain spec are still requirements – A tuned execution is a one of the many legal executions of the domain spec. Goal: guide away from bad ones. Software and Services Group 30 Foundation for tuning: Hierarchical Affinity groups • Affinity groups – Avoid mappings with poor locality (temporal + spatial) – No benefit if too far away in either time or space – Foundation doesn’t distinguish space from time – On top of this foundation we allow (but don’t require) time-specific and space-specific control • Hierarchy – Same low-level group => tight affinity – Same higher-level group => weaker affinity Software and Services Group 31 Cholesky domain spec CONTROL TAG CholeskyTag: iter CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter Software and Services Group 32 32 Hierarchical affinity structure COMPUTE STEP Cholesky: COMPUTE STEP Trisolve: COMPUTE STEP Update: Software and Services Group 33 Hierarchical affinity groups have names AFFINITY GROUP GroupC: COMPUTE STEP Cholesky: AFFINITY GROUP GroupTU: COMPUTE STEP Trisolve: COMPUTE STEP Update: Software and Services Group 34 We distinguish among instances of affinity groups AFFINITY GROUP GroupC: iter COMPUTE STEP Cholesky: iter AFFINITY GROUP GroupTU: row, iter COMPUTE STEP Trisolve: row, iter COMPUTE STEP Update: col, row, iter Software and Services Group 35 Specify the instances of each affinity group AFFINITY GROUP CONTROL TAG GroupC: iter CholeskyTag: iter COMPUTE STEP Cholesky: iter AFFINITY

Load more