<<

C-DAC Four Days Technology Workshop

ON

Hybrid – Co-Processors/Accelerators Power-aware Computing – Performance of Applications Kernels

hyPACK-2013 (Mode-1:Multi-Core) Lecture Topic : Multi-Core Processors : & Applications Overview - Part-II

Venue : CMSD, UoHYD ; Date : October 15-18, 2013

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 1 Killer Apps of Tomorrow

Application-Driven Architectures: Analyzing Tera-Scale Workloads Workload convergence  The basic algorithms shared by these high-end workloads Platform implications  How workload analysis guides future architectures Programmer productivity  Optimized architectures will ease the development of software Call to Action  Benchmark suites in critical need for redress

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 2 Emerging “Killer Apps of Tomorrow” Parallel Algorithms Design

 Static and Load Balancing  Mapping for load balancing

 Minimizing Interaction

 Overheads in parallel algorithms design

 Data Sharing Overheads

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 3 Emerging “Killer Apps of Tomorrow” Parallel Algorithms Design (Contd…)

General Techniques for Choosing Right Parallel

 Maximize data locality

 Minimize volume of data

 Minimize frequency of Interactions

 Overlapping computations with interactions.

 Data

 Minimize construction and Hot spots.

 Use highly optimized collective interaction operations. Collective data transfers and computations •  Maximize Concurrency.

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 4 Programming Aspects Examples

Implementation of Streaming Media Player on Multi-Core  One decomposition of work using Multi-threads  It consists of  A Monitoring a network port for arriving data,  A decompressor thread for decompressing packets  Generating frames in a video sequence  A rendering thread that displays frame at programmed intervals

Source : Reference : [4]

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 5 Programming Aspects Examples

Implementation of Streaming Media Player on Multi-Core  The thread must communicate via shared buffers – • an in-buffer between the network and decompressor, • an out-buffer between the decompressor and renderer  It consists of  Listen to port ……..Gather data from the network  Thread generates frames with random (Random string of specific bytes)  Render threads pick-up frames & from the out-buffer and calls the display function  Implement using the Thread Condition Variables

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 6 Types of Parallelism

Types of Parallelism

 Combination of Data and Task parallelism

 Stream parallelism

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 7 Evolving towards model-based Computing

Large dataset mining /Grid Mining Streaming Distributed Data Mining Content-based Retrieval

Multimodal event/object Recognition Collaborative Filters Statistical Computing Multidimensional Indexing Dimensionality Reduction Efficient access to large, unstructured, sparse Clustering / Classification datasets Model-based: Bayesian network/Markov Model Photo-real Synthesis Neural network / Probability networks Real-world animation LP/IP/QP/Stochastic Optimization Ray tracing Global Illumination Behavioral Synthesis Physical simulation Kinematics Emotion synthesis Audio synthesis Video/Image synthesis Document synthesis

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 8 Evolving towards model-based Computing

Large dataset mining Semantic Web/Grid Mining Streaming Data Mining Distributed Data Mining Mining Content-based Retrieval

Multimodal event/object Recognition Collaborative Filters Indexing Multidimensional Indexing Statistical Computing Dimensionality Reduction Machine Learning StreamingEfficient access to large, unstructured, sparse Clustering / Classification datasets Stream Processing Model-based: Recognition Bayesian network/Markov Model Photo-real Synthesis Neural network / Probability networks Real-world animation LP/IP/QP/Stochastic Optimization Ray tracing Synthesis Global Illumination Behavioral Synthesis Physical simulation Kinematics Emotion synthesis Source : http://www.intel.com ; Graphics Audio synthesis Reference : [1],[4], [6] Video/Image synthesis Document synthesis

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 9 Killer Apps of Tomorrow

Workload convergence The basic algorithms shared by these high-end workloads

Platform implications How workload analysis guides future architectures

Programmer productivity Optimized architectures will ease the development of software

Call to Action Benchmark suites in critical need for redress

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 10 Multi Cores - Unique Challenges Target For

Bigger OR Smaller Cores

Performance OR

Compute OR I/O Intensive

Cache Friendly OR Memory Intensive

Source : http://www.intel.com ; Reference : [6]

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 11 The Memory sub-system : Access time  A lot of time is spent accessing/storing data from/to memory. It is important to keep in mind the relative times for each memory types:

CPU

registers

L2 DISK RAM Dcache Icache Access Time is Important  Approximate access times  CPU-registers: 0 cycles (that’s where the work is done!)

 L1 Cache: 1 cycle (Data and Instruction cache). Repeated access to a cache takes only 1 cycle

 L2 Cache (static RAM): 3-5 cycles?  Memory (DRAM): 10 cycles (Cache miss);  30-60 cycles for Translation Lookaside Buffer (TLB) update  Disk: about 100,000 cycles!  connecting to other nodes - depending on network latency

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 12 Operational Flow of Threads for an Application

Operational Flow of Threads Implementation Source Code T1 . . . T2 Tn ......

Parallel Code Block or a Perform synchronization operations using parallel section needs multithread constructs Bi synchronization . T T1 2 . . . Tn . .

Perform synchronization operations using parallel Parallel Code Block constructs Bj

T1 …. p Source : http://www.intel.com ; Reference : [6]

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 13 Multi Core : Programming Issues

 Out of Order Execution

 Preemptive and Co-operative Multitasking

 SMP to the rescue

 Super threading with Multi threaded

 Hyper threading the next step (Implementation)

 Multitasking

 Caching and SMT

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 14 Hyper-threading : Partitioned Resources

 Hyper-threading (HT) technology is a hardware mechanism where multiple independent hardware threads get to execute in a single cycle on a single super- core.

Physical Package Physical Package Architectural Logical Logical State Professor 0 Professor 1 Architectural Architectural Execution State State Engine Local APIC Execution Engine Local APIC Local APIC Interface Bus Interface

System Bus System Bus Single Processor Single Processor with Hyper-Threading Technology Single Processor System without Hyper-Threading Technology and Single Processor System with Hyper-Threading Technology Source : http://www.intel.com ; Reference : [6], [10], [19], [23], [24],[29], [31]

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 15 Hyper-threading Technology

 Multi-processor with and without Hyper-threading (HT) technology

MP HT MP Physical Package Physical Package Physical Package Physical Package Logical Logical Logical Logical Professor 0 Professor 1 Architectural Architectural Professor 0 Professor 1 Architectural Architectural Architectural State State State State Architectural State State Execution Engine Execution Engine Execution Engine Execution Engine Local APIC Local APIC Local APIC Local APIC Local APIC Local APIC Bus Interface Bus Interface Bus Interface Bus Interface

System Bus System Bus

Source : http://www.intel.com ; Reference : [6], [10], [19], [23], [24],[29], [31]

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 16 Multi-threaded Processing using Hyper-Threading Technology

 Time taken to n threads on a single processor is significantly more than a single processor system with HT technology enabled.

Multithreading Hyper-threading Technology

App 0 App 1 App 2 App 0 App 1 App 2

T0 T1 T2 T3 T4 T5 T0 T1 T2 T3 T4 T5 Thread Pool

CPU CPU

CPU T0 T1 T2 T3 T4 T5 LP0 T0 T2 T4 LP1 Time T1 T3 T5 CPU 2 Threads per Processor Time

Source : http://www.intel.com ; Reference : [6], [29], [31]

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 17 Simple Comparison of Single-core, Multi- processor, and multi-Core Architectures

B) Multiprocessor CPU State CPU State CPU State Interrupt Logic Interrupt Logic Interrupt Logic Execution Cache Execution Execution Cache Units Cache Units Units

A) Single Core b) Multi Processor

CPU State CPU State Interrupt Logic Interrupt Logic

Execution Cache Units CPU State CPU State C) Hyper-Threading Technology Interrupt Logic Interrupt Logic

Execution Cache Execution Cache Units Units

D) Multi-core

Source : http://www.intel.com ; Reference : [6], [29], [31]

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 18 Simple Comparison of Single-core, Multi- processor, and multi-Core Architectures

CPU State CPU State Interrupt Logic Interrupt Logic

Execution Units Execution Units Cache

E) Multi-core with Shared Cache

CPU State CPU State CPU State CPU State Interrupt Logic Interrupt Logic Interrupt Logic Interrupt Logic

Execution Cache Execution Cache Units Units

F) Multi-core with Hyper-threading Technology

Source : http://www.intel.com ; Reference : [6], [29], [31]

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 19 Emerging “Killer Apps of Tomorrow”

“Where is it?” Search for a similar Mining instance

Demo: Personal image search Source: Intel

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 20 Emerging “Killer Apps of Tomorrow” RMS: Recognition Mining & Synthesis

Recognition Mining Synthesis What is …? Is it …? What if …?

Find a model Create a model Model instance instance

Today Model-less Real-time streaming and Very limited realism transactions on static – structured datasets

Tomorrow Model-based Real-time analytics on Photo-realism and multimodal dynamic, unstructured, physics-based recognition multimodal datasets animation Source : Intel

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 21 Emerging “Killer Apps of Tomorrow” Interactive Recognition Mining & Synthesis

Most RMS apps are about enabling interactive (real-time) RMS Loop or iRMS Recognition Mining Synthesis What is …? Is it …? What if …?

Find an existing Create a new Model model instance model instance

Source : Intel Graphics Rendering + Physical Simulation

Learning & Visual Input Synthesized Modeling Streams Visuals Computer Reality Vision Augmentation Workload convergence: multimodal recognition and synthesis over complex datasets

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 22 Emerging “Killer Apps of Tomorrow” Interactive Recognition Mining & Synthesis

Media Evolution Upcoming Transition Next Transition Modality-specific streaming

Modality-aware transformation Graphics Evolution Multimodal recognition Scene complexity: moderate Local processing dominated Scene complexity: large Global processing dominated Mining Evolution Scene complexity: real-world Dataset: static/structured Physical simulation dominated Response: offline Dataset: dynamic, multimodal Response: interactive Dataset: massive+streaming Source : Intel Response: real-time Workload convergence: multimodal recognition and synthesis over complex datasets

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 23 Emerging “Killer Apps of Tomorrow” Parallel Algorithms Design (Contd…)

 Video : Body Tracking and Ray-Tracing

 Search for Video based imaging & Creation of Videos with animation

 Games / Interactive 3D simulation

•  Real-time realistic 3D of body systems

 Wall Street Financial Analysis; Business: Data Mining – Security: Real Time Video surveillance Astrophysics,  – Pattern Search Algorithms

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 24 Emerging “Killer Apps of Tomorrow” Parallel Algorithms Design (Contd…)

 Speech Recognition & Speech Analysis – Modelling and Identifying using multi-modal data

 Large Scale image searches Audi /Video Control Systems

 Ray Tracing Algorithms – Video & Physical Simulation

 Overlapping computations with interactions.

 Data Mining Algorithms

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 25 Computer Physical Rendering (Financial) Vision Simulation Analytics Data Mining Body Face Face, Rigid Global CFD Portfolio Option Tracking Detection Illumination Cloth Body Mgmt Pricing Media Machine Cluster/ Text Synthesis learning Classify Index

Collision PDE LCP NLP FIMI SVM SVM detection Classification Training IPM K-Means (LP, QP) Level Set Filter/ Particle Fast Marching Index transform Filtering Method Monte Carlo Bench

Source : Intel Basic Iterative Krylov Iterative Direct Solver Solver Non-Convex Solvers (PCG) (Cholesky) (Jacobi, GS, SOR) Method

Basic matrix primitives Basic geometry primitives (dense/sparse, structured/unstructured) (partitioning structures, primitive tests)

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 26 Emerging “Killer Apps of Tomorrow” Simulation

Computational Memory BW (GFLOPs) (GB/s) 10000 10000

1000 1000

100 100

10 10

1 1

0.1 0.1 A B A B

A: CFD, 75x50x50, 10 fps Source : Intel B: CFD, 150x100x100, 30 fps

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 27 Performance Analysis

Graph Mining

5

Non-Numerical algorithm 4 Numerical Computations (selective)

3

2

1 Relative Relative Throughput

0 1 2 4 8 16 32 Number of cores

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 28 Emerging “Killer Apps of Tomorrow”:Summary

 RMS applications require significant increase in compute density  often non-linear extensions of existing usages  RMS apps have a converged common core of computing  Workload convergence implies platform convergence  Programmer productivity can be enhanced by platform  significant opportunity for silicon differentiation  RMS apps will drive most future technology vectors  Programming to processor to memory technology  Existing benchmarks are a poor proxy for these apps  Current popular suites must evolve or go extinct!

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 29 Emerging “Killer Apps of Tomorrow”:References

 More web based info:  Justin Rattner, CTO, Intel, blog on “Cool Codes” http://blogs.zdnet.com/OverTheHorizon/  RMS Intro: http://www.intel.com/technology/magazine/computing/recognition -mining-synthesis-0205.htm  Platform 2015: • http://www.intel.com/technology/architecture/platform2015/

Source : Reference : [6] http://www.intel.com

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 30 Summary

 An overview of Application Requirements for Multi-Core Systems  An Overview of Threading

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 31 References

1. Andrews, Grogory R. (2000), Foundations of Multithreaded, Parallel, and Distributed Programming, Boston, MA : Addison-Wesley

2. Butenhof, David R (1997), Programming with POSIX Threads , Boston, MA : Addison Wesley Professional

3. Culler, David ., Jaswinder Pal Singh (1999), Parallel - A Hardware/Software Approach , San Francsico, CA : Morgan Kaufmann

4. Grama Ananth, Anshul Gupts, George Karypis and Vipin Kumar (2003), Introduction to , Boston, MA : Addison-Wesley

5. Intel Corporation, (2003), Intel Hyper-Threading Technology, Technical User's Guide, Santa Clara CA : Intel Corporation Available at : http://www.intel.com

6. Shameem Akhter, Jason Roberts (April 2006), Multi-Core Programming - Increasing Performance through Software Multi-threading , Intel PRESS, Intel Corporation,

7. Bradford Nichols, Dick Buttlar and Jacqueline Proulx Farrell (1996), Pthread Programming O'Reilly and Associates, Newton, MA 02164,

8. James Reinders, Intel Threading Building Blocks – (2007) , O’REILLY series

9. Laurence T Yang & Minyi Guo (Editors), (2006) High Performance Computing - Paradigm and Infrastructure Wiley Series on Parallel and , Albert Y. Zomaya, Series Editor

10. Intel Threading Methodology ; Principles and Practices Version 2.0 copy right (March 2003), Intel Corporation

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 32 References

11. William Gropp, Ewing Lusk, Rajeev Thakur (1999), Using MPI-2, Advanced Features of the Message-Passing Interface, The MIT Press..

12. Pacheco S. Peter, (1992), Parallel Programming with MPI, , University of Sanfrancisco, Morgan Kaufman Publishers, Inc., Sanfrancisco, California

13. Kai Hwang, Zhiwei Xu, (1998), Scalable Parallel Computing (Technology Architecture Programming), McGraw Hill New York.

14. Michael J. Quinn (2004), Parallel Programming in C with MPI and OpenMP McGraw-Hill International Editions, Series, McGraw-Hill, Inc. Newyork

15. Andrews, Grogory R. (2000), Foundations of Multithreaded, Parallel, and Distributed Progrmaming, Boston, MA : Addison-Wesley

16. SunSoft. Solaris multithreaded programming guide. SunSoft Press, Mountainview, CA, (1996), Zomaya, editor. Parallel and Distributed Computing Handbook. McGraw-Hill,

17. Chandra, Rohit, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon, (2001),Parallel Programming in OpenMP San Fracncisco Moraan Kaufmann

18. S.Kieriman, D.Shah, and B.Smaalders (1995), Programming with Threads, SunSoft Press, Mountainview, CA. 1995

19. Mattson Tim, (2002), Nuts and Bolts of multi-threaded Programming Santa Clara, CA : Intel Corporation, Available at : http://www.intel.com

20. I. Foster (1995, Designing and Building Parallel Programs ; Concepts and tools for Parallel , Addison-Wesley (1995)

21. J.Dongarra, I.S. Duff, D. Sorensen, and H..Vorst (1999), Numerical Linear Algebra for High Performance Computers (Software, Environments, Tools) SIAM, 1999

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 33

References

22. OpenMP C and C++ Application Program Interface, Version 1.0". (1998), OpenMP Architecture Review Board. October 1998

23. D . A. Lewine. Posix Programmer's Guide: (1991), Writing Portable Unix Programs with the Posix. 1 Standard . O'Reilly & Associates, 1991

24. Emery D. Berger, Kathryn S McKinley, Robert D Blumofe, Paul R.Wilson, Hoard : A Scalable Memory Allocator for Multi-threaded Applications ; The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). Cambridge, MA, November (2000). Web site URL : http://www.hoard.org/

25. Marc Snir, Steve Otto, Steyen Huss-Lederman, David Walker and Jack Dongarra, (1998) MPI-The Complete Reference: Volume 1, The MPI Core, second edition [MCMPI-07].

26. William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir (1998) MPI-The Complete Reference: Volume 2, The MPI-2 Extensions

27. A . Zomaya, editor. Parallel and Distributed Computing Handbook. McGraw-Hill, (1996)

28. OpenMP C and C++ Application Program Interface, Version 2.5 (May 2005)”, From the OpenMP web site, URL : http://www.openmp.org/

29. Stokes, Jon 2002 Introduction to Multithreading, Super-threading and Hyper threading Ars Technica, October (2002)

30. Andrews Gregory R. 2000, Foundations of Multi-threaded, Parallel and Distributed Programming, Boston MA : Addison – Wesley (2000)

31. Deborah T. Marr , Frank Binns, David L. Hill, Glenn Hinton, David A Koufaty, J . Alan Miller, Michael Upton, “Hyperthreading, Technology Architecture and ”, Intel (2000-01)

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 34  Thank You  Any questions ?

C-DAC hyPACK-2013 Multi-Core Processors : Alg.& Appls Overview Part-II 35