ParallelismParallelism forfor thethe Masses:Masses: OpportunitiesOpportunities andand ChallengesChallenges

Andrew A. Chien Vice President of Research Corporation

Carnegie Mellon University Parallel Thinking Seminar OctobeOctoberr 2299, 2008 OutlineOutline

• Is Parallelism a crisis? • Opportunities in Parallelism • Expectations and Challenges • Moving Parallel Programming Forward • What’s Going on at Intel • Questions

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 2 ParallelismParallelism isis here…here… And And Growing!Growing!

Future: 100+

Larrabee: 12-32

Nehalem: 8+

Dunnington (6)

Core2 Quad (4) Number of Cores Number of

Core 2 Duo (2)

2006 20072008 2009 2010 … 2015

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 3 Q.Q. IsIs parallelismparallelism aa crisis?crisis?

A.A. Parallelism Parallelism isis anan opportunity.opportunity.

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 4 Parallelism is a key driver for Energy and Performance

Dual-Core 1.73x Performance 1.73x Power

1.13x 1.00x 1.02x 0.87x

0.51x

Over-clocked Design Dual-core (+20%) Frequency Underclocked (-20%)

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 5 Unified IA and Parallelism Vision – a Foundation Across Platforms

s ption tion O tegra s PERFORMANCE In raphic ple: G Exam Notebook, Desktop and Server

Canmore, Tolapai, Intel® Core™ 2, i7 Visual Computing, Gaming POWER

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 6 Opportunity in Low-power Computing 10x Lower Power

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 7 Opportunity in Highly Parallel Computing 10x Higher Performance

VECTOR VECTOR … VECTOR VECTOR IA CORE IA CORE IA CORE IA CORE INTERPROCESSOR NETWORK … COHERENT COHERENT COHERENT COHERENT CACHE CACHE CACHE CACHE COHERENT COHERENT … COHERENT COHERENT

FIXED FUNCTION LOGIC FUNCTION FIXED CACHE CACHE CACHE CACHE

INTERPROCESSOR NETWORK and I/O INTERFACES MEMORY

VECTOR VECTOR … VECTOR VECTOR Visual Computing IA CORE IA CORE IA CORE IA CORE Financial Modeling

Gaming, Entertainment Biological Modeling

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 8 OpportunityOpportunity #1:#1: HighlyHighly Portable,Portable, ParallelParallel SoftwareSoftware

• All computing systems (servers, desktops, laptops, MIDs, smart phones, embedded…) converging to… – A single framework with parallelism and a selection of CPU’s and specialized elements – Energy efficiency and Performance are core drivers – Must become “forward scalable”

Parallelism becomes widespread – all software is parallel

Create standard models of parallelism in architecture, expression, and implementation.

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 9 “Forward Scalable” Software

# Cores Academic Research 2N N

512 256 256 128 Today

32 16 16 8

Time Software with forward scalability can be moved unchanged from N -> 2N -> 4N cores with continued performance increases.

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 10 OpportunityOpportunity #2:#2: MajorMajor ArchitecturalArchitectural SupportSupport forfor ProgrammabilityProgrammability • Single core growth and aggressive frequency scaling are weakening competitors to other types of architecture innovation

Architecture innovations for functionality – programmability, observability, GC, … are now possible

Don’t ask for small incremental changes, be bold and ask for LARGE changes… that make a LARGE difference

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 11 NewNew GoldenGolden AgeAge ofof ArchitecturalArchitectural SupportSupport forfor Programming?Programming?

Programming Support Parallelism System Integration

Ghz Scaling Issue Scaling Language Support Integration

Mid 60’s Mid 80’s Terascale To Mid 80’s To Mid 200x Era

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 12 OpportunityOpportunity #3:#3: HighHigh Performance,Performance, HighHigh LevelLevel ProgrammingProgramming ApproachesApproaches • Single chip integration enables closer coupling (cores, caches) and innovation in intercore coordination – Eases performance concerns – Supports irregular, unstructured parallelism

Forward scalable performance with good efficiency may be possible without detailed control.

Functional, declarative, transactional, object-oriented, dynamic, scripting, and many other high level models will thrive.

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 13 ManycoreManycore != != SMPSMP onon DieDie

Parameter SMP Tera-scale Improvement On-die 12 GB/s ~1.2 TB/s ~100X Bandwidth

On-die 400 cycles 20 cycles ~20X Latency

• Less locality sensitive; Efficient sharing • Runtime techniques more effective for dynamic, irregular data and programs

• Can we do less tuning? And program at a higher level?

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 14 OpportunityOpportunity #4:#4: ParallelismParallelism cancan AddAdd NewNew KindsKinds ofof CapabilityCapability andand ValueValue

• Additional core and computational capability can be available on-chip – Single-chip design enables enhancement at low cost – Integration enables close coupling – Security: taintcheck, invariant monitoring, etc. – Robustness: race detection, invariant checking, etc. – Interface: sensor data processing, self-tuning, activity inference

Deploy Parallelism to enable new applications, software quality, and enhance user experience

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 15 ExpectationsExpectations andand ChallengesChallenges

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 16 Two Students in 2015

Mine’s an Intel 1,000 How many cores core with 64 Out- does your of-Order cores! computer have?

End Users don’t care about core counts; they care about capability

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 17 Chip Real Estate and Performance/Value

– next generation Intel processor – 4 cores, 2B transistors, 30MB cache – 50% cache, 50% logic – 1% of chip area = 30M transistors = 1/3MB – 0.1% of chip area = 3M transistors = 1/30MB

• How much performance benefit? • What incremental performance benefit should you expect for the last core in a 100-core? 1000-core?

Incremental performance benefit or “forward scalability”, not efficiency should be the goal.

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 18 KeyKey SoftwareSoftware DevelopmentDevelopment ChallengesChallenges

• New functionality • Productivity • Portability • Performance, Performance Robustness • Debugging/Test • Security • Time to market

⇒ Software Development is Hard! ⇒ Parallelism is critical for performance, but must be achieved in conjunction with all of these requirements…

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 19 HPC: What we have learned about Parallelism

• Large-scale parallelism is possible, and typically comes with scaling of problems and data • Portable expression of parallelism matters • High level program analysis is a critical technology • Working with domain experts is a good idea • Multi-version programming (algorithms and implementations) is a good idea. Autotuning is a good idea

• Locality is hard, modularity is hard, data structures are hard, efficiency is hard…

• Of course, this list is not exhaustive….

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 20 HPC… lessons not to learn …

• Programmer effort doesn’t matter • Hardware efficiency matters • Low-level programming tools are acceptable • Low-level control is an acceptable path to performance • Horizontal locality / explicit control of communication is critical

Move beyond conventional wisdom “parallelism is hard”, based on these lessons. Parallelism can be easy.

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 21 MovingMoving ParallelParallel ProgrammingProgramming ForwardForward

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 22 Spending Moore’s Dividend (Larus)

• 30 year retrospective, analyzing Microsoft’s experience •Spent on -- – New Application Features – Higher level programming > Structured programming > Object-oriented > Managed runtimes – Programmer productivity (decreased focus)

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 23 Today, most programmers do not focus on performance Programmer • Productivity: “Quick functionality, adequate Productivity performance” – Matlab, Mathematica, R, SAS, etc. – VisualBasic, PERL, Javascript, Python, Ruby|Rails

• Mixed: “Productivity and performance” – Java, C# (Managed languages + rich libraries)

• Performance: “Efficiency is critical” (HPC focus) –C++ and STL –C and Fortran Execution Efficiency • How can we enable productivity programmers write 100-fold parallel programs? Parallelism must be accessible to productivity programmers.

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 24 Challenge #1: Can we introduce parallelism in the Productivity Layer? • Enable productivity programmers to create large- scale parallelism with modest effort • A simple models of parallelism mated to productivity languages – Data and collection parallelism, determinism, functional/declarative – Parallel Libraries • Generate scalable parallelism for many applications • Exploit with dynamic and compiled approaches

• => May be suitable for introduction to programming classes

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 25 Challenge #2: Can we compose parallel programs safely? (and easily) • Parallel program composition frameworks which preserved correctness of the composed parts – Language, execution model, a/o tools • Interactions when necessary are controllable at the programmer’s semantic level (i.e. objects or modules) • Composition and interactions supported efficiently by hardware

• Fulfill vision behind TM, Concurrent Collections, Concurrent Aggregates, Linda, Lock-free, Race- free, etc.

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 26 Challenge #3: Can we invent an easy, modular way to express locality?

• Describe data reuse and spatial locality • Descriptions compose • Enable software exploitation • Enable hardware exploitation

• Computation = Algorithms + Data structures

• Efficient Computation = Algorithms + Data Structures + Locality – A + DS + L – (A + DS) + L –A + (DS + L) – (A + L) + DS

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 27 Challenge #4: Can we raise the level of efficient parallel programming?

•HPC model = Lowest Level of abstraction – Direct Control of resource mapping – Explicit control over communication and synchronization – Direct management of locality

•Starting from HPC models, how can we elevate?

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 28 ParallelismParallelism ImplicationsImplications forfor ComputingComputing EducationEducation inin 20152015 (and(and NOW!)NOW!) • Parallelism is in all computing systems, and should be front and center in education – An integral part of early introduction and experience with programming (simple models) – An integral part of early theory and algorithms (foundation) – An integral part of software architecture and engineering

• Observation: No biologist, physicist, chemist, or engineer thinks about the world with a fundamentally sequential foundation. • The world is parallel! We can design large scale concurrent systems of staggering complexity.

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 29 AreAre AllAll ApplicationsApplications Parallel?Parallel?

100%

??%

Manycore Today Era

# of Applications Growing Rapidly

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 30 What’sWhat’s goinggoing onon atat IntelIntel

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 31 Universal Parallel Computing Research Centers Catalyze breakthrough research enabling pervasive use of parallel computing

Parallel Programming Parallel Applications Languages, Compilers, For desktop, gaming, Runtime, Tools and mobile systems

Parallel Architecture Parallel Sys. S/W Support new generation Performance scaling, of programming models memory utilization, and languages and power consumption

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 32 Making Parallel Computing Pervasive

Joint HW/SW R&D Academic research program to enable Intel Tera-scale Academic Research seeking disruptive Intel products Research UPCRCs innovations 7-10+ 3-7+ in future years out

Enabling Parallel Software Computing Multi-core Products Education

Wide array of leading multi-core SW multi-core SW Community and z Multi-core Education Program development tools & - 400+ Universities info available today - 400+ Universities Experimental Tools - 25,000+ students - 2008 Goal: Double this z Intel® Academic Community z TBB Open Sourced z Threading for Multi-core SW z STM-Enabled Compiler on community Whatif.intel.com z Multi-core books z Parallel Benchmarks at Princeton’s PARSEC site

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 33 Ct: A Throughput Programming Language

TVECTVEC a(src1), a(src1), b(src2); b(src2); User Writes Serial-Like TVECTVEC c c = = a a + + b; b; c.copyOut(dest);c.copyOut(dest); Core Independent C++ Code

1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 + Primary Data Abstraction is the Nested Vector 1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 Supports Dense, Sparse, and Irregular Data

1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 + + + + Ct Parallel Runtime: 1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 Auto-Scale to Increasing Cores Thread 1 Thread 2 Thread 3 Thread 4

Ct JIT Compiler: SIMD Unit SIMD Unit SIMD Unit SIMD Unit Auto-vectorization, SSE, Core 1 Core 2 Core 3 Core 4 AVX, Larrabee

Programmer Thinks Serially; Ct Exploits Parallelism See whatif.intel.com

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 34 Intel® Concurrent Collections for C/C++

The application problem Supports serious The work of the domain expert separation of concerns: • Semantic correctness • Constraints required by the application The domain expert does not need to know about parallelism

Concurrent Collections Spec The tuning expert does not need to know about the domain. The work of the tuning expert • Architecture • Actual parallelism • Locality • Overhead Classifier1 • Load balancing • Distribution among processors image • Scheduling within a processor Classifier2

Mapping to target platform Classifier3

Parallelism for the Masses “Opportunities and Challenges” http://whatif.intel.com ©Intel Corporation 35 Intel Software Products

2007

2008

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 36 ““PolarisPolaris”” TeraflopsTeraflops ResearchResearch ProcessorProcessor

12.64mm I/O Area • Deliver Tera-scale performance single tile – TFLOP @ 62W, Desktop Power, 1.5mm 16GF/W – Frequency target 5GHz, 80 cores 2.0mm – Bi-section B/W of 256GB/s – Link bandwidth in hundreds of GB/s • Prototype two key technologies

Technology 65nm, 1 poly, 8 metal (Cu) – On-die interconnect fabric Transistors 100 Million (full-chip) – 3D stacked memory 1.2 Million (tile) • Develop a scalable design 21.72mm Die Area 275mm2 (full-chip) methodology 3mm 2 (tile) C4 bumps # 8390 – Tiled design approach – Mesochronous clocking – Power-aware capability PLL TAP I/O Area

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 37 OpenCirrus Testbed

• Open large-scale, global testbed for academic research – Catalyze creation of an Open Source Cloud stack – Focus on data center management and services – Systems-level and application-level research • Announced July 29, 2008, deployment expected December 2008 – 6 Testbeds of 1,000 to 4,000 cores each

KIT Germany UIUC

Intel Research HP, Yahoo CMU Bay area IDA Singapore

Create an academic cloud computing research community and open source stack - www.opencirrus.org

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 38 SummarySummary

• Is Parallelism a crisis? No, there are major opportunities to lay new foundations and synergy across layers. • Expectations and Challenges: Efficiency is not king • Moving Parallel Programming Forward – Productivity and Ease, Composition, Locality, and only then Efficiency • What’s Going on at Intel: A broad array of explorations

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 39 Questions?Questions?

Parallelism for the Masses “Opportunities and Challenges” ©Intel Corporation 40 Copyright © Intel Corporation. *Other names and brands may be claimed as The property of others. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or Parallelismotherwise for theor used Masses for any public Purpose under any circumstance, without the express written consent and permission of Intel Corporation. “Opportunities and Challenges” ©Intel Corporation 41