Essentials of Intel® Xeon Phi™ Coprocessor Programming

MARC Program Status and Essentials to Programming the Intel® Xeon® Phi™ Coprocessor (based on Intel® Many Integrated Core Architecture) Jim Jeffers Principal Engineer Technical Computing Group

Intel® Corporation

1 Intel® Many Integrated Core (Intel® MIC) Architecture Learn more about this book: It all comes down to This book belongs on the PARALLEL lotsofcores.com bookshelf of every HPC PROGRAMMING ! professional. Not only does it (applicable to processors successfully and accessibly and Intel® Xeon Phi™ teach us how to use and coprocessors both) obtain high performance on

Forward, Preface the Intel MIC architecture, it is Chapters: about much more than that. It 1. Introduction takes us back to the universal 2. High Performance Closed fundamentals of high- Track performance computing Test Drive! including how to think and 3. A Friendly Country Road Race reason about the 4. Driving Around Town: Optimizing A Real-World performance of algorithms Code Example mapped to modern 5. Lots of Data (Vectors) architectures, and it puts into 6. Lots of Tasks (not Threads) your hands powerful tools 7. Offload that will be useful for years to 8. Coprocessor Architecture come. 9. Coprocessor System Software —Robert J. Harrison 10. Linux on the Coprocessor Institute for Advanced 11. Math Library Available since mid-February 2013. 12. MPI Computational Science, 13. Profiling and Timing Stony Brook University 14. Summary Glossary, Index Intel® Xeon Phi™ Coprocessor High Performance Programming, Jim Jeffers, James Reinders, (c) 2013, publisher: Morgan Kaufmann

• MARC Program: What’s Next…. • Introduction to the Intel® Xeon Phi™ Coprocessor • Coprocessor HW Architecture Overview • Coprocessor SW Architecture Overview • Programming the Intel® Xeon Phi™ Coprocessor – An illustrative example • ‘Real World’ Code Performance • Learning more: Resources for you • Q&A

Presented to: Hayder Al-Khalissi, Andrea Marongiu and Mladen Berekovic for their paper “An approach for Supporting OpenMP on the Intel SCC” Agenda

The Many-Core Application Research Community Program has exceeded Intel’s expectations with broad participation, many contributions and research results −The world-wide research community enhanced our understanding of many-core architecture and usage −100+ Institutions, 150+ research projects and 100s of participants −Software research with SCC included Barrelfish, X10, Bare Metal, SW Managed Coherence, Comm libraries, Message Passing Interfaces, OpenMP and more! −Numerous events hosted and well over 80 papers published −More than a dozen institutions created active SCC-based curriculum

7 MARC Program Achievements

8 MARC Program Transition

• Many core Intel® Xeon Phi™ Products are launched and gaining increasing use − #1 SuperComputer on Top500 -> Milkway-2 (Tianhe-2) − Includes 48,000 Intel Xeon Phi Coprocessors! − #6 TACC Stampede − Includes 6800+ Intel Xeon Phi Coprocessors • MARC program advanced community knowledge and prepped many engineers and scientists for today’s “many core era” • What was the future has become reality! • So as planned Intel will end active MARC and SCC support in December 2013

So What’s Next for Intel Manycore Computing?

10 Technical Computing: Transforming Information & Data Driven Science Into Knowledge

This decade we will create and extend computing technology to connect and enrich the lives of every person on earth

Other brands, names, and images are the property of their respective owners. Technical Computing Continues Its Rapid Growth To Compete, You Must Compute

Governments & Research Commercial/Industrial New Users – New Uses

From “My goal is simple. It is Better Products Diagnosis to complete understanding of personalized the universe, why it is as it is and why it exists at all” treatments quickly Faster Time to Market Stephen Hawking

Reduced R&D Genomics Clinical Information

Fundamental Discovery to Business Transformation Big Data Analytics Enabling Data Gain Fundamental Insights Driven Science

Transforming the world of data & information into insight & knowledge

Source: IDC: Worldwide Technical Computing Server 2013–2017 Forecast; Other brands, names, and images are the property of their respective owners. Enabling Capability & Accessibility

Supercomputing Example

Top 500* (1997 – 2012) 1500X 100X Performance Reduction in cost per FLOP 4X Power Increase

Strong gains, but many applications use a fraction of the capability – limiting discovery & wasting power

Source: Intel Analysis / Top500 Modernize Your Code Now to Unlock Potential Imagine What You Could Do with …

~2-3X SMC (Astronomy)

~15-20x - PCIT Parallelization (Biology)

~25-57x Acceleware* RTM (Seismic Processing)

~8-100x American Monte Carlo (Finance)

~40,000x : PCIT Modernization (Biology)

……..Your Current Performance

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See third party disclaimer in backup. Announcing Intel® Parallel Computing Centers

Co-investing and collaborating to deliver modern parallel applications that are  Open  Standard  Portable  Scalable  Greatest long-term return on investment

Join us to accelerating the next decade of discovery Open call for Proposals Submit your collaboration proposals though the Intel® Academic Program by December 1st at : http://software.intel.com/academic Our First Intel® Parallel Computing Centers

Collaborating to accelerate the pace of discovery For More information visit the Intel® Academic Program at: http://software.intel.com/academic

Intel® Xeon Phi™ Coprocessor Starter Kits

Available from Intel OEM Partners(e.g. HP, more TBA at SC’13)

Very Good Intel Xeon Phi Coprocessor info @ http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-top10-list-for-starter-kit-developers Go parallel today with a fully-configured system starting below $5K* Agenda

Groundbreaking: differences

Up to 61 IA cores/1.1 GHz/ 244 Threads

Up to 8GB memory with up to 352 GB/s bandwidth

512-bit SIMD instructions

Linux operating system, IP addressable

Standard programming languages and tools

Leading to Groundbreaking results

Up to 1 TeraFlop/s double precision peak performance1

Enjoy up to 2.2x higher memory bandwidth than on an Intel® Xeon® processor E5 family-based server.2

Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance 19Notes 1, 2 & 3, see backup for system configuration details. © Intel 2013, All Rights Reserved vision span from few cores to many cores with consistent models, languages, tools, and techniques

Compilers Libraries, Parallel Models

Intel® MIC Multicore CPU Multicore CPU architecture coprocessor

Source

Compilers Libraries, Parallel Models

Intel® MIC Multicore CPU Multicore CPU architecture coprocessor

Intel® MPI Library

Intel® Trace Analyzer and Collector

“Unparalleled productivity… most of this software does not run on a GPU” - Robert Harrison, NICS, ORNL

22 R. Harrison, “Opportunities and Challenges Posed by Exascale Computing - ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011” © Intel 2013, All Rights Reserved ® Intel C/C++ and Fortran Compilers ® w/OpenMP* + Intel MPI Library

Intel® MKL, Intel® Cilk™ Plus, Intel® TBB, and Intel® IPP + Intel® Trace Analyzer and Collector Intel® Inspector XE, ® ® Intel VTune™ Amplifier XE, Intel ® Advisor Intel Parallel Studio XE

Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Trace Analyzer and Collector Intel® Inspector XE, Intel® VTune™ Amplifier XE, Intel® Parallel Intel® Advisor Studio XE

• Instruction Level Parallelism (ILP) – Micro-architectural techniques Pipelined Execution Super-scalar execution Out-of/In-order execution Branch prediction… • Vector Level Parallelism (VLP) – Using SIMD vector processing instructions for SSE, AVX, IMCI • SIMD registers width: – 64-bit (MMX)  128-bit (SSE)  256bit (AVX) for host-CPUs – (IMCI) 512-bit for Intel® Xeon Phi® coprocessors • Thread-Level Parallelism (TLP) – Multi-core architecture w/ & w/o Hyper-Threading (HT) – Many-core architecture w/ “smart” RR h/w multithreading • Node Level Parallelism (NLP) (Distributed/Cluster/Grid Computing) – MPI

25 25 © Intel 2013, All Rights Reserved Rapidly Growing Parallelism Capability An Inflection Point 1. Multiple-cores w/ HT on CPU to Many-cores on coprocessor w/ “smart” RR h/w multithreading  Thread level parallelism – Difference in CPU-core HT vs. coprocessor-core multithreading – Over 240 coprocessor threads (61 cores * 4 threads/core = 244 threads) – Call to action  thread-parallelize to fully utilize all cores/threads 2. Wider vectors per core  Vector level parallelism – SIMD parallelism – CPUs w/ AVX support has vector register width of 256 bits, 32 bytes – Coprocessors have vector register width to 512 bits, 64 bytes – Call to action  vectorize to fully utilize the wider vectors

• BOTH should be exploited to maximize performance on coprocessors • You can start optimization on CPU and then scale it to the coprocessor (or vice-versa!)

Untuned Untuned Performance on Performance on Intel® Xeon® Intel® Xeon Phi™ processor coprocessor Based on an actual (but confidential) customer example. Shown to illustrate a point about common techniques. 27 Your results may vary! © Intel 2013, All Rights Reserved Illustrative example Fortran code using MPI, single threaded originally. Run on Intel® Xeon Phi™ coprocessor natively (no offload).

Yeah!

Untuned Untuned TUNEDTUNED Performance on Performance on PerformancePerformance onon Intel® Xeon® Intel® Xeon Phi™ Intel®Intel® Xeon Xeon® Phi™ processor coprocessor coprocessorprocessor Based on an actual (but confidential) customer example. Shown to illustrate a point about common techniques. 28 Your results may vary! © Intel 2013, All Rights Reserved Illustrative example Fortran code using MPI, single threaded originally. Run on Intel® Xeon Phi™ coprocessor natively (no offload).

Yeah!

Common optimization techniques… “dual benefit”

Untuned Untuned TUNEDTUNED Performance on Performance on PerformancePerformance onon Intel® Xeon® Intel® Xeon Phi™ Intel®Intel® Xeon Xeon® Phi™ processor coprocessor coprocessorprocessor Based on an actual (but confidential) customer example. Shown to illustrate a point about common techniques. 29 Your results may vary! © Intel 2013, All Rights Reserved Illustrative example Fortran code using MPI, single threaded originally. Run on Intel® Xeon Phi™ coprocessor natively (no offload).

Common optimization techniques… “dual benefit”

Untuned Untuned TUNED TUNED Performance on Performance on Performance on Performance on Intel® Xeon® Intel® Xeon Phi™ Intel® Xeon® Intel® Xeon Phi™ processor coprocessor processor coprocessor Based on an actual (but confidential) customer example. Shown to illustrate a point about common techniques. 30 Your results may vary! © Intel 2013, All Rights Reserved Picture worth many words

Application Performance Examples Customer Application Performance Increase1 vs. 2S Xeon* Los Alamos Molecular Dynamics Up to 2.52x

Acceleware 8th order isotropic Up to 2.05x variable velocity Jefferson Labs Lattice QCD Up to 2.27x

Financial BlackScholes SP Up to 7x Services Monte Carlo SP Up to 10.75x Sinopec Seismic Imaging Up to 2.53x2

Sandia Labs miniFE Up to 2x3 (Finite Element Solver)

Intel Labs Ray Tracing (incoherent Up to 1.88x4 KNC= Intel® Xeon Phi™ coprocessor (Knights Corner) rays) E5 = Intel® Xeon® E5 processor

• Intel® Xeon Phi™ coprocessor accelerates highly parallel & vectorizable applications. (Chart) • Table provides examples of such applications Configuration Notes: 1. 2S Xeon vs. 1 Xeon Phi (preproduction HW/SW & Application running 100% on coprocessor unless otherwise noted) 2. 2S Xeon vs. 2S Xeon + 2 Xeon Phi (offload) 3. 8 node cluster, each node with 2S Xeon (comparison is cluster performance with and without 1 Xeon Phi per node) (Hetero) 4. Intel Measured Oct. 2012 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Customer Measured results as of October 22, 2012. For more information go to http://www.intel.com/performance 32 © Intel 2013, All Rights Reserved Intel® Xeon Phi™ Coprocessors Full Portfolio 3 Family Outstanding Parallel Computing Solution Performance/$ leadership 6GB GDDR5 240 GB/s >1 TFlops DP 3120P 3120A 5 Family Optimized for High Density Environments 8GB GDDR5 Performance/watt leadership >300 GB/s >1 TFlops DP 5110P 5120D 7 Family Highest Level of Features 16GB GDDR5 Performance leadership 352 GB/s > 1.2 TFlops DP Turbo 7120P 7120X T

Designed using Intel’s cutting-edge Intel leads the industry in transistor technology by about 14nm transistor three years. With the coming 14nm process, Knights technology Landing will deliver more compute density and efficiency than ever before.1

Not bound by “offloading” bottlenecks As a host processor directly installed in the motherboard Standalone CPU or socket, Knights Landing will function as a CPU, eliminate PCIe PCIe coprocessor bottlenecks, and enable the next leap in compute density & performance per watt. Learn more

Common instruction set architecture First implementation of new backward compatible instruction Intel® Advanced Vector set architecture featuring 512 bit operations; will be Extensions 512 supported on future Intel® Xeon® processors to be introduced after Knights Landing. Full details

Leadership compute & memory bandwidth On-package memory will significantly increase memory Integrated on-package bandwidth, allowing workloads to take full advantage of memory available compute without encountering memory bandwidth bottlenecks seen today. Learn More

34 1 http://newsroom.intel.com/community/intel_newsroom/blog/2013/09/10/new-intel-ceo-president-outline-product-plans-future-of-computing-vision-to-mobilize-intel-and-developers Agenda

• High Performance On-Die Bidirectional Interconnect – Fully Coherent L2 Caches

• Memory – 8 Memory Controllers – 16 GDDR5 Channels – Up to 16GB Capacity – Clamshell Supported • Reliability Features • PCIe Gen2 x16 (EP) – Parity on L1 caches – Up to 14 GB/s w/ 256B Packets – ECC on L2 caches – Support for P2P transactions – ECC on Memory – CRC on Memory IO

Intel® Xeon Phi™ Coprocessor core Fully functional multi-thread execution unit • Up to 61 in-order cores Instruction Decode – Ring based On-Die Interconnect (ODI) • 64-bit addressing Scalar Vector Unit Unit • Scalar unit based on Intel® Pentium® processor family

Scalar Vector – Two pipelines Registers Registers – Dual issue with scalar instructions – One-per-clock scalar pipeline throughput 32K L1 I-cache 32K L1 D-cache – 4 clock latency from issue to resolution • 4 hardware threads per core 512K L2 Cache – Each thread issues instructions in turn – Round-robin execution hides scalar unit ODI latency

3939 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Intel® Xeon Phi™ Coprocessor core Fully functional multi-thread execution unit • Optimized for single and double precision

Instruction Decode • All new vector unit – 512-bit SIMD Instructions – not Intel® SSE, MMX™, or Intel® AVX Scalar Vector Unit Unit – 32 512-bit wide vector registers – Hold 16 singles or 8 doubles per register

Scalar Vector – Cache organization Registers Registers – L1 cache – L1-D 32KB 32K L1 I-cache – L1-I 32KB 32K L1 D-cache – L2 cache – 512KB per core – inclusive of L1-D & L1-I 512K L2 Cache – shared across all cores over ODI – if neither code nor data is shared among all cores, L2 = 30.5MB (= 512KB/core x 61 cores) ODI – if all code and data is shared among all cores, L2 = 512KB

4040 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture VLP / SIMD / Vectorization Vectorization is the process of transforming a scalar operation that acts on single data elements at a time (Single Instruction Single Data – SISD), to an operation that that acts on multiple data elements at once (Single Instruction Multiple Data – SIMD)

• Scalar mode • SIMD processing – one instruction produces – with SSE or AVX or MIC instructions one result – one instruction can produce multiple results for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

a[i] a a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] + + + b[i] b b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

a[i]+b[i] a+b c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]

T0 IP L1 TLB and Code Cache Miss T1IP 32KB T2 IP T3 IP Code TLB Miss

Cache 4 Threads 16B/Cycle (2 IPC) In-Order Decode uCode HWP

TLB Pipe 0 Pipe 1 Miss L2 512KB Handler Ctl L2 Cache L2 TLB VPU RF X87 RF Scalar RF

x87 ALU 0 ALU 1 VPU 512b SIMD L1 TLB and 32KB TLB Miss

Data Cache DCache Miss Core X86 specific logic < 2% of core+L2 area To On-Die Interconnect

• 512b vector ISA – 16 SP, 8 DP elements • 32 vector registers • 8 mask registers – per lane predicated execution • Gather/scatter – Prime (hint) instructions – GenMux • Load-Op, 2-3 sources, 1 destination – destination same as 1 source • EMU - SP transcendental instructions – exp, log, recip, sqrt • IEEE 754 2008

• Per Core Caches (shared amongst 4T) – 32K Instruction Cache – 32K Data Cache – 512KB L2 Caches per Core • 64 bit addressing • L2 Cache Streaming Prefetcher • Large TLB Capacity – 4K and 2M Page Sizes Supported – 64 entry L2 for 2M PTEs, or 4K, 2M PDEs

Card-to-system DMA Card-to-card DMA (or system-to-card)

Xeon(s) Xeon(s) Xeon(s)

PCIe* PCIe* PCIe*

InfiniBand* Xeon Xeon Xeon Xeon Phi Phi Phi ... Phi ...

Mem Mem Mem Mem Large MPP System

Xeon = Intel® Xeon® Processor Platform Xeon Phi = Intel® Xeon Phi™ Coprocessor

Intel® Xeon Phi™ Coprocessor x100 Family Reference Table

Recomm Peak GDDR5 Memory Turbo Board Clock Peak Total ended Processor Codena Form Factor, Max # Double Memory Capacit Enabled Clock SKU # TDP Speed Memory Cache Custome Brand Name me Thermal of Cores Precision Speeds y Turbo Speed (Watts) (GHz) BW (MB) r Pricing (GFLOP) (GT/s) (GB) (GHz) (RCP)

PCIe Card, 7120P Passively 300 61 1.238 1208 5.5 352 16 30.5 Y 1.333 $4129 Cooled

PCIe Card, 7120X No Thermal 300 61 1.238 1208 5.5 352 16 30.5 Y 1.333 $4129 Solution

PCIe Dense Form Factor, 5120D 245 60 1.053 1011 5.5 352 8 30 N N/A $2759 No Thermal Solution

PCIe Card,

3120P Passively 300 57 1.1 1003 5.0 240 6 28.5 N N/A $1695 Intel® Xeon Knights Cooled Phi™ Corner Coprocessor PCIe Card, x100 3120A Actively 300 57 1.1 1003 5.0 240 6 28.5 N N/A $1695 Cooled

Previously Available

PCIe Card, SE10P* Passively 300 61 1.1 1073.6 5.5 352 8 30.5 B N N/A Cooled

PCIe Card, 5110P** Passively 225 60 1.053 1011 5.0 320 8 30 N N/A $2649 Cooled

*Special Edition availability limited to early ship program customers **Please refer to our technical documentation for Silicon stepping information

48 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2013, Intel Corporation. Agenda

Runs Complete Applications IP Addressable Open Source Linux* OS Common Source Code Standard models of clustering

Builds on / Advances State of the Art in Parallelism Intel Developer tools

Acts as a Linux* SMP Compute Node 50 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Intel® Xeon Phi™ Software Architecture Components • Development Tools – Intel Compilers, Libraries, Parallel and Cluster Development Tools • 3rd Party tools complement, extend, or compete with Intel Tools – Language Extensions for Offload (LEO) and OpenMP 4.0 “Target” Extensions • Intel® Manycore Platform Software Stack (Intel® MPSS) – A Linux* ‘Coprocessor’-OS for the Intel® MIC device – Support Standards compliance: • Sockets, TCP/UDP IP (over PCIe), PSM, OFED Verbs, MPI*, OpenMP*, OpenCL* – Symmetric Communication InterFace driver layer (SCIF) – Middleware Interfaces (Intel COI, MYO) for tools developers

Design goals and principles • Support a variety of programming models • Standards compliance where possible • Scalability & Symmetry Enables a continuum of parallel computing solutions

Intel® TBB Intel® CILK™ Plus Intel® MKL OpenMP* OpenMP* Intel® MKL Intel® CILK™ Plus Intel® TBB Legend Intel® Vtune Intel ® C/C++ and Intel® Intel ® C/C++ and Intel® Intel® Vtune Amplifier XE Fortran Compilers Fortran Compilers Amplifier XE MPSS Install Intel® MPI Tools & Apps Debuggers Debuggers Tools & Apps Intel® MPI Std. SW Board Control Ganglia* Tools Panel Mod. Linux* COI MYO MYO COI uDAPL uDAPL Intel® SW

OFED* Verbs OFED* Verbs Std. OFED* Sockets Sockets HCA OFED*/SCIF User SCIF User SCIF OFED*/SCIF HCA Library Library Library Library Library Library Intel® HW

IB Proxy Other HW Daemon Host (R3) Host (R0) HCA OFED* Proxy ulp’s TCP/IP TCP/IP UDP/IP UDP/IP OFED* Core SW OFED* Core SW

HCA OFED*/ Management OFED*/ HCA Driver SCIF Driver Middleware SCIF Driver Proxy

Host/SCIF Driver NetDev NetDev SCIF Driver

/sys,/proc Linux* Kernel Coprocessor Linux OS ME SMC BMC Update SMC Path

PCI Express*

InfiniBand* HCA

52 © Intel 2013, All Rights Reserved Intel® Manycore Platform Software Stack (Intel® MPSS) Host Platform Programming Models Coprocessor Tools Linux*-based OS PCIe Offload Apps Driver PCIe Coprocessor MPI TCP/IP Linux*-based OS

Tools For Host-Side Initialization & Management of Card(s) - Service for automatically booting installed Intel® Xeon Phi™ Coprocessor(s) at host start-up - Configuration and management of coprocessor(s) - Display information about installed coprocessor(s), such as coprocessor utilization and power

For illustration only, potential future options subject to change without notice.

53 © Intel 2013, All Rights Reserved Intel® Manycore Platform Software Stack (Intel® MPSS) Host Platform Programming Models Coprocessor Tools Linux*-based OS PCIe Offload Apps Driver PCIe Coprocessor MPI TCP/IP Linux*-based OS

Linux* or Windows* host-side driver - Open source Linux* driver - Interface for offload (through COI), communication, & management of Intel® Xeon Phi™ Coprocessor(s) - Virtual ethernet device for Linux*-based OS with support for bridging to external networks; virtual serial console device

For illustration only, potential future options subject to change without notice.

54 © Intel 2013, All Rights Reserved Intel® Manycore Platform Software Stack (Intel® MPSS) Host Platform Programming Models Coprocessor Tools Linux*-based OS PCIe Offload Apps Driver PCIe Coprocessor MPI TCP/IP Linux*-based OS

Linux*-based OS - Open source - Common UNIX utilities provided through BusyBox - NFS root for persistent config of users, tools, and apps on each Intel® Xeon Phi™ Coprocessor - Out-of-box support for TCP/IP, sockets, MPI & OFED™

- Out-of-box supportFor illustration for SSH, only, potential NFS, future optionsTelnet, subject to FTPchange without notice.

Advanced Performance Distributed Performance

C++ and Fortran Compilers, MKL Libraries MPI Cluster Tools with C++ and Fortran & Analysis Tools for Windows*, Linux* Compiler, MKL Libraries and Analysis Tools for developers on IA based multi-core and Windows*, Linux* developers on IA based many-core nodes clusters Parallel coding and tuning investments “dual benefit” today, scale forward tomorrow 56 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Spectrum of Programming Models

Multi-Core Centric Many-Core Centric Xeon MIC Multi-Core Hosted Symmetric Many Core Hosted General purpose Codes with balanced serial and parallel Highly-parallel codes computing needs

Offload Codes with highly- parallel phases

Main( ) Main( ) Main( ) Foo( ) Foo( ) Foo( ) Multi-core MPI_*( ) MPI_*( ) MPI_*( ) (Xeon) Main( ) Main( ) Foo( ) Foo( ) Foo( ) Many-core MPI_*( ) MPI_*( ) (MIC) Range of models to meet application needs

• MPI ranks on Intel® Xeon® processors (only) Offload

• All messages into/out of processors Data • Offload models used to accelerate MPI Xeon MIC MPI ranks • Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* Data within Intel® MIC Network Xeon MIC • Homogenous network of hybrid nodes: Data

Xeon MIC

Data

Xeon MIC

• MPI ranks on Intel® MIC (only) • All messages into/out of Intel® MIC Data TM • Intel® Cilk Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads Xeon MIC MPI used directly within MPI processes

• Programmed as homogenous Data network of many-core CPUs: Xeon Network MIC

Data

Xeon MIC

Data

Xeon MIC

• MPI ranks on Intel® MIC and Intel® MPI Xeon® processors • Messages to/from any core Data Data

TM • Intel® Cilk Plus, OpenMP*, Intel® MPI Xeon MIC MPI Threading Building Blocks, Pthreads* used directly within MPI processes Data Data

• Programmed as heterogeneous Network Xeon MIC network of homogeneous nodes:

Data Data

Xeon MIC

Data Data

Xeon MIC

Multi-Core Centric Many-Core Centric

Xeon MIC Multi-Core Hosted Offload Symmetric Many Core Hosted General serial and Code with highly- Codes with Highly-parallel parallel computing parallel phases balanced needs codes

Ease of use Intel® Math Kernel Library Intel® Math Kernel Library Intel MPI* Auto vectorization Semi-auto vectorization: OpenMP* #pragma (vector, ivdep, simd)

Intel® Threading Building Array Notation: Intel® Cilk™ Plus Blocks C/C++ Vector Classes Intel® Cilk™ Plus (F32vec16, F64vec8) OpenCL* Fine Pthreads* Intrinsics control

62 © Intel 2013, All Rights Reserved Test System Specs • Processor System Specs • Coprocessor Specs – 2x8 Cores: 32 Threads – 61 cores: 244 Threads – 2.7 Ghz – 1.1 Ghz – 64GB DDR3 – 8GB GDDR5 – 85.3 GB/s Peak Mem BW – 352 GB/s Peak Mem BW – OS: RHEL* Linux* – OS:Linux*

It’s Just Linux*!

It’s Just Linux! 64 © Intel 2013, All Rights Reserved Learn more about this book: It all comes down to This book belongs on the PARALLEL lotsofcores.com bookshelf of every HPC PROGRAMMING ! professional. Not only does it (applicable to processors successfully and accessibly and Intel® Xeon Phi™ teach us how to use and obtain coprocessors both) high performance on the Intel MIC architecture, it is about Forward, Preface much more than that. It takes Chapters: us back to the universal 1. Introduction fundamentals of high- 2. High Performance Closed Track performance computing Test Drive! including how to think and 3. A Friendly Country Road Race reason about the performance 4. Driving Around Town: Chapter 3: A Friendly Country Optimizing A Real-World of algorithms mapped to Code Example Road Race modern architectures, and it 5. Lots of Data (Vectors) Featuring: 9 Point 2-D Stencil puts into your hands powerful 6. Lots of Tasks (not Threads) tools that will be useful for 7. Offload years to come. 8. Coprocessor Architecture —Robert J. Harrison 9. Coprocessor System Software 10. Linux on the Coprocessor Institute for Advanced 11. Math Library Computational Science, 12. MPI Stony Brook University 13. Profiling and Timing Available since February 2013. 14. Summary Glossary, Index

2D Image Data Array

Row, Column Format

Mapping the Stencil values to the data

for (i=0; i

Full Source Listings in Book Pgs. 64-65 and http://lotsofcores.com -> Downloads

• Steps We’ll Go Through

1. Run the Baseline (Single Thread, no parallelism) 2. Add Data Parallelism (Vectorize) 3. Add Task Parallelism (Scale) 4. Compare Processor vs. Coprocessor

• Processor ~19x faster than Coprocessor! • Code not vectorized nor scaled (1 thread)

#pragma ivdep - ignore ambiguous dependencies

#pragma SIMD - Just do it!

#pragma vector {keyword} - overrides heuristics - Addl information (e.g. aligned)

Also in Book Pg. 71

Results:

Loops are typically the first place to look.

X/Y Loop nest has virtually all the work.

Generally better to pick an “outer loop”

We choose standard OpenMP parallel for

From Book Page 73

Coprocessor now 4.5x faster than Processor !! Remember This? [Picture worth many words]

Hit memory bw wall

79 © Intel 2013, All Rights Reserved Additional Tuning Considerations • Data / Memory Alignment • Huge (2MB) Pages – Memory -> Cache alignment – Data access pattern dependent – C/C++ – Can reduce TLB miss rate • __attribute__((alligned(64)) – THP (Tranparent Huge Pages) in 2.6.38 • #pragma vector aligned kernel • _mm_alloc(n, 64) – mmap(….., MAP_HUGETLB,…) – FORTRAN • AOS -> SOA • -align array64byte – Random vs Stride 1 access • !dir$ attributes align: 64:: var • Cache Blocking • !DIR$ VECTOR ALIGNED – Improve cache reuse with data locality – Padding focus • Streaming Stores – Especially code with neighbor calcs – Bypass unneeded Read for Ownership (RFO) • E.g. Stencils behavior – See Book Chapter 4 – #pragma vector nontemporal • Prefetch Analysis / Tuning – -opt-streaming-stores (always, auto, never) – #pragma prefetch var: hint: distance – -opt-prefetch=n

Book provides discussion and examples!

Overall improvement: 2xIntel® Xeon® Processor : ~5.8x Intel® Xeon Phi™ Coprocessor: ~303x

Common optimization techniques… “dual benefit”

• Application: open source MPI implementation of the Speedup HMMER protein sequence analysis suite (Higher is Better) • Execution Model: Symmetric Mode

• Demonstrated Results: 1.8 1.56 1.6 – No source code changes were required to build and run 1.4 MPI-HMMER on Intel Xeon Phi coprocessors. 1.2 1 1 Developers are adding #pragma unroll to improve loop 0.8 performance on both Intel® Xeon® processors and 0.6 Intel® Xeon Phi™ coprocessors 0.4 0.2 – The key function in HMMER is the Viterbi algorithm 0 implemented as a contained double nested loop which gets vectorized on both the Intel® Xeon® processors and Intel® Xeon Phi™ coprocessor • 2S Intel® Xeon® processor E5-2670

• 2S Intel® Xeon® processor E5-2670 + Intel® Xeon Phi™ coprocessor (pre-production HW/SW)

83 SOURCE: MEASURED BY INTEL JULY 2013 INTEL CONFIDENTIAL Performance Proof-Point: Government and Academic Research WEATHER RESEARCH AND FORECASTING (WRF)

Speedup (Higher is Better)

1.6 1.4 1.4 • Application: Weather Research and Forecasting (WRF) 1.2 1 1 • Status: WRF V3.5 was released 4/18/13 0.8 • Code Optimization: 0.6 0.4 – Approximately two dozen files with less than 2,000 0.2 lines of code were modified (out of approximately 0 700,000 lines of code in about 800 files, all Fortran standard compliant) – Most modifications improved performance for both the 2S Intel® Xeon® processor E5-2670 with • host and the co-processors eight-node cluster configuration • ® ® Performance Measurements: Pre release of WRF 3.5 • 2S Intel Xeon processor E5-2670 + (V3.5Pre) and NCAR supported CONUS2.5KM Intel® Xeon Phi™ coprocessor (pre-production HW/SW) benchmark (a high resolution weather forecast) with eight-node cluster configuration • Acknowledgments: There were many contributors to these results, including the National Renewable Energy Laboratory and The Weather Channel Companies

84 SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013 INTEL CONFIDENTIAL Performance Proof-Point: Government and Academic Research ZIB ISING 3D

Speedup “We achieved a 3.46x speedup in just 3 days.” Konrad-Zuse-Zentrum (Higher is Better) für Informationstechnik Berlin

4 3.46 3.5 • Application: ZIB Ising 3D models magnetism and 3 phase transitions 2.5 2 • Status: code ready for internal use 1.5 1 1 • Demonstrated Results: 0.5 0 - Two days to convert C code to AVX intrinsics, and one day to optimize the code on Intel® Xeon Phi™ coprocessors • 2S Intel® Xeon® processor E5-2670 - Productivity for Intel Xeon Phi coprocessors was • Intel® Xeon Phi™ Coprocessor higher for target specific optimization (couple of (pre-production HW/SW) hours versus 2-3 days implementation in CUDA)

85 SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012 INTEL CONFIDENTIAL Performance Proof-Point: Financial Services MONTE CARLO EUROPEAN OPTIONS

• Application: Monte Carlo algorithms are used to evaluate complex instruments, portfolios, and investments. Performance depends on raw Speedup computational power and the performance of exp2() (Higher is Better) • Status: Case Study available

12 • Highlights: Dramatic performance scaling for both 10.36 10 single-precision and double-precision calculations 8 • Demonstrated Results: 6 - Intel® Xeon Phi™ coprocessor fast exp2() and FMA 4 3.34 instructions deliver high performance, high accuracy 2 1 1 for single precision computations 0 - Compiler based loop unrolling delivers high performance Single Double Precision Precision - Cache blocking further optimizes cache utilization, reduces cache misses, and makes outer loop • 2S Intel® Xeon® processor E5-2670 vectorization possible

• 2S Intel Xeon processor E5-2670 + • Read the Case Study: software.intel.com/en-us/articles/case- Intel® Xeon Phi™ Coprocessor study-achieving-high-performance-on-monte-carlo-european-option- (pre-production HW/SW) on-intel-xeon-phi

86 SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013 INTEL CONFIDENTIAL Performance Proof-Point: Government and Academic Research JEFFERSON LAB LATTICE QCD

Speedup (Higher is Better) • Application: Lattice QCD uses a numerical approach to quantum chromo dynamics to calculate weak 2.3 decays of strongly interacting particles, to investigate 2.5 matter under extreme conditions, and to study the 2 structure and interaction of hadrons 1.5 • Demonstrated Results: 1 1 – Lattice QCD benefits from the memory bandwidth ® 0.5 of the Intel Xeon Phi™ coprocessor

• 2S Intel® Xeon® processor E5-2680

• 2S Intel Xeon processor E5-2680 + Intel® Xeon Phi™ Coprocessor (pre-production HW/SW)

87 SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013 INTEL CONFIDENTIAL Agenda

88 © Intel 2013, All Rights Reserved Learn more about this book: SC’13 tutorial Teaches parallel November 2013 programming using parallelbook.com a new In Denver pattern-based approach. This is a really great book…

Extensive examples I've been dreaming for a while in Cilk Plus and TBB. of a modern accessible book that I could recommend to my Not about any threading-deprived colleagues specific hardware, and assorted enquirers to get but relevant to all. them up to speed with the core It’s about concepts of multithreading as effective well as something that covers parallel all the major current programming. interesting implementations.

Great for teaching! Finally I have that book.

—Martin Watt, Principal Engineer, Available since July 2012. Dreamworks Animation

• http://software.intel.com/mic-developer – Developer’s Quick Start Guide – Programming Overview – User Forum at http://software.intel.com/en-us/forums/intel-many- integrated-core • http://software.intel.com/en-us/articles/programming-and- compiling-for-intel-many-integrated-core-architecture • http://software.intel.com/en-us/articles/advanced-optimizations- for-intel-mic-architecture • Intel® Composer XE 2013 for Linux* User and Reference Guides • Intel Premier Support https://premier.intel.com

Upcoming Webinars: • http://software.intel.com/en-us/articles/intel-software-tools- technical-webinar-series

Recordings of Spring Webinars: • http://software.intel.com/en-us/articles/intel-software-tools- spring-technical-webinar-series

91 © Intel 2013, All Rights Reserved Intel® Xeon Phi™ Coprocessor Wrap-up • SMP on a chip • Leverages existing standards, models and tools – “It’s Just….” [Linux, C/C++, FORTRAN, MPI, OpenMP, etc] • Future Knights Landing adds Manycore “Processor” • Parallel coding investments are paid “backward & forward” • Performance AND familiar programming models

Parallelism is the Key!

Thank You!

Q & A?

93 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Copyright © 2013, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, Phi, VTune and Cilk are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.

Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor- dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

97 Intel® Many Integrated Core (Intel® MIC) Architecture Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Knights Landing and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user

Intel, Cilk, VTune, Xeon, Xeon Phi, Look Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

*Other names and brands may be claimed as the property of others. Copyright ©2013 Intel Corporation. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804 Legal Disclaimers

• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. • Estimated Results Benchmark Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. • Simulated Results Benchmark Disclaimer: Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance. • Software Source Code Disclaimer: Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Risk Factors

The above statements and any others in this document that refer to plans and expectations for the third quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be the important factors that could cause actual results to differ materially from the company’s expectations. Demand could be different from Intel's expectations due to factors including changes in business and economic conditions; customer acceptance of Intel’s and competitors’ products; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Uncertainty in global economic and financial conditions poses a risk that consumers and businesses may defer purchases in response to negative financial events, which could negatively affect product demand and other related matters. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly to technological developments and to incorporate new features into its products. The gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; product manufacturing quality/yields; and impairments of long-lived assets, including manufacturing, assembly/test and intangible assets. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Expenses, particularly certain marketing and compensation expenses, as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's products and the level of revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings release.

Rev. 7/17/13