® Xeon Phi™ Coprocessor

http://tinyurl.com/inteljames twitter @jamesreinders

James Reinders

it’s all about parallel programming

Source

Compilers Libraries, Parallel Models

Intel® MIC Multicore CPU Multicore CPU architecture coprocessor

Game Changer

Source

Compilers Libraries, Parallel Models

Intel® MIC Multicore CPU Multicore CPU architecture coprocessor

“Unparalleled productivity… most of this does not run on a GPU” - Robert Harrison, NICS, ORNL

“R. Harrison, “Opportunities and Challenges Posed by Exascale Computing - ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011”

Intel® /C++ and Compilers w/OpenMP Intel® MPI Library

Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Trace Analyzer and Collector Intel® Inspector XE, Intel® VTune™ Amplifier Intel® Parallel XE, Intel® Advisor Studio XE

Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MPI Library

Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Trace Analyzer and Collector Intel® Inspector XE, Intel® VTune™ Amplifier Intel® Parallel XE, Intel® Advisor Studio XE

Software Development Ecosystem for Intel Xeon Phi coprocessors Open Commercial Source gcc (kernel Intel® C++ , Compilers, build only, not Run environs for applications), Intel® Fortran Compiler, MYO, Python CAPS* HMPP* compiler, ScaleMP* gdb Intel Debugger, Rogue Wave* TotalView*, Allinea* DDT TBB1, NAG*, Intel® MKL, Intel® MPI, Libraries * MPICH2, OpenMP (in Intel compilers), Cilk™ Plus (in Intel compilers), FFTW, * NetCDF Rogue Wave IMSL, Intel® OpenCL* SDK Profiling & Intel® VTune™ Amplifier XE, Analysis Tools Intel® Trace Analyzer & Collector, Intel® Inspector XE, Rogue Wave ThreadSpotter* Workload Altair* PBS Professional, * Scheduler Adaptive Computing Moab

1 Commercial support of TBB available from Intel.

*Other names and brands may be claimed as the property of others. Software Development Ecosystem for Intel Xeon Phi coprocessors Open Commercial Source gcc (kernel Intel® C++ Compiler, Compilers, build only, not Run environs for applications), Intel® Fortran Compiler, MYO, Python CAPS* HMPP* compiler, ScaleMP* Debugger gdb Intel Debugger, Rogue Wave* TotalView*, Allinea* DDT TBB1, NAG*, Intel®Intel® MKL, MPI Intel® Library MPI, Libraries * MPICH2, OpenMP (in Intel compilers), Cilk™ Plus (in Intel compilers), FFTW, * NetCDF Rogue Wave IMSL, Intel® OpenCLIntel® Trace* SDK Analyzer and Collector Profiling & Intel® VTune™ Amplifier XE, Analysis Tools Intel® Trace Analyzer & Collector, Intel® Inspector XE, Rogue Wave ThreadSpotter* Workload Altair* PBS Professional, * Scheduler Adaptive Computing Moab

1 Commercial support of TBB available from Intel.

*Other names and brands may be claimed as the property of others. Knights Corner Coprocessor

KNC Card KNC Card TCP/IP GDDR5 GDDR5 Channel … Channel Intel® Xeon® PC e x16 Channel GDDR5 Processor PCIe x16

… KN> 50 Cores

Channel KN GDDR5 OS

System Memory

GDDR5 GDDR5 Channel … Channel

>= 8GB GDDR5 memory

Knights Corner Micro-architecture

Core Core Core Core PCIe Client L2 L2 L2 L2 Logic

GDDR MC TD TD TD TD GDDR MC

TD TD TD

GDDR MC TD GDDR MC

L2 L2 L2 L2

Core Core Core Core

Knights Corner Core

PPF PF D0 D1 D2 E WB

T0 IP T1 IP L1 TLB Code Cache Miss T2 IP and 32KB T3 IP Code Cache TLB Miss 16B/Cycle (2 IPC) 4 Threads In-Order Decode uCode 512KB TLB Miss HWP L2 Cache Handler Pipe 0 Pipe 1 L2 Ctl L2 TLB

VPU RF X87 RF Scalar RF

X87 ALU 0 ALU 1 VPU To On-Die Interconnect 512b SIMD TLB Miss L1 TLB and 32KB Data Cache DCache Miss Core

X86 specific logic < 2% of core + L2 area

Vector Processing Unit

PPF PF D0 D1 D2 E WB

D2 E VC1 VC2 V1-V4 WB

D2 E VC1 VC2 V1 V2 V3 V4

VPU LD DEC RF 3R, 1W Vector ALUs EMU

16 Wide x 32 bit ST 8 Wide x 64 bit

Fused Multiply Add Mask Scatter RF Gather

Interconnect

BL - 64 Bytes Data Core Core Core Core

L2 L2 L2 L2 AD Command and Address

AK Coherence and Credits

TD TD TD TD

TD TD TD TD AK

AD L2 L2 L2 L2

Core Core Core Core

BL – 64 Bytes

Interleaved Memory Access

Core Core

L2 L2 GDDR MC

Core

GDDR MC GDDR L2 TD TD TD

Core L2

TD L2 TD

Core

L2 TD

TD GDDR MC TD Core

GDDR MC GDDR

L2 L2

Core Core

http://tinyurl.com/intelja mes twitter @jamesreinders

A picture can be worth a thousand words.

Picture worth many words

Picture worth many words

Picture worth many words

SMALL NUMBER OF THREADS IS UNINTERESTING

Picture worth many words

AT LOW PERFORMANCE LEVELS, MORE THREADS NEEDED FOR SAME PERFORMANCE

Picture worth many words

THE PAYOFF IS HIGHER ACHIEVEABLE RESULTS ON CERTAIN WORKLOADS AND LOWER POWER USAGE

Over 100 threads?

!$OMP PARALLEL do PRIVATE(j,k) do i=1, M ! each thread will work its own part of the problem do j=1, N do k=1, X ! calculations end do end do end do Fortran do loop transformed to create many threads using an OpenMP directive

Where does my program run?

1. On CPU and “offload” to coprocessor model popular with GPUs

1. All the cores (CPU or coprocessor) are just peers in a system (probably connect with MPI)

Your choice. Whatever works best for you.

On CPU and “offload” to coprocessor model popular with GPUs

Supported by:

1. Automatic use by Intel® (MKL) 2. Program controls by Compiler directives (C, C++, Fortran) 3. APIs available to build additional tools or low level programs

Offload Directives and Standard Requirements NVidia’s Intel’s Desired Feature OpenACC LEO Standard Support for C and C++, Fortran ✔ ✔ ✔ Support single code base of hetero-machine ✔ ✔ ✔ Overlap communication and computation ✔ ✔ ✔ Interoperate with MPI ✔ ✔ ✔ Interoperate with OpenMP* ✔ ✔ Offload to GPU ✔ ✔ Offload to MIC Coprocessor ✔ ✔ Ability to support all accelerators ✔ Ability to support all GPUs ✔ Ability to support all co-processors ✔ Proof of performance portability ✔ Support for nested parallelism ✔ ✔ User-managed memory consistency ✔ ✔ ✔ Multiple vendor support ✔ ✔ Restrict clause support ✔ Support for dynamic dispatch ✔ ✔ Parallel on/off separate from offload ✔ ✔ PGI*, CAPS* compiler support 2012 ✔ Cray* compiler support soon ✔ Intel® compiler support 2010* ✔ Broad standards body approval ✔

OpenMP* 4.0 (early 2013) planned

* public product in 2012 two pre-Standard approaches to directives to control “offload”

nVidia OpenACC Intel Language Extensions for Offload Data Parallelism Only Broad range of Parallelism Optimized for SIMT GPU Multicore, Many-core CPU, GPU No General Purpose Threading General Purpose Threading Targets “GPU Computing” Supports Intel CPU, GPU & coprocessor closed spec standards body with broad participation

OpenMP “omp target” Open, Standard, Supports Diverse Hardware Intel will support the OpenMP/TR in our C/C++ and Fortran compilers

Intel LEO support diverse parallel programming models and is an ideal path to OpenMP 4.0 Other brands and names are the property of their respective owners. Where does your program RUN? Everywhere More flexible possibility:

Consider the program to run on cores everywhere.

This opens up many possibilities.

Peers cores or groups of cores can be organized in many ways.

Peers? Well, it is an SMP-on-a-chip running Linux.

As peers, a distributed program runs on processors and coprocessors, communicating with each other.

Many ways to think about this.

Starts with MPI.

Intel Xeon Phi coprocessors stand out here – because of how very flexible this model is. Limited only by imagination!

HotChips presentation (architecture details)

Where to Learn More

http://intel.com/software/mic

http://tinyurl.com/intelja mes twitter @jamesreinders

This is a really great book…

I've been dreaming for a while of a modern accessible book that I could recommend to my threading-deprived colleagues and assorted enquirers to get them up to speed with the core concepts of multithreading as well as something that covers all the major current interesting implementations.

Finally I have that book.

Martin Watt, Principal Engineer, Dreamworks Animation

(c) 2012, publisher: Morgan Kaufmann

http://tinyurl.com/inteljames twitter @jamesreinders Available in early 2013. (limited partial “proof” version available at SC12 for reviewers)

Completely focused on

Intel Xeon Phi coprocessors.

Volume 1: essentials ~350 pages of explanation of programming.

It all comes down to PARALLEL PROGRAMMING ! (applicable to processors and Intel® Xeon Phi™ coprocessor)

(c) 2013

http://tinyurl.com/inteljames twitter @jamesreinders http://tinyurl.com/inteljames

my blogs

Summary

Intel® Xeon Phi™ coprocessor provides:

Performance and Performance/Watt for highly parallel HPC workloads with cores, threads, wide-SIMD, caches, memory BW

while maintaining the advantages of Intel Architecture general purpose programming environment advanced power management technology

delivers programmability and performance/watt for highly parallel HPC

parallel programming

http://tinyurl.com/inteljames twitter @jamesreinders

Thank you. http://tinyurl.com/inteljames twitter @jamesreinders

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

Copyright © 2012 , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.

Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Legal Information Today’s presentations contain forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent Earnings Release and our most recent Form 10-Q or 10-K filing for more information on the risk factors that could cause actual results to differ.

If we use any non-GAAP financial measures during the presentations, you will find on our website, intc.com, the required reconciliation to the most directly comparable GAAP financial measure.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel's current plan of record product roadmaps.

Legal Information: Performance

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm. Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjEnterprise, SPECjbb, SPECompM, SPECompL, and SPEC MPI are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information. TPC Benchmark is a trademark of the Transaction Processing Council. See http://www.tpc.org for more information. SAP and SAP NetWeaver are the registered trademarks of SAP AG in Germany and in several other countries. See http://www.sap.com/benchmark for more information. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel® Many Integrated Core (Intel MIC) Architecture

Targeted at highly parallel HPC workloads • Physics, Chemistry, Biology, Financial Services

Power efficient cores, support for parallelism • Cores: less speculation, threads, wider SIMD • Scalability: high BW on die interconnect and memory

General Purpose Programming Environment • Runs Linux (full service, open source OS) • Runs applications written in Fortran, C, C++, … • Supports X86 memory model, IEEE 754 • x86 collateral (libraries, compilers, Intel® VTune™ , etc)