Intel® Xeon Phi™ Coprocessor

Intel® Xeon Phi™ Coprocessor http://tinyurl.com/inteljames twitter @jamesreinders James Reinders it’s all about parallel programming Source Compilers Libraries, Parallel Models Intel® MIC Multicore CPU Multicore CPU architecture coprocessor Game Changer Source Compilers Libraries, Parallel Models Intel® MIC Multicore CPU Multicore CPU architecture coprocessor “Unparalleled productivity… most of this software does not run on a GPU” - Robert Harrison, NICS, ORNL “R. Harrison, “Opportunities and Challenges Posed by Exascale Computing - ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011” Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MPI Library Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Trace Analyzer and Collector Intel® Inspector XE, Intel® VTune™ Amplifier Intel® Parallel XE, Intel® Advisor Studio XE Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MPI Library Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Trace Analyzer and Collector Intel® Inspector XE, Intel® VTune™ Amplifier Intel® Parallel XE, Intel® Advisor Studio XE Software Development Ecosystem for Intel Xeon Phi coprocessors Open Commercial Source gcc (kernel Intel® C++ Compiler, Compilers, build only, not Run environs for applications), Intel® Fortran Compiler, MYO, Python CAPS* HMPP* compiler, ScaleMP* Debugger gdb Intel Debugger, Rogue Wave* TotalView*, Allinea* DDT TBB1, NAG*, Intel® MKL, Intel® MPI, Libraries * MPICH2, OpenMP (in Intel compilers), Cilk™ Plus (in Intel compilers), FFTW, * NetCDF Rogue Wave IMSL, Intel® OpenCL* SDK Profiling & Intel® VTune™ Amplifier XE, Analysis Tools Intel® Trace Analyzer & Collector, Intel® Inspector XE, Rogue Wave ThreadSpotter* Workload Altair* PBS Professional, * Scheduler Adaptive Computing Moab 1 Commercial support of TBB available from Intel. *Other names and brands may be claimed as the property of others. Software Development Ecosystem for Intel Xeon Phi coprocessors Open Commercial Source gcc (kernel Intel® C++ Compiler, Compilers, build only, not Run environs for applications), Intel® Fortran Compiler, MYO, Python CAPS* HMPP* compiler, ScaleMP* Debugger gdb Intel Debugger, Rogue Wave* TotalView*, Allinea* DDT TBB1, NAG*, Intel®Intel® MKL, MPI Intel® Library MPI, Libraries * MPICH2, OpenMP (in Intel compilers), Cilk™ Plus (in Intel compilers), FFTW, * NetCDF Rogue Wave IMSL, Intel® OpenCLIntel® Trace* SDK Analyzer and Collector Profiling & Intel® VTune™ Amplifier XE, Analysis Tools Intel® Trace Analyzer & Collector, Intel® Inspector XE, Rogue Wave ThreadSpotter* Workload Altair* PBS Professional, * Scheduler Adaptive Computing Moab 1 Commercial support of TBB available from Intel. *Other names and brands may be claimed as the property of others. Knights Corner Coprocessor KNC Card KNC Card TCP/IP GDDR5 GDDR5 Channel … Channel Intel® Xeon® PC e x16 Channel GDDR5 Processor PCIe x16 … KN> 50 Cores Channel KN GDDR5 Linux OS System Memory GDDR5 GDDR5 Channel … Channel >= 8GB GDDR5 memory Knights Corner Micro-architecture Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR MC TD TD TD TD GDDR MC TD TD TD GDDR MC TD GDDR MC L2 L2 L2 L2 Core Core Core Core Knights Corner Core PPF PF D0 D1 D2 E WB T0 IP T1 IP L1 TLB Code Cache Miss T2 IP and 32KB T3 IP Code Cache TLB Miss 16B/Cycle (2 IPC) 4 Threads In-Order Decode uCode 512KB TLB Miss HWP L2 Cache Handler Pipe 0 Pipe 1 L2 Ctl L2 TLB VPU RF X87 RF Scalar RF X87 ALU 0 ALU 1 VPU To On-Die Interconnect 512b SIMD TLB Miss L1 TLB and 32KB Data Cache DCache Miss Core X86 specific logic < 2% of core + L2 area Vector Processing Unit PPF PF D0 D1 D2 E WB D2 E VC1 VC2 V1-V4 WB D2 E VC1 VC2 V1 V2 V3 V4 VPU LD DEC RF 3R, 1W Vector ALUs EMU 16 Wide x 32 bit ST 8 Wide x 64 bit Fused Multiply Add Mask Scatter RF Gather Interconnect BL - 64 Bytes Data Core Core Core Core L2 L2 L2 L2 AD Command and Address AK Coherence and Credits TD TD TD TD TD TD TD TD AK AD L2 L2 L2 L2 Core Core Core Core BL – 64 Bytes Interleaved Memory Access Core Core L2 L2 GDDR MC Core GDDR MC GDDR L2 TD TD TD Core L2 TD L2 TD Core L2 TD TD GDDR MC TD Core GDDR MC L2 L2 Core Core http://tinyurl.com/intelja mes twitter @jamesreinders A picture can be worth a thousand words. Picture worth many words Picture worth many words Picture worth many words SMALL NUMBER OF THREADS IS UNINTERESTING Picture worth many words AT LOW PERFORMANCE LEVELS, MORE THREADS NEEDED FOR SAME PERFORMANCE Picture worth many words THE PAYOFF IS HIGHER ACHIEVEABLE RESULTS ON CERTAIN WORKLOADS AND LOWER POWER USAGE Over 100 threads? !$OMP PARALLEL do PRIVATE(j,k) do i=1, M ! each thread will work its own part of the problem do j=1, N do k=1, X ! calculations end do end do end do Fortran do loop transformed to create many threads using an OpenMP directive Where does my program run? 1. On CPU and “offload” to coprocessor model popular with GPUs 1. All the cores (CPU or coprocessor) are just peers in a system (probably connect with MPI) Your choice. Whatever works best for you. On CPU and “offload” to coprocessor model popular with GPUs Supported by: 1. Automatic use by Intel® math Kernel Library (MKL) 2. Program controls by Compiler directives (C, C++, Fortran) 3. APIs available to build additional tools or low level programs Offload Directives and Standard Requirements NVidia’s Intel’s Desired Feature OpenACC LEO Standard Support for C and C++, Fortran ✔ ✔ ✔ Support single code base of hetero-machine ✔ ✔ ✔ Overlap communication and computation ✔ ✔ ✔ Interoperate with MPI ✔ ✔ ✔ Interoperate with OpenMP* ✔ ✔ Offload to GPU ✔ ✔ Offload to MIC Coprocessor ✔ ✔ Ability to support all accelerators ✔ Ability to support all GPUs ✔ Ability to support all co-processors ✔ Proof of performance portability ✔ Support for nested parallelism ✔ ✔ User-managed memory consistency ✔ ✔ ✔ Multiple vendor support ✔ ✔ Restrict clause support ✔ Support for dynamic dispatch ✔ ✔ Parallel on/off separate from offload ✔ ✔ PGI*, CAPS* compiler support 2012 ✔ Cray* compiler support soon ✔ Intel® compiler support 2010* ✔ Broad standards body approval ✔ OpenMP* 4.0 (early 2013) planned * public product in 2012 two pre-Standard approaches to directives to control “offload” nVidia OpenACC Intel Language Extensions for Offload Data Parallelism Only Broad range of Parallelism Optimized for SIMT GPU Multicore, Many-core CPU, GPU No General Purpose Threading General Purpose Threading Targets “GPU Computing” Supports Intel CPU, GPU & coprocessor closed spec standards body with broad participation OpenMP “omp target” Open, Standard, Supports Diverse Hardware Intel will support the OpenMP/TR in our C/C++ and Fortran compilers Intel LEO support diverse parallel programming models and is an ideal path to OpenMP 4.0 Other brands and names are the property of their respective owners. Where does your program RUN? Everywhere More flexible possibility: Consider the program to run on cores everywhere. This opens up many possibilities. Peers cores or groups of cores can be organized in many ways. Peers? Well, it is an SMP-on-a-chip running Linux. As peers, a distributed program runs on processors and coprocessors, communicating with each other. Many ways to think about this. Starts with MPI. Intel Xeon Phi coprocessors stand out here – because of how very flexible this model is. Limited only by imagination! HotChips presentation (architecture details) Where to Learn More http://intel.com/software/mic http://tinyurl.com/intelja mes twitter @jamesreinders This is a really great book… I've been dreaming for a while of a modern accessible book that I could recommend to my threading-deprived colleagues and assorted enquirers to get them up to speed with the core concepts of multithreading as well as something that covers all the major current interesting implementations. Finally I have that book. Martin Watt, Principal Engineer, Dreamworks Animation (c) 2012, publisher: Morgan Kaufmann http://tinyurl.com/inteljames twitter @jamesreinders Available in early 2013. (limited partial “proof” version available at SC12 for reviewers) Completely focused on Intel Xeon Phi coprocessors. Volume 1: essentials ~350 pages of explanation of programming. It all comes down to PARALLEL PROGRAMMING ! (applicable to processors and Intel® Xeon Phi™ coprocessor) (c) 2013 http://tinyurl.com/inteljames twitter @jamesreinders http://tinyurl.com/inteljames my blogs Summary Intel® Xeon Phi™ coprocessor provides: Performance and Performance/Watt for highly parallel HPC workloads with cores, threads, wide-SIMD, caches, memory BW while maintaining the advantages of Intel Architecture general purpose programming environment advanced power management technology delivers programmability and performance/watt for highly parallel HPC parallel programming http://tinyurl.com/inteljames twitter @jamesreinders Thank you. http://tinyurl.com/inteljames twitter @jamesreinders Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference

Load more