MARC Program Status and Essentials to Programming the Intel® Xeon® Phi™ Coprocessor (based on Intel® Many Integrated Core Architecture) Jim Jeffers Principal Engineer Technical Computing Group
Intel® Corporation
1 Intel® Many Integrated Core (Intel® MIC) Architecture Learn more about this book: It all comes down to This book belongs on the PARALLEL lotsofcores.com bookshelf of every HPC PROGRAMMING ! professional. Not only does it (applicable to processors successfully and accessibly and Intel® Xeon Phi™ teach us how to use and coprocessors both) obtain high performance on
Forward, Preface the Intel MIC architecture, it is Chapters: about much more than that. It 1. Introduction takes us back to the universal 2. High Performance Closed fundamentals of high- Track performance computing Test Drive! including how to think and 3. A Friendly Country Road Race reason about the 4. Driving Around Town: Optimizing A Real-World performance of algorithms Code Example mapped to modern 5. Lots of Data (Vectors) architectures, and it puts into 6. Lots of Tasks (not Threads) your hands powerful tools 7. Offload that will be useful for years to 8. Coprocessor Architecture come. 9. Coprocessor System Software —Robert J. Harrison 10. Linux on the Coprocessor Institute for Advanced 11. Math Library Available since mid-February 2013. 12. MPI Computational Science, 13. Profiling and Timing Stony Brook University 14. Summary Glossary, Index Intel® Xeon Phi™ Coprocessor High Performance Programming, Jim Jeffers, James Reinders, (c) 2013, publisher: Morgan Kaufmann
“© 2013, James Reinders & Jim Jeffers, book image used with permission Agenda
• MARC Program: What’s Next…. • Introduction to the Intel® Xeon Phi™ Coprocessor • Coprocessor HW Architecture Overview • Coprocessor SW Architecture Overview • Programming the Intel® Xeon Phi™ Coprocessor – An illustrative example • ‘Real World’ Code Performance • Learning more: Resources for you • Q&A
3 © Intel 2013, All Rights Reserved MARC Symposium - Paper Awards - SPLASH MARC Symposium Best Paper Award
Presented to: Hayder Al-Khalissi, Andrea Marongiu and Mladen Berekovic for their paper “An approach for Supporting OpenMP on the Intel SCC” Agenda
• MARC Program: What’s Next…. • Introduction to the Intel® Xeon Phi™ Coprocessor • Coprocessor HW Architecture Overview • Coprocessor SW Architecture Overview • Programming the Intel® Xeon Phi™ Coprocessor – An illustrative example • ‘Real World’ Code Performance • Learning more: Resources for you • Q&A
6 © Intel 2013, All Rights Reserved MARC Program Achievements
The Many-Core Application Research Community Program has exceeded Intel’s expectations with broad participation, many contributions and research results −The world-wide research community enhanced our understanding of many-core architecture and usage −100+ Institutions, 150+ research projects and 100s of participants −Software research with SCC included Barrelfish, X10, Bare Metal, SW Managed Coherence, Comm libraries, Message Passing Interfaces, OpenMP and more! −Numerous events hosted and well over 80 papers published −More than a dozen institutions created active SCC-based curriculum
7 MARC Program Achievements
The Many-Core Application Research Community Program has exceeded Intel’s expectations with broad participation, many contributions and research results −The world-wide research community enhanced our understanding of many-core architecture and usage −100+ Institutions, 150+ research projects and 100s of participants −Software research with SCC included Barrelfish, X10, Bare Metal, SW Managed Coherence, Comm libraries, Message Passing Interfaces, OpenMP and more! −Numerous events hosted and well over 80 papers published −More than a dozen institutions created active SCC-based curriculum
8 MARC Program Transition
• Many core Intel® Xeon Phi™ Products are launched and gaining increasing use − #1 SuperComputer on Top500 -> Milkway-2 (Tianhe-2) − Includes 48,000 Intel Xeon Phi Coprocessors! − #6 TACC Stampede − Includes 6800+ Intel Xeon Phi Coprocessors • MARC program advanced community knowledge and prepped many engineers and scientists for today’s “many core era” • What was the future has become reality! • So as planned Intel will end active MARC and SCC support in December 2013
So What’s Next for Intel Manycore Computing?
10 Technical Computing: Transforming Information & Data Driven Science Into Knowledge
This decade we will create and extend computing technology to connect and enrich the lives of every person on earth
Other brands, names, and images are the property of their respective owners. Technical Computing Continues Its Rapid Growth To Compete, You Must Compute
Governments & Research Commercial/Industrial New Users – New Uses
From “My goal is simple. It is Better Products Diagnosis to complete understanding of personalized the universe, why it is as it is and why it exists at all” treatments quickly Faster Time to Market Stephen Hawking
Reduced R&D Genomics Clinical Information
Fundamental Discovery to Business Transformation Big Data Analytics Enabling Data Gain Fundamental Insights Driven Science
Transforming the world of data & information into insight & knowledge
Source: IDC: Worldwide Technical Computing Server 2013–2017 Forecast; Other brands, names, and images are the property of their respective owners. Enabling Capability & Accessibility
Supercomputing Example
Top 500* (1997 – 2012) 1500X 100X Performance Reduction in cost per FLOP 4X Power Increase
Strong gains, but many applications use a fraction of the capability – limiting discovery & wasting power
Source: Intel Analysis / Top500 Modernize Your Code Now to Unlock Potential Imagine What You Could Do with …
~2-3X SMC (Astronomy)
~15-20x - PCIT Parallelization (Biology)
~25-57x Acceleware* RTM (Seismic Processing)
~8-100x American Monte Carlo (Finance)
~40,000x : PCIT Modernization (Biology)
……..Your Current Performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See third party disclaimer in backup. Announcing Intel® Parallel Computing Centers
Co-investing and collaborating to deliver modern parallel applications that are Open Standard Portable Scalable Greatest long-term return on investment
Join us to accelerating the next decade of discovery Open call for Proposals Submit your collaboration proposals though the Intel® Academic Program by December 1st at : http://software.intel.com/academic Our First Intel® Parallel Computing Centers
Collaborating to accelerate the pace of discovery For More information visit the Intel® Academic Program at: http://software.intel.com/academic
Intel® Xeon Phi™ Coprocessor Starter Kits
Available from Intel OEM Partners(e.g. HP, more TBA at SC’13)
Very Good Intel Xeon Phi Coprocessor info @ http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-top10-list-for-starter-kit-developers Go parallel today with a fully-configured system starting below $5K* Agenda
• MARC Program: What’s Next…. • Introduction to the Intel® Xeon Phi™ Coprocessor • Coprocessor HW Architecture Overview • Coprocessor SW Architecture Overview • Programming the Intel® Xeon Phi™ Coprocessor – An illustrative example • ‘Real World’ Code Performance • Learning more: Resources for you • Q&A
18 © Intel 2013, All Rights Reserved Intel® Xeon Phi™ Coprocessors Highly-parallel Processing for Unparalleled Discovery
Groundbreaking: differences
Up to 61 IA cores/1.1 GHz/ 244 Threads
Up to 8GB memory with up to 352 GB/s bandwidth
512-bit SIMD instructions
Linux operating system, IP addressable
Standard programming languages and tools
Leading to Groundbreaking results
Up to 1 TeraFlop/s double precision peak performance1
Enjoy up to 2.2x higher memory bandwidth than on an Intel® Xeon® processor E5 family-based server.2
Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance 19Notes 1, 2 & 3, see backup for system configuration details. © Intel 2013, All Rights Reserved vision span from few cores to many cores with consistent models, languages, tools, and techniques
20 © Intel 2013, All Rights Reserved Source
Compilers Libraries, Parallel Models
Intel® MIC Multicore CPU Multicore CPU architecture coprocessor
21 © Intel 2013, All Rights Reserved Game Changer
Source
Compilers Libraries, Parallel Models
Intel® MIC Multicore CPU Multicore CPU architecture coprocessor
Intel® MPI Library
Intel® Trace Analyzer and Collector
“Unparalleled productivity… most of this software does not run on a GPU” - Robert Harrison, NICS, ORNL
22 R. Harrison, “Opportunities and Challenges Posed by Exascale Computing - ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011” © Intel 2013, All Rights Reserved ® Intel C/C++ and Fortran Compilers ® w/OpenMP* + Intel MPI Library
Intel® MKL, Intel® Cilk™ Plus, Intel® TBB, and Intel® IPP + Intel® Trace Analyzer and Collector Intel® Inspector XE, ® ® Intel VTune™ Amplifier XE, Intel ® Advisor Intel Parallel Studio XE
23 © Intel 2013, All Rights Reserved Intel® C/C++ and Fortran Compilers w/OpenMP Intel® MPI Library
Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP Intel® Trace Analyzer and Collector Intel® Inspector XE, Intel® VTune™ Amplifier XE, Intel® Parallel Intel® Advisor Studio XE
24 © Intel 2013, All Rights Reserved Types of parallelism in Intel processors / coprocessors / platforms
• Instruction Level Parallelism (ILP) – Micro-architectural techniques Pipelined Execution Super-scalar execution Out-of/In-order execution Branch prediction… • Vector Level Parallelism (VLP) – Using SIMD vector processing instructions for SSE, AVX, IMCI • SIMD registers width: – 64-bit (MMX) 128-bit (SSE) 256bit (AVX) for host-CPUs – (IMCI) 512-bit for Intel® Xeon Phi® coprocessors • Thread-Level Parallelism (TLP) – Multi-core architecture w/ & w/o Hyper-Threading (HT) – Many-core architecture w/ “smart” RR h/w multithreading • Node Level Parallelism (NLP) (Distributed/Cluster/Grid Computing) – MPI
25 25 © Intel 2013, All Rights Reserved Rapidly Growing Parallelism Capability An Inflection Point 1. Multiple-cores w/ HT on CPU to Many-cores on coprocessor w/ “smart” RR h/w multithreading Thread level parallelism – Difference in CPU-core HT vs. coprocessor-core multithreading – Over 240 coprocessor threads (61 cores * 4 threads/core = 244 threads) – Call to action thread-parallelize to fully utilize all cores/threads 2. Wider vectors per core Vector level parallelism – SIMD parallelism – CPUs w/ AVX support has vector register width of 256 bits, 32 bytes – Coprocessors have vector register width to 512 bits, 64 bytes – Call to action vectorize to fully utilize the wider vectors
• BOTH should be exploited to maximize performance on coprocessors • You can start optimization on CPU and then scale it to the coprocessor (or vice-versa!)
26 26 © Intel 2013, All Rights Reserved Illustrative example Fortran code using MPI, single threaded originally. Run on Intel® Xeon Phi™ coprocessor natively (no offload).
Untuned Untuned Performance on Performance on Intel® Xeon® Intel® Xeon Phi™ processor coprocessor Based on an actual (but confidential) customer example. Shown to illustrate a point about common techniques. 27 Your results may vary! © Intel 2013, All Rights Reserved Illustrative example Fortran code using MPI, single threaded originally. Run on Intel® Xeon Phi™ coprocessor natively (no offload).
Yeah!
Untuned Untuned TUNEDTUNED Performance on Performance on PerformancePerformance onon Intel® Xeon® Intel® Xeon Phi™ Intel®Intel® Xeon Xeon® Phi™ processor coprocessor coprocessorprocessor Based on an actual (but confidential) customer example. Shown to illustrate a point about common techniques. 28 Your results may vary! © Intel 2013, All Rights Reserved Illustrative example Fortran code using MPI, single threaded originally. Run on Intel® Xeon Phi™ coprocessor natively (no offload).
Yeah!
Common optimization techniques… “dual benefit”
Untuned Untuned TUNEDTUNED Performance on Performance on PerformancePerformance onon Intel® Xeon® Intel® Xeon Phi™ Intel®Intel® Xeon Xeon® Phi™ processor coprocessor coprocessorprocessor Based on an actual (but confidential) customer example. Shown to illustrate a point about common techniques. 29 Your results may vary! © Intel 2013, All Rights Reserved Illustrative example Fortran code using MPI, single threaded originally. Run on Intel® Xeon Phi™ coprocessor natively (no offload).
Common optimization techniques… “dual benefit”
Untuned Untuned TUNED TUNED Performance on Performance on Performance on Performance on Intel® Xeon® Intel® Xeon Phi™ Intel® Xeon® Intel® Xeon Phi™ processor coprocessor processor coprocessor Based on an actual (but confidential) customer example. Shown to illustrate a point about common techniques. 30 Your results may vary! © Intel 2013, All Rights Reserved Picture worth many words
© 2013, James Reinders & Jim Jeffers, diagram used with permission 31 © Intel 2013, All Rights Reserved Intel® Xeon Phi™ Coprocessor: Increases Application Performance up to 10x
Application Performance Examples Customer Application Performance Increase1 vs. 2S Xeon* Los Alamos Molecular Dynamics Up to 2.52x
Acceleware 8th order isotropic Up to 2.05x variable velocity Jefferson Labs Lattice QCD Up to 2.27x
Financial BlackScholes SP Up to 7x Services Monte Carlo SP Up to 10.75x Sinopec Seismic Imaging Up to 2.53x2
Sandia Labs miniFE Up to 2x3 (Finite Element Solver)
Intel Labs Ray Tracing (incoherent Up to 1.88x4 KNC= Intel® Xeon Phi™ coprocessor (Knights Corner) rays) E5 = Intel® Xeon® E5 processor
• Intel® Xeon Phi™ coprocessor accelerates highly parallel & vectorizable applications. (Chart) • Table provides examples of such applications Configuration Notes: 1. 2S Xeon vs. 1 Xeon Phi (preproduction HW/SW & Application running 100% on coprocessor unless otherwise noted) 2. 2S Xeon vs. 2S Xeon + 2 Xeon Phi (offload) 3. 8 node cluster, each node with 2S Xeon (comparison is cluster performance with and without 1 Xeon Phi per node) (Hetero) 4. Intel Measured Oct. 2012 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Customer Measured results as of October 22, 2012. For more information go to http://www.intel.com/performance 32 © Intel 2013, All Rights Reserved Intel® Xeon Phi™ Coprocessors Full Portfolio 3 Family Outstanding Parallel Computing Solution Performance/$ leadership 6GB GDDR5 240 GB/s >1 TFlops DP 3120P 3120A 5 Family Optimized for High Density Environments 8GB GDDR5 Performance/watt leadership >300 GB/s >1 TFlops DP 5110P 5120D 7 Family Highest Level of Features 16GB GDDR5 Performance leadership 352 GB/s > 1.2 TFlops DP Turbo 7120P 7120X T
33 © Intel 2013, All Rights Reserved Knights Landing: Next Generation Intel® Xeon Phi™
Designed using Intel’s cutting-edge Intel leads the industry in transistor technology by about 14nm transistor three years. With the coming 14nm process, Knights technology Landing will deliver more compute density and efficiency than ever before.1
Not bound by “offloading” bottlenecks As a host processor directly installed in the motherboard Standalone CPU or socket, Knights Landing will function as a CPU, eliminate PCIe PCIe coprocessor bottlenecks, and enable the next leap in compute density & performance per watt. Learn more
Common instruction set architecture First implementation of new backward compatible instruction Intel® Advanced Vector set architecture featuring 512 bit operations; will be Extensions 512 supported on future Intel® Xeon® processors to be introduced after Knights Landing. Full details
Leadership compute & memory bandwidth On-package memory will significantly increase memory Integrated on-package bandwidth, allowing workloads to take full advantage of memory available compute without encountering memory bandwidth bottlenecks seen today. Learn More
34 1 http://newsroom.intel.com/community/intel_newsroom/blog/2013/09/10/new-intel-ceo-president-outline-product-plans-future-of-computing-vision-to-mobilize-intel-and-developers Agenda
• MARC Program: What’s Next…. • Introduction to the Intel® Xeon Phi™ Coprocessor • Coprocessor HW Architecture Overview • Coprocessor SW Architecture Overview • Programming the Intel® Xeon Phi™ Coprocessor – An illustrative example • ‘Real World’ Code Performance • Learning more: Resources for you • Q&A
35 © Intel 2013, All Rights Reserved Knights Corner* Block Diagram
*code name = Intel Xeon Phi™ Coprocessor card and chip 36 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Knights Corner Chip Architecture • Up to 61 Cores
• High Performance On-Die Bidirectional Interconnect – Fully Coherent L2 Caches
• Memory – 8 Memory Controllers – 16 GDDR5 Channels – Up to 16GB Capacity – Clamshell Supported • Reliability Features • PCIe Gen2 x16 (EP) – Parity on L1 caches – Up to 14 GB/s w/ 256B Packets – ECC on L2 caches – Support for P2P transactions – ECC on Memory – CRC on Memory IO
37 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture
Intel® Xeon Phi™ Coprocessor core Fully functional multi-thread execution unit • Up to 61 in-order cores Instruction Decode – Ring based On-Die Interconnect (ODI) • 64-bit addressing Scalar Vector Unit Unit • Scalar unit based on Intel® Pentium® processor family
Scalar Vector – Two pipelines Registers Registers – Dual issue with scalar instructions – One-per-clock scalar pipeline throughput 32K L1 I-cache 32K L1 D-cache – 4 clock latency from issue to resolution • 4 hardware threads per core 512K L2 Cache – Each thread issues instructions in turn – Round-robin execution hides scalar unit ODI latency
3939 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Intel® Xeon Phi™ Coprocessor core Fully functional multi-thread execution unit • Optimized for single and double precision
Instruction Decode • All new vector unit – 512-bit SIMD Instructions – not Intel® SSE, MMX™, or Intel® AVX Scalar Vector Unit Unit – 32 512-bit wide vector registers – Hold 16 singles or 8 doubles per register
Scalar Vector – Cache organization Registers Registers – L1 cache – L1-D 32KB 32K L1 I-cache – L1-I 32KB 32K L1 D-cache – L2 cache – 512KB per core – inclusive of L1-D & L1-I 512K L2 Cache – shared across all cores over ODI – if neither code nor data is shared among all cores, L2 = 30.5MB (= 512KB/core x 61 cores) ODI – if all code and data is shared among all cores, L2 = 512KB
4040 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture VLP / SIMD / Vectorization Vectorization is the process of transforming a scalar operation that acts on single data elements at a time (Single Instruction Single Data – SISD), to an operation that that acts on multiple data elements at once (Single Instruction Multiple Data – SIMD)
• Scalar mode • SIMD processing – one instruction produces – with SSE or AVX or MIC instructions one result – one instruction can produce multiple results for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];
a[i] a a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] + + + b[i] b b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
a[i]+b[i] a+b c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
41 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Knights Corner Core PPF PF D0 D1 D2 E WB
T0 IP L1 TLB and Code Cache Miss T1IP 32KB T2 IP T3 IP Code TLB Miss
Cache 4 Threads 16B/Cycle (2 IPC) In-Order Decode uCode HWP
TLB Pipe 0 Pipe 1 Miss L2 512KB Handler Ctl L2 Cache L2 TLB VPU RF X87 RF Scalar RF
x87 ALU 0 ALU 1 VPU 512b SIMD L1 TLB and 32KB TLB Miss
Data Cache DCache Miss Core X86 specific logic < 2% of core+L2 area To On-Die Interconnect
42 © Intel 2013. All rights reserved Copyright © 2013 Intel Corporation. All rights reservedIntel® Many Integrated Core (Intel® MIC) Architecture Knights Corner Vector Processing Unit Architecture
• 512b vector ISA – 16 SP, 8 DP elements • 32 vector registers • 8 mask registers – per lane predicated execution • Gather/scatter – Prime (hint) instructions – GenMux • Load-Op, 2-3 sources, 1 destination – destination same as 1 source • EMU - SP transcendental instructions – exp, log, recip, sqrt • IEEE 754 2008
4343 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Knights Corner Core Memory Architecture
• Per Core Caches (shared amongst 4T) – 32K Instruction Cache – 32K Data Cache – 512KB L2 Caches per Core • 64 bit addressing • L2 Cache Streaming Prefetcher • Large TLB Capacity – 4K and 2M Page Sizes Supported – 64 entry L2 for 2M PTEs, or 4K, 2M PDEs
4444 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Direct Memory Access (DMA) full peer-to-peer DMA
Card-to-system DMA Card-to-card DMA (or system-to-card)
45 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture KNC System Topologies Mem Mem Mem
Xeon(s) Xeon(s) Xeon(s)
PCIe* PCIe* PCIe*
InfiniBand* Xeon Xeon Xeon Xeon Phi Phi Phi ... Phi ...
Mem Mem Mem Mem Large MPP System
Xeon = Intel® Xeon® Processor Platform Xeon Phi = Intel® Xeon Phi™ Coprocessor
46 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture SKU and Product Definitions
Intel® Xeon Phi™ Coprocessor x100 Family Reference Table
Recomm Peak GDDR5 Memory Turbo Board Clock Peak Total ended Processor Codena Form Factor, Max # Double Memory Capacit Enabled Clock SKU # TDP Speed Memory Cache Custome Brand Name me Thermal of Cores Precision Speeds y Turbo Speed (Watts) (GHz) BW (MB) r Pricing (GFLOP) (GT/s) (GB) (GHz) (RCP)
PCIe Card, 7120P Passively 300 61 1.238 1208 5.5 352 16 30.5 Y 1.333 $4129 Cooled
PCIe Card, 7120X No Thermal 300 61 1.238 1208 5.5 352 16 30.5 Y 1.333 $4129 Solution
PCIe Dense Form Factor, 5120D 245 60 1.053 1011 5.5 352 8 30 N N/A $2759 No Thermal Solution
PCIe Card,
3120P Passively 300 57 1.1 1003 5.0 240 6 28.5 N N/A $1695 Intel® Xeon Knights Cooled Phi™ Corner Coprocessor PCIe Card, x100 3120A Actively 300 57 1.1 1003 5.0 240 6 28.5 N N/A $1695 Cooled
Previously Available
PCIe Card, SE10P* Passively 300 61 1.1 1073.6 5.5 352 8 30.5 B N N/A Cooled
PCIe Card, 5110P** Passively 225 60 1.053 1011 5.0 320 8 30 N N/A $2649 Cooled
*Special Edition availability limited to early ship program customers **Please refer to our technical documentation for Silicon stepping information
48 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2013, Intel Corporation. Agenda
• MARC Program: What’s Next…. • Introduction to the Intel® Xeon Phi™ Coprocessor • Coprocessor HW Architecture Overview • Coprocessor SW Architecture Overview • Programming the Intel® Xeon Phi™ Coprocessor – An illustrative example • ‘Real World’ Code Performance • Learning more: Resources for you • Q&A
49 © Intel 2013, All Rights Reserved Knights Corner Software Architecture Optimized, Highly Parallel coprocessor (Pairs with Intel® Xeon® processor host via PCIe)
Runs Complete Applications IP Addressable Open Source Linux* OS Common Source Code Standard models of clustering
Builds on / Advances State of the Art in Parallelism Intel Developer tools
Acts as a Linux* SMP Compute Node 50 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Intel® Xeon Phi™ Software Architecture Components • Development Tools – Intel Compilers, Libraries, Parallel and Cluster Development Tools • 3rd Party tools complement, extend, or compete with Intel Tools – Language Extensions for Offload (LEO) and OpenMP 4.0 “Target” Extensions • Intel® Manycore Platform Software Stack (Intel® MPSS) – A Linux* ‘Coprocessor’-OS for the Intel® MIC device – Support Standards compliance: • Sockets, TCP/UDP IP (over PCIe), PSM, OFED Verbs, MPI*, OpenMP*, OpenCL* – Symmetric Communication InterFace driver layer (SCIF) – Middleware Interfaces (Intel COI, MYO) for tools developers
Design goals and principles • Support a variety of programming models • Standards compliance where possible • Scalability & Symmetry Enables a continuum of parallel computing solutions
51 Intel® Many Integrated Core (Intel® MIC) Architecture © Intel 2013, All Rights Reserved Host Intel® Xeon Phi™ Coprocessor
Intel® TBB Intel® CILK™ Plus Intel® MKL OpenMP* OpenMP* Intel® MKL Intel® CILK™ Plus Intel® TBB Legend Intel® Vtune Intel ® C/C++ and Intel® Intel ® C/C++ and Intel® Intel® Vtune Amplifier XE Fortran Compilers Fortran Compilers Amplifier XE MPSS Install Intel® MPI Tools & Apps Debuggers Debuggers Tools & Apps Intel® MPI Std. SW Board Control Ganglia* Tools Panel Mod. Linux* COI MYO MYO COI uDAPL uDAPL Intel® SW
OFED* Verbs OFED* Verbs Std. OFED* Sockets Sockets HCA OFED*/SCIF User SCIF User SCIF OFED*/SCIF HCA Library Library Library Library Library Library Intel® HW
IB Proxy Other HW Daemon Host (R3) Host (R0) HCA OFED* Proxy ulp’s TCP/IP TCP/IP UDP/IP UDP/IP OFED* Core SW OFED* Core SW
HCA OFED*/ Management OFED*/ HCA Driver SCIF Driver Middleware SCIF Driver Proxy
Host/SCIF Driver NetDev NetDev SCIF Driver
/sys,/proc Linux* Kernel Coprocessor Linux OS ME SMC BMC Update SMC Path
PCI Express*
InfiniBand* HCA
52 © Intel 2013, All Rights Reserved Intel® Manycore Platform Software Stack (Intel® MPSS) Host Platform Programming Models Coprocessor Tools Linux*-based OS PCIe Offload Apps Driver PCIe Coprocessor MPI TCP/IP Linux*-based OS
Tools For Host-Side Initialization & Management of Card(s) - Service for automatically booting installed Intel® Xeon Phi™ Coprocessor(s) at host start-up - Configuration and management of coprocessor(s) - Display information about installed coprocessor(s), such as coprocessor utilization and power
For illustration only, potential future options subject to change without notice.
53 © Intel 2013, All Rights Reserved Intel® Manycore Platform Software Stack (Intel® MPSS) Host Platform Programming Models Coprocessor Tools Linux*-based OS PCIe Offload Apps Driver PCIe Coprocessor MPI TCP/IP Linux*-based OS
Linux* or Windows* host-side driver - Open source Linux* driver - Interface for offload (through COI), communication, & management of Intel® Xeon Phi™ Coprocessor(s) - Virtual ethernet device for Linux*-based OS with support for bridging to external networks; virtual serial console device
For illustration only, potential future options subject to change without notice.
54 © Intel 2013, All Rights Reserved Intel® Manycore Platform Software Stack (Intel® MPSS) Host Platform Programming Models Coprocessor Tools Linux*-based OS PCIe Offload Apps Driver PCIe Coprocessor MPI TCP/IP Linux*-based OS
Linux*-based OS - Open source - Common UNIX utilities provided through BusyBox - NFS root for persistent config of users, tools, and apps on each Intel® Xeon Phi™ Coprocessor - Out-of-box support for TCP/IP, sockets, MPI & OFED™
- Out-of-box supportFor illustration for SSH, only, potential NFS, future optionsTelnet, subject to FTPchange without notice.
55 © Intel 2013, All Rights Reserved Intel Development Tools extend to Intel® Xeon Phi™ Coprocessor Leading developer tools for performance on nodes and clusters
Advanced Performance Distributed Performance
C++ and Fortran Compilers, MKL Libraries MPI Cluster Tools with C++ and Fortran & Analysis Tools for Windows*, Linux* Compiler, MKL Libraries and Analysis Tools for developers on IA based multi-core and Windows*, Linux* developers on IA based many-core nodes clusters Parallel coding and tuning investments “dual benefit” today, scale forward tomorrow 56 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Spectrum of Programming Models
Multi-Core Centric Many-Core Centric Xeon MIC Multi-Core Hosted Symmetric Many Core Hosted General purpose Codes with balanced serial and parallel Highly-parallel codes computing needs
Offload Codes with highly- parallel phases
Main( ) Main( ) Main( ) Foo( ) Foo( ) Foo( ) Multi-core MPI_*( ) MPI_*( ) MPI_*( ) (Xeon) Main( ) Main( ) Foo( ) Foo( ) Foo( ) Many-core MPI_*( ) MPI_*( ) (MIC) Range of models to meet application needs
57 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC)57 Architecture Programming Intel® MIC-based Systems MPI+Offload
• MPI ranks on Intel® Xeon® processors (only) Offload
• All messages into/out of processors Data • Offload models used to accelerate MPI Xeon MIC MPI ranks • Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* Data within Intel® MIC Network Xeon MIC • Homogenous network of hybrid nodes: Data
Xeon MIC
Data
Xeon MIC
58 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture 58 Programming Intel® MIC-based Systems Many-core Hosted
• MPI ranks on Intel® MIC (only) • All messages into/out of Intel® MIC Data TM • Intel® Cilk Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads Xeon MIC MPI used directly within MPI processes
• Programmed as homogenous Data network of many-core CPUs: Xeon Network MIC
Data
Xeon MIC
Data
Xeon MIC
59 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture 59 Programming Intel® MIC-based Systems Symmetric
• MPI ranks on Intel® MIC and Intel® MPI Xeon® processors • Messages to/from any core Data Data
TM • Intel® Cilk Plus, OpenMP*, Intel® MPI Xeon MIC MPI Threading Building Blocks, Pthreads* used directly within MPI processes Data Data
• Programmed as heterogeneous Network Xeon MIC network of homogeneous nodes:
Data Data
Xeon MIC
Data Data
Xeon MIC
60 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture 60 IA Benefit: Wide Range of Development Options
Multi-Core Centric Many-Core Centric
Xeon MIC Multi-Core Hosted Offload Symmetric Many Core Hosted General serial and Code with highly- Codes with Highly-parallel parallel computing parallel phases balanced needs codes
Ease of use Intel® Math Kernel Library Intel® Math Kernel Library Intel MPI* Auto vectorization Semi-auto vectorization: OpenMP* #pragma (vector, ivdep, simd)
Intel® Threading Building Array Notation: Intel® Cilk™ Plus Blocks C/C++ Vector Classes Intel® Cilk™ Plus (F32vec16, F64vec8) OpenCL* Fine Pthreads* Intrinsics control
Breadth, depth, familiar models meet varied application needs 61 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Agenda
• MARC Program: What’s Next…. • Introduction to the Intel® Xeon Phi™ Coprocessor • Coprocessor HW Architecture Overview • Coprocessor SW Architecture Overview • Programming the Intel® Xeon Phi™ Coprocessor – An illustrative example • ‘Real World’ Code Performance • Learning more: Resources for you • Q&A
62 © Intel 2013, All Rights Reserved Test System Specs • Processor System Specs • Coprocessor Specs – 2x8 Cores: 32 Threads – 61 cores: 244 Threads – 2.7 Ghz – 1.1 Ghz – 64GB DDR3 – 8GB GDDR5 – 85.3 GB/s Peak Mem BW – 352 GB/s Peak Mem BW – OS: RHEL* Linux* – OS:Linux*
It’s Just Linux*!
63 © Intel 2013, All Rights Reserved Working with the Intel® Xeon Phi™ Coprocessor
It’s Just Linux! 64 © Intel 2013, All Rights Reserved Learn more about this book: It all comes down to This book belongs on the PARALLEL lotsofcores.com bookshelf of every HPC PROGRAMMING ! professional. Not only does it (applicable to processors successfully and accessibly and Intel® Xeon Phi™ teach us how to use and obtain coprocessors both) high performance on the Intel MIC architecture, it is about Forward, Preface much more than that. It takes Chapters: us back to the universal 1. Introduction fundamentals of high- 2. High Performance Closed Track performance computing Test Drive! including how to think and 3. A Friendly Country Road Race reason about the performance 4. Driving Around Town: Chapter 3: A Friendly Country Optimizing A Real-World of algorithms mapped to Code Example Road Race modern architectures, and it 5. Lots of Data (Vectors) Featuring: 9 Point 2-D Stencil puts into your hands powerful 6. Lots of Tasks (not Threads) tools that will be useful for 7. Offload years to come. 8. Coprocessor Architecture —Robert J. Harrison 9. Coprocessor System Software 10. Linux on the Coprocessor Institute for Advanced 11. Math Library Computational Science, 12. MPI Stony Brook University 13. Profiling and Timing Available since February 2013. 14. Summary Glossary, Index
Intel® Xeon Phi™ Coprocessor High Performance Programming, Jim Jeffers, James Reinders, (c) 2013, publisher: Morgan Kaufmann
65 65 “© 2013, James Reinders© Intel & 2013, Jim Jeffers, All Rights book Reserved image used with permission © 2013, James Reinders & Jim Jeffers, drawing used with permission
66 © Intel 2013, All Rights Reserved 9 Point 2D Stencil - Image Blurring
2D Image Data Array
Row, Column Format
Mapping the Stencil values to the data
67 © Intel 2013, All Rights Reserved C ‘Pseudo’ Code Baseline
for (i=0; i 68 © Intel 2013, All Rights Reserved Implemented Function (Linearized) 69 © Intel 2013, All Rights Reserved Implemented Function (Linearized) Full Source Listings in Book Pgs. 64-65 and http://lotsofcores.com -> Downloads 70 © Intel 2013, All Rights Reserved Parallelizing code for Intel® Xeon Phi™ Coprocessor! • Steps We’ll Go Through 1. Run the Baseline (Single Thread, no parallelism) 2. Add Data Parallelism (Vectorize) 3. Add Task Parallelism (Scale) 4. Compare Processor vs. Coprocessor 71 © Intel 2013, All Rights Reserved Step 1. Baseline Runs • Processor ~19x faster than Coprocessor! • Code not vectorized nor scaled (1 thread) 72 © Intel 2013, All Rights Reserved Not “auto-vectorized “– potential dependence Step 2. Add/Check Vectorization between fin and fout • Use –vec-report=3 (or more) Also in Book Page 69 73 © Intel 2013, All Rights Reserved Let’s “Help” compiler with a hint: #pragma ivdep - ignore ambiguous dependencies #pragma SIMD - Just do it! #pragma vector {keyword} - overrides heuristics - Addl information (e.g. aligned) Also in Book Pg. 71 74 © Intel 2013, All Rights Reserved Now with #pragma ivdep Results: • Processor -> 1.27x improvement • Coprocessor -> 4.90x improvement! Processor still 4.9x faster! 75 © Intel 2013, All Rights Reserved Step 3. Add Scaling to code Loops are typically the first place to look. X/Y Loop nest has virtually all the work. Generally better to pick an “outer loop” We choose standard OpenMP parallel for From Book Page 73 76 © Intel 2013, All Rights Reserved Results With Both Scaling and Vectorization • Processor -> 2.6x improvement • Coprocessor -> 61x improvement!!! Coprocessor now 4.8x faster! 77 © Intel 2013, All Rights Reserved 2.7x increase on Processor : 61x increase on Coprocessor!! Coprocessor now 4.5x faster than Processor !! Remember This? [Picture worth many words] 78 © 2013, James Reinders & Jim Jeffers, diagram used with© Intelpermission 2013, All Rights Reserved Real Plot for 9 Point Stencil Workload Hit memory bw wall 79 © Intel 2013, All Rights Reserved Additional Tuning Considerations • Data / Memory Alignment • Huge (2MB) Pages – Memory -> Cache alignment – Data access pattern dependent – C/C++ – Can reduce TLB miss rate • __attribute__((alligned(64)) – THP (Tranparent Huge Pages) in 2.6.38 • #pragma vector aligned kernel • _mm_alloc(n, 64) – mmap(….., MAP_HUGETLB,…) – FORTRAN • AOS -> SOA • -align array64byte – Random vs Stride 1 access • !dir$ attributes align: 64:: var • Cache Blocking • !DIR$ VECTOR ALIGNED – Improve cache reuse with data locality – Padding focus • Streaming Stores – Especially code with neighbor calcs – Bypass unneeded Read for Ownership (RFO) • E.g. Stencils behavior – See Book Chapter 4 – #pragma vector nontemporal • Prefetch Analysis / Tuning – -opt-streaming-stores (always, auto, never) – #pragma prefetch var: hint: distance – -opt-prefetch=n Book provides discussion and examples! 80 © Intel 2013, All Rights Reserved Results from ‘simply’ adding Data and Task Parallelism to one source base… Overall improvement: 2xIntel® Xeon® Processor : ~5.8x Intel® Xeon Phi™ Coprocessor: ~303x Common optimization techniques… “dual benefit” 81 © Intel 2013, All Rights Reserved Agenda • MARC Program: What’s Next…. • Introduction to the Intel® Xeon Phi™ Coprocessor • Coprocessor HW Architecture Overview • Coprocessor SW Architecture Overview • Programming the Intel® Xeon Phi™ Coprocessor – An illustrative example • ‘Real World’ Code Performance • Learning more: Resources for you • Q&A 82 © Intel 2013, All Rights Reserved Performance Proof-Point: Government and Academic Research MPI-HMMER • Application: open source MPI implementation of the Speedup HMMER protein sequence analysis suite (Higher is Better) • Execution Model: Symmetric Mode • Demonstrated Results: 1.8 1.56 1.6 – No source code changes were required to build and run 1.4 MPI-HMMER on Intel Xeon Phi coprocessors. 1.2 1 1 Developers are adding #pragma unroll to improve loop 0.8 performance on both Intel® Xeon® processors and 0.6 Intel® Xeon Phi™ coprocessors 0.4 0.2 – The key function in HMMER is the Viterbi algorithm 0 implemented as a contained double nested loop which gets vectorized on both the Intel® Xeon® processors and Intel® Xeon Phi™ coprocessor • 2S Intel® Xeon® processor E5-2670 • 2S Intel® Xeon® processor E5-2670 + Intel® Xeon Phi™ coprocessor (pre-production HW/SW) 83 SOURCE: MEASURED BY INTEL JULY 2013 INTEL CONFIDENTIAL Performance Proof-Point: Government and Academic Research WEATHER RESEARCH AND FORECASTING (WRF) Speedup (Higher is Better) 1.6 1.4 1.4 • Application: Weather Research and Forecasting (WRF) 1.2 1 1 • Status: WRF V3.5 was released 4/18/13 0.8 • Code Optimization: 0.6 0.4 – Approximately two dozen files with less than 2,000 0.2 lines of code were modified (out of approximately 0 700,000 lines of code in about 800 files, all Fortran standard compliant) – Most modifications improved performance for both the 2S Intel® Xeon® processor E5-2670 with • host and the co-processors eight-node cluster configuration • ® ® Performance Measurements: Pre release of WRF 3.5 • 2S Intel Xeon processor E5-2670 + (V3.5Pre) and NCAR supported CONUS2.5KM Intel® Xeon Phi™ coprocessor (pre-production HW/SW) benchmark (a high resolution weather forecast) with eight-node cluster configuration • Acknowledgments: There were many contributors to these results, including the National Renewable Energy Laboratory and The Weather Channel Companies 84 SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013 INTEL CONFIDENTIAL Performance Proof-Point: Government and Academic Research ZIB ISING 3D Speedup “We achieved a 3.46x speedup in just 3 days.” Konrad-Zuse-Zentrum (Higher is Better) für Informationstechnik Berlin 4 3.46 3.5 • Application: ZIB Ising 3D models magnetism and 3 phase transitions 2.5 2 • Status: code ready for internal use 1.5 1 1 • Demonstrated Results: 0.5 0 - Two days to convert C code to AVX intrinsics, and one day to optimize the code on Intel® Xeon Phi™ coprocessors • 2S Intel® Xeon® processor E5-2670 - Productivity for Intel Xeon Phi coprocessors was • Intel® Xeon Phi™ Coprocessor higher for target specific optimization (couple of (pre-production HW/SW) hours versus 2-3 days implementation in CUDA) 85 SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012 INTEL CONFIDENTIAL Performance Proof-Point: Financial Services MONTE CARLO EUROPEAN OPTIONS • Application: Monte Carlo algorithms are used to evaluate complex instruments, portfolios, and investments. Performance depends on raw Speedup computational power and the performance of exp2() (Higher is Better) • Status: Case Study available 12 • Highlights: Dramatic performance scaling for both 10.36 10 single-precision and double-precision calculations 8 • Demonstrated Results: 6 - Intel® Xeon Phi™ coprocessor fast exp2() and FMA 4 3.34 instructions deliver high performance, high accuracy 2 1 1 for single precision computations 0 - Compiler based loop unrolling delivers high performance Single Double Precision Precision - Cache blocking further optimizes cache utilization, reduces cache misses, and makes outer loop • 2S Intel® Xeon® processor E5-2670 vectorization possible • 2S Intel Xeon processor E5-2670 + • Read the Case Study: software.intel.com/en-us/articles/case- Intel® Xeon Phi™ Coprocessor study-achieving-high-performance-on-monte-carlo-european-option- (pre-production HW/SW) on-intel-xeon-phi 86 SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013 INTEL CONFIDENTIAL Performance Proof-Point: Government and Academic Research JEFFERSON LAB LATTICE QCD Speedup (Higher is Better) • Application: Lattice QCD uses a numerical approach to quantum chromo dynamics to calculate weak 2.3 decays of strongly interacting particles, to investigate 2.5 matter under extreme conditions, and to study the 2 structure and interaction of hadrons 1.5 • Demonstrated Results: 1 1 – Lattice QCD benefits from the memory bandwidth ® 0.5 of the Intel Xeon Phi™ coprocessor 0 • 2S Intel® Xeon® processor E5-2680 • 2S Intel Xeon processor E5-2680 + Intel® Xeon Phi™ Coprocessor (pre-production HW/SW) 87 SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013 INTEL CONFIDENTIAL Agenda • MARC Program: What’s Next…. • Introduction to the Intel® Xeon Phi™ Coprocessor • Coprocessor HW Architecture Overview • Coprocessor SW Architecture Overview • Programming the Intel® Xeon Phi™ Coprocessor – An illustrative example • Learning more: Resources for you • Q&A 88 © Intel 2013, All Rights Reserved Learn more about this book: SC’13 tutorial Teaches parallel November 2013 programming using parallelbook.com a new In Denver pattern-based approach. This is a really great book… Extensive examples I've been dreaming for a while in Cilk Plus and TBB. of a modern accessible book that I could recommend to my Not about any threading-deprived colleagues specific hardware, and assorted enquirers to get but relevant to all. them up to speed with the core It’s about concepts of multithreading as effective well as something that covers parallel all the major current programming. interesting implementations. Great for teaching! Finally I have that book. —Martin Watt, Principal Engineer, Available since July 2012. Dreamworks Animation Structured Parallel Programming, Michael McCool, Arch Robison, James Reinders (c) 2012, publisher: Morgan Kaufmann 89 © 2012, Michael McCool, Arch Robison, James Reinders, book image used with permission 89 © Intel 2013, All Rights Reserved Online Resources • http://software.intel.com/mic-developer – Developer’s Quick Start Guide – Programming Overview – User Forum at http://software.intel.com/en-us/forums/intel-many- integrated-core • http://software.intel.com/en-us/articles/programming-and- compiling-for-intel-many-integrated-core-architecture • http://software.intel.com/en-us/articles/advanced-optimizations- for-intel-mic-architecture • Intel® Composer XE 2013 for Linux* User and Reference Guides • Intel Premier Support https://premier.intel.com 90 © Intel 2013, All Rights Reserved 90 Webinars Upcoming Webinars: • http://software.intel.com/en-us/articles/intel-software-tools- technical-webinar-series Recordings of Spring Webinars: • http://software.intel.com/en-us/articles/intel-software-tools- spring-technical-webinar-series 91 © Intel 2013, All Rights Reserved Intel® Xeon Phi™ Coprocessor Wrap-up • SMP on a chip • Leverages existing standards, models and tools – “It’s Just….” [Linux, C/C++, FORTRAN, MPI, OpenMP, etc] • Future Knights Landing adds Manycore “Processor” • Parallel coding investments are paid “backward & forward” • Performance AND familiar programming models Parallelism is the Key! 92 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Thank You! Q & A? 93 © Intel 2013. All rights reserved Intel® Many Integrated Core (Intel® MIC) Architecture Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Copyright © 2013, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, Phi, VTune and Cilk are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor- dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 97 Intel® Many Integrated Core (Intel® MIC) Architecture Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Knights Landing and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user Intel, Cilk, VTune, Xeon, Xeon Phi, Look Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright ©2013 Intel Corporation. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimers • Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. • Estimated Results Benchmark Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. • Simulated Results Benchmark Disclaimer: Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance. • Software Source Code Disclaimer: Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Risk Factors The above statements and any others in this document that refer to plans and expectations for the third quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be the important factors that could cause actual results to differ materially from the company’s expectations. Demand could be different from Intel's expectations due to factors including changes in business and economic conditions; customer acceptance of Intel’s and competitors’ products; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Uncertainty in global economic and financial conditions poses a risk that consumers and businesses may defer purchases in response to negative financial events, which could negatively affect product demand and other related matters. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly to technological developments and to incorporate new features into its products. The gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; product manufacturing quality/yields; and impairments of long-lived assets, including manufacturing, assembly/test and intangible assets. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Expenses, particularly certain marketing and compensation expenses, as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's products and the level of revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings release. Rev. 7/17/13