An Introduction to the Intel® Xeon Phi™ Coprocessor

An Introduction to the Intel ® Xeon Phi™ Coprocessor

INFIERI2013 July 2013

Leo Borges ([email protected]) Intel Software & Services Group Introduction

Highlevel overview of the Intel ® Xeon Phi™ platform: Hardware and Software

Intel Xeon Phi Coprocessor programming considerations: Native or Offload

Performance and Thread Parallelism

MPI Programming Models

Tracing: Intel ® Trace Analyzer and Collector

Profiling: Intel ® Trace Analyzer and Collector

Conclusions

2 Introduction

Highlevel overview of the Intel ® Xeon Phi™ platform: Hardware and Software

Intel Xeon Phi Coprocessor programming considerations: Native or Offload

Performance and Thread Parallelism

MPI Programming Models

Tracing: Intel ® Trace Analyzer and Collector

Profiling: Intel ® Trace Analyzer and Collector

Conclusions

3 Intel in High-Performance Computing

Dedicated, Large Scale Clusters Tera- Renowned Scale Applications for Test & Optimization Research Expertise Exa-Scale Labs

Defined Broad Software Platform HPC Tools Building Application Portfolio Blocks Platform

Many Manufacturing Leading Integrated Process Performance, Core Technologies Energy Efficient Architecture

A long term commitment to the HPC market segment

MultiCore ManyCore

Xeon® General Purpose Architecture Leadership Per Core Performance FP/core CAGR via AVX MultiCore CAGR Intel® Xeon Phi™ Coprocessor EN EP EP 4S Xeon EX General Max perf/watt Additional Additional Trades a “big” IA core for purpose w/ Higher compute sockets & multiple lower performance perf/watt Memory BW / density big memory IA cores resulting in higher freq and QPI performance for a subset of ideal for HPC highly parallel applications

Common Intel Environment Portable code, common tools

Energy &oil Digital content Climate modeling & Financial Medical imaging Computer Aided exploration creation weather prediction analyses, trading and biophysics Design & Manufacturing Parallel Application Types Hardware Options for Parallel Application Types Fine Grain

Coarse Grain

C/C++ OR Fortran Embarrassingly Parallel

Communication Compute Process

Highly Parallel Compute Kernels Intel® MIC Architecture 6 BlackBlack----ScholesScholes Sparse/Dense Matrix Mult … FFTs Vector Math LU Factorization 6

Highlevel overview of the Intel ® Xeon Phi™ platform: Hardware and Software

Intel Xeon Phi Coprocessor programming considerations: Native or Offload

Performance and Thread Parallelism

MPI Programming Models

Tracing: Intel ® Trace Analyzer and Collector

Profiling: Intel ® Trace Analyzer and Collector

Conclusions

7 Each Intel ® Xeon Phi™ Coprocessor core is a fully functional multi-thread execution unit >50 inorder cores Instruction Decode • Ring interconnect

64bit addressing Scalar Vector Unit Unit Scalar unit based on Intel ® Pentium ® processor family Scalar Vector • Two pipelines Registers Registers – Dual issue with scalar instructions

32K L1 II-cache-cache • Oneperclock scalar pipeline throughput 32K L1 DD-cache-cache – 4 clock latency from issue to resolution

512K L2 Cache 4 hardware threads per core • Each thread issues instructions in turn • Roundrobin execution hides scalar unit latency Ring

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Each Intel ® Xeon Phi™ Coprocessor core is a fully functional multi-thread execution unit >50 inorder cores Instruction Decode • Ring interconnect