<<

Xeon+FPGA: Better Together An Overview of Architecture and Practices

Elijah Charles, Gaurav Kaul Intel Corporation Legal Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, or service activation. Learn more at intel.com, or from the OEM orretailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel, Xeon and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. © 2015 Intel Corporation.

2 Risk Factors The above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the future are forward- looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "plans," "believes," "seeks," "estimates," "may," "will," "should" and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel's actual results, and variances from Intel's current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be important factors that could cause actual results to differ materially from the company's expectations. Demand for Intel's products is highly variable and could differ from expectations due to factors including changes in business and economic conditions; consumer confidence or income levels; the introduction, availability and market acceptance of Intel's products, products used together with Intel products and competitors' products; competitive and pricing pressures, including actions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Intel's gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross margin may also be caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel's ability to respond quickly to technological developments and to introduce new products or incorporate new features into existing products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Results may also be affected by the formal or informal imposition by countries of new or revised export and/or import and doing-business regulations, which could be changed without prior notice. Intel operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. The amount, timing and execution of Intel's stock repurchase program could be affected by changes in Intel's priorities for the use of cash, such as operational spending, capital spending, acquisitions, and as a result of changes to Intel's cash flows or changes in tax laws. Product defects or errata (deviations from published specifications) may adversely impact our expenses, revenues and reputation. Intel's results could be affected by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel's ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. Intel's results may be affected by the timing of closing of acquisitions, divestitures and other significant transactions. A detailed discussion of these and other factors that could affect Intel's results is included in Intel's SEC filings, including the company's most recent reports on Form 10-Q, Form 10-K andearnings release. Rev. 4/14/15 3 Agenda

• Accelerators: Motivation and Use Cases

• Using Field Programmable Gate Array (FPGA) as an Accelerator

• Intel® Xeon® Processor + FPGA Accelerator Platform

• Hardware and Software Programming Interfaces

• Example Applications

4 Digital Services Economy…

Build out of 50¹ Billion the CLOUD DEVICES $120B³ New SERVICES $450B²

1: Sources: AMS Research, Gartner, IDC, McKinsey Global Institute, and various others industry analysts and commentators 2: Source IDC, 2013. 2016 calculated base don reported CAGR ‘13-’17 4 3: Source: iDATA /Digiworld,2013 …Fueling Cloud Growth

6 Cloud Economics

Workload Performance Metrics Amazon’s TCO Analysis¹

VMs per System

Web Transactions / Sec

Storage Capacity

Hadoop Queries

Performance / TCO is the key metric

1: Source: James Hamilton, Amazon* http://perspectives.mvdirona.com/2010/09/overall-data-center-costs/ 7 Diverse Data Center Demands

Accelerators can increase Performance at lower TCO for targeted workloads

8 Intel estimates; bubble size is relative CPUintensity Agenda

• Accelerators: Motivation and Use Cases

• Using Field Programmable Gate Array (FPGA) as an Accelerator

• Intel® Xeon® Processor + FPGA Accelerator Platform

• Hardware and Software Programming Interfaces

• Example Applications

9 Accelerator Architecture Landscape

CPU

Ease ofProgramming/ Development Reconfigurable Accelerator

Fixed Function Accelerator

Application Flexibility

10 Benefits of Reconfigurable Accelerators: Savings in Area /Power

• Can be configured to implement different functions efficiently - Meeting performance goals for segment - Saving area and power compared to multiple Fixed Functions

Fixed Functions Cost Programmable Accelerator Software

Performance

10 Benefits of Reconfigurable Accelerators: Meeting Customer Needs for Differentiation

Driving the Digital ServiceEconomy

Workload Dynamic Intelligent Pervasive Optimized Resource Resource Analytics & Silicon Pooling Orchestration Insights

12 What is a Field Programmable Gate Array (FPGA)?

FPGAs (Field Programmable Gate Arrays) are semiconductor devices that can be programmed Interconnect Resources • Desired functionality of the FPGA can be (re-) programmed I/O Cells by downloading a configuration into the device

FPGAs offer several advantages over potential alternatives: • Lower one-time development cost, and faster time to market compared to custom designed chips (ASICs) • Ability to implement customer-specific functionality beyond what is available from standard products (ASSPs) • Customizable and reprogrammable after the device has Logic Blocks been deployed to the field compared to both ASIC and ASSP

13 A Complete Solutions Portfolio

P O W E R I N G Y O U R I N N O V A T I O N

CPLDs FPGAs FPGAs FPGAs PowerSoCs Lowest Cost, Cost/Power Balance Mid-range FPGAs Optimized for High-efficiency Lowest Power SoC & Transceivers SoC & Transceivers High Bandwidth Power Management R E S O U R C E S

Embedded Soft and Design Development Intellectual Hard Processors Software Kits Property (IP) . Industrial . Computing . Enterprise

1 4 Efficiency via Specialization

FPGAs ASICs

GPUs

Source: Bob Broderson, Berkeley Wireless group OpenCL and FPGAs Address These Challenges

Power efficient acceleration – Typically 1/5 power of GPU and orders of magnitude more performance per watt of CPU FPGA lifecycle over 15 years – GPUs lifespan is short Require re-optimization testing between generations – FPGA OpenCL code retargeted to future devices without modification Our OpenCL flow abstracts away FPGA hardware flow – Puts FPGA into software engineers hands Our OpenCL SDK allows for streaming IO channels and kernel channels – Data movement without host involvement – Low latency data transmissions to accelerator Shared virtual memory – IBM CAPI and Intel QPI 16 More SW Engineering Resources than HW?

1000:1 software engineers to FPGA designers Software engineers are not used to long compile times OpenCL Solves This!

 Our OpenCL flow abstracts away FPGA hardware flow bringing the FPGA to low level software programmers  Software developers write, optimize and debug in their software familiar environment  Quartus is run behind the scenes  Emulator and profiler are software development tools  Pushing long compile times to end  OpenCL optimization doesn’t require a board  Allowing SW to drive board requirements (.xml file)

17 Application Development Paradigm OpenCL expands

ASIC The number of application developers FPGA Programmers

Parallel Programmers

Standard CPU Programmers

18 Agenda

• Accelerators: Motivation and Use Cases

• Using Field Programmable Gate Array (FPGA) as an Accelerator

• Intel® Xeon® Processor + FPGA Accelerator Platform

• Hardware and Software Programming Interfaces

• Example Applications

19 Intel® Xeon® E5 + Field Programmable Gate Array Software Development Platform (SDP) Shipping Today

Software Development for Accelerating Workloads using Intel® Xeon® processors and coherently attached FPGA in-socket

Processor Intel Xeon Processor E5

DDR3 FPGA Module Altera* Stratix* V QPI Speed 6.4 GT/s fullwidth DDR3 (target 8.0 GT/s at full width) DDR3 Intel Xeon Memory to 2 channels of DDR3 Processor E5 Intel QPI FPGA FPGA Module (up to 64GB) DDR3 Product Family DDR3 Expansion PCI Express® (PCIe) 3.0 x8 connector lanes - maybe used for direct DDR3 to FPGA Module I/O e.g. Ethernet

Configuration Agent, Caching

x8 x8 x8 x8 x8 x8 Agent, (optional) Memory

I2 Features

.0 .0 .0 .0 .0 .0

3 3 3 3 3 3

M Controller

D

PCIe PCIe PCIe PCIe PCIe PCIe PCIe Accelerator Abstraction Layer Software (AAL) runtime, drivers, sample Intel® QuickPath Interconnect (Intel® QPI) applications

20 System Logical View

P r o c e s s o r F P G A

C Q P I a In te l CCI Core s LLC c QPI A F U s h IP e

DDR DRAM DRAM DRAM

M u l t i - p r o c e s s o r C o h e r e n c e D o m a i n C a c h e a c c e s s D o m a i n

• AFUs can access coherent cache on FPGA • AFUs can “not” implement a second level cache • Intel® Quick Path Interconnect (Intel® QPI) IP participates in cache coherency with Processors

21 Intel® Xeon® + Field Programmable Gate Array SDP: Intel® Quick Path Interconnect 1.1 RTL Microarchitecture

U s e r : • PHY – Implements the Intel QPI PHY 1.1 Accelerator F u n c t i o n U n i t (AFU) (Analog/Digital) R x • Intel QPI Link layer- provides flow control CCI- E T x and reliable communication S P L 2

• Intel QPI Protocol – implements Intel QPI A d d r e s s translation Cache Agent + Configuration Agent • Cache Controller – Cache hit/miss CCI- S R x determination and generates Intel QPI T x protocol requests. Intel Q P I FPGA IP • Cache Tag – Tracks state of cacheline (MESI + Cac h e C a c h e T a b l e internal states for tracking outstanding D a t a requests) C a c h e c o n t r o l l e r C a c h e T a g • Coherency Table – Programmable table that Rx C o n t r o l Tx C o n t r o l implements coherency protocol rules QPI L i n k / P r o t o c o l C o n t r o l • System Protocol Layer (SPL2) – Implements Address translation functionality. Can provide up to 2GB device virtual address 640 b it s 640 b it s space to AFU. SPL2 cannot handle page faults. Rx A l i g n QPI PHY Tx A l i g n • AFU – User designed Accelerator Function Unit

22 Intel® QuickPath Interconnect (Intel® QPI) Q P I interface to p i n s Agenda

• Accelerators: Motivation and Use Cases

• Using Field Programmable Gate Array (FPGA) as an Accelerator

• Intel® Xeon® Processor + FPGA Accelerator Platform

• Hardware and Software Programming Interfaces

• Example Applications

23 Intel® Xeon® Processor + Field Programmable Gate Array Tool Flow

HDL Programming OpenCL™ Programming

C HDL Host Kernels

SW S yn. SW OpenCL PAR Compiler Compiler

bit- bit- exe exe stream stream

Intel® Xeon® FPGA Intel Xeon FPGA AAL Shell AAL Shell

24 Accelerator Abstraction Layer Field ProgrammableGate Array (FPGA) Programming Interfaces

CPU Field Programmable Gate Array

Accelerator Function Host Application Units (AFU)

Service API CCI1 extended Virtual Memory Accelerator Addr Translation Abstraction API CCI1 Layer standard Physical Memory API Intel QPI/KTI Link, Interfaces Protocol, & PHY

Intel QPI

Standard Programming Interfaces : AAL and CCI Programming interfaces will be forward compatible from SDP2 to future MCP3 solutions Simulation Environment available for development of SW and RTL 4

1. Coherent Cache Interface 3. Multi-chip package 25 Intel® QuickPath Interconnect (Intel® QPI) 2. Software Development Platform 4. Register Transfer Level Programming Interfaces: OpenCL™

CPU Field Programmable Gate Array

OpenCL™ OpenCL OpenCL Host Code Application Kernel Code

C OpenCL RunTime F OpenCL Kernels G

CCI Accelerator Service API Extended Abstraction Virtual Memory API VirtMem Layer CCI Physical Memory API Physical Memory API Standard Intel QPI/PCI Express® System Memory

Unified application code abstracted from the hardware environment Portable across generations and families of CPUs and FPGAs

20 Intel® QuickPath Interconnect (Intel® QPI) Agenda

• Accelerators: Motivation and Use Cases

• Using Field Programmable Gate Array (FPGA) as an Accelerator

• Intel® Xeon® Processor + FPGA Accelerator Platform

• Hardware and Software Programming Interfaces

• Example Applications

21 Example Usage:

Deep Learning Framework for Visual Understanding

r

e lust c CCI Interface

SRAM Controller e d Read Write Reg IP no Processing Tile ‘n’ Access Registers DMA Processing Tile 1

Processing Tile 0

s

s

t

s

ht

t u

Control g

i

u

p

t

e

u np

State I

We

ic O

v Machine de

PE PE PE s

e CNN (Convolutional Neural Network) function accelerated on FPGA:

v

i t i †

m Power-performance of CNN classification boosted up to 2.2X

i

r p

†Source: Intel Measured (Intel® Xeon® processor E5-2699v3 results; Altera Estimated (4x Arria-10 results) 2S Intel( Xeon E5-2699v3 + 4x GX1150 PCI Express® cards. Most computations executed on Arria-10 FPGA's, 2S Intel Xeon E5-2699v3 host assumed to be near idle, doing misc. networking/housekeeping functions. Arria-10 results estimated by Altera with Altera custom classification network. 2x Intel Xeon E5-2699v3 power estimated @ 139W while doing "housekeeping" for GX1150 cards based on Intel measured 22 microbenchmark. In order to sustain ~2400 img/s we need a I/O bandwidth of ~500 MB/s, which can be supported by a 10GigE link and software stack Example Usage: Genomics Analysis Toolkit

BWA mem (Smith-Waterman HaplotypeCaller (PairHMM

PairHMM function accelerated on FPGA: Power-performance of pHMM boosted up to 3.8X†

†pHMM Algorithm performance is measured in terms of Millions Cell Updates per seconds (CUPS). Performance projections: CPU Performance: includes: 1 core Intel® Xeon® processor E5-2680v2 @ 2.8GHz delivers 2101.1 MCUP/s measured; estimated value assumes linear scaling to 10 Cores on Xeon ES2680v2 @ 2.8 GHz & 115W TDP; FPGA Performance includes: 1 FPGA PE (Processing Engine) delivers 408.9 MCUP/s @ 200 MHz measured; estimated value assumes linear scaling to 32 PEs and 90% frequency scaling on Stratix- V A7 400 MHz based on RTL Synthesis results (35W TDP). Intel estimated based on 1S Xeon E5-2680v2 + 1 Stratix-V A7 with QPI 1.1 @ 6.4 GT/s full width using Intel® QuickAssist FPGA System Release 3.3, ICC (CPU is 23 essentially idle when work load is offloaded to the FPGA) Intel® Xeon® + FPGA1 in theCloud IP Library Vision End User FPGA Vendor Developed IP Developed IP Cloud Users Intel 3rd party Developed IP Developed IP Workload

Launch workload Workload accelerators

Software Defined Orchestration Software Infrastructure Place workload Resource Pool Static/dynamic FPGA programming Storage Network Compute

Intel® Xeon® +FPGA

30 1: Field Programmable GateArray (FPGA) “Programmer Friendly” Acceleration

Software Programmers • Need Logic and Data Management – By writing lines of code CPU FPGA OpenCL™ Compiler Benefits • Ease ofuse Context • Scalable Compile code Create data& Execute • Heterogeneous arguments • Leverage existing libraries • Vendor choice w/open standards • Foundation for OpenMP (80% reuse) DDRx Global MemoryBuffer Channels/Pipe Extension • Kernel  Kernel I/O Kernel Kernel I/O • External IO Kernel Kernel • Mix ‘n Match HDL & Kernels

© 2015 Altera Corporation–Public 31 Spectrum of Workload Acceleration

Software Discrete Integrated Processor Library Accelerator Accelerator Instruction

Example: Example: Field Example: Intel® Example: Intel® Data Plane Programmable Iris™ Pro Graphics Advanced Vector Development Kit Gate Array Extensions

Quick Assist Technology

39 Workload Acceleration Beyond CPU

3D XPoint™ Intel® Omni- Intel® Silicon Technology Path Fabric Photonics

49 Intel Architecture Vision for Software: Code Once – Run Anywhere

Consistent programing model for all accelerators

Software Discrete Integrated Processor Library Accelerator Accelerator Instruction

34 Additional Sources of Information

• A PDF of this presentation is available from our Technical Session Catalog: www.intel.com/idfsessionsSF.

• Intel® Xeon Phi™ coprocessor resources: software.intel.com/mic-developer

• Network Compression resources: intel.com/quickassist

• Media Transcoding resources: software.intel.com/intel-media-server-studio

• Storage Cryptography resources: software.intel.com/storage

• FPGA: Please see demo in Altera* booth in the demo showcase

35