0 LegUp: Accelerating Memcached on Cloud FPGAs

Xilinx Developer Forum December 10, 2018

Andrew Canis & Ruolong Lian LegUp Computing Inc. 1

COMPUTE IS 1 BECOMING SPECIALIZED

GPU TPU FPGA

Nvidia graphics cards Google tensor Reconfigurable hardware. are being used for processing unit used FPGAs excel at real-time floating point computations for machine learning data processing. LEGUP HLS PLATFORM 2 2

Software

Software Test/Debug A Unified Hardware Acceleration Platform

Hardware Hardware System Vendor Agnostic

CPU The Era of FPGA Cloud Computing is Here

Aug 2018 Sept 2017 Oct 2016 Jan 2017 SKT deploys Microsoft rolls out FPGAs FPGAs for AI in every new datacenter acceleration Jul 2017 Nov 2016

Alibaba and Tencent deploy June 2014 FPGAs in their cloud Baidu, Huawei deploy FPGAs in Amazon and Nimbix deploy their cloud Microsoft accelerates FPGAs in their cloud Bing Search with FPGAs 4 CLOUD PLATFORM 4

• Network processing engines on cloud FPGAs and on-premises FPGA acceleration cards

void accelerator ( FIFO *input, FIFO Input your *output) { Hardware int in = fifo_read(input); loop: for (int i = 0; i < C/C++ Code NUM; i++) { Compilation

01010101010101010101010110100101 01010101010101010101010110100101 01010101010010101010010101010010 01010101010010101010010101010010 10101010100101010101010101001010 10101010100101010101010101001010 10101010010101010101010010101010 10101010010101010101010010101010 10100101010101001010101010100110 10100101010101001010101010100110 01010101010010101010010101010100 01010101010010101010010101010100 10101010101010010101010101010010 10101010101010010101010101010010 10101001010100100101011001010101 10101001010100100101010101010101 Cloud or On-prem Real-Time Data Stream FPGA Server Real-Time Analysis from Network Output to Network 5 5 What is Memcached? 5 • Memcached is a distributed in-memory key-value store • Used as a cache by , , , Youtube, etc • Facebook Memcached cluster handles billions of requests per second • Memcached Commands: • Set key value • Get key • Typical deployments: • Amazon ElastiCache • Google Cloud App Engine • Self-hosted • Easy horizontal scaling: • Cluster of Memcached servers handles the load 6

Easy to Deploy

6 Introducing: World’s Fastest Lower TCO

Cloud-Hosted Memcached 10Gbps network

Powered by AWS FPGAs with LegUp’s Platform 9X 9X 10X HIGHER REQUESTS/SEC LOWER LATENCY LOWER TCO 7 7 Memcached vs. AWS ElastiCache 7 • Benchmarked Memcached against AWS ElastiCache • AWS provides a fully-managed CPU Memcached service • Different instance types based on RAM size, network bandwidth, and hourly cost • Chose an ElastiCache instance with the closest specs to F1

AWS Instance vCPUs RAM Network Speed Cost

cache.r4.4xlarge (CPU) 16 101 GB Up to 10 Gbps $1.82/hour

f1.2xlarge (FPGA) 8 122 GB Up to 10 Gbps $1.65/hour 8 8 Experimental Setup 8

• Memtier_benchmark: Open-source Memcached benchmarking tool • 100-byte size data, pipelining (batching) of 16 • Varied number of connections to Memcached 9 9 Throughput Results 9 • Up to 9X better ops/sec vs. ElastiCache 10 10 Latency Results 10 • Up to 9X lower latency vs. ElastiCache 11 11 Total Cost of Ownership Results 11 • Up to 10X better throughput/$ vs AWS ElastiCache 12 12 Where is the speedup from coming from? 12 1. We accelerated both TCP/IP network and Memcached completely in FPGA 2. Fully pipelined FPGA hardware – new input every clock cycle 3. Multiple Memcached commands in-flight processed in streaming fashion 4. At high packets/sec, software network stack can become a bottleneck 13 13 Memcached Demonstration on AWS F1 13

• Live demo from our website: http://www.legupcomputing.com/main/memcached_demo • Spins up an AWS F1 instance and another client instance 1 14

1

www.companyname.com © 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved.

www.companyname.com © 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved. 2

15

www.companyname.com © 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved. 3

16

www.companyname.com © 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved. 4 17

www.companyname.com © 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved. 18 18 AWS Cloud-Deployed FPGAs (F1) 18

• On F1, the FPGA is not directly connected to the network • CPU is connected to the network and FPGA is connected over PCIe. 19 19 Memcached System Architecture 19

AWS F1 Instance

FPGA CPU

Virtual TCP/IP Virtual Network Memcached 64GB Offload Network to PCIe to FPGA Engine Accelerator DDR4 FPGA (S/W) (H/W)

10Gbps Network 20 20 Virtual Network to FPGA (VN2F) 20 • VN2F SW • Bypass kernel, send/receive raw network packets, DMA from/to FPGA • VN2F HW • Split/combine DMA data to/from individual network packets • Each direction takes 20~50us, transfers are overlapped

VNF S/W VNF H/W Kernel Bypass Split packets Network PCIe F1 Application Stack Shell DPDK F1 Drivers Combine packets 21 21 Network Offload: TCP/IP & UDP 21

• Supports TCP/UDP/IP network protocols

• 10Gbps ethernet support

• 1000s of TCP connections

• Implemented in C++, synthesized by LegUp

• Can be used by other applications

• Interface with application via AXI-S

* D. Sidler, et al., Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware, in FCCM’15 22 Memcached Core 22

• The Memcached core is fully pipelined with Initiation Interval of 1 • Request Decoder block decodes the requests and partitions them into key and value pairs. • Hash Lookup hashes keys to hash values and looks up the corresponding addresses • Values are stored/retrieved to/from the memory by the Value Store block. • Response Encoder creates Memcached responses to return to the clients

Memcached Core

Hash Keys Addresses Request Lookup Value Response Values Responses Requests Decoder Store Encoder Values

* M. Blott et. al., “Achieving 10Gbps Line-rate Key-value Stores with FPGAs,” in Hot Cloud, 2013 23 23 Network Bandwidth on AWS F1 23

• The f1.2xlarge instance has an “Up to 10 Gbps” network • Bottleneck: bandwidth and PPS

Bandwidth Packets per Second Bandwidth Limit

10 Gbps network cannot be saturated with small packets

Small packets can’t saturate 10 Gbps network Max PPS is around 700K 24 Memcached Request Batching 24 • Batching in Memcached permits packing multiple requests into a single network packet • Reduces packet processing overhead • Important feature for performance,Batching especially is not when needed PPS is the for bottleneck

Without on-premise FPGAs PKT1 REQ1 PKT2 REQ2 PKT3 REQ3 PKT4 REQ4 … Pipelining

With PKT1 REQ1 REQ2 REQ3 PKT2 REQ4 REQ5 REQ6 … Pipelining

• Batching Adapter splits up aggregated requests into individual requests • Sends to Memcached core each request in a pipelined fashion 25 25 Streaming b/w Host & Kernel in SDAccel 25

HDK Shell Custom Logic on FPGA Interface

WR channel MM2S AXI-S RX FIFO PCIS Streaming Host AXI-4

PCIe Kernel RD channel S2MM AXI-S TX FIFO data count

BAR1 AXI-L (read & write) 26 26 Streaming b/w Host & Kernel in SDAccel 26

Kernel Custom Logic on FPGA Interface

WR channelAXI-4 MM2S AXI-S RX FIFO PCIS DDR Streaming Host AXI-4

PCIe Memory Kernel RD channelAXI-4 S2MM AXI-S TX FIFO data count

BAR1 AXI-L (read & write) 27 27 Streaming b/w Host & Kernel in SDAccel 27

Kernel Custom Logic on FPGA Interface

AXI-4 MM2S AXI-S RX FIFO PCIS DDR transfer size Streaming Host AXI-4

PCIe Memory Kernel TX FIFO

BAR1 AXI-L (read & write) 28 28 Streaming b/w Host & Kernel in SDAccel 28

Kernel Custom Logic on FPGA Interface

RX FIFO PCIS DDR Streaming Host AXI-4

PCIe Memory Kernel S2MM TX FIFO

BAR1 AXI-L (read & write) 29 29 Streaming b/w Host & Kernel in SDAccel 29

Kernel Custom Logic on FPGA Interface

RX FIFO PCIS DDR Streaming Host AXI-4

PCIe Memory Kernel AXI-4 S2MM AXI-S TX FIFO

ready, valid BAR1 AXI-L (read & write) Batch

Send transfer size to the AXI-S pipe when, Accumulator • Enough data has been accumulated • No new incoming data for X cycles 30 Come talk to us about: • Memcached Acceleration • FPGA Network Stack • SDAccel Streaming Handler • LegUp high-level synthesis tool • Any other FPGA acceleration needs

Andrew Canis & Ruolong Lian www.LegUpComputing.com [email protected] | 647-834-6654 | Toronto, Canada