0 LegUp: Accelerating Memcached on Cloud FPGAs
Xilinx Developer Forum December 10, 2018
Andrew Canis & Ruolong Lian LegUp Computing Inc. 1
COMPUTE IS 1 BECOMING SPECIALIZED
GPU TPU FPGA
Nvidia graphics cards Google tensor Reconfigurable hardware. are being used for processing unit used FPGAs excel at real-time floating point computations for machine learning data processing. LEGUP HLS PLATFORM 2 2
Software
Software Test/Debug A Unified Hardware Acceleration Platform
Hardware Hardware System Vendor Agnostic
CPU The Era of FPGA Cloud Computing is Here
Aug 2018 Sept 2017 Oct 2016 Jan 2017 SKT deploys Microsoft rolls out FPGAs FPGAs for AI in every new datacenter acceleration Jul 2017 Nov 2016
Alibaba and Tencent deploy June 2014 FPGAs in their cloud Baidu, Huawei deploy FPGAs in Amazon and Nimbix deploy their cloud Microsoft accelerates FPGAs in their cloud Bing Search with FPGAs 4 CLOUD PLATFORM 4
• Network processing engines on cloud FPGAs and on-premises FPGA acceleration cards
void accelerator ( FIFO *input, FIFO Input your *output) { Hardware int in = fifo_read(input); loop: for (int i = 0; i < C/C++ Code NUM; i++) { Compilation
01010101010101010101010110100101 01010101010101010101010110100101 01010101010010101010010101010010 01010101010010101010010101010010 10101010100101010101010101001010 10101010100101010101010101001010 10101010010101010101010010101010 10101010010101010101010010101010 10100101010101001010101010100110 10100101010101001010101010100110 01010101010010101010010101010100 01010101010010101010010101010100 10101010101010010101010101010010 10101010101010010101010101010010 10101001010100100101011001010101 10101001010100100101010101010101 Cloud or On-prem Real-Time Data Stream FPGA Server Real-Time Analysis from Network Output to Network 5 5 What is Memcached? 5 • Memcached is a distributed in-memory key-value store • Used as a cache by Facebook, Twitter, Reddit, Youtube, etc • Facebook Memcached cluster handles billions of requests per second • Memcached Commands: • Set key value • Get key • Typical deployments: • Amazon ElastiCache • Google Cloud App Engine • Self-hosted • Easy horizontal scaling: • Cluster of Memcached servers handles the load 6
Easy to Deploy
6 Introducing: World’s Fastest Lower TCO
Cloud-Hosted Memcached 10Gbps network
Powered by AWS FPGAs with LegUp’s Platform 9X 9X 10X HIGHER REQUESTS/SEC LOWER LATENCY LOWER TCO 7 7 Memcached vs. AWS ElastiCache 7 • Benchmarked Memcached against AWS ElastiCache • AWS provides a fully-managed CPU Memcached service • Different instance types based on RAM size, network bandwidth, and hourly cost • Chose an ElastiCache instance with the closest specs to F1
AWS Instance vCPUs RAM Network Speed Cost
cache.r4.4xlarge (CPU) 16 101 GB Up to 10 Gbps $1.82/hour
f1.2xlarge (FPGA) 8 122 GB Up to 10 Gbps $1.65/hour 8 8 Experimental Setup 8
• Memtier_benchmark: Open-source Memcached benchmarking tool • 100-byte size data, pipelining (batching) of 16 • Varied number of connections to Memcached 9 9 Throughput Results 9 • Up to 9X better ops/sec vs. ElastiCache 10 10 Latency Results 10 • Up to 9X lower latency vs. ElastiCache 11 11 Total Cost of Ownership Results 11 • Up to 10X better throughput/$ vs AWS ElastiCache 12 12 Where is the speedup from coming from? 12 1. We accelerated both TCP/IP network and Memcached completely in FPGA 2. Fully pipelined FPGA hardware – new input every clock cycle 3. Multiple Memcached commands in-flight processed in streaming fashion 4. At high packets/sec, software network stack can become a bottleneck 13 13 Memcached Demonstration on AWS F1 13
• Live demo from our website: http://www.legupcomputing.com/main/memcached_demo • Spins up an AWS F1 instance and another client instance 1 14
1
www.companyname.com © 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved.
www.companyname.com © 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved. 2
15
www.companyname.com © 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved. 3
16
www.companyname.com © 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved. 4 17
www.companyname.com © 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved. 18 18 AWS Cloud-Deployed FPGAs (F1) 18
• On F1, the FPGA is not directly connected to the network • CPU is connected to the network and FPGA is connected over PCIe. 19 19 Memcached System Architecture 19
AWS F1 Instance
FPGA CPU
Virtual TCP/IP Virtual Network Memcached 64GB Offload Network to PCIe to FPGA Engine Accelerator DDR4 FPGA (S/W) (H/W)
10Gbps Network 20 20 Virtual Network to FPGA (VN2F) 20 • VN2F SW • Bypass Linux kernel, send/receive raw network packets, DMA from/to FPGA • VN2F HW • Split/combine DMA data to/from individual network packets • Each direction takes 20~50us, transfers are overlapped
VNF S/W VNF H/W Kernel Bypass Split packets Network PCIe F1 Application Stack Shell DPDK F1 Drivers Combine packets 21 21 Network Offload: TCP/IP & UDP 21
• Supports TCP/UDP/IP network protocols
• 10Gbps ethernet support
• 1000s of TCP connections
• Implemented in C++, synthesized by LegUp
• Can be used by other applications
• Interface with application via AXI-S
* D. Sidler, et al., Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware, in FCCM’15 22 Memcached Core 22
• The Memcached core is fully pipelined with Initiation Interval of 1 • Request Decoder block decodes the requests and partitions them into key and value pairs. • Hash Lookup hashes keys to hash values and looks up the corresponding addresses • Values are stored/retrieved to/from the memory by the Value Store block. • Response Encoder creates Memcached responses to return to the clients
Memcached Core
Hash Keys Addresses Request Lookup Value Response Values Responses Requests Decoder Store Encoder Values
* M. Blott et. al., “Achieving 10Gbps Line-rate Key-value Stores with FPGAs,” in Hot Cloud, 2013 23 23 Network Bandwidth on AWS F1 23
• The f1.2xlarge instance has an “Up to 10 Gbps” network • Bottleneck: bandwidth and PPS
Bandwidth Packets per Second Bandwidth Limit
10 Gbps network cannot be saturated with small packets
Small packets can’t saturate 10 Gbps network Max PPS is around 700K 24 Memcached Request Batching 24 • Batching in Memcached permits packing multiple requests into a single network packet • Reduces packet processing overhead • Important feature for performance,Batching especially is not when needed PPS is the for bottleneck
Without on-premise FPGAs PKT1 REQ1 PKT2 REQ2 PKT3 REQ3 PKT4 REQ4 … Pipelining
With PKT1 REQ1 REQ2 REQ3 PKT2 REQ4 REQ5 REQ6 … Pipelining
• Batching Adapter splits up aggregated requests into individual requests • Sends to Memcached core each request in a pipelined fashion 25 25 Streaming b/w Host & Kernel in SDAccel 25
HDK Shell Custom Logic on FPGA Interface
WR channel MM2S AXI-S RX FIFO PCIS Streaming Host AXI-4
PCIe Kernel RD channel S2MM AXI-S TX FIFO data count
BAR1 AXI-L (read & write) 26 26 Streaming b/w Host & Kernel in SDAccel 26
Kernel Custom Logic on FPGA Interface
WR channelAXI-4 MM2S AXI-S RX FIFO PCIS DDR Streaming Host AXI-4
PCIe Memory Kernel RD channelAXI-4 S2MM AXI-S TX FIFO data count
BAR1 AXI-L (read & write) 27 27 Streaming b/w Host & Kernel in SDAccel 27
Kernel Custom Logic on FPGA Interface
AXI-4 MM2S AXI-S RX FIFO PCIS DDR transfer size Streaming Host AXI-4
PCIe Memory Kernel TX FIFO
BAR1 AXI-L (read & write) 28 28 Streaming b/w Host & Kernel in SDAccel 28
Kernel Custom Logic on FPGA Interface
RX FIFO PCIS DDR Streaming Host AXI-4
PCIe Memory Kernel S2MM TX FIFO
BAR1 AXI-L (read & write) 29 29 Streaming b/w Host & Kernel in SDAccel 29
Kernel Custom Logic on FPGA Interface
RX FIFO PCIS DDR Streaming Host AXI-4
PCIe Memory Kernel AXI-4 S2MM AXI-S TX FIFO
ready, valid BAR1 AXI-L (read & write) Batch
Send transfer size to the AXI-S pipe when, Accumulator • Enough data has been accumulated • No new incoming data for X cycles 30 Come talk to us about: • Memcached Acceleration • FPGA Network Stack • SDAccel Streaming Handler • LegUp high-level synthesis tool • Any other FPGA acceleration needs
Andrew Canis & Ruolong Lian www.LegUpComputing.com [email protected] | 647-834-6654 | Toronto, Canada