Latest Trends in Computing and Communication

Gil Bloch

▪ April 2005, Gordon Moore stated in an interview that the projection cannot be sustained indefinitely: "It can't continue forever. ... It no longer centered its research and development plan on Moore's law.

GPU Accelerated Computing Google TPU Cloud

Cloud HPC Big Data

Security Internet Enterprise of Things

Storage Machine Business Learning Intelligence

Did you know that 90 % of the world’s data has been created only in last two years?

© 2019 Mellanox Technologies | Confidential 8 From Oil and Banking to Data Top 10 Companies in the World (Market Cap) ▪ 1998 ▪ 2009 ▪ 2019 ▪ Microsoft ▪ Exxon Mobil ▪ Apple ▪ General Electric ▪ Petrochina ▪ Amazon ▪ Exxon Mobil ▪ Walmart ▪ Google (Alphabet) ▪ Royal Dutch Shell ▪ ICBC ▪ Microsoft ▪ Merck ▪ China Mobile ▪ Facebook ▪ Pfitzer ▪ Microsoft ▪ Tencent ▪ Intel ▪ AT&T ▪ Alibaba ▪ Coca Cola ▪ Johnson & Johnson ▪ Berkshire Hathaway ▪ Walmart ▪ Royal Dutch Shell ▪ JPMorgan Chase ▪ IBM ▪ Procter & Gamble ▪ Exxon Mobil

▪ Oil and Gas ▪ Pharmaceutical / Medical device company ▪ Data Driven Revenues

▪ How many servers does Google have? ▪ We do not know, they never expose the numbers ▪ There are guestimates… ▪ In 2011 - 900,000 servers ▪ In 2018 - 2,500,000 servers (Source – Gartner) ▪ “As of 2018, Google has invested over $10.5 billion equipping its US

data centers to deliver state-of-the-art services.” (Source - Google) Mayes County, Oklahoma data center (source – google)

Google data centers in the Dalles, Oregon, 200 (Photo by Craig Mitchelldyer/Getty Images)

350X Image ResNet AlexNet GoogleNet Recognition

Inception-V2 Inception-V4

2012 2013 2014 2015 2016

30X Speech DeepSpeech DeepSpeech-2 DeepSpeech-3 Recognition

2014 2015 2016 2017

Complexity = GOPS X Bandwidth

© 2019 Mellanox Technologies | Confidential 14 © 2019 Mellanox Technologies | Confidential 15 Enabling World-Leading Artificial Intelligence Solutions Mellanox Unleashes the Power of Artificial Intelligence

More Better Faster Data Models Interconnect GPUs CPUs ASIC FPGAs Storage

CPU-Centric (Onload) Data-Centric (Offload)

CPU GPU CPU GPU

GPU CPU GPU CPU Onload Network In-Network Computing CPU GPU CPU GPU

GPU CPU GPU CPU

Must Wait for the Data Analyze Data as it Moves! Creates Performance Bottlenecks Higher Performance and Scale

© 2019 Mellanox Technologies | Confidential 17 An Application Example – Pizza Processing CPU 1 – Pizza Generation CPU 2 – Pizza Consumption ▪ Order Pizza ▪ Call (or use Pizza application) ▪ CPU 1 – prepare Pizza CPU-Centric (Onload) ▪ Tomato sauce, Cheese, Peperoni… ▪ CPU 1 – Put in the oven CPU GPU ▪ And now we wait… ▪ CPU 1 – Pack and send GPU CPU ▪ Network (Pizza Delivery) Onload Network CPU GPU

GPU CPU

Must Wait for the Data Creates Performance Bottlenecks

High BW is a Must For Faster Training Large Scale Models 6.5X with higher BW

6.5X

2.5X

GPUDirect™ RDMA ▪ Accelerates HPC and Deep Learning performance

▪ Lowest communication latency for GPUs

▪ Remote ▪ Data transfer between nodes connected by an interconnect

▪ Direct ▪ No operating system kernel involvement in transfer ▪ All transfer operation offloaded to the network card

▪ Memory ▪ Transfer between user-space application virtual memory ▪ No extra copying or buffering

▪ Access ▪ Send / Receive ▪ Read / Write ▪ Atomic

RDMA access model TCP/IP socket access model

▪ Message - preserve user’s message ▪ Byte stream – application recover message boundaries

▪ Asynchronous – no blocking during transfer ▪ Synchronous – block until data is sent / ▪ Starts when work added to work queue received ▪ Finishes when status available in completion queue

▪ Support paired (two sided) and unpaired ▪ send() / recv() are paired (one sided) transfers ▪ Both sides must participate in the transfer

▪ No data copying into system buffers ▪ Requires data copy using system buffers ▪ Memory involved in transfer is untouchable ▪ User memory accessible immediately before and between start and completion of transfer after send() / recv() operations

▪ Main features TCP/IP ▪ Remote memory read/write semantics in addition to send/receive ▪ Kernel bypass / direct user space access ▪ Full hardware offload for network stack ▪ Secure, channel based IO

▪ Application Advantage ▪ Lowest latency ▪ Highest bandwidth ▪ Lowest CPU consumption ▪ Direct memory access, no unnecessary data copies RDMA ▪ RoCE: RDMA over Converged Ethernet ▪ Available for all Ethernet speeds 10 – 400G

Unmatched Linear Better Scalability at No 50% Performance Additional Cost