Data Centers Accelerate Aiprocessing

DATA CENTERS ACCELERATE AI PROCESSING Vendors Large and Small Optimize Their Designs for Deep Learning By Bob Wheeler (December 24, 2018) ................................................................................................................... In a year dominated by AI, you’d be ex- Training for Dollars cused for thinking traditional server In June, two distinguished computer architects—John CPUs were dying. Yet Intel’s data-center Hennessy and David Patterson—raised the profile of revenue (mainly server processors) grew domain-specific architectures through their ACM Turing 26% in the first nine months of 2018, Award lecture. They argued that domain-specific archi- reaching a record $16.9 billion. By con- tectures (DSAs) combined with domain-specific lan- trast, Nvidia’s data-center revenue (mainly accelerators) rose guages (DSLs) are the path to higher efficiency as Moore’s an astounding 70% from a much smaller base, reaching Law slows (we posited the same approach back in MPR $2.3 billion in that period. Although AI-accelerator growth 8/26/13, “What Comes After the End”). slowed in 2018, both established vendors and startups con- A canonical DSA/DSL example in data centers is tinue to invest heavily in this technology. Companies are Nvidia’s Tesla programmed using Cuda to accelerate HPC. pitching everything from simple ISA extensions to purpose- Cuda DNN (cuDNN) adapts this model for DNN training, built accelerator chips and cards. supporting frameworks that include Caffe, TensorFlow, Although cloud operators such as Google have and others. In 2017, Nvidia optimized its Tesla accelera- deployed AI-specific hardware, merchant vendors have tors for convolutional neural networks by adding matrix- mostly repurposed existing designs for AI. Recently, the multiplication units (“tensor cores”) to its 12nm Volta horde of startups developing AI-specific architectures began generation. For ResNet-50 training on FP16 data, the to deliver production devices. Graphcore, Habana, and Volta-based Tesla V100 card delivers 1,350 images per sec- Wave are three such companies targeting data centers. ond (IPS), 4x the performance of the Pascal generation. Meanwhile, Intel plans to ship its second purpose-built AI Volta has dominated the market for training neural net- silicon for data centers in 2019 after its first effort flopped. works, offering at least 10x better performance than main- These vendors are all chasing Nvidia, which continues to stream Skylake-SP processors. optimize its Tesla GPU cards for AI, where they primarily Hoping to take share from Nvidia, AMD recently in- accelerate deep-neural-network (DNN) training. troduced Radeon Instinct cards using its 7nm Vega20 GPU. Intel and Nvidia continue to benefit from the growth Although Vega20 handles FP16 data, it lacks Volta’s matrix- in cloud data centers, but in-house ASICs can displace their multiply hardware. In ResNet-50 training, AMD’s MI60 products. Examples include Amazon’s processors, Google’s card delivers only 499 IPS, falling short of Volta at the same TPUs, and Huawei’s AI chips. Merchant processors and power (300W). Nvidia’s technology leadership allowed it to accelerators must demonstrate superior efficiency if they optimize its hardware and software for superior efficiency, hope to succeed at hyperscalers. In the longer term, a big more than offsetting AMD’s process advantage. shakeout in AI-chip vendors is coming, but next year should One of the first AI startups to target training is see more new entrants than failures. Graphcore, which is shipping a processor purpose-built for This custom reprint is excerpted from the full article that originally appeared in Microprocessor Report. © The Linley Group • Microprocessor Report December 2018 2 Data Centers Accelerate AI Processing DNNs. Its 16nm GC2 approaches the FP16 performance of A well-funded Israeli startup, Habana is attacking Volta but consumes half the power. The company can then high-throughput inference with its Goya HL-1000 card, pack two of its processors into a 300W card. In simulations, which sampled in 3Q18. It measured ResNet-50 throughput the card delivers about 1.5x the ResNet-50 throughput of the of 15,012 IPS at 100W (typical), yielding 150 IPS per watt. V100, but Graphcore has yet to publish benchmark data. Goya uses C-programmable tensor cores that support inte- Still, the GC2 demonstrates the advantage of a clean-sheet ger data types as well as FP32. In 2Q19, the company plans design for DNN acceleration: it offers 4x the performance to sample Gaudi, a training processor designed to scale to per watt of AMD’s GPU while using process technology that large arrays using a 2Tbps interconnect. Habana has raised trails by a full node. $120 million with Intel Capital as the lead investor. Another startup, Wave Computing, is implementing Another approach to inference acceleration is to em- its training technology in an appliance that competes with ploy FPGAs, as Microsoft has done in its Project Brainwave Nvidia’s DGX systems. Wave is shipping evaluation systems with Intel Stratix 10. For customers needing off-the-shelf that integrate four of its DPU ASICs. That system could hardware, Intel unveiled accelerator cards that sport Arria deliver performance in the same range as Nvidia’s $400,000 10 FPGAs and are available through server OEMs. The one DGX-2, which integrates 16 V100 modules. The startup, based on the Arria 10 GX, for example, is a PCIe Gen3 x8 however, has thus far withheld benchmarks. card with an FPGA, 8GB of DDR4 SDRAM, and a 40G Intel is conspicuous by its absence, having decided to Ethernet port; it has a 70W TDP rating. Although it pro- use its first-generation Nervana product as only a develop- vides an SDK and libraries for Arria 10, Intel doesn’t directly ment platform after it fell far behind the V100 in support DNN frameworks such as TensorFlow and Caffe. performance. It plans to ship in late 2019 the NNP L-1000, Xilinx is more aggressive, offering FPGA cards direct- code-named Spring Crest. On the basis of the company’s ly to end users and adding its xDNN software to handle performance targets, that chip is unlikely to significantly frameworks. In October, it introduced the Alveo line of outperform the V100, much less Nvidia’s next-generation PCIe accelerator cards, which employ 16nm UltraScale+ Ampere design. FPGAs. It rates the top-end Alveo U250 at 33.3 TOPS for INT8 data at 225W (TDP). Xilinx and a partner recently Habana Smokes Inference Incumbents demonstrated the U250 running ResNet-50 inference work- Whereas Nvidia dominates DNN training, data-center oper- loads at 3,700 IPS. We estimate ResNet-50 efficiency at 33 ators employ multiple architectures for inference tasks. The IPS per watt (typical), falling short of Tesla’s T4. The com- Intel Xeon processors are the most popular, but operators pany’s 7nm generation (Versal) adds programmable AI en- also employ custom DNN accelerators, and Nvidia is mak- gines optimized for integer data, an improvement that ing inroads with new products. Inference offers more room for optimization owing to the use of lower-precision fixed-point (INT8) math. Last 150.1 quarter, Nvidia introduced new Tesla T4 cards Habana Goya 15,012 based on its Turing architecture, its first to include tensor cores that handle integer data. The 56.0 result is a 70W card that delivers 27TMAC/s for Tesla T4 3,920 INT8 data. Using the Caffe framework with its TensorRT libraries, the company measured 33.2 ResNet-50 inference throughput of 3,920 IPS, Xilinx Alveo U250* 3,700 or 56 IPS per watt, as Figure 2 shows. Intel’s response is to improve Xeon’s per- 5.7 formance. It added to Cascade Lake new fused- Cascade Lake-SP 2S* 2,500 multiply-accumulate instructions for INT8 and INT16 data types under the Vector Neural Net- 18.7 Tesla P4 work Instruction (VNNI) umbrella. Its simula- 1,400 tions show Cascade Lake doubles ResNet-50 IPS/W inference throughput when using INT8, which 2.6 Skylake-SP 2S 1,110 IPS should yield about 2,500 IPS for the top-end Xeon (i.e., Platinum 8180 “v2”) 2S platform. Two Figure 2. ResNet-50 inference throughput and efficiency (log scale). processors plus a south bridge consume about Habana’s Goya sets a new benchmark, demonstrating the advantage of a 435W TDP, yielding around 5.7 IPS per watt, or clean-sheet DNN design. For Intel, we show the top-end Xeon Platinum 10% of the Tesla T4’s efficiency. Carrying a list 8180 model of both Skylake and Cascade Lake in dual-socket (2S) configu- price of more than $20,000, the dual-8180 con- rations; other products are single-chip designs. (Data source: vendors, ex- figuration is hardly mainstream. cept *The Linley Group estimate) © The Linley Group • Microprocessor Report December 2018 Data Centers Accelerate AI Processing 3 should greatly boost performance per watt. Samples of the has created ASICs and SoCs for many markets. Now, the first Versal models are due in 2H19. Customers can use the company has developed the Ascend 910 chip for data-center first-generation Alveo cards in the interim, particularly for training and inference. It’s also investing in its own cloud- software development. services business, which is likely to consume internally developed chips where possible. Intel appears most exposed in Competing With Cloud Customers this case. A rising tide should lift all boats, but hyperscale cloud opera- In 2019, Nvidia will finally see competition from third- tors and OEMs are creating dangerous currents for mer- party AI accelerators as multiple startups begin shipping chant vendors. Google is the prime example with its TPUs, their first designs. These startups have all promised massive now entering their third generation. Whereas the TPUv1 performance and efficiency improvements, but once their (circa 2015) was an inference coprocessor, the TPUv2 (de- products reach the market, we’ll see which ones are deployed in 2017) is a standalone processor designed for train- livering on these promises.

Data Centers Accelerate Aiprocessing

Accenture AI Inferencing in Action

GPU Developments 2018

Persistent Memory for Artificial Intelligence

AI Accelerator Latencies in Hybrid Vehicular Simulation

Unified Inference and Training at the Edge

Low-Power Ultra-Small Edge AI Accelerators for Image Recog- Nition with Convolution Neural Networks: Analysis and Future Directions

Survey and Benchmarking of Machine Learning Accelerators

Efficient Management of Scratch-Pad Memories in Deep Learning

Infiltration of AI/ML in Particle Physics Aspects of Data Handling

Ai Accelerator Ecosystem: an Overview

MYTHIC MULTIPLIES in a FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles

Driving Intelligence at the Edge: Axiomtek's Edge AI Solutions