The Importance of Hardware Raising All Boats

THE IMPORTANCE OF HARDWARE RAISING ALL BOATS

Paul Brant Principal Internet Sales Consultant [email protected]

Abstract ...... 5

Introduction ...... 5

Why is Hardware becoming Cool or Hot again ...... 6

The Cambrian Explosion ...... 7

Traditional Hardware losing its punch ...... 7

The need for Specialized Hardware ...... 9 The Trifecta ...... 9 Hardware Accelerators ...... 11

Digital Transformation Hardware - Raising all Boats ...... 12

The four pillars of Digital Transformation ...... 12

Cloud ...... 13

Why Hardware Microservices in the Cloud ...... 13 Hardware Microservices ...... 14 What is an ASIC ...... 15 ASIC Cloud Use Case - Bitcoin mining Hardware ...... 15 ASIC Cloud Use Case – Google, we have a problem ...... 17 What is an FPGA ...... 18 FPGA Cloud Use Case – Microsoft Azure Networking ...... 20 FPGA Cloud Use Case – AWS ...... 21 What is a GPU ...... 22 GPU Cloud Use Case – AWS ...... 23 The Hardware Accelerator integration Cloud Framework ...... 24 AI ...... 25

Moravec’s paradox ...... 26 The Von Neumann Architecture and AI ...... 27 Memory Bandwidth is important ...... 28 Utilization is important ...... 29 The AI Platform Landscape ...... 30

2018 Dell EMC Proven Professional Knowledge Sharing 2 Mobile and IoT ...... 30

Why Do We Need Mobile and IoT AI Chips ...... 31 The Move to a New Way of Thinking ...... 31

Neuromorphic chips ...... 33

What is different ...... 33 How it Works ...... 34 Hardware for DNN Processing ...... 35 The Importance of CNN’s ...... 36 A Deeper Look – IBM’s True North ...... 37 The Supporting Software ...... 38

The Why and What of Frameworks ...... 38 Bringing Hardware Acceleration to the Enterprise...... 40

Summary ...... 41

References ...... 42

Table of Figures

Figure 1 - Cambrian Explosion Architecture Stack ...... 7

Figure 2 - Moore's Law ...... 8 Figure 3 - Dennard Scaling ...... 8 Figure 4 - Coprocessor Architecture ...... 9 Figure 5 - History of CPU core architectures ...... 10 Figure 6 - Compute System Trends ...... 11 Figure 7 - The Four Pillars of Transformation ...... 12 Figure 8 - Categorization of Cloud Workloads ...... 13 Figure 9 - Hardware Accelerators ...... 14 Figure 10 - Application Specific IC ...... 15 Figure 11 - Bitcoin ASIC Architecture ...... 16 Figure 12 - Bitcoin Mining Cloud ...... 16 Figure 13 - Google AI TPU ASIC ...... 17

2018 Dell EMC Proven Professional Knowledge Sharing 3

Figure 14 - TPU Performance comparison ...... 18 Figure 15 - Field Programable Gate Array ...... 19 Figure 16 - CPU vs. FPGA ...... 20 Figure 17 - Microsoft Azure Network Hardware Acceleration ...... 21 Figure 18 - AWS FPGA in AMI Instances ...... 22 Figure 19 - GPU Hardware Acceleration ...... 23 Figure 20 - AWS Elastic GPU's ...... 24 Figure 21 - Example of a Hardware Accelerator Framework in the Cloud ...... 25 Figure 22 - AI Application Landscape ...... 26 Figure 23 - Von Neumann Architecture ...... 27 Figure 24 - Eye Photoreceptors ...... 28 Figure 25 - Hardware AI Platform Landscape ...... 30 Figure 26 - Power Density of CPU's vs. Human Brain ...... 32 Figure 27 - Comparing Human and Silicon approaches ...... 34 Figure 28 - Parrallel computation approaches ...... 35 Figure 29 - CNN Process flow ...... 36 Figure 30 - IBM True North Chip Layout ...... 37 Figure 31 - Dell Technologies Integration ...... 40 Figure 32 - Dell Technologies Bundling ...... 40

Disclaimer: The views, processes or methodologies published in this article are those of the author. They do not necessarily reflect Dell EMC’s views, processes or methodologies.

2018 Dell EMC Proven Professional Knowledge Sharing 4

Abstract Yes, software is eating the world. Everything is becoming software defined. But, if it was not for hardware, software would starve. Let’s face it, all software, at some point, has to run on something. And that is hardware.

As businesses try to cross the chasm to digitally transform, creating software, including new cloud native applications, it is very important to not underestimate the importance of hardware to achieve that goal. The infrastructure, including the hardware it’s built on, has not maintained the perceived stature of importance that software has recently attained.

That is now changing. The tried and true CPU that has been around since the Mainframe and the associated hardware infrastructure that supports it are no longer keeping up with the needs of new applications. Applications such as artificial intelligence, machine learning, and visualization are running on empty using the common CPU.

This is where new hardware technologies, architectures and processes are providing new approaches to satisfy the ravenous appetite of these new applications. This Knowledge Sharing article explores the opportunities and approaches that new hardware solutions provide in sustaining the appetite of software as it continues to try and eat the world.

2018 Dell EMC Proven Professional Knowledge Sharing 5

Introduction It seems that all of the discussions I have heard about the future of our digital transformational world revolved on how software will re-make it. In 2011, Mark Andreessen, the visionary founder of the Internet browser stated “Software is eating the worldi”. It has also been said that “Software Is Eating the World, but AI Is Going to Eat Softwareii”.

There is one aspect that seems to be totally missing or at least underestimated in discussions on what’s next, who is eating what and what is really cool. That is hardware! Why is that? Is hardware un-important? Is hardware boring? Is it a commodity? Does it really matter anymore?

Here is an example that might answer these questions. This might come to a shock for some, but Bitcoin monetization and profitability is being driven by highly specialized hardware. Really? For those who don’t know, Bitcoin is the latest phenomenon taking the financial world by storm. You make money by “mining” the Bitcoin. This process guzzles a lot of energy and produces a lot of heat!iii To put things into perspective, “Bitcoin consumes more electricity a year than Irelandiv”, and the reason is the amount and type of hardware being used! For Bitcoin, hardware is the “hard” part! So, the answer to the last question is yes, hardware matters. It is becoming very important, not only for Bitcoin but other societal transformational workloads such as AI!

In this world of digital transformation, as long as application workloads need to execute, there will always be hardware. Hardware is the fundamental building block of all things digital. This paper will argue that not only is hardware important, with all its variants and customizations, it is the primary enabler for software to do its job. As such, it is one of the major differentiators driving business value. Again, with the advent of next generation workloads such as AI and deep learning extending beyond HPC (high performance computing) and academia to the enterprise, it is becoming cool again and will transform why, how and what we do in this digital world.

2018 Dell EMC Proven Professional Knowledge Sharing 6

Why is Hardware becoming Cool or Hot again If software is eating the world, why are most, if not all, of the major public cloud providers going into the hardware business? The answer is simple; hardware is becoming a major differentiator to their product and services.

For example, Google Cloud and Amazon Web Services are making their own chipsv. Not only are they in the semiconductor chip business, but they have been making their own server, network and storage devices with special hardware secret sauce. According to James Hamilton, an AWS VP of infrastructure, “doing things in dedicated or customized hardware is a major differentiator for us. Anything that is done often and requires low latency should be put in hardware.” He also added that “It’s all about physics. . . . I tell software people, the things you’re measuring are in milliseconds. In hardware, they measure nanoseconds and microseconds. So this is the right place for us to go.” Vi

According to Steve Scott, who did stints as CTO at Cray, Google and NVidia, “As in most areas of computer architecture and system design, the architecture is driven by technologyvii”. What that means is how a computer system is designed, both software and hardware, is based on the current state of the underlying technology. To put it another way, the goal always remains the same, faster, cheaper and better. The approach however is always changing. As such, computer architectures are always in flux based on what technologies are available. Today, we are reaching limits in what traditional (Von-Neumann) hardware architectures can do which is leading to the Cambrian Compute Explosion. This explosion is what is making hardware cool again!

2018 Dell EMC Proven Professional Knowledge Sharing 7

The Cambrian Explosion

With the advent of new and complex workloads, it is becoming more apparent that not one hardware or software architecture floats all boats (Workloads). Thus, there is a “Cambrian explosion” of alternative styles or architectures as shown in Figure 1. This is jumping ahead a little bit, but it just shows the wide diversity of products, solutions, hardware and software approaches available that is often specific to varying workloads. Even Michael Dell Tweetedviii that “with AI (Artificial Intelligence) being the rocket ship and Data the fuel, there is a Cambrian explosion of opportunity with IoT, 5G and Edge/Distributed computing to name a few”.

Figure 1 - Cambrian Explosion Architecture Stack

There are some challenges however. Over the last fifty years or so, the changes usually meant increasing the number of transistors, reducing the silicon real estate and reducing latencies with no limit in sight. This is changing. We are reaching the limits of current semiconductor technologies.

Traditional Hardware losing its punch Hardware is hitting some road blocks. Moore’s Law, Dennard scaling and dark silicon is slowing us down. What are now becoming the bottlenecks to performance is density, power and cooling. Let’s understand why.

2018 Dell EMC Proven Professional Knowledge Sharing 8

Moore’s Lawix states that the size and density of a computer chip halves every 12 to 18 months as shown in Figure 2. While this has been true for fifty years, based on a recent report from the International Technology Roadmap for Semiconductors (ITRS)x, by 2021, the ability to reduce geometries with CMOS manufacturing processes will reach an end.

Figure 2 - Moore's Law

Dennard scalingxi is associated with Moore’s Law as shown in Figure 3. It states there is a power wall in that the smaller a chip becomes the more power it consumes through leakage current and increased clock frequency.

Figure 3 - Dennard Scaling

2018 Dell EMC Proven Professional Knowledge Sharing 9

Dark siliconxii is an approach addressing the limitations of Dennard scaling. With power density going up and the potential of thermal run-away, an approach to limit the power is to keep portions of the chip static thus reducing the power consumed. Dark silicon is a method of “turning” off areas of a computer chip to save power.

What this means is these three constraints are forcing alternative approaches to providing more computational power to applications. This is the driving factor in creating new hardware architectures that can overcome these obstacles.

The need for Specialized Hardware It used to be that with Moore’s Law, the increased number of gates would help keep up with application performance demand thus making the time and effort spent creating specialized hardware impractical. By the time a new hardware chip was released, demands moved on before anyone could reap its benefits. Now that Moore’s Law is slowing down, throwing an increased number of gates and standardized Von-Neumann cores at it is no longer the answer. This is driving an increased need for specialized hardware and in many cases, is predicated on specific applications and workloads.

Figure 4 - Coprocessor Architecture

Specialized hardware is nothing new. The original IBM PC, shown in Figure 4, had an arithmetic processor to do specific tasks. The big difference is, today, there is an explosion of “killer apps” that are moving from serial tasks to parallel tasks. Hardware needs to efficiently drive new workloads and a different approach is needed. Let’s look at the approaches.

The Trifecta There is a perfect storm for a change or approach in system architectures with Moore’s Law losing steam. The new methodologies are “Parallelism”, “Heterogeneous” and “Distributed”.

2018 Dell EMC Proven Professional Knowledge Sharing 10

Let’s first take a look at Parallelism. Consider the history of CPU core architectures as shown in Figure 5. From 1975 through 2005, most ALU’s were single core with its designs moving from simple to complex exponentially over time. Then, multi-core processors came on the scene given there was still an excess number of gates powered by Moore’s Law. This made things more difficult for the programmer dealing with managing multiple process threads requiring concurrency semaphores in code and many apps did not leverage its parallelized efficiencies.

Now, let’s look at the Heterogeneous aspect. In 2011, hardware architects realized that if you had certain types of CPU’s that are big and fast (large CPU gate counts and high clock rates) and others that are small and slow (RISC instruction set and low clock rate to reduce power consumption) that are optimized for specific workloads, the overall workload efficiency and performance would be significantly improved. In addition, during the same time, the private, public and hybrid cloud architectures started its accent.

Figure 5 - History of CPU core architectures

The cloud offered workloads an almost un-limited number of distributed CPU’s at the applications’ disposal. These unique architectural paradigms are driving new ways and methods to utilize hardware more efficiently, as shown in Figure 6.

There is more to this story. A CPU needs data and getting that data to the CPU as quickly as possible is an overarching goal. As shown in Figure 6, simple and complex single cores are tied directly to memory through Caches, DRAM, etc. Every pointer to data is equally visible and local with limited fragmentation. Non-Uniform Memory Access (NUMA) cache and RAM architectures were developed that addressed distributed but fragmented linked cores to help implement SMP (Symmetric Multi-processing).

Distributed Incoherent Memory Architectures (DIMA) was another SMP implementation in that each CPU has its own view of shared data and it’s up to the application to manage data synchronization. ARM and Power PC processors are examples of this approach. Disjoint tight distributed memory is the first time that each core in a multiprocessor architecture see its own memory.

2018 Dell EMC Proven Professional Knowledge Sharing 10

Figure 6 - Compute System Trends

Data is then copied and distributed by specialized hardware on high speed buses. NVidia’s NV Link xiii is one of the most recent examples of this architecture. And finally, there is disjoint loose distributed architecture. This is when thousands of processors are connected in the cloud and share data across physical servers, data centers and regions of the world. This brings up unique challenges that include reliability, availability and latency. The point is as you move from unified to disjoint memory architectures, complexity and latency increases. The good news is, as one might expect, capacity increases.

In addition, as shown in Figure 6, the trends are that cloud scale CPU’s are moving from complex CPU architectures to a mix of more dedicated hardware while maintaining a disjoint memory model. Multiple Core CPU’s are trending to disjoint while dedicated and specific hardware architectures are moving to faster and more efficient NUMA cache and RAM. Addressing the limitations of physics is always a challenge and bespoke hardware is an answer.

Hardware Accelerators This has pushed an explosion of specialized hardware accelerators such a GPU’s (Graphic Processor Units), FPGA’s (Field Programmable Fate Array’s), ASIC’s (Application Specific Integrated Circuits), SOC’s (System on Chip) and others and has just recently made sense for servers.

2018 Dell EMC Proven Professional Knowledge Sharing 11 The number of types of specialized hardware accelerators architectures is still in flux. It’s still about economics and a cost benefit analysis. There must be a set of dominant workloads that can leverage mass production. As a case in point, we are now seeing AI as an example that truly needs specialized hardware. As a result, we will find out that many of the leading cloud providers are finding business cases to specialized hardware.

2018 Dell EMC Proven Professional Knowledge Sharing 12

Digital Transformation Hardware - Raising all Boats Many reading this paper have the power to influence technology everywhere. Technology is reshaping humanities involvement with the world. As such, it seems that we should have a goal to provide the right platform allowing all people to thrive regardless of culture, profession or location on this planet. I truly believe that the way for all boats to rise in this technology tide is to innovate every day through a Cambrian explosion of solutions. I am optimistic that this can be done. Most recently, there have been four powerful capabilities that will allow us to raise all boats; Mobile, Cloud, AI and IoT (Internet of Things) as shown in Figure 7.

The four pillars of Digital Transformation

Figure 7 - The Four Pillars of Transformation

The reason to be optimistic now is for the first time we all see the beginning of a period that is making these capabilities democratized throughout the world.

Mobile provides ubiquitous scope. Cloud offers unprecedented scale. Artificial Intelligence spurs the ability to gather and analyze massive amounts of information and data and makes it useful. Last, but not least, is IoT which combines the physical and digital worlds together offering connections in all dimensions for the benefit of mankind. These synergistic capabilities have not aligned in history until today.

2018 Dell EMC Proven Professional Knowledge Sharing 13 The bedrock of these pillars has always been software that runs on hardware. Many believe that today, software is the only place that magic happens and the hardware is becoming commoditized other than its necessity and is, from an innovation standpoint, unimportant.

Let’s see how this assumption is inaccurate and how hardware diversification is becoming highly important to the four pillars. Cloud Cloud application workloads vary and each has its own requirements, examples of which are shown in Figure 8. At a high level, apps can be categorizedxiv as Batch and Streaming processes. For example, data analytics require optimized throughput and completion time (CT) and require substantial CPU and Disk resources.

Figure 8 - Categorization of Cloud Workloads

Alternatively, streaming media apps require throughput and low latency and substantial disk and network bandwidth. As such, hardware Accelerators or hardware Microservices can help apps perform better based on these varying workload requirements in the cloud.

Why Hardware Microservices in the Cloud There are 3 reasons why dedicated hardware is required in the cloud. Clouds need to be faster, more efficient and more intelligent. With the data explosion exceeding 44ZB by 2020, big data analytics for example, is forcing clouds to scale. There is a need for

2018 Dell EMC Proven Professional Knowledge Sharing 14 low latency in applications that support autonomous decisions making thorough real time insights of various connected devices and immersive user experiences. In addition, there is a need for throughput. For example, with cloud scale services such as storage, networking and compute being provisioned across the globe in multiple availability zones, the need for big data pipes with high bandwidth is imperative.

Hardware Microservices As a result, there is a major shift in how cloud providers are architecting and provisioning their infrastructures. The shift is simple and straightforward; Google, Microsoft, IBM, AWS, Dell, HP and others are providing hardware acceleration. It’s a major differentiator. The hardware accelerators include enhanced FPGA’s, GPU’s and ASIC’s that can execute without CPU intervention. .

Figure 9 - Hardware Accelerators

As shown in Figure 9, Hardware Accelerators basically come in three flavors; GPU’s, FPGA’s and ASIC’s. As shown, CPU’s are the most flexible and ASIC’s are the most efficient. For example, CPU’s are highly dynamically programmable and general purpose. CPU’s however, are not as efficient in power consumption and serialize the work load. Meanwhile, ASIC’s provide hard wired algorithms that are very efficient in terms of latency, throughput and power consumption.

2018 Dell EMC Proven Professional Knowledge Sharing 15 What is an ASIC An ASIC semiconductor chip is cast in stone or one should say, silicon. As such it is designed for a specific purpose. Figure 10 shows an example of the ASIC block diagram of a PCIE Network Interface device. The important thing to understand about this solution is the time it takes to design, tape out (manufacture it), test and debug can be extensive. Its niche is in high volume production and has a broad base of customers that might want to use it.

Figure 10 - Application Specific IC

ASIC Cloud Use Case - Bitcoin mining Hardware An example of the power of an ASIC is that it is being used in an application that is in the forefront of emerging financial businesses; Bitcoinxv. First, a little background. People profit from Bitcoin by mining the Bitcoinxvi. You mine by “working” solving problems. The problems get more difficult the faster they are solved. Since the inception of Bitcoin, there has always been a need to utilize the fastest Bitcoin mining processes.

This “work” validates transactions which is included in the block chain. Today, the best way of doing this is using a brute force approach increasing an index that matches certain properties when hashed using SHA2, as shown in Figure 11. The more hashes you make in the least amount of time, the more Bitcoins you mine and thus the need for speed. The “Hashing power” is disseminated throughout the distributed network and measured as m-hashes. The Bitcoin miner hardware accelerator can muster this processing power by evaluating many millions of hashes per second.

2018 Dell EMC Proven Professional Knowledge Sharing 16

Figure 11 - Bitcoin ASIC Architecture

First came generic CPU implementations, then GPU and FPGA’s. With the two major drivers continuing to be compute efficiency and power density, ASIC’s are now in the forefront of Bitcoin processing.

Figure 12 - Bitcoin Mining Cloud

Bitcoin mining is now moving to the cloud xvii as shown in Figure 12. The usual reasoning of moving to a cloud environment still holds true when discussing this particular workload. Scalability, low cost, OPex vs. CAPex and others are the primary drivers moving to a Bitcoin cloud mining model. In this instance, ASIC’s provide the best use case with the fastest performance possible with the lowest cost and power. Since the ASIC’s algorithms are static and as long as hashing algorithms or the fundamental block chain architecture does not change too much, ASIC’s continue to be the best solution for Bitcoin mining.

2018 Dell EMC Proven Professional Knowledge Sharing 17 ASIC Cloud Use Case – Google, we have a problem Google has a problem. A number of years ago they realized that many of their services needed AI, deep learning and neural networks. Google calculated that if users spent a few minutes a day doing voice- enabled searches and/or voice dictation on their phones they would be forced to double their data center capacity just to handle these deep learning functionsxviii. This was not sustainable or even financially possible given the power and cooling needed. When Google architects looked at the problem it was obvious that some form a hardware acceleration was needed to do AI inference and learning fostered by other CPU and hardware accelerator resources at scale.

Google decided to do their own ASIC with a twist in that it is also programmable. Called a TPU (Tensor Processing Unit)xix, it allows the flexibility of a CPU with the performance, power and cooling efficiencies of an ASIC. The architecture is shown in Figure 13.

Figure 13 - Google AI TPU ASIC

It is programmable, but it is not a traditional CPU that uses vector or scalar primitives. It uses more of a matrix approach to instruction execution allowing for massive parallel processing. Its primitives are not as granular as to force a limited subset of neural networks so it makes it somewhat flexible allowing a minimal obsolescence objective. The architecture looks more like a mathematical floating point co-processor with specific capacity for matrix multiply operations. The TPU has two memories; an external DRAM allowing for code instruction install and the internal SRAM.

2018 Dell EMC Proven Professional Knowledge Sharing 18

Figure 14 - TPU Performance comparisonxx

The importance of the TPU lies in the performance comparison between the leading AI hardware accelerators like the Intel Haswell and NVidia solutions. Note that the important numbers are TOPS/s (Teraflops per second). The TPU scored 92 and the closest competitor was 2.8 (NVidia). That is a 32x improvement! Also note the power consumption numbers. The power consumption of the TPU while busy is 40W and the competitors are more than double that. Also notice that the TPU uses an 8-bit operations and the others do 64 bit. It is beyond the scope of this paper, but Google has argued that inference calculations do not need the higher granularity and as such the fact that other solutions doing 64 bit is a moot point and not necessary.

What is an FPGA An FPGA (Field Programmable Gate Array) allows algorithms to be downloaded to hardware. It can be downloaded multiple times so the function of that hardware can change, but not in real time. Intel predicts that by 2020, more than 30 percent of cloud service provider nodes will be accelerated by FPGAsxxi.

FPGAs excel at inference, where broad scale deployment requires the most compute efficiency in terms of performance-per-watt. They also are reconfigurable, enabling leverage across a wide range of workloads and new evolving algorithms and neural networks. This re-configurability is one of Microsoft’s stated reasons for its decision to deploy FPGAs widely in its Azure and Bing properties. The state-of-the-art of inference is changing rapidly. Compression, pruning, and variable / limited precision (8-bit to 1-bit layers in the same network) techniques are examples where the algorithms are being refined and researched and may be broadly deployed. The major drawback to FPGAs has been their difficulty to program. New software-based development environments and development stacks are intended to address these shortcomings.

2018 Dell EMC Proven Professional Knowledge Sharing 19 FPGA is, in effect, programmable hardware. The device has large quantities of programmable units that can also include specific hardware functionality such as DSP’s, RAM, Network and DRAM Controllers. Each unit has specialized circuits that communicate directly with other blocks as well as external hardware. FPGA chips can also be considered to be a “System on Chip” device given all this functionality.

Figure 15 – Field Programable Gate Array

The way you configure a FPGA is to “flash” it. What that means is you can download the logic to the device. As previously mentioned, this can be done multiple times to change the functionality as needed. This is not a real time event so flashing devices is not a common event and downtime is often required. It takes a few seconds and it’s a load and go. The way the FPGA is architected is to provide new logic that is done by what is known as slices. Slices can be memory (SRAM), random combinatorial logic or LUT’s (Look Up tables) that can implement any form of function, DSP’s and ALU’s. Switching blocks are integrated between these slices allowing programmable connectivity between the blocks fostering data and process flows. Larger FPGA’s can also have on board CPU’s, most typically ARM cores that integrate with the slices and the external world by using standard interface such as PCie, I2C, DDR, Ethernet and others. FPGA’s have typically been used for prototyping and low volume production and as such once the design is locked in, a migration to an ASIC often occurs.

2018 Dell EMC Proven Professional Knowledge Sharing 20

Figure 16 - CPU vs. FPGA

One of the fundamental differences between a CPU and a FPGA is how information is processed as shown in Figure 16. FPGA’s have the ability to process information, functionality and algorithms in parallel. Traditional CPU’s process algorithms serially with a pipeline while FPGA’s can be configured to parallel process. FPGA’s generally consume less power. However, there are some disadvantages to an FPGA. Once a FPGA product is selected, one is locked into the functionality it provides. If more features are needed, they may not be supported.

FPGA Cloud Use Case – Microsoft Azure Networking For years now, Microsoft Azure data centers have noticed a trendxxii. A few years ago, Ethernet network data rates were in the 1 to 10Gbp/s. Today, networks are running 40Gbbp/s and the future of 50Gbp/s is not too far away. The problem is cloud servers cannot keep up with the network processing needed to transfer and route data. In addition, with software defined networking added to the mix, this additional layer of abstraction made it worse by adding latency. With network pipes of 40Gbp/s, the best throughput possible was a fraction of that. Servers spent more CPU cycles routing data than acting on it! The Azure data center team had used FPGA’s for years and had limited traction. Today, it’s a different story given the constraints.

2018 Dell EMC Proven Professional Knowledge Sharing 20

Figure 17 - Microsoft Azure Network Hardware Acceleration

Microsoft has installed hundreds of thousands of servers in over 14 countries with an accelerated NIC with FPGA’s as shown in Figure 17. Not only is does this NIC offer substantial bandwidth and reduced latency, it frees up the CPU for actual application workloads. In addition, it also offloads SDN tasks such as creating tenants, assigning ACL’s, and programmable rule/flow tables. Microsoft also has plans on doing Crypto, QoS and storage acceleration on these accelerated hardware NIC’s.

FPGA Cloud Use Case – AWS As discussed in the previous section “Why is Hardware becoming Cool or Hot again”, AWS is big on using dedicated chips. In addition, they are also offering customers the same form of accelerators on EC2 instancesxxiii using FPGA’s! As shown in Figure 18, AWS offers a hardware development kit that creates logic using traditional hardware language tools such as Verilog or HDL to create RTL output (Custom Logic) that is sent to an Amazon FPGA Image. This image can then be put into the AWS marketplace that can be re-used or it can be integrated into a F1 instance to go into a production environment.

2018 Dell EMC Proven Professional Knowledge Sharing 21

Figure 18 - AWS FPGA in AMI Instances

Traditionally, HDL development has been difficult. It takes a lot of experience developing hardware, but AWS is trying to make it simpler by offering customers a library of common hardware blocks such as DRAM and PCie structures. It also includes setup, build and constraint scripts as well as test best practices. In many ways hardware customization is slowly being democratized.

What is a GPU This form of hardware accelerator offloads certain tasks from the GPCPU. For example, GPUs excel at training algorithms in a cloud-based environment, where fine- grained floating-point accuracy is required to learn the weights in the neural network based on massive batches of (tagged) sample training data. Being well suited to training, advantages include ease-of-use, a well-established ecosystem, and an abundance of standardized libraries, frameworks, and support. GPUs can also be well suited for large-scale inference workloads and are available on many public clouds, including Google, IBM Softlayer, and Amazon AWS. (With the exception of IBM, however, these services do not yet offer the latest Pascal-based hardware, preferring to stick with the older Maxwell-based chips for the time being.) Disadvantages include high power consumption and lack of the ability to accommodate hardware changes as neural networks and algorithms evolve, particularly when compared to FPGAs.

For edge applications, typical SoCs and embedded GPUs (with processor cores) such as NXP i.MX, Qualcomm Snapdragon, and NVidia Tegra X1 can be relatively easy to program and offer low cost and reasonable throughput. However, because they must continuously access external memory, they are typically limited in the latency they can support. They also typically have limited sensor and connectivity interfaces and have fixed hardware, which limits their ability to be upgraded as algorithms, networks, sensors, and interface requirements change.

2018 Dell EMC Proven Professional Knowledge Sharing 22 This form of hardware acceleration used a Graphics Processing Unit or GPU to achieve its performance goals. The architecture was originally developed by NVidia for video gaming acceleration but is now used for many other workloads such as AI deep learning, inference robots, autonomous cars and other unmanned vehicle processing. As shown in Figure 19, the way it accelerates performance is by offloading compute intensive tasks from the CPU. Though many advanced AI functions are simple to do but with limited variations, the number of times each function is executed in a very short time frame are many.

Figure 19 - GPU Hardware Acceleration

GPU’s parallelize the workload allowing CPU’s to do what they do best; that is programmability and management of the workflow processes. GPU’s are very good at large AI training tagged data sets in a cloud-based environment. All of the major cloud providers such as AWS, Google, Softlayer (IBM) and Microsoft have very strong ecosystems of deep programming frameworks and libraries. Some of the negatives are increased power consumption and the inability to evolve with the changing algorithms of the ever changing demands of neural networks

2018 Dell EMC Proven Professional Knowledge Sharing 23

GPU Cloud Use Case – AWS

Not only does AWS support FPGA’s, EC2 instances also support Amazon GPU’sxxiv as shown in Figure 20. For example, applications such as 3D rendering, AI learning or other parallel intensive tasks. An OpenGL API call is made to the GPU running in an EC2 instance. The OpenGL API process is intercepted by the Elastic GPU driver which then forwards the graphics API call to the Elastic GPU over the network. The GPU processes the API call and returns the parallel data (in this case the image) back over the network to the GPU driver and back to the application.

Figure 20 - AWS Elastic GPU'sxxv

An Elastic Network interface (GPU ENI) resides between the EC2 instance and the GPU and is always in the same availability zones the EC2 instance for performance reasons such as throughput and latency.

2018 Dell EMC Proven Professional Knowledge Sharing 24

The Hardware Accelerator integration Cloud Framework Whether a cloud instantiation uses an ASIC, FPGA or GPU, there are some common architectural goals that need to be considered. According to the research paper “Enabling FPGA’s in the cloudxxvi”, there are four fundamental goals that need to be met for the cloud provider to provide a stable and scalable environment for customers. They are abstraction, sharing, compatibility and security. 1. Abstraction - Based on the tenets of what is a public cloud, hardware accelerators need to be exposed similarly to other abstracted resources in the cloud such a compute, network and storage. The ability to request the resource easily from a self-service catalog for example, will allow for self-service. Allowing hardware microservices to be managed similar to other software as a service vectors is the goal. 2. Share - The ability to share while at the same time isolate cloud resources is also an ability a Hardware accelerator needs to operate in the cloud. This has generally been difficult to do given FPGA, GPU’s and ASIC design requirements housed in the same silicon or adapter, but given advances in players like NVidia Kepler linexxvii of GPU’s, this functionality is being added. 3. Compatibility - Cloud resources need to be compatible with industry standards to avoid lock in. Most, if not all, hardware accelerators do have libraries and support applications, but many are stovepiped to their own ecosystem. Often, there are no standards as to how flashing into an FPGA is done. Each vendor has its own method. 4. Lastly, security is a primary factor. Since hardware accelerators run at the embedded level of the infrastructure stack, it is difficult to isolate. For example, FPGA’s often have no functionality in this regard.

2018 Dell EMC Proven Professional Knowledge Sharing 25 xxviii

Figure 21 - Example of a Hardware Accelerator Framework in the Cloud

As shown in Figure 21, the component blocks highlighted in grey are the elements that provide hardware acceleration. This stack takes many of the levels from an “Open Stack” architecture with the underlying hardware component with the hypervisor, middle-ware and libraries and supported virtual machines. In this approach, virtual accelerator module control applications can be written to service and augment the physical accelerated hardware. For example, the tenant will have access to the Open Stack Cloud as normal with accelerated hardware being abstracted the same way traditional compute and network services are done today in the software defined data center. AI There is an international AI chip race. AI and chip development competitiveness is escalating. For example, China has announced that the government will be spending 2.1 billion USD on building an AI development parkxxix. China’s goal is to be dominant in AI chip development in three yearsxxx. As stated previously, dedicated hardware is becoming a given to be competitive based on the needs of performance and energy efficiency for AI-specific workloads.

AI workloads continue to expand and diversify daily. As shown in Figure 22, machine and deep learning workloads are touching pretty much every industry vertical and being provisioned and managed across public, private, hybrid clouds as well as the edge.

2018 Dell EMC Proven Professional Knowledge Sharing 26

Figure 22 - AI Application Landscape

Artificial intelligence and deep learning continue to be an evolving practice in this digital transformation world. Why is hardware so important to this burgeoning technology? The answer is you need to “think” differently. Moravec’s paradox When I say think differently, I mean that traditional computer architectures may not be the best fit for AI. The reason is based on Moravec’s paradoxxxxi. The Austrian roboticist Hans Moravec observes that tasks or procedures that seem to be repetitive and arduous, i.e. arithmetic computations, are easy for a computer. Whereas listening to your friend with a different accent from yours, at a loud bar, or rock music in the background seems to be easy for humans. He theorized that the evolution of our brain has something to do with it. Higher level cognitive analysis has only been a relatively recent thing over the last hundred thousand years, akin to logical reasoning similar to computers, but pattern and visual analysis has been hard coded into our lower level brain functions and as such we don’t have to think about it, we just do it. AI running on existing computers have a difficult time doing what humans do easily and how current compute processing is done could be the reason.

2018 Dell EMC Proven Professional Knowledge Sharing 27

The Von Neumann Architecture and AI Current data processing methods are often equated to a human brain. However, the biological version is much different. Most computing systems like a laptop or PC still use the von Neumann Architecture as shown in Figure 23.

Figure 23 - Von Neumann Architecture

In the Von Neumann Architecture, all of the number crunching happens in the CPU. It consists of the control unit that makes decisions on what is done next and the Arithmetic Logic Unit that does mathematical calculations such as add, subtract, multiply and divide. The memory unit is the location of the data. There is typically one input and output and the sequence of events is done one at a time or serially. Some computers have a number of these blocks or cores, but the challenge that programmers have is coordinating them in a way that parallel process can be done efficiently. In other words, this architecture is not optimized for parallel operations and does not scale.

In contrast, the human brain is highly optimized for parallel processing. Data is evaluated simultaneously through billions of parallel tasks or processes using neurons. It’s interesting that both the brain and computers use electrical pulses but that is where the similarity ends.

The difference is many of our brains’ neurons are pre-programmed or “hard” wired. This is done through genetic pattern matching or learned relationsxxxii.

For example, let’s take a look at how we see. Humans see the world through millions of Photoreceptors as shown in Figure 24.

2018 Dell EMC Proven Professional Knowledge Sharing 28

Figure 24 - Eye Photoreceptors

The brain takes this array of dots and creates a representation or perception of that image quickly in real time by processing each dot in parallel. In a Von Neumann Architecture, the processor needs to gather the data from memory, move it to the CPU for processing, generate a result and move it back to memory. This is done millions of times, serially, if there is only one core. In order to process the image into what might be considered real time, imagine the speed and power required to do that? Over twenty years ago, Carver Mead estimated that modern day computers would need ten million times more energy than a brain would use for a similar processxxxiii. How then can we achieve better performance or performance density? Let’s take a look.

2018 Dell EMC Proven Professional Knowledge Sharing 29 Memory Bandwidth is important To increase performance density it is important to take into account that the Von Neumann Architecture moves the data from memory to the ALU (arithmetic logic unit) through a pipe. This pipe is often the bottleneck both from a throughput and latency perspective which also increases power and thermal consumption. The challenge has always been that when you disperse memory blocks with random logic typically found in the ALU, the gate and power density inverts to a non-optimized solution.

Typical data sets in the AI world are large in scale and as such, there is almost always a need to utilize off die memory such as HBM2, HMC or DDR4 thus forcing an external memory bus and associated bottleneck.

2018 Dell EMC Proven Professional Knowledge Sharing 30 Utilization is important In this context, utilization is the percentage of on die logic and memory that is actually utilized to perform the specific tasks. AI and deep learning use a limited number of processing primitives that are usually a fraction of the total computing duration of the operation. Typical operations include matrix multiplication, transposition and multiply accumulate. Therefore, actual operation utilization can be derived as;

Therefore, if these operations are slowed by memory bandwidth or latency moving data from and to memory, high levels of actual utilization will be severely limited. This form on limitation is called memory bounded. A typical solution can utilize Cache, but as previously stated, if you have large data sets typical of AI workloads, this strategy often yields limited results. Another strategy is to lower the bit precisions and sparsity algorithms for AI workloads and this is an area that is being actively researchedxxxiv.

Various metrics have been considered to determine the true performance of compute platforms. Often, TeraFLOPs per second or TeraOPS per second are used. In the case of AI workloads, one metric has been defined that might be able to evaluate a system as best as possible as shown;

xxxv

This metric defined by Naveen Rao, Vice President & General Manager of Intel's AI Products Group might be a good tradeoff approach. Called Computational Capacity (CC), it defines bit width of numeric representation, memory bandwidth, and operations per second. Let b=# bits of representation, m=memory bandwidth in Gigabits per second, and o=Tera operations per second. These three factors provide a good approximation of the actual utilization used.

2018 Dell EMC Proven Professional Knowledge Sharing 31 The AI Platform Landscape Let’s understand how the major players are executing in the AI hardware acceleration vertical. As in the other 4 pillars of digital transformation, the usual suspects and a few not previously mentioned in the Ai hardware ecosystem are included as shown in Figure 25.

Figure 25 - Hardware AI Platform Landscape

As shown, Intel with its recent acquisition of Xilinx offering FPGA technology to its portfolio is a major player in the data center, edge or a combination of the two. AI inference and training are often divided with specific hardware solutions. NVidia and Google often excel in performance and functionality for AI workloads.

Mobile and IoT The Internet of Things (IoT) promises the ability to connect everything in a way never before imagined. The vast number of applications range from wearables to industrial automation. For individual consumers, the need is for low cost solutions that connect to everything with mobility in mind making our lives better. For businesses, the goal is to provide higher levels of asset tracking, shipping, energy conservation and security. Mobile devices go hand in hand with similar goals. Mobile solutions such as smart phones are just one form of edge device in the IoT ecosystem.

Application workloads processing demands are exponentially increasing especially dealing with cognitive, spatial and deep neural networks such as visual and speech recognition. Over time, edge devices will rely less on the cloud’s resources based on the need to process real time data and reduce latency in time-critical applications. In

2018 Dell EMC Proven Professional Knowledge Sharing 30 addition, IoT also has similar needs. To meet these goals of providing the scale to billions of devices and applications, there are implementation challenges. The challenges include power density, moving beyond the Von Neumann Architecture.

Why Do We Need Mobile and IoT AI Chips Mobile devices are using AI everywhere and the major manufacturers are seeing this as well. For example, Apple’s newest iPhones have their own deep learning enginexxxvi. Huawei has a device that has a Kirin 970’s NPUxxxvii built in. In addition, pretty much all of the major manufacturers of semiconductor devices like ARMxxxviii, Qualcommxxxix and NVidia are ramping up to support and supply Deep Neural Network silicon and frameworks. The importance for having dedicated hardware is simple. They are: 1. Power consumption – Mobile devices are limited in the amount of power available.

Traditional architectures that need to move data to the CPU and back via a serial process as discussed previously, use a lot of power. 2. Simple calculations quickly – DNN requires a lot of simple and limited set of calculations done in parallel. 3. Privacy and Security – Many AI services done today on mobile devices are done in the cloud. When more calculations are done locally it reduces the possibility of user’s data being hacked or corruptedxl. 4. Efficient data usage – If more AI calculations are done locally, the amount of data that needs to be pushed to the cloud is reduced substantially. In addition, with reduced latency, overall performance will increase. Everyone wins when enhanced hardware carries the extra workload.

Ecosystems are being developed to address the need for mobile DNN functions. For example, Google is supporting a lite version of TensorFlowxli to support mobile.

2018 Dell EMC Proven Professional Knowledge Sharing 31 The Move to a New Way of Thinking As we discuss the four pillars of digital transformation there is a common theme; power density. Be it addressing the cloud, AI, Mobile and IoT, each pillar has the fundamental need to reduce power and become more efficient to be sustainable.

If power density is key, then one must ask if biological systems like humans can teach us anything. The answer is yes. If we take a look at the power consumed by traditional processors as shown in Figure 26, compared to the human brain, there is a huge difference.

For example, a human brain consumes about 20 watts of power. Super computers consume megawatts.

A few years ago Google developed an algorithm to detect cats in YouTube videos using thousands of core processors and all the power that come with it. Now, the goal is for our smartphones to detect family in our photos, infer our emotions, and suggest ways to change our moods in real time. To implement these workloads, one must go beyond algorithms and think differently.

Yes, supercomputers and large arrays of CPU’s do have large computational power for some workloads. But in today’s demanding and new way of thinking about solving AI, DNN and CNN (Convoluted Neural Network) workloads, a different approach needs to be considered. This is where Neuromorphic computing comes in.

Figure 26 - Power Density of CPU's vs. Human Brain

2018 Dell EMC Proven Professional Knowledge Sharing 32 Neuromorphic chips Neuromorphic processing or what is known as “brain inspired” computing has been a goal for many years. As discussed, Von Neumann architecture-based systems process differently from our brains. Table 1 provides a high level comparison of general purpose and Neuromorphic Processors. The important aspect to notice is as discussed previously, GPCPU’s are sequential, centralized and power hungry. Neuromorphic processors are parallel, distributed and low power and in some cases, very low power.

Table 1 - Comparison of GP and Neuromorphic Processors

General Purpose Processor Neuromorphic Processor

Feature Multicore GPCPU MCU ASIC FPGA CPU Compute Process Sequential Parallel Compute Structure Centralized Distributed Energy Efficiency Low Best Better Development Style Short Longest Longer Cost High High Low Highest Moderate Amtel, IBM Intel, Applied Examples Intel, AMD Cypress, TrueNorth, AMD Sciences Altera Qualcom

What is different What makes this new architecture different is it’s motivated by the human brain that has been in millions of years in development. The major difference is power consumption as discussed previously. It does this byxlii; 1. Asynchronous Architecture – No single clock or central control. This provides the ability to slow down or even stop logic elements saving power. Traditional architectures have a common clock and each gate burns power whether it’s used or not. 2. Event driven – This approach adds to power conservation in that again, logic is only used when a specific notification or event occurs thus minimizing overall power and cooling.

2018 Dell EMC Proven Professional Knowledge Sharing 33 3. Adaptive and Fault tolerant – Similar to the brain, signals can be rerouted to other compute blocks. 4. Flexibility – Logic blocks can be reprogramed to provide varied behaviors making efficient use of limited resources. 5. Co-localized memory and computation – This is one of the major differences! By placing data right next to the processing unit, major power reductions, speed and latency benefits can be achieved.

For example, as shown in Figure 27a, neurons and synapses are the building blocks of the brain that provide both memory and thinking (computation). Memory is local and re-enforced by the number of connections between Neuron clusters and form a mesh between other clusters. As shown in Figure 27b, the analogy is the local CPU; registers and cache represent the neurons, the synapses represent the main memory and the peripheral memory are the connections between the clusters that create global sensory synthesis.

Figure 27 - Comparing Human and Silicon approaches

How it Works The first part is how do we get our current technology to think like humans? Our brain looks at patterns. Our brain has about 86 billion neurons that connect to thousands of other neurons. When a stimulation level of a neuron hits a certain point, it “fires” and processes the image downstream. Our brain learns by modifying the strengths and number of connections based on image stimuli. The more we see, for example, it re- enforces the level (weight) and number of connections. Algorithms have been developed to simulate this process, called Artificial or Deep neural networks (Neural Nets).

2018 Dell EMC Proven Professional Knowledge Sharing 34 Hardware for DNN Processing

DNN’s are the first attempt at emulating a brain-inspired approach to computational excellence. Even though power is still an issue, parallelism is key to their architecture. Various manufacturers are entering the DNN space. They include; 1. Intel Knights Mill CPUxliii – Vector instructions for deep learning.

2. NVidia PASCAL GP100 GPU – Sixteen bit floating-point (FP16) arithmetic support.

Performs two FP16 operations on a single precision core for faster deep learning computation. 3. NVidia DGX-1 – Removes the burden of continually optimizing your deep learning software and delivers a ready-to-use, optimized software stack.

4. Facebook’s Big Basin custom DNN serverxliv – The new GPU server will allow Facebook to train machine learning models that are 30 percent larger compared to

Big Sur. 5. System-on-Chips (SoC) - NVidia Tegraxlv – VIDIA Tegra 4 processor. Uses ARM CPU cores with battery-saver core technology. 6. Samsung Exynos xlvi – Field-programmable gate array (FPGA).

To achieve high performance, highly-parallel compute paradigms are commonly used, including both temporal and spatial architectures as shown in Figure 28.

Figure 28 - Parallel computation approaches

Parallel temporal architectures utilize a series of SIMD vectors and/or SIMT threads that provide parallelism. Temporal architectures utilize a centralized control process with a common or central memory or data repository accessed by the ALU’s. This implies that the ALU’s cannot communicate and/or transfer data between each other. With a spatial architecture approach to parallelism, each ALU has the ability to process or 2018 Dell EMC Proven Professional Knowledge Sharing 35 transfer data between each other. In many cases each ALU also has control logic and local memory used and as such can be considered more than an ALU and more of an autonomous processing engine. DNN’s are commonly used in spatial architectures idealized in ASICS and FPGA’s.

The importance here is a spatial approach is more closely mimicked by three- dimensional approaches in the Neurons that are autonomous in nature.

The Importance of CNN’s

DNN’s are great, but with recent algorithmic advancements, CNN’s are becoming the predominant approach to AI, specifically relating to low power consumption running on Neuromorphic chips. CNN’s, as shown in Figure 29, are a type of neural network that has gained stature recently.

Figure 29 - CNN Process flow

What makes CNN’s different is that it uses simulated neurons or feature maps that are particular to a certain learned pattern, like the image of a human mouth and, through sub-sampling, be able to see a wider view of the image and find patterns faster and more accurately while learning.

CNN’s have traditionally used GPU’s or ASICS to get the job done. But even parallelization of tasks scaling to the number of simulated neurons needed is still beyond current technology.

2018 Dell EMC Proven Professional Knowledge Sharing 36 A Deeper Look – IBM’s True North

With the advances in AI, a number of development efforts are on the move. Stamford announced the Neurogrid devicexlvii and Qualcomm unveiled the Zerothxlviii processor.

Figure 30 - IBM True North Chip Layout

IBM’s TrueNorth chip as shown in Figure 30, has been based on years of DARPA research and is at the forefront of Neuromorphic design. Its feature set includes; I. SIZE – One million programmable neurons with 256 million individually programmable synapses on chip and 5.4B transistors, 4,096 parallel and distributed cores, interconnected in an on-chip mesh network, and approx.100 Kb per core for synapses and neuron storage. II. ARCHITECTURE – Hierarchical communication, with message spike-routing network, local fan-out crossbar that minimizes data traffic reducing power consumption. Two-dimensional network of cores (4,096 cores) and supports advances in 3D packaging. III. EFFICIENCY – 70mW total power that is orders of magnitude of conventional GPU’s and ASICS. 26pJ per synaptic event, which is the lowest recorded energy for any large-scale neuromorphic system, and five orders of magnitude lower than von Neumann computers. 20mW / cm^2 power density. IV. DESIGN – Fully event-driven digital mixed synchronous-asynchronous neuromorphic architecture. This reduces neuron switching by 99% on average for a typical network and asynchronous routing with low latency spike delivery at 0.3fJ per bit per um.

2018 Dell EMC Proven Professional Knowledge Sharing 37 As outlined, a lot of effort is going into the development of Neuromorphic processing. It’s time to think different. The Supporting Software Now that we have discussed the various hardware accelerator architectures (DNN’s, CNN’s) hardware instantiations (FPGA’s, GPU’s, ASICS’s) and typical application workloads (AI) that benefit from it, we need not forget the SDK’s (Software Development Kits) that utilize software frameworks that allow this infrastructure to be leveraged. The Why and What of Frameworks Hardware acceleration frameworks fast-track the adoption, use and efficiency of hardware acceleration hardware. They provide containers that are used to take extensible code and foster active application development. Most, if not all frameworks are open sourced. This leads to an open environment allowing diversity, collaboration and innovation. There are numerous frameworks for other application workloads, but to limit the scope of this discussion and providing examples, this paper will limit to frameworks specific to deep learning.

Table 2 - Hardware Acceleration Frameworks

Table 2 lists the most common and popular deep learning frameworks. As with any well supported framework, the importance of a wide range of languages is a plus. Documentation, modeling capabilities, wide breadth of hardware platforms and speed to development are also important. Kerasxlix support makes it simple to implement and only a few enable this API. It brings abstraction to a higher level enabling ease of use. Let’s take a look at each in more detail.

2018 Dell EMC Proven Professional Knowledge Sharing 38 1. Theano – Python is the programming language of choice. It is a compiler for math expression processes using multi-dimensional arrays. It is also part of the NVidia SDK and as such supports the cuDNNl version 5 that enables GPU acceleration. 2. TensorFlow – This framework allows DNN’s to be described as graphs of connected operations that can be distributed across multiple GPU’s. It supports the range of hardware platforms from mobile devices, desktops to high end servers with the appropriate hardware accelerated components. 3. Torch – This framework supports a wide spectrum of AI and DNN architectures. Its advantage is a simple and efficient scripting language (LuaJITli) allowing the underlying resources easy access to its feature set. Its underlying implementation is based on its C++ CUDA lii structured implementation. 4. Caffe – Based on the Berkeley AI Research (BAIR)/The Berkeley Vision and Learning Center (BVLC) and its ecosystem of contributors on GitHub, Caffe offers a modular, fast and expressive approach to AI and DNN. It has numerous variants supporting multi-node CPU implementations, specifically Intel Xeon processors as well as Caffe OpenCL for AMD or Intel devices. 5. MXnet – This framework utilizes the Gluon API that provides a high level interface to MXnetliii. Its secret sauce is the ability to enable efficiency and flexibility through dynamic define by run data graphs with just-in-time compilation. 6. Neon – This framework is supported by Intel and as such is optimized for its various core processors. It utilizes the Python language supporting an extensible and optimized environment for AI and other DNN’s such as GooGleNet, AlexNet and VGG (Visual geometry group). 7. CNTK – This framework, also known as MCT (Microsoft Cognitive Toolkit), supports unified deep learning. It allows combining multiple GPU and ASIC types and server architectures. Speech, image and text data can be analyzed for efficient RNN and CNN training and inference processes. It also supports cuDDN v5.1 for GPU acceleration.

Software is still eating the world, but hardware acceleration SDK’s and Frameworks are the straws that apps can drink from.

2018 Dell EMC Proven Professional Knowledge Sharing 39 Bringing Hardware Acceleration to the Enterprise Hardware acceleration has been in the HPC world for many decades and now it’s mainstream moving to the enterprise. What the enterprise wants is a complete solution that can be easily adapted to business requirements. Enterprises are looking for three things; integration, bundling and environmental managementliv. For example, Dell EMC is taking AI seriously by integrating NVidia’s Tesla V100 with NVLink as shown in Figure 31.

Figure 31 - Dell Technologies Integration

In addition, as shown in Figure 32, Dell EMC is also bundling Server, Network and Storage with Libraries, Software and Consulting services.

Figure 32 - Dell Technologies Bundling

As discussed at length, power and cooling are very important to these new workloads, and Dell EMC is partnering with CoolITSystems to increase power density through direct contact liquid cooling. This solution has reduced cooling costs by over 50 percentlv. Enterprise vendors are serious about these new workloads and the new hardware acceleration solutions that are now required to support it.

2018 Dell EMC Proven Professional Knowledge Sharing 40 Summary Digital transformation is happening and the amazing possibilities of Cloud, AI, Mobile and IoT are the driving forces. Software workloads are eating the world but unique and bespoke hardware is feeding it especially in areas of deep learning, convolutional networks and other emerging applications. Hardware acceleration is becoming a major differentiator to cloud and service providers. With the Cambrian explosion of alternative styles of architectures, the concern of hardware vendors losing market share to commoditized hardware seems to be less of an issue. The opportunities and approaches that new hardware solutions provide in sustaining the appetite of software as it continues to try and eat the world is expanding. Long live Hardware!

2018 Dell EMC Proven Professional Knowledge Sharing 41 References

i https://a16z.com/2016/08/20/why-software-is-eating-the-world/

ii https://www.technologyreview.com/s/607831/nvidia-ceo-software-is-eating-the-world-but- ai-is- going-to-eat-software/ iii https://finance.yahoo.com/news/bitcoin-mining-sucks-much-electricity-140843956.html

iv https://www.theguardian.com/technology/2017/nov/27/bitcoin-mining-consumes- electricity-ireland v https://www.wired.com/2016/05/googles-making-chips-now-time-intel-freak/

vi https://www.geekwire.com/2017/amazon-web-services-secret-weapon- custom-made- hardware-network/ vii https://www.nextplatform.com/2017/06/19/cray-cto-cambrian-compute-explosion/ viii https://twitter.com/MichaelDell/status/868515452116500485 ix http://www.lithoguru.com/scientist/CHE323/Lecture2.pdf x http://www.wired.co.uk/article/moores-law-wont-last-forever xi https://en.wikipedia.org/wiki/Dennard_scaling xii https://en.wikipedia.org/wiki/Dark_silicon xiii http://www.nvidia.com/object/nvlink.html xiv http://ieeexplore.ieee.org/document/7577381/ xv https://www.bitcoinmining.com/bitcoin-mining-hardware/ xvi https://www.bitcoinmining.com/ xvii https://www.bitcoinmining.com/best-bitcoin-cloud-mining-contract-reviews/ xviii https://www.nextplatform.com/2017/04/05/first-depth-look-googles-tpu-architecture/ xix https://cloud.google.com/tpu/ xx https://www.nextplatform.com/2017/04/05/first-depth-look-googles-tpu-architecture/ xxi https://datacenterfrontier.com/machine-learning-changing-the-data-center/

xxii https://www.computerworld.com/article/3124928/cloud-computing/microsoft- azure- networking-is-speeding-up-thanks-to-custom-hardware.html#tk.drr_mlt xxiii https://aws.amazon.com/ec2/instance-types/f1/ https://www.nextplatform.com/2016/12/02/takes-build-true-fpga-service/ https://forums.xilinx.com/t5/Xcell-Daily-Blog/You-have-until-Jan-27-to-download-this-free- book- with-a/ba-p/743757 https://live.awsevents.com/previous-recordings.html

2018 Dell EMC Proven Professional Knowledge Sharing 42

https://www.nextplatform.com/2016/12/02/takes-build-true-fpga-service/ xxiv https://aws.amazon.com/ec2/elastic-gpus/ xxv https://www.slideshare.net/AmazonWebServices/remote-application-streaming- onawsnabbooth xxvihttps://pdfs.semanticscholar.org/ab67/49dce553fa95685ae106ef28719b3b7ccb7e.pd fhttps://dl.acm.org/citation.cfm?id=2597929&dl=ACM&coll=DL&CFID=847682292&CFTOK EN=1784041 5 xxvii http://www.nvidia.com/object/grid-boards.html. xxviiihttps://pdfs.semanticscholar.org/ab67/49dce553fa95685ae106ef28719b3b7ccb7e.p dfhttps://dl.acm.org/citation.cfm?id=2597929&dl=ACM&coll=DL&CFID=847682292&CFTOKEN =178404 15 xxix https://futurism.com/china-building-2-1-billion-industrial-park-ai-research/ xxx https://www.wsj.com/articles/google-and-intel-beware-china-is-gunning-for-dominance- in-ai- chips-1515060224 xxxi https://en.wikipedia.org/wiki/Moravec%27s_paradox xxxii http://onlinelibrary.wiley.com/doi/10.1002/1097-0193(200007)10:3%3C120::AID-

HBM30%3E3.0.CO;2-8/full xxxiii https://www.technologyreview.com/s/521501/three-questions-for-computing-pioneer- carver- mead/ xxxiv https://arxiv.org/abs/1610.00324 xxxv https://www.intelnervana.com/comparing-dense-compute-platforms-ai/ xxxvi https://www.theverge.com/2017/9/13/16300464/apple-iphone-x-ai-neural-engine xxxvii https://www.theverge.com/circuitbreaker/2017/10/16/16481242/huawei-mate-10- pro- announcement-specs-price-ai-features xxxviii https://www.theverge.com/2017/7/25/16024540/ai-mobile-chips-qualcomm- neural-processing-engine-sdk xxxix https://www.theverge.com/2017/7/25/16024540/ai-mobile-chips-qualcomm- neural- processing-engine-sdk xl https://www.theverge.com/2017/4/10/15241492/google-ai-user-data-federated-learning xli https://www.tensorflow.org/mobile/tflite/ xlii S.-C. Liu, T. Delbruck, G. Indiveri, A. Whatley, and R. Douglas, Eventbased neuromorphic systems. Wiley, 2014. xliii Intel Architecture Instruction Set Extensions, Programming Reference, Intel, Santa Clara, CA, USA, Apr. 2017.

2018 Dell EMC Proven Professional Knowledge Sharing 43

xliv S. Condon, “Facebook unveils Big Basin, new server geared for deep learning,” in Proc. ZDNet, Mar. 2017. xlv http://www.nvidia.com/object/tegra-4-processor.html

xlvi http://www.samsung.com/semiconductor/minisite/exynos/

xlvii https://web.stanford.edu/group/brainsinsilicon/neurogrid.html

xlviii https://www.qualcomm.com/news/onq/2013/10/10/introducing-qualcomm-zeroth- processors- brain-inspired-computing xlix http://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/

l https://developer.nvidia.com/cudnn

li http://luajit.org/ext_ffi_tutorial.html

lii https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/

liii http://gluon.mxnet.io/

liv https://www.nextplatform.com/2017/11/13/dell-emc-wants-take-ai-mainstream/

lv http://www.zdnet.com/article/dell-emc-and-nvidia-sign-gpu-deal-aimed-at-hpc-data- analytics- and-ai/

Dell EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying and distribution of any Dell EMC software described in this publication requires an applicable software license.

Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.

2018 Dell EMC Proven Professional Knowledge Sharing 44