Vector Processing Rises to the Challenges of AI and Machine Learning

In the complex world of computing, a major new technological forecasting, oil exploration, industrial design and bioinformatics. development in a particular area tends to gain acceptance and Led in the U.S. by Seymour and researchers at Fujitsu, Hitachi then reign supreme, sometimes for many years or decades until and NEC in Japan, (1), the performance of these machines was something more advanced comes along. It’s rare indeed for any simply untouchable by any other processor architecture. But by technology to fade away only to reappear, stronger, and make the 1990s, the immense market potential for making computing another major contribution. accessible to more people (especially those without very deep pockets) made billions of dollars in research funding available in But that’s precisely what is happening with vector processing, the commercial market, nearly all of which was spent on scalar which once ruled the highest echelons of performance— rather than vector processors. —and is now solving some of the most critical problems facing the industry. And rather than remaining in the The result has been a continuous stream of scalar processors rarefied air of supercomputers, it’s moving into the mainstream like ’s x86 that could sometimes exceed the performance of computer hierarchy, all the way to board level. To understand why vector supercomputers on some tasks, at a much lower cost, while this is happening now and how it is being achieved, a brief review being more flexible in their capabilities than vector processors. of processor history is in order. They have continued to increase performance more or less adhering to Gordon Moore’s “law” of doubling performance every The Evolution of Vector 18 months while offloading specialized tasks to supplementary and Scalar Processing coprocessors. When they evolved to enable massively parallel processing, the supercomputing industry, or most of it, abandoned vector processing, seemingly for good.

For more than three decades, vector-processor-based In fact, Eugene D. Brooks, a scientist at Lawrence Livermore supercomputers were the most powerful in the world, enabling National Laboratory, predicted in a 1990 New York Times article research in nuclear weapons and cryptography, weather (2) headlined “A David Challenging Computer Goliaths” that vector-

Vector Processing Rises to the Challenges of AI and Machine Learning -x.com | 1 processor-based computers would soon be “dead,” at which time Vector-Supported Scalar Processing their scalar counterparts would exceed the abilities of Cray’s vector-processor-based supercomputers at a fraction of the cost.

Less noticed was work by Krste Asanovic, who in 1998 as a Interestingly enough, vector processing has been an adjunct to doctoral candidate at the University of California, Berkeley, scalar processing as an accelerator. For example, when Intel added posited in his thesis (3) that the success of scalar (or by this the MMX instruction set to its x86 Pentium processors in the time superscalar) microprocessors were mostly the result of late 1990s, (5) it was the beginning of a continuous succession their ability to be fabricated using standard CMOS technology. He of vector-based coprocessing that has continued through the argued that if the same process could be used for creating vector Streaming SIMD Extension (SSE) in the Pentium III to the current microprocessors, they could be the fastest, cheapest and most vector instructions standard, AVX-512. Today nearly all scalar energy efficient processors for future applications. processors rely on vector instructions for performing specific tasks with scalar processors for traditional host processing functions. His argument was that, in the future, compute-intensive tasks would require data parallelism, and that the vector architecture Other types of accelerators including graphics processing units was the best way to achieve it. In the following 250 pages of his (GPUs), field programmable gate arrays (FPGAs), application- thesis, he described how and why vector microprocessors could specific integrated circuits (ASICs), and other more esoteric devices be applied to perform more tasks than scientific and engineering also perform specialized tasks and are even transitioning to supercomputing, eventually handling multimedia and “human- become host processors themselves. GPUs, for example, whose machine interface” processing. So, who was the most prophetic? origin comes from providing the computational power for graphics engines, are now often used as “general purpose” GPUs (GPGPUs), As it turned out, both. Asanovic’s vision proved to be correct, and FPGAs are making a similar transition. Now, the venerable although it took two decades for it to be realized. Asanovic went accelerator technology called vector processing, once powering on to become a professor in the Computer Science Division of only the world’s most powerful computers, is being applied in the Electrical Engineering and Computer Sciences Department platforms lower in the computer hierarchy. at Berkeley, chairing the board of the RISC-V Foundation, and co- founding SiFive, a fabless semiconductor company that produces Future Challenges for Scalar Processors computer chips based on the established reduced instruction set computer (RISC) principles.

Brooks’ predictions have also proven to be correct, with one very Current supercomputers and high performance computing systems significant exception: Vector processors never vanished because based on superscalar massive-parallel-processing CPUs are they could still perform some functions far faster, at less cost, and reaching an important stage in their history. Simply stated, the with less hardware than scalar processors. As vector and scalar huge amount of data inherent in artificial intelligence (AI) and technologies advanced over the years, scalar processors became machine learning (ML) applications are pushing the envelope of ubiquitous, while vector processors were used by fewer and fewer what they can achieve. They are requiring more and more overhead companies until the only company—NEC in Japan—continued to to handle the massive parallelism of the computers they serve, develop computers around them. which is creating a gap between what they can theoretically achieve and what is experienced in practice. The result is that it is So, far from being “dead,” processing using vector instructions becoming difficult to fully exploit the potential of current deployed continued to be used in supercomputers. In 2002, for example, systems in which only superscalar processors are employed. NEC’s vector system called the became operational at the Earth Simulator Center in Japan. The performance of scalar processor cores are also not increasing (4) It was the world’s fastest at the time, with at a rate it once did, so improvements are obtained by increasing 5,120 central processing units (CPUs) using 640 nodes, each the number of cores on a homogeneous silicon substrate. However, consisting of eight vector processors operating at 8 GFLOPS (one it is difficult to extract more parallelism from cores with different billon floating-point operations per second) and overall system performance characteristics. In addition, the cost of fabricating performance of 40 TFLOPS (40 trillion floating-point operations transistors themselves is increasing, because fabrication per second). The Earth Simulator created a “virtual planet earth” by processes must use increasingly small process nodes to increase processing huge amounts of data from satellites, buoys and other the number of transistors in a reasonably sized piece of silicon. sensors. The current version uses the NEC SX-ACE supercomputer, which delivers 1.3 PFLOPS (one quadrillion floating-point operations per second).

Vector Processing Rises to the Challenges of AI and Machine Learning nec-x.com | 2 The Rises to the Challenge bandwidth achievable with vector processors could be retained in such small form factors.

The result of this work is SX-Aurora TSUBASA, a complete vector Vector instructions have significant benefits, because they computing system implemented on a PCIe card with one of the perform multiple math operations at once, while scalar processors world’s highest performance levels per core (Figure 1). It is the perform only one. So, if the job is to multiply four numbers by first time a vector processor has been made the primary compute four, it can be done with a single command with vectors in a engine rather than an accelerator. All functions are implemented process called single instruction, multiple data (SIMD) in which the on a single chip, including the cores with integrated cache, as well command applies the same single instruction to many different as the memory, network and I/O controllers. This makes it possible pieces of data. to reduce the space required for all these functions to a fraction of what its predecessor could achieve as well as reducing power The computer thus needs to fetch and decode far fewer consumption. Implementing each processor on a card optimizes the instructions, reducing control unit overhead and memory use of space to a fifth of what the company has produced before. bandwidth required to execute operations. In addition, the instructions provide the processor cores with continuous streams of data so that when the vector instruction is initiated, the system knows it needs to fetch sets of operands arranged in regular patterns, in memory.

This allows the processor cores to invoke the memory to start sending those operator sets that will arrive at a rate of one per cycle, at which point they can be routed directly to a processor. Without an interleaved memory or some other way of providing instructions at a high rate, the advantages of processing an entire vector with a single instruction would be greatly reduced. In Figure 1: NEC SX-Aurora modern vector processors, however, this is not the case. TSUBASA - Vector Engine.

The New Computers with Vector Processors The vector processors themselves use a chip-on-wafer-on- substrate (CoWoS) architecture to connect six high bandwidth memory (HBM2) modules that deliver 150 GBytes/s per core and Ideally, a vector computer would combine the benefits of a total of 307 GFLOPS, about five times faster per core than extremely high-performance general-purpose scalar processors the company’s previous generation of vector computers. Each with the unique capabilities of vector processors to produce processor has eight cores with 2.45 TFLOPS performance and an results that neither one alone could achieve, effectively becoming aggregate memory bandwidth of 1.2 TBytes/s. a vector-parallel computer. This is precisely what NEC’s latest achievement is designed to do. The architecture of the system combines an Intel 6100-based host server, a “vector host” running , and the NEC has been designing and fabricating vector processors and vector subsystem, in a hybrid configuration. The host server acts nearly all the other major components of its supercomputers for as the “front end” for the vector engine, performing functions more than three decades, accumulating deep knowledge about not directly related to mathematical calculations, such as I/O. this technology and how it is applied. So, about 5 years ago, NEC This is the familiar CPU-offload scenario, except in this case, the determined that it was possible to bring some of the performance core computational component is the vector engine rather than of a supercomputer to much smaller and less expensive form the superscalar Intel processor. And unlike GPU-based solutions, factor, where it would be very well suited for new applications on the system executes complete applications, which significantly the horizon, primarily AI and its subsets, including ML. reduces the communication overhead between the vector host and the vector engine. The challenges were significant because these applications do not have massive available power sources like supercomputers To ensure the SX-Aurora TSUBASA could accommodate multiple and data centers. Vector processing has never been particularly configurations, from workstations to server towers and larger frugal with power, so the new constraints had to be addressed. It systems, the product line uses the PCIe form factor as its core was also necessary to ensure that the inherently wide memory platform. From there it can be configured to construct servers

Vector Processing Rises to the Challenges of AI and Machine Learning nec-x.com | 3 with 1, 2, 4 or 8 vector engines in racks of up to 64 vector engines As both sparse and dense data are key elements, the ML libraries that can deliver the aggregate performance of 156 TFLOPS and a support both. Frovedis can also be used with Python by integrating maximum memory bandwidth of 76.8 TByte/s. A nearly unlimited codes with the middleware. number of these systems can be combined, creating what amounts to a “mini data center” level system. A top-end system SX-Aurora TSUBASA is also a good example of in-memory can be configured with 32, 48 or 64 vector engines in up to eight computing in which data is stored in RAM tightly coupled to each enclosures. processor. By storing data in RAM and processing it in parallel, CPU-to-memory transfers can be made up to 1,000 times faster Although current users of NEC’s products know how to optimize than when memory is further away and “attached” using a their performance using the company’s tools, the new users standard memory interface. that NEC hopes to cultivate will not, so it has spent more than 6 years developing middleware called Frovedis to accelerate In-memory computing is just now being implemented in the the learning curve toward optimizing the performance of the financial sector as well as other applications in which latency must system. Measurements made by NEC show that using the Frovedis be reduced to the vanishing point, such as real-time ad platforms, machine learning library with SX-Aurora TSUBASA has achieved 42 online games, geospatial/GIS processing, medical imaging, natural to 113 times the performance of Spark MLlib running on an x86 language processing, and of course, ML. With Aurora’s HBM2 platform and a 10 to 47 times performance improvement on data memory bandwidth of 1.2 TBytes/s per processor and 48 GBytes frame operations. of memory capacity, faster execution times can be achieved.

Frovedis is a set of C++ programs including a math matrix library A single vector processor in the SX-Aurora TSUBASA system, that adheres to the Apache Spark MLlib, an additional (and running at 1.6 GHz, can execute 192 floating-point operations expanding) ML algorithm library, and preprocessing for the data per cycle. Each core has three fused-multiply-add units (FMA) frame format used by Python, R, Java, and Scala. The matrix library with double-precision peak performance of 307.2 GFLOPS. The provides basic matrix operations and linear algebras such as matrix complete system has a peak performance of up to 2.45 TFLOPS multiply, sparse matrix-vector multiply, solve and transpose. while consuming about 300 W of power. Although the platform is currently configured using Xeon devices, the SX-Aurora vector The ML library currently consists of 18 algorithms, with more to processor is agnostic, so it is compatible with any server host come, and the data frame library supports basic data analytics processor architecture, such as the ARM Cortex family. Key operations to accelerate end-to-end data analytics processes. specifications are shown in Table 1.

Metric Performance Vector computing system

Vector engines Up to 8

Form factor Tower, 1U rack mount, 4U rack mount

Performance (TFLOPS), single/dual precision 324.4/17.2

Memory bandwidth (TB/s) Up to 9.83

Memory capability (GB) Up to 384

Host processors Xeon Gold 6100 (1 or 2)

Maximum memory per host (GB) 192

Operating system Red Hat Enterprise Linux 7.5 Vector engine

Clock speed (GHz) 1.4

Peak performance (GFLOPS, single/dual precision) 288.8/537.6

Average memory bandwidth (GB/s) 153

Number of processor cores 8

Table 1: Key specifications of the SX-Aurora TSUBASA system.

Vector Processing Rises to the Challenges of AI and Machine Learning nec-x.com | 4 Accessing the SX-Aurora What This All Means TSUBASA AI Platform

AI and ML, driving the future of computing, will continue to NEC’s SX-Aurora TSUBASA AI Platform originated from the permeate more applications and services in the future. Many will company’s global research centers. Until recently, it was not be implemented in smaller and smaller footprints. Needless to say, available in North America. Now, it is offered as the flagship this requires revisiting the entire spectrum of technologies for AI. product for NEC X in the Americas. Vector processor computing is one that proved itself long ago and is finding new demand in AI and ML meeting the challenges both References today and in the future. 1. “History of supercomputing,” Wikipedia, October NEC’s SX-Aurora TSUBASA PCIe card represents not just vector 27, 2019. https://en.wikipedia.org/w/index. computing but tight integration with scalar processing, optimized php?title=Special:CiteThisPage&page=History_of_ memory utilization and data transport, both internal and external. supercomputing&id=923242193 As such, it has the potential to allow computer performances, typically found in supercomputers and high-performance 2. “A David Challenging Goliaths,” John Markoff, New York computers, to be achieved in a much smaller space than before. Times, Section D, Page 1, May 6, 1991. https://www.nytimes. Equally important is this system’s ability to scale from a single com/1990/05/01/business/a-david-challenging-computer- board-level computing platform to a much larger system by adding goliaths.html more self-contained vector processing subsystems as PCIe cards. And this is just the beginning, because NEC has plans to further 3. “Vector Microprocessors,” Ph.D. dissertation, Krste Asanovic, enhance both the Aurora hardware and Frovedis software. Graduate Division, University of California, Berkeley, 1998.

4. The History of Computing Project, NEC Earth Simulator System, 2002, NEC Japan. https://www.thocp.net/hardware/nec_ess.htm

5. MMX Technology Overview, March 1996, Intel Corp. https:// www.ee.ryerson.ca/~courses/ele818/mmx.pdf

About NEC X

NEC X, Inc. accelerates the development of innovative products and services through the strengths of NEC Laboratories’ technologies. The organization was launched by NEC Corp. in 2018 to fast-track technologies and business ideas selected from inside and outside NEC. For companies launched by its Corporate Accelerator Program, NEC X support business development activities to help achieve revenue growth. NEC X provides options for entrepreneurs, startups and existing companies in the Americas to use NEC’s emerging technologies. The company is centrally located in Silicon Valley for access to its entrepreneurial ecosystem and strong high-technology market. Learn more at http://nec-x.com or by emailing [email protected].

© 2019 NEC Corporation of America. NEC is a registered trademark of NEC Corporation. All Rights Reserved. Other product or service marks mentioned are the trademarks of their respective owners. Vector Processing Rises to the Challenges of AI and Machine Learning nec-x.com | 5