Vector Processing Rises to the Challenges of AI and Machine Learning

Vector Processing Rises to the Challenges of AI and Machine Learning In the complex world of computing, a major new technological forecasting, oil exploration, industrial design and bioinformatics. development in a particular area tends to gain acceptance and Led in the U.S. by Seymour Cray and researchers at Fujitsu, Hitachi then reign supreme, sometimes for many years or decades until and NEC in Japan, (1), the performance of these machines was something more advanced comes along. It’s rare indeed for any simply untouchable by any other processor architecture. But by technology to fade away only to reappear, stronger, and make the 1990s, the immense market potential for making computing another major contribution. accessible to more people (especially those without very deep pockets) made billions of dollars in research funding available in But that’s precisely what is happening with vector processing, the commercial market, nearly all of which was spent on scalar which once ruled the highest echelons of performance— rather than vector processors. supercomputers—and is now solving some of the most critical problems facing the industry. And rather than remaining in the The result has been a continuous stream of scalar processors rarefied air of supercomputers, it’s moving into the mainstream like Intel’s x86 that could sometimes exceed the performance of computer hierarchy, all the way to board level. To understand why vector supercomputers on some tasks, at a much lower cost, while this is happening now and how it is being achieved, a brief review being more flexible in their capabilities than vector processors. of processor history is in order. They have continued to increase performance more or less adhering to Gordon Moore’s “law” of doubling performance every The Evolution of Vector 18 months while offloading specialized tasks to supplementary and Scalar Processing coprocessors. When they evolved to enable massively parallel processing, the supercomputing industry, or most of it, abandoned vector processing, seemingly for good. For more than three decades, vector-processor-based In fact, Eugene D. Brooks, a scientist at Lawrence Livermore supercomputers were the most powerful in the world, enabling National Laboratory, predicted in a 1990 New York Times article research in nuclear weapons and cryptography, weather (2) headlined “A David Challenging Computer Goliaths” that vector- Vector Processing Rises to the Challenges of AI and Machine Learning nec-x.com | 1 processor-based computers would soon be “dead,” at which time Vector-Supported Scalar Processing their scalar counterparts would exceed the abilities of Cray’s vector-processor-based supercomputers at a fraction of the cost. Less noticed was work by Krste Asanovic, who in 1998 as a Interestingly enough, vector processing has been an adjunct to doctoral candidate at the University of California, Berkeley, scalar processing as an accelerator. For example, when Intel added posited in his thesis (3) that the success of scalar (or by this the MMX instruction set to its x86 Pentium processors in the time superscalar) microprocessors were mostly the result of late 1990s, (5) it was the beginning of a continuous succession their ability to be fabricated using standard CMOS technology. He of vector-based coprocessing that has continued through the argued that if the same process could be used for creating vector Streaming SIMD Extension (SSE) in the Pentium III to the current microprocessors, they could be the fastest, cheapest and most vector instructions standard, AVX-512. Today nearly all scalar energy efficient processors for future applications. processors rely on vector instructions for performing specific tasks with scalar processors for traditional host processing functions. His argument was that, in the future, compute-intensive tasks would require data parallelism, and that the vector architecture Other types of accelerators including graphics processing units was the best way to achieve it. In the following 250 pages of his (GPUs), field programmable gate arrays (FPGAs), application- thesis, he described how and why vector microprocessors could specific integrated circuits (ASICs), and other more esoteric devices be applied to perform more tasks than scientific and engineering also perform specialized tasks and are even transitioning to supercomputing, eventually handling multimedia and “human- become host processors themselves. GPUs, for example, whose machine interface” processing. So, who was the most prophetic? origin comes from providing the computational power for graphics engines, are now often used as “general purpose” GPUs (GPGPUs), As it turned out, both. Asanovic’s vision proved to be correct, and FPGAs are making a similar transition. Now, the venerable although it took two decades for it to be realized. Asanovic went accelerator technology called vector processing, once powering on to become a professor in the Computer Science Division of only the world’s most powerful computers, is being applied in the Electrical Engineering and Computer Sciences Department platforms lower in the computer hierarchy. at Berkeley, chairing the board of the RISC-V Foundation, and co- founding SiFive, a fabless semiconductor company that produces Future Challenges for Scalar Processors computer chips based on the established reduced instruction set computer (RISC) principles. Brooks’ predictions have also proven to be correct, with one very Current supercomputers and high performance computing systems significant exception: Vector processors never vanished because based on superscalar massive-parallel-processing CPUs are they could still perform some functions far faster, at less cost, and reaching an important stage in their history. Simply stated, the with less hardware than scalar processors. As vector and scalar huge amount of data inherent in artificial intelligence (AI) and technologies advanced over the years, scalar processors became machine learning (ML) applications are pushing the envelope of ubiquitous, while vector processors were used by fewer and fewer what they can achieve. They are requiring more and more overhead companies until the only company—NEC in Japan—continued to to handle the massive parallelism of the computers they serve, develop computers around them. which is creating a gap between what they can theoretically achieve and what is experienced in practice. The result is that it is So, far from being “dead,” processing using vector instructions becoming difficult to fully exploit the potential of current deployed continued to be used in supercomputers. In 2002, for example, systems in which only superscalar processors are employed. NEC’s vector parallel computing system called the Earth Simulator became operational at the Earth Simulator Center in Japan. The performance of scalar processor cores are also not increasing (4) It was the world’s fastest supercomputer at the time, with at a rate it once did, so improvements are obtained by increasing 5,120 central processing units (CPUs) using 640 nodes, each the number of cores on a homogeneous silicon substrate. However, consisting of eight vector processors operating at 8 GFLOPS (one it is difficult to extract more parallelism from cores with different billon floating-point operations per second) and overall system performance characteristics. In addition, the cost of fabricating performance of 40 TFLOPS (40 trillion floating-point operations transistors themselves is increasing, because fabrication per second). The Earth Simulator created a “virtual planet earth” by processes must use increasingly small process nodes to increase processing huge amounts of data from satellites, buoys and other the number of transistors in a reasonably sized piece of silicon. sensors. The current version uses the NEC SX-ACE supercomputer, which delivers 1.3 PFLOPS (one quadrillion floating-point operations per second). Vector Processing Rises to the Challenges of AI and Machine Learning nec-x.com | 2 The Vector Processor Rises to the Challenge bandwidth achievable with vector processors could be retained in such small form factors. The result of this work is SX-Aurora TSUBASA, a complete vector Vector instructions have significant benefits, because they computing system implemented on a PCIe card with one of the perform multiple math operations at once, while scalar processors world’s highest performance levels per core (Figure 1). It is the perform only one. So, if the job is to multiply four numbers by first time a vector processor has been made the primary compute four, it can be done with a single command with vectors in a engine rather than an accelerator. All functions are implemented process called single instruction, multiple data (SIMD) in which the on a single chip, including the cores with integrated cache, as well command applies the same single instruction to many different as the memory, network and I/O controllers. This makes it possible pieces of data. to reduce the space required for all these functions to a fraction of what its predecessor could achieve as well as reducing power The computer thus needs to fetch and decode far fewer consumption. Implementing each processor on a card optimizes the instructions, reducing control unit overhead and memory use of space to a fifth of what the company has produced before. bandwidth required to execute operations. In addition, the instructions provide the processor cores with continuous streams of data so that when the vector instruction is initiated,

Load more