The Role of Middleware in Optimizing Vector Processing

The Role of Middleware in Optimizing Vector Processing Data generated globally is growing at a massive rate that will these applications do have one very important common challenge: increase exponentially as the predicted billions of IoT devices add handling unstructured data. their weight to the stack, reaching what analysts at IDC believe will be 79 trillion GBytes in 2025(1). This presents enormous Unstructured data is information that is not organized in any challenges for even the most formidable computers and their particular way. It can include text such as dates, numbers, facts primary constituent components — central processing units and many other elements. As a result, the ambiguities between (CPUs), hardware accelerators, storage, and internal and external them make unstructured data difficult to use in databases. It communications. It will also require new processing architectures also requires proficient hardware and software to transition for managing and analyzing this data tsunami. the unstructured data from a “raw” state to information that can quickly and securely be democratized and made useful for This white paper will delve into this world of unstructured functions like analysis, prediction, creation of personal preference data and describe some of the technologies, especially vector profiles, and dozens more. processors and their optimization software, that will play key roles in solving these problems in the future. Unstructured, raw data is basically the opposite of what is stored in a compliant relational database, where information The Unstructured World of Big Data is compartmentalized so that records can be accessed easily. Analyzing unstructured data must be approached differently and doesn’t lend itself well to older models of processing, storage or analysis. There is no defined amount of data that qualifies as “big.” To a small company, the amount of data it generates may seem The combination of exponentially increasing data and the enormous, but to an e-commerce site, financial institution or difficulty in working with it is exceeding the capabilities of streaming service, the same data volume would be considerably data center compute platforms, especially for applications in less in comparison with the data storm they create. However, which analysis must be performed in near real time. Servers and The Role of Middleware in Optimizing Vector Processing nec-x.com | 1 computers generally address these challenges by increasing CPU operating on certain types of instructions(2). In fact, a vector clock rates, adding more cores to the CPU and increasing server processor can be more than 100 times faster than a scalar cluster sizes. Unfortunately, this is no longer enough, even while processor, especially when operating on the large amounts of data silicon vendors are resorting to exotic fabrication techniques to typical of an ML application. Even greater performance increases pack more performance into a given device footprint. Clearly, new can be achieved by combining vector and scalar processors in a approaches are required to ensure that the ability to process, store single device, as will be described later. and analyze data remains robust. Vector processing will be one of the primary solutions to address these problems. Both scalar and vector processors use a technique called “instruction pipelining.” It allows the address decoder in the device It’s almost certain that the challenges posed by this avalanche to be in constant use, so that even before the CPU has finished of data will be effectively addressed in the coming years: Failure executing one instruction, it can begin decoding the next one. This isn’t an option when there is so much to be gained from the allows the CPU to process an entire batch of instructions faster insights that data can provide. For proof, look no further than the than if it had to process them one at a time. recommendation engines used by every major e-commerce site, search engine, streaming service and financial institution. A vector processor pipelines not just the instructions but the data as well, which reduces the number of “fetch-then-decode” steps Without artificial intelligence (AI) and machine learning (ML), and in turn reduces decoding time. To illustrate this, consider the a Google search would produce mediocre results; financial simple operation below, in which two groups of 10 numbers are transactions would take too long; and streaming services added together. Using a typical programming language, this is like Netflix couldn’t determine what type of entertainment performed by writing a loop that sequentially takes each pair of each viewer likes and refine their offerings to users’ personal numbers and adds them together (Figure 1a). preferences over time. In short, “big data” equals big rewards, but harnessing it is not simple, because AI and ML are technological The same task performed by a vector processor requires only two and scientific domains that only data scientists can truly address translations, and fetching and decoding is performed only understand. These technologies push the envelope of what once, rather than 10 times (Figure 1b). As the code is smaller, conventional computing architectures can achieve. memory is used more effectively as well. Modern vector processors also allow different types of operations to be performed Vectors Victorious simultaneously, further increasing efficiency. Moving into the Mainstream Vector processing, the de facto technology powering supercomputers since Seymour Cray first used it in the Cray-1 in 1976, has unique advantages over scalar processors when CPU vendors have used vector processing as a hardware a. Scalar Processor b. Vector Processor • Execute this loop 10 times • Read instruction and decode it - Read the next instruction and decode it • Fetch these 10 numbers - Fetch this number • Fetch those 10 numbers - Add them • Add them - Put the result there • Put the result there • End loop Figure 1: Executing the task defined above, the scalar processor (a) must perform more steps than the vector processor (b). The Role of Middleware in Optimizing Vector Processing nec-x.com | 2 accelerator for more than two decades, first externally as a “math a memory subsystem, minimizing data roundtrips to and from coprocessor,” and later integrated into the CPU itself. One of the external host processor, reducing power consumption. The vector few exceptions is NEC, which has been developing and fabricating processors connect six high-bandwidth memory (HBM2) modules vector computing engines for more than 30 years as the driving with a maximum data rate of 150 GBytes/s per core and a total of force in its supercomputers. 307 GFLOPS, about five times faster per core than the company’s previous vector processor. These machines continue to demonstrate that while scalar processors currently rule throughout computing in general, The SX-Aurora TSUBASA can scale from workstations to racks of vector processors are essential for tackling the challenges servers and larger systems based on the vector engine PCIe cards, ahead. And when scalar processors are used as the “secondary” from which it is possible to create servers with one, two, four or processing element with vector processor taking the lead, the eight vector engines in racks of up to 64 vector engines that can resulting heterogeneous processing environment can deliver truly deliver aggregate performance of 156 TFLOPS and a maximum exceptional performance. memory bandwidth of 76.8 TBytes/s. A nearly unlimited number of these systems can be combined, creating what amounts to a “mini To bring its vector processing capabilities into applications less data center.” esoteric than scientific applications, NEC has combined vector processors with scalar CPUs to produce a “vector-parallel” Optimization with Frovedis computer called the SX-Aurora TSUBASA. It’s a complete vector computing system on a PCIe card that has one of the world’s highest average peak performance levels per core. This marks the first time a vector processor has been made the primary compute Any vector-based computer can perform only as well as the engine rather than secondarily as an accelerator. software employed to optimize it. For the SX-Aurora TSUBASA, this is achieved using open-source middleware called Frovedis A single vector processor in the SX-Aurora TSUBASA system, (FRamework for VEctorized and DIStributed data analytics), the running at 1.6 GHz, can execute 192 floating-point operations per result of five years of development by NEC. Frovedis was crafted cycle. The system comprises a scalar host processor, a vector host from the start to aid designers with the vector processor tools and (VH) running LINUX, and one or more vector processor accelerator begin implementing and deploying applications quickly. cards (vector engines), creating a heterogenous compute server ideal for large AI and ML workloads and data analytics applications. To that end, Frovedis is based on industry-standard programming The primary computational components are the vector engines, languages and frameworks such as C++ for vector optimization, with self-contained memory subsystems, rather than the host and Apache Spark and Scikit-Learn for data base management processor. This acts as the “front end” to the vector engines, and data analytics. This allows for seamless migration of legacy performing functions other than mathematical calculations. code to the vector processor accelerator card. Frovedis also uses the message

The Role of Middleware in Optimizing Vector Processing

Intel ® Atom™ Processor E6xx Series SKU for Different Segments” on Page 30 Updated Table 15

Adding Support for Vector Instructions to 8051 Architecture

Datasheet, Volume 2

MPC500 Family MPC509 User's Manual

Introduction to CUDA Programming

Address Decoding Large-Size Binary Decoder: 28-To-268435456 Binary Decoder for 256Mb Memory

Covert and Side Channels Due to Processor Architecture*

Special Address Generation Arrangement

Low Overhead Memory Subsystem Design for a Multicore Parallel DSP Processor

The Memory System

The CPU and Memory

Intel SGX Explained