The Role of Middleware in Optimizing Vector Processing

Data generated globally is growing at a massive rate that will these applications do have one very important common challenge: increase exponentially as the predicted billions of IoT devices add handling unstructured data. their weight to the stack, reaching what analysts at IDC believe will be 79 trillion GBytes in 2025(1). This presents enormous Unstructured data is information that is not organized in any challenges for even the most formidable computers and their particular way. It can include text such as dates, numbers, facts primary constituent components — central processing units and many other elements. As a result, the ambiguities between (CPUs), hardware accelerators, storage, and internal and external them make unstructured data difficult to use in databases. It communications. It will also require new processing architectures also requires proficient hardware and software to transition for managing and analyzing this data tsunami. the unstructured data from a “raw” state to information that can quickly and securely be democratized and made useful for This white paper will delve into this world of unstructured functions like analysis, prediction, creation of personal preference data and describe some of the technologies, especially vector profiles, and dozens more. processors and their optimization software, that will play key roles in solving these problems in the future. Unstructured, raw data is basically the opposite of what is stored in a compliant relational database, where information The Unstructured World of Big Data is compartmentalized so that records can be accessed easily. Analyzing unstructured data must be approached differently and doesn’t lend itself well to older models of processing, storage or analysis. There is no defined amount of data that qualifies as “big.” To a small company, the amount of data it generates may seem The combination of exponentially increasing data and the enormous, but to an e-commerce site, financial institution or difficulty in working with it is exceeding the capabilities of streaming service, the same data volume would be considerably data center compute platforms, especially for applications in less in comparison with the data storm they create. However, which analysis must be performed in near real time. Servers and

The Role of Middleware in Optimizing Vector Processing nec-x.com | 1 computers generally address these challenges by increasing CPU operating on certain types of instructions(2). In fact, a vector clock rates, adding more cores to the CPU and increasing server can be more than 100 times faster than a scalar cluster sizes. Unfortunately, this is no longer enough, even while processor, especially when operating on the large amounts of data silicon vendors are resorting to exotic fabrication techniques to typical of an ML application. Even greater performance increases pack more performance into a given device footprint. Clearly, new can be achieved by combining vector and scalar processors in a approaches are required to ensure that the ability to process, store single device, as will be described later. and analyze data remains robust. Vector processing will be one of the primary solutions to address these problems. Both scalar and vector processors use a technique called “.” It allows the address decoder in the device It’s almost certain that the challenges posed by this avalanche to be in constant use, so that even before the CPU has finished of data will be effectively addressed in the coming years: Failure executing one instruction, it can begin decoding the next one. This isn’t an option when there is so much to be gained from the allows the CPU to process an entire batch of instructions faster insights that data can provide. For proof, look no further than the than if it had to process them one at a time. recommendation engines used by every major e-commerce site, search engine, streaming service and financial institution. A pipelines not just the instructions but the data as well, which reduces the number of “fetch-then-decode” steps Without artificial intelligence (AI) and machine learning (ML), and in turn reduces decoding time. To illustrate this, consider the a Google search would produce mediocre results; financial simple operation below, in which two groups of 10 numbers are transactions would take too long; and streaming services added together. Using a typical programming language, this is like Netflix couldn’t determine what type of entertainment performed by writing a loop that sequentially takes each pair of each viewer likes and refine their offerings to users’ personal numbers and adds them together (Figure 1a). preferences over time. In short, “big data” equals big rewards, but harnessing it is not simple, because AI and ML are technological The same task performed by a vector processor requires only two and scientific domains that only data scientists can truly address translations, and fetching and decoding is performed only understand. These technologies push the envelope of what once, rather than 10 times (Figure 1b). As the code is smaller, conventional computing architectures can achieve. memory is used more effectively as well. Modern vector processors also allow different types of operations to be performed Vectors Victorious simultaneously, further increasing efficiency.

Moving into the Mainstream

Vector processing, the de facto technology powering supercomputers since Seymour Cray first used it in the Cray-1 in 1976, has unique advantages over scalar processors when CPU vendors have used vector processing as a hardware

a. b. Vector Processor

• Execute this loop 10 times • Read instruction and decode it - Read the next instruction and decode it • Fetch these 10 numbers - Fetch this number • Fetch those 10 numbers - Add them • Add them - Put the result there • Put the result there • End loop

Figure 1: Executing the task defined above, the scalar processor (a) must perform more steps than the vector processor (b).

The Role of Middleware in Optimizing Vector Processing nec-x.com | 2 accelerator for more than two decades, first externally as a “math a memory subsystem, minimizing data roundtrips to and from ,” and later integrated into the CPU itself. One of the external host processor, reducing power consumption. The vector few exceptions is NEC, which has been developing and fabricating processors connect six high-bandwidth memory (HBM2) modules vector computing engines for more than 30 years as the driving with a maximum data rate of 150 GBytes/s per core and a total of force in its supercomputers. 307 GFLOPS, about five times faster per core than the company’s previous vector processor. These machines continue to demonstrate that while scalar processors currently rule throughout computing in general, The SX-Aurora TSUBASA can scale from workstations to racks of vector processors are essential for tackling the challenges servers and larger systems based on the vector engine PCIe cards, ahead. And when scalar processors are used as the “secondary” from which it is possible to create servers with one, two, four or processing element with vector processor taking the lead, the eight vector engines in racks of up to 64 vector engines that can resulting heterogeneous processing environment can deliver truly deliver aggregate performance of 156 TFLOPS and a maximum exceptional performance. memory bandwidth of 76.8 TBytes/s. A nearly unlimited number of these systems can be combined, creating what amounts to a “mini To bring its vector processing capabilities into applications less data center.” esoteric than scientific applications, NEC has combined vector processors with scalar CPUs to produce a “vector-parallel” Optimization with Frovedis computer called the SX-Aurora TSUBASA. It’s a complete vector computing system on a PCIe card that has one of the world’s highest average peak performance levels per core. This marks the first time a vector processor has been made the primary compute Any vector-based computer can perform only as well as the engine rather than secondarily as an accelerator. software employed to optimize it. For the SX-Aurora TSUBASA, this is achieved using open-source middleware called Frovedis A single vector processor in the SX-Aurora TSUBASA system, (FRamework for VEctorized and DIStributed data analytics), the running at 1.6 GHz, can execute 192 floating-point operations per result of five years of development by NEC. Frovedis was crafted cycle. The system comprises a scalar host processor, a vector host from the start to aid designers with the vector processor tools and (VH) running LINUX, and one or more vector processor accelerator begin implementing and deploying applications quickly. cards (vector engines), creating a heterogenous compute server ideal for large AI and ML workloads and data analytics applications. To that end, Frovedis is based on industry-standard programming The primary computational components are the vector engines, languages and frameworks such as C++ for vector optimization, with self-contained memory subsystems, rather than the host and Apache Spark and Scikit-Learn for data base management processor. This acts as the “front end” to the vector engines, and data analytics. This allows for seamless migration of legacy performing functions other than mathematical calculations. code to the vector processor accelerator card. Frovedis also uses the message passing interface (MPI) to implement distributed A major reason vector supercomputers are incredibly fast is the processing efficiently, which is transparent to the user. The extensive bandwidth enabled through banked memory that developers of Frovedis have made it possible to launch it from allows several memory requests to be processed simultaneously. within Spark and Python environments, in the same format as its Because this architecture doesn’t use cache techniques as scalar own ML library. processors do, vector processors can outperform cache-reliant architectures in applications that make “random” memory accesses. In essence, Frovedis is a set of C++ programs including a math matrix library, an ML algorithm library and preprocessing for the The SX-Aurora TSUBASA has 48 GBytes of internal memory, so data frame. Most of the matrix libraries that come with Frovedis it can ingest a full program and execute it almost exclusively on support both sparse and dense data. The matrix library provides the vector engine. The effect is to avoid bottlenecks, dramatically basic matrix operations and linear algebra such as matrix multiply, reducing the time spent transferring data between processor and sparse matrix-vector multiply, solve and transpose. memory. This differs from GPU-based solutions, for example, that require extensive communication between the vector host and the The ML library, implemented with Frovedis middleware and its vector engine to perform the same functions. matrix library, currently consists of 18 algorithms including linear and logarithmic regression, ALS, SVM, K-means and Word2vec, All system functions are performed within a single chip designed with more in development. In addition, graph algorithms exploit and fabricated by NEC, including eight cores per processor with the benefits of vector processing for the rapid analysis of up to 2.45 TFLOPS of performance and an aggregate memory large, complex graphs. The data frame library supports basic bandwidth of 1.2 TBytes/s. The device also integrates cache and data analytics operations (select, filter, sort, join, group by and

The Role of Middleware in Optimizing Vector Processing nec-x.com | 3 aggregate) to accelerate end-to-end data analytics processes. A second test (Figure 2b) evaluated the performance of the same two computers when processing filter, sort, join and group-by ML inherently uses both sparse and dense matrix data, the former functions with Spark DataFrame showed equally impressive results. consisting of considerable non-values that inhibit efficient use of compute cycles. Frovedis accelerates the processing of sparse- 30 24.4 matrix data on ML algorithms and matrix operations by processing 25 only the non-null and non-zero values existing in the sparse- 20 16.0 matrix, thus increasing efficiency and speed. 15 9.5 10 9.0 Proof of Performance 5 1 1 1 1

Speed up (Spark = 1) 0 Filter Sort Join Group by Spark/ Frovedis/Aurora(VE) The performance advantage on large ML and data analytics jobs that can be realized with Frovedis when used with the SX-Aurora Figure 2b: The same processors when executing specific ML TSUBASA accelerator card is significant when compared with functions. the same tasks run on a scalar processor. Measurements made by NEC show that using the Frovedis ML library, run on the SX- These tests verify that SX-Aurora TSUBASA, when optimized Aurora TSUBASA accelerator card, can deliver 42 to 113 times the with the Frovedis middleware, delivers substantial performance performance advantage over Spark MLlib operations running on an increases when performing functions such as logistic regression x86 processor, and 10 to 47 times the performance advantage on that are inherent in processing large quantities of unstructured data frame operations. data within ML. NEC has conducted many other tests as well, covering a wide range of ML algorithms will similar results. To view performance in the context of ML, NEC compared Spark executed in a cluster where multiple servers are connected, with Summary Frovedis executed by SX-Aurora TSUBASA. The execution time of the latter was more than 50 times faster than that of the former.

Figure 2a shows the ML performance with sparse data from Big data brings big benefits but also big challenges for existing running LR parsing, K-means, and singular value decomposition volume server compute architectures, because the data is less (SVR) functions on two computers. The first, shown in blue, uses structured and difficult to process using traditional approaches. an Intel Xeon Gold 6126 scalar processor with Spark’s MLlib ML Efficiently revealing the hidden benefits and insights within library. The second, shown in red, is a single SX-Aurora TSUBASA the unstructured data requires new software tools, processors vector processor optimized using Frovedis. and memory configurations, without having to learn new coding languages. It just works. NEC’s SX-Aurora TSUBASA vector 120 113.2 computer architecture and Frovedis middleware address these 100 challenges while scaling from a single PCIe card to an entire server at far less cost and with higher performance than an approach 80 solely using scalar processors. In short, vector processing with 56.8 60 SX-Aurora TSUBASA will play a key role in changing the way big 42.8 data is handled while stripping away the barriers to achieving even 40 higher performance in the future. 20

Speed up (Spark = 1) 1 1 1 0 Accessing the SX-Aurora TSUBASA LR K-means SVD AI Platform

Figure 2a: A comparison between a Spark cluster (blue) and SX-Aurora TSUBASA optimized with Frovedis (red), each with 64 cores, shows the latter’s enormous performance when executing NEC’s SX-Aurora TSUBASA AI Platform originated from the logistic regression, singular value composition (SVD) and K-means company’s global research centers. Until recently, it was not algorithms. available in North America. Now, it is offered as the flagship product for NEC X in the Americas.

The Role of Middleware in Optimizing Vector Processing nec-x.com | 4 References

1. “The Growth in Connected IoT Devices Is Expected to Generate 79.4ZB of Data in 2025, According to a New IDC Forecast,” IDC, June 18, 2019. https://www.idc.com/getdoc. jsp?containerId=prUS45213219#:~:targetText=The%20 Growth%20in%20Connected%20IoT%20Devices%20Is%20 Expected%20to%20Generate,to%20a%20New%20IDC%20 Forecast&targetText=IDC%20projects%20that%20the%20 amount,the%202018%2D2025%20forecast%20period.

2. “Seymour Cray,” Wikipedia. https://en.wikipedia.org/wiki/ Seymour_Cray

About NEC X

NEC X, Inc. accelerates the development of innovative products and services through the strengths of NEC Laboratories’ technologies. The organization was launched by NEC Corp. in 2018 to fast-track technologies and business ideas selected from inside and outside NEC. For companies launched by its Corporate Accelerator Program, NEC X support business development activities to help achieve revenue growth. NEC X provides options for entrepreneurs, startups and existing companies in the Americas to use NEC’s emerging technologies. The company is centrally located in Silicon Valley for access to its entrepreneurial ecosystem and strong high-technology market. Learn more at http://nec-x.com or by emailing [email protected].

© 2020 NEC Corporation of America. NEC is a registered trademark of NEC Corporation. All Rights Reserved. Other product or service marks mentioned are the trademarks of their respective owners. The Role of Middleware in Optimizing Vector Processing nec-x.com | 5