A Near-Memory Processor for Vector, Streaming and Bit Manipulation Workloads∗

A Near-Memory Processor for Vector, Streaming and Bit Manipulation Workloads∗ Mingliang Wei+, Marc Snir+, Josep Torrellas+ and R. Brett Tremaine‡ +Department of Computer Science ‡IBM Research Division University of Illinois at Urbana-Champaign Thomas J. Watson Research Center Thomas M. Siebel Center for Computer Science P.O. Box 218, Yorktown Heights 201 N. Goodwin, Urbana, IL 61801-2302, USA New York 10598, USA {mwei1, snir,torrellas}@cs.uiuc.edu [email protected] Abstract a single thread seems to have reached its limits; instead Many important applications exhibit poor temporal and microprocessor vendors are moving to multicore chips. spatial locality and perform poorly on current commod- While current designs are of symmetric processors, as ity processors, due to high cache miss rates. In addition, the number of cores per chip continue to increase, it is they sometimes need to perform expensive bit manip- reasonable to explore heterogeneous systems with dis- ulation operations that are not efficiently supported by tinct cores that are optimized for different applications. commodity instruction sets. (A recent example of such a design is the CELL proces- To address this problem, this paper proposes the use sor [10]; due to the limited public information on CELL, of a heterogeneous architecture that couples on one we could not compare our design to it.) chip a commodity microprocessor together with a co- The advantage of a heterogeneous design is that one processor that is designed to run well applications need not modify most of the software, as application that have poor locality or that require bit manipula- and system code can continuerunning on the commodity tions. The coprocessor supports vector, streaming, and core; code with limited parallelism can continue running bit-manipulation computation. The coprocessor is a on a conventional, heavily pipelined core, while code blocked-multithreaded narrow in-order core. It has no with significant data or stream parallelism can run on caches but has exposed, explicitly addressed fast stor- the new core. Each of the cores is simpler to design: the age. A common set of primitives supports the use of this design of the new core is not constrained by compatibil- storage both for stream buffers and for vector registers. ity requirements and good performance can be achieved with less aggressive pipelining; the design of the com- We simulated this coprocessor using a set of 10 bench- modity core is not burdened by the need to handle wide marks and kernels that are representative of the applica- vectors or other forms of parallelism. Thus, a hetero- tions we expect it to be used for. These codes run much geneous system may be preferable even if, theoretically, faster, with speedups of up to 18 over a commodity mi- one could design an architecture that combines both. croprocessor , and with a geometric mean of 5.8. Three main mechanisms have been used to handle com- 1. Introduction putations with poor locality: vector processing, multi- Many applications, including several key ones from the threading and streaming. We show in this paper that defense domain, are not supported efficiently by current these three mechanisms are not interchangeable: all commodity processors. These applications often exhibit three are needed to achieve good performance. There- access patterns that, rather than reusing data, stream over fore, we study an architecture that combines all three. large data structures. As a result, they make poor use of caches and place high-bandwidth demands on the main Both streaming and vector processing require a large memory system, which is one of the most expensive amount of exposed fast storage – explicitly addressed components of high-end systems. stream buffers and vector registers, respectively. The two approaches however manage exposed storage dif- In addition, these applications often perform sophisti- ferently. We develop an architecture that provides one cated bit manipulation operations. For example, bit per- unified mechanism to manage exposed storage that can mutations are used in cryptographic applications [23]. be used both for storing vectors and for providingstream Since commodity processors do not have direct sup- buffers. port for these operations, they are performed in software through libraries, which are typically slow. Streaming and vector provide a model where compilers are responsible for the scheduling of arithmetic units and Chip densities continue to increase, while our ability to the management of concurrency. While vector compila- use more gates in order to improve the performance of tion is mature, efficient compilation for streaming archi- ∗This work is supported by DARPA Contract NBCHC-02-0056 tectures is still a research topic; streaming architectures and NBCH30390004, as part of the PERCS project. cannot handle well variability in the execution time of code kernels, due to data dependent execution paths or order to hide this latency, one needs to support a large to variability of communication time in large systems. number of concurrent memory accesses, and to reuse The problem can be alleviated by using multithreading, data as much as possible once brought from memory. where computational resources are scheduled “on de- Vector processing is a traditional mechanism used for la- mand” by the hardware. We show in this paper how tency hiding. Vector loads and stores effect a large num- to combine blocked multithreading with streaming and ber of concurrent memory accesses, possibly bypassing vector processing with low hardware overhead and show the cache. With scatter/gather, the locations accessed that a modest amount of multithreading can be effective can be at random locations in memory. Vector regis- to achieve high performance. The NMP also enables a ters provide the large amount of buffering needed for simpler underlying streaming compiler. these many concurrent memory accesses. In addition, Our coprocessor is a blocked-multithreaded, narrow in- vector operations can use efficiently a large number of order core with hardware support for vectors, streams, arithmetic units, while requiring only a small number of and bit manipulation. It is closely coupled with the on instruction issues, a simpler resource allocator, less de- chip memory controller. It has no caches, and a high pendency tracking and a simpler communication pattern bandwidth to main memory. For this reason, rather from registers to arithmetic units. than for its actual physical location, we call it Near- The vector programming paradigm is well understood Memory Processor (NMP). A key feature of the NMP is and well supported by compilers. It works well in ap- the Scratchpad, a large local-memory directly managed plications with a regular control flow that fits the data by the NMP. parallel model [22]. To assess the potential of the NMP, we simulate a state- A more general method to hide memory latency is to of-art high-end machine with an NMP in its memory use multithreading, supporting the execution of multi- controller. We use a set of 10 benchmark and kernel ple threads in the same processor core, so that when one codes that are representative of applications we expect thread stalls waiting for memory, another one can make to use the NMP for. The focus in this initial evaluation progress [24]. One very simple implementation is the is on multimedia streaming applications, encryption and use of blocked multithreading that involves running a bit processing. We find that these codes run much faster single thread at a time, and only preempting the thread on the NMP than on an aggressive conventional proces- when it encounters a long-latency operation, such as an sor. Specifically, the speedups obtained reach 18, with a L2 cache miss or a busy lock. This approach was im- geometric mean of 5.8. plemented in the Alewife [3] and the IBM RS64IV [11]. The main contribution of this paper is in detailing an ar- It has been shown that blocked multithreading can run chitecture that integrates vector, streaming and blocked efficiently with only a few threads or contexts [25]. multithreading with common mechanisms that man- When multithreading is used, it is very desirable to pro- age exposed on-chip storage to support both vectors vide efficient inter-thread communication and synchro- and stream buffers. The architecture provides dynamic nization mechanisms between the threads. Producer- scheduling of stream kernels via hardware supported consumer primitives are particularly powerful. With fine-grain synchronization and multithreading, which these, one can very efficiently support a streaming pro- eases a streaming compiler’s job. To the best of our gramming model [14, 13, 9]. A stream program consists knowledge, the design is novel. The evaluation focuses of a set of computation kernels that communicate with on important benchmarks and kernels. The evaluation each other, producing and consuming elements from shows that all the mechanisms that are integrated in the streams of data. This model suits data intensive appli- NMP are necessary to achieve high performance. cations with regular communication patterns, like many This paper is organized as follows: Section 2 briefs the of the applications considered in this paper. background on the architectural techniques considered; When the stream model is used, one obtains additional Section 3 presents the design of the NMP; Section 4 in- locality by ensuring that data produced by one kernel troduces its programming environment; Section 5 evalu- and consumed by another is not stored back to memory. ates the design; and Section 6 surveys related work. Stream architectures such as the Merrimac [14] do so The results in this paper are preliminary; additional work by having on-chip addressable stream buffers, and man- is needed to fully validate the design. aging the allocation of space in these buffers and the scheduling of producers and consumers in software. The 2. Background compiler needs to interleave the execution of the vari- High memory latency is a major performance impedi- ous kernels, a task that is not done efficiently by present ment for many applications in current architectures.

Load more