A New Look at Exploiting Data Parallelism in Embedded Systems

A New Look at Exploiting Data Parallelism in Embedded Systems Hillery C. Hunter Jaime H. Moreno Center for Reliable and High-Performance Computing IBM Research Division Department of Electrical and Computer Engineering T.J. Watson Research Center University of Illinois at Urbana-Champaign Yorktown Heights, NY [email protected] [email protected] ABSTRACT 1. INTRODUCTION This paper describes and evaluates three architectural meth- Demand for personal device functionality and performance ods for accomplishing data parallel computation in a pro- has made the ability to perform multiple computations per grammable embedded system. Comparisons are made be- cycle essential in the digital signal processing (DSP) and tween the well-studied Very Long Instruction Word (VLIW) embedded domains. These computations can be gleaned and Single Instruction Multiple Packed Data (SIM pD) from multiple forms of program parallelism. Instruction- paradigms; the less-common Single Instruction Multiple Dis- level parallelism (ILP) occurs at the operation level when joint Data (SIM dD) architecture is described and evalu- two or more operations are data-independent from one an- ated. A taxonomy is defined for data-level parallel architec- other and may be executed concurrently. Data-level paral- tures, and patterns of data access for parallel computation lelism (DLP) occurs when the same operation is executed are studied, with measurements presented for over 40es- on each member of a set of data. When the elements of sential telecommunication and media kernels. While some data parallel computations are narrower than a standard algorithms exhibit data-level parallelism suited to packed data width (e.g. 8 instead of 32 bits), sub-word parallelism vector computation, it is shown that other kernels are most (SWP) is present. Superscalar machines detect instruction efficiently scheduled with more flexible vector models. This parallelism in hardware, but other approaches require ILP, motivates exploration of non-traditional processor architec- DLP, and SWP to be explicitly exposed by a compiler or tures for the embedded domain. programmer. Broad application parallelism categories exist (e.g. nu- merical applications are highly data parallel and control ap- Categories and Subject Descriptors plications have little ILP), but most complete applications C.1.1 [Processor Architectures]: Single Data Stream Ar- contain both instruction-level and data parallelism. In the chitectures; C.1.2 [Processor Architectures]: Multiple embedded domain, repetition of computations across sym- Data Stream Architectures; C.1.3 [Processor Architec- bols and pixels results in particularly high amounts of data tures]: Other Architecture Styles; C.3 [Processor Ar- parallelism. This paper is motivated by the breadth of DSP chitectures]: Special-Purpose and Application-Based Sys- and embedded architectures currently available for 2.5G and tems—Real-time and embedded systems;C.4[Processor Third-Generation (3G) wireless systems, so examples and Architectures]: Performance of systems—Design studies analysis will focus on telecommunication and media kernels. This paper qualitatively and quantitatively analyzes meth- ods for performing data parallel computation. Three pri- General Terms mary architecture styles are evaluated: Very Long Instruc- Design tion Word (VLIW), Single Instruction Multiple Packed Data (SIM pD), and Single Instruction Multiple Disjoint Data (SIM dD). Section 3 presents a taxonomy of architectures Keywords and Section 4 describes implementations of these architectures. Comparisons in Section 5 include ease of implemen- Data-Level Parallelism, DLP, ILP, SIMD, Sub-word Paral- tation and programming; performance; and the ability to lelism, VLIW, Embedded, Processor, DSP, Telecommunica- match inherent algorithmic patterns. In Section 6, ker- tions, Media, Architecture nels are grouped according to data access patterns, and the amount of irregular patterns is quantified. The results of this analysis motivate definition and exploration of alterna- tive forms of data access. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies 2. RELATED WORK bear this notice and the full citation on the first page. To copy otherwise, to In the uni-processor domain, previous research on data republish, to post on servers or to redistribute to lists, requires prior specific parallel computation deals primarily with architectures us- permission and/or a fee. CASES’03, Oct. 30–Nov. 1, 2003, San Jose, California, USA. ing packed vector data, meaning that multiple data elements Copyright 2003 ACM 1-58113-676-5/03/0010 ...$5.00. are joined together into a single register. In the literature, 159 this is commonly referred to simply as SIMD – Single In- If instructions are viewed as the controllers of data us- struction Multiple Data. For each commercial SIMD ex- age and production, the most general architectural model tension, there are many papers which evaluate speedups re- will allow for simultaneous execution of an arbitrary com- alized for particular domains. A thorough survey of these bination of instructions and a similarly flexible use of data performance studies on the Sun VIS, Intel MMX and SSE, registers. However, to produce useful work, instructions will and IBM PowerPC AltiVec extensions is given in [1]. Most inherently have some degree of dependence on one another, studies focus on general-purpose video and matrix kernels and must therefore be sequenced or combined as dictated and applications. In addition, [2] compares TI C62x VLIW by program control flow. In contrast, data access is not performance to signal processing and multimedia compu- generally restricted by program flow, but by hardware sim- tation on a Pentium II with and without MMX support; plifications made at the time of architecture design. and [3] compares execution of EEMBC telecommunication In [7], Stokes illustrates a Single Instruction Single Data and consumer benchmarks on a proposed vector processor (SISD) architecture as consisting of a single sequential in- (VIRAM) to commercial VLIW and superscalar implemen- struction stream which operates on a single data (register) tations. stream, and produces a single output stream. An adapta- Aside from performance studies, there are several bod- tion of Stokes’ one-wide, in-order processor representation ies of work related to overcoming bottlenecks within the isshowninFigure1(a). packed vector format. Proposals to increase multimedia If a statically-scheduled architecture allows simultaneous performance include: the MediaBreeze architecture which and different processing of multiple independent data (reg- improves address generation, loop handling, and data re- ister) streams, the result is a Very Long Instruction Word ordering [4]; the MOM (Matrix Oriented Multimedia) ISA (VLIW) architecture. The instruction stream in Figure 1(b) extension for performing matrix operations on two-dimen- is a combination of multiple operations, grouped into an in- sional data; and use of in-order processors with long vectors struction word which specifies multiple computations. In- instead of superscalar architectures [1]. dividual operations may use any registers, with the restric- Data alignment requirements of the packed vector data tion that their destinations must generally be unique so as paradigm have also received some attention in the form of to avoid write-back ambiguity within a single cycle of exe- alternate permutation (data re-arrangement) networks and cution. This is the most general of the architecture types instruction set modifications. Yang et al. discuss the need discussed in this paper. for flexible permutation instructions in packed vector ar- If the VLIW instruction stream is modified to consist of chitectures, and propose instructions which use a butterfly single, sequential operations, the result is a Single Instruc- permutation network to provide rearrangement of subword tion Multiple Data (SIMD) architecture. Here single opera- data elements [1]. tions are specified in the instruction stream, but each spec- Little work has considered the extent to which media al- ified operation is performed on multiple data elements. For gorithms are truly or only partially suited to packed vec- the general SIMD case, these data elements may come from, tor computation. Bronson’s work [5] was evaluated in a and be written back to, disjoint locations. This separation of multi-processor context, but its premise is that there are al- input and output data streams will be indicated by the no- gorithms which do not map fully to a SIMD computational tation SIM dD (Single Instruction Multiple Disjoint Data), model, and which benefit from a mixed SIMD/MIMD (Mul- andispicturedinFigure1(c). tiple Instruction Multiple Data) system. Faraboschi et al. [6] Current SIMD implementations, however, do not allow advocate a mixed VLIW/SIMD system in which SIMD com- this level of data flexibility. Instead, multiple data elements ponents are programmed by hand, and ILP is scheduled on for SIMD operations are packed into a single register, of- VLIW units by the compiler. They discuss tradeoffs be- ten called a SIMD vector register. Each instruction causes tween VLIW and packed vector architectures in terms of an operation to be performed on all elements in its source implementation and code generation. registers, and there is only one input data

Load more