Yao, K. "Algorithms and Architectures for Multimedia and Beamforming Communications"

The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

79 Algorithms and Architectures for Multimedia and Beamforming in Communications

79.1 Introduction 79.2 Multimedia Support for General Purpose Computers Extended Instruction Set and Multimedia Support for General Purpose Processors • Multimedia Processors and Flavio Lorenzelli Accelerators • Architectures and Concepts for Multimedia ST Microelectronics, Inc. Processors Kung Yao 79.3 Beamforming Array Processing and Architecture Interference Rejecting Beamforming Antenna Arrays • Smart University of California at Los Angeles Antenna Beamforming Arrays 79.1 Introduction

In recent years, we have experienced many explosions in various technologies in information processing, transmission, distribution, and storage. In this chapter, we will address two distinct but equally important algorithms and architectures of multimedia and beamforming array processings in communication systems. The first problem is motivated by our need for efficient and real-time presentation of still image, live-action video, and audio information commonly called “multimedia.” To most users, multimedia presentation is easier and more natural to comprehend compared with the traditional textual form of presentation on paper. The second problem is motivated by a desire to transfer tremendous amounts of information from one location to another over limited frequency-polarization-space-time channel con- straints. Beamforming array processing technologies are used to coherently transmit or receive informa- tion over these channel constraints that can significantly improve the performance of a single transmit or receive antenna system. Since both of these problems are of fundamental interest in VLSI and practical implementations of modern communication systems, we consider in detail the basic signal processing algorithmic and architectural limitations of these problems. In Secection 79.2, we first consider the intense computational requirements for displaying images and video using a general purpose PC. The pros and cons of using a separate or accelerator to enhance the main CPU are also introduced. Then, “Extended Instruction Set and Multimedia Support for General Purpose Processors” discusses issues related to the extended instruction set for media support

© 2000 by CRC Press LLC

for a general purpose PC, while “Multimedia Processors and Accelerators” considers in some detail media processors and accelerators. “Architectures and Concepts for Multimedia Processors” deals with the architectures and concepts for multimedia processors. The very long instruction word (VLIW) architec- ture and the SIMD and associates subword parallelism issues are presented and compared. Four tables are used in Section 79.2 to summarize some of the involved basic operations. In Section 79.3, we discuss the use of beamforming operation originally motivated by various military and aerospace applications, but more recently in many civilian applications. In “Interface Rejecting Beamforming Antenna Arrays”, we consider some early simple, sidelobe-canceller, beamforming arrays based on the LMS adaptive criterion, then we discuss in some detail various aspects of a recursive least- squares criterion QR decomposition-based beamforming array. In “Smart Antenna Beamforming Arrays”, we consider the motivation and evolution of the smart antenna system and its implication.

79.2 Multimedia Support for General Purpose Computers

In any computer store, it is fairly common nowadays to see rows of computer screens running video and audio clips. Voices, sounds, and moving pictures are attracting large numbers of users who now require as basic features software applications, which include voice mail and videoconferencing. Advertisements for home computers virtually never miss the buzz words “multimedia,” “interactive,” and “Internet.” Over time, audio and video, as well as network connectivity, have become integral parts of many user appli- cations. This change in users’ expectations has been the cause for a major shift in the operations required of a general-purpose computer. The focus of computer designers and vendors has moved to plain word processing and spreadsheet applications to highly demanding tasks, such as real-time compression and decompression of audio and video streams. CPUs are now required to process large amounts of data fast, and allow for different processes to run simultaneously. For instance, one might have different windows open at the same time, one running a streaming video, another playing an audio file, another showing an open internet connection, etc. In addition, PCs are now expected to provide high-quality graphics, 3-D moving pictures, good quality sound, and full-motion support (e.g., MPEG-2 decoders, etc.). Most computer vendors and processor designers are therefore making huge efforts to alter or enrich their products so as to be able to provide these new multimedia services. In order to better envision the complexity required by some multimedia tasks, consider what is involved in operations such as 3-D image rotation/zooming, etc. Each 3-D object consists of hundreds or even thousands of polygons, usually triangles. The smaller the polygons — and consequently the more numerous — the more detailed the object. When an object moves on the screen (rotates or moves forward/backward), the program must recalculate every vertex of every polygon by performing a number of matrix computations. A typical 3-D object might have 1000 polygons, which means that each time it moves, at least 84,000 multiplies and adds are executed. The new MPEG-4 audio/video standard (release date November 1998) addressed, among other things, the need for high interactive functionality, i.e., interactive access to and manipulation of audio-visual data, and content-based interactivity. The standard will provide an object-layered bit-stream. Each object can be separately decoded and manipulated (scaled, rotated, translated). Even 300-MHz Pentium II’s cannot supply enough coordinates for more than a million polygons per second. Accelerator cards or graphics chips are often required for high-end, graphics intensive applications, to render polygons faster than any CPU can keep up. One of the major drives to multimedia PCs was due to Microsoft’s APIs, particularly the recent DirectX interfaces. With DirectX, it is easier to replace audio and video hardware while retaining software compatibility. However, as long as older applications remain alive, multimedia PCs will have to provide compatibility with existing register level standards, e.g., superVGA and SoundBlaster. Register-level interfaces were not designed to support more advanced audio/video features, e.g., 3-D graphics. Most game programs avoid old APIs because of their poor performance. They run directly under DOS and access audio and video chips directly. Microsoft created DirectX, a new set of APIs, in order to avoid having to modify existing programs to support new graphics devices as they show up on the market.

© 2000 by CRC Press LLC

Multimedia is likely to reach the electronics consumer market also in ways that are still unclear, for instance, in what has been dubbed home entertainment. Vendors envision gamuts of devices which combine the functions of a PC with traditional TV sets. PCs with dual-purpose monitors and set-top boxes will soon offer functionalities ranging from DVD to videophone and interactive 3-D gaming. The PC might become the basis for a living room entertainment center which includes a TV tuner, DVD, Dolby audio, 3-D graphics, etc. Different companies have chosen different approaches to multimedia and made widely different decisions. Some companies, notably and among others, have added circuitry to their new generation processor and enhanced their instruction sets in order to handle multimedia data. Other companies, e.g., Philips and Chromatics, have preferred to build whole new processors from the ground up which could work either independently or in alliance with another processor, solely in charge of audio, video, and network connectivity. There are also those who have been thinking of thoroughly new architectures and concepts, which could open entirely different perspectives and computation par- adigms. In the following, we will provide a (necessarily incomplete and temporary) snapshot of the situation as it appears at the end of the 1990s, when the authors are writing. The solutions pursued by the various companies display tradeoffs. What will be the most successful approach is open to debate and will probably be understood only in hindsight. Nonetheless, some considerations can be made now. The main drawback of external multimedia accelerators is undoubtedly the cost.15 The home PC market has been very sensitive to cost, and most vendors are unwilling to add expensive features which could deter possible customers and therefore reduce their market penetration. On the other hand, software developers are unlikely to generate applications tailored for multimedia processors and accelerators unless there is a sufficiently large installed based. The alternative to a separate processor or accelerator is to enhance the main CPU to allow it to process multimedia data.23 As of today, there are few CPUs which have enough “horsepower” to simultaneously handle the operating system, all standard applications, as well as audio decoding and mixing, video decoding, 2-D and 3-D graphics, and modem functions. A general-purpose CPU like one based on a Pentium processor Pentium is simply unable to handle digitized audio samples and pixel data. Moreover, the multimedia additions translate into larger chips with lower yield. On the other hand, if the CPU is enhanced for multimedia, no optional hardware is required to perform the multimedia tasks and the installed base can grow naturally, so to speak. Software developers will be more willing to write applica- tions for established platforms (e.g., Pentium), whereas software vendors can target these systems with higher sales expectations. The additional hardware required for multimedia enhancement is usually only a small fraction of chip space, with minimal added cost to the overall system. As processors become faster and smaller, eventually this is the solution that most foresee prevailing, at least in low- to mid-range systems. Already, existing multimedia enhanced systems show great promise, especially when compared with the cost and performance of add-in chips and boards. What can be said in favor of separate graphics’ subsystems is that the same money spent on a better graphics accelerator often buys much larger effective performance improvement (on multimedia applications) than one would get by upgrading the CPU. Users may be more willing to spend money on a system with greater multimedia performance than in a faster number-crunching CPU. In fact, graphics’ subsystems have accounted for an increased percentage of the total cost of multimedia PCs, often at the expense of the central processor. Faster CPUs are still unavoidable where FP geometry calculations and rendering performance are crucial, but geometry calculations are being included in graphics chips as well.

Extended Instruction Set and Multimedia Support for General Purpose Processors The approach of processing the multimedia data on the CPU itself has been named native signal processing (NSP). This is the approach favored by Intel, Sun, Motorola, Hewlett-Packard, and the makers of the PowerPC line. The rationale is that processors soon will be sufficiently powerful to perform tasks

© 2000 by CRC Press LLC

such as real-time video compression and decompression, video teleconferencing, etc., without the aid of any add-in chips or cards. Tomorrow’s host CPUs will be able to execute in software most if not all the tasks that today are handled by media processors and accelerators. Certainly, media processors’ perfor- mance will increase over time just as much as regular CPUs. The question, though, is what is required of a media processor? Multimedia is by definition directly related to human perception. Once a multi- media algorithm reaches the limit of people’s perception, there is little reason to make any further improvements. Of course, users always will be asking for more features, simultaneously running appli- cations in real-time, etc., but it seems reasonable that eventually the drive for performance increase will level off. Both audio (including 3-D sound, up- and down-sampling, 32-channel mixing, Dolby ac-3) and 2-D graphics (GUI acceleration) have already reached the limits of human perception on any high- end Pentium system. Algorithms that still need a good deal of improvement are 3-D graphics and video compression (although in the latter, bandwidth is the key, not human perception). CPUs may soon catch up where multimedia algorithms still have not reached their limits. Intel’s 233-MHz Klamath is able to execute full DVD decoding, including Dolby ac-3 audio and MPEG-2 video, and future processors will only do better. In the past, the market for hardware accelerators has been destroyed by software decoders that run on any fast CPU, i.e., MPEG-1 decoders. NSP requires the expansion of the instruction set architecture (ISA), as well as the possible modification of the existing CPU. The additional circuitry for multimedia data processing takes advantage of the fact that most multimedia applications require calculations in lower precision (16- or 32-bit) than provided by ALUs used for standard integer and floating point (FP) data. This allows for the generation of two to eight results per clock cycle, thereby offering sufficient data rate for intensive multimedia applications. Fast computation may require additional optimized hardware: faster and possibly split buses, parallel computational units or superpipeline, larger L1 caches, small misprediction penalty, faster CPUs at lower voltage supply, etc. As an example, has offered a chip, MediaGX,16 that delivers Pentium-class performance and software compatibility, while adding an integrated , graphics accel- erator, and a PCI interface. MediaGX improves performance by taking advantage of the tight coupling between CPU and system logic. The effects of high bandwidth demands are minimized by use of advanced compression techniques, which reduce the size of the frame buffer when stored in memory. By this technique, up to 95% of what the frame buffer reads from memory can be eliminated. MediaGX also provides full compatibility with VGA graphics and SoundBlaster audio. Later generations of this chip also include MMX support and acceleration feature for MPEG-1 video, such as pixel scaling and color scale conversion. Intel’s MMX Intel improved its architecture’s multimedia performance by extending the instruction set.13 The 57 new instructions, see Table 79.1, are referred to as MMX and are devised to accelerate calculations common in audio, 2-D and 3-D graphics, video, speech synthesis and recognition, and data communi- cation. The overall performance improvement is estimated to be 50 to 100% for most multimedia applications (e.g., MPEG-1 decoding, pixel manipulation). Other companies, such as AMD and Cyrix, have incorporated Intel’s MMX. MMX obviously has no impact on the operating system. Applications can take advantage of MMX in either of two ways: by scaling MMS-enabled drivers, or by adding MMX instructions to critical routines. One advantage of relying on drivers is that an application can automat- ically take advantage of a hardware accelerator for 3-D graphics, sound, or MPEG decoding if one is installed. MMX inst ructions can be used in any processor mode and at any privilege level. They generate no new interrupts or exceptions. The eight new MMX r egisters are mapped onto the existing floating point registers, so that p rograms cannot use MMX and FP inst ructions simultaneously in the same routines. For 3-D graphics, MMX is t ypically used for 3-D rendering routines, while geometry calcu- lations remain in FP. The new instructions use a SIMD model, i.e., the same instruction operates on many values (eight bytes, four words, or two double words). On a two-way superscalar machine such as a P55C,19 MMX

© 2000 by CRC Press LLC

instructions can be paired with integer instructions or other MMX instructions as long as they do not use the same functional units. One drawback of MMX is the lack of a multiply or a multiply-add for 32-bit operands; this choice is probably due to real-estate considerations. This drawback prevents routines such as 3-D geometry calculations and wavetable sound from taking advantage of MMX. The problem of managing separate versions of each application for MMX and non-MMX systems can be solved by checking bit 23 of the CPUID to determine whether a processor implements MMX. The check can be performed via software. While MMX can boost multimedia performance of low-end systems with no accelerators, it will probably eliminate media processors and accelerator chips from low- and mid-range systems as processor speed increases. Intel has announced another extension, KNI, which will consist of 70 new instructions that emphasize parallel FP computations. In the meantime, AMD, Cyrix, and IDT have formally introduced a new x86 instruction set extension for speeding 3-D graphics applications. Their new set is dubbed 3DNow!4 and consists of instructions that pack two 32-bit FP values into a single MMX register and compute two results in parallel. 3DNow! provides basic add, subtract, divide, multiply, and square-root operations. The 3Dnow! single-precision FP format is IEEE-754 compatible, but results computed by 3DNow! do not comply with the IEEE standard. In particular, not all four rounding modes dictated by the IEEE standard are supported. When KNI becomes available in Intel’s chips, 3DNow! is likely to be soon forgotten, since no software developer can justify supporting 3DNow! instead of KNI. Other IS Extensions Companies other than Intel have proposed their instruction set extensions. Sun proposed VIS for its UltraSparc. Similar to MMX, VIS includes 8-, 16-, and 32-bit parallel operations, uses FP registers, and performs saturating and unsaturating arithmetic. VIS adds some specialized instructions, which accelerate MPEG motion estimation, and video compression algorithms (MPEG-1). Some instructions are included to accelerate the discrete Fourier transform (DCT), pixel masking, and 3-D rendering. VIS instructions are encoded as three-operand and operate on 32 registers. The PowerPC IS architecture is extended by the AltiVec instructions, which in many respects goes far beyond its 3DNow! counterparts. AltiVec offers more specialized instructions, such as 2-to-the-power, base-2 logarithm estimation, data permutation, various multiply-accumulate. Overflow and other events are more faithful to the IEEE standards than 3DNow!, a feature that helps meet requirements of advanced audio applications. The number of dedicated registers is 32.

Multimedia Processors and Accelerators Media processors are whole new processors that can handle multimedia data efficiently.9,12,14 By definition, they are programmable processors which can simultaneously accelerate the processing of multiple data types, including digital video, digital audio, computer animation, text, and graphics. They are required to offer high performance on these tasks, while the CPU is allowed to concentrate on other applications, be they spreadsheets or word processors, uninterrupted. Media processors combine the advantages of hardwired solutions, such as low cost and high performance, with the flexibility of software implemen- tations. They also eliminate the need for a number of different audio/video subsystems (3-D acceleration, MPEG playback, wavetable audio, etc.) There is no unique way to build a media processor.26 The first ones to reach the market (NVidia NV1 and Philips TriMedia, TM-1) would combine powerful DSPs for 2-D and 3-D acceleration, audio (chan- nels of CD-quality sound), possibly MPEG decoding, and broadband demodulation. Philips TM-1 has also units for video preprocessing (variable length decoding and image scaling and color space conver- sion). Other makers, notably Chromatics with its MPact Media Engine, preferred a single powerful processor that performs both audio and video tasks and telephony.20 MPact2 also adds DVD support,

© 2000 by CRC Press LLC

TABLE 79.1 The MMX Opcodes

Group Mnemonic Description

Data Transfer, Pack, Unpack MOV[D,Q] Move [double/quad] to/from MMX reg PACKUSWB Pack W → B with unsigned sat PACKSS[WB,DW] Pack W → B, D → W w/signed sat PUNPCKH[BW,DW,DQ] Unpack high-order [B,W,D] from MMX reg PUNPCKL[BW,DW,DQ] Unpack low-order [B,W,D] from MMX reg Arithmetic PADD[B,W,D] Packed add on [B,W,D] PADDS[B,W] Saturating add on [B,W] PADDUS[B,W] Unsigned saturating add on [B,W] PSUB[B,W,D] Packed subtraction on [B,W,D] PSUBS[B,W] Saturating sub. on [B,W] PSUBUS[B,W] Unsigned sat. sub. on [B,W] PMULHW Multiply packed words, get high bits PMULLW Multiply packed words, get low bits PMADDWD Multiply packed words, add pairs of products Shift PSLL[W,D,Q] Packed shift, left, logical PSRL[W,D,Q] Packed shift, right, logical PSRA[W,D] Packed shift, right, arithmetical Logical PAND Bitwise AND PANDN Bitwise AND NOT POR Bitwise OR PXOR Bitwise XOR Compare PCMPEQ[B,W,D] Packed compare if equal PCMPGT[B,W,D] Packed compare if greater than Misc EMMS Empty MMX state

Brackets indicate options. B = byte. W = word. D = double word. Q = quad word. videophone operation, as well as MMX support to reduce host processor overhead.25 The specifics regarding clock speed, core performance, etc., are given in Tables 79.2 and 79.3. Most media processors use VLIW and SIMD techniques to achieve high performance. These techniques are summarized in a later section. Among the issues that designers have to face are the following: • The load balance between the media processor and the host CPU. Some of the media processors require significant preprocessing on the host CPU, which can cause a noticeable slowing of the system. For instance, MPEG-1 decoding with a high resolution display can consume as much as half the processing power of a 100-MHz Pentium when Chromatics MPact1 is used. • Compatibility with legacy code, such as SoundBlaster and VGA emulation. Some makers guarantee some kind of compatibility, while others (e.g., Philips) do not. • Software flexibility. Some designs are reprogrammable and thus have the advantage of being upgradable when new audio, video and graphics standards change and improve. • Connections to outside devices. Media processor should also provide connection to PCI, external DAC, and audio codec. The software development for the different media processors usually depends on both makers and outside vendors (with the notable exception of Chromatics, which did not make it possible for anyone else to develop software for MPact). Makers usually offer compilers and code libraries for the most common multimedia algorithms, while outside partners and OEMs develop custom software. All vendors of media processors are more or less at Microsoft’s mercy for the APIs needed to run their processors under Windows.

© 2000 by CRC Press LLC

Not all media accelerators are so ambitious as to target the complete multimedia market. Indeed, recently there has been a trend where media processors have started to specialize. Many companies have narrowed their scope to specific tasks. To name a few, C-Cube’s Video RISC processor performs variable- length coding, data compression estimation, and motion estimation for use in MPEG-1 and MPEG-2 encoding by digital satellite TV broadcasters; Mitsubishi’s D30V includes audio and video circuitry and a variable-length decoder; Fujitsu’s multimedia assist (MMA) integrates a graphics controller and audio interfaces and can be applied to DVD; Rendition’s Vérité V1000 is a 32-bit RISC processor acting as a programmable 3-D accelerator, used for antialiazing algorithms and special effects; NVidia’s NV3, suc- ceeding the unsuccessful NV1, does not support audio and programmability in favor of 3-D performance. Many makers have included special DVD playback support (Chromatic’s MPact2, Digital’s SA-1500, as well as Mitsubishi’s D30V). Philips has focused on videoconferencing systems, which include echo cancellation, voice-tracking camera control, Web server, GUI programs.

TABLE 79.2 Comparison of Some Media Processors

NVidia Chromatic Chromatic NV1 MPact R/3000 MPact 2/6000

Clockspeed (MHz) 50 62.5 125 Peak Int. Perf. (GOPS) 0.35 3.0 6.0 Peak FP perf. (GFLOPS) n/a n/a 0.5 Memory Bandwidth (MB/s) 100-200 500 1200 3-D Geometry (Mpolys/s) n/a n/a 1.0 3-D Setup (Mpolys/s) n/a n/a 1.2 3-D Fill rate (Mpel/s) n/a 5 42 2-D Acceleration Yes Yes Yes MPEG-1 decode No Yes (S/W) Yes (S/W) MPEG-1 encode No Yes (H/W) Yes (H/W) MPEG-2 decode No Yes (S/W) Yes (S/W) Videoconferencing No Yes Yes Telephony No Yes Yes

Philips Philips Samsung TM-1 TM-PC SMP-1G

Clockspeed (MHz) 100 100 160 Peak Int. Perf. (GOPS) 3.8 3.8 10.2 Peak FP perf. (GFLOPS) 0.5 0.5 1.6 Memory Bandwidth (MB/s) 400 400 1280 3-D Geometry (Mpolys/s) 0.75 0.75 0.75 3-D Setup (Mpolys/s) 1.0 1.0 0.75 3-D Fill rate (Mpel/s) n/a n/a n/a 2-D Acceleration No Yes Yes MPEG-1 decode Yes (S/W) Yes (S/W) Yes (S/W) MPEG-1 encode Yes (S/W) Yes (S/W) Yes (S/W) MPEG-2 decode Yes (S/W) Yes (S/W) Yes (S/W) Videoconferencing Yes (S/W) Yes (S/W) Yes (S/W) Telephony Yes Yes Yes

Recently, a number of companies have announced 3-D chips, among others 3DLabs, ATI, Number Nine, NVidia, S3.10,11 3-D chips have to face two main issues to provide the required performance levels: computational resources and memory bandwidth. The calculations necessary for 3-D consist of scene management (preparation of a 3-D object database along with information on light source and virtual camera), geometry calculations (transforms and setup), and rendering (shading and texturing applied to the pixels) (see Table 79.4). The host processor is usually in charge of scene definition, whereas rendering is handled by 3-D graphics hardware. As for geometry calculations, they are usually split equally

© 2000 by CRC Press LLC

TABLE 79.3 Comparison of Some Media Processors (Continued.)

Fijitsu MMA Mitsubishi D30V Rendition Vérité

Clockspeed (MHz) 180 250 50 Peak Int. Perf. (GOPS) 1.08 1.0 0.1 Peak FP perf. (GFLOPS) n/a n/a n/a Memory Bandwidth (MB/s) 720 n/a 400 3-D Geometry (Mpolys/s) n/a n/a n/a 3-D Setup (Mpolys/s) n/a n/a 0.15 3-D Fill rate (Mpel/s) n/a n/a 25 2-D Acceleration n/a n/a Yes (H/W) MPEG-1 decode Yes (S/W) Yes (S/W) No MPEG-1 encode No No No MPEG-2 decode Yes (S/W) Yes (S/W) No Videoconf. Teleph. n/a n/a n/a between host processor (transforms) and graphics chips (setup). Thanks to the evolution of 3-D chips, the host processor is often relieved of many of the tasks. The result is higher polygon throughput and better visual quality. The performance that all makers have tried to reach, and exceed, is that of a million polygons per second. The increased bandwidth demands are met by on-chip texture caches and faster, wider local memory arrays in the graphics subsystems. Memory bandwidth demands can be reduced by applying texture compression, usually based on vector quantization algorithms for still images. The bandwidth problem will also require larger internal SRAM arrays for texture caches and embedded DRAM for on-chip frame buffers. Off-chip memory will still be necessary for larger frame buffers. On-chip setup engine and motion compensation logic may be added, as in ATI’s RagePro design, to accelerate MPEG-2 decoding and DVD playback.

TABLE 79.4 Basic 3-D Pipeline: Scene → Geometry → Rendering

Scene Geometry Rendering

Database of 30D Objects Projection Shading Scene Clipping Texturing Location of Light Source Slope Calculation Remove Invisible Pixels Position of Virtual Camera Rasterizing

Architectures and Concepts for Multimedia Processors Many, if not all, multimedia applications are required to perform very computationally demanding computations. Custom architectures that support digital signal processing can satisfy the computational needs of these applications, but they are usually lacking the flexibility required in an environment where standards and algorithms change continuously. Programmable devices, such as traditional DSPs, in turn lack computing power and bandwidth necessary to perform more than a multimedia task at a time. The media processors narrow the gap between DSPs and general-purpose processors by extending the instruc- tion-level parallelism (ILP) of traditional DSPs to a general, very long instruction word (VLIW) archi- tecture.24 Analogously to superscalar processors, VLIW media processors rely on hardware dispatch units to dynamically schedule the operations and evaluate data dependences, VLIW architectures rely on the compiler to statically schedule instructions at compile time. Most media processors also take advantage of SIMD concepts in order to improve computational throughput. Both VLIW and SIMD concepts are reviewed in the following.17 The VLIW Archictecture Very Long Instruction Word (VLIW) architectures3,5,7,8 are one of two categories of multiple-issue pro- cessors, the other one being referred to as superscalar architecture. Superscalar architectures issue more

© 2000 by CRC Press LLC

than one instruction per clock, which can be either statically scheduled at compile time or dynamically scheduled using various scoreboarding techniques or Tomasulo’s algorithm. VLIWs in contrast are stat- ically scheduled by the compiler, and fixed-size instruction packets are issued at each clock cycle. In a superscalar processor18 the functional units are structured so that a number of instructions, typically one to eight, can be scheduled simultaneously per clock tick. An example is an architecture where one integer and one floating point instruction can be issued together. A sufficient number of read and write ports have to be provided to avoid structural hazards. Of course, simultaneously issued instructions cannot be interdependent. Moreover, no more than one memory reference can be issued per clock. Other restrictions come from latencies inherent in memory accesses or branches, e.g., it may be required that the result of a load instruction may not be used before a number of clock cycles have elapsed. When dynamic scheduling is employed, a requires dedicated hardware to provide for hazard detection, reservation tables, queues for loads and stores, and possibly an out-of- order scheme. Whenever the behavior is unknown or unpredictable, the architecture has to allow for conditional or predicated instructions, which can eliminate the need for certain branches altogether, and the use of speculation, whereby an instruction can be issued before knowing whether the instruction should really execute. A number of additional registers is typically required to support and . Additionally, bits may have to be added to registers and instructions to indicate whether the instruction is speculative. In summary, a superscalar architecture may require a significant amount of additional hardware with respect to a single-issue processor, with the consequent cost of a larger die area and complex circuitry. The advantage is a low CPI (clock ), obtained with good efficiency (for instance, hardware-based branch prediction is usually superior to software-based prediction done at compile time). Another great advantage of superscalar architectures is compatibility to legacy code: unscheduled programs or those compiled for single-issue processors can be run on a superscalar processor. VLIWs use multiple independent functional units. Instructions are packed into a very long instruction word. The burden for ordering the instructions and packing them together falls on the compiler, and no hardware is needed to make dynamic, run-time, scheduling and decisions. In order to keep all functional units as busy as possible, there must be enough parallelism in the application. Not all inherent parallelism is immediately available. Some has to be extracted by various software techniques, including loop unroll- ing and software pipelining.1,2,21 Instructions may have to be moved even across branch boundaries, always maintaining the correct dependences among data. Some additional hardware is required also of a VLIW processor due to the increase in memory bandwidth and register file bandwidth. As the number of data ports increases, so does the complexity of the memory system. The VLIW compiler is required to perform very aggressive operations on the original code, and it generates an output code that is usually much larger than in traditional processors, e.g., due to loop unrolling. Memory size can be reduced by use of compression techniques in main memory and/or in cache. Object code compatibility is obviously the main drawback. For many types of applications, one cannot guarantee that all functional units will be used efficiently, depending on the available instruction level parallelism (ILP) of the application itself. Moreover, branch mispredictions can cause significant performance loss.27 Multimedia applications have characteristics which make them more suitable for VLIW processors. DSP applications as well as multimedia, display abundant amounts of available ILP, thanks to the highly regular structure of the algorithms implemented, i.e., matrix-vector manipulations or discrete cosine transforms. Not surprisingly, DSP architectures were among the first programmable devices to rely on long instruction words (LIWs) to improve parallelism.22 Signal processing applications are characterized by • large numbers of combination of instructions, e.g., the pair of instructions multiply and accumu- late (very commonly used in filtering and linear transform algorithms) • low-overhead counted loops, which can be supported by simple d ecrement-and-b ranch instruc- tions • multiple accesses to memory in a single cycle. Memory which supports multiple accesses is usually partitioned or interleaved.

© 2000 by CRC Press LLC

Typical DSP and multimedia applications include parallel and independent computations, e.g., compu- tations of different pixel values, which allow one to extract considerable ILP. Among the most common algorithms that may be encountered in these applications are filtering (FIR, IIR); frequency transforms (Fourier, DCT), are pixel computations. DSP hardware has been hard to use and program through libraries and hand-tailored code. The principle was for DSPs and is for VLIWs that the processor provides some parallelism and the programmer (or compiler) must provide code that matches as much as possible the CPU’s ILP. The first VLIWs were built with the intention of bridging the gap between general-purpose processors and DSPs, keeping in mind that compiler technology has to be as efficient as possible in uncovering the application’s ILP. VLIWs now combine the cost/performance ratio of DSP processors with the reprogrammability of general-purpose processors. The hardware can provide parallelism in any of the following ways: • by allowing operations to execute simultaneously on all functional units; sufficient memory and register level bandwidth are required to reduce potential structural hazards conflicts • by replicating functional units, e.g., by having two integer units • by pipelining longer operations, e.g., FP operations, so that one can be initiated at every clock cycle The application’s ILP can be uncovered by using any or all the following techniques: • Trace scheduling . The major barrier to exposing ILP was long represented by the so-called “basic blocks,” i.e., the single straight line segments of code between branches or jumps.6 Compilers used to schedule operations of basic blocks and to stop any scheduling when it would encounter a jump or branch. The most ILP that can be gained by basic block scheduling is approximately a factor of two. More ILP can be exposed if instructions are allowed to be moved across basic blocks boundaries. Trace scheduling is based on loop-free sequences of basic blocks, i.e., paths through the program that could conceivably be taken by some set of input data. Traces are selected and scheduled according to their frequency of execution, i.e., according to how likely it is they are taken. In this scenario, the compiler can freely move operations between basic blocks, irrespective of branches. Scheduling proceeds from most to least frequently executed traces, until all the program is scheduled. Instructions can be moved only if all data dependences are guaranteed to remain unchanged. In general it may not be possible to determine all data dependences at compile time (it is actually an NP-complete problem). The compiler, with no other indications from the programmer, has to be conservative in its decisions. Trace scheduling can be successfully applied in conjunction with software pipelining and loop unrolling.1 • Loop unrolling. This is done by simply replicating the loop body a given number of times, erasing useless intermediate checks, and adjusting the loop termination code. Instructions from different loop iterations can now be scheduled together. • Software pipelining . This technique reorganizes loops so that each iteration in the software- pipelined code is chosen from different iterations of the original loop. In a sense, instructions from different iterations are interleaved with one another. The loop is thus running at maximum overlap or parallelism among instructions, except for start-up and a wind-down overheads. Often it is necessar y to combine software pipelining and lo op unrolling, depending on the specific application. SIMD and Subword Parallelism Most media processors combine the VLIW architecture with another form of parallelism, usually referred to as single-interaction, multiple-data (SIMD), or subword parallelism. SIMD instructions affect multiple pieces of data, compacted in a longer word. A single SIMD instruction has the same effect as many identical instructions on as many data items. ILP can be exploited by packing many lower precision data into a large container. The ALU is set up to perform identical operations on every single subword of the input. The additional hardware required to implement SIMD operations is • more control logic for the ALU to perform parallel operations • more opcodes and the corresponding decoding logic

© 2000 by CRC Press LLC

• packing, unpacking, and alignment circuitry SIMD uses most of the existing ALU architecture to execute multiple instructions in parallel. Moreover, it uses the same register file port to read and write more distinct values. On the other hand, packing and unpacking penalties have to be paid when single items need to be accessed individually. In addition, SIMD instructions require rewriting of the application to identify the ILP and exploit it with appropriate library functions that use SIMD instructions. Many media processors try to take advantage of both approaches by implementing a VLIW architecture with some functional units supporting packed (SIMD) arithmetic. Architecture Issues A VLIW/SIMD processor has specific architecture characteristics, different from traditional processors. • A larger number of registers helps when large amounts of ILP are exposed. Many registers reduce memory traffic. Moreover, the following techniques take advantage of a higher number of registers: – Speculative execution: speculative results must be kept until committed or discarded. – Loop unrolling and software pipelining: N iterations of the same loop, unrolled or interleaved, require more registers. – Predication: values computed along both paths of a branch must be kept until the branch is resolved. These effects are mitigated by SIMD, where a single register contains many smaller precision values. • VLIW requires a higher number of data ports and a higher bandwidth. Data ports are costly additions because – the area of the register grows quadratically with the number of ports, and – the read access time of a register file grows linearly with the number of ports. Again, a SIMD approach reduces the need for ports, since a single register may contain several data items. • The register file has to be appropriately partitioned into separate register banks. Hardware must provide for moving date from bank to bank, when needed. • SIMD operations must be supported in the following ways: – The hardware must support to address individual subwords of a register. Parts of a register may be read or written without affecting the rest. – The hardware must support to execute lower precision operations on subwords. – Parts of a register may have to be shifted and realigned. • The memory architecture has to be organized in such a way as to reduce latencies to a minimum. General-purpose processors make large use of data caches, which are very effective for locally or temporally localized data whose access pattern and size are unpredictable. Among data memory accesses, there are two categories that can be classified as follows: – Memory accesses with spatial locality, but little temporal locality, e.g., regular access to data structures, matrices, and vectors, which are usually traversed only once sequentially. – Memory accesses with temporal locality, but little spatial locality, e.g., loop-up tables, interpo- lation tables, etc., which are repeatedly accessed in a data-dependent, nonsequential fashion. Caches are usually characterized by miss penalties which are at least one order of magnitude higher than cache hits. DSP applications display widely different data locality characteristics. Many data streams, such as audio and video, have a high degree of spatial locality, whereas other data accesses have temporal locality. Typically large data structures with only spatial locality tend to disrupt the data items that could benefit from temporal locality. For DSP applications, the large mismatch between miss and hit penalty makes the worst-case performance unacceptable for most applications, especially if

© 2000 by CRC Press LLC

there are real-time requirements. For these reasons VLIW, analogously to DSPs, usually makes use of local memories. Local memories are mapped to an address space that is separate from main memory. The use of local memories requires explicit compiler support or programmer intervention in order to guarantee known and constant access time. The programmer may help the compiler manage memory accesses by use of either of the following techniques: – platform specific language extensions not supported by the native language, or – compiler annotations, i.e., annotations that can be safely ignored by the compiler, but, if used, help identify which data can be accessed in local memory. These annotations can be used in high-level languages and do not affect code compatibility. Local memories are of little use when large data structures are used a few times at most and are traversed sequentially. For these cases, it is best to use – fast access, off-chip memory, either expensive SRAM or slower DRAM; or – a pre-fetch butter, which works very well when the access pattern is highly predictable. • Most DSP algorithms spend a large amount of time in simple loops, which are usually repeated a constant and predictable number of times and do not require complex control logic. Traditionally, DSPs implement very efficient and simple ways of handling the control flow of simple loops. • VLIW code can be quite large because of the following: – Compilation techniques make use of loop unrolling, software pipelining, etc., which cause instructions to be replicated to allow code motion across basic block boundaries. – Not all VLIW instructions can be filled with useful instructions because of data dependences and because not all functional units can be continuously busy. This means that many of the VLIW instruction fields are bound to be no-operation (NOP). Because of the aforementioned, it is common to compress the VLIW code in order to reduce the storage space. Due to the sparse nature of VLIW instructions, there usually is room for compres- sion. Code is kept compressed in main memory and is decompressed before execution. There are two main approaches to compression: – The instruction cache contains uncompressed code. In this case, decompression is not on the critical path and is crucial only during cache misses. The cache is less effective, since it must contain uncompressed instructions. Moreover, the cache cannot have synchronized instruction addresses with main memory because main memory stores variable-length instructions while the I-cache stores fixed-size instructions. The decoding scheme must therefore be more complex. – Main memory and the I-cache are both compressed. The cache utilization is optimal, and the address translation mechanism from main memory is simple. The disadvantage is due to having decompression on the critical path, thus affecting the hardware cost.

79.3 Beamforming Array Processing and Architecture

Beamforming array concepts were originally formulated over 40 years ago to tackle demanding defense- oriented aerospace and underwater sonar, radar, and communication system problems. These problems combine signal processing with antenna technologies. They involve the use of phase coherent transmission or reception, utilizing array sensor selectively to enhance the gain in some space-polarization-time- frequency domain to reject intentional or unintentional jamming. In recent years, there has been an explosion in civilian mobile communication usages, mainly in the form of the large number of cellular telephone users competing over a limited number of frequency bands. Proper use of beamforming arrays can increase the number of users as well as find the user’s cell phone under an emergency 911 condition. The various proposed array processing systems need the availability of modern digital signal processors. These processors start from the high-end programmable DSP, the FPGA, to custom single and wafer-

© 2000 by CRC Press LLC

integration VLSI chips. As the cost of these processors decreases rapidly while their capability increases dramatically, “smart antenna” beamforming array techniques are under serious consideration for practical communication system implementations.

Interference Rejecting Beamforming Antenna Arrays In order to understand the rationale for beamforming antenna arrays, we briefly consider various possible types of antenna. An omnidirectional (also called isotropic) antenna has equal gains in all directions. A directional antenna has more gain at certain directions relative to other directions. Thus, it needs to be moved mechanically to point at the direction of interest for either transmission or reception. A phased antenna array uses signals from simple (and possibly omnidirectional) antennas combined appropriately to achieve a high gain in some desired direction. The direction of maximum gain is adjustable by controlling the phases among the antennas so that the voltages are coherently added in phase. An adaptive antenna array is a phased antenna array in which the gain and the phase of each antenna may be changed as a function of time depending on external conditions. As an example, a receiving adaptive antenna array not only directs a high gain toward a desired transmitter, but it may also adaptively place spatial nulls (or low gains) toward other moving co-channel interfering sources operating at the same frequency band. An antenna array is said to be optimal if it adjusts the gains and phases of the antenna elements to optimize the array performance. Typical performance of interest may be maximizing the signal-to- noise-ratio (SNR) or signal-to-interference-and-noise ratio (SINR) of the array. A plot of the gain vs. the angle of an antenna array is called the beam pattern. The process in which signals from different antennas in an array are added coherently is called beamforming. The steering of the direction of maximum gain of an array can be achieved by mechanical means or by changing the gains and phases of the antennas to achieve electronic beam steering. Beam steering for a narrow band array only need phase-shifters, but for a wide band array, both the gain and the phase of each antenna element (typically implemented in the form of a FIR filter) need to be used. For a narrow band array, coaxial cable of different lengths can be used for phase-shifting. By switching different cable lengths, beam switching can be realized. The earliest adaptive antenna array system is the sidelobe canceller (SLC) motivated by practical radar- jamming problems. This system consists of a high gain mainbeam dish antenna surrounded by few auxilliary low gain omnidirectional antennas. A strong jammer appears on all the auxilliary and main- beam antennas. By using appropriate complex weights on all the antennas, the output of the auxilliary antennas consisting of the jamming signal is coherently substracted from the sidelobe of the main antenna, thus permitting desired weak signal at the mainbeam to be detected. This scheme, often called the Howells-Applebaum algorithm,28 was first proposed in the 1960s and was implemented using analog components. It turns out the Widrow-Hoff algorithm29 for adaptive filtering, conceived independently and at about the same time, is essentially equivalent but has broader applications. It is called the least- mean-square (LMS) algorithm. It is an approximate steepest-descent gradient search algorithm. As such, the convergence time may be slow if the system eigenvalue spread is large. However, due to its relative simplicity, most simple adaptive antenna systems are still based on variations of the LMS algorithm. A more complex but also faster convergent adaptive antenna array is to use the least-squares (LS) method for solving the weights W of the linear system of equations XW ≈ d. Here X is a N × M matrix with each row representing an antenna element input and M represents the time index, d is a d × 1 known steering vector, and W is a M × 1 unknown weights needed in the adaptive system. Various QR decomposition (QRD) methods, such as the Gram-Schmidt, Modified-Gram-Schmidt, Givens, and Householder algorithms, all can be used to solve the LS problem. In practical complex antenna systems, N can be in the low hundreds, and M can be in the thousands. Clearly, direct brute force block LS solution is not feasible. A simple form of parallel processing called systolic processing, using only small number of processing elements (PE) having adjacent neighbor connects, and operating in a synchronous manner, has been proposed to solve the QRD approach to the recursive least-squares (RLS) problem. This approach was originated by McWhirter30 using the Givens algorithm, and variations have been proposed by Ling-

© 2000 by CRC Press LLC

Proakis,31 Kalson-Yao,32 Liu-Yao,33 and others. Issue related to the complexity of the algorithms and the practical VLSI implementation of such an array have been considered by Rader,34 Bolstad-Neeld,35 and Lodhtbody et al.36

Smart Antenna Beamforming Arrays The explosion of cellular telephony and wireless communication services has motivated the consideration of various technologies to increase their efficiency. These systems, in order to support large number of users with increasing data rate demands and operating in difficult propagation conditions, have found “smart antenna” to be highly useful. The term “smart antenna” commonly denotes the use of a RF antenna array system with advanced signal processing techniques to perform space-time operations. While there is some similarities in smart antenna to the advanced adaptive beamforming antenna systems considered earlier for radar applications, there are also many differences. Basic issues of beamforming for signal enhancement, noise reduction, and interference rejection are still of interest. While the earlier radar systems operate in the higher X frequency band, the wireless systems operate in the lower 800-MHz and 2.4-MHz bands with significantly more multipath and fading problems. Perhaps most important is the fact that we have to deal with both up and down link communications. Various advanced modulation and coding operations not needed in radar systems are present now. With code division-multiple access (CDMA) communication, the bandwidths encountered are wider than the previously encountered sys- tems. Furthermore, the hand-held communication devices have limited power, volume, as well as dimen- sion for the placement of several antennas. All these harsh conditions impose demanding requirements on the smart antenna. A smart antenna may possess adaptive beamforming capability as well as only be able to switch between small number of fixed beams. The switched beam system offers some of the advantages of a more elaborate smart antenna with reduced complexity and cost. However, for greater flexibility and performance, a smart antenna must utilize adaptivity. In an adaptive system, the weighting vector is adjusted in an adaptive manner to minimize some criteria. In the minimum mean-square error (MMSE) criteria, statistical averaging of the observed quantities are used, while in the LS criteria, the observed time samples are used directly. Variations of the stochastic gradient method are used to solve the MMSE criteria problem. For both of the above approaches, in order to know the output of the spatial filter, training sequences known both to the transmitter and the receiver are often used. An alternative approach is to use decision directed adaptation, where the desired signal sample is estimated from the generally correctly decided sample. Still another approach uses blind adaptive algorithm, which does not require training sequences. This approach exploits various properties of the transmitted signals. Two other approaches have been used for an adaptive antenna array system. In the maximum SNR criteria, it seeks to maximizes the ratio of the power of the desired signal to the noise power at the output of the array. In the linearly constrained minimum variance (LCMV) criteria, it minimizes the variance of the output of the array subject to linear constraints such as the directions of interferences. Both of these methods must know the direction-of-arrival (DOA) of the signal of interest. Both methods result in the solution of linear system of equations commonly termed the normal equation. Direct solution as well as various LS approximation methods and RLS methods are possible. One important application of a smart antenna is finding the DOA and possibly the location of a source. The DOA problem has been an ongoing problem of interest in aerospace and avionics for many years. However, due to the recent FCC mandate requiring a cellular telephone system to locate each user to 125 meter accuracy, there is increasing interest in this area. There are many techniques that have been proposed to tackle the DOA and localization problem. Conventional methods use the beamforming and nulling capabilities of an array to determine the DOA of a source in the presence of noise and interference. While these techniques have been used in many practical systems, there are fundamental resolution limitations. These limitations correspond closely to frequency resolution limitations of classical DFT/FFT method in spectral analysis. Just as there are various modern parametric techniques in spectral analysis,

© 2000 by CRC Press LLC

so too are there many parametric techniques for DOA and localization. Most of the modern techniques are based on signal subspace methods. All subspace methods exploit some signal/noise structure in the modeling of the problem and use eigenvalue decomposition (EVD) or singular value decomposition (SVD) computational tools. The earliest and most well-known subspace method is the MUltiple SIgnal Classificat (MUSIC) technique. The MUSIC method exploits the narrow band assumption of the input sources and uses an EVD of the input covariance matrix to determine the DOA of the sources. Variations and extensions of the MUSIC method have been proposed. There are also many maximum likelihood (ML) methods that have been considered to address the DOA and localization problem. These methods are all computationally costly. Most recently, various joint space-time array processing techniques based on advanced linear algebraic methods have been proposed. The goals of these techniques are to perform blind deconvolution, system identification, source separation, and equalization of the communication channels. While many technically interesting results have been discovered, there is still much work to be done before these methods can be used in practical smart antenna system. Relevant references in these areas can be found in Refs. 37–42. We also note that many of the techniques used for smart antenna are also relevant to acoustic and seismic sensor DOA and localization problems in multimedia, industrial, and military applications.43

Acknowledgment This work is partially supported by NASA under Grant NCC 2-374.

References 1. Allan, V. H., Jones, R., Lee, R., and Allan, S. J., Software pipelining, ACM Comput. Surv., Sept. 1995. 2. Callahan, D., Kennedy, K., and Porterfield, A., Software prefetching, Proc. 4th Int. Conf. Arch. Supp. Prog. Lang. and O.S.'s, Apr. 1991. 3. Capitanio, A., Dutt, N., and Nicolau, A., Partitioned register files for VLIWs: a preliminary analysis of tradeoffs, Proc. 25th Annu. Int. Symp. Microarch., Dec. 1992. 4. Case, B., 3DNow boosts non-Intel 3D performance, Rep., June 1998. 5. Faraboschi, P., Desoli, G., and Fisher, J. A., The latest word in digital and media processing, IEEE Signal Processing Mag., March 1998. 6. Fisher, J. A., Trace scheduling: a technique for global compaction, IEEE Trans. Comput., 30(7), 478, 1981. 7. Fisher, J. A., Ellis, J. R., Ruttenberg, J. C., and Nicolau, A., Parallel processing: a smart compiler and a dumb processor, Proc. SIGPLAN Conf. Compiler Construct., June 1984. 8. Fisher, J. A. and Rau, B. R., Journal of Supercomputing, Jan. 1993. 9. Glaskowsky, P., First media processors reach the market, Microprocessor Rep., Jan. 1997. 10. Glaskowsky, P., 3D chips break megatriangle barrier, Microprocessor Rep., June 1997. 11. Glaskowsky, P., 3D chips take large bite of PC budget, Microprocessor Rep., July 1997. 12. Gwennap, L., Multimedia boom affects CPU design, Microprocessor Rep., Dec. 1994. 13. Gwennap, L., Intel’s MMX speeds multimedia, Microprocessor Rep., March 1996. 14. Gwennap, L., Media processors may have short reign, Microprocessor Rep., Oct. 1996. 15. Gwennap, L., New multimedia chips to enter the fray, Microprocessor Rep., Oct. 1996. 16. Gwennap, L., MediaGX targets low-cost PCs, Microprocessor Rep., March 1997. 17. Hennessy, J. L. and Patterson, D. A., Computer Architecture. A Quant itative Approach, Morgan Kaufmann, 1996. 18. Johnson, M., Superscalar M icroprocessor Design, Prentice-Hall, Englewood Cliffs, NJ, 1991. 19. Kagan, M., The P55C : the first implementation of MMX technology, Hot- Chips 8, Aug. 1996. 20. Kalapathy, P., Hardware/software interaction on the MPact media processor, Hot-Chips 8, Aug. 1996.

© 2000 by CRC Press LLC

21. Lam, M., Software pipelining: an effective scheduling technique for VLIW processor, Proc. SIG- PLAN Conf. Prog. Lang. Design Impl., June 1988. 22. Lapsley, P., Bier, J., Shohan, A., and Lee, E. A., DSP Design Fundame ntals — Architectures and Features, Berkeley Design Technology, Inc., 1996. 23. Lee, R., Accelerating multimedia with enhanced microprocessor, IEEE Micro, 15(2), May 1993. 24. Rau B. R. and Fisher, J. A., Instruction-level parallelism, J. Supercomput., May 1993. 25. Song, P., Media processors begin to specialize, Microprocessor Rep., Jan. 1998. 26. Turley, J., Multimedia chips complicate choices, Microprocessor Rep., Feb. 1996. 27. Wall, D.W., Limits of instruction-level parallelism, Proc. 4th Conf. Arch. Supp. Prog. Lang. O.S.’s, Apr. 1991. 28. Howells, P. W., Intermediate frequency sidelobe canceller, U.S. patent 3,202,990, Aug. 1965. 29. Widrow, B., Adaptive filters I: fundamentals, Rept. SEL-66-126, Stanford Electronics Laboratory, Dec. 1966. 30. McWhirter, J. G., Recursive least-squares minimization using a systolic array, Proc. SPIE, 431, 1983. 31. Ling, F. and Proakis, J. G., A generalized multichannel least-squares lattice with sequential pro- cessing stages, IEEE Trans. Acoustics, Speech, Signal Proc., 32, Apr. 1984. 32. Kalson, S. Z. and Yao, K., A class of least-squares filtering and identification algorithms with systolic array architecture, IEEE Trans. Inf. Theory, Jan. 1991. 33. Liu, K. J. R., Hsieh, S. F., and Yao, K., Systolic block householder transformation for RLS algorithm with two-level pipelined implementation, IEEE Trans. Signal Proc., Apr. 1992. 34. Rader, C. M., VLSI systolic arrays for adaptive nulling, IEEE Signal Proc. Mag., 13, 29, 1996. 35. Bolstad, G. D. and Neeld, K. B., CORDIC-based digital signal processing (DSP) element for adaptive signal processing, Digital Signal Processing Technology, Papamichalis, P. and Kerwin, R., Eds., SPIE Optical Engineering Press, 291, 1995. 36. Lightbody, G., Woods, R., Walke, R., Hu, Y., Trainor, D., and McCanny, J., Rapid design of a single chip adaptive beamformer, Proc. 1998 Workshop Signal Proc., Manolakos, E.S. et al., Eds., 285, 1998. 37. Rapport, T. S., Smart Antennas — Adaptive Arrays, Algorithms, & Wireless Position Location, IEEE, 1998. 38. Liberti, J. C. and Rapport, T. S., Smart antennas for wireless communications, Prentice-Hall, Englewood Cliffs, NJ, 1999. 39. Godara, L. C., Applications of antenna arrays to mobile communications, part I: performance improvement, feasibility, and system considerations, Proc. IEEE, 85, 1301, 1997. 40. Godara, L. C., Applications of antenna arrays to mobile communications, part II: beam-forming and direction-of-arrival considerations, Proc. IEEE, 85, 1195, 1997. 41. Paulraj, A. J. and Papadias, C. B., Space-time processing for wireless communications, IEEE Per- sonal Commun., 14, 49, 1997. 42. Mouline, E., Duhamel, P., Cardoso, J., and Mayrague, S., Subspace methods for the blind identi- fication of multichannel FIR filters, IEEE Trans. Signal Proc., 43, 516, 1995. 43. Yao, K., Hudson, R. E., Reed, C. W., Chen, D., and Lorenzelli, F., Blind beamforming on a randomly distributed sensor array system, IEEE J. Selected Areas Commun., 16, 1555, 1998.

© 2000 by CRC Press LLC