Implementation of a Programmable Baseband Processor

IMPLEMENTATION OF A PROGRAMMABLE BASEBAND PROCESSOR Eric Tell, Anders Nilsson, Dake Liu Department of Electrical Engineering, Linkoping¨ University, SE-581 83 Linkoping,¨ Sweden erite, andni, dake @isy.liu.se f g ABSTRACT 2 THE DSP CORE A fully programmable radio baseband processor architec- The core processor has a 16 bit ALU and a specialized com- ture has been developed. The architecture is based on an plex MAC unit. The instruction set can be divided into three application specific DSP processor and a number of flexi- classes of instructions: ble hardware accelerators, connected via a configurable network. A large degree of hardware reuse and careful selec- Ordinary RISC-style instructions operating on 16-bit • tion of accelerators together with low memory cost allows values, or on 16+16-bit complex values. a very area and power efficient implementation. A demon- Vector instructions, operating on vectors of complex strator chip for 802.11a/b/g physical layer baseband pro- • cessing was implemented in 0.18 µm CMOS on a 5.0 mm2 data. die with a core area of 2.9 mm2, including all memories. Instructions for network and accelerator configuration • and control. All instructions are 16 bits long, providing very efficient 1 INTRODUCTION use of program memory. The main feature of the core is the complex multiply- accumulate unit (CMAC) and the associated vector instruc- The large number of emerging radio standards and the con- tions. A very significant part of the computations in a vergence of wireless products leads to an increased interest baseband processor are operations on vectors of complex in Software Defined Radio (SDR) and increased flexibility numbers (I/Q-pairs) such as auto- and cross correlation, fir requirements for baseband processors [1]. At the same time, filtering, FFT, vector multiplication and complex absolute with new demanding applications, such as Wireless LAN, maximum search. The CMAC is optimized for these types 3G/4G mobile telephony and Digital Video Broadcasting, a of operations. It can execute e.g. two complex multiply- high degree of parallelism is needed. Power consumption accumulate operations, a radix-2 FFT butterfly, or two com- continues to be very important. plex absolute value calculations plus max search, each clock The proposed architecture, which is based on a special- cycle. ized DSP processor core connected to a number of flexible The vector instructions allows a complete vector opera- hardware accelerators via a configurable network, allows tion (e.g. a scalar product or one layer of butterflies in an a good tradeoff between flexibility and performance and a FFT) to be executed using only one instruction. The vector very area efficient implementation with low control over- size is given explicitly in the instruction word. A vector in- head. Figure 1 gives an overview of the architecture. struction takes several clock cycles to complete (depending The use of accelerators improves the efficiency of the architecture through increased degree of parallelism. Ef- Normal instruction (4 different formats): ficiency is further improved since the processor core can focus on tasks more suitable for DSP software implementa- 0 subtype (2−5) instr. (2−6) arguments (4−11) tion, e.g. multiply-accumulate based operations. Other programmable solutions, such as [2],[3] and [4], Vector instruction: are typically based on highly complex VLIW and/or multi- 10 instruction (4) ports (3) vector size (7) ple processor cores. The approach described here leads to significantly less control overhead and memory size, result- Accelerator instruction: ing in reduced area and power consumption. The described approach also leads to a much higher degree of hardware 11 accelerator(4) accelerator instruction (10) reuse and better utilization of hardware components than a corresponding fixed function hardware implementation Figure 2: Encoding of basic instruction types in the BBP such as [5]. (number of bits in parenthesis) 16 bit datapath MAC DM0 cADD16 cADD16 256x32 16 bit ALSU cMUL16x16 cMUL16x16 DM1 cADD40 cADD40 256x32 RF AR0 (40+40 bits) AR1 16x16 bits DM2 AR2 AR3 256x32 Control Registers DM3 Port IF CM 256x32 2048x32 IM walsh Network 256x16 ADC FIR demap interleave conv.enc./ MAC Application rotor scramble CRC IF packet.det. viterbi processor DAC Figure 1: Overview of the baseband processor. DMx, CM and IM are data memories on vector size). However, other instructions e.g. ALU in- MAC.32 port1,port2 ; 32 element vector dot product structions or accelerator configuration, can execute in paral- ADD R0,R1 ; R0=R0+R1 lel with a vector instruction, as illustrated by figure 3. This NWC interl,viterbi ; setup network connection often allows the control code of an algorithm to be “hidden” ACL viterbi,0xF2 ; accelerator control instruction behind multi-cycle vector instructions. Table 1 shows some IDLE mac ; wait for MAC.32 to finish benchmarks for the core. Figure 3: Assembly code example. The four instructions following the vector instructions add no cycle cost. Table 1: DSP core benchmark examples, cycle cost includes memory addressing setup and other overhead. No accelera- 3 THE ACCELERATOR NETWORK tion was used Function Clock cycles Memories, accelerators and external interfaces are con- 64-point FFT 205 nected to the core via the interconnect network. The net- 40 element vector add 24 work behaves like a crossbar switch and is configured by 40 el. scalar product 24 the core using dedicated assembly instructions. This elim- 40 el. vector elementwise inates the need for an arbiter and addressing logic, thus re- multiplication 24 ducing the complexity of the network and the accelerator 40 sample, 16 tap FIR filter 404 interfaces, still allowing many concurrent communications. 40 element complex absolute Each accelerator has one read port and one write port to maximum value search 22 the network. A connection is set up by connecting one read port to one write port. The reading unit requests one unit of data by asserting a ReadReaquest signal during one clock cycle and the transmitter uses a DataAvailable signal to in- The processor control path is similar to that of a rather dicate that new data is available. The requesting unit may simple micro-controller-like processor with some added have up to two outstanding read requests, but must then halt DSP- and other special features. It does not suffer from the if no data available signal is received. This protocol allows control and communications overhead found in VLIW, su- a new data item to be communicated every clock cycle but per scalar and other enhanced processor types. The vector still provides sufficient flow control. control unit, which is responsible for execution of the par- A chain of accelerators connected to each other via the allel multi cycle vector instructions, adds only little extra network will automatically synchronize and communicate overhead. without any interaction by the processor. This allows truly concurrent operation of the core and any number of accel- Table 2: Firmware implementation results. erators, and with zero synchronization overhead in the core. This also minimizes the number of memory accesses since Task Req. freq. Prog. size Data mem. no intermediate storage is needed when sending data be- 11a Tx 155 MHz 1020 bytes 3456 bytes tween accelerators. 11a Rx 160 MHz 1658 bytes 2340 bytes Accelerators can be configured via special accelerator 11b Tx 120 MHz 476 bytes 484 bytes instructions or via a control register space. 11b Rx 110 MHz 1090 bytes 304 bytes 4 MEMORY ARCHITECTURE core is intended for FFT and filter coefficients, look-up ta- Using a number of small data memories gives enough mem- bles, and other data not processed by accelerators. Using ory bandwidth to keep the core/CMAC and accelerators dual memory banks instead of dual port memories saves fully occupied. The network always gives a unit (core or power. accelerator) exclusive memory access, thereby eliminating The ADC/DAC interface accelerator contains a config- stall cycles due to access conflicts. After finishing a task, urable decimation filter, a rotor for carrier frequency off- the entire memory containing the output can be ”handed set compensation and a configurable packet detector based over” to an accelerator or interface by reconfiguration of the on autocorrelation. The packet detector will wake the core network. This eliminates data moves between memories. from idle mode when an incoming frame preamble is de- Each memory has its own address generator, and ad- tected. These functions can be reused between many stan- dresses and addressing modes are configured using the same dards. They also have to run continously a large part of the interface as accelerator configuration. No addressing infor- time, and especially decimation is quite demanding. Other mation needs to be sent over the network. accelerators reused between 11a and 11b standards are the Reducing memory sizes and memory accesses was a scrambler and the mac-layer interface. major focus in the design, since a large part of the power consumption in a programmable architecture takes place in the memories. The small, and thereby fast, memories and the moderate frequency eliminates the need for caches. 7 RESULTS Thereby a lot of control overhead is avoided, and more im- portantly, execution time is completely predictable, which is major advantage in hard real time systems. Firmware was implemented for 802.11a and 11b transceivers; results in terms of memory usage and required frequency for different modules can be found in 5 ACCELERATORS table 2. The instruction set has proven to be very efficient. Only about half of the available program memory is re- A key issue is the choice of accelerators. This has previ- quired to store the entire 11a and 11b transceiver firmware ously been discussed in [6]. The main factors to consider on chip. Data memory requirements are also about half are: 1) The relation between the area of the accelerator and of the available data memory.

Load more