IMPLEMENTATION OF A PROGRAMMABLE BASEBAND

Eric Tell, Anders Nilsson, Dake Liu

Department of Electrical Engineering, Linkoping¨ University, SE-581 83 Linkoping,¨ Sweden erite, andni, dake @isy.liu.se { }

ABSTRACT 2 THE DSP CORE

A fully programmable baseband processor architec- The core processor has a 16 bit ALU and a specialized com- ture has been developed. The architecture is based on an plex MAC unit. The instruction set can be divided into three application specific DSP processor and a number of flexi- classes of instructions: ble hardware accelerators, connected via a configurable net- work. A large degree of hardware reuse and careful selec- Ordinary RISC-style instructions operating on 16-bit • tion of accelerators together with low memory cost allows values, or on 16+16-bit complex values. a very area and power efficient implementation. A demon- Vector instructions, operating on vectors of complex strator chip for 802.11a/b/g physical layer baseband pro- • cessing was implemented in 0.18 µm CMOS on a 5.0 mm2 data. die with a core area of 2.9 mm2, including all memories. Instructions for network and accelerator configuration • and control.

All instructions are 16 bits long, providing very efficient 1 INTRODUCTION use of program memory. The main feature of the core is the complex multiply- accumulate unit (CMAC) and the associated vector instruc- The large number of emerging radio standards and the con- tions. A very significant part of the computations in a vergence of products leads to an increased interest baseband processor are operations on vectors of complex in Software Defined Radio (SDR) and increased flexibility numbers (I/Q-pairs) such as auto- and cross correlation, fir requirements for baseband processors [1]. At the same time, filtering, FFT, vector multiplication and complex absolute with new demanding applications, such as Wireless LAN, maximum search. The CMAC is optimized for these types 3G/4G mobile telephony and Digital Video Broadcasting, a of operations. It can execute e.g. two complex multiply- high degree of parallelism is needed. Power consumption accumulate operations, a radix-2 FFT butterfly, or two com- continues to be very important. plex absolute value calculations plus max search, each clock The proposed architecture, which is based on a special- cycle. ized DSP processor core connected to a number of flexible The vector instructions allows a complete vector opera- hardware accelerators via a configurable network, allows tion (e.g. a scalar product or one layer of butterflies in an a good tradeoff between flexibility and performance and a FFT) to be executed using only one instruction. The vector very area efficient implementation with low control over- size is given explicitly in the instruction word. A vector in- head. Figure 1 gives an overview of the architecture. struction takes several clock cycles to complete (depending The use of accelerators improves the efficiency of the architecture through increased degree of parallelism. Ef- Normal instruction (4 different formats): ficiency is further improved since the processor core can focus on tasks more suitable for DSP software implementa- 0 subtype (2−5) instr. (2−6) arguments (4−11) tion, e.g. multiply-accumulate based operations. Other programmable solutions, such as [2],[3] and [4], Vector instruction: are typically based on highly complex VLIW and/or multi- 10 instruction (4) ports (3) vector size (7) ple processor cores. The approach described here leads to significantly less control overhead and memory size, result- Accelerator instruction: ing in reduced area and power consumption. The described approach also leads to a much higher degree of hardware 11 accelerator(4) accelerator instruction (10) reuse and better utilization of hardware components than a corresponding fixed function hardware implementation Figure 2: Encoding of basic instruction types in the BBP such as [5]. (number of bits in parenthesis) 16 bit MAC

DM0 cADD16 cADD16 256x32 16 bit ALSU cMUL16x16 cMUL16x16

DM1 cADD40 cADD40 256x32 RF

AR0 (40+40 bits) AR1 16x16 bits DM2 AR2 AR3 256x32 Control Registers

DM3 Port IF CM 256x32 2048x32

IM walsh Network 256x16

ADC FIR demap interleave conv.enc./ MAC Application rotor scramble CRC IF packet.det. viterbi processor DAC

Figure 1: Overview of the baseband processor. DMx, CM and IM are data memories on vector size). However, other instructions e.g. ALU in- MAC.32 port1,port2 ; 32 element vector dot product structions or accelerator configuration, can execute in paral- ADD R0,R1 ; R0=R0+R1 lel with a vector instruction, as illustrated by figure 3. This NWC interl,viterbi ; setup network connection often allows the control code of an algorithm to be “hidden” ACL viterbi,0xF2 ; accelerator control instruction behind multi-cycle vector instructions. Table 1 shows some IDLE mac ; wait for MAC.32 to finish benchmarks for the core. Figure 3: Assembly code example. The four instructions following the vector instructions add no cycle cost.

Table 1: DSP core benchmark examples, cycle cost includes memory addressing setup and other overhead. No accelera- 3 THE ACCELERATOR NETWORK tion was used Function Clock cycles Memories, accelerators and external interfaces are con- 64-point FFT 205 nected to the core via the interconnect network. The net- 40 element vector add 24 work behaves like a crossbar and is configured by 40 el. scalar product 24 the core using dedicated assembly instructions. This elim- 40 el. vector elementwise inates the need for an arbiter and addressing logic, thus re- multiplication 24 ducing the complexity of the network and the accelerator 40 sample, 16 tap FIR filter 404 interfaces, still allowing many concurrent communications. 40 element complex absolute Each accelerator has one read port and one write port to maximum value search 22 the network. A connection is set up by connecting one read port to one write port. The reading unit requests one unit of data by asserting a ReadReaquest signal during one clock cycle and the transmitter uses a DataAvailable signal to in- The processor control path is similar to that of a rather dicate that new data is available. The requesting unit may simple micro-controller-like processor with some added have up to two outstanding read requests, but must then halt DSP- and other special features. It does not suffer from the if no data available signal is received. This protocol allows control and communications overhead found in VLIW, su- a new data item to be communicated every clock cycle but per scalar and other enhanced processor types. The vector still provides sufficient flow control. , which is responsible for execution of the par- A chain of accelerators connected to each other via the allel multi cycle vector instructions, adds only little extra network will automatically synchronize and communicate overhead. without any interaction by the processor. This allows truly concurrent operation of the core and any number of accel- Table 2: implementation results. erators, and with zero synchronization overhead in the core. This also minimizes the number of memory accesses since Task Req. freq. Prog. size Data mem. no intermediate storage is needed when sending data be- 11a Tx 155 MHz 1020 bytes 3456 bytes tween accelerators. 11a Rx 160 MHz 1658 bytes 2340 bytes Accelerators can be configured via special accelerator 11b Tx 120 MHz 476 bytes 484 bytes instructions or via a control register space. 11b Rx 110 MHz 1090 bytes 304 bytes

4 MEMORY ARCHITECTURE core is intended for FFT and filter coefficients, look-up ta- Using a number of small data memories gives enough mem- bles, and other data not processed by accelerators. Using ory bandwidth to keep the core/CMAC and accelerators dual memory banks instead of dual port memories saves fully occupied. The network always gives a unit (core or power. accelerator) exclusive memory access, thereby eliminating The ADC/DAC interface accelerator contains a config- stall cycles due to access conflicts. After finishing a task, urable decimation filter, a rotor for carrier frequency off- the entire memory containing the output can be ”handed set compensation and a configurable packet detector based over” to an accelerator or interface by reconfiguration of the on autocorrelation. The packet detector will wake the core network. This eliminates data moves between memories. from idle mode when an incoming frame preamble is de- Each memory has its own address generator, and ad- tected. These functions can be reused between many stan- dresses and addressing modes are configured using the same dards. They also have to run continously a large part of the interface as accelerator configuration. No addressing infor- time, and especially decimation is quite demanding. Other mation needs to be sent over the network. accelerators reused between 11a and 11b standards are the Reducing memory sizes and memory accesses was a scrambler and the mac-layer interface. major focus in the design, since a large part of the power consumption in a programmable architecture takes place in the memories. The small, and thereby fast, memories and the moderate frequency eliminates the need for caches. 7 RESULTS Thereby a lot of control overhead is avoided, and more im- portantly, execution time is completely predictable, which is major advantage in hard real time systems. Firmware was implemented for 802.11a and 11b transceivers; results in terms of memory usage and required frequency for different modules can be found in 5 ACCELERATORS table 2. The instruction set has proven to be very efficient. Only about half of the available program memory is re- A key issue is the choice of accelerators. This has previ- quired to store the entire 11a and 11b transceiver firmware ously been discussed in [6]. The main factors to consider on chip. Data memory requirements are also about half are: 1) The relation between the area of the accelerator and of the available data memory. (Parts of the firmware as the cycle cost for a pure software implementation of a func- well as most of the data memory is shared by Rx and Tx tion, and 2) to which extent the accelerator can be reused be- modules, so the actual requirements are less than the sum tween standards. The reuse factor can often be improved by of the numbers in the table.) adding configurability to the accelerator. The right choice The 11b receiver requires lower clock frequency (or has of accelerators allows us to run the processor at a relatively more idle cycles at a fixed frequency) than the 11b trans- low frequency, which saves power. Even more power can mitter, at the highest data rate, due to the acceleration of be saved since lower frequency may allow us to lower the the modified Walsh transform which is the most complex supply voltage. operation in the receiver at the highest rate. An 802.11a/b/g baseband processor demonstrator chip, 6 IMPLEMENTATION FOR WLAN with accelerators for ADC/DAC interface+frontend pro- cessing, demapping, interleaving, scrambling, CRC, Walsh Figure 1 shows an implementation of the architecture for a transform and MAC-layer interface was implemented and converged 802.11a/b/g baseband processor. manufactured, using a 0.18 µm CMOS standard cell library. The program memory size is 4096x16 bits. Four iden- The chip features and measured performance can be found tical 256x32 bit data memories for complex data are con- in table 3. Fig. 4 shows a die photo. nected to the network. Each of these memories consists The chip will function correctly at least up to 220 MHz, of two interleaved memory banks, allowing two consec- implying that significant power can be saved by reducing utive addresses (vector elements) to be accessed in paral- supply voltage in a converged 802.11a/b/g transceiver run- lel. These memories also have FFT addressing support. A ning at 160 MHz (the required frequency for 54 Mbit/s re- 2048x32 bit coefficient memory connected directly to the ception in 802.11a/g.) dards with lower data rate, such is GSM/GPRS and blue- Table 3: BBP chip feature summary tooth, and firmware for these standards are currently being Feature Value developed. Technology 0.18 µm CMOS Chip area 5 mm2 Core area 2.9 mm2 REFERENCES Memory area 1.0 mm2 Logic area 1.9 mm2 Max frequency 220 MHz [1] http://www.sdrforum.org Package 144 pin fpBGA [2] J. Kneip et.al. Single Chip Programmable Baseband Power @160 MHz: ASSP for 5 GHz Wireless LAN Applications, IECICE Idle 44 mW Trans. Electron., vol.E85-C, N0.2 February 2002. 11a Rx burst 126 mW [3] J. Glossner et al, A Software-Defined Communications Baseband Chip, IEEE Communications Magazine, Jan- uary 2003. [4] S. Rajagopal et al. A Programmable Baseband Proces- sor Design for Software Defined , Proc. Mid- west Symposium on Circuits and Systems (MWSCAS 2002), p. III-413 - III-416, [5] T. Fujitsawa et al., A Single-Chip 802.11a MAC/PHY with a 32b RISC Processor, ISSCC Dig. Tech. Papers, pp. 144-145, Feb. 2003. [6] Anders Nilsson, Eric Tell, Dake Liu, An Accelera- tor Structure for Multi-Standard Programmable Base- band Processors, proc. of IASTED Intl. Multi-Conf. on Wireless and Optical Com., pp 644-649, July 2004

Figure 4: Die photo

8 CONCLUSIONS

A programmable architecture for radio baseband processing has been presented. The architecture enables very area effi- cient implementations of baseband functions for multi stan- dard radio systems. The accelerator architecture together with an efficient instruction set including vector instruc- tions, minimizes program and data memory requirements. The accelerator chaining feature further reduces memory accesses and provides a high degree of parallelism and low control overhead. The programmable DSP core can support both OFDM systems, e.g. 802.11a, and spread spectrum systems such as 802.11b. A demonstrator chip for wireless LAN applications has been fabricated. A clock frequency of 160 MHz is required to support 802.11a reception at the highest data rate. The core area of 2.9 mm2 is smaller than existing programmable and non-programmable solutions. The power consumption is low considering that all logic was synthesized from VHDL and low power design tech- niques such as were not used. The processor is flexible enough to also support stan-