A Fully Programmable 40 GOPS SDR Single Chip Baseband for LTE/WiMAX Terminals

Torsten Limberg, Markus Winter, Marcel Bimberg, Reimund Klemm, Emil Matu´sˇ Marcos B.S. Tavares, Gerhard Fettweis, Hendrik Ahlendorf, Pablo Robelly Technische Universitat¨ Dresden Chair Mobile Communications Systems 01062 Dresden, Germany Email: [email protected]

Abstract— The increasing number of radio protocols along quirements with low energy consumption. Figure 1 shows a with the need for multimedia support in mobile communication schematic of the Tomahawk. Below, we briefly discuss the devices call for heterogeneous, programmable multi-core pro- components building this architecture. cessors. In this paper, we present a fully programmable, het- erogeneous single chip SDR platform with multimedia support, The platform for the operating system and control code ex- fabricated in a 0.13 m CMOS process. Running at 175 MHz, a ecution consists of two Tensilica DC212GP RISC processors. peak performance of 40 GOPS is delivered while dissipating 1.5 The signal processing block of the Tomahawk is composed W. The typical MPSoC programmability problem is solved with by six fixed-point vector DSPs (VDSP), two scalar floating- a dedicated hardware unit which performs dynamic spatial and point DSPs (SDSP), an LDPC decoder ASIP, a deblocking temporal mapping of tasks onto processing elements. filter ASIP and an entropy decoder ASIC. Additionally, the I. INTRODUCTION CoreManager performs the scheduling of data transfers and Emerging next generation cellular standards like 3GPP LTE signal processing tasks issued from the control code onto the and WiMAX require a vast amount of modem signal pro- VDSPs and SDSPs. cessing. Both standards represent high data rate, low latency, In the Tomahawk, local and global memories can be found. packet optimized technologies, incorporating OFDMA/MIMO, The local memories are part of the signal processing elements, adaptive modulation and coding techniques. In such systems, which do not have direct access to the global memories. On the dynamic variability of configurations due to user resource the other hand, the global memories are accessible from all NoC master components (see Fig. 1) and consist of external allocation in conjunction with high computational demand as 2 well as low latency requirements call for programmable dis- DDR-SDRAMs and I C as well as an internal 256 KByte tributed baseband architectures. On the other hand, broadband SRAM, which is used as scratchpad memory. Moreover, three media applications (e.g. H.264) will be running in handsets independently (and parallel) accessible DDR RAM controllers as well. Their data dependent control flow does not allow provide large memory bandwidth in order to supply the effective scheduling at compile time. Thus, a run-time solution processing elements (PE) and the control processors (CP) with of this problem is also required. Multi-core architectures, e.g. data. from , Coresonic, PicoChip, Infineon’s MuSIC [1] or The peripheral part of the Tomahawk consist of the fol- Sandbridge’s SB3011 platform [2], are acknowledged to be lowing components: an FPGA bridge that enables additional power efficient [3], [4] in such scenarios. However, the MPSoC functionalities by the mapping of off-chip components into programmability/scheduling problem is still one of the main the address space of the Tomahawk, a single lane PCI Express obstacles to be overcome. interface that realizes communication links of 2GBit/s to a host In this paper, we present the Tomahawk MPSoC which has computer, a VGA/Streaming interface that allows interfacing a dedicated run-time scheduler hardware unit for solving the AD or DA converters, a freely programmable DMA controller, programmability/scheduling problem. We call this dedicated a general purpose I/O and an UART interface. scheduling unit CoreManager and its purpose is to reduce All components in the chip are connected by two low the context switching overhead in the control code processor, latency, high bandwidth, crossbar-like networks-on-chip (NoC) which traditionally places major efficiency penalties on MP- [7]. The FPGA bridge supports the same protocol and band- SoCs [5]. The Tomahawk is a low-power, C-programmable width of the NoC, and it can be seen by the NoC as a master software defined radio platform with multimedia support, or a slave depending on the function of component attached based on embedded software written as modular tasks accord- to it. ing to the synchronous data flow model [6]. In order to achieve low power consumption, the data locality principle is used at multiple levels. For instance, within the II. MPSOC ARCHITECTURE STA (synchronous transfer architecture) processors [8] explicit The Tomahawk MPSoC exploits instruction, data and task register file bypassing is used. In contrast to the traditional level parallelism in order to meet stringent performance re- approach, the STA functional units hold and exchange data NoC Network-on-Chip DC212 DC212 FPGA Entropy VGA & DDR DDR DDR PCIe GP GP Bridge Decoder Stream Ctrl Ctrl Ctrl M, S NoC Master / Slave Port M M S M S M S S S S M DDR Ctrl DDR SDRAM Controller NoC PCIe PCI express Endpoint S S M M M S S S M M S MAC LDPC Deblocking Scratchpad I2C & CoreManager DMA DMA DMA DMA GPIO General Purpose I/O Decoder Filter Memory GPIO VDSP Vector DSP M M M M SDSP Scalar DSP NoC LDPC Low-Density Parity-Check S S S S S S S S Decoder VDSP VDSP VDSP VDSP VDSP VDSP SDSP SDSP DMA Direct Memory Access Controller

Fig. 1. Tomahawk Schematic

maximize the local reuse of program memories, thus reducing Global Memory Task level control code (DDR) the need for reloading (Fig. 2). This allows to reduce the re- quired local and global memories and also the NoC bandwidth. Program and data Fetch Besides linear memory regions, the CoreManager and DMA controllers support two-dimensional memories. This allows

e k Task request u for more effective implementation of multimedia and MIMO s Core- e

DC212GP a u

T Manager Q algorithms. Figure 3 shows how 2-D dependency checking on the memories is performed. DataMem Data write back DDaatataMMeemm DataMem two-dimensional InInsstMtMeemm Task load with maximum 0 InInsstMtMeemm reuse of local memory memory sub-block 8 Task firing VVDDSSPP11 VDSP1 Tomahawk 16 sub-block overlap 24 Fig. 2. Task level data locality principle Fig. 3. Dependency checking between two 2-D sub-blocks of a 2-D memory which is stored line after line in memory (numbers are line start addresses) directly with each other. This significantly reduces the I/O bandwidth, size and power consumption of the register file. The VDSPs, SDSPs, LDPC decoder and the deblocking filter IV. IMPLEMENTATION AND RESULTS are all based on the STA principle. The Tomahawk chip was designed using a UMC 130 nm, 8 metal layer CMOS standard cell design flow. The 57M III. PROGRAMMING MODEL transistor chip occupies 10x10 mm2 (including all 480 I/O The C-based programming model of the Tomahawk, which cells) and runs at 175 MHz. The typical case core supply is similar to the CellSS [9] programming model for the Cell voltage is 1.2 V, the I/O voltage is 3.3 V and 2.5 V for the processor [10], hides scheduling details from the programmer high speed SSTL2 I/Os. completely. However, in contrast to the CellSS software based Figure 4 shows the setup of the measurement station. All scheduling, the CoreManager [11] computes the schedule of presented results have been measured at this place. For the core tasks issued from the control code with a dedicated hardware, power measurements, the PCB provides an independent power and thus, it achieves a significantly better performance and en- supply for the Tomahawk core. The core supply voltage can ergy efficiency. The programmer is merely required to identify be adjusted from 0.9 to 1.35 V and has been set to 1.3 V for all C-functions which shall be executed as tasks on one of the all measurements. Since the Tomahawk has only one single processing elements controlled by the CoreManager. The calls power domain for all components, obtaining exact power to these tasks are converted to so-called task descriptions [11] numbers for single components is impossible. Therefore, we at compile time. At run-time, these task descriptions are sent approximately determined power numbers by ensuring that to the CoreManager instead of calling the tasks explicitly. The only the component under observation is running during spatial and temporal mapping of the tasks onto the PEs is then the measurement. Static power for single components was done automatically under consideration of data dependencies. neglected, what is acceptable for 0.13 m. We could observe Simultaneously, the control processor can continue execution that all measurements results have been in the same range as and send further task descriptions to the CoreManager as long power simulations on back annotated place and route netlists. as a queue length of 16 tasks is not exceeded. Table I summarizes the power and area results of the core Concerning the data transfers, the CoreManager tries to components. of 175 MHz, this results in about 100 nJ energy dissipation per scheduled task. Compared to about 500 nJ/task that would be required if the scheduling algorithm would run 3000 cycles on a standard RISC processor like the Tensilica DC212GP core, this is an significant improvement. In order to save power, the CoreManager explicitly switches off the clock for PEs which are not in use. This is done in addition to the clock gates which are available for all registers in the PEs. The CoreManager itself is not clock gated. This leaves room for significant power reduction in future designs. The presented CoreManager power numbers are simulated on back annotated place and route netlist. Measuring the real power consumption is practically impossible, because the CoreManager is not able to run without running at least one Fig. 4. Tomahawk Reference Board at Measuring Station DC212GP and the DSPs. Furthermore, both NoCs are under load. TABLE I PROGRAMMABLE FUNCTIONAL UNITS AND COREMANAGER OVERVIEW C. LDPC Decoder The LDPC decoder ASIP [12] is dedicated to the decoding 2 Unit Power/mW Area/mm Memory portion ( area ) of low-density parity-check convolutional codes (LDPCCC), SDSP 27 3.33 91.1 % achieving 12 GOPS at 175 MHz. The decoder is able to VDSP 85 3.80 79.5 % decode variable block lengths at throughputs of several hun- CoreManager 282 5.95 24.3 % dred MBit/s. It is a 64 way parallel, fixed-point SIMD-VLIW 2 DC212GP 30 2.50 15.8 % architecture with an area of 7.89 mm and an average power LDPC 354 7.89 64.0 % consumption of 354 mW. 611 pJ are consumed per decoded bit Deblocker 86 4.54 86.0 % when running 10 decoding iterations of a (128,5,13)-LDPCCC [12].

A. Scalar and Vector Processors D. Deblocking Filter 2 The VDSPs are 16 bit fixed-point, 4 way SIMD VLIW One ASIP with 4.54 mm has been added for acceleration processors, issuing up to 5 instructions per cycle. To support of deblocking filtering. It comprises a 44 KByte instruction H.264 video decoding, the VDSPs can be switched from memory and 16 KByte dual-port data memory. The average fractional to integer arithmetic. The achieved performance of power consumption of the deblocking filter for decoding a all VDSPs is 120 MOPS/MHz resulting in 21 GOPS at 175 1080i H.264 encoded baseline video is 86 mW. MHz clock. The average power consumption is 85 mW for 2 E. Network on Chip an FFT computation. Each VDSP occupies 3.8 mm and has 32 KByte instruction and 16 KByte data memory. The use of Both Networks-on-Chip are master-slave point-to-point net- dual port memories enables concurrent computation and data works with 32 bit bus-width. Burst transfers of up to 63 data pre-fetching for pending tasks. words and static priority arbitration per slave are supported. The SDSP has 3.33 mm2 including 32 KByte instruction For latency improvement, the NoCs operate on negative clock and 32 KByte data memory. Each SDSP has a dual cycle single transition while all other modules operate on positive clock precision floating point unit required for algorithms with large edge. Nevertheless, the NoCs work at the full system clock dynamic range like matrix inversion for MIMO processing. frequency. A sustained throughput of 5.47 GBit/s is achieved Additionally, all instructions can be executed conditionally, for each master-slave connection. The crossbar-like architec- thus, enabling low overhead control structures for scalar opera- ture allows each master to communicate with one slave in 2 tions like bit-stream processing. Each SDSP allows to issue up parallel. The chip area are consumed by the NoCs is 0.4 mm . to 3 instructions per cycle, thus, a compute power of 0.7 GOPS F. Overall MPSoC is achieved for both SDSPs at 175 MHz clock frequency. For floating point FIR filter computations,each SDSP consumes 27 Considering that both DC212GP, all vector and scalar DSPs, mW. the CoreManager and LDPC decoder are fully loaded and si- multaneously running, the overall dynamic power consumption B. CoreManager of these components is 1260 mW. However, full utilization The CoreManager occupies 5.95 mm2 (including 1.7 mm2 is not very likely to appear. If we consider a more realistic for the 3 DMA controllers and 1 mm2 for a debugging unit) utilization of 80% for each component, a dynamic power of and consumes 282 mW when fully loaded. The average time about 950 mW would be observed. If we now add the static to schedule one task is about 60 clock cycles. At a clock rate power consumption of 130 mW and the clock tree power

PLL Scratchpad Memory 6

(256 KB) x

V

Core D S

Manager P DC212GP

DC212GP DDR Ctrl LDPC-CC Decoder SDSP 0 Deblocking Filter SDSP 1 Fig. 5. Tomahawk added to graph from [13]: Area efficiency versus energy efficiency normalized to a 1V 90nm process and 12 bit adder equivalents PLL PLL consumption of 445 mW (for all inactive components) we Fig. 6. Chip Micrograph end up with 1525 mW core power consumption for a realistic application scenario. However, this has to be proved by a real and the Institute of Semiconductors and Microsystems of our application which was not yet tested on the Tomahawk. The University. Finally we would like to thank Synopsys, Tensilica huge clock tree power of 445 mW is due to missing clock and Altera for sponsoring Software, IP and Hardware. The gating at all peripheral components. major part of this work has been done within the scope of the In order to compare the performance values of the complete WIGWAM project, funded by the German Federal Ministry Tomahawk MPSoC with [13], we scaled the dynamic power of Education and Research. A minor part was funded by the 0.69 0.5 consumption of the fully loaded chip by , where the European Union within the scope of the E2R project. first scaling factor comes from the voltage difference of 1.3 2 2 V and 1.0 V (i.e. (1.0V ) /(1.3V ) = 0.69) and the second REFERENCES factor is due to the process gain when going from 130 nm to [1] U. Ramacher, “Software-defined radio prospects for multistandard mobile phones,” 90 nm (frequency remains the same) [14]. From Fig. 5 it can Computer, vol. 40, no. 10, pp. 62–69, 2007. [2] J. Glossner, D. Iancu, M. Moudgill, G. Nacer, S. Jinturkar, S. Stanley, and be observed, that the Tomahawk outperforms existing designs M. Schulte, “The sandbridge sb3011 platform,” EURASIP J. Embedded Syst., by an order of magnitude, nearly achieving the ASIC results vol. 2007, no. 1, pp. 16–16, 2007. [3] M. Horowitz and W. Dally, “How scaling will change processor architecture,” in of [13]. Proceedings of the IEEE Solid-State Circuits Conference, 2004, Digest of Technical Papers ISSCC, February 2004, pp. 132 – 133. V. CONCLUSION [4] K. Asanovic, R. Bodik, B.C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, “The We presented Tomahawk, a low power, heterogeneous MP- landscape of parallel computing research: A view from berkeley,” Tech. Rep. UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec SoC for signal processing applications. A major component 2006. of the Tomahawk is a dedicated scheduling unit called Core- [5] Olli Silven and Kari Jyrkka,¨ “Observations on power-efficiency trends in mobile communication devices,” EURASIP J. Embedded Syst., vol. 2007, no. 1, pp. 17–17, Manager. This unit is able to automatically schedule signal 2007. processing tasks onto the cores, taking data and control depen- [6] E. A. Lee and D. G. Messerschmitt, “Synchronous data flow,” Proceedings of the IEEE, vol. 75, no. 9, pp. 1235–1245, 1987. dencies into account. Therefore, the Tomahawk represents a [7] M. Winter and G. Fettweis, “Interconnection generation for system-on-chip new architectural approach for MPSoCs in which the software design,” in Proceedings of International Symposium on System-on-Chip 2006, Tampere, Finland, November 2006, pp. 91–94. developer is released from the complicated task scheduling and [8] G. Cichon, P. Robelly, H. Seidel, E. Matus, M. Bronzel, and G. Fettweis, synchronization burdens. Furthermore, the Tomahawk chip “Synchronous transfer architecture (sta),” in Proceedings of the 4th International Workshop on Systems, Architectures, Modeling, and Simulation (SAMOS’04), offers compute power for LTE/WiMAX baseband processing Samos, Greece, July 2004, pp. 126–130. while approaching the power efficiency of ASIC solutions. [9] P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta, “CellSS: a programming model for the cell be architecture,” in Proceedings of the ACM/IEEE Supercomputing Our further research efforts comprehend the enhancement 2006 Conference, November 2006. of the CoreManager towards real-time task scheduling as well [10] D. Pham, S. Asano, M. Bolliger, M.N. Day, H.P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, as dynamic power scaling of the computing cores. M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa, “The design and implementation of a first-generation cell processor,” ACKNOWLEDGMENTS in Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE International, February 2005, vol. 1, pp. 184–592. We would like to acknowledge Prof. Rene´ Schuf¨ fny and his [11] H. Seidel, A Task-level Programmable Processor, WiKu, Duisburg, October 2006. [12] M. Bimberg, M.B.S. Tavares, E. Matus, and G. Fettweis, “A high-throughput team consisting of Holger Eisenreich, Georg Ellguth and Jens- programmable decoder for ldpc convolutional codes,” in Proceedings of the 18th Uwe Schlussler¨ from the Parallel VLSI-Systems and Neural IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP’07), Montreal, Canada, July 2007. Circuits Chair at our University for doing a great job at the [13] D. Markovic, B. Nikolic, and R.W. Brodersen, “Power and area minimization for backend. Furthermore, we would like to thank Frank Siebler, multidimensional signal processing,” IEEE J. Solid-State Circuits, vol. 42, no. 4, pp. 922–934, April 2007. Markus Ullmann, Johannes Lange, Arne Lehmann, Boris [14] K. Flautner, “The wall ahead is made of rubber,” in 4th HiPEAC Industrial Boesler and Patrick Herhold as well as the ZMD AG Dresden Workshop on Compilers and Architectures, Cambridge, UK, November 2007.