A Fully Programmable 40 GOPS SDR Single Chip Baseband for LTE/Wimax Terminals
Total Page:16
File Type:pdf, Size:1020Kb
A Fully Programmable 40 GOPS SDR Single Chip Baseband for LTE/WiMAX Terminals Torsten Limberg, Markus Winter, Marcel Bimberg, Reimund Klemm, Emil Matu´sˇ Marcos B.S. Tavares, Gerhard Fettweis, Hendrik Ahlendorf, Pablo Robelly Technische Universitat¨ Dresden Vodafone Chair Mobile Communications Systems 01062 Dresden, Germany Email: [email protected] Abstract— The increasing number of radio protocols along quirements with low energy consumption. Figure 1 shows a with the need for multimedia support in mobile communication schematic of the Tomahawk. Below, we briefly discuss the devices call for heterogeneous, programmable multi-core pro- components building this architecture. cessors. In this paper, we present a fully programmable, het- erogeneous single chip SDR platform with multimedia support, The platform for the operating system and control code ex- fabricated in a 0.13 ¹m CMOS process. Running at 175 MHz, a ecution consists of two Tensilica DC212GP RISC processors. peak performance of 40 GOPS is delivered while dissipating 1.5 The signal processing block of the Tomahawk is composed W. The typical MPSoC programmability problem is solved with by six fixed-point vector DSPs (VDSP), two scalar floating- a dedicated hardware unit which performs dynamic spatial and point DSPs (SDSP), an LDPC decoder ASIP, a deblocking temporal mapping of tasks onto processing elements. filter ASIP and an entropy decoder ASIC. Additionally, the I. INTRODUCTION CoreManager performs the scheduling of data transfers and Emerging next generation cellular standards like 3GPP LTE signal processing tasks issued from the control code onto the and WiMAX require a vast amount of modem signal pro- VDSPs and SDSPs. cessing. Both standards represent high data rate, low latency, In the Tomahawk, local and global memories can be found. packet optimized technologies, incorporating OFDMA/MIMO, The local memories are part of the signal processing elements, adaptive modulation and coding techniques. In such systems, which do not have direct access to the global memories. On the dynamic variability of configurations due to user resource the other hand, the global memories are accessible from all NoC master components (see Fig. 1) and consist of external allocation in conjunction with high computational demand as 2 well as low latency requirements call for programmable dis- DDR-SDRAMs and I C as well as an internal 256 KByte tributed baseband architectures. On the other hand, broadband SRAM, which is used as scratchpad memory. Moreover, three media applications (e.g. H.264) will be running in handsets independently (and parallel) accessible DDR RAM controllers as well. Their data dependent control flow does not allow provide large memory bandwidth in order to supply the effective scheduling at compile time. Thus, a run-time solution processing elements (PE) and the control processors (CP) with of this problem is also required. Multi-core architectures, e.g. data. from Icera, Coresonic, PicoChip, Infineon’s MuSIC [1] or The peripheral part of the Tomahawk consist of the fol- Sandbridge's SB3011 platform [2], are acknowledged to be lowing components: an FPGA bridge that enables additional power efficient [3], [4] in such scenarios. However, the MPSoC functionalities by the mapping of off-chip components into programmability/scheduling problem is still one of the main the address space of the Tomahawk, a single lane PCI Express obstacles to be overcome. interface that realizes communication links of 2GBit/s to a host In this paper, we present the Tomahawk MPSoC which has computer, a VGA/Streaming interface that allows interfacing a dedicated run-time scheduler hardware unit for solving the AD or DA converters, a freely programmable DMA controller, programmability/scheduling problem. We call this dedicated a general purpose I/O and an UART interface. scheduling unit CoreManager and its purpose is to reduce All components in the chip are connected by two low the context switching overhead in the control code processor, latency, high bandwidth, crossbar-like networks-on-chip (NoC) which traditionally places major efficiency penalties on MP- [7]. The FPGA bridge supports the same protocol and band- SoCs [5]. The Tomahawk is a low-power, C-programmable width of the NoC, and it can be seen by the NoC as a master software defined radio platform with multimedia support, or a slave depending on the function of component attached based on embedded software written as modular tasks accord- to it. ing to the synchronous data flow model [6]. In order to achieve low power consumption, the data locality principle is used at multiple levels. For instance, within the II. MPSOC ARCHITECTURE STA (synchronous transfer architecture) processors [8] explicit The Tomahawk MPSoC exploits instruction, data and task register file bypassing is used. In contrast to the traditional level parallelism in order to meet stringent performance re- approach, the STA functional units hold and exchange data NoC Network-on-Chip DC212 DC212 FPGA Entropy VGA & DDR DDR DDR PCIe GP GP Bridge Decoder Stream Ctrl Ctrl Ctrl M, S NoC Master / Slave Port M M S M S M S S S S M DDR Ctrl DDR SDRAM Controller NoC PCIe PCI express Endpoint S S M M M S S S M M S MAC LDPC Deblocking Scratchpad I2C & CoreManager DMA DMA DMA DMA GPIO General Purpose I/O Decoder Filter Memory GPIO VDSP Vector DSP M M M M SDSP Scalar DSP NoC LDPC Low-Density Parity-Check S S S S S S S S Decoder VDSP VDSP VDSP VDSP VDSP VDSP SDSP SDSP DMA Direct Memory Access Controller Fig. 1. Tomahawk Schematic maximize the local reuse of program memories, thus reducing Global Memory Task level control code (DDR) the need for reloading (Fig. 2). This allows to reduce the re- quired local and global memories and also the NoC bandwidth. Program and data Fetch Besides linear memory regions, the CoreManager and DMA controllers support two-dimensional memories. This allows e k Task request u for more effective implementation of multimedia and MIMO s Core- e DC212GP a u T Manager Q algorithms. Figure 3 shows how 2-D dependency checking on the memories is performed. DataMem Data write back DDaatataMMeemm DataMem two-dimensional InInsstMtMeemm Task load with maximum 0 InInsstMtMeemm reuse of local memory memory sub-block 8 Task firing VVDDSSPP11 VDSP1 Tomahawk 16 sub-block overlap 24 Fig. 2. Task level data locality principle Fig. 3. Dependency checking between two 2-D sub-blocks of a 2-D memory which is stored line after line in memory (numbers are line start addresses) directly with each other. This significantly reduces the I/O bandwidth, size and power consumption of the register file. The VDSPs, SDSPs, LDPC decoder and the deblocking filter IV. IMPLEMENTATION AND RESULTS are all based on the STA principle. The Tomahawk chip was designed using a UMC 130 nm, 8 metal layer CMOS standard cell design flow. The 57M III. PROGRAMMING MODEL transistor chip occupies 10x10 mm2 (including all 480 I/O The C-based programming model of the Tomahawk, which cells) and runs at 175 MHz. The typical case core supply is similar to the CellSS [9] programming model for the Cell voltage is 1.2 V, the I/O voltage is 3.3 V and 2.5 V for the processor [10], hides scheduling details from the programmer high speed SSTL2 I/Os. completely. However, in contrast to the CellSS software based Figure 4 shows the setup of the measurement station. All scheduling, the CoreManager [11] computes the schedule of presented results have been measured at this place. For the core tasks issued from the control code with a dedicated hardware, power measurements, the PCB provides an independent power and thus, it achieves a significantly better performance and en- supply for the Tomahawk core. The core supply voltage can ergy efficiency. The programmer is merely required to identify be adjusted from 0.9 to 1.35 V and has been set to 1.3 V for all C-functions which shall be executed as tasks on one of the all measurements. Since the Tomahawk has only one single processing elements controlled by the CoreManager. The calls power domain for all components, obtaining exact power to these tasks are converted to so-called task descriptions [11] numbers for single components is impossible. Therefore, we at compile time. At run-time, these task descriptions are sent approximately determined power numbers by ensuring that to the CoreManager instead of calling the tasks explicitly. The only the component under observation is running during spatial and temporal mapping of the tasks onto the PEs is then the measurement. Static power for single components was done automatically under consideration of data dependencies. neglected, what is acceptable for 0.13 ¹m. We could observe Simultaneously, the control processor can continue execution that all measurements results have been in the same range as and send further task descriptions to the CoreManager as long power simulations on back annotated place and route netlists. as a queue length of 16 tasks is not exceeded. Table I summarizes the power and area results of the core Concerning the data transfers, the CoreManager tries to components. of 175 MHz, this results in about 100 nJ energy dissipation per scheduled task. Compared to about 500 nJ/task that would be required if the scheduling algorithm would run 3000 cycles on a standard RISC processor like the Tensilica DC212GP core, this is an significant improvement. In order to save power, the CoreManager explicitly switches off the clock for PEs which are not in use. This is done in addition to the clock gates which are available for all registers in the PEs. The CoreManager itself is not clock gated. This leaves room for significant power reduction in future designs.