The What, Why, and How of Customizable Processors Meeting Performance, Cost, and Power Objectives While Reducing ASIC Design Risk and Increasing Design Flexibility
Total Page:16
File Type:pdf, Size:1020Kb
The What, Why, and How of Customizable Processors Meeting performance, cost, and power objectives while reducing ASIC design risk and increasing design flexibility Customizable processors that perform intensive data processing are designed to provide programmability in the performance-intensive dataplane of the system-on-chip (SoC) design. Not only do they combine the capabilities of a DSP and a CPU, but they can be customized to maximize efficiency for your target application. Introduction Contents While processors are often used for the control functions in system-on-chip Introduction ......................................1 (SoC) designs, designers turn to RTL blocks for many data-intensive functions Getting More Performance the Old that control processors can’t handle. However, RTL blocks take a long time to Way #1—Higher Clock Speed ...........2 design and even longer to verify, and they are not programmable to handle Getting More Performance the Old multiple standards or designs. Way #2—RTL Acceleration ................2 What is a Customizable Processor? ....3 The most common embedded microprocessor architectures—such as the ARM®, MIPS, and PowerPC processors—were developed in the 1980s for Processor Design Cycle ......................4 stand-alone microprocessor chips. These general-purpose processor architec- Achieving Lower Energy Consumption tures, or CPUs, are good at executing a wide range of algorithms with a focus with the Xtensa Processor .................5 on control code, but SoC designers often need more performance in critical Benefits of Using a Custom datapath portions of their designs than these microprocessor architectures can Processor Design ...............................5 deliver. Customizable Processors for Digital Signal Processing ...............................5 To bridge this performance gap, the two most-used approaches are to run the Customizable Processors as RTL general-purpose processor at a higher clock rate (thus extracting more perfor- Alternatives .......................................6 mance from the same processor architecture), or to hand-design acceleration Advantages of Using Processors hardware that offloads some of the processing burden from the processor. Instead of RTL ....................................6 Running a general-purpose processor core at a high clock rate incurs a power Fit the Processor to the Algorithm .....7 and area penalty, and designing acceleration hardware takes additional devel- Conclusion ......................................12 opment time, not just for the design but for verification of the new acceler- Additional Information ....................12 ation hardware. In fact, verification can consume as much as 70% to 80% of the total design time. This white paper discusses how Cadence® Tensilica® Xtensa® customizable processors can achieve high performance and lower energy consumption, save time, and provide design flexibility versus hand-coded RTL hardware or a general-purpose processor. The What, Why, and How of Customizable Processors Getting More Performance the Old Way #1—Higher Clock Speed When designing a new chip, some ASIC design teams wait until they know exactly how much processing “horse- power” they need, then they select the processor. This design approach locks up the chip’s system architecture by basing it around the relatively narrow performance range occupied by all of the general-purpose processor architectures. For example, all 32-bit RISC architectures provide essentially the same performance at the same clock speed. If the ASIC design team is wrong about the system’s performance needs and additional processor performance is required, the only alternative available when using general-purpose processors that have fixed instruction set archi- tectures (ISAs) is to use a processor from the same family that runs at a higher clock rate. This is a tried-and-true approach to system engineering dating back to the earliest days of microprocessor usage. However, this decades-old approach has clearly reached the end of its useful life. High clock rates, now measured in GHz, result in excessive power and energy consumption. Chips overheat, fans are needed, and power supply costs rise as a result of this additional energy consumption. All of these consequences increase system costs. In addition, a general-purpose processor can just as easily be employed by one of your competitors in a similar design, making your software and design ideas that much easier to copy. What is the alternative? Custom processors, such as the Xtensa processor, can be optimized for a specific appli- cation. Optimization adds a variety of useful capabilities that keep clock rates and energy consumption low: • Special registers sized to the natural data types of the tasks to be performed • Specialized execution units that efficiently perform task-specific algorithms, often in one or two clock cycles • Customized, system specific I/Os that can connect directly to neighboring blocks of dedicated hardware Further, no other company can replicate your application-specific processor. You can hide portions of your design or give others full access to it using the development tools. These advantages help keep the design pirates away. Getting More Performance the Old Way #2—RTL Acceleration Because many applications simply cannot run fast enough on standard embedded microprocessors, even with an auxiliary DSP, engineering teams often use a hardware description language (HDL) such as Verilog or VHDL to create hand-coded acceleration hardware for the parts of their design with aggressive performance goals that are clearly out of any microprocessor’s reach. However, custom RTL hardware takes a long time to design and even longer to verify. In addition, RTL hardware blocks can’t be easily changed once they’re designed because of the huge verification time. Yet changes are often needed to fix bugs or to accommodate new standards or product features. Figure 1 shows the internal structure of a generic RTL accelerator block. The block’s datapath appears on the left and its state machine appears on the right. www.cadence.com 2 The What, Why, and How of Customizable Processors Data RAM x Finite State + – + – Machine + Figure 1: Hardwired RTL = Datapath Plus State Machine In most RTL accelerators, the datapath consumes the vast majority of the gates. A typical datapath may be as narrow as 16 or 20 bits or range up to hundreds of bits wide, depending on the target task or application. RTL datapaths typically contain many data registers and often include significant blocks of RAM or interfaces to blocks of memory that are shared with other RTL blocks. By contrast, the RTL logic block’s finite state machine (FSM) contains nothing but the control logic. All the nuances of sequencing data through the datapath, all the exception and error conditions, and all the handshakes with other blocks are captured in the FSM. While the FSM is often small in area compared to the datapath, it most often embodies most or all of the design and verification risk due to its logical complexity. A late design change made to an RTL acceleration block is more likely to affect the FSM than the datapath because the FSM contains most of the design complexity. An alternative to dedicated RTL blocks is using a custom processor for embedding the datapath logic and gates, and embedding the FSM control logic into the processor’s firmware to decouple the datapath and the FSM control logic. This separation minimizes hardware design changes and gives flexibility to keep pace with updates that can easily be made in firmware. Custom processors support the RTL block’s ability to have wide, non-integer-sized datapaths—which can be guaranteed correct-by-construction by the processor vendor—while reducing the risks associated with FSM design because processor-based FSMs are firmware programmable. What is a Customizable Processor? Xtensa customizable processors, based on a single processor architecture, use a common development flow with tools that scale from tiny microcontrollers and digital signal controllers to high-performance, real-time controllers and DSPs. Xtensa processors allow you to make the processor run your application more efficiently. You can tailor the Xtensa processor in various ways: • Select from standard configuration options, such as bus widths, interfaces, memories, and preconfigured execution units (floating-point units, DSPs, etc.) www.cadence.com 3 The What, Why, and How of Customizable Processors • Add new registers, register files, and custom task-specific instructions that support specialized data types and operations such as 48-bit data types (stereo audio sample pairs), 56-bit data and operations for security processing (encryption and decryption), or 256-bit data types and operations for packet processing • Add interfaces for more input/output (I/O) bandwidth to move data into and out of the processor such as: Ports (general-purpose I/O [GPIO]), Queues (FIFO), and memory Lookup interfaces • Add instructions on top of the base ISA from pre-defined options or create your own using the Tensilica Instruction Extension (TIE) language Xtensa processors can implement wide, parallel, and complex datapath operations that closely match those used in custom RTL hardware. The equivalent datapaths are implemented by augmenting the base processor’s integer pipeline with additional execution units, registers, and other functions developed by the chip architect for a target application You get this customization with