Evaluating ASIC, DSP, and RISC Architectures for Embedded Applications

Evaluating ASIC, DSP, and RISC Architectures for Embedded Applications Marc E. Campbell Northrop Grumman Corporation Abstract = = ⋅ 2 ⋅ ⋅ ⋅ Mathematical analysis and empirical evaluation of the solid PowerCPU P C V f N %N = = ⋅ 2 ⋅ ⋅ ⋅ (1-1) state equation PowerCMOS P C V f N %N is presented in this paper which identifies a measurable metric for evaluating where, relative advantages of ASIC, DSP, and RISC architectures for C = Capacitance embedded applications. Relationships are examined which can V = CPU Core Voltage help predict relative future architecture performance as new f = CPU Core Clock Frequency generations of CMOS solid state technology become N = Number of Gates available. In particular, Performance/Watt is shown to be an %N = Percentage of Gates which change state on a given Architecture-Technology Metric which can be used to • calibrate ASIC, DSP, & RISC performance density potential clock cycle. relative to a solid state technology generations, Some primary architectural and technoloyg components in • measure & evaluate architectural changes, and the CMOS power consumption equation 1-1 can be noted. • project a architecture performance density roadmap. 1.2. Architecture 1. Architecture Metric Basis First, the architecture is a function of the number of gates Intuitively, more efficient architectures will use less energy N and how gates toggle to perform a task, %N . Physical to complete the same task on the same generation CMOS CMOS gate real estate is quantified by N . How the software solid state technology. Beyond this intuition, interacts with the CMOS gates is quantified by how the gates (1) How can the relative advantages between ASIC, DSP, toggle, %N . Thus software architecture is also reflected in and RISC architectures be quantified? %N for software programmable devices. The system (2) How can the relative advantages between different architecture of combined hardware and software is largely ASIC, DSP, and RISC options be projected into future represented by N%N . embedded system design plans? The key concept is the power consumption behavior of a Architecture = N ⋅%N (1-2) CMOS gate. More complex architectures require more gates. More 1.1. CMOS device power consumption specialized architectures have fewer gate toggles per unit task. Thus, an Architecture Specialization metric can be defined Power consumption is an externally measurable physical based on how many clock cycles and gates are required to property of CMOS ASIC, DSP, and RISC devices which are complete a given functional task. The fewer gates and cycles used to build larger computational systems. The power used to get the task done the more specialized, efficient, or consumption for a CMOS device is shown in equation 1-1. “tuned” the processor is for the required task. 1.3. CMOS technology Digital Equipment Corporation (DEC) [2]. Figure 2-1 does not include the power consumption required by support logic. Second, the advances in solid state process technologies can 1 be observed as a function of gate geometry, voltage, and 0.8 frequency. Newer technologies systems are design with smaller gate geometry, lower voltages, and higher frequencies. 0.6 Changes in gate geometry affect changes in capacitance. 0.4 0.2 1 1 SPECfp95 / Watt SolidStateTechnology(), 2 , f (1-3) C V 0 350 400 450 500 2. Architecture Metric Observations Alpha 21164 Clock (MHz) Figure 2-1. Alpha Performance/Watt vs. Clock At this point, several observations based on sustained performance per watt can be made: 2.2. Voltage 2.1. Clock frequency Observation 2. A voltage (V 2 ) decrease significantly improves performance per watt for the same architecture. Observation 1. Higher performance by only increasing the For the same architecture with corresponding clock rate f has no net performance per watt benefit. N%N Oc Let, task definition, the relative performance per watt improvement task operations can be shown by equation 2-3, Performance = = ()Performance unit_ time second Power VoltageChangeBenefit = CPU new operations cycles ()Performance = PowerCPU old cycle second Oc 2 = (2-1) C ⋅V ⋅ N ⋅%N Oc f = new Oc where, ⋅ 2 ⋅ ⋅ C Vold N %N Oc = sustained operation per cycle, and f = clock frequency. CV 2 = old (2-3) CV 2 Then it follows that, new For the cases where C is near constant or the change in C ()operations is small with respect to the change in V 2 the performance per Performance = second watt benefit is approximated as, PowerCPU watt O V 2 = c VoltageChangeBenefit ≈ old (2-4) 2 (2-2) 2 C ⋅V ⋅ N ⋅%N Vnew The observation that clock frequency f does not affect The approximate performance per watt benefit as an performance per watt for the same CPU architecture N is also architecture voltage changes from 3.5 volts toward 2.0 volts is illustrated in Figure 1-1 based on information available from calculated in equation 2-5. V 2 3.32 3.3 = = 2.7 (2-5) 40 2 2 Alpha V2.0 2.0 MIPS 30 PA-RISC Note also that, ∂ PowerPC Pnew ∂ 2 P 2 ∂f V 20 ∝ V ⇒ = new (2-6) ∂ ∂P 2 f old Vold ∂f 10 SPECfp92 / Peak Watt Thus, the 2x improvement in ∂P ∂f slope improvement 0 shown in figure 2-2, and the associated 2x improvement in 91 92 93 94 95 96 performance / watt, for the DEC Alpha 21164 [2] can be 2 mostly attributed to the change in V rather than by a change Figure 2-3. SPECfp92 Performance/Watt vs. in architecture. The data in figure 2-2 does not include power Release Year consumed by support logic. Alpha 21164, Using a 64-bit processor, where only 32-bits is required, 60 Vdd= 3.3V will potentially consume on the order of 2x more energy on 55 Alpha 21164, large scale systems. Likewise, Field Programmable Gate 50 Vdd=2.5V, Array (FPGA) processors would be an energy efficient choice 45 core=2.0V 5 watts / 34 MHz 40 for bit level template matching algorithms. 35 Watts / CPU 30 5 watts / 66 MHz 2.4. Architecture specialization 25 20 Observation 4. For given operational task O and a given 200 300 400 500 c technology generation, Performance / Watt provides a metric Alpha 21164 Clock (MHz) of Architecture Specialization. Figure 2-2. DEC Alpha Watt vs. Clock The Architecture Specialization metric is concisely stated in equation 2-7. Performance 2.3. Architecture width ()(2-7) ArchitectureSpecializationDSP = Watt DSP RISC ()Performance Watt RISC Observation 3. Smaller architecture widths will have lower The Architecture Specialization metric has the nice feature power consumption per completed task. of being directly measurable. In contrast, an N%N count Since 64-bit floating point register uses 2x more gates N determination would require a gate level simulation which is than a 32-bit floating point register, the 64-bit architecture can not as widely available an actual production device. be expected to consume 2x more power for the same solid The Architecture Specialization metric of a 32-bit SHARC state CMOS generation. DSP with respect to a 64-bit 21064A and 21164 Alpha RISC The difference between 32-bit and 64-bit performance / watt for sustained FFT throughput / watt is shown in figure 2-4. can be observed in figure 2-3. The 32-bit MIPS and 32-bit The 18x Architecture Specialization measurement is based PowerPC show a 2x SPECfp92 performance per watt on measured sustained throughput / watt for 1D complex compared with the 64-bit DEC Alpha and 64-bit Hewlett FFTs averaged over vector lengths ranging from 64 to 32K for Packard PA-RISC processors in the 1995 and 1996 release the same voltage system. The SHARC 21060 is designed for years as the respective architectures matured in the market. efficient FFT performance. The Architecture Specialization metric will vary significantly for other tasks. RISC 64-bit @ 5 volts being given the value of 1. CMOS 70 technology generation axis shows the corresponding 60 Ave.Ops Sec Watt 58 DSP32 = = 18 minimum and/or specified device voltage. Notice that the 50 Ave.Ops Sec 3.14 Watt RISC64 DSP-RISC architecture offset is greater than the precision 40 offset within a RISC Architecture Family. 30 Ave.Ops Sec Watt 10.9 NEW = = 2.3 Ave.Ops Sec Figure 2-6 illustrates how a lag of 2 CMOS generations 20 4.8 FFT MFLOPS / Watt Watt OLD can diminish the specialization advantage that an architecture 10 might have. 0 2 Generations ASIC 16-bit 64 128 256 512 1024 2048 4096 8192 16384 32768 1000 Complex FFT Length (Points) Alpha 21064A, 275 MHz, Alpha 21164, 333 MHz, SHARC 21060, 40 MHz, (A) DSP 32-bit 275 MFLOP Peak, 3.3 666 MFLOP Peak, 2.2 120 MFLOP Peak, 3.3 Volts, 33+8 Watts Volt, 25.4 + 6 Watts Volt, 1.75 Watts 100 (B) Figure 2-4 FFT MFLOPS / Watt RISC 32-bit RISC 64-bit Notice that the 2.6x Alpha performance per watt increase (C) 2 10 can be largely attributed to a V core voltage change and not (D) to a fundamental shift to a more DSP-specialized architecture. Architecture Metric (Performance/Watt) An aside observation is that more specialized architectures 1 5 3.3 2.2 1.5 1 inherently require more specialized programming techniques Solid State Technology (Voltage) and tools. FPGAs use a Hardware Description Language Figure 2-6 Two Generation Lag (HDL). Floating point DSP chipsets primarily have The performance/watt offset between FPGA, Integer DSP, assembly and C support. General purpose RISC processors Floating Point DSP, and RISC will be persistent, as each have a wide variety of programming languages including niche architecture takes advantage of the same advancements in solid languages such as Fortran, Prolog, and SmallTalk. state technology.

Load more