<<

A BDTI Analysis of the

ADSP-BF5xx by the staff of Berkeley Design Technology, Inc.

Contents of this summary include: oped by Analog Devices and . There as the ADSP-BF5xx throughout this text. • Introduction are two generations of proces- However, because the ADSP-BF561 • Architecture sors which have slightly different contains two processor cores, some of • Memory System • Pipeline instruction sets and microarchitectures. the following analysis does not apply to • Addressing Due to these architectural differences, this dual-core family member. • Instruction Set the two generations are only partly As of late 2004, prices for ADSP- • Peripherals assembly-code compatible. BF5xx family members range from $5 to • BDTI Benchmark™ Performance: • Sample Execution Time Results As of late 2004, the ADSP-BF5xx $40 in 10,000-unit quantities. All five • Sample Cost-Performance Results family included five family members. ADSP-BF5xx family members are cur- • Sample Energy Efficiency Results The sole first-generation Blackfin pro- rently in full production. • Sample Memory Use Results cessor, the ADSP-BF535, achieves a • Conclusion clock speed of 350 MHz at 1.6 volts. Second-generation Blackfin processors Architecture Introduction include the ADSP-BF531, ADSP- The ADSP-BF5xx contains two The ADSP-BF5xx (Blackfin) is a BF532, ADSP-BF533, and ADSP- fixed-point data paths, two address gen- family of 16-bit fixed-point dual-MAC BF561. The second-generation parts eration units, and a program sequencer. processors from Analog Devices. The operate at up to 750 MHz at 1.45 volts. The ADSP-BF5xx also includes a data ADSP-BF5xx combines features typical ADSP-BF5xx family members are register file of eight 32-bit registers, as of low-power DSPs with features tradi- designed to operate over a range of clock well as two 40-bit accumulators and two tionally associated with general-purpose speeds and operating voltages and are address register files. The ADSP-BF5xx processors, such as privilege modes and able to switch dynamically between uses a load/store architecture: the data . The ADSP-BF5xx speeds via software. For example, a sin- paths generally take inputs from and targets power-sensitive applications, gle ADSP-BF533 chip can operate at return results to the data register file or such as phones; applications that speeds and voltages ranging from 750 the accumulators. require the functionality of both a DSP MHz at 1.45 volts to 100 MHz at 0.8 Collectively, the two ADSP-BF5xx and a general-purpose processor, such as volts. data paths include two multipliers automotive applications; and computa- For the purposes of this analysis, the (MAC0 and MAC1), two ALUs (ALU0 tionally intensive applications, such as Blackfin processor family is referred to and ALU1), and a single . consumer video equipment. One data path includes MAC0, ALU0, The ADSP-BF5xx uses a mixed- and the shifter; the other data path con- About BDTI width 16-/32-bit instruction set. These BDTI provides analysis and advice tains MAC1 and ALU1. The ADSP- instructions can be combined to form 64- to help companies develop, market, BF5xx can issue one SIMD instruction bit VLIW-style “multi-issue” instruc- and use technol- that uses the two ALUs or the two MAC tions. Multi-issue instructions can ogy. units in parallel, but it does not support include up to one 32-bit arithmetic BDTI is a trusted industry resource instructions that use two dissimilar exe- instruction and up to two 16-bit move for: cution units; for example, one ALU and instructions. The ADSP-BF5xx provides one MAC unit. • Independent benchmarking and single-instruction, multiple-data (SIMD) In addition to instructions that use competitive analysis instructions, including a dual 16 × 16 • Guidance for confident technology both ALUs or both MAC units in paral- multiply-accumulate and a variety of and business decisions lel, the ADSP-BF5xx supports SIMD video-oriented, eight-bit ALU opera- • Expert product development advice operations within each ALU and within tions. • Industry and technology seminars the shifter (but not within the MAC The ADSP-BF5xx is based on the and reports units). These SIMD operations allow an Micro Signal Architecture (MSA) • Advice and analysis that enable execution unit to perform two operations instruction set architecture jointly devel- credible, compelling marketing per cycle. Thus, it is possible to perform

© 2004 BDTI (www.BDTI.com). All rights reserved. four operations per clock cycle using sustainable on-chip data bandwidth is 3 The programmer (or ) must both ALUs or two operations per clock billion 16-bit words per second at 750 specify the prediction for each branch. cycle using the shifter. However, the MHz for reads, or 1.5 billion 16-bit Instructions generally execute in one SIMD operations within the ALUs are words per second for writes. cycle in the absence of memory access- supported only as part of a SIMD opera- The ADSP-BF5xx address space is related delays and pipeline stalls. The tion across the data paths, so that both byte addressable. However, instructions most significant exceptions are the jump, ALUs must perform the same type of must be aligned on 16-bit boundaries, call, and most return instructions, which SIMD computation (e.g., a 16-bit add/ and loads and stores must maintain the require four or more cycles. subtract). appropriate alignment for the transfer: The ADSP-BF5xx generally pro- 8-, 16-, and 32-bit transfers must be cesses data as 16-bit values. Many aligned on 8-, 16-, and 32-bit boundaries, Instruction Set instructions support 32-bit data, how- respectively. The ADSP-BF5xx assembly lan- ever, and some support 40-bit data. For guage syntax is algebraic. The ADSP- example, a five-cycle 32 × 32 → 32 BF5xx assembly language is not com- multiply is supported. The ADSP-BF5xx Addressing patible with that of earlier Analog also supports a set of video-oriented The ADSP-BF5xx has two address Devices processors. operations that operate exclusively on 8- generation units that can each generate The ADSP-BF5xx uses both 16- and bit data. an independent address in each cycle. 32-bit instructions. Arithmetic instruc- The address generation units access two tions are generally 32 bits wide, but 32-bit register files that contain general- some have 16-bit variants; 16-bit-wide Memory System purpose address registers, stack pointers, arithmetic instructions offer less flexibil- All ADSP-BF5xx memory is orga- and special registers for modulo address- ity than their 32-bit counterparts. Load, nized into a single unified 32-bit address ing. store, and branch instructions are typi- space. However, the ADSP-BF5xx orga- The ADSP-BF5xx supports a variety cally 16 bits wide, but most have 32-bit nizes its level-one (L1) on-chip memory of addressing modes, including: register- variants that support long immediate into separate instruction and data banks. indirect, register-indirect with post- operands. The organization of the L1 memory increment or post-decrement, register- The ADSP-BF5xx instruction set is is a primary differentiator among ADSP- indirect indexed addressing with a short highly orthogonal. Most arithmetic BF5xx family members. Depending on or long immediate offset, register-indi- instructions can take operands from the the family member, the L1 memory is rect addressing with pre-decrement for data registers, the pointer registers, the comprised of up to five separate banks of stack pushes, and register-indirect accumulators, or immediate operands. memory organized as a modified Har- addressing with post-increment for stack Most arithmetic instructions support vard architecture. Up to three of the pops. The register-indirect mode sup- both 16-bit and 32-bit data. Most arith- memory banks are data SRAM banks, ports bit-reversed addressing. metic instructions also support SIMD some of which can be optionally config- The address generation units can also operations. ured as cache. The remaining two banks perform some addition, subtraction, and contain instruction ROM and instruction shifting operations on the address regis- SRAM/cache. ters. Peripherals In addition to level-one memory, the ADSP-BF5xx family members offer ADSP-BF535 includes an on-chip level- a variety of peripherals, including DMA two (L2) memory system. The L2 mem- Pipeline controllers, general-purpose I/O pins, ory of the ADSP-BF535 consists of 256 The first-generation ADSP-BF535 timers with pulse width modulation Kbytes of unified program and data pipeline has eight stages, while the sec- (PWM) and pulse measurement capabil- RAM. The dual-core ADSP-BF561 pro- ond-generation ADSP-BF5xx pipeline ity, real-time clocks, watchdog timers, vides four banks of private L1 memory has ten stages. Both generations feature serial ports, UART ports, and Serial for each core as well as 128 Kbytes of fully interlocked pipelines, so that data Peripheral Interface (SPI) ports. The shared L2 memory. hazards do not cause unexpected results. ADSP-BF531, ADSP-BF532, ADSP- The ADSP-BF5xx core can perform However, some data hazards are BF533, and ADSP-BF561 each have an one instruction read and two data trans- resolved by stalling the processor. Data internal voltage regulator and one or fers in each cycle. Each data transfer can forwarding has been improved in the more parallel ports that support ITU-R be 8, 16, or 32 bits wide. Both data trans- second-generation microarchitecture; 656 video modes. The ADSP-BF535 has fers can access the same data bank if they some stalls that result from data hazards PCI and USB interfaces. use different sub-banks, but only one of on the ADSP-BF535 have been removed the transfers can be a store. If data is by the improved forwarding mecha- arranged as 16-bit pairs in memory, the nisms. Benchmark Performance ADSP-BF5xx can transfer four 16-bit The ADSP-BF5xx uses static branch The BDTI BenchmarksTM are a set of values each cycle. Thus, the maximum prediction for all conditional branches. digital signal processing functions that

Page 2 © 2004 BDTI (www.BDTI.com). All rights reserved. BDTI has independently designed to provide an objective basis for comparing Figure 1. LMS Adaptive FIR Execution Times processor performance characteristics (lower is better) such as speed and memory use for signal 200

)

s processing applications. Implementa- d

n

o tions of the BDTI Benchmark functions c 150

e

s are carefully optimized for each proces- o

n

a

n sor to allow a realistic assessment of sig- ( 100

e

m

nal processing performance. The i

T

resulting software is then verified for n

o

i 50

t functional correctness, optimality, and u

c

e adherence to the BDTI Benchmark spec- x E 0 ifications. Benchmark performance ADI12TI TI 3 results are obtained either through man- ADSP-BF533 TMS320C5501 TMS320C6414T ual analysis and careful, detailed simula- (750 MHz) (300 MHz) (1000 MHz) tion, or by measurement on sample devices. Figure 2. Energy Consumption for LMS Adaptive FIR BDTI’s reports such as Buyer’s (lower is better)

)

s 50 Guide to DSP Processors and the Inside d

n

o series of reports include extensive BDTI c e 40

s

Benchmark results used to evaluate the o

n

a

n signal processing performance of a set of - 30

t

t

a

processors. For each benchmark, BDTI w

( 20

y

typically reports cycle counts, execution g

r

e times, a cost-performance metric, an n 10

E energy-efficiency metric, and memory 0 use. ADI123TI TI In this section, we present sample ADSP-BF533 TMS320C5509 TMS320C6414T execution time, energy consumption, (24 mW (62 mW (673 mW and memory use results taken from 100 MHz, 0.8 V) 108 MHz, 1.2 V) 600 MHz, 1.1 V) BDTI’s library of benchmark results for the ADSP-BF5xx and two other fixed- TMS320C5501 are both dual-MAC than the ADSP-BF533 on this bench- point DSPs: the DSPs, only the ADSP-BF533 uses its mark. TMS320C55x and TMS320C64x. dual-MAC capability on this benchmark. Differences in clock rates also play a In the TMS320C5501, the two MAC large role on this benchmark. The Execution Time units share an input—a restriction not 750 MHz ADSP-BF533 executes Execution time results in this report imposed by most architectures. The instructions at a much faster rate than the were obtained assuming instructions and TMS320C5501 is therefore limited to 300 MHz TMS320C5501, but at a much data are preloaded in caches where appli- single-MAC operation on the filtering slower rate than the 1000 MHz the cable. Processor speeds are for the fastest section of this benchmark. In contrast, TMS320C6414T. Due to its higher cycle available chips as of late 2004. the ADSP-BF533 is uses both of its efficiency and higher clock rate, the MAC units on this benchmark. As a ADSP-BF533 is over three times faster Sample Execution Time Results result, the ADSP-BF53xx is signifi- than the TMS320C55x on this bench- The execution time results for cantly more efficient than the mark. Similarly, the combination of high BDTI’s LMS Adaptive FIR Filter bench- TMS320C5501 on this benchmark. cycle efficiency and high clock rate mark are shown in Figure 1. The LMS The TMS320C6414T is a high-per- makes the TMS320C6414T roughly 1.5 Adaptive FIR Filter benchmark consists formance 8-issue VLIW DSP, but it does times faster than the ADSP-BF533 on of an FIR filter, an error calculation, and not make full use of it parallelism on this this benchmark. a filter coefficient update. As shown in benchmark. The LMS Adaptive FIR Figure 1, the ADSP-BF533 is much benchmark is a short benchmark in Energy Efficiency faster than the TMS320C5501 on this which setup and housekeeping tasks are Figure 2 shows energy efficiency benchmark. Part of this difference in an important factor in overall perfor- results. To estimate the energy required speed is due to a difference in cycle effi- mance. These tasks present little oppor- for a processor to complete a given ciency: the TMS320C5501 requires tunity for performing operations in benchmark, the benchmark execution about 25% more cycles than the ADSP- parallel. Thus, the TMS320C6414T is time is multiplied by the estimated typi- BF533 to complete this benchmark. only about 20% more cycle-efficient cal power consumption for that proces- Although the ADSP-BF533 and the sor. For each professor family we select

Page 3 © 2004 BDTI (www.BDTI.com). All rights reserved. the family member with the best energy efficiency; the chosen family member Figure 3. Cost-Execution Time Product for LMS Adaptive FIR (lower is better) and speed grade may differ from the one )

$

- used in the execution time comparison. d 2000

n

o

c

Based on processor power consump- e

s

o tion and the LMS Adaptive FIR Filter n 1500

a

n

( benchmark execution times, the ADSP-

e

m BF533 is roughly three times more i 1000 T

n

o

energy-efficient than the TMS320C5501 i

t

u and nearly four times more efficient than c 500

e

x

E the TMS320C6414T.

t

s

o

C 0 Cost-Performance ADI12TI TI 3 To create a cost-performance metric, ADSP-BF531 TMS320C5501 TMS320C6410 ($4.95, 100 MHz) ($4.75, 300 MHz) ($17.95, 400 MHz) the execution time is multiplied by the cost of the processor family member with the best speed-to-price ratio. This is Figure 4. Control Benchmark Total Memory Use not necessarily the same processor used (lower is better) in the execution time or energy effi- 300

ciency metrics. The cost-performance )

s 250

e

t

results are shown in Figure 3. y

b

( 200

Based on this metric, the ADSP- y

r

o

BF531 is the most cost-effective proces- m 150

e sor considered here, over 1.5 times more M 100 cost effective than the TMS320C5501 and nearly 3 times more cost effective 50 than the TMS320C6410. 0 It should be noted that included on- ADI12 TI TI 3 chip memory and peripherals can be ADSP-BF5xx TMS320C55x TMS320C64x have a significant impact on overall sys- tem cost. These factors are not consid- ered in the cost-performance metric used Control-oriented tasks usually constitute BF5xx instruction widths range from 16 here. the bulk of an application’s program to 64 bits, but the Control benchmark memory requirements, but only a frac- implementation for the ADSP-BF5xx Memory Use tion of the application processing time. only uses 16-bit instructions. Similarly, Execution speed is often the primary Thus, in control-oriented tasks, minimiz- TMS320C55x instruction widths range metric used to compare processors. ing memory use is usually a more serious from 8 to 48 bits, but its Control bench- However, a processor’s memory use is concern than maximizing execution mark implementation uses mostly 16-bit also important. For example, the mem- speed. instructions. Thus, the ADSP-BF5xx and ory requirements of an application can While most of the BDTI Bench- TMS320C55x have similar Control have a significant impact on overall sys- marks™ are optimized primarily for benchmark memory use. The tem cost. In addition, processors may maximum speed, BDTI’s Control bench- TMS320C64x, on the other hand, uses experience significant performance deg- mark is optimized for minimum memory 32-bit instruction words, increasing its radation when instructions and data do use. This optimization hierarchy mirrors memory use. not fit in on-chip memory. Because of the approach generally followed by these and other factors, memory effi- application programmers. Note that ciency is an important metric in proces- memory use results on the Control Conclusions sor selection. For each of the BDTI benchmark are not necessarily indicative The ADSP-BF5xx competes with an Benchmarks™, BDTI measures each of processor memory use in signal-pro- unusually large number of processors, processor’s program, constant data, non- cessing-intensive code. ranging from general-purpose embedded constant data, and total memory use. processors to high-performance DSPs. It Sample Control Benchmark Results is unusual to find a DSP processor family Control Benchmark The memory use results for BDTI’s that offers variants ranging from a low- The BDTI Benchmarks™ include Control benchmark are shown in Figure end 400 MHz part for less than $5 to a the Control benchmark, a benchmark 4. Large differences in Control bench- high-performance 750 MHz dual-core specifically designed to evaluate mem- mark memory use are typically due to part. ory use for control-oriented software. differences in instruction widths. ADSP-

Page 4 © 2004 BDTI (www.BDTI.com). All rights reserved. Although the ADSP-BF5xx family instruction set also makes the ADSP- is clear: Many signal processing applica- spans a wide range of price and perfor- BF5xx an excellent compiler target. tions currently employ both a DSP and a mance points, the family contains rela- The ADSP-BF5xx has excellent general-purpose processor. If these two tively few members: As of late 2004, assembly language tools. The tools make processors could be replaced with a sin- only five family members were avail- it very easy to detect memory conflicts gle processor, it could lead to significant able. However, Analog Devices recently and other stalls, which greatly aids code cost savings. Hence, manufacturers are announced six new ADSP-BF5xx family optimization. The ADSP-BF5xx also has keenly interested in processors that com- members that are expected to be released good third-party software support. It bine the capabilities of a DSP and a gen- in the near future. should be noted, however, that software eral-purpose processor. Although DSPs Blackfin processors are not the fast- tools from Analog Devices are not as such as the ADSP-BF5xx still enjoy est, nor the least expensive, nor the most comprehensive and feature-rich as those some performance advantages over gen- energy efficient DSPs on the market. from Texas Instruments. eral-purpose processors, they cannot yet However, the ADSP-BF5xx does offer The ADSP-BF5xx combines features match the abundant application soft- an excellent balance of these three fac- typical of low-power DSPs with features ware, tools, and sup- tors. traditionally associated with general- port available for popular general- The ADSP-BF5xx is particularly purpose processors. More and more purpose processors. Thus, Analog notable for its ability to dynamically DSPs are adding general-purpose fea- Devices faces a tough battle as the line switch between a wide range of voltages tures, but the ADSP-BF5xx includes a between DSPs and general-purpose pro- and clock speeds. This feature is no more complete set of features than some cessors continues to blur. longer as rare as it once was, but it is still of its competitors. For example, the The ADSP-BF5xx faces a difficult unusual—and the ADSP-BF5xx offers ADSP-BF5xx includes privileged exe- market that is filled with strong compet- more flexibility than most other DSPs in cution modes that are useful for operat- itors. However, Analog Devices is this regard. For example, some of the lat- ing systems and multitasking software. poised to strengthen the ADSP-BF5xx est TMS320C55x family members offer There is also a version of the oper- family with the introduction of several dynamic voltage and frequency adjust- ating system available for the ADSP- new family members in 2005. This pro- ments, but these parts support only a nar- BF5xx. cessor family is a compelling choice row range of operating voltages. Just as Analog Devices has incorpo- overall for its excellent balance of price, The ADSP-BF5xx is quite straight- rated many general-purpose features into features, and performance. forward to program in assembly lan- the Blackfin processors, many of the lat- guage due to its orthogonal instruction est general-purpose processors now set, algebraic assembly language, rela- include signal-processing-oriented fea- tively benign pipeline, and flexible tures. The motives for combining DSP addressing modes. The orthogonal and general-purpose processor features

For detailed analysis and insight on Analog Devices’ Blackfin processors, you should have—

Buyer’s Guide to DSP Processors BDTI’s Buyer’s Guide includes: • In-depth analysis of architectural strengths and weaknesses • Performance analysis based on the BDTI Benchmarks™ • Comparison of major commercial DSP processor families from Analog Devices, Freescale, and Texas Instruments

Buyer’s Guide is the most comprehensive technical analysis of major commercial DSP processors. A unique resource for proces- sor and systems designers alike, BDTI’s Buyer’s Guide provides the only independent benchmark analysis and comparisons of today’s DSP processors. For more information, visit http://www.BDTI.com/bg04 or contact BDTI at [email protected].

Page 5 © 2004 BDTI (www.BDTI.com). All rights reserved.