<<

and DSPs Microcontrollers and DSPs Some references Contents 1. D. A. Patterson, J. L. Hennessy, “Computer Organization and Design”, Morgan Kaufmann, cap. 5 pagg. 338-416. • Definition of (mC) 2. A. Clements, "The principles of computer • Definition of Digital (DSP) hardware", Oxford, 2000, cap. 5, pagg. 231-244. • mCs and DSPs performance 3. P. Lapsley, J. Bier, A. Shoham, E.A. Lee, "DSP Processor Fundamentals - Architectures and • Advanced DSP architectures Processor Fundamentals - Architectures and Features", IEEE Press, New York, 1997, cap. 1. • Examples 4. K. Hintz, D. Tabak, "Microcontrollers - Architecture, Implementation and Programming", Mc Graw - Hill, 1992, par. 1.1.4, pp. 16-26.

Simone Buso - Microcontrollers and DSPs 1 Simone Buso - Microcontrollers and DSPs 2

Microcontrollers (mCs) Microcontrollers (mCs) A microcontroller is a processor specifically Peripheral units in mCs: designed and optimized to perform control, timing, supervising tasks on various target • A/D converters (number of bit, conversion devices. It is characterized by the speed, linearity vary a lot among different availability of relatively large amounts of devices) “on chip” memory (ROM, EEPROM, Flash • Timer and counters ... ) and of several peripheral units, for • PWM modulators different functions (I/O, A/D conversion, • External memories (ROM, EEPROM, timer, counters, PWM, …). It is normally FLASH) characterized by reduced complexity and • Communication ports (serial, I2C, field low cost. e.g. CAN)

Simone Buso - Microcontrollers and DSPs 3 Simone Buso - Microcontrollers and DSPs 4

Microcontrollers (mCs) Digital Signal Processors (DSPs)

The use of mCs is very common for the DSPs are specifically implementation of: designed and optimized to efficiently • portable measurement instruments; perform real time tasks. They are characterized by high • PC peripherals; computational power and relatively low cost • fax/photocopiers; (if compared to general purpose • home appliances; processors). Particular care is taken in • cell phones; minimizing the power consumption (e.g. in embedded portable applications). • industrial applications, in particular in the automotive and electrical drives fields.

Simone Buso - Microcontrollers and DSPs 5 Simone Buso - Microcontrollers and DSPs 6

1 Digital Signal Processors (DSPs) Digital Signal Processors (DSPs)

Several different DSP families are available The major application areas for DSPs are on the market. They all exhibit some common related to: features: • coding/decoding of speech, hi-fi audio • • availability of a built-in multiplier circuit segnals, video signal processing; (MAC instruction); • compression/decompression of data; • capability to operate multiple memory compression/decompression of data; accesses in a single clock cycle; • encryption/decryption of data; • specific addressing modes for circular • mixing of audio and video ; registers and stacks; • sound synthesis. • sophisticated program flow control instructions; • availability of DMA circuitry (top level). Simone Buso - Microcontrollers and DSPs 7 Simone Buso - Microcontrollers and DSPs 8

DSPs vs mCs DSPs vs mCs

Traditionally, mCs were used in the More recent DSPs include peripheral units implementation of control functions, thanks traditionally typical of mCs. On the other to the wide range of peripheral units hand, mCs present, at least in top level available on-chip. The computational power models, hardware organizations and was limited (CPUs had 8 bits or less, no computational powers closer and closer to hardware multiplier). those typical of DSPs. Costs and DSPs were used, instead, almost only for performance may be very close and, for signal-processing applications, where the key particular applications, the choice of the parameter is computational power. device may be quite difficult. Currently, the differences in the application We definitely need criteria to compare fields of mCs and DSPs are a lot fuzzier. different devices.

Simone Buso - Microcontrollers and DSPs 9 Simone Buso - Microcontrollers and DSPs 10

DSPs vs mCs DSPs vs mCs

The fundamental parameters for the The cost of device is largely dependent on comparison are, of course, cost and the expected production volume. performance. Time and resources required by the To minimize the cost parameter, for given development of the application are a specifications, it is normally required to take function of several factors, like: into account not only the device cost, but • availability of high quality and high also the estimated development time, the so reliability development tools; called time to market. • effective technical support from the device manufacturer.

Simone Buso - Microcontrollers and DSPs 11 Simone Buso - Microcontrollers and DSPs 12

2 DSPs vs mCs Performance measurement The application specifications determine the The perfomance level of any processor can performance level required for the selected be measured only in terms of time required in terms of: to excute a given program. • required peripheral units and their basic parameters (e.g. A/D converter with 8, In the case of mCs or DSPs this is the 10 or 12 bits); same time the processor effectively spends on the program instructions (computation • • operating conditions (e.g. maximum time), unless an allowable power consumption, temperature coordinating several tasks in time sharing is range); running on the device. • required computational power (real time control, signal processing …).

Simone Buso - Microcontrollers and DSPs 13 Simone Buso - Microcontrollers and DSPs 14

Estimation of computation time Estimation of computation time

The computation time of a program is a key The clock period and the number of clock parameter in real time applications (both in cycles required by the various program control and signal processing). This can be instructions can be read on the processor estimated based on three parameters: datasheet/user manual. • processor clock period; The number of instructions required by a given is a function of the • number of clock cycles required by the number of clock cycles required by the processor architecture. instructions in the program; • number of instructions required by the By architecture we mean the set of program. resources that are available to the programmer for the implementation of the algorithm (instruction set). Simone Buso - Microcontrollers and DSPs 15 Simone Buso - Microcontrollers and DSPs 16

Estimation of computation time Estimation of computation time Any given architecture can be implemented The computation time of a program can be in different ways at the hardware level. estimated by using thethe following formula: We must therefore distinguish processor Ncl organization and architecture: the former is the particular hardware implementation of Tcal = Tclk ⋅ ∑Ni ⋅NCi (1) the latter. i=1

The architecture has a direct effect on the where Tclk is the processor clock period, Ni number of instructions required by a given is the number of class i instructions in the program. The organization determines the program, NCi is the average number of clock clock period and the number of clock cycles cycles required by class i instructions, Ncl is required by any instruction. the number of considered instruction classes. Simone Buso - Microcontrollers and DSPs 17 Simone Buso - Microcontrollers and DSPs 18

3 Estimation of computation time Speed limits! Relation (1) assumes thatthat the program is not Reducing the clock cycle duration always interrupted by other processes and neglects implies the increase of power consumption the delays due to memory accesses. for the processor.

To increase the speed of a processor, we This can be limited by reducing also the therefore need to: supply voltage.

• reduce the clock cycle (Tclk); Which tells us why there is a strong need • reduce the number of cycles required by for lower and lower power supply voltages the more commonly used instructions (NC); (<1V) in computer applications. • reduce the number of instructions required The limitations are basically technological by a given algorithm. (we need new processes/materials).

Simone Buso - Microcontrollers and DSPs 19 Simone Buso - Microcontrollers and DSPs 20

Speed limits! Speed limits! The number of instructions required by a The reduction of the number of clock cycles given algorithm is a function of the processor required by an instruction calls for a more architecture, i.e. of its instruction set, as sophisticated hardware organization of the seen by the programmer/compiler. processor, e.g. wired control instead of micro-programmed control, higher degree of The reduction of this parameter leads to parallelism (achievable in several different complex instruction set computers (CISC), ways: VLIW, SIMD, etc.) or the use of instead of reduced (and simple) instruction pipelines. set computers (RISC). This again affects the processor organization and its cost. That´s This trend leads to complex processors, with why RISC processors are a lot more used high cost. The limitation in this case is than CISC processors. basically “economical”.

Simone Buso - Microcontrollers and DSPs 21 Simone Buso - Microcontrollers and DSPs 22

Processor control Processor control The execution of any instruction is normally In a single cycle organization, the processor divided into several different phases. For clock period will have to be long enough to instance, the instruction allow the execution of all phases, so it will Add R1, num; be equal to, at least, the sum of the That is [R1] = [R1] + M[num]. response times of all the functional units. could require the following actions: In a multicycle organization, each phase is executed in, at least, a clock period. The Operand Result instruction in the example will then require, Fetch Decode read Execute write read write at least, 5 clock periods. The execution time Each of these phases requires a certain of some instructions can, in some cases, be amount of time, depending on the speed of longer in a multi cycle organization with response of the processor functional units. respect to a single cycle one. Simone Buso - Microcontrollers and DSPs 23 Simone Buso - Microcontrollers and DSPs 24

4 Single vs multi cycle organization Comparison single cycle/multicycle Unless we are considering very simple Let’s consider a processor with the following processors with extremely small instruction response times: sets, as some RISC processors, the single ‰ memory: 2 ns cycle organization tends to be inefficient. ‰ ALU: 2 ns The most complex instruction, i.e. that ‰ registers: 1 ns requiring the longest time, determines the Let’s also consider a typical program made minimum possible clock period. minimum possible clock period. up of instructions like the following: This strongly penalizes the execution of faster ‰ memory reads (load): 24% instructions, that only require a fraction of that period. For the remaining time, the ‰ memory writes (store): 12% processor is not doing anything. ‰ ALU operations on registers: 44% ‰ jumps (or branches): 20% Simone Buso - Microcontrollers and DSPs 25 Simone Buso - Microcontrollers and DSPs 26

Comparison single cycle/multicycle Comparison single cycle/multicycle Any single cycle implementation of a processor A multicycle implementation of the control must be designed to allow the unit is, instead, limited only by the response execution of the slowest instruction. time of the slowest CPU functional unit, that is the memory access unit (in our example). Usually, this is the memory read (like “load R1, num”) which requires: Supposing that we can subdivide the clock period into 4 segments (depending on the a. fetch phase: 2 ns; a. fetch phase: 2 ns; processor organization), any of these will b. possible ALU operation (offset): 2 ns; have to be 2 ns long. c. register write (): 1 ns; In the two implementations, the load d. data memory access: 2 ns; instruction is going to have the same e. register write: 1 ns. duration (8 ns). Simpler instructions (e.g. jumps) will be faster in the multicycle case.

Simone Buso - Microcontrollers and DSPs 27 Simone Buso - Microcontrollers and DSPs 28

Comparison single cycle/multicycle Supposing that the processor instructions Both in single and in multi cycle organizations, have the following durations (in terms of during the execution of any instruction the clock period segments and in absolute value): different processor units operate only for a load: 4 segments, 8 ns; fraction of the total executexecutionion time, which is highly inefficient. store: 4 segments, 8 ns; The pipeline organization tends to remove this ALU operations: 3 segments, 6 ns; inefficiency. The different functional units jumps: 2 segments, 4 ns; are operated simultaneously, but on parts of The typical program will have an average different instructions, as in any industrial number NC equal to: pipeline. 4·(0.24 + 0.12) + 3·0.44 + 2·0.2 = 3.16 The only problem with this organization is in so well below 4. the resolution of data/structural conflicts.

Simone Buso - Microcontrollers and DSPs 29 Simone Buso - Microcontrollers and DSPs 30

5 Comparison pipeline/multicycle Comparison pipeline/multicycle Let’s consider again the same processor with Supposing our pipeline organization allows to: the following response times: • execute without interlocking a half of the ‰ memory: 2 ns load operations, with 1 clock cycle penalty ‰ ALU: 2 ns the other half. ‰ registers: 1 ns • execute, without stalling, a half of jumps Let’s also consider again the typical program and, for the other half to get a 2 clock made up of instructions like the following: cycle penalty. ‰ memory read (load): 24% It follows from this, that number of clock cycles required by load instructions is on ‰ memory write (store): 12% average equal to 1.5, while it is equal to ‰ ALU operations on registers: 44% 0.5·1+0.5·(1+2) = 2 for jump instructions. ‰ jumps (or branches): 20% Simone Buso - Microcontrollers and DSPs 31 Simone Buso - Microcontrollers and DSPs 32

Comparison pipeline/multicycle Comparison pipeline/multicycle Our typical program will present an average 3.0 NC value equal to: 2.5 1.5·0.24 + 1·(0.12 + 0.44) + 2·0.2 = 1.32 2.0 We saw that, in the multicycle organization, 1.5 1.0

Acceleration 1.0 with 4 segments and no pipeline, the average Acceleration number of clock , NC, 1 2 4 8 16 was equal to 3.16. n The use of a pipeline with 4 stages allows to Number of pipeline stages reduce by almost 60% the execution time of Diagram relating the increase in performance the program. The acceleration allowed by the and the number of stages n (also known as pipeline is then equal to 2.4. “depth”) of the pipeline.

Simone Buso - Microcontrollers and DSPs 33 Simone Buso - Microcontrollers and DSPs 34

Comparison pipeline/multicycle Superscalar architectures The factors limiting the acceleration any In last generation DSPs, we are now seeing pipeline can guarantee are: even more complex organizations, typically • data conflicts: the higher the number of taken from the world of general purpose stages the higher the penality any stall processors (GPPs). condition determines; Among these, it is quite common to find the • decrease in the execution speed of jumps: so called superscalar architectures. the higher the number of stages the bigger These are based on the replication of the ; functional units within the CPU, so as to • increase in the CPU control complexity that allow the execution of several instructions in requires clock frequency reduction: this is parallel. due to auxiliary and control circuitry for It is typical to find a replication factor the pipeline (registers, logic, …). between 2 and 4.

Simone Buso - Microcontrollers and DSPs 35 Simone Buso - Microcontrollers and DSPs 36

6 Superscalar architectures Superscalar architectures A dynamic pipeline is capable of organizing The control strategy for this type of the execution of instructions in an out of processors usually shows very high order manner: this way it is possible to complexity: reduce the penalties due to stall conditions • branch condition prediction; to a minimum. • advanced memory organization (dynamic The typical structure of a dynamic pipeline is RAM, multi level , etc.); made up of three main units: • dynamic pipeline. • fetch and decode unit (FDU); The instruction set is usually RISC type, but • (EXU); SIMD architectures are also possible to • write-back unit (WBU). further increase the level of parallelism. The first and the last unit operate in order, the second does not. Simone Buso - Microcontrollers and DSPs 37 Simone Buso - Microcontrollers and DSPs 38

Superscalar architectures Superscalar architectures Fetch and decode: in The use of these complex organizations with FDU order, sometimes on several instructions functional unit parallelism, dynamic pipelines, simultaneaously. branch predictions, normally implies very high

Execution in parallel. The implementation costs. registri registri … registri registri FDU distributes the The DSPs that adopt them, are designed for instructions to execute among the functional units high performance, reduced scale ALU ALU ALU ALU of the EXU, using applications, where the computational power … prediction techniques for conditioned branches. is the key issue. These devices are characterized by a EXU Results write back: the pretty high power consumption and are not WBU WBU takes care of data suited to embedded control applications for conflicts and only writes the “safe” ones. large volume productions.

Simone Buso - Microcontrollers and DSPs 39 Simone Buso - Microcontrollers and DSPs 40

Superscalar architectures Superscalar architectures vs VLIW

The DSPs with superscalar architectures are Some DSPs adopt the VLIW (Very Long also direct competitors of general purpose Instruction Word) strategy to increase the processors (GPPs), which often offer signal internal level of parallelism. processing capabilities (in particular for audio This technique combines a big number (e.g. and video applications) that are quite close 8) of simple instructions in a single to the DSPs’ ones. instruction memory word, that is fetched in DSPs are still preferable because of lower a single clock cycle. cost and lower power consumption. Besides, The decoder decomposes the long instruction being slightly less complex, it is normally in its basic components and, exploiting a easier to estimate the computation time of a given degree of hardware parallelism, program (fundamental for possible real-time distributes each component to a different applications). execution unit. Simone Buso - Microcontrollers and DSPs 41 Simone Buso - Microcontrollers and DSPs 42

7 Superscalar architectures vs VLIW Maximizing performance The VLIW approach is not equivalent to the The processor performance is a function of superscalar one because: both its architecture and of its organization, • only some particular instruction sequences at the hardware level. can be combined in long instruction words that completly exploit the CPU (e.g. that The maximization of performance calls for a of a FIR filter tap); co-ordinated design of hardware and software. • the adopted level of hardware parallelism is normally not too high (two execution The problem is further complicated by the units as a maximum); action of several design constraints such as: • the pipeline is static; • cost; • the instruction bus has a lot of bits (e.g. • electric power consumption. 256). Simone Buso - Microcontrollers and DSPs 43 Simone Buso - Microcontrollers and DSPs 44

Example: Data-sheet Example: Data-sheet

The The manufacturer manufacturer claims a claims a 30 maximum MIPS (F = clk speed of 40 MHz) 8000 MIPS! operating But here F speed. clk = 1000 MHz!

Simone Buso - Microcontrollers and DSPs 45 Simone Buso - Microcontrollers and DSPs 46

Example: Data-sheet

The manufacturer claims a maximum speed of 40 MIPS (Fclk = 80 MHz).

Simone Buso - Microcontrollers and DSPs 47

8