<<

On The Softer Side Configurable Digital Signal Processing Micro-Sequencer Approach Speeds Reconfiguration

Not all schemes were created equal. A micro-sequencer architecture offers faster performance and greater flexibility.

Roozbeh Jafari, Graduate Student consumption and silicon area are cute their task. A non-flexible hardware Henry Fan, Graduate Student reduced by 72% and 77% respectively, by realization for such applications has to fit Majid Sarrafzadeh, Professor using a customized 8-bit data versus all required algorithm variations on the UCLA Computer Science Department a 64-bit data bus, while the speed is die. This, if possible, makes the design improved by 157%. and fabrication processes more compli- ey military applications such as cated and expensive. image processing, sonar/radar and FPGAs Make it Possible K SIGINT can enjoy significant perfor- Before getting into the details of the Dealing with Delays mance gains by using reconfigurable com- micro-sequencer scheme, it’s helpful to A major drawback of using runtime puting. Therefore, it’s perhaps surprising examine the underlying hardware that reconfiguration is the significant delay of that the technology has been so slow to makes reconfigurable computing possi- reprogramming the hardware. The total gain broad acceptance. Among the barriers ble. Programmable system capability runtime of an application includes the is a lack of clear understanding about its forms the heart of reconfigurable com- actual execution delay of each task on the real performance differences compared to puting. Programmable devices including hardware along with the total time spent traditional compute architectures. FPGAs contain an array of programma- for hardware reconfiguration between Group RTC The / ©2002 Call (949)226-2000 Work done at ULCA’s Computer ble computational units that can be pro- computations. The latter might domi- Science department paints a clearer pic- grammed through the configuration bits. nate the total runtime, especially for ture and proposes a method to enhance This gives us the flexibility of having ded- classes of applications with a small the performance of reconfigurable com- icated hardware to perform specific com- amount of computation between two puting. Researchers there devised the putational units combined with a paral- consecutive reconfigurations. Hardware concept of a micro-sequencer-based lelism capability. reconfiguration often takes hundreds of Reprint Orders reconfigurable system, and introduced Reconfigurable systems provide the milliseconds or longer based on the size For For with it the concept of Quick Recon- flexibility and reuse of hardware for mul- of the application. To reduce the recon- figuration. The approach exploits simi- tiple applications. Reconfigurable hard- figuration overhead, some previous larities among various applications to ware can be used to execute designs that works have used different approaches. save on reconfiguration time. are larger than the available hardware In many applications, only a small Using the micro-sequencer architec- resources. In such cases, a part of a large portion of the design changes at a time ture, the group demonstrated that a sig- application is executed on the hardware. and the entire hardware does not have to nificant increase in speed is gained by By reusing the reconfigurable hardware, be reconfigured. This has led the industry reconfiguring a system on a set of image- the remaining tasks of the application can to add the capability of Partial processing benchmarks. The benchmarks be loaded and executed on the hardware Reconfiguration to some of their recent took merely hundreds of microseconds at runtime. This is known as runtime products. FPGAs are examples of such versus the hours needed in traditional reconfiguration. Another issue that neces- reconfigurable hardware, and some of FPGA reconfigurations. Using this archi- sitates the integration of reconfiguration the recent FPGA devices have the capa- tecture also makes the parameters of a in a hardware platform is that some appli- bility of partial runtime reconfiguration. system, for instance the size of data bus, cations require reconfiguration in differ- Another method used to gain speed easy to modify in order to customize the ent abstraction levels of the system. For is called configuration prefetching. This system for a specific application. The example, some applications require dif- method tries to overlap the computation results, detailed later, show that power ferent variations of an algorithm to exe- with the reconfiguration of the hardware. June 2003 COTS Journal [ 49 ] The Softer Side

Therefore, to maximize this overlap, it not vary substantially from one task to seeks a way to minimize the chance that another. The motivation of this research Op-codes for an Algorithm reconfiguration is prefetched falsely. is to enhance the reconfiguration PLA (computes the start address of the micro-instruction) Configuration compression is anoth- for image-processing algorithms. It’s er approach to minimize the reconfigura- important, therefore, to verify if image- MUX tion time. In this method, the configura- processing algorithms are altered substan- µPC tion delay is reduced by compressing the tially during the reconfiguration process. data transferred from the host computer Most image-processing algorithms to the programmable system. A major are computationally intensive and should µInstruction Word portion of delay is due to the distance be executed on hardware resources to between the host computer and the pro- allow real-time processing. Moreover, Input Data Data Path Output Data grammable device. Reconfiguration can these algorithms change in nature and be accelerated by using a fast memory parameters based on information avail- Figure 1 (configuration ) near the reconfig- able from targets. For instance, in a fea- The block diagram depicts a typical urable array. This method is called config- ture-tracking algorithm, the number of design of a microcoded . The uration caching. targets, their position and their distance to the camera can change the algorithm microprogram contains the Speedy Micro-Sequencing (or its parameters) to increase the effi- address of the next microinstruction to be In Micro-Sequencer-based Quick ciency of tracking motions. As a result, fetched from the control store, a fast local Reconfiguration (MSQR), the reconfigura- image-processing algorithms are proper memory that contains the control words. tion is performed by altering the algorithm candidates for mapping onto reconfig- The control word is copied into the loaded onto the memory of the proposed urable resources. This not only provides microinstruction registers. architecture. Also new instructions can be fast running time, but also allows dynam- generated by modifying the microcodes ic modification of the algorithm through to be fetched from the control store, a fast stored in the control unit. Using microcod- runtime reconfiguration. Both cannot be local memory that contains the control ed architecture in the control unit of the achieved by mapping this type of algo- words. The control word is copied into micro-sequencer facilitates the parallelism rithm onto traditional fixed software or the microinstruction registers. The con- and adds new computational units to the hardware platforms. trol store consists of microinstructions datapath. In this case, the control unit The micro-sequencer architecture that control the datapath directly. needs the minimum modification since it makes use of architecture for Because image-processing algo- is highly regular compared to sequential- the design of the control unit. In this rithms tend to have similar computation- machine-based controllers. approach, the relation between inputs al behavior, it can be inferred that the Group RTC The / ©2002 Call (949)226-2000 Efficient reconfiguration of an FPGA and outputs is treated as a memory sys- capabilities of the datapath satisfies the is a critical issue because of the time over- tem. Control signals are stored as words new algorithm and it does not have to be head involved. Sometimes in order to in a microcoded memory. At each clock modified. However, since the algorithm reconfigure a system from one algorithm tick during instruction execution, the changes, the opcode part of the micro- to another, the processes of synthesis, appropriate (micro) control word is sequencer, which contains the instruc- placement and routing have to be per- fetched from microprogram memory to tions for a specific algorithm, has to be Reprint Orders formed, which are highly expensive in supply the control signals. updated. This process can be simply done For For terms of CPU time and may take hours to The microcode control unit itself is a by writing the new algorithm (the new complete. In the micro-sequencer small stored program computer. It has a instructions) onto the memory. This approach, when a system is to be recon- micro , a microprogram approach is significantly faster than tra- figured, only the new algorithm has to be memory and a microinstruction word, ditional physical reconfiguration. loaded onto the memory micro- which contains the control signals and Sometimes, due to the constraints of sequencer. When new instructions are sequencing information. The action of new algorithms, the order of utilization required for the new algorithm, the the microcode control unit is exactly like of components in the datapath may have memory structure is updated. This is the that of a general-purpose computer: fetch to be altered, or further, instantiation of a Micro-Sequencer-based Quick a microinstruction, execute it by applying new computational unit may be Reconfiguration method. Because this the control signals in the control word to inevitable. In this case, the microcodes in method does not require physical recon- the computer’s datapath, determine the the control unit can be easily modified to figuration, a significant speed increase is address of the next microinstruction, and satisfy the new requirements. Meanwhile, gained in the process. fetch the next instruction. Figure 1 shows a new computational unit is added to the a block diagram of a typical design of a datapath. This achieves higher perfor- MSQR in Image Processing microcoded control unit. mance because of utilizing parallelism in MSQR is applicable to those applica- The microprogram counter contains computationally intensive algorithms. tions where the types of computations do the address of the next microinstruction MSQR is therefore effective both in terms June 2003 COTS Journal [ 53 ] The Softer Side

Horizontal Microcode Format

Micro- Signal LD, ST, LA, ADDI, ANDI ORI Op Ra Rb c2 instruction Name LDR, STR, LAR register Index Op Ra c1 NEG, NOT Op Ra Rc 00 ENDbit Cond 01 PCena BR Op Rb Rc Cout 02 BRL Op Ra Rb Rc Cond 03 MDout Op 04 Rout ADD, SUB, AND, OR Ra Rb Rc MAin 05 SHR, SHRA, SHL Op Ra Rb Count 06 MDin SHC Op Ra Rb Rc 07 Cin

08 PCin NOP, STOP Op 09 IRin 10 Ain Figure 2 11 Rin Similar to a RISC instruction set, the instruction set format of the MSQR design has a fixed 12 INC4 32-bit instruction width. The first field is the operation code (op-code) by which the type of 13 RD operation is determined. The second, third and forth fields are the register indices. Some 14 WR instruction, for example, ADD, SUB and BRL have three indices, but some only have one. 15 ADD GRa 16 defined so that the maximum flexibility is the quick reconfiguration, the micro- 17 GRb gained for reconfiguration. There are two sequencer machine needs to support the 18 GRc types of microinstructions: horizontal and new instruction set whenever there is a vertical. In the horizontal microcode, each new hardware implementation. Figure 3 Table 1 bit represents a control signal for a com- shows the fields of the new instruction Shown here is the microinstruction format ponent in the datapath. In the vertical format. The UCLA team also implement- Call (949)226-2000 / ©2002 The RTC Group RTC The / ©2002 Call (949)226-2000 for the MSQR architecture. A few more microcode, however, the control of similar ed an assembler for its proposed instruc- bits are reserved for newly inserted com- components in the datapath is compacted tion set. The syntax was chosen such that ponents in the datapath. in a group of bits in microinstructions. parsing is minimized. This requires a local decoder to gen- of providing flexibility and performance. erate all control signals usable for com- Experimental Results ponents in the datapath. The advantage The UCLA team implemented the

Reprint Orders Microinstruction Format is Key of the horizontal microinstructions is its micro-sequencer in VHDL. The code was Microinstructions are an important lower access time to reach the designated written structured and generic so that the For For part of the controller and have to be component in the datapath. However, as size of the data bus/address bus can be the size of the datapath grows, the num- easily modified. The reconfiguration Op Ra Rb Rc Rd ber of control bits increases resulting in times of two image-processing algo- larger microinstructions. Table 1 shows rithms were measured: once using the the microinstruction format for the traditional physical reconfiguration and MSQR architecture. A few more bits are once using the novel method of MSQR

Microprocessor Operand 2 Operand 4 reserved for newly inserted components on a Wildstar/PCI board. The results are Opcode in the datapath. The instruction set for- shown in Tables 2a, 2b and 2c.

Operand 1 Operand 3 mat of the MSQR design is very similar to To show the flexibility of the pro- a RISC instruction set. It has a fixed 32- posed architecture, a background subtrac- Figure 3 bit instruction width. The details of the tion algorithm, which requires an 8-bit Shown here are the fields of the MSQR instruction format are shown in Figure 2. data bus, was implemented on various instruction format. The first field is the op- In order to support the template micro-sequencers with 8, 16, 32, 64 and code of the instruction. The second to the generation, a better format is one with 128-bit data buses. The goal was to mea- fifth fields are the indices of the operand four register index operands. The op- sure power, area and longest path delay for registers. code of this instruction is specified by the all above variations. This experiment was template generator. In order to support done using Synopsys’ Power Compiler. [ 54 ] COTS Journal June 2003 The Softer Side

Traditional Reconfiguration Time Breakdown Power Consumption (µW) 80 Algorithm Feature Background Process Selection Subtraction 70 60 Synthesis 176 54 50 (sec) 40

Placement and 1514 270 library cells 30 Routing (sec) 20 Programming 621 220 10 FPGA (msec) 0 8 bit 16 bit 32 bit 64 bit 128 bit Table 2a

Quick Reconfiguration Time Area 350000 Algorithm Feature Background Selection Subtraction 300000 250000 # of 387 131 200000 Instructions

library cells 150000 MSQR Time 42 30 (µsec) 100000 50000 Table 2b 0 Reconfiguration Time vs. 8 bit 16 bit 32 bit 64 bit 128 bit Quick Reconfiguration Time Longest Path Delay (ns) Algorithm Feature Background 250 Selection Subtraction 200 Traditional 1690.621 322.220 Call (949)226-2000 / ©2002 The RTC Group RTC The / ©2002 Call (949)226-2000 Reconfiguration 150 Time (sec) 100 MSQR Time 42 30 library cells µ ( sec) 50 Table 2c 0 Reprint Orders Table 2 8 bit 16 bit 32 bit 64 bit 128 bit For For The UCLA team measured reconfiguration Figure 4 times of two image-processing algorithms: In order to illustrate the flexibility of the MSQR architecture, the UCLA team ran a background once using the traditional physical reconfig- subtraction algorithm, which requires 8-bit data bus, implemented on various micro- uration and once using the MSQR method sequencers with 8, 16, 32, 64 and 128-bit data buses. They measured power, area and on a Wildstar/PCI board. The tables show longest path delay for all above variations. clear speed advantages for MSQR.

The measured power, area and data arrival sequencer-based system provides flexibil- feature. Moreover, MSQR provides an effi- times are shown in Figures 4a, 4b and 4c. ity for data bus/ address bus design. cient way for reconfiguration with a con- A micro-sequencer-based architec- The UCLA team demonstrated this siderable increase in speed in the reconfig- ture enhances the efficiency of reconfigu- point by modifying the size of the data bus uration process versus the traditional ration on an FPGA. The experimental for a specific application. Such features physical reconfiguration. results prove that the time needed to can be exploited when the algorithms are reconfigure the control unit of a system is unknown at the stage of micro-sequencer UCLA Computer Science Dept. far less than the time it takes to physical- design. A significant improvement in Los Angeles, CA. ly reconfigure the whole system. Besides power consumption, speed and silicon (310) 825-3886. the reconfiguration time, the micro- area is gained by providing this flexible [www.cs.ucla.edu]. June 2003 COTS Journal [ 55 ]