Microcode Compression for TIPI

Microcode Compression for TIPI Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley ABSTRACT 1. Introduction A lot of applications need dedicated datapaths The background to our work is the Tipi micro- to be computed fast. While building the datapath is architecture design tool [1] being developed at the usually simple, designing and debugging the University of California, Berkeley. Their goal is to associated control logic can be a long process. The provide an architecture development system to Mescal group is working on a development specify, evaluate, and explore fully-programmable environment which automatically generates this (micro-) architectures using a correct-by- logic. Traditionally, a register-transfer level vertical construction method to ease the definition of instruction set is decoded into pipelined control customized instruction sets. bits with forwarding and hazard detection logic. Traditional design flows define the instruction However, for novel, low-power, parallel, deeply set architecture (ISA) first and then implement a pipelined, heterogeneous embedded processor corresponding micro-architecture. In this architectures, such an ISA may not be easy to situation, the problem that arises is that the actual formulate, presents serious difficulties for architecture must be checked against the original compilation, and requires excessive dynamic ISA specification to see whether it implements the control logic for forwarding and hazards. An ISA. In Tipi, The design flow in Tipi encourages the alternate approach, which can be viewed as an designer to think of the data-path micro- extension to the VLIW style of processors is to use architecture first. He/she lays out the elements of a control scheme that retains the full flexibility of the data path without connecting control ports. For the datapath. Thus, the resulting processor accepts unconnected control ports, operations are a trace of horizontal microcode instructions that automatically extracted which are supported by the can be automatically generated. This eliminates the data path. These operations form the basis for need to create an ISA (manually or otherwise), but more complex, multi-cycle instructions that are requires a large number of bits for each composed of several primitive operations. The 'instruction' in the microcode trace. Therefore, the construction of complex, multi-cycle instructions is microcode needs to be compressed after the supported by defining temporal and spatial compilation and decompressed on the fly. A high constraints, i.e., automatically extracted operations compression rate can be achieved using a trace from the data path can be combined in parallel and cache to exploit the redundancy in the microcode. in sequence as long as resources for these The decoder, which is a processor, operates on the operations do not conflict with each other, which is trace cache following the instructions given by the checked by Tipi. encoder. In this paper, we compare the memory The Tipi framework extracts the operations and size and bandwidth requirements with that for a produces a file with the horizontal microcode for traditional ASIP design flow. each of these operations. It thus generates a 1 horizontal microcode code-generator, which can registers, and a number of multiplexers and take in an assembly file and produce the microcode register enables, it is reasonable to expect that equivalent of it. parts of the microcode will have good locality, The main problem with this approach is that enabling us to obtain better compression for the microcode produced is very large in size but is portions of the microcode. Thus we support the highly redundant i.e. the entropy is very low. This existence of multiple trace caches, each of which leads to problems of memory size and memory contains a portion of the microcode. bandwidth required to read these microcode The trace caches will be controlled by the sequences from the memory. This leads to the idea decoder, which is in our case a processor that is of compressing the microcode in order to reduce specialized in trace cache operations. The these requirements on the memory. motivation for this is described in Section three. It is of interest that the compressed stream The operations on the trace cache are of two main need not be visible to the architecture designer kinds: filling the trace cache with the microcode since the designer will use the framework of the and reading the contents of the trace caches in the encoder-decoder pair which acts as a layer to right order. The decoder has to be capable of abstract the generated microcode. Thus we are performing these two functions. The decoder free to choose the encoding/decoding scheme. performs these two functions when it receives The paper focuses on multimedia and signal WRITE and SEQUENCE instructions respectively. processing benchmarks which have kernels with The encoder is then a compiler that produces long pieces of linear code with few branches. The WRITE and SEQUENCE instructions. The WRITE scheme proposed is not efficient on code that has a instruction directs the decoder to fill in a particular lot of branches. trace cache line, and the SEQUENCE instruction The rest of the paper is organized as follows: In provides the order in which such written values are Section two, we present the encoding and decoding to be read out. scheme that we use. In Section three, we describe The reading of the cache has to be done at the the decoder architecture. In Section four, we micro-architecture speed in order to avoid being on describe the encoder architecture. In Section five, the critical path. This however implies that the we present the results of our work. In Section six, decoder clock speed has to be much higher than we present our conclusions. the micro-architecture speed because it has to write the cache in addition to reading it. In order to 2. Coding/decoding scheme keep the decoder clock speed low, we use a sequence manager for each cache. The sequence The scheme that we choose is limited by the manager is autonomous of the cache writing consideration that the decoder has to be mechanism, and is in charge of reading the cache implemented in hardware in order to avoid being in the correct order to regenerate the microcode on the critical path of processor execution. Thus trace. software-based compression schemes like gzip Since each sequence has to be stored in a buffer cannot be used. Dictionary based compression that is of finite size, we can only write a limited schemes like Huffman exploit coding redundancy number of microcode sequences before the reading in the microcode but do not give sufficient begins. In order to further pipeline the reads and performance to make storing microcode feasible. writes, it is essential that a mechanism be provided We propose to use a L0 cache to exploit to continue storing the indices for the next redundancy in the microcode. The cache will sequence while the current sequence is being contain microcode traces. Since the horizontal executed. If this is not done, the sequence buffer microcode inherently controls a number of pipeline becomes a resource shared between cache reads 2 and writes, eliminating the possibility of doing Table 3.1 them in parallel. For this reason, each sequence NOP No operation manager has two sequence buffers which store the WRITE C,I,D Write data D in cache D at index I SEQUENCE C,S Prepare sequence S for cache C order in which the cache indices are to be read. START Start executing all the ready One of the buffers stores the sequence in which the sequences sequence manager is currently reading out data. JUMP A,O Fetch next instructions from address A and offset O The other buffer stores the indices for the next SEQLENGTH SL Use sequence length SL when sequence. Whenever a sequence is complete, a executing the next sequences STOP Freeze the decoder on its actual state START instruction is issued to the decoder in order COPY C,DI,SI,BC Read data at index SI in cache C; Flip to start up the sequence managers. At this point the bits BC; Write at index DI in same cache. the new buffer is copied into the current buffer and execution starts. The number of parameters of these instructions Another point of interest in our scheme is its is diverse; therefore their sizes are very different. need to be applicable to various architectures. We Using a fixed instruction format would force us to require a scheme that is flexible for this purpose. add unnecessary zeros at the end of the shortest We thus has a number of parameters like the instructions, which would lower the encoding number of caches and the line size of each size, scheme compression performance. Obviously, total cache size, the size of the data bus, and other using variable length instructions would solve this parameters that arise due to fact that the decoder issue, but with the cost of additional complexity in can do multiple issues on each cycle. the fetch unit. A rapid evaluation of the gain in The next section describes the architecture of term of compression ratio shows us that the the decoder and its implementation details. compression performance is divided by up to 4 using fixed-size instruction: using variable-length 3. Decoder instructions is unavoidable. The decoder is a piece of hardware responsible 3.2 Implementation for the decoding of the compressed microcode located in the program memory to a horizontal In order to avoid introducing stalls in the main microcode that will be used to drive the architecture, the decoder must have finished architecture. preparing the next sequence before the current is finished. In most cases, this requires an IPC 3.1 A processor and its ISA greater than one.

Load more