Bit-Level Partial Evaluation of Synchronous Circuits
Total Page:16
File Type:pdf, Size:1020Kb
Bit-Level Partial Evaluation of Synchronous Circuits Sarah Thompson Alan Mycroft Computer Laboratory Computer Laboratory University of Cambridge University of Cambridge [email protected] [email protected] ABSTRACT Table 1: Rewrite Rules for Combinational PE Partial evaluation has been known for some time to be very effective when applied to software; in this paper we demonstrate that it can also be usefully applied to hard- a ∧ a → a a ∨ a → a ware. We present a bit-level algorithm that supports the a ∧ false → false a ∧ true → a partial evaluation of synchronous digital circuits. Full PE a ∨ false → a a ∨ true → true of combinational logic is noted to be equivalent to Boolean minimisation. A loop unrolling technique, supporting both ¬false → true ¬true → false partial and full unrolling, is described. Experimental results are given, showing that partial evaluation of a simple micro- processor against a ROM image is equivalent to compiling Section 3 this is extended to encompass loop unrolling of the ROM program directly into low level hardware. synchronous circuits. Section 4 introduces HarPE, the hard- ware description language and partial evaluator used to sup- 1. INTRODUCTION port the experimental work leading to the results described in Section 4. Partial evaluation [12, 14, 13] is a long-established tech- nique that, when applied to software, is known to be very powerful; apart from its usefulness in automatically creating 2. PE OF COMBINATIONIAL CIRCUITS (usually faster) specialised versions of generic programs, its The simplest form of hardware partial evaluation is al- ability to transform interpreters into compilers is particu- ready well known to hardware engineers, though under a larly noteworthy. different name: Boolean optimisation. For example, the In this paper, we present a partial evaluation framework combinational circuit represented by the expression for synchronous digital circuits that, whilst supporting spe- a ∧ (b ∨ c) cialisation, also supports the first Futamura projection [10]: where a, b and c are inputs, may be specialised for the case PE[[interpreter, program]] = compiler[[program]]. where c is known to be true as follows: In hardware terms, this is equivalent to taking the circuit a ∧ (b ∨ true) = a ∧ true = a for a processor and a program ROM image, then compiling this into hardware that represents the program only – as Boolean optimisation is well-studied, with many approaches with software partial evaluation, the processor itself is op- documented in the literature. Some techniques are sub- timised away, leaving only the functionality of the program optimal but can be applied to any circuit, whereas others expressed directly in hardware. (e.g. OBDDs, flattening to CNF or DNF) can yield optimal 1 Note that in this paper, we consider only the partial evalu- results but suffer exponential size blowup when confronted ation of purely synchronous circuits, i.e. circuits consisting with some circuits, particularly multipliers. Any of these only of acyclic networks of gates, with feedback occurring techniques are potentially capable of PE of combinational only via D-type latches whose clock inputs are all driven by circuits, though for clarity and generality, we have opted for a single global clock net. Generalisation to the asynchronous a simple approach based upon term rewriting. Table 1 shows case is discussed briefly in Section 7.1. a simple set of rewrite rules that are sufficient to implement In Section 2 we discuss PE of combinational circuits; in (sub-optimal) combinational PE in time and space that is linear with respect to the original circuit. The software analogue of a combinational circuit, from the Permission to make digital or hard copies of all or part of this work for point of view of PE, would be a program consisting only of personal or classroom use is granted without fee provided that copies are assignment statements, if-then and if-then-else constructs, not made or distributed for profit or commercial advantage and that copies but strictly no loops. bear this notice and the full citation on the first page. To copy otherwise, to 1 republish, to post on servers or to redistribute to lists, requires prior specific In hardware terms, ‘optimal’ is not easily defined. In some permission and/or a fee. cases, circuits are optimised for minimum gate count, though ACM SIGPLAN Workshop on Partial Evaluation and Program Manipula- more commonly they are optimised for speed or power con- tion, Charleston, South Carolina, January 9-10 2006 sumption – only rarely will a circuit be optimal with respect Copyright 2005 ACM X-XXXXX-XX-X/XX/XX ...$5.00. to more than one of these considerations. Figure 1: General Form of Synchronous Circuits Figure 2: Synchronous Circuit After One Unrolling input input1 CLK g output1 State f g output CLK State f f g output2 3. PE OF SYNCHRONOUS CIRCUITS input2 A slight variation2 on the general form of any synchronous circuit, generally referred to as a Mealy machine [17], is shown in Fig. 1. In software terms, such circuits resemble a corresponding to the circuit shown in Fig. 2. After unrolling, program of the form since statek+1 is not externally observable, it need not be while true explicitly computed. In one clock cycle, this new circuit statek+1 := f(statek, input) performs the same computations that the original circuit output := g(statek+1, input) performed in two cycles, though possibly with a slower max- endwhile imum clock rate due to longer worst-case paths. In the general case, since the input i and output j may Each iteration of the loop represents exactly one clock cy- change at every cycle, they may need to be separately ac- cle. The internal state of the circuit, represented by statek, cessible in the unrolled circuit. Often, though, input may may change only once per clock cycle and is determined be known to remain unchanged for many iterations, or for only by the result of the combinational Boolean function all time. It is common, also, for outputs other than those f : (B×···×B)×(B×···×B) → (B×···×B); the subscript resulting from the final state to be unimportant; in com- k has no effect on execution, and is purely a naming con- bination, this allows loop unrolling to generate much more vention that is convenient when describing unrolling. The efficient hardware. input is assumed to change synchronously with the clock, In the rest of this paper, we make the assumption that and the output is determined by the combinational Boolean input may change only synchronously with the the clock of function g : (B ×···× B) × (B ×···× B) → (B ×···× B). the generated hardware, and that output reflects the final Note that the internal state of the circuit, represented by state of the unrolled loop body at the end of each clock state, is observable only through g. cycle. Partial evaluation of this kind of circuit typically requires specialisation of f and g, but may also involve either partly 3.1 Multiple Unrollings or completely unrolling the while loop. State minimisation Loops may be unrolled an arbitrary number of times by 3 is not performed . A single unrolling yields the program the following method: while true while true statek+1 := f(statek, input 1) statek+1 := f(statek, input) output 1 := g(statek+1, input 1) statek+2 := f(statek+1, input) +2 f +1, statek := (statek input 2) . output 2 := g(statek+2, input 2) state + := f(state + −1, input) endwhile k n k n output := g(statek+n, input) endwhile which can be equivalently expressed as while true or, equivalently: statek+2 := f(f(statek, input 1), input 2) while true output 1 := g(f(statek, input 1), input 1) statek+n := f(f(. f(statek, input),... input) output 2 := g(f(f(statek, input 1), input 2), input 2) endwhile output := g(statek+n, input) endwhile 2Conventionally, f would be drawn on the left of the state flip-flops. This (equivalent) form allows unrolling to be de- Since the repetition of input gives potential for common scribed more conveniently. subexpression elimination, and g needs to be evaluated ex- 3 In synchronous circuits, state minimisation is often unde- actly once regardless of the number of unrollings (see Fig. 3), sirable – though it reduces flip flop count, the extra state de- the gate count of any resulting circuit is typically (and often coding logic required often adversely affects maximum clock n f g f rates. As an extreme example, the performance advantages substantially) less than × | | + | |, where | | is the gate of a carefully designed one-hot encoded state machine might count of the original f, |g| is the gate count of the original g, be lost entirely if state minimisation was to be na¨ıvely at- and n is the number of unrollings. In some cases, the final tempted. gate count may even be less than |f| + |g| (see Section 4). Figure 3: Synchronous Circuit After n Unrollings Figure 4: Synchronous Circuit After Full Unrolling input CLK input f . f g output State initial f f . f g output state 3.2 Reset Logic a larger, more sophisticated general purpose programming Most synchronous circuits require some form of reset ca- language, in this case C++. pability, corresponding to the following program: The HarPE language currently exists as an ISO C++ tem- plate library, taking advantage of template metaprogram- state1 := initialstate ming techniques [26, 27]. Compiling and then executing a while true C++ source file incorporating HarPE code causes a hard- statek+1 := f(statek, input) ware netlist to be generated. The current compiler generates output := g(statek+1, input) gate-level Verilog for further processing by a conventional endwhile tool chain. In pure-synchronous hardware terms, since state may only 4.1 Semantics change at a clock edge, it is not possible to have code exe- cute outside the loop, so practical implementations usually HarPE source code has a standard, imperative semantics, resemble the following: with the characteristic that a whole program defines exactly one machine cycle.