230 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2007 Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich

Abstract—Field-programmable gate-array (FPGA) logic syn- thesis and technology mapping have been studied extensively over the past 15 years. However, progress within the last few years has slowed considerably (with some notable exceptions). It seems natural to then question whether the current logic-synthesis and technology-mapping algorithms for FPGA designs are produc- ing near-optimal solutions. Although there are many empirical studies that compare different FPGA synthesis/mapping algo- rithms, little is known about how far these algorithms are from the optimal (recall that both logic-optimization and technology- mapping problems are NP-hard, if we consider area optimization in addition to delay/depth optimization). In this paper, we present a novel method for constructing arbitrarily large circuits that have known optimal solutions after technology mapping. Using these circuits and their derivatives (called Logic synthesis Exam- ples with Known Optimal (LEKO) and Logic synthesis Exam- ples with Known Upper bounds (LEKU), respectively), we show that although leading FPGA technology-mapping algorithms can produce close to optimal solutions, the results from the entire logic-synthesis flow (logic optimization + mapping) are far from optimal. The LEKU circuits were constructed to show where the logic synthesis flow can be improved, while the LEKO circuits specifically deal with the performance of the technology map- ping. The best industrial and academic FPGA synthesis flows Fig. 1. Possible area-minimal mapping solutions. (a) Original circuit. (b) Mapping solution without logic optimization. (c) Mapping solution with are around 70 times larger in terms of area on average and, in logic optimization. some cases, as much as 500 times larger on LEKU examples. These results clearly indicate that there is much room for further research and improvement in FPGA synthesis. capable of implementing any K-input one-output Boolean function. Index Terms—Circuit optimization, circuit synthesis, design automation, field-programmable gate arrays (FPGAs), optimiza- Given a register transfer level (RTL) design, the typical tion methods. FPGA synthesis process consists of RTL elaboration, logic syn- thesis, and the physical design (layout synthesis) [11]. In this paper, we will focus on logic synthesis, which can be broken I. INTRODUCTION down into two main steps: logic optimization and technology IELD-PROGRAMMABLE gate arrays (FPGAs) have mapping. Logic optimization transforms the current gate-level F been gaining momentum as an alternative to application- network into an equivalent gate-level network more suitable for specific integrated circuits (ASICs). FPGAs consist of program- technology mapping. Technology mapping transforms the gate- mable logic, input-output (I/O), and routing elements, which level network into a network of programmable cells (in our can be programmed and reprogrammed in the field to customize case, these cells are LUTs) by covering the network with these an FPGA, enabling it to implement a given application in a cells. Several algorithms perform logic optimization during matter of seconds or milliseconds. The most common type technology mapping. As an example, Fig. 1 shows the differ- of programmable-logic element used in an FPGA is called a ence between mapping algorithms that use logic optimization K-LUT, which is a K-input one-output lookup table (LUT), and those that do not. By examining the logic function of f,we can see that it just takes the logical AND of all its inputs; thus, by manipulating the circuit, we can reduce the mapping solution Manuscript received March 16, 2006; revised July 17, 2006 and by one 4-LUT. Since the size of the circuit will be directly September 13, 2006. This work was supported in part by the National Science proportional to the price of an FPGA that can implement Foundation under Grant CCF-0306682 and Grant CCF 0430077 and in part by it, the logic-synthesis step will play an integral role in the Altera Corporation, Magma Design Automation Inc., and Inc. under the California MICRO Program. This paper was recommended by Associate Editor design flow. K. Bazargan. As FPGA technology gained popularity throughout the The authors are with the Computer Science Department, University of 1990s, a large amount of work was published that dealt with California, Los Angeles, CA 90095 USA (e-mail: [email protected]; cory_m@ cs.ucla.edu). logic synthesis and/or technology mapping of FPGAs, includ- Digital Object Identifier 10.1109/TCAD.2006.887922 ing Chortle-crf [20], XMap [24], TechMap [31], DAG-map [9],

0278-0070/$25.00 © 2007 IEEE CONG AND MINKOVICH: OPTIMALITY STUDY OF LOGIC SYNTHESIS FOR LUT-BASED FPGAs 231

FlowMap [13], Zmap [16], Cutmap [12], BoolMap [27], and this approach (SAT-based optimal logic optimization is applied many others. More comprehensive surveys of FPGA synthe- to logic cones of up to ten inputs). sis and mapping algorithms are available from [8] and [11]. In this paper, we present a novel method for constructing These mapping algorithms employ many different techniques arbitrarily large circuits that have known optimal solutions after to achieve their solutions, including dynamic programming, bin technology mapping, or known upper bound solutions after packing, BDD-based logic simplification, and cut enumeration, logic optimization and technology mapping for LUT-based just to name a few. Some of these algorithms focused on delay FPGAs. Using these circuits [called Logic synthesis Examples minimization [14], [20], [24], [29], [31], [35], while others with Known Optimal (LEKO) and Logic synthesis Examples focused on area minimization possible under delay or depth with Known Upper bounds (LEKU)], we show that although constraints [9], [10], [12], [13], [16], [27], [31], [36]. All of leading FPGA technology-mapping algorithms can produce these algorithms were developed over a ten-year period in the close to optimal solutions, the results from the entire logic- 1990s but, after this influx, the amount of new published work synthesis flow (logic optimization + mapping) are far from op- began to decrease steadily with only a few novel algorithms timal. The best industrial and academic FPGA synthesis flows emerging in the past few years—such as IMAP [3], Hermes are around 70 times larger in terms of area on average and, in [19], DAOmap [7], and the ABC mapper [1], [38]. To many some cases, as much as 500 times larger on LEKU examples. people, this signaled that FPGA synthesis algorithms had prob- These results clearly indicate that there is much room for further ably hit a plateau. It is natural to then question whether the research and improvement in FPGA synthesis. current logic-synthesis and technology-mapping algorithms for FPGA designs are producing near-optimal solutions. Although II. CONSTRUCTION OF BENCHMARKS there are many empirical studies that compare different FPGA synthesis/mapping algorithms, little is known about how far A. Construction of LEKO Examples these algorithms are from the optimal (recall that both logic- We present an algorithm for constructing a network Gn (with optimization and technology-mapping problems are NP-hard n inputs and outputs) of an arbitrarily large size that has a if we consider area optimization in addition to delay/depth known optimal technology mapping solution. Gn is constructed optimization). in a special way by replicating a small circuit with a known In fact, a similar question was raised a few years ago when optimal mapping solution into a circuit of any size that also has placement research slowed down. However, using a set of a known optimal mapping solution. These circuits are called cleverly constructed examples, called placement examples with LEKO examples. The building block of our construction is a known optimal (PEKO), the study in [6] showed surprising re- “core graph” named Cn, with the following properties. sults: Wirelengths produced by state-of-the-art placement tools n n at that time were 1.66–2.53 times the optimal solutions in the 1) It has inputs and outputs. n worst cases. These results generated a renewed interest in place- 2) Every output is a function of all inputs. C ment research; within two to three years, a large body of papers 3) Each internal node of n has exactly two inputs. was published on placement optimality studies (e.g., [15], [17], 4) There exists an optimal (in terms of area/depth) mapping C M [18], and [23]), as well as novel placement algorithms (e.g., of n into a 4-LUT mapping solution, denoted as n, M [4], [5], [22], [30], and [34]). Within three years, the optimality such that n only has 4-LUTs (no 3-LUTs or 2-LUTs). C M gap on the PEKO examples was reduced to roughly 20% on For the 5 showninFig.2, 5 has exactly seven 4-LUTs. average [4]. The actual improvement on the IBM (ISPD04) This general method can be used to create core graphs benchmarks was 24% by the mPL placer [33]. This leads one of arbitrary size, although in this paper, we will only present to believe that the improvement on the artificially constructed how C5 and C6 are constructed. These core graphs target the PEKO examples correlates to some extent the improvement 4-LUT architecture, because it is the simplest of those currently on the “real” (more realistic) examples. This might be due to in use. But, the ideas behind our construction can be extended to the fact that a small suboptimality of an algorithm on the real Altera’s Adaptive Logic Module or any sized LUT architecture benchmarks often gets magnified into a significant optimality by only adapting the fourth property to reflect an optimal gap on the PEKO benchmarks. mapping in that architecture. Unfortunately, there is no simple way to extend the ideas The specific C5 we used to construct our LEKO and LEKU of testing placement optimality to logic synthesis because of benchmarks is shown in Fig. 2, and the optimality of its tech- the inherent differences in the two problems. Therefore, little nology mapping solution is stated in Theorem 1a and verified progress has been made on testing the optimality of logic- in its proof. The core graph C6 was made similarly to C5 synthesis algorithms. The research in [28] presented a method by following the four properties and slowly modifying the that could only create very small structureless test cases, and structure until it was hard for the structural-based mappers to they were used to test very primitive mappers. Another method, map [ABC and DAOmap]; then, the logic was modified to make described in [2], used a SAT solver as an exact logic-synthesis it difficult for the logic-synthesis tools, like ISE and Quartus, tool for LUT-based FPGAs to determine how much more the to map. The optimality of C6’s depth is extracted from the circuit area could be reduced by postprocessing the mapping DAOmap result, and the optimality of the ten 4-LUTs needed solutions produced by existing mappers. But, the results sug- to map it is proven in Theorem 1b. gested that current mappers could not be easily improved. This Theorem 1a: C5 has an area-optimal technology mapping is largely due to the highly localized search algorithm used in solution of seven 4-LUTs. 232 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2007

Fig. 2. Example of C5.

Proof: This is proved using the binate-cover technique, Using this newly created Cn, a LEKO circuit G is created which is able to compute the minimum-area technology map- by stacking up Cns in such a way that from the outputs of G, ping solution. In particular, we use the binate-covering solver in there is only one way to traverse Cn to get to the inputs. The SIS [32] using the command “xl_cover - h 0.” The binate solver exact algorithm is presented in Fig. 3, where createLEKO (L) L−1 in our case returned a seven 4-LUT solution. The reason that creates a LEKO example with L · n Cns in L layers. In the this method cannot be used to prove the optimality of the larger algorithm, the (union) operator does not disturb the order of LEKO circuits is because this tool is computationally infeasible the inputs or outputs. For example, when looking at the A B, for returning an optimal binate-covering solution for any graph we can think of the inputs and outputs as an array of nodes. with more than 100 logic gates.  Then, the index of every input node from A appears before any Theorem 1b: C6 has an area-optimal technology mapping input node from B in A B; likewise, the property holds for the solution of ten 4-LUTs. output nodes. The Copy operator creates a copy of the network Proof: This can also be proved using the binate cover renaming all the nodes, createEdge(x, y) just creates an edge provided in SIS. But for testing any core graphs with more from x to y, and Goutput[i] is the ith output of G (Fig. 3). than six inputs, the binate-cover algorithm would have to be Logically, the createLEKO algorithm works as follows. It implemented using a current SAT solver.  builds up the graph using layer upon layer of Cns in order to CONG AND MINKOVICH: OPTIMALITY STUDY OF LOGIC SYNTHESIS FOR LUT-BASED FPGAs 233

Fig. 5. LEKO(G25).

Fig. 3. createLEKO algorithm.

Fig. 6. LEKO(G36).

C6 instead of C5. We similarly constructed G125 (Fig. 7) with three layers, and we constructed G625 with four layers. Theorem 2: The optimal mapping solution of an arbitrarily sized LEKO circuit without logic optimization is achieved when every Cn in the circuit is mapped optimally without overlapping any other Cn. Proof: Now that we have the ability to construct arbitrar- ily sized LEKO circuits, we can show that this construction creates a circuit G with a known optimal binate cover, which we proved in Theorem 1. Assuming we have an arbitrary LEKO circuit G with L layers, we prove Theorem 2 by induction over the layers of G. Claim 1 will be used in almost all of the other claims as it proves that there are no reconverging paths of Cns. Claims 2 and 3 will help prove the base case, while Claim 4, working with Claims 2 and 3, helps prove the induc- tive step.  Claim 1: Treelike structure of Cns (no reconverging paths Cns Fig. 4. LUT spanning two layers. of ). GivenanarbitraryCn, x, on the top layer and a LEKO G, starting at any Cn at the bottom layer, there is only one way to get a LEKO example of L layers. It first creates the bottom traverse the Cns to get to x. L−1 layer of n Cns, then for each additional layer, it makes Proof: Assume we start at an arbitrary Cn, call it x, on the L−1 n copies of Cn and proceeds to connect the outputs of graph top layer. From the construction it should be obvious that a path G to the inputs of the newly created layer. It spreads out the exists from any Cn at the bottom layer to x (i.e., x is connected connections in such a way that for an arbitrary Cn at the top to every Cn on the bottom layer). Now, let us consider the layer, there exists a path to it from every Cn at the bottom layer maximum number of Cns we are connected to after one layer, (i.e., every Cn at the top level is connected to every Cn at the which is n (since x has n inputs). Similarly, after two layers 2 bottom level). Thus, using this algorithm and any number m, the maximum number of Cns that are connected to x is n one can create a LEKO circuit having more than m nodes and a (since x has n inputs and the nCns that feed x’s inputs also known optimal technology mapping solution whose optimality have n inputs), and the maximum number of Cns that can reach is proved in Theorem 2. By using this method, we were able to x at layer L (after L-1 layers) is nL−1. Now, if there were any construct G25 (Fig. 5) by calling createLEKO with two layers reconverging paths connecting x to the rest of the Cns, there L−1 and a C5. G36 (Fig. 6) was constructed the same way but used would be strictly less than n Cns at the bottom layer that 234 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2007

Fig. 7. Partial view of LEKO(G125). can reach x. By construction, it should be obvious that every However, this is impossible. Due to the tree structure of G, Cn at the bottom layer can reach x, therefore, there are no every input into every Cn at layer i will never meet again; this reconverging paths.  was proven in Claim 1.  Claim 2: Mapping upward (layer i). Proof of Theorem 2: Let G be an arbitrary LEKO circuit Mapping the nodes in Cn at layer i so that the resulting LUT with L layers constructed using the previous construction. takes nodes from layer i and i+1(i.e., mapping upward across Let us define a property that we will use in the proof. a layer) requires one more LUT than mapping within the layers. Property P(n): Let P(n) mean that the optimal mapping Proof: Assuming that the inputs to Cn on layer i are for all nodes up to layer n is the optimal mapping of each Cn already LUTs, we know that the optimal mapping of each separately. Cn has everything packed as tightly as possible (it only uses It is then enough to show that P(1) is true, and if P(m − 1) ⇒ 4-LUTs), so in order to extend into layer i+1, one of the P(m), where 2 ≤ m ≤ L (which will show by induction that LUTs (in the optimal mapping of a Cn) has to split into two the optimal mapping of our arbitrary G is just the optimal separate LUTs, thereby creating one additional LUT. This is mapping of each Cn separately).  because every output of a particular Cn in layer i will never be Base Case: P(1) is true. combined with another output from that Cn in any layer above i Proof: Before we begin the proof, this is what “all nodes (Claim 1).  up to layer 1” looks like: Claim 3: Mapping upward (layer i+1). Any extensions of LUTs from layer i into layer i+1 will not result in layer i+1being mapped with fewer LUTs than mapping within the layers. Now, that we have an understanding of what this looks like, Proof: Assume that the number of LUTs to map a Cn optimally is N, and the LUT that spans layer i and layer i+1is let us consider all the possible ways to map all nodes up to called x. Since LUT x is partially in layer i and partially in layer layer 1. i+1, the LUT has at most three inputs in layer i+1. Since this Case 1) Mapping exactly all the nodes of layer 1 and not LUT in layer i+1has only three inputs to choose from, and mapping any nodes of layer 2. we know the optimal mapping for Cn in layer i+1is strictly Since there is no overlap between the Cns (thus made up of LUTs with exactly four inputs. This will result, in trying to pack nodes from different Cns into one the best case, in a mapping for the Cn in layer i+1 with N LUT cannot possibly reduce the area) and we know LUTs plus one spanning the two layers. Another way to look at the optimal mapping of Cn, the optimal mapping of this is to consider the question: Can you map the Cn on layer this layer will result in mapping each Cn separately. i+1using N − 1 4-LUTs and one 3-LUT? Consider Fig. 4 for Case 2) Mapping exactly all the nodes of layer 1 and possi- a pictorial representation of the question proposed. The answer bly mapping some nodes of layer 2. to this question is clearly no because of the optimality of the N Now, we have to consider the case where the LUTs needed to map Cn.  optimally mapped 4-LUT solution for layer 1 maps Claim 4: Mapping downward (layer i). some nodes in layer 2. But from Claim 2 (its as- Any extensions of LUTs from layer i into layer i − 1 will not sumption holds since we are at the lowest layer result in layer i being mapped with fewer LUTs. and all the inputs are primary inputs) and Claim 3, Proof: Assume that the inputs to Cn at layer i are already we see that it will not help mapping if LUTs span LUTs. We know that in the optimal mapping of each Cn across layer 1 and into layer 2; thus, Case 2) cannot everything is already packed as tightly as possible (it only uses happen and Case 1) must happen. From Case 1), we 4-LUTs), so extending into layer i − 1 will not be possible can see that P(1) must hold; thus, the base case is unless there are reconverging paths at some layer below i. proved.  CONG AND MINKOVICH: OPTIMALITY STUDY OF LOGIC SYNTHESIS FOR LUT-BASED FPGAs 235

Inductive Step: P(m − 1) ⇒ P(m). adders, multipliers, and shifters. To construct a core graph Proof: Recall that P(m) is saying that the optimal mapping from a pre-existing benchmark, all one has to do is extract a for all nodes up to layer m is the optimal mapping of each piece of logic that has an equal number of input and output Cn separately. Since P(m − 1) is assumed, we know that the and is hopefully hard to map. The problem is that it can be optimal mapping for all nodes up to layer m − 1 is the optimal quite difficult to trick the tools into performing the wrong thing mapping of each Cn separately. Now, all we need to know to with a simple circuit. For example, every mapper managed to prove P(m) is that any LUTs spanning two separate layers will optimally map, in terms of depth and area, an 8-bit adder and not result in a better mapping solution (Claim 4). With Claim 2, an 8 × 8 multiplier. which uses the inductive hypothesis P(m − 1) to uphold the Since these circuits were constructed by hand, it is inter- assumption that all the inputs to layer m are already LUTs, and esting to see how similar they are to existing benchmarks. Claim 3, we know that the mapping of nodes up to layer m will We will show this by comparing the LEKO and LEKU to not intrude on layer m+1. Moreover with Claim 4, we know Microelectronics Center of North Carolina (MCNC) bench- that the mapping will not create LUTs that intrude into any layer marks in terms of their Rent’s exponent, I/O to node count, below m. Thus, the optimal mapping for layer m is to map each and maximum fanout free cone (MFFC) size. The MCNC Cn separately, and the inductive step is proven. Therefore, by benchmarks were processed through standard simplification induction, the optimal mapping of G is that which maps every scripts including decomposition into two input gates (for a fair Cn optimally and separately.  comparison to the LEKO and LEKU circuits) but remained unmapped. Rent’s rule [26] shows the relationship between the B. Construction of LEKU Examples number of external I/O connections to a and the number of logic gates in the logic block. It is estimated to be A LEKU example LEKU(G) is derived from the LEKO ex- the slope of the regression line comparing the size versus ample G after collapsing and gate decomposition of G. Clearly, number of I/O on the log–log scale. Higher Rent’s exponent the optimal optimization + technology mapping solution of values correspond to a higher topological complexity, with a G provides an upper bound on the area of the corresponding Rent’s exponent value of zero corresponding to a simple logic LEKU(G) example due to the functional equivalence of G and chain and a value of one corresponding to a clique. In this paper, LEKU(G). In this paper, we focus on constructing LEKU(G25), we estimate Rent’s exponent using a top–down partitioning which may already result in over 1 million gates after collapsing with hMetis [25] similar to the method described in [37]. The and decomposition. It is not reasonable to require the existing Rent’s exponents of the MCNC benchmarks in Table I(b) range FPGA synthesis tools to handle larger examples beyond such from 0.51 to 0.87 with an average of 0.74. When comparing sizes. our LEKO and LEKU circuits to the MCNC benchmarks, we In fact, after collapsing of G25, we tried different decom- can see that the range of the Rent’s exponent falls perfectly in position algorithms. LEKU-CD(G25) was constructed by first line with the MCNC benchmarks with an average exponent of collapsing LEKO(G25) into a two-level network, then decom- 0.7. From this view, there is little structural difference between posing the result into an equivalent two-bounded simple-gate the two sets of benchmarks. network (using SIS [32] commands collapse and tech_decomp); Another way to examine these circuits is by analyzing their LEKU-CB(G) was constructed by first collapsing the network, MFFC sizes. Using the SIS platform, we were able to calculate then balancing was done using the collapse and balance com- the average MFFC size [Table I(a)] of the MCNC circuits to mands of the ABC system [38]. From the circuit-size profile be six. Once again, the LEKO and most of the LEKU circuits shown in Table II, one can see that ABC’s internal canonical match the MCNC benchmarks very well in terms of the average AND–INV representation leads to the removal of a large number MFFC sizes. The LEKU-CD(G25) circuit is the exception; it of functionally equivalent gates. has an extremely high average MFFC size which makes it Since Xilinx’s mapper could not accept a circuit as large as quite different from the MCNC benchmarks but not entirely LEKU-CD(G25), we broke LEKU-CD(G25) up into a collec- unrealistic (a similar MFFC size shows up when examining tion of nonoverlapping circuits; one circuit for each primary an error checking circuit with only one output). But instead of output. The resulting collection of circuits is clearly equivalent trying to reflect a real circuit, it can be used as a tool to evaluate to LEKU-CD(G25) and denoted as LEKU-CD(G25). how well your algorithm finds duplication. C. LEKO and LEKU Properties III. RESULTS The LEKO circuit can exhibit a multitude of properties based on which core graph is used to construct it. In this paper, we The results of this paper will be presented in two parts. We constructed our core graph C5 to be as hard as possible for the first discuss the LEKO circuits created by createLEKO (G25, industrial tools (Quartus and ISE) to map, while C6 was made G125, and G625 were created by using C5, and G36, G216, with special properties that made it difficult for the academic and G1296 were created using C6), and we then discuss the tools (ABC and DAOMap) to map. Both of these core graphs LEKU examples which are functionally equivalent circuits of were constructed by hand to show that the heuristics used in all LEKU(G25)—LEKU-CD(G25), LEKU-CD(G25), and LEKU- of these tools can be easily subverted. CB(G25). The details of these circuits are shown in Table II. It is interesting to note that core graphs can be constructed Using these examples, we present the results of running state- from pre-existing benchmarks or complex logic blocks like of-the-art academic FPGA mappers DAOmap [7], ABC [38], 236 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2007

TABLE I TABLE II (a) RENT’S EXPONENT AND MFFC ANALYSIS OF THE LEKO AND LEKU EXAMPLES USED FOR OPTIMALITY STUDY MCNC BENCHMARKS.(b)RENT’S EXPONENT AND MFFC ANALYSIS OF THE LEKO/LEKU EXAMPLES

TABLE III MAPPING RESULTS ON LEKO EXAMPLES

device EP1S80F1508I7 and the option for area optimization. Xilinx’s logic-synthesis tool was run from Xilinx ISE 7.1i [40] using Virtex device xcv3200e and also the option for area optimization. All the tools were run with the default settings unless otherwise stated. For this paper, we only performed the logic-synthesis steps of these tools and did not go through final placement and routing. The depths of the mapped LEKO circuits are not reported here for two reasons: Xilinx and Altera and the leading-edge FPGA synthesis systems from Altera [39] optimize for delay instead of depth, and the final logic element and Xilinx [40] on each of the circuits. DAOmap from the in the Xilinx device is not a 4-LUT but a slice that combines SIS [32] and RASP [16] environments was used with op- two 4-LUTs. tions allowing only the use of LUTs with four or less in- puts (DAOmap—k 4). Berkeley’s ABC mapper [38] was also A. Mapping Results on LEKO Examples used for mapping into 4-LUTs (using the “FPGA” command which targets 4-LUTs). Note that DAOmap produces a depth- This section will illustrate how well mappers perform in optimal mapping solution as FlowMap [13] but uses 29% less achieving the optimal mapping solution, if they do not have to LUTs on average as calculated from [7]. ABC mapper also carry out logic optimization. We tested this by running each one produces depth-optimal mapping solutions, but uses 7% less of the LEKO circuits on each one of the mappers. As the results LUTs than DAOmap on average, as reported in [1]. Altera’s in Table III show, each mapper and logic-synthesis tool does a logic-synthesis tool was run from Quartus 5.0 [39] using fairly good job mapping the benchmarks. The average gap from CONG AND MINKOVICH: OPTIMALITY STUDY OF LOGIC SYNTHESIS FOR LUT-BASED FPGAs 237

TABLE IV structure of C6 allowed for more logic optimizations. One of LOGIC-SYNTHESIS RESULTS ON LEKU EXAMPLES the main reasons that every one of these algorithms performed so poorly is because they were not able to reconstruct the original structure of the circuit. The fact that the same logic- synthesis flows perform so much worse on the LEKU examples than the equivalent LEKO examples suggests that the existing logic-optimization algorithms are not capable of reproducing the initial circuit structure of the LEKO examples. This suggests that there may be significant opportunity for improvement in the existing logic-synthesis algorithms. For example, we believe that in order for logic-synthesis algorithms to perform well on the LEKU examples, they must have a more global view of resynthesis—including duplication removal, logic identifica- tion, and many other heuristics that examine the circuit globally. Without such global heuristics, algorithms do not perform well on LEKU examples and may produce poor results on large real- world circuits as well.

C. Applications of Results optimal varies from 3% (by Quartus) to 22% (by DAOMap), Regarding the LEKO benchmarks, we feel that the ones we with an average of 15%. This shows that the current LUT- presented here not only have the known optimal solutions but based FPGA mappers or synthesis tools perform quite well also have many structural similarities to the widely used MCNC on circuits, where logic optimization is not needed to obtain examples, based on the MFFC and Rent’s exponent analysis. the optimal solutions. Note that Quartus and ISE perform both The problem with the MCNC benchmarks is that almost every logic optimization and technology mapping, while DAOMap logic-synthesis tool is specifically tuned to perform well on performs technology mapping only and ABC performs some these benchmarks. By presenting a set of alternative-design logic optimization during mapping which allows the removal of examples, we hope to better understand where and how these a large amount of logic. It is interesting to see that by using dif- algorithms differ. Another advantage of the LEKO construction ferent core graphs, we can show that DAOmap can outperform is that it enables a designer to combine multiple “hard to map ABC. At first, it would seem that the smaller core graph C5 circuit cores” into one design with a known optimal. Knowing produced results further away from the optimal, but this is only the optimal solution, the designer can see exactly where the the case for the tools that performed logic optimization. C6 has algorithm made the incorrect choice and why. a difficult structure for mapping but some of its functionality The LEKU designs, on the other hand, can be used to test can be simplified, since it is very hard to create a C6 that is how different logic-synthesis algorithms handle the design with difficult for the whole logic-synthesis process. inefficiency or redundancy. For example, one can take a LEKO design and start “introducing inefficiency or redundancy,” which may include duplicate logic or signals that can eventually B. Synthesis Results on LEKU Examples be factored away. Using these “flawed” designs, we can test This section will illustrate how poorly most of the best various logic-synthesis tools to see the degree to which they available FPGA logic-synthesis flows perform when logic re- handle such design flaws. This type of test is very important, structuring and/or optimizing is needed to achieve the opti- since popularity of hardware languages enables people with lit- mal mapping solution. The academic mappers presented in tle or no hardware background to design hardware. Sometimes this section are allowed to use standard preprocessing tools people end up writing or very-high-speed integrated- (script.algebraic for DAOmap and resyn2 for ABC mapper) for circuit hardware-description language (VHDL) in the same way technology-independent logic optimization, since the LEKU they write C, which results in designs with a high degree of examples require logic restructuring/optimization to achieve the inefficiency or redundancy. The LEKU circuits are meant to test optimal mapping solutions. From the results in Table IV, when how the existing logic-synthesis algorithms perform on such examining the largest test case, we see that all four synthesis designs and how much room is left for improvement when flows perform poorly and produce synthesis results with area handling each type of inefficiency and/or redundancy in the ranging from 72X to 504X larger than the known upper bounds design. (the mapping results of the equivalent LEKO examples), aver- A good mapper should not only employ a single heuristic aging 256X larger. We believe Quartus produced a better solu- but should be able to combine multiple heuristics so that it can tion on LEKU-CD than LEKU-CD, because it could perform perform well on different types of benchmarks. We presented more optimizations on each one of the circuits of LEKU-CD a platform to create new benchmarks that can test every part due to their smaller size. It is also interesting to note that in of a logic-synthesis tool. For example, to check that all the the correlation between the performance of commercial tools heuristics of a logic-synthesis tool are working in harmony, one on LEKU-CB(G36) and LEKU-CB(G25), the simpler logical only has to create a set of Cns representing difficult circuits and 238 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2007 use them to create a LEKO circuit. Also, to test the capabilities as did the PEKO examples to the physical-design community. and strengths of various logic optimization phases of the tools, Needless to say, the potential of large-scale-area reduction is all one has to do is introduce certain types of inefficiency of great interest to the IC and EDA industries. If realized, it or redundancy (called targeted perturbation) in the underlying leads to significant improvement in density and cost of future LEKO circuit (thus creating a LEKU circuit) and check to see integrated circuits. It is not clear if and how often these artificial if the tools can correct them. examples constructed by our algorithm appear in real-life cir- The advantage of the LEKU circuit is that it establishes a cuits. However, these examples will help to identify deficiencies clear bound on how much minimization one can do. Since there in the current logic-synthesis algorithms and improve their is no known lower bound of the MCNC benchmarks, there is no quality. It is our hope that these benchmarks are not only clear way to tell if an improvement on a previous algorithm is used to determine that logic-synthesis tools catch commonly significant. For example, an improvement of some algorithm made mistakes (redundant logic and unused logic) that tool X on another algorithm Y by 10% might be insignificant if designers expect, but also create an online community that leads both algorithms are over 3x away from optimal. On the other to the sharing of LEKO circuits that will test every possible hand, when an algorithm does really well on certain LEKU design flaw. circuits, it means that the algorithm was able to find and Although our optimality study is done for LUT-based correct all the targeted perturbations used to create it. Such a FPGAs, we think that the same technique can be easily ex- controlled experiment is very useful when one has a collection tended to cell-based IC designs, where one needs to map to a of LEKU circuits to use for testing how the synthesis tools may library of different logic cells. In this case, we need to modify perform on common inefficiencies or redundancies that exist the construction of Cn so that it remains a “hard core” basis of in some HDL designs. For example, one can inject targeted constructing larger hard examples. perturbations such as do-not-care signals and duplicated logic The LEKO and LEKU examples are available online [41]. and see how well an algorithm can detect and correct them. Since these circuits cannot be minimized past a certain point, one knows that the reason the logic synthesis under test did ACKNOWLEDGMENT well is precisely because it found the targeted perturbations that The authors would like thank Dr. D. Chen for provid- one injects and not because of anything else. Without a tight ing the DAOmap implementation, Prof. R. Brayton and bound on the area, one cannot simply tell why or how the tool Dr. A. Mishchenko for providing the ABC mapper, and J. Lin produced a better solution. for his helpful feedback on this paper.

IV. CONCLUSION REFERENCES In this paper, we presented an algorithm for creating syn- [1] R. Brayton, S. Chatterjee, M. Ciesielski, and A. Mishchenko, “An inte- grated technology mapping environment,” in Proc. Int. Workshop Logic thetic benchmarks with known optimal technology mapping and Synthesis, 2005, pp. 383–390. solutions for LUT-based FPGA designs. Using these, LEKO [2] S. Brown, A. Ling, and D. Singh, “FPGA technology mapping: A study and LEKU benchmarks of sizes ranging from a few hundred of optimality,” in Proc. Des. Autom. Conf., 2005, pp. 427–432. [3] S. Brown, V. Manohararajah, and Z. Vranesic, “Heuristics for area mini- nodes to over one million nodes, we experimented on four mization in LUT based FPGA technology mapping,” in Proc. Int. Work- state-of-the-art FPGA logic-synthesis flows. We showed that shop Logic and Synthesis, 2004, pp. 14–21. although leading FPGA technology mapping algorithms can [4] T. Chan, J. Cong, and K. Sze, “Multilevel generalized force-directed method for circuit placement,” in Proc. Int. Symp. Phys. Des.,San produce close to optimal solutions with an average gap of Francisco, CA, Apr. 2005, pp. 185–192. 15% on the LEKO examples, the results from the entire logic- [5] T. Chan, J. Cong, T. Kong, J. R. Shinnerl, and K. Sze, “An enhanced synthesis flows (logic optimization + mapping) are far from multilevel algorithm for circuit placement,” in Proc. Int. Conf. Comput.- Aided Des., 2003, pp. 299–306. optimal. The best industrial and academic FPGA synthesis [6] C. Chang, J. Cong, and M. Xie, “Optimality and scalability study of flows are around 70 times larger in terms of area on average and, existing placement algorithms,” in Proc. ASP-DAC, 2003, pp. 621–627. in some cases, as much as 500 times larger on LEKU examples. [7] D. Chen and J. Cong, “DAOmap: A depth-optimal area optimization map- ping algorithm for FPGA designs,” in Proc. IEEE/ACM ICCAD, 2004, It is important to understand that just because an algorithm pp. 752–759. performs poorly on a set of artificial benchmarks, it does not [8] D. Chen, J. Cong, and P. Pan, “A decade of advances in FPGA design mean the algorithm will perform badly on real world circuits. automation,” Foundations and Trends Electron. Des. Autom., vol. 1, no. 2, 2006, to be published. Since logic synthesis is NP-hard and all exiting synthesis [9] K. Chen et al., “DAG-map: Graph-based FPGA technology mapping for algorithms are heuristics in nature, one would naturally expect delay optimization,” IEEE Des. Test Comput., vol. 9, no. 3, pp. 7–20, there are examples where the heuristics perform poorly. Nev- Sep. 1992. [10] J. Cong and Y. Ding, “Beyond the combinatorial limit in depth minimiza- ertheless, it is important to have a quantitative measurement of tion for LUT-based FPGA designs,” in Proc. Int. Conf. Comput.-Aided the optimality gap. We would also like to emphasize that the Des., Nov. 1993, pp. 110–114. performance of different heuristics on LEKO and LEKU may [11] ——, “Combinational logic synthesis for LUT based field programmable gate arrays,” ACM Trans. Des. Automat. Electron. Syst., vol. 1, no. 2, not reflect their performance on other examples. We refer the pp. 145–204, Apr. 1996. reader to [21, Ch. 6] on a discussion about the worst case versus [12] J. Cong and Y. Hwang, “Simultaneous depth and area minimization in average case performance of heuristics. LUT-based FPGA mapping,” in Proc. Int. Symp. Field-Programmable Gate Arrays, Feb. 1995, pp. 68–74. We hope that the rather surprising results on LEKO and [13] J. Cong and Y. Ding, “FlowMap: An optimal technology mapping LEKU examples will stimulate the logic-synthesis community algorithm for delay optimization in lookup-table based FPGA designs,” CONG AND MINKOVICH: OPTIMALITY STUDY OF LOGIC SYNTHESIS FOR LUT-BASED FPGAs 239

IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 13, no. 1, Jason Cong received the B.S. degree from Peking pp. 1–12, Jan. 1994. University, Beijing, China, in 1985 and the M.S. [14] J. Cong, C. Wu, and E. Ding, “Cut ranking and pruning: Enabling a and Ph.D. degrees from the University of Illinois, general and efficient FPGA mapping solution,” in Proc. Int. Symp. Field- Urbana–Champaign (UIUC), in 1987 and 1990, re- Programmable Gate Arrays, Feb. 1999, pp. 29–35. spectively, all in computer science. [15] J. Cong, G. Nataneli, M. Romesis, and J. Shinnerl, “An area-optimality He is currently a Professor and Chairman of study of floorplanning,” in Proc. Int. Symp. Phys. Des., 2004, pp. 78–83. the Computer Science Department of University of [16] J. Cong, J. Peck, and Y. Ding, “RASP: A general logic synthesis system California, Los Angeles (UCLA). He is also a Codi- for SRAM-based FPGAs,” in Proc. Int. Symp. Field-Programmable Gate rector of the VLSI CAD Laboratory. He was the Arrays, 1996, pp. 137–143. Founder and President of Aplus Design Technolo- [17] J. Cong, M. Romesis, and M. Xie, “Optimality and stability study of gies, Inc., until it was acquired by Magma Design timing-driven placement algorithms,” in Proc. Int. Conf. Comput.-Aided Automation, in 2003. He currently serves as the Chief Technology Advisor Des., 2003, pp. 472–478. of Magma and AutoESL Design Technologies, Inc. Additionally, he has been [18] ——, “Optimality, scalability and stability study of partitioning and place- a Guest Professor at Peking University, since 2000. His research interests ment algorithms,” in Proc. Int. Symp. Phys. Des., 2003, pp. 88–94. include computer-aided design of very large-scale integration (VSLI) circuits [19] E. Dubrova and M. Teslenko, “Hermes: LUT FPGA technology mapping and systems, design and synthesis of system-on-a-chip, programmable systems, algorithm for area minimization with optimum depth,” in Proc. Int. Conf. novel computer architectures, nanosystems, and highly scalable algorithms. He Comput.-Aided Des., 2004, pp. 748–751. has published over 230 research papers. He has graduated 17 Ph.D. students. [20] R. Francis, J. Rose, and Z. Vranesic, “Chortle-crf: Fast technology map- Dr. Cong is the recipient of a number of awards and recognition, in- ping for lookup table-based FPGAs,” in Proc. Des. Autom. Conf., 1991, cluding the Best Graduate Award from Peking University, in 1985, and the pp. 227–233. Ross J. Martin Award for Excellence in Research from the University of Illinois, [21] M. Garey and D. Johnson, Computers and Intractability: A Guide to the in 1989, the NSF Young Investigator Award, in 1993, the Northrop Outstanding Theory of NP-Completeness. San Francisco, CA: Freeman, 1979. Junior Faculty Research Award from UCLA, in 1993, the ACM/SIGDA Mer- [22] A. Kahng and Q. Wang, “Implementation and extensibility of an analytic itorious Service Award, in 1998, and the SRC Technical Excellence Award, placer,” in Proc. Int. Symp. Phys. Des., 2004, pp. 18–25. in 2000. He is also the recipient of the three best paper awards—the 1995 [23] A. Kahng and S. Reda, “Evaluation of placer suboptimality via zero- IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN Best Paper Award, change netlist transformations,” in Proc. Int. Symp. Phys. Des., 2005, the 2005 International Symposium on Physical Design Best Paper Award, and pp. 208–215. the 2005 ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC [24] K. Karplus, “Xmap: A technology mapper for table-lookup field- SYSTEMS Best Paper Award. He served on the technical program committees programmable gate arrays,” in Proc. Des. Autom. Conf., 1991, and executive committees of many conferences, such as Asia and South Pa- pp. 240–243. cific Design Automation Conference (ASPDAC), Design Automation Confer- [25] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, “Multilevel hyper- ence (DAC), ACM/SIGDA International Symposium on Field-Programmable graph partitioning: Applications in VLSI domain,” in Proc. Des. Autom. Gate Arrays (FPGA), International Conference on Computer-Aided Design Conf., 1997, pp. 526–529. (ICCAD), International Symposium on Circuits And Systems (ISCAS), Inter- [26] B. Landman and R. Russo, “On a pin versus block relationship for parti- national Symposium on Physical Design (ISPD), and International Symposium tions of logic graphs,” IEEE Trans. Comput., vol. C-20, no. 12, pp. 1469– on Low Power Electronics and Design (ISLEPD), and several editorial boards, 1479, 1971. including the IEEE TRANSACTIONS ON VERY LARGE-SCALE INTEGRATION [27] C. Legl, B. Wurth, and K. Eckl, “A Boolean approach to performance- SYSTEMS and the ACM TRANSACTIONS ON DESIGN AUTOMATION OF directed technology mapping for LUT-based FPGA designs,” in Proc. ELECTRONIC SYSTEMS. He served on the ACM SIGDA Advisory Board, Des. Autom. Conf., Jun. 1996, pp. 730–733. the Board of Governors of the IEEE Circuits and Systems Society, and the [28] I. Markov and J. A. Roy, “On sub-optimality and scalability of logic Technical Advisory Board of a number of EDA and silicon IP companies, synthesis tools,” in Proc. Int. Workshop Logic and Synthesis, May 2003. including Atrenta, eASIC, Get2Chip, Magma Design Automation, and Ultima [29] R. Murgai et al., “Improved logic synthesis algorithms for table look Interconnect Technologies. up architectures,” in Proc. Int. Conf. Comput.-Aided Des., Nov. 1991, pp. 564–567. [30] J. Roy, D. Papa, S. Adya, H. Chan, A. Ng, J. Lu, and I. Markov, “Capo: Robust and scalable open-source min-cut floorplacer,” in Proc. Int. Symp. Phys. Des., 2005, pp. 224–226. [31] P. Sawkar and D. Thomas, “Technology mapping for table-look-up based field programmable gate arrays,” in Proc. ACM/SIGDA Workshop Field Programmable Gate Arrays, Feb. 1992, pp. 83–88. [32] E. Sentovitch, K. Singh et al., “SIS: A System for sequential circuit synthesis,” Univ. California, Berkeley, Tech. Rep., UCB/ERL M92/41, May 1992. [33] K. Sze, “Multilevel optimization for VLSI circuit placement,” Ph.D. dissertation, Univ. California, Los Angeles, 2006. [34] N. Viswanathan and C. Chu, “FastPlace: Efficient analytical place- ment using cell shifting, iterative local refinement and a hybrid net model,” in Proc. Int. Symp. Phys. Des., 2004, pp. 26–33. [35] N. Woo, “A heuristic method for FPGA technology mapping based on the edge visibility,” in Proc. Des. Autom. Conf., 1991, pp. 248–251. [36] H. Yang and D. Wong, “Edge-map: Optimal performance driven technol- ogy mapping for iterative LUT based FPGA designs,” in Proc. Int. Conf. Comput.-Aided Des., Nov. 1994, pp. 150–155. [37] X. Yang, E. Bozorgzadeh, and M. Sarrafzadeh, “Wirelength estimation Kirill Minkovich based on rent exponents of partitioning and placement,” in Proc. Int. received the B.S. degree from the Workshop Syst.-Level Interconnect Prediction, 2001, pp. 25–32. University of California, Berkeley, in 2003, and the [38] Berkeley Logic Synthesis and Verification Group, ABC: A System for M.S. degree from the University of California, Los Sequential Synthesis and Verification. [Online]. Available: http://www. Angeles, in 2006, where he is currently working eecs.berkeley.edu/~alanmi/abc/ toward a Ph.D. degree, all in computer science. [39] Altera Inc., Quartus 5.0. [Online]. Available: http://www.altera.com/ His current research interests include logic synthe- [40] Xilinx Corporation, ISE Logic Design Tools 7.1i. [Online]. Available: sis for LUT-based FPGAs, evaluation of synthesis al- http://www.xilinx.com/ gorithms, and synthesis for nanoscale architectures. [41] UCLA Optimality Study Project. [Online]. Available: http://cadlab.cs. ucla.edu/~pubbench/