Bit-Level Transformation and Optimization for Hardware

Bit-Level Transformation and Optimization for Hardware Synthesis of Algorithmic Descriptions Jiyu Zhang*† , Zhiru Zhang+, Sheng Zhou+, Mingxing Tan*, Xianhua Liu*, Xu Cheng*, Jason Cong† *MicroProcessor Research and Development Center, Peking University, Beijing, PRC † Computer Science Department, University Of California, Los Angeles, CA 90095, USA +AutoESL Design Technologies, Los Angeles, CA 90064, USA {zhangjiyu, tanmingxing, liuxianhua, chengxu}@mprc.pku.edu.cn {zhiruz, zhousheng}@autoesl.com, [email protected] ABSTRACT and more popularity [3-6]. However, high-quality As the complexity of integrated circuit systems implementations are difficult to achieve automatically, increases, automated hardware design from higher- especially when the description of the functionality is level abstraction is becoming more and more important. written in a high-level software programming language. However, for many high-level programming languages, For bitwise computation-intensive applications, one of such as C/C++, the description of bitwise access and the main difficulties is the lack of bit-accurate computation is not as direct as hardware description descriptions in high-level software programming languages, and hardware synthesis of algorithmic languages. The wide use of bitwise operations in descriptions may generate sub-optimal implement- certain application domains calls for specific bit-level tations for bitwise computation-intensive applications. transformation and optimization to assist hardware In this paper we introduce a bit-level transformation synthesis of algorithmic descriptions. and optimization approach to assisting hardware Figure 1 shows a motivational example with a bit synthesis of algorithmic descriptions. We introduce a reversing function. The C description of this algorithm bit-flow graph to capture bit-value information. is shown in Figure 1(a), where the bit_reverse function Analysis and optimizing transformations can be takes a 32-bit integer as input and yields an output in performed on this representation, and the optimized the reverse bit order. The data-flow graph for the results are transformed back to the standard data-flow unrolled function is shown in Figure 1(b), while the graphs extended with a few instructions representing optimal implementation is shown in Figure 1(c). We bitwise access. This allows high-level synthesis tools to can clearly see that the direct implementation based on automatically generate circuits with higher quality. the data-flow graph would use many more logical Experiments show that our algorithm can reduce slice components and also have a longer latency compared usage by 29.8% on average for a set of real-life to the optimal one, which only uses 32 wires to link the bits directly in the reverse order. benchmarks on Xilinx FPGAs. In the meantime, the clock period is reduced by 13.6% on average, with an int bit_reverse(int input) 11.4% latency reduction. { int i, output = 0; 1. INTRODUCTION for (i = 0; i < 32; i++ ) Bitwise operations are used extensively in many output |= (((input>>i)&1) << (31- application domains, such as cryptography and i)); telecommunications, etc. However, for applications return output; } written in high-level programming languages and (a) C code for the bit_reverse function. executed on general-purpose processors, accessing and computing bit-values are relatively expensive, and bit- input 0 input 31 level parallelism is not well exploited. This is mainly >> due to the lack of support in target machines, as well as 1 >> 1 . high-level programming languages, such as C/C++. AND AND Most general-purpose processor architectures and high- 31 0 level programming languages do not support bitwise << << memory access and require a series of 0 load/shift/mask/store instructions to implement simple OR . OR output bitwise operations, such as bit accessing and bit setting. Customized hardware accelerators provide a (b) Data-flow graph for unrolled bit_reverse function. promising approach to assisting general-purpose input processors in exploiting performance of bitwise 31 0 computation-intensive applications. Today, we can put 100 . 0 1. 0 1 1 more than one billion transistors in a single chip [1], and modern FPGAs allow users to exploit parallelism 31 0 in applications by hundreds of thousands of logic cells 11001. 0 . 0 and prefabricated IPs [2]. As RTL coding time is 1 increasingly recognized as a significant component of output the overall effort to solution, automated design (c) Optimal implementation for the bit_reverse function. processes and tools which compile higher-level Figure 1: The bit reversing function. abstraction into optimized hardware are gaining more We can see from the example that efficient bit-level bits instructions are such extensions existing on some transformation and optimization for operations in general-purpose processors. Hilewitz et al. [9] algorithmic descriptions will lead to a much more conjectured that the most powerful primitive bit-level direct and compact description. This will help the high- operation might be the bit matrix multiply (BMM) level synthesis to generate better RTL designs for instruction, which currently is found only in bitwise computation-intensive designs, and thus supercomputers like Cray[10]. They also proposed new achieve better final implementations. Otherwise, in the instructions that implement simpler BMM primitive absence of such optimization, the synthesis process can operations. However, the current code-generation be misled by inaccurate area and timing estimation and techniques for these instructions mainly seek for thus generate suboptimal microarchitecture. It is often special patterns, and efficiently taking use of these too late for the downstream RTL or logic synthesis and instructions still much relies on hand-coded assemble optimization techniques to make up for the QoR loss codes. caused by the mistakes during compiler In the logic synthesis field, much research has been transformations. conducted to simplify logical expressions [11]. In this paper we propose a novel compiler However, when using high-level programming optimization approach to automatically generate languages, the bit-value accessing, computing and bitwise operations for hardware synthesis from the storing are indirectly represented and often require a algorithmic descriptions in high-level programming series of load/shift/ mask/store instructions. If the languages. Specifically, we extend the data-flow graph bitwise computation is not well analyzed and optimized with two operations (instructions) representing bitwise during the high-level synthesis step, the resulting RTLs access to greatly facilitate the hardware synthesis to can be suboptimal. This would impose difficulty for synthesize algorithmic description into efficient the downstream RTL/logic synthesis and optimization hardware. We propose an intermediate representation to make up the QoR degradation. called the bit-flow graph (BFG) to analyze and Some hardware modeling languages extend high- optimize bitwise operations, and the optimized BFG is level software programming languages, and most of transformed to the extended data-flow graph for them support bit-accurate description. For example, hardware synthesis. To our knowledge, this is the first SystemC [12], which is a popular modeling language work to systematically analyze and optimize bitwise based on C++, introduces bit-accurate data types to operations to assist hardware synthesis of algorithmic support description for bit-level access and description. Experiments show that our approach can computation. Some related works also extend a base achieve a 29.8% area reduction, 13.6% clock period sequential language with direct bit- manipulation for reduction and 11.4% latency reduction on average for a both software and hardware. For example, [13] set of real-world applications. introduces a new object-oriented language called Lime, The remainder of the paper is organized as follows: which can be compiled for JVM or into a synthesizable In Section 2 we review the related work. Section 3 hardware description language. It provides explicit bit- presents the problem statement. In Section 4 we numeration to describe bitwise operations. describe our bit-level transformation and optimization Nevertheless, most software algorithms and a large approach. Section 5 presents experimental results. amount of legacy code are still written in high-level Section 6 concludes the discussion of current work and software programming language. proposes future directions. In contrast to the previous work, our approach aims 2. RELATED WORK at providing bit-level transformation and optimization to assist hardware synthesis of algorithmic descriptions. In this section we discuss previous work on Since hardware directly supports bit-value accessing optimization for bitwise computation-intensive and storing, while a large amount of software legacies applications. still use load/ shift/mask/store instructions to represent Modern optimizing compilers can perform a series bitwise operation, there is a gap between function of transformation passes (typically in the form of description in high-level programming languages and peephole optimizations) to simplify logical operations hardware synthesis. To deal with this problem, we [7, 8]. For example, algebraic simplifications and propose a new intermediate representation for bitwise reassociation can be applied to Boolean and structure operations. It will facilitate bit-value analysis

Bit-Level Transformation and Optimization for Hardware

Bit Shifts Bit Operations, Logical Shifts, Arithmetic Shifts, Rotate Shifts

Technical Report TR-2020-10

ADSP-21065L SHARC User's Manual; Chapter 2, Computation

Most Computer Instructions Can Be Classified Into Three Categories

X86 Assembly Language Syllabus for Subject: Assembly (Machine) Language

Bitwise Operators in C Examples

The Central Processor Unit

Using LLVM for Optimized Lightweight Binary Re-Writing at Runtime

Relational Representation of the LLVM Intermediate Language

Fractal Surfaces from Simple Arithmetic Operations

Problem Solving with Programming Course Material

X86 Intrinsics Cheat Sheet Jan Finis [email protected]