A New Scalable Systolic Array Processor Architecture for Discrete Convolution

ABSTRACT OF THESIS

A NEW SCALABLE SYSTOLIC ARRAY PROCESSOR ARCHITECTURE FOR DISCRETE CONVOLUTION

Two-dimensional discrete convolution is an essential operation in digital image processing. An ability to simultaneously convolute an (i×j) pixel input image plane with more than one Filter Coefficient Plane (FC) in a scalable manner is a targeted performance goal. Assuming k FCs, each of size (n×n), an additional goal is that the system have the ability to output k convoluted pixels each clock cycle. To achieve these performance goals, an architecture that utilizes a new systolic array arrangement is developed and the final architecture design is captured using the VHDL hardware descriptive language. The architecture is shown to be scalable when convoluting multiple FCs with the same input image plane. The architecture design is functionally and performance validated through VHDL post-synthesis and post-implementation (functional and performance) simulation testing. In addition, the design was implemented to a Field Programmable Gate Array (FPGA) experimental hardware prototype for further functional and performance testing and evaluation.

KEYWORDS: Systolic Array Processor, Discrete Convolution, Hardware Prototyping, Scalable Architecture, Parallel Architecture.

______

______A NEW SCALABLE SYSTOLIC ARRAY PROCESSOR ARCHITECTURE FOR DISCRETE CONVOLUTION

Albert Tung-Hoe Wong

______Director of Thesis

______Director of Graduate Studies

______

RULES FOR THE USE OF THESIS

Unpublished theses submitted for the Master’s degree and deposited in the University of Kentucky Library are as a rule open for inspection, but are to be used only with due regard to the rights of the authors. Bibliographical references may be noted, but quotations or summaries of parts may be published only with permission of the author, and with the usual scholarly acknowledgements.

Extensive copying or publication of the thesis in whole or in part also requires the consent of the Dean of the Graduate School of the University of Kentucky.

A library that borrows this thesis for use by its patrons is expected to secure the signature of each user.

Name Date

______

THESIS

Albert Tung-Hoe Wong

The Graduate School

University of Kentucky

2003

A NEW SCALABLE SYSTOLIC ARRAY PROCESSOR ARCHITECTURE FOR DISCRETE CONVOLUTION

THESIS

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering in the College of Engineering at the University of Kentucky

Albert Tung-Hoe Wong

Lexington, Kentucky

Director: Dr. J. Robert Heath, Associate Professor of Electrical and Computer Engineering

Lexington, Kentucky 2003

MASTER’S THESIS RELEASE

I authorize the University of Kentucky Libraries to reproduce this thesis in whole or in part for purposes of research.

Signed: ______

Date: ______

ACKNOWLEDGEMENTS

The following thesis, while an individual work, benefited from the insights and direction of several people. First, my Thesis Chair, Dr. J. Robert Heath, exemplifies the high quality scholarship to which I aspire. In addition, Dr. J. Robert Heath provided timely and instructive comments and evaluation at every stage of the thesis process, allowing me to complete this project. Next, I wish to thank the complete Thesis Committee: Dr. J. Robert Heath, Dr. Hank Dietz, and Dr. William R. Dieter. Each individual provided insights that guided and challenged my thinking, substantially improving the finished product. I would also like to thank Dr. Michael Lhamon from Lexmark Inc. for his technical insights, guidance and comments.

In addition to the technical and instrumental assistance above, I received equally important assistance from family and friends. My wife, Sze Ying Ng, provided on-going support through out the thesis process for me to finish the thesis.

iii CONTENTS

Acknowledgements...... iii List of Tables ...... vi List of Figures...... vii Chapter 1 Introduction ...... 1 Chapter 2 Background and Convolution Architecture Requirements ...... 2 Chapter 3 Version 1 Convolution Architecture ...... 7 3.1. Arithmetic Unit (AU) ...... 7 3.2. Coefficient Shifters (CSs) ...... 10 3.3. Input Data Shifters (IDSs)...... 11 3.3.1. Register Bank (RB) ...... 12 3.3.2. Pattern Generator Pointers (PGPs) ...... 12 3.3.3. Delay Units (DU)...... 15 3.4. Systolic Flow of Version 1 Convolution Architecture...... 15 3.5. Data Memory Interface (DM I/F) ...... 17 3.6. Output Correction Unit ...... 19 3.7. Controller ...... 19 Chapter 4 Revised Architectural Requirements and Resulting Version 2 Convolution Architecture...... 21 4.1. Version 2 Convolution Architecture for (k = 1) ...... 21 4.2. Arithmetic Unit (AU) ...... 22 4.2.1. Multiplication Unit (MU) of Multiplication and Add Unit (MAU) ...... 26 4.2.2. Delay Units (DU)...... 31 4.3. Input Data Shifters (IDS) ...... 31 4.4. Data Memory Interface (DM I/F) ...... 32 4.5. Memory Pointers Unit (MPU) ...... 32 4.6. Systolic Flow of Version 2 Convolution Architecture ...... 34 4.7. Controller ...... 35 4.8. Multiple Filter Coefficient Sets when (k > 1) ...... 43 Chapter 5 VHDL Description of Version 2 Convolution Architecture...... 45

iv Chapter 6 Version 2 Convolution Architecture Validation via Virtual Prototyping (Post-Synthesis and Post-Implementation Simulation Experimentation)...... 47 6.1. Post-Synthesis Simulation ...... 48 6.1.1. Adders...... 48 6.1.2. Multiplication Unit...... 51 6.1.3. Version 2 Convolution Architecture (with k = 1) ...... 52 6.2. Post-Implementation Simulation ...... 61 6.2.1. Synthesis and Implementation of Version 2 Convolution Architecture (with k = 1)...... 61 6.2.2. Version 2 Convolution Architecture (with k = 1) ...... 62 6.2.3. Synthesis and Implementation of Version 2 Convolution Architecture (k = 3)...... 65 6.2.4. Validation of Version 2 Convolution Architecture (with k = 3)...... 66 Chapter 7 Hardware Prototype Development and Testing ...... 72 7.1. Board Utilization Modules and Prototype Setup ...... 73 7.2. Hardware Prototyping Flow...... 76 7.3. Test Cases ...... 80 Chapter 8 Conclusion...... 84 Appendix A VHDL Code for Version 2 Discrete Convolution Architecture ...... 86 Appendix B VHDL codes, C++ source codes and Script file for Post-Synthesis Simulation...... 133 Appendix C C++ Source Codes for Programs Used During Post-Implementation Simulation...... 140 Appendix D C++ Source Codes for Programs Used During Hardware Prototype Implementation ...... 143 Appendix E VHDL Files for Modules External to the Convolution Architecture...... 149 References...... 157 Vita ...... 159

v LIST OF TABLES

Table 3.1. Filter coefficient array...... 11 Table 3.2. 5×5 Filter size (with one output pointer)...... 13 Table 3.3. 5×5 Filter size (Convolution with two output pointers)...... 14 Table 4.1. Gate count comparison between CSA and CLA...... 25 Table 4.2. A summary of the multiplication...... 26 Table 4.3. Partial Product Selection Table...... 28 Table 4.4. Comparison between method I and method II...... 30 Table 6.1. Details of FPGA on the XESS protoboard...... 62 Table 6.2. Resource utilization of Version 2 Convolution Architecture (with k = 1)...... 62 Table 6.3. Resource utilization of Version 2 Convolution Architecture (with k = 3)...... 66

vi LIST OF FIGURES

Figure 2.1. Pictorial view of Input Image Plane (IP), Filter Coefficient Plane (FC), and Output Image Plane (OI)...... 3 Figure 2.2. Example showing how two consecutive output pixels are generated. This example is shown with a 3×3 size FC...... 4 Figure 2.3. Example showing that only (n-1) previous rows plus n input image pixels need to be stored. In this example, 2 previous rows (shaded rows in addition to

IP23,..,IP25, IP33,..,IP35) plus 3 additional input image pixels (IP43,..,IP45) are needed for a 3×3 filter size...... 4 Figure 3.1. Top-level view of Version 1 of the convolution architecture (d is assumed to be 8 in this example)...... 8 Figure 3.2. A MAU and included functional units...... 8 Figure 3.3. Systolic array structures of MAUs, where IDSs are outputs from Input Data Shifters and CSs are the outputs from Coefficient Shifters...... 9 Figure 3.4. Functional units within CSs...... 10 Figure 3.5. Arrangement of the filter coefficients within the Coefficient Shifters...... 11 Figure 3.6. Functional units within IDSs...... 11 Figure 3.7. Generalized RB for n×n filter size. (d denotes number of bits for the input pixels)...... 12 Figure 3.8. Additional hardware and modification for convolution of x output pixels in parallel for (x ≤ n) (Functional units shaded in gray are additional hardware required for processing two convolutions in parallel)...... 14 Figure 3.9. Organization of Flip-flops within the Delay Unit (DU). R within the figure denotes one flip-flop...... 15 Figure 3.10. Pictorial view of the data flow within the MAUs for one output pixel...... 16 Figure 3.11. Basic functional units within Data Memory I/F...... 17 Figure 3.12. A more detailed look at DM I/F...... 18 Figure 3.13. Time line for activities, where W denotes Write or R denotes Read (from external memory device) registers indicated in boxes directly below...... 19 Figure 3.14. Output pattern for two convolutions in parallel...... 19

vii Figure 4.1. A top level view of Version 2 of the convolution architecture for one distinct filter coefficient set with n = 5 and d = 8 (MAA denotes Multiplication and Add Array and AT denotes Adder Tree)...... 22 Figure 4.2. Functional units within the Multiplication and Add Array (MAA)...... 23 Figure 4.3. A MAU and its functional units...... 23 Figure 4.4. A possible arrangement of the AT (R denotes a single flip-flop pipeline stage; a pipeline stage is included within each CSA and CLA) for a 5×5 FC...... 24 Figure 4.5. One possible arrangement of the AT when CLA is utilized within the MAUs...... 25 Figure 4.6. Illustration of the paper and pencil multiplication technique (s on each row of the partial products denotes sign extension of that particular row of partial product)...... 27 Figure 4.7. One possible arrangement of Multilevel CSA Tree for six partial products....28 Figure 4.8. Multiplier based on Modified Booth’s Algorithm and Wallace Tree Structure...... 29 Figure 4.9. Illustration of multiplication technique based on Modified Booth’s Algorithm...... 29 Figure 4.10. Partial Product’s sign extension reduced for hardware saving...... 30 Figure 4.11. Functional units within the DU for the case of n = 5 and two pipeline stages within each MAU (PL denotes a pipeline stage composed of flip-flop registers)...... 31 Figure 4.12. Structural view of the IDS with n = 5 and d = 8...... 32 Figure 4.13. External memory devices organization for n = 5 and d = 8...... 33 Figure 4.14. Functional units within the Memory Pointers Unit (MPU)...... 34

Figure 4.15. Pictorial view of the data flow within the MAAs for one output pixel (td denotes the time delay between each MAU)...... 35 Figure 4.16. Top level view of the Controller Unit (CU)...... 36 Figure 4.17. Functional Units that receive control signals from the CU...... 37 Figure 4.18. System flow chart for Version 2 convolution architecture’s Controller Unit (CU)...... 39 Figure 4.19. Modified Version 2 system flow chart...... 41

viii Figure 4.20. Version 2 architecture for k (n×n) filter coefficient sets (where k can be any number)...... 43 Figure 5.1. Version 2 Convolution Architecture organization...... 46 Figure 6.1. Testing model for lower level functional components...... 49 Figure 6.2. Post-Synthesis simulation for 14-bit CLA...... 49 Figure 6.3. A close up view of one segment of Figure 6.2...... 50 Figure 6.4. Post-Synthesis simulation for 15-bit CLA...... 50 Figure 6.5. Post-Synthesis simulation for 16-bit CLA...... 50 Figure 6.6. Post-Synthesis simulation for 17-bit CLA...... 50 Figure 6.7. Post-Synthesis simulation for 19-bit CLA...... 51 Figure 6.8. Post-Synthesis simulation for all possible inputs for the Multiplication Unit (MU)...... 51 Figure 6.9. A close up view of one segment of the simulation in Figure 6.8 above...... 52 Figure 6.10. Test case 1 with IP and OI of size 5×60 (however, only the first seven columns of both IP and OI are shown due to report page width limit)...... 53 Figure 6.11. The source code for C++ program that generates test vectors to program the filter coefficients into MAUs...... 54 Figure 6.12. Arrangement of the Filter Coefficients within the Arithmetic Unit...... 55 Figure 6.13. First phase of operation; programming of FCs into MAUs...... 56 Figure 6.14. First phase of operation; receiving the first two rows of the IP (shown in figure above is the beginning of the second row of the input pixels)...... 56 Figure 6.15. Second phase of operation; output pixels generated...... 57 Figure 6.16. Second phase of operation; output pixels of the second row of OI (superimposed)...... 58 Figure 6.17. Third phase of operation; output pixels of the last row of OI (superimposed)...... 58 Figure 6.18. Test case 2; IP, FCs and expected OI (the first seven columns)...... 59 Figure 6.19. First phase of operation for test case 2...... 60 Figure 6.20. Second phase of operation for test case 2; output pixels shown are the first six of row one of OI (superimposed)...... 60

ix Figure 6.21. Third phase of operation for test case 2; output pixels shown are the first six of the last row for OI (superimposed)...... 61 Figure 6.22. Second phase of operation for test case 1 (post-implementation simulation); output pixels of the second row of OI (superimposed)...... 63 Figure 6.23. Third phase of operation for test case 1 (post-implementation simulation); output pixels of the last row of OI (superimposed)...... 64 Figure 6.24. Second phase of operation for test case 2 (post-implementation simulation); output pixels shown are the first six of row one of OI (superimposed)...... 64 Figure 6.25. Third phase of operation for test case 2 (post-implementation simulation); output pixels shown are the first six of the last row for OI (superimposed)...... 65 Figure 6.26. Test Case 1: FC planes, IP plane and the predicted OI planes...... 67 Figure 6.27. Superimposed output image pixels (start from the 3rd pixel) for first row of the OIs for test case 1...... 68 Figure 6.28. Superimposed output image pixels (from 3rd pixel onward) of the second row of the OIs for test case 1...... 68 Figure 6.29. Test case 2: FC planes, IP plane and the predicted OI planes...... 69 Figure 6.30. Superimposed output image pixels (start from the 3rd pixel) for third row of the OIs for test case 2...... 70 Figure 6.31. Superimposed output image pixels (from 3rd pixel onward) of the fourth row of the OIs for test case 2...... 70 Figure 6.32. A plot of equivalent system gates versus number of FC planes...... 71 Figure 7.1. Convolution Architecture hardware implementation...... 72 Figure 7.2. XSV-800 prototype board featuring Xilinx Virtex 800 FPGA (picture obtained from XESS Co. website, http://www.xess.com)...... 73 Figure 7.3. Top level view of the prototyping hardware...... 74 Figure 7.4. Example of a VHDL file for creating an internal Block RAM containing input image pixels for the convolution system (seed number of 1 is provided to the program)...... 75 Figure 7.5. FPGA configuration and bit stream download program, gxsload from XESS Co...... 77

x Figure 7.6. Execution of the FCs configuration program...... 78 Figure 7.7. Upload SRAM content using gxsload utility, the high address indicates the upper bound of the SRAM address space whereas the low address indicates the lower bound of the SRAM address space...... 79 Figure 7.8. Uploaded SRAM contents stored in a file (Intel hex file format). There are two segments due to the fact that the program wrote the right bank of the SRAM (16-bit) first and the left bank of the SRAM next (16 MSB bits)...... 79 Figure 7.9. SRAM contents retrieved for first OI plane for test case 1...... 81 Figure 7.10. SRAM contents retrieved for second OI plane for test case 1...... 81 Figure 7.11. SRAM contents retrieved for third OI plane for test case 1...... 81 Figure 7.12. SRAM contents retrieved for first OI plane for test case 2...... 82 Figure 7.13. SRAM contents retrieved for second OI plane for test case 2...... 82 Figure 7.14. SRAM contents retrieved for third OI plane for test case 2...... 83

Chapter 1

Introduction

Performance and cost are both important parameters and criteria in today’s computing system components, whether the components are an entire computer or computer accessories and peripherals such as printers. The ever increasing desire for higher performance from consumers has driven printer manufacturers to develop and incorporate performance enhancements into their products at a cheapest price. It is a known fact that cost is, most of the time, directly proportional to performance, but manufacturers are constantly pursuing higher performance for less cost. An ability to scan and print exceedingly clear images at a maximum page-per- minute rate at the cheapest cost are performance metrics printer manufacturers aim for. In order to produce highly enhanced clear images, the “discrete convolutional-filtering algorithm” must be implemented within the scanner or printer. General-purpose signal processors from various vendors are widely used to implement the convolutional-filtering algorithm. Many times all the functionality offered by general-purpose signal processors are not needed or required by the manufacturers, thus the unused functionality becomes a cost overhead. Also, many times, commercially available general-purpose processors cannot meet desired performance/cost requirements of having the highest performance at lowest cost. Thus, a special-purpose architecture signal/image processor is desired to implement the discrete convolutional filtering algorithm. The subject of this thesis is the development of an efficient high performance special purpose signal/image processor architecture which may be used to implement the discrete convolutional-filtering algorithm at a lowest cost.

Chapter 2

Background and Convolution Architecture Requirements

Convolution is one of the essential operations in digital image processing required for image enhancements [15,16]. It is used in linear filtering operations such as smoothing, denoising, edge detection and so on [15,16]. In general, image processing is carried out in a two dimensional space/array [16]. A digital image can be represented with an array of numbers in a two dimensional space. Each number (or pixel) has an associated row and column to indicate its coordination (position) in the two dimensional space and the number’s value represents gray levels for that coordinate [15]. The gray levels are usually represented with a byte or 8-bit unsigned binary number, ranging from 0 to 255 in decimal. Equation 1 shows the two dimensional discrete convolution algorithm, where IP is the Input Image Plane, FC is the Filter Coefficient Plane, and OI is the Output Image Plane [16]. n−1 n−1 n −1 n −1 OI[x, y] = FC[x, y]* IP[x, y] = ∑∑FC[I, J ]IP[x − I + ( ), y − J + ( )] (1) I =0 J =0 2 2

Figure 2.1 below shows the basic definitions for the Input Image Plane (IP), Filter Coefficient Plane (FC), and Output Image Plane (OI). Assuming that the IP has a size of i×j pixels and FC has a size of n×n pixels, then, OI would have a size of i×j pixels. In most cases, n

2 IP(0,0) FC(0,0) OI(0,0)

n c

FC(n-1,n-1) n i x IP(x,y) i x OI(x,y)

IP(i-1,j-1) OI(i-1,j-1) j j Figure 2.1. Pictorial view of Input Image Plane (IP), Filter Coefficient Plane (FC), and Output Image Plane (OI).

Digital convolution can be thought of as a moving window of operations [16]. As shown in Equation 1, one output pixel of the OI[x,y] can be obtained by rotating the FC 180 degrees around the center point (denoted c in Figure 2.1 within FC) and place it over the IP with the center point on top of IP[x,y]. All the overlapping IP pixels would be multiplied by the corresponding filter coefficients on FC and then all the products are summed to generate the one pixel OI[x,y]. The next output pixel can be obtained by sliding the FC plane by one pixel to the right and then repeat all the processes mentioned above. Figure 2.2 illustrates the idea of the moving window of operations. FC is first centered at IP[3,4] to compute OI[3,4] and then moves to IP[3,5] for OI[3,5]. From Figure 2.2 below, one can deduce that when an output pixel is computed, access to entire previous rows or portions of previously input rows of input pixels are needed. Hence, previous input image pixels must be stored for this purpose. However, not all of the previous rows of input image pixels are necessarily needed. Instead, only (n-1) rows plus n input image pixels are required. This can be shown by example in Figure 2.3 below. Another important observation that can be made from Figure 2.2 is that for consecutive convolution, only n input image pixels are obsolete and require update. This is an important observation that influences the design of the convolution architecture. Figure 2.3 shows an example for a 3×3 filter size where the shaded areas of

3 the IP and the area under the FC plane are the input image pixels that need to be stored. Hence, these pixels can be stored in a memory device whether it is on chip or off chip.

Filter Coefficients

FC FC FC 00 01 02 FC10 FC11 FC12 FC centered at FC centered at FC20 FC21 FC22 IP[3,4] IP[3,5]

• • • • • • • • • •

• • • • • • • • • • IP23 IP24 IP25 IP24 IP25 IP26 • • • • • • • • • • IP33 IP34 IP35 IP34 IP35 IP36 • • • • • • • • • • IP43 IP44 IP45 IP44 IP45 IP46 • • • • • • • • • •

Current output, OI[3,4] = FC00IP45 + FC01IP44 + FC02IP43 next output, OI[3,5] = FC00IP46 + FC01IP45 + FC02IP44 + FC10IP35 + FC11IP34 + FC12IP33 + FC10IP36 + FC11IP35 + FC12IP34 + FC20IP25 + FC21IP24 + FC22IP23 + FC20IP26 + FC21IP25 + FC22IP24

Figure 2.2. Example showing how two consecutive output pixels are generated. This example is shown with a 3×3 size FC.

End of the rows Beginning of the rows • • • • •

• • • • • IP23 IP24 IP25 • • • • • IP33 IP34 IP35 Current input image pixel. • • • • The input image pixels coming in row by row. IP43 IP44 IP45

Figure 2.3. Example showing that only (n-1) previous rows plus n input image pixels need to be stored. In this example, 2 previous rows (shaded rows in addition to IP23,..,IP25, IP33,..,IP35) plus 3 additional input image pixels (IP43,..,IP45) are needed for a 3×3 filter size.

Convolution is a vital part of image processing and it can be done through both software and hardware [1]. Much effort has been directed towards speeding up the convolution process through hardware implementation [1,2,3,9,14]. This is because convolution is a computation intensive algorithm as shown in Equation 1. For example, with a 5×5 filter size, each output pixel requires 25 Multiplication and Addition

4 Operations (MAOPS). Thus, as the total number of pixels for an image increases, MAOPS will increase substantially. Bosi and Bois in [1] propose the use of FPGAs programmed with a 2D convolver as a coprocessor to an existing digital signal processor (DSP) to speedup the convolution process. In [2,3,9,14], special purpose convolution architectures are designed to meet real-time image processing requirements. Hsieh and Kim in [2] proposed a highly pipelined VLSI convolution architecture. Parallel one dimensional convolutions and a circular processing module were the approaches used in the architecture for performance gain and the architecture required n×n processing elements, each being a multiplier and adder. Both [3] and [14] propose convolution architectures based on systolic arrays which operate on real time images with a size of 512×512 pixels. Also, both of the architectures performed bit-serial arithmetic. The architecture of [14] requires on chip memory to store the necessary input pixels. The focus of this thesis is the development of a high performance real time special purpose convolution architecture which is desired for scanning and printing applications. The requirement for the final “production” version (implemented with ASIC technology) of this architecture is the capability to perform convolution with a 5×5 FC size on input images of size of 8½”×11” at a rate of 60 Pages Per Minute (PPM) at 600dpi (dot per inch). A total of 33.66M pixels are generated when a standard paper size of 8½”×11” is scanned at a resolution of 600dpi. Add to that the requirement to process 60 scanned PPM results in 1.69G MAOPS per second. The multiplication operands are each 8 bits in width generating 16 bit products which must be summed. The method proposed in [1] is not feasible from a cost standpoint and also some of the functionality of the DSP may not be required. The architectures presented in [2] and [9] are special purpose architectures for convolution, and both architectures require n×n processing elements which could potentially occupy a large chip area. The architectures of [3] and [14] are both systolic array architectures employing bit-serial arithmetic operations and hence may not be able to meet performance requirements mentioned above. In [7] the authors point out the well known fact that for most applications bit-parallel arithmetic has a performance edge over bit-serial arithmetic. However, the processing architecture of [7] is based on bit-serial arithmetic since it is sufficient for their requirements and has less gate count.

5 Two hardware architectures, version 1 and version 2, are proposed for the implementation of the two dimensional discrete convolution algorithm shown as Equation 1. Version 1 of the architecture proposed in the next section, is based on a linear systolic array structure. Version 2 of the architecture, based on an extension of version 1, will be shown in a later section. Version 2 of the architecture was developed to meet some functional and performance requirements different from those of Version 1 of the architecture. Version 1 is a special purpose convolution architecture. Unlike [2] and [9], the architecture will not use n×n processing elements and it will be scalable in order to meet variable performance requirements. A scalable architecture when implemented into programmable Field Programmable Gate Array (FPGA) technology allows users to implement the architecture to meet their specific performance needs. Parallel arithmetic operations will be utilized in the version 1 architecture for performance gain.

Chapter 3

Version 1 Convolution Architecture

Specially designed hardware to implement convolution can offer a performance gain over a general-purpose Digital Signal Processor (DSP). Since the convolution algorithm requires a large number of multiplications and additions, a multiprocessor architecture is desired. Multiprocessor architectures offer the benefit of processing multiple operations at a given instance. A specific type of multiprocessor architecture will be utilized for implementation of convolution in hardware. It is referred to as a systolic array structure [6]. The advantages of this type of structure include its modularity and regularity in structure and the ease of pipelining. In other words, the systolic structure has the ability to fully and simultaneously utilize all computational units within the architecture. A major challenge in developing systolic array architectures can be in providing data simultaneously to the multiple computational units in correct order. Figure 3.1 shows the top-level view of version 1 of the developed hardware convolution architecture with the basic functional units indicated. An external memory device (external to an FPGA or ASIC chip) will be utilized to hold a portion of the scanned input image plane during the convolution process. A more detailed description of the interface between the main system and the external memory will be presented later. The functionality of all basic functional units of the architecture of Figure 3.1 will now be described.

3.1. Arithmetic Unit (AU) The Arithmetic Unit (AU) is the core of version 1 of the convolution architecture. As shown in Figure 3.1 above, the AU consists of an Accumulator plus Multiplication and Add Units (MAUs). As the name implies, the basic building blocks within MAUs are

7 multiplication units and adders. Each MAU consists of one multiplication unit and one adder as shown in Figure 3.2 below.

External Memory 40 Controllller Device

8 Data Memory IDS Input Input Data Shiifters (IDSs) Input Image Plane, IP Interface (DM I/F) 8 88 8 IDSn-1 IDS1 IDS0

r r o o t t a a l Output l u u Mulltiiplliicatiion and Add Uniits Outputs, OI 21 21 19 m Correctiion m (MAUs) u u c Uniit c c c A A MAUn-1 MAU1 MAU0

Arithmetic Unit (AU)

8 CSn-1 8 CS1 8 CS0

Filter Coefficients, FC 8 Coeffiiciient Shiifters (CSs)

Figure 3.1. Top-level view of Version 1 of the convolution architecture (d is assumed to be 8 in this example).

IDS output 8 16 Output from To the next 17 previous MAU MAU Adder

16 Multiplication Unit

CS output

Figure 3.2. A MAU and included functional units.

8 As depicted in Figure 3.2 the multiplication unit will multiply two 8-bit binary numbers, and then the adder will add the product to input from the previous MAU. Output from the adder will be used as the input to the next MAU. In order to achieve high performance it is important to utilize high-speed adders and multiplication units within the architecture. It is also of interest to adopt a multiplication technique that is suitable for pipelining for performance enhancement. For example, the Wallace Tree multiplication or array multiplication techniques can be easily pipelined into multiple stages. The implementation platform influences performance as well. For instance, it is now common to find high performance built in core adders and multiplication units within FPGA technology chips.

IDSn-1 IDS1 IDS0

Cllock, Cllk

Partiiall resullt to MAUn-1 MAU1 MAU0 accumullator

Regiistern-1 Regiister1 Regiister0 Mulltiiplliicatiion and Add Uniits (MAUs)

CS CS CS n-1 1 0 Figure 3.3. Systolic array structures of MAUs, where IDSs are outputs from Input Data Shifters and CSs are the outputs from Coefficient Shifters.

The MAUs are arranged in such a way that they create the systolic array structure. Figure 3.3 shows the systolic array structure of the MAUs. The total number of MAUs used is determined by the size of the coefficient filter. For an n×n filter size, n MAUs would be utilized. Use of the systolic array structure will require the outputs from the Input Data Shifter (IDS) and Coefficient Shifter (CS) functional units to be skewed. This is to ensure that the correct sequence of multiplications is added correctly. An accumulator is needed at the end of the structure to add the necessary partial results generated by the MAUs to form an output pixel. For example, for an n×n filter size, the accumulator must

9 accumulate n partial results generated by the MAUs in the generation of one output pixel. This requires n clock cycles if each MAU takes one clock cycle to complete its operation. The registers to the left of each MAU serve as pipeline stage registers. If necessary, additional pipeline stages can be implemented within each MAU to increase performance [8].

3.2. Coefficient Shifters (CSs) Coefficient shifters are a group of parallel register shifters that can be programmed to retain the values of the filter coefficients. CSs are responsible for generating a skewed output of filter coefficients to MAU inputs. Figure 3.4 shows a structural view of the CSs. The number of shifters within the CSs is also dependant on the coefficient filter size. For an n×n filter size, there will be n coefficient shifters. Each CS stores n coefficients as seen in Figure 3.4. Once programmed with the filter coefficients, the CSs will retain the filter coefficients through out the convolution process. In order to provide the MAUs with the skewed input from the CS, as the convolution process starts,

CS0 will shift after the first clock cycle, the next clock cycle CS1 and CS0 will shift, the following clock cycle CS2, CS1, and CS0 will shift. The process continues until all the CSs are shifting every clock cycle. This will ensure that the MAUs will receive required skewed input. Figure 3.5 shows the arrangement of the filter coefficients within the CSs for the convolution algorithm corresponding to the filter coefficients shown in Table 3.1.

CSn-1 CS1 CS0

8 8 8

CSn-1 CS1 CS0 Cllock, cllk 8 8 8 0 0 0

8 1 8 1 8 1

8 FC Input

8 n-1 8 n-1 8 n-1

Coeffiiciients Shiifters (CSs)

Figure 3.4. Functional units within CSs.

10 Table 3.1. Filter coefficient array. FC(0, 0) FC(0, 1) FC(0, n-2) FC(0, n-1) FC(1, 0) FC(1, n-1)

FC(n-1, 0) FC(n-1, 1) FC(n-1, n-2) FC(n-1, n-1)

8 8 8 8 CSn-1 CSn-2 CS1 CS0

FC(0,0) FC(0,1) FC(0,n-2) FC(0,n-1)

FC(1,0) FC(1,n-1)

FC(n-1,0) FC(n-1,1) FC(n-1,n-2) FC(n-1,n-1)

Coeffiiciient Shiifters (CSs)

Figure 3.5. Arrangement of the filter coefficients within the Coefficient Shifters.

3.3. Input Data Shifters (IDSs) The main function of the Input Data Shifters (IDSs) is to generate a proper sequence of input image pixels for the MAUs. Figure 3.6 shows the basic functional units within the IDSs of Figure 3.1.

Pattern Generator Pointer(s) (PGPs)

Outputs to MAU Input from Data Register Bank (RB) Delay Units Inputs Memory Interface (Size of n×n Registers)

Figure 3.6. Functional units within IDSs.

11 3.3.1. Register Bank (RB) Due to the structure of the convolution algorithm, for each successive output pixel, it requires access to the previous ((n×(n-1))+n-1) input image pixels. Hence, the RB of Figure 3.6 is used to provide the correct input image pixels for successive convolution. Figure 3.7 shows the detail of the RB. The RB consists of n registers and each register has a length of n input image pixels or (n×d) bits assuming each input image pixel is d bits in length. Thus, RB has the capacity to hold n2 input image pixels that are needed for each convolution. This functional unit and its structure also improve the scalability of the architecture.

Demux’s Mux’s select select lines lines log n log2 n  2 

0 d d(n-1)-1 (d×n)-1 0 0

n×d Input from n×d To delay Data units

Mux

Memory Demux

n-1 n-1

Regiisters Bank

Figure 3.7. Generalized RB for n×n filter size. (d denotes number of bits for the input pixels).

3.3.2. Pattern Generator Pointers (PGPs) In order to provide the MAUs with the correct sequence of input image pixels for each convolution, the Pattern Generator Pointers (PGPs) of Figure 3.6 are utilized. As the RB fully fills up for each convolution, only one register needs to be updated with new input image pixels. Thus, the update sequence for the RB (Input image pixels coming

12 from the Data Memory Interface) will repeatedly go from top to bottom (repeating from zero to (n-1)). As for the output sequence from RB, each convolution requires all registers’ contents being fetched to the Delay Units. Hence, all except the Data Memory Interface (DM I/F) run at frequency n times faster (for an n×n filter size). The DM I/F operates at the same frequency as the input image pixel rate. Table 3.2 below shows the output sequence for one output pointer. The output sequence is 0, 1, 2, 3, 4 for n = 5. The example in Table 3.2 is based on a 5×5 filter size. It is found that the output pattern will repeat itself every five convolutions. Table 3.2. 5×5 Filter size (with one output pointer).

Convolutions 1st 2nd 3rd 4th 5th 0 1 2 3 4 1 2 3 4 0 2 3 4 0 1 3 4 0 1 2 from the RB Reading order 4 0 1 2 3

The architecture can be scaled up to process up to x convolutions in parallel where (x ≤ n). This is made possible by adding (x-1) additional output pointer(s), Delay Unit(s) and AU(s) to the existing architecture. As in the example above, the output sequence for each pointer can be predetermined and they repeat after every five convolutions. Table 3.3 below shows an example for a 5×5 filter size with two output pointers. Figure 3.8 shows the additional hardware required if two convolutions are to be done in parallel and the figure infers the additional hardware required to convolute x = n points in parallel. In addition, an Output Correction Unit (OCU) will be needed for convolution of two or more output pixels in parallel (See OCU in Figure 3.1). The function of the OCU will be explained in a following section. As the architecture is scaled up to process more than one convolution in parallel, all functional units within the architecture except the DM I/F (which runs at the same frequency as the input image pixels’ rate) can operate at a lower frequency. If the architecture is scaled up to process n convolutions in parallel, then the whole architecture operates at the same clock rate as the input image pixels rate. Thus, on average one convolution can be achieved in every clock cycle. However, the RB needs modification in order to process n convolutions in parallel. To process n convolutions in parallel, on every clock cycle all n registers within the RB will be read at once. Thus, it is necessary

13 for the current input from the DM I/F for updating one of the registers within RB being fetched to the MAU at the same instance. For this case, n pointers are utilized and each pointer will only have one sequence instead of five as shown in Table 3.2 and Table 3.3.

Table 3.3. 5×5 Filter size (Convolution with two output pointers).

Convolutions 1st 2nd 3rd 4th 5th 0 2 4 1 3 1 3 0 2 4 2 4 1 3 0

(pointer 0) 3 0 2 4 1 from the RB Reading order 4 1 3 0 2

Convolutions 1st 2nd 3rd 4th 5th 1 3 0 2 4 2 4 1 3 0 3 0 2 4 1

(pointer 1) 4 1 3 0 2 from the RB Reading order 0 2 4 1 3

IDS Inputs

Register Bank

4 3 2 1 0

Pointern-1

Pointer1 Pointer0

DUn-1 DU1 DU0

AUn-1 AU Output Correction 1 Unit AU0

CSs outputs Figure 3.8. Additional hardware and modification for convolution of x output pixels in parallel for (x ≤ n) (Functional units shaded in gray are additional hardware required for processing two convolutions in parallel).

The PGPs can be synthesized by using a finite state machine model. Another modeling possibility is by storing the predetermined sequences into RAM and reading them out sequentially as needed.

3.3.3. Delay Units (DU) Output from the Register Bank (RB of Figure 3.7 and Figure 3.8) will go through the Delay Units (DU) of Figure 3.8 before being fetched into MAUs within the Aus of Figure 3.8. The delay units consist of a series of flip-flops placed in a manner that will generate a skewed input to the MAUs. This is necessary for the AU to generate the correct outputs. Figure 3.9 below shows the internal structure of a DU.

Output from the IDS

40 IDS(39 – 32) IDS(31 – 24) IDS(23 – 16) IDS(15 – 8) 8 8 8 8 8 IDS R R R R (7 – 0)

R R R

R R

MAU MAU MAU MAU MAU 4 3 2 1 0 Figure 3.9. Organization of Flip-flops within the Delay Unit (DU). R within the figure denotes one flip-flop.

3.4. Systolic Flow of Version 1 Convolution Architecture. In order to further demonstrate how the data flow within the MAUs occurs, Figure 3.10 may be used. As the input image pixels pass through the DU from the RB, skewed input image pixels are generated and fed to the MAUs of the AU. At the same time, skewed filter coefficients are also input into MAUs by the CSs. Figure 3.10 below shows an example of how an output pixel is obtained as it flows through the MAUs with a 5×5 filter coefficient size.

FC centered at IP[3,4] Filter Coefficients • • • • • FC00 FC01 FC02 FC03 FC04 IP12 IP13 IP14 IP15 IP16 • • • • • FC10 FC11 FC12 FC13 FC14 IP22 IP23 IP24 IP25 IP26 • • • • • FC20 FC21 FC22 FC23 FC24 IP32 IP33 IP34 IP35 IP36 • • • • • FC30 FC31 FC32 FC33 FC34 IP IP IP IP IP 42 43 44 45 46 • • • • • FC40 FC41 FC42 FC43 FC44 IP52 IP53 IP54 IP55 IP56

output, OI[3,4] = FC00IP56 + FC01IP55 + FC02IP54 + FC03IP53 + FC04IP52 + FC10IP46 + FC11IP45 + FC12IP44 + FC13IP43 + FC14IP42 + FC20IP36 + FC21IP35 + FC22IP34 + FC23IP33 + FC24IP32 + FC30IP26 + FC31IP25 + FC32IP24 + FC33IP23 + FC34IP22 + FC40IP16 + FC41IP15 + FC42IP14 + FC43IP13 + FC44IP12 Time

t0 c.c. - - - - FC44IP12

(t0 + 1) FC34IP22 + - - - FC43IP13 c.c. FC44IP12

FC24IP32 + (t0 + 2) FC33IP23 + - - FC34IP22 + FC42IP14 c.c. FC43IP13 FC44IP12 FC23IP33 + (t0 + 3) FC14IP42 + FC24IP32 + FC32IP24 + - FC33IP23 + FC41IP15 c.c. FC34IP22 + FC44IP12 FC42IP14 FC43IP13 FC22IP34 + (t0 + 4) FC04IP52 + FC14IP42 + FC24IP32 FC13IP43 + FC23IP33 + FC31IP25 + FC32IP24 + FC40IP16 c.c. + FC34IP22 + FC44IP12 FC33IP23 + FC43IP13 FC41IP15 FC42IP14 FC21IP35 + (t0 + 5) FC03IP53 + FC13IP43 + FC23IP33 FC12IP44 + FC22IP34 + FC30IP26 + FC31IP25 + - c.c. + FC33IP23 + FC43IP13 FC32IP24 + FC42IP14 FC40IP16 FC41IP15 FC20IP36 + (t0 + 6) FC02IP54 + FC12IP44 + FC22IP34 FC11IP45 + FC21IP35 + FC30IP26 + - - c.c. + FC32IP24 + FC42IP14 FC31IP25 + FC41IP15 FC40IP16 (t + 7) FC IP + FC IP + FC IP FC IP + FC IP + 0 01 55 11 45 21 35 10 46 20 36 - - - c.c. + FC31IP25 + FC41IP15 FC30IP26 + FC40IP16 (t + 8) FC IP + FC IP + FC IP 0 00 56 10 46 20 36 - - - - c.c. + FC30IP26 + FC40IP16 MAU4 MAU3 MAU2 MAU1 MAU0

Figure 3.10. Pictorial view of the data flow within the MAUs for one output pixel.

As shown in Figure 3.10 above, the convolution starts at t0 clock cycle (cc) when the first input image pixel (IP12) is multiplied by filter coefficient FC44 in MAU0. During the next clock cycle, (t0 + 1), the previous product from MAU0 is added with the product of the IP22 and FC34 multiplication in MAU1 while a new product is generated in MAU0

(FC43 and IP13). The sum of the two products in MAU1 will be propagated into MAU2 the next clock cycle (t0 + 2) and it is then summed with the product generated within MAU2.

16 The process continues as shown in Figure 3.10 above. An output pixel is generated during (t0 + 9) cc when partial results from MAU4 at (t0 + 4), (t0 + 5), (t0 + 6), (t0 + 7), and

(t0 + 8) cc are summed by the accumulator. Once the first output pixel is generated on the 9th cc, from then on a new output pixel will be generated every five cc’s.

3.5. Data Memory Interface (DM I/F) It is anticipated that external memory devices will be utilized for IP pixel storage since the cost of having an on chip memory within the single-chip convolution architecture of Figure 3.1 is high for any implementation platform. (The following assessment is made on the assumption that a 5×5 filter size is desired) The bus width for data transfer between external memory device and DM I/F will be 40 bits wide. This will ensure that each access to the external memory device can yield five input image pixels. However, since memory devices such as the SRAM devices on the market only come in sizes of 8-bit, 16-bit, and 32-bit, two memory devices will be used; one 8-bit and one 32- bit. Due to the fact that for consecutive convolution only five input image pixels need to be updated, only one access to an external memory device will be required. Figure 3.11 below shows the basic functional units within the DM I/F.

External Memory 40 Cache 40 Output To Device Unit IDSs

Zero Padding Input From 8 Scanning Device Hardware

Figure 3.11. Basic functional units within Data Memory I/F.

A cache unit is utilized to reduce the penalty of accessing external memory devices. Figure 3.12 shows a more detailed DM I/F. Register File A and B each consists of four shift registers, namely Registers b, c, d, and e. Each register is 40-bits in size and holds five input image pixels. In order to prevent data starvation, Register File A and B are used alternatively. As Register File A is providing input image pixels to IDSs,

17 Register File B is being filled with input image data from external memory devices and vise versa. As either one of the Register Files are outputting input image pixels to IDSs, each internal register will shift an input image pixel out by shifting right.

From External 40 Register File A 32 Memory Device 32

Register File B 32 Output to 40 IDSs To External Memory Device Register a 8

Scanning 8 8 Device ‘0’

Figure 3.12. A more detailed look at DM I/F.

Observing Equation 1, there are instances where references are made outside the range of the input image. For these accesses zero pixel values will be used. To address these boundary conditions, zero padding hardware is incorporated into the DM I/F. (n −1) Whenever the end of a row (input image pixels) is reached pixels are attached. 2 Thus, a column counter is needed within the main controller of the architecture. Register a of Figure 3.12 is a register that can hold up to five input image pixels. As Register a is filled, the contents will be stored into External Memory Devices. There are also five address pointers needed for addressing the External Memory Devices for storage of input image pixels. Each addressable location of the two External Memory Devices can hold up to five input image pixels (40 bits). Figure 3.13 shows the time line for activities within the DM I/F. For every four reads from External Memory Devices (read from each pointer once) and one write to store input image pixels stored in Register a, five output pixels (OI) of Figure 3.1 will be produced.

Figure 3.13. Time line for activities, where W denotes Write or R denotes Read (from external memory device) registers indicated in boxes directly below.

3.6. Output Correction Unit This unit is responsible for correcting the output sequence when the architecture is scaled to process two or more convolutions in parallel. Figure 3.14 below shows an example of the output pixel sequence when two convolutions are processed in parallel. Instead of one output on each output clock cycle, two output pixels are generated on a single clock cycle every two output clock cycles. Thus, the Output Correction Unit may be needed to correct the output sequence back to one output pixel per one output clock cycle. Whether the OCU is needed can be addressed at a later time.

Figure 3.14. Output pattern for two convolutions in parallel.

3.7. Controller This is the functional unit of Figure 3.1 that will coordinate all the other functional units within the architecture. The main controller of the architecture will be

19 implemented in finite state machine form. Within the main controller, there will be a row and a column counter to keep track of the row and column counts so that it knows when the end of a row is encountered. There can be two separate controlling units within the main controller, one controller responsible for the DM I/F (controller DM) and the other responsible for the rest of the architecture (controller R). The DM I/F will be running at the same rate as the input image pixels while the rest of the architecture will be running at least n times faster assuming convolution of a single point. However, as the architecture is scaled up to handle x convolutions at the same time (see Figure 3.8), then the controller R can be run at a correspondingly lower frequency rate. Basically the controller can be divided into three main stages. The first stage is mainly devoted to storing the first few rows of input image pixels and waiting until there are enough input image pixels for convolution to start. The second stage is responsible for filling up the pipeline and making sure that the convolution starts in the correct manner. The last stage deals with shutting down of the system.

Chapter 4

Revised Architectural Requirements and Resulting Version 2 Convolution Architecture

The convolution architecture proposed in the previous section is scalable and suitable for applications that require scalable performance and hardware. In this section a more stringent performance requirement is addressed for which a convoluted OI pixel is expected on each clock cycle of 7.3 ns (for a final “production” model based on ASIC technology). In addition, k distinct n×n FCs are required to be simultaneously convoluted with each Input Image Plane (IP) resulting in a performance requirement of k OI pixels on each 7.3 ns clock cycle. The performance requirement of k convoluted OI pixels on each 7.3 ns clock cycles can only be expected from final high speed production technologies. In Version 1 of the architecture, filter coefficients (FCs) were assumed to be 8 bits in length. Filter coefficients will now be 6 bits in length. Even though the convolution architecture proposed in the previous section can be scaled up and pipeline stages within the MAUs can also be increased to meet all the above requirements, a specially tailored architecture can save hardware and reduce the architecture’s controller complexity. For example, as shown in Figure 3.1, within each AU there is an accumulator in front of the MAUs. As the architecture is scaled up to process n convolutions in parallel, n accumulators within the architecture will be required which can be costly from a hardware standpoint. Furthermore, a simplified controller for the IDSs can also contribute to a hardware saving. Hence, in this section a modified and specially tailored convolution architecture will be presented and it is referred to as Version 2 of the convolution architecture.

4.1. Version 2 Convolution Architecture for (k = 1) Since the desired output rate is the convolution of one OI pixel per clock cycle, for a n×n FC size, a total of n2 MAUs are needed for one distinct filter coefficient set. Figure 4.1 below shows a top level view of Version 2 of the architecture where n and d

21 (width of input image pixels) are assumed to be 5 and 8 respectively. Buses shown in Figure 4.1 with width of 40 bits are resultant of (n×d). Each functional unit in this architecture will implement required functionality as will be addressed below.

External Memory n×d Device (Data Memory) Controllller Uniit (CU)

Input Image d Data Memory Plane, IP Interface (DM I/F)

n×d Arithmetic Unit (AU)

IDS0 MAA0 n×d IDS1 MAA1 Input n×d 19 Outputs, Data AT OI Shifters (IDS) IDSn-1 n×d MAAn-1 Filter Coefficients, 6 FC

Figure 4.1. A top level view of Version 2 of the convolution architecture for one distinct filter coefficient set with n = 5 and d = 8 (MAA denotes Multiplication and Add Array and AT denotes Adder Tree).

4.2. Arithmetic Unit (AU) As shown in Figure 4.1 above, the Arithmetic Unit (AU) consists of n Multiplication and Add Arrays (MAAs) plus an Adder Tree Structure (AT) at the end of the MAAs. Within each MAA there are n Multiplication and Add Units (MAUs) arranged in a systolic array structure. Figure 4.2 shows the arrangement of the n MAUs within each MAA. The basic functional units within each MAU remain the same as in the previous section. In Version 2 of the architecture, filter coefficients will be held within the MAUs, therefore, an additional register is needed to hold the filter coefficient value assigned to a specific MAU. Since Version 2 of the modified convolution architecture will feature n2 MAUs, the Coefficient Shifters (CSs) shown in Version 1 of the architecture (see Figure 3.1, Figure 3.4 and Figure 3.5) of the previous section can be eliminated. Hence, all the n2

22 filter coefficients will be assigned to a specific MAU. Figure 4.3 shows the functional units within each MAU.

IDS Delay Units (DU) output

DU DU DU Clock, 0 1 n-1 Clk

Partiiall resullt to MAU0 MAU1 MAUn-1 Adder Tree

Regiister0 Regiister1 Regiistern-1

Filter Coefficients Mulltiiplliicatiion and Add Array (MAA)

Figure 4.2. Functional units within the Multiplication and Add Array (MAA).

DU output

8 14 Output from previous MAU Adder To the next MAU 15 Multiplication 14 Unit

6 Register

Filter Coefficient Figure 4.3. A MAU and its functional units.

In order to achieve the desired performance, it will be necessary to pipeline all MAUs beyond the minimum pipeline stages shown in Figure 4.2 (the register to the right of each MAU represents a pipeline stage). Thus it is important to employ multiply techniques that can easily be pipelined into multiple stages. It is possible to combine the multiplication unit and the adder shown in Figure 4.3 into one unit. For the most part, a multiplication unit usually consists of an adder tree that adds all the generated partial

23 products. As shown in Figure 4.3 an adder is required to sum the previous MAU output with the product generated by the multiplier. It is possible to use a Carry Save Adder and generate the output as two separate outputs (a sum output and a carry output). This will eliminate the need for another high speed adder at the end of each MAU. The Adder Tree (AT) within the AU is responsible for adding all the n partial results from the MAAs to form the output image pixel. The AT can be constructed with Carry Save Adders (CSAs) and a Carry Look Ahead Adder (CLA). In addition, the AT will be pipelined into multiple stages as well for performance. Figure 4.4 shows a possible arrangement of CSAs and CLA within the AT. This example is based on a 5×5 FC size.

Sum from MAA0 16

16 Carry from MAA0 CSA

CSA Sum from MAA1 16 CSA

CSA Output Image 16 19 Carry from MAA1 CSA CLA Pixel 16 Sum from MAA2 CSA 16 Carry from MAA2

16 Sum from MAA3

CSA Carry from MAA3 16 CSA R

Sum from MAA4 16

16 Carry from MAA4 R R R R

Figure 4.4. A possible arrangement of the AT (R denotes a single flip-flop pipeline stage; a pipeline stage is included within each CSA and CLA) for a 5×5 FC.

Another basic functional unit within the AU is the Delay Units (DU) which are responsible for generating the skewed input image pixels for the MAUs. However, the DUs will need to be pipelined as well with the same pipeline stages that the MAA has. Upon further investigation, even though the replacement of a high speed adder with a CSA within a MAU can save a small amount of hardware, the replacement is not as beneficial when the architecture is reviewed at the highest level. Table 4.1 shows a direct

24 comparison of number of gates required for both a 14-bit CSA and a 14-bit CLA (EX-OR gate is counted as five gates). The amount of hardware saved is not as significant as the increase in hardware for the AT. Figure 4.5 shows a possible arrangement of the AT if CLA is utilized within the MAUs.

Table 4.1. Gate count comparison between CSA and CLA. CSA CLA Gate Count 182 210

17 Output from MAA0 17

17 CSA 18 Output from MAA1 CSA 18 18 Output Image Output from MAA 17 19 CSA CLA 19 2 Pixel 19

17 17 Output from MAA3 R

Output from MAA4 17 R 17 R 17

Figure 4.5. One possible arrangement of the AT when CLA is utilized within the MAUs.

First and foremost, comparison between Figure 4.4 and Figure 4.5 shows a number of CSAs being saved and also a number of pipeline stages being saved as well. This results in a large amount of hardware savings. If CSA is used within each MAU, the number of bits (or bus lines) running from one MAU to another is doubled. Hence, when implemented, CSA will require more real estate within the chip (especially when implemented as an ASIC) than CLA, thus reinforcing the need to reduce the number of CSA units. Another important hardware reduction is the reduction in the number of flip- flops required for the pipeline into half since only one bus (one output from each MAA) is required.

25 In conclusion, the adder within each MAU of Figure 4.3 will be a CLA type and the Adder Tree (AT) of Figure 4.1 will be implemented as shown in Figure 4.5 for the case of n = 5.

4.2.1. Multiplication Unit (MU) of Multiplication and Add Unit (MAU) The Multiplication Unit (MU) of Figure 4.3 is one of the most important arithmetic components within the proposed convolution architecture. Thus, it is important that a high speed and area efficient multiplication technique be derived and implemented since the architecture requires 25 MUs for one convolution set. For each MU, an 8-bit unsigned binary number (IP) is to be multiplied by a 6-bit signed binary number (FC) and a 14-bit signed binary output (OI) is generated. Table 4.2 below shows a summary of all the elements involved in the multiplication. All signed binary numbers will be represented as 2’s complement numbers. Table 4.2. A summary of the multiplication. Description Representation Range (Decimal) Multiplicand 8-bit unsigned binary number 0 to 255 Multiplier 6-bit signed binary number (2’s complement) -32 to 31 Product 14-bit signed binary number (2’s complement) -8192 to 8191

Multiplication in binary can be done using the same technique as with the commonly used paper and pencil method. Partial products are generated based on each bit of the multiplier and then all the partial products are summed to generate the product. The number of partial products required is dependent on the number of bits of the multiplier. Hence, as shown in Table 4.2, a 6-bit signed binary number is used as the multiplier instead of the 8-bit unsigned binary number; this is due to the fact that using a reduced number of bits for the multiplier results in fewer partial products. However, since the multiplier in this case is a signed binary number, for the regular paper and pencil method to work when the multiplier is in negative range, both the multiplicand and multiplier need to be complemented before the multiplication. This is due to the fact that all the partial products are positive and hence the result generated will be positive as well, which is not correct since a negative result should be obtained as the multiplier is of negative value. Hence, by complementing both the multiplicand and the multiplier, the

26 signs are switched between the two, but the result should be the same, a negative value. Besides, all the partial products need to be sign extended for the multiplication to be correct. Figure 4.6 illustrates the multiplication concept mentioned above. A copy of the multiplicand will be placed into the partial product with sign extension(s) if the respective multiplier bit is one, otherwise all zeros will be placed.

B7 B6 B5 B4 B3 B2 B1 B0 multiplicand

× A A A A A A multiplier 5 4 3 2 1 0 s s s s s x x x x x x x x partial product based on A0

s s s s x x x x x x x x partial product based on A1

s s s x x x x x x x x partial product based on A2 s s x x x x x x x x partial product based on A3 s x x x x x x x x partial product based on A 4 + x x x x x x x x partial product based on A5

P13 P12 P11 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0

Figure 4.6. Illustration of the paper and pencil multiplication technique (s on each row of the partial products denotes sign extension of that particular row of partial product).

From Figure 4.6 above, it is shown that the most expensive operation is to sum all the partial products into the final result. It is difficult to design an adder that can add six operands at the same time and also it may not be speed efficient as well. However, a method such as the Multilevel Carry Save Adder (CSA) Tree [10] can be employed to add all the partial products into a final result. The Multilevel CSA Adder Tree uses multiple stages of the CSA to reduce the operands into two operands and a final adder stage is used to sum both operands to generate the final result. Depending on the speed requirement, the Multilevel CSA Tree can be easily pipelined into multiple stages to increase throughput. Besides, the final stage adder can be replaced with a fast adder such as Carry Lookahead Adder (CLA) to reduce the latency. Figure 4.7 below shows a possible arrangement of the Multilevel CSA Tree for adding six operands. Also, a five stage pipeline can be implemented with this configuration. It is possible to reduce the hardware count if the number of partial products can be reduced. This can be done thorough use of the Modified Booth’s Algorithm (MBA) [13]. The MBA inspects three multiplier bits at a time and generates respective partial product

27 selections. Compared to the MBA, the original Booth’s Algorithm (BA) [4] inspects two bits of the multiplier at an instance and hence the number of partial products generated still remains proportional to the number of multiplier bits. The MBA can reduce the x number of partial products required to ( +1), assuming x is the number of bits for the 2 multiplier. Thus, for a 6-bit multiplier, the partial products generated will be reduced from six to four. However, the Partial Products Generator’s (PPG) complexity is increased due to the different possible outputs for each partial product. Table 4.3 below gives a summary of the possible outputs for a partial product based on the three multiplier bits examined.

Multiplicand Multiplier

Partial Products Generator

PP0 PP1 PP2 PP3 PP4 PP5

CSA CSA

1st Stage

CSA

CSA 2nd Stage

CLA 3rd Stage

Product Figure 4.7. One possible arrangement of Multilevel CSA Tree for six partial products.

Table 4.3. Partial Product Selection Table. Multiplier Bits Selection 000 0 001 + Multiplicand 010 + Multiplicand 011 + 2×Multiplicand 100 - 2×Multiplicand 101 - Multiplicand 110 - Multiplicand 111 0

28 As shown in Table 4.3, each partial product generated can have a different output and thus the hardware complexity for the PPG is increased. Also, the output for the partial product can be easily obtained by either shifting the multiplicand left one position for 2× and complement plus one for all the negative values required. Figure 4.8 below illustrates the changes to the Wallace Tree when MBA is employed. Compared to Figure 4.7, two CSAs can be saved.

Multiplicand Multiplier 14 14

Partial Products Generator 1st Stage PP0 PP1 PP2 PP3

CSA

CSA 2nd Stage

CLA 3rd Stage 14

Product

Figure 4.8. Multiplier based on Modified Booth’s Algorithm and Wallace Tree Structure.

Figure 4.9 below depicts the detailed multiplication technique when MBA is employed to reduce the number of partial products required. Some hardware saving can be achieved when a Full Adder (FA) can be replaced with a Half Adder (HA) within the MU. Figure 4.10 shows the reduced sign extension within the partial products which in turn can contribute to hardware savings [12].

B7 B6 B5 B4 B3 B2 B1 B0 multiplicand multiplier × A5 A4 A3 A2 A1 A0 s1 s1 s1 s1 s1 x x x x x x x x x partial product based on 0, A0, A1

s2 s2 s2 x x x x x x x x x partial product based on A1, A2, A3

s3 x x x x x x x x x partial product based on A3, A4, A5 + s3 s2 s1 Sign corrections for all partial products P13 P12 P11 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0

Figure 4.9. Illustration of multiplication technique based on Modified Booth’s Algorithm.

B7 B6 B5 B4 B3 B2 B1 B0 multiplicand multiplier × A5 A4 A3 A2 A1 A0 s1 s1 s1 x x x x x x x x x partial product based on 0, A0, A1 1 s2 x x x x x x x x x partial product based on A1, A2, A3

s3 x x x x x x x x x partial product based on A3, A4, A5

+ s3 s2 s1 Sign corrections for all partial products P P P P P P P P P P P P P P 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Figure 4.10. Partial Product’s sign extension reduced for hardware saving. Table 4.4 below shows the hardware comparison between multiplication techniques shown in Figure 4.7 and Figure 4.8. For simplicity, the multiplication technique shown in Figure 4.7 is denoted as method I and method II denotes the multiplication technique shown in Figure 4.8. Also, Exclusive-OR gates (EX-OR) within both methods are counted as five gates. FF in Table 4.4 denotes Flip-Flops. Table 4.4. Comparison between method I and method II. # FA # HA # FF # CLA Gate Count (excluding FF) Method I 37 5 78 1 691 Method II 9 15 72 1 641

From Table 4.4, the total number of Gate Count, excluding FFs, is almost identical but in order to achieve the speed requirement, multiple pipeline stages are included. The maximum gate delay for both methods is identical, which is due to the CLA (10 gate delays) [11]. Even though the hardware saving between the two methods is modest at a glance, the number of replications of the unit can be a factor. Another important note, in order for method I to be able to handle two’s complement, both the multiplicand and the multiplier need to be complemented (complement each bit and add one to the result) if the multiplier has a negative value. Hence extra overhead is required for method I to function correctly. There is a workaround to avoid adding hardware for complementing both the multiplicand and multiplier for method I, and that is by reversing the multiplicand and the multiplier. This will ensure that the multiplier will always have positive value and hence avoid the two’s complement hardware, but in return more partial products will need to be generated since the multiplier consists of an 8-bit unsigned number.

30 The multiply technique as shown in Figure 4.8 will be employed when the version 2 architecture is implemented. This is because this multiply technique requires less hardware compared to the multiply technique shown in Figure 4.7.

4.2.2. Delay Units (DU) The Delay Units (DU) are responsible for generating skewed input image pixels for the MAA. The number of stages within DU is directly proportional to n and the number of pipeline stages employed within the MAA. Figure 4.11 below shows the organization of the flip-flops within the DU. The number of stages shown in Figure 4.11 below assumes n = 5 and that there are two pipeline stages within each MAU (one pipeline stage after the Multiplication Unit and another one after the Adder; exclude the first MAU since it only has a Multiplication Unit and hence one pipeline stage) as shown in Figure 4.1.

IDS 32 PL1 24 PL2 24 PL3 16 PL4 16 PL5 8 PL6 8 PL7 Output

8 8 8 8 8

DU0 DU1 DU2 DU3 DU4 (7-0) (15-8) (23-16) (31-24) (39-32)

Figure 4.11. Functional units within the DU for the case of n = 5 and two pipeline stages within each MAU (PL denotes a pipeline stage composed of flip-flop registers).

4.3. Input Data Shifters (IDS) This functional unit (see Figure 4.1) is responsible for providing the AU with the correct input image pixel sequence. Figure 4.12 below shows the structural view of the Input Data Shifters (IDS). There are n shift registers (S Registers) within the IDS with each register capable of holding n input image pixels and with parallel load capability. Each input image pixel is d bits wide. All shift registers also need to be able to shift all n input image pixels in parallel.

Output from DM I/F

40 IDS0 S Register0 40

40 IDS1 S Register1 40 Outputs to AU

40 IDSn-1 S Registern-1

Figure 4.12. Structural view of the IDS with n = 5 and d = 8.

As shown in Figure 4.12 above, all data within the structure are shifted in parallel (in this example, it is 40 bits) from one shift register to another shift register. For example, S Register0 is loaded with 40 bits in parallel from the output of DM I/F and shifts 40 bits in parallel into S Register1.

4.4. Data Memory Interface (DM I/F) The Data Memory Interface (DM I/F) of Figure 4.1 will remain unchanged from section 3.5. See Figure 3.11 and Figure 3.12.

4.5. Memory Pointers Unit (MPU) The external memory devices (see Figure 4.1) that are required by the architecture are read and written through several memory pointers within the Memory Pointer Unit (MPU). In order to achieve a minimum number of writes to the external memory devices, MPU receives and stores n (five) input image pixels (for a 5×5 filter size) before it writes all five input image pixels to the memory location pointed to by one of the memory pointers. Thus, the bus width for the interconnection between the memory devices and the architecture is 40 bits (n×d). If the memory accesses cannot keep up with the main system clock, then the memory bandwidth can be increased to reduce the number of accesses to the external memory devices.

0 31 39 0 ptr_a (ptr_0)

1024 ptr_b (ptr_1)

2048 ptr_c

3072 ptr_d

4096 ptr_e (ptr_n-1)

external memory external memory device a device b Figure 4.13. External memory devices organization for n = 5 and d = 8.

Figure 4.13 above shows the memory organization of the external memory devices for the case of n = 5 and d =8 with different segments of the memory designated for each memory pointer while Figure 4.14 below shows the functional units within the MPU, again, for the case of n = 5 and d = 8. Each of n memory segments of the memory is capable of storing one row of input image pixels. For example, for a n×d (40) bit memory bus width and 5100 pixels of paper size width, each segment of the memory should have at least 1020 locations. From Figure 4.13 above, 1024 locations are allocated to each memory pointer. By allocating 1024 locations to each memory segment, the three most significant bits of each memory pointer can be used to differentiate each memory segment. Also, for every five output pixels generated, every memory pointer will have a common ten least significant bits, except for the memory pointer that is used to write current pixels into the external memory devices. This is because the other four memory pointers need to pre-fetch all necessary input image pixels for the next convolution iteration into the cache memory. Thus, two 10 bit counters (col_cntr #1 and col_cntr #2) are needed to generate memory addresses as shown in Figure 4.14.

reg_sel

1 ptr_b

2 ptr_c

3 3 ptr_d

4 ptr_e

5 ptr_a add_out 13

reg_sel 1 col_cntr #2 2

10 3

5 col_cntr #1 Figure 4.14. Functional units within the Memory Pointers Unit (MPU).

As shown in Figure 4.14 above, the three most significant bits of each memory pointer is stored in registers. One important note is that, as the architecture is reinitialized, all registers are initialized differently. For example, Memory Pointer b (ptr_b) is initialized with 001, Memory Pointer c (ptr_c) with 010, Memory Pointer d (ptr_d) with 011, Memory Pointer e (ptr_e) with 100, and Memory Pointer a (ptr_a) with 000. This is to ensure that each memory pointer is designated to a specific memory segment. In addition, the memory pointers are shifted as indicated in Figure 4.14 above whenever a row of the input image pixels is completed. This is because the least recent row of input image pixels stored within the external memory devices are no longer needed and can be overwritten to store the current input image pixels.

4.6. Systolic Flow of Version 2 Convolution Architecture This section shows how the input image data flows through the AU of Figure 4.1 for the case of (n = 5) and how each output pixel (OI) is generated in Version 2 of the Convolution Architecture. Figure 4.15 below shows the systolic flow of the data for Version 2 of the Convolution Architecture. In order to simplify the figure, all pipeline stages within the MAAs are ignored and the figure corresponds to Figure 3.10 with the same convolution point and input image pixels. As can be seen in Figure 4.15 below, at time t0 every MAA will multiply the input image pixel received with the filter coefficient

34 that is stored within the first MAU (within the MAA). The next time instant, t0 + 1td where td denotes the pipeline delay between two MAUs within each MAA, the previous product from MAU0 (in each MAA) is summed with the product of MAU1. This process continues until time instant t0 + 4td when all input image pixels (for one convolution point) have flowed through MAU4 within each MAA and they will be summed by the AT the next cycle to generate one output pixel.

4.7. Controller The controller for version 2 of the architecture, shown in Figure 4.1, is only responsible for controlling the DM I/F of the architecture and the described Memory Pointers Unit (MPU). This is due to the fact that the AU and the IDS need no controller to regulate their activities. AU and IDS will be clocked with the main clock and all the necessary input image pixels will propagate through the pipeline stages as required, hence no controller is necessary for both units.

Time t0 c.c. (t0 + 1td) c.c. (t0 + 2 td) c.c. (t0 + 3 td) c.c. (t0 + 4 td) c.c. FC10IP46 + FC20IP36 + FC00IP56 + FC10IP46 + FC30IP26 + FC20IP36 + FC40IP16 FC30IP26 + FC20IP36 + FC30IP26 + MAA0 FC40IP16 FC30IP26 + FC40IP16 FC40IP16 FC40IP16 FC11IP45 + FC21IP35 + FC01IP55 + FC11IP45 + FC31IP25 + FC21IP35 + FC41IP15 FC31IP25 + FC21IP35 + FC31IP25 + MAA1 FC41IP15 FC31IP25 + FC41IP15 FC41IP15 FC41IP15 FC12IP44 + FC22IP34 + FC02IP54 + FC12IP44 + FC32IP24 + FC22IP34 + FC42IP14 FC32IP24 + FC22IP34 + FC32IP24 + MAA2 FC42IP14 FC32IP24 + FC42IP14 FC42IP14 FC42IP14 FC13IP43 + FC23IP33 + FC03IP53 + FC13IP43 + FC33IP23 + FC23IP33 + FC43IP13 FC33IP23 + FC23IP33 + FC33IP23 + MAA3 FC43IP13 FC33IP23 + FC43IP13 FC43IP13 FC43IP13 FC14IP42 + FC24IP32 + FC04IP52 + FC14IP42 + FC34IP22 + FC24IP32 + FC44IP12 FC34IP22 + FC24IP32 + FC34IP22 + MAA4 FC44IP12 FC34IP22 + FC44IP12 FC44IP12 FC44IP12

Figure 4.15. Pictorial view of the data flow within the MAAs for one output pixel (td denotes the time delay between each MAU).

Figure 4.16 below shows the top level view of the Controller Unit (CU) with the input and output control signals shown. The CU is responsible for generating control signals to functional units within the DM I/F and the MPU.

f_sel, cache banks select line z_pad, zero padding 3 reset, rst reg_sel, registers select 2 row greater than, rgt en_w, write enable for cache Controller 2 shut down signal, sds en_sf, shift enable for cache Unit end of column, eoc z_input, zero input (CU) 2 c_inc, column counter increment beginning of a row, bor rot, rotate memory pointers clock, clk r_inc, row counter increment r_w, memory read/write line sd_inc, shut down counter increment

Figure 4.16. Top level view of the Controller Unit (CU).

Figure 4.17 below shows the functional units for which the CU generates control signals for the case of n = 5 and d = 8. The functional units labeled C_BANK1 and REG_A are functional units contained within the DM I/F, whereas the functional unit labeled as MEMPTRS is the MPU referred to above. The MPU is the functional unit responsible for generating memory addresses for the external memory devices which store all the necessary input image pixels (IP) for each convolution. C_BANK1 is the functional unit that supplies input image pixels to the Input Data Shifters (IDS) and pre- fetch the necessary input image pixels from the external memory devices for the next iteration of convolutions. In other words, the functional unit serves as a cache memory for the convolution system. As for the functional unit labeled as REG_A, it is a unit that is responsible for storing the most recent input image pixels received from the external scanning device and later write to the external memory device when its register is full. In addition, the functional unit also supplies the most current input image pixels to the IDS.

Figure 4.17. Functional Units that receive control signals from the CU.

The convolution system is pipelined into multiple stages requiring synchronized operation. Thus, the CU is modeled as a finite state machine. Figure 4.18 below shows the system flow chart for the CU. The system flow chart describes micro-operations of the system on a clock-cycle by clock-cycle basis and it also indicates values that must be assigned to appropriate control signals of the architecture on each clock cycle of operation. Operation of the system flow chart shown below can be divided into three segments. The first segment starts from the beginning of the flow chart and runs until the row counter (row_cntr) reaches a count of greater than one. This segment is operational when the input image pixels of a scanned page start and the convolution process will only be started after the first two rows of input image pixels have been received. In this first segment, the received two rows of input image pixels will be stored in the external memory device. The purpose of the tog signal within the flow chart is to alternate writing to the two Regfiles within C_BANK1 to avoid data starvation from the external memory device. There are two column counters (col_cntr #1 and col_cntr #2) within the MEMPTRS functional unit, the reason being that the first column counter is for ptr_a address generator to write to the external memory device while the second column

37 counter will be one count ahead of the first column counter to pre-fetch the rest of the rows (ptr_b, ptr_c, ptr_d, and ptr_e) required from the external memory device. After the first two rows of the input image pixels have been stored, the convolution process can be started as the third row of input image pixels of the new row are received. The second operational segment of the flow chart, which starts from the decision box of row greater than one (row_cntr > 1) and ends at the connector A in the figure, is operational as the convolution process starts. This segment of the flow chart will continue until all input image pixels for the entire scanned page are received. The last segment of the flow chart starts from the connector A and continues until the end of the flow chart. This segment is mainly responsible for supplying the system with zeros as input to the system until the last two rows of the output pixels are completely generated. As can be seen from the flow chart, a special counter (sd_cntr) is designated for counting to two for indication of the end of the convolution process. Control signals such as tog, en_w and en_sf are expected to retain their latest value as the system transitions from one micro-operation to another. Thus, some memory elements such as latches are required. If this compromises the CU’s speed and performance, then a modification to the system flow chart such as the one shown in Figure 4.19 below would be desired. The system flow chart shown in Figure 4.19 contains extra states added to eliminate the shared states (after each decision branch) shown in Figure 4.18.This modification is aimed to remove the memory elements for signal tog, en_w and en_sf, which need to be toggled after each branching after the decision making states. This reduces the control signals generation delay.

Figure 4.18. System flow chart for Version 2 convolution architecture’s Controller Unit (CU).

Figure 4.18. (Continued) System flow chart for Version 2 convolution architecture’s Controller Unit (CU).

Figure 4.19. Modified Version 2 system flow chart.

Figure 4.19. (Continued) Modified Version 2 system flow chart.

42 4.8. Multiple Filter Coefficient Sets when (k > 1) To address the need to simultaneously convolute k different sets of Filter Coefficients (FC) with a single Input Image Plane (IP), such as when scanning and printing color images, the version 2 architecture will require some hardware to be replicated. Figure 4.20 below shows a high level view of the arrangement of the additional required replicated hardware. For each additional FC set, one additional AU will need to be added. However, not all the functional units within the AU need to be replicated. A common DU (within the MAAs) can be shared among all the AUs for additional FC sets (see Figure 4.2 for detail within a MAA). For example, all the MAA0s within all the AUs can all share a common DU rather than each MAA0 having its own DU.

Data Memory Control signals (External or Internal) Controllller Uniit (CU)

Input Image Data Memory Interface (DM I/F) Plane, IP Interface (DM I/F) Control signals Clock, CLK AU0

AU1 IDS0 MAA0

IDS1 MAA1 AUk-1 Input Data AT Outputs, OI1 Shifters 1 (IDS) Outputs, OI2

IDSn-1 MAAn-1 Outputs, OIk

FC1 FC2

FCk

Figure 4.20. Version 2 architecture for k (n×n) filter coefficient sets (where k can be any number).

The CU, DM I/F, and IDS functional units of Figure 4.20 are functionally and operationally identical to the same units of Figure 4.1 for a given n and d and only one instantiation of these units is required when k filter coefficient sets are used. This enhances the scalability of the convolution architecture when expanded to handle

43 multiple FC planes. Likewise, the CU does not have to control any of the AUs of Figure 4.20. It only has to control the DM I/F and IDS units. The version 2 convolution architecture of Figure 4.20, from a functional and performance standpoint, can now simultaneously convolute a single IP with k (n×n) FCs resulting in k convoluted OI pixels (OI1, OI2, … OIk) on each system clock cycle. This functionality and performance of the version 2 architecture will first be validated via HDL post-synthesis and post-implementation simulation in a later chapter of the thesis. The functionality and performance will finally be validated in a later chapter via development and experimental testing of a FPGA based hardware prototype.

Chapter 5

VHDL Description of Version 2 Convolution Architecture

This chapter describes the VHDL coding style and approach used to capture the Version 2 convolution architecture. After the design is captured through VHDL, it is synthesized and implemented to a targeted FPGA. Before a hardware prototype is built, functional and performance level simulation will be done to validate its proper functionality and determine its performance. Modular and bottom up hierarchical design approaches were employed during the VHDL design capture process. The modular design approach partitions the entire system into smaller modules or functional units that can be independently designed and described in VHDL. Besides, identical modules (with the same functionalities) can share the same VHDL code or reuse the previously designed module. In addition, the bottom up hierarchical design approach allows a multiple level view of the entire system for design ease. Hence, by employing these approaches the smaller modules or functional units can be tested and validated before they are combined together as the entire system. For prototype purposes the Version 2 convolution architecture is captured with three AUs instantiated (k = 3), no pipeline stages are built into the multiplication units, and the architecture is tailored to an input image plane of size 5×60 pixels. The VHDL described system has a total of 13 pipeline stages within each AU. In addition, the external memory device as shown in Figure 4.1 is described in VHDL and incorporated into the overall system and will thus, for the experimental hardware prototype, be implemented within the FPGA chip containing the other functional units of the convolution architecture. Figure 5.1 below shows the organization of functional units within the convolution architecture. For simplicity of the chart only the main functional units are shown, sub- modules within the main functional units are omitted. Both behavioral and structural level coding styles were used during the VHDL coding process. Behavioral level style

45 coding has the advantage in that only the behaviors of the modules are described in the code and the CAD software must infer the internal logic blocks. However, this may present inconsistency since different CAD software may infer different logic blocks for the same code. For this thesis behavioral level coding style was employed for most of the functional units, however all the various sized adders and multiplication units were coded at the structural level. This was to validate the correctness of the multiply and addition techniques proposed in the previous chapter.

Convolution Architecture

External Data Memory Input Data Controller Unit Arithmetic Unit Memory Device Interface Shifters

Memory Pointers Unit Cache Unit Multiplication and Add Array Adder Tree

Multiplication and Add Unit

Multiplication Unit Various sized Adder Figure 5.1. Version 2 Convolution Architecture organization.

After the system is captured through VHDL, post-synthesis and post- implementation HDL software simulation can determine if the system is functioning and performing as it should. The next chapter presents post-synthesis and post- implementation simulations of the convolution architecture. All VHDL code for Version 2 of the Discrete Convolution Architecture with three AUs (k = 3, see Figure 4.20) is included in Appendix A. The code is appropriately commented such that one should be able to identify the VHDL code describing all functional units of the convolution architecture system.

Chapter 6

Version 2 Convolution Architecture Validation via Virtual Prototyping (Post- Synthesis and Post-Implementation Simulation Experimentation)

Hardware Description Language (HDL) simulation of an architecture design, sometimes known as virtual prototyping, is an important step in the design flow for fine tuning and detecting potential problem areas before the design is implemented or manufactured. In this section, Post-Synthesis simulation results and Post-Implementation simulation results of version 2 of the convolution architecture will be presented. Both Post-Synthesis simulation and Post-Implementation simulation are utilities contained within the Xilinx Foundation 4.1i CAD software packages utilized during this project [18]. During the process of validating version 2 of the convolution architecture, the computer system that was used to run the specific software has the following configuration; Intel Pentium III 450 MHz processor, 128 MB memory capacity and Windows 98 Second Edition operating system. After a design has been captured either through schematic capture or via HDLs such as VHDL or Verilog, software HDL simulation of the design is the next step in the design flow for functional and timing validation. Software HDL simulation has the advantage of identifying potential problem areas before a design is implemented (for FPGA) or manufactured (for ASIC) and hence correction or modification can be made. The usage of both Post-Synthesis simulation and Post-Implementation simulation within the design flow for design prototyping via FPGA technology can be attributed to the fact that Post-Synthesis simulation is utilized for functional validation of the design whereas Post-Implementation simulation is utilized for both functional and timing (performance) validation of the design. In order to obtain a better understanding of the characteristics of a particular design, both utilities can be important tools for such purpose. The testing methodology employed in this project uses the bottom up approach, which means lower level functional components such as various types of adders and

47 multipliers were tested before these components were combined to form higher level functional units. Using the bottom up approach in testing is desired since this will help assure that when the lower level components are combined into higher level functional units, one can be more assured that the lower level components will not be at fault if errors are detected.

6.1. Post-Synthesis Simulation In order to be assured that version 2 of the convolution architecture functionally performs as intended in the previous sections, Post-Synthesis simulation was utilized for functional level validation. Post-Synthesis HDL simulation of a system is simulation of the system as synthesized to netlist (gate-level) form and zero propagation delay is assumed through gates. To determine the correctness of the functional unit under test, all possible input vectors are required to be applied and checked against known correct outputs or expected outputs from the functional unit under test. Thus, testbenches are required and need to be developed for this purpose. However, if the number of inputs for the functional unit under test is large, fully testing all the possible inputs or stimulus for the functional unit under test can be quiet complex. Hence, automated generation of the testbenches is preferred. In order to achieve this objective, C++ was used to write a program capable of generating testbenches in the required format. For ease of re-running the simulation process, the script file editor, a feature of the Xilinx Foundation Simulator, has been used to eliminate the process of inputting test vectors after each simulation run. Figure 6.1 below shows the testing model that was used for verifying the functionality of lower level functional components. The testing model shown in Figure 6.1 below was captured through VHDL as an entity with the functional unit under test being instantiated and its output is compared to the expected or theoretically correct result from the testbench.

6.1.1. Adders Different types of Carry Lookahead Adders (CLA) were employed within the convolution system. The main difference between all of them lies in the length of the operands that they operate on and depending on the length they are referred to as 14-bit,

48 15-bit, 16-bit, 17-bit and 19-bit CLA. To check that these lower level functional components operate as intended, Post-Synthesis simulation was used. One of the most utilized CLA is the 14-bit CLA, and it was duplicated within each Multiplication Unit (MU) contained in the convolution system. stimulus / test vectors expected / theoretically of testbench correct result from testbench

Functional Unit Under Test

Comparator

err (zero indicates both results are identical, one indicates otherwise) Figure 6.1. Testing model for lower level functional components.

The testing methodology described in Figure 6.1 was used. The VHDL file that contains the testbench entity and C++ program source code that generates the theoretically correct outputs (in a file format that is acceptable to the script editor for software simulation) can be found in Appendix B. Figure 6.2 shows the Post-Synthesis simulation output of the testbench for the 14-bit CLA. However, due to the length of the simulation, selective test vectors were used instead of an exhaustive (all possible inputs) set. As can be seen from Figure 6.2, the signal err remains low throughout the simulation and indicates that the outputs from the unit under test agree with the theoretically correct outputs generated from the C++ program.

Figure 6.2. Post-Synthesis simulation for 14-bit CLA. Figure 6.3 below shows a close up view for one segment of the testbench simulation shown in Figure 6.2 above. Buses vec_a and vec_b are the two input operands, while ans_ut is the output from the unit under test (14-bit CLA) and ans is the

49 theoretically correct output. For instance, at the left-most of bus vec_a one sees a hexadecimal value of 0008 (8 in decimal) and vec_b shows a hexadecimal value of 2000 (-8192 in decimal), thus the sum should be 2008 (-8184 in decimal) which is the same value shown on both buses ans and ans_ut.

Figure 6.3. A close up view of one segment of Figure 6.2. The procedure for testing all the other CLAs with different operand lengths is the same as shown above. Figure 6.4, Figure 6.5, Figure 6.6, and Figure 6.7 show Post- Synthesis simulation results of testbenchs for each CLA. As can be seem from the figures, the err signal stays low throughout, thus indicating that outputs from the unit under test agrees with the predicted correct results generated by the C++ program.

Figure 6.4. Post-Synthesis simulation for 15-bit CLA.

Figure 6.5. Post-Synthesis simulation for 16-bit CLA.

Figure 6.6. Post-Synthesis simulation for 17-bit CLA.

Figure 6.7. Post-Synthesis simulation for 19-bit CLA.

6.1.2. Multiplication Unit The Multiplication Unit (MU) is a lower level component that is replicated 25 times in version 2 of the convolution architecture within each AU for the case of n = 5. Hence, it is important to determine that MU is functioning correctly. In order to test MU with all possible inputs, a C++ program was written to generate the required testbench; the program can be found in Appendix B. In addition, an entity was created in a VHDL file with MU being instantiated and a comparator was also instantiated to compare the output generated by MU with the theoretically correct output from the program (as an input to the entity). This VHDL file can also be found in Appendix B. Figure 6.8 below shows the complete run of the testbench. Coef is a 6-bit wide signed filter coefficients bus, bus mag is an unsigned 8-bit magnitude input, bus product is the output generated by the unit under test (MU in this case) and bus t_ans is the theoretically correct output. Signal err is the output from the comparator, and it will be one if both the outputs, t_ans and product are not matched. As shown in Figure 6.8 below, all the buses are packed closely and cannot be distinguished due to the length of the simulation, however, the err signal remains low for the entire simulation. Thus, both the buses t_ans and product are identical throughout the simulation.

Figure 6.8. Post-Synthesis simulation for all possible inputs for the Multiplication Unit (MU).

51 Figure 6.9 below shows a close up view of one segment of the simulation. For instance, in the first part of Figure 6.9, bus coef has a value of 21 (-31 in decimal) and bus mag has a value of FB (251 in decimal), thus the product of the multiplication should have a value of 219B (14-bit signed value) in hexadecimal (-7781 in decimal). Both buses product and t_ans have the same value and thus the err signal has a value of zero indicating that both the values agree with one another.

Figure 6.9. A close up view of one segment of the simulation in Figure 6.8 above.

6.1.3. Version 2 Convolution Architecture (with k = 1) In the process of testing Version 2 of the convolution architecture as a whole unit for the case of k = 1 (see Figure 4.20), a few minor modifications were made to the system such that the Post-Synthesis simulation can be completed within a reasonable time frame. However, these modifications do not affect the system’s intended characteristics. For instance, the intended Input Image Plane (IP) has a size of 5100×6600 pixels; for simulation purpose the IP size was reduced to 5×60 pixels. This reduction will in no way hamper the system’s functional characteristics. The Filter Coefficient Plane (FC) will remain the same as in the previous sections; a 5×5 size. Figure 6.10 below shows a test case used to verify the functional correctness of Version 2 of the convolution architecture. The IP has a size of 5×60, but due to the page width limit (this thesis) only the first seven columns of the IP can be clearly shown. The same can be said about the Output Image Plane (OI) in Figure 6.10. A C++ program was written to generate the test vectors required to program all MAUs with the correct filter coefficients. This C++ program can read a text file with filter coefficients indicated within and then generate waveform vectors for the script editor to use. Figure 6.11 below shows the source file for the program.

Input Image Plane (Decimal)

0 1 2 3 4 5 6 60 61 62 63 64 65 66 120 121 122 123 124 125 126 180 181 182 183 184 185 186 240 241 241 241 241 241 241

Filter Coefficient Plane (Decimal)

-1 2 3 0 1 1 -1 2 1 0 0 0 1 1 1 1 1 0 1 1 2 -2 1 0 1

Output Image Plane (Hexadecimal)

00259 0029C 0031D 00328 00333 0033E 00349 00400 004BD 005B9 005C8 005D7 005E6 005F5 0061F 00790 0093D 0094A 00956 00962 0096E 003C5 005E7 00756 0075F 00768 00771 0077A 002D5 0047D 0069E 006A5 006AB 006B1 006B7

Figure 6.10. Test case 1 with IP and OI of size 5×60 (however, only the first seven columns of both IP and OI are shown due to report page width limit).

The operation of the convolution architecture can be divided into three phases; the first phase of the operation starts as the system commences operation until the first two rows of the IP have been stored into the external memory devices and no OI is generated. The second phase of the operation starts when the system has enough IP to commence generation of OI; this phase of operation ends when IP is received completely. Finally, the last phase of the operation starts with the system being provided with zeros as input and continues operation until all the OIs are generated.

#include #include #include

int main() { ifstream in_file1; ofstream out_file1, out_file2;

in_file1.open("coef.txt"); out_file1.open("v_coef.dat"); out_file2.open("v_c_reg.dat");

int array[5][5]; int time, count, temp, a, b;

time = 40; count = 1;

for (a=4; a>=0; a--) { for (b=0; b<5; b++) { in_file1 >> temp; cout << temp << endl; array[b][a] = temp; } }

out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl;

for (a=0; a<5; a++) { for (b=0; b<5; b++) { out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << array[a][b] << "\\H +" << endl; out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << count << "\\H +" << endl; time += 20; count++; } }

out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl;

in_file1.close(); out_file1.close(); out_file2.close();

return 0; }

Figure 6.11. The source code for C++ program that generates test vectors to program the filter coefficients into MAUs.

Figure 6.12 below shows the arrangement of the Filter Coefficients (FCs) within each MAU contained in the Arithmetic Unit (AU). The arrangement of the FCs only showing a 90 degree (clockwise) rotation is because the input image pixels have been rotated by 90 degrees (counter clockwise) before flowing through the AU.

MAUs Filter Coefficients 1 FC40 2 FC30 3 FC20 4 FC10 5 FC00 6 FC41 7 FC31 8 FC21 9 FC11 10 FC01 11 FC42 12 FC32 13 FC22 14 FC12 15 FC02 16 FC43 17 FC33 18 FC23 19 FC13 20 FC03 21 FC44 22 FC34 23 FC24 24 FC14 25 FC04 Arithmetic Unit

Filter Coefficients 1 2 3 4 5 FC FC FC FC FC 00 01 02 03 04 6 7 8 9 10 FC10 FC11 FC12 FC13 FC14

FC20 FC21 FC22 FC23 FC24 11 12 13 14 15

FC30 FC31 FC32 FC33 FC34

16 17 18 19 20 FC40 FC41 FC42 FC43 FC44

21 22 23 24 25

Figure 6.12. Arrangement of the Filter Coefficients within the Arithmetic Unit.

Figure 6.13 and Figure 6.14 below show the Post-Synthesis simulation output of the first phase of operation for the version 2 convolution architecture based on test case 1 shown in Figure 6.10. Figure 6.13 shows the programming of the FCs into respective MAUs. As shown in the figure, coef_regs bus acts as a write enable for each MAU within

55 the Arithmetic Unit (AU) and coef bus is the FC value which is given as input to each MAU.

Figure 6.13. First phase of operation; programming of FCs into MAUs.

Figure 6.14. First phase of operation; receiving the first two rows of the IP (shown in figure above is the beginning of the second row of the input pixels).

56 In Figure 6.14, the input image pixel values are shown in hexadecimal instead of the decimal values in test case 1 shown in Figure 6.10. As can be seen in Figure 6.14, in this phase of operation, there are no output pixels generated since the system has to wait until the first two rows of IP are received. Also shown in Figure 6.14, the system will receive five input image pixels and then write to one memory location. The second phase of the system operation is shown in the following figures. Figure 6.15 below shows that the system is generating the first three output pixels for the second row of the OI as compared to the test case 1 shown in Figure 6.10. Under normal operation, the output pixels will be generated after multiple stages of pipeline delays contained within the AU as shown in Figure 6.15. Figure 6.16 shows superimposed (timing delay not included) output pixels with their corresponding input pixels for ease of comparison. The output pixels shown in Figure 6.16 are the first six output pixels from the second row of the OI. Buses o1, o2, o3, o4 and o5 are the output buses from functional unit IDS to the AU (all the buses’ value are shown in hexadecimal) and each bus is 40 bits wide (five input image pixels). In Figure 6.16 below, the 25 input image pixels that correspond to each output pixel are highlighted. As can be seen from Figure 6.16, all the output pixels are as predicted in Figure 6.10.

Figure 6.15. Second phase of operation; output pixels generated.

Figure 6.16. Second phase of operation; output pixels of the second row of OI (superimposed).

The third phase of system operation is shown in Figure 6.17 below. Figure 6.17 below shows that the system generates the first six output pixels for the last row of OI. At this phase of operation, the input image pixels are completely received and zeros are inserted into the system. The six output pixels shown in Figure 6.17 are compared and validate with the output pixels predicted in test case 1shown in Figure 6.10.

Figure 6.17. Third phase of operation; output pixels of the last row of OI (superimposed).

58 A second test (test case 2) was also done to further investigate the correctness of operation of the system. Figure 6.18 shows the FCs, IP (the first seven columns) and the OI (the first seven columns) for test case 2. As in the previous test case (Figure 6.10), due to the page width limit, only the first seven columns of the IP and OI are being displayed from the 60 columns.

Input Image Plane (Decimal)

0 1 2 3 4 5 6 60 61 62 63 64 65 66 120 121 122 123 124 125 126 180 181 182 183 184 185 186 240 241 241 241 241 241 241

Filter Coefficient Plane (Decimal)

-1 2 1 0 1 1 -2 1 1 0 0 0 1 -1 2 2 1 0 1 -2 2 -2 1 0 1

Output Image Plane (Hexadecimal)

000F0 0012F 001AA 001B0 001B6 001BC 001C2 001A9 001EB 0031E 00326 0032E 00336 0033E 00314 00392 00500 00508 0050F 00516 0051D 0025E 00318 003D2 003D8 003DE 003E4 003EA 0038B 00354 00448 0044E 00452 00456 0045A

Figure 6.18. Test case 2; IP, FCs and expected OI (the first seven columns).

Figure 6.19, Figure 6.20 and Figure 6.21 show the post-synthesis simulation results of all three phases of operation for Version 2 of the convolution architecture with IP and FCs as intended in Figure 6.18. All the results from Figure 6.19 and Figure 6.20 agree with the expected results shown in Figure 6.18 above.

Figure 6.19. First phase of operation for test case 2.

Figure 6.20. Second phase of operation for test case 2; output pixels shown are the first six of row one of OI (superimposed).

Figure 6.21. Third phase of operation for test case 2; output pixels shown are the first six of the last row for OI (superimposed).

6.2. Post-Implementation Simulation After version 2 of the convolution architecture, for the case of (k = 1), was functionally validated via post-synthesis HDL simulation, its functional and timing characteristics were studied and validated through post-implementation simulation. The following sections will describe and depict synthesis and implementation of the system to a particular Field Programmable Gate Array (FPGA) chip and the post-implementation simulations that have been done to validate the version 2 convolution architecture.

6.2.1. Synthesis and Implementation of Version 2 Convolution Architecture (with k = 1) In general, when a system described in a HDL is synthesized to a specific FPGA chip, the CAD packages (Xilinx Foundation Series in this case) invoke a process that translates the system described in HDL to a specific gate level netlist. The gate level netlist may consist of any gate level elements or functional units that are specific to a certain family of FPGA, hence a targeted (specific) FPGA is to be specified before the process begins. Following the synthesis process is the implementation process of the desired system which targeted a specific FPGA chip. This process includes map, place and route of the netlist within the specific FPGA chip. Within each FPGA chip there are a

61 certain number of Configurable Logic Blocks (CLBs), and within each of these CLBs there are a number of Lookup Tables (LUTs) and memory elements such as Flip-Flops (FFs). The mapping process implements the gate level netlist to the FPGA chip using all the available resources. Then, the place and route process determines the best placement and routing of all the resources used for the mapped system such that all the components (resources) are connected according to the netlist. For this project, a prototyping board (XSV800) manufactured by XESS Co. is used. This protoboard featured the Virtex family FPGA chip (XCV800) from Xilinx. Table 6.1 below shows a summary of the resources available within the FPGA chip on the protoboard. There are 4704 CLBs in this specific FPGA chip and within each CLB there are four 4-Input LUTs and four FFs. Table 6.2 below shows the resource utilization on the XCV800 chip as version 2 of the convolution architecture (with k = 1) is implemented.

Table 6.1. Details of FPGA on the XESS protoboard. FPGA XCV800 (Virtex FPGA family) System Gates 888,439 CLB Array 56×84 FF 18,816 4-Input LUT 18,816

Table 6.2. Resource utilization of Version 2 Convolution Architecture (with k = 1) CLBs 1,878 FF 2,620 4-Input LUT 5,955 Equivalent System Gates 96,210

6.2.2. Version 2 Convolution Architecture (with k = 1) The post-implementation simulations of version 2 of the convolution architecture with k = 1 were conducted with the same test cases run in the post-synthesis simulations in the previous section. The script file and C++ programs used in the post-synthesis simulations were reused in the post-implementation simulation testing and validation processes described here. Figure 6.22 and Figure 6.23 below show the results of the second phase and third phase of operation for post-implementation simulation of test case 1 (see Figure 6.10). As can be seen from both of the figures, the highlighted output image pixels were as

62 predicted in Figure 6.10. Figure 6.24 and Figure 6.25 show the second and third phase of operation of post-implementation simulation for test case 2 (see Figure 6.18). All the output image pixels highlighted within these figures were in agreement with the predicted output image pixels as shown in Figure 6.18. In both Figure 6.22 and Figure 6.24, the second phase of operation is shown; after the first two rows of the IP has been stored and the convolution architecture starts the convolution process. Meanwhile, in Figure 6.23 and Figure 6.25 the third phase of operation is shown; with all the IP received and zeros are inserted as input for the convolution system to process the last two rows of the OI. A clock frequency (clk in all figures) of 12.5 MHz has been used in all the post- implementation simulations (Figure 6.22, Figure 6.23, Figure 6.24 and Figure 6.25) conducted thus far. The main objective of the simulation testing described in this section was to validate the system functionality and performance with respect to being able to generate one OI pixel on each system clock cycle with a 5×5 FC. For the case of k = 1, the convolution architecture met our just stated functional and perfomance goals.

Figure 6.22. Second phase of operation for test case 1 (post-implementation simulation); output pixels of the second row of OI (superimposed).

Figure 6.23. Third phase of operation for test case 1 (post-implementation simulation); output pixels of the last row of OI (superimposed).

Figure 6.24. Second phase of operation for test case 2 (post-implementation simulation); output pixels shown are the first six of row one of OI (superimposed).

Figure 6.25. Third phase of operation for test case 2 (post-implementation simulation); output pixels shown are the first six of the last row for OI (superimposed).

6.2.3. Synthesis and Implementation of Version 2 Convolution Architecture (k = 3) As shown in Figure 4.20, the architecture can be scaled up to perform k convolutions in parallel. To validate the scalability of version 2 convolution architecture, the version 2 convolution architecture with three AUs instantiated is synthesized and implemented to the XCV800 FPGA chip. Table 6.3 below shows the XCV800 chip resource utilization as the convolution architecture is implemented. However, as the system is scaled up to process three convolutions in parallel, the total number of system gates did not increase proportionally. As can be seen from Table 6.3, the equivalent system gates for k = 3 is 173,170 gates compared to the 96,210 gates for k = 1 (Table 6.2), an increase of 80 percent, which is less than the factor of three. This is due to the fact that when the system is scaled up, only the AU needs to be replicated and it does not need to be totally replicated as earlier discussed. Comparison of the total number of CLBs utilized between the two implementations will not yield a good measurement since not all the elements within each CLB are utilized.

Table 6.3. Resource utilization of Version 2 Convolution Architecture (with k = 3) CLBs 4,613 FF 5,226 4-Input LUT 15,307 Equivalent System Gates 173,170

6.2.4. Validation of Version 2 Convolution Architecture (with k = 3) In order to validate that version 2 of the convolution architecture can be scaled up to include more than one AU (k > 1 in Figure 4.20) and continue to operate correctly from a functional and performance standpoint, this section presents post-implementation simulation results of version 2 convolution architecture operating with three instantiated AUs. All VHDL code for version 2 of the convolution architecture with three AUs instantiated can be found in Appendix A. To validate the output image planes (OI) generated by version 2 convolution architecture for k = 3, a C++ program with the ability to generate different sets of input image planes of size 5×60 pixels, depending on the seed number given, has been written and used. The program uses the rand function to generate random numbers based on the given seed and the generated numbers were limited in the range of 0 to 255. In addition, the program also generates the three expected output image planes based on the three filter coefficient planes that it reads in. The source code of the program mentioned above can be found in Appendix C. Another program that generates all the test vectors necessary to program each individual MAUs with the filter coefficients was written and used. This program reads in three FC planes contained in a text file and then generates the test vectors according to the script editor format (source code for this program can also be found in Appendix C). Two test cases were post-implementation simulated and each of the test cases was run with a single and different IP (generated by giving different seed number) and three distinct FC sets. This was done to further validate correct operation and performance of the version 2 convolution architecture with k = 3. Figure 6.26 below shows test case number 1 with the inputs and expected outputs (generated by the C++ program mentioned in the above paragraph). However, again due to page width limitation the figure only shows part of the IP and predicted OIs.

Figure 6.26. Test Case 1: FC planes, IP plane and the predicted OI planes.

Figure 6.27 and Figure 6.28 below show the results of the post-implementation HDL simulation with the inputs of test case 1 (see Figure 6.26). Figure 6.27 shows the output image pixels for the first row of the three OIs starting from the third output image pixel (signals out_pxl1, out_pxl2, and out_pxl3 were output image pixels for the first OI, second OI and third OI respectively). Figure 6.28 shows the second row of output image pixels (start from the third pixel) for all three OIs. All output pixels generated by post- implementation simulation of the version 2 convolution architecture system agreed with the expected results shown in Figure 6.26. As can be seen from Figure 6.27 and Figure 6.28 all the input image pixels highlighted within each rectangle correspond to the 25 input image pixels required for all three convolutions (one output image pixel per FC plane). Again, the clock frequency that has been utilized in the post-implementation simulation run of test case 1 in Figure 6.27 and Figure 6.28 is 12.5 MHz.

Figure 6.27. Superimposed output image pixels (start from the 3rd pixel) for first row of the OIs for test case 1.

Figure 6.28. Superimposed output image pixels (from 3rd pixel onward) of the second row of the OIs for test case 1.

For test case number 2, the IP generator program was given a seed number of 2 and hence a different IP plane was produced as shown in Figure 6.29. Figure 6.30 (OIs result for third row) and Figure 6.31 (OIs result for fourth row) show the post-

68 implementation simulation result for test case 2. The output results from both of the figures agreed with the predicted results shown in Figure 6.29. The clock frequency of test case 2 is the same as in test case 1. All the highlighted input image pixels within each of the rectangles corresponding to the 25 input image pixels required for each output pixel generated. Each individual OI pixel is generated within a single system clock cycle.

Figure 6.29. Test case 2: FC planes, IP plane and the predicted OI planes.

Validation of version 2 of the convolution architecture has been accomplished through post-synthesis and post-implementation HDL simulation utilizing the Xilinx Foundation CAD software packages. All the simulations are done with the system implemented to a Xilinx Virtex FPGA (XCV800). As the system is scaled up to process k convolutions in parallel, the hardware increment is directly proportional to k since only AUs are replicated. A graph showing the equivalent system gates count compared to the number of FC planes is plotted and shown in Figure 6.32 below. Since all the simulation

69 results are as desired and correct, the version 2 convolution architecture is functionally and performance validated in that it can correctly generate three OI pixels (OI1, OI2, and

OI3) within one system clock cycle (with k = 3).

Figure 6.30. Superimposed output image pixels (start from the 3rd pixel) for third row of the OIs for test case 2.

Figure 6.31. Superimposed output image pixels (from 3rd pixel onward) of the fourth row of the OIs for test case 2.

70 Equivalent System Gates versus Number of FC planes 200,000

180,000

160,000

140,000

120,000

100,000

80,000

60,000

40,000

20,000

0 0 0.5 1 1.5 2 2.5 3 3.5

Figure 6.32. A plot of equivalent system gates versus number of FC planes.

Chapter 7

Hardware Prototype Development and Testing

Hardware prototype development and testing were done to experimentally validate the functionality and performance of the convolution architecture. Ideally, the convolution architecture will be implemented in ASIC technology with external SRAM (Data Memory) as shown in Figure 7.1 below. In the figure below, b and l denote the bus width for the address bus of the external SRAM and output image pixel respectively. CE, OE, and WE are chip enable, output enable and write enable control signals for the external SRAM. For example, to implement the convolution architecture with three 5×5 FC planes, a total of 113 IO (Input Output) pins are needed on the FPGA or ASIC.

WE OI SRAMs 1 l address FPGA or ASIC OI2 l b data Output (Implementation of Image n × d Convolution planes

architecture) OIk l

Figure 7.1. Convolution Architecture hardware implementation.

To further validate the convolution architecture functionality and performance correctness, hardware emulation of version 2 of the convolution architecture is done through the development and testing of a FPGA based prototype. A hardware prototyping board manufactured by the XESS Corporation [17] which features Xilinx Virtex FPGA (XCV800) technology was available and utilized. Figure 7.2 below shows a picture of the XSV-800 prototype board. Even though the XCV800 FPGA has enough IO pins to handle the convolution architecture configuration shown in Figure 7.1 above, the SRAM on the

72 prototyping board has lower data bandwidth than desired. Because of this, the Data Memory was inferred or emulated within the FPGA.

Figure 7.2. XSV-800 prototype board featuring Xilinx Virtex 800 FPGA (picture obtained from XESS Co. website, http://www.xess.com).

The hardware emulation is carried out by programming the FPGA with the convolution architecture through the parallel port of a computer. The Xilinx Foundation series CAD software package [18] was utilized to generate the bit stream file (FPGA configuration bit stream) necessary to program the FPGA with the desired convolution architecture hardware description. A software utilization package, XSTOOLs is provided by XESS cooperation for use with the prototyping board. The software package includes programs such as bit stream download (for FPGA programming), a clock frequency setting program and an on-board SRAM content retrieval or initialization program.

7.1. Board Utilization Modules and Prototype Setup As shown in Figure 7.2, the prototyping board contains many auxiliary parts such as push buttons, LEDs, SRAMs, parallel port and so on. However, to utilize these parts, the FPGA needs to be programmed with the appropriate driver or module. These drivers or modules are implemented within the FPGA since all these parts are connected to the FPGA. Thus, for the purpose of hardware emulation of the convolution architecture system, a means of supplying the input image plane (IP) and storing of the output image

73 plane (OI) is necessary and must be developed. An internal Block RAM within the FPGA is implemented to provide the convolution system with the input image pixels. The internal RAM is initialized with input image pixels when the system is synthesized and implemented. Figure 7.3 below shows a pictorial view of the prototyping hardware. All functional blocks or modules within the FPGA were implemented with VHDL.

Data Memory Filter Coefficients Convolution 6 Input pixels 19 Output pixels, data lines System FC Parallel 8 port ProgrammingF PGA Block SRAM 21 address RAM driver 6 Control SRAM External signals clock clock

Figure 7.3. Top level view of the prototyping hardware.

In Figure 7.3, there is a FC Programming module. This module, as the name implies is responsible for initialization of all the filter coefficients within the convolution system. The filter coefficients are supplied from the computer through the use of the parallel port. A C++ program (this program can be found in Appendix D) was written to read in the filter coefficients for each FC plane contained in a file and send them through the parallel port to the module. The FC Programming module receives two bytes of data from the parallel port to program one filter coefficient register. In addition, the FC Programming module also sends the output image plane to the external SRAM for storage and analysis later on. Due to the limit of the data bus for the external SRAM, it is only possible to retain one OI from each run. Also shown in the figure is the Block RAM module within the FPGA chip. This module is mainly responsible for providing the convolution system with the input image pixels and it is implemented through VHDL code. This module makes use of the internal RAM within the FPGA to store the input image pixels. In order to be able to generate the VHDL file for this module easily, a C++ program was written to generate this file depending on the seed of the random number provided. By using the same seed number

74 as those used in the previous chapter, the same input image pixels can be generated. An example of the generated VHDL file is shown in Figure 7.4 below. The C++ program can be found in Appendix D.

library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all;

entity IN_RAM is port( clk: in std_logic; rst: in std_logic; req: in std_logic; dout: out std_logic_vector(7 downto 0) ); end entity IN_RAM;

architecture STRUCT of IN_RAM is

component RAMB4_S8 is port( DI: in std_logic_vector(7 downto 0); EN: in std_logic; WE: in std_logic; RST: in std_logic; CLK: in std_logic; ADDR: in std_logic_vector(8 downto 0); DO: out std_logic_vector(7 downto 0) ); end component RAMB4_S8;

attribute INIT_00: string; attribute INIT_00 of IRM: label is "9fb6add75e5a290b8713f55f888c72ce37e06d31362b091dd779a254c5b24a30"; attribute INIT_01: string; attribute INIT_01 of IRM: label is "80b20000c7212be922035da80f3273676826daf8fd91c35f0b99e093f209c61c"; attribute INIT_02: string; attribute INIT_02 of IRM: label is "2f2d2f198cec76303797682ed5553d18eb5345050260bccdc4eed36fc92e4910"; attribute INIT_03: string; attribute INIT_03 of IRM: label is "e4392f650000e90330e84c421110aa1db0d9d34e544884c1b5a7ce5aeaff060d"; attribute INIT_04: string; attribute INIT_04 of IRM: label is "29af02af2cf91bfc241bd2ada5d75262f228d437b5d0fc6e8f18bb82b5216f9e"; attribute INIT_05: string; attribute INIT_05 of IRM: label is "8f552a661c42000095af0ea4bb1cfdb88c34cdf122f8ae8904447e84657c6a00"; attribute INIT_06: string; attribute INIT_06 of IRM: label is "e1ab005cf16e41b0d174d275a95fa85c177eb6d6ec1b2ef67cd891e33f88cfb7"; attribute INIT_07: string; attribute INIT_07 of IRM: label is "679780f0f91cba3e000007b707bc25aba634015c6ab3c55053130fd44d7f9ee8"; attribute INIT_08: string; attribute INIT_08 of IRM: label is "53bb6d2e0a67fd7071d852926a66ce6a617485308dc35ad9177c391f8f32a8b3"; attribute INIT_09: string; attribute INIT_09 of IRM: label is "00000000000000000000000066896a4dc61c8c7b1434242dcc20e505cb7aa7a9";

signal din : std_logic_vector(7 downto 0); signal addr: unsigned(8 downto 0);

Figure 7.4. Example of a VHDL file for creating an internal Block RAM containing input image pixels for the convolution system (seed number of 1 is provided to the program).

75 signal adr : std_logic_vector(8 downto 0); signal en : std_logic; signal we : std_logic;

begin

L1: din <= (others=>'0'); L2: en <= '1'; L3: we <= '0'; L4: adr <= std_logic_vector(addr);

P1: process(clk, rst, req) is begin if (rst = '1') then addr <= (others=>'0'); elsif (clk'event and clk = '1') then if (req = '1') then addr <= addr + 1; end if; end if; end process P1;

IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk, ADDR=>adr, DO=>dout); end architecture STRUCT;

Figure 7.4(Continued) Example of a VHDL file for creating an internal Block RAM containing input image pixels for the convolution system (seed number of 1 is provided to the program).

In addition to the two modules mentioned above, there is another module that is responsible for controlling the external SRAM (parts that are external to the FPGA). This module generates progressive addresses to store the output image pixels in an ascending order in addition to the other signals such as cen (chip enable), wen (write enable) and so on to ensure the proper functioning of the SRAM. All modules mentioned in this section were developed and implemented via VHDL description. The VHDL files for all these modules can be found in Appendix E.

7.2. Hardware Prototyping Flow After a design is synthesized and implemented through use of CAD packages, a bit stream file (FPGA configuration bit file) for a specific FPGA chip is generated. In this case, the bit stream file contains the configuration bits for the convolution architecture as well as the auxiliary modules generated for a Xilinx XCV800 FPGA chip. Next, the bit stream file is programmed into the FPGA through the parallel port of a computer. For this particular prototyping board by the XESS Co., a FPGA configuration or download

76 program, gxsload, is provided. Figure 7.5 shows the graphic interface of the gxsload program once it is executed.

Figure 7.5. FPGA configuration and bit stream download program, gxsload from XESS Co.

After the FPGA chip has been configured with the convolution architecture, it is ready for hardware experimental testing and validation of the convolution architecture. Since the input image pixels are stored within the Block RAM module within the FPGA, the only time that the system requires external input (external of the FPGA) is for filter coefficient programming. This is done through hardware (FC Programming module) and software. The software program (can be found in Appendix D) was developed and written with the C++ language to read in the filter coefficients from a text file, coef.txt (the same file as shown in Figure 6.26), and send all the data through the parallel port to the convolution architecture. However, within this file it also specifies which OI plane the external SRAM stores during each experimental run. Figure 7.6 below shows a segment of the verbose output for the execution of the FCs configuration program. The program enters the filter coefficient in the order shown in Figure 6.12. For each filter coefficient two bytes of data are sent through the parallel port, the first byte indicates the position of the filter coefficient within AU and the following byte is the filter coefficient.

Figure 7.6. Execution of the FCs configuration program.

Next, the convolution process is commenced when the start push button on the prototyping board is pressed. One of the push buttons on the prototyping board is mapped as the start signal for the convolution system. Since the execution of the convolution architecture is transparent to the user, a LED on the prototyping board is mapped to the invert of the SRAM’s write enable signal (invert of the wen signal in the highest level of VHDL description). Consequently, once the convolution architecture finishes its execution, the SRAM’s write enable line will be pulled low, hence the LED is lighted. Then, the output image pixels stored in the external SRAM are retrieved by using the gxsload program, which is the same program used to download the FPGA configuration file. Figure 7.7 shows the graphical interface of the gxsload program when used to upload SRAM contents to a file. The uploaded SRAM content is stored in an Intel hex file format. Figure 7.8 below shows the uploaded SRAM contents in a file. There are two banks of SRAM on the prototyping board, the left bank and the right bank. Each bank of the SRAM has a 16-bit data bus and 19-bit address bus. Since the output image pixels are 19-bits wide, both sides of the SRAM are utilized. As evident from Figure 7.8 below, it is tedious to trace and compare the uploaded SRAM contents to the expected output results. As mentioned in the previous section, a program was written to generate the theoretically correct output image pixels (shown in Figure 6.26). In order to compare the uploaded results with the known correct results efficiently, a C++ program was written to parse the Intel hex file and check against the

78 theoretically correct output for similarity. The source code for this program can be found in Appendix D.

SRAM Upper limit SRAM lower limit

Figure 7.7. Upload SRAM content using gxsload utility, the high address indicates the upper bound of the SRAM address space whereas the low address indicates the lower bound of the SRAM address space.

Figure 7.8. Uploaded SRAM contents stored in a file (Intel hex file format). There are two segments due to the fact that the program wrote the right bank of the SRAM (16-bit) first and the left bank of the SRAM next (16 MSB bits).

79 7.3. Test Cases To validate correct functional and performance operation of the FPGA based hardware prototype of the convolution architecture, two test cases were run. The convolution architecture was run at 2 kHz clock frequency for these test cases. Maximum clock rate hardware prototype performance was not a goal for these two tests. The performance metric of interest is whether the prototype can simultaneously convolute one IP with k (n×n) FCs and generate k OI pixels on each system clock cycle. These two test cases are as shown in Figure 6.26 and Figure 6.29 of the previous section. Since for each test case there are three different Filter Coefficient (FC) planes, three experimental runs must be carried out. The reason that three experimental runs are needed even though all three OI planes are generated in each experimental run is because of the SRAM data bus bandwidth limitation. Each OI requires 19-bits and the SRAM data bus is only 32-bits wide. Figure 7.9 (first OI plane), Figure 7.10 (second OI plane) and Figure 7.11 (third FC plane) show the results obtained from the SRAM after each experimental run. The grayed areas of the figures are Intel hex file header and checksum, and the highlighted boxes with arrows projected to the bottom of the figure are the first OI pixel for the respective OI plane. To obtain the second OI pixel, slide the windows to the next column as marked. Comparison of all three obtained OI planes with the result shown in Figure 6.26 reveals they match. The comparing program executed on each of the experimental runs show that the obtained results are identical to the expected results.

Figure 7.9. SRAM contents retrieved for first OI plane for test case 1.

Figure 7.10. SRAM contents retrieved for second OI plane for test case 1.

Figure 7.11. SRAM contents retrieved for third OI plane for test case 1.

81 Figure 7.12 (first OI plane), Figure 7.13 (second OI plane) and Figure 7.14 (third OI plane) below show the experimental runs for test case 2. Again, the OI planes received from each experimental run are compared to the expected results shown in Figure 6.29 of the previous section. After each experimental run, the results obtained were compared with the expected results using the comparison program. All three of the OI planes matched with the expected results and thus, again, validated the correctness of the convolution architecture. From the results obtained from testing of the hardware prototype, the functionality and performance correctness of version 2 of the convolution architecture is thus further validated.

Figure 7.12. SRAM contents retrieved for first OI plane for test case 2.

Figure 7.13. SRAM contents retrieved for second OI plane for test case 2.

Figure 7.14. SRAM contents retrieved for third OI plane for test case 2.

Chapter 8

Conclusions

In summary, the main objective of this thesis research project was to develop the architecture for and design, validate, and build a hardware prototype of a convolution architecture capable of processing an input image plane such that an output image pixel is obtained every clock cycle assuming convolution with one FC plane. In addition, the convolution architecture needed to be scalable in both the filter coefficient plane size (kernel size) and the number of filter coefficient planes which could be simultaneously processed. The motivations behind the scalability were such that, first, the convolution architecture can be tailored to any size of kernel and still produce one output image pixel per clock cycle. Secondly, a motivation for scalability was to allow k kernels of any size within the architecture and the architecture will have the functional and performance capability to output k output image pixels on each system clock cycle. The developed convolution architecture was captured through the use of the VHDL hardware description language. Xilinx Foundation series CAD software packages were used to synthesize and implement the architecture to a FPGA chip. Before the architecture was prototyped into the prototyping board for experimental testing, the architecture was functionally and performance validated through HDL software simulation via post-synthesis and post-implementation software simulations. Experimental testing of the architecture was done on a prototyping board that featured a Virtex family FPGA. Post-synthesis and Post-implementation HDL software simulation and experimental hardware testing of the hardware prototype showed that the implemented prototype of the convolution architecture was indeed functionally correct as intended. It is felt that if the convolution architecture were implemented in high speed ASIC “production” technology with a high speed external SRAM, the intent that a single IP pixel can be convoluted with k (n×n) FCs and k OI pixels generated within a clock cycle of 7.3 ns could be achieved. If required, and as earlier indicated, pipelining of the

84 multiply unit within MAUs of the AUs would greatly increase overall system performance if needed. As a side note, a convolution program with three 5×5 FC planes and 5100×6600 IP plane was run on a general purpose processor (AMD Athlon 650 MHz) for a “loose” comparison to the performance of the new convolution architecture. The processor used on average 0.4 second of system time to convolute the one IP plane with three FC planes which indicates that when the processor runs at around 260 MHz it would be able to meet the “production” requirements for the new convolution architecture system. However, the cost/performance ratio for the general purpose processor will be higher than the version 2 convolution architecture implemented in ASIC technology considering the die size of both architectures (the convolution architecture has roughly less than ten percent of the general purpose processor’s transistor count). In conclusion, the best cost/performance ratio can be obtained from implementing the new convolution architecture in “production” ASIC technology which should allow the system clock of the convolution architecture to have a desired cycle time of 7.3 ns or less. Thus, the primary factor that determines performance of the new convolution architecture is the speed of the implementation technology, optimization of the layout/placement of the implementation to reduce longest path delays, and the degree of pipelining one chooses to design into the system.

85 Appendix A

VHDL Code for Version 2 Discrete Convolution Architecture (Figure 4.20 for k = 3)

1. Version 2 Convolution Architecture -- sys.vhd (Top Level System of Version 2 Convolution Architecture) library IEEE; use IEEE.std_logic_1164.all; entity SYS is port( clk, rst, str: in std_logic; d_in: in std_logic_vector(7 downto 0); --(FIFO -> DM_IF) coef: in std_logic_vector(5 downto 0); --(FCs from parallel port) ld_reg: in std_logic_vector(4 downto 0); --(MAUs select from pp) au_sel: in std_logic_vector(1 downto 0); --(AU select from pp) o_sel: in std_logic; --(Output config pin from pp) req: out std_logic; --(Controller -> FIFO) sram_w: out std_logic; --(SYS -> SRAM) d_out: out std_logic_vector(18 downto 0) ); end entity SYS; architecture STRUCT of SYS is component CTR is port( clk, rst, str, bor, eoc, sds, rgt: in std_logic; f_sel: out std_logic; z_pad: out std_logic; reg_sel: out std_logic_vector(2 downto 0); en_w: out std_logic_vector(1 downto 0); en_sf: out std_logic_vector(1 downto 0); z_input: out std_logic; c_inc: out std_logic_vector(1 downto 0); rot: out std_logic; r_inc: out std_logic; req: out std_logic; r_w: out std_logic; s_inc: out std_logic; sd_inc: out std_logic; ans: out std_logic ); end component CTR;

component RCNT is port( clk, rst, r_inc, sd_inc: in std_logic; eoc, sds, rgt: out std_logic ); end component RCNT;

component REG_A is generic( n: integer := 8; -- denotes the data width d: integer := 5 );-- denotes the number of registers port( clk, rst, z_pad, z_input: in std_logic; reg_sel: in std_logic_vector(2 downto 0); d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector((n*d)-1 downto 0); ids_out: out std_logic_vector(n-1 downto 0) ); end component REG_A;

component C_BANK1 is port( clk, rst, f_sel, z_pad: in std_logic; en_sf, en_w: in std_logic_vector(1 downto 0); ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end component C_BANK1;

component RAM is port( wclk, r_w: in std_logic; d_in: in std_logic_vector(39 downto 0); addr: in std_logic_vector(6 downto 0);

86 d_out: out std_logic_vector(39 downto 0) ); end component RAM;

component MEMPTR is port( clk, rst, rot, s_inc: in std_logic; inc: in std_logic_vector(1 downto 0); reg_sel: in std_logic_vector(2 downto 0); bor: out std_logic; -- begining of a new row bseq: out std_logic_vector(3 downto 0); add_out: out std_logic_vector(6 downto 0) ); end component MEMPTR;

component IDS is port( clk, rst, ans: in std_logic; ids_in: in std_logic_vector(39 downto 0); o1, o2, o3, o4, o5: out std_logic_vector(39 downto 0)); end component IDS;

component DU is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); du_out: out std_logic_vector(39 downto 0)); end component DU;

component AU is port( clk, rst: in std_logic; du0, du1, du2, du3, du4: in std_logic_vector(39 downto 0); p_en: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); out_pxl: out std_logic_vector(18 downto 0); ovf: out std_logic); end component AU;

-- Internal signals to connect components signal r_inc, sd_inc, eoc, sds, rgt: std_logic; --(Row counter <-> Controller) -- signal req : std_logic; signal z_pad, z_input : std_logic; --(Controller -> REG_A ) signal rot, s_inc, bor : std_logic; --(Controller -> MEMPTR ) signal r_w : std_logic; --(Controller -> RAM ) signal f_sel : std_logic; --(Controller -> BANK_1 ) signal ans : std_logic; --(Controller -> IDS ) signal a1, a2, a3 : std_logic; --(Controller->a1->a2->a3->ans) signal ovf1, ovf2, ovf3 : std_logic; --(Overflow from AUs) --(Controller -> MEMPTR) signal c_inc : std_logic_vector(1 downto 0); --(Controller -> BANK_1) signal en_w, en_sf : std_logic_vector(1 downto 0); --(Controller -> REG_A, MEMPTR) signal reg_sel : std_logic_vector(2 downto 0); --(REG_A -> RAM) signal rega_ram : std_logic_vector(39 downto 0); --(REG_A -> IDS) signal rega_ids : std_logic_vector(7 downto 0); --(MEMPTR -> C_BANK1) signal bseq : std_logic_Vector(3 downto 0); --(C_BANK1 -> IDS) signal cbank_ids : std_logic_vector(31 downto 0); --(RAM -> C_BANK1) signal ram_cbank : std_logic_vector(39 downto 0); --(MEMPTR -> RAM) Ram address signal memptr_ram : std_logic_vector(6 downto 0); --(IDS -> DUs) signal o1, o2, o3, o4, o5 : std_logic_vector(39 downto 0); --(Combined output from REG_A and C_BANK1 into ids_in) signal ids_in : std_logic_vector(39 downto 0); --(DUs -> AUs) signal du_au1, du_au2, du_au3 : std_logic_vector(39 downto 0); signal du_au4, du_au5 : std_logic_vector(39 downto 0); --(AUs -> Output Pixels)

87 signal out_pxl1 : std_logic_vector(18 downto 0); signal out_pxl2 : std_logic_vector(18 downto 0); signal out_pxl3 : std_logic_vector(18 downto 0); --(AU's select line for programming) signal a_sel : std_logic_vector(2 downto 0); --(Output select register for holding output selection from parallel port) signal op_sel_reg : std_logic_vector(1 downto 0); --(ans delays signals) signal ds : std_logic_vector(13 downto 0); begin

-- Main Controller of Version 2 Convolution Architecture U0: CTR port map(clk=>clk, rst=>rst, str=>str, bor=>bor, eoc=>eoc, sds=>sds, rgt=>rgt, f_sel=>f_sel, z_pad=>z_pad, reg_sel=>reg_sel, en_w=>en_w, en_sf=>en_sf, z_input=>z_input, c_inc=>c_inc, rot=>rot, r_inc=>r_inc, req=>req, r_w=>r_w, s_inc=>s_inc, sd_inc=>sd_inc, ans=>a1);

-- Row counter for the main controller U1: RCNT port map(clk=>clk, rst=>rst, r_inc=>r_inc, sd_inc=>sd_inc, eoc=>eoc, sds=>sds, rgt=>rgt);

-- Register A of the DM_IF U2: REG_A port map(clk=>clk, rst=>rst, z_pad=>z_pad, z_input=>z_input, reg_sel=>reg_sel, d_in=>d_in, d_out=>rega_ram, ids_out=>rega_ids);

-- C_BANK1 of the DM_IF U3: C_BANK1 port map(clk=>clk, rst=>rst, f_sel=>f_sel, z_pad=>z_pad, en_sf=>en_sf, en_w=>en_w, ld_reg=>reg_sel, bseq=>bseq, d_in=>ram_cbank, d_out=>cbank_ids);

-- RAM U4: RAM port map(wclk=>clk, r_w=>r_w, d_in=>rega_ram, addr=>memptr_ram, d_out=>ram_cbank);

-- MEMPTR (memory pointer) U5: MEMPTR port map(clk=>clk, rst=>rst, rot=>rot, s_inc=>s_inc, inc=>c_inc, reg_sel=>reg_sel, bor=>bor, bseq=>bseq, add_out=>memptr_ram);

-- IDS L1: ids_in <= rega_ids & cbank_ids; -- Combine signals output from REG_A and CBANK u6: IDS port map(clk=>clk, rst=>rst, ans=>ans, ids_in=>ids_in, o1=>o1, o2=>o2, o3=>o3, o4=>o4, o5=>o5); -- This process is to create delays such that the output from IDS will outputs at -- the same time. Also take cares of the boundary outputs from line to line. D2: process (clk, rst, a1, a2) is begin

if (rst = '1') then a2 <= '0'; a3 <= '0'; ans <= '0'; elsif (clk'event and clk = '1') then a2 <= a1; a3 <= a2; ans <= a3; end if; end process D2; -- This process is to propagate ans signal through out all the pipeline stages to -- the SRAM writer interface so it could strat "recording". D3: process (clk, rst, ans, ds) is begin

if (rst = '1') then ds <= (others => '0'); elsif (clk'event and clk = '1') then ds(0) <= ans; ds(1) <= ds(0);

88 ds(2) <= ds(1); ds(3) <= ds(2); ds(4) <= ds(3); ds(5) <= ds(4); ds(6) <= ds(5); ds(7) <= ds(6); ds(8) <= ds(7);

ds(9) <= ds(8); ds(10) <= ds(9); ds(11) <= ds(10); ds(12) <= ds(11); ds(13) <= ds(12); end if; end process D3; L2: sram_w <= ds(13) or ds(12) or ds(11) or ds(10);

-- DUs u7: DU port map(clk=>clk, rst=>rst, ids_in=>o1, du_out=>du_au1); u8: DU port map(clk=>clk, rst=>rst, ids_in=>o2, du_out=>du_au2); u9: DU port map(clk=>clk, rst=>rst, ids_in=>o3, du_out=>du_au3); u10: DU port map(clk=>clk, rst=>rst, ids_in=>o4, du_out=>du_au4); u11: DU port map(clk=>clk, rst=>rst, ids_in=>o5, du_out=>du_au5);

-- AUs u12: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3, du3=>du_au4, du4=>du_au5, p_en=>a_sel(0), ld_reg=>ld_reg, coef=>coef, out_pxl=>out_pxl1, ovf=>ovf1); u13: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3, du3=>du_au4, du4=>du_au5, p_en=>a_sel(1), ld_reg=>ld_reg, coef=>coef, out_pxl=>out_pxl2, ovf=>ovf2); u14: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3, du3=>du_au4, du4=>du_au5, p_en=>a_sel(2), ld_reg=>ld_reg, coef=>coef, out_pxl=>out_pxl3, ovf=>ovf3);

AUP_SEL: process (au_sel) is begin

case (au_sel) is when "01" => a_sel <= "001"; when "10" => a_sel <= "010"; when "11" => a_sel <= "100"; when others => a_sel <= "000"; end case; end process AUP_SEL;

-- Testing Purposes -- Output selection logic OP_SEL: process (clk, rst, o_sel) is begin

if (rst = '1') then op_sel_reg <= (others => '0'); elsif (clk'event and clk = '1') then if (o_sel = '1') then op_sel_reg <= coef(1 downto 0); else op_sel_reg <= op_sel_reg; end if; end if; end process OP_SEL;

-- Output selection mux OP_D: process (op_sel_reg, out_pxl1, out_pxl2, out_pxl3) is begin

case (op_sel_reg) is when "00" => d_out <= out_pxl1; when "01" => d_out <= out_pxl2;

89 when "10" => d_out <= out_pxl3; when others => d_out <= out_pxl1; end case; end process OP_D; end architecture STRUCT;

2. Controller Unit (CU) -- ctr.vhd (Controller) library IEEE; use IEEE.std_logic_1164.all; entity CTR is port( clk, rst, str, bor, eoc, sds, rgt: in std_logic; f_sel: out std_logic; z_pad: out std_logic; reg_sel: out std_logic_vector(2 downto 0); en_w: out std_logic_vector(1 downto 0); en_sf: out std_logic_vector(1 downto 0); z_input: out std_logic; c_inc: out std_logic_vector(1 downto 0); rot: out std_logic; r_inc: out std_logic; req: out std_logic; r_w: out std_logic; s_inc: out std_logic; sd_inc: out std_logic; ans: out std_logic ); end entity CTR; architecture BEHAVIORAL of CTR is

type statetype is (st0, st1, st2, st3, st4, st5, st6, st7, st8, st9, st10, st11, st12, st13, st14, st15, st16, st17, st18, st19, st20, st21, st22, st23, st24, st25, st26, st27, st28, st29, st30, st31, st32, st33, st34, st35, st36, st37, st38, st39, st40, st41, st42, st43, st44); signal c_st, n_st: statetype; signal tog : std_logic;

begin

NXTSTPROC: process (c_st, str, bor, eoc, rgt, tog, sds) is begin

case c_st is when st0 => if (str = '1') then n_st <= st1; else n_st <= st0; end if;

when st1 => if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if;

when st2 => n_st <= st3;

when st3 => n_st <= st4;

when st4 => n_st <= st5;

when st5 => n_st <= st6;

90 when st6 => if (bor = '1') then n_st <= st7; else if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st7 => n_st <= st8; when st8 => if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; when st9 => n_st <= st10; when st10 => n_st <= st11; when st11 => n_st <= st12; when st12 => n_st <= st13; when st13 => if (bor = '1') then n_st <= st14; else if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st14 => n_st <= st15; when st15 => if (rgt = '0' and tog = '0') then n_st <= st9; elsif (rgt = '0' and tog = '1') then n_st <= st2; elsif (rgt = '1' and tog = '0') then n_st <= st23; else n_st <= st16; end if; when ST16 => n_st <= st17; when st17 => n_st <= st18; when st18 => n_st <= st19; when st19 => n_st <= st20; when st20 => if (bor = '1') then

91 n_st <= st21; else if (tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st21 => n_st <= st22; when st22 => if (eoc = '1') then if (tog = '0') then n_st <= st37; else n_st <= st30; end if; else if (tog = '0') then n_st <= St23; else n_st <= st16; end if; end if; when ST23 => n_st <= st24; when st24 => n_st <= st25; when st25 => n_st <= st26; when st26 => n_st <= st27; when st27 => if (bor = '1') then n_st <= st28; else if (tog = '0') then n_st <= st23; else n_st <= st16; end if; end if; when st28 => n_st <= st29; when st29 => if (eoc = '1') then if (tog = '0') then n_st <= st37; else n_st <= st30; end if; else if (tog = '0') then n_st <= St23; else n_st <= st16; end if; end if; when st30 => n_st <= st31; when st31 => n_st <= st32; when st32 => n_st <= st33; when st33 => n_st <= st34; when st34 => if (bor = '1') then n_st <= st35; else

92 n_st <= st37; end if;

when st35 => n_st <= st36;

when st36 => if (sds = '1') then n_st <= st44; else n_st <= st37;

end if;

when st37 => n_st <= st38;

when st38 => n_st <= st39;

when st39 => n_st <= st40;

when st40 => n_st <= st41;

when st41 => if (bor = '1') then n_st <= st42; else n_st <= st30; end if;

when st42 => n_st <= st43;

when st43 => if (sds = '1') then n_st <= st44; else n_st <= st30; end if;

when st44 => n_st <= st44;

when others => null; end case; end process NXTSTPROC;

CURSTPROC: process (clk, rst) is begin

if (rst = '1') then c_st <= st0; elsif (clk'event and clk = '0') then c_st <= n_st; end if; end process CURSTPROC;

OUTCONPROC: process (c_st) is begin

case c_st is when st0 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '0';

93 ans <= '0'; when st1 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '0'; ans <= '0'; when st2 => reg_sel <= "001"; f_sel <= '0'; en_w <= "01"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st3 => reg_sel <= "010"; f_sel <= '0'; en_w <= "01"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st4 => reg_sel <= "011"; f_sel <= '0'; en_w <= "01"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st5 => reg_sel <= "100"; f_sel <= '0'; en_w <= "01"; en_sf <= "00";

94 z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st6 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st7 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st8 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st9 => reg_sel <= "001"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1';

95 tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st10 => reg_sel <= "010"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st11 => reg_sel <= "011"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st12 => reg_sel <= "100"; f_sel <= '0'; en_w <= "10"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st13 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '0';

96 when st14 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st15 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st16 => reg_sel <= "001"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st17 => reg_sel <= "010"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st18 => reg_sel <= "011"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0';

97 c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st19 => reg_sel <= "100"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st20 => reg_sel <= "101"; f_sel <= '1'; en_w <= "00"; en_sf <= "10"; z_pad <= '0'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st21 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st22 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '0'; r_w <= '0';

98 sd_inc <= '0'; s_inc <= '1';

ans <= '0'; when st23 => reg_sel <= "001"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st24 => reg_sel <= "010"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st25 => reg_sel <= "011"; f_sel <= '0'; en_w <= "10"; en_sf <= "01";

z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st26 => reg_sel <= "100"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1';

99 when st27 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "01"; z_pad <= '0'; z_input <= '0'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st28 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '1'; r_inc <= '1'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st29 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '1'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st30 => reg_sel <= "001"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st31 => reg_sel <= "010"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1';

100 c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st32 => reg_sel <= "011"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st33 => reg_sel <= "100"; f_sel <= '1'; en_w <= "01"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st34 => reg_sel <= "101"; f_sel <= '1'; en_w <= "00"; en_sf <= "10"; z_pad <= '0'; z_input <= '1'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st35 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '1'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0';

101 sd_inc <= '1'; s_inc <= '1'; ans <= '0'; when st36 => reg_sel <= "000"; f_sel <= '1'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st37 => reg_sel <= "001"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "01"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st38 => reg_sel <= "010"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st39 => reg_sel <= "011"; f_sel <= '0'; en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st40 => reg_sel <= "100"; f_sel <= '0';

102 en_w <= "10"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st41 => reg_sel <= "101"; f_sel <= '0'; en_w <= "00"; en_sf <= "01"; z_pad <= '0'; z_input <= '1'; c_inc <= "10"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '1'; sd_inc <= '0'; s_inc <= '1'; ans <= '1'; when st42 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '1'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '1'; s_inc <= '1'; ans <= '0'; when st43 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '1'; c_inc <= "00"; rot <= '0'; r_inc <= '0'; req <= '0'; tog <= '1'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0'; when st44 => reg_sel <= "000"; f_sel <= '0'; en_w <= "00"; en_sf <= "00"; z_pad <= '1'; z_input <= '0'; c_inc <= "00"; rot <= '0';

103 r_inc <= '0'; req <= '0'; tog <= '0'; r_w <= '0'; sd_inc <= '0'; s_inc <= '1'; ans <= '0';

when others => null; end case;

end process OUTCONPROC; end architecture BEHAVIORAL;

3. Memory Pointers Unit (MPU) -- mem_ptr.vhd (Memory Pointers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity MEMPTR is port( clk, rst, rot: in std_logic; inc: in std_logic_vector(1 downto 0); reg_sel: in std_logic_vector(2 downto 0); bor: out std_logic; -- begining of a new row bseq: out std_logic_vector(3 downto 0); add_out: out std_logic_vector(12 downto 0) ); end entity MEMPTR; architecture BEHAVIORAL of MEMPTR is

signal count1, count2 : unsigned(9 downto 0); signal ptr_a, ptr_b, ptr_c, ptr_d, ptr_e: unsigned(2 downto 0); signal b_seq : unsigned(3 downto 0); signal eor : std_logic;

begin

CNTR1: process (clk, rst, inc) is begin

if (rst = '1') then count1 <= to_unsigned(0, 10); elsif (clk'event and clk = '1') then if (inc(0) = '1') then if (count1 = to_unsigned(1020,10)) then count1 <= to_unsigned(0, 10); else count1 <= count1 + 1; end if; end if; end if;

end process CNTR1;

CNTR2: process (clk, rst, inc) is begin

if (rst = '1') then count2 <= to_unsigned(1, 10); elsif (clk'event and clk = '1') then if (inc(1) = '1') then if (count2 = to_unsigned(1020,10)) then count2 <= to_unsigned(0, 10); else count2 <= count2 + 1; end if;

104 end if; end if; end process CNTR2;

CODOUT: process (count1, count2) is begin

if (count2 = to_unsigned(0, 10)) then eor <= '1'; else eor <= '0'; end if;

if (count1 = to_unsigned(0, 10)) then bor <= '1'; else bor <= '0'; end if; end process CODOUT;

BLK: process (clk, rst, rot) is begin

if (rst = '1') then b_seq <= to_unsigned(0, 4); elsif (clk'event and clk = '1') then if (rot = '1') then b_seq(0) <= '1'; b_seq(1) <= b_seq(0); b_seq(2) <= b_seq(1); b_seq(3) <= b_seq(2); end if; end if; end process BLK;

L1: bseq <= std_logic_vector(b_seq);

PTRS: process (clk, rst, rot) is begin

if (rst = '1') then ptr_a <= to_unsigned(0, 3); ptr_b <= to_unsigned(1, 3); ptr_c <= to_unsigned(2, 3); ptr_d <= to_unsigned(3, 3); ptr_e <= to_unsigned(4, 3); elsif (clk'event and clk = '1') then if (rot = '1') then ptr_b <= ptr_a; ptr_c <= ptr_b; ptr_d <= ptr_c; ptr_e <= ptr_d; ptr_a <= ptr_e; end if; end if; end process PTRS;

MUX: process (reg_sel, count1, count2, ptr_a, ptr_b, ptr_c, ptr_d, ptr_e, eor) is begin

case reg_sel is when "001" => if (eor = '1') then add_out <= std_logic_vector(ptr_d) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_e) & std_logic_vector(count2); end if;

105 when "010" => if (eor = '1') then add_out <= std_logic_vector(ptr_c) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_d) & std_logic_vector(count2); end if;

when "011" => if (eor = '1') then add_out <= std_logic_vector(ptr_b) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_c) & std_logic_vector(count2); end if;

when "100" => if (eor = '1') then add_out <= std_logic_vector(ptr_a) & std_logic_vector(count2); else add_out <= std_logic_vector(ptr_b) & std_logic_vector(count2); end if;

when "101" => add_out <= std_logic_vector(ptr_a) & std_logic_vector(count1);

when others => add_out <= std_logic_vector(to_unsigned(0, 13)); end case;

end process MUX; end architecture BEHAVIORAL;

4. Data Memory Interface (DM I/F) -- dm_if.vhd (Data Memory Interface) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity C_BANK1 is port( clk, rst, f_sel, z_pad: in std_logic; en_sf, en_w: in std_logic_vector(1 downto 0); ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end entity C_BANK1; architecture STRUCTURAL of C_BANK1 is component REGFILE is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_sf, en_w: in std_logic; ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end component REGFILE;

signal f_a, f_b, f_mux: std_logic_vector(31 downto 0);

begin

RF1: REGFILE port map(clk=>clk, rst=>rst, en_sf=>en_sf(0), en_w=>en_w(0), bseq=>bseq, ld_reg=>ld_reg, d_in=>d_in, d_out=>f_a); RF2: REGFILE port map(clk=>clk, rst=>rst, en_sf=>en_sf(1), en_w=>en_w(1), bseq=>bseq, ld_reg=>ld_reg, d_in=>d_in, d_out=>f_b);

MUX1: f_mux <= f_a when f_sel = '0' else f_b;

Z_P: d_out <= f_mux when z_pad = '0' else std_logic_vector(to_unsigned(0, 32)); end architecture STRUCTURAL;

106 -- -- regfile.vhd library IEEE; use IEEE.std_logic_1164.all; entity REGFILE is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_sf, en_w: in std_logic; ld_reg: in std_logic_vector(2 downto 0); bseq: in std_logic_vector(3 downto 0); d_in: in std_logic_vector(39 downto 0); d_out: out std_logic_vector(31 downto 0)); end entity REGFILE; architecture STRUCTURAL of REGFILE is component PLS_REG is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_ld, en_sf: in std_logic; d_in: in std_logic_vector((n*d)-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end component PLS_REG;

signal reg_sel: std_logic_vector(3 downto 0);

begin

LF: for f in 1 to 4 generate PR_F: PLS_REG generic map(n=>n, d=>d) port map(clk=>clk, rst=>rst, en_ld=>reg_sel(f-1), en_sf=>en_sf, d_in=>d_in, d_out=>d_out((f*n)-1 downto ((f-1)*n))); end generate LF;

SEL: process (ld_reg, en_w, bseq) is begin

if (en_w = '0') then reg_sel <= "0000"; else case ld_reg is when "001" => if (bseq(0) = '1') then reg_sel <= "0001"; else reg_sel <="0000"; end if;

when "010" => if (bseq(1) = '1') then reg_sel <= "0010"; else reg_sel <= "0000"; end if;

when "011" => if (bseq(2) = '1') then reg_sel <= "0100"; else reg_sel <= "0000"; end if;

when "100" => if (bseq(3) = '1') then reg_sel <= "1000"; else reg_sel <= "0000"; end if;

when others => reg_sel <= "0000"; end case; end if;

107 end process SEL; end architecture STRUCTURAL;

-- pls_reg.vhd library IEEE, BFULIB; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BFULIB.bfu_pckg.all; entity PLS_REG is generic( n: integer := 8; -- n denotes the data width d: integer := 5); -- d denotes number of registers port( clk, rst, en_ld, en_sf: in std_logic; d_in: in std_logic_vector((n*d)-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end entity PLS_REG; architecture STRUCTURAL of PLS_REG is signal s: std_logic_vector((n*(d+1))-1 downto 0);

begin

L1: s(n-1 downto 0) <= std_logic_vector(to_unsigned(0, n));

LK: for k in 1 to d generate REGK: S_REG generic map(n=>n) port map(clk=>clk, rst=>rst, en_ld=>en_ld, en_sf=>en_sf, p_in=>d_in((n*k)-1 downto n*(k-1)), d_in=>s((n*k)-1 downto n*(k-1)), d_out=>s((n*(k+1))-1 downto (n*k))); end generate LK;

L2: d_out <= s((n*(d+1))-1 downto n*d); end architecture STRUCTURAL;

-- reg_a.vhd library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity REG_A is generic( n: integer := 8; -- denotes the data width d: integer := 5 );-- denotes the number of registers port( clk, rst, z_pad: in std_logic; reg_sel: in std_logic_vector(2 downto 0); d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector((n*d)-1 downto 0); ids_out: out std_logic_vector(n-1 downto 0) ); end entity REG_A; architecture BEHAVIORAL of REG_A is

signal reg1, reg2, reg3, reg4, reg5, regt: unsigned(n-1 downto 0);

begin

-- Register write with conditions REGSEL: process (clk, rst, reg_sel) is begin

if (rst = '1') then reg1 <= to_unsigned(0, n); reg2 <= to_unsigned(0, n); reg3 <= to_unsigned(0, n); reg4 <= to_unsigned(0, n); reg5 <= to_unsigned(0, n); regt <= to_unsigned(0, n); elsif (clk'event and clk = '1') then

108 case reg_sel is when "001" => reg1 <= unsigned(d_in); regt <= unsigned(d_in);

when "010" => reg2 <= unsigned(d_in); regt <= unsigned(d_in);

when "011" => reg3 <= unsigned(d_in); regt <= unsigned(d_in);

when "100" => reg4 <= unsigned(d_in); regt <= unsigned(d_in);

when "101" => reg5 <= unsigned(d_in); regt <= unsigned(d_in);

when others => null; end case; end if; end process REGSEL;

-- Output Logic L1: d_out <= std_logic_vector(reg5) & std_logic_vector(reg4) & std_logic_vector(reg3) & std_logic_vector(reg2) & std_logic_vector(reg1);

L2: ids_out <= std_logic_vector(regt) when z_pad = '0' else std_logic_vector(to_unsigned(0, n)); end architecture BEHAVIORAL;

5. Input Data Shifters (IDS) -- ids.vhd (Input Data Shifters) -- This is the functional unit that responsible for providing inputs to the five MAAs. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity IDS is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); o1, o2, o3, o4, o5: out std_logic_vector(39 downto 0)); end entity IDS; architecture STRUCTURAL of IDS is signal s1, s2, s3, s4: std_logic_vector(39 downto 0);

begin

R1: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>ids_in, d_out=>s1); L1: o1 <= s1;

R2: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s1, d_out=>s2); L2: o2 <= s2;

R3: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s2, d_out=>s3); L3: o3 <= s3;

R4: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s3, d_out=>s4); L4: o4 <= s4;

R5: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s4, d_out=>o5); end architecture STRUCTURAL;

6. Arithmetic Unit (AU) -- au.vhd (Arithmetic Unit) -- This is the combination of all the arithmetic units, which including all the

109 -- MAUs (25 of them). library IEEE; use IEEE.std_logic_1164.all; entity AU is port( clk, rst: in std_logic; ids0, ids1, ids2, ids3, ids4: in std_logic_vector(39 downto 0); ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); out_pxl: out std_logic_vector(18 downto 0); ovf: out std_logic); end entity AU; architecture STRUCTURAL of AU is component MAA is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end component MAA;

component AT is port( clk, rst: in std_logic; maa0, maa1, maa2, maa3, maa4: in std_logic_vector(16 downto 0); ovf: out std_logic; out_pxl: out std_logic_vector(18 downto 0)); end component AT;

signal maa0, maa1, maa2, maa3, maa4: std_logic_vector(16 downto 0); signal ld_coef : std_logic_vector(24 downto 0);

begin

MAA_0: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(4 downto 0), coef=>coef, img_pxl=>ids0, p_rst=>maa0);

MAA_1: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(9 downto 5), coef=>coef, img_pxl=>ids1, p_rst=>maa1);

MAA_2: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(14 downto 10), coef=>coef, img_pxl=>ids2, p_rst=>maa2);

MAA_3: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(19 downto 15), coef=>coef, img_pxl=>ids3, p_rst=>maa3);

MAA_4: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(24 downto 20), coef=>coef, img_pxl=>ids4, p_rst=>maa4);

U1: AT port map(clk=>clk, rst=>rst, maa0=>maa0, maa1=>maa1, maa2=>maa2, maa3=>maa3, maa4=>maa4, ovf=>ovf, out_pxl=>out_pxl);

MUX: process (ld_reg, coef) is begin

case (ld_reg) is when "00001" => ld_coef <= "0000000000000000000000001"; -- 1 when "00010" => ld_coef <= "0000000000000000000000010"; -- 2 when "00011" => ld_coef <= "0000000000000000000000100"; -- 3 when "00100" => ld_coef <= "0000000000000000000001000"; -- 4 when "00101" => ld_coef <= "0000000000000000000010000"; -- 5 when "00110" => ld_coef <= "0000000000000000000100000"; -- 6 when "00111" => ld_coef <= "0000000000000000001000000"; -- 7 when "01000" => ld_coef <= "0000000000000000010000000"; -- 8 when "01001" => ld_coef <= "0000000000000000100000000"; -- 9 when "01010" => ld_coef <= "0000000000000001000000000"; -- 10 when "01011" => ld_coef <= "0000000000000010000000000"; -- 11 when "01100" => ld_coef <= "0000000000000100000000000"; -- 12 when "01101" => ld_coef <= "0000000000001000000000000"; -- 13 when "01110" => ld_coef <= "0000000000010000000000000"; -- 14

110 when "01111" => ld_coef <= "0000000000100000000000000"; -- 15 when "10000" => ld_coef <= "0000000001000000000000000"; -- 16 when "10001" => ld_coef <= "0000000010000000000000000"; -- 17 when "10010" => ld_coef <= "0000000100000000000000000"; -- 18 when "10011" => ld_coef <= "0000001000000000000000000"; -- 19 when "10100" => ld_coef <= "0000010000000000000000000"; -- 20 when "10101" => ld_coef <= "0000100000000000000000000"; -- 21 when "10110" => ld_coef <= "0001000000000000000000000"; -- 22 when "10111" => ld_coef <= "0010000000000000000000000"; -- 23 when "11000" => ld_coef <= "0100000000000000000000000"; -- 24 when "11001" => ld_coef <= "1000000000000000000000000"; -- 25 when others => ld_coef <= "0000000000000000000000000";

end case; end process MUX; end architecture STRUCTURAL;

-- at.vhd (Adding Tree) -- This is the Adding Tree that is responsible for adding five 17-bit word from -- five different MAAs. The structure will include four level of pipeline stages. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity AT is port( clk, rst: in std_logic; maa0, maa1, maa2, maa3, maa4: in std_logic_vector(16 downto 0); ovf: out std_logic; out_pxl: out std_logic_vector(18 downto 0)); end entity AT; architecture STRUCTURAL of AT is

signal low, ovf_r : std_logic; signal sum1 : std_logic_vector(16 downto 0); signal sum2, sum3, carry1, carry2: std_logic_vector(17 downto 0); signal carry3 : std_logic_vector(18 downto 0); signal sum4 : std_logic_vector(19 downto 0); signal pl1_r1 : std_logic_vector(17 downto 0); signal pl1_r2, pl1_r3, pl1_r4 : std_logic_vector(16 downto 0); signal pl2_r5, pl2_r6 : std_logic_vector(17 downto 0); signal pl2_r7 : std_logic_vector(16 downto 0); signal pl3_r8 : std_logic_vector(18 downto 0); signal pl3_r9 : std_logic_vector(17 downto 0); signal pl4_r10 : std_logic_vector(19 downto 0);

begin

L1: low <= '0';

U1: CSA generic map(n=>17) port map(a=>maa0, b=>maa1, c=>maa2, sum=>sum1, carry=>carry1); R1: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>carry1, d_out=>pl1_r1); R2: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>sum1, d_out=>pl1_r2); R3: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>maa3, d_out=>pl1_r3); R4: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>maa4, d_out=>pl1_r4);

U2: CSA generic map(n=>17) port map(a=>pl1_r1(16 downto 0), b=>pl1_r2, c=>pl1_r3, sum=>sum2(16 downto 0), carry=>carry2); L2: sum2(17) <= pl1_r1(17); -- This is the most significant bit from carry1 above R5: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>carry2, d_out=>pl2_r5); R6: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>sum2, d_out=>pl2_r6); R7: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>pl1_r4, d_out=>pl2_r7);

U3: CSA generic map(n=>17) port map(a=>pl2_r5(16 downto 0), b=>pl2_r6(16 downto 0), c=>pl2_r7, sum=>sum3(16 downto 0), carry=>carry3(17 downto 0)); L3: HA port map(a=>pl2_r5(17), b=>pl2_r6(17), s=>sum3(17), cout=>carry3(18)); R8: REG_P generic map(n=>19) port map(clk=>clk, rst=>rst, d_in=>carry3, d_out=>pl3_r8); R9: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>sum3, d_out=>pl3_r9);

111

U4: CLA_19 port map(a(17 downto 0)=>pl3_r9, a(18)=>low, b=>pl3_r8, s=>sum4(18 downto 0), ovf=>sum4(19)); R10: REG_P generic map(n=>20) port map(clk=>clk, rst=>rst, d_in=>sum4, d_out=>pl4_r10); L4: out_pxl <= pl4_r10(18 downto 0); L5: ovf <= pl4_r10(19); end architecture STRUCTURAL;

-- maa.vhd (This is ths systolic array of five MAUs with a DU) library IEEE; use IEEE.std_logic_1164.all; entity MAA is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end entity MAA; architecture STRUCTURAL of MAA is component DU is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); du_out: out std_logic_vector(39 downto 0)); end component DU;

component MAUS is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end component MAUS;

signal s: std_logic_vector(39 downto 0);

begin

U1: DU port map(clk=>clk, rst=>rst, ids_in=>img_pxl, du_out=>s); U2: MAUS port map(clk=>clk, rst=>rst, ld_reg=>ld_reg, coef=>coef, img_pxl=>s, p_rst=>p_rst); end architecture STRUCTURAL;

-- du.vhd -- This is the Delay Unit for the propagation of the image data library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity DU is port( clk, rst: in std_logic; ids_in: in std_logic_vector(39 downto 0); du_out: out std_logic_vector(39 downto 0)); end entity DU; architecture STRUCTURAL of DU is

signal p1: std_logic_vector(31 downto 0); signal p2, p3: std_logic_vector(23 downto 0); signal p4, p5: std_logic_vector(15 downto 0); signal p6, p7: std_logic_vector(7 downto 0);

begin

L1: du_out(7 downto 0) <= ids_in(7 downto 0); PL1: REG_P generic map(n=>32) port map(clk=>clk, rst=>rst, d_in=>ids_in(39 downto 8), d_out=>p1);

112

L2: du_out(15 downto 8) <= p1(7 downto 0); PL2: REG_P generic map(n=>24) port map(clk=>clk, rst=>rst, d_in=>p1(31 downto 8), d_out=>p2); PL3: REG_P generic map(n=>24) port map(clk=>clk, rst=>rst, d_in=>p2, d_out=>p3);

L3: du_out(23 downto 16) <= p3(7 downto 0); PL4: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p3(23 downto 8), d_out=>p4); PL5: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p4, d_out=>p5);

L4: du_out(31 downto 24) <= p5(7 downto 0); PL6: REG_P generic map(n=>8) port map(clk=>clk, rst=>rst, d_in=>p5(15 downto 8), d_out=>p6); PL7: REG_P generic map(n=>8) port map(clk=>clk, rst=>rst, d_in=>p6, d_out=>p7);

L5: du_out(39 downto 32) <= p7; end architecture STRUCTURAL;

-- maus.vhd library IEEE; use IEEE.std_logic_1164.all; entity MAUS is port( clk, rst: in std_logic; ld_reg: in std_logic_vector(4 downto 0); coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(39 downto 0); p_rst: out std_logic_vector(16 downto 0)); end entity MAUS; architecture STRUCTURAL of MAUS is component MAU_0 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); img_pxl: in std_logic_vector(7 downto 0); p_res: out std_logic_vector(13 downto 0)); end component MAU_0;

component MAU_1 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(13 downto 0); -- previous MAU output p_res: out std_logic_vector(14 downto 0)); -- partial result to next MAU end component MAU_1;

component MAU_2 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(14 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end component MAU_2;

component MAU_3 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end component MAU_3;

component MAU_4 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(16 downto 0)); -- partial result to next MAU end component MAU_4;

113

signal p_res1: std_logic_vector(13 downto 0); signal p_res2: std_logic_vector(14 downto 0); signal p_res3, p_res4: std_logic_vector(15 downto 0);

begin

U0: MAU_0 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(0), coef=>coef, img_pxl=>img_pxl(7 downto 0), p_res=>p_res1);

U1: MAU_1 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(1), coef=>coef, img_pxl=>img_pxl(15 downto 8), p_mau=>p_res1, p_res=>p_res2);

U2: MAU_2 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(2), coef=>coef, img_pxl=>img_pxl(23 downto 16), p_mau=>p_res2, p_res=>p_res3);

U3: MAU_3 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(3), coef=>coef, img_pxl=>img_pxl(31 downto 24), p_mau=>p_res3, p_res=>p_res4);

U4: MAU_4 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(4), coef=>coef, img_pxl=>img_pxl(39 downto 32), p_mau=>p_res4, p_res=>p_rst); end architecture STRUCTURAL;

-- mau_0.vhd -- This is the first MAU of the MAUs (for one systolic array). -- This MAU only contains Multiplication unit and no Adder since there is no previous -- MAU output that needs to be accumulated. The range of multiplication is within -- -8160 to 7905 (decimal), hence the output (p_res) is 14-bit word. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_0 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- Filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- Image pixels p_res: out std_logic_vector(13 downto 0)); -- Partial result to end entity MAU_0; architecture BEHAVIORAL of MAU_0 is

signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0);

begin

U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R1: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>p_res);

STORE: process (clk, rst, ld_reg) is begin

if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if;

end process STORE; end architecture BEHAVIORAL;

114 -- mau_1.vhd -- This is the second MAU of the MAUs. The range of this MAU is between -- -16320 and 15810 (decimal), hence the 15-bit word is used for the partial result. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_1 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(13 downto 0); -- previous MAU output p_res: out std_logic_vector(14 downto 0)); -- partial result to next MAU end entity MAU_1; architecture BEHAVIORAL of MAU_1 is

signal coef_reg: std_logic_vector(5 downto 0); signal pl1, pl2, product: std_logic_vector(13 downto 0); signal sum: std_logic_vector(14 downto 0);

begin

R1: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2); U2: CLA_15 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>15) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res);

STORE: process (clk, rst, ld_reg) is begin

if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if;

end process STORE; end architecture BEHAVIORAL;

-- mau_2.vhd -- This is the third MAU of the MAUs systolic array. -- The range of the MAU is between -24480 and 23715 (decimal), thus -- a 16-bit word is used for the partial result bus. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_2 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(14 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end entity MAU_2; architecture BEHAVIORAL of MAU_2 is

signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0); signal pl1, pl2, sum: std_logic_vector(15 downto 0);

begin

115

L1: pl2(15 downto 14) <= "00"; L2: pl1(15) <= '0';

R1: REG_P generic map(n=>15) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1(14 downto 0)); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0)); U2: CLA_16 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res);

STORE: process (clk, rst, ld_reg) is begin

if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if;

end process STORE; end architecture BEHAVIORAL;

-- mau_3.vhd -- This is the third MAU with the MAUs. The range of the MAU is between -- -32640 and 31620 (decimal), thus a 16-bit word bus is used for the partial -- result coming out from this MAU. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_3 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU end entity MAU_3; architecture BEHAVIORAL of MAU_3 is

signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0); signal pl1, pl2, sum: std_logic_vector(15 downto 0);

begin

L1: pl2(15 downto 14) <= "00";

R1: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0)); U2: CLA_16 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res);

STORE: process (clk, rst, ld_reg) is begin

if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg;

116 end if; end if;

end process STORE;

end architecture BEHAVIORAL;

-- mau_4.vhd -- This is the last MAU within the MAUs systolic array. The range for this MAU -- is between -40800 and 39525 (decimal), thus a 17-bit word bus is used for the -- partial result coming out of this MAU. library IEEE, BFULIB; use IEEE.std_logic_1164.all; use BFULIB.bfu_pckg.all; entity MAU_4 is port( clk, rst, ld_reg: in std_logic; coef: in std_logic_vector(5 downto 0); -- filter coefficient img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU p_mau: in std_logic_vector(15 downto 0); -- previous MAU output p_res: out std_logic_vector(16 downto 0)); -- partial result to next MAU end entity MAU_4; architecture BEHAVIORAL of MAU_4 is

signal coef_reg: std_logic_vector(5 downto 0); signal product: std_logic_vector(13 downto 0); signal pl1, pl2: std_logic_vector(15 downto 0); signal sum: std_logic_vector(16 downto 0);

begin

L1: pl2(15 downto 14) <= "00";

R1: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1); U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product); R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0)); U2: CLA_17 port map(a=>pl1, b=>pl2, s=>sum); R3: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res);

STORE: process (clk, rst, ld_reg) is begin

if (rst = '1') then coef_reg <= "000000"; elsif (rising_edge(clk)) then if (ld_reg = '1') then coef_reg <= coef; else coef_reg <= coef_reg; end if; end if;

end process STORE; end architecture BEHAVIORAL;

7. Multiplication and Adder Units (These functional units have been defined as a library package) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all;

package bfu_pckg is

117

component CLA_15 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(14 downto 0)); end component CLA_15;

component CLA_16 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0)); end component CLA_16;

component CLA_17 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(16 downto 0)); end component CLA_17;

component CLA_19 is port(a, b: in std_logic_vector(18 downto 0); s: out std_logic_vector(18 downto 0);

ovf: out std_logic); end component CLA_19;

component MULT is port(a: in std_logic_vector(7 downto 0); b: in std_logic_vector(5 downto 0); p: out std_logic_vector(13 downto 0)); end component MULT;

component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA;

component HA is port( a, b: in std_logic; s, cout: out std_logic); end component HA;

component CSA is generic(n: positive := 5); port( a, b, c: in std_logic_vector(n-1 downto 0); sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end component CSA;

component REG_P is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end component REG_P;

component REG_N is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end component REG_N; end package bfu_pckg;

-- mult.vhd (Multiplication Unit) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity MULT is

118 port(a: in std_logic_vector(7 downto 0); b: in std_logic_vector(5 downto 0); p: out std_logic_vector(13 downto 0)); end entity MULT; architecture STRUCT of MULT is component PPG is generic(n: integer := 8); port( a: in std_logic_vector(n-1 downto 0); mult: in std_logic_vector(2 downto 0); pp: out std_logic_vector(n downto 0); spp: out std_logic); end component PPG;

component R3_2C is generic(n: integer := 14); port(a, b, c: in std_logic_vector(n-1 downto 0); sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end component R3_2C;

component S3_2C is port(pp1, pp2, pp3: in std_logic_vector(8 downto 0); sp1, sp2, sp3: in std_logic; sum, carry: out std_logic_vector(13 downto 0)); end component S3_2C;

component CLA_14 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(13 downto 0)); end component CLA_14;

signal sp1, sp2, sp3: std_logic;

signal ls: std_logic_vector(2 downto 0); signal pp1, pp2, pp3: std_logic_vector(8 downto 0); signal pp4, sum1, sum2, carry1: std_logic_vector(13 downto 0); signal carry2: std_logic_vector(14 downto 0);

begin

L1: ls <= b(1 downto 0) & '0';

U1: PPG port map(a=>a, mult=>ls, pp=>pp1, spp=>sp1); U2: PPG port map(a=>a, mult=>b(3 downto 1), pp=>pp2, spp=>sp2); U3: PPG port map(a=>a, mult=>b(5 downto 3), pp=>pp3, spp=>sp3);

U4: S3_2C port map(pp1=>pp1, pp2=>pp2, pp3=>pp3, sp1=>sp1, sp2=>sp2, sp3=>sp3, sum=>sum1, carry=>carry1); L2: pp4 <= "000000000" & sp3 & '0' & sp2 & '0' & sp1; U5: R3_2C port map(a=>sum1, b=>carry1, c=>pp4, sum=>sum2, carry=>carry2);

U6: CLA_14 port map(a=>sum2, b=>carry2(13 downto 0), s=>p); end architecture STRUCT;

-- ppg.vhd (Partial Product Generator) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity PPG is generic(n: integer := 8); port( a: in std_logic_vector(n-1 downto 0); mult: in std_logic_vector(2 downto 0); pp: out std_logic_vector(n downto 0); spp: out std_logic); end entity PPG;

119 architecture BEHAVIORAL of PPG is begin

PP_PROC: process(mult, a) begin

case mult is when "000" => for k in n downto 0 loop pp(k) <= '0'; end loop; spp <= '0'; when "001" => pp <= '0' & a; spp <= '0'; when "010" => pp <= '0' & a; spp <= '0'; when "011" => pp <= a & '0'; spp <= '0'; when "100" => pp <= not(a & '0'); spp <= '1'; when "101" => pp <= not('0' & a); spp <= '1'; when "110" => pp <= not('0' & a); spp <= '1'; when "111" => for l in n downto 0 loop pp(l) <= '0'; end loop; spp <= '0'; when others => null;

end case; end process PP_PROC; end architecture BEHAVIORAL;

-- r3_2c.vhd (Second level 3 to 2 Counter for Multiplier) library IEEE; use IEEE.std_logic_1164.all; entity R3_2C is generic(n: integer := 14); port(a, b, c: in std_logic_vector(n-1 downto 0); -- a->sum, b->carry c->pp4 sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end entity R3_2C; architecture STRUCTURAL of R3_2C is component HA is port( a, b: in std_logic; s, cout: out std_logic); end component HA;

component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA;

begin

L1: carry(0) <= '0';

LK: for k in 2 downto 0 generate HAK: HA port map(a=>a(k), b=>c(k), s=>sum(k), cout=>carry(k+1)); end generate LK;

L2: HA port map(a=>a(3), b=>b(3), s=>sum(3), cout=>carry(4)); L3: FA port map(a=>a(4), b=>b(4), cin=>c(4), s=>sum(4), cout=>carry(5));

LF: for f in n-1 downto 5 generate

120 HAF: HA port map(a=>a(f), b=>b(f), s=>sum(f), cout=>carry(f+1)); end generate LF; end architecture STRUCTURAL;

-- s3_2c.vhd (Special 3 to 2 Counter) library IEEE; use IEEE.std_logic_1164.all; entity S3_2C is port(pp1, pp2, pp3: in std_logic_vector(8 downto 0); sp1, sp2, sp3: in std_logic; sum, carry: out std_logic_vector(13 downto 0)); end entity S3_2C; architecture STRUCTURAL of S3_2C is component HA is port( a, b: in std_logic; s, cout: out std_logic); end component HA;

component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA;

signal high, n_sp1, n_sp2, n_sp3: std_logic;

begin

G0: high <= '1'; G1: n_sp1 <= not sp1; G2: n_sp2 <= not sp2; G3: n_sp3 <= not sp3;

L1: sum(1 downto 0) <= pp1(1 downto 0); L2: carry(2 downto 0) <= "000"; L3: sum(13) <= n_sp3;

LK: for k in 1 downto 0 generate HAA: HA port map(a=>pp1(k+2), b=>pp2(k), s=>sum(k+2), cout=>carry(k+3)); end generate LK;

LG: for g in 4 downto 0 generate FAG: FA port map(a=>pp1(g+4), b=>pp2(g+2), cin=>pp3(g), s=>sum(g+4), cout=>carry(g+5)); end generate LG;

LF: for f in 6 downto 5 generate FAF: FA port map(a=>sp1, b=>pp2(f+2), cin=>pp3(f), s=>sum(f+4), cout=>carry(f+5)); end generate LF;

FA7: FA port map(a=>n_sp1, b=>n_sp2, cin=>pp3(7), s=>sum(11), cout=>carry(12)); HA2: HA port map(a=>high, b=>pp3(8), s=>sum(12), cout=>carry(13)); end architecture STRUCTURAL;

-- cla_16.vhd (Carry Lookahead Adder ~ 16 Bits) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_16 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0)); end entity CLA_16; architecture STRUCTURAL of CLA_16 is

121 component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1;

component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4;

component CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end component CLL_2L;

signal p, g: std_logic_vector(3 downto 0); signal c: std_logic_vector(3 downto 1);

begin

U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_4 port map(a=>a(15 downto 12), b=>b(15 downto 12), cin=>c(3), s=>s(15 downto 12), p_out=>p(3), g_out=>g(3)); U5: CLL_2L port map(p=>p(2 downto 0), g=>g(2 downto 0), cout=>c); end architecture STRUCTURAL;

-- cla_19.vhd (19-bit Carry Lookahead adder) library IEEE; use IEEE.std_logic_1164.all; entity CLA_19 is port(a, b: in std_logic_vector(18 downto 0); s: out std_logic_vector(18 downto 0); ovf: out std_logic); end entity CLA_19; architecture STRUCTURAL of CLA_19 is component CLA_3S is port(a, b: in std_logic_vector(2 downto 0); cin: in std_logic; s: out std_logic_vector(2 downto 0); ovf: out std_logic); end component CLA_3S;

component CLA_17 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(16 downto 0)); end component CLA_17;

signal cout, high, low, ovf0, ovf1: std_logic; signal s1, s2: std_logic_vector(2 downto 0);

begin

L1: high <= '1'; L2: low <= '0';

U1: CLA_17 port map(a=>a(15 downto 0), b=>b(15 downto 0), s(15 downto 0)=>s(15 downto 0), s(16)=>cout); U2: CLA_3S port map(a=>a(18 downto 16), b=>b(18 downto 16), cin=>high, s=>s1, ovf=>ovf1);

122 U3: CLA_3S port map(a=>a(18 downto 16), b=>b(18 downto 16), cin=>low, s=>s2, ovf=>ovf0);

L3: s(18 downto 16) <= s1 when cout = '1' else s2; L4: ovf <= ovf1 when cout='1' else ovf0; end architecture STRUCTURAL;

-- cla_17.vhd (Carry Lookahead Adder ~ 17 Bits) -- This Carry Lookahead adder adds two 16-bit number and generate a 17-bit sum. library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_17 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(16 downto 0)); end entity CLA_17; architecture STRUCTURAL of CLA_17 is component CLA_16_P is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0); p_out, g_out: out std_logic); end component CLA_16_P;

signal p_4, g_4: std_logic;

begin U1: CLA_16_P port map(a=>a, b=>b, s=>s(15 downto 0), p_out=>p_4, g_out=>g_4);

L3: s(16) <= g_4; end architecture STRUCTURAL;

-- cla_16_p.vhd (Carry Lookahead Adder ~ 16 Bits) with p and g library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_16_P is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0); p_out, g_out: out std_logic); end entity CLA_16_P; architecture STRUCTURAL of CLA_16_P is component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1;

component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4;

component CLL_2 is port( p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 1); p_out, g_out: out std_logic); end component CLL_2;

signal p, g: std_logic_vector(3 downto 0);

123 signal c: std_logic_vector(3 downto 1);

begin

U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_4 port map(a=>a(15 downto 12), b=>b(15 downto 12), cin=>c(3), s=>s(15 downto 12), p_out=>p(3), g_out=>g(3)); U5: CLL_2 port map(p=>p, g=>g, cout=>c, p_out=>p_out, g_out=>g_out); end architecture STRUCTURAL;

-- -- cll_2.vhd (2nd Level of Carry Lookahead Logic - for 4 bits) with P, G output library IEEE; use IEEE.std_logic_1164.all; entity CLL_2 is port( p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 1); p_out, g_out: out std_logic); end entity CLL_2; architecture BEHAVIORAL of CLL_2 is begin

L1: cout(1) <= g(0); L2: cout(2) <= g(1) or (p(1) and g(0)); L3: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)); L4: p_out <= p(3) and p(2) and p(1) and p(0); L5: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1)) or (p(3) and p(2) and p(1) and g(0)); end architecture BEHAVIORAL;

-- cla_15.vhd (Carry Lookahead Adder ~ 15 Bits) -- This is a adder that add two 14-bit number and the sum is 15-bit word library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_15 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(14 downto 0)); end entity CLA_15; architecture STRUCTURAL of CLA_15 is component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1;

component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4;

component CLA_3 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic;

124 s: out std_logic_vector(2 downto 0)); end component CLA_3;

component CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end component CLL_2L;

signal p, g: std_logic_vector(2 downto 0); signal c: std_logic_vector(3 downto 1);

begin

U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_3 port map(a=>a(13 downto 12), b=>b(13 downto 12), cin=>c(3), s=>s(14 downto 12));

U5: CLL_2L port map(p=>p, g=>g, cout=>c); end architecture STRUCTURAL;

-- -- cla_3.vhd (Carry Lookahead Adder ~ 3 Bits (Last 3 Bits)) library IEEE; use IEEE.std_logic_1164.all; entity CLA_3 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic; s: out std_logic_vector(2 downto 0)); end entity CLA_3; architecture STRUCTURAL of CLA_3 is component SCLL_3 is port( cin: in std_logic; p, g: in std_logic_vector(1 downto 0); cout: out std_logic_vector(2 downto 0)); end component SCLL_3;

component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA;

signal p, g: std_logic_vector(1 downto 0); signal c: std_logic_vector(2 downto 0);

begin U0: SCLL_3 port map(cin=>cin, p=>p, g=>g, cout=>c); U1: PFA port map(a=>a(0), b=>b(0), c=>c(0), s=>s(0), g=>g(0), p=>p(0)); U2: PFA port map(a=>a(1), b=>b(1), c=>c(1), s=>s(1), g=>g(1), p=>p(1));

L1: s(2) <= c(2); end architecture STRUCTURAL; -- -- scll_3.vhd (Carry Lookahead Logic for Bit position 13 & 14) library IEEE; use IEEE.std_logic_1164.all; entity SCLL_3 is port( cin: in std_logic; p, g: in std_logic_vector(1 downto 0); cout: out std_logic_vector(2 downto 0)); end entity SCLL_3;

125 architecture BEHAVIORAL of SCLL_3 is begin

L1: cout(0) <= cin; L2: cout(1) <= g(0) or (p(0) and cin); L3: cout(2) <= g(1) or (p(1) and g(0)) or (p(1) and p(0) and cin); end architecture BEHAVIORAL;

-- cla_14.vhd (Carry Lookahead Adder ~ 14 Bits) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity CLA_14 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(13 downto 0)); end entity CLA_14; architecture STRUCTURAL of CLA_14 is component CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4_1;

component CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLA_4;

component CLA_2 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic; s: out std_logic_vector(1 downto 0)); end component CLA_2;

component CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end component CLL_2L;

signal p, g: std_logic_vector(2 downto 0); signal c: std_logic_vector(3 downto 1);

begin

U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0), p_out=>p(0), g_out=>g(0)); U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4), p_out=>p(1), g_out=>g(1)); U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8), p_out=>p(2), g_out=>g(2)); U4: CLA_2 port map(a=>a(13 downto 12), b=>b(13 downto 12), cin=>c(3), s=>s(13 downto 12));

U5: CLL_2L port map(p=>p, g=>g, cout=>c); end architecture STRUCTURAL;

-- cla_3s.vhd (Carry Lookahead Adder ~ 3 Bits (For CLA_19)) library IEEE; use IEEE.std_logic_1164.all; entity CLA_3S is

126 port(a, b: in std_logic_vector(2 downto 0); cin: in std_logic; s: out std_logic_vector(2 downto 0); ovf: out std_logic); end entity CLA_3S; architecture STRUCTURAL of CLA_3S is component SCLL_4 is port( cin: in std_logic; p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 0)); end component SCLL_4;

component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA;

signal p, g: std_logic_vector(2 downto 0); signal c: std_logic_vector(3 downto 0);

begin U0: SCLL_4 port map(cin=>cin, p=>p, g=>g, cout=>c); U1: PFA port map(a=>a(0), b=>b(0), c=>c(0), s=>s(0), g=>g(0), p=>p(0)); U2: PFA port map(a=>a(1), b=>b(1), c=>c(1), s=>s(1), g=>g(1), p=>p(1)); U3: PFA port map(a=>a(2), b=>b(2), c=>c(2), s=>s(2), g=>g(2), p=>p(2));

L1: ovf <= c(2) xor c(3); end architecture STRUCTURAL;

-- scll_4.vhd (Carry Lookahead Logic for Bit position 17, 18, & 19) library IEEE; use IEEE.std_logic_1164.all; entity SCLL_4 is port( cin: in std_logic; p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 0)); end entity SCLL_4; architecture BEHAVIORAL of SCLL_4 is begin

-- cla_4.vhd (Carry Lookahead Adder ~ 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLA_4 is port( a, b: in std_logic_vector(3 downto 0); cin: in std_logic; s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end entity CLA_4; architecture STRUCTURAL of CLA_4 is component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA;

127

component CLL is port( cin: in std_logic; p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end component CLL;

signal c, g, p: std_logic_vector(3 downto 0);

begin

L1: CLL port map(cin=>cin, p=>p, g=>g, cout=>c, p_out=>p_out, g_out=>g_out);

LK: for k in 3 downto 0 generate PFAK: PFA port map(a=>a(k), b=>b(k), c=>c(k), s=>s(k), g=>g(k), p=>p(k)); end generate LK;

end architecture STRUCTURAL;

-- cla_4_1.vhd (Carry Lookahead Adder ~ 4 bits / for the first CLA_4) library IEEE; use IEEE.std_logic_1164.all;

entity CLA_4_1 is port( a, b: in std_logic_vector(3 downto 0); s: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end entity CLA_4_1;

architecture STRUCTURAL of CLA_4_1 is component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA;

component CLL_1 is port( p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(2 downto 0); p_out, g_out: out std_logic); end component CLL_1;

signal g, p: std_logic_vector(3 downto 0); signal c: std_logic_vector(3 downto 0);

begin

L0: g(0) <= a(0) and b(0); L1: p(0) <= a(0) xor b(0); L2: s(0) <= p(0);

L3: CLL_1 port map(p=>p, g=>g, cout=>c(3 downto 1), p_out=>p_out, g_out=>g_out);

LK: for k in 3 downto 1 generate PFAK: PFA port map(a=>a(k), b=>b(k), c=>c(k), s=>s(k), g=>g(k), p=>p(k)); end generate LK; end architecture STRUCTURAL;

-- -- cll_1.vhd (Carry Lookahead Logic - for first 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLL_1 is port( p, g: in std_logic_vector(3 downto 0);

128 cout: out std_logic_vector(2 downto 0); p_out, g_out: out std_logic); end entity CLL_1; architecture BEHAVIORAL of CLL_1 is begin

L1: cout(0) <= g(0); L2: cout(1) <= g(1) or (p(1) and g(0)); L3: cout(2) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0));

L4: p_out <= p(3) and p(2) and p(1) and p(0); L5: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1)) or (p(3) and p(2) and p(1) and g(0)); end architecture BEHAVIORAL;

-- cll.vhd (Carry Lookahead Logic - for 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLL is port( cin: in std_logic; p, g: in std_logic_vector(3 downto 0); cout: out std_logic_vector(3 downto 0); p_out, g_out: out std_logic); end entity CLL; architecture BEHAVIORAL of CLL is begin

L1: cout(0) <= cin; L2: cout(1) <= g(0) or (p(0) and cin); L3: cout(2) <= g(1) or (p(1) and g(0)) or (p(1) and p(0) and cin); L4: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)) or (p(2) and p(1) and p(0) and cin); L5: p_out <= p(3) and p(2) and p(1) and p(0); L6: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1)) or (p(3) and p(2) and p(1) and g(0)); end architecture BEHAVIORAL;

-- cll_2l.vhd (2nd Level of Carry Lookahead Logic - for 4 bits) library IEEE; use IEEE.std_logic_1164.all; entity CLL_2L is port(p, g: in std_logic_vector(2 downto 0); cout: out std_logic_vector(3 downto 1)); end entity CLL_2L; architecture BEHAVIORAL of CLL_2L is begin

L1: cout(1) <= g(0); L2: cout(2) <= g(1) or (p(1) and g(0)); L3: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0)); end architecture BEHAVIORAL;

-- csa.vhd (Carry Save Adder) library IEEE; use IEEE.std_logic_1164.all; entity CSA is generic(n: positive := 5); port( a, b, c: in std_logic_vector(n-1 downto 0);

129 sum: out std_logic_vector(n-1 downto 0); carry: out std_logic_vector(n downto 0)); end entity CSA; architecture STRUCTURAL of CSA is component FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end component FA;

begin

L1: carry(0) <= '0';

KL: for k in n-1 downto 0 generate FAK: FA port map(a=>a(k), b=>b(k), cin=>c(k), s=>sum(k), cout=>carry(k+1)); end generate KL; end architecture STRUCTURAL;

-- -- cla_2.vhd (Carry Lookahead Adder ~ 2 Bits (Last 2 Bits)) library IEEE; use IEEE.std_logic_1164.all; entity CLA_2 is port(a, b: in std_logic_vector(1 downto 0); cin: in std_logic; s: out std_logic_vector(1 downto 0)); end entity CLA_2; architecture STRUCTURAL of CLA_2 is component SCLL is port(cin, p, g: in std_logic; cout: out std_logic); end component SCLL;

component PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end component PFA;

signal c, p, g: std_logic;

begin U1: PFA port map(a=>a(0), b=>b(0), c=>cin, s=>s(0), g=>g, p=>p); U2: SCLL port map(cin=>cin, p=>p, g=>g, cout=>c);

L1: s(1) <= a(1) xor b(1) xor c; end architecture STRUCTURAL;

-- -- scll.vhd (Carry Lookahead Logic for Bit position 13) library IEEE; use IEEE.std_logic_1164.all; entity SCLL is port(cin, p, g: in std_logic; cout: out std_logic); end entity SCLL; architecture BEHAVIORAL of SCLL is begin

L1: cout <= g or (p and cin);

130 end architecture BEHAVIORAL;

-- ha.vhd (Half Adder) library IEEE; use IEEE.std_logic_1164.all; entity HA is port( a, b: in std_logic; s, cout: out std_logic); end entity HA; architecture BEHAVIORAL of HA is begin

L1: s <= a xor b; L2: cout <= a and b; end architecture BEHAVIORAL;

-- fa.vhd (full adder) library IEEE; use IEEE.std_logic_1164.all; entity FA is port(a, b, cin: in std_logic; s, cout: out std_logic); end entity FA; architecture BEHAVIORAL of FA is begin

L1: s <= a xor b xor cin; L2: cout <= (a and b) or (a and cin) or (b and cin); end architecture BEHAVIORAL;

-- pfa.vhd (Partial Full Adder) library IEEE; use IEEE.std_logic_1164.all; entity PFA is port(a, b, c: in std_logic; s, g, p: out std_logic); end entity PFA; architecture BEHAVIORAL of PFA is begin

L1: s <= a xor b xor c; L2: g <= a and b; L3: p <= a xor b; end architecture BEHAVIORAL;

-- reg_p.vhd (Positive edge clocked registers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; entity REG_P is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end entity REG_P;

131 architecture BEHAVIORAL of REG_P is

signal d_reg: signed(n-1 downto 0);

begin

STORE: process (clk, rst, d_in) is begin

if (rst = '1') then d_reg <= conv_signed('0', n); elsif (rising_edge(clk)) then d_reg <= signed(d_in); end if;

end process STORE;

L1: d_out <= std_logic_vector(d_reg); end architecture BEHAVIORAL;

-- reg_n.vhd (Negative edge clocked registers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; entity REG_N is generic(n: positive := 5); port( clk, rst: in std_logic; d_in: in std_logic_vector(n-1 downto 0); d_out: out std_logic_vector(n-1 downto 0)); end entity REG_N; architecture BEHAVIORAL of REG_N is

signal d_reg: signed(n-1 downto 0);

begin

STORE: process (clk, rst, d_in) is begin

if (rst = '1') then d_reg <= conv_signed('0', n); elsif (falling_edge(clk)) then d_reg <= signed(d_in); end if;

end process STORE;

L1: d_out <= std_logic_vector(d_reg); end architecture BEHAVIORAL;

132 Appendix B

VHDL codes, C++ source codes and Script file for Post-Synthesis simulation

Adders

C++ source code // This program generate all possible inputs to the Adders // with the ability to increase the increment #include #include #include

int main() { ofstream out_file1, out_file2, out_file3;

out_file1.open("v_a.dat"); out_file2.open("v_b.dat"); out_file3.open("v_ans.dat");

int time, delay, a, b, choice; int lo, hi, step;

time = 20; delay = 0;

cout << "Please enter the selection by number:" << endl; cout << "------" << endl; cout << "(1) CLA 14" << endl; cout << "(2) CLA 15" << endl; cout << "(3) CLA 16" << endl; cout << "(4) CLA 17" << endl; cout << "(5) CLA 19" << endl; cin >> choice; cout << endl << "Please enter the step: "; cin >> step;

switch (choice) { case 1: lo = -8192; hi = 8192; break;

case 2: lo = -16384; hi = 16384; break;

case 3: lo = -32768; hi = 32768; break;

case 4: lo = -65536; hi = 65536; break;

case 5: lo = -262144; hi = 262144; break;

default: lo = -8192;

133 hi = 8192; break; }

out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file3 << "@" << 0 << "ns=" << 0 << "\\H +" << endl;

for (a=lo; a

out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file3 << "@" << dec << (time + delay) << "ns=" << 0 << "\\H" << endl;

out_file1.close(); out_file2.close(); out_file3.close();

return 0; }

VHDL file for 14-bit CLA Testbench library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA14 is port( a, b: in std_logic_vector(13 downto 0); ans: in std_logic_vector(13 downto 0); t: inout std_logic_vector(13 downto 0); err: out std_logic ); end entity TB_CLA14; architecture BEHAV of TB_CLA14 is component CLA_14 is port( a, b: in std_logic_vector(13 downto 0); s: out std_logic_vector(13 downto 0)); end component CLA_14;

begin

U1: CLA_14 port map(a=>a, b=>b, s=>t);

COMP: process(t, ans) is begin

if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP; end architecture BEHAV;

134 VHDL file for 15-bit CLA Testbench library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA15 is port( a, b: in std_logic_vector(14 downto 0); ans: in std_logic_vector(14 downto 0); t: inout std_logic_vector(14 downto 0); err: out std_logic ); end entity TB_CLA15; architecture BEHAV of TB_CLA15 is component CLA_15 is port( a, b: in std_logic_vector(14 downto 0); s: out std_logic_vector(14 downto 0)); end component CLA_15;

begin

U1: CLA_15 port map(a=>a, b=>b, s=>t);

COMP: process(t, ans) is begin

if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP; end architecture BEHAV;

VHDL file for 16-bit CLA Testbench library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA16 is port( a, b: in std_logic_vector(15 downto 0); ans: in std_logic_vector(15 downto 0); t: inout std_logic_vector(15 downto 0); err: out std_logic ); end entity TB_CLA16; architecture BEHAV of TB_CLA16 is component CLA_16 is port( a, b: in std_logic_vector(15 downto 0); s: out std_logic_vector(15 downto 0)); end component CLA_16;

begin

U1: CLA_16 port map(a=>a, b=>b, s=>t);

COMP: process(t, ans) is begin

if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP; end architecture BEHAV;

135 VHDL file for 17-bit CLA Testbench library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity TB_CLA17 is port( a, b: in std_logic_vector(16 downto 0); ans: in std_logic_vector(16 downto 0); t: inout std_logic_vector(16 downto 0); err: out std_logic ); end entity TB_CLA17; architecture BEHAV of TB_CLA17 is component CLA_17 is port( a, b: in std_logic_vector(16 downto 0); s: out std_logic_vector(16 downto 0)); end component CLA_17;

begin

U1: CLA_17 port map(a=>a, b=>b, s=>t);

COMP: process(t, ans) is begin

if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP; end architecture BEHAV;

VHDL file for 19-bit CLA Testbench -- tb_cla_19.vhd library IEEE, BFULIB; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BFULIB.bfu_pckg.all; entity TB_CLA19 is port( a, b: in std_logic_vector(18 downto 0); ans: in std_logic_vector(18 downto 0); ovf: out std_logic; t: inout std_logic_vector(18 downto 0); err: out std_logic ); end entity TB_CLA19; architecture BEHAV of TB_CLA19 is begin

U1: CLA_19 port map(a=>a, b=>b, s=>t, ovf=>ovf);

COMP: process(t, ans) is begin

if (t = ans) then err <= '0'; else err <= '1'; end if; end process COMP; end architecture BEHAV;

136 Multiplication Unit

C++ source code // This program generate all possible inputs to the Multiplication // unit and correct output results corresponding to all the inputs. #include #include #include

int main() { ofstream out_file1, out_file2, out_file3, out_file4, out_file5;

out_file1.open("coef.dat"); out_file2.open("mag.dat"); out_file3.open("x_ans.dat");

int delay, time, m, n;

time = 0; delay = 40;

out_file1 << "@" << time << "ns=" << 0 << "\\H +" << endl; out_file2 << "@" << time << "ns=" << 0 << "\\H +" << endl; out_file3 << "@" << time << "ns=" << 0 << "\\H +" << endl;

for (m=-32; m<32; m++) { out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << m << "\\H +" << endl; for (n=0; n<256; n++) { out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << n << "\\H +" << endl; out_file3 << setiosflags(ios::uppercase) << "@" << dec << time + delay << "ns=" << hex << (m*n) << "\\H +" << endl; time = time + 20; } } out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file3 << "@" << dec << time + delay << "ns=" << 0 << "\\H" << endl;

out_file1.close(); out_file2.close(); out_file3.close();

return 0; }

VHDL code for Multiplication Unit testbench -- testbench for multiplier (tb_mult.vhd) library IEEE, BFULIB; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BFULIB.bfu_pckg.all;

entity TB_MULT is port( clk, rst: in std_logic; coef: in std_logic_vector(5 downto 0); mag: in std_logic_vector(7 downto 0); pro: in std_logic_vector(13 downto 0); p: out std_logic_vector(13 downto 0); err: out std_logic ); end entity TB_MULT;

architecture STRUCT of TB_MULT is signal product: std_logic_vector(13 downto 0);

137

begin

M1: MULT port map(clk=>clk, rst=>rst, a=>mag, b=>coef, p=>product); L1: p <= product;

COMP: process(clk, rst) is begin

if (rst = '1') then err <= '0'; elsif (clk'event and clk = '1') then if (product = pro) then err <= '0'; else err <= '1'; end if; end if;

end process COMP; end architecture STRUCT;

| Initial settings delete_signals set_mode functional restart stepsize 10 ns

| Vector Definitions | | Add your vector definition commands here vector coef coef5 coef4 coef3 coef2 coef1 coef0 radix hex coef vector mag mag7 mag6 mag5 mag4 mag3 mag2 mag1 mag0 radix hex mag vector product p[13:0] radix hex product vector t_ans pro[13:0] radix hex t_ans

| Watched Signals and Vectors | | Define your signal and vector watch list here watch coef mag product err

| Stimulators Assignment | | Select and/or define your own stimulators | and assign them to the selected signals wfm coef < coef.dat wfm mag < mag.dat wfm t_ans < x_ans.dat

| Set Breakpoint Conditions | | Define breakpoint conditions and | breakpoint actions for selected signals here break err 1-0 do (print err.out)

| Perform Simulation

138 | | Run simulation for a selected number of | clock cycles or a time range sim

139 Appendix C

C++ Source Codes for Programs Used During Post-Implementation Simulation

Program 1 (Input Image Plane and Output Image Planes generator)

#include #include #include #include int main() { ifstream in_file1; ofstream out_file1, out_file2;

in_file1.open("coef.txt"); out_file1.open("v_input.dat"); out_file2.open("input_mag.txt");

int row, col, nfc; int a, b, k, m, n, mag; int i_mag[5][60]; int fc[3][5][5]; unsigned int seed;

cout << "Seed number: "; cin >> seed; in_file1 >> nfc;

row = 5; col = 60; k = 0;

// Reading in the FC planes in coef.txt file while (k < nfc) { for (a = 0; a < 5; a++) { for (b = 0; b < 5; b++) { in_file1 >> mag; fc[k][a][b] = mag; } } k++; } in_file1.close();

// Generate Randomized Input Image plane with rand function srand(seed); out_file2 << "Input Image Plane (" << row << "x" << col << ")" << endl; out_file2 << "------" << endl;

for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 256; while (mag > 255) { mag = rand(); } i_mag[a][b] = mag; out_file2 << setiosflags(ios::uppercase) << setw(3) << hex << mag << " "; }

140 out_file2 << endl; }

// This segment of the code generate the input image plane for simulation for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { out_file1 << setiosflags(ios::uppercase) << "assign input " << hex << i_mag[a][b] << "\\h" << endl; out_file1 << "cycle 1" << endl; } if (a < 2) { out_file1 << "cycle 1" << endl; } else { out_file1 << "cycle 2" << endl; } }

// Generate Expected Output for (k = 0; k < nfc; k++) { out_file2 << endl << endl << "Output Image Plane " << k+1 << endl << "------" << endl; for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 0; for (m = 0; m < 5; m++) { for (n = 0; n < 5; n++) { if (!(((a-m+2) < 0) | ((a-m+2) >= row)| ((b-n+2)<0) | ((b-n+2) >= col))) mag = mag + (i_mag[a-m+2][b-n+2]*fc[k][m][n]); } } mag = mag & 0x7FFFF; out_file2 << setw(5) << hex << setiosflags(ios::uppercase) << mag << " "; } out_file2 << endl; } }

out_file1.close(); out_file2.close();

return 0; } Program 2 (To generate test vectors that will program the FCs into each MAUs)

#include #include #include int main() { ifstream in_file1; ofstream out_file1, out_file2, out_file3;

in_file1.open("coef.txt"); //Filter Coefficient file out_file1.open("v_coef.dat"); out_file2.open("v_c_reg.dat"); out_file3.open("v_au_sel.dat");

141 int array[5][5]; int time, count, temp, a, b, k, nfc;

time = 40; count = 1; k = 1; in_file1 >> nfc;

out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl; out_file3 << "@" << 0 << "ns=" << 0 << "\\H +" << endl;

while (k-1 < nfc) { out_file3 << setiosflags(ios::uppercase) << "@" << dec << time << "ns=" << hex << k << "\\H +" << endl; for (a=4; a>=0; a--) { for (b=0; b<5; b++) { in_file1 >> temp; cout << temp << endl; array[b][a] = temp; } }

out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl; out_file3 << "@" << dec << time << "ns=" << 0 << "\\H" << endl;

in_file1.close(); out_file1.close(); out_file2.close(); out_file3.close();

return 0; }

142 Appendix D

C++ Source Codes for Programs Used During Hardware Prototype Implementation

Program 1 (This is the program that responsible for sending the FC values)

//This is the driver for system without FIFO #include #include #include #include #include

#define DATA 0x0378 #define STATUS DATA+1 #define CONTROL DATA+2 void delay(int); void sentdata(int &); main() { int reg_coef[3][25]; int reg_cfg[3][25]; int fc[3][5][5]; int k, nfc, a, b, count, wait, d, o_sel, o_cfg;

/* Reading the Filter Coefficient from the coef.txt */ ifstream in_file; in_file.open("coef.txt"); // Open the coef.txt file in_file >> nfc >> o_sel; // Read in the number of FC plane o_cfg = 0x80;

/* Make Sure Parallel Port is in forward mode and set strobe */ _outp(CONTROL, _inp(CONTROL) & 0xDE); /* Make Sure the write enable (ppc(3)) is at low */ _outp(CONTROL, _inp(CONTROL) | 0x08);

k = 1; count = 1; while (k-1 < nfc) // Repeat for the number of plane indicated { for (a=4; a>=0; a--) { for (b=0; b<5; b++) { in_file >> fc[k-1][b][a]; // Reading in the filter coefficients } // in the arrangement of FC in the AU }

for (a=0; a<5; a++) { for (b=0; b<5; b++) { reg_coef[k-1][count-1] = fc[k-1][a][b] & 0x3F; reg_cfg[k-1][count-1] = ((k & 0x03) << 5); reg_cfg[k-1][count-1] = reg_cfg[k-1][count-1] | (count & 0x1F); cout << hex << reg_cfg[k-1][count-1] << " " << reg_coef[k-1][count-1] << endl; count++; } } count = 1; k++; }

k = 1; while (k-1 < nfc) { a = 1;

143 while(a<26) { while(((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3) { _outp(CONTROL, _inp(CONTROL) & 0xF7); //Assert ppc(3) while(!((_inp(STATUS) & 0x20) == 32)){} //Detect high on pps(5) sentdata(reg_coef[k-1][a-1]); while(!((_inp(STATUS) & 0x40) == 64)){} //Detect high on pps(6) sentdata(reg_cfg[k-1][a-1]); a++; while((_inp(STATUS) & 0x40) == 64){} //Detect high on pps(6) } } k++; } //Configuration of the MAUs are done _outp(CONTROL, _inp(CONTROL) | 0x08); //Desert ppc(3)

k = 0; while (k<1) { //Program output selection according to o_sel read in while(((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3) { _outp(CONTROL, _inp(CONTROL) & 0xF7); //Assert ppc(3) while(!((_inp(STATUS) & 0x20) == 32)){} //Detect high on pps(5) sentdata(o_sel); while(!((_inp(STATUS) & 0x40) == 64)){} //Detect high on pps(6) sentdata(o_cfg); while((_inp(STATUS) & 0x40) == 64){} //Detect high on pps(6) k++; } } _outp(CONTROL, _inp(CONTROL) | 0x08); //Desert ppc(3) cout << "Configuration done!" << endl; in_file.close(); //Close coef.txt // Programming of the Filter Coefficient is done // exit(1);

// This section starts sending input datas to the system ifstream in_file1; in_file1.open("input.txt"); //Open the input data file wait = 1; while (wait==1) { if ((_inp(STATUS) & 0x08) == 0) //Detect low on pps(3) { //Run the following segment of code if pps(3)==0 while(!((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3) { //Run the following segment of code if pps(4)==1 if ((_inp(STATUS) & 0x10) == 16) //Detect high on pps(4) { in_file1 >> d; //Read in the data from file sentdata(d); while((_inp(STATUS) & 0x08) == 0) //Detect high on pps(3) {} } } } } return 0; }; //end of main void sentdata(int &c) { cout << c << " sent..." << endl; _outp(DATA, c^0x03); /* sending the data with the two LSB toggled */ _outp(CONTROL, _inp(CONTROL) & 0xFB); /* set strobe ~ one to zero */ delay(1000); _outp(CONTROL, _inp(CONTROL) | 0x04); /* reset strobe ~ zero to one */

144 delay(1000); };

/* A function to create delay */ void delay(int k) { int i; for (i=0; i<=k; i++){} };

Program 2 (This is the program that generate VHDL file for internal RAM for input image pixels)

#include #include #include #include main() { int a, b, mag, row, col, i, j, k; int i_mag[5][62]; int temp[32]; unsigned int seed; ofstream outfile("input_ram.vhd", ios::out);

row = 5; col = 62;

cout << "Seed number: "; cin >> seed;

srand(seed); for (a = 0; a < row; a++) { for (b = 0; b < (col-2); b++) { mag = 256; while (mag > 255) { mag = rand(); } i_mag[a][b] = mag; } i_mag[a][col-1] = 0; i_mag[a][col-2] = 0; }

outfile << "library IEEE;\n" << "use IEEE.std_logic_1164.all;\n" << "use IEEE.numeric_std.all;\n" << "\nentity IN_RAM is\n" << " port( clk: in std_logic;\n" << " rst: in std_logic;\n" << " req: in std_logic;\n" << " dout: out std_logic_vector(7 downto 0) );\n" << "end entity IN_RAM;\n\n";

outfile << "architecture STRUCT of IN_RAM is\n\n" << " component RAMB4_S8 is\n" << " port( DI: in std_logic_vector(7 downto 0);\n" << " EN: in std_logic;\n" << " WE: in std_logic;\n" << " RST: in std_logic;\n" << " CLK: in std_logic;\n" << " ADDR: in std_logic_vector(8 downto 0);\n" << " DO: out std_logic_vector(7 downto 0) );\n"

145 << " end component RAMB4_S8;\n\n";

a = b = j = k = 0; while (k<(row*col)) { i = 0; while (i<32) { if (a < 5) { temp[31-i] = i_mag[a][b]; } else { temp[31-i] = 0; } if (b == (col-1)) { b = 0; a++; k++; i++; } else { k++; b++; i++; } } outfile << " attribute INIT_0" << setw(1) << hex << j << ": string;\n" << " attribute INIT_0" << setw(1) << hex << j << " of IRM: label is \""; for (i=0; i<32; i++) { if (temp[i] < 16) { outfile << "0" << setw(1) << hex << temp[i]; } else { outfile << setw(2) << hex << temp[i]; } } outfile << "\";\n"; j++; }

outfile << "\n signal din : std_logic_vector(7 downto 0);\n" << " signal addr: unsigned(8 downto 0);\n" << " signal adr : std_logic_vector(8 downto 0);\n" << " signal en : std_logic;\n" << " signal we : std_logic;\n";

outfile << "\n begin\n\n" << " L1: din <= (others=>'0');\n" << " L2: en <= '1';\n" << " L3: we <= '0';\n" << " L4: adr <= std_logic_vector(addr);\n\n";

outfile << " P1: process(clk, rst, req) is\n" << " begin\n" << " if (rst = '1') then\n" << " addr <= (others=>'0');\n" << " elsif (clk'event and clk = '1') then\n" << " if (req = '1') then\n" << " addr <= addr + 1;\n" << " end if;\n" << " end if;\n" << " end process P1;\n\n"; outfile << " IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk, \n"

146 << " ADDR=>adr, DO=>dout);\n";

outfile << "\nend architecture STRUCT;\n";

outfile.close();

return 0; }

Program 3 (Test Program: this program responsible for comparing the uploaded output witht the theoretical correct outputs) #include #include #include #include

void hex2file(ofstream &, int);

int main() { ifstream in_file1; ofstream out_file1, out_file2;

in_file1.open("coef.txt"); out_file1.open("input.txt"); out_file2.open("exp_res.txt");

int row, col, nfc; int a, b, k, m, n, mag, o_sel; int i_mag[5][60]; int fc[3][5][5]; unsigned int seed;

cout << "Seed number: "; cin >> seed; in_file1 >> nfc >> o_sel;

row = 5; col = 60; k = 0;

// Reading in the FC planes in coef.txt file while (k < nfc) { for (a = 0; a < 5; a++) { for (b = 0; b < 5; b++) { in_file1 >> mag; fc[k][a][b] = mag; } } k++; } in_file1.close();

// Generate Randomized Input Image plane with rand function srand(seed); out_file2 << "Input Image Plane (" << row << "x" << col << ") Seed: " << seed << endl; out_file2 << "------" << endl;

for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 256; while (mag > 255) {

147 mag = rand(); } i_mag[a][b] = mag; out_file1 << dec << setw(3) << mag << " "; out_file2 << setiosflags(ios::uppercase) << setw(3) << hex << mag << " "; } out_file1 << endl; out_file2 << endl; }

// Generate Expected Output for (k = 0; k < nfc; k++) { out_file2 << endl << endl << "Output Image Plane " << k+1 << endl << "------" << endl; for (a = 0; a < row; a++) { for (b = 0; b < col; b++) { mag = 0; for (m = 0; m < 5; m++) { for (n = 0; n < 5; n++) { if (!(((a-m+2) < 0) | ((a-m+2) >= row) | ((b-n+2) < 0) | ((b-n+2) >= col))) mag = mag + (i_mag[a-m+2][b-n+2]*fc[k][m][n]); } } mag = mag & 0x7FFFF; hex2file(out_file2, mag); } out_file2 << endl; } }

out_file1.close(); out_file2.close();

return 0; } void hex2file(ofstream &outfile, int mag) { int i, temp;

for (i=0; i<5; i++) {

temp = mag & 0xF0000; temp = temp >> 16; outfile << hex << setiosflags(ios::uppercase) << temp; mag = mag << 4; } outfile << " "; }

148 Appendix E

VHDL Files for Modules External to the Convolution Architecture

1. Top Level Description of the whole system (the convolution architecture is included)

-- par2brd.vhd library IEEE, BRDMOD; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use BRDMOD.brd_util.all;

entity PAR2BRD is port( -- inputs from board sw1: in std_logic; -- start signal sw2: in std_logic; -- global reset sw3: in std_logic; -- mux select for internal clock clk: in std_logic; -- external clock

-- from parallel port ppd: in std_logic_vector(7 downto 0); ppc: in std_logic_vector(3 downto 2); pps: out std_logic_vector(6 downto 3);

-- output to external SRAM (right bank) cen_r : out std_logic; wen_r : out std_logic; oen_r : out std_logic; addr_r: out std_logic_vector(18 downto 0); data_r: out std_logic_vector(15 downto 0);

-- output to external SRAM (left bank) cen_l : out std_logic; wen_l : out std_logic; oen_l : out std_logic; addr_l: out std_logic_vector(18 downto 0); data_l: out std_logic_vector(15 downto 0);

-- output from the interface clk_led: out std_logic; -- sl: out std_logic_vector(6 downto 0); -- sr: out std_logic_vector(6 downto 0) );

done: out std_logic ); end entity PAR2BRD;

architecture STRUCT of PAR2BRD is component SYS is port( clk, rst, str: in std_logic; d_in: in std_logic_vector(7 downto 0); --(FIFO -> DM_IF) coef: in std_logic_vector(5 downto 0); --(FCs from parallel port) ld_reg: in std_logic_vector(4 downto 0); --(MAUs select from pp) au_sel: in std_logic_vector(1 downto 0); --(AU select from pp) o_sel: in std_logic; --(Output config from pp) req: out std_logic; --(Controller -> FIFO) sram_w: out std_logic; --(SYS -> SRAM) d_out: out std_logic_vector(18 downto 0) ); end component SYS;

component IN_RAM is port( clk: in std_logic; rst: in std_logic; req: in std_logic; dout: out std_logic_vector(7 downto 0) ); end component IN_RAM;

component IBUF is port( i: in std_logic;

149 o: out std_logic ); end component IBUF;

component BUFG is port( i: in std_logic; o: out std_logic ); end component BUFG;

-- These signals are from parallel port interface signal d_clk, strobe, strobe_b: std_logic; signal nsw1, nsw2, nsw3 : std_logic; -- Internal connection signals signal req: std_logic; --(SYS -> IN_RAM) signal v_d : std_logic_vector(7 downto 0); --(PINTFC -> REG_A) signal v_t : std_logic_vector(18 downto 0); --(SYS -> SRAM) signal v_in : std_logic_vector(7 downto 0); --(IN_RAM -> SYS) -- signal v_led : std_logic_vector(7 downto 0); --(MUX -> SVNSEG) signal c_out : std_logic_vector(5 downto 0); --(coefficient output from FC_MOD) signal cf_out: std_logic_vector(7 downto 0); --(MAUs configuration output from FC_MOD) signal sram_w: std_logic; --(SYS -> OUT_RAM) signal cen : std_logic; signal wen : std_logic; signal oen : std_logic; signal data : std_logic_vector(18 downto 0); signal addr : std_logic_vector(18 downto 0); -- Clock selection signal signal c_sel : std_logic; signal p_clk : std_logic; -- filter coefficients programming clk

begin

-- External strobe buffering and padding B1: IBUF port map(i=>ppc(2), o=>strobe_b); B2: BUFG port map(i=>strobe_b, o=>strobe);

-- Inverting the logic level of the push buttons. S1: nsw1 <= not sw1; S2: nsw2 <= not sw2; S3: nsw3 <= not sw3;

-- Clock counter to reduce the clock frequency of the external clock C1: C_CNTR generic map(n=>12500) port map(clk=>clk, rst=>nsw2, co=>p_clk);

-- First In First Out queue after the parallel port -- F1: FIFO port map(rst=>nsw2, r_clk=>d_clk, r_en=>req, w_clk=>strobe, w_en=>ppc(3), -- din=>ppd, dout=>v_d, empty=>pps(3));

-- This parallel port interface is aimed to replace the FIFO queue P1: PINTFC port map(clk=>strobe, rst=>nsw2, ppd=>ppd, d_out=>v_d);

-- Drivers to the two seven segments LEDs -- SV1: SVNSG port map(ldg=>v_led(7 downto 4), rdg=>v_led(3 downto 0), sl=>sl, sr=>sr);

-- Filter Coefficient Programming Module FC1: FC_MOD port map(clk=>d_clk, rst=>nsw2, ppc=>ppc(3), ppd=>v_d, pps1=>pps(5), pps2=>pps(6), coef_out=>c_out, cfg_out=>cf_out);

-- SRAM Interface module (Responsible for writing output pixels to the external SRAM SRM: OUT_RAM port map(clk=>d_clk, rst=>nsw2, w=>sram_w, d_in=>v_t, cen=>cen, wen=>wen, oen=>oen, addr=>addr, data=>data); L1 : cen_l <= cen; L2 : cen_r <= cen; L3 : wen_l <= wen; L4 : wen_r <= wen; L5 : oen_l <= oen; L6 : oen_r <= oen; L7 : addr_l <= addr; L8 : addr_r <= addr; L9 : data_l <= "0000000000000" & data(18 downto 16); L10: data_r <= data(15 downto 0);

150

-- Clock LED on the bar(6) LED L11: pps(3) <= d_clk; L12: done <= sram_w; -- MUX Select for SVNSEG LEDs display -- MUX: v_led <= v_t(7 downto 0) when nsw3 = '0' else v_t(15 downto 8);

-- Convolution System (req is replaced by pps(4) to the parallel port pin) -- (o_sel is the output config pin to select which output plane to output to svnseg) U0: SYS port map(clk=>d_clk, rst=>nsw2, str=>nsw1, coef=>c_out, ld_reg=>cf_out(4 downto 0), au_sel=>cf_out(6 downto 5), o_sel=>cf_out(7), req=>req, d_in=>v_in, sram_w=>sram_w, d_out=>v_t); -- Input RAM to provide input image pixels to the convolution system U1: IN_RAM port map(clk=>d_clk, rst=>nsw2, req=>req, dout=>v_in); -- MUX select for the internal operation clock by SW3 -- SELP: process (nsw3, nsw2) is -- begin

-- if (nsw2 = '1') then -- c_sel <= '0'; -- elsif (nsw3 = '1') then -- c_sel <= '1'; -- end if;

-- end process SELP; -- assign the clock as indicated by the c_sel signal L13: d_clk <= p_clk when nsw3 = '0' else clk; L14: clk_led <= c_sel; end architecture STRUCT;

2. Block RAM module (initialized with input image plane) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity IN_RAM is port( clk: in std_logic; rst: in std_logic; req: in std_logic; dout: out std_logic_vector(7 downto 0) ); end entity IN_RAM; architecture STRUCT of IN_RAM is

attribute INIT_00: string; attribute INIT_00 of IRM: label is "b127037aa60fa2fcd6d2ecfba0b29b472f795754f8c707915e72910a387ba32d"; attribute INIT_01: string; attribute INIT_01 of IRM: label is "54c7000034647cf42036263aea806ef629ca8017f15167ae072fa406aa3c5f8e"; attribute INIT_02: string; attribute INIT_02 of IRM: label is "29e3340c8d55d63104709180640dde6598abfa87834f3f8b862521afb27b05da"; attribute INIT_03: string; attribute INIT_03 of IRM: label is "8c059a5e0000d10f29ef58e2343094fdf00c4875d7132b775a9d90eb0a2af23a"; attribute INIT_04: string; attribute INIT_04 of IRM: label is "f4ffbba026cb9a95e76073e78cf8bb71ec7adb544571aba14108426ce0b65b48"; attribute INIT_05: string; attribute INIT_05 of IRM: label is "562f3c0012a50000c449b579657f3430b517d329250b06e193764decbb8d2fae";

151 attribute INIT_06: string; attribute INIT_06 of IRM: label is "0e1445292bf6efc7ca1a168a0b28ac6c95e4c5e1eafa97d4a739e68847e36db2"; attribute INIT_07: string; attribute INIT_07 of IRM: label is "766d9cd4a6f5a327000025adece722128cec7c0d9ec1ecf1cb50f064f402ac29"; attribute INIT_08: string; attribute INIT_08 of IRM: label is "48acd73b6335198a049f8709c6e40466c6c633d89e44e024272e09b37771f914"; attribute INIT_09: string; attribute INIT_09 of IRM: label is "0000000000000000000000003f1b75a632e7d585d0d36448ec00ce97f55e0ad8";

signal din : std_logic_vector(7 downto 0); signal addr: unsigned(8 downto 0); signal adr : std_logic_vector(8 downto 0); signal en : std_logic; signal we : std_logic;

begin

L1: din <= (others=>'0'); L2: en <= '1'; L3: we <= '0'; L4: adr <= std_logic_vector(addr);

P1: process(clk, rst, req) is begin if (rst = '1') then addr <= (others=>'0'); elsif (clk'event and clk = '1') then if (req = '1') then addr <= addr + 1; end if; end if; end process P1;

IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk, ADDR=>adr, DO=>dout); end architecture STRUCT;

3. FC Programming module -- fc_pg_mod.vhd (Filter Coefficient Programming Module) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all;

entity FC_MOD is port( clk, rst: in std_logic;

ppc: in std_logic; -- Control Pin from parallel port ppd: in std_logic_vector(7 downto 0); -- Input data from parallel port pps1: out std_logic; -- Status pin for request data pps2: out std_logic; -- Status pin for request cfg coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals end entity FC_MOD;

architecture STRUCTURAL of FC_MOD is component FC_REG is port( clk, rst: in std_logic; rec_0: in std_logic; rec_1: in std_logic; prog: in std_logic; d_in: in std_logic_vector(7 downto 0); -- Input data from the parallel port coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals end component FC_REG;

component FC_FSM is port( clk, rst: in std_logic;

152 ctr_pin: in std_logic; rec_0: out std_logic; -- receive_data state enable pin rec_1: out std_logic; -- receive_config state enable pin prog: out std_logic ); -- program state enable pin end component FC_FSM;

signal rec_0, rec_1, prog: std_logic;

begin

FSM: FC_FSM port map(clk=>clk, rst=>rst, ctr_pin=>ppc, rec_0=>rec_0, rec_1=>rec_1, prog=>prog);

FCG: FC_REG port map(clk=>clk, rst=>rst, rec_0=>rec_0, rec_1=>rec_1, prog=>prog, d_in=>ppd, coef_out=>coef_out, cfg_out=>cfg_out);

-- Status pins to the parallel port O1: pps1 <= rec_0; O2: pps2 <= rec_1; end architecture STRUCTURAL;

-- fc_pg_fsm.vhd (Filter Coefficient Programming Finite State Machine) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity FC_FSM is port( clk, rst: in std_logic; ctr_pin: in std_logic; rec_0: out std_logic; -- receive_data state enable pin rec_1: out std_logic; -- receive_config state enable pin prog: out std_logic ); -- program state enable pin end entity FC_FSM; architecture BEHAVIORAL of FC_FSM is type states is (idle, receive_data, receive_config, program);

signal c_state: states; -- Current State signal n_state: states; -- Next State

begin

NST_PROC: process(c_state, ctr_pin) is begin

case c_state is when idle => if (ctr_pin = '0') then n_state <= idle; else n_state <= receive_data; end if;

when receive_data => n_state <= receive_config;

when receive_config => n_state <= program;

when program => if (ctr_pin = '0') then n_state <= idle; else n_state <= receive_data; end if; end case; end process NST_PROC;

CST_PROC: process(clk, rst, n_state) is begin

if(rst='1') then c_state <= idle;

153 elsif (clk'event and clk='0') then c_state <= n_state; end if; end process CST_PROC;

OUT_PROC: process(c_state) is begin

case c_state is when idle => rec_0 <= '0'; rec_1 <= '0'; prog <= '0';

when receive_data => rec_0 <= '1'; rec_1 <= '0'; prog <= '0';

when receive_config => rec_0 <= '0'; rec_1 <= '1'; prog <= '0';

when program => rec_0 <= '0'; rec_1 <= '0'; prog <= '1'; end case; end process OUT_PROC; end architecture BEHAVIORAL;

-- fc_pg_reg.vhd (Filter COefficient Programming Registers) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity FC_REG is port( clk, rst: in std_logic; rec_0: in std_logic; rec_1: in std_logic; prog: in std_logic; d_in: in std_logic_vector(7 downto 0); -- Input data from the parallel port coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals end entity FC_REG; architecture BEHAVIORAL of FC_REG is

signal d_reg: std_logic_vector(5 downto 0); signal c_reg: std_logic_vector(7 downto 0);

begin

-- Registers for storing Filter Coefficients REC_D: process(clk, rst, rec_0) is begin

if (rst = '1') then d_reg <= (others=>'0'); elsif (clk'event and clk='1') then if (rec_0='1') then d_reg <= d_in(5 downto 0); end if; end if;

end process REC_D;

-- MAUs configuration signals REC_C: process(clk, rst, rec_1) is begin

if (rst = '1') then

154 c_reg <= (others=>'0'); elsif (clk'event and clk='1') then if (rec_1='1') then c_reg <= d_in(7 downto 0); end if; end if;

end process REC_C;

-- Enable Output from the registers O1: coef_out <= d_reg when prog = '1' else (others=>'0'); O2: cfg_out <= c_reg when prog = '1' else (others=>'0'); end architecture BEHAVIORAL;

4. SRAM Driver -- out_ram.vhd (Output Ram for storing output pixels from the architecture) library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity OUT_RAM is port( clk, rst: in std_logic; w: in std_logic; -- read or write to SRAM d_in: in std_logic_vector(18 downto 0); -- Input data from architecture cen, wen: out std_logic; -- cen=chip enable, wen=write enable (both active low) oen: out std_logic; -- oen=out enable (active low) addr: out std_logic_vector(18 downto 0); -- SRAM address bus data: out std_logic_vector(18 downto 0) ); -- SRAM Data bus end entity OUT_RAM; architecture BEHAV of OUT_RAM is

signal w_address: unsigned(18 downto 0); -- signal i_data : unsigned(7 downto 0);

begin

-- Asynchronous Reset and positive edge trigger events P1: process (clk, rst, w) is begin

if (rst = '1') then wen <= '1'; oen <= '1'; addr <= (others => '0'); -- initial address during reset elsif (clk'event and clk = '1') then if (w = '1') then wen <= '0'; oen <= '1'; addr <= std_logic_vector(w_address); end if; end if;

end process P1;

-- Address counter P2: process(clk, rst, w) is begin

if (rst = '1') then w_address <= (others => '0'); elsif (clk'event and clk = '1') then if (w = '1') then w_address <= w_address + 1; end if;

end if;

155

end process P2;

-- Chip enable signal L1: cen <= clk;

-- Data Bus L2: data <= d_in; --std_logic_vector(i_data);

-- L3: i_data <= unsigned(w_address(7 downto 0)) + 4; end architecture BEHAV;

5. Parallel Port Interface Module library IEEE; use IEEE.std_logic_1164.all; entity PINTFC is port( clk, rst: in std_logic; ppd: in std_logic_vector(7 downto 0); d_out: out std_logic_vector(7 downto 0) ); end entity PINTFC; architecture BEHAVIORAL of PINTFC is signal data: std_logic_vector(7 downto 0);

begin

REC: process(clk, rst) is begin

if (rst = '1') then data <= "00000000"; elsif (clk'event and clk = '1') then data <= ppd; end if; end process REC;

L1: d_out <= data; end architecture BEHAVIORAL;

156 References

[1] Bernard Bosi and Guy Bois, “Reconfigurable Pipelined 2-D Convolvers for fast Digital Signal Processing”, IEEE Transaction on Very Large Scale (VLSI) Systems, Vol. 7, No.3, p 299-308, Sep. 1999.

[2] Cheng-The Hsieh and Seung P. Kim, “A Highly-Modular Pipelined VLSI Architecture for 2-D FIR Digital Filter,” Proceedings of the 1996 IEEE 39th Midwest Symposium on Circuits and Systems, Part 1, p 137-140, Aug. 1996.

[3] D. D. Haule and A. S. Malowany, “High-speed 2-D Hardware Convolution Architecture Based on VLSI Systolic Arrays”, IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, p 52-55, Jun 1989.

[4] D. Patterson and J. Hennessy, Computer Organization & Design: The Hardware / Software Interface, Morgan Kaufmann, 1994.

[5] GSI Technology, Product Datasheet, http://www.gsitechnology.com.

[6] H. T. Kung, “Why Systolic Architectures”, IEEE Computer, Vol. 15, p 37-46, Jan. 1982.

[7] Hyun Man Chang and Myung H. Sunwoo, “An Efficient Programmable 2-D Convolver Chip”, Proceedings of the 1998 IEEE International Symposium on Circuits and Systems, ISCAS, Part 2, p 429-432, May 1998.

[8] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, Second Edition, Morgan Kaufmann, 1996.

[9] K. Hsu, L. J. D’Luna, H. Yeh, W. A. Cook and G. W. Brown, “A Pipelined ASIC for Color Matrixing and Convolution”, Proceedings of the 3rd Annual IEEE ASIC Seminar and Exhibit, Sep. 1990.

[10] Kai Hwang, Computer Arithmetic: Principles, Architecture, and Design, John Wiley & Sons, 1979.

[11] M. Morris Mano and Charles R. Kime, Logic and Computer Design Fundamentals, Prentice Hall, 1997.

[12] Michael J. Flynn and Stuart F. Oberman, Advanced Computer Arithmetic Design, Wiley-Interscience, 2001.

[13] O. L. MacSorley, “High-Speed Arithmetic in Binary Computers”, Proceedings of the IRE, vol. 49, pp. 67-91, Jan. 1961.

157 [14] V. Hecht, K. Rönner and P. Pirsch, “An Advanced Programmable 2D- Convolution Chip for Real Time Image Processing”, Proceedings of IEEE International Symposium on Circuits and Systems, p 1897-1900, 1991.

[15] Vijay K. Madisetti and Douglas B. Williams, The Digital Signal Processing Handbook, CRC Press and IEEE Press, 1998.

[16] Wayne Niblack, An Introduction to Digital Image Processing, Prentice/Hall International, 1986.

[17] Xess Co., XSV Board User Manual, http://www.xess.com/manuals/xsv-manual- v1_1.pdf

[18] Xilinx Co., Foundation 4.1i Software Manual, http://toolbox.xilinx.com/docsan/xilinx4/pdf/manuals.pdf

158 Vita

Albert Wong was born on January 1st 1975 in Sibu, Sarawak, Malaysia. He attended SMB Methodist secondary school in Sibu and graduated in 1993. He obtained his Bachelor of Science in Electrical Engineering Degree in May of 1998 from the University of Kentucky, Lexington, Kentucky. He enrolled in the University of Kentucky’s Graduate school in the fall semester of 1999.

159