<<

Single Rail Ternary Null Convention Logic Architecture for Digital Signal Processing Applications

A thesis submitted in fulfillment of the requirements for the Degree of Doctor of Philosophy

Sameh Andrawes Master of Engineering (Electronics and Telecommunications) Victoria University, Melbourne, Australia

School of Engineering College of Science, Engineering and Health RMIT University February 2020

Declaration

I certify that except where due acknowledgement has been made, the work is that of the author alone; the work has not been submitted previously, in whole or in part, to qualify for any other academic award; the content of the thesis is the result of work which has been carried out since the official commencement date of the approved research program; any editorial work, paid or unpaid, carried out by a third party is acknowledged; and, ethics procedures and guidelines have been followed.

I acknowledge the support I have received for my research through the provision of an Australian Government Research Training Program Scholarship.

Sameh Andrawes February 2020

ii Abstract

Synchronous design techniques have been used for decades to design and implement digital systems mainly due to their simplicity and the ready availability of sophisticated tools. While these techniques offer many advantages there are also some disadvantages. Issues such as clock skew and the need for high power clock drivers to generate the required global clock may result in large area and high dynamic power consumption. In contrast, asynchronous techniques eliminate the need for a global clock along with its associated drivers and offer a promising path to overcome many of the problems with synchronous design.

While there are many different techniques and architectures that can be used to create asynchronous digital systems, Null Convention Logic is considered one of the more effective methods due to its straightforward “structural” approach to design, being based on a predefined library of majority logic (threshold) gates. Just as for any other asynchronous digital design technique, Null Convention Logic eliminates the need for a high frequency global clock, replacing it with localised handshaking signals. This may lead to lower power consumption in some circumstances and the gate library approach does not require the sort of complicated timing analysis required by other methodologies, including synchronous. A Null Convention Logic design can therefore be “correct by construction”, with its overall performance adjusting automatically to suit changes in operating conditions due to process variability, supply voltage and/or temperature.

On the other hand, there are various drawbacks. Null Convention Logic relies on multi-rail signals, where each individual logic can be represented by a dual or quad-rail, two or four wires respectively. While multi-rail logic in NCL may simplify path logic in some cases, it invariably results in greater area and also supports the generation of illegal states. For instance, in both dual-rail and quad-rail systems, the rails must be mutually exclusive, such that they can never be asserted simultaneously. This type of illegal state may occur within high speed, low power system on chip (SOC) implementations due to system noise or delays caused by variable interconnection lengths or unbalanced fanouts.

iii This research proposes and analyses the concept of using single-rail, ternary logic to represent the three levels required to define Null Convention Logic i.e., Data-One, Null and Data-Zero. The architecture is implemented using two different CMOS technology processes and voltage level mappings to demonstrate that is largely technology independent. Two different versions are presented in this work: register-controlled and register-less ternary. The more conventional register controlled ternary NCL system relies on the use of the ternary delay insensitive registers to control the flow of the data in the pipeline. The “register-free” implementation eliminates the additional pipeline registers, which helps to reduce both the total design area as well as its power consumption. Both designs propagate data only and do not need to propagate Null thereby enhancing the performance of the design by eliminating the Null cycle propagation.

A representative example of a Short Word Length Digital Finite Impulse Response Low Pass Filter is used to demonstrate that the Single-Rail Ternary Logic approach can be used to design and implement a sophisticated NCL system. The system was designed and implemented at the CMOS transistor level and the three alternative architectures, dual-rail Binary Logic NCL, register-controlled Single-Rail Ternary Logic NCL and register-less Single-Rail Ternary Logic NCL, were compared in terms of total design area, power consumption and design performance. The register-less ternary NCL example system exhibited an approximately 30% reduction in both area and propagation delay, at the cost of a significant increase in power compared to dual rail binary NCL. The register-less ternary system has reduced both power consumption and design area by almost 2.5% and 8% respectively compared to the register-controlled ternary version. These comparisons indicate that both register-controlled and register-less Single-Rail Ternary Logic NCL can reduce the design area, as each signal is represented by only one wire, and at the same time can enhance system performance. However, as the conventional dual-rail Binary Logic NCL still exhibits much lower overall power consumption, the main advantage of the two proposed Single-Rail Ternary Logic NCL architectures are their slightly higher performance and the elimination of illegal state that can otherwise occur. They also offer a means to save area where this is a key design parameter.

iv Acknowledgements

I would like to express my sincere gratitude to my supervisor Professor Paul Beckett for his continuous support and being always available. His door was always open and without his guidance I couldn’t have finished this thesis.

As a Part-Time PhD student, I experienced many unexpected hurdles while doing my PhD and sometimes it was difficult to keep my spirits high, but my supervisor’s patience and enthusiasm motivated me to complete and finish this research. One of my biggest challenges was changing my study load from a full-time research student to a part-time one due to family circumstances and therefore giving up my full-time scholarship. Part-time research students always have to juggle a research degree with a professional career, and I was no exception. Whilst my career requires lots of continuous self-development and self-learning due to the rapid rate of technology change, it was still a big challenge to be able to allocate the necessary time every week to focus on my research degree.

Accessing the simulation tools remotely over the internet was always a challenge, especially when I had to relocate interstate and the simulator performance was slow and therefore getting the required results was taking longer than expected. However, even while I was interstate, my supervisor always made himself available and was able to attend our fortnightly meeting either on Skype or on the phone.

My sincere thanks also go to RMIT University for offering me the scholarship including all the academic and administrative staff for all their great work and support.

Last, but by no means least, I would like to sincerely thank my wife Lucy and my daughter Scarlett for their love, support, immense encouragement and for ultimately being very understanding about what I was going through and for putting up with my absence for many nights and weekends.

v Contents

Declaration ...... ii

Abstract ...... iii

Acknowledgements ...... v

List of Figures ...... ix

List of Tables ...... xii

CHAPTER 1: INTRODUCTION ...... 1

1.1 Introduction ...... 1

1.2 Research Motivation and Purpose ...... 2

1.3 Research Questions...... 4

1.4 Publication arising from this work...... 5

1.5 Introduction to the key concepts used in this work ...... 5 1.5.1 Delay Insensitive Implementation using NCL ...... 6 1.5.2 Multi Threshold CMOS ...... 11

1.6 Novel contributions and outcomes ...... 13

1.7 Thesis Organisation ...... 14

CHAPTER 2: LITERATURE REVIEW ...... 15

2.1 Introduction ...... 15

2.2 Asynchronous Logic...... 15

2.3 Null Convention Logic ...... 17

2.4 Multi-rail NCL Transistor Level Design ...... 17 2.4.1 Static Null Convention Logic Technique (S-NCL) ...... 18 2.4.2 Semi-Static CMOS Null Convention Logic (SS-NCL) ...... 21 2.4.3 Differential Null Convention Logic NCL (D-NCL) ...... 25 2.4.4 Multi Threshold Null Convention Logic (MTNCL) ...... 29

2.5 Ternary Logic Circuits ...... 31

2.6 Summary ...... 38

CHAPTER 3: REGISTER – CONTROLLED TERNARY NCL SYSTEM ...... 40

3.1 Introduction ...... 40

3.2 Multi Threshold Single Rail Ternary Logic Null Convention logic System (MT-SR-TNCL) ...... 41

3.3 CMOS Multi Threshold Single Rail Ternary Logic NCL (CMOS MT-ST-TNCL) – Design # 1 ...... 42

vi 3.3.1 Ternary NCL Register ...... 48 3.3.2 Multi Threshold Ternary Logic NCL Gates ...... 58 3.3.3 Summary of Operation—Multi Threshold Ternary NCL, Design #1 ...... 61

3.4 Multi Threshold Single Rail Ternary Logic NCL (MT-ST-TNCL) – Design # 2 ...... 61 3.4.1 Ternary Detector Design Model #2 ...... 63

3.5 NCL Delay Insensitive Register Comparison ...... 64

3.6 Summary ...... 69

CHAPTER 4: A REGISTER – LESS TERNARY NCL SYSTEM ...... 71

4.1 Introduction ...... 71

4.2 Register-less Ternary NCL Architecture ...... 72 4.2.1 Ternary Detector Gate ...... 73 4.2.2 Register-Less Multi Threshold Ternary NCL Gates ...... 76 4.2.3 Hold Gate ...... 78

4.3 IMPLEMENTATION Example: 8-Bit Full Adder ...... 78 4.3.1 Simulation Results ...... 80

4.4 Summary ...... 81

CHAPTER 5: SWL FIR FILTER CASE STUDY ...... 83

5.1 Case Study Background ...... 84 5.1.1 Digital Filter ...... 84 5.1.2 Sigma–Delta Modulation ...... 85 5.1.3 SWL Finite Impulse Response Low Pass Filter ...... 87

5.2 Methodology ...... 88 Step – 1: MATLAB Filter Design ...... 89 Step – 2: NCL Pipeline Hardware Design ...... 91 Step – 3: System Verilog Code Design ...... 107 Step – 4: FIR Filter Hardware Design ...... 108 Step – 5: Hardware Design Verification using MATLAB Code ...... 109

5.3 Simulation Results and discussion ...... 111 5.3.1 Serial to Parallel Shift Register Comparison ...... 111 5.3.2 Multiply —Accumulate Comparison ...... 112 5.3.3 Digital Low Pass Filter Comparison ...... 114

5.4 Summary ...... 114

CHAPTER 6: CONCLUSIONS AND FUTURE WORK ...... 116

6.1 Conclusions ...... 116

6.2 Future work ...... 117

REFERENCES ...... 119

vii APPENDIX – A: REGISTER-CONTROLLED TERNARY NCL VERILOG CODE . 125

APPENDIX – B: REGISTER-LESS TERNARY NCL VERILOG CODE ...... 131

APPENDIX – : CONVENTIONAL BINARY NCL VERILOG CODE ...... 137

APPENDIX – D: FIR MATLAB CODE ...... 159

viii List of Figures

Figure 1-1: Multi Rail Binary Logic NCL System ...... 8 Figure 1-2: Dual Rail NCL Delay Insensitive (DI) Register ...... 8 Figure 1-3: Hysteresis state-holding behaviour of NCL gate ...... 9 Figure 1-4: C-element Gate ...... 9 Figure 2-1: Static CMOS Multi-Rail NCL Threshold Logic Gate Implementation ...... 18 Figure 2-2: Transistor Level Design of the TH23 Static NCL Threshold Logic Gate ...... 19 Figure 2-3: TH23 No-Hold CMOS Multi-Rail NCL Threshold Gate ...... 20 Figure 2-4: Semi-Static Null Convention Logic Implementation ...... 22 Figure 2-5: TH23 Semi-Static NCL ...... 22 Figure 2-6: Semi-Static Diode-Connected CMOS Multi-rail NCL Logic Gates implementation ...... 23 Figure 2-7: Pseudo Semi-Static NCL ...... 25 Figure 2-8: Differential NCL ...... 26 Figure 2-9: Semi Differential NCL ...... 27 Figure 2-10: Ternary inverter using resistance ...... 32 Figure 2-11: Ternary Buffer ...... 33 Figure 2-12: Ternary Logic Signalling System - TLSS ...... 33 Figure 2-13: Ternary AND Logic Gate ...... 34 Figure 2-14: Ternary OR Logic Gate ...... 34 Figure 2-15: Ternary encoding scheme...... 35 Figure 2-16: Schematic of STI, NTI and PTI inverters with different W/L ratio ...... 35 Figure 2-17: Ternary to Binary Decoder ...... 36 Figure 2-18: IS_Data Ternary Logic Block utilizing Reverse Body Bias to achieve the required threshold voltage ...... 38 Figure 3-1: dual rail Binary Null Convention Logic System ...... 44 Figure 3-2: Single Bit dual rail NCL Register with built in Completion Detection Gate ...... 44 Figure 3-3: N-bit Completion Detection Gate ...... 46 Figure 3-4: The First Version of the CMOS Multi Threshold Single Rail Ternary Logic NCL System Pipeline ...... 47 Figure 3-5: Minimum structure of the Single Rail Ternary Logic NCL Pipeline System ...... 47

ix Figure 3-6: The components of the Ternary NCL Register ...... 50 Figure 3-7: Ternary Detector Gate – First Version ...... 51 Figure 3-8: Switching Threshold Voltage for IS Zero and IS Null Gates ...... 52 Figure 3-9: Ternary NCL Detector Circuit Waveform ...... 53 Figure 3-10: Current versus Ternary Input Signal ...... 54 Figure 3-11: Hold Circuit ...... 55 Figure 3-12: Hold Null Gate Waveform ...... 56 Figure 3-13: Hold One Gate Waveform ...... 57 Figure 3-14: Hold Zero Gate Waveform ...... 58 Figure 3-15: Multi Threshold Ternary NCL Gate Architecture ...... 60 Figure 3-16: Ternary Detector Circuit Version # 2 ...... 63 Figure 3-17: Multi Threshold Binary Logic NCL Threshold Gate Version # 1 ...... 66 Figure 3-18: Multi Threshold Binary Logic NCL Threshold Gate Version # 2 ...... 66 Figure 3-19: Glitches in Binary NCL when optimized for area ...... 68 Figure 4-1: Multistage Register-Less Multi Threshold Single Rail Ternary Logic NCL Architecture ...... 73 Figure 4-2: Single Stage Register-Less Multi Threshold Single Rail Ternary Logic NCL Architecture ...... 73 Figure 4-3: Active Control Signal Generator ...... 74 Figure 4-4: Multi Threshold NCL AND Gate for Register-Less NCL Architecture ...... 77 Figure 4-5: Register-Less Multi Threshold Single Rail Ternary Logic NCL One Bit Full Adder ...... 79 Figure 4-6: Schematic Diagram of Register-Less Multi Threshold Single Rail Ternary Logic NCL One Bit Full Adder ...... 79 Figure 4-7: Multi-Threshold Single-Rail Ternary Logic NCL One Bit Full Adder Waveform ...... 80 Figure 5-1 Simplified Structure of the Direct Form FIR Filter ...... 85 Figure 5-2: Sigma-Delta Analog-to-Digital Converter ...... 86 Figure 5-3: Noise Shaping to enhance the Signal-to-Noise Ratio ...... 86 Figure 5-4: Short Word Length Digital Low Pass Filter ...... 88 Figure 5-5: Finite Impulse Response Low Pass Filter Magnitude Response ...... 90 Figure 5-6: Finite Impulse Response Low Pass Filter Impulse Response ...... 90 Figure 5-7: Impulse Response of the FIR LPF, with OSR = 8 ...... 91 Figure 5-8: Binary NCL FIR Architecture – Partial View ...... 92 Figure 5-9: Binary NCL FIR Waveforms Partial View ...... 93 Figure 5-10: Register-controlled Ternary NCL FIR Architecture – Partial View ..... 94

x Figure 5-11: Register-Less Ternary NCL FIR Architecture – Partial View ...... 96 Figure 5-12: Register-Less Ternary NCL FIR Waveforms Partial View ...... 96 Figure 5-13: Register-Controlled Ternary Pipeline ...... 97 Figure 5-14: Binary Pipeline ...... 97 Figure 5-15: Architecture of 1-Bit Ternary Serial to Parallel Shift Register ...... 98 Figure 5-16: Ternary Serial-to-Parallel Shift Register Waveform ...... 99 Figure 5-17: Architecture of 1-Bit Binary Serial-to-Parallel Shift Register ...... 100 Figure 5-18: Binary Serial-to-Parallel Shift Register Waveform ...... 100 Figure 5-19: Multiply and Accumulate ...... 101 Figure 5-20: Transistor Level Design of Register-Less Multi Threshold Ternary NCL One Bit Half Adder (dimensions in nm) ...... 102 Figure 5-21: Ternary Full Adder Logic Gate ...... 103 Figure 5-22: Schematic Diagram of Register-Controlled Multi Threshold Ternary NCL AND Logic Gate (dimensions in nm) ...... 104 Figure 5-23: Schematic Diagram of Register-Controlled Multi Threshold Ternary NCL One Bit Half Adder (dimensions in nm) ...... 104 Figure 5-24: Schematic Diagram of Register-Controlled Multi Threshold Ternary NCL One Bit Full Adder ...... 105 Figure 5-25: Binary Null Convention Logic Multiplier Logic Gate ...... 106 Figure 5-26: Binary Null Convention Logic Half Adder Architecture ...... 106 Figure 5-27: Binary Null Convention Logic Full Adder Architecture ...... 107 Figure 5-28: Output from Cadence Hardware Ternary LPF ...... 110 Figure 5-29: Output from Ternary LPF MATLAB Code ...... 110

xi List of Tables

Table 1-1: Multi Threshold Single Rail Ternary ...... 3 Table 1-2: Dual-Rail NCL Truth Table ...... 7 Table 1-3: NCL Threshold Logic Gates Function ...... 10 Table 3-1: Ternary signal level mappings for single rail ternary NCL System ...... 43 Table 3-2: Truth Table of Ternary NCL Detector Circuit – Version # 1 ...... 52 Table 3-3 Threshold Voltages of the 45nm SOI process technology ...... 62 Table 3-4: Truth Table of the Single Rail Ternary Logic NCL System – Design Model # 2 ...... 62 Table 3-5: Truth Table of Ternary Detector Circuit Version # 2 ...... 64 Table 3-6: NCL Value Mapping with both of Ternary and Binary NCL Systems ...... 68 Table 3-7: The evaluation of the NCL Delay Insensitive Register Techniques ...... 69 Table 4-1: 8-Bit Full Adder Analysis and Comparison ...... 81 Table 5-1: Quantised Signal Mapped to Binary and Ternary NCL ...... 109 Table 5-2: Ternary versus Binary Null Convention Logic Implementation of the 32-Bit Serial to Parallel Shift Register ...... 112 Table 5-3: Ternary versus Binary Null Convention Logic Implementation of the Multiply and Accumulate Stage ...... 113 Table 5-4: Ternary versus Binary Null Convention Logic Implementation of the Digital Low Pass Filter ...... 114

xii Chapter 1: Introduction

1.1 INTRODUCTION

For many years the semiconductor industry has been able to exploit ever smaller transistor sizes to improve computational performance and electrical efficiency [1]. However, this is unlikely to continue. The International Technology Roadmap for Semiconductors (ITRS) [2] identifies a range of technical challenges that represent critical “road-blocks” to be solved if the industry is to continue to meet the demands of its customers for cheaper, faster and more capable embedded electronic systems into the future. Key amongst these challenges are the interrelated issues of power and performance in conventional clocked processor systems.

Sequential digital circuits can be divided into two general categories – synchronous and asynchronous [3], [4] and both use vastly different approaches to execute and process the digital input data. Synchronous circuits employ a global synchronisation signal — the clock— that dictates the pace of its data processing. This concept has been used for decades due to its simplicity. However, in recent years, it has faced numerous problems arising from the fabrication processes for nano-scale technologies as well as the methodology itself.

For instance, in order to maintain data integrity, the fastest block has to wait for the slowest block to finish the execution of the data. Clock skew has become a major problem in a synchronous system as clock rates have significantly increased while feature sizes have decreased [5]. To achieve acceptable skew figures, high performance chips must dedicate increasingly larger portions of their area to high speed clock drivers, causing these to dissipate increasingly higher power, thereby increasing the overall power consumption of the chip. In a synchronous digital system, the activity of the clock signal is a major energy consumer as can be responsible for something like 15% to 45% of the total consumed energy [6]. Reducing clock activity may result in not only a reduction in dynamic switching energy, but also a reduction in clock skew problems and offers additional benefits such as a reduction in electromagnetic radiation.

1 It is obvious that power and performance are the key amongst all challenges in conventional clocked processor systems. Techniques such as power gating that switch off unused functionality are becoming common in synchronous designs [7]. However, these various techniques are, at best, ad-hoc solutions that have an impact not only on the circuit performance, but at the same time can greatly increase both hardware and software complexity. Hence, there may be an advantage in removing the clock generators and drivers to make the digital design clock free.

In asynchronous systems there is no need for a global clock or a clock generator as there are no requirements for a global synchronisation signal. The term “asynchronous” is used here to indicate that, instead of a global clock, the design relies on autonomous distributed control i.e., handshake signals. Therefore, this approach removes the problems associated with clock distribution, replacing these with the issues of completion detection and localised handshaking. Furthermore, unlike in the case of synchronous systems, each block within the asynchronous circuit can operate at its optimal speed determined by the complex interactions between fabrication process variability, power supply voltage stability, ambient temperature (PVT conditions) and other issues. Under specific circumstances, asynchronous circuits can exhibit superior characteristics to their synchronous counterparts [8], [9]. For instance, they can offer lower power consumption thanks to their data-driven operation, lower electromagnetic interference (EMI) due to uncorrelated switching, and possibility of dynamic voltage scaling, the ability of withstanding harsh environments etc. [10].

However, it must be remembered that asynchronous circuits require interfaces to an external environment. This work focusses on fully asynchronous systems, for example, event driven signal processing, where the data values are already in the correct format. In this case, there is no requirement to convert between binary to ternary and vice versa at the input and output stages.

1.2 RESEARCH MOTIVATION AND PURPOSE

There are many techniques and architectures that can be used to represent and implement asynchronous systems. This research has focussed on Null Convention Logic (NCL) as an implementation technology due to its straightforward “structural” approach to design, being based on a predefined library of majority logic (threshold)

2 gates. At the physical level, traditional NCL tends to be based on some form of multi rail logic to represent the in the digital system. Each bit, whether it is an input, internal or output data signal can be a member of either a dual-rail or quad-rail . This increases the required number of wires to represent each single logic bit, which increases the design complexity, design area and may even lead to higher power consumption. Null Convention Logic circuits typically use a pre-defined library of 27 threshold gates, each with between two and five inputs, to implement the required design. NCL systems can suffer from illegal states where, for example, both rails are high in a dual rail system. This may occur due to noise and can lead to unknown or indeterminate behaviour in the system. NCL circuits are not designed to handle or process illegal states arising in this way, and there is no guarantee that this illegal case will not propagate forward through the network.

In this thesis, a novel architecture is proposed to represent and implement a multi threshold Null Convention Logic system using single rail ternary instead of a multi rail binary. In the proposed Multi Threshold Single Rail Ternary Null Convention Logic (MT-SR-TNCL) architecture, each bit is represented by only a single rail (single wire) instead of multiple wires. The first outcome of this is that it eliminates the possibility of illegal states, guaranteeing that they cannot happen under any condition. Secondly, the proposed ternary null convention logic architecture may potentially lead to reduced design area and complexity as it uses single rail instead of multi rail signalling. Reducing the required number of wires will reduce the routing complexity and may lead to a shorter path between gates leading to an increase in system performance.

Table 1-1: Multi Threshold Single Rail Ternary Null Convention Logic Architecture Truth Table Multi Threshold Single Rail Ternary NCL Logic Null Convention Logic Voltage Level Mapping

VDD Data One

VDD/2 Null

Zero (GND) Data Zero

In the multi threshold single rail ternary Null Convention Logic architecture, the three Null Convention Logic symbols: DATA (1), NULL (Data not Valid) and

3 DATA (0) are mapped to “VDD”, “VDD/2” and “Zero” respectively as shown in Table 1-1. The NULL value (indicating that data is not yet available) is represented by

“VDD/2”, so each successive transition at the input data rail or the output data rail is

“VDD/2” this includes the internal signals between the different ternary gates. Whilst this minimises switching power, it comes at the expense of a reduction in static noise margin. In the NCL digital system, legal transitions can only occur between DATA and NULL or vice versa. Hence, no transitions may take place directly between logic values i.e., no direct transition from Data One to Data Zero or vice versa. This is a characteristic of the proposed multi threshold ternary NCL architecture.

1.3 RESEARCH QUESTIONS

To address the sorts of issues with NCL outlined in the previous section, this research has addressed a number of specific questions:

1) Is there an alternative to Multi Rail Binary Logic to represent the Null Convention Logic System? For instance, can the Single Rail Ternary Logic be used as an alternative to the Multi Rail Binary Logic to design and implement the Null Convention Logic System? 2) Is there a method to design and implement the Null Convention Logic system using a Single Rail instead of Multi Rail without degrading the NCL system characteristics in terms of observability, completeness and hysteresis? 3) How can the new system be built and structured? What would the architecture be of the new system design? Including the transistor level design. 4) Is the proposed architecture dependent on a specific CMOS process technology or specific voltage mapping? 5) By using Single Rail Ternary Logic as an alternative to Multi Rail Binary Logic, will the performance of the NCL System be enhanced in terms of the overall system speed/performance, required number of transistors, total power consumption and finally design complexity? 6) Can the proposed Single Rail Ternary NCL system be used to efficiently implement complex Digital Signal Processing Systems?

4 1.4 PUBLICATION ARISING FROM THIS WORK

The following publications are a result of the work that has been done in this thesis:

1. S. Andrawes and P. Beckett, “Null Convention Logic Circuits using on SOI,” The 2011 2nd International Congress on Computer Applications and Computational Science, CACS 2011.

2. S. Andrawes and P. Beckett, “Ternary Circuits for Null Convention Logic,” The Seventh IEEE International Conference on Computer Engineering & Systems, ICCES 2011.

3. S. Andrawes and P. Beckett, abstract of “Balanced Ternary Null Convention Logic (BTNCL) Circuits,” The 6th IEEE International Conference on Broadband Communications and Biomedical Applications, IB2COM 2011.

4. Andrawes, S. and P. Beckett (2012). Null Convention Logic Circuits Using Balanced Ternary on SOI. Proceedings of the 2011 2nd International Congress on Computer Applications and Computational Science, Springer.

5. S. Andrawes and P. Beckett, “Balanced Ternary Null Convention Logic Circuits (BTNCL) on SOI,” The 17th Asia and South Pacific Design Automation Conference, ASP-DAC 2012.

1.5 INTRODUCTION TO THE KEY CONCEPTS USED IN THIS WORK

The two dominant implementation techniques currently used to implement asynchronous digital systems are known as Bounded Delay (BD) or Delay Insensitive (DI) methodologies [11]. Of these, the Delay Insensitive (DI) technique is the more widely used technique [12]. In Delay Insensitive digital design, the flow of data from one logic circuit gate or block of gates to another is controlled by local handshake signals. In this way, the arrival of data at the input of the gate (or gates) triggers the start of a computation and the next computation can be initiated immediately after the result of the first is completed.

The DI model is the most robust of all asynchronous circuit delay models [13], as it makes the fewest assumptions about the delay of the wires or the gates. In this

5 model all transitions on the gates or the wires must be acknowledged before transitioning again to the following gates. In Delay Insensitive circuits any transition on an input to a gate must be fully complete at the output of the gate before a subsequent transition on that input is allowed to happen. This forces some input states or sequences to become illegal. For example, an OR gate must never go into the state where both inputs are one, as the entry and exit from this state will not be seen on the output of the gate [14].

1.5.1 Delay Insensitive Implementation using NCL One member of the Delay Insensitive asynchronous logic class is Null Convention Logic (NCL) [15], [16]. NCL is termed Quasi Delay Insensitive as it imposes a small number of timing assumptions on the circuit. In particular, it assumes that all digital signalling transitions are monotonic (i.e., are a single, unidirectional transition) and that wire forks in the logic network are isochronic in that every transition on the fork must be acknowledged by at least one branch of that fork. These assumptions impose two key limitations on NCL that are relatively easy to meet. Firstly, signal transition times must be bounded (i.e., they must eventually occur), and secondly all signal transitions in the circuit must be observable. This latter restriction requires the elimination of so-called orphan nets, which are parts of the circuit that receive a signal transition sequence but do not acknowledge that transition or are not included in the so-called completion network for that signal. However, the presence of instability or noise in the circuit can cause it to violate these simple timing assumptions, forcing it into illegal states and resulting in incorrect or unstable results, deadlock, and even circuit damage. By obeying these timing restrictions, NCL [17] becomes a symbolically complete asynchronous logic system that implicitly and completely expresses logic processes and therefore represents a convenient way to describe Delay Insensitive asynchronous digital logic. In this sense, Clocked Boolean Logic (CBL) is not complete as it requires the inclusion of an independent time variable (i.e. a clock) that must be very carefully coordinated with the logic part of the expression to completely and effectively express an operation.

In common with other asynchronous styles and models, NCL offers a number of advantages, including but not limited to:

 an intrinsic lack of global clocks and their associated routing problems;

6  high tolerance to variability in both manufacturing (i.e. dopant levels, line roughness, mask alignment errors etc) and environment (i.e. voltage, temperature);

 significantly lower electromagnetic interference (EMI) due to its uncorrelated switching noise [18].

While the problem of timing closure disappears, it is replaced by a need to determine efficient techniques to automatically adjust the distribution of buffer cycles within the asynchronous pipeline for optimum throughput rate. While some of these advantages are shared by asynchronous techniques in general, NCL systems can be said to exhibit specific behaviour which is self-determined, locally autonomous, self- synchronising and delay insensitive. By adding the control value NULL (i.e. Data not valid) to the Boolean set to create a symbolically complete and delay insensitive three-value logic system [19], [20], a gate will only assert its output data when a complete set of (valid) data values is present at its input, thereby enforcing a “completeness of input” criterion.

Table 1-2: Dual-Rail NCL Truth Table Value Data1 Null Data0

X1 Data Null Null

X0 Null Null Data

Most implementations to date have been based on a “one-hot” dual rail signalling approach to support the required three symbols (0, Null, 1). In this case, each logic rail asserts a single DATA (D) value representing either a logic 0 or 1. When both lines are de-asserted (e.g., zero or low voltage) the NULL (N) value indicate “no valid data” or “data is not yet available”. A one-hot code like this requires the wire assertions to be mutually exclusive for each binary bit as illustrated in Table 1-2. For example, a binary dual rail signal X will have two wires, X1 and X0 which indicates value DATA One and DATA Zero respectively when asserted. It is illegal to assert those two wires at the same time (since they are mutually exclusive). However, as mentioned above, this can still happen because of the system noise. In general, for NCL dual rail encoded signals, all valid code-words are referred to as

DATA (and can add its value as a suffix to distinguish, for example, DATA1 and

7 DATA0 as above), while the NULL state represents the invalidity of data (all rails are de-asserted), both of X1 and X0 are Zero.

In this way, NCL can be set up to essentially correct-by-construction and it is fairly straightforward to achieve a working system using pre-built IP blocks. Existing approaches to the development of NCL gates have tended to be based on the 27 fundamental gates [21] shown in Table 1-3, which limit the number of inputs to four or fewer. These gates are designed with feedback which results in hysteresis [22], [23]. The basic transistor implementation of each gate comprises four networks: Go to Data, Go to Null, Hold Data, Hold Null [24], [25].

Figure 1-1: Multi Rail Binary Logic NCL System

As mentioned above, computation in an NCL system is controlled by local handshaking signals and completion detection, as shown in Figure 1-1. The global clock is replaced by local handshaking signals such as Request for Output (RFO) and Request for Input (RFI) control signals. These data flow control signals, in combination with standard NCL gates result in the formation of quasi delay insensitive (QDI) register blocks that control the data flow through the NCL system [26].

X0 th22 Z0

RFI

X1 th22 Z1

Reset

RFO th12

Figure 1-2: Dual Rail NCL Delay Insensitive (DI) Register

8 Figure 1-2 shows an example of a conventional dual rail NCL register. In this example, the register is implemented using two TH22 NCL Threshold Logic Gates and one NOR gate. The naming convention for NCL gates indicates its number of inputs (n) and its threshold value (m) as a THmn label. So the TH22, for example, will “fire” and produce a valid data output only when both of its two inputs are asserted. In this way, the TH22 is equivalent to a Boolean AND gate but with hysteresis, so that a return to NULL will occur only when both its inputs are NULL. In this way, it is identical to a Muller C-element. Thus, the NCL register will pass the input DATA value only when the Request for Input (RFI) control signal is in its Request for Data (RFD) state and the input Data is present (asserted). It will pass a NULL value only when the RFI control signal is Request for Null (RFN) and the input is Null. The TH12 gate is equivalent to a simple Boolean NOR gate and is followed by an inverter so that output completion is detected when either of the output signals is asserted. Thus, the NOR NCL gate acts as a control signal generator (RFO) for the previous logic gate in the NCL pipeline structure forming a Completion Detection network for the logic block.

Input 1 Input 2 Output m  Input 3     Input n 

Figure 1-3: Hysteresis state-holding behaviour of NCL gate

A C Y A B Y B 0 0 0 0 1 Y 1 0 Y A th22 1 1 1 B Y

Figure 1-4: C-element Gate

An N-bit dual-rail NCL register comprises N single-bit dual-rail NCL registers and N Completion Detection (CD) NCL logic gates that are used to control the previous NCL gates. In larger designs with more than one register, the outputs of the completion detection circuits are AND-ed together to generate a single Request for Output (RFO) signal for the previous register block. That control signal will take on

9 the value Request for NULL (RFN) when all of the outputs of the current register are asserted to DATA and will be Request for Data (RFD) when the outputs of the current register are all NULL. It is this DATA—NULL—DATA cycle that controls the flow of information in the NCL circuit.

Table 1-3: NCL Threshold Logic Gates Function NCL Threshold Gate Equivalent Boolean Function

TH12 A + B

TH22 AB

TH13 A + B + C

TH23 AB + AC + BC

TH33 ABC

TH23w2 A + BC

TH33w2 AB + AC

TH14 A + B + C + D

TH24 AB + AC + AD + BC + BD + CD

TH34 ABC + ABD + ACD + BCD

TH44 ABCD

TH24w2 A + BC + BD + CD

TH34w2 AB + AC + AD + BCD

TH44w2 ABC + ABD + ACD

TH34w3 A + BCD

TH44w3 AB + AC + AD

TH24w22 A + B + CD

TH34w22 AB + AC + AD + BC + BD

TH44w22 AB + ACD + BCD

TH54w22 ABC + ABD

TH34w32 A + BC + BD

TH54w32 AB + ACD

TH44w322 AB + AC + AD + BC

TH54w322 AB + AC + AD + BC

THxor0 AB + CD

THand0 AB + BC + AD

TH24comp AC + BC + AD + BD

10 Within the library of 27 fundamental threshold logic gates with hysteresis state-holding behaviour [27], a generalised THmn gate (Figure 1-3) will have “n” inputs and a threshold level of “m”, meaning that the output will only become asserted when at least m of its n inputs are asserted, and de-asserted when all inputs are de-asserted. Otherwise, it will hold its previous state with no change.

The library of NCL gates also contains a small number of so-called “weighted” gates where specific inputs contribute multiple threshold values. For example, a TH34w2 gate will have 4 inputs, but only three DATA threshold values as the first input is a group of two connected wires. In general, in a THmnWw1w2…wk gate (1 < wi ≤ m, k < n), the first k inputs are weighted as w1, w2,…, wk respectively.

With its library of fundamental threshold logic gates, NCL provides many optimisation opportunities at the circuit level compared to most alternative Quasi Delay Insensitive approaches, which must use separate Muller C-elements and Boolean gates to develop asynchronous behaviour [19]. In the NCL set of primitive gates, TH1n gates are simply n-input OR gates and THnn gates extend the C-element concept and are equivalent to n-input C-element gates as illustrated in Figure 1-4. A generic m-of-n threshold gate (THmn) can be considered to be a combination of Boolean OR and C-element functions. Therefore, the behaviour of an NCL gate can be expressed as a combination of a Boolean function (Table 1.3) with the convention that AND operators have hysteresis (state-holding) behaviour.

1.5.2 Multi Threshold CMOS As we move into the era of universally connected devices, low energy design is required to increase the life time of the battery [28] especially for portable and embedded systems [29]. A decrease in the size of the battery will directly impact the portability of devices such as laptops, tablets and cellular phones. Reducing the supply voltage is one of the most effective ways to minimise the total power consumption. However, this requires the logic gates to operate at lower supply voltages e.g., in the range 0.3 ~ 0.5 V in modern deep sub-micron CMOS. To maintain adequate performance, the gate over-drive voltage (i.e., VGS-VTH) must be kept relatively constant in the face of reducing gate voltage (VGS  VDD). Thus, the threshold voltage VTH of the transistor, defined as the gate voltage where an inversion layer forms at the between the insulating layer (oxide) and the

11 substrate (body) of the transistor, must be scaled down as well. For example, VTH may need be scaled to the range 0.1—0.2V in DSM processes to maintain reasonable performance. The alternative is to allow operation in the sub-threshold region, where

VGS ≤ VTH. In this region, process variability becomes a primary limitation to stable operation and makes conventional synchronous timing closure very difficult. As a result, threshold voltage scaling becomes a critical parameter influencing system performance and will not be straightforward to achieve. For example, although it will increase performance, reducing the transistor threshold will greatly increase leakage current to a point where it becomes the dominant part of the power consumption.

Multi Threshold CMOS (MT-CMOS) is an effective approach to implement low power digital design and most DSM fabrication processes offer at least two and often three values of voltage threshold. MT-CMOS enables digital circuit designers to mix low threshold voltage with its increased leakage current with high threshold transistors in order to optimise the overall power consumption of the CMOS chip. There are three fundamental equations that govern the performance of CMOS and therefore can give some insight into the relevant trade-offs between power and performance. Equations (1) and (2) relate the supply (VDD) and transistor threshold voltages (VTH) to dynamic and static power respectively while (3) illustrates how these parameters affect the circuit performance:

2 PDYN = ½ C FCLK V DD (1)

ILKG  exp(VGS-VT)/ VTH (2)

TPD  VDD /(VDD-VTH) (3)

It can be seen from (1) that scaling supply voltage is the most effective way to reduce circuit power due to its square-law relationship. However, while simply reducing supply voltage VDD also reduces leakage (in (2), VGS  VDD, and VT is the thermal potential) and therefore total power (the sum of these two), it is indicated by (3) that this can severely impact performance, especially as the supply approaches and falls below the threshold voltage, into the sub-threshold region.

As just mentioned, in order to allow designers the option to trade off these parameters, CMOS process technologies typically provide two or more sets of different threshold values. Some provide only low and high threshold voltage transistors, while others provide three: low, normal and high values. By combining

12 CMOS transistors with different thresholds within the same CMOS logic system, or even within the same gate, a suitable balance between static power, dynamic power and performance can be achieved.

In addition to supporting this trade-off between power and performance, multiple threshold transistors can be used to set the overall switching threshold of

CMOS gates. The simplified equation for the switching threshold (VM) of an inverter, defined as the point at which VOUT = VIN is given by (4) below:

( ) ( )

where √ , VTHN/P are the thresholds for the n-type and p-type transistors

respectively and kp/kn is the process gain, which in turn is a function of the W/L ratios of the respective transistors. While, in practice the switching threshold of a logic gate will be a more complex function of multiple VTH values, (4) illustrates the basic idea that the switching point of a gate can be varied by selecting the appropriate threshold values for its component transistors. This has been the fundamental idea used in the development of the multi-level switching circuits in this thesis as will be described in the subsequent chapters, below.

1.6 NOVEL CONTRIBUTIONS AND OUTCOMES

The work reported in this thesis has resulted in the following novel contributions and outcomes:

1. Two novel architectures have been proposed and implemented that use single-rail ternary instead of multi-rail binary. Both have been analysed and compared to identify the advantages and disadvantages of each.

2. A number of CMOS process technologies and voltage mappings have been used to design the proposed architecture and map the Null convention logic voltage level to show that the proposed architecture is independent of both CMOS process technology and threshold voltage mapping.

3. A Digital Low Pass Filter has been designed and implemented employing the proposed Single Rail Ternary Logic Architecture, which

13 has been analysed and compared against the Multi Rail Binary Logic Architecture.

1.7 THESIS ORGANISATION

In this chapter the research motivation and purpose including the research questions as well as an introduction to the Null Convention Logic System Structure has been presented. The remainder of this thesis is organised as follows.

In Chapter 2, a literature review of the main techniques and topics that are included in this thesis has been presented. Chapter 3 introduces the proposed architecture of the Single Rail Ternary Logic to design and implement Null Convention Logic systems. This includes two versions of the proposed architecture and an analysis of the Delay Insensitive Register, a core element of the ternary NCL system which controls the data flow within the system pipelines.

In Chapter 4, a revised and improved architecture is described that leads to improved performance and reduced area by removing the requirement for the Delay Insensitive Register.

To prove the functionality of the proposed architecture, in Chapter 5 a digital signal processing application is presented, and its design implemented using both Multi Rail Binary Logic Technique and Single Rail Ternary Logic. The results have been analysed and compared in terms of power consumption, total design area and performance.

Finally, the work concludes in Chapter 6 and some suggestions for future work are made.

14 Chapter 2: Literature Review

2.1 INTRODUCTION

Binary logic has been dominant in digital system design for decades, mainly due to its simplicity and ease of use. For instance, in the CMOS binary digital system design, each CMOS transistor might represent a binary logic 1 or 0 when it is switched on or off. In contrast, as a type of multi-valued logic, ternary logic exhibits some advantages compared to its binary equivalent. Ternary is a “denser” representation than binary and therefore may result in fewer active devices and interconnections, leading to smaller area and higher data rates [30]. Of particular interest here is the mapping efficiency that can be achieved by a three-level ternary representation onto the set of three signals {0, N, 1} required by NCL. For instance, in binary dual-rail NCL, 2K wires are required to represent K , while only K wires are needed in ternary single-rail NCL. On the face of it, this promises to reduce both design area and routing complexity, something that is explored in the following chapters.

2.2 ASYNCHRONOUS LOGIC

Digital system applications can be created using either synchronous or asynchronous techniques. Synchronous digital circuits are mainly driven by a global clock. The global clock is required to synchronise between the different logic blocks in the overall design. This makes the synchronous circuits stable and relatively straightforward to design. However, in high performance synchronous applications where high clock rate is required, a number of clock-related issues have arisen. For example, issues such as clock skew and jitter are complicating the task of achieving system-wide timing closure, while the increase in global clock tree power is especially affecting low-power and embedded systems. One possible solution is to eliminate the global clock itself and move to asynchronous techniques. Asynchronous circuits are controlled by request and acknowledge control signals [31] which make them self-timed circuits.

The Asynchronous pipeline can be implemented using two main protocols: Two Phase [32] or Four Phase Handshaking [33]. In the two-phase protocol, a new

15 data set will flow across the pipeline at either the positive or negative edge of the request control signal. For instance, if the request control signal transitions from logic zero to logic one or vice versa, a new set of data will be transmitted to the receiver logic gate, after which the receiver logic gate will acknowledge this event with either a rising edge (logic zero to one) or falling edge (logic one to zero).

By contrast, in Four Phase Handshaking, the sender logic gate operates at the rising edge only, so when the request control signal transits only from logic zero to logic one, this indicates a new set of data and at the same time the receiver logic gate operates at the positive edge as well, where the acknowledge control signal will transition from logic zero to one only to acknowledge the new set of data. As a result, the four-phase protocol requires twice the number of control signal cycles than the two-phase case to send the same amount of data. The ternary NCL proposed in this work is based on two-phase encoding in which the high and low level data are encoded as data “One” and data “Zero” respectively while the middle value is “Null”.

Asynchronous circuits can offer many potential advantages compared to equivalent synchronous circuits, including higher speed and lower power consumption, less routing complexity and smaller design area [34]. However, it is clear that not all of these advantages have been realised in previous asynchronous designs. These advantages cannot be achieved by just simply replacing the global clock by local control signals. In itself, this will reduce neither routing complexity nor power consumption as the local control signals still need to be routed between the different logic blocks in the digital circuits and as the design size and complexity of a design increases so does the complexity and area of the handshake network(s). These control signals will expend switching power, which will increase the overall dissipated power at the system level.

Some of the advantages of asynchronous systems can be achieved by making the control signals an intrinsic part of the data itself. The quasi delay insensitive NCL circuits achieves this by introducing the third, “NULL”, logic value. In this way, NCL is a self-timed logic paradigm in which the control timing is inherent in the data . However, the null convention by definition complicates the system behaviour and can result in increased area in the logic paths. This trade-off between logic and communication will be explored below and in later chapters.

16 2.3 NULL CONVENTION LOGIC

NCL is one of a small group of techniques commonly used to implement asynchronous circuits. Compared to its synchronous counterparts, it has been shown that NCL is able to reduce power consumption and generated noise without compromising performance [35]. NCL has also been demonstrated to be correct-by- construction in that delay insensitive techniques like this do not require sophisticated timing analysis, which can reduce the required effort to ensure correct operation under all timing conditions. While many asynchronous circuits have been implemented in other delay models, for example, bounded delay, these have been extensively studied elsewhere. In contrast, there has been virtually no work done on ternary NCL, which has been the focus of this thesis.

Most implementations of NCL threshold logic gates to date have been based on multi-rail binary logic, either dual-rail or quad-rail in which each bit is represented by two or four wires respectively. While one of the advantages of asynchronous systems is eliminating the problem of routing global clock signal, NCL introduces instead the challenge of routing multiple wires, which can increase design area and complexity compared to the single-rail NCL design.

2.4 MULTI-RAIL NCL TRANSISTOR LEVEL DESIGN

As outlined in Chapter 1, most recent techniques to implement NCL are based primarily on variations of 27 fundamental gates [21] which limit the number of the inputs to each gate to four or fewer. These gates are designed with hysteresis (state- holding) capability such that, after the output is asserted, all inputs must be de- asserted (i.e., return to NULL) before the output will be similarly de-asserted. Hysteresis ensures a complete transition of all inputs back to NULL before the output associated with the next wavefront of input data [22], [23] becomes active.

The 27 NCL threshold logic gates are designed and implemented using CMOS transistor technology and there are many techniques to design and implement the gates at the transistor level. For example, Static NCL is the most robust against noise, while Semi-Static is a simplification that reduces area at the cost of noise margin. In the following section, the most recent techniques that have been used to design and implement the NCL threshold logic gates will be discussed, including the advantages and disadvantages of each approach. As there is no perfect approach or technique

17 that can be used in every NCL design, the best approach or technique will depend largely on the requirements of the particular design in terms of its power consumption, area, speed and the complexity of the design.

2.4.1 Static Null Convention Logic Technique (S-NCL) The Static Null Convention Logic technique was one of the first that was used to design and implement multi-rail NCL threshold logic gates [36]. All subsequent techniques are based on modifications of this initial architecture, in order to enhance one or more of its characteristics, predominantly speed, power or total area (e.g., as defined by the required number of transistors in the gate).

In the Static NCL technique, each gate is implemented using four basic networks [27]: the Go to Data, Go to Null, Hold Data and finally the Hold Null, as shown in Figure 2-1. These four blocks each perform a specific function as follows:

 The Set Network or Go to Data is used to perform the logical function of the threshold gate responding to the value of the logical inputs at the input gate [37]. This network asserts the output of the NCL threshold gate—i.e., “output is asserted and go to data". The Go to Data Network acts as a Pull-Down Network (PDN), therefore it is composed of NMOS transistors that are connected between the ground and the output node. The function of the Go to Data Network for each NCL Threshold Logic Gate i.e. AND, OR, etc. is different and it is the complement of the function of the Hold Zero network.

VDD

Go to Hold Null Null

Y Z

Go to Hold Data Data

GND

Figure 2-1: Static CMOS Multi-Rail NCL Threshold Logic Gate Implementation

18  Reset Network or Go to Null is used to de-assert the output only when all inputs are low (input is NULL) [38]. It acts as a pull-up network (PUN) and is the complement of the function of the Hold Data Network. This block comprises a number of PMOS transistors connected between the supply and the output node. As a long stack of PMOS transistors (i.e. a row of more than four PMOS transistors in the pull-up network) can severely impact the performance of the gate, the maximum number of inputs of the gate tends to be limited to four or fewer.

 Hold networks are used to hold the output value while neither the Go to Data nor Go to Null functions are true, which happens during the transition from Data to Null or vice versa.

1. Hold Null (or Hold Zero) comprises PMOS transistors that are used to maintain the required hysteresis behaviour of the design by holding the NULL value at the output node. The function of the Hold Null Network is the complement of the Go to Data Network.

2. Hold Data (or Hold One) is a network of NMOS transistors that are used to maintain the required hysteresis behaviour of a gate by holding the data value at its output node. Additionally, the Hold Data Network is the complement of Go to Null Network and it keeps the output asserted until all the input signals are completely de-asserted.

VDD A B C C

B A B

C Z B C C

A B A B C

GND

Figure 2-2: Transistor Level Design of the TH23 Static NCL Threshold Logic Gate

19 While the static implementation of NCL gates exhibits good noise margin and is robust in the face of supply and process variability, it still has a number of drawbacks. It can be seen from the example TH23 gate illustrated in Figure 2-2, it requires a large number of transistors even for simple functions. In fact, it is the largest compared to alternative CMOS implementations, which impacts the total design area and power. It is also the most complex to design as it requires four different networks to be built for each gate, as demonstrated in [36]. There is also no straightforward design methodology such as in CMOS Boolean design. Nonetheless, the Static NCL implementation still offers an appropriate trade-off between power, speed and design area [39], so it is considered one of the preferred design techniques for these NCL gates [40] especially if there is no specific requirements or need for a specific optimisation e.g., high speed, low power consumption, small design area or lower design complexity.

No-Hold Null Convention Logic (NH-NCL)

In [41], an alternative approach to static Null Convention Logic was proposed, where the hold networks (both of Hold Data Network and Hold Null Network) are removed and integrated into both the Go to Data and Go to Null networks respectively, as shown in Figure 2-3. The reason for this is that the hold networks are only used to maintain the output whether it is Data or Null but do not take part in the switching of the output from Data to Null or vice versa. Hence, they can be integrated in their corresponding Go to Data/Null networks to enhance their speed.

Vdd A Z C

B Z B

C A Z A C Z A C

B C B B

GND

Figure 2-3: TH23 No-Hold CMOS Multi-Rail NCL Threshold Gate

20 The technique results in better overall power consumption and delay performance compared to the static case but at the cost of additional transistors and therefore a larger gate area. As one example, the TH34w2 NCL gate requires five more transistors than the static technique, potentially impacting overall area.

2.4.2 Semi-Static CMOS Null Convention Logic (SS-NCL) The semi-static technique [36], [42] reduces the required number of transistors to implement the gates as it eliminates the need for the Hold networks. Both the Hold Data and Hold Null networks are replaced by a weak inverter as feedback from the output stage of the gate. The inverter maintains the hysteresis of the gate, as shown in Figure 2-4. Both the Go to Data logic network and Go to Null logic network are still required and must be strong enough to overcome the feedback signal that is coming back from the weak feedback inverter [43].

There are some benefits gained from implementing the NCL gates using this technique compared to the static technique. The approach reduces the total power consumption [44] and also decreases the required number of transistor to implement the CMOS multi-rail gates which, in turn, is reflected on the total area of the design [43]. While Semi-Static Null Convention Logic technique offers some advantages, there are also some disadvantages. For example, the static technique is still faster than the semi-static [25] mainly due to the additional delay incurred in overcoming the weak feedback inverter at the output stage.

On the other hand, while the semi-static technique appears to offer a simpler design process compared to the static case, one drawback of using the weak feedback inverter is its sizing. This is not an easy process as the weak feedback inverter needs to be carefully sized such that it provides sufficient current to compensate for the leakage but at the same time, be able to be easily overcome by the pull-down and pull-up networks. For example, if the size of the PMOS transistors in the Go to Null Network (the Pull-Up Network) are smaller than the weak feedback inverter, the Go to Null network will not be able to provide enough current to reset the gate. This will require an increase in the size of these PMOS transistors, which may lead to higher power consumption and increase the total design area. On the other hand, if the size of the weak feedback inverter is reduced and becomes really weak, as a result, the feedback inverter will be very sensitive to noise on the internal node, thereby affecting its noise margin. The probability that switching noise will affect the

21 functional behaviour of the gate, and thus the behaviour of the overall design, will increase. Additionally, due to the sizing of the Hold Null network (Pull-Up Network), the Hold Data network (Pull-Down Network) and the weak inverter, this technique does not support low voltage implementation [42]. As for the static technique, the semi-static technique can generate both output and its inverted value by default with no additional inverter gates or wiring.

VDD

RESET

Z Y

SET

GND

Figure 2-4: Semi-Static Null Convention Logic Implementation

Vdd A B C Z B A B A C

GND

Figure 2-5: TH23 Semi-Static NCL

In [43] and [44], multiple logic gates and functions i.e. AND, NAND, Half Adder and Full Adder, have been designed and simulated in both static and semi- static logic. Here, the results show that semi-static logic requires fewer transistors to

22 implement the same logic gate or function, as a result of using the weak feedback inverter which helps to reduce the design area and power consumption. For instance, in Figure 2-2, the TH23 static NCL gate implementation requires 18 transistors while in Figure 2-5, only 12 are required to implement the same TH23 gate using the semi- static technique. However, as the logic design becomes more complicated, and includes functions such as a Full Adder, the speed of the static NCL implementation becomes faster than that of the semi-static case. Thus, ultimately, the additional benefits of the static implementation tend to outweigh the area savings offered by the semi-static technique.

Semi-Static Diode-Connected Null Convention Logic (SSDC-NCL)

The speed and performance of the semi-static technique can be enhanced by using diode-connected transistors, which are added in series with the feedback inverter as shown in Figure 2-6. In [16], this enhancement was proposed and called semi-static diode-connected NCL. The two additional diode-connected transistors are minimum-sized transistors and connected to the feedback inverter in series. The additional group of transistors in the diode-connected circuit behave like high value resistors between the feedback inverter and both supply voltage and ground which limit its source and sink current capability. As a result, the feedback inverter becomes weaker and is easier to overcome, so that the gate operates faster than the original semi-static version. This technique has been shown to consume less power than the conventional semi-static design and, although two additional transistors are required, these can be minimum geometry so will have minimal impact except for very large scale designs [45].

VDD VDD

RESET

Z SET

GND

Figure 2-6: Semi-Static Diode-Connected CMOS Multi-rail NCL Logic Gates implementation

23 Semi-static diode connected NCL gates can operate at lower voltage supply and higher speed compared to the semi-static case, but at the cost of noise margin [16]. Despite the fact that the semi-static diode-connected gate consumes less power compared to semi-static, its power consumption is still higher than the static case, although the gate design requires fewer transistors. For instance, with a supply voltage of 1.2V, the diode-connected semi-static implementation of a 4x4 NCL multiplier consumes more power compared to the static implementation by almost 25% [16], while its area is almost 34% less.

Pseudo Semi-Static Null Convention Logic (PSS-NCL)

In [20], Pseudo Semi-Static Null Convention Logic was proposed as an alternative architecture to enhance the speed of the semi-static technique. In this case, the PMOS transistor stack in the Go to Null (pull-up) network is removed and replaced by a combination of NMOS transistors as a second Pull-Down Network and a single pseudo-PMOS transistor. In addition to that, the weak feedback inverter is replaced by a normal feedback inverter and the output inverter is replaced by a NOR logic gate as shown in Figure 2-7.

These modifications to the semi-static technique enhance the speed of the NCL gates and there are two main reasons for this. Firstly, instead of PMOS transistors the Go to Null Network is built using NMOS transistors which tend to be faster due to their higher carrier mobility. Secondly, the weak inverter is removed, and normal inverter is used for the feedback signal. However, the power consumption of this architecture is higher than that of both the static and semi-static techniques. While the authors suggest that the power consumption can be reduced by controlling the pseudo PMOS transistor, this adds an additional handshake signal (EN) which means additional wires need to be added and routed.

In summary, while the pseudo semi-static technique offers a few advantages, there are a number of disadvantages, mainly that its total power consumption can be higher than that of the semi-static NCL gates if the additional handshake (EN) signal is not controlled properly.

24 VDD

EN

Z SET A B C

GND

Figure 2-7: Pseudo Semi-Static NCL

2.4.3 Differential Null Convention Logic NCL (D-NCL) This concept is based on Differential Cascade Voltage-Switch Logic (DCVSL) [46], where the pull-up network (Go to Null Network) is removed and replaced with another pull-down network to perform the Go to Null function [42]. Thus, the DCVSL NCL gate design will have no pull-up networks and it will have two pull-down networks plus the cross-coupled inverters [47] that maintain the gate’s hysteresis behaviour. The cross-coupled inverters act as a S-R latch and hold the value of both of the output and its inverted value as long as both of the Go to Data and Go to Null conditions are not true.

The Differential multi-rail NCL gates are very easy to design and implement, as the NMOS transistors do not require any sizing but can be kept at the minimum size for the technology. The pull-down nets are composed of only NMOS transistors. The use of minimum size transistors will help to reduce the total design area and the total power consumption as well, while removing the stack of the PMOS transistors forming the pull-up network will help to increase the speed of the gate as well as allowing them to operate at low supply voltage.

However, as shown in Figure 2-8, these differential gates require both the inputs and their inverted values to be able to perform the required function. This might not be a problem as the differential gates generate both the required signals but routing these additional lines might be a problem under some circumstances. Cross coupled inverters do not have to be weak, which enables the DNCL gate to operate at lower supply voltages than can be achieved by the semi-static approach.

25 VDD

Z Z

SET RESET

GND

Figure 2-8: Differential NCL

On the other hand, this implementation is not completely delay insensitive as it is a dynamic circuit [41]. Further, the cross-coupled inverter creates a path between the voltage supply and the ground, as a result the power consumption will be increased because of the leakage current. Finally, the cross-coupled inverters do not maintain the NCL hysteresis characteristic as claimed by the authors. This is because these inverters will maintain the output value only for bounded time [41] and to maintain it for unbounded time, output will rely on the leakage current [41].

Static Differential Null Convention Logic (SD-NCL)

SD-NCL strives to achieve the benefits of the Differential NCL technique while exhibiting the required hysteresis characteristics. Here, the cross-coupled inverters are removed and replaced by two new Pull-Up Networks: Complement of Set (COS) and Complement of Reset (COR), as shown in Figure 2-9. The Complement of Set pull-up forces the output to logic one while the Complement of Reset pull-up forces the output to logic Zero. The pull-down networks (Go to Data/Set Network and Go to Null/Reset Network) are still controlled by both input data and the complement of the input data, which again might require additional inverters to invert the inputs of the data.

The technique, called Static Differential CMOS multi-rail NCL in [12], was shown to have better performance than the Differential CMOS multi-rail NCL threshold technique, in terms of lower power consumption and smaller delay. Its performance is enhanced because of the addition of the new pull-up nets

26 (Complement of Set and Complement of Reset), although this increases its area, particularly compared to the semi-static technique. While the addition of these pull- ups increases the total number of transistors to implement the gate, it is claimed in these additional transistors do not increase the logic gate design area, which is not a valid assumption even were the PMOS transistors to be kept at minimum size. To demonstrate via a simple example, the TH23 NCL gate which is considered one of the simplest and smallest NCL Gates, requires twenty transistors using the static- differential technique, compared to 18 using the static technique and 12 transistors using semi-static technique. It was not shown how, with the increased number of transistors (especially compared with semi-static NCL), the total area of the design will not increase. One of the disadvantages is that there is a need to use both of the inputs and the complement of inputs as well which require additional inverters, this will increase the design area and total power consumption for a large design as in this case extra inverters will be required.

In summary, these authors compared their design against the differential CMOS multi-rail NCL threshold gate technique, which cannot be used to implement or deploy CMOS multi-rail NCL system. As a result, all the claimed advantages need to be verified against a valid implementation of CMOS multi-rail NCL threshold gates like the static or semi-static techniques.

VDD

COS COR

Z Z

SET RESET

GND

Figure 2-9: Semi Differential NCL

Finally, a logical question would be which is the best technique that can be recommended to be used for most NCL designs? The answer as shown above is that

27 each technique has its advantages and disadvantages and there is no one specific technique that will meet all the design requirements. However, one of these proposed techniques, the differential CMOS multi rail NCL technique, cannot be used as it does not meet the main requirements of NCL. Because it is a dynamic implementation and not a delay insensitive design, it does not achieve the required hysteresis behaviour of NCL.

Based on the design requirements, the designer can follow and apply a specific technique. This is where the main requirements that can be used as a design criterion of whether to choose or not to choose a specific technique would be power consumption, design area, design complexity and finally design speed. While the static NCL technique offers a suitable balance between all of these requirements or design needs, the semi-static diode-connected CMOS multi-rail technique has an advantage as well as it can operate at a very low supply compared to the other techniques. The main reason for this is that the stack of the PMOS transistors in the pull-up network (Go to Null) can be smaller than the one in the semi-static technique due to the diode-connected transistors which makes the weak inverter weaker. In addition, this technique has the smallest design area, even compared to the semi- static technique. While both of them remove the hold networks (i.e., the Hold One and Hold Zero networks), in the semi-static case using a weak inverter requires the PMOS transistors in the pull-up network (Go to Null Network) to be larger. As mentioned earlier, the semi-static diode-connected CMOS multi rail gate is able to operate at the lowest voltage source compared to other CMOS Multi Rail Null Convention Logic techniques and as a result it potentially has the lowest power consumption.

While semi-static gate has no hold networks (neither Hold One nor Hold Zero), which are replaced by a weak inverter, it still has the largest design area compared to others including the static implementation. Secondly, the weak inverter makes this implementation the slowest one in the group. On the other hand, the No-Hold CMOS multi rail gate has the best performance in the group in terms of the speed. The main reason for this is that is not only the integration of both of the hold networks into the set and reset networks, but because of reducing the number of transistors that are connected to the output node, hence reducing the propagation delay.

28 Finally, the pseudo semi-static NCL technique is not recommended for a low power design as the additional handshake signal (Enable) that is used to control the additional PMOS transistor might create a direct path between the voltage source and the ground and as a result the leakage current will be increased and causing higher leakage power.

2.4.4 Multi Threshold Null Convention Logic (MTNCL) Logic circuits can consume a lot of power while in Sleep Mode i.e., when the device is in its idle state, and this can have a significant effect on applications that operate on standby but not totally switched off. In an NCL system, it is expected that the gates will spend the majority of the time in their NULL state, waiting for a Data cycle. Thus, the leakage current in this state will have a great effect on the standby power of the overall circuit. Leakage current can be reduced if there is no direct path or a very high resistance path between the voltage source and the ground. This is the reason that the pseudo CMOS multi rail NCL technique has the highest power consumption compared to the other alternatives. It is clear from (2) in Chapter 1 that the leakage current is exponentially proportional to the threshold voltage, so the challenge with deep sub-micron technology is that the gates require low threshold voltage transistors to be able to operate with low supply (e.g., to allow the supply to be scaled to under 0.5 V), while as a result of reducing the threshold voltage of the transistor, the leakage current will be increased [48] which will lead to higher power consumption.

In a digital system that spends most of its time in sleep or standby mode, waiting on an input to become available, a practical approach to use multi threshold CMOS technology, where low threshold transistors are used in the critical paths while high threshold ones are to be used in the non-critical paths to reduce the leakage current. This will lead to lower overall power consumption, as reducing the leakage current reduces the static power consumption which represents the dominant part of the total power consumption in ultra-low power deep sub-micron CMOS technology.

The implementation of NCL generally requires a large number of transistors, which can consume a lot of power while the circuit is idle (in sleep mode) and in its NULL state i.e. holding Null at the output stage or Data state i.e. holding Data at the output stage, waiting for the next cycle of Data or Null respectively. For example,

29 during the NULL cycle, the multi-rail gate does not perform any function and just waits until the RFD handshake (Request for Data signal is high) is requesting for DATA. At this stage the gate will become active and, as a result, start processing the input signals to generate the required output based on the function of the gate.

Multi Threshold CMOS (MT-CMOS) reduces the leakage current during standby mode (sleep mode/idle mode) by using high threshold transistors as isolation switches between the supply lines (VDD, GND) to the gate (the virtual supply) and the system, or primary ones [49], [50]. The technique has been proposed in [51], [52] to address this idle mode leakage current issue. In the NCL pipeline structure, after each DATA cycle, instead of propagating a NULL cycle, the gates enter a sleep or standby mode with their output nodes connected to ground. The same completion signal that indicates whether the input register is ready to process DATA or waiting for NULL, can be used as a sleep signal for the multi threshold gate in this case. As a result, there is no need to add any additional control signals or another handshake signals, as was the case in the pseudo semi-static gate, for example.

It has been shown in [52], [53], [54] that multi threshold NCL consumes less power and exhibits higher performance than the corresponding static implementations. The architecture employs low threshold transistors to speed up the operation of the gate while high threshold voltage transistors are used to reduce the power consumption. The authors claim that the proposed multi threshold, multi-rail NCL gates are smaller than their static equivalent, which is not entirely valid as by their own analysis the additional number of transistors varies from gate to gate, which makes it hard to claim an overall advantage. For instance, the TH12 implementation requires eleven transistors using the multi threshold technique compared to the static case that requires only six transistors. In cases where the TH12 is the main gate that is used to design and implement NCL registers, and each register requires two TH12 NCL threshold gates, this leads to extra 12 transistors in every register, which in larger designs will result in greater design area compared to the static case. Notwithstanding, the main purpose of employing the multi threshold technique is to reduce power consumption rather than to decrease the design area, per se.

30 2.5 TERNARY LOGIC CIRCUITS

Although binary has been, so far, entirely dominant in digital logic systems, higher order (multi-valued) logic systems do offer some useful advantages to logic systems. Three-valued logic or ternary is one example of a multi-valued representation and has many physical level implementations for its three logic values. For instance, in Balanced Ternary, the three logic values can be represented using the number set {-1, 0, +1} whereas in the ternary system, each logic value is a trinary digit (a “trit”) having a value of 0, 1 or 2.

Although ternary logic has tended to be overshadowed by the simplicity of binary with its mature EDA tools support, it has been an active area of research interest for arithmetic circuits [55] [56] over a great number of years and has been widely applied across digital system design in general. Some of the main advantages of using ternary logic include its higher information density, so that more information can be sent over a single wire compared to binary, plus the potential for higher arithmetic and logical performance due to the reduction in both the interconnect complexity and the overall chip area.

Several design techniques have been applied to perform the basic ternary operations, although, most of these techniques are unsuited for use in NCL. Ternary voltage levels can be derived using a number of techniques, as follows:

 Floating Gate: various non-binary arithmetic circuits have been proposed based on floating gate mechanisms of the type originally described in [57]. For instance, [58] describes a voltage-mode balanced ternary adder circuit based on dynamic semi floating gate devices that operates with a 1 GHz clock

at VDD = 1 V. Yet, all floating-gate circuits require some form of charge initialisation that complicates their design, while a further disadvantage of this particular approach is the need to supply a high-speed clock signal related to the data throughout rate.

 Resistive voltage divider: using a resistive element that acts as a voltage divider. For example, in [59], transistor level CMOS logic circuits such as inverter, AND, OR, NAND, NOR gates etc. were presented based on the concept of using resistive dividers to achieve the required voltage levels as shown in Figure 2-10. Using resistors as voltage dividers, increases power

31 consumption and design area due to the bulky size of the resistors implemented in a CMOS process.

Figure 2-10: Ternary inverter using resistance

 CMOS Multi-Threshold transistors: this can be implemented either at the transistor level by adjusting the W/L ratio of the transistors or during fabrication by adjusting the gate oxide thickness or dopant concentration in the channel region beneath the gate oxide. For example, in [60] [61], the authors claimed that their proposed ternary completion detection circuit minimises the area penalty associated with the wiring overhead compared to their binary dual-rail NCL counterparts. However, the proposed completion detection circuit is based on a buffer that requires 10 transistors as shown in Figure 2-11, while the corresponding dual-rail NCL completion detection circuit is composed of only four. This more than doubles the area of the completion detection circuit, which will be reflected in the overall system design area and power dissipation.

32 VDD VDD VDD

GND Vout Vin Vref VDD

GND

GND GND

Figure 2-11: Ternary Buffer

(a) TLSS Input Stage (b) TTLS Output Stage Figure 2-12: Ternary Logic Signalling System - TLSS

The ternary logic signalling system introduced in [62] was shown to consume less power than its binary counterpart, but its input and output data is still in a dual- rail form and not single-rail as shown in Figure 2-12. Thus, the proposed solution cannot be used in an end-to-end ternary system as the logic processing still uses dual- rail logic. The main application of this solution seems to be as a ternary to binary encoder/decoder, to transform from binary to ternary and back.

An energy-efficient ternary interconnection link for asynchronous systems was introduced in [63] that consumes up to 56.4% less power than a dual-rail full-swing

33 signalling system for long global interconnects. Nevertheless, this solution is still not an end-to-end ternary system and in a similar manner to [62] is more suited as a simple converter from binary to ternary and vice versa. Further, the authors did not demonstrate the performance of the proposed circuit in case of an illegal state, where both inputs are data1.

The Double Pass-transistor Logic (DPL) based ternary logic proposed in [64] is intended to result in a high-speed ternary system with reduced power and area. However, while the area is small, being composed of seven and five transistors respectively as shown in Figure 2-13 and Figure 2-14. Here, the voltage supply was set to 1.8V and the power consumption of simple ternary AND and OR gates are more than 60 µW and 80 µW, respectively. Thus, the circuits hardly qualify as low power and, while the area is small, the authors did not demonstrate that the circuits are capable of operating at lower voltages, although this is one of the main techniques for achieving lower power. It would be interesting to determine the performance of the proposed solution when operated at sub-1V supply ranges.

Figure 2-13: Ternary AND Logic Gate

Figure 2-14: Ternary OR Logic Gate

The ternary encoding scheme introduced in [65] reduces the number of wires to half in comparison with the conventional 2 phase DI protocols. However, it does not

34 meet the main requirements of NCL systems. Firstly, there is a direct transition between data_1 and data_0 as shown in Figure 2-15. Further, there is an assumption that data will be always available before the control signal (RFD), which is not a valid assumption in NCL system. It is therefore unsuitable as a basis for NCL systems.

Figure 2-15: Ternary encoding scheme

A number of delay-insensitive asynchronous circuits using ternary logic were introduced in [66] and [67] and it was shown that, while their completion detection circuits consume less power compared to [68] [69], they have the largest area. As the completion detection circuit exists in each single stage of the NCL pipeline, this will have a disproportionate effect on the overall circuit area.

VDD VDD VDD

WP/LP WP/LP WP/LP

Vin Vout Vin Vout Vin Vout

WN/LN WN/LN WN/LN

GND GND GND

(a) (b) (c)

Figure 2-16: Schematic of STI, NTI and PTI inverters with different W/L ratio (a) STI, W/L = 2; (b) NTI, W/L = 1.5; (c) PTI, W/L = 2.5

As identified above, rather than using bulky resistive dividers, a more efficient method to implement multi-level systems involves changing the W/L ratios of the transistors and/or adapting the concept of multi-threshold transistors to adjust the

35 effective impedances of the CMOS transistors. This methodology is widely used to implement a range of digital system devices. In [70], the digital blocks were designed using a single voltage source, which eliminates the need for multiple voltage sources and all its associated design complexity. In this case, the W/L ratio for each transistor was adjusted to change the transistor resistance and set the switching threshold where the schematics of simple ternary inverter (STI), positive ternary inverter (PTI) and negative ternary inverter (NTI) are the same as shown in Figure 2-16.

Thus, for the CMOS Ternary inverter, when the input logic is ‘1’ (its “middle” value) there will be a direct path between the voltage source and the ground, which greatly increases its power dissipation. For this reason, the power consumption of a simple Ternary AND gate in [70] is in the range of mW. A further observation from

(4) in Chapter 1 is that, in general terms, the switching threshold (VSW) of a gate is more sensitive to VTH than to W/L as VSW is a function of the square root of W/L but a linear function of VTH.

Figure 2-17: Ternary to Binary Decoder

One key advantage of using multi-threshold techniques to generate the ternary logic is that it is not necessarily tied to a specific technology. For instance, ternary logic using multi-threshold can be achieved at the transistor level using CMOS technology [71, 72] or Carbon Nanotube FETs (CNTFET) [73, 74]. As described above, multi-threshold operation in CMOS can be achieved using a combination of process (e.g., gate oxide thickness or channel dopant concentration), W/L adjustment, or substrate biasing. On the other hand, CNTFET transistor threshold can be controlled via tube diameter and/or wall thickness. For example, ternary logic was achieved in [75] [76] by using different tube diameters of CNT to generate different

36 threshold voltages. As shown in Figure 2-17, a ternary to binary decoder can be implemented using this technique.

A resistive load CNTFET-based ternary logic was introduced in [77], while a complementary CNTFET network was described in [78]. The ternary adder circuit proposed and analysed in [76] and [78] uses a binary decoder block so that the system not fully ternary. Recently, resistive load CNTFET-based ternary logic [79] and complementary CNTFET networks [77] [78] have been proposed.

Whilst CNTFET technology sounds like a promising design technology to implement ternary logic at the transistor level, the nano-scale placement and alignment of CNTs with different diameters to generate the required multi threshold voltage levels might not turn out to be practically feasible. There is currently no mass production technology that will allow CNT devices to be placed and their geometry “tuned” to create complex multi threshold circuits. To overcome the fabrication challenges, other design processes have been developed such as direct growth, solution dropping, and various transfer printing techniques [80]. Further, carbon nanotubes have shown reliability issues when operated under high electric field or temperature gradients [81], something which might further limit their eventual application.

All binary NCL implementations published to date have been based on multi- rail where each logic value is represented using either dual- or quad-rail. This including all approached that were discussed above. However, multi-rail increases the required number of wires, the routing density between gates and finally the total area of the design. Each of the above techniques has tried to enhance one or multiple design parameters but there appears to have been no work to date aimed at eliminating the need for multi-rail signalling and replacing it with a single rail.

There was a single attempt in to address this issue in [82], [83], where the authors proposed a new method called Delay-Insensitive Ternary Logic (DITL). It was argued that DITL combines the design aspects of NCL, PCHB, and Boolean logic to form a delay-insensitive paradigm that uses only a single wire to represent a single bit of data. The single wire has three distinct voltage levels corresponding to the three DI values of DATA0, DATA1, and NULL. Although it was claimed that the proposed ternary design yields maximum noise margin with minimum switching power, it was not made clear whether this maximum noise margin is still lower than

37 the noise margin of conventional multi-rail binary NCL. The main reason for this is the reduction in the voltage swing from Vdd to ½ Vdd. As shown in Figure 2-18, the proposed design is based on an IS_Data logic gate that is composed of 14 additional transistors and uses four different voltage sources to control the threshold of the transistors using a reverse body bias technique. Although reverse body bias is an effective method to achieve multi threshold technique for single ternary logic, as shown in Figure 2-18, it requires two additional voltage sources which adds overhead to the overall design. Such overhead includes design area and power consumption of the voltage divider circuits.

1.2 V 1.6V 1.2 V Is_0

Out

In GND -0.4V

1.2 V 1.6V GND

Is_1

-0.4V GND

Figure 2-18: IS_Data Ternary Logic Block utilizing Reverse Body Bias to achieve the required threshold voltage

A simple NAND2 gate was analysed and it was shown that the single rail ternary gate design has a higher static power consumption than its multi-rail counterpart. High static power limits the applications of the proposed design, as asynchronous systems such as NCL tend to spend most of their time in stand-by mode where the power consumption is determined by its static power. Thus, the longer the idle state is, the more static power the gate will consume, and so the single rail ternary gate proposed in [83] will consume much more static power compared to the multi-rail NCL counterpart. Furthermore, the speed of the proposed ternary design is almost half that of the equivalent binary gate, which leads to a ternary system that has higher static power and lower performance compared to conventional

38 binary NCL. As a result, this proposed single rail technique is not a suitable alternative to standard multi-rail NCL.

2.6 SUMMARY

This chapter has presented some prior work on multi-rail NCL techniques and implementations and has discussed and demonstrated the benefits and drawbacks of each technique. Early implementations were based on bulky resistors whilst most of recent implementations that are using multi-threshold technique do not meet the NCL characteristics. Although some (or all) of these approaches might ultimately support low power DI asynchronous systems, they are unsuited to implement NCL systems. This is due either to the high-power dissipation during the idle state or because the proposed solution does not fulfil a necessary characteristics of the NCL system such as hysteresis. Whilst all these techniques are based on multi rail logic, an alternative way to implement Delay-Insensitive systems has been discussed which is based on single rail ternary logic. Although it has been presented as an alternative to the multi- rail NCL system, analysis shows that while it has some advantages compared to the multi-rail NCL system, it has some drawbacks as well such as low performance and high static power consumption, which is the main source of power dissipation while the design is in idle state.

Consequently, another approach is required to take advantage of the single rail ternary logic and the NCL system at the same time, by representing the NCL system itself using Single Rail Ternary logic instead of Multi Rail Binary logic. The focus of this thesis is to propose a new NCL System based on a single rail ternary logic and evaluate it to determine if it can achieve lower power consumption, smaller design area, less design complexity and ultimately higher speed. The following chapter will demonstrate and discuss the main components of the single rail ternary logic NCL followed by the development of a component library for the proposed system.

39 Chapter 3: Register – Controlled Ternary NCL System

3.1 INTRODUCTION

In this chapter, a new architecture called the Multi Threshold Single Rail Ternary Logic system is proposed and analysed. The architecture combines two main concepts: the multi threshold technique and ternary logic. Ternary logic is used as an alternative to binary so only a single rail is required to implement each signal in the gate, in contrast to the conventional dual or quad rail systems that require multi physical wires.

There are a number of advantages in using ternary logic compared to binary in NCL. Firstly the single rail removes the possibility of the illegal state where both of the rails could be one (X1X0 = 11) that might occur with multi rail binary logic due to noise. In addition, it leads to less routing complexity as the number of wires that are required are to be reduced by 50% compared to a dual rail binary system and up to 75% compared against and a quad rail system. This is offset slightly by the need for the routing system to handle multi-level signals, but the overall effect is still positive. Finally, the proposed ternary logic NCL system is not necessarily limited to the existing set of 27 NCL threshold gates, which can increase its design flexibility.

The proposed ternary NCL technique utilises the principle of multi threshold CMOS, as its transistor level design is based on using a combination of different threshold voltages—low, regular and high. This arrangement of multi threshold transistors avoids the problem with previous solutions where transistors in both the pull-up and pull-down networks were in their active region during the Null state. Thus, it effectively reduces the leakage power consumption, which is then reflected in the total power consumption.

In the following sections, the main modules of the ternary NCL system are discussed and covered in more detail to demonstrate: 1) the overall simplicity of the design and its main blocks and gates, 2) the required number of transistors to design and implement both the ternary gates and registers which is then used to compare

40 their area against the multi rail binary case, 3) speed of the new ternary gates and 4) the total power consumption of the new architecture.

3.2 MULTI THRESHOLD SINGLE RAIL TERNARY LOGIC NULL CONVENTION LOGIC SYSTEM (MT-SR-TNCL)

The proposed architecture in this chapter is based on the idea of the pipelined asynchronous architecture where handshaking control signals are used to control the data flow between adjacent gates. The first version of this work is based on using a minimum of two single rail ternary NCL registers, where one is positioned at the input stage as an input control register and the other one is at the output stage as an output control register. These are used to hold the input and output values, which could be either a Data value (either Data One or Data Zero) or a Null value, to maintain the hysteresis characteristics of the NCL system. At the same time, the ternary registers have to generate the necessary control signals i.e., the Request for Input control signal, which is then inverted to form the Active signal (Figure 3-4) and used to control the successor gate.

The single rail ternary NCL register will generate a single rail control signal (Request for Input-RFI) to request a new set of inputs which could be Data or Null i.e. Request for Data (RFD) or Request for Null (RFN) respectively (i.e., RFI could be either RFD or RFN), depending on the current cycle of the data flow between the two adjacent ternary logic NCL registers. If the current output of the ternary logic NCL register is Data, it will request Null and vice versa. The other single rail control signal (Active) is used to control the subsequent gates and switch these gates into their Active mode (if the data is available) or switch them into Sleep mode if the input is Null (i.e., the data is not yet available).

Two different versions of the proposed single rail ternary NCL system have been designed and implemented in this chapter. These are identical from an architectural perspective but use different CMOS process technologies. The first design, described in Section 3.3 below, is based on a 250nm Silicon-on-Sapphire process while the second uses a 45nm bulk CMOS process technology. Additionally, the two designs are mapped to different voltage levels. Whereas in the first design,

Data One, Null and Data Zero are mapped to the voltage levels VDD, VDD/2 and 0V respectively, in the second design Data One, Null and Data Zero are mapped to the voltage levels +VDD, 0V and –VDD respectively. The objective of this is to illustrate

41 that the proposed architecture is independent of both CMOS process technology and the mapped voltage levels as will be discussed in more detail in the following sections.

3.3 CMOS MULTI THRESHOLD SINGLE RAIL TERNARY LOGIC NCL (CMOS MT-ST-TNCL) – DESIGN # 1

This design of the new proposed architecture is the first design that was proposed and implemented to answer the first three questions in this research, which are:

1) Is there an alternative to the Multi Rail Binary Logic to represent the Null Convention Logic System? For instance, can the Single Rail Ternary Logic be used as an alternative to the Multi Rail Binary Logic to design and implement the Null Convention Logic System? 2) Is there a method to design and implement the Null Convention Logic system using a Single Rail instead of Multi Rail without degrading the NCL system characteristics i.e. observability, completeness and hysteresis? 3) How can the new system be built and structured? What would the architecture be of the new system design? Including the transistor level design.

In this section, the main question of whether single rail ternary can be used as an alternative option to the multi rail binary for NCL is addressed. The first design of the proposed architecture has the following characteristics that differentiate it from the subsequent design. Firstly, it is based on a 250nm Silicon on Sapphire (SoS) CMOS process technology which is a member of the Silicon on Insulator family of technologies. This SoS process technology has an insulating layer over the substrate and as a result, its parasitic capacitances are smaller than a roughly equivalent bulk process. The technology offers a number of unique advantages including but not limited to firstly, support for high-speed operation and secondly, the availability of multiple simultaneous transistor threshold voltages that support the various components of the multi threshold single rail ternary NCL system. As the process supports three different threshold values for each of the P and N type transistors, it is relatively straightforward to achieve the range of switching threshold voltages that are required for the proposed ternary logic circuits without requiring large variations in width/length ratios. Although this is an older technology with feature sizes that are

42 large by comparison to the current state of the art, the insulating substrate removes the need for deep well structures and guard regions. As a result, the devices can be packed more densely than in a corresponding bulk CMOS process. For the same reason, the process is intrinsically fast and has low power consumption, as the small parasitic capacitances at the source, drain and interconnect lines results in reduced intrinsic delay, which is linearly proportional to ( ) ) and dynamic power consumption, proportional to .

Table 3-1: Ternary signal level mappings for single rail ternary NCL System

Ternary NCL System voltage level – Design # 1 NCL Value Mapping

VDD = 0.6V Data One

VDD/2 = 0.3V Null

0V Data Zero

Secondly, as illustrated in Table 3-1, the three NCL symbols, DATA One,

NULL and DATA Zero are mapped to VDD, VDD/2 and 0V, respectively. The NULL value is represented by VDD/2 and as a result, each successive transition at the output rail is VDD/2, which minimises switching power at the expense of a reduction in noise margin. This proposed architecture, still maintains the main characteristics of NCL, in that legal transitions can only occur between DATA (either Data One or Data Zero) and NULL or vice versa and no transition can take place directly between the logic values.

In general terms, NCL architectures comprise adjacent NCL registers surrounding operational gates sourced from the threshold gate library that perform the required logical function of the NCL pipeline. The accompanying completion detection gate is shown in Figure 3-1. The complete single bit dual rail NCL register and its completion detection circuit are illustrated in Figure 3-2, where the single bit dual rail NCL register is composed of two TH22 NCL gates [84] and these are controlled by two binary logic signals: the input signal (I1 or I0) itself that is coming from the previous stage (Stage_1), and the single rail binary logic control signal (Request for Input) that is coming from the following stage (Stage_2).

43 Gate 1 Gate 2 R1 R2 R3 Multi Multi BDI BDI BDI Data Data Threshold Data Data Threshold Data Data Reg Reg Reg Logic Gates Logic Gates

RFI RFI CD RFI RFO CD

Figure 3-1: dual rail Binary Null Convention Logic System

Figure 3-2: Single Bit dual rail NCL Register with built in Completion Detection Gate

At the same time, the single bit dual rail NCL register produces a dual rail binary logic output signal (either data or null) to the following stage (Stage_3) along with the single bit control signal (Request for Input) to the previous stage (Stage_1). For the register to be considered as a DI register and to achieve the main characteristics of the NCL system, the register will only pass Data if both of the following two conditions are true:

1. The dual rail binary logic input signal that is coming from the previous

stage is data (either Data One or Data Zero, I1 I0 = 10 or 01 respectively).

44 2. The single rail binary logic control signal “Request for Input” that is coming from the following stage is requesting for data (RFI is RFD and it is logic one).

On the other hand, the single bit dual rail NCL register will only pass Null value if both of the following two conditions are met at the same time:

1. The dual rail binary logic input signal that is coming from the previous

stage is Null (I1 I0 = 00).

2. The single rail binary logic control signal “Request for Input” that is coming from the following stage is requesting for Null (RFI is RFN and it is logic zero).

If none of these conditions is true, the register will hold its output value (Data or Null). In this way, the primary function of the register is to control the flow of the Data or Null between the stages in the asynchronous pipeline architecture.

The completion detection circuit in an NCL system detects the output completeness of each register by monitoring its output rails and signalling when all outputs have completely transitioned from Data to Null or from Null to Data depending on the particular cycle. As a result, it generates a single rail binary logic handshake control signal and sends it to the register in the previous stage. This control signal is called a Request for Input. Each set of registers contained within a completion domain requires one or more completion detection gate(s) as shown in Figure 3-2. Therefore, for an N-bit register, N completion detection gates are required and the outputs of all of these circuits are AND-ed together (Figure 3-3) to generate a single “Request for Input” control signal that will be sent to the registers in the previous stage of the pipeline. Of course, as the maximum number of inputs to a standard NCL library gate is four, a hierarchical structure will be required to generate the required single rail binary logic control signal. The completion detection is implemented using TH1n threshold gates (equivalent to OR gates) as shown in Figure 3-2.

In [26], a modified completion detection circuit was proposed to speed up the NCL pipeline system by changing the input of the completion detection circuit. Instead of detecting the dual rail signal at the output of the NCL register, the completion detection circuit detects the dual rail signal at the input of the delay

45 insensitive NCL register. In this case, to avoid data/null being overwritten, an additional control signal coming from the following stage (Stage + 1) has to be fed into the completion detection circuit increasing its complexity and possibly eliminating the claimed increase in throughput. While there have been some attempts to optimise or enhancing the completion detection (e.g., [85]) network in NCL, little work has been aimed at the architecture level of the NCL system itself, as all the proposed design optimisations still use multi rail binary logic and the standard gate library.

RFO(N) RFO(N-1) th14 RFO(N-2)

       th14      RFO     th14                        th14 RFO (N-M)

Figure 3-3: N-bit Completion Detection Gate

The ternary architecture proposed here eliminates the need for a separate completion detection circuit for each NCL register (Figure 3-4) as the proposed ternary register is designed with the completion detection function integrated into it. The high-level design of the NCL pipeline in this architecture is composed of similar stages apart from the final output stage. In the example shown in Figure 3-4, Stage_1 and Stage_2 are the same and are composed of a ternary register followed by multi threshold ternary logic gates, while the last stage (Stage_3) comprises only the single rail ternary register, which acts to control the flow of data and null cycles through the rest of the system in a similar way to conventional multi rail binary architectures.

As shown in Figure 3-5, the proposed architecture comprises of at least two N- bit ternary registers, where one acts as an input control register and the other one acts as an output control register. In between these are the multi threshold single rail

46 ternary NCL gates that implement the required logic function, which could be a simple logic function or a more sophisticated operator like an 8x8 multiplier.

Gate 1 Gate 2 R1 R2 R3 N Bit Multi N Bit Multi N Bit Data Ternary Data Threshold Data Ternary Data Threshold Data Ternary Data Register Logic Gates Register Logic Gates Register

RFI RFI RFI Active Active

Figure 3-4: The First Version of the CMOS Multi Threshold Single Rail Ternary Logic NCL System Pipeline

N Bit Multi N Bit Data Ternary Data Threshold Data Ternary Data Register Logic Gates Register

RFI RFI Active

Figure 3-5: Minimum structure of the Single Rail Ternary Logic NCL Pipeline System

In summary, the single rail ternary logic registers act as input and output control registers with built in control signals i.e. Request for Input (RFI) and Active control signals. The Active control signal is generated from the Request for Input control using just an inverter. This architecture does not require any additional circuits or gates to detect and identify the completion of the output signal nor generate any additional control signals to be able to control the flow of the Data and Null cycles across the span of the entire NCL pipeline. In the following sections, the main components of the new proposed single rail ternary logic NCL will be discussed and explained in more detail. This includes the ternary delay insensitive register and the multi threshold ternary logic NCL gates

47 3.3.1 Ternary NCL Register The ternary NCL register circuit generates a single control signal which is a single rail signal (Request for Input), which has two main functions:

1. It controls the ternary logic gates that are in the current stage of the NCL pipeline by putting the gates in either Active or Sleep mode.

2. At the same time, it controls the ternary registers that are in the previous stage of the pipeline (see Figure 3-5), to send either Data or Null at an appropriate time in the cycle.

The ternary NCL register operates as follow: if the output of the register is a Null, its single rail “Request for Input” control signal is asserted high (logic one) which means that the register is requesting Data to be sent from the register in the previous stage, through the gates in that stage. At the same time, the “Request for Input” control signal is inverted to generate the “Active” control signal that is used to enable the gates in the same stage as that register. Only a single inverter is required for an N-bit register, so generating the Active signal very little effect on the overall design area. This approach guarantees that the ternary logic gates will enter their active mode and start processing Data only if the preceding register has Data available at its output, otherwise, the gates will stay in the Sleep mode with the output rails connected to VDD/2 (Null value). The register will pass Data if and only if all its inputs are Data and the “Request for Input” signal from the following register is true, indicating that the following register is requesting data. The opposite occurs when the register output is Data—the “Request for Input” signal becomes logic Zero i.e., requesting Null from the previous stage while the inverted Active signal disables the gates in the current stage. These disabled gates will enter their

Sleep mode, directly connected to VDD/2, its Null value

As a result, in this architecture, the Null value does not need to propagate between two adjacent ternary registers through the ternary gates as the gates in the second stage are disabled during a Null cycle. In a conventional multi rail binary system, the Null value has to propagate across the entire NCL pipeline, something which is considered to be a drawback of NCL systems in general. This requirement to propagate the Null value tends to increase the cycle time of the circuit, which will be reflected in the overall performance of the NCL system. The advantage is that the

48 down-stream register (R2 in Figure 3-4) will have the required Null value immediately with no need to wait for the value to propagate across the gates.

It can be seen that the sole purpose of the Active signal is to control the gates within the same stage as the register that generated the signal. If the Request for Input signal coming from the following register is requesting Data and the input signal of register (R1 in Figure 3-4) is Data as well (i.e., all inputs are asserted), the Active control signal will be high and as a result will activate the gates in that stage to allow the input data to be processed and the required output data delivered to the following register (R2 in this example). The design aims to reduce the power consumption, especially during the Null cycle, which is considered more dominant than the active cycle, as the gates might spend most of their time in the Null state waiting for valid input data.

Overall, the Single Rail Ternary Logic NCL Delay Insensitive Register will only generate or pass Data through and activate the following CMOS Multi Threshold Ternary logic NCL Gates if the following two conditions are true:

1. All its current input signals that are coming from the previous gates are Data (either Data One or Data Zero).

2. The “Request for Input” control signal that is coming from the following Single Rail Ternary Logic NCL Delay Insensitive Register is requesting Data.

On the other hand, the register will only switch off the following gates in the same stage and connect their output nodes directly to VDD/2 voltage source (Null Value), if all of the following conditions are true:

1. All its current input signals that are coming from the preceding gates are Null.

2. The “Request for Input” control signal coming from the following register is requesting Null.

If neither of the above two sets of conditions is true, the register will hold its output value, thereby achieving the hysteresis characteristic of the NCL system. The hold is entirely static and behaves as a delay insensitive register.

49 The proposed Single Rail Ternary Logic NCL Delay Insensitive Register is composed of 4 main gates as shown in Figure 3-6 which are: Ternary Detector Gate, Hold Null Gate, Hold One Gate and finally Hold Zero Gate.

VDD/2 Hold Null IS Null

Gate VDD Ternary Detector Hold One IS ONE Gate Gate

Ternary Input Ternary Output Hold Zero IS ZERO

Gate GND

RFI

Figure 3-6: The components of the Ternary NCL Register

Ternary Detector Gate

The Ternary Detector Gate is the main component of the ternary register, as it detects whether the ternary input signal is Data One or Data Zero or Null value and based on this it generates the single rail “Request for Input” control signal. The advantage of this design is that, it removes the need for a separate completion detection gate.

The CMOS transistor level design of the Ternary Detector Gate is as shown in Figure 3-7 and its truth table as demonstrated in Table 3-2, where the circuit divides the input ternary signal into three regions, represented by three single rail signals: “IS ZERO” signal then “IS NULL” signal and finally the “IS ONE” signal, with three corresponding gates to produce these signals. As shown in Figure 3-7, the three gates are built using a combination of low and regular (general) threshold transistors indicated by VthL and VthG, respectively.

50 As briefly mentioned in Chapter 1, the operation of any logic system relies on establishing one or more known switching threshold voltage values at the input of the receiving gate. The general form of the switching threshold value is shown in (5), below [86] (repeated here for clarity from chapter 1), where VTHP and VTHN are the threshold voltages of the P type and N type transistors respectively, and both KP and

KN represent the "process gain" of the relevant transistors. Although this equation is valid only for a single CMOS inverter, it clearly shows the sensitivity of these circuits to both transistor geometry (width/length) and threshold voltage. It is fairly typical in a digital logic layout to fix the transistor length at the minimum value that is allowed by the technology, in which case the gain ratio term KP / KN reduces to approximately (µP/µN) (WP/WN) i.e., this term depends on the relative values of

mobility (µ) and width (W). VDD

VthL

VthG IS Null

VthG VthG

GND VDD

VthG VDD Ternary Input VthG

VthL VthG IS One

GND VthL VthL

IS Zero GND

Figure 3-7: Ternary Detector Gate – First Version

As already mentioned, the detector circuit employs transistors with two different threshold voltage values (low and regular) at approximately -0.1V/0.1V (P/N) and -0.4V/0.4V respectively to establish two switching thresholds in the sub- circuits for the input ternary logic signal as given by (5). The first threshold is at around 0.15V which detects the presence of logic Zero while a second, at around

51 0.46V detects a logic One. Between these points, the input ternary logic signal is considered to be NULL.

( )√

( )

Table 3-2: Truth Table of Ternary NCL Detector Circuit – Version # 1 Logical Representation Voltage Representation

Ternary IS Zero IS Null IS One IS Zero IS Null IS One Input Signal Signal Signal Signal Signal Signal Signal 0 (0.0V) 1 0 0 0.6V 0.0V 0.0V NULL (0.3V) 0 1 0 0.0V 0.6V 0.0V 1 (0.6V) 0 0 1 0.0V 0.0V 0.6V

Figure 3-8: Switching Threshold Voltage for IS Zero and IS Null Gates

The switching result for two of the gates that are used to implement the ternary detector are illustrated in Figure 3-8. The curve on the left-hand side of the diagram is the output of the inverter made up of a regular threshold voltage PMOS transistor

52 and low threshold voltage NMOS transistor and is used to generate the “IS Zero” signal. The other curve (blue) is from the NOR gate that generates the “IS Null” signal. This NOR gate is composed of one low threshold voltage PMOS transistor and the rest of the transistors are just regular threshold voltage transistors, shifting the switching voltage curve to the desired voltage level of around 0.46V in order to detect logic one. The threshold voltage of the NOR gate that is used to generate the

“IS Null” signal is set at VDD/2, resulting in the logical behaviour shown in Table 3.2 and in Figure 3-9, where the supply voltage has been set to 0.6V. The switching thresholds seen at both inputs of the lower NOR gate (the gate that generate the IS_ ONE signal) are set at the same value of 0.15V. In this case, the gate inputs are binary, and the low switching threshold simply serves to enhance the critical rise time performance.

Figure 3-9: Ternary NCL Detector Circuit Waveform

As shown in Figure 3-9, the “IS Zero” signal will be only high (logic one i.e., 0.6V) if the ternary input signal is logic zero (0V), otherwise, it is low (logic zero i.e. 0V). On the other hand, the “IS One” signal will be high (logic one i.e., 0.6V) only if the ternary input signal is logic one (0.6V), otherwise, it is always low (logic zero i.e., 0V). Finally, the “IS Null” signal is only high (logic one i.e. 0.6V) if the ternary input signal is Null value (0.3V), otherwise, it is always low (logic zero i.e. 0V).

53 Figure 3-10 illustrates the supply current over the range of the ternary logic input signal (from 0V to 0.6V). As expected, the detector circuit has two peaks corresponding to the switching thresholds of the sub-circuits, which are approximately symmetrical around the midpoint (0.3V), which represent the Null value. Thus, static power consumption is minimised during the NULL cycle. This contrasts with the circuits of [51], which exhibits high peak current during NULL cycles. The proposed circuit shows lower power consumption in a NULL state compared to a DATA cycle [51], substantially reducing the average static power consumed by these circuits during their idle state.

Figure 3-10: Current versus Ternary Input Signal

Hold Gates

The other main part that is used to form the Single Rail Ternary Logic NCL Delay Insensitive Register is the hold circuits that follow the ternary detector gate in the architecture as shown in Figure 3-6. In addition to a ternary detector gate, the proposed register architecture uses three hold gates operating to hold Null, Data One and Data Zero and therefore achieve NCL hysteresis behaviour. The three hold gates are all set up as conventional C-element gates followed by a latch to maintain static operation as shown in Figure 3-11, where the multi threshold technique is used to set the switching points of the C-element gates.

The single rail “Request for Input” control signal derived from the following register will determine which hold gate is active during any given cycle while the

54 other hold gates are in sleep mode (i.e., not active and disconnected from the power supply lines). Each hold gate has two single rail input signals, where the “Request for Input” control signal is a common input signal to all of the three hold gates and as a result, only one of them is active at a time. As shown in Figure 3-11, the single rail Request for Input control signal is connected to high threshold PMOS and NMOS transistors, while the input signal is connected to low threshold PMOS and NMOS transistors. The input signal here is “IS Null” in the Hold Null Gate, “IS One” signal in the Hold One Gate or “IS Zero” in the Hold Zero Gate. Thus, if the request for input is requesting Null, only the Hold Null gate will be activated and the other two hold Gates will be in sleep mode, which helps to reduce the power consumption of the register during the Null cycle.

High Threshold

RFI Low Threshold IN OUT Low Threshold

High Threshold

Figure 3-11: Hold Circuit

As illustrated in Figure 3-6, the Hold NULL gate is connected to both of the inverted “IS NULL” signal and the single rail “Request for Input” control signal that is coming from the following register. Thus, when both the input ternary logic signal and the request for input control signal are NULL and request for null (Request for Input control signal is logic Zero i.e. 0V) respectively, the Hold Null gate will be activated and will generate logic zero (i.e. 0V) at its output node that is connected to a PMOS transistor. Once the PMOS transistor is switched on, it will connect the output node of the register to the 0.3V voltage source (i.e. VDD/2 which represents the Null value). This Null value holds until the register receives both of: (1) the input ternary Data signal (either Data One or Data Zero) and (2) the single rail “Request

55 for Input” control signal requesting Data (i.e. Request for Input is RFD), as shown in Figure 3-12. An input signal of 0.3V (Null value) is detected as a logic zero and hence the Hold Null Gate will be active and create a Null value at the output node register if both of the following conditions are true:

1. The single rail “Request for Input” control signal that is coming from the following Single Rail Ternary Logic NCL Delay Insensitive Register is requesting for Null.

2. The input ternary logic signal of the register that is in the current stage is Null.

If the above two conditions are not true, the Hold Null Gate will not be active and will sit in a low static power state during the Null cycle.

Figure 3-12: Hold Null Gate Waveform

Additionally, the Hold ONE gate is connected to both the “IS ONE” signal and the same single rail Request for Input control signal that is coming from the following register. The output of this gate is only low when both of: (1) the input ternary signal and the request for input control signal are logic one (i.e. 0.6V) and (2) the Request for Input signal is logic one i.e. 0.6V. This will activate the Hold One gate and will generate logic zero (0V) which will switch on the PMOS transistor. Once the PMOS transistor is switched on, it will connect the output node of the

56 register to the 0.6V voltage source (i.e. VDD which is Data One). This Data One value holds until the register receives both an input ternary Null value and a single rail “Request for Input” control signal is requesting Null, as shown in Figure 3-13. The Hold One gate will be active and hold its output Data One value, if both of the following conditions are true:

1. The single rail “Request for Input” control signal that is coming from the following register is requesting for Data.

2. The input ternary logic signal of the register that is in the current stage is Data One value i.e. 0.6V.

If the above two conditions are not true, the Hold Data One gate will not be active.

Figure 3-13: Hold One Gate Waveform

Finally, the Hold Zero gate is connected to both of the “IS Zero” signal and the same single rail “Request for Input” control signal that is coming from the following register stage. The output of this gate is only high (i.e. Logic One = 0.6V) when both the input ternary signal and the request for input control signal are logic zero and Request for Data is true. This will activate the Hold Zero gate and will generate a logic one (0.6V) which will switch on the NMOS transistor. As a result, once the NMOS transistor is switched on, it will connect the output node of the register to the

57 ground rail. This Data Zero value holds until the register receives both an input ternary Null value and the single rail “Request for Input” control signal is requesting Null, as shown in Figure 3-14. The Hold Zero gate will be active and hold its output Data Zero value, if both of the following conditions are true:

1. The single rail “Request for Input” control signal that is coming from the following register is requesting Data (logic one i.e., 0.6V).

2. The input ternary logic signal of the register that is in the current stage is Data Zero value i.e. 0V.

If the above two conditions are not true, the Hold Data Zero Gate will not be active.

Figure 3-14: Hold Zero Gate Waveform

This implementation guarantees that the proposed ternary NCL register behaves as a hazard free asynchronous circuit that exhibits monotonic output behaviour, which is important to the correct operation of the NCL circuits.

3.3.2 Multi Threshold Ternary Logic NCL Gates The proposed architecture does not have to rely on the same limited number of threshold gates as the multi rail binary system, which is restricted to only 27 NCL threshold gates. At the same time, it can use the same EDA tools as are used to design, size and build the conventional binary CMOS gates. This leads to a simpler

58 design process, as it removes the need for any additional special tools or software to design and construct the gates at the transistor level.

Although, CMOS multi threshold techniques have been used previously in conjunction with multi rail binary Logic NCL systems (e.g., [53]), all of the applications to date still rely on the use of the multi rail technique i.e., multi threshold dual rail or multi threshold quad rail. In [87], a proposed multi threshold semi static NCL system was shown to reduce the static power. While this offers some advantages, that analysis showed that there is a large overhead in the area of the design compared to conventional semi static multi-rail NCL. For example, the area of the Half Adder demonstrated in [87] was more than 29% larger than an equivalent semi static multi-rail NCL implementation.

In [88], [53] the multi threshold technique was used to build only the combinational gates located between adjacent NCL registers. However, as the multi threshold technique is not used to create the register as well, and the register will spend significant time in its inactive mode during the Null cycle, this may lead to higher power consumption, especially in a large-scale and complex system in which a large number of registers will be required to implement the end to end pipeline. In [18], additional control signal and extra gates are required to implement the multi rail binary NCL system using the multi threshold technique.

In the single rail ternary NCL Architecture proposed in this work, the multi threshold technique is used in the following way:

1. The entire NCL system is designed and built using only single rail signals.

2. Both the gates and the register blocks are designed and implemented using the multi threshold technique. The registers do not have to be active all the time, while some of its sub-gates and components can be switched off and be in sleep mode when the incoming control signal, “Request for Input” requests Data.

3. No additional control signals or extra circuits are required to control the gates to be either in Active or Sleep mode. The Active signal can be generated either by inverting the Request for Input control that comes from the following stage, or directly (with no need for additional

59 inverter) from the Request for Input control signal that comes from the stage that is after to the following one (i.e., two stages away from this one).

4. No restrictions or limitations on the number of the gates that can be built or designed using the new architecture.

VDD

D G High Threshold VDD / 2 Active TR1 JFET P PMOS S Channel

D G Low Threshold Active TR1 JFET P PMOS S Channel

Low Threshold Ternary Input Signal Logic Gate Ternary Output Signal

D G High Threshold Active TR1 JFET N NMOS S Channel

Figure 3-15: Multi Threshold Ternary NCL Gate Architecture

As illustrated in Figure 3-15, the multi threshold ternary gate is built using transistors with two different sets of threshold voltage, high and low. The high threshold PMOS and NMOS transistors are used to create a virtual power line pair so that the gate, which is built using low threshold devices, can be isolated from the real power lines. The high-threshold devices greatly reduce the leakage current in the gates. Only a single control signal is required to control the entire multi threshold gate.

During a Null cycle, while the gate is in sleep mode and disconnected from the real power lines, the Null value is still required to be maintained at the output. This is achieved via a (low threshold) PMOS transistor which connects the output to the

VDD/2 voltage source, as shown in Figure 3-15. This use of a low-threshold PMOS

60 transistor makes entering and exiting the sleep mode very slightly faster, without affecting the power consumption during the Null cycle.

3.3.3 Summary of Operation—Multi Threshold Ternary NCL, Design #1 The operation of the ternary NCL system can be summarised as follows. The NCL system comprises a pipeline of blocks, each formed from ternary gates positioned between registers. Data flow in the pipeline is controlled using a binary encoded “Request for Input” line between adjacent blocks. For example, referring back to Figure 3-4, if a successor register (e.g., R2 in that figure) is requesting Data and, at the same time, the input to the first register (R1) is data, the binary “Request for Input” signal from this block is driven low (logic zero i.e. 0V). This signal is inverted to produce the “Active” control signal (logic one i.e. VDD), so that the ternary gates in that block are switched into their active mode and the low threshold

PMOS transistor is switched off, disconnecting the output node from the VDD/2 source. The gates are then able to process the data and generate the required output that will flow to the second register (R2) as per its control signal (Request for Input is in its Request for Data state).

Conversely, if register R2 is requesting a Null value (VDD/2), its single rail “Request for Input” control signal is sent low. If, at the same time, the input to R1 is Null, its “Request for Input” will be requesting data (i.e., high, 0.6V) and the Active signal will return to logic zero thereby disabling the gates in this block and connecting the output node to VDD/2.

3.4 MULTI THRESHOLD SINGLE RAIL TERNARY LOGIC NCL (MT- ST-TNCL) – DESIGN # 2

In this section, a second design model of the proposed CMOS Multi Threshold Single Rail Ternary Logic pipeline is demonstrated to show that the proposed architecture is largely independent of the voltage supplies and/or the CMOS process technology. The second design model of the proposed architecture will address the following research question:

1) Is the proposed architecture dependent on a specific CMOS process technology or specific voltage mapping?

It has the following characteristics that differentiate it from the first that was discussed above:

61 1. It is based on a 45nm SOI process technology [79] that offers three

simultaneous transistor threshold voltages [89] termed low, (VTHL),

general (VTHG) and high (VTHH) as demonstrated in Table 3-3. Typically, each transistor type targets a different design objective. For example, the low threshold voltage transistor can be used to achieve high performance and high-power operation. The general threshold voltage CMOS transistor targets lower dynamic power consumption while the high threshold voltage transistor can be used to achieve lower performance with low static power consumption.

Table 3-3 Threshold Voltages of the 45nm SOI process technology

NMOS PMOS VTHH + 0.60 - 0.50 VTHG + 0.40 - 0.38 VTHL + 0.30 - 0.30

2. As demonstrated in Table 3-4, the three NCL symbols, Data One, Null

and Data Zero, are mapped to +VDD, 0V and –VDD respectively where

in this version +VDD and –VDD are represented by +0.3V and -0.3V respectively, while the Null value is 0V. Transitions at the output rail are therefore still only 0.3V, which minimises the switching power at the expense of a smaller noise margin. Transitions can only occur between DATA (either Data One or Data Zero) and Null value or vice versa.

Table 3-4: Truth Table of the Single Rail Ternary Logic NCL System – Design Model # 2

Ternary NCL System Voltage Level – Version # 2 NCL Value Mapping

VDD = 0.3V Data One

0V Null

-VDD = - 0.3V Data Zero

62 The overall system architecture and gate structure of version # 2, is the same as the previous version. The modification was limited to the ternary detector gate, as will be discussed in the following section.

3.4.1 Ternary Detector Design Model #2

A second design model of the single rail ternary NCL delay insensitive register is proposed in this section. The ternary detector circuit has been modified as shown in Figure 3-16 and its truth table as demonstrated in Table 3-5, to reduce the required number of transistors. This will be reflected in the entire design area of the NCL system, given that the detector circuit is used in every register. The proposed ternary detector circuit in this section requires only seven transistors compared to ten in model #1, a reduction of around 30%. As the other sections of the register remain the same, the area of the register has been reduced by almost 8%. However, this modification does result in an increase in static power during the Null cycle, although both the propagation delay and dynamic power consumption of the two versions are virtually the same.

As the pullup employs a variation of the Gate Diffusion Input technique, this design version of the ternary detector exhibits a further limitation. It can be seen that the ternary input controls the pull-up network as it is connected to the source of the pullup PMOS transistors (VthL and VthH). Hence this design version requires the

input drive to be strong enough to drive the IS_NULL signal at full fanout. VDD

VthG IS One VthH VthL VthG IS Null Ternary Input GND

VthG VthH

IS Zero

VthG VCC

Figure 3-16: Ternary Detector Circuit Version # 2

63

Table 3-5: Truth Table of Ternary Detector Circuit Version # 2

Ternary Input IS Zero IS Null IS One

One (+0.3V) (-0.3V) (+0.3V) (0.0V)

Null (0.0V) (-0.3V) (-0.3V) (+0.3V)

Zero (-0.3V) (0.0V) (+0.3V) (+0.3V)

3.5 NCL DELAY INSENSITIVE REGISTER COMPARISON

This section demonstrates the performance of the different design methodologies and approaches including the proposed techniques to design and implement the NCL delay insensitive register. The comparison will include the following four techniques:

1. dual rail binary NCL Register (NCL-R).

2. Multi threshold dual rail binary NCL Early Completion Register (MTNCL-EC-R).

3. Multi Threshold Single Rail Ternary Logic NCL Delay Insensitive Register Design Model # 1 (TNCL-R-1). 4. Multi Threshold Single Rail Ternary Logic NCL Delay Insensitive Register Design Model # 2 (TNCL-R-2). Although MTNCL-EC-R [26] uses a multi threshold approach, both the register and its detection gate are both still implemented using a single threshold value (for each of the P an N types). The main reason for this is that both have to be continuously active in order to control the data flow across the different stages of the NCL pipeline without affecting its functionality. As a result of this, the register will still consume high static power while in its idle mode.

In the multi threshold implementation of multi rail binary NCL, only the combinational logic gates between the adjacent registers are implemented using transistors with multiple threshold voltages. These gates move into sleep mode when all inputs are Null and the request for input is requesting Null as well [54] without

64 affecting the functionality of the NCL pipeline as the Sleep Mode is used during the Null cycle only. The multi threshold technique can be directly implemented at the transistor level for the standard library of 27 NCL threshold gates as shown in Figure 3-17, where the “Hold0 High Vth” block is the pull up network that holds the logic Zero and implemented using high threshold PMOS transistors and the “Set Low Vth” block is the pull down network that set the output to logic one and implemented using low threshold NMOS transistors. In this implementation of the MT-MR-BNCL gates (design model # 1) both the Reset and Hold one blocks have been removed and only Hold Zero and Set blocks are maintained, which reduces their overall area to the same as that of the dual rail binary NCL Early Completion Register circuit. However, this reduction in the area comes at the cost of the performance of the NCL system. It is clear that this implementation cannot be used directly in a conventional multi rail binary NCL system, as the final NCL system will become a delay sensitive and hence cannot be used to implement the NCL pipeline. To enable the use of the technique within an NCL pipeline without compromising its delay insensitivity characteristic, an early completion technique was introduced in [26] as an alternative to the traditional completion detection technique. However, the resulting multi threshold dual rail binary early completion NCL system suffers from glitches that can propagate through the pipeline. This problem has been addressed in [24] by adding the “Hold1 High Vth” block which is a pull down network that holds the output when it is logic one and implemented using high threshold NMOS transistors as shown in Figure 3-18. This technique is referred to in this thesis as the multi threshold multi rail binary NCL threshold gate Version #2 (MT-MR-BNCL Version #2). In this circuit, the Hold One block is added back to the gates, increasing the area of the gates to be almost equivalent to the static NCL implementation.

65 VDD

Hold0 High Vth

VthH Sleep VthL

OUT Set Low Vth VthH VthL

VthH

Sleep GND

Figure 3-17: Multi Threshold Binary Logic NCL Threshold Gate Version # 1

In summary, the Early Completion technique can be combined only with the second version of the multi threshold multi rail binary NCL threshold gates to build a glitch free multi threshold dual rail binary NCL system that does not compromise the delay insensitivity of the system. For the register and its completion detection circuit to achieve the desired early completion behaviour, the completion detection gate has to detect the dual rail binary logic at the input rails of the register instead of the output rails. Currently, both are implemented using transistors with a single threshold

voltage. VDD

Hold0 High Vth

VthH Sleep VthH

VthL Output VthH

Set Hold1 Low Vth High Vth VthH

VthH VthL

Sleep GND

Figure 3-18: Multi Threshold Binary Logic NCL Threshold Gate Version # 2

66 The performance of the two proposed single-rail ternary NCL register models has been analysed and compared against the conventional multi rail register including its multi threshold implementation using early completion. The four different registers were designed and implemented in Cadence Virtuoso® using a 45nm Silicon on Insulator CMOS Process technology. All the measurements were taken while each NCL Delay Insensitive Register is loaded by another register of the same type, a typical configuration in a real circuit. The three NCL system symbols

(Data One, Null and Date Zero) are mapped to +VDD, 0V and –VDD respectively where +VDD and –VDD are set to +0.3V and -0.3V respectively, while the Null value is 0V as shown in Table 3-6. The binary signals use +/- 0.3V levels only.

The four different implementations of the NCL register have been assessed, evaluated and compared against each other. The comparison includes the overall area of the register, the worst-case propagation delay, the dynamic power consumption over five sequential Data/Null cycles (i.e. Null  Data One  Null  Data Zero  Null) and finally the static power consumption during the Null cycle.

No physical layouts were done in this work, so all the delay and power figures are “pre-extraction” values i.e., before layout parasitic components are added back to the Spice (Spectre®) netlist. Similarly, the total area of each implementation is an estimate calculated by measuring the approximate area of each transistor (Width x Length) and then multiplying by the number of transistors that are required to build the register. Although this does not account for variations due to different layout styles, it is slightly better than a simple transistor count as it takes the size differences between individual transistors into account. The method was considered to be sufficiently accurate for this study, especially given that it is the ratio between the design areas of the different implementations that is important here.

For example, to implement the dual rail binary NCL register and its completion detection gate takes 32 transistors, and the average area of each transistor is approximately 27 µm2 so its approximate overall area is 864 µm2. However, it was observed that when we tried to optimise the area of the dual rail binary NCL register plus its multi threshold implementation, the output dual rail of the register exhibited glitches as shown in Figure 3-19. To eliminate these glitches required the relative sizes of the transistors to be adjusted, affecting the overall area. This is reflected in the figures shown below.

67 Table 3-6: NCL Value Mapping with both of Ternary and Binary NCL Systems Binary Logic NCL System NCL Value Mapping Ternary Logic NCL System I1 I0 Data One VDD = 0.3V 0.3V - 0.3V Null 0V - 0.3V - 0.3V Data Zero -VDD = - 0.3V - 0.3V 0.3V

Figure 3-19: Glitches in Binary NCL when optimized for area

The evaluation of the four NCL register designs is presented in Table 3-7. It can be seen that the NCL-R and MTNCL-EC-R use the same number of transistors and exhibit the same area. As mentioned above, the only difference between them is that the completion detection gate monitors the input rails of the register rather than its output rails. The transistor-level design of the register and its completion detection gate are still the same. Moving the detection point from the output rails to the input rails has reduced the propagation delay by almost 99%, as the completion gate in the early completion design does not have to wait for the Data/Null value to propagate through the register to decide whether it is Data or Null value. In this way, the data/null value propagates via the completion detection gate only. For the same reason, the dynamic power consumption is the same, although there has been a small improvement to the static power figure during the Null cycle.

The analysis shows that TNCL-R-1 is somewhat better than TNCL-R-2 in terms of the total design area and static power during the Null cycle, while the propagation delay and the dynamic power consumption are similar. Hence, on

68 balance, it appears that the TNCL-R-1 is a better choice to implement a multi threshold single rail ternary NCL system.

As demonstrated in Table 3-7, the proposed multi threshold single rail ternary NCL register design model #1 has better performance than the NCL-R but worse than the MTNCL-EC-R, which has the shortest propagation delay of the entire group of registers. The TNCL-R-1 exhibits higher dynamic power consumption but its static power consumption during the Null cycle has been significantly reduced. Although the TNCL-R-2 implementation requires 36 transistors, it still occupies the smallest design area in the entire set of the registers.

Table 3-7: The evaluation of the NCL Delay Insensitive Register Techniques Power Consumption NCL Register # Approximate Propagation (µW) Implementation Transistors Area (µm2) Delay (nSec) Dynamic Idle

NCL-R 32 864 80 0.7 5.1 MTNCL-EC-R 32 864 0.25 0.7 4.3 TNCL-R-1 39 158 2.36 1.1 0.011 TNCL-R-2 36 145 2.4 1.1 0.3 In summary, it is demonstrated that the proposed single rail ternary logic technique can result in smaller area and reduced static power consumption compared to the dual rail binary NCL implementations of the NCL register, but at the cost of an increase in dynamic power consumption. On the other hand, its performance is better than the conventional binary NCL system but worse than the multi threshold binary NCL approach.

3.6 SUMMARY

To summarise this chapter, the NCL system can be implemented using a single rail, using ternary logic as an alternative to binary to achieve this objective. The proposed architecture in this chapter still based on a minimum of two adjacent registers to control the flow of data across the pipeline of the NCL system, where one acts as an input control register and the other one acts as an output control register for the pipeline. It has been demonstrated that the proposed single rail ternary logic register is delay insensitive as required by the NCL system. The circuit generates a

69 single rail control signal with no additional gates to detect the completion of the output. This control signal has two main roles to play: firstly, it controls the multi threshold ternary gates that are in the same stage of the pipeline, and secondly it sends requests for input to the ternary register in the previous stage. Two design models of the proposed register have been compared against a conventional NCL register implementation and its multi-threshold approach using early completion. Although the proposed register consumes almost 40% more dynamic power, its area and static power are smaller compared to the entire group which makes this proposed new model an attractive one to design and implement smaller area NCL designs with low idle state static power dissipation.

In the next chapter, a modified version of the ternary NCL system is proposed that further reduces the area by removing the input and output registers without degrading the functionality of the system.

70 Chapter 4: A Register – less Ternary NCL System

4.1 INTRODUCTION

Conventional multi rail binary NCL systems, including their multi threshold implementations typically make extensive use of pipeline registers to separate neighbouring logic blocks and control the Data/Null flow across the entire NCL pipeline. These registers can account for a large percentage of the overall area of the circuit, which will in turn be reflected in the total power consumption of the NCL system as well as its design complexity.

This chapter proposes an enhancement to the design and implementation of the multi threshold ternary NCL architecture that was discussed in the previous chapter. This enhancement eliminates the need for the ternary registers and converts the overall NCL system into a Register-Less network. As a result, the total design area and the complexity are reduced compared to both the registered multi threshold ternary implementation as well as the conventional multi rail binary case. In the proposed architecture, the NCL registers have been removed without degrading the functionality of the NCL pipeline as will be discussed in more detail in the following sections.

Recent NCL pipeline designs are still based on the use of registers which are required in both conventional NCL systems and their multi threshold implementations to control the Data/Null cycles in the pipeline [90]. In NCL systems, the delay and latency across the different stages of the pipeline will not be consistent so these registers components are generally required to re-synchronise the data flow at particular points in the circuit by enclosing related data (e.g., the bits of an N-bit variable) within the same completion network [91].

The concept of a register-less NCL pipeline was presented in [92] in which the completion detection gate was removed and replaced by a simple OR gate used to detect the completion of the output (Data/Null) binary logic bit that is on the critical path of each stage in the NCL pipeline. Detection here is applied only to the critical path. While this might appear as an advantage as it reduces the size and the number

71 of the required NCL completion detection logic gates, there are a number of serious disadvantages. For example, complex timing analysis is required to identify the correct critical path in each stage in order to guarantee the correct operation of this system. This alone disqualifies the system from use in a quasi-delay insensitive NCL system as its correct operation is no longer independent of the delay assumptions in the operators [93]. Incorrect operation can arise from the design stage, with an error in the delay estimates or a failure to identify the correct critical path, or during operation when process, voltage and temperature variability issues cause the critical paths to change. The result will be data corruption or deadlocks. Furthermore, for the structure of this register-less NCL organisation to work properly, additional NCL gates are required between the output and input of logic gates on some paths as direct connection is not possible or even permitted. All of this adds unnecessary gates and buffers to the final design to maintain correct operation. Thus, this register-less approach will not be useful for the implementation of a NCL system as it does not meet the requirements of a quasi-delay insensitive asynchronous system [94], [95].

In this chapter, the enhanced version of the CMOS multi threshold single rail ternary NCL System will be discussed and will be referred to as a Register-Less Multi Threshold Single Rail Ternary Logic NCL System as it removes the need for the NCL registers.

4.2 REGISTER-LESS TERNARY NCL ARCHITECTURE

The pipeline architecture of the proposed Register-Less ternary NCL system, shown in Figure 4-1, removes the need for a minimum of two adjacent NCL registers and replaces them with a combination of a Ternary Detector Gates (IS Data) and the required ternary NCL gates in a single stage as shown in Figure 4-2. Thus, the enhanced multi threshold ternary NCL system combines the following functions:

1. A detector circuit at the input of each ternary logic gate to detect whether the input signal is DATA (i.e., in the set {0, 1}) or NULL, which will generate logic One if the input is Data or logic Zero if the input is Null.

2. Fine grained power gating and Register-less operation in order to significantly reduce the power consumption, particularly the static power consumption during the Null cycles.

72 Multi- Multi- Multi- A Threshold Threshold Threshold B Ternary Logic Ternary Logic Ternary Logic Gates Gates Gates Detector_A Detector_C Detector_E Active Active Active Hold Hold Hold Detector_B Detector_D Detector_F Gate Gate Gate RFI(i-2) RFI(i+3) RFI(i+4) RFI(i-1) RFI(i+2)

Figure 4-1: Multistage Register-Less Multi Threshold Single Rail Ternary Logic NCL Architecture

Multi-Threshold Register-Less Ternary Inputs Ternary Outputs Ternary Logic Gates

Active

Ternary Inputs IS_Data

Figure 4-2: Single Stage Register-Less Multi Threshold Single Rail Ternary Logic NCL Architecture

4.2.1 Ternary Detector Gate As mentioned above, conventional NCL techniques typically place pipeline registers between neighbouring logic modules to prevent a DATA/NULL token from over-riding its preceding NULL/DATA token due to latency differences between pipeline stages. However, in some cases pipeline registers can account for up to 35% of overall power dissipation of the NCL circuit [92]. Two different scenarios will be discussed in the following sections. The first one is the single stage as shown in Figure 4-2 and the second one is the multi stage (pipeline) as illustrated in Figure 4-1.

Single Stage

In conventional multi rail binary NCL systems, two registers are typically required to surround each block of gates. On the other hand, in the proposed Register-less approach, a single stage will be built directly using the ternary logic gates and the ternary detector used to control the gates. The ternary gates are driven only by the input Data/Null values, as shown in Figure 4-2, and each individual input

73 will be connected to a separate ternary detector gate (IS Data) and the output of all of these gates will be used to generate a single control signal (Active) by using a Hold Gate to drive the ternary gates (Figure 4-3) to achieve the required logic function. Note that the Hold gate is based on the circuit of Figure 3-11 which has back to back inverters at the output to maintain the output. The gate will stay in its current state until all inputs transition to a different state.

A IS_Data_1

B Hold IS_Data_2 Active Gate N IS_Data_N

Figure 4-3: Active Control Signal Generator

The ternary detector gate determines whether the input to the ternary logic gate is Data or Null. If it is Data, the ternary detector gate will generate logic one (0.6V) at its output single rail. Conversely, a Null input will result in a logic zero (0V) at its output. The proposed single stage Register-less ternary NCL system operates as follows:

1. If and only if all the single rail ternary logic inputs are data, the combined single rail Active control signal is set high (logic one) which activates the ternary logic gates to process the input data and generate the required output as per its function.

2. Conversely, if and only if all the single rail ternary logic inputs are Null, the Active signal becomes low (logic zero) and de-activates the ternary gates, placing them in their sleep or idle mode. The gates will be disconnected from the power lines and their output rails connected to

VDD/2 (0.3V) to generate the required Null value. There is no need to propagate that null value across the gates, which leads to a reduction in the propagation time of the Null value. During the Null cycle the path

from VDD to ground is through high threshold transistors which leads to a significant reduction in static power consumption during this time.

If none of the above two conditions is true, the hold gate will hold its current value (logic one or zero) and the gates will hold their current output (Data One, Data

74 Zero or Null) waiting for the next cycle. Again, only transitions from Data to Null or vice versa are possible in this scheme. The gates will be active only if all the inputs are Data and will be in sleep mode with its output single rail connected to Null value if all the inputs are Null. Taken together, these characteristics makes the system suitable to be deployed as part of an NCL system.

Multi-Stage

This organisation corresponds to the multi-stage multi rail binary NCL system, where the NCL pipeline is made up of more than one stage and a minimum of two binary registers are required to control the Data/Null flow. Comparing Figure 4-1 and Figure 4-2, it can be seen that the only difference between the single and multi-stage organisation is that in the latter the gates within the current stage (Stage_i in Figure 4-1) are driven by both the input Data/Null values of this Stage (Stage_i) and the control signal from a “down-stream” stage in the pipeline (Stage_i+2), to correctly control the Data/Null flow. The multi-stage Register-Less ternary NCL system operates as follows:

1. If and only if all the single rail ternary logic inputs at Stage_i are Data and at the same time the single rail Active control signal that is coming from Stage_i+2 is high (requesting Data), the Active control signal at Stage_i is sent high to turn on the gate(s) in that stage.

2. Conversely, if and only if all the ternary inputs at Stage_i are Null and at the same time the Active control signal from Stage_i+2 is low (requesting Null), the Active signal at Stage_i is low, which de- activates the Stage_i gates by placing them in their sleep mode and

connecting their outputs to Null (VDD/2). This eliminates the need for the Null value to propagate through ternary gates.

If neither of these two conditions is true, the hold gate will hold its value and the gates will stay in their current state waiting for the next cycle to initiate. This achieves the hysteresis characteristic of the NCL system and guarantees full control of the Data/Null cycle. The ternary gates will be active and process their input data only if all the inputs are Data and the Active signal is requesting for Data and will be in sleep mode with its output connected to Null value if all the inputs are null and the single rail Active signal is requesting null.

75 As shown in Figure 4-1, the Active signal controlling the gates in Stage_i is also fed forward to control Stage_i+2. Using the same control signal in this way eliminates a large number of control lines and therefore the hardware required to generate them. The architecture therefore becomes an alternating set of gates holding Data and Null values in succession. The Active signals between pairs of gate stages control the Data/Null flow in the NCL pipeline by managing a Data/Null “wave front” that flows between the alternating blocks in much the same way it would between registers in a multi bit binary NCL system.

4.2.2 Register-Less Multi Threshold Ternary NCL Gates The multi threshold ternary NCL gates in this chapter are slightly different to the ones proposed and discussed in the previous chapter, as removing the NCL registers from the pipeline requires some modification to the gates to hold their output while it is waiting for the next data/null cycle. The two important cases that require addressing are when:

1. some or all of the inputs are still Data and the request for input control signal is requesting Null;

2. some or all of the inputs are still Null and the request for input control is requesting Data.

It can be seen that these are the transitional cases when a new Data or Null cycle is just commencing. In the first case, the output of the ternary gate will be Data and as the request for input changes from requesting Data to requesting Null (i.e., the Active signal from the following stages has changed from logic one to logic zero), the gate still has to hold its output data until all the inputs have transitioned from Data to Null as there is no output register here to hold the data. In the second case, the output of the gate is Null and the request for input is changing from requesting Null to requesting Data. The gate still has to hold its output Null until all the inputs transition from Null to Data. Hardware modifications were necessary to allow for these cases.

The output from the Data detector (the “IS_Data” signal) is used together with the Request for Input “RFI” from the following stage to drive the combinational logic of the current stage as shown previously in Figure 4-1. If the input signal is Data, the output of the detector circuit is high, which will activate the following combinational logic. If the input signal is NULL, the output of the detector circuit is

76 low, and the following combinational circuit will be placed in its Sleep mode with its outputs at NULL. Each input signal to a multi-input gate is connected to its own detector circuit and their outputs are grouped together using a hold gate to form the Active signal shown in Figure 4-3 to drive the combinational logic gates of the

current stage and generate the RFI signal for the previous stage.

VDD VDD/2 Sleep P1 P5 P4 Active P0 A P2 P3 Output

N2 N4 B N3

Active

N1 N5 GND

Figure 4-4: Multi Threshold NCL AND Gate for Register-Less NCL Architecture

As an example of these ternary gates, Figure 4-4 shows the transistor level design of a ternary AND Gate for use in the Register-Less architecture. In this circuit, transistors P0, P1 and N1 are high threshold CMOS transistors, while P2, P3, P4, P5, N2, N3, N4 and N5 are low threshold transistors. P1 and N1 are used as an isolation switch between the real and virtual supply lines. If all of the single rail ternary logic inputs are Data and the Active control signal that is coming from Stage_i+2 is high (requesting Data) then the Active signal in Stage_i will be high as well, so that the Sleep signal, which is the complement of the Active control signal, will be low. In this state both N1 and P1 will be on while P0 is off so that the AND gate will be in its Active mode and will process the input Data and generate an output. When all the single rail ternary logic inputs are NULL and the Active control signal that is coming from Stage_i+2 is low (requesting Null) then the Active signal in Stage_i will be low as well and the Sleep Signal will be high. Transistors N1 and P1 will be off and P0 on. The AND gate will be in its sleep mode with the output rail connected to VDD/2 (Null) via P0.

77 This design guarantees that the ternary gate will be active only if all the inputs are valid input data and the request for input is requesting Data. Similarly, the gate will be in its low power, sleep mode with its output at NULL only if all the inputs are NULL and the request for input is requesting Null. Otherwise the output will hold its current state via the back to back inverters, thereby exhibiting the necessary hysteresis behaviour of NCL.

4.2.3 Hold Gate The main purpose of the Hold Gate here is to maintain the NCL input completeness condition so that the outputs of the ternary gates transition from NULL to DATA only after all inputs transition from NULL to DATA, and vice versa. The hold gate is used to group the outputs of the detector gates of the current stage (Stage_i) with the output of the hold gate of Stage (Stage_i+2). This guarantees that the combinational logic gates of the current stage (Stage_i) will be active only if all the inputs are Data and the request for input is requesting for Data and it will be in sleep mode only if all the inputs are NULL and the request for input control signal is requesting for Null, otherwise it will hold its current state, either Data or NULL.

These three components i.e. Ternary Detector Gate, CMOS Multi Threshold Single Rail Ternary Logic Gates and hold gates combine to form the Register-Less ternary NCL architecture and eliminate the need for the ternary NCL registers which consume power as well as increasing the design complexity and area. In the next section, these gates are used to derive an example of a small NCL system, i.e. 8-bit Full Adder as a proof of concept.

4.3 IMPLEMENTATION EXAMPLE: 8-BIT FULL ADDER

As a proof of concept, an 8-Bit Full Adder circuit has been designed and implemented in both dual rail binary and Register-less ternary NCL architectures, using a 45nm CMOS process technology with a supply voltage of 0.6V. A conventional one-bit full adder, formed from two TH34w2 and two TH23 gates, has been used as the basis of the binary NCL circuits, which is then replicated to build the 8-bit adder block. The Register-less ternary NCL 1-bit Adder module shown in Figure 4-5 consists of three detectors, a four-bit hold circuit plus the register-less ternary NCL Full Adder block. In turn, the register-less Full Adder block comprises two register-less ternary half adders and a register-less ternary OR gate, and the

78 ternary register-less HA is made up of a register-less ternary XOR and a register-less AND gate. The full circuit diagram is shown in Figure 4-6 and its waveform is demonstrated in Figure 4-7. All of these gates have been designed using the same approach illustrated above for the simple multi-threshold register-less ternary AND gate (Figure 4-4).

Register-Less Multi Threshold Ternary Logic 1 Bit FA

A Ternary IN Detector Gate IS_Data Hold GateHold Active(i) B Ternary IN Detector Gate IS_Data

Cin Ternary IN Detector Gate IS_Data Active(i+2)

Active(i)

Figure 4-5: Register-Less Multi Threshold Single Rail Ternary Logic NCL One Bit Full Adder

While the 8-bit dual rail binary NCL pipelines use registers at their input and output stages, which increases its area and power consumption. In contrast, the proposed ternary approach eliminates the need for those registers and uses the detector circuits that will activate the FA adder only when all the inputs are Data and the Active control signal (Active_i+2) is high (requesting Data).

Vdd Vdd

Vdd/2 Vdd/2 Sum

G G Vdd

Vdd Cin Vdd/2

Vdd Vdd

RFIi Vdd/2 G Vdd G Cout Vdd A Vdd/2

G

G Vdd

G

RFIi+1 G B

G

Figure 4-6: Schematic Diagram of Register-Less Multi Threshold Single Rail Ternary Logic NCL One Bit Full Adder

79 As shown in Figure 4-7 at 3.4 µSec , when all inputs are de-asserted (transition to Null which is 0.3V) and the RFI is request for Null (0V) both outputs (Sum and Carry) will be de-asserted also (transition to Null which is 0.3V). This maintains the completeness characteristic of the NCL system. On the other hand side, at 3.7 µSec, when inputs (B and Cin) are data and RFI is request for data but input “A” is still Null, all outputs (Sum and Carry) will hold its current state (Null state) until “A” transition from Null to Data, until then the following ternary logic gate (Multi- threshold Ternary Full Adder) will remain in Sleep mode which helps to reduce the power consumption as not all the inputs are data yet and maintains the hysteresis (state-holding) capability of the NCL system also.

Figure 4-7: Multi-Threshold Single-Rail Ternary Logic NCL One Bit Full Adder Waveform 4.3.1 Simulation Results In this work, the circuits were simulated using Cadence® tools to measure the area, power consumption and worst-case delay of both the Sum and Carry signals for the various adders. The dual rail binary NCL circuits were built using transistors with a single threshold value (the “general” value VGTH). Both the multi threshold dual rail binary and the register-less ternary NCL circuits were built using multiple threshold values (low and high threshold voltage transistors), where the high threshold transistors were used as isolation switches to create virtual supply lines that allow the circuits to be disconnected during the Null cycle, thereby reducing power.

In this case, a simple transistor count has been used as a proxy for area, which although inaccurate in absolute terms does allow a comparison to be made between the design styles. The measured power covers the full set of input vectors [000...111], where each data cycle is followed by a NULL cycle. Data and NULL were simulated at a vector presentation rate of 1 GHZ and the results are listed in Table 4-1.

80 Table 4-1 shows that, compared to dual rail binary NCL, the multi-threshold circuit achieves area savings of around 53%, while the power consumption of the 8- bit adder was reduced by 15%. Compared to multi-threshold dual-rail NCL, the ternary approach achieves area savings of around 46% and the power consumption was reduced by 7%. The proposed Register-less multi threshold ternary NCL system is faster than the corresponding multi threshold binary circuit while both are slower than dual rail binary NCL due to the use of the high threshold transistor as a virtual switch.

Table 4-1: 8-Bit Full Adder Analysis and Comparison Worst Case Worst Area Power Design Technique Delay Case (# Transistors) (nW) (nSec) PDP

Dual rail Binary NCL 2040 375 0.6 225

Multi Threshold dual rail 1768 357 1.921 686 Binary NCL Register-Less Multi Threshold single rail 960 319 1.8 574 Ternary NCL

The area and power savings in the proposed register-less ternary NCL architecture are not only due to the use of the multi-threshold technique, as the register-less ternary circuit still has smaller area and better PDP compared to multi threshold dual rail binary. The main reasons for the area and power savings are, firstly, the elimination of the registers at the input and output and the use of the detector circuits at the input stage to drive the combinational logic gates and finally the use of the proposed ternary NCL gates which can be built from a smaller number of transistors than both the binary and multi-threshold NCL gates.

4.4 SUMMARY

In this chapter, a novel register-less ternary NCL architecture was presented and analysed, in which hardware modifications were made to remove the NCL registers without degrading the functionality of the NCL pipeline. It was demonstrated that NCL systems can be built using fewer transistors compared to both the conventional dual rail binary NCL architecture and its multi threshold binary implementation. The modified Ternary NCL pipeline still achieves the hysteresis characteristic of the NCL system, guarantees full control of the Data/Null

81 cycle and maintains the NCL input completeness condition. In turn, the outputs of the modified ternary NCL gates transition from NULL to DATA, only after all inputs transition from NULL to DATA, and vice versa. The proposed register-less ternary NCL architecture is composed of three main components: The ternary detector, register-less logic and hold gates. An 8-Bit Full adder was designed and implemented to compare the proposed register-less ternary NCL architecture with both complementary Binary NCL and multi-threshold binary NCL. The analysis shows that the proposed approach achieves area savings of around 53% and 46% compared to dual rail binary NCL and the multi-threshold dual rail NCL respectively. The proposed approach also reduces the power consumption by 15% and 7% compared to the dual rail binary NCL and the multi-threshold dual rail NCL respectively. However, the dual rail binary NCL displays better propagation delay performance than the proposed ternary approach, mainly due to the use of the high threshold transistors as a virtual switch. As a result, the power-delay product of the proposed approach is higher than the dual rail binary NCL but lower than the equivalent multi- threshold dual rail NCL circuit.

82 Chapter 5: SWL FIR Filter Case Study

The purpose of this chapter is to demonstrate how the proposed ternary architectures can be used to design and implement a complex NCL system with suitable performance and area characteristics. In the previous chapters, two different versions of the proposed ternary NCL were discussed: Register-controlled and Register-less Ternary NCL. In this chapter, we demonstrate the functionality and performance of these two versions compared to the conventional binary NCL. To support this analysis, an asynchronous Short Word Length Finite Impulse Response Low Pass Filter (SWL-FIR-LPF) has been designed and implemented. Three filters were designed, analysed and then compared against each other in terms of their power consumption, area and speed. Each one of the three filters use one of the following architectures:

1. Conventional multi-rail binary NCL (B-NCL)

2. Register-Controlled Ternary NCL (RC-TNCL) (as discussed in chapter 3).

3. Register-Less Ternary NCL (RL-TNCL) (as discussed in chapter 4).

This chapter comprises of three main sections. In the first section (section 5.1), we briefly introduce the FIR filter and its Sigma Delta Modulation (M) technique that forms the basis of the short-word length FIR filter. This technique employs the concept of oversampling to move the noise out of the signal band. The short-word length LPF employs a direct form of structure that can be implemented using a Serial-to-Parallel Shift Register followed by a Multiply and Accumulate logic block. In section 5.2, the design methodology to design and implement the asynchronous SWL-FIR-LPF is discussed in more detail. In this section, the three pipeline structures used to implement the asynchronous SWL-FIR-LPF are discussed, the conventional multi-rail binary NCL, the register-controlled ternary NCL and finally the register-less ternary NCL. In section 5.3, the three different pipelines are compared in terms of their performance, power and speed.

83 5.1 CASE STUDY BACKGROUND

5.1.1 Digital Filter The digital filter was selected for this demonstration as it is considered as one of the main building blocks in any digital signal processing system. Digital filters can be classified into two main types: the FIR (Finite Impulse Response) and IIR (Infinite Impulse Response) [96]. The impulse response of the FIR filter has a finite duration and can be represented in discrete time using (6), where Y(m) is the filter output at discrete time instance m, Ki represents the filter coefficients, and x(m-n) is the filter input delayed by n samples.

( ) ∑ ( ) ( )

The FIR filter offers some advantages compared to the IIR type. It is typically implemented with non-recursive structures and is therefore inherently stable. Further, many applications such as music and video processing require a linear phase and the FIR filter can be easily designed to have an exact linear phase [97] arranging the filter taps to be symmetrical around the centre tap position. However, in comparison to the IIR filter, the FIR requires a higher-order filter in order to obtain the same response. In turn, this higher filter order also implies higher computational complexity and greater memory requirements for storing coefficients. Figure 5-1 shows the canonical structure of a direct form FIR filter. This is more-or-less a direct implementation of (6) so that the filter contains many multiplication and addition operations. As shown in Figure 5-1, the filter is mainly composed of a shift register followed by a multiply and accumulate (“MAC”) logic block. MAC represents the operation of multiplying a coefficient by the corresponding delayed data sample and accumulating the result.

The frequency response, ripple and band-pass characteristics of a digital filter are directly affected by the order, resolution and sampling rate of the filter. A high- resolution FIR filter requires many taps (delay lines) and more coefficients (multipliers and accumulators) which will be reflected in the size of the registers, buses, adders and multipliers in its hardware implementation. An increasing number of coefficients will impact the design size and its power consumption correspondingly [98].

84 On the other hand, reducing the number of taps or reducing the input word size, will affect the performance of the digital filter. This challenge can be addressed by using a Sigma Delta Modulation system which is based on an oversampling technique rather than using Nyquist rate sampling.

Figure 5-1 Simplified Structure of the Direct Form FIR Filter

5.1.2 Sigma–Delta Modulation The Sigma–Delta Modulation () architecture is widely used for converting analog signals to digital and vice versa. The Sigma–Delta ADC employs two main techniques to enhance the resolution and improve the signal-to-noise ratio of the converter while constraining its area, design complexity and total power consumption. These techniques are (1) oversampling and (2) noise shaping [99]. As shown in Figure 5-2, the  ADC comprises a  followed by a digital low pass filter and then a decimator.

Any analog signal in the frequency domain has a bandwidth that is determined by the highest frequency component of interest in the analog signal (Fm). The Nyquist theory [100] mandates that the necessary sampling rate is at least twice that maximum frequency i.e., Fs = 2Fm [101], where Fs is called Nyquist rate. The oversampling technique is based on the idea of sampling the input analog signal at a rate that is significantly higher than the Nyquist value [102], called the Oversampling

Rate (OSR) rate (OSR = N*Fs, where N ≥ 4). There are many sources of noise in an ADC process, including thermal and shot noise, variation in voltage supply, variation

85 in voltage reference, phase noise due to sampling clock jitter and noise due to the quantisation error.

Input Analog Single Bit Single Word Signal Digital Signal Digital Digital Signals Digital Signal + Integrator Quantizer Decimator LPF

DAC

Figure 5-2: Sigma-Delta Analog-to-Digital Converter

Figure 5-3: Noise Shaping to enhance the Signal-to-Noise Ratio (a) Sampling at the Nyquist Rate. (b) Oversampling at 16 times the Nyquist Rate. (c) Shifting the noise to a higher frequency outside of the signal band.

Quantisation noise will always exist and impact the resolution and accuracy of any ADC [103] [104]. The oversampling technique can increase the ADC resolution and enhance its signal-to-noise ratio without the need to increase the number of bits, which might lead to a less complicated design, smaller design area and smaller power consumption. In a  ADC, the oversampling process has the effect of shaping the noise and shifting it out of bounds. A subsequent low pass filtering step is then needed to suppress the noise, but the requirements of this filter are greatly relaxed. For example, Figure 5-3 illustrates the case for an OSR of 16. Using a

86 sampling rate equal to the Nyquist rate, the quantisation noise power will spread over the Nyquist bandwidth (between the DC and the Nyquist frequency) as shown in Figure 5-3 (a). By applying the OSR of 16, the quantisation noise power will spread over a wider bandwidth, in this case between DC and 16 times the Nyquist frequency (Figure 5-3 (b)), resulting in lower quantisation noise power in the band of interest. Applying a low pass filter to the quantised signal, the noise that falls outside the band of the interested signal can be removed (Figure 5-3 (c)), leading to a better overall signal-to-noise ratio.

5.1.3 SWL Finite Impulse Response Low Pass Filter Short Word Length (e.g. single bit) filters have many advantages compared to their multi bit counterparts. Short word length techniques not only reduce the complexity of the design of the hardware but may also improve its performance [105] [106]. The Short Word Length Digital Finite Impulse Response low pass filter that is discussed in this chapter is composed of two main logic blocks (Figure 5-4). The first is the serial to parallel shift register and the second is the multiply— accumulate logic block (MAC). Equation (7) restates the mathematical representation of the filter, where X is the digital input signal and Ci are the filter coefficients.

( ) ∑ { } ( )

It can be seen that Figure 5-4 arises directly from (8) and, in this case, both the digital input, X, and the coefficient Ci are single bit. The final result from the adder stages will be multi-bit and this is sent to a subsequent LPF stage for filtering, decimating and re-quantising to single bit for further processing. (see Figure 5-2 above).

In summary, a Sigma Delta Modulator can be used to design a digital filter with higher resolution using less circuitry and power [107] compared to Nyquist rate sampling. It is also more reliable than a Nyquist rate counterpart as its bit stream can tolerate higher bit error rate during data communication [108] [109]. Reducing the word length increases the quantisation error which is then compensated by increasing the over sampling rate (OSR) and subsequently filtering and decimating the final result. The overall impact of using a single-bit technique is to reduce the

87 multiplication to a simple AND function and simplify the addition. Thus, although its processing rates are significantly higher, the filter will occupy a smaller area and its power consumption can be lower compared to its Nyquist rate counterparts.

Single Bit N-Bit Serial to Parallel Shift Register Digital Input 1-Bit Shift 1-Bit Shift 1-Bit Shift Register Register Register R0 R1 RN

0 D D1 DN C0 X C1 X CN X

+ +

+

MAC Logic Block Multi Bit Digital Output

Figure 5-4: Short Word Length Digital Low Pass Filter

5.2 METHODOLOGY

Three different pipeline implementations of the SWL-FIR-LPF were designed and compared using the design methodology discussed in this section. The same process was followed to design the three different implementations. The design process has five main steps as follows:

1. MATLAB® was first used to design the Digital Finite Impulse Response Low Pass Filter (SWL-FIR-LPF) itself. In this step, the

digital filter coefficients (Ci) and the discrete short word length digital input signal (X) were derived for use as inputs in the subsequent steps.

2. The second step in the process was the actual hardware design of the digital low pass filter at the transistor level using Cadence® Virtuoso. All the necessary logic blocks and gates were designed and implemented at this stage i.e., Shift Register, Multiplier, Half Adder, Full Adder and the IS_Data Gate. Three different reference libraries

88 were created for the three implementations of the filter (multi-bit binary, register-controller ternary and register-less ternary).

3. In the third step, three System Verilog descriptions of the filter were built for the binary, register-controlled ternary and register-less ternary NCL cases. The descriptions were then imported into the Cadence tool targeting the corresponding libraries created in the previous step, thereby generating the required three filter designs at the gate level.

4. When the circuits of the binary NCL (B-NCL), register-controller ternary (RC-TNCL) and register-less ternary (RL-TNCL) were ready, they were verified and analysed by applying the discrete input signal and coefficients that were generated previously using MATLAB. The three designs were analysed in terms of their power consumption, total number of transistors and performance.

5. The final step was to verify the functionality of the hardware designs for the three different filters by checking the generated multi-bit outputs against a MATLAB reference version. The multi bit output signal of this version was compared and validated against the output from the three hardware designs, as will be discussed in more detail in the following sections.

Step – 1: MATLAB Filter Design As just mentioned, the first step in the design methodology was to use MATLAB to generate the discrete digital input samples and the coefficients that were to be used to test the hardware design of the filter. A 4-tap digital finite impulse response low pass filter (FIR-LPF) was designed with the following characteristics:

 Sampling Frequency = 8000 Hz

 Normalized Passband Frequency = 800 Hz

 Stopband Frequency = 3200 Hz

 Stopband ripple = 50 db

The filter magnitude response is as shown in Figure 5-5 and its impulse response in Figure 5-6 also, which includes only four coefficients at 0, 1, 2 and 3.

89 Because the  ADC uses oversampling [110-112], the MATLAB filter needs to be interpolated by the required OSR to generate the coefficients. In this example, the OSR has been set at 8, and this OSR used to generate the digital coefficients and the discrete input signal (the input samples) for the LPF. Thus, the filter response in Figure 5-5 was interpolated by the same value (8) resulting in the 32 coefficients shown in Figure 5-7.

Figure 5-5: Finite Impulse Response Low Pass Filter Magnitude Response

Figure 5-6: Finite Impulse Response Low Pass Filter Impulse Response

90

Figure 5-7: Impulse Response of the FIR LPF, with OSR = 8 Step – 2: NCL Pipeline Hardware Design The second step of the design methodology is to build the required libraries for both the binary and ternary NCL including both its register-controlled and register-less versions. As shown above, binary NCL and register-controlled ternary NCL libraries have the following main three logic blocks:

1. Serial to parallel shift register, which also acts as an input delay insensitive register.

2. Multiply and Accumulate logic block.

3. Output delay insensitive register.

As the register-less ternary NCL library has eliminated the need for the delay insensitive registers at both input and output stages, it is composed of only the first two blocks.

Binary NCL FIR Architecture

A partial view of the Binary NCL FIR architecture is illustrated in Figure 5-8, which is composed of a binary MAC logic block that is sandwiched between two delay insensitive binary NCL registers.

91 32 Bit Binary Shift Register

TH12 TH12 TH12 TH12 TH12 TH12 TH12 TH12 pipeReq

1-Bit 1-Bit 1-Bit 1-Bit 1-Bit 1-Bit 1-Bit 1-Bit Binary SR Binary SR Binary SR Binary SR Binary SR Binary SR Binary SR Binary SR Binary D00 D01 D10 D11 D20 D21 D31 D40 D41 D50 D51 Input D30 D60 D61 D70 D71 Data C00 C10 C20 C30 C40 C50 C60 C70 (Data_0 X X X X X X X X Data_1) C01 C11 C21 C31 C41 C51 C61 C71 M20, M21 M30, M31 RFI HA0 HA1 HA2 HA3 Coming from the output HA10, HA11 Binary DI Register

HA4 FA0 HA5 FA1 Completion Detection Binary MAC HA6 FA2 FA3 Unit

RFI Binary Output Register Coming from the following Binary DI Register

Output Data

X Dual-Rail Binary NCL Multiplier

HA Dual-Rail Binary NCL Half Adder

Dual-Rail Binary NCL Full Adder FA Figure 5-8: Binary NCL FIR Architecture – Partial View

The MAC logic block is composed of dual rail binary NCL Multipliers, Half Adders and Full Adders. The architecture of the input delay insensitive binary register is modified to act as a shift register also. Both the input and output delay insensitive registers are used to control the flow of the data in the pipeline, so the MAC logic unit is just a slave in this case and will process its input data without the ability to control the flow in the pipeline. Hence, the MAC logic block does not require any control signals, where the delay insensitive registers at the input and output generate and consume the required handshaking control signals i.e. pipeReqin, RFI. This architecture requires a minimum of two external control signals.

92

Figure 5-9: Binary NCL FIR Waveforms Partial View

Waveforms of the partial view of the Binary NCL FIR is demonstrated in Figure 5-9, where (D20 & D21) and (D30 & D31) represent the shifted dual-rail input signal and (C20 & C21) and (C30 & C31) are the filter dual-rail Coefficients (Coefficients 2 and 3 respectively). On the other hand side, (M20 & M21) and (M30 & M31) are the dual-rail outputs of Multipliers 2 and 3 respectively (Figure 5-8). As demonstrated in Figure 5-9, M2 will be only high (M21 = 1 & M20 = 0) when both D2 and C2 are high, and will be only low (M21 = 0 & M20 = 1) when both or one of them is low, otherwise M2 will hold Null (M21 = 0 & M20 = 0). The outputs of multiplier 2 and 3 are the inputs of half adder (HA1) (Figure 5-8) where its outputs are (HA10-Sum10 & HA11-Sum11) and (HA10-Carry10 & HA11-Carry11) which represent the dual-rail sum and carry of the half adder (HA1) (Figure 5-8).

Register-Controlled Ternary NCL FIR Architecture

A partial view of the Register-Controlled Ternary NCL FIR architecture is demonstrated in Figure 5-10, which is composed of a register-controlled ternary MAC logic block that is sandwiched between two delay insensitive ternary NCL registers.

93 32 Ternary Bit Shift Register Completion Detection-1 IS Data? IS Data? IS Data? IS Data? IS Data? IS Data? IS Data? Active

1-Bit 1-Bit 1-Bit 1-Bit 1-Bit 1-Bit 1-Bit 1-Bit Ternary Binary SR Binary SR Binary SR Binary SR Binary SR Binary SR Binary SR Binary SR Input D0 D1 D2 D3 D4 D5 D6 D7 Data X X X X X X X X

HA HA HA HA RFI

HA FA HA FA Completion Detection-2 Ternary MAC Unit HA FA FA

RFI+1 Coming from Ternary Output Register the following Ternary DI Register

Output Data X Single-Rail Register-controlled Ternary NCL Multiplier

HA Single-Rail Register-controlled Ternary NCL Half Adder

Single-Rail Register-controlled Ternary NCL Full Adder FA

Figure 5-10: Register-controlled Ternary NCL FIR Architecture – Partial View

The MAC logic block is made up of register-controlled ternary NCL Multipliers, Half Adders and Full Adders. The architecture of the input delay insensitive ternary register is modified to act as a shift register also. Both the input and output delay insensitive registers act to control the flow of the data in the pipeline, while the MAC logic unit processes the input data without the ability to control its flow in the pipeline. Hence the MAC logic block does not require any handshaking signals, where the delay insensitive registers at the input and output generate and consume the required handshaking control signals i.e. RFI, RFI+1. As this architecture uses the multi-threshold technique, the Active control signal generated by the input register is required to control when the MAC logic will be in its active mode and when it will be in sleep mode. Active control signal will be high only when all the outputs of the shift register are data and the RFI is RFD and low only when all the outputs of the shift register are Null and the RFI is RFN. Thus, when the Active control signal is high, the MAC logic block will be activated and process its input data. On the other hand, when the Active control is low the MAC logic block will be in sleep mode and all its outputs will be connected to Vdd/2 (i.e., Null). Otherwise, the output register will hold its outputs.

Register-Less Ternary NCL FIR Architecture

94 A partial view of the Register-Less Ternary NCL FIR architecture is demonstrated in Figure 5-11. In the same way as the previous two cases, the structure comprises an input shift register, and a register-less ternary MAC logic block formed from multipliers plus half and full adders. The input register acts as a shift register that can process the input data and control its flow through the pipeline. Hence, it could be considered as another register-less ternary logic block in the pipeline which can process the data and control its flow at the same time. Thus there is no need for either input or output control registers. The MAC logic unit can process the input data and control its flow in the pipeline.

As discussed in Chapter 4, the register-less architecture employs a completion detection circuit at the input of each logic block. The completion detection is composed of IS_Data ternary logic blocks (one logic block for each input) followed by a Hold gate (see Figure 4-5). As shown in Figure 5-11, Completion_Detection-1 will generate RFD only if the input of the shift register is data and the MAC logic block requires data and will generate RFN only if the input of the shift register is null and the MAC logic block requires null. Otherwise, the output register will hold its output value whether it is data or null. At the same time, Completion_Detection-2 will generate RFD only if all the inputs of the MAC logic block are data and RFI+2 (which is coming from the following logic block) is RFD and will generate RFN only if all the inputs of the MAC logic block are null and RFI+2 is RFN. Otherwise, the MAC logic block will hold its outputs whether they are data or null with no need for additional output registers to hold the outputs.

If RFI+1 is RFD, MAC will be active and will process the input data and if RFI+1 is RFN, the MAC will be in sleep mode and its outputs will be connected directly to Vdd/2 which is its Null value and there is no need for the Null to propagate through the pipeline. This reduces the null cycle propagation and enhances the performance of the pipeline. If RFI+2 is RFN/RFD and all the inputs of the MAC logic block are data/Null respectively, the MAC logic block will hold its current output. The completion detection is placed at the input of each logic block to achieve early completion detection which reduces the propagation delay of the pipeline.

95 32 Ternary Bit Shift Register

IS Data? IS Data? IS Data? IS Data? IS Data? IS Data? IS Data?

Ternary 1-Bit 1-Bit 1-Bit 1-Bit 1-Bit 1-Bit 1-Bit 1-Bit RFI Completion Input Data Binary SR Binary SR Binary SR Binary SR Binary SR Binary SR Binary SR Binary SR Detection-1 D0 D1 D2 D3 D4 D5 D6 D7

Completion RFI+1 X C0 X C1 X C2 X C3 C4 C5 C6 C7 Detection-2 X X X X

M2 M3 HA0 HA1 HA2 HA3

Carry1 RFI+2 Sum1

HA4 FA0 HA5 FA2

Register-Less Ternary MAC HA6 FA3 FA4 Unit

Output Data

X Single-Rail Register-Less Ternary NCL Multiplier

HA Single-Rail Register-Less Ternary NCL Half Adder

Single-Rail Register-Less Ternary NCL Full Adder FA

Figure 5-11: Register-Less Ternary NCL FIR Architecture – Partial View

Waveforms of the partial view of the Register-Less Ternary NCL FIR is demonstrated in Figure 5-12, where D2 and D3 represent the shifted input and C2 and C3 are the filter Coefficients (Coefficient 2 and 3 respectively). On the other hand side, M2 and M3 are the outputs of Multipliers 2 and 3 respectively (Figure 5-11). As demonstrated in Figure 5-12, M2 will be only high (0.6V) when both D2 and C2 are high, and will be only low (0V) when both or one of them is low, otherwise M2 will hold Null (0.3). the outputs of multiplier 2 and 3 are the inputs of half adder (HA1) ( Figure 5-11) where its outputs are HA1-Sum1 and HA1-Carry1 which represent the sum and carry of the half adder (HA1) (Figure 5-11).

Figure 5-12: Register-Less Ternary NCL FIR Waveforms Partial View

Serial to Parallel Shift Register

96 The input stage of the digital finite impulse response low pass filter (FIR-LPF) is the shift register (SR) which is common across the three different architectures. However, it has different functions according to each architecture. For instance, in the register-less architecture, the shift register has two functions only: it acts as a delay unit, shifting the input signal from the left to the right and to deliver the shifted bits in parallel to the multiply-accumulate unit. In this case, the shift register can be considered as a separate logic block in the pipeline. This separate logic block can process input data (shift input data) and control the flow of the data to the next logic block in the pipeline (i.e., the MAC logic unit). Hence, in the case of other (non-FIR) applications of the MAC, where data shifting is not required at the input stage, the shift register can be eliminated. In this case, the MAC logic unit will be able to process the input data and control the flow of the data in the pipeline with no need for additional registers.

Ternary Input Register-Controlled Ternary Output DI Register MAC Logic Block DI Register (Shift Register)

Figure 5-13: Register-Controlled Ternary Pipeline

Binary Input Binary Binary Output DI Register MAC Logic Block DI Register (Shift Register)

Figure 5-14: Binary Pipeline

On the other hand, in the binary and register-controlled NCL, the shift register has three functions: it acts as a delay unit, shifting the input signal from the left to the right, deliver the shifted bits in parallel to the multiply-accumulate unit and finally acts as a delay insensitive input register. In these two cases, the shift register can be considered as an input delay insensitive register with built-in serial to parallel shift register function. As shown in Figure 5-13 and Figure 5-14, the MAC logic unit is sandwiched between two delay insensitive registers to control the flow of the data in the pipeline. The input delay insensitive register is modified to act as a shift register as well. Hence, if data shifting is not required at the input stage, the shift register has

97 to be replaced by another delay insensitive register which is similar to the one at the output stage. As the MAC logic unit cannot control the flow of the data in the pipeline and requires delay insensitive control registers at the input and output stages.

Ternary NCL Shift Register Implementation

The filter has 32 coefficients, so the delay line must be able to shift 32 bits. The 32-bit ternary serial-to-parallel shift register (T-SP-SR) architecture comprises a linear array of 1-bit ternary registers (Figure 5-15), built from ternary detectors and transmission gates plus IS_Data logic gates. The Request for Input controls the transmission gates and determines whether data will be transferred forward or will be held at its current value. The IS_Data logic gate between two adjacent delay lines controls the preceding delay line (i.e., the previous 1-bit shift register).

Vcc Ternary Input Out Ternary Detector Ternary Detector Ternary Detector Ternary Detector TX Gate TX Gate TX Gate

IN Out IN Out IN Out IN Out TX Gate TX RFI Gate TX Ternary Detector Out IN

Figure 5-15: Architecture of 1-Bit Ternary Serial to Parallel Shift Register

T-SP-SR implementations for both register-controlled and register-less ternary NCL are not identical as there are two main differences. The first one is the source of the Request for Input signal for the last 1-bit register. In the implementation of the register-controlled ternary system, the control signal is generated from the following output register as shown previously in Figure 5-10, which acts as an output delay insensitive register. On the other hand, in the register-less ternary system the signal is generated from the following logic block as shown in Figure 5-11, as there is no output register. The second difference is replacing the Active signal in the register- less architecture by the RFI+1 control signal. This control signal controls whether the register-less MAC will be in active or sleep mode.

The flow of the Request for Input signal guarantees that the serial-to-parallel shift register will not shift and process any data unless the following stage is requesting data. In the register-less version, the need for an output register stage is

98 eliminated. This is unlike the multi rail binary and register-controlled ternary NCL implementations that still require an output control register to maintain the flow of the data in the pipeline. To demonstrate the functionality of the serial-to-parallel shift register, the first two parallel output waveforms of the T-SP-SR are shown in Figure 5-16, which is same for both register-controlled and register-less T-SP-SR implementations.

Figure 5-16: Ternary Serial-to-Parallel Shift Register Waveform

Binary NCL Shift Register Implementation

In the binary NCL shift register implementation, the register handshaking is used to control the data flow as it shifts the input data and converts it from serial to parallel. The delay line must be able to shift 32 bits in the same way as the ternary case, hence requires a 32-bit serial-to-parallel shift register (32 delay lines). In this case, each delay line is made up of two registers holding Data and Null as shown in Figure 5-17 and its waveforms as demonstrated in Figure 5-18. The binary NCL register requires an additional control signal “pipeReq” as shown in Figure 5-8 [113] [114] [115].

The 32-bit binary register is built from 32 delay lines (1-bit registers) and 32 TH12 logic gates. The Request for Input (RFI) signal controls the flow of the data between the 1-bit binary registers. Each of the first 31 TH12 logic gates is placed

99 between two delay lines to control the previous delay line. Their inputs are the Request for Input coming from the following 1-bit register and the Request for Input from the following binary DI output register. The final TH12 logic gate (in the final 1-Bit shift register) is also controlled by an additional pipeReq signal as well as the Request for Input signal from the output binary DI register.

Null Register Data Register

Data 0 TH22N TH22N

RFI TH12 TH12 RFI

Data 1 TH22N TH22D

Figure 5-17: Architecture of 1-Bit Binary Serial-to-Parallel Shift Register

D1 D0

Figure 5-18: Binary Serial-to-Parallel Shift Register Waveform

Multiply—Accumulate

As shown in Figure 5-19, the multiply—accumulate block comprises two sub- blocks: multiply and summation (accumulate). In this single bit filter, the multiplication is reduced to a simple AND function combining the data (D0, D1, D2, …) and coefficient (C0, C1, C2 …) bits. In the summation stage, the outputs of the multiplication process are added together within a tree structure composed of stages of half and full adders which will be discussed in the following subsections.

100 Multiply Stage

D0 D1 D2 D3 D31

C0 C1 C2 C3 C31 X X X X X

+

Output Accumulate Stage

Figure 5-19: Multiply and Accumulate

Ternary NCL MAC Implementation

The architecture of the register-controlled ternary MAC is different to the register-less ternary version. The register-controlled MAC gates are designed to process the input data only, where the delay insensitive ternary registers will control the flow of the data in the pipeline. On the other hand, the register-less MAC gates are designed to process the input data and control the data flow as well using the same Request for Input signal that, as discussed in chapter 4, eliminates the need for additional registers to hold the data. The two ternary implementations of the multiply—accumulate gates based on the multi threshold technique described in the previous chapters are shown in Figure 3-15.

Register-less Ternary MAC

In the register-less ternary AND gate shown previously in Figure 4-4, the Active signal is replaced by the Request for Input signal, one input signal will be the ternary output signal of the corresponding delay line (D0, D1, etc.) and the second input will be the coefficient (C0, C1, C2, etc.). The register-less Half Adder transistor design (Figure 5-20) is made up of a ternary multi threshold register-less AND gate used to generate the Carry and a register-less ternary XOR gate to generate the SUM. Both gates are controlled by the Request for Input signal. The register-less Full Adder is made up of two register-less ternary half adder logic gates followed by a register-less ternary OR logic gate (Figure 5-21). The register-less Full

101 Adder transistor level design is as shown in Figure 4-6. The register-less logic gates are controlled by the Request for Input signal and do not require delay insensitive registers to control the flow of the data as they are designed to process the input data

and also control its flow. VDD

Wp=400

A A

Wp=360 Wp=360 VDD/2

B B Wp=180 Wp=180 Wp=90 Wp=360 Wp=360 Sum

A A Wn=180 Wn=180

B B Wn=90 Wn=90 Wn=180 Wn=180

Wn=200 GND

RFI

VDD VDD/2

Wp=400 Wp=180 Wp=180 Wp=90 A Wp=360 Wp=360 Carry

Wn=270 Wn=90 B Wn=270

Wn=270 Wn=90 GND

Figure 5-20: Transistor Level Design of Register-Less Multi Threshold Ternary NCL One Bit Half Adder (dimensions in nm)

102 Carry_IN Ternary SUM

Half Adder A Ternary T-OR Carry_Out B Half Adder

RFI Figure 5-21: Ternary Full Adder Logic Gate Register-Controlled Ternary MAC

The register-controlled logic gates still require registers to control the data flow. The multiplication process is implemented using register-controlled AND gate shown in Figure 5-22, the Active signal will be high when both the following gates in the pipeline requesting data and the inputs from the corresponding delay line are data also. One input signal will be the ternary output signal of the corresponding delay line (D0, D1, etc.) and the second input will be the coefficient (C0, C1, C2, etc.). The register-controlled Half Adder transistor design (Figure 5-23) is made up of a ternary multi threshold register-controlled AND gate used to generate the Carry and a register-controlled ternary XOR gate to generate the SUM. Both gates are controlled by the Active signal. The register-controlled Full Adder is made up of two register-less ternary half adder logic gates followed by a register-less ternary OR logic gate as shown in Figure 5-21. The register-controlled Full Adder transistor design is as shown in Figure 5-24. The register-controlled logic gates are in active mode and process the input data when the Active signal is high. Conversely, when the control signal is low, the logic gates are in sleep mode and output is hold by the output delay insensitive register. In this architecture, the register-controlled ternary logic gates process the data only. The active control signal controls when the gate is in active or sleep mode and does not control the flow of the data in the pipeline. Hence, delay insensitive registers are required to control the flow of the data in the pipeline.

Although the size of the register-controlled Half Adder is smaller than its register-less counterpart (24 transistors compared to 30 transistors respectively), the overall design area of the register-less ternary design is smaller as the latter approach eliminates the need for the delay insensitive registers, as will be discussed in the following sections.

103 VDD

Wp=400 VDD/2

Wp=90 A Wp=360 Wp=360 Wp=180 Output

Wn=180 Wn=90 B Wn=180

Active

Wn=200 GND

Figure 5-22: Schematic Diagram of Register-Controlled Multi Threshold Ternary NCL AND Logic Gate (dimensions in nm)

Active

VDD VDD

Wp=400 VDD/2 A A

Wp=400 Wp=360 Wp=360 VDD/2 B Wp=90 B Wp=90 Sum B Wp=360 Wp=360 Wp=360 Wp=360 Wp=180 Carry B A Wn=180 Wn=180 Wn=180 A B A Wn=90 Wn=180 Wn=180 Wn=180

Wn=200 GND

Wn=200 GND

Figure 5-23: Schematic Diagram of Register-Controlled Multi Threshold Ternary NCL One Bit Half Adder (dimensions in nm)

104 Vdd Vdd

Vdd/2 Vdd/2 Sum

G A G Vdd

Vdd/2

Cin Vdd B Active Vdd/2 G Cout Vdd

Vdd/2

G

G Figure 5-24: Schematic Diagram of Register-Controlled Multi Threshold Ternary NCL One Bit Full Adder

Binary NCL MAC Implementation

In the same way as the ternary case, the binary multiply—accumulate block can be formed from AND, Half Adder and Full Adder gates but here, their binary implementation will be based on the 27 NCL gates listed in Table 1-3 in the introduction. The multiplication function is implemented using TH22 and THand0 gates as shown in Figure 5-25. Here, the TH22 logic gate produces the Z-1 output and the THand0 generates Z-0, while A-0 and A-1 are the dual rail inputs which represent the output of the delay lines (D0, D1, D2, etc.). Additionally, B-0 and B-1 represent the dual rail low pass filter coefficient (C0, C1, C2, etc.).

The binary NCL Half Adder is built using three NCL threshold gates (Figure 5-26): a TH22, THand0 and two THxor0. The TH22 and THand0 gates are used to generate the dual rail carry output and the two THxor0 gates generate the dual rail sum output.

105 A_1 B_1 th22 Z0

A_0 B_0 thand0 Z1

Figure 5-25: Binary Null Convention Logic Multiplier Logic Gate

A_1 th22 Cout_1 B_1

A_0 B_0 thand0 A_1 Cout_0 B_1

A_0 B_0 A_1 thxor0 Sum_1 B_1

A_0 B_0 A_1 thxor0 Sum_0 B_1

Figure 5-26: Binary Null Convention Logic Half Adder Architecture

As shown in Figure 5-27, the full adder circuit is formed from two TH23 and two TH34w2 gates. The TH23 gates generate the dual rail carry output and the TH34w2 gates produce the dual rail sum output. In contrast to the ternary case, the binary implementation of the MAC stage does not require a Request for Input control signal as the binary MAC gates are situated between two registers that control the data flow between the input and output stages, as will be discussed in the following subsections.

106 In summary, at this stage both the ternary and binary reference libraries have been created and are ready to be used to build the FIR low-pass filter. Each library comprises a serial-to-parallel shift register, multiplier logic, half and full adder circuits.

Cout_1

th34w2 Sum_0

Cin_0 A_0 th23 Cout_0 B_0

th34w2 Sum_1

Cin_1 A_1 th23 B_1

Figure 5-27: Binary Null Convention Logic Full Adder Architecture

Step – 3: System Verilog Code Design In this stage in the flow, the System-Verilog code for both the ternary and binary implementations of the FIR-LPF is generated using Quartus-II®.

Ternary Null Convention Logic System Verilog Code

The full System-Verilog code for the register-controlled and register-less ternary cases are listed in Appendix-A and Appendix-B, respectively. The register- controlled code comprises the same three major blocks as its binary counterpart as shown in Figure 5-10: the 32-bit binary NCL shift register followed by the register- controlled multiply—accumulate stage and then the output 6-bit ternary control register. On the other hand, the register-less code is composed of two main blocks, the 32-bit ternary shift register followed by the register-less ternary multiply—

107 accumulate (Figure 5-11). In turn, the ternary MAC in both versions is made up of six stages. The first stage is the 32-bit multiplier gates and the second is the 16 ternary half adder gates. The third stage has 8 ternary NCL half adder gates and 8 ternary full adders. The fourth stage has four blocks, each composed of one ternary half adder and two full adders. Stage five has only two blocks—each made up of one-half adder followed by three full adders. Finally, the last stage has adder followed by four full adder gates. In the register-controlled code, the output of the last stage needs an additional output register to control the flow of the data in the pipeline. On the other hand, in the register-less code, the output of the last stage does not need an additional register as this architecture is a Register-Less architecture.

Binary Null Convention Logic System Verilog Code

The code for the binary version is listed in Appendix-C and comprises three main blocks: the 32-bit binary NCL shift register followed by the multiply— accumulate stage and then the output 6-bit binary control register as shown in Figure 5-16. The input register acts as a serial-to-parallel shift register and control register at the same time. The output register is used to control the data flow by generating the required control signal as it has a built in completion detection circuit that is used to generate the control signal “RFO” as shown in Figure 5-17.

The binary MAC unit comprises the same six stages as its ternary counterpart. In this case, the output of the final stage needs an additional register as its register- controlled ternary counterpart as this architecture is not register-Less. The completion detection gate is part of the output register and is used to control the data flow from the previous shift register stage. The need for an output control register increases the required number of the control signals by two i.e., the “RFI” and “pipeReq” shown in Figure 5-8.

Step – 4: FIR Filter Hardware Design At this stage, the three libraries are ready to be employed to build the respective low-pass filter designs i.e. register-controlled ternary, register-less ternary and binary filters. These LPF System-Verilog implementations were imported into the Cadence Design Tool targeting their respective reference libraries. The Cadence System Verilog for these designs can be found in Appendix-A, B and C respectively,

108 while the block diagrams of the register-controlled ternary, register-less ternary and binary filters are shown in Figure 5-10, Figure 5-11 and Figure 5-8 respectively.

Table 5-1: Quantised Signal Mapped to Binary and Ternary NCL Quantised digital Dual Rail Single Rail # of Samples binary Binary Ternary Sine Wave 10 1 0 1 00 N 10 1 1 1 00 N 01 0 2 0 00 N 01 0 3 0 00 N 01 0 4 0 00 N 10 1 5 1 00 N 10 1 6 1 00 N 01 0 7 0 00 N 01 0 8 0 00 N 10 1 9 1 00 N 01 0 10 0 00 N 10 1 11 1 00 N 01 0 12 0 00 N 01 0 13 0 00 N 10 1 14 1 00 N 01 0 15 0 00 N 01 0 16 0 00 N

Step – 5: Hardware Design Verification using MATLAB Code To verify the transistor level design of both the ternary and binary implementations of the filter, the digital low pass filter (LPF) was designed using MATLAB and its output compared against that of the transistor level implementations for both the ternary and binary filters. The output of the ternary filter is a 6-bit value in ternary format, which was then converted to binary for comparison against the MATLAB output. Similarly, the output of the binary NCL filter is a 6-bit dual rail format. The output was also converted into a conventional

109 binary format, to be compared against the MATLAB output. The MATLAB code for the 4-tap digital low pass filter with an oversampling ratio of 8 is shown in Appendix-D.

To reduce the time required for the simulation, a 4-tap filter with an OSR of 4 designed and simulated in both MATLB and Cadence. The test input signal was an 8KHZ sinusoidal wave that was quantised as described in the previous steps and represented by 17 samples as shown in Table 5-1. The quantised binary signal was converted into dual rail binary and single rail ternary (Table 5-1).

The quantised signal was then used as a test input vector for both the MATLAB and Cadence filter designs. The output of the transistor level design for one period sine wave is shown in Figure 5-28, while the output of MATLAB version is demonstrated in Figure 5-29. The comparison between both outputs shows that the hardware design is functional as expected.

1

0.5

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

C4 C3 C2 C1 C0

Figure 5-28: Output from Cadence Hardware Ternary LPF x axis: number of samples for one period sinewave; y axis: magnitude

1

0.5

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

M4 M3 M2 M1 M0

Figure 5-29: Output from Ternary LPF MATLAB Code x axis: number of samples for one period sinewave; y axis: magnitude

110 5.3 SIMULATION RESULTS AND DISCUSSION

In this section the design of both the ternary and binary null convention logic implementations of the digital finite impulse response low pass filter (FIR-LPF) are analysed and discussed in terms of their power consumption, total design area and finally the performance of each implementation. The three different architectures were implemented using a 45nm bulk CMOS Technology and tested at a supply of 0.6V. The binary NCL system has a full-scale swing of between 0 and 0.6 whereas both the ternary systems have half that as the Null value is represented by 0.3V.

5.3.1 Serial to Parallel Shift Register Comparison As mentioned previously in Step-2 of the design methodology, the design of the shift register for both register-less and register-controlled ternary is more or less the same as both versions need a register at the input stage to act as a serial to parallel shift register. In both ternary architectures, the shift register is designed to be able to process the input data and control the flow of the data also. However, in the register-controlled architecture, it requires an output delay insensitive register to generate the required handshaking signals to control the flow of the data in the pipeline. This is not required in the register-less architecture as the handshaking signal will be generated from the following register-less logic block.

As discussed previously, in the register-less architecture the shift register can be considered as a separate logic block in the pipeline that can process the input data and control the flow of the data to the following register-less logic block in the pipeline. On the other hand, in the register-controlled ternary architecture, the register is designed as an input delay insensitive register with built-in shift register. Thus, it cannot be considered as a separate logic block in the pipeline as it requires an output delay insensitive register to control the flow of data.

The 32-bit ternary and binary NCL shift registers are compared in Table 5-2, which shows that the total power consumption (static plus dynamic) of the ternary implementation of the 32-Bit Serial-to-Parallel Shift Register is much higher than the Binary Null Convention Logic (BNCL) implementation. This is mainly due to the complicated design of the 1-Bit Ternary Null Convention Logic Serial-to-Parallel Shift Register (TNCL-SP-SR) as shown in Figure 5-15. In turn, this makes the Binary Null Convention Logic (BNCL) Implementation a better technique to design

111 and implement a low power shift register based on the null convention logic concept. This is mainly due to the higher number of transistors that are required to implement the 32-Bit Serial-to-Parallel Shift Register using Ternary technique, approximately 45% more than the Binary Null Convention Logic (BNCL) implementation.

Table 5-2: Ternary versus Binary Null Convention Logic Implementation of the 32- Bit Serial to Parallel Shift Register Ternary Null Binary Null Convention Logic Convention Logic Implementation Implementation

Power Consumption (nW) 400 40 Total Number of Transistors 2240 1216 Speed (nS) 140 200 Power Delay Product x 10-21 56 8

On the other hand, the Ternary Null Convention Logic (TNCL) implementation of the 32-Bit Serial-to-Parallel Shift Register has better performance in terms of speed compared to the Binary Null Convention Logic (BNCL) implementation one. It is 30% faster, which in this case makes the Ternary Null Convention Logic (TNCL) Implementation suitable for high speed shift register deployment which can in turn compromise the power consumption. Nonetheless, the power delay product of the Binary Null Convention 32-Bit Serial to Parallel Shift Register is still much lower than for the Ternary case.

5.3.2 Multiply —Accumulate Comparison This section compares the design of the multiply—accumulate stage using the binary and ternary (including both its versions) techniques. Although, the individual logic gates of the register-controlled ternary are larger than the register-less counterparts as shown in the previous sections, but the total design area of the register-less MAC is smaller than the register-controlled one due to eliminating the need for the output delay insensitive register. In Table 5-3, register-less MAC design reduced the total area by almost 15%, the power consumption by almost 17% and enhanced the speed by almost 26% compared to the register-controlled version.

112 Table 5-3: Ternary versus Binary Null Convention Logic Implementation of the Multiply and Accumulate Stage Register-Controlled Binary NCL Ternary NCL Register-Less Implementation Implementation Ternary NCL including including output Implementation output register register Power 72 60 24 Consumption (nW)

Total Number of 2790 2385 5226 Transistors

Speed (nS) 230 170 250

Power Delay 16.56 10.2 6 Product x 10-21

However, the total power consumption of both the ternary implementations of the Multiply and Accumulate stage are still higher than the Binary Null Convention Logic (BNCL) implementation. The main reason for this saving in power consumption using the Binary Null Convention Logic (BNCL) Technique is; the reduced probability of the signal switching. However, it is shown that the required number of transistors to implement the Multiply and Accumulate stage using either of the Ternary Null Convention Logic (TNCL) techniques is much lesser than the Binary Null Convention Logic (BNCL) technique. As the register-controlled and register-less ternary reduce the design area by almost 47% and 54% respectively and enhance the performance of the design by almost 8% and 32% respectively, which makes for a better implementation process for applications that require low design area and/or high speed with no power consumption restrictions. Moreover, this makes the binary implementation a more appropriate technique for implementing a low power Multiply and Accumulate stage using NCL.

113 5.3.3 Digital Low Pass Filter Comparison As demonstrated in Table 5-4, both of the 4-Tap Digital Low Pass Filter (LPF) ternary implementations exhibit higher power consumption than the binary case. It has to be remembered that in the binary implementation, although each signal is represented by two wires, only one wire and its associated logic path is active at a time. Thus at a given time, almost half the gates will be in their idle state (i.e., not switching), reducing the global switching probability and therefore lowering the power consumption.

Table 5-4: Ternary versus Binary Null Convention Logic Implementation of the Digital Low Pass Filter Register-Controlled Register-Less Binary Null Ternary NCL Ternary NCL Convention Logic Implementation Implementation Implementation Power 472 460 64 Consumption (nW) Total Number of 5030 4625 6442 Transistors

Speed (nS) 370 310 450 Power Delay 42.5 29.5 11.02 Product x 10-21

On the other hand, the register-controlled and register-less ternary approaches are both smaller than the binary implementation, by almost 22% and 28% respectively. Further, the register-controlled and register-less ternary filters are about 18% and 31% respectively faster than the binary implementation. However, the lower power of the binary circuit still dominates the power-delay product (energy) figure, giving the binary case superior PDP to both ternary circuits.

5.4 SUMMARY

In this chapter, the methodology of the design of the Finite Impulse Response Low Pass Filter (FIR-LPF) was demonstrated and the steps to design and implement it using both ternary and binary Null Convention Logic techniques were discussed in detail. All implementations were compared in terms of power consumption, total

114 design area and performance. From the analysis, it is clear that the binary (BNCL) implementation exhibits significantly better power and energy figures than the ternary cases (TNCL). However, in comparison to the multi rail binary NCL implementation of the filter, the ternary implementation is almost 31% faster and nearly 28% smaller. Compared to register-controlled ternary NCL, the register-less ternary has reduced both the design area and power consumption by 8% and 2.5%, respectively and enhanced the performance of the design by almost 16%.

The main reason for these reductions and performance enhancement is the elimination of the need for the registers at the input and output stages. As well as eliminating the registers, the ternary logic gates have been modified to be able to simultaneously process the input data and control the flow of the data in the pipeline. Although this modification replaces the pipeline registers that would be otherwise required, it somewhat limits the scope for both power consumption improvements and total design area minimisation due to the additional transistors in each logic gate required to replace these registers.

In summary, the multi rail binary NCL architecture is a better power saving implementation that is suited to applications where low power consumption is the main requirement. On the other hand, ternary NCL offers better performance and smaller design area, which can be used in applications that have strong requirements for a small design area and or higher speed. Register-less ternary has been proved to be a better option than the register-controlled ternary in terms of power, area and performance.

115 Chapter 6: Conclusions and Future Work

6.1 CONCLUSIONS

Null Convention Logic (NCL) is one of the most common techniques to design and implement the asynchronous digital system. Whilst it is considered as a robust architecture and a very power saving one, there are also some disadvantages which are worth considering as it relies on Multi Rail to represent each single bit.

In turn, this leads to a bulky complicated design that might have routing issues. The advantage of using Multi Rail is to decrease the switching probability and it also helps to reduce the total power consumption. Whilst this might appear as an advantage but there are also some disadvantages. By utilizing the mutually exclusive dual rail, in turn thus causes the illegal state where both rails are asserted. This illegal case might happen in the high frequency applications due to the noise. This limits the applications of the Null Convention Logic design using the Multi Rail Binary Logic as is not designed to handle this illegal case.

In this work, Single Rail Ternary Logic has been introduced as an alternative approach to multi rail binary to design and implement Null Convention Logic systems. A novel architecture has been proposed which is largely independent of the CMOS process technology or the voltage level mapping. The proposed architecture has been designed at the transistor level using Cadence Virtuoso and then modified and improved from an architecture perspective to reduce the design complexity and its area in turn.

A register-less Null Convention Logic architecture has been introduced that uses this single rail ternary logic. This arrangement eliminates the need for delay insensitive registers to control the flow of the data within the NCL Pipeline. It has been demonstrated that the proposed architecture exhibits the primary characteristics of NCL and hence can be used to implement any NCL system.

The proposed register-less architecture has been used to design a small digital signal processing application which is an 8-Bit Full Adder and the results have been analysed and compared against the dual-rail Binary Logic implementation including its multi-threshold implementation as well. It has also been shown that the proposed

116 architecture has a smaller design area and better idle power consumption during null cycle but at the cost of performance. Its dynamic power consumption is also higher than for binary NCL.

A short word-length low pass filter, which is widely used in digital communication systems, was used as a case study to compare between the binary and ternary implementation styles. This application was designed and implemented in both multi rail binary and single rail ternary NCL. Both of the ternary architectures have a smaller design area but consume more power compared to the binary case. The main reason for this is the complicated design of the serial-to-parallel shift register in ternary mode.

It has been demonstrated that the Single Rail Ternary Logic can be utilised to implement a sophisticated and complicated Null Convention Logic system which reduces the design area, complexity and enhances the design speed. Nonetheless, the Multi Rail Binary Logic is still considered as a better power saving option. On the other hand, the proposed novel architecture eliminates the illegal state that might happen in the Multi Rail Binary Logic Implementation and hence does not limit the proposed to low frequency applications as it is illegal-case free.

6.2 FUTURE WORK

In this thesis, whilst every effort has been made to cover the relevant topic as thoroughly as possible, inevitable time constraints and circumstances have prevented potentially interesting investigations into different various optimization techniques, structures and improved designs. In this section we present a brief discussion on some topics found in this thesis that would prove useful to be investigated further.

1. Further optimization to the Single Rail Ternary Logic Serial-to-Parallel Shift Register is required which might help to reduce the complexity of the design and the required number of transistors that are required to implement it. All of these may be reflected on its power consumption and reduce it to be less or even similar to the Multi Rail Binary Logic implementation one.

2. Development of a Single Rail Ternary Logic Null Convention Logic (SR-TNCL) Encoder to encode the output of the Digital Low Pass Filter (LPF) to generate the single word at the output stage.

117 3. Development of higher resolution Digital Low Pass Filter (LPF) by increasing the Over Sampling Rate “OSR” to 32 or more utilizing the Single Rail Ternary Logic Null Convention Logic (SR-TNCL).

4. Development of the end to end design of the Delta-Sigma Analog-to- Digital Converter utilizing the Single Rail Ternary Logic (SR-T) Architecture that can be integrated with most of the digital signal processing applications.

118 References

1. Koomey, J.G., et al., Assessing trends in the electrical efficiency of computation over time. IEEE Annals of the History of Computing, 2009. 17. 2. International Technology Roadmap for Semiconductors (ITRS). 2009; Available from: https://www.semiconductors.org/resources/2009-international-technology-roadmap-for- semiconductors-itrs/. 3. Scott, S. and D. Jia, Designing Asynchronous Circuits using NULL Convention Logic (NCL). Designing Asynchronous Circuits using NULL Convention Logic (NCL). 2009: Morgan & Claypool. 1. 4. Yahya, E. and L. Fesquet. Asynchronous design: A promising paradigm for electronic circuits and systems. in 2009 16th IEEE International Conference on Electronics, Circuits and Systems - (ICECS 2009). 2009. 5. International Technology Roadmap for Semiconductors (ITRS) - Design. 2007; Available from: https://www.semiconductors.org/resources/2007-international-technology-roadmap- for-semiconductors-itrs/. 6. Oliveira, D.L., et al. Synthesis of Low-Power Synchronous Digital Systems Operating in Double-Edge of the Clock. in 2012 VI Andean Region International Conference. 2012. 7. Calhoun, B.H., F.A. Honore, and A. Chandrakasan. Design methodology for fine-grained leakage control in MTCMOS. in Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003. ISLPED '03. 2003. 8. Chong, K., B. Gwee, and J.S. Chang, Energy-Efficient Synchronous-Logic and Asynchronous-Logic FFT/IFFT Processors. IEEE Journal of Solid-State Circuits, 2007. 42(9): p. 2034-2045. 9. Spars, J. and S. Furber, Principles asynchronous circuit design. 2002: Springer. 10. Nagy, L., J. Koscelánsky, and V. Stopjaková. Design of a globally asynchronous locally synchronous digital system. in 2014 IEEE 12th IEEE International Conference on Emerging eLearning Technologies and Applications (ICETA). 2014. 11. Di, J. A Framework on Mitigating Single Event Upset using Delay-Insensitive Asynchronous Circuits. in 2007 IEEE Region 5 Technical Conference. 2007. 12. Moreira, M.T., et al., Static Differential NCL Gates: Toward Low Power. IEEE Transactions on Circuits and Systems II: Express Briefs, 2015. 62(6): p. 563-567. 13. Santos, I. and E. MacDonald. Delay Insensitive logic with increased fault tolerance and optimized for subthreshold operation. in 2013 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S). 2013. 14. Jung, E., et al., Handshake protocol using return-to-zero data encoding for high performance asynchronous bus. IEE Proceedings - Computers and Digital Techniques, 2003. 150(4): p. 245-251. 15. Fant, K.M. and S.A. Brandt. Null convention logic/sup TM: A complete and consistent logic for asynchronous digital circuit synthesis. in Application Specific Systems, Architectures and Processors, 1996. ASAP 96. Proceedings of International Conference on. 1996. IEEE. 16. Parsan, F.A. and S.C. Smith. CMOS implementation comparison of NCL gates. in Circuits and Systems (MWSCAS), 2012 IEEE 55th International Midwest Symposium on. 2012. IEEE. 17. Fant, K.M. and S.A. Brandt, Null convention logic system. 1994, Google Patents. 18. Chang, M.-C., M.-H. Hsieh, and P.-H. Yang, Low-power asynchronous NCL pipelines with fine-grain power gating and early sleep. IEEE Transactions on Circuits and Systems II: Express Briefs, 2014. 61(12): p. 957-961. 19. Tran, L.D., et al. Null convention logic (NCL) based asynchronous design—fundamentals and recent advances. in Recent Advances in Signal Processing, Telecommunications & Computing (SigTelCom), International Conference on. 2017. IEEE. 20. Minsu, C., et al. Asynchronous circuit design using new high speed NCL gates. in 2014 International SoC Design Conference (ISOCC). 2014. 21. Dugganapally, I.P., et al. Design and Implementation of FPGA Configuration Logic Block Using Asynchronous Static NCL. in Region 5 Conference, 2008 IEEE. 2008. IEEE. 22. Kakarla, S. and W.K. Al-Assadi. Testing of asynchronous NULL conventional logic (NCL) circuits. in Region 5 Conference, 2008 IEEE. 2008. IEEE.

119 23. Smith, S.C., Design of an FPGA logic element for implementing asynchronous NULL convention logic circuits. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2007. 15(6): p. 672-683. 24. Al Zahrani, A., et al. Glitch-free design for multi-threshold CMOS NCL circuits. in Proceedings of the 19th ACM Great Lakes symposium on VLSI. 2009. ACM. 25. Joshi, M.V., et al. NCL Implementation of Dual-Rail 2 s Complement 8× 8 Booth2 Multiplier using Static and Semi-Static Primitives. in Region 5 Technical Conference, 2007 IEEE. 2007. IEEE. 26. Smith, S.C. Speedup of self-timed digital systems using early completion. in VLSI, 2002. Proceedings. IEEE Computer Society Annual Symposium on. 2002. IEEE. 27. Smith, S.C. and J. Di, Designing asynchronous circuits using NULL convention logic (NCL). Synthesis Lectures on Digital Circuits and Systems, 2009. 4(1): p. 1-96. 28. Marin, I., et al. low-power aware design: Topics on low battery consumption. in proceedings of the 4th WSEAS Int. Conf. on Information Security, Communications and computers, Tenerife, Spain. 2005. 29. Chilambuchelvan, A., et al. Certain Investigations on Energy saving techniques using DVS for low power embedded system. in Proceedings of the 6th WSEAS International Conference on Applied Informatics and Communications. 2006. World Scientific and Engineering Academy and Society (WSEAS). 30. Moradi, M., R.F. Mirzaee, and K. Navi. Ternary Versus Binary Multiplication with Current- Mode CNTFET-Based K-Valued Converters. in 2016 IEEE 46th International Symposium on Multiple-Valued Logic (ISMVL). 2016. IEEE. 31. Zhang, L., R. Wu, and Y. Yang. A high-speed and low-power synchronous and asynchronous packaging circuit based on standard gates under four-phase one-hot encoding. in 2013 14th International Conference on Electronic Packaging Technology. 2013. IEEE. 32. Van Toan, N., D.M. Tung, and J.-G. Lee. Energy-efficient and high performance 2-phase asynchronous micropipelines. in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). 2017. IEEE. 33. Saxena, N., et al. Implementation of asynchronous pipeline using Transmission Gate logic. in 2016 International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT). 2016. 34. Bandapati, S.K., S.C. Smith, and M. Choi, Design and characterization of NULL convention self-timed multipliers. IEEE design & test of computers, 2003. 20(6): p. 26-36. 35. McCardle, J. and D. Chester. Measuring an asynchronous processor’s power and noise. in Synopsys User Group Conference (SNUG), Boston. 2001. 36. Sobelman, G.E. and K. Fant. CMOS circuit design of threshold gates with hysteresis. in Circuits and Systems, 1998. ISCAS'98. Proceedings of the 1998 IEEE International Symposium on. 1998. IEEE. 37. Guan, X., Y. Liu, and Y. Yang. Performance analysis of low power null convention logic units with power cutoff. in Wearable Computing Systems (APWCS), 2010 Asia-Pacific Conference on. 2010. IEEE. 38. Kim, J., M.M. Kim, and P. Beckett. Static leakage control in null convention logic standard cells in 28 nm UTBB-FDSOI CMOS. in SoC Design Conference (ISOCC), 2015 International. 2015. IEEE. 39. Moreira, M.T., et al. Charge sharing aware NCL gates design. in Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), 2013 IEEE International Symposium on. 2013. IEEE. 40. Moreira, M., et al. Semi-custom ncl design with commercial eda frameworks: Is it possible? in Asynchronous Circuits and Systems (ASYNC), 2014 20th IEEE International Symposium on. 2014. IEEE. 41. Parsan, F.A. and S.C. Smith. CMOS implementation of static threshold gates with hysteresis: A new approach. in 2012 IEEE/IFIP 20th International Conference on VLSI and System-on- Chip (VLSI-SoC). 2012. 42. Yancey, S. and S.C. Smith. A differential design for C-elements and NCL gates. in 2010 53rd IEEE International Midwest Symposium on Circuits and Systems. 2010. 43. Sankar, R., et al. Implementation of Static and Semi-Static Versions of a Bit-Wise Pipelined Dual-Rail NCL 2 s Complement Multiplier. in 2007 IEEE Region 5 Technical Conference. 2007. IEEE.

120 44. Mallepalli, S.R., et al. Implementation of Static and Semi-Static Versions of a 24+8×8 Quad- Rail NULL Convention Multiply and Accumulate Unit. in 2007 IEEE Region 5 Technical Conference. 2007. 45. Haulmark, K., et al. Comprehensive Comparison of NULL Convention Logic Threshold Gate Implementations. in 2018 New Generation of CAS (NGCAS). 2018. IEEE. 46. Heller, L., et al. Cascode voltage switch logic: A differential CMOS logic family. in 1984 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. 1984. 47. Lee, H.J. and Y. Kim. Low power Null Convention Logic circuit design based on DCVSL. in 2013 IEEE 56th International Midwest Symposium on Circuits and Systems (MWSCAS). 2013. 48. Kao, J.T. and A.P. Chandrakasan, Dual-threshold voltage techniques for low-power digital circuits. IEEE Journal of Solid-State Circuits, 2000. 35(7): p. 1009-1018. 49. Andrawes, S., L. Koushaeian, and R. Veljanovski, Muli-threshold low power shift register. International Journal of Circuits, Systems and Signal Processing, 2009. 3(1): p. 1487-1495. 50. Andrawes, S., L. Koushaeian, and R. Veljanovski. Low power shift register using MTCMOS edge-trigger D flip flop transmission gate in sub-threshold region. in Proceedings of the 7th WSEAS International Conference on Microelectronics, Nanoelectronics, Optoelectronics. 2008. World Scientific and Engineering Academy and Society (WSEAS). 51. Andrawes, S. and P. Beckett. Ternary circuits for NULL convention logic. in Computer Engineering & Systems (ICCES), 2011 International Conference on. 2011. IEEE. 52. Bailey, A.D., et al. Ultra-low power delay-insensitive circuit design. in 2008 51st Midwest Symposium on Circuits and Systems. 2008. 53. Bailey, A., et al., Multi-threshold asynchronous circuit design for ultra-low power. Journal of Low Power Electronics, 2008. 4(3): p. 337-348. 54. Zhou, L., et al., Multi-Threshold NULL Convention Logic (MTNCL): An ultra-low power asynchronous circuit design methodology. Journal of Low Power Electronics and Applications, 2015. 5(2): p. 81-100. 55. Balla, P.C. and A. Antoniou, Low power dissipation MOS ternary logic family. IEEE Journal of Solid-State Circuits, 1984. 19(5): p. 739-749. 56. Moaiyeri, M.H., M. Nasiri, and N. Khastoo, An efficient ternary serial adder based on carbon nanotube FETs. Engineering Science and Technology, an International Journal, 2016. 19(1): p. 271-278. 57. Shibata, T. and T. Ohmi, A functional MOS transistor featuring gate-level weighted sum and threshold operations. IEEE Transactions on Electron devices, 1992. 39(6): p. 1444-1455. 58. Gundersen, H. and Y. Berg. A novel balanced ternary adder using recharged semi-floating gate devices. in 36th International Symposium on Multiple-Valued Logic (ISMVL'06). 2006. IEEE. 59. Dhande, A., S.S. Narkhede, and S.S. Dudam. VLSI implementation of ternary gates using Tanner Tool. in 2014 2nd International Conference on Devices, Circuits and Systems (ICDCS). 2014. IEEE. 60. Connell, C.L. and P.T. Balsara. A new ternary MVL based completion detection method for the design of self-timed circuits using dynamic CMOS logic. in The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002. 2002. IEEE. 61. Connell, C.L. and P. Balsara. A novel single-rail variable encoded completion detection scheme for self-timed circuit design using ternary multiple valued logic. in Proceedings of the IEEE 2nd Dallas CAS Workshop on Low Power/Low Voltage Mixed-Signal Circuits & Systems (DCAS-01)(Cat. No. 01EX454). 2001. IEEE. 62. Felicijan, T. and S.B. Furber, An asynchronous ternary logic signaling system. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2003. 11(6): p. 1114-1119. 63. Philippe, J.-M., et al. An energy-efficient ternary interconnection link for asynchronous systems. in 2006 IEEE International Symposium on Circuits and Systems. 2006. IEEE. 64. Singh, N.D., et al. Novel Approach to Design DPL-based Ternary Logic Circuits. in 2018 IEEE Electron Devices Kolkata Conference (EDKCON). 2018. IEEE. 65. Oh, M.-H., S.-N. Kim, and S. Kim. Design of asynchronous 2-phase ternary encoding protocol using multiple-valued logic. in 2011 International SoC Design Conference. 2011. IEEE. 66. Mariani, R., et al. On the realisation of delay-insensitive asynchronous circuits with CMOS ternary logic. in Proceedings Third International Symposium on Advanced Research in Asynchronous Circuits and Systems. 1997. IEEE.

121 67. Mariani, R., et al. A useful application of CMOS ternary logic to the realisation of asynchronous circuits. in Proceedings 1997 27th International Symposium on Multiple- Valued Logic. 1997. IEEE. 68. De Gloria, A., P. Faraboschi, and M. Olivieri, Design and characterization of a standard cell set for delay insensitive VLSI design. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 1994. 41(6): p. 410-415. 69. Wuu, T.-Y. and S.B. Vrudhula, A design of a fast and area efficient multi-input Muller C- element. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 1993. 1(2): p. 215-219. 70. Gaikwad, V. and P. Deshmukh. Design of CMOS ternary logic family based on single supply voltage. in 2015 International Conference on Pervasive Computing (ICPC). 2015. IEEE. 71. Syuto, M., et al. Multi-input variable-threshold circuits for multi-valued logic functions. in Proceedings 30th IEEE International Symposium on Multiple-Valued Logic (ISMVL 2000). 2000. IEEE. 72. Gupta, K., N. Pandey, and M. Gupta. Low power Multi-Threshold MOS Current Mode Logic asynchronous pipeline circuits. in 2012 IEEE 5th India International Conference on Power Electronics (IICPE). 2012. IEEE. 73. Bhaskar, A., et al. A low power and high speed 10 transistor full adder using multi threshold technique. in 2016 11th International Conference on Industrial and Information Systems (ICIIS). 2016. IEEE. 74. Moghaddam, M., et al. Low-voltage multi-V TH single-supply level converters based on CNTFETs. in 2014 22nd Iranian Conference on Electrical Engineering (ICEE). 2014. IEEE. 75. Muglikar, M., R. Sahoo, and S.K. Sahoo. High performance ternary adder using CNTFET. in 2016 3rd International Conference on Devices, Circuits and Systems (ICDCS). 2016. IEEE. 76. Sahoo, S.K., et al., High-performance ternary adder using CNTFET. IEEE Transactions on Nanotechnology, 2017. 16(3): p. 368-374. 77. Raychowdhury, A. and K. Roy, Carbon-nanotube-based voltage-mode multiple-valued logic design. IEEE Transactions on Nanotechnology, 2005. 4(2): p. 168-179. 78. Lin, S., Y.-B. Kim, and F. Lombardi, CNTFET-based design of ternary logic gates and arithmetic circuits. IEEE transactions on nanotechnology, 2009. 10(2): p. 217-225. 79. Agnello, P., et al. High Performance 45-nm SOI Technology with Enhanced Strain, Porous Low-k BEOL, and Immersion Lithography. in 2006 International Electron Devices Meeting. 2006. 80. Chang-Jian, S.-K., J.-R. Ho, and J.-W.J. Cheng, Characterization of developing source/drain current of carbon nanotube field-effect transistors with n-doping by polyethylene imine. Microelectronic Engineering, 2010. 87(10): p. 1973-1977. 81. Pop, E., et al. Avalanche, joule breakdown and hysteresis in carbon nanotube transistors. in 2009 IEEE International Reliability Physics Symposium. 2009. IEEE. 82. Nair, R.S.P., S.C. Smith, and J. Di. Delay-Insensitive Ternary Logic. in CDES. 2009. 83. Nair, R.S., S.C. Smith, and J. Di, Delay Insensitive Ternary CMOS Logic for Secure Hardware. Journal of Low Power Electronics and Applications, 2015. 5(3): p. 183-215. 84. Smith, S.C., et al., Delay-insensitive gate-level pipelining. 2001. 30(2): p. 103-131. 85. Dabholkar, P. and P. Beckett. Optimised completion detection circuits for null convention logic pipelines. in 2017 IEEE Asia Pacific Conference on Postgraduate Research in Microelectronics and Electronics (PrimeAsia). 2017. 86. Leblebici, Y. and S. Kang, CMOS digital integrated circuits: analysis and design. 2003: McGraw-Hill. 87. Yanfei, Y., et al. A high-speed asynchronous array multiplier based on multi-threshold semi- static NULL convention logic pipeline. in 2011 9th IEEE International Conference on ASIC. 2011. 88. Zhou, L., S.C. Smith, and J. Di. Bit-Wise MTNCL: An ultra-low power bit-wise pipelined asynchronous circuit design methodology. in 2010 53rd IEEE International Midwest Symposium on Circuits and Systems. 2010. 89. Silicon on Insulator Multi Threshold Transistors. 2007; Available from: https://www.eda.ncsu.edu/wiki/FreePDK45:Manual. 90. Vakil, A., et al. Comparitive analysis of null convention logic and synchronous CMOS ripple carry adders. in 2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT). 2017.

122 91. Kuang, W., et al., Design of Asynchronous Circuits for High Soft Error Tolerance in Deep Submicrometer CMOS Circuits. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2010. 18(3): p. 410-422. 92. Chang, M., P. Yang, and Z. Pan, Register-Less NULL Convention Logic. IEEE Transactions on Circuits and Systems II: Express Briefs, 2017. 64(3): p. 314-318. 93. Martin, A.J., Compiling communicating processes into delay-insensitive VLSI circuits. Distributed computing, 1986. 1(4): p. 226-234. 94. Beerel, P.A., R.O. Ozdag, and M. Ferretti, A designer's guide to asynchronous VLSI. 2010: Cambridge University Press. 95. Myers, C.J., Asynchronous circuit design. 2001: John Wiley & Sons. 96. Liu, Y. and Y. Wu. Research and realization of self-adaptive filter based on DSP in the digital TV Y/C dissociation. in 2009 3rd International Conference on Anti-counterfeiting, Security, and Identification in Communication. 2009. IEEE. 97. Pal, R. Comparison of the design of FIR and IIR filters for a given Specification and removal of phase distortion from IIR filters. in 2017 International Conference on Advances in Computing, Communication and Control (ICAC3). 2017. IEEE. 98. Gawande, G.S. and K. Khanchandani. Efficient design and FPGA implementation of digital filter for audio application. in 2015 International Conference on Computing Communication Control and Automation. 2015. IEEE. 99. Gandhi, N. and S.D. Shelke. Sigma delta analog to digital converter: Design and implementation with reduction in power consumption. in 2017 International Conference on Trends in Electronics and Informatics (ICEI). 2017. IEEE. 100. Chen, Y., Y.C. Eldar, and A.J. Goldsmith. Shannon meets Nyquist: capacity limits of sampled analog channels. in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2011. IEEE. 101. Tufts, D., On" Sampling, data transmission, and the Nyquist rate" and adaptive communication using randomly time-varying channels. Proceedings of the IEEE, 1968. 56(5): p. 889-889. 102. Candy, J.C. and G.C. Temes, Oversampling delta-sigma data converters: theory, design and simulation. 1992: IEEE Press. 103. Thompson, C.D., Delta-sigma modulator for an analog-to-digital converter with low thermal noise performance. 1993, Google Patents. 104. Kuyel, T. Linearity testing issues of analog to digital converters. in International Test Conference 1999. Proceedings (IEEE Cat. No. 99CH37034). 1999. IEEE. 105. Pham, T.C., et al. Implementation of a short word length ternary FIR filter in both FPGA and ASIC. in 2018 2nd International Conference on Recent Advances in Signal Processing, Telecommunications & Computing (SigTelCom). 2018. IEEE. 106. Sadik, A.Z. and Z.M. Hussain. Short word-length LMS filtering. in 2007 9th International Symposium on Signal Processing and Its Applications. 2007. IEEE. 107. Pavan, S., R. Schreier, and G.C. Temes, Understanding delta-sigma data converters. 2017: John Wiley & Sons. 108. Fujisaka, H., et al., Sorter-based arithmetic circuits for sigma-delta domain signal processing—Part I: Addition, approximate transcendental functions, and log-domain operations. IEEE Transactions on Circuits and Systems I: Regular Papers, 2012. 59(9): p. 1952-1965. 109. Fujisaka, H., et al., Sorter-based arithmetic circuits for sigma-delta domain signal processing—Part II: Multiplication and algebraic functions. IEEE Transactions on Circuits and Systems I: Regular Papers, 2012. 59(9): p. 1966-1979. 110. King, E.T., et al., A Nyquist-rate delta-sigma A/D converter. IEEE Journal of solid-state circuits, 1998. 33(1): p. 45-52. 111. Harjani, R. and T.A. Lee, FRC: A method for extending the resolution of Nyquist rate converters using oversampling. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 1998. 45(4): p. 482-494. 112. Gilbert, E.N., Increased information rate by oversampling. IEEE transactions on information theory, 1993. 39(6): p. 1973-1976. 113. Sovani, R., K. Haque, and P. Beckett. Short word length null convention logic FIR filter for low power applications. in 2015 IEEE International WIE Conference on Electrical and Computer Engineering (WIECON-ECE). 2015. IEEE.

123 114. Sovani, R., P. Dabholkar, and P. Beckett. An asynchronous short word length Delta-Sigma FIR filter for low power DSP. in 2016 10th International Conference on Signal Processing and Communication Systems (ICSPCS). 2016. IEEE. 115. Malathi, D. and M. Madheswaran. Low power high speed fir filter design using null convention logic. in 2012 Annual IEEE India Conference (INDICON). 2012. IEEE. 116. International Technology Roadmap for Semiconductors, 2009.

124 Appendix – A: Register-Controlled Ternary NCL Verilog Code

Ternary NCL Digital LPF System Verilog Code module Ternry_Filter_Register ( input wire IN, input wire [31:0] C, output wire [5:0] filterout, output wire RFO, input wire RFD);

wire [31:0] d; wire [31:0] RFI; wire [31:0] M; wire [15:0] S_Stage_0; wire [15:0] C_Stage_0; wire [15:0] S_Stage_1; wire [15:0] C_Stage_1; wire [11:0] S_Stage_2; wire [11:0] C_Stage_2; wire [7:0] S_Stage_3; wire [7:0] C_Stage_3; wire [4:0] S_Stage_4; wire [4:0] C_Stage_4; wire [5:0] filter_out; assign filterout = filter_out; assign RFI[30] = RFO;

//32-Bit Ternary Shift Register including Ternary Completion Detection SR_Reg SR00 ( d[116], RFI[116], IN); SR_Reg SR01 ( d[1], RFI[1], d[116]); SR_Reg SR02 ( d[116], RFI[02], d[1]); SR_Reg SR03 ( d[3], RFI[3], d[116]); SR_Reg SR04 ( d[4], RFI[4], d[3]); SR_Reg SR05 ( d[5], RFI[5], d[4]); SR_Reg SR06 ( d[6], RFI[6], d[5]); SR_Reg SR07 ( d[7], RFI[7], d[6]); SR_Reg SR08 ( d[8], RFI[8], d[7]); SR_Reg SR09 ( d[116], RFI[116], d[8]); SR_Reg SR10 ( d[10], RFI[10], d[116]); SR_Reg SR11 ( d[11], RFI[11], d[10]); SR_Reg SR12 ( d[12], RFI[12], d[11]); SR_Reg SR13 ( d[13], RFI[13], d[12]);

125 SR_Reg SR14 ( d[14], RFI[14], d[13]); SR_Reg SR15 ( d[15], RFI[15], d[14]); SR_Reg SR16 ( d[16], RFI[16], d[15]); SR_Reg SR17 ( d[17], RFI[17], d[16]); SR_Reg SR18 ( d[18], RFI[18], d[17]); SR_Reg SR19 ( d[19], RFI[19], d[18]); SR_Reg SR20 ( d[116], RFI[116], d[19]); SR_Reg SR21 ( d[21], RFI[21], d[116]); SR_Reg SR22 ( d[22], RFI[22], d[21]); SR_Reg SR23 ( d[23], RFI[23], d[22]); SR_Reg SR24 ( d[24], RFI[24], d[23]); SR_Reg SR25 ( d[25], RFI[25], d[24]); SR_Reg SR26 ( d[26], RFI[26], d[25]); SR_Reg SR27 ( d[27], RFI[27], d[26]); SR_Reg SR28 ( d[28], RFI[28], d[27]); SR_Reg SR29 ( d[29], RFI[29], d[28]); SR_Reg SR30 ( d[30], RFI[30], d[29]); SR_Reg SR31 ( d[31], RFI[31], d[30]);

cdsModule_2 CD00 ( RFI[116], d[1]); cdsModule_2 CD01 ( RFI[1], d[116]); cdsModule_2 CD02 ( RFI[116], d[3]); cdsModule_2 CD03 ( RFI[3], d[4]); cdsModule_2 CD04 ( RFI[4], d[5]); cdsModule_2 CD05 ( RFI[5], d[6]); cdsModule_2 CD06 ( RFI[6], d[7]); cdsModule_2 CD07 ( RFI[7], d[8]); cdsModule_2 CD08 ( RFI[8], d[116]); cdsModule_2 CD09 ( RFI[116], d[10]); cdsModule_2 CD10 ( RFI[10], d[11]); cdsModule_2 CD11 ( RFI[11], d[12]); cdsModule_2 CD12 ( RFI[12], d[13]); cdsModule_2 CD13 ( RFI[13], d[14]); cdsModule_2 CD14 ( RFI[14], d[15]); cdsModule_2 CD15 ( RFI[15], d[16]); cdsModule_2 CD16 ( RFI[16], d[17]); cdsModule_2 CD17 ( RFI[17], d[18]); cdsModule_2 CD18 ( RFI[18], d[19]); cdsModule_2 CD19 ( RFI[19], d[116]); cdsModule_2 CD20 ( RFI[116], d[21]); cdsModule_2 CD21 ( RFI[21], d[22]); cdsModule_2 CD22 ( RFI[22], d[23]); cdsModule_2 CD23 ( RFI[23], d[24]); cdsModule_2 CD24 ( RFI[24], d[25]); cdsModule_2 CD25 ( RFI[25], d[26]); cdsModule_2 CD26 ( RFI[26], d[27]); cdsModule_2 CD27 ( RFI[27], d[28]); cdsModule_2 CD28 ( RFI[28], d[29]); cdsModule_2 CD29 ( RFI[29], d[30]);

126 cdsModule_2 CD30 ( RFI[30], d[31]); inverter inv0 (RFN, RFD);

//Multiply and Accumulate Stages

//32-Bit Ternary Multiplier

AND_MTH AND_0( .Z(M[116]), .A(d[116]), .B(C[116]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_1( .Z(M[1]), .A(d[1]), .B(C[1]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_2( .Z(M[116]), .A(d[116]), .B(C[116]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_3( .Z(M[3]), .A(d[3]), .B(C[3]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_4( .Z(M[4]), .A(d[4]), .B(C[4]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_5( .Z(M[5]), .A(d[5]), .B(C[5]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_6( .Z(M[6]), .A(d[6]), .B(C[6]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_7( .Z(M[7]), .A(d[7]), .B(C[7]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_8( .Z(M[8]), .A(d[8]), .B(C[8]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_9( .Z(M[116]), .A(d[116]), .B(C[116]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_10( .Z(M[10]), .A(d[10]), .B(C[10]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_11( .Z(M[11]), .A(d[11]), .B(C[11]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_12( .Z(M[12]), .A(d[12]), .B(C[12]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_13( .Z(M[13]), .A(d[13]), .B(C[13]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_14( .Z(M[14]), .A(d[14]), .B(C[14]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_15( .Z(M[15]), .A(d[15]), .B(C[15]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_16( .Z(M[16]), .A(d[16]), .B(C[16]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_17( .Z(M[17]), .A(d[17]), .B(C[17]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_18( .Z(M[18]), .A(d[18]), .B(C[18]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_19( .Z(M[19]), .A(d[19]), .B(C[19]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_20( .Z(M[116]), .A(d[116]), .B(C[116]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_21( .Z(M[21]), .A(d[21]), .B(C[21]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_22( .Z(M[22]), .A(d[22]), .B(C[22]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_23( .Z(M[23]), .A(d[23]), .B(C[23]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_24( .Z(M[24]), .A(d[24]), .B(C[24]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_25( .Z(M[25]), .A(d[25]), .B(C[25]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_26( .Z(M[26]), .A(d[26]), .B(C[26]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_27( .Z(M[27]), .A(d[27]), .B(C[27]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_28( .Z(M[28]), .A(d[28]), .B(C[28]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_29( .Z(M[29]), .A(d[29]), .B(C[29]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_30( .Z(M[30]), .A(d[30]), .B(C[30]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_31( .Z(M[31]), .A(d[31]), .B(C[31]),.Active(RFD), .Sleep(RFN));

///Stage_1

HA HA0_0 ( .S(S_Stage_0[116]), .C(C_Stage_0[116]), .A(M[116]), .B(M[1]), .Active(RFD), .Sleep(RFN));

127 HA HA0_1( .S(S_Stage_0[1]), .C(C_Stage_0[1]), .A(M[1]), .B(M[116]), .Active(RFD), .Sleep(RFN)); HA HA0_2( .S(S_Stage_0[116]), .C(C_Stage_0[116]), .A(M[116]), .B(M[3]), .Active(RFD), .Sleep(RFN)); HA HA0_3( .S(S_Stage_0[3]), .C(C_Stage_0[3]), .A(M[3]), .B(M[4]), .Active(RFD), .Sleep(RFN)); HA HA0_4( .S(S_Stage_0[4]), .C(C_Stage_0[4]), .A(M[4]), .B(M[5]), .Active(RFD), .Sleep(RFN)); HA HA0_5( .S(S_Stage_0[5]), .C(C_Stage_0[5]), .A(M[5]), .B(M[6]), .Active(RFD), .Sleep(RFN)); HA HA0_6( .S(S_Stage_0[6]), .C(C_Stage_0[6]), .A(M[6]), .B(M[7]), .Active(RFD), .Sleep(RFN)); HA HA0_7( .S(S_Stage_0[7]), .C(C_Stage_0[7]), .A(M[7]), .B(M[8]), .Active(RFD), .Sleep(RFN)); HA HA0_8( .S(S_Stage_0[8]), .C(C_Stage_0[8]), .A(M[8]), .B(M[116]), .Active(RFD), .Sleep(RFN)); HA HA0_9( .S(S_Stage_0[116]), .C(C_Stage_0[116]), .A(M[116]), .B(M[10]), .Active(RFD), .Sleep(RFN)); HA HA0_10( .S(S_Stage_0[10]), .C(C_Stage_0[10]), .A(M[10]), .B(M[11]), .Active(RFD), .Sleep(RFN)); HA HA0_11( .S(S_Stage_0[11]), .C(C_Stage_0[11]), .A(M[11]), .B(M[12]), .Active(RFD), .Sleep(RFN)); HA HA0_12( .S(S_Stage_0[12]), .C(C_Stage_0[12]), .A(M[12]), .B(M[13]), .Active(RFD), .Sleep(RFN)); HA HA0_13( .S(S_Stage_0[13]), .C(C_Stage_0[13]), .A(M[13]), .B(M[14]), .Active(RFD), .Sleep(RFN)); HA HA0_14( .S(S_Stage_0[14]), .C(C_Stage_0[14]), .A(M[14]), .B(M[15]), .Active(RFD), .Sleep(RFN)); HA HA0_15( .S(S_Stage_0[15]), .C(C_Stage_0[15]), .A(M[15]), .B(M[16]), .Active(RFD), .Sleep(RFN));

///Stage_2

HA HA1_0 ( .S(S_Stage_1[116]), .C(C_Stage_1[116]), .A(S_Stage_0[116]), .B(S_Stage_0[1]), .Active(RFD), .Sleep(RFN)); HA HA1_2 ( .S(S_Stage_1[116]), .C(C_Stage_1[116]), .A(S_Stage_0[116]), .B(S_Stage_0[3]), .Active(RFD), .Sleep(RFN)); HA HA1_4 ( .S(S_Stage_1[4]), .C(C_Stage_1[4]), .A(S_Stage_0[4]), .B(S_Stage_0[5]), .Active(RFD), .Sleep(RFN)); HA HA1_6 ( .S(S_Stage_1[6]), .C(C_Stage_1[6]), .A(S_Stage_0[6]), .B(S_Stage_0[7]), .Active(RFD), .Sleep(RFN)); HA HA1_8 ( .S(S_Stage_1[8]), .C(C_Stage_1[8]), .A(S_Stage_0[8]), .B(S_Stage_0[116]), .Active(RFD), .Sleep(RFN)); HA HA1_10 ( .S(S_Stage_1[10]), .C(C_Stage_1[10]), .A(S_Stage_0[10]), .B(S_Stage_0[11]), .Active(RFD), .Sleep(RFN)); HA HA1_12 ( .S(S_Stage_1[12]), .C(C_Stage_1[12]), .A(S_Stage_0[12]), .B(S_Stage_0[13]), .Active(RFD), .Sleep(RFN)); HA HA1_14 ( .S(S_Stage_1[14]), .C(C_Stage_1[14]), .A(S_Stage_0[14]), .B(S_Stage_0[15]), .Active(RFD), .Sleep(RFN));

128 FA FA1_1 ( .S(S_Stage_1[1]), .Co(C_Stage_1[1]), .A(C_Stage_0[116]), .B(C_Stage_0[1]), .Ci(C_Stage_1[116]), .Active(RFD), .Sleep(RFN)); FA FA1_3( .S(S_Stage_1[3]), .Co(C_Stage_1[3]), .A(C_Stage_0[116]), .B(C_Stage_0[3]), .Ci(C_Stage_1[116]), .Active(RFD), .Sleep(RFN)); FA FA1_5( .S(S_Stage_1[5]), .Co(C_Stage_1[5]), .A(C_Stage_0[4]), .B(C_Stage_0[5]), .Ci(C_Stage_1[4]), .Active(RFD), .Sleep(RFN)); FA FA1_7( .S(S_Stage_1[7]), .Co(C_Stage_1[7]), .A(C_Stage_0[6]), .B(C_Stage_0[7]), .Ci(C_Stage_1[6]), .Active(RFD), .Sleep(RFN)); FA FA1_9( .S(S_Stage_1[116]), .Co(C_Stage_1[116]), .A(C_Stage_0[8]), .B(C_Stage_0[116]), .Ci(C_Stage_1[8]), .Active(RFD), .Sleep(RFN)); FA FA1_11( .S(S_Stage_1[11]), .Co(C_Stage_1[11]), .A(C_Stage_0[10]), .B(C_Stage_0[11]), .Ci(C_Stage_1[10]), .Active(RFD), .Sleep(RFN)); FA FA1_13( .S(S_Stage_1[13]), .Co(C_Stage_1[13]), .A(C_Stage_0[12]), .B(C_Stage_0[13]), .Ci(C_Stage_1[12]), .Active(RFD), .Sleep(RFN)); FA FA1_15( .S(S_Stage_1[15]), .Co(C_Stage_1[15]), .A(C_Stage_0[14]), .B(C_Stage_0[15]), .Ci(C_Stage_1[14]), .Active(RFD), .Sleep(RFN));

///Stage_3

HA HA2_0 ( .S(S_Stage_2[116]), .C(C_Stage_2[116]), .A(S_Stage_1[116]), .B(S_Stage_1[116]), .Active(RFD), .Sleep(RFN)); FA FA2_1 ( .S(S_Stage_2[1]), .Co(C_Stage_2[1]), .A(S_Stage_1[1]), .B(S_Stage_1[3]), .Ci(C_Stage_2[116]), .Active(RFD), .Sleep(RFN)); FA FA2_2 ( .S(S_Stage_2[116]), .Co(C_Stage_2[116]), .A(C_Stage_1[1]), .B(C_Stage_1[3]), .Ci(C_Stage_2[1]), .Active(RFD), .Sleep(RFN)); HA HA2_3 ( .S(S_Stage_2[3]), .C(C_Stage_2[3]), .A(S_Stage_1[4]), .B(S_Stage_1[6]), .Active(RFD), .Sleep(RFN)); FA FA2_4( .S(S_Stage_2[4]), .Co(C_Stage_2[4]), .A(S_Stage_1[5]), .B(S_Stage_1[7]), .Ci(C_Stage_2[3]), .Active(RFD), .Sleep(RFN)); FA FA2_5 ( .S(S_Stage_2[5]), .Co(C_Stage_2[5]), .A(C_Stage_1[5]), .B(C_Stage_1[7]), .Ci(C_Stage_2[4]), .Active(RFD), .Sleep(RFN)); HA HA2_6 ( .S(S_Stage_2[6]), .C(C_Stage_2[6]), .A(S_Stage_1[8]), .B(S_Stage_1[10]), .Active(RFD), .Sleep(RFN)); FA FA2_7( .S(S_Stage_2[7]), .Co(C_Stage_2[7]), .A(S_Stage_1[116]), .B(S_Stage_1[11]), .Ci(C_Stage_2[6]), .Active(RFD), .Sleep(RFN)); FA FA2_8 ( .S(S_Stage_2[8]), .Co(C_Stage_2[8]), .A(C_Stage_1[116]), .B(C_Stage_1[11]), .Ci(C_Stage_2[7]), .Active(RFD), .Sleep(RFN)); HA HA2_9 ( .S(S_Stage_2[116]), .C(C_Stage_2[116]), .A(S_Stage_1[12]), .B(S_Stage_1[14]), .Active(RFD), .Sleep(RFN)); FA FA2_10( .S(S_Stage_2[10]), .Co(C_Stage_2[10]), .A(S_Stage_1[13]), .B(S_Stage_1[15]), .Ci(C_Stage_2[116]), .Active(RFD), .Sleep(RFN)); FA FA2_11 ( .S(S_Stage_2[11]), .Co(C_Stage_2[11]), .A(C_Stage_1[13]), .B(C_Stage_1[15]), .Ci(C_Stage_2[10]), .Active(RFD), .Sleep(RFN));

///Stage_4

HA HA3_0 ( .S(S_Stage_3[116]), .C(C_Stage_3[116]), .A(S_Stage_2[116]), .B(S_Stage_2[3]), .Active(RFD), .Sleep(RFN)); FA FA3_1 ( .S(S_Stage_3[1]), .Co(C_Stage_3[1]), .A(S_Stage_2[1]), .B(S_Stage_2[4]), .Ci(C_Stage_3[116]), .Active(RFD), .Sleep(RFN));

129 FA FA3_2 ( .S(S_Stage_3[116]), .Co(C_Stage_3[116]), .A(S_Stage_2[116]), .B(S_Stage_2[5]), .Ci(C_Stage_3[1]), .Active(RFD), .Sleep(RFN)); FA FA3_3 ( .S(S_Stage_3[3]), .Co(C_Stage_3[3]), .A(C_Stage_2[116]), .B(C_Stage_2[5]), .Ci(C_Stage_3[116]), .Active(RFD), .Sleep(RFN)); HA HA3_4 ( .S(S_Stage_3[4]), .C(C_Stage_3[4]), .A(S_Stage_2[6]), .B(S_Stage_2[116]), .Active(RFD), .Sleep(RFN)); FA FA3_5 ( .S(S_Stage_3[5]), .Co(C_Stage_3[5]), .A(S_Stage_2[7]), .B(S_Stage_2[10]), .Ci(C_Stage_3[4]), .Active(RFD), .Sleep(RFN)); FA FA3_6( .S(S_Stage_3[6]), .Co(C_Stage_3[6]), .A(S_Stage_2[8]), .B(S_Stage_2[11]), .Ci(C_Stage_3[5]), .Active(RFD), .Sleep(RFN)); FA FA3_7( .S(S_Stage_3[7]), .Co(C_Stage_3[7]), .A(C_Stage_2[8]), .B(C_Stage_2[11]), .Ci(C_Stage_3[6]), .Active(RFD), .Sleep(RFN));

///Stage_5

HA HA4_0 ( .S(S_Stage_4[116]), .C(C_Stage_4[116]), .A(S_Stage_3[116]), .B(S_Stage_3[4]), .Active(RFD), .Sleep(RFN)); FA FA4_1 ( .S(S_Stage_4[1]), .Co(C_Stage_4[1]), .A(S_Stage_3[1]), .B(S_Stage_3[5]), .Ci(C_Stage_4[116]), .Active(RFD), .Sleep(RFN)); FA FA4_2 ( .S(S_Stage_4[116]), .Co(C_Stage_4[116]), .A(S_Stage_3[1]), .B(S_Stage_3[5]), .Ci(C_Stage_4[1]), .Active(RFD), .Sleep(RFN)); FA FA4_3 ( .S(S_Stage_4[3]), .Co(C_Stage_4[3]), .A(S_Stage_3[3]), .B(S_Stage_3[7]), .Ci(C_Stage_4[116]), .Active(RFD), .Sleep(RFN)); FA FA4_4 ( .S(S_Stage_4[4]), .Co(C_Stage_4[4]), .A(C_Stage_3[3]), .B(C_Stage_3[7]), .Ci(C_Stage_4[3]), .Active(RFD), .Sleep(RFN));

////////////////////////////////////////////////////////////////////////////////////////////////////////////// outputReg Reg12 (.O_5(filter_out[5]),.O_4(filter_out[4]),.O_3(filter_out[3]),.O_2(filter_out[116]),.O_ 1(filter_out[1]),.O_0(filter_out[116]),.RFD(RFO),.In_0(S_Stage_4[116]),.In_1(S_St age_4[1]),.In_2(S_Stage_4[116]),.In_3(S_Stage_4[3]),.In_4(S_Stage_4[4]),.In_5(C_ Stage_4[4]),.RFI_Following(RFD)); endmodule

130 Appendix – B: Register-Less Ternary NCL Verilog Code

Register-Less Ternary NCL Digital LPF System Verilog Code module Ternry_Filter_RegisterLess ( input wire IN, input wire [31:0] C, output wire [4:0] filterout); wire [31:0] d; wire [31:0] RFI; wire [31:0] M; wire [15:0] S_Stage_0; wire [15:0] C_Stage_0; wire [15:0] S_Stage_1; wire [15:0] C_Stage_1; wire [11:0] S_Stage_2; wire [11:0] C_Stage_2; wire [7:0] S_Stage_3; wire [7:0] C_Stage_3; wire [4:0] S_Stage_4; wire [4:0] C_Stage_4; wire [4:0] filter_out; assign filterout = filter_out; assign RFI[30] = RFD;

//32-Bit Ternary Shift Register including Ternary Completion Detection SR_Reg SR00 ( d[116], RFI[116], IN); SR_Reg SR01 ( d[1], RFI[1], d[116]); SR_Reg SR02 ( d[116], RFI[116], d[1]); SR_Reg SR03 ( d[3], RFI[3], d[116]); SR_Reg SR04 ( d[4], RFI[4], d[3]); SR_Reg SR05 ( d[5], RFI[5], d[4]); SR_Reg SR06 ( d[6], RFI[6], d[5]); SR_Reg SR07 ( d[7], RFI[7], d[6]); SR_Reg SR08 ( d[8], RFI[8], d[7]); SR_Reg SR09 ( d[116], RFI[116], d[8]); SR_Reg SR10 ( d[10], RFI[10], d[116]);

131 SR_Reg SR11 ( d[11], RFI[11], d[10]); SR_Reg SR12 ( d[12], RFI[12], d[11]); SR_Reg SR13 ( d[13], RFI[13], d[12]); SR_Reg SR14 ( d[14], RFI[14], d[13]); SR_Reg SR15 ( d[15], RFI[15], d[14]); SR_Reg SR16 ( d[16], RFI[16], d[15]); SR_Reg SR17 ( d[17], RFI[17], d[16]); SR_Reg SR18 ( d[18], RFI[18], d[17]); SR_Reg SR19 ( d[19], RFI[19], d[18]); SR_Reg SR20 ( d[116], RFI[116], d[19]); SR_Reg SR21 ( d[21], RFI[21], d[116]); SR_Reg SR22 ( d[22], RFI[22], d[21]); SR_Reg SR23 ( d[23], RFI[23], d[22]); SR_Reg SR24 ( d[24], RFI[24], d[23]); SR_Reg SR25 ( d[25], RFI[25], d[24]); SR_Reg SR26 ( d[26], RFI[26], d[25]); SR_Reg SR27 ( d[27], RFI[27], d[26]); SR_Reg SR28 ( d[28], RFI[28], d[27]); SR_Reg SR29 ( d[29], RFI[29], d[28]); SR_Reg SR30 ( d[30], RFI[30], d[29]); SR_Reg SR31 ( d[31], RFI[31], d[30]); cdsModule_2 CD00 ( RFI[116], d[1]); cdsModule_2 CD01 ( RFI[1], d[116]); cdsModule_2 CD02 ( RFI[116], d[3]); cdsModule_2 CD03 ( RFI[3], d[4]); cdsModule_2 CD04 ( RFI[4], d[5]); cdsModule_2 CD05 ( RFI[5], d[6]); cdsModule_2 CD06 ( RFI[6], d[7]); cdsModule_2 CD07 ( RFI[7], d[8]); cdsModule_2 CD08 ( RFI[8], d[116]); cdsModule_2 CD09 ( RFI[116], d[10]); cdsModule_2 CD10 ( RFI[10], d[11]); cdsModule_2 CD11 ( RFI[11], d[12]); cdsModule_2 CD12 ( RFI[12], d[13]); cdsModule_2 CD13 ( RFI[13], d[14]); cdsModule_2 CD14 ( RFI[14], d[15]); cdsModule_2 CD15 ( RFI[15], d[16]); cdsModule_2 CD16 ( RFI[16], d[17]); cdsModule_2 CD17 ( RFI[17], d[18]); cdsModule_2 CD18 ( RFI[18], d[19]); cdsModule_2 CD19 ( RFI[19], d[116]); cdsModule_2 CD20 ( RFI[116], d[21]); cdsModule_2 CD21 ( RFI[21], d[22]); cdsModule_2 CD22 ( RFI[22], d[23]); cdsModule_2 CD23 ( RFI[23], d[24]); cdsModule_2 CD24 ( RFI[24], d[25]); cdsModule_2 CD25 ( RFI[25], d[26]); cdsModule_2 CD26 ( RFI[26], d[27]); cdsModule_2 CD27 ( RFI[27], d[28]);

132 cdsModule_2 CD28 ( RFI[28], d[29]); cdsModule_2 CD29 ( RFI[29], d[30]); cdsModule_2 CD30 ( RFI[30], d[31]); inverter inv0 (RFN, RFD);

//Multiply and Accumulate Stages

//32-Bit Ternary Multiplier

AND_MTH AND_0( .Z(M[116]), .A(d[116]), .B(C[116]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_1( .Z(M[1]), .A(d[1]), .B(C[1]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_2( .Z(M[116]), .A(d[116]), .B(C[116]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_3( .Z(M[3]), .A(d[3]), .B(C[3]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_4( .Z(M[4]), .A(d[4]), .B(C[4]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_5( .Z(M[5]), .A(d[5]), .B(C[5]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_6( .Z(M[6]), .A(d[6]), .B(C[6]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_7( .Z(M[7]), .A(d[7]), .B(C[7]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_8( .Z(M[8]), .A(d[8]), .B(C[8]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_9( .Z(M[116]), .A(d[116]), .B(C[116]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_10( .Z(M[10]), .A(d[10]), .B(C[10]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_11( .Z(M[11]), .A(d[11]), .B(C[11]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_12( .Z(M[12]), .A(d[12]), .B(C[12]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_13( .Z(M[13]), .A(d[13]), .B(C[13]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_14( .Z(M[14]), .A(d[14]), .B(C[14]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_15( .Z(M[15]), .A(d[15]), .B(C[15]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_16( .Z(M[16]), .A(d[16]), .B(C[16]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_17( .Z(M[17]), .A(d[17]), .B(C[17]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_18( .Z(M[18]), .A(d[18]), .B(C[18]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_19( .Z(M[19]), .A(d[19]), .B(C[19]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_20( .Z(M[116]), .A(d[116]), .B(C[116]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_21( .Z(M[21]), .A(d[21]), .B(C[21]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_22( .Z(M[22]), .A(d[22]), .B(C[22]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_23( .Z(M[23]), .A(d[23]), .B(C[23]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_24( .Z(M[24]), .A(d[24]), .B(C[24]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_25( .Z(M[25]), .A(d[25]), .B(C[25]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_26( .Z(M[26]), .A(d[26]), .B(C[26]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_27( .Z(M[27]), .A(d[27]), .B(C[27]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_28( .Z(M[28]), .A(d[28]), .B(C[28]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_29( .Z(M[29]), .A(d[29]), .B(C[29]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_30( .Z(M[30]), .A(d[30]), .B(C[30]),.Active(RFD), .Sleep(RFN)); AND_MTH AND_31( .Z(M[31]), .A(d[31]), .B(C[31]),.Active(RFD), .Sleep(RFN));

///Stage_1

133 HA HA0_0 ( .S(S_Stage_0[116]), .C(C_Stage_0[116]), .A(M[116]), .B(M[1]), .Active(RFD), .Sleep(RFN)); HA HA0_1( .S(S_Stage_0[1]), .C(C_Stage_0[1]), .A(M[1]), .B(M[116]), .Active(RFD), .Sleep(RFN)); HA HA0_2( .S(S_Stage_0[116]), .C(C_Stage_0[116]), .A(M[116]), .B(M[3]), .Active(RFD), .Sleep(RFN)); HA HA0_3( .S(S_Stage_0[3]), .C(C_Stage_0[3]), .A(M[3]), .B(M[4]), .Active(RFD), .Sleep(RFN)); HA HA0_4( .S(S_Stage_0[4]), .C(C_Stage_0[4]), .A(M[4]), .B(M[5]), .Active(RFD), .Sleep(RFN)); HA HA0_5( .S(S_Stage_0[5]), .C(C_Stage_0[5]), .A(M[5]), .B(M[6]), .Active(RFD), .Sleep(RFN)); HA HA0_6( .S(S_Stage_0[6]), .C(C_Stage_0[6]), .A(M[6]), .B(M[7]), .Active(RFD), .Sleep(RFN)); HA HA0_7( .S(S_Stage_0[7]), .C(C_Stage_0[7]), .A(M[7]), .B(M[8]), .Active(RFD), .Sleep(RFN)); HA HA0_8( .S(S_Stage_0[8]), .C(C_Stage_0[8]), .A(M[8]), .B(M[116]), .Active(RFD), .Sleep(RFN)); HA HA0_9( .S(S_Stage_0[116]), .C(C_Stage_0[116]), .A(M[116]), .B(M[10]), .Active(RFD), .Sleep(RFN)); HA HA0_10( .S(S_Stage_0[10]), .C(C_Stage_0[10]), .A(M[10]), .B(M[11]), .Active(RFD), .Sleep(RFN)); HA HA0_11( .S(S_Stage_0[11]), .C(C_Stage_0[11]), .A(M[11]), .B(M[12]), .Active(RFD), .Sleep(RFN)); HA HA0_12( .S(S_Stage_0[12]), .C(C_Stage_0[12]), .A(M[12]), .B(M[13]), .Active(RFD), .Sleep(RFN)); HA HA0_13( .S(S_Stage_0[13]), .C(C_Stage_0[13]), .A(M[13]), .B(M[14]), .Active(RFD), .Sleep(RFN)); HA HA0_14( .S(S_Stage_0[14]), .C(C_Stage_0[14]), .A(M[14]), .B(M[15]), .Active(RFD), .Sleep(RFN)); HA HA0_15( .S(S_Stage_0[15]), .C(C_Stage_0[15]), .A(M[15]), .B(M[16]), .Active(RFD), .Sleep(RFN));

///Stage_2

HA HA1_0 ( .S(S_Stage_1[116]), .C(C_Stage_1[116]), .A(S_Stage_0[116]), .B(S_Stage_0[1]), .Active(RFD), .Sleep(RFN)); HA HA1_2 ( .S(S_Stage_1[116]), .C(C_Stage_1[116]), .A(S_Stage_0[116]), .B(S_Stage_0[3]), .Active(RFD), .Sleep(RFN)); HA HA1_4 ( .S(S_Stage_1[4]), .C(C_Stage_1[4]), .A(S_Stage_0[4]), .B(S_Stage_0[5]), .Active(RFD), .Sleep(RFN)); HA HA1_6 ( .S(S_Stage_1[6]), .C(C_Stage_1[6]), .A(S_Stage_0[6]), .B(S_Stage_0[7]), .Active(RFD), .Sleep(RFN)); HA HA1_8 ( .S(S_Stage_1[8]), .C(C_Stage_1[8]), .A(S_Stage_0[8]), .B(S_Stage_0[116]), .Active(RFD), .Sleep(RFN)); HA HA1_10 ( .S(S_Stage_1[10]), .C(C_Stage_1[10]), .A(S_Stage_0[10]), .B(S_Stage_0[11]), .Active(RFD), .Sleep(RFN)); HA HA1_12 ( .S(S_Stage_1[12]), .C(C_Stage_1[12]), .A(S_Stage_0[12]), .B(S_Stage_0[13]), .Active(RFD), .Sleep(RFN));

134 HA HA1_14 ( .S(S_Stage_1[14]), .C(C_Stage_1[14]), .A(S_Stage_0[14]), .B(S_Stage_0[15]), .Active(RFD), .Sleep(RFN));

FA FA1_1 ( .S(S_Stage_1[1]), .Co(C_Stage_1[1]), .A(C_Stage_0[116]), .B(C_Stage_0[1]), .Ci(C_Stage_1[116]), .Active(RFD), .Sleep(RFN)); FA FA1_3( .S(S_Stage_1[3]), .Co(C_Stage_1[3]), .A(C_Stage_0[116]), .B(C_Stage_0[3]), .Ci(C_Stage_1[116]), .Active(RFD), .Sleep(RFN)); FA FA1_5( .S(S_Stage_1[5]), .Co(C_Stage_1[5]), .A(C_Stage_0[4]), .B(C_Stage_0[5]), .Ci(C_Stage_1[4]), .Active(RFD), .Sleep(RFN)); FA FA1_7( .S(S_Stage_1[7]), .Co(C_Stage_1[7]), .A(C_Stage_0[6]), .B(C_Stage_0[7]), .Ci(C_Stage_1[6]), .Active(RFD), .Sleep(RFN)); FA FA1_9( .S(S_Stage_1[116]), .Co(C_Stage_1[116]), .A(C_Stage_0[8]), .B(C_Stage_0[116]), .Ci(C_Stage_1[8]), .Active(RFD), .Sleep(RFN)); FA FA1_11( .S(S_Stage_1[11]), .Co(C_Stage_1[11]), .A(C_Stage_0[10]), .B(C_Stage_0[11]), .Ci(C_Stage_1[10]), .Active(RFD), .Sleep(RFN)); FA FA1_13( .S(S_Stage_1[13]), .Co(C_Stage_1[13]), .A(C_Stage_0[12]), .B(C_Stage_0[13]), .Ci(C_Stage_1[12]), .Active(RFD), .Sleep(RFN)); FA FA1_15( .S(S_Stage_1[15]), .Co(C_Stage_1[15]), .A(C_Stage_0[14]), .B(C_Stage_0[15]), .Ci(C_Stage_1[14]), .Active(RFD), .Sleep(RFN));

///Stage_3

HA HA2_0 ( .S(S_Stage_2[116]), .C(C_Stage_2[116]), .A(S_Stage_1[116]), .B(S_Stage_1[116]), .Active(RFD), .Sleep(RFN)); FA FA2_1 ( .S(S_Stage_2[1]), .Co(C_Stage_2[1]), .A(S_Stage_1[1]), .B(S_Stage_1[3]), .Ci(C_Stage_2[116]), .Active(RFD), .Sleep(RFN)); FA FA2_2 ( .S(S_Stage_2[116]), .Co(C_Stage_2[116]), .A(C_Stage_1[1]), .B(C_Stage_1[3]), .Ci(C_Stage_2[1]), .Active(RFD), .Sleep(RFN));

HA HA2_3 ( .S(S_Stage_2[3]), .C(C_Stage_2[3]), .A(S_Stage_1[4]), .B(S_Stage_1[6]), .Active(RFD), .Sleep(RFN)); FA FA2_4( .S(S_Stage_2[4]), .Co(C_Stage_2[4]), .A(S_Stage_1[5]), .B(S_Stage_1[7]), .Ci(C_Stage_2[3]), .Active(RFD), .Sleep(RFN)); FA FA2_5 ( .S(S_Stage_2[5]), .Co(C_Stage_2[5]), .A(C_Stage_1[5]), .B(C_Stage_1[7]), .Ci(C_Stage_2[4]), .Active(RFD), .Sleep(RFN));

HA HA2_6 ( .S(S_Stage_2[6]), .C(C_Stage_2[6]), .A(S_Stage_1[8]), .B(S_Stage_1[10]), .Active(RFD), .Sleep(RFN)); FA FA2_7( .S(S_Stage_2[7]), .Co(C_Stage_2[7]), .A(S_Stage_1[116]), .B(S_Stage_1[11]), .Ci(C_Stage_2[6]), .Active(RFD), .Sleep(RFN)); FA FA2_8 ( .S(S_Stage_2[8]), .Co(C_Stage_2[8]), .A(C_Stage_1[116]), .B(C_Stage_1[11]), .Ci(C_Stage_2[7]), .Active(RFD), .Sleep(RFN));

HA HA2_9 ( .S(S_Stage_2[116]), .C(C_Stage_2[116]), .A(S_Stage_1[12]), .B(S_Stage_1[14]), .Active(RFD), .Sleep(RFN)); FA FA2_10( .S(S_Stage_2[10]), .Co(C_Stage_2[10]), .A(S_Stage_1[13]), .B(S_Stage_1[15]), .Ci(C_Stage_2[116]), .Active(RFD), .Sleep(RFN)); FA FA2_11 ( .S(S_Stage_2[11]), .Co(C_Stage_2[11]), .A(C_Stage_1[13]), .B(C_Stage_1[15]), .Ci(C_Stage_2[10]), .Active(RFD), .Sleep(RFN));

135 ///Stage_4

HA HA3_0 ( .S(S_Stage_3[116]), .C(C_Stage_3[116]), .A(S_Stage_2[116]), .B(S_Stage_2[3]), .Active(RFD), .Sleep(RFN)); FA FA3_1 ( .S(S_Stage_3[1]), .Co(C_Stage_3[1]), .A(S_Stage_2[1]), .B(S_Stage_2[4]), .Ci(C_Stage_3[116]), .Active(RFD), .Sleep(RFN)); FA FA3_2 ( .S(S_Stage_3[116]), .Co(C_Stage_3[116]), .A(S_Stage_2[116]), .B(S_Stage_2[5]), .Ci(C_Stage_3[1]), .Active(RFD), .Sleep(RFN)); FA FA3_3 ( .S(S_Stage_3[3]), .Co(C_Stage_3[3]), .A(C_Stage_2[116]), .B(C_Stage_2[5]), .Ci(C_Stage_3[116]), .Active(RFD), .Sleep(RFN));

HA HA3_4 ( .S(S_Stage_3[4]), .C(C_Stage_3[4]), .A(S_Stage_2[6]), .B(S_Stage_2[116]), .Active(RFD), .Sleep(RFN)); FA FA3_5 ( .S(S_Stage_3[5]), .Co(C_Stage_3[5]), .A(S_Stage_2[7]), .B(S_Stage_2[10]), .Ci(C_Stage_3[4]), .Active(RFD), .Sleep(RFN)); FA FA3_6( .S(S_Stage_3[6]), .Co(C_Stage_3[6]), .A(S_Stage_2[8]), .B(S_Stage_2[11]), .Ci(C_Stage_3[5]), .Active(RFD), .Sleep(RFN)); FA FA3_7( .S(S_Stage_3[7]), .Co(C_Stage_3[7]), .A(C_Stage_2[8]), .B(C_Stage_2[11]), .Ci(C_Stage_3[6]), .Active(RFD), .Sleep(RFN));

///Stage_5

HA HA4_0 ( .S(S_Stage_4[116]), .C(C_Stage_4[116]), .A(S_Stage_3[116]), .B(S_Stage_3[4]), .Active(RFD), .Sleep(RFN)); FA FA4_1 ( .S(S_Stage_4[1]), .Co(C_Stage_4[1]), .A(S_Stage_3[1]), .B(S_Stage_3[5]), .Ci(C_Stage_4[116]), .Active(RFD), .Sleep(RFN)); FA FA4_2 ( .S(S_Stage_4[116]), .Co(C_Stage_4[116]), .A(S_Stage_3[1]), .B(S_Stage_3[5]), .Ci(C_Stage_4[1]), .Active(RFD), .Sleep(RFN)); FA FA4_3 ( .S(S_Stage_4[3]), .Co(C_Stage_4[3]), .A(S_Stage_3[3]), .B(S_Stage_3[7]), .Ci(C_Stage_4[116]), .Active(RFD), .Sleep(RFN)); FA FA4_4 ( .S(S_Stage_4[4]), .Co(C_Stage_4[4]), .A(C_Stage_3[3]), .B(C_Stage_3[7]), .Ci(C_Stage_4[3]), .Active(RFD), .Sleep(RFN));

Completion_Detection CD0 (RFD, S_Stage_4[116], S_Stage_4[1], S_Stage_4[116], S_Stage_4[3], S_Stage_4[4], C_Stage_4[4]);

////////////////////////////////////////////////////////////////////////////////////////////////////////////// assign f[116] = S_Stage_4[116] ; assign f[1] = S_Stage_4[1] ; assign f[116] = S_Stage_4[116] ; assign f[3] = S_Stage_4[3] ; assign f[4] = S_Stage_4[4] ; assign f[5] = C_Stage_4[4] ; endmodule

136 Appendix – C: Conventional Binary NCL Verilog Code

Binary Null Convention Logic Digital Low Pass Filter System Verilog Code module Binary_Filter ( input IN_0, input IN_1, input reset, input [31:0] coeff_0, input [31:0] coeff_1, output [11:0] filterOut_0, output [11:0] filterOut_1); wire [63:0] d_0; wire [63:0] d_1; wire [32:0] RFD; wire [31:0] RFN; wire [32:0] RFI; wire [31:0] multiplierOut_0; wire [31:0] multiplierOut_1; wire [15:0] addStage1Sum_0; wire [15:0] addStage1Sum_1; wire [15:0] addStage1Cout_0; wire [15:0] addStage1Cout_1; wire [15:0] addStage2Sum_0; wire [15:0] addStage2Sum_1; wire [15:0] addStage2Cout_0; wire [15:0] addStage2Cout_1; wire [11:0] addStage3Sum_0; wire [11:0] addStage3Sum_1; wire [11:0] addStage3Cout_0; wire [11:0] addStage3Cout_1; wire [7:0] addStage4Sum_0; wire [7:0] addStage4Sum_1; wire [7:0] addStage4Cout_0; wire [7:0] addStage4Cout_1; wire [4:0] addStage5Sum_0; wire [4:0] addStage5Sum_1; wire [4:0] addStage5Cout_0; wire [4:0] addStage5Cout_1;

//32-Bit Binary Null Convention Logic Regsiter including Completion Detection

137 reg1b_initNULL reg0 (d_1[116],d_0[116],RFD[116],IN_1,IN_0,RFN[116], reset); reg1b_initData reg0 (d_1[1],d_0[1],RFN[116],d_1[116],d_0[116],RFI[116], reset); TH22 TH00 (RFI(0), RFD(1), RFO); reg1b_initNULL reg1 (d_1[116],d_0[116],RFD[1],d_1[1],d_0[1],RFN[1], reset); reg1b_initData reg1 (d_1[3],d_0[3],RFN[1],d_1[116],d_0[116],RFI[1], reset); TH22 TH01 (RFI(1), RFD(2), RFO); reg1b_initNULL reg2 (d_1[4],d_0[4],RFD[116],d_1[3],d_0[3],RFN[116], reset); reg1b_initData reg2 (d_1[5],d_0[5],RFN[116],d_1[4],d_0[4],RFI[116], reset); TH22 TH02 (RFI(2), RFD(3), RFO); reg1b_initNULL reg3 (d_1[6],d_0[6],RFD[3],d_1[5],d_0[5],RFN[3], reset); reg1b_initData reg3 (d_1[7],d_0[7],RFN[3],d_1[6],d_0[6],RFI[3], reset); TH22 TH03 (RFI(3), RFD(4), RFO); reg1b_initNULL reg4 (d_1[8],d_0[8],RFD[4],d_1[7],d_0[7],RFN[4], reset); reg1b_initData reg4 (d_1[116],d_0[116],RFN[4],d_1[8],d_0[8],RFI[4], reset); TH22 TH04 (RFI(4), RFD(5), RFO); reg1b_initNULL reg5 (d_1[10],d_0[10],RFD[5],d_1[116],d_0[116],RFN[5], reset); reg1b_initData reg5 (d_1[11],d_0[11],RFN[5],d_1[10],d_0[10],RFI[5], reset); TH22 TH05 (RFI(6), RFD(6), RFO); reg1b_initNULL reg6 (d_1[12],d_0[12],RFD[6],d_1[11],d_0[11],RFN[6], reset); reg1b_initData reg6 (d_1[13],d_0[13],RFN[6],d_1[12],d_0[12],RFI[6], reset); TH22 TH06 (RFI(7), RFD(7), RFO); reg1b_initNULL reg7 (d_1[14],d_0[14],RFD[7],d_1[13],d_0[13],RFN[7], reset); reg1b_initData reg7 (d_1[15],d_0[15],RFN[7],d_1[14],d_0[14],RFI[7], reset); TH22 TH07 (RFI(8), RFD(8), RFO); reg1b_initNULL reg8 (d_1[16],d_0[16],RFD[8],d_1[15],d_0[15],RFN[8], reset); reg1b_initData reg8 (d_1[17],d_0[17],RFN[8],d_1[16],d_0[16],RFI[8], reset); TH22 TH08 (RFI(9), RFD(9), RFO); reg1b_initNULL reg9 (d_1[18],d_0[18],RFD[116],d_1[17],d_0[17],RFN[116], reset); reg1b_initData reg9 (d_1[19],d_0[19],RFN[116],d_1[18],d_0[18],RFI[116], reset); TH22 TH09 (RFI(10), RFD(10), RFO); reg1b_initNULL reg10 (d_1[116],d_0[116],RFD[10],d_1[19],d_0[19],RFN[10], reset); reg1b_initData reg10 (d_1[21],d_0[21],RFN[10],d_1[116],d_0[116],RFI[10], reset); TH22 TH10 (RFI(11), RFD(11), RFO); reg1b_initNULL reg11 (d_1[22],d_0[22],RFD[11],d_1[21],d_0[21],RFN[11], reset); reg1b_initData reg11 (d_1[23],d_0[23],RFN[11],d_1[22],d_0[22],RFI[11], reset); TH22 TH11 (RFI(12), RFD(12), RFO);

138 reg1b_initNULL reg12 (d_1[24],d_0[24],RFD[12],d_1[23],d_0[23],RFN[12], reset); reg1b_initData reg12 (d_1[25],d_0[25],RFN[12],d_1[24],d_0[24],RFI[12], reset); TH22 TH12 (RFI(13), RFD(13), RFO); reg1b_initNULL reg13 (d_1[26],d_0[26],RFD[13],d_1[25],d_0[25],RFN[13], reset); reg1b_initData reg13 (d_1[27],d_0[27],RFN[13],d_1[26],d_0[26],RFI[13], reset); TH22 TH13 (RFI(14), RFD(14), RFO); reg1b_initNULL reg14 (d_1[28],d_0[28],RFD[14],d_1[27],d_0[27],RFN[14], reset); reg1b_initData reg14 (d_1[29],d_0[29],RFN[14],d_1[28],d_0[28],RFI[14], reset); TH22 TH14 (RFI(15), RFD(15), RFO); reg1b_initNULL reg15 (d_1[30],d_0[30],RFD[15],d_1[29],d_0[29],RFN[15], reset); reg1b_initData reg15 (d_1[31],d_0[31],RFN[15],d_1[30],d_0[30],RFI[15], reset); TH22 TH15 (RFI(16), RFD(16), RFO); reg1b_initNULL reg16 (d_1[32],d_0[32],RFD[16],d_1[31],d_0[31],RFN[16], reset); reg1b_initData reg16 (d_1[33],d_0[33],RFN[16],d_1[32],d_0[32],RFI[16], reset); TH22 TH16 (RFI(17), RFD(17), RFO); reg1b_initNULL reg17 (d_1[34],d_0[34],RFD[17],d_1[33],d_0[33],RFN[17], reset); reg1b_initData reg17 (d_1[35],d_0[35],RFN[17],d_1[34],d_0[34],RFI[17], reset); TH22 TH17 (RFI(18), RFD(18), RFO); reg1b_initNULL reg18 (d_1[36],d_0[36],RFD[18],d_1[35],d_0[35],RFN[18], reset); reg1b_initData reg18 (d_1[37],d_0[37],RFN[18],d_1[36],d_0[36],RFI[18], reset); TH22 TH18 (RFI(19), RFD(19), RFO); reg1b_initNULL reg19 (d_1[38],d_0[38],RFD[19],d_1[37],d_0[37],RFN[19], reset); reg1b_initData reg19 (d_1[39],d_0[39],RFN[19],d_1[38],d_0[38],RFI[19], reset); TH22 TH19 (RFI(20), RFD(20), RFO); reg1b_initNULL reg20 (d_1[40],d_0[40],RFD[116],d_1[39],d_0[39],RFN[116], reset); reg1b_initData reg20 (d_1[41],d_0[41],RFN[116],d_1[40],d_0[40],RFI[116], reset); TH22 TH20 (RFI(21), RFD(21), RFO); reg1b_initNULL reg21 (d_1[42],d_0[42],RFD[21],d_1[41],d_0[41],RFN[21], reset); reg1b_initData reg21 (d_1[43],d_0[43],RFN[21],d_1[42],d_0[42],RFI[21], reset); TH22 TH21 (RFI(22), RFD(22), RFO); reg1b_initNULL reg22 (d_1[44],d_0[44],RFD[22],d_1[43],d_0[43],RFN[22], reset); reg1b_initData reg22 (d_1[45],d_0[45],RFN[22],d_1[44],d_0[44],RFI[22], reset); TH22 TH22 (RFI(23), RFD(23), RFO); reg1b_initNULL reg23 (d_1[46],d_0[46],RFD[23],d_1[45],d_0[45],RFN[23], reset); reg1b_initData reg23 (d_1[47],d_0[47],RFN[23],d_1[46],d_0[46],RFI[23], reset); TH22 TH23 (RFI(24), RFD(24), RFO); reg1b_initNULL reg24 (d_1[48],d_0[48],RFD[24],d_1[47],d_0[47],RFN[24], reset);

139 reg1b_initData reg24 (d_1[49],d_0[49],RFN[24],d_1[48],d_0[48],RFI[24], reset); TH22 TH24 (RFI(25), RFD(25), RFO); reg1b_initNULL reg25 (d_1[50],d_0[50],RFD[25],d_1[49],d_0[49],RFN[25], reset); reg1b_initData reg25 (d_1[51],d_0[51],RFN[25],d_1[50],d_0[50],RFI[25], reset); TH22 TH25 (RFI(26), RFD(26), RFO); reg1b_initNULL reg26 (d_1[52],d_0[52],RFD[26],d_1[51],d_0[51],RFN[26], reset); reg1b_initData reg26 (d_1[53],d_0[53],RFN[26],d_1[52],d_0[52],RFI[26], reset); TH22 TH26 (RFI(27), RFD(27), RFO); reg1b_initNULL reg27 (d_1[54],d_0[54],RFD[27],d_1[53],d_0[53],RFN[27], reset); reg1b_initData reg27 (d_1[55],d_0[55],RFN[27],d_1[54],d_0[54],RFI[27], reset); TH22 TH27 (RFI(28), RFD(28), RFO); reg1b_initNULL reg28 (d_1[56],d_0[56],RFD[28],d_1[55],d_0[55],RFN[28], reset); reg1b_initData reg28 (d_1[57],d_0[57],RFN[28],d_1[56],d_0[56],RFI[28], reset); TH22 TH28 (RFI(29), RFD(29), RFO); reg1b_initNULL reg29 (d_1[58],d_0[58],RFD[29],d_1[57],d_0[57],RFN[29], reset); reg1b_initData reg29 (d_1[59],d_0[59],RFN[29],d_1[58],d_0[58],RFI[29], reset); TH22 TH29 (RFI(30), RFD(30), RFO); reg1b_initNULL reg30 (d_1[60],d_0[60],RFD[30],d_1[59],d_0[59],RFN[30], reset); reg1b_initData reg30 (d_1[61],d_0[61],RFN[30],d_1[60],d_0[60],RFI[30], reset); TH22 TH30 (RFI(31), RFD(31), RFO); reg1b_initNULL reg31 (d_1[62],d_0[62],RFD[31],d_1[61],d_0[61],RFN[31], reset); reg1b_initData reg31 (d_1[63],d_0[63],RFN[31],d_1[62],d_0[62],RFI[31], reset); TH22 TH31 (RFI(32), RFD(32), RFO);

//32-Bit Binary Null Convention Logic Multiplier booleanAND booleanAND_0 ( .Z_0 (multiplierOut_0[116]), .Z_1 (multiplierOut_1[116]), .X_0 (coeff_0[116]), .X_1 (coeff_1[116]), .W_0 (d_0[1]), .W_1 (d_1[1])); booleanAND booleanAND_1( .Z_0 (multiplierOut_0[1]), .Z_1 (multiplierOut_1[1]), .X_0 (coeff_0[1]), .X_1 (coeff_1[1]), .W_0 (d_0[3]), .W_1 (d_1[3])); booleanAND booleanAND_2(

140 .Z_0 (multiplierOut_0[116]), .Z_1 (multiplierOut_1[116]), .X_0 (coeff_0[116]), .X_1 (coeff_1[116]), .W_0 (d_0[5]), .W_1 (d_1[5])); booleanAND booleanAND_3( .Z_0 (multiplierOut_0[3]), .Z_1 (multiplierOut_1[3]), .X_0 (coeff_0[3]), .X_1 (coeff_1[3]), .W_0 (d_0[7]), .W_1 (d_1[7])); booleanAND booleanAND_4( .Z_0 (multiplierOut_0[4]), .Z_1 (multiplierOut_1[4]), .X_0 (coeff_0[4]), .X_1 (coeff_1[4]), .W_0 (d_0[116]), .W_1 (d_1[116])); booleanAND booleanAND_5( .Z_0 (multiplierOut_0[5]), .Z_1 (multiplierOut_1[5]), .X_0 (coeff_0[5]), .X_1 (coeff_1[5]), .W_0 (d_0[11]), .W_1 (d_1[11])); booleanAND booleanAND_6( .Z_0 (multiplierOut_0[6]), .Z_1 (multiplierOut_1[6]), .X_0 (coeff_0[6]), .X_1 (coeff_1[6]), .W_0 (d_0[13]), .W_1 (d_1[13])); booleanAND booleanAND_7( .Z_0 (multiplierOut_0[7]), .Z_1 (multiplierOut_1[7]), .X_0 (coeff_0[7]), .X_1 (coeff_1[7]), .W_0 (d_0[15]), .W_1 (d_1[15])); booleanAND booleanAND_8( .Z_0 (multiplierOut_0[8]), .Z_1 (multiplierOut_1[8]),

141 .X_0 (coeff_0[8]), .X_1 (coeff_1[8]), .W_0 (d_0[17]), .W_1 (d_1[17])); booleanAND booleanAND_9( .Z_0 (multiplierOut_0[116]), .Z_1 (multiplierOut_1[116]), .X_0 (coeff_0[116]), .X_1 (coeff_1[116]), .W_0 (d_0[19]), .W_1 (d_1[19])); booleanAND booleanAND_10( .Z_0 (multiplierOut_0[10]), .Z_1 (multiplierOut_1[10]), .X_0 (coeff_0[10]), .X_1 (coeff_1[10]), .W_0 (d_0[21]), .W_1 (d_1[21])); booleanAND booleanAND_11( .Z_0 (multiplierOut_0[11]), .Z_1 (multiplierOut_1[11]), .X_0 (coeff_0[11]), .X_1 (coeff_1[11]), .W_0 (d_0[23]), .W_1 (d_1[23])); booleanAND booleanAND_12( .Z_0 (multiplierOut_0[12]), .Z_1 (multiplierOut_1[12]), .X_0 (coeff_0[12]), .X_1 (coeff_1[12]), .W_0 (d_0[25]), .W_1 (d_1[25])); booleanAND booleanAND_13( .Z_0 (multiplierOut_0[13]), .Z_1 (multiplierOut_1[13]), .X_0 (coeff_0[13]), .X_1 (coeff_1[13]), .W_0 (d_0[27]), .W_1 (d_1[27])); booleanAND booleanAND_14( .Z_0 (multiplierOut_0[14]), .Z_1 (multiplierOut_1[14]), .X_0 (coeff_0[14]), .X_1 (coeff_1[14]),

142 .W_0 (d_0[29]), .W_1 (d_1[29])); booleanAND booleanAND_15( .Z_0 (multiplierOut_0[15]), .Z_1 (multiplierOut_1[15]), .X_0 (coeff_0[15]), .X_1 (coeff_1[15]), .W_0 (d_0[31]), .W_1 (d_1[31])); booleanAND booleanAND_16( .Z_0 (multiplierOut_0[16]), .Z_1 (multiplierOut_1[16]), .X_0 (coeff_0[16]), .X_1 (coeff_1[16]), .W_0 (d_0[33]), .W_1 (d_1[33])); booleanAND booleanAND_17( .Z_0 (multiplierOut_0[17]), .Z_1 (multiplierOut_1[17]), .X_0 (coeff_0[17]), .X_1 (coeff_1[17]), .W_0 (d_0[35]), .W_1 (d_1[35])); booleanAND booleanAND_18( .Z_0 (multiplierOut_0[18]), .Z_1 (multiplierOut_1[18]), .X_0 (coeff_0[18]), .X_1 (coeff_1[18]), .W_0 (d_0[37]), .W_1 (d_1[37])); booleanAND booleanAND_19( .Z_0 (multiplierOut_0[19]), .Z_1 (multiplierOut_1[19]), .X_0 (coeff_0[19]), .X_1 (coeff_1[19]), .W_0 (d_0[39]), .W_1 (d_1[39])); booleanAND booleanAND_20( .Z_0 (multiplierOut_0[116]), .Z_1 (multiplierOut_1[116]), .X_0 (coeff_0[116]), .X_1 (coeff_1[116]), .W_0 (d_0[41]), .W_1 (d_1[41]));

143 booleanAND booleanAND_21( .Z_0 (multiplierOut_0[21]), .Z_1 (multiplierOut_1[21]), .X_0 (coeff_0[21]), .X_1 (coeff_1[21]), .W_0 (d_0[43]), .W_1 (d_1[43])); booleanAND booleanAND_22( .Z_0 (multiplierOut_0[22]), .Z_1 (multiplierOut_1[22]), .X_0 (coeff_0[22]), .X_1 (coeff_1[22]), .W_0 (d_0[45]), .W_1 (d_1[45])); booleanAND booleanAND_23( .Z_0 (multiplierOut_0[23]), .Z_1 (multiplierOut_1[23]), .X_0 (coeff_0[23]), .X_1 (coeff_1[23]), .W_0 (d_0[47]), .W_1 (d_1[47])); booleanAND booleanAND_24( .Z_0 (multiplierOut_0[24]), .Z_1 (multiplierOut_1[24]), .X_0 (coeff_0[24]), .X_1 (coeff_1[24]), .W_0 (d_0[49]), .W_1 (d_1[49])); booleanAND booleanAND_25( .Z_0 (multiplierOut_0[25]), .Z_1 (multiplierOut_1[25]), .X_0 (coeff_0[25]), .X_1 (coeff_1[25]), .W_0 (d_0[51]), .W_1 (d_1[51])); booleanAND booleanAND_26( .Z_0 (multiplierOut_0[26]), .Z_1 (multiplierOut_1[26]), .X_0 (coeff_0[26]), .X_1 (coeff_1[26]), .W_0 (d_0[53]), .W_1 (d_1[53])); booleanAND booleanAND_27(

144 .Z_0 (multiplierOut_0[27]), .Z_1 (multiplierOut_1[27]), .X_0 (coeff_0[27]), .X_1 (coeff_1[27]), .W_0 (d_0[55]), .W_1 (d_1[55])); booleanAND booleanAND_28( .Z_0 (multiplierOut_0[28]), .Z_1 (multiplierOut_1[28]), .X_0 (coeff_0[28]), .X_1 (coeff_1[28]), .W_0 (d_0[57]), .W_1 (d_1[57])); booleanAND booleanAND_29( .Z_0 (multiplierOut_0[29]), .Z_1 (multiplierOut_1[29]), .X_0 (coeff_0[29]), .X_1 (coeff_1[29]), .W_0 (d_0[59]), .W_1 (d_1[59])); booleanAND booleanAND_30( .Z_0 (multiplierOut_0[30]), .Z_1 (multiplierOut_1[30]), .X_0 (coeff_0[30]), .X_1 (coeff_1[30]), .W_0 (d_0[61]), .W_1 (d_1[61])); booleanAND booleanAND_31( .Z_0 (multiplierOut_0[31]), .Z_1 (multiplierOut_1[31]), .X_0 (coeff_0[31]), .X_1 (coeff_1[31]), .W_0 (d_0[63]), .W_1 (d_1[63]));

//stage1 halfAdder halfAdder_1_0 ( .Sum_0 (addStage1Sum_0[116]), .Sum_1 (addStage1Sum_1[116]), .Cout_0 (addStage1Cout_0[116]), .Cout_1 (addStage1Cout_1[116]), .A_0 (multiplierOut_0[116]), .A_1 (multiplierOut_1[116]), .B_0 (multiplierOut_0[1]), .B_1 (multiplierOut_1[1]) );

145 halfAdder halfAdder_1_1 ( .Sum_0 (addStage1Sum_0[1]), .Sum_1 (addStage1Sum_1[1]), .Cout_0 (addStage1Cout_0[1]), .Cout_1 (addStage1Cout_1[1]), .A_0 (multiplierOut_0[116]), .A_1 (multiplierOut_1[116]), .B_0 (multiplierOut_0[3]), .B_1 (multiplierOut_1[3]) ); halfAdder halfAdder_1_2 ( .Sum_0 (addStage1Sum_0[116]), .Sum_1 (addStage1Sum_1[116]), .Cout_0 (addStage1Cout_0[116]), .Cout_1 (addStage1Cout_1[116]), .A_0 (multiplierOut_0[4]), .A_1 (multiplierOut_1[4]), .B_0 (multiplierOut_0[5]), .B_1 (multiplierOut_1[5]) ); halfAdder halfAdder_1_3 ( .Sum_0 (addStage1Sum_0[3]), .Sum_1 (addStage1Sum_1[3]), .Cout_0 (addStage1Cout_0[3]), .Cout_1 (addStage1Cout_1[3]), .A_0 (multiplierOut_0[6]), .A_1 (multiplierOut_1[6]), .B_0 (multiplierOut_0[7]), .B_1 (multiplierOut_1[7]) ); halfAdder halfAdder_1_4 ( .Sum_0 (addStage1Sum_0[4]), .Sum_1 (addStage1Sum_1[4]), .Cout_0 (addStage1Cout_0[4]), .Cout_1 (addStage1Cout_1[4]), .A_0 (multiplierOut_0[8]), .A_1 (multiplierOut_1[8]), .B_0 (multiplierOut_0[116]), .B_1 (multiplierOut_1[116]) ); halfAdder halfAdder_1_5 ( .Sum_0 (addStage1Sum_0[5]), .Sum_1 (addStage1Sum_1[5]), .Cout_0 (addStage1Cout_0[5]), .Cout_1 (addStage1Cout_1[5]), .A_0 (multiplierOut_0[10]), .A_1 (multiplierOut_1[10]), .B_0 (multiplierOut_0[11]), .B_1 (multiplierOut_1[11]) );

146 halfAdder halfAdder_1_6 ( .Sum_0 (addStage1Sum_0[6]), .Sum_1 (addStage1Sum_1[6]), .Cout_0 (addStage1Cout_0[6]), .Cout_1 (addStage1Cout_1[6]), .A_0 (multiplierOut_0[12]), .A_1 (multiplierOut_1[12]), .B_0 (multiplierOut_0[13]), .B_1 (multiplierOut_1[13]) ); halfAdder halfAdder_1_7 ( .Sum_0 (addStage1Sum_0[7]), .Sum_1 (addStage1Sum_1[7]), .Cout_0 (addStage1Cout_0[7]), .Cout_1 (addStage1Cout_1[7]), .A_0 (multiplierOut_0[14]), .A_1 (multiplierOut_1[14]), .B_0 (multiplierOut_0[15]), .B_1 (multiplierOut_1[15]) ); halfAdder halfAdder_1_8 ( .Sum_0 (addStage1Sum_0[8]), .Sum_1 (addStage1Sum_1[8]), .Cout_0 (addStage1Cout_0[8]), .Cout_1 (addStage1Cout_1[8]), .A_0 (multiplierOut_0[16]), .A_1 (multiplierOut_1[16]), .B_0 (multiplierOut_0[17]), .B_1 (multiplierOut_1[17]) ); halfAdder halfAdder_1_9 ( .Sum_0 (addStage1Sum_0[116]), .Sum_1 (addStage1Sum_1[116]), .Cout_0 (addStage1Cout_0[116]), .Cout_1 (addStage1Cout_1[116]), .A_0 (multiplierOut_0[18]), .A_1 (multiplierOut_1[18]), .B_0 (multiplierOut_0[19]), .B_1 (multiplierOut_1[19]) ); halfAdder halfAdder_1_10 ( .Sum_0 (addStage1Sum_0[10]), .Sum_1 (addStage1Sum_1[10]), .Cout_0 (addStage1Cout_0[10]), .Cout_1 (addStage1Cout_1[10]), .A_0 (multiplierOut_0[116]), .A_1 (multiplierOut_1[116]), .B_0 (multiplierOut_0[21]), .B_1 (multiplierOut_1[21]) );

147 halfAdder halfAdder_1_11 ( .Sum_0 (addStage1Sum_0[11]), .Sum_1 (addStage1Sum_1[11]), .Cout_0 (addStage1Cout_0[11]), .Cout_1 (addStage1Cout_1[11]), .A_0 (multiplierOut_0[22]), .A_1 (multiplierOut_1[22]), .B_0 (multiplierOut_0[23]), .B_1 (multiplierOut_1[23]) ); halfAdder halfAdder_1_12 ( .Sum_0 (addStage1Sum_0[12]), .Sum_1 (addStage1Sum_1[12]), .Cout_0 (addStage1Cout_0[12]), .Cout_1 (addStage1Cout_1[12]), .A_0 (multiplierOut_0[24]), .A_1 (multiplierOut_1[24]), .B_0 (multiplierOut_0[25]), .B_1 (multiplierOut_1[25]) ); halfAdder halfAdder_1_13 ( .Sum_0 (addStage1Sum_0[13]), .Sum_1 (addStage1Sum_1[13]), .Cout_0 (addStage1Cout_0[13]), .Cout_1 (addStage1Cout_1[13]), .A_0 (multiplierOut_0[26]), .A_1 (multiplierOut_1[26]), .B_0 (multiplierOut_0[27]), .B_1 (multiplierOut_1[27]) ); halfAdder halfAdder_1_14 ( .Sum_0 (addStage1Sum_0[14]), .Sum_1 (addStage1Sum_1[14]), .Cout_0 (addStage1Cout_0[14]), .Cout_1 (addStage1Cout_1[14]), .A_0 (multiplierOut_0[28]), .A_1 (multiplierOut_1[28]), .B_0 (multiplierOut_0[29]), .B_1 (multiplierOut_1[29]) ); halfAdder halfAdder_1_15 ( .Sum_0 (addStage1Sum_0[15]), .Sum_1 (addStage1Sum_1[15]), .Cout_0 (addStage1Cout_0[15]), .Cout_1 (addStage1Cout_1[15]), .A_0 (multiplierOut_0[30]), .A_1 (multiplierOut_1[30]), .B_0 (multiplierOut_0[31]), .B_1 (multiplierOut_1[31]) );

148

//Stage 2 halfAdder halfAdder_2_0( .Sum_0 (addStage2Sum_0[116]), .Sum_1 (addStage2Sum_1[116]), .Cout_0 (addStage2Cout_0[116]), .Cout_1 (addStage2Cout_1[116]), .A_0 (addStage1Sum_0[116]), .A_1 (addStage1Sum_1[116]), .B_0 (addStage1Sum_0[1]), .B_1 (addStage1Sum_1[1])); halfAdder halfAdder_2_1( .Sum_0 (addStage2Sum_0[116]), .Sum_1 (addStage2Sum_1[116]), .Cout_0 (addStage2Cout_0[116]), .Cout_1 (addStage2Cout_1[116]), .A_0 (addStage1Sum_0[116]), .A_1 (addStage1Sum_1[116]), .B_0 (addStage1Sum_0[3]), .B_1 (addStage1Sum_1[3])); halfAdder halfAdder_2_2( .Sum_0 (addStage2Sum_0[4]), .Sum_1 (addStage2Sum_1[4]), .Cout_0 (addStage2Cout_0[4]), .Cout_1 (addStage2Cout_1[4]), .A_0 (addStage1Sum_0[4]), .A_1 (addStage1Sum_1[4]), .B_0 (addStage1Sum_0[5]), .B_1 (addStage1Sum_1[5])); halfAdder halfAdder_2_3( .Sum_0 (addStage2Sum_0[6]), .Sum_1 (addStage2Sum_1[6]), .Cout_0 (addStage2Cout_0[6]), .Cout_1 (addStage2Cout_1[6]), .A_0 (addStage1Sum_0[6]), .A_1 (addStage1Sum_1[6]), .B_0 (addStage1Sum_0[7]), .B_1 (addStage1Sum_1[7])); halfAdder halfAdder_2_4( .Sum_0 (addStage2Sum_0[8]), .Sum_1 (addStage2Sum_1[8]), .Cout_0 (addStage2Cout_0[8]), .Cout_1 (addStage2Cout_1[8]), .A_0 (addStage1Sum_0[8]), .A_1 (addStage1Sum_1[8]),

149 .B_0 (addStage1Sum_0[116]), .B_1 (addStage1Sum_1[116])); halfAdder halfAdder_2_5( .Sum_0 (addStage2Sum_0[10]), .Sum_1 (addStage2Sum_1[10]), .Cout_0 (addStage2Cout_0[10]), .Cout_1 (addStage2Cout_1[10]), .A_0 (addStage1Sum_0[10]), .A_1 (addStage1Sum_1[10]), .B_0 (addStage1Sum_0[11]), .B_1 (addStage1Sum_1[11])); halfAdder halfAdder_2_6( .Sum_0 (addStage2Sum_0[12]), .Sum_1 (addStage2Sum_1[12]), .Cout_0 (addStage2Cout_0[12]), .Cout_1 (addStage2Cout_1[12]), .A_0 (addStage1Sum_0[12]), .A_1 (addStage1Sum_1[12]), .B_0 (addStage1Sum_0[13]), .B_1 (addStage1Sum_1[13])); halfAdder halfAdder_2_7( .Sum_0 (addStage2Sum_0[14]), .Sum_1 (addStage2Sum_1[14]), .Cout_0 (addStage2Cout_0[14]), .Cout_1 (addStage2Cout_1[14]), .A_0 (addStage1Sum_0[14]), .A_1 (addStage1Sum_1[14]), .B_0 (addStage1Sum_0[15]), .B_1 (addStage1Sum_1[15])); fullAdder fullAdder_2_0( .Sum_0(addStage2Sum_0[1]), .Sum_1(addStage2Sum_1[1]), .Cout_0(addStage2Cout_0[1]), .Cout_1(addStage2Cout_1[1]), .A_0(addStage1Cout_0[116]), .A_1(addStage1Cout_1[116]), .B_0(addStage1Cout_0[1]), .B_1(addStage1Cout_1[1]), .Cin_0(addStage2Cout_0[116]), .Cin_1(addStage2Cout_1[116])); fullAdder fullAdder_2_1( .Sum_0(addStage2Sum_0[3]), .Sum_1(addStage2Sum_1[3]), .Cout_0(addStage2Cout_0[3]), .Cout_1(addStage2Cout_1[3]),

150 .A_0(addStage1Cout_0[116]), .A_1(addStage1Cout_1[3]), .B_0(addStage1Cout_0[3]), .B_1(addStage1Cout_1[1]), .Cin_0(addStage2Cout_0[1]), .Cin_1(addStage2Cout_1[])); fullAdder fullAdder_2_2( .Sum_0(addStage2Sum_0[5]), .Sum_1(addStage2Sum_1[5]), .Cout_0(addStage2Cout_0[5]), .Cout_1(addStage2Cout_1[5]), .A_0(addStage1Cout_0[4]), .A_1(addStage1Cout_1[4]), .B_0(addStage1Cout_0[5]), .B_1(addStage1Cout_1[5]), .Cin_0(addStage2Cout_0[116]), .Cin_1(addStage2Cout_1[116])); fullAdder fullAdder_2_3( .Sum_0(addStage2Sum_0[7]), .Sum_1(addStage2Sum_1[7]), .Cout_0(addStage2Cout_0[7]), .Cout_1(addStage2Cout_1[7]), .A_0(addStage1Cout_0[6]), .A_1(addStage1Cout_1[6]), .B_0(addStage1Cout_0[7]), .B_1(addStage1Cout_1[7]), .Cin_0(addStage2Cout_0[3]), .Cin_1(addStage2Cout_1[3])); fullAdder fullAdder_2_4( .Sum_0(addStage2Sum_0[116]), .Sum_1(addStage2Sum_1[116]), .Cout_0(addStage2Cout_0[116]), .Cout_1(addStage2Cout_1[116]), .A_0(addStage1Cout_0[8]), .A_1(addStage1Cout_1[8]), .B_0(addStage1Cout_0[116]), .B_1(addStage1Cout_1[116]), .Cin_0(addStage2Cout_0[4]), .Cin_1(addStage2Cout_1[4])); fullAdder fullAdder_2_5( .Sum_0(addStage2Sum_0[11]), .Sum_1(addStage2Sum_1[11]), .Cout_0(addStage2Cout_0[11]), .Cout_1(addStage2Cout_1[11]), .A_0(addStage1Cout_0[10]), .A_1(addStage1Cout_1[10]),

151 .B_0(addStage1Cout_0[11]), .B_1(addStage1Cout_1[11]), .Cin_0(addStage2Cout_0[5]), .Cin_1(addStage2Cout_1[5])); fullAdder fullAdder_2_6( .Sum_0(addStage2Sum_0[13]), .Sum_1(addStage2Sum_1[13]), .Cout_0(addStage2Cout_0[13]), .Cout_1(addStage2Cout_1[13]), .A_0(addStage1Cout_0[12]), .A_1(addStage1Cout_1[12]), .B_0(addStage1Cout_0[13]), .B_1(addStage1Cout_1[13]), .Cin_0(addStage2Cout_0[6]), .Cin_1(addStage2Cout_1[6])); fullAdder fullAdder_2_7( .Sum_0(addStage2Sum_0[15]), .Sum_1(addStage2Sum_1[15]), .Cout_0(addStage2Cout_0[15]), .Cout_1(addStage2Cout_1[15]), .A_0(addStage1Cout_0[14]), .A_1(addStage1Cout_1[14]), .B_0(addStage1Cout_0[15]), .B_1(addStage1Cout_1[15]), .Cin_0(addStage2Cout_0[7]), .Cin_1(addStage2Cout_1[7]));

//Stage3 halfAdder halfAdder_3_0( .Sum_0 (addStage3Sum_0[116]), .Sum_1 (addStage3Sum_1[116]), .Cout_0 (addStage3Cout_0[116]), .Cout_1 (addStage3Cout_1[116]), .A_0 (addStage2Sum_0[116]), .A_1 (addStage2Sum_1[116]), .B_0 (addStage2Sum_0[116]), .B_1 (addStage2Sum_1[116])); fullAdder fullAdder_3_0_0 ( .Sum_0(addStage3Sum_0[1]), .Sum_1(addStage3Sum_1[1]), .Cout_0(addStage3Cout_0[1]), .Cout_1(addStage3Cout_1[1]), .A_0(addStage2Sum_0[1]), .A_1(addStage2Sum_1[1]), .B_0(addStage2Sum_0[3]), .B_1(addStage2Sum_1[3]),

152 .Cin_0(addStage3Cout_0[116]), .Cin_1(addStage3Cout_1[116])); fullAdder fullAdder_3_0_1 ( .Sum_0(addStage3Sum_0[116]), .Sum_1(addStage3Sum_1[116]), .Cout_0(addStage3Cout_0[116]), .Cout_1(addStage3Cout_1[116]), .A_0(addStage2Cout_0[1]), .A_1(addStage2Cout_1[1]), .B_0(addStage2Cout_0[3]), .B_1(addStage2Cout_1[3]), .Cin_0(addStage3Cout_0[1]), .Cin_1(addStage3Cout_1[1])); halfAdder halfAdder_3_1( .Sum_0 (addStage3Sum_0[3]), .Sum_1 (addStage3Sum_1[3]), .Cout_0 (addStage3Cout_0[3]), .Cout_1 (addStage3Cout_1[3]), .A_0 (addStage2Sum_0[4]), .A_1 (addStage2Sum_1[4]), .B_0 (addStage2Sum_0[6]), .B_1 (addStage2Sum_1[6])); fullAdder fullAdder_3_0_0 ( .Sum_0(addStage3Sum_0[4]), .Sum_1(addStage3Sum_1[4]), .Cout_0(addStage3Cout_0[4]), .Cout_1(addStage3Cout_1[4]), .A_0(addStage2Sum_0[5]), .A_1(addStage2Sum_1[5]), .B_0(addStage2Sum_0[7]), .B_1(addStage2Sum_1[7]), .Cin_0(addStage3Cout_0[3]), .Cin_1(addStage3Cout_1[3])); fullAdder fullAdder_3_0_1 ( .Sum_0(addStage3Sum_0[5]), .Sum_1(addStage3Sum_1[5]), .Cout_0(addStage3Cout_0[5]), .Cout_1(addStage3Cout_1[5]), .A_0(addStage2Cout_0[5]), .A_1(addStage2Cout_1[5]), .B_0(addStage2Cout_0[7]), .B_1(addStage2Cout_1[7]), .Cin_0(addStage3Cout_0[4]), .Cin_1(addStage3Cout_1[4])); halfAdder halfAdder_3_2(

153 .Sum_0 (addStage3Sum_0[6]), .Sum_1 (addStage3Sum_1[6]), .Cout_0 (addStage3Cout_0[6]), .Cout_1 (addStage3Cout_1[6]), .A_0 (addStage2Sum_0[8]), .A_1 (addStage2Sum_1[8]), .B_0 (addStage2Sum_0[10]), .B_1 (addStage2Sum_1[10])); fullAdder fullAdder_3_0_0 ( .Sum_0(addStage3Sum_0[7]), .Sum_1(addStage3Sum_1[7]), .Cout_0(addStage3Cout_0[7]), .Cout_1(addStage3Cout_1[7]), .A_0(addStage2Sum_0[116]), .A_1(addStage2Sum_1[116]), .B_0(addStage2Sum_0[11]), .B_1(addStage2Sum_1[11]), .Cin_0(addStage3Cout_0[6]), .Cin_1(addStage3Cout_1[6])); fullAdder fullAdder_3_0_1 ( .Sum_0(addStage3Sum_0[8]), .Sum_1(addStage3Sum_1[8]), .Cout_0(addStage3Cout_0[8]), .Cout_1(addStage3Cout_1[8]), .A_0(addStage2Cout_0[116]), .A_1(addStage2Cout_1[116]), .B_0(addStage2Cout_0[11]), .B_1(addStage2Cout_1[11]), .Cin_0(addStage3Cout_0[7]), .Cin_1(addStage3Cout_1[7])); halfAdder halfAdder_3_3( .Sum_0 (addStage3Sum_0[116]), .Sum_1 (addStage3Sum_1[116]), .Cout_0 (addStage3Cout_0[116]), .Cout_1 (addStage3Cout_1[116]), .A_0 (addStage2Sum_0[12]), .A_1 (addStage2Sum_1[12]), .B_0 (addStage2Sum_0[14]), .B_1 (addStage2Sum_1[14])); fullAdder fullAdder_3_0_0 ( .Sum_0(addStage3Sum_0[10]), .Sum_1(addStage3Sum_1[10]), .Cout_0(addStage3Cout_0[10]), .Cout_1(addStage3Cout_1[10]), .A_0(addStage2Sum_0[13]), .A_1(addStage2Sum_1[13]),

154 .B_0(addStage2Sum_0[15]), .B_1(addStage2Sum_1[15]), .Cin_0(addStage3Cout_0[116]), .Cin_1(addStage3Cout_1[116])); fullAdder fullAdder_3_0_1 ( .Sum_0(addStage3Sum_0[11]), .Sum_1(addStage3Sum_1[11]), .Cout_0(addStage3Cout_0[11]), .Cout_1(addStage3Cout_1[11]), .A_0(addStage2Cout_0[13]), .A_1(addStage2Cout_1[13]), .B_0(addStage2Cout_0[15]), .B_1(addStage2Cout_1[15]), .Cin_0(addStage3Cout_0[10]), .Cin_1(addStage3Cout_1[10]));

//Stage4 halfAdder halfAdder_4_0( .Sum_0 (addStage4Sum_0[116]), .Sum_1 (addStage4Sum_1[116]), .Cout_0 (addStage4Cout_0[116]), .Cout_1 (addStage4Cout_1[116]), .A_0 (addStage3Sum_0[116]), .A_1 (addStage3Sum_1[116]), .B_0 (addStage3Sum_0[3]), .B_1 (addStage3Sum_1[3])); fullAdder fullAdder_4_0_0 ( .Sum_0(addStage4Sum_0[1]), .Sum_1(addStage4Sum_1[1]), .Cout_0(addStage4Cout_0[1]), .Cout_1(addStage4Cout_1[1]), .A_0(addStage3Sum_0[1]), .A_1(addStage3Sum_1[1]), .B_0(addStage3Sum_0[4]), .B_1(addStage3Sum_1[4]), .Cin_0(addStage4Cout_0[116]), .Cin_1(addStage4Cout_1[116])); fullAdder fullAdder_4_0_1 ( .Sum_0(addStage4Sum_0[116]), .Sum_1(addStage4Sum_1[116]), .Cout_0(addStage4Cout_0[116]), .Cout_1(addStage4Cout_1[116]), .A_0(addStage3Sum_0[116]), .A_1(addStage3Sum_1[116]), .B_0(addStage3Sum_0[5]), .B_1(addStage3Sum_1[5]),

155 .Cin_0(addStage4Cout_0[1]), .Cin_1(addStage4Cout_1[1])); fullAdder fullAdder_4_0_2 ( .Sum_0(addStage4Sum_0[3]), .Sum_1(addStage4Sum_1[3]), .Cout_0(addStage4Cout_0[3]), .Cout_1(addStage4Cout_1[3]), .A_0(addStage3Cout_0[116]), .A_1(addStage3Cout_1[116]), .B_0(addStage3Cout_0[5]), .B_1(addStage3Cout_1[5]), .Cin_0(addStage4Cout_0[116]), .Cin_1(addStage4Cout_1[116])); halfAdder halfAdder_4_0( .Sum_0 (addStage4Sum_0[4]), .Sum_1 (addStage4Sum_1[4]), .Cout_0 (addStage4Cout_0[4]), .Cout_1 (addStage4Cout_1[4]), .A_0 (addStage3Sum_0[6]), .A_1 (addStage3Sum_1[6]), .B_0 (addStage3Sum_0[116]), .B_1 (addStage3Sum_1[116])); fullAdder fullAdder_4_0_0 ( .Sum_0(addStage4Sum_0[5]), .Sum_1(addStage4Sum_1[5]), .Cout_0(addStage4Cout_0[5]), .Cout_1(addStage4Cout_1[5]), .A_0(addStage3Sum_0[7]), .A_1(addStage3Sum_1[7]), .B_0(addStage3Sum_0[10]), .B_1(addStage3Sum_1[10]), .Cin_0(addStage4Cout_0[4]), .Cin_1(addStage4Cout_1[4])); fullAdder fullAdder_4_0_1 ( .Sum_0(addStage4Sum_0[6]), .Sum_1(addStage4Sum_1[6]), .Cout_0(addStage4Cout_0[6]), .Cout_1(addStage4Cout_1[6]), .A_0(addStage3Sum_0[8]), .A_1(addStage3Sum_1[8]), .B_0(addStage3Sum_0[11]), .B_1(addStage3Sum_1[11]), .Cin_0(addStage4Cout_0[5]), .Cin_1(addStage4Cout_1[5])); fullAdder fullAdder_4_0_2 (

156 .Sum_0(addStage4Sum_0[7]), .Sum_1(addStage4Sum_1[7]), .Cout_0(addStage4Cout_0[7]), .Cout_1(addStage4Cout_1[7]), .A_0(addStage3Cout_0[8]), .A_1(addStage3Cout_1[8]), .B_0(addStage3Cout_0[11]), .B_1(addStage3Cout_1[11]), .Cin_0(addStage4Cout_0[6]), .Cin_1(addStage4Cout_1[6]));

//Stage5 halfAdder halfAdder_4_0( .Sum_0 (addStage5Sum_0[116]), .Sum_1 (addStage5Sum_1[116]), .Cout_0 (addStage5Cout_0[116]), .Cout_1 (addStage5Cout_1[116]), .A_0 (addStage4Sum_0[116]), .A_1 (addStage4Sum_1[116]), .B_0 (addStage4Sum_0[4]), .B_1 (addStage4Sum_1[4])); fullAdder fullAdder_4_0_0 ( .Sum_0(addStage5Sum_0[1]), .Sum_1(addStage5Sum_1[1]), .Cout_0(addStage5Cout_0[1]), .Cout_1(addStage5Cout_1[1]), .A_0(addStage4Sum_0[1]), .A_1(addStage4Sum_1[1]), .B_0(addStage4Sum_0[5]), .B_1(addStage4Sum_1[5]), .Cin_0(addStage5Cout_0[116]), .Cin_1(addStage5Cout_1[116])); fullAdder fullAdder_4_0_1 ( .Sum_0(addStage5Sum_0[116]), .Sum_1(addStage5Sum_1[116]), .Cout_0(addStage5Cout_0[116]), .Cout_1(addStage5Cout_1[116]), .A_0(addStage4Sum_0[116]), .A_1(addStage4Sum_1[116]), .B_0(addStage4Sum_0[6]), .B_1(addStage4Sum_1[6]), .Cin_0(addStage5Cout_0[1]), .Cin_1(addStage5Cout_1[1])); fullAdder fullAdder_4_0_2 ( .Sum_0(addStage5Sum_0[3]), .Sum_1(addStage5Sum_1[3]),

157 .Cout_0(addStage5Cout_0[3]), .Cout_1(addStage5Cout_1[3]), .A_0(addStage4Sum_0[3]), .A_1(addStage4Sum_1[3]), .B_0(addStage4Sum_0[7]), .B_1(addStage4Sum_1[7]), .Cin_0(addStage5Cout_0[116]), .Cin_1(addStage5Cout_1[116])); fullAdder fullAdder_4_0_2 ( .Sum_0(addStage5Sum_0[4]), .Sum_1(addStage5Sum_1[4]), .Cout_0(addStage5Cout_0[4]), .Cout_1(addStage5Cout_1[4]), .A_0(addStage4Cout_0[3]), .A_1(addStage4Cout_1[3]), .B_0(addStage4Cout_0[7]), .B_1(addStage4Cout_1[7]), .Cin_0(addStage5Cout_0[3]), .Cin_1(addStage5Cout_1[3]));

// output register reg11b_initNULL reg11b_initNULL_0 ( .din_0(addStage5Sum_0[116]), .din_1(addStage5Sum_1[116]), .din_2(addStage5Sum_0[1]), .din_3(addStage5Sum_1[1]), .din_4(addStage5Sum_0[116]), .din_5(addStage5Sum_1[116]), .din_6(addStage5Sum_0[3]), .din_7(addStage5Sum_1[3]), .din_8(addStage5Sum_0[4]), .din_9(addStage5Sum_1[4]), .din_10(addStage5Cout_0[4]), .din_11(addStage5Cout_1[4]), .ki(reqin), .reset(reset), .dout_0(filterOut_0[116]),.dout_1(filterOut_0[1]),.dout_2(filterOut_0[116]),.dout_3( filterOut_1[3]), .dout_4(filterOut_0[4]),.dout_5(filterOut_1[5]),.dout_6(filterOut_0[6]),.dout_7(filter Out_1[7]), .dout_8(filterOut_0[8]),.dout_9(filterOut_1[116]),.dout_4(filterOut_0[10]),.dout_1(fi lterOut_1[11]), .ko(RFO)); endmodule

158 Appendix – D: FIR MATLAB Code

MATLAB Code for a Finite Impulse Response Digital Low Pass Filter which is composed of 32-Bit Serial to Parallel Shift Register followed by the Multiply and Accumulate Stages.

%input quantized signal d=[obtained from MATLAB Filter Design in Step 5.2]; i=32; C = cell(1,i); for (m=1:i)

%Shift Register g= circshift(d,-m);

d_1(m)=g(1); d_2(m)=g(2); d_3(m)=g(3); d_4(m)=g(4); d_5(m)=g(5); d_6(m)=g(6); d_7(m)=g(7); d_8(m)=g(8); d_9(m)=g(9); d_10(m)=g(10); d_11(m)=g(11); d_12(m)=g(12); d_13(m)=g(13); d_14(m)=g(14); d_15(m)=g(15); d_16(m)=g(16); d_17(m)=g(17); d_18(m)=g(18); d_19(m)=g(19); d_20(m)=g(20); d_21(m)=g(21); d_22(m)=g(22); d_23(m)=g(23); d_24(m)=g(24); d_25(m)=g(25); d_26(m)=g(26); d_27(m)=g(27); d_28(m)=g(28); d_29(m)=g(29); d_30(m)=g(30);

159 d_31(m)=g(31); d_32(m)=g(32);

%Coefficients c_0=0; c_1=0; c_2=1; c_3=0; c_4=0; c_5=0; c_6=0; c_7=0; c_8=0; c_9=0; c_10=0; c_11=0; c_12=1; c_13=0; c_14=1; c_15=0; c_16=0; c_17=0; c_18=0; c_19=1; c_20=0; c_21=0; c_22=0; c_23=0; c_24=1; c_25=0; c_26=0; c_27=0; c_28=0; c_29=0; c_30=0; c_31=0;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%Multiplication m_0= and ( c_0,d_1(0)); m_1= and ( c_1,d_2(0)); m_2= and ( c_2,d_3(0)); m_3= and ( c_3,d_4(0)); m_4= and ( c_4,d_5(0)); m_5= and ( c_5,d_6(0)); m_6= and ( c_6,d_7(0)); m_7= and ( c_7,d_8(0)); m_8= and ( c_8,d_9(0)); m_9= and ( c_9,d_10(0));

160 m_10= and ( c_10,d_11(0)); m_11= and ( c_11,d_12(0)); m_12= and ( c_12,d_13(0)); m_13= and ( c_13,d_14(0)); m_14= and ( c_14,d_15(0)); m_15= and ( c_15,d_16(0)); m_16= and ( c_16,d_17(0)); m_17= and ( c_17,d_18(0)); m_18= and ( c_18,d_19(0)); m_19= and ( c_19,d_20(0)); m_20= and ( c_20,d_21(0)); m_21= and ( c_21,d_22(0)); m_22= and ( c_22,d_23(0)); m_23= and ( c_23,d_24(0)); m_24= and ( c_24,d_25(0)); m_25= and ( c_25,d_26(0)); m_26= and ( c_26,d_27(0)); m_27= and ( c_27,d_28(0)); m_28= and ( c_28,d_29(0)); m_29= and ( c_29,d_30(0)); m_30= and ( c_30,d_31(0)); m_31= and ( c_31,d_32(0));

%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%Addition Stage %%Stage_0

%%Half Adders

S_00 = xor (m_0,m_1); c_00 = and (m_0, m_1);

S_01 = xor (m_2,m_3); c_01 = and (m_2, m_3);

S_02 = xor (m_4,m_5); c_02 = and (m_4, m_5);

S_03 = xor (m_6,m_7); c_03 = and (m_6, m_7);

S_04 = xor (m_8,m_9); c_04 = and (m_8, m_9);

S_05 = xor (m_10,m_11); c_05 = and (m_10, m_11);

S_06 = xor (m_12,m_13); c_06 = and (m_12, m_13);

161

S_07 = xor (m_14,m_15); c_07 = and (m_14, m_15);

S_08 = xor (m_16,m_17); c_08 = and (m_16, m_17);;

S_09 = xor (m_18,m_19); c_09 = and (m_18, m_19);

S_010 = xor (m_20,m_21); c_010 = and (m_20, m_21);

S_011 = xor (m_22,m_23); c_011 = and (m_22, m_23);

S_012 = xor (m_24,m_25); c_012 = and (m_24, m_25);

S_013 = xor (m_26,m_27); c_013 = and (m_26, m_27);

S_014 = xor (m_28,m_29); c_014 = and (m_28, m_29);

S_015 = xor (m_30,m_31); c_015 = and (m_30, m_31);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%Stage_1 %%Stage_1_0

%%HA_10

HA_S00_00 = xor (S_00,S_01); HA_c00_00 = and (S_00,S_01);

%%FA_10 FA_s00_01 = xor (c_00,c_01); FA_c00_01 = and (c_00,c_01);

FA_s00_02 = xor (FA_s00_01,HA_c00_00); FA_c00_02 = and (FA_s00_01,HA_c00_00);

FA_c00_03 = or (FA_c00_01,FA_c00_02); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%Stage_1_1

%%HA_11

162 HA_S01_00 = xor (S_02,S_03); HA_c01_00 = and (S_02,S_03);

%%FA_11 FA_s01_01 = xor (c_02,c_03); FA_c01_01 = and (c_02,c_03);

FA_s01_02 = xor (FA_s01_01,HA_c01_00); FA_c01_02 = and (FA_s01_01,HA_c01_00);

FA_c11_03 = or (FA_c01_01,FA_c01_02); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%Stage_1_2

%%HA_12

HA_S02_00 = xor (S_04,S_05); HA_c02_00 = and (S_04,S_05);

%%FA_12 FA_s02_01 = xor (c_04,c_05); FA_c02_01 = and (c_04,c_05);

FA_s02_02 = xor (FA_s02_01,HA_c02_00); FA_c02_02 = and (FA_s02_01,HA_c02_00);

FA_c02_03 = or (FA_c02_01,FA_c02_02);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%Stage_1_3

%%HA_13

HA_S03_00 = xor (S_06,S_07); HA_c03_00 = and (S_06,S_07);

%%FA_13 FA_s03_01 = xor (c_06,c_07); FA_c03_01 = and (c_06,c_07);

FA_s03_02 = xor (FA_s03_01,HA_c03_00); FA_c03_02 = and (FA_s03_01,HA_c03_00);

FA_c03_03 = or (FA_c03_01,FA_c03_02);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%Stage_1_4

%%HA_14

163 HA_S04_00 = xor (S_08,S_09); HA_c04_00 = and (S_08,S_09);

%%FA_10 FA_s04_01 = xor (c_08,c_09); FA_c04_01 = and (c_08,c_09);

FA_s04_02 = xor (FA_s04_01,HA_c04_00); FA_c04_02 = and (FA_s04_01,HA_c04_00);

FA_c04_03 = or (FA_c04_01,FA_c04_02);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%Stage_1_5 %%HA_15

HA_S05_00 = xor (S_10,S_11); HA_c05_00 = and (S_10,S_11);

%%FA_11 FA_s05_01 = xor (c_10,c_11); FA_c05_01 = and (c_10,c_11);

FA_s05_02 = xor (FA_s05_01,HA_c05_00); FA_c05_02 = and (FA_s05_01,HA_c05_00);

FA_c05_03 = or (FA_c05_01,FA_c05_02);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%Stage_1_6

%%HA_16

HA_S06_00 = xor (S_12,S_13); HA_c06_00 = and (S_12,S_13);

%%FA_12 FA_s06_01 = xor (c_12,c_13); FA_c06_01 = and (c_12,c_13);

FA_s06_02 = xor (FA_s06_01,HA_c06_00); FA_c06_02 = and (FA_s06_01,HA_c06_00);

FA_c06_03 = or (FA_c06_01,FA_c06_02);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%Stage_1_7

%%HA_17

164

HA_S07_00 = xor (S_14,S_15); HA_c07_00 = and (S_14,S_15);

%%FA_13 FA_s07_01 = xor (c_14,c_15); FA_c07_01 = and (c_14,c_15);

FA_s07_02 = xor (FA_s07_01,HA_c07_00); FA_c07_02 = and (FA_s07_01,HA_c07_00);

FA_c07_03 = or (FA_c07_01,FA_c07_02);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%Stage_2

%%Stage_2_0 %%HA0_0

HA_S20_00 = xor (HA_S00_00,HA_S01_00); HA_c20_00 = and (HA_S00_00,HA_S01_00);

%%FA0_1 FA_s20_00 = xor (FA_s00_02 ,FA_s01_02 ); FA_c20_00 = and (FA_s00_02 ,FA_s01_02 );

FA_s20_01 = xor (FA_s20_00,HA_c20_00); FA_c20_01 = and (FA_s20_00,HA_c20_00);

FA_c20_02 = or (FA_c20_00,FA_c20_01);

%%FA0_2 FA_s20_03 = xor (FA_c00_03,FA_c11_03); FA_c20_03 = and (FA_c00_03,FA_c11_03);

FA_s20_04 = xor (FA_s20_03,FA_c20_02); FA_c20_04 = and (FA_s20_03,FA_c20_02);

FA_c20_05 = or (FA_c20_03,FA_c20_04);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%Stage_2_1 %%HA1_1

HA_S21_00 = xor (HA_S02_00 ,HA_S03_00 ); HA_c21_00 = and (HA_S02_00 ,HA_S03_00 );

%%FA1_1 FA_s21_00 = xor (FA_s02_02 ,FA_s03_02 );

165 FA_c21_00 = and (FA_s02_02 ,FA_s03_02 );

FA_s21_01 = xor (FA_s21_00,HA_c21_00); FA_c21_01 = and (FA_s21_00,HA_c21_00);

FA_c21_02 = or (FA_c21_00,FA_c21_01);

%%FA1_2 FA_s21_03 = xor (FA_c02_03,FA_c03_03); FA_c21_03 = and (FA_c02_03,FA_c03_03);

FA_s21_04 = xor (FA_s21_03,FA_c21_02); FA_c21_04 = and (FA_s21_03,FA_c21_02);

FA_c21_05 = or (FA_c21_03,FA_c21_04);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%Stage_2_2 %%HA2_2

HA_S22_00 = xor (HA_S00_00,HA_S01_00); HA_c22_00 = and (HA_S00_00,HA_S01_00);

%%FA2_1 FA_s22_00 = xor (FA_s00_02 ,FA_s01_02 ); FA_c22_00 = and (FA_s00_02 ,FA_s01_02 );

FA_s22_01 = xor (FA_s22_00,HA_c22_00); FA_c22_01 = and (FA_s22_00,HA_c22_00);

FA_c22_02 = or (FA_c22_00,FA_c22_01);

%%FA2_2 FA_s22_03 = xor (FA_c00_03,FA_c11_03); FA_c22_03 = and (FA_c00_03,FA_c11_03);

FA_s22_04 = xor (FA_s22_03,FA_c22_02); FA_c22_04 = and (FA_s22_03,FA_c22_02);

FA_c22_05 = or (FA_c22_03,FA_c22_04);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%Stage_2_3 %%HA3_3

HA_S23_00 = xor (HA_S04_00 ,HA_S05_00 ); HA_c23_00 = and (HA_S04_00 ,HA_S05_00 );

166 %%FA3_1 FA_s23_00 = xor (FA_s04_02 ,FA_s05_02 ); FA_c23_00 = and (FA_s04_02 ,FA_s05_02 );

FA_s23_01 = xor (FA_s23_00,HA_c23_00); FA_c23_01 = and (FA_s23_00,HA_c23_00);

FA_c23_02 = or (FA_c23_00,FA_c23_01);

%%FA3_2 FA_s23_03 = xor (FA_c04_03,FA_c05_03); FA_c23_03 = and (FA_c04_03,FA_c05_03);

FA_s23_04 = xor (FA_s23_03,FA_c23_02); FA_c23_04 = and (FA_s23_03,FA_c23_02);

FA_c23_05 = or (FA_c23_03,FA_c23_04);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%Stage_3 %%Stage_3_0

%%HA0_0 HA_S30_00 = xor (HA_S20_00,HA_S21_00); HA_c30_00 = and (HA_S20_00,HA_S21_00);

%%FA0_1 FA_s31_00 = xor (FA_s20_01 ,FA_s21_01); FA_c31_00 = and (FA_s20_01 ,FA_s21_01);

FA_s31_01 = xor (FA_s31_00,HA_c30_00); FA_c31_01 = and (FA_s31_00,HA_c30_00);

FA_c31_02 = or (FA_c31_00,FA_c31_01);

%%FA0_2 FA_s32_03 = xor (FA_s20_04 ,FA_s21_04); FA_c32_03 = and (FA_s20_04 ,FA_s21_04);

FA_s32_04 = xor (FA_s32_03,FA_c31_02); FA_c32_04 = and (FA_s32_03,FA_c31_02);

FA_c32_05 = or (FA_c32_03,FA_c32_04);

%%FA0_3 FA_s33_06 = xor (FA_c20_05 ,FA_c21_05 ); FA_c33_06 = and (FA_c20_05 ,FA_c21_05 );

FA_s33_07 = xor (FA_s33_06,FA_c32_05); FA_c33_07 = and (FA_s33_06,FA_c32_05);

167

FA_c33_08 = or (FA_c33_06,FA_c33_07);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%Stage_3_1

%%HA0_0 HA_S34_00 = xor (HA_S22_00,HA_S23_00); HA_c34_00 = and (HA_S22_00,HA_S23_00);

%%FA0_1 FA_s35_00 = xor (FA_s22_01 ,FA_s23_01); FA_c35_00 = and (FA_s22_01 ,FA_s23_01);

FA_s35_01 = xor (FA_s35_00,HA_c34_00); FA_c35_01 = and (FA_s35_00,HA_c34_00);

FA_c35_02 = or (FA_c35_00,FA_c35_01);

%%FA0_2 FA_s36_03 = xor (FA_s22_04 ,FA_s23_04); FA_c36_03 = and (FA_s22_04 ,FA_s23_04);

FA_s36_04 = xor (FA_s36_03,FA_c35_02); FA_c36_04 = and (FA_s36_03,FA_c35_02);

FA_c36_05 = or (FA_c36_03,FA_c36_04);

%%FA0_3 FA_s37_06 = xor (FA_c22_05 ,FA_c23_05 ); FA_c37_06 = and (FA_c22_05 ,FA_c23_05 );

FA_s37_07 = xor (FA_s37_06,FA_c36_05); FA_c37_07 = and (FA_s37_06,FA_c36_05);

FA_c37_08 = or (FA_c37_06,FA_c37_07);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%Stage_4 %%Stage_4_0

%%HA0_0 HA_S44_00 = xor (HA_S30_00 ,HA_S34_00); HA_c44_00 = and (HA_S30_00 ,HA_S34_00);

%%FA0_1 FA_s45_00 = xor (FA_s31_01 ,FA_s35_01); FA_c45_00 = and (FA_s31_01 ,FA_s35_01);

FA_s45_01 = xor (FA_s45_00,HA_c44_00);

168 FA_c45_01 = and (FA_s45_00,HA_c44_00);

FA_c45_02 = or (FA_c45_00,FA_c45_01);

%%FA0_2 FA_s46_03 = xor (FA_s32_04 ,FA_s36_04); FA_c46_03 = and (FA_s32_04 ,FA_s36_04);

FA_s46_04 = xor (FA_s46_03,FA_c45_02); FA_c46_04 = and (FA_s46_03,FA_c45_02);

FA_c46_05 = or (FA_c46_03,FA_c46_04);

%%FA0_3 FA_s47_06 = xor (FA_s33_07 ,FA_s37_07); FA_c47_06 = and (FA_s33_07 ,FA_s37_07);

FA_s47_07 = xor (FA_s47_06,FA_c46_05); FA_c47_07 = and (FA_s47_06,FA_c46_05);

FA_c47_08 = or (FA_c47_06,FA_c47_07);

%%FA0_4 FA_s48_06 = xor (FA_c33_08 ,FA_c37_08); FA_c48_06 = and (FA_c33_08 ,FA_c37_08);

FA_s48_07 = xor (FA_s48_06,FA_c47_08); FA_c48_07 = and (FA_s48_06,FA_c47_08);

FA_c48_08 = or (FA_c48_06,FA_c48_07);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% f_0 (m) = HA_S44_00; f_1 (m) = FA_s45_01; f_2 (m) = FA_s46_04; f_3 (m) = FA_s47_07; f_4 (m) = FA_s48_07; f_5 (m) = FA_c48_08; end

169