INFORMATION TO USERS

Til is reproduction was made from a copy of a document sent to us for microfilming. While the most advanced technology has been used to photograph and reproduce this document, the quality of the reproduction is heavily dependent upon the quality of the material submitted.

The following explanation of techniques is provided to help clarify markings or notations which may appear on this reproduction.

1. The sign or “target” for pages apparently lacking from the document photographed is “ Missing Page(s)". If it was possible to obtain the missing page(s) or section, they are spliced into the film along with adjacent pages. This may have necessitated cutting through an image and duplicating adjacent pages to assure com plete co n tin u ity.

2 . When an image on the film is obliterated with a round black mark, it is an indication of either blurred copy because of movement during exposure, duplicate copy, or copyrighted materials that should not have been filmed. For blurred pages, a good image of the page can be found in the adjacent frame. If copyrighted materials were deleted, a target note will appear listing the pages in the adjacent frame.

,V When a map, drawing or chart, etc.. is part of the material being photographed, a definite method of "sectioning" the material has been followed. It is customary to begin filming at the upper left hand corner of a large sheet and to continue from left to right in equal sections with small overlaps. It' necessary,

sectioning is continued again beginning below the first row and continuing 011 until complete.

4. For illustrations that cannot be satisfactorily reproduced by xerographic means, photographic prints can he purchased at additional cost and inserted into your xerographic copy. These prints are available upon request from the Dissertations Customer Services Department.

.‘'.Som e pages in any document may have indistinct print. In all cases the best available copy has been filmed.

Un*^ ms International 300 N Zeeb Road Ann Arbor, Ml 4S106 851892$

Chao, Hung-Hsiang Jonathan

PARALLEL/PIPELINE VLSI COMPUTING STRUCTURES FOR ROBOTICS APPLICATIONS

The Ohio State University Ph.D. 1985

University Microfilms

International 300 N Zeeb Road. Ann Arbor, Ml 48106

Copyright 1985 by Chao, Hung-Hsiang Jonathan All Rights Reserved Parallel/Pipeline VLSI Computing Structures for Robotics Applications

DISSERTATION

Presented in Partial Fulfillm ent of the Requirements for the Degree Doctor of Philosophy 1n the Graduate School of The Ohio State University

by

Hung-Hsiang Jonathan Chao, B.S., M.S.

The Ohio State University 1985

Reading Committee:

Karl Olson

David Orin Approved by

Fusun Ozguner

Advisor Department of Electrical Engineering © 1985

HUNG-HSIANG JONATHAN CHAO

All Rights Reserved To my parents and my w ife

ii ACKNOWLEDGEMENTS

I would like to thank my advisor. Professor Karl W, Olson, for the constant support and constructive advice which he provided during my studies at The Ohio State University. Many of the results of this research were developed under his guidance. His contributions to the research and patience in reviewing this dissertation are also greatly appreciated.

I would like to thank Professor David E. Orin, who gave me many helpful suggestions from time to time and also reviewed this dissertation. I also would like to thank Professor Fusun Ozguner, one of the reading committee, for reviewing this dissertation. I am very much indebted to Mrs. Barbara S. Elberfeld for her careful, patient,

and efficient proofreading of my manuscript. I am grateful to Ms. Debi

Britton for her excellent work in preparing this manuscript.

Finally, I would like to thank my parents for their support, and

most importantly, I wish to thank my wife, Yeichu, and daughter,

Jessica, for the endurance, encourangement and love which they

provided throughout my studies.

This research was supported by the National Science Foundation,

Computer Engineering Grant No. DMC-8312677,

i i i Hung-Hsiang Jonathan Chao

VITA

December 10, 1955 ...... Born -- Taipei, Taiwain, R.O.C,

June, 1977 ...... B.S, Electronics Engineering National Chiao Tung University Hsinchu, Taiwan, R.O.C,

June, 1980 _ ...... M.S. Electronics Engineering National Chiao Tung University Hsinchu, Taiwan, R.O.C.

1977-1981 ...... Design Engineer, Laboratories, Switching System Group Chungli, Taiwan, R.O.C.

1982-1985 ...... Graduate Research Associate Digital Systems Laboratory The Ohio State University Columbus, Ohio

1982-1985 ...... Graduate Teaching Associate Department of Electrical Engineering The Ohio State University Columbus, Ohio

i v VITA — Continued

FIELDS OF STUDY

Major Field: Electrical Engineering

Studies in Computer Engineering : Professor K.W. Olson, D.E. Orin, F. Ozguner, K.J. Breeding, R.R. McGhee

Studies in Control Engineering: Professor R.E. Fenton, II. Ozguner

Studies in Communications: Professor D.T. Davis, R.T. Compton

Studies in Computer and Information Science: Professor M.T. Liu, V. Ashok, B.W. Weide

PUBLICATIONS

"The Design of Reliable Common Channel Signaling System in Time Division Digital Switching System," M.S. Thesis, National Chiao Tung University, Hsinchu, Taiwan, R.O.C., June 1980.

"The Design of Control System in Time Division Digital Switching System-II," Journal of Taiwan Telecommunication Laboratories, Chungli, Taiwan, R.O.C., April 1981.

v TABLE OF CONTENTS

Page

ACKNOWLEDGEMENT ...... i 11

VITA ...... iv

LIST OF FIGURES ...... x iii

LIST OF TABLES ...... xviii

Chapter

1 INTRODUCTION ...... 1

1.1 Project Background ...... 1

1.2 Previous Work ...... 3

1.3 Organization ...... 5

2 VLSI COMPUTING STRUCTURE ON ROBOTICS APPLICATIONS ...... 7

2.1 Introduction ...... 7

2.2 Inverse Plant Plus Uacobian Control ...... 7

2.3 Computer Architectures for Robotics ...... 9

2.4 VLSI Technology to Computer Architectures ...... 13

2.5 VLSI Technology ...... 15

2.6 Summary ...... 17

3 ARCHITECTURE OF THE ROBOTICS PROCESSOR ...... IB

3.1 Introduction ...... IB

3.2 Block Diagram of the RP ...... 18

3.3 Evolution of the Architectural Design of the RP Data Paths ...... 24

vi TABLE OF CONTENTS - - Continued

Chapter Page

3.3.1 Single-Bus Configuration ...... 24

3.3.2 Two-Bus Configuration ...... 26

3.3.3 Three-Bus Configuration ...... 26

3.3.4 Cross-Bar Network ...... 27

3.4 Summary ...... 29

4 APPLICATIONS OF THE ROBOTICS PROCESSOR ...... 33

4.1 Introduction ...... 33

4.2 Jacobian ...... 37

4.2.1 Complexity of Vector Operations ...... 39

4.2.2 Task Graph ...... 40

4.2.3 Architectures of the Jacobian ...... 43

4.2.3.1 1-Processor Architecture ...... 43

4.2.3.2 2-Processor Architecture ...... 46

4.2.3.3 N-Processor Architecture ...... 49

4.2.3.4 Cube Interconnection Network ...... 55

4.2 .3 .5 Comparison ...... 66

4.3 Inverse Jacobian ...... 67

4.3.1 Methods for Solving Linear Equations ...... 68

4.3.2 Architectures of the Inverse Jacobian ...... 70

4.3.2.1 1-Processor Architecture ...... 70

4.3.2.2 6-Processor Architecture ...... 72

4.3.2.3 12-Processor Architecture ...... 72

4.3.2.4 24-Processor Architecture ...... 75

v 11 TABLE OF CONTENTS — Continued

Chapter Page

4.3.2.5 Comparison ...... 75

4.4 Inverse Oynamics ...... 78

4.4.1 Task Graph ...... 80

4.4.2 Architectures of the Inverse Dynamics ...... 80

4.4.2.1 1-Processor Architecture ...... 80

4.4.2.2 2-Processor Architecture ...... 80

4.4.2.3 N-Processor Architecture ...... 87

4.4.?.4 2N-Processor Architecture ...... 91

4.4.2.5 Comparison...... 95

4.5 Summary ...... 95

5 CIRCUIT DESIGNS OF THE ROBOTICS PROCESSOR CHIP ...... 97

5.1 Introduction ...... 97

5.2 Clock Generator ...... 97

5.3 Bootstrap Unit and Format Converters ...... 100

5.4 Testability in the Chip ...... 103

5.4.1 Structured Design for Testability ...... 104

5.4.2 Level Sensitive Scan Design (LSSD) ...... 107

5.5 Floating Point Adder/Suhtractor (FPA) ...... 110

5.5.1 Floating Point Format ...... 110

5.5.2 Algorithm and Block Diagram ...... 112

5.5.3 N-bit Adder/Subtractor ...... 116

5.6 Floating Point Multiplier (FPM) ...... 120

5.6.1 Algorithm and Block Diagram ...... 122

5.6.2 24-bit Fixed Point M u lt ip lie r ...... 125

vi i i TABLE OF CONTENTS - - Continued

Chapter Page

5.6.2.1 Sequential Add-Shift Multiplication 126

5.6.2.2 Array Multiplier ...... 126

5.6.2.3 Nonadditive Multiply Modules (NMM) with Wallace Trees ...... 127

5.6.2.4 Additive Multiply Modules (AMM) ... 129

5.6.2.5 Recursive Parallel M ultiplier ...... 129

5.6.2.6 Modified Booth Algorithm (Radix=4) with Carry-Save Adders ...... 131

5.6.2.7 Pipelined Recursive M ultiplier with Modified Booth Algorithm ...... 135

5.3 Summary ...... 139

6 COMPUTER AIDED DESIGN FOR VLSI ...... 140

6.1 Introduction ...... 140

6.2 Overview of VLSI Design Tools ...... 141

6.3 Logic Circuit Description ...... 146

6.4 Logic Level Simulation ...... 152

6.5 Circuit Level Simulation ...... 158

6.6 Summary ...... 160

7 Summary and Conclusions...... 163

7.1 Summary ...... 163

7.2 Research Extensions ...... 166

REFERENCES ...... 171

APPENDIX A.1 Reservation Tables for Vector Operations ...... 176

APPENDIX A.2 Microprogram for Jacobian (one RP per Link) ...... 190

APPENDIX A.3 Calculation of the Measurement Parameters for Jacobian with P = 1 ...... 194

1 x TABLE OF CONTENTS - - Continued

Chapter Page

APPENDIX A.4 Calculation of the Measurement Parameters for Jacobian with P = 2 ...... 195

APPENDIX A.5 Calculation of the Measurement Parameters for Jacobian with P = N ...... 197

APPENDIX A.6 To Find Brl ...... 198 - 1x6

APPENDIX A.7 Computation Complexity and Register Required for Vector Inner Products ...... 210

APPENDIX A.8 Procedures to Solve the Derivative of Theta and Calculation of the Measurement Parameters for Inverse Jacobian with P = 1 ...... 211

APPENDIX A.9 Procedures to Solve the Derivative of Theta and Calculation of the Measurement Parameters for Inverse Jacobian with P - 2 ...... 214

APPENDIX A .10 Procedures to Solve the Derivative of Theta and Calculation of the Measurement Parameters for Inverse Jacobian with P = 1 2 ...... 217

APPENDIX A.11 Procedures to Solve the Derivative of Theta and Calculation of the Measurement Parameters for Inverse Jacobian with P = 24...... 222

APPEXDIX A.12 Microprogram for Forward Recursion of Inverse Dynamics (one RP per Link) ...... 228

APPENDIX A.13 Microprogram for Backward Recursion of Inverse Dynamics (one RP per Link) ...... 234

APPENDIX A.14 Calculation of the Measurement Parameters for Inverse Dynamics with P = 1 ...... 238

APPENDIX A .15 Calculation of the Measurement Parameters for Inverse Dynamics with P = 2 ...... 239

APPENDIX A.16 Calculation of the Measurement Parameters for Inverse Dynamics with P = N ...... 240

APPENDIX A.17 Calculation of the Measurement Parameters for Inverse Dynamics with P = 2 N ...... 241

x TABLE OF CONTENTS — Continued

Chapter Page

Appendix B.l Detailed Circuit Descriptions for Two-Phase Generators (TPG) and Two Johnson Counters, JCNTR and JCNTF ...... 242

Appendix B.2 Detailed Procedures for Loading Microprogram and Circuit Designs for the Synchronization Controller and Bootstrap Controller (SC+BTC) ...... 245

Appendix B.3 Detailed Circuit Designs for the Four Format Converters, FCE, FCS, FCW, and FCN ...... 250

Appendix B.4 Data Flow in the Data Path for Normal Arithmetic Operations ...... 253

Appendix B.5 Circuit Design of the Zero Checking Unit ...... 258

Appendix B.6 Circuit Design of the Sign Unit ...... 260

Appendix B.7 Circuit Design of the Alignment Control Unit ...... 262

Appendix B.8 Circuit Design of the 24-bit Shifter ...... 265

Appendix B.9 Detailed Explanation for the Postnormalization ...... 270

Appendix B.10 Circuit Design of the Leading Zero Detector ...... 273

Appendix B .ll Circuit Design of the Overflow/Underflow Unit ...... 274

Appendix B.12 Detailed Circuit Design of the Zero Checking U n it.. 278

Appendix B.13 Detailed Circuit Design of the Over/Underflow Unit. 279

Appendix B.14 Logic Equations of U l, U2, U3, U4, and U5 in the 8-bit M ultiplier with Modified Booth Algorithm . . . . 284

Appendix B.15 Detailed Explanation for the Rounding Scheme Used in the 8-bit Multiplier with Modified Rooth Algorithm ...... 290

Appendix B.16 Detailed Circuit Designs for the Register Storing the M ultiplier B and Recursive Carry-Save Adder . . . 294

Appendix C .l Detailed Network Description for a 2-bit Adder . . . . 297

Appendix C.2 Command File for RNL Simulation for the 2-bit Adder Described in Appendix C . l ...... 305

xi TABLE nr CONTENTS — Continued

Chapter Page

Appendix C.3 Input and Output Signals Specified for the 2-bit Adder in the SPICE Simulation ...... 308

Appendix C.4 Model Parameters of the Simulated Devices in the SPICE Simulation ...... 309

x ii LIST OF FIGURES

Figure Page

3.1 Block Diagram of the Robotics Processor ...... 19

3.2 Microinstruction Format ...... 21

3.3 Single-Bus Configuration ...... 25

3.4 Two-Bus Configuration ...... 25

3.5 Cross-Bar Network Configuration ...... 28

4.1 Rlock Diagram of Inverse Plant Plus Jacobian Control . . . 34

4.2 Major Data Acquistion, Computation, and Control Modules for Inverse Plant Plus Jacobian Control ...... 35

4.3 Architectural Concept for Implementation of Advanced Real-Time Control Algorithms for Robots ...... 36

4.4 Task Graph for Jacobian with P * 1...... 41

4.5 Architecture for Jacobian with P = 1 ...... 44

4.6 Timing Chart for Jacobian with P = 1 ...... 45

4.7 Task Graph for Jacobian with P = 2...... 47

4.8 Architecture for Jacobian with P = 2 ...... 48

4.9 Timing Chart for Jacobian with P = 2 ...... 50

4.10 Task Graph for Jacobian with P = N ...... 51

4.11 Architecture for Jacobian with P = N ...... 53

4.12 Timing Chart for Jacobian with P = N ...... 54

4.13 Architecture for Implementing Jacobian in Parallel (8 degrees-of-freedom} ...... 57

4.14(a) 3 Cube Interconnection Network ...... 60

4.14(b) 4 Cube Interconnection Network ...... 60

xi i 1 LIST OF FIGURES - - Continued

Figure Page

4.15 Communication of the 8 PEs in Different Time Slots ...... 61

4.16 3 Cube Interconnection Network for Implementing Jacobian in Parallel (8 degrees-of-freedom) ...... 63

4.17 Communication Between PEs in Different Time Slots ...... 64

4.18 4 Cube Interconnection Network for Implementing Jacobian in Parallel (16 degrees-of-freedom) ...... 65

4.19 Architecture for Inverse Jacobian with P = 1 ...... 71

4.20 Architecture for Inverse Jacobian with P = 6 ...... 73

4.21 Architecture for Inverse Jacobian with P = 12 ...... 74

4.22 Architecture for Inverse Jacobian with P = 24 ...... 76

4.23 Task Graph for the Forward Recursion of Inverse Dynamics 81

4.24 Task Graph for the Backward Recursion of Inverse Dynamics ...... 82

4.25 Architecture for Inverse Dynamics with P = 1 ...... 83

4.26 Timing Chart for Inverse Dynamics with P = 1 ...... 84

4.27 Architecture f^r Inverse Dynamics with P = 2 ...... 85

4.28 Timing Chart for Inverse Dynamics with P = 2 ...... 85

4.29 Timing Chart for Forward Recursion of Inverse Dynamics with One RP per Link ...... 88

4.30 Timing Chart for Backward Recursion of Inverse Dynamics with One RP per Link ...... 89

4.31 Architecture for Inverse Dynamics with P = N ...... 90

4.32 Timing Chart for Inverse Dynamics with P = N ...... 92 ( N = 3 for example )

4.33 Architecture for Inverse Dynamics with P = 2N ...... 93

4.34 Timing Chart for Inverse Dynamics with P = 2N ...... 94 ( N = 3 for example )

xi v LIST OF FIGURES - - Continued

Figure Page

5.1 Block Diagram of the Robotics Processor ...... 98

5.2 Clock Generator (CG) ...... 99

5.3 Clock Signals Generated from the Clock Generator (CG) .. 101

5.4 Block Diagram of the BU, FCB, and CRAM Indicating the Paths Used for Microprogram Loading...... 102

5.5 Logic Circuit Diagram of BILBO Registers ...... 106

5.6 LSSD Used in the Pipelined Stages of the FPA and FPM . . . 108

5.7 Interconnection of the LSSD SRL's ...... 109

5.8 Block Diagram of the Floating Point Adder/Subtractor . . . 114

5.9 Circuit Diagram of a 2-b1t Adder with Manchester-type Carry Chain ...... 121

5.10 Block Diagram of the Floating Point Multiplier ...... 123

5.11 A 8x8 M ultiplier with Modified Booth Algorithm (Radix = 4) ...... 134

5.12 The Structure of a Pipelined Recursive M ultiplier (mu1_?4.ca) ...... 136

5.13 Timing for the Pipelined Recursive M ultiplier ...... 137

5.14 State Diagram for Generating the MLO Signal ...... 137

6.1 Functional Chart of VLSI CAD Tools ...... 142

6.2 NMOS INVERTER ...... 148

6.3 NMOS NAND ...... 148

6.4 NMOS NOR ...... 151

6.5 NMOS AND-OR-INVERTER ...... 151

B.l Two Phase Clock Generator (TPG) ...... 243

B.2 State Diagrams for JCNTR and JCNTF ...... 244

B.3 Timing for Loading Microprogram ...... 246

xv LIST OF FIGURES - - Continued

Fi gure Pa9e

B.4 Synchronization Controller for HWR ...... 248

B.5 State Diagram of the Bootstrap Controller (BTC) ...... 249

B.6 Circuit Diagram of the FCE, FCW, FCS and FCN ...... 251

B.7 Timing for Data Passed between RP's ...... 252

B.8 Three Pipelined Stages 1n the FPM and FPA ...... 254

B.9 Timing for the Data Through the Pipelined Stages ...... 255

B.IO AA, AB and WR for the Register File ...... 257

B .ll Block Diagram of the Alignment Control Unit ...... 263

B.12 4-bit Barrel Right-Shifter ...... 266

B.13 Block Diagram of the 24-bit Barrel Right-Shifter ...... 268

B.14 Block Diagram of the 24-bit Right (Left) Shifter ...... 269

B.15 Block Diagram of the Over/Underflow Unit ...... 275

B.16 All Possible Cases for Overflow and Underflow ...... 280

B.17 Block Diagram of the Over/Underflow Unit (fm_ovf_udf.ca) 282

B.18 Example to Explain How the Sign Extension Bits SIGN EX{i) and SIGN_EX(i+l) Function ...... 286

B.19 Consider All Possible Cases to Obtain SIGN_EX(i) and SIGN_EX(i +1) ...... 288

B.20 Achieve Rounding Scheme by Adding a Full Adder ...... 291

B.21 Circuit of the Register Storing Multiplier B ...... 295

B.22 Schematic of Data Flow in the Pipelined Carry-Save Adders 296

C .l Exclusive-OR ...... 298

C.2 Exclusive-NOR ...... 298

C.3 Circuit Diagram of addl_e.mac ...... 299

xvi LIST OF FIGURES — Continued

F i gu re Page

C.4 Circuit Diagram of addl_o.mac ...... 300

C.5 Circuit Diagram of add2.mac...... 302

xvii LIST OF TARLKS

Table Page

4.1 Computation Times for Necessary Vector and Matrix Operations ...... 39

4.2 Comparison of Three Architectures for Jacobian ...... 67

4.3 Comparison of Four Architectures for Inverse Jacobian . .. 77

4.4 Comparison of Four Architectures for Inverse Dynamics . . . 95

5.1 Possible Values of the IEEF. Single Precision Floating Point ...... I l l

5.2 Truth Table of a One-bit Full A dder ...... 119

5.3 Size and Number of the Wallace Trees for a 24-bit Multiplier ...... 128

5.4 Comparison of the 4M, 3M, and 2M Versions of the Multiplication in [38] ...... 130

5.5 Encoding Table for the Modified Booth Algorithm ...... 132

B .l Truth Table for Generating Effective Operation Bits (EOPO E0P1) and SUB ...... 260

B.2 Truth Table for Generating the Final Sign Bit of the Result ...... 261

B.3 Truth Table of the Leading Zero Detector ...... 273

xvi ii CHAPTER 1

INTRODUCTION

1.1 Project Background

Several different and sophisticated control schemes have been proposed for robotic mechanism in the past few years, but few of them have been widely used because they usually involve many complex computations which are d iffic u lt to implement in real time. For example, control of the end effector in cartesian coordinates or dynamic control may require a dozen trigonometric functions and several hundreds

(maybe thousands) of floating point multiplications and additions/ subtractions. The latest 16-bit , equipped with single chip numeric co-processors (e.g. Intel 8087), are s t ill not adequate for most computationally intensive real-time control tasks.

The combination of the Inverse Plant for feedforward control and

Jacobian Control for feedback has been proved to have excellent potential for fast and accurate control. But since these operations are

very time consuming, some parallel/pipeline VLSI computing structures

need to be designed to tackle the bottlenecks of robotic system control.

Because rapid advances have been made in very-large-scale-integrated

(VLSI) technology, the computing structure implemented with VLSI has the characteristics of simplicity, regularity, and

communication locality. In addition, some parallel computing schemes,

1 such as arithmetic pipelining, processor pipelining, and multiprocessor system, are employed to improve system throughput.

Special purpose dedicated attached processors, based on the

Robotics Processor chip (RP) being developed with state-of-the-art VLSI technology, will be attached to a host microcomputer. The RPs are connected in a mesh network to achieve parallel anl pipeline structures, where the parallelism is about 80%. The system throughput is expected to be an improvement over a high speed attached processor only doing simple vector and matrix operations.

The Robotics Processor chip is designed primarily for solving the

Jacobian, Inverse Jacobian, and Inverse Dynamics. The RP is able to perform necessary vector and matrix operations in Inverse Plant plus

Jacobian control. It contains a floating point adder/subtractor and floating point multiplier. Both of them have three pipeline stages and can execute simultaneously. Because the RP is designed for more than one application, it must be programmable. Based on the parallel/ pipeline computing structure, the Jacobian, Inverse Jacobian, and

Inverse Dynamics can be completed in one millisecond individually.

The Computer-Aided Design (CAD) tools used to design the RP were released from the UW/NW VLSI Consortium on October 1 in both 1983 and

1984. The VLSI CAD tools are capable of doing (1) interactive layout

-CAESAR, (2) logic simulation -RNL or ESIM, and (3) circuit simulation

-SPICE. The tools support designs on the NMOS and CMOS fabrication processes availahle through MOSIS, the Department of Defense's MOS

Implementation Service run by the Information Sciences Institute of the

University of Southern .

2 1.2 Previous Work

The Jacobian relates the rate of change (velocity) of each of the six components of end effector position and orientation to the rate of change of each of the joint angles. This approach is more efficient than reverse kinematics because it involves less complex equations and is more easily applied to general N degree-of-freedom robotic mechanisms. A number of algorithms to compute the Jacobian are available and show a linear increase in computation with an increase in the number of degrees-of-freedom [16] [23]. Olson and nibble's algorithm [16] for computing the Jacobian is considered for pipelining.

An Inverse Dynamics analysis determines the joint torques of a manipulator given the relative positions, rates, and accelerations of the joints as well as the forces and moments to be applied at the end effector. It is proposed for control to drive the manipulator in the desired trajectory. Several methods based on Newton-Euler have been proposed to solve this problem. Orin and Olson suggested [20] that the most natural approach to pipelining Inverse Dynamics is to assign two

processors to each lin k, one for the forward recursion and one for the

backward recursion. Lathrop [24], investigates the high degree of

parallelism inherent in the computations of Inverse Dynamics and

presents two formulations suited to high-speed, highly parallel

implementations using VLSI devices. The firs t presented is a parallel

version of the recent linear Newton-Euler recursive algorithm. The

time cost is linear in the number of joints. The second formulation

reports a new parallel algorithm which indicates the time required to

3 perform the calculations Increases only as the log(base 2) of the number of joints.

A loosely coupled multiprocessor system has provided an excellent basis for the design of an onboard computer system for a new hexapod vehicle currently under development at The Ohio State University. It consists of fifteen 16-bit microcomputers based on In te l's 8086/808? pair. The Multibus allows high-speed common memory communication between the microcomputers.

Also constructed at Ohio State is the Skeletal Motion Processor, a highly parallei/pipelined array processor for generation of human and animal skeletal motion [16], Its architecture includes four pipelined boating point adders and four pipelined floating point multipliers as well as nine independent data memories, a sine/cosine unit, a reciprocal unit, a microprogrammed control unit, and I/O buffers which are a part of a PDP-11 interface. The processor is capable of computing the 12 x 14

Jacobian in approximately 70 microseconds and solve the system of 12 equations using Gaussian elimination in approximately 5.5 microseconds.

At the University of Michigan, VLSI implementlon is being considered for a numerical processor for robotics [57], The processor is being designed to match the VLSI capabilities of the mid 80's and is intented for the computationally intensive tasks involoved in real-time control of a robot arm. The numerical processor includes a pipelined

32-bit floating point adder unit, a pipelined 32-hit floating point multiplier unit, a 256 x 32 register file, and 32 x 32 input and output buffers to fa c ilita te high-speed communication between processors. The device count for the chip is approximately 150K. 1.3 Organization

In chapter 2, control schemes, computer architectures, and the

impact of VLSI technology are reviewed. Some parallel computing schemes

-arithmetic pipelining, processor pipelining, and multiprocessor system

are employed to solve intensive computations in the Inverse Plant plus

Jacobian control.

Chapter 3 describes the block diagram of the Robotics Processor

(RP). The Robotics Processor can perform the necessary vector and

matrix operations in Inverse Plant plus Jacobian Control. Four possible

bus configurations of the RP's data path are proposed and compared.

In chapter 4, a task graph is used to help schedule processes to

the Robotics Processors. Several possible architectures for each

particular control problem -Jacobian, Inverse Jacobian, or Inverse

Dynamics, are proposed and compared. The comparisons are based upon

some Important parameters, such as total execution time, initiation

rate, CPU u tiliza tio n , and total memory size needed in the RP.

Chapter 5 discusses the general description of most major

functional blocks of the Robotics Processor, while the detailed circuit

designs are described in Appendix B. The two major circuit designs are

floating point adder/subtractor and floating point multiplier. Part of

the chip, heavily dependent on manufacture capability and VLSI design

tools (e.g. memory), has not yet been designed.

In chapter 6, computer-aided design tools used at the Ohio State

University are introduced. The methodology for designing a chip with

the VLSI tools is presented. A 2-bit adder is used as an example to

explain how to do logic and circuit level simulation.

5 Chapter 7 gives a summary of the conclusions drawn from this dissertation. It also points out some problems inherent in designing the Robotics Processor chip and makes suggestions for future study in this area.

6 CHAPTER 2

VLSI COMPUTING STRUCTURE

ON ROBOTICS APPLICATIONS

2.1 Introduction

In this chapter the control scheme, the computer architecture, and the impact of very-1arge-scale (VLSI) technology are reviewed. The firs t section points out what kind of control scheme is to be considered for our robotic mechanisms and introduces the concepts of Inverse Plant plus Jacobian control. The second section discusses various computer architectures for robotics applications. The third section describes the impacts of VLSI on computing structures. The last section depicts the current VLSI technology and future trends and

1imi tations.

2.2 Inverse Plant Plus Jacobian Control

Although many different and complicated control schemes have been proposed for robotic mechanisms, few of them have been used successfully because they apply only linear feedback at the joints and don't consider the nonlinearities in the robotic mechanism. Some control schemes based on the kinematic and dynamic properties have the potential to improve the whole control system, but, since the equations for the kinematics and Inverse dynamics are rather complex, especially as the number of

7 degrees-of-freedom increases, it is difficult to implement these equations on a digital computer for real-time control.

When servoing and motion-planning are accomplished in cartesian, workspace coordinates, all that needs to be done is to transform the desired angular and translational rates of the gripper or "end effector" to obtain the required rates of joints. This kind of transformation is called the Inverse Jacobian. The Inverse Jacobian approach is better than reverse kinematics because it involves less complex equations and is more easily applied to the general N degree-of-freedom robotic mechanisms,

Jacobian control is based upon the kinematic properties of the mechanism and therefore does not account for the dynamic properties of the mechanism. Thus, the Inverse Plant feedforward controller has been proposed, in which it is assumed that the desired position, rate, and acceleration of a mechanism are given and that the joint actuator torques are to be determined. Recently, several approaches based on the

Newton-Euler method have been shown to be efficient enough to be implemented in real time. These involve a forward recursion from the base to the end effector to compute link accelerations, and then a backward recursion to compute the joint torques.

The combination of the Inverse Plant for feedforward control and

Jacobian control for feedback has excellent potential for fast and accurate control. However, Inverse Plant, which is basically Inverse

Dynamics, with the addition of Jacobian control is computationally intensive and therefore very time consuming when using conventional

8 computing architectures. For example, the execution time for the

Inverse Dynamics plus Jacobian control may exceed half of the total

execution time [21], Thus, some kinds of computing architectures are explored and designed to tackle the bottleneck of robotic system

control,

2.3 Computer Architectures for Robotics

There are two approaches for speeding up the computations required

by Inverse Dynamics plus Jacobian control schemes. One is to attach a

very fast numeric processor to the host computer with the objective that

this attached processor would perform all vector and matrix operations

required by the robotic control algorithms. Thus, the attached

processor can potentially relieve the host computer from performing

large numbers of computations. In practice, however, the quantity of

data required to be transferred between the attached processor and the

host computer is often so great that an "I/O bottleneck" is created at

the interface between the two with the result that the potential of the

substantial speed increase cannot be realized. The decision to shift

such a number-crunching job to an attached processor depends largely on

whether the shifted computations can be done in sufficiently large

blocks to compensate for the "interface overhead" -the relatively long

time spent in transferring data from the host computer to the attached

processor and back again. For example, one commercial attached

processor, the FPS 120/164, can multiply two vectors faster than such

conventional minicomputers as the POP-10 minicomputer only when the

vectors have at least 60 components [15], Most matrices required for

9 robotic computations are ratiier small, eg, 3x3 or 4x4, Thus it is entirely possible that when adding two matrices, the time required to transfer the two source matrices to the attached processor and the result matrix back to the host may exceed the computation time which would have been required i f the operation had been performed by the host without the attached processor.

Another approach is to design a special purpose computer, with a parallel computing structure based on suitable algorithms, to solve some particular problems. Some parallel computer s t a t u r e s are introduced and employed to solve the Inverse Dynamics plus Jacobian control. When designing an appropriate computing structure, simplicity, regularity, and communication locality should always be kept in mind [12],

A parallel computer can be divided into three architecture configurations [12]:

A pipeline computer performs overlapped computations to

exploit temporal parallelism. An array processor uses

multiple synchronized arithmetic logic units to achieve

spatial parallelism. A multiprocessor system achieves

asynchronous parallelism through a set of interactive

processors with shared resources (memories, datahase, e tc .).

Three pipeline schemes used in a pipelined computer are :

(1) Arithmetic pipelining, where the arithmetic logic units are segmentized for pipeline operations in various data formats. For example, four-stage pipelines are used in the Star-IOD and three-stage pipelines are designed in both the Wettek WTL 1033 floating point adder and the WTL 1032 floating point m ultiplier [56], Recause of the

10 independence of the elements in vectors or in matrices, pipeline structure is very suitable for vector or matrix operations. Therefore, this arithmetic pipelining scheme is used in the arithmetic units of the

Robotics Processor being designed and developed at The Ohio State

Uni versity.

(2) Instruction pipelining, where the execution of a stream of instructions can be pipelined by overlapping the execution of current instruction with the fetch, decode, and operand fetch of subsequent instructions. Intel 8086 is one of the examples.

(3) Processor pipelining, where the same data stream is processed by a cascade of processors. This processor pipelining scheme is employed in solving the Inverse Dynamics problem with Robotics Processors (explained in section 4.4).

An array processor handles single instruction and multiple data

(SIMD) stream. The original motivation for developing SIMD array processors was to perform parallel computations on vector or matrix types of data. The array processor has been used widely and efficien tly in many different fields, such as fast Fourier transform, matrix inversion, parallel sorting and solving partial differential equations.

Various interconnection networks have been suggested for array processors, such as the mesh network, the n-cube, the barrel shifter, and the shuffle-exchange network. One typical example of array processor is Illia c IV, which was connected in a mesh network and primarily designed for matrix manipulation and solving partial differential equations.

11 As mentioned above, the array processor structure is very useful as an attached processor performing operations on large matrices.

However, because of the inherently small size of the vectors and matrices encountered in robotic systems, it is not constructive to apply an array processor to the task of performing vector or matrix operations for the robotics applications. However, the concept of the array processor can be used to develop parallel/pipeline algorithms for

Inverse Dynamics and Jacobian computations. The Rohotics Processors

(RP) are to be connected in a mesh network to achieve parallel and pipeline structure. For example, in the Inverse Dynamics application, the RPs are pipelined, but the overlap (or parallelism) between the RPs achieves 80% (explained in section 4.4). The system throughput is expected to be be much Improved as compared to using a high speed attatched processor only doing simple vector and matrix operations.

A multiprocessor system is controlled by one operating system which provides interaction between processors and their programs at the process levels C121- There are two architectural models for a multiprocessor system. One is a tigh tly coupled multiprocessor, where all processors communicate through a shared main memory. Another is a loosely coupled multiprocessor, where each processor has its own input-output devices and a large local memory, storing most of the instructions and data. It is usually efficient when the interations are minimal. However, tightly coupled systems can tolerate a higher degree of interactions between processors without significant deterioration in performance. Three different interconnection networks have been

12 commonly used: time-shared common bus, crossbar switch network, and multiport memories.

Many multiprocessor systems have been designed and constructed during the last two decades. One example is the n.mmp which consists of

16 computer modules connected to 16 global, shared memory banks via a central crossbar switch. Another example is a multiprocessor system with five units of PDP-11/03, which is designed and implemented for the

Hexapod Vehicle developed at The Ohio State University [18], Currently, a multiprocessor system, consisting of 15 Intel iSRC's, and communicating through the MULTIBUS and shared memories, is being used for an Adaptive Suspension Vehicle (ASV) being developed at The Ohio

State University. The multiprocessor system concept can be found in the overall computer system of tre entire control system in section 4.1.

2.4 VLSI Technology to Computer Architectures

Because the rapid advent of VLSI technology, several new architectures implementing parallel algorithms directly in hardware, are avalable. For example, the systolic array offers sustantial performance gains via massive parallelism and regular local communication [251.

Another example is the wavefront array processor (WAP), which provides a powerful tool for the high speed execution of a large class of matrix operations and related algorithms which have widespread applications

[26]. The major difference between systolic array and WAP is that systolic array requires global synchronization while WAP dosen't.

Basically, there are two different architectural directions for

VLSI based computers [2 ], The firs t involves putting more and more

13 functions on a chip and making it run faster and faster. For example, within a single-chip computer are integrated CPU, memory and input/output circuitry.

The other is taking a fresh look at new technology and many recently emerged computer applications. It considers the in ter­ connection of VLSI chips to form highly parallel computing systems.

Such computing systems have structural properties that are suitable for

VLSI implementation, such as systolic array and WAP,

The key attributes of VLSI computing structures are described below [12] :

1. Simplicity and regularity : If a structure can be partioned into a

few types of building blocks which are used repetitively with

simple interfaces, great savings can be achieved. This is

especially true for VLSI designs where a single chip comprises

hundreds of thousands of components.

2. Concurrency and communication : Massive parallelism can be achieved

i f the algorithm is designed to introduce high degrees of

pipelining and multiprocessing. When a large number of processing

elements work simultaneously, communication becomes significant -

especially with VLSI technology where routing costs dominate the

power, time and area. The locality of interprocessor

communications is a desired feature to have in any processor

arrays.

3. Computation intensiveness : VLSI processing structures are suitable

for implementing compute-bound algorithms rather than I/O-bound

computations because VLSI packaging must be constrained to a

14 1i mi ted number of I/O pins. A VLSI device must balance its

computation with the I/O .

2.5 VLSI Technology

Since the first silicon transistor was made in the mid-1950's, IC techonoly has been improving rapidly. Chip complexity doubled every year after 1959, In 1973, complexity reached nearly R000 components per chip. Since then, complexity has doubled every 1.5 to two years. The complexities (transistor number) of the five generations of ICs, small, medium, large, very large, ultra large - or SSI, MSI, LSI, VLSI, DLSI - are [3] :

SSI 2 - 6 4

MSI 64 - 2,000

LSI 2,000 - 64,000

VLSI 64,000 - 2,000,000

ULSI 2,000,000 - 64,000,000

Today's technology is in the VLSI range. For example, HP's FOCUS CPU and Bell's BELLMAC-32 microprocessor both contain 450,000 transistors.

Most IC are made of bipolar, N-channel Metal-oxide-semi conductor

(NMOS), Complementary MOS (CMOS) or GaAs devices. Depending on different applications, different devices are used. For example, TTL or

ECL (both made of bipolar transistors) are used when high speed is required. If higher speed is required, then GaAs is probably the candidate; its gate delay is less than 120 picosecond, and is four to six times faster than silicon devices [4 ], NMOS has highest density and is suitable for making memories. CMOS has low power dissipation

15 characteristics. As more and more components are pot on one chip, heat removal becomes a serious problem. Thus for a memory chip containing over one megabit, CMOS will replace NMOS [5].

Now CMOS and bipoiar transistors can put on the same silicon wafer

[6 ]. Logic cells that combine CMOS FETs with bipolar transistors operate at subnanosecond ECL speeds, but dissipate only the fractional m illiw att power levels of CMOS circuits. Each cell includes a standard

CMOS logic gate buffered by a totem-pole output driver of the type of popular TTL.

Wafer-scale integration (WSI) was first tried in the 1960’s [7],

One wafer, from 2 to 8 inches, can contain 25 to 100 Intel 8086's. WSI has the advantage of eliminating interchip connections. As a result, it has a low posslbilty of generating noise along the interchip connection wirings, faster throughput, higher re lia b llty , and it needs less power since no driving power is required. But WSI technology is not yet mature because of the problems of heat removal {1000 watts/wafer) and very low yield.

Very-high-speed IC (VHSIC) is a seven-year project supported by the Department of Defense. During Phase I (May 1981 to Apr. 198*), microcircuits were designed and produced with minimum feature size of

1.25 micrometer. In the Phase II (May 1984 to Dec. 1986), the goal was to design and produce microcircuits with minimum feature size of 0.5 micrometer, clock rates of 100 MHz, and circuit complexities in excess of 100,000 logic gates per chip [8] [9 ],

Although IC technology will continue to advance rapidly, there are several factors that constrain the integration level of the future

16 silicon IC technology. These factors can be categorized into physical, technological and complexity lim its [2 ], Physical limits include the velocity of light, the principle of uncertainty, entropy

(irre v e rs ib ility ) and thermal energy. It is proposed that the final lim its of the device si 2e wili be 0.3 micrometer [1 ], Technological limits are concerned with fabrication techniques, materials constants and electrical parameters. Complexity lim its relate to the human inability to design a circuit involving a very large number of components.

2.6 Summary

In this chapter, the Inverse Plant plus Jacobian Control are introduced. Then different computer architectures for robotics applications are depicted. The trends and the impacts of VLSI technology on the computers are further discussed. Finally, the current

VLSI technology, future perspects and the final limitations of the feature size are described.

17 CHAPTER 3

ARCHITECTURE OF THE ROBOTICS PROCESSOR

3.1 Introduction

The RP is primarily designed for solving problems involving

Jacobian, Inverse Jacobian, and Inverse Dynamics. It will be shown in this chapter that the RP has great potential for use in other applications involving vector or matrix operations. The RP is capable of performing most vector and matrix operations, such as vector addition, vector multiplication with a scalar constant, vector inner product, vector cross product, matrix multiplication with a vector, and matrix multiplication with a matrix.

The block diagram of the final version of the RP is firs t introduced. Next, four of the design alternatives which were considered during the evolution of the final architectural design are given and discussed. Finally, the characteristics of the RP are summarized.

3.2 Block Diagram of the RP

Figure 3.1 shows the block diagram of the RP consisting of the

Clock Generator (CG), Bootstrap Unit (BU), Format Converter for Rll

(FCB), Control RAM (CRAM), Sequencer (SEQ), Microcode Register (MCR),

Register File (RF), Floating Point Adder/subtractor (FPA), Floating

Point Multplier (FPM), Format Converter East (FCE), Format Converter

18 BN 8T ,H U R ,LC

FCB

FCN

CRAM

HCR

FCW

Addresses and Control signals

FPM FPA FCE >BE

jr

SYS CLK FCS

BS

Figure 3.1 Block Diagram of the Robotics Processor West (FCW), Format Converter North (FCN), and Format Converter South

(FCS). The CG generates all clock signals needed in the RP. The System

Clock ( SYS CLK) is tentatively selected to be 16 MHz. The Pipeline

Clock (P_CLK), is used to clock the pipeline registers in the FPA and

FPM The frequency of P_CLK is defined to be one sixteenth of the

SVS_CLK, i.e . 1 MHz. The detailed circuit for the CG is described in section 5.2.

During in itia liz a tio n , the host computer loads the appropriate application microprograms and constants to the RPs. The loading process

is initiated by the bootstrap signal (BT), which is asserted by the host computer. When the RPs begin to execute their microprograms, the host computer intermittently sends the necessary parameters, such as the desired position, rate, and acceleration of the particular mechanism, to the RPs and then receives results, such as joint actuator torques, from the RPs.

As shown in figure 3.2, the microcode format consists of 40 bits.

The most significant bit, the opcode bit, specifies the interpretation of the remaining bits of the microinstruction. When this bit is 0, an arithmetic operation can be specified, and when the opcode bit is 1, branching and I/O operations are indicated. There are six address fields for normal arithmetic operations and four address fields for I/O.

Each address field has 6 bits because the RF has 64 words. The OP bit specifies either addition or subtraction; OP = 0 produces addition and

OP = 1, subtraction. Control bits, WM (Write Multiplier's result) and

WA (Write Adder's result), determine whether the result of the FPM or

FPA is to be written into the RF. Control bits, EE (Enable East output)

20 I/O OP WM WA ADDR1 ADDR2 ADDR3 ADDR4 ADDR5 ADDR6

I/O : I/O = 0 AODR2: Address for multiplier operand B OP : OP = 0 for addition ADDR3: Address for multiplier result R WM Write Multiplier's result ADDR4: Address for adder operand A WA : Write Adder's result ADDR5: Address for adder operand B AODR1: Address for multiplier operand A ADDR6: Address for adder result R

I/O BR EE ES EW EN — — — ADDR3 ADDR4 ADDR5 ADDR6

I/O: I/O = 1 ADDR3: Address for East output BR : BRanch = 0 ADDR4: Address for South ouput EE : Enable East output ADDR5: Address for West input ES : Enable South ouput ADDR6: Address for North input EW : Enable West input EN : Enable North input

Figure 3.2 Microinstruction Format and ES (Enable South output), are used to buffer the data going out the

RP from the east or south side. Control bits, EW (Enable West output) and EN (Enable North output), are used to buffer the outside data coming into the RP from the west or north side. These four control bits are also used to gate the four address fields to the address buses, AA and

AB.

When a microprogram is being executed, the SEQ generates the next address to the CRAM. The MCR is a microinstruction latch clocked hy

P_CLK phase-1 (P ^ ). The RU generates the necessary control signals and address to the CRAM during the in itia liz a tio n stage. Section 5.3 has more detailed descriptions of the RU.

There are two output ports; one on the east and the other on the south. Also, there are two input ports; one on the west and the other on the north. To lim it the number of pins, the external bus on each side is 16 bits wide while the three internal buses. Bus A (RA), Bus B

(BB), and Bus C {BC> are all 32 bits wide. Therefore, four format converters are required to change the 32-bit data to two 16-bit words or visa versa. Detailed circuits for the format converters are described in section 5.3.

As mentioned in section 2,3, the RPs are connected a mesh network to develop parallel/pipeline algorithms for the Jacobian and Inverse

Dynamics analyses. 32-bit data must be transferred between adjacent

RPs. Since the data paths external to the RPs are 16 bits wide, two

P_CLK cycles are required to complete one 32-bit data transfer from one

RP to another. This is necessitated partly because of the required format conversions from 32 to 16 bits on the transmit side, and 16 to 32

22 bits on the receive side. Another factor requiring a double clock period is that the output pad drivers designed into the chip incorporate multiple amplifier stages to supply the current sourcing and sinking capability to drive the relatively large electrical capacitance of the external interconnections. Since 32-bit data can be sent from both west and south sides simultaneously, and since each requires two P_CLK cycles, the net effect is that the maximum transfer rate is one 32-bit transfer per P_CLK. Here an assumption is made that the intercommunication between the adjacent RPs is controlled by precisely synchronized microprograms. Specifically, when one is transmitting, the other must be receiving. If one consumes the received data faster than the other produces i t , the consumer must wait idly until the producer is ready to send the next data. On the other hand, if the producer produces the data to be sent faster than the consumer comsumes the data, the producer must wait idly until the consumer is ready to receive.

Since the time to transmit or receive data is well known, it is not necessary to have handshaking signals between the RPs.

The RF is, in fact, a three-port RAM. Two operands are read onto the BA and BB at the same time. But only one result can be stored in the RF at one time. Two address buses, Address bus A (AA) and Address bus B (AB), and the Write (WR) signal are not shown in the block diagram. In order to access two contents in two different RF locations at the same time, two address decoders are needed in the RF. The capacity of the RF is tentatively assigned to 64 words and each word is

32 bits wide since the standard IEEE single-precision floating point format is used.

23 Both the FPM and FPA have three pipeline stages. Each stage is

clocked by P_CLK phase-1 and phase-2. During the P_CLK phase-1 (P^),

four operands are read from the RF and latched in the firs t pipeline

register of the FPM and FPA by a time scheme. Since there

are three pipeline stages in the FPM and FPA, it takes three P_CLK

cycles to propagate from input operands to output results. During the

P_CLK phase-^ ^ ^ 2 ^' *:,ie restJl t s the two arithmetic units are stored

in the RF by the time multiplexing scheme. Thus, a total of four P_CLK

cycles are required to complete a floating point addition/ subtraction

and multiplication. Detailed circuit designs for the FPA and FPM are

described in sections 5.5 and 5.6 individually. Since floating addition

and multiplication are executed simultaneously, a throughput of 2 million floating point operations per second (FLOP) can be obtained once

three pipeline stages in the FPM and FPA are fille d up.

3.3 Evolution of the Architectural Design of the RP Data Paths

In this section, four possible bus configurations are proposed and compared. They are the single-bus, two-bus, and three-bus configurations, and the cross-bar network.

3.3.1 Single-Bus Configuration

Figure 3.3 shows the single-bus configuration. The input bus and output bus of the FPA and FPM are connected. To obtain benefits from pipelining, accessing the operands from and storing the results in the

RF should be completed in one P__CLK cycle. For example, during , the operands are read onto the bus and latched at the firs t

24 FPM FPA

Figure 3,3 Single-Bus Configuration

FPM FPA

Figure 3,4 Two-Bus Configuration

25 stage of the FPA and FPM. During P^, results from the FPA and FPM are stored into the RF.

The advantages of the single-bus configuration are that it is simple and that chip area is saved since only one bus is used. There are, however, some disadvantages to the configuration. First, no precharging on the internal bus causes the transition from low to high voltage to take more time than from high to low. This results in asymmetry on rising and falling edges and is usually not acceptable to design engineers. Also, the transfer rate on the internal bus becomes slower. A second disadvantage of this configuration is that a very high speed RF with an access time less than one quarter of the Pfj> is required because during P ^ , four operands need to be read out of the RF to the FPA and FPM. For example, the P ^ duration is 500 ns, so the access time of the RF should be less than 125 ns.

3.3.2 Two-Bus Configuration

Figure 3.4 shows the two-bus configuration. One more bus is added to allow the buses to be precharged before they are activated. This eliminates the slowing down of the data transferring on the internal buses. However, the access time of the RF must s t ill be less than one quarter of the Ptj^, 125ns.

3.3.3 Three-Bus Configuration

The three-bus configuration is shown in figure 3.1. If both RA and

BB are precharged during P ^ , then BC is precharged during P ^ . Since only two operands, instead of four, as in the previous

26 two cases, are read from the RF during the P<^, the access time of the

RF is doubled to 250 ns.

There are two disadvantages with the three-bus configuration.

F irst, it requires one more bus than the two-bus configuration and thus occupies more chip area. Also, the RF becomes slightly more complicated and occupies more chip area since it is a three-port memory.

3.3.4 Cross-Bar Network

Data path configurations range from a single bus, where only one data word can be transferred at a time, to a full crossbar switch, where all possible connections can be made simultaneously. Figure 3.5 shows the cross-bar network configuration similar to that of the FPS/120B

[14]. To read four operands at the same time, two three-port register file s , RF1 and RF2, are required. The crossbar network is implemented with six dedicated buses, four to supply operands to the arithmetic units and two to carry results away from the FPA and FPM and to swap the data in the RF1 and RF2. All the cross points are closed by controlling signals. Operands in the RF1 and RF2 can be sent to the FPA and FPM by closing proper crosspoints in the network.

There are two advantages of the cross-bar network configuration.

First, it has the most flexible configuration. Second, all the buses are precharged at the same time, say during P ^ . The access time of the

RF can be almost as large as the P<{^ duration, 500 ns, since four operands are simultaneously read from the RF1 and RF2 to the FPA and Ft>M during There are, however, two disadvantages. First, the network occupies too much chip area since every bus is 32 bits wide.

27 RF 1

RF2

FPM FPA

Figure 3.5 Cross-Bar Network Configuration Second, it is rather complicated to control the cross points since there are too many possible combinations of data flow paths. Moreover, errors are likely to be made since programmers must keep the network in mind while writing the microprogram.

The above comparisons show that the cross-bar network configuration is overly complicated and occupies too much chip area, whereas the single-bus configuration does not allow for the precharging of the bus, therefore, these two possibilities were eliminated. In the case of the two-bus configuration, some uncertainty exists as to whether the RF can be designed with an access time less than one quarter of the

P<|>^, i.e . 125ns, possibly necessitating a reduction of the system clock frequency. Even though a greater :h1p area w ill be required for the three-bus configuration, it will quite likely allow the access time to be one half of the P<^, 250 ns, thus the three-bus configuration is tentatively chosen for the data path.

3.4 Summary

In this chapter, the architecture of the RP is described. In the second section, several data path design alternatives were given, and in the third section, four possible data path configurations were proposed and compared. The three-bus configuration was found to be the best choice. Some of the specificatons are based upon the knowledge of the current VLSI technology. For example, the choice of 16 MHz for the RP

System Clock is comparable with the 20 Mh 2 clock used in the VLSI version of Digital Equipment Corporation's VAX, which is fabricated using 3 micron NMOS technology.

29 The characteristics for the RP are summarized as follows :

1. The RP contains a Floating Point Adder/subtractor (FPA) and a

Floating Point M ultiplier (FPM) to execute floating point

addition/subtraction and multiplication. No divider is included

since division is not required in the vector or matrix operations

mentioned in the firs t section.

2. IEEE single-precision floating point format is used.

3. The FPA and the FPM can operate simultaneously to improve system

throughput.

. A pipeline scheme is used to increase the speed of vector and

matrix operations because it exploits the global parallelism found

between computations on separate elements in a vector or matrix

C14].

5. There are three pipeline stages in the FPA and FPM. The

rationale for the choice of three pipeline stages are

(a) the shorter the pipeline, the better the performance with

short vectors [15], Some vector processors, such as

Star-100, have much longer pipelines and perform relatively

poorly on short vectors [14].

(b) I t is easy to partition the FPA and FPM into 3 stages, where

each has almost the same delay time.

6. There are four unidirectional I/O ports, two for inputs and two

for outputs.

7. Internal buses are 32 bits wide, while the four external buses are

16 bits wide, in order to lim it the pin number.

8. A format converter is needed on every I/O port to convert a 32-bit

30 data to two 16-bit words or vise versa.

9. No handshaking signals between Robotics Processors are needed

since the timing to transfer parmeters from one RP to another RP

is known. Thus data transferring can be handled by

microprogramming.

10. To speed up the access of the contents in the RF, the internal

buses, BA, RB, and BC are precharged before they are activated.

11. To simplify hardware in the RP, RP1s microinstruction is not

pipelined as are operands in the FPA and FPM, although this makes

microprogram coding slightly more complicated. As a result,

programmers should keep 1n mind that the results of

addition/subtraction and multiplication are not available until 4

P_CLK cycles later.

12. The system clock is tentatively set to 16 MHz.

13. Because format conversion is necessary on each I/O port, two

P_CLK cycles are required to transfer a 32-bit data to the

adjacent RP.

14. The RP must be programmable for different applications. The

application programs are all written in microcode. The reasons

for this are

(a) the application algorithm is usually fixed, thus also

fixing as the application program.

(b) Usually the application microprogram is relatively short.

For example, the most complicated application

microprogram, Inverse Dynamics, consists of about inn

microinstructions. Therefore, there is no need to

31 develope a microcode compiler.

(d) programmers are permitted direct access to each of the

arithmetic units thereby permitting maximum u tilization of

the potential parallelism.

15. The parallei/pipeline computing structure allows the execution of

any one of the Jacobian, Inverse Jacobian, and Inverse Dynamics in

one mi 111 second.

RP's characteristecs are summarized in the 15 preceding items.

Some specifications are made according to the current VLSI technology.

However, designing the RP chip will probably take one or two more years, and by that time VLSI technology shall be improved and the channel

length will likely be decreased to one or two microns. Thus, some specifications of the RP may need to be modified. For example, the

SYS_CLK may be increased. Also, i f an RF with faster speed can be fabricated, the number of internal buses can be reduced from three to two or even one. But some of the basic boxes of the RP w ill probably not change, such as the FPA and FPM.

32 CHAPTER 4

APPLICATIONS OF THE ROBOTICS PROCESSOR

4.1 Introduction

As was mentioned in section 2.2, the combination of the Inverse

Plant for feedforward control and Jacobian Control for feedback has excellent potential for fast and accurate control. A control block diagram with feedforward Inverse Plant plus feedback Jacobian Control is shown in figure 4.1. The major data acquisition, computation, and control modules required to implement the Inverse Plant plus Jacobian

Control are shown in figure 4.2. Detailed explanations of control schemes and the meaning of their control parameters can be found in

[21].

In this chapter, several special purpose dedicated attached processors for the Inverse Plant plus Jacobian are developed. These attached processors are based on the Robotics Processor, which is being developed with state-of-the-art VLSI technology at The Ohio State

University. These special purpose dedicated processors will be attached to a host microcomputer, and multiprocessor system concepts, described in section 2.3, will be used to interconnect these multiple processors for real-time control. The overall computer system for the entire control system is shown in figure 4.3.

In this chapter, several possible architectures for each particular control problem, eg. Jacobian, Inverse Jacobian, and Inverse

33 INVERSE PLANT

ROBOTIC o* I SYSTEM

DIRECT KINEMATICS

Figure 4.1 Block Diagram of Inverse Plant Plus Jacobian Control [ 2 1 ] OUTER MOTION DIRECT FORCE LOOP PLANNING KINEMATICS TRANSFORMATION CONTROL

DATA INVERSE / k. JACOBIAN ACQUISITION JACOBIAN

DESIRED INNER 10 INVERSE OUTPUT JOINT LOOP ACCELERATION DYNAMICS CONTROL CONTROL

Figure 4.2 Major Data Acquistion, Computation, and Control Modules for Inverse Plant Plus Jacobian Control [2 1] RP., rpin I 1 I

RPSl RP?? RP2N

Jacobian Processor Inverse Dynamics Processor

r

Microcomputer Microcomputer

Local ~i Shared Local ~I Shared Memory J Hemory Memory [ Memory

< Common Memory Bus

Local | Shared Local ! Shared -Memory. _1 _ Memory — - - tfeaojcy. L ~ Memory- _

Microcomputer Microcomputer • • •

Trajectory Data Acquisition Generation Servo Control

Figure 4.3 Architectural Concept for Implementation of Advanced Real-Time Control Algorithms for Robots Dynamics, are proposed and compared based upon some important parameters such as total execution time, initiation rate, CPU u l t i 1ization, and the total required memory size in the RP.

4.2 Jacobian

The Jacobian relates the six components of the velocity of the end effector, including both linear and rotational, to the angular velocities of each of the joint angles. The Jacobian approach has the advantage of producing simpler equations than Reverse kinematics. The equation relating the joint angle rates (js) to the angular velocity (u) and translational velocity (v) of the end effector is given as follows:

(l) = J(9) * 9 ( 2 . 1)

The J matrix is of dimensions 6xN and is given as:

N+l N+l N+l *1 ' x2» • * * ^ (2.2) N+l N+l N+l V B ^ * * * * i *

N+l N+l y. and $. (i= 1,2,...N) are derived as follows: — i — i

N+l Vi ■' (2.3)

N+l N+l i-1 T u = u u i = N+l, N, . . . 2, 1 (2.4) i-1 i i

37 0 N+l N+l Y. = U. , 0 i = N, (N-l), . . . 2, 1 (2.5) —1 1-1 1

N+l (2. 6 ) ^ + 1

N+l N+l N+l,, .1 * r. i = r. - u .* P. i = N+l, N...... 2, 1 (2.7) -i-l —i i -i

N+1e. - N+1y X ( - N+1r ) i = N, (N-l), . . . 2, 1 (2.8) —i —1 -1 -1

The 3 x 3 rotational transformation matrix, ^ and the 3 x 1 vector, i * P., are defined as:

COS 0 ■sin e.- cos a. sin 0.- sin a. i i i i i i-1 sin 0 cos 0.- COS a. -cos i • sin (2.9) u, ■ i i i i sin a. cos a. i

d,sin a. (2 . 10) i i

d^cos

The detailed definitions for the four parameters, 0-j, d}, a i , and c*i, can be found in [22] [23], Here only the revolute joint is considered.

The equations for the sliding joint is less complicated, but similar results will be obtained.

38 4,2,1 Complexity of Vector Operations

The independence of the FPA and FPM, and the pipeline structure

in each arithmetic unit speed up vector operations significantly. The

exact RP computation times for the necessary vector and matrix operations can be determined and are shown in table 4.1, where V is a

3x1 vector and M is a 3x3 matrix.

Table 4.1

Computation Times for Necessary Vector and Matrix Operations

| Vector | No, of P_CLK Computation time | Operations | cycles (microsecond) Complexity j

| V + V i 6 6 1 |

| V x c 1 6 6 1 |

| V * V 1 13 13 2 j

| V x V 1 *3 13 2 1

j MV 1 17 17 3 1

| MM | 35 35 6 |

j Transfer V I & 6 1 )

The time required for computation or transferring data between processors can be represented as complexity. Using complexity instead of time expressed in microseconds allows the results to he independent

39 of the system clock. It is desired that the complexity of the simplest operation, e.g. vector addition, be normalized to the value 1. The complexity value of all other operations will be computed relative to vector addition. Since vector addition requires 6 microseconds, the normalizing factor by which all other times are to be divided is 6, In short, complexity = computation time / 6. The computation time of two

3x3 matrix multiplications is 35 microseconds, divided by 6 resulting in the complexity of 6. A reservation table is used to show how successive pipeline stages are utilized (or reserved) for a specific function evaluation in successive pipeline cycles. The reservation tables for obtaining the computation times of these operations can he found in

Appendix A .1.

4.2.2 Task Graph

Task graphs are used to aid the scheduling of processes to more than one processor. Construction of optimal schedules is NP-complete in many cases. A detailed definition of NP-complete can be found in [62 p.

501 -558], The term NP-complete implies that an optimal solution may be very d iffic u lt to compute in the worst possible input case. However, construction of suitable schedules, that is, computing a reasonable answer for a typical input case, is not NP-complete [12 p. 598]. The task graph for the Jacobian with one RP (P = 1) is shown in figure 4.4.

This is obtained by calculating the complexity of each Jacobian equation. The complexity of each vector or matrix operation can be found in table 4,1. The circles in figure 4,4 represent the equations

(2.4), (2.7), (2.8), and (2.9). The arrows connecting circles indicate not only the sequence of application of equations, but also that data

40 sine N+l N+l cose

N-l

N+l N+l N-l

N+l N + l N+l N-l

N+l N+l sine cose

N+l N+l

N+l N+l N+l

Figure 4.4 Task Graph for Jacobian with P * 1

41 resulting from one equation is to be operated upon by the next indicated equation. The number in the circle represents the complexity for that particular equation, and the number adjacent the arrow represents the complexity of the I/O transmission. For example, the circle representing equation (2 .4 ), which computes the U matrix, indicates that the computation complexity of this operation is 6. The "3" on the arrow represents the I/O complexity for transferring a 3x3 matrix, i.e . three

3x1 vectors.

There are many alternatives to scheduling the entire task into one or more than one processor. In the following sections several architectures, obtained according to different partitions of the task graph, are proposed and compared. A number of measures have been developed to evaluate the architectures. They are listed below :

ET : Total execution time of the whole system

= total computation time + total I/O transfer time

+ processor idle time

IR : Initiation rate

= average number of initiations per clock unit

UP : U tilization of each processor

= RP fractional busy time

= (total computation time + total I/O time) / ET(P=n)

SP : Speed up

= the ratio of the total execution time for one RP to

the total execution time for n RPs

= ET(P=1) / ET(P=n)

42 CBR : CPU bound ratio

= the ratio of the computation time for one RP

to the total execution time for n RPs

= computation time(P=l) / ET(P=n).

RN : Register number of the RF

SCRAM : Size of control RAM

Total memory : includes the RF and CRAM

4,2.3 Architectures of the Jacobian

Four possible architectures for implementing the Jacobian,

1-Processor, 2-Processor, N-Processor, and cube interconnection network,

are proposed and compared in the following sections,

4.2.3.1 1-Processor Architecture

I f the task in figure 4.4 is executed by only one processor, its

architecture for implementing the Jacobian with one processor is straightforward and shown in figure 4.5. Here only one Robotics

Processor is used to calculate all Jacobian equations to obtain the

Jacobian matrix. For a robot with N degrees-of-freedom, 2N values of

sine and cosine of the N joints are received from the host computer through one of the input ports of the RP. As soon as the Jacobian matrix is calculated by the RP, 6N components of the Jacobian matrix are passed to the host computer through one of the output ports.

The corresponding timing chart, shown in figure 4.6, shows the

sequence of subtasks and the time required for each one. It is used to

help determine the measurement parameters, such as computation time and

CPU u tiliza tio n . The numbers in the timing chart, obtained from the

43 Host Computer

sinG, cose.

sine. cose.

N+l N+l0 N+l N+l V X ,. B,

Figure 4.5 Architecture for Jacobian with P ,4,9 , 35 P CLK cycles , 21 , 13 , 12 , 4 , 9 , 35 P_CLK cycles U l b"

a : input sine, and cose^ from Host Computer

b : compute N+l c ; compute N+l d : compute r. j N+l e compute

f : move to M+1IL location

Figure 4.6 Timing Chart for Jacobian with P = 1 microprogram for the Jacobian in Appendix A.2, are computation time, I/O transferring time, or idle time respectively. The unit for each of these numbers is one P_CLK period, or 1 microsecond. Subtasks in this timing chart are labeled from a to f. For example, to find one column of the Jacobian matrix requires all 6 subtasks. Thus, the total time required to obtain the Jacobian matrix is the time required to complete the 6 subtasks multiplied by N, where N is the number of degrees-of-freedom, i.e . (4 + 9 + 35 + 21 + 13 + 12) x N = 94N microseconds. The other measurement parameters for evaluating the architecture are computed in Appendix A.3.

4.2.3.2 2-Processor Architecture

For the 2-Processor architecture, the task graph for the Jacobian in figure 4.4 can be arbitrarily partitioned into two parts, left and right. The partitioning indicated by the dotted line as shown in figure

4.7 was selected to cause the complixity of each part to be approximately equal, 7.5N for the right part vs. 7N for the le ft.

Therefore, i f the task in each part is assigned to one processor, then the load sharing between the two processors should be almost equal. The architecture of the Jacobian with two processors is shown in figure 4,8.

Since some intermediate data must be transferred from one RP to the other, I/O time is increased. In figure 4.8, it can be seen that N 3x3

U matrices and N 3x1 garma vectors, or 4N 3x1 vectors in to ta l, are transferred between the two RPs. Recall from table 4.1 that the complexity for transferring a 3x1 vector is one. Thus, the complexity for transferring the necessary data between the two RPs is 4N. If this increase of I/O complexity is greater than the computation complexity

46 sine N+l N+l cose

N-l

N+l N+l N-l

N+l N+l N+l N+l N-l

RP 2

N+l N+l sine cose

N+l N+l

\ t N+l N+l N+l

Figure 4.7 Task Graph for Jacobian with P = 2

47 Host Computer

sine.,, cose sine,, cose

N+lN+l

RP1

N+l N+l N+l N+l

RP2

N+l N+l

Figure 4.8 Architecture for Jacobian with P = 2 reduction when two processors are used, the 2-Processor architecture has

no obvious advantage.

The timing chart corresponding to this case is shown in figure

4.9. This timing chart shows not only the sequence and amount of time

required by subtasks, but also the timing for transferring data between

the two RPs. RP1 performs subtasks a to f , while RP2 performs subtasks

g to m. Both are doing different subtasks, but they are synchronized at

points where they start to transfer data. For example, RP1 will not

transfer the 3x3 U matrix to RP2 until subtasks a and b are completed.

When it occurs, this transfer requires 18 P_CLK cycles, i.e . 18 microseconds. Since there are no handshaking signals and no buffers

between the RPs, one RP has to be receiving the data when the other is

transmitting. The transferring timing is synchronized not only by using

the same system clock but also by precise microprogramming. The measurement parameters are computed in Appendix A,4. The total execution time, 601 microseconds found in the Appendix A.4, is not

reduced much as compared the 658 microseconds of the 1-Processor

architecture since the increased I/O time takes up much of the

computation time saved by using two processors.

4.2.3.3 N-Processor Architecture

The task graph for the Jacobian in figure 4.4 can also be

partitioned into N parts for an N degrees-of-freedom robot, separated by

horizontal dotted lines as shown in figure 4.10. Note that only the Nth and 1st parts are exp licitly shown. Since each part is identical, each has the same computation complexity of 14.5. The task in each part is

assigned to one processor, thus N processors are required in to ta l. The

49 4 9 18 35 P CLK cycles 18 35 P_CLK cycles RP1 - f — f e f

N+l N+l j Nt\ Li V i 1 i

/ 18 21 13 RP2 14 18 P—CLK cycles 1 m

N+l I i i

a input sine^ and cosG^ from Host Computer 9 input AU.

b compute * ~*U.j h compute *^+V j_ N+l c output U. i idle N+l d compute U^_j j input N+*Xi

e output k compute N+l N+l N+l f move U^_j to 11. location 1 output 8^

m idle

Figure 4.9 Timing Chart for Jacobian with P = 2 sine N+l N+l cose

N-l

N+l N+l

RP N

N+l N+l v N+l N+l N-l

N + l N+l sine cose

N+l N+l ^0

N+l N+l v N+l N+l

Figure 4.10 Task Graph for Jacobian with P = N

51 architecture of the Jacobian with N processors is shown in figure 4.11.

As with the 2-Processor architecture, some Intermediate data needs to be transferred between Robotics Processors, and the I/O time is again increased. Figure 4.11 shows that one tJ matrix (3x3) and one £ vector

(3x1) are transferred between any two RPs.

The corresponding timing chart for the architecture in figure 4.11 is shown in figure 4.12. Every RP performs the same subtasks, a to 1.

I t can be seen from figure 4.12 that the R P (i-l) idles 59 P_CLK cycles before it starts subtasks a to 1. The purpose for these idles is to synchronize the transferring of data at certain points. Every RP repeats the suhtasks a to 1 again and again until the RP is rebooted by the host computer. The measurement parameters are computed in Appendix

A.5.

The I/O time for the N-processor architecture, 64 P_CLK cycles

(refer to Appendix A.5), is compatible with the computation time, 78

P_CLK cycles (refer to Appendix A.5). If the task is further partitioned, for example P=2N, the I/O time w ill further increase while the computation time decreases. Thus, for the N-Processor architecture, it is possible that the I/O time w ill exceed the computation time, consequently, the system throughput will not be improved. Furthermore, since no handshaking signals and buffers between Robotics Processors are used (synchronization scheme), it becomes more complicated to handle data transferring between the RPs as more processors are used.

Therefore, P=2N is not to be considered.

52 Host Computer

sin e sin e sin 0

COS 6 cos e

RP 1 N-l RP N

Figure 4.11 Architecture for Jacobian with P = N N+l N+l U i "i i i 4 9 18 35 P CLK cycles 18 21 13 RPi I1 l6h6|6H j k 1

N+l N+l Ui-1 -i-1 1 1

RP(i- 18 35 PCLK cycles 18 u l+ 59 P_CLK cycles tn -P» 1 a input sine.j, cos6i from Host Computer g compute N+l..

b compute 1_1U. h : compute

c input N+1u^ i : idle N+l N+l d compute U._j j : output r._j

e input N+1ri k : output N+*l-j N+l f Output U- j 1 : output N+^-

Figure 4.12 Timing Chart for Jacobian with P = N 4.2.3.4 Cube Interconnection Network

The total execution times for implementing Jacobian with the above three architectures increase linearly with the number of degrees-of-freedom. In this section, a parallel algorithm for implementing the Jacobian is considered using the cube interconnection network [12 p. 342] to realize the algorithm. The time required to perform the calculations increases as the log of the number of degrees-of-freedom. Conceptually, the rotational and translational transformations from tip to base are accomplished by grouping adjacent links in the fir s t step to form (N/2) groups of two links each. Then on each succeeding step, adjacent pairs of groups are grouped together until after [log(N)] steps there is one group encompassing all links.

This is analogous to multiplying N numbers in [log(N)] steps by multiplying first adjacent pairs, then adjacent pairs of pairs, and so forth.

Equations for the Jacobian (2.3) to (2.8) are rewritten as follows for a parallel algorithm.

N+1 U, I ( 2. 11) N+l

N+l 0 (2 . 12)

N+l N+l (N+l-21 ) U. U. (2.13) V + l- 2 1) X

N+l N+l (N+l-21) N+l t. (2.14) U(N+l-2l) X -i + -(N+l-21)

55 0 N+l N+l 0 (2.15) 2i = Ui-1 X 1

N+V = N+1t. . x 1“V +. (2.16) 1 —l - l —N + 1 where i = N, ( N - l) ...... 2, 1

1 = 0, 1, 2, . . . ; (N+l-21) > 1.

Note that N+*t is the firs t three elements of the forth column of the —i homogeneous transformation ^ *T, ; *ir?, , is the jth column of the 3 x 3 i —N+1 orientation part of 1 ^ n + j * t ^ie 3th component of the vector

N+l 3.. —i

The architecture for implementing the Jacobian in parallel form is shown in figure 4.13 (assuming 8 degrees-of-freedom). The rotational transformation U, a 3x3 matrix, and the translational transformation t, a 3x1 vector, are shown in each box. The computation time for finding the Jacobian matrix is proportional to the level number -log(8), or 3.

I f each box represents one Robotics Processor (RP), 20 RPs are needed in to ta l. The input data must be the sine and the cosine of theta instead

of theta since the RP cannot perform trigonometric functions. The total

execution time can be found from the following calculations :

56 u-,

L J, 1 J l i: l! J, 'X T u Tu T u V X 6,5 5 4 !“o l 7 li l5 i «t3*_1 ~ z _ii

5u.6.4 482 280 % 2 1 l0

TU8,5 5!. 5 %

8U V 8,3 8,28U i| 8,0 *3 l 2

\> Y y T Y V 3 8„ 8 8 i 8* £ 8 8a 8 8 * 8 8. 8 80 8 8fl 8 i r - 7 V —6 1 5’ - 5 I 4* i 4 ly £3 I?’ ^2 I

Figure 4.13 Architecture for Implementing Jacobian in Parallel (8 degrees-of-freedom) computation time

= T(finding U and t from the rotation angle theta) +

1og(8) x Tffinding U and t in the level 0, 1, 2) +

T(finding beta in the Jacobian matrix)

T(finding U and t from the rotation angle theta)

= 9 microseconds (refer to Appendix A.2)

T(finding U and t in the level 0, 1, 2)

= Tfmultiplication of two 3x3 matrices) +

T(multiplication of a 3x3 matrix with a 3x1 vector) +

T(addition of two 3x1 vectors)

= (35-8) + (17-8) + 6 (refer to Appendix A.l)

= 42 microseconds

Tffinding beta in the Jacobian matrix)

= T(3 cross products of two 3x1 vectors)

= (6M+7) with M=3 (refer to Appendix A .l)

= 25 microseconds so, computation time =9+3x42+25= 160 microseconds

I/O time

= T(input sine and consine theta) +

log(8) x T(transferring matrix U and vector t)

= 4 + 3 x T(transferring 4 3x1 vectors)

= 4 + 3x4x6 (refer to table 4.1)

= 76 microseconds

58 so, total execution time

= computation time + I/O time

= 160 + 76

= 236 microseconds (for 5, 6, 7, and 8 degrees-of-freedom)

Although the total execution time is reduced to about half as

compared to the 497 microseconds of the N-Processor architecture with 7

degrees-of-freedom in the last section, the architecture in figure 4.13

is not regular and the communications between RPs are no longer limited

to adjacent RPs. However, if a cube interconnection network is applied,

a regular architecture can be obtained allowing each processor to

communicate only with the other n processors, where n = log(N). This

network corresponds to an n cube.

A three-dimensional cube is illustrated in figure 4.14(a). The

processor element (PE) located at each vertex of the cube is directly connected to 3 neighbors. The addresses of neighboring PEs d iffer in exactly one bit position. Vertical lines connect vertices (PEs) whose addresses d iffer in the most significant bit position. Figure 4.14(b) shows that a 4 cube network can be considered as two 3 cube networks

linked together by 8 extra edges.

Figure 4.15 shows the required communications among the 8 PEs in different time slots. For example, in time slot to to t l , possible

communication pairs are (PEO, PEI), (PE2, PE3), (PE4, PES), and (PE6,

PE7) ; in time slot t l to t2, possible communication pairs are (PEO PE2),

(PEI PE3), (PE4 PE6), and (PE5 PE7); in time slot t2 to t3, possible communication pairs are (PEO PE4), (PEI PE5), (PE2 PE6), and (PE3 PE7),

59 ( 101) ( 111)

Figure 4.14(a) 3 Cube Interconnection Network

( 0 0 0 0 ) ( 1000) ( 0010) ( 1010)

(0001), ( 0011 ) 1001 (1011 '(1100) ( 1110) 110 ) ( 0100)

( 0101 ) (0111) ( 1101) ( 1111)

Figure 4.14(b) 4 Cube Interconnection Network

60 PE ( 0 0 0 )

( 001)

( 010 )

(Oil)

( 100)

( 101)

( 110)

(111)

time

Figure 4.15 Comnunication of the 8 PEs in Different Time Slots

61 Figure 4.16 shows a 3 cube interconnection network for the parallel implementation of the Jacobian with 8 degrees-of-freedom. Each row represents a processor element (PE), which is numbered from 0 to 7.

The rotational transformation U required to be calculated at each time, such as tO, is shown in the box. The translation transformation t_ is not exp licitly shown but is understood. A blank box means that the PE is idle. The links between the PEs indicate that the matrix U and the vector t are transferred from one PE to the other PE. Some of the links between the PEs are missing, which means that there is no communication needed in that time slot. Figure 4.16 also shows that PEs are doing different calculations at different times. For example, at the time to, each PE calculates the matrix U from each rotation angle, theta, while at the time t l , PEI is idle.

Figure 4.17 shows the communication between PEs in different time slots. In time slot tO to t l , communication pairs are (PEO PEI), (PE2

PE3), (PE4 PE5) and (PE6 PE7); from t l to t2, communication pairs are

(PEO PE2), (PEI PE3), (PE4 PE6) and (PES PE7) ; t2 to t3, communication pairs (PEO PE4), (PEI, PES), (PE2, PE6) and (PE3 PE7). Except during the last time slot, all communications are bidirectional. I f there is only one channel between two adjacent PEs or the communication is half duplex, the transferring time will be about double compared with the transferring time in figure 4.13, where communication is unidirectional.

Figure 4.18 shows a 4 cube interconnection network for implementing the

Jacobian with 16 degrees-of-freedom. At the time t 4 , PE8, PE11, PE14 and PE15 are idle. PE9 calculates the matrix U by multiplying three matrices instead of two as at the time t l , t2 and t3. The calculation is:

62 PE

( 000)

( 001)

( 0 10)

(Oil)

( 100)

( 101)

( 110)

------i _ [ ( 111) : 8 n 7 \ j i ' j L

time

Figure 4.16 3 Cube Interconnection Network for Implementing Jacobian in Parallel (8 degrees-of-freedom)

63 (000) (010) a) t n-t I

( 001) O il) ( 110) ( 100)

( 101) (111)

( 000) ( 010) b) t . - t I

( 001) ( 110) ( 110) ( 100)

( 101) (111)

( 00 0 ) ( 010) c) t 2- t 3

( 001) (o il) ( 110) ( 100) Ji

( 101) (111)

Figure 4.17 Communication Between PEs in Different Time Slots

64 PE 16, (0000 '0 '

"16, 16. (0001

(0010

416, 16, (0011

(0100 16,,

(0101

(0110

(0111 l6L

(1000 160

161 (1001

(1010 16l

16, (1011 J11

(1100 J\2

(1101

16, (1110 14

16, (1111 J15

time

Figure 4.18 4 Cube Interconnection Network for Implementing Jacobian in Parallel (16 degrees-of-freedom)

65 (N/2)+l In general case, the calculation of at the last step is

(N/2)+1 (N/2J+1 (N/2)+2 (N/2)+{N/4) Uw(2.32) N = (N/2)+2 X U(M/2)+4X " X Pi

The computation time is the order of log(N) + log{N/2) - 2, or log(N).

4.2.3,5 Comparison

The cube interconnection network described in the last section is suitable for the parallel implementation of the Jacobian. The processor element in the network must have bidirectional transmission capability on the I/O port. Since the Robotics Processor has only unidirectional transmission capability, it cannot be used to realize the network.

Therefore, the cube interconnection network is not to be compared with the other three architectures described above. The measurement parameters of the firs t three architectures are listed in table 4.2. N is assumed equal to 7 with one redundant degree-of-freedom. Since the total execution time for the 1-Processor architecture is acceptable and the total memory size Is moderate, the architecture with only processor is the best choice. If the N is large enough, the cube interconnection network might be the better choice. Table 4.2 Comparison of Three Architectures for Jacobian

| N=7 P»1 | P*2 | P-N |

| ET (microsecond) 658 t 601 1 *97 1

| IR (1/microsecond) 1/658 | 1/588 | 1/143 |

I UP (%) 100 | 88 1 100 1

| SP — i 1.1 1 1*32 1

| CBR (%) 96 | 54 1 63 |

| RN (32-bit) 39 t 30 1 39 |

) SCRAM (bit) 3.7K | 2.4K | 4 .5K |

| Total memory (b it) 5K | 3.4K | 5.8K |

4.3 Inverse Jacobian

Given the six components of the velocity of the end effector, including both linear and rotational, Inverse Jacobian solves the angular velocities of each of the joint angles. The equations of

Inverse Jacobian are listed as follows. For the case N (number of degrees-of-freedom) not equal to 6, a pseudoinverse method is employed.

1) N > 6

-flxllu 1 s Nx6 * £ 6xN Jc « Nx6 J« c -^ l " x 1L l , (2.17)

67 Define _, R_ and C as follows: 6x6 6xn —6x1

A r = [Ac 1 Ac2 Ac3 Ac4 Ac5 Ac6] = J „ r ( 2 .IB) 6x6 — — — — — — ' 6xN Nx6

R_ , = [Rrl Br2 Rr3 Rr4 Rr5 Rr6lT = d A- * f2.19) 6x6 — — — — — — 6x6

where d is the determinant of A„ - 6x6

£««1-d^xsis,! {7-?n)

2 ) N = 6

—* = 6x6 —6x1L i (2*21)

3) N < 6

T -IT* —Nxl-5ai i = Nx6 c 6xN Nx6 a —6x1 2Le i (2.2?)

These equations are solved as is any system of linear equations.

4.3.1 Methods for Solving Linear Equations

There are many ways to solve linear equations, for example

Gaussian Elimination, LU-decomposition, Faddeev Algorithm, and Inverse

Matrix with Determinant. These methods are to he explored to determine whether or not they are feasible for VLSI implementation.

1) Gaussian Elimination

The complexity of solving N linear equations with N unknowns using

Gaussian Elimination method is the order of NxNxN, including 1/2 x

N(N+1) divisions, N( 1/3 NxN + 1/2 N -5/6) multiplications and N{1/3 NxN

+ 1/2 N -5/6) additions, while the inverse matrix with determinant is

6 8 the order of N! [19, p, 208], If N=6, then the complexity is 216 vs,

720. Even though Gaussian Elimination has less complexity, it has some disadvantages :

(a) It is not regular. Specifically, arrangements must be made to

avoid picking a pivot which would result in the division by zero.

Furthermore, because the choice of the pivot cannot he predicted

in advance, it does not seem to be feasible to implement it using

VLSI chips.

(h) To be accurate, the pivot must be wisely chosen, Whether

this algorithm chosen uses a "partial pivot" or “complete pivot"

[19, p. 187], the pivot selection process destroys the regularity

of the layout which is absolutely required in VLSI.

(c) Gaussian Elimination requires 1/2 N(N+1) divisions. Since the RPs

cannot perform division, this operation would have to be performed

by the host computer.

2) LU-decomposition

Although LH-decomposition implemented with VLSI chips has been widely used to solve linear equations [25] [26] [27], there are s till some disadvantages :

(a) The characteristic matrix must be a symmetric positive-definite or

an irreducible, diagonally dominant matrix,

(b) Once again, the division operation is required.

3) Faddeev Algorithm

By applying Faddeev Algorithm and using an (N+l) x (N+l) array of processors, the entire calculation for solving the linear equations can be performed in the order of N time steps [28], However, the

6 9 disadvantage is that four types of processors are required, one of which is a d iv id e r.

4) Inverse Matrix with Determinant

The complexity of finding an inverse matrix can be much reduced hy finding the determinants of some matrices with i x i dimension and sharing these determinants (see Appendix A . 6 ). These determinants can then be obtained by finding the determinants of the reduced ( i - 1) x

(i-1) matrices. In the following sections, several possible architectures based on this method are explored and compared. The

Robotics Processors can be used to achieve the matrix inverse elegantly in some architectures.

4.3.2 Architectures of the Inverse Jacobian

Four possible architectures, 1-Processor, 6-Processor,

12-Processor, and 24-Processor are proposed and compared based on the measurement parameters described in section 4.2.2. To calculate the parameters, computation time and the number of required temporary registers for each vector inner product must be known. For example, it takes (3M+10) P_CLK cycles to do M inner products with vector size 3x1.

They are listed in Appendix A.7 and their reservation tables can be found in Appendix A .l. For the follow ing sections, N > 6 is assumed, where N is the number of degrees-of-freedom,

A.3.2.1 1-Processor Architecture

The architecture for finding the Inverse Jacobian using only one

RP is shown in figure 4.19. I t shows the necessary data flow between the host computer and the RP. The procedures to solve the Inverse

7 0 Host Computer

6xN

Figure 4.19 Architecture for Inverse Jacobian with P = 1 Jacobian, i.e. finding the derivative of theta, and the detailed calculations for the measurement parameters can he found in Appendix

A.8. Since the RP does not have division capability, the reciprocal of the matrix determinant, 1/d , must he computed by the host computer. It is assumed that the host computer can complete the reciprocal in the time required in steps 7 and 8 in Appendix A . 8 , i.e . 6M + 14 = 6 x (6+7)

+ 14 = 92 microseconds. Most commercial microprocessors with a numeric coprocessor can complete division in this amount of time. For example,

Intel 8087 (5 MHz clock) can complete it in 39 microseconds.

4 .3 .2 .2 6-Processor Architecture

The architecture for finding the Inverse Jacobian using six RPs is shown in figure 4,20. It shows the necessary data flow between the host computer and the RPs. It can be seen that there is no communication between RPs. The procedures to solve the Inverse Jacobian according to this architecture and the detailed calculations for the measurement parameters can be found in Appendix A .9. I t is assumed that the host computer can complete the reciprocal in the time required in steps 8 to 10 in Appendix A,9, i.e. (6x1 + 14) + 6x2 + (6x1 + 14) = 5? mi croseconds.

4.3.2.3 12-Processor Architecture

The architecture for finding the Inverse Jacobian using twelve RPs is shown in figure 4.21. I t shows the necessary data flow between the host computer and the RPs, and the data flow between the RPs. It can he seen that 20 determinants are passed from RPi to R P i 1 (i= 1 to 6).

Also, some intermediate data must to be broadcast from one RP to the

72 Host Computer

6xN6xN 6xN ^6 x 1 6xN ^6 x 1 6xN 5C. 5Ci 1/d 1/d 1/d 1/d

RPI RP2 RP3 RP4 RPS RP6

Figure 4.20 Architecture for Inverse Jacobian with P = 6 Host Computer

6xN 6xN — 1 6xN

Ac.,

RPI RP2 RPC

20 d(3x3) 20 d(3x3) 20 d(3x3)

RPI RP2 RP6 1

^6 x 1 d 5Cj Cj

}'* ____ 3^1 1/d

Figure 4.21 Architecture for Inverse Jacobian with P = 12 other RPs. This makes the I/O transmission procedures more complicated

and microprograming more difficult. Furthermore, I/O time is apparently

increased because more interm ediate data are transferred between RPs.

The procedures to solve the Inverse Jacobian according this architecture

and the detailed calculations for the measurement parameters can he

found in Appendix A .10. I t is assumed that the host computer can

complete the reciprocal in the time required in steps 10 to 12 in

Appendix A .10, i.e . (6x1 + 14) + 6x2 + {6x1 + 14) = 52 microseconds.

4.3.2.4 24-Processor Architecture

The architecture for finding the Inverse Jacobian using twenty

four RPs is shown in figure 4.22, It can be seen that 20 determinants

are passed from RPib to RPic (i= 1 to 6 ) and 6 determinants from RPic to

RPid. As with the 12-Processor architecture, some intermediate data must be required to be broadcast from one RP to the rest of RPs. The I/O

time is increased greatly over the 12-Processor architecture since more

data is required to be transferred between RPs, The procedures to solve

the Inverse Jacobian according to this architecture and the detailed

calculations for the measurement parameters can be found in Appendix

A. 11. I t is assumed that the host computer can complete the reciprocal

in the time required in steps 11 to 13 in Appendix A .11, i.e . (6x1 + 14}

+ 6x2 + (6x1 + 14) = 52 microseconds.

4 .3 .2 .5 Comparison

The measurement parameters of the above four architectures are

calculated in Appendix A . 8 to A .11 and summarized in tab le 4 .3 , N is

assumed equal to 7 with one redundant degree-of-freedom.

7 5 Host Computer

$ Ac^, Ac^

20 d(3x3) 20 d(3x3) "■g CT) ... ■ ^ 4

— 3 RPic — If — — — ^ 5 ------> RP6c

6 d(5x5) 6 d(5x5)

RP Id — Jf w RP6d Jcc — 1 ■■■ ACj 5ci c6 1/d d 1/d ^6 x 1 r 61 *£xl 1 C

Figure 4.22 Architecture for Inverse Jacobian with P = 24 Table 4.3

Comparison of Four Architectures for Inverse Jacobian

N=7 P=1 P=6 1 P=12 1 P=24

ET (microsecond) 1838 542 582 594 |

IR ( l/m1crosecond) 1/1838 1/542 1/355 1/238 j

UP (%) 100 100 82 62 |

SP 3.4 3.16 3.1 | —

CBR (%) 94 53 40 30 |

RN (32-bit) 119 63 56 53 |

SCRAM (bit) 82K 17.7K 10. 5K 6.08K |

Total memory (b it) 8 6 K 20K 12.3K 7.78K |

It can be seen that 86 K bits of memory are required for the I-Processor architecture. It is impractical to put such a large memory into the PP.

The execution times for the architectures of 6 -Processor, 12-Processor and 24-Processor are in the same range. Even though the 6-Processor architecture has the least execution time, the required memory, 20K b it, is s till considered too large with current VLSI technology. Both the

6-Processor and the 12-Processor architectures are very regular. But, the 24-Processor architecture is not too regular although it has the smallest memory size. Therefore, because of the regularity of the architecture and the moderate memory size, the 12-Processor architecture is the best choice for the Inverse Jacobian.

77 4.4 Inverse Dynamics

The Inverse Dynamics problem is: given the desired acceleration, find the necessary forces and torques. The equations of Inverse

Dynamics are listed as follows :

Forward Recursion:

0 i 1-1 T 1-1 0 (2.23) * u, 1 i i , . i + 0 .

0 0 i. 1-1 T 1-1- i -1 to. = U. { (*).,+ 0 0 (2.24) —1 1 —1-1 + ^ - 1 X 0 . 0 .

u* i-l,.T i-I k . i. iD* , i ,i i*. P. = U. P. , + (D. x P. + w, x ( w. x P.) (2.25) —i i —i- l —i —i —i - i —i ' i •* i i i i * i ■■ S. = 5i. x S. + (*). x ( id. x S.) + P. (2.26) —i —i —i —l —i —i —i

F. = m.S . (2.27) —i i—i

1 U 1 1 1 * 1 / 1 , 1 ^ N, = J . u), +

Backward Recursion:

<-y = ’-y {y. * y ,, (2.29) —i i —i —i+l i - l i - l i ★ -i * JL “ V Hi+ 1 ♦ V {W X W ^i + l } (2‘ 30) i = N, (N-l), . . . 2, 1

78 Here, only revolute joints are considered. The notations used above are summarized as follows:

1—*3x 1 * the an 9 u*ar velocity of link i

* 3 x 1 : the angular acceleration of link i

i-l,, : rotational transformation matrix 3x3

■ 0^ 9^ : the joint generalized variable for joint i

i *■ Pi- , : the acceleration of the origin of coordinate i -'3x 1 3

i * —* 3x 1 ' 3 vec*;or **° *'*1e ori 9 '*n coordinate i from the origin

of coordinate i - l

i *• —* 3 xl * acce*erat*on the center of gravity of link i

* —* 3xl " 3 vec^or *:o *'*1e cef|Ter gravity of link i from the

origin of coordinate i i : the total force (excluding gravity) on line i 3x1 mi : the mass of link i

: the total torque on link i

^*3x3 * t ^ie *ner* * a Tensor (with respect to its center of mass)

of link i

i - l _filu1: constraint force (unknown) exerted on link i by link (i-l) 3x1

i-l : constraint torque (unknown) exerted on link i by link ( 1- 1) -*3 x l

79 4.4.1 Task Graph

The task graph for Inverse Dynamics is shown in figure 4.23 and

4.24. It is clearly much more complicated than the task graph for the

Jacobian in figure 4.4. The task graph is obtained hy calculating the complexity of each of Inverse Dynamics equations (2.17) to (2.24). Using this task graph, four possible architectures are proposed and compared in the following sections,

4.4.2 Architectures of the Inverse Dynamics

Four possible architectures for implementing Inverse Dynamics are

1-Processor, 2-Processor, N-Processor and 2N-Processor.

4.4.2.1 1-Processor Architecture

The architecture for Inverse Dynamics with only one RP is shown in figure 4.25, The necessary data transferred between the host computer and the RP is also shown in the figure. The corresponding timimg chart, shown in figure 4.26, shows the sequence of subtasks and the time required for each subtask. Subtasks in this timing chart are labeled a, b l, b2, . . , bN, c l, c2, cN, and d. The exact amount of time for each subtask is obtained from the microprograms for forward and backward recursion of Inverse Dynamics in Appendix A .12 and A .13, The measurement parameters evaluating the architecture are computed in Appendix A .14.

4.4.2.2 2-Processor Architecture

The architecture for the Inverse Dynamics with two RPs is shown in figure 4.27, along with the necessary data transferred between the host computer and the RP, and transferred between the two RPs. The corresponding timimg chart, shown in figure 4.28, shows not only the

80 sin 9 -, cos

-i

—i

- i -1

-l

—i

F

Figure 4.23 Task Graph for the Forward Recursion of Inverse Dynamics

81 sine., cose.

—i " i+1

- i +1

—i

n. —l

Figure 4.24 Task Graph for the Backward Recursion of Inverse Dynamics

82 Host Computer

sin 6j, cos 0^

sin 6.., cos eu N N eN* 6N 00 w

/

Figure 4.25 Architecture for Inverse Dynamics with P = I 106 . . 106 . 50 50 | 50 | m 4N 106 1 I \— 1---- 1 ------|------1------1------1------bl bN C1 c2 CN d a bj

a input sinQj, cos6j, sine^, cose^ b. : compute Forward Recursion of Inverse Dynamics c. : compute Backward Recursion of Inverse Dynamics d output t , - - t 1 N

Figure 4.26 Timing Chart for Inverse Dynamics with P = 1 Host Computer

sine,, cose

sine.,, cose

RP1

sine., cose sine.,, cose

RP2

Figure 4.27 Architecture for Inverse Dynamics with P = 2 a : input sine^, cose^ - - sine^, cose^

: compute Forward Recursion of Inverse Dynamics c : output sine,, cose,, F,, N ,, - - sine.., cose,., F.. Nu 1 I “1 —I fi (1 —H, —N d input sinQj, cosej, Fj, Nj, ------, sine^, cose^, F^, e. : compute Backward Recursion of Inverse Dynamics f : output Tj - - - Tj^

Figure 4.28 Timing Chart for Inverse Dynamics with P = 2 sequence of subtasks and the time required for these, but also the timing of transferring data between the two RPs. RP1 executes subtasks a, b l, b2, and bN; RP2 executes subtasks c, d, e l, e?, eN, and f. RP1 will not start to transfer data to RP2 until it completes subtasks a, b l, b2, and bN. The I/O time created by transferring the data from RP1 to RP2 is 16N microseconds. The exact amount of time for each subtask is obtained from the microprograms for forward and backward recursion of Inverse Dynamics in Appendix A .12 and A .13. The measurement parameters evaluating the architecture are computed in

Appendix A .15.

4.4.2.3 N-Processor Architecture

The timing chart for the forward recursion of Inverse Dynamics with one RP per link, obtained from Appendix A ,12, is shown in figure

4.29. The subtasks in the figure are labeled from a to p. The timing chart for the backward recursion of Inverse Dynamics with one RP per lin k, obtained from Appendix A .13, is shown in figure 4.30. Its subtasks are labeled from a to k. From the two timing charts, it can be seen that the data transfer initiation times can be easily aligned.

The architecture for the Inverse Dynamics with N RPs is shown in figure 4.31. The N RPs are connected in a one dimension array. The figure shows the necessary data transferred between the host computer and the RPs, and between any two adjacent RPs. It can be seen that the data transferring between any two adjacent RPs is not unidirectional but bidirectional. Therefore, if the RP is to be employed in the architecture, two of the I/O ports of the RP must be bidirectional.

Based on the two timing charts above, a corresponding timing chart can

87 -j_l I I “i_l

43 25 6.6 14 19 I 6 i6 30 RPi fi-( 9 { 6 4 6 1 30 . l l l i M ± ± * J L b e d 1 f 1 q 4 f 1 k 1 1 1 * 4 r 4 o p 'a 'b c d ' f 1 a 1

Hi | | ^ i i!u 14

30 43 RP(i+1) ( UJ 6|___26___|4 |6 ( 6 | 4 |9 | 6 | 6 |

^i+1 - i +1 U*1 00 00 u 1

a input sine, and cose^ from Host Computer l input

b compute i_ 1U. j complete the computation of

c input k idle - i -1 d input 1 ouput P.

e compute and ^ m compute F_. and

f output ^ n output sine^ and cose^

g output o output

h compute part of P. P output Nj

Figure 4.29 Timing Chart for Forward Recursion of Inverse Dynamics with One RP per Link input sine^ and cose^ from Host Computer f : compute _f.

compute 1_1U. g idle

input h output f •

input i : input ni+1

input f1+l j : compute n_.

k output

Figure 4.30 Timing Chart for Backward Recursion of Inverse Dynamics with One RP per Link Host Computer

\

sin e sin e sin e

cos 0 cos 8 cos e, -o

-0

—0 —1 ’“ 1 RP1 RP2 RPN * z l 2 * N~1 N-l

Figure 4.31 Architecture for Inverse Dynamics with P * N be developed in figure 4.32. The figure shows the timing of transferring data between the RPs, 3 RPs in this example. To align the

I/O transmission, some RPs must idle until the other RPs are ready to

receive or transmit. The CPU u tilization is thus decreased. The measurement parameters evaluating the architecture are computed in

Appendix A .16.

4.4.2.4 2N-Processor Architecture

The architecture for Inverse Dynamics with 2N RPs is shown in

figure 4.33. The 2N RPs are connected in a two dimensional array. It can be seen that unlike the N-Processor architecture the data transferred between any two adjacent RPs is unidirectional. This allows four I/O ports in the RP to be unidirectional. The RPs of the upper row perform the forward recursion, while the RPs of the lower row perform backward recursion. It is assumed that RPi‘ {i =1 to N) always uses the most updated data transferred from the RPi. Therefore, there is no buffer required between any two RPi and RPi1. The corresponding timing chart is shown in figure 4.34. The figure shows the timing of data transferred between the RPs, 6 RPs in this example. It can be seen that

RPI1 to RP3* are idle sometime to wait for the data transferred from the other RPs. Also some idle time is created because of the uneven load sharing between the RPi and RPi 1 (forward recursion is more time consuming than backward recursion). The measurement parameters evaluating the architecture are computed in Appendix A .17.

91 RPI | - i 0- . [ ___^ _____ , | 34 | 46 | 40 | ™ ------1 i t I

m |—22— |------! « — | | 34 | 46 | | 40 | m 1

1 t 1

RP3 ( J 2 _ |_____W , | 34 | 46 t I_21_|_144 ------, <£> ro

Forward _ Backward J Recursion 1 RecursionDflrnrcinn *

Figure 4.32 Timing Chart for Inverse Dynamics with P = N (N = 3 for example) Host Computer

sin 8 sir e sir 6

COS 8

RP N

U>

sir e sir e sir e.

COS 6

RP N' r-

Figure 4.33 Architecture for Inverse Dynamics with P = 2N RP 1 | 40 | 160 | 4Q | 160 | 40- ( - A 60_____ | 40 | 160___| ______4 4 4 4 , 40 , 160 , 40 , 160 , 40 , 160 , 40 , 160 , RP 2 I 1------1------1------1------1------1------1------1 ------4 4 4 4 | 40 | 160 i 40 i 160 | 40 | 160 j 40 | 160 | RP 3 4 4 4 lO | 34 | 71 | | 34 j 71 | ,3 4 ,7 1 ■t* RP i' 4 4 4 RP 2' I 34 i 71 i i 34 i 71 i i 34 i 71 i 4 4 4

RP 3' I 34 I 71 | | 34 I 71 I ^ 4

Figure 4.34 Timing Chart for Inverse Dynamics with P = 2N (N = 3 for example) 4.4.2.5 Comparison

The measurement parameters of the ahove four architectures are calculated in Appendix A .14 to A .17 and summarized in table 4.4. N is assumed to be equal to 7 implying one redundant degree-of-freedom.

Table 4.4

Comparison of Four Architectures for Inverse Dynamics

N=7 P=1 P=2 P=N | P=2N |

ET (microsecond) 1134 1246 708 1 749 |

IR (1/microsecond) 1/1134 1/882 1/708 j 1/200 |

UP (%) 100 71 37 1 76 |

SP .91 1.6 1 1.5 | —

C8 R {%) 96 62 22 1 39 |

RN (32-bit) 371 329 65 1 59 |

SCRAM (b it) 9 . IK 6 . 8 K 10.7K | 6.9K |

Total memory (b it) 2 IK 17.3K 12.8K ( 8 . 8 K |

Bidirectional I/O no no yes 1 n0 t

It can be seen that even though the N-Processor architecture has the least execution time, its bidirectional I/O bus makes the interface circuit more complicated, thus, it w ill not be considered. The memory

95 sizes for both 1-Processor and 2-Processor architectures are too large to be put into the RP chip with current VLSI technology. The memory size for the 2N-Processor architecture is less than 10K, which can likely be put into the RP. Its execution time is 750 microsecond, which is acceptable. Also, the connections between RPs are very regular.

Thus, the 2N-Processor architecture is the best choice for Inverse

Dynami cs.

4.5 Summary

In this chapter, several possible architectures for the different applications, Jacobian, Inverse Jacobian, and Inverse Dynamics, are explored and compared. Because of essential circuits required in the

RP, such as the Floating Point Adder/subtractor and Multiplier, there is limited area le ft for the Register File and Control RAM. Therefore, the total memory size becomes a very important factor in determining which architecture is the best choice. Also, in the control system, the in itia tio n rate is not as important as the total execution time since the total execution time affects the s tab ility of the control system.

As more and more RPs are used to solve the problems, such as

Jacobian, Inverse Jacobian, and Inverse Dynamics , the communications between them increase and thus the CPU Bound Ratio (CBR) tends to decrease. In some cases, using more RPs to solve the same problem w ill even result in having more total exection time than when fewer are used.

The function partion between the RPs greatly affects the whole system throughput. Task graph concepts are introduced to achieve the best function partition, i.e . minimize the data transferring between RPs and the total exection time.

96 CHAPTER 5

CIRCUIT DESIGN OF THE ROBOTICS PROCESSOR

5.1 Introduction

The general description of most major functional blocks of the

Robotics Processor (RP) w ill be discussed in this chapter, while the detailed circuit designs are explained in Appendix R, The functional blocks of the RP, shown in figure 5.1, are Clock Generator (CG),

Bootstrap Unit (BU), Format Converter for BU (FCB), Control RAM (CRAM),

Sequencer (SEQ), Microcode Register (MCR), Register File (RF), Floating

Point Adder/subtractor (FPA), Floating Point Multplier (FPM), Format

Converter East (FCE), Format Converter West (FCW), Format Converter

North (FCN), and Format Converter South (FCS). In addition, the Level

Sensitive Scan Design (LSSD) technique will be discussed. The detailed designs of Register F ile (RF), Sequencer (SEO), and Control RAM (CRAM) have not been designed yet and so will not be covered in this dissertation.

5.2 Clock Generator

The Clock Generator (CG) generates all clock signals needed in the

RP. Figure 5.2 shows the block diagram of the CG. It consists of three

Two-Phase Generators (TPG), two Counters (CNT), and two Johnson

Counters, JCNTR to generate Jra and Jrb signals and JCNTF to generate

97 BN BT.HUR.LC

FCB

FCN

CRAM SEQ

MCR

FCU

Addresses and Control signals

FPM FPA

SYS CLK FCS

BS

Figure 5.1 Rlock Diagram of the Robotics Processor SYS CL K( 16MHZ)

IPG

to BU and mol 24.ca

CNT SYS CLK/2 CNT P CLK TPG TPG :^a)

* 1/2

TPG: Two-Phase Generator P4j: Pi pel ine ; i Pi2 ■ Pi pel ine 4,2 JCNTR JCNTF JCNTR: Johnson CouNler of Rising trigger JCNTF: Johnson CouNTer of Fal1ing trigger

Jra Jrb Jfa Jfb

Figure 5.2 Clock Generator (CG) the Jfa and Jfb signals. The timing of all signals generated by the CG is shown in figure 5.3. Phase-1 and phase-2 of the 16 MHz System Clock

(SYS_CLK), and are used in the BU and FPM. Phase-1 and phase-2 of the Pipeline Clock (P_CLK, 1 MHz', P<{^ and P>^, are used to latch the microinstruction at the Microcode Register (MCR) and to latch the temporary data at the pipeline registers in the FPA and FPM. Jra, Jrb,

Jfa, and Jfb are used for time multiplexing data and addresses onto the buses at the proper times. These signals are identical but are shifted with respect to each other by one SYS_CLK period as is apparent from figure 5.3. The detailed circuits for the TPG, JCNTR, and JCNTF are described in Appendix B .l.

5.3 Bootstrap Unit and Format Converters

The RP is designed to be used for more than one application and must therefore be programmable. During in itia liz a tio n , the host computer loads appropriate application microprograms and constants to the RPs. This loading process is accomplished by the Bootstrap Unit

(BU) and the Format Converter for Bootstrap (FCB). While a microprogram is being loaded, the BU provides the necessary control signals and loading addresses to the Control RAM (CRAM). During microprogram execution, the SEQ generates the control signals and supplies the next microinstruction address to the CRAM.

Figure 5.4 shows the block diagram of the BU, FCB, and CRAM indicating the paths used for microprogram loading. The RU consists of a Synchronization Controller plus Bootstrap Controller (SC+BTC) and a counter (CNT). The SC+BTC generates all control signals to the FCB and

1 0 0 -in_njTJi_rLr^u^r^r^Rrurn_nj'

SYS_CLK/2 1

♦ l/2 1

*2/2 1

PjCLK J I

_J Pf; “I 1

Jra

Jrb

Jfa

Jfb J

Figure 5.3 Clock Signals Generated from the Clock Generator (CG) 8E BT HWR LC

LOU* — I f LDl * — j SC + BTC

LD0 LDl LD2

INC WEN

CRAM CNT CLR

L _ Address from SEQ

Addresses and control signals

Figure 5.4 Block Diagram of the BU, FCB, and CRAM Indicating the Paths Used for Microprogram Loading CRAM. The counter generates the loading address for the CRAM. The FCB concatenates three 16-bit words sent from host computer to form a 40-bit microcode, shown in figure 3.2, (8 unused bits are discarded) and then stores the result in the CRAM. The detailed procedures for loading microprogram and circuit designs for the SC+BTC are described in

Appendix B.2.

From figure 5.1, it can be seen that the external bus on each side is 16 bits wide while the three internal buses, Rus A (RA), Rus R (RB), and Bus C (BC) are all 32 bits wide. Therefore, four format converters are required to change the 32-bit data to two 16-bit words or visa versa. The detailed circuit designs for the four format converters are explained in Appendix B.3.

5.4 Testability of the Chip

As advances in the VLSI technology increase, more and more components can be put into a chip, resulting in improved performance. As a result of this increase in complexity the testing problem becomes much more d iffic u lt. In the case of a highly complex sequential circu it, a complete testing of every aspect of the circuit may be impossible unless some provision for te s ta b ility of the circuitry is included in the chip design. Thus, the system architect and the logic circuit designer must consider the testability of the chip when they begin to design it . In this section, several methods for providing for testability of design are discussed. For reasons which w ill be described late r, Level

Sensitive Scan Oesign (LSSD) was selected as the testing method for the

Robotics Processor.

103 5.4.1 Structured Design for Testability

Three major techniques constitute design for testabilty: ad hoc approaches, structured approaches, and self-test approaches T601C6H. Ad hoc approaches are sometimes appropriate for specific designs but are not generally applicable.

Both structured approaches and self-test approaches are applicable to chip and board level designs. The structured approach to designing for testablity involves the structuring of registers, internal to the design, in such a way that the data contained in these registers can be controlled and observed. Two specific methods for accomplishing this are Level Sensitive Scan Design, LSSD, developed at IBM, and Scan Path developed at NEC.

A recently developed self-testing technique integrates the Scan

Path, LSSD and Signature Analysis concepts. The system registers are used both to generate the random patterns and to compress test results.

This integrated technique is called "built-in logic block observation", or BILBO. Recently, BILBO has been given considerable attention because of certain advantages over other approaches. Specifically, comparing the

LSSD and BILBO approaches, the ratio of the time needed to generate and apply LSSD patterns versus the time needed to apply pseudorandom patterns for the RILRO is (L x K) [61j, where L is the maximum length of the shift register latch (SRL) and K is the ratio of the speed at which the shift register could be shifted in RILRO versus the speed at which the test patterns could be generated for LSSD. K is usually in the range from 100 to 1000. The derivation for this relation is as follows.

Assume P patterns are to be applied in one design using LSSD and another

104 using RILBO, the time needed for LSSD will be

T(LSSD) = P x L x (1/TPGS)

where TPGS is test pattern generation speed in patterns/sec. The time

needed for BILBO w ill be

T(BILBO) = P x ( 1/SRLS)

where SRLS is SRL shifting speed. Thus, the ratio of the time needed

for LSSD versus the time needed for BILRO is

T(LSSD) / T(BILBO) = (P x L / TPGLS) / (P / SRLS)

= L X (SRLS / TPGS)

= L x K.

It can be seen that the testing time for the LSSD approach applied to a

given system is significantly greater than that for BILBO applied to the

same system. However, it is not sufficient to restrict the

consideration to only the ratio of testing times. The ratio of

propagation times for these two approaches must also be considered when each is operating in the non-testing, or normal mode. Both testing

approaches make use of registers to control and observe parallel data.

Both approaches require that the registers operate in a parallel and a

serial shift mode. The addition of the serial shifting capability to a

VLSI register does not increase the register setup or propagationtimes.

However the signature analysis mode inherent in BILBO requiresthe

insertion of an Exclusive-Or function plus an And function in each

register input, as shown in figure 5.5. The result is that the

propagation delay of a BILRO design, in the non-testing mode, is two to

three times greater than the corresponding LSSfl design.

105 Scan-Cut SRL Out Out N SRL Out Out 0 Figure 5.5 Logic Circuit Diagram of BILBO Registers nux Scan-In— 0

106 Another consideration is that the requirement to add the And

function and Exclusive-Or function at each bit position of the BILBO

shift register will increase the required chip area occupied by the

register configuration by as much as two to three times.

In conclusion, if testing time is the primary consideration in the

decision for the selection of the LSSD or the BILBO method for testing,

the la tte r would obviously be selected. However, if real-time execution

speed is the primary factor and i f testing can be done o ff-lin e , the

LSSD approach seems preferable.

5.4.2 Level Sensitive Scan Design (LSSD)

Because the Robotics Processor (RP) 1s designed for real-time

control systems, the execution speed is, by definition, a primary

concern. Also, the RP is a highly complex chip demanding that chip area be conserved as much as possible. For these reasons LSSD is selected as the testab ility approach in the RP.

LSSD is a testab ility technique wherein all bits of the internal state of the chip are linked into a shift register and read out for examination. This scan path greatly increases the observability and controllability of the chip by providing access to signals that would otherwise be invisible to the outside world without extensive multiplexing schemes or large numbers of extra pins.

Figure 5.6 shows the LSSD used in the pipelined stages of the FPA and FPM. Note that all SRLs are connected serially. Test patterns are fed to Scan_In and after one P_CLK cycle both the input test patterns

and the test results are extracted from the Scan Out. Figure 5.7 shows

107 Normal Input

Scan-In SRLS

Combinational Network 1

Dynamic Registers

SRLS

Combinational Network 2

Dynamic Registers

SRLS » Scan-Out

Normal Output

Figure 5,6 LSSH Used in the Pipelined Stages of the FPA and FPM

108 In 0 In N

1—

Scan-In

Combinational Network

o \0

Scan-Out n .

Out 0 Out N

Figure 5.7 Interconnection of the LSSO SRL * s the interconnection of the SRLs and the detailed circuit of the SRL.

One of the sets of clocks, i and d> , is active during the normal mode 1 2 and the other set, and ^ , is active during the testing mode.

5.5 Floating Point Adder/Subtractor (FPA)

The data path of the Robotics Processor consists of Register File

(RF), Floating Point Adder/Subtractor (FPA), and Floating Point

Multiplier. The RF is just a three-port RAM, consisting of 64 32-bit words. Each of the two arithmetic units has three pipeline stages. The

data flow in the data path for normal arithmetic operations is described

in Appendix B.4. The design work for the FPA is explained in this

section, while the FPM is in section 5.6.

F irs t, the floating point format is defined and explained and then

the algorithm to perform floating point addition/subtraction is

described. Then, the block diagram of the FPA and its building function

blocks are described. They are the Zero Checking Unit, Sign Unit, N-bit

Adder/subtractor, Alignment Control Unit, Barrel Shifter, Leading Zero

Dector, Postnormalization, and Over/underflow Unit.

5.5.1 Floating Point Format

The firs t version of the IEEE floating point standard format was drafted in April 1978 by Harold Stone and the final version was

published in March of 1981 in [53]. The main goal of the

standardization efforts was to establish a standard which would allow

communication between systems at the data level without the need for conversion. The standard defines four floating point formats in two

groups, basic and extended, each having two widths -single and double.

110 Here only the basic single precision format is considered. Its format, made up of sign bit, biased exponent, and mantissa, is given as

31 30 23 22 0

where S is sign bit (1 b it), E is biased exponent (8 b its ), and F is significand (23 bits). The possible values for the IEEE single precision floating point are shown in the table 5.1.

Table 5.1

Possible Values of the IEEE Single Precision Floating Point.

Name Value E F

Not a Number Not applicable 255 Not all zeros

S Infinity (-1) ( Infinity) 255 All zeros

S ------1-12? Normali zed (-1) ( l.F ) x 2 1 - 254 Any

$ - m Denormali zed (-1) (O.F) x 2 0 Not all zeros

S Zero (-1) 0.0 0 0

I l l Since most commercial floating point processors, e.g. Intel 8087,

Weitek WTL 1032 and 1033, NS 16081-6, support the IEEE standard format, the floating point adder/subtractor and multiplier designed in this project are to follow the IEEE standard format.

5,5.2 Algorithm and Block Diagram

Addition and subtraction are described together since the same hardware is used for both operations. Subtraction 1s performed by addition using a 2 's-complemented subtrahend. The algorithm for the floating point addition/subtraction can be divided into five consecutive steps :

1. Check for zero operands.

2. Prenormalize

Align the two operands by comparing their exponents. The

mantissa with the smaller exponent is right shifted by the

amount of the difference of the two exponents.

3. Add/subtract the two mantissas.

4. Postnormalize

When the two mantissas have the same sign, a mantissa overflow

may occur. If this happens, right shift one place to put

i t back into the range, and increment the exponent. I f the

operands have different signs, there may be cancellation of

leading significant bits, yielding an unnormalized result.

Then shift left the result into the normalization range and

subtract the shift amount from the larger exponent. The shift

amount is equal to the number of leading zero.

112 5. Check for the exponent overflow or underflow.

If either occurs, the appropriate constants are loaded for the

exponent and the mantissa, and the status bit is set.

Figure 5.8 shows the block diagram of the above algorithm. It is assumed that the floating point adder/subtractor operates on the two operands, A and B, and the result of A+B or A-B is delivered as the operand R. SA denotes the sign bit of the operand A, EA the exponent of

A, MA the mantissa of A. OP bit specifies addition or subtraction.

The floating point adder/subtractor consists of a three stage pipeline with the firs t stage performing sign bit determination, zero checking and prenormalization, the second stage performing mantissa addition/subtraction, and the final stage performing postnormalization and overflow/underflow checking. The pipeline registers (lssd_n.ca) contain LSSD (level sensitive scan design), to ensure testability, where n is the width of the pipeline register.

The firs t stage consists of two zero checking units

( fa_exp_ne.ca), sign unit ( fa_sign.ca), mantissa comparator

(add_sub_24.ca), alignment control unit (fa ali_con.ca), and right-shifter (fa_sh_r.ca).

Each zero checking unit examines the exponent of the operand. If the exponent is zero, a zero is attached to be the most significant bit

(MSB) of the 24-bit mantissa. Otherwise a one is attached as the implicit leading b it. The circuit design of the zero checking unit is described in Appendix B.5.

The sign unit determines the sign bit of the final result and the effective operation for the mantissa addition or subtraction. If its

113 MB

I s s d - 8 .c a

> 2 3 23

fa -e x p - fa -e x p - ne .ca

SUB

MB-GT-MAq add-sub-24.ca tmz. f a - a l i - (fl-B) co n .ca

AMOUNT-R FZ-R

IN* INI | MUX-8.ca T 8

SUB

A 8 a d d -s u b -2 4 .c a i -1

Is s d -8 , Ci Issd-24.ca

5L1E. — RSH

f a -z e ro - d e .c a

s 8 .a- o v f - a d d -su b -8 .fa f a -s h - ■23 u d f,c a OUT

ZOUT RSH IN* INI IUX-8. ca

OVF MR

Figure 5.8 Block Diagram of the Floating Point Adder/Subtractor

114 output signal SUB is equal to 1, then subtraction is executed, otherwise

addition is executed. The circuit design of the sign unit is described

in Appendix B.6.

The mantissa comparator is used to determine whether the mantissa

of A or B is larger. Since it always does A minus B, the carry out (low

asserted) of the 24-bit subtractor means that the mantissa of B is

greater than A.

The alignment control unit determines the difference of the two

operand exponents. This difference is equal to the required number of

mantissa shifts. The alignment control unit sends the signal R_GT_A, R

greater than A, to the sign unit to determine the sign bit of the final

result. The signal B_GT_A 1s also used to route the mantissa of the

smaller number to the rig ht-shifter. The circuit design of the

alignment control unit is described in Appendix B.7.

The right-shifter contains a 24-bit barrel shifter which can cause

a shift a 24-bit word by a number of bit positions ranging from 0 to 23 within about two gate propagation delays. I f the rig ht-shift amount is

greater than or equal to 24, the output of the rig ht-shifter is forced

to zero since the input has, at most, 24 bits. The circuit design of

the 24-bit shifter is described in Appendix B.8.

The second stage contains only a mantissa adder/subtractor,

add_sub_24.ca. The subtract operation is selected by the SUB signal.

The RSH (right shift) signal is generated when there is a carry out and

addition is executed. I f it is asserted, the result of the mantissa

adder/subtractor, add_sub_24.ca, is to be shifted right by one bit in

postnormalization and the common exponent, the exponent of the larger

115 operand, is incremented by one. The circuit design of the 24-bit adder/subtractor is described in the next section.

The third stage consists of leading zero dectector

(fa_zero_de,ca), exponent update unit (add_sub_8.ca), left-shifter

(fa_sh_l.ca) and over/underflow unit (fa_ovf_udf,ca). Postnormalization is performed in this stage. Detailed explanation for the postnormalization can be found in Appendix B.9.

The leading zero dectector is used in postnormalization to provide the left-shifter with shift amount needed to normalize the ouput of the mantissa adder/subtractor, add_suh_24.ca. The circuit design of the leading zero dectector is described in Appendix B.10.

The exponent update unit updates the common exponent by incrementing one when RSH is asserted, or subtracting the shift amount sent from the leading zero dectector when RSH is not asserted.

The over/underflow unit detects the occurrence of overflow or underflow. If one of these should happen, the corresponding status bit is set. It also sends out ZOUT (zero output) signal to force the mantissa and exponent to zero when underflow occurs or the le ft shift amount is 24. The circuit design of the overflow/underfolw unit is described in Appendix B .1I.

5.5.3 N-bit Adder/Subtractor

The method for doing addition/subtraction can he classified into ripple carry generation and parallel carry generation. A ripple carry adder/subtractor that adds/subtracts two N-bit operands consists of a cascade of N ful1-adder stages. A full adder is a logic network with

116 two inputs : a (i) and b ( i) , and a carry_in c ( i - l ) , and two outputs : a sum s(i) and a carry-out c(i). It performs the following logic functions :

s(i) = a(i) xor b(i) xor c(i-l)

c(i) = ( a (i) A b(i) ) j ( a(i) 4 c(1-l) ) | ( b(i) A c(i-l) )

for i = 0,1, ...... , (N-n

For high-speed applications, carry lookahead adders are usually implemented. However, Sakurai and Muroga pointed out in [35] that

Carry-lookahead adders require larger chip area, and

furthermore, if low-power devices such as are used,

the speed of the carry-lookahead adders is greatly slowed down

due to parasitic capacitance caused by large fan-outs of some

gates, (If we try to reduce large fan-outs of some gates by

using extra gates, the chip area increases further.) Thus,

when the chip size area limited, adders which occupy a small

chip area are often used for high speed (e.g. a ripple adder,

instead of a carry-lookahead adder, is used in In te l's

microprocessor chip 8080 for higher speed because of chip size

1 imitation.)

Also, Mead and Conway pointed in [1, pl50] that simulation of several look-ahead carry circuits indicated that they would add a great deal of complexity to the system without much gain in performance. Therefore, a ripple carry adder is designed for the N-bit adder/subtractor in the

Robotics Processor.

The compactness of the adder is very important since a more compact network often increases speed by virtue of its smaller parastic

117 capacitance. In actual design, the chip area occupied by a circuit cannot be known until the actual layout is completed. However, Muroga and Lai pointed out [33] that minimizing the number of gates as the primary objective and the number of connections as the secondary objective usually yields the most compact circuits, at least for functions using a small number of variahles. So an adder with as few inverters as possible was designed and pass transistors were used as often as possible, since they are formed by simply crossing polysilicon over diffusion and occupy l it t l e chip space.

A static Manchester-type carry chain adder was designed in this project. The Manchester-type carry chain adder has N basic cells cascaded vertically. In each c e ll, the propagation delay of the carry signal in one-bit full adder includes one logic inverter and one pass transistor. Thus, the carry propagation delay time for the N-bit adder is N times of the delay of one inverter plus one pass transistor. Since the resistance of a pass transistor is about one fourth of that of a pull-up transistor, the propagation delay of a pass transistor is about one fourth of that of an inverter. So the total carry propagation delay time for the N-bit adder is about (1+1/4)N inverter delays. This delay time is shorter than that of the adder in [34], which was claimed to have the smallest propagation delay, (4/3)N inverter delays. A dynamic

Manchester-type adder can be found in [1, p. 150], which is precharged by one of the clock phases. Since the carry chain in this adder is normally a series of pass transistors, the chain must be periodically buffered to minimize propagation delay. The carry-in signal is usually restored by a pair of inverters every four adder cells. Although the

118 dynamic Manchester-type adder has fewer components than the static one and thus has smaller chip area, it requires a precharge clock causing the contolling to be more complicated. Therefore, the static

Manchester-type adder was designed instead.

The truth table of a one-bit full adder are shown in table 5.2, where the p(i) is the carry-propagate for cell i, and g(i) is the carry-generate.

Table 5.2

Truth Table of a One-bit Full Adder

a{i) b ( i ) c ( i- l) c ( i ) s { i ) p ( i ) gd)

0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 0 1

The logic equtations for the c(i), s(i), p(i), and p(i) are derived as follows :

p ( i) = a ( i ) xor b{i )

g(i) = a{i ) & b ( i )

c(i) = (p(i) & c(i-l)) | g(i)

s(i) = p(i) xor c(i-l)

119 but

g(i) = a(i) A b(i)

= (a(i ) & a(1) A b(1)) | (a(1) A !a(i) A !b(1))

= a(1) A (a(1) A b(i) | !a{i) A !b(1))

= a (i) A I (a(1) xor b(i))

= a{ i ) A !p (i) so

c(i) = (p(i) A c(i-l)) | (Ip(i) A a(i))

!c(i) = (p(i) A !c(i-1)) | (!p(i) A !a(i))

When logic equations for c (i) and its complement, !c (i), are derived as above, pass transistors can be used to obtain the carry-out. The circuit diagram for a 2-bit adder is shown in figure 5.9. It consists of two different cells, say even and odd. The even cell has c (i)'s complement output while the odd cell has c(i) output. Dividing the

2-adder into two different cells allows each cell have the same carry propagation delay with one inverter plus one pass transistor. An N-bit subtractor can be easily obtained by doing Exclusive-Or function of the b(i ) input and the SUB_0P (subtraction operation) signal, i.e. b(i) is replaced by [b (i) xor SUB_0P],

5.6 Floating Point M ultiplier (FPM)

In this section, the algorithm to perform floating point multiplication is proposed, and then the block diagram of the FPM and its functional building blocks are described. They are the Zero

Checking Unit, Exponent Computation, 24-bit Fixed Point M ultiplier,

Postnormalization, and Over/underflow Unit. For the 24-bit Fixed Point

12(1 i+I

addl e.ca addl O.ca i + 1

i+1 u1+l

i + I

J I

i+1

Figure 5.9 Circuit Diagram of a 2-bit Adder with Manchester-type Carry Chain

121 M u lt ip lie r , several methods are proposed and compared : Sequential

Add-shift M ultiplication, Array M ultiplier, Nonadditive Multiply Modules

( NMM) with Wallace Trees, Additive Multiply Modules (AMM), Recursive

Parallel M ultiplier, Modified Booth Algorithm (Radix-4) with Carry-save

Adders, and Pipelined Recursive M ultiplier with Modified Booth

A1gori thm.

5.6.1 Algorithm and Block Diargram

The algorithm for the floating point multiplication can be divided into five consecutive steps :

1. Check for zero operands.

2. Determine the product sign, add exponents and correct for

the bias.

3. Perform fixed point multiplication of the mantissas of the two

operands.

4. Postnormalization

Normalize the result of the mantissa m ultiplier. This may

require one right shift and then incrementing the exponent,

5. Check for overflow/underflow.

Figure 5.10 shows the block diagram of the above algorithm. It is assumed that the floating point m ultiplier operates on two operands, A and B, and that the result of A x B is delivered as operand R. It can be seen that the floating point multiplier is a three stage pipeline.

The firs t stage performs zero operand checking, sign bit determination and exponents addition. The detailed circuit design of the zero checking unit is described in Appendix B.12. The final sign

122 SA MA

Iss d 8 .c a

Pi2

fm exp] e q j . c a ! EXP I EQ0

EXP I EQ0 mul 2 4 .ca

43

Issd

add 9 .c a

EXP EQ0

mux 2 3 .ca

B9-ER

C8-ER B7-ER

mux 2 3 .ca

IN0 INI mux O.ca ZOUT

OVF MR

Iss d S .ca

Figure 5.10 Block Diagram of the Floating Point M ultiplier

123 bit, SR, is obtained by doing Exclusive-Or function of the two sign bits, SA and SB. Exponents addition is performed by the 8-bit adder, add_8.ca, in this stage.

The 24-bit fixed point m ultiplier occupies stages one and two. In stage one, partial products are generated and carry-save addition is performed; in stage two, carry propagation addition is performed to produce the unnormalized 48-bit product. The detailed design of the

24-bit fixed point multiplier is discussed in section 5.6.2.

The third stage performs exponent correction, postnormalization, and overflow/underflow checking. Since in the IEEE floating point format the exponents are represented by biased code, a correction is necessary. The exponent correction is performed by a 2-bit adder

(add_2.ca) in stage three and is explained as follows : In floating point multiplication R = A x B, the exponents of the two operands are added together to generate the exponent of the result.

The real exponent of A is : REAL_EA = EA - 127.

The real exponent of B is : REAL_EB = EB - 127.

The real exponent of R is : REAL_ER = REAL_EA + REAL_EB

= (EA - 127) + (EB - 127)

= (EA + ER - 127) - 127

ER - 127.

So the biased exponent of R is :

ER = EA + ER -127

= (EA + EB) + 1 - 128

= (EA + EB) + 1 + (2's complement of 128)

= (EA + EB) + 1 + ( 110000000 )

124 The addition of the value "1" is achieved by connecting the carry-in of the exponent adder (add_8.ca) to high, while the addition of the value

"110000000“ is achieved by adding "11" to the right most two bits of the result of (EA + EB + 1), since the other 7 le ft most bits are zeros.

Since, by definition, the values of the mantissa of each of the two operands are between 1 and 2, their product is between 1 and 4, and thus postnormalization may be required. If the product mantissa is greater than 2, one right shift is required along with an incrementation of the biased exponent by one. The right shift is implemented by a multiplexor (mux_23.ca, upper one) and is controlled by the MSB of the mantissa product, B23_MC. If B23_MC is high, the right most 23 bits of

MC (=MA x MB) are selected. Otherwise, the right most 22 bits plus a

LSB with zero is selected. The B23_MC is also used as the carry-in of the exponent updater (add_9.ca). Overflow/underflow checking is accomplished by the over/underflow unit. Tts detailed circuit design is described in Appendix R.13.

5.6.2 24-bit Fixed Point M ultiplier

The multiplication of two fixed-point binary numbers can be achieved by sequential add-shift or parallel schemes. Schemes for parallel multiplication can be further roughly divided into two classes

C393 [ 40]. The firs t one consists of an array of cells connected in an iterative way to form a product of any size. The second one, introduced by Wallace, consists of the generation of all partial product terms and subsequent reduction of all partial products to two terms by using carry-save addition. In the following sections, several multiplication

125 schemes are explored and compared on their area and time complexity.

Finally, the scheme of the pipelined recursive m ultiplier with modified booth algorithm is used in this project.

5.6.2.1 Sequential Add-Shift Multiplication

The hardware organization of the sequential add-shift multiplication can be found in most digital computer architecture books, for example [31, p. 145]. The circuit performs multiplication by using a single adder n times to implement the addition for the n rows of partial product. It is inexpensive to implement but too slow to meet our requirement. According to our assumption, with 3 pipeline stages for a floating point multiplier, a 24-bit fixed point multiplication must be finished in 2 stage clocks, i.e . 2 x 1 microsecond = 2 microseconds so the 24-bit addition must be accomplished in 83 ns (= 2 microsecond / 24). This is s till not feasible for present NMOS technology. To achieve faster m ultiplication, various high-speed parallel multipliers must be used. The following sections show parallel multiplication ■schemes.

5.6.2.2 Array M ultiplier

The schematic circuit diagram of a 4-by-4 array m ultiplier can be found in [31, p. 197] [30, p. 48], A typical commercial chip is TI

7415274. A multiplier of an N-by-N size requires N(N-l) ful1-adders and

N x N AND gates. Presumably, the delay time of an AND is neglected.

The total delay time of the m ultiplier is about [(N -l) + (N -l)] x

T( ful l_adder), i.e. (2N-2) x T( ful l_adder), where T(full_adder) *is the delay time of a ful1-adder. For N = 24, 24(24-1) = 552 full-adders are

126 needed and the delay time is (2 x 24 -2) - 46 x T(ful l_adder), This kind of multiplication scheme is usually implemented when N is not too large, say less than or equal to 16, as in [43] [44] [45],

5.6,2,3 Nonadditive Multiply Modules {NMM) with Wallace Trees

A K-input Wallace tree is a b it-slic e summing circu it, which produces the sum of K b1t-slice inputs. For example, the Wallace tree with 7 inputs can be found in [31, p. 196] [30, p. 166] [TI 74LS275], A

NMM is just a M-bit-by-M-bit array multiplier with a smaller dimension; for example a typical NMM with M=4 is TI 74LS274. A m ultiplier of any size can be obtained by properly arranging the NMMs and Wallace trees.

In [31, p. 203], the modular arrangement for an array multiplication network ranging in size from 4x4 to 32x32 is shown. Each rectangle represents an eight-bit partial product divided into high and low 4-bit slices. All slices are added in a columned fashion by Wallace trees of odd number input. For example, a 24-bit m ultiplier needs 36 4-bit NMMs which are arranged following the Wallace trees. The size and the number of the required Wallace trees is shown in table 5.3. The number of the carry-save-adder levels required in a Wallace tree 2 is log (size of the

Wallace tree / 2) [29, p. 139], where the base for the log is 3/2.

127 Table 5 .3

Table 5.3 Size and Number of the Wallace

Trees for a 24-bit Multiplier

size of the number of the number of full-adders number of full Wallace tree Wallace trees in each Wallace tree adder levels

3 8 1 1

5 8 3 3

7 8 4 3

9 8 7 4

11 8 8 5

A 4-bit NMM needs 4(4-1) = 12 full-adders and has the delay time

(2x4-2) = 6 x T(full_adder). The number of the ful1-adders needed

for the 24-bit-by-24-bit multiplier is

(36 x 12) + 8 x (1 + 3 + 4 + 7 + 8) = 616 / I / I the full- the full-adders in adders in all Wallace trees 36 NMMs

plus the full-adders needed in a 48-bit adder. The total delay time for the m ultiplier is the sum of the delay time of a NMM, the delay time of

the longest Wallace tree and the delaytime ofa 48-bit adder. Ifa carry-propagate adder is used for the 48-bit adder, then the number of

the full-adders is 616 + 48 = 664. The total delay time will be 6 + 5 +

48 = 59 x T(full_adder). It can be seen that both the number of

full-adders and the delay time are greater than those of a array m ultiplier depicted above.

128 Due to the irregular interconnection hetween NMM and Wallace

trees, this scheme does not give feasible layouts for VLSI

implementation. Furthermore, it is not modular in nature, so can hardly

be extended to form regular large multiplier arrays.

5.6.2.4 Additive Multiply Module (AMM)

This scheme does not require b it-s lic e summing trees such as

Wallace trees. An AMM with 4-by-2 can be found in [30, p. 49], In

general, a 4m-by-4m multiplication network can be constructed by 2 x m x m 4-by-2 AMMs with delay time equal to (3m -1) x T(AMM). T(AMM)

represents the delay time of an AMM. For a 24-bit m ultiplication, 2 x 6

x 6 = 72 AMMs are required. An AMM consists of 8 full-adders and has delay time 5 T(ful l_adder). So, in to ta l, the 24-bit m ultiplier needs

(8 x 72) = 576 full-adders with the delay time (3x6 -1) x 5 = 85

T(full_adder). Roth the number of full-adders and the delay time are

greater than those of a array multiplier depicted before.

5.6.2.5 Recursive Parallel M ultiplier

Luk proposed in [37] a complete VLSI layout for a fast recursive

parallel multiplier having time complexity, the time required for

completing an operation, in the order of logN x logN, or T = 0(logN x

log N) and area complexity, the area of the layout, in the order of N x

N x logN x logN, or A = 0 (N x N x logN x logN). The recursive parallel m ultiplier divides large sized multiplication into a number of smaller

sized multiplications followed by additions combining the intermediate

results. Thus, multiplication is recursively reduced to a sequence of

129 additions until arriving at a reasonably small size, say 2-bit multiplication or 4-bit multiplication.

The above time complexity is obtained by using a the Brent-Kung adder [36], which is a parallel carry look-ahead adder having time

O(logN). I f a carry propagation adder is used, the time complexity will be T = 0(N), greater than 0(logN x logN). However, if carry-save adders are used, the time complexity can be improved to an optimal T = 0(1ogN)

[38], In [38], three versions of multiplications are explored. They are 4M, 3M and 2M versions. The 3M version gives a smaller layout with

A = 0(N x N), but is not as regular as the other two versions. The area and time complexities for the three versions are listed in table 5.4,

Table 5.4

Comparison of the 4M, 3M, and 2M Versions

of the Mu11iplication in [38]

version area time

4M N x N x logN x logN logN

3M N x N logN

2M N x N x logN 1 ogN

Even though the recursive parallel m ultiplier gives time complexity near optimal or even optimal T = O(logN) ( i f carry-save adders are used), it occupies much more chip area than that of the array multiplier. For example, in [38] a 8x16 recursive multiplier (2M

130 version), made by 3-micron NMOS process, takes about 65 ns of operation time and has a chip size of 49 micron square. And in [43] a 16x16

parallel array m ultiplier, made using the same 3 micron NMOS process

techonology, has longer operation time, 120ns, but has a smaller chip

size of 5 micron square.

In fact, when the word length of two operands, N, is not large, a

judgement cannot be made between the recursive m ultiplier and the other

parallel multipliers based just on the value of the order. I f N is not

large, both the coefficient of the complexity and the value of the order

must be taken into account. For example, m ultiplier A may have time

complexity T = 2N while the m ultiplier B may have time complexity T = 3

logN x logN. It is likely to say that multiplier B is faster than multiplier A, for 0{logN x logN) is less than D(N). But multiplier R is

not faster than m ultiplier A for N less than or equal to 32. Thus, if

they are compared by considering only their values of the orders, an erroneous conclusion is likely.

Therefore, if N is not large, more careful consideration must be

given when two multipliers are compared. One practical and reasonable way to compare layout areas is to compare their numbers of full-adders.

For operation times, levels of full-adders for the longest path can be

used for comparison. Consideration of area and time complexity in terms

of the number or the levels of full adders is reasonable because they

are the very basic constructing units in m ultipliers,

5.6.2.6 Modified Booth Algorithm (Radix=4) with Carry-Save Adders

For the Booth algorithm, the m ultiplier is encoded to a series of

+1, 0 or -1 as the m ultiplier is scanned from right to le ft.

131 M ultiplication speed can be increased when the m ultiplier has blocks of l's . Since the speed depends on the bit configuration of the m ultiplier, the efficiency of the Booth algorithm is obviously data-dependent.

Using the modified Booth algorithm(radix =4), an N-bit-by-N-bit m ultiplier generates only N/2 partial products. Thus, it can speed up the multiplication by a factor of almost two and cut the number of the full-adders to about half with a small amount of encoding circuit and multiplexing logic required. Every two bits in a multiplier are encoded from right to le ft to +2, +1, 0, -1, or -2. The bit pair encoding is shown in table 5.5.

Table 5.5

Encoding Table for the Modified Booth Algorithm

original multiplier encoded m ultiplier acti on B(i +1) B (i) B (i — 1) B (i)'

0 0 0 0 add 0 0 0 1 + 1 add A 0 1 0 + 1 add A 0 1 1 +2 add 2A 1 0 0 -2 subtract 2A 1 0 1 -1 subtract A 1 1 a -1 subtract A 1 1 l 0 add 0

An N-bit-by-N-bit multiplier needs (N/2+1) x (N+l) full-adders,

(N/2) x (N+l) l-out-of-5 multiplexors and a 2N-bit adder. Its delay

132 time is the sum of (N/2+1) x T(full_adder), (N/2) x [T(encoding) +

T(m ultiplexor)], and the delay time of a 2N-bit adder.

An 8-bit-by-8-bit m ultiplier with modified Booth algorithm

(radix=4) is shown in figure 5.11, simplified and modified from [55, p.

902], There are 5 rows of carry-save adders combined with multiplexors and a 16-bit carry propagation adder (e.g. Manchester-type adder). IJ1 is a multiplexor used to select one of +2A(i), +A(i), 0, -A(i) or

-2A(i), which means the multiplicand being mulitplied by +2, +1, 0, -1, or -2. IJ2 is used to generate the carry-in at the LSR position when the action is doing subtraction, i.e. when -A(i) or -2A(i) is selected. 113 is also a multiplexor used to select +2A(N-1) or —2A(N—1). U4s combined with the 5th row of carry-save adders are used to add the multiplicand to the partial product when the most significant bit (MSB) of the m ultiplier B, B(N-l), is one. Since, according to the modified Booth encoding algorithm, if the MSR of the the m ultiplier B is one, then the action of adding the multiplicand A must be performed at the last step.

For example, the m ultiplier B = 9F or 5F in hexdecimal is encoded below.

It can be seen that at the last encoding step, the encoded m ultiplier

B( i ) 1 = "+1M for B = 9F is generated because the MSB of the m ultiplier B is 1.

origianl multiplier (9F) It) 01 11 11

encoded m ultiplier + 1 -2 +2 0 -1

origianl multiplier (5F) 01 01 11 11

encoded m ultiplier 0 + 1 +2 0 -1

133 MINUS 0 x(+2),x{+l),x(0 ,x -l),x(-2)

SIGH-EXQ

SIGN-EX1

MINUS I x(+21.»( + !).x{0),xt-|Ulj>] ,5 SIGN-EX2

SIGH-EX3

MINUS 2

US i SIGN-EX5 u> -p* MJNUS 3

IS u a 12 ' II 10 9 a 7 ; 5 4 3 2 1 0 CPA F Ris-«0

Figure 5.11 A 8x8 Multiplier with Modified Booth Algorithm (Radix = 4) U5 is a circuit used to encode the m ultiplier and generate two sign extension hits, SIGN EX{i) and SIGN_EX(i+1). The logic equations of each unit are shown and explained in Appendix B.14. The full-adder at the (N -1 )th position of the last row is used for rounding. Detailed explanation for the rounding scheme is described in Appendix B.14.

5.6.2.7 Pipelined Recursive M ultiplier with Modified Booth Algorithm

Using the modified Booth algorithm the number of full adders of an

N-bit-by-N-bit multiplier can be cut to about half, but it still occupies tremendous chip space when N is reasonably large, say 24. So instead of using an iterative array of the carry-save adders as in the figure 5.11, a pipelined recursive m ultiplier with modified Booth algorithm is used. The concepts are following the above and those in

[41]. The {N/2 + 1) rows of carry-save adders are replaced by one row of clocked carry-save adders. The structure of the pipelined recursive multiplier is shown in figure 5.12.

Figure 5.13 shows the timing for the 24-bit pipelined recursive multiplier. * and a are both 16 MHz; Pa and P* are both 1 MHz, The 1 2 1 2 MLD signal, generated in every P$ with pulse width equal to 2 periods, is used for loading the B multiplier and some initial values at the begining of the fixed point multiplication. For example, X(+2)

X{ + 1) X(0) X{-1) X{-2) are set to "0 0 1 0 0"; all carry and sums are set to 0; MINUS{0) is set to 0; SIGN EX(1) is set to 0; and SIGN_EX(0) is set to 0 by seting MINUS(0)=0 and X(-1)=0. This causes the value to the carry-save adders being in itia lize d to zero after the pulse 1.

The signal can be obtained with a PLA and its state diagram is shown

135 "0" "0" B "0" A

ML a 00100

e n c o d e r.c a

MINUS

MINUS i l l HL

FA‘ s

CARRY SET SUM SET

MLD

L __

48 to CPA

Figure 5.12 The Structure of a Pipelined Recursive M ultiplier (mul 24.ca)

136 _ruxnj^LAm \j^j\.^Li\nAA/u14LT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~l t i^-LTLJirLnrLn^LJ^nrLr^inrL

P $2 — * i ______r

p»i

MLD r

Figure 5.13 Timing for the Pipelined Recursive Multiplier

Reset

P* I/P P*2

0/P MLD

Figure 5.14 State Diagram for Generating the MLD Signal in figure 5.14. To explain in converence, the pulse train of both ^ and d> is marked from 1 to 14. The 4 * in figure 5.12 is obtained by 2 2 doing the And function of and !MLD (! means logic NOT). This will prevent the conflicts that occur when both $ and MLD are high.

During the pulse 1, the least significant bits of the multiplier B, b(1-l), b(i) and b(i+l), are shifted out to an encoder and are encoded into X(+2), X(+l), X(0), X (-l) and X{-2) which are used in the Ul, U2, U3 and sign extension unit, sign_ex.ca. During the $ pulse

1, an appropriate partial product of +2A, +1A, 0, -A, or -2A is generated from Ul, U2 and U3 are to be fed into the carry-save adders.

During the <>^ pulse 2, the addition is performed. The carry set and sum set of the carry-save adders are latched at the $ pulse 2. The carries and sums are then feedback to the carry-save adders at next pulse of the

♦ . After the addition is repeated 13 times, from pulse 2 to pluse 14, the final carry set and sum set (each has 2N bits) are generated and latched at the falling edge of the P<{^. The 2N-bit addition is then performed by a carry propagation adder (e.g. Manchester-type adder) in the second stage of the floating point multipier. The detailed circuit designs for the register storing the m ultiplier B and recursive carry-save adder are described in Appendix B.16,

The maximum operating speed of the m ultiplier is determined by the slowest stage in the recursive pipeline. The duration of the or the

2 must be greater than the delay time of the slowest stage plus clock overhead (such as set-up time and holding time). The possible slowest stages are encoder.ca, sign-ex.ca, U1/U2/U3, and one bit full adder.

Since X(+2), X( + l) , X(0), X(— 1) and X(-2) are required at each

138 carry-save adder, the registers corresponding to these five signals must

have enough driving capability. So, the signal paths should be metal

layer throughout the entire row of the full adders,

5.7 Summary

The detailed designs of Register File (RF), Sequencer (SEQ), and

Control RAM (CRAM) are not covered in this chapter. All the other

building blocks -Generator (CG), Bootstrap Unit (RU), Format Converter,

Level Sensitive Scan Design (LSSD), Floating Point Adder/subtractor

(FPA), and Floating Point M ultiplier (PFM), are explained in great detai1.

139 CHAPTER 6

COMPUTER AIDED DESIGN FOR VLSI

6.1 Introduction

Because of the diversity of tasks and concerns in VLSI designs, a systematic method is especially important in designing a special purpose chip. Typically the chip is decomposed into several small cells geometrically, functionally, and hierarchically. The design of the functional block of the cell can then be largely independent of the others. During the design phase, some software tools w ill provide the integrated circuit designer with step-by-step design assistance. This approach is called Computer-Aided Design (CAD).

CAD tools support a hierarchical design sequence to assist the designer in specifying a system from in itia l concept to detailed implementation. They also support both functional and physical designs.

Functional design aids include synthesis, simulation, and verification, at architectural, system, logic, and circuit levels. Physical design aids support partitioning and layout. The CAD tools must be technology independent so that the designs in each phase do not need to change with improvement of the integrated circuit processing technology.

In the second section of this chapter, the CAD tools used at The

Ohio State University are introduced. Section three proposes a network description language, which is a LISP-like language used to describe the

140 designed circuit. The described circuit can then be simulated to verify its logic and timing before the circuit layout is attempted. Section four explains how the circuit can be simulated at the logic level.

Finally, simulation at the circuit level is explained in section five.

6.2 Overview of VLSI Design Tools

Figure 6.1 shows the functional chart of essential VLSI design tools and several logical sequences of their applications. These tools are part of the VLSI design tools released from UW/NW VLSI Consortium on

October 1 of both 19R3 and 1984, The following is a brief overview of the VLSI tools being used at The Ohio State University. a) Functional design tools for translating a high-level design

description into layout tool input are :

PEG : Translates a language description of a finite state

machine into logic equations compatible with EONTOTT.

EQNTOTT : Converts logic equations into a truth table format to be

used as input to TPLA. b) Layout tools used to design the actual artwork for the circuit

are :

CAESAR : An interactive display editor for manhattan VLSI designs.

TPLA : Automatic PLA layout generator. c) A display tool used to display circuit designs is :

PENPLOT : Penplotting programs for HP7221 and HP7580 pen plotters. d) A design rule checker used for geometric rule checking is :

LYRA : Performs hierarchical design rule check on a CAESAR

formatted design using a corner based algorithm.

141 System definition I I

Logic circuit | Logic equations I I FSM description description, ckt.net I I c k t .eqn I I ckt.f an .1 I. .1 I. I I NETLIST | PEC I ___ MIT .sis file I EQNTOTT State equations ckt lis I < I I ckt.eqn I I CAESAR I PRESIM + EQNTOTT AED512 I___ RNL binary input + Truth table file, ckt.rnl TABLET ckt.tbl I I j I I TPLA RNL I I _I__ _ l ___ l _ Logic simulation I .ca file Design rule result and timing I ckt.ca I--LYRA > violations analysis, ckt.ult I______I c k t .ly I I I :CIF I______I loyout I .cif file I 1 hard SIMF LTER I ckt.clf I--PENPLOT > | copy I I I MEXTRA

. 1. Berkeley .slm I ■> I file, ckt.sim I through CSNET and ARPANET to MOSIS I I I I PSPICE ESIM CRYSTAL I I__ . 1. I___ I_____ I SPICE input I Logic level Timing I I.e. chip I I ckt.spcin j simulation analysis t J I ______I result result I ckt.res ckt.cry SPICE

. 1. Tested with Circuit level HP 64000 simulaiton result c kt.out

Figure 6.1 Functional Chart of VLSI CAO Tools

142 e) Circuit extractor used to extract simulation database from layout

database is :

MEXTRA : Extracts a Berkeley format ‘ .sim* f ile from a Caltech

Intermediate Form (CIF) input f ile . f) Simulation tools used for logic and timing simulation are :

RNL : Event driven "timing" simulator. It uses logic levels

and a simplified circuit model to estimate timing delays

through digital circuits. It also has a mode that allows

it to be used as swiatch level simulator.

PSPICE : Preprocessor for the SPICE simulator.

SPICE : Device level circuit simulator,

ESIM : Switch level simulator. It uses logic levels and models

transistors as perfect switches.

CRYSTAL : Static timing verifier. It uses a simplified circuit

model to estimate the worst case delay through a circuit. g) Filters used to convert from one database to another.

:CIF : Converts .ca form to Caltech Intermediate Form.

SIMFILTER : This is a f ilt e r that converts Berkeley style .sim

files such as produced by MEXTRA to MIT style .sim

files used as inputs by programs such as PRESIM. It

also can be used to produce Rerkeley style .sim files

from the MIT format allowing circuits described with

NETLIST to be analyzed by SPICE or run through the

timing verifier CRYSTAL.

NETLIST : This is a program for generating circuit description.

143 PRESIM : This program converts MIT style .sim files into the

binary format required by RNL.

As mentioned in the last section, the circuit of an entire chip is usually partioned into several small cells. The design of each cell can then be largely independent of the others. I f the cell function is a sequential function, it can be described as a fin ite state machine using a state diagram. I f the cell is a non-sequential function, it can be described using Boolean equations. Either the sequential or non-sequential function can be implemented using a PLA. If a cell is not frequently used in the chip and does not occupy much chip space, it is usually implemented with a PLA, since the layout of the PLA can be easily obtained by using the TPLA, PLA generator. Specifically, EQNTOTT is firs t used to convert the set of logic equations which describe the cell into a corresponding truth table, then TPLA generates a PLA layout based on this truth table. Rut, a basic, frequently used cell (e.g. fu ll adder) is usually implemented with manual layout, because manual layout usually results in smaller chip space compared to using a PLA.

Before the cell is laid out, its logic circuit design should be tested. The path through NETLIST, PRESIM, and RNL provides circuit description and verification. The detailed explanation of this verification path will be described in sections 6.3 and 6.4. Once proven to be correct at the logic level and timing analyses, the circuit can be laid out by using CAESAR with the AED512 graphic terminal and data tablet. The completed layout should be verified by LYRA to check for violations of the numerous layout rules. The resulting .ca file is

144 then converted to a .c if file using :CIF package. I f desired, the

layout can then be plotted by PENPLOT.

If the cell under consideration is a finite state machine

implemented with this PLA, the register required to store the state

variables is obtained using two inverters and two series pass devices

for each state variable. The following is the specification of the

fin ite state machine for the Bootstrap Controller, whose state diagram is shown in figure B.5.

— Finite state machine of the Bootstrap Controller (BTC)

- - The state diagram of the BTC is shown in figure B.5.

— File name : btc.fsm

— BT is the RESET signal, which is a keyword in PER.

— When the RESET is present in the INPUTS, conditional

— branches to the firs t state are automatically added to

— the next state expressions for each state.

INPUTS RESET WR LC;

OUTPUTS CLR LD LOO L01 LD2 WEN INC;

SO — This is the reset state.

ASSERT CLR LO;

IF NOT WR THEN LOOP;

SI ASSERT LO LDO;

S? ASSERT LO;

IF NOT WR THEN LOOP;

53 ASSERT LO L01;

54 ASSERT LO;

IF NOT WR THEN LOOP:

145 S5 ASSERT LD LD2

S6 ASSERT LD WEN

S7 ASSERT LD INC

S8 ASSERT LD;

CASE (WR LC)

0 0 => SB;

1 0 => SI

? 1 => S9;

ENDCASE => SB;

S9 GOTO S9 ;

After the fin ite state machine is written in the file "btc.fsm",

it can be implemented immediately with a PLA by entering the command :

% peg btc.fsm ) eqntott -f -R | tpla -s Bcis -I -0 -o btc.ca where "btc.ca" is the output of the PLA in CAESAR layout format. PEG

translates a high level language description of a finite state machine

into the logic equations, which are then read by EQNTOTT. EQNTOTT

generates the truth table, which is used by TPLA to generate a PLA. The

switch - f allows outputs to be defined in terms of their previous values

in a synchronous system. The switch -R forces EQNTOTT to produce a

truth table with no redundant minterms. The switch -s Bcis forces TPLA

to generate a PLA with buried contacts, inputs and outputs on same side

of the PLA. And the switch - I clocks the inputs to the PLA; the switch

-0 clocks the outputs to the PLA.

6.3 Logic Circuit Description

Given the definition of a functional block, its logic circuit can

be designed. For effective design it is important to ensure that the

146 design w ill work before layout is attempted. The circuit can be described with a network description language and then simulated at the logic level. The simulation result can show the logical level (0 or 1) and the timing of desired signals.

NETLIST is a macro-based language for describing networks of different size transistors. The program NETLIST allows the user to describe the circuit with a symbolic language. RNL is a timing logic simulator for digital MOS circuits. It is an event driven simulator that uses a simple RC (resistance capacitance) model of the circuit to estimate transition times and the effects of charging sharing.

The network can be specified in a logic network description file using a LISP-1 ike command systax as follows. The circuit of a simple

NMOS inverter, shown in figure 6.2, is used as the fir s t example. Its logic network description is written into the file "inverter.net", listed as below :

; (1) network description for a NMOS inverter

; File name : inverter.net

; (2) declaration of the nodes in the network

(node in out)

; (3) depletion mode transistor (pull-up)

(dtrans out out Vdd 2 4)

; (4) ehancement mode transistor (pull-down)

(etrans in GND out 4 2 )

; (5) specify an interconnect capacitance for the output node

(capacitance out 0.03)

147 V d d I out

in i

7 GND

Figure 6,2 NMOS INVERTER

Vdd

I out ini in2

v GND

Figure 5.3 NMOS NAND

148 The number enclosed in the parenthesis will be used to indicate each part of the description file .

(1) A semi cl on causes the rest of the line to be treated as a comment.

Blank lines are also ignored.

(2) Any node named for subsquent reference must be declared. Nodes are

declared with the command

(node nl n2 n3 n4 . . . . )

where nl, n2, n3, n4, . . . are the names of the nodes to be referred

to in the netwoek.

(3) The declaration of the nodes has provided the "skeleton" for the

network. Then components, such as transistors or capacitors, must

be fille d in to construct the whole circu it. A transistor is

written in the form :

(transistor-type gate source drain width length)

Transistor-type represents a mnemonic for various types of

transistors, such as

dtrans for n-channel depletion-mode transistor

etrams for n-channel enhancement-mode transistor

Specification of the width and length of the transisor's gate area

in units of lambda is optional. I f omitted, the default width

(fir s t number) and length (second number) for depletion-mode

transistor are ? and 8, while the defaults for enhancemnet-mode

transistors are 2 and 2. The width and the length of the gate area

determine the resistance of the transistor, i.e . influence the

ratio of the pull-up to pull-down.

149 (4) The pull-down is specified, analogously to (3), as an n-channel

enhancement-mode transistor with a gate width 4 and a gate length

of 2.

(5) The final element to be specified in the inverter is the

interconnect capacitance. The command

(capacitance out 0.03)

te lls NETLIST that a capacitance of 0,03 pF is to be connected

between the nodes "out" and GND,

For those frequently used, basic circuits, it will be convenient to define macros, which can be stored in a library file and easily loaded

into future network description files without having to redefine them.

In NETLIST, some basic functions, such as inverter, Nand, Nor, and

And-Or-Inverter, are already defined. They are shown in figure 6.2,

6.3, 6.4, and 6.5 respectively and specified in the following manner :

(invert (out width-o length-o) (in width-i length-i))

(nand (out width-o length-o) (ini width-1 length-1)

(in2 width-2 length-2) ..,)

(nor (out width-o length-o) (in i width-1 length-1)

(in2 width-2 length-2) ...)

(and-or-invert (out width-o length-o)

((inll width-11 length-11) (inl2 width-1? length-12) ...)

((in21 width-21 length-21) (in22 width-22 length-22) ...)

)

The gate size for each transistor is specified with width and

length together with the node to which the gate is connected. For

1 5 0 V d d

out

ini i n2 1P

Figure 6.4 NMOS NOR

Vdd

out

in ll in21 i n 12

Figure 6.5 NMOS AND-OR-INVERTER

151 example, (invert (out 3 9) (in 4 2)) creates an NMOS inverter whose depletion-mode pull-up has a gate area of 4 by 2 lambda, and whose enhancement-mode pull-down has a gate area of 3 by 9 lambda. In order to have better symmetry on the rising and fallin g edges of the output waveform, the number of the enhancement-mode pull-down transistors in an

Nand gate or on the branch of an And-Or-Inverter is recommanded to be less than three.

A 2-bit adder is used as a second example to show how the network is described in the macro-based language. The detailed network description of the 2-bit adder is shown Appendix C .l.

6.4 Logic Level Simulation

After the logic network description has been written to the file

" file .n e t11, it is processed with the NETLIST and PRESIM programs. For example, the command

% netlist inverter.net inverter.lis w ill cause NETLIST to process the network description file inverter.net, writing its output to the file "in verte r.lis". % is the prompt of the

UNIX system. The next step is to process "in verter.1is" with PRESIM.

PRESIM transforms the transistors in the " file .lis " file into resistors of equivalent sizes. This is done because RNL uses resistor models for the transistors and estimates transistion time delays from the equivalent network formed by the resistors and the circuit capacitances. The command

% presim inverter.lis inverter.rnl will cause PRESIM to process "in verter.lis", putting the output into the

152 binary file "inverter.rnl11. The "inverter.rnl" file can now be used as the binary network description for RNL simulation.

All the necessary preparations to run a simulation of the inverter.net are now complete. To issue RNL simulation, enter the command :

% rnl

RNL prompts its version number

Version 4,2

Before the simulation is satrted, two file s , "uwstd.l" and "uwsirn.l", containing function definitions for RNL, must be loaded.

(load "”cad/lib/rnl/uwstd.l")

(load "”cad/lib/rnl/uwsim.l")

Next, load the binary network description f ile "in verter.rn l".

( read-network "in verter,rn l")

RNL w ill prompt with information about the network :

; 8 nodes; transistors : enh=0 intrinsic=0 p-chan=0 dep=0

low-power=0 pullup=0 resistor=0

There is a simple command ("s11) to run a simulation step for an amount of time defaulted to 100 ns. To change this, a variable "incr" can be set. "incr=l" means the simulation time interval is 0.1 ns. The command to assign a value to a symbolic value is "(setq symbol value)".

For example, the simulation time interval is changed to one nanosecond,

(setq incr 10)

Frequently, it is convenient to refer to a group of nodes, rather than to one individual node. A symbol name for a lis t of node names can be

153 denoted with "setq" command. For example, "inv-nodes" is given to the lis t of the two nodes "in" and "out".

(setq inv-nodes '(in out))

The final step is to specify details for the reports on the simulation step. There are two standard report forms available in RNL. The firs t type lists the states of nodes whenever these states change.

Consequently, a timing analysis can be obtained. To set the change-flags of nodes "in" and "out" to "true", the command is

(chflag inv-nodes)

The second type of report lists the state of nodes at the end of a simulation step. To obtain such a report on the nodes "in" and "out", use the "def-report" command :

(def-report ’ ("STATE AT THE END OF SIMULATION STEP :" in out))

Now try a simulation. Setting the input of the inverter to high voltage is simply done by entering

h i n

Run a simulation step by entering

s

According to the report specifications in the "chflag" and

"def-report", RNL replies :

Step begins 0 0 ns,

in = I P O

out = 0 P 0.1

STATE AT THE FND OF SIMULATION STEP :

Current time = 1

in = 1 out = 0

154 The reports on changes in the states of the nodes "in" and "out" show that "in" was set to High at the time zero, and "out" changed to Low at

0,1 ns. The time delay in the change of the output is caused by the

time needed to discharge the gate capacitance of the inverter and the

output node capacitance of 0,03 pF. Now consider another state of the

inverter. Set the input to Low

1 in

and then do a simulation step :

s

RNL replies :

Step begins 0 1 ns,

in = 1 0 0

out = 0 0 0.6

STATE AT THE END OF SIMULATION STEP :

Current time = 2

in = 0 out = 1

To exit RNL, enter

exit then back to UNIX,

There is one other mode of RNL operation called batch mode. All the commands to RNL can be written into a f ile , "file.cmd". This f ile is then treated as a parameter when RNL is executed. For example, all the above commands related to the inverter are written into the file

"inverter.cmd" :

155 (load ""cad/lib/rnl/uwstd.l")

(load ""cad/lib/rnl/uwsim.l")

( read-network "in verter.rnl“)

(setq incr 10)

(setq inv-nodes '(in out))

(chflag inv-nodes)

(def-report '("state at the end of simulation step in out))

(h '(in ))

(s '( ) )

(1 '(in ))

(s '() )

(exit)

Now run RNL again, with command file "inverter.cmd" as a parameter :

% rnl inverter.cmd

The same results as before will be obtained.

The simulation of the 2-bit adder is put into the batch mode. The command file "add2.cmd" for the 2-bit adder described in Appendix C.l is created for RNL simulation and shown in Appendix C.2. Then enter the

following three commands to get the simulation result stored in the file

"add2.ult".

% n etlist add2.net add2.1is -tnmos -u200

% presim add2.1is add2.rnl

% rnl add2.cmd > add2.ult

The switch -t tells the technology is used. Here it is NMOS technology.

The swich -u sets the number of centi-microns per lambda (default is

250). The logic simulation result, stored in the file "add2.ult", is checked. If incorrect, the circuit network should be modified and the above three commands are executed until a correct simulation result is obtained. A 2-bit adder is then laid out according to the described circuit add2.net. To test layout correctness, the layout is extracted and simulated in logic level using RNL or ESIM, and in circuit level using SPICE. Since ESIM does not support timing analysis, RNL is usually preferred for use in logic simulation. How to do circuit simulation with SPICE will be explained in the next section.

After the 2-BIT ADDER has been laid out according to the add2.net

and stored in the file add2.ca, the next step is converting the CAESAR

file "file.ca" into CIF file "file.cif" by entering the following commands :

% caesar -n add2

: c if -p

: quit where : is the prompt of the. CAESAR program. The -n switch causes

CAESAR to run in non-interactive mode. The -p switch is necessary to

obtain CIF files for circuit extraction and/or simulation. Enter the

following four commands to check the layout in logic level.

% mext ra add2.ci f

% sim filter add2.sim add2.tem

% presim add2.tem add2.rnl

% rnl add2.cmd > add2.ult

The simulation result is stored in the f ile "add2.ult", and should be

the same as that obtained from the circuit description except the

timings, because the capacitances of gates and interconnections in the

157 layouts are extracted and are heavily dependent on the layout geometry.

The SIMFILTER reformats the simulation file format of either Berkeley or

MIT into the other simulation format. Now, Rerkeley format, "addP.sim",

is reformated into MIT format, "add2.tem", which allows layouts

extracted by MEXTRA to be simulated using RNL. Note that the add2.cmd

is described in Appendix C.2. Another alternative is using ESIM for

logic level simulation and then CRYSTAL for timing analysis.

6.5 Circuit Level Simulation

After doing logic simulation and timing analysis for the circuit

network described in the LISP-like language or for the layouts

individually, a detailed circuit simulation, such as SPICE, is

necessary. SPICE is a general-purpose circuit simulation program for

nonlinear dc, nonlinear transient, and linear ac analysis. Circuits may contain resistors, capacitors, inductors, mutual inductors, independent

voltage and current sources, four types of dependent sources,

transmission lines, and the four most common semiconductor devices :

diodes, BJT's, JFET's, and MOSFET's. SPICE has b u ilt-in models for the

semiconductor devices, and the user needs to specify

only the pertinent model parameter values. Three MOSFET models are

implemented; M0S1 is described by a square-law I-V characteristic, M0S2

is an analytical model and M0S3 is a semi-empirical. The SPICE from

Berkeley only works for M0S1 model.

The 2-bit adder, "add2.net" described in Apppendix C .l, is used as

an example to explain how to run SPICE. Enter the following commands

and then the circuit simulation result can he obtained and stored in the

f ile "add2.out".

158 % n etlist add2.net add2.1is -tnmos -u200

% sim filter -n add2.1is add2.sim

% pspice -d defs -m model -e add2.io add2

% spice add2,spcin add2.out

The switch -n te lls SIMFILTER to generate output in Berkeley format,

"add2.sim", with NMOS technology. Besides the add2.net, some file s , defs, model, and add2,io, must be prepared before the above commands can be executed. The add2.io contains all specified input signals, some output signals which are desired to check, and some control commands for

SPICE. It is shown in Appendix C.3. Model parameters of the simulated devices, stored in the file "model" and given by MOSIS, are shown in

Appendix C.4. The "defs" sets up the equivalences between node names in the simulation f ile and SPICE node names. The GND node is always set to node 0 in SPICE, while the VDD node is set to node 1. To avoid the low cases of the GND and VDO being set to different nodes in SPICE, they are set to node 0 and I in the "defs" f ile , which is shown below,

set gnd 0 nmos

set vdd 1 nmos

PSPICE is a shell script for preparing SPICE input from several sources.

PSPICE runs SIM2SPICE to convert from a "file.sim " format circuit description to a SPICE compatible description. For example, SIM2SPICF reads "add2.sim", "add2.nodes" and "add2.al" files and creates

"add2.spice" and "add2.names" file s . PSPICE then runs SPCPP to translate a "pseudo-spice" formatted f ile that contains symbolic node labels to a SPICE acceptable f ile . For example, SPCPP reads

"add2.names" and "add2.io" file s and creates "add2.spcx" f ile . Finally,

159 PSPICE concatenates the circuit description f ile , the translation table, a file of untranslated SPICE input, and the translated SPICE input into a single file, ,,add2,spice", To simulate a circuit layout, e.g.

"add2.ca", enter the following commands and the circuit simulation result will be stored in the f ile "add2.out".

% caesar -n add2

: cif -p

: quit

% mextra add2,cif

% pspice -d defs -m model -e add2.io add2

% spice add2.spcin add2.out

6,6 Summary

The procedures for using VLSI CAD tools to design an integrated circuit chip are introduced in this chapter. How to describe a designed circuit with NETLIST, a LISP-like language, and simulate it in logic level with RNL or ESIM and then in circuit level with SPICE, is explained in detail with an example of a 2-bit adder.

More information can be found in "VLSI Design Tools Reference Manual" released from the UW/NW VLSI Consortium.

The VLSI design tools currently used of The Ohio State University are able to support design with the NMOS, and CMOS fabrication processes available through MOSIS, the Department of Defence's MOS Implementation

Service run by the Information Sciences Institute of the University of

Southern California. MOSIS now supports NMOS, CMOS/Bulk, CMOS/SOS, and

Printed Circuit Board technologies. MOSIS usually aggregates several

160 small projects submitted by the same organization into

Multi-Project-Chps (MPCs), and the various chips of the same technology into Multi-Chip-Wafer (MCWs). It is very common for MOSIS to have wafer with over 100 individual projects, and wafers with about 50 different die types of several sizes.

Some CAD tools, released from UW/NW VLSI Consortium are s till primitive. For example, when laying out the circuit with CAESAR, an interactive circuit editor, the user must keep in mind the design rules, which depend on IC processing technology. Different processing technologies give different layout design rules. So, once the process line is changed, the user must remember another set of design rules. A kind of symbolic layout, called VIVID [63], released recently by the

Microelectronics Center of North Carolina (MCNC) can eliminate this drawback. Users just need to draw the symbols to represent different layers without consideration of the design rules.

The CAD tools are s t ill being developed toward a so-called silicon compiler which can translate a high-level functional description or behavior description of a chip down to the actual layout of the device

[10]. The silicon compiler is divided into two stages. The firs t stage is the translation of a brief functional or behavioral description into a more precise intermediate that is s t ill implementation independent.

The second stage is the automatic generation of a chip layout from the intermediate description.

Software simulation is able to verify the chip before it is fabricated and thus aids in fast turnaround time and saves the expense of silicon foundries. Rut the simulation time for a large circuit takes

161 from hours to days or even months. Therefore* some simulations are implemented and run on special-purpose processors, such as Zycad's LE

1002. For a 61K gates circuit, LE 1002 can simulate four hundred times faster than the DECSIM, Digital's internal simulator [11],

162 CHAPTER 7

SUMMARY AND CONCLUSIONS

7.1 Summary

In this research, three objectives have been achieved: (1)

Architectures based on the Robotics Processor chip have been been shown to be applicable to the solution of the general robotics problem involving the Jacobian, Inverse Jacobian, and Inverse Dynamics. (2) The architecture and the major parts of the RP chip have been designed, and

(3) The VLSI design tools released from UW/NW VLSI Consortium have been used for the fir s t time at The Ohio State University to fabricate a chip.

Several current VLSI computing structures, such as systolic array and wavefront array processor (WAP), were surveyed in an attempt to solve the intensive computations required in the Inverse Plant plus

Jacobian. Since the effectiveness of these approaches is contingent upon large dimension arrays and since the dimensions of the vectors and matrices in robotic systems are rather small, 3x1 (or 4x1) and 3x3 (or

4x4), neither the systolic array nor the WAP architectures could be successfully applied to the current application. Instead, several special purpose dedicated attached processors for the Inverse Plant plus

Jacobian were developed. These attached processors are based on the

Robotics Processor being developed with state-of-the-art VLSI technology

163 at The Ohio State University. These special purpose dedicated processors will be attached to a host microcomputer, and multiprocessor system concepts will be used to interconnect these multiple processors for real time control.

Based on the current processing capability supported by MOSIS, the achitecture of the RP has been tentatively designed. The data path contains a Register File with 64 words (32 bits per word), a floating point adder/subtractor, and a floating point multiplier. Both arithmetic units have three pipeline stages and can execute at the same time.

From the architecture of the RP's data path, the computation times for all vector and matrix operations can be exactly obtained and normalized in units of complexity. Using complexity instead of time expressed in microseconds allows the results to be independent of the system clock. A task graph was employed to schedule processes for more than one processor for the Jacobian and Inverse Dynamics applications,

Roth computation complexities and I/O transferring complexities can be shown in the task graph. In addition, the total execution time of each task, e.g. Inverse Dynamics, can be estimated for each architecture.

Once the total execution time, initiation rate, processor utilization ratio, and sizes of the Register and Control memory have been calculated based on the microprogram, the most desirable architecture can be determi ned.

The major parts of the RP chip, such as the FPA, FPM, Bootstrap

Unit, and Format Converters, have been designed to logic gate level or function equations, which can in turn be implemented with a PLA by using the TPLA package. Some basic and often used cells, such as a full

164 adder, are designed and laid out in a compact form. A chip containing a

4-bit adder, a two-phase clock generator, and a PLA controller have been designed and fabricated by MOSIS with 4 micron NMOS technology. The chip was received three months after the corresponding CIF was sent to

MOSIS through CIS net and APART net.

A 2-bit adder was used as an example to show how the circuit network is described in LISP-like language. The logic function and timing of the described network can then be tested and verified before they are laid out. Once the circuit is laid out, its layout is extracted and then simulated to verify its logic level and timings. Some of tiie CAD tools used at The Ohio State University are somewhat prim itive. More powerful tools are being installed on the VAX 780. For example, the VIVID [63] system dramatically shortens custom integrated circuit design time. It translates symbolic layout, automatically, into the geometric representation necessary for mask generation. This approach offers the designer two advantages over the traditional mask editing approach to custom layout : (1) Technology independence : the symbolic layout does not need to be redesigned for different design rules. (2) Correctness-by-construction : the compactor generates the physical layout; therefore, the designer does not have to be aware of the design rules in order to create an error-free layout. It also supports the Interactive Circuit Editor (ICE) package, allowing the designer to select graphic representations of circuit elements to describe the circuit network without using LISP-like language. Another

VLSI tool installed is TEGAS-5, donated by the GE Calma Company. The

TEGAS-5 system consists of several subsystems that include logic and

165 design verification, testability analysis, fault simulation, and test generation. It supports the concept of design-for-testabi1ity by providing the designer with a method of measuring the te s ta b ility of the design during the design process,

7.2 Research Extensions

There are many research recommendations that could be an extension of the research reported in this dissertation. The following is a brief description of future research that could be done,

1, The architecture of the Robotics Processor is heavily dependent on VLSI technology. For example, i f the access time of the

Register File can be made to be smaller than one fourth of P_CLK (i.e . smaller than 250 ns), the two-bus, rather than the three-bus, configuration of the data path could conceivably be used. Furthermore, i f the access time could be reduced even more, the one-bus configuration could be used instead.

2. The control memory used to store microprograms, could be implemented with Static RAM (SRAM), Dynamic RAM (DRAM), or RDM.

Fabricating the memory on the chip would require about 40 fewer pins as compared with using an off-chip memory which is an important concern from the standpoint of current pin lim itations. I f the application demands a very large CRAM then DRAM probably is suitable since its bit cell needs only one or three transistors, while SRAM needs 6. Rut DRAM requires a refreshing circuit and thus is more complicated. I f the microprogram is fixed and will not be further changed, ROM is the best choice since it has highest density and re lia b ilty . Also, the Rootstrap

1 6 6 Unit and Format Converter for Bootstrap are no longer required. Since the Register File size is not large, six transistors for each bit cell could be used. Two address decoders are required because two operands in different locations are to be accessed at one time,

3. The current design of the next-microinstruction address unit includes neither the capability for jumping-to-subroutine nor conditional branching. Because the application microprograms for

Jacobain, Inverse Jacobi an, and Inverse Dynamics are fa irly short, they are written in a sequential, straightline manner,

4. Another possible architecture is to take the two arithmetic units out of the Robotics Processor. In this case they can be implemented using commercial floating processors, such as Weitek's WTL

1032 (or 1064) m ultiplier, WTL 1033 (or 1065) adder [56] [58], and AMD's

Am29325 [59]. The problem here is the lim itation on the number of pins.

The four I/O ports alone require 64 pins. Presently, MOSIS supports a pin number maximum of 84. In. the future, MOSIS may support a package having more pins, say up to 144. Then using the commercial floating point processors will be feasible. However, the comparison table in chapter 4 shows that any one of the applications of the Jacobian,

Inverse Jacobian, and Inverse Dynamics can be executed in one millisecond based upon the parallei/pipeline computing structure with RP as basic building block. Therefore, although there are several faster commercial floating point processors available, the two arithmetic units in the RP are fast enough to meet our requirements. For example, the AMD

Am29325 can complete a floating point addition or multiplication in 100 ns, and is about 40 times faster than the arithmetic units in the RP.

167 Another alternative is using a 32-bit I/O bus on each I/O port but keeping the two arithmetic units in the Robotics Processor. The comparison table shows that the I/O communication increases as more RPs are used. As I/O communication between the RPs is a dominant factor, using a 32-bit I/O bus instead of 16—bTt would be very desirable and would double the data transferring rate.

5. In the current design no handshaking signals are used between the RPs, thus data transfer timing must be known and the RPs must be globally synchronized. In addition, clock skew must be carefully avoided. Also, since it is assumed that the data accessed by each processor is always the freshest available, no other signals are used to coordinate data exchanges except for load mode handshaking signals between the RPs and a host computer. Since all RPs are synchronized and the timing for transferring data is predefined, no status bits of the arithmetic units are ever tested. It is conceivable that a more general philosophy should be adopted. Specifically, it may be found that it is more realistic to use handshaking signals between the RPs, and between

RPs and host computer and to have asynchronous communication.

6. I f the pin number of the Robotics Processor chip is large enough, the clock generator can be placed outside of the Robotics

Processor chip. This leads to more r e lia b ility , easier handling of the

RP chip, and the generating of the Scan Path clock, Psi 1 and Psi2, for

LSSD testabi1ity .

7. In order to permit maximum accuracy to be retained in the result, it is important to extend the current internal 32-bit word length with guard bits. To date no simulations have been done to

l f i 8 determine how many guard bits would be needed in the Robotics Processor chip implementing an application program. Also, no rounding methods except truncation have been considerd in the FPA and FPM, Many other more accurate rounding methods, such as adder-based rounding, Von

Neumann rounding, and ROM rounding [31, p. 427-431] could be considered in the future research.

P. The design of the RP architecture is based on NMOS technology.

There are several advantages in using CMOS technology (MOSIS has been supporting 3 micron P-well CMOS), F irst, since the rising and falling edges of the output waveform are symmetrical, no precharging is necessary as in NMOS technology. Second, the voltage passing through a pass transistor pair doesn't drop by the threshold voltage as it does through a N-channel pass transistor. Third, the ratio of the pull-up transistor vs. pull-down transistor is of no concern except that the width of the P-channel is almost always 2.8 times the N-channel because the N-type's mobility is faster than the P-type's. Fourth, CMOS has better noise immunity and uses less power. Designing the circuit with

CMOS technology w ill not be more sophisticated than with NMOS, since the ratio is inconsequential. The only drawback is that the VLSI design tools used at the Ohio State University do not have any package capable of generating a PLA with CMOS technology.

9. Because there are three pipeline stages in the FPA and FPM, a reservation table is used to help write the application microprogram.

This will likely result in errors when the microprogram is being coded; however, if microcode complier (or assembler) is developed, it will not only reduce the errors, but also speed up coding.

169 10. Investigation of the three tables in chapter 4 shows that the

RP contains CRAM -10.5K bits and RF -64 x 32 bits. If the CRAM bit cell

is made of three-transistor ORAM, and the RF bit cell is made of

six-transistor SRAM, the total transistor number for memory is 10.5 x 3

+ 64 x 32 x 6 = 44K plus decoders, refreshing c ircu it, and drivers, and

is about 50K. Each full adder consists of 25 transistors. The FPA contains 72 full adders and the FPM contains 89 fu ll adders, in total

161. From the block diagram of the FPA and FPM, it can be seen that the

number of components required for the adders is more than half of the 4K transistors used in the FPA and FPM. Including the format converters, clock generator and sequencer, the total number of transistors on the RP

chip is roughly 60K. It is possible to fabricate this number of transistors on a moderate die size with 3 micron NMOS techology. One existing example is Berkeley RISC II CPU with the same technology, containing about 40K transistors, and having a die size of 171 mil x 304 mil (4.34 mm x 7.72 mm) [17]. For this example, the die area of the RP can be estimated using the fact that the ratio of die areas is

approximately 1.5 times the ratio of complexities. The 1.5 factor

arises from the expected increase in interconnection area.

Specifically, the die size of the RP containing 60K transistors will be about 8.2 mm x 8.2 mm, about twice of that of the RISC CPU. This would

seem to present no problem since the maximum die size which MOSIS

supports is 9.5 mm x 10.5 mm with an 84-pin package.

170 REFERENCES

[1] Mead, C. and Conway L ., Intoduction to VLSI systems, Reading, Mass. : Addison-Wesly, 1980.

[2] Rideout, V.L., "Limits to Improvement of Silicon Integrated Circuits," Proceedings of COMPCON, 1980.

[3] Burger, R.M., Cavin, R.K., Holton, W.C. and Sumney L.W., "The Impact of ICs on Computer Technology," IEEE Computer, Oct. 1984.

[4] Taylor, S., "Gearing up for GaAs," VLSI Design, April 1984.

[5] Cole, B.C., "CMOS Memories Replacing NMOS in Megabit Storage Chips," ElectronicsWeek, Nov, 26, 1984.

[6] Cohen C.L., "Cells Combine CMOS, Bipolar Transisters," ElectronicsWeek, Nov. 5, 1984.

[7] McDonald, J .F ., Roger, J .F ., Rose, K. and Steekl, A., "The Trials of Wafer-Sacle Integration,11 IEEE Spectrum, October 1984.

[8] Beresford, R., "VHSIC : Redefining the Mission," VLSI Design, Nov. 1983.

[9] Berney, K., Wesley, R., Lineback, J.R. and Waller, L ., "Chip Makes Ready for VHSIC Phase I I , " ElectronicsWeek, Nov. 12, 1984.

[10] Werner, J ., "Progress Toward the 'Ideal' Silicon Complier," VLSI Design, Sep. and Oct. 1983.

[11] Rezac, R.R. and Smith L .T ., " A Simulation Engine in the Design Environment," VLSI Design, Nov. 1984.

[12] Hwang, K. and Briggs, F.A., Computer Architecture and Parallel Processing, Mcgraw-Hill, Inc., New York, New York, 1984. ~

[13] Kogge, P.M., The Architecture of Pipelined Computer, Mcgraw-Hill, Inc., New york, New York, 1981.

[14] Charlesworth, A.E., "An Approach to Scientific Array Processing : The Architecture Design of the AP-120B/FPS-164 Family" IEEE Computer, Sep. 1981,

171 [15] Bernhard, R., "Giants in Small Packages," IEEE Spectrum, Feb. 1982.

[16] Ribble, E.A., Synthesis of Human Skeletal Motion and the Design a Special-Purpose Processor for Real-Time Animation of Human and Animal Figure Motion, M.S. Thesis, The Ohio State University" Tune"'!!)!??. ------

[17] Scherburne, R.W., Katevenis, M.H., Patterson, D.A., and Sequin, C.H., "32-bit NMOS Microprocessor with a Large Register File ," IEEE Journal of Solid-State Circuit, Vol. SC-19, No. 5, October m ------

[18] Wahawisan, W., A Multiprocessor System woth Applications to Hexapod Vehi cl e~~Cbntrol, Ph.D. dissertation, Tne Ohio State University, Sep. 1981,

[19] Nobel, B. and Nani el, J.W., Applied Linear Algebra, Prentice-Hal1, Inc., 1977.

[20] Orin, D.E. and Olson K.W., Special Purpose Computer Architectures for Control of Robotic Mechanisms, Department of Electrical Engineering, The Ohio State University, March 1983.

[21] Orin, D.E., "Pipelined Approach to Inverse Plant Plus Jacobian Control of Robot Manipulators," IEEE International Conference on Robotics, Atlanta, Georgia, March 1984.

[22] Paul, R.P., Robot Manipulators : Mathematics, Programming, and Control, The MIT Press, Cambridge, Mass., 1981.

[23] Schrader, W.W., Efficient Jacobian Computation for Robot Manipulators on Serial and Pipelined Processors, M.S. Thesis at The Ohio $tate University, 1983.

[24] Lathrop, R.H., Parallelism in Manipulator Dynamics, M.S. Thesis at Massachusetts Institute o f Technology, 1983.

[25] Kung, H.T. and Leiserson, C.E., "Systolic Arrays (for VLSI)," Proc. Symp. Sparse Matrix Computations, Applications, Nov. 2-3, 1978, pp256-282.

[26] Kung, S.Y., Gal-Ezer R.J. and Arun, K.S.."Wavefront Array Processor : Architecture, Language and Applications," Proc. of the Conference on Advanced Research in VLSI, M .I.T ., January ~mr.

[27] Liu, P.S. and Young, T.Y. "VLSI Array Design Under Constraint of Limited I/O Randwidth," IEEE Trans. Comp., Vol. C-32, No. 12, December 1983, pp. 1160-1170.

172 [28] Nash, J.G., Hansen, S. and Nudd, G.R., "VLSI Processor Arrays for Matrix Manipulations," VLSI System and Computation, edited by Kong, H.T., Sproull, B, and Steele, G,, Computer’"Science Press, 1981.

[29] Waser, S. and Flynn, M.J., Introduction to Arithmetic for Digital Systems Designers, New York : Holt, Rinehart and Winston ; m College f»ub., 1982.

[30] Hwang K., Computer Arithmetic : Principle, Arichitecture, and Design. New York : John Wiely A Sons, 1979.

[31] Cavanagh, J .J .F ., Digital Computer Arithmetic : Design and Imp!ementation, New York : McGraw-Hill, 1984.

[32] Kung, H.T., Sproull, B, and Steele, G., VLSI Systems and Computations, Rockvill, Maryland, Computer Science Press Inc., m 1

[33] Muroga, S. and Lai, H.C., "Minimization of Logic Networks under a Grneralized Cost Function," IEEE Trans. Comp. , Vol. C-25, September 1976, pp.893-907.

[34] Lai, H.C. and Muroga, S., "Mininum Parallel Rinary Adder with NOR (NAND) Gates," IEEE Trans. Comp., Vol. C-28, No. 9, September 1979, pp. 648-659.

[35] Sakurai, A. and Muroga, S., "Parallel Binary with a Minimum Number of Connections," IEEE Trans. Comp., Vol. C-32, No. 10, October 1983, pp. 969-976.

[36] Brent, R.P. and Kung H.T., A Regular Layout for Parrel Adders, Technical Report, Dept, of Computer Science, Carnegie_Mellon University, CMU-CS-79-131, June 1979.

[37] Luk, U.K., "A Regular Layout for Parallel M ultiplier of 0(LogN X LogN) Time," VLSI System and Computation, 1981.

[38] Luk, W.K. and Vulliemin, J. E., "Recursive Implementation of Optimal Time VLSI Interger M ultiplier," VLSI'83, Anceau, F ., and Aas, E.J. (eds.), Elsevier Science Publishers B.V. (North-Holland), 1983.

[39] Stenzel, W.J., Kubitz, W.J. and Garcia, G.H., "A Compact High-Speed Parallel Multiplication Scheme," IEEE Trans. Comp., Vol. C-26, No. 10, October 1977.

[40] Bandeira N., Vaccaro, K. and Howard, A., "A Two's Complement Array M ultiplier Using True Values of the Operands." IEEE Trans. Comp., Vol. C-32, No. 8, August 1983.

173 [41] Ercegovac, M.D., and Nash. J.G ., A VLSI Design of A Radix-4 Carry Save M u ltip lie r, UCLA Computer '£cien ce Depart me ntT, Los Angels, April I9TT4.

[42] Reusens, P., Ku, W.H., amd Mao, Y.H., "Fixed-Point High-Speed Parallel M ultipliers in VLSI," VLSI System and Computatuion, m i .

[43] Lerouge, C.P., Girard P. and Colardelle, J.S., "A Fast 16 Bit NMOS Parallel Multiplier," IEEE Journal of Solid-State Circuit, Vol. SC-19, No. 3, June 1 9 8 ^

[44] Hartring, C.O., Rosario, B.A. and Picket, J.M., "High-Speed Low-Power Silicon MESFET Parallel M ultipliers," IEEE Journal of Solid-State C irc u it, Vol. SC-17, No. 1, Feb. 1982.

[45] Lee, F.S., "A High-Speed LSI GaAs 8x8 Bit Parallel M u ltip lier," ~mr.IEEE ------Journal of Solid-State Circuit, Vol. SC-17, No. 4, October

[46] Pareparta, F.P., "A Mesh-Connected Area-Time Optimal VLSI Integer M u ltiplier," VLSI Systems and Computation. 1981.

[47] Chen, I.N. and Willoner, R., "A 0(n) Parallel Multiplexer with Bit-Sequential Input and Output," IEEE Trans. Comp., Vol. C-28, No. 10, October 1979.

[48] Lyon, R.F., "Two's Complement Pipeline M ultipliers," IEEE Trans. Communi ati ons, April 1976.

[49] Strader, N.R. and Rhyne, V .T., "A Cannonical Bit-Sequential M u ltiplier," IEEE Trars. Comp., Vol. C-31, No. 8, August 1979.

[50] AM29516 and AM29517, 16x16 Parallel M u ltiplier, Bipolar Microprocessor Logic and Interface Oata Rook, Advanced Micro Devices, Sunnyvale, California, 1984,

[51] Zurawski, J.H.P. and Gosling, J.B ., "Design of High Speed Digital Divider Units," IEEE Trans. Comp., Vol. C-30, No. 9. September 1981.

[52] TRW LSI Multipliers Applications Notes, TRW LSI Products, Redondo, California.

[53] Stevenson, D., "A Proposed Standard for Binary Floating Point Arithmetic," IEEE Computer, March 1981.

[54] Kuck, D. J. et a l., "Analysis of Rouding Methods in Floating Ponit Arithmetic," IEEE Trans. Comp.. Vol. C-26. No. 7. July 1977, pp. 643-650.

174 [55] Ware, F.A. and McAllister W.H., "64 Bit Monolithic Floating Point Processors," IEEE Journal of Solid-State C irc u it, Vol. SC-17, No. 5, October 1982.

[56] Woo, B., Lin, L. and Owen, R.E., "ALU, M ultiplier Chips zip Through IEEE Floating-Point Operations," Electronics, May 19, 1983.

[57] Turney, J.L. and Mudge, T.N., "VLSI Implementation of a Numerical Processor for Robitlcs," Proc. of the 27th International Instrumentation Symposium, pp.169-1?5, Indianapolis, Indiana, April 1981.

[58] Ware, F ., Lin, L, Wong R., and Woo, B ., "Fast 64-bit Chip Set Gangs up for Double-Precision Floating-Point Work," Electronics, July 12, 1984.

[59] Chu, P. and New, B.J. "Microprogrammable Chips Rlend Top Performance with 32-bit Structures," Electronic Design, Nov. 15, 1984.

[60] Williams, T.W. and Parker K.P., "Design for Testability - a Survey," IEEE Trans. Comp., Vol. C—31, No. 1, January 1982, pp. 2-13.

[61] Williams, T.W., "VLSI Testing," IEEE Computer, October, 1984.

[62] Horowitz, E. and Sahni S., Fundamentals of Computer A1gorithms, Computer Science Press, Inc. 1978,

[63] Roger, C.D., Daniel S.W., and Rosenberg, J.B ., An Overview of VIVID, MCNC's Vertically Integrated Symbolic Design, MCNC Technical Report, Microelectronic Center of North Carolina, Research Triangle Park, North Caralina, 1985.

175 Appendix A.1 : Reservation Tables for Vector Operations.

1. Reservation table for the addition of two 3x1 vectors

xl yi zl zl - xl + yl

x2 + y2 = z2 z2 = x2 + y2

x3 y3 z3 z3 = x3 + y3

P CLK CYCLE

1 2 3 4 5 6 7 8 9 10111213 14 1$1617 18 1920

+ 1 zl z2z3

+2 zl z2z3

+3 zl z2z3

ST zl z2z3

The +1 to +3 represent the three stages of the floating point adder.

"ST" means storing the result to the RF. It takes six P_CLK cycles to complete the operation.

176 2. Reservation table for a 3x1 vector multiplied with a scalar

constant.

x l * z l z l = x l * c

x2 X c * z2 z2 = x2 * c

x3 z3 z3 = x3 * c

P CLK CYCLE

1 2 3 4 5 & 7 8 8 1(1 U 12 13 14 15 15 17 16 19 26

*1 z l z2 z3

*2 z l z2 z3

*3 z l z2 z3

ST z l z2 z3

The *1 to *3 represent the three stages of the floating point m ultiplier. It also takes six P_CLK cycles to complete the operation.

177 3. Reservation tahle for the inner product of two 3x1 vectors.

y i A xl * yl B x2 * y2 [ xl x2 x3 ] . y2 C x3 * y3 D A + B y3 Z C + D

P CLK CYCLE

It takes (3M + 10) P_CLK cycles to complete M inner products of vectors with 3x1.

178 4. Reservation table for the cross product of two 3x1 vectors.

- -- A = x2 * y3 xl yi zl R = y2 * x3 C = yl * x3 + x2 y2 = z2 [1 = xl * y3 E = xl * y2 x3 y3 z3 F = yl * x2 - - - zl = A - R z2 - C - D z3 « E - F

P CLK CYCLE

1 2 3 4 5 <5 7 ft ft 10 11 12 13 14 15 16 17 1ft 1ft 20

*1 AB C DE F

*2 A B CD EF

*3 A BC DE F

ST A B C 0 E F

+ 1 zl z2 z3

+2 Zl z2 z3

+3 zl z2 z3

ST zl z2 z3

It also takes 13 P_CLK cycles to complete the operation.

179 5. Reservation table for the multiplication of a 3x3 matrix with a 3x1

vector.

xll xl2 xl3 1 yi Zl

x21 x22 x23 * y2 = z2

x31 x32 x33 | y3 z3

A = xll * yl R = xl2 * y2 C = x21 * y l 0 = x22 * y2 E = x31 * yl F = x32 * y2 G - xl3 * y3 H = x23 * b3 I = x33 * y3 J = A + B K = C ^ D L = E + F zl = G + J z2 = H + K z3 = I + L

P CLK CYCLE

1 2 3 4 5 6 1 A 5 If) 11 12 13 14 15 16 11 16 14 20

*1 A B CDE FGH I

*2 A R CDE F G H I

*3 A R C DEF GHI

ST A B CDE F G H I

+ 1 J K L zl z2 z3

+2 J K L zl z2 z3

+ 3 J K L zl z2 z3

ST J K L zl z2 z3

It takes 17 P_CLK cycles to finish it .

180 6. Reservation table for the multiplication of two 3x3 matrices.

x ll x!2 x 13 yll yl2 yl3 Zll zl2 zl3

x21 x22 x23 + y21 y22 y23 = z21 z22 z23

x31 x32 x33 y31 y32 y33 z31 z32 z33

A1 = x ll * y 11 B1 = xl2 * y21 Cl = x21 * y l l D1 = x22 * y21 El = x31 * y ll FI = x32 * y21 G1 = xl3 * y31 HI = x23 * b31 11 = x33 * y31 J1 = A1 Ht B 1 K1 = Cl + Dl LI = El * FI z ll = G1 + J1 z21 = HI + K1 z31 = 11 + LI

A2 = xll * y12 B2 = xl2 * y22 C2 = x21 * y 12 D2 = x22 * y22 E2 = x31 * y 12 F2 = x32 * y2? G2 = x 13 * y32 H2 = x23 * b32 12 = x33 * y32 J2 = A2 + B2 K2 = C2 + D2 L2 = E2 + F2 zl2 = G2 + J2 z22 = H2 + K2 z32 = 12 + L2

A3 = xll * y13 B3 = xl2 * y23 C3 = x21 * y 13 D3 = x22 * y23 E3 = x31 * y 13 F3 = x32 * y23 G3 = x!3 * y33 H3 = x23 * b33 13 = x33 * y33 J3 = A3 + B3 K3 = C3 + 03 L3 = E3 + F3 zl3 = G3 + J3 z23 = H3 + K3 z33 = 13 + L3

P CLK CYCLE

1 2 3 4 5 6 7 ft 9 10 11 12 13 14 15 16 17 1ft 19 20

*1 A1 B1 Cl Dl El FI G1 HI 11 A2 B2 C2 D2 E2 F2 G2 H2 12 A3 B3

*2 A1 B1 Cl 01 El FI G1 HI 11 A2 B2 C2 02 E2 F2 G2 H2 12 A3

*3 A1 B1 Cl Dl El FI G1 HI 11 A2 B2 C2 02 E2 F2 G2 H2 12

ST A1 B1 Cl Dl El FI G1 HI 11 A2 R2 C2 02 E2 F? G2 H2

+ 1 J1 K 1 LI 11 21 31 J2 K2 L2 12

+2 J1 K1 LI 11 21 31 J2 K2 L2

+ 3 J1 K1 LI 11 21 31 J2 K2

1

ST 1 C- K1 LI 11 21 31 J2 K2

181 P CLK CYCLE

21 22 23 24 25 26 27 28 29 38 31 32 33 34 35 36 37 39 4'fl 3 *1 C3 D3 £3 F3 G3 H3 13

*2 C3 03 E3 F3 G3 H3 13

*3 C3 D3 E3 F3 G3 H3 13

ST C3 D3 E3 F3 G3 H3 13

+ 1 22 32 J3 K3 L3 13 23 33

+2 12 22 32 J3 K3 13 13 23 33

+3 L2 12 22 32 J3 K3 L3 13 23 33

ST L2 12 22 32 J3 K3 L3 13 23 33

It takes 8 + 9 * 3 = 35 P_CLK cycles to complete the multiplication of two 3x3 matrices.

1 8 2 7, Reservation table for the inner product of two 4x1 vectors.

T [ xl x2 x3 x4 ] . [ y l y2 y3 y4 ] - 1

A ■ x 1 * y 1 B = x2 * y2 C = x3 * y3 D = x3 * y3

P CLK CYCLE

1 2 3 4 5 6 7 g $ lo 11 12 13 14 15 16 1? 16 19 20

*1 AB C 0 A B c 0

* 2 A B CD AB c D

* 3 AB C DA B C D

ST A B C D AB C 0

+ 1 E FE FZ Z

+2 E FE F Z z

+3 E F E F Z Z

ST E F E F ZZ

It takes (4M + 12) P_CLK cycles to complete M inner products of vectors with 4x1.

183 8, Reservation table for the inner product of two 5x1 vectors.

A = xl * yl F = A + B 3 = x2 * y2 G = C + D C = x3 * y3 H = E + F D = x4 * y4 Z = G + H E = x5 * y5

P CLK CYCLE

1 2 3 4 5 6 7 fl 9 i n 11 12 13114 15 16 17 IS 19

*1 A B C D E A R C D E

*2 A B C D E A B C f) E

*3 A B C D E A R CD E

ST A B C D E A B C D E

+ 1 F G H F G Z H z

+2 F G H F G Z H Z

+3 F G H F GZ H

ST F G H F G Z H

It takes (5M + 12) P_CLK cycles to complete M inner products of vectors with 5x1.

184 9. Reservation table for the inner product of two 6x1 vectors.

T [ xl x2 x3 x4 x5 x6 ] . [ y l y2 y3 y4 y5 y6 ] = Z

A = xl * yl G = A + B B = x2 * y2 H = C + D C = x3 * y3 I = E + F D = x4 * y4 J = G + H E = x5 * y5 Z = I + J F = x6 * y6

P CLK CYCLE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 ■18 16 17 18 19 20

*1 A BC D E FA BC DE F

*2 AB CD EF A B CD EF

*3 A BC D E FA B C D E F

ST A BC 0 E F A B C 0 E F

+ 1 G H IG J H I Z J

+2 G H I G J H I Z J

+3 G H I GJ H I Z

ST G H I G J H IZ

I t takes (6M + 14) P_CLK cycles to complete M inner products of vectors with 6x1.

185 10. Reservation table for the inner product of two 7x1 vectors.

J [ xl x2 x3 x4 x5 x6 x7 ] . [ y l y2 y3 y4 y5 yfi y7 ] = Z

A = xl * y l H = A + R R = x2 * y2 I = C + D C = x3 * y3 J = E + F 0 = x4 * y4 K = H + I E » x5 * y5 L = G + J F = x6 * y6 Z = K + L G = x7 * y7

P CLK CYCLE

1 2 3 4 5 6 7 s 0 10 11 12 13 14 16 16 17 1ft 10 20

*1 A B C 0 E F G AB C DEF G

*2 A B c 0 E F G A B C n E F G

*3 A B C DEF G AB c DEF G

ST AB C DEF G AB C D E F G

+ 1 H I J K H LI J ZK

+2 H I JK H LI J 1 K

+3 H IJ K H L IJ Z

ST H I JKH L I J

It takes ( 7M + 14) P_CLK cycles to complete M inner products of vectors with 7x1.

186 11. Reservation table for the determinant of a 2x2 matrix

all al2 A = a ll x a22 = Z B = al2 x a21 a21 a22 Z = A - B

P CLK CYCLE

1 2 3 4 5 6 1 8 9 io n 12 13 14 15 16 17 is 19 26

*1 AB AB

*2 A B AB

*3 ABAB

ST ABA B

+ 1 ZZ

+2 ZZ

+ 3 Z z

ST ZZ

It takes (2M + 7) P_CLK cycles to complete M determinants of 2x2 matrices.

107 12. Reservation table for the determinant of a 3x3 matrix.

all al2 al3

a21 a22 a23 - Z

a31 a32 a33

A T~ a ll X a22 B = al2 X a23 C = al3 X a21

D = a ll X a23 E = al2 X a21 F = a 13 X a22

G = A X a33 H = B X a31 I = C X a32

J = n X a32 K = E X a23 L = F X a31

0 = G + H P = I - J Q = K + L

R 0 + P Z = R _ 0

P CLK CYCLE

1 2 3 4 5 6 1 8 0 10 11 12 13 14 15 16 17 1ft 10 20

*1 A B c D EFG H IJ K L AB C D E FGH

*2 A B C DEF GHI J K L A R C D EF G

*3 A B C 0 E FGH I J K L AB C DEF

ST A B CD EFG H I J K L A R CD E

+ 1 0 P 0 R

+2 0 P 0 R

+3 0 P 0 R

ST n P 0

188 P CLK CYCLE

21 22 23 24 25 26 27 23 24 30 31 32 33 34 35 36 3? 3ft 34 40

*1 I j K L *2 i J K L

*3 I JKL

ST I JK L

+ 1 z 0 P Q R Z

+2 Z 0 P Q R Z

+3 Z 0 P 0 R z

ST z 0 P 0 R Z

It takes ( 12M + 13) P_CLK cycles to complete M determinants of 3x3 matri ces.

189 Appendix A.2 : Microprogram for Jacobian (one RP per Link)

i -1 N+l N+l Qi = U, Ui = I) Ui-1 = U 1-1

i * N+l N+l Pi = P Ri = r Bi beta i

RF RF

RO Qi L1,U R9 'lhTT.1T.....

Rl Oi [1,2] RIO U i[l,2 ]

R2 Qi [1,3] Rll Ui[1,3]

R3 Q i[2 ,l] R12 Oi[2,1]

R4 Qi[2,2] R13 Ui[2,2]

R5 Qi [2,3] R14 Ni[2,3]

R6 Qi[3,1] = 0 R15 Ui[3,1]

R7 Qi[3,2] Rlfi Ui[3,2]

R8 Qi[3,3] R17 U i[3 ,3]

RF RF RF

R 18 Ui-1L1.1J R27 ML1J R36 temporary

R 19 U1-1[1,2] R28 Pi [2] R37 temporary

R20 Hi—1[1,3] R29 Pi [3] R38 tempora ry

R21 Ui-1[2,1] R30 Ri-lC 1] ,Ri C1]

R22 U1-l[2,2] R31 Ri-1[2],Ri[2]

R23 Ui- 1[2,3] R32 R i —1[3],R i[3]

R24 U i-1 [3 ,1] R33 Bi r i]

R25 Ui-1[3,2] R34 Bi [2]

R26 Ui-1[3,3] R35 Bi [3]

190 RF ~ > m " R'F"~> FAM SUB FPM --> FAM — >

1 Qi 2,1] , Qi[3,3] - - - - -

2 Qi 1.1] , 0 i[3 ,2 ] - - - - -

3 Qi 2,1] , Qi[3,2] - -- - -

4 Qi 1,1] , Qi[3,3] - -- 0 i[ l,2 ] -

5 Ui 1.13 , Qi[1,1] Qi[3,1] Q i[l,2 ] 1 Qi [2,3] -

6 Ui 1.2] , Qi[1,2] Qi[3,1] Qi[2,3] 1 Qi[1,3] -

7 Ui 2,1] , Qi[1,1] -- - Qi[2,2] -

8 Ui 2,2] , Qi[1,2] - - - R36 Q i[ 1,2]

9 Ui 3,1] , Qi[1,1] - - - R37 Qi[2,3]

10 Ui 3,2] , Q i[l,2 ] R36 R37 0 R36 -

11 Ui 1,3] , Qi[ 1,3] - - - R37 -

12 Ui 2,3] , Qi[1,3] R36 R37 0 R36 -

13 Ui 3,3] , Qi[1,3] - - - R37 R38

14 Ui 1.1] , Qi[2,1] R36 R37 0 R36 -

15 Ui 1.2] , Qi[2,2] R36 R38 0 R37 R38

16 Ui 2,1] , Qi[2,1] R37 R38 0 R36 -

17 Ui 2,2] , Qi[2,2] - - - R37 R38

18 Ui 3,1] , Qi[2,1] R36 R38 0 R36 U i - l [ l , l l

19 Ui 3,2] , Qi[2,2] R36 R37 0 R37 U i- l[ 2 ,l]

20 Ui 1,3] , Qi[2,3] - - - R36 -

21 Ui 2,3] , Qi[2,3] R36 R37 0 R37 Ui - 1[3,1]

22 Ui 3,3] , Qi[2,3] -- - R36 R38

23 Ui 1.1] , Q i[3 ,l] R36 R37 0 R37 -

24 Ui 1,2] , Qi[3,2] R37 R38 0 R36 R38

191 25 ' u TIT, IT "O il T.TT R36 R38 0 R37 -

26 Ui[2,2] 0i[3 ,2 ] - - - R36 R38

27 U i[3 ,l] 0 i[ 3 ,l] R37 R38 n R37 Ui- 1[1,2]

28 Ui[3,2] Q i[3,2] R36 R37 0 R36 U i-1 [2 ,2]

29 U i[l,3 ] Qi[3,3] - - - R37 -

30 Ui[2,3] Q i[3,3] R36 R37 0 R36 U i —1[3,2]

31 Ui[3,3] Qi[3,3] - - - R37 R38

32 Ui[1,1] Pi [1] R36 R37 0 R36 -

33 Ui[1,2] Pi [2] R36 R38 0 R37 R38

34 Ui[2,1] P i[l] R37 R38 0 R36 -

35 Ui[2,2] Pi[2] - -- R37 R38

36 U i[3 ,l] P i[l] R36 R38 0 R36 U i-1 [1,3]

37 Ui[3,21 Pi[2] R36 R37 0 R37 Ui-1[?,3]

38 Ui[1,3] Pi [3] - -- R36 -

39 Ui[2,3] Pi[3] R36 R37 0 R37 Ui-1[3,3]

40 Ui[3,3] Pi[3] - - - R36 R38

41 - - R36 R37 0 R3fi -

42 - - R37 R38 0 R37 R38

43 - - R36 R38 0 R36 -

44 ------R38

45 - - R37 R38 0 - R36

46 ------R37

47 - - R i[l] R36 1 - -

48 - - Ri [2] R37 1 - R38

49 - - Ri [3] R38 1 - -

192 I So ---- R f-itiT

-- - -- Ri-1[2] 51 52 -- - - Ri-1[3]

53 Ui-1[2,3] R1-1C3] -, - - - -

54 Ui-1[3,3] Ri- 1[2] ---

55 Ui-1[3,3] Ri-1C 1] - - - -

56 Ui -1[ 1,3] Ri -1[3] -, - - R36 -

57 Ui-1[1,3] R1-l[2] - - R37 -

58 Ui-1[2,3] Ri -1[ 1] R37 , R36 1 R36 -

59 - --- R37 -

60 - - R37 , R36 1 R36 -

61 - - -, - - R37 B iili

62 -- R37 , R36 1 --

63 - - - _ Bi [2]

64 - - - - -

65 - - - - Bi [31

193 Appendix A.3 : Calculation of the Measurement Parameters

for Jacobian with P = 1.

Tc = ( 9 + 35 + 21 + 13 + 12 ) x N = 90N P_CLK cycles

Tio = 4N P_CLK cycles

ET = Tc + Tio = 94N P_CLK cycles

IR = 1/ET = 1/(94N microsecond)

UP = 100%

CBR = 90N/94N = 96%

RN = 39 (see Appendix A.2)

MC = (Tc + Ti0/2) / N = 92

SCRAM = MC x ( 4 + 6 x "log2 (RN)“ | )

= 92 x 40 = 3.7K bits

Total Memory = 39 x 32 + 3.7K = 5K bits

194 Appendix A.4 : Calculation of the Measurement Parameters

for Jacobian with P = 2.

Tel = { 9 + 35 + 12 ) x N = 56N P_CLK cycles

Tiol = ( 4 + 18 + 6 ) x N = 28N P_CLK cycles

Tc2 = ( 21 + 13 ) x N = 34N P_CLK cycles

Tio2 = ( 18+6+6 )xN= 30N P_CLK cycles

Tid2 (idle) = ( 14 + 6 ) x N = 20N P_CLK cycles

ET1 = Tel + Tiol = 84N P_CLK cycles

ET2 = Tc2 + Tio2 + Tid2 = 84N P_CLK cycles

ET = (13 + 84N) P__CLK cycles

IR = l/max(ETl, ET2) = 1/(84N microsecond)

UP1 = 100%

UP2 = (84N - 20N) / 84N = 76%

UP - (DPI + UP2) / 2 = 88%

SP = ET( P=1) / ET(P=2) = 94N / (13 + 84N) = 94N / 84N = 1.1

CBR = Tc(P=l)/2 x IR = 90N/2 x 1/84N = 54%

RN1 (see Appendix A.2):

Qi[1,1] to Qi[3,3] > 9

Ui[1,1] to Ui[3,3] > 9

Ui-1[ 1,1] to Ui-1[3,3] > 9

temporary register > 3

RN1 = 9 + 9 + 9 + 3 = 30

RN2 (see Appendix A.2) :

Qi[1,1] to Oi[3,3] ...... — > 9

Pi [1] to Pi [31 — > 3

195 R i [ 1] to R i [3] > 3

R i- l[ l] to R1-1C2] > 3

Bi [1] to Bi [3] > 3

temporary register > 3

RN1 =9+3+3+3+3+3=24

RN = max(RNl, RN2) = 30

MCI = (Tel + Tiol/2) / N = 70

MC2 = (Tc2 + Tio2/2 + Tid2) / N = 69

MC = ma x{MC1, MC2) = 70

SCRAM = MC x ( 4 + 6 x 'log^(RN)’ )

= 70 x 34 = 2.4K bits

Total Memory = 30 x 32 + 2.4K = 3.4K bits

196 Appendix A.6 : Calculation of the Measurement Parameters

for Jacobian with P = N.

Tc = 9 + 35 + 21 + 13 = 78 P_CLK cycles

Tio=4+ 18+6+ 18 +6+6+6= 64 P_CLK cycles

Tidl (idle) = 1 P_CLK cycles

ET = 59 x (N-l) + (Tc + Tio + Tid) = (59N + 84) P_CLK cycles

IR = l/(Tc + Tio + Tid) = 1/(143 microsecond)

UP = (ET - Tid) / ET » 142/143 = 100%

SP = ET(P=1) / ET(P=N) = 94N / (59N + 84) = 1.32 for N=7

CBR - Tc(P=l)/N x IR = 90N/N x 1/143 = 63%

RN = 39

MC = (Tc + Ti 0/2) + Tid = 78 + 64/2 + 1 = 111

SCRAM = MC x ( 4 + 6 x p o g ^ R N )- )

= 111 x 40 = 4.5K bits

Total Memory = 39 x 32 + 4.5K = 5.8K bits

197 Appendix A.6 : To find Brl - 1x6

1. Find the 6 determinents of the 6 reduced 5x5 matrices.

a22 a23 a24 a25 a26 a32 a33 a34 a35 a36 B r l[l] = + det { a42 a43 a44 a45 a46 ) a52 a53 a54 a55 a56 a62 a63 a64 a65 a66

a33 a34 a35 a36 a23 a24 a25 a26 a43 a44 a45 a46 ) - a32 x det( a43 aM a45 a46 a53 a 54 a55 a 56 a53 a 54 a55 a56 a63 a64 a65 a66 a63 a 64 a65 a66

a23 a24 a25 a26 a23 a24 a25 a26 a33 a34 a35 a36 ) - a52 x det( a33 a 34 a35 a36 a53 a54 a55 a56 a43 a44 a45 a46 a63 a64 a65 a66 a63 a64 a65 a 66

a23 a 24 a25 a26 = a62 x det( a33 a 34 a35 a36 a43 a44 a45 a46 a53 a54 a55 a56

= + a22 x d(3,4 ,5,6) - a32 x d(2,4,5,6) + a42 x d(2,3,5,6)

- a52 x d(2,3,4,6) + a62 x d(2,3,4,5)

al2 al3 al4 al5 a 16 a32 a33 a34 a35 a36 8 rl[2 ] = - det ( a42 a43 a44 a45 a46 a52 a53 a54 a55 a56 a62 afi3 a64 a65 a66

= - al2 x d(3,4,5,6) + a32 x d(l,4,5,6) - a42 x d(1,3,5,6)

+ a52 x d(1,3,4,6) - a62 x d(l,3,4,5)

198 al2 al3 a 14 al5 al6 a22 a23 a24 a25 a2 6 B rl[3] = + det ( a42 a43 a44 a45 a46 a52 a53 a54 a55 a56 a62 a63 a64 a65 a66

+ * = + al2 x d(2,4,5,6) - a22 x d(l,4,5,6) + a42 j d ( l , 2 , 5 , 6 )

- a52 x d(l,2,4,6) + a62 x d(l,2,4,5)

a 12 al3 al4 al5 alfi a22 a23 a24 a25 a26 Brl[4] = - det ( a32 a33 a34 a35 a36 a52 a53 a54 a55 a56 a62 a63 afi4 a65 afifi

* - al2 x d(2,3,5,6) + a22 x d(l,3,5,6) - a32 ) d( 1,2,5,6)

+ a52 x d(l,2,3,6) - a62 x d{l,2,3,5)

al2 a 13 a 14 a 15 al6 a22 a23 a24 a25 a26 Brl[5] = + det ( a32 a33 a34 a35 a36 a42 a43 a44 a45 a46 a62 a63 a64 a65 a 66

* + ★ + a 12 x d(2,3,4,6) - a22 x d(l,3,4,6) + a32 ) d (1,2,4 ,6)

- a42 x d(l,2,3,6) + a62 x d(1.2.3.4)

al2 a 13 a 14 a 15 a 16 a22 a23 a24 a25 a26 Brl[6] = - det ( a32 a33 a34 a35 a36 a42 a43 a44 a45 a46 a52 a53 a 54 a55 a 56

199 * * * = - a 12 x d(2,3,4,5) + a22 x d(l,3,4,5) - a32 x d(l,2,4,5) * * + a42 x d{1,2,3,5) - a52 x d(l,2,3,4)

* means redundant. The complexity of finding the Brl is equal to that of finding 6 vector inner products with 5x1 plus 15 determinants of 15

4x4 matrices.

2. Find the 15 determinants of the 15 reduced 4x4 matrices :

a33 a34 a35 a36 d(3,4,5,6) = det( a43 a44 a45 a46 ) a53 a54 a55 a56 a63 a64 a65 a66

= + a33 x d(4,5,6) - a43 x d(3,5,6)

+ a53 x d(3,4,6) - a63 x d(3,4,5)

a23 a24 a25 a26 d(2,4,5,6) = det{ a43 a44 a45 a46 ) a53 a54 a55 a56 a63 afi4 a65 a66

* = + a23 x d(4,5,6) - a43 x d(2,5,6)

+ a53 x d(2,4,6) - a63 x d(2,4,5)

a23 a24 a25 a26 d(2,3,5,6) = det( a33 a34 a35 a36 ) a53 a54 a55 a56 a63 a64 a65 a66

= + a23 x d(3,5,6) - a33 x d(2,5,6)

+ a53 x d(2,3,6) - a63 x d(2,3,5)

2 no a23 a24 a25 a26 d(2,3,4, 6) = det( a33 a34 a35 a36 a43 a44 a45 a46 a63 afi4 a65 a66

~ + a23 x d(3,4,6) - a33 > d(2,4,6)

+ a43 x d{2,3,6) a63 ) d(2,3 ,4)

a23 a24 a25 a26 d{2,3,4, 5) - det( a33 a34 a35 a36 a43 a44 a45 a46 a53 a54 a55 a56

= + a23 x d(3,4,6) - a33 ) d (2,4,6)

+ a43 x d(2,3,5) - a53 > d(2,3,4)

al3 a 14 al5 al6 6) = det( a43 a44 a45 a46 ) a53 a54 a55 a56 a63 a64 a65 a66

* = + al3 x d(4,5,6) - a43 d(1,5,6)

+ a53 x d( 1,4,6) - a63 d (1,4.5)

al3 a14 al5 al6 d{ 1,3,5, 6) = det( a33 a34 a35 a36 ) a53 a54 a55 a56 a63 a64 a65 a66

* - + al3 x d(3,5,6) - a33 d(1,5,6)

+ a53 x d( 1,3,6) - a63 d{1,3,5)

2 0 1 al3 a 14 al5 al6 d( 1,3,4,6) = det{ a33 a34 a35 a36 a43 a44 a45 a46 a63 a64 a65 a66

★ * + al3 x d(3*4.6) - a33 x d(l,4,6) * + a43 x d(1,3,6) - a63 x d(l,3,4)

al3 a 14 al5 al6 d(1,3,4,5) = det( a33 a34 a35 a36 a43 a44 a45 a46 a53 a54 a55 a56

* * = + al3 x d(3,4,5) - a33 x d(l,4,5)

■k i t + a43 x d( 1,3,5) - a53 x d (l,3 ,4 )

al3 al4 al5 al6 d( 1,2 ,5,6) « det( a23 a24 a25 a26 a53 a54 a55 a56 a63 a64 a65 a66

= + al3 x d(2,5,6) - a23 x d(l,5,6)

+ a53 x d(1,2,6) - a63 x d{l,2,5)

al3 a 14 al5 al6 d( 1,2,4,6) = det ( a23 a24 a25 a26 a43 a44 a45 a46 a63 a64 a65 afi6

= + al3 x d(2,4,6) - a23 x d{l,4,6) ★ + a43 x d(1,2,6) - a63 x d(l,2,4)

2 0 2 al3 a 14 a 15 alfi d( 1,2,4,5) = det( a23 a24 a25 a26 a43 a44 a45 a46 a53 a54 a55 a56

= + a!3 x d(2,4,5) - a23 x d(l,4,5)

+ a43 x d( 1,2,5) - a53 x d (l,2 ,4 )

al3 a 14 al5 al6 d(1,2,3,6) = det( a23 a24 a25 a26 a33 a34 a35 a36 a63 a64 a65 a66

* * = + al3 x d(2,3,6) - a23 x d(l,3,6) * + a33 x d( 1,2,6) - a63 x d( 1,2,3)

al3 al4 al5 al6 d( 1,2,3,5) = det( a23 a24 a25 a26 a33 a34 a35 a36 a53 a54 a55 a56

★ * = + al3 x d(2,3,6) - a23 x d(l,3,5) * * + a33 x d(1,2,5) - a53 x d(l,2,3)

al3 al4 al5 al6 d( 1,2,3,4) = det( a23 a24 a25 a26 a33 a34 a35 a36 a43 a44 a45 a46

* * = + al3 x d(2,3,4) - a23 x d{l,3,4) * * + a33 x d( 1,2,4) - a43 x d (l,2 ,3 )

The complexity of finding the 15 determinants is equal to that of finding 15 vector inner products with 4x1 plus 20 determinants of 20

3x3 matrices,

203 3. Find the 20 determinants of the 20 reduced 3x3 matrices :

a44 a45 a46 d(4,5,6) = det{ a54 a55 a56 a64 a65 a66

= + a44 x d(5,6) - a54 x d(4,6) + a64 x d{4,5)

a34 a35 a36 d(3,5,6) = det( a54 a55 a56 a64 a65 a66

= + a34 x d(5,6) - a54 x d(3,6) + a64 x d (3,5)

a34 a35 a36 d(3,4,6) = det( a44 a45 a46 a64 a65 a66

= + a34 x d(4,6) a44 x d(3,6) + a64 x d(3,4)

a34 a35 a36 d(3,4,5) = det( a44 a45 a46 a54 a55 a56

* = + a34 x d(4,5) a44 x d{3,5) + a54 x d(3,4)

a24 a25 a26 d(2,5,6) = det( a54 a55 a56 a64 a65 a66

= + a24 x d{5,6) - a54 x d(2,6) + a64 x d(2,5)

204 a24 a25 a26 d(2,4,6) = det( a44 a45 a46 a64 a65 a66

= + a24 x d(4,6) a44 x d(2,6) + a64 x d(2,4)

a24 a25 a26 d( 2,4,5) = det( a44 a45 a46 a54 a55 a56

* * = + a24 x d(4,5) a44 x d(2,5) + a54 x d{2,4)

a24 a25 a26 d(2,3,6) = det( a34 a35 a36 a64 a65 a66

= + a24 x d(3,6) a34 x d(2,6) + a64 x d(2,3)

I a24 a25 a26 d{2,3,5) = det( a34 a35 a36 ) [ a54 a55 a56

* * * = + a24 x d(3,5) - a34 x d{2,5) + a54 x d(2,3)

a24 a25 a26 d(2,3,4) = det( a34 a35 a36 a44 a45 a46

* * + a24 x d(3,4) - a34 x d(2,4) + a44 x d(2,3)

a 14 a 15 a 16 d( 1,5,6) = det( a54 a55 a56 a64 a65 a66

205 ★ = + a 14 x d(5,6) - a54 x d (l,6 ) + a64 x d (l,5)

a 14 al5 al6 d( 1,4,6) = det( a44 a45 a46 a64 a65 a66

* * + al4 x d(4,6) - a44 x d(l,6) + a64 x d(1*4)

al4 al5 al6 d(1,4.5) = det{ a44 a45 a46 a54 a55 a56

* * * + a 14 x d(4,5) - a44 x d{ 1,5) + a54 x d (1,4)

a 14 a 15 al6 d(1,3,6) = det( a34 a35 a36 a64 ab5 a66

* * + a 14 x d(3,6) - a34 x d(1,6) + a64 x d (l,3)

a14 al5 al6 d (l,3 ,5 ) = det( j a34 a35 a36 a54 a55 a56

* * * = + al4 x d(3,5) - a3a x df 1,5) + a54 x d(l,3)

a 14 a 15 a 16 d(1,3,4) = det( a34 a35 a36 a44 a45 a46

= + al4 x d(3,4) - a34 x d(l,4) + a44 x d{l,3)

206 al4 al5 al6 d(1,2,6) = det( a24 a25 a26 afi4 a65 a66

* * + a14 x d(2,6) - a24 x d(l,6) + afi4 x d(1.2)

a14 al5 al6 d( 1,2,5) = det( a24 a25 a26 a54 a55 a56

* + * = + al4 x d(2,5) - a24 x d( 1,5) + a54 x d (l,2 )

al4 al5 al6 d(1,2,4) = det( a24 a25 a26 a44 a45 a46

★ i f + + al4 x d(2,4) - a24 x d{1,4) + a44 x d(1,2)

al4 a 15 al6 d( 1,2,3) = det( a24 a25 a26 a34 a35 a36

= + al4 x d(2,3) - a24 x d(l,3) + a34 x d(l,2)

The complexity of finding the 20 determinants is equal to that of finding 20 vector inner products with 3x1 plus 15 determinants of 15

2x2 matrices.

207 4. The complexity of finding Brl to Br6 can be considered as below, where T means the computation time in terms of PjCLK cycles.

For example, T[ d{5x5) ] represents the computation time needed for computing the determinent of a 5x5 matrix.

Number of P_CLK cycles of finding Rrl to Br6

6 T[ d(5x5) ]

6 T[ V . V ] + 15 T[ d(4x4) 1 -1x5 -5x1

6 Tr V . V ] + 15 T[ V . V ] + 2(1 T[ d(3x3) 1 -1x5 -5x1 -1x4 -4x1

6 T[ V . V ) + 15 T[ V . V ] -1x5 -5x1 -1x4 -4x1

+ 20 T[ V . V ] + 15 T[ d(3x2) ] -1x3 -3x1

[ 5M + 12 ] + [ 4M + 12 1 M=6 ' M=15

+ [ 3M + 10 ] + [ 2M + 7 ] M=20 M=15

221 P C1K cycles

208 Those vector inner product complexities can be found in Appendix A .I.

I f the determinant of a 3x3 matrix is regarded as the basic operations, instead of the determinent of a 2x2 matrix, then

Number of P__CLK cycles of finding Brl to Rr6

6 T[ d(5x5) ]

6 T[ V . V ] + 15 T[ d(4x4) ] -1x5 -5x1

6 T[ V . V ] + 15 TC V . V ] + 20 T[ d(3x3) 1 -1x5 -5x1 -1x4 -4x1

[ 5M + 12 ] + [ 4M + 12 ] + f 12M + 13 ] M=6 M=15 M=20

367 P CLK cycles

209 Appendix A. f : Computation Complexity and Register Required

for Vector Inner Products.

M T(V . V , = 3M + in P_CLK cycels -1x3 -3x1

M T(V . V = 4M + 12 P_CLK cycels -1x4 -4x1

M T(V . V = 5M + 12 PjCLK cycels -1x5 -5x1

M T(V . V = 6M + 14 P_CLK cycels -1x6

M T(V . V = 7M + 10 P CLK cycels -1x7 -7x1

N M RN(V . V ) = 2 + lxN Nxl 2

210 Appendix A.8 : Procedures to Solve the Derivative of Theta and

Calculations of the Measurement Parameters for

Inverse Jacobian with P = 1.

(1) Host sends J to the RP. 6xN T (2) RP computes A = J J 6x6 6xN Nx6

(3) RP computes 36 d(5x5) = B 6x6

(4) RP computes d = d(6x6) = Rrl . Acl - 1x6 -6x1

(5) RP sends d to the host.

(6) Host sends X to the RP. - 6x1

(7) RP computes C = B X - 6x1 6x6 - 6x1

T (8) RP computes theta' = J C -N xl Nx6 - 6x1

(9) Host sends 1/d to the RP.

(10) RP computes theta = th eta1 x 1/d - Nxl - Nxl » (11) RP sends theta to the host. - Nxl

Tio : J — -> 6N x 2 = 12N P CLK cycles 6xN

X > 6 x 2 = 12 P CLK cycles - 6x1

d, 1/d > 2 x 2 = 4 P CLK cycles

211 theta ------> N x 2 = 2N P CLK cycles - Nxl “

So Tio = 12N + 12 + 4 + 2N = 14N + 16 P_CLK cycles

= 114 P_CLK cycles for N=7.

Tc : j (1) T(J J ) = 36 T(V . V ) = [7M + in] 6xN Nx6 - lxN - Nxl M=3fi

= 262 P_CLK cycles for N=7.

(2) T(36 d(5x5) ) = 6 T(6 d(5x5) ) = 6 x 221

= 1326 P_CLK cycles

(3) T(Brl . Acl ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M=1 —

(4) T(B . X ) = 6 T(V . V ) 6x6 - 6x1 - 1x6 - 6x1

= [6M + 14] = 50 P CLK cycles M=6

(5) T(JT . C ) = N T(V . V ) = [6M + 14] Nx6 - 6x1 - 1x6 - 6x1 M=N

= 6N + 14 = 56 P_CLK cycles for N=7,

(6) T(theta x 1/d) - N + 3 = 10 P CLK cycles for N=7. Nxl —

So Tc = 262 + 1326 + 20 + 50 + 56 + 10 = 1724 P_CLK cycles.

ET (Execution Time) :

ET = Tio + Tc = 114 +■ 1724 = 1838 P_CLK cycles

= 1838 microsecond for N=7.

IR ( Ini tiation Rate) :

IR = 1/ET = 1/(1838 microsecond) for N=7.

UP ( U lti1ization of the Processor) :

UP = 100%

2 1 2 CBR (CPU Bound Ratio) :

CBR = Tc/ET = 1724/1B38 = 94%

RN (Register Number) :

J -...... > 6N 6xN

A and B ■> 36 6x6 6x6

■> 15d(2x2), d(4x4), X , d and 1/d ■> 15d(2x2), - 6x1

d(3x3)t d(5x5), C , theta' and theta ------> 20 - 6x1 - Nxl - Nxl

temporary •> 2 + = 6

RN * 6N + 36 + 15 + 20 + 6 - 6N + 77 = 119 for N = 7,

MC (Microcode number) :

MC = T i o/2 + Tc =114/2 + 1724 = 1781 words

SCRAM (Size of COntrol RAM) :

SCRAM = MC x ( 4 + 6 x log^(RN) | )

= 1781 x ( 4 + 6 x 7 ) = B2K bits.

Total Memory = 119 x 32 + 82K = 86K bits for N=7.

213 Appendix A.9 : Procedures to Solve the Derivative of Theta and

Calculations of the Measurement Parameters for

Inverse Jacobi an with P = 6,

(1) Host broadcasts J to RPi, i = 1, ...... 2 6. 6xN

(2) RPi computes Aci = J [J ]ci , where [J ]ci is - 6x1 6xN — Nxl

the ith column of the transpose of J matrix.

(3) RPi broadcasts Aci to RPj, j < > i , i , j = 1, 2, . . . , 6. - 6x1

(4) RPi computes 6 d(5x5) = Bri - 1x6

(5) RPi computes d - d(6x6) = Bri . Aci - 1x6 - 6x1

(6) RPI sends d to the host.

(7) Host broadcasts X to RPi, i = 1, 2, . . . , 6. - 6x1

(8) RPi computes Ci = Bri . X - 1x6 - 6x1

(9) RPi broadcasts Ci to RPj, i < > j , i , j = 1, 2, . . . , 6,

(10) RPi computes [theta*]i = [J ]ri . C 1x6 - 6x1

For i > 6, [theta*Ji are computed in RPI to RP(N-6).

(11) Host broadcasts 1/d to RPi, i = 1, 2, . . . , 6.

(12) RPi computes [theta]i = [theta*]i x 1/d.

For i > 6, [theta]i are computed in RPI to RP(N-6).

214 (13) RPi sends [theta]i to the host.

Tio : ■ > 6N x 2 = 12 N P CLK cycles 6xN

Aci -> 36 x 2 = 72 P CLK cycles - 6x1

■>6x2=1? P CLK cycles 6x1

Ci ■> 6 x 2 - 12 P CLK cycles

d, 1/d ■> 2 x 2 = 4 P CLK cycles

theta ■> N x 2 = 2N P CLK cycles - Nxl

So Tio = 12N + 72 + 12 + 12 + 4 + 2N = 14N + 88 P_CLK cycles = 198 P CLK cycles for N=7.

Tc : y (1) T(J [J ]ci ,) = 6 T(V . V ) = [7M + 1(1] 6xN — Nxl IxN - Nxl M=6

= 52 P_CLK cycles for N=7.

(2) T(6 d(5x5) ) = 221 P_CLK cycles

(3) T(Bri .Aci ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M= 1 —

(4) T(Bri . X ) = [6M + 14] = 2(1 P CLK cycles - 1x6 - 6x1 M=1 —

N (5) T([J ]ri . C ) = [6M + 14] N 6 1x6 - 6x1 6

= 26 P CLK cycles for N=7,

N ( 6 ) T(theta'i x i/d) = [M + 3] 6 M=

5 P CLK cycles for N=7.

215 So Tc = 52 + 221 + 20 + 20 + 26 + 5 = 344 P_CLK cycles.

ET (Execution Time) :

ET = Tio + Tc = 198 + 344 = 542 P_CLK cycles = 542 microsecond for N=7.

IR (Initiation Rate) :

IR = 1/ET = 1/(542 microsecond) for N=7.

UP (U ltiliza tio n of the Processor) :

UP = 100%

SP (SPeed-up) :

SP = ET(P-l) / ET(P=6) = 1838/542 = 3.4

CBR (CPU Bound Ratio) :

CBR = Tc(P=l)/6 x IR = 1724/6 x 1/542 = 53%

RN (Register Number) :

J , A , Bri , d(3x3) and d(5x5) > 6N 6xN 6x6 - 1x6

d(2x2), d(4x4), X , C , theta1 and theta —> 15 - 6x1 - 6x1 - Nxl - Nxl

7 temporary > 2 + = 6 registers 2

RN = 6N + 15 + 6 = 6N + 21 = 63 for N=7.

MC (Microcode number) :

MC = Tio/2 + Tc = 198 / 2 + 344 = 44 3 words

SCRAM (Size of Control RAM) :

SCRAM = MC x { 4 + 6 x log^(RN) I )

= 443 x (4+6x6) = 17.7K bits.

Total Memory = 63 x 32 + 17.7K = 20K bits.

216 Appendix A .10 : Procedures to Solve the Derivative of Theta and

Calculations of the Measurement Parameters for

Inverse Jacobian with P = 12.

(1) Host broadcasts J to RPi and RPi1, i = 1, 2, . . . , 6 6xN T T (2) RPi computes Aci = J [J ]c1 , where [J ]ci is - 6x1 6xN -- Nxl

the ith column of the transpose of J matrix.

(3) RPi broadcasts Aci to the other RPi and RPi’ .

(4) RPi computes 20 d(3x3).

(5) RPi sends 20 d(3x3) to corresponding RPi'.

(6)RPi' computes 6 d(5x5) = Bri - 1x6

(7) RPi1 computes d = d(6x6) = Bri . Aci - 1x6 - 6x1

(8) RPI’ sends d to the host.

(9) Host broadcasts X to RPi', i = 1, 2, . . . , 6. - 6x1 • (10) RPi' computes Ci = Bri . X - 1x6 - 6x1

(11) RPi1 broadcasts Ci to R P j', i < > j, i , j = 1, 2, . . .

T (12) RPi' computes [th e ta1li = [0 ]ri . C 1x6 - 6x1

For i > 6, [th eta1]i are computed in RPI' to RP(N— 6)*.

(13) Host broadcasts 1/d to RPi', i = 1, 2, . . . , 6. (14) RPi' computes [theta]i = [th eta1]i x 1/d.

For i > 6, [theta]i are computed in RPI' to RP(N-6)

(15) RPi1 sends [theta]i to the host.

Tio J ...... > 6N x 2 = 12N P CLK cycles 6xN

Aci ...... > 36 x 2 = 72 P CLK cycles - 6x1 -

20 d(3x3) — ...... > 20 x 2 = 40 P CLK cycles

X > 6 x 2 = 12 P CLK cycles - 6x1 “

Ci > 6 x 2 = 12 P_CLK cycles

d, 1/d > 2 x 2 = 4 P_CLK cycles

theta > N x 2 = 2N P CLK cycles - Nxl “

So Tiou = 12N + 72 + 40 = 12N + 112 = 196 P_CLK cycles.

Tiod = 72 + 40 + 12 + 12 + 4 + 2N = 154 P_CLK cycles.

Tio = Tiou + Tiod - 72 - 40 = 238 P_CLK cycles.

Tcu (Computation time for RPi) :

T (1) T(J [J ]ci ) = 6 T(V . V ) = [7M + 10] 6xN -- Nxl - lxN - Nxl M=6

= 52 P_CLK cycles for N=7.

(2) 20 T( d(3x3) ) = 20 T(V . V ) + 15 T( d(2x2) ) -1x3 -3x1

= [3M + 10] + [2M + 7] M=20 M-15 = 107 P_CLK cycles

218 So Tcu = 52 + 1C7 = 159 P_CLK cycles.

Ted (Computation time for RPi') :

(1) T(6 d{5x5) ) = 6 T(V . Vc ) + 15 T(V . V ) -1x5 -5x1 -1x4 -4x1

+ 2C T( d(3x3) )

= [5H + 121 + [4M + 12] " M=6 M=15

= 114 P__CLK cycles

(2) T(Bri .Aci ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M=1

(3) T(Bri . X ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M= 1 “

N (4) T(CJ ]ri . C ) = [6M + 14] N 6 1x6 -6x1 m= 6

= 26 P_CLK cycles for N=7,

N (5) T(theta'i x i/d) = [M + 3] N 6 M=

= 5 P_CLK cycles for N=7.

So Ted = 114 + 20 + 20 + 26 + 5 = 185 P_CLK cycles.

Tc = Tcu + Tcd= 344 P__CLK cycles.

ET (Execution Time) :

ETu = Tiou + Tcu = 196 + 159 = 355 P_CLK cycles.

ETd = Tiod + Ted = 154 + 185 = 339 P_CLK cycles.

ET = Tio + Tc = 238 + 344 = 582 P_CLK cycles

= 582 microsecond for N=7.

219 IR (In itia tio n Rate) :

IR = l/max(ETu, ETd) = 1/(355 microsecond) for N=7,

UP ( U lt i1ization of the Processor) :

UP = ET/2 x IR = 582/2 x 1/355 = 82%

SP (SPeed-up) :

SP = ET(P=1) / ET(P=12) = 1838/582 = 3.16

CBR (CPU Bound Ratio) :

CBR = Tc(P=1)/12 x IR = 1724/12 x 1/355 = 40%

RN (Register Number) :

J , 3Aci , d(2x2) and d(3x3) ■> 6N 6xN - 6x1

7 temporary •> 2 + regi sters 2

RNu = 6N + 6 = 48 for N=7.

Jci , 3Aci , Bri , X and C -> 30 - 6x1 - 6x1 - 1x6 - 6x1 - 6x1

d(3x3), d(4x4), d(5x5), d , 1/d , theta1 and theta —> 20 - Nxl - Nxl

7 temporary -> 2 + regi sters 2

RNd = 30 + 20 + 6 = 56 for N=7.

So RN = max( RNu, RNd) = 56.

220 MC (Mi crocode number) :

MCu = Tiou/2 + Tcu = 196/2 + 159 = 257

MCd = Tiod/2 + Ted = 154/2 + 185 = 262

MC = max( MCu, MCd) = 262 words

SCRAM (Size of Control RAM) :

SCRAM = MC x ( 4 + 6 x 'log?(RN)' )

= 262 x ( 4 + 6 x 6 ) = 10.5K bits,

Total Memory = 56 x 32 + 10.5K = 12.3K bits.

221 Appendix A .11 : Procedures to Solve the Derivative of Theta and

Calculations of the Measurement Parameters for

Inverse Jacobian with P = 24.

(1) Host broadcasts J to RPia and RPid, i = 1, 2, . . . , 6. 6xN T T (2) RPia computes Aci = J [J ]ci , where [J lei is - 6x1 6xN — Nxl — Nxl

the ith column of the transpose of J matrix.

(3) RPia broadcasts Aci to the other RPs.

(4) RPib computes 20 d(3x3).

(5) RPib sends 20 d(3x3) to corresponding RPic.

(6) RPic computes 6 d(5x5) = Bri - 1x6

(7) RPic sends 6 d(5x5) = Bri to RPid. - 1x6

(8) RPid computes d = d(6x6) = Bri . Aci - 1x6 - 6x1

(9) RPid sends d to the host.

(10) Host broadcasts X to RPid. i = 1, 2, . . . , 6. - 6x1

(11) RPid computes Ci = Bri . X - 1x6 - 6x1

(12) RPid broadcasts Ci to RPjd, i < > j , i, j = 1, 2...... 6.

„ . T (13) RPid computes [th e ta 'li = [J ]ri . C — 1x6 - 6x1

For 1 > 6. [th e ta 'li are computed in RPid to RP(N-6)d.

222 (14) Host broadcasts 1/d to RPid, i = 1, 2, . . . , 6.

(15) RPid computes [theta]i = [th e ta']i x 1/d,

For i > 6, [theta]i are computed in RPid to RP(N-fi)d.

(16) RPid sends [theta]i to the host,

Tio : J ...... > 6N x 2 = 12N P CLK cycles 6xN

Aci------> 36 x 2 = 72 P CLK cycles - 6x1 -

20 d(3x3) > 20 x 2 = 40 P_CLK cycles

6 d(5x5) > 6 x 2 = 12 P_CLK cycles

X > 6 x 2 = 12 P CLK cycles - 6x1 -

Ci ------> 6 x 2 = 12 P_CLK cycles

d, 1/d — ...... > 2 x 2 = 4 P_CLK cycles

theta > N x 2 = 2N P CLK cycles - Nxl

So Tioa = 12N +12 Tida = 60

Tiob = 36 + 40 = 76 Tidb = 36

Tioc = 24 + 40 + 12 = 76 Tide = 48

Tiod =12+12+12 Tidd = 60

+ 4 + 2N + 12 = 66

Tio = 12N + 72 + 40 + 12 + 12 + 12 + 4 + 2N = 14N + 152

= 250 P CLK cycles for N=7.

223 Tea (Computation time for RPia) :

T(J [JT]ci ) = 6 T(V . V ) = [7M + 10] 6xN — Nxl - lxN - Nxl M=6

= 52 P_CLK cycles for N=7.

Tcb (Computation timefor RPib) :

20 T( d(3x3) ) = 20 T(V . V ) + 15 T( d(2x2) ) -1x3 -3x1

= [3M + 10] + [2M + 7] M=20 ' M=15

= 107 P_CLK cycles

Tcc (Computation time for RPic) ;

T(6 d(5x5) ) = 5 T(V . V ) + 15 T(V . V ) -1x5 -5x1 -1x4 -4x1

+ 20 T( d(3x3) )

= [5M + 12] + [4M + 121 M=6 ' M=15

= 114 P_CLK cycles

Ted (Computation time for RPid) :

(1) T(Rri .Aci ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M= 1 —

(2) T(Bri . X ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M= 1 ”

N (3) T([J ] r i , . C ) - [6M + 14] N 6 1x6 - 6x1 6

26 P CLK cycles for N=7,

224 N (4) T(theta'i x i/d ) = [M + 3] N 6 M= — 6

= 5 P_CLK cycles for N=7.

Ted = 20+20+26 + 5 = 71 P_CLK cycles.

Tc = Tea + Tcb + Tee + Ted = 344 P_CLK cycles.

ET (Execution Time) :

ETa = Tioa + Tida + Tea = (12N + 12) +60+52

= 208 P_CLK cycles.

ETb = Tiob + Tidb + Tcb = 76 + 36 + 107

=219 P_CLK cycles.

ETC = Tioc + Tide + Tee = 76 + 48 + 114

= 238 P_CLK cycles.

ETd = Tiod + Tidd + Ted = 66 + 60 + 71

= 197 P_CLK cycles.

ET = Tio + Tc = 250 + 344 = 594 P_LCK cycles for N=7.

IR (Initiation Rate) :

IR = l/max(ETa, ETb, ETc, ETd) = 1/ETc

= 1 /(238mlcrosecond)

UP (U ltiliza tio n of the Processor) :

UP = ET/4 x IR = 594/4 x 1/238 = 62%

SP (SPeed-up) :

SP = ET(P=1) / ET(P=24) = 1838/594 = 3.1

CBR (CPU Bound Ratio) :

CBR = Tc(P=l)/24 x IR = 1724/24 x 1/238 = 30%

225 RN (Register Number) :

RNa :

j > 6N 6xN

temporary ------> 2 + regi sters

RNa = 6N + 6 = 6 x 7 + 6 = 4R

RNb :

3 Aci > IR - 6x1

d (3x3) > 20

temporary ------> 2 + registers

RNb = 18 + 20 + 6 = 44

RNc :

2Aci > 12 - 6x1

d(3x3), d(5x5) > 20

d(4x4) -> 15

temporary > 2 + registers

RNc = 12 + 20 + 15 + 6 = 63

226 RNd :

Jcl , Aci , X and C ■ > 24 - 6x1 - 6x1 - 6x1 - 6x1

d , 1/d , [theta1]i and [theta]i ------> 4

'7 temporary -> 2 + = 6 regi sters 2

RNc = 24 + 4 + 6 = 34

So RN = max{ RNa, RNb, RNc, RNd) = RNc = 53,

MC (Microcode number) :

MCa = Tioa/2 + Tea = {12N + 12)/2 + 52 = 100 for N=7

MCb = Tiob/2 + Tcb = 76/2 + 107 = 145

MCc = Tioc/2 + Tcc = 76/2 + 114 = 152

MCd = Tiod/2 + Ted - 66/2 + 71 = 104

MC = max( MCa, MCb, MCc, MCd) = 152 words

SCRAM (Size of Control RAM) :

SCRAM = MC X ( 4 + 6 x 'log^(RN) I )

= 152 x ( 4 + 6 x 6 ) = 6.08K bits,

Total Memory = 53 x 32 + 6.08K = 7.78K bits.

227 Appendix A .12 : Microprogram for Forward Recursion of

Inverse Dynamics (one RP per Link)

i-1 i Q1 - . U Ji = J Wi = w i i i

i . i * i * Oi = w Pi = P Si = S i i i

i f* Xi = P Fi s F i i i

Ni = N T'1 = theta T"i = theta i

RF RF RF

RO 0U1.1J R9 JUl.lJ R18 Wi-111J,Wi L1J

R1 Qi [1,2] RIO J i[ l,2 ] R19 Wi - 1[2] ,Wi [2]

R2 Qi[1,3] R11 J i[ l,3 ] R20 Wi -1[3],Wi[3]

R3 Q i[2t l] R12 J i[ 2 ,l] R21 Oi-1[1],Oi[1]

R4 Qi[2,2] R13 J i [2,2] R22 Oi-1[2] ,Oi[2]

R5 Qi[2,3] R14 J i[2 ,3 ] R23 Oi-1[3],Oi[3]

R6 Qi[3,l] = 0 R15 J i [3,1] R24 picn

R 7 Qi[3,2] R16 J i[3 ,2 ] R25 Pi [2]

R8 Qi[3,3] R17 J i [3,3] R26 Pi [3]

228 RF RF RF

R27 SUIT R3F> FiLlj R45 temporary

R28 Si [2] R37 Fi [23 R46 temporary

R29 Si [3] R38 Fi [33 R47 temporary

R30 xi [13 R39 Ni [13 R48 temporary

R31 Xi[2] R40 Ni [2] R49 temporary

R32 Xi[3] R41 Ni [33 R50 temporary

R33 YiCl] R42 T' i R51 temporary

R34 Yi [23 R43 T"i RS2 temporary

R35 Y i [33 R44 mi R53 temporary

R54 temporary

R56 temporary

R56 temporary

R57 temporary

RSfl temporary

RF —> FPM RF — > FAM SUB FPM — > FAM — > (AA) (AB) (AA) (AB) RF RF

1 Qi[2 ,1] , QU3.3] Wi - 1[3], T'i 0 - -

2 01C1.11 , 01[3,23 Oi- 1[33 * T"i 0 --

3 01[2,13 , Qi[3,21 - - --

4 Ql[l, 13 , 01[3,33 -, - - Oi[l ,23 Wi- 1[3]

5 Wi-1[2] , T'i 01[3,13 , Oi[1,23 1 0 i[2,33 Oi - 1[3]

6 Wi-1[ 1] , T"i Oi[3,l3 , Oi[2,33 1 ni[i,33 -

7 -, - -, - - 0i[2,23 -

8 - R45 n i[1,23

229 9 Q iL U J , UMLiJ Oi-iLIJ H4S 0 R46 O ld .3 ]

10 Q1C2.ll , W1-1C2] Oi —1[2] R46 1 - -

11 01[1,2] . W i-l[ 1] -- -- -

12 Oi[2,2] , Wi-1[2] -- - R45 o i-ic il

13 Q1C1.3] , W1-1C1] - - - R46 01-1C2]

14 Qi[2,3] , W1-1C2] R45 R46 0 R45 -

15 Q1C3.11 , Wi- 1[3] --- R46 -

16 01[3,2] , Wi - 1[ 3] R45 R46 0 R45 -

17 Q1C3.3] , Wi-1[3] - - - R46 R47

18 Qi[1,1] , O i- l[ l] R45 R46 0 R45 -

19 Qi[2,1] , Oi —1[2] R45 R47 0 R46 R47

20 QIC 1,21 , 01-1[1] R46 R47 0 R45 -

21 Qi[2,2] , 01-1C2] -- - R46 R47

22 QIC 1,3] , Oi-1C 1] R45 R47 0 R45 Wi [ 1]

23 Qi[2,3] , 01-1[2] R45 R46 0 R46 Wi [2]

24 0 i[3 ,l] , Oi —1[ 3] - - - R45 -

25 Qi[3.2] , Oi —1[3] R45 R46 0 R46 Wi [3]

26 Qi[3.3] , 01 —1[3] -- - R45 R47

27 Wi [2] , P i[3] R45 R46 0 R46 -

28 W1C3] , Pi[2] R46 R47 0 R45 R47

29 Wi [3] , P i[l] R45 R47 0 R46 -

30 Wi [1] , Pi[3] - - - R45 R47

31 Wi [ 1] , P1[2] R46 R47 0 R46 Oi [1]

32 Wi [2] , P i[l] R45 R46 1 R45 Oi [2]

33 Wi [2] , Si[3] -- - R46 -

230 u W L3J Si [2] R45 R46 1 R45 Oi C3]

35 W [3] siCi] - - R4fi R47

36 w Cl] SiC3] R45 R46 1 R45 -

37 w [1] SiC2] -- R46 R48

38 w [2] Si C1] R45 R46 1 R45 -

39 0 C2] Pi[3] - - R46 R49

40 0 [3] PiC2] R45 R46 1 R45 -

41 0 [3] PiCl] - - R46 R50

42 0 [1] PiC3] R45 R46 1 R45 -

43 0 ci] PiC2] - - R46 R51

44 0 C2] PiCl] R45 R46 1 R45 -

45 0 C2] SiC3] - - R46 R52

46 0 C3] Si C2] R45 R46 1 R45 -

47 0 C3] Si C1] -- R46 R53

48 0 CI] SiC3] R45 R46 1 R45 -

49 0 Cl] SiC2] - - R46 R54

50 0 C2] Si Ci] R45 R46 1 R45

51 w C2] R49 - - R46 R55

52 w C3] R48 R45 R46 1 R45 -

53 w C 3] R47 - - R46 R56

54 w [1] R49 R45 R46 1 R45 -

55 w Cl] R48 - - R46 R57

56 w C2] R47 R45 R46 1 R45 - 1 1 1 i ID1 1 r-- I 1

w C2] R52 -- R46 R58

58 VI C3] R51 R45 R46 1 R45 -

231 59 tfTDT1' ft50 - -- R46 R47

60 W i[l] R52 R45 R46 1 R45 -

61 Wi [ 1] R51 R53 R47 0 R46 R48

62 Wi[2] R50 R45 R46 1 R45 -

63 Q i[ l, l] xi-iCi] R54 R48 0 R46 R49

64 Q i[2 ,l] Xi-1[2] R45 R46 1 R45 R47

65 Qi[1*2] Xi-l[l] R55 R49 0 R46 R50

66 Qi[2*2] Xi-1[2] R45 R46 1 R45 R4R

67 Qi[l*3] Xi-l[l] R56 R50 0 R46 R51

68 Oi[2,3] Xi-l[2] R45 R46 0 R45 R49

69 Q i[3 ,l] Xi-1[3] R57 R51 0 R46 R52

70 Oi[3,2] Xi- 1[3] R45 R46 0 R45 R50

71 Qi[3,3] Xi-1[3] R58 R52 0 R46 R58

72 J i [ l , l ] Wi [ 1] R45 R46 0 R45 R51

73 Ji[l,2] Wi [2] R45 R58 0 R46 R58

74 J i[ 2 ,l] Wi [ 1] R46 R58 0 R45 R52

75 J1C2.2] Wi [2] - -- R46 R58

76 J i [3,1] Wi [ 1] R45 R58 0 R45 R53

77 J i[3 ,2 ] Wi [2] R45 R46 0 R46 R54

78 O i [ l , 3] Wi [3] R63 R47 0 R45 -

79 J i[2 ,3 ] Wi [3] R45 R46 0 R46 R55

80 J i[3 ,3 ] Wi [3] R54 R48 0 R45 R58

81 j i [ i , n O i[l] R45 R46 0 R46 X i[l]

82 J i [1,2] Oi [2] R46 R58 0 R45 R58

83 Ji [2,1] Oi [ 1] R45 R58 0 R46 Xi [2]

2 3 2 84 JH2.2J 0U2J R49 0 R45 R58

85 Ji [3,1] Oi [13 R46 R5R 0 R46 R47

86 J1C3.2] Oi [23 R45 R46 0 R45 R48

87 J i [1.3] Oi[33 R50 xi [13 0 R46 Xi [33

88 J i [2,3] Oi [3] R45 R46 0 R45 R49

89 J i [3,3] Oi [3] R51 X i [23 0 R46 R58

90 Wi[2] R49 R45 R46 0 R45 YiCll

91 Wi[3] R48 R45 R58 0 R46 R58

92 Wi [32 R47 R46 R58 0 R45 Yi[2]

93 Wi [ 1] R49 R52 Xi[3] 0 R46 R58

94 Wi [1] R48 R45 R58 0 R45 R53

95 Wi [2] R47 R46 R45 1 R46 R54

96 mi Y i[l] -- - R45 Yi [33

97 mi Yi[2] R46 R45 1 R46 R55

98 mi Yi [3] - - - R45 R47

99 - - R46 R45 1 F i[l] -

100 -- -- - FI [2] R48

101 - - R53 R47 0 Fi[3] -

102 - - R54 R4R 0 - R49

103 - - R55 R49 0 - -

104 ------Ni [ 13

105 ------Ni [23

106 ------Ni [33

233 Appendix A.13 : Microprogram for Backward Recursion of

Inverse Dynamics (one RP per Link)

i -1 Qi = . U Fi Ni = N i

Pi = P Si fi +1 = f 1 + 1

i-1 i i -1 fi = f ni+1 = n ni = n i i+1 1

RF RF

RO Q iL U J R9 FU1J

R1 Q1C1.Z] RIO F1[2]

R2 Q i[l,3 ] R11 Fi[3

R3 o ic z .n R 12 Ni C1]

R4 Qi[2,2] R 13 Ni[2]

R5 Qi[2,3] R 14 Ni [3]

R6 Qi[3,1] = 0 R 15 PiCl]

R7 Qi[3,2] R16 Pi [2]

R8 Qi[3,3] R17 Pi [3]

234 RF RF

R18 SU1J R27 temporary

R19 Si [2] R28 temporary

R20 Si[33 R29 temporary

R21 fi + l[l],fi[l] R30 temporary

R22 fi+l[2],fi[23 R31 temporary

R23 fi+l[3],fi[3] R32 temporary

R24 ni+l[l],ni [1] R33 temporary

R25 ni + l[2] ,ni [2] R34 temporary

R26 ni + l[3] ,n i[3] R35 temporary

RF ~ > Ff>M RF — > FAM StJfe FPM — > FAM --> (AA) (AB) (AA) (AB) RF RF

1 Q1C2.1] , Qi[3,3] Fi [13 , fi + lCl] 0 --

2 QiCl.l] , Qi[3,2] Fi[23 , ft+ l[2 ] 0 - -

3 Q1C2.1] , Qi[3,2] Fi[3] , fi+l[3] 0 - -

4 OiCl.l] , Oi[3,33 Pi[l] , Si[l3 0 Oi[1,23 R27

5 P1[2] , f1 + l[33 0i[3,l] , 0i[l,?3 1 Oi[2,33 R28

6 Pi [3] , fi + l[2] Oi[3,13 , Qi[2,33 1 Oi[1.33 R29

7 Pi[33 , fi + l[ 13 Pi[2] , Si[2] 0 Oi[2,23 R33

8 Pi[ 13 , fi+ l[3 ] Pi[33 , Si[33 0 R30 Oi[l,2]

9 Pi[ 1] , fi + l[2] " » " 0 R31 Qi[2,33

ID Pi [23 , fi + l[ 1] R30 , R31 1 R30 R34

11 Qi[l*l3 » R27 " * * - R31 R35

12 Ql[l,2] , R28 R30 , R31 1 R30 -

13 Qi[2,l3 , R27 - R31 fi+ l[ 13 |

235 H , m WO' R31 1 Wo -

15 Q U3.1] R27 N i[l] fi + lCl] 0 R31 fi+lC2]

16 Q1C3.2] R28 R30 R31 0 R30 -

17 Oi[1,3] R29 Ni [2] fi + lC2] 0 R31 fi + lC3]

18 Ql[2,3] R29 R30 R31 0 R30 NiCl]

19 Qi[3,3] R29 Ni [3] Fi+lC3] 0 R31 R32

20 R34 Fi [31 R30 R31 0 R30 Ni [21

21 R35 Fi[2] R30 R32 0 R31 R32

22 R35 F i[l] R31 R32 0 R30 Ni [31

23 R33 Fi [3] ni + l[ l ] NiCl] 0 R31 R32

24 R33 Fi [2] R30 R32 0 R30 f i c n

25 R34 Fi [1] R31 R30 1 R31 fiC2]

26 - - ni + l[2] NiC2] 0 R30 NiCl]

27 -- R31 R30 1 R31 Fi C31

28 - - ni + l[3] Ni [3] 0 R30 Fi Cl]

29 -- R31 R30 1 - Ni [2]

30 -- N i[l] FiCi] 0 - Fi[2]

31 -- Ni [21 FiC2] 0 - Ni [3]

32 ------FiC3]

33 - - NiC3] Fi [3] 0 - NiCl]

34 Q iC l.l] Ni [ 11 -- - - Ni [21

35 01[1,2] Ni [2] - - -- -

36 Qi[2,1] NiCl] - - -- Ni [3]

37 01[2,2] Ni [2] - - - R30 -

38 Qi[3,1] NiCl] - - - R31 -

236 39 Ql[3,2] Ni[2] R30 R31 R30

40 ] Ni [3] R31Qi[l,3

41 Q1C2.3] Ni [3] R30 R31 R30

42 Qi[3,3] Ni [3] R31

43 R30 R31 0 R30

44 R30 R32 0 R31

45 R32 R30R31

46

47 R30 R32

48

49

50

237 Appendix A.14 : Calculation of the Measurement Parameters

for Inverse Dynamics with P = 1.

Tcf(Forward) = 106N P_CLK cycles

Tcb(Backward) = 50N P_CLK cycles

Tc = Tcf + Tch = 156N P_CLK cycles

Tio = 4N + 2N = 6N PJ1LK cycles

ET = Tc + Tio = 162N PJXK cycles

IR = 1/ET = 1/(162N microsecond)

u p = ino%

CBR = 150N/156N = 96%

RN = 51N + 14 (see Appendix A .12 and A .13) = 371

MC = (Tc + Tio/2) / N = 156

SCRAM = MC x ( 4 + 6 x log^(RN) I )

= 156 x ( 4 + 6 x 9 ) = 9 .IK bits

Total Memory = 371 x 32 + 9 .IK = 21K bits

238 Appendix A .15 : Calculation of the Measurement Parameters

for Inverse Dynamics with P = 2.

Tcl(Forward) = 106N P_CLK cycles

Tc2(8ackward) = 5QN P_CLK cycles

Tc = Tel + Tc2 = 156N P_CLK cycles

Tiol = 4N + 16N = 20N P_CLK cycles

Tio2 = 16N + 2N = 18N P_CLK cycles

ET1 = Tc1 + Tiol = 126 P_CLK cycles

ET2 = Tc2 + Tio2 = 68N P_CLK cycles

ET = ET1 + ET2 - 16N = 178N P_CLK cycles

IR = 1/ET = l/max(ETl, ET2) = 1/(126N microsecond)

UP = ET/2 x IR = 178N/2 x 1/126N = 71%

SP = ET(P=1) / ET(P=2) = 162N / 1/8N = .91

CBR = Tc(P=l)/2 x IR = 156N/2 x 1/126N = 62%

RN1 = 45N + 14 (see Appendix A .12)

RN2 = 27N + 9 (see Appendix A .13)

RN = max(RNl, RN2) = 45N + 14 = 329 for N=7

MCI = (Tel + Tiol/2) / N = 116

MC2 = (Tc2 + Tio2/2) / N = 59

MC = max(MC1, MC2) = 116

SCRAM = MC x ( 4 + 6 x 'log (RN)' )

= 116 x 58 = 6.8K bits

Total Memory = 329 x 32 + 6.8K = 17.3K bits

239 Appendix A ,16 : Calculation of the Measurement Parameters

for Inverse Dynamics with P = N.

For one RP :

Tcf(Forward) = 133 P_CLK cycles

Tiof = 5 6 -4-6-6= 40 P_CLK cycles

Tidf = 11 PJTLK cycles

Tcb(Backward) = 60 -9=51 P_CLK cycles

Tiob = 40 - 4 -6-6 = 24 P__CLK cycles

Tidb = 5 P_CLK cycles

For the whole system :

FT = (40N + 144) + ( 34N + 46) = 74N + 190 P_CLK cycles

=708 P_CLK cycles

IR = 1/ET = 1/(708 microsecond) for N=7

UP = (40 + 144 + 34 + 46) / ET = 37%

SP = ET(P=l) / ET(P=N) = 162N / (74N +■ 190) = 1, 6 for N=7

CBR = Tc(P=l)/N x IR = 156N/N x 1/708 = 22%

RN = 51 + 14 = 65 (see Appendix A, 12 and A. 13)

MCf = (Tcf + Tiof/2 + Tidf) = 164

MCb = (Tcb + Tiob/2 + Tidb) = 68

MC = MCf + MCb = 232

SCRAM = MC x ( 4 + 6 x 'log^RN) | )

= 232 x 46 = 10.7K bits

Total Memory = 65 x 32 + 10.7K = 12,8K bits

240 Appendix A .17 : Calculation of the Measurement Parameters

for Inverse Dynamics with P = 2N.

For one RP :

Tcf(Forward) = 133 P_CLK cycles

Tiof = 56 P_CLK cycles

Tidf = 11 P_CLK cycles

Tch(Backward) = 60 P_CLK cycles

Tiob = 40 P_CLK cycles

Tidb = 5 P_CLK cycles

For the whole system :

ET = (40N + 160) + (34N + 71) = 74N + 231 P_CLK cycles

= 749 P_CLK cycles

IR = 1/(40 + 160) = 1/(200 microsecond) for N=7

UP = (40 + 160 + 34 + 71)/2 x IR = 76%

SP = ET( P=1) / ET(P=2N) = 162N / (74N + 231) = 1.5 for N=7

CBR = Tc(P=l)/2N x IR = 156N/2N x 1/200 = 39%

RNi = 45 + 14 = 59 (see Appendix A .12)

RNi' = 27 + 9 = 36 (see Appendix A .13)

RN = max( RNi. RNi' ) = 59

MCf = (Tcf + Tiof/2 + Tidf) = 172

MCb = (Tcb + Tiob/2 + Tidb) = 85

MC = max(MCf, MCb) = 172

SCRAM = MC x ( 4 + 6 x "log^(RN) I )

= 172 x 40 = 6.9K bits

Total Memory = 59 x 32 + 6.9K = 8.8K bits

241 Appendix B.l : Detailed Circuit Descriptions for Two-Phase

Generators (TPG) and Two Johnson Counters, JCNTR

and JCNTF,

The logic diagram for the TPG is shown in figure B .l. The

Johnson counters, JCNTR and JCNTF, shown in figure B.2, are

implemented using PLAs. Both have the same state diagram but they are

driven by different clock phases.

242 2

t> Superbuffer Two Phase Clock Clock Generator Phase (TPG) Two Fi gure Fi L < o CLK

243 PLA PLA

r2/2 2/2 1/2

Jra Jfa Jfb

State/Ja,Jb

BT

Figure B.2 State Diagrams for JCNTR and JCNTF

244 Appendix B.2 : Detailed Procedures for Loading Microprogram and

Circuit Designs for the Synchronization Controller and

Bootstrap Controller (SC+BTC).

Figure B.3 shows the timing of microprogram loading. The loading procedures are started by the BT signal sent from the host. The conuter

(CNT) in the BU is reset to zero by the CLR signal at the beginning and the signal LD is asserted until the loading is complete, causing the output of the CNT to be selected as the address of the CRAM. The Host

Write signal (HWR) is synchronized by the Synchronization Controller

(SC), which generates the WR signal with a pulse width equal to one

SYS_CLK period. The WR signal is used as an input signal to the BTC.

Whenever a HWR is sent from the host, a WR is generated by the SC.

Consequently, three loading signals, LDO, LD1 and LD2, are generated following the WR signals. These loading signals are used to latch the

16-bit data sent from the host. The latch, made of dynamic registers needs to be refreshed by To avoid the conflict between the loading signals and the <^, the LDi* (i = 0, 1, 2) signals obtained by logically

ANDing LDi and $ are used as real loading signals to the latch. As soon as the three 16-bit words are latched, the Write Enable signal

(WEN) is generated. This allows the microinstruction to be written into the CRAM. After that, the CNT is incremented by one with the Increase signal (INC). When microprogram loading is complete, the host computer acknowledges the BTC by issuing the Load Complete signal (LC). The LD signal is then unasserted. Consequently, the address of the CRAM is produced by the Sequencer (SEQ). When the LD signal is asserted to

245 Figure Figure R.3 Timing for Microprogram Loading

246 high, the counter in the SEQ is reset to zero and the SFQ is disabled; when the LO is asserted to low, the SEQ is enabled.

The Sychronization Controller, shown in figure R.4, is implemented

using a PLA. The BTC is also made using a PLA and its state diagram is

shown in figure R.5. The RTC's input signals are gated by *1 and its

output signals by *2. The HWR signal can not be used as a direct input

signal to the RTC because the BTC is a level-trigger PLA and because the

HWR signal may stay high for more than three SYS_CLK periods causing the

BTC regard it as anotht?' write signal from the host computer. For

example, if the HWR stays high for three SYS_CLK periods, the RTC will

probably change from state SO to S3 and write the data into two latches,

which results in an error.

247 PLA

HWR

WR

State/WR BT

HWR HWR

HWR

HWR

Figure B.4 Synchronization Controller for HWR

248 0100100 OlOOOlO WR 0100000 0100001 0101000 0100000 WR State/CLR, State/CLR, LD, LD1, LDO, INC LD2, WEN, 0100000 0000000 Figure Figure B.5 State of Diagram the Bootstrap Controller (BTC) 0110000 1100000

249 Appendix B.3 : Detailed Circuit Designs for the Four Format

Converters, FCE, FCS, FCW, and FCN.

Figure B.6 shows the detailed circuits for each of the four format

converters. The data sent out from the east and south sides of one RP

flows into the west and north sides of the adjacent RPs. There are

drivers at the output ports to increase the driving capability and so

decrease the transmission time between the RPs. The four I/O operation enable bits in the microinstruction for EE, EW, ES, and EN are four

enable bits in the microinstruction for I/O operations. The data to be transmitted can flow out at the east and south sides at the same time,

and the data received from the west and north sides can be stored in the

RF.

During the P<^, the data to be transferred to the adjacent RP is

read and stored in the format converter. Due to the high capacitance

outside the chip and the conversion from 32-bit data to two 16-bit words, two P_CLK cycles are required to transfer the high and low part

of the 32-bit data to the other RPs. The received data collected by the

format converter is stored in the RF in the P

be seen that the permissible time for data transferred between RPs is

about a half of the P_CLK, i.e . about 500 ns, where "H" means the high

part of a 32-bit data and "L" means the low part.

250 II IO *E W * 11 * J r a 2 b' ih

H > ° -

IO*Jrbl I0*Jfa2 —*BC i IO*ES*J fdl I— l_ ;|— 10*EN 11 *Jrb2 f '

BB I I — L J [ } 0 - T T 10*Jrb2 IO *Jfa1

ECS FCN

Figure R.6 Circuit Diagram of the FCE, FCW, FCS and FCN iCZDC or g(L) ) l ( c ______------, ______! or g(H)

Time Time for Data Transferred RPs Between ______i i I — c(H) a a or e b(H) b(H) or f(H) X btL) or f tL) / > Permissible ( Between RPs RPs Between Figure Figure B.7 Timing for RP's Data between Passed — Time for Data ' ' Transmission I I j Transmission i 1 1 [ Permissible l( I I Transferred . . I I <( b"(H) or f'(H ) ^ b '(l) or f'CL) | I I I i i i I < I I I I I I r i ______Jfb Jfa Jfa Jrb Jrb Jra _Jra J p*j —

252 Appendix B.4 : Data Flow in the Data Path for Normal Arithmetic

Operati ons.

Figure B.8 shows the data path for normal arithmetic operations.

There are three pipeline stages in each Floating Point Adder (FPA) and

Floating Point M ultiplier (FPM). During the firs t half cycle of the

the firs t two operands are read onto the Bus A (BA) and Bus B (RB), and are latched at the falling edge of the Jfal. The second set of operands are read onto the BA and BB at the second half cycle of the Pd>^ and latched at the falling edge of the Jfb l. Roth BA and BB are precharged during the P ^ .

The results are stored into the RF through the BC during P ^ . The result of the FPM is put onto the Rus C (BC) at the fir s t half cycle of the P (Jra2), while the result of the FPA is put onto the RC at the second half cycle of the P$ (Jrb2), The detailed data flow timings on the nodes of a, b, c, . . . 1, m, are shown in figure B.9.

There are two pipeline registers for each stage. This provides

LSSD testab ility and longer permissible computation time for each stage.

The timing diagram in figure B.9 shows that the permissible time for each stage is about one P_CLK period. If only one register is used and the clock is alternated by P ^ and P ^ * then the permissible computation time for each stage is about one half of the P_CLK period.

When operands are being read from the RF, their addresses are put onto Address A (AA) and Address B (AB) prior to the operands available on the BA and BB, because it takes a certain amount of time (access time) to decode the address. Therefore, the addresses are latched onto

253 \/ \f 1 Jfal I I----- Jfb l j l H ] P*2

+1

P41! ]------

♦5 P$0

+2

AA : Address bus A AB : Address bus B Pf P(t>, : Pipeline

T0*Jra2 I0*Jrb2

Figure B.8 Three Pipelined Stages in the FPM and FPA p$. “L

J f a l Jra2- JL

Jfbl Jrb2 • JL CD 1 i ' C >

Permissible 1 b— Computa------tion time for Stage 1 J LXD

• C

L Perm issible1/ r Computa- *l\ tion time for Stage 2 CD C >

Permissible \/~ K omputa- n\. t i o n tim e for Stage 3 C c

Figure B.9 Timing for the Data Through the Pipelined Stages

255 the AA or AB at the falling edge of the Jra or Jrb, while the operands are latched in the input registers of two arithmetic units at the fallin g edge of the Jfa or Jfb. Also, the Write signal (WR) tothe RF w ill not be asserted until the address has been decoded and the data is stable on the BC, Figure B.10 shows how the the addresses are gated to the AA and AB and how the WR is generated.

256 A0DR3 ADDR5 A0DR6 A0DR3 ADDR6 ADDR1 ADDR4

IO*EW | r IO*EN IO*EE j r I0*Jra2 —|F I0*Jrb2 IO*Jral —|P IO*Jrbl *Jral "1L *Jra2 “ 1 L *J rb2 H t

JL \ t_ AA

rs> AB uT

IO*ES*Jral HI IO*Jral IO*JrblHL

ADDR4 ADDR2 ADDR5

WR = 10 (WM * Jfa2 * + VJA * Jfb2 *

+ 10 (EW * Jfa2 * + ^ * ^^2 * $\f2)

Figure R. 10 AA, AB and WR for the Register File Appendix B.5 : Circuit Design of the Zero Checking Unit.

This unit is made by a programable logic array (PLA) which can be generated by TPLA, one of the VLSI CAO tools. TPLA accepts the truth table and then generates the PLA layout automatically. The truth table can be generated by EONTOTT package by supplying it with logic equations. The logic equation of the zero checking unit is contained in the f ile fa_zero_ch.eqn, listed in the following, where the means logic OR. Input Bi_EX (i = 7, 6, 5, .. , 0) means the ith bit of the exponent; output EXP_NE0 means "exponent not equal to zero". EX can be the exponent of operand A or operand B.

INORDER = B7_EX B6_EX B5_EX B4_EX R3_EX B2_EX R1_EX R0_EX;

OUTORDER = EXP_NE0;

EXP_NE0 = B7_EX ) B6_EX | B5_EX | B4_EX | B3_EX [ B2_EX |

B 1_EX j BO E X ;

After the file is created, the PLA layout can be generated immediately by issuing the command :

% eqntott -R fa_zero_ch.eqn | tpla -s Btrans -o fa_zero_ch.ca where the -R option forces eqntott to produce a truth table with no redundant minterms. Following the option of -s is the argument specifying the style of PLA, Btrans argument makes a PLA with buried contacts, NMOS and trans version (inputs and outputs on opposite side of the PLA). Following the option of -o is the name of the output f ile . And the "|" is a pipe in the UNIX, which causes the output of the command eqntott to be sent as input to the command tpla. Since the minterm in the above logic equation has only one variable, the

258 input variables can be feed directly into the DR plane of the PLA.

So, the input inverters and the AND plane of the PLA can be eliminated and more than half of the chip area can be saved.

259 Appendix B.6 : Circuit Design of the Sign Unit.

For the operation b it, OP, and the sign bits of the two operands, there are 8 combinations but only 4 possible effective operations, shown in table B .l, SUB = 0 means mantissa addition;

SUB = 1 means mantissa subtraction.

Table B.l

Truth Table for Generating Effective

Operation Bits {EOPO E0P1) and SUB

OP SA SB Effective operation EOPO E0P1 SUB

0 0 0 + A + R 0 0 0 0 0 1 + A - B 0 1 1 0 1 0 - A + B 1 0 1 0 1 1 - A - B 1 1 0 1 0 0 + A - B 0 1 1 1 0 1 + A + B 0 0 0 1 1 0 - A - B 1 1 0 1 1 1 - A + B 1 0 1

The final sign b it, SR, is determined by the effective operation, which is represented by EOPO and E0P1, and the B_GT_A

signal. The truth table is shown in table B.2.

260 Table B.2

Truth Table for Generating the

Final Sign Bit of the Result

Effective operation j EOPO E0P1 j B_GT_A | SR

+ A + B 0 0 0 0 + A + R 0 0 1 0 + A - B 0 1 0 0 + A - R 0 1 1 1 - A + B 1 0 0 1 - A + B 1 0 1 0 - A - B 1 1 0 1 - A- B 1 1 1 1

The logic equation of SR is contained in the f ile fa__sign.eqn, which is listed in the following.

#define xor(a, b) ( (a & !b) | (la A b) )

^define EOPO (SA)

#define E0P1 ( xor(OP, SB) )

INORDER = OP SA SB BJ5T_A;

OUTORDER = SR SUB;

SUB = xor(EOPO, E0P1) ;

SR = (EOPO & !R GT A) | (E0P1 A R_GT_A); where is logic NOT and " V is logic AND. As with the above, the

PLA can be easily obtained by issuing EONTOTT and TPLA commands.

261 Appendix B.7 : Circuit Design of the Alignment Control Unit.

Figure B .ll shows the block diagram of the alignment control unit. It contains two 8-bit adder/subtractors (add_sub_8.ca). Since their SUB_OP and carry-in are tied to high, both are doing substraction. Rut since only the positive result is meaningful, one of the outputs from the two subtractors is selected. The carry-out from the le ft subtractor, EB_GT_EA, is asserted to low, when the exponent of B is greater than the exponent of A and so the result of

EB substracting EA is selected by the multiplexor (mux_8.ca).

Because there are, at most, 24 bits to the right-shifter, if the shift amount is greater than or equal to 24, the output of the right-shifter is zero. The block, fa_ge__24 ,ca, will assert the signal

FZ_R ( force to zero for right_shifter). Its logic equation is stored in the f ile fa_ge_24.eqn as follows :

INORDER = B7_EE B6_EE B5_EE R4_EE B3_EE;

OUTORDER = FZ_R;

FZ_R = B7_EE | B6_EE | B5_EE | (R4_EE S B3_EE) ;

EE is the exponent difference of the operand A and operand R. Notice that only 5 bits of the exponent difference need to be sent to the rig h t-sh ifter. According to the input MR_GT_MA (mantissa of A greater than mantissa of B, low asserted) and the two carry-outs from the two exponent subtractors, the alignment control unit also generates the signal B_GT_A. The logic equation of generating the B_GT_A is stored in the f ile fa comp.ca as follows :

262 SUB-OP SJB-OP

add sub S.ca add sub S.ca Tea-Fb ) Teb-F a )

EA G (MB GT MA)

INI I NO

mux 8 .c a fa_com p.ca

OUT

EB GT EA

AMOUNT R

fa_ge_24.ca

■> FZ_R

Figure R .ll Block Diagram of the Alignment Control Unit

263 INORDER = EB_GT_EA EA_GT_EB MB__MA;

OUTORDER = B__GT_A;

B_GT_A = !EB_GT_EA | (EA_GT_EB & !MB_GT_MA) ;

Because all three input signals are negative active, while the PLA inputs are regarded as positive active, the varibles shown in the above equation are complemented.

264 Appendix B.8 : Circuit Design of the 24-bit Shifter,

The 24-bit shifter is made of a barrel shifter, A barrel shifter is basically made of a number of multiplexors. For example, a

4-bit barrel right shifter is shown in figure B.12. From those equations, it can be seen that each output is operationally equivalent to a four-input multiplexer with the inputs connected so that the select signal generates successive one-bit shifts of the input data word. It is known that a multiplexor can be implemented easily with pass transistors in NMOS technology. One example of a 4-h it barrel shifter is shown in [1, pl59] and its layout is shown in PLATE 13 of

[1], We can see that for a 4-bit barrel shifter, 4 x 4 = 16 pass transistors are needed.

There are many possible schemes to make a 24-bit barrel shifter.

Scheme 1) : one level. Then the number of pass transistors needed

is 24 x 24 = 576.

Scheme 2) : two levels. The first level has 12 2-bit barrel shifters

and the second level has 2 12-bit barrel shifters. Then

the number of pass transistors is (2 x 2) x 12 + (12 x 12)

x 2 = 336.

Scheme 3) : two levels. The first level has 8 3-bit barrel shifters

and the second level has 3 R-bit barrel shifters. Then

the number of pass transistors is (3 x 3) x R + (R x 8) x 3

= 264.

265 BO B1 B2 AO A1 A2 A3

barrel 4 r.ca o I I o (SHO & AO) | (SHI (Sh'2 & Bl) (SH3 & BO) o I I 1 — ( (SHO & Al) | (SHI (SH2 & B2) (SH3 & Bl) O I I M C (SHO & A2) | (SHI (SH2 & AO) (SH3 & B2) II o (SHO & A3) | (SHI (SH2 & Al) (SH3 & AO) U)

Figure B.12 4-bit Barrel Right-Shifter

266 Scheme 4) : two levels. The firs t level has 6 4-bit barrel shifters

and the second level has 4 6-bit barrel shifters. Then

the number of pass transistors is (4 x 4) x 6 + (6 x 6) x 4

= 240.

Obviously, scheme 4 has the minimum number of pass transistors and so it is used in the following right/left shifter. The connections between the 4-bit barrel shifters and 6-bit barrel shifters are shown in figure B.13, In the 6-bit barrel shifter, the 5 le ft most bits of the 11 inputs are connected to ground for right shifting while the 5 right most bits are connected to ground for left shifting. One inverter is added to each output of the 4-bit and 6-bit barrel shifter to reduce the large parasitic capacitance introduced by the long connecting wire between the firs t level and the second level barrel shifters. Figure B.13 shows a 24-bit barrel right-shifter whose le ft most 3 bits are connected to ground. Roth the 4-bit and

6-bit barrel shifter are shifting to the right. A 24-bit barrel le ft-s h ifte r can be easily obtained by just flipping the 24-bit barrel right-shifter and connecting the 3 right most bits to ground, while keeping the order of the input, output signals, and shifting control signals unchanged.

Figure R.14 shows the 24-bit rig h t/le ft shifter used in the pre/post normalization. The signal FZ_R/FZ_L forces the ?4-bit barrel shifter to have zero output. As FZ_R/FZ_L is asserted, the shift amount equal to 24 is selected through the multiplexor {mux_5.cal and that results in the 24-bit shifter having an zero output.

267 I20"I23 ^16'-I19 I12',I15 !8 ■' !8 ■' Jll " J7 U barrel-6-r.ca U •• I- SH5 SHt SH3 Figure Figure B.13 of Block Diagram the 24-bit Barrel Right-Shifter d ec-2- S1 4,ca St 52 S3 S4

26ft 69Z (AMOUNT L) AMOUNTJ F L){FZ FZ_R iue .4 lc Darm f h 2-i Rgt Lf) Shifter (Left) Right 24-bit the Block Diagram of B.14 Figure

"11000” INI I NO mux G.ca s S<4:0> arl 4 l).ca r 24 barrel I <23:0> 0<23:0- Appendix B.9 : Detailed Explanation for the Postnormalization.

I f the output of the mantissa adder/subtractor, add_sub_24.ca, has the form :

RSH bO b (-l) M -2 ) b{-3) bf-22) b(-23)

1 X * X XXXX where X can be 1 or 0, it is shifted right by one bit so the form

becomes :

rO j r( -T) r(-20 r(-3) ...... j r(-22) r( -23)

1 . 1 bO b (-l) b(-2) ...... 1 b(-21) b(-21)

Because the left-shifter cannot do right shifting, a multiplexor

(mux_23.ca) is used to select the appropriate bits of the output from

the mantissa adder/subtractor, add_sub_24.ca, to achieve the right

shifting purpose. The implicit leading one is not to be sent to the

output register; only r(-l) to r(-23), or b(0) to b(-22), are passed

through the multiplexor and stored in the ouput register. At the same

time, because of the one-bit right shifting, the common exponent, the

exponent of the larger operand, should be incremented by one by

inverting the RSH to the SUB_0P of the exponent update unit,

add_sub_8.ca, tying the carry-in, C -l, to 5V, and setting one of the

add_sub_8.ca inputs to zero, i.e . setting the output from the leading

zero detector to zero. Since RSH = 1, SUB__0P becomes 0 and so

addition is executed. The ouput from the leading zero detector is

the lower five bits of the B operand in the exponent update unit. To

270 have zero output from the leading zero detector, the RSH signal needs to be sent to the leading zero detector. When the RSH is asserted, the shift amount is zero.

I f the output of the mantissa adder/subtractor, add_sub_24.ca, has the form :

RSH bO b(-l) b(-2) TIT • • • • • b(-22) "R-2'3")

0 1 • XXX XX no shifting is needed, so the shift amount is zero. The 23 bits, from b (-l) to b(-23), are passed through the le ft-s h ifte r without changing and latched at the output register. Since RSH = 0, SUB_0P becomes 1 and so subtraction is executed. The exponent update unit subtracts zero from the common exponent by tying carry-in to 5V and setting the subtrahend to zero, and resulting in the common exponent remaining unchanged. Since the RSH is 0, the output from the le ft-s h ifte r, without being shifted, is selected in the multiplexor (mux_23.ca).

I f the output of the mantissa adder/subtractor, add_sub_24.ca, has the form :

RSH bO b( -1) b(-?) b(-3) » • I » • ~bT-T2) b(-23)

0 0 * 0 1 X X X it is shifted le ft by two bits, and the output form becomes :

rO rr-1) r(-2J rf-T )'' ■Vl-HT) r(-23)

1 • b(-3) b<-4) b(-5) 0 0

2 7 1 At the same time, the common exponent is subtracted by two by having

SUB_0P set to 1 (so subtraction is executed), carry_in set to 1, and R operand set to two.

I f the output of the mantissa adder/subtractor, add_suh_24.ca,

has the form :

RSH b t) b(-trb'f-27 b<-3) b ( - » ) b(-M )

0 0 • 0 0 0 0 0

the result is regarded as zero. The shift amount equal to 24 will

force both le ft-sh ifter output and common exponent to zero.

272 Appendix B.10 : Circuit Design of the Leading Zero Detector.

The leading zero detector is made of a PLA. Its truth table is shown in table R.3 and stored in the file fa_zero_de.tbl. Its layout is stored in the f ile fa_zero_de.ca. Notice that the truth table is very similar to that of a priority encoder, where RSH and bO have the highest priority and b it(-23) has the lowest. means don't care.

Table R.3

Truth Table of the Leading Zero Detector

RSH b O b (- 1) b(. 23) A MO I N T__L

1 • 0 0 0 0 0 0 1 0 0 0 D 0 0 0 1 0 0 n 0 1 0 0 0 1 0 n 0 1 0 0 0 D 0 1 0 0 0 1 1 0 0 0 0 0 I 0 0 1 0 n 0 0 0 0 0 0 1 0 0 1 n 1 0 0 0 n 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 I n 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 n 0 1 0 0 0 0 0 0 G 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 n 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 I 1 0 0 0 0 0 0 0 0 0 0 00000001 1D000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I I 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 0 0 0 0 0 1 --- - 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n 0 1 * -- 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -- 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 n 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 n 0 0 0 0 0 0 n 0 0 0 0 0 0 0 0 0 0 0 n 1 1 0 n 0

273 Appendix B .ll : Circuit Design of the Overflow/Underflow Unit.

Figure B.15 shows the block diagram of the over/underflow unit.

I t contains overflow block (fa_ovf.ca), underflow block (fa_udf.ca) and fa_eq_24.ca, which 1s used to test whether or not the le ft shift amount is equal to 24. I f the mantissa of the final result is zero, the signal MC_EQ_0 is asserted, which causes 20UT to be asserted. And this w ill force the final mantissa to zero. The block fa_eq_24.ca is made of a PLA and its logic function is stored in the f ile fa_eq_24.eqn, listed as below :

INORDER = B4_AM B3_AM B2_AM B1_AM B0_AM;

OUTORDER = MC_EQ_0;

MC_EQ_0 = B4_AM & B3_AM & !B2_AM A !B1_AM & !B0_AM;

Bi_AM {i=4, 3, 2, 1, 0) is the 1th bit of the AM0UNT_L, output of the leading zero detector. Overflow can occur only if the common exponent has the largest possible biased value (254) and the exponent update unit is incremented by one due to mantissa overflow. So the overflow flag, OVF, is asserted when the final exponent is equal to

255 and mantissa adder/subtractor is doing addition. Notice that the

final exponent can have the value 255 too for subtraction. For example, the common exponent before updating is 2 and the le ft shift

amout is 3. Then the output of the exponent update unit is 255, but

it is underflow, not overflow. The overflow block, fa_ovf.ca, also

tests whether or not the final exponent is zero. I f it is, the signal

ED_EQ_0 is asserted and causes the ZOUT to be asserted. The overflow

block is made of a PLA; its logic equations is stored in the file

274 ttC EQ 0 EQ ttC AMOU fa fa udf.ca fa_eq_24.ca B7 B7 ED EDJQ EDJQ 0 RSH OVF OVF UDF ZOUT fa fa ovf.ca Figure Figure B.15 of Diagram Block the Over/Underflow Unit SUB B7 B7 EC

275 fa_ovf.eqn, listed as below :

INORDER = SUB B7_ED B6_ED B5_ED B4_ED B3_ED B2_ED B1_ED BD_ED;

OUTORDER = OVE ED_EQ_0;

OVF = '.SUB j B7_ED | B6_ED ) B5_ED | B4_ED | R3_ED 1 R2_ED

| BI_ED | BO_ED;

EO_EQ_° = !B7_ED & 1B6 ED & !B5_ED A !B4_ED A !B3_ED A !B2_ED

A !B1_ED A !B0_ED;

ED is the common exponent after being updated. The overflow can be handled in two ways. The common way is simply to stop the computation immediately and interrupt the host computer. The second way would be to reset the result to the largest representable value, and allow the computation to continue without interrupting the host computer.

Underflow happens only when the effective operation is subtraction, and when the final biased exponent value is less than zero, i.e . when the number of leading zeros is greater than common exponent value. In tu itiv e ly , this happens when the two operands are very small and very close. Another case, when the final exponent is zero but the final mantissa is not equal to zero, is regarded as underflow too. The underflow block is made of a PLA too; its logic equations is stored in the file fajudf.eqn, listed as follows :

#define udf ( (!B7_EC A B7_ED A SIJB) | (ED_EQ_0 8 !MC_E0_D) )

INORDER = B7_EC B7_ED SUB ED_EQ_0 MC_E0_0 RSH;

OUTORDER = UDF ZOUT;

UDF = udf;

ZOUT = udf j (MC_EQ_0& !RSH);

R7_EC is the 7th bit of the EC, the common exponent before being

276 updated. The firs t minterm of the UDF shows the occurrence of negative final exponent. It happens when the effective operation is subtraction (SUB = 1), and the MSB of the the common exponent is changed from 0 to 1 in the final exponent. The second mlnterm represents the case when the final exponent is equal to zero but the final mantissa is not. ZOUT is asserted when underflow occurs or the final mantissa is equal to zero. To avoid the final mantissa as

10.000...000 being regarded as zero mantissa, the signal RSH is used.

Whenever ZOUT is asserted, the final exponent and mantissa are both forced to zero.

277 Appendix B.12 : Detailed Circuit Design of the Zero Checking Unit.

The zero checking unit (fm_exp_eq0.xa) examines the exponent of the operand. If the exponent is zero, the signal EXPI_EQ0 is asserted. As a result, EXP_EQG is asserted and then ZOUT too, which w ill force the value of the final product to zero. The unit is made of a PLA. Its logic function is stored in the f ile fm_exp_eq0.eqn, listed below :

INORDER = B7_EX B6_EX R5_EX R4_EX B3_EX B2_EX R1_EX R0_EX;

OUTORDER * EXPI_EQ0 ;

EXPI_E00 = IB7 EX & !B6_EX A IB5_EX A !B4_EX A !R3_EX A !R2_EX

A ! B1_E X A !BO E X;

EX can be the exponent of the operand A or B. Notice that whether or not the exponent is zero, the MSB of the 24-bit mantissa is always one. The reason for this is that i f one of the exponents is zero, the result is always forced to zero by ZOUT regardless the mantissa.

278 Appendix B.13 : Detailed Circuit Design of the Over/Underflow Unit.

Overflow may occur if two large operands are multiplied together, while underflow may occur if two small operands are multiplied together. From the possible multiplication cases in figure

B.16, the condition for overflow and underflow can be determined.

Bi ER (i = 9, 8) is the ith bit of the biased exponent of the result.

The block diagram of the over/underflow unit is shown in figure

B.17. The zero checking unit ( fm_exp_eqO,ca) is the same as that in stage one. The ER = 255 is detected by fm_exp_255.ca, which is made of a PLA. Its logic equation is stored in the f ile fm_exp_255.eqn, listed below :

INORDER = B7_ER B6_ER B5_ER B4_ER B3_ER B2_ER B1_ER B0_ER;

OUTORDER = ER_255 ;

ER _255 = B7_ER 4 B6_ER 4 B5_ER 4 R4_ER 4 B3_ER 4 B2_ER

4 B1_ER 4 BO_ER;

Overflow, underflow and force-to-zero are implemented in the

fm_oy_ud.ca, which is made of a PLA too. It logic functions are

stored in the f ile fm_ov_ud.eqn, listed as the follows :

^define udf ( (!B9_ER 4 B8_ER) | (B9 ER 4 !B8_ER 4 ER_EQ_0) )

INORDER = EXP_EQ0 R9_ER B8_ER ER_255 ER_EQ_0;

OUTORDER = OVF UDF ZOUT;

OVF = (B9_ER 4 B8_ER) | (B9 _ER 4 !R8_ER 4 ER_255);

UDF = udf;

ZOUT = EXP_EQ0 | udf;

The logic equation for the OVF is obtained according to case 2 and 4,

279 1) REAL_EA = 127, EA = 254 REAL EB = 0 , ER = 127 REAL ER = 127, ER = EA + EB - 127 = 254 (O.K.) 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 + 1

1 0 1 1 1 1 1 1 0 + 1 1

1 0 1 1 1 1 1 1 1 0 / I B9_ER R3_ER The biased exponent of the result is between 1 and 254. No overflow nor underflow occurs when R9_ER B8_ER = 1 0 and the ER value is not equal to 0 nor 255.

2) REAL_EA = 127, EA = 254 REAL_EB = 2 , ER = 129 REAL ER = 129, ER = EA + ER - 127 = 256 (overflow) 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 + 1

1 1 0 0 0 0 0 0 0 + 1 1

1 1 0 0 0 0 0 0 0 0 / 1 B9_ER B8_ER The biased exponent of- the result is 256 and greater than 254. Overflow occurs when B9_ER B8_ER =11.

3) REAL_EA = -125, EA = 2 REALJEB = -126, EB = 1 REAL_ER = -251, ER = EA + EB - 127 = -124 (underflow) 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 + 1

0 0 0 0 0 0 1 0 0 + 1 1

0 1 1 0 0 0 0 1 0 0 / I B9_ER B8_ER The biased exponent of the result is -124 and less than 0. Underflow occurs when B9 ER B8 ER = 0 1.

Figure B.16 All Possible Cases for Overflow and Underflow

2 8 0 4) REAL_EA = 127, EA = 254 REAL EB = 1 , EB = 128 REAL ER * 128, ER = EA + ER - 127 = 255 (overflow)

1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 + 1

1 0 1 1 1 1 1 1 1 + 1 1

1 0 1 1 1 1 1 1 1 1 / I B9 ER B8 ER

The biased exponent of the result is 265 and regarded as in fin ity . Overflow occurs when R9_ER R8_ER = 1 0 and ER value equal to 255,

5) REAL_EA = -1, EA = 126 REAL EB = -126, EB = 1 REAL ER = -127, ER = EA + EB - 127 = 0 (underflow)

0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 + 1

1 0 0 0 0 0 0 0 0 + 1 1

1 0 0 0 0 0 0 0 0 0 / I B9_ER R8_ER

The biased exponent of the result is 0. Underflow occurs when B9_ER B8_ER = 1 0 and ER value equql to zero.

Figure R.16 All Possible Oases for Overflow and Underflow (continued)

281 E R < 7 :0 >

EXP EQO B9 ER B8 ER

fm_exp eqO.ca

ER 255

ER EQ 0

fm ov ud.ca

UDF ZOUT

Figure B.17 Block Diagram of the Over/Underflow Unit (fm ovf udf.ca)

282 while UDF is obtained according to case 3 and 5. The ZOUT is asserted when one of the exponent is zero or underflow happens.

283 Appendix B.14 : Logic Equations of Ul, U2, 1)3, 1)4, and U5 in the 8-bit

M ultiplier with Modified Booth Algorithm.

Ul :

IJ1JDUT = X( +2) A A (i-l) | X( + l) A A(i) ) X(0) 4 0

j X(— 1) & !A(i) | X(-2) & !A(i -1)

U2 :

U2 out = x(-i) + x(- 2 )

U3 :

U3_0UT = X(+2) & A(N-l) | X(-2) 4 IA{N-1)

U4 :

U4 OUT - A(i) A B(N-l)

U5 :

X(+2) = !B(i+l) 4 B( i ) A B (i-l)

X( + l) = !B(i+l) 4 !B{i) A B(i-l) | !B(i+l) A B(i) A !B(i-l)

X(0) = !B(i+l) 4 !B(i) 4 !B(i-1) | B(i+1) 4B(i) 4 R(i-l)

X(-l) = B(i + 1) 4 1B(i ) A B(i-l) | B(i+1) 4 R(i) A !B(i-l)

X(-2) = B(1+l) 4 [B(i) 4 !B{i — 1)

SIGN__EX(i) = MINUS(i) xor X(-1)

SIGN__EX( i +1) = !MINIJSC1) A X(-l) | MINUS(i) xor X(-2)

MINUS(t + 1) = MINUS(I) | X(— 1) | X(-2)

X{+2), X{+1), X(O), X (-l), and X(-2), obtained from table 5.5,

mean the mulitplicand being multiplied by +2, +1, 1, -1, and -2. The

sign extension of the partial product can be simplified by using three

bits - SIGN_EX(i) and SIGN_EX(i + l) and MlNUS(i). MINUS(I) is used to

indicate whether or not there is a previous subtraction, and so

284 affects the SIGN_EX(i) and SIGN_EX(i+l). The following example, shown in figure B.18, explains how the sign extension bits SIGN EX( i ) and

SIGN_EX(i+l) are used.

The sign extension bits, SIGN EX( i ) and SIGN EX( i +1) can be obtained by considering the following cases, shown in figure R.19. The logic equations for the SIGN_EX(i) and SIGN_EX(i+l) are :

SIGN_EX(i) = 1MINUS(1) A X (-l)

| MINUS(i) A ( X(+2) | X(+l) | X(0) | X(-2) )

but

! X(-1) = X(+2) | X{ + 1) | X(0) | X(-2)

so

SIGN_EX(1) = !MINUS(i) A X (-l) | MlNDS(i) A !X(-1)

= MINUS(i) xor X(-l)

SIGN_EX(i +1) = !M1NUS{i) A ( X(-1) | X{-2) )

| MINlJS(i) A ( X(+2) | X( + l) | X(0) | X(-1) )

but

!X(-2) = X(+2) | X( + l) | X(0) | X(-1)

so SIGN_EX(i + l) = !MINUS(i) A X(-1) | !MINl!S(i) A X(-2)

| MINUS(i) A ! X(-2)

= !MINUS(i ) A X (-l) | MINIIS(i ) xor X(-2)

285 Example : 0 1 1 0 0 0 1 1 ( 63)H = 99 x 1 0 0 11111 (9F)H = 159

final product P = 15741

1) Encoding (9F)H

0 1 1 0 0 0 1 1 x +1 -2 +2 0 -1

MINUS(O) = 0

2) PI (partial product)

0 1 1 0 0 0 1 1 x -1

SIGN EXO 1 0 0 1 1 1 0 0

~ I SIGN EX 1 —>11 1 < generated by U2

PI = 1 1 1 0 0 1 1 1 0 1 MINUS(l) = 1

3) P2 (partial product)

0 1 1 0 0 0 1 1 x 0

SIGN EX2 00000000

" I SIGN_EX3 —>11 0 <— - generated by U2

P2 = 1 100000000 Ml NUS( 2) * 1

Figure B.1R Example to Explain How the Sign Extension Bits SIGN EX(i) and SIGN EX(i+l) Function

286 4) P3 (partial product)

0 1 1 0 0 0 1 1 X +2

S IGN_EX4 0 1 1 0 0 0 1 1 n 1 1 S IGN_EX5 — > 1 1 0 < — generated by U2

P3 = 1 1 1 1 0 0 0 1 1 0 MI NUS(3) ;= 1

5) P4 (partial product)

0 1 1 0 0 0 1 1 X -2

S IGN_EX5 1 0 0 1 1 1 0 0 1 1 1 SIG N_E X6 — > 0 1 1 < - - - generated by 02

P4 = 1 0 0 0 1 1 1 0 1 0 M1NUS(4) = 1

6) P5 (partial product)

0 1 1 0 0 0 1 •%1 X 1

P5 = 0 1 1 0 0 0 1 1

8 4 2 1 n 7) Final product P=P5x2 +P4x2 +P3x2 + P2 x 2 + PI x 2

PI = 1110 0 1110 1 P2 = 1 1 0 0 0 0 0 0 0 0 P3 = l i i i n o o i i o P4 =100011 1010 + P5 = o l 1 n o o i i

p = 0 0 1 1 110101111101= 1 5 7 4 1

Figure B.18 Example to Explain How the Sign Extension Bits SIGN EX(i) and SIGN EX(i+l) Function (contiuned)

287 1) HINUS(i) = 0, and one of the X(+2), X( + l) and X(0) is asserted.

sign extension of the sum sum of the previous of previous partial products partial products / i / i P(i) 0000------o n o o o f i p p p ------ppp current + P(i + 1) 0 0 0 0...... 0 0 0 0 p p p - - - p p p <— partial ------product 0 0 0 ...... OOOOOppp...... ppp 1 /II MINUS(i + l) = 0 SIGN EX(i +1) SIGN EX(i) = 0 0

2) MINUS(i) = 0, X{-1) = 1,

sign extension of the sum sum of the previous of previous partial products partial products / 1/1 P (i) 0 0 0 0 0 0 0 0 0 0 p p p - ■ - p p p current + P(i+1) llll-----llllppp---ppp <— partial ------product 1 111--- - lllllppp-----ppp I / I I MINUS(1+1) = 1 SIGN EX(i+l) SIGN EX(i) » 1 1

3) MINUS(i) = 0, X(-2) = 1.

sign extension of tho sum sum of the previous of previous partial products partial products

P(i) 0 1 0 0 0 0 0 0 0 0 0 1 p 1 p p - • ■ p p p 1 current + P(i+1) 1111----- 1 1 1 p p p p - - - p p p <— partial ------product llll----llllpppp-----ppp 1 , 1 1 * MINIJS( i +1) = 1 SIGN EX(i+l) SIGN EX(i) = 1 D

Figure B.19 Consider All Possible Cases to Obtain SIGN EX(i) and SiGN EX(i+l).

288 4) MINUS(i) = 1, and one of the X(+2), X(+1) and X(D) is asserted.

sign extension of the sum sum of the previous of previous partial products partial products / I / I P(i) llll-----llllllppp---ppp current + P(i+1) 0 0 0 0 0 0 0 0 p p p • - - p p p <— partial ------product 1 1 1 1 ----- 1 1 1 1 p p p ...... ppp I / I I MINUS(i+l) = 1 SIGN EX(1 + 1) SIGN EX(1) = 1 1

5) MINUS(i) = 1, X (-l) « 1.

sign extension of the sum sum of the previous of previous partial products partial products / I / I P(i) llll-----llllllppp---ppp current + P(i+1) llll-----llllppp---ppp <— partial ------product 1 111---- - 1 1 1 0 p p p ...... ppp I / I I MINl)S( i +1) = 1 SIGN EX(1+1) SIGN EX(i) = 1 0

6) MINUS(i) = 1, X(-2) = 1.

sign extension of the sum sum of the previous of previous partial products partial products / I / I P (i) 1111----- llllllppp---ppp current + P(i + 1) 1 1 1 1 1 1 1 p p p p - - ■ p p p <— partial ------product 1 111---- - 1 1 0 p p p p ...... ppp i / II Ml NUS{ i +1) = 1 SIGN EX(i+l) SIGN EX(i) = 0 1

Figure B.19 Consider All Possible Cases to Obtain SIGN EX{i ) and SIGN_EX{i+l) (continued)

2 m Appendix B.15 : Detailed Explanation for the Rounding Scheme Used in

the 8-bit M ultiplier with Modified Booth Algorithm.

For a floating point m ultiplier, the input multiplicand and m ultiplier are between 1 and 2. Thus, the final product of a

N-bit-by-N-bit multiplier is between 1 and 4. From the following possible final products, shown in figure B.20, can he seen the rounding scheme can be obtained by adding the full adder. It is assumed that the word length is 4 and so the final product has 8-bit long.

A truncation error occurs only in case 5. If it is assumed the value of the final product is uniform distribution in the range from 1 to 4, the probability of the final product between 2 to 4 is two thirds. And because the probability of when the Nth and (N-l)th bits are "1 0" is one fourth, the probability of truncation error occurring is 2/3 x 1/4 = 1/6, So a satisfactory round-off result can he obtained by using one more adder without affecting the operation time.

2 9 0 1) 1 -< P (final product) < 2 art J assumed P = 0 1 , 010001

a) after round-off, P = 0 1 . 0 1 considered b) add 1 to P at the (N - 1)th bit position to affect then truncate the last N-bit, rounding

P = 0 1 . 0 1 0 0 0 1 + 1

P = 0 1 . 0 1 0 0 0 1

truncate the last 4 b its , P s 0 1 . 0 1

2) 1 =< P (final product) < 2 and assumed P * 0 1 , 011001 I a) after round-off, P = 0 1 . 1 0 considered b) add 1 to P at the (N - l)th bit position to affect then truncate the last N-bit, rounding

P = 0 1 . 0 1 1 0 0 1 + 1

P = 0 1 . 1 0 0 0 0 1

truncate the last 4 bits, P = 0 1 , 1 0

3) 2 =< P (final product) < 4 and assumed P = 11,000001 I I a) after one right shift and considerd to round-off, P = 0 1 . 1 0 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit,

P = 1 1 . 0 0 0 0 0 1 + 1

P = 1 1 . 0 0 1 0 0 1

after one right shift, P = 0 1 . 100100 then truncate the last 4 bits, P = 0 1 . 1 0

Figure B.20 Achieve Rounding Scheme by Adding a Full Adder

291 4) 2 =< P (final product) < 4 and assumed P = 1 1 . 0 0 1 0 0 1 I 1 a) after one right shift and eonsiderd to round-off, P = 0 1 , 10 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit,

P = 1 1 . 0 0 1 0 0 1 + 1

P = 1 1 . 0 1 0 0 0 1

after one right shift, P = 0 1 . 101000 then truncate the last 4 bits, P = 0 1 . 1 0

5) 2 =< P (final product) < 4 and assumed P = 1 1 . 0 1 0 0 0 1 i I a) after one right shift and eonsiderd to round-off, P * 0 1 , 1 1 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit.

P = 1 1 . 0 1 0 0 0 1 + 1

P = 1 1 . 0 1 1 0 0 1

after one right shift, P = 0 1 . 101100 truncation then truncate the last 4 bits, P = 0 1 . 1 0 <— error

6) 2 =< P (final product) < 4 and assumed P = 1 1 . 0 1 1 0 0 1 1 1 a) after one right shift and eonsiderd to round-off, P = 0 1 . 1 1 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit,

P = 1 1 . 0 1 1 0 0 1 + 1

p = 1 1 . 1 0 0 0 0 1

after one right shift, P * 0 1 . 1 10000 then truncate the last 4 bits, P = 0 1 . 1 1

Figure B.20 Achieve Rounding Scheme by Adding a Full Adder (continued)

292 4) 2 =< P (final product) < 4 and assumed P = 1 1 . 0 0 1 0 0 1 I I a) after one right shift and eonsiderd to round-off, P = 0 1 . 10 affect rounding b) add 1 to P at the (N - 1)th bit position then shift and truncate the last N-bit.

P = 1 1 . 0 0 1 0 0 1 + 1

P = 1 1 . 0 1 0 0 0 1

after one right shift, P=01. 10100 then truncate the last 4 bits, P = 0 1 . 1

5) 2 =< P (final product) < 4 and assumed P = 1 1 . 0 1 0 0 0 1 I I a) after one right shift and eonsiderd to round-off, P = 0 1 . 1 1 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit.

P = 1 1 . 0 1 0 0 0 1 + 1

P = 1 1 . 0 1 1 0 0 1

after one right shift, P = 0 1 . 10 1 1 0 truncation then truncate the last 4 bits, P = 0 1 . 1 <— error

6) 2 =< P (final product) < 4 and assumed P = 1 1 .011001 i I a) after one right shift and eonsiderd to round-off, P = 0 1 , 1 1 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit.

P = 1 1 . 0 1 1 0 0 1 + 1

P = 1 1 . 1 0 0 0 0 1

after one right shift, P = 01. 1 1000 then truncate the last 4 bits, P = 0 1 . 1

Figure B.20 Achieve Rounding Scheme by Adding a Full Adder (continued)

293 Appendix B.16 : Detailed Circuit Designs for the Register Storing

the M ultiplier B and Recursive Carry-Save Adder.

Figure B.21 shows the circuit of the register storing B. Each bit of the register contains two shift registers. B(i+1), B(i) and B (i-l) are available at every $ . When B is loaded into the register, a "O'1 at LSB position and two "0" at the most significant bits position are

loaded together. This guarantees the fir s t encoded pair is "B(l) B(0)

0" and the last encoded pair is "0 0 B(N-l)". The value stored in the

register is encoded (N/2 + 1) times and therefore there are (N/2 + 1) partial products generated. At the last encoding cycle, "0 0 B(N-l)" is

always encoded to cause X(+1)=B(N-1), X{0)=!B(N-1) and X(+2) = X(-1) *

X(-2) = 0. This then performs exactly the same function of the last row

of the carry-save adders in figure 5.11 except that no round off is

used. Figure B.22 shows the schematic of data flow in the pipelined

recursive carry-save adders. Since two m ultiplier bits are examined

each time, it is necessary for the sum outputs of each carry-save adder

to be shifted to the right by two bits. The carry is shifted to the

right by one b it. The functions of Ul, U2 and U3 are exactly the same

as those in figure 5,9. After the pulse 14, the low N bits of the carry

set and the low N bits of the sum set of the final product are stored in

the right N/2 slices of shift registers. There are four shift registers

in each slice. The high N bits of the carry set and the high N bits of

the sum set are stored in the right N carry-save adders. The outputs of

the le ft most carry-save adder are not sent to the carry propogate adder

in the next pipeline stage.

294 Figure Figure B.21 Circuit of the Register Storing Multiplier B

295 V i

x{+2),x(+l) x(o) x(-l),x(-2)

U3 Ul

♦ 1 . z $2’

SIGH-EX.i+1 SIGN-EX i ri_ _f u FA FA J“L - r W - I MLD- ro o> r

S,Ci sc S2N-1 C2N-1

Figure B.22 Schematic of Data Flow in the Pipelined Carry-Save Adders Appendix C .l : Detailed Network Description for a 2-bit Adder.

First of all, an exclusive-or and an exclusive-nor are built in macros. Their circits are shown in figure C.l and C.2 respectively, and their macro definitions are listed below :

; Macro definition for a Exclusive-OR ; File name : xor,mac ; Default channel-width/channel-length for depletion transistors = 2/8 ; Default W/L for enhancement transistors = 2/2 (macro xor (y a a- b b-) (and-or-invert y ((a 4 2) (b 4 2)) {(a- 4 2) (b- 4 2))) ) ; end xor

; Macro definition for a Exclusive-NOR ; File name : xnor.mac (macro xnor (y a a- b b-) (and-or-invert y ((a 4 2) (b- 4 2)) ((a- 4 2) (b 4 2))) ) ; end xnor

A macro definition has the general format

(macro name (paraml param2 param3 . . . )

body of the macro

)

The name is followed by a lis t of parameters - paraml, param2, param3,

. . . . , which represent the value to be used when the macro is called

1ater.

A one-bit adder having positive logic carry-in and negative logic carry-out is shown in figure C.3, while a one-bit adder having negative logic carry-in and positive logic carry-out is shown in figure C.4. Their macro definitions are listed below :

297 Vdd

2/3

y = a © b

4/2 4/2

4/2

Figure C.l Exclusive-OR

Vdd

2/0

y = a © b

4/2 4/2

4/2 4/2

Figure C.2 Exclusive-NOR

298 a b i L ci n addl je.mac cout- I

i

5

ci n- cin > cout-

Figure C.3 Circuit Diagram of ad d lj .mac

299 a b 1 I ci n- addl O.mac cout T

a b

a-

cin- cout

J L

r~i_

r

Figure C.4 Circuit Diagram of addljo.mac

300 ; Macro definition for a 1-bit adder at even position with positive ; logic carry-in and negative logic carry-out ; File name : addl_e.mac fmacro addl e (cout- r a b cin) ; Declaration of the nodes local to the addl_e.mac ; The nodes are only of local importance to the addl e.mac ; and will not be refered when the addl e.mac is usecf later, (local a- a-1 b- cin- p p-) — ; Load the macros xor.mac and xnor.mac, which has the effect of ; incerting the macro definitions before the description of the ; addl l.mac. (Toad "xor,mac") (load "xnor.mac") (invert (a- 2 4) (a 4 2)) ( invert (a-1 2 4) (a 4 2)) (invert (b- 2 4) (b 4 2)) (xor p a a- b b-) (xnor p- a a- h b-) (etrans p- a-1 cout-) (Invert cin- (cin 4 2)) (etrans p c1n- cout-) (xor r cin cin- p p-) ) ; end addl e

; Macro definiton for a 1-bit adder at odd position with negative ; logic carry-in ; File name : addl_o,mac (macro addl_o (cout r a b cin-) (local a- al b- cin p p-J (load "xor.mac") (load "xnor.mac") (invert (a- 2 4) (a 4 2)) (invert (al 2 4) (a- 4 2)) (invert (b- 2 4) (b 4 2)) (xor p a a- b b-) (xnor p- a a- b b-) (etrans p- al cout) (invert cin (cin- 4 2)) (etrans p cin cout) (xor r cin cin- p p-) ) ; end addl o

A 2-bit adder macro composed of the addl_e.mac and addl_o.mac is shown in figure C.5 and its macro definition is listed below :

3 0 1 aO bO al bl 1 __ L J L c-1 addl e.mac cO -*------£&—O addl_0..mac cl T rO r l

aO bO a l b l v v y

d i tf! & 4 HE 4 > o 4E 4 HE 4 H ) o - H 5

H ) ° " 4>> c - i cO cO cl >— { * >

HI

J L

rO r l

Figure C.5 Circuit Diagram of add2.mac

302 ; Macro definition for a 2-bit adder ; File name : add2.mac (macro add2 ( cl rl rO al bl aO bO c-1) (local cO) (load Maddl_e.mac") (load "addljo.mac") (addl e cO rO aO bO c-1) (addl o cl rl al bl cO) ) ; end add2

By using the above 2-bit adder macro, add2.mac, an adder with any even number of bits can be constructed easily. The macro of an adder with any even number of bits is shown firs t and then an example of a 2-b1t adder with each output having capacitance load 0.03 pF is descri bed.

; Macro definition for a 2n-b1t adder ; File name : adder.mac (macro adder (n cout r a b cin) (local c) (load "add2.mac") ; Repeat macro add2.mac from 1=0 to 1=(n-l). (repeat 1 0 (1- n) {add2 c.(1+ (* 2 i)) r.(l+ (* 2 i )) r.(* 2 i) a.(1+ (* 2 i)) b.( 1+ (* 2 i) ) a.(* 2 i) b.(* 2 i) c.(l- (* 2 i)) ) ) (connect cin c .-l) (connect cout c .(l- (*2 n))) ) ; end adder

; Circuit description for a 2-bit adder ; File name : add2.net (node n cout r a b cin) ; Set n equal to 1 for a 2-bit adder (setq n 1) (load "adder.mac") (adder n cout r a b cin) ; Each output node has capacitance load 0.03 pF. (capacitance cout 0.03)

303 (repeat i 0 (1- n) (capacitance r,{1+ (+2 i)) 0.03) ^ (capacitance r.(*2 i) 0.03)

; end add2

304 Appendix C.2 : Command File for RNL Simulation for the

2-bit Adder Described in Appendix C .l.

; Command f ile "add2.cmd" for RNL simulation (load ""cad/1ib/rnl/uwstd.l'') (load ""cad/lib/rnl/uwsim.l") ( read-network "add2.rnl") (setq incr 100) (setq all nodes '(cin cout r, 1 r.O a.l b.l a.O b.O)) (chflag aTl nodes) (defvec '(bTn A a .l a.O)) (defvec ' (bln B b .1 b.O)) (defvec '(bln add2out cout r . 1 r.O)) (def-report '("CURRENT STATE (vec A) (vec B) cin newline (vec add2out)))

(1 ' a .l)) (1 ' b .l)) (1 ' a.O b.O c in )) (s ' )) (h ' cin)) ( s ' )) (1 ' cin)) (h ' b.O)) (s ' )) ( h ' cin)) (s ' )) (1 ' cin b.O)) (h ' a.O)) ( s ' )) (h ' cin)) (s ' )) (1 1 cin)) (h ' b.O)) (s ' )) (h ' cin)) (s ' ))

(1 ' a .D ) (h ' b .l)) (1 ' a.O b.O cin)) (s ' )) (b ' cin)) ( s ' )) (1 ' c1 n)) (b 1 b.O)) (s ' )) (h ' cin)) (s ' ))

305 (1 (cin b.O)) (h (a.O ) (s 0 ) (h (ci n] ) (s 0 ) (1 (c in ’ ) (h (b.O ) (s 0 )

( h (cin ) (s 0 )

(b (a .l ) (1 (b .l ) (1 (a.O b.O cin)) (s 0 ) ( h (cin ) (s 0 ) (1 (ci n ) (b (b.O ) (s 0 ) (h (cin ) (s 0 ) (1 (cin b.O)) (h (a.O ) (s 0 ) (h (cin ) ( s 0 ) (1 (ci n’ ) (b (b.O ) (s 0 ) (b ( c 1 n ) (s 0 )

* i MM M M M M M M (b (a .l ) (h (b .l ) (1 (a.O b.O cin)) {s 0 ) (h (cin ) (s 0 ) (1 (cin] ) (b (b.O ) ( s 0 ) (b (cin' ) (s 0 ) (1 (cin b.O)) (b (a.O ) (s 0 )

(b (cin ) (s 0 )

(1 (cin )

(b (b.O )

3 0 6 (s '{))

(exi t ) Appendix C.3 : Input and Output Signals Specified for the 2-bit

Adder in the SPICE Simulation.

*********** input signals *************************

VDD PC 5

VEM DC 0

VDM DC 0

Vein PULSE(0 5 20NS ONS ONS 20NS 40NS)

Vb.O PULSE(0 5 40NS ONS ONS 40NS SONS)

Va.O PULSEfO 5 80NS ONS ONS SONS 160NS)

Vb.l PULSE(0 5 160NS ONS ONS 160NS 320NS)

Va.l PULSEfO 5 320NS ONS ONS 320NS 640NS)

* * * * * * * * * * * output signals * * * * * * * * * * * * * * * * * * * * * * * * *

.PLOT TRANS V() V() V() V() (0,5)

.TRANS 2NS 640NS

.OPTIONS LIMITS*400

.WIDTH OUT=72

.END

308 Appendix C.4 : Model Parameters of the Simulated Devices in

the SPICE Simulation.

.MODEL ENMOS NMOS LEVEL-1 LD-0.211698U TOX-635.000E-10

+NSUB-3.779887E+15 VTO-1.13877 KP-4.145038E-05

+GAMMA-0.494661 PHI-0.600000 1)0-300.000

+VMAX-100000. XJ-5.27683U LAMBDA-2.385822E-02

+NFS-2.356687E+12 NSS-O.OOOOOOE+OO TPG-1.00000

+RSH-25.4 CGS0-1.6E-10 CGD0-1.6E-10 CGR0-1.7E-10

+CJ- 1. IE-4 MJ-0.5 CJSW-5E-10 MJSW-0.33

.MODEL DNMOS NMOS LEVEL-1 LD-0.348540U TOX-635.000E-10

+NSUB-1.000000E+16 VT0--3.83489 KP-3.639582E-05

+GAMMA-0.314330 PHI-0.600000 UO-900.000

+VMAX-477999. XJ-0.439338U LAMBDA-1.OOOOOE-06

+NFS-4.3 10000E+12 NSS-O.OOOOOOE+OO TPG-1.00000

+RSH-25.4 CGS0-1.6E-10 CGD0-1.6E-10 CGB0-1.7E-10

+CJ-1. IE-4 MJ-0.5 CJSW-5E-10 MJSW-0.33