INFORMATION TO USERS
Til is reproduction was made from a copy of a document sent to us for microfilming. While the most advanced technology has been used to photograph and reproduce this document, the quality of the reproduction is heavily dependent upon the quality of the material submitted.
The following explanation of techniques is provided to help clarify markings or notations which may appear on this reproduction.
1. The sign or “target” for pages apparently lacking from the document photographed is “ Missing Page(s)". If it was possible to obtain the missing page(s) or section, they are spliced into the film along with adjacent pages. This may have necessitated cutting through an image and duplicating adjacent pages to assure com plete co n tin u ity.
2 . When an image on the film is obliterated with a round black mark, it is an indication of either blurred copy because of movement during exposure, duplicate copy, or copyrighted materials that should not have been filmed. For blurred pages, a good image of the page can be found in the adjacent frame. If copyrighted materials were deleted, a target note will appear listing the pages in the adjacent frame.
,V When a map, drawing or chart, etc.. is part of the material being photographed, a definite method of "sectioning" the material has been followed. It is customary to begin filming at the upper left hand corner of a large sheet and to continue from left to right in equal sections with small overlaps. It' necessary,
sectioning is continued again beginning below the first row and continuing 011 until complete.
4. For illustrations that cannot be satisfactorily reproduced by xerographic means, photographic prints can he purchased at additional cost and inserted into your xerographic copy. These prints are available upon request from the Dissertations Customer Services Department.
.‘'.Som e pages in any document may have indistinct print. In all cases the best available copy has been filmed.
Un*^ ms International 300 N Zeeb Road Ann Arbor, Ml 4S106 851892$
Chao, Hung-Hsiang Jonathan
PARALLEL/PIPELINE VLSI COMPUTING STRUCTURES FOR ROBOTICS APPLICATIONS
The Ohio State University Ph.D. 1985
University Microfilms
International 300 N Zeeb Road. Ann Arbor, Ml 48106
Copyright 1985 by Chao, Hung-Hsiang Jonathan All Rights Reserved Parallel/Pipeline VLSI Computing Structures for Robotics Applications
DISSERTATION
Presented in Partial Fulfillm ent of the Requirements for the Degree Doctor of Philosophy 1n the Graduate School of The Ohio State University
by
Hung-Hsiang Jonathan Chao, B.S., M.S.
The Ohio State University 1985
Reading Committee:
Karl Olson
David Orin Approved by
Fusun Ozguner
Advisor Department of Electrical Engineering © 1985
HUNG-HSIANG JONATHAN CHAO
All Rights Reserved To my parents and my w ife
ii ACKNOWLEDGEMENTS
I would like to thank my advisor. Professor Karl W, Olson, for the constant support and constructive advice which he provided during my studies at The Ohio State University. Many of the results of this research were developed under his guidance. His contributions to the research and patience in reviewing this dissertation are also greatly appreciated.
I would like to thank Professor David E. Orin, who gave me many helpful suggestions from time to time and also reviewed this dissertation. I also would like to thank Professor Fusun Ozguner, one of the reading committee, for reviewing this dissertation. I am very much indebted to Mrs. Barbara S. Elberfeld for her careful, patient,
and efficient proofreading of my manuscript. I am grateful to Ms. Debi
Britton for her excellent work in preparing this manuscript.
Finally, I would like to thank my parents for their support, and
most importantly, I wish to thank my wife, Yeichu, and daughter,
Jessica, for the endurance, encourangement and love which they
provided throughout my studies.
This research was supported by the National Science Foundation,
Computer Engineering Grant No. DMC-8312677,
i i i Hung-Hsiang Jonathan Chao
VITA
December 10, 1955 ...... Born -- Taipei, Taiwain, R.O.C,
June, 1977 ...... B.S, Electronics Engineering National Chiao Tung University Hsinchu, Taiwan, R.O.C,
June, 1980 _ ...... M.S. Electronics Engineering National Chiao Tung University Hsinchu, Taiwan, R.O.C.
1977-1981 ...... Design Engineer, Telecommunication Laboratories, Switching System Group Chungli, Taiwan, R.O.C.
1982-1985 ...... Graduate Research Associate Digital Systems Laboratory The Ohio State University Columbus, Ohio
1982-1985 ...... Graduate Teaching Associate Department of Electrical Engineering The Ohio State University Columbus, Ohio
i v VITA — Continued
FIELDS OF STUDY
Major Field: Electrical Engineering
Studies in Computer Engineering : Professor K.W. Olson, D.E. Orin, F. Ozguner, K.J. Breeding, R.R. McGhee
Studies in Control Engineering: Professor R.E. Fenton, II. Ozguner
Studies in Communications: Professor D.T. Davis, R.T. Compton
Studies in Computer and Information Science: Professor M.T. Liu, V. Ashok, B.W. Weide
PUBLICATIONS
"The Design of Reliable Common Channel Signaling System in Time Division Digital Switching System," M.S. Thesis, National Chiao Tung University, Hsinchu, Taiwan, R.O.C., June 1980.
"The Design of Control System in Time Division Digital Switching System-II," Journal of Taiwan Telecommunication Laboratories, Chungli, Taiwan, R.O.C., April 1981.
v TABLE OF CONTENTS
Page
ACKNOWLEDGEMENT ...... i 11
VITA ...... iv
LIST OF FIGURES ...... x iii
LIST OF TABLES ...... xviii
Chapter
1 INTRODUCTION ...... 1
1.1 Project Background ...... 1
1.2 Previous Work ...... 3
1.3 Organization ...... 5
2 VLSI COMPUTING STRUCTURE ON ROBOTICS APPLICATIONS ...... 7
2.1 Introduction ...... 7
2.2 Inverse Plant Plus Uacobian Control ...... 7
2.3 Computer Architectures for Robotics ...... 9
2.4 VLSI Technology to Computer Architectures ...... 13
2.5 VLSI Technology ...... 15
2.6 Summary ...... 17
3 ARCHITECTURE OF THE ROBOTICS PROCESSOR ...... IB
3.1 Introduction ...... IB
3.2 Block Diagram of the RP ...... 18
3.3 Evolution of the Architectural Design of the RP Data Paths ...... 24
vi TABLE OF CONTENTS - - Continued
Chapter Page
3.3.1 Single-Bus Configuration ...... 24
3.3.2 Two-Bus Configuration ...... 26
3.3.3 Three-Bus Configuration ...... 26
3.3.4 Cross-Bar Network ...... 27
3.4 Summary ...... 29
4 APPLICATIONS OF THE ROBOTICS PROCESSOR ...... 33
4.1 Introduction ...... 33
4.2 Jacobian ...... 37
4.2.1 Complexity of Vector Operations ...... 39
4.2.2 Task Graph ...... 40
4.2.3 Architectures of the Jacobian ...... 43
4.2.3.1 1-Processor Architecture ...... 43
4.2.3.2 2-Processor Architecture ...... 46
4.2.3.3 N-Processor Architecture ...... 49
4.2.3.4 Cube Interconnection Network ...... 55
4.2 .3 .5 Comparison ...... 66
4.3 Inverse Jacobian ...... 67
4.3.1 Methods for Solving Linear Equations ...... 68
4.3.2 Architectures of the Inverse Jacobian ...... 70
4.3.2.1 1-Processor Architecture ...... 70
4.3.2.2 6-Processor Architecture ...... 72
4.3.2.3 12-Processor Architecture ...... 72
4.3.2.4 24-Processor Architecture ...... 75
v 11 TABLE OF CONTENTS — Continued
Chapter Page
4.3.2.5 Comparison ...... 75
4.4 Inverse Oynamics ...... 78
4.4.1 Task Graph ...... 80
4.4.2 Architectures of the Inverse Dynamics ...... 80
4.4.2.1 1-Processor Architecture ...... 80
4.4.2.2 2-Processor Architecture ...... 80
4.4.2.3 N-Processor Architecture ...... 87
4.4.?.4 2N-Processor Architecture ...... 91
4.4.2.5 Comparison...... 95
4.5 Summary ...... 95
5 CIRCUIT DESIGNS OF THE ROBOTICS PROCESSOR CHIP ...... 97
5.1 Introduction ...... 97
5.2 Clock Generator ...... 97
5.3 Bootstrap Unit and Format Converters ...... 100
5.4 Testability in the Chip ...... 103
5.4.1 Structured Design for Testability ...... 104
5.4.2 Level Sensitive Scan Design (LSSD) ...... 107
5.5 Floating Point Adder/Suhtractor (FPA) ...... 110
5.5.1 Floating Point Format ...... 110
5.5.2 Algorithm and Block Diagram ...... 112
5.5.3 N-bit Adder/Subtractor ...... 116
5.6 Floating Point Multiplier (FPM) ...... 120
5.6.1 Algorithm and Block Diagram ...... 122
5.6.2 24-bit Fixed Point M u lt ip lie r ...... 125
vi i i TABLE OF CONTENTS - - Continued
Chapter Page
5.6.2.1 Sequential Add-Shift Multiplication 126
5.6.2.2 Array Multiplier ...... 126
5.6.2.3 Nonadditive Multiply Modules (NMM) with Wallace Trees ...... 127
5.6.2.4 Additive Multiply Modules (AMM) ... 129
5.6.2.5 Recursive Parallel M ultiplier ...... 129
5.6.2.6 Modified Booth Algorithm (Radix=4) with Carry-Save Adders ...... 131
5.6.2.7 Pipelined Recursive M ultiplier with Modified Booth Algorithm ...... 135
5.3 Summary ...... 139
6 COMPUTER AIDED DESIGN FOR VLSI ...... 140
6.1 Introduction ...... 140
6.2 Overview of VLSI Design Tools ...... 141
6.3 Logic Circuit Description ...... 146
6.4 Logic Level Simulation ...... 152
6.5 Circuit Level Simulation ...... 158
6.6 Summary ...... 160
7 Summary and Conclusions...... 163
7.1 Summary ...... 163
7.2 Research Extensions ...... 166
REFERENCES ...... 171
APPENDIX A.1 Reservation Tables for Vector Operations ...... 176
APPENDIX A.2 Microprogram for Jacobian (one RP per Link) ...... 190
APPENDIX A.3 Calculation of the Measurement Parameters for Jacobian with P = 1 ...... 194
1 x TABLE OF CONTENTS - - Continued
Chapter Page
APPENDIX A.4 Calculation of the Measurement Parameters for Jacobian with P = 2 ...... 195
APPENDIX A.5 Calculation of the Measurement Parameters for Jacobian with P = N ...... 197
APPENDIX A.6 To Find Brl ...... 198 - 1x6
APPENDIX A.7 Computation Complexity and Register Required for Vector Inner Products ...... 210
APPENDIX A.8 Procedures to Solve the Derivative of Theta and Calculation of the Measurement Parameters for Inverse Jacobian with P = 1 ...... 211
APPENDIX A.9 Procedures to Solve the Derivative of Theta and Calculation of the Measurement Parameters for Inverse Jacobian with P - 2 ...... 214
APPENDIX A .10 Procedures to Solve the Derivative of Theta and Calculation of the Measurement Parameters for Inverse Jacobian with P = 1 2 ...... 217
APPENDIX A.11 Procedures to Solve the Derivative of Theta and Calculation of the Measurement Parameters for Inverse Jacobian with P = 24...... 222
APPEXDIX A.12 Microprogram for Forward Recursion of Inverse Dynamics (one RP per Link) ...... 228
APPENDIX A.13 Microprogram for Backward Recursion of Inverse Dynamics (one RP per Link) ...... 234
APPENDIX A.14 Calculation of the Measurement Parameters for Inverse Dynamics with P = 1 ...... 238
APPENDIX A .15 Calculation of the Measurement Parameters for Inverse Dynamics with P = 2 ...... 239
APPENDIX A.16 Calculation of the Measurement Parameters for Inverse Dynamics with P = N ...... 240
APPENDIX A.17 Calculation of the Measurement Parameters for Inverse Dynamics with P = 2 N ...... 241
x TABLE OF CONTENTS — Continued
Chapter Page
Appendix B.l Detailed Circuit Descriptions for Two-Phase Generators (TPG) and Two Johnson Counters, JCNTR and JCNTF ...... 242
Appendix B.2 Detailed Procedures for Loading Microprogram and Circuit Designs for the Synchronization Controller and Bootstrap Controller (SC+BTC) ...... 245
Appendix B.3 Detailed Circuit Designs for the Four Format Converters, FCE, FCS, FCW, and FCN ...... 250
Appendix B.4 Data Flow in the Data Path for Normal Arithmetic Operations ...... 253
Appendix B.5 Circuit Design of the Zero Checking Unit ...... 258
Appendix B.6 Circuit Design of the Sign Unit ...... 260
Appendix B.7 Circuit Design of the Alignment Control Unit ...... 262
Appendix B.8 Circuit Design of the 24-bit Shifter ...... 265
Appendix B.9 Detailed Explanation for the Postnormalization ...... 270
Appendix B.10 Circuit Design of the Leading Zero Detector ...... 273
Appendix B .ll Circuit Design of the Overflow/Underflow Unit ...... 274
Appendix B.12 Detailed Circuit Design of the Zero Checking U n it.. 278
Appendix B.13 Detailed Circuit Design of the Over/Underflow Unit. 279
Appendix B.14 Logic Equations of U l, U2, U3, U4, and U5 in the 8-bit M ultiplier with Modified Booth Algorithm . . . . 284
Appendix B.15 Detailed Explanation for the Rounding Scheme Used in the 8-bit Multiplier with Modified Rooth Algorithm ...... 290
Appendix B.16 Detailed Circuit Designs for the Register Storing the M ultiplier B and Recursive Carry-Save Adder . . . 294
Appendix C .l Detailed Network Description for a 2-bit Adder . . . . 297
Appendix C.2 Command File for RNL Simulation for the 2-bit Adder Described in Appendix C . l ...... 305
xi TABLE nr CONTENTS — Continued
Chapter Page
Appendix C.3 Input and Output Signals Specified for the 2-bit Adder in the SPICE Simulation ...... 308
Appendix C.4 Model Parameters of the Simulated Devices in the SPICE Simulation ...... 309
x ii LIST OF FIGURES
Figure Page
3.1 Block Diagram of the Robotics Processor ...... 19
3.2 Microinstruction Format ...... 21
3.3 Single-Bus Configuration ...... 25
3.4 Two-Bus Configuration ...... 25
3.5 Cross-Bar Network Configuration ...... 28
4.1 Rlock Diagram of Inverse Plant Plus Jacobian Control . . . 34
4.2 Major Data Acquistion, Computation, and Control Modules for Inverse Plant Plus Jacobian Control ...... 35
4.3 Architectural Concept for Implementation of Advanced Real-Time Control Algorithms for Robots ...... 36
4.4 Task Graph for Jacobian with P * 1...... 41
4.5 Architecture for Jacobian with P = 1 ...... 44
4.6 Timing Chart for Jacobian with P = 1 ...... 45
4.7 Task Graph for Jacobian with P = 2...... 47
4.8 Architecture for Jacobian with P = 2 ...... 48
4.9 Timing Chart for Jacobian with P = 2 ...... 50
4.10 Task Graph for Jacobian with P = N ...... 51
4.11 Architecture for Jacobian with P = N ...... 53
4.12 Timing Chart for Jacobian with P = N ...... 54
4.13 Architecture for Implementing Jacobian in Parallel (8 degrees-of-freedom} ...... 57
4.14(a) 3 Cube Interconnection Network ...... 60
4.14(b) 4 Cube Interconnection Network ...... 60
xi i 1 LIST OF FIGURES - - Continued
Figure Page
4.15 Communication of the 8 PEs in Different Time Slots ...... 61
4.16 3 Cube Interconnection Network for Implementing Jacobian in Parallel (8 degrees-of-freedom) ...... 63
4.17 Communication Between PEs in Different Time Slots ...... 64
4.18 4 Cube Interconnection Network for Implementing Jacobian in Parallel (16 degrees-of-freedom) ...... 65
4.19 Architecture for Inverse Jacobian with P = 1 ...... 71
4.20 Architecture for Inverse Jacobian with P = 6 ...... 73
4.21 Architecture for Inverse Jacobian with P = 12 ...... 74
4.22 Architecture for Inverse Jacobian with P = 24 ...... 76
4.23 Task Graph for the Forward Recursion of Inverse Dynamics 81
4.24 Task Graph for the Backward Recursion of Inverse Dynamics ...... 82
4.25 Architecture for Inverse Dynamics with P = 1 ...... 83
4.26 Timing Chart for Inverse Dynamics with P = 1 ...... 84
4.27 Architecture f^r Inverse Dynamics with P = 2 ...... 85
4.28 Timing Chart for Inverse Dynamics with P = 2 ...... 85
4.29 Timing Chart for Forward Recursion of Inverse Dynamics with One RP per Link ...... 88
4.30 Timing Chart for Backward Recursion of Inverse Dynamics with One RP per Link ...... 89
4.31 Architecture for Inverse Dynamics with P = N ...... 90
4.32 Timing Chart for Inverse Dynamics with P = N ...... 92 ( N = 3 for example )
4.33 Architecture for Inverse Dynamics with P = 2N ...... 93
4.34 Timing Chart for Inverse Dynamics with P = 2N ...... 94 ( N = 3 for example )
xi v LIST OF FIGURES - - Continued
Figure Page
5.1 Block Diagram of the Robotics Processor ...... 98
5.2 Clock Generator (CG) ...... 99
5.3 Clock Signals Generated from the Clock Generator (CG) .. 101
5.4 Block Diagram of the BU, FCB, and CRAM Indicating the Paths Used for Microprogram Loading...... 102
5.5 Logic Circuit Diagram of BILBO Registers ...... 106
5.6 LSSD Used in the Pipelined Stages of the FPA and FPM . . . 108
5.7 Interconnection of the LSSD SRL's ...... 109
5.8 Block Diagram of the Floating Point Adder/Subtractor . . . 114
5.9 Circuit Diagram of a 2-b1t Adder with Manchester-type Carry Chain ...... 121
5.10 Block Diagram of the Floating Point Multiplier ...... 123
5.11 A 8x8 M ultiplier with Modified Booth Algorithm (Radix = 4) ...... 134
5.12 The Structure of a Pipelined Recursive M ultiplier (mu1_?4.ca) ...... 136
5.13 Timing for the Pipelined Recursive M ultiplier ...... 137
5.14 State Diagram for Generating the MLO Signal ...... 137
6.1 Functional Chart of VLSI CAD Tools ...... 142
6.2 NMOS INVERTER ...... 148
6.3 NMOS NAND ...... 148
6.4 NMOS NOR ...... 151
6.5 NMOS AND-OR-INVERTER ...... 151
B.l Two Phase Clock Generator (TPG) ...... 243
B.2 State Diagrams for JCNTR and JCNTF ...... 244
B.3 Timing for Loading Microprogram ...... 246
xv LIST OF FIGURES - - Continued
Fi gure Pa9e
B.4 Synchronization Controller for HWR ...... 248
B.5 State Diagram of the Bootstrap Controller (BTC) ...... 249
B.6 Circuit Diagram of the FCE, FCW, FCS and FCN ...... 251
B.7 Timing for Data Passed between RP's ...... 252
B.8 Three Pipelined Stages 1n the FPM and FPA ...... 254
B.9 Timing for the Data Through the Pipelined Stages ...... 255
B.IO AA, AB and WR for the Register File ...... 257
B .ll Block Diagram of the Alignment Control Unit ...... 263
B.12 4-bit Barrel Right-Shifter ...... 266
B.13 Block Diagram of the 24-bit Barrel Right-Shifter ...... 268
B.14 Block Diagram of the 24-bit Right (Left) Shifter ...... 269
B.15 Block Diagram of the Over/Underflow Unit ...... 275
B.16 All Possible Cases for Overflow and Underflow ...... 280
B.17 Block Diagram of the Over/Underflow Unit (fm_ovf_udf.ca) 282
B.18 Example to Explain How the Sign Extension Bits SIGN EX{i) and SIGN_EX(i+l) Function ...... 286
B.19 Consider All Possible Cases to Obtain SIGN_EX(i) and SIGN_EX(i +1) ...... 288
B.20 Achieve Rounding Scheme by Adding a Full Adder ...... 291
B.21 Circuit of the Register Storing Multiplier B ...... 295
B.22 Schematic of Data Flow in the Pipelined Carry-Save Adders 296
C .l Exclusive-OR ...... 298
C.2 Exclusive-NOR ...... 298
C.3 Circuit Diagram of addl_e.mac ...... 299
xvi LIST OF FIGURES — Continued
F i gu re Page
C.4 Circuit Diagram of addl_o.mac ...... 300
C.5 Circuit Diagram of add2.mac...... 302
xvii LIST OF TARLKS
Table Page
4.1 Computation Times for Necessary Vector and Matrix Operations ...... 39
4.2 Comparison of Three Architectures for Jacobian ...... 67
4.3 Comparison of Four Architectures for Inverse Jacobian . .. 77
4.4 Comparison of Four Architectures for Inverse Dynamics . . . 95
5.1 Possible Values of the IEEF. Single Precision Floating Point ...... I l l
5.2 Truth Table of a One-bit Full A dder ...... 119
5.3 Size and Number of the Wallace Trees for a 24-bit Multiplier ...... 128
5.4 Comparison of the 4M, 3M, and 2M Versions of the Multiplication in [38] ...... 130
5.5 Encoding Table for the Modified Booth Algorithm ...... 132
B .l Truth Table for Generating Effective Operation Bits (EOPO E0P1) and SUB ...... 260
B.2 Truth Table for Generating the Final Sign Bit of the Result ...... 261
B.3 Truth Table of the Leading Zero Detector ...... 273
xvi ii CHAPTER 1
INTRODUCTION
1.1 Project Background
Several different and sophisticated control schemes have been proposed for robotic mechanism in the past few years, but few of them have been widely used because they usually involve many complex computations which are d iffic u lt to implement in real time. For example, control of the end effector in cartesian coordinates or dynamic control may require a dozen trigonometric functions and several hundreds
(maybe thousands) of floating point multiplications and additions/ subtractions. The latest 16-bit microprocessors, equipped with single chip numeric co-processors (e.g. Intel 8087), are s t ill not adequate for most computationally intensive real-time control tasks.
The combination of the Inverse Plant for feedforward control and
Jacobian Control for feedback has been proved to have excellent potential for fast and accurate control. But since these operations are
very time consuming, some parallel/pipeline VLSI computing structures
need to be designed to tackle the bottlenecks of robotic system control.
Because rapid advances have been made in very-large-scale-integrated
(VLSI) semiconductor technology, the computing structure implemented with VLSI has the characteristics of simplicity, regularity, and
communication locality. In addition, some parallel computing schemes,
1 such as arithmetic pipelining, processor pipelining, and multiprocessor system, are employed to improve system throughput.
Special purpose dedicated attached processors, based on the
Robotics Processor chip (RP) being developed with state-of-the-art VLSI technology, will be attached to a host microcomputer. The RPs are connected in a mesh network to achieve parallel anl pipeline structures, where the parallelism is about 80%. The system throughput is expected to be an improvement over a high speed attached processor only doing simple vector and matrix operations.
The Robotics Processor chip is designed primarily for solving the
Jacobian, Inverse Jacobian, and Inverse Dynamics. The RP is able to perform necessary vector and matrix operations in Inverse Plant plus
Jacobian control. It contains a floating point adder/subtractor and floating point multiplier. Both of them have three pipeline stages and can execute simultaneously. Because the RP is designed for more than one application, it must be programmable. Based on the parallel/ pipeline computing structure, the Jacobian, Inverse Jacobian, and
Inverse Dynamics can be completed in one millisecond individually.
The Computer-Aided Design (CAD) tools used to design the RP were released from the UW/NW VLSI Consortium on October 1 in both 1983 and
1984. The VLSI CAD tools are capable of doing (1) interactive layout
-CAESAR, (2) logic simulation -RNL or ESIM, and (3) circuit simulation
-SPICE. The tools support designs on the NMOS and CMOS fabrication processes availahle through MOSIS, the Department of Defense's MOS
Implementation Service run by the Information Sciences Institute of the
University of Southern California.
2 1.2 Previous Work
The Jacobian relates the rate of change (velocity) of each of the six components of end effector position and orientation to the rate of change of each of the joint angles. This approach is more efficient than reverse kinematics because it involves less complex equations and is more easily applied to general N degree-of-freedom robotic mechanisms. A number of algorithms to compute the Jacobian are available and show a linear increase in computation with an increase in the number of degrees-of-freedom [16] [23]. Olson and nibble's algorithm [16] for computing the Jacobian is considered for pipelining.
An Inverse Dynamics analysis determines the joint torques of a manipulator given the relative positions, rates, and accelerations of the joints as well as the forces and moments to be applied at the end effector. It is proposed for control to drive the manipulator in the desired trajectory. Several methods based on Newton-Euler have been proposed to solve this problem. Orin and Olson suggested [20] that the most natural approach to pipelining Inverse Dynamics is to assign two
processors to each lin k, one for the forward recursion and one for the
backward recursion. Lathrop [24], investigates the high degree of
parallelism inherent in the computations of Inverse Dynamics and
presents two formulations suited to high-speed, highly parallel
implementations using VLSI devices. The firs t presented is a parallel
version of the recent linear Newton-Euler recursive algorithm. The
time cost is linear in the number of joints. The second formulation
reports a new parallel algorithm which indicates the time required to
3 perform the calculations Increases only as the log(base 2) of the number of joints.
A loosely coupled multiprocessor system has provided an excellent basis for the design of an onboard computer system for a new hexapod vehicle currently under development at The Ohio State University. It consists of fifteen 16-bit microcomputers based on In te l's 8086/808? microprocessor pair. The Multibus allows high-speed common memory communication between the microcomputers.
Also constructed at Ohio State is the Skeletal Motion Processor, a highly parallei/pipelined array processor for generation of human and animal skeletal motion [16], Its architecture includes four pipelined boating point adders and four pipelined floating point multipliers as well as nine independent data memories, a sine/cosine unit, a reciprocal unit, a microprogrammed control unit, and I/O buffers which are a part of a PDP-11 interface. The processor is capable of computing the 12 x 14
Jacobian in approximately 70 microseconds and solve the system of 12 equations using Gaussian elimination in approximately 5.5 microseconds.
At the University of Michigan, VLSI implementlon is being considered for a numerical processor for robotics [57], The processor is being designed to match the VLSI capabilities of the mid 80's and is intented for the computationally intensive tasks involoved in real-time control of a robot arm. The numerical processor includes a pipelined
32-bit floating point adder unit, a pipelined 32-hit floating point multiplier unit, a 256 x 32 register file, and 32 x 32 input and output buffers to fa c ilita te high-speed communication between processors. The device count for the chip is approximately 150K. 1.3 Organization
In chapter 2, control schemes, computer architectures, and the
impact of VLSI technology are reviewed. Some parallel computing schemes
-arithmetic pipelining, processor pipelining, and multiprocessor system
are employed to solve intensive computations in the Inverse Plant plus
Jacobian control.
Chapter 3 describes the block diagram of the Robotics Processor
(RP). The Robotics Processor can perform the necessary vector and
matrix operations in Inverse Plant plus Jacobian Control. Four possible
bus configurations of the RP's data path are proposed and compared.
In chapter 4, a task graph is used to help schedule processes to
the Robotics Processors. Several possible architectures for each
particular control problem -Jacobian, Inverse Jacobian, or Inverse
Dynamics, are proposed and compared. The comparisons are based upon
some Important parameters, such as total execution time, initiation
rate, CPU u tiliza tio n , and total memory size needed in the RP.
Chapter 5 discusses the general description of most major
functional blocks of the Robotics Processor, while the detailed circuit
designs are described in Appendix B. The two major circuit designs are
floating point adder/subtractor and floating point multiplier. Part of
the chip, heavily dependent on manufacture capability and VLSI design
tools (e.g. memory), has not yet been designed.
In chapter 6, computer-aided design tools used at the Ohio State
University are introduced. The methodology for designing a chip with
the VLSI tools is presented. A 2-bit adder is used as an example to
explain how to do logic and circuit level simulation.
5 Chapter 7 gives a summary of the conclusions drawn from this dissertation. It also points out some problems inherent in designing the Robotics Processor chip and makes suggestions for future study in this area.
6 CHAPTER 2
VLSI COMPUTING STRUCTURE
ON ROBOTICS APPLICATIONS
2.1 Introduction
In this chapter the control scheme, the computer architecture, and the impact of very-1arge-scale integrated circuit (VLSI) technology are reviewed. The firs t section points out what kind of control scheme is to be considered for our robotic mechanisms and introduces the concepts of Inverse Plant plus Jacobian control. The second section discusses various computer architectures for robotics applications. The third section describes the impacts of VLSI on computing structures. The last section depicts the current VLSI technology and future trends and
1imi tations.
2.2 Inverse Plant Plus Jacobian Control
Although many different and complicated control schemes have been proposed for robotic mechanisms, few of them have been used successfully because they apply only linear feedback at the joints and don't consider the nonlinearities in the robotic mechanism. Some control schemes based on the kinematic and dynamic properties have the potential to improve the whole control system, but, since the equations for the kinematics and Inverse dynamics are rather complex, especially as the number of
7 degrees-of-freedom increases, it is difficult to implement these equations on a digital computer for real-time control.
When servoing and motion-planning are accomplished in cartesian, workspace coordinates, all that needs to be done is to transform the desired angular and translational rates of the gripper or "end effector" to obtain the required rates of joints. This kind of transformation is called the Inverse Jacobian. The Inverse Jacobian approach is better than reverse kinematics because it involves less complex equations and is more easily applied to the general N degree-of-freedom robotic mechanisms,
Jacobian control is based upon the kinematic properties of the mechanism and therefore does not account for the dynamic properties of the mechanism. Thus, the Inverse Plant feedforward controller has been proposed, in which it is assumed that the desired position, rate, and acceleration of a mechanism are given and that the joint actuator torques are to be determined. Recently, several approaches based on the
Newton-Euler method have been shown to be efficient enough to be implemented in real time. These involve a forward recursion from the base to the end effector to compute link accelerations, and then a backward recursion to compute the joint torques.
The combination of the Inverse Plant for feedforward control and
Jacobian control for feedback has excellent potential for fast and accurate control. However, Inverse Plant, which is basically Inverse
Dynamics, with the addition of Jacobian control is computationally intensive and therefore very time consuming when using conventional
8 computing architectures. For example, the execution time for the
Inverse Dynamics plus Jacobian control may exceed half of the total
execution time [21], Thus, some kinds of computing architectures are explored and designed to tackle the bottleneck of robotic system
control,
2.3 Computer Architectures for Robotics
There are two approaches for speeding up the computations required
by Inverse Dynamics plus Jacobian control schemes. One is to attach a
very fast numeric processor to the host computer with the objective that
this attached processor would perform all vector and matrix operations
required by the robotic control algorithms. Thus, the attached
processor can potentially relieve the host computer from performing
large numbers of computations. In practice, however, the quantity of
data required to be transferred between the attached processor and the
host computer is often so great that an "I/O bottleneck" is created at
the interface between the two with the result that the potential of the
substantial speed increase cannot be realized. The decision to shift
such a number-crunching job to an attached processor depends largely on
whether the shifted computations can be done in sufficiently large
blocks to compensate for the "interface overhead" -the relatively long
time spent in transferring data from the host computer to the attached
processor and back again. For example, one commercial attached
processor, the FPS 120/164, can multiply two vectors faster than such
conventional minicomputers as the POP-10 minicomputer only when the
vectors have at least 60 components [15], Most matrices required for
9 robotic computations are ratiier small, eg, 3x3 or 4x4, Thus it is entirely possible that when adding two matrices, the time required to transfer the two source matrices to the attached processor and the result matrix back to the host may exceed the computation time which would have been required i f the operation had been performed by the host without the attached processor.
Another approach is to design a special purpose computer, with a parallel computing structure based on suitable algorithms, to solve some particular problems. Some parallel computer s t a t u r e s are introduced and employed to solve the Inverse Dynamics plus Jacobian control. When designing an appropriate computing structure, simplicity, regularity, and communication locality should always be kept in mind [12],
A parallel computer can be divided into three architecture configurations [12]:
A pipeline computer performs overlapped computations to
exploit temporal parallelism. An array processor uses
multiple synchronized arithmetic logic units to achieve
spatial parallelism. A multiprocessor system achieves
asynchronous parallelism through a set of interactive
processors with shared resources (memories, datahase, e tc .).
Three pipeline schemes used in a pipelined computer are :
(1) Arithmetic pipelining, where the arithmetic logic units are segmentized for pipeline operations in various data formats. For example, four-stage pipelines are used in the Star-IOD and three-stage pipelines are designed in both the Wettek WTL 1033 floating point adder and the WTL 1032 floating point m ultiplier [56], Recause of the
10 independence of the elements in vectors or in matrices, pipeline structure is very suitable for vector or matrix operations. Therefore, this arithmetic pipelining scheme is used in the arithmetic units of the
Robotics Processor being designed and developed at The Ohio State
Uni versity.
(2) Instruction pipelining, where the execution of a stream of instructions can be pipelined by overlapping the execution of current instruction with the fetch, decode, and operand fetch of subsequent instructions. Intel 8086 is one of the examples.
(3) Processor pipelining, where the same data stream is processed by a cascade of processors. This processor pipelining scheme is employed in solving the Inverse Dynamics problem with Robotics Processors (explained in section 4.4).
An array processor handles single instruction and multiple data
(SIMD) stream. The original motivation for developing SIMD array processors was to perform parallel computations on vector or matrix types of data. The array processor has been used widely and efficien tly in many different fields, such as fast Fourier transform, matrix inversion, parallel sorting and solving partial differential equations.
Various interconnection networks have been suggested for array processors, such as the mesh network, the n-cube, the barrel shifter, and the shuffle-exchange network. One typical example of array processor is Illia c IV, which was connected in a mesh network and primarily designed for matrix manipulation and solving partial differential equations.
11 As mentioned above, the array processor structure is very useful as an attached processor performing operations on large matrices.
However, because of the inherently small size of the vectors and matrices encountered in robotic systems, it is not constructive to apply an array processor to the task of performing vector or matrix operations for the robotics applications. However, the concept of the array processor can be used to develop parallel/pipeline algorithms for
Inverse Dynamics and Jacobian computations. The Rohotics Processors
(RP) are to be connected in a mesh network to achieve parallel and pipeline structure. For example, in the Inverse Dynamics application, the RPs are pipelined, but the overlap (or parallelism) between the RPs achieves 80% (explained in section 4.4). The system throughput is expected to be be much Improved as compared to using a high speed attatched processor only doing simple vector and matrix operations.
A multiprocessor system is controlled by one operating system which provides interaction between processors and their programs at the process levels C121- There are two architectural models for a multiprocessor system. One is a tigh tly coupled multiprocessor, where all processors communicate through a shared main memory. Another is a loosely coupled multiprocessor, where each processor has its own input-output devices and a large local memory, storing most of the instructions and data. It is usually efficient when the interations are minimal. However, tightly coupled systems can tolerate a higher degree of interactions between processors without significant deterioration in performance. Three different interconnection networks have been
12 commonly used: time-shared common bus, crossbar switch network, and multiport memories.
Many multiprocessor systems have been designed and constructed during the last two decades. One example is the n.mmp which consists of
16 computer modules connected to 16 global, shared memory banks via a central crossbar switch. Another example is a multiprocessor system with five units of PDP-11/03, which is designed and implemented for the
Hexapod Vehicle developed at The Ohio State University [18], Currently, a multiprocessor system, consisting of 15 Intel iSRC's, and communicating through the MULTIBUS and shared memories, is being used for an Adaptive Suspension Vehicle (ASV) being developed at The Ohio
State University. The multiprocessor system concept can be found in the overall computer system of tre entire control system in section 4.1.
2.4 VLSI Technology to Computer Architectures
Because the rapid advent of VLSI technology, several new architectures implementing parallel algorithms directly in hardware, are avalable. For example, the systolic array offers sustantial performance gains via massive parallelism and regular local communication [251.
Another example is the wavefront array processor (WAP), which provides a powerful tool for the high speed execution of a large class of matrix operations and related algorithms which have widespread applications
[26]. The major difference between systolic array and WAP is that systolic array requires global synchronization while WAP dosen't.
Basically, there are two different architectural directions for
VLSI based computers [2 ], The firs t involves putting more and more
13 functions on a chip and making it run faster and faster. For example, within a single-chip computer are integrated CPU, memory and input/output circuitry.
The other is taking a fresh look at new technology and many recently emerged computer applications. It considers the in ter connection of VLSI chips to form highly parallel computing systems.
Such computing systems have structural properties that are suitable for
VLSI implementation, such as systolic array and WAP,
The key attributes of VLSI computing structures are described below [12] :
1. Simplicity and regularity : If a structure can be partioned into a
few types of building blocks which are used repetitively with
simple interfaces, great savings can be achieved. This is
especially true for VLSI designs where a single chip comprises
hundreds of thousands of components.
2. Concurrency and communication : Massive parallelism can be achieved
i f the algorithm is designed to introduce high degrees of
pipelining and multiprocessing. When a large number of processing
elements work simultaneously, communication becomes significant -
especially with VLSI technology where routing costs dominate the
power, time and area. The locality of interprocessor
communications is a desired feature to have in any processor
arrays.
3. Computation intensiveness : VLSI processing structures are suitable
for implementing compute-bound algorithms rather than I/O-bound
computations because VLSI packaging must be constrained to a
14 1i mi ted number of I/O pins. A VLSI device must balance its
computation with the I/O bandwidth.
2.5 VLSI Technology
Since the first silicon transistor was made in the mid-1950's, IC techonoly has been improving rapidly. Chip complexity doubled every year after 1959, In 1973, complexity reached nearly R000 components per chip. Since then, complexity has doubled every 1.5 to two years. The complexities (transistor number) of the five generations of ICs, small, medium, large, very large, ultra large - or SSI, MSI, LSI, VLSI, DLSI - are [3] :
SSI 2 - 6 4
MSI 64 - 2,000
LSI 2,000 - 64,000
VLSI 64,000 - 2,000,000
ULSI 2,000,000 - 64,000,000
Today's technology is in the VLSI range. For example, HP's FOCUS CPU and Bell's BELLMAC-32 microprocessor both contain 450,000 transistors.
Most IC are made of bipolar, N-channel Metal-oxide-semi conductor
(NMOS), Complementary MOS (CMOS) or GaAs devices. Depending on different applications, different devices are used. For example, TTL or
ECL (both made of bipolar transistors) are used when high speed is required. If higher speed is required, then GaAs is probably the candidate; its gate delay is less than 120 picosecond, and is four to six times faster than silicon devices [4 ], NMOS has highest density and is suitable for making memories. CMOS has low power dissipation
15 characteristics. As more and more components are pot on one chip, heat removal becomes a serious problem. Thus for a memory chip containing over one megabit, CMOS will replace NMOS [5].
Now CMOS and bipoiar transistors can put on the same silicon wafer
[6 ]. Logic cells that combine CMOS FETs with bipolar transistors operate at subnanosecond ECL speeds, but dissipate only the fractional m illiw att power levels of CMOS circuits. Each cell includes a standard
CMOS logic gate buffered by a totem-pole output driver of the type of popular TTL.
Wafer-scale integration (WSI) was first tried in the 1960’s [7],
One wafer, from 2 to 8 inches, can contain 25 to 100 Intel 8086's. WSI has the advantage of eliminating interchip connections. As a result, it has a low posslbilty of generating noise along the interchip connection wirings, faster throughput, higher re lia b llty , and it needs less power since no driving power is required. But WSI technology is not yet mature because of the problems of heat removal {1000 watts/wafer) and very low yield.
Very-high-speed IC (VHSIC) is a seven-year project supported by the Department of Defense. During Phase I (May 1981 to Apr. 198*), microcircuits were designed and produced with minimum feature size of
1.25 micrometer. In the Phase II (May 1984 to Dec. 1986), the goal was to design and produce microcircuits with minimum feature size of 0.5 micrometer, clock rates of 100 MHz, and circuit complexities in excess of 100,000 logic gates per chip [8] [9 ],
Although IC technology will continue to advance rapidly, there are several factors that constrain the integration level of the future
16 silicon IC technology. These factors can be categorized into physical, technological and complexity lim its [2 ], Physical limits include the velocity of light, the principle of uncertainty, entropy
(irre v e rs ib ility ) and thermal energy. It is proposed that the final lim its of the device si 2e wili be 0.3 micrometer [1 ], Technological limits are concerned with fabrication techniques, materials constants and electrical parameters. Complexity lim its relate to the human inability to design a circuit involving a very large number of components.
2.6 Summary
In this chapter, the Inverse Plant plus Jacobian Control are introduced. Then different computer architectures for robotics applications are depicted. The trends and the impacts of VLSI technology on the computers are further discussed. Finally, the current
VLSI technology, future perspects and the final limitations of the feature size are described.
17 CHAPTER 3
ARCHITECTURE OF THE ROBOTICS PROCESSOR
3.1 Introduction
The RP is primarily designed for solving problems involving
Jacobian, Inverse Jacobian, and Inverse Dynamics. It will be shown in this chapter that the RP has great potential for use in other applications involving vector or matrix operations. The RP is capable of performing most vector and matrix operations, such as vector addition, vector multiplication with a scalar constant, vector inner product, vector cross product, matrix multiplication with a vector, and matrix multiplication with a matrix.
The block diagram of the final version of the RP is firs t introduced. Next, four of the design alternatives which were considered during the evolution of the final architectural design are given and discussed. Finally, the characteristics of the RP are summarized.
3.2 Block Diagram of the RP
Figure 3.1 shows the block diagram of the RP consisting of the
Clock Generator (CG), Bootstrap Unit (BU), Format Converter for Rll
(FCB), Control RAM (CRAM), Sequencer (SEQ), Microcode Register (MCR),
Register File (RF), Floating Point Adder/subtractor (FPA), Floating
Point Multplier (FPM), Format Converter East (FCE), Format Converter
18 BN 8T ,H U R ,LC
FCB
FCN
CRAM
HCR
FCW
Addresses and Control signals
FPM FPA FCE >BE
jr
SYS CLK FCS
BS
Figure 3.1 Block Diagram of the Robotics Processor West (FCW), Format Converter North (FCN), and Format Converter South
(FCS). The CG generates all clock signals needed in the RP. The System
Clock ( SYS CLK) is tentatively selected to be 16 MHz. The Pipeline
Clock (P_CLK), is used to clock the pipeline registers in the FPA and
FPM The frequency of P_CLK is defined to be one sixteenth of the
SVS_CLK, i.e . 1 MHz. The detailed circuit for the CG is described in section 5.2.
During in itia liz a tio n , the host computer loads the appropriate application microprograms and constants to the RPs. The loading process
is initiated by the bootstrap signal (BT), which is asserted by the host computer. When the RPs begin to execute their microprograms, the host computer intermittently sends the necessary parameters, such as the desired position, rate, and acceleration of the particular mechanism, to the RPs and then receives results, such as joint actuator torques, from the RPs.
As shown in figure 3.2, the microcode format consists of 40 bits.
The most significant bit, the opcode bit, specifies the interpretation of the remaining bits of the microinstruction. When this bit is 0, an arithmetic operation can be specified, and when the opcode bit is 1, branching and I/O operations are indicated. There are six address fields for normal arithmetic operations and four address fields for I/O.
Each address field has 6 bits because the RF has 64 words. The OP bit specifies either addition or subtraction; OP = 0 produces addition and
OP = 1, subtraction. Control bits, WM (Write Multiplier's result) and
WA (Write Adder's result), determine whether the result of the FPM or
FPA is to be written into the RF. Control bits, EE (Enable East output)
20 I/O OP WM WA ADDR1 ADDR2 ADDR3 ADDR4 ADDR5 ADDR6
I/O : I/O = 0 AODR2: Address for multiplier operand B OP : OP = 0 for addition ADDR3: Address for multiplier result R WM Write Multiplier's result ADDR4: Address for adder operand A WA : Write Adder's result ADDR5: Address for adder operand B AODR1: Address for multiplier operand A ADDR6: Address for adder result R
I/O BR EE ES EW EN — — — ADDR3 ADDR4 ADDR5 ADDR6
I/O: I/O = 1 ADDR3: Address for East output BR : BRanch = 0 ADDR4: Address for South ouput EE : Enable East output ADDR5: Address for West input ES : Enable South ouput ADDR6: Address for North input EW : Enable West input EN : Enable North input
Figure 3.2 Microinstruction Format and ES (Enable South output), are used to buffer the data going out the
RP from the east or south side. Control bits, EW (Enable West output) and EN (Enable North output), are used to buffer the outside data coming into the RP from the west or north side. These four control bits are also used to gate the four address fields to the address buses, AA and
AB.
When a microprogram is being executed, the SEQ generates the next address to the CRAM. The MCR is a microinstruction latch clocked hy
P_CLK phase-1 (P ^ ). The RU generates the necessary control signals and address to the CRAM during the in itia liz a tio n stage. Section 5.3 has more detailed descriptions of the RU.
There are two output ports; one on the east and the other on the south. Also, there are two input ports; one on the west and the other on the north. To lim it the number of pins, the external bus on each side is 16 bits wide while the three internal buses. Bus A (RA), Bus B
(BB), and Bus C {BC> are all 32 bits wide. Therefore, four format converters are required to change the 32-bit data to two 16-bit words or visa versa. Detailed circuits for the format converters are described in section 5.3.
As mentioned in section 2,3, the RPs are connected a mesh network to develop parallel/pipeline algorithms for the Jacobian and Inverse
Dynamics analyses. 32-bit data must be transferred between adjacent
RPs. Since the data paths external to the RPs are 16 bits wide, two
P_CLK cycles are required to complete one 32-bit data transfer from one
RP to another. This is necessitated partly because of the required format conversions from 32 to 16 bits on the transmit side, and 16 to 32
22 bits on the receive side. Another factor requiring a double clock period is that the output pad drivers designed into the chip incorporate multiple amplifier stages to supply the current sourcing and sinking capability to drive the relatively large electrical capacitance of the external interconnections. Since 32-bit data can be sent from both west and south sides simultaneously, and since each requires two P_CLK cycles, the net effect is that the maximum transfer rate is one 32-bit transfer per P_CLK. Here an assumption is made that the intercommunication between the adjacent RPs is controlled by precisely synchronized microprograms. Specifically, when one is transmitting, the other must be receiving. If one consumes the received data faster than the other produces i t , the consumer must wait idly until the producer is ready to send the next data. On the other hand, if the producer produces the data to be sent faster than the consumer comsumes the data, the producer must wait idly until the consumer is ready to receive.
Since the time to transmit or receive data is well known, it is not necessary to have handshaking signals between the RPs.
The RF is, in fact, a three-port RAM. Two operands are read onto the BA and BB at the same time. But only one result can be stored in the RF at one time. Two address buses, Address bus A (AA) and Address bus B (AB), and the Write (WR) signal are not shown in the block diagram. In order to access two contents in two different RF locations at the same time, two address decoders are needed in the RF. The capacity of the RF is tentatively assigned to 64 words and each word is
32 bits wide since the standard IEEE single-precision floating point format is used.
23 Both the FPM and FPA have three pipeline stages. Each stage is
clocked by P_CLK phase-1 and phase-2. During the P_CLK phase-1 (P
four operands are read from the RF and latched in the firs t pipeline
register of the FPM and FPA by a time multiplexing scheme. Since there
are three pipeline stages in the FPM and FPA, it takes three P_CLK
cycles to propagate from input operands to output results. During the
P_CLK phase-^ ^ ^ 2 ^' *:,ie restJl t s the two arithmetic units are stored
in the RF by the time multiplexing scheme. Thus, a total of four P_CLK
cycles are required to complete a floating point addition/ subtraction
and multiplication. Detailed circuit designs for the FPA and FPM are
described in sections 5.5 and 5.6 individually. Since floating addition
and multiplication are executed simultaneously, a throughput of 2 million floating point operations per second (FLOP) can be obtained once
three pipeline stages in the FPM and FPA are fille d up.
3.3 Evolution of the Architectural Design of the RP Data Paths
In this section, four possible bus configurations are proposed and compared. They are the single-bus, two-bus, and three-bus configurations, and the cross-bar network.
3.3.1 Single-Bus Configuration
Figure 3.3 shows the single-bus configuration. The input bus and output bus of the FPA and FPM are connected. To obtain benefits from pipelining, accessing the operands from and storing the results in the
RF should be completed in one P__CLK cycle. For example, during , the operands are read onto the bus and latched at the firs t
24 FPM FPA
Figure 3,3 Single-Bus Configuration
FPM FPA
Figure 3,4 Two-Bus Configuration
25 stage of the FPA and FPM. During P
The advantages of the single-bus configuration are that it is simple and that chip area is saved since only one bus is used. There are, however, some disadvantages to the configuration. First, no precharging on the internal bus causes the transition from low to high voltage to take more time than from high to low. This results in asymmetry on rising and falling edges and is usually not acceptable to design engineers. Also, the transfer rate on the internal bus becomes slower. A second disadvantage of this configuration is that a very high speed RF with an access time less than one quarter of the Pfj> is required because during P ^ , four operands need to be read out of the RF to the FPA and FPM. For example, the P ^ duration is 500 ns, so the access time of the RF should be less than 125 ns.
3.3.2 Two-Bus Configuration
Figure 3.4 shows the two-bus configuration. One more bus is added to allow the buses to be precharged before they are activated. This eliminates the slowing down of the data transferring on the internal buses. However, the access time of the RF must s t ill be less than one quarter of the Ptj^, 125ns.
3.3.3 Three-Bus Configuration
The three-bus configuration is shown in figure 3.1. If both RA and
BB are precharged during P ^ , then BC is precharged during P ^ . Since only two operands, instead of four, as in the previous
26 two cases, are read from the RF during the P<^, the access time of the
RF is doubled to 250 ns.
There are two disadvantages with the three-bus configuration.
F irst, it requires one more bus than the two-bus configuration and thus occupies more chip area. Also, the RF becomes slightly more complicated and occupies more chip area since it is a three-port memory.
3.3.4 Cross-Bar Network
Data path configurations range from a single bus, where only one data word can be transferred at a time, to a full crossbar switch, where all possible connections can be made simultaneously. Figure 3.5 shows the cross-bar network configuration similar to that of the FPS/120B
[14]. To read four operands at the same time, two three-port register file s , RF1 and RF2, are required. The crossbar network is implemented with six dedicated buses, four to supply operands to the arithmetic units and two to carry results away from the FPA and FPM and to swap the data in the RF1 and RF2. All the cross points are closed by controlling signals. Operands in the RF1 and RF2 can be sent to the FPA and FPM by closing proper crosspoints in the network.
There are two advantages of the cross-bar network configuration.
First, it has the most flexible configuration. Second, all the buses are precharged at the same time, say during P ^ . The access time of the
RF can be almost as large as the P<{^ duration, 500 ns, since four operands are simultaneously read from the RF1 and RF2 to the FPA and Ft>M during There are, however, two disadvantages. First, the network occupies too much chip area since every bus is 32 bits wide.
27 RF 1
RF2
FPM FPA
Figure 3.5 Cross-Bar Network Configuration Second, it is rather complicated to control the cross points since there are too many possible combinations of data flow paths. Moreover, errors are likely to be made since programmers must keep the network in mind while writing the microprogram.
The above comparisons show that the cross-bar network configuration is overly complicated and occupies too much chip area, whereas the single-bus configuration does not allow for the precharging of the bus, therefore, these two possibilities were eliminated. In the case of the two-bus configuration, some uncertainty exists as to whether the RF can be designed with an access time less than one quarter of the
P<|>^, i.e . 125ns, possibly necessitating a reduction of the system clock frequency. Even though a greater :h1p area w ill be required for the three-bus configuration, it will quite likely allow the access time to be one half of the P<^, 250 ns, thus the three-bus configuration is tentatively chosen for the data path.
3.4 Summary
In this chapter, the architecture of the RP is described. In the second section, several data path design alternatives were given, and in the third section, four possible data path configurations were proposed and compared. The three-bus configuration was found to be the best choice. Some of the specificatons are based upon the knowledge of the current VLSI technology. For example, the choice of 16 MHz for the RP
System Clock is comparable with the 20 Mh 2 clock used in the VLSI version of Digital Equipment Corporation's VAX, which is fabricated using 3 micron NMOS technology.
29 The characteristics for the RP are summarized as follows :
1. The RP contains a Floating Point Adder/subtractor (FPA) and a
Floating Point M ultiplier (FPM) to execute floating point
addition/subtraction and multiplication. No divider is included
since division is not required in the vector or matrix operations
mentioned in the firs t section.
2. IEEE single-precision floating point format is used.
3. The FPA and the FPM can operate simultaneously to improve system
throughput.
. A pipeline scheme is used to increase the speed of vector and
matrix operations because it exploits the global parallelism found
between computations on separate elements in a vector or matrix
C14].
5. There are three pipeline stages in the FPA and FPM. The
rationale for the choice of three pipeline stages are
(a) the shorter the pipeline, the better the performance with
short vectors [15], Some vector processors, such as
Star-100, have much longer pipelines and perform relatively
poorly on short vectors [14].
(b) I t is easy to partition the FPA and FPM into 3 stages, where
each has almost the same delay time.
6. There are four unidirectional I/O ports, two for inputs and two
for outputs.
7. Internal buses are 32 bits wide, while the four external buses are
16 bits wide, in order to lim it the pin number.
8. A format converter is needed on every I/O port to convert a 32-bit
30 data to two 16-bit words or vise versa.
9. No handshaking signals between Robotics Processors are needed
since the timing to transfer parmeters from one RP to another RP
is known. Thus data transferring can be handled by
microprogramming.
10. To speed up the access of the contents in the RF, the internal
buses, BA, RB, and BC are precharged before they are activated.
11. To simplify hardware in the RP, RP1s microinstruction is not
pipelined as are operands in the FPA and FPM, although this makes
microprogram coding slightly more complicated. As a result,
programmers should keep 1n mind that the results of
addition/subtraction and multiplication are not available until 4
P_CLK cycles later.
12. The system clock is tentatively set to 16 MHz.
13. Because format conversion is necessary on each I/O port, two
P_CLK cycles are required to transfer a 32-bit data to the
adjacent RP.
14. The RP must be programmable for different applications. The
application programs are all written in microcode. The reasons
for this are
(a) the application algorithm is usually fixed, thus also
fixing as the application program.
(b) Usually the application microprogram is relatively short.
For example, the most complicated application
microprogram, Inverse Dynamics, consists of about inn
microinstructions. Therefore, there is no need to
31 develope a microcode compiler.
(d) programmers are permitted direct access to each of the
arithmetic units thereby permitting maximum u tilization of
the potential parallelism.
15. The parallei/pipeline computing structure allows the execution of
any one of the Jacobian, Inverse Jacobian, and Inverse Dynamics in
one mi 111 second.
RP's characteristecs are summarized in the 15 preceding items.
Some specifications are made according to the current VLSI technology.
However, designing the RP chip will probably take one or two more years, and by that time VLSI technology shall be improved and the channel
length will likely be decreased to one or two microns. Thus, some specifications of the RP may need to be modified. For example, the
SYS_CLK may be increased. Also, i f an RF with faster speed can be fabricated, the number of internal buses can be reduced from three to two or even one. But some of the basic boxes of the RP w ill probably not change, such as the FPA and FPM.
32 CHAPTER 4
APPLICATIONS OF THE ROBOTICS PROCESSOR
4.1 Introduction
As was mentioned in section 2.2, the combination of the Inverse
Plant for feedforward control and Jacobian Control for feedback has excellent potential for fast and accurate control. A control block diagram with feedforward Inverse Plant plus feedback Jacobian Control is shown in figure 4.1. The major data acquisition, computation, and control modules required to implement the Inverse Plant plus Jacobian
Control are shown in figure 4.2. Detailed explanations of control schemes and the meaning of their control parameters can be found in
[21].
In this chapter, several special purpose dedicated attached processors for the Inverse Plant plus Jacobian are developed. These attached processors are based on the Robotics Processor, which is being developed with state-of-the-art VLSI technology at The Ohio State
University. These special purpose dedicated processors will be attached to a host microcomputer, and multiprocessor system concepts, described in section 2.3, will be used to interconnect these multiple processors for real-time control. The overall computer system for the entire control system is shown in figure 4.3.
In this chapter, several possible architectures for each particular control problem, eg. Jacobian, Inverse Jacobian, and Inverse
33 INVERSE PLANT
ROBOTIC o* I SYSTEM
DIRECT KINEMATICS
Figure 4.1 Block Diagram of Inverse Plant Plus Jacobian Control [ 2 1 ] OUTER MOTION DIRECT FORCE LOOP PLANNING KINEMATICS TRANSFORMATION CONTROL
DATA INVERSE / k. JACOBIAN ACQUISITION JACOBIAN
DESIRED INNER 10 INVERSE OUTPUT JOINT LOOP ACCELERATION DYNAMICS CONTROL CONTROL
Figure 4.2 Major Data Acquistion, Computation, and Control Modules for Inverse Plant Plus Jacobian Control [2 1] RP., rpin I 1 I
RPSl RP?? RP2N
Jacobian Processor Inverse Dynamics Processor
r
Microcomputer Microcomputer
Local ~i Shared Local ~I Shared Memory J Hemory Memory [ Memory
< Common Memory Bus
Local | Shared Local ! Shared -Memory. _1 _ Memory — - - tfeaojcy. L ~ Memory- _
Microcomputer Microcomputer • • •
Trajectory Data Acquisition Generation Servo Control
Figure 4.3 Architectural Concept for Implementation of Advanced Real-Time Control Algorithms for Robots Dynamics, are proposed and compared based upon some important parameters such as total execution time, initiation rate, CPU u l t i 1ization, and the total required memory size in the RP.
4.2 Jacobian
The Jacobian relates the six components of the velocity of the end effector, including both linear and rotational, to the angular velocities of each of the joint angles. The Jacobian approach has the advantage of producing simpler equations than Reverse kinematics. The equation relating the joint angle rates (js) to the angular velocity (u) and translational velocity (v) of the end effector is given as follows:
(l) = J(9) * 9 ( 2 . 1)
The J matrix is of dimensions 6xN and is given as:
N+l N+l N+l *1 ' x2» • * * ^ (2.2) N+l N+l N+l V B ^ * * * * i *
N+l N+l y. and $. (i= 1,2,...N) are derived as follows: — i — i
N+l Vi ■' (2.3)
N+l N+l i-1 T u = u u i = N+l, N, . . . 2, 1 (2.4) i-1 i i
37 0 N+l N+l Y. = U. , 0 i = N, (N-l), . . . 2, 1 (2.5) —1 1-1 1
N+l (2. 6 ) ^ + 1
N+l N+l N+l,, .1 * r. i = r. - u .* P. i = N+l, N...... 2, 1 (2.7) -i-l —i i -i
N+1e. - N+1y X ( - N+1r ) i = N, (N-l), . . . 2, 1 (2.8) —i —1 -1 -1
The 3 x 3 rotational transformation matrix, ^ and the 3 x 1 vector, i * P., are defined as:
COS 0 ■sin e.- cos a. sin 0.- sin a. i i i i i i-1 sin 0 cos 0.- COS a. -cos i • sin (2.9) u, ■ i i i i sin a. cos a. i
d,sin a. (2 . 10) i i
d^cos
The detailed definitions for the four parameters, 0-j, d}, a i , and c*i, can be found in [22] [23], Here only the revolute joint is considered.
The equations for the sliding joint is less complicated, but similar results will be obtained.
38 4,2,1 Complexity of Vector Operations
The independence of the FPA and FPM, and the pipeline structure
in each arithmetic unit speed up vector operations significantly. The
exact RP computation times for the necessary vector and matrix operations can be determined and are shown in table 4.1, where V is a
3x1 vector and M is a 3x3 matrix.
Table 4.1
Computation Times for Necessary Vector and Matrix Operations
| Vector | No, of P_CLK Computation time | Operations | cycles (microsecond) Complexity j
| V + V i 6 6 1 |
| V x c 1 6 6 1 |
| V * V 1 13 13 2 j
| V x V 1 *3 13 2 1
j MV 1 17 17 3 1
| MM | 35 35 6 |
j Transfer V I & 6 1 )
The time required for computation or transferring data between processors can be represented as complexity. Using complexity instead of time expressed in microseconds allows the results to he independent
39 of the system clock. It is desired that the complexity of the simplest operation, e.g. vector addition, be normalized to the value 1. The complexity value of all other operations will be computed relative to vector addition. Since vector addition requires 6 microseconds, the normalizing factor by which all other times are to be divided is 6, In short, complexity = computation time / 6. The computation time of two
3x3 matrix multiplications is 35 microseconds, divided by 6 resulting in the complexity of 6. A reservation table is used to show how successive pipeline stages are utilized (or reserved) for a specific function evaluation in successive pipeline cycles. The reservation tables for obtaining the computation times of these operations can he found in
Appendix A .1.
4.2.2 Task Graph
Task graphs are used to aid the scheduling of processes to more than one processor. Construction of optimal schedules is NP-complete in many cases. A detailed definition of NP-complete can be found in [62 p.
501 -558], The term NP-complete implies that an optimal solution may be very d iffic u lt to compute in the worst possible input case. However, construction of suitable schedules, that is, computing a reasonable answer for a typical input case, is not NP-complete [12 p. 598]. The task graph for the Jacobian with one RP (P = 1) is shown in figure 4.4.
This is obtained by calculating the complexity of each Jacobian equation. The complexity of each vector or matrix operation can be found in table 4,1. The circles in figure 4,4 represent the equations
(2.4), (2.7), (2.8), and (2.9). The arrows connecting circles indicate not only the sequence of application of equations, but also that data
40 sine N+l N+l cose
N-l
N+l N+l N-l
N+l N + l N+l N-l
N+l N+l sine cose
N+l N+l
N+l N+l N+l
Figure 4.4 Task Graph for Jacobian with P * 1
41 resulting from one equation is to be operated upon by the next indicated equation. The number in the circle represents the complexity for that particular equation, and the number adjacent the arrow represents the complexity of the I/O transmission. For example, the circle representing equation (2 .4 ), which computes the U matrix, indicates that the computation complexity of this operation is 6. The "3" on the arrow represents the I/O complexity for transferring a 3x3 matrix, i.e . three
3x1 vectors.
There are many alternatives to scheduling the entire task into one or more than one processor. In the following sections several architectures, obtained according to different partitions of the task graph, are proposed and compared. A number of measures have been developed to evaluate the architectures. They are listed below :
ET : Total execution time of the whole system
= total computation time + total I/O transfer time
+ processor idle time
IR : Initiation rate
= average number of initiations per clock unit
UP : U tilization of each processor
= RP fractional busy time
= (total computation time + total I/O time) / ET(P=n)
SP : Speed up
= the ratio of the total execution time for one RP to
the total execution time for n RPs
= ET(P=1) / ET(P=n)
42 CBR : CPU bound ratio
= the ratio of the computation time for one RP
to the total execution time for n RPs
= computation time(P=l) / ET(P=n).
RN : Register number of the RF
SCRAM : Size of control RAM
Total memory : includes the RF and CRAM
4,2.3 Architectures of the Jacobian
Four possible architectures for implementing the Jacobian,
1-Processor, 2-Processor, N-Processor, and cube interconnection network,
are proposed and compared in the following sections,
4.2.3.1 1-Processor Architecture
I f the task in figure 4.4 is executed by only one processor, its
architecture for implementing the Jacobian with one processor is straightforward and shown in figure 4.5. Here only one Robotics
Processor is used to calculate all Jacobian equations to obtain the
Jacobian matrix. For a robot with N degrees-of-freedom, 2N values of
sine and cosine of the N joints are received from the host computer through one of the input ports of the RP. As soon as the Jacobian matrix is calculated by the RP, 6N components of the Jacobian matrix are passed to the host computer through one of the output ports.
The corresponding timing chart, shown in figure 4.6, shows the
sequence of subtasks and the time required for each one. It is used to
help determine the measurement parameters, such as computation time and
CPU u tiliza tio n . The numbers in the timing chart, obtained from the
43 Host Computer
sinG, cose.
sine. cose.
N+l N+l0 N+l N+l V X ,. B,
Figure 4.5 Architecture for Jacobian with P ,4,9 , 35 P CLK cycles , 21 , 13 , 12 , 4 , 9 , 35 P_CLK cycles U l b"
a : input sine, and cose^ from Host Computer
b : compute N+l c ; compute N+l d : compute r. j N+l e compute
f : move to M+1IL location
Figure 4.6 Timing Chart for Jacobian with P = 1 microprogram for the Jacobian in Appendix A.2, are computation time, I/O transferring time, or idle time respectively. The unit for each of these numbers is one P_CLK period, or 1 microsecond. Subtasks in this timing chart are labeled from a to f. For example, to find one column of the Jacobian matrix requires all 6 subtasks. Thus, the total time required to obtain the Jacobian matrix is the time required to complete the 6 subtasks multiplied by N, where N is the number of degrees-of-freedom, i.e . (4 + 9 + 35 + 21 + 13 + 12) x N = 94N microseconds. The other measurement parameters for evaluating the architecture are computed in Appendix A.3.
4.2.3.2 2-Processor Architecture
For the 2-Processor architecture, the task graph for the Jacobian in figure 4.4 can be arbitrarily partitioned into two parts, left and right. The partitioning indicated by the dotted line as shown in figure
4.7 was selected to cause the complixity of each part to be approximately equal, 7.5N for the right part vs. 7N for the le ft.
Therefore, i f the task in each part is assigned to one processor, then the load sharing between the two processors should be almost equal. The architecture of the Jacobian with two processors is shown in figure 4,8.
Since some intermediate data must be transferred from one RP to the other, I/O time is increased. In figure 4.8, it can be seen that N 3x3
U matrices and N 3x1 garma vectors, or 4N 3x1 vectors in to ta l, are transferred between the two RPs. Recall from table 4.1 that the complexity for transferring a 3x1 vector is one. Thus, the complexity for transferring the necessary data between the two RPs is 4N. If this increase of I/O complexity is greater than the computation complexity
46 sine N+l N+l cose
N-l
N+l N+l N-l
N+l N+l N+l N+l N-l
RP 2
N+l N+l sine cose
N+l N+l
\ t N+l N+l N+l
Figure 4.7 Task Graph for Jacobian with P = 2
47 Host Computer
sine.,, cose sine,, cose
N+lN+l
RP1
N+l N+l N+l N+l
RP2
N+l N+l
Figure 4.8 Architecture for Jacobian with P = 2 reduction when two processors are used, the 2-Processor architecture has
no obvious advantage.
The timing chart corresponding to this case is shown in figure
4.9. This timing chart shows not only the sequence and amount of time
required by subtasks, but also the timing for transferring data between
the two RPs. RP1 performs subtasks a to f , while RP2 performs subtasks
g to m. Both are doing different subtasks, but they are synchronized at
points where they start to transfer data. For example, RP1 will not
transfer the 3x3 U matrix to RP2 until subtasks a and b are completed.
When it occurs, this transfer requires 18 P_CLK cycles, i.e . 18 microseconds. Since there are no handshaking signals and no buffers
between the RPs, one RP has to be receiving the data when the other is
transmitting. The transferring timing is synchronized not only by using
the same system clock but also by precise microprogramming. The measurement parameters are computed in Appendix A,4. The total execution time, 601 microseconds found in the Appendix A.4, is not
reduced much as compared the 658 microseconds of the 1-Processor
architecture since the increased I/O time takes up much of the
computation time saved by using two processors.
4.2.3.3 N-Processor Architecture
The task graph for the Jacobian in figure 4.4 can also be
partitioned into N parts for an N degrees-of-freedom robot, separated by
horizontal dotted lines as shown in figure 4.10. Note that only the Nth and 1st parts are exp licitly shown. Since each part is identical, each has the same computation complexity of 14.5. The task in each part is
assigned to one processor, thus N processors are required in to ta l. The
49 4 9 18 35 P CLK cycles 18 35 P_CLK cycles RP1 - f — f e f
N+l N+l j Nt\ Li V i 1 i
/ 18 21 13 RP2 14 18 P—CLK cycles 1 m
N+l I i i
a input sine^ and cosG^ from Host Computer 9 input AU.
b compute * ~*U.j h compute *^+V j_ N+l c output U. i idle N+l d compute U^_j j input N+*Xi
e output k compute N+l N+l N+l f move U^_j to 11. location 1 output 8^
m idle
Figure 4.9 Timing Chart for Jacobian with P = 2 sine N+l N+l cose
N-l
N+l N+l
RP N
N+l N+l v N+l N+l N-l
N + l N+l sine cose
N+l N+l ^0
N+l N+l v N+l N+l
Figure 4.10 Task Graph for Jacobian with P = N
51 architecture of the Jacobian with N processors is shown in figure 4.11.
As with the 2-Processor architecture, some Intermediate data needs to be transferred between Robotics Processors, and the I/O time is again increased. Figure 4.11 shows that one tJ matrix (3x3) and one £ vector
(3x1) are transferred between any two RPs.
The corresponding timing chart for the architecture in figure 4.11 is shown in figure 4.12. Every RP performs the same subtasks, a to 1.
I t can be seen from figure 4.12 that the R P (i-l) idles 59 P_CLK cycles before it starts subtasks a to 1. The purpose for these idles is to synchronize the transferring of data at certain points. Every RP repeats the suhtasks a to 1 again and again until the RP is rebooted by the host computer. The measurement parameters are computed in Appendix
A.5.
The I/O time for the N-processor architecture, 64 P_CLK cycles
(refer to Appendix A.5), is compatible with the computation time, 78
P_CLK cycles (refer to Appendix A.5). If the task is further partitioned, for example P=2N, the I/O time w ill further increase while the computation time decreases. Thus, for the N-Processor architecture, it is possible that the I/O time w ill exceed the computation time, consequently, the system throughput will not be improved. Furthermore, since no handshaking signals and buffers between Robotics Processors are used (synchronization scheme), it becomes more complicated to handle data transferring between the RPs as more processors are used.
Therefore, P=2N is not to be considered.
52 Host Computer
sin e sin e sin 0
COS 6 cos e
RP 1 N-l RP N
Figure 4.11 Architecture for Jacobian with P = N N+l N+l U i "i i i 4 9 18 35 P CLK cycles 18 21 13 RPi I1 l6h6|6H j k 1
N+l N+l Ui-1 -i-1 1 1
RP(i- 18 35 PCLK cycles 18 u l+ 59 P_CLK cycles tn -P» 1 a input sine.j, cos6i from Host Computer g compute N+l..
b compute 1_1U. h : compute
c input N+1u^ i : idle N+l N+l d compute U._j j : output r._j
e input N+1ri k : output N+*l-j N+l f Output U- j 1 : output N+^-
Figure 4.12 Timing Chart for Jacobian with P = N 4.2.3.4 Cube Interconnection Network
The total execution times for implementing Jacobian with the above three architectures increase linearly with the number of degrees-of-freedom. In this section, a parallel algorithm for implementing the Jacobian is considered using the cube interconnection network [12 p. 342] to realize the algorithm. The time required to perform the calculations increases as the log of the number of degrees-of-freedom. Conceptually, the rotational and translational transformations from tip to base are accomplished by grouping adjacent links in the fir s t step to form (N/2) groups of two links each. Then on each succeeding step, adjacent pairs of groups are grouped together until after [log(N)] steps there is one group encompassing all links.
This is analogous to multiplying N numbers in [log(N)] steps by multiplying first adjacent pairs, then adjacent pairs of pairs, and so forth.
Equations for the Jacobian (2.3) to (2.8) are rewritten as follows for a parallel algorithm.
N+1 U, I ( 2. 11) N+l
N+l 0 (2 . 12)
N+l N+l (N+l-21 ) U. U. (2.13) V + l- 2 1) X
N+l N+l (N+l-21) N+l t. (2.14) U(N+l-2l) X -i + -(N+l-21)
55 0 N+l N+l 0 (2.15) 2i = Ui-1 X 1
N+V = N+1t. . x 1“V +. (2.16) 1 —l - l —N + 1 where i = N, ( N - l) ...... 2, 1
1 = 0, 1, 2, . . . ; (N+l-21) > 1.
Note that N+*t is the firs t three elements of the forth column of the —i homogeneous transformation ^ *T, ; *ir?, , is the jth column of the 3 x 3 i —N+1 orientation part of 1 ^ n + j * t ^ie 3th component of the vector
N+l 3.. —i
The architecture for implementing the Jacobian in parallel form is shown in figure 4.13 (assuming 8 degrees-of-freedom). The rotational transformation U, a 3x3 matrix, and the translational transformation t, a 3x1 vector, are shown in each box. The computation time for finding the Jacobian matrix is proportional to the level number -log(8), or 3.
I f each box represents one Robotics Processor (RP), 20 RPs are needed in to ta l. The input data must be the sine and the cosine of theta instead
of theta since the RP cannot perform trigonometric functions. The total
execution time can be found from the following calculations :
56 u-,
L J, 1 J l i: l! J, 'X T u Tu T u V X 6,5 5 4 !“o l 7 li l5 i «t3*_1 ~ z _ii
5u.6.4 482 280 % 2 1 l0
TU8,5 5!. 5 %
8U V 8,3 8,28U i| 8,0 *3 l 2
\> Y y T Y V 3 8„ 8 8 i 8* £ 8 8a 8 8 * 8 8. 8 80 8 8fl 8 i r - 7 V —6 1 5’ - 5 I 4* i 4 ly £3 I?’ ^2 I
Figure 4.13 Architecture for Implementing Jacobian in Parallel (8 degrees-of-freedom) computation time
= T(finding U and t from the rotation angle theta) +
1og(8) x Tffinding U and t in the level 0, 1, 2) +
T(finding beta in the Jacobian matrix)
T(finding U and t from the rotation angle theta)
= 9 microseconds (refer to Appendix A.2)
T(finding U and t in the level 0, 1, 2)
= Tfmultiplication of two 3x3 matrices) +
T(multiplication of a 3x3 matrix with a 3x1 vector) +
T(addition of two 3x1 vectors)
= (35-8) + (17-8) + 6 (refer to Appendix A.l)
= 42 microseconds
Tffinding beta in the Jacobian matrix)
= T(3 cross products of two 3x1 vectors)
= (6M+7) with M=3 (refer to Appendix A .l)
= 25 microseconds so, computation time =9+3x42+25= 160 microseconds
I/O time
= T(input sine and consine theta) +
log(8) x T(transferring matrix U and vector t)
= 4 + 3 x T(transferring 4 3x1 vectors)
= 4 + 3x4x6 (refer to table 4.1)
= 76 microseconds
58 so, total execution time
= computation time + I/O time
= 160 + 76
= 236 microseconds (for 5, 6, 7, and 8 degrees-of-freedom)
Although the total execution time is reduced to about half as
compared to the 497 microseconds of the N-Processor architecture with 7
degrees-of-freedom in the last section, the architecture in figure 4.13
is not regular and the communications between RPs are no longer limited
to adjacent RPs. However, if a cube interconnection network is applied,
a regular architecture can be obtained allowing each processor to
communicate only with the other n processors, where n = log(N). This
network corresponds to an n cube.
A three-dimensional cube is illustrated in figure 4.14(a). The
processor element (PE) located at each vertex of the cube is directly connected to 3 neighbors. The addresses of neighboring PEs d iffer in exactly one bit position. Vertical lines connect vertices (PEs) whose addresses d iffer in the most significant bit position. Figure 4.14(b) shows that a 4 cube network can be considered as two 3 cube networks
linked together by 8 extra edges.
Figure 4.15 shows the required communications among the 8 PEs in different time slots. For example, in time slot to to t l , possible
communication pairs are (PEO, PEI), (PE2, PE3), (PE4, PES), and (PE6,
PE7) ; in time slot t l to t2, possible communication pairs are (PEO PE2),
(PEI PE3), (PE4 PE6), and (PE5 PE7); in time slot t2 to t3, possible communication pairs are (PEO PE4), (PEI PE5), (PE2 PE6), and (PE3 PE7),
59 ( 101) ( 111)
Figure 4.14(a) 3 Cube Interconnection Network
( 0 0 0 0 ) ( 1000) ( 0010) ( 1010)
(0001), ( 0011 ) 1001 (1011 '(1100) ( 1110) 110 ) ( 0100)
( 0101 ) (0111) ( 1101) ( 1111)
Figure 4.14(b) 4 Cube Interconnection Network
60 PE ( 0 0 0 )
( 001)
( 010 )
(Oil)
( 100)
( 101)
( 110)
(111)
time
Figure 4.15 Comnunication of the 8 PEs in Different Time Slots
61 Figure 4.16 shows a 3 cube interconnection network for the parallel implementation of the Jacobian with 8 degrees-of-freedom. Each row represents a processor element (PE), which is numbered from 0 to 7.
The rotational transformation U required to be calculated at each time, such as tO, is shown in the box. The translation transformation t_ is not exp licitly shown but is understood. A blank box means that the PE is idle. The links between the PEs indicate that the matrix U and the vector t are transferred from one PE to the other PE. Some of the links between the PEs are missing, which means that there is no communication needed in that time slot. Figure 4.16 also shows that PEs are doing different calculations at different times. For example, at the time to, each PE calculates the matrix U from each rotation angle, theta, while at the time t l , PEI is idle.
Figure 4.17 shows the communication between PEs in different time slots. In time slot tO to t l , communication pairs are (PEO PEI), (PE2
PE3), (PE4 PE5) and (PE6 PE7); from t l to t2, communication pairs are
(PEO PE2), (PEI PE3), (PE4 PE6) and (PES PE7) ; t2 to t3, communication pairs (PEO PE4), (PEI, PES), (PE2, PE6) and (PE3 PE7). Except during the last time slot, all communications are bidirectional. I f there is only one channel between two adjacent PEs or the communication is half duplex, the transferring time will be about double compared with the transferring time in figure 4.13, where communication is unidirectional.
Figure 4.18 shows a 4 cube interconnection network for implementing the
Jacobian with 16 degrees-of-freedom. At the time t 4 , PE8, PE11, PE14 and PE15 are idle. PE9 calculates the matrix U by multiplying three matrices instead of two as at the time t l , t2 and t3. The calculation is:
62 PE
( 000)
( 001)
( 0 10)
(Oil)
( 100)
( 101)
( 110)
------i _ [ ( 111) : 8 n 7 \ j i ' j L
time
Figure 4.16 3 Cube Interconnection Network for Implementing Jacobian in Parallel (8 degrees-of-freedom)
63 (000) (010) a) t n-t I
( 001) O il) ( 110) ( 100)
( 101) (111)
( 000) ( 010) b) t . - t I
( 001) ( 110) ( 110) ( 100)
( 101) (111)
( 00 0 ) ( 010) c) t 2- t 3
( 001) (o il) ( 110) ( 100) Ji
( 101) (111)
Figure 4.17 Communication Between PEs in Different Time Slots
64 PE 16, (0000 '0 '
"16, 16. (0001
(0010
416, 16, (0011
(0100 16,,
(0101
(0110
(0111 l6L
(1000 160
161 (1001
(1010 16l
16, (1011 J11
(1100 J\2
(1101
16, (1110 14
16, (1111 J15
time
Figure 4.18 4 Cube Interconnection Network for Implementing Jacobian in Parallel (16 degrees-of-freedom)
65 (N/2)+l In general case, the calculation of at the last step is
(N/2)+1 (N/2J+1 (N/2)+2 (N/2)+{N/4) Uw(2.32) N = (N/2)+2 X U(M/2)+4X " X Pi
The computation time is the order of log(N) + log{N/2) - 2, or log(N).
4.2.3,5 Comparison
The cube interconnection network described in the last section is suitable for the parallel implementation of the Jacobian. The processor element in the network must have bidirectional transmission capability on the I/O port. Since the Robotics Processor has only unidirectional transmission capability, it cannot be used to realize the network.
Therefore, the cube interconnection network is not to be compared with the other three architectures described above. The measurement parameters of the firs t three architectures are listed in table 4.2. N is assumed equal to 7 with one redundant degree-of-freedom. Since the total execution time for the 1-Processor architecture is acceptable and the total memory size Is moderate, the architecture with only processor is the best choice. If the N is large enough, the cube interconnection network might be the better choice. Table 4.2 Comparison of Three Architectures for Jacobian
| N=7 P»1 | P*2 | P-N |
| ET (microsecond) 658 t 601 1 *97 1
| IR (1/microsecond) 1/658 | 1/588 | 1/143 |
I UP (%) 100 | 88 1 100 1
| SP — i 1.1 1 1*32 1
| CBR (%) 96 | 54 1 63 |
| RN (32-bit) 39 t 30 1 39 |
) SCRAM (bit) 3.7K | 2.4K | 4 .5K |
| Total memory (b it) 5K | 3.4K | 5.8K |
4.3 Inverse Jacobian
Given the six components of the velocity of the end effector, including both linear and rotational, Inverse Jacobian solves the angular velocities of each of the joint angles. The equations of
Inverse Jacobian are listed as follows. For the case N (number of degrees-of-freedom) not equal to 6, a pseudoinverse method is employed.
1) N > 6
-flxllu 1 s Nx6 * £ 6xN Jc « Nx6 J« c -^ l " x 1L l , (2.17)
67 Define _, R_ and C as follows: 6x6 6xn —6x1
A r = [Ac 1 Ac2 Ac3 Ac4 Ac5 Ac6] = J „ r ( 2 .IB) 6x6 — — — — — — ' 6xN Nx6
R_ , = [Rrl Br2 Rr3 Rr4 Rr5 Rr6lT = d A- * f2.19) 6x6 — — — — — — 6x6
where d is the determinant of A„ - 6x6
£««1-d^xsis,! {7-?n)
2 ) N = 6
—* = 6x6 —6x1L i (2*21)
3) N < 6
T -IT* —Nxl-5ai i = Nx6 c 6xN Nx6 a —6x1 2Le i (2.2?)
These equations are solved as is any system of linear equations.
4.3.1 Methods for Solving Linear Equations
There are many ways to solve linear equations, for example
Gaussian Elimination, LU-decomposition, Faddeev Algorithm, and Inverse
Matrix with Determinant. These methods are to he explored to determine whether or not they are feasible for VLSI implementation.
1) Gaussian Elimination
The complexity of solving N linear equations with N unknowns using
Gaussian Elimination method is the order of NxNxN, including 1/2 x
N(N+1) divisions, N( 1/3 NxN + 1/2 N -5/6) multiplications and N{1/3 NxN
+ 1/2 N -5/6) additions, while the inverse matrix with determinant is
6 8 the order of N! [19, p, 208], If N=6, then the complexity is 216 vs,
720. Even though Gaussian Elimination has less complexity, it has some disadvantages :
(a) It is not regular. Specifically, arrangements must be made to
avoid picking a pivot which would result in the division by zero.
Furthermore, because the choice of the pivot cannot he predicted
in advance, it does not seem to be feasible to implement it using
VLSI chips.
(h) To be accurate, the pivot must be wisely chosen, Whether
this algorithm chosen uses a "partial pivot" or “complete pivot"
[19, p. 187], the pivot selection process destroys the regularity
of the layout which is absolutely required in VLSI.
(c) Gaussian Elimination requires 1/2 N(N+1) divisions. Since the RPs
cannot perform division, this operation would have to be performed
by the host computer.
2) LU-decomposition
Although LH-decomposition implemented with VLSI chips has been widely used to solve linear equations [25] [26] [27], there are s till some disadvantages :
(a) The characteristic matrix must be a symmetric positive-definite or
an irreducible, diagonally dominant matrix,
(b) Once again, the division operation is required.
3) Faddeev Algorithm
By applying Faddeev Algorithm and using an (N+l) x (N+l) array of processors, the entire calculation for solving the linear equations can be performed in the order of N time steps [28], However, the
6 9 disadvantage is that four types of processors are required, one of which is a d iv id e r.
4) Inverse Matrix with Determinant
The complexity of finding an inverse matrix can be much reduced hy finding the determinants of some matrices with i x i dimension and sharing these determinants (see Appendix A . 6 ). These determinants can then be obtained by finding the determinants of the reduced ( i - 1) x
(i-1) matrices. In the following sections, several possible architectures based on this method are explored and compared. The
Robotics Processors can be used to achieve the matrix inverse elegantly in some architectures.
4.3.2 Architectures of the Inverse Jacobian
Four possible architectures, 1-Processor, 6-Processor,
12-Processor, and 24-Processor are proposed and compared based on the measurement parameters described in section 4.2.2. To calculate the parameters, computation time and the number of required temporary registers for each vector inner product must be known. For example, it takes (3M+10) P_CLK cycles to do M inner products with vector size 3x1.
They are listed in Appendix A.7 and their reservation tables can be found in Appendix A .l. For the follow ing sections, N > 6 is assumed, where N is the number of degrees-of-freedom,
A.3.2.1 1-Processor Architecture
The architecture for finding the Inverse Jacobian using only one
RP is shown in figure 4.19. I t shows the necessary data flow between the host computer and the RP. The procedures to solve the Inverse
7 0 Host Computer
6xN
Figure 4.19 Architecture for Inverse Jacobian with P = 1 Jacobian, i.e. finding the derivative of theta, and the detailed calculations for the measurement parameters can he found in Appendix
A.8. Since the RP does not have division capability, the reciprocal of the matrix determinant, 1/d , must he computed by the host computer. It is assumed that the host computer can complete the reciprocal in the time required in steps 7 and 8 in Appendix A . 8 , i.e . 6M + 14 = 6 x (6+7)
+ 14 = 92 microseconds. Most commercial microprocessors with a numeric coprocessor can complete division in this amount of time. For example,
Intel 8087 (5 MHz clock) can complete it in 39 microseconds.
4 .3 .2 .2 6-Processor Architecture
The architecture for finding the Inverse Jacobian using six RPs is shown in figure 4,20. It shows the necessary data flow between the host computer and the RPs. It can be seen that there is no communication between RPs. The procedures to solve the Inverse Jacobian according to this architecture and the detailed calculations for the measurement parameters can be found in Appendix A .9. I t is assumed that the host computer can complete the reciprocal in the time required in steps 8 to 10 in Appendix A,9, i.e. (6x1 + 14) + 6x2 + (6x1 + 14) = 5? mi croseconds.
4.3.2.3 12-Processor Architecture
The architecture for finding the Inverse Jacobian using twelve RPs is shown in figure 4.21. I t shows the necessary data flow between the host computer and the RPs, and the data flow between the RPs. It can he seen that 20 determinants are passed from RPi to R P i 1 (i= 1 to 6).
Also, some intermediate data must to be broadcast from one RP to the
72 Host Computer
6xN6xN 6xN ^6 x 1 6xN ^6 x 1 6xN 5C. 5Ci 1/d 1/d 1/d 1/d
RPI RP2 RP3 RP4 RPS RP6
Figure 4.20 Architecture for Inverse Jacobian with P = 6 Host Computer
6xN 6xN — 1 6xN
Ac.,
RPI RP2 RPC
20 d(3x3) 20 d(3x3) 20 d(3x3)
RPI RP2 RP6 1
^6 x 1 d 5Cj Cj
}'* ____ 3^1 1/d
Figure 4.21 Architecture for Inverse Jacobian with P = 12 other RPs. This makes the I/O transmission procedures more complicated
and microprograming more difficult. Furthermore, I/O time is apparently
increased because more interm ediate data are transferred between RPs.
The procedures to solve the Inverse Jacobian according this architecture
and the detailed calculations for the measurement parameters can he
found in Appendix A .10. I t is assumed that the host computer can
complete the reciprocal in the time required in steps 10 to 12 in
Appendix A .10, i.e . (6x1 + 14) + 6x2 + {6x1 + 14) = 52 microseconds.
4.3.2.4 24-Processor Architecture
The architecture for finding the Inverse Jacobian using twenty
four RPs is shown in figure 4.22, It can be seen that 20 determinants
are passed from RPib to RPic (i= 1 to 6 ) and 6 determinants from RPic to
RPid. As with the 12-Processor architecture, some intermediate data must be required to be broadcast from one RP to the rest of RPs. The I/O
time is increased greatly over the 12-Processor architecture since more
data is required to be transferred between RPs, The procedures to solve
the Inverse Jacobian according to this architecture and the detailed
calculations for the measurement parameters can be found in Appendix
A. 11. I t is assumed that the host computer can complete the reciprocal
in the time required in steps 11 to 13 in Appendix A .11, i.e . (6x1 + 14}
+ 6x2 + (6x1 + 14) = 52 microseconds.
4 .3 .2 .5 Comparison
The measurement parameters of the above four architectures are
calculated in Appendix A . 8 to A .11 and summarized in tab le 4 .3 , N is
assumed equal to 7 with one redundant degree-of-freedom.
7 5 Host Computer
$ Ac^, Ac^
20 d(3x3) 20 d(3x3) "■g CT) ... ■ ^ 4
— 3 RPic — If — — — ^ 5 ------> RP6c
6 d(5x5) 6 d(5x5)
RP Id — Jf w RP6d Jcc — 1 ■■■ ACj 5ci c6 1/d d 1/d ^6 x 1 r 61 *£xl 1 C
Figure 4.22 Architecture for Inverse Jacobian with P = 24 Table 4.3
Comparison of Four Architectures for Inverse Jacobian
N=7 P=1 P=6 1 P=12 1 P=24
ET (microsecond) 1838 542 582 594 |
IR ( l/m1crosecond) 1/1838 1/542 1/355 1/238 j
UP (%) 100 100 82 62 |
SP 3.4 3.16 3.1 | —
CBR (%) 94 53 40 30 |
RN (32-bit) 119 63 56 53 |
SCRAM (bit) 82K 17.7K 10. 5K 6.08K |
Total memory (b it) 8 6 K 20K 12.3K 7.78K |
It can be seen that 86 K bits of memory are required for the I-Processor architecture. It is impractical to put such a large memory into the PP.
The execution times for the architectures of 6 -Processor, 12-Processor and 24-Processor are in the same range. Even though the 6-Processor architecture has the least execution time, the required memory, 20K b it, is s till considered too large with current VLSI technology. Both the
6-Processor and the 12-Processor architectures are very regular. But, the 24-Processor architecture is not too regular although it has the smallest memory size. Therefore, because of the regularity of the architecture and the moderate memory size, the 12-Processor architecture is the best choice for the Inverse Jacobian.
77 4.4 Inverse Dynamics
The Inverse Dynamics problem is: given the desired acceleration, find the necessary forces and torques. The equations of Inverse
Dynamics are listed as follows :
Forward Recursion:
0 i 1-1 T 1-1 0 (2.23) * u, 1 i i , . i + 0 .
0 0 i. 1-1 T 1-1- i -1 to. = U. { (*).,+ 0 0 (2.24) —1 1 —1-1 + ^ - 1 X 0 . 0 .
u* i-l,.T i-I k . i. iD* , i ,i i*. P. = U. P. , + (D. x P. + w, x ( w. x P.) (2.25) —i i —i- l —i —i —i - i —i ' i •* i i i i * i ■■ S. = 5i. x S. + (*). x ( id. x S.) + P. (2.26) —i —i —i —l —i —i —i
F. = m.S . (2.27) —i i—i
1 U 1 1 1 * 1 / 1 , 1 ^ N, = J . u), + Backward Recursion: <-y = ’-y {y. * y ,, (2.29) —i i —i —i+l i - l i - l i ★ -i * JL “ V Hi+ 1 ♦ V {W X W ^i + l } (2‘ 30) i = N, (N-l), . . . 2, 1 78 Here, only revolute joints are considered. The notations used above are summarized as follows: 1—*3x 1 * the an 9 u*ar velocity of link i * 3 x 1 : the angular acceleration of link i i-l,, : rotational transformation matrix 3x3 ■ 0^ 9^ : the joint generalized variable for joint i i *■ Pi- , : the acceleration of the origin of coordinate i -'3x 1 3 i * —* 3x 1 ' 3 vec*;or **° *'*1e ori 9 '*n coordinate i from the origin of coordinate i - l i *• —* 3 xl * acce*erat*on the center of gravity of link i * —* 3xl " 3 vec^or *:o *'*1e cef|Ter gravity of link i from the origin of coordinate i i : the total force (excluding gravity) on line i 3x1 mi : the mass of link i : the total torque on link i ^*3x3 * t ^ie *ner* * a Tensor (with respect to its center of mass) of link i i - l _filu1: constraint force (unknown) exerted on link i by link (i-l) 3x1 i-l : constraint torque (unknown) exerted on link i by link ( 1- 1) -*3 x l 79 4.4.1 Task Graph The task graph for Inverse Dynamics is shown in figure 4.23 and 4.24. It is clearly much more complicated than the task graph for the Jacobian in figure 4.4. The task graph is obtained hy calculating the complexity of each of Inverse Dynamics equations (2.17) to (2.24). Using this task graph, four possible architectures are proposed and compared in the following sections, 4.4.2 Architectures of the Inverse Dynamics Four possible architectures for implementing Inverse Dynamics are 1-Processor, 2-Processor, N-Processor and 2N-Processor. 4.4.2.1 1-Processor Architecture The architecture for Inverse Dynamics with only one RP is shown in figure 4.25, The necessary data transferred between the host computer and the RP is also shown in the figure. The corresponding timimg chart, shown in figure 4.26, shows the sequence of subtasks and the time required for each subtask. Subtasks in this timing chart are labeled a, b l, b2, . . , bN, c l, c2, cN, and d. The exact amount of time for each subtask is obtained from the microprograms for forward and backward recursion of Inverse Dynamics in Appendix A .12 and A .13, The measurement parameters evaluating the architecture are computed in Appendix A .14. 4.4.2.2 2-Processor Architecture The architecture for the Inverse Dynamics with two RPs is shown in figure 4.27, along with the necessary data transferred between the host computer and the RP, and transferred between the two RPs. The corresponding timimg chart, shown in figure 4.28, shows not only the 80 sin 9 -, cos -i —i - i -1 -l —i F Figure 4.23 Task Graph for the Forward Recursion of Inverse Dynamics 81 sine., cose. —i " i+1 - i +1 —i n. —l Figure 4.24 Task Graph for the Backward Recursion of Inverse Dynamics 82 Host Computer sin 6j, cos 0^ sin 6.., cos eu N N eN* 6N 00 w / Figure 4.25 Architecture for Inverse Dynamics with P = I 106 . . 106 . 50 50 | 50 | m 4N 106 1 I \— 1---- 1 ------|------1------1------1------bl bN C1 c2 CN d a bj a input sinQj, cos6j, sine^, cose^ b. : compute Forward Recursion of Inverse Dynamics c. : compute Backward Recursion of Inverse Dynamics d output t , - - t 1 N Figure 4.26 Timing Chart for Inverse Dynamics with P = 1 Host Computer sine,, cose sine.,, cose RP1 sine., cose sine.,, cose RP2 Figure 4.27 Architecture for Inverse Dynamics with P = 2 a : input sine^, cose^ - - sine^, cose^ : compute Forward Recursion of Inverse Dynamics c : output sine,, cose,, F,, N ,, - - sine.., cose,., F.. Nu 1 I “1 —I fi (1 —H, —N d input sinQj, cosej, Fj, Nj, ------, sine^, cose^, F^, e. : compute Backward Recursion of Inverse Dynamics f : output Tj - - - Tj^ Figure 4.28 Timing Chart for Inverse Dynamics with P = 2 sequence of subtasks and the time required for these, but also the timing of transferring data between the two RPs. RP1 executes subtasks a, b l, b2, and bN; RP2 executes subtasks c, d, e l, e?, eN, and f. RP1 will not start to transfer data to RP2 until it completes subtasks a, b l, b2, and bN. The I/O time created by transferring the data from RP1 to RP2 is 16N microseconds. The exact amount of time for each subtask is obtained from the microprograms for forward and backward recursion of Inverse Dynamics in Appendix A .12 and A .13. The measurement parameters evaluating the architecture are computed in Appendix A .15. 4.4.2.3 N-Processor Architecture The timing chart for the forward recursion of Inverse Dynamics with one RP per link, obtained from Appendix A ,12, is shown in figure 4.29. The subtasks in the figure are labeled from a to p. The timing chart for the backward recursion of Inverse Dynamics with one RP per lin k, obtained from Appendix A .13, is shown in figure 4.30. Its subtasks are labeled from a to k. From the two timing charts, it can be seen that the data transfer initiation times can be easily aligned. The architecture for the Inverse Dynamics with N RPs is shown in figure 4.31. The N RPs are connected in a one dimension array. The figure shows the necessary data transferred between the host computer and the RPs, and between any two adjacent RPs. It can be seen that the data transferring between any two adjacent RPs is not unidirectional but bidirectional. Therefore, if the RP is to be employed in the architecture, two of the I/O ports of the RP must be bidirectional. Based on the two timing charts above, a corresponding timing chart can 87 -j_l I I “i_l 43 25 6.6 14 19 I 6 i6 30 RPi fi-( 9 { 6 4 6 1 30 . l l l i M ± ± * J L b e d 1 f 1 q 4 f 1 k 1 1 1 * 4 r 4 o p 'a 'b c d ' f 1 a 1 Hi | | ^ i i!u 14 30 43 RP(i+1) ( UJ 6|___26___|4 |6 ( 6 | 4 |9 | 6 | 6 | ^i+1 - i +1 U*1 00 00 u 1 a input sine, and cose^ from Host Computer l input b compute i_ 1U. j complete the computation of c input k idle - i -1 d input 1 ouput P. e compute and ^ m compute F_. and f output ^ n output sine^ and cose^ g output o output h compute part of P. P output Nj Figure 4.29 Timing Chart for Forward Recursion of Inverse Dynamics with One RP per Link input sine^ and cose^ from Host Computer f : compute _f. compute 1_1U. g idle input h output f • input i : input ni+1 input f1+l j : compute n_. k output Figure 4.30 Timing Chart for Backward Recursion of Inverse Dynamics with One RP per Link Host Computer \ sin e sin e sin e cos 0 cos 8 cos e, -o -0 —0 —1 ’“ 1 RP1 RP2 RPN * z l 2 * N~1 N-l Figure 4.31 Architecture for Inverse Dynamics with P * N be developed in figure 4.32. The figure shows the timing of transferring data between the RPs, 3 RPs in this example. To align the I/O transmission, some RPs must idle until the other RPs are ready to receive or transmit. The CPU u tilization is thus decreased. The measurement parameters evaluating the architecture are computed in Appendix A .16. 4.4.2.4 2N-Processor Architecture The architecture for Inverse Dynamics with 2N RPs is shown in figure 4.33. The 2N RPs are connected in a two dimensional array. It can be seen that unlike the N-Processor architecture the data transferred between any two adjacent RPs is unidirectional. This allows four I/O ports in the RP to be unidirectional. The RPs of the upper row perform the forward recursion, while the RPs of the lower row perform backward recursion. It is assumed that RPi‘ {i =1 to N) always uses the most updated data transferred from the RPi. Therefore, there is no buffer required between any two RPi and RPi1. The corresponding timing chart is shown in figure 4.34. The figure shows the timing of data transferred between the RPs, 6 RPs in this example. It can be seen that RPI1 to RP3* are idle sometime to wait for the data transferred from the other RPs. Also some idle time is created because of the uneven load sharing between the RPi and RPi 1 (forward recursion is more time consuming than backward recursion). The measurement parameters evaluating the architecture are computed in Appendix A .17. 91 RPI | - i 0- . [ ___^ _____ , | 34 | 46 | 40 | ™ ------1 i t I m |—22— |------! « — | | 34 | 46 | | 40 | m 1 1 t 1 RP3 ( J 2 _ |_____W , | 34 | 46 t I_21_|_144 ------, <£> ro Forward _ Backward J Recursion 1 RecursionDflrnrcinn * Figure 4.32 Timing Chart for Inverse Dynamics with P = N (N = 3 for example) Host Computer sin 8 sir e sir 6 COS 8 RP N U> sir e sir e sir e. COS 6 RP N' r- Figure 4.33 Architecture for Inverse Dynamics with P = 2N RP 1 | 40 | 160 | 4Q | 160 | 40- ( - A 60_____ | 40 | 160___| ______4 4 4 4 , 40 , 160 , 40 , 160 , 40 , 160 , 40 , 160 , RP 2 I 1------1------1------1------1------1------1------1 ------4 4 4 4 | 40 | 160 i 40 i 160 | 40 | 160 j 40 | 160 | RP 3 4 4 4 lO | 34 | 71 | | 34 j 71 | ,3 4 ,7 1 ■t* RP i' 4 4 4 RP 2' I 34 i 71 i i 34 i 71 i i 34 i 71 i 4 4 4 RP 3' I 34 I 71 | | 34 I 71 I ^ 4 Figure 4.34 Timing Chart for Inverse Dynamics with P = 2N (N = 3 for example) 4.4.2.5 Comparison The measurement parameters of the ahove four architectures are calculated in Appendix A .14 to A .17 and summarized in table 4.4. N is assumed to be equal to 7 implying one redundant degree-of-freedom. Table 4.4 Comparison of Four Architectures for Inverse Dynamics N=7 P=1 P=2 P=N | P=2N | ET (microsecond) 1134 1246 708 1 749 | IR (1/microsecond) 1/1134 1/882 1/708 j 1/200 | UP (%) 100 71 37 1 76 | SP .91 1.6 1 1.5 | — C8 R {%) 96 62 22 1 39 | RN (32-bit) 371 329 65 1 59 | SCRAM (b it) 9 . IK 6 . 8 K 10.7K | 6.9K | Total memory (b it) 2 IK 17.3K 12.8K ( 8 . 8 K | Bidirectional I/O no no yes 1 n0 t It can be seen that even though the N-Processor architecture has the least execution time, its bidirectional I/O bus makes the interface circuit more complicated, thus, it w ill not be considered. The memory 95 sizes for both 1-Processor and 2-Processor architectures are too large to be put into the RP chip with current VLSI technology. The memory size for the 2N-Processor architecture is less than 10K, which can likely be put into the RP. Its execution time is 750 microsecond, which is acceptable. Also, the connections between RPs are very regular. Thus, the 2N-Processor architecture is the best choice for Inverse Dynami cs. 4.5 Summary In this chapter, several possible architectures for the different applications, Jacobian, Inverse Jacobian, and Inverse Dynamics, are explored and compared. Because of essential circuits required in the RP, such as the Floating Point Adder/subtractor and Multiplier, there is limited area le ft for the Register File and Control RAM. Therefore, the total memory size becomes a very important factor in determining which architecture is the best choice. Also, in the control system, the in itia tio n rate is not as important as the total execution time since the total execution time affects the s tab ility of the control system. As more and more RPs are used to solve the problems, such as Jacobian, Inverse Jacobian, and Inverse Dynamics , the communications between them increase and thus the CPU Bound Ratio (CBR) tends to decrease. In some cases, using more RPs to solve the same problem w ill even result in having more total exection time than when fewer are used. The function partion between the RPs greatly affects the whole system throughput. Task graph concepts are introduced to achieve the best function partition, i.e . minimize the data transferring between RPs and the total exection time. 96 CHAPTER 5 CIRCUIT DESIGN OF THE ROBOTICS PROCESSOR 5.1 Introduction The general description of most major functional blocks of the Robotics Processor (RP) w ill be discussed in this chapter, while the detailed circuit designs are explained in Appendix R, The functional blocks of the RP, shown in figure 5.1, are Clock Generator (CG), Bootstrap Unit (BU), Format Converter for BU (FCB), Control RAM (CRAM), Sequencer (SEQ), Microcode Register (MCR), Register File (RF), Floating Point Adder/subtractor (FPA), Floating Point Multplier (FPM), Format Converter East (FCE), Format Converter West (FCW), Format Converter North (FCN), and Format Converter South (FCS). In addition, the Level Sensitive Scan Design (LSSD) technique will be discussed. The detailed designs of Register F ile (RF), Sequencer (SEO), and Control RAM (CRAM) have not been designed yet and so will not be covered in this dissertation. 5.2 Clock Generator The Clock Generator (CG) generates all clock signals needed in the RP. Figure 5.2 shows the block diagram of the CG. It consists of three Two-Phase Generators (TPG), two Counters (CNT), and two Johnson Counters, JCNTR to generate Jra and Jrb signals and JCNTF to generate 97 BN BT.HUR.LC FCB FCN CRAM SEQ MCR FCU Addresses and Control signals FPM FPA SYS CLK FCS BS Figure 5.1 Rlock Diagram of the Robotics Processor SYS CL K( 16MHZ) IPG to BU and mol 24.ca CNT SYS CLK/2 CNT P CLK TPG TPG :^a) * 1/2 TPG: Two-Phase Generator P4j: Pi pel ine ; i Pi2 ■ Pi pel ine 4,2 JCNTR JCNTF JCNTR: Johnson CouNler of Rising trigger JCNTF: Johnson CouNTer of Fal1ing trigger Jra Jrb Jfa Jfb Figure 5.2 Clock Generator (CG) the Jfa and Jfb signals. The timing of all signals generated by the CG is shown in figure 5.3. Phase-1 and phase-2 of the 16 MHz System Clock (SYS_CLK), and are used in the BU and FPM. Phase-1 and phase-2 of the Pipeline Clock (P_CLK, 1 MHz', P<{^ and P>^, are used to latch the microinstruction at the Microcode Register (MCR) and to latch the temporary data at the pipeline registers in the FPA and FPM. Jra, Jrb, Jfa, and Jfb are used for time multiplexing data and addresses onto the buses at the proper times. These signals are identical but are shifted with respect to each other by one SYS_CLK period as is apparent from figure 5.3. The detailed circuits for the TPG, JCNTR, and JCNTF are described in Appendix B .l. 5.3 Bootstrap Unit and Format Converters The RP is designed to be used for more than one application and must therefore be programmable. During in itia liz a tio n , the host computer loads appropriate application microprograms and constants to the RPs. This loading process is accomplished by the Bootstrap Unit (BU) and the Format Converter for Bootstrap (FCB). While a microprogram is being loaded, the BU provides the necessary control signals and loading addresses to the Control RAM (CRAM). During microprogram execution, the SEQ generates the control signals and supplies the next microinstruction address to the CRAM. Figure 5.4 shows the block diagram of the BU, FCB, and CRAM indicating the paths used for microprogram loading. The RU consists of a Synchronization Controller plus Bootstrap Controller (SC+BTC) and a counter (CNT). The SC+BTC generates all control signals to the FCB and 1 0 0 -in_njTJi_rLr^u^r^r^Rrurn_nj' SYS_CLK/2 1 ♦ l/2 1 *2/2 1 PjCLK J I _J Pf; “I 1 Jra Jrb Jfa Jfb J Figure 5.3 Clock Signals Generated from the Clock Generator (CG) 8E BT HWR LC LOU* — I f LDl * — j SC + BTC LD0 LDl LD2 INC WEN CRAM CNT CLR L _ Address from SEQ Addresses and control signals Figure 5.4 Block Diagram of the BU, FCB, and CRAM Indicating the Paths Used for Microprogram Loading CRAM. The counter generates the loading address for the CRAM. The FCB concatenates three 16-bit words sent from host computer to form a 40-bit microcode, shown in figure 3.2, (8 unused bits are discarded) and then stores the result in the CRAM. The detailed procedures for loading microprogram and circuit designs for the SC+BTC are described in Appendix B.2. From figure 5.1, it can be seen that the external bus on each side is 16 bits wide while the three internal buses, Rus A (RA), Rus R (RB), and Bus C (BC) are all 32 bits wide. Therefore, four format converters are required to change the 32-bit data to two 16-bit words or visa versa. The detailed circuit designs for the four format converters are explained in Appendix B.3. 5.4 Testability of the Chip As advances in the VLSI technology increase, more and more components can be put into a chip, resulting in improved performance. As a result of this increase in complexity the testing problem becomes much more d iffic u lt. In the case of a highly complex sequential circu it, a complete testing of every aspect of the circuit may be impossible unless some provision for te s ta b ility of the circuitry is included in the chip design. Thus, the system architect and the logic circuit designer must consider the testability of the chip when they begin to design it . In this section, several methods for providing for testability of design are discussed. For reasons which w ill be described late r, Level Sensitive Scan Oesign (LSSD) was selected as the testing method for the Robotics Processor. 103 5.4.1 Structured Design for Testability Three major techniques constitute design for testabilty: ad hoc approaches, structured approaches, and self-test approaches T601C6H. Ad hoc approaches are sometimes appropriate for specific designs but are not generally applicable. Both structured approaches and self-test approaches are applicable to chip and board level designs. The structured approach to designing for testablity involves the structuring of registers, internal to the design, in such a way that the data contained in these registers can be controlled and observed. Two specific methods for accomplishing this are Level Sensitive Scan Design, LSSD, developed at IBM, and Scan Path developed at NEC. A recently developed self-testing technique integrates the Scan Path, LSSD and Signature Analysis concepts. The system registers are used both to generate the random patterns and to compress test results. This integrated technique is called "built-in logic block observation", or BILBO. Recently, BILBO has been given considerable attention because of certain advantages over other approaches. Specifically, comparing the LSSD and BILBO approaches, the ratio of the time needed to generate and apply LSSD patterns versus the time needed to apply pseudorandom patterns for the RILRO is (L x K) [61j, where L is the maximum length of the shift register latch (SRL) and K is the ratio of the speed at which the shift register could be shifted in RILRO versus the speed at which the test patterns could be generated for LSSD. K is usually in the range from 100 to 1000. The derivation for this relation is as follows. Assume P patterns are to be applied in one design using LSSD and another 104 using RILBO, the time needed for LSSD will be T(LSSD) = P x L x (1/TPGS) where TPGS is test pattern generation speed in patterns/sec. The time needed for BILBO w ill be T(BILBO) = P x ( 1/SRLS) where SRLS is SRL shifting speed. Thus, the ratio of the time needed for LSSD versus the time needed for BILRO is T(LSSD) / T(BILBO) = (P x L / TPGLS) / (P / SRLS) = L X (SRLS / TPGS) = L x K. It can be seen that the testing time for the LSSD approach applied to a given system is significantly greater than that for BILBO applied to the same system. However, it is not sufficient to restrict the consideration to only the ratio of testing times. The ratio of propagation times for these two approaches must also be considered when each is operating in the non-testing, or normal mode. Both testing approaches make use of registers to control and observe parallel data. Both approaches require that the registers operate in a parallel and a serial shift mode. The addition of the serial shifting capability to a VLSI register does not increase the register setup or propagationtimes. However the signature analysis mode inherent in BILBO requiresthe insertion of an Exclusive-Or function plus an And function in each register input, as shown in figure 5.5. The result is that the propagation delay of a BILRO design, in the non-testing mode, is two to three times greater than the corresponding LSSfl design. 105 Scan-Cut SRL Out Out N SRL Out Out 0 Figure 5.5 Logic Circuit Diagram of BILBO Registers nux Scan-In— 0 106 Another consideration is that the requirement to add the And function and Exclusive-Or function at each bit position of the BILBO shift register will increase the required chip area occupied by the register configuration by as much as two to three times. In conclusion, if testing time is the primary consideration in the decision for the selection of the LSSD or the BILBO method for testing, the la tte r would obviously be selected. However, if real-time execution speed is the primary factor and i f testing can be done o ff-lin e , the LSSD approach seems preferable. 5.4.2 Level Sensitive Scan Design (LSSD) Because the Robotics Processor (RP) 1s designed for real-time control systems, the execution speed is, by definition, a primary concern. Also, the RP is a highly complex chip demanding that chip area be conserved as much as possible. For these reasons LSSD is selected as the testab ility approach in the RP. LSSD is a testab ility technique wherein all bits of the internal state of the chip are linked into a shift register and read out for examination. This scan path greatly increases the observability and controllability of the chip by providing access to signals that would otherwise be invisible to the outside world without extensive multiplexing schemes or large numbers of extra pins. Figure 5.6 shows the LSSD used in the pipelined stages of the FPA and FPM. Note that all SRLs are connected serially. Test patterns are fed to Scan_In and after one P_CLK cycle both the input test patterns and the test results are extracted from the Scan Out. Figure 5.7 shows 107 Normal Input Scan-In SRLS Combinational Network 1 Dynamic Registers SRLS Combinational Network 2 Dynamic Registers SRLS » Scan-Out Normal Output Figure 5,6 LSSH Used in the Pipelined Stages of the FPA and FPM 108 In 0 In N 1— Scan-In Combinational Network o \0 Scan-Out n . Out 0 Out N Figure 5.7 Interconnection of the LSSO SRL * s the interconnection of the SRLs and the detailed circuit of the SRL. One of the sets of clocks, i and d> , is active during the normal mode 1 2 and the other set, and ^ , is active during the testing mode. 5.5 Floating Point Adder/Subtractor (FPA) The data path of the Robotics Processor consists of Register File (RF), Floating Point Adder/Subtractor (FPA), and Floating Point Multiplier. The RF is just a three-port RAM, consisting of 64 32-bit words. Each of the two arithmetic units has three pipeline stages. The data flow in the data path for normal arithmetic operations is described in Appendix B.4. The design work for the FPA is explained in this section, while the FPM is in section 5.6. F irs t, the floating point format is defined and explained and then the algorithm to perform floating point addition/subtraction is described. Then, the block diagram of the FPA and its building function blocks are described. They are the Zero Checking Unit, Sign Unit, N-bit Adder/subtractor, Alignment Control Unit, Barrel Shifter, Leading Zero Dector, Postnormalization, and Over/underflow Unit. 5.5.1 Floating Point Format The firs t version of the IEEE floating point standard format was drafted in April 1978 by Harold Stone and the final version was published in March of 1981 in [53]. The main goal of the standardization efforts was to establish a standard which would allow communication between systems at the data level without the need for conversion. The standard defines four floating point formats in two groups, basic and extended, each having two widths -single and double. 110 Here only the basic single precision format is considered. Its format, made up of sign bit, biased exponent, and mantissa, is given as 31 30 23 22 0 where S is sign bit (1 b it), E is biased exponent (8 b its ), and F is significand (23 bits). The possible values for the IEEE single precision floating point are shown in the table 5.1. Table 5.1 Possible Values of the IEEE Single Precision Floating Point. Name Value E F Not a Number Not applicable 255 Not all zeros S Infinity (-1) ( Infinity) 255 All zeros S ------1-12? Normali zed (-1) ( l.F ) x 2 1 - 254 Any $ - m Denormali zed (-1) (O.F) x 2 0 Not all zeros S Zero (-1) 0.0 0 0 I l l Since most commercial floating point processors, e.g. Intel 8087, Weitek WTL 1032 and 1033, NS 16081-6, support the IEEE standard format, the floating point adder/subtractor and multiplier designed in this project are to follow the IEEE standard format. 5,5.2 Algorithm and Block Diagram Addition and subtraction are described together since the same hardware is used for both operations. Subtraction 1s performed by addition using a 2 's-complemented subtrahend. The algorithm for the floating point addition/subtraction can be divided into five consecutive steps : 1. Check for zero operands. 2. Prenormalize Align the two operands by comparing their exponents. The mantissa with the smaller exponent is right shifted by the amount of the difference of the two exponents. 3. Add/subtract the two mantissas. 4. Postnormalize When the two mantissas have the same sign, a mantissa overflow may occur. If this happens, right shift one place to put i t back into the range, and increment the exponent. I f the operands have different signs, there may be cancellation of leading significant bits, yielding an unnormalized result. Then shift left the result into the normalization range and subtract the shift amount from the larger exponent. The shift amount is equal to the number of leading zero. 112 5. Check for the exponent overflow or underflow. If either occurs, the appropriate constants are loaded for the exponent and the mantissa, and the status bit is set. Figure 5.8 shows the block diagram of the above algorithm. It is assumed that the floating point adder/subtractor operates on the two operands, A and B, and the result of A+B or A-B is delivered as the operand R. SA denotes the sign bit of the operand A, EA the exponent of A, MA the mantissa of A. OP bit specifies addition or subtraction. The floating point adder/subtractor consists of a three stage pipeline with the firs t stage performing sign bit determination, zero checking and prenormalization, the second stage performing mantissa addition/subtraction, and the final stage performing postnormalization and overflow/underflow checking. The pipeline registers (lssd_n.ca) contain LSSD (level sensitive scan design), to ensure testability, where n is the width of the pipeline register. The firs t stage consists of two zero checking units ( fa_exp_ne.ca), sign unit ( fa_sign.ca), mantissa comparator (add_sub_24.ca), alignment control unit (fa ali_con.ca), and right-shifter (fa_sh_r.ca). Each zero checking unit examines the exponent of the operand. If the exponent is zero, a zero is attached to be the most significant bit (MSB) of the 24-bit mantissa. Otherwise a one is attached as the implicit leading b it. The circuit design of the zero checking unit is described in Appendix B.5. The sign unit determines the sign bit of the final result and the effective operation for the mantissa addition or subtraction. If its 113 MB I s s d - 8 .c a > 2 3 23 fa -e x p - fa -e x p - ne .ca SUB MB-GT-MAq add-sub-24.ca tmz. f a - a l i - (fl-B) co n .ca AMOUNT-R FZ-R IN* INI | MUX-8.ca T 8 SUB A 8 a d d -s u b -2 4 .c a i -1 Is s d -8 , Ci Issd-24.ca 5L1E. — RSH f a -z e ro - d e .c a s 8 .a- o v f - a d d -su b -8 .fa f a -s h - ■23 u d f,c a OUT ZOUT RSH IN* INI IUX-8. ca OVF MR Figure 5.8 Block Diagram of the Floating Point Adder/Subtractor 114 output signal SUB is equal to 1, then subtraction is executed, otherwise addition is executed. The circuit design of the sign unit is described in Appendix B.6. The mantissa comparator is used to determine whether the mantissa of A or B is larger. Since it always does A minus B, the carry out (low asserted) of the 24-bit subtractor means that the mantissa of B is greater than A. The alignment control unit determines the difference of the two operand exponents. This difference is equal to the required number of mantissa shifts. The alignment control unit sends the signal R_GT_A, R greater than A, to the sign unit to determine the sign bit of the final result. The signal B_GT_A 1s also used to route the mantissa of the smaller number to the rig ht-shifter. The circuit design of the alignment control unit is described in Appendix B.7. The right-shifter contains a 24-bit barrel shifter which can cause a shift a 24-bit word by a number of bit positions ranging from 0 to 23 within about two gate propagation delays. I f the rig ht-shift amount is greater than or equal to 24, the output of the rig ht-shifter is forced to zero since the input has, at most, 24 bits. The circuit design of the 24-bit shifter is described in Appendix B.8. The second stage contains only a mantissa adder/subtractor, add_sub_24.ca. The subtract operation is selected by the SUB signal. The RSH (right shift) signal is generated when there is a carry out and addition is executed. I f it is asserted, the result of the mantissa adder/subtractor, add_sub_24.ca, is to be shifted right by one bit in postnormalization and the common exponent, the exponent of the larger 115 operand, is incremented by one. The circuit design of the 24-bit adder/subtractor is described in the next section. The third stage consists of leading zero dectector (fa_zero_de,ca), exponent update unit (add_sub_8.ca), left-shifter (fa_sh_l.ca) and over/underflow unit (fa_ovf_udf,ca). Postnormalization is performed in this stage. Detailed explanation for the postnormalization can be found in Appendix B.9. The leading zero dectector is used in postnormalization to provide the left-shifter with shift amount needed to normalize the ouput of the mantissa adder/subtractor, add_suh_24.ca. The circuit design of the leading zero dectector is described in Appendix B.10. The exponent update unit updates the common exponent by incrementing one when RSH is asserted, or subtracting the shift amount sent from the leading zero dectector when RSH is not asserted. The over/underflow unit detects the occurrence of overflow or underflow. If one of these should happen, the corresponding status bit is set. It also sends out ZOUT (zero output) signal to force the mantissa and exponent to zero when underflow occurs or the le ft shift amount is 24. The circuit design of the overflow/underfolw unit is described in Appendix B .1I. 5.5.3 N-bit Adder/Subtractor The method for doing addition/subtraction can he classified into ripple carry generation and parallel carry generation. A ripple carry adder/subtractor that adds/subtracts two N-bit operands consists of a cascade of N ful1-adder stages. A full adder is a logic network with 116 two inputs : a (i) and b ( i) , and a carry_in c ( i - l ) , and two outputs : a sum s(i) and a carry-out c(i). It performs the following logic functions : s(i) = a(i) xor b(i) xor c(i-l) c(i) = ( a (i) A b(i) ) j ( a(i) 4 c(1-l) ) | ( b(i) A c(i-l) ) for i = 0,1, ...... , (N-n For high-speed applications, carry lookahead adders are usually implemented. However, Sakurai and Muroga pointed out in [35] that Carry-lookahead adders require larger chip area, and furthermore, if low-power devices such as MOSFETs are used, the speed of the carry-lookahead adders is greatly slowed down due to parasitic capacitance caused by large fan-outs of some gates, (If we try to reduce large fan-outs of some gates by using extra gates, the chip area increases further.) Thus, when the chip size area limited, adders which occupy a small chip area are often used for high speed (e.g. a ripple adder, instead of a carry-lookahead adder, is used in In te l's microprocessor chip 8080 for higher speed because of chip size 1 imitation.) Also, Mead and Conway pointed in [1, pl50] that simulation of several look-ahead carry circuits indicated that they would add a great deal of complexity to the system without much gain in performance. Therefore, a ripple carry adder is designed for the N-bit adder/subtractor in the Robotics Processor. The compactness of the adder is very important since a more compact network often increases speed by virtue of its smaller parastic 117 capacitance. In actual design, the chip area occupied by a circuit cannot be known until the actual layout is completed. However, Muroga and Lai pointed out [33] that minimizing the number of gates as the primary objective and the number of connections as the secondary objective usually yields the most compact circuits, at least for functions using a small number of variahles. So an adder with as few inverters as possible was designed and pass transistors were used as often as possible, since they are formed by simply crossing polysilicon over diffusion and occupy l it t l e chip space. A static Manchester-type carry chain adder was designed in this project. The Manchester-type carry chain adder has N basic cells cascaded vertically. In each c e ll, the propagation delay of the carry signal in one-bit full adder includes one logic inverter and one pass transistor. Thus, the carry propagation delay time for the N-bit adder is N times of the delay of one inverter plus one pass transistor. Since the resistance of a pass transistor is about one fourth of that of a pull-up transistor, the propagation delay of a pass transistor is about one fourth of that of an inverter. So the total carry propagation delay time for the N-bit adder is about (1+1/4)N inverter delays. This delay time is shorter than that of the adder in [34], which was claimed to have the smallest propagation delay, (4/3)N inverter delays. A dynamic Manchester-type adder can be found in [1, p. 150], which is precharged by one of the clock phases. Since the carry chain in this adder is normally a series of pass transistors, the chain must be periodically buffered to minimize propagation delay. The carry-in signal is usually restored by a pair of inverters every four adder cells. Although the 118 dynamic Manchester-type adder has fewer components than the static one and thus has smaller chip area, it requires a precharge clock causing the contolling to be more complicated. Therefore, the static Manchester-type adder was designed instead. The truth table of a one-bit full adder are shown in table 5.2, where the p(i) is the carry-propagate for cell i, and g(i) is the carry-generate. Table 5.2 Truth Table of a One-bit Full Adder a{i) b ( i ) c ( i- l) c ( i ) s { i ) p ( i ) gd) 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 0 1 The logic equtations for the c(i), s(i), p(i), and p(i) are derived as follows : p ( i) = a ( i ) xor b{i ) g(i) = a{i ) & b ( i ) c(i) = (p(i) & c(i-l)) | g(i) s(i) = p(i) xor c(i-l) 119 but g(i) = a(i) A b(i) = (a(i ) & a(1) A b(1)) | (a(1) A !a(i) A !b(1)) = a(1) A (a(1) A b(i) | !a{i) A !b(1)) = a (i) A I (a(1) xor b(i)) = a{ i ) A !p (i) so c(i) = (p(i) A c(i-l)) | (Ip(i) A a(i)) !c(i) = (p(i) A !c(i-1)) | (!p(i) A !a(i)) When logic equations for c (i) and its complement, !c (i), are derived as above, pass transistors can be used to obtain the carry-out. The circuit diagram for a 2-bit adder is shown in figure 5.9. It consists of two different cells, say even and odd. The even cell has c (i)'s complement output while the odd cell has c(i) output. Dividing the 2-adder into two different cells allows each cell have the same carry propagation delay with one inverter plus one pass transistor. An N-bit subtractor can be easily obtained by doing Exclusive-Or function of the b(i ) input and the SUB_0P (subtraction operation) signal, i.e. b(i) is replaced by [b (i) xor SUB_0P], 5.6 Floating Point M ultiplier (FPM) In this section, the algorithm to perform floating point multiplication is proposed, and then the block diagram of the FPM and its functional building blocks are described. They are the Zero Checking Unit, Exponent Computation, 24-bit Fixed Point M ultiplier, Postnormalization, and Over/underflow Unit. For the 24-bit Fixed Point 12(1 i+I addl e.ca addl O.ca i + 1 i+1 u1+l i + I J I i+1 Figure 5.9 Circuit Diagram of a 2-bit Adder with Manchester-type Carry Chain 121 M u lt ip lie r , several methods are proposed and compared : Sequential Add-shift M ultiplication, Array M ultiplier, Nonadditive Multiply Modules ( NMM) with Wallace Trees, Additive Multiply Modules (AMM), Recursive Parallel M ultiplier, Modified Booth Algorithm (Radix-4) with Carry-save Adders, and Pipelined Recursive M ultiplier with Modified Booth A1gori thm. 5.6.1 Algorithm and Block Diargram The algorithm for the floating point multiplication can be divided into five consecutive steps : 1. Check for zero operands. 2. Determine the product sign, add exponents and correct for the bias. 3. Perform fixed point multiplication of the mantissas of the two operands. 4. Postnormalization Normalize the result of the mantissa m ultiplier. This may require one right shift and then incrementing the exponent, 5. Check for overflow/underflow. Figure 5.10 shows the block diagram of the above algorithm. It is assumed that the floating point m ultiplier operates on two operands, A and B, and that the result of A x B is delivered as operand R. It can be seen that the floating point multiplier is a three stage pipeline. The firs t stage performs zero operand checking, sign bit determination and exponents addition. The detailed circuit design of the zero checking unit is described in Appendix B.12. The final sign 122 SA MA Iss d 8 .c a Pi2 fm exp] e q j . c a ! EXP I EQ0 EXP I EQ0 mul 2 4 .ca 43 Issd add 9 .c a EXP EQ0 mux 2 3 .ca B9-ER C8-ER B7-ER mux 2 3 .ca IN0 INI mux O.ca ZOUT OVF MR Iss d S .ca Figure 5.10 Block Diagram of the Floating Point M ultiplier 123 bit, SR, is obtained by doing Exclusive-Or function of the two sign bits, SA and SB. Exponents addition is performed by the 8-bit adder, add_8.ca, in this stage. The 24-bit fixed point m ultiplier occupies stages one and two. In stage one, partial products are generated and carry-save addition is performed; in stage two, carry propagation addition is performed to produce the unnormalized 48-bit product. The detailed design of the 24-bit fixed point multiplier is discussed in section 5.6.2. The third stage performs exponent correction, postnormalization, and overflow/underflow checking. Since in the IEEE floating point format the exponents are represented by biased code, a correction is necessary. The exponent correction is performed by a 2-bit adder (add_2.ca) in stage three and is explained as follows : In floating point multiplication R = A x B, the exponents of the two operands are added together to generate the exponent of the result. The real exponent of A is : REAL_EA = EA - 127. The real exponent of B is : REAL_EB = EB - 127. The real exponent of R is : REAL_ER = REAL_EA + REAL_EB = (EA - 127) + (EB - 127) = (EA + ER - 127) - 127 ER - 127. So the biased exponent of R is : ER = EA + ER -127 = (EA + EB) + 1 - 128 = (EA + EB) + 1 + (2's complement of 128) = (EA + EB) + 1 + ( 110000000 ) 124 The addition of the value "1" is achieved by connecting the carry-in of the exponent adder (add_8.ca) to high, while the addition of the value "110000000“ is achieved by adding "11" to the right most two bits of the result of (EA + EB + 1), since the other 7 le ft most bits are zeros. Since, by definition, the values of the mantissa of each of the two operands are between 1 and 2, their product is between 1 and 4, and thus postnormalization may be required. If the product mantissa is greater than 2, one right shift is required along with an incrementation of the biased exponent by one. The right shift is implemented by a multiplexor (mux_23.ca, upper one) and is controlled by the MSB of the mantissa product, B23_MC. If B23_MC is high, the right most 23 bits of MC (=MA x MB) are selected. Otherwise, the right most 22 bits plus a LSB with zero is selected. The B23_MC is also used as the carry-in of the exponent updater (add_9.ca). Overflow/underflow checking is accomplished by the over/underflow unit. Tts detailed circuit design is described in Appendix R.13. 5.6.2 24-bit Fixed Point M ultiplier The multiplication of two fixed-point binary numbers can be achieved by sequential add-shift or parallel schemes. Schemes for parallel multiplication can be further roughly divided into two classes C393 [ 40]. The firs t one consists of an array of cells connected in an iterative way to form a product of any size. The second one, introduced by Wallace, consists of the generation of all partial product terms and subsequent reduction of all partial products to two terms by using carry-save addition. In the following sections, several multiplication 125 schemes are explored and compared on their area and time complexity. Finally, the scheme of the pipelined recursive m ultiplier with modified booth algorithm is used in this project. 5.6.2.1 Sequential Add-Shift Multiplication The hardware organization of the sequential add-shift multiplication can be found in most digital computer architecture books, for example [31, p. 145]. The circuit performs multiplication by using a single adder n times to implement the addition for the n rows of partial product. It is inexpensive to implement but too slow to meet our requirement. According to our assumption, with 3 pipeline stages for a floating point multiplier, a 24-bit fixed point multiplication must be finished in 2 stage clocks, i.e . 2 x 1 microsecond = 2 microseconds so the 24-bit addition must be accomplished in 83 ns (= 2 microsecond / 24). This is s till not feasible for present NMOS technology. To achieve faster m ultiplication, various high-speed parallel multipliers must be used. The following sections show parallel multiplication ■schemes. 5.6.2.2 Array M ultiplier The schematic circuit diagram of a 4-by-4 array m ultiplier can be found in [31, p. 197] [30, p. 48], A typical commercial chip is TI 7415274. A multiplier of an N-by-N size requires N(N-l) ful1-adders and N x N AND gates. Presumably, the delay time of an AND is neglected. The total delay time of the m ultiplier is about [(N -l) + (N -l)] x T( ful l_adder), i.e. (2N-2) x T( ful l_adder), where T(full_adder) *is the delay time of a ful1-adder. For N = 24, 24(24-1) = 552 full-adders are 126 needed and the delay time is (2 x 24 -2) - 46 x T(ful l_adder), This kind of multiplication scheme is usually implemented when N is not too large, say less than or equal to 16, as in [43] [44] [45], 5.6,2,3 Nonadditive Multiply Modules {NMM) with Wallace Trees A K-input Wallace tree is a b it-slic e summing circu it, which produces the sum of K b1t-slice inputs. For example, the Wallace tree with 7 inputs can be found in [31, p. 196] [30, p. 166] [TI 74LS275], A NMM is just a M-bit-by-M-bit array multiplier with a smaller dimension; for example a typical NMM with M=4 is TI 74LS274. A m ultiplier of any size can be obtained by properly arranging the NMMs and Wallace trees. In [31, p. 203], the modular arrangement for an array multiplication network ranging in size from 4x4 to 32x32 is shown. Each rectangle represents an eight-bit partial product divided into high and low 4-bit slices. All slices are added in a columned fashion by Wallace trees of odd number input. For example, a 24-bit m ultiplier needs 36 4-bit NMMs which are arranged following the Wallace trees. The size and the number of the required Wallace trees is shown in table 5.3. The number of the carry-save-adder levels required in a Wallace tree 2 is log (size of the Wallace tree / 2) [29, p. 139], where the base for the log is 3/2. 127 Table 5 .3 Table 5.3 Size and Number of the Wallace Trees for a 24-bit Multiplier size of the number of the number of full-adders number of full Wallace tree Wallace trees in each Wallace tree adder levels 3 8 1 1 5 8 3 3 7 8 4 3 9 8 7 4 11 8 8 5 A 4-bit NMM needs 4(4-1) = 12 full-adders and has the delay time (2x4-2) = 6 x T(full_adder). The number of the ful1-adders needed for the 24-bit-by-24-bit multiplier is (36 x 12) + 8 x (1 + 3 + 4 + 7 + 8) = 616 / I / I the full- the full-adders in adders in all Wallace trees 36 NMMs plus the full-adders needed in a 48-bit adder. The total delay time for the m ultiplier is the sum of the delay time of a NMM, the delay time of the longest Wallace tree and the delaytime ofa 48-bit adder. Ifa carry-propagate adder is used for the 48-bit adder, then the number of the full-adders is 616 + 48 = 664. The total delay time will be 6 + 5 + 48 = 59 x T(full_adder). It can be seen that both the number of full-adders and the delay time are greater than those of a array m ultiplier depicted above. 128 Due to the irregular interconnection hetween NMM and Wallace trees, this scheme does not give feasible layouts for VLSI implementation. Furthermore, it is not modular in nature, so can hardly be extended to form regular large multiplier arrays. 5.6.2.4 Additive Multiply Module (AMM) This scheme does not require b it-s lic e summing trees such as Wallace trees. An AMM with 4-by-2 can be found in [30, p. 49], In general, a 4m-by-4m multiplication network can be constructed by 2 x m x m 4-by-2 AMMs with delay time equal to (3m -1) x T(AMM). T(AMM) represents the delay time of an AMM. For a 24-bit m ultiplication, 2 x 6 x 6 = 72 AMMs are required. An AMM consists of 8 full-adders and has delay time 5 T(ful l_adder). So, in to ta l, the 24-bit m ultiplier needs (8 x 72) = 576 full-adders with the delay time (3x6 -1) x 5 = 85 T(full_adder). Roth the number of full-adders and the delay time are greater than those of a array multiplier depicted before. 5.6.2.5 Recursive Parallel M ultiplier Luk proposed in [37] a complete VLSI layout for a fast recursive parallel multiplier having time complexity, the time required for completing an operation, in the order of logN x logN, or T = 0(logN x log N) and area complexity, the area of the layout, in the order of N x N x logN x logN, or A = 0 (N x N x logN x logN). The recursive parallel m ultiplier divides large sized multiplication into a number of smaller sized multiplications followed by additions combining the intermediate results. Thus, multiplication is recursively reduced to a sequence of 129 additions until arriving at a reasonably small size, say 2-bit multiplication or 4-bit multiplication. The above time complexity is obtained by using a the Brent-Kung adder [36], which is a parallel carry look-ahead adder having time O(logN). I f a carry propagation adder is used, the time complexity will be T = 0(N), greater than 0(logN x logN). However, if carry-save adders are used, the time complexity can be improved to an optimal T = 0(1ogN) [38], In [38], three versions of multiplications are explored. They are 4M, 3M and 2M versions. The 3M version gives a smaller layout with A = 0(N x N), but is not as regular as the other two versions. The area and time complexities for the three versions are listed in table 5.4, Table 5.4 Comparison of the 4M, 3M, and 2M Versions of the Mu11iplication in [38] version area time 4M N x N x logN x logN logN 3M N x N logN 2M N x N x logN 1 ogN Even though the recursive parallel m ultiplier gives time complexity near optimal or even optimal T = O(logN) ( i f carry-save adders are used), it occupies much more chip area than that of the array multiplier. For example, in [38] a 8x16 recursive multiplier (2M 130 version), made by 3-micron NMOS process, takes about 65 ns of operation time and has a chip size of 49 micron square. And in [43] a 16x16 parallel array m ultiplier, made using the same 3 micron NMOS process techonology, has longer operation time, 120ns, but has a smaller chip size of 5 micron square. In fact, when the word length of two operands, N, is not large, a judgement cannot be made between the recursive m ultiplier and the other parallel multipliers based just on the value of the order. I f N is not large, both the coefficient of the complexity and the value of the order must be taken into account. For example, m ultiplier A may have time complexity T = 2N while the m ultiplier B may have time complexity T = 3 logN x logN. It is likely to say that multiplier B is faster than multiplier A, for 0{logN x logN) is less than D(N). But multiplier R is not faster than m ultiplier A for N less than or equal to 32. Thus, if they are compared by considering only their values of the orders, an erroneous conclusion is likely. Therefore, if N is not large, more careful consideration must be given when two multipliers are compared. One practical and reasonable way to compare layout areas is to compare their numbers of full-adders. For operation times, levels of full-adders for the longest path can be used for comparison. Consideration of area and time complexity in terms of the number or the levels of full adders is reasonable because they are the very basic constructing units in m ultipliers, 5.6.2.6 Modified Booth Algorithm (Radix=4) with Carry-Save Adders For the Booth algorithm, the m ultiplier is encoded to a series of +1, 0 or -1 as the m ultiplier is scanned from right to le ft. 131 M ultiplication speed can be increased when the m ultiplier has blocks of l's . Since the speed depends on the bit configuration of the m ultiplier, the efficiency of the Booth algorithm is obviously data-dependent. Using the modified Booth algorithm(radix =4), an N-bit-by-N-bit m ultiplier generates only N/2 partial products. Thus, it can speed up the multiplication by a factor of almost two and cut the number of the full-adders to about half with a small amount of encoding circuit and multiplexing logic required. Every two bits in a multiplier are encoded from right to le ft to +2, +1, 0, -1, or -2. The bit pair encoding is shown in table 5.5. Table 5.5 Encoding Table for the Modified Booth Algorithm original multiplier encoded m ultiplier acti on B(i +1) B (i) B (i — 1) B (i)' 0 0 0 0 add 0 0 0 1 + 1 add A 0 1 0 + 1 add A 0 1 1 +2 add 2A 1 0 0 -2 subtract 2A 1 0 1 -1 subtract A 1 1 a -1 subtract A 1 1 l 0 add 0 An N-bit-by-N-bit multiplier needs (N/2+1) x (N+l) full-adders, (N/2) x (N+l) l-out-of-5 multiplexors and a 2N-bit adder. Its delay 132 time is the sum of (N/2+1) x T(full_adder), (N/2) x [T(encoding) + T(m ultiplexor)], and the delay time of a 2N-bit adder. An 8-bit-by-8-bit m ultiplier with modified Booth algorithm (radix=4) is shown in figure 5.11, simplified and modified from [55, p. 902], There are 5 rows of carry-save adders combined with multiplexors and a 16-bit carry propagation adder (e.g. Manchester-type adder). IJ1 is a multiplexor used to select one of +2A(i), +A(i), 0, -A(i) or -2A(i), which means the multiplicand being mulitplied by +2, +1, 0, -1, or -2. IJ2 is used to generate the carry-in at the LSR position when the action is doing subtraction, i.e. when -A(i) or -2A(i) is selected. 113 is also a multiplexor used to select +2A(N-1) or —2A(N—1). U4s combined with the 5th row of carry-save adders are used to add the multiplicand to the partial product when the most significant bit (MSB) of the m ultiplier B, B(N-l), is one. Since, according to the modified Booth encoding algorithm, if the MSR of the the m ultiplier B is one, then the action of adding the multiplicand A must be performed at the last step. For example, the m ultiplier B = 9F or 5F in hexdecimal is encoded below. It can be seen that at the last encoding step, the encoded m ultiplier B( i ) 1 = "+1M for B = 9F is generated because the MSB of the m ultiplier B is 1. origianl multiplier (9F) It) 01 11 11 encoded m ultiplier + 1 -2 +2 0 -1 origianl multiplier (5F) 01 01 11 11 encoded m ultiplier 0 + 1 +2 0 -1 133 MINUS 0 x(+2),x{+l),x(0 ,x -l),x(-2) SIGH-EXQ SIGN-EX1 MINUS I x(+21.»( + !).x{0),xt-|Ulj>] ,5 SIGN-EX2 SIGH-EX3 MINUS 2 US i SIGN-EX5 u> -p* MJNUS 3 IS u a 12 ' II 10 9 a 7 ; 5 4 3 2 1 0 CPA F Ris-«0 Figure 5.11 A 8x8 Multiplier with Modified Booth Algorithm (Radix = 4) U5 is a circuit used to encode the m ultiplier and generate two sign extension hits, SIGN EX{i) and SIGN_EX(i+1). The logic equations of each unit are shown and explained in Appendix B.14. The full-adder at the (N -1 )th position of the last row is used for rounding. Detailed explanation for the rounding scheme is described in Appendix B.14. 5.6.2.7 Pipelined Recursive M ultiplier with Modified Booth Algorithm Using the modified Booth algorithm the number of full adders of an N-bit-by-N-bit multiplier can be cut to about half, but it still occupies tremendous chip space when N is reasonably large, say 24. So instead of using an iterative array of the carry-save adders as in the figure 5.11, a pipelined recursive m ultiplier with modified Booth algorithm is used. The concepts are following the above and those in [41]. The {N/2 + 1) rows of carry-save adders are replaced by one row of clocked carry-save adders. The structure of the pipelined recursive multiplier is shown in figure 5.12. Figure 5.13 shows the timing for the 24-bit pipelined recursive multiplier. * and a are both 16 MHz; Pa and P* are both 1 MHz, The 1 2 1 2 MLD signal, generated in every P$ with pulse width equal to 2 periods, is used for loading the B multiplier and some initial values at the begining of the fixed point multiplication. For example, X(+2) X{ + 1) X(0) X{-1) X{-2) are set to "0 0 1 0 0"; all carry and sums are set to 0; MINUS{0) is set to 0; SIGN EX(1) is set to 0; and SIGN_EX(0) is set to 0 by seting MINUS(0)=0 and X(-1)=0. This causes the value to the carry-save adders being in itia lize d to zero after the pulse 1. The signal can be obtained with a PLA and its state diagram is shown 135 "0" "0" B "0" A ML a 00100 e n c o d e r.c a MINUS MINUS i l l HL FA‘ s CARRY SET SUM SET MLD L __ 48 to CPA Figure 5.12 The Structure of a Pipelined Recursive M ultiplier (mul 24.ca) 136 _ruxnj^LAm \j^j\.^Li\nAA/u14LT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~l t i^-LTLJirLnrLn^LJ^nrLr^inrL P $2 — * i ______r p»i MLD r Figure 5.13 Timing for the Pipelined Recursive Multiplier Reset P* I/P P*2 0/P MLD Figure 5.14 State Diagram for Generating the MLD Signal in figure 5.14. To explain in converence, the pulse train of both ^ and d> is marked from 1 to 14. The 4 * in figure 5.12 is obtained by 2 2 doing the And function of and !MLD (! means logic NOT). This will prevent the conflicts that occur when both $ and MLD are high. During the pulse 1, the least significant bits of the multiplier B, b(1-l), b(i) and b(i+l), are shifted out to an encoder and are encoded into X(+2), X(+l), X(0), X (-l) and X{-2) which are used in the Ul, U2, U3 and sign extension unit, sign_ex.ca. During the $ pulse 1, an appropriate partial product of +2A, +1A, 0, -A, or -2A is generated from Ul, U2 and U3 are to be fed into the carry-save adders. During the <>^ pulse 2, the addition is performed. The carry set and sum set of the carry-save adders are latched at the $ pulse 2. The carries and sums are then feedback to the carry-save adders at next pulse of the ♦ . After the addition is repeated 13 times, from pulse 2 to pluse 14, the final carry set and sum set (each has 2N bits) are generated and latched at the falling edge of the P<{^. The 2N-bit addition is then performed by a carry propagation adder (e.g. Manchester-type adder) in the second stage of the floating point multipier. The detailed circuit designs for the register storing the m ultiplier B and recursive carry-save adder are described in Appendix B.16, The maximum operating speed of the m ultiplier is determined by the slowest stage in the recursive pipeline. The duration of the or the Since X(+2), X( + l) , X(0), X(— 1) and X(-2) are required at each 138 carry-save adder, the registers corresponding to these five signals must have enough driving capability. So, the signal paths should be metal layer throughout the entire row of the full adders, 5.7 Summary The detailed designs of Register File (RF), Sequencer (SEQ), and Control RAM (CRAM) are not covered in this chapter. All the other building blocks -Generator (CG), Bootstrap Unit (RU), Format Converter, Level Sensitive Scan Design (LSSD), Floating Point Adder/subtractor (FPA), and Floating Point M ultiplier (PFM), are explained in great detai1. 139 CHAPTER 6 COMPUTER AIDED DESIGN FOR VLSI 6.1 Introduction Because of the diversity of tasks and concerns in VLSI designs, a systematic method is especially important in designing a special purpose chip. Typically the chip is decomposed into several small cells geometrically, functionally, and hierarchically. The design of the functional block of the cell can then be largely independent of the others. During the design phase, some software tools w ill provide the integrated circuit designer with step-by-step design assistance. This approach is called Computer-Aided Design (CAD). CAD tools support a hierarchical design sequence to assist the designer in specifying a system from in itia l concept to detailed implementation. They also support both functional and physical designs. Functional design aids include synthesis, simulation, and verification, at architectural, system, logic, and circuit levels. Physical design aids support partitioning and layout. The CAD tools must be technology independent so that the designs in each phase do not need to change with improvement of the integrated circuit processing technology. In the second section of this chapter, the CAD tools used at The Ohio State University are introduced. Section three proposes a network description language, which is a LISP-like language used to describe the 140 designed circuit. The described circuit can then be simulated to verify its logic and timing before the circuit layout is attempted. Section four explains how the circuit can be simulated at the logic level. Finally, simulation at the circuit level is explained in section five. 6.2 Overview of VLSI Design Tools Figure 6.1 shows the functional chart of essential VLSI design tools and several logical sequences of their applications. These tools are part of the VLSI design tools released from UW/NW VLSI Consortium on October 1 of both 19R3 and 1984, The following is a brief overview of the VLSI tools being used at The Ohio State University. a) Functional design tools for translating a high-level design description into layout tool input are : PEG : Translates a language description of a finite state machine into logic equations compatible with EONTOTT. EQNTOTT : Converts logic equations into a truth table format to be used as input to TPLA. b) Layout tools used to design the actual artwork for the circuit are : CAESAR : An interactive display editor for manhattan VLSI designs. TPLA : Automatic PLA layout generator. c) A display tool used to display circuit designs is : PENPLOT : Penplotting programs for HP7221 and HP7580 pen plotters. d) A design rule checker used for geometric rule checking is : LYRA : Performs hierarchical design rule check on a CAESAR formatted design using a corner based algorithm. 141 System definition I I Logic circuit | Logic equations I I FSM description description, ckt.net I I c k t .eqn I I ckt.f an .1 I. .1 I. I I NETLIST | PEC I ___ MIT .sis file I EQNTOTT State equations ckt lis I < I I ckt.eqn I I CAESAR I PRESIM + EQNTOTT AED512 I___ RNL binary input + Truth table file, ckt.rnl TABLET ckt.tbl I I j I I TPLA RNL I I _I__ _ l ___ l _ Logic simulation I .ca file Design rule result and timing I ckt.ca I--LYRA > violations analysis, ckt.ult I______I c k t .ly I I I :CIF I______I loyout I .cif file I 1 hard SIMF LTER I ckt.clf I--PENPLOT > | copy I I I MEXTRA . 1. Berkeley .slm I ■> I file, ckt.sim I through CSNET and ARPANET to MOSIS I I I I PSPICE ESIM CRYSTAL I I__ . 1. I___ I_____ I SPICE input I Logic level Timing I I.e. chip I I ckt.spcin j simulation analysis t J I ______I result result I ckt.res ckt.cry SPICE . 1. Tested with Circuit level HP 64000 simulaiton result c kt.out Figure 6.1 Functional Chart of VLSI CAO Tools 142 e) Circuit extractor used to extract simulation database from layout database is : MEXTRA : Extracts a Berkeley format ‘ .sim* f ile from a Caltech Intermediate Form (CIF) input f ile . f) Simulation tools used for logic and timing simulation are : RNL : Event driven "timing" simulator. It uses logic levels and a simplified circuit model to estimate timing delays through digital circuits. It also has a mode that allows it to be used as swiatch level simulator. PSPICE : Preprocessor for the SPICE simulator. SPICE : Device level circuit simulator, ESIM : Switch level simulator. It uses logic levels and models transistors as perfect switches. CRYSTAL : Static timing verifier. It uses a simplified circuit model to estimate the worst case delay through a circuit. g) Filters used to convert from one database to another. :CIF : Converts .ca form to Caltech Intermediate Form. SIMFILTER : This is a f ilt e r that converts Berkeley style .sim files such as produced by MEXTRA to MIT style .sim files used as inputs by programs such as PRESIM. It also can be used to produce Rerkeley style .sim files from the MIT format allowing circuits described with NETLIST to be analyzed by SPICE or run through the timing verifier CRYSTAL. NETLIST : This is a program for generating circuit description. 143 PRESIM : This program converts MIT style .sim files into the binary format required by RNL. As mentioned in the last section, the circuit of an entire chip is usually partioned into several small cells. The design of each cell can then be largely independent of the others. I f the cell function is a sequential function, it can be described as a fin ite state machine using a state diagram. I f the cell is a non-sequential function, it can be described using Boolean equations. Either the sequential or non-sequential function can be implemented using a PLA. If a cell is not frequently used in the chip and does not occupy much chip space, it is usually implemented with a PLA, since the layout of the PLA can be easily obtained by using the TPLA, PLA generator. Specifically, EQNTOTT is firs t used to convert the set of logic equations which describe the cell into a corresponding truth table, then TPLA generates a PLA layout based on this truth table. Rut, a basic, frequently used cell (e.g. fu ll adder) is usually implemented with manual layout, because manual layout usually results in smaller chip space compared to using a PLA. Before the cell is laid out, its logic circuit design should be tested. The path through NETLIST, PRESIM, and RNL provides circuit description and verification. The detailed explanation of this verification path will be described in sections 6.3 and 6.4. Once proven to be correct at the logic level and timing analyses, the circuit can be laid out by using CAESAR with the AED512 graphic terminal and data tablet. The completed layout should be verified by LYRA to check for violations of the numerous layout rules. The resulting .ca file is 144 then converted to a .c if file using :CIF package. I f desired, the layout can then be plotted by PENPLOT. If the cell under consideration is a finite state machine implemented with this PLA, the register required to store the state variables is obtained using two inverters and two series pass devices for each state variable. The following is the specification of the fin ite state machine for the Bootstrap Controller, whose state diagram is shown in figure B.5. — Finite state machine of the Bootstrap Controller (BTC) - - The state diagram of the BTC is shown in figure B.5. — File name : btc.fsm — BT is the RESET signal, which is a keyword in PER. — When the RESET is present in the INPUTS, conditional — branches to the firs t state are automatically added to — the next state expressions for each state. INPUTS RESET WR LC; OUTPUTS CLR LD LOO L01 LD2 WEN INC; SO — This is the reset state. ASSERT CLR LO; IF NOT WR THEN LOOP; SI ASSERT LO LDO; S? ASSERT LO; IF NOT WR THEN LOOP; 53 ASSERT LO L01; 54 ASSERT LO; IF NOT WR THEN LOOP: 145 S5 ASSERT LD LD2 S6 ASSERT LD WEN S7 ASSERT LD INC S8 ASSERT LD; CASE (WR LC) 0 0 => SB; 1 0 => SI ? 1 => S9; ENDCASE => SB; S9 GOTO S9 ; After the fin ite state machine is written in the file "btc.fsm", it can be implemented immediately with a PLA by entering the command : % peg btc.fsm ) eqntott -f -R | tpla -s Bcis -I -0 -o btc.ca where "btc.ca" is the output of the PLA in CAESAR layout format. PEG translates a high level language description of a finite state machine into the logic equations, which are then read by EQNTOTT. EQNTOTT generates the truth table, which is used by TPLA to generate a PLA. The switch - f allows outputs to be defined in terms of their previous values in a synchronous system. The switch -R forces EQNTOTT to produce a truth table with no redundant minterms. The switch -s Bcis forces TPLA to generate a PLA with buried contacts, inputs and outputs on same side of the PLA. And the switch - I clocks the inputs to the PLA; the switch -0 clocks the outputs to the PLA. 6.3 Logic Circuit Description Given the definition of a functional block, its logic circuit can be designed. For effective design it is important to ensure that the 146 design w ill work before layout is attempted. The circuit can be described with a network description language and then simulated at the logic level. The simulation result can show the logical level (0 or 1) and the timing of desired signals. NETLIST is a macro-based language for describing networks of different size transistors. The program NETLIST allows the user to describe the circuit with a symbolic language. RNL is a timing logic simulator for digital MOS circuits. It is an event driven simulator that uses a simple RC (resistance capacitance) model of the circuit to estimate node transition times and the effects of charging sharing. The network can be specified in a logic network description file using a LISP-1 ike command systax as follows. The circuit of a simple NMOS inverter, shown in figure 6.2, is used as the fir s t example. Its logic network description is written into the file "inverter.net", listed as below : ; (1) network description for a NMOS inverter ; File name : inverter.net ; (2) declaration of the nodes in the network (node in out) ; (3) depletion mode transistor (pull-up) (dtrans out out Vdd 2 4) ; (4) ehancement mode transistor (pull-down) (etrans in GND out 4 2 ) ; (5) specify an interconnect capacitance for the output node (capacitance out 0.03) 147 V d d I out in i 7 GND Figure 6,2 NMOS INVERTER Vdd I out ini in2 v GND Figure 5.3 NMOS NAND 148 The number enclosed in the parenthesis will be used to indicate each part of the description file . (1) A semi cl on causes the rest of the line to be treated as a comment. Blank lines are also ignored. (2) Any node named for subsquent reference must be declared. Nodes are declared with the command (node nl n2 n3 n4 . . . . ) where nl, n2, n3, n4, . . . are the names of the nodes to be referred to in the netwoek. (3) The declaration of the nodes has provided the "skeleton" for the network. Then components, such as transistors or capacitors, must be fille d in to construct the whole circu it. A transistor is written in the form : (transistor-type gate source drain width length) Transistor-type represents a mnemonic for various types of transistors, such as dtrans for n-channel depletion-mode transistor etrams for n-channel enhancement-mode transistor Specification of the width and length of the transisor's gate area in units of lambda is optional. I f omitted, the default width (fir s t number) and length (second number) for depletion-mode transistor are ? and 8, while the defaults for enhancemnet-mode transistors are 2 and 2. The width and the length of the gate area determine the resistance of the transistor, i.e . influence the ratio of the pull-up to pull-down. 149 (4) The pull-down is specified, analogously to (3), as an n-channel enhancement-mode transistor with a gate width 4 and a gate length of 2. (5) The final element to be specified in the inverter is the interconnect capacitance. The command (capacitance out 0.03) te lls NETLIST that a capacitance of 0,03 pF is to be connected between the nodes "out" and GND, For those frequently used, basic circuits, it will be convenient to define macros, which can be stored in a library file and easily loaded into future network description files without having to redefine them. In NETLIST, some basic functions, such as inverter, Nand, Nor, and And-Or-Inverter, are already defined. They are shown in figure 6.2, 6.3, 6.4, and 6.5 respectively and specified in the following manner : (invert (out width-o length-o) (in width-i length-i)) (nand (out width-o length-o) (ini width-1 length-1) (in2 width-2 length-2) ..,) (nor (out width-o length-o) (in i width-1 length-1) (in2 width-2 length-2) ...) (and-or-invert (out width-o length-o) ((inll width-11 length-11) (inl2 width-1? length-12) ...) ((in21 width-21 length-21) (in22 width-22 length-22) ...) ) The gate size for each transistor is specified with width and length together with the node to which the gate is connected. For 1 5 0 V d d out ini i n2 1P Figure 6.4 NMOS NOR Vdd out in ll in21 i n 12 Figure 6.5 NMOS AND-OR-INVERTER 151 example, (invert (out 3 9) (in 4 2)) creates an NMOS inverter whose depletion-mode pull-up has a gate area of 4 by 2 lambda, and whose enhancement-mode pull-down has a gate area of 3 by 9 lambda. In order to have better symmetry on the rising and fallin g edges of the output waveform, the number of the enhancement-mode pull-down transistors in an Nand gate or on the branch of an And-Or-Inverter is recommanded to be less than three. A 2-bit adder is used as a second example to show how the network is described in the macro-based language. The detailed network description of the 2-bit adder is shown Appendix C .l. 6.4 Logic Level Simulation After the logic network description has been written to the file " file .n e t11, it is processed with the NETLIST and PRESIM programs. For example, the command % netlist inverter.net inverter.lis w ill cause NETLIST to process the network description file inverter.net, writing its output to the file "in verte r.lis". % is the prompt of the UNIX system. The next step is to process "in verter.1is" with PRESIM. PRESIM transforms the transistors in the " file .lis " file into resistors of equivalent sizes. This is done because RNL uses resistor models for the transistors and estimates transistion time delays from the equivalent network formed by the resistors and the circuit capacitances. The command % presim inverter.lis inverter.rnl will cause PRESIM to process "in verter.lis", putting the output into the 152 binary file "inverter.rnl11. The "inverter.rnl" file can now be used as the binary network description for RNL simulation. All the necessary preparations to run a simulation of the inverter.net are now complete. To issue RNL simulation, enter the command : % rnl RNL prompts its version number Version 4,2 Before the simulation is satrted, two file s , "uwstd.l" and "uwsirn.l", containing function definitions for RNL, must be loaded. (load "”cad/lib/rnl/uwstd.l") (load "”cad/lib/rnl/uwsim.l") Next, load the binary network description f ile "in verter.rn l". ( read-network "in verter,rn l") RNL w ill prompt with information about the network : ; 8 nodes; transistors : enh=0 intrinsic=0 p-chan=0 dep=0 low-power=0 pullup=0 resistor=0 There is a simple command ("s11) to run a simulation step for an amount of time defaulted to 100 ns. To change this, a variable "incr" can be set. "incr=l" means the simulation time interval is 0.1 ns. The command to assign a value to a symbolic value is "(setq symbol value)". For example, the simulation time interval is changed to one nanosecond, (setq incr 10) Frequently, it is convenient to refer to a group of nodes, rather than to one individual node. A symbol name for a lis t of node names can be 153 denoted with "setq" command. For example, "inv-nodes" is given to the lis t of the two nodes "in" and "out". (setq inv-nodes '(in out)) The final step is to specify details for the reports on the simulation step. There are two standard report forms available in RNL. The firs t type lists the states of nodes whenever these states change. Consequently, a timing analysis can be obtained. To set the change-flags of nodes "in" and "out" to "true", the command is (chflag inv-nodes) The second type of report lists the state of nodes at the end of a simulation step. To obtain such a report on the nodes "in" and "out", use the "def-report" command : (def-report ’ ("STATE AT THE END OF SIMULATION STEP :" in out)) Now try a simulation. Setting the input of the inverter to high voltage is simply done by entering h i n Run a simulation step by entering s According to the report specifications in the "chflag" and "def-report", RNL replies : Step begins 0 0 ns, in = I P O out = 0 P 0.1 STATE AT THE FND OF SIMULATION STEP : Current time = 1 in = 1 out = 0 154 The reports on changes in the states of the nodes "in" and "out" show that "in" was set to High at the time zero, and "out" changed to Low at 0,1 ns. The time delay in the change of the output is caused by the time needed to discharge the gate capacitance of the inverter and the output node capacitance of 0,03 pF. Now consider another state of the inverter. Set the input to Low 1 in and then do a simulation step : s RNL replies : Step begins 0 1 ns, in = 1 0 0 out = 0 0 0.6 STATE AT THE END OF SIMULATION STEP : Current time = 2 in = 0 out = 1 To exit RNL, enter exit then back to UNIX, There is one other mode of RNL operation called batch mode. All the commands to RNL can be written into a f ile , "file.cmd". This f ile is then treated as a parameter when RNL is executed. For example, all the above commands related to the inverter are written into the file "inverter.cmd" : 155 (load ""cad/lib/rnl/uwstd.l") (load ""cad/lib/rnl/uwsim.l") ( read-network "in verter.rnl“) (setq incr 10) (setq inv-nodes '(in out)) (chflag inv-nodes) (def-report '("state at the end of simulation step in out)) (h '(in )) (s '( ) ) (1 '(in )) (s '() ) (exit) Now run RNL again, with command file "inverter.cmd" as a parameter : % rnl inverter.cmd The same results as before will be obtained. The simulation of the 2-bit adder is put into the batch mode. The command file "add2.cmd" for the 2-bit adder described in Appendix C.l is created for RNL simulation and shown in Appendix C.2. Then enter the following three commands to get the simulation result stored in the file "add2.ult". % n etlist add2.net add2.1is -tnmos -u200 % presim add2.1is add2.rnl % rnl add2.cmd > add2.ult The switch -t tells the technology is used. Here it is NMOS technology. The swich -u sets the number of centi-microns per lambda (default is 250). The logic simulation result, stored in the file "add2.ult", is checked. If incorrect, the circuit network should be modified and the above three commands are executed until a correct simulation result is obtained. A 2-bit adder is then laid out according to the described circuit add2.net. To test layout correctness, the layout is extracted and simulated in logic level using RNL or ESIM, and in circuit level using SPICE. Since ESIM does not support timing analysis, RNL is usually preferred for use in logic simulation. How to do circuit simulation with SPICE will be explained in the next section. After the 2-BIT ADDER has been laid out according to the add2.net and stored in the file add2.ca, the next step is converting the CAESAR file "file.ca" into CIF file "file.cif" by entering the following commands : % caesar -n add2 : c if -p : quit where : is the prompt of the. CAESAR program. The -n switch causes CAESAR to run in non-interactive mode. The -p switch is necessary to obtain CIF files for circuit extraction and/or simulation. Enter the following four commands to check the layout in logic level. % mext ra add2.ci f % sim filter add2.sim add2.tem % presim add2.tem add2.rnl % rnl add2.cmd > add2.ult The simulation result is stored in the f ile "add2.ult", and should be the same as that obtained from the circuit description except the timings, because the capacitances of gates and interconnections in the 157 layouts are extracted and are heavily dependent on the layout geometry. The SIMFILTER reformats the simulation file format of either Berkeley or MIT into the other simulation format. Now, Rerkeley format, "addP.sim", is reformated into MIT format, "add2.tem", which allows layouts extracted by MEXTRA to be simulated using RNL. Note that the add2.cmd is described in Appendix C.2. Another alternative is using ESIM for logic level simulation and then CRYSTAL for timing analysis. 6.5 Circuit Level Simulation After doing logic simulation and timing analysis for the circuit network described in the LISP-like language or for the layouts individually, a detailed circuit simulation, such as SPICE, is necessary. SPICE is a general-purpose circuit simulation program for nonlinear dc, nonlinear transient, and linear ac analysis. Circuits may contain resistors, capacitors, inductors, mutual inductors, independent voltage and current sources, four types of dependent sources, transmission lines, and the four most common semiconductor devices : diodes, BJT's, JFET's, and MOSFET's. SPICE has b u ilt-in models for the semiconductor devices, and the user needs to specify only the pertinent model parameter values. Three MOSFET models are implemented; M0S1 is described by a square-law I-V characteristic, M0S2 is an analytical model and M0S3 is a semi-empirical. The SPICE from Berkeley only works for M0S1 model. The 2-bit adder, "add2.net" described in Apppendix C .l, is used as an example to explain how to run SPICE. Enter the following commands and then the circuit simulation result can he obtained and stored in the f ile "add2.out". 158 % n etlist add2.net add2.1is -tnmos -u200 % sim filter -n add2.1is add2.sim % pspice -d defs -m model -e add2.io add2 % spice add2,spcin add2.out The switch -n te lls SIMFILTER to generate output in Berkeley format, "add2.sim", with NMOS technology. Besides the add2.net, some file s , defs, model, and add2,io, must be prepared before the above commands can be executed. The add2.io contains all specified input signals, some output signals which are desired to check, and some control commands for SPICE. It is shown in Appendix C.3. Model parameters of the simulated devices, stored in the file "model" and given by MOSIS, are shown in Appendix C.4. The "defs" sets up the equivalences between node names in the simulation f ile and SPICE node names. The GND node is always set to node 0 in SPICE, while the VDD node is set to node 1. To avoid the low cases of the GND and VDO being set to different nodes in SPICE, they are set to node 0 and I in the "defs" f ile , which is shown below, set gnd 0 nmos set vdd 1 nmos PSPICE is a shell script for preparing SPICE input from several sources. PSPICE runs SIM2SPICE to convert from a "file.sim " format circuit description to a SPICE compatible description. For example, SIM2SPICF reads "add2.sim", "add2.nodes" and "add2.al" files and creates "add2.spice" and "add2.names" file s . PSPICE then runs SPCPP to translate a "pseudo-spice" formatted f ile that contains symbolic node labels to a SPICE acceptable f ile . For example, SPCPP reads "add2.names" and "add2.io" file s and creates "add2.spcx" f ile . Finally, 159 PSPICE concatenates the circuit description f ile , the translation table, a file of untranslated SPICE input, and the translated SPICE input into a single file, ,,add2,spice", To simulate a circuit layout, e.g. "add2.ca", enter the following commands and the circuit simulation result will be stored in the f ile "add2.out". % caesar -n add2 : cif -p : quit % mextra add2,cif % pspice -d defs -m model -e add2.io add2 % spice add2.spcin add2.out 6,6 Summary The procedures for using VLSI CAD tools to design an integrated circuit chip are introduced in this chapter. How to describe a designed circuit with NETLIST, a LISP-like language, and simulate it in logic level with RNL or ESIM and then in circuit level with SPICE, is explained in detail with an example of a 2-bit adder. More information can be found in "VLSI Design Tools Reference Manual" released from the UW/NW VLSI Consortium. The VLSI design tools currently used of The Ohio State University are able to support design with the NMOS, and CMOS fabrication processes available through MOSIS, the Department of Defence's MOS Implementation Service run by the Information Sciences Institute of the University of Southern California. MOSIS now supports NMOS, CMOS/Bulk, CMOS/SOS, and Printed Circuit Board technologies. MOSIS usually aggregates several 160 small projects submitted by the same organization into Multi-Project-Chps (MPCs), and the various chips of the same technology into Multi-Chip-Wafer (MCWs). It is very common for MOSIS to have wafer with over 100 individual projects, and wafers with about 50 different die types of several sizes. Some CAD tools, released from UW/NW VLSI Consortium are s till primitive. For example, when laying out the circuit with CAESAR, an interactive circuit editor, the user must keep in mind the design rules, which depend on IC processing technology. Different processing technologies give different layout design rules. So, once the process line is changed, the user must remember another set of design rules. A kind of symbolic layout, called VIVID [63], released recently by the Microelectronics Center of North Carolina (MCNC) can eliminate this drawback. Users just need to draw the symbols to represent different layers without consideration of the design rules. The CAD tools are s t ill being developed toward a so-called silicon compiler which can translate a high-level functional description or behavior description of a chip down to the actual layout of the device [10]. The silicon compiler is divided into two stages. The firs t stage is the translation of a brief functional or behavioral description into a more precise intermediate that is s t ill implementation independent. The second stage is the automatic generation of a chip layout from the intermediate description. Software simulation is able to verify the chip before it is fabricated and thus aids in fast turnaround time and saves the expense of silicon foundries. Rut the simulation time for a large circuit takes 161 from hours to days or even months. Therefore* some simulations are implemented and run on special-purpose processors, such as Zycad's LE 1002. For a 61K gates circuit, LE 1002 can simulate four hundred times faster than the DECSIM, Digital's internal simulator [11], 162 CHAPTER 7 SUMMARY AND CONCLUSIONS 7.1 Summary In this research, three objectives have been achieved: (1) Architectures based on the Robotics Processor chip have been been shown to be applicable to the solution of the general robotics problem involving the Jacobian, Inverse Jacobian, and Inverse Dynamics. (2) The architecture and the major parts of the RP chip have been designed, and (3) The VLSI design tools released from UW/NW VLSI Consortium have been used for the fir s t time at The Ohio State University to fabricate a chip. Several current VLSI computing structures, such as systolic array and wavefront array processor (WAP), were surveyed in an attempt to solve the intensive computations required in the Inverse Plant plus Jacobian. Since the effectiveness of these approaches is contingent upon large dimension arrays and since the dimensions of the vectors and matrices in robotic systems are rather small, 3x1 (or 4x1) and 3x3 (or 4x4), neither the systolic array nor the WAP architectures could be successfully applied to the current application. Instead, several special purpose dedicated attached processors for the Inverse Plant plus Jacobian were developed. These attached processors are based on the Robotics Processor being developed with state-of-the-art VLSI technology 163 at The Ohio State University. These special purpose dedicated processors will be attached to a host microcomputer, and multiprocessor system concepts will be used to interconnect these multiple processors for real time control. Based on the current processing capability supported by MOSIS, the achitecture of the RP has been tentatively designed. The data path contains a Register File with 64 words (32 bits per word), a floating point adder/subtractor, and a floating point multiplier. Both arithmetic units have three pipeline stages and can execute at the same time. From the architecture of the RP's data path, the computation times for all vector and matrix operations can be exactly obtained and normalized in units of complexity. Using complexity instead of time expressed in microseconds allows the results to be independent of the system clock. A task graph was employed to schedule processes for more than one processor for the Jacobian and Inverse Dynamics applications, Roth computation complexities and I/O transferring complexities can be shown in the task graph. In addition, the total execution time of each task, e.g. Inverse Dynamics, can be estimated for each architecture. Once the total execution time, initiation rate, processor utilization ratio, and sizes of the Register and Control memory have been calculated based on the microprogram, the most desirable architecture can be determi ned. The major parts of the RP chip, such as the FPA, FPM, Bootstrap Unit, and Format Converters, have been designed to logic gate level or function equations, which can in turn be implemented with a PLA by using the TPLA package. Some basic and often used cells, such as a full 164 adder, are designed and laid out in a compact form. A chip containing a 4-bit adder, a two-phase clock generator, and a PLA controller have been designed and fabricated by MOSIS with 4 micron NMOS technology. The chip was received three months after the corresponding CIF was sent to MOSIS through CIS net and APART net. A 2-bit adder was used as an example to show how the circuit network is described in LISP-like language. The logic function and timing of the described network can then be tested and verified before they are laid out. Once the circuit is laid out, its layout is extracted and then simulated to verify its logic level and timings. Some of tiie CAD tools used at The Ohio State University are somewhat prim itive. More powerful tools are being installed on the VAX 780. For example, the VIVID [63] system dramatically shortens custom integrated circuit design time. It translates symbolic layout, automatically, into the geometric representation necessary for mask generation. This approach offers the designer two advantages over the traditional mask editing approach to custom layout : (1) Technology independence : the symbolic layout does not need to be redesigned for different design rules. (2) Correctness-by-construction : the compactor generates the physical layout; therefore, the designer does not have to be aware of the design rules in order to create an error-free layout. It also supports the Interactive Circuit Editor (ICE) package, allowing the designer to select graphic representations of circuit elements to describe the circuit network without using LISP-like language. Another VLSI tool installed is TEGAS-5, donated by the GE Calma Company. The TEGAS-5 system consists of several subsystems that include logic and 165 design verification, testability analysis, fault simulation, and test generation. It supports the concept of design-for-testabi1ity by providing the designer with a method of measuring the te s ta b ility of the design during the design process, 7.2 Research Extensions There are many research recommendations that could be an extension of the research reported in this dissertation. The following is a brief description of future research that could be done, 1, The architecture of the Robotics Processor is heavily dependent on VLSI technology. For example, i f the access time of the Register File can be made to be smaller than one fourth of P_CLK (i.e . smaller than 250 ns), the two-bus, rather than the three-bus, configuration of the data path could conceivably be used. Furthermore, i f the access time could be reduced even more, the one-bus configuration could be used instead. 2. The control memory used to store microprograms, could be implemented with Static RAM (SRAM), Dynamic RAM (DRAM), or RDM. Fabricating the memory on the chip would require about 40 fewer pins as compared with using an off-chip memory which is an important concern from the standpoint of current pin lim itations. I f the application demands a very large CRAM then DRAM probably is suitable since its bit cell needs only one or three transistors, while SRAM needs 6. Rut DRAM requires a refreshing circuit and thus is more complicated. I f the microprogram is fixed and will not be further changed, ROM is the best choice since it has highest density and re lia b ilty . Also, the Rootstrap 1 6 6 Unit and Format Converter for Bootstrap are no longer required. Since the Register File size is not large, six transistors for each bit cell could be used. Two address decoders are required because two operands in different locations are to be accessed at one time, 3. The current design of the next-microinstruction address unit includes neither the capability for jumping-to-subroutine nor conditional branching. Because the application microprograms for Jacobain, Inverse Jacobi an, and Inverse Dynamics are fa irly short, they are written in a sequential, straightline manner, 4. Another possible architecture is to take the two arithmetic units out of the Robotics Processor. In this case they can be implemented using commercial floating processors, such as Weitek's WTL 1032 (or 1064) m ultiplier, WTL 1033 (or 1065) adder [56] [58], and AMD's Am29325 [59]. The problem here is the lim itation on the number of pins. The four I/O ports alone require 64 pins. Presently, MOSIS supports a pin number maximum of 84. In. the future, MOSIS may support a package having more pins, say up to 144. Then using the commercial floating point processors will be feasible. However, the comparison table in chapter 4 shows that any one of the applications of the Jacobian, Inverse Jacobian, and Inverse Dynamics can be executed in one millisecond based upon the parallei/pipeline computing structure with RP as basic building block. Therefore, although there are several faster commercial floating point processors available, the two arithmetic units in the RP are fast enough to meet our requirements. For example, the AMD Am29325 can complete a floating point addition or multiplication in 100 ns, and is about 40 times faster than the arithmetic units in the RP. 167 Another alternative is using a 32-bit I/O bus on each I/O port but keeping the two arithmetic units in the Robotics Processor. The comparison table shows that the I/O communication increases as more RPs are used. As I/O communication between the RPs is a dominant factor, using a 32-bit I/O bus instead of 16—bTt would be very desirable and would double the data transferring rate. 5. In the current design no handshaking signals are used between the RPs, thus data transfer timing must be known and the RPs must be globally synchronized. In addition, clock skew must be carefully avoided. Also, since it is assumed that the data accessed by each processor is always the freshest available, no other signals are used to coordinate data exchanges except for load mode handshaking signals between the RPs and a host computer. Since all RPs are synchronized and the timing for transferring data is predefined, no status bits of the arithmetic units are ever tested. It is conceivable that a more general philosophy should be adopted. Specifically, it may be found that it is more realistic to use handshaking signals between the RPs, and between RPs and host computer and to have asynchronous communication. 6. I f the pin number of the Robotics Processor chip is large enough, the clock generator can be placed outside of the Robotics Processor chip. This leads to more r e lia b ility , easier handling of the RP chip, and the generating of the Scan Path clock, Psi 1 and Psi2, for LSSD testabi1ity . 7. In order to permit maximum accuracy to be retained in the result, it is important to extend the current internal 32-bit word length with guard bits. To date no simulations have been done to l f i 8 determine how many guard bits would be needed in the Robotics Processor chip implementing an application program. Also, no rounding methods except truncation have been considerd in the FPA and FPM, Many other more accurate rounding methods, such as adder-based rounding, Von Neumann rounding, and ROM rounding [31, p. 427-431] could be considered in the future research. P. The design of the RP architecture is based on NMOS technology. There are several advantages in using CMOS technology (MOSIS has been supporting 3 micron P-well CMOS), F irst, since the rising and falling edges of the output waveform are symmetrical, no precharging is necessary as in NMOS technology. Second, the voltage passing through a pass transistor pair doesn't drop by the threshold voltage as it does through a N-channel pass transistor. Third, the ratio of the pull-up transistor vs. pull-down transistor is of no concern except that the width of the P-channel is almost always 2.8 times the N-channel because the N-type's mobility is faster than the P-type's. Fourth, CMOS has better noise immunity and uses less power. Designing the circuit with CMOS technology w ill not be more sophisticated than with NMOS, since the ratio is inconsequential. The only drawback is that the VLSI design tools used at the Ohio State University do not have any package capable of generating a PLA with CMOS technology. 9. Because there are three pipeline stages in the FPA and FPM, a reservation table is used to help write the application microprogram. This will likely result in errors when the microprogram is being coded; however, if microcode complier (or assembler) is developed, it will not only reduce the errors, but also speed up coding. 169 10. Investigation of the three tables in chapter 4 shows that the RP contains CRAM -10.5K bits and RF -64 x 32 bits. If the CRAM bit cell is made of three-transistor ORAM, and the RF bit cell is made of six-transistor SRAM, the total transistor number for memory is 10.5 x 3 + 64 x 32 x 6 = 44K plus decoders, refreshing c ircu it, and drivers, and is about 50K. Each full adder consists of 25 transistors. The FPA contains 72 full adders and the FPM contains 89 fu ll adders, in total 161. From the block diagram of the FPA and FPM, it can be seen that the number of components required for the adders is more than half of the 4K transistors used in the FPA and FPM. Including the format converters, clock generator and sequencer, the total number of transistors on the RP chip is roughly 60K. It is possible to fabricate this number of transistors on a moderate die size with 3 micron NMOS techology. One existing example is Berkeley RISC II CPU with the same technology, containing about 40K transistors, and having a die size of 171 mil x 304 mil (4.34 mm x 7.72 mm) [17]. For this example, the die area of the RP can be estimated using the fact that the ratio of die areas is approximately 1.5 times the ratio of complexities. The 1.5 factor arises from the expected increase in interconnection area. Specifically, the die size of the RP containing 60K transistors will be about 8.2 mm x 8.2 mm, about twice of that of the RISC CPU. This would seem to present no problem since the maximum die size which MOSIS supports is 9.5 mm x 10.5 mm with an 84-pin package. 170 REFERENCES [1] Mead, C. and Conway L ., Intoduction to VLSI systems, Reading, Mass. : Addison-Wesly, 1980. [2] Rideout, V.L., "Limits to Improvement of Silicon Integrated Circuits," Proceedings of COMPCON, 1980. [3] Burger, R.M., Cavin, R.K., Holton, W.C. and Sumney L.W., "The Impact of ICs on Computer Technology," IEEE Computer, Oct. 1984. [4] Taylor, S., "Gearing up for GaAs," VLSI Design, April 1984. [5] Cole, B.C., "CMOS Memories Replacing NMOS in Megabit Storage Chips," ElectronicsWeek, Nov, 26, 1984. [6] Cohen C.L., "Cells Combine CMOS, Bipolar Transisters," ElectronicsWeek, Nov. 5, 1984. [7] McDonald, J .F ., Roger, J .F ., Rose, K. and Steekl, A., "The Trials of Wafer-Sacle Integration,11 IEEE Spectrum, October 1984. [8] Beresford, R., "VHSIC : Redefining the Mission," VLSI Design, Nov. 1983. [9] Berney, K., Wesley, R., Lineback, J.R. and Waller, L ., "Chip Makes Ready for VHSIC Phase I I , " ElectronicsWeek, Nov. 12, 1984. [10] Werner, J ., "Progress Toward the 'Ideal' Silicon Complier," VLSI Design, Sep. and Oct. 1983. [11] Rezac, R.R. and Smith L .T ., " A Simulation Engine in the Design Environment," VLSI Design, Nov. 1984. [12] Hwang, K. and Briggs, F.A., Computer Architecture and Parallel Processing, Mcgraw-Hill, Inc., New York, New York, 1984. ~ [13] Kogge, P.M., The Architecture of Pipelined Computer, Mcgraw-Hill, Inc., New york, New York, 1981. [14] Charlesworth, A.E., "An Approach to Scientific Array Processing : The Architecture Design of the AP-120B/FPS-164 Family" IEEE Computer, Sep. 1981, 171 [15] Bernhard, R., "Giants in Small Packages," IEEE Spectrum, Feb. 1982. [16] Ribble, E.A., Synthesis of Human Skeletal Motion and the Design a Special-Purpose Processor for Real-Time Animation of Human and Animal Figure Motion, M.S. Thesis, The Ohio State University" Tune"'!!)!??. ------ [17] Scherburne, R.W., Katevenis, M.H., Patterson, D.A., and Sequin, C.H., "32-bit NMOS Microprocessor with a Large Register File ," IEEE Journal of Solid-State Circuit, Vol. SC-19, No. 5, October m ------ [18] Wahawisan, W., A Multiprocessor System woth Applications to Hexapod Vehi cl e~~Cbntrol, Ph.D. dissertation, Tne Ohio State University, Sep. 1981, [19] Nobel, B. and Nani el, J.W., Applied Linear Algebra, Prentice-Hal1, Inc., 1977. [20] Orin, D.E. and Olson K.W., Special Purpose Computer Architectures for Control of Robotic Mechanisms, Department of Electrical Engineering, The Ohio State University, March 1983. [21] Orin, D.E., "Pipelined Approach to Inverse Plant Plus Jacobian Control of Robot Manipulators," IEEE International Conference on Robotics, Atlanta, Georgia, March 1984. [22] Paul, R.P., Robot Manipulators : Mathematics, Programming, and Control, The MIT Press, Cambridge, Mass., 1981. [23] Schrader, W.W., Efficient Jacobian Computation for Robot Manipulators on Serial and Pipelined Processors, M.S. Thesis at The Ohio $tate University, 1983. [24] Lathrop, R.H., Parallelism in Manipulator Dynamics, M.S. Thesis at Massachusetts Institute o f Technology, 1983. [25] Kung, H.T. and Leiserson, C.E., "Systolic Arrays (for VLSI)," Proc. Symp. Sparse Matrix Computations, Applications, Nov. 2-3, 1978, pp256-282. [26] Kung, S.Y., Gal-Ezer R.J. and Arun, K.S.."Wavefront Array Processor : Architecture, Language and Applications," Proc. of the Conference on Advanced Research in VLSI, M .I.T ., January ~mr. [27] Liu, P.S. and Young, T.Y. "VLSI Array Design Under Constraint of Limited I/O Randwidth," IEEE Trans. Comp., Vol. C-32, No. 12, December 1983, pp. 1160-1170. 172 [28] Nash, J.G., Hansen, S. and Nudd, G.R., "VLSI Processor Arrays for Matrix Manipulations," VLSI System and Computation, edited by Kong, H.T., Sproull, B, and Steele, G,, Computer’"Science Press, 1981. [29] Waser, S. and Flynn, M.J., Introduction to Arithmetic for Digital Systems Designers, New York : Holt, Rinehart and Winston ; m College f»ub., 1982. [30] Hwang K., Computer Arithmetic : Principle, Arichitecture, and Design. New York : John Wiely A Sons, 1979. [31] Cavanagh, J .J .F ., Digital Computer Arithmetic : Design and Imp!ementation, New York : McGraw-Hill, 1984. [32] Kung, H.T., Sproull, B, and Steele, G., VLSI Systems and Computations, Rockvill, Maryland, Computer Science Press Inc., m 1 [33] Muroga, S. and Lai, H.C., "Minimization of Logic Networks under a Grneralized Cost Function," IEEE Trans. Comp. , Vol. C-25, September 1976, pp.893-907. [34] Lai, H.C. and Muroga, S., "Mininum Parallel Rinary Adder with NOR (NAND) Gates," IEEE Trans. Comp., Vol. C-28, No. 9, September 1979, pp. 648-659. [35] Sakurai, A. and Muroga, S., "Parallel Binary with a Minimum Number of Connections," IEEE Trans. Comp., Vol. C-32, No. 10, October 1983, pp. 969-976. [36] Brent, R.P. and Kung H.T., A Regular Layout for Parrel Adders, Technical Report, Dept, of Computer Science, Carnegie_Mellon University, CMU-CS-79-131, June 1979. [37] Luk, U.K., "A Regular Layout for Parallel M ultiplier of 0(LogN X LogN) Time," VLSI System and Computation, 1981. [38] Luk, W.K. and Vulliemin, J. E., "Recursive Implementation of Optimal Time VLSI Interger M ultiplier," VLSI'83, Anceau, F ., and Aas, E.J. (eds.), Elsevier Science Publishers B.V. (North-Holland), 1983. [39] Stenzel, W.J., Kubitz, W.J. and Garcia, G.H., "A Compact High-Speed Parallel Multiplication Scheme," IEEE Trans. Comp., Vol. C-26, No. 10, October 1977. [40] Bandeira N., Vaccaro, K. and Howard, A., "A Two's Complement Array M ultiplier Using True Values of the Operands." IEEE Trans. Comp., Vol. C-32, No. 8, August 1983. 173 [41] Ercegovac, M.D., and Nash. J.G ., A VLSI Design of A Radix-4 Carry Save M u ltip lie r, UCLA Computer '£cien ce Depart me ntT, Los Angels, April I9TT4. [42] Reusens, P., Ku, W.H., amd Mao, Y.H., "Fixed-Point High-Speed Parallel M ultipliers in VLSI," VLSI System and Computatuion, m i . [43] Lerouge, C.P., Girard P. and Colardelle, J.S., "A Fast 16 Bit NMOS Parallel Multiplier," IEEE Journal of Solid-State Circuit, Vol. SC-19, No. 3, June 1 9 8 ^ [44] Hartring, C.O., Rosario, B.A. and Picket, J.M., "High-Speed Low-Power Silicon MESFET Parallel M ultipliers," IEEE Journal of Solid-State C irc u it, Vol. SC-17, No. 1, Feb. 1982. [45] Lee, F.S., "A High-Speed LSI GaAs 8x8 Bit Parallel M u ltip lier," ~mr.IEEE ------Journal of Solid-State Circuit, Vol. SC-17, No. 4, October [46] Pareparta, F.P., "A Mesh-Connected Area-Time Optimal VLSI Integer M u ltiplier," VLSI Systems and Computation. 1981. [47] Chen, I.N. and Willoner, R., "A 0(n) Parallel Multiplexer with Bit-Sequential Input and Output," IEEE Trans. Comp., Vol. C-28, No. 10, October 1979. [48] Lyon, R.F., "Two's Complement Pipeline M ultipliers," IEEE Trans. Communi ati ons, April 1976. [49] Strader, N.R. and Rhyne, V .T., "A Cannonical Bit-Sequential M u ltiplier," IEEE Trars. Comp., Vol. C-31, No. 8, August 1979. [50] AM29516 and AM29517, 16x16 Parallel M u ltiplier, Bipolar Microprocessor Logic and Interface Oata Rook, Advanced Micro Devices, Sunnyvale, California, 1984, [51] Zurawski, J.H.P. and Gosling, J.B ., "Design of High Speed Digital Divider Units," IEEE Trans. Comp., Vol. C-30, No. 9. September 1981. [52] TRW LSI Multipliers Applications Notes, TRW LSI Products, Redondo, California. [53] Stevenson, D., "A Proposed Standard for Binary Floating Point Arithmetic," IEEE Computer, March 1981. [54] Kuck, D. J. et a l., "Analysis of Rouding Methods in Floating Ponit Arithmetic," IEEE Trans. Comp.. Vol. C-26. No. 7. July 1977, pp. 643-650. 174 [55] Ware, F.A. and McAllister W.H., "64 Bit Monolithic Floating Point Processors," IEEE Journal of Solid-State C irc u it, Vol. SC-17, No. 5, October 1982. [56] Woo, B., Lin, L. and Owen, R.E., "ALU, M ultiplier Chips zip Through IEEE Floating-Point Operations," Electronics, May 19, 1983. [57] Turney, J.L. and Mudge, T.N., "VLSI Implementation of a Numerical Processor for Robitlcs," Proc. of the 27th International Instrumentation Symposium, pp.169-1?5, Indianapolis, Indiana, April 1981. [58] Ware, F ., Lin, L, Wong R., and Woo, B ., "Fast 64-bit Chip Set Gangs up for Double-Precision Floating-Point Work," Electronics, July 12, 1984. [59] Chu, P. and New, B.J. "Microprogrammable Chips Rlend Top Performance with 32-bit Structures," Electronic Design, Nov. 15, 1984. [60] Williams, T.W. and Parker K.P., "Design for Testability - a Survey," IEEE Trans. Comp., Vol. C—31, No. 1, January 1982, pp. 2-13. [61] Williams, T.W., "VLSI Testing," IEEE Computer, October, 1984. [62] Horowitz, E. and Sahni S., Fundamentals of Computer A1gorithms, Computer Science Press, Inc. 1978, [63] Roger, C.D., Daniel S.W., and Rosenberg, J.B ., An Overview of VIVID, MCNC's Vertically Integrated Symbolic Design, MCNC Technical Report, Microelectronic Center of North Carolina, Research Triangle Park, North Caralina, 1985. 175 Appendix A.1 : Reservation Tables for Vector Operations. 1. Reservation table for the addition of two 3x1 vectors xl yi zl zl - xl + yl x2 + y2 = z2 z2 = x2 + y2 x3 y3 z3 z3 = x3 + y3 P CLK CYCLE 1 2 3 4 5 6 7 8 9 10111213 14 1$1617 18 1920 + 1 zl z2z3 +2 zl z2z3 +3 zl z2z3 ST zl z2z3 The +1 to +3 represent the three stages of the floating point adder. "ST" means storing the result to the RF. It takes six P_CLK cycles to complete the operation. 176 2. Reservation table for a 3x1 vector multiplied with a scalar constant. x l * z l z l = x l * c x2 X c * z2 z2 = x2 * c x3 z3 z3 = x3 * c P CLK CYCLE 1 2 3 4 5 & 7 8 8 1(1 U 12 13 14 15 15 17 16 19 26 *1 z l z2 z3 *2 z l z2 z3 *3 z l z2 z3 ST z l z2 z3 The *1 to *3 represent the three stages of the floating point m ultiplier. It also takes six P_CLK cycles to complete the operation. 177 3. Reservation tahle for the inner product of two 3x1 vectors. y i A xl * yl B x2 * y2 [ xl x2 x3 ] . y2 C x3 * y3 D A + B y3 Z C + D P CLK CYCLE It takes (3M + 10) P_CLK cycles to complete M inner products of vectors with 3x1. 178 4. Reservation table for the cross product of two 3x1 vectors. - -- A = x2 * y3 xl yi zl R = y2 * x3 C = yl * x3 + x2 y2 = z2 [1 = xl * y3 E = xl * y2 x3 y3 z3 F = yl * x2 - - - zl = A - R z2 - C - D z3 « E - F P CLK CYCLE 1 2 3 4 5 <5 7 ft ft 10 11 12 13 14 15 16 17 1ft 1ft 20 *1 AB C DE F *2 A B CD EF *3 A BC DE F ST A B C 0 E F + 1 zl z2 z3 +2 Zl z2 z3 +3 zl z2 z3 ST zl z2 z3 It also takes 13 P_CLK cycles to complete the operation. 179 5. Reservation table for the multiplication of a 3x3 matrix with a 3x1 vector. xll xl2 xl3 1 yi Zl x21 x22 x23 * y2 = z2 x31 x32 x33 | y3 z3 A = xll * yl R = xl2 * y2 C = x21 * y l 0 = x22 * y2 E = x31 * yl F = x32 * y2 G - xl3 * y3 H = x23 * b3 I = x33 * y3 J = A + B K = C ^ D L = E + F zl = G + J z2 = H + K z3 = I + L P CLK CYCLE 1 2 3 4 5 6 1 A 5 If) 11 12 13 14 15 16 11 16 14 20 *1 A B CDE FGH I *2 A R CDE F G H I *3 A R C DEF GHI ST A B CDE F G H I + 1 J K L zl z2 z3 +2 J K L zl z2 z3 + 3 J K L zl z2 z3 ST J K L zl z2 z3 It takes 17 P_CLK cycles to finish it . 180 6. Reservation table for the multiplication of two 3x3 matrices. x ll x!2 x 13 yll yl2 yl3 Zll zl2 zl3 x21 x22 x23 + y21 y22 y23 = z21 z22 z23 x31 x32 x33 y31 y32 y33 z31 z32 z33 A1 = x ll * y 11 B1 = xl2 * y21 Cl = x21 * y l l D1 = x22 * y21 El = x31 * y ll FI = x32 * y21 G1 = xl3 * y31 HI = x23 * b31 11 = x33 * y31 J1 = A1 Ht B 1 K1 = Cl + Dl LI = El * FI z ll = G1 + J1 z21 = HI + K1 z31 = 11 + LI A2 = xll * y12 B2 = xl2 * y22 C2 = x21 * y 12 D2 = x22 * y22 E2 = x31 * y 12 F2 = x32 * y2? G2 = x 13 * y32 H2 = x23 * b32 12 = x33 * y32 J2 = A2 + B2 K2 = C2 + D2 L2 = E2 + F2 zl2 = G2 + J2 z22 = H2 + K2 z32 = 12 + L2 A3 = xll * y13 B3 = xl2 * y23 C3 = x21 * y 13 D3 = x22 * y23 E3 = x31 * y 13 F3 = x32 * y23 G3 = x!3 * y33 H3 = x23 * b33 13 = x33 * y33 J3 = A3 + B3 K3 = C3 + 03 L3 = E3 + F3 zl3 = G3 + J3 z23 = H3 + K3 z33 = 13 + L3 P CLK CYCLE 1 2 3 4 5 6 7 ft 9 10 11 12 13 14 15 16 17 1ft 19 20 *1 A1 B1 Cl Dl El FI G1 HI 11 A2 B2 C2 D2 E2 F2 G2 H2 12 A3 B3 *2 A1 B1 Cl 01 El FI G1 HI 11 A2 B2 C2 02 E2 F2 G2 H2 12 A3 *3 A1 B1 Cl Dl El FI G1 HI 11 A2 B2 C2 02 E2 F2 G2 H2 12 ST A1 B1 Cl Dl El FI G1 HI 11 A2 R2 C2 02 E2 F? G2 H2 + 1 J1 K 1 LI 11 21 31 J2 K2 L2 12 +2 J1 K1 LI 11 21 31 J2 K2 L2 + 3 J1 K1 LI 11 21 31 J2 K2 1 ST 1 C- K1 LI 11 21 31 J2 K2 181 P CLK CYCLE 21 22 23 24 25 26 27 28 29 38 31 32 33 34 35 36 37 39 4'fl 3 *1 C3 D3 £3 F3 G3 H3 13 *2 C3 03 E3 F3 G3 H3 13 *3 C3 D3 E3 F3 G3 H3 13 ST C3 D3 E3 F3 G3 H3 13 + 1 22 32 J3 K3 L3 13 23 33 +2 12 22 32 J3 K3 13 13 23 33 +3 L2 12 22 32 J3 K3 L3 13 23 33 ST L2 12 22 32 J3 K3 L3 13 23 33 It takes 8 + 9 * 3 = 35 P_CLK cycles to complete the multiplication of two 3x3 matrices. 1 8 2 7, Reservation table for the inner product of two 4x1 vectors. T [ xl x2 x3 x4 ] . [ y l y2 y3 y4 ] - 1 A ■ x 1 * y 1 B = x2 * y2 C = x3 * y3 D = x3 * y3 P CLK CYCLE 1 2 3 4 5 6 7 g $ lo 11 12 13 14 15 16 1? 16 19 20 *1 AB C 0 A B c 0 * 2 A B CD AB c D * 3 AB C DA B C D ST A B C D AB C 0 + 1 E FE FZ Z +2 E FE F Z z +3 E F E F Z Z ST E F E F ZZ It takes (4M + 12) P_CLK cycles to complete M inner products of vectors with 4x1. 183 8, Reservation table for the inner product of two 5x1 vectors. A = xl * yl F = A + B 3 = x2 * y2 G = C + D C = x3 * y3 H = E + F D = x4 * y4 Z = G + H E = x5 * y5 P CLK CYCLE 1 2 3 4 5 6 7 fl 9 i n 11 12 13114 15 16 17 IS 19 *1 A B C D E A R C D E *2 A B C D E A B C f) E *3 A B C D E A R CD E ST A B C D E A B C D E + 1 F G H F G Z H z +2 F G H F G Z H Z +3 F G H F GZ H ST F G H F G Z H It takes (5M + 12) P_CLK cycles to complete M inner products of vectors with 5x1. 184 9. Reservation table for the inner product of two 6x1 vectors. T [ xl x2 x3 x4 x5 x6 ] . [ y l y2 y3 y4 y5 y6 ] = Z A = xl * yl G = A + B B = x2 * y2 H = C + D C = x3 * y3 I = E + F D = x4 * y4 J = G + H E = x5 * y5 Z = I + J F = x6 * y6 P CLK CYCLE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ■18 16 17 18 19 20 *1 A BC D E FA BC DE F *2 AB CD EF A B CD EF *3 A BC D E FA B C D E F ST A BC 0 E F A B C 0 E F + 1 G H IG J H I Z J +2 G H I G J H I Z J +3 G H I GJ H I Z ST G H I G J H IZ I t takes (6M + 14) P_CLK cycles to complete M inner products of vectors with 6x1. 185 10. Reservation table for the inner product of two 7x1 vectors. J [ xl x2 x3 x4 x5 x6 x7 ] . [ y l y2 y3 y4 y5 yfi y7 ] = Z A = xl * y l H = A + R R = x2 * y2 I = C + D C = x3 * y3 J = E + F 0 = x4 * y4 K = H + I E » x5 * y5 L = G + J F = x6 * y6 Z = K + L G = x7 * y7 P CLK CYCLE 1 2 3 4 5 6 7 s 0 10 11 12 13 14 16 16 17 1ft 10 20 *1 A B C 0 E F G AB C DEF G *2 A B c 0 E F G A B C n E F G *3 A B C DEF G AB c DEF G ST AB C DEF G AB C D E F G + 1 H I J K H LI J ZK +2 H I JK H LI J 1 K +3 H IJ K H L IJ Z ST H I JKH L I J It takes ( 7M + 14) P_CLK cycles to complete M inner products of vectors with 7x1. 186 11. Reservation table for the determinant of a 2x2 matrix all al2 A = a ll x a22 = Z B = al2 x a21 a21 a22 Z = A - B P CLK CYCLE 1 2 3 4 5 6 1 8 9 io n 12 13 14 15 16 17 is 19 26 *1 AB AB *2 A B AB *3 ABAB ST ABA B + 1 ZZ +2 ZZ + 3 Z z ST ZZ It takes (2M + 7) P_CLK cycles to complete M determinants of 2x2 matrices. 107 12. Reservation table for the determinant of a 3x3 matrix. all al2 al3 a21 a22 a23 - Z a31 a32 a33 A T~ a ll X a22 B = al2 X a23 C = al3 X a21 D = a ll X a23 E = al2 X a21 F = a 13 X a22 G = A X a33 H = B X a31 I = C X a32 J = n X a32 K = E X a23 L = F X a31 0 = G + H P = I - J Q = K + L R 0 + P Z = R _ 0 P CLK CYCLE 1 2 3 4 5 6 1 8 0 10 11 12 13 14 15 16 17 1ft 10 20 *1 A B c D EFG H IJ K L AB C D E FGH *2 A B C DEF GHI J K L A R C D EF G *3 A B C 0 E FGH I J K L AB C DEF ST A B CD EFG H I J K L A R CD E + 1 0 P 0 R +2 0 P 0 R +3 0 P 0 R ST n P 0 188 P CLK CYCLE 21 22 23 24 25 26 27 23 24 30 31 32 33 34 35 36 3? 3ft 34 40 *1 I j K L *2 i J K L *3 I JKL ST I JK L + 1 z 0 P Q R Z +2 Z 0 P Q R Z +3 Z 0 P 0 R z ST z 0 P 0 R Z It takes ( 12M + 13) P_CLK cycles to complete M determinants of 3x3 matri ces. 189 Appendix A.2 : Microprogram for Jacobian (one RP per Link) i -1 N+l N+l Qi = U, Ui = I) Ui-1 = U 1-1 i * N+l N+l Pi = P Ri = r Bi beta i RF RF RO Qi L1,U R9 'lhTT.1T..... Rl Oi [1,2] RIO U i[l,2 ] R2 Qi [1,3] Rll Ui[1,3] R3 Q i[2 ,l] R12 Oi[2,1] R4 Qi[2,2] R13 Ui[2,2] R5 Qi [2,3] R14 Ni[2,3] R6 Qi[3,1] = 0 R15 Ui[3,1] R7 Qi[3,2] Rlfi Ui[3,2] R8 Qi[3,3] R17 U i[3 ,3] RF RF RF R 18 Ui-1L1.1J R27 ML1J R36 temporary R 19 U1-1[1,2] R28 Pi [2] R37 temporary R20 Hi—1[1,3] R29 Pi [3] R38 tempora ry R21 Ui-1[2,1] R30 Ri-lC 1] ,Ri C1] R22 U1-l[2,2] R31 Ri-1[2],Ri[2] R23 Ui- 1[2,3] R32 R i —1[3],R i[3] R24 U i-1 [3 ,1] R33 Bi r i] R25 Ui-1[3,2] R34 Bi [2] R26 Ui-1[3,3] R35 Bi [3] 190 RF ~ > m " R'F"~> FAM SUB FPM --> FAM — > 1 Qi 2,1] , Qi[3,3] - - - - - 2 Qi 1.1] , 0 i[3 ,2 ] - - - - - 3 Qi 2,1] , Qi[3,2] - -- - - 4 Qi 1,1] , Qi[3,3] - -- 0 i[ l,2 ] - 5 Ui 1.13 , Qi[1,1] Qi[3,1] Q i[l,2 ] 1 Qi [2,3] - 6 Ui 1.2] , Qi[1,2] Qi[3,1] Qi[2,3] 1 Qi[1,3] - 7 Ui 2,1] , Qi[1,1] -- - Qi[2,2] - 8 Ui 2,2] , Qi[1,2] - - - R36 Q i[ 1,2] 9 Ui 3,1] , Qi[1,1] - - - R37 Qi[2,3] 10 Ui 3,2] , Q i[l,2 ] R36 R37 0 R36 - 11 Ui 1,3] , Qi[ 1,3] - - - R37 - 12 Ui 2,3] , Qi[1,3] R36 R37 0 R36 - 13 Ui 3,3] , Qi[1,3] - - - R37 R38 14 Ui 1.1] , Qi[2,1] R36 R37 0 R36 - 15 Ui 1.2] , Qi[2,2] R36 R38 0 R37 R38 16 Ui 2,1] , Qi[2,1] R37 R38 0 R36 - 17 Ui 2,2] , Qi[2,2] - - - R37 R38 18 Ui 3,1] , Qi[2,1] R36 R38 0 R36 U i - l [ l , l l 19 Ui 3,2] , Qi[2,2] R36 R37 0 R37 U i- l[ 2 ,l] 20 Ui 1,3] , Qi[2,3] - - - R36 - 21 Ui 2,3] , Qi[2,3] R36 R37 0 R37 Ui - 1[3,1] 22 Ui 3,3] , Qi[2,3] -- - R36 R38 23 Ui 1.1] , Q i[3 ,l] R36 R37 0 R37 - 24 Ui 1,2] , Qi[3,2] R37 R38 0 R36 R38 191 25 ' u TIT, IT "O il T.TT R36 R38 0 R37 - 26 Ui[2,2] 0i[3 ,2 ] - - - R36 R38 27 U i[3 ,l] 0 i[ 3 ,l] R37 R38 n R37 Ui- 1[1,2] 28 Ui[3,2] Q i[3,2] R36 R37 0 R36 U i-1 [2 ,2] 29 U i[l,3 ] Qi[3,3] - - - R37 - 30 Ui[2,3] Q i[3,3] R36 R37 0 R36 U i —1[3,2] 31 Ui[3,3] Qi[3,3] - - - R37 R38 32 Ui[1,1] Pi [1] R36 R37 0 R36 - 33 Ui[1,2] Pi [2] R36 R38 0 R37 R38 34 Ui[2,1] P i[l] R37 R38 0 R36 - 35 Ui[2,2] Pi[2] - -- R37 R38 36 U i[3 ,l] P i[l] R36 R38 0 R36 U i-1 [1,3] 37 Ui[3,21 Pi[2] R36 R37 0 R37 Ui-1[?,3] 38 Ui[1,3] Pi [3] - -- R36 - 39 Ui[2,3] Pi[3] R36 R37 0 R37 Ui-1[3,3] 40 Ui[3,3] Pi[3] - - - R36 R38 41 - - R36 R37 0 R3fi - 42 - - R37 R38 0 R37 R38 43 - - R36 R38 0 R36 - 44 ------R38 45 - - R37 R38 0 - R36 46 ------R37 47 - - R i[l] R36 1 - - 48 - - Ri [2] R37 1 - R38 49 - - Ri [3] R38 1 - - 192 I So ---- R f-itiT -- - -- Ri-1[2] 51 52 -- - - Ri-1[3] 53 Ui-1[2,3] R1-1C3] -, - - - - 54 Ui-1[3,3] Ri- 1[2] --- 55 Ui-1[3,3] Ri-1C 1] - - - - 56 Ui -1[ 1,3] Ri -1[3] -, - - R36 - 57 Ui-1[1,3] R1-l[2] - - R37 - 58 Ui-1[2,3] Ri -1[ 1] R37 , R36 1 R36 - 59 - --- R37 - 60 - - R37 , R36 1 R36 - 61 - - -, - - R37 B iili 62 -- R37 , R36 1 -- 63 - - - _ Bi [2] 64 - - - - - 65 - - - - Bi [31 193 Appendix A.3 : Calculation of the Measurement Parameters for Jacobian with P = 1. Tc = ( 9 + 35 + 21 + 13 + 12 ) x N = 90N P_CLK cycles Tio = 4N P_CLK cycles ET = Tc + Tio = 94N P_CLK cycles IR = 1/ET = 1/(94N microsecond) UP = 100% CBR = 90N/94N = 96% RN = 39 (see Appendix A.2) MC = (Tc + Ti0/2) / N = 92 SCRAM = MC x ( 4 + 6 x "log2 (RN)“ | ) = 92 x 40 = 3.7K bits Total Memory = 39 x 32 + 3.7K = 5K bits 194 Appendix A.4 : Calculation of the Measurement Parameters for Jacobian with P = 2. Tel = { 9 + 35 + 12 ) x N = 56N P_CLK cycles Tiol = ( 4 + 18 + 6 ) x N = 28N P_CLK cycles Tc2 = ( 21 + 13 ) x N = 34N P_CLK cycles Tio2 = ( 18+6+6 )xN= 30N P_CLK cycles Tid2 (idle) = ( 14 + 6 ) x N = 20N P_CLK cycles ET1 = Tel + Tiol = 84N P_CLK cycles ET2 = Tc2 + Tio2 + Tid2 = 84N P_CLK cycles ET = (13 + 84N) P__CLK cycles IR = l/max(ETl, ET2) = 1/(84N microsecond) UP1 = 100% UP2 = (84N - 20N) / 84N = 76% UP - (DPI + UP2) / 2 = 88% SP = ET( P=1) / ET(P=2) = 94N / (13 + 84N) = 94N / 84N = 1.1 CBR = Tc(P=l)/2 x IR = 90N/2 x 1/84N = 54% RN1 (see Appendix A.2): Qi[1,1] to Qi[3,3] > 9 Ui[1,1] to Ui[3,3] > 9 Ui-1[ 1,1] to Ui-1[3,3] > 9 temporary register > 3 RN1 = 9 + 9 + 9 + 3 = 30 RN2 (see Appendix A.2) : Qi[1,1] to Oi[3,3] ...... — > 9 Pi [1] to Pi [31 — > 3 195 R i [ 1] to R i [3] > 3 R i- l[ l] to R1-1C2] > 3 Bi [1] to Bi [3] > 3 temporary register > 3 RN1 =9+3+3+3+3+3=24 RN = max(RNl, RN2) = 30 MCI = (Tel + Tiol/2) / N = 70 MC2 = (Tc2 + Tio2/2 + Tid2) / N = 69 MC = ma x{MC1, MC2) = 70 SCRAM = MC x ( 4 + 6 x 'log^(RN)’ ) = 70 x 34 = 2.4K bits Total Memory = 30 x 32 + 2.4K = 3.4K bits 196 Appendix A.6 : Calculation of the Measurement Parameters for Jacobian with P = N. Tc = 9 + 35 + 21 + 13 = 78 P_CLK cycles Tio=4+ 18+6+ 18 +6+6+6= 64 P_CLK cycles Tidl (idle) = 1 P_CLK cycles ET = 59 x (N-l) + (Tc + Tio + Tid) = (59N + 84) P_CLK cycles IR = l/(Tc + Tio + Tid) = 1/(143 microsecond) UP = (ET - Tid) / ET » 142/143 = 100% SP = ET(P=1) / ET(P=N) = 94N / (59N + 84) = 1.32 for N=7 CBR - Tc(P=l)/N x IR = 90N/N x 1/143 = 63% RN = 39 MC = (Tc + Ti 0/2) + Tid = 78 + 64/2 + 1 = 111 SCRAM = MC x ( 4 + 6 x p o g ^ R N )- ) = 111 x 40 = 4.5K bits Total Memory = 39 x 32 + 4.5K = 5.8K bits 197 Appendix A.6 : To find Brl - 1x6 1. Find the 6 determinents of the 6 reduced 5x5 matrices. a22 a23 a24 a25 a26 a32 a33 a34 a35 a36 B r l[l] = + det { a42 a43 a44 a45 a46 ) a52 a53 a54 a55 a56 a62 a63 a64 a65 a66 a33 a34 a35 a36 a23 a24 a25 a26 a43 a44 a45 a46 ) - a32 x det( a43 aM a45 a46 a53 a 54 a55 a 56 a53 a 54 a55 a56 a63 a64 a65 a66 a63 a 64 a65 a66 a23 a24 a25 a26 a23 a24 a25 a26 a33 a34 a35 a36 ) - a52 x det( a33 a 34 a35 a36 a53 a54 a55 a56 a43 a44 a45 a46 a63 a64 a65 a66 a63 a64 a65 a 66 a23 a 24 a25 a26 = a62 x det( a33 a 34 a35 a36 a43 a44 a45 a46 a53 a54 a55 a56 = + a22 x d(3,4 ,5,6) - a32 x d(2,4,5,6) + a42 x d(2,3,5,6) - a52 x d(2,3,4,6) + a62 x d(2,3,4,5) al2 al3 al4 al5 a 16 a32 a33 a34 a35 a36 8 rl[2 ] = - det ( a42 a43 a44 a45 a46 a52 a53 a54 a55 a56 a62 afi3 a64 a65 a66 = - al2 x d(3,4,5,6) + a32 x d(l,4,5,6) - a42 x d(1,3,5,6) + a52 x d(1,3,4,6) - a62 x d(l,3,4,5) 198 al2 al3 a 14 al5 al6 a22 a23 a24 a25 a2 6 B rl[3] = + det ( a42 a43 a44 a45 a46 a52 a53 a54 a55 a56 a62 a63 a64 a65 a66 + * = + al2 x d(2,4,5,6) - a22 x d(l,4,5,6) + a42 j d ( l , 2 , 5 , 6 ) - a52 x d(l,2,4,6) + a62 x d(l,2,4,5) a 12 al3 al4 al5 alfi a22 a23 a24 a25 a26 Brl[4] = - det ( a32 a33 a34 a35 a36 a52 a53 a54 a55 a56 a62 a63 afi4 a65 afifi * - al2 x d(2,3,5,6) + a22 x d(l,3,5,6) - a32 ) d( 1,2,5,6) + a52 x d(l,2,3,6) - a62 x d{l,2,3,5) al2 a 13 a 14 a 15 al6 a22 a23 a24 a25 a26 Brl[5] = + det ( a32 a33 a34 a35 a36 a42 a43 a44 a45 a46 a62 a63 a64 a65 a 66 * + ★ + a 12 x d(2,3,4,6) - a22 x d(l,3,4,6) + a32 ) d (1,2,4 ,6) - a42 x d(l,2,3,6) + a62 x d(1.2.3.4) al2 a 13 a 14 a 15 a 16 a22 a23 a24 a25 a26 Brl[6] = - det ( a32 a33 a34 a35 a36 a42 a43 a44 a45 a46 a52 a53 a 54 a55 a 56 199 * * * = - a 12 x d(2,3,4,5) + a22 x d(l,3,4,5) - a32 x d(l,2,4,5) * * + a42 x d{1,2,3,5) - a52 x d(l,2,3,4) * means redundant. The complexity of finding the Brl is equal to that of finding 6 vector inner products with 5x1 plus 15 determinants of 15 4x4 matrices. 2. Find the 15 determinants of the 15 reduced 4x4 matrices : a33 a34 a35 a36 d(3,4,5,6) = det( a43 a44 a45 a46 ) a53 a54 a55 a56 a63 a64 a65 a66 = + a33 x d(4,5,6) - a43 x d(3,5,6) + a53 x d(3,4,6) - a63 x d(3,4,5) a23 a24 a25 a26 d(2,4,5,6) = det{ a43 a44 a45 a46 ) a53 a54 a55 a56 a63 afi4 a65 a66 * = + a23 x d(4,5,6) - a43 x d(2,5,6) + a53 x d(2,4,6) - a63 x d(2,4,5) a23 a24 a25 a26 d(2,3,5,6) = det( a33 a34 a35 a36 ) a53 a54 a55 a56 a63 a64 a65 a66 = + a23 x d(3,5,6) - a33 x d(2,5,6) + a53 x d(2,3,6) - a63 x d(2,3,5) 2 no a23 a24 a25 a26 d(2,3,4, 6) = det( a33 a34 a35 a36 a43 a44 a45 a46 a63 afi4 a65 a66 ~ + a23 x d(3,4,6) - a33 > d(2,4,6) + a43 x d{2,3,6) a63 ) d(2,3 ,4) a23 a24 a25 a26 d{2,3,4, 5) - det( a33 a34 a35 a36 a43 a44 a45 a46 a53 a54 a55 a56 = + a23 x d(3,4,6) - a33 ) d (2,4,6) + a43 x d(2,3,5) - a53 > d(2,3,4) al3 a 14 al5 al6 6) = det( a43 a44 a45 a46 ) a53 a54 a55 a56 a63 a64 a65 a66 * = + al3 x d(4,5,6) - a43 d(1,5,6) + a53 x d( 1,4,6) - a63 d (1,4.5) al3 a14 al5 al6 d{ 1,3,5, 6) = det( a33 a34 a35 a36 ) a53 a54 a55 a56 a63 a64 a65 a66 * - + al3 x d(3,5,6) - a33 d(1,5,6) + a53 x d( 1,3,6) - a63 d{1,3,5) 2 0 1 al3 a 14 al5 al6 d( 1,3,4,6) = det{ a33 a34 a35 a36 a43 a44 a45 a46 a63 a64 a65 a66 ★ * + al3 x d(3*4.6) - a33 x d(l,4,6) * + a43 x d(1,3,6) - a63 x d(l,3,4) al3 a 14 al5 al6 d(1,3,4,5) = det( a33 a34 a35 a36 a43 a44 a45 a46 a53 a54 a55 a56 * * = + al3 x d(3,4,5) - a33 x d(l,4,5) ■k i t + a43 x d( 1,3,5) - a53 x d (l,3 ,4 ) al3 al4 al5 al6 d( 1,2 ,5,6) « det( a23 a24 a25 a26 a53 a54 a55 a56 a63 a64 a65 a66 = + al3 x d(2,5,6) - a23 x d(l,5,6) + a53 x d(1,2,6) - a63 x d{l,2,5) al3 a 14 al5 al6 d( 1,2,4,6) = det ( a23 a24 a25 a26 a43 a44 a45 a46 a63 a64 a65 afi6 = + al3 x d(2,4,6) - a23 x d{l,4,6) ★ + a43 x d(1,2,6) - a63 x d(l,2,4) 2 0 2 al3 a 14 a 15 alfi d( 1,2,4,5) = det( a23 a24 a25 a26 a43 a44 a45 a46 a53 a54 a55 a56 = + a!3 x d(2,4,5) - a23 x d(l,4,5) + a43 x d( 1,2,5) - a53 x d (l,2 ,4 ) al3 a 14 al5 al6 d(1,2,3,6) = det( a23 a24 a25 a26 a33 a34 a35 a36 a63 a64 a65 a66 * * = + al3 x d(2,3,6) - a23 x d(l,3,6) * + a33 x d( 1,2,6) - a63 x d( 1,2,3) al3 al4 al5 al6 d( 1,2,3,5) = det( a23 a24 a25 a26 a33 a34 a35 a36 a53 a54 a55 a56 ★ * = + al3 x d(2,3,6) - a23 x d(l,3,5) * * + a33 x d(1,2,5) - a53 x d(l,2,3) al3 al4 al5 al6 d( 1,2,3,4) = det( a23 a24 a25 a26 a33 a34 a35 a36 a43 a44 a45 a46 * * = + al3 x d(2,3,4) - a23 x d{l,3,4) * * + a33 x d( 1,2,4) - a43 x d (l,2 ,3 ) The complexity of finding the 15 determinants is equal to that of finding 15 vector inner products with 4x1 plus 20 determinants of 20 3x3 matrices, 203 3. Find the 20 determinants of the 20 reduced 3x3 matrices : a44 a45 a46 d(4,5,6) = det{ a54 a55 a56 a64 a65 a66 = + a44 x d(5,6) - a54 x d(4,6) + a64 x d{4,5) a34 a35 a36 d(3,5,6) = det( a54 a55 a56 a64 a65 a66 = + a34 x d(5,6) - a54 x d(3,6) + a64 x d (3,5) a34 a35 a36 d(3,4,6) = det( a44 a45 a46 a64 a65 a66 = + a34 x d(4,6) a44 x d(3,6) + a64 x d(3,4) a34 a35 a36 d(3,4,5) = det( a44 a45 a46 a54 a55 a56 * = + a34 x d(4,5) a44 x d{3,5) + a54 x d(3,4) a24 a25 a26 d(2,5,6) = det( a54 a55 a56 a64 a65 a66 = + a24 x d{5,6) - a54 x d(2,6) + a64 x d(2,5) 204 a24 a25 a26 d(2,4,6) = det( a44 a45 a46 a64 a65 a66 = + a24 x d(4,6) a44 x d(2,6) + a64 x d(2,4) a24 a25 a26 d( 2,4,5) = det( a44 a45 a46 a54 a55 a56 * * = + a24 x d(4,5) a44 x d(2,5) + a54 x d{2,4) a24 a25 a26 d(2,3,6) = det( a34 a35 a36 a64 a65 a66 = + a24 x d(3,6) a34 x d(2,6) + a64 x d(2,3) I a24 a25 a26 d{2,3,5) = det( a34 a35 a36 ) [ a54 a55 a56 * * * = + a24 x d(3,5) - a34 x d{2,5) + a54 x d(2,3) a24 a25 a26 d(2,3,4) = det( a34 a35 a36 a44 a45 a46 * * + a24 x d(3,4) - a34 x d(2,4) + a44 x d(2,3) a 14 a 15 a 16 d( 1,5,6) = det( a54 a55 a56 a64 a65 a66 205 ★ = + a 14 x d(5,6) - a54 x d (l,6 ) + a64 x d (l,5) a 14 al5 al6 d( 1,4,6) = det( a44 a45 a46 a64 a65 a66 * * + al4 x d(4,6) - a44 x d(l,6) + a64 x d(1*4) al4 al5 al6 d(1,4.5) = det{ a44 a45 a46 a54 a55 a56 * * * + a 14 x d(4,5) - a44 x d{ 1,5) + a54 x d (1,4) a 14 a 15 al6 d(1,3,6) = det( a34 a35 a36 a64 ab5 a66 * * + a 14 x d(3,6) - a34 x d(1,6) + a64 x d (l,3) a14 al5 al6 d (l,3 ,5 ) = det( j a34 a35 a36 a54 a55 a56 * * * = + al4 x d(3,5) - a3a x df 1,5) + a54 x d(l,3) a 14 a 15 a 16 d(1,3,4) = det( a34 a35 a36 a44 a45 a46 = + al4 x d(3,4) - a34 x d(l,4) + a44 x d{l,3) 206 al4 al5 al6 d(1,2,6) = det( a24 a25 a26 afi4 a65 a66 * * + a14 x d(2,6) - a24 x d(l,6) + afi4 x d(1.2) a14 al5 al6 d( 1,2,5) = det( a24 a25 a26 a54 a55 a56 * + * = + al4 x d(2,5) - a24 x d( 1,5) + a54 x d (l,2 ) al4 al5 al6 d(1,2,4) = det( a24 a25 a26 a44 a45 a46 ★ i f + + al4 x d(2,4) - a24 x d{1,4) + a44 x d(1,2) al4 a 15 al6 d( 1,2,3) = det( a24 a25 a26 a34 a35 a36 = + al4 x d(2,3) - a24 x d(l,3) + a34 x d(l,2) The complexity of finding the 20 determinants is equal to that of finding 20 vector inner products with 3x1 plus 15 determinants of 15 2x2 matrices. 207 4. The complexity of finding Brl to Br6 can be considered as below, where T means the computation time in terms of PjCLK cycles. For example, T[ d{5x5) ] represents the computation time needed for computing the determinent of a 5x5 matrix. Number of P_CLK cycles of finding Rrl to Br6 6 T[ d(5x5) ] 6 T[ V . V ] + 15 T[ d(4x4) 1 -1x5 -5x1 6 Tr V . V ] + 15 T[ V . V ] + 2(1 T[ d(3x3) 1 -1x5 -5x1 -1x4 -4x1 6 T[ V . V ) + 15 T[ V . V ] -1x5 -5x1 -1x4 -4x1 + 20 T[ V . V ] + 15 T[ d(3x2) ] -1x3 -3x1 [ 5M + 12 ] + [ 4M + 12 1 M=6 ' M=15 + [ 3M + 10 ] + [ 2M + 7 ] M=20 M=15 221 P C1K cycles 208 Those vector inner product complexities can be found in Appendix A .I. I f the determinant of a 3x3 matrix is regarded as the basic operations, instead of the determinent of a 2x2 matrix, then Number of P__CLK cycles of finding Brl to Rr6 6 T[ d(5x5) ] 6 T[ V . V ] + 15 T[ d(4x4) ] -1x5 -5x1 6 T[ V . V ] + 15 TC V . V ] + 20 T[ d(3x3) 1 -1x5 -5x1 -1x4 -4x1 [ 5M + 12 ] + [ 4M + 12 ] + f 12M + 13 ] M=6 M=15 M=20 367 P CLK cycles 209 Appendix A. f : Computation Complexity and Register Required for Vector Inner Products. M T(V . V , = 3M + in P_CLK cycels -1x3 -3x1 M T(V . V = 4M + 12 P_CLK cycels -1x4 -4x1 M T(V . V = 5M + 12 PjCLK cycels -1x5 -5x1 M T(V . V = 6M + 14 P_CLK cycels -1x6 M T(V . V = 7M + 10 P CLK cycels -1x7 -7x1 N M RN(V . V ) = 2 + lxN Nxl 2 210 Appendix A.8 : Procedures to Solve the Derivative of Theta and Calculations of the Measurement Parameters for Inverse Jacobian with P = 1. (1) Host sends J to the RP. 6xN T (2) RP computes A = J J 6x6 6xN Nx6 (3) RP computes 36 d(5x5) = B 6x6 (4) RP computes d = d(6x6) = Rrl . Acl - 1x6 -6x1 (5) RP sends d to the host. (6) Host sends X to the RP. - 6x1 (7) RP computes C = B X - 6x1 6x6 - 6x1 T (8) RP computes theta' = J C -N xl Nx6 - 6x1 (9) Host sends 1/d to the RP. (10) RP computes theta = th eta1 x 1/d - Nxl - Nxl » (11) RP sends theta to the host. - Nxl Tio : J — -> 6N x 2 = 12N P CLK cycles 6xN X > 6 x 2 = 12 P CLK cycles - 6x1 d, 1/d > 2 x 2 = 4 P CLK cycles 211 theta ------> N x 2 = 2N P CLK cycles - Nxl “ So Tio = 12N + 12 + 4 + 2N = 14N + 16 P_CLK cycles = 114 P_CLK cycles for N=7. Tc : j (1) T(J J ) = 36 T(V . V ) = [7M + in] 6xN Nx6 - lxN - Nxl M=3fi = 262 P_CLK cycles for N=7. (2) T(36 d(5x5) ) = 6 T(6 d(5x5) ) = 6 x 221 = 1326 P_CLK cycles (3) T(Brl . Acl ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M=1 — (4) T(B . X ) = 6 T(V . V ) 6x6 - 6x1 - 1x6 - 6x1 = [6M + 14] = 50 P CLK cycles M=6 (5) T(JT . C ) = N T(V . V ) = [6M + 14] Nx6 - 6x1 - 1x6 - 6x1 M=N = 6N + 14 = 56 P_CLK cycles for N=7, (6) T(theta x 1/d) - N + 3 = 10 P CLK cycles for N=7. Nxl — So Tc = 262 + 1326 + 20 + 50 + 56 + 10 = 1724 P_CLK cycles. ET (Execution Time) : ET = Tio + Tc = 114 +■ 1724 = 1838 P_CLK cycles = 1838 microsecond for N=7. IR ( Ini tiation Rate) : IR = 1/ET = 1/(1838 microsecond) for N=7. UP ( U lti1ization of the Processor) : UP = 100% 2 1 2 CBR (CPU Bound Ratio) : CBR = Tc/ET = 1724/1B38 = 94% RN (Register Number) : J -...... > 6N 6xN A and B ■> 36 6x6 6x6 ■> 15d(2x2), d(4x4), X , d and 1/d ■> 15d(2x2), - 6x1 d(3x3)t d(5x5), C , theta' and theta ------> 20 - 6x1 - Nxl - Nxl temporary •> 2 + = 6 RN * 6N + 36 + 15 + 20 + 6 - 6N + 77 = 119 for N = 7, MC (Microcode number) : MC = T i o/2 + Tc =114/2 + 1724 = 1781 words SCRAM (Size of COntrol RAM) : SCRAM = MC x ( 4 + 6 x log^(RN) | ) = 1781 x ( 4 + 6 x 7 ) = B2K bits. Total Memory = 119 x 32 + 82K = 86K bits for N=7. 213 Appendix A.9 : Procedures to Solve the Derivative of Theta and Calculations of the Measurement Parameters for Inverse Jacobi an with P = 6, (1) Host broadcasts J to RPi, i = 1, ...... 2 6. 6xN (2) RPi computes Aci = J [J ]ci , where [J ]ci is - 6x1 6xN — Nxl the ith column of the transpose of J matrix. (3) RPi broadcasts Aci to RPj, j < > i , i , j = 1, 2, . . . , 6. - 6x1 (4) RPi computes 6 d(5x5) = Bri - 1x6 (5) RPi computes d - d(6x6) = Bri . Aci - 1x6 - 6x1 (6) RPI sends d to the host. (7) Host broadcasts X to RPi, i = 1, 2, . . . , 6. - 6x1 (8) RPi computes Ci = Bri . X - 1x6 - 6x1 (9) RPi broadcasts Ci to RPj, i < > j , i , j = 1, 2, . . . , 6, (10) RPi computes [theta*]i = [J ]ri . C 1x6 - 6x1 For i > 6, [theta*Ji are computed in RPI to RP(N-6). (11) Host broadcasts 1/d to RPi, i = 1, 2, . . . , 6. (12) RPi computes [theta]i = [theta*]i x 1/d. For i > 6, [theta]i are computed in RPI to RP(N-6). 214 (13) RPi sends [theta]i to the host. Tio : ■ > 6N x 2 = 12 N P CLK cycles 6xN Aci -> 36 x 2 = 72 P CLK cycles - 6x1 ■>6x2=1? P CLK cycles 6x1 Ci ■> 6 x 2 - 12 P CLK cycles d, 1/d ■> 2 x 2 = 4 P CLK cycles theta ■> N x 2 = 2N P CLK cycles - Nxl So Tio = 12N + 72 + 12 + 12 + 4 + 2N = 14N + 88 P_CLK cycles = 198 P CLK cycles for N=7. Tc : y (1) T(J [J ]ci ,) = 6 T(V . V ) = [7M + 1(1] 6xN — Nxl IxN - Nxl M=6 = 52 P_CLK cycles for N=7. (2) T(6 d(5x5) ) = 221 P_CLK cycles (3) T(Bri .Aci ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M= 1 — (4) T(Bri . X ) = [6M + 14] = 2(1 P CLK cycles - 1x6 - 6x1 M=1 — N (5) T([J ]ri . C ) = [6M + 14] N 6 1x6 - 6x1 6 = 26 P CLK cycles for N=7, N ( 6 ) T(theta'i x i/d) = [M + 3] 6 M= 5 P CLK cycles for N=7. 215 So Tc = 52 + 221 + 20 + 20 + 26 + 5 = 344 P_CLK cycles. ET (Execution Time) : ET = Tio + Tc = 198 + 344 = 542 P_CLK cycles = 542 microsecond for N=7. IR (Initiation Rate) : IR = 1/ET = 1/(542 microsecond) for N=7. UP (U ltiliza tio n of the Processor) : UP = 100% SP (SPeed-up) : SP = ET(P-l) / ET(P=6) = 1838/542 = 3.4 CBR (CPU Bound Ratio) : CBR = Tc(P=l)/6 x IR = 1724/6 x 1/542 = 53% RN (Register Number) : J , A , Bri , d(3x3) and d(5x5) > 6N 6xN 6x6 - 1x6 d(2x2), d(4x4), X , C , theta1 and theta —> 15 - 6x1 - 6x1 - Nxl - Nxl 7 temporary > 2 + = 6 registers 2 RN = 6N + 15 + 6 = 6N + 21 = 63 for N=7. MC (Microcode number) : MC = Tio/2 + Tc = 198 / 2 + 344 = 44 3 words SCRAM (Size of Control RAM) : SCRAM = MC x { 4 + 6 x log^(RN) I ) = 443 x (4+6x6) = 17.7K bits. Total Memory = 63 x 32 + 17.7K = 20K bits. 216 Appendix A .10 : Procedures to Solve the Derivative of Theta and Calculations of the Measurement Parameters for Inverse Jacobian with P = 12. (1) Host broadcasts J to RPi and RPi1, i = 1, 2, . . . , 6 6xN T T (2) RPi computes Aci = J [J ]c1 , where [J ]ci is - 6x1 6xN -- Nxl the ith column of the transpose of J matrix. (3) RPi broadcasts Aci to the other RPi and RPi’ . (4) RPi computes 20 d(3x3). (5) RPi sends 20 d(3x3) to corresponding RPi'. (6)RPi' computes 6 d(5x5) = Bri - 1x6 (7) RPi1 computes d = d(6x6) = Bri . Aci - 1x6 - 6x1 (8) RPI’ sends d to the host. (9) Host broadcasts X to RPi', i = 1, 2, . . . , 6. - 6x1 • (10) RPi' computes Ci = Bri . X - 1x6 - 6x1 (11) RPi1 broadcasts Ci to R P j', i < > j, i , j = 1, 2, . . . T (12) RPi' computes [th e ta1li = [0 ]ri . C 1x6 - 6x1 For i > 6, [th eta1]i are computed in RPI' to RP(N— 6)*. (13) Host broadcasts 1/d to RPi', i = 1, 2, . . . , 6. (14) RPi' computes [theta]i = [th eta1]i x 1/d. For i > 6, [theta]i are computed in RPI' to RP(N-6) (15) RPi1 sends [theta]i to the host. Tio J ...... > 6N x 2 = 12N P CLK cycles 6xN Aci ...... > 36 x 2 = 72 P CLK cycles - 6x1 - 20 d(3x3) — ...... > 20 x 2 = 40 P CLK cycles X > 6 x 2 = 12 P CLK cycles - 6x1 “ Ci > 6 x 2 = 12 P_CLK cycles d, 1/d > 2 x 2 = 4 P_CLK cycles theta > N x 2 = 2N P CLK cycles - Nxl “ So Tiou = 12N + 72 + 40 = 12N + 112 = 196 P_CLK cycles. Tiod = 72 + 40 + 12 + 12 + 4 + 2N = 154 P_CLK cycles. Tio = Tiou + Tiod - 72 - 40 = 238 P_CLK cycles. Tcu (Computation time for RPi) : T (1) T(J [J ]ci ) = 6 T(V . V ) = [7M + 10] 6xN -- Nxl - lxN - Nxl M=6 = 52 P_CLK cycles for N=7. (2) 20 T( d(3x3) ) = 20 T(V . V ) + 15 T( d(2x2) ) -1x3 -3x1 = [3M + 10] + [2M + 7] M=20 M-15 = 107 P_CLK cycles 218 So Tcu = 52 + 1C7 = 159 P_CLK cycles. Ted (Computation time for RPi') : (1) T(6 d{5x5) ) = 6 T(V . Vc ) + 15 T(V . V ) -1x5 -5x1 -1x4 -4x1 + 2C T( d(3x3) ) = [5H + 121 + [4M + 12] " M=6 M=15 = 114 P__CLK cycles (2) T(Bri .Aci ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M=1 (3) T(Bri . X ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M= 1 “ N (4) T(CJ ]ri . C ) = [6M + 14] N 6 1x6 -6x1 m= 6 = 26 P_CLK cycles for N=7, N (5) T(theta'i x i/d) = [M + 3] N 6 M= = 5 P_CLK cycles for N=7. So Ted = 114 + 20 + 20 + 26 + 5 = 185 P_CLK cycles. Tc = Tcu + Tcd= 344 P__CLK cycles. ET (Execution Time) : ETu = Tiou + Tcu = 196 + 159 = 355 P_CLK cycles. ETd = Tiod + Ted = 154 + 185 = 339 P_CLK cycles. ET = Tio + Tc = 238 + 344 = 582 P_CLK cycles = 582 microsecond for N=7. 219 IR (In itia tio n Rate) : IR = l/max(ETu, ETd) = 1/(355 microsecond) for N=7, UP ( U lt i1ization of the Processor) : UP = ET/2 x IR = 582/2 x 1/355 = 82% SP (SPeed-up) : SP = ET(P=1) / ET(P=12) = 1838/582 = 3.16 CBR (CPU Bound Ratio) : CBR = Tc(P=1)/12 x IR = 1724/12 x 1/355 = 40% RN (Register Number) : J , 3Aci , d(2x2) and d(3x3) ■> 6N 6xN - 6x1 7 temporary •> 2 + regi sters 2 RNu = 6N + 6 = 48 for N=7. Jci , 3Aci , Bri , X and C -> 30 - 6x1 - 6x1 - 1x6 - 6x1 - 6x1 d(3x3), d(4x4), d(5x5), d , 1/d , theta1 and theta —> 20 - Nxl - Nxl 7 temporary -> 2 + regi sters 2 RNd = 30 + 20 + 6 = 56 for N=7. So RN = max( RNu, RNd) = 56. 220 MC (Mi crocode number) : MCu = Tiou/2 + Tcu = 196/2 + 159 = 257 MCd = Tiod/2 + Ted = 154/2 + 185 = 262 MC = max( MCu, MCd) = 262 words SCRAM (Size of Control RAM) : SCRAM = MC x ( 4 + 6 x 'log?(RN)' ) = 262 x ( 4 + 6 x 6 ) = 10.5K bits, Total Memory = 56 x 32 + 10.5K = 12.3K bits. 221 Appendix A .11 : Procedures to Solve the Derivative of Theta and Calculations of the Measurement Parameters for Inverse Jacobian with P = 24. (1) Host broadcasts J to RPia and RPid, i = 1, 2, . . . , 6. 6xN T T (2) RPia computes Aci = J [J ]ci , where [J lei is - 6x1 6xN — Nxl — Nxl the ith column of the transpose of J matrix. (3) RPia broadcasts Aci to the other RPs. (4) RPib computes 20 d(3x3). (5) RPib sends 20 d(3x3) to corresponding RPic. (6) RPic computes 6 d(5x5) = Bri - 1x6 (7) RPic sends 6 d(5x5) = Bri to RPid. - 1x6 (8) RPid computes d = d(6x6) = Bri . Aci - 1x6 - 6x1 (9) RPid sends d to the host. (10) Host broadcasts X to RPid. i = 1, 2, . . . , 6. - 6x1 (11) RPid computes Ci = Bri . X - 1x6 - 6x1 (12) RPid broadcasts Ci to RPjd, i < > j , i, j = 1, 2...... 6. „ . T (13) RPid computes [th e ta 'li = [J ]ri . C — 1x6 - 6x1 For 1 > 6. [th e ta 'li are computed in RPid to RP(N-6)d. 222 (14) Host broadcasts 1/d to RPid, i = 1, 2, . . . , 6. (15) RPid computes [theta]i = [th e ta']i x 1/d, For i > 6, [theta]i are computed in RPid to RP(N-fi)d. (16) RPid sends [theta]i to the host, Tio : J ...... > 6N x 2 = 12N P CLK cycles 6xN Aci------> 36 x 2 = 72 P CLK cycles - 6x1 - 20 d(3x3) > 20 x 2 = 40 P_CLK cycles 6 d(5x5) > 6 x 2 = 12 P_CLK cycles X > 6 x 2 = 12 P CLK cycles - 6x1 - Ci ------> 6 x 2 = 12 P_CLK cycles d, 1/d — ...... > 2 x 2 = 4 P_CLK cycles theta > N x 2 = 2N P CLK cycles - Nxl So Tioa = 12N +12 Tida = 60 Tiob = 36 + 40 = 76 Tidb = 36 Tioc = 24 + 40 + 12 = 76 Tide = 48 Tiod =12+12+12 Tidd = 60 + 4 + 2N + 12 = 66 Tio = 12N + 72 + 40 + 12 + 12 + 12 + 4 + 2N = 14N + 152 = 250 P CLK cycles for N=7. 223 Tea (Computation time for RPia) : T(J [JT]ci ) = 6 T(V . V ) = [7M + 10] 6xN — Nxl - lxN - Nxl M=6 = 52 P_CLK cycles for N=7. Tcb (Computation timefor RPib) : 20 T( d(3x3) ) = 20 T(V . V ) + 15 T( d(2x2) ) -1x3 -3x1 = [3M + 10] + [2M + 7] M=20 ' M=15 = 107 P_CLK cycles Tcc (Computation time for RPic) ; T(6 d(5x5) ) = 5 T(V . V ) + 15 T(V . V ) -1x5 -5x1 -1x4 -4x1 + 20 T( d(3x3) ) = [5M + 12] + [4M + 121 M=6 ' M=15 = 114 P_CLK cycles Ted (Computation time for RPid) : (1) T(Rri .Aci ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M= 1 — (2) T(Bri . X ) = [6M + 14] = 20 P CLK cycles - 1x6 - 6x1 M= 1 ” N (3) T([J ] r i , . C ) - [6M + 14] N 6 1x6 - 6x1 6 26 P CLK cycles for N=7, 224 N (4) T(theta'i x i/d ) = [M + 3] N 6 M= — 6 = 5 P_CLK cycles for N=7. Ted = 20+20+26 + 5 = 71 P_CLK cycles. Tc = Tea + Tcb + Tee + Ted = 344 P_CLK cycles. ET (Execution Time) : ETa = Tioa + Tida + Tea = (12N + 12) +60+52 = 208 P_CLK cycles. ETb = Tiob + Tidb + Tcb = 76 + 36 + 107 =219 P_CLK cycles. ETC = Tioc + Tide + Tee = 76 + 48 + 114 = 238 P_CLK cycles. ETd = Tiod + Tidd + Ted = 66 + 60 + 71 = 197 P_CLK cycles. ET = Tio + Tc = 250 + 344 = 594 P_LCK cycles for N=7. IR (Initiation Rate) : IR = l/max(ETa, ETb, ETc, ETd) = 1/ETc = 1 /(238mlcrosecond) UP (U ltiliza tio n of the Processor) : UP = ET/4 x IR = 594/4 x 1/238 = 62% SP (SPeed-up) : SP = ET(P=1) / ET(P=24) = 1838/594 = 3.1 CBR (CPU Bound Ratio) : CBR = Tc(P=l)/24 x IR = 1724/24 x 1/238 = 30% 225 RN (Register Number) : RNa : j > 6N 6xN temporary ------> 2 + regi sters RNa = 6N + 6 = 6 x 7 + 6 = 4R RNb : 3 Aci > IR - 6x1 d (3x3) > 20 temporary ------> 2 + registers RNb = 18 + 20 + 6 = 44 RNc : 2Aci > 12 - 6x1 d(3x3), d(5x5) > 20 d(4x4) -> 15 temporary > 2 + registers RNc = 12 + 20 + 15 + 6 = 63 226 RNd : Jcl , Aci , X and C ■ > 24 - 6x1 - 6x1 - 6x1 - 6x1 d , 1/d , [theta1]i and [theta]i ------> 4 '7 temporary -> 2 + = 6 regi sters 2 RNc = 24 + 4 + 6 = 34 So RN = max{ RNa, RNb, RNc, RNd) = RNc = 53, MC (Microcode number) : MCa = Tioa/2 + Tea = {12N + 12)/2 + 52 = 100 for N=7 MCb = Tiob/2 + Tcb = 76/2 + 107 = 145 MCc = Tioc/2 + Tcc = 76/2 + 114 = 152 MCd = Tiod/2 + Ted - 66/2 + 71 = 104 MC = max( MCa, MCb, MCc, MCd) = 152 words SCRAM (Size of Control RAM) : SCRAM = MC X ( 4 + 6 x 'log^(RN) I ) = 152 x ( 4 + 6 x 6 ) = 6.08K bits, Total Memory = 53 x 32 + 6.08K = 7.78K bits. 227 Appendix A .12 : Microprogram for Forward Recursion of Inverse Dynamics (one RP per Link) i-1 i Q1 - . U Ji = J Wi = w i i i i . i * i * Oi = w Pi = P Si = S i i i i f* Xi = P Fi s F i i i Ni = N T'1 = theta T"i = theta i RF RF RF RO 0U1.1J R9 JUl.lJ R18 Wi-111J,Wi L1J R1 Qi [1,2] RIO J i[ l,2 ] R19 Wi - 1[2] ,Wi [2] R2 Qi[1,3] R11 J i[ l,3 ] R20 Wi -1[3],Wi[3] R3 Q i[2t l] R12 J i[ 2 ,l] R21 Oi-1[1],Oi[1] R4 Qi[2,2] R13 J i [2,2] R22 Oi-1[2] ,Oi[2] R5 Qi[2,3] R14 J i[2 ,3 ] R23 Oi-1[3],Oi[3] R6 Qi[3,l] = 0 R15 J i [3,1] R24 picn R 7 Qi[3,2] R16 J i[3 ,2 ] R25 Pi [2] R8 Qi[3,3] R17 J i [3,3] R26 Pi [3] 228 RF RF RF R27 SUIT R3F> FiLlj R45 temporary R28 Si [2] R37 Fi [23 R46 temporary R29 Si [3] R38 Fi [33 R47 temporary R30 xi [13 R39 Ni [13 R48 temporary R31 Xi[2] R40 Ni [2] R49 temporary R32 Xi[3] R41 Ni [33 R50 temporary R33 YiCl] R42 T' i R51 temporary R34 Yi [23 R43 T"i RS2 temporary R35 Y i [33 R44 mi R53 temporary R54 temporary R56 temporary R56 temporary R57 temporary RSfl temporary RF —> FPM RF — > FAM SUB FPM — > FAM — > (AA) (AB) (AA) (AB) RF RF 1 Qi[2 ,1] , QU3.3] Wi - 1[3], T'i 0 - - 2 01C1.11 , 01[3,23 Oi- 1[33 * T"i 0 -- 3 01[2,13 , Qi[3,21 - - -- 4 Ql[l, 13 , 01[3,33 -, - - Oi[l ,23 Wi- 1[3] 5 Wi-1[2] , T'i 01[3,13 , Oi[1,23 1 0 i[2,33 Oi - 1[3] 6 Wi-1[ 1] , T"i Oi[3,l3 , Oi[2,33 1 ni[i,33 - 7 -, - -, - - 0i[2,23 - 8 - R45 n i[1,23 229 9 Q iL U J , UMLiJ Oi-iLIJ H4S 0 R46 O ld .3 ] 10 Q1C2.ll , W1-1C2] Oi —1[2] R46 1 - - 11 01[1,2] . W i-l[ 1] -- -- - 12 Oi[2,2] , Wi-1[2] -- - R45 o i-ic il 13 Q1C1.3] , W1-1C1] - - - R46 01-1C2] 14 Qi[2,3] , W1-1C2] R45 R46 0 R45 - 15 Q1C3.11 , Wi- 1[3] --- R46 - 16 01[3,2] , Wi - 1[ 3] R45 R46 0 R45 - 17 Q1C3.3] , Wi-1[3] - - - R46 R47 18 Qi[1,1] , O i- l[ l] R45 R46 0 R45 - 19 Qi[2,1] , Oi —1[2] R45 R47 0 R46 R47 20 QIC 1,21 , 01-1[1] R46 R47 0 R45 - 21 Qi[2,2] , 01-1C2] -- - R46 R47 22 QIC 1,3] , Oi-1C 1] R45 R47 0 R45 Wi [ 1] 23 Qi[2,3] , 01-1[2] R45 R46 0 R46 Wi [2] 24 0 i[3 ,l] , Oi —1[ 3] - - - R45 - 25 Qi[3.2] , Oi —1[3] R45 R46 0 R46 Wi [3] 26 Qi[3.3] , 01 —1[3] -- - R45 R47 27 Wi [2] , P i[3] R45 R46 0 R46 - 28 W1C3] , Pi[2] R46 R47 0 R45 R47 29 Wi [3] , P i[l] R45 R47 0 R46 - 30 Wi [1] , Pi[3] - - - R45 R47 31 Wi [ 1] , P1[2] R46 R47 0 R46 Oi [1] 32 Wi [2] , P i[l] R45 R46 1 R45 Oi [2] 33 Wi [2] , Si[3] -- - R46 - 230 u W L3J Si [2] R45 R46 1 R45 Oi C3] 35 W [3] siCi] - - R4fi R47 36 w Cl] SiC3] R45 R46 1 R45 - 37 w [1] SiC2] -- R46 R48 38 w [2] Si C1] R45 R46 1 R45 - 39 0 C2] Pi[3] - - R46 R49 40 0 [3] PiC2] R45 R46 1 R45 - 41 0 [3] PiCl] - - R46 R50 42 0 [1] PiC3] R45 R46 1 R45 - 43 0 ci] PiC2] - - R46 R51 44 0 C2] PiCl] R45 R46 1 R45 - 45 0 C2] SiC3] - - R46 R52 46 0 C3] Si C2] R45 R46 1 R45 - 47 0 C3] Si C1] -- R46 R53 48 0 CI] SiC3] R45 R46 1 R45 - 49 0 Cl] SiC2] - - R46 R54 50 0 C2] Si Ci] R45 R46 1 R45 51 w C2] R49 - - R46 R55 52 w C3] R48 R45 R46 1 R45 - 53 w C 3] R47 - - R46 R56 54 w [1] R49 R45 R46 1 R45 - 55 w Cl] R48 - - R46 R57 56 w C2] R47 R45 R46 1 R45 - 1 1 1 i ID1 1 r-- I 1 w C2] R52 -- R46 R58 58 VI C3] R51 R45 R46 1 R45 - 231 59 tfTDT1' ft50 - -- R46 R47 60 W i[l] R52 R45 R46 1 R45 - 61 Wi [ 1] R51 R53 R47 0 R46 R48 62 Wi[2] R50 R45 R46 1 R45 - 63 Q i[ l, l] xi-iCi] R54 R48 0 R46 R49 64 Q i[2 ,l] Xi-1[2] R45 R46 1 R45 R47 65 Qi[1*2] Xi-l[l] R55 R49 0 R46 R50 66 Qi[2*2] Xi-1[2] R45 R46 1 R45 R4R 67 Qi[l*3] Xi-l[l] R56 R50 0 R46 R51 68 Oi[2,3] Xi-l[2] R45 R46 0 R45 R49 69 Q i[3 ,l] Xi-1[3] R57 R51 0 R46 R52 70 Oi[3,2] Xi- 1[3] R45 R46 0 R45 R50 71 Qi[3,3] Xi-1[3] R58 R52 0 R46 R58 72 J i [ l , l ] Wi [ 1] R45 R46 0 R45 R51 73 Ji[l,2] Wi [2] R45 R58 0 R46 R58 74 J i[ 2 ,l] Wi [ 1] R46 R58 0 R45 R52 75 J1C2.2] Wi [2] - -- R46 R58 76 J i [3,1] Wi [ 1] R45 R58 0 R45 R53 77 J i[3 ,2 ] Wi [2] R45 R46 0 R46 R54 78 O i [ l , 3] Wi [3] R63 R47 0 R45 - 79 J i[2 ,3 ] Wi [3] R45 R46 0 R46 R55 80 J i[3 ,3 ] Wi [3] R54 R48 0 R45 R58 81 j i [ i , n O i[l] R45 R46 0 R46 X i[l] 82 J i [1,2] Oi [2] R46 R58 0 R45 R58 83 Ji [2,1] Oi [ 1] R45 R58 0 R46 Xi [2] 2 3 2 84 JH2.2J 0U2J R49 0 R45 R58 85 Ji [3,1] Oi [13 R46 R5R 0 R46 R47 86 J1C3.2] Oi [23 R45 R46 0 R45 R48 87 J i [1.3] Oi[33 R50 xi [13 0 R46 Xi [33 88 J i [2,3] Oi [3] R45 R46 0 R45 R49 89 J i [3,3] Oi [3] R51 X i [23 0 R46 R58 90 Wi[2] R49 R45 R46 0 R45 YiCll 91 Wi[3] R48 R45 R58 0 R46 R58 92 Wi [32 R47 R46 R58 0 R45 Yi[2] 93 Wi [ 1] R49 R52 Xi[3] 0 R46 R58 94 Wi [1] R48 R45 R58 0 R45 R53 95 Wi [2] R47 R46 R45 1 R46 R54 96 mi Y i[l] -- - R45 Yi [33 97 mi Yi[2] R46 R45 1 R46 R55 98 mi Yi [3] - - - R45 R47 99 - - R46 R45 1 F i[l] - 100 -- -- - FI [2] R48 101 - - R53 R47 0 Fi[3] - 102 - - R54 R4R 0 - R49 103 - - R55 R49 0 - - 104 ------Ni [ 13 105 ------Ni [23 106 ------Ni [33 233 Appendix A.13 : Microprogram for Backward Recursion of Inverse Dynamics (one RP per Link) i -1 Qi = . U Fi Ni = N i Pi = P Si fi +1 = f 1 + 1 i-1 i i -1 fi = f ni+1 = n ni = n i i+1 1 RF RF RO Q iL U J R9 FU1J R1 Q1C1.Z] RIO F1[2] R2 Q i[l,3 ] R11 Fi[3 R3 o ic z .n R 12 Ni C1] R4 Qi[2,2] R 13 Ni[2] R5 Qi[2,3] R 14 Ni [3] R6 Qi[3,1] = 0 R 15 PiCl] R7 Qi[3,2] R16 Pi [2] R8 Qi[3,3] R17 Pi [3] 234 RF RF R18 SU1J R27 temporary R19 Si [2] R28 temporary R20 Si[33 R29 temporary R21 fi + l[l],fi[l] R30 temporary R22 fi+l[2],fi[23 R31 temporary R23 fi+l[3],fi[3] R32 temporary R24 ni+l[l],ni [1] R33 temporary R25 ni + l[2] ,ni [2] R34 temporary R26 ni + l[3] ,n i[3] R35 temporary RF ~ > Ff>M RF — > FAM StJfe FPM — > FAM --> (AA) (AB) (AA) (AB) RF RF 1 Q1C2.1] , Qi[3,3] Fi [13 , fi + lCl] 0 -- 2 QiCl.l] , Qi[3,2] Fi[23 , ft+ l[2 ] 0 - - 3 Q1C2.1] , Qi[3,2] Fi[3] , fi+l[3] 0 - - 4 OiCl.l] , Oi[3,33 Pi[l] , Si[l3 0 Oi[1,23 R27 5 P1[2] , f1 + l[33 0i[3,l] , 0i[l,?3 1 Oi[2,33 R28 6 Pi [3] , fi + l[2] Oi[3,13 , Qi[2,33 1 Oi[1.33 R29 7 Pi[33 , fi + l[ 13 Pi[2] , Si[2] 0 Oi[2,23 R33 8 Pi[ 13 , fi+ l[3 ] Pi[33 , Si[33 0 R30 Oi[l,2] 9 Pi[ 1] , fi + l[2] " » " 0 R31 Qi[2,33 ID Pi [23 , fi + l[ 1] R30 , R31 1 R30 R34 11 Qi[l*l3 » R27 " * * - R31 R35 12 Ql[l,2] , R28 R30 , R31 1 R30 - 13 Qi[2,l3 , R27 - R31 fi+ l[ 13 | 235 H , m WO' R31 1 Wo - 15 Q U3.1] R27 N i[l] fi + lCl] 0 R31 fi+lC2] 16 Q1C3.2] R28 R30 R31 0 R30 - 17 Oi[1,3] R29 Ni [2] fi + lC2] 0 R31 fi + lC3] 18 Ql[2,3] R29 R30 R31 0 R30 NiCl] 19 Qi[3,3] R29 Ni [3] Fi+lC3] 0 R31 R32 20 R34 Fi [31 R30 R31 0 R30 Ni [21 21 R35 Fi[2] R30 R32 0 R31 R32 22 R35 F i[l] R31 R32 0 R30 Ni [31 23 R33 Fi [3] ni + l[ l ] NiCl] 0 R31 R32 24 R33 Fi [2] R30 R32 0 R30 f i c n 25 R34 Fi [1] R31 R30 1 R31 fiC2] 26 - - ni + l[2] NiC2] 0 R30 NiCl] 27 -- R31 R30 1 R31 Fi C31 28 - - ni + l[3] Ni [3] 0 R30 Fi Cl] 29 -- R31 R30 1 - Ni [2] 30 -- N i[l] FiCi] 0 - Fi[2] 31 -- Ni [21 FiC2] 0 - Ni [3] 32 ------FiC3] 33 - - NiC3] Fi [3] 0 - NiCl] 34 Q iC l.l] Ni [ 11 -- - - Ni [21 35 01[1,2] Ni [2] - - -- - 36 Qi[2,1] NiCl] - - -- Ni [3] 37 01[2,2] Ni [2] - - - R30 - 38 Qi[3,1] NiCl] - - - R31 - 236 39 Ql[3,2] Ni[2] R30 R31 R30 40 ] Ni [3] R31Qi[l,3 41 Q1C2.3] Ni [3] R30 R31 R30 42 Qi[3,3] Ni [3] R31 43 R30 R31 0 R30 44 R30 R32 0 R31 45 R32 R30R31 46 47 R30 R32 48 49 50 237 Appendix A.14 : Calculation of the Measurement Parameters for Inverse Dynamics with P = 1. Tcf(Forward) = 106N P_CLK cycles Tcb(Backward) = 50N P_CLK cycles Tc = Tcf + Tch = 156N P_CLK cycles Tio = 4N + 2N = 6N PJ1LK cycles ET = Tc + Tio = 162N PJXK cycles IR = 1/ET = 1/(162N microsecond) u p = ino% CBR = 150N/156N = 96% RN = 51N + 14 (see Appendix A .12 and A .13) = 371 MC = (Tc + Tio/2) / N = 156 SCRAM = MC x ( 4 + 6 x log^(RN) I ) = 156 x ( 4 + 6 x 9 ) = 9 .IK bits Total Memory = 371 x 32 + 9 .IK = 21K bits 238 Appendix A .15 : Calculation of the Measurement Parameters for Inverse Dynamics with P = 2. Tcl(Forward) = 106N P_CLK cycles Tc2(8ackward) = 5QN P_CLK cycles Tc = Tel + Tc2 = 156N P_CLK cycles Tiol = 4N + 16N = 20N P_CLK cycles Tio2 = 16N + 2N = 18N P_CLK cycles ET1 = Tc1 + Tiol = 126 P_CLK cycles ET2 = Tc2 + Tio2 = 68N P_CLK cycles ET = ET1 + ET2 - 16N = 178N P_CLK cycles IR = 1/ET = l/max(ETl, ET2) = 1/(126N microsecond) UP = ET/2 x IR = 178N/2 x 1/126N = 71% SP = ET(P=1) / ET(P=2) = 162N / 1/8N = .91 CBR = Tc(P=l)/2 x IR = 156N/2 x 1/126N = 62% RN1 = 45N + 14 (see Appendix A .12) RN2 = 27N + 9 (see Appendix A .13) RN = max(RNl, RN2) = 45N + 14 = 329 for N=7 MCI = (Tel + Tiol/2) / N = 116 MC2 = (Tc2 + Tio2/2) / N = 59 MC = max(MC1, MC2) = 116 SCRAM = MC x ( 4 + 6 x 'log (RN)' ) = 116 x 58 = 6.8K bits Total Memory = 329 x 32 + 6.8K = 17.3K bits 239 Appendix A ,16 : Calculation of the Measurement Parameters for Inverse Dynamics with P = N. For one RP : Tcf(Forward) = 133 P_CLK cycles Tiof = 5 6 -4-6-6= 40 P_CLK cycles Tidf = 11 PJTLK cycles Tcb(Backward) = 60 -9=51 P_CLK cycles Tiob = 40 - 4 -6-6 = 24 P__CLK cycles Tidb = 5 P_CLK cycles For the whole system : FT = (40N + 144) + ( 34N + 46) = 74N + 190 P_CLK cycles =708 P_CLK cycles IR = 1/ET = 1/(708 microsecond) for N=7 UP = (40 + 144 + 34 + 46) / ET = 37% SP = ET(P=l) / ET(P=N) = 162N / (74N +■ 190) = 1, 6 for N=7 CBR = Tc(P=l)/N x IR = 156N/N x 1/708 = 22% RN = 51 + 14 = 65 (see Appendix A, 12 and A. 13) MCf = (Tcf + Tiof/2 + Tidf) = 164 MCb = (Tcb + Tiob/2 + Tidb) = 68 MC = MCf + MCb = 232 SCRAM = MC x ( 4 + 6 x 'log^RN) | ) = 232 x 46 = 10.7K bits Total Memory = 65 x 32 + 10.7K = 12,8K bits 240 Appendix A .17 : Calculation of the Measurement Parameters for Inverse Dynamics with P = 2N. For one RP : Tcf(Forward) = 133 P_CLK cycles Tiof = 56 P_CLK cycles Tidf = 11 P_CLK cycles Tch(Backward) = 60 P_CLK cycles Tiob = 40 P_CLK cycles Tidb = 5 P_CLK cycles For the whole system : ET = (40N + 160) + (34N + 71) = 74N + 231 P_CLK cycles = 749 P_CLK cycles IR = 1/(40 + 160) = 1/(200 microsecond) for N=7 UP = (40 + 160 + 34 + 71)/2 x IR = 76% SP = ET( P=1) / ET(P=2N) = 162N / (74N + 231) = 1.5 for N=7 CBR = Tc(P=l)/2N x IR = 156N/2N x 1/200 = 39% RNi = 45 + 14 = 59 (see Appendix A .12) RNi' = 27 + 9 = 36 (see Appendix A .13) RN = max( RNi. RNi' ) = 59 MCf = (Tcf + Tiof/2 + Tidf) = 172 MCb = (Tcb + Tiob/2 + Tidb) = 85 MC = max(MCf, MCb) = 172 SCRAM = MC x ( 4 + 6 x "log^(RN) I ) = 172 x 40 = 6.9K bits Total Memory = 59 x 32 + 6.9K = 8.8K bits 241 Appendix B.l : Detailed Circuit Descriptions for Two-Phase Generators (TPG) and Two Johnson Counters, JCNTR and JCNTF, The logic diagram for the TPG is shown in figure B .l. The Johnson counters, JCNTR and JCNTF, shown in figure B.2, are implemented using PLAs. Both have the same state diagram but they are driven by different clock phases. 242 2 t> Superbuffer Two Phase Clock Clock Generator Phase (TPG) Two Fi gure Fi L < o CLK 243 PLA PLA r2/2 2/2 1/2 Jra Jfa Jfb State/Ja,Jb BT Figure B.2 State Diagrams for JCNTR and JCNTF 244 Appendix B.2 : Detailed Procedures for Loading Microprogram and Circuit Designs for the Synchronization Controller and Bootstrap Controller (SC+BTC). Figure B.3 shows the timing of microprogram loading. The loading procedures are started by the BT signal sent from the host. The conuter (CNT) in the BU is reset to zero by the CLR signal at the beginning and the signal LD is asserted until the loading is complete, causing the output of the CNT to be selected as the address of the CRAM. The Host Write signal (HWR) is synchronized by the Synchronization Controller (SC), which generates the WR signal with a pulse width equal to one SYS_CLK period. The WR signal is used as an input signal to the BTC. Whenever a HWR is sent from the host, a WR is generated by the SC. Consequently, three loading signals, LDO, LD1 and LD2, are generated following the WR signals. These loading signals are used to latch the 16-bit data sent from the host. The latch, made of dynamic registers needs to be refreshed by To avoid the conflict between the loading signals and the <^, the LDi* (i = 0, 1, 2) signals obtained by logically ANDing LDi and $ are used as real loading signals to the latch. As soon as the three 16-bit words are latched, the Write Enable signal (WEN) is generated. This allows the microinstruction to be written into the CRAM. After that, the CNT is incremented by one with the Increase signal (INC). When microprogram loading is complete, the host computer acknowledges the BTC by issuing the Load Complete signal (LC). The LD signal is then unasserted. Consequently, the address of the CRAM is produced by the Sequencer (SEQ). When the LD signal is asserted to 245 Figure Figure R.3 Timing for Microprogram Loading 246 high, the counter in the SEQ is reset to zero and the SFQ is disabled; when the LO is asserted to low, the SEQ is enabled. The Sychronization Controller, shown in figure R.4, is implemented using a PLA. The BTC is also made using a PLA and its state diagram is shown in figure R.5. The RTC's input signals are gated by *1 and its output signals by *2. The HWR signal can not be used as a direct input signal to the RTC because the BTC is a level-trigger PLA and because the HWR signal may stay high for more than three SYS_CLK periods causing the BTC regard it as anotht?' write signal from the host computer. For example, if the HWR stays high for three SYS_CLK periods, the RTC will probably change from state SO to S3 and write the data into two latches, which results in an error. 247 PLA HWR WR State/WR BT HWR HWR HWR HWR Figure B.4 Synchronization Controller for HWR 248 0100100 OlOOOlO WR 0100000 0100001 0101000 0100000 WR State/CLR, State/CLR, LD, LD1, LDO, INC LD2, WEN, 0100000 0000000 Figure Figure B.5 State of Diagram the Bootstrap Controller (BTC) 0110000 1100000 249 Appendix B.3 : Detailed Circuit Designs for the Four Format Converters, FCE, FCS, FCW, and FCN. Figure B.6 shows the detailed circuits for each of the four format converters. The data sent out from the east and south sides of one RP flows into the west and north sides of the adjacent RPs. There are drivers at the output ports to increase the driving capability and so decrease the transmission time between the RPs. The four I/O operation enable bits in the microinstruction for EE, EW, ES, and EN are four enable bits in the microinstruction for I/O operations. The data to be transmitted can flow out at the east and south sides at the same time, and the data received from the west and north sides can be stored in the RF. During the P<^, the data to be transferred to the adjacent RP is read and stored in the format converter. Due to the high capacitance outside the chip and the conversion from 32-bit data to two 16-bit words, two P_CLK cycles are required to transfer the high and low part of the 32-bit data to the other RPs. The received data collected by the format converter is stored in the RF in the P be seen that the permissible time for data transferred between RPs is about a half of the P_CLK, i.e . about 500 ns, where "H" means the high part of a 32-bit data and "L" means the low part. 250 II IO *E W * 11 * J r a 2 b' ih H > ° - IO*Jrbl I0*Jfa2 —*BC i IO*ES*J fdl I— l_ ;|— 10*EN 11 *Jrb2 f ' BB I I — L J [ } 0 - T T 10*Jrb2 IO *Jfa1 ECS FCN Figure R.6 Circuit Diagram of the FCE, FCW, FCS and FCN iCZDC or g(L) ) l ( c ______------, ______! or g(H) Time Time for Data Transferred RPs Between ______i i I — c(H) a a or e b(H) b(H) or f(H) X btL) or f tL) / > Permissible ( Between RPs RPs Between Figure Figure B.7 Timing for RP's Data between Passed — Time for Data ' ' Transmission I I j Transmission i 1 1 [ Permissible l( I I Transferred . . I I <( b"(H) or f'(H ) ^ b '(l) or f'CL) | I I I i i i I < I I I I I I r i ______Jfb Jfa Jfa Jrb Jrb Jra _Jra J p*j — 252 Appendix B.4 : Data Flow in the Data Path for Normal Arithmetic Operati ons. Figure B.8 shows the data path for normal arithmetic operations. There are three pipeline stages in each Floating Point Adder (FPA) and Floating Point M ultiplier (FPM). During the firs t half cycle of the the firs t two operands are read onto the Bus A (BA) and Bus B (RB), and are latched at the falling edge of the Jfal. The second set of operands are read onto the BA and BB at the second half cycle of the Pd>^ and latched at the falling edge of the Jfb l. Roth BA and BB are precharged during the P ^ . The results are stored into the RF through the BC during P ^ . The result of the FPM is put onto the Rus C (BC) at the fir s t half cycle of the P There are two pipeline registers for each stage. This provides LSSD testab ility and longer permissible computation time for each stage. The timing diagram in figure B.9 shows that the permissible time for each stage is about one P_CLK period. If only one register is used and the clock is alternated by P ^ and P ^ * then the permissible computation time for each stage is about one half of the P_CLK period. When operands are being read from the RF, their addresses are put onto Address A (AA) and Address B (AB) prior to the operands available on the BA and BB, because it takes a certain amount of time (access time) to decode the address. Therefore, the addresses are latched onto 253 \/ \f 1 Jfal I I----- Jfb l j l H ] P*2 +1 P41! ]------ ♦5 P$0 +2 AA : Address bus A AB : Address bus B Pf P(t>, : Pipeline T0*Jra2 I0*Jrb2 Figure B.8 Three Pipelined Stages in the FPM and FPA p$. “L J f a l Jra2- JL Jfbl Jrb2 • JL CD 1 i ' C > Permissible 1 b— Computa------tion time for Stage 1 J LXD • C L Perm issible1/ r Computa- *l\ tion time for Stage 2 CD C > Permissible \/~ K omputa- n\. t i o n tim e for Stage 3 C c Figure B.9 Timing for the Data Through the Pipelined Stages 255 the AA or AB at the falling edge of the Jra or Jrb, while the operands are latched in the input registers of two arithmetic units at the fallin g edge of the Jfa or Jfb. Also, the Write signal (WR) tothe RF w ill not be asserted until the address has been decoded and the data is stable on the BC, Figure B.10 shows how the the addresses are gated to the AA and AB and how the WR is generated. 256 A0DR3 ADDR5 A0DR6 A0DR3 ADDR6 ADDR1 ADDR4 IO*EW | r IO*EN IO*EE j r I0*Jra2 —|F I0*Jrb2 IO*Jral —|P IO*Jrbl *Jral "1L *Jra2 “ 1 L *J rb2 H t JL \ t_ AA rs> AB uT IO*ES*Jral HI IO*Jral IO*JrblHL ADDR4 ADDR2 ADDR5 WR = 10 (WM * Jfa2 * + VJA * Jfb2 * + 10 (EW * Jfa2 * + ^ * ^^2 * $\f2) Figure R. 10 AA, AB and WR for the Register File Appendix B.5 : Circuit Design of the Zero Checking Unit. This unit is made by a programable logic array (PLA) which can be generated by TPLA, one of the VLSI CAO tools. TPLA accepts the truth table and then generates the PLA layout automatically. The truth table can be generated by EONTOTT package by supplying it with logic equations. The logic equation of the zero checking unit is contained in the f ile fa_zero_ch.eqn, listed in the following, where the means logic OR. Input Bi_EX (i = 7, 6, 5, .. , 0) means the ith bit of the exponent; output EXP_NE0 means "exponent not equal to zero". EX can be the exponent of operand A or operand B. INORDER = B7_EX B6_EX B5_EX B4_EX R3_EX B2_EX R1_EX R0_EX; OUTORDER = EXP_NE0; EXP_NE0 = B7_EX ) B6_EX | B5_EX | B4_EX | B3_EX [ B2_EX | B 1_EX j BO E X ; After the file is created, the PLA layout can be generated immediately by issuing the command : % eqntott -R fa_zero_ch.eqn | tpla -s Btrans -o fa_zero_ch.ca where the -R option forces eqntott to produce a truth table with no redundant minterms. Following the option of -s is the argument specifying the style of PLA, Btrans argument makes a PLA with buried contacts, NMOS and trans version (inputs and outputs on opposite side of the PLA). Following the option of -o is the name of the output f ile . And the "|" is a pipe in the UNIX, which causes the output of the command eqntott to be sent as input to the command tpla. Since the minterm in the above logic equation has only one variable, the 258 input variables can be feed directly into the DR plane of the PLA. So, the input inverters and the AND plane of the PLA can be eliminated and more than half of the chip area can be saved. 259 Appendix B.6 : Circuit Design of the Sign Unit. For the operation b it, OP, and the sign bits of the two operands, there are 8 combinations but only 4 possible effective operations, shown in table B .l, SUB = 0 means mantissa addition; SUB = 1 means mantissa subtraction. Table B.l Truth Table for Generating Effective Operation Bits {EOPO E0P1) and SUB OP SA SB Effective operation EOPO E0P1 SUB 0 0 0 + A + R 0 0 0 0 0 1 + A - B 0 1 1 0 1 0 - A + B 1 0 1 0 1 1 - A - B 1 1 0 1 0 0 + A - B 0 1 1 1 0 1 + A + B 0 0 0 1 1 0 - A - B 1 1 0 1 1 1 - A + B 1 0 1 The final sign b it, SR, is determined by the effective operation, which is represented by EOPO and E0P1, and the B_GT_A signal. The truth table is shown in table B.2. 260 Table B.2 Truth Table for Generating the Final Sign Bit of the Result Effective operation j EOPO E0P1 j B_GT_A | SR + A + B 0 0 0 0 + A + R 0 0 1 0 + A - B 0 1 0 0 + A - R 0 1 1 1 - A + B 1 0 0 1 - A + B 1 0 1 0 - A - B 1 1 0 1 - A- B 1 1 1 1 The logic equation of SR is contained in the f ile fa__sign.eqn, which is listed in the following. #define xor(a, b) ( (a & !b) | (la A b) ) ^define EOPO (SA) #define E0P1 ( xor(OP, SB) ) INORDER = OP SA SB BJ5T_A; OUTORDER = SR SUB; SUB = xor(EOPO, E0P1) ; SR = (EOPO & !R GT A) | (E0P1 A R_GT_A); where is logic NOT and " V is logic AND. As with the above, the PLA can be easily obtained by issuing EONTOTT and TPLA commands. 261 Appendix B.7 : Circuit Design of the Alignment Control Unit. Figure B .ll shows the block diagram of the alignment control unit. It contains two 8-bit adder/subtractors (add_sub_8.ca). Since their SUB_OP and carry-in are tied to high, both are doing substraction. Rut since only the positive result is meaningful, one of the outputs from the two subtractors is selected. The carry-out from the le ft subtractor, EB_GT_EA, is asserted to low, when the exponent of B is greater than the exponent of A and so the result of EB substracting EA is selected by the multiplexor (mux_8.ca). Because there are, at most, 24 bits to the right-shifter, if the shift amount is greater than or equal to 24, the output of the right-shifter is zero. The block, fa_ge__24 ,ca, will assert the signal FZ_R ( force to zero for right_shifter). Its logic equation is stored in the f ile fa_ge_24.eqn as follows : INORDER = B7_EE B6_EE B5_EE R4_EE B3_EE; OUTORDER = FZ_R; FZ_R = B7_EE | B6_EE | B5_EE | (R4_EE S B3_EE) ; EE is the exponent difference of the operand A and operand R. Notice that only 5 bits of the exponent difference need to be sent to the rig h t-sh ifter. According to the input MR_GT_MA (mantissa of A greater than mantissa of B, low asserted) and the two carry-outs from the two exponent subtractors, the alignment control unit also generates the signal B_GT_A. The logic equation of generating the B_GT_A is stored in the f ile fa comp.ca as follows : 262 SUB-OP SJB-OP add sub S.ca add sub S.ca Tea-Fb ) Teb-F a ) EA G (MB GT MA) INI I NO mux 8 .c a fa_com p.ca OUT EB GT EA AMOUNT R fa_ge_24.ca ■> FZ_R Figure R .ll Block Diagram of the Alignment Control Unit 263 INORDER = EB_GT_EA EA_GT_EB MB__MA; OUTORDER = B__GT_A; B_GT_A = !EB_GT_EA | (EA_GT_EB & !MB_GT_MA) ; Because all three input signals are negative active, while the PLA inputs are regarded as positive active, the varibles shown in the above equation are complemented. 264 Appendix B.8 : Circuit Design of the 24-bit Shifter, The 24-bit shifter is made of a barrel shifter, A barrel shifter is basically made of a number of multiplexors. For example, a 4-bit barrel right shifter is shown in figure B.12. From those equations, it can be seen that each output is operationally equivalent to a four-input multiplexer with the inputs connected so that the select signal generates successive one-bit shifts of the input data word. It is known that a multiplexor can be implemented easily with pass transistors in NMOS technology. One example of a 4-h it barrel shifter is shown in [1, pl59] and its layout is shown in PLATE 13 of [1], We can see that for a 4-bit barrel shifter, 4 x 4 = 16 pass transistors are needed. There are many possible schemes to make a 24-bit barrel shifter. Scheme 1) : one level. Then the number of pass transistors needed is 24 x 24 = 576. Scheme 2) : two levels. The first level has 12 2-bit barrel shifters and the second level has 2 12-bit barrel shifters. Then the number of pass transistors is (2 x 2) x 12 + (12 x 12) x 2 = 336. Scheme 3) : two levels. The first level has 8 3-bit barrel shifters and the second level has 3 R-bit barrel shifters. Then the number of pass transistors is (3 x 3) x R + (R x 8) x 3 = 264. 265 BO B1 B2 AO A1 A2 A3 barrel 4 r.ca o I I o (SHO & AO) | (SHI (Sh'2 & Bl) (SH3 & BO) o I I 1 — ( (SHO & Al) | (SHI (SH2 & B2) (SH3 & Bl) O I I M C (SHO & A2) | (SHI (SH2 & AO) (SH3 & B2) II o (SHO & A3) | (SHI (SH2 & Al) (SH3 & AO) U) Figure B.12 4-bit Barrel Right-Shifter 266 Scheme 4) : two levels. The firs t level has 6 4-bit barrel shifters and the second level has 4 6-bit barrel shifters. Then the number of pass transistors is (4 x 4) x 6 + (6 x 6) x 4 = 240. Obviously, scheme 4 has the minimum number of pass transistors and so it is used in the following right/left shifter. The connections between the 4-bit barrel shifters and 6-bit barrel shifters are shown in figure B.13, In the 6-bit barrel shifter, the 5 le ft most bits of the 11 inputs are connected to ground for right shifting while the 5 right most bits are connected to ground for left shifting. One inverter is added to each output of the 4-bit and 6-bit barrel shifter to reduce the large parasitic capacitance introduced by the long connecting wire between the firs t level and the second level barrel shifters. Figure B.13 shows a 24-bit barrel right-shifter whose le ft most 3 bits are connected to ground. Roth the 4-bit and 6-bit barrel shifter are shifting to the right. A 24-bit barrel le ft-s h ifte r can be easily obtained by just flipping the 24-bit barrel right-shifter and connecting the 3 right most bits to ground, while keeping the order of the input, output signals, and shifting control signals unchanged. Figure R.14 shows the 24-bit rig h t/le ft shifter used in the pre/post normalization. The signal FZ_R/FZ_L forces the ?4-bit barrel shifter to have zero output. As FZ_R/FZ_L is asserted, the shift amount equal to 24 is selected through the multiplexor {mux_5.cal and that results in the 24-bit shifter having an zero output. 267 I20"I23 ^16'-I19 I12',I15 !8 ■' !8 ■' Jll " J7 U barrel-6-r.ca U •• I- SH5 SHt SH3 Figure Figure B.13 of Block Diagram the 24-bit Barrel Right-Shifter d ec-2- S1 4,ca St 52 S3 S4 26ft 69Z (AMOUNT L) AMOUNTJ F L){FZ FZ_R iue .4 lc Darm f h 2-i Rgt Lf) Shifter (Left) Right 24-bit the Block Diagram of B.14 Figure "11000” INI I NO mux G.ca s S<4:0> arl 4 l).ca r 24 barrel I <23:0> 0<23:0- Appendix B.9 : Detailed Explanation for the Postnormalization. I f the output of the mantissa adder/subtractor, add_sub_24.ca, has the form : RSH bO b (-l) M -2 ) b{-3) bf-22) b(-23) 1 X * X XXXX where X can be 1 or 0, it is shifted right by one bit so the form becomes : rO j r( -T) r(-20 r(-3) ...... j r(-22) r( -23) 1 . 1 bO b (-l) b(-2) ...... 1 b(-21) b(-21) Because the left-shifter cannot do right shifting, a multiplexor (mux_23.ca) is used to select the appropriate bits of the output from the mantissa adder/subtractor, add_sub_24.ca, to achieve the right shifting purpose. The implicit leading one is not to be sent to the output register; only r(-l) to r(-23), or b(0) to b(-22), are passed through the multiplexor and stored in the ouput register. At the same time, because of the one-bit right shifting, the common exponent, the exponent of the larger operand, should be incremented by one by inverting the RSH to the SUB_0P of the exponent update unit, add_sub_8.ca, tying the carry-in, C -l, to 5V, and setting one of the add_sub_8.ca inputs to zero, i.e . setting the output from the leading zero detector to zero. Since RSH = 1, SUB__0P becomes 0 and so addition is executed. The ouput from the leading zero detector is the lower five bits of the B operand in the exponent update unit. To 270 have zero output from the leading zero detector, the RSH signal needs to be sent to the leading zero detector. When the RSH is asserted, the shift amount is zero. I f the output of the mantissa adder/subtractor, add_sub_24.ca, has the form : RSH bO b(-l) b(-2) TIT • • • • • b(-22) "R-2'3") 0 1 • XXX XX no shifting is needed, so the shift amount is zero. The 23 bits, from b (-l) to b(-23), are passed through the le ft-s h ifte r without changing and latched at the output register. Since RSH = 0, SUB_0P becomes 1 and so subtraction is executed. The exponent update unit subtracts zero from the common exponent by tying carry-in to 5V and setting the subtrahend to zero, and resulting in the common exponent remaining unchanged. Since the RSH is 0, the output from the le ft-s h ifte r, without being shifted, is selected in the multiplexor (mux_23.ca). I f the output of the mantissa adder/subtractor, add_sub_24.ca, has the form : RSH bO b( -1) b(-?) b(-3) » • I » • ~bT-T2) b(-23) 0 0 * 0 1 X X X it is shifted le ft by two bits, and the output form becomes : rO rr-1) r(-2J rf-T )'' ■Vl-HT) r(-23) 1 • b(-3) b<-4) b(-5) 0 0 2 7 1 At the same time, the common exponent is subtracted by two by having SUB_0P set to 1 (so subtraction is executed), carry_in set to 1, and R operand set to two. I f the output of the mantissa adder/subtractor, add_suh_24.ca, has the form : RSH b t) b(-trb'f-27 b<-3) b ( - » ) b(-M ) 0 0 • 0 0 0 0 0 the result is regarded as zero. The shift amount equal to 24 will force both le ft-sh ifter output and common exponent to zero. 272 Appendix B.10 : Circuit Design of the Leading Zero Detector. The leading zero detector is made of a PLA. Its truth table is shown in table R.3 and stored in the file fa_zero_de.tbl. Its layout is stored in the f ile fa_zero_de.ca. Notice that the truth table is very similar to that of a priority encoder, where RSH and bO have the highest priority and b it(-23) has the lowest. means don't care. Table R.3 Truth Table of the Leading Zero Detector RSH b O b (- 1) b(. 23) A MO I N T__L 1 • 0 0 0 0 0 0 1 0 0 0 D 0 0 0 1 0 0 n 0 1 0 0 0 1 0 n 0 1 0 0 0 D 0 1 0 0 0 1 1 0 0 0 0 0 I 0 0 1 0 n 0 0 0 0 0 0 1 0 0 1 n 1 0 0 0 n 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 I n 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 n 0 1 0 0 0 0 0 0 G 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 n 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 I 1 0 0 0 0 0 0 0 0 0 0 00000001 1D000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I I 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 0 0 0 0 0 1 --- - 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n 0 1 * -- 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -- 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 n 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 n 0 0 0 0 0 0 n 0 0 0 0 0 0 0 0 0 0 0 n 1 1 0 n 0 273 Appendix B .ll : Circuit Design of the Overflow/Underflow Unit. Figure B.15 shows the block diagram of the over/underflow unit. I t contains overflow block (fa_ovf.ca), underflow block (fa_udf.ca) and fa_eq_24.ca, which 1s used to test whether or not the le ft shift amount is equal to 24. I f the mantissa of the final result is zero, the signal MC_EQ_0 is asserted, which causes 20UT to be asserted. And this w ill force the final mantissa to zero. The block fa_eq_24.ca is made of a PLA and its logic function is stored in the f ile fa_eq_24.eqn, listed as below : INORDER = B4_AM B3_AM B2_AM B1_AM B0_AM; OUTORDER = MC_EQ_0; MC_EQ_0 = B4_AM & B3_AM & !B2_AM A !B1_AM & !B0_AM; Bi_AM {i=4, 3, 2, 1, 0) is the 1th bit of the AM0UNT_L, output of the leading zero detector. Overflow can occur only if the common exponent has the largest possible biased value (254) and the exponent update unit is incremented by one due to mantissa overflow. So the overflow flag, OVF, is asserted when the final exponent is equal to 255 and mantissa adder/subtractor is doing addition. Notice that the final exponent can have the value 255 too for subtraction. For example, the common exponent before updating is 2 and the le ft shift amout is 3. Then the output of the exponent update unit is 255, but it is underflow, not overflow. The overflow block, fa_ovf.ca, also tests whether or not the final exponent is zero. I f it is, the signal ED_EQ_0 is asserted and causes the ZOUT to be asserted. The overflow block is made of a PLA; its logic equations is stored in the file 274 ttC EQ 0 EQ ttC AMOU fa fa udf.ca fa_eq_24.ca B7 B7 ED EDJQ EDJQ 0 RSH OVF OVF UDF ZOUT fa fa ovf.ca Figure Figure B.15 of Diagram Block the Over/Underflow Unit SUB B7 B7 EC 275 fa_ovf.eqn, listed as below : INORDER = SUB B7_ED B6_ED B5_ED B4_ED B3_ED B2_ED B1_ED BD_ED; OUTORDER = OVE ED_EQ_0; OVF = '.SUB j B7_ED | B6_ED ) B5_ED | B4_ED | R3_ED 1 R2_ED | BI_ED | BO_ED; EO_EQ_° = !B7_ED & 1B6 ED & !B5_ED A !B4_ED A !B3_ED A !B2_ED A !B1_ED A !B0_ED; ED is the common exponent after being updated. The overflow can be handled in two ways. The common way is simply to stop the computation immediately and interrupt the host computer. The second way would be to reset the result to the largest representable value, and allow the computation to continue without interrupting the host computer. Underflow happens only when the effective operation is subtraction, and when the final biased exponent value is less than zero, i.e . when the number of leading zeros is greater than common exponent value. In tu itiv e ly , this happens when the two operands are very small and very close. Another case, when the final exponent is zero but the final mantissa is not equal to zero, is regarded as underflow too. The underflow block is made of a PLA too; its logic equations is stored in the file fajudf.eqn, listed as follows : #define udf ( (!B7_EC A B7_ED A SIJB) | (ED_EQ_0 8 !MC_E0_D) ) INORDER = B7_EC B7_ED SUB ED_EQ_0 MC_E0_0 RSH; OUTORDER = UDF ZOUT; UDF = udf; ZOUT = udf j (MC_EQ_0& !RSH); R7_EC is the 7th bit of the EC, the common exponent before being 276 updated. The firs t minterm of the UDF shows the occurrence of negative final exponent. It happens when the effective operation is subtraction (SUB = 1), and the MSB of the the common exponent is changed from 0 to 1 in the final exponent. The second mlnterm represents the case when the final exponent is equal to zero but the final mantissa is not. ZOUT is asserted when underflow occurs or the final mantissa is equal to zero. To avoid the final mantissa as 10.000...000 being regarded as zero mantissa, the signal RSH is used. Whenever ZOUT is asserted, the final exponent and mantissa are both forced to zero. 277 Appendix B.12 : Detailed Circuit Design of the Zero Checking Unit. The zero checking unit (fm_exp_eq0.xa) examines the exponent of the operand. If the exponent is zero, the signal EXPI_EQ0 is asserted. As a result, EXP_EQG is asserted and then ZOUT too, which w ill force the value of the final product to zero. The unit is made of a PLA. Its logic function is stored in the f ile fm_exp_eq0.eqn, listed below : INORDER = B7_EX B6_EX R5_EX R4_EX B3_EX B2_EX R1_EX R0_EX; OUTORDER * EXPI_EQ0 ; EXPI_E00 = IB7 EX & !B6_EX A IB5_EX A !B4_EX A !R3_EX A !R2_EX A ! B1_E X A !BO E X; EX can be the exponent of the operand A or B. Notice that whether or not the exponent is zero, the MSB of the 24-bit mantissa is always one. The reason for this is that i f one of the exponents is zero, the result is always forced to zero by ZOUT regardless the mantissa. 278 Appendix B.13 : Detailed Circuit Design of the Over/Underflow Unit. Overflow may occur if two large operands are multiplied together, while underflow may occur if two small operands are multiplied together. From the possible multiplication cases in figure B.16, the condition for overflow and underflow can be determined. Bi ER (i = 9, 8) is the ith bit of the biased exponent of the result. The block diagram of the over/underflow unit is shown in figure B.17. The zero checking unit ( fm_exp_eqO,ca) is the same as that in stage one. The ER = 255 is detected by fm_exp_255.ca, which is made of a PLA. Its logic equation is stored in the f ile fm_exp_255.eqn, listed below : INORDER = B7_ER B6_ER B5_ER B4_ER B3_ER B2_ER B1_ER B0_ER; OUTORDER = ER_255 ; ER _255 = B7_ER 4 B6_ER 4 B5_ER 4 R4_ER 4 B3_ER 4 B2_ER 4 B1_ER 4 BO_ER; Overflow, underflow and force-to-zero are implemented in the fm_oy_ud.ca, which is made of a PLA too. It logic functions are stored in the f ile fm_ov_ud.eqn, listed as the follows : ^define udf ( (!B9_ER 4 B8_ER) | (B9 ER 4 !B8_ER 4 ER_EQ_0) ) INORDER = EXP_EQ0 R9_ER B8_ER ER_255 ER_EQ_0; OUTORDER = OVF UDF ZOUT; OVF = (B9_ER 4 B8_ER) | (B9 _ER 4 !R8_ER 4 ER_255); UDF = udf; ZOUT = EXP_EQ0 | udf; The logic equation for the OVF is obtained according to case 2 and 4, 279 1) REAL_EA = 127, EA = 254 REAL EB = 0 , ER = 127 REAL ER = 127, ER = EA + EB - 127 = 254 (O.K.) 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 + 1 1 0 1 1 1 1 1 1 0 + 1 1 1 0 1 1 1 1 1 1 1 0 / I B9_ER R3_ER The biased exponent of the result is between 1 and 254. No overflow nor underflow occurs when R9_ER B8_ER = 1 0 and the ER value is not equal to 0 nor 255. 2) REAL_EA = 127, EA = 254 REAL_EB = 2 , ER = 129 REAL ER = 129, ER = EA + ER - 127 = 256 (overflow) 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 + 1 1 1 0 0 0 0 0 0 0 + 1 1 1 1 0 0 0 0 0 0 0 0 / 1 B9_ER B8_ER The biased exponent of- the result is 256 and greater than 254. Overflow occurs when B9_ER B8_ER =11. 3) REAL_EA = -125, EA = 2 REALJEB = -126, EB = 1 REAL_ER = -251, ER = EA + EB - 127 = -124 (underflow) 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 + 1 0 0 0 0 0 0 1 0 0 + 1 1 0 1 1 0 0 0 0 1 0 0 / I B9_ER B8_ER The biased exponent of the result is -124 and less than 0. Underflow occurs when B9 ER B8 ER = 0 1. Figure B.16 All Possible Cases for Overflow and Underflow 2 8 0 4) REAL_EA = 127, EA = 254 REAL EB = 1 , EB = 128 REAL ER * 128, ER = EA + ER - 127 = 255 (overflow) 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 + 1 1 0 1 1 1 1 1 1 1 + 1 1 1 0 1 1 1 1 1 1 1 1 / I B9 ER B8 ER The biased exponent of the result is 265 and regarded as in fin ity . Overflow occurs when R9_ER R8_ER = 1 0 and ER value equal to 255, 5) REAL_EA = -1, EA = 126 REAL EB = -126, EB = 1 REAL ER = -127, ER = EA + EB - 127 = 0 (underflow) 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 + 1 1 0 0 0 0 0 0 0 0 + 1 1 1 0 0 0 0 0 0 0 0 0 / I B9_ER R8_ER The biased exponent of the result is 0. Underflow occurs when B9_ER B8_ER = 1 0 and ER value equql to zero. Figure R.16 All Possible Oases for Overflow and Underflow (continued) 281 E R < 7 :0 > EXP EQO B9 ER B8 ER fm_exp eqO.ca ER 255 ER EQ 0 fm ov ud.ca UDF ZOUT Figure B.17 Block Diagram of the Over/Underflow Unit (fm ovf udf.ca) 282 while UDF is obtained according to case 3 and 5. The ZOUT is asserted when one of the exponent is zero or underflow happens. 283 Appendix B.14 : Logic Equations of Ul, U2, 1)3, 1)4, and U5 in the 8-bit M ultiplier with Modified Booth Algorithm. Ul : IJ1JDUT = X( +2) A A (i-l) | X( + l) A A(i) ) X(0) 4 0 j X(— 1) & !A(i) | X(-2) & !A(i -1) U2 : U2 out = x(-i) + x(- 2 ) U3 : U3_0UT = X(+2) & A(N-l) | X(-2) 4 IA{N-1) U4 : U4 OUT - A(i) A B(N-l) U5 : X(+2) = !B(i+l) 4 B( i ) A B (i-l) X( + l) = !B(i+l) 4 !B{i) A B(i-l) | !B(i+l) A B(i) A !B(i-l) X(0) = !B(i+l) 4 !B(i) 4 !B(i-1) | B(i+1) 4B(i) 4 R(i-l) X(-l) = B(i + 1) 4 1B(i ) A B(i-l) | B(i+1) 4 R(i) A !B(i-l) X(-2) = B(1+l) 4 [B(i) 4 !B{i — 1) SIGN__EX(i) = MINUS(i) xor X(-1) SIGN__EX( i +1) = !MINIJSC1) A X(-l) | MINUS(i) xor X(-2) MINUS(t + 1) = MINUS(I) | X(— 1) | X(-2) X{+2), X{+1), X(O), X (-l), and X(-2), obtained from table 5.5, mean the mulitplicand being multiplied by +2, +1, 1, -1, and -2. The sign extension of the partial product can be simplified by using three bits - SIGN_EX(i) and SIGN_EX(i + l) and MlNUS(i). MINUS(I) is used to indicate whether or not there is a previous subtraction, and so 284 affects the SIGN_EX(i) and SIGN_EX(i+l). The following example, shown in figure B.18, explains how the sign extension bits SIGN EX( i ) and SIGN_EX(i+l) are used. The sign extension bits, SIGN EX( i ) and SIGN EX( i +1) can be obtained by considering the following cases, shown in figure R.19. The logic equations for the SIGN_EX(i) and SIGN_EX(i+l) are : SIGN_EX(i) = 1MINUS(1) A X (-l) | MINUS(i) A ( X(+2) | X(+l) | X(0) | X(-2) ) but ! X(-1) = X(+2) | X{ + 1) | X(0) | X(-2) so SIGN_EX(1) = !MINUS(i) A X (-l) | MlNDS(i) A !X(-1) = MINUS(i) xor X(-l) SIGN_EX(i +1) = !M1NUS{i) A ( X(-1) | X{-2) ) | MINlJS(i) A ( X(+2) | X( + l) | X(0) | X(-1) ) but !X(-2) = X(+2) | X( + l) | X(0) | X(-1) so SIGN_EX(i + l) = !MINUS(i) A X(-1) | !MINl!S(i) A X(-2) | MINUS(i) A ! X(-2) = !MINUS(i ) A X (-l) | MINIIS(i ) xor X(-2) 285 Example : 0 1 1 0 0 0 1 1 ( 63)H = 99 x 1 0 0 11111 (9F)H = 159 final product P = 15741 1) Encoding (9F)H 0 1 1 0 0 0 1 1 x +1 -2 +2 0 -1 MINUS(O) = 0 2) PI (partial product) 0 1 1 0 0 0 1 1 x -1 SIGN EXO 1 0 0 1 1 1 0 0 ~ I SIGN EX 1 —>11 1 < generated by U2 PI = 1 1 1 0 0 1 1 1 0 1 MINUS(l) = 1 3) P2 (partial product) 0 1 1 0 0 0 1 1 x 0 SIGN EX2 00000000 " I SIGN_EX3 —>11 0 <— - generated by U2 P2 = 1 100000000 Ml NUS( 2) * 1 Figure B.1R Example to Explain How the Sign Extension Bits SIGN EX(i) and SIGN EX(i+l) Function 286 4) P3 (partial product) 0 1 1 0 0 0 1 1 X +2 S IGN_EX4 0 1 1 0 0 0 1 1 n 1 1 S IGN_EX5 — > 1 1 0 < — generated by U2 P3 = 1 1 1 1 0 0 0 1 1 0 MI NUS(3) ;= 1 5) P4 (partial product) 0 1 1 0 0 0 1 1 X -2 S IGN_EX5 1 0 0 1 1 1 0 0 1 1 1 SIG N_E X6 — > 0 1 1 < - - - generated by 02 P4 = 1 0 0 0 1 1 1 0 1 0 M1NUS(4) = 1 6) P5 (partial product) 0 1 1 0 0 0 1 •%1 X 1 P5 = 0 1 1 0 0 0 1 1 8 4 2 1 n 7) Final product P=P5x2 +P4x2 +P3x2 + P2 x 2 + PI x 2 PI = 1110 0 1110 1 P2 = 1 1 0 0 0 0 0 0 0 0 P3 = l i i i n o o i i o P4 =100011 1010 + P5 = o l 1 n o o i i p = 0 0 1 1 110101111101= 1 5 7 4 1 Figure B.18 Example to Explain How the Sign Extension Bits SIGN EX(i) and SIGN EX(i+l) Function (contiuned) 287 1) HINUS(i) = 0, and one of the X(+2), X( + l) and X(0) is asserted. sign extension of the sum sum of the previous of previous partial products partial products / i / i P(i) 0000------o n o o o f i p p p ------ppp current + P(i + 1) 0 0 0 0...... 0 0 0 0 p p p - - - p p p <— partial ------product 0 0 0 ...... OOOOOppp...... ppp 1 /II MINUS(i + l) = 0 SIGN EX(i +1) SIGN EX(i) = 0 0 2) MINUS(i) = 0, X{-1) = 1, sign extension of the sum sum of the previous of previous partial products partial products / 1/1 P (i) 0 0 0 0 0 0 0 0 0 0 p p p - ■ - p p p current + P(i+1) llll-----llllppp---ppp <— partial ------product 1 111--- - lllllppp-----ppp I / I I MINUS(1+1) = 1 SIGN EX(i+l) SIGN EX(i) » 1 1 3) MINUS(i) = 0, X(-2) = 1. sign extension of tho sum sum of the previous of previous partial products partial products P(i) 0 1 0 0 0 0 0 0 0 0 0 1 p 1 p p - • ■ p p p 1 current + P(i+1) 1111----- 1 1 1 p p p p - - - p p p <— partial ------product llll----llllpppp-----ppp 1 , 1 1 * MINIJS( i +1) = 1 SIGN EX(i+l) SIGN EX(i) = 1 D Figure B.19 Consider All Possible Cases to Obtain SIGN EX(i) and SiGN EX(i+l). 288 4) MINUS(i) = 1, and one of the X(+2), X(+1) and X(D) is asserted. sign extension of the sum sum of the previous of previous partial products partial products / I / I P(i) llll-----llllllppp---ppp current + P(i+1) 0 0 0 0 0 0 0 0 p p p • - - p p p <— partial ------product 1 1 1 1 ----- 1 1 1 1 p p p ...... ppp I / I I MINUS(i+l) = 1 SIGN EX(1 + 1) SIGN EX(1) = 1 1 5) MINUS(i) = 1, X (-l) « 1. sign extension of the sum sum of the previous of previous partial products partial products / I / I P(i) llll-----llllllppp---ppp current + P(i+1) llll-----llllppp---ppp <— partial ------product 1 111---- - 1 1 1 0 p p p ...... ppp I / I I MINl)S( i +1) = 1 SIGN EX(1+1) SIGN EX(i) = 1 0 6) MINUS(i) = 1, X(-2) = 1. sign extension of the sum sum of the previous of previous partial products partial products / I / I P (i) 1111----- llllllppp---ppp current + P(i + 1) 1 1 1 1 1 1 1 p p p p - - ■ p p p <— partial ------product 1 111---- - 1 1 0 p p p p ...... ppp i / II Ml NUS{ i +1) = 1 SIGN EX(i+l) SIGN EX(i) = 0 1 Figure B.19 Consider All Possible Cases to Obtain SIGN EX{i ) and SIGN_EX{i+l) (continued) 2 m Appendix B.15 : Detailed Explanation for the Rounding Scheme Used in the 8-bit M ultiplier with Modified Booth Algorithm. For a floating point m ultiplier, the input multiplicand and m ultiplier are between 1 and 2. Thus, the final product of a N-bit-by-N-bit multiplier is between 1 and 4. From the following possible final products, shown in figure B.20, can he seen the rounding scheme can be obtained by adding the full adder. It is assumed that the word length is 4 and so the final product has 8-bit long. A truncation error occurs only in case 5. If it is assumed the value of the final product is uniform distribution in the range from 1 to 4, the probability of the final product between 2 to 4 is two thirds. And because the probability of when the Nth and (N-l)th bits are "1 0" is one fourth, the probability of truncation error occurring is 2/3 x 1/4 = 1/6, So a satisfactory round-off result can he obtained by using one more adder without affecting the operation time. 2 9 0 1) 1 -< P (final product) < 2 art J assumed P = 0 1 , 010001 a) after round-off, P = 0 1 . 0 1 considered b) add 1 to P at the (N - 1)th bit position to affect then truncate the last N-bit, rounding P = 0 1 . 0 1 0 0 0 1 + 1 P = 0 1 . 0 1 0 0 0 1 truncate the last 4 b its , P s 0 1 . 0 1 2) 1 =< P (final product) < 2 and assumed P * 0 1 , 011001 I a) after round-off, P = 0 1 . 1 0 considered b) add 1 to P at the (N - l)th bit position to affect then truncate the last N-bit, rounding P = 0 1 . 0 1 1 0 0 1 + 1 P = 0 1 . 1 0 0 0 0 1 truncate the last 4 bits, P = 0 1 , 1 0 3) 2 =< P (final product) < 4 and assumed P = 11,000001 I I a) after one right shift and considerd to round-off, P = 0 1 . 1 0 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit, P = 1 1 . 0 0 0 0 0 1 + 1 P = 1 1 . 0 0 1 0 0 1 after one right shift, P = 0 1 . 100100 then truncate the last 4 bits, P = 0 1 . 1 0 Figure B.20 Achieve Rounding Scheme by Adding a Full Adder 291 4) 2 =< P (final product) < 4 and assumed P = 1 1 . 0 0 1 0 0 1 I 1 a) after one right shift and eonsiderd to round-off, P = 0 1 , 10 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit, P = 1 1 . 0 0 1 0 0 1 + 1 P = 1 1 . 0 1 0 0 0 1 after one right shift, P = 0 1 . 101000 then truncate the last 4 bits, P = 0 1 . 1 0 5) 2 =< P (final product) < 4 and assumed P = 1 1 . 0 1 0 0 0 1 i I a) after one right shift and eonsiderd to round-off, P * 0 1 , 1 1 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit. P = 1 1 . 0 1 0 0 0 1 + 1 P = 1 1 . 0 1 1 0 0 1 after one right shift, P = 0 1 . 101100 truncation then truncate the last 4 bits, P = 0 1 . 1 0 <— error 6) 2 =< P (final product) < 4 and assumed P = 1 1 . 0 1 1 0 0 1 1 1 a) after one right shift and eonsiderd to round-off, P = 0 1 . 1 1 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit, P = 1 1 . 0 1 1 0 0 1 + 1 p = 1 1 . 1 0 0 0 0 1 after one right shift, P * 0 1 . 1 10000 then truncate the last 4 bits, P = 0 1 . 1 1 Figure B.20 Achieve Rounding Scheme by Adding a Full Adder (continued) 292 4) 2 =< P (final product) < 4 and assumed P = 1 1 . 0 0 1 0 0 1 I I a) after one right shift and eonsiderd to round-off, P = 0 1 . 10 affect rounding b) add 1 to P at the (N - 1)th bit position then shift and truncate the last N-bit. P = 1 1 . 0 0 1 0 0 1 + 1 P = 1 1 . 0 1 0 0 0 1 after one right shift, P=01. 10100 then truncate the last 4 bits, P = 0 1 . 1 5) 2 =< P (final product) < 4 and assumed P = 1 1 . 0 1 0 0 0 1 I I a) after one right shift and eonsiderd to round-off, P = 0 1 . 1 1 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit. P = 1 1 . 0 1 0 0 0 1 + 1 P = 1 1 . 0 1 1 0 0 1 after one right shift, P = 0 1 . 10 1 1 0 truncation then truncate the last 4 bits, P = 0 1 . 1 <— error 6) 2 =< P (final product) < 4 and assumed P = 1 1 .011001 i I a) after one right shift and eonsiderd to round-off, P = 0 1 , 1 1 affect rounding b) add 1 to P at the (N - l)th bit position then shift and truncate the last N-bit. P = 1 1 . 0 1 1 0 0 1 + 1 P = 1 1 . 1 0 0 0 0 1 after one right shift, P = 01. 1 1000 then truncate the last 4 bits, P = 0 1 . 1 Figure B.20 Achieve Rounding Scheme by Adding a Full Adder (continued) 293 Appendix B.16 : Detailed Circuit Designs for the Register Storing the M ultiplier B and Recursive Carry-Save Adder. Figure B.21 shows the circuit of the register storing B. Each bit of the register contains two shift registers. B(i+1), B(i) and B (i-l) are available at every $ . When B is loaded into the register, a "O'1 at LSB position and two "0" at the most significant bits position are loaded together. This guarantees the fir s t encoded pair is "B(l) B(0) 0" and the last encoded pair is "0 0 B(N-l)". The value stored in the register is encoded (N/2 + 1) times and therefore there are (N/2 + 1) partial products generated. At the last encoding cycle, "0 0 B(N-l)" is always encoded to cause X(+1)=B(N-1), X{0)=!B(N-1) and X(+2) = X(-1) * X(-2) = 0. This then performs exactly the same function of the last row of the carry-save adders in figure 5.11 except that no round off is used. Figure B.22 shows the schematic of data flow in the pipelined recursive carry-save adders. Since two m ultiplier bits are examined each time, it is necessary for the sum outputs of each carry-save adder to be shifted to the right by two bits. The carry is shifted to the right by one b it. The functions of Ul, U2 and U3 are exactly the same as those in figure 5,9. After the pulse 14, the low N bits of the carry set and the low N bits of the sum set of the final product are stored in the right N/2 slices of shift registers. There are four shift registers in each slice. The high N bits of the carry set and the high N bits of the sum set are stored in the right N carry-save adders. The outputs of the le ft most carry-save adder are not sent to the carry propogate adder in the next pipeline stage. 294 Figure Figure B.21 Circuit of the Register Storing Multiplier B 295 V i x{+2),x(+l) x(o) x(-l),x(-2) U3 Ul ♦ 1 . z $2’ SIGH-EX.i+1 SIGN-EX i ri_ _f u FA FA J“L - r W - I MLD- ro o> r S,Ci sc S2N-1 C2N-1 Figure B.22 Schematic of Data Flow in the Pipelined Carry-Save Adders Appendix C .l : Detailed Network Description for a 2-bit Adder. First of all, an exclusive-or and an exclusive-nor are built in macros. Their circits are shown in figure C.l and C.2 respectively, and their macro definitions are listed below : ; Macro definition for a Exclusive-OR ; File name : xor,mac ; Default channel-width/channel-length for depletion transistors = 2/8 ; Default W/L for enhancement transistors = 2/2 (macro xor (y a a- b b-) (and-or-invert y ((a 4 2) (b 4 2)) {(a- 4 2) (b- 4 2))) ) ; end xor ; Macro definition for a Exclusive-NOR ; File name : xnor.mac (macro xnor (y a a- b b-) (and-or-invert y ((a 4 2) (b- 4 2)) ((a- 4 2) (b 4 2))) ) ; end xnor A macro definition has the general format (macro name (paraml param2 param3 . . . ) body of the macro ) The name is followed by a lis t of parameters - paraml, param2, param3, . . . . , which represent the value to be used when the macro is called 1ater. A one-bit adder having positive logic carry-in and negative logic carry-out is shown in figure C.3, while a one-bit adder having negative logic carry-in and positive logic carry-out is shown in figure C.4. Their macro definitions are listed below : 297 Vdd 2/3 y = a © b 4/2 4/2 4/2 Figure C.l Exclusive-OR Vdd 2/0 y = a © b 4/2 4/2 4/2 4/2 Figure C.2 Exclusive-NOR 298 a b i L ci n addl je.mac cout- I i 5 ci n- cin > cout- Figure C.3 Circuit Diagram of ad d lj .mac 299 a b 1 I ci n- addl O.mac cout T a b a- cin- cout J L r~i_ r Figure C.4 Circuit Diagram of addljo.mac 300 ; Macro definition for a 1-bit adder at even position with positive ; logic carry-in and negative logic carry-out ; File name : addl_e.mac fmacro addl e (cout- r a b cin) ; Declaration of the nodes local to the addl_e.mac ; The nodes are only of local importance to the addl e.mac ; and will not be refered when the addl e.mac is usecf later, (local a- a-1 b- cin- p p-) — ; Load the macros xor.mac and xnor.mac, which has the effect of ; incerting the macro definitions before the description of the ; addl l.mac. (Toad "xor,mac") (load "xnor.mac") (invert (a- 2 4) (a 4 2)) ( invert (a-1 2 4) (a 4 2)) (invert (b- 2 4) (b 4 2)) (xor p a a- b b-) (xnor p- a a- h b-) (etrans p- a-1 cout-) (Invert cin- (cin 4 2)) (etrans p c1n- cout-) (xor r cin cin- p p-) ) ; end addl e ; Macro definiton for a 1-bit adder at odd position with negative ; logic carry-in ; File name : addl_o,mac (macro addl_o (cout r a b cin-) (local a- al b- cin p p-J (load "xor.mac") (load "xnor.mac") (invert (a- 2 4) (a 4 2)) (invert (al 2 4) (a- 4 2)) (invert (b- 2 4) (b 4 2)) (xor p a a- b b-) (xnor p- a a- b b-) (etrans p- al cout) (invert cin (cin- 4 2)) (etrans p cin cout) (xor r cin cin- p p-) ) ; end addl o A 2-bit adder macro composed of the addl_e.mac and addl_o.mac is shown in figure C.5 and its macro definition is listed below : 3 0 1 aO bO al bl 1 __ L J L c-1 addl e.mac cO -*------£&—O addl_0..mac cl T rO r l aO bO a l b l v v y d i tf! & 4 HE 4 > o 4E 4 HE 4 H ) o - H 5 H ) ° " 4>> c - i cO cO cl >— { * > HI J L rO r l Figure C.5 Circuit Diagram of add2.mac 302 ; Macro definition for a 2-bit adder ; File name : add2.mac (macro add2 ( cl rl rO al bl aO bO c-1) (local cO) (load Maddl_e.mac") (load "addljo.mac") (addl e cO rO aO bO c-1) (addl o cl rl al bl cO) ) ; end add2 By using the above 2-bit adder macro, add2.mac, an adder with any even number of bits can be constructed easily. The macro of an adder with any even number of bits is shown firs t and then an example of a 2-b1t adder with each output having capacitance load 0.03 pF is descri bed. ; Macro definition for a 2n-b1t adder ; File name : adder.mac (macro adder (n cout r a b cin) (local c) (load "add2.mac") ; Repeat macro add2.mac from 1=0 to 1=(n-l). (repeat 1 0 (1- n) {add2 c.(1+ (* 2 i)) r.(l+ (* 2 i )) r.(* 2 i) a.(1+ (* 2 i)) b.( 1+ (* 2 i) ) a.(* 2 i) b.(* 2 i) c.(l- (* 2 i)) ) ) (connect cin c .-l) (connect cout c .(l- (*2 n))) ) ; end adder ; Circuit description for a 2-bit adder ; File name : add2.net (node n cout r a b cin) ; Set n equal to 1 for a 2-bit adder (setq n 1) (load "adder.mac") (adder n cout r a b cin) ; Each output node has capacitance load 0.03 pF. (capacitance cout 0.03) 303 (repeat i 0 (1- n) (capacitance r,{1+ (+2 i)) 0.03) ^ (capacitance r.(*2 i) 0.03) ; end add2 304 Appendix C.2 : Command File for RNL Simulation for the 2-bit Adder Described in Appendix C .l. ; Command f ile "add2.cmd" for RNL simulation (load ""cad/1ib/rnl/uwstd.l'') (load ""cad/lib/rnl/uwsim.l") ( read-network "add2.rnl") (setq incr 100) (setq all nodes '(cin cout r, 1 r.O a.l b.l a.O b.O)) (chflag aTl nodes) (defvec '(bTn A a .l a.O)) (defvec ' (bln B b .1 b.O)) (defvec '(bln add2out cout r . 1 r.O)) (def-report '("CURRENT STATE (vec A) (vec B) cin newline (vec add2out))) (1 ' a .l)) (1 ' b .l)) (1 ' a.O b.O c in )) (s ' )) (h ' cin)) ( s ' )) (1 ' cin)) (h ' b.O)) (s ' )) ( h ' cin)) (s ' )) (1 ' cin b.O)) (h ' a.O)) ( s ' )) (h ' cin)) (s ' )) (1 1 cin)) (h ' b.O)) (s ' )) (h ' cin)) (s ' )) (1 ' a .D ) (h ' b .l)) (1 ' a.O b.O cin)) (s ' )) (b ' cin)) ( s ' )) (1 ' c1 n)) (b 1 b.O)) (s ' )) (h ' cin)) (s ' )) 305 (1 (cin b.O)) (h (a.O ) (s 0 ) (h (ci n] ) (s 0 ) (1 (c in ’ ) (h (b.O ) (s 0 ) ( h (cin ) (s 0 ) (b (a .l ) (1 (b .l ) (1 (a.O b.O cin)) (s 0 ) ( h (cin ) (s 0 ) (1 (ci n ) (b (b.O ) (s 0 ) (h (cin ) (s 0 ) (1 (cin b.O)) (h (a.O ) (s 0 ) (h (cin ) ( s 0 ) (1 (ci n’ ) (b (b.O ) (s 0 ) (b ( c 1 n ) (s 0 ) * i MM M M M M M M (b (a .l ) (h (b .l ) (1 (a.O b.O cin)) {s 0 ) (h (cin ) (s 0 ) (1 (cin] ) (b (b.O ) ( s 0 ) (b (cin' ) (s 0 ) (1 (cin b.O)) (b (a.O ) (s 0 ) (b (cin ) (s 0 ) (1 (cin ) (b (b.O ) 3 0 6 (s '{)) (exi t ) Appendix C.3 : Input and Output Signals Specified for the 2-bit Adder in the SPICE Simulation. *********** input signals ************************* VDD VEM VDM Vein Vb.O Va.O Vb.l Va.l * * * * * * * * * * * output signals * * * * * * * * * * * * * * * * * * * * * * * * * .PLOT TRANS V( .TRANS 2NS 640NS .OPTIONS LIMITS*400 .WIDTH OUT=72 .END 308 Appendix C.4 : Model Parameters of the Simulated Devices in the SPICE Simulation. .MODEL ENMOS NMOS LEVEL-1 LD-0.211698U TOX-635.000E-10 +NSUB-3.779887E+15 VTO-1.13877 KP-4.145038E-05 +GAMMA-0.494661 PHI-0.600000 1)0-300.000 +VMAX-100000. XJ-5.27683U LAMBDA-2.385822E-02 +NFS-2.356687E+12 NSS-O.OOOOOOE+OO TPG-1.00000 +RSH-25.4 CGS0-1.6E-10 CGD0-1.6E-10 CGR0-1.7E-10 +CJ- 1. IE-4 MJ-0.5 CJSW-5E-10 MJSW-0.33 .MODEL DNMOS NMOS LEVEL-1 LD-0.348540U TOX-635.000E-10 +NSUB-1.000000E+16 VT0--3.83489 KP-3.639582E-05 +GAMMA-0.314330 PHI-0.600000 UO-900.000 +VMAX-477999. XJ-0.439338U LAMBDA-1.OOOOOE-06 +NFS-4.3 10000E+12 NSS-O.OOOOOOE+OO TPG-1.00000 +RSH-25.4 CGS0-1.6E-10 CGD0-1.6E-10 CGB0-1.7E-10 +CJ-1. IE-4 MJ-0.5 CJSW-5E-10 MJSW-0.33