<<

UNIVERSITY OF CALIFORNIA

Los Angeles

Variable LongPrecision Arithmetic VLPA

for Recongurable Copro cessor Architectures

A dissertation submitted in partial satisfaction of the

requirements for the degree Do ctor of Philosophy

in Science

by

Alexandre Ferreira Tenca

c

Copyright by

Alexandre Ferreira Tenca

ii

The dissertation of Alexandre Ferreira Tenca is approved

Prof Dr Willian Newman

Prof Dr David Rennels

Prof Dr Jason Cong

Prof Dr Milos D Ercegovac Committee Chair

University of California Los Angeles

ABSTRACT OF THE DISSERTATION

Variable LongPrecision Arithmetic VLPA

for Recongurable Copro cessor Architectures

by

Alexandre Ferreira Tenca

Do ctor of Philosophy in Computer Science

University of California Los Angeles

Professor Prof Dr Milos D Ercegovac Chair

This is the abstract iii

Contents

Introduction

The need for VLPA

Alternative Arithmetic Systems

Languages and Libraries for VLPA

Existing Copro cessors for Long Precision Computations

Chows VP Pro cessor

CADAC Controlled Precision Arithmetic Unit

Copro cessor for PascalXSC

Copro cessor

JANUS

VLP Copro cessor for the TM

VLP Computation and RCArs

Research Ob jectives

Dissertation Outline

Recongurable Copro cessor Architecture

Recongurable Copro cessor Mo del

FPGA Architecture

FPGA Array

Software and Hardware

Variable Longprecision Arithmetic VLPA Algorithms

VLP Algorithms used in Software

Notation and Conventions

Software Algorithms for VLP iv

v

Software Algorithms for VLP Multiplication

Software Algorithms for VLP Division

Software Algorithms for VLP SquareRo ot

Hardware Implementation of VLP Algorithms

Multiplication in VLP Copro cessors

Division in VLP Copro cessors

Squarero ot in VLP Copro cessors

Online Algorithms for VLP Computations

General Concepts and Scheduling Strategies

Summary

VLP Multiplier

The VLP Multiplication Algorithm

Path for VLP Multiplication

Data Arrangement for Serial Computation

Serial Computation of the Residual

Pip elined Data Path

VLP Multiplication with Precision less than m

Truncation p oint to Satisfy Output Precision

VLP Multiplication Algorithm for Truncated Results

Gain in Performance

Op erands with Dierent Precision

Execution time of the VLP Multiplier

VLP Divider

VLP Divison Algorithm

Selection Function

Scaling Factor M

Computation of the Scaling Factor at the Host

Prescaling of Op erands

vi

Online Prescaler

Selection Circuit

Reducing the Number of Cycles

Pip elined Op eration

Execution Time

VLP Square Ro ot

VLP SquareRo ot Algorithm

Convergence conditions for Output Selection

Selection function with Comp ensated Residual

Selection Circuit

Performance Evaluation

Optimization of the Number of Cycles

Execution time

Implementation Asp ects and Host Tasks for VLP Op eration

Digit Co de Conversion

CS to BS Converter

BS to NR Converter

Tasks Performed at the Host

VLP Number Format

OntheFly Conversion

Digit Expansion and Compression

VLP Floating Point Op erations

VLP Circuit Design for FPGAs

Imp ortant Design Asp ects

FPGA Time Parameters

Pip eline Degree

Digit Representation

vii

Design of Arithmetic Op erators for FPGAs

Addition

Multiplication

Summary of Results

Reco der Circuit for the VLP Multiplier

VLP Data Path Area Estimates

Delay of Selection Functions

Delay of VLP Division Selection

Delay of VLP Square Ro ot Selection

Performance Evaluation

Copro cessor Reconguration

Copro cessor Mo del

Mo del Parameters

Measurements

Performance Estimate

Copro cessor AreaTime

Mo del Simulation

Conclusion and Future Research

Research Contributions

Future Research

A Timing Characteristics of the XC FPGAs

B Test Program for LP Op erations using GMP version

B Test Program for LP Integer Multiplication

B Test Program for Floating Point Op erations

C Digit radix transformation using BS Co de

List of Tables

Performance measurement of interval arithmetic on VPI with and

without the arithmetic copro cessor VPIAC SSJa

Cycle counts for various op erations in the VPIAC SSJa

Recurrence equations for online arithmetic

Number of stages in the digit by vector multiplier

Example of highradix online division using prescaled op erands

Example of VLP Square Ro ot in radix

Truth table for the FS function

FPGA Timing Parameters adapted from Xilinx data b o ok

Areadelay of digitparallel adders in the XC

Array Multiplier with Bo oth Reco ding CS output

Extra area for the LinearArray Multiplier

LSA Multiplication radix

Area and time estimates for addition and multiplication of n

op erators using input LUT FPGAs

n

Area of the VLP division selection function digits in radix

n

Area of the VLP square ro ot selection function digits in radix

n

Data Path area for digits in radix pip elined

Area CLBs of the VLP data path for some values of n

Maximum number of digits read simultaneously in each iteration

Maximum area required to implement the VLP algorithms in FPGAs

Number of cycles for longprecision op erations in GMP C

pr og viii

ix

Signicant and exp onent manipulation time in GMP

Other tasks p erformed by the host during VLP op erations

Copro cessor parameters

Parameters for the system

Hostcopro cessor op eration Integer Multiplication

Hostcopro cessor op eration VLP FP multiplication

Hostcopro cessor op eration VLP FP divisionprescaling at the

host

Hostcopro cessor op eration VLP FP division prescaling at the

copro cessor

Hostcopro cessor op eration VLP FP square ro ot

Variation in the Sp eedup with the Host sp eed

List of Figures

Solutions given to Very Long Precision Computation

Copro cessor mo del

Copro cessor Organization for VLP op eration

Mo del for a Lo okup Table based congurable cell

FPGA structure

Structure of a CLB in the Xilinx XC series

Linear Array of FPGAs

SoftwareHardware interface for LP addition

Flowchart of NewtonRaphson algorithm for division

Flowchart of squarero ot computation using NewtonRaphson metho d

Multiplication metho d in the VPIAC

Spatial representation of online recurrence equation computation

Multipleprecision computation of the recurrence equation

Digitslices for VLP multiplication

Online computation of the recurrence equation for multiplication

Data path for VLP multiplication

Digit by vector multiplier in online mo de

One layer of the reduction structure

Data path delays

Data vectors for VLP multiplication

Pip elined Data Path

Truncated multiplication result

7

Variable output precision op eration of the VLP multiplier r x

xi

Sp eedup of VLP multiplier with variable output precision over full

precision multiplication

Digit vector for VLP Division

Data path for VLP online division

Divisor b ounds

Online prescaler

Prescaling using the data path of the VLP division

Selection circuit for VLP division

Data path for VLP online squarero ot op eration

Digit vector for VLP Square Ro ot

Data path delays

Circuit used for output selection in VLP Square Ro ot

Timing of selection function and data path

CSBS converter

BS NR converter

Formats of VLP a multiple digit and b multiple term

n

Conventional radix serial Adder

Basic online adder structure

Radix online adder

Array Multiplier using CSAs

Array Multiplier using Bo oth reco ding and CSAs

LSA multiplier radix

Radix LSA mo dule

Radix OnLine Reco der

More detailed copro cessor mo del

Blo ck diagram of the circuits inside the FPGA

xii

Number of cycles for Longprecision Op erations

Mo del for host copro cessor op eration

Sp eedup obtained with Hostcopro cessor over Host alone

Qualitative b ehavior of the sp eedup

Prop ortion of time used by the copro cessor

ACKNOWLEDGEMENTS

Thanks to xiii

VITA

BS Electrical Engineering

University of Sao Paulo Sao Paulo Brazil

MS Electrical Engineering

University of Sao Paulo Sao Paulo Brazil

MS Computer Science

University of California Los Angeles

PUBLICATIONS

Alexandre F Tenca and Milos D Ercegovac A HighRadix Multiplier Design

for Variable LongPrecision Computations st Asilomar Conference on Signals

Systems and Nov

Alexandre F Tenca and Milos D Ercegovac Synchronous UpDown Binary

Counter for LUT FPGAs with Counting Frequency Indep endent of Counter Size

FPGA ACMSIGDA International Symp osium on Field Programmable Gate

Arrays pp Feb Monterey

Alexandre F Tenca and Milos D Ercegovac Highradix Digitslices for On

line Computations Pro ceedings of SPIE Conference on High Sp eed Computing

Digital Signal Pro cessing and Filtering using Recongurable Logic vol

pp xiv

Chapter

Introduction

Programmable logic has provided a new hardware design space that allows

the exploration of new techniques to obtain more p erformance The most imp or

tant programmable device is the FieldProgrammable Gate Array FPGA that

combine the regularity of gate arrays and the programmability of random access

memories FPGAs have evolved in dierent directions dictated by the nal appli

cation of the devices One of these applications is the recongurable computing

The name suggests that a task or part of a task p erformed in the general purp ose

computer can b e downloaded into the programmable devices and b e executed at

hardware sp eed which may b e many times faster than the software sp eed program

execution in the computer

A Recongurable Copro cessor Architecture RCAr is a digital system that

combines FPGAs xed logic and memory It is attached to the main pro cessor

and is able to sp ecial tasks faster than the pro cessor

The organization of the copro cessor may also b e tailored to the particular ap

plication In this thesis we study and prop ose an RCAr that supp orts the ecient

implementation of circuits for variable longprecision arithmetic VLPA VLPA is

used to improve the accuracy of computer calculations vital in many areas such

as computational geometry Nb o dy problem and criptography

A few research works have pro duced hardware solutions for the VLPA problem

Only one work was done by this time that prop ose a hardware structure for VLPA

in a recongurable architecture In this dissertation we present the research on the

extension of the online arithmetic concepts to the design of hardware algorithms

for VLPA which is also suitable for implementation using RCArs

PhD Dissertation Chapter DRAFT February

The need for VLPA

Throughout this work the hardware precision or sometimes called the width

of the hardware data path is represented by n The precision required in an

op eration is represented by m

VLPA encompass solutions for long precision computation that can go b eyond

the limits of the available hardware or m n The ob jective of the research

is to prop ose arithmetic algorithms that allow scalability by reutilization of the

available hardware resources until the desired precision is attained

IEEE Std Co o for example has oatingp oint FP formats

with few precision alternatives for the exp onent and signicand mantissa Com

puter systems that conform to this standard often have only bit data paths

This number of is not needed in many applications and yet it is insucient

for other applications in which the computation can pro duce results completely

inaccurate without warning Examples of this problem are presented in the litera

ture Lyn Sch Let us repro duce as an illustration Consider the

T

dotpro duct computation of two vectors A and B using IEEE doubleprecision

arithmetic and exact arithmetic

18 27 25 5

A

T 38 29 22 42

B

T

IEEE A B

T

exact A B

the result of the computation using IEEE has a large error compared to the exact

result

Real numbers cannot b e always precisely represented by oatingp oint numbers

with the exact value b eing b etween two consecutive FP numbers The mapping

from the real number to a machine representable oatingp oint number is done by

rounding This pro cess creates errors that propagate in the computation and can

VLP Arithmetic for Recongurable Copro cessor Architectures

make the results of some calculation meaningless The utilization of FP numbers

leads to two main problems catastrophic cancellation and roundo errors caused

by the discrete nature of the FP representation Roundo error o ccurs b ecause

the signicand of a FP number has a limited number of bits Thus a real number

must b e rounded in order to b e represented with the limited precision signicand

Catastrophic cancellation results from the subtraction of FP numbers with close

values that results in a value with less signicant bits than the initial op erands

One of the goals of numerical analysis is to determine the accuracy of numerical

metho ds and evaluate analitically if the result of a computational algorithm can b e

trusted or not This is so called problem of computer credibility Knu Lyn

Accuracy cannot b e obtained in some cases if there is not enough precision in the

arithmetic op erations and op erands A statement found in many pap ers in the

area is that if more accuracy is needed more bits must b e employed

Clearly computer systems should b e able to work with variable precision in

order to have ecient implementations The term is used in the sense that the

user or software should b e able to adjust the precision of an arithmetic op eration

ideally constrained only by the available memory in the system

To solve the problem of computer accuracy researchers prop osed alternative

number systems and arithmetic concepts AH Vui MM Neu Mo o

Section These approaches are implemented in the digital computer in the

form of software libraries programming languages or copro cessors Various im

plementations already available to deal with longprecision computations are pre

sented in Sections and The classication of the available solutions for this

problem is presented in Figure VLP applications are built using highlevel

languages and libraries that provide data types and op erations that supp ort vari

able precision calculations Copro cessors are used by these software mo dules to

improve the most time consuming op erations

Exact arithmetic based on arithmetic libraries are slow These libraries dene

the longprecision oatingp oint numbers in two dierent ways as a multiple digit

PhD Dissertation Chapter DRAFT February

VLP Application

High-level Libraries Languages

Pascal C++ Acrith Software Coprocessors XSC XSC

GMP BigNum Fixed Programmable

ASIC Programmable of ASIC + Devices Variable FPGA Structures

FPGA

Figure Solutions given to Very Long Precision Computation

format or as a multiple term format In the multiple digit format the number is

represented as a sequence of digits that form the signicand and a signed exp onent

The multiple term format considers the longprecision number as a collection of

ordinary oatingp oint numbers each one with its own signicand and exp onent

There are advantages and disadvantages in each scheme but in particular the

multiple digit format is the one that can more compactly represent most numbers

since only one exp onent is stored

The next sections present commonly used solutions for the accuracy problem

in computer systems fo cusing on the asp ect of longprecision provision that is

required

Alternative Arithmetic Systems

Some of the arithmetic systems prop osed to solve the accuracy problem are

Interval arithmetic considers each number as a pair of FP numbers the up

p er and lower b ounds for the exact value called an interval This concept combined with the control over the precision allows the user to keep the cal

VLP Arithmetic for Recongurable Copro cessor Architectures

culations inside reasonable error b ounds Interval arithmetic is discussed in

AH KM KM Ral Neu Mo o All op erations are dened over

intervals and the results are guaranteed to b e inside the resulting interval

The advantage of this system is to allow the automatic verication of com

puted results A disadvantage of this system is the need to p erform the same

op eration many times in order to obtain the two FP values for the result in

terval Interval division for example requires up to FP divisions and some

comparisons to obtain the result interval Testing the sign of the interval end

p oints it is p ossible to obtain the interval with only FP op erations in most

of the cases Another disadvantage is that interval arithmetic by itself do es

not provide mechanisms for variable precision computation but it is a to ol

that allow the user to verify at run time if the present precision is acceptable

or not If the precision is not acceptable using variable precision arithmetic

the user can increase it and redo the calculations There is also the problem

that interval numbers grow to o fast so that the results are to o p essimistic

Continued fractions real numbers are exactly represented as fractions of in

teger values ie as rational numbers The precision of the real number

dep ends on the precision of the integers used in the representation The

problem with this concept is the high complexity of the basic arithmetic algo

rithms Vui The advantage is the ability to represent real numbers without

requiring rounding

Staggered arithmetic considers longprecision numbers LP as a of non

overlapping FP numbers The resulting number has at least as many signi

cant bits as all signicand bits of the FP terms put together The op erations

are executed over the set of FP numbers that represent the LP number and

generate another LP number This system takes advantage of the high ef

ciency of FP arithmetic units Pri Numbers that have large groups of zero es are represented by few FP numbers covering only the nonzero signi

PhD Dissertation Chapter DRAFT February

cant digits that are far apart therefore reducing the storage requirement and

computation time The disadvantage of this approach is the complexity of the

algorithms for even the basic arithmetic op erations An example can b e found

in Pri for the addition of two digits represented as FP numbers a and

b The complete addition algorithm involves op erations over multiple digits

The digit addition is p erformed by the following algorithm which satises

a b c d where d or c d is a valid expansion c and d represent

the sum values without overlapping The word f l represents a oating p oint

op eration For each digit seven oating p oint op erations are executed

pro cedure sum ter ma b

b egin

if jaj jbj

swapab

c f l a b e f l c a

g f l c e h f l g a f f l b h

d f l f e

if f l d e f

c a d b

return cd

end pro cedure

A numeric example follows for digits of precision p base a

and b The sequence of op erations gives the following values c

e g h f and d Thus the oating

p oint value a b is represented in precision p by the pair

Note that f l a b c has an error or

VLP Arithmetic for Recongurable Copro cessor Architectures

Software Languages and Libraries for VLPA

Variable precision libraries and languages provide an abstraction level that al

lows the user to use data types and op erations for long and variable precision

numbers They extend the p ower of an existing hardware creating an environment

for larger precision that makes use of the xed precision available in the pro ces

sor Some examples of software language extensions and libraries for longprecision

computation are

C XSC C language Wie

+

Numerical Recip es b o ok P provides several routines for variable precision

computation on large precision integer op erands

GMP Gnu Multiple Precision Library Tor a multiprecision package

available as publicdomain software A manual of the available routines can

b e found with the software

Maple Mathematica and Mathlab these are general computer algebra sys

tems that provide an interactive and easy to program environment These

packages allow the manipulations of numbers with variable precision

BigNum Zim a p ortable LeLisp package for arbitraryprecision integer

arithmetic

Longprecision numbers are represented as arrays of machine words A word

value is considered as a highradix digit Therefore op erations p erformed by the

pro cessor b ecome digit op erations Algorithms to deal with digit op erations are

used to manipulate the longprecision numbers

The exibility required in a go o d software implementation together with the

p ortability needs leads to large time overhead that could b e avoided in hardware

approach Thus it is usually the case that an algorithm implemented in hardware

will p erform b etter than the same algorithm implemented in software

PhD Dissertation Chapter DRAFT February

The p erformance is strongly dep endent on the selection of the algorithm for

the desired precision It may happ en that an algorithm is very ecient for high

precision numbers hundreds of digits and p erforms p o orly for lowprecision cases

Because of that libraries select the algorithm to b e used based on the op erands

precision

In general the longprecision libraries have routines that allow the user to work

on dierent number types integers oating p oint rational numbers interval arith

metic numbers with variable precision Op erations involving longprecision inte

gers are the fastest and the most optimized ones For eciency the inner lo op of

LP op erations may b e written in for each dierent type of pro

cessor Other data types are manipulated using the ecient longprecision integer

routines

Existing Copro cessors for Long Precision Computations

There are few copro cessors presented in the literature for long and variable

precision computations Sp ecic machines also exist to deal with long precision

numbers for particular applications like cryptosystems and dotpro duct compu

tation Many FPUs use extended precision internally We are more interested in

copro cessors that were designed for a more general application in the VLP arena

Some of the copro cessors primarily used to increase the p erformance of arith

metic algorithms are discussed in this section

Chows VP Pro cessor

In her thesis work Catherine Y Chow Cho Cho presented the design of

a general purp ose variable precision pro cessor for oatingp oint op erations The

main asp ects of her design are

Generalized the architecture for any digit set radix and precision allowing

the adjustment of the number of digit slices to the available hardware This

VLP Arithmetic for Recongurable Copro cessor Architectures

approach gives a high degree of exibility and the units VP mo dules can b e

combined to op erate in dierent mo des from totally parallel to totally serial

dep ending on the number of mo dules used The op erations are executed Most

Signicant digit rst

The design uses digit recurrent algorithms assuming that one of the op erands

is available in parallel form

Instructions are dened at the VP mo dule to supp ort shifting rotation ini

tialization communication b etween VP mo dules etc Each VP mo dule has

digit slices Many VP mo dules can b e combined to increase the hardware

parallelism The available hardware can b e used rep eatedly to obtain the

desired precision

An implementation of this design generated a copro cessor named Cascade Car

This cppro cessor works with radix digit slices There is not much data on the

p erformance of the pro cessor

CADAC Controlled Precision Decimal Arithmetic Unit

This pro ject was developed at University of Toronto CHH HCH CADAC

was designed to work with decimal number system BCD representation in oating

p oint format The arithmetic unit work on decimal digits The approach avoids

IO errors that o ccur when converting from decimal to binary For example the

number in decimal is not representable in binary The use of decimal notation

is also more convenient to the user however the arithmetic circuits for decimal

numbers are not as ecient as circuits for binary

The design uses pip elined stages The multiplication of digit numbers is

done in sec by the execution of steps with a cycle time of ns The

copro cessor can execute multiplication addition subtraction and division

PhD Dissertation Chapter DRAFT February

Copro cessor for PascalXSC

PascalXSC is an extended precision Pascal language A chip was implemented

by Baumhof Bau to reduce the accumulation of errors during dotpro duct com

putation that is provided by the language as an instruction It accepts IEEE

oatingp oint format and the most imp ortant feature of the pro cessor is a mo dule

called Long LA MRR Kno This accumulator is used to store

partial results without rounding A single and nal rounding is necessary b efore

the result is delivered to the user

The p erformance improvement using this dedicated hardware is signicant

However this hardware solution is limited to the particular case of dotpro duct

computation Other types of op erations cannot take advantage of the copro ces

sor Besides that it uses long precision only internally Conventional FP number

representation is used to exchange data at the user interface

Interval Arithmetic Copro cessor

A work presented in SSJb SSJa Sch describ es algorithms hardware

organization and p erformance measurements of a copro cessor designed to sp eedup

the execution of a Variable Precision IntervalArithmetic Package called VPI Ely

Interval arithmetic was already describ ed in Section The copro cessor is called

VPI Arithmetic Copro cessor VPIAC

The architecture of the VPIAC is based on a conventional oatingp oint unit

with a more sophisticated controller scheduling algorithms and LongAccumulator

to avoid propagation of rounding errors in intermediated oatingp oint op erations

The p erformance gures shown in Table obtained from SSJa compare

the execution of VPI package with and without the copro cessor The cycle count

assumes the op erands already in the register le

Another imp ortant information is the number of cycles necessary to p erform a

single p oint op eration or an interval op eration in the copro cessor SSJa This

VLP Arithmetic for Recongurable Copro cessor Architectures

Interval addition execution times sec

Precision bits VPIAC VPI Sp eedup

7 5

7 4

7 4

6 4

Interval Multiplication execution times sec

Precision bits VPIAC VPI Sp eedup

7 4

7 4

6 3

5 2

Interval Division execution times sec

Precision bits VPIAC VPI Sp eedup

7 4

6 3

6 3

4 1

Table Performance measurement of interval arithmetic on VPI with and

without the arithmetic copro cessor VPIAC SSJa

PhD Dissertation Chapter DRAFT February

Op eration Point Interval

Addsubtract n n

2 2

Multiply n n n n

2 2

Square n n n n

2 2

Divide n n n n

2 2

Square Ro ot n n n n

Table Cycle counts for various op erations in the VPIAC SSJa

information is given in Table Variable n denotes the number of bit words

in the signicand of the input op erands The number of cycles includes instruction

fetch data read from internal registers op eration rounding the result and storing

the result in the data registers The time for data transfer b etween the copro cessor

and the main memory was not included

No sp ecial algorithms for longprecision computation are used The algorithms

presented in the work deal with necessary steps to maintain intervals to the basic

arithmetic op erations At the the low level the algorithms to add multiply and

divide long precision numbers are the classical ones More details on the algorithm

used for VLP op erations in this copro cessor are given in Chapter

The copro cessor has a bit multiplier a bit adder a Long Accumu

lator of bit segments and two bit shifters

JANUS

Another variableprecision copro cessor was prop osed in GHM This copro

cessor uses the online arithmetic Erc ET to obtain a simple and regular

structure that can b e used for all basic arithmetic op erations The design consid

ered serial links to interconnect the copro cessor to transputers and ASIC design

style for a maximum precision of decimal digits There is no provision in the

chip to allow an extension of precision which makes op erations with more than

digits very inecient Only a digit SBD radix signed binary digit multiplier

VLP Arithmetic for Recongurable Copro cessor Architectures

has b een designed The resulting clo ck sp eed was MHz

VLP Copro cessor for the TM

+

A VLPA copro cessor designed for the Transmogrier TM GKC was

presented in Hsu This copro cessor also uses standard hardware algorithms

with mo dications for longprecision computations The work concentrates in the

arithmetic structures and copro cessor organizations that would b e adequate for the

structure of the TM The number of cycles presented in Hsu are repro duced

in the next table

Op eration Number of Cycles

2

multiplication n n

2

division n n

The design uses bit digits and each cycle takes ns

VLP Computation and RCArs

It is a fact that Central Pro cessing Units or Arithmetic Copro cessors for gen

eral applications are designed for b est p erformance in the most frequent op era

tions It is not feasible to have a pro cessor that has the b est p erformance for

all p ossible applications That is the main reason why Recongurable Architec

tures RArs are b ecoming p opular These machines are exible enough to im

plement custom solutions tailored for the particular user problem or application

AH VdB Wau CB and for this reason they are also called Custom

Computing Machines CCM It is very common to see large sp eedups obtained

in RArs although slow devices are used price paid for recongurability and ex

ibility The sp eedup results from the exploitation of the high level of paral

lelism that can b e attained in hardware reduced software system overhead and

the sp ecialization of the circuits to the task on hand Some examples of RArs are

Anyboard VdB Splash AH Wau and Ganglion CB A list of many

programmable b oards and systems can b e found in Gui

PhD Dissertation Chapter DRAFT February

One of the rst systems using FPGAs and designed for general use were de

+

veloped at DEC Paris Reseach Lab oratory VBD The DEC PeRLe and

PeRLe are FPGA b oards that are attached to the system IO of a host

workstation A mesh top ology nearestneighbour is used to interconnect the

FPGA chips Typical applications in the PAM systems are computationally com

plex but require low communication bandwidth Some examples are Long Integer

Multiplication Discrete Cosine Transform DCT and an RSA Decrypter

Splash is another FPGAbased system designed at the Sup ercomputing Re

seach Centre SRC that obtained signicant sp eedups in gene pattern matching

AACE Hoa text searching PTS and heat transfer problem solving based

on nite dierence metho d PA

Recent studies consider the integration of a recongurable copro cessor with

the general pro cessor RV The work presented in RV concentrates on the

prop erties of the pro cessorFPGAs interface The study shows that the sp eedup

obtained with copro cessors is less sensitive to the communication delays when a

DMA type of interface is used and most of the task is done by the recongurable

architecture without CPU intervention The CPU manipulates the data used by

the copro cessor in the next iteration or manipulates the data generated by the

copro cessor

Other more theoretical studies analyze the application of FPGAs to sp eedup

computations in general terms The applicability of programmable copro cessors

is discussed in ACC It considers the copro cessor connected directly to the

main pro cessor The recongurable copro cessor is compared to a Very Long In

struction Word VLIW copro cessor architecture The research results shows that

the implementation of addition and multiplication for integer and oating p oint

numbers on FPGA based copro cessors are not the b est choice in most of the cases

The applications that show signicant sp eedup are the ones with a limited number

of multipliers and with no oatingp oint op erations The work do es not consider

the VLP op erations It concludes that the range of applications that can take

VLP Arithmetic for Recongurable Copro cessor Architectures

advantage of FPGA based copro cessors increase with the chip logical resources

available This tendency is a reality for the future since the feature size is always

b eing reduced and the size of the chip is also increasing But it is imp ortant to

pick up applications that have a reasonable amount of parallelism

The p otential of FPGAbased systems is studied in DeH The work analyzes

the capacity of FPGA organizations and the results show again that FPGAs will

p erform much b etter than general purp ose computers in tasks that have signicant

degree of parallelism

VLP computation is one of the tasks that ts in the category of applications

that are well suited for FPGAs The same digit op erations are executed over

and over again until the nal result is obtained The exact number of digits

that b ecomes the threshold b etween hardware and software execution dep ends on

particular implementations It is the type of application that has p otential to

run much faster in a sp ecic hardware than in a general case pro cessor running a

software solution

Research Ob jectives

The main research ob jectives of this work are

Investigation of alternatives for the design of hardware algorithms for VLPA

computations fo cusing in Recongurable Architectures RA

Development of VLP arithemtic algorithms using online arithmetic

Development of areatime estimates for the prop osed VLP algorithms for

RAs The mo del is used as the base for the evaluation of p ossible alternatives

in the implementation of the algorithms and also a rst order approximation

for the p erformance evalution suggested in this thesis

Prop osal of a recongurable copro cessor architecture that considers the char acteristics of the VLP algorithms and a particular communication mechanism

PhD Dissertation Chapter DRAFT February

with the host computer

Performance evaluation of the pair hostcopro cessor compared with the exe

cution time of the longprecision computation by the host running a software

package

Denition of a mechanism to intefrate the host and copro cessor in the execu

tion of FP op erations Tasks related to FP op erations such as normalization

and exp onent manipulation may b e done by the host pro cessor for p erfor

mance reasons

study the parameters that are necessary to design a copro cessor which allows

adjustment of the op eration precision The maximum precision is determined

by available memory resources only CHH HCH

Dissertation Outline

The remaining of this dissertation is organized as follows

Chapter gives an highlevel mo del for the copro cessor architecture presents

the FPGA architecture considered in the thesis and provides an initial discus

sion on the software interface that allows the use of the copro cessor op erations

Chapter presents longprecision algorithms used in most of the

software packages Discuss the hardware implementation of VLP op erations

in available copro cessors and justies the use of online arithmetic for VLP

op erations

Chapter and describ e the algorithms used for VLP multiplication

division and square ro ot Convergence conditions and implementation issues are discussed

VLP Arithmetic for Recongurable Copro cessor Architectures

Chapter constains details of the copro cessor implementation such as digit

reco ding number conversion and oatingp oint op erations using the host and

copro cessor

Chapter gives the areatime estimates of the comp onents used in the VLP

algorithms for the particular case of FPGA technology

Chapter describ es the p erformance mo del for the VLP algorithms and pro

vides an evaluation of the system comp osed of hostcopro cessor

Chapter present the conclusions for this work and future research

Chapter

Recongurable Copro cessor Architecture

This chapter presents the architectural issues of the recongurable copro cessor

prop osed for VLP computations The copro cessor architecture illustrates the tar

get system for the arithmetic algorithms presented in this thesis and also provides

the basis for making a rst order p erformance estimate of the prop osed approach

The copro cessor is mo deled in terms of required logic memory interconnect re

sources and IO capabilities The recongurable copro cessor may have its datap

ath and control mo died as needed in order to implement directly a particular

algorithm

We rst present the copro cessor mo del architecture give a general view of the

target programmable devices and describ e the interface b etween the copro cessor

and the user software

Recongurable Copro cessor Mo del

The copro cessor mo del presented in this section is not designed for general use

but for the arithmetic op erations prop osed in the thesis Its architecture reects

the characteristics of the algorithms for VLP arithmetic

The mo del is shown in Figure and considers the following functions

Arithmetic op erations basic arithmetic op erators and other op erations

required in the arithmetic algorithms such as number conversion This func

tion is implemented in the recongurable hardware of the copro cessor Field

Programmable Gate Arrays FPGAs

Lo cal storage memory mo dules necessary to store intermediate results in

VLP Arithmetic for Recongurable Copro cessor Architectures

VLP op erations Memory resources may b e available in the FPGA chip or

may b e included as extra chips in the copro cessor The rst solution provides

a b etter access time since the logic resources and memory cells are on the

same chip However the use of internal memory resources reduces the area

available for logic circuits used in VLP op erations Also the internal memory

elements have a small addressing space The second solution has more delay

in data transfer due to crossing chip b oundaries However for long sequences

of intermediate results this is the only practical solution The transfer of more

than one digit p er memory access may b e used to reduce the transfer delay

Memory elements that has two p orts for simultaneous accesses are assumed

for two reasons need for data buers b etween host and copro cessor to

store data that is generated faster than it is consumed and capability to

allow read and write of temp orary data in the same cycle in order to maximize

data throughput Data channels b etween the memory elements and the host

or copro cessor have dierent bandwidths represented as B and B

H LM

Data interface IO this function is related to the transfer of op erands

and results to the system main memory or lo cal memory The copro cessor

has communication channels to the host and lo cal memory The circuits that

implement the communication proto col and DMA are in this category

Memory Interface provides the functionality to readwrite information

tofrom the lo cal memory

Control resp onsible for synchronization of tasks p erformed by all other

functions

Conguration resp onsible for donwnloading conguration les into the FP

GAs These les dene the circuits that implement the copro cessor functions

fully or in part The function also includes storage for conguration les The les are transferred by the host in advance and when needed they are written

PhD Dissertation Chapter DRAFT February

to the FPGA devices in the copro cessor If the FPGA chips p ermit partial

reconguration the bitstreams required to mo dify the conguration from

one op eration to the other are also stored in the conguration memory Partial

reconguration allows the mo dication of parts of the logic or mo dication of

interconnections only This pro cess is faster than the static reconguration

used in many FPGAs when the whole circuit must b e downloaded Recong

uration time is of main imp ortance since the application program may require

successive and dierent op erations in the VLP copro cessor

The recongurable part of the system is comp osed by the FPGA chips and

can accommo date all of these functions For p erformance reasons some of the

functions or parts of them can b e implemented in dedicated hardware Dedicated

circuits should b e used to implement functions that are not mo died through time

The use of dedicated hardware is not discussed in this thesis

Reconfigurable Coprocessor

B H Local Memory B Main Data LM Memory Interface FPGAs

Arithmetic Mem Operators Int HOST

Control

Configuration

Figure Copro cessor mo del

Once congured for a particular VLP op eration the copro cessor organization for

VLP arithmetic is dened as shown in Figure In the diagram we abstract the

VLP Arithmetic for Recongurable Copro cessor Architectures

complexity of the interface b etween the host and the copro cessor and concentrate

on the architectural and design issues of the arithmetic copro cessor The host is

able to access the lo cal memory to transfer input op erands and read the result We are omitting details to increase legibility and present only the main concepts

coprocessor

H O W S T x X j+ε x i M z I j N y Y Datapath j+ε y i T M E

R

 

 MZ F A C control

E m

Figure Copro cessor Organization for VLP op eration

The memory comp onents named X Y Z and W are dualp ort RAMs for con

current memory access Sp ecic lo cations of these memories store information on

the last digits received from the host Thus the copro cessor op erates based on

the availability of input data If the data is not available the copro cessor stalls

The same is true for the host which reads the result digits indep endently of the

copro cessor op eration The memory elements have arbitration to avoid conicts

during memory accesses

The copro cessor lo cal memory is mapp ed into the host memory space The host

sees the addresses of the memory elements as a sequence of addresses Internally

the copro cessor is able to access multiple words of the lo cal memory in order to

PhD Dissertation Chapter DRAFT February

have sucient data throughput to match the circuit op eration

A word transferred on the bus holds several digits to b e manipulated in the

copro cessor The host pro cessor program must pack digits into the bus transfer

word to reduce the communication overhead

FPGA Architecture

FPGAs are mo deled as an array of cells or Congurable Logic Units CLU A

CLU is comp osed of a Lo ok Up Table LUT and a ipop The LUT is congured

to implement a particular logic function In the Xilinx family of programmable

devices the LUT is called a function generator FG The number of inputs in the

table change from one manufacturer to another or from one family of devices to

another A ninput LUT can implement any logic function of n binary variables

Some FPGAs contain more than one type of LUTs For example in the Xilinx

XC devices it is p ossible to implement the following function types in the

CLU

f x x x x with x f g

1 1 2 3 4 i

f x x x x x with x f g

2 1 2 3 4 5 i

f f x x x x g y y y y z with x y and z f g

3 1 1 2 3 4 1 1 2 3 4 i i

The ipop FF is indep endent from the LUT and can b e used by the lo cal

LUT or by another logic in another cell The interconnections are dened at

conguration time Figure shows the mo del of one of these cells One or

more CLUs can b e group ed into a Congurable Logic Blo ck CLB In the Xilinx

XC series of devices we may have or CLUs in one CLB dep ending on the

b eing implemented The multiplexer symbol in gray is programmed

at conguration time and allows the FF to receive data from the LUT or from

another input called DIN Both combinational X and registered XQ outputs are provided

VLP Arithmetic for Recongurable Copro cessor Architectures

0 X 1 2 LUT XQ FF n

DIN clock

Figure Mo del for a Lo okup Table based congurable cell

CLUs are connected to each other by a programmable interconnect as shown

in Figure The exible interconnect system allows a high degree of freedom for

Switch Box Configurable Logic Block

I/O Block

Figure FPGA structure

placement and routing of circuits but also causes more delay than interconnects

used in ASIC technology The same way as ASICs the communication line delay

is sensitive to the load on the line and the length of it

Regarding supp ort for arithmetic op erations some FPGAs have sp ecial circuitry

to reduce the time to propagate carriesb orrows in adders and subtractors This

PhD Dissertation Chapter DRAFT February

fast carry logic FCL makes p ossible the implementation of fast and area ecient

adders for small number of bits It expands the functionality of the FG using an

extra input and output called CIN and COUT These inputoutput pairs are

connected by a dedicated network faster than the one that interconnects CLUs

For this reason the FCL reduces the total addition time signicantly

The structure of a Congurable Logic Blo ck CLB for the Xilinx XC series is presented in Figure

F1 F2 F F3 Function X F4 Generator H FF XQ CK Function G1 Generator FF YQ G2 G G3 Function Y G4 Generator

Din

Figure Structure of a CLB in the Xilinx XC series

FPGA Array

If many FPGA chips are used the interconnection b etween them will signi

cantly impact the overall p erformance of the VLP algorithms Flexible intercon

nects b etween the chips imply lower bandwidth and larger delay For the VLP

algorithms prop osed in this thesis its is imp ortant to have high data bandwidth

so a dedicated interconnect is assumed for highly demanded data and bus connec

tions are considered for lower demanded data After we explain the VLP algorithms

these options will b ecome clear

The most adequate interconnection b etween FPGAs is the linear array as shown

in Figure The lo cal interconnections have small delays and allow pip elining

b etween circuits working in neighboring chips

VLP Arithmetic for Recongurable Copro cessor Architectures

control signals

2n data FPGA FPGA FPGA digits chip chip chip

result bus

Figure Linear Array of FPGAs

The communication across chips may take longer than the op eration cycle A

strategy to balance the chip IO bandwidth and the circuit sp eed was discussed

in Lou and may also b e used here This work will concentrate on single chip

implementation but also provides a preliminary discussion on the implementation

of the VLP algorithms over multiple chips

Software and Hardware Interface

We prop ose to make the new VLP arithmetic algorithms available to the VLP

application or library in the form of a hardware ob ject The ob ject hides the

hardware implementation details from the user The VLP application or library

makes use of the copro cessor op eration through a pro cedure call There is a tech

nology already available named Hardware Ob ject Technology HOT that uses

this concept Corb

The VLP algorithm implemented in the copro cessor requires that the host ex

ecutes some tasks that do not b enet from hardware implementation as for ex

ample digit adjustments number conversion and exp onent manipulation Thus

only the kernels of the VLP op erations are implemented in the copro cessor

In this work we selected the Gnu Multiprecision Library GMP as the soft

ware package to exemplify the application of the copro cessor and also compare

p erformance The application software uses data types and VLP op erations de

ned in the library When using the copro cessor the VLP op eration is replaced by

PhD Dissertation Chapter DRAFT February

a pro cedure the the same interface that is able to manipulate the recongurable

hardware

An example of VLP addition using GMP library with and without the copro ces

sor is shown in Figure The pro cedure named mpz add receives three p ointers

one for the result sum and two for the op erands int and int The path on

top of the gure shows the application program activating the pro cedure that is

executed by the software library and returns the control to the application When

the copro cessor is used the activation of the pro cedure passes control to a software

routine that interfaces the copro cessor hardware doing basic op erations such as

transferring data to the copro cessors lo cal memory starting the op eration in the

copro cessor and reading the results Using this pro cedure the application program

is not aware if the op eration was p erformed by a GMP original routines or by the VLP copro cessor

VLP addition using GMP library

mpz_add (&sum,&int1,&int2) GMP LP addition

VLP addition using the coprocessor

mpz_add (&sum,&int1,&int2) VLP Hardware Object Checks if the coprocessor is Reconfigurable correctly configured Coprocessor Transfers data to the coprocessor Triggers the operation

Reads the result

Figure SoftwareHardware interface for LP addition

Chapter

Variable Longprecision Arithmetic VLPA Algorithms

In this chapter we present commom algorithms used in software solutions for

longprecision computation review the hardware algorithms used in present copro

cessors for longprecision calculations and introduce the concept of online arith

metic as a solution for VLP op erations The presentation of software algorithms

provide the main concepts involved in longprecision calculations We also jus

tify the main reasons that make some of the algorithms preferable for hardware

implementation

By denition one cannot have direct full hardware implementation for VLP

arithmetic op erations The ob jective in any case is to nd an optimal partitioning

and allo cation strategy for the harware resources available The problem is involved

since the hardware resources although limited are recongurable The circuit

design for FPGAs is done dep ending on the problem at hand and the resources

available

VLP op erations are done serially digit by digit The number of bits in each

digit is adjusted according to system characteristics When using a software library

to p erform this task in a general purp ose pro cessor the digit corresp onds to the

machine word Using RAs one can adjust the digit size in order to get the most

p erformance in the available area Area utilization is not a serious problem for

serial addition since serial addition units use small chip area and the area dep ends

only on the digit size not the the precision of input op erands or result The

problem in the use of limited precision units is exp osed in other op erations for

which the digits already received by the unit are imp ortant for the computation of

the next output digits Thus these serial op erators for longprecision computation

PhD Dissertation Chapter DRAFT February

need hardware resources prop ortional to the precision of the result This is really a

problem of the internal Some op erations like addition have a constantsize

state while the others like multiplication have a state size prop ortional to the

op erands precision

Solutions to this problem are usualy inspired in algorithms for longprecision

op erations implemented in software

VLP Algorithms used in Software

In this section we present algorithms used in software packages for variable

longprecision computations Sometimes these packages or libraries use the term

multiple precision to indicate the fact that the op erations precision varies in mul

tiples of a basic precision allowed by the pro cessor

+

By analyzing GMP Tor Numerical Recip es routines P and MPFUN

software packages Bai we could determine the algorithms used for the four

basic op erations The main asp ects of most of the algorithms are also presented

+

in detail in Knu P They are used in the software packages mentioned in

this work with some mo dications

Notation and Conventions

Consider that each word in the machine holds one digit in a very high radix r

A long precision number is represented as a sequence of words digits Usually the

n

value of the radix is a function of the machine word length in bits n ie r

such that each word represents a digit u

i

Only p ositive numbers are considered based on the fact that most libraries use

sign and magnitude representation where the sign of the number can b e treated

separately The magnitude value represented by a vector of digits u u u u is

1 2 3 m

obtained in the conventional p ositional notation as

m

X

mi

u r with u r u

i i

i=1

VLP Arithmetic for Recongurable Copro cessor Architectures

The number of digits m in the vector is variable and it represents the precision

of the number in radix r

The longprecision op erations are mapp ed into primitive operations instruc

tions that are available in the computer

additionsubtraction of onedigit integers radix r giving a onedigit integer

and a carry as a result

multiplication of a onedigit integer by a onedigit integer giving a twodigit

integer as a result

Although for some op erations the discussion is fo cused on integers the appli

cation of the VLP algorithms to oatingp oint numbers is straightforward

Software Algorithms for VLP Addition

The basic algorithm used for addition of extended precision numbers is the serial

addition Each step is p erformed in a machine word that stores a digit in base

n

r

Addition and subtraction in these classical algorithms always start from the

leastsignicant digit The highprecision numbers are stored in memory in an

array of memory words

The software routine monitors the carry out in each digit addition and uses it

as carry in for the next digit addition The pro cess stops as so on as the precision

of one of the op erands is exhausted and there is no carry out to b e assimilated in

next steps

Software Algorithms for VLP Multiplication

VLP multiplication is p erformed by dierent algorithms dep ending on the pre

cision of the op erands Simple algorithms are faster than more elab orated ones

when the precision is around tens of digits few hundred to a thousand bits that

PhD Dissertation Chapter DRAFT February

we call in this section as low precision After a limit that is dened exp erimen

tally for each software package implementation the precision is considered high

precision and a more elab orated algorithm with b etter asymptotic time is used

The threshold b etween low and high precision was evaluated in Com

to b e around words approximately bits If the elab orate algorithm is

applied for op erands with less than words the overhead involved in the data

manipulation makes it worse than a simple algorithm In the GMP library the

threshold has a similar value

VLP Multiplication of Low Precision Numbers

For the multiplication of small precision numbers the simplest algorithm for

multiplication equivalent to the classical pap er and p encil algorithm is more

ecient than other more sophisticated metho ds

To multiply two op erands u u u u and v v v v without loss of

1 2 p 1 2 m

precision the result w uv w w w must have m p digits The value of

1 2 m+p

the op erands and result are

p

X

pi

u r u

i

i=1

m

X

mi

v r v

i

i=1

m+p

X

m+pi

w w r

i

i=1

Algorithm was adapted from Knu The algorithm executes mp digit

multiplications and mp double precision The number of single

precision digit additions dep ends on the generation of carries in the addition of

the rst least signicant words If a carry is not generated there is no need to

continue with the second word Some savings result from this simple test The

inclusion of a test for a zero condition of v is not worthwhile since the probability

j

of having v with a high value of r is very small The asymptotic time of this

j

2

is algorithm is O n

VLP Arithmetic for Recongurable Copro cessor Architectures

Algorithm lowprecisionmultiplication classical

S set w w w Initialize the outer loop counter j m

m m+1 m+p

S initialize the inner loop counter i p and set variable car r y

u v +w +car r y

i j i+j

S w u v w car r y mo d r and car r y b c

j i j i+j

r

where car r y r is the carryout digit at position j

Notice that u v generates a double precision number and thus al l additions

i j

are double precision

S i i If i go to S Otherwise w car r y and continue to S

j

S j j If j go to S otherwise stop

PhD Dissertation Chapter DRAFT February

Multiplication of HighPrecision numbers

One can calculate the pro duct of two op erands by breaking them in half and

combining the result of the multiplication of these halves An algorithm called

OfmanKaratsuba Knu reduces the number of digit multiplications used in the

classical algorithm This scheme is used by GNU MP library when the number of

digits is larger than a certain threshold value The algorithm uses the following

equation to obtain the pro duct of U and V

m m2 m2 m2

UV r r U V r U U V V r U V

1 1 1 0 0 1 0 0

Using this transformation only multiplications with half of the precision are

needed The number of additionssubtractions is subtractions in precision m

and additions in precision m b esides some shift op erations

Assuming that b oth op erands have the same precision and it is a p ower of

the algorithm to multiply highprecision numbers is dened recursively as shown

in Algorithm Op erands and result are U u u u V v v v and

1 2 m 1 2 m

k 16

P p p p with m The asymptotic time of this algorithm is O n

1 2 2m

FFTbased VLP Algorithm

Another way to multiply highprecision numbers in software uses the FFT pro

cess or Schonhage and Strassen algorithm Knu Bai The algorithm has a

b etter asymptotic time of Onlognloglogn

Examining the multiplication pro cedure we can see that the pro duct of two

vectors of digits is the convolution of these vectors That is the multiplication can

b e represented by the expression

n

X

x y x y p

nk k n n n

k =0

where p is an element of the pro duct vector and the symbol is the convolution

n

op eration For example consider the multiplication of two vectors X and

+

Y P

VLP Arithmetic for Recongurable Copro cessor Architectures

Algorithm highprecisionmultiplication OfmanKaratsuba

S if siz e thr eshol d

use classical multiplication algorithm and stop Otherwise continue

the stop in this step may represent the return from a recursive call

S split the operands in two halves V V U and U

1 0 1 0

where the vectors with indices are the mostsignicant halves

S recursively call this algorithm to multiply

pp U V

pp U V and

pp U U V V

S combine the partial products obtained in S according to the expression

2c c c c

p r r pp r pp r pp

where c dsiz ee

stop

PhD Dissertation Chapter DRAFT February

x

that by reduction to digits in radix results in the vector

The convolution theorem says that the Fourier transform of the convolution of

two functions is equals to the pro duct of their individual Fourier transforms This

theorem is true for the continuous and discrete sampled cases Thus antitrans

forming the pro duct of the Fourier transforms of the two op erand vectors it is

p ossible to get the convolution of these vectors If op erands are g and h with m

digits and their Fourier transforms are Gf and H f resp ectively the theorem

says that

g h Gf H f

where the symbol denotes the transformation or antitransformation b etween

the two systems Double precision numbers are necessary to store the accumulation

of digits in the convolution to avoid overow Notice that this is a limitation for

the number of digits that can b e used in the multiplication

To transform a vector with m digits ml og m steps are necessary

2

The multiplication of the transformed vectors imaginary numbers requires

m real multiplications and m additions A b etter solution would b e to use only

m multiplications as follows

a bic di ac ad bci bd ac bd ia bc d ac bd

Notice that the multiplication of two discrete functions Gf and H f is ob

tained by multiplying the values of the functions for each p ossible value of the

input

After getting the transform of the convolution the antitransform pro cess needs

ml og m steps to obtain the value of the result The obtained vector do es not have 2

VLP Arithmetic for Recongurable Copro cessor Architectures

values that satisfy the values of the digits thus a nal pass to adjust digit values

and propagate the carries if necessary

Some problems for hardware implementation are

use the values of sin in the pro cess These values are usually rounded and

stored in tables in the hardware

even though the number of steps was reduced to ml og m the pro cedure is

2

time consuming since each step is complex A high level of parallelism must

b e used to make the pro cedure attractive

This algorithm is used by MPFUN a FORTRAN multiprecision package a

software library implemented in CRAY machines The convolution pro cess is ade

quate for vector computations Values for the execution time of this algorithm in

CRAY machines are presented in Bai

Software Algorithms for VLP Division

The most used algorithm for longprecision division is the NewtonRaphson

+

algorithm P It makes use of one VLP multiplication algorithm To obtain

the quotient X Y compute the recipro cal of the divisor by iteration of the Newtons

rule

U U YU

i+1 i i

where U is an approximation of Y and the algorithm has a quadractic con

0

vergence to the correct value of Y The owchart of the complete algorithm is

shown in Figure The asymptotic execution time is O M l og l og N where M

is the asymptotic time of the longprecision multiplication algorithm and l og l og N

is the factor related to the convergence of the Newtons rule

An eciente implementation of VLP multiplication is critical for the overall

p erformance of this division metho d Many variable longprecision multiplications

are executed and at least two of them using full precision of the op erands

PhD Dissertation Chapter DRAFT February

Y use conventional FP system to obtain 1/Y-> U

convert U from standard FP to X LP format

Multiply Q <-X.U U and Y

compute R <- X-QY 2-UY

compute U=U(2-UY)

Noconverged Yes

to 1?

Figure Flowchart of NewtonRaphson algorithm for division

Software Algorithms for VLP SquareRo ot

The NewtonRaphson square ro ot metho d rst computes the recipro cal of the

square ro ot and them multiply the result by the initial op erand The recipro cal

of the square ro ot of a longprecision number V is obtained iteratively using the

recurrence equation

2

U U VU

i+1 i

i

1

p

where U is an approximation of and U converges quadraticaly to the full

0

V

precision recipro cal value Intermediate steps are done in precision that doubles

in each iteration of the algorithm A nal fullprecision multiplication by V is

1

p

is obtained by done to obtain the correct result The initial approximation of

V

table lo okup or using an available square ro ot instruction in the pro cessor The

asymptotic time for the algorithm is the same as the division using the same

VLP Arithmetic for Recongurable Copro cessor Architectures

metho d The owchart of the squarero ot computation using NewtonRaphson

metho d is shown in gure

Y use conventional FP system to obtain 1/sqrt(V)-> U

convert U from standard FP to LP format

compute

2

T U

T TV U TU

T (3 T )2

Noconverged Yes

VU

to 1? Y

Figure Flowchart of squarero ot computation using NewtonRaphson metho d

Hardware Implementation of VLP Algorithms

We discuss in this section the algorithms used in available copro cessors

Multiplication in VLP Copro cessors

VLP arithmetic copro cessors use the classical algorithm for multiplication for

longprecision numbers Although the classical algorithm has a time complexity

2

of O n that is the worst of all software algorithms the choice is justied by a

PhD Dissertation Chapter DRAFT February

simpler control and data transformation circuits The justication for using this

type of algorithm and not another one with a b etter asymptotic time is given in

Zur The approach was used in the design of the arithmetic unit of the VPIAC

SSJb SSJa and the VLPA copro cessor designed for the Transmogrier

TM Hsu More details in the implementation of the arithmetic circuits of

these machines are shown next

The VLP multiplication is done in two dierent ways computing accumulating

longprecision partial pro ducts by rows SSJb Sch or computing a particular

digit of the result by columns Hsu

In the VPIAC SSJb SSJa the multiplication algorithm used is a variation

of the classical algorithm used in software A n n bit multiplier digit of n bits

generates the digit multiples that are shifted and added to previously accumulated

pro ducts using an internal longaccumulator Figure illustrates this algorithm

The words of the multiplicand are accessed from the least to the most signicant

digit Each digit of B is multiplied by one digit of A If the number of digits

in the multiplicand is even as shown in the gure the digits of A with even

indexes are multiplied rst ie A B and A B If the number of digits in

4 4 2 4

the multiplicand is o dd o dd indexes are used rst The use of a long accumulator

increases p erformance and reduces data transfers at the cost of more area however

this feature also limits the maximum precision to the length of the accumulator

2

The longprecision multiplication in this pro cessor takes m m cycles where

m is the number of op erand digits The duration of each cycle dep ends on the

precision of the digit size The cycle times for a bit bit and bit units are

ns ns and ns resp ectively

The design presented in Hsu uses a dierent approach It keeps inside the

chip the partial accumulation of the digits in each column of the partial pro duct

matrix The precision of the required adder is l og dmne n representing two

2

nbit digits and some extra space for carries The partial pro ducts of column i are

generated by multiplying the words x and y for k from to i and i m

ik k

VLP Arithmetic for Recongurable Copro cessor Architectures

Multiplicand A1 A2 A3 A4 Multiplier B1 B2 B3 B4

A2 B4 A4 B4 A1 B4 A3 B4

A2 B3 A4 B3 A1 B3 A3 B3

A2 B2 A4 B2 A1 B2 A3 B2

A2 B1 A4 B1 A1 B1 A3 B1

M1 M2 M3 M4 M5 M6 M7 M8

Figure Multiplication metho d in the VPIAC

When m i m then k varies from i m to m Dierent words

of the op erands are read in every step opp osite to what is done in the VPIAC

The p erformance of this algorithm is given as a function of the memory accesses

2

to memory m m cycles with a cycle time of ns

Division in VLP Copro cessors

Hardware longprecision division is usually p erformed by mo dications of the

NewtonRaphson metho d SSJb SSJa or digitrecurrence division algorithm

Hsu The digitrecurrence division algorithm EL generates each digit of the

quotient based on the scaled residual

w j r w j q d

j +1

where x is the dividend d is the divisor w is the residual and q is the quotient

digit Initially w x The quotient digit is obtained by a selection function

which is more complex for higher radices Prescaling of the op erands simplies

PhD Dissertation Chapter DRAFT February

the selection function but adds extra steps to obtain the multiplication constant

and scale the op erands

Squarero ot in VLP Copro cessors

The copro cessor prop osed in SSJb SSJa computes the squarero ot using

an algorithm similar to NewtonRaphson describ ed in section The square

ro ot op eration is not considered in Hsu

Online Algorithms for VLP Computations

At the conceptual level all VLP algorithms manipulate a pro cessor word as

a highradix digit The op erations are carried digit by digit serially Online

algorithms Erc TE are the only ones that allow serial computation of all

arithmetic op erations

The main reasons for using online algorithms for VLP arithmetic are

Inputs and outputs are handled serial ly mostsignicant digit rst allowing

variableprecision op eration and overlap b etween successive op erations

Online algorithms work with xed p oint numbers When the algorithm is

used to compute the signicand part of oatingp oint pro duct only the most

signicant digits of the pro duct are computed without wasting cycles on the

least signicant digits

Circuits for online op erations have regular structure which can b e easily

extended and a simple algorithm step

We rst review the concepts of online arithmetic and later we present and discuss the hardware organization and VLP algorithms based on online arithmetic

VLP Arithmetic for Recongurable Copro cessor Architectures

General Concepts and Scheduling Strategies

The basic ideas and algorithms of online arithmetic are presented in Erc

TE and a design metho dology in EL As mentioned ab ove the result digits

are pro duced serially most signicant digit rst after a few cycles online delay

p

Online multiplication Z XY division Z X Y and square ro ot Z X

are dened by the recurrence equations shown in Table with the following

convention for the digit vectors

j + 1

X

i j + 1

x r X j x r X j

i j + 1

i=0

j + 1

X

i j + 1

y r Y j y r Y j

i j + 1

i=0

j

X

i

W j w r and

i

i=0

j

X

i

z r Z j

i

i=0

where W is the scaled residual and any digit is in the set fa ag

r a r

The online delays in each expression are slightly dierent and dep end on the

radix used The scaled residual is kept inside b ounds by subtracting the output

digit z that is selected based on a selection function that considers the value of the

j

present residual and other parameters The input op erands are usually in the range

1

but in some sp ecial cases the op erands are scaled to b e in a dierent range

r

to simplify the algorithm implementation These cases are presented in Chapters

and VLP division and square ro ot

The theory of online op erations was developed for any radix However the

implementation of online op erators is usually prop osed for small radices and

SG TE TE TE The evaluation of an online multiplyadd op eration

using high radix digits is done in TE The work prop oses structures that

can b e easily adjusted to dierent precision and shows the b est costp erformance

PhD Dissertation Chapter DRAFT February

Online multiplication

+1

W j r W j z r x Y j y X j

j 1 j + 1 j + 1

W X Y

Online division

+1 +1

W j r W j z Y j x r y Z j r

j 1 j + 1 j + 1

W X

Online square ro ot

+1

W j r W j z z j z j x r

j 1 j + 1

W X

Table Recurrence equations for online arithmetic

relation for the design of the online multiplyadd op eration The use of highradix

structures always leads to b etter p erformance over small radix For this reason we

concentrate the study on highradix online algorithms and circuits

Unlike the software algorithms online op erators generate results in redundant

form that must b e converted to conventional representation This may b e consid

ered as a drawback of the metho d at rst However the concurrent execution of

an onthey conversion algorithm similar to the algorithm presented in EL

can minimize the p erformance impact of the conversion step The conversion is

discussed in Section

The recurrence equations in Table indicate the use of two digit by vector

multiplications and longprecision additions The vectors obtained by the digit by

vector multiplications such as x Y j and y X j must b e prop erly aligned

j + j +

by shifting the digits by a xed amount b efore the long precision addition takes

VLP Arithmetic for Recongurable Copro cessor Architectures

place Variables such as Y j and X j represent vectors that increase in length

as new digits are received Conventional online op erators use App end Registers

to store these vectors We prop ose later another metho d to handle incoming digits

in VLP op erations

The discussion that follows presents two alternatives to implement online VLP

op erations which are applicable to all online op erations The rst technique con

siders the use of a digit slice to obtain the longprecision result The second tech

nique consists of computing the online recurrence equation using serial mo dules

The use of digitslices

Figure shows the space complexity of the recurrence equation computation

The parallelograms represent the digit by vector computation and the rectangles

represent the longprecision additions

When there is not enough hardware to implement the full precision of the

op erands only a section of the total space is implemented in hardware d radix

n

digit slices The input op erands are received serially digit by digit While

the input precision is less than d the section is used once for input precision in

the range d d the section is used twice and so on Registers are placed in the

circuit in order to store the intermediate bits carries that are generated from one

activation of the section to the next

section

2 digit by vector multiplications

Additions

Figure Spatial representation of online recurrence equation computation

To illustrate the approach consider the case of n radix digits as shown

PhD Dissertation Chapter DRAFT February

in Figure Vector A represents the pro duct y X and B represents x Y W

j +3 j +3

represents W j and W represents the W j C is the carry out digit of each

i

section A most signicant digit rst MSDF accumulation of the residual is done

in this case

W W W W W W W W W W 0 0 0

1 0 1 2 3 4 5 6 7 8

0. A A A A A 0 0

1 2 3 4 5

0. B B B B B B 0

1 2 3 4 5 6

W 0 0

8

A A 0 5

4 +

B B B

4 5 6

0 0 0

C W W W

7

7 8 9

W W W

5 6 7

+

A A A

1 2 3

B B B

1 2 3

0 0 0

C W W W

4

5 6 4 from right to left

W W W

3 4

2 +

0

0

0 0 0

C W W W

1

1 2 3

W W W

1 0 1

+

z 0

j

0 0

0 W W

1 0

Figure Multipleprecision computation of the recurrence equation

A general structure of the online multiplication slice is shown in Figure

Thicker lines represent paths more than one digit wide A group of d digits of X

Y or W are read at once and applied to the group of digitslices section before

one computation cycle b egins Registers dark b oxes are used to store the carries

from one iteration to the next Sections of multiplers and adders are represented

as a b ox with a marked corner These comp onents are not complete functional

VLP Arithmetic for Recongurable Copro cessor Architectures

op erators they are part of larger op erators

XY W 1 digit of X and Y d d d

slice of the operator + n

W

Figure Digitslices for VLP multiplication

Serial Computation of Recurrence Equation

This technique considers digit op erators in highradix to compute the online

recurrence equation serially instead of using digit slices Both conventional or

online serial mo dules can b e used and the residual can b e computed in LSDF or

MSDF mo de

The output digit z used in the recurrence equation is selected from the

j 1

most signicant digits of the residual The LSDF mo de of op eration will force the

output digit selection to of the serial computation of the residual However

the iteration to compute the next residual cannot b egin until the output digit is

selected based on the present residual and this dep endency will slow down the

computation

The MSDF mo de of op eration uses online units The most signicant digits of

the residual are computed rst leaving more time to select the output digit b efore

the next iteration is ab out to b egin

PhD Dissertation Chapter DRAFT February

By construction all recurrence equations created for online computations use

digit by word multiplication additionsubtraction and shifting The digit by word

multiplier has a xed area which dep ends only on the digit size The time to obtain

the result dep ends on the precision of the vector containing the previously received

digits which can b e as large as needed The complexity of the circuit is adjusted

based on the radix of the digit b eing used Addition and subtraction also have an

area dep endent only on the radix of the digits b eing added

The network of serial mo dules for the particular case of the recurrence equation

used in multiplication is presented in Figure The multiplication no des are

serial digit by vector multiplier mo dules As shown in the gure the multiplication

no des are comp osed by a digit by digit multiplier a delay blo ck shaded b ox and

a serial adder in the gure an online adder One of the inputs is kept with

a xed value during one recurrence equation iteration digits x and y

j + 1 j + 1

The digits stored in the digit vectors X j and Y j each digit represented in the

gure as X j and Y j are fed serially into the multipliers The same notation

i i

is used for the scaled residual The output digit z is used in only one cycle to

j

correct the residual More details of this circuit are given in Chapter

This strategy has the following advantages over the previous one

The network that implements the recurrence equation is comp osed of standard

mo dules serialparallel multipliers adderssubtractors and delay blo cks and

not by slices of op erators

The basic blo cks that constitute the data path to compute recurrence equation

can b e easily made highradix This increases the sp eed of this type of circuit

to a maximum allowed by the communication bandwidth b etween the data

path and the memory elements For digit slices the implementation of high

radix can b e very complex

When using online mo dules in the data path the selection of the output digit

selection function is going to b e in the critical path only at the b eginning

VLP Arithmetic for Recongurable Copro cessor Architectures

serial digit by vector Xι [j-1] Yι [j] y x multiplier j+δ−1 j+δ−1

XX Wι [j-1] X

+ - z j digit digit

+ Digit Multiplier msd lsd

Short Append Register W ι [j] +

Selection

Figure Online computation of the recurrence equation for multiplication

of the op eration when the precision of the residual is small There will b e

enough time to compute the output digit while the network is computing

other digits of the recurrence equation

The online algorithm using these scheduling strategies will have an asymptotic

2

time complexity of O m steps where m is the number of op erand digits

Summary

In this chapter we discussed the software and hardware algorithms commonly

used for longprecision computations There are software algorithms that have a

b etter asymptotic time than the algorithms used in hardware for longprecision

The hardware algorithms for VLP arithmetic have b een based on conventional

PhD Dissertation Chapter DRAFT February

and simple software algorithms to avoid the complexity in the implementation

We have also discussed the choice of online arithmetic for our VLP algorithms

The number of cycles for the available hardware implementations are summa

rized in the following table The equations shows that the asymptotic time of the

2

VLP op erations is O n the same asymptotic time of the online algorithms

Op eration VPIAC SSJa TM Hsu

2 2

Multiplication n n n n

2 2

Division n n n n

2

Square Ro ot n n not implemented

However the online algorithms have imp ortant features that allow the imple

mentation of very ecient designs for longprecision computation We explore the

use of serial mo dules to compute the online recurrence equations based on the dis

cussion presented in the previous section VLP op eration using online will allow

the overlap of op erations b etween the host and the copro cessor increasing the p er

formance of the hostcopro cessor architecture When more than one copro cessor

is used ovelap of VLP op erations executed in dierent pro cessors are also p ossible

The algorithms for VLP multiplication division and square ro ot are describ ed in

the next chapters

Addition and subtraction are not considered for implementation in the copro

cessor for the following reasons

The time complexity of addition and subtraction is linear The gain in using

the hardware for these op erations over the software implementation at the

host level is minimal The algorithm is not complex and the host is very

ecient to p erform addition of integers

The online approach for addition has no advantage over the traditional right

to left addition with carry propagation The online generates the digits in

redundant format what requires a Carry Propagate addition by the end to convert to conventional number system

Chapter

VLP Multiplier

This chapter presents the design of a VLP multiplier that uses the online arith

metic approach Erc TE The use of online arithmetic metho d to implement

VLP op erations was alredy justied in the previous chapter

We rst present the VLP multiplication algorithm considering the op eration

over longprecision integers where the pro duct has twice the precision of the

op erands Then we discuss the case of the multiplication of signicands of oating

p oint numbers where the output and op erands precision is the same Main issues

of the implementation of this particular algorithm are presented and discussed

The VLP Multiplication Algorithm

Online multiplication Z XY is dened by the following recurrence equation

EL

+1

W j r W j z r x Y j y X j j

j 1 j + 1 j + 1

with

j + 1 j + 1

X X

i i

y r x r and Y j X j

i i

i=0 i=0

where x y and z fa ag with r a r

i i i

P

j

i

The pro duct Z j z r is obtained in redundant form The variable W

i

i=1

is called the partial residual The initial condition for the algorithm is W

X Y Partial pro ducts are added in each iteration to the scaled residual r W

that is kept inside b ounds by subtracting the output digit z The online op erands

j

Y j and X j increase in precision as new digits are received

PhD Dissertation Chapter DRAFT February

Algorithm corresp onds to the VLP multiplication algorithm The op erands

precision in digits is represented by m Each digit d is in the digit set D

i

n

f g with r and r The online delay assumed in the algorithm

is cycles We use output digit selection by truncation TE This selection

function generates overredundant output digits that must b e reco ded in order to

generate output digits in the set D The algorithm generates the pro duct of m

digits doubleprecision

The RE C O D E function converts the output formed by digits z

j

fr r g into a stream formed by z f g The analysis of

j

the digits that form z shows that W j f g and W j f g The

0 1

j

sequence of overredundant output digits z can b e seen as two separate sequences

j

W W W m

0 0 0

and

W W W m

1 1 1

An online adder reduces these two sequences to one that corresp onds to the vector

of reco ded values z The reco der circuit is describ ed in Chapter

j

The rst m digits of the pro duct are obtained during the recurrence itera

tions The remaining least signicant digits are in the residual vector such that

the nal result is represented as

Z z z W m W m

1 m2 1 m+2

A variation of onthey conversion metho d EL is used to transform the

output digit set from redundant to nonredundant NR representation Further

details are discussed in Section

Data Path for VLP Multiplication

As discussed in chapter the data path for VLP algorithms is implemented as

a network of serial mo dules capable of p erforming digit by vector multiplication

VLP Arithmetic for Recongurable Copro cessor Architectures

Algorithm VLP Multiplication Algorithm

Initialization

a input digits x x y and y and compute the initial residual

1 2 1 2

P P

2 2

(i+j )

W x y r

i j

i=1 j =1

b initialize the product

z r W W

0 1

0

z

1

where W and W represent the most signicant digits of the residual one

0 1

integer and one fractional digit

c compute the scaled residual P

P r W z

0

Iteration for j to m

a compute serial ly the digits of the next residual as

2

W j P j MA MB r

where MA and MB correspond to the vectors x Y j and y X j

j +2 j +2

respectively

b during the serial computation execute

i Selection of output digit

z r W j W j

0 1

j

ii Computation of scaled residual P

P j r W j z

j

iii Reco ding z RE C O D E z z z

j 2

j j 1 j 2

Obtain the last two reco ded digits z and z assuming z z

m2 m3

m1 m

PhD Dissertation Chapter DRAFT February

addition or subtraction It is more adequate to implement the network using on

line op erators The so oner the most signicant digit of the residual comes out the

so oner the selection of the output digit can b e done This feature also allows the

utilization of multiple networks each one working in a dierent iterations

The data path for VLP multiplication is shown in gure The online adder

considered in this work is describ ed in DMV TE and Section The digit

by vector multiplier DV Multiplier combined with online adders constitute the

main data transformation network to implement the serial computation of the

recurrence equation for VLP multiplication Other mo dules shown in the gure

will b e discussed later

Xi[j-1] Yi[j]

a b c d y BS j+δ x j+δ NR X X serial digit X by vector Wi[j] multiplier + e

+

Wi[j+1] Recoder

Mux

BS -> NR

zj

Figure Data path for VLP multiplication

The DV multiplier circuit shown in Figure works in online mo de One nbit

digit of each op erand is inserted into the multiple generation and reduction blo ck The inputs of the multiplier are assumed to b e in nonredundant twos complement

VLP Arithmetic for Recongurable Copro cessor Architectures

format NR since the op erands are signeddigit vectors This format uses the least

number of bits making it more adequate for storage and transmission During the

op eration of this comp onent one of the inputs is kept xed while the other changes

in each clo ck cycle vector digits The pro duct of a digit x and a vector Y is

computed serially as follows

xY xy y y y

1 2 3 k

1

xY xy xy xy r xy xy

1 3 5 2 4

where each xy corresp onds to two digits For example x Y r

i

xY The addition is done serially and since b oth

digits of the pro duct xy are generated at the same time it is necessary to delay

i

the least signicant digit b efore it is applied to the serial adder This op eration is p erformed by the digit alignment section

XY n n NR NR Multiple generation and reduction using CSAs CS 4n CS -> BS 4n+1 most significant BS digit Digit 2n 2n 1 Alignment Section

On-line adder 2n

BS

Figure Digit by vector multiplier in online mo de

The multiple generation and reduction blo ck is basically a linear array multiplier

that is implemented using radix Bo oth reco ding and carry save adders CSA

PhD Dissertation Chapter DRAFT February

to accumulate the partial pro ducts Both a lineararray or a tree of adders can

b e used for partial pro duct reduction The linear array is not as fast as the tree

structure however it is regular and easy to implement for dierent op erand sizes

A slice of the multiple generation and reduction structure is shown in gure

Using Bo oth reco ding the bit vector of one op erand is reco ded into signed digits in

radix These reco ded digits are used to generate the prop er multiple of the other

op erand The multiple is then added to the partial pro duct that was accumulated in the layers ab ove

Accumulated partial product (most sig. bits)

x x x (zero,shift,neg) n-1 n-2 0

Radix-4 3 3 0 y Booth recoder neg. bit PG PG PG PG 1

2 2 2 2 0 CSA CSA CSA CSA

2 2 2 2 2

Accumulated partial product

Figure One layer of the reduction structure

In a regular multiplier the CS value obtained from the reduction blo ck would b e

converted to NR representation b efore b eing used by the next mo dule However

the mo dule that uses the output of the reduction blo ck is an online adder which

is designed for inputs in BS co de Thus the reduction blo ck output is converted

from CS to BS co de directly The CS to BS converter is discussed in Section

It is a circuit that do es not have carry propagation and thus a delay that do es not

dep end on the digit size

Since the output of the reduction blo ck is used by the online adder that uses

VLP Arithmetic for Recongurable Copro cessor Architectures

BS co de signeddigit adders SDA would b e a natural choice to avoid conversion

of the output However the area of the multiplier would increase with the use of

these adders Each SDA uses full adders FA while each CSA uses FA p er

bit Using Bo oth reco ding an bit multiplier would need FAs

using CSAs and FAs using SDAs Bo oth reco ding and multiple

generation would consume an area of FAs On the other hand

with CSAs it is necessary to have the CSBS converter which adds CLBs to the

multiplier area Thus the area of the multipier using SDAs is

larger than the area of the multiplier using CSAs

The online adder works with single digits and the output of the reduction

blo ck has two digits of precision So the twodigit pro duct representation must

b e transformed into two separate digit representations b efore it is used by the

online adder Using BS co de the pro duct can b e easier split into two separate

digit representations Let x and y b e two signed digits in

10 10

radix The pro duct xy

10 2

where the second representation corresp onds to twos complement of the pro duct

and the last representation corresp onds to BS co de representation of the same

number It can b e seen that splitting the BS representation into two halves results

in and that corresp onds to the value

Besides the main comp onents for data transformation other comp onents used

in the data path are the reco der circuit multiplexer and converter The reco der

circuit is an simplied online adder shown in section The multiplexer is

used at the end of the VLP op eration when the digits in the residual vector are

transferred to the result vector The conversion from BS co de to NR representation

is done b efore the digits are stored into the result memory element The BS to NR converter is discussed in Section

PhD Dissertation Chapter DRAFT February

Minimum delays of the datapath are shown in Figure The minimum delay

is caused by the use of serial mo dules in the data path network The serial digit by

vector multiplier has a delay of cycles cycles in the online adder and cycle

in the digit alignment section b etween the CSBS and online adder mo dules Figure

XY

3 cycles X X W

2 cycles + + 7 cycles - 4 cycles

+

Figure Data path delays

Data Arrangement for Serial Computation

The vectors used in VLP multiplication are

X and Y are the op erands comp osed of m fractional digits The vector

p osition i contains x and y resp ectively A p ointer named opt indicates the

i i

digit b eing considered at step j

Z vector holds the output digits generated by the RECODER In order to

avoid using another p ointer just for this vector opt may also b e used to

reference the vector p osition of the last written output digit

W is the vector that holds the residual digits In some p oint in time the

vector will hold some digits of W j and some digits of W j The p ointer

VLP Arithmetic for Recongurable Copro cessor Architectures

p is used to read digits of the residual W j The same p ointer less an

oset is used to write digits of W j into the vector

The digit vectors are shown in Figure In order to simplify the manipulation

0 1 2 3 4 X

Y p opt Z 0 0 0 0

on-line computation delay + RECODER delay W

on-line and network delay

fractional point

Figure Data vectors for VLP multiplication

of the longprecision digit vectors the alignment of values is p erformed shifting

the digit vectors relatively to each other The amount of shifting dep ends on the

recurrence equation and the delay of arithmetic mo dules used in the VLP division

data path The case shown in the Figure considers a nonpip elined data path The

use of pip elining is discussed in section Considering the recurrence equation

b oth X and Y should b e aligned and the displacement b etween W and X

or Y is given by However as the paths involving X Y and W have a

dierence of cycles W must b e inserted cycles after the digits in X Y b egin

to b e inserted in order to get a delay of cycles b etween them as shown in

Figure One integer digit is kept for vectors X and Y b ecause a redundant representation can b e used

PhD Dissertation Chapter DRAFT February

Serial Computation of the Residual

Algorithm gives the steps required to manipulate the data path and data

vectors in order to execute serial computation of the residual which is the main

part of the VLP multiplication op eration The algorithm is describ ed using pseudo

co de In the pseudoco de we refer to the data vectors as done in Cco de W i for

example is the vector p osition with index i We assume that the registers a and

d contain the value of the op erand digits x and y This condition must

j + 1 j + 1

b e true after the initialization phase and after each iteration

Pip elined Data Path

All comp onents in the data path use redundant number representation that

allows the implementation of op erators with xed delay indep endent of the digit

radix that is b eing used The delay of these comp onents can b e as low as input

LUT delay plus interconnect and FF delay

Table shows the maximum number of pip eline stages that can b e added to

each comp onent of the data path We use the term added b ecause the online

adder and the digit by vector multiplier DV already have registers in the non

pip eline implementation In this evaluation we assume that input LUT FPGAs

are used The use of k input FPGAs with k would incur in more logic levels

and more stages could b e created

The data path already has a delay of cycles in the nonpip elined implementa

tion The total delay of the pip elined version is p cycles The maximum degree

of pip elining is computed as

n

p

max

n

The data path has a critical path delay of b c cycles In the case of signed

2

16

digits in radix bits are used and the delay is cycles with p

max When using a deeply pip elined structure it is p ossible to start inserting data of

VLP Arithmetic for Recongurable Copro cessor Architectures

Algorithm Serial computation of the residual for VLP Multiplication

p

loop while p opt comment is the delay of the nonpipelined data path

network

if p opt read X p and store into register b otherwise clear register b

if p opt read Y p and store into register c otherwise clear register c

if p opt store Xp into register d and Yp into register a store the

input digits for next iteration corresponding to x and y

j + 1 j + 1

read W p and store into register e

if p store output of the data path into the RECODER circuit most

signicant digit of z comment RECODER circuit is dened in section

j

if p store output of the data path into the RECODER circuit second

digit of z

j

if p write the output of the data path to W p comment the

number comes from the delay of the path between W j and W j

i i

plus eect of scaling W by r in each iteration that corresponds to one

more cycle

increment p

adjust pointers to vectors opt opt

PhD Dissertation Chapter DRAFT February

Comp onent Stages

DV Multiplier BR

DV Multiplier PP

n

c for nbit digits DV Multiplier reduction b

2

DV Multiplier CSBS converter

Online adder

Table Number of stages in the digit by vector multiplier

the next iteration some cycles after the data for the present iteration is applied

Dep endencies in the path force the insertiong of a bubble of cycles There is a

cycle dep endency on the DV multiplier digit alignment and cycle dep endency

in each online adder So after the digits of iteration j are inserted zeros are

applied for cycles at the DV multiplier inputs b efore the digits of iteration j

b egin to b e used as inputs The longest pip elined path in the network used for

VLP multiplication is shown in the gure

another another X delay DV Mult. OL Adder

CS Prod. to delay Digit delay OL OL OL Y Booth Reduction Rec. Gen. BC Alignment Adder Adder Adder Conv.

1 1n 1 1 2 2 2

c b

2

Figure Pip elined Data Path

VLP Multiplication with Precision less than m

The multiplication of two integer op erands with m digits generates a full

precision result with m digits When working with fractions it is sometimes

desirable to obtain a pro duct that has less than m digits usually with the same

VLP Arithmetic for Recongurable Copro cessor Architectures

precision of the op erands m digits In this case the computation of the full

precision result followed by rounding can imply in twice the work that would b e

required in fact

The discussion on longprecision computation in Knu states that it is p ossi

ble to reduce the amount of computation discarding multiples that are not going to

P P

m m

i i

y r with x r and y aect the result signicantly Assume that x

i i

i=1 i=1

x y f g and r Multiples that b elong to the same

i i

column p in the multiplication matrix are generated by the multiplication of digits

x and y such that i j p Consider the situation when multiplies in columns

i j

p m k with k are removed from the nal pro duct It is p ossible to

(m+k 1)

prove that the maximum error in this truncation pro cess is m k r

k

The multiples that are discarded for p in a x multiplication are shown in Figure

m 1 2 3 4 5 6 x

1 2 3 4 5 6 7 8

p

Figure Truncated multiplication result

PhD Dissertation Chapter DRAFT February

Pro of generates the fol lowing maximum error

2m(m+k )

X

2 i1 2m

ir r r

i=1

mk

2

X

r

i 2m

ir r

r

i=1

mk mk +1 2

r f m k r m k r g r

2m

r

2

r r

2m mk mk +1

r m k r m k r

Based on the fact that

2m mk mk +1 mk +1

r m k r m k r m k r

for k m the fol lowing upper bound on the maximum error is obtained

mk +1

m k r

k

Notice that k is related to the output digit p osition not the input digit It

would b e b etter to asso ciate the error with the input digit p osition Assuming

that the multiplication algorithm starts to disregard multiples of digits of X and

Y starting after input digit p osition t the algorithm should discard multiples in

columns to the right of column t that corresp onds to column p t Since

from the previous expressions p m k we have the following relation

p m k k p m

(m+k 1) (p2)

m k r m p r

k

as p t we get the error based on the input p osition t as

(2t1)

m tr

t

Based on this maximum error we may now obtain the minimum number of

op erand digits that are required to obtain a requested output precision

VLP Arithmetic for Recongurable Copro cessor Architectures

Truncation p oint to Satisfy Output Precision

For VLP computation rounding results is meaningless If the error of a trun

m

cated result is not acceptable for m fractional digits it is r a greater precision

is used and the error can b e reduced as much as desired For this reason we do

not consider rounding problems in this thesis

So when the requested output precision is m m m we want to determine

the number of input digits such that the elimination of multiples is not going

to cause an error that is greater than the truncation error of the result with m

fractional digits The following relation must b e satised

0

(2t1) m

m tr r

t

From as t m we obtain

m t m

and thus

0

m

r

(2t1)

r

m

that implies

t m l og m

r

t m l og m

r

l og m m

r

t

As the relation l og m is true for large r and m the error b ound is

r

satised for

m

e t d

For the values of r considered in this thesis ab ove the value of the maximum m

that satises the condition is reasonably large m digits that corresp onds

max

to bits of output precision

PhD Dissertation Chapter DRAFT February

So for all pratical purp oses we are going to use the truncation p oint ginven in

equation as the input p osition for truncation when the output precision is

m

VLP Multiplication Algorithm for Truncated Results

The eect of disregarding multiples is obtained in the VLP multiplier making a

slight mo dication in the original algorithm The idea is to reduce the precision of

the working vectors one digit at a time Assume that the algorithm starts to trun

cate the multiples after input digit p osition t At step j of the VLP multiplication

the algorithm computes the multiples x y with i j t The input precision

i j

is m digits The dierence in this algorithm and the one for full precision resides

in the computation of the vector MA x Y and MB y X Instead of

j + 1 j + 1

using a vector Y j with precision of j for all iterations at step j MA is

computed as

x Y j if j t

j +2

MA

x Y t j otherwise

j +2

and MB is computed as

y X j if j t

j +2

MB

y X t j otherwise

j +2

where X j and Y j are dened in equation

The description shows that after input digit t the precision of the stored

op erands is reduced one digit p er iteration

The implementation of this metho d is reected in the serial computation of the

recurrence equation shown in Algorithm only by using a new p ointer for the main

lo op instead of opt The p ointer works like opt while the precision increases and

starts to b e decremented when the condition j t is reached

The simulation of the algorithm execution is shown in gure The shaded

area shows the gain over full precision computation

VLP Arithmetic for Recongurable Copro cessor Architectures

generation of product with the same precision as operands

give the precision n of operands in digits less than

give the precision m of the result in digitsnmn

trace in which step if not wanted

want random generation of operandsy

give a seed for the random number generator

X

Y

Truncation point

W

W

W

W

W

W

W

W

W

W

W

W

W

W

W

W

W

W

W

W

Zj

7

Figure Variable output precision op eration of the VLP multiplier r

Gain in Performance

2

The number of multiples generated in a full precision computation is m as

suming op erands of the same size m digits

Considering m m m for m m m all multiples are

computed the number of input digits required is

m

e t d

that corresp onds to a truncation of multiples starting at output fractional p osition

m t The number of multiples that are computed in this case as a function of

m is

2

m m m m

2

M m

PhD Dissertation Chapter DRAFT February

The gain of this truncation metho d to obtain the required digits for rounding

over the fullprecision case followed by rounding is

2

m

S

M

A graph of the sp eedup obtained when op erand precision m and output

precision m is shown in Figure

2 1.9 m=250 1.8 1.7 1.6 1.5 1.4

Speedup - S 1.3 1.2 1.1 1 250 300 350 400 450

m'

Figure Sp eedup of VLP multiplier with variable output precision over full

precision multiplication

Op erands with Dierent Precision

In long precision computations it may happ en that two op erands have dierent

precision lets call them m and m and assume that m m While j m

x y x y y

we use the algorithm already describ ed After this p oint digits of Y are all zero es

and the VLP multiplier can b e mo died to make use of its two digit by vector

multipliers Two output digits are computed in each iteration

We mo dify the recurrence equation as follows assuming that Y do es not change

from one iteration to the next and the term y of the original equation is zero

j +

W j r W j z r x Y

j j +

VLP Arithmetic for Recongurable Copro cessor Architectures

W j r W j z r x Y

j +1 j + +1

replacing equation into and doing prop er manipulation we get

2 1 +1

W j r W j z z r r x Y r x Y

j j +1 j + j + +1

the output is comp osed by digits z and z

j j +1

The data path is slightly mo died to p erform this op eration such that the second

multiplier b egins to receive data from the input register of the rst Consecutive

digits of X are used as the other inputs for the digit multipliers

This mo dication makes the circuit op erate two times faster than the original

one after the last digit of the shortest op erand is received

Execution time of the VLP Multiplier

In this estimate we consider m digit op erands and a result precision m that is

p

in the range m m The op erand precision dictates the number of iterations to

b e executed and the output precision denes the number of digits used in each

iteration

Using a pip elined data path with nonoverlapped op eration the number of

cycles to execute VLP multiplication for an output precision m m m is given

in equation The op erands have integer digit as shown in gure The

initialization of the residual is done with fractional digits The precision increases

until fractional digit t as shown in section when it starts to decrease one

digit p er iteration Once the input digits were consumed digits are obtained

from the RECODER and the remaining m m digits are transferred from

the residual memory to the output using the data path As in this case the

BSNR converter may have a delay that is larger then the copro cessor cycle time

we assume that the conversion time for each digit in the residual vector takes T conv

PhD Dissertation Chapter DRAFT February

copro cessor cycles

m t

X X

t i p m m T i p T

conv V LP mul

i=t+1 i=2

For our purp oses as explained b efore the value of t is

m

t d e

for m m m When t m the second summation is not computed The

value of t dep ends on the precision and the digit radix used thus care must b e

taken to guarantee that the conditions for truncation are satised

In an overlapped op eration the equation for the number of cycles is

m t

X X

t i p m m T i T

conv V LP mul

i=t+1 i=2

using a buble of cycles The term p corresp onds to the data path latency

for the last iteration minus the the buble already considered in the summation

term for the last iteration

The time for an iteration when the precision of vector Y is i fractional digits

and successive iterations are not overlapped is given as

i p i p if i t

T i

iter

t i p otherwise

Chapter

VLP Divider

This chapter presents an algorithm for VLP division and discusses main imple

mentation issues The online recurrence equation is obtained from Tu for the

N

online division op eration Q

D

+1 +1

W j r W j q D j n r d Qj r

j 1 j + 1 j + 1

where

j + 1

X

i

N j n r

i

i=0

j + 1

X

i

d r D j

i

i=0

j

X

i

q r Qj

i

i=0

j

X

i

w r W j

i

i=0

and with the initial condition of W N The online delay for highradix

is based on Tu

Equation implies that the value of the quotient digit q used in the

j 1

iteration to obtain W j must already b e part of variable Q It is more convenient

for the implementation of the VLP algorithm presented in this chapter that the

insertion of the new quotient digit b e done at the end of the iteration This is

obtained by the following manipulation of equation

+1 j +1

W j r W j q D j n r d q r

j 1 j + 1 j + 1 j 1

+1

Qj r

PhD Dissertation Chapter DRAFT February

+1 j +1

W j r W j q D j n r d r q r

j 1 j + 1 j + 1 j 1

+1

d Qj r

j + 1

j +1 +1

W j r W j q D j d r n r

j 1 j + 1 j + 1

+1

d Qj r

j + 1

that results in

+1 +1

W j r W j n r d Qj r r q D j

j + 1 j + 1 j 1

with the initial condition of W N We base our discussion of VLP division

in equation It is easier to obtain D j than Qj and the complexity of

the recurrence equation is exactly the same

The output digit q is selected based on a function of the residual value and the

j

divisor The selection function is represented as SEL W j d and for implemen

tation it should consider only a short precision value of b oth parameters Also

when using highradix digits the selection function b ecomes very complex One

solution to simplify the function is to scale the divisor to a value close to as done

in EL for digit recurrent algorithms When the divisor is scaled the selection

is done based only on the short precision residual as shown in equation The

conditions for this selection function are presented in section

q SEL W j

j

In the next sections we present the algorithm for highradix VLP online divi

sion the data path organization digit selection and timing characteristics

VLP Divison Algorithm

Based on the general mo del of the recongurable copro cessor the op erands result and residual data are stored in memory elements that are referenced in this

VLP Arithmetic for Recongurable Copro cessor Architectures

n

section as digit vectors Highradix signeddigits are considered radix r

The op erands and result have m digits of precision The vectors used in VLP

division are

N is the dividend vector and contains one integer digit and m fractional digits

P

m

i

n r and the vector p osition i contains The value of N corresp onds to

i

i=0

n starting from p osition i

i

Qj contains all quotient digits generated until and including iteration j

D is the divisor vector that holds one integer digit and m fractional digits

All digits of the divisor may b e present in the memory however during the

algorithm only partial view of D is used referenced as D j D j represents

the most signicant digits of the divisor from the rst digit to the one that

is b eing used at iteration j or d

j + 1

W j is the residual vector at step j

These data vectors are shown in Figure In order to simplify the manipulation

of the longprecision digit vectors the alignment of values is p erformed shifting

the digit vectors relatively to each other The same p ointer p can b e used to read

all digit vectors The amount of shifting dep ends on the recurrence equation and

on the delay of arithmetic mo dules used in the VLP division data path The case

shown in the Figure considers a nonpip elined data path The use of pip elining is

discussed in section

The p ointer opt indicates the input op erand digit b eing considered at step j

The same p ointer is used to insert the quotient digit into Q

Algorithm shows the steps to p erform VLP division We assume that N D

to avoid overow

The AP P E N D Qj q function constitutes on concatenating the digit q to

j j

the vector Qj q q q q such that the resulting vector is Qj

0 1 2 j 1

q q q q q The value of q is written to Qopt

0 1 2 j 1 j j

PhD Dissertation Chapter DRAFT February

Algorithm VLP division

Initialization

a W n i

i i

W i

i

b q SEL W

0

c Q

d opt

Iteration for j to m

a compute serial ly the digits of the next residual as

+1 +1

W j r W j n r d Qj r r q D j

j + 1 j + 1 j 1

b update quotiend vector Qj AP P E N D Qj q

j 1

c Selection of quotient digit

q SEL W j

j

update quotiend vector Qm AP P E N D Qm q m

VLP Arithmetic for Recongurable Copro cessor Architectures

0 1 2 3 4 N

D p opt Q 0 0 0

on-line delay W

network delay

fractional point

Figure Digit vector for VLP Division

The VLP division datapath shown in Figure uses basically the same arith

metic op erators used in the VLP multiplier data path presented in Chapter The

the same type of arithmetic op erators are used Compared to the VLP multiplier

this data path uses a serial subtractor to generate W j the inputs of the mo dules

i

come from dierent memory elements and the selection function changed

The relative p osition of digits in each vector is explained in terms of the network

delays and the recurrence equation As digits from vectors D and Q are inserted at

the same height of the network tree the dierence b etween them is only the online

delay based on the recurrence equation In a nonpip elined network the digit

by vector multipliers have a delay of cycles The online adders or subtractor

have a delay of cycles Considering the recurrence equation the residual W must

b e aligned with vector D However the network delay in the path from input D

to the level where W is inserted corresp onds to cycles Thus W must wait for

cycles b efore it is applied to the network That means the digits of W must b e

PhD Dissertation Chapter DRAFT February

Di[j] Qi[j-2] d j+δ−1 BS q c j-1 d NR b Wi[j-1] serial digit X X X by vector e a multiplier

+ n + j+δ−1

f - + +

Selection Wi[j] q j

to Q vector

Figure Data path for VLP online division

displaced p ositions to the right of D In a pip elined implementation the distance

increases by the number of stages inserted in the DV multiplier

Prop er control over the data path registers marked in the data path with

small letters allows the division control circuit to generate zero es as inputs in the

following cases

input n must have the digit of the dividend at step j for only one cycle For

j

the rest of the time this input has a zero value

although all or most digits of the divisor are already in the copro cessor

memory digits that are not included in D j should not b e loaded into

the input register of D j

i

other vectors can b e continuously read since the other digit p ositions will contain

only zero es

A pseudoco de that details the serial computation of the residual in the VLP

algorithm is shown in Algorithm We assume that the value in register c is

VLP Arithmetic for Recongurable Copro cessor Architectures

already correct contains the value of the present divisor digit

Selection Function

The traditional selection of the quotient digit is done based on the values of

the residual and the divisor For small radices a quotient digit is selected by

comparing the most signicant digits of the residual and divisor with constants

Dep ending on the range of them the prop er output digit is selected

When highradix digits are used the implementation of the selection function

using table lo okup is very exp ensive there are many constants to compare and the

size of the required table is prohibitive Solutions for the problem were given in

EL The one that we investigate in this thesis is based on scaling the divisor

to a range that allows the selection of the quotient digit based only on a rounded

value of the residual This op eration is easy to implement

In order to have convergence conditions satised in the recurrence equation the

op erands must b e prescaled to a predened range The dividend and divisor are

scaled by a constant M such that the scaled divisor d is close to in the range

d

9 18

This metho d was used in digitrecurrence algorithm for highradix and

for small radix online division TE In this section we derive the convergence

b ounds for this approach for VLP division using highradix The error analysis

considers the selection function by rounding and maximal redundancy of the digit

set For the VLP division implementation we use signeddigits in BS co de with

maximal redundancy redundancy factor K for the digit set D f g

The redundancy factor is dened as K

r 1

The quotient digit is obtained as

q S W j W j

j

where W j is the estimate of the residual

PhD Dissertation Chapter DRAFT February

Algorithm serial computation of the residual VLP division

initialization

p

store q into register a

j

loop while p opt comment is the delay of the nonpipelined data path

network

if p opt clear register b otherwise read D p and store into register b

if p opt read D p and store into register c

read Qp and W p from memory and store into registers d and e

if p store N opt into register f otherwise clear register f

issue control signals to the selection function to store the rst most sig

nicant digits of the residual comment the selection function circuit is

dened in section

if p write the output of the data path to W p comment the

number comes from the delay of the path between W j and W j

i i

plus an extra cycle to scale the residual

increment p

adjust pointers to vectors opt opt

VLP Arithmetic for Recongurable Copro cessor Architectures

Adopting the same terminology used in EL we compute the remanent as

W j q

j

Assuming that the residual is represented with signeddigits and the estimate

th

of the residual is obtained based on truncation of the residual value at the t

fractional bit the b ounds for the remanent are

t t

The residual value is obtained as a function of the remanent by manipulation

of equation as follows

+1 +1

W j r W j q D j n r d Qj r

j j + j +

+1 +1

W j r W j q q q D j n r d Qj r

j j j j + j +

Inserting equation into equation

+1 +1

W j r r q D j n r d Qj r

j j + j +

One condition for convergence comes from the recurrence equation of online

division considering d and n q with the value of the

j + 1 j + 1 j 1

residual represented as l

+1

l r l d r

+1

l r K d K r

that after substitution of the variable l by W j results in the upp er b ound

+1

jW j j K r d r

It is also necessary to avoid that the selection function generates a digit that is

greater or equal to r The following relation applies

t

jW j j r

PhD Dissertation Chapter DRAFT February

Combining equations and for the case K the b ound for the

residual b ecomes

t +1 t +1

maxr r d r W j minr r d r

As the b ounds are symmetrical we analyze only the upp er b ound based on the

conditions imp osed by equations and

t +1 t

r r j dj r r minr d r r r

or

t

t 1

r j dj r r mind r

r r

The solutions for this relation is calculated dividing the range of values in three

regions as follows

d

t

t 1

r d r r

r r

that results in

t

r

1

d r

r r r

t

1 2

1

d r

2r r

t

t 1

r d r r

r r

that imp oses the b ound

t

r

1

r d

r r r

t

1 2

1

for d r the expression results in contradiction

2r r

VLP Arithmetic for Recongurable Copro cessor Architectures

Combining the p ossible b ounds from expressions and we obtain

the range for the prescaled divisor as

t t

r r

1 1

r d r

r r r r r r

For it is easy to show that t would b e enough to obtain consistent

upp er and lower limits for the scaled divisor When working with highradix digits

the number of bits in one fractional digit is dl og r e that is more than enough to

2

obtain a reasonable range for the prescaled divisor in order to allow this selection

function So assuming that t l og r we get the following b ounds on the scaled

2

divisor

r r

1 1

r d r

2 2

r r r r r r

that could b e rewritten as

d

where

r

and

r

1

r

2

r r

and only one fractional digit is used for selection

An example of the online division using prescaled op erands is shown in Table

Scaling Factor M

From the b ounds of the scaled divisor we can determine the scaling factor M

that is used to scale the divisor and dividend Basically the value of M is obtained

by the recipro cal of the divisor computed using some of the most signicant digits

This short precision recipro cal value could b e obtained by the host pro cessor

using the FP arithmetic unit or by a dedicated circuit in the copro cessor Since

PhD Dissertation Chapter DRAFT February

r and t and

d

N and D

8 8

W q

0

r W

q D r

0

2

n r

3

d Q

3

W q

1

r W

q D r

1

2

n r

4

d Q

4

W q

2

r W

q D r

2

2

n r

5

d Q

5

W q

3

r W

q D r

3

2

n r

6

d Q

6

W q

4

Q

8 8

Table Example of highradix online division using prescaled op erands

VLP Arithmetic for Recongurable Copro cessor Architectures

the recipro cal is computed in short precision the b est solution would b e the use of

the host FP unit We explore this option in the next section A dedicated circuit

for this purp ose would consume space in the recongurable hardware if the circuit

is kept during the op eration or force a reconguration of the copro cessor during

the op eration The circuit would b e designed for small radix that would b e slower

than the FP unit in the pro cessor In b oth cases the p erformance would suer

Computation of the Scaling Factor at the Host

Using the maximum value of we compute a more restrictive b ounds for the

scaled divisor d with

2

r r

based on the observation that

r

4

and r

max

2 2

r r r r

for r The equation already includes the online delay of The same result

would b e valid for

The bit patterns for the upp er and lower limits of the interval are shown in

Figure The vertical bars separate bits from dierent radixr digits

As shown in the gure the scaled divisor must have the rst fractional digit

r 1

equals to or r The second fractional digit is d for the case

2

2

r

or d for the case

2

2

Consider the original divisor to b e y The host has an FP unit that is capable of

computing a recipro cal approximation of y with a precision of k copro cessor digits

If k the recipro cal of y M has or more digits and the multiplication of

k

the scaling factor by y will generate a value close to with an error caused by the

limited precision of y

k

We know show that the truncation error introduced to obtain y do es not cause

the nal scaled divisor to fall outside the exp ected b ounds

PhD Dissertation Chapter DRAFT February

0

1 +

1000001001x

1000000111x

0

1

1000001001x

0111111001x

Figure Divisor b ounds

k

The approximation y of the divisor y has an error of r Assuming that y is

k

a normalized number y the maximum dierence b etween the obtained

scaled divisor and is

k

y M y r M

Since the scaling factor comes from a limited precision computation k digits the

k

pro duct y M also has an error and corresp onds to r thus

k k k k

y M y M M r r M r r

Assuming the minimum value of k this error will aect the least signicant

bits of the second fractional digit of the scaled divisor Therefore the scaled divisor

will b e in the desired range for digit radices ab ove

In conclusion using at least digits of the divisor y we obtain the scaling factor

M in the host FP unit by calculating the recipro cal of the truncated divisor y

The value d obtained from the multiplication of M and y is guaranteed to satisfy

the the b ounds d for prop er selection of the quotient digit by

VLP Arithmetic for Recongurable Copro cessor Architectures

rouding

Observe that the number of bits used in the copro cessor digit must b e less than

the number of bits in a host pro cessor word such that at least two copro cessor

digits can t in one host pro cessor word

Prescaling of Op erands

During the prescaling phase b oth dividend and divisor must b e multiplied

by the scaling factor M This task can b e acomplished by an online prescaler

hardware or by software at the host level In the next section we present the online

prescaler

Online Prescaler

The online prescaler is shown in Figure Since the original divisor is

normalized the scaling factor is in the range M For M the host can

p erform the prescaling op eration that consists of one bit left shift of each op erand

Other values of M are going to have an integer bit b and two fractional digits d

1

and d The online circuit to compute this op eration contains two digitbyvector

2

multipliers and two adders One of the registers shown in the gure is controlled

by the most signicant bit of the scaling factor b If b is one the input digits are

delayed and passed to the last adder otherwise the register controlled by b is kept

with a zero value The delay dep ends on the degree of pip elining in the prescaler

network

The structure is similar to the same as the data path circuit shown in previous

sections The basic dierences are some of the connections to the memory comp o

nents and the delay blo ck of p These dierences force the reconguration of

the FPGA for each prescaling phase

Another alternative is to use the same data path used for VLP division plus

some extra circuitry This solution is shown in Figure The extra circuits are

PhD Dissertation Chapter DRAFT February

operand

delay=1 delay=3+p'

d1 d2 4+p' b X X

A 5+p' +

+ M=b.d1d2 p'=number of pipeline stages in path A

BS -> NR

Figure Online prescaler

shown inside dashed b oxes The prescaling is done in phases assuming that

b oth op erands are placed into the same vector D

The control circuit passes the digits through the data path with r eg a

registers c d e and f cleared A copy of the op erand is this way stored into

the residual vector

load r eg a d and r eg c d The op erand is inserted again into the data

1 2

path If b most signicant bit of the scaling factor the copy of the

op erand in the residual vector is also used as input with adequate delay The

network output is transferred to the prop er output vector D or N During

this transfer the digits are converted from BS to NR representation

VLP Arithmetic for Recongurable Copro cessor Architectures

DQ

c d b W X X a e

from + + N selection circuit f - + +

BS -> NR W

DN

Figure Prescaling using the data path of the VLP division

With an ecient implementation the circuit will take p m cycles p er

op erand where p is the level of pip elining and m is the precision of the op erands

Selection Circuit

The most signicant digits of the residual that come out of the data path are

stored inside the selection circuit in a small app end register The stored value

is converted from BS co de to nonredundant twos complement form b efore the

rounding pro cess takes place A blo ck diagram of the selection circuit is shown in

the Figure

For prop er selection only one fractional digit is required The digit conversion

stage has the capacity for digits in highradix r one integer and one fractional

digit plus one bit and is resp onsible to convert the redundant digit output of

the datapath BS co de into a nonredundant representation The rounding takes

place after the conversion considering only the integer digits of the nonredundat

residual and one fractional bit Since the nonredundant representation is twos

complement the addition of followed by truncation pro duces the desired quo

PhD Dissertation Chapter DRAFT February

Residual n a+ a- digit in n BS code

comp comp comp Digit Conversion CPA CPA CPA 1 1 n msbit n 1

(000...01) n+2 n+2 CPA n+1

quotient digit in two's complement

Figure Selection circuit for VLP division

tient digit in nonredundant format

The most signicant digit of the residual leftmost CPA must b e zero thus

only one bit is used to keep the sign of the nonredundant residual ie n bits

of the output are used as the quotient digit

The selection time may force the data path to stall when the precision of the

op erands is small b eginning of the op eration Since the three digits are stored

into the unit the time to compute the quotient digit corresp onds to

T n T n

CPA CPA

T

sel

T

cp

where n l og r and T is the copro cessor cycle time

2 cp

The complementation blo ck COMP will b e absorb ed by the CPA mo dules in

a FPGA synthesis pro cess A more elab orate estimate of the selection function

time is given in Chapter when we discuss the design of arithmetic op erators for

FPGAs The inuence of the selection function in the VLP division time is given in section

VLP Arithmetic for Recongurable Copro cessor Architectures

Reducing the Number of Cycles

Based on the same idea considered for the VLP multiplication the computation

of the recurrence equation do es not need to b e done in full precision of the residual

all the time The algorithm may b e mo died to reduce the precision of the data

vectors in one digit p er step after a certain step j Assume that the rst truncation

of the data vectors o ccurs at p osition t of the input vector D We consider in this

analysis only the vector D b ecause it is the one that has more weight in the

recurrence equation The same truncation in vector Q would result in a smaller

+1

error since this vector is multiplied by r

Disregarding digits from D after the digit at p osition t results in an error in

the scaled residual that is b ounded as

t

jr r r j

One more iteration will result in an error that is comp osed of the residual error

plus the divisor error that now has one less digit

t t+1 2 t

r r r r r scal ed r r r

and for q iterations the equation for the error is

q t

jq r r r j

q

For correct selection of the output digit the error inserted by the limited preci

1

sion computation of the recurrence equation must b e less than r since only one

fractional digit is used in the truncated residual for output selection Thus

q t 1

q r r r r

In VLP division the number of iterations executed after digit t is pro cessed

for a nal precision of m digits is q m t

The value q is b ounded as

m

q

PhD Dissertation Chapter DRAFT February

Using equation in we obtain

m

m

t 1

2

r r r r

m m

t +2

2

r r

m m

t l og d e

r

m

is true for a large range of values as explained for As the term l og

r

2

VLP multiplication we assume

m

t d e

Four digits after the middle of the output vector the precision of the recur

rence equation can start to decrease without compromising the nal result of the

computation

Pip elined Op eration

The eect of pip elining in the data path for VLP division impacts the p erfor

mance in two ways The increase in the pip eline level reduces the cycle time which

for a long sequence of digits results in less total time However the increase in the

number of pip eline stages also increases the latency As the selection function is

executed only after the rst residual digits are generated delaying the generation

of these digits will also delay the generation of the next quotient digit While the

precision of the residual is small the generation of the quotient digit will susp end

the b eginning of the next iteration

The delay of the data path is p where p is the degree of pip elining in the

implementation The minimum number of cycles in the data path is Considering

the overlap of iterations and lo oking at the input sequences the data path receives

k input digits and a bubble of cycles minimum b efore the next iteration If

the quotient digit is not ready after k cycles the new input digits cannot b e

applied The quotient digit is generated in cycle p T related to the sel

VLP Arithmetic for Recongurable Copro cessor Architectures

cycle when the iteration b egins A new iteration cannot b egin b efore the quotient

digit of the present iteration is generated that results in k p T The

sel

number of cycles p er iteration as a function of the number of digits in the input

vectors k is given as

p T if k p T

sel sel

cy cl esk

k otherwise

The value of the T is discussed in section

sel

In a nonoverlapped mo de of op eration the constrain is k p p T

sel

that implies the equation

p T if k T

sel sel

cy cl esnk

k p otherwise

The extra cycles imp osed by the selection function may b e seem as an overhead

over the number of cycles of the VLP division without selection function This

number of extra cycles are not easily obtained since VLP division starts to reduce

the precision of intermediate calculations after t op erand digits Thus the selection

function is going to aect the b eginning and the end of the VLP division op eration

An approximation of the overhead caused by the selection may b e obtained as twice

the extra cycles when the working precision increases

X

T p T k

ov h sel

k =

where min p T t with t representing the input truncation p oint

sel

makes the summation items always p ositive or zero The same overhead for the

case of nonoverlapped op eration is given as

X

T k T

sel ov hn

k =

where min T t

sel

These equations show that the nonoverlapped mo de of op eration reduces the

impact of the selection function in the total time of the op eration b ecause it takes

longer to complete each iteration

PhD Dissertation Chapter DRAFT February

Execution Time

This section presents the execution time of the VLP division for two op erands

with m digits of precision generating a quotient of m digits of precision The

degree of pip elining in the data path is represented by p

Due to prescaling the dividend and divisor may have one integer digit We

consider this worst case situation in the following equations Prescaling itself will

take

T m p p m cycles

pr e

for each input op erand and p is the data path delay in cycles as explained in

Chapter

The overhead imp osed by the selection function in a pip elined data path is

given in equation as T orT Each iteration with k fractional digits

ov h ov hn

will take k cycles one extra integer digit included in a pip elined and overlapped

m

op eration until digit d d e After fractional digit d the precision of the

2

data vectors used in the serial computation decrease one digit p er iteration as

explained in section and the number of cycles for each iteration involving the

new fractional digit k d is d k cycles comp osed by the number

cycles needed for k fractional digits in the iteration one integer digit and a bubble

of cycles

Putting it all together we obtain the expression for the number of cycles for

VLP division as

m+2 d

X X

j d j p T T m p T

V LP div pr e ov hn

j =3

j =d+1

Without overlap of iterations it is necessary to insert the op erand digits and

wait for the data to go through the data path b efore the next b egins The number

of cycles for VLP division in this case can b e approximated by the expression

m+2 d

X X

j p d j p T T m p T

V LP div pr e ov h

j =3

j =d+1

VLP Arithmetic for Recongurable Copro cessor Architectures

The basic dierence b etween the VLP division time and VLP multiplication

time is the prescaling of op erands and the extra cycles when the data path do es

not have the output of the selection function available during the rst iterations

The cycle time to execute one iteration with input op erands in precision i digits

after the p oint when the selection function is aecting the b eginning of the next

iteration is given as

T i i p cycles iter

Chapter

VLP Square Ro ot

This chapter presents the design asp ects of the VLP square ro ot op eration The

p

recurrence equation for online squarero ot computation Y X is

+1

W j r W j y Y j Y j x r

j 1 j + 1

or

j +1 +1

W j r W j y Y j y r x r

j 1 j 1 j + 1

where

j + 1

X

i

X j n r

i

i=0

j

X

i

q r Y j

i

i=0

j

X

i

W j w r

i

i=0

and with the initial condition of W X The online delay for highradix

square ro ot computation is cycles based on Tu

Lets compare the pros and of using one or another equation If the main

selection factor is to have a data path similar to the VLP multiplication and

division the rst recurrence equation is more adequate The implementation

of this recurrence equation will have b oth digit by vector multipliers doing the

same task most of the time multiplication of y and Y j The other equation

j 1

leads to an implementation which has a data path where one of the digit

by vector DV multipliers is used as a digit by digit DD multiplier only The

dierence in area b etween the two implementations is small only an online adder

VLP Arithmetic for Recongurable Copro cessor Architectures

and some registers since the DD multiplier consumes most of the area in the DV

multiplier The control circuit would b e more complex for the second approach

2

since the correct time to insert the value y in the network would change from

j 1

one iteration to the other and the insertion would need to b e done for only or

cycles digit pro duct in redundant representation The rest of the time the

circuit is not used at all The shift left op eration Y j would also force a

small increase in the circuit area of the DV multiplier in the implementation of the

second equation

For these reasons we are going to use recurrence equation as the basic

equation for our VLP square ro ot implementation

VLP SquareRo ot Algorithm

In this section we describ e the algorithm to compute the VLP square ro ot

p

Y X of a LP number with X with m digits The serial

computation of the recurrence equation is p erformed by the data path shown in Figure

Yi[j]

BS y j-1 b c NR Wi[j-1] serial digit a X X X d by vector multiplier

x + + j+δ−1 e ^ Y - +

Wi[j] Selection y

j

Figure Data path for VLP online squarero ot op eration

PhD Dissertation Chapter DRAFT February

The same type of mo dules used in VLP division are used in this data path

The dierence b etween this circuit and the other is the interconnection of the DV

multpliers and the memory elements the selection function and a multiplexer for

register a

The data vectors used are

X is the input op erand digit vector and contains one integer digit and m

P

m

i

fractional digits The value of X corresp onds to x r and the vector

i

i=0

p osition i contains digit x starting from p osition i

i

Y j contains all output digits generated by the algorithm including the digit

selected in step j

W is the scaled residual digit vector at step j

The VLP square ro ot algorithm to compute a number in precision m using the

online recurrence equation is describ ed in Algorithm

The selection function S el is describ ed in the next section Other values used

in the algorithm such as the number of fractional digits in the output estimate Y

will b ecome clear later in this chapter

The data vectors manipulated by the algorithm are shown in gure The big

dots represent the fractional p oints in the digit vectors The p ointer xp references

the input op erand digit b eing considered at step j Pointer opt shows the p osition

of the last output digit After each iteration a new digit is generated and inserted

in Y followed by an increment of opt The distance b etween xp and opt corresp ond

to the online delay

In order to simplify the manipulation of the longprecision digit vectors the

alignment of values is p erformed shifting the digit vectors relatively to each other

The amount of the shift dep ends on the network and recurrence equation The

same p ointer is used to read digit vectors W and Y

The delays of the datapath are shown in Figure Both Y and W based on

the recurrence equation should b e aligned But as the paths involving Y and W

VLP Arithmetic for Recongurable Copro cessor Architectures

Algorithm VLP square root

Initialization

transfer the rst most signicant digits of X to W making the proper

alignment in the digit vector

W x i

i i

W i

i

initialize pointers to data vectors xp and opt

copy to vector Y the estimate of the output Y y y y that was

0 1 2

provided by the host this step may be avoided if the host is able to access

the memory element that stores Y directly forcing Y Y

generate the initial residual based on the output estimate Y y y y

0 1 2

for j to

compute serial ly the recurrence equation

+1

W j r W j x r r y Y j r y Y j

j + 1 j 1 j 1

where t is the number of fractional digits in the output estimate Y

select the output digit for the rst time

y S el W Y

3

update Y vector Y AP P E N D Y y

3

Iteration for j to m

a compute serial ly the digits of the next residual as

+1

W j r W j x r r y Y j r y Y j

j + 1 j 1 j 1

b Select the output digit

y S el W j Y

j

c update quotiend vector Y j AP P E N D Y j y j

PhD Dissertation Chapter DRAFT February

0 1 2 3 4 X xp Y p opt W

network delay

fractional point

Figure Digit vector for VLP Square Ro ot

have a dierence of cycles W must b e shifted digits related to Y as shown in

Figure

The pseudoco de for the serial computation of the recurrence equation is shown

in Algorithm

We now analyze the conditions to have selection function using rounding Con

trary to the VLP division the square ro ot imp oses more constraints to the appli

cation of this metho d

Convergence conditions for Output Selection

This section shows the derivation of the b ounds for the op erand to allow round

ing as the selection function Constrasting with the division recurrence equation

squarero ot equation corrects the residual value by multiplying the partial result

of the op eration by the new output digit More than that as the op eration output

is used in the recurrence equation the leading digits must b e computed by another

circuit or obtained from table lo ok up The option of table lo okup is not consid

ered in this evaluation since the table for high radix is to o large We consider

VLP Arithmetic for Recongurable Copro cessor Architectures

Algorithm serial computation of the residual

initialization

p

store Y opt into register a

set the limit for the next loop It is necessary when the precision of input

Y is less than the precision of the residual W

if opt l imit else l imit opt

comment is the delay of the nonpipelined data path network

loop while p l imit do

if p opt clear register b otherwise read Y p and store into register b

if p opt clear register c otherwise read Y p and store into register c

read W p from memory and store into register d

if p store X xp into register e otherwise clear register e

issue control signals to the selection function to store the rst most sig

nicant digits of the residual comment the selection function circuit is

dened in section

if p write the output of the data path to W p comment the

number comes from the delay of the path between W j and W j

i i

and the scaling of the residual

increment p

update pointers to the vectors xp xp and opt opt

PhD Dissertation Chapter DRAFT February

Y y 3 cycles X X W

+ + 7 cycles

- 4 cycles

+

Figure Data path delays

the case when the host FP arithmetic unit is used to compute a short precision

estimate of the output and provides the digits obtained in this pro cess to start up

the computation of the VLP square ro ot

One way to attack the problem is to consider an scaling factor the same way

as we did for division In this case the output Y may b e close to or more

sp ecically in the range Y in order to allow the rounding

function for selection of the output digit This solution requires correction of

the result With X as the input and a scaling factor of M the result of the

p p p

1

p

square ro ot op eration is MX M X and it must b e multiplied by

M

This correction factor is a longprecision number which forces the application of

other longprecision computations including a longprecision squarero ot For this

reason this approach is not further analyzed in this thesis

Another idea is to use a mo died selection function that is not based on the

residual value only but in a combination of the residual and the output In this

case a larger interval of inputoutput values may b e considered and it is not

necessary to scale the input op erand or correct the output This approach is discussed in the next section

VLP Arithmetic for Recongurable Copro cessor Architectures

Selection function with Comp ensated Residual

The prescaling of the input op erand causes many problems The solution

prop osed and studied in this section uses a more elab orated selection function

based on the residual value and an estimate of the output as follows

W j

y S el

j 1

Y

where W is an limited precision value of the residual and Y is an approximation

p

of the value X computed by the host based on the most signicant digits of the

input op erand X In Mat it was prop osed a highradix squarero ot unit that

would use the same selection metho d but truncation instead of rounding This

work was cited in LM

The conditions assumed are

X that results in Y

the host is capable of handling k copro cessor digits

the host generates a rounded limited precision estimate of the output based

on limited precision input values up to t fractional digits

First observe that using the prop osed selection function the output digit set

may not b e maximally redundant sometimes It dep ends on the value of the output

Y If we consider jW j r for example the selection function reduces to

r

y S el

j 1

Y

Based on this equation the digit set for y changes dep ending on Y For example

j

if Y and r the p ossible values of y are in the set f g

j

For Y close to the maximally redundant digit set is p ossible We assume

that the host generates an approximation of the output with t fractional digits

rounded thus k t digits The utilization of tables would require a large space

PhD Dissertation Chapter DRAFT February

in memory On the other hand this assumption limits the width of the copro cessor

word to a fraction of the host pro cessor word

The error in the truncated lowprecision value X is given by

t

r

jX X j

The estimate of the output is computed based on X We need to determine

the error in the estimate denes as jY Y j The relative error of X when the

th

truncation is done at the t fractional digit is sp ecied as

t

r

x

X

As X the error has the following upp er b ound

t

r

x

and from this equation we obtain

X X

x

and thus

q q

q q

Y Y X X

x x

p

p

Using the prop erty that for we obtain the equation

q

p

t2

t

r r

x

thus

t2

Y Y r

and the relative error is computed as

t2

r

y

as Y the maximum absolute error is

t2

jY Y j r

VLP Arithmetic for Recongurable Copro cessor Architectures

We dene the variable as

W j

y

j 1

Y

Based on the selection function equation that considers a truncted value

of W and the error incurred in the use of Y instead of the real Y we obtain the

b ounds on as

t t2

r r j j

t t2

where the value r corresp onds to the residual truncation error and r is the

error in Y

To obtain the value of W as a function of we combine equations and

using the maximum values of the p ossible terms in the equations The result

is

+1

w r Y r r

where w is the value of the residual and Y is the output value

For convergence of the recurrence equation it is necessary that

+1

r w Y r r w

where is the maximum value of digit y such that y f g The

j 1 j 1

minus sign in the equation was used b ecause the output Y is always p ositive and

the output digit has the same sign of the residual We also considered that the

maximum input digit value is r From equation

+1

r Y r r

w

r

To have the output in a nonredundant digit set we must imp ose

W

t

r r

Y

From this equation combined with equation we obtain

t2 t

jW j Y r r r

PhD Dissertation Chapter DRAFT February

Equations and give the upp er b ound on the residual value as

+1

r Y r r

t t2

r W minY r r

r

Combining equation and we get

+1

r Y r r

+1 t2 t

r Y r r minY r r r

r

As these functions are continuous and monotonic in the interval of interest

lets analyze the condition for the extreme values only and test the condition for

the values and t The value t is used b ecause the VLP division

algorithm already limited a minimum of copro cessor digits p er host word

Y in this case r and the condition reduces to

2 1 2 1 2 2

r r r r r minr r r r r r

r

Y in this case

2

r

2

r r r

2 1 2 1 2

2

r r r r r minr r r

r

These conditions are satised for values of r

An example of the VLP square ro ot computation using the prop osed selection

function is shown in Table The output digits generated at steps and are

obtained from the estimate of the output value provided by the host The selection

function is applied in all other steps to obtain y

j

Selection Circuit

The circuit used by the selection function of the VLP square ro ot is shown in

Figure The selection function requires the multiplication of a truncated resid

1

ual by the recipro cal of the output estimate Y The recipro cal Y is a value

in the range and has fractional digits a total of n bits The value

VLP Arithmetic for Recongurable Copro cessor Architectures

X in radix

X truncated value of X two digits

p

Y X approximation of the result Y

1

p

^

X

online delay

W

^

W

j x W j y

j

^

2Y

W

p

X W from

p

W from X

W

W

W

W

Y

expected

Table Example of VLP Square Ro ot in radix

1

of Y is represented in conventional number system The multiplier and CPA

mo dules p erform the scaling of the residual value and rounding The division by

is obtained by prop er interconnection b etween the multiplier and CPA repre

sented by an right arrow crossing the multiplier output represented in the Figure

The rounding circuit could b e incorp orated to the CPA stage of the multiplier

internally reducing the area and total delay but for our estimates we consider

them as separate mo dules

Since online mo dules are used in the serial computation of the recurrence equa

tion the most signicant digits of the residual are obtained rst and while the

remaining digits of the residual are computed the selection function works on the

generation of the next output digit The timing diagram in Figure shows the

time used by an hypothetical selection function that consumes cycles and the

serial computation of the recurrence equation Observe that the selection func

tion stalls the op eration of the copro cessor when the precision of the op erands is

low It will not b e in the critical path after the rst most signicant digits of

PhD Dissertation Chapter DRAFT February

W[j] ^ -1 (Y) Append 2n+2 Register W[j]^ BS->NR 3n+1 Multiplier 0.5 n+2 CPA n+1

yj

Figure Circuit used for output selection in VLP Square Ro ot

the output are generated since the latency to generate each digit increases one

cycle p er iteration In the Figure after iteration the selection function do es not

limit the sp eed of the datapath This feature allows the utilization of cheaper and

slower comp onents in the implementation of the selection function circuit such as

a parallel serial multiplication mo dule and Carry Propagate Adders

The same problems with the level of pip elining and initial op eration of the unit

that happ ened in VLP division are going to b e more pronunciated in VLP square

ro ot since the selection function for VLP square ro ot is more complex However

VLP Arithmetic for Recongurable Copro cessor Architectures

Initialization Iter. 1 Iter. 2 Iter. 3 Iter. 4

z z z z z Data path

Selection

Figure Timing of selection function and data path

equation used to computer the overhead imp osed by the selection function

in a pip elined data path is also valid for VLP square ro ot

The time for selection dep ends on the selection of comp onents and technology

An estimate of the circuit delay for FPGAs is shown in section

Performance Evaluation

Optimization of the Number of Cycles

For an output of precision m digits it is not necessary to compute the recurrence

equation in full precision all the time If we apply the same idea presented for the

VLP multiplier we may work with a precision for vectors Y and W that is reduced

in one digit in each step after a certain step n Observe that the reduction of Y s

precision is only for residual calculation The precision of Y continues to increase

in each new iteration but not all digits are read in each iteration Assume that

the rst truncation of vector Y is done at digit p osition k This action inserts an

error in the scaled residual that is b ounded as

k

jr r r j

One more iteration will result in an error that combines the error in the short

precision Y value one less digit and the error in the previous residual

k +1 2 k

r r scal ed r r r

and for q iterations the equation for the error is

q k

jq r r r j q

PhD Dissertation Chapter DRAFT February

For correct selection of the output digit the error inserted by the limited preci

t

sion computation of the recurrence equation must b e less than r where t is the

number of fractional digits in the truncated residual used in the selection function

Thus

q k t

q r r r r

In VLP square ro ot calculation the number of iterations executed after input

digit k was pro cessed for a nal precision of m is q m k However a

simpler upp er b ound for q is

m

q

thus

m

m

+t k

2

r r r

and using the condition that t digits we obtain

m m

+2 +3 k

2 2

mr r mr r

m

l og m k

r

As l og m is true for a large range of input precision values m based on

r

the same discussion presented for VLP multiplication we obtain

m

k

Five digits after the middle of the output vector the precision of the recurrence

equation can start to decrease without compromising the selection function of

output digits

Execution time

During initialization some digits of X are copied to vector W Algorithm

This task can b e done by the host b efore the data is inserted in the copro cessor

VLP Arithmetic for Recongurable Copro cessor Architectures

without consuming copro cessor cycles Also during initialization phase t iter

ations are p erformed to generate the initial residual During this phase digits of

the residual are necessary one integer and fractional digits Each iteration

takes cycles bubble of cycles in the path of the residual with a setup time

to transfer digits from Y of t t where t is the total delay in the data path

dp w dp

and t is the delay in the branch used by the scaled residual W The selection

w

function is activated only in the iteration that issues the last digit of the estimate

iteration that generates y Total number of cycles during initialization is

t+1

T t t t N T cycles

init dp w sel

where N considers the number of cycles required to obtain the t most signicant

digits of the residual at iteration t after all digits of the previous residual and

a bubble of cycles were inserted Since we are considering the input b ehavior

and the selection function dep ends on the output the value of N is inserted to

comp ensate for this change in the p oint of reference Thus N t

w

where the value corresp onds to the number of digits used in the selection function

and the rest of the equation corresp onds to the time to get the rst residual digit

at the output

For other iterations to compute output digit y with t i m the

i

pip eline structure and the time of the selection function circuit will increase the

total number of cycles in T as presented in equation related to the zero

ov h

delay selection function case

The total number of cycles necessary in VLP square ro ot considering the trun

m

cation p oint in digit d d e is approximately given by the equation with

2

m+2 d

X X

j d j p T T T

V LP sq r t init ov h

j =3

j =d+1

considering one integer digit for the op erand a pip elined data path with delay

p and an overlapped op eration b etween iterations In the equation one integer

PhD Dissertation Chapter DRAFT February

digit is considered for the op erand and result This assumption is necessary given

the redundant representation that can b e used for X and Y

For a nonoverlapped op eration the equation is mo died to

m+2 d

X X

d j p j p T T T

V LP sq r t init ov hn

j =3

j =d+1

Comparing the number of cycles of VLP square ro ot and the previous VLP

op erations the VLP square ro ot op eration b ehaves like VLP division without the need for scaling

Chapter

Implementation Asp ects and Host Tasks for VLP Op eration

This chapter describ es the implementation asp ects of the VLP algorithms in the

RAC and also present op erations that should b e implemented at the host level

Regarding the VLP algorithms we discuss in detail the arithmetic mo dules that

were used in the previous chapters more sp ecically number conversion mo dules

Asp ects of the implementation asso ciated to host tasks are oatingp oint FP

number format digit conversion and FP algorithms using the copro cessor VLP

op erations

Digit Co de Conversion

All digits used in VLP algorithms are signeddigits SD in the maximally re

dundant set fr r g The SDs are represented in this work

using BS co de nonredundant twos complement NR or CS form

Redundant number system is considered in all internal op erations of the network

of mo dules to compute the online recurrence equation The utilization of dierent

co des in dierent stages of the computation is the key for ecient implementations

We present in this section the converters used in the VLP data path networks

presented in the previous chapters

CS to BS Converter

The output of multipliers is usually in CS co de CS adders are the ones that

give the b est time and area relation It is imp ortant to have a parallel conversion

metho d from CS to BS co de avoiding the need to assimilate the bits in CS co de

b efore the equivalent BS co de is obtained In this section we present the circuit

PhD Dissertation Chapter DRAFT February

and pro of for a parallel convertion metho d

The nbit input number in CS co de is represented by x c s where c

c c c and s s s s with c s f g Both vectors are in

n1 1 0 n1 1 0 i i

P

n2

i n1 n1

c and s s twos complement form Thus c c

i n1 n1

i=0

P

n2

i

s We assume that no overow is allowed

i

i=0

Following the pro cedure describ ed in EL we generate another digit vector

v adding each digit of the CS representation in parallel

i

s s s s

n1 n2 1 0

c c c c

n1 n2 1 0

v v v v

n1 n2 1 0

g such that v f g for i n and v f

i n1

Reco ding v p m and b p m we get the following table for values

i i+1 i i i i

of v and b

i i

v b

i i

p p

i+1 i

m m

i i

From the table is easy to see that p c or s and m c xor s We

i+1 i i i i i

also know that

n2 n2

X X

i i n1

p m s c p u

i i i i n1

i=0 i=0

when combined with v the following options are p ossible assuming that no

n1

overow o ccurs

v p p

n1 n1 n1

(b)

(a)

b

n1

n1

In case a if v then u which implies p ie p

n1 n1 n1

n1

When v then u so either p or p and the next

n1 n1 n1

nonzero b is negative

i

The circuit of the CSBS converter is shown in Figure The circuit op erates

without carry propagation Only CLB p er signed bit

VLP Arithmetic for Recongurable Copro cessor Architectures

a b cn-1 s n-1 cn-2 s n-2 c1 s1 c0 s0

+ + + + + + + + A A A A + - + - + - + -

0 a+b a xor b

m p m p p m p m p n-1 n-1 n-2 n-2 2 1 1 0 0

bn-1 b n-2 b 1 b 0

Figure CSBS converter

BS to NR Converter

The circuit that p erforms BS NR conversion is shown in Figure The

conversion of redundant representation to NR always imply in the use of a carry or

b orrow propagation In this case the fastest and area ecient circuit in FPGAs

should use the Fast Carry Logic FCL The circuit shown in the gure is comp osed

of a chain of Full Subtractors FS Each FS with inputs a b and c and output

in

c and s computes the following expression a b c c s The circuit

out in out

uses the FCL in the FPGA

- a b x+ x NR NR cin - FS cout FS FS 0 NR x

s

a b c = 2c + s

in out

Figure BS NR converter

The truth table for the FS function is shown in Table From the table we

PhD Dissertation Chapter DRAFT February

a b c c s

in out

Table Truth table for the FS function

obtain the following logical expressions for c and s

out

c c a c b a b

out in in

s a b c

in

Tasks Performed at the Host

The details of the hardware implementation should b e hidden from the user

When the software activates one of the VLP op erations it do es so through a pro

cedure call This pro cedure is resp onsible to check the present conguration of the

copro cessor and take any required action p erform tasks that are not eciently

done in the copro cessor manipulation of signicands and exp onents in FP op era

tions transform digits in radix b host digit size to radix r copro cessor digits

and viceversa and p erform onthey conversion on the result generated by the

copro cessor

This section presents the data organization used in some of the VLP software

and hardware systems discussed in Chapter and also describ es host tasks in more

detail preparing the reader for the p erformance evalution of the system comp osed

of copro cessor and host

VLP Arithmetic for Recongurable Copro cessor Architectures

VLP Number Format

Longprecision integers are stored in the host as vectors of integer variables

LP oatingp oint numbers are usually represented in multiple digit format or

multiple term format In the multiple digit format the FP number is comp osed of

a single exp onent and a sequence of highradix digits that form the signicand

The multiple term format considers the longprecision number is expressed as a

collection of ordinary oatingp oint numbers each one with its own signicand

and exp onent See Figure

(a)

ESL D[1] D[2] Significant digits

D[L]

(b)

L FP1

FP2

FPL

Figure Formats of VLP numbers a multiple digit and b multiple term

The multiple digit format has the elds exp onent E signicands sign S

signicands length L and vector of machine words or highradix digits D i

The signicand of the longprecision number is represented in sign and magnitude

form

The multiple term format is comp osed of a vector of FP numbers and a eld

that indicates the vectors length

There are advantages and disadvantages in each scheme but in particular the

multiple digit format is the one that can more compactly represent most numbers

since only one exp onent is stored The advantage of using multiple term is the

PhD Dissertation Chapter DRAFT February

representation of only the signicand digits that are dierent than zero This

feature would allow to skip over zero es during the computation what would imply

in faster implementation or even more compact number representation in some

cases However the probability of having a long sequence of zero es in the LP

number is very small what makes this advantage not signicant In this thesis

we assume the multiple digit format

OntheFly Conversion

Onthey EL conversion of the result from signed digits to conventional

representation sign and magnitude is not eciently done in the copro cessor b e

cause it would b e necessary to hold the digits until in the limit all of the result

digits were dened This condition would prejudice the digit transfer time to the

host Besides that the result vector would have to b e manipulated in the copro

cessor lo cal memory thus requiring sp ecial hardware in the FPGA to convert the

result digits serially similar to the online recurrence equation computation

For these reasons the OFC op eration is b etter executed by the host concur

rently with VLP op eration p erformed by the copro cessor The host converts the

result vector comp osed of SDs in NR representation twos complement into a

digit vector in signandmagnitude form Digits in NR representation are com

p osed of two elds S sign in the most signicant bit and F other bits of the

representation Field F is considered as a p ositive binary representation

The host manipulates vectors RESULT where the nal representation is

stored Q and QM where temp orary values of the digits are stored The algo

rithm shown in Algorithm The overline symbol means bit complementation and

SD represents the signed digit received from the copro cessor

In particular when the VLP multiplication algorithm generates an output with

precision larger than m some of the output digits are available in the residual

vector after the op erand digits are received These digits can b e transferred to

the host very fast see Section One option to sp eedup the OFC consists in

VLP Arithmetic for Recongurable Copro cessor Architectures

Algorithm OFC algorithm executed by the host processor

receive the rst digit dierent than zero

SD is a signed digit read from the copro cessor

while SD

write zero to RESULT

if S D

result is p ositive

store F in Q and F in QM

while the last is not received

case the digit is

p ositive

move data on Q to RESULT

empty Q and QM

store F in Q and F in QM

negative

move data on QM to RESULT

empty Q and QM

store F in Q and F in QM

zero

app end to Q and r in QM

if S D

result is negative

store F in QM and F in Q

while the last is not received

case the digit is

p ositive

move data on QM to RESULT

empty Q and QM

store F in Q and F in QM

negative

move data on Q to RESULT

empty Q and QM

store F in Q and F in QM

zero

app end to Q and r in QM

move all information in Q to RESULT STOP

PhD Dissertation Chapter DRAFT February

p erforming the conversion in two phases the multiplier is generating z and

j

the multiplier is sending the digits stored in the last residual In phase the host

executes the OFC algorithm generating two vectors Q and QM that corresp ond

to two p ossible representations of the received digit vector In phase the

multiplier sends the residual digits from leastsignicant to mostsignicant and

p erform conversion as the digits are transferred This op eration is done by a serial

BSNR converter serial subtractor The subtraction of two k bit vectors will

result in another k bit vector G and a b orrow bit b The nonredundant result is

obtained by the host selecting Q if b or QM if b and concatenating the

selected vector with vector G

For example assume that the pro duct is represented in redundant form r

as where x x Assume also that the host receives the rst digits

during phase The host creates the following two vectors based on the OFC

algorithm and When the multiplier transmits the other digits it

executes the serial op eration that results in the digit stream

and from least to mostsignicant digit also including the b orrow bit As

b the vector is selected and the nal converted pro duct is

Digit Expansion and Compression

Before the LP number is transfered to the copro cessor the host must adjust

the digit radix The host store the signicand in sign and magnitude format

w

that consists of a sign bit and a long string of digits in radix b D The

n

copro cessor on the other hand works with signed digits in radix r that will

require n bits each During digit conversion from radix b to radix r sign bits

are introduced for each copro cessor digit For high utilization of communication

and storage resources an integer number of copro cessor digits should t in one host

word That implies

w

k n w n k

VLP Arithmetic for Recongurable Copro cessor Architectures

where k is the number of copro cessor digits p er host word VLP division and

square ro ot dene a minimum k of The value of n is dened as a function of

the hardware resources and VLP algorithm

For a host working with bit digits in SM and k we obtain n

Taking L as the length of the LP number in the host after this transformation of

Lw

digits the LP number has m d e copro cessor digits The host executes the

n

data transformation decrib ed in Pro cedure to convert digits in radix b to radix

r for the particular case of w k adn n For other values of k w and

n other masks and shift op erations must b e used that are easily deduced from the

given pro cedure

A similar op eration is p erformed by the host to compress the digits already

received from the copro cessor and converted to SM format OFC The pro cedure

shows the digit compression task for the same parameters used in the conversion

from host to copro cessor

VLP Floating Point Op erations

The algorithms developed for multiplication division and squarero ot over xed

p oint numbers are the main comp onents for the development of VLP oating

p oint op erations that are executed by the host and copro cessor together It is

not worthwhile to implement in hardware the p ortions of the VLP computation

that are executed only once Some of these tasks are manipulation of exp onents

adjustment of signicands to t the assumptions used in the VLP algorithms

p ostcorrection and normalization of results

In the next sections we describ e the op erations executed at the host in order to

implement VLP FP op erations making use of the copro cessor VLP op erations

PhD Dissertation Chapter DRAFT February

Pro cedure host executes base conversion host coprocessor

example of digit expansion using k w n

mask xfff

mask xfff

sb number of spill bits

spill

for iiLi D is the most significant host digit

insert the previous spill bits into the new processor word

word Di sb spill wsb

spill Di wsbk wsbk

temp word mask

temp word mask

temp temp temp two digits are mounted

spill bits in the spill word

send temp to coprocessor

sb sb k

if sbkn too many bits in the spillbit register

word spill wsb

temp word mask

temp word mask

temp temp temp

send temp to coprocessor

sb

spill

insert the remainder spill bits into a blank word

word spill wsb

temp word mask

temp word mask

temp temp temp

send temp to coprocessor

VLP Arithmetic for Recongurable Copro cessor Architectures

Pro cedure Procedure executed at the host for base conversion coprocessor

host

vector R stores the received and converted

coprocessor digits

mask xfffe

mask xfffc

sb

w

j

for iimi use m for integer multiplication

compress the word

temp Ri mask Ri mask

insert the value into register D

if sb

DjDj temp wsb

Dj temp sb

sb sb w empty bit places in Dj

else

Dj temp

sb

if sb j

PhD Dissertation Chapter DRAFT February

Notation

To make a clear dierence b etween the numbers manipulated at the host level

and at the copro cessor level we introduce the following notation The host works

with longprecision oating p oint numbers with the format presented in section

The format includes the longprecision signicand and an exp onent that is

unbiased The longprecision FP number is represented as

x

ef pr

x x r

FP mf p

h

1

where x is the signicand of the FP number in the range x and

mf p mf p

r

x is the exp onent in radix r Lets assume that the exp onent is not biased

ef pr h

that means a p ositive or negative sign is explicitly assigned to it The radix r is

h

dened as a function of the number of bits used in each digit of the host pro cessor

w as

w

r

h

n

The copro cessor works in a dierent radix r that dep ends on the available

resources in the recongurable hardware

The host manipulates numbers comp osed of digits in radix r The digits are

h

converted from radix r to radix r by the host and transferred to the copro cessor

h

that manipulates digits in radix r Op erators sent to the copro cessor are considered

as xed p oint numbers The range of these numbers dep end on the VLP algorithm

It is convenient to have the FP number expressed as

w x x

ef pr ef p

x x x

FP mf p mf p

this transformation provides a b etter format for the representation of bit shifts

VLP FP Multiplication

The VLP FP multiplication Z X Y is obtained as

f p f p f p

z

ef pr

Z z r

f p mf p h

VLP Arithmetic for Recongurable Copro cessor Architectures

z x y

mf p mf p mf p

z x y

ef p ef pr ef pr

1

The VLP multiplier will b e b etter utilized if op erands are in the range

r

1

2

As r r there is a p ossibility However x and y are in the range

h mf p mf p

r

h

that the most signicant radixr digits of the op erands x and y are zero es

mf p mf p

Two options

the signicands are not scaled and the multiplier waste some cycles with the

leading zero es

1

normalize the signicands to b e in the range This pro cess requires bit

r

shifts to scale the op erands and bitshifts to correct the result For example

consider the numbers x and y with r

100 100 h

and r The most signicant digit of x represented in radix r is and

h

the most signicant digit in radix r is zero Performing a left shift of

digit in radix r we get The pro duct is but the correct

10 10

result is thus a right shift in radix r is required

100

For simplicity we assume the rst case and do not consider that the host exp end

cycles in the pro cess

A p ostcorrection is needed if the most signicant digit of the result z is

mf p

zero In this case the signicand is shifted one host digit to the left reducing the

signicand lenght by and the exp onent is decremented by one

z z r

mf p mf p h

z z

ef p ef pr

In GMP there is no rounding of the signicand after the multiplication computes

z with twice the precision of x or y whichever has the largest precision

mf p mf p mf p

maxjjx jj jjx jj As presented in Chapter the VLP multiplication

mf p mf p algorithm avoids unnecessary computation when the precision of the output is the

PhD Dissertation Chapter DRAFT February

same as the op erands and a signicant sp eedup in the use of the copro cessor is

exp ected in this case

The host may also provide to the copro cessor only the necessary precision for the

requested computation If the output precision that was requested is m and the

the precision of the op erands are m and m the host may send to the copro cessor

1 2

the precision m if m m m or the precision m m otherwise

1 2 1 2

VLP FP Division

Consider the FP division Q N D One condition for the VLP Division

f p f p f p

algorithm is to have the divisor d xedp oint number in the range d

1

However the FP number D has the signicand in the range d Thus

f p mf p

r

the following op erations are p erformed at the host

shift d to the left until the rst fractional bit is one The normalized

mf p

number is called d The number of bit p ositions shifted is S

n

1

using the FP ALU at the compute the short precision scaling factor M

^

d

n

host this op eration is p erformed at the host considering a short precision

recipro cal of the divisors signicand as explained in Section The

truncated value of d d has fractional digits

n n

provide the value M to the copro cessor for online prescaling or p erform

prescaling at the host level to generate M d and M n

n mf p

pass the op erand digits scaled or not dep ends on previous step to the co

pro cessor

read the result digits p erforming the OFC algorithm describ ed in Section

The converted result is q

compute the exp onent of the result q n d S The value S was

ef p ef p ef p

added to correct the resulting signicand by the p ositions shifted to adjust

the range of the divisor step

VLP Arithmetic for Recongurable Copro cessor Architectures

correct the exp onent and signicand dep ending on the most signicant digit

of q

VLP FP Square Ro ot

The condition for VLP square ro ot computation was shown in Chapter The

xedp oint radicand X manipulated in the VLP op eration is in the range

x The oatingp oint number X must b e manipulated by the host as follows

f p

q

in order to compute Z X

f p f p

1

correct the signicand as x then x can b e less than

mf p mf p

r

and it is necessary to shift the signicand to the left Shift left op eration of p

p

bits on the signicand is equivalent to multiply x by Another reason

mf p

to correct the signicand is to have an o dd value for x The new number

ef p

p x p

ef p

is x such that x p is even

mf p ef p

compute the short precision Y based on a short precision computation of

p

1

p

x Compute Y Send b oth values to the copro cessor after digit

mf p

conversion

submit the corrected signicand to the copro cessor

read the result and p erform OFC to obtain z

mf p

compute the exp onent

x p

ef p

z

ef p

w

If z is not divisible by w r correct z p erforming k right bit

ef p h mf p

z x p

ef p ef p

k is divisible by w Make z shifts until z

ef pr ef p

2 w

3

Consider the following example in radix r X

f p 8

q

9 2

And X Note that x already

8 f p 8 mf p

however

q q

45

9

8 8

PhD Dissertation Chapter DRAFT February

creates a noninteger p ower of which is not easy to compute Thus the sig

nicand is shifted to the left one bit p osition to obtain the new FP number

8

The signicand of this new number is applied to the copro cessor

8

that returns the result The result exp onent z will b e rst computed

8 ef p

as Since is not divisible by w two right shifts of the nal result will result

6 2

in the signicand z and z

8 ef p ef pr

Chapter

VLP Circuit Design for FPGAs

This chapter shows the estimates of area consumption for the data path selec

tion function and other mo dules used in the VLP circuits We are not concerned

with the control complexity in this study The given estimates are upp er b ounds

on the number of CLBs in the designs Optimizations at the synthesis level are

exp ected to reduce the actual number of CLBs

VLPA op erations are constructed based on the presented building blo cks Since

we are considering recongurable architectures the precision of these basic op era

tors can b e adjusted in order to satisfy the p erformance requirements of the higher

level arithmetic algorithms for VLPA

The rst section discuss imp ortant asp ects of the VLP algorithms in FPGAs

The next sections show the estimates of area and time of the arithmetic op erators

used in the VLP data path and selection functions

Imp ortant Design Asp ects

FPGA Time Parameters

In all estimates we consider input LUT FPGAs The main FPGA parameters

that are considered in these estimates are shown in Table Sp ecic values for a

particular device is shown in App endix A The delay of a circuit always includes

the delay of the interconnect to deliver the output value

Pip eline Degree

Dierent degrees of pip elining can b e utilized to shorten the cycle time and

improve the overall p erformance As discussed in TE a long precision online

PhD Dissertation Chapter DRAFT February

CLB Switching Characteristics

Description Symbol

Combinational Delays

FG inputs to XY outputs T

I LO

FG inputs via H to XY outputs T

IHO

C inputs via DIN through H to XY outputs T

HH 2O

CLB Fast Carry Logic

Op erand inputs to COUT T

OPCY

AddSubtract input to COUT T

AS C Y

Initialization inputs to COUT T

INCY

1

CIN through FGs to XY outputs T

SUM

CIN to COUT bypass FGs T

BYP

Carry network delay COUT to CIN T

NET

Sequential Delays

Clo ck K to outputs Q T

CKO

Setup time b efore Clo ck K

FG inputs T

ICK

FG inputs via H T

IHCK

C inputs via DIN T

DICK

Average interconnect delay t

inter c

InputOutput Timing Characteristics

Description Symbol

Global Low skew clo ck to Output using Output FF T

ICKOF

Input Setup time using Global Low Skew clo ck and Input FF T

SPD

Input Hold Time using Global Low Skew clo ck and IFF T

PHD

Function Generators

Table FPGA Timing Parameters adapted from Xilinx data b o ok

VLP Arithmetic for Recongurable Copro cessor Architectures

multiplyadd mo dule has the b est p erformance for maximum degree of pip elining

A pip elined structure has more latency than a nonpip elined one The impact of

a large latency is not so signicant for the arithmetic op erations when hundreds

or thousands of bits are considered On the other hand the pip elined structure

has shorter cycle time that aects all the digits computed by the unit higher

throughput which reduces signicantly the total op eration time Based on this

observation we assume that a maximum degree of pip elining is used whenever

p ossible

Based on the costp erformance relation discussed in Kwa lets consider t as

the cycle time time required by a nonpip elined circuit To execute the same task

on a kstage pip eline with an equal ow through delay t one needs a clo ck p erio d

of

t t k d

p d

k k

that corresp onds to a maximum throughput of

k

f

p t k d

The total pip eline cost is estimated by c k h where c is the cost of all logic

stages and h is the cost of each latch The parameter h for FPGAs is practically

zero Dening the p erformancecost ratio as in Kwa

k f

PCR

c k h t k dc k h

that has a maximum for

s

tc

k

0

dh

For a nonpip elined data path circuit r implemented in a XC we

obtained the values c CLBs t ns and d ns With maximum

pip eline the data path has stages The increase in area due to pip elining is

only CLBs thus the cost of each latch is These numbers result in

k much ab ove the maximum number of stages allowed in the architecture 0

PhD Dissertation Chapter DRAFT February

Another justication for the use of the largest level of pip elining is the total

execution time The stage delay grows in steps of the CLB delay plus interconnect

time lets say T With maximum pip elining of k stages the cycle time is given

FG

by the minimum step T With less stages the next balanced pip elining structure

FG

has a cycle time T and k stages The total task time for C inputs is T k

FG k

C T with the maximum pip elined structure and is T k C T k

FG FG

k 2

C T for the other case When C k that is the case for VLP computation

FG

the time to execute the task with half the maximum pip elining degree is basically

twice the minimum time to execute the task in the architecture

Digit Representation

Digits manipulated by arithmetic op erators may b e in conventional or redun

dant form

For p ositive digits the representation in conventional form uses k l og r p er

2

radixr digit and in redundant Carry Save CS form it uses k bits

A radixr signed digit in a maximally redundant set may use conventional

twos complement and ones complement form or redundant representation such

as carrysave and b orrowsave co de The conventional digit representation uses

k bits p er radix r digit and the redundant representation uses k bits p er

digit

Borrowsave co de has b een used for a long time in the Illiac I I I Atk Using

this co de the b orrowsave representation of each signeddigit is a vector of signed

+

bits b represented by two binary variables b b b such that the signedbit

+

value is evaluated as b b b

We make use of dierent digit representations in order to obtain the most e

cient arithmetic structures for VLP computation

VLP Arithmetic for Recongurable Copro cessor Architectures

Design of Arithmetic Op erators for FPGAs

The main op erators used in the design of VLP op erations are conventional or

online In terms of the data interface the conventional arithmetic op erations can

b e designed for digitparallel or digitserial mo des of computation Digitserial

mo de in conventional arithmetic is done Least Signicant Digit First LSDF On

line algorithms use digitserial interface and work in the Most Signicant Digit

First MSDF mo de of computation

The algorithm types are related to the way the op erands and results are ma

nipulated The types are sequential serialparallel unfolded fullyparallel non

pip elinedpip elined and serial algorithms The fullyparallel type is the most area

consuming but it usually has the highest sp eed Serial and sequential types are im

plemented by circuits that use less area than the parallel type and have a lower IO

requirement On the other hand the number of cycles to complete an op eration is

larger for serial and sequential type than for parallel

We consider in the next sections the features of the basic arithmetic op erations

in the organizations that are of interest for this work

Addition

This section presents circuits for parallel and serial addition The circuits are

group ed as digitparallel and serial conventional or online addition

Digitparallel Addition

Exp erimental results for conventional addition in the XC series of devices

YX Xil indicate that Ripple Carry Addition RCA for op erands with

less than bits is the fastest approach For more than bits a go o d choice is the

Carry Select Adder Both of the previous adders are also called Carry Propagate

Adders CPA

Parallel addition using redundant adders like SignedDigit Adders SDA

PhD Dissertation Chapter DRAFT February

would enable addition to b e done in a xed time indep endent of the op erand

precision The inconvenience of using SDAs is the generation of an output in re

dundant form So it is sometimes necessary to convert the output from redundant

to conventional representation A CPA is used for this task Redundant adders

are justied in the addition of many op erands or in the cases when the output of

the adder can b e used without conversion by another arithmetic structure

The use of the dedicated carry logic available in many FPGA devices is p ossible

only if SD numbers are represented in twos complement form In this case the area

of a maximally redundant radixr SD adder is the same of a RCA with l og r

2

bits The delay is prop ortional to the number of bits

When signed digits are represented in Borrow Save BS co de the fast carry

logic cannot b e eciently used since the chain of carries is very short two radix

digits The area of the SDA in this case is basically times more than the

one used by a conventional RCA CLBs p er bit that can b e estimated as n

CLBs A design using two FAs would have a delay of Ffunctiongenerators

plus interconnect and it is indep endent of n precision

Another redundant adder that can b e used is the Carry Save Adder CSA

This adder uses half the area used in the SDA adder to combine one op erand in CS

form and another in conventional form with roughly half of the delay To combine

two op erands in CS form the CSA has the same area and delay of the SDA

Table shows the areadelay relation for these adders without registers only

combinational delay The types of op erands are also shown as NR nonredundant

CSA carrysave form and BS b orrowsave form

DigitSerial Addition

Conventional

Serial addition in the conventional number system receives op erands least sig

nicant digit rst The basic organization is a short length CPA and a ipop

The carry out of one iteration is stored in the ipop and used as carry in of the

VLP Arithmetic for Recongurable Copro cessor Architectures

Adder type Type of Op Area CLBs Delay T ns

P AD D

n

RCA CSA or NR d T T e

OPCY NET

2

n3

c T T b

BYP NET

2

T t

SUM inter c

SDA BS co de SDs n T t

IHO inter c

CSA CSA NR n T t

I LO inter c

CSA n T t

I LO inter c

Table Areadelay of digitparallel adders in the XC

next iteration We dene the precision of the short length adder as n bits and the

total precision of the op erands as m bits In each cycle n bits are received and

pro cessed in parallel see Figure A group of n bits represents a digit in radix

n

The number of digits to b e added serially is dm n dmne

n

CPA

n

n

n

Figure Conventional radix serial Adder

Using RCAs the sp ecial cases of n and n can b e implemented without

the dedicated carry logic with area CLB and CLBs and delay T T

I LO CKO

t and T T t T resp ectively For n the use

inter c I LO HH 2O inter c CKO

of the fast carry logic pro duces the b est designs in terms of area and delay Carry

select adders should b e considered only when n

PhD Dissertation Chapter DRAFT February

n

The cycle time of a radix serial adder is

t n T n T

cy cle P AD D CKO

where T n is the delay of a conventional parallel adder of length n The

P AD D

total time for serial addition of m bits using an adder of n bits is computed as

T m n dm nt n

S AD D cy cle

The area of a serial adder dep endents only on the digit radix The inclusion of

the FF for the state storage do es not increase the area of this adder in terms of

the area already computed for the parallel adder for the same digit size

Online Addition

Online op erations rely on the utilization of redundant number system to rep

resent the output Costdelay of online addition dep ends on the enco ding of

redundant op erands

The general structure of an online adder is presented in Figure It is

comp osed of a redundant adder a selection function of output digits and registers

The circuit implements the following recurrence equation

W j r W j z r x y

j j + j +

where variable W is the scaled residual z is the output digit and x y are

j j + j +

input digits in the set fa ag with r a r The online

delay can b e as low as clo ck cycle dep ending on r The cycle time is a function

of the selection function that by itself dep ends on the radix r The complexity of

the selection function increases as the radix increases

The design of online adders using BS co de is presented in DMV The online

n

delay of this design is A radix r signeddigit online adder is obtained

as an extension of the signedbit adder The radixr digit is formed concatenating

n signedbits The delay of the implementation is indep endent of the radix The design of a radix online adder is shown in Figure and uses fulladders as the

VLP Arithmetic for Recongurable Copro cessor Architectures

x j+δ yj+δ

Adder W[j+1] W[j] Sel.

z j

Figure Basic online adder structure

building blo ck The delay if the implementation is T T T where

O LA FA CKO

T T t

FA I LO inter c

The area of a radixr online adder is n CLBs If we include a layer of registers

b etween the layers of FAs the delay increases in one cycle but the cycle

time is reduced to

T T T

O LAp FA CKO

n

The total time to p erform serial radix online addition of op erands with m

bits is T dm n The increase in area due to pip elining corresp onds

O LA

to CLBs indep endent of the radix

The circuit output is a number in the redundant number system The use of

onthey conversion OFC EL reduces this disadvantage That means the

conversion is done as the output digits are generated such that the output is already

converted by the time the last digit is received

Multiplication

In this section we present the areatime mo dels of multiplier op erators in fully

parallel and serialparallel organizations Only the most convenient comp onents

PhD Dissertation Chapter DRAFT February

a2+ b2+ a2- b2- a1+ b1+ a1- b1- a0+ b0+ a0- b0-

FA FA FA

1 2 3

FA FA FA

1 2 3

z2+z2- z1+ z1- z0+ z0-

output digit

FA FA FA FA FA

FA FA FA FA FA

Figure Radix online adder

for FPGAs are shown

A multiplier usually consists of multiple generators a reduction structure that

pro duces the pro duct in a redundant form and a CPA to obtain the pro duct in a

conventional form

Parallel multiplier

Parallel n n multipliers are p otentially the fastest but use the most area The

reduction structure can b e a tree of FAs a group of RCAs a network of column

compressors etc We describ e in the following sections the organizations used

in the VLP algorithms A general description of multiplier structures is given in

Kor

In this thesis we use the multiplier op erator in the implementation of the digit

by vector multiplier DV already dened in previous chapters The DV multiplier

uses a lineararray structure that is describ ed in the next

LinearArray multiplier The lineararray multiplier uses CSAs to reduce the multiples and include a CPA

VLP Arithmetic for Recongurable Copro cessor Architectures

to obtain the nal conventional pro duct see Figure In this case the area is

n n

n n d e CLBs and the delay is nT t T n t

I LO inter c RC A inter c

2 2 This implementation consumes more area but it has lesser delay than the rst one

x3 x2 x1 x0

y0

F F F F

y1

F F F F

y2

F F F F

y3

F F F F

CPA m3 m2 m1 m0

m7 m6 m5 m4

Figure Array Multiplier using CSAs

Pip elining is used to increase the throughput and reduce the cycle time A

pip elined version of the array multiplier has a cycle time limited by the last CPA

In our designs we are able to use a parallel multiplier that generates a redundant

representation and as a consequence the CSA reduction without the CPA stage

can b e used Without the CPA stage the pip eline cycle time is as low as one CSA

delay plus ipop and interconnect delay The area is aected by the required

registers b etween stages to store the value X and bits of Y used by following stages

Many metho ds can b e used to improve the p erformance of the array multiplier

among them Bo oth reco ding and other alternatives for multiple reduction Bo oth

reco ding is discussed next Other alternatives for reduction were considered in the

work but were discarded based on the diculty to have a single description that

could b e instantiated for various op erand sizes We lo oked for a generic description

PhD Dissertation Chapter DRAFT February

of a parallel multiplier that could b e easily adjusted to dierent op erand sizes A

tree multiplier has an structure that dep endents on the size of the op erand For

this reason we do not describ e or analyze the tree reduction structures or column

compression schemes in this thesis A faster implementation could certainly b e

obtained if these other reduction schemes were applied to the particular imple

mentation of the VLP algorithms describ ed in this work

Bo oth reco ding Kor is used in multiplication to transform the binary vector

that represents the op erand Y into another vector comp osed of radix digits in

the set f g The multiplication of a radix digit in this digit set by X

is obtained by simple op erations like complementation and shifting The output of

the Bo oth reco ding circuit uses a bit co de z s c that indicates the conditions

zero shift and complement negative multiple Bo oth reco ding can b e used

for signed or unsigned op erands When considering unsigned numbers an extra

most signicant bit of must b e considered to compute the number of rows

The structure of the multiplier using this technique is shown in Figure

Since the output of the reco der consists of bits CLBs are necessary for

each radix digit When applied to the array multiplier the number of rows is

n+1

basically reduced in half For an op erand of n bits only d e addition stages

2

are required An estimate of the array multiplier area using Bo oth reco ding and

CSA reduction of multiples is presented in Table As shown in Figure the

multiple generator uses input LUTs and so CLB p er bit is required If a

multiplier with conventional representation is required we need to add the CPA

area shown in Table Also in the Table we present the extra area required for

maximum pip elining

The time to compute the multiplication is divided in parts Bo oth reco ding

time multiple generation CSA addition stages and CPA addition stage that adds

up to

n

T n T T d e T T n

mult BR MG CSA RC A

VLP Arithmetic for Recongurable Copro cessor Architectures

Area of array multiplier for nbit op erands

Comp onent CLBs p er radix digit quantity total

n+1 3 n+1

e d e Bo oth reco der d

2 2 2

multiple

n+1 n+1

e e generator n d n d

2 2

n+1 n+1

CSAs n d n d e e

2 2

7 n+1

Total n d e n

2 2

Table Array Multiplier with Bo oth Reco ding CS output

n+1 n

d e CPA

2 2

2

1 n n n

pip eline add

2 2 2 4

Table Extra area for the LinearArray Multiplier

where

T T t

BR I LO inter c

and

T T t

MG IHO inter c

For a multiplier that generates the output in CS form the CPA time T

RC A

should b e removed from the previous equation

The cycle time of a pip elined multiplier is dominated by the multiple generator

delay The clo ck cycle time reduces to T T and the total number of stages

IHO CKO

n+1

e in the multiplier for an op erator of n bits is d

2

Serialparallel multiplier

When the sp eed of the comp onent is not a vital factor serialparallel op erators

can b e used There are some p ossibilities for serialparallel multipliers however

for FPGA technology the design prop osed in Lou is the most ecient one

b eside having other features that are adequate for FPGAs

Parallelserial multiplication is obtained combining results of digit by word mul

tipliers Multiplier designs like the one shown in Pet HP broadcasts the digit

PhD Dissertation Chapter DRAFT February

X Y n 3 3 MG BR

n 3 MG BR

CSA

n 3 MG BR

CSA

n 3 MG BR

MG -multiple generator CSA BR - Booth recoder CSA - Carry Save adder

XY (redundant form)

Figure Array Multiplier using Bo oth reco ding and CSAs

value to several arithmetic mo dules at the same time such that they can compute

the digit by word pro duct in parallel This type of solution implies a large fanout

for the digit communication line broadcast of the signal

The broadcast of signals is particularly exp ensive on FPGAs As in ASICs

the delay of signals dep ends on the load and length of the interconnects However

b ecause of the exibility of the interconnect network in FPGAs the communication

delays grow faster The Linear Sequential Array Lou is a generalization of the

pip elining structure suggested in Erc and it can b e used eectively to reduce

broadcast of signals and pip eline arithmetic data paths

Figure shows a radix LSA multiplier with bits p er LSA mo dule Bo oth

VLP Arithmetic for Recongurable Copro cessor Architectures

reco ding is used to transform radix into radix digit set The input Y is a sub

vector of Y containing only the bits necessary to p erform the Bo oth reco ding algo

rithm in each step Variable z represents a radix digit in the set f g

j The particular case of the computation done in LSA is also shown in Figure

x x x x x x x x x x x 0 7 6 5 5 4 3 3 2 1 1 0 z z j-1 j Booth Y' Recoder

LSA3 LSA2 LSA1 LSA0

(4,2) CPA s s

7 6

m m 2 c 3

7

s s

7 6

c c

For LSA

7 j 1

s s

3 2

c c

j 3

Figure LSA multiplier radix

The addition of values is done in CS form The carry out generated during

addition in one clo ck cycle c is kept to b e used in the next clo ck cycle c

j j 1

Because of that the bits transferred from LSA do not include c The values of

6

m corresp ond to the bits of the pro duct xz that are computed in the digit slice

i j

The values obtained in LSA for example are

z m m

j 3 2

x x

2 1

x x

3 2

x x

3 2

x x

2 1

The scheme of a radix LSA is shown in Figure It is comp osed of FAs

and multiple generators In the gure these comp onents were already mapp ed

to CLBs A general description of the op eration of LSA multipliers is found in Lou

PhD Dissertation Chapter DRAFT February

xi xi-1 xi-1 xi-2 Radix-4 s recoded c LSA digit (from k z Booth Rec.)

F F

Multiple Generation H H

1 CLB

From LAS k+2 FA FFFF Addition

To LAS

k-2

Figure Radix LSA mo dule

A complete example of the radix LSA multiplication algorithm with X

and Y is shown in Table The reco ded value of Y is

Z The values inside b oxes are carries that are generated in one cycle

and used in the next cycle in the same LSA mo dule Variable A is set to every

time the value at LSA is negative value otherwise This variable combined

with the complementation of digits p erformed in each LSA mo dule is used to

generate the negative values of X or X in the twos complement form The

most signicant LSA is dierent from the others in the sense that it must control

the sign extensions

The area and delay of this design for r using two radix Bo oth reco ders

is Lou

n

et T n d

cy cle lsa

n

A n d eC LB s

lsa

where t T t T from Lou t ns for the XC

cy cle IHO inter c CKO cy cle

and the estimate would generate t ns Each LSA mo dule uses CLBs

cy cle

Four bits of the result are generated in each cycle

VLP Arithmetic for Recongurable Copro cessor Architectures

LSA LSA LSA LSA

X

z

xz

transfer bits

CS form

A

z

xz

transfer bits

x

CS form

x x x A

z

xz

transfer bits

x

CS form

x x x A

z

xz

transfer bits

x

CS form

x x x A

z

xz

transfer bits

x

CS form

x x x A

z

xz

transfer bits

x

CS form

x x x A

Table LSA Multiplication radix

PhD Dissertation Chapter DRAFT February

Summary of Results

A summary of the results discussed in this section is presented in Table with

the following conventions

n precision of op erators input precision

dm n dmne number of input digits

Time values are given as a function of the FPGA parameters For serial or pip elined

implementations the delay is given in terms of the number of cycles and cycle time

Only the op erators of interest for the circuits prop osed in this thesis are listed in

the Table

Reco der Circuit for the VLP Multiplier

The digit reco der used in the VLP multiplier is implemented as a simplied

radixr online adder with one op erand restricted to and values The

most signicant signed bits sbits of the digit are zero es which causes the removal

of most of the full adder FA mo dules in the online adder with some some

mo dication in the interconnections

The upp er row of mo died FA mo dules used in the online adder pro duces

binary outputs t and u from binary variables x y and z such that x y z

t u The switching expressions are t xy xz y z and u x y z In the

reco der case since x z for many of the inputs we have that t u y A

radix online reco der is shown in Figure It is a simplication of the radix

online adder shown in section The two least signicant sbits are delayed to

synchronize the values b etween dierent cycles Another register is used for z to

2

align the sbits in the output digit

The implementation of this mo dule with FPGAs XC uses k CLBs for

k

a radix r reco der The delay is equivalent to two Ffunction generators plus

ipop and interconnection delay

VLP Arithmetic for Recongurable Copro cessor Architectures

Comp onents Area CLBs Time

Basic blo cks

n+1 3

BR d e T T t

BR I LO inter c

2 2

MG n T T t

MG IHO inter c

CSA T T t

CSA I LO inter c

Parallel Adder

n

RCA d T n T T e

RC A OPCY NET

2

n3

c T T b

BYP NET

2

T t

SUM inter c

Serial Adder

OLA n T t T

I LO inter c CKO

dm n cycles

OLA pip elined n T t T

I LO inter c CKO

dm n cycles

n

e Conventional using RCA d T n t T

RC A inter c CKO

2

dm n cycles

1

Parallel Multiplier

9 n+1 n+1 n

Array multiplier n T T d d e e T

BR MG CSA

2 2 2 2

T n

RC A

7 n+1 n+1

Array multiplier no CPA n T T d d e n e T

BR MG CSA

2 2 2

1 n n

d e n d e Pip elined multiplier add T t T

IHO inter c CKO

2 2 2

n n+1 n+1

ed e e cycles d d

2 2 2

Serialparallel Multiplier

n

e LSA multiplier r d T t TCKO

IHO inter c

4

n

e cycles d

2

n includes the sign bit thus it corresp onds to the total number of bits in a digit NR repre

sention

Table Area and time estimates for addition and multiplication of nbit op er

ators using input LUT FPGAs

PhD Dissertation Chapter DRAFT February

+ - + - + + - - 1 b2 b2 b1 b1 a0 b0 a0 b0

FA

2

2 1 FA FA FA

3

+ - + - + - 3 z1 z1 z0 z0 z2 z2

Radix-8 output digit

Figure Radix OnLine Reco der

Comp onent Area CLBs

Input Registers n

n

e CPA BS to NR conversion of digits d

2

n+2

e CPA rounding d

2

n

TOTAL n d e

2

n

Table Area of the VLP division selection function digits in radix

VLP Data Path Area Estimates

The basic data path used in VLP algorithms presented in the previous chapters

was designed using comp onents listed in the previous sections Other comp onents

not listed are digit reco ders and basic logic comp onents such as multiplexers and

shifters The area used in the reco der of the VLP multiplier was presented in

section Other reco ders were discussed in the previous chapter

The selection function for VLP division has an area that includes the comp o

nents shown in Table

The area of the selection circuit for the VLP squarero ot is shown in Table

This selection function is more complex than the one for division Based on the

op eration of the VLP squarero ot algorithm we select arithmetic op erators that

are inexp ensive in terms of area The truncated residual W j has fractional

VLP Arithmetic for Recongurable Copro cessor Architectures

Comp onent Area CLBs

1

Y Register n

BS to NR conversion

3n+1

e digits bit d

2

mult input shifter n

3n+2

multiplier LS A d e

16

4

n

mult output shifter

2

n+2

CPA rounding d e

2

n+1

e Mux for register a d

2

3n+2 n+2 3n+1

e d e d e TOTAL n d

2 4 2

n

Table Area of the VLP square ro ot selection function digits in radix

digits and one integer digit in BS co de each digit has n bits n l og r The

2

approximation of the output has n fractional bits plus integer bits A serial

parallel multiplier is used to scale the residual LSA multiplier in radix for

this reason a shifter is used to generate one of the inputs the recipro cal of the

1

input estimate Y The other input truncated residual is loaded in parallel

Another shifter collects the serial output only n bits are stored

The area estimate of the data path with the extra selection circuit options is

shown in Table

Table shows the area used by data paths for some values of n

Delay of Selection Functions

Based on the circuits presented in previous chapters and the time estimates

developed we are able to determine the delay of the selection functions As the

delays used in the previous chapters are normalized in terms of the copro cessor

cycles with cycle time T the values obtained in the previous sections must b e

cp

divided by T to obtain T

cp sel

PhD Dissertation Chapter DRAFT February

Comp onent Area Quantity

p

CLBs

digit by vector multiplier

11 n+2

digit by digit multiplier n d e n

2 2

1 n+1 n+1

multiplier pip elining d e n d e

2 2 2

n+2 n+1

d ed e

2 2

CS to BS converter n

OL Adder n

n+1 n+2 11

d ed e subtotal n

2 2 2

1 n+1 n+1

d e n d e

2 2 2

n

OL Adder n

n n

Reco derBSNR n

2 2

Mux VLP Mult

Selection VLP division from Table

Selection VLP sqrt from Table

n

Table Data Path area for digits in radix pip elined

n VLP Multiplication VLP Division VLP Sqrt

Table Area CLBs of the VLP data path for some values of n

VLP Arithmetic for Recongurable Copro cessor Architectures

Delay of VLP Division Selection

The selection circuit for VLP division do es not have any sequential circuit only

two CPAs The total time is

t T n T n

div sel RC A RC A

n n

t T T T t T T b c b c

div sel OPCY NET SUM inter c BYP NET

Delay of VLP Square Ro ot Selection

The selection circuit for VLP square ro ot is computed as

t T f n T f n T T n

sq r tsel BSNR M U LT MUX CPA

where f n n The NSBR converter was describ ed in section and

corresp onds to an RCA circuit in terms of area and time T is going to b e a

M U LT

sequential multiplier LSA consumes less area and is more adequate for FPGAs

The delay of a input vector multiplexer is T T t Equation

MUX I LO inter c

reduces to

n f n f n

c b c d eT t T T T b

cp sq r tsel f clt BYP NET

and

T T T T t

f clt OPCY NET SUM inter c

Chapter

Performance Evaluation

Given the VLP algorithms and areatime estimates of the circuit implemen

tation into input LUT FPGAs we analyze the p erformance impact of the re

congurable arithmetic copro cessor in the overall p erformance of longprecision

arithmetic There are some types of machines that could take advantage of the

copro cessor We chose a high p erformance workstation as the host and compare

the p erformance of the host alone and the pair hostcopro cessor

As explained b efore the host executes some tasks that are not worthy to b e

done at the copro cessor This concurrent execution of host and copro cessor is

dicult to b e describ ed by equations For this reason we prop ose a mo del for

the system and simulate the mo del based on the b ehavior of the VLP algorithms

the tasks p erformed by the host and measurements on a real computer Based

on the simulation results we investigate how architecture parameters aect the

p erformance like bus communication and pro cessor sp eed

Copro cessor Reconguration

The reconguration time of the copro cessor can b e signicant This time is

incurred when it is necessary to switch b etween dierent parts of the design or when

a dierent arithmetic function is required in the copro cessor The reconguration

time T in the Xilinx devices for example is several miliseconds in the XC

r econf

and in the range of s to ns for the XC series The larger the FPGA the

larger the time for reconguration Reconguration during the VLP op eration

with these large delays is unacceptable The impact of the reconguration can b e

minimized by some of the following techniques

VLP Arithmetic for Recongurable Copro cessor Architectures

prediction or preview of the next VLP op eration The host pro cessor know

ing that the copro cessor must b e recongured can trigger this op eration in

advance while other tasks are executed

host takes over knowing that the reconguration time is going to degrade

the p erformance the host go es ahead and executes the op eration During this

time the copro cessor can b e prepared for the next op eration

partial reconguration future input LUT FPGAs may have partial recong

uration feature FPGAs such as the Xilinx XC allows partial recongura

tion This characteristic would reduced the reconguration time signicantly

sp ecially for the prop osed VLP algorithms The VLP op eration data paths

are alike and the change from one to another is done by prop er rewiring and

reconguration of the selection function circuits only

For our estimates we assume that the VLP op eration is already congured in the

copro cessor The p erformance impact of recongurations can b e part of future

investigations

Copro cessor Mo del

A highlevel mo del of the copro cessor was already given in Chapter A more

detailed blo ck diagram of the copro cessor is shown in Figure The main com

p onents are

Memory Access Control Block this mo dule controls the access to the copro ces

sor lo cal memory It is also able to p erform DMA op erations to read op erands

from main memory or another copro cessor memory This feature is useful

when cascading VLP op erations over multiple copro cessors It also holds the

access to memory lo cations that are b eing accessed by the copro cessor at the

same time based on the status signals from the memory busy

PhD Dissertation Chapter DRAFT February

Address System Data or Processor Control BUS

Bus Interface control address signals data Memory control d Access d d Control Block Busy Busy Busy

Dual- Dual- Dual- Dual- m/d Port Port Port Port Memory Memory Memory Memory OP1 OP2 Result Residual readadr busy busy busy VLP writeadr1 Algorithm writeadr2 Control control lines FPGA

operation Reconfiguration or control Configuration configuration Files files Reconfiguration

Memory

Figure More detailed copro cessor mo del

Dualport Memory elements Local Memory the algorithms prop osed for

VLP computation will have the b est p erformance if the op erands and result

b e stored in memory comp onents that allow simultaneous access to mem

ory lo cations For the op erands dualp ort memory is imp ortant to decouple

the sp eed of the host pro cessor digit transfers and the copro cessor digit con

sumption rate The same is valid for the result dualp ort RAM The residual

dualp ort RAM is used only by the copro cessor but in each cycle it is neces

sary to read and write data to the residual memory thus in this case the use

of this type of memory simplies the design Each blo ck is capable of access

arbitration signal busy when the access o ccurs to the same memory address

These memories are also used to store information that are transferred to the

copro cessor to execute the required computation like scaling factor for VLP

VLP Arithmetic for Recongurable Copro cessor Architectures

division and op erations precision The memory space is viewed dierently

by the host and the copro cessor The host sees the accessible memory of the

copro cessor as a continuous space The copro cessor sees each memory ele

ment as a separate space that is accessed in parallel and have the same base

address

FPGA for this mo del we assume one FPGA chip The case of multiple chips

as shown in section may b e considered for future investigation

VLP Algorithm Control this blo ck represents the control circuit of the VLP

algorithm following the descriptions done in the previous chapters It is re

sp onsible for issuing control signals to the FPGA chip in order to load values

in the prop er registers at the prop er time and access the memory elements

reading op erand digits and writing result or residual digits The complexity

of the control is slightly aected by the memory organization that in the

mo del will hold d digits

Reconguration Control the reconguration control is the blo ck resp onsible

for downloading the correct reconguration les into the FPGA The con

troller is activated by the host to force a reconguration or it can b e activated

by the VLP Algorithm Control blo ck when the op eration that was requested

is not the one loaded into the FPGA

Reconguration memory stores the conguration les of all VLP op erations

It is loaded by the host computer with the p ossible les and accessed by the

Reconguration Control blo ck during reconguration time

The mo del shows only the most imp ortant interconnections b etween the copro

cessor comp onent and systempro cessor bus

PhD Dissertation Chapter DRAFT February

Mo del Parameters

A VLP op eration can b e p erformed completely by the host in a time

T T C

LP host pr og

where T is the host clo ck cycle time and C is the average number of cycles

host pr og

needed to execute a program that p erforms the LP op eration

The bus bandwidth B is dened as

H

d

B digitscycle

H

T

b

where d is dened based on the copro cessor hardware resources and the word size

The bus transfer unit is assumed to b e the same size as the host word T represents

b

the number of pro cessor cycles used for each bus transfer

The copro cessor DualPort memory enables concurrent access to the same mem

ory blo ck when the memory addresses are not the same We assume that the

copro cessor is able to readwrite d digits fromto each memory bank So based

on the blo ck diagram of the copro cessor and the VLP algorithms describ ed the

bandwidth with the lo cal memory is

d d d

NR BS

B

LM

T

LM

where

d is the number of digits in nonredundant form that are readd or

NR N RR

writtend fromto the lo cal memory during the VLP op eration in clo ck cycle

N RW

We consider d d d

NR N RR N RW

d is the number of digits in redundant form consumed d or gener

BS BSR

ated d by the copro cessor in each cycle residual valueThe relation d

BSW BS

d d holds

BSR BSW

T is the lo cal memory access time memory cycle combined with the FPGA

LM delays

VLP Arithmetic for Recongurable Copro cessor Architectures

The value of T is dened as a function of the memory parameters and the

LM

FPGA IO characteristics

T maxT T T T T

LM ICKOF LM setup SPD LM hold LM r ead

where T is the memory setup time during a write cycleT is the hold

LM setup LM hold

time of data read from the memory after the address change and T is the read

LM r ead

access time The other parameters are related to the FPGA IO characteristics

Another constrain relates the memory bandwidth with the data demanded by

the VLP op erators data path

d d

NR BS

B

LM

T

cp

where T is the copro cessor cycle time based on the data path maximum delay

cp

This minimum copro cessor cycle is obtained based on the FPGA device used and

logical design of the arithmetic op erators Using the structures presented in this

thesis with maximum degree of pip elining we obtain a cycle time that is

T T t T T

cpmin IHO inter c CKO cp

The combination of Equations and implies that

T dT

LM cp

that gives a lower b ound on the number of digits that must b e simultaneously read

or written to the memory mo dules in one memory cycle

An upp er b ound on the value of d is obtained from the maximum number of

user IO pins in the chip N as

p

n d nd d N N

NR BS c p

that results in

N N

p c

d

n d nd

NR BS

PhD Dissertation Chapter DRAFT February

where N is the number of pins used for control signals The NR representation of a

c

signed digit in radixr n l og r takes n bits and the redundant representation

2

of the same digit BS takes n bits

In our mo del we assume that T K T

CP CP host

The number of cycles of each algorithm C was given in the previous chapters

cp

and it is dierent for each VLP op eration It is computed in terms of the number

of copro cessor digits that are required as output precision or the number of digits

in the input op erands m

The total op eration time is comp osed of

TT C T T T T T

cp cp w aitC P opt r es FPO

where

T represents the op erand transfer time It includes the time that the host

opt

needs to adjust the digit format and write to the copro cessors lo cal memory

This op eration is overlapped with the VLP op eration in the copro cessor This

parameter considers only the time when the copro cessor is waiting for data from

the host b efore the op eration starts

T is the result transfer time Includes the time required by the host to read

r es

the copro cessor result p erform OFC and convert from expanded digit format to

compact SM format used in the software Again this op eration can b e overlapped

with the copro cessor op eration so this parameter considers only the time that the

host takes after the last result digit is generated by the copro cessor

T is the time for other tasks required for prop er FP op eration

FPO

T includes the waiting time after the copro cessor starts its op eration

w aitC P

The copro cessor area dep ends on the VLP op eration We assume the control

circuit is implemented in another space The main comp onents of the circuit

synthesized into the FPGA is shown in Figure The blo cks in the gure are

input buer stores the digits received from the lo cal memory until they are

used by the data path It is a dual register structure While one register

VLP Arithmetic for Recongurable Copro cessor Architectures

From Memory Op1 Op2 Result Residual

d d d d

Input Buffer

1

MUX 1 1 1

1 VLP operation Data Path

1 1

Output Buffer

d d

Result Residual

Figure Blo ck diagram of the circuits inside the FPGA

stores the information b eing read from the memory another register is used

to shift the digits into the data path The number of input buers dep end on

the number of digits read in each iteration d and d

N RR BSR

output buer stores the digits that are generated by the data path until they

form a group of digits that can b e stored back into the memory The output

of the shift register is stored into another register that keeps the value during

a writing cycle The number of registers required dep ends on the number of

digits to b e written into memory d and d

N RW BSW

forwarding multiplexer this comp onent is required to forward the information

from the output buer to the data path inputs when digits already in the

PhD Dissertation Chapter DRAFT February

Op eration d d d d

N RR N RW BSR BSW

VLP Multiplication

VLP Division

VLP Square Ro ot

Table Maximum number of digits read simultaneously in each iteration

Comp onent Area CLBs

Input buer n n

Output buer n n

n+1

Forwarding Multiplexer n

2

Data path of VLP SQRT Table

Table Maximum area required to implement the VLP algorithms in FPGAs

output buer were not yet written to the lo cal memory Twoinput vector

multiplexers are used and their size dep end on the number of output digits

that should feedback to the data path inputs

Data path already discussed in previous chapters constitutes the main data

transformation element for the VLP algorithms

Table shows the maximum number of digits read or written in a single iteration

for each VLP op eration and Table shows the maximum area required by the

circuit comp onents The VLP square ro ot was considered in the estimate since it

is the one that consumes the larger area among the VLP op erations From the

table we determine the digit radix if the amount of recongurable resources is

xed or the FPGA device that is more adequate for the copro cessor

Measurements

The total time to execute a long precision computation in a general pro cessor

running a software library was obtained using Quantify Pur a to ol to measure

VLP Arithmetic for Recongurable Copro cessor Architectures

limbs Multiplication FP Multiplication FP Division FP Square Ro ot

bits

Table Number of cycles for longprecision op erations in GMP C

pr og

1

the program run time in terms of the pro cessor cycle We selected an UltraSparc

machine as the host computer The host is a RISC pro cessor that runs at

MHz The test programs use the GMP library version Tor that is claimed

to b e one of the fastest libraries for LP computations These programs are listed

in App endix A The number of cycles to execute the main routines for each LP

algorithm in GMP are shown in Table The same data is also plotted in Figure

15

converting limbs to digits in radix r The use of this radix for the

copro cessor is justied later

Based on the description of the FP tasks done by the host section and the

analysis of the GMP library routines we concluded that the manipulation of the

signicand and exp onents is basically the same for GMP and the copro cessorhost

system The time related to this task was also measured as shown in Table

The value of T dep ends on the op eration and the number of copro cessor digits

FPO

in the signicand m

Measurement of other tasks p erformed by the host during VLP op eration are

shown in Table These measures use the fact that limbs have bits and the

copro cessor digit has bits The reason for bit digits will b ecome clear later

The numbers on the table show that the average numbers of cycles p er digit are

1 UltraSparc is a trademark of SUN Microsystems

PhD Dissertation Chapter DRAFT February





 F\FOHV



         GLJLWV

0XOWLSOLFDWLRQ )S0XOWLSOLFDWLRQ

)3'LYLVLRQ )36TXDUH5RRW

Figure Number of cycles for Longprecision Op erations

FP Op eration T cycles

FPO

Multiplication m

Division m

Squarero ot m

Table Signicant and exp onent manipulation time in GMP

T cyclesdigT cyclesdig and the expansion time T

compr ess OFC g op

cyclesdig

The case of prescaling at the host VLP division was also implemented and

measured in the host computer An online prescaler was implemented in soft

ware for digits in the same radix as the copro cessor digits The prescaling has

a setup time T cycles and after that takes T cycles p er

pr escalsu pr escal

copro cessor digit The time to compute the scaling factor is given as T

f actor cycles

VLP Arithmetic for Recongurable Copro cessor Architectures

Limbs Digits Digit Compression Digit Expansion OFC

Table Other tasks p erformed by the host during VLP op erations

Performance Estimate

This p erformance estimate is done rst determining the area and cycle time for

the VLP op erations at the copro cessor level This estimate considers

one FPGA chip of a given technology in this case Xilinx Based on the mo del

parameters presented in Section we select the most adequate chip to t

the VLPA circuits It is also p ossible to determine the number of copro cessor

digits that must b e accessed p er memory cycle

dual p ort memory comp onents with small access time Dual p ort RAMs inside

the chip would have a small access time but these memories would store a

small number of digits limiting the precision of the VLP op erators

The second phase of this estimate is done considering measurements of the other

tasks done by the host digit preparation OFC and digit compression These

measures are simulated using the op eration mo del shown in Figure The VLP

computation is p erformed by three dierent tasks

the host translates the LP format to the copro cessor digit format and sends

the information to the copro cessor The bus transfer time is mo deled as an

extra delay for each word transferred to the copro cessor Each word may

carry or more copro cessor digits The exact number is dened later

PhD Dissertation Chapter DRAFT February

the copro cessor consumes the op erand digits and generate result digits If the

op erand digits are not available the copro cessor computation waits until the

necessary digits arrive Result digits are immediately available to the host

the host reads the result digits This task starts only after task is complete

For each result digit read an extra time corresp onding to the bus transfer

delay is added We assume the worst case situation when only one digit can

b e read in each cycle It may b e the case that more than one digit can b e

read at once that is a b etter case them the one considered

Bus delay Op. queue

host

Bus delay Result queue coprocessor

Figure Mo del for host copro cessor op eration

Copro cessor AreaTime

For this estimate we consider the comp onents shown in Table Based on the

VLP algorithms shown in the previous chapters these are the parameters for the

system

the host is able to readwrite bit words to the copro cessor memory Based

on the discussion in section the minimum number of digits p er word is

15

and the largest copro cessor digit radix is

based on the choice of n the area needed by the VLP square ro ot case

the most area consuming op eration is CLBs from Table For

this area the chip XC can b e used We chose a fast comp onent in this

family the XCXL that has the parameters shown in App endix A

VLP Arithmetic for Recongurable Copro cessor Architectures

Parameter Value

n

d

T ns

cp

Table Copro cessor parameters

from the data sheet and equation we obtain T ns

cpmin

using a dualp ort RAM as describ ed in CY the memory parameters are

T ns T ns and T ns that lead to the deni

LM r ead LM hold LM setup

tion of

T max ns

LM

based on equation we obtain d and T ns The value of d

cp

also satises equation for the values N d and d

p NR BS

leaving N pins for control signals

c

The copro cessor parameters are summarized in Table

All parameters listed in Table are obtained from the VLP algorithms or the

areatime estimates presented ab ove Other considerations ab out the p erformed

evaluation are

The BSNR converter used by the VLP multiplier to transmit the second

half of the double precision pro duct will have a delay of ns that is less

than T Thus during the transmission of the residual to the host a rate of

cp

one digit p er cycle can b e attained

T corresp onds to the initialization time of the algorithms b efore the main

init

iteration steps take place This time is used only for the VLP square ro ot It

corresp onds to the preparation of the residual based on the estimate of the

output The initialization time is copro cessor cycles based on equation

provided in Chapter and the FPGA time characteristics K host cycles cp

PhD Dissertation Chapter DRAFT February

The simulation considers pip elined data paths and nonoverlapped iterations

The number of cycles to complete one iteration when the precision of the

input op erands is i digits is given by the equation that is the same for

all VLP op erations

K i if i d

cp

T i

r g

K d i otherwise

cp

The delay of the selection functions for division and square ro ot are estimated

using the equations and as copro cessor cycles and copro cessor

cycles resp ectively

We consider that the bus that interconnects the copro cessor and host has a delay

of T cycles This parameter reects the Sbus transfer time Each individual

b

transfer not in burst mo de takes cycles Lyl The Sbus uses a MHz clo ck

frequency that results in ns p er bit word transfer This corresp onds to

copro cessor cycles or K host cycles

cp

Mo del Simulation

A simulation program was implemented in C based on the mo del describ ed

and the copro cessor parameters listed in Table Many parameters are based

on measurements in the host computer as describ ed in Section The time is

computed in terms of host cycles

The simulation results for some values of precision are shown in Tables

and The VLP integer multiplication generates an output that

is twice the precision of the op erands Floatingp oint op erations generate results

with precision equals to the precision of the op erands The rst case of VLP FP

division considers the scaling of op erands at the host The second case is the

situation when the prescaling is done by the copro cessor

The sp eedup obtained with the pair hostcopro cessor is shown in Figure

It is computed in resp ect to the execution of a similar computation at the host

VLP Arithmetic for Recongurable Copro cessor Architectures

Parameter Description Value cycles

T interval for copro cessor digit generation

g op

N number of digits transferred in each bus cycle

dig

T number of host cycles for each bus cycle

b

K ratio b etween the host and copro cessor cycle time

cp

m number of digits in one op erand or result varies

T average number of cycles p er digit of OFC

of c

T average time to compress one copro cessor digit

compr ess

T initialization phase of the VLP algorithm varies

init

T selection function delay varies

sel

T i time b etween the generation of two result digits Eq

r g

T time for signicand and exp onent manipulation Table

FPO

T set up time for prescaling of op erand division

pr escalsu

T time for prescaling p er digit

pr escal

T computation of division scaling factor

f actor

Table Parameters for the system

digits Total time CP time Waiting time

15

r useful Host CP

Table Hostcopro cessor op eration Integer Multiplication

PhD Dissertation Chapter DRAFT February

digits Total time CP time Waiting time

15

r useful Host CP

Table Hostcopro cessor op eration VLP FP multiplication

digits Total time CP time Waiting time

15

r useful Host CP

Table Hostcopro cessor op eration VLP FP divisionprescaling at the host

VLP Arithmetic for Recongurable Copro cessor Architectures

digits Total time CP time Waiting time

15

r useful Host CP

Table Hostcopro cessor op eration VLP FP division prescaling at the

copro cessor

digits Total time CP time Waiting time

15

r useful Host CP

Table Hostcopro cessor op eration VLP FP square ro ot

PhD Dissertation Chapter DRAFT February

using GMP Observe that the sp eedup for square ro ot is the b est among all VLP

op erations This result is exp ected since the square ro ot op eration in GMP is the

one that takes more time while for the copro cessor the VLP square ro ot is just

slightly slower than the others caused by the more complex selection function















         GLJLWV

,QW0XOW )30XOW )3'LY

)3'LY )36457

Figure Sp eedup obtained with Hostcopro cessor over Host alone

The second b est sp eedup is obtained with VLP division when the copro cessor

executes the prescaling lo cally VLP division with prescaling done by the host

will show some b enet only for more than digits of precision For low values

of precision the time used by the host to manipulate the copro cessor digits and

transfer them kills the advantage of using the copro cessor

Observe that the simulation was done for K that corresp onds to a host

CP

computer running at MHz and not MHz

As the precision of the op erands increase more work is left for the copro cessor

and the sp eedup increases After a certain p oint the sp eedup starts to drop

A qualitative view of the sp eedup b ehavior is shown in Figure The gure

VLP Arithmetic for Recongurable Copro cessor Architectures

illustrates the fact that there is a range of application for the VLP algorithms Software algorithms have b etter asymptotic time but more overhead than the VLP Time

Sotware Algorithm

Hos+coprocessor

Precision Speedup

1

Precision

Figure Qualitative b ehavior of the sp eedup

algorithms for a large range of precision The sp eedup increases as the number of

digits manipulated by the copro cessor increases After a certain number of digits

the sp eedup starts to decrease as a result of the asymptotic time of the VLP

hardware algorithm Thus given a pair hostcopro cessor the software resp onsible

for the manipulation of the copro cessor should decide if the copro cessor is going

to b e used

Another reason for this b ehavior is the prop ortion of total time used by the

copro cessor For small number of digits the host computation time dominates

PhD Dissertation Chapter DRAFT February

the total time In this case the total time has a linear b ehavior As the precision

increases the copro cessor is resp onsible for most of the total time and the total

2

computation delay shows a O n b ehavior The prop ortion of the total time used

by the copro cessor with its waiting time is shown in Figure for the case VLP multiplication





  



         GLJLWV

&3WLPH &3:DLW RWKHU

Figure Prop ortion of time used by the copro cessor

Simulations p erformed with other values for the parameter T showed that the

b

sp eedup is only marginally aected by variations in the bus transfer time The

variation of the parameter K reects the case when a faster host computer

CP

is used Table shows the system b ehavior for dierent values of K for

CP

m digits and a bus with T cycles of transfer time The VLP FP

b

multiplication is considered The copro cessor is kept at the same clo ck frequency

The data shows that even for very fast host computers the use of the copro cessor

will provide a reasonable sp eedup Observe that division and square ro ot have

b etter p erformance than the VLP multiplication considered in this analysis

VLP Arithmetic for Recongurable Copro cessor Architectures

K Total Time Sp eedup Host clo ck frequency MHz

CP

Table Variation in the Sp eedup with the Host sp eed

Chapter

Conclusion and Future Research

Throughout this thesis we have shown the advantages of using online arithmetic

in the implementation of hardware algorithms for VLP computations For the same

digit size the number of cycles required in VLP op erations is less than the number

of cycles of other algorithms used for hardware implementation

The designs presented in this dissertation have b een developed to b e easily

applied to any digit radix Estimates of area and time provide the required to ols

to evaluate the applicability of the technique to a particular design space Based

on the estimates the designer can select the prop er chip to b e used or select the

digit radix that b est ts in the hardware resources available

A radix VLP multiplier was implemented in the EVC b oard Cora

1 2

The design was sp ecied in VHDL and sythesized using Powerview and XACT

FPGA development software The implementation was integrated to GMP version

using the HOT technology Corb The EVC b oard was connected to a

Sun Sparcstation This exp eriment was used as a pro of of concept and provided

the required background for the preparation of the areatime estimates given in

Chapter

All algorithms were implemented and tested in C The VLP data paths

were describ ed as a network on online mo dules the same way as in the hard

ware implementation Highradix online op erands were also developed Using

the simulation to ols it was p ossible to verify most of the parameters for the VLP

algorithms included in this thesis Most tests were p erformed in radix or

1

CAD to ol from Viewlogic

2

CAD to ol for FPGA synthesis from Xilinx

VLP Arithmetic for Recongurable Copro cessor Architectures

to simplify verication of the circuit op eration

The evolution of the technology leads to an increase in area and sp eed of FPGA

devices This tendency has b een shown over the years The extra space in the

FPGA may b e used in two dierent ways

store the residual value in the VLP computations or even the op erands and

result

extract more parallelism

As discussed b efore the implementation of dualp ort RAM in FPGAs is prac

tically limited to a small addressing space More memory lo cations can b e imple

mented at the cost of p o or p erformance However an internal memory element

would have a b etter access time and less communication delay Thus this option

must b e reevaluated for future technologies

To extract more parallelism from the VLP algorithms the designer can use

the extra hardware to increase the radix of the digits used in the copro cessor or

have more than one iteration b eing executed at the same time The strategy of

unfolding iterations is p ossible for all VLP op erations The result digits pro duced

by one data path is fed into another data path Sp ecially for VLP multiplication

which has a selection function that do es not consume time from the iteration

this approach would b e very ecient The linear organization of FPGA chips

prop osed in Section could b e used to pip eline the various instances of the

VLP multiplication data path VLP division and square ro ot would take less

advantage of this strategy given that their selection functions are more complex

than the one used for VLP multiplication and would fatally aect the time to

trigger trigger consecutive unfolded iterations

Even though this study was concentrated on FPGA technology the use of ASIC

technology would b e p ossible

PhD Dissertation Chapter DRAFT February

Research Contributions

This work provides the following main contributions

provide algorithms for VLP op erations based on highradix online arithmetic

prop oses solutions for the problem of selection function in highradix online

op erations As a result of this study the dissertation provides the required

conditions for VLP computations

generalizes the estimation of area and time of the prop osed algorithm imple

mentations Such estimates are used as the basis for p erformance evaluation

of the copro cessor sp eed up working co op eratively with the host computer

shows that the co op erative execution of tasks among the host and copro cessor

results in signicant overall sp eedup over the execution of the same task by

software in the host alone for a large range of precision

prop oses new logical designs for arithmetic structures such as the parallel CS

to BS converter the serial digit by vector multiplier and the online reco der

for the VLP multiplier

Future Research

During this research we observed some direction that could b e followed in future

research work

Software implementation of the online algorithm for VLP op erations Based

on the investigation in the area of software for longprecision computation it

b ecame clear that there is a space for the application of the online technique

for some ranges of precision The implementation should b e as ecient as the

ones already available for other algorithms This requirement would generate

VLP Arithmetic for Recongurable Copro cessor Architectures

an implementation that could b e fairly compared with the others The im

plementation of the online op erations in assembly co de for the target system

would b e imp ortant

Analyze the situation when more than one VLP data path is available for the

computation or when the design must b e partitioned into multiple chips The

b ehavior of the VLP online op erations have a variable latency to generate

and consume digits This variable timing results in synchronization problems

that should b e studied

Investigate the case when more than one copro cessor is available For this

case the variable latency to generate output digits will also play a role Co

pro cessors working in an overlapped mo de would not have a go o d match of

sp eed Besides that some tasks executed by the host would also aect the

p erformed of this concurrent op eration

Investigate the impact of partial reconguration in the system op eration

Analysis the implementation of the same algorithms using another FPGA

such as the XC would b e interesting to obtain data for the evaluation of

the time overhead to p erform switching b etween VLP op erations

Researchers claim that FPGAs will include sp ecial circuitry for multipliers

embedded in the matrix of CLBs Multiplication is the most area consuming

circuit in the VLP data paths It would b e an imp ortant to investigate the

impact of these dedicated multipliers in the design and p erformance of the VLP op erations

APPENDIX A

Timing Characteristics of the XC FPGAs

XCXL

CLB Switching Characteristics

Description Symbol time ns

Combinational Delays

FG inputs to XY outputs T

I LO

FG inputs via H to XY outputs T

IHO

C inputs via DIN through H to XY outputs T

HH 2O

CLB Fast Carry Logic

Op erand inputs to COUT T

OPCY

AddSubtract input to COUT T

AS C Y

Initialization inputs to COUT T

INCY

1

CIN through FGs to XY outputs T

SUM

CIN to COUT bypass FGs T

BYP

Carry network delay COUT to CIN T

NET

Sequential Delays

Clo ck K to outputs Q T

CKO

Setup time b efore Clo ck K

FG inputs T

ICK

FG inputs via H T

IHCK

C inputs via DIN T

DICK

Average interconnect delay t

inter c

2

InputOutput Timing Characteristics XCXL

Description Symbol time ns

3

Global Low skew clo ck to Output using Output FF T

ICKOF

4

Input Setup time using Global Low Skew clo ck and IFF T

SPD

Input Hold Time using Global Low Skew clo ck and IFF T

PHD

1

Function Generators

2

XC has CLBs and user IO pins

3

Load capacitance of pF

4

IFF Input Flipop

APPENDIX B

Test Program for LP Op erations using GMP version

GMP uses limbs of bits A limb is a high radix digit that is manipulated by

the arithmetic algorithms

We have made four types of measurements integer multiplication oating p oint

multiplication division and squarero ot Using Quantify we measure the number

of cycles for the kernel of each op eration

B Test Program for LP Integer Multiplication

This program computes the pro duct of two integers with limbs of precision

bits As the software routines are optimized to skip over zero es many runs

are done to reduce obtain an average number of cycles for the program execution

mul called by mpz mul The kernel of this op eration is the pro cedure mpn

B Test Program for Floating Point Op erations

This program computes the pro duct of two oating p oint numbers with

limbs in the signicand eld bits The randomization routines generate

patterns with long sequences of zero es for this reason the program uses the result

of previous op erations that has shorter sequences of zero es as op erands for the

next ones The dierence b etween the FP multiplication division and square ro ot

test program is the set of instructions inside the for lo op The test program used

for FP multiplication is shown next and was used to measure the number of cycles

in the mpf mul routine

To measure the number of cycles of FP division routine mpf div a similar

program was used replacing the the instructions fmpf mulxuv mpf mulvxu

PhD Dissertation Chapter DRAFT February

include stdioh

include gmph

include mathh

include systimeh

include sysresourceh

include gmpimplh

void main

int numlimbsprog

int i j numruns

MPINT integ integ

MPINT prod

numlimbsprog

numruns

mpzinit integ

mpzinit integ

mpzinit prod

for iinumrunsi

initialize the random values of integ and integ

mpzrandomintegnumlimbsprog

mpzrandomintegnumlimbsprog

mpzmulprodinteginteg

VLP Arithmetic for Recongurable Copro cessor Architectures

include stdioh

include gmph

include mathh

include systimeh

include sysresourceh

include gmpimplh

include urandomh

ifndef SIZE

define SIZE

endif

main argc argv

int argc

char argv

mpsizet size size in limbs

mpexpt exp

int i

mpft u v x

int numruns

mpfsetdefaultprec size the library use lims of bits

mpfinit u

mpfinit v

mpfinit x

numruns

initialize operand u

exp urandom SIZE

mpfrandom u size exp

initialize operand v

exp urandom SIZE

mpfrandom v size exp

for iinumrunsi

mpfmul x u v

mpfmul v x u

mpfmul u x v

exit

PhD Dissertation Chapter DRAFT February

mpf muluxvg with fmpf divxuv mpf divvxug

For LP square ro ot we collected the number of cycles of the routine mpf sqrt

The test program is similar to the one for FP multiplication replacing the three

multiplication instructions with the instructions fmpf sqrtxu mpf sqrtuxg

APPENDIX C

Digit radix transformation using BS Co de

Theorem Given a value x represented by a signedbit vector x x x

n1 1 0

with x f g it is always possible to transform the representation of x to

i

k

another radix r k just splitting the signed bits into groups of k bits from

right to left

x x x x x x x x x x

n1 1 0 2 2k 1 k k 1 k 2 1 0 k

n

where each group x x i d e represents a radixr signed digit in

2k i1 k i

k

the maximal ly redundat set fr r g

Pro of When the signedbit vector is split into groups of k signed bits each

group of bits will have a value

k 1

X

j

y C y

ij i

j =0

k

with y f g and y fr r g since r

ij i

Lets show that the value of the vector formed by digit y represent the value of

i

x The value of the vector y y y in radix r is computed as

d 1 0

d d

X X

k i i

y C y r y

i i

i=0 i=0

n

where d d e

k

Substituting the value of y in equation C by the value given in equation

i

C we obtain

k 1 d

X X

j k i

y C y

ij

j =0 i=0

PhD Dissertation Chapter DRAFT February

From the relation b etween x and y we know that x y and based on the

k i+j ij

signed digit representation we have x for i n Thus

i

n k 1 d

X X X

i k i+j

x x C x y

i k i+j

i=0 j =0 i=0

Bibliography

ANSIIEEE Std Standard for Binary FloatingPoint Arith

metic In SIGPLAN Notices volume pages

AACE A L Abb ott P M Athanas L Chen and R L Elliott Finding

Lines and Building Pyramids with Splash In IEEE Workshop on

FPGAs for Custom Computing Machines pages Napa Valley

California April

ACC O T Albaharna P Y K Cheung and T J Clarke On the Viability of

FPGABased Integrated Copro cessors In IEEE Workshop on FPGAs

for Custom Computing Machines pages

AH G Alefeld and J Herzb erger Introduction to Interval Computations

Computer Science and Applied Mathematics Academic Press

AH P M Athanas and F S Harvey Pro cessor Reconguration through

InstructionSet Metamorphism IEEE Computer pages March

Atk DE Atkins Design of the arithmetic units of ILLIAC I I I use of redun

dancy and higher radix metho ds IEEE Transactions on Computers

Bai D H Bailey A Portable High Performance Multiprecision Package

Technical Rep ort Technical Rep ort RNR RNR

Bau C Baumhof A New VLSI Vector Arithmetic Copro cessor for the PC In

IEEE th Symposium on Computer Arithmetic pages

PhD Dissertation Chapter DRAFT February

Car T M Carter CASCADE Hardware for HighVariable Precision Arith

metic In IEEE th Symposium on Computer Arithmetic pages

CB C E Cox and W E Blanz Ganglion A fast eldprogrammable Gate

Array implementation of a Connectionist Classier IEEE Journal of

SolidState Circuits Mar

CHH M Cohen T Hull and V Hamacher CADAC A ControlledPrecision

Decimal Arithmetic Unit IEEE Transactions on Computers C

Cho C Chow A Variable Precision Processor Module PhD thesis Univer

sity of Illinois at UrbanaChampaign UIUCDCSR

Cho C Chow A Variable Precision Mo dule In Int Conference on Computer

Design VLSI in Computers pages

Com P G Comba Exp onentiation Cryptosystems on the IBM PC IBM

Systems Journal

Co o J T Co onen An Implementation Guide to a Prop osed Standard

for FloatingPoint Arithmetic IEEE Computer pages January

Cora Virtual Computer Corp oration Engineers Virtual Computer EVCs

Users Guide rst edition

Corb Virtual Computer Corp oration HOT Hardware Object Technology

Programming Guide rst edition

CY Cypress Semiconductor Corp oration San Jose CA Data Sheet Kx

DualPort Static RAM

VLP Arithmetic for Recongurable Copro cessor Architectures

DeH DeHon AndreRecongurable Architectures for GeneralPurp ose Com

puting Technical Rep ort MIT Articial Intelligence Lab oratory

Technology Sq Cambridge MA Octob er

DMV M Daumas J M Muller and J Vuillemin Implementing OnLine

Arithmetic on PAM In th Int Workshop on Field Programmable

Logic and Applications pages

EL M D Ercegovac and T Lang Onthey Convertion of Redundant

into Conventional Representations IEEE Transactions on Computers

C

EL M D Ercegovac and T Lang Online Arithmetic A Design Metho d

ology and Applications in Digital Signal Pro cessing In IEEE Workshop

on VLSI Signal Processing IEEE

EL M D Ercegovac and T Lang Division and square root digit

recurrence algorithms and implementations Kluwer Academic Pub

lishers

EL M D Ercegovac and T Lang On Reco ding in Arithmetic Algorithms

Journal of VLSI Signal Processing

Ely J E Ely The VPI Software Package for Variable Precision Interval

Arithmetic In Interval Computations pages

Erc M D Ercegovac Online Arithmetic An Overview In Real Time

Signal Processing VII pages SPIE

ET M D Ercegovac and K Trivedi Online Op erations IEEE Transac

tions on Computers C

PhD Dissertation Chapter DRAFT February

GHM A Guyot Y Herreros and J M Muller JANUS an Online Multi

plierdivider for manipulating large numbers IEEE th Symposium on

Computer Arithmetic pages

+

GKC D Galloway D Karchmer P Chow D Lewis and J Rose The

Transmogrier The University of Toronto FieldProgrammable Sys

tem Technical Rep ort CSRI University of Toronto Canada June

ftpftpcsritorontoeducsritechnicalreports

Gui S Guicionne List of fpgabased computing machines Hyp er text

linkhttputsccutexasedu guiccioneHW listhtml

HCH TE Hull M S Cohen and C B Hall Sp ecication for a Variable

Precision Arithmetic Copro cessor In IEEE th Symposium on Com

puter Arithmetic pages

Hoa D T Hoang Searching Genetic Databases on Splash In IEEE

Workshop on FPGAs for Custom Computing Machines pages

April

HP R Hartley and K K Parhi DigitSerial Computation Kluwer Aca

demic Publishers

Hsu CY Hsu Variable Precision Arithmetic Pro cessor in FPGAs Masters

thesis University of Toronto

KM U W Kulisch and W L Miranker Computer Arithmetic in Theory

and Practice Computer Science and Applied Mathematics Academic

Press

Kno A Knofel Fast Hardware Units for the Computation of Accurate Dot

Pro ducts In IEEE th Symposium on Computer Arithmetic pages

VLP Arithmetic for Recongurable Copro cessor Architectures

Knu D E Knuth The Art of Computer Programming Seminumerical Al

gorithms volume AddisonWesley Publishing Co

Kor I Koren Computer Arithmetic Algorithms Prentice Hall

Kwa K Kwang Advanced with Parallel Program

ming McGrawHill Inc preliminary edition edition

LM Tomas Lang and Paolo Montuschi Higher Radix Square Ro ot with

Prescaling IEEE Transactions on Computers

Lou M E Louie Variable Precision Arithmetic with Lookup Table Based

Field Programmable Gate Arrays PhD thesis UCLA

Lyl James D Lyle Sbus Information Applications and Experience

SpringerVerlag

Lyn T Lynch High Radix Online Arithmetic for Credible and Accurate

General Purp ose Computing In Real Numbers and Computers Les

Nombres Reels et LOrdinateur pages Ecole des Mines de Saint

Etienne France

Mat D W Matula Design of a Highly Parallel IEEE Floating Point arith

metic unit In Proc Symp Combinatorial Optimization Sci and Tech

nology

MM V MenissierMorain Arithmetic Exacte Conception Algorithmique

et Performances dune Implementation Informatique en Precision Ar

bitraire PhD thesis LUniversite Paris VI I France December

Mo o R E Mo ore Metho ds and Applications of Interval Analysis In Studies

in Applied Mathematics Philadelphia SIAM

PhD Dissertation Chapter DRAFT February

MRR M Muller C Rub and W Rulling Exact Accumulation of Floating

Point Numbers In IEEE th Symposium on Computer Arithmetic

pages

Neu A Neumaier Interval Methods for Systems of Equations Cambridge

University Press

+

P W H Press et al Numerical Recipes in C the art of Scientic Com

puting Cambridge University Press nd edition

PA K J Paar and P M Athanas Implementation of a Finite Dierence

Metho d on a Custom Computing Platform In HighSpeed Comput

ing Digital Signal Processing and Filtering Using Recongurable Logic

pages Boston November SPIE

Pet Petersen Russell J An Assessment of the Suitability of Recongurable

Systems for Digital Signal Pro cessing Masters thesis UCLA Electrical

Engineering

Pri D M Priest Algorithms for Arbitrary Precision Floating Point Arith

metic In IEEE th Int Symposium on Computer Arithmetic pages

PTS D Pryor M Thistle and N Shirazi Text Searching on Splash In

IEEE Workshop on FPGAs for Custom Computing Machines

Pur Pure Atria Quantify Users Guide

Ral L B Rall Tools for Mathematical Computation In Computer Aided

Proofs in Analysis volume IMA volumes in Mathematics and Its ap

plications pages Springer Verlag

VLP Arithmetic for Recongurable Copro cessor Architectures

RV S Ra jamani and P Viswanath A Quantitative Analysis of Pro cessor

Programmable Logic Interface In IEEE Workshop on FPGAs for Cus

tom Computing Machines pages

Sch M Schulte A Variable Precision Interval Arithmetic Processor PhD

thesis University of Texas Austin

SG A Skaf and A Guyot VLSI Design of Online AddMultiply Algo

rithms IEEE Int Conference on Computer Design pages

SSJa M J Schulte and E E Swartzlander Jr A Copro cessor for Accurate

and Reliable Numerical Computations In IEEE Int Conf on Com

puter Design pages

SSJb M J Schulte and E E Swartzlander Jr Hardware Design and Arith

metic Algorithms for a VariablePrecision Interval Arithmetic Copro

cessor In IEEE th Symposium on Computer Arithmetic pages

TE K S Trivedi and M D Ercegovac Online Algorithms for Division and

Multiplication IEEE Trans on Computers C

TE D M Tullsen and M D Ercegovac Design and VLSI Implementation

of an Online Algorithm In Real Time Signal Processing IX

pages SPIE

TE P KG Tu and M D Ercegovac A Radix Online Division Algo

rithm In IEEE th Symposium on Computer Arithmetic pages

TE P KG Tu and M D Ercegovac Design of Online Division Unit In

IEEE th Symposium on Computer Artihmetic pages

PhD Dissertation Chapter DRAFT February

TE A F Tenca and M D Ercegovac Highradix digitslices for online

computations In Proceedings of SPIE Conference on High Speed Com

puting Digital Signal Processing and Filtering using recongurable

logic volume pages

Tor G Torbjon User Guide to GNU MP Multiple Precision Library

edition Available with the software

Tu P KG Tu Online Arithmetic Algorithms for Ecient Implementa

tion PhD thesis University of California Los Angeles Sept

+

VBD J E Vuillemin P Bertin Roncin D M Shand H Touati and P Bou

card Programmable active memories Recongurable systems come of

age IEEE Transactions on VLSI Systems March

VdB D et al Van den Bout Anyboard An FPGABased Recongurable

System IEEE Design and Test of Computers pages September

Vui J E Vuillemin Exact Real Computer Arithmetic with Continued Frac

tions IEEE Trans on Computers

Wau T C Waugh Field Programmable Gate Array Key to Recongurable

Array Outp erforming Sup ercomputers In IEEE Custom Integrated Cir

cuits Conference

Wie A Wietho A C Class Library for Extended Scientic Computing

Xil Xilinx The Programmable Logic Data Book

YX W W H Yu and S Xing Performance Evaluation of FPGA imple

mentation of HighSp eed Addition algorithms In Proceedings of SPIE

Conference on High Speed Computing Digital Signal Processing and

Filtering using recongurable logic volume pages

VLP Arithmetic for Recongurable Copro cessor Architectures

Zim P Zimmerman Comparison of Three PublicDomain Multipreci

sion Libraries Bignum GMP and Paris Obtained through e

mailzimmermaninriafr

Zur D Zuras On Squaring and Multiplying Large Integers In Proceedings

of the th Symposium on Computer Arithmetic pages