<<

FloatingPoint Repre s entation

Michael L. Overton

copyr ight 1996

1 Computer Repre s entation of Numbers

Computers whichworkwith real ar ithmetic us e a system calle oating point.

Suppose a real number x has thebinary expans ion

E

x = m  2 ; where 1  m<2

and

m =(b :b b b :::) :

0 1 2 3 2

Tostore a number in oating p oint repre s entation, a computer word i s divided

into 3 elds, repre s entingthe s ign, theexponent E ,andthe s igni cand m

re sp ectively. A 32- word could b e divided into elds as follows: 1 bit

for the s ign, 8 bitsfortheexponentand 23 for the s igni cand. Since

the exp onent eld i s 8 bits, it can b e us e d to represent exp onentsbetween

128 and 127. The s igni cand eld can store the rst23bitsofthebinary

repre s entation of m,namely

b :b :::b :

0 1 22

If b ;b ;::: are not all zero, thi s oating p oint repre s entation of x i s not

23 24

exact butapproximate. A numberiscalleda oating point number if it can b e

store d exactly on the computer us ingthegiven oatingpoint repre s entation

scheme, i.e. in thi s cas e, b ;b ;::: are all zero. For example, thenumber

23 24

2

11=2= (1:011)  2

2

would b e repre s ented by

0 E =2 1.0110000000000000000000 ; 1

andthenumber

6

71 = (1:000111)  2

2

would b e repre s ented by

0 E =6 1.0001110000000000000000 :

Toavoid confus ion, the exp onent E , whichisactually store d in a binary

repre s entation, i s shown in for themoment.

The oatingpoint repre s entation of a nonzero number is unique as long

as we require that1  m<2. If it were not for thi s requirement, thenumber

11=2 could also b e wr itten

4

(0:01011)  2

2

and could therefore b e repre s ented by

0 E =4 0.0101100000000000000000

:

However, thi s i s not allowe d s ince b =0 andsom<1. A more intere sting

0

example i s

1=10 = (0:0001100110011 :::) :

2

Since thi s binary expans ion i s in nite, wemust truncate the expans ion some-

where. (An alter native, namely rounding, i s di scuss e d later.) The s imple st

way totruncatethe expans ion to 23 bitswould givethe representation

0 E =0 0.0001100110011001100110 ;

butthi s means m<1 s ince b =0.Aneven wors e choice of repre s entation

0

would b e the following: s ince

4

1=10 = (0:00000001100110011 :::)  2 ;

2

thenumb er could b e repre s ented by

0 E =4 0.0000000110011001100110 :

Thi s i s clearly a bad choice s ince le ss of the binary expans ion of 1=10 i s

store d, due tothe space wasted bythe leading zeros in the s igni cand eld.

This is the reason why m<1,i.e.b =0, is not al lowed. The only allowable

0

repre s entation for 1=10 us e s thefactthat

4

1=10 = (1:100110011 :::)  2 ;

2 2

givingthe repre s entation

0 E = 4 1.1001100110011001100110 :

Thi s repre s entation includes more of the binary expansion of 1=10 than the

others, and i s said tobenormalized, s ince b = 1, i.e. m> 1. Thus noneof

0

theavailable bitsiswasted bystor ing leading zeros.

We can s ee f rom thi s example whythename oating point is used: the

binary point of thenumber 1=10 can b e oated toanyposition in the bitstr ing

we likebychoosingtheappropr iateexponent: thenormalize d repre s entation,

with b =1,istheone whichshould b e always b e us e d when p oss ible. It i s

0

clear that an irrational number suchas i s also repre s ente d most accurately

by a normalize d repre s entation: s igni cand bitsshould not b e wasted by

stor ing leading zeros. However, thenumber zero i s sp ecial. It cannot b e

normalize d, s ince all the bitsinits repre s entation are zero. The exp onent E

i s irrelevantand can b e s et to zero. Thus, zero could b e repre s ented as

0 E =0 0.0000000000000000000000 :

The gap b etween thenumber1andthenext large st oatingpointnumber

1

i s calle d the precision of the oatingpointsystem, or, often, the machine

precision,andweshall denotethi s by .Inthesystem just described, the

next oating p oint bigger than 1 i s

1:0000000000000000000001;

22

withthe last bit b =1.Therefore, the preci s ion i s  =2 .

22

Exerci s e 1 What is the smal lest possible positive normalized oating point

number using the system just described?

1

 m< Exerci s e 2 Could nonzero numbers instead be normalizedsothat

2

1? Would this bejustasgood?

It i s quite instructivetosuppose thatthe computer wordsizeismuch

smaller than 32 bitsandworkoutindetail whatallthe p oss ible oating

numb ers are in such a cas e. Suppose thatthe s igni cand eld has ro om only

tostore b :b b ,andthatthe only p oss ible value s for the exp onent E are 1,

0 1 2

0and1.Weshall call thi s system our toy oating point system.Thesetof

toy oating p ointnumbers is shown in Figure 1

1

Actually,the usual de nition of preci s ion i s onehalf of thi s quantity, for reasons that

will b ecomeapparentinthenext s ection. Weprefertoomitthe f actor of onehalf in the

de nition . 3

 -

0 1 2 3 ::::::

Figure 1: TheToyFloatingPointNumbers

1

The large st number is (1:11)  2 =(3:5) ,andthesmalle st p os itivenor-

2 10

1

malize d number is (1:00)  2 =(0:5) . All of thenumbers shown are

2 10

normalize d except zero. Since thenext oating p ointnumb er bigger than 1 i s

1.25, the preci s ion of thetoysystem i s  =0:25. Notethatthegapbetween

oating p ointnumbers becomes smal ler as themagnitudeofthenumb ers

thems elve s get smaller, and bigger as thenumb ers get bigger. Notealsothat

the gap b etween b etween zero andthesmallest positivenumber is much big-

ger than thegapbetween thesmalle st p os itivenumber andthenext p os itive

numb er. Weshall showinthenext s ection howthi s gap can b e \ lle d in"

withtheintro duction of \subnormal numb ers".

2 IEEE FloatingPointRepresentation

In the 1960's and 1970's, eachcomputer manuf acturer developed itsown

oating p ointsystem, leadingto a lot of incons i stency as tohowthe same

program b ehave d on di erentmachines. For example, although most ma-

chine s us e d binary oatingpoint systems, the IBM 360/370 s er ie s, which

dominate d computingdur ingthi s p er io d, us e d a hexadecimal bas e, i.e. num-

E

bers were repre s ented as m  16 .Other machines, such as HP ,

usedadecimal oatingpointsystem. Through the e ortsofseveral com-

puter scientists, particularly W. Kahan, a binary oating p ointstandard was

developed in the early 1980's and, most imp ortantly, followed very carefully

bythe pr incipal manuf acturers of oatingpointchips for p ersonal computers,

namely Intel andMotorola. Thi s standard has b ecome known as theIEEE

oating p ointstandard s ince it was developed andendors e d bya working

2

committee of the Institute for Electr ical and Electronics Engineers. (There

i s also a decimal vers ion of thestandard butweshall not di scuss thi s.)

The IEEE standard has three very imp ortant requirements:

2

ANSI/IEEE Std 754-1985. Thanks to Jim Demmel for intro ducingtheauthor tothe

standard. 4

 i stent representation of oatingpointnumb ers across all machines

adoptingthestandard

 correctly rounded arithmetic (tobeexplained in thenext s ection)

 cons i stentand s ens ible treatment of exceptional s ituations suchasdi-

vi s ion byzero(to b e di scuss e d in the followingsection).

We will not describe thestandard in detail, butwe will cover themain

p oints.

Westart withthe following obs ervation. In thelastsection, wechos e to

E

,where 1  m<2, i.e. normalize a nonzero number x so that x = m  2

m =(b :b b b :::) ;

0 1 2 3 2

with b =1.Inthe s imple oatingpointmodel di scuss e d in the previous

0

s ection, westore d the leading nonzero bit b in the rst p os itionofthe eld

0

provided for m.Note, however, that s ince weknowthi s bit has thevalue one,

it is not necessary to storeit. Cons equently,we can us e the 23 bitsofthe

s igni cand eld tostore b ;b ;:::;b instead of b ;b ;:::;b 2, changingthe

1 2 23 0 1 2

22 23

machine preci s ion f rom  =2 to  =2 : Since the bitstr ingstore d in the

s igni cand eld i s now actually the fractional part of the s igni cand, weshall

refer henceforthtothe eldasthe fraction eld. Given a str ing of bitsinthe

f raction eld, it i s nece ssary toimaginethatthesymb ols \1." app ear in f ront

of thestring, even though these symb ols are not store d. Thi s technique i s

calle d hidden bit normalization andwas us e d by Digital for theVax machine

in thelate 1970's.

Exerci s e 3 Show that the hidden bit technique does not result in a more

accurate representation of 1=10. Would this stil l be true if we had started

with a eld width of 24 bits before applying the hidden bit technique?

Note an imp ortant p oint: s ince zero cannot b e normalize d tohavea

leading nonzero bit, hidden bit repre s entation requires a special technique for

storing zero.Weshall s ee whatthi s i s shortly.Apatter n of all zeros in the

f raction eld of a normalize d numb er repre s entsthe s igni cand 1.0, not 0.0.

Zero i s not the only sp ecial numb er for whichthe IEEE standard has a

sp ecial repre s entation. Another sp ecial numb er, not us e d on older machines

butvery us eful, i s thenumber 1. Thi s allows the p oss ibility of dividing

a nonzero number by0 andstor ing a s ens ible mathematical re sult, namely

1, instead of terminatingwithanover owme ssage. Thi s tur ns outtobe 5

Table 1: IEEE Single Preci s ion

 a a a :::a b b b :::b

1 2 3 8 1 2 3 23

If exp onent bitstr ing a :::a is Then numer ical value repre s ented is

1 8

126

(00000000) = (0) (0:b b b :::b )  2

2 10 1 2 3 23 2

126

(00000001) = (1) (1:b b b :::b )  2

2 10 1 2 3 23 2

125

(00000010) = (2) (1:b b b :::b )  2

2 10 1 2 3 23 2

124

(00000011) = (3) (1:b b b :::b )  2

2 10 1 2 3 23 2

# #

0

(01111111) = (127) (1:b b b :::b )  2

2 10 1 2 3 23 2

1

(10000000) = (128) (1:b b b :::b )  2

2 10 1 2 3 23 2

# #

125

(11111100) = (252) (1:b b b :::b )  2

2 10 1 2 3 23 2

126

(11111101) = (253) (1:b b b :::b )  2

2 10 1 2 3 23 2

127

(11111110) = (254) (1:b b b :::b )  2

2 10 1 2 3 23 2

(11111111) = (255) 1 if b = ::: = b =0,NaNotherwi s e

2 10 1 23

very us eful, as weshall s ee later, although onemust b e careful aboutwhat

is meantbysuch a re sult. One que stion whichthen arises is: whatabout

1?Ittur ns outtobeconvenienttohave repre s entations for 1 as well

as 1 and 0aswell as 0. We will givemoredetails later, but note for now

that 0and0aretwo di erent representations for the same value zero, while

1 and 1 repre s ent two very di erent numbers. Another sp ecial number

i s NaN, whichstands for \Not a Number" and i s cons equently not really a

number at all, but an error patter n. Thi s to o will b e di scuss e d further later.

All of the s e sp ecial numbers, as well as someother sp ecial numb ers calle d

subnormal numb ers, are repre s ented through the us e of a sp ecial bit patter n

in theexponent eld. Thi s slightly re duce s theexponentrange, butthi s i s

quite acceptable s ince therange i s so large.

There are three standard typ e s in IEEE oating p ointarithmetic: s ingle

preci s ion, double preci s ion and extende d preci s ion. Single preci s ion numb ers

require a 32-bit word andtheir representations are summar ize d in Table 1.

LetusdiscussTable 1 in somedetail. The  refers tothesignofthe

numb er, a zero bit b e ingusedto repre s ent a positive s ign. The rstline

shows thattherepresentation for zero require s a sp ecial zero bitstr ingforthe 6

exp onent eld as wel l as a zero bitstr ing for the f raction, i.e.

0 00000000 0000000000000000000000 :

No other lineinthetablecanbeusedto representthenumb er zero, for all

line s except the rstandthe last repre s entnormalize d numb ers, withan

initial bit equal toone; thi s i s theonethatisnotstore d. In the cas e of the

126

rst lineofthetable, theinitial unstore d bit i s zero, not one. The2 in

the rst line i s confus ingat rst s ight, but let us ignore that for themoment

126

s ince (0:000 :::0)  2 is certainly oneway towritethenumb er zero. In

2

thecasewhen theexponent eld has a zero bitstr ingbutthe f raction eld

3

has a nonzero bitstr ing, thenumber represente d i s said tobesubnormal.

Let us p ostponethe di scuss ion of subnormal numbers for the momentandgo

on totheother lines of thetable.

All thelines of Table 1 except the rstandthe last refer tothe normalize d

numb ers, i.e. all the oating p ointnumb ers which are not sp ecial in someway.

Note e sp ecially therelationship b etween the exp onent bitstr ing a a a :::a

1 2 3 8

andtheactual exp onent E , i.e. thepower of 2 whichthebitstr ingisintended

to repre s ent. Weseethatthe exp onent repre s entation do e s not us e anyof

s ign-and-mo dulus, 2's complement or 1's complement, butrather something

calle d biasedrepresentation:the bitstr ing whichisstore d i s s imply thebinary

repre s entation of E +127. Inthi s cas e, thenumb er 127 whichisadded to

thede s ire d exp onent E i s calle d the .For example, thenumber

0

1=(1:000 :::0)  2 is store d as

2

0 01111111 00000000000000000000000 :

Here the exp onentbitstr ingisthe binary repre s entation for 0 + 127 andthe

f raction bitstr ingisthe binary repre s entation for 0 (the f ractional part of 1:0).

2

Thenumber 11=2= (1:011)  2 is store d as

2

0 10000001 0110000000000000000000 ;

4

andthenumber 1=10 = (1:100110011 :::)  2 is store d as

2

0 01111011 10011001100110011001100 :

We s ee thattherange of exp onent eldbitstr ings for normalize d numbers

i s 00000001to 11111110 (thedecimal numb ers 1 through 254), repre s enting

3

These numbers were calle d denormalized in early vers ions of thestandard. 7

actual exp onentsfromE = 126 to E = 127. Thesmalle st normalize d

min max

numb er whichcanbestored is represented by

0 00000001 00000000000000000000000

126 126 38

meaning(1:000 :::0)  2 ,i.e.2 , whichisapproximately 1:2  10 ,

2

while the large st normalize d number is represented by

0 11111110 11111111111111111111111

127 23 127

meaning(1:111 :::1)  2 ,i.e.(2 2 )  2 , whichisapproximately

2

38

3:4  10 .

ThelastlineofTable 1 shows thatanexponentbitstr ing cons i stingofall

ones is a special patter n us e d for repre s enting 1 and NaN, dep endingon

thevalue of the f raction bitstr ing. We will di scuss themeaningofthese later.

Finally, let us retur n tothe rstlineofthetable. Theidea here i s as

126

follows: although 2 is thesmalle st normalize d numb er which can b e rep-

re s ented, wecanusethecombination of thespecialzeroexponent bitstr ing

and a nonzero f raction bitstr ingto repre s entsmaller numb ers calle d subnor-

127 126

mal numb ers. For example, 2 , whichisthesameas(0:1)  2 ,is

2

repre s ented as

0 00000000 10000000000000000000000 ;

149 126

while 2 =(0:0000 :::01)  2 (with 22 zero bitsafter the binary

2

p oint) i s store d as

0 00000000 00000000000000000000001

:

Thi s last number is thesmalle st nonzero numb er whichcanbestore d. Nowwe

126

see the reason for the2 in the rst line. It allows us to repre s entnumb ers

in therange imme diately b elowthesmalle st normalize d numb er. Subnormal

numb ers cannot b e normalize d, s ince thatwould re sult in an exp onent which

do e s not t in the eld.

Letusretur n to our example of a machinewithatinyword s ize, illustrated

in Figure 1, andseehowthe addition of subnormal numbers change s it. We

1 1

get three extra numb ers: (0:11)  2 =3=8, (0:10)  2 =1=4and

2 2

1

(0:01)  2 =1=8: these are shown in Figure 2.

2 8

 -

:::::: 0 1 2 3

Figure 2: TheToySystem includingSubnormal Numbers

Notethat the gap between zero and the smal lest positive normalizednumber

is nicely l led in by the subnormal numbers,usingthesame spacingasthat

between the normalize d numb ers withexponent 1.

Subnormal numb ers are less accurate, i.e. they have le ss ro om for nonzero

bitsinthefraction eld, than normalize d numbers. Indee d, the accuracy

123

drops as thesizeofthesubnormal number decreas e s. Thus (1=10)  2 =

126

(0:11001100 :::)  2 is store d as

2

0 00000000 11001100110011001100110 ;

133 136

while (1=10)  2 =(0:11001100 :::)  2 is store d as

2

0 00000000 00000000001100110011001 :

Exerci s e 4 Determine the IEEE single precision oating point representa-

100 100

tion of the fol lowing numbers: 2, 1000, 23/4, (23=4)  2 , (23=4)  2 ,

135 140

(23=4)  2 , 1=5, 1024=5, (1=10)  2 .

Exerci s e 5 Write down an algorithm that tests whether a oating point

number x is less than, equal to or greater than another oating point num-

ber y ,bysimplycomparing their oating point representations bitwise from

left to right, stopping as soon as the rst di ering bit is encountered. The

fact that this can be done easily is the main motivation for biasedexponent

notation.

Exerci s e 6 Suppose x and y are single precision oating point numbers. Is

4

it true that round(x y )= 0 only when x = y ? Il lustrate your answer with

some examples. Do you get the same answer if subnormal numbers arenot

al lowed, i.e. subnormal results areroundedtozero? Again, il lustrate with an

example.

4

The expre ss ion round(x)isde ne d in thenext s ection. 9

Table 2: IEEE Double Preci s ion

 a a a :::a b b b :::b

1 2 3 11 1 2 3 52

If exp onent bitstr ingis a :::a Then numer ical value repre s ented is

1 11

1022

(00000000000) =(0) (0:b b b :::b )  2

2 10 1 2 3 52 2

1022

(00000000001) =(1) (1:b b b :::b )  2

2 10 1 2 3 52 2

1021

(00000000010) =(2) (1:b b b :::b )  2

2 10 1 2 3 52 2

1020

(00000000011) =(3) (1:b b b :::b )  2

2 10 1 2 3 52 2

# #

0

(01111111111) =(1023) (1:b b b :::b )  2

2 10 1 2 3 52 2

1

(10000000000) =(1024) (1:b b b :::b )  2

2 10 1 2 3 52 2

# #

1021

(11111111100) =(2044) (1:b b b :::b )  2

2 10 1 2 3 52 2

1022

(11111111101) =(2045) (1:b b b :::b )  2

2 10 1 2 3 52 2

1023

(11111111110) =(2046) (1:b b b :::b )  2

2 10 1 2 3 52 2

(11111111111) =(2047) 1 if b = ::: = b = 0, NaN otherwi s e

2 10 1 52

For manyapplications, s ingle preci s ion numb ers are quiteadequate. How-

ever, double preci s ion i s a commonly us e d alter native. In thi s cas e each

oatingpointnumber is store d in a 64-bit double word. Details are shown in

Table 2. Theideas are all thesame; only the eld widths andexponent bias

are di erent. Clearly,anumber like1=10 with an in nite binary expans ion

is store d more accurately in double preci s ion than in s ingle, s ince b ;:::;b

1 52

can b e store d instead of just b ;:::;b .

1 23

There i s a third IEEE oating p ointformat calle d extende d preci s ion.

Although thestandard do e s not require a particular format for thi s, thestan-

dard implementationusedonPC'sisan80-bitword, with 1 bit us e d for the

s ign, 15 bits for the exp onentand64bitsforthe s igni cand. The leading bit

of a normalize d number is not generally hidden as it i s in s ingle and double

preci s ion, but i s explicitly store d. Otherwi s e, theformatismuchthesameas

single and double preci s ion.

23

Weseethatthe rstsingle precision numb er larger than1is1+2 ,

52

while the rst double precision numb er larger than 1 i s 1 + 2 .Theextended

64

preci s ion cas e i s a little more tr icky: s ince there i s no hidden bit, 1 + 2

63

cannot b e store d exactly,sothe rstnumb er larger than 1 is 1+2 .Thus the 10

Table 3: Whatisthat Precision?

23 7

IEEE Single  =2  1:2  10

52 16

IEEE Double  =2  1:1  10

63 19

IEEE Extended  =2  1:1  10

exact machine preci s ion, together withitsapproximatedecimal equivalent, i s

shown in Table 3 for eachofthe IEEE s ingle, double and extende d formats.

Thefraction of a s ingle preci s ion normalize d number has exactly 23 bits of

accuracy, i.e. the s igni candhas 24 bits of accuracy countingthe hidden bit.

Thi s corre sp onds to approximately 7decimal digits of accuracy.Indouble

preci s ion, thefraction has exactly 52 bits of accuracy, i.e. the s igni candhas

53 bits of accuracy countingthe hidden bit. Thi s corre sp onds to approxi-

mately 16 decimal digits of accuracy.Inextende d preci s ion, the s igni cand

has exactly 64 bits of accuracy,andthi s corre sp onds to approximately 19 dec-

imal digits of accuracy. So, for example, thesingle preci s ion repre s entation

for thenumber  is approximately 3:141592, while thedouble preci s ion repre-

sentation i s approximately 3:141592653589793. Toseeexactly whatthesingle

anddouble preci s ion repre s entations of  are would require wr itingoutthe

binary repre s entations andconvertingthese todecimal.

Exerci s e 7 What is the gap between 2 and the rst IEEE single precision

number larger than 2? What is the gap between 1024 and the rst IEEE

single precision number larger than 1024? What is the gap between 2 and the

rst IEEE double precision number larger than 2?

E

Exerci s e 8 Let x = m  2 be a normalized single precision number, with

1  m< 2. Show that the gap between x and the next largest single precision

number is

E

  2 :

(It may be helpful to recal l the discussion fol lowing Figure 1.)

3 Roundingand Correctly Rounde d Ar ithmetic

Weusetheterminology \ oating p ointnumbers" tomean all acceptable num-

b ers in a given IEEE oatingpointarithmetic format. Thi s s et cons i stsof

0, subnormal andnormalize d numbers, and 1,but not NaN value s, and 11

i s a nitesubs et of the reals. Wehaveseenthat most real numb ers, suchas

1=10 and  , cannot b e repre s ente d exactly as oating p ointnumbers. For

eas e of expre ss ion wewillsayageneral real numb er i s \normalize d" if its

mo dulus lie s b etween thesmalle st and large st p os itive normalize d oating

p ointnumb ers, with a corre sp onding us e of theword \subnormal". In b oth

cas e s the representations we giveforthese numb ers will parallel the oating

p ointnumb er repre s entations in that b = 1 for normalize d numb ers, and

0

b = 0 with E = 126 for subnormal numb ers.

0

For anynumber x whichisnota oating p ointnumb er, there are two

obvious choice s for the oatingpointapproximation to x:the clos e st oating

p ointnumber less than x,andthe clos e st oatingpointnumber greater than

x. Let us denotethese x and x re sp ectively.For example, cons ider the

+

toy oating p ointnumber system illustrate d in Figure s 1 and2.If x =1:7,

for example, then wehave x =1:5and x =1:75, as shown in Figure 3.

+

 -

:::::: 0 1 x 2 3

Figure 3: RoundingintheToySystem

Now let us assumethatthe oatingpoint system we are us ing i s IEEE s in-

gle preci s ion. Then if our general real number x is positive,(and normalize d

or subnormal), with

E

x =(b :b b :::b b b :::)  2 ;

0 1 2 23 24 25 2

wehave

E

x =(b :b b :::b )  2 :

0 1 2 23 2

Thus, x i s obtaine d s imply bytruncatingthe binary expans ion of m atthe

23rd bit and di scarding b , b ,etc. Thi s i s clearly the closest oatingpoint

24 25

numb er whichislessthan x.Writing a formula for x i s more complicated

+

s ince, if b =1, ndingthe clos e st oatingpointnumb er bigger than x will

23

involve some bit \carr ie s" and p oss ibly, in rare cas e s, a change in E .Ifx is

negative, thesituation i s reversed: it is x whichisobtained by dropping bits

+

b , b ,etc., s ince di scarding bits ofanegativenumber makes thenumber

24 25

clos er to zero, andtherefore larger (further tothe r ightontherealline). 12

The IEEE standard de nes the correctly rounded value of x, whichwe

shall denoteround(x), as follows. If x happ ens to b e a oatingpointnumb er,

then round(x)= x.Otherwise, thecorrectly rounded value dep ends on which

of thefollowingfourrounding modes i s in e ect:

 Round down.

round(x)=x :

 Round up.

round(x)=x :

+

 Round towards zero.

round(x)iseither x or x , whichever is between zero and x.

+

 Round to nearest.

round(x)iseither x or x , whichever i s nearer to x.Inthe cas e of a

+

tie, theonewithits least signi cant bit equal to zero is chos en.

If x is positive, then x is between zero and x,soround down and round

towards zero havethesame e ect. If x is negative, then x is between zero

+

and x, soitisround up and round towards zero whichhavethesameef-

fect. In e ither cas e, round towards zero s imply require s truncatingthebinary

expans ion, i.e. di scarding bits.

The most us eful roundingmode, andtheone whichisalmost always us e d,

is round to nearest, s ince thi s pro duce s the oatingpointnumb er whichis

clos e st to x.Inthecaseof\toy" preci s ion, with x =1:7, it i s clear that

round to nearest gives a rounded value of x equal to 1.75. When theword

\round" i s us e d withoutany quali cation, it almost always means \roundto

neare st". In the more f amiliar decimal context, if we \round" thenumber

 =3:14159 ::: to four decimal digits, weobtain theresult 3.142, whichis

clos er to  than thetruncate d re sult 3.141.

Exerci s e 9 What is the rounded value of 1=10 for each of the four rounding

modes? Give the answer in terms of the binary representation of the number,

not the decimal equivalent.

The(absolutevalue of the) di erence b etween round(x)and x is called

the absolute rounding error asso ciate d with x,anditsvalue dep ends on the

roundingmode in e ect. In toy preci s ion, when round down or round to-

wards zero i s in e ect, theabsoluterounding error for x =1:7is0:2 (s ince

round(x)= 1:5), butif round up or round to nearest i s in e ect, theabsolute 13

rounding error for x =1:7is0:05 (s ince round(x)= 1:75). For all rounding

mo de s, it i s clear thatthe absolute rounding error associatedwith x is less

than the gap between x and x , while in thecaseofround to nearest,the

+

absoluterounding error can b e no more than half the gap between x and

x .

+

Nowlet x be a normalize d IEEE s ingle preci s ion numb er, andsuppose

that x>0, so that

E

x =(b :b b :::b b b :::)  2 ;

0 1 2 23 24 25 2

with b = 1. Clearly,

0

E

x =(b :b b :::b )  2 :

0 1 2 23 2

Thus wehave, for any roundingmode, that

23 E

jround(x) xj < 2  2 ;

while for round to nearest

24 E

jround(x) xj 2  2 :

23 52

Similar re sultshold for double andextende d preci s ion, replacing2 by2

63

and2 re sp ectively,sothatingeneral wehave

E

jround(x) xj < 2 ; (1)

for anyroundingmodeand

1

E

  2 ; jround(x) xj

2

for round to nearest.

Exerci s e 10 For roundtowards zero,could the absolute rounding error be

E

exactly equal to   2 ?For roundtoneare st,could the absolute rounding

1

E

error be exactly equal to   2 ?

2

Exerci s e 11 Does (1) hold if x is subnormal, i.e. E = 126 and b =0?

0

E

The pre s ence of thefactor 2 is inconvenient, so let us cons ider the relative

rounding error asso ciate d with x,de ned tobe

round(x) x round(x)

1= :  =

x x 14

Since for normalize d numb ers

E

x = m  2 ; where m  1

(b ecaus e b =1)wehave, for all roundingmodes,

0

E

  2

= : (2) j j <

E

2

In thecaseof round to nearest,wehave

1

E

  2

1

2

j j = :

E

2 2

Exerci s e 12 Does (2) hold if x is subnormal, i.e. E = 126 and b =0?

0

If not, how big could  be?

Now another way towritethede nition of  is

round(x)=x(1 +  );

so wehavethe following re sult: therounded value of a normalize d number x

is, when not exactly equal to x, equal to x(1 +  ), where, regardle ss of the

roundingmode,

j j <:

Here, as b efore,  is themachine preci s ion. In thecaseofround to nearest,

wehave

1

j j :

2

Thi s re sultisvery imp ortant, b ecaus e it shows that, no matter how x is

di splaye d, for example e ither in binary formatorina converted decimal

format, you can think of thevalue shown as not exact,butasexact within a

factor of 1 + .UsingTable 3 we s ee, for example, that IEEE s ingle preci s ion

7

numbers are good toa factor of about1+10 , whichmeans thatthey have

about 7 accuratedecimal digits.

Numb ers are normally inputtothe computer us ingsomekind of high-

level programminglanguage, to b e pro ce ss e d by a compiler or an interpreter.

There are two di erentways thatanumber suchas1=10 might b e input. One

way would b e to inputthedecimal str ing 0.1 directly,either in the program

its elf or in the inputtothe program. The compiler or interpreter then calls a

standard input-handling pro ce dure whichgenerates machine instructions to 15

convert thedecimal str ingtoitsbinary repre s entation andstore the correctly

rounded resultinmemory or a regi ster. Alter natively,theintegers 1 and10

might b e inputtothe program andtheratio 1/10 generated by a divi s ion

operation. In thi s cas e too, the input-handling program must b e calle d to

read theinteger str ings 1 and10andconvert them to binary repre s entation.

Either integer or oating p ointformat mightbeusedforstor ingthese value s

in memory,dep endingonthetyp e of thevar iable s us e d in the program, but

these value s mustbeconverted to oatingpointformat b efore the divi s ion

operation computes theratio 1/10 andstore s the nal oatingpoint re sult.

From thepointofviewoftheunderlyinghardware, there are relatively

few op erations which can b e doneon oatingpointnumb ers. The s e include

thestandard ar ithmetic op erations (add, subtract, multiply, divide) as well as

afewothers such as square ro ot. When the computer p erforms such a oating

p ointop eration, theop erands must b e available in the pro ce ssor regi sters or

in memory.Theop erands are therefore, byde nition, oating p ointnumb ers,

even if they are only approximations totheoriginal program data. However,

theresultofastandard operation on two oatingpointnumb ers may well not

be a oating p ointnumber. For example, 1 and10areboth oatingpoint

numb ers butwehave already s een that1=10 i s not. In f act, multiplication

of two arbitrary 24-bit s igni cands generally give s a 48-bit s igni cand which

cannot b e repre s ente d exactly in s ingle preci s ion.

When the re sult of a oatingpointoperation i s not a oating p ointnum-

ber, the IEEE standard require s thatthe computed resultmust b e the cor-

rectly rounded value of the exact result,usingtheroundingmodeand preci s ion

currently in e ect. It i s worthstatingthi s requirement carefully.Letx and

y be oatingpointnumb ers, let +,,,= denotethefourstandard ar ithmetic

operations, andlet, , , denotethe corre sp ondingop erations as they are

actually implemented on the computer. Thus, x + y may not b e a oating

p ointnumb er, but x  y is the oatingpointnumber whichthe computer

computes as itsapproximation of x + y .The IEEE rule i s then preci s ely:

x  y =round(x + y );

x y =round(x y );

x y =round(x  y );

and

x y =round(x=y ): 16

From the di scuss ion of relative rounding errors given above, weseethen that

the computed value x  y sati s e s

x  y =(x + y )(1 +  )

where

j j 

for all roundingmodes and

1

  

2

in the cas e of round to nearest. The sameresult also holds for theother

operations , and .

Exerci s e 13  Show that it fol lows from the IEEE rule for correctly

rounded arithmetic that oating point addition is commutative, i.e.

a  b = b  a

for any two oating point numbers a and b.

 Show with a simple example that oating point addition is not associa-

tive, i.e. it may not be true that

a  (b  c)= (a  b)  c

for some oating point numbers a, b and c.

Nowweaskthe que stion: howiscorrectly rounded arithmetic imple-

mented? Let us cons ider the addition of two oating p ointnumb ers x =

E F

m  2 and y = p  2 ,usingIEEEsingle preci s ion. If thetwo exp onents E

and F are the same, it i s nece ssary only to add the s igni cands m and p.The

E

nal re sultis(m + p)  2 , whichthen nee ds further normalization if m + p is

1

2 or larger, or le ss than 1. For example, the re sult of adding3= (1:100)  2

2

1

to2=(1:000)  2 is:

2

1

(1:10000000000000000000000 )  2

2

1

+ (1:00000000000000000000000 )  2

2

1

= (10:10000000000000000000000 )  2

2

andthe s igni candofthesum i s shifted right1bitto obtain the normalize d

2

format(1:0100 :::0)  2 . However, if thetwo exp onents E and F are

2 17

di erent, say with E>F,the rststep in addingthetwonumb ers i s to align

the signi cands,shifting p r ight E F positions so thatthe s econdnumber

i s no longer normalize d andbothnumbers havethesame exp onent E .The

1

s igni cands are then adde d as b efore. For example, adding3=(1:100)  2

2

1

to3=4= (1:100)  2 gives:

2

1

(1:1000000000000000000000 )  2

2

1

+ (0:0110000000000000000000 )  2

2

1

= (1:1110000000000000000000 )  2 :

2

In thi s cas e, theresult do e s not nee d further normalization.

23

Nowconsider adding3to3 2 .Weget

1

(1:1000000000000000000000 )  2

2

1

+ (0:0000000000000000000001j1 )  2

2

1

= (1:1000000000000000000001j1 )  2 :

2

Thi s time, theresultisnotanIEEEsingle preci s ion oatingpointnumb er,

s ince its s igni candhas 24 bitsafter thebinary p oint: the24thisshown

beyondthevertical bar. Therefore, theresultmust b e correctly rounded.

1

Roundingdown gives the re sult(1:10000000000000000000001)  2 , while

2

1

rounding up gives (1:10000000000000000000010)  2 .Inthecaseofround-

2

ingtoneare st, thereisatie, so thelatter re sult, withtheeven nal bit, i s

obtained.

15 15

For another example, cons ider thenumb ers 1, 2 and2 , which are

all oatingpointnumbers. Theresultofaddingthe rsttothe s econdis

(1:000000000000001) , which i s a oatingpointnumb er; the re sult of adding

2

15

the rst tothethird i s (1:000000000000001)  2 , which i s also a oating

2

p ointnumb er. However, theresult of addingthesecondnumber tothethird

is

15

(1:000000000000000000000000000001)  2 ;

2

whichisnota oatingpointnumber in the IEEE s ingle preci s ion format,

s ince thefraction eld would nee d 30 bitsto repre s entthi s numb er exactly.

In thi s example, us inganyofroundtowards zero, roundtoneare st, or round

15

down as the roundingmode, the correctly rounde d re sultis2 .Ifroundup

15

is used, the correctly rounded resultis(1:00000000000000000000001)  2 .

2

Exerci s e 14 In IEEE single precision, using roundtoneare st, what are the

20 20 20 20

correctly rounded values for: 64 + 2 , 64 + 2 , 32 + 2 , 16+2 ,

20

8+2 . Give the binary representations, not the decimal equivalent. What

are the results if the rounding mode is changedto roundup? 18

Exerci s e 15 Recal ling how many decimal digits correspond to the 23 bit

fraction in an IEEE single precision number, which of the fol lowing numbers

5

do you think round exactly to the number 1, using roundtoneare st: 1+10 ,

10 15

1+10 , 1+ 10 ?

Although theoperations of addition andsubtraction are conceptually

much s impler than multiplication and division andaremucheasiertocarry

outbyhand, the implementation of correctly rounde d oating p oint addition

andsubtraction i s not tr ivial, even for thecasewhen the re sult i s a oating

p ointnumber andtherefore do e s not require rounding. For example, cons ider

0 1

computing x y with x =(1:0)  2 and y =(1:1111 :::1)  2 ,where

2 2

thefraction eld for y contains 23 ones after the binary p oint. (Notice that

y i s only slightly smaller than x;infactitisthenext oating p ointnumber

smaller than x.) Aligningthe s igni cands, weobtain:

0

(1:00000000000000000000000j )  2

2

0

(0:11111111111111111111111j1 )  2

2

0

= (0:00000000000000000000000j1 )  2 :

2

Thi s i s an example of cancel lation, s ince almost all thebitsinthetwonumbers

24

cancel eachother. Theresultis(1:0)  2 , whichisa oating p ointnumb er,

2

but in order to obtain thi s correct re sultwemust b e sure to carry out the

subtraction using an extrabit, calle d a guardbit, whichisshown after the

vertical line followingthe b position. When the IBM 360 was rst releas e d,

23

it did not have a guard bit, anditwas only after thestrenuous ob jections

of certain computer scientiststhatlater vers ions of themachine incorp orated

a guard bit. Twenty- veyears later, the Cray sup ercomputer still do e s not

have a guard bit. When theoperation just illustrate d, mo di e d to re ect the

Cray's longer wordlength, i s p erforme d on a Cray XMP,the re sult generated

is wrongbyafactor of twosincea one i s shifte d past theendofthe s econd

op erand's s igni candand di scarded. In thi s example, instead of having

x y =(x y )(1 +  ); where   ; (3)

wehave x y =2(x y ). On a Cray YMP,ontheother hand, the s econd

op erandisrounde d b efore theoperation take s place. Thi s convertsthe s econd

op erandtothevalue 1.0 andcaus e s a nal re sult of 0.0 tobecompute d, i.e.

x y =0.Evidently, Cray sup ercomputers do not us e correctly rounded

ar ithmetic. 19

Machines supportingthe IEEE standard do, however, have correctly rounded

ar ithmetic, so that (3), for example, always holds. Exactly howthi s i s im-

plemented dep ends on themachine, buttypically oating p ointop erations

are carr ie d outusing extendedprecision registers, e.g. 80-bit regi sters, even

if thevalue s loaded from andstore d tomemory are only s ingle or double

preci s ion. Thi s e ectively provides many guard bitsforsingle and double

preci s ion operations, butifanextende d preci s ion operation on extende d pre-

ci s ion op erands i s de s ire d, at least one additional guard bit i s nee de d. In f act,

the following example (given in s ingle preci s ion for convenience) shows that

one, twooreven 24 guard bits are not enough to guarantee correctly rounded

addition with 24-bit s igni cands when theroundingmodeisroundtoneare st.

25

Cons ider computing x y where x =1:0and y =(1:000 :::01)  2 ,where

2

y has 22 zero bitsbetween thebinary p ointandthe nal one bit. In exact

ar ithmetic, which require s 25 guard bitsinthi s cas e, weget:

0

(1:00000000000000000000000j )  2

2

0

(0:00000000000000000000000j0100000000000000000000001 )  2

2

0

= (0:11111111111111111111111j1011111111111111111111111 )  2

2

Normalizingtheresult, andthen roundingthi s totheneare st oatingpoint

1

numb er, we get (1:111 :::1)  2 , whichisthe correctly rounded value of

2

the exact sumofthenumbers. However, if wewere touseonlytwo guard

bits (or indee d anynumb er f rom 2 to 24), wewould get the re sult:

0

(1:00000000000000000000000j )  2

2

0

(0:00000000000000000000000j01 )  2

2

0

= (0:11111111111111111111111j11 )  2

2

Normalizingand roundingthen re sults in rounding up instead of down, giving

the nal re sult1:0, whichis not the correctly rounded value of the exact sum.

Machines that implement correctly rounded arithmetic takesuch p oss ibilitie s

into account, andittur ns outthat correctly rounde d re sults can b e achieved

in all cas e s us ingonlytwo guard bitstogether with an extra bit, calle d a

sticky bit, whichisusedto ag a rounding problem of thi s .

Floatingpointmultiplication, unlike addition andsubtraction, do e s not

E F

require s igni cands to b e aligned. If x = m  2 and y = p  2 ,then

E +F

x  y =(m  p)  2

so there are three steps to oatingpointmultiplication: multiply thesig-

ni cands, add theexponents, andnormalize and correctly roundthe re sult. 20

Single preci s ion s igni cands are eas ily multiplie d in an extende d preci s ion reg-

ister, s ince theproduct of two 24-bit s igni cand bitstr ings i s a 48-bit bitstr ing

whichisthen correctly rounded to24bitsafter normalization. Multiplication

of double preci s ion or extende d preci s ion s igni cands i s not so straightfor-

ward, however, s ince droppingbitsmay, as in addition, lead to incorrectly

rounde d re sults. Theactual multiplication and divi s ion implementation may

or may not us e thestandard \longmultiplication" and\long divi s ion" al-

gor ithms whichwelearne d in school. Onealter native for divi s ion will b e

mentionedinalater s ection. Multiplication and divi s ion are muchmore

complicated op erations than addition andsubtraction, and cons equently they

may require more execution timeonthecomputer; thi s i s the cas e for PC's.

However, thi s i s not nece ssar ily true on f aster sup ercomputerssinceitispos-

s ible, byusing enough space on thechip, todesign the micro co deforthe

instructions so thatthey are all equally f ast.

Exerci s e 16 Assume that x and y are normalized numbers, i.e. 1  m<2,

1  p< 2. How many bits may it benecessary to shift the signi cand product

m  p left or right to normalize the result?

4 Exceptional Situations

Oneofthe most dicultthings about programmingisthenee d toanticipate

exceptional s ituations. In as much as it i s p oss ible todoso,aprogram

should handle exceptional datainamanner cons i stentwiththehandlingof

standard data. For example, a program which reads integers f rom an input

le andechos them toanoutput le until theendofthe input le i s reached

should not f ail just b ecaus e the input le i s empty.Ontheother hand, if

it i s further require d to computetheaverage value of the inputdata, no

reasonable solution i s available if the input le i s empty.Soitiswith oating

p ointarithmetic. When a reasonable re sp ons e to exceptional data i s p oss ible,

it should b e us e d.

The s imple st example i s division by zero. Before the IEEE standard was

devised, there were twostandard re sp ons e s to divi s ion of a p os itivenumber

byzero.Oneoften us e d in the1950'swas togeneratethe large st oating

p ointnumber as the re sult. Therationale o ere d bythemanuf acturers was

thattheuserwould notice the large number in theoutputand draw the

conclus ion that somethinghad gonewrong. However, thi s often le d tototal

di saster: for example the expre ss ion 1=0 1=0would then have a re sultof0,

which i s completely meaningle ss; furthermore, as thevalue i s not large, the 21

us er might not notice thatanyerrorhad taken place. Cons equently,itwas

emphas ize d in the 1960's that divi s ion byzeroshould lead tothegeneration

of a program interrupt, givingtheuseraninformativeme ssage suchas\fatal

error | divi s ion by zero". In order toavoid thi s abnormal termination, the

burden was on the programmer tomake sure that division by zero would never

o ccur. Supp os e, for example, it i s desired to computethetotal re s i stance in

an electr ical circuit withtwo resistors connecte d in parallel, with resistance s

re sp ectively R and R ohms, as shown in Figure 4.

1 2

R1 R2

Thestandard formula for thetotal resistance of the circuit i s

1

: T =

1 1

+

R R

1 2

Thi s formula makes intuitive s ens e: if b oth resistance s R and R are the

1 2

samevalue R,then the resistance of thewhole circuit i s T = R=2, s ince the

current divide s equally,with equal amounts owingthrough each resistor.

On theother hand, if oneofthe resistance s, say R ,isvery muchsmaller

1

than theother, theresistance of thewhole circuit i s almost equal tothat

very small value, s ince most of the current will owthrough that resistor and

avoid theother one. WhatifR i s zero? The answer i s intuitively clear: if one

1

re s i stor o ers no re s i stance tothe current, al l the current will owthrough

that resistor andavoid theother one; therefore, thetotal resistance in the 22

circuit i s zero. Theformula for T also make s p erfect s ens e mathematically;

we get

1 1 1

T = = = =0:

1 1 1

1

1 + +

0 R R

2 2

Why,then, should a programmer wr itingcodefortheevaluation of parallel

resistance formulas havetoworry abouttreating divi s ion by zero as an ex-

ceptional s ituation? In IEEE oatingpointarithmetic, the programmer i s

relieved from that burden. As longasthe initial oatingpointenvironment

is prop erly (as explained below), divi s ion by zero do e s not generatean

interrupt butgive s an in niteresult, program execution continuing normally.

In the cas e of the parallel re s i stance formula thi s leads toa nal correct re sult

of 1=1 =0.

It i s, of cours e, true that a  0has thevalue 0 for any nite value of a.

Similarly,we can adopt the convention that a=0= 1 for any positive value

of a. Multiplication with 1 also makes sense: a 1 has thevalue 1 for any

positive value of a.Butthe expre ss ions 10and0=0makenomathematical

s ens e. An attempt tocomputeeither of these quantitie s i s calle d an invalid

operation,andthe IEEE standard calls for theresultofsuchanoperation to

be set to NaN (Not a Numb er). Anyarithmetic operation on a NaN gives a

NaN re sult, so anysubs equent computation with expre ss ions whichhavea

NaN value are thems elves assigned a NaN value. When a NaN i s di scovere d

in theoutput of a program, the programmer knows somethinghas gone wrong

and can invokedebuggingto ols todeterminewhatthe problem i s. Thi s may

be assisted bythefactthatthebitstr inginthe f raction eld can b e us e d to

co dethe or igin of the NaN. Cons equently,we do not sp eak of a unique NaN

value butofmany p oss ible NaN value s. Notethatan1 in theoutputof

a program may or may not indicate a programmingerror,dep endingonthe

context.

Addition andsubtraction with 1 also makemathematical s ens e. In the

1

parallel resistance example, weseethat 1 + = 1. Thi s i s true even if

R

2

R also happ ens to b e zero, b ecaus e 1 + 1 = 1.Wealsohave a + 1 = 1

2

and a 1 = 1 for any nite value of a.Butthere i s no way tomake

sense of the expre ss ion 11, whichmust therefore have a NaN value.

(The s e obs ervations can b e justi e d mathematically by cons ider ing addition

of limits. Suppose there are two s equence s x and y bothdivergingto 1,

k k

k

e.g. x =2 , y =2k , for k =1; 2; 3;:::. Clearly,the s equence x + y

k k k k

must also diverge to 1. Thi s justi e s the expre ss ion 1 + 1 = 1.Butitis

imp oss ible tomakea statementaboutthe limit of x y ,withoutknowing

k k 23

more than thefactthatthey b othdiverge to 1, s ince the re sultdep ends on

whichofa or b diverge s f aster to 1.)

k k

Exerci s e 17 What arethepossible values for

1 1

a b

where a and b areboth nonnegative (positive or zero)?

Exerci s e 18 What are sensible values for the expressions 1=0, 0=1 and

1=1?

Exerci s e 19 Using the 1950's convention for treatment of division by zero

mentionedabove, the expression (1=0)=10000000) results in a number very

much smal ler than the largest oating point number. What is the result in

IEEE arithmetic?

The reader may very reasonably ask the following que stion: whyshould

1=0havethevalue 1 rather than 1? Thi s i s themain reason for the

exi stence of thevalue 0, so thattheconventions a=0= 1 and a=(0) = 1

may b e followed, where a is a positivenumb er. (The revers e holds if a is

negative.) It i s e ss ential, however, thatthe logical expre ss ion h0= 0i have

thevalue true while h1 = 1i havethevalue false.Thus weseethatitis

p oss ible thatthe logical expre ss ions ha = bi and h1=a =1=bi have di erent

value s, namely in thecasea =0,b = 0(ora = 1, b = 1). Thi s

phenomenon i s a direct cons equence of theconvention for handling in nity.

Exerci s e 20 What are the values of the expressions 0=(0), 1=(1) and

1=(0)?

Exerci s e 21 What is the result of the paral lel resistance formula if an input

value is negative, 0,or NaN?

Another p erhaps unexp ecte d cons equence of these conventions concer ns

ar ithmetic compar i sons. When a and b are real numb ers, oneofthree condi-

tions holds: a = b, a b.Thesameistrueif a and b are oating

p ointnumb ers in theconventional s ens e, even if thevalue s 1 are p ermitted.

However, if e ither a or b has a NaN value, noneofthethree conditions can

b e said tohold (even if b oth a and b haveNaNvalue s). Instead, a and b are

said tobeunordered. Cons equently,although the logical expre ss ions ha  bi 24

and hnot(a> b)i usually havethe samevalue, they have di erentvalue s (the

rst f als e,thesecond true)ifeither a or b i s a NaN.

Let us nowtur n our attention toover owandunder ow. Over ow is

said to o ccur when the true re sultofanarithmetic operation i s nitebut

larger in magnitudethan the large st oatingpointnumb er which can b e

store d us ingthe given preci s ion. As with divi s ion by zero, there were two

standard treatments b efore IEEE ar ithmetic: e ither s et theresultto (plus

or minus) the large st oatingpointnumb er, or interrupt the program with

an error me ssage. In IEEE ar ithmetic, thestandard re sp ons e dep ends on the

roundingmode. Suppose thattheover owed value i s p os itive. Then round up

gives the re sult 1, while round down and round towards zero set the re sultto

the large st oatingpointnumb er. In thecaseofround to nearest,theresult

is 1.From a str ictly mathematical p oint of view, thi s i s not cons i stentwith

thede nition for non-over owed value s, s ince a niteover owvalue cannot

b e said to b e clos er to 1 than tosomeother nitenumber. Fromapractical

p oint of view, however, thechoice 1 i s imp ortant, s ince round to nearest is

thedef aultroundingmodeandanyother choice may lead tovery mi sleading

nal computational re sults.

Under ow i s said to o ccur when the true re sult ofanarithmetic op eration

is smaller in magnitudethan thesmalle st normalize d oating p ointnumber

whichcanbestore d. Hi stor ically,theresponsetothi s was almost always

thesame: replace theresultby zero. In IEEE ar ithmetic, theresultmay

be a subnormal positivenumb er instead of zero. Thi s allows re sultsmuch

smaller than thesmalle st normalize d number tobestore d, clos ingthegap

between thenormalize d numbers and zero as illustrate d earlier. However,

it also allows the p oss ibilityofloss of accuracy,assubnormal numb ers have

fewer bitsofprecisionthan normalize d numb ers.

Exerci s e 22 Work out the sensible rounding conventions for under ow. For

example, using roundtoneare st, what values areroundeddowntozero and

what values arerounded up to the smal lest subnormal number?

Exerci s e 23 More often than not the result of 1 fol lowing division by zero

indicates a programming problem. Given two numbers a and b,consider

setting

b a

p p

; d = c =

2 2 2 2

a + b a + b

Is it possible that c or d (or both) is set to the value 1,evenifa and b are 25

Table 4: IEEE Standard Re sp ons e to Exceptions

Invalid Op eration Set re sulttoNaN

Divi s ion by Zero Set re sultto 1

Over ow Set re sultto 1 or large st f.p. number

Under ow Set re sultto zero or subnormal number

Preci s ion Set re sultto correctly rounded value

5

normalizednumbers? .

Exerci s e 24 Consider the computation of the previous exercise again. Is it

possible that d and e could have many less digits of accuracy than a and b,

even though a and b are normalizednumbers?

Altogether, the IEEE standard de nes vekinds of exceptions: invalid

operation, divi s ion by zero, over ow, under owand preci s ion, together with

a standardresponse for eachofthe s e. All of these have just b een de scr ib e d

except the last. The last exception i s, in f act, not exceptional atallbecaus e it

o ccurs every timetheresultofanarithmetic operation i s not a oatingpoint

number andtherefore require s rounding. Table 4 summar ize s thestandard

re sp ons e for the ve exceptions.

The IEEE standard sp eci e s thatwhen an exception o ccurs it must b e

signaled bysetting an asso ciated status ag,andthatthe programmer should

havetheoption of e ither trapping the exception, providing sp ecial co deto

b e executed when the exception o ccurs, or masking the exception, in which

cas e the program continue s executingwiththestandard re sp ons e shown in

thetable. If the us er i s not programminginassembly language, butina

higher-level language b e ing pro ce ss e d byaninterpreter or a compiler, the

abilityto trap exceptions may or may not b e pass e d on tothe programmer.

However, us ers rarely nee ds to trap exceptions in thi s manner. It i s usually

better tomask all exceptions andrelyonthestandard re sp ons e s de scr ib e d

in thetable. Again, in the cas e of higher-level language s theinterpreter or

compiler in us e may or may not mask the exceptions as itsdef aultaction.

5

Abetter way tomakethi s computation may b e foundinthe LAPACK routine slartg,

which can b e obtained bysendingtheme ssage "slartg f rom lapack" totheInter net addre ss

netlib@or nl.gov 26