Floating Point

“The term floating point refers to the fact that a number's radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number.”

A floating-point number (or real number) can represent a very large (1.23×10^88) or a very small (1.23×10^-88) value. It could also represent very large negative number (-1.23×10^88) and very small negative number (-1.23×10^88), as well as zero, as illustrated:

A floating-point number is typically expressed in the scientific notation, with a fraction (F), and an exponent (E) of a certain radix (r), in the form of F×r^E. Decimal numbers use radix of 10 (F×10^E); while binary numbers use radix of 2 (F×2^E). Representation of floating point number is not unique. For example, the number 55.66 can be represented as 5.566×10^1, 0.5566×10^2, 0.05566×10^3, and so on. The fractional part can be normalized. In the normalized form, there is only a single non-zero digit before the radix point. For example, decimal number 123.4567 can be normalized as 1.234567×10^2; binary number 1010.1011B can be normalized as 1.0101011B×2^3. It is important to note that floating-point numbers suffer from loss of precision when represented with a fixed number of (e.g., 32- or 64-bit). This is because there are infinite number of real numbers (even within a small range of says 0.0 to 0.1). It is also important to note that floating number arithmetic is very much less efficient than integer arithmetic. It could be speed up with a so-called dedicated floating-point co-processor. Hence, use integers if your application does not require floating-point numbers. In computers, floating-point numbers are represented in scientific notation of fraction (F) and exponent (E) with a radix of 2, in the form of F×2^E. Both E and F can be positive as well as negative.

Fixed‐Point Data Types “In computing, a fixed-point number representation is a real data type for a number that has a fixed number of digits after (and sometimes also before) the radix point (after the decimal point '.' in English decimal notation).” In a digital hardware, numbers are stored in binary words. A binary word is a fixed‐length sequence of bits (1's and 0's). How hardware components or software functions interpret this sequence of 1's and 0's is defined by the data type. Binary numbers are represented as either fixed‐point or floating‐point data types. In order to implement an algorithm such as communication algorithms, the algorithm should be converted to the fixed‐point domain and then it should be described with Hardware Description Language (HDL). In HDL coding process, it is necessary to indicate the size of the variables and registers. The registers should be large enough to represent the value of parameters with the desired precision. In the fixed‐point domain a pair (W, F) is considered for each of the parameters in the algorithm, where W is the word length of the parameters and F is the fractional length of the parameters. It is obvious that larger W and F results in a better performance and lower bit error rate (BER) but the design needs a large silicon area. On the other hand smaller W and F result in a larger BER but less area. So we should choose suitable values of (W, F) for each parameter in the algorithm. For this reason a simulation should be ran for the algorithm to get the dynamic range of the parameters. Simulation results indicate the dynamic range of the variables and the number of bits for W and F, which are used to represent the variables with the desired precision. According to the previous section, a fixed‐point data type is characterized by the word length in bits, the position of the binary point, and whether it is signed or unsigned. The position of the binary point is the means by which fixed‐point values are scaled and interpreted. For example, a binary representation of a generalized fixed‐point number (either signed or unsigned) is shown below:

Where: bi is the ith binary digit wl is the word length in bits. bwl-1 is the location of the most significant, or highest, bit (MSB) b0 is the location of the least significant, or lowest, bit (LSB). The binary point is shown three places to the left of the LSB. In this example, therefore, the number is said to have three fractional bits, or a fraction length of three. Fixed‐point data types can be either signed or unsigned. Signed binary fixed‐point numbers are typically represented in one of these ways: Sign, One's complement, Two's complement. Two's complement is the most common representation of signed fixed‐point numbers

Conversion A fixed-point number is a representation of a real number using a certain number of bits of a type for the integer part, and the remaining bits of the type for the fractional part. The number of bits representing each part is fixed (hence the name, fixed-point). An integer type is usually used to store fixed-point values.

Fixed-point numbers are usually used in systems which don't have floating point support, or need more speed than floating point can provide. Fixed-point calculations can be performed using the CPU's integer instructions.

A 32-bit fixed-point number would be stored in an 32-bit type such as int. Normally each bit in an (unsigned in this case) integer type would represent an integer value 2^n as follows:

1 0 1 1 0 0 1 0 = 2^7 + 2^5 + 2^4 + 2^1 = 178 2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0 But if the type is used to store a fixed-point value, the bits are interpreted slightly differently:

1 0 1 1 0 0 1 0 = 2^3 + 2^1 + 2^0 + 2^-3 = 11.125 2^3 2^2 2^1 2^0 2^-1 2^-2 2^-3 2^-4 The fixed point number in the example above is called a 4.4 fixed-point number, since there are 4 bits in the integer part and 4 bits in the fractional part of the number. In a 32 bit type the fixed-point value would typically be in 16.16 format, but also could be 24.8, 28.4 or any other combination.

Converting from a floating-point value to a fixed-point value involves the following steps:

1. Multiply the float number by 2^(number of fractional bits for the type), eg. 2^8 for 24.8 2. Round the result (just add 0.5) if necessary, and floor it (convert to binary) leaving an integer value. 3. Assign this value into the fixed-point according to type of format that is used to multiply by float number. Obviously you can lose some precision in the fractional part of the number. If the precision of the fractional part is important, the choice of fixed-point format can reflect this - eg. use 16.16 or 8.24 instead of 24.8. OR

Step4: Assign this binary value into the fixed-point type format.

Qm.n Number Representation

Q is a fixed point number format where the number of fractional bits (and optionally the number of integer bits) is specified. For example, a Q15 number has 15 fractional bits; a Q1.14 number has 1 integer bit and 14 fractional bits. Q format is often used in hardware that does not have a floating-point unit For example, a Q14.1 format number:

 requires 14+1+1 = 16 bits for signed  requires 14+1=15bits for unsigned

Float to Qm.n To convert a number from floating point to Qm.n format: Multiply the floating point number by 2n Round to the nearest integer

Practice some examples of conversions