IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (>= 43-bit, not commonly used) and double-extended precision (>= 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard, the others are optional. Many languages specify that they implement IEEE arithmetic, although sometimes it is optional. The C programming language for example allows but does not require IEEE arithmetic. IEEE is commonly used in C where float implemented IEEE single precision and double implements IEEE double precision.
Also known as IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985) and IEC 559: "Binary floating-point arithmetic for microprocessor systems.
|
Anatomy of a floating point number
Following is a description of the standard's format for floating point numbers.
Bits within a word of width W are indexed with integers in the range 0 to W-1 inclusive. Bit 0 is drawn on the right. When considering the word or regions within the word as binary numbers then usually the lowest indexed bit will also be the least significant.
A binary floating point number is stored in a 32 bit word:
1 8 23 width in bits +-+--------+-----------------------+ |S| Exp | Fraction | +-+--------+-----------------------+ 31 30 23 22 0 bit index (0 on right) bias +127
S - sign
Exp - Exponent
The set of possible data values can be divided into the following classes:
(NaNs are used to represent exceptional cases, such as the square root of a negative number)
Each class can be distinguished by the value of the Exp field (well, nearly):
Consider the Exp and Fraction fields as unsigned binary numbers (Exp will be in the range 0-255):
Class Exp Fraction Zeroes 0 0 Denormalised numbers 0 non zero Normalised numbers 1-254 any Infinities 255 0 NaN (Not a Number) 255 non zero
For normalised numbers, the most common, Exp is the biased exponent and Fraction is the fractional part of the mantissa. The number has value v:
v = s * 2^e * m
Where
s = 1 (positive numbers) when S is 0
s = -1 (negative numbers) when S is 1
e = Exp - 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127")
m = 1.Fraction in binary (the binary number 1 followed by the point followed by the binary bits of Fraction). Note 1 <= m < 2
Denormalised numbers are the same but e = -126 and m is 0.Fraction. Note that -126 is the smallest exponent for a normalised number.
There are two Zeroes, +0 (S is 0) and -0 (S is 1), and two Infinities +Inf (S is 0) and -Inf (S is 1).
Notice that NaNs and Infinities have all 1s in the Exp field.
Double precision is basically the same but the fields are wider:
1 11 52 +-+-----------+----------------------------------------------------+ |S| Exp | Fraction | +-+-----------+----------------------------------------------------+ 63 62 52 51 0 bias +1023
NaNs and Infinities are represented with Exp being all 1s (2047).
For Normalised numbers the exponent bias is +1023 (so e is Exp - 1023). For Denormalised numbers the exponent is -1022 (the minimum exponent for a normalised number).
An interesting feature of this particular representation is that it makes comparisons of most of the numbers simple. For positive numbers (sign bit is 0) a and b then a < b whenever the unsigned binary integers with the same bit patterns as a and b are also ordered the same way. In other words if you are comparing two positive floating point numbers you can just used an unsigned binary integer comparison using the same bits.
See also: Let's Go To The (Floating) Point by Chris Hecker (http://www.d6.com/users/checker/pdfs/gdmfp.pdf)
Search Encyclopedia
|
Featured Article
|