Chapter 5: Binary Number Formats
In this chapter, we introduce fixed- and floating-point number systems that can represent rational numbers. Fixed-point numbers are analogous to decimals; some of the bits represent the integer part, and the rest represent the fraction. Floating-point numbers are analogous to scientific notation, with a mantissa and an exponent.
Objectives
By the end of this chapter you should be able to:
- Use the fixed- and floating-point number systems to represent rational numbers.
- Demonstrate signed fixed- and floating-point numbers.
- Recall how to convert decimal numbers to binary numbers.
- Express rational numbers into scientific notations.
- Identify the biased exponent for IEEE 754 representation.
- Demonstrate the floating-point precision.
5.1 Number Systems for Binary Representations
Computers operate on both integers and fractions. So far, the numbers we can represent using binary representations include positive and negative integer numbers. Positive integer numbers are represented with unsigned binary numbers, whereas negative integer numbers are represented with two’s complement and sign/magnitude numbers. How can we represent fractions? There are two common notations to represent numbers with fractions; (1) fixed point notation and (2) floating point notation. In fixed point notation, the location of decimal point is fixed and there are a fixed number of digits after the decimal point. On the other hand, floating point number allows for a varying number of digits after the decimal point, meaning that the decimal point floats to the right of the most significant ‘1’ bit.
5.2 Fixed-Point Number Representation
The decimal number can be expressed as the sum of the products of each digit times the weight for that digit. Thus, the decimal number 123.4510, can be expressed as
The weight of digits moving towards left increases by a factor of 10, whereas the weight of digits moving towards right decreases by a factor of 10.
Now, let’s look at binary number representation. The binary number can be expressed as the sum of the products of each digit times the weight for that digit in a similar manner. Thus, the binary number 101.112, can be expressed as
In the binary number representation, the weight of digits moving towards left increases by a factor of 2, whereas the weight of digits moving towards right decreases by a factor of 2.
For example, what decimal number does the binary number 1011.10112 represent? We can find the decimal number value of the binary number 1011.10112 with the sum of the products of each digit times the weight for that digit, such as
What about the reverse process? Let’s convert the decimal number 6.7510 to a fixed-point binary number. First, we need to split the value into the integral part and the fractional part; integral 6 and fractional 0.75. The integral part will be converted into the binary number by repeating the division, 6 = 110. The fractional part 0.75 will be converted into the binary number by repeating the multiplication as shown below:
→ remove overflow digit 1
→ remove overflow digit 1
By collecting all the overflow digits from top to bottom, we can represent the decimal number 6.75 into the binary number 110.112.
Exercises
Represent the decimal number 12.6875 as the binary number using 4 integer bits and 4 fraction bits.
After splitting the decimal number into the integer part and the fractional part, we can get the integral part as shown below:
- Integral part: 12 → 1100
The fractional part 0.6875 will be converted into the binary number by repeating the multiplication as shown below:
→ remove overflow digit 1
→ no overflow 0
→ remove overflow digit 1
→ remove overflow digit 1
By collecting all the overflow digits from top to bottom, we can represent the decimal number 12.6875 into the binary number 1100.10112.
Signed Fixed-Point Numbers
The fixed-point number can represent the positive and negative values with two’s complement and sign/magnitude number systems.
For example, let’s represent the decimal number -7.510 as a signed fixed-point binary number using 4 integer and 4 fraction bits. In the sign/magnitude number system, the first bit always represents the sign. Since the decimal number -7.510 is a negative value, the sign bit should be ‘1’. The rest of integer bits can represent the integer value, so that we can convert the integer part 7 into the binary value 111. The fractional part 0.5 will be converted into the binary number by repeating the multiplication, (overflow digit: 1). The number -7.510 can be converted into the signed fixed-point number, 1111.1000, where the decimal point is fixed.
Now, let’s represent the decimal number -7.510 as a two’s complement number. First, we will find a positive representation of the number and then we will negate the value, meaning that we convert the positive value to the negative value. We can find a positive representation of the number 7.5, by splitting the value into the integral part and the fractional part; integral 0111 and fractional 1000. Let’s negate the positive representation 01111000 by inverting all the bits and adding 1 to lsb (least significant bit), as shown below:
In the two’s complement number system, the first digit always represents a negative value. The other bits are regular binary numbers, as shown below:
Digits | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Weights | -8 | 4 | 2 | 1 | 0.5 | 0.25 | 0.125 | 0.0625 |
That means the first digit ‘1’ represents -8 and the fifth digit ‘1’ represents 0.5. The sum of these two digits represents -7.5 () which we got from the above operation.
Exercises
Convert the following two’s complement binary fixed-point numbers to base 10. The implied binary point is explicitly shown to aid in your interpretation.
- 0101.1000 =
The integer part is 0101, so that we can get the integer part as follows: . The fractional part is 1000, so that we can get the fractional part as follows:
. The sum of integer part and the fractional part is 5.5.
- 1111.1111 =
The integer part is 1111, so that we can get the integer part as follows: . The fractional part is 1111, so that we can get the fractional part as follows:
. The sum of integer part and the fractional part is
.
- 1000.000 =
There is only the integer part 1000. We can get the integer value, -8.
5.3 Floating-Point Number Representation
In the floating-point number, the binary point position is assumed always just before the most significant digit, which is very similar to decimal scientific notation. Before we dive into the binary number, let’s look at a decimal number 27310. We can write the decimal number 27310 in scientific notation: .
In general, a number is written in scientific notation as follows:
where the symbol defines the mantissa (fraction), the symbol
defines the base, and the symbol
defines the exponent. In the example, the mantissa
is 2.73, the base
is 10, and the exponent
is 2.
The binary number can be written in scientific notation as shown above, where the base is 2. Once we got the scientific notation, we can store the binary number in 32 bits, as shown in Fig. 5-1. The first bit stores the sign. If the sign bit is 0, the number is positive; otherwise the number is negative. The next exponent field (8 bits) stores the exponent value. The mantissa field (23 bits) stores all the digit of the number.
Fig. ‑. Floating-Point Number Representation
We will show you how to represent the decimal number 22810 using a 32-bit floating point representation. There are three versions. The final version is called the IEEE 754 floating-point standard.
First, we need to convert the decimal number to the binary number.
After getting the binary number, we can write the binary number in “binary scientific notation”.
where, we can identify the mantissa , the base
, and the exponent
. Let’s fill in each field of the 32-bit floating point number:
- The sign bit is positive (0)
- The 8-bit exponent represents the value 7: 00000111
- The remaining 23 bits are the mantissa: 11100100000000000000000
Since the mantissa has a total of 6 digits in the given example. The rest of the mantissa will be filled with ‘0’. The following figure show the first representation of the floating-point number.
Fig. ‑. Floating-Point Number Representation 1
The first bit of the mantissa is always ‘1’. The implicit leading one is not included in the 23-bit mantissa for efficiency. We only store the fraction bits in 23-bit field except the leading one. The following figure shows the second representation of the floating-point number.
Fig. ‑. Floating-Point Number Representation 2
Notice that the first bit of the mantissa is gone. Now we only store the fraction.
The exponent needs to represent both positive and negative exponents. To do so, floating-point uses a biased exponent, which is the original exponent plus a constant bias. 32-bit floating-point uses a bias of 127. The exponent of 7 is stored as a biased exponent that is equal to the sum of the bias (127) and the original exponent (7), i.e., . The IEEE 754 32-bit floating-point representation of 22810 is shown in the following figure:
Fig. ‑. Floating-Point Number Representation 3 – IEEE 754
The hexadecimal representation of the number is .
Exercises
Write the decimal number -58.2510 in floating point of IEEE 754 format.
First, we need to convert the decimal number to the binary number, as shown below:
Second, we can write the binary number in “binary scientific notation”.
where, we can identify the mantissa , the base
, and the exponent
. Let’s fill in each field of the 32-bit floating point number:
- The sign bit is negative (1)
- The 8-bit biased exponent bits:
- 23 fraction bits: 110 1001 0000 0000 0000 0000
Note that the first bit of the mantissa is gone and we have 23 fraction bits.
Fig. ‑. Floating-Point Number Representation with IEEE 754 Format
The hexadecimal representation of the number is .
Special Cases
The IEEE 754 floating-point format has special cases to represent numbers such as zero, positive and negative infinity, and illegal results. The following figure show special cases of these values.
Table ‑. Special Cases of IEEE 754 Standard Format
Number | sign | Exponent (8 bits) | Fraction (23 bits) |
---|---|---|---|
0 | x | 00000000 | 00000000000000000000000 |
∞ | 0 | 11111111 | 00000000000000000000000 |
-∞ | 1 | 11111111 | 00000000000000000000000 |
NaN | x | 11111111 | Non-zero |
We have showed 32-bit floating-point numbers. When you declare a float variable in your program language, the variable is stored with the format we have discussed so far in the computer system. The format is also called single-precision (float) or single. The IEEE 754 standard also defines 64-bit double-precision numbers (also called doubles) that can provide greater precision and range.
The following table shows the number of bits used for the fields in each format.
Table ‑. Single-Precision and Double-Precision Formats of IEEE 754 Standard
Format | Total bits | Sign bits | Exponent bits | Bias value | Fraction bits |
---|---|---|---|---|---|
Single-Precision | 32 | 1 | 8 | 127 | 23 |
Double-Precision | 64 | 1 | 11 | 1023 | 52 |
Recall that a number overflows when its magnitude is too large to be represented. Likewise, the number underflows when it is too tiny to be represented. Arithmetic results that fall outside of the available precision must round to a neighboring number. The rounding modes are: round down, round up, round toward zero, and round to nearest. The default rounding mode is round to nearest.
For example, round the value 1.100101 (1.578125) to only 3 fraction bits. If the round down mode is applied, the value rounds ‘1.100’. If the round up mode is applied, the value rounds ‘1.101’. If the round toward zero is applied, the value rounds ‘1.100’. If the round to a neighboring number is applied, the value rounds ‘1.101’, because 1.62510 (1.1012) is closer to 1.57812510 (1.1001012) than 1.510 (1.12) is.