Finite Precision Arithmetic

Compare the two pieces of code given below. In infinite precision arithmetic they are logically equivalent.

In [1]:
a = 1.0
while a > 0.0:
    a = a/2.0
print a
In [2]:
a = 1.0
while a + 1.0 > 1.0:
    a = a/2.0
print a

Isn't the above number $1.1102230246316 \times 10^{-16}$ nearly 0 ? Not really. The smallest representable number in IEEE Double Precision format is approximately $4.95 \times 10^{-324}$ which is many orders of magnitude smaller than $1.1102230246316 \times 10^{-16}$. The difference between the output of the two pieces of seemingly equivalent codes can be understood by understanding gaps between floating point numbers in finite precision arithmetic.

In [1]:
# Smallest IEEE DP number
print '%g' % 2.**-1074

Anything smaller underflows (is zero). The above number is sometimes called UFL (for underflow).

In [2]:
print '%g' % 2.**-1075

IEEE single and double precision numbers

A single precision number is stored in 4 bytes (32 bits) which are used as follows: (1 sign bit) (8 bit exponent) (23 bit fraction). A double precision float is stored in 8 bytes (64 bits) which are used as follows: (1 sign bit $s$) (11 bit exponent $E\,$) and (52 bit fraction $f\,$). The smallest and largest $E$ values ($E=0$ and $E=2047$) are reserved and for the other $E$ values the exponent is interpreted as $E-1023$. Thus the minimum exponent for doubles is 1-1023 = -1022 and the maximum is 2046 - 1023 = 1023. If a number is $\ge 2^{-1022}$ then with some exceptions (for inf, NaN etc.) the number is $(-1)^s 1.\,f \times 2^{E-1023}$. For numbers smaller than $2^{-1022}$ with some exceptions the number is $0.\,f \times 2^{-1022}$.

Some special floating point numbers

We have already seen the smallest representable double precision floating point number. The reason it is $2^{-1074}$ should now be clear. It is because the smallest subnormal mantissa is $2^{-52}$ and the smallest exponent is $-1022$. Similarly the largest representable double is approximately $1.798\times 10^{308}$:

In [24]:
(1. + (1 - 2.**-52)) * 2**1023

Anything larger overflows (yields inf).

In [25]:
(1. + (1 - 2.**-53)) * 2**1023

The most important number to know is the machine epsilon. We will define it as the gap between 1 and the next representable double. (Some people call that number divided by 2 the machine epsilon). This gap is $2^{-52}$ which is approximately $2.22045 \times 10^{-16}$. We will denote this number by $\epsilon_M$.

In [1]:

If a number smaller than $\epsilon_M\, /\, 2$ is added to 1 then the result is 1. This explains the different outputs of the two pieces of code shown above. The gap at different points on the floating point range depends o n the exponent. This gap changes at every power of 2. Thus, for example, the gap at $2^{1000}$ is $2^{1000}$ times $\epsilon_M$.

Truncation error is due to approximations obtained by truncating infinite series expansions or terminating an iterative sequence before it has converged.

Rounding error is caused by the finite representation of and arithmetic operations on real numbers.

Forward error is the difference between true and computed values.

Backward error is the change in input that would produce the observed output in an exact computation.

Errors can be measured relative to the true values (relative error) or as an absolute quantity (absolute error). Even if the computation and representation is exact the solution to a problem may be sensitive to perturbations in input data. This is the concept of conditioning of a problem and it is a property of the problem itself. On the other hand the concept of stability is a property of the algorithm. One situation to be careful in is the subtraction of nearly equal numbers, which results in loss of significant digits. This is seen in the exercise below.


  1. Run the second piece of code above using pdb and step through the loop and make sure you understand the behavior.
  2. Write a function to compute an approximate value for the derivative of a function using the finite difference formula $$ f'(x) \approx \frac{f(x+h) - f(x)}{h}\, . $$ Test your function by computing the derivative of $\sin(x)$ at $x=1$. Compute the absolute value of the error by using the math.cos function to compare with. Use a value of $h=0.1$.
  3. On a log-log plot plot the error versus $h$ for $h = 10^{-k}$ for $k=0, \ldots, 16$.

Arbitrary precision arithmetic

The package mpmath provides the capabilities for computing at any desired precision. Higher the precision, slower the computation will be most likely since the standard single and double precision artithmetic is implemented in hardware while arbitrary precision arithmetic is done in software. But sometimes the tradeoff between precision and speed is unavoidable.

In [2]:
import mpmath as mpm
In [3]:
In [4]:
In [5]: = 100
In [6]:
print mpm.pi
In [7]:
In [8]: