Excerpt from The Art of Computer Programming by Donald E. Knuth:
    "Floating-point computation is by nature inexact, and it is not 
  difficult to misuse it so that the computed answers consist almost 
  entirely of 'noise'. 

  One of the principal problems of numerical analysis is to determine 
  how accurate the results of certain numerical methods will be; a 
  'credibility gap' problem is involved here: we don't know how much 
  of the computer's answers to believe. 

  Novice computer users solve this problem by implicitly trusting in 
  the computer as an infallible authority; they tend to believe all 
  digits of a printed answer are significant. 

  Disillusioned computer users have just the opposite approach, they 
  are constantly afraid their answers are almost meaningless."

 Properties of Floating-point arithmetic
 Real numbers have a lot of 'good' properties, which of them are 
 retained by floating-point arithmetic?

   Topological                        Validity
   -----------                        -----------------------------
    Connectivity                      no    All points are isolated
    Completeness                      no    No converging sequences

   Field axioms:
    Closure under addition            no    We may have Overflow/Underflow
    Associativity of addition         no    (a + b) + c  .NE.  a + (b + c)
    Additive commutativity            yes   a + b        .EQ.  b + a
    Unique zero element               yes   a + 0        .EQ.  a
    Unique additive inverse           yes   a + (-a)     .EQ.  0

    Closure under multiplication      no    We may have Overflow/Underflow
    Associativity of multiplication   no    (a * b) * c  .NE.  a * (b * c)
    Multiplicative commutativity      yes   a * b        .EQ.  b * a
    Unique unit element               yes   a * 1        .EQ.  a
    Unique multiplicative inverse     yes   a * (1/a)    .EQ.  1 

    Distributivity                    no    a * (b + c)  .NE. (a * b) + (a * c)

   Ordered field axioms:
    Completeness                      yes
    Transitivity                      yes   (a .ge. b) .and. (b .ge. c)
                                                    implies  (a .ge. c)
    Density                           no
    Translation invariance            yes
    Scale invariance                   
    Triangle inequality                
    Archimedes' axiom                 yes

 We see that some of the basic properties of real numbers are missing. 
 Some properties that are usually derived from the basic axioms are 
 still true: 

   (x * y) .eq. 0.0   IS EQUIVALENT TO  (x .eq. 0.0) .or. (y .eq. 0.0)

   (x .le. y) .and. (z .gt. 0.0)   IMPLIES   (x * z) .le. (y * z)

   (x .eq. y)   IS EQUIVALENT TO   (x - y) .eq. 0.0  (If using denormals)

 When you are doing floating-point arithmetic, bear in mind that you 
 are dealing with a discrete number system, which "quantize" every 
 value that appears in the calculations, mangling everything in its way.

 A side remark
 Division is non-associative, simple counter-examples are:

   1 =  (8 / 4) / 2  .NE.  8 / (4 / 2)  =  4

   4 =  (8 / 4) * 2  .NE.  8 / (4 * 2)  =  1

 Be sure to supply parentheses to force the right order of evaluation!

 Sources of errors
 This is a schematic classification based on Dahlquist/Bjork:

  A) Errors in the input data - measurement errors, errors introduced
     by the conversion of decimal data to binary, roundoff errors. 

  B) Roundoff errors during computation

  C) Truncation errors - using approximate calculation is inevitable
     when computing quantities involving limits and other infinite
     processes on a computer:

       1) Summing only a finite number terms in an infinite series
       2) Discretization errors, approximating a derivative/integral 
          by a difference quotient/finite sum
       3) Approximating functions by polynomials or linear functions

  D) Simplifications in the mathematical model of the problem

  E) Human mistakes and machine malfunctions

 A bit of Error Analysis 
 We can visualize errors with intervals.  A number X that is known with 
 absolute error dX, can be represented by the interval (X - dX, X + dX),
 with a positive dX.

 Addition of intervals is given by: 

    (X-dX, X+dX) + (Y-dY, Y+dY) = (X+Y - (dX+dY), X+Y + (dX+dY))

 Similarly subtraction of intervals is given by: 

    (X-dX, X+dX) - (Y-dY, Y+dY) = (X-Y - (dX+dY), X-Y + (dX+dY))

 A good measure of accuracy is the ABSOLUTE RELATIVE ERROR (ARE). 
 If X is represented by the interval (X-dX, X+dX)  the absolute 
 relative error is:  

   ARE(X)  =  ------

 The ARE is always positive since dX is so.

 The absolute relative error of addition is:   

                   dX + dY
   ARE(X + Y)  =  ----------
                  abs(X + Y)

 which can be written as:   

                     abs(X)       dX         abs(Y)       dY
   ARE(X + Y)  =   ---------- * ------  +  ---------- * ------
                   abs(X + Y)   abs(X)     abs(X + Y)   abs(Y)

                 abs(X) * are(X)  +  abs(Y) * are (Y)
    ARE(X + Y) = ------------------------------------
                             abs(X + Y)

 In a similar way:

                 abs(X) * are(X)  +  abs(Y) * are(Y)
    are(X - Y) = ----------------------------------
                             abs(X - Y)

 Note that these formulas doesn't take into account errors introduced 
 by the addition/subtraction operations, they are purely error-analytic.
 To get the total error you have to add a term whose magnitude will
 be calculated in the next section.

    err(X + Y) = are(X + Y) + round(X + Y)

 Other error measures used in error analysis are the relative error, 
 ULP (Units in the last place), and the number of significant digits. 

 Estimating floating-point roundoff errors
 The IEEE floating-point representation of a real number X is:

    X  =  ((-1) ** S) *  1.ffff...fff  * (2 ** e)
                           |        |
                         t binary digits

 We have t binary digits instead of an infinite series, that is the 
 cause of the roundoff error. An upper bound for the roundoff error 
 is the maximum value the truncated tail could have:

    Upper-Bound  <=  (2 ** (-t)) * (2 ** e)  =  2 ** (e - t)

 The (2 ** (-t)) term is the maximal value the tail could have, it is
 just the sum of an infinite geometric sequence composed of binary 1's.

 Note that we found here an upper bound for the roundoff error made
 in one rounding operation. Since almost every arithmetic operation
 involves rounding, our upper bound will get propagated and multiplied.

 The fractional part of the representation lays in the range [1,2) and 
 so we have that abs(X) is of the same order of magnitude as (2 ** e). 
 We can denote this by  abs(X) ~ (2 ** e).

 The maximum Absolute Relative Error is approximately:

                2 ** (e -t)
   round(X)  =  -----------  ~  2 ** (-t)

 Relative error in addition and subtraction
 On machines which doesn't use guard digits, e.g. Crays (which have also 
 problems with division), the relative error introduced by the operation 
 of addition (even if the operands have no errors) can be as large as 1.0, 
 i.e. a relative error of 100% ! 

 Machines with one guard digit will have at most a small relative error 
 of  2 * (2 ** (-t))  due to the addition operation.  See the article
 of David Goldberg for a proof of these results.

 Even on "good" machines, when you add numbers with rounding errors from 
 previous calculations (an unavoidable situation), it may get very bad.
 This is due to an error-analytic fact that has nothing to do with the 
 way machines perform addition:

                 abs(X) * are(X)  +  abs(Y) * are (Y)
    are(X + Y) = ------------------------------------
                             abs(X + Y)

 To simplify the argument, suppose  are(X)  is equal to  are(Y), 
 and denote the common value by  "are(X|Y)": 

                   abs(X) + abs(Y)
    are(X + Y) =   --------------- * are(X|Y)
                     abs(X + Y) 

 Note that the error associated with the addition operation should
 include also the rounding error associated with machine-adding X and Y.

 If X and Y have the same sign, we get "are(X|Y)" again.

 However, if X and Y have opposite signs and similar magnitudes the value 
 of the expression can be VERY LARGE, since the denominator is small.
 It can't get arbitrarily large, since X and Y belongs to a discrete number 
 system and can't have arbitrarily close absolute values (if not equal), 
 but that is a small consolation.

 The amplification of previous roundoff errors by subtracting nearly
 equal numbers is sometimes called "catastrophic cancellation".  
 If no previous roundoff errors are present (an unlikely situation) 
 we have "benign cancellation".  

 The strange terminology was probably invented when floating-point 
 arithmetic was less understood.

  |  Addition of two floating-point numbers with similar magnitudes   |
  |  and opposite signs, may cause a LARGE PRECISION LOSS.            |
  |                                                                   |
  |  Subtraction of two floating-point numbers with similar           |
  |  magnitudes and equal signs, may cause a LARGE PRECISION LOSS.    |

 A good example - computing a derivative
 A good example is provided by the numerical computation of a derivative,
 which involves the subtraction of two very close numbers. 

 The subtraction generates a relative error which increases as the step 
 size (h) decrease. On the other hand, the truncation error of the formula 
 decrease with decreasing h. These two opposing "effects" produce a minimum
 for the total error, located high above the precision threshold.

      INTEGER           I
      REAL              X, H, DYDX, TRUE
C     ------------------------------------------------------------------
      X = 0.1
      H = 10.0
      TRUE = COS(X)
C     ------------------------------------------------------------------
      DO I = 1, 15
        DYDX = (SIN(X + H) - SIN(X - H)) / (2.0 * H)
        WRITE (*,'(1X,I3,E12.2,F17.7,F17.5)') 
     &                I, H, DYDX, 100.0 * (DYDX - TRUE) / TRUE
        H = H / 10.0
C     ------------------------------------------------------------------
C     ------------------------------------------------------------------

 Following are the results on a classic computer (VAX). The second 
 column is the step size, the third is the computed value, and the 
 fourth is the relative error:

   1    0.10E+02       -0.0541303       -105.44022
   2    0.10E+01        0.8372672        -15.85290
   3    0.10E+00        0.9933466         -0.16659
   4    0.10E-01        0.9949874         -0.00168
   5    0.10E-02        0.9950065          0.00023
   6    0.10E-03        0.9949879         -0.00164
   7    0.10E-04        0.9950252          0.00211
   8    0.10E-05        0.9946526         -0.03533
   9    0.10E-06        0.9313227         -6.40012
  10    0.10E-07        0.7450581        -25.12010
  11    0.10E-08        0.0000000       -100.00000
  12    0.10E-09        0.0000000       -100.00000
  13    0.10E-10        0.0000000       -100.00000
  14    0.10E-11        0.0000000       -100.00000
  15    0.10E-12        0.0000000       -100.00000
  *** Analytical derivative is:   0.9950042    

 The results can be divided into 4 ranges:

   1 -  2    The step 'h' is just too large, truncation error dominates
   3 -  8    Useful region, optimal value of 'h' is in the middle
   9 - 10    Onset of severe numerical problems near precision threshold
  11 - 15    Computation is completely trashed

 Different magnitudes problem
 A small program will illustrate another effect of roundoff errors:

C     ------------------------------------------------------------------
      INTEGER           I
      REAL              X
C     ------------------------------------------------------------------
      X = 1.0
      DO I=1,20
        WRITE(*,*) ' 1.0 + ', X, ' = ', 1.0 + X
        X = X / 10.0
C     ------------------------------------------------------------------

 We see that addition (subtraction) of two floating points with different 
 magnitudes can be very inaccurate, the smaller float can be effectively 
 treated as zero. 

 Whenever the difference between the binary exponents is larger than 
 the number of bits in the mantissa, the smaller number will get shifted 
 out completely upon addition (subtraction).

 For example, when summing the terms of a sequence of numbers that begins
 with some large terms and then goes down we may get 'effective truncation'.
 The addition of the small terms to the large partial sum may leave it 

      INTEGER		I
      REAL		SUM, AN
C     ------------------------------------------------------------------
      AN(I) = 1000.0 * EXP(-REAL(I)) 
C     ------------------------------------------------------------------
      SUM = 0.0
C     ------------------------------------------------------------------
      DO I = 1, 21
        SUM = SUM + AN(I)
        WRITE (*,*) I, AN(I), SUM
C     ------------------------------------------------------------------

 The output on a classic machine (VAX):

           1   367.8795       367.8795    
           2   135.3353       503.2148    
           3   49.78707       553.0018    
           4   18.31564       571.3174    
           5   6.737947       578.0554    
           6   2.478752       580.5342    
           7  0.9118820       581.4460    
           8  0.3354626       581.7815    
           9  0.1234098       581.9049    
          10  4.5399930E-02   581.9503    
          11  1.6701700E-02   581.9670    
          12  6.1442126E-03   581.9732    
          13  2.2603294E-03   581.9755    
          14  8.3152874E-04   581.9763    
          15  3.0590233E-04   581.9766    
          16  1.1253518E-04   581.9767    
          17  4.1399377E-05   581.9768    
          18  1.5229979E-05   581.9768    
          19  5.6027966E-06   581.9768    
          20  2.0611537E-06   581.9768    
          21  7.5825608E-07   581.9768    

 We can see that the series is effectively truncated at fairly
 large terms, about two orders of magnitude above machine precision.

 Another example is the following "identity": 

              1 - (1 - e)
        1  =  ------------  #  0       (e is a small enough number)
              (1 - 1) + e

 Where the '=' denotes the exact value, and # denotes the result 
 of a floating-point computation.

 A more subtle problem occurs when the smaller number doesn't get 
 "nullified" but just get rounded off with a large error. 

  |     On adding (subtracting) several floating points,        |
  |     you have to group them according to relative size,      |
  |     so that as much as possible operations will be          |
  |     performed between numbers with similar magnitudes.      |

  Underflow problem
 Many computers replace by default an underflowed result with zero,
 it seems very plausible to replace a number that is smaller than your
 minimal number with zero. 

 However, if we replace the number X with 0.0, we create a
 relative error of:

                       X - 0.0
    Relative Error  =  -------  = 1.0   (A 100% error!)

 The following small program computes two MATHEMATICALLY IDENTICAL 
 expressions, the results are illuminating.

 Replace the constant 'TINY' by  0.3E-038  on old DEC machines,
 and by  0.1E-044  on new IEEE workstations.

C     ------------------------------------------------------------------
     *                  A, B, C, D, X,
     *                  LEFT, RIGHT
C     ------------------------------------------------------------------
      A     = 0.5
      B     = TINY
      C     = 1.0
      D     = TINY
      X     = TINY
C     ------------------------------------------------------------------
      LEFT  = (A * X + B) / (C * X + D)
      RIGHT = (A + (B / X)) / (C + (D / X))
      WRITE(*,*) ' X = ', X, ' LEFT  = ', LEFT, ' RIGHT = ', RIGHT
C     ------------------------------------------------------------------

 A difference of 50% is a little unexpected, isn't it?

 If the FPU underflows 'gracefully' (using 'denormalized' numbers), 
 the errors will develop and become large many orders of magnitudes 
 below the 'least usable number'.

  |  Enable underflow checking if possible,  |
  |  if not check manually for underflow.    |

 Error accumulation
 Roundoff errors are more or less random, the following example program 
 is a crude model that gives some insight on the accumulation of random 

C     ------------------------------------------------------------------
     *          i,
     *          seed
C     ------------------------------------------------------------------
     *          x,
     *          sum
C     ------------------------------------------------------------------
      seed = 123456789
      sum  = 0.0
C     ------------------------------------------------------------------
      DO i = 1, 1000
        x = 2.0 * (RAN(seed) - 0.5)
        write (*,*) x
        sum = sum + x
        IF (MOD(i,10) .EQ. 0) WRITE (*,*) ' ', sum
      END DO
C     ------------------------------------------------------------------

 RAN() is a random number generator, a random sequence composed of 
 {-1, 1} is generated and summed, and the sum is displayed every 
 10 iterations.

 Examining the program's output, you will see that the partial sums 
 are not zero, they oscillate irregularly. Most of the partial sums 
 are contained in the interval [-15, 15] (may depend on the random 
 generator and the seed value), because long 'unbalanced' 
 sub-sequences are relatively rare.

 The conclusion is that random errors tend to cancel on a large scale,
 and accumulate on small scale. 

 However it is easy to give an example in which the random errors 
 DO accumulate: 

C     ------------------------------------------------------------------
     *          I
C     ------------------------------------------------------------------
     *          X, TMP
C     ------------------------------------------------------------------
      X = 0.1
      WRITE (*,*) 'DIRECT COMPUTATION: ', 1000000.0 * X
C     ------------------------------------------------------------------
      TMP = 0.0
      DO I = 1, 1000000
        TMP = TMP + X
C     ------------------------------------------------------------------

 The error accumulate to about 1%, and could accumulate further if
 the number of loop iterations were made larger.

 A more realistic example is:

C     ------------------------------------------------------------------
     *          i, j
C     ------------------------------------------------------------------
     *          interval,
     *          sum
C     ------------------------------------------------------------------
      DO i = 10, 2000, 20
        sum = 0.0
        interval = 3.1415926 / REAL(i)
        DO j = 1, i
          sum = sum + interval * SIN(j * interval) 
        END DO
        WRITE (*,*) i, sum
      END DO
C     ------------------------------------------------------------------

 Here we do a crude numerical integration of SIN() in the range [0, Pi],
 the result should be 2.0 = (COS(0) - COS(Pi)) and is converging quite 
 closely to that value. 

 However when the number of intervals used is further increased (higher i), 
 the result develops a slight oscillation due to roundoff errors. 

Return to contents page