``` 4-6  ERRORS IN FLOATING POINT COMPUTATIONS
******************************************

Excerpt from The Art of Computer Programming by Donald E. Knuth:
----------------------------------------------------------------
"Floating-point computation is by nature inexact, and it is not
difficult to misuse it so that the computed answers consist almost
entirely of 'noise'.

One of the principal problems of numerical analysis is to determine
how accurate the results of certain numerical methods will be; a
'credibility gap' problem is involved here: we don't know how much
of the computer's answers to believe.

Novice computer users solve this problem by implicitly trusting in
the computer as an infallible authority; they tend to believe all
digits of a printed answer are significant.

Disillusioned computer users have just the opposite approach, they
are constantly afraid their answers are almost meaningless."

Properties of Floating-point arithmetic
---------------------------------------
Real numbers have a lot of 'good' properties, which of them are
retained by floating-point arithmetic?

Topological                        Validity
-----------                        -----------------------------
Connectivity                      no    All points are isolated
Completeness                      no    No converging sequences

Field axioms:
-------------
Closure under addition            no    We may have Overflow/Underflow
Associativity of addition         no    (a + b) + c  .NE.  a + (b + c)
Additive commutativity            yes   a + b        .EQ.  b + a
Unique zero element               yes   a + 0        .EQ.  a
Unique additive inverse           yes   a + (-a)     .EQ.  0

Closure under multiplication      no    We may have Overflow/Underflow
Associativity of multiplication   no    (a * b) * c  .NE.  a * (b * c)
Multiplicative commutativity      yes   a * b        .EQ.  b * a
Unique unit element               yes   a * 1        .EQ.  a
Unique multiplicative inverse     yes   a * (1/a)    .EQ.  1

Distributivity                    no    a * (b + c)  .NE. (a * b) + (a * c)

Ordered field axioms:
---------------------
Completeness                      yes
Transitivity                      yes   (a .ge. b) .and. (b .ge. c)
implies  (a .ge. c)
Density                           no
Translation invariance            yes
Scale invariance
Triangle inequality
Archimedes' axiom                 yes

We see that some of the basic properties of real numbers are missing.
Some properties that are usually derived from the basic axioms are
still true:

(x * y) .eq. 0.0   IS EQUIVALENT TO  (x .eq. 0.0) .or. (y .eq. 0.0)

(x .le. y) .and. (z .gt. 0.0)   IMPLIES   (x * z) .le. (y * z)

(x .eq. y)   IS EQUIVALENT TO   (x - y) .eq. 0.0  (If using denormals)

When you are doing floating-point arithmetic, bear in mind that you
are dealing with a discrete number system, which "quantize" every
value that appears in the calculations, mangling everything in its way.

A side remark
-------------
Division is non-associative, simple counter-examples are:

1 =  (8 / 4) / 2  .NE.  8 / (4 / 2)  =  4

4 =  (8 / 4) * 2  .NE.  8 / (4 * 2)  =  1

Be sure to supply parentheses to force the right order of evaluation!

Sources of errors
-----------------
This is a schematic classification based on Dahlquist/Bjork:

A) Errors in the input data - measurement errors, errors introduced
by the conversion of decimal data to binary, roundoff errors.

B) Roundoff errors during computation

C) Truncation errors - using approximate calculation is inevitable
when computing quantities involving limits and other infinite
processes on a computer:

1) Summing only a finite number terms in an infinite series
2) Discretization errors, approximating a derivative/integral
by a difference quotient/finite sum
3) Approximating functions by polynomials or linear functions

D) Simplifications in the mathematical model of the problem

E) Human mistakes and machine malfunctions

A bit of Error Analysis
-----------------------
We can visualize errors with intervals.  A number X that is known with
absolute error dX, can be represented by the interval (X - dX, X + dX),
with a positive dX.

Addition of intervals is given by:

(X-dX, X+dX) + (Y-dY, Y+dY) = (X+Y - (dX+dY), X+Y + (dX+dY))

Similarly subtraction of intervals is given by:

(X-dX, X+dX) - (Y-dY, Y+dY) = (X-Y - (dX+dY), X-Y + (dX+dY))

A good measure of accuracy is the ABSOLUTE RELATIVE ERROR (ARE).
If X is represented by the interval (X-dX, X+dX)  the absolute
relative error is:

dX
ARE(X)  =  ------
abs(X)

The ARE is always positive since dX is so.

The absolute relative error of addition is:

dX + dY
ARE(X + Y)  =  ----------
abs(X + Y)

which can be written as:

abs(X)       dX         abs(Y)       dY
ARE(X + Y)  =   ---------- * ------  +  ---------- * ------
abs(X + Y)   abs(X)     abs(X + Y)   abs(Y)

abs(X) * are(X)  +  abs(Y) * are (Y)
ARE(X + Y) = ------------------------------------
abs(X + Y)

In a similar way:

abs(X) * are(X)  +  abs(Y) * are(Y)
are(X - Y) = ----------------------------------
abs(X - Y)

Note that these formulas doesn't take into account errors introduced
by the addition/subtraction operations, they are purely error-analytic.
To get the total error you have to add a term whose magnitude will
be calculated in the next section.

err(X + Y) = are(X + Y) + round(X + Y)

Other error measures used in error analysis are the relative error,
ULP (Units in the last place), and the number of significant digits.

Estimating floating-point roundoff errors
-----------------------------------------
The IEEE floating-point representation of a real number X is:

X  =  ((-1) ** S) *  1.ffff...fff  * (2 ** e)
|        |
+--------+
t binary digits

We have t binary digits instead of an infinite series, that is the
cause of the roundoff error. An upper bound for the roundoff error
is the maximum value the truncated tail could have:

Upper-Bound  <=  (2 ** (-t)) * (2 ** e)  =  2 ** (e - t)

The (2 ** (-t)) term is the maximal value the tail could have, it is
just the sum of an infinite geometric sequence composed of binary 1's.

Note that we found here an upper bound for the roundoff error made
in one rounding operation. Since almost every arithmetic operation
involves rounding, our upper bound will get propagated and multiplied.

The fractional part of the representation lays in the range [1,2) and
so we have that abs(X) is of the same order of magnitude as (2 ** e).
We can denote this by  abs(X) ~ (2 ** e).

The maximum Absolute Relative Error is approximately:

2 ** (e -t)
round(X)  =  -----------  ~  2 ** (-t)
abs(X)

Relative error in addition and subtraction
------------------------------------------
On machines which doesn't use guard digits, e.g. Crays (which have also
problems with division), the relative error introduced by the operation
of addition (even if the operands have no errors) can be as large as 1.0,
i.e. a relative error of 100% !

Machines with one guard digit will have at most a small relative error
of  2 * (2 ** (-t))  due to the addition operation.  See the article
of David Goldberg for a proof of these results.

Even on "good" machines, when you add numbers with rounding errors from
previous calculations (an unavoidable situation), it may get very bad.
This is due to an error-analytic fact that has nothing to do with the

abs(X) * are(X)  +  abs(Y) * are (Y)
are(X + Y) = ------------------------------------
abs(X + Y)

To simplify the argument, suppose  are(X)  is equal to  are(Y),
and denote the common value by  "are(X|Y)":

abs(X) + abs(Y)
are(X + Y) =   --------------- * are(X|Y)
abs(X + Y)

Note that the error associated with the addition operation should
include also the rounding error associated with machine-adding X and Y.

If X and Y have the same sign, we get "are(X|Y)" again.

However, if X and Y have opposite signs and similar magnitudes the value
of the expression can be VERY LARGE, since the denominator is small.
It can't get arbitrarily large, since X and Y belongs to a discrete number
system and can't have arbitrarily close absolute values (if not equal),
but that is a small consolation.

The amplification of previous roundoff errors by subtracting nearly
equal numbers is sometimes called "catastrophic cancellation".
If no previous roundoff errors are present (an unlikely situation)
we have "benign cancellation".

The strange terminology was probably invented when floating-point
arithmetic was less understood.

+-------------------------------------------------------------------+
|  Addition of two floating-point numbers with similar magnitudes   |
|  and opposite signs, may cause a LARGE PRECISION LOSS.            |
|                                                                   |
|  Subtraction of two floating-point numbers with similar           |
|  magnitudes and equal signs, may cause a LARGE PRECISION LOSS.    |
+-------------------------------------------------------------------+

A good example - computing a derivative
---------------------------------------
A good example is provided by the numerical computation of a derivative,
which involves the subtraction of two very close numbers.

The subtraction generates a relative error which increases as the step
size (h) decrease. On the other hand, the truncation error of the formula
decrease with decreasing h. These two opposing "effects" produce a minimum
for the total error, located high above the precision threshold.

PROGRAM NUMDRV
INTEGER           I
REAL              X, H, DYDX, TRUE
C     ------------------------------------------------------------------
X = 0.1
H = 10.0
TRUE = COS(X)
C     ------------------------------------------------------------------
DO I = 1, 15
DYDX = (SIN(X + H) - SIN(X - H)) / (2.0 * H)
WRITE (*,'(1X,I3,E12.2,F17.7,F17.5)')
&                I, H, DYDX, 100.0 * (DYDX - TRUE) / TRUE
H = H / 10.0
ENDDO
C     ------------------------------------------------------------------
WRITE (*,*) ' *** ANALYTICAL DERIVATIVE IS: ', TRUE
C     ------------------------------------------------------------------
END

Following are the results on a classic computer (VAX). The second
column is the step size, the third is the computed value, and the
fourth is the relative error:

1    0.10E+02       -0.0541303       -105.44022
2    0.10E+01        0.8372672        -15.85290
3    0.10E+00        0.9933466         -0.16659
4    0.10E-01        0.9949874         -0.00168
5    0.10E-02        0.9950065          0.00023
6    0.10E-03        0.9949879         -0.00164
7    0.10E-04        0.9950252          0.00211
8    0.10E-05        0.9946526         -0.03533
9    0.10E-06        0.9313227         -6.40012
10    0.10E-07        0.7450581        -25.12010
11    0.10E-08        0.0000000       -100.00000
12    0.10E-09        0.0000000       -100.00000
13    0.10E-10        0.0000000       -100.00000
14    0.10E-11        0.0000000       -100.00000
15    0.10E-12        0.0000000       -100.00000
*** Analytical derivative is:   0.9950042

The results can be divided into 4 ranges:

1 -  2    The step 'h' is just too large, truncation error dominates
3 -  8    Useful region, optimal value of 'h' is in the middle
9 - 10    Onset of severe numerical problems near precision threshold
11 - 15    Computation is completely trashed

Different magnitudes problem
----------------------------
A small program will illustrate another effect of roundoff errors:

C     ------------------------------------------------------------------
INTEGER           I
REAL              X
C     ------------------------------------------------------------------
X = 1.0
DO I=1,20
WRITE(*,*) ' 1.0 + ', X, ' = ', 1.0 + X
X = X / 10.0
ENDDO
C     ------------------------------------------------------------------
END

We see that addition (subtraction) of two floating points with different
magnitudes can be very inaccurate, the smaller float can be effectively
treated as zero.

Whenever the difference between the binary exponents is larger than
the number of bits in the mantissa, the smaller number will get shifted

For example, when summing the terms of a sequence of numbers that begins
with some large terms and then goes down we may get 'effective truncation'.
The addition of the small terms to the large partial sum may leave it
unchanged:

PROGRAM TRNCSR
INTEGER		I
REAL		SUM, AN
C     ------------------------------------------------------------------
AN(I) = 1000.0 * EXP(-REAL(I))
C     ------------------------------------------------------------------
SUM = 0.0
C     ------------------------------------------------------------------
DO I = 1, 21
SUM = SUM + AN(I)
WRITE (*,*) I, AN(I), SUM
ENDDO
C     ------------------------------------------------------------------
END

The output on a classic machine (VAX):

1   367.8795       367.8795
2   135.3353       503.2148
3   49.78707       553.0018
4   18.31564       571.3174
5   6.737947       578.0554
6   2.478752       580.5342
7  0.9118820       581.4460
8  0.3354626       581.7815
9  0.1234098       581.9049
10  4.5399930E-02   581.9503
11  1.6701700E-02   581.9670
12  6.1442126E-03   581.9732
13  2.2603294E-03   581.9755
14  8.3152874E-04   581.9763
15  3.0590233E-04   581.9766
16  1.1253518E-04   581.9767
17  4.1399377E-05   581.9768
18  1.5229979E-05   581.9768
19  5.6027966E-06   581.9768
20  2.0611537E-06   581.9768
21  7.5825608E-07   581.9768

We can see that the series is effectively truncated at fairly
large terms, about two orders of magnitude above machine precision.

Another example is the following "identity":

1 - (1 - e)
1  =  ------------  #  0       (e is a small enough number)
(1 - 1) + e

Where the '=' denotes the exact value, and # denotes the result
of a floating-point computation.

A more subtle problem occurs when the smaller number doesn't get
"nullified" but just get rounded off with a large error.

+-------------------------------------------------------------+
|     On adding (subtracting) several floating points,        |
|     you have to group them according to relative size,      |
|     so that as much as possible operations will be          |
|     performed between numbers with similar magnitudes.      |
+-------------------------------------------------------------+

Underflow problem
-----------------
Many computers replace by default an underflowed result with zero,
it seems very plausible to replace a number that is smaller than your
minimal number with zero.

However, if we replace the number X with 0.0, we create a
relative error of:

X - 0.0
Relative Error  =  -------  = 1.0   (A 100% error!)
X

The following small program computes two MATHEMATICALLY IDENTICAL
expressions, the results are illuminating.

Replace the constant 'TINY' by  0.3E-038  on old DEC machines,
and by  0.1E-044  on new IEEE workstations.

PROGRAM UNDRFL
C     ------------------------------------------------------------------
REAL
*                  A, B, C, D, X,
*                  LEFT, RIGHT
C     ------------------------------------------------------------------
A     = 0.5
B     = TINY
C     = 1.0
D     = TINY
X     = TINY
C     ------------------------------------------------------------------
LEFT  = (A * X + B) / (C * X + D)
RIGHT = (A + (B / X)) / (C + (D / X))
WRITE(*,*) ' X = ', X, ' LEFT  = ', LEFT, ' RIGHT = ', RIGHT
C     ------------------------------------------------------------------
END

A difference of 50% is a little unexpected, isn't it?

If the FPU underflows 'gracefully' (using 'denormalized' numbers),
the errors will develop and become large many orders of magnitudes
below the 'least usable number'.

+------------------------------------------+
|  Enable underflow checking if possible,  |
|  if not check manually for underflow.    |
+------------------------------------------+

Error accumulation
------------------
Roundoff errors are more or less random, the following example program
is a crude model that gives some insight on the accumulation of random
quantities:

PROGRAM CUMERR
C     ------------------------------------------------------------------
INTEGER
*          i,
*          seed
C     ------------------------------------------------------------------
REAL
*          x,
*          sum
C     ------------------------------------------------------------------
seed = 123456789
sum  = 0.0
C     ------------------------------------------------------------------
DO i = 1, 1000
x = 2.0 * (RAN(seed) - 0.5)
write (*,*) x
sum = sum + x
IF (MOD(i,10) .EQ. 0) WRITE (*,*) ' ', sum
END DO
C     ------------------------------------------------------------------
END

RAN() is a random number generator, a random sequence composed of
{-1, 1} is generated and summed, and the sum is displayed every
10 iterations.

Examining the program's output, you will see that the partial sums
are not zero, they oscillate irregularly. Most of the partial sums
are contained in the interval [-15, 15] (may depend on the random
generator and the seed value), because long 'unbalanced'
sub-sequences are relatively rare.

The conclusion is that random errors tend to cancel on a large scale,
and accumulate on small scale.

However it is easy to give an example in which the random errors
DO accumulate:

PROGRAM ACCRND
C     ------------------------------------------------------------------
INTEGER
*          I
C     ------------------------------------------------------------------
REAL
*          X, TMP
C     ------------------------------------------------------------------
X = 0.1
WRITE (*,*) 'DIRECT COMPUTATION: ', 1000000.0 * X
C     ------------------------------------------------------------------
TMP = 0.0
DO I = 1, 1000000
TMP = TMP + X
ENDDO
WRITE (*,*) 'LOOP COMPUTATION: ', TMP
C     ------------------------------------------------------------------
END

The error accumulate to about 1%, and could accumulate further if
the number of loop iterations were made larger.

A more realistic example is:

PROGRAM NUMINT
C     ------------------------------------------------------------------
INTEGER
*          i, j
C     ------------------------------------------------------------------
REAl
*          interval,
*          sum
C     ------------------------------------------------------------------
DO i = 10, 2000, 20
sum = 0.0
interval = 3.1415926 / REAL(i)
DO j = 1, i
sum = sum + interval * SIN(j * interval)
END DO
WRITE (*,*) i, sum
END DO
C     ------------------------------------------------------------------
END

Here we do a crude numerical integration of SIN() in the range [0, Pi],
the result should be 2.0 = (COS(0) - COS(Pi)) and is converging quite
closely to that value.

However when the number of intervals used is further increased (higher i),
the result develops a slight oscillation due to roundoff errors.

```