4-9  MINIMIZING FLOATING-POINT ERRORS
*************************************

A few remarks
-------------

o  Some practical methods for checking the severity of floating-point
errors can be found in the chapter: 'practical issues'.

o  The chapter on 'FORTRAN pitfalls' discusses various programming
practises that may amplify floating-point (and other) errors,
and it is very important to avoid them.

o  Note that an interval/stochastic arithmetic package is not just a
diagnostic tool for FP errors, a result without an error estimation
is not very useful, as errors can never be eliminated completely
in experimental data and computation.

Carefully written programs
--------------------------
This term was probably coined by Sterbenz (see bibliography below),
and means programs that are numerically correct.  this is not easy
to achieve as the following example will show.

Sterbenz discusses the implementation of a Fortran FUNCTION that
returns the average of two REAL numbers, the specifications for the
routine are:

o  The sign must be always correct.
o  The result should be as close as possible to (x+y)/2
and stay within a predefined bound.

o  min(x,y)  <=  average(x,y)  <=  max(x,y)
o  average(x,y)  =  average(y,x)
o  average(x,y)  =  0  if and only if  x = -y  unless an
underflow occurred.
o  average(-x,-y)  =  -average(x,y)

o  An overflow should never occur.
o  An underflow should never occur, unless the mathematical
average is strictly less than the smallest representable
real number.

Even a simple task like this, requires considerable knowledge to
program in a good way, there are 4 (at least) possible average
formulas:

1)   (x + y) / 2
2)   x/2 + y/2
3)   x + ((y - x) / 2)
4)   y + ((x - y) / 2)

Sterbenz have a very interesting discussion on choosing the most
appropriate formulas, he also consider techniques like scaling up
the input variables if they are small.  Grossly oversimplifying,
we have:

Formula #1   may raise an overflow if x,y have the same sign
#2   may degrade accuracy, but is safe from overflows
#3,4 may raise an overflow if x,y have opposite signs

We will use formulas #1,3,4 according to the signs of the input
numbers:

real function average (x, y)
real	x, y, zero, two, av1, av2, av3, av4
logical	samesign
parameter	(zero = 0.0e+00, two = 2.0e+00)
av1(x,y) = (x + y) / two
av2(x,y) = (x / two) + (y / two)
av3(x,y) = x + ((y - x) / two)
av4(x,y) = y + ((x - y) / two)

if (x .ge. zero) then
if (y .ge. zero) then
samesign = .true.
else
samesign = .false.
endif
else
if (y .ge. zero) then
samesign = .false.
else
samesign = .true.
endif
endif

if (samesign) then
if (y .ge. x) then
average = av3(x,y)
else
average = av4(x,y)
endif
else
average = av1(x,y)
endif

return
end

Programming using exception handling
------------------------------------
Computing the average of two numbers may serve as an example for
the system-dependent technique of writing faster numerical code
using exception handling.

Most of the time the formula  (x + y)/2  is quite adequate,
the FUNCTION above is needed essentially to avoid overflow,
so the following scheme may be used:

call reset_overflow_flag
result = (x + y) / 2.0
call check_overflow_flag(status)
if (status .eq. .true.) result = average(x,y)

In this way the "expansive" call to the average routine may be
eliminated in most cases.  Of course, two system-dependent calls
were added for reseting and checking the overflow flag, but this
may be still worth it if the ratio between the two algorithms is
large enough (which is not the case here).

------------------------------
This is the most simple solution on machines that supports this
data type.  REAL*16 takes more CPU time than REAL*8/REAL*4,
but introduces very small roundoff errors, and has a huge range.

Performance cost of different size floats is VERY machine dependent
(see the performance chapter).

A crude example program:

program reals
real*4        x4,  y4,  z4
real*8        x8,  y8,  z8
real*16       x16, y16, z16

x4 = 1.0e+00
y4 = 0.9999999e+00
z4 = x4 - y4
write(*,*) sqrt(z4)

x8 = 1.0d+00
y8 = 0.9999999d+00
z8 = x8 - y8
write(*,*) sqrt(z8)

x16 = 1.0q+00
y16 = 0.9999999q+00
z16 = x16 - y16
write(*,*) sqrt(z16)

end

Normalization of equations
--------------------------
Floating-point arithmetic is best when dealing with numbers with
magnitudes of the order of 1.0, the 'representation density' is not
maximal but we are in the 'middle' of the range.

Usually you can decrease the range of numbers appearing in the
computation, by transforming the system of units, so that you get
dimensionless equations.

The diffusion equation will serve as an example, I apologize for
the horrible notation:

Ut = K * Uxx

Where the solution is U(X,T), the lowercase letters denote the
partial derivatives, and K is a constant.

Let:
L  be a typical length in the problem
U0 a typical value of U

Substitute:

X' = X / L     U' = U / U0

Then:

Ux  = Ux' / L
Uxx = Ux'x' / (L*L)

Substitute in the original equation:

(U' * U0)t = (K / (L*L)) * (U' * U0)x'x'

((L * L) / K) U't = U'x'x'

Substitute:

T' = (K * T) / (L * L)

And you get:

U't' = U'x'x'

With:   X' = X / L     U' = U / U0     T' = (K * T) / (L * L)

Multi-precision arithmetic
--------------------------
That is a really bright idea, you can simulate floating-point numbers
with very large sizes, using character strings (or other data types),
and create routines for doing arithmetic on these giant numbers.
Of course such software simulated arithmetic will be slow.

By the way, the function overloading feature of Fortran 90, makes using
multi-precision arithmetic packages with existing programs easy.

Two free packages are "mpfun" and "bmp" (Brent's multiple precision),
which are available from Netlib.

Using special tricks
--------------------
A good example are the following tricks for summing a series.
The first is sorting the numbers and adding them in ascending order.

An example program:

program rndof
integer       i
real          sum

sum = 0.0
do i = 1, 10000000, 1
sum = sum + 1.0 / real(i)
end do
write (*,*) 'Decreasing order: ', sum

sum = 0.0
do i = 10000000, 1, -1
sum = sum + 1.0 / real(i)
end do
write (*,*) 'Increasing order: ', sum

end

There is no need here for sorting, as the series is monotonic.
Executing 2 * 10**7 iterations will take some CPU seconds, but the
result is very illuminating.

Another way (though not as good as doubling the precision) is using
the Kahan Summation Formula.

Suppose the series is stored in an array X(1:N)

SUM = X(1)
C = 0.0
DO J = 2, N
Y = X(J) - C
T = SUM + Y
C = (T - SUM) - Y
SUM = T
ENDDO

Yet another method is using Knuth's formula.  The recommended method

Another example is using the standard formulae for solving the
quadratic equation (real numbers are written without mantissa

a*(x**2) + b*x + c = 0     (a .ne. 0)

When  b**2  is much larger than  abs(4*a*c), the discriminat is
nearly equal to  abs(b), and we may get "catastrophic cancellation".
Multiplying and dividing by the same number we get alternative
formulae:

-b + (b**2 - 4*a*c)**0.5              -2 * c
x1  =  ------------------------  =  -----------------------
2*a                  b + (b**2 - 4*a*c)**0.5

-b - (b**2 - 4*a*c)**0.5               2 * c
x2  =  ------------------------  =  ------------------------
2*a                  -b + (b**2 - 4*a*c)**0.5

If "b" is much larger than "a*c", use one of the standard and one
of the alternative formulae.  The first alternative formula is
suitable when "b" is positive, the other when it's negative.

--------------------------------
See the chapter: "The world of integers".

Manual safeguarding
-------------------
You can check manually every dangerous arithmetic operation,
special routines may be constructed to perform arithmetical
operations in a safer way, or get an error message if this
cannot be done.

Hardware support
----------------
IEEE conforming FPUs can raise an exception whenever a roundoff
was performed on an arithmetical result.

You can write an exception handler that will report the exceptions,
but as the result of most operations may have to be rounded, your
program will be slowed down, and you will get huge log files.

Rational arithmetic
-------------------
Every number can be represented (possibly with an error) as a
quotient of two integers, the dividend and divisor can be kept
along the computation, without actually performing the division.
See Knuth for technical details.

It seems this method is not used.

Bibliography
------------

An excellent article on floating-point arithmetic:

David Goldberg
What Every Computer Scientist Should Know about
Floating-Point arithmetic
ACM Computing Surveys
Vol. 23 #1  March 1991, pp. 5-48

An old but still useful book:

Sterbenz, Pat H.
Floating-Point Computation
Prentice-Hall, 1974
ISBN 0-13-322495-3

An old classic presented in a mathematical rigorous
way (oouch!):

Donald E. Knuth
The Art of Computer Programming
Volume II, sections 4.2.1 - 4.2.3

The Silicon Graphics implementation of the IEEE standard,
republished later in another issue of Pipeline:

How a Floating Point Number is represented on an IRIS-4D
Pipeline
July/August 1990

The homepage of Prof. William Kahan, the well-known expert
on floating-point arithmetic:

http://http.cs.berkeley.edu/~wkahan/

A short nice summary on floating-point arithmetic:

CS267: Supplementary Notes on Floating Point