1-10 CODE OPTIMIZATION - COMPILER ********************************* (Thanks to Craig Burley for the excellent comments. Thanks to Timothy Prince for the note on architectures with Instruction Level Parallelism) Optimization techniques used by compilers may inspire good and efficient programming practices and are interesting in their own right. Before trying to perform 'hand optimization' please note the following points: 1) Remember that the compiler perform such optimizations anyway, so the benefit of doing them manually may be small or negative!. 2) Profile First!! (A chapter on profiling will be added) Programmers who are learning this arcane art should certainly play around with all the techniques on "make-believe" code, but should NOT waste their time (and, especially, risk introducing bugs) by optimizing any _real_ code until after they've gotten _elegant, correct_ code working and profiled to discover what areas need optimizing (and have test suites to use to validate new optimized versions). Performing operations at compile-time (if possible) --------------------------------------------------- Computations and type conversions on constants, computing addresses of array elements with constant indexes, can be performed already by the compiler. Value propagation ----------------- Tracing the VALUE of Inlining small functions ------------------------ Repeatedly inserting the function code instead of calling it, saves the calling overhead and enable further optimizations. Inlining large functions will make the executable too large. Code hoisting ------------- Moving as much as possible computations outside loops, saves computing time. In the following example (2.0 * PI) is an invariant expression that there is no reason to recompute it 100 times. DO I = 1, 100 ARRAY(I) = 2.0 * PI * I ENDDO Introducing a temporary variable 't' it can be transformed to: t = 2.0 * PI DO I = 1, 100 ARRAY(I) = t * I ENDDO Dead store elimination ---------------------- If the compiler detects variables that are never used, it may safely ignore many of the operations that compute their values. Such operations can't be ignored if there are (non-intrinsic) function calls involved, those functions have to be called, because of their possible side effects. Remember that before Fortran 95, Fortran didn't have the concept of "pure" function. Programs used as performance tests, and perform no 'real' computations, should be written to avoid being 'completely optimized out', writing the 'results' to screen/file may be enough to fool the compiler. Strength reduction ----------------- Taking advantage of the machine architecture -------------------------------------------- A simple example, the subject is clearly too machine dependant and highly technical for more than that: Register operations are much faster than memory operations, so all compilers try to put in registers data that is supposed to be heavily used, like temporary variables and array indexes. To facilitate such 'register scheduling' the largest sub-expressions may be computed before the smaller ones. Eliminating common sub-expressions ---------------------------------- This is an old optimization trick that compilers are able to perform quite well: X = A * LOG(Y) + (LOG(Y) ** 2) We introduce an explicit temporary variable t: t = LOG(Y) X = A * t + (t ** 2) We saved one 'heavy' function call, by an elimination of the common sub-expression LOG(Y), now we will save the exponentiation by: X = (A + t) * t which is much better. The compiler may do all of this automatically, so don't waste too much energy on such transformations. A classic example - computing the value of a polynomial ------------------------------------------------------- Eliminating Common Subexpressions may inspire good algorithms like the classic 'Horner's rule' for computing the value of a polynomial. y = A + B*x + C*(x**2) + D*(x**3) (canonical form) It is more efficient (i.e. executes faster) to perform the two exponentiations by converting them to multiplications, in this way we will get 3 additions and 5 multiplications in all. The following forms are more efficient to compute, they require less operations, and the operations that are saved are the 'heavy' ones (multiplication is an operation that takes a lot of CPU time, much more than addition). Stage #1: y = A + (B + C*x + D*(x**2))*x Stage #2 and last: y => A + (B + (C + D*x)*x)*x The last form requires 3 additions and only 3 multiplications! The algorithm hinted here, can be implemented with one loop to compute an arbitrary order polynomial. It may also be better numerically than direct computation of the canonical form. Note: On architectures with Instruction Level Parallelism the fastest way is: A + B*x +x**2*(C + D*X). Adding parentheses A + (B*X + x**2()) will improve accuracy in the case where A is the largest term. +-------------------------------------------------+ | CHECK THE CODE FROM A NUMERICAL POINT OF VIEW | +-------------------------------------------------+Return to contents page