Code optimization


 1-10 CODE OPTIMIZATION - COMPILER 
 *********************************
 (Thanks to Craig Burley for the excellent comments.
  Thanks to Timothy Prince for the note on architectures with 
  Instruction Level Parallelism)


 Optimization techniques used by compilers may inspire good and 
 efficient programming practices and are interesting in their 
 own right. 

 Before trying to perform 'hand optimization' please note the 
 following points:

    1) Remember that the compiler perform such optimizations anyway, 
       so the benefit of doing them manually may be small or negative!.

    2) Profile First!! (A chapter on profiling will be added)
       Programmers who are learning this arcane art should certainly 
       play around with all the techniques on "make-believe" code, 
       but should NOT waste their time (and, especially, risk 
       introducing bugs) by optimizing any _real_ code until after 
       they've gotten _elegant, correct_ code working and profiled 
       to discover what areas need optimizing (and have test suites 
       to use to validate new optimized versions).


 Performing operations at compile-time (if possible)
 ---------------------------------------------------
 Computations and type conversions on constants, computing addresses
 of array elements with constant indexes, can be performed already
 by the compiler.


 Value propagation
 -----------------
 Tracing the VALUE of


 Inlining small functions
 ------------------------
 Repeatedly inserting the function code instead of calling it, saves
 the calling overhead and enable further optimizations. Inlining
 large functions will make the executable too large.


 Code hoisting
 -------------
 Moving as much as possible computations outside loops, saves computing 
 time. In the following example (2.0 * PI) is an invariant expression
 that there is no reason to recompute it 100 times. 

      DO I = 1, 100
        ARRAY(I) = 2.0 * PI * I
      ENDDO

 Introducing a temporary variable 't' it can be transformed to:

      t = 2.0 * PI
      DO I = 1, 100
        ARRAY(I) = t * I
      ENDDO


 Dead store elimination
 ----------------------
 If the compiler detects variables that are never used, it may safely
 ignore many of the operations that compute their values. 

 Such operations can't be ignored if there are (non-intrinsic) function 
 calls involved, those functions have to be called, because of their 
 possible side effects.  Remember that before Fortran 95, Fortran didn't 
 have the concept of "pure" function.

 Programs used as performance tests, and perform no 'real' computations, 
 should be written to avoid being 'completely optimized out', writing the 
 'results' to screen/file may be enough to fool the compiler.


 Strength reduction
 -----------------


 Taking advantage of the machine architecture
 --------------------------------------------
 A simple example, the subject is clearly too machine dependant and
 highly technical for more than that: 

 Register operations are much faster than memory operations, so all 
 compilers try to put in registers data that is supposed to be 
 heavily used, like temporary variables and array indexes. 

 To facilitate such 'register scheduling' the largest sub-expressions
 may be computed before the smaller ones.


 Eliminating common sub-expressions
 ----------------------------------
 This is an old optimization trick that compilers are able to 
 perform quite well:

        X = A * LOG(Y) + (LOG(Y) ** 2)

 We introduce an explicit temporary variable t:

        t = LOG(Y)
        X = A * t + (t ** 2)

 We saved one 'heavy' function call, by an elimination of 
 the common sub-expression LOG(Y), now we will save the 
 exponentiation by: 

        X = (A + t) * t

 which is much better. 

 The compiler may do all of this automatically, so don't waste too 
 much energy on such transformations. 


 A classic example - computing the value of a polynomial
 -------------------------------------------------------
 Eliminating Common Subexpressions may inspire good algorithms like 
 the classic 'Horner's rule' for computing the value of a polynomial.

        y = A + B*x + C*(x**2) + D*(x**3)       (canonical form)

 It is more efficient (i.e. executes faster) to perform the two
 exponentiations by converting them to multiplications, in this
 way we will get 3 additions and 5 multiplications in all.
 
 The following forms are more efficient to compute, they require 
 less operations, and the operations that are saved are the 'heavy' 
 ones (multiplication is an operation that takes a lot of CPU time, 
 much more than addition).

 Stage #1:

        y = A + (B + C*x + D*(x**2))*x

 Stage #2 and last:

        y => A + (B + (C + D*x)*x)*x

 The last form requires 3 additions and only 3 multiplications!

 The algorithm hinted here, can be implemented with one loop to compute 
 an arbitrary order polynomial. It may also be better numerically than
 direct computation of the canonical form.

 Note:
   On architectures with Instruction Level Parallelism the fastest way 
   is:  A + B*x +x**2*(C + D*X).  Adding parentheses A + (B*X + x**2()) 
   will improve accuracy in the case where A is the largest term.
 
   +-------------------------------------------------+
   |  CHECK THE CODE FROM A NUMERICAL POINT OF VIEW  |
   +-------------------------------------------------+
Return to contents page