Code optimization on IBM RS/6000

by Ole H. Nielsen

of UNI-C, the Danish Computer Center for Research and Education; also at Center for Atomic-scale Materials Physics (CAMP) at the Technical University of Denmark.

Abstract:

This presentation discusses how to obtain maximum performance by choosing appropriate compiler options with the IBM XL compilers on RS/6000 workstations, and ultimately by replacing simple operations by library calls. The effect of traversing memory with stride 1 relative to "jumps" in memory (large strides) is clearly seen in the performance numbers benchmarked on a number of fast IBM RS/6000 CPUs.
Several ways of measuring the performance of your code are discussed, including examples.

Contents:


Code example: matrix times vector

The following Fortran-77 example is taken as a prototypical code:
c       Matrix-vector product result = matrix * vector

        parameter (N = 1000)
        real*8 matrix(N,N), vector(N), result(N)

c       LOOP #1 (wrong way through memory)
        do i = 1, N
           result(i) = 0.0d0
           do j = 1, N
              result(i) = result(i) + matrix(i,j) * vector(j)
           end do
        end do

c       LOOP #2 (stride 1 through memory)
        do i = 1, N
           result(i) = 0.0d0
        end do
        do j = 1, N
           do i = 1, N
              result(i) = result(i) + matrix(i,j) * vector(j)
           end do
        end do
(The code with timing calls is
here.) Without any optimization, just using the command:
xlf file.f
the following performance results were obtained on various RS/6000 models running AIX 4.1.4, with the most current (at the time of writing) XL Fortran compiler, XLF version 3.2.4.4:
                              LOOP #1    LOOP #2    Peak-speed
POWER 62.5 MHz (zeise)              2          4           135
Thin node 66 MHz (SP2)              3          7           264
Thin node-2, 66 MHz (rydberg)       3          7           264
Wide node 66MHz (Jensen)            4          7           264
Wide node 77 MHz (Mom)              4          8           308
Numbers are in MFLOPS (Million FLoating-point Operations per Second), names in () are CAMP's machine names. The loops were iterated 100 times for better timing. Peak speed is the clock-frequency times the maximum possible number of floating-point operations per CPU clock cycle. The POWER CPUs can execute at most 1 multiply-add instruction (equal to 2 FLOPs) per cycle, whereas POWER2 CPUs can execute at most 2 multiply-adds.


Optimization with IBM XL compilers

IBM's Fortran compiler for AIX is the
XL Fortran compiler. Please check this IBM document on five simple ways to boost performance. You can read about the XL Fortran-77/90 compiler in the on-line manual by issuing the command:
info -l xlf
The version of your AIX system's Fortran compiler can be shown by this command:
lslpp -l xlfcmp xlfrte
In order to take advantage of the compiler optimizations in this paper, your Fortran compiler should be XLF version 3.2.4, or later.

IBM's C/C++ compiler for AIX is the C Set ++ compiler. Most compiler flags are common to both the Fortran and C/C++ compilers, so the techniques discussed in this paper apply also to codes written in C/C++. The C/C++ compiler's on-line manual is:

info -l cset

Basic optimization: -O

The -O2 option provides an intermediate level of optimization that avoids any techniques that could alter the semantics of valid Fortran programs. The -O and -O2 flags provide the same level of optimization.

Performance results on RS/6000 with -O optimization:

 
                              LOOP #1      LOOP #2
POWER 62.5 MHz (zeise)              2           31
Thin node 66 MHz (SP2)              4           33
Thin node-2, 66 MHz (rydberg)       3           33
Wide node 66MHz (Jensen)            4           33
Wide node 77 MHz (Mom)              5           38
Numbers are in MFLOPS, names in () are machine names. The loops were iterated 100 times for better timing.

Higher level of optimization: -O3 -qstrict

This option increases the range of optimizations the compiler performs, but it can also increase compilation time and memory use by the compiler. Experience suggests that -O3 usually provides improvements in the range of 0-7% over -O2, although for particular types of programs, improvements are sometimes much more dramatic. In a few cases -O3 can decrease performance. If you use -O3 you may want to compare your programs' performance at -O3 to its performance when compiled with -O2. In certain programs -O3 can change the behavior or results of your program, unless you also specify the -qstrict option, which we recommend for novice users.

Performance results on RS/6000 with -O3 -qstrict optimization:

 
                              LOOP #1      LOOP #2
POWER 62.5 MHz (zeise)              2           31
Thin node 66 MHz (SP2)              4           41
Thin node-2, 66 MHz (rydberg)       3           54
Wide node 66MHz (Jensen)            4           65
Wide node 77 MHz (Mom)              5           78
Numbers are in MFLOPS, names in () are machine names. The loops were iterated 100 times for better timing.

High-Order Transformations optimization: -O3 -qhot

The -qhot option provides even more optimizations than -O3. Its tuning efforts concentrate on iteration-reordering transformations. These transformations may produce results that are not bitwise identical to those produced only at -O2 or -O3. The -qhot option can occasionally reduce performance if it does not have enough information about the size of loop bounds and array dimensions, and you may want to use timing techniques to determine whether -qhot improves your program's performance.

CPU-specific optimization: -qarch=... -qtune=...

The RISC System/6000 includes models based on three different chip configurations: the original POWER processor, the PowerPC processor, and the POWER2 processor. You can use -qarch and -qtune to target your program to particular machines. CPU-specific compiler flags:
PowerPC                -qarch=ppc
POWER                  -qarch=pwr
POWER2 (Thin,Thin2)    -qarch=pwr2 -qtune=pwr2s
POWER2 (Wide)          -qarch=pwr2

Performance results on RS/6000 with -O3 -qhot -qarch=... optimization:

 
                              LOOP #1      LOOP #2
POWER 62.5 MHz (zeise)              2           31
Thin node 66 MHz (SP2)              4           44
Thin node-2, 66 MHz (rydberg)       3           62
Wide node 66MHz (Jensen)            4           90
Wide node 77 MHz (Mom)              5          100
Numbers are in MFLOPS, names in () are machine names. The loops were iterated 100 times for better timing.


Using a tuned scientific library

The public-domain
Basic Linear Algebra Subroutines (BLAS) library includes a matrix-vector operation:

y = beta * y + alpha * A * x

where A is the matrix, x and y are vectors, and alpha and beta are constant numbers. The BLAS subroutines have been highly tuned for the RS/6000 CPUs, and are included in the IBM ESSL library. An IBM paper on the techniques used in the ESSL library for obtaining maximum performance in the BLAS subroutines is at http://www.austin.ibm.com/tech/essl.html.

The BLAS matrix-vector subroutine is called in this way:

        call dgemv ('N', N, N, 1.0d0, matrix, N, 
     +    vector, 1, 0.0d0, result, 1)

Performance results on RS/6000 using the ESSL library:

 
POWER 62.5 MHz (zeise)               53
Thin node 66 MHz (SP2)               50
Thin node-2, 66 MHz (rydberg)        99
Wide node 66MHz (Jensen)            157
Wide node 77 MHz (Mom)              187
Numbers are in MFLOPS, names in () are machine names. The call was iterated 100 times for better timing.


Summary of performance optimizations

The conclusion is that it is very important to avoid memory access patterns as exemplified by
Loop #1. The relative improvements in performance of Loop #2 due to various optimization techniques, relative to the unoptimized cases, are summarized in the following table:
                                    C P U   t y p e
Optimization:           POWER   Thin  Thin-2  Wide-66 Wide-77
No optimization           1.0    1.0     1.0      1.0     1.0
-O2 (equals -O)           8.4    4.8     4.8      4.8     4.7
-O3 -qstrict              8.6    6.0     7.9      9.5     9.6
-O3 -qhot -qarch=...      8.6    6.5     8.9     13.2    12.3
BLAS subroutine          14.3    7.3    14.2     23.0    23.0

It is evident that for the POWER architecture, the basic -O optimization is sufficient for optimal performance. However, the POWER2 needs at least -O3 optimization for maximum performance. This is presumably due to the more complex CPU architecture requiring more work by the compiler. The -qarch flags are especially important for POWER2, since they inform the compiler about the memory-bandwidth of the particular CPU: 64 bits (Thin), 128 bits (Thin-2) and 256 bits (Wide).

The BLAS subroutine DGEMV generally doubles the performance relative to the best efforts of the XL Fortran compiler, since a cache prefetching technique is used. On the Thin-node, only, this technique is of limited value due to the narrower memory-bandwidth.

All performance measurements in this paper were performed with the AIX XL Fortran compiler version 3.2.4.4, and the IBM ESSL library version 2.2.2.1.


Accelerating elementary mathematical functions

If your code's performance is dependent on elementary functions, you may want to use IBM's
Mathematical Acceleration Subsystem (MASS) library when linking your code. The MASS library contains an accelerated set of frequently used math intrinsic functions in the AIX system math library libm.a:

The libmass.a library can be used with either FORTRAN or C applications and will run under AIX on all of the RISC System/6000 processors. It assumes that the IEEE rounding mode has been set to nearest and that exceptions have been masked off. In some cases MASS is not as accurate as the system library and it may handle end-point cases differently from libm.a (sqrt(inf) for example). The trig functions (sin, cos, tan) return NaN (Not-a-Number) for large arguments (abs(x)>2**50*pi).

In order to use the MASS library, you simply add the string -lmass when linking, for example:

xlf -O file.f -lmass
For further details see here.

The MASS library also has subroutines for calculating whole vectors of mathematical functions.


How to measure the performance of your code

A number of tools are available for measuring the performance of your code: