of UNI-C, the Danish Computer Center for Research and Education; also at Center for Atomic-scale Materials Physics (CAMP) at the Technical University of Denmark.
c Matrix-vector product result = matrix * vector
parameter (N = 1000)
real*8 matrix(N,N), vector(N), result(N)
c LOOP #1 (wrong way through memory)
do i = 1, N
result(i) = 0.0d0
do j = 1, N
result(i) = result(i) + matrix(i,j) * vector(j)
end do
end do
c LOOP #2 (stride 1 through memory)
do i = 1, N
result(i) = 0.0d0
end do
do j = 1, N
do i = 1, N
result(i) = result(i) + matrix(i,j) * vector(j)
end do
end do
(The code with timing calls is here.)
Without any optimization, just using the command:
xlf file.fthe following performance results were obtained on various RS/6000 models running AIX 4.1.4, with the most current (at the time of writing) XL Fortran compiler, XLF version 3.2.4.4:
LOOP #1 LOOP #2 Peak-speed
POWER 62.5 MHz (zeise) 2 4 135
Thin node 66 MHz (SP2) 3 7 264
Thin node-2, 66 MHz (rydberg) 3 7 264
Wide node 66MHz (Jensen) 4 7 264
Wide node 77 MHz (Mom) 4 8 308
Numbers are in MFLOPS (Million FLoating-point Operations per Second),
names in () are CAMP's machine
names.
The loops were iterated 100 times for better timing.
Peak speed is the clock-frequency times the maximum possible number
of floating-point operations per CPU clock cycle.
The POWER CPUs can execute at most 1 multiply-add instruction (equal
to 2 FLOPs) per cycle,
whereas POWER2 CPUs can execute at most 2 multiply-adds.
info -l xlfThe version of your AIX system's Fortran compiler can be shown by this command:
lslpp -l xlfcmp xlfrteIn order to take advantage of the compiler optimizations in this paper, your Fortran compiler should be XLF version 3.2.4, or later.
IBM's C/C++ compiler for AIX is the C Set ++ compiler. Most compiler flags are common to both the Fortran and C/C++ compilers, so the techniques discussed in this paper apply also to codes written in C/C++. The C/C++ compiler's on-line manual is:
info -l cset
Performance results on RS/6000 with -O optimization:
LOOP #1 LOOP #2
POWER 62.5 MHz (zeise) 2 31
Thin node 66 MHz (SP2) 4 33
Thin node-2, 66 MHz (rydberg) 3 33
Wide node 66MHz (Jensen) 4 33
Wide node 77 MHz (Mom) 5 38
Numbers are in MFLOPS, names in () are machine names.
The loops were iterated 100 times for better timing.
Performance results on RS/6000 with -O3 -qstrict optimization:
LOOP #1 LOOP #2
POWER 62.5 MHz (zeise) 2 31
Thin node 66 MHz (SP2) 4 41
Thin node-2, 66 MHz (rydberg) 3 54
Wide node 66MHz (Jensen) 4 65
Wide node 77 MHz (Mom) 5 78
Numbers are in MFLOPS, names in () are machine names.
The loops were iterated 100 times for better timing.
PowerPC -qarch=ppc POWER -qarch=pwr POWER2 (Thin,Thin2) -qarch=pwr2 -qtune=pwr2s POWER2 (Wide) -qarch=pwr2
Performance results on RS/6000 with -O3 -qhot -qarch=... optimization:
LOOP #1 LOOP #2
POWER 62.5 MHz (zeise) 2 31
Thin node 66 MHz (SP2) 4 44
Thin node-2, 66 MHz (rydberg) 3 62
Wide node 66MHz (Jensen) 4 90
Wide node 77 MHz (Mom) 5 100
Numbers are in MFLOPS, names in () are machine names.
The loops were iterated 100 times for better timing.
y = beta * y + alpha * A * x
where A is the matrix, x and y are vectors, and alpha and beta are constant numbers. The BLAS subroutines have been highly tuned for the RS/6000 CPUs, and are included in the IBM ESSL library. An IBM paper on the techniques used in the ESSL library for obtaining maximum performance in the BLAS subroutines is at http://www.austin.ibm.com/tech/essl.html.
The BLAS matrix-vector subroutine is called in this way:
call dgemv ('N', N, N, 1.0d0, matrix, N,
+ vector, 1, 0.0d0, result, 1)
Performance results on RS/6000 using the ESSL library:
POWER 62.5 MHz (zeise) 53 Thin node 66 MHz (SP2) 50 Thin node-2, 66 MHz (rydberg) 99 Wide node 66MHz (Jensen) 157 Wide node 77 MHz (Mom) 187Numbers are in MFLOPS, names in () are machine names. The call was iterated 100 times for better timing.
C P U t y p e
Optimization: POWER Thin Thin-2 Wide-66 Wide-77
No optimization 1.0 1.0 1.0 1.0 1.0
-O2 (equals -O) 8.4 4.8 4.8 4.8 4.7
-O3 -qstrict 8.6 6.0 7.9 9.5 9.6
-O3 -qhot -qarch=... 8.6 6.5 8.9 13.2 12.3
BLAS subroutine 14.3 7.3 14.2 23.0 23.0
It is evident that for the POWER architecture, the basic -O optimization is sufficient for optimal performance. However, the POWER2 needs at least -O3 optimization for maximum performance. This is presumably due to the more complex CPU architecture requiring more work by the compiler. The -qarch flags are especially important for POWER2, since they inform the compiler about the memory-bandwidth of the particular CPU: 64 bits (Thin), 128 bits (Thin-2) and 256 bits (Wide).
The BLAS subroutine DGEMV generally doubles the performance relative to the best efforts of the XL Fortran compiler, since a cache prefetching technique is used. On the Thin-node, only, this technique is of limited value due to the narrower memory-bandwidth.
All performance measurements in this paper were performed with the AIX XL Fortran compiler version 3.2.4.4, and the IBM ESSL library version 2.2.2.1.
The libmass.a library can be used with either FORTRAN or C applications and will run under AIX on all of the RISC System/6000 processors. It assumes that the IEEE rounding mode has been set to nearest and that exceptions have been masked off. In some cases MASS is not as accurate as the system library and it may handle end-point cases differently from libm.a (sqrt(inf) for example). The trig functions (sin, cos, tan) return NaN (Not-a-Number) for large arguments (abs(x)>2**50*pi).
In order to use the MASS library, you simply add the string -lmass when linking, for example:
xlf -O file.f -lmassFor further details see here.
The MASS library also has subroutines for calculating whole vectors of mathematical functions.
time a.out 54.0u 0.1s 2:41 33% 15+7879k 0+0io 2pf+0w
real*8 time
time = dble(mclock()) * 0.01d0
and print out relevant timing numbers.
The mclock system subroutine returns the user CPU time spent, in
units of 10 milliseconds.
rs2mon Real-time hardware performance monitor llflops Performance monitor of a LoadLeveler batch pool rs2hpm Performance monitor of user codeEspecially the rs2hpm tool is useful for measuring the exact MFLOPS rate etc. of your code. We show some examples of output from rs2mon, llflops and rs2hpm. Note that these tools summarize the performance of all running processes, so they are mainly useful if only a single job is running on the CPU. See the relevant man-pages for further information.
xlf -qlist -g -d file.fThen you run your code under the control of tprof:
tprof -p a.out -s -v -x a.outThe timing profile is recorded in a file __a.out.all (when the executable name is a.out). This file lists the number of "clock ticks" for all processes, for each subroutine of your code, and for the system libraries called by your code. A further break down into source lines are found in file names starting with __h.. This information is also available in transformed source code files (file names starting with __t.) which directly list the number of "clock ticks" next to each source line. However, the generation of __t. files can sometimes be screwed up with the present version of tprof, whereas no problems have been identified with the __h. files.
An example of tprof on the code above is here.