Mathematical Acceleration SubSystem (MASS) MASS Version 2.7 Contents o Introduction o Performance and Accuracy Information for the MASS Libraries o Installing the Libraries o Using the Libraries o Obtaining MASS: License and Download For questions, contact mass@austin.ibm.com or write to: MASS Program Manager (Mail Stop 9444) IBM Austin Laboratory 11400 Burnet Road Austin, TX 78758-3406 Introduction MASS consists of libraries of tuned mathematical intrinsic functions. Each new MASS release contains the material from previous releases that has not been changed so that only the latest release need be kept for currency. Version 1.0 Version 1.0 of MASS introduced a library of mathematical intrinsic functions, libmass.a, that offered tuned alternatives for functions in libm.a, the math library supplied with AIX. Version 2.1 Version 2.1 of MASS added two new libraries of vector function subroutines, libmassv.a for general use and libmassvp2.a for use on POWER2 machines only. These vector subroutines compute result vectors as functions of their argument vectors. The scalar library libmass.a was unchanged. Version 2.2 Version 2.2 added functions to the vector libraries. Version 2.3 Version 2.3 included improved scalar sqrt and rsqrt functions, added more vector functions, and introduced a Fortran/C source library of simple implementations of the functions in the MASS vector libraries. This permits users who vectorize their programs to have portable code. The performance and accuracy tables were considerably revised and extended as well. Version 2.4 Version 2.4 added a tanh intrinsic function to the scalar library and slightly improved the sinh and x**y functions. Short-precision vector functions and vector atan2 were added to the general vector library. Version 2.5 Version 2.5 makes the general vector library, libmassv.a, safe for multi-threaded applications where MASS functions might be called by two threads simultaneously (threadsafe). Some of these vector functions were further tuned. A new POWER3 vector library, libmassvp3.a (also threadsafe), is added, with some routines tuned for the POWER3 architecture as implemented on the 630 processor (the remaining routines being unchanged from the general vector library). A dnint routine was added to the scalar library, libmass.a. The functions unique to the POWER2 vector library, libmassvp2.a, are unchanged. Version 2.6 Version 2.6 makes the scalar library, libmass.a, safe for multi-threaded applications where MASS functions might be called by two threads simultaneously (threadsafe). The scalar library shows only minor changes in performance. Version 2.7 Version 2.7 replaces twelve functions (forms of exp, log, sin, and cos) in the general vector library, libmassv.a, with versions that have been rewritten for better short-precision performance and better performance for the POWER3 architecture as implemented on the 630 processor. The architecture-specific vector libraries, libmassvp2.a and libmassvp3.a, also include the new functions. Scalar Library The MASS Scalar Library libmass.a contains an accelerated set of frequently used math intrinsic functions in the AIX system library libm.a (now called libxlf90.a in the IBM XL Fortran manual): o sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y The libmass.a library can be used with either Fortran or C applications and will run under AIX on all of the IBM RS/6000 processors. Because MASS does not check its environment, it must be called with the IEEE rounding mode set to nearest and with exceptions masked off (the default XLF environment). MASS may not work properly with other settings. In some cases MASS is not as accurate as the system library, and it may handle edge cases differently from libm.a (sqrt(inf), for example). The trig functions (sin, cos, tan) return NaN (Not-a-Number) for large arguments (abs(x)>2**50*pi). In release 2.4 the x**y function was revised to accept negative x arguments with integer y arguments in accordance with the C standard. (The Fortran standard prohibits such arguments and the function previously returned NaNs for them.) Vector Libraries The general vector library, libmassv.a, contains vector function subroutines that will run on the entire IBM RS/6000 family. The second library, libmassvp2.a, contains the subroutines of libmassv.a and adds a set that is tuned for and based upon the POWER2 architecture. The POWER2 processors include the 590 and follow-on servers and related desktop systems and the POWERParallel SP2 using the P2SC processor. The third library libmassvp3.a contains the routines of libmassv.a, some which have been tuned for the POWER3 architecture used by the 630 processor. See the performance tables in the following sections to see the contents of these libraries. Please note that in Version 2.3, the vdint and vdnint functions were rewritten so that the dependency on POWER2 architecture was removed. Thus, they were added to libmassv.a and remain in libmassvp2.a by the inclusion of libmassv.a in libmassvp2.a. Users have requested a vector atan2 function and short-precision vector functions and these were added in version 2.3. The vector libraries libmassv.a, libmassvp2.a and libmassvp3.a can be used with either FORTRAN or C applications. However, when calling the library functions from C, only call by reference is supported. The vector functions were developed under similar assumptions to those for libmass.a: IEEE round-to-nearest mode and exceptions masked off. Accuracy is comparable to that of libmass.a scalar counterparts, though results may not be bit-wise identical. The MASS Vector Fortran/C Source Library enables application developers to write portable vector codes. For this purpose one Fortran source library has been provided: libmassv.f. It corresponds to libmassvp2.a, and thus contains all of the vector functions of libmassv.a as well. An examination of the following performance tables shows that there can be a performance improvement even when the MASS scalar functions are used in the vector loops of libmassv.f. Performance and Accuracy Information for the MASS Libraries Sample scalar library performance data is provided for the 604E (PowerPC), 630 (POWER3), and P2SC (POWER2) processors. The data should be considered approximate. It was obtained by timing many repetitions of a loop over 1,000 random arguments and includes all overhead. Timing in this way will bring the input and output vectors into the on-chip cache (the loop is short enough for them to fit in cache). Performance may deteriorate seriously when the input and output vectors are not in cache. Performance may also deteriorate for arguments at or near the end-points of the valid argument ranges. The P2SC and 630 processors have a hardware sqrt which is not timed here. The libxlf90.a measurements were made with the versions of the library available on the respective test systems. They may vary from the versions timed for previous MASS releases. The user may experience performance which varies from that found in this table. Math function performance (cycles per call, length 1000 loop) libxlf90.a libmass.a ratio Function Range 604E 630 P2SC 604E 630 P2SC 604E 630 P2SC sqrt A 68 59* 41* 50 45* 27* 1.36 1.31 1.52 rsqrt A 80 71* 56* 52 53* 28* 1.54 1.34 2.00 exp D 87 64 53 42 27 22 2.07 2.37 2.41 log C 100 87 67 55 53 33 1.82 1.64 2.03 sin B 51 36 34 27 15 19 1.89 2.40 1.79 sin D 79 60 49 60 42 37 1.32 1.43 1.32 cos B 52 37 35 26 15 19 2.00 2.47 1.84 cos D 76 58 50 59 42 36 1.29 1.38 1.39 tan D 137 113 90 76 42 36 1.80 2.69 2.50 atan B 60 52 40 53 45 36 1.13 1.16 1.11 atan D 97 70 61 86 58 57 1.13 1.21 1.07 sinh D 218 186 178 61 44 31 3.57 4.23 5.74 cosh D 154 120 129 49 34 26 3.14 3.53 4.96 tanh D 217 206 185 78 53 43 2.78 3.89 4.30 dnint D 36 24 23 22 12 13 1.64 2.00 1.77 atan2 B 538 410 557 106 88 71 5.08 4.66 7.85 x**y C 287 228 187 114 97 63 2.52 2.35 2.97 * When this function is compiled specifically for this processor, inline code using the optional sqrt instruction will be generated. This is not what is being timed here. Range Key Processor Cycle time Dcache A = 0, 1 604E 3.0 nanoseconds 32k B = -1, 1 630 5.0 nanoseconds 64k C = 0,100 P2SC 7.4 nanoseconds 128k D = -100,100 The following table gives estimated processor clock cycles per vector element evaluation for the various MASS vector libraries. These results are obtained for vectors of length 1000 so that the caches contain all vectors. Results using the functions in libxlf90.a are shown under the columns labeled libx. They were computed using the code compiled from the MASS Fortran source code library libmassv.f (see The Vector Libraries) using the IBM XLF compiler with the -O option without linking to libmass.a. Results obtained by repeating the previous process with linkage to libmass.a are shown under the columns labeled mass. Results obtained using the libmassv.a, libmassvp2.a, or libmassvp3 are shown in the columns labeled massv, -vp3, or massvp2, respectively. Times are not given for functions in the libmassvp2 and libmassvp3 libraries which have been carried over from the libmassv library. As before, results were computed on PowerPC, POWER2 and POWER3 systems. Users should expect results to vary with vector length. Items in the table where the indicated library function does not exist or the measurement was not done are left blank. Math Library Performance (cycles per evaluation, length 1000 loop) 604E 630 P2SC function range libm mass massv libm mass massv vp3 libm mass massv vrec D 32* 10 9* 6 4 8* 4 vsrec D 18* 8 7* 5 3 8* 4 vdiv D 32* 12 9* 7 5 9* 5 vsdiv D 18* 10 7* 6 3 9* 4 vsqrt C 67 48 16 11* 9 6 13* 7 vssqrt C 70 48 10 7* 8 5 13* 5 vrsqrt C 79 49 16 22* 9 6 22* 7 vsrsqrt C 83 51 9 16* 7 4 22* 5 vexp D 83 45 16 64 33 6 53 21 7 vsexp E 85 44 13 68 36 5 58 21 6 vlog C 99 56 20 83 53 8 67 35 8 vslog C 102 56 17 86 57 7 66 37 7 vsin B 50 29 11 36 16 5 34 17 5 vsin D 79 59 27 60 43 12 50 37 12 vssin B 51 26 8 39 18 4 40 16 4 vssin D 79 58 20 62 46 9 56 38 9 vcos B 51 26 9 37 16 4 34 17 4 vcos D 75 59 27 58 43 12 51 36 11 vscos B 52 26 7 39 18 3 40 16 3 vscos D 76 59 20 61 46 9 56 37 9 vsincos B 100 53 19 80 33 8 80 38 8 vsincos D 151 116 29 123 92 12 111 81 12 vssincos B 107 55 15 79 38 6 78 36 7 vssincos D 159 118 24 125 98 10 110 80 10 vcosisin B 104 55 19 78 34 8 79 37 8 vcosisin D 156 118 29 123 93 12 111 81 12 vscosisin B 108 55 15 79 36 6 78 36 6 vscosisin D 160 119 23 125 95 9 110 79 10 vtan D 136 74 32 111 52 19 90 38 13 vstan D 136 74 32 113 56 19 95 39 12 vatan2 D 545 104 40 413 87 25 555 73 17 vsatan2 D 545 104 40 418 89 25 558 71 17 vdnint D 37 22 7 24 12 3.4 23 13 2.7 vdint D 36 6 22 2.8 21 2.6 massvp2 vidint D 4.0 2.7 vasin B 48 17 vacos B 49 17 vdfloat D 3.0 1.8 vdsign D 9 3.5 * indicates inline instructions timed (not a subroutine call) Range Key Processor Cycle time Dcache size A = 0, 1 604E 3.0 nanoseconds 32 kilobytes B = -1, 1 630 5.0 nanoseconds 64 kilobytes C = 0,100 P2SC 7.4 nanoseconds 128 kilobytes D = -100,100 E = -10, 10 The performance of the POWER2-only vasin and vacos functions in libmassvp2.a is argument dependent. The results were obtained for a large set of arguments uniformly distributed between -1 and 1. Short-precision versions of the vector functions vexp through vatan2 are now included in libmassv.a. They are obtained when the prefix is vs rather than just v. The following table provides sample accuracy data for the libx, libmass, libmassv, and libmassvp3 libraries. The numbers are based on the results for 10,000 random arguments chosen in the specified ranges. Real*16 functions were used to compute the errors. There may be portions of the valid input argument range for which accuracy is not as good as illustrated in the table. Also, the user may experience accuracy which varies from the table when argument values are used which are not represented in the table. The Percent Correctly Rounded (PCR) column elements are obtained by counting the number of correctly rounded results out of the 10,000 random argument cases. A result is correctly rounded if the function returns the IEEE 64 bit value which is closest to the infinite-precision exact result. Math Library Accuracy libm libmass libmassv libmassvp3 function range PCR MaxE PCR MaxE PCR MaxE PCR MaxE rec D 100.00* .50* 100.00 .50 100.00 .50 srec D 100.00* .50* 92.47 .66 99.97 .50 div D 100.00* .50* 74.21 1.28 74.21 1.28 sdiv D 100.00* .50* 100.00 .50 74.49 1.31 sqrt A 100.00 .50 96.59 .58 96.42 .60 63.14 2.16 ssqrt A 100.00 .50 100.00 .50 87.64 .79 87.05 .83 rsqrt A 88.52 .98 98.60 .54 97.32 .62 82.00 1.22 srsqrt A 100.00 .50 100.00 .50 86.39 .82 89.66 .86 exp D 99.95 .50 96.55 .63 96.58 .63 sexp E 100.00 .50 100.00 .50 98.87 .52 log C 99.99 .50 99.69 .53 99.69 .53 slog C 100.00 .50 100.00 .50 99.91 .51 sin B 81.31 .91 96.88 .80 97.28 .72 sin D 86.03 .94 83.88 1.36 83.85 1.27 ssin B 100.00 .50 100.00 .50 99.95 .50 ssin D 100.00 .50 100.00 .50 99.73 .51 cos B 92.95 1.02 92.20 1.00 93.19 .88 cos D 86.86 .93 84.19 1.33 84.37 1.33 scos B 100.00 .50 100.00 .50 99.35 .51 scos D 100.00 .50 100.00 .50 99.82 .51 tan D 99.58 .53 64.51 2.35 50.48 3.19 stan D 100.00 .50 100.00 .50 100.00 .50 atan2 D 74.66 1.59 88.02 1.69 84.01 1.67 satan2 D 100.00 .50 100.00 .50 100.00 .50 atan B 99.82 .51 92.58 1.78 atan D 99.98 .50 98.86 1.72 sinh D 94.78 1.47 89.54 1.45 cosh D 95.64 .97 92.73 1.04 tanh E 94.08 2.95 83.33 1.79 X**Y C 99.95 .50 96.87 .62 * indicates hardware instruction was used Range Key PCR = Percentage correctly rounded A = 0, 1 MaxE = Maximum observed error in ulps B = -1, 1 C = 0,100 D = -100,100 E = - 10, 10 Installing the Libraries MASS consists of nine files (LICENSE, MASS.readme, libmass.a, libmassv.a, libmassvp2.a, libmassvp3.a, libmassv.f, massv.h, and massvp2.h) packaged as a compressed tar file, MASS_RN.tar.Z, where RN denotes Release Number (for example, 2.7). There are two ways that MASS can be installed. If you have root access to the target RS/6000, follow the instructions for installing as root. Otherwise, then follow the instructions for installing as non-root user. In general, MASS is more convenient to use if it is installed with root access, and linked to the conventional /usr/lib subdirectory, since the shorthand -lmass flags can be used instead of specifying an explicit path. The tar file uses relative path names, so it will create a mass subdirectory in the current directory. Installing as Root 1. login as root -- or -- su to root 2. cd /usr/lpp MASS files will be restored to the directory /usr/lpp/mass. 3. zcat /tmp/MASS_RN.tar.Z | tar -xvf - -- or -- uncompress /tmp/MASS_RN.tar.Z tar -xvf /tmp/MASS_RN.tar 4. ln -s /usr/lpp/mass/libmass.a /usr/lib/libmass.a or ln -s /usr/lpp/mass/libmassv.a /usr/lib/libmassv.a ln -s /usr/lpp/mass/libmassvp2.a /usr/lib/libmassvp2.a (This step creates a symbolic link from the /usr/lpp/mass/libmass.a to /usr/lib/libmass.a, etc., so that users can specify the -lmass flag as a shorthand to link MASS routines.) Applications can now be linked by any user simply by using the correct flag. For example, cc -o foo foo.c -lmass -lm ... (for scalar only) cc -o foo foo.c -lmassv -lm ... (for vector only) cc -o foo foo.c -lmassvp2 -lm ... (for POWER2 vector only) cc -o foo foo.c -lmass -lmassv -lm (for scalar/vector) cc -o foo foo.c -lmass -lmassvp2 -lm (for POWER2 scalar/vector) etc. Installing as Non-Root User 1. cd to the directory where the MASS subdirectory should be created 2. zcat /tmp/MASS_RN.tar.Z | tar -xvf - or uncompress /tmp/MASS_RN.tar.Z tar -xvf /tmp/MASS_RN.tar Applications can be now linked with libmass.a by using the -L flag to specify the directory in which MASS is installed. For example, if MASS is installed in /home/somebody/mass, then any user with read access to that directory can link applications in the following manner: cc -o foo foo.c -L /home/somebody/mass -lmass -lm ... etc. Note for World Wide Web Users Some web browsers will uncompress the tar file before storing. If the previous instructions result in the error message: MASS_RN.testing.tar.Z: not in compressed format then try the following: mv MASS_RN.tar.Z to MASS_RN.tar tar -xvf /tmp/MASS_RN.tar The MASS Fortran Source File. Successfully installing MASS also creates the file libmassv.f, a Fortran source file of simple loops for the MASS vector functions, in the mass directory. Its use is described in the section The Vector Libraries. Using the MASS Libraries The Scalar Library To use libmass.a, use -lmass before libm.a or libxlf90.a in the linker command line. We will use libm.a in the following examples. For example, if the library is installed in the customary location in directory /usr/lib, then the command lines for Fortran and C would be: xlf progf.f -o progf -lmass cc progc.c -o progf -lmass -lm If libmass.a is installed in a directory other than in /usr/lib, for example in /home/somebody/mass, use the -L option to add that to the search path: xlf progf.f -o progf -L/home/somebody/mass -lmass cc progc.c -o progf -L/home/somebody/mass -lmass -lm (Fortran links with libm.a automatically, so only -lmass need to be specified on the command line. For C code, you must link both libmass.a and libm.a, since libmass.a includes only a subset of the functions in libm.a.). The library uses some global names for shared tables. These names have the form %...$. When called from C code, the functions in libmass.a will not set the global variable errno to indicate range, domain, or loss of precision errors. For example, with libm.a, sqrt(-1) will return the value NaN (not a number) and also sets errno to 33 (EDOM -- domain error); with libmass.a, sqrt(-1) simply returns NaN, but errno is not set. The user should recall that the rsqrt function is handled in a different way from the other intrinsic functions by the XLF compiler. This is discussed in the XLF manuals, and it is suggested that the user review this material. Selective Use of libmass.a If you wish to use libmass.a for some functions and the normal libm.a for the remainder, you can use an export list with the ld command. For example, to select only the fast tangent routine from libmass.a for use with the C program sample.c: 1. Create an export list containing the names of the desired functions. In this case, the file export.list will contain only two lines: #! .tan (Remember that Fortran names start with "._", while C names start with ".") 2. Pull the exported routines into an object file using the load command with libmass.a. ld -o fast_tan.o -bE:export.list -lmass (or, if libmass.a is not in /usr/lib) ld -o fast_tan.o -bE:export.list -L/some/other/path -lmass 3. Create the final executable using cc, specifying fast_tan.o before the standard math library, libm.a. This will link only the tan routine from MASS (now in fast_tan.o) and the remainder of the math subroutines from the standard system library: cc -o sample fast_tan.o -lm (Note: this scheme will work for all routines in libmass.a except sin, cos, atan, and atan2. These routines are coded together, so selecting fast sin will also link in fast cos; selecting atan will also link atan2.) The Vector Libraries Successful use of the MASS vector libraries is contingent on the user making the effort to vectorize his code. To assist in that effort, the Fortran source library libmassv.f has been provided for use on non-IBM systems where the MASS libraries are not available. The syntax for the vector functions is visible in libmassv.f, and the user can write code using these functions to obtain code that may port to systems other than the RS6000. The user can then use the faster MASS vector libraries with that same code when running on an RS6000. See the section on Vector Code Portability for more details. To use the faster MASS libraries in a code that has been vectorized as indicated, simply use the corresponding library name(s) in the linker command line. For example, if the library is installed in the customary location in directory /usr/lib, then the command lines for Fortran and C would be: xlf progf.f -o progf -lmassv cc progc.c -o progf -lmassv -lm If libmassv.a is installed in a directory other than in /usr/lib, for example in /home/somebody/mass, use the -L option to add that to the search path: xlf progf.f -o progf -L/home/somebody/mass -lmassv cc progc.c -o progf -L/home/somebody/mass -lmassv -lm The vector function subroutines may be used as any Fortran function subroutines via a CALL statement, using the same syntax as the functions in libmassv.f. Except for vdiv, vsincos, vcosisin, vatan2, vdfloat, vidint, and vdsign, the functions in libmassv.a and libmassvp2.a are all of the form function_name (y,x,n), where x is the source vector, y is the target vector, and n is the vector length. The arguments y and x are assumed to be long-precision (real*8) for functions whose prefix is v, and short-precision (real*4) for functions with prefix vs. The three-argument subroutines are all used in the same way. For example: ..... DIMENSION X(500),Y(500) ..... CALL VEXP(Y,X,500) ..... returns a vector Y of length 500 whose elements are exp(X(I)); I=1,500. The functions vdiv, vsincos, vatan2, and vdsign are of the form function_name(x,y,z,n). Vdiv returns a vector x whose elements are y(I)/z(I), I=1,n. Vsincos returns two vectors, x and y, whose elements are sin(z(I)) and cos(z(I)) respectively. Vatan2 returns a vector x whose elements are atan(y(I)/z(I)) respectively. Vdsign returns a vector x of elements of the form dsign(y(I),z(I)). Arguments follow the same conventions as given previously. In vcosisin(y,x,n), x is a vector of n real*8 elements and the function returns a vector y of n complex*16 elements of the form (cos(x(I)),sin(x(I))). In vdfloat(y,l,n) and vidint(l,x,n), x and y are vectors of n real*8 elements and and l is a vector of n integer*4 elements. Vdfloat returns a vector of elements of the of the form dfloat(l(I)). Vidint returns a vector of elements of the form idint(x(I)). When calling the vector functions from C, the user is reminded that only call by reference is supported. Vector Code Portability The recommended procedure for writing a portable code that is vectorized for using the fast MASS vector libraries is to write in ANSI standard language and use the vector functions defined by libmassv.f. Then, to prepare to run on a system other than an IBM RS/6000, compile the application source code together with the libmassv.f source. The vector syntax to be used is visible in the libmassv.f source. It may be necessary to comment out one line of the vrsqrt subroutine, which is a directive to the IBM XLF compiler, for full portability. When running the application on an IBM RS/6000, the faster MASS vector libraries can be linked as described previously. WARNING: Do not use libmassv.f on IBM RS/6000 systems. Use the -lmassv flag instead. The libmassv.f Fortran source vector library should be used as a portable substitute for the MASS vector libraries only on non-IBM systems. Obtaining MASS: License and Download The routines in MASS Version 2.7 are made available at no additional charge to users on the world-wide web under the terms and conditions that follow. They are not licensed for resale with vendor applications. For distribution with vendor applications, please see the DEVCON AIX developers disk and its accompanying license, which can be found at http://www.developer.ibm.com/devcon/titlepg.htm. By downloading MASS Version 2.7 you agree to the following: MASS Version 2.7 is licensed to you under the terms and conditions of your AIX license with IBM, and the following additional provisions: Notwithstanding anything to the contrary contained in your AIX license, MASS is provided to you AS IS. IBM MAKES NO WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IBM has no obligation to defend or indemnify against any claim of infringement, including, but not limited to, patents, copyright, trade secret, or intellectual property rights of any kind. Obtain a MASS Version 2.7 license and download the files. Obtaining Previous MASS Versions The following back-level versions of MASS are also available: o MASS Version 2.6 o MASS Version 2.5 o MASS Version 2.4 o MASS Version 2.3 o MASS Version 2.2 o MASS Version 2.1 o MASS Version 1.0