Quick viewing(Text Mode)

From CUDA to Opencl: Towards a Performance-Portable Solution for Multi-Platform GPU Programming

From CUDA to Opencl: Towards a Performance-Portable Solution for Multi-Platform GPU Programming

From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming✩,✩✩

Peng Dua, Rick Webera, Piotr Luszczeka, Stanimire Tomova, Gregory Petersona, Jack Dongarraa,b

aUniversity of Tennessee Knoxville bUniversity of Manchester

Abstract In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the Tesla C2050 and ATI 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA ’ optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose using auto-tuning to better explore these kernels’ parameter space using search heuristics. Keywords: hardware accelerators, portability, auto-tuning

1. Introduction of NVIDIA’s CUDA [7, 8] (Compute Unified Device Archi- tecture), ushered a new era of improved performance for many People associated Graphics Processing Units (GPUs) with applications as programming GPUs became simpler: archaic fast image rendering until the turn of the century. This is when terms such as texels, fragments, and pixels were superseded the science community turned their attention to the hardware with threads, vector processing, data caches and shared mem- predominantly discussed in the computer gaming circles. One ory. of the first attempts of non-graphical computations on a GPU Further changes occuring in ATI’s and NVIDIA’s offerings was a matrix-matrix multiply [1]. In 2001, low-end graphics made GPU acceleration even more pertinent to the scientific cards had no floating-point support; floating-point color buffers community. ATI’s FireStream and NVIDIA’s Fermi architec- did not arrive until 2003 [2]. For the gaming industry, sup- ture added support for Fused Multiply-Add (FMA): a more ac- port for floating-point meant more realistic game-play; rich ad- curate version of the former MAD (Multiply-Add) instruction. vanced lighting effects no longer suffered from banding effects With only a single rounding step, this new instructions bring common in older generations of hardware that only allowed a GPUs even closer to compliance with the IEEE 754 standard for single byte per color channel. For the scientific community, floating-point arithmetic. Additionally, reworking of the cache the addition of floating point meant that overflow associated hierarchy helped with some of the performance issues of the with fixed-point arithmetic was no longer a problem. Many re- past. Finally, Error Correction Codes (ECC) are now used to search publications thoroughly analyzed the new breed of GPU protect the GPU device’s memory as its capacity grows to the hardware using languages borrowed from the graphics com- point of being vulnerable to errors induced by nature, such as munity [3, 4, 5]. It goes without saying that these computa- cosmic radiation. tional advances would not be possible if it weren’t for the pro- In our project called Matrix Algebra on GPU and Multicore grammable shaders that broke the rigidity of the fixed graphics Architectures [9] (MAGMA), we mainly focus on dense matrix . LU factorization with partial pivoting on a GPU was routines for numerical linear algebra, similar to those available one of the first common computational kernels that ran faster in LAPACK [10]. While CUDA is only available for NVIDIA than an optimized CPU implementation [6]. The introduction GPUs, there are other existing frameworks that allow platform- independent programming for GPUs:

✩This work was supported by the SCALE-IT fellowship through grant num- 1. DirectCompute from Microsoft, ber OR11907-001. 2. OpenGL Shading Language (GLSL), and ✩✩This work was supported by NVIDIA, Microsoft, the U.S. National Sci- ence Foundation, and the U.S. Department of Energy. 3. OpenCL.

Preprint submitted to August 31, 2010 DirectCompute allows access to graphics cards from mul- is harder, even impossible, and naturally, many authors claim tiple vendors. However, it is specific to that OpenCL does not provide performance portability. This, and therefore it is not portable between host Operating Sys- along with the fact that GPUs are quickly evolving in complex- tems (OS). ity, has made tuning numerical libraries for them challenging. The OpenGL Shading language [11] is portable across both One approach (that we explore) to systematically resolve these GPU hardware and the host OS. However, it is specifically geared issues is the use of auto-tuning, a technique that in the context of towards programming new graphics effects – GLSL does not OpenCL would involve collecting and generating multiple ker- have a scientific focus. nel versions, implementing the same algorithm optimized for OpenCL [12] has been designed for general purpose com- different architectures, and heuristically selecting the best per- puting on GPUs (GPGPU). It is an open standard maintained forming one. Auto-tuning has been used intensively on CPUs by the Khronos group with the backing of major graphics hard- in the past to address these challenges to automatically gener- ware vendors as well as large computer industry vendors inter- ate near optimal numerical libraries, e.g., ATLAS [19, 20] and ested in off-loading computations to GPUs. As such, there ex- PHiPAC [21] used it to generate highly optimized BLAS. Work ist working OpenCL implementations for graphics cards and, on auto-tuning CUDA kernels for NVIDIA GPUs [22, 23] has multi-core processors; OpenCL offers portability across GPU shown that the technique is a very practical solution to easily hardware, OS software, and multicore processors. In this work, port existing algorithmic solutions on quickly evolving GPU BLAS routines are used to evaluate OpenCL’s usefulness in cre- architectures and to substantially speed up even highly tuned ating a portable high performance GPU numerical linear alge- hand-written kernels. bra library. In [24], the authors1 examined performance portability in The rest of the paper is organized as follows: Section 2 talks OpenCL. In their study, they compared CUDA and OpenCL about related work. Section 3 details OpenCL’s programming implementations of a Monte Carlo Chemistry application run- considerations. This involves a comparison between CUDA ning on an NVIDIA GTX285. They also compared the same and OpenCL as for GPU, and the profil- application written in ATI’s now defunct Brook+ to an OpenCL ing analysis of the run-time of each component in OpenCL pro- version on a Firestream 9170 and Radeon 4870 respectively. Fi- gram. Sections 4 and 5 give an evaluation to both NVIDIA and nally they compared OpenCL to a ++ implementation running ATI GPU platforms by implementing fast GEMM and analyz- on multi-core processors. The paper showed that while ing performance. Section 6 discusses cross-platform issues and OpenCL does provide code portability, it doesn’t necessarily section 7 lays out the basic structure of an auto-tuning system provide performance portability. Furthermore, they showed that for cross-platform GPU math library design. Section 8 shows platform-specific languages often, but not always, outperformed the performance result and section 9 concludes the paper. In OpenCL. the text we use the terms ”ATI card” and Radeon 5870; ”Fermi card” and Tesla C2050 interchangeably. 3. OpenCL as A Programming Tool

2. Related Work To evaluate OpenCL as a programming tool for implement- ing high performance linear algebra routines, we pick the tri- Synthetic benchmarking of GPUs was used extensively to angular solver (TRSM) routine from BLAS and profile each gain understanding of the graphics accelerators when the tech- component of the program: environment setup, program com- nical details of the hardware remains an industrial secret [13]. pilation, kernel extraction, and execution. In the context of scientific applications, such benchmarking ef- TRSM solves the linear equation Ax = b where A is an upper forts lead to algorithms that provide significant performance or lower triangular matrix and b is a known matrix of solutions. improvements [14]. Its implementation involves a blocking algorithm in which the A performance-oriented study of NVIDIA’s PTX (Parallel diagonal triangular blocks are inverted (TRTRI) in parallel fol- eXecution) architecture [15] was done in the context of lowed by a series of matrix multiplications (GEMM). Porting the Ocelot project [16]. Code transformations of CUDA ker- these routines from CUDA to OpenCL requires some transla- nels were done by the MCUDA [17] project. The transforma- tion. tions enabled CUDA code to run efficiently on multicore CPUs. CUDA and OpenCL have many conceptual similarities but The inverse operation – porting multicore code to GPUs – was they diverge on terminology. Table 1 shows the correspond- difficult [16]. ing terms in both frameworks while 1 highlights differences in Work on optimizing CUDA implementations of basic lin- the CUDA and OpenCL software stacks. Similarly, ATI and ear algebra kernels has demonstrated the importance on “how NVIDIA GPUs have analogous platform definitions as shown sensitive the performance of a GPU is to the formulation of in Table 2. Table 3 shows the platform details of two different your kernel” [18] and that an enormous amount of well thought NVIDIA GPUs and one GPU from ATI/AMD. We show what experimentation and benchmarking [14, 18] is needed in or- these differences mean to application developers. der to optimize performance. Optimizing OpenCL applications for a particular architecture faces the same challenges. Fur- ther, optimizing a fixed OpenCL code for several architectures 1Rick Weber and Gregory Peterson are also authors of this paper

2 CUDA Code OpenCL Code

Figure 2: Comparison of Device Kernel Code Between OpenCL and CUDA.

CUDA term OpenCL term    host CPU host streaming multiprocessor (SM) compute unit (CU)   scalar core processing element (PE) host thread host program  && %     ! thread work-item thread block work-group  # # grid NDRange    local memory constant memory space constant memory "!## $ #  % & texture memory space constant memory

Table 1: Comparison of terms used by CUDA and OpenCL to describe very Figure 1: Comparison of software stacks used with CUDA and OpenCL on similar concepts. various hardware platforms.

3.1. Relation to CUDA memory addresses (register variable or pointer to shared Figure 2 shows side-by-side differences of the kernel codes memory) whereas CUDA makes no such distinction. for triangular inversion routine (TRTRI) for OpenCL and CUDA. Syntax for synchronization primitives. The changes are in the lines annotated in red. They belong to • the following categories: Differences in the CUDA and OpenCL front-ends differ- ent timing profiles. Obtaining the ID for the thread/work-item and block/work- • group. 3.2. Profiling The definition of shared memory in CUDA is replaced Unlike CUDA, OpenCL requires environmental setup on • in OpenCL by local memory: shared is replaced with the host (CPU) before launching kernels to run on GPU. This local also includes compiling kernels. The process for set- ting up kernels to execute is similar for all GPU kernels and OpenCL makes explicit differentiation between global mem- an analysis of TRSM gives a typical breakdown of initializa- • ory addresses (device memory address space) and local tion. Figure 3 shows the breakdown of initialization time. We run oclDtrsm (OpenCL double precision triangular solver) with 3 ATI term NVIDIA term Streaming Multiprocessor Stream Core

streaming multiprocessor (SM) compute unit (CU) ƵŝůĚWƌŽŐƌĂŵ ŽƉLJĂƚĂ scalar core processing element (PE) ϭϴϵ ƌĞĂƚĞ<ĞƌŶĞů shared memory shared memory ϭϴϱ ŽĐůƚƌƐŵ texture cache texture cache ϯϯϱ PTX IL binary binary Ϯ͕ϬϮϭ Table 2: Comparison of terms used by ATI and NVIDIA to describe very simi- lar concepts.

M=10240 and NRHS=128 on an Intel Q9300 running at 2.5 sĂůƵĞ͗dŝŵĞ hŶŝƚ͗ ŵƐ GHz and a Tesla C2050. Setting up the kernel takes longer than the kernel execution itself. Figure 3: Runtime break down Compiling OpenCL source code into an intermediate rep- resentation takes the most time in initialization. We observed similar results on older NVIDIA cards (e.g. GTX 280) and ATI mize. cards (e.g. Radeon 5870) using ATI STREAM SDK[25]. On The performance of TRSM is dominated by GEMM[27]. the Tesla C2050, the compilation of 300+ lines of OpenCL C Since GEMM is one of the most important kernels in linear code into 7000+ lines of PTX takes just over 2 seconds, while algebra, we will focus on implementing and analyzing a fast the computation on fairly large problem takes less than 0.2 sec- OpenCL GEMM in the coming sections. ond. This overhead can lead to a severe performance impact if not accounted for when dealing with many OpenCL routines 4. NVIDIA Platform calling each other in a software library. One solution to reduce this overhead is to separate compilation and execution. 4.1. Fermi Architecture Since OpenCL includes separate compilation and build func- Fermi is NVIDIA’s latest GPU product line that includes tions in its API, source code compilation can be performed once the GTX4xx series and the Tesla C2050. It introduces several during the deployment/installation stage of the math library. As changes over the previous GPU offerings including more plen- of writing, there is no off-line kernel in NVIDIA’s tiful and capable compute units and a revamped memory hier- OpenCL platform. Documentation suggests [26] for this to archy. Under Fermi, more compute units are available, warp be implemented by the developers. We can do this by fetching sizes have increased from 16 to 32 threads, and each compute the Intermediate Representation (IR) resulting from compila- unit issues instructions from 2 warps concurrently [28]. Mul- tion using clGetProgramInfo and saving it to disk. During the tiple kernels can run simultaneously on Fermi hardware as op- initialization phase, the IR can be read from disk and processed posed to previous generations which only support a single ker- with a call to clBuildProgram. This method reduced the time nel executing at any given time [7]. This feature increases de- of getting the binary code ready to run from 2+ seconds to 0.2 vice utilization for matrix operations with small problem sizes seconds. While the time to create a kernel from a pre-built pro- by increasing the number of thread blocks beyond that which gram still takes more time than a single TRSM, initialization is a single kernel allows. On the Tesla C2050 GPU, the num- 10x faster when the raw source code isn’t compiled every time ber of double precision ALUs as compared to single precision the user runs the application. Having sped up initialization, the ALUs has increased to 1:2; for every double precision ALU, time profile for the TRSM kernel itself is the next item to opti- there are 2 single precision ALUs [29]. In previous genera- tions, this ratio was 1:8. This implies the double precision peak performance (515 GFlops/s [30]) is half that of single precision GPU NVIDIA NVIDIA ATI (1.015 TFlops/s [30]). In addition to extra compute units, Fermi Device GTX 280 C2050 (Fermi) Radeon 5870 revamps the memory architecture of previous generation GPUs. Compute Fermi’s retooled memory system mainly features changes Units 30 32 20 in caching. Global memory loads and stores are now fetched through the L1 cache, which shares hardware with shared mem- Processing ory. Shared memory and L1 can be configured as 48kB/16kB or elements 8 16 16 16kB/48kB respectively. The former configuration can increase occupancy in applications that use a large amount of shared Table 3: Comparison of computational resources available on NVIDIA’s GTX memory while the latter configuration can decrease global mem- 280 ory access latencies within a compute unit. Fermi also increases 4 ϳϬϬ Figure 4: NVIDIA C2050 (Fermi) architecture. ϲϬϬ ϲϬϬ ϱϬϬ

ϱϬϬ ϰϬϬ

ϰϬϬ 'ĨůŽƉͬƐ ϯϬϬ h͕'D'DDǁŝƚŚƚĞdžƚƵƌĞ h͕D'D'DDǁŝƚŚŽƵƚƚĞdžƚƵƌĞ KƉĞŶ>͕ĚŝƌĞĐƚƚƌĂŶƐůĂƚŝŽŶ ϯϬϬ ϮϬϬ KƉĞŶ>͕ĨƵůůLJƵŶƌŽůůĞĚ 'ĨůŽƉͬƐ KƉĞŶ>͕ϭϲͲϭƵŶŽůůĞĚ KƉĞŶ>͕ŶŽŶͲƚĞdžƚƵƌĞ;ǁŝƚŚŽƵƚĐŽƉLJŝŶŐͿ ϮϬϬ ϭϬϬ

KƉĞŶ>͕ƚĞdžƚƵƌĞ;ǁŝƚŚĐŽƉLJŝŶŐͿ Ϭ ϭϬϬ ϭϵϮ ϯϴϰ ϱϳϲ ϳϲϴ ϵϲϬ ϭϭϱϮ ϭϯϰϰ ϭϱϯϲ ϭϳϮϴ ϭϵϮϬ ϮϭϭϮ ϮϯϬϰ Ϯϰϵϲ Ϯϲϴϴ ϮϴϴϬ ϯϬϳϮ ϯϮϲϰ ϯϰϱϲ ϯϲϰϴ ϯϴϰϬ ϰϬϯϮ ϰϮϮϰ ϰϰϭϲ ϰϲϬϴ ϰϴϬϬ ϰϵϵϮ ϱϭϴϰ ϱϯϳϲ ϱϱϲϴ ϱϳϲϬ ϱϵϱϮ E Ϭ

Figure 6: MAGMA SGEMM with CUDA and OpenCL E

 Figure 5: MAGMA SGEMM with/without texture (copying) in OpenCL              the number of registers (the other limiting factor in occupancy)      available to applications to 128kB per compute unit[29]. The  Tesla C2050 has 144GB/s of memory bandwidth.

 4.2. Fast GEMM 

We have previously published a fast GEMM algorithm for  the Fermi architecture [31]. This work expanded on the work of Volkov and Demmel [14], who provided a high performance  matrix multiply for NVIDIA’s previous generation GPUs. The new algorithm is available in the MAGMA library and will  be included in CUBLAS 3.2. Both MAGMA and Volkov’s         GEMM algorithms are written in CUDA. In this work, we rewrite  the MAGMA algorithm in OpenCL, tune the implementation Figure 7: Bandwidth when copying from a global buffer to an image for float for the Fermi architecture, and compare the OpenCL imple- and double types with and without transposition mentation performance to that of CUDA. Additionally, we run the MAGMA algorithm on both NVIDIA and ATI GPUs, il- lustrating OpenCL’s cross platform design and examining the the texture cache, we need to copy without transposition both portability of algorithm performance. A and B into images with a single floating point alpha chan- nel. For DGEMM we copy A and B into images with red and 4.3. From CUDA to OpenCL green single precision channels, the concatnation of which rep- Given the performance of MAGMA GEMM on Fermi, the resents one double precision number in the matrix. High per- intuitive fast OpenCL implementation is translated from the formance data copies are important for small matrix multiplica- CUDA. While CUDA offers the ability to bind a texture to tions where there is less work to amortize them. global memory for direct access, which is used in MAGMA Figure 6 shows the CUDA performance with and without GEMM, OpenCL does not. To access matrices through the tex- using texture in MAMGA’s Fermi GEMM. We show that not ture cache, data in global memory must be explicitly copied using the texture cache leads to a 5% performance drop, which into an image before it can be read by GPU kernels. Figure 5 means that streaming data through textures on Fermi contributes shows the performance of 2 SGEMM kernels - one using im- little performance. This leads to the idea of removing image ages (requiring a copies) and one that directly accesses global copies to save time, which counteracts the performance drop by memory. not using textures. Figure 5 demonstrates this by also showing Our kernels for copying global buffers into images don’t the performance of the texture-free version of the same code. run efficiently on Fermi. We show the memory bandwidth uti- The cross over point shows that copying overhead is expensive lization when copying an NxN matrix into an image in Figure and only pays off for larger problem size. 7. These kernels use less than 7% of the Tesla C2050’s peak The way OpenCL code is converted from CUDA is straight- bandwidth. We found that using clEnqueueCopyBufferToIm- forward. Firstly, texture binding and fetching are removed from age improves performance fourfold. For our SGEMM using the MAGMA’s Fermi GEMM in CUDA. Then the code is trans-

5 fma.rn.f32 %f510, %f508, %f446, %f509; mov.f32 %f511, %f510; mov.f32 %f512, %f394; fma.rn.f32 %f513, %f508, %f450, %f512; mov.f32 %f514, %f513; mov.f32 %f515, %f397; ...... fma.rn.f32 %f525, %f508, %f466, %f524; mov.f32 %f526, %f525;

Here shared memory data at address %rd53+3276 is fetched to register %f508 and 6 FMAs are issued to consume this value. This pattern allows better overlap of I/O and computation and therefore the overhead of slow I/O can be better hidden through pipelining. OpenCL compiler emits different PTX. There following is the equivalent PTX of the above C code: Figure 8: MAGMA GEMM structure ld.shared.f32 %f315, [%r120+1088]; ld.shared.f32 %f316, [%r120+2176]; lated into OpenCL using keyword changes previously described...... Most of the data I/O and computation instructions remain the ...... same. The green line in figure 6 shows that the performance ld.shared.f32 %f326, [%r184]; of this code ramps up very slowly and drops slightly more than add.s32 %r184, %r184, 388; 100 Gflop/s at large matrix size. We examing the PTX output fma.rn.f32 %f375, %f319, %f325, %f375; from OpenCL and CUDA To explain the different performance. fma.rn.f32 %f376, %f319, %f324, %f376; The kernel structure of MAGMA’s Fermi GEMM is shown ...... in figure 8. On the NVIDIA platform, both CUDA and OpenCL ...... produce PTX, NVIDIA’s intermediate representation between fma.rn.f32 %f386, %f318, %f326, %f386; high-level code and device-specific assembly code. In the three This PTX appears to follow the C code directly: all the load- yellow parts, ld.global.f32 is issued to load data from global ing from shared memory followed by all FMAs. While execut- memory, and in the blue parts, ld.shared.f32 is issued to load ing in a loop, this pattern could stall GPU cores waiting for all data from shared memory. When both operands A and B are in the data to load from shared memory. registers, FMA is issued to calculate C+=A B. ∗ Another PTX discrepancy is that the OpenCL compiler does By comparing the PTX generated by CUDA and OpenCL not unroll the outermost loop despite the unroll pragma. In or- compiler, the first major difference is that the CUDA compiler der to make a fair comparison, we manually unrolled all the generates code that reads data from shared memory right before inner loops so that unrolling pragma works on the outermost the computation that requires it. For example, the first blue and loop. The performance of this code is shown in figure 6 as green boxes have the following code: ’fully unrolled’. After seeing drastic drops in performance, we #pragma unroll found that OpenCL compiler unrolled the outermost loop and for( int y=0;y<6;y++) grouped different iterations’ ld.shared.f32 followed by many Bxp[y]= Bb[j1][res+y*16]; FMA operations. Similar results have been observed on the #pragma unroll ATI’s OpenCL compiler. This grouping action further exac- for( int y=0;y<6;y++) erbates stalls and explains the low performance. To demon- Axs[y] = Abs[qot+y*16][j1] ; strate this claim, different combinations of manual unrolling #pragma unroll and pragma locations were tested to fight with the OpenCL for( int x=0;x<6;x++) { compiler’s tendency of grouping similar operations. The best #pragma unroll solution we found is putting the two blue and green parts to- for( int y=0; y<6; y++) gether forming a 16-loop rather than two 8-loop, which is un- Cb[x*6+y] += Axs[x]*Bxp[y]; rolled using pragma with a factor of 15 (hence the name ’16- } 1 unrolling’). This incomplete unrolling tricks the compiler into laying out the codes in small groups of ld.shared.f32 and The CUDA compiler’s resultant output is: ld.shared.f32. This lifts the performance to within 50 Gflop/s of the original CUDA code as shown in figure 6. If the compiler ld.shared.f32 %f508, [%rd53+3276]; better collected the ld.shared.f32 and ld.shared.f32 instructions mov.f32 %f509, %f391; into small groups interleaved with execution, the performance 6 precision operations, the peak double precision throughput is 1/5 that of single precision. The Radeon 5870’s 1600 ALUs run at 850MHz, yielding a peak throughput of 2.72TFlops/s in single precision and 544GFlops/s in double. Like NVIDIA’s of- ferings, the Radeon 5870 has high off-chip memory bandwidth and even higher on-chip bandwidth. To balance the performance of floating point units, the Radeon 5870 features high bandwidth to global memory augmented with a texture cache. Global memory has a peak data through- put of 154GB/s divided over 8 memory controllers. Unlike the Figure 9: Radeon 5870 architecture Fermi architecture, reads and writes to global memory are gen- erally not cached. However, reads and writes to textures are cached. Each compute unit has its own 8KB L1 cache yielding of the OpenCL code for this and similar operations should closely an aggregate bandwidth of 1TB/s and can produce 4 bytes of match the CUDA performance. Our ultimate objective is to el- data (1 float or half a double) per cycle per stream core. Mul- evate the performance of the sans-texture kernel to the CUDA tiple compute units share a 512kB L2 cache with 435GB/s of non-texture kernel, and using the texture-based kernel for large bandwidth between L1 and L2. In addition to automatically problems where the copy costs can be amortized. controlled texture caches, data reuse can also be facilitated us- ing shared memory[32]. The Radeon 5870 features 32KB of shared memory per 5. ATI Platform compute unit. This shared memory has provides 2TB/s of ag- 5.1. Radeon 5870 Architecture gregate bandwidth. As with the Fermi architecture, local mem- ATI’s Evergreen architecture has many analogues with NVIDIA ory usage dictates how many concurrent wavefronts can run on GPUs. Like Fermi-based cards, the Radeon 5xxx GPUs have a compute unit. Each compute unit can produce 2 4-byte (2 copious memory bandwidth and many parallel computation units floats or 1 double) shared memory requests per cycle. As with to exploit . Our work focuses on the Radeon NVIDIA cards, bank conflicts can hurt shared memory perfor- 5870. mance and need to be minimized. Since each stream core can The Radeon 5870 GPU has 1600 ALUs organized in groups perform 5 MADs per cycle in single precision, more bandwidth (fig. 9). This graphics processor has 20 compute units, each is needed for operands than shared memory can provide. This of which contains 16 stream cores. Each stream core within is also true for double precision. As such, register blocking be- a compute unit executes an instance of a kernel in lockstep. comes crucial in obtaining a high performance GEMM kernel. This SIMD hierarchy is analogous to NVIDIA’s SIMT engines. [32] However, unlike NVIDIA’s Fermi architecture, each stream core Registers provide the highest memory bandwidth on-chip. is a 5 ALU Very Long Instruction Word (VLIW) processor. The Radeon 5870 has 256KB of register space per compute unit Threads are grouped and interleaved on compute units. (5.1MB for the whole GPU) and can produce 48 bytes/cycle of Threads are interleaved to hide memory access latency. They data per stream core. Results generated on the previous cycle are grouped into sets of 64 called a wavefront. A wavefront is used as operands don’t count towards this limit, as they can be analogous to a warp on NVIDIA hardware. Of these 64 threads, forwarded using the Previous Vector or Previous Scalar regis- 16 execute on a given clock cycle on the 16 stream cores within ter [32]. Each of the 5 MADs per cycle takes 4 operands yield- a compute unit. Over the course of 4 clock cycles, all threads ing 20 total operands of 4 bytes a piece. 4 of these operands are interleaved, executing an instruction. This hides memory can be mitigated using the previous vector register and one of latency; while one thread is loading data another thread can be these operands can be shared among the 5 ALUs. This equates performing computation[32]. to exactly 48 bytes/cycle needed to fully utilize all 5 ALUs in Each ALU in a stream core can independently perform ba- a perfectly scheduled SGEMM. If the is not per- sic floating point operations. In single precision, each ALU can fect, registers can actually serve as a bottleneck in the com- issue one basic floating point instruction such as subtract, multi- putation. For DGEMM, registers provide sufficient bandwidth ply, Multiply-Add (MAD), etc. instruction per clock cycle with for the single FMA instruction per stream core regardless of a pipeline latency of 8 cycles. In addition to basic floating point operand reuse. Like shared memory, register usage dictates the instructions (such as multiply, add, subtract, divide, and MAD), number of concurrent wavefronts executing on a compute unit. the fifth ALU can perform transcendental functions including Unlike NVIDIA GPUs, registers are 128-bit and support swiz- log, square root, etc. To perform double precision floating point zling at no performance cost. operations, some number of the ALUs are fused together. In the 5.2. Fast GEMM case of Fused Multiply Add (FMA), four of the five ALUs are used per operation[32]. The fifth unit is free to do integer or We present a fast GEMM algorithms for single and dou- single precision calculations during this time. Since four ALUs ble precision optimized for ATI hardware. These kernels are are fused per operation and the fifth does not perform double based on Nakasato’s matrix multiply written in ATI’s Interme- diate Language (IL)[33]. This work focused on computing row 7         ! ! (")" (")* (")& (")+ '"#$%      !"#$   !"#%      !&#$ !&#%

(&)" (&)* (&)& (&)+ '&#$% 

 ! !" # 

 Figure 11: ATI DGEMM algorithm

       "%            "  Figure 10: ATI Memory bandwidth utilization during image copies

!% 

T major matrix multiplication C = A B. We expanded this work  to perform a column major C = αABT + βC. Our algorithm !  makes extensive use of the texture cache, which requires copies     in OpenCL. We use custom copy kernels to move data into im- %       ages configured with RGBA color mode and float values. We   pad the leading dimension of the image with zeros if needed.  For single and double precision, we pad the leading dimension  !  "  #  $  %  &   of the image to a multiple of 4 float and 2 doubles respec- tively. In double precision, the concatenation of the red and green channels represents the first double, while the blue and Figure 12: ATI SGEMM performance alpha channel represent the second double. A is copied with- out transposition into its corresponding padded image while B and swizzling. is transposed and copied into its image. The time to perform To maximize outstanding loads from the texture cache, we 2 3 these copies is O(N ), which is amortized by the O(N ) opera- use 8 samplers. 2 samplers load 2 double2s from A into reg- tions in GEMM for large problems. Copying A into an image isters and 2 samplers fetch 2 double2s of B. We unroll the k efficiently uses the Radeon 5870’s memory bandwidth (fig. 10), loop twice to use the other 4 samplers. A useful feature of im- while copying B doesn’t. ages is that when its sampler is declared using CL CLAMP, The kernels copying data from global memory into images fetching outside the bounds of the 2D image yields 0.0, which achieve a high fraction of the Radeon 5870’s available band- has no effect on the result when accumulated. This allows k width. For non transposed copies, our kernels used over 100GB/s to be indivisible by 2 yet still function correctly; in this case, of the 154GB/s of available bandwidth with little variance. Our our DGEMM kernel samples beyond the bounds of A and B transpose-and-copy kernels achieved half that amount and were and accumulates 0.0 into the result. Handling cases when m far more sensitive to problem size. Poor memory coalescing in is not a multiple of 2 is non-trivial and our algorithm currently the transpose kernels is to blame for the low memory through- doesn’t this case. In fact, our algorithm requires both m put. These kernels and the ones needed for texture use for the and n be multiples of 4. Padding C can overcome this limita- NVIDIA texture-based kernel ran an order of magnitude more tion. SGEMM is analogous to our DGEMM kernel, where each quickly on the Radeon 5870 than on the Tesla C2050. Oddly thread computes an 8x8 block of C in float4 registers and has enough, our float and double copy kernels were significantly analogous limitations. faster than clEnqueueCopyBufferToImage on the Radeon 5870. We compare our OpenCL results to Nakasato’s IL perfor- Once A and B have been copied, the matrix multiplication mance in 12 and 13. Nakasato’s timings do not include copy kernel executes. Our DGEMM algorithm is shown in fig. 11. times, which exaggerates performance for small problems. Our Each thread computes a single 4x4 block of C as double2s. timings do include this copy. Furthermore, Nakasato’s algo- Since B is transposed, the columns of B reside in its leading rithm performs only the matrix multiplication while we per- dimension. The rows of A reside in its leading dimension. We form the alpha and beta scaling (a negligible amount of time scale the double2 row vectors of A by the corresponding dou- for large N). Neither timings include PCIe data transfer times, ble of B using swizzling to extract and duplicate the required which would be amortized in large multiplications or when data element. This scaled vector is then accumulated into the corre- can be heavily reused on the GPU. sponding double2 of C. All of this is done with a single MAD Our SGEMM kernel exceeds 1.3TFlops/s for N=3840, 4352,

8 ϲϬϬ    %  ϱϬϬ $% 

$  ϰϬϬ

#% 

#  ϯϬϬ 'ĨůŽƉͬƐ "%  &ĞƌŵŝĨĂƐƚƐŐĞŵŵ;ϭϲͲϭƵŶƌŽůůĞĚͿ   ϮϬϬ "    d/ĨĂƐƚƐŐĞŵŵ !%        ϭϬϬ !    

%  Ϭ  ϭϵϮ ϯϴϰ ϱϳϲ ϳϲϴ ϵϲϬ  !  "  #  $  %  &  ϭϭϱϮ ϭϯϰϰ ϭϱϯϲ ϭϳϮϴ ϭϵϮϬ ϮϭϭϮ ϮϯϬϰ Ϯϰϵϲ Ϯϲϴϴ ϮϴϴϬ ϯϬϳϮ ϯϮϲϰ ϯϰϱϲ ϯϲϰϴ ϯϴϰϬ ϰϬϯϮ ϰϮϮϰ ϰϰϭϲ ϰϲϬϴ ϰϴϬϬ ϰϵϵϮ ϱϭϴϰ ϱϯϳϲ ϱϱϲϴ ϱϳϲϬ ϱϵϱϮ E 

Figure 14: Fast ATI SGEMM running on Fermi card in OpenCL Figure 13: ATI DGEMM performance ϭϲϬϬ and 4864. Nakasato’s IL implementation just exceeds 2TFlops/s ϭϰϬϬ for these same N, implying our OpenCL MAGMA implemen- ϭϮϬϬ tation achieves 65% of a fast matrix multiply algorithm written ϭϬϬϬ in high-level assembly code. Nakasato achieves 74% of the Radeon 5870’s peak performance while we achieve 49%. ϴϬϬ Our OpenCL ATI DGEMM algorithm achieves 308GFlops/s 'ĨůŽƉͬƐ ϲϬϬ on the Radeon 5870. In comparison, Nakasato’s matrix multi- ply algorithm achieves 472GFlops/s. This means that our OpenCL ϰϬϬ d/ĨĂƐƚƐŐĞŵŵ implementation is has 69% of the performance of Nakasato’s &ĞƌŵŝƐŐĞŵŵ͕ƌĞŵŽǀĞĚĂƌƌĂLJ ϮϬϬ IL matrix multiply. The ATI OpenCL kernel computes at 57% &ĞƌŵŝƐŐĞŵŵ;ĚŝƌĞĐƚƚƌĂŶƐůĂƚŝŽŶͿ of the hardware’s peak performance while Nakasato’s kernel Ϭ ϭϵϮ ϯϴϰ ϱϳϲ ϳϲϴ ϵϲϬ operates at 87% of maximum throughput. From this perfor- ϭϭϱϮ ϭϯϰϰ ϭϱϯϲ ϭϳϮϴ ϭϵϮϬ ϮϭϭϮ ϮϯϬϰ Ϯϰϵϲ Ϯϲϴϴ ϮϴϴϬ ϯϬϳϮ ϯϮϲϰ ϯϰϱϲ ϯϲϰϴ ϯϴϰϬ ϰϬϯϮ ϰϮϮϰ ϰϰϭϲ ϰϲϬϴ ϰϴϬϬ ϰϵϵϮ ϱϭϴϰ ϱϯϳϲ ϱϱϲϴ ϱϳϲϬ ϱϵϱϮ mance comparison, we illustrate that OpenCL provides fair per- E formance with a high degree of programmability on ATI hard- Figure 15: Fast Fermi SGEMM running on ATI Radeon 5870 card in OpenCL ware. Furthermore, we found that the relevant data copy and pack kernels effectively used the Radeon 5870’s memory band- width (fig. 10). because the compiler fails to fully unroll loops accessing these arrays and as such generates code that performs runtime index- 6. Performance Portability ing (which can’t occur on registers). To resolve this, the arrays are replaced with standalone variables and performance shoots In this section, we run our device specific kernels on hard- to 600 Gflop/s (22% peak). ware for which they aren’t tuned to evaluate performance porta- From these two experiments we show that performance is bility. OpenCL is designed with program portability in mind. not portable through simple using OpenCL. Without paying at- Despite different vendors having added extra functionality in tention to the underlying architecture and designing algorithms their OpenCL implementations, our work only uses features in accordingly, an algorithm’s performance suffers. the OpenCL standard[12]. This theoretically allows the front- end and GPU kernels to run on any platform without changes. 7. Performance Portability with Auto-tuning Figure 14 shows our ATI SGEMM kernel running on a Tesla C2050. While achieving 1+ Teraflop/s (50+% peak) on a Radeon The goal behind the OpenCL standard is to provide func- 5870, it only manages to execute at 40 Gflop/s(4% peak) on tional portability, or in oder words, to enable a single OpenCL the Tesla C2050. We reverse this experiment in figure 15 and application to run across a variety of hardware platforms. Al- run the OpenCL version of MAGMA’s Fermi GEMM on the though extremely important, functional portability by itself, e.g., Radeon 5870. While achieving 400+ Gflop/s (40+% peak) on without performance portability, would be insufficient to estab- the Tesla C2050, it has a very low performance when run on a lish OpenCL in the area of high-performance scientific com- Radeon 5870. Through reading the IL generated, we suspect puting. In this section we address this issue by discussing auto- that despite hinting the arrays that hold operands should be ex- tuning – the vehicle that we recognize as the potential driver of ist in registers, they actually reside in global memory, leading OpenCL applications towards performance portability. orders of magnitude more data fetch latency. We suspect this is

9 Automatic performance tuning (optimization), or auto-tuning  in short, is a technique that has been used intensively on CPUs to automatically generate near-optimal numerical libraries. For  example, ATLAS [19, 20] and PHiPAC [21] are used to gener- ate highly optimized BLAS. The main approach for doing auto-  tuning is based on empirical optimization techniques. Namely,     !

these are techniques to generate a large number of parametrized      code variants for a given algorithm and run these variants on a      given platform to discover the one that gives the best perfor-     mance. The effectiveness of empirical optimization depends  on the chosen parameters to optimize, and the search heuristic used. A disadvantage of empirical optimization is the time cost  of searching for the best code variant, which is usually propor- tional to the number of variants generated and evaluated. There-                  fore, a natural idea is to combine it with some “model-driven”    approach in a first stage that would limit the search space for the second stage of an empirical search. Work on auto-tuning CUDA kernels for NVIDIA GPUs [22, Figure 16: Performance of various automatedly generated SGEMM kernels for 23] has already shown that the technique is a very practical so- Fermi C2050 GPU. lution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even best one using a feedback loop, e.g., the performance re- hand-tuned kernels. We expand this early work, as described sults of previously evaluated variants are used as a guid- below, in the context of todays high-end GPGPU from NVIDIA ance for the search on currently unevaluated variants. and ATI, using both CUDA and OpenCL. 7.2. Tuning GEMM 7.1. Auto-tuning Infrastructure To illustrate the effect of auto-tuning we present numeri- The performance of CUDA GEMM implementations rely cal results in tuning SGEMM. In particular, we parametrize the on a number of very well selected parameters and optimiza- new Fermi GEMM kernels [31]. Figure 16 gives the single pre- tions [18]. Previous work in the area has managed to auto-tune cision performance of several versions derived from the original the selection of these parameters and optimizations used, to and run on a Fermi C2050 GPU. The parametrization is based quickly find the best performing implementations for particu- on the size of the submatrix computed by a thread block. lar cases of GEMM [22, 23]. However, with the introduction of the Fermi architecture, these auto-tuning frameworks were not able to find the new “optimal” implementations for Fermi, sim- 8. Performance Results on Multicore ply because their search space did not consider the newly intro- So far we have presented performance results for GPU plat- duced features in the architecture [31]. Performance portabil- forms from ATI and NVIDIA that we were able to obtain using ity problems are even further aggravated when porting kernels CUDA and OpenCL software stacks. To put these numbers in a across hardware vendors – kernels optimized for one vendor’s proper perspective, it is informative to compare them to the re- architecture perform poorly on another vendor’s architecture, sults obtained from optimized BLAS routines coming from AT- e.g., as illustrated throughout the paper with the GEMM kernels LAS and Intel’s MKL (Math Kernel Library) running on a mul- optimized correspondingly for NVIDIA and ATI GPUs. There- ticore hardware – a well researched experimental setting. We fore, our work on providing performance portability has con- use ATLAS as a representative of performance levels achiev- centrated on building up an auto-tuning infrastructure with the able from C with extensive use of auto-tuning (the tuning of following two-components (characteristic for a complete auto- ATLAS library may take many hours). Intel’s MKL represents tuning system): one of the best performing libraries for Intel processors. When Code generator The code generator produces code variants ac- such libraries are distributed by the hardware vendor, most per- cording to a set of pre-defined, parametrized templates formance sensitive kernels written in optimized assembly code. and/or algorithms. Currently we have identified and col- The results from Intel Tigerton multicore system are shown in lected best GEMM candidates for both NVIDIA and ATI Figure 17. The tested system had 4 processors, each one being GPUs. We have identified several key parameters that af- a quad-core clocked at 2.4 GHz for the total peak performance fect performance. The code generator will automatically of 307.2 Gflop/s in single precision and half of that in double create kernels using parameters and applying certain state precision. of the art optimization techniques. The summary of the results is presented in Table 4 in rel- ative terms (as the percentage of the peak performance) to al- Heuristic search engine The heuristic search engine runs the low easy comparison of the tested platforms regardless of their variants produced by the code generator and discovers the absolute and achieved performance. A common theme may

10 Performance of xGEMM on Intel Tigerton (4x quad-core) [2] Ad´ am´ Moravanszky,´ Dense matrix algebra on the GPU, available at:

350 http://www.shaderx2.com/shaderx.PDF (2003). [3] J. Kruger,¨ R. Westermann, Linear algebra operators for GPU implemen- tation of numerical algorithms, ACM Transactions on Graphics 22 (2003) 300 908–916. Peak single [4] J. D. Hall, N. A. Carr, J. C. Hart, Cache and bandwidth aware matrix Peak double 250 MKL single multiplication on the GPU, Tech. Rep. UIUCDCS-R-2003-2328, UIUC MKL double ATLAS single (2003). ATLAS double [5] K. Fatahalian, J. Sugerman, P. Hanrahan, Understanding the efficiency of 200 GPU algorithms for matrix-matrix multiplication, in: In Graphics Hard- s / ware 2004, 2004, pp. 133–137. p o fl 150 [6] N. Galoppo, N. K. Govindaraju, M. Henson, D. Manocha, LU-GPU: Ef- G ficient algorithms for solving dense linear systems on graphics hardware, in: Proceedings of Supercomputing 2005, Seattle, WA, 2005. 100 [7] NVIDIA, NVIDIA CUDATM Programming Guide Version 3.0, NVIDIA Corporation, 2010. TM 50 [8] NVIDIA, NVIDIA CUDA Best Practices Guide Version 3.0, NVIDIA Corporation, 2010. [9] S. Tomov, R. Nath, P. Du, J. Dongarra, MAGMA version 0.2 User Guide, 0 1 5 8 1 1 2 2 2 3 3 3 4 4 2 1 9 2 6 0 4 8 2 5 9 3 7 http://icl.cs.utk.edu/magma (11/2009). 8 2 6 8 6 4 3 1 0 8 6 5 3 0 4 8 2 6 0 4 8 2 6 [10] E. Anderson, Z. Bai, C. Bischof, S. L. Blackford, J. W. Demmel, J. J. Don- N garra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, D. C. Sorensen, LAPACK User’s Guide, Third Edition, Society for Industrial and Applied Mathematics, Philadelphia, 1999. Figure 17: Performance of xGEMM routines on a multicore processor with 16 [11] R. J. Rost, D. Ginsburg, B. Licea-Kane, J. M. Kessenich, B. Lichten- cores. belt, H. Malan, M. Weiblen, OpenGL Shading Language, 3rd Edition, Addison-Wesley Professional, 2009. Single precision Double precision [12] A. Munshi (Ed.), The OpenCL Specification, Khronos OpenCL Working ATI OpenCL 52% 57% Group, 2009, version: 1.0, Document Revision:48. [13] M. Papadopoulou, M. Sadooghi-Alvandi, H. Wong, Micro-benchmarking ATI IL 74% 87% the GT200 GPU, Tech. rep., Computer Group, ECE, University of NVIDIA CUDA 63% 58% Toronto (2009). NVIDIA OpenCL 57% 49% [14] V. Volkov, J. Demmel, Benchmarking GPUs to tune dense linear algebra, Multicore (vendor) 79% 79% in: Supercomputing 08, IEEE, 2008. [15] NVIDIA, NVIDIA Compute PTX: Parallel Thread Execution, 1st Edi- Multicore (auto-tuned) 46% 40% tion, NVIDIA Corporation, Santa Clara, California, 2008. [16] A. Kerr, G. Diamos, S. Yalamanchili, A characterization and analysis of Table 4: Comparison of efficiency achieved on the tested hardware. ptx kernels, in: Proceedings of 2009 IEEE International Symposium on Workload Characterization (IISWC), 2009, pp. 3–12. [17] J. Stratton, S. Stone, W. mei Hwu, MCUDA: An efficient implementation be observed from this summary: computational kernels writ- of CUDA kernels on multi-cores, Tech. Rep. IMPACT-08-01, Univer- ten in low-level langauge (such as ATI’s IL and assembly) sity of Illinois at Urbana-Champaign, http://www.gigascale.org/ pubs/1278.html (Mar. 2008). achieve around 80% of peak performance while high-level lan- [18] M. Wolfe, Special-purpose hardware and algorithms for acceler- guages achieve about 50% of peak. ating dense linear algebra, HPC Wirehttp://www.hpcwire.com/ features/33607434.html. [19] R. C. Whaley, A. Petitet, J. J. Dongarra, Automated empirical optimiza- 9. Conclusions tion of software and the ATLAS project, Parallel Computing 27 (1–2) (2001) 3–35. In this paper, we evaluated various aspects of using OpenCL [20] J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. C. Whaley, K. Yelick, Self-adapting linear alge- as a performance-portable method for GPGPU application de- bra algorithms and software, Proc. IEEE 93 (2) (2005) 293–312. velopment. Profiling results show that environment setup over- doi:http://dx.doi.org/10.1109/JPROC.2004.840848. head is large and should be minimized. Performance results for [21] J. Bilmes, K. Asanovic, C.-W. Chin, J. Demmel, Optimizing matrix mul- both the Tesla C2050 and Radeon 5870 show that OpenCL has tiply using phipac: a portable, high-performance, ansi c coding method- ology, in: ICS ’97: Proceedings of the 11th international conference good potential to be used to implement high performance ker- on Supercomputing, ACM, New York, NY, USA, 1997, pp. 340–347. nels so long as architectural specifics are taken into account in doi:http://doi.acm.org/10.1145/263580.263662. the algorithm design. Even though good performance should [22] Y. Li, J. Dongarra, S. Tomov, A note on auto-tuning GEMM for GPUs, not be expected from blindly running algorithms on a new plat- in: ICCS 2009, Vol. 5544 of Lecture in , Springer Berlin / Heidelberg, 2009, pp. 884–892. form, auto-tuning heuristics can help improving performance [23] R. Nath, S. Tomov, J. J. Dongarra, Accelerating GPU kernels for dense on a single platform. Putting these factors together, we conclud linear algebra, in: Proceedings of VECPAR’10, Berkeley, CA, 2010. that OpenCL is a good choice for delivering a performance- [24] R. Weber, A. Gothandaraman, R. J. Hinde, G. D. Peterson, Com- paring hardware accelerators in scientific applications: A case study, portable application for multiple GPGPU platforms. IEEE Transactions on Parallel and Distributed Systems 99 (RapidPosts). doi:http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.125. [25] ATI, ATI Stream (SDK) v2.1, avail- References able at: http://developer.amd.com/gpu/ATIStreamSDK/Pages/ default.aspx (2010). [1] E. S. Larsen, D. McAllister, Fast matrix multiplies using graphics hard- [26] NVIDIA OpenCL JumpStart Guide: Technical Brief, version 0.9 (April ware, in: Proceedings of Supercomputing 2001, Denver, CO, 2001. 11 2009). [27] P. Du, P. Luszczek, S. Tomov, J. Dongarra, Mixed-tool performance anal- ysis on hybrid multicore architectures, in: Proceedings of the First In- ternational Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010), San Diego, CA, 2010. [28] Tuning CUDA Applications for Fermi Version 1.0, available at: http://developer.download.nvidia.com/compute/cuda/3_0/ toolkit/docs/NVIDIA_FermiTuningGuide.pdf (Feb. 2010). [29] NVIDIA, Nvidia’s next generation compute architecture: Fermi v1.1, Tech. rep., NVIDIA Corporation (2009). [30] NVIDIA, Tesla c2050/c2070 GPU Computing Processor: Supercomput- ing at 1/10 the Cost, available at: http://www.microway.com/pdfs/ Tesla_C2050_C2070_Final_lores.pdf (2010). [31] R. Nath, S. Tomov, J. Dongarra, An improved magma gemm for fermi gpus., Tech. Rep. 227, LAPACK Working Note (July 2010). URL http://www.netlib.org/lapack/lawnspdf/lawn227.pdf [32] ATI, ATI Stream Computing OpenCL Programming Guide, avail- able at: http://developer.amd.com/gpu/ATIStreamSDK/assets/ ATI_Stream_SDK_OpenCL_Programming_Guide.pdf (June 2010). [33] N. Nakasato, Matrix Multiply on GPU, http://galaxy.u-aizu.ac. jp/trac/note/wiki/MatrixMultiply.

12