Dense Linear Algebra Factorization in Openmp and Cilk Plus on Intel's

Dense Linear Algebra Factorization in OpenMP and Cilk Plus on Intel's MIC Architecture: Development Experiences and Performance Analysis John Eisenlohr (Ohio Supercomputer Center) David E. Hudak (Ohio Supercomputer Center) Karen Tomko (Ohio Supercomputer Center) Timothy C. Prince (Intel Corporation) Abstract Applications that generate large scale task-vector parallelism are good candidates for general purpose manycore architectures. We implemented variants of the communication-avoiding QR algorithm to test hardware utilization and run-time efficiency of OpenMP and Cilk Plus on Intel's MIC architecture. Preliminary results show similar performance for OpenMP and Cilk Plus and demonstrate a minimal impact for application-level restructuring on performance. 1 Introduction and Objectives Intel's Many Integrated Core (MIC) is a general-purpose manycore architecture supporting open- source operating systems and language runtimes (e.g., Linux, OpenMP and Cilk Plus). This runtime support makes initial application porting easier relative to special-purpose manycore alterna- tives; for example, existing OpenMP code need not be rewritten in CUDA. However, upcoming devices based on the MIC architecture will have in excess of fifty cores and support multiple threads per core, requiring hundreds of active threads for effective hardware utilization. Meeting this goal will require a combination of increased scalability at the OS/runtime level as well as new application tuning techniques. In order to begin studying these areas, we sought an algorithm that could create a large number of parallel tasks while providing flexibility in implementation. We selected QR factorization since it allows specification of problem size and has been extensively studied for every major HPC architectural model. There are a number of important research questions that must be answered for the MIC architecture. What impact do the run-time systems of OpenMP and Cilk Plus have on performance? How can application-level restructuring impact performance? To examine these questions, we have implemented a number of variants of the communication-avoiding QR algorithm in both OpenMP and Cilk Plus and performed a preliminary set of experiments measuring both wall-clock performance and event-based sampling. Initial results demonstrate similar performance for OpenMP and Cilk Plus. 2 Related Work and Algorithm Implementation QR factorization is an important numerical technique and a tremendous amount of effort has gone into developing efficient QR algorithms and implementations for every major HPC architectural model: Vector (Linpack), SMP (LAPACK), Single-core Cluster (ScaLAPACK), Multicore Cluster (PLASMA), and GPU-Accelerated Cluster (MAGMA). Demel et. al. [3] describes the tall-skinny QR (TSQR) and communication-avoiding QR (CAQR) algorithms. Hadri et. al. [5] describes the problems of QR factorization for TS matrices and develop both the semi-parallel tile CAQR (SP-CAQR) and the fully parallel tile CAQR (FP-CAQR). Kurzak et. al. [6] utilizes both the host CPU cores and a GPU on SP-CAQR. Agullo et. al. [1] extends the work Hardri and Kurzak. They compare the performance of Tile QR and Tile CAQR, noting that CAQR "introduces parallelism in the panel factorization itself", i.e., by adding a binary tree reduction of partial factors. They examine the impact of tile granularity on performance and show that a dynamic scheduling-based implementation improves performance. Anderson et. al. [2] describes the CAQR algorithm running entirely on GPUs, including the panel factorization steps previously performed on CPUs [6]. Song et. al. [7] extends SP-CAQR and FP-CAQR [5] for distributed memory implementation with decentralized dynamic scheduling. The distributed-memory approach is similar to the multiple- GPU approach [1]: the matrix is thought of as D domains, with each process responsible for computing D/P domains. Dongarra et. al. [4] incorporates all prior approaches into a general- purpose hierarchical algorithm targeted toward distributed-memory clusters of SMP nodes. 2.1 Overview of the CAQR Algorithm The implementation of the communication-avoiding QR (CAQR) decomposition is done in C with factorization code offloaded to a Knights Ferry (KNF) coprocessor. The offload call transfers the input matrix to the device memory, converts the matrix to blocked format and stores the resulting factors on the device based on the assumption that downstream operations which make use of the decomposition will be offloaded to the KNF as well. At the highest level, the structure of the CAQR algorithm is to apply a set of four different kernels to the tiles that make up the matrix. These kernels are applied repeatedly, in a certain order, until the factorization is complete. Each kernel application is a task, and a great deal of research has been done to determine an ordering of the QR tasks that allows for maximum concurrency. The overall flow of the factorization is determined once we have fixed on the CAQR algorithm, but we are still free to make choices about the implementation that affect how well the program runs on the KNF. The implementation decisions are: 1. What is the best block size and shape to use? The lowest level subroutines should take best advantage of the KNF level 1 cache and 512-bit wide vector registers. 2. What is the highest level of concurrency the KNF can sustain for these kernels? For example, is it beneficial to run 4 threads per core? 3. How should we implement task scheduling in parallel sections of the algorithm? To make this decision, we need to know how well the KNF runtime schedules tasks in parallel sections and how much improvement can we get by customizing task-scheduling. 3 Implementing CAQR on the MIC Architecture For this study we adopt the CAQR algorithm as presented by Anderson, et. al. in [2]. At the highest level, the structure of the CAQR algorithm is to apply a set of four different kernels to the tiles that make up the matrix. These kernels, diagrammed in Figure 1, are: 1. factor: compute Householder vectors for a single block of the matrix. factor is applied within a loop over all blocks in a column of blocks. This loop is fully parallel. Applying factor to all the blocks in a column is called a factor panel operation. 2. factor tree: compute Householder vectors for a chunk of blocks. factor tree is applied within a loop over all chunks of blocks in a column of blocks. This loop is fully parallel. Applying factor tree to all the blocks in a column is called a factor panel tree operation. 2 Figure 1: CAQR Algorithm steps. Figure adapted from [2] 3. apply qt h: apply a block of Householder vectors horizontally across the trailing matrix. apply qt h is applied within a loop over an MxN array of blocks. This loop is fully parallel. Applying apply qt h to an array of blocks is called a trailing matrix update operation. 4. apply qt tree: apply Householder vectors from a chunk of blocks horizontally across chunks of blocks in the trailing matrix. apply qt tree is applied within a loop over an array of chunks of blocks. This loop is fully parallel. Applying apply qt tree to an array of chunks is called a trailing matrix tree update operation. These kernels are applied repeatedly, panel by panel, until the factorization is complete. 3.1 Task Parallelism with OpenMP and Cilk Plus To answer questions 1 and 2 posed in Sec 2.1 we vary the block size and maximum number of threads, then use Intels SEP and VTune to measure cache reuse, vectorization efficiency and cycles per instruction. To answer question 3 we explore the following implementations of task parallelism for the stages of the algorithm: 1. OpenMP CAQR (a) Simple OpenMP, using single level omp for in all four loops: factor panel trailing matrix update, factor panel tree and trailing matrix tree update. (b) Nested OpenMP, using nested omp for in the two-dimensional loops trailing matrix update and trailing matrix tree update. (c) Chunked OpenMP, comparing static and dynamic OMP task scheduling options and schedule chunk sizes for trailing matrix update and trailing matrix tree update. 2. Cilk Plus CAQR (a) Simple Cilk Plus, using single cilk for in factor panel and factor panel tree. (b) Nested Cilk Plus, using single cilk for in factor panel and factor panel tree with a nested cilk for in trailing matrix update and trailing matrix tree update. 3 Knights Ferry MIC 1 MIC 2 MIC p Xeon L1 L1 L1 L1 L1 L1 Core 0 Core 1 Core 5 L2 Cache L2 Cache L2 Cache L1 L1 L1 L1 L1 L1 L2 Cache L2 Cache L2 Cache Device Memory Knights Ferry Knights Ferry Knights Ferry Knights Ferry IB Card IB Card DMA Dual PCIe Dual PCIe L3 Cache L3Cache ... RAM RAM Xeon Xeon Xeon Xeon 6 cores 6 cores 6 cores 6 cores RAM RAM L3 Cache L3 Cache Cluster Node 1 Cluster Node n MPI InfiniBand Interconnection network Figure 2: MIC Architecture Software Development System block diagram 4 Experimental Results The experiments have been carried out on an Intel Knights Ferry Software Development system with a dual socket node containing two Intel Westmere (X5680) hex-core processors running at 3.33GHz, 24GB of memory, and two KNF co-processors, connected via PCI-e as shown in Figure 2. The host processors are running Red Hat Enterprise Linux Server release 6.1(Santiago), with kernel version 2.6.32-131.0.15.el6.x8 64. Intel's Sampling Enabling Product (SEP) is used to determine clock cycles for routines by using the CPU CLK UNHALTED counter and cache misses by using the DATA READ MISS OR WRITE MISS and DATA READ OR WRITE counters. One core is dedicated to system tasks, hence we restrict our experiments to 29 cores.

Dense Linear Algebra Factorization in Openmp and Cilk Plus on Intel's

Other Apis What’S Wrong with Openmp?

Heterogeneous Task Scheduling for Accelerated Openmp

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Parallel Programming

Openmp API 5.1 Specification

Openmp Made Easy with INTEL® ADVISOR

Parallelism in Cilk Plus

Introduction to Openmp Paul Edmon ITC Research Computing Associate

Parallel Programming with Openmp

A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization

Intel Threading Building Blocks

An Overview of Parallel Ccomputing