Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation
Total Page:16
File Type:pdf, Size:1020Kb
Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation Li Chen1, Lei Liu1, Shenglin Tang1, Lei Huang2, Zheng Jing1, Shixiong Xu1, Dingfei Zhang1, Baojiang Shou1 1 Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China {lchen,liulei2007,tangshenglin,jingzheng,xushixiong, zhangdingfei, shoubaojiang}@ict.ac.cn; 2 Department of Computer Science, University of Houston; Houston, TX, USA [email protected] 5 Abstract. Unified Parallel C (UPC), a parallel extension to ANSI C, is designed for high performance computing on large-scale parallel machines. With General-purpose graphics processing units (GPUs) becoming an increasingly important high performance computing platform, we propose new language extensions to UPC to take advantage of GPU clusters. We extend UPC with hierarchical data distribution, revise the execution model of UPC to mix SPMD with fork-join execution model, and modify the semantics of upc_forall to reflect the data-thread affinity on a thread hierarchy. We implement the compiling system, including affinity-aware loop tiling, GPU code generation, and several memory optimizations targeting NVIDIA CUDA. We also put forward unified data management for each UPC thread to optimize data transfer and memory layout for separate memory modules of CPUs and GPUs. The experimental results show that the UPC extension has better programmability than the mixed MPI/CUDA approach. We also demonstrate that the integrated compile-time and runtime optimization is effective to achieve good performance on GPU clusters. 1 Introduction Following closely behind the industry-wide move from uniprocessor to multi-core and many-core systems, HPC computing platforms are undergoing another major change: from homogeneous to heterogeneous platform. This heterogeneity - exhibited in compute capability, speed, memory size and organization, and power consumption - may increase in the near future, along with the increasing number of cores configured. General-purpose graphics processor units (GPGPUs) based clusters are attractive in obtaining orders of magnitude performance improvement with relatively low cost. Current prevailing programming models, like MPI and OpenMP, have not supported heterogeneous platforms yet. Programming for the current dominant heterogeneous systems equipped with general-purpose multi-core processors and NVIDIA GPUs, programmers typically first program in OpenMP, Pthreads or other suitable programming interfaces, then manually identify code regions for GPU acceleration, partition computation to GPU threads, and generate optimal accelerated code in OpenCL (or CUDA which still dominates). Besides, users have to manipulate data transfers and specify architecture-specific parameters in code optimizations. For GPGPU clusters, hybrid MPI and OpenCL is a suggested programming method, but it is not easy since both of them are low level programming interfaces. Partitioned Global Address (PGAS) languages are successful in capturing non- uniform memory access feature of clusters and provide programmability very close to shared memory programming. There are a number of PGAS language extensions available. UPC, Co-array Fortran and Titanium are the dialects for C, Fortran and Java, respectively. For UPC, there are open source implementations as Berkeley UPC, GCC-based Intrepid UPC and MTU UPC, and commercial ones such as Cray UPC, HP UPC, IBM XL UPC and etc. Previous work [15,16] demonstrated that UPC delivers flexible programmability, high performance, and portability across different platforms. UPC encourages global view programming with upc_forall, and it lays the basis for hierarchical computation partitioning. We make three contributions in this paper. Firstly, we extend UPC with hierarchical data distribution and advance the semantics of upc_forall to support multi-level work distribution. Affinity expression with hierarchical data distribution indicates how to map the corresponding iterations to an implicit thread hierarchy. Code portability can be gained across traditional HPC clusters and GPU clusters. Secondly, we investigate important compiler analysis and runtime supports for these language extensions. Affinity-aware loop tiling is put forward for computation partitioning, array region analysis is used for inter-UPC communication and explicit data transfer for GPUs, and the unified data management is introduced to the runtime system to optimize data transfer and memory layout on different memory modules of CPUs and GPUs. Thirdly, we implement several memory-related optimizations for GPUs. Experimental results show that the UPC extensions have better programmability than MPI+CUDA approach, and the integrated compile-time and runtime optimization is effective to achieve good performance. The rest of the paper is organized as follows. Section 2 presents our language extension. Section 3 outlines the compiling and runtime framework for GPU clusters, including affinity-aware loop tiling, memory layout optimization and the unified data management system. Section 4 presents our experimental methodology and the results. Section 5 discusses related work, and Section 6 contains our conclusion and future work. 2 Extending UPC with Hierarchical Parallelism In this section, we describe our revised execution model to UPC and the extension on hierarchical data distribution. We also explain how to exploit massive parallelism upon the upc_forall construct. 2.1 UPC’s Execution Model on GPGPU Clusters Standard UPC only has flat SPMD parallelism among UPC threads. In order to match the hierarchical thread organization of GPU, we introduce implicit threads. Implicit threads are created in the fork-join style by each UPC thread at upc_forall, and are organized in groups similar to the CUDA thread blocks. Implicit threads are created and managed by runtime, and programmers do not have much control over them, except their granularity and organization. UPC program UPC threads Implicit threads or thread subgroups UPC thread Fork point Fork point upc_forall Implicit Implicit thread groups thread groups Implicit threads Implicit threads Join point Join point Figure 1. Extension on UPC’s execution model Figure 1 shows the extension on UPC’s execution model on GPU cluster. Solid lines represent the original UPC threads that exist from the very beginning of a UPC program. UPC threads have their thread identities and are synchronized using barriers and locks. Implicit threads represented as dotted lines, are created at the entry of upc_forall loops, then synchronize with the UPC thread that forks them at the join point, and then disappear. So, it is a mix between SPMD and fork-join model. Thread group is the form of organizing implicit threads. Implicit threads run completely on GPUs, and the original UPC threads do not participate in the computation on GPU. In our UPC execution model, the organization of implicit threads is determined by the affinity expression of upc_forall. Since there may be multiple upc_foralls in a UPC program, the number and the organization of implicit threads varies during the program execution. 2.2 Hierarchical Data Distribution UPC maintains the partitioned global address space for different UPC threads in the clusters. We introduce hierarchical data distribution into UPC allowing users to map data to a hierarchical machine abstraction of GPU clusters. We adopt the multi-dimensional shared array data distribution extension as in [7]. Upon this, a shared array is decomposed into a tree of data tiles and mapped to a thread hierarchy. The syntax is as follows. shared [ub1][ub2]...[ubn], [sb1][sb2]...[sbn], [tb1][tb2]...[tbn] <type> A[d1][d2] ... [dn]; Here, the sequence of [ub1][ub2]...[ubn] is the layout qualifier, describing the shape of a data tile. This means that array A is decomposed into data tiles sizing ∏1≤i≤n(ubi) at the first level, and these data tiles are distributed to UPC threads in lexicographic order and in block-cyclic manner. We call these data tiles, upc-tiles. The second and third level layout qualifiers are separated using commas. Each upc-tile is further split into smaller tiles, whose number is ∏1≤i≤n⎡ubi/sbi⎤. These smaller data-tiles are called subgroup-tiles, and are also arranged in lexicographic order. Similarly, each subgroup-tile is again split into data tiles sizing ∏1≤i≤n(tbi) , which are called thread- tiles, and this further set up father-child relationship between subgroup-tiles and thread-tiles. From this father-child relationship, each data distribution implies a tree of data-tiles. shared [32][32], [4][4],[1][1] float A[128][128]; 128 UPC program 16 16 … … … UPC thread Upc-tiles 32 32 64 Subgroup 64 16 64 … … … 4 Subgroup-tiles 4 implicit threads 16 … Thread-tiles 1 (a) Implicit thread tree (b) Tree of data-tiles Figure 2. Two hierarchies implied by the same data distribution Figure 2 illustrates the two hierarchies implied by a three-level data distribution. The tree of data-tiles is given in Figure 2b, while Figure 2a illustrates the related thread tree. In this example, there are 16 sub-trees of implicit threads, and four such sub-trees will map to the same UPC thread if THREADS=4. The meaning of data affinity is different on the three thread levels. Level-1 data tiles are physically distributed to UPC threads. The data tiles on the other levels do not indicate physical data distribution. The computation granularity of implicit threads is decided by the leaf thread-tiles, while subgroup-tiles