<<

2016 International Conference on Information Engineering and Communications Technology (IECT 2016) ISBN: 978-1-60595-375-5

Study on Heterogeneous Queuing Bao Zhenshan1,a, Chen Chong1,b, Zhang Wenbo1,c, Liu Jianli1,2, Brett A. Becker2,3 1College of Computer Science, Beijing University of Technology 100 Ping Le Yuan, Chaoyang District, Beijing 100124, China 2Beijing-Dublin International College, Beijing University of Technology 100 Ping Le Yuan, Chaoyang District, Beijing 100124, China 3University College Dublin, Belfield, Dublin 4, Ireland [email protected], [email protected], [email protected]

Keywords: heterogeneous Queuing (hQ), HSA, APU, heterogeneous computing

Abstract. As CPU processing speed has slowed down year-on-year, heterogeneous “CPU-GPU” architectures combining multi-core CPU and GPU accelerators have become increasingly attractive. Under this backdrop, the Heterogeneous System Architecture (HSA) standard was released in 2012. New Accelerated Processing Unit (APU) architectures – AMD Kaveri and Carrizo – were released in 2014 and 2015 respectively, and are compliant with HSA. These architectures incorporate two technologies central to HSA, hUMA (heterogeneous Unified Memory Access) and hQ (heterogeneous Queuing). This paper summarizes the detailed processes of hQ by analyzing the AMDKFD kernel source code. Furthermore, this paper also presents hQ performance indexes obtained by running matrix-vector multiplications on Kaveri and Carrizo experiment platforms. The experimental results show that hQ can prevent the system from falling into kernel mode as much as possible without additional overhead. We find that compared with Kaveri, the Carrizo architecture provides better HSA performance.

1. Introduction In recent years, as a result of slowing CPU performance, GPU acceleration has become more mainstream. Compared with CPUs, GPUs have shown their ability to provide better performance in many applications such as image processing and floating point arithmetic. As a result, heterogeneous multi-core "CPU-GPU" architectures are becoming an increasingly attractive platform, bringing increased performance and reduced energy consumption. This has brought about several novel avenues of research for academia and industry. In 2012 a non-profit organization called the HSA Foundation was established under the advocacy of AMD, and they proposed HSA standard [1]. This aims to reduce communication latency between CPUs, GPUs and other compute devices, making them more compatible from the programmer's perspective, by making the task of planning the moving of data between devices' disjoint memories more transparent. HSA covers both hardware and software, and provides users a unified model based on shared storage, aiming at decreasing the heterogeneous computing programmability barrier. At the beginning of 2014, the APU Kaveri [2] was the first generation hardware to implement HSA. In 2015, the APU Carrizo [2] was released and fully supported the HSA standard. hUMA and hQ, the two core technologies of HSA, were both implemented in Kaveri and Carrizo. However, research into the actual vs. perceived benefits of these technologies have not been fully determined. By analyzing AMDKFD [3], a HSA kernel driver integrated into , the principle implementation mechanisms of hQ are presented in this paper. Through implementing matrix-vector multiplications [4,5] on experimental HSA platforms, the support to hQ was verified. Experimental results show that hQ provides better performance through the equivalent of CPU and GPU, and Carrizo has more advantages than Kaveri in support of hQ. In [6], some related works we have done before are shown. This paper is organized as follows. We give an overview of HSA in Section 2, along with a description of the Kaveri and Carrizo architectures. In Section 3 we show how hQ works through the HSA Runtime, libhsakmt and AMDKFD. Next, experiment results and analysis are presented in Section 4. Finally, Section 5 provides conclusions.

2. HSA and APU Review 2.1 HSA The HSA standard was proposed by the HSA Foundation in 2012 and the HSA Technical Specification 1.0 was issued formally in January 2015. HSA aims to make CPU and GPU integration more seamless, allocating appropriate loads to the most suitable computing units to achieve true chip-level integration. In a system which complies with HSA standard, due to the two key technologies, hUMA and hQ, data of any processor unit can be accessed by other processing unit. Furthermore, all processing units can access to the virtual memory significantly improving the computing capability of the system. hUMA has subverted the mutual isolation mode of CPU and GPU, so that both can take uniform addressing. When the CPU distributes a task to the GPU, it only by transfers pointers, avoiding the transfer of the large amounts of associated data. After the GPU processing is completed, the CPU can check directly access the results, significantly reducing unnecessary overhead. hQ has changed the dominant position of the CPU in a heterogeneous system, so that the GPU can be run independently, making the CPU and GPU more equivalent than in the traditional CPU-GPU relationship. See Section 3 for more information on the hQ technology. 2.2 APU AMD Accelerated Processing Unit (APU) [2,7], formerly known as AMD fusion, places CPU and GPU on the same chip, harnessing the processing performance of high performance processors and the latest discrete graphics processing technologies. Since pre-APUs did not support HSA, we do not consider them here. Kaveri, the 5th generation of APU, was launched by AMD in 2014 and was the first generation of APU supporting the HSA standard. Kaveri is a 28nm process and has 4 “Steamroller” processor cores and 8 R7 graphics cores of GCN () framework. Carrizo, the 6th generation of APU, was launched in 2015 and fully supports the HSA 1.0 standard. Carrizo features cores 30% smaller than Kaveri, shortening the gaps between crystal valves. Besides, as the instruction per clock increases, power consumption during the calculation process can be further reduced. However, as for GPU core, although its scale is the same as that of Kaveri, it is greatly improved in aspects such as tessellation performance, multimedia processing performance and power consumption control.

3. Heterogeneous Queuing hQ is one of the two central technologies defined by HSA, and realized in a special software stack that consists of HSA Runtime, library and kernel driver. During the initialization of HSA, the related HSA Runtime execute in order. The important API, hsa_queue_create, calls the library libhsakmt and then calls AMDKFD, the first kernel driver merged into Linux3.19 to support HSA. If the chain running is successful, the user and kernel mode queues will be built, and task packets will be sent to hardware constantly without extra kernel mode cost until the end of the process. Thus the building and initialization process needs only be performed once.

PASID MQD queue kfd_ process HIQ packet CP NO CP doorbell over no over kernel queue GPU HQD HW

Figure 1. Process of hQ scheduling.

Fig. 1 shows an abstract framework of hQ, and two scheduling strategies to map user mode queue into kernel mode. The major difference between these is whether CP (Command Processor) is used. Without using CP, it is only necessary to build HQD (Hardware Queue Descriptor) for the kernel queue after which packets can be assigned to hardware. What calls for special attention is that most of dispatching operation depends on the manual setting from the beginning in NO CP. On the contrary, CP can do the mapping automatically, it is a more common one that acts as the main design objective of KFD. When using the CP scheduling strategy, a MQD (Memory Queue Descriptor) is created to describe the basic information of user mode queue, and the relationship also built between user mode queue and kfd_process through PASID (Process Address Space ID). After the combination process, user mode queue is converted to HIQ (Heterogeneous Interface Queue) which can be mapped into the kernel mode queue. In the process of mapping, over-subscription and no over-subscription are both permitted. Finally, with the help of a doorbell, a special synchronization mechanism, hQ can reduce the time of a system to fall into kernel mode and enables the CPU and GPU at a truly equivalent positions.

4. Experiments on Kaveri and Carrizo with Analysis 4.1 Experimental Environment Before conducting a performance analysis on the Kaveri and Carrizo architectures, several steps needed to be completed. Our experimental platform runs Ubuntu 15.04, kernel version 3.19. We installed HSA-Drivers, HSA-Runtime and CLOC (CL Offline Compiler) [3] in accordance with the order strictly and used the check file in HSA-Drivers to make a detection. CLOC, with the SNACK (Simple No Api Compiled Kernels), played an important role so that applications could be compiled form CL kernels to HSAIL. Finally, we installed AMD CodeXL, a comprehensive tool suite that enables developers to harness the benefits of APU. Additionally, the graphics card driver was forbidden, it need to be uninstalled at the very beginning. 4.2 Results and Analysis Because of difference in the base architectures, the benchmarks in the common architecture could not be used directly, and benchmarks customized for HSA platform were still inadequate, thus, it was necessary to change the standard benchmarks. We utilized matrix-vector multiplication as our benchmark, as it is a common HPC benchmark. After modification, it could be used on both the Kaveri and Carrizo architectures. The experiments were divided into two parts, one was to monitor the execution time on Kaveri and Carrizo where the matrix were on the same scale by using a Linux command “time”. For this part, we focused on the overall performance of two APUs, trying to find which yielded higher performance. The other part was to obtain the hQ performance indexes including the time cost of hQ and the kernel mode time during the workload running. In this way, we aimed to observe the running situation of hQ on APUs. Through previous investigation of literature, we expected that hQ could provide better performance and Carrizo would be more suitable for hQ than Kaveri.

Figure 2. Results of matrix-vector multiplication.

Fig. 2 shows the results of matrix-vector multiplication running on Carrizo and Kaveri with the size of matrix (n) increasing from 3000 to 21000. As the size of an n*n matrix increases quadratically, the execution time increases correspondingly. For all problem sizes, Carrizo has better performance than Kaveri as expected. It is worth noting that Kaveri is quite tough when operating large matrix although it provides similar performance working with small amounts of data. To do further research on HSA and hQ, several performance indexes are detected. As shown in Fig. 3 (a) and (b), time consumption of HSA and hQ is relatively constant for all values of n with slightly higher execution times for Kaveri. This also shows that hQ, or user and kernel mode queue creation, is the dominant factor in the whole HSA process, accounting for approximately 75% of overall time. Table 1 details the execution times presented in Fig. 3. Table 1 shows that the time of HSA accounts for no more than 10% of the benchmark’s execution time, because of the nonlinear increase of execution time and a steady growth trend on HSA time as n increases, the proportion of HSA declines constantly – less than 1% in many cases. Though it has a larger proportion of HSA on Carrizo in sometimes caused by shorter execution time of benchmark, the performance of Carrizo is still better than Kaveri due to the the advantage of absolute size. The time of kernel mode is presented in Table 2, is significantly reduced by HSA and hQ, which only perform once at initialization, after which the workload can dispatch packets without additional kernel mode time. In comparison, Carrizo provides a shorter time of kernel mode during the workload execution, meanwhile, the kernel mode time is under 8.5% of overall execution time and can account for as little as 1% on both platforms. (a) Carrizo

(b) Kaveri Figure 3. Time of HSA and hQ. Table 1. The percentage of HSA in execution time. Size 3000 6000 9000 12000 15000 18000 21000 Carrizo Total[s] 1.408 7.364 24.015 55.684 105.239 185.097 290.035 HSA[ms] 137.251 126.2 120.268 129.209 123.088 122.725 127.242 Pct[%] 9.748 1.714 0.501 0.232 0.117 0.066 0.044 Kaveri Total[s] 2.023 9.208 28.684 70.410 164.372 295.387 515.209 HSA[ms] 166.035 168.949 160.541 154.882 159.759 161.871 162.710 Pct[%] 8.207 1.835 0.560 0.220 0.097 0.055 0.032

Table 2. The percentage of kernel mode in execution time. Size 3000 6000 9000 12000 15000 18000 21000 Carrizo Total[s] 1.408 7.364 24.015 55.684 105.239 185.097 290.035 Sys[s] 0.096 0.180 0.344 0.476 0.752 1.060 1.380 Pct[%] 6.818 2.444 1.432 0.855 0.715 0.573 0.476 Kaveri Total[s] 2.023 9.208 28.684 70.410 164.372 295.387 515.209 Sys[s] 0.164 0.204 0.364 0.528 0.772 1.144 2.564 Pct[%] 8.107 2.215 1.269 0.750 0.470 0.387 0.498 When all results are considered, HSA and hQ occupy only a small part of execution time, but reduce the time of falling into kernel mode, and Carrizo has better comprehensive performance than Kaveri. 5. Conclusion In recent years, the heterogeneous computing field has become a research priority. With the establishment of the HSA Foundation, an increasing number of researchers are investigating the uses and performance of APU and HSA technologies. This paper focuses on hQ, one of the two key technologies of HSA, starting with the AMDKFD driver to analyze the basic procedures of this technology. hQ can map the user mode queue to kernel queues safely in multiple ways. With the doorbell synchronization mechanism, it can reduce the time of a system to fall into kernel mode, and accelerate workload execution. Meanwhile, task distribution of the kernel queue enables the CPU and GPU to function at truly equal positions. This paper also presents the results of testing these technologies on experimental platforms utilizing Kaveri and Carrizo architectures. Compared with Kaveri, Carrizo has comparable advantages on comprehensive performance. The system does not incur excessive overheads due to the application of HSA and hQ, and reduces the time of falling into kernel mode and waiting considerably, so that superior computing performances are obtained.

Acknowledgment This work was supported by the significant special project for Core electronic devices, high-end general chips and basic software products (2012ZX01039-004), and also supported by Beijing Key Laboratory on Integration and Analysis of Large Scale Stream Data.

References [1] Information on http://www.hsafoundation.com/. [2] Information on http://www.amd.com/en-us. [3] Information on https://github.com/HSAFoundation. [4] Tao Yuan, Deng Yangdong, Mu Shuai, Zhang Zhenzhong, Zhu Mingfa, Xiao Limin, Ruan Li, GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication, Concurrency Computation v27 n14 (2015) 3771-3789. [5] Huang Yu-Ju, Wu Hsuan-Heng, Chung Yeh-Ching, Hsu Wei-Chung, Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compliant System, ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), Atlanta, GA, USA, April 02-03, 2016. [6] Zhang Wenbo, Chen Chong, Liu Fei, Bao Zhenshan, Liu Jianli, Linux Kernel Driver Support to Heterogeneous System Architecture, The 2015 international Conference on Computer Science (CCSE 2015), Guangzhou, China, December 18-19, 2015. [7] Calandra Henril, Dolbeau Romain, Fortin Pierre, Lamotte Jean-Luc, Said Issam, Evaluation of successive CPUs/APUs/GPUs based on an OpenCL finite difference stencil, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2013), Belfast, United Kingdom, February 27-March 01, 2013.