Performance Comparison Between Java and JNI for Optimal Implementation of Computational Micro-Kernels
Total Page:16
File Type:pdf, Size:1020Kb
Performance comparison between Java and JNI for optimal implementation of computational micro-kernels Nassim Halli Henri-Pierre Charles Jean-François Mehaut Univ. Grenoble, France CEA-LIST, France Univ. Grenoble Aselta Nanographics, France [email protected] ABSTRACT Keywords General purpose CPUs used in high performance computing Java, JNI, HPC, Performance, Vectorization (HPC) support a vector instruction set and an out-of-order engine dedicated to increase the instruction level parallelism. Hence, related optimizations are currently critical to im- 1. INTRODUCTION AND MOTIVATION prove the performance of applications requiring numerical The initial motivation for this study is to improve the perfor- computation. Moreover, the use of a Java run-time environ- mance of a Java application which intensively uses a small ment such as the HotSpot Java Virtual Machine (JVM) in set of computational micro-kernels. Java has become an high performance computing is a promising alternative. It important general-purpose programming language and in- benefits from its programming flexibility, productivity and dustry expresses great interest in using it for scientific and the performance is ensured by the Just-In-Time (JIT) com- engineering computation [6, 11]. High performance scientific piler. Though, the JIT compiler suffers from two main draw- Java, however, suffers from the lack of optimized libraries backs. First, the JIT is a black box for developers. We and low-level control [6]. This is in part due to the machine- have no control over the generated code nor any feedback independent bytecode representation of Java which prevents from its optimization phases like vectorization. Secondly, from target specific optimization at the source code level. In the time constraint narrows down the degree of optimiza- a dynamically compiled environment, the optimizations are tion compared to static compilers like GCC or LLVM. So, it delegated to the JIT compiler which is in charge of translat- is compelling to use statically compiled code since it bene- ing bytecode into machine code at run-time, using dynamic fits from additional optimization reducing performance bot- information. JIT optimization like vectorization are essen- tlenecks. Java enables to call native code from dynamic tials for performance and still remain a major concern [2, libraries through the Java Native Interface (JNI). Neverthe- 3, 8]. They enable programmers to maintain a machine- less, JNI methods are not inlined and require an additional independent code and save them from writing several ver- cost to be invoked compared to Java ones. Therefore, to sions for different target architectures. That is why in Java benefit from better static optimization, this call overhead applications pure Java methods are often preferred to native must be leveraged by the amount of computation performed ones. at each JNI invocation. In this paper we tackle this problem Though, the JIT compiler suffers from two main drawbacks. and we propose to do this analysis for a set of micro-kernels. First, the JIT is a black box for developers. We have no Our goal is to select the most efficient implementation con- control over the generated code nor any feedback from its sidering the amount of computation defined by the calling optimization phases like vectorization. Secondly, the time context. We also investigate the impact on performance constraint narrows down the degree of optimization [7, 2] of several different optimization schemes which are vector- compared to static compiler like GCC or LLVM. As a re- ization, out-of-order optimization, data alignment, method sult, it can be interesting to use statically compiled code to inlining and the use of native memory for JNI methods. benefit from a deeper optimization for performance bottle- arXiv:1412.6765v1 [cs.PF] 21 Dec 2014 necks that usually represent a small part of the application. Java applications enable to call native code from dynamic Categories and Subject Descriptors libraries through the Java Native Interface (JNI). Neverthe- D.3.4 [Programming Languages]: Processors|Optimiza- less, JNI methods are not inlined and require an additional tion, Run-time environments; D.2.8 [Software Engineer- cost to be invoked compared to Java ones. Thus, to benefit ing]: Metrics|Complexity measures, Performance measures from better static optimization, this call overhead must be leveraged by the amount of computation performed at each JNI invocation. This is what we call the flop-per-invocation Considering these aspects (summarized in Table 1) select- ing the most efficient implementation between JNI and Java requires an advanced analysis that takes into account the flop-per-invocation. In this paper we tackle this problem and we propose to do this analysis for a set of micro-kernels in order to select the most efficient implementation for a given range of flop-per-invocation. We also investigate how the flop-per-invocation impacts the performance of several different optimization schemes which are vectorization, out- 2.3 Theorical impact of the invocation cost of-order optimization, data alignment, method inlining and A simple model to describe the invocation cost impact on the use of native memory for JNI methods. performance when the flop-per-invocation decreases is given The study is performed using the Java HotSpot Virtual Ma- by the following equation: chine over a Sandy Bridge x86-64 architecture which sup- P F ports the AVX (Advanced Vector Extensions) instruction P = max (3) set for vectorization. Furthermore, we use GCC for static I + F compilation of the micro-kernels. Section 2 provides a back- Where P is the performance, F is the flop-per-invocation, ground about performance analysis and Section 3 provides Pmax is the peak performance reached by the kernel and I a background about the optimization considered for this is a parameter which describes the invocation cost in flop. study. Section 4 presents our benchmark methodology and We define the decrease-factor DF as followed: experimental conditions. In Section 5 we expose and discuss P F the obtained results. DF = = (4) Pmax I + F Figure 1 shows the decrease-factor as a function of the flop- Table 1: Performance drawbacks and benefits per-invocation for different values of I. We can see that overview for relatively small flop-per-invocation, the invocation cost Drawbacks Benefits Lower level of has a large impact on performance. Accordingly, we must JIT No call overhead optimization consider both the flop-at-invocation and the JNI invocation Higher call over- Higher level of JNI cost to select the most efficient implementation. head optimization 1.0 2. PERFORMANCE METRICS I=1 2.1 Arithmetic intensity 0.8 I=2 We call flop-per-invocation the number of floating point op- erations (flop) performed by the kernel at each invocation. I=4 The performance of a kernel implementation is measured 0.6 in flop/s (i.e. the number of floating point operations per- I=8 formed per second). We have the following equation: 0.4 F = AI × M (1) Decrease-factor I=16 Where F is the flop-per-invocation, M is the memory-per- invocation in byte corresponding to the amount of input and 0.2 output memory processed by the kernel. AI is the arith- metic intensity in flop/byte which is the ratio of the flop- per-invocation by the memory-per-invocation. 0.0 1 4 5 7 11 The arithmetic intensity allows to locate micro-kernel bot- 20 2 22 23 2 2 26 2 28 29 210 2 212 tlenecks [12]. If the bottleneck is the memory bandwidth flop-per-invocation [flop] then the kernel is called memory-bound. Otherwise the bot- tleneck is the CPU throughput and the kernel is called CPU- Figure 1: Theoretic performance profile : decrease- bound. Indeed, we can write the following inequality: factor function of the flop-per-invocation for differ- ent value of the invocation cost I Π > AI × β (2) Where Π is the CPU peak performance in flop/s and β is the peak memory bandwidth in byte/s. If Equation 2 is 3. CODE OPTIMIZATION satisfied then the kernel is memory-bound (otherwise CPU- We consider two kinds of optimization, The first ones are bound). Since modern architectures use a multi-level cache asymptotic optimizations that are significant when the flop- hierarchy, the memory bandwidth depends on the data lo- per-invocation grows. In the second kind we have the opti- cation over this hierarchy. This means that the memory mizations that reduce the invocation cost of a method and bandwidth depends on the memory-per-invocation which which are significant for lower flop-per-invocation. They are impacts cache efficiency therefore increasing the memory- described respectively in Section 3.1 and 3.2. per-invocation leads to reducing the memory bandwidth. 3.1 Asymptotic optimization 2.2 Performance profile 3.1.1 Vectorization and out-of-order optimization We define the performance profile of a kernel implemen- Since microprocessor performance is limited by a maximum tation as the performance function of either the flop-per- clock frequency the trend goes towards increasing processor invocation, or the memory-per-invocation (since there are throughput by handling multiple data in parallel. The prin- linearly related). By plotting the performance profile of ciple of vectorization is to pack data into a vector register a kernel implementation for a wide range of memory-per- then perform one operation over all the data. In this study invocation, we can observe, for example, the memory band- we consider the AVX instruction set available in the 64-bit width bottleneck. Intel Sandy Bridge architecture. AVX enables to use 256 bits registers called YMM. It allows to perform four dou- ble precision operations at one time. Additionally to vector Table 3: Core vectorization and out-of-order opti- units, the out-of-order (o-o-o) execution can increase the mizations for a horizontal sum reduction kernel in level of parallelism by dispatching independent instructions double precision No vectorization Vectorization (AVX intrinsics) over parallel execution units available in different ports.