SIMD Intrinsics on Managed Language Runtimes
Total Page:16
File Type:pdf, Size:1020Kb
1 56 2 SIMD Intrinsics on Managed Language Runtimes 57 3 58 4 Anonymous Author(s) 59 5 60 6 Abstract speedup, but neither languages like Java, JavaScript, Python, 61 7 Managed language runtimes such as the Java Virtual Ma- or Ruby, nor their managed runtimes, provide direct access 62 8 chine (JVM) provide adequate performance for a wide range to SIMD facilities. This means that SIMD optimizations, if 63 9 of applications, but at the same time, they lack much of the available at all, are left to the virtual machine (VM) and the 64 10 low-level control that performance-minded programmers built-in just-in-time (JIT) compiler to carry out automatically, 65 11 appreciate in languages like C/C++. One important example which often leads to suboptimal code. As a result, developers 66 12 of such a control is the intrinsics interface that expose in- may be pushed to use low-level languages such as C/C++ to 67 13 structions of SIMD (Single Instruction Multiple Data) vector gain access to the intrinsics API. But leaving the high-level 68 14 ISAs (Instruction Set Architectures). In this paper we present ecosystem of Java or other languages also means to abandon 69 15 an automatic approach for including native intrinsics in the many high-level abstractions that are key for the produc- 70 16 runtime of a managed language. Our implementation con- tive and efficient development of large-scale applications, 71 17 sists of two parts. First, for each vector ISA, we automatically including access to a large set of libraries. 72 18 generate the intrinsics API from the vendor-provided XML To reap the benefits of both high-level and low-level lan- 73 19 specification. Second, we employ a metaprogramming ap- guages, developers using managed languages may write low- 74 20 proach that enables programmers to generate and load native level native C/C++ functions, that are invoked by the man- 75 21 code at runtime. In this setting, programmers can use the aged runtime. In the case of Java, developers could use the 76 22 entire high-level language as a kind of macro system to de- Java Native Interface (JNI) to invoke C functions with spe- 77 23 fine new high-level vector APIs with zero overhead. Asan cific naming conventions. However, this process of dividing 78 24 example we show a variable precision API. We provide an the application logic between two languages creates a sig- 79 25 end-to-end implementation of our approach in the HotSpot nificant gap in the abstractions of the program, limits code 80 26 VM that supports all 5912 Intel SIMD intrinsics from MMX to reuse, and impedes clear separation of concerns. Further, the 81 27 AVX-512. Our benchmarks demonstrate that this combina- native code must be cross-complied ahead of time, which is 82 28 tion of SIMD and metaprogramming enables developers to an error-prone process that requires complicated pipelines 83 29 write high-performance vectorized code on an unmodified to support different operating systems and architectures, and 84 30 JVM that outperforms the auto-vectorizing HotSpot just-in- thus directly affects code maintenance and refactoring. 85 31 time (JIT) compiler and provides tight integration between To address these problems, we propose a systematic and 86 32 vectorized native code and the managed JVM ecosystem. automated approach that gives developers access to SIMD 87 33 instructions in the managed runtime, eliminating the need 88 CCS Concepts • Computer systems organization → 34 to write low-level C/C++ code. Our methodology supports 89 Single instruction, multiple data; • Software and its 35 the entire set of SIMD instructions in the form of embedded 90 engineering → Virtual machines; Translator writing 36 domain-specific languages (eDSLs) and consists of two parts. 91 systems and compiler generators; Source code genera- 37 First, for each architecture, we automatically generate ISA- 92 tion; Runtime environments; 38 specific eDSLs from the vendor’s XML specification ofthe 93 39 ACM Reference Format: SIMD intrinsics. Second, we provide the developer with the 94 40 Anonymous Author(s). 2017. SIMD Intrinsics on Managed Lan- means to use the SIMD eDSL to develop application logic, 95 guage Runtimes. In Proceedings of ACM SIGPLAN Conference on 41 which automatically generates native code inside the run- 96 Programming Languages, New York, NY, USA, January 01–03, 2017 42 time. Instead of executing each SIMD intrinsic immediately 97 (CGO’18), 11 pages. when invoked by the program, the eDSLs provide a staged or 43 https://doi.org/10.1145/nnnnnnn.nnnnnnn 98 44 deferred API, which accumulates intrinsic invocations along 99 45 with auxiliary scalar operations and control flow, batches 100 46 1 Introduction them together in a computation graph, and generates a na- 101 47 Managed high-level languages are designed to be general- tive kernel that executes them all at once, when requested 102 48 purpose and portable. For the programmer, the price is re- by the program. This makes it possible to interleave SIMD 103 49 duced access to detailed and low-level performance optimiza- intrinsics with the generic language constructs of the host 104 50 tions. In particular, SIMD vector instructions on modern ar- language without switching back and forth between native 105 51 chitectures offer significant parallelism, and thus potential and managed execution, enabling programmers to build both 106 52 high-level and low-level abstractions, while running SIMD 107 CGO’18, January 01–03, 2017, New York, NY, USA kernels at full speed. 53 2017. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 108 54 https://doi.org/10.1145/nnnnnnn.nnnnnnn 109 55 1 110 CGO’18, January 01–03, 2017, New York, NY, USA Anon. 111 This paper makes the following contributions: • AVX / AVX2 - ISAs that expand the SSE operations to 166 112 1. We present the first systematic and automated ap- 256-bit wide registers and provide extra operations for 167 113 proach that supports the entire set of SIMD instruc- manipulating non-contiguous memory locations. 168 114 tions, automatically generated from the vendor spec- • FMA - an extension to SSE and AVX ISAs to provide 169 115 ification, in a managed high-level language. Theap- fused multiply add operations. 170 116 proach is applicable to other low-level instructions, • AVX-512 - extends AVX to operate on 512-bit registers 171 117 provided support for native code binding in the man- and consists of multiple parts called F / BW / CD / DQ / 172 118 aged high-level language. ER / IFMA52 / PF / VBMI / VL. 173 119 2. In doing so, we show how to use metaprogramming • KNC - the first production version of Intel’s Many In- 174 120 techniques and runtime code generation to give back tegrated Core (MIC) architecture that provides opera- 175 121 low-level control to developers in an environment that tions on 512-bit registers. 176 122 typically hides architecture-specific details. Additionally, we also include: 177 123 178 3. We provide an end-to-end implementation of our ap- • SVML - an intrinsics short vector math library, built on 124 proach within the HotSpot JVM, which provides access 179 125 top of the ISAs mentioned above. 180 to all Intel SIMD intrinsics from MMX to AVX-512. • ADX AES BMI1 126 Sets of smaller ISA extensions: / / / 181 4. We show how to use the SIMD eDSLs to build new BMI2 CLFLUSHOPT CLWB FP16C FSGSBASE FXSR 127 / / / / / / 182 abstractions using host language constructs. Program- INVPCID LZCNT MONITOR MPX PCLMULQDQ POPCNT 128 / / / / / 183 mers can use the entire managed language as a form PREFETCHWT1 RDPID RDRAND RDSEED RDTSCP 129 / / / / / / 184 of macro system to define new vectorized APIs with RTM / SHA / TSC / XSAVE / XSAVEC / XSAVEOPT / XSS 130 zero overhead. As an example, we present a “virtual 185 131 ISA” for variable precision arithmetic. These ISAs yield a large number of associated intrinsics: 186 132 5. We provide benchmarks that demonstrate significant arithmetics operations on both floating point and integer 187 133 performance gains of explicit SIMD code versus code numbers, intrinsics that operate with logical and bitwise 188 134 auto-vectorized by the HotSpot JIT compiler. operations, statistical and cryptographic operations, com- 189 135 parison, string operations and many more. Figure 1a gives a 190 Our work focuses on the JVM and Intel SIMD intrinsics 136 rough classification of the different classes of intrinsics. To 191 functions, but would equally apply to other platforms. For 137 ease the life of the developers, Intel provides an interactive 192 the implementation of computation graphs and runtime code 138 tool called Intel Intrinsics Guide [2] where each available 193 generation, we use the LMS (Lightweight Modular Staging) 139 intrinsics is listed, including a detailed description of the 194 compiler framework [16]. 140 underlying ISA instruction. Figure 1b shows the number of 195 141 available intrinsics: 5912 in total (of which 338 are shared 196 2 Background 142 between AVX-512 and KNC), classified into 24 groups. 197 143 We provide background on intrinsics functions, JVMs, and 198 144 the LMS metaprogramming framework that we use. 2.2 Java Virtual Machines 199 145 There are many active implementations of the JVM includ- 200 2.1 Intrinsics 146 ing the open source IBM V9 [5], Jikes RVM [3], Maxine [24], 201 147 Intrinsics are compiler-built-in functions that usually map JRockit [12], and the proprietary SAP JVM [19], CEE-J [21], 202 148 into a single or a small number of assembly instructions. Dur- JamaicaVM [20]. HotSpot remains the primary reference 203 149 ing compilation, they are inlined to remove calling overhead.