1 56 2 SIMD Intrinsics on Managed Language Runtimes 57 3 58 4 Anonymous Author(s) 59 5 60 6 Abstract speedup, but neither languages like Java, JavaScript, Python, 61 7 Managed language runtimes such as the Java Virtual Ma- or Ruby, nor their managed runtimes, provide direct access 62 8 chine (JVM) provide adequate performance for a wide range to SIMD facilities. This means that SIMD optimizations, if 63 9 of applications, but at the same time, they lack much of the available at all, are left to the virtual machine (VM) and the 64 10 low-level control that performance-minded programmers built-in just-in-time (JIT) to carry out automatically, 65 11 appreciate in languages like /C++. One important example which often leads to suboptimal code. As a result, developers 66 12 of such a control is the intrinsics interface that expose in- may be pushed to use low-level languages such as C/C++ to 67 13 structions of SIMD (Single Instruction Multiple Data) vector gain access to the intrinsics API. But leaving the high-level 68 14 ISAs (Instruction Set Architectures). In this paper we present ecosystem of Java or other languages also means to abandon 69 15 an automatic approach for including native intrinsics in the many high-level abstractions that are key for the produc- 70 16 runtime of a managed language. Our implementation con- tive and efficient development of large-scale applications, 71 17 sists of two parts. First, for each vector ISA, we automatically including access to a large set of libraries. 72 18 generate the intrinsics API from the vendor-provided XML To reap the benefits of both high-level and low-level lan- 73 19 specification. Second, we employ a metaprogramming ap- guages, developers using managed languages may write low- 74 20 proach that enables programmers to generate and load native level native C/C++ functions, that are invoked by the man- 75 21 code at runtime. In this setting, programmers can use the aged runtime. In the case of Java, developers could use the 76 22 entire high-level language as a kind of macro system to de- Java Native Interface (JNI) to invoke C functions with spe- 77 23 fine new high-level vector with zero overhead. Asan cific naming conventions. However, this process of dividing 78 24 example we show a variable precision API. We provide an the application logic between two languages creates a sig- 79 25 end-to-end implementation of our approach in the HotSpot nificant gap in the abstractions of the program, limits code 80 26 VM that supports all 5912 Intel SIMD intrinsics from MMX to reuse, and impedes clear separation of concerns. Further, the 81 27 AVX-512. Our benchmarks demonstrate that this combina- native code must be cross-complied ahead of time, which is 82 28 tion of SIMD and metaprogramming enables developers to an error-prone process that requires complicated pipelines 83 29 write high-performance vectorized code on an unmodified to support different operating systems and architectures, and 84 30 JVM that outperforms the auto-vectorizing HotSpot just-in- thus directly affects code maintenance and refactoring. 85 31 time (JIT) compiler and provides tight integration between To address these problems, we propose a systematic and 86 32 vectorized native code and the managed JVM ecosystem. automated approach that gives developers access to SIMD 87 33 instructions in the managed runtime, eliminating the need 88 CCS Concepts • Computer systems organization → 34 to write low-level C/C++ code. Our methodology supports 89 Single instruction, multiple data; • and its 35 the entire set of SIMD instructions in the form of embedded 90 engineering → Virtual machines; Translator writing 36 domain-specific languages (eDSLs) and consists of two parts. 91 systems and compiler generators; Source code genera- 37 First, for each architecture, we automatically generate ISA- 92 tion; Runtime environments; 38 specific eDSLs from the vendor’s XML specification ofthe 93 39 ACM Reference Format: SIMD intrinsics. Second, we provide the developer with the 94 40 Anonymous Author(s). 2017. SIMD Intrinsics on Managed Lan- means to use the SIMD eDSL to develop application logic, 95 guage Runtimes. In Proceedings of ACM SIGPLAN Conference on 41 which automatically generates native code inside the run- 96 Programming Languages, New York, NY, USA, January 01–03, 2017 42 time. Instead of executing each SIMD intrinsic immediately 97 (CGO’18), 11 pages. when invoked by the program, the eDSLs provide a staged or 43 https://doi.org/10.1145/nnnnnnn.nnnnnnn 98 44 deferred API, which accumulates intrinsic invocations along 99 45 with auxiliary scalar operations and control flow, batches 100 46 1 Introduction them together in a computation graph, and generates a na- 101 47 Managed high-level languages are designed to be general- tive kernel that executes them all at once, when requested 102 48 purpose and portable. For the programmer, the price is re- by the program. This makes it possible to interleave SIMD 103 49 duced access to detailed and low-level performance optimiza- intrinsics with the generic language constructs of the host 104 50 tions. In particular, SIMD vector instructions on modern ar- language without switching back and forth between native 105 51 chitectures offer significant parallelism, and thus potential and managed execution, enabling programmers to build both 106 52 high-level and low-level abstractions, while running SIMD 107 CGO’18, January 01–03, 2017, New York, NY, USA kernels at full speed. 53 2017. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 108 54 https://doi.org/10.1145/nnnnnnn.nnnnnnn 109 55 1 110 CGO’18, January 01–03, 2017, New York, NY, USA Anon.

111 This paper makes the following contributions: • AVX / AVX2 - ISAs that expand the SSE operations to 166 112 1. We present the first systematic and automated ap- 256-bit wide registers and provide extra operations for 167 113 proach that supports the entire set of SIMD instruc- manipulating non-contiguous memory locations. 168 114 tions, automatically generated from the vendor spec- • FMA - an extension to SSE and AVX ISAs to provide 169 115 ification, in a managed high-level language. Theap- fused multiply add operations. 170 116 proach is applicable to other low-level instructions, • AVX-512 - extends AVX to operate on 512-bit registers 171 117 provided support for native code binding in the man- and consists of multiple parts called F / BW / CD / DQ / 172 118 aged high-level language. ER / IFMA52 / PF / VBMI / VL. 173 119 2. In doing so, we show how to use metaprogramming • KNC - the first production version of Intel’s Many In- 174 120 techniques and runtime code generation to give back tegrated Core (MIC) architecture that provides opera- 175 121 low-level control to developers in an environment that tions on 512-bit registers. 176 122 typically hides architecture-specific details. Additionally, we also include: 177 123 178 3. We provide an end-to-end implementation of our ap- • SVML - an intrinsics short vector math library, built on 124 proach within the HotSpot JVM, which provides access 179 125 top of the ISAs mentioned above. 180 to all Intel SIMD intrinsics from MMX to AVX-512. • ADX AES BMI1 126 Sets of smaller ISA extensions: / / / 181 4. We show how to use the SIMD eDSLs to build new BMI2 CLFLUSHOPT CLWB FP16C FSGSBASE FXSR 127 / / / / / / 182 abstractions using host language constructs. Program- INVPCID LZCNT MONITOR MPX PCLMULQDQ POPCNT 128 / / / / / 183 mers can use the entire managed language as a form PREFETCHWT1 RDPID RDRAND RDSEED RDTSCP 129 / / / / / / 184 of macro system to define new vectorized APIs with RTM / SHA / TSC / XSAVE / XSAVEC / XSAVEOPT / XSS 130 zero overhead. As an example, we present a “virtual 185 131 ISA” for variable precision arithmetic. These ISAs yield a large number of associated intrinsics: 186 132 5. We provide benchmarks that demonstrate significant arithmetics operations on both floating point and integer 187 133 performance gains of explicit SIMD code versus code numbers, intrinsics that operate with logical and bitwise 188 134 auto-vectorized by the HotSpot JIT compiler. operations, statistical and cryptographic operations, com- 189 135 parison, string operations and many more. Figure 1a gives a 190 Our work focuses on the JVM and Intel SIMD intrinsics 136 rough classification of the different classes of intrinsics. To 191 functions, but would equally apply to other platforms. For 137 ease the life of the developers, Intel provides an interactive 192 the implementation of computation graphs and runtime code 138 tool called Intel Intrinsics Guide [2] where each available 193 generation, we use the LMS (Lightweight Modular Staging) 139 intrinsics is listed, including a detailed description of the 194 compiler framework [16]. 140 underlying ISA instruction. Figure 1b shows the number of 195 141 available intrinsics: 5912 in total (of which 338 are shared 196 2 Background 142 between AVX-512 and KNC), classified into 24 groups. 197 143 We provide background on intrinsics functions, JVMs, and 198 144 the LMS metaprogramming framework that we use. 2.2 Java Virtual Machines 199 145 There are many active implementations of the JVM includ- 200 2.1 Intrinsics 146 ing the open source IBM V9 [5], Jikes RVM [3], Maxine [24], 201 147 Intrinsics are compiler-built-in functions that usually map JRockit [12], and the proprietary SAP JVM [19], CEE-J [21], 202 148 into a single or a small number of assembly instructions. Dur- JamaicaVM [20]. HotSpot remains the primary reference 203 149 ing compilation, they are inlined to remove calling overhead. JVM implementation that is used by both Oracle Java and 204 150 This way they provide the programmer with assembly-like OpenJDK. Each of the JVM implementations provides sup- 205 151 functionality, without having to worry about register alloca- port for Java Standard Edition or Micro Edition, tailored for a 206 152 tion and instruction scheduling. SIMD intrinsics give access particular need: either a particular target machine or microar- 207 153 to data parallel instructions in vector ISAs, such as NEON chitecture, embedded systems, operating system, provides 208 154 on ARM processors, or the SSE and AVX families on Intel. additional garbage collector, resource control or parallelism 209 155 We focus on the architecture and the associated SIMD model. However, none of the active JVM implementation 210 156 intrinsics that are available in modern C/C++ , such provides any support for explicit vectorization nor intrinsics, 211 157 as GCC, Clang/LLVM, and Intel ICC. Specifically, these include nor permits inlining of assembly code directly in the Java 212 158 the following ISAs: source due to portability specifications. 213 159 • MMX - operating on 64-bit wide registers; provides inte- The HotSpot JVM, which is the focus of this study, pro- 214 160 ger operations only. vides JIT compilation of Java bytecode as a black box. The 215 161 • SSE / SSE2 / SSE3 / SSSE3 / SSE4.1 / SSE4.2 - operat- developer has no control over, nor receives any feedback 216 162 ing on 128-bit wide registers; provides integer, 32-bit, on the compilation phases except through coarse-grained 217 163 and 64-bit floating point operations, string operations command-line and debug options [13]. There are two flavors 218 164 and cache and memory management operations. of the VM: a client mode focused on latency, and a server 219 165 2 220 SIMD Intrinsics on Managed Language Runtimes CGO’18, January 01–03, 2017, New York, NY, USA

221 Arithmetics Shuffles Statistics Loads ISA Count 276 222 277 _mm256_add_pd _mm256_permutevar_pd _mm_avg_epu8 _mm_i32gather_epi32 MMX 124 223 _mm256_hadd_ps _mm256_shufflehi_epi16 _mm256_cdfnorm_pd _mm256_broadcast_ps SSE 154 278 224 ...... SSE2 236 279 SSE3 11 225 Compare String Logical Stores 280 226 SSSE3 32 281 _mm_cmp_epi16_mask _mm_cmpestrm _mm256_or_pd _mm512_storenrngo_pd SSE41 61 227 282 _mm_cmpeq_epi8 _mm_cmpistrz _mm256_andnot_pd _mm_store_pd1 SSE42 19 228 ...... AVX 188 283 229 AVX2 191 284 Random Bitwise Crypto Conversion 230 AVX-512 3857 285 231 _rdrand16_step _mm256_bslli_epi128 _mm_aesdec_si128 _mm256_castps_pd FMA 32 286 _rdseed64_step _mm512_rol_epi32 _mm_sha1msg1_epu32 _mm256_cvtps_epi32 KNC 601 232 287 ...... SVML 406 233 288 234 (a) (b) 289 235 290 236 Figure 1. Simplified classification of intrinsics (a) and instruction count (b) of the x86 SIMD Intrinsics set 291 237 292 238 293 mode tuned for throughput. We only focus on the Server JVM Types ↔ C/C++ Types 239 294 VM, as it is tuned to maximize peak operating speed. The Float ↔ float Char ↔ int16_t 240 295 Server VM offers a tiered compilation of bytecode using the Double ↔ double Boolean ↔ bool 241 C1 and C2 compilers. C1 is a fast, lightly optimizing bytecode Byte ↔ int8_t UByte ↔ uint8_t 296 242 compiler, while C2 performs more aggressive optimizations. Short ↔ int16_t UShort ↔ uint16_t 297 243 When JVM applications are started, the HotSpot VM starts Int ↔ int32_t UInt ↔ uint32_t 298 Long ↔ int64_t ULong ↔ uint64_t 244 interpreting bytecode. It detects computation intensive hot 299 245 spots in the code via profiling, and proceeds to compile the 300 246 C/C++ 301 bytecode of frequently used functions with C1. Once further Figure 2. Type mappings between JVM and types 247 thresholds are reached, functions may be compiled using 302 248 C2. C2 supports autovectorization, using Superword Level Figure 1b. Then, we show how to use these eDSLs to generate 303 249 Parallelism (SLP) [8]. SLP detects groups of isomorphic in- high-performance SIMD code with high-level language con- 304 250 struction and replaces them with SIMD instructions, which structs inherited by the host language. We show examples of 305 251 results in a lightweight vectorization. The SLP approach is end-user code in Scala, but any other JVM language could be 306 252 limited and cannot optimize across loop iterations, nor can used. Before we start, we have to establish a corresponding 307 253 it detect idioms such as reductions. type system between the JVM and the SIMD intrinsics func- 308 254 tions to represent the SIMD vector types, that is required for 309 255 2.3 Lightweight Modular Staging the generation of the eDSLs and their usage. 310 256 311 Lightweight Modular Staging (LMS) [16] is framework for 257 3.1 Type System for SIMD Intrinsics in the JVM 312 runtime code generation and for building compilers for em- 258 313 bedded DSLs in Scala. LMS makes pervasive use of operator The JVM has no notion of SIMD vector types, thus we build 259 314 overloading to make code generation blend in with normal abstract classes to mark the type of DSL expressions that 260 315 programming. The core abstraction is a type constructor represent SIMD intrinsics functions in LMS: 261 316 Rep[T] that marks code expressions. For example, executing 1 Rep[ __m64 ]//MMX integer types 262 317 a + b where a and b are two Rep[Int] expressions will 2 Rep[ __m128 ]//SSE4x32-bit float 263 3 Rep[ __m128d ]//SSE2x64-bit float 318 create a program expression that represents the addition 4 Rep[ __m128i ]//SSE2x64/4x32/8x16/16x8-bit integer 264 319 a’ + b’, where a’ and b’ are the program expressions a 5 Rep[ __m256 ]//AVX8x32-bit float 265 6 Rep[ __m256d ]//AVX4x64-bit float 320 and b evaluate to. The combined program expression can be 7 Rep[ __m256i ]//AVX4x64/84x32/16x16/32x8-bit integer 266 321 unparsed to source code, in this paper to C or LLVM code, 8 Rep[ __m512 ]// AVX512 16x32-bit float 267 9 Rep[ __m512d ]// AVX5128x64-bit float 322 compiled dynamically, and loaded into the running JVM. 10 Rep[ __m512i ]// AVX5128x64/16x32/32x16/64x8-bit integer 268 323 269 SIMD intrinsics functions take primitive arguments that 324 270 3 Intrinsics in the JVM correspond to low-level C/C++ primitive types. The primitive 325 271 In this section, we present our two-tier approach for mak- types in the JVM exhibit a fixed width, and therefore a direct 326 272 ing the Intel SIMD intrinsics available in the JVM. First we mapping can be established with C/C++ primitives. Some 327 273 automatically generate SIMD eDSLs, each implemented as a intrinsics however, require the use of unsigned types that 328 274 Scala class that corresponds to one of the 13 vector ISAs in are not supported natively in the JVM: 329 275 3 330 CGO’18, January 01–03, 2017, New York, NY, USA Anon.

331 data-3.3.16.xml 1 386 Floating Point 332 2 387 3 AVX 333 4 Arithmetic 388 Parse XML intrinsics specification 334 5 389 6 335 7 390 Add packed double-precision (64-bit) 336 8 391 Infer intrinsic mutability 9 floating-point elements in"a" and"b", 337 10 and store the results in"dst". 392 effects 338 11 393 LMS Generate ISA specific DSL in LMS 12 339 13 FORj :=0 to3 394 14 i :=j*64

340 handle 395

, 15 dst[i+63:i] :=a[i+63:i]+b[i+63:i] 341 case class Def[T] 1 Definition 16 ENDFOR 396 dst[MAX:256] :=0 342 17 397 mutable 18 If Def[T] → Exp[T] 2 SSA 343 19 398

immintrin.h 344 20 399 Exp[T] → Def[T] 3 Mirroring 21 345 400 346 Def[T] → String 4 Un-parsing 401 347 Figure 4. XML specification of the _mm256_add_pd intrinsic 402 348 403 Split each ISA specific DSL into subclasses 349 404 350 file. As shown in an example (Figure 4), for each intrinsics, 405 351 intrinsics.jar: SSE.scala / AVX.scala / … the XML file contains a name that defines the intrinsics 406 352 function, return type, ordered list of each argument of the 407 353 Figure 3. Generating SIMD intrinsics eDSLs from vendor function with the corresponding type, CPUID parameter, that 408 354 specification correspond to the ISA set, and a category parameter. 409 355 410 356 Generate ISA-specific DSL in LMS. For each intrinsic func- 411 357 1 unsigned int _mm_crc32_u16 ( unsigned int , unsigned short ) tion we implement four building blocks that define the eDSL. 412 358 These are represented in terms of implementation classes 413 To mitigate this problem, we use the Scala Unsigned [11] 359 provided by the LMS framework. The classes Def[T] and 414 package, which implements unsigned types and operations 360 Exp[T] together define a computation graph. Subclasses of 415 on top of the signed types available in the JVM. Figure 2 361 Def[T] implement graph nodes that represent individual 416 shows the type mapping between the 12 primitives, which 362 computations, e.g., Plus(a,b). Here, a and b are values of 417 in most cases is straight-forward, except for JVM Char that 363 type Exp[T]: either constants Const(.) or symbols Sym(id) 418 maps to int16_t to support UTF-8. Arrays of primitive types 364 that refer to other graph nodes through a numeric index id. 419 in the JVM and C/C++ code are isomorphic and both repre- 365 The four necessary building blocks are as follows: 420 sent continuous memory space of a certain primitive type. 366 1. Definition of the intrinsic function represented asa 421 Therefore Array[T] maps to a memory pointer T* in the 367 subclass of Def[T]. 422 low-level SIMD intrinsics. 368 2. Implicit conversion from expression Exp[T] to defini- 423 369 3.2 Automatic Generation of ISA-specific eDSLs tion Def[T], looking up a computation node given a 424 370 symbolic reference. 425 LMS provides a relatively simple interface to define eDSLs, 371 3. Mirroring function that converts a Def[T] into expres- 426 but adding more than 5000 functions by hand would be 372 sion Exp[T], potentially applying a transformation. 427 tedious and error prone. Our approach generates the LMS 373 4. Unparsing routine that converts each Def[T] into C/C++ 428 eDSLs automatically from the XML specification provided by 374 string. 429 the Intel Intrinsics Guide; these are then packed as a jar file 375 To complete the first part, we define IntrinsicsDef[T], 430 that is later published in the Maven Central Repository [4] for 376 an abstract class that each intrinsics definition will inherit: 431 deployment. At the time of writing, we use the latest version 377 432 of the intrinsics specifications stored as data-3.3.16.xml 1 abstract class IntrinsicsDef[T:Manifest] extends Def[T]{ 378 2 val category: List[IntrinsicsCategory] 433 file, and build the generator such that it anticipates future 3 val intrinsicType: List[IntrinsicsType] 379 434 extensions in the specifications. Figure 3 shows a high-level 4 val performance: Map[MicroArchType, Performance] 380 5 val header: String 435 overview of the generation process, which we explain step- 6 } 381 436 by-step next. 382 Then for each intrinsic function, we define a Scala case 437 383 Parse XML intrinsics specification. The first step in the class that corresponds to the intrinsics function’s name, its 438 384 generation process extracts the information from the XML input arguments and return type. Each case class contains 439 385 4 440 SIMD Intrinsics on Managed Language Runtimes CGO’18, January 01–03, 2017, New York, NY, USA

441 the category, the type of the intrinsics and performance and stores from/to memory creates effects that have to be 496 442 informations when available. Additionally, we also include handled by LMS. The semantic of these effects is essential in 497 443 the header where the C/C++ intrinsics is defined: scheduling the DSL. 498 444 499 1 case class MM256_ADD_PD( To resolve this problem, we use the category information 445 2 a: Exp[__m256d], b: Exp[ __m256d ] of each intrinsics (see Figure 4), and implement a conserva- 500 3 ) extends IntrinsicsDef[__m256d] { 446 501 4 val category = List(Arithmetic) tive heuristic to generate the effects: 447 5 val intrinsicType = List(FloatingPoint) • 502 6 val performance = Map.empty[MicroArchType, Performance] Each time an intrinsics is discovered with a load cate- 448 503 7 val header ="immintrin.h" gory, we generate a read effect on each argument that 449 8 } is a memory location. 504 450 With the current definition we allow a particular intrinsics • Each time an intrinsics is discovered with a store cat- 505 451 to pertain to several categories. The header information gives egory, we generate a write effect on each argument 506 452 us the control to include the correct header when unparsing that is a memory location. 507 453 508 the code to C/C++ code. Performance information is included For example, an AVX load of 4 doubles has the form of: 454 but is not used in the staging process. 509 455 1 def _mm256_load_pd[A[_], U:Integral]( 510 Next we generate Scala code for the implicit conversion 2 mem_addr : Exp[A[Double]], mem_addrOffset: Exp[U] 456 from intrinsics expressions to definitions. This routine is 3 )( implicit cont: Container[A]): Exp[__m256d] = { 511 457 4 cont.read(mem_addr)( 512 essential in LMS, as it provides automatic conversion of 5 MM256_LOAD_PD(mem_addr, mem_addrOffset) 458 the staged code into static single assignment (SSA) form. In 6 (implicitly[Integral[U]], cont) 513 459 7 ) 514 most cases it is sufficient to rely on the Scala compiler to 8 } 460 automatically perform the implicit conversion: 515 461 The heuristics is invoked on each intrinsics that performs 516 1 def _mm256_add_pd(a: Exp[__m256d], b: Exp[ __m256d ]) 462 517 2 : Exp[__m256d] = MM256_ADD_PD(a, b) loads, stores, maskstores, maskloads, gather, scatters and 463 other intrinsics that perform memory-related operations. 518 464 The LMS framework supports DSL transformations by 519 Split each ISA specific DSL into subclasses The JVM has 465 substitution. Once a substitution is defined, LMS creates new 520 a hard limit of 64KB on the size of each method, which is an 466 definitions. However, when no substitution is available, a 521 obstacle in generating the unparsing and mirroring routines 467 definition has to be converted to an expression through a 522 for large ISA, such as AVX-512 or KNC. To avoid this obstacle, 468 routine of mirroring that converts a Def[T] back to Exp[T], 523 and still keep the LMS design pattern, we decided to split 469 potentially creating a new definitions for subexpressions as 524 the ISA specific DSLs into subclasses that inherit each other. 470 part of the transformation: 525 471 1 override def mirror[A:Typ](e: Def[A], f: Transformer) 526 ( implicit pos: SourceContext): Exp[A] = (e match { 3.3 Developing Explicitly Vectorized Code in the 472 2 527 3 case MM256_ADD_PD (a, b) => JVM Using SIMD eDSLs 473 4 _mm256_add_pd(f(a), f(b)) 528 case MM256_ADD_PS (a, b) => Figure 5 gives a high-level overview of how to use explicit 474 5 529 6 _mm256_add_ps(f(a), f(b)) vectorization in the JVM. The process consists of two parts: 475 7 //...a lot more patterns to match 530 case _ => super .mirror(e, f) compile-time tasks, done by the high-performance code de- 476 8 531 9 } veloper, and runtime tasks that are done automatically by 477 532 LMS and our compiler pipeline. Specifically, the compile-time 478 Once code is generated for all these routines and for each 533 tasks of the developer comprise four steps: 479 intrinsic, the final step is to generate code to perform theun- 534 480 parsing of the DSL into C code. The unparsing routine is done 1. Implement a native function placeholder that will rep- 535 481 similarly to the mirroring routine, by pattern matching each resent the vectorized code. 536 482 DSL definition to produce the corresponding C expression: 2. Create a DSL instance by instantiating one or mixing 537 483 1 override def emitNode(s:Sym[Any], r:Def[Any ]) = r match { several ISA-specific eDSLs. 538 484 2 case iDef@MM256_ADD_PD(a, b) => 3. Implement the SIMD logic as a staged function. 539 3 headers += iDef.header 485 4 emitValDef(sym, s"_mm256_add_pd(${a},${b})") 4. Call the provided compile routine to generate, compile 540 486 5 //...a lot more patterns to match and link the code in the JVM. 541 6 case _ => super .emitNode(sym, rhs) 487 7 } After the four steps are completed, and the JVM program 542 488 is started, the compiler pipeline is invoked with the compile 543 489 Infer intrinsic mutability As mentioned before, when routine. This will perform system inspection, search for avail- 544 490 code is generated for the implicit conversion of an intrinsics able compilers and opportunistically pick the optimal com- 545 491 expression to the intrinsics definition, we can rely on the piler available on the system. In particular, it will attempt 546 492 Scala compiler to match the correct implicit method. This to find icc, gcc or llvm/clang. After a compiler is found, 547 493 works correctly for immutable expressions, but not all intrin- the runtime will determine the target CPU, as well as the 548 494 sics are immutable. For example, each intrinsics that loads underlying micro-architecture to derive available ISAs. This 549 495 5 550 CGO’18, January 01–03, 2017, New York, NY, USA Anon.

551 Start developing SIMD code 1 import ch.ethz.acl.commons.cir.IntrinsicsIR 606 import com.github.dwickern.macros.NameOf._ 552 2 607 3 553 608 ) 4 class NSaxpy { 5 554 Implement a native function placeholer 1 609 6 // Step 1: Placeholder for theSAXPY native function 555 7 @native def apply ( 610 8 a : Array[Float],

556 developer Mixin one or several ISA-specific eDSLs 2 611 9 b : Array[Float], 557 10 scalar : Float, 612 n : Int 558 Implement the SIMD staged function 3 11 613 12 ): Unit 559 13 614 // Step 2:DSL instance of the intrinsics 560 Call compile to generate native code 4 14 615

Compile time ( time Compile 15 val cIR = new IntrinsicsIR 561 16 import cIR._ 616 562 17 617 Start the (JVM) 18 // Step 3: StagedSAXPY function usingAVX+FMA 563 19 def saxpy_staged( 618 a_imm : Rep[Array[Float]], 564 20 619 21 b : Rep[Array[Float]], 565 22 scalar : Rep[ Float ], 620 Detect available C/C++ compilers n : Rep[Int] 566 23 621 24 ): Rep[ Unit ] = { import ImplicitLift._ 567 pipeline) 622 Inspect the system through CPUID 25 // make array `a` mutable val a_sym = a_imm.asInstanceOf[Sym[Array[Float]]] 568 26 623 27 val a = reflectMutableSym(a_sym) 569 28 // start with the computation 624 compiler Infer available ISAs and compiler flags val n0 = (n >> 3) << 3 570 29 625 30 val vec_s = _mm256_set1_ps(scalar) 571 LMS: remove abstraction & generate C code 31 forloop(0, n0, fresh[Int], 8, (i : Rep[Int ]) => { 626 (LMS + (LMS val vec_a = _mm256_loadu_ps(a, i) 572 32 627 33 val vec_b = _mm256_loadu_ps(b, i) 573 Compile and link the code to the JVM 34 val res = _mm256_fmadd_ps(vec_b, vec_s, vec_a) 628

Runtime _mm256_storeu_ps(a, res, i) 574 35 629 36 }) 575 37 forloop(n0, n, fresh[Int], 1, (i : Rep[Int ]) => { 630 a(i) = a(i) + b(i) * scalar 576 Use high performance SIMD code 38 631 39 }) 577 40 } 632 578 41 633 Figure 5. Developing explicit vectorized code in the JVM 42 // Step 4: generate the saxpy function, 579 using SIMD eDSLs 43 // compile it and link it to theJVM 634 compile(saxpy_staged _, this , nameOf(apply _)) 580 44 635 45 } 581 636 582 allows us to have full control over the system, as well as to be 637 583 able to pick the best mix of compiler flags for each compiler. Figure 6. A complete implementation of the BLAS 1 routine 638 584 Once this process is completed, the user-defined staged SAXPY in the JVM using AVX and FMA SIMD intrinsics 639 585 function is executed, which assembles a computation graph 640 586 of SIMD instructions. From this computation graph, LMS 3.4 Evaluation 641 587 generates vectorized C code. This code is then automatically 642 To assess the viability of our approach we consider two 588 compiled as a dynamic library with the set of derived com- 643 ubiquitous kernel functions: the aforementioned SAXPY and 589 piler flags, and linked back into the JVM. To link the native 644 matrix multiplication. 590 code into the JVM, JNI requires the C functions header to 645 591 contain the Java_ prefix, followed by package name, class Experimental setup. We perform the tests on a Haswell 646 592 name and name of the native function. The compile rou- enabled processor Intel Xeon CPU E3-1285L v3 3.10GHz with 647 593 tine automates this process using JVM reflection and some 32GB of RAM, running Debian GNU/Linux 8 (jessie), kernel 648 594 lightweight use of Scala macros. By this automation, we en- 3.16.43-2+deb8u3. The available compilers are gcc 4.9.2-10 649 595 sure the interoperability between the native function and and Intel icc 17.0.0. The installed JVM is HotSpot 64-Bit 650 596 the staged function, creating code robust to modifications Server 25.144-b01, supporting Java 1.8. To avoid the effects of 651 597 and refactoring and eliminate the need for the developer to frequency scaling and resource sharing on the measurements, 652 598 recompile the native code each time major code revision are Turbo Boost and Hyper-Threading are disabled. 653 599 performed on the low-level code or the class container. We use ScalaMeter [14] to perform the benchmarks. To 654 600 Figure 6 illustrates a complete and self-contained imple- obtain precise results, we select a pre-configured benchmark 655 601 mentation of a BLAS 1 routine called SAXPY [9], which com- that forks a new JVM virtual machine and performs mea- 656 602 putes y = y + αx for given vectors x,y and scalar α. The surements inside the clean instance. The new instance has a 657 603 expression forloop(...) creates a staged loop in the LMS compilation threshold of 100 (-XX:CompileThreshold=100) 658 604 computation graph. and we perform at least 100 warm-up runs on all test cases 659 605 6 660 SIMD Intrinsics on Managed Language Runtimes CGO’18, January 01–03, 2017, New York, NY, USA

661 Flops / Cycle [f/c] Flops / Cycle [f/c] 716 662 717 663 4 4 718 664 719 3 3 665 720 666 721 2 2 667 722 L1 L2 L3 668 723 1 1 669 L1 L2 L3 724 670 0 0 725 671 26 27 28 29 210 211 212 213 214 215 216 217 218 219 220 221 222 83264 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 726 Size Size 672 727 673 Java SAXPY LMS generated SAXPY Java MMM (triple loop) Java MMM LMS generated MMM 728 674 729 675 (a) SAXPY (b) Matrix-Maxtrix-Multiplication 730 676 731 677 Figure 7. Performance analysis: Java implmentation vs LMS intrinsics generated code 732 678 733 679 to trigger the JIT compiler. Each test case is performed on a versions. The first is a standard Java implementation ofa 734 680 warm cache. Tests are repeated 30 times, and the median of triple MMM loop. The two other versions are a blocked 735 681 the runtime is taken. We show the results as performance, version of MMM, with block size of 8, the first implemented 736 682 measured in flops per cycle. in Java, and the second implemented using AVX intrinsics 737 683 To inspect the JIT compilation of the HotSpot JVM, we in Scala. For simplicity, we assume that the matrix has size 738 684 use -XX:UnlockDiagnosticVMOptions that unlocks the di- n = 8k, and provide the implementation in Figure 8. 739 685 agnostic JVM options and -XX:CompileCommand=print to Our Scala implementation uses the SIMD intrinsics with 740 686 output the generated assembly. In all test cases we observe high level constructs of the Scala language, including pat- 741 687 the full-tiered compilation starting from the C1 compiler tern matching (lines 5, 10, 19, 20, etc), lambdas (lines 4, 10, 742 688 to the last phase of the C2 compiler. For a fair comparison 34), Scala collections (lines 4, 34, etc), closures (line 45) and 743 689 between the JVM and our generated intrinsics code, we con- others that are not available in low-level C code. Once LMS 744 690 sider the C2 compiled version of the bytecode only, excluding removes the abstraction overhead, the MMM function results 745 691 the JIT warm-up time and the LMS generation overhead. in a high-performance implementation. The performance 746 692 comparison in Figure 7b shows that the use of explicit vec- 747 SAXPY 693 SAXPY We compare the generated vector code, shown torization through SIMD intrinsics can offer improvements 748 694 in Figure 6, against an equivalent Java implementation: up to 5x over the blocked Java implementation, and over 7.8x 749 695 750 1 public class JSaxpy { over the baseline triple loop implementation. 696 2 public void apply ( float [] a, float [] b, float s, int n){ The assembly analysis shows that the C2 compiler will 751 3 for (int i= 0; i

771 1 // take8 __m256 vector types, transpose their Another limitation is a consequence of the complex mem- 826 // values, returning8 __m256 vectors. 772 2 827 3 def transpose(row: Seq[Exp[__m256]]) = { ory model in the HotSpot JVM. Once arrays are used in the 773 4 val __tt = row.grouped(2).toSeq.flatMap({ native code, GetPrimitiveArrayCritical must be invoked 828 case Seq(a, b) => Seq ( 774 5 829 6 _mm256_unpacklo_ps(a, b), to obtain the memory space of the array. Depending on the 775 7 _mm256_unpackhi_ps(a, b) state of the garbage collector (GC), the array might end up on 830 ) 776 8 831 9 }).grouped(4).toSeq.flatMap({ different segments on the heap, which could result in acopy 77710 case Seq(a, b, c, d) => Seq( once the native code tries to access the memory space, or the 832 _mm256_shuffle_ps(a, c, 68), 77811 833 12 _mm256_shuffle_ps(a, c, 238), JVM could decide to temporary disable the GC. Although we 77913 _mm256_shuffle_ps(b, d, 68), did not experience an array copy in any test case performed, 834 _mm256_shuffle_ps(b, d, 238) 78014 835 15 ) we believe that the use of LMS intrinsics is best suited for 78116 }) compute-bound problems, where the copy overhead can be 836 val zip = __tt.take(4) zip __tt.drop(4) 78217 837 18 val f = _mm256_permute2f128_ps _ leveraged by the fast runtime of SIMD instructions upon 78319 zip.map ({ case (a, b) => f(a, b, 0x20) }) ++ each JNI invocation. Some of the issues with JVM arrays can 838 zip.map ({ case (a, b) => f(a, b, 0x31) }) 78420 839 21 } be avoided by using Java NIO buffers or off-heap memory 78522 // Perform Matrix-Matrix-Multiplication allocated with the sun.misc.Unsafe package. 840 def staged_mmm_blocked ( 78623 841 24 a : Rep[Array[Float]], 78725 b : Rep[Array[Float]], 4 Build Your Own Virtual ISA 842 c_imm : Rep[Array[Float]], 78826 843 27 n : Rep[Int]// assumen ==8k In the previous section, we demonstrated the use of SIMD 78928 ): Rep[ Unit ] = { 844 val c_sym = c_imm.asInstanceOf[Sym[Array[Float]]] intrinsics in developing high performance code, based on 79029 845 30 val c = reflectMutableSym(c_sym) high-level constructs of the Scala language. However, with 79131 forloop(0, n, fresh[Int], 8, (kk: Exp[Int ]) => { 846 forloop(0, n, fresh[Int], 8, (jj: Exp[Int ]) => { the use of metaprogramming and staging provided by LMS, 79232 847 33 // Load the block of matrixB and transpose it we can also use the SIMD intrinsics to build new low-level 79334 val blockB = transpose((0 to 7).map { i => 848 _mm256_loadu_ps(b, (kk + i) * n + jj) abstractions and provide a functionality similar to the SVML 79435 849 36 }) short vector math library that is typically implemented by 79537 // Multiply all the vectors ofa of the 850 // corresponding block column with the running low-level C/C++ compilers. As an example, we build abstrac- 79638 851 39 // block and store the result in matrixC tions for low-precision arithmetic and, in particular, building 79740 forloop(0, n, fresh[Int], 1, (i: Exp[Int ]) => { 852 val rowA = _mm256_loadu_ps(a, i * n + kk) blocks for the stochastic gradient descent (SGD) algorithm. 79841 853 42 val mulAB = transpose( SGD is currently among the most popular algorithms in 79943 blockB.map(_mm256_mul_ps(rowA, _)) 854 ) machine learning, used for training neural networks [1, 17]. It 80044 855 45 def f(l: Seq[Exp[__m256]]): Exp[ __m256 ] = consists of two main building blocks: a dot-product operator 80146 l. size match { 856 case 1 => l. head and a scale-and-add operator. The use of low precision is an 80247 857 48 case s => important optimization in SGD for deep learning as it reduces 80349 val lhs = f(l.take(s/2)) 858 val rhs = f(l.drop(s/2)) both computation time and data movement for increased 80450 859 51 _mm256_add_ps(lhs, rhs) performance and efficiency [25]. 80552 } 860 val rowC = _mm256_loadu_ps(c, i * n + jj) In this section we build a virtual variable-precision ISA 80653 861 54 val accC = _mm256_add_ps(f(mulAB), rowC) that implements the dot product operator, operating on ar- 80755 _mm256_storeu_ps(c, accC, i * n + jj) 862 }) rays of 32, 16, 8 and 4-bit precision. For 32 and 16-bit we use 80856 863 57 }) floating point, which is natively supported by the hardware; 80958 }) 864 59 } for the lower precision formats, we use quantized arrays [25]. 810 Quantization is a lossy compression technique that maps con- 865 811 tinuous values to a finite set of fixed bit-width numbers. For 866 812 Figure 8. Implementation of MMM in the JVM using AVX a given vector v of size n and precision of b bits, we first 867 813 868 intrinsics. derive a factor sv that scales the vector elements vi into the 814 representable range: 869 815 870 − 816 macros could potentially resolve this problem and ensure 2b 1 − 1 871 s = . 817 v 872 complete isomorphic binding of JNI and staged functions. maxi ∈[1,n] |vi | 818 LMS does not provide any mechanism to deal with ex- 873 819 ceptions such as segfaults from generated code. Therefore The scaled vi are then quantized stochastically: 874 820 it is the responsibility of the developer to write valid SIMD   875 vi → vi · sv + µ 821 code. LMS is also not optimized for fast code generation, 876 822 which might result in an overhead surpassing the HotSpot where µ is drawn uniformly from the interval (0, 1). With 877 823 interpretation speed when used to generate functions that this, a quantized array consists of one scaling factor and an 878 824 are computationally light. array of quantized b-bit values. 879 825 8 880 SIMD Intrinsics on Managed Language Runtimes CGO’18, January 01–03, 2017, New York, NY, USA

881 Flops / Cycle [f/c] 936 882 15 937 883 938 884 Java 32-bit 939 Java 16-bit 885 10 940 Java 8-bit 886 941 Java 4-bit 887 LMS generated 32-bit 942 888 LMS generated 16-bit 943 5 889 LMS generated 8-bit 944 LMS generated 4-bit 890 945 891 946 0 892 947 27 28 29 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 893 Size 948 894 949 895 Figure 9. Performance analysis: Variable Precision 950 896 951 897 4.1 Implementation us to easily build for loops with increment defined by the 952 898 953 32-bit. For the 32-bit version, we use the built-in hardware dot_ps_step, such that dot_ps is invoked at each iteration. 899 954 support for floating point. In Java, we can only express scalar Finally, the resulting dot product with variable precision is a 900 955 (non-vectorized) code to multiply / add values; using our LMS sum reduction of 8 floats stored in the acc variable. A generic 901 956 intrinsics we have access to AVX2 and FMA. implementation is given as Scala pseudo-code: 902 957 903 1 def dot_AVX2[T] (bits: Rep[Int], 958 16-bit. For the 16-bit version we use a half-precision floating 2 a: Rep[Array[T]], b: Rep[Array[T]], len: Rep[Int] 904 point format, available as a separate ISA extension called 3 ): Rep[ Float ] = { 959 905 FP16C 4 var acc = _mm256_setzero_ps(); 960 that provides instructions to convert a 32-bit float 5 val increment = dot_ps_step(bits); 906 into 16-bit float, and vice-versa. We use these instructions 6 forloop(0, len, fresh[Int], increment, i => { 961 907 7 acc += dot_ps(bits, a + i, b + i) 962 to load and store the data in 16-bit format, and perform 8 }) 908 computations on the 32-bit format. In Java, there is no access 9 reduce_sum(acc) 963 909 to half-precision floaitng point; thus, instead, we quantize 10 } 964 910 the values as shown before to type short.. In the 16-bit version, we use the _mm256_cvtph_ps intrin- 965 911 966 8-bit. We base our 8-bit version on Buckwild! [18]. Both LMS sics to convert the half-precision format into 32-bit float- 912 967 intrinsics and the Java implementation operate on quantized ing point number. In the 8-bit and 4-bit version we benefit 913 968 values, using a scaling factor and an array of type byte to from _mm256_sign_epi8 and _mm256_abs_epi8 combined 914 969 hold the 8-bit two’s complements values. with _mm256_maddubs_epi16 and _mm256_madd_epi16 to 915 perform fast additions and multiplications, that result in a 970 916 4-bit. We base this version on the analysis in the ZipML 32-bit values, without spending a single instruction to per- 971 917 framework [28]. The values are not two’s complement, but form casts. Additionally, in the 4-bit version, we benefit from 972 918 sign-bit followed by the base in binary format, and stored as intrinsics to perform wide range of bit-wise manipulations to 973 919 pairs inside the 8-bit values of a byte array. extract the base and the sign of the 4-bit values. Note that this 974 920 975 Scala implementation. In Scala, we can abstract the preci- version of dot product is implemented such that existence 921 of AVX2 and FP16C is assumed. It is possible to abstract the 976 922 sion as a number that reflects the bit length of the format, 977 and provide two virtual intrinsics functions: use of the dot operator to an ISA-agnostic implementation 923 as shown in [22], and performance could be improved using 978 924 1 int dot_ps_step (int bits ); 979 2 __m256 dot_ps (int bits , void * x, void * y); methods as prefetching, particularly in the 8-bit [18] and 925 4-bit version, however we did not apply these techniques. 980 926 dot_ps_step returns the number of elements processed 981 927 by dot_ps for a given bit length. For example, in the case Java implementation. We implement each case as a sepa- 982 928 of 32, 16 and 8-bit versions, 32 elements are processed at rate class corresponding to the 32, 16, 8 and 4 bit versions. 983 929 a time and in the case of the 4-bit, 128 elements at a time. The simplest dot product is the 32-bit version. Java does not 984 930 dot_ps takes the bit length and two memory addresses. Ac- support bit manipulation or numerical expression on types 985 931 cording to the bit length, it will select the corresponding bit lower than 32-bits. Instead, 16-bit and 8-bit types are pro- 986 932 format, load from the array, compute the dot product and moted to 32-bit integers before operations are performed. 987 933 return the result. Assuming the arrays are padded with their To provide a fair comparison, we write the 16, 8 and 4-bit 988 934 corresponding dot_ps_step value, the two intrinsics allows versions of the dot product such that we block the loop, and 989 935 9 990 CGO’18, January 01–03, 2017, New York, NY, USA Anon.

991 accumulate into integer values, avoiding unnecessary type low-level languages that compile to LLVM, using LLVM IR in- 1046 992 promotion as much as possible. terpreter built on top of the Truffle framework26 [ ], running 1047 993 on the Graal VM. While this approach can bring a support of 1048 994 4.2 Evaluation low-level instructions in the JVM, it does not support SIMD 1049 995 We use the same experimental setup as in section 3.4. For instructions, as Graal does not provide sufficient analysis for 1050 996 each benchmark, we use the identical flop (or op for 8 and vectorization. Furthermore, due to interpretation, Sulong is 1051 997 4 bit) count of 2n for the dot-product, where n is the size of shown to be outpefromed by native compilers such as gcc. 1052 998 the quantized array. 1053 Automatic generation of DSLs in LMS. DSLs have been 999 Figure 9 shows the obtained results. Our 4-bit implemen- 1054 generated into LMS before. Yin-Yang [7] automatically gener- 1000 tation outperforms HotSpot by a factor of up to 40x, the 8-bit 1055 ates deep DSL embeddings from their shallow counterparts 1001 up to 9x, the 16-bit up to 4.8x, and the 32-bit version up to 1056 by reusing the core translation. Forge [23] generates DSLs 1002 5.4x. There are several reasons for the speedups obtained 1057 from a declarative specification. None of the approaches have 1003 with the use of SIMD intrinsics. In the 32-bit case, we see 1058 been challenged to generate DSLs of the scale imposed by 1004 the limitation of SLP to detect and optimize reductions. In 1059 the large amount of SIMD intrinsics, nor were they designed 1005 the 16-bit, there is no way to obtain access in Java to an 1060 to automatically infer effects of mutability. 1006 ISA such as FP16C. And in the 8-bit and 4-bit case, Java is 1061 1007 severely outperformed since it does type promotion when SIMD intrinsics in LMS. A limited support of SIMD in- 1062 1008 dealing with integers. However, the largest speedup of 40x structions has been introduced while abstracting vectors 1063 1009 in the 4-bit case is due to the domain knowledge used for the architectures [22]. This approach has been used in gener- 1064 1010 implementing the dot product, that the HotSpot compiler ating libraries for high-performance code, and integration 1065 1011 cannot synthesize with a lightweight autovectorization such with the JVM has not been demonstrated. 1066 1012 as SLP. 1067 1013 6 Conclusion 1068 1014 5 Related Work Our work shows how metaprogramming techniques can 1069 1015 We review different lines of related work. be used to bridge the gap between high-level managed lan- 1070 1016 guages and the need to access low-level instructions in high 1071 1017 Explicit vectorization in the JVM. The first approach to performance code development. Specifically, we showed how 1072 1018 expose data parallelism was the implementation of the Java to provide access to SIMD intrinsics in the HotSpot JVM, 1073 1019 Vectorization Interface (JVI) as part of the Jitrino JIT com- thus eliminating the need to write C/C++ code. Two key 1074 1020 piler [10]. JVI is designed as a an abstract vector interface techniques underlie our approach. First is the use of embed- 1075 1021 that provides set of methods as vector operators. This meth- ded DSLs to express intrinsics inside JVM languages such 1076 1022 ods are later compiled to different vector instructions, such as Scala. These are generated directly from the vendor XML 1077 1023 as SSE and AVX. The approach offers competitive results in specification, which enables complete intrinsics support and 1078 1024 some case, but is limited in the SIMD instructions it supports fast updates in the future. Second is the use of staging to 1079 1025 and subsequent iterations of post-AVX ISAs. Similarly to JVI, convert SIMD intrinsics interspersed with Scala code into 1080 1026 Oracle has ongoing research developing cross-platform APIs high-performant C kernels, which are then compiled and 1081 1027 that can leverage SIMD instructions. Implemented as part of linked via JNI. The challenge in our work is in the system- 1082 1028 an experimental JVM called Panama [6], SIMD instructions atic handling of large sets of functions, converting them 1083 1029 are used in immutable vector types, parameterized by ele- into sets of DSLs, automatically inferring their side effects, 1084 1030 ment type and size. Similarly to JVI, Panama also suffers from and creating a compiler and code generation pipeline for 1085 1031 limited support of vector ISAs, and requires a specific JVM. convenient and productive development. We show how the 1086 1032 Both approaches abstract SIMD instructions, which limits SIMD support in JVM can be used to build powerful high- 1087 1033 the ability of a developer to tune the code to a particular level and low-level abstractions while offering significant, 1088 1034 microarchitecture. often manifold speedup over the autovectorizing HotSpot 1089 1035 1090 Autovectorization in the JVM. Initially introduced in the JIT compiler. 1036 1091 Jikes RVM [3], the HotSpot JVM uses SLP [8] based autovec- 1037 1092 torization. SLP is limited and is only able to vectorize basic References 1038 1093 blocks consisting of groups of isomorphic instructions, gener- [1] Léon Bottou. 1991. Stochastic gradient learning in neural networks. 1039 1094 ating SSE and AVX code. Partial support of FMA and AVX-512 Proceedings of Neuro-Nımes 91, 8 (1991). 1040 [2] Intel Corporation. 2012. Intel Intrinsics Guide. https://software.intel. 1095 is only planned for Java 9 [6]. 1041 com/sites/landingpage/IntrinsicsGuide/. (2012). [Online; accessed 1096 4-August-2017]. 1042 1097 Support of low-level code in the JVM. Sulong [15] is a sys- [3] Sara Elshobaky, Ahmed El-Mahdy, and Ahmed El-Nahas. 2009. Au- 1043 tem to execute low-level languages such as C/C++, , tomatic vectorization using dynamic compilation and tree pattern 1098 1044 Ada, and Haskell in the JVM. Sulong is capable of handling matching technique in Jikes RVM. In Proceedings of the 4th workshop 1099 1045 10 1100 SIMD Intrinsics on Managed Language Runtimes CGO’18, January 01–03, 2017, New York, NY, USA

1101 on the Implementation, Compilation, Optimization of Object-Oriented [18] Christopher De Sa, Matthew Feldman, Christopher Ré, and Kunle 1156 1102 Languages and Programming Systems, ICOOOLPS 2009, Genova, Italy, Olukotun. 2017. Understanding and Optimizing Asynchronous Low- 1157 1103 July 6, 2009, Ian Rogers (Ed.). ACM, 63–69. https://doi.org/10.1145/ Precision Stochastic Gradient Descent. In Proceedings of the 44th An- 1158 1565824.1565833 nual International Symposium on Computer Architecture, ISCA 2017, 1104 1159 [4] Apache Software Foundation. 2004. The Central Repository. https: Toronto, ON, Canada, June 24-28, 2017. ACM, 561–574. https://doi.org/ 1105 //search.maven.org/. (2004). [Online; accessed 4-August-2017]. 10.1145/3079856.3080248 1160 1106 [5] IBM. 2005. J9 Virtual Machine (JVM). https://www.ibm.com/support/ [19] SAP. 2011. SAP Java Virtual Machine (JVM). https: 1161 1107 knowledgecenter/SSYKE2_8.0.0/com.ibm.java.lnx.80.doc/user/java_ //help.sap.com/viewer/65de2977205c403bbc107264b8eccf4b/Cloud/ 1162 1108 jvm.html. (2005). [Online; accessed 4-August-2017]. en-US/da030d10d97610149defa1084cb0b2f1.html. (2011). [Online; 1163 [6] Vladimir Ivanov. 2017. VectorizaAon in HotSpot JVM. http://cr.openjdk. accessed 4-August-2017]. 1109 1164 java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf. [20] Fridtjof Siebert. 2007. Realtime garbage collection in the JamaicaVM 1110 (2017). [Online; accessed 4-August-2017]. 3.0. In Proceedings of the 5th International Workshop on Java Tech- 1165 1111 [7] Vojin Jovanovic, Amir Shaikhha, Sandro Stucki, Vladimir Nikolaev, nologies for Real-time and Embedded Systems, JTRES 2007, Institute of 1166 1112 Christoph Koch, and Martin Odersky. 2014. Yin-yang: concealing Computer Engineering, Vienna University of Technology, 26-28 Septem- 1167 1113 the deep embedding of DSLs. In Generative Programming: Concepts ber 2007, Vienna, Austria (ACM International Conference Proceeding 1168 and Experiences, GPCE’14, Vasteras, Sweden, September 15-16, 2014, Series), Gregory Bollella (Ed.). ACM, 94–103. https://doi.org/10.1145/ 1114 1169 Ulrik Pagh Schultz and Matthew Flatt (Eds.). ACM, 73–82. https: 1288940.1288954 1115 //doi.org/10.1145/2658761.2658771 [21] Skelmir. 1998. Embedded Virtual Machines (VM) to host Java applica- 1170 1116 [8] Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword tions. https://www.skelmir.com/products. (1998). [Online; accessed 1171 1117 Level Parallelism with Multimedia Instruction Sets. In Proceedings of 4-August-2017]. 1172 1118 the ACM SIGPLAN 2000 Conference on Design [22] Alen Stojanov, Georg Ofenbeck, Tiark Rompf, and Markus Püschel. 1173 and Implementation (PLDI ’00). ACM, New York, NY, USA, 145–156. 2014. Abstracting Vector Architectures in Library Generators: Case 1119 1174 https://doi.org/10.1145/349299.349320 Study Convolution Filters. In ARRAY’14: Proceedings of the 2014 ACM 1120 [9] C. L. Lawson, Richard J. Hanson, D. R. Kincaid, and Fred T. Krogh. SIGPLAN International Workshop on Libraries, Languages, and Com- 1175 1121 1979. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. pilers for Array Programming, Edinburgh, United Kingdom, June 12-13, 1176 1122 Math. Softw. 5, 3 (1979), 308–323. https://doi.org/10.1145/355841.355847 2014, Laurie J. Hendren, Alex Rubinsteyn, Mary Sheeran, and Jan Vitek 1177 1123 [10] Jiutao Nie, Buqi Cheng, Shisheng Li, Ligang Wang, and Xiao-Feng (Eds.). ACM, 14–19. https://doi.org/10.1145/2627373.2627376 1178 Li. 2010. Vectorization for Java. In Network and Parallel Computing, [23] Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, 1124 1179 IFIP International Conference, NPC 2010, Zhengzhou, China, September Tiark Rompf, Martin Odersky, and Kunle Olukotun. 2013. Forge: 1125 13-15, 2010. Proceedings (Lecture Notes in Computer Science), Chen generating a high performance DSL implementation from a declarative 1180 1126 Ding, Zhiyuan Shao, and Ran Zheng (Eds.), Vol. 6289. Springer, 3–17. specification. In Generative Programming: Concepts and Experiences, 1181 1127 https://doi.org/10.1007/978-3-642-15672-4_3 GPCE’13, Indianapolis, IN, USA - October 27 - 28, 2013, Jaakko Järvi 1182 1128 [11] Nate Nystrom. 2013. Scala Unsigned. https://github.com/nystrom/ and Christian Kästner (Eds.). ACM, 145–154. https://doi.org/10.1145/ 1183 scala-unsigned. (2013). [Online; accessed 4-August-2017]. 2517208.2517220 1129 1184 [12] Oracle. 2002. Oracle JRockit JVM. http://www.oracle.com/ [24] Christian Wimmer, Michael Haupt, Michael L. Van de Vanter, Mick J. 1130 technetwork/middleware//overview/index.html. (2002). [On- Jordan, Laurent Daynès, and Doug Simon. 2013. Maxine: An approach- 1185 1131 line; accessed 4-August-2017]. able virtual machine for, and in, java. TACO 9, 4 (2013), 30:1–30:24. 1186 1132 [13] Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The Java https://doi.org/10.1145/2400682.2400689 1187 1133 hotspotTM Server Compiler. In Proceedings of the 2001 Symposium [25] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 1188 on JavaTM Virtual Machine Research and Technology Symposium - 2016. Quantized Convolutional Neural Networks for Mobile Devices. In 1134 1189 Volume 1 (JVM’01). USENIX Association, Berkeley, CA, USA, 1–1. http: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 1135 //dl.acm.org/citation.cfm?id=1267847.1267848 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 1190 1136 [14] Aleksandar Prokopec. 2012. ScalaMeter. https://scalameter.github.io. 4820–4828. https://doi.org/10.1109/CVPR.2016.521 1191 1137 (2012). [Online; accessed 4-August-2017]. [26] Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, 1192 1138 [15] Manuel Rigger, Matthias Grimmer, Christian Wimmer, Thomas Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and 1193 Würthinger, and Hanspeter Mössenböck. 2016. Bringing low-level Mario Wolczko. 2013. One VM to rule them all. In ACM Symposium 1139 1194 languages to the JVM: efficient execution of LLVM IR on Truffle. In on New Ideas in Programming and Reflections on Software, Onward! 1140 Proceedings of the 8th International Workshop on Virtual Machines and 2013, part of SPLASH ’13, Indianapolis, IN, USA, October 26-31, 2013, 1195 1141 Intermediate Languages, VMIL@SPLASH 2016, Amsterdam, Netherlands, Antony L. Hosking, Patrick Th. Eugster, and Robert Hirschfeld (Eds.). 1196 1142 October 31, 2016, Antony L. Hosking and Witawas Srisa-an (Eds.). ACM, ACM, 187–204. https://doi.org/10.1145/2509578.2509581 1197 1143 6–15. https://doi.org/10.1145/2998415.2998416 [27] Kamen Yotov, Xiaoming Li, Gang Ren, María Jesús Garzarán, David A. 1198 [16] Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: Padua, Keshav Pingali, and Paul Stodghill. 2005. Is Search Really 1144 1199 a pragmatic approach to runtime code generation and compiled DSLs. Necessary to Generate High-Performance BLAS? Proc. IEEE 93, 2 1145 In Generative Programming And Component Engineering, Proceedings (2005), 358–386. https://doi.org/10.1109/JPROC.2004.840444 1200 1146 of the Ninth International Conference on Generative Programming and [28] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce 1201 1147 Component Engineering, GPCE 2010, Eindhoven, The Netherlands, Oc- Zhang. 2017. ZipML: Training Linear Models with End-to-End Low 1202 1148 tober 10-13, 2010, Eelco Visser and Jaakko Järvi (Eds.). ACM, 127–136. Precision, and a Little Bit of Deep Learning. In Proceedings of the 34th 1203 https://doi.org/10.1145/1868294.1868314 International Conference on Machine Learning, ICML 2017, Sydney, NSW, 1149 1204 [17] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, etal. 1988. Australia, 6-11 August 2017 (Proceedings of Machine Learning Research), 1150 Learning representations by back-propagating errors. Cognitive mod- Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 4035–4043. 1205 1151 eling 5, 3 (1988), 1. http://proceedings.mlr.press/v70/zhang17e.html 1206 1152 1207 1153 1208 1154 1209 1155 11 1210