For Review Only

Blackja ckBench: Portable Hardware Characterization with Automated Results Analysis Journal:For The ReviewComputer Journal Only Manuscript ID: COMPJ-2012-03-0208 Manuscript Type: Original Article Date Submitted by the Author: 31-Mar-2012 Complete List of Authors: Danalis, Anthony; University of Tennessee, Luszczek, Piotr; University of Tennessee, Marin, Gabriel; Oak Ridge National Laboratory, Vetter, Jeffrey; Oak Ridge National Laboratory, Dongarra, Jack; University of Tennessee, Key Words: Micro-Benchmarks, Hardware Characterization, Statistical Analysis Page 1 of 13 1 2 3 4 BlackjackBench: Portable Hardware 5 6 7 Characterization with Automated 8 9 Results Analysis 10 11 12 ANTHONY DANALIS1 , PIOTR LUSZCZEK1 , GABRIEL MARIN2 , JEFFREY S. 13 VETTER2 AND JACK DONGARRA1 14 15 1University of Tennessee, Knoxville, TN, USA 16 2Oak Ridge National Laboratory, Oak Ridge, TN, USA 17 Email: adanalis,luszczek,dongarra @eecs.utk.edu, maring,vetter @ornl.gov 18 f g f g 19 20 DARPA's AACE prForoject aimed Reviewto develop Architecture AwarOnlye Compiler Environments. Such a 21 compiler automatically characterizes the targetted hardware and optimizes the application codes 22 accordingly. We present the BlackjackBench suite, a collection of portable micro-benchmarks 23 that automate system characterization, plus statistical analysis techniques for interpreting the 24 results. The BlackjackBench benchmarks discover the effective sizes and speeds of the hardware environment rather than the often unattainable peak values. We aim at hardware characteristics 25 that can be observed by running executables generated by existing compilers from standard C 26 codes. We characterize the memory hierarchy, including cache sharing and NUMA characteristics 27 of the system, properties of the processing cores affecting instruction execution speed, and the length 28 of the OS scheduler time slot. We show how these features of modern multicores can be discovered 29 programmatically. We also show how the features could potentially interfere with each other 30 resulting in incorrect interpretation of the results, and how established classification and statistical 31 analysis techniques can reduce experimental noise and aid automatic interpretation of results. We 32 show how effective hardware metrics from our probes allow guided tuning of computational kernels 33 that outperform an autotuning library further tuned by the hardware vendor. 34 35 Keywords: Micro-Benchmarks, Hardware Characterization, Statistical Analysis 36 Received 00 January 2009; revised 00 Month 2009 37 38 39 40 1. INTRODUCTION BlackjackBench was specifically motivated by the effort 41 to develop architecture aware compiler environments [9] that 42 Compilers, autotuners, numerical libraries, and other automatically adapt to hardware, which is unknown to the 43 performance sensitive software need information about compiler writer, and optimize application codes based on the 44 the underlying hardware. If portable performance is discovery of the runtime environment. 45 a goal, automatic detection of hardware characteristics Often, important performance related decisions take into 46 is necessary given the dramatic changes undergone by account effective values of hardware features, rather than 47 computer hardware. Several system benchmarks exist their peak values. In this context, we consider an effective 48 in the literature [1, 2, 3, 4, 5, 6, 7, 8]. However, as 49 hardware becomes more complex, new features need to value to be the value of a hardware feature that would be experienced by a user level application written in C (or any 50 be characterized, and assumptions about hardware behavior 51 need to be revised, or completely redesigned. other portable, high level, standards-compliant language) running on that hardware. This is in contrast with values 52 In this paper, we present BlackjackBench, a system that can be found in vendor documents, or through assembler 53 characterization benchmark suite. The contribution of this benchmarks, or specialized instructions and system-calls. 54 work is twofold: 55 BlackjackBench goes beyond the state of the art in 56 1. A collection of portable micro-benchmarks that can system benchmarking by characterizing features of modern 57 probe the hardware and record its behavior while multicore systems, taking into account contemporary 58 control variables, such as buffer size, are varied. – complex – hardware characteristics such as modern 59 2. A statistical analysis methodology, implemented as sophisticated cache prefetchers, and the interaction between 60 a collection of scripts for result parsing, examines the cache and TLB hierarchies, etc. Furthermore, the output of the micro-benchmarks and produces BlackjackBench combines established classification and the desired system characterization information, e.g. statistical analysis techniques with heuristics tailored to effective speeds and sizes. specific benchmarks, to reduce experimental noise and aid THE COMPUTER JOURNAL, Vol. ??, No. ??, ???? Page 2 of 13 1 2 2 A. DANALIS, P. LUSZCZEK, G. MARIN, J.S. VETTER, J. DONGARRA 3 4 5 automatic interpretation of the results. As a consequence, parameters that often enough are below vendors' advertised 6 BlackjackBench does not merely output large sets of data specifications. Servet aims for maximum portability of its 7 that require human intervention and comprehension; it constituent tests, as does our work, but we were unable 8 shows information about the hardware characteristics of the to compare this aspect of our efforts as the authors only 9 tested platform. Moreover, BlackjackBench does not rely presented results from Intel Xeon and Itanium clusters. 10 on assembler code, specialized kernel modules and libraries, In summary, our work differs from existing benchmarks 11 nor non-portable system calls. Therefore, it is a portable in the methodology used in several micro-benchmarks, 12 system characterization tool. the breadth of hardware features it characterizes, the 13 automatic statistical analysis of the results, and the emphasis 14 2. RELATED WORK on effective values and the ability to address modern, 15 sophisticated architectures. 16 Several low level system benchmarks exist, but most have We consider the use of BlackjackBench as a tool for 17 different target audiences, functionality, or assumptions. model-based tuning and performance engineering to be 18 Some benchmarks, such as those described by Molka related to autotuning based existing exhaustive search 19 et. al [10], aim to analyze the micro-architecture of a approaches [15, 16], analytical search methodologies [17], 20 specific platform in great detail andForthus sacrifice Reviewportability and techniques Onlybased on machine learning [18]. 21 and generality. Others [11] sacrifice portability and 22 generality by depending upon specialized software such as 3. BENCHMARKS 23 PAPI [12]. Autotuning libraries such as ATLAS [13] rely 24 on micro-benchmarkingfor accurate system characterization In this section we describe the operation of our micro- 25 for a very specific set of routines which need tuning. benchmarks and discuss the assumptions about compiler 26 These libraries also develop their own characterization and hardware behavior that make our benchmarks possible. 27 techniques [14], most of which we need to subsume in order We also present experimental results, from diverse hardware 28 to target a much broader feature spectrum. environments, as supporting evidence for the validity of our 29 Other benchmarks, such as CacheBench [8], or assumptions. 30 lmbench [6, 7] are higher level, portable, and use similar A key thesis of this work is that only hardware 31 techniques to those we use – such as pointer chasing – but characteristics with a significant impact on application 32 output large data sets or graphs that need human interpreta- performance are important. Therefore, our benchmarks 33 tion instead of “answers” about the values that characterize vary controlled variables, such as buffer size, access 34 the hardware platform. pattern, number of threads, variable count, etc., in order to 35 X-Ray [1, 2] is a micro- and nano-benchmark suite that observe variations in performance. Our benchmarks rely 36 is close to our work in terms of the scope of the system on assumptions about the behavior of the hardware under 37 characteristics. There are, however, a number of features different circumstances and try to trigger different behaviors 38 that we chose to discover with our tests that are not addressed by varying the circumstances. 39 by X-Ray. There are also differences in methodology which 40 Our benchmarks regulate control variables, such as buffer 41 we mention, where appropriate, throughout this document. size, access pattern, number of threads, etc., in order to 42 One important distinguishing feature is X-Ray's emphasis observe variations in performance. We assert that, by 43 on code generation as part of the benchmarking activity, observing variations in the performance of benchmarks, 44 while we give more emphasis on analyzing the resulting all hardware characteristics that can significantly affect the 45 data. performance of applications can be discovered. Conversely, 46 P-Ray [3] is a micro-benchmark suite whose primary if a hardware characteristic cannot be discovered through 47 aim is to complement X-Ray by characterizing multicore performance measurements, it is probably not very 48 hardware features such as cache sharing and processor important to optimization tools such as compilers,

Load more