LORE: a Loop Repository for the Evaluation of Compilers Zhi Chen∗, Zhangxiaowen Gong†, Justin Josef Szaday†, David C
Total Page:16
File Type:pdf, Size:1020Kb
LORE: A Loop Repository for the Evaluation of Compilers Zhi Chen∗, Zhangxiaowen Gongy, Justin Josef Szadayy, David C. Wongz, David Padua y, Alexandru Nicolau ∗, Alexander V Veidenbaum ∗, Neftali Watkinson ∗, Zehra Sura x, Saeed Maleki{, Josep Torrellasy, Gerald DeJongy z Intel Corporation, x IBM Research, { Microsoft Research, ∗ University of California, Irvine, y University of Illinois at Urbana-Champaign, Email: ∗fzhic2, nicolau, alexv, [email protected], [email protected], yfgong15, szaday2, padua, torrella, [email protected], [email protected], {[email protected] Abstract—Although numerous loop optimization techniques successive versions, and the effectiveness of solo/compound have been designed and deployed in commercial compilers in transformations. While some of such data can be found in the the past, virtually no common experimental infrastructure nor literature, in many cases compiler studies are confined to a repository exists to help the compiler community evaluate the effectiveness of these techniques. few codes, which are not always widely available. In addition, This paper describes a repository, LORE, that maintains a there is little in the area of historical data that shows how much large number of C language for loop nests extracted from popu- progress compiler technology has made in terms of delivering lar benchmarks, libraries, and real applications. It also describes performance. the infrastructure that builds and maintains the repository. Each In this paper, we propose LORE, a repository of program loop nest in the repository has been compiled, transformed, executed, and measured independently. These loops cover a segments, their semantically equivalent transformed versions, variety of properties that can be used by the compiler community and performance measurements with various compilers. Our to evaluate loop optimizations using a broad and representative goal is to create an extensible repository for the compiler collection of loops. community that can provide a common base for experimen- To illustrate the usefulness of the repository, we also present tation and comparison of results, allowing compiler writers two example applications. One is assessing the capabilities of to see the effect of transformations on individual loop nests the auto-vectorization features of three widely used compilers. The other is measuring the performance difference of a compiler and to compare execution times of different transformation across different versions. These applications prove that the repos- sequences. Currently, the repository contains: itory is valuable for identifying the strengths and weaknesses of • A growing number of C language for loop nests (∼2,500 a compiler and for quantitatively measuring the evolution of a as of the writing of this paper) extracted from multiple compiler. benchmarks, libraries, and real applications by an extractor. The extractor encapsulates loop nests into standalone exe- I. INTRODUCTION cutables called codelets. Each codelet executes a loop using A significant fraction of execution time is consumed by data captured during the original benchmark execution, loops for a large class of programs. For this reason, numerous measures its execution time, and collects readouts from loop optimization techniques [1], [2] have been designed and hardware performance counters for further analysis. deployed in all major compilers. These optimizations trans- • Multiple semantically equivalent versions of each loop pro- form loops into semantically equivalent versions that exhibit duced by a source-to-source mutator that applies sequences better locality, require less computation, and/or take advantage of loop transformations constructed by combining tiling, of parallelism from vector devices/multicores. Furthermore, interchange, distribution, unrolling, and unroll-and-jam into compiler optimizations also enhance programability and en- sets of possibly repeated subsets of these transformations. able machine independence. The ultimate goal of research in We call the result of applying each of the transformation automatic program optimization is to enable programmers to sequences a mutation. The repository currently contains focus on the preparations of readable and correct code because ∼90,000 different mutations (an average of 36 mutations they can rely on compilers to generate highly efficient code per original loop), as of the writing of this paper. for each target machine. It is well known that we are far from • A database that correlates the execution profiling of the reaching the ultimate goal and that compilers today are brittle original loop and its mutations with the compilers and target in that it is impossible to know before hand if they will be machines used for each execution. This data is valuable not able to generate a highly efficient version of the code or if, on only for comparing the effectiveness of different compilers’ the contrary, it will be necessary to devote considerable time optimization passes but also for tracking the evolution of to manually transform the code for good performance. a compiler’s capability in loop optimization. Currently the Clearly, there is much room for advances in compiler repository contains data on the execution on a single target technology. An important enabler of these future advances is machine from 2 versions of the Intel R C++ Compiler data on the effectiveness of compilers on different classes of (ICC), the GNU C Compiler (GCC), and Clang (frontend codes and target machines, the progress of compilers across of LLVM for C family languages). We extract the loops from the benchmarks for two reasons. Loop identification One is to reduce the time to evaluate each loop. Evaluating and extraction the effect of different compilers with various switches on the C applications Source level original loop nests and their semantically equivalent mutations instrumentation typically requires numerous executions of the loop. Much time Input dataset is saved by executing the loop and its mutations in isolation. capturing The second reason is that it is important to study the execution Loop extractor Performance measurement of the loops separately so that we could classify loops for Dependence analysis Characteristic the purposes of applying machine learning techniques for extraction compilation and, when necessary, alter the context of execution Loop Loop Repository Website of each loop. transformations interfaces Both the source code for the original loop and its mutations Loop mutator Other tools and the measurement data are available for download/query Similarity analysis Loop clustering from the LORE website. The web interface also conveniently Loop clusterer allows users to view dynamically generated plots/charts from Fig. 1: The system framework of the infrastructure. selected loops/compilers. In order to illustrate the usefulness of LORE, we include in data items that the loop reads before being modified during this paper two experiments. One compares the effectiveness of its execution. This is done to guarantee that the execution major compilers’ auto-vectorizers since vectorization plays an of the codelet will follow the same flow of control as the important role in performance improvement and efficient hard- benchmark. The memory layout is preserved for data accessed ware utilization. The other assesses the performance difference in the loop, however the extractor does not replicate the cache across generations of a compiler. These experiments prove state when a benchmark executes the target loop since cache that LORE is an efficient approach to effectively identify the state is invisible to the programmers during coding. Finally, strengths and weaknesses of a compiler and to quantitatively the extractor copies the loop statements, creates the necessary measure the evolution of a compiler. declaration, and inserts the instrumentation statements. The remainder of the paper is organized as follows. Sec- Data copying. For each for loop nest to be separated, tionII presents the tools contained in the infrastructure for the extractor inserts code into the source file to first capture loop extraction, mutation, and clustering. We explain the the data read inside the loop and then save them to a file. details of the repository in Section III. SectionIV gives two To this end, the loop nest is analyzed statically to identify all example analysis that users can perform using the data in the variables that are read-only or read-before-write. Variables that repository. SectionV discusses related work and compares are guaranteed to be write-only or write-before-read are not our repository and infrastructure tools with the state-of-the-art instrumented as their values are generated during the execution research. Finally, SectionVI concludes the paper. of the loop. Then, for these collected variables the values are obtained as follows depending on the nature of the variables. ODULES TO CREATE AND ACCESS LORE II. M • Non-pointer scalar variables are instrumented simply by Figure1 shows the different software modules that we have adding code to write their values to a file directly. built to create and access LORE, which include: • Arrays with known dimensions are treated similarly to scalar • an extractor to create codelets, variables. Statements are added before the target loop to • a mutator to create multiple semantically equivalent versions write the values of all array elements to a file in a row- (mutations) of each codelet, major way. • a clusterer