MINIME: Pattern-Aware Multicore Benchmark Synthesizer Etem Deniz, Member, IEEE, Alper Sen, Senior Member, IEEE, Brian Kahne, Jim Holt, Senior Member, IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2014.2349522, IEEE Transactions on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. V, NO. N, 2013 1 MINIME: Pattern-Aware Multicore Benchmark Synthesizer Etem Deniz, Member, IEEE, Alper Sen, Senior Member, IEEE, Brian Kahne, Jim Holt, Senior Member, IEEE Abstract—We present a novel automated multicore benchmark synthesis framework with characterization and generation components. Our framework uses parallel patterns in capturing important characteristics of multi-threaded applications and generates synthetic multicore benchmarks from those applications. The resulting synthetic benchmarks are small, fast, portable, human-readable, and they accurately reflect microarchitecture dependent and independent characteristics of the original multicore applications. Also, they can use either Pthreads or MCA libraries. We implement our techniques in the MINIME tool and generate synthetic benchmarks from PARSEC, Rodinia, and EEMBC MultibenchTM benchmarks on x86 and Power Architecture R platforms. We show that synthetic benchmarks are representative across a range of multicore machines with different architectures, while being on average 21x faster and 14x smaller than original benchmarks. Index Terms—Multicore Systems, Parallel Patterns, Synthetic Benchmarks. F 1 INTRODUCTION ence of shared memory architectures, or Pthreads, OpenMP, and OpenCL libraries as well as uniform Thermal and power problems limit the performance CPU ISAs. Multicore systems may not be able to that single core processors can deliver. This has led use these benchmarks as they may not support such to the development of multicore systems. There is architectures. There is a need for benchmarks suitable a need for efficient parallel programming to fully for any given infrastructure, that is, SMP or message utilize these systems. Patterns, in our case, parallel passing architectures. patterns, help ease the burden of parallel program- In order to solve above limitations, we need to ming by bringing best practices to commonly occur- develop new benchmarks but benchmark develop- ring programming challenges. Parallel patterns are ment process is time- and labor-intensive. We present high level characteristics that define the structure of a novel synthetic benchmark synthesis approach us- a multicore application in terms of communication ing parallel patterns that addresses these limitations. and data sharing behaviors. They provide a way to Synthetic benchmarks do not perform any useful design and create robust and understandable parallel computation, yet they can approximate characteristics multicore applications rapidly. We use these high level of real-life applications. These benchmarks can be parallel pattern characteristics in developing synthetic generated by varying application characteristics or multicore benchmarks. can be derived from existing benchmarks. A synthetic Benchmarks represent software workloads for cur- benchmark is smaller and faster than the original rent and future multicore systems and they are used benchmark that it is derived from hence it simulates for early design exploration and to evaluate perfor- faster. In this work, we generate synthetic multicore mance, power consumption, and reliability of new benchmarks that are fast, portable, and suitable for multicore systems. Development of new multicore any given infrastructure. systems requires a large number of benchmarks. At We experimentally validate our techniques by gen- the same time, there is an increase in simulation erating synthetic multicore benchmarks from PAR- runtimes of benchmarks that limits our ability to fully SEC, Rodinia, and EEMBC benchmark suites using explore the design space. We need to develop faster our MINIME tool. Our synthetics can use either benchmarks. Pthreads or Multicore Association (MCA) libraries [1], Existing benchmark suites such as PARSEC, Ro- the latter allowing us to have infrastructure inde- dinia as well as the embedded multicore benchmark pendent benchmarks. Synthetic benchmarks are com- suite EEMBC MultiBench are big and rely on pres- pared with the original benchmarks using similarity metrics based on both microarchitecture dependent • E. Deniz and A. Sen are with the Department of Computer and independent characteristics. We found that syn- Engineering, Bogazici University, Istanbul, Turkey 34342. E-mail: [email protected], [email protected]. Brian Kahne and Jim thetic benchmarks are similar on average 92% to the Holt are with Freescale Semiconductor Inc., Austin, TX, USA. E-mail: original benchmarks. We also found that the synthet- [email protected],[email protected] ics that correctly captured the parallel patterns in the originals have a high level of similarity. 0018-9340 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2014.2349522, IEEE Transactions on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. V, NO. N, 2013 2 In particular, this paper makes the following con- tributions. • The key novelty of our approach is that we use parallel patterns in generating synthetic multicore applications. • We formalize parallel pattern recognition process by presenting reference behaviors for each parallel pattern type. • We present an algorithm for synthetic multicore Fig. 1. Parallel Patterns for Software [5] benchmark generation. • Our synthetics are portable since they are generated in a high level programming language, C, as use the term parallel pattern as synonym for parallel opposed to assembly in earlier works. Also, they software architectural pattern. can be generated using either Pthreads or MCA There exist three classes of parallel patterns based libraries. on organization of tasks, data, and flow of data. • Our synthetics are suitable for embedded systems Figure 1 shows parallel patterns in a decision tree [5]. and can run on any given infrastructure thanks Each parallel pattern has unique architectural charac- to using MCA libraries. teristics to exploit. When a work is divided among • Our synthetics can act as proxies for proprietary several independent tasks, which cannot be paral- customer applications that are not publicly avail- lelized individually, the parallel pattern employed is able. Task Parallelism (TP). The independent tasks may read • We developed MINIME tool and experimentally shared data, but they produce independent results. In validate our techniques on both x86 and Power Divide and Conquer (DaC), a problem is structured to be Architecture systems using PARSEC, Rodinia and solved in sub-problems independently, and merging EEMBC Multibench benchmark suites. Experi- the outputs later. This pattern is used to solve many ments show that our synthetics are similar with sorting, computational geometry, graph theory, and the originals with respect to several metrics. They numerical problems. Divide and conquer algorithms are also faster and smaller than originals and they can cause load-balancing problems when using non- mimic the behavior of the original on different uniform sub-problems, but this can be resolved if the microarchitectures. sub-problems can be further reduced. • We study the impact of input changes on the In data centric patterns, data is decomposed aligned synthetics. We also perform correlation studies with the set of tasks. When the data decomposition is to determine the importance of correct parallel linear, the parallel pattern that is employed is called patterns in achieving high similarity. Geometric Decomposition (GD). In GD, data decomposition can inherently deliver a natural load balanc- ARALLEL ATTERNS 2 P P ing process since data is partitioned into equal size. Architectural patterns are fundamental organizational Matrix, list, and vector operations are examples of descriptions of common top-level structures observed geometric decomposition. Parallel pattern used with in a group of software systems [2]. One of the most recursively defined data structures is called Recursive important decisions during the design of the overall Data (RD). Graph search and tree algorithms are structure of a software system is the selection of example usages of recursive data. an architectural pattern. Architectural patterns allow Apart from task parallelism and data parallelism, software developers to understand complex software if a series of ordered but independent computation systems in larger conceptual blocks and their rela- stages need to be applied on data, where each output tions, thus reducing the adoption complexity and of a computation becomes input of a subsequent providing less error prone applications. computation, Pipeline (Pl) parallel pattern is used. Architectural design patterns have been developed Each stage processes its data serially and all stages for object-oriented software and have been found to run in parallel to increase the throughput. Event-based be very useful [3]. Similarly, a parallel pattern lan- Coordination (EbC) parallel pattern defines a set of guage which is a collection of design patterns, guiding tasks that run concurrently

MINIME: Pattern-Aware Multicore Benchmark Synthesizer Etem Deniz, Member, IEEE, Alper Sen, Senior Member, IEEE, Brian Kahne, Jim Holt, Senior Member, IEEE

Microprocessor

Programmers' Tool Chain

Educational Goals for Embedded Systems in the Multicore Era

Performance Analysis and Tuning in Multicore Environments

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Openmp for Heterogeneous Multicore Embedded Systems Using MCA API Standard Interface

High-Level Programming Model for Heterogeneous Embedded Systems Using

Performance Impact of Lock-Free Algorithms on Multicore Communication Apis

Making Full Use of Emerging ARM-Based Heterogeneous Multicore Socs Felix Baum, Arvind Raghuraman

Automatically Parallelizing Embedded Legacy Software on Soft-Core Socs

Heterogeneous Multicore Openamp

A Unified Multicore Programming Model Simplifying Multicore Migration by Sven Brehmer