Measuring Multicore Performance

EMB E DD E D COMPUTING ators to full-fledged processors. Further, SoCs could be based on shared- or distributed-memory architectures. They could consist of either Measuring homogeneous or heterogeneous cores. And they could employ a vari- ety of interconnect technologies. Multicore technologies are highly Multicore differentiated, so multicore benchmarks need to be highly differenti- Performance ating as well. TRADITIONAL BENCHMARKING methods Shay Gal-On and Markus Levy, EEMBC Before elaborating on different multicore benchmarking methods, it’s useful to provide a brief histori- cal perspective. During the past two decades, Multicore benchmarking benchmarks that only exercised a processor core’s internal work- techniques must break the ings were sufficient. In fact, these mold of traditional methods. benchmarks are still valid depending on the processor characteristics to be ascertained. Most of the EEMBC first-generation benchmarks fall into this cate- gory. While still quite popular, they predominantly exercise the proces- s an astute reader of MULTITUDES OF MULTICORE sor core and have little interaction Computer, you prob- Amazingly, many people—engi- with the external memory. For ably don’t need the neers included—think that a spe- example, they test functions and detailed explanation of cific x86 manufacturer invented features such as pipelines, branch why most of the indus- multicore. While the x86 has gar- prediction units, instruction sets, Atry has so markedly shifted to multi- nered the biggest spotlight, it only and caches. core technology. But in all honesty, represents a fraction of multicore- Although these benchmarks can many embedded system designers enabled devices. run on top of most operating sys- are still struggling to determine Focusing on the multicore pro- tems, they’re designed to run bare- whether multicore really buys them cessors with characteristics more metal. Performance is measured anything in terms of performance. or less similar to an x86, there’s in iterations per second, where an Resolving this quandary requires a plethora of general-purpose, iteration is the sequential execution a thorough understanding of the shared-memory, symmetric mul- of the benchmark kernel. target application, the character- tiprocessing (SMP)-featured The compiler also plays a big istics of multicore processors that products from companies such as role in this type of benchmarking. could be used, and the amount of ARM, Freescale Semiconductor, In fact, we’ve seen as much as 70 time that must be invested to make IBM, and MIPS Technologies. percent performance difference the transition. Beyond this, countless vendors depending on the compiler used to Having reliable performance are building multicore products generate the benchmark results. information provides a good start- in the form of application-spe- EEMBC’s second-generation ing point for analyzing these fac- cific systems on a chip. SoCs can benchmarks go a step further, pro- tors, but “reliable” is the operative be as simple as a processor with viding significantly larger code word. In other words, it’s impera- a general computing core plus a and datasets to ensure that even tive to pick the right types of bench- digital signal processing core. the most robust memory and cache marks to accurately predict the per- They might have any number of hierarchies are tested. formance of the multicore processor cores, ranging in complexity from Due to the popularity of and once it’s in the final product. single-function hardware acceler- familiarity with the first- and November 2008 99 Authorized licensed use limited to: The University of British Columbia Library. Downloaded on September 14, 2009 at 02:47 from IEEE Xplore. Restrictions apply. EMB E DD E D COMPUTING second-generation benchmarks, Shared memory, typically alties when it oversubscribes com- many entities—including proces- accessed through a bus and con- puting resources. sor vendors, system developers, trolled by some type of locking In familiar terms, assume that and academic research groups— mechanism to avoid simultaneous an application program consists of want to continue using them access by multiple cores, provides a varying number of threads—it’s to measure multicore processor a straightforward programming not unreasonable to have hundreds performance. model because each processor of threads in a relatively complex In theory, multiple instantiations can directly access the memory. program. If the number of threads of each benchmark can be launched Typically associated with homo- exactly matches the number of pro- simultaneously. Some might rec- geneous multicore systems, shared cessor cores, performance could ognize this method as similar to a memory also facilitates program- scale linearly assuming no limita- SPECrate, which measures a sys- ming with traditional languages tions on memory bandwidth. tem’s capacity for processing jobs of because it allows passing data by However, realistically the num- a specified type in a given amount reference, without actually mov- ber of threads will exceed the num- of time. The Standard Performance ing the data. ber of cores, and performance will Evaluation Corporation notes that depend on other factors such as this “metric is used the same for cache utilization, memory and I/O multi-processor systems and for At the highest level, bandwidth, intercore communica- uni-processors” (www.spec.org/ benchmarks for multicore tions, OS scheduling support, and spec/glossary). architectures should be synchronization efficiency. In fact, even in a multicore system, it isn’t possible to guarantee either computationally or Example that the system is utilizing more memory intensive, or some So does sequential code work for than one core unless it’s employing combination of both. benchmarking multicore proces- some form of processor affinity. In sors? The answer is yes with respect other words, without programmer to the cumulative throughput of intervention, the platform’s sched- each individual core. In this case, uler will assume control over exe- For cores with individual caches, memory bandwidth and computa- cution of the individual benchmark there must be a coherency mechanism tion can be evaluated. instantiations. between the caches. The ease of use In fact, we’re aware of at least The question is whether sequen- of this architecture can lead to perfor- one example of this. A telecom tial code really works for this mance bottlenecks due to competition equipment manufacturer transi- purpose. between multiple cores accessing the tioned its application from a mul- same memory locations. tiprocessor to a multicore system. MULTICORE BENCHMARK In a distributed memory system, Merely switching from a system in CRITERIA which is more common within an which each processor had its own To answer this question, it’s nec- SoC, each processor can access memory subsystem to one in which essary to understand the important its own local memory but doesn’t the two cores shared the memory multicore performance character- have to share it with other cores, subsystem resulted in a critical per- istics. At the highest level, bench- even though there might also be formance reduction, regardless of marks for multicore architectures a global memory address space how much inherent parallelism the should be either computationally across them. code possessed. or memory intensive, or some com- When one core requires that data This is the type of information we bination of both. from another core or cores must syn- must be able to derive using bench- chronize, the system must physically marks before going through the Memory bandwidth move data or the control code must massive porting effort required to Regardless of the type of multi- switch to run on a different core. Even switch to a new platform. core architecture, memory band- though each core has its own local width is a key factor in perfor- memory, there might still be memory SMP-BASED MULTICORE mance. A multicore processor’s bottlenecks depending on how data benchmarks memory bandwidth, as with any moves on or off the chip itself. Executing multiple copies of other processor, depends on the sequential code doesn’t account for memory subsystem’s design. In Scalability one of the most important potential turn, the memory subsystem Another important benchmark benefits of multicore: using paral- depends on the underlying multi- criterion is scalability, in which the lelism to improve the performance core architecture. processor incurs performance pen- of individual tasks rather than 100 Computer Authorized licensed use limited to: The University of British Columbia Library. Downloaded on September 14, 2009 at 02:47 from IEEE Xplore. Restrictions apply. improving overall throughput. One face. This means that if the sys- code into blocks that use MCAPI example, assuming a dual-core pro- tem already supports Pthreads, no to communicate. cessor, would be using both cores porting is necessary. to load one web page twice as fast. The multicore benchmarks are Application-specific Alternatively, you could parallelize delivered as a set of workloads, standard BENCHMARKS the work by using both cores to load each comprising one or more While the demand for SMP-based two web pages (one per core) in the work items. Although it’s easier benchmarks and for those that sup- same amount of time it would take said than done, users can select port heterogeneous cores is grow- to load one. from this list the workloads that ing almost exponentially, a move to Accomplishing that would require most closely resemble their appli- application-specific standard bench- benchmarks that utilize task decom- cation. marks is afoot. Also known as sce- position, functional decomposition, nario-oriented benchmarks, ASSBs or data decomposition. These meth- BEYOND SMP are designed to take into account ods could more comprehensively It’s important to note that the more system-level features. exercise all of the major multicore MultiBench tests are oriented benchmark criteria. Of course, the toward general-purpose proces- Black-box benchmarks devil’s in the details.

Load more