E m b e d d e d computing

ators to full-fledged processors. Further, SoCs could be based on shared- or distributed-memory archi- tectures. They could consist of either Measuring homogeneous or heterogeneous cores. And they could employ a vari- ety of interconnect technologies. Multicore technologies are highly Multicore differentiated, so multicore bench- marks need to be highly differenti- Performance ating as well. TRADITIONAL BENCHMARKING methods Shay Gal-On and Markus Levy, EEMBC Before elaborating on different multicore benchmarking methods, it’s useful to provide a brief histori- cal perspective. During the past two decades, Multicore benchmarking benchmarks that only exercised a processor core’s internal work- techniques must break the ings were sufficient. In fact, these mold of traditional methods. benchmarks are still valid depend- ing on the processor characteristics to be ascertained. Most of the EEMBC first-genera- tion benchmarks fall into this cate- gory. While still quite popular, they predominantly exercise the proces- s an astute reader of MULTITUDES OF MULTICORE sor core and have little interaction Computer, you prob- Amazingly, many people—engi- with the external memory. For ably don’t need the neers included—think that a spe- example, they test functions and detailed explanation of cific x86 manufacturer invented features such as pipelines, branch why most of the indus- multicore. While the x86 has gar- prediction units, instruction sets, Atry has so markedly shifted to multi- nered the biggest spotlight, it only and caches. core technology. But in all honesty, represents a fraction of multicore- Although these benchmarks can many embedded system designers enabled devices. run on top of most operating sys- are still struggling to determine Focusing on the multicore pro- tems, they’re designed to run bare- whether multicore really buys them cessors with characteristics more metal. Performance is measured anything in terms of performance. or less similar to an x86, there’s in iterations per second, where an Resolving this quandary requires a plethora of general-purpose, iteration is the sequential execution a thorough understanding of the shared-memory, symmetric mul- of the benchmark kernel. target application, the character- tiprocessing (SMP)-featured The also plays a big istics of multicore processors that products from companies such as role in this type of benchmarking. could be used, and the amount of ARM, Freescale Semiconductor, In fact, we’ve seen as much as 70 time that must be invested to make IBM, and MIPS Technologies. percent performance difference the transition. Beyond this, countless vendors depending on the compiler used to Having reliable performance are building multicore products generate the benchmark results. information provides a good start- in the form of application-spe- EEMBC’s second-generation ing point for analyzing these fac- cific systems on a chip. SoCs can benchmarks go a step further, pro- tors, but “reliable” is the operative be as simple as a processor with viding significantly larger code word. In other words, it’s impera- a general computing core plus a and datasets to ensure that even tive to pick the right types of bench- digital signal processing core. the most robust memory and cache marks to accurately predict the per- They might have any number of hierarchies are tested. formance of the multicore processor cores, ranging in complexity from Due to the popularity of and once it’s in the final product. single-function hardware acceler- familiarity with the first- and November 2008 99

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on September 14, 2009 at 02:47 from IEEE Xplore. Restrictions apply. E m b e d d e d computing

second-generation benchmarks, Shared memory, typically alties when it oversubscribes com- many entities—including proces- accessed through a bus and con- puting resources. sor vendors, system developers, trolled by some type of locking In familiar terms, assume that and academic research groups— mechanism to avoid simultaneous an application program consists of want to continue using them access by multiple cores, provides a varying number of threads—it’s to measure multicore processor a straightforward programming not unreasonable to have hundreds performance. model because each processor of threads in a relatively complex In theory, multiple instantiations can directly access the memory. program. If the number of threads of each benchmark can be launched Typically associated with homo- exactly matches the number of pro- simultaneously. Some might rec- geneous multicore systems, shared cessor cores, performance could ognize this method as similar to a memory also facilitates program- scale linearly assuming no limita- SPECrate, which measures a sys- ming with traditional languages tions on memory bandwidth. tem’s capacity for processing jobs of because it allows passing data by However, realistically the num- a specified type in a given amount reference, without actually mov- ber of threads will exceed the num- of time. The Standard Performance ing the data. ber of cores, and performance will Evaluation Corporation notes that depend on other factors such as this “metric is used the same for cache utilization, memory and I/O multi-processor systems and for At the highest level, bandwidth, intercore communica- uni-processors” (www.spec.org/ benchmarks for multicore tions, OS scheduling support, and spec/glossary). architectures should be synchronization efficiency. In fact, even in a multicore sys- tem, it isn’t possible to guarantee either computationally or Example that the system is utilizing more memory intensive, or some So does sequential code work for than one core unless it’s employing combination of both. benchmarking multicore proces- some form of processor affinity. In sors? The answer is yes with respect other words, without to the cumulative throughput of intervention, the platform’s sched- each individual core. In this case, uler will assume control over exe- For cores with individual caches, memory bandwidth and computa- cution of the individual benchmark there must be a coherency mechanism tion can be evaluated. instantiations. between the caches. The ease of use In fact, we’re aware of at least The question is whether sequen- of this architecture can lead to perfor- one example of this. A telecom tial code really works for this mance bottlenecks due to competition equipment manufacturer transi- purpose. between multiple cores accessing the tioned its application from a mul- same memory locations. tiprocessor to a multicore system. MULTICORE BENCHMARK In a distributed memory system, Merely switching from a system in CRITERIA which is more common within an which each processor had its own To answer this question, it’s nec- SoC, each processor can access memory subsystem to one in which essary to understand the important its own local memory but doesn’t the two cores shared the memory multicore performance character- have to share it with other cores, subsystem resulted in a critical per- istics. At the highest level, bench- even though there might also be formance reduction, regardless of marks for multicore architectures a global memory address space how much inherent parallelism the should be either computationally across them. code possessed. or memory intensive, or some com- When one core requires that data This is the type of information we bination of both. from another core or cores must syn- must be able to derive using bench- chronize, the system must physically marks before going through the Memory bandwidth move data or the control code must massive porting effort required to Regardless of the type of multi- switch to run on a different core. Even switch to a new platform. core architecture, memory band- though each core has its own local width is a key factor in perfor- memory, there might still be memory SMP-BASED MULTICORE mance. A multicore processor’s bottlenecks depending on how data benchmarks memory bandwidth, as with any moves on or off the chip itself. Executing multiple copies of other processor, depends on the sequential code doesn’t account for memory subsystem’s design. In Scalability one of the most important potential turn, the memory subsystem Another important benchmark benefits of multicore: using paral- depends on the underlying multi- criterion is scalability, in which the lelism to improve the performance core architecture. processor incurs performance pen- of individual tasks rather than

100 Computer

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on September 14, 2009 at 02:47 from IEEE Xplore. Restrictions apply. improving overall throughput. One face. This means that if the sys- code into blocks that use MCAPI example, assuming a dual-core pro- tem already supports Pthreads, no to communicate. cessor, would be using both cores porting is necessary. to load one web page twice as fast. The multicore benchmarks are Application-specific Alternatively, you could parallelize delivered as a set of workloads, standard BENCHMARKS the work by using both cores to load each comprising one or more While the demand for SMP-based two web pages (one per core) in the work items. Although it’s easier benchmarks and for those that sup- same amount of time it would take said than done, users can select port heterogeneous cores is grow- to load one. from this list the workloads that ing almost exponentially, a move to Accomplishing that would require most closely resemble their appli- application-specific standard bench- benchmarks that utilize task decom- cation. marks is afoot. Also known as sce- position, functional decomposition, nario-oriented benchmarks, ASSBs or data decomposition. These meth- BEYOND SMP are designed to take into account ods could more comprehensively It’s important to note that the more system-level features. exercise all of the major multicore MultiBench tests are oriented benchmark criteria. Of course, the toward general-purpose proces- Black-box benchmarks devil’s in the details. sors based on an SMP architec- We view ASSBs as black-box Further, for a benchmark to be ture. Although MultiBench is benchmarks. From the simplest relevant for multiple cores and significantly more parallel than perspective, they specify the input, produce comparable results, it the first- and second-generation expected output, and interface must be able to execute the same benchmarks, it still doesn’t support points. In other words, it doesn’t amount of work regardless of the the heterogeneous cores found in matter what’s inside the system as number of contexts used, and it SoCs. This calls for an entirely dif- long as the benchmark can handle must be able to show the perfor- ferent benchmarking strategy. the input and deliver the expected mance improvement (or degrada- output. tion) that results from the number Unlike complete off-the-shelf sys- of contexts used. The benchmark Heterogeneous cores tems, many embedded systems are must be able to utilize any number require careful analysis and tested using evaluation boards or of computation contexts because similar preproduction platforms. it will be used across many differ- workload partitioning. The problem is that developers can ent platforms. tinker with these platforms in subtle ways to indicate better performance MultiBench Unlike the SMP benchmarks, than a system will achieve as a com- Primarily addressing the embed- which can essentially consist of mercial product. ded market, EEMBC has imple- orthogonal threads, heterogeneous For example, suppose you were mented MultiBench, an extensive cores require careful analysis and evaluating the performance of an suite of multicore benchmarks that workload partitioning. Bench- SoC running the Session Initiation utilizes an API abstraction to more marking these devices in a relevant Protocol. SIP is used to set up and easily support SMP architectures manner with portable code is a tear down multimedia communica- (www..org/press/pressrelease/ huge challenge because each part tion sessions such as voice and video 223001_M30_EEMBC.pdf). of the system might use a different calls over the Internet. Given the variety of architec- tool chain, and communication In normal operation, a com- tures in the embedded industry, between different parts of the sys- mercial product should be able to flexibility is key. Hence, when tem isn’t standardized. receive and handle packet streams porting to a new platform, only There are several options for consisting of mostly valid, but 13 API calls are needed to allow benchmarking such systems. some invalid, packets. However, in the framework and all bench- Extending the existing MultiBench an evaluation platform, a perfor- marks to run and to exploit par- framework to support heteroge- mance advantage could hypotheti- allel execution. neous systems requires some stan- cally be obtained by leaving out the Moreover, many systems sup- dard to allow the benchmarks to code that processes invalid packets. port the Portable Operating Sys- be portable. For example, the Mul- Therefore, the ASSB must also inject tem Interface (Posix) threads ticore Communications API from and test for bad packets. interface, and EEMBC carefully the Multicore Association (www. chose the API abstraction such multicore-association.org/home. Challenges that there is a direct mapping to php) provides a standardized frame- Although ASSBs have been the more complex Pthreads inter- work for partitioning benchmark described as the way forward in

November 2008 101

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on September 14, 2009 at 02:47 from IEEE Xplore. Restrictions apply. E m b e d d e d computing

computer benchmarking (S.M. missing various components of the Shay Gal-On is director of software Pieper et al., “A New Era of Per- software stack. engineering for EEMBC and leader formance Evaluation,” Computer, Another challenge of ASSBs lies of the EEMBC Technology Center. Sept. 2007, pp. 23-30), they don’t in analyzing the results. While an Contact him through www.eembc. come without challenges. embedded platform is the sum of org/contact. First, creating an industry stan- its components, running an ASSB dard, at least within the EEMBC makes it difficult to isolate any par- domain, requires a consensus ticular component. In other words, Markus Levy is president of EEMBC among the members to select the assuming that an ASSB requires the and the Multicore Association, as well right mix of scenarios to test from hardware (SoC, memory system, as chairman of the Multicore Expo. among an almost infinite number of I/O) and the software (OS, applica- Contact him through www.eembc. possibilities. tion, runtime stacks), finding bottle- org/contact. Second, getting an ASSB to run necks can be as difficult as agreeing properly on an evaluation platform on a performance metric. requires the assembly of many system-level components, includ- egardless of the architecture, ing both hardware and software. measuring multicore perfor- In many cases, the company run- mance requires new ways of R Editor: Tom Conte, College of ning the ASSB might not have all benchmarking. Coming up with the Computing, Georgia Institute of the components at its disposal. For benchmarks is only half of the chal- Technology; [email protected]. example, a processor vendor could lenge; the other half is interpreting have all the supporting hardware the results. Of course, that’s where to run a SIP benchmark but be the fun really begins. ■

CCAALLLL FFOORR PPAARRTTICICIPIPAATTIOIONN

CTS 2009

The 2009 International Symposium on Collaboration Technologies and Systems May 18 – 22, 2009 The Westin Baltimore Washington International Airport Hotel Baltimore, Maryland, USA http://cisedu.us/cis/cts/09/main/callForPapers.jsp

In cooperation with the ACM, IEEE, IFIP

102 Computer

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on September 14, 2009 at 02:47 from IEEE Xplore. Restrictions apply.