A Survey of Multicore Processors
Total Page:16
File Type:pdf, Size:1020Kb
[ Geoffrey Blake, Ronald G. Dreslinski, and Trevor Mudge] A Survey of Multicore Processors [A review of their common attributes] eneral-purpose multicore processors are being accepted in all segments of the industry, including signal processing and embedded space, as the need for more performance and general-pur- Gpose programmability has grown. Parallel process- ing increases performance by adding more parallel resources while maintaining manageable power characteristics. The implementations of multicore processors are numerous and diverse. Designs range from conventional multiprocessor machines to designs that consist of a “sea” of programmable arithmetic logic units (ALUs). In this article, we cover some of the attributes common to all multi- core processor implementations and illustrate these attributes with current and future commercial multi- © PHOTO F/X2 core designs. The characteristics we focus on are applica- tion domain, power/performance, processing elements, memory system, and accelerators/integrated peripherals. INTRODUCTION Parallel processors have had a long history going back at least scalar processor that could be easily programmed in a conven- to the Solomon computer of the mid-1960s. The difficulty of tional manner, and the vector programming paradigm was not programming them meant they have been primarily employed as daunting as creating general parallel programs. Recently by scientists and engineers who understood the application the evolution of parallel machines has changed dramatically. domain and had the resources and skill to program them. For the first time, major chip manufacturers—companies Along the way, a surprising number of companies created par- whose primary business is fabricating and selling microproces- allel machines. They were largely unsuccessful since their dif- sors—have turned to offering parallel machines, or single chip ficulty of use limited their customer base, although, there were multicore microprocessors as they have been styled. exceptions: the Cray vector machines are perhaps the best There are a number of reasons behind this, but the leading one example. However, these Cray machines also had a very fast is to continue the raw performance growth that customers have come to expect from Moore’s law scaling without being over- Digital Object Identifier 10.1109/MSP.2009.934110 whelmed by the growth in power consumption. As single core IEEE SIGNAL PROCESSING MAGAZINE [26] NOVEMBER 2009 1053-5888/09/$26.00©2009IEEE designs were pushed to ever processing, audio processing, higher clock speeds, the power IN THE PAST DECADE, POWER HAS and wireless baseband process- required grew at a faster rate JOINED PERFORMANCE AS A FIRST ing. Many of the classic signal than the frequency. This power CLASS DESIGN CONSTRAINT. processing algorithms are part problem was exacerbated by of this group. The computation designs that attempted to of these types of applications is dynamically extract extra performance from the instruction typically a sequence of operations on a stream of data with little stream, as we will note later. This led to designs that were complex, or no data reuse. The operations can frequently be performed unmanageable, and power hungry. The trend was unsustainable. in parallel and often require high throughput and performance But ever higher performance is still desired as evident by predic- to handle the large amounts of data. These kind of applications tions from the ITRS Roadmap [1] predicting a need for 300x more favor designs that have as many processing elements as practi- performance by 2022 as shown in Figure 1. To meet these cal in regards to desired power/performance ratio. demands, chip designers have turned to multicore processors and parallel programming to continue the push for more performance, CONTROL PROCESSING DOMINATED and in turn the ITRS Roadmap has projected that by 2022, there Control-dominated applications include file compression/ will be chips with upwards of 100x more cores than on current decompression, network processing, and transactional query multicore processors. The main advantage to multicore systems is processing. The code for these types of applications tend to be that raw performance increase can come from increasing the num- dominated by conditional branches, complicating parallelism. ber of cores rather than frequency, which translates into a slower The programs themselves often need to keep track of large growth in power consumption. However, this approach represents amounts of state and often have a high amount of data reuse. a significant gamble because parallel programming science has not These types of applications favor a more modest number of advanced nearly as fast as our ability to build parallel hardware. general-purpose processing elements to handle the unstruc- General-purpose multicores are becoming necessary even tured nature of control dominated code. in the realm of digital signal processing (DSP) where, in the In almost all cases, no application can fit into these neat past, one general-purpose control core marshaled many spe- divisions, but execution phases of an application may. For cial purpose application-specific integrated circuits (ASICs) as instance the H.264/AVC [4] video codec is data dominant when part of a “system on chip.” This is primarily due to the variety performing the block filter, but control dominated when com- of applications and performance required from these chips. pressing or decompressing video using context-adaptive binary This has driven the need for more general-purpose processors. arithmetic coding (CABAC) compression. It is valuable to Recent examples would include software-defined radio (SDR) think of applications as falling into these divisions to under- base stations, or cell phone processors that are required to stand how different multicores design aspects can affect per- support numerous codecs and applications all with different formance. An unbalanced architecture may do very well on the characteristics, requiring a general programmable multicore. data dominated portion of the H.264/AVC, but be very ineffi- cient for CABAC encoding/decoding, leading to less than ARCHITECTURE CLASSIFICATIONS desired performance. Multicore architectures can be classified in a number of ways. In this section we discuss five of the most distinguishing POWER/PERFORMANCE attributes: the application class, power/ performance, pro- Many applications and devices have strict performance and cessing elements, memory system, and accelerators/ integrat- power requirements. For instance, a mobile phone that wants to ed peripherals. APPLICATION CLASS 1,000 1/T : Intrinsic If a machine is targeted to a specific application domain, the Switching Speed architecture can be made to reflect this. The result is a design Number of DPE 100 Processing that is efficient for the domain in question but often ill-suited to Performance other areas. The extreme example is an ASIC. Tuning to an application domain can have several positive consequences. 10 Perhaps the most valuable is the potential for significant power savings. Conventional DSPs are a good example. There are two broad classes of processing into which an appli- Normalized to 2007 Trend, 1 cation can fall: data processing dominated and control dominated. 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Year of Production DATA PROCESSING DOMINATED Data processing-dominated applications contain many familiar [FIG1] ITRS Roadmap [1] for frequency, number of data types of applications including graphics rasterization, image processing elements (DPE) and overall performance. IEEE SIGNAL PROCESSING MAGAZINE [27] NOVEMBER 2009 support video playback has a vector transpose in one instruc- THE MAIN ADVANTAGE TO MULTICORE strict power budget, but it also tion. The soft-core provider has to meet certain perfor- SYSTEMS IS THAT RAW PERFORMANCE Tensilica has made this the mance characteristics. INCREASE CAN COME FROM INCREASING main selling point for their Performance has been the THE NUMBER OF CORES RATHER Xtensa CPUs [7], offering cus- traditional goal. In the past THAN FREQUENCY. tomizable special purpose decade, power has joined per- instructions for specific designs. formance as a first-class design constraint. This is, in large part, due to the rise of mobile phones and other forms of MICROARCHITECTURE mobile computing where battery life and size are critical. More The processing element microarchitecture governs, in many recently, power consumption has also become a concern for respects, the performance and power consumption that can computers that are not mobile. The driving force behind this be expected from the multicore. The microarchitecture of is the growth in data centers to support “cloud” computing. each processing element is often tailored to the application The typical general-purpose multicore processor is ideally domain that is targeted by the multicore machine. Although suited to these centers, but these centers are now consuming the commercial offerings of the major chip manufacturers more energy than heavy manufacturing in the United States. like Intel employ numbers of identical cores into a homoge- They are so large—Google now refers to them as warehouse- neous architecture, it is often advantageous to combine dif- scale computers—that the power consumption of the basic ferent types of processing elements into a heterogeneous multicore component is critical to the cost and operation. architecture. The idea is again