!" #

      $%$&   ' (%%$&()   *+***,-.$&%/                                !!"  # ! $  % & $    $ '%  %( )%    *      + &%(

  ,   ( !!-(   .  & $  /% 0    (        (                     1( (    ( 2.34 50--10"1 506(

)%          7  %    &   $ *      * % &    %   $ %      8/'9( )%     $ /'   * %  &   0 $$     %  ( 2

   %    *    %    $ %    %    &   %  $    % ( )% %   * %   $ %     % * :  $    %  &  & %          * &    %        $ /'( 2             % ;  * :  %   7   *%%   & % & $ *      ( )%      % : $             ( )% % &        $     $  ( )    * %  % : & &    $ 0&    $ % &   &    %     %     $  %    ( )*  % & 7     # )% <./=  % & 7   *%% &     % &%    % &  % + *  % % &   $      & :*  % * %    &  %(           *% &  &  0%   *  0    %    ( >% &  & *                 &   ( 2  %$       * :  %   $ % $   $ %  ( )% %   % $ %    %   7    $   0  *  *%%  * :  %    &    $  %  &   $     (

              ! " ##$     %&$'()'    *

?    ,  !!-

2..4 "-0" 1 2.34 50--10"1 506  #  ### 0" -! 8% #@@ (:(@ A B #  ### 0" -!9 Till min familj

List of Papers

This thesis is based on the following papers, which are referred to in the text by the capital letters A through G.

A. Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution Martin Karlsson and Erik Hagersten Technical Report 2005-040, Department of Information Technology, Uppsala University, Sweden, December 2005. Submitted for publication

B. Timestamp-Based Selective Cache Allocation Martin Karlsson and Erik Hagersten In High Performance Memory Systems, Chapter 4, Edited by H. Hadimiouglu, D. Kaeli, J. Kuskin, A. Nanda, and J. Torrellas, pages 43-59, Springer-Verlag, 2003. ISBN 038700310X

C. Skewed Caches from a Low-Power Perspective Mathias Spjuth, Martin Karlsson and Erik Hagersten In Proceedings of the ACM International Conference on Computing Frontiers, Ischia, Italy, May 2005.

D. TMA: A Trap-Based Memory Architecture Håkan Zeffer, Zoran Radovic,´ Martin Karlsson and Erik Hagersten Technical Report 2005-041, Department of Information Technology, Uppsala University, Sweden, December 2005. Submitted for publication

E. Memory System Behaviour of Java-Based Middleware Martin Karlsson, Kevin Moore, Erik Hagersten and David Wood In Proceeding of the Ninth International Symposium on High Performance Computer Architecture, Anaheim, California, USA, February 2003.

F. Exploring Processor Design Options for Java-Based Middleware Martin Karlsson, Kevin Moore, Erik Hagersten and David Wood In Proceedings of the 34th International Conference on Parallel Processing, Oslo, Norway, June 2005.

G. VASA: A Simulator Infrastructure with Adjustable Fidelity Dan Wallin, Håkan Zeffer, Martin Karlsson and Erik Hagersten In Proceedings of the International Conference on Parallel and Distributed Com- puting and Systems, Phoenix, Arizona, USA, November 2005.

5

Other Publications

• Timestamp-Based Selective Cache Allocation, Martin Karlsson and Erik Hagersten, In Proceedings of the 1st Annual Workshop on Memory Performance Issues (WMPI 2001), held in conjunction with the 28th International Symposium on Computer Architecture (ISCA28), Göteborg, Sweden, May 2002.

• Memory Characterization of the ECperf Benchmark, Martin Karlsson, Kevin Moore, Erik Hagersten, and David Wood, In Proceedings of the 2nd Annual Workshop on Memory Performance Issues (WMPI 2002), held in conjunction with the 29th International Symposium on Computer Architecture (ISCA29), Anchorage, Alaska, USA, May 2002.

• Skewed Caches from a Low-Power Perspective Mathias Spjuth, Martin Karlsson and Erik Hagersten, In Proceedings of the 4th Annual Workshop on Memory Performance Issues (WMPI 2004), held in conjunction with the 31th International Symposium on Computer Architecture (ISCA31), Munich, Germany, June 2004.

• Cache Memory Design Trade-offs for Current and Emerging Workloads Martin Karlsson, Licentiate Thesis 2003-009, Department of Information Technology, Uppsala Uni- versity, September 2003.

7

Comments on my Participation

A. Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution I am the principal author and have performed all the experiments in this work.

B. Timestamp-Based Selective Cache Allocation I am the principal author and have performed all the experiments in this work.

C. Skewed Caches from a Low-Power Perspective I am the principal author of the paper while Mathias Spjuth is the principal investigator of the project. The energy model was developed jointly, while the performance model was based on Mathias Spjuth’s Masters thesis which I advised.

D. TMA: A Trap-Based Memory Architecture Håkan Zeffer is the principal investigator of the project and has performed all the experiments. I am the principal author of the article together with Håkan Zeffer and we also co-designed the trap mechanism. I have had no part in the protocol implementation, which was carried out by Håkan Zeffer and Zoran Radovic.´

E. Memory System Behaviour of Java-Based Middleware The paper was jointly written with Kevin Moore. From an experimental point of view, Kevin Moore was responsible for setting up and tuning the workload while I designed most of the experiments and provided the initial benchmark setup inside Simics.

F. Exploring Processor Design Options for Java-Based Middleware The paper was written jointly with Kevin Moore. I designed and performed all the experiments in this project

G. VASA: A Simulator Infrastructure with Adjustable Fidelity All authors have contributed to all aspects of this project. I am the principal author of the article together with Dan Wallin who also designed and performed all the experiments. Håkan Zeffer is the main contributor to the Vasa implementation.

9

Acknowledgments

My first attempt at research during an internship at Los Alamos National Lab- oratories was not much of a success, quite the opposite. The result was that I decided that getting a PhD (in databases), which I had been contemplating, was not anything for me. I am therefore very grateful to my advisor Profes- sor Erik Hagersten who persuaded me to get a PhD and accepted me in his research group, which at that time was lacking members. I have since that day never regretted my decision and would do it all over again without hesitation. Erik’s role somewhere along the way blurred between advisor, mentor and friend. I consider myself very fortunate to have had Erik as an advisor dur- ing these years. I am especially grateful to Erik for allowing me to pursue my own ideas to a large extent. I have learned many things from him, a surprising amount in areas unrelated to computer architecture. Thank you! Apart from Erik I have been very fortunate to have had an extraordinary set of teachers and mentors during my education, many of which have indirectly been instru- mental in reaching this goal. While the list is too long to be included here my thanks goes out to all of you. The Uppsala Architecture Team (UART) members have been excellent com- panions along this long winding road. Lars Albertson, Erik Berg, Henrik Löf, Zoran Radovic,´ Dan Wallin and Håkan Zeffer. I will miss our daily interaction a great deal. The IT department has been an excellent workplace and I have enjoyed the friendly atmosphere among faculty and staff very much. I would like to thank the collection of wise men that have joined me at the grad student friday pub. Without the administrative staff, coffee breaks certainly would not have been such a pleasure. The computer administrators helped me out with numerous is- sues. I would like to thank the staff at , particularly Andreas Moestedt, for countless hours of Simics support. Sverker Holmgren provided computa- tional resources and shipped me to Sri Lanka, installing linux boxes, in the middle of the cold dark winter. Thanks! During the course of this work I have had three very rewarding internships at Sun Microsystems. I would like to thank the people at Sun Laboratories and the Processor Performance Group for making my visits there so enjoyable, especially Paul Caprioli and Shailender Chaudhry.

11 I would also like to express my gratitude towards my fellow student co- authors and collaborators Kevin Moore, Mathias Spjuth, Håkan Zeffer, Dan Wallin and Zoran Radovic.´ I think in each project the result was far greater than the sum of our efforts. I would especially like to thank Kevin Moore for proofreading this thesis. Till slut vill jag tacka min familj, vänner och andra närstående för allt stöd och uppmuntran som jag erhållit under de här åren. För det kommer jag för alltid att vara tacksam. Jag vill rikta ett speciellt tack till Josefin som genom sin oändliga klokhet under den senare delen av denna resa hjälpt mig bibehålla perspektiv på saker och ting. Thanks!

This research has been supported by a grant from the Swedish Foundation for Strategic Research (SSF) under the ARTES/PAMP program.

12 Introduction

The services provided by multiprocessor servers are increasingly becoming a part of our every day lives. For example when we make an online reservation or purchase, servers running databases and web applications keep track of prod- ucts in stock or available seats, and are used to deliver the service to us. The cost and performance of these multiprocessor servers are important. Multipro- cessor servers are used in many domains. In the scientific computing arena, faster multiprocessor servers allow for more detailed and accurate modeling of for example weather and pharmaceuticals. The discrepancy between the speed of processing and the time to access memory has made memory system performance one of the most significant bottleneck preventing increased computational performance and capabilities. This trend is sometimes referred to as the memory wall problem [21]and shows no sign of decreasing in the foreseeable future. Chip multiprocessors (CMPs) [12][2][16] introduce new resource trade-offs and further complicate the architectural task of chip design. While CMP tech- nology is still in its infancy it has already proven viable. CMP design is likely to be the prevailing trend in processor design for many years to come. All major processor manufacturers have either released or announced CMP-based designs. Researchers at have for example projected that 75 percent of all their processor products will be CMPs by the end of 2006 [5]. In this thesis we study multiple aspects of memory system design when moving to CMPs. In particular we study cost, performance and power aspects.

Chip Multiprocessors Processor performance improvements have been driven partly by advances in silicon fabrication technology. Reduced feature sizes have allowed more tran- sistors to fit on a die, following the well known Moore’s law [8] stating that the number of devices per chip doubles every 18 months. The increased budgets has enabled architects to design more elaborate microarchitectures. Processor structures, such as register files, issue windows and caches have grown with each generation, improving the performance of a single monolithic processor core. Technology projections show that the electrical characteristics of chips are changing [18]. While the will get smaller the relative speed of wires

13 L20 L21 L22 L23

Intra-chip Switch

iL1 dL1 iL1 dL1 iL1 dL1 iL1 dL1

CPU0 CPU1 CPU2 CPU3

Processor Slice

Figure 1: Overview of a Chip Multiprocessor. will decrease. The fact that transistor speed scales inversely with feature size, leads to an increased cost of communication across the chip. In recent pro- cess generations a signal could be propagated across an entire chip within one clock cycle. Projections on future process generations indicate that less than 10 percent of the chip will be reachable within a cycle and that it will get progres- sively worse with future generations [1]. Therefore, the traditional approach of obtaining performance through larger and larger monolithic processor cores faces serious challenges.2 The introduction of chip multiprocessors, provides an alternative approach. Instead of scaling up a monolithic core, multiple cores may be fitted on the same die. The CMP approach offers performance increases mainly through multi-processing rather than through increased single-thread performance. The design of chip multiprocessors introduces new trade-offs to architects since they now have to strike a balance between allocating resources to each processor core against fitting more cores on a die. For example, designers must decide how much cache should be allocated per core and how many cores should share each cache. An advantage of chip multiprocessors compared to conventional monolithic designs is the ability to scale up a processor design.

2One approach to mitigate the wire delay problem is to employ a clustered microarchitecture where clusters of smaller structures replace larger structures [17].

14 Let us consider a processor core together with the L1 caches and a subset of the L2 banks, a processor slice, as illustrated in Figure 1. Once a process shrink provides more transistors, a CMP design can be scaled up by simply adding more processor slices, incurring a minimal redesign effort. From a software perspective, the paradigm shift in processor design towards providing higher performance, mainly through multiprocessing, forces appli- cation writers to parallelize their applications in order to take advantage of the performance increase. Most commercial applications, such as databases, web or application servers, are throughput based and have already been paral- lelized. For such workloads, the interesting metric is not how fast one request is satisfied. Instead, performance is measured by how many requests that can be satisfied per time unit. Of course, single-thread performance can not be en- tirely ignored since certain quality of service aspects, e.g. responsiveness in a web application, must be observed. Such applications, which often have an abundance of parallelism [20], increases the focus on processing core com- plexity. Processor designers seeking to maximize the amount of throughput per chip may find that a larger number of more simple cores outperform fewer more complex ones [6]. From a multiprocessing perspective, CMPs have a very important advan- tage. Since the processors are fitted on the same die, sharing resources such as an L2 cache can be done with a very limited latency penalty. Shared caches increase utilization and significantly reduce the inter-thread communication cost, as long as the communicating threads share the same cache. Since com- munication overhead often limits the scaling of multithreaded programs, the introduction of CMPs with shared caches is likely to have an important corol- lary for parallelization efforts: while it becomes easier to get an application to scale across cores within a chip, getting an application to scale across chips remains as challenging as it is today. Therefore, there exist a great incentive to build CMPs with as many processing cores as possible.

Processor Design Constraints The basis for the resource constraints in current processor design lies in the economics of CMOS chip fabrication. When designing processors and the sys- tem around them, cost and performance are the driving factors. Since computer evolution has followed a pattern of roughly doubling per- formance every 18 months, the time to market for a new design is critical. Due to this performance trend every extra month in development time leads to an expectation of 6 percent higher performance. The overall development time of a new processor is often in the range of 3 to 7 years. Complex solutions leads to longer design, implementation and foremost verification time. Therefore,

15 complexity within a design can be viewed as a limited resource. Apart from complexity, some of the most important constraints in chip design are area, power and packaging costs.

2 000

500

000 Cost

$500

$- 50100 100150 150200 200250 250300 300350 350400 400 ChipChip Area Size (mm(mm2)2)

Figure 2: Example of die cost as a function of chip area.

Processor chips are manufactured by imprinting a design on a wafer through lithography. A wafer is today typically measured in decimeters in diagonal, while a chip is on the order of a few hundred mm2. Larger chips simply means that fewer will fit on the same wafer. A significant fraction of the chips on a wafer usually have to be discarded due to process variations. These two factors lead to a quadratic area cost in chip design [10], as illustrated in Figure 2.Chip Multiprocessors alleviate this problem somewhat, since a processor where one or even a few of the cores are unusable can be priced and sold as a scaled down version. Due to the high manufacturing costs associated with area it is important to make large processor structures, such as caches, as area efficient as possible. Power consumption, which recently only was considered an issue in portable computing, is currently one of the most pressing processor design constraints also in high-performance computing [14]. Chip power is based on switching, short-circuit and leakage power. Switching power has traditionally made up the vast majority of chip power. As transistor size has kept decreasing, it has been possible to scale down the supply voltage. Since switching power is quadrati- cally dependent on voltage, the voltage scaling approach has enabled more and more transistors per chip, without a corresponding power increase. However, in current and future process generations, leakage power effects will counter

16 100%

80%

60%

40%

20%

0%

Bandwidth per Million transistors (Normalized) 04 06 08 10 13 16 20 20 20 20 20 20 Year

Figure 3: The Semiconductor Industry Association (SIA) projection of bandwidth per million transistors. this trend and is predicted to make up an increasingly large portion of chip power [15]. Power is limited in high-performance computing for multiple reasons. For large server facilities the electricity and room cooling cost are significant parts of the total cost of ownership. Also server density, i.e. how much processing power that fits into a rack, is an important design consideration. From a single processor point of view, power is important because the likelihood that a chip will break increases with temperature. To guarantee that a chip is safely kept cool, processor designers state a power envelope under which the maximum power is kept. System designers must make sure that the system design is able to cool the processor and maintain temperature within the thermal limit. The off-chip bandwidth of a processor is determined by the processor pack- aging. The number of pins, as well as the pin frequency, dictates the bandwidth. Packaging costs grow superlinearly with pin count and often represent a sig- nificant fraction of the overall cost [3]. The 2004 edition of the international technology road map for semiconductors, published by the semiconductor in- dustry association (SIA), predicts that the cost-efficient number of pins per chip will remain constant while the number of transistors will grow by 67 per- cent [19]. Hence, if the increase in transistor count is used to scale up CMP designs by adding more processor cores (or slices), it will lead to a decrease in bandwidth per processor core. Figure A.1 shows the predicted cost-effective

17 amount of bandwidth (pin count times pin frequency) per million transistors over time. As can be seen the bandwidth per transistor is projected to decrease by 90 percent over the next decade.

Memory System Design The growing disparity between processor and memory speed has made the memory system the bottleneck of computer system performance. To overcome the ever increasing latencies, high-speed SRAM caches are used extensively. The principle of locality is the basis for the successful use of caches. Spatial locality (locality in space) and temporal locality (locality in time) refers to a repetitiveness exhibited by many programs. By exploiting locality, caches help to reduce the effects of the long memory latency. Spatial locality implies that after a piece of data has been accessed there is an increased likelihood that nearby data will be used within a short period of time. Temporal locality means that a piece of data that has been accessed is likely to be requested soon again. Temporal locality is exploited by the use of caches holding a subset of the contents of the slow but large memory. Spatial locality is exploited by fetching more than one word from memory at the time. As the memory gap has widened, memory system designers have responded by designing larger and larger caches and fetching larger cache blocks at a time. Larger caches leads to longer access times. Therefore, cache hierarchies are employed with smaller and smaller caches closer to the processing core. In today’s multi-GHz processors, the size of the first level caches are constrained in order to allow access within a few cycles. Therefore it is very important to design as efficient caches as possible from a miss ratio perspective. However, not only cache performance is important. As part of the chip infrastructure, caches are limited by the same resource constraints as other structures, forc- ing architects to consider for example power aspects of different cache design alternatives. It is desirable to share some of the resources, such as the L2 cache, in a CMP. This enables a resource to be allocated according to demand rather than stat- ically partitioned. For example, if only a single application thread is running on a CMP it would be preferable if that thread could make use of the entire L2 cache instead of just a fraction of it. First level caches are rarely shared among different cores for timing reasons.3 However, each core may contain multiple hardware threads [7], which often share an L1 cache. Also, the previ- ously mentioned cost decrease for inter-chip communication among software threads is a strong motivation for shared caches.

3The MAJC-5200 is a notable exception to this [11].

18 While sharing enables more efficient use of caches, it is also prone to con- flicts when multiple processors compete for the same cache space. A common approach to reduce this effect is to increase the cache associativity, i.e. the number of locations in which a piece of data may reside. This is costly since power consumption scales with associativity.

Architectural Design Methodology When designing a new computer system, it is common to use simulation in combination with benchmarks that represent the workload of the system. By comparing the simulation results of an existing design with those of a new de- sign, computer architects are able to approximate the performance associated with various design options.

System Modeling and Simulation Simulation is extensively used in the field of computer architecture to evaluate new ideas. Many independent simulations are often made in parallel, e.g. for different benchmarks. Therefore the task of simulation can be viewed as con- taining plenty of application-level parallelism. Since the simulation results are needed within hours or days in the design process, simulation speed is very important. Care is often taken to make simulators resemble the proposed designs as closely as possible, leading to extremely detailed simulators. Although this provides high accuracy, it leads to very slow simulators and limits the simula- tion coverage. For example, a detailed simulator capable of simulating 100 000 instructions per second on todays computers is considered a relatively fast simulator. Modeling a target architecture at 4Ghz, running an application spending on average two cycles per instruction, would require 6 hours to sim- ulate a single second of the target architecture. Full system simulators, e.g. Virtutech Simics, capture operating system effects which can have a signifi- cant impact when studying many workloads and applications. Simics has been used throughout this thesis.

Benchmark Simulation Since different applications have radically different memory characteristics, it is important to make design trade-offs based on the performance of the type of application that are likely to run on the target system. Server vendors typ- ically emphasize commercial benchmarks, and in particular database work- loads, since a large fraction of computer systems are employed as database servers. As previously mentioned, the design time of a processor or a system

19 Browsers/ Tier 1 Thin Clients

Web Server Middleware Tier 2 Business Logic

Databases Tier 3

LAN/WAN

Figure 4: Overview of a 3-tiered system. is long and counted in years. Therefore, it is important to use modern work- loads while designing a system so that it is optimized for the workloads that will be relevant when the system is deployed. It is also important from a re- search perspective to discover new performance bottlenecks that arise with new workloads. A new workload that is gaining importance due to its rapid and widespread use is middleware. Middleware is an important building block in e-commerce systems, allowing the business logic to be glued together with databases and webservers. An illustration of this is shown in Figure 4,where browsers or thin clients are connected to two backend database servers through middleware. The fact that they are commonly transaction-oriented Java ap- plications make them quite different from conventional number-crunching or database-style workloads. An application server is a piece of software that provides an abstraction layer to N-tier application developers. They typically provide services, such as thread pooling, load balancing and database connection pooling and allow the application writer to focus on the business rule implementation instead of the application performance. Therefore applications servers can be viewed as an underlying software layer on which middleware software is implemented. The rest of this thesis is organized as follows: Chapter 2 contains a summary of each paper included in this thesis. The contribution of this thesis is described in Chapter 3, followed by a summary of the thesis in Swedish.

20 Paper Summaries

The papers included in this thesis are verbatim copies of the previously pub- lished versions, but are reformatted to fit the format of this thesis. Reprints were made with permission from the publishers. This chapter contains sum- maries of the included papers together with a retrospective section per paper with comments and some of the insights gained since the publication of the paper.

Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution The cost of increasing the chip pin-count and thereby bandwidth is high and predicted to increase considerably. As more transistors become available, the number of cores per chip will be dictated either by power or bandwidth. In this paper we target the issue of conserving bandwidth and note a trend towards larger cache blocks in on-chip L2 caches. We find that large parts of the cache blocks are not being used, leading to a waste of the costly bandwidth. Runahead execution is a promising technique for tolerating long memory latencies. In this paper we observe multiple characteristics of runahead execution that can be exploited to remove the implicit assumption of spatial locality. We propose a method of fine-grained fetching that reduces the amount of unnecessary data being fetched and achieves a performance speedup through the reduction in bandwidth.

Retrospective This paper provides a first step towards understanding how characteristics of runahead execution can be exploited in memory system design. While the fine- grained fetch method proposed in Paper A only targets data fetched from memory, a similar technique can be applied between nodes in a cache coherent system.

Timestamp-Based Selective Cache Allocation In order to minimize miss ratio and maximize area and power utilization, it is important to carefully select which piece of data to evict every time a new

21 piece of data is allocated in a cache. Therefore, a large number of replacement algorithms have been devised. In Paper B we take the orthogonal approach of proposing a selective allocation strategy. Instead of only selecting what to replace on a cache miss, we choose whether a piece of data should be allocated or not in the first place. The scheme is based on the observation that a large fraction of data that is accessed does not benefit from cache allocation. By not allocating some data, the data that does get allocated is able to reside in the cache for a longer period of time. Thus, avoiding some of the capacity misses. In Paper B we introduce the concept of Cache Allocation Ticks, CAT. CAT is a time metric for caches. The scheme is based on a global counter that is incremented for each cache allocation, in combination with a timestamp per cache item. Based on these timestamps we propose an algorithm to detect which data to avoid allocating. We also demonstrate how a selective allocation algorithm combined with a small staging cache, L0, drastically can increase the effectiveness of a cache. Allowing a smaller cache to perform similar to a larger one. We also show that CAT provides a fairly accurate and application-independent way of detecting good allocation candidates.

Retrospective If the L0 staging cache used in the RASCAL cache is looked up before ac- cessing the main cache, it can be expected to reduce power consumption. The reason is that many accesses are likely to be satisfied from the small L0 cache, which consumes less power than the main cache. This is however neither dis- cussed nor quantified in Paper B.

Skewed Caches from a Low-Power Perspective When more cache blocks concurrently map to a set than there are ways in a set-associative cache, conflict misses arise. The elbow cache presented in Pa- per C targets designs requiring a high degree of conflict tolerance. The shared caches often employed in chip multithreaded processors is a typical case where conflict tolerance is desirable. Increased associativity is the standard approach to achieve conflict tolerance. However, associativity comes at a high cost in dy- namic power, a scarce resource in processors today. The elbow cache achieves the same conflict tolerance of highly associative caches with a fraction of the power. The elbow cache extends the idea of a skewed cache by adding a relocation strategy yielding an average miss ratio that is comparable to the miss ratio of an 8-way set-associative cache. In this paper we also quantify the power consumption of a skewed cache organization as well as the proposed Elbow cache and compare it to set-associative caches. We find that the Elbow cache

22 consumes up to 48 percent less energy than an 8-way set-associative cache. Since caches often account for a significant fraction of the overall chip power [13][9], using the elbow cache design instead of an 8-way associative cache would lead to considerable overall chip power savings.

Retrospective Similar to the cache organization presented in Paper B, the elbow cache evalu- ation was made from a first-level cache perspective. However nothing prevents either of them from being employed in other types of caches such as TLBs, higher-level caches, trace caches, or directory structures. We used the metric miss ratio to evaluate our proposals in Paper A and Paper B. The goal was to devise more efficient cache organizations. To what extent a miss ratio reduction implies an overall performance improvement is dependent on many other factors such as latency and the ability of the proces- sor to overlap a cache miss with useful execution. Since these factors tend to differ significantly between processor designs and over time we have chosen not to include execution time approximations in this study.

TMA: A Trap-Based Memory Architecture The cost of designing large cache coherent systems from multiple chip-multiprocessors is very high. Since errors in coherence protocols can lead to incorrect computations they are unacceptable, hence a new multiprocessor system design must be extraordinarily well tested and verified. In todays system design, verification time represents a significant part of the overall product development time. Since the task of verification grows quadratically or worse with system complexity, the sophistication level of coherence protocol is restricted by verification. In almost all systems shipped today the cache coherence protocol is imple- mented in hardware. While hardware solutions have a speed advantage, they have the drawback of lacking flexibility. The best performing protocol must be verified and ready the day the machine leaves the assembly plant and the same protocol implementation is used for banking systems and scientific computa- tions. In a software-based solution, such as the one proposed in this paper, a simple protocol can be shipped with the systems and more advanced versions can be made available later on as they become verified and tested. Similarly different protocols can be supplied with systems targeted at different tasks. In this paper we propose to extend the existing trap mechanism in processors with a small set of hardware extensions to support coherence traps. Based on the extensions fine-grained software-based coherence can be implemented in a way that is transparant to software. The described

23 architecture relies on interconnect functionality existing in some commodity interconnects such as Infiniband. We show that our proposed trap-based memory architecture (TMA) on average can perform within 1 percent of a hardware based system.

Retrospective Two of the most revenue-important commercial application for server manufacturers are databases and application servers. Both workloads have successfully been parallelized on the application level, and does not re- quire cache coherent shared memory machines to scale. This trend towards software clustering may significantly decrease the market for large shared memory machines. If that is the case, the huge costs associated with cache co- herent shared memory system design would become harder to amortize. The low costs associated with software based solutions to maintain coherence, may provide an option to build large systems consisting of multiple single- or multi- CMP designs.

Memory System Behavior of Java-Based Middleware Java middleware and application servers in particular is one of the most im- portant workloads for server processors and systems being designed today. Despite this, little is known about this new workload from an architectural point. In Paper E, we present a detailed characterization of the memory system behavior of two Java middleware benchmarks, ECperf and SPECjbb. Our results show that it is important to isolate the behavior of the middle tier. We find that these workloads have small memory footprints and primary working sets compared with other commercial workloads (e.g., on-line transaction processing) and that a large fraction of the working sets are shared between processors. We observed two key differences between ECperf and SPECjbb. First, ECperf has a larger instruction footprint, resulting in much higher miss rates for intermediate-size instruction caches. Second, SPECjbb’s data set size increases linearly as the benchmark scales up, while ECperf’s remains roughly constant. This difference leads to opposite conclusions on the utility of moderately sized shared caches in a chip multiprocessor.

Retrospective This is the first step towards understanding the characteristics of this new kind of workload.

24 As the workload matures and its performance-dependent factors are better understood, its characteristics are likely to change. Further advances in garbage collection and other JVM technology are also likely to change the memory behavior of the workload.

Exploring Processor Design Options for Java-Based Middleware

In Paper F we characterize the execution of Java middleware from a processor design perspective and compare and contrast it to more well-known workloads, such as online-transaction processing (OLTP), static web serving (APACHE) and to several SPEC benchmarks. By idealizing a processor model we are able to get an approximation of the single-thread performance that is obtainable in a conventional out-of-order de- sign. We then take a realistic processor model and idealize various aspects of it, in order to quantify the performance impact of each structure. The perfor- mance effect of idealizing the branch prediction compared to todays predictors are quantified and isolated for directional branch prediction, branch target pre- diction and return address prediction. We also measure the memory level paral- lelism (MLP) of each workload. MLP is an increasingly important metric since it indicates how many of the costly memory accesses that can be overlapped with each other. With the introduction of chip multiprocessors the task of processor architects becomes more complicated. In particular architects must now strike a balance between allocating resources such as die area and power to each processor against fitting more processors on a chip. In Paper F we combine instruction-level parallelism findings with results from Paper E to investigate the optimal trade-off between instruction and thread-level parallelism on a CMP for the Java middleware workload.

Retrospective The workload setup of the ECperf benchmark was based on a Solaris 8 instal- lation, which does not use large TLB pages for code. This is likely to have inflated the number of ITLB misses we observed. Even when ignoring the ITLB misses we observe a substantial trap rate in the commercial applications. This trap rate is likely to have severe performance implications for future pro- cessor designs with hundreds or thousands of instructions in-flight [4], unless new less costly trap mechanisms are devised.

25 VASA: A Simulator Infrastructure with Adjustable Fidelity The arrival of chip-multiprocessor designs has exacerbated the simulation slowdown problem by increasing the need to simulate multiprocessor designs. In Paper G we present the VASA simulator framework, which has multiple fidelity and speed modes. Vasa is an architectural level simulator that is highly integrated with Virtutech Simics full-system simulator. It contains models of caches, storebuffers, interconnects and memory controllers and is capable of modeling full multi-node systems with complex out-of-order SMT/CMP-processors in great detail. However, VASA also contains two faster modes with highly simplified processors modeling. VASA contains three different fidelity levels or modes, where the fastest mode is 260 times faster than the most detailed mode. In this paper we show that, for memory system studies targeting the lower levels of the memory system, the more simplified modes can be used and still yield the same conclusions as the complex mode.

Retrospective The Vasa simulator is the result of an attempt to create a simulator framework that would fulfill the requirements of each authors current, past and future research projects. It was designed using the authors experience from various other simulators and research projects. This paper contains some of the lessons learned and insights gained in the design of the VASA simulator framework. VASA was used in both Paper A and Paper D.

26 Contributions of this Thesis

The main contributions of this thesis are:

• It highlights a trend of decreasing bandwidth per transistor, which is likely to have severe implications towards scaling up CMP designs. Furthermore, it identifies characteristics of runahead execution regarding accesses exhibiting spatial locality and presents a method, fine-grained fetching,to exploit them to save bandwidth and cache space.

• The Cache Allocation Tick (CAT) time metric is introduced and the potential of selective cache allocation is quantified. The RASCAL cache organization, targeted at capacity misses, is proposed.

• It presents the first dynamic power evaluation of skewed cache organiza- tions, together with multiple implementation optimizations for skewed caches. It proposes the Elbow cache organization, targeted at conflict misses

• The Trap-based Memory Architecture is proposed, which provides application-transparent fine-grained coherence through software and performs on par with hardware-based solutions.

• It includes the first two papers analyzing the new Java middleware workload from a memory system and processor perspective. It identifies several performance bottlenecks in current designs

27 28 Summary in Swedish

På grund av landvinningar inom chiptillverkningsindustrin får det numera plats flera processorkärnor på samma kisel chip. Detta tillsammans med nya spel- regler inom kiselteknologin, som gör det allt svårare att öka prestanda för en- skilda processorkärnor, har lett fram till utvecklingen av chip-multiprocessorer (CMPer). Denna avhandling handlar om vilka följder detta får för minnessys- temet samt föreslår ett flertal nya konstruktioner utifrån de resursbegränsningar som gäller för CMPer. Exempel på sådana begränsningar är kiselarea och ef- fekten ett chip utvecklar. Förutom detta innehåller avhandlingen även studier rörande simulering och programkaraktärisering, vilka utgör viktiga verktyg när man designar nya processorer och datorsystem. För att överbrygga hastighetsskillnaden mellan processorer och primärminnen använder man så kallade cache-minnen. Cache-minnen är snabba, men mycket dyra minnen, jämfört med vanliga primärminnen. Eftersom läs och skrivtiden till ett cache-minne ökar med cache-minnets storlek begränsas cache-minnenas kapacitet. Flertalet av dagens datorer innehåller därför flera nivåer av cache-minnen, där det minnet som är närmast processorn är minst och snabbast. En av grundstenarna i konstruktionen av cache-minnen är ett fenomen som uppvisas av det stora flertalet program, nämligen referenslokalitet. Lokalitet kan delas in i temporal och spatial lokalitet. Temporal lokalitet betyder att data som nyligen använts sannolikt kommer att användas snart igen. Spatial lokalitet indikerar att data närliggande till nyligen använt data sannolikt kommer att användas inom en kort tidsperiod. I denna avhandling analyseras och ifrågasätts flera antaganden om dessa välkända begrepp och metoder för att dra nytta av insikterna presenteras. När nya processorer och minnessystem konstrueras använder man sig ofta av simulering. En simuleringsmodell av den nya datorn skapas och modellerar den tilltänkta typen av program för att uppskatta hur den nya datorn kommer att prestera. Eftersom olika program uppvisar olika beteenden är det viktigt att använda moderna och tidsenliga program när man gör dessa simuleringar. Denna avhandling innehåller den första minnessystemsstudien samt den första processorkaraktäriseringsstudien av en ny typ av programtillämpning som blir allt viktigare i vår vardag, nämligen Java-baserade middleware. Dessa utgör grunden i många av de internetbaserade tjänster, som till exempel biljettbokning och skivinköp, som

29 många av oss använder regelbundet. Vi studerar denna nya typ av program speciellt ur ett chip-multiprocessorperspektiv.

Artikelvis sammanfattningar av de minnessystemskonstruktioner som beskrivs i denna avhandling återfinns nedan:

• Kostnaden för att öka chipbandbredden, dvs mängden data som kan överföras till ett processorchip per tidsenhet, är hög. I artikel A visar vi att denna kostnad sannolikt kommer att öka avsevärt framöver. Vi visar även att en stor del av den data som läses in på chipet ofta inte används. Samtidigt har vi identifierat ett beteende bland läsningar och skrivningar för en ny typ av processor. I artikel A visar vi hur vi kan dra nytta av detta beteende för att reducera den mängd av onödig data som läses in på ett chip och därmed minska bandbreddsbehovet.

• I artikel B beskrivs en selektiv cachningsalgoritm som undviker att upplåta cacheutrymme åt viss data. Vilken data som skall undvikas bestäms genom att mäta intervallet mellan användningar av datat. Data som bedöms ha en låg sannolikhet att få ligga kvar i cachen till nästa användning tillåts inte ta upp utrymme i cachen. Därigenom tillåts det data som allokeras att få ligga kvar längre. En ny tidsenhet, Cache Allocation Ticks (CAT), introduceras och används för att mäta avståndet mellan två användningar.

• I artikel C föreslås en utökning av en typ av cachekonstruktion som kallas skewed caches. Utökningen är baserad på CAT-begreppet från papper B och en omflyttningsstrategi för cacheblock som tävlar om samma plats i cachen. En av de resurser som är begränsade i en procesor är effekt. Denna artikel innehåller den första effektutvärderingen av skewed caches, samt ett flertal optimeringar till hur man kan implementera dessa.

• I artikel D presenteras en ny typ av arkitektur för att hålla flera cacheminnen koherenta. I stället för att göra ändringar i minnessystemet för att tillhan- dahålla koherens föreslår vi att göra små ändringar i processorn för att utöka den existerande felhanteringsmekanismen som finns i dagens datorer till att även hantera koherensfel i mjukvara. Vi visar att ett system baserat på dessa utökningar kan prestera likvärdigt med ett hårdvarubaserat system. plain

30 References

[1] Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, and Doug Burger. Clock rate versus IPC: The End of the Road for Conventional Microarchitectures. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00), pages 248–259, 2000.

[2] L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In Proceedings of the 27th Annual Interna- tional Symposium on Computer Architecture (ISCA’00), 2000.

[3] Doug Burger, James R. Goodman, and Alain Kägi. Limited Bandwidth to Affect Processor Design. IEEE Micro, 17(6), 1997.

[4] Adrian Cristal, Daniel Ortega, Josep Llosa, and Mateo Valero. Out-of-Order Commit Processors. In Proceedings of the 10th International Symposium on High Performance Computer Architecture (HPCA ’04), page 48, 2004.

[5] Darrell Dunn. Intel Expects 75% Of Its Processors To Be Dual-Core Next Year. InformationWeek, March 2005.

[6] John D. Davis, James Laudon, and Kunle Olukotun. Maximizing cmp through- put with mediocre cores. In PACT ’05: Proceedings of the 14th Interna- tional Conference on Parallel Architectures and Compilation Techniques (PACT’05), pages 51–62, 2005.

[7] Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. Stamm, and Dean M. Tullsen. Simultaneous Multithreading: A Platform for Next- Generation Processors. IEEE Micro, 17(5), 1997.

[8] Gordon E. Moore. Cramming more components onto integrated circuits. Elec- tronics Magazine, April 1965.

[9] M. Gowan, L. Biro, and D. Jackson. Power considerations in the design of the alpha 21264 microprocessor. In 35th Design Automation Conference, 1998.

[10] J. Hirase. Economical importance of the maximum chip area. In Proceedings of the Seventh Asian Test Symposium, 1998.

31 [11] A. Kowalczyk, V. Adler, C. Amir, F. Chiu, C. P. Chng, W.J. De Lange, Y. Ge, S. Ghosh, T. C. Hoang, B. Huang, S. Kant, Y.S. Kao, C. Khieu, S. Kumar, L. Lee, A. Liebermensch, X. Liu, N. G. Malur, A. A. Martin, H. Ngo, S. H. Oh, I. Orginos, L. Shih, B. Sur, M. Tremblay, A. Tzeng, D. Vo, S. Zambere, and J. Zong. The First MAJC Microprocessor: A Dual CPU System-on-a-Chip. Solid-State Circuits, IEEE Journal of, 36(11):1609–1616, 2001.

[12] M. Tremblay and J. Chan and S. Chaudhry and A. W. Conigliam and S. S. Tse. The MAJC architecture: a synthesis of parallelism and scalability. IEEE Micro, 20(6):12–25, 2000.

[13] S. Manne, A. Klauser, and D. Grunwald. Pipeline gating: Speculation control for energy reduction. In Proceedings of the 25th Annual International Sym- posium on Computer Architecture (ISCA’98), 1998.

[14] Trevor Mudge. Power: A First-Class Architectural Design Constraint. IEEE Computer, 34(4), April 2001.

[15] Siva Narendra, Vivek De, Shekhar Borkar, Dimitri Antoniadis, and Anantha Chandrakasan. Full-chip sub-threshold leakage power prediction model for sub- 0.18 micron cmos. In ISLPED ’02: Proceedings of the 2002 international symposium on Low power electronics and design, pages 19–23, 2002.

[16] Kunle Olukotun, Basem A. Nayfeh, L. Hammond, K. Wilson, and K.-Y. Chang. The Case for a Single-Chip Multiprocessor. In Proceedings of the 7th Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), 1996.

[17] Ramon Canal and Joan-Manuel Parcerisa and Antonio González. Dynamic Cluster Assigment Mechanisms. In Proceedings of the sixth International Symposium on High Performance Computer Architecture (HPCA-6), Toulouse, France, January 2000.

[18] Semiconductor Industry Association (SIA). International Technology Roadmap for Semiconductors 1999.

[19] Semiconductor Industry Association (SIA). International Technology Roadmap for Semiconductors 2004 Update.

[20] Lawrence Spracklen and Santosh G. Abraham. Chip Multithreading: Opportu- nities and Challenges. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA ’05, 2005.

[21] Wm.A. Wulf and S.A. McKee. Hitting the Memory Wall: Implications of the Obvious. Computer Architecture News, 23(1):20–24, March 1995.

32

. 0- . 0.                    

1, *  .   2.   .,   

 ,  . , .    2.   .,    3 0.. 0-  3  . .  .  .  +  . 4  5       , .  .  6 . .7 5, .  + . 3 5   . .   , +,  . .      .   -  .   0..  .    2.   .,    4 89   :.. 3 &//%3    5. +, ,   ;  -  .   0..  .    2.   .,    <4=

       + * +. 44    *+***,-.$&%/