Research Collection

Doctoral Thesis

Adaptive main memory compression

Author(s): Tuduce, Irina

Publication Date: 2006

Permanent Link: https://doi.org/10.3929/ethz-a-005180607

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library Doctoral Thesis ETH No. 16327

Adaptive Main Memory Compression

A dissertation submitted to the

Swiss Federal Institute of Technology Zurich (ETH ZÜRICH)

for the degree of Doctor of Technical Sciences

presented by Irina Tuduce Engineer, TU Cluj-Napoca born April 9, 1976 citizen of Romania

accepted on the recommendation of Prof. Dr. Thomas Gross, examiner Prof. Dr. Lothar Thiele, co-examiner

2005 Seite Leer / Blank leaf Seite Leer / Blank leaf Seite Leer lank Abstract

Computer pioneers correctly predicted that programmers would want unlimited amounts of fast memory. Since fast memory is expensive, an economical solution to that desire is a organized into several levels. Each level has greater capacity than the preceding but it is less quickly accessible. The goal of the memory hierarchy is to provide a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level. In the last decades, the processor performance improved much faster than the performance of the memory levels. The memory hierarchy proved to be a scalable solution, i.e., the bigger the perfomance gap between processor and memory, the more levels are used in the memory hierarchy. For instance, in 1980 microprocessors were often designed without caches, while in 2005 most of them come with two levels of caches on the chip.

Since the fast upper memory levels are small, programs with poor locality tend to access data from the lower levels of the memory hierarchy. Therefore, these programs run slower than programs with good locality. The sizes of all memory levels have increased continuously. Fol¬ lowing this trend, the applications would fit in higher (and faster) memory levels. However, the application developers have even more agresively increased their memory demands. Appli¬ cations with large memory requirements and poor locality are becoming increasingly popular, as people attend to solve large problems (e.g., network simulators, traffic simulators, model checking, databases). Given the technology and application trends, efficiently executing large applications on a hierarchy of memories remains a challenge.

Given that the fast memory levels are close to the processor, bringing the working set of an application closer to the processor tends to improve the performance of the application. An approach to bringing the application's data closer to the processor is compressing one of the existing memory levels. This approach is becoming increasigly attractive as processors become faster and more cycles can be dedicated to (de)compressing data.

This thesis provides an example of efficiently designing and implementing a compressed- memory system. We choose to investigate compression at the main memory level (RAM) be¬ cause the management of this level is done in software (thus allowing for rapid prototyping and validation). Although our design and implementation are specific to main memory compression, the concepts described are general and can be applied to any level in the memory hierarchy.

The key idea of main memory compression is to set aside part of main memory to hold compressed data. By compressing some of the data space, the effective memory size available to the applications is made larger and disk accesses are avoided. One of the thomy issues is that sizing the region that holds compressed data is difficult and if not done right (i.e., the region is too large or too small) memory compression slows down the application.

There are two claims that make up the core of the thesis. First, the thesis shows that it is possible to implement main memory compression in an efficient way, meaning that while

v vi

applications execute the size of memory that stores compressed data can be changed easily if it advisable to do so. We describe a practical design for an adaptive compressed-memory system and demonstrate that it can be integrated into an existing general-purpose operating system. The key idea of our design is to keep compressed pages in self-contained zones, and to grow and shrink the compressed region by adding and removing zones.

Second, the thesis shows that it is possible to estimate - during an application's execution - how much data should be compressed such that compression improves the application's perfor¬ mance. The technique we proposed is based on an application's execution profile and finds the compressed region size such that the application's working set fits into uncompressed and com¬ pressed regions. If such a size does not exist or compression hurts performance, our technique turns off compression. This way, if compression cannot improve an application's performance, it will not hurt it either. To determine whether compression hurts performance, we compare - at runtime - an application's performance on the compressed-memory system with an estimation of its performance without compression. The performance estimation is based on the memory access pattern of the application, efficiency of the compression algorithm and amount of data being compressed.

The design proposed in this thesis is implemented in Linux OS, runs on both 32-bit and 64-bit architectures, and has been demonstrated to work in practice under complex workload conditions and memory pressure. To assess the benefits of main memory compression, we use benchmarks and real applications that have large memory requirements. The applications used are: Symbolic Model Verifier (SMV), NS2 network simulator, and qsim car traffic simulator. For the selected benchmarks and applications, the system shows an increase in performance by a factor of 1.3 to 55.

To sum up, this thesis shows that a compressed-memory level is a beneficial addition to the classical memory hierarchy, and this addition can be provided without significant effort. The compressed-memory level exploits the tremendous advances in processor speed that have not been matched by corresponding increases in memory performance. Therefore, if access times to memory and disk continue to improve over the next decade at the same rate as they did during the last decade (the likely scenario), compressed-memory systems are an attractive approach to improve total system performance. Zusammenfassung

Pioniere der Informatik haben korrekt vorhergesehen, dass Programmierer unbegrenzten Bedarf nach schnellem Speicher haben würden. Schneller Speicher ist jedoch teuer. Eine ökonomische Lösung zur Zufriedenstellung des Bedarfs ist eine in mehrere Ebenen aufgeteilte Speicherhier¬ archie. Jede Ebene bietet eine grössere Kapazität als die vorgängige, jedoch eine geringere Zugriffsgeschwindigkeit. Das Ziel der Speicherhierarchie ist die Bereitstellung eines Speicher¬ systems zu Kosten, die fast der günstigsten Ebene entsprechen, und mit einer Geschwindigkeit, die fast der schnellsten Ebene entspricht. Die Leistungsfähigkeit von Prozessoren hat in den letzten Jahrzehnten markant schneller zugenommen als diejenige der Speicherebenen. Die Idee des hierarchischen Speichersystems hat sich dabei als skalierbare Lösung erwiesen; die Anzahl der verwendeten Ebenen wird vergrössert, umso grösser die Differenz zwischen der Leistungsfähigkeit von Prozessoren und Speicher wird. 1980 wurden Mikroprozessoren zum Beispiel oft ohne Caches entworfen; 2005 besitzen jedoch die meisten Prozessoren zwei Cacheebenen direkt auf dem Prozessorchip. Da die schnellen höheren Speicherebenen klein sind, tendieren Programme mit schlechter Lokalität dazu, auf Daten in den tieferen Speicherebenen zuzugreifen. Das führt zu Geschwindigkeitseinbussen im Vergleich zu Programmen mit guter Lokalität. Trotz des kon¬ tinuierlichen Wachstums aller Speicherebenen hat sich diese Situation nicht verbessert, da die

Nachfrage nach Speicher durch Anwendungsentwickler in noch stärkerem Masse zugenommen hat. Das Lösen immer grösserer Problemstellungen (z.B. Netzwerksimulation, Verkehrssimu¬ lation, Model-Checking, Datenbanken) führt vermehrt zu Applikationen mit hohen Speicheran¬ forderungen und schlechter Lokalität. Unter diesen Technologie- und Anwendungstrends bleibt die effiziente Ausführung von grossen Applikationen eine Herausforderung.

Wenn das Working Set einer Applikation näher zum Prozessor gebracht wird, kann damit tendenziell die Ausführungsgeschwindigkeit erhöht werden, da die schnelleren Speicherebenen nahe beim Prozessor liegen. Ein Ansatz, um die Daten einer Applikation näher zum Prozessor zu bringen, besteht in der Komprimierung einer bestehenden Speicherebene. Dieser Ansatz wird zunehmend attraktiver, bedingt durch die Zunahme der Prozessorgeschwindigkeit, die es erlaubt einen Teil der Prozessorzeit zur Komprimierung und Entkomprimierung von Daten aufzuwenden.

Diese Doktorarbeit beschreibt den Entwurf und die Implementation eines effizienten Spe¬ ichersystems mit Speicherkomprimierung. Wir fokussieren uns dabei auf die Komprimierung auf Hauptspeicherebene (RAM). Diese Ebene wird softwareseitig verwaltet, was eine schnelle Prototypisierung und Validation ermöglicht. Obwohl unser Entwurf und die Implementation spezifisch auf die Hauptspeicherkomprimierung ausgerichtet sind, können die beschriebenen Konzepte auf alle Ebenen der Speicherhierarchie angewandt werden.

Die Kemidee der Hauptspeicherkomprimierung ist es, einen Teil des Hauptspeichers für

vii vin

komprimierte Daten zu reservieren. Durch die Komprimierung wird der effektiv den Applika¬ tionen zur Verfügung stehende Speicher vergrössert und die Anzahl Festplattenzugriffe wird reduziert. Ein wichtiger Aspekt ist dabei die Wahl der Grösse der komprimierten Speicherre¬ gion. Wird diese Region zu gross oder zu klein gewählt, kann die Speicherkomprimierung zu einer Verlangsamung der Applikation führen.

Diese Doktorarbeit besteht aus zwei Grundthesen. Erstens zeigt die Arbeit, dass es möglich ist, Hauptspeicherkomprimierung effizient zu implementieren. Die Grösse des komprimierten Speichers kann dabei während der Laufzeit einfach angepasst werden, wenn dies zweckmässig ist. Wir beschreiben ein praktisches Design eines adaptives Systems zur Speicherkomprim¬ ierung und zeigen, dass dieses in ein existierendes Betriebssystem integriert werden kann. Die Hauptidee unseres Designs besteht darin, komprimierte Speicherseiten in abgeschlossenen Zo¬ nen abzulegen und die Grösse des komprimierten Bereichs durch Hinzufügen oder Entfernen von solchen Zonen anzupassen.

Zweitens zeigt die Arbeit, dass während der Laufzeit von Programmen eine Abschätzung gemacht werden kann, welche Datenmenge komprimiert werden soll, damit die Komprimierung zu einer Leistungssteigerung der Applikation führt. Die dazu vorgeschlagene Technik basiert auf dem Ausführungsprofil der Applikation und bestimmt die Grösse des komprimierten Bere¬ ichs, so dass das Working Set der Applikation in den verfügbaren komprimierten und nicht komprimierten Speicher passt.

Die Kompression wird deaktiviert, wenn eine solche optimale Grösse nicht existiert oder wenn die Komprimierung zu einer Leistungseinbusse führt. Um zu bestimmen, ob die Kom¬ primierung die Ausführungsleistung verschlechtert, vergleichen wir zur Laufzeit die effek¬ tive Leistung mit der geschützten Leistung für eine Ausführung ohne Komprimierung. Diese Schätzung basiert auf dem Speicherzugriffsmuster der Applikation, der Effizienz des Komprim¬ ierungsalgorithmus und der komprimierten Datenmenge. Das vorgeschlagene Design wurde im Betriebssystem Linux implementiert, und läuft sowohl auf 32- als auch auf 64-bit Architekturen, Die Funktionalität wurde unter Verwendung von komplexen Arbeits- und Speicherlasten unter Beweis gestellt. Wir verwenden Benchmark- Programme und reale Applikationen, um die Vorteile der Hauptspeicherkomprimierung zu beurteilen. Folgende Applikationen wurden dabei verwendet: der Symbolic Model Verifier (SMV), der Netzwerk-Simulator NS2 und der Verkehrs-Simulator qsim. Für die ausgewählten Benchmarks und Applikationen führt das System zu einer Leistungssteigerung zwischen Faktor 1.3 und 55.

Diese Doktorarbeit zeigt, dass eine komprimierte Speicherebene eine nützliche und prak¬ tikable Erweiterung der klassischen Speicherhierarchie von modernen Betriebssystemen ist, und dass diese zusätzliche Ebene ohne erheblichen Mehraufwand zur Verfügung gestellt wer¬ den kann. Mit diesem Ansatz wird die enorme Zunahme der Prozessorgeschwindigkeit aus¬ genutzt, die einer erheblich kleineren Zunahme der Festplattengeschwindigkeit gegenübersteht. Wenn sich die Zugriffszeiten auf Hauptspeicher und Festplatte weiterhin in gleichem Masse verbessern, was sehr wahrscheinlich ist, dann sind Speichersysteme mit Komprimierung ein attraktiver Ansatz, um die Gesamtleistung von Computersystemen zu erhöhen. Acknowledgments

I thank Professor Thomas Gross for enabling and supporting the research presented in this thesis and for offering an enjoyable work enviroment for his PhD students. I am also very grateful to Professor Lothar Thiele for accepting to be my co-examiner and providing valuable feedback.

Thanks are owed to Oliver Trachsel who translated the abstract of this thesis to German. I

thank Eva Ruiz for organizing exciting social events, as well as Patrik Reali and Viktor Schup- pan for organizing great Assistentenabends. I thank members of the Computer Systems Instutute at ETH for feedback and interesting dis¬ cussions, especially (in alphabetical order) Susanne Cech, Matteo Corti, Hans Domjan, Roger Karrer, Christian Kurmann, Nico Matsakis, Pieter Müller, Val Naumov, Luca Previtali, Felix Rauch, Patrik Reali, Florian Schneider, Viktor Schuppan, Yang Su, Oliver Trachsel, Cristian Tuduce, and Christoph von Praun. I thank the students I supervised during their master thesis for their contributions to the implementation and evaluation of the compressed-memory system, namely Raul Silaghi (per¬ formance modeling), Philip Oswald (first prototype), Nicolas Wettstein (memory traces), Daniel Steiner (instrumentation), and Cristian Morariu (adaptivity implementation).

Last but not least, I thank my parents, brother and husband for their unconditional support and all others that influenced the research of this PhD thesis and were not explicitly mentioned here.

ix Seite Leer / Blank leaf Contents

Abstract v

Zusammenfassung vii

Acknowledgments ix

Contents xi

1 Introduction 1

1.1 Motivation 1 1.1.1 Technology Trends 1 1.1.2 Applications 3

1.1.3 Bridging the Main Memory - Disk Gap 3 1.1.4 Scope 4 1.2 Thesis Statement 4 1.3 Roadmap 5

2 Background 7 2.1 L1/L2 Compression 8 2.2 L2/DRAM Memory Compression 8 2.3 DRAM/Disk Compression 9 2.3.1 Hardware-based Main Memory Compression 9 2.3.2 Software-based Main Memory Compression 12 2.4 Summary 16

3 Design and Implementation 19

3.1 Design . 19 3.1.1 Global Metadata 20

3.1.2 Local Metadata 21 3.1.3 Page Insert 23 3.1.4 Page Delete 25 3.1.5 Zone Add 26

3.1.6 Zone Delete 26 3.2 Implementation and Integration in Linux 29 3.2.1 Linux Internals 29

xi Xll

3.2.2 Implementation and Integration Details 37 3.3 Discussion 44 3.4 Summary 44

4 Performance Modeling 47 4.1 Introduction 47 4.2 Background 48 4.2.1 Performance Prediction 48

4.2.2 Machine Characterization 49

4.2.3 Instrumentation and Simulation Tools 49 4.3 Two Simple Models for Execution Prediction 58 4.4 Target System 59 4.5 Sample Applications 61 4.5.1 SMV (Symbolic Model Verifier) 61 4.5.2 CHARMM (Chemistry at HARvard Molecular Mechanics) 61 4.5.3 NS2 (Network Simulator) 62 4.6 Experimental Results 63 4.6.1 MSP-RA Prediction Model 63 4.6.2 MSP-IA Prediction Model 65 4.7 Summary 67

5 Adaptation 71 5.1 Introduction 71 5.1.1 Performance Potential of Main Memory Compression 72 5.2 Cost/Benefit Analysis of Main Memory Compression 73 5.2.1 Performance Model for a Compressed-Memory System 73 5.2.2 Compressed Region Size 75 5.2.3 Influence of Application Characteristics 75 5.3 Validation 76 5.3.1 Experimental Setup 76 5.3.2 Experimental Results 78 5.4 Our Approach to Addressing Adaptivity 80 5.4.1 Resizing Scheme 80 5.4.2 Implementation Details 82 5.4.3 Efficiency Considerations 84 5.5 Related Work 85 5.6 Summary 86

6 Evaluation 87 6.1 Experimental Setup 87

6.2 Is main memory compression beneficial? 88 6.2.1 Symbolic Model Verifier (SMV) 88 6.2.2 NS2 Network Simulator 89 6.2.3 qsim Traffic Simulator 90 CONTENTS xiii

6.2.4 Discussion 91 6.3 Does adaptation work? 92 6.3.1 Compression Improves Performance 92 6.3.2 Compression Degrades Performance 93 6.4 When does adaptation fail? 98 6.5 Efficiency 98 6.5.1 Time Complexity 99 6.5.2 Space Efficiency 100 6.6 Summary 109

7 Conclusions 111 7.1 Summary and Contributions 112 7.2 Future Work 113

Bibliography 115

List of Figures 123 1

Introduction

1.1 Motivation

1.1.1 Technology Trends

Over the last four decades, computer technology has improved each year. Occasionally techno¬ logical breakthroughs have occurred and sometimes development has stalled, but both processor and memory technology have improved performance at constant rate. The typical trend line is canonized by Moore's law that circuits-per-chip increases by a factor of four every three years. In other words, memories get four times larger each three years. This observation has been approximately true since the early RAM (random-access memory) chips of 1970. Moore's law, which originally applied only to RAM, has been generalized to apply to microprocessors and to disk storage capacity. Indeed, disk capacity has been improving by leaps and bounds; it has improved 100 fold over the last decade. Processors. Microprocessors have changed the economics of completely. Ini¬ tially, some though that RISC (Reduced Instruction Set Computers) was the key to the increas¬ ing speed of microprocessors. However, Intel's IA-32 and IA-64 CISC (Complex Instruction Set Computers) continue to be competitive. Moreover, it appeared that the next step was for the RISC and CISC lines to merge as super-pipelined VLIW (Very-Long Instruction Word) computers. These technology trends indicate that faster, inexpensive microprocessors are com¬ ing. Related to Moore's law is Bill Joy's law that Sun Microsystems' processor MIPS (millions of instructions per second) double every year. Though Sun's own technology has not strictly obeyed this law, the industry as a whole has (the actual rate for MIPS to double seems to be 1.6-1.7 years [87]). Performance, as the primary differentiator for microprocessors, has seen the greatest gains: 1000-2000x in bandwidth (MIPS) and 20-40x in latency (nsec) over the last 20 to 25 years [87]. The current trend is 50% annual increase in clock speed and 40% annual decline in cost [53].

RAM. DRAM is almost on the historic trend line of 50% price decline per year. A new gen¬ eration appears about every 3 years (the actual rate seems to be 3.3 years) with each successive generation being 4 times larger than the previous (as predicted by Moore's law). However, ca¬ pacity is not the only memory characteristic that must grow rapidly to maintain system balance, since the speed with which data and instructions are delivered to the CPU also determines its ultimate performance. Although processors are getting much faster, DRAMs are not keeping pace: DRAM speeds are increasing slowly, about 10%/year. As a consequence, high-speed static RAM chips are used to build multi-level caches to overcome the long latency to memory. Since SRAM is mainly built out of transistors (using essentially the same technology as the pro-

1 2 Chapter 1. Introduction

cessor chip), SRAM density improves at the same rate as component density on processors, i.e., by about 50% per year, a slightly slower rate than DRAM density. SRAM speed improves by

about 40% per year. Although SRAM speed increases faster than DRAM speed (40% SRAM - 10% DRAM), the gap between processor speed and CPU speed (50%/y) is increasing.

Emerging Memory Technologies. Driven by changing market needs, a recent flurry of activity in the development of new memory technologies has emerged. Products based upon Magnetic Random Access Memory (MRAM), Ferro RAM (FeRAM), chalcogenide, polymer, MEMS (MicroElectroMechanical Systems), ovonic memory, nanotube memory, holographic memory, molecular memory, scanning probe-based memories, write-once 3D memory tech¬ nologies, molectronics and single-electron memories are in various stages of development by many firms, and commercial release of some of these products is planned for the near future. Although it is not easy for a new technology to become commercially successful in memory markets, changing market needs may facilitate the market penetration of one or more new tech¬ nologies. However, it is not expected that we will see the rapid transition that marked the beginning of the Dynamic Random Access Memory (DRAM) era. Of the most recent emerging memory technologies, only Flash has succeeded in the market. Despite being fundamentally a variant of the existing electrically erasable programmable read-only memory (EEPROM), Flash required nearly a decade from introduction to major success. Technologies that succeed will of¬ fer either the ability to replace two different chips with a single chip, significantly more storage capacity at an acceptable cost or a major user benefit at virtually no price penalty.

Magnetic Disks. In contrast to primary memory technologies, the performance of con¬ ventional magnetic disks has improved at modest rate. The performance of these mechanical devices is dominated by the seek time, rotational latency, and data transfer time [65], Although different technologies can reduce the average seek time, the raw seek time only improved 7% per year. Disk speeds have risen from 3600 to 15000 RPM, decreasing rotational latency by a factor of 2.5. Even with today's technology, more than 90% of the time taken to transfer a 4K-byte block is introduced by the mechanical delays of moving the head and rotating the disk. Magnetic disk technology has doubled capacity and halved price every three years, in line with the growth rate of . Greater recording density translates into a higher transfer rate once the information is located. The bandwidth-latency gap is expected to grow, as expressed by the following rule of thumb [87]: In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4.

The cost of a disk access is further complicated by the software system. Each page fault or request for data on disk causes CPU overhead: a buffer has to be found for the data, the request must be initiated, and interrupts (with context switches) must be handled. Moreover, in some systems, e.g., UNIX file I/O or any database system, the data from disk must be copied one or more times within memory before the user can access it.

Amdahl's Law. As we have seen so far, different pieces of computer systems have improved performance at different rates. One of Amdahl's law quantifies the impact of improving the performance of some pieces of a problem while leaving others the same [88]. Suppose that some applications spend 10% of their time in I/O. According to the current trend, in about three years processors will be lOx faster than now. For this figure, Amdahl's law predicts that the effective applications' speed-up will be only 5x. When processors will be lOOx faster than now, these applications will be less than lOx faster, wasting 90% of the potential speed-up.

Unfortunately, the gap between the access time of primary and secondary storage is growing, 1.1. Motivation 3

not shrinking. This is due to the phenomenal improvements in microprocessor speeds that mechanical devices cannot match (60% yearly increase in processor speeds versus 8% annual increase in disk access times). This means that systems are and will be limited by the peripheral storage (disks). In other words, the performance limits of computer systems today are not how fast they can compute but how fast they can retrieve and safely store data.

1.1.2 Applications

The key to understanding the CPU-disk tradeoffs is understanding the applications. In this dissertation we focus on memory-bound applications, which are the applications that would benefit more from a larger memory than from a faster processor [46].

The two factors that play a key role in determining whether a program is memory bound are (1 ) locality of memory references and (2) ratio of memory references to other CPU operations. A program has good locality if it often reads memory locations referenced in the immediate past, or locations physically close together. A program with good locality does not have large physical memory requirements, as it will infrequently have to access secondary storage under demand paging. Similarly, if a program executes many internal CPU instructions between each memory reference, the need for a large memory is reduced. On the other hand, the applica¬ tions that have poor locality and a high ratio of memory to internal CPU operations have large memory requirements.

Although the amount of main memory has increased significantly in the last decades, as de¬ scribed in Section 1.1.1, application developers have even more aggressively increased their de¬ mands. Many of today's computer applications require large amounts of system memory. This is especially true with very large and complex applications that provide hundreds of functional¬ ities and handle large amounts of data, e.g., databases or CAD/CAM systems. If an application is going to require a large memory, it will be because it references a large data structure, e.g., a database or the representation of a VLSI chip. (Since instruction references have excellent locality, the code segment requires small memories.) The large data structures are becoming more common as people attempt to solve large problems. A good example of an application that is limited by the physical memory available in a system is symbolic model checking. Re¬ searchers and practitioners in computer-aided design continue to devote substantial resources to improve the (space) performance of model checkers [117]. Nevertheless, there are many inter¬ esting models that cannot be checked because the amount of memory required exceeds the size of the available main memory.

1.1.3 Bridging the Main Memory - Disk Gap

The technology trends described in Section 1.1.1 dictate that a processor must wait for increas¬ ingly larger number of cycles for a disk read/write to complete. Moreover, as described in Section 1.1.2, even though DRAM sizes increase, there never seems to be enough memory for all applications. Given these trends, a huge amount of performance literature has focused on hiding the I/O latency for disk-bound applications.

Caching is a fundamental technique in hiding I/O latency and is widely used in storage controllers, databases, file systems, and operating systems. A cache is a high-speed memory or storage device used to reduce the effective time required to read data from or write data to a 4 Chapter 1. Introduction

lower speed memory or device. However, when an application's memory requirements largely exceed the cache size, the cache cannot improve the application's performance much. Another method to hide I/O latency is demand paging [30]. When demand paging is used, the system brings a page in main memory from the disk only on a miss. A deeper dent can be made in I/O latency by speculatively prefetching pages even before they are requested [112]. Nevertheless, commercial systems have rarely used sophisticated prediction schemes [48]. The main reason for this, is that sophisticated prediction schemes need extensive history of page accesses which is cumbersome and expensive to maintain for the real-life systems.

An attractive approach to avoiding disk accesses is main memory compression. The basic idea of a compressed-memory system is to reserve some memory that would normally be used directly by an application and use this memory region instead to hold pages in compressed format. By compressing some of the data space, the effective memory size available to the applications is made larger and disk accesses are avoided. Previous studies have shown that main memory compression can improve the performance of some applications significantly. On the other hand, because compressed pages must be decompressed before use, the average DRAM access time increases, potentially increasing an application's execution time. Therefore, sizing the region that holds compressed data is difficult and if not done right (i.e., the region is too small or too large) memory compression will slow down the application.

1.1.4 Scope

This dissertation describes a flexible design for a compressed-memory system. The proposed system is able to resize the compressed region easily if is advisable to grow or shrink the size of the compressed region. In this context, emphasis is placed on understanding the impact of main memory compression on an application's performance. Based on the compressed- memory system understanding, the system finds the compressed region size that improves an application's performance the most, including the case when there is no need for a compressed region. In addition, this dissertation work aims at broadening the understanding of compressed- memory systems by means of a systematic performance evaluation that explains the key factors effecting the performance of a compressed-memory system.

1.2 Thesis Statement

The increasing gap between the CPU cycle time and disk access time together with the in¬ creasing memory demands of modern applications pose a problem to many memory-bound applications. I claim that:

• Software main memory compression, if implemented with the adaptivity in mind, is capa¬ ble of improving the performance of many memory-bound applications. The key idea of a practical design is to keep compressed pages in zones; the compressed region is grown and shrunk by adding and removing zones.

• The size of the compressed region that can improve an application's performance can be determined at run-time based on the application's memory access pattern. The main idea is to focus on the application's interaction with the memory system and combine analytical models with results from micro-benchmarking. 1.3. ROADMAP 5

1.3 Roadmap

The thesis is established as follows:

• We describe the design and implementation of a compressed-memory system that allows for easy resizing of the compressed region while applications execute. The integration of the proposed system in the Linux operating system is presented, drawing a line between the OS-dependent and OS-independent parts.

• We present simple performance models for predicting performance of memory-bound applications on common and compressed-memory systems. The accuracy of the perfor¬ mance models is evaluated for several programs that are representative for memory-bound applications.

• We describe the adaptation scheme used to resize the compressed region while applica¬ tions execute. The adaptation scheme tries to find the compressed region size for which an application's working set fits into uncompressed and compressed regions.

• We present a systematic, application-oriented evaluation of the compressed-memory sys¬ tem proposed by this dissertation work. The analysis investigates two classes of applica¬ tions: applications for which compression can improve performance and applications for which compression degrades performance.

The dissertation is organized as follows: Chapter 2 discusses related research in using hardware- and software-based compression at different memory levels and thereby provides ev¬ idence that the challenges of implementing main memory compression have largely remained untackled so far. Chapter 3 presents the design and implementation of the compressed-memory system proposed by this thesis, and describes its integration in Linux OS. Chapter 4 describes two analytical models for predicting the performance of large applications executing on com¬ mon systems. One of the models is extended in Chapter 5 to apply also to memory-bound applications executing on compressed-memory systems. Moreover, Chapter 5 discusses the adaptation process in detail; the adaptation process uses the proposed prediction model to de¬ tect the cases when there is no need for main memory compression. Chapter 6 introduces the evaluation methodology, analyzes the efficacy of main memory compression for different memory-bound applications, and presents quantitative results that allow us to identify the key factors effecting adaptation performance. Chapter 7 summarizes our findings and concludes the dissertation. Seite Leer / Blank leaf 2

Background

Compression has been proposed as a solution for 1) more effective utilization of available stor¬ age and for 2) reducing the communication data transferred in a system. As the goal of this dissertation is to assess the benefits of using compression to increase the amount of memory available in a system, this chapter discusses the studies that investigate memory compression only (the first approach). However, as far as the second approach is concerned, researchers have shown that compression has the potential to reduce address bus width in most cases and data bus widths in some cases while maintaining equal or better performance than in the uncompressed case [61, 60,62]. Compression techniques have been used effectively for both data and instructions [57, 110]. On one hand, instruction compression is easier than data compression because typically code is not modified by a running program. On the other hand, viewed based on the bit/byte patterns alone, code does not provide too much opportunity for compression. Fortunately, it is possi¬ ble to exploit knowledge about instruction format and procedures to achieve compression ratio of 3-5 [76, 38, 33, 77]. However, applications that have large memory footprints require large memory sizes to accommodate their data sets, and the size of their data segment largely exceeds their code segment size. As this dissertation focuses on large applications that execute on com¬ modity computers, it investigates the effect of data compression only. With the considerable interest in memory compression, several studies have examined compressibility of in-memory data and proposed/derived specialized compression algorithms that achieve compression ratio of 2-3 on average. The existent compression algorithms will be described shortly in the context of the systems that employ them.

This chapter focuses on lossless data compression, which allows exact reconstruction of the original data. All compression algorithms exploit expected regularities in data, and consist of two phases, which are typically interleaved in practice: modeling and encoding [83], Modeling is the process of detecting regularities, and encoding is the construction of a small representation of the detected regularities. Most compression algorithms read through the input data token by token and construct a dictionary of observed patterns. The decompressor reads through the encoded data much like an interpreter, reconstructing the original data based on the dictionary created during compression.

This chapter discusses the existent approaches to memory compression and shows that com¬ pression is an attractive approach to improve application performance. A compressed memory system architecture is a computer system architecture that employs compression at one or more levels of the memory hierarchy. Researchers have investigated the potential benefits of com¬ pression at the LI and L2 cache levels, DRAM memory and disk. Whether and to what extent

7 8 Chapter 2. Background

the potential benefits of compression are actually achieved depends on the implementation de¬ tails of each approach. The discussion of related research in memory compression reveals the issues that have remained open so far and will consequently be explored by this dissertation.

2.1 L1/L2 Cache Compression

Yang et al. The only study that investigates the potential of using compression at the L1/L2 cache boundary is that of Yang et al. [118]. In their paper, the authors present the design and evaluation of a LI cache where each cache line can either hold one uncompressed line or two cache lines which have been compressed to at least half their sizes. (If a cache line cannot be compressed to half its size, it is kept in uncompressed form). Additional bits are used to identify whether a line is stored in compressed form. The proposed system uses the frequent value cache (FVC) compression algorithm that compresses individual items (e.g., 32-bit words) in a cache line [119]. Due to the small compression unit, data in the LI cache can be compressed very efficient. The system reduces traffic and energy in that it transfers data over external buses in compressed form. Namely, when a compressed cache line is evicted, it is transmitted off-chip in compressed form and uncompressed before being stored in off-chip memory. Simulations show that for the SPECint95 benchmarks LI cache compression allows greater amounts of data to be stored leading to substantial reductions in LI miss rates (0-36.4%), off-chip traffic (3.9-48.1%), and energy consumed (1-27%).

2.2 L2/DRAM Memory Compression

Although there are a couple of studies that asses the potential benefit of compression at the L2/DRAM memory boundary, given the complexity of the implementation, none of these studies has made it into the production so far.

Lee et al. The Selective Compressed Memory System (SCMS) proposed by Lee et al. [75] comprises of a LI cache, managed as a conventional cache, and a L2 cache and main memory that can store compressed and uncompressed data. In the proposed scheme, a cache line may contain either an uncompressed line or two adjacent compressed lines. (If a line and its adjacent neighbor compress to less than one line, it is stored compressed.) To speed-up compression, the authors modify the X-RL de/compressor [66] to support parallel de/compression. Furthermore, the system uses a small decompression buffer between LI and L2 that acts like an intermediate cache. However, the main memory storage scheme is rather simplistic and the authors do not consider any effect of the compressed main memory. Main memory is divided in pages of normal size and half-size; a half-page is used if all blocks within a page are compressed. Detailed trace-driven simulations show that for the SPEC95 benchmarks this approach can reduce the miss ratio up to 35% and read/write traffic in the core up to 53%.

Alameldeen and Wood. The system proposed by Alameldeen and Wood [14] is a two-level cache hierarchy where the LI cache holds uncompressed data and the L2 cache dynamically selects between compressed and uncompressed storage. The L2 cache is 8-way associative with LRU replacement, where each set can store up to eight compressed lines 2.3. DRAM/DISK COMPRESSION 9

but has space for only four uncompressed lines. The system uses a single global saturating counter, which predicts whether to allocate lines in compressed or uncompressed form. On each L2 reference, based on the LRU stack depth and compressed size, the system determines whether compression could have eliminated a miss or incurred an unnecessary decompression overhead. The global counter is incremented by the L2 miss penalty if compression did eliminate a miss or is decremented by the decompression latency if the reference would have hit regardless of compression. When a L2 cache line is allocated, it is stored uncompressed if the counter is negative, and compressed otherwise. Besides the adaptive scheme, the authors propose a new compression scheme called Frequent Pattern Com¬ pression (FPC) that takes advantage of small values and has a low decompression latency [15]. Full-system simulations show that adaptive L2 cache compression can improve performance of commercial workloads by up to 17% while never degrading performance by more than 0.4%.

Hallnor and Reinhardt. A different approach is proposed by Hallnor and Reinhardt [52] who analyze a memory hierarchy that consists of a compressed on-chip cache (the last-level cache) and a compressed main memory modeled after IBM's MXT system [109]. Because the proposed system is based on the MXT implementation, it uses the LZSS [44] compression algo¬ rithm employed in the MXT system. The design of the compressed cache builds on the Indirect Index Cache (IIC) design [51]. In addition to the IIC design, the new design, called the Indirect Index Cache with Compression (IIC-C), allocates variable amounts of storage to different cache lines based on their compressibility. To mitigate the possible negative effect of de/compression, Hallnor and Reinhardt extend the Generational Replacement algorithm (Gen) [51], originally proposed for the IIC, to keep the most recently accessed blocks uncompressed. Gen maintains prioritized FIFOs, or pools, of the blocks in the cache. Each block stays in a given pool for a number of misses, when it will move to either a higher or lower priority pool depending on whether it has been referenced or not. When a block is accessed, it is decompressed, and when the block is moved to other pool, it is recompressed. If the cache has many misses, blocks will move between pools frequently, many blocks will be compressed, and the amount of data in the cache will increase. Moreover, if an application fits in the cache, there will be no misses, no blocks will move between pools, data will remain uncompressed, and the cost of compression will be avoided. Simulations of SPEC2000 benchmarks show an average speed-up of 19%, while degrading performance by no more than 5%. The combined scheme achieves a peak improvement of 292%, compared to 165% and 83% for cache or bus compression alone.

2.3 DRAM/Disk Compression

2.3.1 Hardware-based Main Memory Compression

In the mid-1990s, Franaszek began to reevaluate the idea of applying compression to squeeze more data into main memory in a way that did not slow down the computer. Franaszek and his team parallelized the Lempel-Ziv compression algorithm [ 120] to permit the use of multiple hardware engines operating in parallel [44], They also developed techniques for efficient storage and fast retrieval of compressed data. Independently, Smith and Tremaine had been develop¬ ing various hardware approaches to improve memory systems since the mid-1980s. In 1996 the two teams combined forces and built the Memory expansion Technology (MXT) [109], a solution to memory compression centered on a novel, single-chip memory controller, called 10 Chapter 2. Background

Pinnacle. While the controller's main role is to compress memory data, MXT is unique in several respects [109, 40, 43, 42].

Architecture. The Pinnacle controller compresses data before writing it to the main mem¬ ory. To absorb the de/compression latency, the controller uses a relatively large, 32MB L3 cache. For quick access, data is stored uncompressed in the L3 cache, which is shared by all the processors as shown in Figure 2.1. The L3 cache appears as the main memory to the processors and I/O devices, and its operation is transparent to them. The line size of the L3 cache is 1KB, same as the unit of compression. The L3 cache is made of double data rate (DDR) SDRAM and the main memory uses standard off-the-shelf SDRAM [109].

Virtual Addresses

Processor Processor

Cache (LO Cache (LI)

Cache (L2) Cache (L2)

Real S7 5£ Addresses

Shared cache (L3) Uncompressed Data

Physical Addresses

Main Memory Compressed Data

Figure 2.1 : MXT memory hierarchy.

The Pinnacle controller implements the LZ1 compression algorithm [44], Franaszek's par¬ allelized variation of the Lempel-Ziv algorithm [120]. The algorithm divides each input data block into sub-blocks and constructs a shared dictionary while compressing all sub-blocks in parallel. Four compression engines operate on 256 Byte blocks (one quarter of the 1KB uncom¬ pressed data block) [109]. The compression scheme stores compressed cache lines into main memory in variable-length format, the unit of storage being a 256 Byte sector. Depending on its compressibility a (1KB) cache line may occupy 0 to 4 sectors. Each cache-line address maps to one entry in the compression translation table (CTT), which is kept uncompressed at a reserved location in the physical memory. A CCT entry is 16 Bytes long and contains control flags and four physical addresses, each pointing to a sector in the physical memory.

MXT implements two memory-saving optimizations [40]. The first optimization is the us¬ age of the trivial line format to efficiently store cache lines that compress to less than 120-bits. A cache line that compresses so well is stored entirely into the CTT entry, replacing the four address pointers. The second optimization is the possibility to use the same sector to store data from two cache lines. Two cache lines can share a sector if and only if they belong to the same virtual page. The lines that fulfill this condition are called cohort cache lines. For example, two cohorts each compressing to 100 Bytes may split and share a sector, since their total size is less than the sector size of 256 Bytes. Figure 2.2 illustrates possible occupancies of the physical 2.3. DRAM/Disk Compression 11

memory that can result from compressing a (4KB) virtual page. When the page is in the L3 cache, its content is stored by four (1KB) cache lines, Linel to Line4. When the page is to be stored into the physical memory, its cache lines are compressed. For the page in Figure 2.2, the trivial file format is used to store Linel, while Li ne 3 and Line 4 are cohort cache lines that share a sector.

L3 cache CTT entries Physical Memory Compr Line 1 Line

1KB 16 bytes 256 bytes

Figure 2.2: Physical memory occupancy.

Performance Factors. The performance of a compressed memory system is heavily influ¬ enced by the size of the compression unit. Franaszek et al. [44] show that small data units may not compress well and longer units may degrade performance because of longer de/compression times. Therefore, MXT uses a compression unit size of 1KB, as it proved to be a good compro¬ mise between the compression ratio and compression speed. Moreover, MXT uses a L3 cache line size equal to the unit of compression. Using different sizes for the cache line and com¬ pression unit may also work, but also introduces additional overheads, as shown by Benveniste et al. [21]. For instance, a sequence of cache misses residing in the same unit of compres¬ sion could require repeated decompressions of the same main memory data. Experiments have shown that finding the right size of the storage unit (sector) into the physical memory influences system's performance a lot [42]. Small sector sizes require large directory spaces, and large sectors result in small directory spaces but also in increased fragmentation. Thus finding an optimal sector size involves balancing the required directory space and the amount of fragmen¬ tation. Franaszek et al. [43] show that a sector size of 256 Bytes works fine for a large class of applications.

Operating System Support. Due to compression, in MXT the amount of addressable real memory (page frames) is variable and changes dynamically according to the compressibility of the pages currently in main memory. Abali et al [10, 9, 41] describe the operating system changes required to actively manage the number of page frames that can be sustained at any one time. Their approach relies on a set of registers that monitor the physical memory utilization. At boot time, the maximum number of page frames is set to 2x the physical memory space. (Experiments show that an application's main memory contents can be compressed usually by a factor of 1:2 [10]). If an application's compressibility is better (or equal) than expected, the operating system runs in the traditional way and performs page-outs when running low on available page frames. If compressibility is worse than expected, the operating system may have more than enough available page frames but be low on physical memory. In this case, the 12 Chapter 2. Background

operating system performs page-outs, in which pages are written to disk (if modified) and then cleared to free up physical memory. As a result, the physical memory pressure is reduced and the overall compressibility is increased.

Performance. The MXT's primary motivator is savings in memory cost [99]. MXT suc¬ cessfully doubles the amount of memory in a system at the cost of a small performance degra¬ dation: some SPECint2000 benchmarks run between 2% and 10% slower on the MXT system than on the standard system [11]. However, for several SPEC benchmarks compression im¬ proves performance by up to 8.3%. Moreover, the experimens show that with compression, the database system of an insurance company runs 25% and 66% faster than without compression on a system with 512MB and 1GB physical memory.

2.3.2 Software-based Main Memory Compression

The key idea of a system that implements memory compression in software is first described by Wilson in the early-1990s. In his paper [115], Wilson discusses quite a few ideas related to the implementation of an advanced memory management system, such as process check-pointing, persistence, compressed virtual memory caching, and adaptive clustering of pages on disk. Many of these ideas have been subsequently elaborated in detailed papers about individual top¬ ics. Wilson notes that as CPUs become increasingly faster than disks (and compression speeds are improving), it becomes increasingly attractive to keep pages in memory in compressed form. His proposal aims to reduce paging by adding a new level into the memory hierarchy, where pages are stored in compressed form; we call this level the compressed region.

The software-based approaches are either adaptive or static. The adaptive approaches vary the size of the compressed region dynamically, while applications execute, and are either implementation- or simulation-based investigations. Static approaches use fixed sizes of the compressed region. Although the static approaches are useful to assess the benefits of mem¬ ory compression, they fail to provide a solution that works for different system settings and applications.

Adaptive Approaches

Douglis. Doughs is the first who actually investigates memory compression in any detail [36], and many of the subsequent studies follow his design [116, 25, 32]. Following Wilson's idea, Douglis' system reserves some memory (the compressed region) that would normally be used by an application and uses this memory region to hold evicted pages in compressed form. When the compressed region becomes filled, parts of the compressed data are swapped to disk. When an evicted page is needed again (on a page fault) the system checks first whether the requested page is available in compressed form. If the page is not compressed, a disk access is initiated and the page is brought in memory. The compressed memory hierarchy and its page paths are depicted in Figure 2.3.

Douglis also note that main memory compression technology is twofold. On one hand, by compressing some of the data space, the effective memory size available to applications in¬ creases and disk accesses are avoided. Therefore, because decompressing a page is faster than accessing a page on disk, memory compression may improve an application's performance. On the other hand, the compressed region takes away real memory from applications, and fewer 2.3. DRAM/DlSK COMPRESSION 13

Caches Caches

Uncompressed Region

, .Compressed swap-out swap-in

swap-out swap-in

Disk

a) Common b) Compressed

memory hierarchy memory hierarchy

Figure 2.3: Memory Hierarchies

pages than before are uncompressed. Therefore, because the compressed pages must be decom¬ pressed before use, compression may also degrade an application's performance.

As different applications have different memory requirements, the compressed region size that can improve an application's performance is application specific. If the compressed region is too large, the system takes away memory that could be used to hold uncompressed pages. If the compressed region is too small, not many compressed pages are reused, and for the compressed pages that are not reused, the system adds the cost of de/compression on their way to the disk. For instance, if the working set of an application fits into the physical memory, compression should stay out of the way. Therefore, finding the size of the compressed region that can improve an application's performance is a key requirement for a compressed-memory system to work in practice.

Douglis adapts the compressed area size dynamically based on a global LRU scheme, and implements his adaptive scheme in the Sprite operating system. In the proposed system, the uncompressed pages, compressed pages and file system pages compete for RAM on the bases of how recently their pages were accessed. Given that the uncompressed pages were more recently touched than the compressed pages, the adaptation scheme requires a bias to ensure that the compressed region has any memory at all. The main drawback of the resizing scheme is that it uses a single bias value for all applications. The bias value actually dictates the amount of memory to be compressed, and as Douglis noticed, although a single bias value works well for many applications, different applications require different values of the bias.

Douglis' system uses a software implementation of the LZRW1 compression algo¬ rithm [114] with a 4KB compression block size, which coincide with the operating system's page size. The effect of main memory compression for a couple of small applications is investigated on a DECstation 5000. For some applications main memory compression in¬ creases performance up to 62.3% relative to an uncompressed system paging to disk. However, some applications with bad compression characteristics execute up to 36.4% slower with 14 Chapter 2. Background

compression than without.

Wilson and Kaplan. In the late-1990s, Wilson et al. focus on CPU performance and reconsider the idea of compressed memory in the context of current technology trends. Their studies [116, 63], based on simulations, show that the discouraging results of former studies were primarily due to the use of machines that were quite slow computation engines by current standards. They show that for current machines compressed virtual memory offers substantial performance improvements, and its advantages increase as processors get faster.

The authors propose a technique that uses the compressed region for all disk I/Os, including paging and file I/O, and resizes the compressed area while applications execute. The resizing mechanism addresses the issue of adaptation by performing an on-line cost/benefit analysis, based on recent program behavior statistics. The proposed system maintains a queue of all pages ordered by their recency information. The queue contains records for pages that are in main memory and pages that are not in main memory but were evicted recently. On a page fault, the system updates statistics on whether the access would have been a hit if 10%, 23%, 37%, or 50% of the memory was holding compressed pages. Periodically, the compressed region size is changed to match the compressed region size that improves performance the most. The main drawback of this approach is that it relies on information that is difficult to gather on current systems: current systems do not maintain a list of all pages in (uncompressed) memory. Moreover, as the physical memory size increases, the size of the page queue increases as well, making this approach unsuitable for systems with large memories. Another limitation of this approach is that the system can choose between a few sizes of the compressed region.

Besides the adaptive resizing scheme, Wilson et al. introduce novel compression algorithms that achieve good compression ratios [116]. The proposed algorithms exploit in-memory data representations and scan through the input data a 32-bit word at a time, looking for repetitions of the high-order 22-bit pattern of a word. To detect the repetitions, the encoder maintains a dictionary of just 16 recently-seen words.

To assess the performance of their approach the authors simulate the execution of six UNIX programs on three machines: a Pentium Pro at 180 MHz, UltraSPARC-10 300 MHz, and UltraSPARC-2 168 MHz. The simulation results show that main memory compression often eliminates up to 80% of the paging cost, with an average savings of approximately 40%. As expected, the faster SPARC 300 MHz machine has a lower de/compression overhead and thus performs better overall. Moreover, the simulations also show that the performance difference of different compression algorithms is about 15%.

Castro et al. A different resizing scheme is proposed by Castro et al. [32] who at every access to the compressed region adapt the compressed region size dynamically depending on whether the page would be uncompressed or on disk if compression was not used. If two consecutive accesses to the compressed region are to pages that would be uncompressed in memory (if compression was turned off), the compressed region is shrunk. If the access is to a page that is in memory only due to compression, the compressed region is grown. However, the authors do not explain how the values that trigger the resizing process were chosen. The main drawback of the proposed scheme is that it analyzes every access to the compressed region, and resizes the compressed region quite often. Moreover, the resizing of the compressed region is done by adding and deleting cells of two pages. Therefore, although the approach may work 2.3. DRAM/DISK COMPRESSION 15

well for small applications with few data accesses, it may not be feasible for large applications with frequent data accesses. The authors implemented their scheme in the Linux operating system and report performance improvements of up to 171 % for small applications that run on a Pentium III PC. The proposed system supports two compression algorithms, WKdm [116] and LZO [1201.

Static Approaches

Russinovich and Cogswell. A simple mathematical model to predict the performance of a compressed memory system is presented by Russinovich and Cogswell [92]. The authors plug numbers obtained from measurements on PCs running Windows 95 in their model and show that RAM compression does not pay on the industry standard Winstone benchmarks. Nevertheless, the results do not seem to accurately reflect the trade-offs involved. The authors report very low compression overheads and very high overhead for handling a page fault, which is possibly a result of using a slow processor and Windows 95 operating system.

Kjelso et al. Kjelso et al. use simulations to demonstrate the efficacy of main memory compression. The system they propose is similar to that proposed by Douglis, except that it does not obey the normal inclusion property: a page is either in the uncompressed region or in the compressed region but not in both at the same time. To establish the feasibility of the general idea, the authors investigate how well and how fast can memory data be com¬ pressed [49, 68, 66, 67]. Moreover, the authors develop a performance model [69] to quantify the expected performance impact of a software- and hardware-based compression system. The model is based on the average memory-access time (AAT) commonly used for investigating memory hierarchy performance, and has four main components: the memory hierarchy model, memory hierarchy characteristics, data compression characteristics and workload behavior (namely, its miss rate information). The first component describes the functional behavior of the memory hierarchy, while the remaining three components define the temporal behavior of the memory system and application workload. For a number of DEC-WRL workloads, the analytical results show that software-based compression can improve system performance up to a factor of 2 and hardware-based compression improves performance up to an order of magnitude. The data compression characteristics used in these studies are based on performance estimations for two compression algorithms developed by the authors, X-Match and X-RL [66]. Although the authors acknowledge that some sizes of the compressed region can damage performance, they do not investigate the amount of data that should be compressed for memory compression to show performance improvements.

Cervera et al. A compressed memory architecture that differs in some points from Douglis' design is presented by Cervera et al. [25]. In the proposed system, before moving a page from the compressed region to disk, the system tries to merge several compressed pages into a disk block and writes the block to disk (no compressed page is split among several blocks). When a page on disk is swapped in, a whole disk block is read into memory. Based on the observation that the extra pages that are brought into memory on a page fault are most of the times not used by the application, the authors re-engineer their design. In the new design, the system uses two different buffers/paths, one for swapping in and one for swapping out pages. The buffer to cache the swap-in data holds uncompressed pages, while the swap-out buffer stores 16 Chapter 2. Background

compressed pages. To assess the benefits of this design, the authors implemented it in the Linux operating system and measured its performance impact for a couple of benchmarks. The machine used is a Pentium II machine at 350MHz with 64MB physical memory from which only 1MB is dedicated to the compressed region. The system uses a software implementation of the Ziv-Lempel compression algorithm, LZO [ 120], with 4KB compression block size, which coincides with the operating system's page size. The measurements show that compression increases system performance up to a factor of 2 relative to an uncompressed swap space system.

Roy et al. Roy et al. [90] implement a compressed memory system in the form of a device driver for the Linux operating system. Because the authors replace one of the RAID personali¬ ties (e.g., raid5), their compressed-memory implementation is a stand-alone package that does not require changes to the rest of the kernel. The effectiveness of the proposed compression scheme is evaluated for several SPEC2000 programs and for a SOR computation running on a Pentium III machine at 733MHz. The compression algorithm used is the WK4x4 algorithm proposed by Wilson et al. [116]. For the selected tests, the benefits of compression range from 5% up to 250%. However, the main drawback of this approach is that the compressed region size cannot be changed while applications execute.

Related Approach

RAMDoubler. An approach related to memory compression is RAM Doubler, a technology that expands the memory size for the Mac operating system [4]. This technique locates small chunks of RAM that applications are not actively using and makes that memory available to other applications. Moreover, RAM Doubler finds RAM that is unlikely to be accessed again, and compresses it. Finally, if all else fails, the system swaps seldom accessed data to disk. Although RAM Doubler allows the user to open more applications together, the user cannot run applications with memory footprints that exceed the physical memory size.

2.4 Summary

This chapter discusses the existent approaches to memory compression. Although the use of compression at both LI and L2 cache levels is possible, LI and L2 cache compression is not a viable approach as it requires many changes to these hardware components. Previous studies show that compression at the main memory level is an attractive approach and can improve system performance considerably. The hardware-based approaches to compress main memory seem to have limited success and acceptance as they require many changes to both hardware and software systems. The software-based main memory compression is more attractive than the hardware-based compression, as it can be used on already existent commodity processors and memories. A further advantage of software-based memory compression is that it allows the system to turn compression off if compression is not beneficial.

The discussion of the related research on software-based memory compression reveals that a key requirement of a practical system is the ability to adapt to different memory requirements of different applications. Although some of the approaches described by previous work have 2.4. Summary 17

tried to address the adaptivity issue, none of them offers a practical solution that works fine for large applications. Based on this observation, we conclude that an adaptive approach to main memory compression that takes its decisions transparent to the user and detects the applications for which compression is not beneficial, can make memory compression very attractive to the end-users. Leer Blank 3

Design and Implementation

This chapter presents the static structure of our compressed-memory system. After presenting the design and data structures employed, Section 3.1 briefly describes how the data structures collaborate to carry out the tasks of data compression, storage and retrieval. Section 3.2 presents a prototype implementation of the proposed compressed-memory system and its integration in the Linux operating system.

3.1 Design

The compressed-memory system presented here follows the design proposed by Douglis [36]. in that it divides the main memory into an uncompressed region that stores uncompressed pages and a compressed region that holds pages in compressed form.

On a common system, when the amount of physical memory is less than what an application requires, the OS swaps out some pages to make space for other pages the application needs. On a compressed-memory system, when an application's working set exceeds the uncompressed region, the OS compresses the pages that haven't been accessed for the longest time and stores them in the compressed region. When even the compressed region becomes filled, the pages that have been compressed for the longest time are decompressed and stored on disk. On a page fault, the OS checks for the faulted page first in the compressed region and then on disk, servicing the page from the compressed region if it is there and saving the cost of a disk access.

The benefits of main memory compression depend on the compressed region size and the characteristics of the application. Due to compression, accesses to compressed pages take longer than accesses to uncompressed pages. Therefore, compressing too much data decreases an application's performance. On the other hand, if the compressed region is not large enough to hold a sufficient number of compressed pages, some compressed pages must be decompressed and evicted to disk, decreasing system's performance.

Because different applications have different memory requirements, the compressed region size that can improve an application's performance is application dependent. Therefore, the main requirement for a compressed-memory system to work in practice is the ability of the system to resize the compressed region easily, while applications execute. The key idea of our approach is to organize the compressed region in memory zones of the same size, as illustrated in Figure 3.1. Shrinking and growing the compressed region is done by adding and deleting zones. Because the resizing is based on these two simple operations, our design allows for easy resizing of the compressed region.

Although it is possible to transfer variable-size compressed pages to and from disk, imple-

19 20 Chapter 3. Design and Implementation

Caches

Uncompressed Main Region Memory

"gfiggg Compressed Region

Disk

Figure 3.1 ; Compressed Memory Hierarchy

menting variable-size I/O transfers requires many changes to the operating system [36]. More¬ over, on a page fault, if the faulted page is on disk in compressed form, the system will have to wait for the page to be decompressed before being able to use it. Therefore, to lower the latency of future accesses and to employ the operating system's swapping services, the proposed system decompresses the evicted pages before sending them to disk.

3.1.1 Global Metadata

As shown in Figure 3.2, the proposed system relies on several global data structures for keeping track of all pages stored in the compressed region. First of all, all the zones that form the compressed region are linked in a chain called zone chain. As the size of the compressed region grows and shrinks, zones are added and removed from the zone chain.

The system uses an indexing structure for keeping track of all pages stored in the compressed region. For efficiency reasons, we have chosen a hash table to relate the swap handle of a compressed page to its data in the compressed region. When a page is stored in the compressed region, a hash function computes the index of the page in the hash table based on the value of its swap handle. All pages that have the same index in the hash table are linked in a chain, and the first page in chain is identified by the value stored in the hash table at the computed index. Therefore, the number of entries in the hash table docs not limit the number of pages that can be stored in the compressed region.

All pages in the compressed region are linked in a chain in the order of their insertion in the compressed region. The chain is called the LRU list, and the LRU first and LRU last pointers identify the first and last pages in the list. When the compressed region becomes full, the page referenced by the LRU first is decompressed, sent to disk, LRU first is set to point to the next page in the list, and the page is deleted from the compressed region. When a page is stored in the compressed region, it is inserted after the element referenced by LRU last, and LRU last is set to point to the newly inserted page. In other words, the LRU list is a double-linked list that stores the recency information of all pages in the compressed region. 3.1. Design 21

Global Local Structures Structures

Data

Uncompressed Region

Zonel Zonez Compressed Region

Figure 3.2: Birdseye view of the compressed-memory system design.

3.1.2 Local Metadata

A zone has physical memory to store compressed data and structures to manage its physical memory. Both local metadata and memory to store compressed data are allocated and deallo¬ cated when a zone is added or deleted respectively. One of the local data structures of a zone is called zone structure and stores the memory usage information of that zone, as illustrated in Figure 3.3.

To keep fragmentation as low as possible, for each zone, the memory to store compressed data is divided in blocks of fixed size. Each zone uses the block table for identifying the starting address and for storing the usage information of all its memory blocks. The number of entries in the block table is equal to the number of blocks in a zone. Each entry in this table has a field called address that stores the address of a block. All free blocks are linked in a chain by their field next in the block table. The first block in the chain is identified by the free block field of the zone structure. The last block in the chain of free blocks has the value of its field next set to NULL to indicate the end of the list. A compressed page is stored in a set of blocks that are linked in a chain by their field next in the block table. The last block in the chain has the value of its field next set to NULL to indicate the end of the compressed page. Moreover, the number of blocks in a zone that store compressed data is stored in the used field of the zone structure.

A compressed page is stored within a single zone, which means that a page is not scattered over multiple zones. Each zone uses a table, called comp page table, for keeping track of all pages it stores. Because each entry in this table identifies a page, the number of entries in the table gives the maximum number of pages that can be stored in a zone. As Figure 3.3 shows, each entry in com,ppagc table has several fields which will be described in the following paragraphs.

All free entries in comp page table are linked in a chain stored in their field next. The first TU G Global Local

l*j Structures Structures O o r-t- S. o1 < S" Data o

ET rt> o o n 3 Uncompressed œ 3 > 'y. 5B H Cl

Zonel (blocks) O O Zone2 (blocks) m

Compressed o >

Cl O oh—I 3 r m

| o 3.1. Design 23

entry in the chain of free entries is identified by the free entry field of the zone structure. The last entry in the chain of free entries has the value of its field next set to NULL. All entries that have the same index in the hash table are linked in a chain stored in their field next. The first page in the chain is identified by the value in the hash table stored at the computed index, and the last page in the chain has the value of its field next set to NULL, to indicate the end of the chain.

As described earlier, a compressed page is stored in a set of blocks that are linked in a chain by their field next in the block table. When a page is stored in a zone, the first field of its comp page table entry identifies the first block that stores the compressed page.

As illustrated in Figure 3.3, all pages in the compressed region are linked by two fields in com>p page table, called LRU prev and LRU next. This chain of pages is in fact a double linked list called LRU list. Because pages are added at the end of the list, the most recently used pages are at the end. The two global pointers that identify the beginning and the end of the list are the LRU first and LRU last pointers.

The other fields of a com,p page table entry are zone, handle, and size that for each com¬ pressed page store the zone id, swap handle value, and its size.

The following subsections elaborate on how pages are stored and found in the compressed region.

3.1.3 Page Insert

When an application's working set exceeds the uncompressed region size, the system com¬ presses the pages that have not been accessed for the longest time and stores them in the com¬ pressed memory. Figure 3.4 is an example flow diagram depicting the steps of inserting a page in the compressed memory. The first step (Step 1) when a page is evicted from the uncom¬ pressed region is to compress the page and store its data in a compression buffer. The com¬ pression buffer is a storage area that stores a page in compressed form, and its size is equal to the size of a page. A page is compressed using one of the existing algorithms, such as WKdm, WK4x4, LZRW1, and LZO [116]. After compressing the page, the system calls a hash function to compute the index of the new compressed page in the hash table.

Next (Step 2), the system searches for a zone that has enough blocks to store the new page. Because the system selects zones from the beginning of the zone chain, the allocation is basically the first-fit algorithm. In the next step (Step 3), the system checks the number of free blocks in the selected zone. The used field of the selected zone structure keeps track of the number of blocks that store compressed data. Based on the value of this field and the number of blocks in a zone, the system computes the number of free blocks in the selected zone. If the number of free blocks is insufficient to store the new compressed page, another zone is selected and Step 2 is repeated. If the zone has enough free blocks, the system checks whether there is a free entry in the corresponding comp page table (Step 4). A zone has at least one free entry in its com,p page table if the value of the free entry field of its zone structure is valid. If there is a free entry, the search stops and the zone will store the new page. If there is no free entry, Step 2 is repeated and the system selects the next zone in the zone chain.

A process monitors the amount of free memory in the compressed region. When the amount of free memory falls bellow a critical threshold, the process uses the LRU first pointer to select the page that has been in the compressed region for the longest time, decompresses it and stores 24 Chapter 3. Design and Implementation

start

+ compresss page

+ select entry in hash table

get zone

enough free blocks in No block table

V Yes

free entry in No comp page table

\ ' Yes

+ select free blocks

+ copy page into blocks + update block table zone structure

] '

+ select free entry in comp page table

+ update zone structure

+ populate selected entry

^ /

update selected entry in hash table

\ '

done

Figure 3.4: Flow Diagram: Insert Page. 3.1. Design 25

it on disk. This operation is repeated until the amount of free memory is above the critical threshold. In other words, this process makes sure that there is always some free memory in the compressed region. Due to this process, when a page is to be inserted in the compressed region, the system always finds a zone that has enough free memory to store the new page.

After a zone to store the new page is found, the system selects as many blocks as needed to store the compressed data (Step 5). At this step, the system traverses the list of free blocks (whose beginning is identified by the free block field of the zone structure) and selects the necessary number of free blocks. The value of the free block is set to point to the block following the last block selected. Because all free blocks are linked in a chain by their field next in the block table, the selected blocks are linked by the same field next. Therefore, all the blocks that will store the new page are linked in a chain, and the next field of the last block is set to NULL to indicate the end of the page. The compressed page is now copied into the selected blocks. The used field of the zone structure is incremented with the number of blocks used to store the new page.

Next (Step 6), the system selects an entry in the comp page table to store information about the new page. All free entries are linked by their field next, and the beginning of the list is identified by the free entry field of the zone structure. After the first free entry in the list is selected, the free entry field is updated to point to the second entry in the list. The first field of the selected entry is set to point to the first block that stores the compressed page. The page swap handle, the zone identifier and the size of the compressed page are stored in the handle, zone and size fields of the selected entry. The page is inserted at the end of the LRU list by setting the values of the LRU next and LRU prev fields, and the two global pointers LRU first and LRU last are updated if necessary.

The last step (Step 7) is to update the hash table to indicate that the new page is now stored in the compressed region. If the hash table entry that corresponds to the index com¬ puted at Step 1 is invalid, its value is set to point to the selected entry in com,p page table. If the hash table entry stores a valid address, it indicates the first page mapped to that entry. All entries mapped to the same index (hash value) are linked in a chain by their field next in comp page table. The new page is inserted at the beginning of the chain and the hash table entry is updated.

3.1.4 Page Delete

On a page fault, the system checks for the faulted page in the compressed region. Based on the swap handle value, the hash function computes the index of the faulted page in the hash table. The entry in the hash table that corresponds to the computed index identifies an entry in a zone's comp page table. The system checks whether the handle field of the selected entry is equal to that of the faulted page. If the two values are equal, it means that the faulted page is in the compressed region. If the values are not equal, the system traverses the list of pages that have the same index in the hash table using the next field in comp page table. If no entry matches the faulted page, it means that the page is not in the compressed region and is brought from the disk into the uncompressed region.

If there is an entry in comp page table that matches the faulted page, that entry is selected. The system uses the first field of the selected entry to get the first block that stores the com¬ pressed data. Using the block table, the system follows the next field of the first block, selects 26 Chapter 3. Design and Implementation

the second block, and so on, until the value of the next field is NULL. All selected blocks are then copied in a decompression buffer. The decompression buffer is a storage area that stores a page in compressed form, and its size is equal to the page size. The data in the decompres¬ sion buffer is decompressed and the uncompressed page is returned to the faulted process. The blocks that stored the faulted page in compressed form are freed and are added to the beginning of the list of free blocks as follows. The next field of the last block in the chain is set to point to the first free block, which is identified by the free block field of the zone structure. The value of the free block field is set to point to the first block that stored the compressed page.

After the page is decompressed and returned to the faulted process, its entry in the hash table is also deleted. In other words, the page is deleted from the chain of entries that have the same index in the hash table. If the selected entry is the only page in the chain, its corresponding entry in the hash table is invalidated. Otherwise, the next field of the previous entry in the chain is set to point to the entry following the entry to be deleted. Next, the entry is freed and is added to the list of free entries in comp page table as follows. Its next field in comp page table is set to point to the first entry in the free chain, which is identified by the free entry field of the zone structure. The free entry field is set to point to the entry to be freed. Finally, the entry is deleted from the LRU list, and the LRU first and LRU last are updated if necessary.

3.1.5 Zone Add

Figure 3.5 is an example flow diagram depicting the steps of adding a zone to the compressed memory. The first step (Step 1) when a zone is allocated is to check whether the free space in the uncompressed memory is enough to store the new zone. If there is enough free space, the system allocates memory for the zone's local structures and memory to store compressed data, and the zone is added at the end of the zone chain (Step 2). As described in Subsection 3.1.1, all the zones that form the compressed region are linked in a chain stored in the zone chain.

If the free space in the uncompressed region is not enough to store a new zone, the system will swap out some pages until the free space is enough to store the new zone. First, the system selects the page that has been in the uncompressed region for the longest time (Step 3). Then, the amount of free space in the compressed region is checked (Step 4). If there is enough free space in the compressed region to store a new page, the selected uncompressed page is stored in the compressed region (Step 5) and Step 1 is repeated.

If the free space in the compressed region is insufficient to store a new page, the system will swap out some compressed pages until there is enough free space for the new page. To free some space (Step 6), the system uses the LRU first pointer to select the page that has been in the uncompressed region for the longest time, saves it on disk, and Step 4 is repeated.

3.1.6 Zone Delete

Figure 3.6 is an example flow diagram depicting the steps of removing a zone from the com¬ pressed memory. The first step (Step 1) is to select the zone with the smallest number of blocks used. Then (Step 2) the system checks how many blocks of the selected zone are used to store compressed data. If all the blocks are free, the memory that stores the zone's metadata and its compressed data is deallocated (Step 3). u>

ö tri en O

start 31 Oq" 3

enough free space in No select uncompressed page uncompressed region o" g Y Yes S" free in No en enough space + select compressed page >i new zone compressed region + store page on disk 1

> Y Yes & N store page in o done s compressed region o 28 Chapter 3. Design and Implementation

If some blocks are used to store compressed data, all pages that are stored in the region to be deleted will be moved to other zones. After a page from the zone to be deleted is selected (Step 4), the system searches for a zone in the zone chain that has enough space to store the selected page. Because the reallocation uses the first-fit algorithm, the system selects a zone from the beginning of the zone chain (Step 5). If the amount of free space in the selected zone is enough to store the selected page (Step 6), the page is moved to that zone and Step 2 is repeated. If the number of free blocks in the selected zone is insufficient to store the selected page, Step 5 is repeated and another zone is selected.

start

get zonel min no blocks used

\ de¬

No get compressed page blocks used in zonel == 0 from zonel

j Yes \^

zone2 delete zonel get

free in done enough space No zone2

\i Yes

move selected page to zone2

Figure 3.6: Flow Diagram: Remove Zone.

There are two special cases that can appear when a zone is removed. To keep the flow diagram as simple as possible, these cases were not included in Figure 3.6. The first special case is when the zone to be removed is the only one in the compressed region. In this case, before the zone's memory is deallocated, the compressed pages within that zone are decompressed and saved on disk. The second case is when the system tries to move pages to other zones and does not find enough free space in any of the existing zones. In this case, for efficiency reason, the system saves all pages within the zone to be deleted on disk and deallocates the zone's memory.

In the proposed system, a compressed page is stored within a single zone, and is not scattered over multiple zones. This rule simplifies the page insert and delete operations in that the system does not have to keep track of all zones that store a page's data. Moreover, when a zone is deleted, the system must not deal with pages that are partially stored in other zones. Therefore, by storing all the blocks of a compressed page within a single zone, we avoid the scatter/gather 3.2. Implementation and Integration in Linux 29

problem encountered by Douglis [36].

3.2 Implementation and Integration in Linux

This section presents a prototype implementation of the proposed compressed-memory system and its integration in Linux. Linux is a variant of the UNIX system [89, 18] and has been chosen as an implementation platform because it is now a viable operating system that powers everything from embedded devices to huge enterprise servers [6]. The system was firstly built and integrated in the 2.4.12 Linux kernel and worked fine for IA32 architectures. Latter on, the whole system was ported to the Yellow Dog Linux 3.0.1 (YDL) [8] that is built on the 2.6.3 Linux kernel and provides 64-bit support for the Apple G5 machines. The prototype now works on both 32-bit and 64-bit architectures, and we installed it on a Pentium 4 PC and on a G5 machine. Although the discussion of our solution is necessarily OS-specific, the issues are general.

Section 3.2.1 gives a brief overview of the Linux internals (version 2.4) as far as they are needed to understand the following Section 3.2.2 describing the implementation and integration details.

3.2.1 Linux Internals

Page Cache

All pages that are backed by regular files, block devices or swap area are stored in a page cache [50]. Pages exist in this cache for two reasons. The first is to eliminate unnecessary disk reads. Pages read from disk are stored in a page hash table, and this table is always searched before the disk is accessed. The second reason is to help the page replacement algorithm to quickly select which pages to discard or page-out. The cache collectively consists of two lists defined in mm/page-alloc.c called active Jist and inactive list which broadly speaking store the "hot" and "cold" pages respectively.

Different page types are added to the page cache at different moments. Pages which are read from a file or block device are added to the page cache by calling __add-tojpage-cacheC) during generic.file-read(). All file-systems use the high level function generic file.readÇ) so that once a file is read, its pages are in the page cache. On the other hand, anonymous pages are added to the page cache the first time they are about to be swapped out. When first allocated, pages are placed in the inactive-list. Once accessed, these pages are marked as referenced using ma,rk-page-accessed() and will be eventually moved to the active Jist.

Page-out Daemon (kswapd)

At system start, a kernel thread called kswapd is started from kswapd-initÇ) which continu¬ ously executes kswapdÇ) function in mmjvmscan.c. This daemon is responsible for reclaim¬ ing pages when memory is running low. kswapd is woken by the physical page allocator when the number of free pages in a zone equals pages -low.

The call graph of the swap daemon is shown in Figure 3.7. When the swap daemon is woken up, it performs the following: 30 Chapter 3. Design and Implementation

Figure 3.7: Call Graph: kswapdQ

• Calls kswapd-cansleepQ which cycles through all zones checking the need-balance field in the struct zoneJ. If any of them are set, it can not sleep;

• If it can not sleep, it removes itself from the kswapdjwait wait queue;

• kswapdJbalancei) is called which cycles through all zones. It will free pages in a zone with try-to-free-pages-Zone() if needJ>alance is set and will keep freeing pages until pages-high watermark is reached;

• The task queue for tq-disk is run so that pages queued will be written out;

• The kswapd is added back to the kswapd-waii queue and it goes back to the first step.

It is this daemon that performs most of the tasks needed to maintain the page cache correctly, shrink slab caches and swap out processes if necessary, kswapd keeps free¬ ing pages until pages-high watermark is reached. Under extreme memory pressure, processes will do the work of kswapd synchronously by calling balance-classzone() which calls tryJo-free-pages-Zone(). The physical page allocator will also call try-to-freejpages-Zone{) when the zone it is allocating from is under heavy pressure.

Refilling inactiveJist

When caches are being shrunk, pages are moved from the activeJist to the inactiveJist by the function refilLinactiveQ. Linux resembles a Clock algorithm and takes pages from the 3.2. Implementation and Integration in Linux 31

referenced

lru_cache_add() tail active list head

refill_inactive() referenced / mark_page_accessed()

^ head inactive list tail -^

Figure 3.8: Page Cache LRU List

end of activeJist and analyzes them. If the PG-referenced flag is set, it is cleared and the page is put back at top of the activeJist as it has been recently used and is still "hot". The move-to-front heuristic means that the lists behave in an LRU-like manner. (However, the lists are not strictly maintained in LRU order.) If the flag is cleared, the page is moved to the inactiveJist and the PG-referenced flag set so that the page will be quickly promoted to the activeJist if necessary. The number of pages to move is calculated in shrink-cachesQ such as the activeJist is about two thirds the size of the total page cache. Figure 3.8 illustrates how

the two lists are structured.

The Move-To-Front heuristic means that the lists behave in an LRU-like manner but there

are too many differences between the Linux replacement policy and LRU to consider it a stack algorithm. Even if we ignore the problem of analyzing multi-programmed systems and the fact that the memory size for each process is not fixed, the policy does not satisfy the inclusion property as the location of pages in the lists depend heavily upon the size of the lists as opposed to the time of last reference. Moreover, the list is not priority ordered as that would require list updates with every reference. As a final nail in the stack algorithm coffin, the lists are almost ignored when paging out from processes as page-out decisions are related to their location in the virtual address space of the process rather than the location within the page lists. However, the algorithm does exhibit LRU-like behavior and it has been shown to perform well in practice.

Reclaiming Pages from the Page Cache

The function shrink-cacheQ is the part of the replacement algorithm which takes pages from the inactiveJist and decides how to swap them out. shrink-cacheÇ) is a very large for- loop which frees pages from the end of inactiveJist until enough pages are freed or the inactiveJist is empty. For each page in the inactiveJist, shrink-cacheQ makes different decisions on what to do, as shown in Figure 3.9.

If the page is mapped by a process (the page is in state SI), a max-mapped counter is decremented. This counter determines how many process pages are allowed to exist in the page cache until some pages will be swapped out. If the page is anonymous belonging to a process, it is first unlocked and then max-mapped is decremented. When max-mapped counter reaches 0, the swa,p-out() function is called to start swapping out process pages. swap-out() walks the process page tables until it finds enough pages to be freed. All process mapped pages are examined regardless of where they are in the lists or when they were last referenced 32 Chapter 3. Design and Implementation

SI: mapped or anonymous

swap_out()

+ unmapp + if PageDirty add to swap cache else (PageClear)

S2: unmapped & dirty

-i-lock

+ set PG_launder + writepageQ

S3: locked & PG_launder set

+ wait for 10 to complete

+ decrement reference count

S4: has no reference

+ if in swap cache

delete from swap cache

+ delete from page cache

Figure 3.9: Reclaiming pages from the page cache 3.2. Implementation and Integration in Linux 33

Figure 3.10: Call Graph: shrink-cache()

but pages which are part of the activeJist or have been recently referenced will be skipped over. swap-Out() walks the process page tables, unmaps pages, and calls try-to.swap.outQ on these pages and PTEs. Iry-toswap.outÇ) removes the page from the process PTE and checks whether the page is dirty. If the page is dirty, it is added to the swap cache, and the page is now in state S2. If the page is clean, it will be in state SA,

If the page is dirty and is unmapped by all processes (the page is in state S2), the page is locked, the PG-launder bit is set, and the writepageQ function is called to write the page to disk. The page is now in state S3.

If the page is locked and PG-launder bit is set (the page is in state S3), the system waits for the IO to complete and decrements the reference count. The page is now in state SA.

The last case handles pages that have no references to them and are either clean or have been saved in the swap area (are in state SA). If the page is in the swap cache, it is deleted from there (by this time the IO operation is complete). The page is then deleted from the page cache and freed.

Swap Management

Each active swap area, be it a file or partition, is described by a struct swap.infostruct, and all structures in a running system are stored in a statically declared array called swapÂnfo. When a page is swapped out, Linux uses the corresponding PTE to store enough information to locate the page on disk: a page's location on disk is given by an index into the swapJnfo array and an offset within the swap-map. 34 Chapter 3. Design and Implementation

Figure 3.11 : Call Graph: swap-writepageQ

The pages on the swap cache are those pages in the page cache that have a slot reserved in the swap area. The swap cache is purely conceptual as there is no simple way to quickly traverse all pages on it. Different page types are added to the swap cache at different moments. For instance, pages that belong to a region are added to the swap cache when they are first written to. On the other hand, anonymous pages are not part of the swap cache until an attempt is made to swap them out.

When a page is being added to the swap cache, a slot is allocated with getswap-page(), the page is added to the page cache with add-toswa,p-cache{) and is marked as dirty. A page is identified as being part of the swap cache once the page —> m,apping field has been set to swapperspace which is the addressspace struct managing the swap area. When the page is next laundered, it will actually be written to the swap area.

The top-level function for reading and writing to the swap area is rwswap-pageQ. This function ensures that all operations are performed through the swap cache to prevent lost up¬ dates. However, as shown in Figure 3.11, the core function which performs the real work is rwswap-page-base().

Pages are written out to disk when pages in the swap cache are laundered. To launder a page, the addressspace — a,-ops is consulted to find the appropriate write-out function. In the case of swap, the addressspace is swapperspace and the swap operations are con¬ tained in swap-aops. The registered write-out function is swap-writepageQ which will call rwswapjpageQ to write the contents of the page out to backing storage.

Page Faulting

Linux, like most operating systems, has a Demand Fetch policy for dealing with not resident pages [106]. In other words, a page is fetched from the backing storage only when the hard¬ ware raises a page fault exception which the operating system traps [107], Although a good page prefetching policy would result in less page faults, Linux is fairly primitive in this re¬ spect. When a page is paged in from swap space, a number of pages after it are read in by swapin-readahead() and placed in the swap cache.

Each architecture registers an architecture-specific function for the handling of page faults. 3.2. Implementation and Integration in Linux 35

Figure 3.12: Call Graph: do-page-faidtQ

While the name of this function is arbitrary, a common choice is dojpage.faultQ whose call graph is shown in Figure 3.12. This function is provided with a wealth of information such as the address of the fault, whether the page was simply not found or was a protection error, whether it was a read or write fault and whether it is a fault from user or kernel space. dojpage-faultQ is responsible for determining which type of fault has occurred and how it should be handled by the architecture-independent code.

Once the exception handler has decided the fault is a valid page fault in a valid memory region, the architecture-independent function handlejmm.faultQ, whose call graph is shown in Figure 3.12, takes over. handle-mm-fault() allocates the required page table entries if they do not exist and calls handle-pte-fault() which takes different actions. If no PTE has been allocated, do-no-page() is called which handles Demand Allocation, otherwise it is a page that has been swapped out to disk and doswapjpageQ is called to perform Demand Paging. Furthermore, if the page is being written to and if the PTE is write protected, do-wp-pageQ is called as the page is a Copy-On-Write (COW) page.

Demand allocation. When a process accesses a page for the very first time, the page has to be allocated and possibly filled with data by the do-nojpageQ function. If the parent VMA pro¬ vided a vm-ops struct with a nopageQ function, it is called. A nopageQ function is provided for instance if the page is backed by a file or device. If vm.areastruct —> vm-ops field is not filled or a nopageQ function is not supplied, the function do-anonym,ous-page.Q is called to handle an anonymous access. There are only two cases to handle, first time read and first time write. The first read maps an empty.zero-page which is just a page of zeros for the PTE and the PTE is marked write protected so that another page fault will occur if the process writes to the page.

Demand paging. When a process accesses a page that is swapped out to the back¬ ing storage, the page will be read in by the function doswa,pjpageQ, whose call graph is shown in Figure 3.13. By this time, the information needed to find the page in the swap area is stored within the PTE itself. The core function used when reading in pages is readswap-cache-asyncQ which first searches the swap cache with find-get-pageQ and re- 36 Chapter 3. Design and Implementation

Figure 3.13: Call Graph: doswap-pageQ

turns the page if the page is still in the swap cache. If the page is not in the swap cache, a new page is allocated with allocpageQ, it is added to the swap cache with add-toswapsacheQ, and swapin-readaheadQ is called to read in the requested page and a number of pages after it. swapinjreadaheadQ calls rwswap-pageQ which is the top-level function for writing and reading from the swap area and was already described in Figure 3.11. Finally, the IO is started with rwswap.pageQ with flags to start the read operation. If the swap area is a file, bm,apQ is used to fill a local array with the list of all blocks in the file-system which contain the page being operated on. Once that is complete, a normal block IO operation takes place with brw-pageQ.

Linear Address Space

A kernel module can allocate only kernel memory and is not involved in handling segmenta¬ tion and paging (since the kernel offers a unified memory management interface to drivers). In Linux, the kmalloc ( ) function allocates a memory region that is contiguous in physi¬ cal memory. Nevertheless, the maximum memory size that can be allocated by kmalloc ( ) is 128 KB [91]. Therefore, when dealing with large amounts of memory a module uses the vmalloc ( ) function to allocate non-contiguous physical memory in a contiguous virtual ad¬ dress space. Unfortunately, also the memory size that can be allocated by vmalloc ( ) is limited, as discussed in the next paragraph.

The Linux kernel splits its address space in two parts: user space and kernel space [50]. On x86 and SPARC architectures, 3 GB are available for processes and the remaining of 1 GB is always mapped by the kernel. (The kernel space limit is 1 GB because the kernel may directly 3.2. Implementation and Integration in Linux 37

User Kernel Address Space

Process Kernel Img mem_map Gap vmalloc Address Space A A

0x00000000 OxCOOOOOOO OxFFFFFFFF a. Low-Memory Systems

Process fixed virtual Kernel Img mcm_map Gap vmalloc Gap kmap Gap Address Space address mapping

VMALLOC_RESERVE at minimum

b. High-Memory Systems

Figure 3.14: Linux kernel space.

address only memory for which it has set up a page table entry.) From this 1 GB, the first 8 MB are reserved for loading the kernel image to run, as shown in Figure 3.14. After the kernel image, the mem_map array is stored and its size depends on the amount of physical memory. On low-memory systems (systems with less than 896 MB), the remaining amount of virtual address space (minus a 2 page gap) is used by the vmalloc ( ) function, as shown in Figure 3.14.a. For illustration, on a Pentium 4 PC with 512 MB DRAM, a module can allocate about 400 MB. On high-memory systems, which are systems with more than 896 MB, the vmalloc region is followed by the kmap region (an area reserved for the mapping of high-memory pages into low memory) and the area for fixed virtual address mappings, as shown in Figure 3.14.b. On systems with large physical memories, the size of the mem_map array can be significant, and therefore not enough memory is left for the other regions. Because the kernel needs these regions, on x86 the vmalloc area, the kmap area, and the area for fixed virtual address mapping is defined to be at least 128 MB; this area is denoted by VMALLOC-RESERVE at minimum. Hence, the amount of memory that can be allocated in kernel mode on high-memory systems is smaller than on low-memory systems (because the area used by the vmalloc ( ) function is smaller). For illustration, on a Pentium 4 PC with 1 GB DRAM, a module can allocate at most 100 MB (compared with the 400 MB in the previous case). On the other hand, 64-bit architectures aren't as limited as 32-bit architectures, and a module can allocate up to 2 TB on a 64-bit PowerPC that runs Linux in 64-bit mode.

3.2.2 Implementation and Integration Details

The functionality of the compressed-memory system described in Section 3.1 is divided in two parts, namely i) a part that defines and manages the compressed region and ii) a part that adds the compression functionality to the kernel and defines its interaction with the kernel. Part i) is implemented as a loadable module [91] while part ii) consists of hooks in the operating system to call module functions at specific points. While the first part can be ported to new versions of the Linux kernel easily, the second part may require kernel modifications as the Linux implementation changes. 38 Chapter 3. Design and Implementation

Figure 3.15: Call Graph: zoncaddQ

The Compressed Region Management

The compressed region is organized in memory zones, as described in Section 3.2.1. When the compression support is turned on, the system allocates the global data structures to manage the compressed region, which are the zone chain, hash table, and LRU first and LRU last pointers. When the compressed region is grown and shrunk, zones are added and removed with zone-addQ and zone-remQ.

Zone add. When a zone is added to the compressed memory, the memory for the new zone is allocated by calling zone.new Q and the chain of zones is updated with ins-zone-chain, as shown in Figure 3.15. zonejnewQ is actually the core function which allocates memory to store a zone's data structures with allocJocalstruct and memory for the compressed data with alloc-blocks. A zone's data structures are initialized with init.block.table, initspage-table and init-zonestruct.

Because vmalloc ( ) is a flexible mechanism to allocate large amounts of data in kernel space, we use vmalloc ( ) to allocate memory for both compressed data and metadata. As described earlier, on IA32 architectures the memory size that can be allocated by vmalloc ( ) is at most 100 MB. However, as we will see latter in this thesis, for applications with large memory footprints, a compressed region of 100 MB is insufficient.

Zone delete. When the compressed region is shrunk, the system selects the zone with the minimum number of blocks used and returns it to the uncompressed region as follows. If the selected zone stores no compressed data, its memory is deallocated by calling zone-delQ, as shown in Figure 3.16. For the selected zone, zone-delQ calls free-local structQ and free-blocksQ to deallocate both the local metadata and the memory to store compressed data. Finally, the zone is deleted from the zone chain. If the selected zone stores some compressed pages, these pages are moved to other zones. For each page to be moved, gel-ZoneQ selects a zone that has enough free blocks to store the page and move.cpageQ moves the page to the selected zone. After all pages have been moved, the zone is deleted and its memory is returned to the uncompressed region.

Page insert. Pages are inserted in the compressed region by calling cpbufjputQ. As shown 3.2. Implementation and Intégration in Linux 39

Figure 3.16: Call Graph: zone-remQ

in Figure 3.17, cpbuf-putQ computes the page entry in the hash table, compresses the page by calling compress-pageQ, and selects a zone with enough free blocks to store the com¬ pressed page. Once a zone is found with zone.findQ, the necessary number of free blocks is selected by calling get.block.entriesQ which traverses the list of free blocks and selects the desired number of blocks. Furthermore, the system calls get-cpagesntryQ to select an entry in comp page table and adds, with handle-collisions, the new entry to the list of all pages that map to the same hash value. The first field of the selected entry is set to point to the first block that will store the compressed data. Finally, the compressed data is copied into the selected blocks and the remaining fields of the selected entry are updated.

Page Delete. A compressed page is retrieved by calling cpbuf-getQ and its entry in the compressed region is removed by calling cpbufjremQ. As shown in Figure 3.18, the first step of the cpbuf-getQ function is to search for the swap handle in the compressed region. For the given handle, cpbuf-getQ computes its index in the hash table. If the handle is in the chain of handles that map to the computed index, get.cpage-blocks selects the blocks that store the compressed data and copies them in the decompression buffer. The page is then decompressed and returned to the system. The cpbuf'jremQ function is called after the cpbuf -get function has returned the page to the system. As the name indicates, cpbufjremQ frees the entries in a zone's local structures that stored the compressed page and also updates the hash table to indicate that the page is no longer in the compressed region.

Global functions. For the handling of global data structures, the system registers a couple of functions, namely hashJable-initQ, hash-funcQ, and hashJookupQ functions for handling hash table operations, and IruJnsbackQ, Iru-removeQ, and IruJiilQ for handling LRU- related operations.

Page-out daemon (kcmswapd). When the compression support is turned on, a ker¬ nel thread called kcmswapd is started from kcrnswapdJnitQ which continuously executes kcmswapdQ function. This daemon is responsible for reclaiming compressed pages when the compressed region is heavily used, kcmswapd is woken by cpbufjputQ when the num¬ ber of free pages in the compressed region equals kcmswapdJaw. The implementation of the kcmswapd, shown in Figure 3.19, follows pretty much the same steps as the kernel swap daemon kswapd and performs most of the tasks needed to maintain some free memory in the compressed region. It is the mptd-trymoveQ function that retrieves the LRU pages from the 40 Chapter 3. Design and Implementation

Figure 3.17: Call Graph: cpbufjputQ

Figure 3.18: Call Graph: cpbuf-getQ 3.2. Implementation and Integration in Linux 41

Figure 3.19: Call Graph: kcmswapdQ

compressed memory, decompresses and saves them on disk.

Kernel Modifications

To add the compression functionality to the kernel, several kernel functions must be changed to call de/compression functions at specific points. The first set of functions that must be changed are those for writing and reading pages from the swap area. As described in Section 3.2.1, in Linux 2.4 version, the top-level function for writing and reading from the swap area is rwswap-pageQ which starts the IO by calling rwswapjpage-baseQ. Whether the opera¬ tion is a disk read or write depends on the arguments with which this function is called. On the other hand, in Linux 2.6 version, the system uses two different functions for writing and reading data from disk, namely swap.writepageQ and swa,pjreadpageQ. Therefore, the functions that have been changed are rwswap_pageJiaseQ in Linux 2.4 version, and swap.writepageQ and swapjreadpageQ in Linux 2.6 version.

Swap out. When a page is to be swapped out and compression is turned on, the new swap.writepageQ function checks whether the free space in the compressed region is enough to store a new page. If the free space is not enough, swap.writepageQ waits for the kcmswapd to free some space and tries again latter. If the free space suffices, the page is inserted in the compressed region by calling cpbuf jputQ. If compression is turned off, the original swap.writepageQ function is called which writes the page on disk.

Swap in. As described in Section 3.2.1, when a process accesses a page that is swapped out, the system calls the swapinjreadaheadQ function to read in the requested page and a number of pages after it. swapinjreadaheadQ calls validswap-handlersQ to get the number of pages/handles to readahead. For each of the returned handles, sumpin-readaheadQ calls readswap-cachcasyncQ function which calls swap.readpageQ. Because readahead does not make sense for a page in the compressed region, the validswap-handlersQ function has been changed to turn off readahead for compressed pages. Furthermore, if compression is turned on, the swapjreadpageQ function checks first whether the faulted page is in the compressed region. If the page is compressed, cpbuf-getQ is called to return the page to the faulting process 42 Chapter 3. Design and Implementation

and the page is removed from the compressed region with cpbuf.remQ. If the page is not in the compressed region, the original swapjreadpageQ function is called and the page is read from the disk in the (uncompressed) memory.

Terminate process. When a process terminates, it frees all its pages (including those that have been swapped out to disk) and returns them to the operating system for reuse. For those pages that are on disk, the system calls swapsntry-freeQ to delete the corresponding entries in the swap cache and page cache. The swapsntry-freeQ function has been modified to check whether the page to be freed is in the compressed region or on disk. If the page is compressed, it is removed from the compressed region with cpbuf.remQ, and if the page is on disk the original swa,p.entry-freeQ function is called to remove the page from disk.

Module Flags

The actual implementation of the compressed-memory system uses several flags to indicate the state of the module:

• COMPRMOD-FL-ENABLED. This flag indicates whether compression is turned on or turned off. If the flag is not set (compression is turned off) the compressed region is empty and the module can be unloaded. Moreover, if the flag is not set, a user can select a new compression algorithm or can re-enable compression.

• COMPRMOD-FL-SHUTTTNGDOWN. This flag indicates that the module shut¬ down operation is in progress. If this flag is set, pages from the uncompressed region are sent directly to disk, and all pages in compressed region are moved to disk. Upon successful completion of the shutdown operation, compression is disabled.

• COMPRMOD-FL.EXIT. This flag is used when the module is unloaded. If this flag is set, kcmswapd stops.

• COMPRMOD-FL-REMONREAD. This flag enables the remove-on-read fea¬ ture. On a page fault, before going to disk, the system searches for the faulted page in compressed region. If the page is in compressed form, it is de¬ compressed and brought in uncompressed region. Then, the system checks the COMPRMOD-FL-REMONREAD flag, and deletes the page from the compressed region if and only if this flag is set.

Interface between Compressed-Memory System and User

The interface between the module and user consists of a set of functions. We create a device driver, called /dev / com,prmod. The set of functions the module exports are called by using the ioctlQ system call on the compressed memory device file as follows int ioctl(int fd, int cmd, int p); where fd is the file descriptor returned by a previous open{/dev/comprmod) function call, cmd selects one of the functions, and the value of p is the parameter of the selected function (e.g., a user-space address of a buffer). 3.2. Implementation and Integration in Linux 43

A small user-space library, called comprlib, provides easy access to the jdevjcomprmod device. The library provides wrapper functions for each function exported by the kernel module:

• int comprlib.printk.cfgQ prints module information to the kernel log file, usually /var/log/messages (e.g., compression algorithm used).

• int comprlibJprintu-Cfg{char * text) prints module information to a user-space buffer provided by the calling process.

• int comprlibjprintujpid(cha,r * pid) prints kcmswapd pid.

• int comprlib-compr jpidiint pid) enables compression for pid process only.

• int comprlib-printkstatQ prints the current values of various statistics counters to the kernel log file (e.g., number of pages stored in compressed region).

• int comprlibjprintusta,t(char * text) prints various statistics counters to a user-space buffer provided by the calling process.

• int comprlib-reselstatQ sets all statistics counters to zero.

• int comprlibjprintk-ppstat(int pid) prints per process statistics to the kernel log file.

• int comprlib-printu,-ppstat(char * text, int pid) prints per process statistics to a user- space buffer.

• int comprlib-reset-ppstat(int pid) sets per process statistics counters to zero.

• int comprlibjprintk.pcstat{int cpu) prints per cpu statistics to the kernel log file.

• int comprlib-printU-pcslal(char* text, int cpu) prints per cpu statistics to a user-space buffer.

• int compr'lib.resetjpcstat{int cpu) sets per cpu statistics counters to zero.

• int compr lib.printu-calg (char * text) prints information about all compression algo¬ rithms the module supports to a user-space buffer provided by the calling process.

• int comprlibselectcalg{int calg) selects a compression algorithm. This function can be called only after the module has been shut down, calg is the ID of the algorithm.

• int comprlib-add-ZoneQ adds a zone to the compressed region.

• int comprlibjrem.zoneQ removes a zone from the compressed region.

• int compr lib.get zone s Q returns the number of zones in the compressed region.

• int comprlib-geipagesQ returns the number of pages in the compressed region.

• int compr libjprintk.user (char * text) prints a zero-terminated string text to the kernel log file.

• int comprlibshutdownQ shuts down the module. 44 Chapter 3. Design and Implementation

• int comprlib-ContinueQ re-enables compression. This function can be called after the module has been shut down.

comprapp is a small command-line application that provides command-line arguments to call the functions provided by comprlib. In addition, comprapp provides a couple of synthetic benchmarks. Table 3.1 provides a short description of comprapp.

3.3 Discussion

This section briefly discusses the advantages of our design over previous compressed-memory systems.

Metadata. All approaches to implementing main memory compression, reserve at boot time memory for metadata needed to manage the compressed region. The memory reserved for the metadata is based on an estimation of the maximum size of the compressed region. Therefore, latter on, the compressed region size cannot exceed this maximum size. Moreover, if the compressed region is smaller than the chosen maximum, some of the memory reserved for metadata will be wasted, as it cannot be used by applications. By organizing the compressed region in self-contained zones, our approach is, at any moment in time, able to allocate the right amount of metadata that is needed to manage the compressed region. A zone is self-contained in that it comprises memory to store compressed data and structures for managing this memory. Because the compressed region is resized by adding and deleting zones, the amount of metadata increases and decreases while the compressed region size is increased and decreased.

Fragmentation. To keep fragmentation to a minimum, it is desirable to organize the com¬ pressed region in blocks of the same size. However, because the compressed region size varies as applications execute, most of the related approaches to compressing main memory ended up dividing the compressed region in variable-size blocks, hence increasing the fragmentation. By dividing the compressed region in zones of the same size and by further organizing a zone in fixed-size blocks, our design succeeds to keep fragmentation to a minimum.

To sum up, we provide a method for organizing the compressed region in a way that keeps fragmentation to a minimum and allocates the right amount of memory for the metadata needed to manage the compressed region.

3.4 Summary

This chapter describes our approach to a flexible compressed-memory system. The key idea of our design is to organize the compressed region in zones of the same size. A zone is self- contained in that it consists of memory to store compressed data and structures to manage the compressed data. When a zone is allocated or deleted, both the memory to store compressed data and the local structures are allocated or deleted. Because a zone is self-contained, zones are added and removed easily. The proposed design imposes some locality on the blocks of a compressed page by storing a compressed page within a single zone and not scattering it over multiple zones. This design decision further eases the zone delete operation, as the system must not deal with pages that are partially stored in other zones. The proposed system grows and shrinks the size of the compressed region by adding and removing zones. Because adding and Summary

Argument(s) Description

—printkjcfg prints module configuration information to the ker¬ nel log file, —printu-cfg prints module configuration information to stdout. —printujpid prints k,cm,swa,pd, pid to stdout. —setcomprpid pid enables compression for pid process only. —printkstai prints various statistics to the kernel log file, —printustat prints various statistics to stdout.

—resetstat sets all statistics counters to zero.

—printk-ppstat pid prints per process statistics to the kernel log file, —printujppstat pid prints per process statistics to stdout. —reset jppslat pid sets per process statistics counters to zero.

—printk.pcstat cpu prints per cpu statistics to the kernel log file, —printu.pcstat cpu prints per cpu statistics to stdout. —reset.pcstat cpu sets per cpu statistics counters to zero. —printu-calg prints the compression algorithm used to stdout. —selectcalg id selects the id compression algorithm.

—addzone adds a zone,

—rem,zone removes a zone.

—getzones returns the current number of zones, —getpages returns the number of compressed pages. —shutdown shuts down the module, —continue re-enables compression.

—writerand x y allocates x MB of memory and writes randomly y * 1000 times.

—writerand-t3 x y is a multi-threaded version of —writerand: three threads write to a shared memory region.

—writeseq x y allocates x MB of memory and writes sequentially y times.

—writeseqJ'i x y is a multi-threaded version of —writeseq: three

threads write to a shared memory region.

Table 3.1 : Command-line arguments of the comprapp sample application 46 Chapter 3. Design and Implementation

removing zones is done easily (a zone is self-contained), the design presented in this chapter allows for easy resizing of the compressed region.

This chapter also describes a possible implementation of the proposed compressed-memory system. The core implementation ideas are independent of a particular operating system. We organized the implementation into an operating system independent and an operating system dependent part. The presented implementation has successfully been integrated into the Linux operating system with modest effort in terms of lines of code. 4

Performance Modeling

The goal of a compressed-memory system is to improve the performance of large applications whose memory requirements exceed what is available in a system. However, as described in Chapter 2, whether compression improves an application's performance depends on a several factors. To understand the factors that influence an application's performance, this chapter fo¬ cuses on modeling the performance of large applications whose execution times depend heavily on the memory system performance.

Section 4.1 introduces the existent approaches to predicting an application's performance. Section 4.2 discusses related work in the area of performance prediction. Two new performance models are described in Section 4.3. Next, Sections 4.4 and 4.5 present the target system and the applications used to validate the proposed models. The accuracy of the performance estimates is presented in Section 4.6, and Section 4.7 summarizes the chapter.

4.1 Introduction

Research and engineering of computer systems often requires to estimate the execution time of programs. For instance, research into paging techniques for large applications [64, 98] requires investigation of applications that have large data spaces and long execution times. However, a detailed simulator that models all aspects of modern processors requires several hours to simulate a few seconds of real execution time with reasonable accuracy rill]. Straightforward extensions of existing simulators incur a high computational costs and may not even be practical if the data space is large. As researchers use larger and larger applications (and move away from application kernels) in their simulations, the situation will get worse, despite improvements in the cycle time of the platform that is used for simulations.

An attractive approach to predicting an application's performance on a (real or hypothet¬ ical) system is analytical modeling. Although analytical models can provide results in short time, most researchers have had limited success in validating their performance predictions on real machines. The difficulty of modeling current systems comes from the fact that modern microprocessors use a number of techniques to overlap cache misses with computation and subsequent memory operations. Some of these techniques may speed up some applications, but may have a negligible impact on the performance of some other applications. The complexity of modem systems poses a significant challenge in developing an accurate performance model with a small set of input parameters that are easy to measure or estimate. Given a class of appli¬ cations, an application analyst could come up with a relatively simple performance model for predicting the performance of the selected applications. However, as the class of applications

47 48 Chapter 4. Performance Modeling

whose performance is to be predicted is enlarged, the model becomes more and more complex. Validation with a real system, however, is important to increase the confidence that estimations for hypothetical systems will be meaningful.

4.2 Background

In this thesis we use the term target to refer to the system that is to be simulated. At the least we want to estimate the execution time of an application on this target system. The platform is the system that hosts our simulator, e.g., the system that executes the application and produces inputs for a simulator. We group related research into three categories: (1) approaches to model the execution of the simulated program, (2) approaches to characterize the target machine, and (3) approaches to characterize the simulation on the platform (i.e., how is the platform coupled to the simulator). These categories are described in more detail in the following subsections.

4.2.1 Performance Prediction

Given the importance of the performance prediction in both system and application design, a large number of performance prediction techniques have been proposed in the last decades, ranging from pure mathematical models to full system simulators. As the goal of this thesis is to predict performance for large applications, this section reviews only the prediction techniques that use data from executions of real applications and skips those focused solely on synthetic benchmarks.

Gallivan et al. [45] and Saavedra and Smith [93, 94] show that the execution time of a program on a specific machine can be computed by summing up the timings for elementary instructions scaled by their frequency during a specific execution. The main disadvantage of this technique is that it was applied only to unoptimized FORTRAN codes. Moreover, the approach requires a high amount of measurements. For instance, Saavedra and Smith [94] consider 109 abstract FORTRAN operations and measure their execution time on each machine they studied.

The profile-based evaluation technique proposed by Ofelt and Hennessy [86] computes the execution time of a program by summing up the execution times of the basic blocks (and paths) of a program on a given machine times their frequency during a specific run. The technique requires detailed object code analysis and/or assistance to identify the basic blocks of a program.

Hennessy and Patterson [53] state that a good measure of the memory hierarchy perfor¬ mance (although still an indirect measure of performance) is the average time to access mem¬ ory. The Average Access Time (AAT) is defined in terms of time to hit in the cache/memory, miss penalty and miss rate for reads and writes. However, the authors note that as processors get more and more sophisticated, characterizing memory system with a single parameter becomes insufficient and another measure is called for.

Ailamaki et al. [13] compute the execution time of a data-base query by summing up its computation time, memory stall time, branch mis-prediction overhead, and resource-related stalls. The study shows that for this class of applications, the miss penalty of a program is one of the most important performance factors. The miss penalty of a program is given by the penalty of a miss scaled by the number of misses during the program execution. The number 4.2. Background 49

of cache misses can be either measured [93] or estimated using compile-time prediction [24] or using a mathematical framework [47].

Recently, researchers have focused on analytic evaluation of shared-memory systems with ILP processors [100, 59]. Sorin et al. [100] show that although the ILP processors are difficult to model, it is possible to predict performance for some applications running on these processors. However, due to the high complexity of these processors, the analytical models are complex formulas with a large number of parameters (10-20 parameters).

4.2.2 Machine Characterization

Micro-benchmarking is a popular technique used as a basis for performance prediction. Initial work in this area focused on sequential programs and did not consider the memory hierar¬ chy [94]. Since then, the benchmarking technique has been extended and nowadays can also capture the performance of parallel machines [23].

Measuring the memory performance of modem systems is difficult, given that not all misses incur the same timing penalties, and misses interact with one another. Hristea et al. [55] classify misses into: (1) in-isolation misses, (2) back-to-back misses, and (3) pipelined misses. In- isolation misses are isolated in time from one-another so that the time to fill the LI and L2 caches with the missing block is less than the time to the next miss. Back-to-back misses are dependent and have minimal separation. Pipelined misses are independent and have minimal separation so that performance is limited by whichever resource is the bottleneck during the cache fills. In-isolation and back-to-back misses represent the best and worst-case performance of the memory systems, and they dictate the performance of irregular applications. For regular codes, when pipelined transfers are possible, bandwidth is more important than latency [55,103, 34, 81]. Hristea et al. describe three benchmarks to measure the performance of these three categories of misses (for reads only). Similar micro-benchmarks are McVoy's Imbench [81] that measures pipelined and back-to-back memory latency and McCalpin's STREAM [80] that gathers memory pipeline performance.

Strieker and Gross [104] describe memperf, a bandwidth-oriented characterization of a memory system that pays attention to memory access patterns. Memory performance is mea¬ sured in access bandwidth for different strides and different working sets. The stride parameter (with values between 1 and 192) shows how well cache and external stream logic help with read ahead. The working set parameter captures the effect of cache hits through reuse of re¬ cently accessed data. The write tests capture the performance of well pipelined writes through a write-back queue.

4.2.3 Instrumentation and Simulation Tools

There has been a long history of work on instrumentation and simulation tools for evaluating memory performance [111]. Given the hardware diversity and different approaches to charac¬ terize a machine, there is a large number of tools and a huge body of literature that describes this research area. The rest of this subsection gives a brief overview of the instrumentation and simulation tools for evaluating memory performance, and provides a classification of these tools according to the processor code they instrument or simulate.

Instrumentation and simulation tools are particularly useful for modeling memory system 50 Chapter 4. Performance Modeling

performance. Instrumentation tools provide detailed information about the dynamic charac¬ teristics of a program, namely the sequence of instructions the program executes and the data addresses the program accesses. This information, called instruction and address trace, is then given to a trace-driven simulation program that simulates a memory system in detail, including caches and any memory management hardware.

Interface. Because the memory addresses a program accesses are produced at high rates, the way data is communicated to the simulator is of great importance. The interface between the instrumentation and simulation tools can be (1) a file, (2) pipe, (3) procedure call, or (4) region of memory. Each type of interface has its advantages and disadvantages. Given the huge amount of data, using (1) a file for data transmission is nowadays not an option, as the trace size largely exceeds the maximum size of a file. An alternative approach is to compress traces before storing them in files. Another approach is to record only the significant dynamic events, store them in a file, and reconstruct the complete trace before giving it to the simulator. When using (2) pipes, data is processed while it is collected, avoiding the cost of storing it. However, not saving the traces has its disadvantages, as traces must be re-collected for each simulation run. Moreover, there is the overhead of switching processes, as one process collects data and another one processes it. When using (3) procedure calls or (4) regions of memory as interfaces, both instrumentation and simulation are done by the same process. When using (4) regions of memory as interfaces, execution begins in a trace-collecting mode and data is collected and stored in a reserved region of memory until that region is filled. The process then switches to a trace-processing mode, processes the collected data and runs the simulation until the memory region is again empty. However, the main disadvantage of placing the simulator in the same process as the monitored program is that it decreases the trace accuracy.

Instrumentation Tools

Instrumentation tools provide detailed information about a program's execution, namely the se¬ quence of instructions the program executes or the data addresses the program accesses. An instrumentation tool modifies the program under study so that its dynamic characteristics, e.g. instruction and data traces, are recorded while the program executes. Most tools focus on ad¬ dress traces, op-code and operand usage, instruction counts, operand values, or branch behavior. An accurate instrumentation tool does not affect the original functionality of the test program, although it slows down the program significantly.

Conte and Gimarc [31] classify the run-time data collection methods as either (i) hardware- assisted or (ii) software-only schemes. The hardware schemes (i) involve the use of hardware devices that are added to a system solely for the purpose of data collection. Examples of such devices are the hardware boards and off-computer logic analyzers that monitor the activity of the system bus. The most frequently used hardware method is a hardware-based performance counter. A performance counter is a special on-chip logic that summarizes specific run-time events, like cache misses. Examples of interfaces for accessing hardware performance counters are VTune [3], PAPI, Brink and Abyss. The software-only collection schemes (ii) use existing hardware to gather the desired run-time information. Conte and Gimarc divide the software schemes in two approaches: those which simulate, emulate, or translate the application code, and those which instrument the code. In addition to these approaches, some run-time informa¬ tion can be collected by using the operating system traps.

Uhlig and Mudge [111] classify the trace collection methods according to the level of sys- 4.2. Background 51

tern abstraction where they are extracted. The authors define eight levels of system abstraction: the circuit, microcode, emulation, loader, linker, assembler, compiler and operating system level. The authors note that hardware-assisted methods, which are the external hardware probes and microcode modifications, are either expensive, hard to port, hard to use, or outdated. The software-only methods such as instruction-set emulation or static code annotation are less ex¬ pensive, are easier to use and port, but they have to fight inherent architecture limitations. The next paragraphs describe each of these methods in detail.

External hardware probes collect program traces at lowest level of system abstraction, the circuit and gates level. At this level, the signals are recorded by electrical probes physically con¬ nected to the address bus of a host computer. Examples of external probe-based trace collectors are Monster, MTM, BACH, and DASH.

A trace collection method that uses microcode modification collects program traces at the borderline between the hardware and software levels of a system. The hardware-software bor¬ derline is defined by the collection of all instructions, called instruction-set architecture (ISA). A microcode engine is an ISA interpreter that is implemented in hardware.

An instruction-set emulator interprets an ISA in software and can be modified to generate address traces as a side-effect of its execution. Because emulators perform instruction transla¬ tion at run-time, they have the ability to trace dynamically linked or dynamically compiled code. Examples of instruction-set emulators are Spa(Spy), Mable, SPIM, gsim, Talisman, MINT and Shade.

When the target and host ISAs are the same and dynamically changing code is not of in¬ terest, a data collection method called static code annotation can be used to collect program information. The method adds code segments to guard each basic block or memory operation and creates a new executable that produces a stream of basic block executions or a stream of memory references. Although annotating code at the assembly level is easy to implement, its main disadvantage is that it requires the source code of the program to be analyzed. Example of such tools are TRAPEDS, MPtrace, AE and TangoLite. Annotating code at the object level does no longer require the source code of the application, but is more difficult to implement (e.g. Epoxie, ATOM). Furthermore, code annotation at the binary level is the most convenient for the user but is the most difficult to implement (e.g. Pixie, Goblin, IDtrace, Qpt, EEL).

Another method called single-step execution collects traces at the highest level of system abstraction, namely the operating system level. At this level, the address traces are recorded using the OS debugging utility, which enables the user to step through a program one instruction

at a time.

In general, sophisticated methods yield accurate traces, but the accuracy comes at the cost of huge slowdowns. However, high-quality traces are quite difficult to obtain. Although the hardware-assisted collection methods produce undistorted traces, they are either expensive, hard to port, hard to use, or outdated. On the other hand, the software-only collection schemes are less expensive and easier to port than the hardware schemes, but the traces they produce are incomplete. Hence no single method for trace collection is a clear winner.

Discussion. The static code annotation method is probably the most popular form of trace collection because of its low cost and because it is relatively easy to implement. However, the traces collected using code annotation are incomplete. Furthermore, because code annotation at the assembly level requires instrumentation of the assembly code, the instrumentation tools 52 Chapter 4. Performance Modeling

that fall in this category are tied to a particular architecture.

Because the ISA of a specific processor dictates the number of instructions to be instru¬ mented, the implementation complexity of such a tool is given by the machine characteristics. For instance, an instruction set architecture that includes memory-to-memory operations (e.g. x86) has much more instmction types to instrument than a load-store architecture (e.g. SPARC) which usually retrieves operands from the register file. (Memory-to-memory architectures have smaller register sets which forces local variables to be stored in memory locations.)

Furthermore, on memory-to-memory architectures memory operands are often used as source and destination in the same instruction thereby generating two trace entries from one instruction. According to Conte and Gimarc, the x86 architecture has approximately 180 in¬ structions that may address memory. In addition, many of these instmctions may perform both a load and a store operation and some non-string instmctions may reference two different ad¬ dresses. In contrast, MIPS R3000 has only 14 instmctions that may reference memory. Each of these instructions may perform only a single read or write operation and none of these instruc¬ tions may access more than one .

The effort to instrument x86 assembly code is further increased by the presence of multi- reference instmctions. The x86 processor instruction set include string operations that may perform an indeterminate number of memory references per instmction. Such an instmction is the rep instmction prefix which may cause one string instmction to repeatedly access sequential memory addresses until a condition is satisfied. It is impossible to ascertain the number of iterations at instrumentation time. To record an accurate reference trace, the single instmction must be replaced by a sequence of instructions which output the reference, perform the string operation, check the condition, and loop back if the condition is not satisfied.

To sum up, the size and consequently the execution time of a program instrumented to trace memory references is mainly influenced by the number of instmctions requiring tracing. As mentioned above, the number of instmctions to instmment is bigger on a CISC processor than on a RISC processor. Another performance factor is the register set size of the machine under consideration. On processors with large register sets, an instmmentation tool may use some registers in the trace code segments, which results in a speed up of the instmmented code. Hence on load-store architectures the instrumented code is shorter and faster than the instmmented code for memory-to-memory architectures. To conclude, in general RISC processor code can be instmmented easier than CISC processor code and the resulting code is shorter and mns faster.

Simulation Tools

One of the most popular techniques to model the performance of a memory system is trace- driven simulation. Such a tool simulates a memory system in detail, including caches and any memory management hardware. However, simulating a memory system in detail is a time consuming process. An approach to speeding up cache simulations, called single-pass cache simulation, is often used to simulate multiple cache designs in a single pass through the bench¬ mark traces. The single-pass simulation can be either a stack-based simulation, which uses stack processing algorithms, or a non-stack simulation, which can be a forest-simulation or an all-associativity simulation.

Because trace-driven simulation employs a stream (trace) of prerecorded instmctions or 4.2. Background 53

addresses to drive the simulation, the processing of the trace data depends on the methods used to generate and transfer the trace. Therefore, most of the instmmentation tools provide memory simulators and analysis tools to process the trace. For example, Dinero IV, Cache2000 and Tycho are cache simulators built on top of the Pixie trace collector.

Tools

Instmmentation and simulation tools are available for most of the current microprocessors. This subsection classifies the instmmentation and simulation tools according to the processor code they instmment or simulate. The tools are presented starting with the lower hardware levels. Although our focus is mainly on uniprocessor memory simulations, we also pay attention to tools capable of tracing multiprocess workloads.

x86

VTune. One of the most accurate tools (maintained by Intel) is VTune performance an¬ alyzer that collects, analyzes, and provides software performance data specific to Intel archi¬ tectures [3]. The tool runs on Windows, Windows NT, and Linux. The major features of the VTune analyzer are time-based and event-based sampling, performance counter event sampling, call graph profiling, static code analysis, static and dynamic assembly analysis, and code coach optimizations.

PAPI. The Performance Application Programming Interface (PAPI) specifies a standard API for accessing hardware performance counters available on most modem microprocessors [82]. The tool mns on most of the modem machines [35], including Intel Pentium III and Pentium 4, Itanium I and II, UltraSparc I, II and III, Cray, MIPS, Alpha, and Power PC. PAPI is organized in two software layers [78]. The upper layer consists of an API and machine independent support functions. The lower layer maps the API to machine dependent functions and data structures.

Brink and Abyss. A high-level interface to Pentium 4 performance counters is provided by Brink and Abyss (for Linux) [101]. Brink takes a description of programs to mn and events to monitor and creates input files for the Abyss program that configures the performance counters and collects data.

MTM and Bach. Magellan Trace Machine (MTM) and Bach are both probe-based trace collectors that use special-purpose hardware with very large, high speed memories, which ob¬ serve and record bus activity. MTM records bus transactions and Bach records all memory references. MTM collects data for programs executing only on i486 microprocessors. In addi¬ tion to i486, Bach also supports 68030 and SPARC architectures. IDtrace. Another instmmentation tool for Intel architectures running Unix is IDtrace. The tool can produce a variety of trace types including profile, memory reference, and full exe¬ cution traces. Primitive post-processing tools for reading output files, visualizing traces, and computing basic profile data are included in the IDtrace package. However, the executable to be instrumented must be statically linked and kernel code references are not included in the trace.

Dyninst. Dyninst is an API for dynamic program instmmentation that permits the insertion of code into a mnning program [2]. More precise, the tool provides a machine independent interface to write binary instmmentation programs [108]. Dyninst mns on many architectures, including MIPS (IRIX), Power/PowerPC (AIX), SPARC (Solaris), and x86 (Linux, Solaris and 54 Chapter 4. Performance Modeling

NT).

Valgrind. Valgrind [97] (initially Cacheprof) is an execution-driven memory simulator that works with a standard GNU tool-chain on x86 platforms mnning Linux. The tool annotates each instmction which reads and writes memory and links a cache simulator into the resulting executable. When the program mns, it traps all data references and sends them to the cache simulator. When the program finishes its execution, detailed profile information is written to a file. The profiling results are either on an instruction-by-instruction basis or on a per-function basis.

MPtrace. One of the few tools for monitoring multi-threaded workloads mnning on a multiprocessor memory system is MPtrace [37]. The tool annotates 1386 source codes and stores the traces on disks. MPtrace uses control-flow analysis to annotate programs in a minimal way and gathers traces of only significant dynamic events. The trace of dynamic events together with other addresses that are statically reconstructed are then given to a simulator.

Augmint. Augmint is an execution-driven multiprocessor simulator for Intel x86 archi¬ tectures mnning UNIX or Windows NT [84]. The tool consists of a front-end memory event generator, a simulation infrastmcture for managing the scheduling of events and a collection of architectural models that represent the system under study. Augmint is optimized for on-the-fly trace consumption by linking the program, which is annotated at the assembly level, with the simulation libraries.

Limes. Linux MEmory Simulator (Limes) is a multiprocessor simulation environment for PCs running Linux. Like Augmint, it takes an assembly level annotation approach.

SPARC

Spa. Spa is an instmction-set emulator that uses iterative interpretation. The tool mns on SPARC systems and emulates SPARC architectures. Spa stores the emulated register set in the actual hardware registers of the host machine.

Shade. A popular instmction-set simulator and custom trace generator for SPARC architec¬ tures is Shade [5]. Applications are executed and traced under the control of an user-supplied trace analyzer. To reduce communication costs, Shade and the analyzer are mn in the same address space [29]. Shade (maintained by SUN) mns on SPARC systems and simulates SPARC (vers 8 and vers 9) and MIPS instmction sets. SpixTools is an instmmentation tool set for SPARC architectures. The two main tools in the SpixTools distribution are spix and spixs- tats. Spix does not generate instruction or data traces; it only generates basic block counts and stores them in a file. Spixstats uses the basic block counts to summarize the behavior of the instrumented program.

QPT. The design goal of Quick program Profiling and Tracing system (QPT) is to produce compact traces that can be stored for later simulations. QPT [71] instmments MIPS and SPARC exécutables. To reduce the amount of instmmenting code, QPT performs control flow analysis and relies heavily on symbol table information and code stmcture knowledge. The trace output by the instmmented program is a compact trace which needs expansion before it can be used by a trace consumer program. QPT creates a regeneration program in form of an object file that can be linked into the compiled consumer program; hence the consumer program reads the compact trace directly from disk. Abstract Execution (AE) is QPT's predecessor. While QPT instmments the executable, AE is part of the C compiler. Moreover, AE creates a trace regeneration tool for each instmmented application which is piped to the consumer program. 4.2. Background 55

CPROF. CPROF [73] is a tool that consists of two programs: Cprof, an uniprocessor cache simulator, and Xcprof, an X windows user interface. Cprof processes program traces generated by QPT and annotates source lines and data stmctures with the appropriate cache miss statistics. Xcprof provides a generalized X windows interface for easy viewing of annotated source files.

EEL. Executable Editing Library (EEL) is a C++ library that hides much of the complexity and system-specific detail of editing exécutables. EEL provides abstractions that allow a tool to analyze and modify executable programs [72]. The library allows the user to specify how to annotate the instmctions and which machine state to extract. QPT2, Fast-Cache and PP are examples of cache simulators and path profilers built on top of the EEL trace collector.

Fast-Cache. One of the memory system simulators that uses EEL to annotate each workload instmction that makes a memory reference is Fast-Cache [74]. The simulator optimizes the common case of cache hits. More precise, Fast-Cache allows simulator writers to specify the appropriate action on each reference, including "no action" for the common case of cache hits.

PP. Path Profiling (PP) [16] is a path profiler build on EEL. A program's path is a sequence of basic blocks that are executed consecutively. As a program's path captures a program's control flow [20], many performance analysts rely on this information to model a program's performance.

ABSS. Augmentation-Based SPARC Simulator (ABSS) is a simulation environment that enables the user to implement a timing-accurate simulator of a SPARC-based multiprocessor.

Spike. Spike is an instmmentation tool built in the GNU CC compiler. The tool has been implemented for Motorola 68000 family, SPARC, and HP PA-RISC ISA. Spike is optimized for on-the-fly trace consumption by linking the original program with an instmmentation library. The library contains a procedure that is invoked for every trace event, and this procedure may implement any kind of simulator or trace collector.

WWT. Wisconsin Wind Tunnel (WWT) [7] was the first trap-driven simulator that operates at a granularity smaller than a memory page. Its implementation relies on the fact than each memory location has error-correction code (ECC) check bits. The tool causes kernel traps by modifying the ECC check bits in a SPARC-based system. The standard software trap handler is also modified to get the relevant information after each memory access.

MIPS

Mable. Mable is an instmction-set emulator that uses iterative interpretation and simulates a MIPS architecture. Mable stores the emulated register set in memory as a virtual register data stmcture.

SPIM. Another instmction-set emulator called SPIM [70] avoids the cost of repeatedly decoding instmctions by saving pre-decoded instmctions in a special table. A pre-decoded instruction typically includes a pointer to the handler for that instmction. SPIM reads and trans¬ lates a MIPS executable to an intermediate representation, looks up pre-decoded instmctions and then emulates the instructions.

MINT. MINT mns on Silicon Graphics computers, DEC stations and SPARC architectures and interprets MIPS instmctions. The tool is a trace generator for shared-memory multiproces¬ sor simulators, that also uses a form of pre-decoded interpretation (like SPIM).

Pixie. The first binary instmmentation tool that received widespread use is Pixie. The tool is a full execution trace generator and mns on MlPS-based systems. Pixie can generate traces 56 Chapter 4. Performance Modeling

of dynamically linked as well as statically linked code but does not record kernel activity. The instmction and/or data trace generated by Pixie is written to a file descriptor. Using another tool called makepipe, the trace can be piped to a trace consumer program such as a memory simulator. Dinero IV [1], Cache2000, Cheetah and Tycho are examples of cache simulators built on top of Pixie trace collector. In an attempt to lower the mn-time overhead of Pixie, another tool called Nixie was proposed. However, because Nixie makes compiler-based assumptions about code stmcture, it can instmment less applications than Pixie.

Tango. Developed at Standford, the multitasking simulator Tango is based on Unix shared memory and uses Unix context switches to switch from executing one process to another. Later on, the tool was rewritten to use a lightweight thread package. Tango requires all shared memory to be dynamically allocated. In other words, all accesses to shared global variables require two memory accesses, fact that increases the reference rate of an application. TangoLite is a successor of Tango that minimizes the effects of time dilation by determining event order through event-driven simulation.

MemSpy. A memory simulation and analysis tool built on top of the TangoLite trace col¬ lector is MemSpy. MemSpy is based on the observation that a cache hit, unlike a cache miss, typically does not require any updates to a cache's contents. Therefore, the tool tests for a cache hit before invoking the full cache simulator.

Tapeworm. Tapeworm is a trap-driven TLB simulator that relies on the fact that all TLB misses on a MlPS-based DEC station are handled by software in the operation system kernel. The tool modifies the standard software handler for TLB misses to get the relevant information after each miss. Because it is a trap-driven TLB simulator, Tapeworm operates at the gran¬ ularity of a memory page. Tapeworm II, the second generation of the Tapeworm simulator, demonstrates that trap-driven simulation can also monitor multiprocess and operating system workloads. The tool relies on the fact that each memory location has ECC check bits and causes kernel traps by modifying them. Hence, Tapeworm II operates at a granularity smaller than a memory page.

Alpha

ATOM. The Analysis Tool with Object Modification (ATOM) is a tool that allows a user to build her own customized instmmentation and analysis tools [102], ATOM provides library routines that allow the user to have access to each procedure in an application, each basic block in a procedure, and each instmction in a basic block. By appropriately indicating where instmmentation code should go and by indicating the particular information to be gathered at each instmmentation point, one can use ATOM to access the dynamic information of an application. In addition to instmmentation routines, the user can also write analysis (simulation) routines. Both instmmented code and analysis code mn in the same address space. ATOM is implemented on top of a link-time modification system called OM, and works by translating an ALPHA executable into OM's symbolic intermediate representation. Through some extensions to OM, ATOM inserts instrumentation procedure calls at appropriate points in application code, optimizes the instrumentation interface, and translates the symbolic intermediate representation back into an ALPHA executable.

ALTO. ALTO develops whole-program data flow analysis and code optimization tech¬ niques for link-time program optimization. The current system is targeted to DEC Alpha architectures. ALTO produces code that is typically faster than that produced by DEC's OM 4.2. Background 57

link-time optimizer.

IBM RS/6000

Goblin. The only trace generator that instmments applications mnning on IBM RS/6000 architectures is Goblin. The tool annotates code at the basic block level, i.e., code is added prior to each basic block to report block execution. Since storing large files is difficult, Goblin uses a library to perform on-the-fly basic block statistic calculations so that the whole trace need not to be saved.

Full System Simulators

Besides trace-driven simulation, execution-driven simulation is a very useful technique for modeling memory system performance. This technique, which is the most accurate and most costly of the simulation techniques, requires instmction and I/O emulators to reproduce pro¬ gram computation. Through a detail simulation of the memory system and processor pipeline, this technique provides access to all data produced and consumed during program execution. Execution-driven simulation is implemented by most of the full system simulators, and these simulators are briefly presented in the following paragraphs.

g88 models an uniprocessor Motorola 88100 based system and can boot the Unix operating system, gsim is based on the g88 tool and includes support for multiple processors with shared physical memory. Another multiprocessor simulator that models Motorola 88100 systems is Talisman. Both gsim and Talisman are complete system simulators that model caches and memory management units, as well as I/O devices. The two simulators are instmction-set emulators that pre-decode instmctions lazily, as they are executed for the first time.

Simics, a successor of gsim, is a system-level instmction set simulator capable of simulating high-end target systems with sufficient fidelity and speed to boot and mn operating systems and commercial workloads. Simics [79] can simulate a variety of target systems, including x86 and x86-64, SPARC V9, PowerPC, Alpha, IPF, MIPS, and ARM architectures. Simics can model both uniprocessor and multiprocessor systems, as well as clusters and networks of systems. The simulator can boot and mn unmodified commercial workloads including Solaris, Red Hat Linux, Tru64, VxWorks, and Windows200/ NT.

SimOS, developed at Stanford, models hardware similar to that of systems sold by Silicon Graphics and Digital Equipment Corporation [54]. The main component simulated by SimOS is the microprocessor. SimOS currently provides models of MIPS R4000, MIPS R10000, and Digital Alpha processor families. In addition to the CPU, SimOS simulates caches, multipro¬ cessor memory busses, disk drives, Ethernet, consoles, and other devices commonly found on modem machines. The operating systems that have been ported to the SimOS environment are IRIX and Digital UNIX.

Rsim is an architecture simulator for shared-memory systems built from processors that aggressively exploit instmction-level parallelism. The simulator interprets application exécuta¬ bles and not data traces. Rsim converts Sparc V9 instmctions into an expanded instmction set format. However, SimOS does not model any of the real systems.

SimpleScalar offers an infrastmcture for simulation and architectural modeling. Sim- pleScalar [17] reproduces computing device operations by executing all program instmctions using an interpreter. The tool-set includes several instmction interpreters for several popular 58 Chapter 4. Performance Modeling

instruction sets, such as Alpha, Power PC, x86, and ARM.

Conclusions. Several criteria can be used when selecting instmmentation and simulation tools (e.g., cost, speed, portability, flexibility). When both speed and accuracy are taken into consideration, no single method is a clear winner. Simple methods are relatively fast, but the traces they produce are inaccurate. Sophisticated methods gather accurate information at the cost of long execution times. When selecting tools to analyze and model memory system per¬ formance, the most important factor to keep in mind should be the balance between speed, accuracy, and cost.

4.3 Two Simple Models for Execution Prediction

Initial work on performance prediction characterizes memory performance with a single pa¬ rameter, namely the memory access latency. For instance, the Average Access Time (AAT) model [53], briefly discussed in Section 4.2.1, predicts a program's performance by multiplying the number of cache hits and misses during the program execution with the hit and miss times. In other words, for a given program, AAT is defined in terms of time to hit in the cache/memory, miss penalty and miss rate for reads and writes. For many applications, can rearrange data accesses such that computation and data fetching (from lower memory levels) overlap. As different applications have different overlapping degrees of computation and data fetching on different architectures, predicting an application's performance is difficult. Therefore, as pro¬ cessors get more and more sophisticated, characterizing memory system with a single parameter becomes insufficient and another measure is called for.

This thesis proposes two simple analytical models to predict the performance of large ap¬ plications. The set of programs to model is restricted to regular and irregular applications and memory performance is characterized by more than one parameter. The rest of this chapter first describes the performance models in detail, then presents the key parameters for a real target machine, and compares the estimates produced by the models with the estimates of the AAT model.

The new performance models focus on an application's interaction with the memory system and compute the execution time of a program as the sum of the times the program spends at each memory level. In other words,

n

T = J2Ttomi (4.1) »=i where Ttotait is the total time the program spends at the level i of the memory system.

An application is characterized by its memory access types (e.g., strided or random accesses) and their frequency during the application execution. According to the models proposed in this thesis, the time spent by an application at a memory level is a linear combination of the number of times each access type appears at that level multiplied by the time to execute that memory access. Clearly,

s

Ttotai, = J2NiixTi3 <4'2) 4.4. Target System 59

where i is the memory level we consider, Nij is the number of times the access type j appears at the level i, and Ttj is the time to execute a memory access of type 7 at the memory level i.

Putting it together, the execution time of a program is

n n s

T = Y1T*°*^ = EE Ni3x Tv (4-3) i—l i—1 j—1

where i iterates over the memory levels (e.g., LI, L2, DRAM) andj over the access types. Both read and write times are considered separately.

The first model, called MSP-RA, uses Memory System Performance for Regular Appli¬ cations to predict the performance of a regular application. An application is called regular if 80%+ of its non-continuous accesses are strided. (The rest of 20%- may be indexed array accesses.) The performance of regular codes is given by the performance of continuous and strided accesses. (Regular accesses are also called affine array accesses [19].) Because regular

codes have two kinds of accesses, which means s = 2 in Eq.(4,3), the mn-time of a regular application is given by

n n 2

t = J2 T»* = EE N*ix T*i (4-4) i=l i=l j=l

where Nu and Ni2 are the number of continuous and strided accesses at the level i of the memory system. Tu and Ti2 are the time to execute a continuous and strided access at the same memory level i.

The second performance model, called MSP-IA, uses Memory System Performance for Irregular Applications to predict the mn time of an irregular application. For this application, memory performance is given by the performance of continuous accesses, accesses within the same L1/L2 cache line, and random accesses. (Random accesses are also known as pointer- chasing accesses [19].) Because irregular codes have three kinds of accesses, which means

s ~ 3 in Eq.(4.3), the mn-time of an irregular application is given by

n. n 3

t = Yl Tt°tai> = EE Nvx Tv (4-5) t=l i—1 j—1

where Nu, Ni2 and N^ are the number of continuous accesses, accesses within the same L1/L2

cache line and random accesses at the level i of the memory system. Tu, Ti2 and Ti3 are the

time to execute a continuous access, an access within the same cache line and a random access

at the same memory level i.

4.4 Target System

To allow a meaningful evaluation of the models proposed in this thesis, we use as a target system a Sun Blade 100 workstation since a variety of good tools are available in source form. (Availability of the sources allowed us to obtain additional parameters of memory accesses, like the strides.) This system has a 16 KB LI data cache (32 byte blocks, direct mapped), 256 KB L2 cache (64 byte blocks, 4-way set associative), 256 MB DRAM, and is based on an UltraSPARC-II processor at 500 MHz. 60 Chapter 4. Performance Modeling

The memory system performance is measured by the memperf benchmark [105], which was shortly described in Section 4.2.2. Although the memory characterization is done by this benchmark, the models proposed in this thesis are not tied to this specific benchmark. Other benchmark suites such as Hristea et al. [55] and Imbench [81] may be used to gather the perfor¬ mance numbers the models require.

The memory performance is measured in access bandwidth for different strides and working set sizes. The performance of continuous accesses (or the pipelined bandwidth) is the memory throughput when memperf performs continuous loads or stores (or accesses with stride 1 ). The performance of strided accesses is the average over the memory throughputs when data is ac¬ cessed with small stride sizes, which means strides with values between 2 and 7. (Small strides capture the performance of overlapped transfers.) The performance of accesses within the same L1/L2 cache line is the average over the memory throughputs when data is accessed with stride sizes that fit into L1/L2 cache. (The prefetch effect of gathering a whole line increases the hit rate of accesses within the same cache line.) The average over the memory throughputs when data is accessed with large stride sizes (ranging from 12 to 192) gives the performance of random accesses. (Large strides defeat the aggressive overlap of cache misses supported by many microprocessors [12].) McCalpin's STREAM benchmark [80] also gathers the memory pipelined bandwidth. Furthermore, McVoy's Imhench [81] measures the pipelined bandwidth and random read latency using linked-list pointer chasing.

Since memperfreads doubles (8 bytes), we use the following formulas to express the transfer performance in cycles/sec or seconds:

8 bytes X clockfreq

es 'yc bandwidth

8 bytes

_ rCr J Seconds — bandwidth

For instance, on a Sun Blade 100 (500 MHz), the measured LI cache pipelined bandwidth for continuous reads is about 2648 MB/s. In other words, on this machine a sequential 8-byte load from the LI cache needs about 1.5 clock cycles or 3 ns:

8 bytes x (500 x IO6 s^)

— j—;———— = 1.5 cycles 2648 x IO6 ^^ s

8 bytes 8 x IO3 ns ~ 3 ns 2648 x %^ 2648 10H ns

Table 4.1 lists the measured performance of a Sun Blade 100 machine for different types of memory accesses at different levels of memory hierarchy. For this system, the working set sizes that capture the performance of LI cache, L2 cache, and DRAM are 16 KB, 256 KB, and 8MB.

Data Accuracy

We use McCalpin's STREAM benchmark to partially validate the performance data measured with memperf. STREAM computes memory pipeline performance on four unit-stride floating¬ point vector codes. One of these vector codes, namely the copy operation, measures memory 4.5. Sample Applications 61

LI cache L2 cache DRAM Access Type Read Write Read Write Read Write

memory system performance for regular applications 1.51 2.73 5.24 3.58 19.14 32.89 continuous 1.55 2.79 4.92 2.75 57.86 123.44 strided

memory system performance for irregular applications 1.51 2.73 5.24 3.58 19.14 32.89 continuous

1.54 2.76 6.08 4.16 37.01 73.96 same LI cache line 2.89 4.18 16.31 27.23 101.05 197.48 random

Table 4.1: Sun Blade 100: read and write performance [cycles] for different types of memory

accesses.

transfer rate in the absence of arithmetic. In other words, the copy operation measures main memory bandwidth for a continuous load followed by a continuous store. The copy perfor¬ mance of Sun Blade 100 measured by STREAM is 83 MB/s. On the other hand, the load copy mode of the memperf benchmark provides the same information as the STREAM copy oper¬ ation. The measured performance of a continuous load followed by a continuous store when memperf exceeds the working set size of the L1/L2 caches on a Sun Blade 100 is 83 MB/s, which agrees with the number reported by the STREAM benchmark.

4.5 Sample Applications

4.5.1 SMV (Symbolic Model Verifier)

Model checking tools are used for formal verification of finite state machines, and have been successful in finding subtle errors in complex system designs. However, the high space require¬ ments of model checkers limit their applicability to large designs. One of the most successful approaches to reducing the model checker's space requirements is Symbolic Model Verifier (SMV). Based on Binary Decision Diagrams (BDDs), SMV has made model checking applica¬ ble to industrial designs of medium size.

SMV is a complex C program with high CPU and memory requirements. Because it uses dynamically allocated data stmctures (DAGs), SMV has a highly irregular access pattern and its memory accesses are dependent on one another. We use Yang's SMV implementation [113] since it shows superior performance over other implementations [117]. We select 8 inputs, 6 of them are models commonly used in benchmark studies [117] and the other 2 inputs model the FireWire system [95], Table 4.2 presents memory and mn-time characteristics of the selected SMV models.

4.5.2 CHARMM (Chemistry at HARvard Molecular Mechanics)

CHARMM [22] is a macromolecular simulator that has been optimized to fully benefit from L1/L2 cache performance. Written in FORTRAN, CHARMM accesses its statically allocated data (arrays) in a regular fashion, performing strided and indexed array accesses. (Indexed array 62 Chapter 4. Performance Modeling

SMV model Run time [s] Memory[MB] semaphore 0.32 43.55 counter 0.30 43.62 pci3p 1.01 44.12 dmel 1.87 47.77 abp8 16.45 79.87 ode-3-3-2 24.29 160.34 idle 35.55 159.96 node-3-3-3 71.87 174.93

Table 4.2: SMV models: memory and mn time characteristics on Sun Blade 100.

accesses are irregular accesses that are independent from one another, in contrast to pointer- chasing accesses which are also irregular accesses but are dependent on one another [19].) We select 8 different input files that are distributed with the CHARMM package. Table 4.3 presents memory and mn-time characteristics of the selected CHARMM inputs.

CHARMM input Run time [s] Memory [MB]

gener 1.01 58.22 nbond 1.15 58.22 im 1.32 58.22 brb 1.42 58.23 dynl 1.83 58.27

ener 2.04 58.24 imh2o 2.20 58.25 djs 6.11 58.25

Table 4.3: CHARMM inputs: memory and run time characteristics on Sun Blade 100.

4.5.3 NS2 (Network Simulator)

NS2 is a discrete event simulator targeted at networking research [85]. NS2 provides substantial support for simulation of TCP, routing, and multicast protocols over wired and wireless (local and satellite) networks. The amount of memory used by a simulation is determined by the number of nodes and traffic connections being simulated. Therefore, simulations of scenarios that involve very large sets of nodes and many traffic connections require a significant amount of memory to hold the data sets. The simulation mn-time is mainly dictated by the protocol being simulated and the simulated time interval. Written in C++, NS2's memory access pattern is highly irregular. We select 4 different inputs (or network protocol models) that are distributed with the NS2 package. Table 4.4 presents memory and mn-time characteristics of the selected NS2 inputs. 4.6. Experimental Results 63

NS2 test Run time [s] Memory [MB] diffusion 7.12 14.82 shadowing 69.49 15.13 aodv 102.71 16.68 tdma 126.64 90.40

Table 4.4: NS2 inputs: memory and mn time characteristics on Sun Blade 100.

Data Accuracy

This thesis characterizes an application by the number of its memory accesses and the access type (or size) at each memory level. The memory access type, or the distance between two consecutive memory accesses, can be extracted from a program's address trace. To gather an address trace, we use Sun's trace generator, shade. The tool is distributed with the cachesimS cache simulator, which we extended to extract the stride information needed by the new perfor¬ mance models. We use performance counters to partially validate data gathered by the shade tool. (Although the hardware counters can gather the number of read/write cache hits/misses, they can not provide the access type information.) To access the performance counters, we use cputrack library. The events we measure are the number of reads and writes from/to the LI and L2 caches. Compared to data gathered by the performance counters, the error range of the shade measurements is ±10%.

4.6 Experimental Results

4.6.1 MSP-RA Prediction Model

The performance of a regular application (e.g., CHARMM) is computed by the MSP-RA model and is given by Eq.(4.4), which on Sun Blade 100 becomes

2

T = J2(N^j x TLlj + Nh2j x TL2J + NMj x TMj) (4.6)

where j = 1 denotes continuous accesses and j — 2 strided accesses.

To validate the MSP-RA performance model, we compare its estimates for the selected CHARMM inputs mnning on a Sun Blade 100 machine with measurements of the same ap¬ plication executing on the same machine. The measurements, summarized in Figure 4.1, show that the MSP-RA model succeeds to capture CHARMM's characteristics and effectively maps them to the machine characterization (with an error range of ±10%).

For the same CHARMM inputs, the AAT model overestimates the real execution time, as shown in Figure 4.2. The overestimation can be explained by CHARMM's access pattern. Because most of CHARMM's memory accesses are array references which can be overlapped by most of the modem processors, the performance of CHARMM's memory accesses is better than the average time to access memory. Hence, for this application, AAT is a too conservative characterization of the memory system. Chapter 4. Performance Modeling

J- <£> ^ J? $> <£ 0° &

CHARMM Inputs

Figure 4.1: MSP-RA prediction error.

CHARMM Inputs AAT model MSP-RA model

Figure 4.2: AAT and MSP-RA estimates. 4.6. Experimental Results 65

12 10 Hill 8 — 6 ""M"„ 4 __^MI BH 2 tI - 1-1 0 i l^ül i BUH i QHffl- -2 LU \m II -4 M -6 la -8 -10

# jgr 3* J ^ ,# / / V

SMV Models

Figure 4.3: MSP-IA prediction error.

4.6.2 MSP-IA Prediction Model

The performance of an irregular application (e.g., SMV and NS2) is computed by the MSP-IA model and is given by Eq.(4.5), which on Sun Blade 100 becomes

T = x x x J2(NLl3 TLl3 + NL23 TL23 + NMj T}Mjj (4.7) j=i where j — 1 denotes continuous accesses, j = 2 accesses within the same L1/L2 cache line, and j — 3 random accesses.

To validate the MSP-IA performance model, we compare its estimates for the selected SMV models and NS2 inputs mnning on a Sun Blade 100 machine with measurements of the same applications executing on the same machine. The measurements, summarized in Figure 4.3 and Figure 4.4, show that the MSP-IA model accurately predicts the performance of these two applications (with an error range of ±10%).

For the same SMV models and NS2 inputs, the AAT model underestimates the real exe¬ cution time, as shown in Figure 4.5 and Figure 4.6. The underestimation can be explain by the highly irregular access patterns of both SMV and NS2 applications. Because the memory accesses are dependent and can not be overlapped, their performance is worse than the average time to access memory. Hence, for these applications, AAT is a too optimistic approach.

Discussion

The experimental results show that the average access time (AAT) is a too conservative machine characterization for regular applications and too optimistic for irregular codes. On the other hand, AAT might be the right machine characterization for mixed codes. However, modeling mixed codes is complicated, as the relative importance of the regular and irregular accesses changes dynamically as the program executes. 66 Chapter 4. Performance Modeling

o

LU

diffuäon shadow aodv tdma

NS-2 Inputs

Figure 4.4: MSP-IA prediction error.

o S 120

CO

- CO 100 CD E

en ro

o CD o TD (D »N i_ =0* V&N <*? & Q_ ^ *F <&v cP"

SMV Models

AAT model MSP-IA model

Figure 4.5: AAT and MSP-IA estimates. 4.7. Summary 67

o_ diffusion shadow aodv tdma

NS2 Inputs

ü AAT model MSP-IA model

Figure 4.6: AAT and MSP-IA estimates.

To further investigate whether more complex or simpler performance models can provide better performance estimates, we perform two experiments. First, we extend the MSP-RA per¬ formance model, which models the performance of continuous and strided accesses, to also include the performance of accesses within the same cache line. Figure 4.7 shows that extend¬ ing the MSP-RA model does not improve prediction accuracy much. The small increase in prediction accuracy can be explain by that fact that the initial MSP-RA model already captures the spatial locality of regular memory accesses. Therefore, MSP-RA is quite accurate and there is no need for extending it.

In the second experiment, we simplify the MSP-IA performance model and consider all non-continuous accesses as being random accesses. The simplified MSP-IA model considers continuous and random accesses only, and does not model the accesses within the same cache line separately. Figure 4.8 shows that simplifying the MSP-IA model decreases the prediction accuracy considerably. The accuracy decrease can be explain by that fact that for the selected SMV models most of the DRAM writes are writes to data within the same cache line (see Figure 4.9 and Figure 4.10). Because the prefetch effect of gathering a whole line increases the performance of accesses within the same cache line, these accesses have a faster transfer bandwidth than random accesses. Therefore, to accurately estimate the performance of irregular codes, MSP-IA has to consider three types of memory accesses.

These experiments show that simplifying the MSP-IA model increases the prediction error considerably, and extending the MSP-RA model doesn't improve prediction accuracy much.

4.7 Summary

This chapter presents two approaches for predicting the performance of programs depending heavily on the speed of contiguous, strided and random memory access streams. Both prediction models compute the execution time of a program as the sum of the times the program spends at each memory level. The time a program spends at a memory level is a linear combination of 68 Chapter 4. Performance Modeling

o «_

LU

^ £ •& &° c* *&

12 access types 3 access types

Figure 4.7: MSP-RA prediction error.

40 -

on

0 - a*-

1_ -20 - o 1_ L. LU -40 -

-60 -

-80 -

J> J? &

SMV Models

2 access types 3 access types

Figure 4.8: MSP-IA prediction error. 4.7. Summary 69

s> •"b ^ s£ v / / / / er

SMV Models

I continuous Bsame cache line outside the cache line

Figure 4.9: Access type distribution for DRAM reads.

«b s5* / v *V

SMV Models

I continuous Bsame cache line D outside the cache line

Figure 4.10: Access type distribution for DRAM writes. 70 Chapter 4. Performance Modeling

the number of times each access type appears at that level multiplied by the time to execute that access. The speed of contiguous, strided and random memory accesses can be obtained through measurements on real machines, or through other means for designs that are not yet realized.

The increasing complexity of current and future microprocessors makes performance pre¬ diction of real applications difficult. A model that predicts performance for a large class of applications must capture all system details that have a non-negligible impact on an applica¬ tion's performance yet be simple enough to allow fast evaluation. To allow simple models, we restrict our attention to two classes of programs, namely to regular and irregular programs. Membership in these classes is defined by the memory access pattern of an application (for reg¬ ular applications, 80+% of their non-contiguous accesses are strided, for irregular applications, 80+% of their non-contiguous accesses are random). Since applications are classified accord¬ ing to their memory access patterns, it is not surprising that simple models with few parameters based on memory access properties perform well. Other models, or detailed simulations, are without doubt useful in specific scenarios, but the simple models presented here allow for a performance evaluation of real applications with large data spaces.

Whereas a machine characterization used together with a characterization of regular codes might want to include a large number of stride values, good results can be obtained from a simplified model that relies only on two data points: continuous and strided (regular) accesses. Furthermore, to accurately predict the execution time of an irregular application (written in C/C++) it is sufficient to characterize machine performance for three groups of access strides: continuous accesses, accesses within a cache line, and random accesses.

The methods require information that is easy to acquire. The memperf benchmark gathers the time to execute different access types on a memory system. The shade tool provides the memory access demands of an application, i.e. the number of times each access type appears at each level of the memory hierarchy. We have successfully applied the prediction models to executions of CHARMM, SMV, and NS2 on a real SPARC system with a three-level memory hierarchy.

The proposed performance models are relatively simple and do not require object code analysis. The errors of their performance estimates are in the range of ±10%; the models strive to find a reasonable compromise between the simulation time and the accuracy of the performance estimates they produce. These models are therefore useful for those performance studies that involve complete complex programs with large data sets. 5

Adaptation

As described in Chapter 2, the idea of memory compression (setting aside part of main memory to hold compressed data) has been investigated in several projects. One of the thorny issues is that sizing the region that holds compressed data is difficult and if not done right (i.e., the region is either too large or too small) memory compression slows down the application. Therefore, the objective of the adaptation process must be to find the compressed region size that can improve an application's performance, including the case when there is no need for a compressed region.

This chapter describes our approach to finding out when and how to adapt. As described in Section 5.1, understanding what applications under what conditions can benefit from main memory compression is complicated due to various tradeoffs and dynamic characteristics of applications. Section 5.2.1 describes an analytical model that states the conditions for a compressed-memory system to yield performance improvements. The model relies on few data points, such as efficiency of the compression algorithm, amount of data being compressed, and an application's memory access pattern. Section 5.2.2 shows how the proposed model can be used to compute the compressed region size that can improve an application's performance. Section 5.4 details the decision-making process and answers the questions when and how to adapt. Section 5.4.1 describes the process dynamics, and Section 5.4.2 shows how application- specific information needed by the decision process can be gathered efficiently. Section 5.4.3 briefly touches other adaptation-related issues and Section 5.6 summarizes the chapter.

5.1 Introduction

As described in Chapter 2, compression has been used in many settings to increase the effec¬ tive size of a storage device or to increase the effective bandwidth, and other researchers have proposed to build a compressed-memory system by integrating compression into the memory hierarchy. The basic idea of a compressed-memory system is to set aside a part of main memory to hold compressed pages; we call this level the compressed region. Then, instead of swapping a page P from the uncompressed region to disk (when its memory is needed), the evicted page P is compressed (producing Pcomp) and Pcomp is kept in the compressed region. If Pcomp is needed again, it is decompressed and moved to the uncompressed region. As long as compression and decompression of P (resp. Pcomp) take less time than swapping P in and out, there exists the opportunity to improve overall system performance. Of course, a number of conditions must be met: setting aside part of the main memory to hold compressed pages reduces the effective size of the application's main memory area and increases the number of page faults. And if the compressed region is not large enough to hold a sufficient number of compressed pages, some compressed pages must be evicted to disk, increasing the overhead of the memory system.

71 72 Chapter 5. Adaptation

The potential benefits of main memory compression depend on the relationship between the size of the compressed region, an application's compression ratio, and an application's access pattern. Because accesses to compressed pages take longer than accesses to uncompressed pages, compressing too much data decreases an application's performance. Moreover, if an application's pages do not compress well, compression will not show any benefit. Furthermore, if an application accesses its data set such that compression does not save enough accesses to disk, memory compression will slowdown the application.

To accurately decide how much data to compress during an application's execution, a compressed-memory system should rely on a performance model for predicting application performance for various sizes of the compressed region. As described in Chapter 4, accurate prediction models require lots of information about applications and systems and have high computational demands. On the other hand, simple models are fast but produce less accurate predictions than complex models do. For two memory intensive applications executing on a compressed-memory system (that does not model performance), measurements that will be pre¬ sented in Section 5.1.1 will show that main memory compression does not improve performance much. Therefore, because on the system under study the performance improvements are small, a key requirement for a performance model to work in practice is to have low computational demands.

5.1.1 Performance Potential of Main Memory Compression

To assess the benefits of main memory (data) compression, we examine the performance of two applications that use large data sets. We use a Pentium 4 PC at 1.9 GHz to generate the mem¬ ory reference string of the selected applications. To measure an application's performance on a compressed-memory system, we use the compressed-memory prototype described in Chap¬ ter 3. We select two programs that are simulators with different input sets, namely SMV model checker [113] and NS2 network simulator [85]. We use Yang's SMV implementation since it demonstrated superior performance over other implementations [117]. All three SMV inputs we select model the FireWire protocol [96]. Although the three SMV inputs have the same memory footprint of 625 MB, they perform different amounts of computations. We select four NS2 simulations that simulate the DSR protocol over a wireless network of 2500 nodes (the first two simulations) and 3000 nodes (the other two simulations). The memory footprint of the first two simulations is 600 MB and that of the last two simulations is 750 MB.

For each input set we select memory sizes for which an application's execution time is at least twice as slow, but no more than 100 times slower than its execution time without thrashing. When memory available is less than memory required to hold an application's working set, the 1.9 GHz CPU spends 1-99% of the total running time paging. We configure the system such that the amount of memory available is 95%, 90%, and 80% of memory required to run an application without thrashing and the size of the compressed region is 5%, 10%, and 20% of memory available. The measurements show that when compression is enabled, the execution of most of the SMV models is 40-80% slower than without compression. However, one of the SMV models executes 18-35% faster with compression than without. Also for the selected NS2 inputs, the results are mixed: main memory compression degrades performance of some inputs by 4-17%, and improves performance of other inputs by 10-29% compared to a system without compression.

The measurements show that software compression can both improve and degrade perfor- 5.2. Cost/Benefit Analysis of Main Memory Compression 73

mance for applications that have large data sets. Moreover, for the selected applications, even when compression is beneficial, the performance improvement is low. Therefore, for the appli¬ cations we select, the performance model for determining the compressed region size that can improve an application's performance must be fast (and therefore simple).

5.2 Cost/Benefit Analysis of Main Memory Compression

5.2.1 Performance Model for a Compressed-Memory System

The performance model described in Chapter 4 computes the execution time (T) of a memory intensive application as the sum of the times the application spends at each memory level. The time spent at a memory level i (TtotaiJ is a linear combination of the number of hits at that level (Ni) multiplied by the access time to that memory level (Tj). Formally, an application's execution time is

n n

T^J2Tt°^ = 12NixT> (5-1) i-1 i=\ where i iterates over the memory levels (e.g., LI, L2, DRAM).

Given the complexity of out-of-order execution processors, there is no simple characteriza¬ tion of the memory system performance with a single parameter (TJ. In this chapter we use and extend the performance model described in Chapter 4 and for each memory level we compute an application's access time as the weighted average of the time spent executing continuous accesses (Tcont), accesses within the same cache line (Tsamc<;L), and non-continuous accesses (Tnon-coni). This definition of an application's access time captures a wide range of processor optimizations, such as how many of an application's accesses to cache, memory and disk are overlapped [27].

Formally, an application's read access time to a memory level is

J* Rdcont ^"RdsameCL Rdnon—cont

, . rp -"-'^c-ont J"*scTrn;Ct. , twnon—cont rp /e r\\ rp | rp rp ' ' rp "" ' 1 — Rd J- ' ' ^ j. j Rdcont at Rd3ameCL "at Rdno„-Cont W-*') Mm MRd j\Hd where NRd = NHdcont + NmsamsCL + N^^^.

An application's write access time (TWr) is computed in a similar way and is given by the following formula

^Wrcont . ^WTsa,meÇL rp . -*^Wrnon-cont rp /r o\ " rp ' rp — • 'Wt J T 1 K-J-~>) .,- J-Wrcont^ Aj WrsamcCL Wrn0n-cont

where — NWt NWt , + NWr r-r + NWr

An application's access time to a memory level i (T%) is the weighted average of its read and write access times (TiRd and Tiwr), and is

JRd NjWr _ , rp m rp l~~ K } N{ ''^+ Ni lWr where Ni — Nilid + NÎWt and TiRd and Tiwr are computed by Eq 5.2 and Eq. 5.3. 74 Chapter 5. Adaptation

We use Eq. 5.4 to compute an application's access time to L\ cache (TLl), L2 cache (TL2), and DRAM (TDRam)- As we do not investigate changes to L\ and L2 caches, we consider them a single cache level (Lu)- Formally, the time an application spends at the (Li and L2) cache level is

TLl2 = NLl TLl + NL2 TLl where NLl and NL2 are the L\ and L2 cache hits (reads and writes), and TLl and TL2 are the LY and L2 cache access times (reads and writes).

An application's DRAM access time is computed as

rp _ NoRAMRd rp , NdRAMWt rp ,r c^ ' * J-DRAM — —rT 1 DRAMRd H Z~7 i DRAMWr \P-3) JVDRAM MDRAM where NDRAM = NDRAMRd + NDRAMw,. and TDRAM[id and TDRAMwt are computed by Eq 5.2 and Eq. 5.3.

Also an application's disk access time (TDisk) includes the application characteristics, and is computed as

' * J-Disk — I -L T^ J-Diskseq AT Disknon-aeq W-°/ ^Disk MDisk where TDisk3eq is the performance of a sequential disk access, and TDisknon_seq is the average time of a random disk access, and NDlsk = NDiakseq + NDisknon_seq.

An application's access time to the compressed region (TcomprReg) is given by the appli¬ cation de/compression speed (TcSpmd) and time spent waiting for old compressed pages to be swapped to disk. (The wait time appears when the compressed region is full and the rate at which pages are stored in compressed region is bigger than the rate at which pages are retrieved from compressed region.) An application's de/compression speed includes the application char¬ acteristics, and is given by the formula

rp 'Oecompr rp . Compr rp ' ' -t -* i -* Cspeed at Decompr ^ Compr ""ComprReg MçomprReg where TDecomPr and TCompr are the times to de/compress a page, and NcomprReg —

MDecom.pT T -< >Compr-

We consider a three-level memory hierarchy ("31evel-memory hierarchy") with a (Li and L2) cache level, main memory, and disk. On this system, an application's performance when the amount of physical memory is less than is required to run the application without thrashing (see Fig. 5.1) is computed based on Eq. 5.1 (where n=3), and is

TWith~swap —Tl12 + NDRAM TDRAM + NDisk TDisk (5.7)

On a compressed-memory system (see Fig. 5.1) the execution time of the same application is computed using Eq. 5.1 (where n=4). As both systems (with and without compression) have the same (Li and L2) caches, an application's reference behavior is on both systems the same. Therefore, also the time spent by an application at the (Li and L2) cache level is on both systems the same (TLl2). Formally,

' = ~^~ ' J- J-ComprM L12 N{jnComprReg DRAM .. „.

' ' "T MComprReg * ComprReg T" NDiskcom.ptm -'Disk 5.2. Cost/Benefit Analysis of Main Memory Compression 75

Application

LI L2 DRAM Disk 31evel-memory hierarchy

LI L2 Disk UncomprReg ComprReg C( compressed-memory an hierarchy

c ach(îs DRAM Disk

Figure 5.1: A 31evel-memory hierarchy and a compressed-memory hierarchy.

5.2.2 Compressed Region Size

This section shows how the proposed model can be used to compute the minimum size of the compressed region that can improve an application's performance.

As both 31evel-memory and compressed-memory systems have the same (Lx and L2) caches, an application's number of L2 cache, misses is on both system the same. On the 31evel- memory system the L2 cache misses become DRAM accesses (Ndram) and disk accesses

(NDisk), or NL2miases = NDRAM + NDisk. On the compressed-memory system L2 cache misses are DRAM accesses (NUncomprReg), compressed region accesses (Ncom.prR.eg) and disk accesses

or = NDiskc After substi- + + ,. (NDiskComprM), formally NL2miliat NUncomprReg NComprReg "Compr A4 tuting the number of L2 misses with data above, we have

— ~ ~ + Nr NcomprReg (NjjRAM NuneomprReg) (^Disk ^DiskCamprM ) (5.9) where NDram > NUn(xmiprReg and NDisk > NDiskCamprM.

Main memory compression improves an application's performance for sizes of the com¬ pressed region for which condition TComprM < Twiih-swap is fulfilled. After substituting TyMh-swap and TComprM with Eq. 5.7 and Eq. 5.8, and NComprR.eg with Eq. 5.9, we have

-* Disk J -N ComprReg < NiDRAM Disk -N Compr Reg W Disk,ComprM > (5.10) - ComprReg DRAM

The term - m Condition is the overhead due to less uncom- NDRAM TV,DRAMcomprM 5.10 pressed and is called "Overhead". On the other JDisk Tc°^r^ — r DRAM, hand, {Noisk J-ComprReg~ J DRAM NoiskcomprM) *s ^is benefit due to fewer disk accesses, and is called "Gain". Another form of Eq. 5.10 is given by

N " — < Disk T,DRAM NcomprReg {TcomprReg TDRAM) (Np)isk UlSKQaTTlpr m )-m (5.11)

5.2.3 Influence of Application Characteristics

Flynn [39] defines the page miss rate of an application as a function of the amount of memory provided, application memory footprint, and application memory access behavior. He shows 76 Chapter 5. Adaptation

that an application's number of accesses to disk is given by the following formula: h 10~y'z, where V is the fraction of memory required that is available, and h and z are application depen¬ dent constants. The number of DRAM accesses is equal to the number of L2 misses minus the

number of disk accesses, or formally NL2misses - h 10_v/*.

To assess the influence of the compressed region size on an application's performance, we select nodes3.4J2 SMV model. We consider two memory sizes smaller than SMV's memory footprint (or two values of V) and measure the number of disk accesses for these memory sizes. The values of h and z are computed solving a system of two equations with two variables. Then,

we set the size of memory available to be 60% of SMV's working set size (V = 0.6). Having h and z, we use Flynn's formula to compute the number of disk accesses for different sizes of the compressed region. The computed values of "Overhead" and "Gain" (in Condition 5.10) for different sizes of the compressed region are depicted in Figure 5.2. Based on this figure, we

can make several observations.

if the compression speed is the value of Dtsk~ c°"rrfi" is small and Con- First, r r low, J ComprReg--I DRAM dition 5.10 is not fulfilled ("Gain" is smaller than "Overhead"). In this case, shown in Fig¬ ure 5.2(a), compression will slowdown an application for all sizes of the compressed region. Second, for faster compression speeds, "Overhead" is smaller than "Gain" for small sizes of the compressed region and bigger for larger sizes, as shown in Figure 5.2(b). In this case, a compressed-memory system improves an application's performance for those sizes of the compressed region that are smaller than the intersection point of the two slopes. Last, fast result in values for "Gain" is than "Overhead" compression speeds big ^Dtsk~Tc^Bes ; bigger and Condition 5.10 is fulfilled. In this case, illustrated by Figure 5.2(c), a compressed-memory system will show a clear benefit for all sizes of the compressed region. Our analysis confirm other researchers' measurements that showed that devoting too much memory to hold compressed data hurts as much as devoting not enough memory [115]. In addition, our analytical model explains why main memory compression can both improve and degrade an application's performance.

5.3 Validation

5.3.1 Experimental Setup

To generate the memory reference string of an application, we use a commodity PC (Pentium 4 at 1.9 GHz, with 8 KB Lx data cache, 256 KB L2 cache, and 1 GB DRAM) that has its swap partition on a ST340016A ATA disk. The PC runs the Linux operating system. We investigate the performance of two applications, SMV and NS2, which were described in Section 5.1.1 shortly.

To model an application's performance on a compressed-memory system, we use the perfor¬ mance model described in Section 5.2.1. An application's DRAM read and write access time is the weighted average of the time spent executing continuous accesses, accesses within the same cache line, and non-continuous accesses. As SMV and NS2 are pointer-based applications, most of their memory accesses are non-continuous. On the selected PC, for SMV and NS2, the performance of DRAM reads is 50 ns and of DRAM writes is 62.5 ns. Furthermore, an appli¬ cation's DRAM access time is given by Eq. 5.4, and is the weighted average of an application's read and write access times. For SMV, 90% of its DRAM accesses are reads, and for NS2 33% 5.3. Validation 77

lll:

' I0A8 -**" "

T3 ~ - e _ :

" - _--K -~ ^

' - -

' - 10A7 , i

10 15 20 25 30 35 40

3 Compressed region size to memory available [%] 2

- + - Overhead —»— Gain (a) Slow compression speeds.

C3

*T3

10A7-r < at a

10AG 15 20 25 30 35 3 !0 Compressed region size to memory available [%]

- + - Overhead —»— Gain (b) Medium compression speeds.

« ]0A10 I I I I I :

: -'- _____ j, -G —» _^__

M I0A9 CA 3 "S

~~ ~ ~ r + - 10A8 ^ <3 ^ ^ Ja

< g 10A7t-

10A6 Ë 10 15 20 25 30 35 40 Compressed region size to memory available [%]

- +- - Overhead —e— Gain (c) Fast compression speeds.

Figure 5.2: "Overhead" and "Gain" values for different compression speeds. 78 CHAPTER 5. ADAPTATION

of the DRAM accesses are reads.

The disk performance is computed using a formula in [53] and for ST340016A ATA disks Toiskseq and TDisknon_seq are 5.15 ms and 13.05 m.s respectively. An application's disk access time (TDisk) is computed using Eq. 5.6, and is the weighted average of an application's sequen¬ tial and non-sequential access times (17% of SMV disk accesses are sequential; 45% of NS2 disk accesses are sequential).

The de/compression speed (TCompr and TDec0mpr) of different algorithms is measured us¬ ing binary files generated by the selected applications; the unit of compression is a page. We chose WKdm compression algorithm, as it shows superior performance over other algorithms. For SMV, WKdm de/compression speed TCspeed is 18.8 fis (45% of SMV's accesses to the compressed region are compression operations), and for NS2, TCspeed is 19 p,s (48% of NS2's accesses are store operations). The access time to the compressed region (TComprReg) is given by an application's de/compression speed (TcSpeed) pius the time spent waiting for compressed pages to be sent to disk.

The number of DRAM accesses (NDRam) is equal to the number of L2 misses minus the number of page faults. We use Pentium 4 performance counters to gather the number of L2 misses and Linux /pr oc file system to gather the number of page faults during an application's execution.

The number of accesses to the compressed region (NCompraeg) is gathered by the compressed-memory prototype described in Chapter 3. The number of disk accesses (NDisk) is equal to the number of page faults (reported by the Linux kernel) minus the number of com¬ pressed region accesses.

5.3.2 Experimental Results

This section presents the performance of the SMV and NS2 applications that execute on a sys¬ tem with main memory compression, when the size of the compressed region is fixed. We configure the system such that the amount of memory provided is 95% of memory required to run the application without thrashing. For this setup, the selected SMV models execute two times slower than without thrashing. We set the size of the compressed region to be 5%, 10%, and 20% of memory available, and we summarize the measurements in Table 5.1. The results confirm our cost/benefit analysis: main memory compression improves an application's perfor¬ mance for sizes of the compressed region that fulfill Condition 5.10. The measurements show a small performance improvement for nodes333. Although the selected SMV models have the same memory footprint of 625 MB, nodes333 performs less computation than the other two models. Therefore, the lower amount of computation explains the performance improvement for this model when compression is turned on.

NS2 allocates a large amount of data but uses only a small subset of this data at any one time. The memory footprint of a NS2 simulation is determined by the number of nodes simulated, while the working set size is given by the number of traffic connections that are simulated. Although both NS2 siml and NS2 sim2 have the same memory footprint of 600 MB, they have different working set sizes. NS2 sim3 and NS2 sim4 differ in the same way: although they have the same memory footprint of 750 MB, they have different working set sizes.

We configure the system such that the amount of memory provided is 80% of memory required to run the application without thrashing. For this setup, the selected NS2 inputs execute 5.3. Validation 79

memory ComprReg Overhead Gain (>1) SMV model Cond. 5.10 speedup available size accesses accesses slowdown (< 1 ) 5% 11,720 99,003 Yes 1.21 nodes333 95% 10% 24,137 142,926 Yes 1.35 20% 153,835 158,265 Yes 1.18 5% 678,519 137,752 No 0.2 nodes3-4J2 95% 10% 1,622,204 198,867 No 0.4 20% 6,465,217 2,495,166 No 0.4 5% 1,276,966 609,537 No 0.2 nodes-433 95% 10% 3,168,318 879,965 No 0.2 20% 12,627,181 974,401 No 0.4

Table 5.1: Performance data of the selected SMV models on a compressed-memory system.

memory ComprReg Overhead Gain (> 1 ) NS2 sim Cond. 5.10 speedup available size accesses accesses slowdown (<1) 12.5% 5,290 14,511 Yes 1.1 NS2 siml 80% 25% 8,096 14,423 Yes 1.21 50% 17,831 14,352 No 0.96 12.5% 5,043 14,236 Yes 1.1 NS2 sim2 80% 20% 8,029 14,140 Yes 1.25 50% 17,477 14,092 No 0.93 12.5% 1,216 6,226 Yes 1.3 NS2 sim3 80% 25% 1,754 6,145 Yes 1.12 50% 1,403 6,076 Yes 1.25 12.5% 264 5,991 Yes 1.1 NS2 sim4 80% 25% 441 5,902 Yes 1.14 50% 353 5,846 Yes 1.29

Table 5.2: Performance data of the selected NS2 simulations.

two times slower than normal. The small slowdown (of NS2 compared to SMV) is explained by the fact that although memory available is smaller than NS2's memory footprint, memory available is larger than NS2's working set size. We set the size of the compressed region to be 12.5%, 25%, and 50% of memory available, and we summarize the measurements in Table 5.2. The results indicate that once the size of memory available is smaller than NS2's working set size, many accesses are to the compressed region and few disk accesses are avoided.

To sum up, in this section we asses the benefits of main memory compression by analyzing the performance of two applications that use large data sets. The measurements confirm that compression improves performance for these two applications when the size of the compressed region fulfills Condition 5.10. Although in these experiments the compressed region sizes are fixed, the results do not only verify that our cost-benefit calculation is reasonable, but also help reveal the relationship between the size of memory available and compressed region size. 80 Chapter 5. Adaptation

5.4 Our Approach to Addressing Adaptivity

Given the complex trade-offs that influence the size of the compressed region that can improve an application's performance, implementing a flexible compressed-memory system is difficult. As shown in Chapter 2, the existing techniques yield mixed results: they improve performance for some applications but slow down other applications considerably. The following sections describe our approach to finding the compressed region size that can improve an application's performance automatically, including the case when there is no need for a compressed region (hence minimizing the negative influence of compression).

We determine the size of the compressed region that can improve an application's perfor¬ mance based on the memory requirements and memory access pattern ofthat application. Some applications allocate large amounts of data but use only a small subset of their data at any time. In other words, their working set is a subset of their memory footprint. For other applications, the working set size is equal to the memory footprint during the entire execution. Furthermore, some applications access their data in such a way that most of their memory accesses go to disk and not to the compressed region. For these applications, compression decreases performance considerably.

The resizing scheme we propose adapts the compressed region size such that the uncom¬ pressed and compressed regions host most of an application's working set, and strives to detect the applications for which compression decreases performance. Once the negative influence of compression is detected, compression is turned off and the application continues its execution without compression.

We use the compressed-memory system presented in Chapter 3, which allows for adapting the allocation of real memory between uncompressed and compressed pages in a manner that keeps the resizing overhead to a minimum. As previously described, the key mechanism to allow adaptivity is to organize the memory space as self-contained zones [28]. On this system, resizing the compressed region is accomplished by reclaiming or adding zones. While Chapter 3 describes the static characteristics of our compressed-memory system, the remaining sections present its dynamic characteristics in detail.

5.4.1 Resizing Scheme

By default, on the proposed system compression is turned off and the size of memory available is checked periodically. If the amount of free memory falls below a certain threshold, compres¬ sion is turned on and a zone is added to the compressed region. From now on, the system checks the amount of compressed data periodically and checks whether it should resize the compressed region or not. We call the process that is repeated periodically the adapt phase. Fig. 5.3 illus¬ trates the way adaptation works, where Step 1 and Step 2 are the two steps of the adapt phase. In the first step (Step 1), the system checks whether compression is beneficial or not. If com¬ pression degrades an application's performance, compression is turned off and the application continues its execution without compression. If compression is beneficial, the system executes the second step of the adapt phase (Step 2), in which it checks whether the compresed region size is the optimal one. The two steps of the adapt phase are described in detail in the following paragraphs.

In the first part of the adapt phase (Step 1), the system determines whether compression .4. Our Approach to Addressing Adaptivity 81

Enable Compression

WaitT

Yes compression hurts ? Stepl :nö:

Free Compr > 4 Zones

No

Free Compr < 1 Zone

No

Yes continue execution ? Stepl

No

Remove All Zones

Disable Compression

Figure 5.3: Adapt phase. 82 Chapter 5. Adaptation

is beneficial or not. The key idea of our approach is to compare at runtime an application's performance on the compressed-memory system with an estimation of its performance without compression. More precisely, we use the approach and equations described in Section 5.2.1 to determine whether an application executes faster with compression than without compression. If compression is beneficial (the condition given by Eq. 5.11 is fulfilled) the adapt phase contin¬ ues with its second part. If compression degrades an application's performance (Condition 5.11 is not fulfilled), the data in the compressed region is decompressed and swapped to disk, the memory that hold the compressed data is returned to the uncompressed region, and compression is turned off.

In the second part of the adapt phase (Step 2), the system checks whether the compressed region size is the optimal one. The key idea is that when an application's working set fits in the uncompressed and compressed regions, most of the disk accesses are avoided and the ap¬ plication should run faster than without compression. Therefore, in this step, the system checks the amount of free memory in the compressed region, Size(Comprfrm). If Size(Comprfree) is bigger than the size of four zones, the compressed region is shrunk by deleting a zone. The decision to shrink the compressed region is based on the observation that when the com¬ pressed region is larger than what an application requires, some space within the compressed region is unused. If Size(Comprjree) is less than the size of four zones, the system checks if Size{Comprfree) is smaller than the size of a zone. If this is the case, the compressed region is grown by adding a zone. In this way, the decision to grow the compressed region increases the compressed region size until the compressed and uncompressed regions can store an appli¬ cation's working set. If Size(Comprfree) is bigger than the size of a zone and smaller than the size of four zones, the compressed region size remains the same (until the next memory usage check).

5.4.2 Implementation Details

Application Parameters

To determine whether compression is beneficial or not, the system evaluates Condition 5.11. To evaluate this condition, we must account for each access to the compressed region. In addition, for each access to the compressed region, we must determine whether the page accessed would be in memory or on disk if compression was turned off. As described in Chapter 3, all pages in compressed region are linked in a LRU list in the order of their insertion in compressed region. To avoid traversing the LRU list on each access to a compressed page, ordering information needs to be stored within the LRU list itself.

The system uses a counter index to keep track of how many pages have been inserted in the compressed region at any one time. When a page is inserted, the value of the index counter is copied into a field order of the new (compressed) page, the page is inserted in the LRU list, and index is incremented. In other words, the order field of a compressed page keeps track of how many pages have been inserted in the compressed region before that page. For example, in Fig. 5.4 the value of the index counter is 127. The ordering of pages in the LRU list is guaranteed by the insertion of compressed pages at the most recent end of the list. Page removal does not interfere with the ordering of the list.

For easy evaluation of Condition 5.11, a guard is used to mark the page in the LRU list that would be the last page in the compressed region if compression was turned off. The number of 5.4. Our Approach to Addressing Adaptivity 83

MRU 8uard LRU next prev ~6- -3- V w

126 125 103 100 99 10 8 1

index = 127

Figure 5.4: LRU list of all compressed pages.

(compressed) pages in the LRU list after the guard is equal to the size of the compressed region divided by the (uncompressed) page size. For example, for a system that has a compressed region of 40 KB, a page size of 4 KB, and whose LRU list is depicted in Fig. 5.4, between the MRU page and the guard there are 10 pages with orders between 126 and 100.

When a page is inserted in the compressed region, the system checks whether the number of pages in compressed region is smaller than the number of pages that would fit in the compressed region if compression was turned off. If this is the case, the guard remains the same (points to the LRU page) and the compressed page is inserted at the MRU end of the list. If the number of compressed pages is bigger than the compressed region size divided by the page size, the guard is updated to point to the next page in the LRU list and the page is inserted at the MRU end of the list (see Fig. 5.4).

When a page is deleted from the compressed region, its order is compared to the order of the page identified by the guard, and a counter NcomprReg is incremented. If the order of the page is smaller than that of the guard, the page would be on disk if compression was turned off. In this case, a counter NcomprOniy is incremented. This would be the case when deleting page 10 in Fig. 5.4. If the order of the page to be deleted is bigger than that of the guard, that page would be in memory even though compression was turned off. In this case, if the page is the page identified by the guard, the guard is updated to point to the next page in the LRU list. The last step is to delete the page from the LRU list. This would be the case when deleting page 125 in Fig. 5.4, which would not change the guard.

Every time Condition 5.11 is evaluated, the values of NComprReg and NDisk - NDiskConipTM are substituted with the values of NcomprReg and NComprOniy counters. As described above, these two counters capture an application's characteristics that dictate the application perfor¬ mance on a compressed-memory system.

System Parameters

Besides the counters that gather an application's memory access pattern, estimating an appli¬ cation performance also requires system parameters for characterizing the machine. Our per¬ formance estimates require several system parameters, namely an application's time to access main memory TDRAm, to access data on disk TDisk, and to access data in compressed region TcomprReg- To gather these numbers, we use the approach and equations in Section 5.2.1.

We use Eq. 5.5 to compute an application's access time to main memory (TDRam)- This equation takes into consideration both read and write access times, which are computed by 84 Chapter 5. Adaptation

Eq. 5.2 and Eq. 5.3. To evaluate the two equations, we need the time to read and write con¬ tinuous data, data within the same cache line, and non-continuous data. These parameters can be either specified by the system designer or measured by using micro-benchmarking [27]. To measure these system parameters, we use the memperfbenchmark (described in detail in Chap¬ ter 4) and we perform the measurements off-line. (Another possibility is to gather the system parameters during the initialization phase of the compressed-memory system.) We also use off¬ line measurements to gather an application's access pattern, namely how many of its accesses are continuous, within the same cache line, and non-continuous.

An application's disk access time (TDisk) is computed using Eq. 5.6, and is the weighted average of an application's sequential and non-sequential disk access times. The performance of sequential and non-sequential disk accesses is computed based on disk specifications. To gather an application's disk access pattern (the percent of its sequential and non-sequential accesses), we use off-line measurements.

The access time to the compressed region (TComprReg) depends on the system and compres¬ sion algorithm used. Off-line, we measure the performance of different compression algorithms using binary files generated by different applications. Moreover, while applications execute we gather the number of compression and decompression operations. TcomprReg is computed as the weighted average of the time spent compressing and decompressing pages.

Resizing Function

The implementation of the resizing process follows the decision steps in Fig. 5.3. Therefore, the process starts with Step 1 and verifies whether compression is beneficial or not. In our imple¬ mentation, the performance model that lies at the heart of the resizing scheme is implemented by comprlib.get-compr.efficiencyQ function, which uses the application and system param¬ eters described above. The comprlib-get-Compr-efficiency() function is called by using the ioctlQ system call on the compressed memory device file as described in Section 3.2.2. This function evaluates Condition 5.11 and returns true if compression is beneficial and false other¬ wise. If compression hurts performance, the compressed region is returned to the uncompressed memory and compression is turned off. If compression is beneficial, the resizing process ex¬ ecutes Step 2 in Fig. 5.3. In our implementation, the compressed region size is increased and shrunk by calling comprlib.add.zone() and comprlib_rem-ZoneQ functions.

5.4.3 Efficiency Considerations

This section describes a couple of optimizations that lower the adaptation overhead of our compressed-memory system.

There are two alternatives to obtain feedback about changes in a compressed-memory sys¬ tem: 1) the adaptation process can be either repeatedly invoked or 2) can be triggered by sig¬ nificant changes in an application's memory requirements. In the first case, also called the polling-based approach, the adaptation process has to decide if the last changes are significant enough to change the compressed region size. More precise, the adaptation process must decide based on the number of accesses to compressed data in the last period if the compressed region size must be changed or not. In the second approach, the adaptation process is asynchronously notified if application changes call for adaptation. More precise, every time a compressed page 5.5. Related Work 85

is accessed, the system verifies if the compressed region size should be changed and if this is the case, it calls for the adaptation process. For the sake of simplicity, we are pursuing a polling based approach. This way, the memory system usage is checked periodically, and not at ev¬ ery access to the compressed region. This approach is therefore suitable to memory intensive applications that have many data accesses. For these applications, verifying if compression is beneficial or not at every access to the compressed region would slow down the application significantly.

A further performance improvement is given by the fact that the compressed region is not resized every time the system usage is checked. As previously described, if the size of free memory in compressed region is larger than the size of a zone but smaller than the size of four zones, the compressed region size remains the same. This decision allows for variations in an application's memory requirements without the need for resizing the compressed region.

A further optimization comes from our approach to main memory compression. As de¬ scribed in Chapter 3, we organize the compressed region in zones of the same size, and we grown and shrink the compressed region by adding and removing zones. Because a zone is self-contained, at any moment in time, the amount of memory to store meta-data is the mini¬ mum amount required. Moreover, when the compressed region is shrunk, the system deletes the zone with the minimum number of blocks. This way, the amount of data to be relocated when the compressed region is shrunk is kept to a minimum. In addition, because a page is stored within a single zone, when a zone is deleted, the system must not deal with pages partially stored in other zones.

To sum up, our approach to addressing adaptivity keeps the adaptation overhead to a mini¬ mum. Therefore, the approach described here is suited to memory intensive applications with large working sets and many data accesses.

5.5 Related Work

Douglis' early paper [36] resizes the compressed region based on a global LRU scheme imple¬ mented in Sprite operating system. The experimental results show that compression can both improve and decrease an application's performance. The mixed results may be explained by the fact that Douglis' resizing scheme uses a single bias value for all applications, and the bias value actually dictates the amount of memory to be compressed. In other words, for all applications, the compressed memory size is the same. However, as shown by our study, for an increase in performance, different applications require different sizes of the compressed region.

Wilson and Kaplan [116, 63] resize the compressed region based on recent program be¬ havior and use simulations to validate their approach. The main drawback of their approach is that the adaptation is based on information that is difficult to gather on current systems: current systems don't maintain a list of all pages in (uncompressed) memory. Hence, on current sys¬ tems we cannot track accesses to a compressed region larger than the current one (but we can do that for smaller sizes of the compressed region), and we cannot determine whether a larger compressed region would perform better than the current one.

Castro et al. [32] resize the compressed region depending on whether the page would be uncompressed or on disk if compression was not used. The main drawback of their approach is that their resizing scheme has to take a decision at (almost) every access to the compressed region. Furthermore, the compressed region is resized by adding or deleting one, two or four 86 Chapters. Adaptation

pages. Therefore, although the approach may work well for small applications, it may not be feasible for large applications with frequent data accesses. The authors implemented their scheme in Linux operating system.

5.6 Summary

This chapter describes the key concepts that form the basis of our adaptive approach to main memory compression. Central to the model-based adaptation is a performance model for deter¬ mining if compression should be turned off. The main idea of our approach is that the appli¬ cations for which compression is not beneficial can be detected based on their memory access pattern. The system monitors an application's performance with compression and estimates the performance of the same application on a system without compression. If the estimated perfor¬ mance is better than the measured one, compression is turned off. The ability of the system to detect the applications for which compression is not beneficial makes the compressed-memory system practical since no user must be afraid that the compressed region will take away perfor¬ mance.

Moreover, based on an application's memory requirements, we develop a heuristic adapta¬ tion algorithm that is capable of producing resizing decisions. The algorithm varies the alloca¬ tion of real memory between uncompressed and compressed regions such that the two regions host most of an application's working set. For this set-up, most of an application's disk accesses are avoided and the application should execute faster with compression than without. 6

Evaluation

In this chapter we want to answer two questions: (1) "is main memory compression benefi- cial?"and (2) "does adaptation work?". Previous work had answered these questions partially: compression can be beneficial and adaptation may work. In addition to the previous studies, our work goes a step further and establishes a detailed understanding of the complexities of the adaptive system under consideration by means of a systematic evaluation methodology. Moreover, we show that the concept of adaptation is worthwhile. This chapter is organized as follows: Section 6.1 describes the experimental setup. Section 6.2 addresses question (1), that is, whether main memory compression is beneficial. Section 6.3 investigates how efficient the adaptation scheme is, and thereby answers question (2). Section 6.5 systematically identifies and evaluates the primary design factors and their effect on performance.

6.1 Experimental Setup

We run the experiments on a commodity PC (32-bit architecture) and an Apple G5 machine (64-bit architecture). The PC is a Pentium 4 at 2.6 GHz with a 8 KB LI data cache, 512 KB

L2 cache and 1 GB DRAM, which has its swap partition on a IC35L060AVV207-0 ATA disk. The Apple G5 has a dual 64-bit PowerPC 970 microprocessor at 1.8 GHz with a 32 KB LI data cache, 512 KB L2 cache (per processor) and 1 GB or 1.5 GB DRAM, and has its swap partition on a ST3160023AS ATA disk. The amount of physical memory is chosen such that the applications exceed the DRAM size and the machine swaps to disk.

The machines run a modified version of Yellow Dog Linux 3.0.1 (YDL) - 32-bit mode on the Pentium 4 PC and 64-bit mode on the Apple G5 machine. Given the memory usage limitations of the 32-bit architectures (see Section 3.2.1), we run the applications that require compressed regions larger than 100 MB only on the G5 machine. Each experiment was repeated five times, and the results shown are the average of the five measurements. For all experiments, we use the WKdm compression algorithm as it shows superior performance over other algorithms [116], Unless otherwise stated, the compressed-memory system under consideration has a zone size of 4 MB, a block size of 128 bytes, and a compression factor of 4. (A compression factor of 4 means that the system can store 4 times more pages within a zone than if no compression was used.)

87 88 Chapter 6. Evaluation

Pentium 4 G5

Memory W/o compr. W/(;ompr. W/o compr. W/<;ompr. available sec sec speedup sec sec speedup

100% 6 6 - 10 10 - 97% 10 16 0.62 32 34 0.94 92% 16 65 0.24 47 51 0.92 87% 403 391 1.03 290 330 0.87 85% 1,307 450 3.22 1,212 1,175 0.97 82% 2,431 791 3.07 2,165 1,382 1.56 80% 3,601 1,175 3.06 3,290 1,508 2.18 78% 4,645 1,433 3.24 4,900 1,670 2.93 75% 5,649 1,609 3.51 5,650 1,931 2.92 73% 8,789 2,177 4.03 6,780 2,233 3.03

Table 6.1: Execution time of nodesJ2A3 model on a Pentium 4 PC at 2.6 GHz and on an Apple G5 at 1.8 GHz.

6.2 Is main memory compression beneficial?

The answer to whether memory compression can improve system performance depends on two issues. First, it depends on the application access pattern, or on how many disk accesses are avoided and how many accesses are hits in compressed region. Second, the answer depends on how efficient the adaptation scheme is.

This section analyzes the first issue and investigates whether compression can improve per¬ formance for real applications. The second issue is investigated in Section 6.3. We select a set of applications that have different memory requirements and access patterns. The selected applications are simulators that can have many inputs. For a simulator, although for different inputs the simulations have different memory requirements, the memory access pattern does not vary much from input to input.

6.2.1 Symbolic Model Verifier (SMV)

SMV is a method based on Binary Decision Diagrams (BDDs) used for formal verification of finite state systems. We use Yang's SMV implementation since it demonstrated superior performance over other implementations [117]. We choose different SMV inputs that model the FireWire protocol [96]. SMV's working set is equal to its memory footprint (i.e. SMV uses all the memory it allocates during its execution rather than a small subset) and has a compression ratio of 52% (or 2:1 ) on average.

First, to investigate the limitations of main memory compression we select a SMV model, called nodes-2.43, that has a small memory footprint of 164 MB. An application with such a small footprint is unlikely to require compression but allows us to perform many experiments. We conduct the first set of experiments on the Pentium 4 PC at 2.6 GHz. We configure the system such that the amount of memory available is 100% to 73% of memory allocated. (The amount of memory allocated by an application is the memory footprint ofthat application.) The measurements are summarized in Table 6.1, column "Pentium 4 - W/o compr." and show that 6.2. IS MAIN MEMORY COMPRESSION BENEFICIAL? 89

when physical memory is smaller than SMV's working set, SMV's performance is degraded substantially. In the next set of experiments, SMV executes on the adaptive compressed-

memory system. The measurements are summarized in Table 6.1, column "Pentium 4 - W/ compr.", and indicate that when the amount of memory available is 87% to 73% of memory allocated by the application, our adaptive compression technique increases performance by a factor of up to 4. The measurements also show that for this small application when the memory shortage is not big enough (memory available is 97% to 92% of memory allocated), taking away space from the SMV model for the compressed region slows down the application slightly. The case when compression degrades SMV's performance is analyzed in detail in Section 6.3.2.

We repeat the experiments on the Apple G5 machine that has a different architecture and a 1.8 GHz processor, and summarize the results in Table 6.1, column "G5". (Different DRAM chips we use have a negligible influence of 0.02% on an application's performance.) The re¬

sults for the adaptive set-up, summarized in column "G5 - W/ compr.", indicate that when SMV executes on the G5 machine with the compressed-memory system described here, SMV's per¬ formance improves by a factor of up to 3. Overall, the results indicate that on a slow machine (Apple G5), compression improves performance for a smaller range of configurations than on a fast machine (Pentium 4 at 2.6 GHz). The measurements confirm other researchers' results, which show that on older machines memory compression can increase system performance by a factor of up to 2 relative to an uncompressed swap system [25, 69, 32]. Moreover, our measurements show that memory compression becomes more attractive as the processor speed increases.

6.2.2 NS2 Network Simulator

NS2 is a network simulator used to simulate different protocols over wired and wireless net¬ works. We choose different inputs that simulate the AODV protocol over a wireless network. NS2's working set is smaller than its memory footprint (NS2 uses only a small subset of its data at any one time) and has a compression ratio of 20% (or 5:1) on average. The amount of memory allocated by a NS2 simulation (or the memory footprint) is determined by the number of nodes simulated, and the size of memory used is given by the number of traffic connections

that are simulated.

We consider two simulations that allocate 880 MB and 1.5 GB. We configure the system such that the amount of memory provided is less than memory allocated by the application. In other words, the memory available is smaller than memory needed by the application to exe¬ cute without disk accesses. We measure the simulations' execution time without compression and summarize the results in Table 6.2, column "W/o compr.". The measurements show that when memory available is 68% to 43% of memory allocated, NS2 executes slightly slower than normal. When we apply our compression technique to NS2 executing with the same reduced memory allocation, its performance improves by a factor of up to 1.4. The measurements for the adaptive set-up are summarized in Table 6.2, column "W/ compr.". The results show that because NS2 allocates a large amount of data but uses only a small subset of its data at any one time, compression does not improve performance much, but fortunately, compression does not hurt either.

The second set of experiments uses inputs that allocate 730 MB, 880 MB, and 990 MB. We execute the selected simulations on a system without and with compression when memory available is 70%, 68%, and 51% of memory allocated by the application, and summarize the 90 Chapter 6. Evaluation

Memory Memory W/o compr. W/ compr. footprint available sec sec speedup 880 MB 100% 253 253 58% 345 252 1.36 50% 426 313 1.36 43% 586 425 1.37 1.5 GB 100% 1,202 1,202 68% 1,335 1,275 1.04 62% 1,351 1,215 1.11

Table 6.2: NS2 execution time on an Apple G5 at 1.8 GHz.

Memory Memory W/o compr. W/ compr. footprint available sec sec speedup Pentium 4 at 2.6 GHz

730 MB 70% 145 128 1.13 880 MB 58% 205 168 1.22 990 MB 51% 275 197 1.39

G5 at 1.8 GHz

730 MB 70% 243 226 1.07 880 MB 58% 345 252 1.36 990 MB 51% 398 319 1.23

Table 6.3: NS2 execution time without and with main memory compression.

results in Table 6.3. The data in column "W/ compr." show that because NS2's working set is small (smaller than memory allocated or memory footprint) and fits into small memories, compression does not improve NS2's performance much. Overall, the measurements show that on the faster Pentium 4 PC compression improvements are slightly bigger than on the slower G5 machine.

6.2.3 qsim Traffic Simulator qsim [26] is a car traffic simulator that employs a queue to model the behavior of varying traffic conditions. Although a simulation can be distributed on many computers (e.g., a cluster), the simulation requires hosts with memory sizes bigger than 1 GB. For a geographic region, the number of travelers (or agents) simulated determine the amount of memory allocated to the simulation and the number of (real) traffic hours being simulated gives the execution time of the simulation.

We consider simulations that allocate 1.3 GB, 1.7 GB, 1.9 GB, and 2,6 GB and simulate the traffic on the road network of Switzerland. We measure the execution time of these simulations on the G5 machine without compression and with adaptive compression, and summarize the results in Table 6.4. The system has a block size of 128 bytes, a zone size of 4 MB, and a compression factor of 9. The results in column "W/ compr." show that when qsim executes on our compressed-memory system, its performance improves by a factor of 20 to 55. qsim's 6.2. IS MAIN MEMORY COMPRESSION BENEFICIAL? 91

Memory Physical W/o compr. W/ compr. footprint memory sec sec speedup 1.3 GB 77% 3,993 135.45 29.47 1.7 GB 88% 2,900 141.53 20.49 59% 24,580 513.66 47.85 1.9 GB 79% 11,456 277.91 41.22 52% 46,049 825.72 55.76 2.6 GB 57% 13,319 332.50 40.05 38% 51,569 988.01 52.19

Table 6.4: qsim execution time on an Apple G5 machine with 1 GB and 1.5 GB DRAM.

working set is equal to its memory footprint (during its execution, qsim uses all the memory it allocates), and has a compression ratio of 10% (or 10:1) on average. Because qsim compresses so well, even when the amount of memory provided is much smaller than memory allocated, the simulation fits into the uncompressed and compressed memory and finishes its execution in a reasonable time. For instance, although the last simulation listed in Table 6.4 allocates 2.6 GB, it succeeds to finish its execution on a system with only 1 GB physical memory, and this would not be possible without compression.

6.2.4 Discussion

Our analysis examines the performance of three applications and shows that compression im¬ proves performance for all these applications, but varies according to the memory access be¬ havior and also according to the compression ratio employed.

SMV and qsim use their entire working set during their execution. When the amount of memory provided is less than memory allocated by 10% or more, SMV executes approximately 600 times slower than without swapping. The measurements show that when the amount of memory available is 15% smaller than SMV's working set, our compression technique increases performance by a factor of 3 to 4 depending on the processor used (a factor 3 for a G5 and 4 for a Pentium 4). When we apply our compression technique to qsim its execution is improved by a factor of 20 to 55.

The NS2 simulator allocates a large amount of data but uses only a small subset of its data at any one time, and thus provides an example that is much different from SMV. For many setups, although NS2 memory footprint is larger than memory available, NS2 working set fits into the uncompressed region. Because NS2 working set changes periodically, the benefit of compression is seen only during the time interval in which the working set changes, which is also the time interval of memory starvation. Under normal execution (without the aid of our compression techniques) when physical memory is 40% smaller than memory allocated (or memory footprint), NS2' execution is slowdown by a factor of up to 2. When we apply our compression techniques to NS2 executing with the same reduced memory allocation its performance improves by a factor of up to 1.4. 92 Chapter 6. Evaluation

Memory Memory W/o compr W/ compr footprint available sec sec speedup 1.2 GB 83% 538.62 147.16 3.66 1.4 GB 71% 5,484.75 461.67 11.88 1.8 GB 55% 47,617.38 2,511.46 18.96

Table 6.5: rand execution time on an Apple G5 with 1 GB physical memory.

6.3 Does adaptation work?

Our approach to analyzing the system's ability to find the compressed region size that improves performance the most proceeds in two steps. First, we run applications for which compression can improve performance, and check whether the compressed region size found by the adaptive scheme is among those that improve performance the most. Second, we investigate whether the system is agile enough to detect the applications for which compression slows down perfor¬ mance. In other words, we investigate whether the proposed system can detect the cases when there is no need for a compressed region.

6.3.1 Compression Improves Performance

To assess the accuracy of our adaptation scheme, we examine the performance of a rand bench¬ mark and qsim simulator on a system with fixed sizes of the compressed region and on an adaptive compressed-memory system. As the experiments presented latter in this section will show, compression improves rand and qsim performance for all sizes of the compressed region we experiment with. We choose these two applications because they have different memory access behavior, different compression ratio, require large sizes of the compressed region, and finish execution in a reasonable time. We run the experiments on a G5 machine with 1 GB physical memory. The system has a block size of 128 bytes and a zone size of 4 MB; the value of the compression factor is 14 for the rand benchmark and 9 for the qsim simulations.

rand benchmark

Programs that use dynamic memory allocation access their data by reference, and hence have irregular access patterns. To investigate the performance of such an application (e.g., written in C++) we use the synthetic benchmark, called rand. The advantage of a benchmark over a real application is that its memory footprint and number of data accesses can be changed easily. The rand benchmark reads and writes its data set randomly and has a compression ratio of 50% on average (or 2:1). We consider three variants that allocate 1.2 GB, 1.4 GB, and 1.8 GB and access their data sets 200,000, 1,200,000, and 6,000,000 times.

We measure the execution time of the three variants when compression is turned off and on, and summarize the results in Table 6.5. Without compression, the three variants finish execution in 538.62 sec, 5,484.75 sec, and 47,617.38 sec respectively. When we apply our adaptive compression technique to these variants their performance improves by a factor of 3 to 18; the compressed region size found by the resizing scheme is 64 MB, 96 MB, and 140 MB respectively. 6.3. Does adaptation work? 93

200

190

1 1 o CD t/3 180

170

4—> Ö 160 o l—1 4—» 150

X 140

130

120 0 50 100 150 200 250 300

Compressed region size [MB]

Figure 6.1: Execution time for rand 1.2 GB on an Apple G5 with 1 GB DRAM.

We then run the benchmark on a system that has fixed sizes of the compressed region, and summarize the measurements in Figure 6.1- 6.3. The doted line indicates the compressed region size found by the adaptive scheme. The figures show that the size found by our resizing scheme is among those that improve performance the most.

qsim car traffic simulator

The second experiment investigates the performance of the qsim simulator on a system with fixed sizes of the compressed region and on an adaptive compressed-memory system. The measurements summarized in Figure 6.4- 6.7, show that also for the real application the size found by the resizing scheme is among those that improve performance the most.

6.3.2 Compression Degrades Performance

This section investigates whether the adaptation mechanism proposed in this dissertation is agile enough to detect the applications for which compression degrades performance. First, we subject the system to a synthetic benchmark, which models an application that does not benefit from memory compression. Second, tests with the SMV application enable us to assess the system's agility with respect to real applications.

thrasher benchmark

To investigate the agility of the adaptation mechanism, we experiment with a benchmark called thrasher, which pays the cost of compressing pages without gaining any benefit. The bench¬ mark cycles linearly through its working set reading and writing the whole data space. When thrasher's working set doesn't fit in memory, on systems that use the LRU algorithm for page 94 Chapter 6. Evaluation

3000

2500 o

Vi 2000

ö 1500 o

3 O 1000

X! w 500

0 50 100 150 200 250 300

Compressed region size [MB]

Figure 6.2: Execution time for rand 1.4 GB on an Apple G5 with 1 GB DRAM.

30000 1.8GB 6000000 rand writes

25000 t_)

Vi 20000

G 15000 o T-H 4-J adaptive P O 10000

W 5000

O 0-0 O <£—& 0-

0 50 100 150 200 250 300

Compressed region size [MB]

Figure 6.3: Execution time for rand 1.8 GB on an Apple G5 with 1 GB DRAM. 6.3. Does adaptation work? 95

1200

1000

l o n> VI 800

4—» P 400 u

0 0 100 150 200 250 300 350 400

Compressed region size [MB]

Figure 6.4: Execution time for qsim 1.33 GB on an Apple G5 with 1 GB DRAM.

4000 qsim 1.77 GB 3500

o 3000 CD VI 2500 CD

2000

.2 1500 p o 1000 X

500

0 0 50 100 150 200 250 300 350 400 Compressed region size [MB]

Figure 6.5: Execution time for qsim 1.77 GB on an Apple G5 with 1 GB DRAM. 96 Chapter 6. Evaluation

14000 1.99 GB - qsim 12000 ü1 - CD Vi 10000

- B 8000 adaptive

o 6000

p o (D 4000 X

2000

' ' i i i i 1 i Do 100 200 300 400 500 600 700

Compressed region size [MB]

Figure 6.6: Execution time for qsim 1.99 GB on an Apple G5 with 1 GB DRAM.

16000 qsim 2.66 GB 14000

12000

10000

8000 e .2 6000 p o CD 4000 w 2000

0 0 100 200 300 400 500 600 700 800

Compressed region size [MB]

Figure 6.7: Execution time for qsim 2.66 GB on an Apple G5 with 1 GB DRAM. 6.3. Does adaptation work? 97

W/ compr Memory Memory W/o compr w/o abort w/ abort footprint available sec sec speedup sec speedup 1.8 GB 55% 263.91 449.52 0.58 271.45 0.97 1.9 GB 53% 255.47 541.50 0.47 288.80 0.88 2 GB 50% 297.86 556.47 0.54 304.62 0.98

Table 6.6: thrasher execution time on an Apple G5.

replacement (e.g., Linux), it takes a page fault on each page each time the benchmark iterates through its working set. Each page fault requires a disk read as well as a page write to make room for the faulted page. In addition, we also have the overhead of compressing pages. Be¬ cause of its access pattern, thrasher will always require pages from disk and will never fault on compressed pages. In other words, for this benchmark, memory compression just adds the cost of compressing pages on their way to disk.

We consider three variants of thrasher that allocate 1.8 GB, 1.9 GB, and 2 GB. We run the variants on the G5 machine with 1 GB DRAM, without compression and summarize the mea¬ surements in Table 6.6, column "W/o compr". We turn on compression and disable the first part of the decision process that checks whether compression degrades an application's perfor¬ mance. For this setup, the measurements summarized in Table 6.6, column "W/ compr - w/o abort", show a decrease in application's performance (speedup < 1). Next, we enable the part of the decision process that turns off compression if an application's performance is degraded, and we summarize the results in Table 6.6, column "W/ compr - w/ abort". The results show that the resizing scheme works as intended and succeeds to detect when compression is not ben¬ eficial. The measurements show that turning on compression, detecting that compression hurts performance, and turning off compression degrades thrasher's performance by at most 10%.

SMV

Section 6.2.1 presents experiments with nodes.2A-3 SMV model. The measurements show that on the G5 machine, when memory shortage is small (97% to 85%), running the application on a compressed-memory system slows down the application slightly. A closer look at the com¬ pressed region size shows that while nodes-2-AS executes, the system turns off compression and the simulation continues its execution without compression.

We run again the SMV model on the G5 machine, we turn on compression and disable the first part of the decision process that checks whether compression degrades an application's performance. For this setup, the measurements summarized in Table 6.7, column "W/ compr - w/o abort", show a significant decrease in application's performance (speedup < 1). Next, we enable the part of the decision process that turns off compression if an application's performance is degraded, and we summarize the results in column "W/ compr - w/ abort". For this set-up, the SMV model executes slightly slower than without compression and faster than if the part of the resizing scheme that checks for compression benefits is turned off. Therefore, also for the SMV application, the adaptation mechanism succeeds to detect when compression is not beneficial. 98 Chapter 6. Evaluation

W/ compr Memory W/o compr w/o abort w/ abort available sec sec speedup sec speedup 100% 10 10 10 97% 32 37 0.86 34 0.94 92% 47 71 0.66 51 0.92 87% 290 1,045 0.27 330 0.87 85% 1,212 1,270 0.95 1,175 0.97 82% 2,165 1,382 1.56 1,382 1.56

Table 6.7: nodes.2.A.S execution time on an Apple G5.

6.4 When does adaptation fail?

We select the qsim simulation that allocates 1.9 GB and measure its performance on the Pen¬ tium 4 PC with 1 GB DRAM. (Section 6.2.3 presents experiments with the qsim car traffic simulator executing on a G5 machine with 1 GB physical memory.) On the Pentium 4 machine the simulation executes 8.5 times slower than on the Apple G5 machine, and this although the Pentium 4 processor is faster than the PowerPC processor. A closer look at the compressed region size shows that the compressed region size stays at 100 MB during most of the qsim execution time, and this although the resizing scheme tries to increase the compressed region size. The fact that the compressed region size stays at 100 MB is explained by the fact that on the Pentium 4 PC the maximum amount of memory that can be allocated in kernel space is 100 MB (see Section 3.2.1). On the other hand, when the simulation executes on the Apple G5 machine the compressed region size found by the resizing scheme is 180 MB. This experiment shows the importance of a flexible OS support: if the amount of memory that can be allocated in kernel mode was not limited, main memory compression would improve performance for this large application considerably.

6.5 Efficiency

The compressed-memory system we propose organizes the compressed region in self-contained zones, consisting of all necessary overhead data structures required to manage the compressed memory within a zone. If not done right, the management of the compressed region may have a negative impact on performance. In other words, choosing efficient data structures to manage the compressed region is crucial to increasing system performance.

A well-designed data structure allows a variety of critical operations (e.g., search, insert, delete) to be performed on using as little resources, both CPU time and memory space, as pos¬ sible. The choice of data structures is a primary design consideration, as experience in building large systems has shown that the difficulty of implementation and the quality and performance of the final result depends heavily on choosing the best data structures. After the data structures are chosen, the algorithms to be used often become relatively obvious. Sometimes things work in the opposite direction - data structures are chosen because certain key tasks have algorithms that work best with particular data structures. In either case, the choice of appropriate data 6.5. Efficiency 99

structures is crucial.

Efficiency is generally contained in two properties: speed, (the time it takes for an operation to complete), and space, (the memory or non-volatile storage used up by the construct). The speed of an algorithm is measured in various ways. The most common method uses time com¬ plexity to determine the Big-0 of an algorithm: often, it is possible to make an algorithm faster at the expense of space. The space of an algorithm has two parts. The first part is the space taken up by the compiled executable on disk. The other part of algorithm space measurement is the amount of temporary memory taken up during processing. In this section we investigate only the second part of the space efficiency.

6.5.1 Time Complexity

This section analyzes the time complexity of the current implementation of the compressed- memory system. We analyze the time spent in basic operations, like search, insert and delete pages from the compressed region. The time complexity of the basic operations is dictated by the data structures used to manage the compressed region. We first analyze the performance of various data structures, and then present the data structures we choose to manage the com¬ pressed region.

Arrays permit efficient (constant time, 0(1)) random access but not efficient insertion and deletion of elements (worst-case is 0(n), where n is the size of the array). Moreover, arrays are among the most compact data structures; storing 100 integers in an array takes only 100 times the space required to store an integer, plus perhaps a few bytes of overhead for the whole array. However, the main disadvantage of an array is that it has a fixed size, and although its size can be altered in many environments, this is an expensive operation. Hence, arrays are most appropriate for storing a fixed amount of data which will be accessed in an unpredictable fashion.

Linked lists are most appropriate for storing a variable amount of data. Because the worst- case lookup time of a linked list is O(n), in general, linked lists are unsuitable for applications where it's useful to look up an element by its index quickly. In other words, linked lists are best for a list of data which will be accessed sequentially and updated often with insertions or deletions.

Like arrays, hash tables can provide constant-time O(l) lookup on average, regardless of the number of items in the table. However, the rare worst-case lookup time can be as bad as 0(n). Compared to other data structures, hash tables are most useful when a large number of data records are to be stored.

The compressed-memory system we propose organizes the compressed region in fixed-size zones, each of the zones being organized in fixed-size blocks. To improve performance, a compressed-memory system must permit fast access to pages in compressed region. Because hash tables have very good lookup times on average and support a large number of records, we use the hash table described in Section 3.1 to handle all compressed pages. To efficiently handle collisions, we use chaining. Besides collision handling, the main advantage of chaining is that it does not require resizing the hash table. In addition to using the hash table, we use the array called comp page table for keeping track of all pages within a zone. Besides the location information, also the content of a compressed page (i.e. its corresponding blocks) must 100 Chapter 6. Evaluation

be available quickly. Because arrays support efficient (O(l)) random accesses, we use the array called block table for keeping track of all blocks within a zone. Since the number of blocks within a zone remains constant in time, an array is the appropriate data structure for storing

information about a zone's blocks.

Because inserting an element in an array is an expensive operation, to speed up insertion, we use the additional data structure called zone structure. All free entries in com,pr page table are linked in a list, and the beginning of the list is identified by the free entry field. Moreover, all free entries in block table are linked in a list whose beginning is identified by the free block field. Because pointer-based data structures have poor , traversing a linked list is an expensive operation. On the other hand, traversing an array sequentially is faster than traversing a linked list. Therefore, to speed up list traversais, the elements in block table are linked by value (i.e. by their index) and not by reference. The elements in compr page table are linked by reference because the elements in a collision list can be in multiple com,pr page tables.

When the compressed region becomes almost full, the system sends to disk those pages that have been in compressed region for the longest time. For efficient lookups, all pages in compressed region are linked in a LRU list. Using a linked list allows the system to efficiently (0(1 )) delete the LRU pages, which are the pages at the end of the LRU list.

6.5.2 Space Efficiency

This section analyzes the space resources required by our approach to main memory compres¬ sion. We analyze the amount of memory taken up during an application's execution, and focus on the amount of memory required by the data structures that manage the compressed region. The size of the compressed-memory metadata is influenced by several factors. We first identify primary factors, their interactions and their effects on the space requirements and performance of a compressed-memory system. After identifying the primary factors, we study performance as a function of a single (primary) factor.

Because the compressed-memory system we propose organizes the compressed region in fixed-size zones, we analyze the amount of memory required to manage a zone's memory. Given the space requirements of a zone's metadata, the total amount of metadata during an application's execution can be easily computed based on the number of zones used while the application executes. As described in Chapter 3, the data structures that manage a zone's mem¬ ory are the block table, comp page table, and zone structure. Hence, the space required for managing a zone's memory is the sum of the sizes of these three data structures:

metadata — sizeof (block table) + sizeof {comp page table) + sizeof (zone structure)

Because block table and comp page table are arrays, for each of them, the total size is given by the entry size multiplied with the number of entries. We use ZoneBlocks to denote the number of entries in the block table, and CompEntries to refer to the number of entries in the camp page table. Formally,

metadata —{sizeof {int) + sizeof {void*)) ZoneBlocks

+ sizeof {compjpage.entry) • Com,pEntries

+ {sizeof {int) + 2 • sizeof {void*)) 6.5. Efficiency 101

To simplify the formula, we do the following substitutions a — sizeof (int) +

• sizeof {void*), b — sizeof (comp-pagesntry), c — sizeof (int) + 2 sizeof {void*), and

we have

metadata = a ZoneBlocks + b • CompEntries + c (6.1)

The number of blocks in a zone, ZoneBlocks, is given by the zone size divided by the block

or — size, formally ZoneBlocks Buïdsïze • ^ne nuniber of uncompressed pages that can be stored in a zone is the zone size divided given by by the page size, or ff^fffff • When compres¬ sion is used, more compressed pages (then uncompressed pages) can be stored within a zone. We denote the expected compression ratio with ComprFactor and compute the maximum number of compressed pages in a zone as the number of uncompressed pages that can be stored in a zone multiplied with the expected compression ratio, or formally ComprFactor f^ffffff. Because each compressed page within a zone requires an entry in comp page table, the number of entries in comp page table (CompEntries) is equal with the maximum number of com¬ pressed pages in a zone. Having ZoneBlocks and CompEntries, Eq. 6.1 becomes

ZoneSize _ ZoneSize , ,_, metadata = a -——— h b - ComprFactor — — h c (6.2) Blockbize Pagebize

Eq. 6.2 shows that the factors that influence the size of a zone's metadata are the zone size (ZoneSize), block size (BlockSize), and (expected) compression factor (ComprFactor). Because the page size (PageSize) is controlled by the system architect, we do not investigate the effect of the page size on system performance.

Primary Factors

The goal of a proper experimental design is to obtain the maximum information with the min¬ imum number of experiments. The three most frequently used designs are simple designs, full factorial designs and fractional designs. At the beginning of a performance study, the number of factors and their levels is usually large. Therefore, a full factorial design with a large number of factors and levels may not be the best use of available effort. To simplify the search for key factors, we restrict the experiment to a 2k factorial design. A 2k experimental design is used to determine the effect of k factors, each of which have two alternatives or levels. Therefore, we use the 2k design because it is easy to analyze and helps sorting out factors in the order of their impact on system performance [58]. Table 6.8 summarizes the factors and factor levels used in the 2k experiment (k=3). The levels of the ZoneSize factor are 2 MB and 8 MB, and of the BlockSize factor are 64 bytes and 1024 bytes. The levels of ComprFactor are 4 and 14, which correspond to a good and very good compression ratio. (Most applications have a compression ratio of at least 2:1.) The following paragraphs explain how to read the results of the 2k experiment. The results are presented in tabular form. Table 6.9 lists the measured performance (in sec) of the qsim car traffic simulations on a compressed-memory system when the primary factors have the levels described in Table 6.8.

We use the sign table method to compute the percent of variation explained by the three factors and their interaction, and summarize the computations in Table 6.10. The results show 102 Chapter 6. Evaluation

Factor Level -1 Level 1

ZoneSize 2MB 8MB BlockSize 64 B 1024 B ComprFactor 4 14

Table 6.8: Factors and levels used.

2MB 8MB Test Compr factor 64 B 1024 B 64 B 1024 B 4 245.30 258.30 688.96 165.64 1.33 GB 14 147.38 253.49 144.51 152.63 4 2,229.04 1,796.98 2,803.01 552.47 1.77 GB 14 595.96 1,980.51 551.54 660.04 4 7,351.71 5,479.42 4,395.01 888.60 1.99 GB 14 954.99 6,354.63 872.34 973.74 4 7,721.11 6,340.01 3,688.07 1,055.76 2.66 GB 14 1,116.68 7,380.94 981.45 1,092.29

Table 6.9: Results of the 2k experiment. The performance [sec] of different qsim simulations measured on an Apple G5 with 1 GB physical memory.

that most of the qsim performance variation is explained by the compression factor (row "Com¬ prFactor") and the interaction between the block size and compression factor (row "Block- Size+ComprFactor"). Moreover, for the large simulations (i.e. 1.99 GB and 2.66 GB), the measurements indicate that a compressed-memory system that has a small zone size decreases performance considerably (the zone size explains more than 40% of the variation).

We use again the 2k experiment to determine the effect of the same primary factors, ZoneSize, BlockSize, and ComprFactor, on the performance of the rand benchmark. The levels of the primary factors are the same as those described in Table 6.8, except that the two levels of the compression factor are 4 and 20. Table 6.11 lists the measured performance (in sec) of the rand simulations on a compressed-memory system.

1.33 GB 1.77 GB 1.99 GB 2.66 GB ZoneSize 3.31% 8.96% 39.03% 47.99% BlockSize 8.51% 3.06% 0.00% 1.08% ComprFactor 23.61% 27.91% 18.50% 13.13% ZoneSize+BlockS ize 21.81% 20.70% 11.08% 10.62% ZoneSize+ComprFactor 11.2% 1.04% 1.00% 1.62% BlockSize+ComprFactor 21.14% 37.69% 27.29% 20.90% ZoneSize+Block+ComprFactor 10.42% 0.64% 3.10% 4.65%

Table 6.10: The portion of variation explained by the three factors and their interaction for qsim simulations. 6.5. Efficiency 103

Compr 2MB 8MB Test factor 64 B 1024 B 64 B 1024 B

4 235.88 211.91 209.72 223.59 1.2 GB 20 65.35 257.62 83.40 166.66 4 803.44 720.93 606.66 457.12 1.4 GB 20 314.08 791.38 379.24 516.40 4 15,861.56 14,350.65 2,929.55 2,410.06 1.8 GB 20 2,318.25 16,016.12 2,248.79 2,687.42

Table 6.11: Results of the 2k experiment. The performance [sec] of different rand simulations measured on an Apple G5 with 1 GB physical memory.

1.2 GB 1.4 GB 1.8 GB ZoneSize 2.68% 22.93% 58.38% BlockSize 24.75% 7.46% 5.84% ComprFactor 33.34% 17.58% 6.01 % ZoneSize+BlockSize 1.78% 8.46% 6.00% ZoneSize+ComprFactor 1.20% 3.21% 5.25% BlockSize+ComprFactor 28.66% 36.56% 10.42% ZoneSize+Block+ComprFactor 7.58% 3.81% 8.10%

Table 6.12: The portion of variation explained by the three factors and their interaction for rand benchmark.

We use the sign table method to compute the percent of variation explained by the three fac¬ tors and their interaction, and summarize the computations in Table 6.12. The results show that also for the rand benchmark, most of the performance variation is explained by the compression factor (row "ComprFactor") and the interaction between the block size and compression factor (row "BlockSize+ComprFactor"). Again, the results show that large applications (i.e. 1.8 GB) require large zone sizes.

Eq. 6.2 shows that the space required by metadata increases as the zone size and compres¬ sion factor increases, and as the block size decreases. However, the 2k experiment shows that compression improves performance the most for large values of the zone size and compression factor, and small block sizes (and this although the metadata space is large). The experiments also show that performance is quite sensitive to the expected value of the compression factor.

For a better understanding of metadata - performance tradeoffs, next sections investigate the effect of each of the primary factors on system performance.

Zone Size

To study the influence of the zone size on the performance of a compressed-memory system, we keep the block size and compression factor constant and measure qsim performance when the zone size is 2 MB, 4 MB, and 8 MB. The results summarized in Figure 6.8 show that a zone size of 4 MB and 8 MB improves performance for all simulations, while a zone size of 2 MB works fine only for the smallest simulation (1.33 GB). 104 Chapter 6. Evaluation

1400

[3.. 1.33GB

+ 1.77GB 1200 -B- X 1.99GB

u 2.66GB fl) 1000 r/l X k_j -B- ----1? CD 800

•H +-i Ö 600 o

kl 3 400

w or\n

0 2MB 4MB 8MB

Zone size TMB1

Figure 6.8: The zone size influence on qsim performance.

The data in Figure 6.9 indicate that for large zone sizes, the size of the compressed region that improves qsim performance grows slightly. However, although the compressed region size increases, for the given zone size, the compressed region has the smallest size that can hold an application's working set. In other words, deleting a zone would results in a compressed region size slightly smaller than the optimal one. Therefore, although the compressed region size increases as the zone size increases, this has a negligible impact on performance.

In general, large applications require large zone sizes. This is explained by the fact that large applications need large compressed regions, and allocating a large compressed region is faster when the system uses large zones (there are fewer zone add operations). The measurements also show that for medium-sized applications, large zone sizes do not heart performance.

Block Size

Based on Eq. 6.2, the additional memory required to manage a zone's memory can be computed as follows

ZoneSize _ _ ZoneSize , , metadata =a ---rz—r^r——h o Compr h actor BlockSize PageSize

metadata a c , ComprFactor + b- + „ „. ZoneSize BlockSize PageSize ZoneSize

metadata a + d ZoneSize BlockSize

= CtT^actor where d b- + , % . Fagebize ZoneSize

We use the formula above to compute the size of the metadata needed to manage a zone's memory. For different block sizes, the metadata size is computed as percent of a zone's size. 6.5. Efficiency 105

300 1.33GB —o— PQ + 1.77GB 250 - --Q- X 1.99GB 2.66GB N ei 200 D---- \ Ö X o , «T-H + W) 150 CD

CD vi 100

Vi <> CD

f 50- O U 0 i 2MB 4MB 8MB

Zone size TMB1

Figure 6.9: The zone size influence on the compressed region size.

The computations are summarized in Table 6.13, column "Metadata" and show that the meta¬ data size decreases (from 15% to 4%) as the block size increases (from 64 bytes to 512 bytes). However, the metadata size is not the only aspect influenced by the block size.

Although organizing the physical memory of a zone in blocks keeps fragmentation to a minimum, memory fragmentation cannot be eliminated completely. External fragmentation appears when a zone's memory is used almost entirely, but its free blocks are insufficient for storing a new compressed page. Internal fragmentation appears when the last block that stores a compressed page is used partially (this happens for almost all compression ratios and block sizes). The space wasted within a zone is the space wasted because of the internal and external fragmentation. For different block sizes, we compute the amount of memory wasted within a zone as percent of a zone's size. We do the computations for applications with different com¬ pression ratios and summarize the results in Table 6.13, column "Fragmentation". The numbers show that the fragmentation increases as the block size increases. This can be explained by the fact that although a compressed page can be stored in fewer large blocks than small blocks, the percentage of memory unused in the last block increases as the block size increases.

To validate the analytical results, we measure qsim performance on a compressed-memory system with block sizes between 64 and 1024 bytes. The measurements summarized in Fig¬ ure 6.10 show that compression improves performance the most for block sizes smaller than 512 bytes. The measurements also show that block sizes of 1024 bytes are way too large.

Furthermore, the data in Figure 6.11 indicate that the compressed region size increases as the block size increases. This can be explained by the fact that the internal fragmentation increases as the block size increases. Hence, for large block sizes, the amount of memory wasted increases, which results in larger sizes of the compressed region. 106 Chapter 6. Evaluation

Fragmentation

Block size Metadata % zone size

bytes % zone size Compression ratio 0.25 0.50 0.75 64 15 6.25 3.13 2.08 96 11 3.03 3.18 3.12 128 9 12.50 6.25 4.17 192 7 13.33 3.03 6.25 256 6 25.00 12.50 8.33 384 5 11.11 13.34 12.50 512 4 50.00 25.00 16.66

Table 6.13: The size of metadata and memory unused because of fragmentation for different block sizes and compression ratios.

1600 1.33GB y —0— 1400 _ + 1.77GB --B- X 1.99GB 17 1200 — 2.66GB CD X 1000-

- ^-- - --U- _ [3 ~~ X ^-- --«-----,. . 4—> 800 --s-'

.2 600 + + + CD 400

200 __& ___ <> ^——

0 1 1 1 64B 128B 256B 512B 1024B

Block size [Bytes]

Figure 6.10: The block size influence on qsim performance. 6.5. Efficiency 107

500

-^o_ 1.33GB 450 1.77GB ..+ .. 400 x 1.99GB /)'. CD 2.66GB N 350

G 300 O "5b 250 CD S-H T3 200 (D 150 h- CD 100 & O 50 U 0 64B 128B 256B 512B 1024B

Block size [Bytes]

Figure 6.11 : The block size influence on the compressed region size.

Compression Factor

We consider a system with a compression factor of 4. In other words, we expect to execute

applications that have a compression ratio of 4:1. ComprFactor ~ A means that with com¬ pression a zone can store four times more pages than without compression. In this case, the ZoneSize number of entries in is — A comp page table CompEntries ' First, we consider an PageSize application with a compression ratio better than 4, namely 6. For this application to use all physical memory, the system needs 6 • zP°^^ize PaSe handles. However, because a zone has onty 4 ' PaTeslTe PaSe handles, some memory (or blocks) remains unused. Second, we consider an application with a compression ratio worse than 4, namely 2. For this application to use all physical memory, the system needs 2 • p^sllt Pa§e handles. Therefore, half of the entries in comp page table are unused. Shortly, if an application's compression ratio is better than the expected ComprFactor, some blocks cannot be used, and their memory space is wasted. On the other hand, if an application's compression ratio is worse than ComprFactor, some entries in compr page table cannot be used. Next, we study if, in general, is better to have a high or a small value of the compression factor.

To study the influence of the compression factor on system performance, we keep the zone size and block size constant and measure qsim performance when the value of the compression factor is 4, 7, 9, and 14. (qsim compression ratio is 10:1.) The measurements summarized in Figure 6.12 show that a compressed-memory system improves an application's performance for values of the compression factor that are equal or larger than an application's compression ratio. Furthermore, the data in Figure 6.13 indicate that when the compression factor is smaller than an application's compression ratio, also the compressed region size is larger than that which would suffice if enough entries to address a zone's memory were available. In other words, the

space wasted when not enough page handles are available is more significant than the space wasted when the number of page handles is larger than the number required. 108 Chapter 6. Evaluation

3500 1.33GB

+ 1.77GB 3000>^ -B- X 1.99GB

CJ 2.66GB CD 2500

2000 M

g 1500t.

2 CD 1000 CD X w 500 + +

0 _±_ 7 9 14 Compression factor

Figure 6.12: The compression factor influence on qsim performance.

400

1 1 PQ 350 £

CD 300 N ..^ v> G 250 O

on CD 200 T-H T3 CD 150 c« Vi a 100 ex

o 50 U 0 7 9 Compression factor

Figure 6.13: The compression factor influence on the compressed region size. 6.6. Summary 109

Performance Factors Summary

To summarize, our analysis shows that a compressed-memory system that has a high value of the compression factor improves performance for a wide range of applications (with different compression ratio). Measurements indicate that block sizes smaller than 512 bytes work well for the selected applications. Furthermore, as the size of an application' working set increases, also the zone size should increase for compression to show maximum performance improvements.

6.6 Summary

This chapter takes a systematic approach to addressing the main concerns in the evaluation of compressed-memory systems and answers the following questions:

Is main memory beneficial? For real applications that need large memories to store their data sets, we find that our approach to main memory compression can improve performance by a factor of 1.3 to 55. The performance improvements are directly correlated to the memory access behavior of each application and also to the compression ratio employed. Does adaptation work? We find that the adaptation scheme proposed in this dissertation is able to fulfill its goals, i.e. is able to find at run-time a compressed region size that is among those that improve performance the most, and is also able to detect the cases when compression degrades an application's performance. When compression is not beneficial, the overhead of turning on compression, detecting that compression hurts performance, and turning off com¬ pression is fairly small (< 10%).

When does adaptation fail? The case when adaptation fails to provide the service expected is when the compressed region size exceeds the maximum amount of memory that can be allo¬ cated in kernel mode. We found that adaptation fails for applications that require compressed region sizes bigger than 100 MB when executing on IA32 architectures running Linux operating system.

Efficiency. The detailed evaluation of the key factors affecting performance reveals that for a compressed-memory system to work fine for various applications, the zone size should be 4 MB or larger, the block size should be 512 bytes or smaller, and the compression factor should be 10 or larger. Seite Leer / j Blank leaf Conclusions

Memory-bound applications perform poorly in paged virtual memory systems because demand paging involves slow disk I/O accesses. Unfortunately, given the technology trends in the last decades, a processor must wait for increasingly larger number of cycles for disk reads/writes to complete. Moreover, although the amount of main memory in a workstation has increased, application developers have even more aggressively increased their demands. Memory-bound applications involve data sets that are too large to fit in main memory. The large data structures are becoming more common as people attempt to solve large problems. Many databases, as¬ trophysics modeling, engineering problems, network and car traffic simulators are examples of such applications.

Much research has been done on reducing the I/O overhead in such applications by either reducing the number of I/Os or lowering the cost of each I/O operation. The first approach to reducing the number of I/O accesses is to prefetch pages even before they are requested. To speculatively prefetch pages, the OS relies on a history of page accesses. Nevertheless, because accurate prediction schemes need extensive histories, which is expensive to maintain for the real-life systems, commercial systems have rarely used sophisticated prediction schemes LI 12]. Another approach to reducing the number of I/O accesses is to have the application (not the OS) use explicit I/O calls because the application has better knowledge of its own data locality and reference pattern. However, this approach requires significant restructuring of the code, which can be a tremendous task. An approach to lowering the cost of each I/O operation is memory server system [56] that uses remote memory servers as distributed caches for disk backing stores. However, the main limitation of this approach is that for the applications with poor data locality, the paging overhead in the server memory system is still significantly high.

An attractive approach to reducing the number of I/O accesses is main memory compression. The main idea of memory compression is to set aside part of main memory to hold compressed data. Although the idea of software-based memory compression has been investigated in sev¬ eral projects, a number of challenges remained largely unaddressed by the previous studies. For instance, resizing the compressed region requires moving (copying) un/compressed pages, as well as resizing the metadata needed to manage the compressed region. Therefore, imple¬ menting an efficient management system that varies the allocation of real memory between the uncompressed and compressed region remains a challenge. Furthermore, the thorny issue is that sizing the compressed region is difficult and if not done right memory compression slows down the application. The size of the compressed region that can improve an application's performance is determined by the application's compression ratio and memory access pattern. Therefore, because memory-bound applications exhibit complex dynamic behavior, which is difficult to evaluate thoroughly, many current compressed-memory systems are built in a rather

111 112 Chapter 7. Conclusions

ad-hoc manner.

7.1 Summary and Contributions

This dissertation addresses the challenges mentioned and makes the following contributions:

Design. The dissertation presents a practical design for a compressed-memory system. We propose a method for organizing the compressed region in a way that keeps fragmentation to a minimum and allocates the right amount of memory for the metadata needed to manage the compressed region. The key idea of our design is to organize the compressed region in zones of the same size. A zone is self-contained in that it consists of memory to store compressed data and structures to manage the compressed data. By dividing the compressed region in zones of the same size and by further organizing a zone in fixed-size blocks, our design succeeds to keep fragmentation to a minimum. Moreover, the proposed design imposes some locality on the blocks of a compressed page by storing a compressed page within a single zone and not scattering it over multiple zones. This design decision eases the zone delete operation, as the system must not deal with pages that are partially stored in other zones. The proposed system grows and shrinks the compressed region by adding and removing zones. Because zone add and zone remove operations have low costs (due to separation of data and metadata), the design we propose allows for easy resizing of the compressed region.

Adaptation. Based on an application's memory requirements we developed a heuristic adaptation algorithm for resizing the compressed region. The algorithm varies the allocation of real memory between uncompressed and compressed regions such that the two regions host most of the application's working set. The main idea of the proposed algorithm is that when an application's working set fits into uncompressed and compressed regions, most of its disk accesses are avoided and the application should execute faster with compression than without. In addition, we present a technique for determining if compression hurts an application's performance. The key idea is to compare at run-time an application's performance on the compressed-memory system with an estimation of its performance without compression. If the measured performance is worse than the estimated one, compression is turned off and the application continues its execution without compression. The ability to detect the applications for which compression is not beneficial makes the compressed-memory system proposed in this thesis practical since no user must be afraid that the compressed region will take away performance.

Evaluation. The dissertation presents a systematic approach to evaluating the performance of memory-bound applications executing on an adaptive compressed-memory system. The evaluation establishes that the adaptation (resizing) scheme is robust with respect to many factors that influence adaptation decisions. Experimental results show that adaptation works as intended: (1) the compressed region size found by the adaptive scheme is among those that improve performance the most and (2) the system is agile enough to detect the applications for which compression hurts performance. Furthermore, the dissertation systematically identifies and evaluates the primary design factors and their impact on performance. 7.2. Future Work 113

Performance prediction. As part of the adaptation scheme, we propose an analytical model for predicting the performance of memory-bound applications. The performance prediction technique we propose can be used to predict an application's performance on common as well as compressed-memory systems. The novel aspect is that we focus on an application's interaction with the memory system and classify applications according to their memory access pattern. We show that characterizing a memory-bound application in terms of its continuous, strided and random memory accesses suffices. To predict an application's performance, the analytical model is combined with results from micro-benchmarking. Experimental results show that the technique we propose is accurate and simple enough to be used on systems that run complete, complex programs with large data sets.

7.2 Future Work

There are two directions for future research based on our work. First, the adaptation mechanism and performance prediction model can be extended to address a larger class of applications. Second, the optimal values of the compressed-memory system parameters can be determined before compression is turned on.

Adaptation. Other applications than those studied in our work may stress the memory system in a different way and therefore may require other performance prediction models than those we propose. Our resizing scheme can be easily extended to include other performance models as well. Nevertheless, the increasing complexity of current and future microprocessors and memory technologies makes performance prediction of real, large applications difficult.

Initialization. We have seen that there are a couple of performance factors (e.g., block size) that influence an application's performance executing on a compressed-memory system. Unfor¬ tunately, the optimal factor values that can improve performance are different from application to application. Future implementations of the compressed-memory system can collect informa¬ tion that is needed for determining optimal system parameters while applications execute and before compression is turned on. For instance, the OS can gather an application's compression ratio easily. If the system decides to turn on compression, it will compute the parameter values based on the data gathered and will initialize the compressed-memory system accordingly. e Leer/ Z^nk leaf Bibliography

[1] Dinero IV Trace-Driven Uniprocessor Cache Simulator. http://www.cs.wisc.edu/ ^markhill/DineroIV/.

[2] Dyninst - An Application Program Interface (API) for Runtime Code Generation. http://www.dyninst.org/.

[3] Intel VTune Performance Analyzer, http://developer.intel.com/software/products/vtune/.

[4] RAM Doubler 8. http://www.powerbookcentral.com/features/ramdoubler.shtml.

[5] Shade and Spixtools. http://www.sun.com/microelectronics/shade/.

[6] The Enterprise Linux Resource, http://www.linux.com/.

[7] Wisconsin Wind Tunnel II. http://www.cs.wisc.edu/~wwt/wwt2/.

[8] Yellow Dog Linux, http://www.yellowdoglinux.com/.

[9] B. Abali and H. Franke. Operating System Support for Fast Hardware Compression

of Main Memory Contents. In Proc. 27th ISCA - Memory Wall Workshop, Vancouver, Canada, June 2000. ACM/IEEE.

[10] B. Abali, H. Franke, D. Poff, R. Saccone, L. Herger, and T. Smith. Memory Expansion Technology (MXT): Software Support and Performance. IBM Journal of Research and Development, 45(2):287-301, March 2001.

[11] B. Abali, H. Franke, X. Shen, D. Poff, and T. Smith. Performance of Hardware Com¬

pressed Main Memory. In Proc. 7th HPCA, pages 73-81, Monterrey, Mexico, Jan. 2001. IEEE.

[12] D. Agarwal, W. Liu, and D. Yeung. Exploiting Application-Level Information to Re¬ duce Memory Bandwidth Consumption. In Proc. 4th Workshop on Complexity-Effective Design, San Diego, CA, June 2003. ACM /IEEE.

[13] A. Ailamaki, D. DeWitt, M. Hill, and D. Wood. DBMSs on a Modern Processor: Where

Does Time Go? In Proc. VLDB, pages 266-277, Edinburgh, UK, Sept. 1999.

[14] A. Alameldeen and D. Wood. Adaptive Cache Compression for High-Performance Pro¬ cessors. In Proc. 31st ISCA, pages 212-223, Munich, Germany, June 2004. ACM/IEEE.

115 116 BIBLIOGRAPHY

[15] A. Alameldeen and D. Wood. Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches. Technical Report CS-TR-2004-1500, Computer

Sciences Department University of Wisconsin - Madison, April 2004.

[16] G. Ammons, T. Ball, and J. Lams. Exploiting Hardware Performance Counters with Flow and Context Sensitive Profiling. In Proc. PLDI, pages 85-96, Las Vegas, NV, June 1997. ACM.

[17] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastmcture for Computer System Modeling. IEEE Computer, 35(2):59-67, Feb. 2002.

[18] M. Bach. The Design of the UNIX Operating System. Prentice Hall, Englewood Cliffs, NJ, 1990. ISBN 0-13-201799-7.

[19] A. Badawy, A. Aggarwal, D. Yeung, and C. Tseng. Evaluating the Impact of Mem¬ ory System Performance on Software Prefetching and Locality Optimizations. In Proc. Supercomputing, pages 486-500, Sorrento, Italy, June 2001. ACM.

[20] T. Ball and J. Lams. Efficient Path Profiling. In Proc. 29th MICRO, pages 46-57, Paris, France, Dec 1996. IEEE/ACM.

[21] C. Benveniste, P. Franaszek, and J. Robinson. Cache-Memory Interfaces in Compressed Memory Systems. IEEE Trans. Computers, 50(11):1106-1116, Nov. 2001.

[22] B. Brooks, R. Bruccoleri, B. Olafson, D. States, S. Swaminathan, and M. Karplus. CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Cal¬ culations. Comp. Chem., (4):87-217, 1983.

[23] R. Bryant and D. O'Hallaron. Computer Systems: A Programmer's Perspective. Prentice Hall, Upper Saddle River, NJ, 1st edition, August 2002. ISBN 0-13-034074-X.

[24] C. Cascaval, L. DeRose, and D. Padua, D. Reed. Compile-time Based Performance Prediction. In Proc. LCPC Workshop, pages 365-379, La Jolla/San Diego, CA, Aug. 1999.

[25] R. Cervera, T. Cortes, and Y. Becerra. Improving Application Performance through Swap Compression. In Proc. USENIX Tech. Conf.: FREENIX Track, pages 207-218, Monterey, CA, June 1999.

[26] N. Cetin, A. Burri, and K. Nagel. A Large-Scale Multi-Agent Traffic Microsimulation based on Queue Model. In STRC, Monte Verita, Switzerland, March 2003.

[27] I. Chihaia and T. Gross. Effectiveness of Simple Memory Models for Performance Pre¬ diction. In Proc. ISPASS, pages 98-105, Austin, TX, March 2004. IEEE.

[28] I. Chihaia Tuduce and T. Gross. Adaptive Main Memory Compression. In Proc. USENIX Tech. Conf., pages 237-250, Anaheim, CA, April 2005. USENIX.

[29] B. Cmelik and D. Keppel. Shade: A Fast Instmction-Set Simulator for Execution Profil¬ ing. In Proc. SIGMETRICS, pages 128-137, Nashville, TN, May 1994. ACM. BIBLIOGRAPHY 117

[30] E. Coffman and P. Denning. Operating Systems Theory. Automatic Computation. Pren¬ tice Hall, Englewood Cliffs, NY, Oct. 1973. ISBN 0136378684.

[31] T. Conte and C. Gimarc. Fast Simulation of Computer Architectures. Kluwer Academic Publishers, Boston, MA, 1st edition, Juni 1995. ISBN 0-7923-9593-X.

[32] R. de Castro, A do Lago, and D. Da Silva. Adaptive Compressed Caching: Design and Implementation. In Proc. SBAC-PAD, pages 10-18, Sao Paulo, Brazil, Nov. 2003. IEEE.

[33] S. Debray and W. Evans. Profile-Guided Code Compression. In Proc. PLDI, pages 95-105, Berlin, Germany, June 2002. ACM.

[34] C. Ding and K. Kennedy. Memory Bandwidth Bottleneck and its Amelioration by a Compiler. In Proc IPDPS, pages 181-190, Cancun, Mexico, May 2000. ACM/IEEE.

[35] J. Dongarra, K. London, S. Moore, P. Mucci, D. Terpstra, H. You, and M. Zhou. Experi¬

ences and Lessons Learned with a Portable Interface to Hardware Performance Counters. In Proc. PADTAD Workshop, IPDPS, Nice, France, April 2003.

[36] F. Douglis. The Compression Cache: Using On-line Compression to Extend Physical Memory. In Proc. Winter USENIX Conference, pages 519-529, San Diego, CA, Jan. 1993.

[37] S. Eggers, D. Keppel, E. Koldinger, and H. Levy. Techniques for Efficient Inline Tracing on a Shared-Memory Multiprocessor. In Proc. SIGMETR1CS, pages 31-A1, Boulder, CO, May 1990. ACM.

[38] J. Ernst, W Evans, C. Fraser, T. Proebsting, and S. Lucco. Code Compression. In Proc. PLDI, pages 358-365, Las Vegas, NV, June 1997. ACM.

[39] M. Flynn. Computer Architecture: Pipelined and Parallel Processor Design. Computer Science. Jones and Bartlett Publishers, USA, 1st edition, April 1995. ISBN 0-86-720204- 1.

[40] P. Franaszek, P. Heidelberger, D. Poff, and J. Robinson. Algorithms and Data Stmc¬ tures for Compressed-Memory Machines. IBM Journal of Research and Development, 45(2):245-258, March 2001.

[41] P. Franaszek, P. Heidelberger, and M. Wazlowski. On Management of Free Space in Compressed Memory Systems. In Proc. SIGMETRICS, pages 113-121, Atlanta, GA, May 1999. ACM.

[42] P. Franaszek and J. Robinson. Design and Analysis of Internal Organizations for Com¬ pressed Random Access Memories. Technical Report Report RC 2] 146, IBM Watson Research Center, Yorktown Heights, USA, 1998.

[43] P. Franaszek and J. Robinson. On Internal Organization in Compressed Random-Access Memories. IBM Journal of Research and Development, 45 (2): 25 9-270, March 2001.

[44] P. Franaszek, J. Robinson, and J. Thomas. Parallel Compression with Cooperative Dic¬ tionary Construction. In Proc. Data Compression Conference, pages 200-209, Snowbird, Utah, March 1996. IEEE. 118 BIBLIOGRAPHY

[45] K. Gallivan, D. Gannon, W. Jalby, A. Malony, and H. Wijshoff. Behavioral Characteriza¬ tion of Multiprocessor Memory Systems: A Case Study. In Proc. SIGMETRICS, pages 79-88, Berkeley, CA, May 1989. ACM.

[46] H. Garcia-Molina and L. Rogers. Performance Through Memory. In Proc. SIGMET¬ RICS, pages 122-131, Banff, Alberta, Canada, May 1987. ACM.

[47] S. Ghosh, M. Martonosi, and S. Malik. Precise Miss Analysis for Program Transfor¬ mations with Caches of Arbitrary Associativity. In Proc. ASPLOS, pages 228-239, San Jose, CA, Oct. 1998. ACM.

[48] B. Gill and D. Modha. SARC: Sequential Prefetching in Adaptive Replacement Cache. In Proc. USENIX Tech. Conf., pages 293-308, Anaheim, CA, April 2005. USENIX.

[49] M. Gooch, M. Kjelso, and S. Jones. A Role for Main Memory Compression. In Proc.

22nd Euromicro Conf. - Beyond 2000: Hardware/Software Design Strategies Short Con¬ tributions, pages 26-31, Sept. 1996.

[50] M. Gorman. Understanding The Linux Virtual Memory Manager. Master's thesis, Uni¬ versity of Limerick, July 2003. http://www.skynet.ie/~mel/projects/vm/.

[51] E. Hallnor and S. Reinhardt. A Fully Associative Software-managed Cache Design. In Proc. 27th ISCA, pages 107-116, Vancouver, BC, Canada, June 2000. ACM/IEEE.

[52] E. Hallnor and S. Reinhardt. A Unified Compressed Memory Hierarchy. In Proc. 11th HPCA, pages 201-212, San Francisco, CA, Feb. 2005. IEEE.

[53] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Com¬ puter Systems & Design. Morgan Kaufmann Publishers, San Francisco, CA, 3rd edition, May 2002. ISBN 1-55860-724-2.

[54] S. Herrod. Using Complete Machine Simulation to Understand Computer System Behav¬ ior. PhD thesis, Stanford University, Feb. 1998.

[55] C. Hristea, D. Lenoski, and J. Keen. Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks. In Proc. Supercomputing, pages 1-12, San Jose, CA, Nov 1997. ACM/IEEE.

[56] L. Iftode, K. Li, and K. Petersen. Memory Servers for Multicomputers. In Proc. 38th COMPCON, pages 534-547, Feb., San Francisco, CA 1993. IEEE.

[57] B. Iyer and D. Wilhite. Data Compression Support in Databases. In Proc. 20th VLDB, pages 695-704, Santiago de Chile, Chile, Sep. 1994.

[58] R.Jain. The Art of Computer Systems Performance Analysis: Techniques for Experi¬ mental Design, Measurement, Simulation, and Modeling. John Wiley & Sons, Inc., New York, NY, 1st edition, April 1991. ISBN 0-471-50336-3.

[59] R. Jin and G. Agrawal. Performance Prediction for Random Write Reductions: A Case Study in Modeling Shared Memory Programs. In Proc. SIGMETRICS, pages 117-128, Marina Del Rey, CA, June 2002. ACM. BIBLIOGRAPHY 119

[60] K. Kant. An Evaluation of Memory Compression Alternatives. In Proc. 6th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW), Anaheim, CA, Feb 2003 Feb 2003. IEEE.

[61] K. Kant and Iyer. R. Compressibility Characteristics of Address/Data transfers in Com¬ mercial Workloads. In Proc. 5th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW), pages 59-67, Cambridge, MA, Feb. 2002. IEEE.

[62] K. Kant and Iyer. R. Design and Performance of Compressed Interconnects for High Performance Servers. In Proc. 21st ICCD, pages 164-169, San Jose, CA, Oct. 2003. IEEE.

[63] S. Kaplan. Compressed Caching and Modern Virtual Memory Simulation. PhD thesis, The University of Texas at Austin, Dec. 1999.

[64] S. Kaplan, Y. Smaragdakis, and P. Wilson. Trace Reduction for Virtual Memory Simula¬ tions. In Proc. SIGMETRICS, pages 47-58, Atlanta, GA, May 1999. ACM.

[65] R. Katz, G. Gibson, and D. Patterson. Disk System Architectures for High Performance Computing. In Proc. of the IEEE, volume 77, pages 1842-1858. Dec. 1989.

[66] M. Kjelso, M. Gooch, and S. Jones. Design and Performance of a Main Memory Hard¬ ware Compressor. In Proc. 22nd Euromicro Conf, pages 423^-30. IEEE Computer Society Press, Sept. 1996.

[67] M. Kjelso, M. Gooch, and S. Jones. Modelling the Performance Impact of Main Memory Data Compression. In Proc. 12th UK Computer and Telecommunications Performance Engineering Workshop, pages 169-184, Sept. 1996.

[68] M. Kjelso, M. Gooch, and S. Jones. Empirical Study of Memory-Data: Characteristics

and Comprenssibility. IEEE Proc. - Comput. Digit. Tech., 145(1 ):63-67, Jan. 1998.

[69] M. Kjelso, M. Gooch, and S. Jones. Performance Evaluation of Computer Architectures with Main Memory Data Compression. Journal of Systems Architecture, 45:571-590, 1999.

[70] J. Lams. SPIM - A MIPS32 Simulator, http://www.cs.wisc.edu/~lams/spim.html.

[71] J. Lams. Efficient Program Tracing. IEEE Computer, 26(5):52-61, May 1993.

[72] J. Lams and E. Schnarr. EEL: Machine-Independent Executable Editing. In Proc. PLDI, pages 291-300, La Jolla, CA, June 1995. ACM.

[73] A. Lebeck and D. Wood. Cache Profiling and the SPEC Benchmarks: A Case Study. IEEE Computer, 27(10):15-26, Oct. 1994.

[74] A. Lebeck and D. Wood. Active Memory: A New Abstraction for Memory-System Simulation. In Proc. SIGMETRICS, pages 220-231, Ottawa, Canada, May 1995. ACM.

[75] J. Lee, W. Hong, and S. Kim. Design and Evaluation of a Selective Compressed Memory System. In Proc. ICCD, pages 184-191, Austin, TX, Oct. 1999. IEEE. 120 BIBLIOGRAPHY

[76] C. Lefurgy, A. Arbor, P. Bird, I. Chen, and T. Mudge. Improving Code Density Using Compression Techniques. In Proc. 30th MICRO, pages 194-203, Research Triangle Park, NC, Dec. 1997. ACM/IEEE.

[77] H. Lekatsas, J. Henkel, and W. Wolf. Code Compression for Low Power Embedded

System Design . In Proc. 37th Design Automation Conference (DAC), pages 294-299, Los Angeles, CA, June 2000. ACM/IEEE.

[78] K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, and T. Spencer. End-user Tools for Application Performance Analysis, Using Hardware Counters. In Proc. Intl. Conf. on Parallel and Distributed Computing Systems, Dallas, TX, Aug. 2001. ACM/IEEE.

[79] P. Magnusson and B. Werner. Efficient Memory Simulation in SimlCS. In Proc. 28th Annual Simulation Symposium, pages 62-73, Santa Barbara, CA, April 1995. IEEE.

[80] J. McCalpin. STREAM Sustainable Memory Bandwidth in High Perf. Computers, http ://www.c s .Virginia, edu/stream.

[81] L. McVoy and C. Staelin. LMbench - Tools for Performance Analysis. http://www.bitmover.com/lmbench/.

[82] S. Moore, P. Mucci, J. Dongarra, S. Shende, and A. Malony. Performance Instmmen¬ tation and Measurement for Terascale Systems. In Lecture Notes in Computer Science, volume 2723, pages 53-62. Springer-Verlag, Heidelberg, Jan. 2003.

[83] M. Nelson and J. Gailly. The Data Compression Book. M&T Books, New York, NY, 2nd edition, April 1995. ISBN 1-55851-434-1.

[84] AT. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The Augmint Multiprocessor Simulation Toolkit for Intel x86 Architectures. In Proc. ICCD, pages 486-492, Austin, TX, Oct. 1996. IEEE.

[85] NS2 Network Simulator, http://www.isi.edu/nsnam/ns/.

[ 86] D. Ofelt and J. Hennessy. Efficient Performance Prediction for Modern Microprocessors. In Proc. SIGMETRICS, pages 229-239, Santa Clara, CA, June 2000. ACM.

[87] D. Patterson. Latency Lags Bandwidth. In Communications of the ACM, volume 47, pages 71-75. Oct. 2004.

D. G. and R. Katz. A Case for Redundant of [88] Patterson, Gibson, , Arrays Inexpensive Disks (RAID). In Proc. SIGMOD, pages 109-116, Chicago, IL, June 1988. ACM.

[89] K. Ritchie, D. and. Thompson. The UNIX Time-Sharing System. The Bell System Technical Journal, 57(6): 1905-1929, July-Aug. 1978.

[90] S. Roy, R. Kumar, and M. Prvulovic. Improving System Performance with Compressed Memory. In Proc. 15th IPDPS, page 10066a, San Francisco, CA, April 2001. IEEE.

[91] A. Rubini and J. Corbet. Linux Device Drivers. O'Reilly, 2nd edition, June 2001. ISBN 0-59600-008-1. BIBLIOGRAPHY 121

[92] M. Russinovich and B. Cogswell. RAM Compression Analysis. http://ftp.uni- mannheim.de/info/OReilly/windows/win95.update/model.html, Feb. 1996.

[93] R. Saavedra and A. Smith. Measuring Cache and TLB Performance and Their Effect on Benchmark Run Times. IEEE Trans. Computers, 44(10): 1223-1235, Oct. 1995.

[94] R. Saavedra and A. Smith. Analysis and Benchmark Characteristics and Benchmark Performance Prediction. ACM Trans. Computer Systems, 14(4): 344-3 84, Nov. 1996.

[95] V. Schuppan and A. Bière. A Simple Verification of the Tree Identify Protocol with SMV. In Proc. IEEE 1394 (FireWire) Workshop, pages 31-34, Berlin, Germany, March 2001.

[96] V. Schuppan and A. Bière. A Simple Verification of the Tree Identify Protocol with SMV. In Proc. IEEE 1394 (FireWire) Workshop, pages 31-34, Berlin, Germany, March 2001.

[97] J. Seward and N. Nethercote. Using Valgrind to detect undefined value errors with bit-precision. In Proc. USENIX Tech. Conf., pages 17-30, Anaheim, CA, April 2005. USENIX.

[98] Y Smaragdakis, S. Kaplan, and P. Wilson. EELRU: Simple and Effective Adaptive Page Replacement. In Proc. SIGMETRICS, pages 122-133, Atlanta, GA, May 1999. ACM.

[99] T. Smith, B. Abali, D. Poff, and R. Tremaine. Memory Expansion Technology (MXT): Competitive Impact. IBM Journal ofResearch and Development, 45(2):303-308, March 2001.

[100] D. Sorin, V Pai, S. Adve, M. Vernon, and D. Wood. Analytic Evaluation of Shared- Memory Systems with ILP Processors. In Proc. 25th ISCA, pages 380-391, Barcelona, Spain, June 1998. ACM/IEEE.

[101] B. Spmnt. Brink and Abyss - Pentium 4 Performance Counter Tools For Linux. http://www.eg.bucknell.edu/~bspmnt/emon/brink_abyss/brink_abyss.shtm.

[102] A. Srivastava and A. Eustace. ATOM - A System for Building Customized Program Analysis Tools. In Proc. PLDI, pages 196-205, Orlando, FL, June 1994. ACM.

[103] T. Strieker and T. Gross. Optimizing Memory System Performance for Communication in Parallel Computers. In Proc. 22nd. ISCA, pages 308-319, S. Margherita Ligure, Italy, June 1995. ACM/IEEE.

[104] T. Strieker and T. Gross. Global Address Space, Non-Uniform Bandwidth: A Memory System Performance Characterization of Parallel Systems. In Proc. 3rd HPCA, pages 168-180, San Antonio, TX, Feb. 1997. IEEE.

[105] T. Strieker and C. Kurmann. ECT memperf - Extended Copy Transfer Characterization. http://www.cs.inf.ethz.ch/CoPs/ECT/.

[106] A. Tanenbaum. Modern Operating Systems. Prentice Hall, Upper Saddle River, NJ, 1992. ISBN 0-13-595752-4. 22 BIBLIOGRAPHY

107] A. Tanenbaum and A. Woodhull. Operating Systems: Design and Implementation. Pren¬ tice Hall, Upper Saddle River, NJ, 2nd edition, 1997. ISBN 0-13-630195-9.

108] M. Tikir and J. Hollingsworth. Using Hardware Counters to Automatically Improve Memory Performance. In Proc. Supercomputing, pages 106-118, Pittsburgh, PA, Nov. 2004. IEEE/ACM.

109] R. Tremaine, P. Franaszek, J. Robinson, C. Schulz, T. Smith, W Wazlowski, and P. Bland. IBM Memory Expansion Technology (MXT). IBM Journal of Research and Develop¬ ment, 45(2):271-285, March 2001.

110] J. Turley. Code Compression under the Microscope. Embedded Systems Programming, Feb. 2004.

Ill] R. Uhlig and T. Mudge. Trace-Driven Memory Simulation: A Survey. ACM Computing Surveys, 29(2): 128-170, June 1997.

112] S. VanderWiel and D. Lilja. Data Prefetch Mechanisms . ACM Computing Surveys, 32(2): 174-199, June 2000.

113] Symbolic Model Verifier, http://www-2.cs.cmu.edu/~bwolen/software/.

114] R Williams. An Extremely Fast Ziv-Lempel Data Compression Algorithm. In Proc. Data Compression Conference (DCC), pages 362-371, Snowbird, UT, April 1991. IEEE.

115] P. Wilson. Operating System Support for Small Objects. In Workshop on Object Orien¬ tation in Operating Systems, pages 80-86, Palo Alto, CA, Oct. 1991. IEEE.

116] P. Wilson, S. Kaplan, and Y. Smaragdakis. The Case for Compressed Caching in Virtual Memory Systems. In Proc. USENIX Tech. Conf., pages 101-116, Monterey, CA, June 1999.

117] B. Yang, R. Bryant, D. O'Hallaron, A. Biere, O. Coudert, G. Janssen, R. Ranjan, and F. Somenzi. A Performance Study of BDD-Based Model Checking. In FMCAD'98, pages 255-289, Palo Alto, CA, Nov. 1998.

118] J. Yang, Y Zhang, and R. Gupta. Frequent Value Compression in Data Caches. In Proc. 33rd MICRO, pages 258-265, Monterey, CA, Dec. 2000. ACM/IEEE.

119] Y. Zhang, J. Yang, and R. Gupta. Frequent Value Locality and Value-Centric Data Cache Design. In Proc. 9thASPLOS, pages 150-159, Cambridge, MA, Nov. 2000. ACM.

120] J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory, 23:337-342, May 1977. List of Figures

2.1 MXT memory hierarchy 10

2.2 Physical memory occupancy 11 2.3 Memory Hierarchies 13

3.1 Compressed Memory Hierarchy 20

3.2 Birdseye view of the compressed-memory system design 21

3.3 Detailed view of the compressed-memory system design 22 3.4 Flow Diagram: Insert Page 24 3.5 Flow Diagram: Add Zone 27 3.6 Flow Diagram: Remove Zone 28 3.7 Call Graph: kswapdQ 30 3.8 Page Cache LRU List 31

3.9 Reclaiming pages from the page cache 32 3.10 Call Graph: shrink-cache{) 33 3.11 Call Graph: swapjwrilepagei) 34 3.12 Call Graph: dojpage-fault() 35 3.13 Call Graph: doswapjpagei) 36

3.14 Linux kernel space 37 3.15 Call Graph: zone-add() 38 3.16 Call Graph: zone.remQ 39 3.17 Call Graph: cpbufjput() 40 3.18 Call Graph: cpbuf.get() 40 3.19 Call Graph: kcmswapdÇ) 41

4.1 MSP-RA prediction error. 64

4.2 AAT and MSP-RA estimates 64

4.3 MSP-IA prediction error. 65

4.4 MSP-IA prediction error. 66

4.5 AAT and MSP-IA estimates 66

123 124 LIST OF FIGURES

4.6 AAT and MSP-IA estimates 67

4.7 MSP-RA prediction error. 68

4.8 MSP-IA prediction error. 68

4.9 Access type distribution for DRAM reads 69

4.10 Access type distribution for DRAM writes 69

5.1 A 31evel-memory hierarchy and a compressed-memory hierarchy 75 5.2 "Overhead" and "Gain" values for different compression speeds 77 5.3 Adapt phase 81

5.4 LRU list of all compressed pages 83

6.1 Execution time for rand 1.2 GB on an Apple G5 with 1 GB DRAM 93

6.2 Execution time for rand 1.4 GB on an Apple G5 with 1 GB DRAM 94

6.3 Execution time for rand 1.8 GB on an Apple G5 with 1 GB DRAM 94

6.4 Execution time for qsim 1.33 GB on an Apple G5 with 1 GB DRAM 95

6.5 Execution time for qsim 1.77 GB on an Apple G5 with 1 GB DRAM 95

6.6 Execution time for qsim 1.99 GB on an Apple G5 with 1 GB DRAM 96

6.7 Execution time for qsim 2.66 GB on an Apple G5 with 1 GB DRAM 96

6.8 The zone size influence on qsim performance 104

6.9 The zone size influence on the compressed region size 105

6.10 The block size influence on qsim performance 106

6.11 The block size influence on the compressed region size 107

6.12 The compression factor influence on qsim performance 108

6.13 The compression factor influence on the compressed region size 108 Curriculum Vitae

Irina Tuduce, bom Chihaia

April 9, 1976 Born in Piatra-Neamt, Romania

1982 - 1990 Primary and secondary school, Baia-Mare, Romania

1990 - 1994 Gheorghe Sincai High School, Baia-Mare, Romania

1994 Diploma de Bacalaureat (Matura), Mathematics and Physics profile

1994 - 1999 Studies in Computer Science, Technical University of Cluj-Napoca, Romania

1999 Diploma in Computer Science, Technical University of Cluj-Napoca, Romania

since 1999 Research and Teaching Assistant Laboratory for Software Technology, ETH Zurich

125