The Mondrian Data Engine

Total Page:16

File Type:pdf, Size:1020Kb

The Mondrian Data Engine The Mondrian Data Engine Mario Drumond Alexandros Daglis Nooshin Mirzadeh Dmitrii Ustiugov Javier Picorel Babak Falsafi Boris Grot1 Dionisios Pnevmatikatos2 EcoCloud, EPFL 1University of Edinburgh 2FORTH-ICS & ECE-TUC [email protected],[email protected],[email protected] ABSTRACT massive amounts of data while maintaining a low request tail latency. The increasing demand for extracting value out of ever-growing Similarly, analytic engines for business intelligence are increasingly data poses an ongoing challenge to system designers, a task only memory resident to minimize query response time. The net result is made trickier by the end of Dennard scaling. As the performance that memory is taking center stage in server design as datacenter op- density of traditional CPU-centric architectures stagnates, advancing erators maximize memory capacity integrated per server to increase compute capabilities necessitates novel architectural approaches. throughput per total cost of ownership [42, 53]. Near-memory processing (NMP) architectures are reemerging as The slowdown in Dennard scaling has further pushed server promising candidates to improve computing efficiency through tight designers toward efficient memory access [18, 30, 49]. A large coupling of logic and memory. NMP architectures are especially spectrum of data services has moderate computational require- fitting for data analytics, as they provide immense bandwidth to ments [20, 36, 43] but large energy footprints primarily due to data memory-resident data and dramatically reduce data movement, the movement [16]: the energy cost of fetching a word of data from main source of energy consumption. off-chip DRAM is up to 6400× higher than operating on it [29]. To Modern data analytics operators are optimized for CPU execu- alleviate the cost of memory access, several memory vendors are tion and hence rely on large caches and employ random memory now stacking several layers of DRAM on top of a thin layer of logic, accesses. In the context of NMP, such random accesses result in enabling near-memory processing (NMP) [35, 46, 63]. The com- wasteful DRAM row buffer activations that account for a significant bination of ample internal bandwidth and a dramatic reduction in fraction of the total memory access energy. In addition, utilizing data movement makes NMP architectures an inherently better fit for NMP’s ample bandwidth with fine-grained random accesses requires in-memory data analytics than traditional CPU-centric architectures, complex hardware that cannot be accommodated under NMP’s tight triggering a recent NMP research wave [2, 3, 22, 23, 56, 57]. area and power constraints. Our thesis is that efficient NMP calls for While improving over CPU-centric systems with NMP is trivial, an algorithm-hardware co-design that favors algorithms with sequen- designing an efficient NMP system for data analytics in terms of tial accesses to enable simple hardware that accesses memory in performance per watt is not as straightforward [19]. Efficiency impli- streams. We introduce an instance of such a co-designed NMP archi- cations stem from the optimization of data analytics algorithms for tecture for data analytics, the Mondrian Data Engine. Compared to a CPU platforms, making heavy use of random memory accesses and CPU-centric and a baseline NMP system, the Mondrian Data Engine relying on the cache hierarchy for high performance. In contrast, the improves the performance of basic data analytics operators by up to strength of NMP architectures is the immense memory bandwidth 49× and 5×, and efficiency by up to 28× and 5×, respectively. they offer through tight integration of logic and memory. While DRAM is designed for random memory accesses, such accesses CCS CONCEPTS are significantly costlier than sequential ones in terms of energy, stemming from DRAM row activations [65]. • Computer systems organization ! Processors and memory In addition to their energy premium, random memory accesses architectures; Single instruction, multiple data; • Information sys- can also become a major performance obstacle. The logic layer of tems ! Main memory engines; • Hardware ! Die and wafer stack- NMP devices can only host simple hardware, being severely area- ing; and power-limited. Unfortunately, generating enough memory-level KEYWORDS parallelism to saturate NMP’s ample bandwidth using fine-grained accesses requires non-trivial hardware resources. Random memory Near-memory processing, sequential memory access, algorithm- accesses result in underutilized bandwidth, leaving a significant hardware co-design performance headroom. In this work, we show that achieving NMP efficiency requires an 1 INTRODUCTION algorithm-hardware co-design to maximize the amount of sequential memory accesses and provide hardware capable of transforming Large-scale IT services are increasingly migrating from storage to these accesses into bandwidth utilization under NMP’s tight area memory because of intense data access demands [7, 17, 51]. Online and power constraints. We take three steps in this direction: First, services such as search and social connectivity require managing we identify that the algorithms for common data analytics operators ISCA ’17, June 24-28, 2017, Toronto, ON, Canada that are preferred for CPU execution are not equally fitting for NMP. © 2017 Association for Computing Machinery. NMP favors algorithms that maximize sequential accesses, trading This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Proceedings of ISCA ’17, June 24-28, 2017, https://doi.org/10.1145/3079856.3080233. 2 This work was done while the author was at EPFL. off algorithmic complexity for DRAM-friendly and predictable ac- Basic operator Spark operator cess patterns. Second, we observe that most data analytics operators Filter, Union, LookupKey, Scan feature a major data partitioning phase, where data is shuffled across Map, FlatMap, MapValues GroupByKey, Cogroup, ReduceByKey, memory partitions to improve locality in the following probe phase. Group by Partitioning unavoidably results in random memory accesses, regard- Reduce, CountByKey, AggregateByKey less of the choice of algorithm. However, we show that minimal Join Join Sort SortByKey hardware support that exploits a common characteristic in the data layout of analytics workloads can safely reorder memory accesses, Table 1: Characterization of Spark operators. thus complementing the algorithm’s contribution in turning random accesses to sequential. Finally, we design simple hardware that can operate on data streams at the memory’s peak bandwidth, while respecting the energy and power constraints of the NMP logic layer. plow through data [60]. Table 1 contains a broad set of common data Building on our observations, we propose the Mondrian Data transformations in Spark [21] and the basic data operators needed to Engine, an NMP architecture that reduces power consumption by implement them in memory: Scan, Group by, Join, and Sort. changing energy-hungry random memory access patterns to sequen- Contemporary analytics software is typically column-oriented tial DRAM-friendly ones and boosts performance through wide com- [1, 12, 21, 40, 50, 71], storing datasets as collections of individual putation units performing data analytics operations on data streams columns, each containing a key and an attribute of a particular type. at full memory bandwidth. We make the following contributions: A dataset is distributed across memory modules where the operators • Analyze common data analytics operators, demonstrating preva- can run in parallel. The simplest form of transformation is a Scan lence of fine-grain random memory accesses. In the context of operator where each dataset subset is sequentially scanned for a NMP, these induce significant performance and energy overheads, particular key. More complex transformations require comparing preventing high utilization of the available DRAM bandwidth. keys among one or more subsets of the dataset. • Argue that NMP-friendly algorithms fundamentally differ from To execute the comparison efficiently (i.e., maximize parallelism conventional CPU-centric ones, optimizing for sequential memory and locality), modern database systems partition datasets based on accesses rather than cache locality. key ranges [68]. For instance, parallelizing a Join would require first • Identify object permutability as a common attribute in the parti- a range partitioning of a dataset based on the data items’ keys to tioning phase of key analytics operators, meaning that the location divide them among the memory modules, followed by an indepen- and access order of objects in memory can be safely rearranged to dent Join operation on each subset. A similar data item distribution improve the efficiency of memory access patterns. We exploit per- is required for the Group by and Sort operators. mutability to coalesce random accesses that would go to disjoint Table 2 compares and contrasts the algorithmic steps in the ba- locations in the original computation into a sequential stream. sic data operators. All operators can be broken down into two main • Show that NMP efficiency can be significantly improved through phases. Starting from data that is initially randomly distributed across an algorithm-hardware co-design that trades off additional com- multiple memory partitions, the goal of the first phase, partitioning, putations and memory accesses for increased
Recommended publications
  • Voodoo - a Vector Algebra for Portable Database Performance on Modern Hardware
    Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware Holger Pirk Oscar Moll Matei Zaharia Sam Madden MIT CSAIL MIT CSAIL MIT CSAIL MIT CSAIL [email protected] [email protected] [email protected] [email protected] ABSTRACT Single Thread Branch Multithread Branch GPU Branch In-memory databases require careful tuning and many engi- Single Thread No Branch Multithread No Branch GPU No Branch neering tricks to achieve good performance. Such database performance engineering is hard: a plethora of data and 10 hardware-dependent optimization techniques form a design space that is difficult to navigate for a skilled engineer { even more so for a query compiler. To facilitate performance- 1 oriented design exploration and query plan compilation, we present Voodoo, a declarative intermediate algebra that ab- stracts the detailed architectural properties of the hard- ware, such as multi- or many-core architectures, caches and Absolute time in s 0.1 SIMD registers, without losing the ability to generate highly tuned code. Because it consists of a collection of declarative, vector-oriented operations, Voodoo is easier to reason about 1 5 10 50 100 and tune than low-level C and related hardware-focused ex- Selectivity tensions (Intrinsics, OpenCL, CUDA, etc.). This enables Figure 1: Performance of branch-free selections based on our Voodoo compiler to produce (OpenCL) code that rivals cursor arithmetics [28] (a.k.a. predication) over a branching and even outperforms the fastest state-of-the-art in memory implementation (using if statements) databases for both GPUs and CPUs. In addition, Voodoo makes it possible to express techniques as diverse as cache- conscious processing, predication and vectorization (again PeR [18], Legobase [14] and TupleWare [9], have arisen.
    [Show full text]
  • A Survey on Parallel Database Systems from a Storage Perspective: Rows Versus Columns
    A Survey on Parallel Database Systems from a Storage Perspective: Rows versus Columns Carlos Ordonez1 ? and Ladjel Bellatreche2 1 University of Houston, USA 2 LIAS/ISAE-ENSMA, France Abstract. Big data requirements have revolutionized database technol- ogy, bringing many innovative and revamped DBMSs to process transac- tional (OLTP) or demanding query workloads (cubes, exploration, pre- processing). Parallel and main memory processing have become impor- tant features to exploit new hardware and cope with data volume. With such landscape in mind, we present a survey comparing modern row and columnar DBMSs, contrasting their ability to write data (storage mecha- nisms, transaction processing, batch loading, enforcing ACID) and their ability to read data (query processing, physical operators, sequential vs parallel). We provide a unifying view of alternative storage mecha- nisms, database algorithms and query optimizations used across diverse DBMSs. We contrast the architecture and processing of a parallel DBMS with an HPC system. We cover the full spectrum of subsystems going from storage to query processing. We consider parallel processing and the impact of much larger RAM, which brings back main-memory databases. We then discuss important parallel aspects including speedup, sequential bottlenecks, data redistribution, high speed networks, main memory pro- cessing with larger RAM and fault-tolerance at query processing time. We outline an agenda for future research. 1 Introduction Parallel processing is central in big data due to large data volume and the need to process data faster. Parallel DBMSs [15, 13] and the Hadoop eco-system [30] are currently two competing technologies to analyze big data, both based on au- tomatic data-based parallelism on a shared-nothing architecture.
    [Show full text]
  • Memory Interference Characterization Between CPU Cores and Integrated Gpus in Mixed-Criticality Platforms
    Memory Interference Characterization between CPU cores and integrated GPUs in Mixed-Criticality Platforms Roberto Cavicchioli, Nicola Capodieci and Marko Bertogna University of Modena And Reggio Emilia, Department of Physics, Informatics and Mathematics, Modena, Italy [email protected] Abstract—Most of today’s mixed criticality platforms feature be put on the memory traffic originated by the iGPU, as it Systems on Chip (SoC) where a multi-core CPU complex (the represents a very popular architectural paradigm for computing host) competes with an integrated Graphic Processor Unit (iGPU, massively parallel workloads at impressive performance per the device) for accessing central memory. The multi-core host Watt ratios [1]. This architectural choice (commonly referred and the iGPU share the same memory controller, which has to to as General Purpose GPU computing, GPGPU) is one of the arbitrate data access to both clients through often undisclosed reference architectures for future embedded mixed-criticality or non-priority driven mechanisms. Such aspect becomes crit- ical when the iGPU is a high performance massively parallel applications, such as autonomous driving and unmanned aerial computing complex potentially able to saturate the available vehicle control. DRAM bandwidth of the considered SoC. The contribution of this paper is to qualitatively analyze and characterize the conflicts In order to maximize the validity of the presented results, due to parallel accesses to main memory by both CPU cores we considered different platforms featuring different memory and iGPU, so to motivate the need of novel paradigms for controllers, instruction sets, data bus width, cache hierarchy memory centric scheduling mechanisms. We analyzed different configurations and programming models: well known and commercially available platforms in order to i NVIDIA Tegra K1 SoC (TK1), using CUDA 6.5 [2] for estimate variations in throughput and latencies within various memory access patterns, both at host and device side.
    [Show full text]
  • A Cache-Efficient Sorting Algorithm for Database and Data Mining
    A Cache-Efficient Sorting Algorithm for Database and Data Mining Computations using Graphics Processors Naga K. Govindaraju, Nikunj Raghuvanshi, Michael Henson, David Tuft, Dinesh Manocha University of North Carolina at Chapel Hill {naga,nikunj,henson,tuft,dm}@cs.unc.edu http://gamma.cs.unc.edu/GPUSORT Abstract els [3]. These techniques have been extensively studied for CPU-based algorithms. We present a fast sorting algorithm using graphics pro- Recently, database researchers have been exploiting the cessors (GPUs) that adapts well to database and data min- computational capabilities of graphics processors to accel- ing applications. Our algorithm uses texture mapping and erate database queries and redesign the query processing en- blending functionalities of GPUs to implement an efficient gine [8, 18, 19, 39]. These include typical database opera- bitonic sorting network. We take into account the commu- tions such as conjunctive selections, aggregates and semi- nication bandwidth overhead to the video memory on the linear queries, stream-based frequency and quantile estima- GPUs and reduce the memory bandwidth requirements. We tion operations,and spatial database operations. These al- also present strategies to exploit the tile-based computa- gorithms utilize the high memory bandwidth, the inherent tional model of GPUs. Our new algorithm has a memory- parallelism and vector processing functionalities of GPUs efficient data access pattern and we describe an efficient to execute the queries efficiently. Furthermore, GPUs are instruction dispatch mechanism to improve the overall sort- becoming increasingly programmable and have also been ing performance. We have used our sorting algorithm to ac- used for sorting [19, 24, 35]. celerate join-based queries and stream mining algorithms.
    [Show full text]
  • How to Solve the Current Memory Access and Data Transfer Bottlenecks: at the Processor Architecture Or at the Compiler Level?
    Hot topic session: How to solve the current memory access and data transfer bottlenecks: at the processor architecture or at the compiler level? Francky Catthoor, Nikil D. Dutt, IMEC , Leuven, Belgium ([email protected]) U.C.Irvine, CA ([email protected]) Christoforos E. Kozyrakis, U.C.Berkeley, CA ([email protected]) Abstract bedded memories are correctly incorporated. What is more controversial still however is how to deal with dynamic ap- Current processor architectures, both in the pro- plication behaviour, with multithreading and with highly grammable and custom case, become more and more dom- concurrent architecture platforms. inated by the data access bottlenecks in the cache, system bus and main memory subsystems. In order to provide suf- 2. Explicitly parallel architectures for mem- ficiently high data throughput in the emerging era of highly ory performance enhancement (Christo- parallel processors where many arithmetic resources can work concurrently, novel solutions for the memory access foros E. Kozyrakis) and data transfer will have to be introduced. The crucial question we want to address in this hot topic While microprocessor architectures have significantly session is where one can expect these novel solutions to evolved over the last fifteen years, the memory system rely on: will they be mainly innovative processor archi- is still a major performance bottleneck. The increasingly tecture ideas, or novel approaches in the application com- complex structures used for speculation, reordering, and piler/synthesis technology, or a mix. caching have improved performance, but are quickly run- ning out of steam. These techniques are becoming ineffec- tive in terms of resource utilization and scaling potential.
    [Show full text]
  • Prefetching for Complex Memory Access Patterns
    UCAM-CL-TR-923 Technical Report ISSN 1476-2986 Number 923 Computer Laboratory Prefetching for complex memory access patterns Sam Ainsworth July 2018 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2018 Sam Ainsworth This technical report is based on a dissertation submitted February 2018 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Churchill College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Abstract Modern-day workloads, particularly those in big data, are heavily memory-latency bound. This is because of both irregular memory accesses, which have no discernible pattern in their memory addresses, and large data sets that cannot fit in any cache. However, this need not be a barrier to high performance. With some data structure knowledge it is typically possible to bring data into the fast on-chip memory caches early, so that it is already available by the time it needs to be accessed. This thesis makes three contributions. I first contribute an automated software prefetch- ing compiler technique to insert high-performance prefetches into program code to bring data into the cache early, achieving 1.3 geometric mean speedup on the most complex × processors, and 2.7 on the simplest. I also provide an analysis of when and why this × is likely to be successful, which data structures to target, and how to schedule software prefetches well. Then I introduce a hardware solution, the configurable graph prefetcher.
    [Show full text]
  • VECTORWISE Simply FAST
    VECTORWISE Simply FAST A technical whitepaper TABLE OF CONTENTS: Introduction .................................................................................................................... 1 Uniquely fast – Exploiting the CPU ................................................................................ 2 Exploiting Single Instruction, Multiple Data (SIMD) .................................................. 2 Utilizing CPU cache as execution memory ............................................................... 3 Other CPU performance features ............................................................................. 3 Leveraging industry best practices ................................................................................ 4 Optimizing large data scans ...................................................................................... 4 Column-based storage .............................................................................................. 4 VectorWise’s hybrid column store ........................................................................ 5 Positional Delta Trees (PDTs) .............................................................................. 5 Data compression ..................................................................................................... 6 VectorWise’s innovative use of data compression ............................................... 7 Storage indexes ........................................................................................................ 7 Parallel
    [Show full text]
  • Eth-28647-02.Pdf
    Research Collection Doctoral Thesis Adaptive main memory compression Author(s): Tuduce, Irina Publication Date: 2006 Permanent Link: https://doi.org/10.3929/ethz-a-005180607 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Doctoral Thesis ETH No. 16327 Adaptive Main Memory Compression A dissertation submitted to the Swiss Federal Institute of Technology Zurich (ETH ZÜRICH) for the degree of Doctor of Technical Sciences presented by Irina Tuduce Engineer, TU Cluj-Napoca born April 9, 1976 citizen of Romania accepted on the recommendation of Prof. Dr. Thomas Gross, examiner Prof. Dr. Lothar Thiele, co-examiner 2005 Seite Leer / Blank leaf Seite Leer / Blank leaf Seite Leer lank Abstract Computer pioneers correctly predicted that programmers would want unlimited amounts of fast memory. Since fast memory is expensive, an economical solution to that desire is a memory hierarchy organized into several levels. Each level has greater capacity than the preceding but it is less quickly accessible. The goal of the memory hierarchy is to provide a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level. In the last decades, the processor performance improved much faster than the performance of the memory levels. The memory hierarchy proved to be a scalable solution, i.e., the bigger the perfomance gap between processor and memory, the more levels are used in the memory hierarchy. For instance, in 1980 microprocessors were often designed without caches, while in 2005 most of them come with two levels of caches on the chip.
    [Show full text]
  • Detecting False Sharing Efficiently and Effectively
    W&M ScholarWorks Arts & Sciences Articles Arts and Sciences 2016 Cheetah: Detecting False Sharing Efficiently andff E ectively Tongping Liu Univ Texas San Antonio, Dept Comp Sci, San Antonio, TX 78249 USA; Xu Liu Coll William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA Follow this and additional works at: https://scholarworks.wm.edu/aspubs Recommended Citation Liu, T., & Liu, X. (2016, March). Cheetah: Detecting false sharing efficiently and effectively. In 2016 IEEE/ ACM International Symposium on Code Generation and Optimization (CGO) (pp. 1-11). IEEE. This Article is brought to you for free and open access by the Arts and Sciences at W&M ScholarWorks. It has been accepted for inclusion in Arts & Sciences Articles by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Cheetah: Detecting False Sharing Efficiently and Effectively Tongping Liu ∗ Xu Liu ∗ Department of Computer Science Department of Computer Science University of Texas at San Antonio College of William and Mary San Antonio, TX 78249 USA Williamsburg, VA 23185 USA [email protected] [email protected] Abstract 1. Introduction False sharing is a notorious performance problem that may Multicore processors are ubiquitous in the computing spec- occur in multithreaded programs when they are running on trum: from smart phones, personal desktops, to high-end ubiquitous multicore hardware. It can dramatically degrade servers. Multithreading is the de-facto programming model the performance by up to an order of magnitude, significantly to exploit the massive parallelism of modern multicore archi- hurting the scalability. Identifying false sharing in complex tectures.
    [Show full text]
  • A Hybrid Analytical DRAM Performance Model
    A Hybrid Analytical DRAM Performance Model George L. Yuan Tor M. Aamodt Department of Electrical and Computer Engineering University of British Columbia, Vancouver, BC, CANADA fgyuan,[email protected] Abstract 2000 As process technology scales, the number of transistors that 1600 can ¯t in a unit area has increased exponentially. Proces- 1200 sor throughput, memory storage, and memory throughput 800 have all been increasing at an exponential pace. As such, 400 DRAM has become an ever-tightening bottleneck for appli- 0 Memory Access Latency cations with irregular memory access patterns. Computer BS FWT DG LPS LIB MMC MUM NN RAY RED SP SLA TRA WP architects in industry sometimes use ad hoc analytical mod- eling techniques in lieu of cycle-accurate performance sim- Figure 1: Average memory latency measured in perfor- ulation to identify critical design points. Moreover, ana- mance simulation lytical models can provide clear mathematical relationships for how system performance is a®ected by individual mi- croarchitectural parameters, something that may be di±cult cess behavior. In the massively multithreaded GPU Com- to obtain with a detailed performance simulator. Modern pute architecture that we simulate, threads are scheduled to DRAM controllers rely on Out-of-Order scheduling policies Single-Instruction Multiple-Data pipelines in groups of 32 to increase row access locality and decrease the overheads of called warps, each thread of which is capable of generating timing constraint delays. This paper proposes a hybrid ana- a memory request to any location in GPU main memory. lytical DRAM performance model that uses memory address (This can be done as long as the limit on the number of in- traces to predict the DRAM e±ciency of a DRAM system flight memory requests per GPU shader core has not been when using such a memory scheduling policy.
    [Show full text]
  • Micro Adaptivity in Vectorwise
    Micro Adaptivity in Vectorwise Bogdan Raducanuˇ Peter Boncz Actian CWI [email protected] [email protected] ∗ Marcin Zukowski˙ Snowflake Computing marcin.zukowski@snowflakecomputing.com ABSTRACT re-arrange the shape of a query plan in reaction to its execu- Performance of query processing functions in a DBMS can tion so far, thereby enabling the system to correct mistakes be affected by many factors, including the hardware plat- in the cost model, or to e.g. adapt to changing selectivities, form, data distributions, predicate parameters, compilation join hit ratios or tuple arrival rates. Micro Adaptivity, intro- method, algorithmic variations and the interactions between duced here, is an orthogonal approach, taking adaptivity to these. Given that there are often different function imple- the micro level by making the low-level functions of a query mentations possible, there is a latent performance diversity executor more performance-robust. which represents both a threat to performance robustness Micro Adaptivity was conceived inside the Vectorwise sys- if ignored (as is usual now) and an opportunity to increase tem, a high performance analytical RDBMS developed by the performance if one would be able to use the best per- Actian Corp. [18]. The raw execution power of Vectorwise forming implementation in each situation. Micro Adaptivity, stems from its vectorized query executor, in which each op- proposed here, is a framework that keeps many alternative erator performs actions on vectors-at-a-time, rather than a function implementations (“flavors”) in a system. It uses a single tuple-at-a-time; where each vector is an array of (e.g.
    [Show full text]
  • Performance Analysis of Cache Memory
    CORE Metadata, citation and similar papers at core.ac.uk Provided by Drexel Libraries E-Repository and Archives Cache Miss Analysis of Walsh-Hadamard Transform Algorithms A Thesis Submitted to the Faculty of Drexel University by Mihai Alexandru Furis in partial fulfillment of the requirements for the degree of Master of Science in Computer Science March 2003 ii Acknowledgements I would like to express my sincere gratitude to my graduate advisor, Professor Jeremy Johnson Ph.D., for his continuing support during the thesis preparation. Without his advice, guidance, understanding and patience everything would have been a lot more complicated. Finally, I am grateful to all the members of the Computer Science and Mathematics Departments: professors, TAs or friends, with who I collaborated over the years at Drexel University. iii Table of Contents Table of Contents ....................................................................................................... iii List of Tables................................................................................................................ v List of Figures ............................................................................................................. vi Abstract....................................................................................................................... vii Chapter 1 Introduction................................................................................................1 Chapter 2 Review of cache memory structure and organization .........................6
    [Show full text]