Array Restructuring for Cache Locality

Array Restructuring for Cache Locality Shun-Tak A. Leung Department of Computer Science and Engineering University of Washington Technical Report UW-CSE-96-08-01 August 1996 This technical report is adapted from the author’s Ph.D. dissertation. © Copyright 1996 Shun-Tak Albert Leung University of Washington Abstract Array Restructuring for Cache Locality by Shun-Tak Albert Leung Chairperson of Supervisory Committee: Professor John Zahorjan Department of Computer Science and Engineering Caches are used in almost every modern processor design to reduce the long memory access latency, which is increasingly a bottleneck to program performance. For caches to be effective, programs must exhibit good data locality. Thus, an optimizing compiler may have to restructure programs to enhance their locality. We focus on the class of restructuring techniques that target array accesses in loops. There are two approaches to enhancing the locality of such accesses: loop restructuring and array restructuring. Under loop restructuring, a compiler adopts a canonical array layout but transforms the order in which loop iterations are performed and thereby reor- ders the execution of array accesses. Under array restructuring, in contrast, a compiler lays out array elements in an order that matches the access pattern, while preserving the flow of control. While loop restructuring has been studied extensively, array restructuring has received much less attention despite advantages such as its applicability to complicated loop structures that may hamper loop restructuring. To fill the void, this dissertation investigates how to perform array restructuring effec- tively — efficiently, automatically, and generally. We present a formal framework for array transformations that meet these objectives. Such transformations are represented by linear transformations of array index vectors. Within this framework, we develop algo- rithms to solve various problems in array restructuring: selecting transformations based on the access pattern, laying out elements of restructured arrays, and determining which elements are accessed by a loop and thus restructuring only that part of an array. To evaluate our array restructuring technique, we implemented a prototype compiler and performed a series of experiments with loops commonly used in related loop restructuring studies. Experimental measurements showed that array restructuring improved performance substantially in many cases, despite a modest runtime overhead in some. Moreover, the results also indicated that array restructuring complemented loop restructuring in applicability and performance: it applied where loop restructuring did not; when both applied, it offered comparable, sometimes even better, performance; in cases where it did not perform as well, loop restructuring improved performance considerably anyway. This observation points to the potential benefit of integrating the two complementary approaches. Table of Contents List of Figures . vi List of Tables . ix List of Common Symbols . x Part I Introduction Chapter 1: Introduction . 2 1.1 Motivation . 2 1.1.1 Two Approaches to Improving Locality. 3 1.1.2 Comparing Array and Loop Restructuring . 6 1.1.3 Understanding Array Restructuring . 8 1.2 Contributions . 9 1.3 Organization. 9 Part II Analysis Chapter 2: Array Restructuring Framework . 12 2.1 Loop Restructuring Framework: A Tutorial . 12 2.2 Array Restructuring Framework . 16 2.3 Alternatives . 19 i 2.4 Summary . 22 Chapter 3: Choosing an Index Transformation. 24 3.1 Index Transformation for Better Locality . 24 3.1.1 Innermost Loop. 24 3.1.2 Enclosing Loops . 28 3.1.3 Unimodular Index Transformation Matrix . 32 3.1.4 Summary. 33 3.2 Computing the Index Transformation Matrix . 33 3.2.1 Permuting Array Indices. 34 3.2.2 General Unimodular Transformation Matrices. 38 3.3 More General Access Patterns. 46 3.3.1 Multiple Accesses . 47 3.3.2 Non-Affine Array Index Expressions . 54 3.3.3 Imperfect Loop Nests . 59 3.3.4 Multiple Loop Nests . 60 3.4 Summary . 64 Chapter 4: Linearizing Restructured Arrays. 65 4.1 Computing the Bounds of the Restructured Array . 67 4.2 Problem of Non-Constant Array Bounds. 69 4.3 Introducing Our Solution . 71 4.4 Computing the Linearization Vector . 73 4.4.1 Relaxed Bounds . 75 4.4.2 Computing Linearization Vector from Relaxed Bounds. 76 4.4.3 Summary. 81 ii 4.5 Relaxing the Transformed Array Bounds . 82 4.5.1 Relaxed Bounds Revisited . 84 4.5.2 Symmetric Image Polyhedron. 85 4.5.3 Computing Augmented Bounds . 86 4.5.4 Computing Center of Symmetry. 88 4.5.5 Computing Relaxed Bounds . 90 4.5.6 Summary. 94 4.6 Summary . 95 Chapter 5: Partial Restructuring . 96 5.1 Regularly Spaced Elements . 97 5.1.1 Extending the Framework. 97 5.1.2 A Single Access . 99 5.1.3 Multiple Accesses . 103 5.1.4 Summary. 108 5.2 Range of Index Vectors . 109 5.2.1 The Problem . 109 5.2.2 Solutions . 110 5.2.3 Multiple Accesses . 118 5.2.4 Implications to Linearization . 119 5.2.5 Summary. 121 5.3 Summary . 122 Part III Experimentation Chapter 6: Implementation . 124 iii 6.1 Runtime System. 125 6.2 Array Restructuring Pass . 126 6.2.1 Functions. 126 6.2.2 Limitations . 130 Chapter 7: Experimental Results . 132 7.1 Experiments . 132 7.2 Array Restructuring . 136 7.2.1 General Results . ..

Array Restructuring for Cache Locality

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction

Detailed Cache Coherence Characterization for Openmp Benchmarks

Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device

Chapter 5 Thread-Level Parallelism

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Parallel Processors and Cache Coherency

Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence

Why On-Chip Cache Coherence Is Here to Stay

Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

A CMOS Vector Processor with a Custom Streaming Cache Greg Faanes [email protected] Silicon Graphics

Computer Architecture, Part 7