Heap Data Allocation to Scratch-Pad Memory in Embedded Systems

ABSTRACT Title of dissertation: HEAP DATA ALLOCATION TO SCRATCH-PAD MEMORY IN EMBEDDED SYSTEMS Angel Dominguez Doctor of Philosophy, 2007 Dissertation directed by: Professor Rajeev K. Barua Department of Electrical and Computer Engineering This thesis presents the first-ever compile-time method for allocating a portion of a program’s dynamic data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its signifi- cantly lower overheads in access time, energy consumption, area and overall runtime. Dynamic data refers to all objects allocated at run-time in a program, as opposed to static data objects which are allocated at compile-time. Existing compiler methods for allocating data to scratch-pad are able to place only code, global and stack data (static data) in scratch-pad memory; heap and recursive-function objects(dynamic data) are allocated entirely in DRAM, resulting in poor performance for these dynamic data types. Runtime methods based on software caching can place data in scratch-pad, but because of their high overheads from software address translation, they have not been successful, especially for dynamic data. In this thesis we present a dynamic yet compiler-directed allocation method for dynamic data that for the first time, (i) is able to place a portion of the dynamic data in scratch-pad; (ii) has no software-caching tags; (iii) requires no run-time per-access extra address translation; and (iv) is able to move heap data back and forth between scratch-pad and DRAM to better track the program’s locality characteristics. With our method, code, global, stack and heap variables can share the same scratch-pad. When compared to placing all dynamic data variables in DRAM and only static data in scratch-pad, our results show that our method reduces the average runtime of our benchmarks by 22.3%, and the average power consumption by 26.7%, for the same size of scratch-pad fixed at 5% of total data size. Significant savings in runtime and energy were also observed when compared against cached memory organizations, showing our method’s success with SPM placement of dynamic data under constrained memory sizes. HEAP DATA ALLOCATION TO SCRATCH-PAD MEMORY IN EMBEDDED SYSTEMS by Angel Dominguez Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2007 Advisory Committee: Professor Rajeev K. Barua, Chair/Advisor Professor Manoj Franklin Professor Shuvra S. Bhattacharrya Professor Peter Petrov Professor Chau-Wen Tseng c Copyright by Angel Dominguez 2007 Table of Contents List of Figures iv 1 Introduction 1 1.1 OrganizationofThesis . .. .. 11 2 Embedded Systems and Software Development 13 2.1 EmbeddedSystems ............................ 14 2.2 IntelStrongARMMicroprocessor . 22 2.3 Embedded Software Development . 26 2.4 CLanguageCompilers . .. .. 33 2.5 HeapDataAllocation. .. .. 39 2.6 RecursiveFunctions. .. .. 44 3 Previous Work on SPM allocation 47 3.1 OverviewofRelatedResearch . 47 3.2 StaticSPMAllocationMethods . 48 3.3 Dynamic SPM Allocation Techniques . 53 3.4 Existing Methods For Dynamic Program Data . 58 3.5 Heap-to-Stack Conversion Techniques . ... 60 3.6 MemoryHierarchyResearch . 62 3.7 DynamicMemoryManagerResearch . 68 3.8 OtherRelatedMethods. 69 5 Dynamicallocationofstaticprogramdata 73 5.1 Overviewforstaticprogramallocation . ... 74 5.2 TheDynamicProgramRegionGraph. 78 5.3 Allocation Method for Code, Stack and Global Objects . ..... 83 5.4 AlgorithmModifications . 92 5.5 LayoutandCodeGeneration. .100 5.6 Summaryofresults . .. .. .104 5 Dynamic program data 106 5.1 Understanding dynamic data in software . 107 5.2 Obstacles to optimizing software with dynamic data . .117 5.3 CreatingtheDPRGwithdynamicdata. .122 6 Compiler allocation of dynamic data 131 6.1 Overview of our SPM allocation method for dynamic data . .132 6.2 PreparingtheDPRGforallocation . 136 6.3 Calculating Heap Bin Allocation Sizes . 138 6.4 Overviewoftheiterativeportion . 141 6.5 TransferMinimizations . .142 6.6 HeapSafetyTransformations. 144 ii 6.7 Memory Layout Technique for Address Assignment . .149 6.8 FeedbackDrivenTransformations . 153 6.9 TerminationofIterativesteps . 155 6.10 Code generation for optimized binaries . .156 7 Robust dynamic data handling 159 7.1 GeneralOptimizations . .159 7.2 Recursive function stack handling . 164 7.3 Compile-time Unknown-Size Heap Objects . 171 7.4 ProfileSensitivity .............................184 8 Methodology 194 8.1 TargetHardwarePlatform . .195 8.2 SoftwarePlatformRequirements. 199 8.3 CompilerImplementation . .203 8.4 SimulationPlatform . .211 8.5 BenchmarkOverview . .215 8.6 BenchmarkClasses . .. .. .216 8.7 BenchmarkSuite .............................219 9 Results 223 9.1 DynamicHeapAllocationResults . 224 9.1.1 Runtimeandenergygain. .224 9.1.2 TransferMethodComparison . .228 9.1.3 Reduction in Heap DRAM Accesses . 232 9.1.4 EffectofvaryingSPMsize . .235 9.2 Unknown-size Heap Allocation . 236 9.3 Recursive Function Allocation . 239 9.4 Comparisonwithcaches . .247 9.5 ProfileSensitivity .............................254 9.5.1 Non-ProfileInputVariation . .258 9.6 CodeAllocation..............................264 10 Conclusion 268 10.1 PrimaryHeapAllocationResults . 270 10.2 CacheComparisonResults . .274 10.3 Profile Sensitivity Results . 283 Bibliography 287 iii List of Figures 1.1 Example of heap allocation using our method . .. 8 2.1 Diagramoftypicaldesktopcomputer.. .. 15 2.2 Diagram of typical embedded computer. 16 2.3 Memory types common to embedded platforms . 18 2.4 Comparison between popular embedded memory types. .... 22 2.5 Diagram of the Intel StrongARM embedded cpu. 23 2.6 Compilation of an application from source files. ..... 34 2.7 Compilerviewofprogrammemory . 39 2.8 Sample memory layout for an embedded application . .... 41 2.9 Heapmanagerexample. .. .. 42 2.10 Stackgrowthofarecursivefunction. ... 46 5.1 DPRGcreatedforasampleprogram. 80 5.2 Algorithm for dynamic allocation of static program data........ 85 5.3 DPRG enhanced with code regions. 90 5.1 Memory map for a typical ARM application. 109 5.2 Example of a recursive data structure. 116 5.3 Exampleprogramfragment . .124 5.4 DPRG showing a heap allocation site. 125 5.5 DPRGforasamplefunctionwithheapdata.. 127 6.1 Algorithm for dynamic allocation of heap data. .134 6.2 Calculating heap bin sizes for allocation. .140 6.3 Allocation scenario for an example program . .150 iv 7.1 DPRGofarecursivefunction. .167 7.2 Binary tree showing access frequency. .169 7.3 Sample Program containing unknown-size heap allocation.......177 7.4 Sample Function containing unknown-size heap allocation. .189 8.1 GCC compiler flow for an application. 206 8.2 Mainstagesofourallocationalgorithm. .207 8.3 BenchmarkSuiteInformation-Part1 . 221 8.4 BenchmarkSuiteInformation-Part2 . 221 8.5 BenchmarkSuiteInformation-Part3 . 222 9.1 Runtime gain from our method for the default scenario. .225 9.2 Energy savings from our method for the default scenario . .227 9.3 Runtime results from using different transfer methods. .......229 9.4 Power consumption using different transfer methods . .231 9.5 Percentage of heap accesses going to DRAM after allocation. .232 9.6 Effects of varying DRAM latency on runtime gain(Part 1). .233 9.7 Effects of varying DRAM latency on runtime gain(Part 2). .234 9.8 Effect of varying SPM size on runtime gain using our approach. .235 9.9 Normalized runtime for unknown-size benchmark set. .236 9.10 Normalized energy consumption for unknown-size benchmark set. 237 9.11 Normalized runtime for unknown-size benchmark set when varying SPMsize. .................................239 9.12 Normalized energy usage for unknown-size benchmark set when vary- ingSPMsize. ...............................240 9.13 Normalized runtime for recursive benchmark set. .......242 9.14 Reduction in energy consumption for recursive benchmarkset. .243 v 9.15 Normalized runtime for recursive benchmark set at 25% SPM. .245 9.16 Normalized energy usage for recursive benchmark set at 25% SPM. 246 9.17 Average normalized runtimes for benchmarks using combinations of SPMandcache...............................249 9.18 Averaged normalized energy usage for benchmarks using combina- tionsofSPMandcache. .251 9.19 Normalized runtimes for benchmarks using the second benchmark inputset. .................................255 9.20 Normalized energy usage for benchmarks using the second benchmark inputset. .................................256 9.21 Average runtime gain for benchmarks using both benchmark input sets. ....................................257 9.22 Average energy savings for benchmarks using both benchmark input sets. ....................................258 9.23 Normalized runtime for benchmarks showing profile input sensitivity. 260 9.24 Normalized energy usage for benchmarks showing profile input sensitivity....................................261 9.25 Improvement in runtime from profile averaging passes. .......262 9.26 Improvement in energy usage from profile averaging passes.. .263 9.27 Runtime gain when code as well as global, stack and heap data are allocated. .................................265 9.28 Energy savings when code as well as global, stack and heap data are allocated. .................................267 10.1 Normalized runtime for recursive applications when SPM size varies(Part 1). .....................................270

Heap Data Allocation to Scratch-Pad Memory in Embedded Systems

Analyzing the Benefits of Scratchpad Memories for Scientific Matrix

Memory Architectures 13 Preeti Ranjan Panda

Gpgpus: Overview Pedagogically Precursor Concepts UNIVERSITY of ILLINOIS at URBANA-CHAMPAIGN

Hardware and Software Optimizations for GPU Resource Management

Improving Memory Subsystem Performance Using Viva: Virtual Vector Architecture

A Dynamic Scratchpad Memory Unit for Predictable Real-Time Embedded Systems

Architectural Exploration of Heterogeneous Memory Systems Marcos Horro, Gabriel Rodr´Iguez, Juan Tourino,˜ Mahmut T

Designing Scratchpad Memory Architecture with Emerging STT-RAM Memory Technologies

5. Embedded Systems Dilemma of Chip Memory Diversity By

Application-Specific Memory Management in Embedded Systems Using Software-Controlled Caches

FPGA Implementation of a Configurable Cache/Scratchpad

Compiler-Directed Scratchpad Memory Management Via Graph Coloring