Energy Modelling of Multi-Threaded, Multi-Core Software for Embedded Systems

Energy modelling of multi-threaded, multi-core software for embedded systems Steven P. Kerrison A thesis submitted to the University of Bristol in accordance with the requirements of the degree Doctor of Philosophy in the Faculty of Engineering, Department of Computer Science, September 2015. 51,000 words. Copyright © 2015 Steven P. Kerrison, some rights reserved. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 Abstract Efforts to reduce energy consumption are being made across all disciplines. ICT's contribution to global energy consumption and by-products such as CO2 emissions con- tinues to grow, making it an increasingly significant area in which improvements must be made. This thesis focuses on software as a means to reducing energy consumption. It presents methods for profiling and modelling a multi-threaded, multi-core embedded processor at the instruction set level, establishing links between the software and the energy consumed by the underlying hardware. A framework is presented that profiles the energy consumption characteristics of a multi-threaded processor core, associating energy consumption with the instruction set and parallelism present in a multi-threaded program. This profiling data is used to build a model of the processor that allows instruction set simulations to be used to estimate the energy that programs will consume, with an average of 2.67 % error. The profiling and modelling is then raised to the multi-core level, examining a channel based message passing system formed of a network of embedded multi-threaded processors. Additional profiling is presented that determines network communication costs as well as giving consideration towards system level properties such as power sup- ply efficiency. Then, this is used to form a system level energy model that can estimate consumption using simulations of multi-core programs. The system level model com- bines multiple instances of a core energy model with a network level communication cost model. The broader implications of this work are explored in the context of other embedded and multi-core processor architectures, identifying opportunities for expanding or transferring the models. The models in this thesis are formed at the instruction set level, but have been demonstrated to be effective at higher-levels of abstraction than instruction set simulation, through their support of further work carried out externally. This work is enabled by several pieces of development effort, including a profiling framework for taking power measurements of the devices under investigation, tools for programming, routing and debugging software on a multi-core hardware platform called Swallow, and enhancements to an instruction set simulator for the simulation of this multi-core system. Through the work of this thesis, an embedded software developer for multi-threaded and multi-core systems is equipped with tools, techniques and new understanding that can help them in determining how their software consumes energy. This raises the status of energy efficiency in the software development cycle as we continue our efforts to minimise the energy impact of the world's embedded devices. 5 Acknowledgements I owe the successful completion of this PhD thesis to a great many people, and I can directly thank but a few of them here. If I interacted with you during the course of my research, please know that I am eternally grateful for that. Thank you to my family for supporting and encouraging me throughout. Thank you to my supervisor, Kerstin Eder, whose guidance helped me develop a compelling research topic and secure funding for my work, without which none of this would have been possible. Thank you to Simon Hollis and Jake Longo Galea for creating a rather interesting set of research problems for us to collectively solve in the Swallow platform. David May, your thought provoking discussions have been invaluable and inspiring. Many thanks to my external examiners, Alex Yakovlev and Peter Marwedel, as well as my internal coordinator JoséLuis Nú~nez-Yá~nez. My colleagues and companions in research deserve much gratitude for their input, collaboration, and of course their tolerance. Jamie Hanlon, Roger Owens, Neville Grech, Kyriakos Georgiou, Jeremy Morse and James Pallister, you and many others in the department made research an exciting experience. A special thank you to Dejanira Araiza Illan for your support, particularly during the write-up. In the first year of my studies I was hosted by XMOS. This was an excellent place to form ideas, gain industrial insight and motivate my work. In particular, thanks to Henk Muller, John Ferguson, Matt Fyles and Richard Osborne for their expert advice and support. My work was funded by a University of Bristol PhD Scholarship, and much of it became relevant to the ENTRA EU FP7 FET research project. I am grateful to these funding sources for making this work possible, and for creating an ecosystem in which to disseminate and further explore this work. 7 Author's declaration I declare that the work in this dissertation was carried out in accordance with the requirements of the University's Regulations and Code of Practice for Research Degree Programmes and that it has not been submitted for any other academic award. Except where indicated by specific reference in the text, the work is the candidate's own work. Work done in collaboration with, or with the assistance of, others, is indicated as such. Any views expressed in the dissertation are those of the author. Signed: Date: 9 Contents List of Figures 13 List of Tables 15 List of Code Listings 15 1. Introduction 17 1.1. Research questions and thesis . 18 1.2. Contributions . 20 1.3. Structure . 22 1.4. Terminology and conventions . 23 I. Background 25 2. Parallelism and concurrency in programs and processors 29 2.1. Concurrent programs and tasks . 29 2.2. Parallelism in a single core . 33 2.3. Multi-core processing . 35 2.4. Summarising parallelism and concurrency . 36 3. Energy modelling 39 3.1. Hardware energy modelling . 40 3.2. Software energy modelling . 42 3.3. Summary . 44 4. Influencing software energy consumption in embedded systems 47 4.1. Forming objectives to save energy in software . 47 4.2. Energy's many relationships . 50 4.3. Can we sit back and let Moore's Law do the work? . 54 4.4. Efficiency through event-driven paradigms . 56 4.5. Summary . 56 5. A multi-threaded, multi-core embedded system 59 5.1. The XS1-L processor family . 59 5.2. Swallow multi-core research platform . 65 5.3. Research enabled by the XS1-L and Swallow . 71 II. Constructing a multi-threaded, multi-core energy model 73 6. Model design and profiling of an XS1-L multi-threaded core 77 6.1. Strategy . 77 6.2. Profiling device behaviour . 77 6.3. Model design considerations . 79 6.4. XMProfile: A framework for profiling the XS1-L . 80 6.5. Generating tests . 83 6.6. Profiling summary . 85 11 Contents 7. Core level XS1-L model implementation 87 7.1. Workflow . 87 7.2. A preliminary model . 88 7.3. Preliminary model evaluation . 97 7.4. An extended core energy model . 100 7.5. Evaluation of the extended model . 104 7.6. Beyond simulation . 107 7.7. Summary . 109 8. Multi-core energy profiling and model design using Swallow 111 8.1. Core energy consumption on Swallow . 111 8.2. Network communication energy profiling . 113 8.3. Determining communication costs . 115 8.4. Summary of Swallow profiling . 117 9. Implementing and testing a multi-core energy model 119 9.1. Workflow . 119 9.2. Core and network timing simulation in axe ...................... 120 9.3. Communication aware modelling . 122 9.4. Displaying multi-core energy consumption data . 126 9.5. Demonstration and evaluation . 128 9.6. I/O as an adaptation of the network model . 132 9.7. Summary . 133 10.Beyond the XS1 architecture 135 10.1. Epiphany . 135 10.2. Xeon Phi . 137 10.3. Multi-core ARM implementations . 139 10.4. EZChip Tile processors . 140 10.5. Summary of model transferability . 141 11.Conclusions 143 11.1. Review of thesis contributions . 143 11.2. Building a multi-core platform for energy modelling research . 144 11.3. ISA-level energy modelling for a multi-threaded embedded processor . 144 11.4. Multi core software energy modelling from a network perspective . 145 11.5. The transferability of multi-threaded, multi-core models . 146 11.6. Writing energy efficient multi-threaded embedded software . 147 11.7. Future work . 148 11.8. Concluding remarks . 149 List of acronyms 151 Bibliography 155 12 List of Figures 2.1. A multi-threaded task structure in a USB audio application . 31 2.2. An abstract example of instruction flow through a super-scalar processor . 34 4.1. Power savings for an Ethernet receiving with DVFS . 54 4.2. CPU frequencies since 1972 . 55 5.1. XS1 architecture block diagram . 60 5.2. Channel communication in the XS1 ISA . 63 5.3. Photos of the Swallow platform . 65 5.4. Dual-core XS1-L link topology . 66 5.5. Swallow board JTAG chain . 68 5.6. Swallow network toplogy . 69 6.1. Process undertaken to profile/model the XS1-L core . 78 6.2. XMProfile test harness hardware and software structure . 81 6.3. Test harness process flow . 82 7.1. XMTraceM workflow for a single-core multi-threaded XMOS device . 87 7.2. Active and inactive thread costs for the XS1-L processor . 89 7.3. Instruction power heat-maps for the XS1-L . 91 7.4. Data constrained instruction power heat-maps for the XS1-L . 93 7.5.

Energy Modelling of Multi-Threaded, Multi-Core Software for Embedded Systems

Advanced Data Structures

Fast As a Shadow, Expressive As a Tree: Hybrid Memory Monitoring for C

Rockjit: Securing Just-In-Time Compilation Using Modular Control-Flow Integrity

Speculative Separation for Privatization and Reductions

Download/Repository/Ivan Fratric.Pdf, 2012

Practical Analysis of Framework-Intensive Applications

Detecting Cacheable Data to Remove Bloat

IAR C/C++ Compiler User Guide

Energy Modelling of Multi-Threaded, Multi-Core Software for Embedded Systems

Shadow State Encoding for Efficient Monitoring of Block-Level Properties

Fast As a Shadow, Expressive As a Tree: Hybrid Memory Monitoring for C∗

Control-Flow Integrity: Precision, Security, and Performance