Eindhoven University of Technology MASTER Embedded Platform
Total Page:16
File Type:pdf, Size:1020Kb
Eindhoven University of Technology MASTER Embedded platform selection based on the Roofline model applied to video content analysis Spierings, M.G.M.; van de Voort, R.W.P.M. Award date: 2012 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Eindhoven University of Technology Department of Electrical Engineering Electronic Systems Group Master’s Thesis Embedded platform selection based on the Roofline model Applied to video content analysis By Martien Spierings & Rob van de Voort Supervisors: Henk Corporaal Cedric Nugteren Tom Goossens Nick de Koning Embedded platform selection based on the roofline model Science Park Eindhoven, December 8, 2011 Course : Graduation project Master : Embedded Systems Department : Electrical Engineering - Eindhoven University of Technology Chair : Electronic Systems - Eindhoven University of Technology Committee : Eindhoven University of Technology: Henk Corporaal Pieter Jonker Cedric Nugteren Prodrive B.V.: Tom Goossens Authors : Rob van de Voort & Martien Spierings Science Park Eindhoven 5501 5692 EM SON Copyright 2011 Prodrive B.V. All rights are reserved. No part of this book may be reproduced, stored in a retrieval system, or trans- mitted , in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission from the copyright owner. Prodrive B.V. Eindhoven University of Technology P.O. Box 28030 P.O. Box 513 Science Park 5501 Den Dolech 2 5692 EM Son 5612 AZ Eindhoven The Netherlands The Netherlands T +31 (0)40 2676 200 T +31 (0)40 2479 111 Ò +31 (0)40 2676 201 Ò +31 (0)40 2441 692 Í http://www.prodrive.nl Í http://www.tue.nl B [email protected] B [email protected] Abstract An Embedded platform is an information processing system, designed to perform a dedicated func- tion. These systems are embedded into many electronic products, e.g., watches, cars, ovens, and mobile phones. Embedded systems have become part of our daily life and their number, functionality and complexity increased over the years. For instance, in the past a phone only offered functions like calling and text messaging. Nowadays, these basic functions are extended with video calling, multimedia messaging, internet browsing and gaming. These functionalities of an embedded system are often controlled by software that executes on one or more processing units inside the embedded system. Selecting hardware is an essential step in the development of an embedded system, especially selecting the appropriate processing units. It is a key step to the success of an embedded system that the processing units are able to meet the required performance. Performance estimations can be done using a cycle accurate simulator or by running benchmarks, however running a simulation is time consuming and running benchmarks may not be possible, this causes a major problem when a combination of processing units needs to be selected in the early phase of a design. Therefore, we present a method for the selection of processing units based on documentation and a high- level application description. This allows for a fast selection which decreases the time to market for embedded systems. The method focusses on throughput oriented systems and extends the Roofline model to heteroge- neous platforms, in order to give an upper bound for the performance of an application on a platform while providing an insightful visualization of the attainable performance. A platform is the compo- sition of multiple processing units. This method enables the modelling of embedded applications on different processing units without requiring detailed knowledge of the application. It consist of decomposing applications into Generic Computational Blocks (GCBs) from which the number of op- erations and data transfers are extracted. This decomposition is then used to match the GCBs to the available hardware model and to select a set of processing units to form a computational platform for the application. The method is validated by using micro-benchmarks and an real-life application from the video content analysis domain. The theoretical estimates constructed with this method match the real-life performance of an processing unit for 95%. However the memory estimate is only accurate for 40% up to 80% due to the complex implementation of the memory controllers involved. i The method uses an extended version of the Roofline model. These extensions make the method suited for fast selection of processing units for an application. The guideline constructed from this method is usable by a designer to see different design trade-offs. ii Contents List of Figures iv List of Tables vi 1 Introduction 1 1.1 Related work . .2 1.2 Motivation . .6 1.3 Thesis outline . .7 2 Method for selecting an embedded platform 9 2.1 The Roofline model and its shortcomings . 10 2.2 Contributions to the Roofline Model . 14 2.3 Roofline estimation errors . 17 2.4 Determining a Roofline model . 19 2.5 Application utilisation Roofline . 23 2.6 Verifying a Roofline Model . 26 3 CPU Roofline model 29 3.1 Background . 29 3.2 Determining the theoretical Roofline . 33 3.3 Determining the practical Roofline . 36 3.4 Intel Atom E6xx . 39 3.5 Intel Xeon E5540 . 43 iii CONTENTS 3.6 Texas Instruments OMAP4430 . 46 3.7 Performance restrictions . 49 3.8 Conclusions . 51 4 DSP Roofline model 53 4.1 Background . 53 4.2 Determining the Roofline for the Omap 137 . 55 4.3 Additional properties . 57 4.4 Conclusion . 58 5 GPU Roofline model 59 5.1 Background . 59 5.2 Determining the theoretical Roofline . 63 5.3 Determining the practical Roofline . 63 5.4 Performance restrictions . 67 5.5 Conclusion . 70 6 FPGA Roofline model 71 6.1 FPGA basics . 71 6.2 A Roofline for the FPGA . 72 6.3 The FPGA computational Roofline . 76 6.4 Determining the practical Roofline . 80 6.5 Performance restrictions . 81 6.6 Conclusions . 81 7 The decomposition of VCA applications 83 7.1 VCA applications . 84 7.2 Describing Generic Computational Blocks . 84 7.3 Operations per byte of a GCB . 87 7.4 VCA use case . 89 iv 7.5 Multiple GCBs on one processing unit . 98 7.6 Pipelining GCBs on one processing unit . 100 7.7 Platform . 102 8 From application to platform, a guideline 107 8.1 Guideline prerequisites . 107 8.2 Processing unit selection guideline . 108 8.3 Example usage of the guideline . 110 9 Conclusions 117 9.1 Future work . 119 v List of Figures 1.1 Design space exploration . .3 1.2 Processor design space, image from [7] . .6 2.1 Research Approach . .9 2.2 Model assumed by the Roofline model . 11 2.3 Original Roofline model . 12 2.4 Differences between the original and the extended Roofline model . 13 2.5 Different types of operations result in points in the Roofline model . 15 2.6 Extra data sources in the Roofline model . 17 2.7 Extensions to the original Roofline . 18 2.8 Simplified processing unit model . 20 2.9 Utilisation Roofline for an application and processing unit . 26 2.10 Micro-benchmarking concept . 27 3.1 Basic components of a stored instruction digital computer . 30 3.2 Basic CPU concept withArithmetic Logic Unit (ALU), memory and CU) . 31 3.3 Modern CPUs contain multiple ALUs to process operands in parallel . 31 3.4 Intel Atom tunnel-creek micro-architecture instruction ports . 40 3.5 Intel E630 benchmark results for integer, single precision float and SIMD . 42 3.6 Intel E630 benchmark results for intrinsics, assembly and cache . 43 3.7 Intel Nehalem micro-architecture instruction ports . 44 vii LIST OF FIGURES 3.8 Intel Xeon 5540 benchmark results . 45 3.9 Intel Xeon 5540 benchmark results for intrinsics, assembly and cache measurements 46 3.10 Cortex-A9 instruction ports . 47 3.11 Texas Instruments OMAP4430 micro-benchmark results . 48 3.12 X86 ISA registers . 50 3.13 Register of the ARM Instruction Set Architecture (ISA) and extensions to it such as NEON and VFPv3 . 51 4.1 Early DSP architectures . 54 4.2 TI DSP C674x architecture . 56 4.3 Measured roof line of DSP . 57 4.4 Software pipelining . 58 5.1 NVIDIA Architectures . 60 5.2 AMD VLIW5 architecture . 61 5.3 Cuda and OpenCL programming model . 62 5.4 NVIDIA FX1700 ..