A Methodology for Efficient Multiprocessor System-On

A Methodology for Efficient Multiprocessor System-on-Chip Software Development Von der Fakultät für Elektrotechnik und Informationstechnik der Rheinisch–Westfälischen Technischen Hochschule Aachen zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften genehmigte Dissertation vorgelegt von M.Sc. Jianjiang Ceng aus Zhejiang/China Berichter: Univ.-Prof. Dr. rer. nat. Rainer Leupers Univ.-Prof. Dr.-Ing. Gerd Ascheid Tag der mündlichen Prüfung: 28.04.2011 Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfügbar. Contents 1 Introduction 1 1.1 Background ...................................... 1 1.2 MPSoCSoftwareDevelopmentChallenge . ........ 3 1.2.1 ProgrammingModel .............................. 4 1.2.2 ApplicationParallelization . ...... 6 1.2.3 MappingandScheduling . .. 6 1.2.4 CodeGeneration ................................ 7 1.2.5 ArchitectureModelingandSimulation . ...... 7 1.2.6 Summary .................................... 8 1.3 ThesisOrganization. .. ... .. .. .. .. .. ... .. .. .. .. .. ..... 8 2 Related Work 11 2.1 GeneralPurposeMulti-Processor . ........ 11 2.1.1 Cell ....................................... 12 2.2 GeneralPurposeGraphicsProcessingUnit . ......... 13 2.3 Embedded Multi-Processor System-on-Chip . .......... 14 2.3.1 MPCore ..................................... 14 2.3.2 IXP ....................................... 14 2.3.3 OMAP...................................... 15 2.3.4 CoMPSoC.................................... 17 2.3.5 SHAPES..................................... 19 2.3.6 Platform2012.................................. 22 2.3.7 HOPES ..................................... 23 2.3.8 Daedalus..................................... 26 2.3.9 MPA....................................... 29 2.3.10 TCT....................................... 32 2.4 Summary ........................................ 34 3 Methodology Overview 35 3.1 MPSoC Application Programming Studio (MAPS) . ...... 35 3.1.1 ApplicationModeling. .. 36 3.1.2 ArchitectureModeling . .. 37 3.1.3 Profiling..................................... 38 3.1.4 Control/DataFlowAnalysis . ... 38 3.1.5 Partitioning .................................. 39 i ii Contents 3.1.6 Scheduling&Mapping . 40 3.1.7 CodeGeneration ................................ 40 3.1.8 High-LevelSimulation . .. 41 3.1.9 TargetMPSoCPlatform . 42 3.1.10 Integrated Development Environment . ........ 42 3.2 ContributionofThisThesis . ...... 42 4 Architecture Model 45 4.1 Performance Estimation and Architecture Model . ........... 45 4.2 RelatedWork ..................................... 46 4.3 ProcessingElementModel . ..... 47 4.3.1 CCompilationProcess . .. 47 4.3.2 OperationCostTable. .. 49 4.3.3 ComputationalCostEstimation . ..... 50 4.4 ArchitectureDescriptionFile. ........ 51 4.4.1 ProcessingElementDescription . ...... 52 4.4.2 ArchitecturalDescription. ..... 52 4.5 Summary ........................................ 54 5 Profiling 55 5.1 RelatedWork ..................................... 56 5.2 ProfilingProcessOverview. ...... 57 5.3 TraceGeneration ................................. ... 58 5.3.1 LANCEFrontend................................ 58 5.3.2 Instrumentation............................... .. 59 5.3.3 ProfilingRuntimeLibrary . ... 60 5.3.4 TraceFile .................................... 61 5.4 Post-Processing ................................. .... 62 5.4.1 IRProfileGeneration. .. 63 5.4.2 SourceProfileGeneration . ... 66 5.4.3 CallGraphGeneration . .. 68 5.5 Summary ........................................ 69 6 Control/Data Flow Analysis 71 6.1 ControlFlowGraph ................................ .. 72 6.2 StatementControlFlowGraph . ..... 73 6.3 StatementControl/DataFlowGraph . ....... 74 6.3.1 DataDependenceAnalysis. .. 75 6.4 Weighted Statement Control/Data Flow Graph . ......... 76 6.4.1 NodeWeightAnnotation. 77 6.4.2 ControlEdgeAnnotation. .. 77 6.4.3 DataEdgeAnnotation ............................. 78 6.5 Summary ........................................ 78 7 Partitioning 79 7.1 RelatedWork ..................................... 79 7.2 TaskGranularity ................................. ... 81 7.3 CoupledBlock.................................... .. 83 7.3.1 CBDefinition.................................. 83 ii Contents iii 7.3.2 SchedulabilityConstraint. ...... 84 7.3.3 DataLocalityConstraint. .... 84 7.4 CBGenerationandWSCDFGPartitioning . ....... 86 7.4.1 WSCDFGPartition............................... 86 7.4.2 PartitionOptimality . ... 87 7.4.3 CAHCAlgorithm................................ 89 7.5 GlobalAnalysis ................................... .. 95 7.6 UserRefinement .................................... 96 7.7 Summary ........................................ 97 8 High-Level Simulation 99 8.1 RelatedWork ..................................... 99 8.2 MVPOverview ..................................... 102 8.3 MVPSimulator..................................... 103 8.3.1 VirtualProcessingElement . ... 103 8.3.2 GenericSharedMemory . 106 8.3.3 UserInterfaceandVirtualPeripheral . ...... 107 8.4 SoftwareToolChain ............................... ... 108 8.4.1 ExecutionEnvironment . 108 8.4.2 CodeGeneration ................................ 109 8.5 DebugMethod ..................................... 113 8.6 Summary ........................................ 114 9 Case Study 115 9.1 JPEGEncoder ..................................... 115 9.1.1 TCTTools ................................... 115 9.1.2 Profiling..................................... 116 9.1.3 Partitioning .................................. 116 9.1.4 Simulation Result and Further Refinement . ....... 118 9.2 H.264BaselineDecoder. .... 119 9.2.1 TargetPlatform................................ 119 9.2.2 Profiling..................................... 121 9.2.3 Partitioning .................................. 122 9.2.4 TargetCodeGeneration . 122 9.2.5 SimulationResults ............................. 124 9.3 Summary ........................................ 126 10 Summary and Outlook 127 10.1Summary ........................................ 127 10.2Outlook ........................................ 128 AMAPSArchitectureDescriptionXMLSchema 131 A.1 ComponentLibrary................................. 131 A.2 ArchitectureModel ................................. 134 B Pseudo Code for Profiling 137 B.1 CodeInstrumentation . ... .. .. .. .. .. ... .. .. .. .. .. .. .... 137 B.2 FunctionCallCostCalculation . ....... 138 B.3 IRStatementExecutionCostCalculation . ......... 138 iii iv Contents B.4 CallGraphGeneration . .... 139 C MAPS Application Profile XML Schema 141 DPseudoCodeforControlDataFlowAnalysis 143 D.1 SCFGConstruction................................ ... 143 D.2 SCDFGConstruction............................... ... 144 D.3 DataDependenceAnalysis . .... 144 D.3.1 StaticDataAnalysis ............................. 144 D.3.2 DynamicAnalysis................................ 145 D.4 WSCDFGConstruction .............................. 146 D.4.1 NodeWeightAnnotation. 147 D.4.2 ControlEdgeWeightAnnotation . ... 147 D.4.3 DataEdgeWeightAnnotation. 148 Bibliography 151 iv Chapter 1 Introduction 1.1 Background Our lifestyles have been greatly changed by the various new electronic devices, which continuously appear in today’s market. For example, Personal Navigation Assistants (PNAs) help people to find driving direction without the need to read a map; Personal Digital Assistants (PDAs) are used by people to carry a vast amount of information with them just inside a pocket; Portable Multimedia Players (PMPs) allow people to enjoy music and video contents while they travel; and, last but not least, mobile/smart phones have fundamentally changed the way in which people are connected with each other. Behind these different devices, it is the advancement of technologies, especially the rapidly progressing semiconductor technology, that makes the development and production of such devices possible. Figure 1.1 shows a statistic of the transistor count of some Intel processors introduced in the past forty years. !! ! " # ; < = < = @A < C; C E < GC HI : ?? > B > > >F D DJ ¡ ¡ ¨ £ ¢ ¤ ¦ ¥ ¥ ¨ £ ¢ ¤ ¦ ¥ ¥ !! !+ " # I CAC = @ < C < < < I C : ? ? ? ¡ F B >F K> B > F > F L LM N O D O ¡ ¤ ¥ ¨ £ ¢ ¤ ¦ ¥ ¡ ¥ ¨ £ ¢ ¤ ¦ ¥ ¡ ¥ !! !* " # ¡ ¡ ¢ £ ¤ ¦ § ¥ ¢ £ ¤ ¦ ¨ ¨ ¥ 6 3 ¡ !! ! " ) # 9 ¢ £ ¤ ¦ ¨ ¨ ¨ ¥ 7 ¡ 8 ¢ £ ¤ ¦ ¥ 1 § © 7 !! !( " # 6 4 5 4 © 3 !! ! " ' # 2 © 1 0 © © !! !& " # © © § § © © !! !% " # Z [ [ Z P Y `S Q TU ^^QS X\ U ]S Q TU ^^QS _R !! !$ " # , - ./ +( + ! + +*! +* ++! ++ $!!! $!! $! ! $! ' ) )' ' ' ' ' Z P Y Q S TU V WX U R Figure 1.1: Moore’s Law & Transistor Count of Some Intel Processors 1 2 CHAPTER 1. INTRODUCTION It can be seen that the semiconductor industry has been advancing in a spectacular speed as predicted by Moore’s law [1], i.e., the number of transistors that can be fabricated on a single silicon chip doubles every two years. In order to utilize the massive amount of transistors, several years ago designers started integrating multiple processor cores into one chip. One of the reasons that caused this change is power dissipation. Since a processor’s power dissipation increases dramatically when its clock frequency [2] is raised, it would soon be necessary to use water cooling to cool a system down, if cranking up clock frequency is used as the only way to improve performance. By employing multiple processors, the overall computational power is improved without increasing the clock speed; hence, a balance between performance and energy consumption can be achieved. Today, multiprocessing

A Methodology for Efficient Multiprocessor System-On

An Introduction to Analysis and Optimization with AMD Codeanalyst™ Performance Analyzer

Software Optimization Guide for Amd Family 15H Processors (.Pdf)

AMD Codexl 1.7 GA Release Notes

AMD Codexl 1.8 GA Release Notes

Instruction-Based Sampling: a New Performance Analysis Technique for AMD Family 10H Processors

AMD Codexl 1.1 GA Release Notes

AMD Codexl 1.9 GA Release Notes

AMD Technology & Software

Improving Program Performance with AMD Codeanalyst for Linux® Paul J. Drongowski AMD Codeanalyst Team May 9, 2007

Efficient Implementation of Finite Element Operators on Gpus

Innovatives Supercomputing in Deutschland

AMD Codexl 1.3 GA Release Notes