A Methodology for Efficient Multiprocessor System-On
Total Page:16
File Type:pdf, Size:1020Kb
A Methodology for Efficient Multiprocessor System-on-Chip Software Development Von der Fakult¨at f¨ur Elektrotechnik und Informationstechnik der Rheinisch–Westf¨alischen Technischen Hochschule Aachen zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften genehmigte Dissertation vorgelegt von M.Sc. Jianjiang Ceng aus Zhejiang/China Berichter: Univ.-Prof. Dr. rer. nat. Rainer Leupers Univ.-Prof. Dr.-Ing. Gerd Ascheid Tag der m¨undlichen Pr¨ufung: 28.04.2011 Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verf¨ugbar. Contents 1 Introduction 1 1.1 Background ...................................... 1 1.2 MPSoCSoftwareDevelopmentChallenge . ........ 3 1.2.1 ProgrammingModel .............................. 4 1.2.2 ApplicationParallelization . ...... 6 1.2.3 MappingandScheduling . .. 6 1.2.4 CodeGeneration ................................ 7 1.2.5 ArchitectureModelingandSimulation . ...... 7 1.2.6 Summary .................................... 8 1.3 ThesisOrganization. .. ... .. .. .. .. .. ... .. .. .. .. .. ..... 8 2 Related Work 11 2.1 GeneralPurposeMulti-Processor . ........ 11 2.1.1 Cell ....................................... 12 2.2 GeneralPurposeGraphicsProcessingUnit . ......... 13 2.3 Embedded Multi-Processor System-on-Chip . .......... 14 2.3.1 MPCore ..................................... 14 2.3.2 IXP ....................................... 14 2.3.3 OMAP...................................... 15 2.3.4 CoMPSoC.................................... 17 2.3.5 SHAPES..................................... 19 2.3.6 Platform2012.................................. 22 2.3.7 HOPES ..................................... 23 2.3.8 Daedalus..................................... 26 2.3.9 MPA....................................... 29 2.3.10 TCT....................................... 32 2.4 Summary ........................................ 34 3 Methodology Overview 35 3.1 MPSoC Application Programming Studio (MAPS) . ...... 35 3.1.1 ApplicationModeling. .. 36 3.1.2 ArchitectureModeling . .. 37 3.1.3 Profiling..................................... 38 3.1.4 Control/DataFlowAnalysis . ... 38 3.1.5 Partitioning .................................. 39 i ii Contents 3.1.6 Scheduling&Mapping . 40 3.1.7 CodeGeneration ................................ 40 3.1.8 High-LevelSimulation . .. 41 3.1.9 TargetMPSoCPlatform . 42 3.1.10 Integrated Development Environment . ........ 42 3.2 ContributionofThisThesis . ...... 42 4 Architecture Model 45 4.1 Performance Estimation and Architecture Model . ........... 45 4.2 RelatedWork ..................................... 46 4.3 ProcessingElementModel . ..... 47 4.3.1 CCompilationProcess . .. 47 4.3.2 OperationCostTable. .. 49 4.3.3 ComputationalCostEstimation . ..... 50 4.4 ArchitectureDescriptionFile. ........ 51 4.4.1 ProcessingElementDescription . ...... 52 4.4.2 ArchitecturalDescription. ..... 52 4.5 Summary ........................................ 54 5 Profiling 55 5.1 RelatedWork ..................................... 56 5.2 ProfilingProcessOverview. ...... 57 5.3 TraceGeneration ................................. ... 58 5.3.1 LANCEFrontend................................ 58 5.3.2 Instrumentation............................... .. 59 5.3.3 ProfilingRuntimeLibrary . ... 60 5.3.4 TraceFile .................................... 61 5.4 Post-Processing ................................. .... 62 5.4.1 IRProfileGeneration. .. 63 5.4.2 SourceProfileGeneration . ... 66 5.4.3 CallGraphGeneration . .. 68 5.5 Summary ........................................ 69 6 Control/Data Flow Analysis 71 6.1 ControlFlowGraph ................................ .. 72 6.2 StatementControlFlowGraph . ..... 73 6.3 StatementControl/DataFlowGraph . ....... 74 6.3.1 DataDependenceAnalysis. .. 75 6.4 Weighted Statement Control/Data Flow Graph . ......... 76 6.4.1 NodeWeightAnnotation. 77 6.4.2 ControlEdgeAnnotation. .. 77 6.4.3 DataEdgeAnnotation ............................. 78 6.5 Summary ........................................ 78 7 Partitioning 79 7.1 RelatedWork ..................................... 79 7.2 TaskGranularity ................................. ... 81 7.3 CoupledBlock.................................... .. 83 7.3.1 CBDefinition.................................. 83 ii Contents iii 7.3.2 SchedulabilityConstraint. ...... 84 7.3.3 DataLocalityConstraint. .... 84 7.4 CBGenerationandWSCDFGPartitioning . ....... 86 7.4.1 WSCDFGPartition............................... 86 7.4.2 PartitionOptimality . ... 87 7.4.3 CAHCAlgorithm................................ 89 7.5 GlobalAnalysis ................................... .. 95 7.6 UserRefinement .................................... 96 7.7 Summary ........................................ 97 8 High-Level Simulation 99 8.1 RelatedWork ..................................... 99 8.2 MVPOverview ..................................... 102 8.3 MVPSimulator..................................... 103 8.3.1 VirtualProcessingElement . ... 103 8.3.2 GenericSharedMemory . 106 8.3.3 UserInterfaceandVirtualPeripheral . ...... 107 8.4 SoftwareToolChain ............................... ... 108 8.4.1 ExecutionEnvironment . 108 8.4.2 CodeGeneration ................................ 109 8.5 DebugMethod ..................................... 113 8.6 Summary ........................................ 114 9 Case Study 115 9.1 JPEGEncoder ..................................... 115 9.1.1 TCTTools ................................... 115 9.1.2 Profiling..................................... 116 9.1.3 Partitioning .................................. 116 9.1.4 Simulation Result and Further Refinement . ....... 118 9.2 H.264BaselineDecoder. .... 119 9.2.1 TargetPlatform................................ 119 9.2.2 Profiling..................................... 121 9.2.3 Partitioning .................................. 122 9.2.4 TargetCodeGeneration . 122 9.2.5 SimulationResults ............................. 124 9.3 Summary ........................................ 126 10 Summary and Outlook 127 10.1Summary ........................................ 127 10.2Outlook ........................................ 128 AMAPSArchitectureDescriptionXMLSchema 131 A.1 ComponentLibrary................................. 131 A.2 ArchitectureModel ................................. 134 B Pseudo Code for Profiling 137 B.1 CodeInstrumentation . ... .. .. .. .. .. ... .. .. .. .. .. .. .... 137 B.2 FunctionCallCostCalculation . ....... 138 B.3 IRStatementExecutionCostCalculation . ......... 138 iii iv Contents B.4 CallGraphGeneration . .... 139 C MAPS Application Profile XML Schema 141 DPseudoCodeforControlDataFlowAnalysis 143 D.1 SCFGConstruction................................ ... 143 D.2 SCDFGConstruction............................... ... 144 D.3 DataDependenceAnalysis . .... 144 D.3.1 StaticDataAnalysis ............................. 144 D.3.2 DynamicAnalysis................................ 145 D.4 WSCDFGConstruction .............................. 146 D.4.1 NodeWeightAnnotation. 147 D.4.2 ControlEdgeWeightAnnotation . ... 147 D.4.3 DataEdgeWeightAnnotation. 148 Bibliography 151 iv Chapter 1 Introduction 1.1 Background Our lifestyles have been greatly changed by the various new electronic devices, which continuously appear in today’s market. For example, Personal Navigation Assistants (PNAs) help people to find driving direction without the need to read a map; Personal Digital Assistants (PDAs) are used by people to carry a vast amount of information with them just inside a pocket; Portable Multimedia Players (PMPs) allow people to enjoy music and video contents while they travel; and, last but not least, mobile/smart phones have fundamentally changed the way in which people are connected with each other. Behind these different devices, it is the advancement of technologies, especially the rapidly progressing semiconductor technology, that makes the development and production of such devices possible. Figure 1.1 shows a statistic of the transistor count of some Intel processors introduced in the past forty years. !! ! " # ; < = < = @A < C; C E < GC HI : ?? > B > > >F D DJ ¡ ¡ ¨ £ ¢ ¤ ¦ ¥ ¥ ¨ £ ¢ ¤ ¦ ¥ ¥ !! !+ " # I CAC = @ < C < < < I C : ? ? ? ¡ F B >F K> B > F > F L LM N O D O ¡ ¤ ¥ ¨ £ ¢ ¤ ¦ ¥ ¡ ¥ ¨ £ ¢ ¤ ¦ ¥ ¡ ¥ !! !* " # ¡ ¡ ¢ £ ¤ ¦ § ¥ ¢ £ ¤ ¦ ¨ ¨ ¥ 6 3 ¡ !! ! " ) # 9 ¢ £ ¤ ¦ ¨ ¨ ¨ ¥ 7 ¡ 8 ¢ £ ¤ ¦ ¥ 1 § © 7 !! !( " # 6 4 5 4 © 3 !! ! " ' # 2 © 1 0 © © !! !& " # © © § § © © !! !% " # Z [ [ Z P Y `S Q TU ^^QS X\ U ]S Q TU ^^QS _R !! !$ " # , - ./ +( + ! + +*! +* ++! ++ $!!! $!! $! ! $! ' ) )' ' ' ' ' Z P Y Q S TU V WX U R Figure 1.1: Moore’s Law & Transistor Count of Some Intel Processors 1 2 CHAPTER 1. INTRODUCTION It can be seen that the semiconductor industry has been advancing in a spectacular speed as predicted by Moore’s law [1], i.e., the number of transistors that can be fabricated on a single silicon chip doubles every two years. In order to utilize the massive amount of transistors, several years ago designers started integrating multiple processor cores into one chip. One of the reasons that caused this change is power dissipation. Since a processor’s power dissipation increases dramatically when its clock frequency [2] is raised, it would soon be necessary to use water cooling to cool a system down, if cranking up clock frequency is used as the only way to improve performance. By employing multiple processors, the overall computational power is improved without increasing the clock speed; hence, a balance between performance and energy consumption can be achieved. Today, multiprocessing