Multep: a Multithreaded Embedded Processor

UCAM-CL-TR-588 Technical Report ISSN 1476-2986 Number 588 Computer Laboratory MulTEP: A MultiThreaded Embedded Processor Panit Watcharawitch May 2004 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2004 Panit Watcharawitch This technical report is based on a dissertation submitted November 2003 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Newnham College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/TechReports/ ISSN 1476-2986 Summary Conventional embedded microprocessors have traditionally followed the footsteps of high- end processor design to achieve high performance. Their underlying architectures priori- tise tasks by time-critical interrupts and rely on software to perform scheduling tasks. Single threaded execution relies on instruction-based probabilistic techniques, such as speculative execution and branch prediction. These techniques might be unsuitable for embedded systems where real-time performance guarantees need to be met [1, 2]. Multithreading appears to be a feasible solution with potential for embedded processors [3]. The multithreaded model benefits from sequential characteristic of control-flow and concurrency characteristic of data-flow [4]. Thread-level parallelism has a potential to overcome the limitations of insufficient instruction-level parallelism to hide the increasing memory latencies. Earlier empirical studies on multithreaded processors [5, 4, 6, 7, 3] demonstrated that exploiting thread-level concurrency not only offers latency tolerance, but also provides predictable performance gain. A MulTithreaded Embedded Processor (MulTEP) is designed to provide high performance thread-level parallelism, real-time characteristics, a flexible number of threads and low incremental cost per thread for the embedded system. In its architecture, a matching-store synchronisation mechanism [4] allows a thread to wait for multiple data items. A tagged up/down dynamic-priority hardware scheduler [8] is provided for real- time scheduling. Pre-loading, pre-fetching and colour-tagging techniques are implemented to allow context switches without any overhead. The architecture supports four additional multithreading instructions (i.e. spawn, wait, switch, stop) for programmers and ad- vance compilers to create programs with low-overhead multithreaded operations. Experimental results demonstrate that multithreading can be effectively used to improve performance and system utilisation. Latency operations that would otherwise stall the pipeline are hidden by the execution of the other threads. The hardware scheduler provides priority scheduling, which is suitable for real-time embedded applications. 3 Contents Contents 4 List of Figures 10 List of Tables 14 I Background 15 1 Introduction 16 1.1 Prelude ..................................... 16 1.2 ResearchMotivation .............................. 17 1.2.1 Inspiration for Targeting the Embedded Processor Market ..... 17 1.2.2 Inspiration for Multithreaded Execution . 18 1.3 ResearchAims ................................. 20 1.3.1 Challenges................................ 20 1.3.2 Contributions .............................. 20 1.4 DissertationOutline .............................. 21 2 Architectural Level Parallelism 23 2.1 Introduction................................... 23 2.2 Control-flowArchitectures . 24 2.2.1 InstructionEncoding . 24 2.2.2 Instruction Level Parallelism . 26 2.2.3 Thread Level Parallelism . 29 4 2.3 Data-flowArchitectures . 30 2.3.1 StaticDataFlow ............................ 30 2.3.2 DynamicDataFlow .......................... 30 2.3.3 Explicit Token-store Data Flow . 31 2.4 MemorySystem................................. 31 2.4.1 MemoryHierarchy ........................... 31 2.4.2 VirtualMemory............................. 34 2.4.3 ProtectionMechanism . 34 2.5 Summary .................................... 34 3 Multithreaded Processors 36 3.1 Introduction................................... 36 3.2 MultithreadedArchitectures . 37 3.2.1 Context-switching Mechanisms . 38 3.2.2 Multiple Execution Contexts . 40 3.2.3 ConcurrencyControl . 40 3.2.4 Multiple-issueModels. 42 3.3 SupportEnvironment.............................. 44 3.3.1 MemoryManagement. 44 3.3.2 SoftwareSupport ............................ 44 3.4 EfficiencyFactors................................ 45 3.4.1 DesignTrade-offs ............................ 45 3.4.2 ProcessorEfficiency. 46 3.4.3 EfficiencyDistribution . 47 3.4.4 Cost-effectivenessModel . 47 3.5 Summary .................................... 48 4 Embedded Systems Design 50 4.1 Introduction................................... 50 4.2 DesignConstraintsandSolutions . 50 4.2.1 Real-timeResponse........................... 51 4.2.2 LowPowerConsumption. 52 5 4.2.3 LowCostandCompactSize . 54 4.2.4 Application Specifications . 55 4.3 Techniques in Embedded Architectures . 56 4.3.1 Specially Compact Instruction Encoding . 56 4.3.2 PredicatedInstructions. 56 4.3.3 Subword-level Manipulation . 57 4.3.4 Thread Level Parallelism Support . 57 4.4 Summary .................................... 58 II Architectural Design 59 5 System Overview 60 5.1 Introduction................................... 60 5.2 BackgroundtotheMulTEPArchitecture . 60 5.2.1 Objectives................................ 61 5.2.2 ProjectChallenges ........................... 61 5.2.3 DesignChoices ............................. 62 5.3 MulTEPSystem ................................ 63 5.3.1 OverallPicture ............................. 64 5.3.2 ExecutionContext ........................... 65 5.3.3 InstructionSetArchitecture . 66 5.3.4 ThreadLifeCycle............................ 68 5.3.5 SynchronisationTechnique . 69 5.3.6 SchedulingTechnique. 69 5.3.7 MemoryProtection . 70 5.3.8 PowerOperatingModes . 70 5.4 Architectural Theoretical Investigation . ....... 71 5.4.1 The Utilisation of the Processing Elements . 71 5.4.2 Incremental Cost per Thread . 75 5.5 Summary .................................... 75 6 6 Hardware Architecture 77 6.1 Introduction................................... 77 6.2 ProcessingUnit................................. 77 6.2.1 Level-0InstructionCache . 80 6.2.2 FetchingMechanism . 82 6.2.3 Context-switch Decoder . 84 6.2.4 ExecutionUnits............................. 86 6.2.5 Colour-tagged Write Back . 87 6.3 MultithreadingServiceUnit . 89 6.3.1 RequestArbiter............................. 91 6.3.2 SpawningMechanism. 91 6.3.3 SynchronisingMechanism . 92 6.3.4 SwitchingMechanism. 93 6.3.5 StoppingMechanism . 96 6.4 Load-StoreUnit................................. 97 6.4.1 Load/StoreQueues . 98 6.4.2 LoadOperations ............................100 6.4.3 StoreOperations ............................101 6.4.4 MultithreadedOperations . 102 6.5 MemoryManagementUnit. .102 6.5.1 MemoryHierarchy . .103 6.5.2 MemoryAddressMapping . .104 6.5.3 AddressTranslation . .105 6.6 Summary ....................................108 7 Software Support 110 7.1 Introduction...................................110 7.2 Low-levelSoftwareSupport . 110 7.2.1 MulTEPAssembler...........................112 7.2.2 MulTEPAssemblyMacros. .113 7.3 SupportforSystemKernel. 117 7 7.3.1 SystemDaemon.............................117 7.3.2 Exception/Interrupt Daemons . 121 7.3.3 Non-runnableStates . .121 7.4 SupportforHigh-levelLanguages . 122 7.4.1 Real-time Embedded Languages . 122 7.4.2 Java-to-native Post-compiler . 122 7.4.3 Native Compiler Implementation . 123 7.5 Summary ....................................124 III Evaluation and Conclusions 125 8 Evaluation and Results 126 8.1 Introduction...................................126 8.2 SimulatingtheMulTEPArchitecture . 127 8.2.1 Pre-simulationStrategies. 127 8.2.2 TheMulTEPSimulator . .129 8.2.3 MulTEP Simulation Analysis . 132 8.3 MulTEPSimulationResults . 137 8.3.1 SelectedBenchmarks . .137 8.3.2 ProcessorPerformance . .139 8.3.3 Efficiency of Multithreading Mechanisms . 144 8.3.4 Real-TimeResponse . .145 8.3.5 Side-effectEvaluation. 146 8.4 Summary ....................................147 9 Conclusions and Future Directions 149 9.1 Introduction...................................149 9.2 OverviewofMulTEP..............................149 9.3 KeyContributions ...............................151 9.4 Relatedwork ..................................152 9.5 FutureDirections ................................ 154 9.5.1 Multi-portedPriorityQueue . 154 8 9.5.2 FurtherEnhancementstotheL0-cache . 155 9.5.3 Thread-0 Daemon Performance Improvements . 155 9.5.4 ProcessorScalability . .155 9.5.5 Cache Conflict Resolution . 156 9.5.6 Application Specific Processor Refinement . 156 9.5.7 SoftwareSupport . .156 A Tagged Up/Down Sorter 158 A.1 TheverilogHDLsourcecode . .158 A.2 Thesimulationsourcecode. .160 A.2.1 tgsort.hh.................................160 A.2.2 tgsort.cc.................................162 B Livermore Loop 7 164 B.1 Thesingle-threadjavacode . 164 B.2 Themultithreadedjavacode. 164 B.3 The14-threadassemblycode . .165 C The Simulator 172 C.1 TheMulTEPconfiguration. 172 C.2 TheSimpleScalarconfiguration . 175 C.3 TheThread-0SystemDaemon. .175 D Additional BNF for MulTEP assembler 177 Bibliography 179 Index 189 9 List of Figures Chapter 1 1.1 Shipment trends of general and embedded processors. ...... 17 1.2 The increasing trend of processor-memory performance gap. ........ 19 Chapter 2 2.1 A control-flow architecture and a data-flow architecture. .......... 23 2.2 Instruction encoding of representative CISC and RISC. .......

Load more