Multep: a Multithreaded Embedded Processor

Total Page:16

File Type:pdf, Size:1020Kb

Multep: a Multithreaded Embedded Processor UCAM-CL-TR-588 Technical Report ISSN 1476-2986 Number 588 Computer Laboratory MulTEP: A MultiThreaded Embedded Processor Panit Watcharawitch May 2004 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2004 Panit Watcharawitch This technical report is based on a dissertation submitted November 2003 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Newnham College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/TechReports/ ISSN 1476-2986 Summary Conventional embedded microprocessors have traditionally followed the footsteps of high- end processor design to achieve high performance. Their underlying architectures priori- tise tasks by time-critical interrupts and rely on software to perform scheduling tasks. Single threaded execution relies on instruction-based probabilistic techniques, such as speculative execution and branch prediction. These techniques might be unsuitable for embedded systems where real-time performance guarantees need to be met [1, 2]. Multithreading appears to be a feasible solution with potential for embedded proces- sors [3]. The multithreaded model benefits from sequential characteristic of control-flow and concurrency characteristic of data-flow [4]. Thread-level parallelism has a potential to overcome the limitations of insufficient instruction-level parallelism to hide the increasing memory latencies. Earlier empirical studies on multithreaded processors [5, 4, 6, 7, 3] demonstrated that exploiting thread-level concurrency not only offers latency tolerance, but also provides predictable performance gain. A MulTithreaded Embedded Processor (MulTEP) is designed to provide high per- formance thread-level parallelism, real-time characteristics, a flexible number of threads and low incremental cost per thread for the embedded system. In its architecture, a matching-store synchronisation mechanism [4] allows a thread to wait for multiple data items. A tagged up/down dynamic-priority hardware scheduler [8] is provided for real- time scheduling. Pre-loading, pre-fetching and colour-tagging techniques are implemented to allow context switches without any overhead. The architecture supports four additional multithreading instructions (i.e. spawn, wait, switch, stop) for programmers and ad- vance compilers to create programs with low-overhead multithreaded operations. Experimental results demonstrate that multithreading can be effectively used to improve performance and system utilisation. Latency operations that would otherwise stall the pipeline are hidden by the execution of the other threads. The hardware scheduler provides priority scheduling, which is suitable for real-time embedded applications. 3 Contents Contents 4 List of Figures 10 List of Tables 14 I Background 15 1 Introduction 16 1.1 Prelude ..................................... 16 1.2 ResearchMotivation .............................. 17 1.2.1 Inspiration for Targeting the Embedded Processor Market ..... 17 1.2.2 Inspiration for Multithreaded Execution . 18 1.3 ResearchAims ................................. 20 1.3.1 Challenges................................ 20 1.3.2 Contributions .............................. 20 1.4 DissertationOutline .............................. 21 2 Architectural Level Parallelism 23 2.1 Introduction................................... 23 2.2 Control-flowArchitectures . 24 2.2.1 InstructionEncoding . 24 2.2.2 Instruction Level Parallelism . 26 2.2.3 Thread Level Parallelism . 29 4 2.3 Data-flowArchitectures . 30 2.3.1 StaticDataFlow ............................ 30 2.3.2 DynamicDataFlow .......................... 30 2.3.3 Explicit Token-store Data Flow . 31 2.4 MemorySystem................................. 31 2.4.1 MemoryHierarchy ........................... 31 2.4.2 VirtualMemory............................. 34 2.4.3 ProtectionMechanism . 34 2.5 Summary .................................... 34 3 Multithreaded Processors 36 3.1 Introduction................................... 36 3.2 MultithreadedArchitectures . 37 3.2.1 Context-switching Mechanisms . 38 3.2.2 Multiple Execution Contexts . 40 3.2.3 ConcurrencyControl . 40 3.2.4 Multiple-issueModels. 42 3.3 SupportEnvironment.............................. 44 3.3.1 MemoryManagement. 44 3.3.2 SoftwareSupport ............................ 44 3.4 EfficiencyFactors................................ 45 3.4.1 DesignTrade-offs ............................ 45 3.4.2 ProcessorEfficiency. 46 3.4.3 EfficiencyDistribution . 47 3.4.4 Cost-effectivenessModel . 47 3.5 Summary .................................... 48 4 Embedded Systems Design 50 4.1 Introduction................................... 50 4.2 DesignConstraintsandSolutions . 50 4.2.1 Real-timeResponse........................... 51 4.2.2 LowPowerConsumption. 52 5 4.2.3 LowCostandCompactSize . 54 4.2.4 Application Specifications . 55 4.3 Techniques in Embedded Architectures . 56 4.3.1 Specially Compact Instruction Encoding . 56 4.3.2 PredicatedInstructions. 56 4.3.3 Subword-level Manipulation . 57 4.3.4 Thread Level Parallelism Support . 57 4.4 Summary .................................... 58 II Architectural Design 59 5 System Overview 60 5.1 Introduction................................... 60 5.2 BackgroundtotheMulTEPArchitecture . 60 5.2.1 Objectives................................ 61 5.2.2 ProjectChallenges ........................... 61 5.2.3 DesignChoices ............................. 62 5.3 MulTEPSystem ................................ 63 5.3.1 OverallPicture ............................. 64 5.3.2 ExecutionContext ........................... 65 5.3.3 InstructionSetArchitecture . 66 5.3.4 ThreadLifeCycle............................ 68 5.3.5 SynchronisationTechnique . 69 5.3.6 SchedulingTechnique. 69 5.3.7 MemoryProtection . 70 5.3.8 PowerOperatingModes . 70 5.4 Architectural Theoretical Investigation . ....... 71 5.4.1 The Utilisation of the Processing Elements . 71 5.4.2 Incremental Cost per Thread . 75 5.5 Summary .................................... 75 6 6 Hardware Architecture 77 6.1 Introduction................................... 77 6.2 ProcessingUnit................................. 77 6.2.1 Level-0InstructionCache . 80 6.2.2 FetchingMechanism . 82 6.2.3 Context-switch Decoder . 84 6.2.4 ExecutionUnits............................. 86 6.2.5 Colour-tagged Write Back . 87 6.3 MultithreadingServiceUnit . 89 6.3.1 RequestArbiter............................. 91 6.3.2 SpawningMechanism. 91 6.3.3 SynchronisingMechanism . 92 6.3.4 SwitchingMechanism. 93 6.3.5 StoppingMechanism . 96 6.4 Load-StoreUnit................................. 97 6.4.1 Load/StoreQueues . 98 6.4.2 LoadOperations ............................100 6.4.3 StoreOperations ............................101 6.4.4 MultithreadedOperations . 102 6.5 MemoryManagementUnit. .102 6.5.1 MemoryHierarchy . .103 6.5.2 MemoryAddressMapping . .104 6.5.3 AddressTranslation . .105 6.6 Summary ....................................108 7 Software Support 110 7.1 Introduction...................................110 7.2 Low-levelSoftwareSupport . 110 7.2.1 MulTEPAssembler...........................112 7.2.2 MulTEPAssemblyMacros. .113 7.3 SupportforSystemKernel. 117 7 7.3.1 SystemDaemon.............................117 7.3.2 Exception/Interrupt Daemons . 121 7.3.3 Non-runnableStates . .121 7.4 SupportforHigh-levelLanguages . 122 7.4.1 Real-time Embedded Languages . 122 7.4.2 Java-to-native Post-compiler . 122 7.4.3 Native Compiler Implementation . 123 7.5 Summary ....................................124 III Evaluation and Conclusions 125 8 Evaluation and Results 126 8.1 Introduction...................................126 8.2 SimulatingtheMulTEPArchitecture . 127 8.2.1 Pre-simulationStrategies. 127 8.2.2 TheMulTEPSimulator . .129 8.2.3 MulTEP Simulation Analysis . 132 8.3 MulTEPSimulationResults . 137 8.3.1 SelectedBenchmarks . .137 8.3.2 ProcessorPerformance . .139 8.3.3 Efficiency of Multithreading Mechanisms . 144 8.3.4 Real-TimeResponse . .145 8.3.5 Side-effectEvaluation. 146 8.4 Summary ....................................147 9 Conclusions and Future Directions 149 9.1 Introduction...................................149 9.2 OverviewofMulTEP..............................149 9.3 KeyContributions ...............................151 9.4 Relatedwork ..................................152 9.5 FutureDirections ................................ 154 9.5.1 Multi-portedPriorityQueue . 154 8 9.5.2 FurtherEnhancementstotheL0-cache . 155 9.5.3 Thread-0 Daemon Performance Improvements . 155 9.5.4 ProcessorScalability . .155 9.5.5 Cache Conflict Resolution . 156 9.5.6 Application Specific Processor Refinement . 156 9.5.7 SoftwareSupport . .156 A Tagged Up/Down Sorter 158 A.1 TheverilogHDLsourcecode . .158 A.2 Thesimulationsourcecode. .160 A.2.1 tgsort.hh.................................160 A.2.2 tgsort.cc.................................162 B Livermore Loop 7 164 B.1 Thesingle-threadjavacode . 164 B.2 Themultithreadedjavacode. 164 B.3 The14-threadassemblycode . .165 C The Simulator 172 C.1 TheMulTEPconfiguration. 172 C.2 TheSimpleScalarconfiguration . 175 C.3 TheThread-0SystemDaemon. .175 D Additional BNF for MulTEP assembler 177 Bibliography 179 Index 189 9 List of Figures Chapter 1 1.1 Shipment trends of general and embedded processors. ...... 17 1.2 The increasing trend of processor-memory performance gap. ........ 19 Chapter 2 2.1 A control-flow architecture and a data-flow architecture. .......... 23 2.2 Instruction encoding of representative CISC and RISC. .......
Recommended publications
  • Pipelining 2
    Instruction-Level Parallelism Dynamic Scheduling CS448 1 Dynamic Scheduling • Static Scheduling – Compiler rearranging instructions to reduce stalls • Dynamic Scheduling – Hardware rearranges instruction stream to reduce stalls – Possible to account for dependencies that are only known at run time - e. g. memory reference – Simplifies the compiler – Code built for one pipeline in mind could also run efficiently on a different pipeline – minimizes the need for speculative slot filling - NP hard • Once a common tactic with supercomputers and mainframes but was too expensive for single-chip – Now fits on a desktop PC’s 2 1 Dynamic Scheduling • Dependencies that are close together stall the entire pipeline – DIVD F0, F2, F4 – ADDD F10, F0, F8 – SUBD F12, F8, F14 • The ADD needs the DIV to finish, so there is a stall… which also stalls the SUBD – Looong stall for DIV – But the SUBD is independent so there is no reason why we shouldn’t execute it – Or is there? • Precise Interrupts - Ignore for now • Compiler could rearrange instructions, but so could the hardware 3 Dynamic Scheduling • It would be desirable to decode instructions into the pipe in order but then let them stall individually while waiting for operands before issue to execution units. • Dynamic Scheduling – Out of Order Issue / Execution • Scoreboarding 4 2 Scoreboarding Split ID stage into two: Issue (Decode, check for Structural hazards) Read Operands (Wait until no data hazards, read operands) 5 Scoreboard Overview • Consider another example: – DIVD F0, F2, F4 – ADDD F10, F0,
    [Show full text]
  • EECS 570 Lecture 5 GPU Winter 2021 Prof
    EECS 570 Lecture 5 GPU Winter 2021 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Martin, Roth, Nowatzyk, and Wenisch of EPFL, CMU, UPenn, U-M, UIUC. 1 EECS 570 • Slides developed in part by Profs. Adve, Falsafi, Martin, Roth, Nowatzyk, and Wenisch of EPFL, CMU, UPenn, U-M, UIUC. Discussion this week • Term project • Form teams and start to work on identifying a problem you want to work on Readings For next Monday (Lecture 6): • Michael Scott. Shared-Memory Synchronization. Morgan & Claypool Synthesis Lectures on Computer Architecture (Ch. 1, 4.0-4.3.3, 5.0-5.2.5). • Alain Kagi, Doug Burger, and Jim Goodman. Efficient Synchronization: Let Them Eat QOLB, Proc. 24th International Symposium on Computer Architecture (ISCA 24), June, 1997. For next Wednesday (Lecture 7): • Michael Scott. Shared-Memory Synchronization. Morgan & Claypool Synthesis Lectures on Computer Architecture (Ch. 8-8.3). • M. Herlihy, Wait-Free Synchronization, ACM Trans. Program. Lang. Syst. 13(1): 124-149 (1991). Today’s GPU’s “SIMT” Model CIS 501 (Martin): Vectors 5 Graphics Processing Units (GPU) • Killer app for parallelism: graphics (3D games) Tesla S870 What is Behind such an Evolution? • The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) ❒ So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU Control ALU ALU CPU GPU Cache DRAM DRAM • The fast-growing video game industry
    [Show full text]
  • What We Know About Testing Embedded Software
    What we know about testing embedded software Vahid Garousi, Hacettepe University and University of Luxembourg Michael Felderer, University of Innsbruck Çağrı Murat Karapıçak, KUASOFT A.Ş. Uğur Yılmaz, ASELSAN A.Ş. Abstract. Embedded systems have overwhelming penetration around the world. Innovations are increasingly triggered by software embedded in automotive, transportation, medical-equipment, communication, energy, and many other types of systems. To test embedded software in a cost effective manner, a large number of test techniques, approaches, tools and frameworks have been proposed by both practitioners and researchers in the last several decades. However, reviewing and getting an overview of the entire state-of-the- art and the –practice in this area is challenging for a practitioner or a (new) researcher. Also unfortunately, we often see that some companies reinvent the wheel (by designing a test approach new to them, but existing in the domain) due to not having an adequate overview of what already exists in this area. To address the above need, we conducted a systematic literature review (SLR) in the form of a systematic mapping (classification) in this area. After compiling an initial pool of 560 papers, a systematic voting was conducted among the authors, and our final pool included 272 technical papers. The review covers the types of testing topics studied, types of testing activity, types of test artifacts generated (e.g., test inputs or test code), and the types of industries in which studies have focused on, e.g., automotive and home appliances. Our article aims to benefit the readers (both practitioners and researchers) by serving as an “index” to the vast body of knowledge in this important and fast-growing area.
    [Show full text]
  • 14 Superscalar Processors
    UNIT- 5 Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? • Use multiple independent instruction pipeline. • Instruction level parallelism • Identify dependencies in program • Eliminate Unnecessary dependencies • Use additional registers & renaming of register references in original code. • Branch prediction improves efficiency What is Superscalar? • RISC-like instructions, one per word • Multiple parallel execution units • Reorders and optimizes instruction stream at run time • Branch prediction with speculative execution of one path • Loads data from memory only when needed, and tries to find the data in the caches first Why Superscalar? • In 1987, scalar instruction • Most operations are on scalar quantities • Improving these operations to get an overall improvement • High performance microprocessors. General Superscalar Organization Superscalar V/S Superpipelined • Ordinary:- • One instruction/clock cycle & can perform one pipeline stage per clock cycle. • Superpipelined:- • 2 pipeline stage per cycle • Execution time:-half a clock cycle • Double internal clock speed gets two tasks per external clock cycle • Superscalar :- • allows parallel fetch execute • Start of the program & at branch target it performs better than above said. Superscalar v Superpipeline Limitations • Instruction level parallelism • Compiler based optimisation • Hardware techniques • Limited by —True data dependency —Procedural dependency —Resource conflicts —Output dependency —Antidependency True Data Dependency • ADD r1,
    [Show full text]
  • Concepts Introduced in Chapter 3 Instruction Level Parallelism (ILP)
    Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Concepts Introduced in Chapter 3 Instruction Level Parallelism (ILP) introduction Pipelining is the overlapping of dierent portions of instruction compiler techniques for ILP execution. advanced branch prediction Instruction level parallelism is the parallel execution of a dynamic scheduling sequence of instructions associated with a single thread of speculation execution. multiple issue scheduling approaches to exploit ILP static (compiler) limits of ILP dynamic (hardware) multithreading Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Data Dependences Control Dependences An instruction is control dependent on a branch instruction if the instruction will only be executed when the branch has a specic result. An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is not controlled by the branch. Instructions must be independent to be executed in parallel. instruction types of data dependences branch branch True dependences can lead to RAW hazards. name dependences instruction Antidependences can lead to WAR hazards. before transformation after transformation Output dependences can lead to WAW hazards. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. instruction branch branch instruction before transformation after transformation Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Static Scheduling Latencies of FP Operations Used in This Chapter The latency here indicates the number of intervening independent instructions that are needed to avoid a stall.
    [Show full text]
  • Embedded Sensors System Applied to Wearable Motion Analysis in Sports
    Embedded Sensors System Applied to Wearable Motion Analysis in Sports Aurélien Valade1, Antony Costes2, Anthony Bouillod1,3, Morgane Mangin4, P. Acco1, Georges Soto-Romero1,4, Jean-Yves Fourniols1 and Frederic Grappe3 1LAAS-CNRS, N2IS, 7, Av. du Colonel Roche 31077, Toulouse, France 2University of Toulouse, UPS, PRiSSMH, Toulouse, France 3EA4660, C3S - Université de Franche Comté, 25000 Besançon, France 4 ISIFC – Génie Biomédical - Université de Franche Comté, 23 Rue Alain Savary, 25000 Besançon, France Keywords: IMU, FPGA, Motion Analysis, Sports, Wearable. Abstract: This paper presents two different wearable motion capture systems for motion analysis in sports, based on inertial measurement units (IMU). One system, called centralized processing, is based on FPGA + microcontroller architecture while the other, called distributed processing, is based on multiple microcontrollers + wireless communication architecture. These architectures are designed to target multi- sports capabilities, beginning with tri-athlete equipment and thus have to be non-invasive and integrated in sportswear, be waterproofed and autonomous in energy. To characterize them, the systems are compared to lab quality references. 1 INTRODUCTION electronics (smartphones, game controllers...), in an autonomous embedded system to monitor the Electronics in sports monitoring has been a growing sportive activity, even in field conditions. field of studies for the last decade. From the heart rate Our IMU based monitoring system allows monitors to the power meters, sportsmen
    [Show full text]
  • Intro Evolution of Superscalar Processor
    Superscalar Processors Superscalar Processors vs. VLIW • 7.1 Introduction • 7.2 Parallel decoding • 7.3 Superscalar instruction issue • 7.4 Shelving • 7.5 Register renaming • 7.6 Parallel execution • 7.7 Preserving the sequential consistency of instruction execution • 7.8 Preserving the sequential consistency of exception processing • 7.9 Implementation of superscalar CISC processors using a superscalar RISC core • 7.10 Case studies of superscalar processors TECH Computer Science CH01 Superscalar Processor: Intro Emergence and spread of superscalar processors • Parallel Issue • Parallel Execution • {Hardware} Dynamic Instruction Scheduling • Currently the predominant class of processors 4Pentium 4PowerPC 4UltraSparc 4AMD K5- 4HP PA7100- 4DEC α Evolution of superscalar processor Specific tasks of superscalar processing Parallel decoding {and Dependencies check} Decoding and Pre-decoding • What need to be done • Superscalar processors tend to use 2 and sometimes even 3 or more pipeline cycles for decoding and issuing instructions • >> Pre-decoding: 4shifts a part of the decode task up into loading phase 4resulting of pre-decoding f the instruction class f the type of resources required for the execution f in some processor (e.g. UltraSparc), branch target addresses calculation as well 4the results are stored by attaching 4-7 bits • + shortens the overall cycle time or reduces the number of cycles needed The principle of perdecoding Number of perdecode bits used Specific tasks of superscalar processing: Issue 7.3 Superscalar instruction issue • How and when to send the instruction(s) to EU(s) Instruction issue policies of superscalar processors: Issue policies ---Performance, tread-----Æ Issue rate {How many instructions/cycle} Issue policies: Handing Issue Blockages • CISC about 2 • RISC: Issue stopped by True dependency Issue order of instructions • True dependency Æ (Blocked: need to wait) Aligned vs.
    [Show full text]
  • Simty: Generalized SIMT Execution on RISC-V Caroline Collange
    Simty: generalized SIMT execution on RISC-V Caroline Collange To cite this version: Caroline Collange. Simty: generalized SIMT execution on RISC-V. CARRV 2017: First Work- shop on Computer Architecture Research with RISC-V, Oct 2017, Boston, United States. pp.6, 10.1145/nnnnnnn.nnnnnnn. hal-01622208 HAL Id: hal-01622208 https://hal.inria.fr/hal-01622208 Submitted on 7 Oct 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Simty: generalized SIMT execution on RISC-V Caroline Collange Inria [email protected] ABSTRACT and programming languages to target GPU-like SIMT cores. Be- We present Simty, a massively multi-threaded RISC-V processor side simplifying software layers, the unification of CPU and GPU core that acts as a proof of concept for dynamic inter-thread vector- instruction sets eases prototyping and debugging of parallel pro- ization at the micro-architecture level. Simty runs groups of scalar grams. We have implemented Simty in synthesizable VHDL and threads executing SPMD code in lockstep, and assembles SIMD synthesized it for an FPGA target. instructions dynamically across threads. Unlike existing SIMD or We present our approach to generalizing the SIMT model in SIMT processors like GPUs or vector processors, Simty vector- Section 2, then describe Simty’s microarchitecture is Section 3, and izes scalar general-purpose binaries.
    [Show full text]
  • Embedded Systems Supporting by Different Operating Systems
    A Survey: Embedded Systems Supporting By Different Operating Systems Qamar Jabeen, Fazlullah Khan, Muhammad Tahir, Shahzad Khan, Syed Roohullah Jan Department of Computer Science, Abdul Wali Khan University Mardan [email protected], [email protected] -------------------------------------------------------------------------------------------------------------------------------------- Abstract: In these days embedded systems used in industrial, commercial system have an important role in different areas. e.g Mobile Phones and different Fields and applications like Network type of Network Bridges are embedded embedded system , Real-time embedded used by telecommunication systems for systems which supports the mission- giving better requirements to their users. critical domains, mostly having the time We use digital cameras, MP3 players, DVD constraints, Stand-alone systems which players are the example of embedded includes the network router etc. A great consumer electronics. In our daily life its deployment in the processors made for provided us efficiency and flexibility and completing the demanding needs of the many features which includes microwave users. There is also a large-scale oven, washing machines dishwashers. deployment occurs in sensor networks for Embedded system are also used in providing the advance facilities, for medical, transportation and also used in handled such type of embedded systems wireless sensor network area respectively a specific operating system must provide. medical imaging, vital signs, automobile This paper presents some software electric vehicles and Wi-Fi modules. infrastructures that have the ability of supporting such types of embedded systems. 1. Introduction: Embedded system are computer systems designed for specific purpose, to increase functionality and reliability for achieving a specific task, like general Figure 1: Taxonomy of Embedded Software’s purpose computer system it does not use for multiple tasks.
    [Show full text]
  • New Approach for Testing and Providing Security Mechanism for Embedded Systems
    Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 78 ( 2016 ) 851 – 858 International Conference on Information Security & Privacy (ICISP2015), 11-12 December 2015, Nagpur, INDIA New Approach for Testing and providing Security Mechanism for Embedded Systems Swapnili P. Karmorea, Anjali R. Mahajanb Research Scholar, Department of Computer Science and enginnering, G. H. Raisoni College of Engineering, Nagpur, India. Head, Department of Information Technology, Government Polytechnic, Nagpur, India. Abstract Assuring the quality of embedded system is posing a big challenge for software testers around the globe. Testing of one embedded system widely differs from another. The approach presented in the paper is used for the testing of safety critical feature of embedded system. The input and outputs are trained validated and tested via ANN. Security is provided via skipping invalid and critical classes and by embedding secret key in the Ram of embedded device. The work contributes simulating environment where cost and time required for testing of embedded systems will be minimized, which removes drawback of traditional approaches. Keywords: Embedded system testing; Black box testing; Safety critical embedded systems; Artificial neural network. 1. Introduction Testing is the most common process used to determine the quality and providing security for embedded systems. In the embedded world, testing is an immense challenge. In the test plan, the distinguished characteristics of embedded systems must be reflected as these are application specific. They give embedded system exclusive and distinct flavor. Real-time systems have to meet the challenge of assuring the correct implementation of an application not only dependent upon its logical accuracy, but also its ability to meet the constraints of timing.
    [Show full text]
  • History Scoreboarding Overview Machine Correctness Four Stages
    History 1966: scoreboarding in CDC6600, Lecture 6: Scoreboarding and Tomasulo implementing limited dynamic Algorithm scheduling Three years later: Tomasulo in IBM 360/91, introducing register renaming and reservation station Now appearing in todays Dec Alpha, SGI MIPS, SUN UltraSparc, Intel Pentium, IBM PowerPC, and others 1 Zhao Zhang, CPRE 581, Fall 2005 2 Adapted from UCB CS252 S98, Copyright 1998 USB Scoreboarding Overview Machine Correctness Basic idea: E(D,P) = E(S,P) if Use scoreboard to track data (RAW) dependence 1. E(D,P) and E(S,P) execute the same set of through register instructions Main points of design: 2. For any inst i, i receives the outputs in Instructions are sent to FU unit if there is no E(D,P) of its parents in E(S,P) outstanding name dependence 3. In E(D,P) any register or memory word receives RAW data dependence is tracked and enforced the output of inst j, where j is the last by scoreboard instruction writes to the register or memory Register values are passed through the register word in E(S,P) file; a child instruction starts execution after the last parent finishes execution Scoreboarding merit: Be able to execute Pipeline stalls if any name dependence (WAR or independent instructions in parallel without WAW) is detected violating statement 2. Zhao Zhang, CPRE 581, Fall 2005 3 Zhao Zhang, CPRE 581, Fall 2005 4 Four Stages of Scoreboard Control Four Stages of Scoreboard Control Fetch first, then 3.Execution—operate on operands (EX) 1. Issue—decode instructions & check for structural Actions: The functional unit begins execution upon receiving hazards operands.
    [Show full text]
  • Advanced RISC-V Architectures
    PULP PLATFORM Open Source Hardware, the way it should be! Working with RISC-V Part 2 of 4 : Advanced RISC-V Architectures Luca Benini <[email protected]> Frank K. Gürkaynak <[email protected]> http://pulp-platform.org @pulp_platform https://www.youtube.com/pulp_platform Working with RISC-V Summary ▪ Part 1 – Introduction to RISC-V ISA ▪ Part 2 – Advanced RISC-V Architectures ▪ Going 64 bit ▪ Bottlenecks ▪ Safety/Security ▪ Vector units ▪ Part 3 – PULP concepts ▪ Part 4 – PULP based chips | ACACES 2020 - July 2020 Working with RISC-V PULP RISC-V Cores from Tiny to App-level 32 bit 64 bit Low Cost DSP Streaming Linux Core Enhanced Compute capable Core Core Core ▪ Zero-riscy ▪ RI5CY ▪ Snitch ▪ Ariane ▪ RV32-ICM ▪ RV32-ICMFX ▪ RV32- ▪ RV64-IC(MA) ▪ Micro-riscy ▪ SIMD ICMDFX ▪ Full ▪ HW loops ▪ RV32-CE privileged ▪ Bit specification manipulation ▪ Fixed point ARM Cortex-M0+ ARM Cortex-M4 ARM Cortex-A5 | ACACES 2020 - July 2020 Working with RISC-V From IoT to HPC ▪ For the first 4 years of PULP, we used only 32bit cores ▪ Most IoT near-sensor applications work well with 32bit cores. ▪ 64bit memory space is not affordable in an MCU-class device ▪ But times change: ▪ Large datasets, high-precision numerical calculations (e.g. double precision FP) at the IoT edge (gateways) and cloud ▪ Software infrastructure (OS – typically linux) with virtual memory assumes 64bit ▪ High-performance computing, being hot again, requires 64bit ▪ Research question – pJ/OP on 64bit data+address space is possible? How? | ACACES 2020 - July 2020 Working with RISC-V
    [Show full text]