INTEROPERABILITY BETWEEN OPERATING

SYSTEMS AND ON

EMBEDDED PLATFORMS

by

MONG TEE SIM

B.S. DeVry University, Pomona, 2002

A thesis submitted to the Graduate Faculty of the

University of Colorado Colorado Springs

in partial fulfillment of the

requirements for the degree of

Master of Science

Department of Electrical and Computer Engineering

2017

© 2017 Mong Tee Sim All Rights Reserved

This thesis for the Master of Science degree by

Mong Tee Sim

has been approved for the

Department of Electrical and Computer Engineering

by

Darshika G. Perera, Chair

T.S Kalkur

Charlie Wang

Date 20th April 2017

ii

Sim, Mong Tee (M.S., Electrical Engineering)

Interoperability between Operating Systems and on Embedded Platform

Thesis directed by Professor Darshika G. Perera.

ABSTRACT

Software designs and programming are limited by the hardware resources including the memory capacity and the CPU clock frequency of the system. This has opened up research and investigation into efficient computing systems and architectures. However, increasing the memory capacity and optimizing the CPU clock frequency alone will not be sufficient to fulfill today’s complex software algorithmic requirements. Therefore, it is imperative to incorporate some hardware-based algorithms and designs to satisfy the requirements of the software-based algorithms and also to address the constraints associated with the existing computing systems.

In many cases, software algorithms are designed with none or minimal software optimizations, provided that these software designs meet the specifications. Software engineers and programmers typically depend on the compilers to perform necessary code optimizations. Although the current compilers have often satisfied the requirements of software algorithm designs, these compilers do not have knowledge about some software constructs including logical flow control, and spinwait for resources, which lead to CPU time wasted.

Within the semiconductor devices, hardware has also reached its limitations such the power wall, the process geometry, and the parametric latencies. As a result, the choice of adding more memory and increasing the CPU clock frequency will soon be obsolete.

iii

If we hastily modify the way how microprocessors operate, most software, if not all software will break. In order to modify the architecture of the microprocessor, a detailed study of hardware and software interoperability is crucial. In this research work, our main objective is to investigate the various issues and constraints associated with the interoperability between the operating systems (OSes) and the microprocessors on embedded platforms.

Based on our extensive analyses, we design a novel and efficient five-virtual-core

Pipelined Barrel (PBP) that does not have control and data hazards. The PBP also addresses some of the issues and constraints associated with the interoperability between the OSes and the microprocessors on embedded platforms.

iv

DEDICATION

To my late Mother, Madam Choo (Pearl), I dedicate this thesis to you for your love and for raising and nurturing my two older brothers, my four older sisters and me single-handedly. Mother had left us on Wednesday, 25th February 2015 to join my late

Father in a better place.

Mother, I love you. You will always live in my heart and memory, your youngest son always.

To my Wife, Swee Yee Lee, I dedicate this Thesis to you for your mental and physical support throughout my undergraduate days until now and raising and educating our two children days and nights without a sign of resentment. I am forever in your debt.

To my two Children, my daughter, JieZhou Melody Sim and my son, Monte

Patrick Sim, this Thesis is a Father setting a good example for his children that learning and knowledge are timeless. Knowledge not only enriches life, it also enriches the soul.

v

ACKNOWLEDGEMENTS

One of my Professors at the University of Texas, Austin once said, "Nice to read but not original, original but not nice to read.” I sincerely appreciate my Professor, Dr.

Darshika G. Perera, Ph.D., for spending many hours of her family time bashing through the jungle of my text trying to make my Thesis more readable. Although no word could describe her dedication to her profession and her kin sense to see her students succeed, but I must say, her action is exemplary and admirable. Thank you,

Professor Darshika! Besides being passionate about her job, she has vast knowledge in the field of Electrical Engineering and especially in the field of Data Mining.

Dr. T. S. Kalkur, Ph.D. is our ECE chairman. In his busy daily schedule between the University and his family, he still finds the time to educate the younger generation and elevates the standard of our ECE students. I have attended one of his classes and one social event with him. As a Professor, Dr. Kalkur is brilliant with his trade and at the social event, he is a friendly gentleman. Thank you for taking the time to be my thesis committee member.

Dr. Charlie Wang, Ph.D. is my Professor for the Computer Architecture and

Design class. He provided his students with one of the most useful and informative lecture slides in all the schools I have ever attended. His lectures were systematic, concise, and clear. He did not just give us lectures; he provided us the tools to solve real problems. I am proud to say that I have experienced this benefit recently at my job.

What makes Dr. Wang admirable beside his teaching techniques is his passion as an educator is that nothing can stop him from coming to UCCS to teach and see his

vi students succeed. Thank you, Dr. Wang, for being my Professor and my thesis committee member.

Eva Wynhorst is our ECE program assistant. She works tirelessly to help the

UCCS ECE students. She even goes beyond her normal responsibility to help the students to get their paperwork done. I am one of the students who has benefited from her generosity. Thank you, Eva, for being a wonderful person.

vii

TABLE OF CONTENTS

CHAPTER

I. INTRODUCTION ...... 1

1.1 Our Objectives ...... 4

1.2 Thesis Organization ...... 4

II. ANALYSIS OF OPERATING SYSTEMS ...... 6

2.1 Super Loop ...... 6

2.2 Operating System Task Terminologies...... 7

2.3 Cooperative Operating System ...... 9

2.4 Real-time Operating System ...... 16

2.4.1 Hybrid Real-time Operating Systems ...... 17

2.4.2 True Real-time Operating System ...... 22

2.5 Create Task ...... 23

2.6 Context Switching ...... 24

2.7 Existing Literature ...... 27

III. ANALYSIS OF SCHEDULING METHODS ...... 28

3.1 Round Robin Scheduling Method ...... 28

3.2 Group Based Scheduling Method ...... 29

3.3 Threshold Based Scheduling Method ...... 30

3.4 Fixed Priority Based Scheduling Method ...... 31

3.5 Fixed Priority Lookup Table Method ...... 32

3.6 Time Tick ISR...... 34

3.7 Software and Hardware Semaphores ...... 35

3.8 Key Features of Operating Systems ...... 36

3.9 Compiler ...... 37

3.10 Existing Literature ...... 39

viii

IV. MICROPROCESSOR ...... 41

4.1 ...... 43

4.2 Analysis of Microprocessors ...... 43

4.3 Constraint on Operating Systems and Microprocessors ...... 46

4.4 Single-Cycle Microprocessor ...... 49

4.5 Pipelined Microprocessor ...... 49

4.6 Coarse-grained Multitasking ...... 52

4.7 Fine-Grained Multitasking ...... 54

4.8 The Barrel Microprocessor ...... 56

4.9 Multiple Cores Microprocessors ...... 58

4.10 Semaphore ...... 59

4.11 Existing Literature ...... 60

4.11.1 New ISA ...... 60

4.11.2 Debugging...... 61

4.11.3 and Exceptions ...... 61

V. A NOVEL FIVE-VIRTUAL-CORE PIPELINED BARREL PROCESSOR...... 63

5.1 The New Microprocessor Architecture ...... 63

5.2 PBP Stages ...... 64

5.2.1 PBP Generator (PCGEN) ...... 64

5.2.2 PBP Stage, IF ...... 65

5.2.3 PBP Stage, ID ...... 65

5.2.4 PBP Stage, EXE, MEM and WB ...... 65

5.3 PBP Proof-of-Concept ...... 66

5.3.1 PBP’s Memory Map ...... 66

5.3.2 PBP Power-on-Reset ...... 67

5.3.3 PBP’s 5 Programs ...... 67

ix

5.3.4 PBP ModelSim Simulation Waveforms ...... 68

5.3.5 Decoding PBP Waveforms ...... 73

5.3.6 No Control and Data Hazards, Proven ...... 76

5.4 Compare and Contrast ...... 76

5.4.1 PBP versus PLP ...... 76

5.4.2 PBP versus SCP...... 77

5.4.3 PBP versus MCP ...... 78

5.4.4 PBP with RTOS ...... 78

VI. CONCLUSION AND FUTURE WORK ...... 80

6.1 Conclusions ...... 80

6.2 Future Work ...... 81

BIBLIOGRAPHY ...... 83

x

TABLE OF FIGURES

FIGURE

1: Cooperative Operating System Architecture ...... 11 2: Multi-Core Cooperative Operating System Event Diagram ...... 15 3: Hybrid Real-time Operating System Event Diagram ...... 19 4: Context Switching ...... 25 5: Dhrystone 2.0 Benchmark Performance Chart ...... 37 6: A Typical Microprocessor Block Diagram ...... 42 7: Course-grained Multitasking...... 52 8: Fine-grained Multitasking ...... 54 9: A pipelined Microprocessor Structure ...... 56 10: A Novel and Efficient Five-Virtual-Core Pipelined Barrel Processor ...... 65 11: PBP ModelSim Simulation Waveforms 01 ...... 70 12: PBP ModelSim Simulation Waveforms 02 ...... 71 13: PBP ModelSim Simulation Waveforms 03 ...... 72 14: PBP ModelSim Simulation Waveforms 04 ...... 73

xi

TABLE OF TABLES

TABLE

1: Key Features of Operating Systems ...... 37 2: Pipeline Content during an ...... 51 3: PBP Vector Table of Reset Vectors ...... 66 4: PBP Code Fetching Sequence at IF Stage...... 74 5: PBP Virtual Cores Sequencing ...... 75 6: Microprocessor Configuration ...... 76 7: PBP versus PLP ...... 77 8: PBP versus SCP ...... 78 9: PBP versus MCP ...... 78

xii

Table of Codes

CODE

1: Cooperative Operating System Task Template ...... 10 2: Proof-of-Concept Assembly Program ...... 68

xiii

CHAPTER I

INTRODUCTION

The advancement of semiconductor technologies has led to the proliferation of the state-of-the-art system-on-chips (SoCs) consisting of millions of transistors on a single die. The /microprocessors are the most noticeable SoC that leverage these technologies. The current state-of-the-art microcontrollers not only consist of many hardware intellectual properties (IPs) but also consist of various protocols on how the IPs should function and communicate. The higher the number of hardware IPs in a microcontroller, the more difficult it is to manage these IPs.

Furthermore, due to the increasing demands for different microcontrollers with various features, and reduced time-to-market, Super Loop programming model can no longer handle the time requirement of tasks and multiple IPs in the microcontroller. As a result, the operating systems (OSes) including cooperative, preemptive and the mixed are introduced to replace the Super loop. These OSes maintain the readiness of the system by providing kernel services: including threads’ scheduling, event threads, and low-level device drivers that control the data transactions of the embedded hardware IPs; and system traps to ensure system operational integrity. Conversely, these features of the operating systems pose a different set of issues to the microcontrollers/microprocessors.

The operating systems, especially Real-Time Operating Systems (RTOS) introduce intrusive methods to schedule the threads, and services that require more memories to function, while consuming thousands of CPU cycles from the microcontrollers. The microcontrollers improve the system performance by integrating more memory and increasing the CPU clock frequency to satisfy the requirements of the operating systems.

1 This leads to more power hungry computing systems. Thus, it is essential to investigate and design novel techniques and architectures (for the next generation system) for the operating systems and the microcontroller to overcome these issues. To facilitate this quest, in this research work, we carried out an extensive survey of the interoperability between embedded OSes and the embedded microcontrollers/microprocessors.

Since the development and the use of the OSes in the 1950s, the idea is to loosely couple the OS with the processor to ensure high portability. In the early days operating systems were typical Loaders. Over the years, this simple concept has transformed into a very complex system, which consists of millions of lines of codes. The modern OSes consist of features including the graphical user interface (GUI) that enhances the user experiences with a click of a mouse, the data storage devices that are capable of storing gigabytes of data, and so on. These software capabilities are the results of the research and development of software compilers and software algorithms.

Although there is an increase in software-based research, the six-decade old basic idea of the OS still remains.

With the advancement of semiconductor technologies and with the invention of the transistors, a square millimeter of a wafer die can consist of millions of transistors.

This semiconductor evolution enabled the microcontrollers/microprocessors as well as many other SoCs to provide more functional capabilities including communication for data exchange, volatile memory for temporary data storage, non-volatile memory for permanent data storage, video and audio for entertaining, and general purpose inputs and outputs to control actuators. The microcontrollers/microprocessors carry out these tasks either with or without human intervention. The operating systems also provide methods to conceal the complexity from the application programmer through the use of

2 the kernel services such as the device drivers, event threads, and user-level threads. In early days, these operating systems used to exist in the main frames and then the desktop computers. Today, OSes have found their ways to the embedded systems domain. The use of OSes in embedded applications is gradually becoming a requirement for most mobile and handheld devices, and also for the Internet of Things (IoTs) devices. In order to function, the real-time operating systems (used in embedded systems) require hardware interrupts and context switching. Without these features, the real-time operating kernel services, event threads, and device drivers will not function; thus rendering the RTOS inoperable. The interrupts and context switching disrupt the basic design flow of the microprocessors, which is to execute its instruction without stalling the pipeline or misses. The introduction of the RTOS disrupts the instructional flow of the microcontroller, since the RTOS kernel services and other OS functions take thousands of CPU cycles, which hinders the performance of the microcontrollers/microprocessors.

Many SoCs take advantage of the advancement of the semiconductor by replacing the time consuming temporal software functions with spatial hardware IPs.

These changes allow the microcontrollers to focus on managing the flow control of the system. However, both the OS and the microcontroller choose to ignore this very synergetic example such as using OS’s IP blocks. Obviously, this is not a technical issue that the real-time operating systems and the microcontrollers cannot work seamlessly.

This is almost like a religious debate on whether the real-time operating system or the microcontroller is the offender that causes the deficiency of the system performance that no one can win. Evidently, the reluctance to change in this case is clear.

3 1.1 Our Objectives

The main objective of this research work, is to provide a low-level understanding of how the operating systems and microcontrollers work while identifying the operational conflicts between them that cause the system to underperform.

In order to achieve our goal, several analyses of different type of operating systems and microprocessors will be studied. The analysis of operating systems will be focused on which operating system features affect the microcontroller operation, while the analysis of microcontroller will be focused on which functional block of the microcontroller were affected by the operating system.

Based on these analyses, we design and implement a novel and efficient five- virtual-core Pipelined Barrel Processor (PBP) as a case study to overcome some of the issues and constraints associated with the interoperability between the OSes and the microprocessors on embedded platforms.

1.2 Thesis Organization

This Thesis is organized in six chapters: In chapter one, we provide a brief introduction of the different type operating systems. In chapter two, we examine the internal of the operating systems and how it functions. We examine the Cooperative

Operating System, the Hybrid Real-Time Operating system and the Real-Time

Operating.

In chapter three, we examine how different operating systems schedule their threads and how these scheduling method functions. We also examine how scheduling can be done deterministically using lookup table.

4 In chapter four, we will examine the different type microcontrollers, single core microcontroller, multiple-core microcontroller, single cycle processor, multi-cycle processor, pipelined processor and barrel processor. We will also talk about course- grained and fine-grained processor and what benefit it brings to the system.

In chapter five, we design, analyze and compare and contrast a novel and efficient five-virtual-core pipelined barrel processor (PBP). We also provide simulation result and data for comparison between the PBP and the different microprocessors.

In chapter six, we conclude and summarize our work. We also provide recommendations to the designers to overcome the constraints and discuss potential future work to enable real-time operating system and microcontrollers to operate seamless without retarding each other.

5 CHAPTER II

ANALYSIS OF OPERATING SYSTEMS

In this chapter, we analyze, discuss, and present various types of existing operating systems (OS) including Cooperative, Hybrid and Real-time OS. We also present the notion of “Super Loop” in order to clarify some of the concepts in operating systems.

It should be noted that, in this research work, our intention is to identify the pitfalls of the embedded operating systems and the microcontrollers, and how these pitfalls affect each other, and not about improving any operating systems programmatically.

2.1 Super Loop

“Super Loop” is one of the most common programming architectures, which provides an open model for programming. This concept has enabled the development of many electronic devices such as: Internet of Things (IoT), medical devices, and engine controllers, etc.

A super loop comprised of an infinite loop, where all the tasks of the system are contained in that loop. All the system tasks in a super loop may often be arranged in a sequential order and/or branched out of the loop due to a sub-routine call to perform a certain task and then return back to the loop after the execution of the sub-routine call.

The super loop architecture provides an easy to predict programming flow control. For instance, in this case, the programming flow control can be done by executing in: sequential order; or conditional branch manner; or non-conditional branch

6 manner. The latter two flow controls are ideal candidates for the hardware support, since they use branch target predictors, which will mitigate its branching latency time. In addition to hardware , providing hardware support for other technologies such as instruction cache and instruction FIFO can further enhance the performance of super loop programming architecture.

Although it is easy to program with super loop model, it is very difficult to control the system timing. For instance, the overall system timing changes for every line of code added to the system. Therefore, this model is not an ideal candidate to perform time critical tasks. In order to overcome this shortcoming, programmers typically utilize hardware interrupts to achieve time critical requirements. Hardware interrupts is one of the most efficient methods, however, there are some penalties to pay for using this method, which includes additional software overheads, processing time, and so on.

2.2 Operating System Task Terminologies

Computer and operating system terminologies are often interpreted differently by different people, and are typically context dependent. To facilitate discussions in this research work, the following definitions are used [2] [3]:

1) Task, also known as a is similar to a main program, which executes until it finishes (cooperative OS) or until it is preempted by the scheduler (in RTOS).

Tasks can usually suspend themselves while waiting for an event to regain the CPU time. In a single core system, tasks are executed pseudo-concurrently sharing the same

CPU; whereas in a multi core system, tasks are executed concurrently in parallel processing fashion. Tasks can be used to structure a program in modular pieces; thus

7 enabling multiple team members to develop an application with shared address space, making an application easier to manage.

Tasks consist of the following states:

 READY: means task is ready to run without waiting for any event or

requiring anything else, and also not currently executing instructions.

 RUNNING: means task is executing instructions.

 SLEEP: means task will not be ready for the required sleep tick counts.

 SUSPENDED: means task is not ready to run because it is waiting for an

event to happen.

 TERMINATED: means task is not eligible to run.

2) Scheduling: Tasks are typically assigned a priority if they are not executed in a round robin fashion. A priority indicates the relative importance of a task, and the order in which the tasks are granted access to the CPU if more than one task is ready to run.

Various OSes assign priorities differently. In this research work, the priority of the task corresponds to the number assigned to a task, i.e., higher the number, higher the priority of the task. In this case, the assigned numbers (i.e., priorities) are unsigned integer values, with zero (0) being the lowest assigned to the Idle task. The priorities often can be changed dynamically, or can be grouped or uniquely fixed. The Idle task is executed, if there are no tasks in the READY state. The Idle task is also used to calculate the CPU usage.

8 2.3 Cooperative Operating System

The Cooperative Operating System (OS) has some similarities to the Super Loop architecture. The main loop of the Cooperative OS is a super loop; and it consists of the all the features, advantages, and disadvantages of the super loop model. However, this

OS differs from a structured super loop. In this case, the threads are executed in time slices in a non-deterministic way. In addition, this OS provides event threads that handle hardware events such as interrupts for input and output ports, communication interfaces, etc.

Cooperative OSes are thread-safe since all the threads operate on the same stack and each thread runs to its completion. That is, each thread must relinquish its

CPU time and return to the scheduler before the next thread can gain processing time to perform its task.

The Cooperative OS has a different thread structure than other OSes’. Its thread is structured like a state machine. As shown in Code 1, this thread structure allows part of the task to be done without degrading the integrity of the full task, and to resume from the same point where it was left off when it regains the processing time. It also provides a mechanism for the thread to cede its processing time in a timely manner.

Similar to the Super loop architecture, a cooperative OS is non-deterministic and must use interrupt sub-routines for its time critical operations. However, cooperative OS provides more disciplined way of programming than super loop model; and it also allows more precise locality for debugging codes.

In this section, we analyze and present the architectures of the cooperative OS.

This will give us some insight into this OS: how does it operate; what are the associated

9 overhead; and so on. Consequently, this information will be used to analyze the embedded microcontroller in Chapter 3.

Code 1: Cooperative Operating System Task Template

A cooperative operating system (OS) is fairly easy to construct because it resembles super loop architecture. The difference in this case is that all the tasks

(known as co-routines) in cooperative OS have to work together without occupying the

CPU and depriving other tasks from sharing the CPU. The running task must voluntarily relinquish the CPU for other tasks to run.

A cooperative OS may consist of application tasks (co-routines), event tasks, and other operating system primitives, which allow the tasks to access system resources through a form of application programming interface (API). The cooperative OS

10 architecture is illustrated in Figure 1. Although the architecture presented in Figure 1 is not associated with any of the existing commercial cooperative OSes, it provides a good representation of currently available cooperative OSes.

7

Application System Event Task Primitives Task

Comm APIs

Application Event I/O APIs 6 Task Task 5

Scheduler 4

Task I/O Comm Ready Event Event 3

TimeTick (ISR) I/O Comm 2

Hardware 1

Figure 1: Cooperative Operating System Architecture

The Label 1 (in Figure 1), the Hardware refers to the embedded microcontroller, on which this cooperative OS is executed. The Label 2, the three system components consist of Time Tick, I/O, and Comm. Time Tick is responsible for decrementing the tasks counter from a certain value to 0; when the counter is 0, the task is said to be

READY. I/0 (input and output port) and Comm. (communication) are responsible for posting events notification to the scheduler. (The event tasks have higher priority than application tasks even if it is in a round robin scheduling scheme). The Label 3 indicates

Label 2 posting to the scheduler. The Label 4, the scheduler first checks for any event notifications when a running task returns. If there are event notifications, an event task is sent for execution, otherwise the next application task that is ready-to-run is sent for

11 execution [8]. The Labels 5 and 6 are event tasks and application tasks. The Label 7 is system primitives and APIs accessible to both event tasks and application tasks.

Most cooperative OSes employ a round robin scheduling methods. This method allows all the tasks an equal opportunity to run; and also each task to run to completion.

This run to completion technique allows the OS to maintain only one stack that is shared by all the tasks. If the system is using a pure round robin method, the scheduler is simple to design. However, there is a possibility that the application tasks can also request the scheduler (if needed) in a READY state and regain the CPU again. For an example, let’s consider a scenario where one task may run every 5000 clock ticks while other tasks may run with different clock tick count. In this case, the scheduler needs to make a decision, especially if two or more tasks with different tick counts are ready-to- run. Let’s consider another scenario, where application task one runs once, application task two runs twice, and application task three has not run at all. In this case, the scheduler must decide which task is allowed to run first when the current running task returns.

The time scheduling mechanism usually requires an additional system primitive to perform the tick count subtraction. This system primitive, known as the "Time Tick", is an interrupt sub-routine. The tick period is typically set to 10ms or higher; the higher the tick period, the lower the system overhead. Similar to RTOS, the cooperative OS uses time tick to decrement the tick count of each task. However, since the cooperative

OS is not a RTOS, it does not use the time tick ISR to preempt a running task. As a result, it has to wait until the running task returns before the next ready-to-run task can run. It should be noted that even if the priority and time scheduling mechanisms are enforced, the tasks are not always guaranteed to run on time.

12 In a single core environment, cooperative OS is considered to be thread/task - safe. As a result, there is no data coherence issue; thus, semaphore or mutex are not needed. However, executing a cooperative OS in a multi-core environment is more complex. To illustrate the multi-core system, we consider a scenario, where we have two cores. In both cores, we only run one cooperative OS, which is capable of scheduling application tasks and event tasks to any core as needed. We do not consider a case where two different cooperative OSes are running in each core.

In multi-core environment, the cooperative OS have to comply with additional rules. For instance, all the shared system primitives and the APIs must provide locking and unlocking features. All the sub-routines accessible by any task must be reentrance sub-routines. In this case, the data coherence becomes an issue. This data coherence issue is similar to that of a RTOS.

The above data coherence issue "may be" resolved by especial atomic commands. Typical atomic commands are executed to its completion and it cannot be interrupted by the system or by external hardware interrupts. Especial atomic command used for data coherence issue (with semaphore or mutex) should have the following attributes. In a single , it should be able to: read from a memory location; compare its data; and write the required data back to the same memory location. It should be noted that an instruction cycle is different from a clock cycle. A single instruction can take one clock cycle or multiple clock cycles to complete and is often not interruptible by the system. The clock cycle is one cycle of the system clock.

We used the word "may be" in the above paragraph, because the atomic commands in one core may not be visible to other cores. To illustrate this case, we present an example considering PowerPC cores. First, to create a semaphore, we have

13 to start with an atomic read command. The read command will set a specific bit in one of the special-purpose registers. Next, compare the data and then perform an atomic write command with the required data. If the bit set by the read command in the special-purpose register is cleared, the write operation will fail [7]. This means that someone has already accessed that memory location. In the multi-core environment, the other cores cannot access nor has any knowledge of this special-purpose register.

Hence, implementing semaphore using this atomic read and write commands would fail.

Also, in this case, implementing a critical section does not have any effect on other cores, since it would not block the other tasks (running on the other cores) from accessing the shared memory.

To overcome the above issue, other methods including hardware semaphore can be used. However, hardware semaphore is not a good solution [12] because it is limited by the number of semaphores by design. Bus centric semaphore is another method that overcomes these issues. These methods are discussed and presented in section 4.10.

We modified the cooperative OS architecture presented in Figure 1 to accommodate multi-core scheduling (in Figure 2). One of the methods is to implement it as two FIFOs: one FIFO is for ready-to-run tasks and another FIFO is for return tasks; and a super loop model on the second core. The FIFO is written by the scheduler to push ready-to-run tasks into the FIFO; and is read by the second core to send the ready-to-run task to a running state. The super loop behaves as an idle task within the loop and checks for any ready-to-run tasks. If there are, it sends the next ready-to-run task to a running state. After the running task returns to the super loop, it posts the return task to the return FIFO; and for the scheduler to reschedule its next running state.

14 A high-level event diagram of a multi-core cooperative OS is depicted in Figure

2. This is implemented on a symmetrical dual core PowerPC device. Right side of the event diagram is the cooperative OS. The FIFO placed between Core 0 and Core 1 acts as a ready-to-run task FIFO and return task FIFO. In this case, the ready-to-run task

FIFO is employed by the schedule, which commands Core 1 to execute any tasks assigned to the Core 1 in the FIFO. Tasks that are not assigned to a specific core can be executed on any core; however, tasks that are assigned to a specific core can only be executed on that core.

Core 1 Core 0 Timer Int

Event Int

Scheduler Idle Task

No

Task CORE1 Task Yes No Idle Task READY Task FIFO READY

Yes

Run Task Run Task Yes

No No

Task Returns Update Task Returns Task Returns

Figure 2: Multi-Core Cooperative Operating System Event Diagram

In multi-core environment, reentrant sub-routines are important, since it allows different applications running on different cores to share functions. The reentrant sub- routines are the only sub-routines called by the tasks in the running state on different cores. A good programming practice is to write all the sub-routines as reentrant sub- routines.

15 2.4 Real-time Operating System

Real Time operating systems (RTOSs) are commonly classified into three categories [1]: hard real-time, firm real-time and soft real-time. These three categories differ based on the time requirement to complete a particular task.

A Hard RTOS, the most time stringent of the three, requires its thread to complete a certain task within a specific time limit, otherwise considers that the system has failed. Examples of applications with hard RTOS include pacemakers, braking systems, and missile controllers.

A Firm RTOS is less time stringent compared to a hard RTOS. In this case, even if the system failed to meet the deadlines more than once, the system is not considered to be failed. An example would be a weather forecasting system. In this case, the information obtained after the weather changes, is useless for predicting the weather.

A soft RTOS, a least stringent of the three, does not consider the system has failed, even if the system waits for a long time. An example of an application with soft

RTOS is a text editor waiting for the typist to enter text via the keyboard. The typist may type a couple of chapters and go for a long break and resume his/her work at a later time.

All the RTOS are preemptive operating systems, i.e., an OS that is not capable of preemption cannot be considered as a real time operating system. For instance, a cooperative OS is incapable of preempting a thread; hence is not a RTOS.

A RTOS is typically more complex than other operating systems. However, to construct a

RTOS that is only capable of preempting its tasks in a round robin fashion is less complex. In this case, since all the tasks are always ready to be executed, a context switcher can be used instead of a scheduler. However, most of the RTOSs are far more

16 complex than the aforementioned example due to various requirements such as different scheduling, priority, to name a few that were be discussed in the following sub- sections.

Real time operating systems (RTOS) must have a deterministic behavior in terms of deadlines. Its throughput is lower than other types of operating systems, because this

RTOS uses more CPU time than other OSes. The CPU usage in ROTS is often based on the number of tasks and the time tick period. This type of OS requires more system resources than other OSes including CPU time and random access memory (RAM).

Apart from the aforementioned features, RTOS can be further categorized into two types: Hybrid and True. The analyses of these two types are detailed in the following sub-sections.

2.4.1 Hybrid Real-time Operating Systems

Hybrid RTOS has the ability to preempt a running task of a lower priority similar to the True RTOS; and it also has some features of the Cooperative OS; thus the name hybrid.

To facilitate this discussion and analysis on hybrid RTOS, it is important to understand the following two concepts: recursive programming algorithm and hardware interrupt.

1) Recursive programming algorithm that we focus on, in this case, is recursion, and not iteration. A recursive sub-routine has the ability to call itself. Recursive sub- routine saves the state it was in and all its variables on the stack, prior to calling itself at the beginning of the sub-routine’s entry point. Recursive algorithm is used in many applications. For instance, the Hanoi Tower application uses the recursive algorithm to

17 rearrange all the parts of the tower. The Hybrid RTOS has many features of the recursive algorithm including: the ability to call itself; save states and variables; preempts other tasks.

2) The hardware interrupts that we focus on, in this case, is nested interrupt.

Interrupts can be categorized into two: one allows a higher priority interrupt to interrupt a lower priority interrupt while it is being executed; the other allows the same interrupt to interrupt itself, also called the self-nested interrupt. The Hybrid RTOS has the features of the latter.

Similar to many OSes, the Hybrid RTOS start from a Time Tick, whereas pure round robin type cooperative OS does not require a Time Tick. With multi-task operations, the highest priority task typically gains access to the CPU. While a task is at the running state, the Time Tick will count down all the tasks except the running task as well as the preempted tasks tick counters. When one or more tasks are ready to run, and also if the highest priority task among the pool of ready-to-run tasks is higher than the current running task, the current running task is preempted, so that the new higher priority task could be executed. If the priority of the current running task is higher than the next ready-to-run task, then the system waits for the current task to run to completion and cedes its CPU time for the next ready-to-run task. The run-to-completion is also common feature in cooperative OS. The hybrid RTOS' tasks must run to completion unless it is preempted by a higher priority task.

Similar to the cooperative OS, the hybrid RTOS only requires one stack for all its tasks, and all the software and hardware interrupts. By using only one stack for all the functionalities, the system overhead is kept low and the ease of design. Unlike the True

RTOS, the Hybrid RTOS uses smaller task control blocks to maintain its tasks; and uses

18 a simplified algorithm to keep track of all the tasks and their associated stacks. The hybrid RTOS uses nested hardware interrupts, and software prioritization to run all the tasks. Considering a multi-task operation scenario, where the second lowest priority task is ready-to-run and this task is sent to the running state. Before the running task can run to completion, it was preempted by a higher priority task. This process continuous until the aforementioned task reaches to the highest priority (also known as the fixed priority, see section 3.5). In this case, the nested interrupts are used; and the lower priority tasks are preempted, by calling the Time Tick recursively.

Preempt Running Timer Int Task

Top of Stack ... Idle Task Time Tick Preempted Idle Task ISR Time Tick

Stack RTR Preempted Appl Task Pushing Time Tick RTC RTR Preempted Appl Task Task Ready No EOI RTI Stack To Run Time Tick RTC Popping RTR Preempted Appl Task Time Tick RTC Running Appl Task Scheduler Restore No Running Task

Task Priority higher RTI Bottom of Stack

Launches Task Time Tick ISR

Run to Running Task No Yes Scheduler Completion

Figure 3: Hybrid Real-time Operating System Event Diagram

19 For the hybrid RTOS, the prioritization of the tasks are typically fixed. However, in certain situations, group prioritization can also be used for hybrid RTOS. Group prioritization reduces the number of pushes and pops on the stack, thus saving precious

CPU time. As a result, tasks have more time to run to completion. The task prioritization and various techniques of scheduling the next ready-to-run task are detailed in section

3.1, 3.2, 3.3 and 3.4.

The Hybrid RTOS tasks are designed using the same concept of a cooperative

OS. As illustrated in Code 1, the tasks are normal C functions, and the thread is structured like a state machine. Similar to the cooperative OS, this thread structure allows partial work to be done without degrading the integrity of the work; to resume its works when it regains the processing time; and to cede its processing time in a timely manner.

The above information is used to build an event diagram for the hybrid RTOS as demonstrated in Figure 3. The hybrid RTOS is also known as a single stack RTOS.

As illustrated in Figure 3, the event diagram starts with a timer interrupt (i.e., hardware interrupt) that jumps to the Time Tick interrupt sub-routine (ISR). When the timer interrupt is triggered due to the preset period expiration, it immediately preempts the running task. Then the timer interrupt performs its operation, which includes counting down the time counters of each task except the preempted tasks and the running task. If there are no ready-to-run tasks, the ISR re-arms the interrupt and returns to interrupt (RTI) instruction, and restores the running task to its running state.

In this case, the ISR returns normally (i.e., returns the control back to the interrupted task without going into a recursion state. If there are ready-to-run tasks, the scheduler initially verifies whether the highest priority task among the pool of ready-to-run tasks

20 has a higher priority than the current running task. In case, if the priority is lower than the running task, the current running task resumes it’s running state and the ISR returns back normally. If the priority is higher than the running task, the scheduler re-arms the interrupt, launches the next ready-to-run task, and sets it to the running state. Since the stack typically operates from the high address to the low memory address space, when the ISR returns, its information is on the top layer of the running task. When the timer interrupt is triggered next time, the ISR enters into a recursive state. The stack frame is depicted on the left of the event diagram in Figure 3. Conversely, if there are no ready- to-run tasks and the current running task runs to completion, the current running task returns the control to the scheduler. Next, the scheduler returns the control to the ISR.

When the ISR returns by executing the RTI (return to interrupt) instruction, popping action of the stack frame is triggered. This popping action loads the stack information to the CPU and restores the previously preempted task and sets it as the current running task. The popping of the stack frame when the ISRs are returning from its recursion state is illustrated in Figure 3, on the right side of the stack frame.

Tasks that are preempted by higher tasks must wait for the tasks above them to run to completion before they can resume to their running state. This sequencing of preempted tasks to their running state is executed in a LIFO sequence. This LIFO sequence makes the hybrid RTOS less responsive. In this case, it is difficult to determine when the preempted tasks could resume its running state. Consider a scenario, where a shared resource is locked by a preempted task buried deep in the stack. Even though the operating system can change the priority of the task to a higher value, that task cannot resume to its running state until all the other tasks in the stack are executed.

21 Typically, proper planning of the tasks in an operating system reduces deadlocks.

However, good OSes in general should not inherit unrecoverable dead locks situation.

Real time operating systems should be deterministic to predict the behavior of the system operation. The hybrid RTOS has deterministic behavior, but not for many responses to events that would classify this operating system as a hard RTOS.

In summary, the hybrid RTOS has some of the features of cooperative OS and true RTOS. The hybrid RTOS also uses fewer resources such as RAM and CPU cycles

(compared to a true RTOS) because it only requires a single stack for all its operation. In the following sub-section we analyze the features of the True RTOS and investigate how it differs from the hybrid RTOS.

2.4.2 True Real-time Operating System

To facilitate the discussion and analysis on true RTOS, initially, we investigate some features of this RTOS, including context switching; and software and hardware semaphores.

Context switching is a process of storing and restoring a state of a task/thread, so that an execution can be resumed from the same point at a later time. It is a process of preempting a task/thread and restoring it to its running state from the point where it was preempted earlier. In RTOS, the context switching process is more complex than other OSes. To achieve context switching timely and deterministically requires more sophisticated software algorithms than non-deterministic OSes. It is important to distinguish the determinism from the speed of the CPU. For example, a certain sub- system always takes the same number of clock cycles to accomplish the deterministic context switching process regardless of the frequency of the CPU. The execution time

22 changes with the system clock frequency, but the number of clock cycles remains the same. This makes it is easier to predict the behavior of the system.

Unlike the hybrid RTOS, the true RTOS typically requires many stacks. In this case, the number of stacks equals to the number of tasks or threads. Each task is allocated its own stack, but all the tasks share the same heap. In order to keep track of all the individual stacks and tasks, task control block (TCB) is also created for each task.

This block contains the location of the stack, information about a particular task, and link lists to other TCB. A TCB usually contains the following information about a task:

 Pointer to the task stack

 Task status

 Stack Size

 Task priority

 Number of Tick delay

 Pointer to the next TCB

 Pointer to the previous TCB

The above TCB information is necessary to facilitate our discussion in the context switching in section 2.6.

2.5 Create Task

Before switching into a RTOS environment, every task that has to be executed in the RTOS environment must be registered to that environment. Creating a task is similar to registering a task to a RTOS. In this case, initially, a stack is allocated to a particular task, and the information regarding the location of the stack and size of the stack are

23 kept in the TCB (task control block). Next, the starting address of this task is stored on top of this allocated stack. Note that stacks always grow downwards from high addresses to the low address: i.e., when a value is pushed (saved) onto the stack the stack pointer decrements to the next address and when a value is popped (retrieved) from the stack the stack pointer increments to the next address. Entry point of the task is saved onto the top of the stack; followed by the null values for the status register, since the status register needs to be cleared for the first time. Once this process is completed, the stack pointer in the TCB points to where the NULL values ends.

2.6 Context Switching

Context switching is the main feature of the true RTOS. Without this feature, the

OS is unable to preempt any task. Unlike the hybrid ROTS, which requires only a single stack for all the tasks, the true RTOS requires a one stack per each task. That means, with true RTOS, a single task is allocated to a single stack. In this case, the stack size depends on the application requirement. Initially, the stack size for a practical task is not optimized, which results in an oversized stack. However, the stack and the stack size are optimized after the coding of the task is completed.

In Figure 4, the block diagram illustrates the context switching between two tasks. As shown in Figure 4, it consists of the TCB and the system stack pointer.

Context switching requires system hardware, and it also involves some initial software setup.

Prior to starting the true RTOS, the operation of the main program uses a main stack similar to those in super loop applications. In order to switch to a true RTOS environment, typically the application calls a "Multi-Tasking" function. After calling the

24 Multi-Tasking: all interrupts are disabled; the scheduler determines which task has the highest priority then performs context switching, enables interrupt, and set the highest priority task from a pool of ready-to-run tasks to its running state.

TCB TCB

Stack Pointer Stack Pointer

CPU Stack Pointer

TaskA TaskB Stack Stack

Hi Memory Hi Memory BEFORE AFTER r0 r0

r1 r1

r2 r2

......

......

r29 r29

r30 r30

r31 r31 AFTER BEFORE Lo Memory Lo Memory

Figure 4: Context Switching

Prior to activating the first context switching, the application still operates in the main stack. After calling the Multi-tasking, all the system registers are pushed onto the main stack and a copy of the main stack pointer value is kept in a system variable.

When this process is completed, the context switching begins. Note: Multi-Tasking is

25 very rarely called in the life time of this OS. From Figure 4 [2], the Task A Stack on the left represents the current running task stack and the Task B Stack on the right represents the task that is going to replace the current running task via context switching. At this time, the CPU stack pointer points to "BEFORE" at Task A Stack. When the context switching occurs, the system registers of the current running task are pushed onto the Task A Stack and the CPU stack pointer is decreased to "AFTER" (as shown in Task A Stack in Figure 4). The CPU stack pointer is now saved in the TCB of

Task A.

At this time, the Task A is suspended and the scheduler looks for the highest priority task to run. In case, if the scheduler is unable to find a higher priority task to run, Task A resumes its running state. In this case, Task B has a higher priority than

Task A, thus Task A is preempted. In order to preempt Task A as well as to set Task B to a running state, the stack pointer address stored in Task B TCB is written into the

CPU stack pointer, which redirects the CPU stack pointer to point to Task B Stack. Next, a return to interrupt (RTI) instruction is executed, which causes the context switching from Task A to Task B.

In addition to the context switching, the scheduler also have to perform more storing and retrieving of information onto the respective stacks and registers. If the compiler supports intrinsic commands, then all the contexts are automatically saved by issuing intrinsic commands such as pop all or push all. If the assembly language is used, the saving and restoring of the system registers must be done using appropriate push and pop instruction.

In summary, context switching can be done deterministically. Context switching typically takes a constant number of clock cycles: to push and pop the stack pointer,

26 flag, general-purpose registers; and also to change the CPU stack pointer to point to a new task stack and execute a RTI.

2.7 Existing Literature

Linux, Windows, and UNIX operating systems (OSes) [23][26][28][29][30][31]

[33][34] are popular subjects for research. In another word, most research is very high level. The researcher may not know exactly how the operating systems work. The question is whether they need to know how the OSes work or they just need to know enough and how to use and modify the APIs to perform their research?

In all fairness, not many people understand the primitives and hardware abstraction layers of an operating system. Therefore most of the research on OSes uses intrusive methods such as using filter drivers workaround to gather the results or running a different program with different functionality and size to rationalize the expected results [23] to [34].

With Windows OS size growing into the 50 million lines of code and Linux OS is more than 5 million lines of code making it an impossible task to understand the complete OS operation. Needless to say, the research results from these OSes are always approximate; too many variables that the researchers have no control or do not know when it will take place.

In our thesis work, we are only interested in the OS's primitives and the hardware abstraction layers. How the OS intrusively disrupts the processor pipelined and to this end find a solution for our future work on our processor. Although the information on these papers on OSes are interesting and may have some usefulness, it is useless to us. However, most the research methods are clever.

27 CHAPTER III

ANALYSIS OF SCHEDULING METHODS

Real-time operating systems (RTOS) use various techniques to preempt the tasks. These techniques are called scheduling methods, which determine when and how to preempt tasks. In this chapter, we analyze, discuss, and present various scheduling methods including round robin, group based, threshold based, and fixed priority based scheduling.

3.1 Round Robin Scheduling Method

The round robin [1] is the simplest scheduling method since it allows every task an equal chance to run. This method is similar to circular buffering of the tasks. For example, if there are ten tasks in the system, task one runs first followed by the next task, task two. The subsequent tasks are processed one after another until the tenth task. After task ten finishes, the round robin method starts to run task one again. This process is repeated in an endless loop.

The round robin method is usually coupled with time scheduling (Task delay).

However, this does not significantly alter or affect the concept of round robin. For instance, if task two is the next ready-to-run task but task two is not ready-to-run; then task three takes its place. Although the above scenario seems trivial, the complexity arises in a situation when the task two should be the next ready-to-run task but missed the chance to run during its time slot because it was not ready, then the next ready-to- run task is task seven. The scheduler typically decides which task to run, and allows the selected task to run. In this case, if the scheduler allows task seven to run and waits for

28 the next available time slot for task two, then the task two is unable to run even once in that particular round robin cycle.

3.2 Group Based Scheduling Method

The group based scheduling employs both priority-based and round robin techniques to schedule the tasks. This method enables the system to schedule and preempt tasks based on their levels of importance in specific groups. Typically, tasks within a group are considered to be of equal importance and operate in a round robin fashion.

In general, the group based scheduling is categorized into three groups: HIGH,

MIDDLE, and LOW, where the HIGH group has the highest priority and the LOW group has the lowest priority. This definition of the groups represents the importance of it tasks and its privilege to preempt tasks from groups of lower priorities. One of the major advantages of the group based scheduling is that it reduces the preemption of the tasks compared to the fixed priority based scheduling.

We present an example to illustrate how the tasks are scheduled using the group based method. Let’s consider a scenario, where a MIDDLE priority task is running and no other tasks are ready-to-run at a time. However, a tick later another task is ready-to- run. In order for the scheduler to preempt the current running task and to set the new ready-to-run task to its running state, one of the following three conditions must be met:

1) The new ready-to-run task is from a higher priority group, i.e., from the HIGH priority group. In this case, the current MIDDLE priority running task is preempted and the HIGH priority task is set to its running state. If the new ready-to-run task is equal

29 (MIDDLE) or lower (LOW) than the current running task (MIDDLE), then the current task resumes its running state.

2) The current running task performs self-preemption. In this case, the next highest priority task from the pool of ready-to-run tasks is set to its running state.

3) The current running task’s runtime expires, and the scheduler preempts the current running task.

As mentioned before, the group based method reduces the number of preemptions of the tasks, thus assigning more CPU time to the running tasks. For instance, when the current task is in the running state, and if several tasks are ready-to- run on the next tick then these ready-to-run tasks are pushed into the corresponding

FIFOs. Next, these ready-to-run tasks are set to its running state when the next running slot is available to that group. This way, the ready-to-run tasks are guaranteed to run in a round robin fashion based on the order, in which they reach the ready-to-run state.

3.3 Threshold Based Scheduling Method

The threshold based scheduling method has some features that are similar to the group scheduling. For example, any task that is under this threshold is pushed into a

FIFO, and operates in round robin fashion similar to group scheduling method.

We are presenting an example, where we integrate the threshold based scheduling with the fixed priority based scheduling. After the integration, the round robin is also applied within the group method. In this case, we consider a scenario, where we have sixty four tasks and each task has been assigned a unique priority from zero to sixty three. The task zero has the lowest priority and the task sixty three has the highest priority. Then we apply a threshold at task sixty, which means that any task that

30 has priority below sixty are not allowed preempting any task that is below this threshold; regardless whether those tasks have higher priority than the running task. For instance, all the tasks that are below this threshold run in a round robin fashion based on their priority order to reach the ready-to-run state. Whereas all the tasks that are above or par with this threshold are allowed to preempt other tasks with lower priorities to gain the CPU time. That is all the tasks above the threshold use the fixed priority scheduling method, which is discussed in detail in the next section.

3.4 Fixed Priority Based Scheduling Method

The fixed priority based scheduling [2][3] is more complex than the other scheduling methods presented in this thesis, since it is more complex to decode the next highest priority task that is ready-to-run in a deterministic way.

The fixed priority based scheduling, as name indicates, assigns a unique priority for each task in the system. By assigning a unique priority, each task preempts the running task, if the priority is higher. This allows the higher tasks to be more responsive.

As a result, if the time allocated to run certain tasks are not planned and assigned properly, the lowest priority task may not have a chance to run at all.

With the fixed priority scheduling, if the next ready-to-run task has a higher priority than the current running task, then the current running task is preempted and the next highest priority task is set to its running state. If the next ready-to-run task has a lower priority than the current running task, then the current running task resumes its running state until the task preempts itself or its runtime expires.

A major disadvantage of the fixed priority based scheduling is that the number of preemptions are high, thus more CPU cycles are wasted on preempting tasks. As a

31 result, tasks have less CPU cycles to perform their work. One advantage of fixed priority based method is the responsiveness of its tasks.

Let’s consider how the scheduler decodes the highest priority tasks from a pool of ready-to-run tasks. One method is to compare all the ready-to-run tasks and check which task has the highest priority. This method is inefficient especially if the system has large number of ready-to-run tasks. For instance, if the system has thousand ready-to- run tasks, we have to compare all the thousand tasks to find out which task has the highest priority. This method might be feasible if the system has few ready-to-run tasks.

There are several techniques that can be employed to enhance the fixed priority scheduling. One technique is to use lookup table method to parse the highest priority task from a pool of ready-to-run tasks. This lookup table method is elaborated in the following sub-section.

3.5 Fixed Priority Lookup Table Method

The fixed priority lookup table [2][3] method requires two lookup tables, an indexer, and an 8-byte array. One lookup table has eight bytes of 2^N, where N ranges from 0 to 7. The other lookup table is a 16 by 16 log2 based. The indexer indicates that there are tasks in the 8-byte array. The Ready Task Table is an 8-byte array, which is used to store ready-to-run tasks. These are illustrated as follows:

The Indexer: to provide a reference into the Second Lookup Table uint8_t OSRdyIndex;

The Ready Task Table: to host all the ready-to-run tasks uint8_t OSRdyTbl [8];

First Lookup Table: to provide bit position for the task in Ready Task Table uint8_t const OSIndexTbl[] = {0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80};

32 Second Lookup Table: to decode Task in the Ready Task Table uint8_t const OSLog2Tbl [] = { 0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3, 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4, 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5, 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7 };

In this example, we consider that the system has 64 ready-to-run tasks. The lookup table method (with the aforementioned group of lookup tables, the indexer, and the 8-byte array) provides a deterministic way to decode the highest priority task with a fixed number of clock cycles regardless of the number of ready-to-run tasks, i.e., if there is one or 63 ready-to-run tasks.

This method decodes the priority number sixty three as the highest priority, and the priority number one as the lowest priority. The priority zero is typically assigned to the idle task, which runs when there are no ready-to-run tasks.

When a specific task is ready-to-run, it is stored in the Ready Task Table in a predefined way. For example, if the task 9 is ready-to-run, it is stored on the second byte of the array, i.e., bit two of the Ready Task Table. Also, we have to set the indexer to indicate that there is task in the second byte of the array. The code below demonstrates how to perform this task.

task = 9; OSRdyIndex |= OSIndexTbl [task >> 3]; OSRdyTbl [task >> 3] |= OSIndexTbl [task & 0x07];

33 The above code snippet only places the ready-to-run tasks in their appropriate positions in the Ready Task Table as well as the Indexer. After the ready-to-run tasks are placed in the table, we can decode the highest priority task (deterministically) from the Ready Task Table using the following code:

y = OSLog2Tbl [OSRdyIndex]; x = OSLog2Tbl [OSRdyTbl[y]]; task = (y<<3) + x;

Lookup tables often provide one of the fastest and efficient ways to solve complex programming problems. Conversely, lookup tables occupy precious resources on chip. For instance, if the system has a total of 128 ready-to-run tasks, the lookup table occupies two times the space on chip.

Furthermore, by combining the threshold scheduling with the fixed priority scheduling, we can reduce the number of preemptive tasks, and also allow tasks above the threshold to preempt any task that has lower priority.

3.6 Time Tick ISR

Time Tick function is an interrupt sub-routine (ISR), which is imperative to the operation of the real-time operating system (RTOS). Thus, it is considered as the heartbeat of the RTOS. The Time Tick ISR is usually triggered by a low priority interrupt from a hardware timer. The main function of this timer is to decrement one tick count

(at a time) of the operating system’s TCB (Task Control Block), when each time the preset period of the timer expires. When returning to the point of interrupt (RTI) ISR rearms itself, and the ISR process repeats continuously until the RTOS is disabled.

The main function of the Time Tick ISR is to decrement the tick counters and then call the scheduler. Although this Time Tick ISR function is simple, within the

34 context of OS, it consumes large number of CPU cycles, which varies with the number of tasks. One of the reasons for this issue is the way the tasks and their corresponding TCB are designed. In this case, the Time Tick ISR has to transverse through the link lists within the TCB to identify the next task, in order to determine whether to decrement its tick count. Our investigation and analysis revealed, thus far, no software solutions are proposed to resolve this fundamental issue of the Time Tick ISR where this ISR is able to decrement the tick counts of the TCB in a constant number of CPU cycles regardless of the amount of tasks in the RTOS. This is an area where software engineering has reached its limit.

3.7 Software and Hardware Semaphores

Software semaphore operations are typically achieved either by using atomic instructions or by entering into a critical section (i.e., by disabling the interrupts).

However, these methods are useful only in a single core environment. In a multi-core environment, using atomic commands and entering into critical sections in one core are not visible to the other cores. As a result, the semaphore flag (of the initial core) can be overwritten by the other cores, resulting in data coherent issues.

Hardware semaphores can be used to overcome the aforementioned issues in multi-core environment. However, hardware semaphore, itself, has many issues. For example, consider a scenario, where core one gains access to a hardware semaphore gate, then this gate becomes visible to all the tasks in core one, which in turn get permission to access the shared resources. In this case, if any task other than the task that locked the semaphore tries to access the shared resources, data incoherency arises.

35 In RTOS, semaphore or mutex play an important role in preventing data incoherency. In Chapter 4, we analyze and present several techniques utilized to resolve data coherent issue (by employing semaphores) in single core as well as multi-core environments.

3.8 Key Features of Operating Systems

The key features of the OSes analyzed above are summarized and illustrated in

Table 1. The three operating systems, discussed in Chapter 2, have their own features, advantages, and disadvantages regardless whether they are coupled with a same microcontroller. These features have a significant impact of the performance of the microcontroller. Hence, in Chapter 4, we investigate and analyze these features and their affect on the microcontroller performance.

Cooperative Description Hybrid RTOS RTOS OS Stack one One Tasks + 1 Task Run to complete Run to complete Super loop Memory Usage Low Medium High Preemption No Yes Yes Round Robin Scheduling Yes Yes Yes Group Priority Scheduling possible Yes Yes Fixed Priority Scheduling No Yes Yes Threshold Scheduling possible Yes Yes Priority Inversion No No Yes Deterministic No Soft Soft/firm/hard Design Complexity Low Medium High CPU usage Low Medium High Hardware Interrupt One One One Reentrant Subroutine No (Single Core) Yes Yes

36 Cooperative Description Hybrid RTOS RTOS OS Semaphore No (Single Core) Yes Yes Task delay/Sleep Yes Yes Yes

Table 1: Key Features of Operating Systems

3.9 Compiler

Dhrystone per Second/Frequency

120000

100000

C Optimize off 80000 BTB_11 BTB_10 60000 BTB_00 NoBTB_11 40000 NoBTB_10 noBTB_00

per Second Dhrystones 20000

0 10 20 30 40 50 60 70 80 MHz

Legend:

BTB : Branch Target Buffer 1. C Optimize off : BTB and SPE enable, Flash Page = 00 and no C optimization 2. BTB_11 : BTB and SPE enable, Flash Page = 11 with Loop optimization 3. BTB_10 : BTB and SPE enable, Flash Page = 10 with Loop optimization 4. BTB_00 : BTB and SPE enable, Flash Page = 00 with Loop optimization 5. NoBTB_11 : BTB disable, SPE enable, Flash Page = 11 with Loop optimization 6. NoBTB_10 : BTB disable, SPE enable, Flash Page = 10 with Loop optimization 7. noBTB_00 : BTB disable, SPE enable, Flash Page = 00 with Loop optimization

Flash Page setting 00: No accesses may be performed by the processor core 01: Only read accesses may be performed by the processor core 10: Only write accesses may be performed by the processor core 11: Both read and write accesses may be performed by the processor core

Figure 5: Dhrystone 2.0 Benchmark Performance Chart

37 The compilers have been in existence from 1952. They consist of many features that reduce the code size and increase the application throughput. In this section, the compilers are discussed and presented, since certain features of them are closely related to the OSes. For instance, features such as “intrinsic commands” allow programmers to access lower-level instructions without using the assembly programming. In case of an equivalent intrinsic command, the compiler optimizes the code to the fullest, even in the absence of the assembly. However, if the assembly is used, the compiler simply inserts the assembly code without any optimization, i.e., the compiler does not attempt any modifications to the assembly codes.

Some compilers can be quite assertive with the optimization process. These compilers often tend to optimize the local variables to become the variables of the heaps, in order to share with several sub routines. This feature violates the basic principles of the reentrant sub routine. In this case, the programming reentrant sub routine utilizing such compilers corrupts the system. This issue can be resolved by disabling such features of these compilers.

Figure 5 depicts the performance chart of a PowerPC microcontroller, which we created in 2010 [11]. The performance is obtained by porting the Dhrystone software to our PowerPC core. Experiments are performed and results are obtained with varying hardware options, with varying compiler optimization options, and also executing the core at varying speeds. As shown in the Figure 5 (legend 1), it is evident that compiler optimization plays an important role in overall performance of the system. Also, as illustrated in Figure 5 (legend 2), branch optimization can only be optimized using hardware. From Figure 5 (legend 3), it is observed that software cannot perform certain tasks without the hardware. Furthermore, if software is to perform in a disruptive

38 way as it does with the Real-time Operating System, the throughput of the microcontroller is reduced, thus reducing the system performance.

In the next Chapter, we investigate and analyze different types of embedded microcontrollers, their features, and also the features of the OS’s that significantly impact these microcontrollers. It should be noted, in this research work, our intention is to investigate the microcontrollers in general with respect to operating system. Thus, our investigation does not include the analyses of the segmented and non-segmented

(linear addressing) microcontrollers; and Complex instruction set computing (CISC) and

Reduced instruction set computing (RISC) microcontrollers.

3.10 Existing Literature

Threads and processes scheduling are the essences of a real-time operating system. Again, Linux, Windows, and UNIX [35][36][37][42] are the operating systems for the research in threads and processes scheduling. The primary timing measuring tool for the research is the system timer. Unfortunately, the use of the system timer degrades the correctness of the scheduler capability measurement.

A proper design of a scheduler must be deterministic, that is it is not affected the system clock and the number of processes running in the system. The test results cannot reflect the actual capability of the scheduler. Nonetheless, the author manually modifies the data to make it representative without understanding what variables they are trying to omit [35] to [41].

The author pointed out the size of the system cache affects the performance of the scheduler (context switching) that is not true [35]. The truth is the size of the system cache affects the system performance. The OS scheduler (context switching) is

39 deterministic and resides in a physical memory space, in Linux, that will not be swapped out or mapped out by the MMU. And for this reason, the OS requires a reasonably sizable piece of memory. The operating system will periodically activate the system processes for housekeeping that will affect the test results that the author may not be able to determine the duration or which system processes. All this interruption degrades the accuracy of the test results.

We are interested in understanding how the OS and processor interact and to design a system that schedule and context switching seamlessly. These papers are too high level and are no use for our future work.

40 CHAPTER IV

MICROPROCESSOR

A microprocessor usually composed of a (CPU), a clock tree, and circuits for power and logic. The CPU consists of an Instruction Decoder (ID), an Arithmetic Logical Unit (ALU), Registers, a (CU), and Buses. The

Instruction Decoder interprets the instructions fetched from the memory by the control unit; and then assigns specific signals to the functional blocks for proper execution of the instructions. A typical microprocessor block diagram is depicted in Figure 6.

The ALU often performs the arithmetic operations including multiplication, division, addition, and subtraction, as well as logic operations including bitwise operations. The ALU also performs comparison operations and decisions for branching operations. The ALU effectively is the main piece of hardware that provides the conditions for making a decision in the microprocessor.

The microprocessor typically fetches data and instructions from the memory; and stores these in temporary storage, called Registers. The size (in bits) and the number of registers depend on the type of the microprocessor. The registers can be roughly divided into two categories; special-purpose registers and general-purpose registers.

The special-purpose registers include status register, program counter, and stack pointer, as well as other registers with specific functions including I/O functions. In contrast, general-purpose registers are there to hold key local variables, and intermediate results of calculations.

The control unit initially executes a default action, which is to fetch an instruction from the memory to the CPU. Next, the control unit moves data from the source

41 locations to the destination locations, which includes moving data to the decoder, registers, and ALU and other destinations accessible by the microprocessor.

The peripherals of a microprocessor are connected by busses, which is a collection of parallel wires for transmitting address, data, and control signals. Buses can be external to the CPU, connecting it to memory and I/O devices, but also internal to the CPU. A microprocessor often has two types of buses: data bus and address bus. A data bus moves data from one point to another, whereas an address bus directs the data to a specific location. The buses can be unidirectional or bidirectional. Also, tri-state devices allow busses to be shared among multiple peripherals [6].

System Bus

Instruction Register

Registers

Control ALU Stack Pointer Uint Program Counter

Status Register Internal Control Signals

Microprocessor

Figure 6: A Typical Microprocessor Block Diagram

42 4.1 Microcontroller

A microprocessor by itself is nonfunctional. It requires other components including I/O device, and various types of memories to function as a system. For instance, a widely used desktop computer such as the Intel Personal Computer (PC) typically consists of an Intel processor, volatile and non-volatile memories, communication port, etc.

Consider a scenario, where we incorporate the aforementioned system as a whole into a single . This is called the Microcontroller or a single chip computer. A microcontroller is a special purpose computing system; and is often based on the requirements of the application and the associated peripherals such as volatile and non-volatile memories, communication port, timers, etc.

The functionality of the microcontroller depends on the type of the microprocessor and the associated peripherals. Since both the microprocessor and the peripherals are equally important: when programming the microcontroller, as well as designing an operating system that is capable of controlling the resources in the microcontroller. Thus, we analyze and present various types of microprocessor technologies in microcontrollers.

4.2 Analysis of Microprocessors

There are different types of microprocessors. Each has its own advantages and disadvantages. For instance, the multiple-cycle microprocessor is area-efficient

(occupying less hardware space on chip), but has lower performance than single-cycle microprocessor. Whereas the single-cycle microprocessor occupies more hardware space on chip and has higher performance than the multiple-cycle microprocessor. An

43 improvement of the single-cycle processor is the pipelined microprocessor, which divides the single-cycle processor into many pipeline stages (typically five or more stages). As a result the pipelined microprocessor can execute at a faster CPU clock cycle than that of a single-cycle microprocessor. Although the pipelined microprocessor improves the throughput of the instructions compared to the single-cycle processor, it occupies more hardware space on chip due to the pipeline stages. In addition to the hardware required for the pipelining, it also requires extra hardware to eliminate the hazards issue. These hazards are discussed in section 4.5.

Apart from the above three types, microprocessors can also be categorized based on how the instructions are decoded: single-issue and dual-issue microprocessors.

With a single-issue microprocessor, a new instruction is fetched to the pipeline in every

CPU clock cycle. All the instructions in the pipeline will move to its next stage of execution with its last stage instruction completing its operation. This methodology reduces the hardware requirement for each stage, hence allowing higher operational frequency. This technique improves the throughput of the pipelined microprocessor compared to a multi-cycle processor. Also, since the pipeline is executed without stalling, it is in an optimal state. In this case, the pipelined microprocessor is said to have a one instruction per cycle per (IPC).

A dual-issue microprocessor can issue certain pairs of instructions simultaneously, thus increasing the IPC. This type of a microprocessor requires more hardware on chip to support the dual-issue circuitry. There are several other constraints associated with a dual-issue microprocessor.

For instance, both the instructions (in an instruction pair) must be available in the issuing stage simultaneously; the first instruction must not use the program counter

44 (PC) as the destination register, the second instruction must not use the PC as a source register; both the instructions (in an instruction pair) should belong to the same instruction set, and there should not be any data dependency between the two instructions.

A dual-issue single-core microprocessor is often mistaken as a dual-core microprocessor. As stated in [non-official number], a Qorivva dual-core microprocessor has 1.6 IPC [61], whereas a dual-issue single-core microprocessor has 1.3 IPC [60]. A dual-core microprocessor can execute: two different applications in parallel, and also a single application and its carbon-copy in parallel. Conversely, a dual-issue microprocessor can only execute a single application at a given time. Even with an operating system, a dual-issue microprocessor can only execute different applications at different times.

Advanced microprocessors such as the Intel i7 core has Hyper-thread technology, which claims to improve the performance by 15%-30% by creating virtual cores. For example, a 4-core i7 device appears to have eight cores. One drawback of this technology is that the applications running on this system should be hyper-thread aware. Another technique is simultaneous multithreading, which is used to improve the overall efficiency of the Superscalar CPUs with hardware multithreading. The Barrel microprocessor is another advanced microprocessor, which allows multiple threads to execute in interleaved fashion. Unlike pipelined microprocessor, Barrel processor does not have pipeline stalls and does not need feed-forward circuits. A barrel microprocessor can guarantee that a real-time thread can execute with precise timing, regardless if the other threads lock up, or interrupted, or go into a spinwait state.

45 4.3 Constraint on Operating Systems and Microprocessors

In this section, we discuss and present the constraints associated in using an operating system (OS), prior to investigating the interoperability of the OSes and microprocessors in terms of a single-cycle microprocessor and a pipelined microprocessor.

As mentioned in section 2.4, a real-time operating system (RTOS) depends on a

Time Tick interrupt subroutine (ISR), which determines when to preamp a running task.

When the Time Tick ISR is enabled, it checks the system for any ready-to-run tasks. If there are no ready-to-run tasks in the system, the ISR simply returns to the point-of- interrupt, and the current running task resumes its operation.

Let’s assume that the time required to perform a Time Tick ISR interrupt is . It should be noted that this time can vary from one microprocessor to another; however, the basic concept is the same. During an interrupt, the microprocessor state machine executes the following steps:

 Wait for the current instruction to complete its execution

 Save the current instruction pointer or program counter

 save program status register

 load interrupt service routine, ISR, address (i.e., interrupt vector)

 ISR is now executed and interrupts are disabled to avoid nested interrupt

During the execution of the Time Tick ISR, the ISR checks all the task control blocks to find out which task is ready-to-run next. Let’s assume that the time required to find out the ready-to-run task is . This operation typically takes a large amount of

CPU time from the overall execution of an application. If there is no ready-to-run task,

46 the Time Tick ISR returns to the point-of-interrupt. In this case, the OS restores the current task information, and returns the control to the current running task for it to resume its operation.

If there is a ready-to-run task, the ISR passes the control to the scheduler that decides which task is allowed to run before it preempts the current running task. If the ready-to-run task is schedule to run immediately, then the scheduler performs a context switch, followed by a “return to the point-of-interrupt” operation. Consequently, “return to the point-of-interrupt’ is the ready-to-run task at that time. After returning to the point-of-interrupt, a new task takes control over the CPU. Let’s assume that the time taken to perform the context switch is and the time taken to return to the point-of- interrupt is . From these details, we can formulate an equation for the operating systems overhead as follows:

Equation 1

. is the operating system time overhead required to switch from one task to another . is the time taken for an interrupt . is the time required to find the next ready-to-run task . is the time required for the scheduler to perform a context switch . is the effective time . is the average time all the tasks spent on performing their jobs . is the time

From the above formula 1, we can define the Effective time as follows:

Equation 2

47 From the Equation 2, we come up with the following conclusions:

 To maximize the Effective time ( ), the time spend on executing a task must be higher than the operating system overhead ( ).

 For the system to very responsive, the operating system overhead ( ) must be as low as possible. The example below illustrates the effective time (in Equation 2). For instance, if we have as one time unit each, then the effective time of the system is 50% as shown below:

Equation 3

The above example shows that half of the CPU time is consumed by the OS, and only half of the CPU time is left for the execution of the application. It is evident that the application programmer is essential to understand this constraint and adjust the time appropriately to achieve the maximum effective time. Conversely, if the is high, it takes long time before other tasks can run. Hence, a trade-off should be made based on the job nature of that task. It should be noted that although the effective time varies with different type of microprocessors, the Equation 2 remains valid for all type of microprocessors with a real-time preemptive operating system.

Furthermore, there are constraints (associated with the Equation 3) that both the software and the hardware cannot resolve. These constraints include context switching and ISR for time tick, which is discussed in section 2.6 and 3.6. All the above equations (1, 2, and 3) assume no instructional caches misses or no instructional cache in the microprocessor. In the following sub-sections, we analyze, discuss, and present

48 how the effective time is affected by a single-cycle microprocessor and a pipelined microprocessor.

4.4 Single-Cycle Microprocessor

The way the instructions are executed in a single-cycle microprocessor is similar to that of a cooperative operating system (OS). For instance, in a single-cycle microprocessor, each instruction is executed to its completion before the next instruction is allowed to execute; thus eliminating the processing hazards. As a result, there is no need to stall or flash the pipeline in a single-cycle microprocessor.

From a RTOS’s perspective, a single-cycle microprocessor is a more time effective than a pipelined microprocessor (no time is wasted on flashing and stalling the pipeline). The time taken for an interrupt and context switching are as shown in

Equation 2. However, a different single-cycle microprocessor may support a different set of instruction. It may offer to push and pop the stack differently. Nonetheless, if the microprocessor has more registers, it will increase the context switching time.

In summary, with a single-cycle microprocessor, the pushing and the popping of the additional registers onto the stack do not affect the effective time compared to the time spent on the operating system.

4.5 Pipelined Microprocessor

A pipelined microprocessor [9] is essentially a single-cycle microprocessor with its instruction execution process decomposed into several stages. To facilitate this discussion, we consider a five-stage pipelined microprocessor. These pipeline stages are:

Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Operation (MEM),

49 and Write Back to Register (WB). The decomposition of the instruction execution process into five stages, usually allows the pipelined microprocessor to execute up to five times faster than that of a single-cycle microprocessor. For the pipelined microprocessor to operate properly, pipeline registers are required to store the intermediate results/information between the pipeline stages. These pipeline registers require extra hardware on chip. Since a pipelined microprocessor can execute five partial instructions in the pipeline, the throughput of the microprocessor increases.

The pipelined microprocessor could have several operational hazards: structural, data, and control hazards. The structural hazard is due to a single memory map for both the instruction and the data memories, which prevents the access of data and instruction at the same clock cycle. This hazard can be eliminated by using separate instruction and data cache memories with different memory maps. The data hazard can be eliminated either by data forwarding, stall insertion logic, or by both. If the data forwarding cannot resolve this hazard, the pipeline stall is enabled. In this case, the pipeline stall forces the data dependency operation to wait, thus giving ample time for the forward-stage operation to complete its task. The control hazard occurs when it is necessary to make a decision based on the result of the forward-stage instruction, while others instructions are being executing. This hazard can only be eliminated using the stall operation. In this case, the branch prediction logic can reduce the number of stall operations by placing the prediction logic in the ID stage.

However, when an interrupt is triggered, the pipelined processor is required to save the information of the PC, Cause, and Status information in general registers and stalls its pipeline if needed before the processor can handle the interrupt

50 (some processors such as save these information in the stack). The information saving and possible pipeline stalling reduces the response time of the ISR.

Equation 4

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 IF ID EXE MEM WB PC ID EXE MEM WB Cause PC EXE MEM WB Status Cause PC MEM WB INT Status Cause PC WB

Table 2: Pipeline Content during an Interrupt

In this case, is the additional time the pipelined microprocessor is required to save the PC, Cause, and Status information and maybe needed a stall cycle. Similarly, when an ISR returns to its point-of-interrupt, it requires reversing the process and yet it incurred (pushing and popping the stack) another , please see Equation 4. The contents of the pipeline are shown in Table 2. As depicted in Table 2, the pipeline needs five cycles before the first instruction reaches the WB stage. In the Table 2, the

INT means Interrupt occurring at stage C5 [13].

According to the Table 2, the time taken for the context switching depends on the number of registers it needs to save and stall cycles it needs to wait. However, in a pipelined microprocessor, the additional time required for the stall cycles and the time required to perform the context switching is relatively insignificant, compared to the time required for the OS to make a decision to preempt a task.

51 From the above analysis, it is evident that the total time spent in a system is mainly due the operating system ISR and context switching from task to task. It is observed that the RTOS is operating in a coarse-grained multitasking paradigm. We will further examine coarse-grained and fine-grained multitasking in the following sections.

4.6 Coarse-grained Multitasking

A Coarse-grained multitasking context switching from task to task takes many

CPU cycles. From the above analysis (see sections 4.4 & 4.5), it is observed that the single-cycle microprocessor as well as the pipelined microprocessor can only support coarse-grained multitasking. In another word, software RTOS cannot improve this architectural limitation in both these microprocessors; that is these processor cannot operate in fine-grained multitasking (see session 4.7).

CX Task1 Task2 Task3 Task4 Task5 Task6

Figure 7: Course-grained Multitasking

Coarse-grained multitasking has several advantages and disadvantages. The advantages of a coarse-grained multitasking are: its task has all the CPU time; it controls all system peripherals; and it is easy to manage from the microcontroller point of view; that is a task can get more work done while its has the CPU time. One

52 disadvantage of coarse-grained multitasking is that the other tasks in this environment must wait for their chance to run. This waiting period can be quite significant, for instance, thousands of clock cycles.

Figure 7 depicts the context switching (CX) among 6 tasks executing in round- robin scheduling method. In a round-robin scheduling, task 6 will run in its time slot in every round-robin cycle (all the tasks run once). If each task (from task 1-6) in the system has been assigned different priority numbers, and task 6 has been assigned the lowest priority, then task 6 may not have a chance to run in the designated time slot or does not have a chance to run at all (from Figure 7). In this case, the application programmer must perform a system load balancing to ensure that alienation of a task does not happen.

In coarse-grained multitasking, most of the tasks in an operating system spend its time waiting for some events to occur. There is a time wasted due to this waiting period. Furthermore, in a running system, the microprocessor may stall from time to time due to access issues, thus leading to more time lost. The coarse-grained OS does not have the ability to mitigate these lost times since the current running task has all the

CPU time and resources. The time lost significantly reduces the system performance.

With coarse-grained environment, it is only feasible to have one task that meets the hard real-time's requirements. The coarse-grained environment is unable to support multiple hard real-time tasks, which is another limitation of this system.

A fine-grained system can overcome the limitation of a coarse-grained system.

The fine-grained system can support the hard real-time requirements of the tasks and also can mitigate the time lost in a system. These are discussed in detail in the following sub-section.

53 4.7 Fine-Grained Multitasking

With fine-grained multitasking, the context switching from one task to another is done in interleaved fashion, without wasting CPU cycles. In a fine-grained multitasking system, several tasks run in an interleaved fashion; hence, it is also called interleaved multitasking.

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Task1 Task1 Task2 Task2 Task3 Task3 Task4 Task4 Task5 Task5

Figure 8: Fine-grained Multitasking

As illustrated in Figure 8, the fine-grained system has five tasks running in an interleaved fashion. In this case, each task is interleaved with one clock cycle. Each task in the fine-grained fabric runs a clock cycle, and switch to the next task on the following clock cycle, and so on, i.e., Task 1 on clock 1, Task 2 on clock 2, etc. This process repeats itself until the OS intervenes. Unlike the coarse-grained system, the fine-grained system supports multiple hard real-time tasks, mitigates the time lost in systems, and improves the system response time. As depicted in Figure 8, all the tasks in a fine- grained multitasking are guaranteed to execute in every five CPU cycles.

It should be noted that context switching is not included in Figure 8. In this case, if there are only five tasks to run, it is not necessary to perform context switching since the five tasks are executed in an interleaved fashion. However, if there is a need to run an additional task (task 6), then it is necessary to preempt one of the running

54 tasks in order to allow task 6 to run in one of the five slots. The reduction of context switching also increases the throughput, since one task is being preempted; the other four tasks are still running without interruption.

Applying the effective time equation (use Equation 4 and removes

) to compute the time spent on tasks presented in Figure 8, we can obtain a 100% performance, since the system has only five tasks, and there is no time lost in the system due to context switching. Even if we have additional tasks to run, the

Effective Time for a fine-grained multitasking microprocessor is typically higher than that of a coarse-grained multitasking microprocessor.

The advantages of the fine-grained multitasking system are summarized below:

 It has the ability to run the tasks even if any of the tasks is in spinwait.

 Designing a fine-grained multitasking is simpler than designing the hazard logic

in a pipelined microprocessor

 Fined-grained multitasking guarantees that a real-time task can execute with

precise timing, regardless of what happens to the other tasks (spinwait, interrupt

or crashes, etc.)

 Interrupts can be assigned to different time slots, i.e., if one task is interrupted,

the other tasks can still be executed.

A Barrel microprocessor is one of the processors that consist of the aforementioned features that can potentially operate the tasks in a fine-grained multitasking environment.

55 4.8 The Barrel Microprocessor

IF/ID ID/EX EX/MEM MEM/WB

Data IF Reg Mux

ALU Memory

Figure 9: A pipelined Microprocessor Structure

The Barrel microprocessor [14] design has some similar features to that of a pipelined microprocessor. For instance, it has the same simple pipeline stages. However, the Barrel microprocessor has different system registers requirements and uses a different method to fetch its instructions from the memory compared to a pipelined microprocessor.

Figure 9 [9] illustrates how the instruction is processed in a pipelined microprocessor. The pipeline stages of the pipelined processor are typically complex, since they are also used to eliminate different pipeline hazards (see section 4.5). A

Barrel microprocessor executes its instructions to completion; hence the Barrel microprocessor does not have any data or control hazards. As a result, unlike the pipelined processor, the pipeline stages of a Barrel microprocessor are simpler.

However, similar to a pipelined microprocessor, a Barrel microprocessor has a structural hazard.

56 Considering the example depicted in Figure 8, each task executes in an interleaved fashion, and each instruction takes five cycles to complete. From the above analysis, it is observed that a barrel microprocessor can be designed with N number of tasks of N-cycle core. In this case, the Barrel microprocessor has five virtual cores and each core takes five clock cycles to complete an instruction. That means, in a Barrel microprocessor, the number of virtual cores and the cycles per instructions (CPI, per program) are directly proportionate to the number of pipeline stages.

In order to execute its task in an interleaved fashion, Barrel microprocessor needs extra system registers to store the information about the current running tasks.

Therefore, for the barrel processor with N number of tasks, it requires N sets of system registers.

The instructional fetch logic of the barrel microprocessor is more complex than the instructional fetch logic of another type of microprocessor. Each instruction fetch must be synchronized to ensure that the proper instruction is assigned to the correct virtual core. In addition, to improve the latency associated with the instruction fetch, each virtual core has its own program counter.

There are some disadvantages associated with fine-grained multitasking compared to a coarse-grained microprocessor as follows:

 A Barrel microprocessor with N virtual cores needs N sets of system registers

 Each task may use the same instruction-cache (i-cache), or each task may have

a separate i-cache, which is costly (An n-ways cache requires more hardware

than a one-way cache)

 Since all tasks are not created equally, tasks that need more processing power

cannot be achieved (i.e. if a task runs two CPU cycles for five CPU cycles)

57 In the next section, we analyze, discuss, and present the impact of the real-time operating systems on a multi-core microprocessor.

4.9 Multiple Cores Microprocessors

Multi-core Microprocessors usually come in two forms: symmetrical and asymmetrical. The symmetrical multi-core microprocessor consists of identical cores in one die; for example, having 3 PowerPC XYZ cores on a single die. Asymmetrical multi- core microprocessor consists of various types of cores on a single die; for example, having an ARM core and two identical PowerPC cores on a single die.

In this research, our focus is on symmetrical multi-core microprocessors. In a multi-core environment, in general, most of the resources are shared among the cores.

The system becomes less efficient with the increasing number of cores. The main reason for this issue depends on the access right of the cores, i.e., a core that fails to access the resources has to stall. This increases the time lost in a system, thus decreasing the efficiency of the system. Even if there are two cores, each running a different task, the average effective time is still less than 100%.

When executing a real-time operating system (RTOS) on a multi-core platform, many issues arise; some of which are the similar to that of a single core microprocessor.

As discussed in Chapter 2, access rights are one of the primary causes that degrade its effective time and performance. In general, a multiple cores microprocessor has a better throughput compared to a single core, since the former has more cores, but the time spent on stalling is relatively higher.

58 In summary, there is no single microprocessor that works flawlessly with a

RTOS. Therefore, the system designer must select the correct processor that is most suitable for the required task or application.

4.10 Semaphore

The semaphores are important in real-time operating systems due to data coherency, regardless whether operating in a single core environment or in a multi-core environment. The semaphores can often be designed in following three ways.

1) The semaphore can be core-centric, which supports built-in instructions. This type of semaphore atomic commands (SAC) is ideal for single core applications. An advantage of a core-centric atomic command is that it is limited to the amount of memory the system has and does not need additional hardware.

2) The hardware semaphores are useful in a multiple cores environment.

However, these semaphores are limited and only good for inter-processors gating; for instance, core 1 versus core 2 semaphore gating. When dealing with tasks running in the same core, the tasks have to depend on the core SAC to prevent any race conditions. This is similar to two participants (in a game show) pressing the buttons to gain access to answer the question first.

3) The bus type semaphore atomic command differs from the other two types of semaphores; it uses a bus arbiter to prevent other core(s) from overwriting the semaphore. In this case, the read data is processed, and then the semaphore atomic write command writes the modified information back to that address. If the set bit is cleared, the atomic write command fails and returns a fail flag. If the operation is successful, the bit is cleared and returns success flag. A semaphore atomic read

59 command fails if the bit is set; thus, allowing only one read to be successful, followed by a write.

In a multiple core environment, the third type is preferred as it limits the amount of memory that can be accessed, and also the third type does not require core-centric

SAC.

4.11 Existing Literature

4.11.1 New ISA

Many RISC architectures come and go with ARM remains in a big way. Constraint by the present of open and free ISA and propriety ISA Licensing issues, the RISC-V organization from the University of California, Berkeley developed an open ISA aiming to support the future of open RISC architecture. The RISC-V ISA provides an instruction length decoding from a 16-bits to greater than 192-bits [52][53][54][55].

The RISC-V ISA further groups into the base instruction formats such as the R- type, I-type, S-type and U-type and the immediate Encoding Variants such as SB-type and UJ-type. These instruction formats were created to ensure an easy to design decoder with less hardware and to support higher bandwidth [54][55].

Although, the ISA resemblance that of the MIPS ISA, the RISC-V ISA provides a more comprehensive ISA that supports a full range of the RISC-V microprocessor of difference instruction length. The cost to develop the RISC-V is significantly lower due to the openness and standardization of the ISA, source code and development tools.

Nonetheless, the RISC-V has the same pipeline issue as other pipelined processors

[54][55].

60 4.11.2 Debugging

Debugging is part of the development process of locating the fault in the system.

That is the ability to provide the low level and source level debugging information. Many of the debugging papers provide an interesting high-level view to debugging. However, if we want to know how a multiple modules system interaction faults, these tactics are not able to provide a correct answer; especially the orders in which these modules communicate that trigger a system fault.

This type of system fault can further complicate the debugging process because it can happen randomly and at a different time. With the current debugging module and software approach, debugging this type of system fault is only possible by best guess.

There was no mention on how to improve the debug module to provide a much easy way to debug this type of system fault. Also, these analyses did not show how to locate the root cause of the problem but only the points of infection [48] to [51].

4.11.3 Interrupts and Exceptions

Program interrupts are also known as soft interrupts or traps, due to instruction exception. Hardware interrupts are due to outside sources such as an Ethernet controller or an external I/O pin. All the interrupt behaviors are architecture specific and cannot change. That is the interrupt responses are deterministic accepts the erroneous exception.

These papers did not clearly show when and how a system can recover from an erroneous exception. For example, if the system went into exception trying to execute an illegal instruction, how the system recovers from it. What are the actions such as an exception routine must take to gracefully preventing a system crash [44] to [47]?

61 The SPARC processor provides a simple and effective way to handle the exception by creating an exception stage in the pipeline [59]. That is all exceptions regardless how or when it happens will be handled at this pipeline stage. Besides the exception stage, the instruction address is also forwarded to the pipeline registers. With the address information, the system knows the location of the illegal instruction, how to flash the pipeline and safety enters into an exception state. Beyond this point, the exception software will take the necessary actions that will not be discussed here.

62 CHAPTER V

A NOVEL FIVE-VIRTUAL-CORE PIPELINED BARREL PROCESSOR

In chapter one to four, we discussed different OSes and Microprocessors on how they operate and their advantages and disadvantages. In this chapter, as a proof-of- concept work, we introduce a new microprocessor architecture that harnesses the advantages of a Single-Cycle Processor (SCP), Pipelined Processor (PLP) and Multi-Core

Processor (MCP). We perform experiments to evaluate our architecture with the existing designs in the literature (4.7, 4.8). We discuss and present how this new microprocessor architecture improves the processor throughput.

5.1 The New Microprocessor Architecture

The PBP uses the PLP’s pipeline structure for only one purpose that is to increase the core operating frequency; hence, improves the throughput of the system.

Nonetheless, the PBP does not inherit all the disadvantages of the PLP; the PBP does not have control and data hazard issues (it has structural hazard) (4.8). Therefore, there is no need for control and data hazard logics making the PBP’s pipeline easier to design (4.8).

The PBP is similar to a MCP. The PBP (for this design) is capable of executing five different programs in its pipeline after the first four clock cycles. The PBP does not have bus contention like the MCP (MCP implements crossbar or arbiter to allow one core to access the bus at a time while stalling other cores) because the PBP operates in an interleave fashion (4.7); each core accesses the bus at a different clock cycle. An N-MCP

(N number of cores) requires N numbers instruction and data caches, whereas the PBP

63 requires only one instruction and data caches. A PBP requires lesser hardware relatives to a MCP.

The PBP is similar to a SCP. The PBP executes a program instruction to its completion (4.8) before it executes the next instruction. This method of executing its instruction mitigates the PBP pipeline from control and data hazard issues.

From the above three paragraphs, we can reduce that the PBP has a more efficient processor architecture than the pipelined processor, multi-core processor and the single-cycle processor. We will substantiate our claim in the following sections.

5.2 PBP Stages

Figure 10 depicts the block diagram of the new microprocessor architecture, A

Novel and Efficient Five-Virtual-Core Pipelined Barrel Processor (PBP) (4.8). The PBP consists of five stages. These five stages are the Instruction Fetch Stage (IF), the

Instruction Decoding Stage (ID), the Execution Stage (EXE), the Memory Stage (MEM) and the Write Back Stage (WB). The PBP has a more complex Program Counter

Generator. We discuss the different stages in the following sections.

5.2.1 PBP Program Counter Generator (PCGEN)

Our PBP is capable of running five different programs (4.8) in an interleave fashion. We need to track and feed the next instruction from each program to the PBP’s pipeline. The PCGEN helps to update the Virtual Core (VC) PC address. The PCGEN also generates 5 Barrel Virtual Core Identifiers (BVCID) for data synchronization. Besides these two functions, the PCGEN also corrects the VC program counters when a jump or a branch-taken instruction is executed.

64 ACALC

Control +4

GPR0 PCGEN BPC IMEM GPR1 EXE MEM WB GPR2 GPR3 GPR4 Barrel General BVC ID Registers

IF ID

Figure 10: A Novel and Efficient Five-Virtual-Core Pipelined Barrel Processor

5.2.2 PBP Stage, IF

The PBP’s IF stage is similar to the PLP’s IF stage. The only difference is that the

PBP has the BVCID stage information.

5.2.3 PBP Stage, ID

The PBP’s ID stage is similar to the PLP’s ID stage. The only difference is that the

PBP has five set of General Purpose Registers (GPR) and the BVCID stage information.

The BVCID information allows the processor to access the correct set of GPR.

5.2.4 PBP Stage, EXE, MEM and WB

The remaining three PBP’s stages are also similar to the PLP’s stages. The only difference is that they have the BVCID stage information. On a register-to-register or

65 memory-to-register write back operation, the BVCID helps to select the correct set of

GPR to store the data.

5.3 PBP Proof-of-Concept

The PBP proof-of-concept (POC) requires the PBP design that consists of the all the sub-modules in section 5.2. We also need to define the PBP’s memory map and five programs for this POC.

5.3.1 PBP’s Memory Map

Address Virtual Core 0000_0000 0

0000_0008 1

Vector Table 0000_0010 2

0000_0018 3

0000_0020 4

Table 3: PBP Vector Table of Reset Vectors

In this section, we define a minimum memory map requirement to run the PBP.

We define the reset vectors for each virtual core and the general memory area as depicted in Table 3. The memory space after the vector table (0x0000_0024) is general purpose memory (code and data space).

66 5.3.2 PBP Power-on-Reset

Upon Power-on-Reset (POR), VC0 to VC4 will fetch its first instruction from

0x0000_0000 to 0x0000_0020 (Reset Vectors) respectively. After fetching and executing the first instruction, it will jump into their respective program code region (see Code 2,

BVC0 to BVC4, program 0 to program 4 respectively).

5.3.3 PBP’s 5 Programs

In this section, we define five programs for this proof-of-concept. Four programs are written in assembly language to test if the PBP has control hazard issue. We used four virtual cores (VC1 to VC4) for this proof because we want the system to rapidly change its executing sequence. If the PCGEN fails to correct the PC, the IF stage will fetch a NOP instruction. If this is the case, the no control hazard proof failed.

We used VC0 to proof that there is not data hazard. We used a load-use data hazard [9] (a textbook example) application (written in assembly) to prove that the PBP has no data hazard issue. If the read back results are wrong, our proof failed. These 5 programs are shown in Code 2.

67

Note: NOPs are inserted for control hazard test. If any one of the NOP is fetched, the test failed.

Code 2: Proof-of-Concept Assembly Program

5.3.4 PBP ModelSim Simulation Waveforms

We used Verilog HDL for our PBP RTL design and ModelSim for our PBP design debugging and simulating. For this simulation, a few variables are pre-initialized: all virtual core GPRs are initialized with 0x0000_0000. The data memory locations

68 0x0000_0000 to 0x0000_0010 are initialized with the values 0x5555_5555,

0x4444_4444, 0x3333_3333, 0x2222_2222 and 0x1111_1111 respectively. The data memory values are used for VC0 program to proof that there is no data hazard in a PBP.

If after running the data hazard proof program and the read back results are not

0x9999_9999 and 0x8888_8888, our proof failed. The ModelSim waveforms are appended in Figure 11 to Figure 14.

69

Figure 11: PBP ModelSim Simulation Waveforms 01

70

Figure 12: PBP ModelSim Simulation Waveforms 02

71

Figure 13: PBP ModelSim Simulation Waveforms 03

72

Figure 14: PBP ModelSim Simulation Waveforms 04

5.3.5 Decoding PBP Waveforms

The PBP fetches and executes program differently from other microprocessors

(SCP, PLP and MCP). We include two tables (Table 4 and

73 Table 5) to help our readers to understand how the program (Code 2) instructions are fetched and executed and how to interpret the simulation waveforms

(Figure 11 to Figure 14).

The program counter address on the waveform “/test/MO/pc,” addresses are shown in Table 4. Table 4 should be read from left to right starting from Barrel Cycle

1 to 16 respectively.

VC0 VC1 VC2 VC3 VC4 Barrel PC Code PC Code PC Code PC Code PC Code Cycle 1 0000 jmp 0008 jmp 0010 jmp 0018 jmp 0020 jmp 2 0028 lw 005C jmp 006C jmp 007C jmp 008C Jmp 3 002c lw 0064 beq 0074 beq 0084 beq 0094 beq 4 0030 addu 005C jmp 006C jmp 007C jmp 008C Jmp 5 0034 sw 0064 beq 0074 beq 0084 beq 0094 beq 6 0038 lw 005C jmp 006C jmp 007C jmp 008C Jmp 7 003c addu 0064 beq 0074 beq 0084 beq 0094 beq 8 0040 sw 005C jmp 006C jmp 007C jmp 008C Jmp 9 0044 lw 0064 beq 0074 beq 0084 beq 0094 beq 10 0048 lw 005C jmp 006C jmp 007C jmp 008C Jmp 11 004c jmp 0064 beq 0074 beq 0084 beq 0094 beq 12 0054 beq 005C jmp 006C jmp 007C jmp 008C Jmp 13 004c jmp 0064 beq 0074 beq 0084 beq 0094 beq 14 0054 beq 005C jmp 006C jmp 007C jmp 008C Jmp 15 004c jmp 0064 beq 0074 beq 0084 beq 0094 beq 16 0054 beq 005C jmp 006C jmp 007C jmp 008C Jmp

Table 4: PBP Code Fetching Sequence at IF Stage

74 Table 5 shows the sequencing of the PBP virtual cores. We can use the waveform “/test/MO/bvcID” to “/test/MO/s4bvcID” to help locate where each VC program different stage of execution. For example, VC0 at PC address 0x0000_0028

(/test/MO/bvcID=0) fetches an instruction LW. At stage 4

(/test/MO/s3bvcID=0), MEM, it reads the data memory value 0x5555_5555.

Descriptions Instruction Pipeline Clock Program Cycle Counter IF ID EXE MEM WB 0 0000_0000 VC0 1 0000_0008 VC1 VC0 2 0000_0010 VC2 VC1 VC0 3 0000_0018 VC3 VC2 VC1 VC0 4 0000_0020 VC4 VC3 VC2 VC1 VC0 5 0000_0028 VC0 VC4 VC3 VC2 VC1 6 0000_005C VC1 VC0 VC4 VC3 VC2 7 0000_006C VC2 VC1 VC0 VC4 VC3 8 0000_007C VC3 VC2 VC1 VC0 VC4 9 0000_008C VC4 VC3 VC2 VC1 VC0 10 0000_002C VC0 VC4 VC3 VC2 VC1 11 0000_0064 VC1 VC0 VC4 VC3 VC2 12 0000_0074 VC2 VC1 VC0 VC4 VC3 13 0000_0084 VC3 VC2 VC1 VC0 VC4 14 0000_0094 VC4 VC3 VC2 VC1 VC0 15 0000_0030 VC0 VC4 VC3 VC2 VC1 16 0000_0064 VC1 VC0 VC4 VC3 VC2 17 0000_0074 VC2 VC1 VC0 VC4 VC3 18 0000_0084 VC3 VC2 VC1 VC0 VC4 19 0000_0094 VC4 VC3 VC2 VC1 VC0 … … … … … … …

Legend:

: Virtual Core Reset Vectors : Virtual Core Program Codes

Table 5: PBP Virtual Cores Sequencing

75 5.3.6 No Control and Data Hazards, Proven

From our results, we have proven that the PBP has no control and data hazards.

The jump and branch-taken instruction execute without the need to insert NOP instruction. The load-use data hazard test program gave us the correct results

(0x5555_5555 + 0x4444_4444 = 0x9999_9999 and 0x5555_5555 + 0x3333_3333 =

0x8888_8888). The results are read back at waveform “/test/MO/memout” 440ns and 490ns (Figure 14, 04) respectively.

5.4 Compare and Contrast

In this section, we analyze and compare the different microprocessors with and without real-time operating based on 5 application programs. Each type of microprocessor is operating at a core frequency as shown in Table 6.

Type of Frequency Description Microprocessor PBP 500MHz 5-stage pipeline PLP 500MHz 5-stage pipeline SCP 100 MHz 5 SCP cores, each core MCP 100 MHz operates at 100 MHz

Table 6: Microprocessor Configuration

5.4.1 PBP versus PLP

In this section, we compare and contrast the PBP and PLP. The PBP IPC throughput is better than the PLP (Table 7). Without an operating system, the PBP can

76 execute 5 programs and the PLP can execute 1 program. If we are running 5 programs, the PBP does not require an operating system, whereas the PLP requires an operating system. The preempting of application programs and other operating system primitives consume precious CPU time leaving lesser CPU time for executing the application programs. From this comparison (Table 7), the PBP is a more efficient processor than the PLP.

Type of Super IPC RTOS Microprocessor Loop 1 PBP 1 5 programs No need 2 PLP <1 1 program Preemption

Table 7: PBP versus PLP

5.4.2 PBP versus SCP

In this section, we compare and contrast the PBP and SCP. Both the PBP and

SCP have an IPC of 1. However, the PBP can operate at a higher CPU frequency; hence, the PBP has a higher throughput. Without an operating system, the PBP can execute 5 programs and the SCP can execute 1 program. If we are running 5 programs, the PBP does not require an operating system, whereas the SCP requires an operating system.

The preempting of application programs and other operating system primitives consume precious CPU time leaving lesser CPU time for executing the application programs. From this comparison (Table 8), the PBP is a more efficient processor than the SCP.

77 Type of Super IPC RTOS Microprocessor Loop 1 PBP 1 5 programs No need 2 SCP 1 1 program Preemption

Table 8: PBP versus SCP

5.4.3 PBP versus MCP

In this section, we compare and contrast the PBP and MCP (5 SCP cores). Both the PBP and MCP have an IPC of 1. However, the MCP has bus contention issue (5.1) when more than one core tries to access the bus. The PBP does not have bus contention issue because the PBP virtual cores access the bus at different clock cycle (4.7). The

PBP also consumes lesser power than the MCP (MCP needs more hardware). From this comparison (Table 9), the PBP is a more efficient processor than the MCP.

Type of Average Super RTOS Microprocessor IPC Loop 1 PBP 1 5 programs No need 2 MCP 1 5 programs No Need

Table 9: PBP versus MCP

5.4.4 PBP with RTOS

If we need to execute more than five application programs, we need an operating system. With the PBP, we can achieve multiple hard real-time threads. We can

78 lock any virtual core from preemption and assign any virtual core for preemption (SCP and PLP cannot). In comparison to the other type of processors except the MCP, the number of preemption on a PBP is lesser.

From our results and comparison (sections 5.3 and 5.4), we can substantiate our claims that the PBP does not has control and data hazard issues and the PBP is a more efficient processor than the PLP, SCP and MCP.

79 CHAPTER VI

CONCLUSION AND FUTURE WORK

6.1 Conclusions

The Operating Systems are important for desktops and larger computing systems. The OSes are also becoming increasingly popular in the embedded systems domain. The advent of the operating systems to the embedded systems domain changed the landscape of this domain, although many existing embedded systems still use the bare-metal code (Super Loop).

As discussed in Chapter 2, the Operating Systems strengths outweigh the weaknesses. For instance, the OS: provides system initialization and bridges the portability to the application software; and also provides localization for system analysis and debugging. These two features enable the system designers to measure the performance of the system, thus leading to a successful project. This also reduces the time-to-market.

In Chapter 3, we analyzed and presented different scheduling methods employed by the Embedded Operating Systems. The OS would not function without the scheduling methods.

Scheduling and context switching provide temporal multitasking (uniprocessor) and spatial multitasking (multiprocessor). All the RTOS scheduling methods (using Timer

ISR) are asynchronous and intrusive. The scheduling methods consume significant amount of CPU time, reducing the amount of CPU time for application software.

The scheduling methods analyzed and discussed in this research work differ from that of the desktops and larger computing systems. For instance, these (non-embedded)

80 computing systems employ soft operating systems and are capable of supporting hundreds of Threads.

The operating systems do not strictly enforce their Threads priority. Instead, the

OSes provide a pattern paradigm, in addition to the Threads priority, to ensure that all the Threads in the system have a chance to run. The pattern paradigm is a by-product that dramatically slows down the system if there are too many applications running in the system.

In Chapter 4, we analyzed and discussed the different type of microprocessors including uniprocessor, multiprocessor, and barrel-processor. We also discussed their advantages and disadvantages. It is observed that the barrel processor is the most

Time-Efficient processor for multitasking operations, since it supports fine-grained multitasking. However, the barrel processor is unable to dynamically allocate more than one clock cycle per task in one interleaving cycle. This is an area where the uniprocessor and multiprocessor are at an advantage.

In Chapter 5, we designed a novel and efficient five-virtual-core pipelined barrel

Processor (PBP). We analyzed and compare and contrast the PBP with PLP, SCP and

MCP. With the test results and data, we concluded that the PBP is a more efficient processor comparing to the other processors.

6.2 Future Work

The several analyses carried out throughout our research work enable us to further hypothesize the identified issues and to set our future research endeavors. It should be noted that our future research will abide by the following two specifications:

81  The operating System software modification should not affect the application

program; hence no application program software needs to change or minimum

changes are required to take advantage of the new system features.

 Hardware modification should not require any of the instructions in the existing

instruction set to change; hence, no operating system, application program and

compiler software need to be modified.

Based on the analyses presented in this thesis, we are planning to carry out our future research as follows:

 How can the microprocessor alleviate the amount of CPU cycles spent on context

switching?

 How can the microprocessor dynamically allocate system clock cycles to the

threads?

 How can the system perform multiple threads in hard real-time multi-tasking

mode?

 How to redirect interrupt to thread level and not affecting other running threads?

 How to prevent a crashing thread from hanging the system (That is all other

threads still can run as normal if even other threads crash)?

 How can an instructional cache system improve its hit rates?

 How can the operating system provide Thread's instructional usage pattern to

the cache subsystem?

 How can we modify the operating system to take advantage of the

aforementioned features (above bullets)?

82 BIBLIOGRAPHY

[1] Andrew S. Tanenbaum, Albert S. Woodhull, “Operating Systems Design and Implementation,” Prentice Hall, Second Edition, 1997

[2] Jean J. Labrosse, “uC/OS The Real-Time Kernel,” R&D, Fifth Printing Revised for v.1.10, 1992

[3] Jean J. Labrosse, “MicroC/OS-II, The Real-Time Kernel,” R&D, Fifth Printing Revised for v.1.10, 1999

[4] Ronald Mak, “Writing Compilers and Interpreters, An Applied Approach Using C++,” Wiley, Second Edition, 1996

[5] Jeff Duntemann, “Assembly Language, Step-by-Step,” Wiley, Second Edition, 1992

[6] Alan Clements, “Microprocessor Systems Design, 68000 Hardware, Software, and Interfacing,” Clements, Third Edition, 1997

[7] IBM, “The PowerPC Architecture: A specification for a new family of RISC Processors,” Morgan Kaufann, Second Edition, 1994

[8] Pat Villani, “FreeDos Kernal,” R&D, 1996

[9] David Patterson, John L. Hennessy, “Computer Organization and Design, The Hardware and Software interface,” Elsevier, Fifth Edition, 2014

[10] Richard P. Paul, “SPARC Architecture, Assembly Language Programming, and C,” Prentice Hall, Second Edition, 2000

[11] Mong Sim, “Initialization and Optimization Program for MPC563xM,” , 2010

[12] Mong Sim, “A Practical Approach to Hardware Semaphores, For MCP56xx and MPC57xx Mutli-core Qorivva Devices,” Freescale Semiconductor, Revision 2, 2014

[13] John L. Hennessy and David A. Patterson, “Computer Architectural, A Quantitative Approach,” Morgan Kaufann, Fourth Edition, 2007

[14] Control Data Corporation, “CDC Cyber 170 Computer Systems; Models 720, 730, 750, and 760; Model 176 (Level B); CPU Instruction Set; PPU Instruction Set,” Control Data Corporation, Revision A, 1973

83 [15] Karel Driesen and Urs Hölzle, “The Cascaded Predictor: Economical and Adaptive Branch Target,” Proceedings 31st Annual ACM/IEEE International Symposium on , Year: 1998, Pages: 249-258, DOI: 10.1109/MICRP.1998.742- 786, Cited by: Papers (26) | Patents (6), IEEE Conference Publications.

[16] Muhammad Umar Farooq, Lei Chen, and Lizy Kurian John, “Value Based BTB Indexing for Indirect Jump Prediction”, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture, Year: 2010, Pages: 1 - 11, DOI: 10.1109/HPCA.2010.5416659, Cited by: Papers (6) | Patents (2), IEEE Conference Publications.

[17] Zichao Xie, Dong Tong, Mingkai Huang, Xiaoyin Wang, Qinqing Shi, Xu Cheng, “TAP Prediction: Reusing Conditional Branch Predictor for Indirect Branches with Target Address Pointers,” 2011 IEEE 29th International Conference on Computer Design (ICCD), Year: 2011, Pages: 119 -126, DOI: 10.1109/ICCD.2011.6081386, Cited by: Papers (1), IEEE Conference Publications

[18] Xie ZC, Tong D, Huang MK, “A general low-cost indirect branch prediction using target address pointers,” Journal of Computer Science and Technology, November 2014, Volume 29, Issue 6, pp 929–94, Springer Link

[19] Wenbing JIN 1, Feng SHI, Qiugui SONG, Yang ZHANG,” A novel architecture for ahead branch prediction”, Frontiers of Computer Science, December 2013, Volume 7, Issue 6, pp 914–923, Springer Link

[20] David R. Kaeli , Philip G. Emma, “Branch History Table Prediction of Moving Target Branches Due to Subroutine Returns,” Proceeding ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture, Pages 34- 42, Toronto, Ontario, Canada — May 27 - 30, 1991, ACM New York, NY, USA ©1991.

[21] Karel Driesen and Urs Hijlzle, “The Cascaded Predictor: Economical and Adaptive Branch Target Prediction,” Proceeding MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture, Pages 223-229, Orlando, Florida, USA — November 27 - 29, 1990, IEEE Computer Society Press Los Alamitos, CA, USA ©1990.

[22] Reuven Bakalash and Zhong Xu’, “A Barrel Shift Microsystem for Parallel Processing,” Proceeding MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture, Pages 223-229, Orlando, Florida, USA — November 27 - 29, 1990, IEEE Computer Society Press Los Alamitos, CA, USA ©1990.

[23] Oren Laadan, Jason Nieh, “Operating System Virtualization: Practice and Experience,” Columbia University, 2010

84 [24] Michael Goldweber, Renzo Davoli, Tomislav Jonjic, “Supporting Operating Systems Projects using the _MPS2 Hardware Simulator,” Proceeding ITiCSE '12 Proceedings of the 17th ACM annual conference on Innovation and technology in computer science education, Pages 63-68, Haifa, Israel — July 03 - 05, 2012, ACM New York, NY, USA ©2012.

[25] Darko Kirovski , Nuria Oliver, Mike Sinclair, and Desney Tan, “Health-OS: A Position Paper,” Proceeding HealthNet '07 Proceedings of the 1st ACM SIGMOBILE international∗ workshop on Systems and networking support for healthcare and assisted living environments, Pages 76-78, San Juan, Puerto Rico — June 11 - 11, 2007, ACM New York, NY, USA ©2007.

[26] David Wentzlaff and Anant Agarwal, “Factored Operating Systems (fos): The Case for a Scalable Operating System for Multicores,” Newsletter ACM SIGOPS Operating Systems Review archive, Volume 43 Issue 2, April 2009, Pages 76-85, ACM New York, NY, USA

[27] Juan Carlos Guzmán, Patrick O. Bobbie, “HANDS-ON OPERATING SYSTEMS MADE EASY,” Journal of Computing Sciences in Colleges archive, Volume 22 Issue 4, April 2007, Pages 145-151, Consortium for Computing Sciences in Colleges , USA.

[28] Feng Xian, Witawas Srisa-an, and Hong Jiang, “Contention-Aware Scheduler: Unlocking Execution Parallelism in Multithreaded Java Programs,” Proceeding OOPSLA '08 Proceedings of the 23rd ACM SIGPLAN conference on Object- oriented programming systems languages and applications, Pages 163-180, Nashville, TN, USA — October 19 - 23, 2008, ACM New York, NY, USA ©2008.

[29] Jean Mayo, Phil Kearns, “A Secure Networked Laboratory for Kernel Programming,” Proceeding ITiCSE '98 Proceedings of the 6th annual conference on the teaching of computing and the 3rd annual conference on Integrating technology into computer science education: Changing the delivery of computer science education, Pages 175-177, Dublin City Univ., Ireland — August 18 - 21, 1998, ACM New York, NY, USA ©1998

[30] Edmund B. Nightingale, Orion Hodson, Ross McIlroy, “Helios: Heterogeneous with Satellite Kernels,” Proceeding SOSP '09 Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, Pages 221-234, Big Sky, Montana, USA — October 11 - 14, 2009, ACM New York, NY, USA ©2009

[31] Steven Robbins, “A UNIX Concurrent I/O Simulator,” Proceeding SIGCSE '06 Proceedings of the 37th SIGCSE technical symposium on Computer science education, Pages 303-307, Houston, Texas, USA — March 03 - 05, 2006, ACM New York, NY, USA ©2006.

85 [32] Stavros Passas, Sven Karlsson, “SRC: FenixOS - A Research Operating System Focused on High Scalability and Reliability,” Proceeding ICS '11 Proceedings of the international conference on Supercomputing, Pages 371-371, Tucson, Arizona, USA — May 31 - June 04, 2011, ACM New York, NY, USA ©2011.

[33] Michael D. Black, “Build an Operating System from Scratch: A Project for an Introductory Operating Systems Course,” Proceeding SIGCSE '09 Proceedings of the 40th ACM technical symposium on Computer science education, Pages 448- 452, Chattanooga, TN, USA — March 04 - 07, 2009, ACM New York, NY, USA ©2009

[34] Chi-Sheng Shih, Hsin-Yu Lai, “nuKernel: MicroKernel for multi-core DSP SoCs with load sharing and priority interrupts,” Proceeding SAC '13 Proceedings of the 28th Annual ACM Symposium on Applied Computing, Pages 1525-1532, Coimbra, Portugal — March 18 - 22, 2013, ACM New York, NY, USA ©2013.

[35] Chuanpeng Li, Chen Ding, Kai Shen, “Quantifying The Cost of Context Switch,” Proceeding ExpCS '07 Proceedings of the 2007 workshop on Experimental computer science, Article No. 2, San Diego, California — June 13 - 14, 2007, ACM New York, NY, USA ©2007.

[36] Francis M. David, Jeffrey C. Carlyle, Roy H. Campbell, “Context Switch Overheads for Linux on ARM Platforms,” Proceeding ExpCS '07 Proceedings of the 2007 workshop on Experimental computer science, Article No. 3, San Diego, California — June 13 - 14, 2007, ACM New York, NY, USA ©2007

[37] FANG LIU and YAN SOLIHIN, “Understanding the Behavior and Implications of Context Switch Misses,” Journal ACM Transactions on Architecture and Code Optimization (TACO) TACO Homepage archive, Volume 7 Issue 4, December 2010, Article No. 21, ACM New York, NY, USA.

[38] Marius Evers, Po-Yung Chang, Yale N. Patt, “Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches,” Proceeding ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture, Pages 3-11, Philadelphia, Pennsylvania, USA — May 22 - 24, 1996, ACM New York, NY, USA ©1996

[39] Volker Barthelmann, “Inter-Task Register-Allocation for Static Operating Systems,” Proceeding LCTES/SCOPES '02 Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems, Pages 149 – 154, Berlin, Germany — June 19 - 21, 2002, ACM New York, NY, USA ©2002

[40] Antonio Diaz Tula, Filipe M. S. de Campos, Carlos H. Morimoto, “Dynamic Context Switching for Gaze Based Interaction,” Proceeding ETRA '12 Proceedings of the Symposium on Eye Tracking Research and Applications, Pages 353-356, Santa Barbara, California — March 28 - 30, 2012, ACM New York, NY, USA ©2012.

86 [41] Lina Sawalha, Ronald D. Barnes, “Phase-Based Scheduling and Thread Migration for Heterogeneous Multicore Processors,” Proceeding PACT '12 Proceedings of the 21st international conference on Parallel architectures and compilation techniques, Pages 493-494, Minneapolis, Minnesota, USA — September 19 - 23, 2012, ACM New York, NY, USA ©2012

[42] Fabien Hermenier, Adrien Lèbre, Jean-Marc Menaud, “Cluster-Wide Context Switch of Virtualized Jobs,” Proceeding HPDC '10 Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Pages 658-666, Chicago, Illinois — June 21 - 25, 2010, ACM New York, NY, USA ©2010

[43] Gurindar S. Sohi, “Instruction Issue Logic for High-Performance, Interruptable Pipelined Processors,” Proceeding ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture, Pages 27-34, Pittsburgh, Pennsylvania, USA — June 02 - 05, 1987, ACM New York, NY, USA ©1987

[44] A. Severance, J. Edwards, H. Omidian, G. Lemieux, “Soft Vector Processors with Streaming Pipelines,” Proceeding FPGA '14 Proceedings of the 2014 ACM/SIGDA international symposium on Field- programmable gate arrays, Pages 117-126, Monterey, California, USA — February 26 - 28, 2014, ACM New York, NY, USA ©2014

[45] Harry Dwyer, H.C. Torng, “An Out-of-Order with and Fast, Precise Interrupts,” Proceeding MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture, Pages 272-281, Portland, Oregon, USA — December 01 - 04, 1992, IEEE Computer Society Press Los Alamitos, CA, USA ©1992

[46] James E. Smith, “IMPLEMENTATION OF PRECISE INTERRUPTS IN PIPELINED PROCESSORS,” Proceeding ISCA '98 25 years of the international symposia on Computer architecture (selected papers), Pages 291-299, Barcelona, Spain — June 27 - July 02, 1998, ACM New York, NY, USA ©1998.

[47] J. Carver Hill, “Synchronizing Processors with Memory-Content-Generated Interrupts,” Magazine, Communications of the ACM CACM Homepage archive, Volume 16 Issue 6, June 1973, Pages 350-351, ACM New York, NY, USA

[48] J Robert Schaefer, “Debugging Debugged, a Metaphysical Manifesto of Systems Integration,” Newsletter ACM SIGSOFT Software Engineering Notes archive, Volume 33 Issue 3, May 2008, Article No. 5, ACM New York, NY, USA.

[49] Pankaj Shanker, “Spatial Debug & Debug without Re-programming in FPGAs,” Proceeding Proceedings of the 2016 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays Pages 3-3, Monterey, California, USA — February 21 - 23, 2016, ACM New York, NY, USA ©2016

87 [50] Dick Hamlet, “DEDUGGING LEVEL: STEP-WISE DEBUGGING,” Proceeding SIGSOFT '83 Proceedings of the ACM SIGSOFT/SIGPLAN software engineering symposium on High-level debugging Pages 4-8, ACM New York, NY, USA ©1983

[51] Donglin Liang and Kai Xu, “Debugging Object Oriented Programs with Behavior Views,” Proceedings of the sixth international symposium on Automated analysis- driven debugging Pages 133-142, Monterey, California, USA — September 19 - 21, 2005, ACM New York, NY, USA ©2005

[52] Asanović, Krste, “The RISC-V Instruction Set Manual, Volume I: Base User-Level ISA version 2.1 (Technical Report EECS-2016-118)," University of California, Berkeley, July 2016.

[53] Santifort, Conor, "Amber ARM-compatible core," OpenCores, August 2010.

[54] Waterman, Andrew; Lee, Yunsup; Avizienas, Rimas; Patterson, David; Asanovic, Krste. "Draft Privileged ISA Specification 1.9, RISC-V Foundation,” August 2016.

[55] RISC-V Foundation. "RISC-V The Free and Open Instruction Set," The RISC-V Foundation's Web Site, The RISC-V Foundation. Retrieved 11 November 2016.

[56] Tobias, Strauch. "OpenRisc 1200 HP, Hyper Pipelined OR1200 Core," OpenCores, 2010

[57] Richard Herveille. "WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores," OpenCores, 2002

[58] ARM Limited. "AMBA™ Specification (Rev 2.0)," ARM, 1999

[59] Jini Gaisler. "LEON 3FT SPARC Processor," Gaisler, 2005

[60] Freescale . "MPC5121e e300 CPU core based on the Power Architecture Technology,” Freescale, 2007

[61] Freescale . "Qorivva MPC5643L High-performance e200z4d core Technology,” Freescale, 2011

88